2025-06-02

Title: Meaning Is Not A Metric: Using LLMs to make cultural context legible at scale

Authors: Cody Kommers, Drew Hemment, Maria Antoniak, Joel Z. Leibo, Hoyt Long, Emily Robinson, Adam Sobey
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.23785
Pdf URL: https://arxiv.org/pdf/2505.23785
Copy Paste: [[2505.23785]] Meaning Is Not A Metric: Using LLMs to make cultural context legible at scale(https://arxiv.org/abs/2505.23785)
Keywords: language model, llm
Abstract: This position paper argues that large language models (LLMs) can make cultural context, and therefore human meaning, legible at an unprecedented scale in AI-based sociotechnical systems. We argue that such systems have previously been unable to represent human meaning because they rely on thin descriptions: numerical representations that enforce standardization and therefore strip human activity of the cultural context that gives it meaning. By contrast, scholars in the humanities and qualitative social sciences have developed frameworks for representing meaning through thick description: verbal representations that accommodate heterogeneity and retain contextual information needed to represent human meaning. While these methods can effectively codify meaning, they are difficult to deploy at scale. However, the verbal capabilities of LLMs now provide a means of (at least partially) automating the generation and processing of thick descriptions, potentially overcoming this bottleneck. We argue that the problem of rendering human meaning legible is not just about selecting better metrics, but about developing new representational formats (based on thick description). We frame this as a crucial direction for the application of generative AI and identify five key challenges: preserving context, maintaining interpretive pluralism, integrating perspectives based on lived experience and critical distance, distinguishing qualitative content from quantitative magnitude, and acknowledging meaning as dynamic rather than static. Furthermore, we suggest that thick description has the potential to serve as a unifying framework to address a number of emerging concerns about the difficulties of representing culture in (or using) LLMs.
摘要：该立场论文认为，大型语言模型（LLMS）可以在基于AI的社会技术系统中以前所未有的规模来使文化背景，因此可以使人类意义易于人类意义。我们认为，此类系统以前无法代表人类的意义，因为它们依赖于稀薄的描述：实施标准化的数值表示，因此剥夺了具有意义的文化背景的人类活动。相比之下，人文学科和定性社会科学的学者已经开发了通过厚实的描述来表示意义的框架：适应异质性并保留代表人类意义所需的上下文信息的口头表现。尽管这些方法可以有效地编码含义，但它们很难大规模部署。但是，LLMS的口头能力现在提供了一种（至少部分）自动化厚描述的生成和处理，并可能克服这种瓶颈。我们认为，使人类意义清晰的问题不仅在于选择更好的指标，还在于开发新的代表性格式（基于厚实的描述）。我们将其视为应用生成AI的关键方向，并确定五个关键挑战：维护背景，保持解释性多元化，基于生活经验和临界距离的观点，将定性内容与定量内容区分开，并承认含义是动态的，而不是静态的。此外，我们建议厚实的描述有潜力作为统一框架，以解决有关在LLM（或使用）LLM中代表文化困难的许多新兴问题。

Title: Nine Ways to Break Copyright Law and Why Our LLM Won't: A Fair Use Aligned Generation Framework

Authors: Aakash Sen Sharma, Debdeep Sanyal, Priyansh Srivastava, Sundar Atreya H., Shirish Karande, Mohan Kankanhalli, Murari Mandal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23788
Pdf URL: https://arxiv.org/pdf/2505.23788
Copy Paste: [[2505.23788]] Nine Ways to Break Copyright Law and Why Our LLM Won't: A Fair Use Aligned Generation Framework(https://arxiv.org/abs/2505.23788)
Keywords: language model, llm
Abstract: Large language models (LLMs) commonly risk copyright infringement by reproducing protected content verbatim or with insufficient transformative modifications, posing significant ethical, legal, and practical concerns. Current inference-time safeguards predominantly rely on restrictive refusal-based filters, often compromising the practical utility of these models. To address this, we collaborated closely with intellectual property experts to develop FUA-LLM (Fair Use Aligned Language Models), a legally-grounded framework explicitly designed to align LLM outputs with fair-use doctrine. Central to our method is FairUseDB, a carefully constructed dataset containing 18,000 expert-validated examples covering nine realistic infringement scenarios. Leveraging this dataset, we apply Direct Preference Optimization (DPO) to fine-tune open-source LLMs, encouraging them to produce legally compliant and practically useful alternatives rather than resorting to blunt refusal. Recognizing the shortcomings of traditional evaluation metrics, we propose new measures: Weighted Penalty Utility and Compliance Aware Harmonic Mean (CAH) to balance infringement risk against response utility. Extensive quantitative experiments coupled with expert evaluations confirm that FUA-LLM substantially reduces problematic outputs (up to 20\%) compared to state-of-the-art approaches, while preserving real-world usability.
摘要：大型语言模型（LLMS）通常通过逐字化受保护的内容或不足的变革性修改，构成了重大的道德，法律和实际问题。当前的推理时间保护措施主要依赖于限制性拒绝过滤器，通常会损害这些模型的实际实用性。为了解决这个问题，我们与知识产权专家密切合作，以开发Fua-llm（合理使用Ancomed语言模型），这是一个明确旨在将LLM输出与公平使用学说相结合的法律框架。我们方法的核心是FairusedB，这是一个经过精心构造的数据集，其中包含18,000个专家验证的示例，其中涵盖了9个现实的侵权方案。利用此数据集，我们将直接偏好优化（DPO）应用于微调开源LLM，鼓励他们生成法律符合法律规定且实际上有用的替代方案，而不是诉诸拒绝。认识到传统评估指标的缺点，我们提出了新的措施：加权惩罚效用和合规性意识到的谐波平均值（CAH）以平衡侵权风险与响应效用。与专家评估相比，与最先进的方法相比，FUA-LLM大大减少了有问题的产出（最高20 \％），同时保留现实世界的可用性，因此，FUA-LLM大大减少了有问题的产出（最高20 \％）。

Title: Conversational Exploration of Literature Landscape with LitChat

Authors: Mingyu Huang, Shasha Zhou, Yuxuan Chen, Ke Li
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23789
Pdf URL: https://arxiv.org/pdf/2505.23789
Copy Paste: [[2505.23789]] Conversational Exploration of Literature Landscape with LitChat(https://arxiv.org/abs/2505.23789)
Keywords: language model, llm, hallucination, chat, agent
Abstract: We are living in an era of "big literature", where the volume of digital scientific publications is growing exponentially. While offering new opportunities, this also poses challenges for understanding literature landscapes, as traditional manual reviewing is no longer feasible. Recent large language models (LLMs) have shown strong capabilities for literature comprehension, yet they are incapable of offering "comprehensive, objective, open and transparent" views desired by systematic reviews due to their limited context windows and trust issues like hallucinations. Here we present LitChat, an end-to-end, interactive and conversational literature agent that augments LLM agents with data-driven discovery tools to facilitate literature exploration. LitChat automatically interprets user queries, retrieves relevant sources, constructs knowledge graphs, and employs diverse data-mining techniques to generate evidence-based insights addressing user needs. We illustrate the effectiveness of LitChat via a case study on AI4Health, highlighting its capacity to quickly navigate the users through large-scale literature landscape with data-based evidence that is otherwise infeasible with traditional means.
摘要：我们生活在一个“大文学”的时代，在这里，数字科学出版物的数量呈指数增长。在提供新的机会的同时，这也给理解文献景观带来了挑战，因为传统的手册审查不再可行。最近的大型语言模型（LLMS）表现出强大的文献理解能力，但由于其有限的上下文窗口和幻觉等问题，他们无法提供“系统评价”所需的“全面，客观，开放和透明”的观点。在这里，我们提出了Litchat，这是一种端到端，互动和对话文献代理，它使用数据驱动的发现工具来增强LLM代理，以促进文学探索。 Litchat会自动解释用户查询，检索相关资源，构建知识图，并采用各种数据挖掘技术来生成满足用户需求的循证见解。我们通过对AI4Health的案例研究说明了Litchat的有效性，并强调了其能够快速通过大规模文献景观导航用户的能力，并具有基于数据的证据，而这些证据否则就与传统手段不可行。

Title: Rethinking the Understanding Ability across LLMs through Mutual Information

Authors: Shaojie Wang, Sirui Ding, Na Zou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23790
Pdf URL: https://arxiv.org/pdf/2505.23790
Copy Paste: [[2505.23790]] Rethinking the Understanding Ability across LLMs through Mutual Information(https://arxiv.org/abs/2505.23790)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have revolutionized natural language processing, yet evaluating their intrinsic linguistic understanding remains challenging. Moving beyond specialized evaluation tasks, we propose an information-theoretic framework grounded in mutual information (MI) to achieve this. We formalize the understanding as MI between an input sentence and its latent representation (sentence-level MI), measuring how effectively input information is preserved in latent representation. Given that LLMs learn embeddings for individual tokens, we decompose sentence-level MI into token-level MI between tokens and sentence embeddings, establishing theoretical bounds connecting these measures. Based on this foundation, we theoretically derive a computable lower bound for token-level MI using Fano's inequality, which directly relates to token-level recoverability-the ability to predict original tokens from sentence embedding. We implement this recoverability task to comparatively measure MI across different LLMs, revealing that encoder-only models consistently maintain higher information fidelity than their decoder-only counterparts, with the latter exhibiting a distinctive late-layer "forgetting" pattern where mutual information is first enhanced and then discarded. Moreover, fine-tuning to maximize token-level recoverability consistently improves understanding ability of LLMs on tasks without task-specific supervision, demonstrating that mutual information can serve as a foundation for understanding and improving language model capabilities.
摘要：大型语言模型（LLM）的最新进展彻底改变了自然语言处理，但评估其内在语言理解仍然具有挑战性。超越了专门的评估任务，我们提出了一个基于相互信息（MI）的信息理论框架以实现这一目标。我们将输入句子与其潜在表示（句子级MI）之间的MI形式化为MI，从而衡量了如何有效地保留了潜在表示中的输入信息。鉴于LLMS学习了单个令牌的嵌入，我们将句子级的MI分解为令牌和句子嵌入之间的令牌级MI，建立了连接这些措施的理论界限。基于这个基础，我们从理论上使用Fano的不平等来得出令牌级MI的可计算下限，该限制直接与令牌级可恢复性直接相关 - 能够从嵌入句子中预测原始令牌。我们实施此可恢复性任务以在不同的LLM上相对测量MI，这表明仅编码模型始终如一地维持与仅解码器的同行更高的信息保真度，而后者表现出独特的后期“遗忘”模式，其中首先增强了共同信息，然后增强了相互信息。此外，微调以最大化令牌级别的可恢复性始终提高LLM在没有特定任务监督的情况下对任务的理解能力，这表明相互信息可以成为理解和改善语言模型功能的基础。

Title: R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning

Authors: Yuan Li, Qi Luo, Xiaonan Li, Bufan Li, Qinyuan Cheng, Bo Wang, Yining Zheng, Yuxin Wang, Zhangyue Yin, Xipeng Qiu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23794
Pdf URL: https://arxiv.org/pdf/2505.23794
Copy Paste: [[2505.23794]] R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning(https://arxiv.org/abs/2505.23794)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) integrates external knowledge with Large Language Models (LLMs) to enhance factual correctness and mitigate hallucination. However, dense retrievers often become the bottleneck of RAG systems due to their limited parameters compared to LLMs and their inability to perform step-by-step reasoning. While prompt-based iterative RAG attempts to address these limitations, it is constrained by human-designed workflows. To address these limitations, we propose $\textbf{R3-RAG}$, which uses $\textbf{R}$einforcement learning to make the LLM learn how to $\textbf{R}$eason and $\textbf{R}$etrieve step by step, thus retrieving comprehensive external knowledge and leading to correct answers. R3-RAG is divided into two stages. We first use cold start to make the model learn the manner of iteratively interleaving reasoning and retrieval. Then we use reinforcement learning to further harness its ability to better explore the external retrieval environment. Specifically, we propose two rewards for R3-RAG: 1) answer correctness for outcome reward, which judges whether the trajectory leads to a correct answer; 2) relevance-based document verification for process reward, encouraging the model to retrieve documents that are relevant to the user question, through which we can let the model learn how to iteratively reason and retrieve relevant documents to get the correct answer. Experimental results show that R3-RAG significantly outperforms baselines and can transfer well to different retrievers. We release R3-RAG at this https URL.
摘要：检索增强的生成（RAG）将外部知识与大语言模型（LLMS）相结合，以增强事实正确性并减轻幻觉。但是，由于LLMS相比，它们的参数有限，并且无法执行逐步推理，因此密集的检索器通常成为抹布系统的瓶颈。尽管基于及时的迭代抹布试图解决这些局限性，但它受到人为设计的工作流程的约束。为了解决这些限制，我们提出了$ \ textbf {r3-rag} $，它使用$ \ textbf {r} $ einforcemention学习使LLM学习如何$ \ textbf {r} $ isasoun和$ \ \ \ \ \ textbf {r} $ eTrieve逐步逐步逐步逐步恢复全面的外部知识，从而获得了综合的外部知识。 R3-rag分为两个阶段。我们首先使用冷启动来使模型学习迭代交织的推理和检索方式。然后，我们使用加强学习来进一步利用其更好地探索外部检索环境的能力。具体来说，我们为R3 rag提出了两个奖励：1）回答结果奖励的正确性，这是判断轨迹是否会导致正确答案的正确性； 2）基于相关性的文档验证过程奖励，鼓励模型检索与用户问题相关的文档，我们可以让模型学习如何迭代推理并检索相关文档以获取正确的答案。实验结果表明，R3 rag明显胜过基线，并且可以很好地转移到不同的猎犬。我们在此HTTPS URL上释放R3-rag。

Title: Emergent LLM behaviors are observationally equivalent to data leakage

Authors: Christopher Barrie, Petter Törnberg
Subjects: cs.CL, cs.GT
Abstract URL: https://arxiv.org/abs/2505.23796
Pdf URL: https://arxiv.org/pdf/2505.23796
Copy Paste: [[2505.23796]] Emergent LLM behaviors are observationally equivalent to data leakage(https://arxiv.org/abs/2505.23796)
Keywords: language model, llm
Abstract: Ashery et al. recently argue that large language models (LLMs), when paired to play a classic "naming game," spontaneously develop linguistic conventions reminiscent of human social norms. Here, we show that their results are better explained by data leakage: the models simply reproduce conventions they already encountered during pre-training. Despite the authors' mitigation measures, we provide multiple analyses demonstrating that the LLMs recognize the structure of the coordination game and recall its outcomes, rather than exhibit "emergent" conventions. Consequently, the observed behaviors are indistinguishable from memorization of the training corpus. We conclude by pointing to potential alternative strategies and reflecting more generally on the place of LLMs for social science models.
摘要：Ashery等。最近认为，大型语言模型（LLMS）搭配使用经典的“命名游戏”时，会自发发展语言惯例，让人联想到人类社会规范。在这里，我们表明他们的结果可以通过数据泄漏更好地解释：模型只是复制他们在预训练期间已经遇到的惯例。尽管作者采取了缓解措施，但我们提供了多次分析，表明LLMS认识了协调游戏的结构并回忆起其结果，而不是展示“新兴”惯例。因此，观察到的行为与训练语料库的记忆没有区别。最后，我们指出了潜在的替代策略，并更广泛地反映了社会科学模型的LLM位置。

Title: My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals

Authors: Jian Lan, Yifei Fu, Udo Schlegel, Gengyuan Zhang, Tanveer Hannan, Haokun Chen, Thomas Seidl
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.23798
Pdf URL: https://arxiv.org/pdf/2505.23798
Copy Paste: [[2505.23798]] My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals(https://arxiv.org/abs/2505.23798)
Keywords: language model
Abstract: Social bias is a critical issue in large vision-language models (VLMs), where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield social bias in generative responses. In this study, we focus on evaluating and mitigating social bias on both the model's response and probability distribution. To do so, we first evaluate four state-of-the-art VLMs on PAIRS and SocialCounterfactuals datasets with the multiple-choice selection task. Surprisingly, we find that models suffer from generating gender-biased or race-biased responses. We also observe that models are prone to stating their responses are fair, but indeed having mis-calibrated confidence levels towards particular social groups. While investigating why VLMs are unfair in this study, we observe that VLMs' hidden layers exhibit substantial fluctuations in fairness levels. Meanwhile, residuals in each layer show mixed effects on fairness, with some contributing positively while some lead to increased bias. Based on these findings, we propose a post-hoc method for the inference stage to mitigate social bias, which is training-free and model-agnostic. We achieve this by ablating bias-associated residuals while amplifying fairness-associated residuals on model hidden layers during inference. We demonstrate that our post-hoc method outperforms the competing training strategies, helping VLMs have fairer responses and more reliable confidence levels.
摘要：社会偏见是大型视力语言模型（VLM）的关键问题，在该模型中，与公平和道德有关的问题损害了社会中某些人群。 VLM在生成反应中产生社会偏见在多大程度上未知。在这项研究中，我们专注于评估和减轻对模型的响应和概率分布的社会偏见。为此，我们首先在成对上评估了四个最先进的VLM，并与多选择的选择任务一起评估了社交counterfactuals数据集。令人惊讶的是，我们发现模型因产生性别偏见或种族偏见的反应而受苦。我们还观察到模型很容易说明他们的反应是公平的，但确实对特定社会群体具有错误校准的信心水平。在研究为什么VLM在这项研究中不公平的同时，我们观察到VLMS的隐藏层在公平水平上表现出很大的波动。同时，每一层的残留物对公平性显示出不同的影响，有些差异积极，而有些则导致偏见增加。基于这些发现，我们提出了一种用于减轻社会偏见的推理阶段的事后方法，该方法是无训练和模型不合时式的。我们通过消融与偏置相关的残留物来实现这一目标，同时在推理过程中放大模型隐藏层上的公平相关残差。我们证明，事后方法的表现优于相互竞争的培训策略，帮助VLM具有更公平的反应和更可靠的信心水平。

Title: Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

Authors: Xiaoyuan Wu, Weiran Lin, Omer Akgul, Lujo Bauer
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23799
Pdf URL: https://arxiv.org/pdf/2505.23799
Copy Paste: [[2505.23799]] Estimating LLM Consistency: A User Baseline vs Surrogate Metrics(https://arxiv.org/abs/2505.23799)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility -- one of them being measuring the consistency (the model's confidence in the response, or likelihood of generating a similar response when resampled) of LLM responses. In previous work, measuring consistency often relied on the probability of a response appearing within a pool of resampled responses, or internal states or logits of responses. However, it is not yet clear how well these approaches approximate how humans perceive the consistency of LLM responses. We performed a user study (n=2,976) and found current methods typically do not approximate users' perceptions of LLM consistency very well. We propose a logit-based ensemble method for estimating LLM consistency, and we show that this method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods of estimating LLM consistency without human evaluation are sufficiently imperfect that we suggest evaluation with human input be more broadly used.
摘要：大型语言模型（LLM）容易幻觉，并且对迅速扰动的敏感，通常会导致不一致或不可靠的文本。已经提出了不同的方法来减轻这种幻觉和脆弱性 - 其中一种正在测量LLM响应的一致性（模型对响应的信心或重新采样时产生类似响应的可能性）。在以前的工作中，测量一致性通常取决于在重新采样响应池或响应的内部状态或徽标中出现响应的可能性。但是，尚不清楚这些方法如何概述人类如何看待LLM响应的一致性。我们进行了一项用户研究（n = 2,976），发现当前方法通常不会很好地近似用户对LLM一致性的看法。我们提出了一种基于logit的集合方法来估计LLM一致性，我们表明该方法与估计LLM一致性人类评级的最佳表现现有指标的性能相匹配。我们的结果表明，在没有人类评估的情况下估算LLM一致性的方法是不完美的，因此我们建议更广泛地使用人类输入评估。

Title: MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Authors: Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M. Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, Akshay Swaminathan, Philip Chung, Fateme Nateghi, Asad Aali, Ashwin Nayak, Shivam Vedak, Sneha S. Jain, Birju Patel, Oluseyi Fayanju, Shreya Shah, Ethan Goh, Dong-han Yao, Brian Soetikno, Eduardo Reis, Sergios Gatidis, Vasu Divi, Robson Capasso, Rachna Saralkar, Chia-Chun Chiang, Jenelle Jindal, Tho Pham, Faraz Ghoddusi, Steven Lin, Albert S. Chiou, Christy Hong, Mohana Roy, Michael F. Gensheimer, Hinesh Patel, Kevin Schulman, Dev Dash, Danton Char, Lance Downing, Francois Grolleau, Kameron Black, Bethel Mieso, Aydin Zahedivash, Wen-wai Yim, Harshita Sharma, Tony Lee, Hannah Kirsch, Jennifer Lee, Nerissa Ambers, Carlene Lugtu, Aditya Sharma, Bilal Mawji, Alex Alekseyev, Vicky Zhou, Vikas Kakkar, Jarrod Helzer, Anurang Revri, Yair Bannett, Roxana Daneshjou, Jonathan Chen, Emily Alsentzer, Keith Morse, Nirmal Ravi, Nima Aghaeepour, Vanessa Kennedy, Akshay Chaudhari, Thomas Wang, Sanmi Koyejo, Matthew P. Lungren, Eric Horvitz, Percy Liang, Mike Pfeffer, Nigam H. Shah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23802
Pdf URL: https://arxiv.org/pdf/2505.23802
Copy Paste: [[2505.23802]] MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks(https://arxiv.org/abs/2505.23802)
Keywords: language model, llm
Abstract: While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
摘要：尽管大型语言模型（LLMS）在医疗许可考试中取得了接近完美的分数，但这些评估不足地反映了现实世界临床实践的复杂性和多样性。我们介绍了MedHelm，这是一个可扩展的评估框架，用于评估具有三个关键贡献的医疗任务的LLM绩效。首先，涉及5个类别，22个子类别和121个临床医生的临床医生验证分类法。其次，一个全面的基准套件，包括35个基准（现有的17个新配制），可完全覆盖分类法中的所有类别和子类别。第三，通过改进的评估方法（使用LLM-jury）和成本绩效分析对LLM进行系统的比较。使用35个基准测试的9个边界LLM的评估显示出显着的性能变化。先进的推理模型（DeepSeek R1：66％的赢率； O3-MINI：64％的胜利率）表现出了出色的性能，尽管Claude 3.5 SONNET在估计计算成本降低40％时取得了可比的结果。以归一化精度量表（0-1），大多数模型在临床注释一代（0.73-0.85）和患者沟通与教育（0.78-0.83）中的表现强劲，在医学研究援助（0.65-0.75）中（0.65-0.75），通常在临床决策支持中较低，并且在临床决策支持（0.56-0.72）和行政和工作（0.56-0.72）和行政和工作流程（0.53-0.63）。我们的LLM-jury评估方法与临床医生评分（ICC = 0.47）达到了良好的一致性，超过了平均临床医生 - 临床医生协议（ICC = 0.43）和自动基准，包括胭脂-L（0.36）和Bertscore-F1（0.44）。 Claude 3.5十四行诗以较低的估计成本获得了与顶级模型的可比性能。这些发现突出了现实世界中特定于LLM的医学使用的重要性评估的重要性，并提供了一个开源框架来实现这一目标。

Title: Calibrating LLMs for Text-to-SQL Parsing by Leveraging Sub-clause Frequencies

Authors: Terrance Liu, Shuyi Wang, Daniel Preotiuc-Pietro, Yash Chandarana, Chirag Gupta
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23804
Pdf URL: https://arxiv.org/pdf/2505.23804
Copy Paste: [[2505.23804]] Calibrating LLMs for Text-to-SQL Parsing by Leveraging Sub-clause Frequencies(https://arxiv.org/abs/2505.23804)
Keywords: language model, llm
Abstract: While large language models (LLMs) achieve strong performance on text-to-SQL parsing, they sometimes exhibit unexpected failures in which they are confidently incorrect. Building trustworthy text-to-SQL systems thus requires eliciting reliable uncertainty measures from the LLM. In this paper, we study the problem of providing a calibrated confidence score that conveys the likelihood of an output query being correct. Our work is the first to establish a benchmark for post-hoc calibration of LLM-based text-to-SQL parsing. In particular, we show that Platt scaling, a canonical method for calibration, provides substantial improvements over directly using raw model output probabilities as confidence scores. Furthermore, we propose a method for text-to-SQL calibration that leverages the structured nature of SQL queries to provide more granular signals of correctness, named "sub-clause frequency" (SCF) scores. Using multivariate Platt scaling (MPS), our extension of the canonical Platt scaling technique, we combine individual SCF scores into an overall accurate and calibrated score. Empirical evaluation on two popular text-to-SQL datasets shows that our approach of combining MPS and SCF yields further improvements in calibration and the related task of error detection over traditional Platt scaling.
摘要：尽管大型语言模型（LLMS）在文本到SQL解析上实现了强劲的性能，但有时它们会表现出意外的失败，在这些失败中，他们自信地不正确。因此，构建值得信赖的文本到SQL系统需要从LLM引起可靠的不确定性度量。在本文中，我们研究了提供校准置信度评分的问题，该评分传达了输出查询正确的可能性。我们的工作是第一个建立基于LLM基于LLM的文本到SQL解析的基准的基准。特别是，我们表明，Platt缩放是一种校准的规范方法，可直接使用原始模型输出概率作为置信度得分进行实质性改进。此外，我们提出了一种用于SQL查询的结构化性质的文本到SQL校准的方法，以提供更精细的正确性信号，称为“量级频率”（SCF）分数。使用多元PLATT缩放量表（MPS），我们对规范PLATT缩放技术的扩展，我们将单个SCF分数合并为整体准确，校准的分数。对两个流行的文本到SQL数据集进行的经验评估表明，我们组合MP和SCF的方法在校准方面进一步改进，以及与传统PLATT缩放相关的错误检测任务。

Title: MedOrchestra: A Hybrid Cloud-Local LLM Approach for Clinical Data Interpretation

Authors: Sihyeon Lee, Hyunjoo Song, Jong-chan Lee, Yoon Jin Lee, Boram Lee, Hee-Eon Lim, Dongyeong Kim, Jinwook Seo, Bohyoung Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23806
Pdf URL: https://arxiv.org/pdf/2505.23806
Copy Paste: [[2505.23806]] MedOrchestra: A Hybrid Cloud-Local LLM Approach for Clinical Data Interpretation(https://arxiv.org/abs/2505.23806)
Keywords: language model, llm, prompt
Abstract: Deploying large language models (LLMs) in clinical settings faces critical trade-offs: cloud LLMs, with their extensive parameters and superior performance, pose risks to sensitive clinical data privacy, while local LLMs preserve privacy but often fail at complex clinical interpretation tasks. We propose MedOrchestra, a hybrid framework where a cloud LLM decomposes complex clinical tasks into manageable subtasks and prompt generation, while a local LLM executes these subtasks in a privacy-preserving manner. Without accessing clinical data, the cloud LLM generates and validates subtask prompts using clinical guidelines and synthetic test cases. The local LLM executes subtasks locally and synthesizes outputs generated by the cloud LLM. We evaluate MedOrchestra on pancreatic cancer staging using 100 radiology reports under NCCN guidelines. On free-text reports, MedOrchestra achieves 70.21% accuracy, outperforming local model baselines (without guideline: 48.94%, with guideline: 56.59%) and board-certified clinicians (gastroenterologists: 59.57%, surgeons: 65.96%, radiologists: 55.32%). On structured reports, MedOrchestra reaches 85.42% accuracy, showing clear superiority across all settings.
摘要：在临床环境中部署大型语言模型（LLM）面临着关键的权衡：云LLM，其广泛参数和出色的性能，对敏感的临床数据隐私构成风险，而本地LLMS则保留隐私，但在复杂的临床解释任务中常常失败。我们提出了Medorchestra，这是一个混合框架，其中云LLM将复杂的临床任务分解为可管理的子任务并及时生成，而本地LLM则以隐私性保留方式执行这些子任务。在不访问临床数据的情况下，Cloud LLM使用临床准则和合成测试用例生成并验证子任务提示。本地LLM在本地执行子任务，并综合由云LLM生成的输出。我们使用NCCN指南的100个放射学报告评估了Medorchestra对胰腺癌分期的评估。在自由文本报告中，Medorchestra的准确性达到70.21％，胜过本地模型基线（没有指南：48.94％，指南：56.59％）和董事会认证的临床医生（胃肠病学家：59.57％：59.57％，外科医生，外科医生：65.96％，放射科学家：55.332％）。在结构化的报告上，Medorchestra的精度达到85.42％，在所有设置中都表现出明显的优势。

Title: DLP: Dynamic Layerwise Pruning in Large Language Models

Authors: Yuli Chen, Bo Cheng, Jiale Han, Yingying Zhang, Yingting Li, Shuhao Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23807
Pdf URL: https://arxiv.org/pdf/2505.23807
Copy Paste: [[2505.23807]] DLP: Dynamic Layerwise Pruning in Large Language Models(https://arxiv.org/abs/2505.23807)
Keywords: language model, llm
Abstract: Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code at this https URL to facilitate future research.
摘要：最近已广泛采用修剪以减少参数量表并提高大语模型（LLMS）的推理效率。主流修剪技术通常依赖于均匀的地层修剪策略，这可能导致高稀疏度的严重性能降解。认识到LLM中不同层的不同贡献，最近的研究将注意力转移到了不均匀的层上修剪。但是，这些方法通常依赖于预定义的值，这可能会导致次优性能。为了克服这些局限性，我们提出了一种称为Dynamic Layswise Pruning（DLP）的新方法。这种方法通过将模型权重与输入激活信息集成，从而适应每一层的相对重要性，从而相应地分配了修剪速率。实验结果表明，DLP有效地保留了多个LLM的高稀疏度的模型性能。具体而言，DLP在70％的稀疏性下，将Llama2-7b的困惑降低了7.79，与最先进的方法相比，平均准确性提高了2.7％。此外，DLP与各种现有的LLM压缩技术兼容，并且可以无缝集成到参数有效的微调（PEFT）中。我们在此HTTPS URL上发布代码，以促进未来的研究。

Title: DenseLoRA: Dense Low-Rank Adaptation of Large Language Models

Authors: Lin Mu, Xiaoyu Wang, Li Ni, Yang Li, Zhize Wu, Peiquan Jin, Yiwen Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23808
Pdf URL: https://arxiv.org/pdf/2505.23808
Copy Paste: [[2505.23808]] DenseLoRA: Dense Low-Rank Adaptation of Large Language Models(https://arxiv.org/abs/2505.23808)
Keywords: language model, llm
Abstract: Low-rank adaptation (LoRA) has been developed as an efficient approach for adapting large language models (LLMs) by fine-tuning two low-rank matrices, thereby reducing the number of trainable parameters. However, prior research indicates that many of the weights in these matrices are redundant, leading to inefficiencies in parameter utilization. To address this limitation, we introduce Dense Low-Rank Adaptation (DenseLoRA), a novel approach that enhances parameter efficiency while achieving superior performance compared to LoRA. DenseLoRA builds upon the concept of representation fine-tuning, incorporating a single Encoder-Decoder to refine and compress hidden representations across all adaptation layers before applying adaptation. Instead of relying on two redundant low-rank matrices as in LoRA, DenseLoRA adapts LLMs through a dense low-rank matrix, improving parameter utilization and adaptation efficiency. We evaluate DenseLoRA on various benchmarks, showing that it achieves 83.8% accuracy with only 0.01% of trainable parameters, compared to LoRA's 80.8% accuracy with 0.70% of trainable parameters on LLaMA3-8B. Additionally, we conduct extensive experiments to systematically assess the impact of DenseLoRA's components on overall model performance. Code is available at this https URL.
摘要：低级适应性（LORA）是通过微调两个低级矩阵来调整大语言模型（LLM）的有效方法，从而减少了可训练的参数的数量。但是，先前的研究表明，这些矩阵中的许多权重冗余，导致参数利用率效率低下。为了解决这一限制，我们引入了密集的低级适应性（Denselora），这是一种新颖的方法，可提高参数效率，同时与洛拉相比，在实现卓越的性能。 denselora建立在表示微调的概念上，并在应用适应之前，将单个编码器描述器结合在所有适应层上进行完善和压缩隐藏的表示。 Denselora不再像Lora那样依赖两个冗余的低级矩阵，而是通过致密的低级别矩阵适应LLM，从而提高了参数利用和适应效率。我们在各种基准上评估了Denselora，与Lora的80.8％精度相比，在Llama3-8B上的0.70％的可训练参数的0.70％的精度与洛拉的80.8％精度相比，它仅获得了83.8％的精度，仅0.01％。此外，我们进行了广泛的实验，以系统地评估Denselora组件对整体模型性能的影响。代码可在此HTTPS URL上找到。

Title: LLM-Driven E-Commerce Marketing Content Optimization: Balancing Creativity and Conversion

Authors: Haowei Yang, Haotian Lyu, Tianle Zhang, Dingzhou Wang, Yushang Zhao
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.23809
Pdf URL: https://arxiv.org/pdf/2505.23809
Copy Paste: [[2505.23809]] LLM-Driven E-Commerce Marketing Content Optimization: Balancing Creativity and Conversion(https://arxiv.org/abs/2505.23809)
Keywords: llm, prompt
Abstract: As e-commerce competition intensifies, balancing creative content with conversion effectiveness becomes critical. Leveraging LLMs' language generation capabilities, we propose a framework that integrates prompt engineering, multi-objective fine-tuning, and post-processing to generate marketing copy that is both engaging and conversion-driven. Our fine-tuning method combines sentiment adjustment, diversity enhancement, and CTA embedding. Through offline evaluations and online A/B tests across categories, our approach achieves a 12.5 % increase in CTR and an 8.3 % increase in CVR while maintaining content novelty. This provides a practical solution for automated copy generation and suggests paths for future multimodal, real-time personalization.
摘要：随着电子商务竞争的加剧，平衡创意内容与转化效率的效果变得至关重要。利用LLMS的语言生成功能，我们提出了一个框架，该框架集成了及时的工程，多目标微调和后处理，以生成引人入胜且转换为驱动的营销副本。我们的微调方法结合了情感调整，多样性增强和CTA嵌入。通过跨类别的离线评估和在线A/B测试，我们的方法在维持满足满足新颖性的同时，CTR增长了12.5％，CVR增长了8.3％。这为自动复制生成提供了一个实用的解决方案，并为将来的多模式，实时个性化提出了途径。

Title: MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Authors: Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23810
Pdf URL: https://arxiv.org/pdf/2505.23810
Copy Paste: [[2505.23810]] MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation(https://arxiv.org/abs/2505.23810)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic \textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenges when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.
摘要：大语言模型（\ textbf {llms}），例如CHATGPT，在现实世界对话应用程序中已被广泛采用。但是，LLMS的鲁棒性，尤其是在处理长期复杂的对话会议时，包括频繁的动机转移，复杂的跨转向依赖性，一直都受到批评。然而，没有现有的基准可以完全反映这些弱点。我们提出\ textbf {mars-bench}，a \ textbf {m} ulti-turn \ textbf {a} thletic \ textbf {r} eal-world \ textbf {s} cenario cenario对话\ textbf \ textbf {bench} mark，旨在使gap Remeded the gap the gap re Remedy the Gap。火星板凳是由逐播文本评论构建的，以特定旨在评估多转交谈的三个关键方面的现实对话：超多转移，交互式多转弯和跨扭转任务。在火星板凳上进行的广泛实验还表明，封闭源LLM的表现明显胜过开源替代方案，明确的推理显着提高了LLMS在处理长期复杂的对话课程方面的鲁棒性，而LLMS在处理动机转移和复杂的交叉转移依赖性时确实面临重大挑战。此外，我们提供了有关特殊令牌引起的注意力如何导致LLMS的性能下降的机械性解释性，从而在QWEN2.5-7B实施中根据注意力可视化实验处理长期复杂的对话会议时的性能降解。

Title: LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions

Authors: Hadi Askari, Shivanshu Gupta, Fei Wang, Anshuman Chhabra, Muhao Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23811
Pdf URL: https://arxiv.org/pdf/2505.23811
Copy Paste: [[2505.23811]] LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions(https://arxiv.org/abs/2505.23811)
Keywords: language model, llm
Abstract: Pretrained Large Language Models (LLMs) achieve strong performance across a wide range of tasks, yet exhibit substantial variability in the various layers' training quality with respect to specific downstream applications, limiting their downstream this http URL is therefore critical to estimate layer-wise training quality in a manner that accounts for both model architecture and training data. However, existing approaches predominantly rely on model-centric heuristics (such as spectral statistics, outlier detection, or uniform allocation) while overlooking the influence of data. To address these limitations, we propose LayerIF, a data-driven framework that leverages Influence Functions to quantify the training quality of individual layers in a principled and task-sensitive manner. By isolating each layer's gradients and measuring the sensitivity of the validation loss to training examples by computing layer-wise influences, we derive data-driven estimates of layer importance. Notably, our method produces task-specific layer importance estimates for the same LLM, revealing how layers specialize for different test-time evaluation tasks. We demonstrate the utility of our scores by leveraging them for two downstream applications: (a) expert allocation in LoRA-MoE architectures and (b) layer-wise sparsity distribution for LLM pruning. Experiments across multiple LLM architectures demonstrate that our model-agnostic, influence-guided allocation leads to consistent gains in task performance.
摘要：预处理的大语言模型（LLM）在各种任务中都实现了强大的性能，但是相对于特定的下游应用，各种层的训练质量表现出很大的差异，因此限制其下游的HTTP URL对于以模型结构和训练数据的方式来估算层次的训练质量至关重要。但是，现有方法主要依赖于以模型为中心的启发式方法（例如光谱统计，离群检测或统一分配），同时忽略了数据的影响。为了解决这些局限性，我们提出了Layerif，这是一个由数据驱动的框架，利用功能来量化功能，以原则上和任务敏感的方式量化单层的训练质量。通过隔离每个层的梯度并通过计算层的影响来衡量验证损失对训练示例的敏感性，我们得出了数据驱动的层值的估计。值得注意的是，我们的方法对同一LLM产生了特定于任务的层重要性估计，从而揭示了如何专门用于不同测试时间评估任务的图层。我们通过利用两个下游应用程序来证明我们的分数实用性：（a）Lora-Moe体系结构中的专家分配以及（b）llm修剪的层次稀疏分布。跨多个LLM体系结构进行的实验表明，我们的模型不合时宜的，影响引导的分配会导致任务性能的一致增长。

Title: Aligning LLMs by Predicting Preferences from User Writing Samples

Authors: Stéphane Aroca-Ouellette, Natalie Mackraz, Barry-John Theobald, Katherine Metcalf
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23815
Pdf URL: https://arxiv.org/pdf/2505.23815
Copy Paste: [[2505.23815]] Aligning LLMs by Predicting Preferences from User Writing Samples(https://arxiv.org/abs/2505.23815)
Keywords: gpt, llm, agent
Abstract: Accommodating human preferences is essential for creating aligned LLM agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs acting as writing agents to infer a description of user preferences. Agent alignment then comes from conditioning on the inferred preference description. However, existing methods often produce generic preference descriptions that fail to capture the unique and individualized nature of human preferences. This paper introduces PROSE, a method designed to enhance the precision of preference descriptions inferred from user writing samples. PROSE incorporates two key elements: (1) iterative refinement of inferred preferences, and (2) verification of inferred preferences across multiple user writing samples. We evaluate PROSE with several LLMs (i.e., Qwen2.5 7B and 72B Instruct, GPT-mini, and GPT-4o) on a summarization and an email writing task. We find that PROSE more accurately infers nuanced human preferences, improving the quality of the writing agent's generations over CIPHER (a state-of-the-art method for inferring preferences) by 33\%. Lastly, we demonstrate that ICL and PROSE are complementary methods, and combining them provides up to a 9\% improvement over ICL alone.
摘要：适应人类的偏好对于创建提供个性化和有效互动的一致性LLM代理至关重要。最近的工作表明，LLM充当写作代理来推断用户偏好的描述。然后，代理对齐来自根据推论的偏好描述的条件。但是，现有方法通常会产生通用的偏好描述，而这些描述无法捕获人类偏好的独特性和个性化的性质。本文介绍了散文，这种方法旨在增强用户编写样本推断出的偏好描述的精度。散文结合了两个关键要素：（1）推断偏好的迭代细化，以及（2）在多个用户编写样本中验证推断偏好的验证。我们通过摘要和电子邮件写作任务评估了几个LLM（即Qwen2.5 7b和72b指示，GPT-Mini和GPT-4O）的散文。我们发现，散文更准确地渗透了细微的人类偏好，从而提高了写作代理人对密码（一种推断偏好的最新方法）的质量，提高了33 \％的质量。最后，我们证明了ICL和散文是互补的方法，并且将它们结合起来可提供比单独的ICL的9 \％改进。

Title: A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs

Authors: Trenton Chang, Tobias Schnabel, Adith Swaminathan, Jenna Wiens
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23816
Pdf URL: https://arxiv.org/pdf/2505.23816
Copy Paste: [[2505.23816]] A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs(https://arxiv.org/abs/2505.23816)
Keywords: language model, llm, prompt
Abstract: Despite advances in large language models (LLMs) on reasoning and instruction-following benchmarks, it remains unclear whether they can reliably produce outputs aligned with a broad variety of user goals, a concept we refer to as steerability. The abundance of methods proposed to modify LLM behavior makes it unclear whether current LLMs are already steerable, or require further intervention. In particular, LLMs may exhibit (i) poor coverage, where rare user goals are underrepresented; (ii) miscalibration, where models overshoot requests; and (iii) side effects, where changes to one dimension of text inadvertently affect others. To systematically evaluate these failures, we introduce a framework based on a multi-dimensional goal space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs struggle with steerability, as side effects are persistent. Interventions to improve steerability, such as prompt engineering, best-of-$N$ sampling, and reinforcement learning fine-tuning, have varying effectiveness, yet side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient. We open-source our steerability evaluation framework at this https URL.
摘要：尽管大型语言模型（LLMS）在推理和指导遵循基准的基准方面取得了进步，但尚不清楚它们是否可以可靠地产生与各种用户目标相一致的输出，这是我们称为可管道性的概念。提议修改LLM行为的大量方法使其尚不清楚当前的LLM是否已经是可转让的，还是需要进一步干预。特别是，LLM可能表现出（i）覆盖范围不佳，在这种情况下，罕见的用户目标不足；（ii）错误校准，其中模型超出了请求；（iii）副作用，其中变为文本的一个维度会无意中影响他人。为了系统地评估这些失败，我们基于多维目标空间引入了一个框架，该框架将用户目标和LLM输出建模为具有与文本属性相对应的尺寸的向量（例如，阅读难度）。应用于文本练习任务时，我们发现当前的LLM与副作用持续存在，因为副作用持续存在。提高可置换性的干预措施，例如及时工程，最佳$ $ n $采样以及增强学习微调，具有不同的效力，但副作用仍然有问题。我们的发现表明，即使是强大的LLMS也要在可接触性方面挣扎，并且现有的一致性策略可能不足。我们在此HTTPS URL上开放我们的可管道评估框架。

Title: Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks

Authors: Bhaktipriya Radharapu, Manon Revel, Megan Ung, Sebastian Ruder, Adina Williams
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23820
Pdf URL: https://arxiv.org/pdf/2505.23820
Copy Paste: [[2505.23820]] Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks(https://arxiv.org/abs/2505.23820)
Keywords: llm
Abstract: The increasing use of LLMs as substitutes for humans in ``aligning'' LLMs has raised questions about their ability to replicate human judgments and preferences, especially in ambivalent scenarios where humans disagree. This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater. These roles loosely correspond to previously described alignment frameworks: preference alignment (judge) and scalable oversight (debater), with the answer generator reflecting the typical setting with user interactions. We develop a ``no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios, each presenting two possible stances. Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters. These findings underscore the necessity for more sophisticated methods for aligning LLMs without human oversight, highlighting that LLMs cannot fully capture human disagreement even on topics where humans themselves are divided.
摘要：LLM越来越多地用作“对齐” LLM的人类替代人，提出了有关其复制人类判断和偏好能力的问题，尤其是在人类不同意的矛盾情景中。这项研究检查了LLM在三个角色中的偏见和局限性：答案生成器，法官和辩论者。这些角色松散地对应于先前描述的对齐框架：偏好对齐（法官）和可扩展的监督（辩论者），答案生成器反映了用户交互的典型设置。我们通过策划涵盖各种先验的矛盾情景的示例来开发``无概念''基准测试。我们的结果表明，尽管LLM在产生开放式答案时可以提供细微的评估，但在被任命为法官或辩论者时，他们倾向于对无守护主题采取立场。这些发现强调了在没有人类监督的情况下对LLM保持一致的更复杂方法的必要性，这强调了LLM即使在人类本身被分割的话题上也无法完全捕捉人类的分歧。

Title: Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction

Authors: Mai Ali, Christopher Lucasius, Tanmay P. Patel, Madison Aitken, Jacob Vorstman, Peter Szatmari, Marco Battaglia, Deepa Kundur
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2505.23822
Pdf URL: https://arxiv.org/pdf/2505.23822
Copy Paste: [[2505.23822]] Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction(https://arxiv.org/abs/2505.23822)
Keywords: language model, llm
Abstract: Speech is a noninvasive digital phenotype that can offer valuable insights into mental health conditions, but it is often treated as a single modality. In contrast, we propose the treatment of patient speech data as a trimodal multimedia data source for depression detection. This study explores the potential of large language model-based architectures for speech-based depression prediction in a multimodal regime that integrates speech-derived text, acoustic landmarks, and vocal biomarkers. Adolescent depression presents a significant challenge and is often comorbid with multiple disorders, such as suicidal ideation and sleep disturbances. This presents an additional opportunity to integrate multi-task learning (MTL) into our study by simultaneously predicting depression, suicidal ideation, and sleep disturbances using the multimodal formulation. We also propose a longitudinal analysis strategy that models temporal changes across multiple clinical interactions, allowing for a comprehensive understanding of the conditions' progression. Our proposed approach, featuring trimodal, longitudinal MTL is evaluated on the Depression Early Warning dataset. It achieves a balanced accuracy of 70.8%, which is higher than each of the unimodal, single-task, and non-longitudinal methods.
摘要：语音是一种无创的数字表型，可以为心理健康状况提供宝贵的见解，但通常被视为一种单一的方式。相比之下，我们提出将患者语音数据作为抑郁症检测的三峰多媒体数据源的处理。这项研究探讨了基于语言模型的大型架构对基于语音的抑郁预测的潜在多模式制度，该制度整合了语音衍生的文本，声学地标和人声生物标志物。青春期抑郁症提出了重大挑战，通常与多种疾病合并，例如自杀念头和睡眠障碍。通过同时预测抑郁症，自杀意识和使用多模式配方的抑郁症，自杀意识和睡眠障碍，这为将多任务学习（MTL）纳入我们的研究提供了额外的机会。我们还提出了一种纵向分析策略，该策略模拟了多种临床相互作用的时间变化，从而可以全面了解条件的进展。在抑郁症状预警数据集中评估了我们提出的纵向MTL的拟议方法。它的平衡精度为70.8％，高于单峰，单任务和非纵向方法中的每种。

Title: RAGPPI: RAG Benchmark for Protein-Protein Interactions in Drug Discovery

Authors: Youngseung Jeon, Ziwen Li, Thomas Li, JiaSyuan Chang, Morteza Ziyadi, Xiang 'Anthony' Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23823
Pdf URL: https://arxiv.org/pdf/2505.23823
Copy Paste: [[2505.23823]] RAGPPI: RAG Benchmark for Protein-Protein Interactions in Drug Discovery(https://arxiv.org/abs/2505.23823)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that reflected expert labeling characteristics, which facilitates the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.
摘要：检索蛋白质 - 蛋白质相互作用（PPI）的生物学影响对于药物开发中的靶识别（靶ID）至关重要。鉴于涉及大量蛋白质，此过程仍然耗时且具有挑战性。大型语言模型（LLM）和检索演出的生成（RAG）框架已支持目标ID；但是，目前尚无基准来识别PPI的生物学影响。为了弥合这一差距，我们介绍了PPI（Ragppi）的抹布基准，这是4,420个问题解答对的事实提问的基准，该基准的重点是PPIS的潜在生物学影响。通过与专家的访谈，我们确定了基准数据集的标准，例如一种质量保证和来源。我们通过专家驱动的数据注释构建了一个金标准数据集（500 QA对）。我们开发了一个合奏自动评估LLM，反映了专家标签特征，该特征有助于构建银色标准数据集（3,720 QA Pairs）。我们致力于维持Ragppi作为支持研究社区的资源，以推动抹布系统的QA解决方案。

Title: Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

Authors: Tianmai M. Zhang, Neil F. Abernethy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23824
Pdf URL: https://arxiv.org/pdf/2505.23824
Copy Paste: [[2505.23824]] Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation(https://arxiv.org/abs/2505.23824)
Keywords: language model, llm
Abstract: Recent advancements in large language models have sparked interest in utilizing them to assist the peer review process of scientific publication. Instead of having AI models generate reviews in the same way as human reviewers, we propose adopting them as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs from different providers and assessed their performance and API costs for identifying critical errors and unsoundness problems. The OpenAI o3 model performed the best, while o4-mini was the most cost-effective one in our evaluation. This paper provides insights into document-based scientific understanding/reasoning and lays the foundation for future applications.
摘要：大型语言模型的最新进展引发了人们对利用它们来协助科学出版物的同行评审过程的兴趣。我们建议将其作为手稿质量检查员，而不是让AI模型以与人类审稿人相同的方式产生评论。我们介绍了几种基线方法和使用顶级LLM的可扩展自动评估框架作为法官，以解决招募域专家进行手动评估的困难。利用从Arxiv撤回的论文，我们用来自不同提供商的几个领先的推理LLM验证了我们的方法，并评估了其绩效和API成本，以识别关键错误和不符合性问题。 OpenAI O3模型表现最好，而O4-Mini是我们评估中最具成本效益的型号。本文提供了有关基于文档的科学理解/推理的见解，并为将来的应用奠定了基础。

Title: ValueSim: Generating Backstories to Model Individual Value Systems

Authors: Bangde Du, Ziyi Ye, Zhijing Wu, Jankowska Monika, Shuqi Zhu, Qingyao Ai, Yujia Zhou, Yiqun Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23827
Pdf URL: https://arxiv.org/pdf/2505.23827
Copy Paste: [[2505.23827]] ValueSim: Generating Backstories to Model Individual Value Systems(https://arxiv.org/abs/2505.23827)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: As Large Language Models (LLMs) continue to exhibit increasingly human-like capabilities, aligning them with human values has become critically important. Contemporary advanced techniques, such as prompt learning and reinforcement learning, are being deployed to better align LLMs with human values. However, while these approaches address broad ethical considerations and helpfulness, they rarely focus on simulating individualized human value systems. To address this gap, we present ValueSim, a framework that simulates individual values through the generation of personal backstories reflecting past experiences and demographic information. ValueSim converts structured individual data into narrative backstories and employs a multi-module architecture inspired by the Cognitive-Affective Personality System to simulate individual values based on these narratives. Testing ValueSim on a self-constructed benchmark derived from the World Values Survey demonstrates an improvement in top-1 accuracy by over 10% compared to retrieval-augmented generation methods. Further analysis reveals that performance enhances as additional user interaction history becomes available, indicating the model's ability to refine its persona simulation capabilities over time.
摘要：随着大型语言模型（LLM）继续表现出越来越类似人类的能力，使它们与人类价值观保持一致变得至关重要。当代高级技术，例如迅速学习和强化学习，正在部署，以使LLM与人类价值观更好地保持一致。但是，尽管这些方法解决了广泛的道德考虑和帮助，但它们很少专注于模拟个性化的人类价值体系。为了解决这一差距，我们提出了Valuem，该框架通过反映过去的经验和人口统计信息的个人背景故事来模拟个人价值观。 ValuesIM将结构化的个体数据转换为叙事背景故事，并采用受认知影响性格系统启发的多模块架构，以根据这些叙述模拟单个价值观。与从世界价值调查得出的自我建立的基准测试基准上的测试值表明，与检索效果的生成方法相比，TOP-1的准确性提高了10％以上。进一步的分析表明，随着其他用户交互历史的可用记录，性能会提高，表明该模型随着时间的推移而提高其角色模拟功能的能力。

Title: BiasFilter: An Inference-Time Debiasing Framework for Large Language Models

Authors: Xiaoqing Cheng, Ruizhe Chen, Hongying Zan, Yuxiang Jia, Min Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23829
Pdf URL: https://arxiv.org/pdf/2505.23829
Copy Paste: [[2505.23829]] BiasFilter: An Inference-Time Debiasing Framework for Large Language Models(https://arxiv.org/abs/2505.23829)
Keywords: language model, llm
Abstract: Mitigating social bias in large language models (LLMs) has become an increasingly important research objective. However, existing debiasing methods often incur high human and computational costs, exhibit limited effectiveness, and struggle to scale to larger models and open-ended generation tasks. To address these limitations, this paper proposes BiasFilter, a model-agnostic, inference-time debiasing framework that integrates seamlessly with both open-source and API-based LLMs. Instead of relying on retraining with balanced data or modifying model parameters, BiasFilter enforces fairness by filtering generation outputs in real time. Specifically, it periodically evaluates intermediate outputs every few tokens, maintains an active set of candidate continuations, and incrementally completes generation by discarding low-reward segments based on a fairness reward signal. To support this process, we construct a fairness preference dataset and train an implicit reward model to assess token-level fairness in generated responses. Extensive experiments demonstrate that BiasFilter effectively mitigates social bias across a range of LLMs while preserving overall generation quality.
摘要：缓解大语模型（LLM）中的社会偏见已成为越来越重要的研究目标。但是，现有的辩护方法通常会产生高昂的人类和计算成本，表现出有限的有效性，并难以扩展到更大的模型和开放式生成任务。为了解决这些局限性，本文提出了BiasFilter，这是一种模型不合时宜的推理时间偏见框架，与开源和基于API的LLM无缝集成。与其依赖于平衡数据或修改模型参数的重新培训，而是通过实时过滤生成输出来实现公平性。具体而言，它会定期评估中间输出每几个令牌，维护一组活跃的候选延续集，并通过基于公平奖励信号丢弃低回报段来逐步完成生成。为了支持这一过程，我们构建了一个公平的偏好数据集并训练隐性奖励模型，以评估产生的响应中的令牌级别的公平性。广泛的实验表明，偏见有效地减轻了一系列LLM的社会偏见，同时保留了整体发电质量。

Title: EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models

Authors: Linglin Jing, Yuting Gao, Zhigang Wang, Wang Lan, Yiwen Tang, Wenhai Wang, Kaipeng Zhang, Qingpei Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23830
Pdf URL: https://arxiv.org/pdf/2505.23830
Copy Paste: [[2505.23830]] EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models(https://arxiv.org/abs/2505.23830)
Keywords: language model, llm
Abstract: Recent advancements have shown that the Mixture of Experts (MoE) approach significantly enhances the capacity of large language models (LLMs) and improves performance on downstream tasks. Building on these promising results, multi-modal large language models (MLLMs) have increasingly adopted MoE techniques. However, existing multi-modal MoE tuning methods typically face two key challenges: expert uniformity and router rigidity. Expert uniformity occurs because MoE experts are often initialized by simply replicating the FFN parameters from LLMs, leading to homogenized expert functions and weakening the intended diversification of the MoE architecture. Meanwhile, router rigidity stems from the prevalent use of static linear routers for expert selection, which fail to distinguish between visual and textual tokens, resulting in similar expert distributions for image and text. To address these limitations, we propose EvoMoE, an innovative MoE tuning framework. EvoMoE introduces a meticulously designed expert initialization strategy that progressively evolves multiple robust experts from a single trainable expert, a process termed expert evolution that specifically targets severe expert homogenization. Furthermore, we introduce the Dynamic Token-aware Router (DTR), a novel routing mechanism that allocates input tokens to appropriate experts based on their modality and intrinsic token values. This dynamic routing is facilitated by hypernetworks, which dynamically generate routing weights tailored for each individual token. Extensive experiments demonstrate that EvoMoE significantly outperforms other sparse MLLMs across a variety of multi-modal benchmarks, including MME, MMBench, TextVQA, and POPE. Our results highlight the effectiveness of EvoMoE in enhancing the performance of MLLMs by addressing the critical issues of expert uniformity and router rigidity.
摘要：最近的进步表明，专家（MOE）方法的混合物显着提高了大语言模型（LLM）的能力，并提高了下游任务的性能。在这些有希望的结果的基础上，多模式大型语言模型（MLLM）越来越多地采用了MOE技术。但是，现有的多模式MOE调整方法通常面临两个关键挑战：专家统一性和路由器刚度。之所以发生专家统一性，是因为通常通过简单地复制LLM的FFN参数来初始化MOE专家，从而导致专家功能并削弱MOE体系结构的预期多元化。同时，路由器的刚度源于普遍使用静态线性路由器进行专家选择，这些路由器无法区分视觉和文本令牌，从而产生了相似的图像和文本专家分布。为了解决这些局限性，我们提出了Evomoe，这是一个创新的MOE调整框架。 Evomoe介绍了一种精心设计的专家初始化策略，该策略逐渐从单个可训练的专家中进化了多个强大的专家，这是一个被称为专家的过程，专门针对严重的专家均质化。此外，我们介绍了动态令牌感知路由器（DTR），这是一种新型的路由机制，它根据其模态和固有的令牌值将输入令牌分配给适当的专家。 HyperNetworks促进了这种动态路由，该路由动态生成为每个单独令牌量身定制的路由权重。广泛的实验表明，Evomoe在各种多模式基准（包括MME，MMBENCH，TEXTVQA和POPE）中的其他稀疏MLLM明显优于其他稀疏MLLM。我们的结果突出了Evomoe通过解决专家统一性和路由器刚性的关键问题来提高MLLM的性能的有效性。

Title: ICH-Qwen: A Large Language Model Towards Chinese Intangible Cultural Heritage

Authors: Wenhao Ye, Tiansheng Zheng, Yue Qi, Wenhua Zhao, Xiyu Wang, Xue Zhao, Jiacheng He, Yaya Zheng, Dongbo Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23831
Pdf URL: https://arxiv.org/pdf/2505.23831
Copy Paste: [[2505.23831]] ICH-Qwen: A Large Language Model Towards Chinese Intangible Cultural Heritage(https://arxiv.org/abs/2505.23831)
Keywords: language model
Abstract: The intangible cultural heritage (ICH) of China, a cultural asset transmitted across generations by various ethnic groups, serves as a significant testament to the evolution of human civilization and holds irreplaceable value for the preservation of historical lineage and the enhancement of cultural self-confidence. However, the rapid pace of modernization poses formidable challenges to ICH, including threats damage, disappearance and discontinuity of inheritance. China has the highest number of items on the UNESCO Intangible Cultural Heritage List, which is indicative of the nation's abundant cultural resources and emphasises the pressing need for ICH preservation. In recent years, the rapid advancements in large language modelling have provided a novel technological approach for the preservation and dissemination of ICH. This study utilises a substantial corpus of open-source Chinese ICH data to develop a large language model, ICH-Qwen, for the ICH domain. The model employs natural language understanding and knowledge reasoning capabilities of large language models, augmented with synthetic data and fine-tuning techniques. The experimental results demonstrate the efficacy of ICH-Qwen in executing tasks specific to the ICH domain. It is anticipated that the model will provide intelligent solutions for the protection, inheritance and dissemination of intangible cultural heritage, as well as new theoretical and practical references for the sustainable development of intangible cultural heritage. Furthermore, it is expected that the study will open up new paths for digital humanities research.
摘要：中国的无形文化遗产（ICH）是一种文化资产，这是各个种族在各个世代传播的文化资产，这是对人类文明发展的重要证明，并具有不可替代的价值，可维护历史血统和增强文化自信。但是，现代化的快速速度对ICH构成了巨大的挑战，包括威胁损害，消失和继承的不连续性。中国在联合国教科文组织无形的文化遗产清单上拥有最多的物品，这表明了该国丰富的文化资源，并强调了对ICH保存的迫切需求。近年来，大语言建模的快速进步为保存和传播ICH提供了一种新颖的技术方法。这项研究利用大量的开源中国ICH数据来为ICH领域开发大型语言模型ICH-QWEN。该模型采用自然语言理解和知识推理能力的大型语言模型，并增强了合成数据和微调技术。实验结果证明了ICH-QWEN在执行特定于ICH领域的任务中的功效。预计该模型将为无形文化遗产的保护，继承和传播提供智能解决方案，以及为无形文化遗产的可持续发展提供的新理论和实用参考。此外，预计该研究将为数字人文研究开辟新的途径。

Title: Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Authors: Qingchuan Ma, Yuhang Wu, Xiawu Zheng, Rongrong Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23833
Pdf URL: https://arxiv.org/pdf/2505.23833
Copy Paste: [[2505.23833]] Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective(https://arxiv.org/abs/2505.23833)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: $\scoreGamma$ measures basic reasoning accuracy, while $\scoreDelta$ quantifies a model's reliance on specific symbols rather than underlying patterns - a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule-based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching. Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multi-agent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) $\scoreDelta$'s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization. These findings underscore that current LLMs, despite domain-specific strengths, still lack robust abstract reasoning, highlighting key areas for future improvement.
摘要：在本文中，我们旨在建立一个简单，有效且理论上的基准，以严格探测大语言模型（LLMS）中的抽象推理。为了实现这一目标，我们首先开发了一个数学框架，将抽象推理定义为：（i）提取独立于表面表示的基本模式，以及（ii）对这些抽象模式应用一致的规则。基于此框架，我们介绍了两个新颖的互补指标：\（\ ScoreGamma \）测量基本推理的准确性，而\（\ corgoredelta \）量化了模型对特定符号的依赖而不是基本模式 - 真正的抽象的关键指标与单纯的记忆与单纯的记忆。为了实现此测量，我们设计了一个基准：基于规则的任务中的系统符号重新映射，该符号迫使模型在表面令牌匹配之外展示了真正的模式识别。使用此基准测试（商业API模型，7B-70B，多代理）进行了广泛的LLM评估：1）非准确算术和符号推理的临界限制； 2）尽管提示了链条，但持续的抽象差距；和3）\（\ corkoredelta \）在稳健测量记忆依赖性方面的有效性，通过量化符号重新映射下的性能降低，尤其是突出了特定于操作数特定的记忆。这些发现强调了当前的LLM尽管具有特定于领域的优势，但仍然缺乏强大的抽象推理，突出了未来改进的关键领域。

Title: Say What You Mean: Natural Language Access Control with Large Language Models for Internet of Things

Authors: Ye Cheng, Minghui Xu, Yue Zhang, Kun Li, Hao Wu, Yechao Zhang, Shaoyong Guo, Wangjie Qiu, Dongxiao Yu, Xiuzhen Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23835
Pdf URL: https://arxiv.org/pdf/2505.23835
Copy Paste: [[2505.23835]] Say What You Mean: Natural Language Access Control with Large Language Models for Internet of Things(https://arxiv.org/abs/2505.23835)
Keywords: language model, gpt, llm, prompt
Abstract: Access control in the Internet of Things (IoT) is becoming increasingly complex, as policies must account for dynamic and contextual factors such as time, location, user behavior, and environmental conditions. However, existing platforms either offer only coarse-grained controls or rely on rigid rule matching, making them ill-suited for semantically rich or ambiguous access scenarios. Moreover, the policy authoring process remains fragmented: domain experts describe requirements in natural language, but developers must manually translate them into code, introducing semantic gaps and potential misconfiguration. In this work, we present LACE, the Language-based Access Control Engine, a hybrid framework that leverages large language models (LLMs) to bridge the gap between human intent and machine-enforceable logic. LACE combines prompt-guided policy generation, retrieval-augmented reasoning, and formal validation to support expressive, interpretable, and verifiable access control. It enables users to specify policies in natural language, automatically translates them into structured rules, validates semantic correctness, and makes access decisions using a hybrid LLM-rule-based engine. We evaluate LACE in smart home environments through extensive experiments. LACE achieves 100% correctness in verified policy generation and up to 88% decision accuracy with 0.79 F1-score using DeepSeek-V3, outperforming baselines such as GPT-3.5 and Gemini. The system also demonstrates strong scalability under increasing policy volume and request concurrency. Our results highlight LACE's potential to enable secure, flexible, and user-friendly access control across real-world IoT platforms.
摘要：物联网（IoT）中的访问控制变得越来越复杂，因为政策必须考虑到时间，位置，用户行为和环境条件等动态和上下文因素。但是，现有平台要么仅提供粗粒度的控件，要么依赖于严格的规则匹配，从而使其不适合语义上丰富或模棱两可的访问场景。此外，政策创作过程仍然存在分散：领域专家描述了自然语言的要求，但是开发人员必须将其翻译成代码，引入语义差距和潜在的配置错误。在这项工作中，我们展示了基于语言的访问控制引擎的蕾丝，这是一种利用大型语言模型（LLMS）的混合框架，以弥合人类意图和机器可实现的逻辑之间的差距。蕾丝结合了及时的引导政策生成，检索提示的推理以及正式验证，以支持表现力，可解释和可验证的访问控制。它使用户能够用自然语言指定策略，自动将其转换为结构化规则，验证语义正确性，并使用基于LLM LLM规则的引擎做出访问决策。我们通过广泛的实验在智能家庭环境中评估花边。蕾丝使用DeepSeek-V3（例如GPT-3.5和GEMINI）的0.79 F1得分实现了经过验证的策略生成的100％正确性，并以0.79 F1分数的速度达到了88％的决策准确性。该系统还显示出强大的可扩展性在增加的策略量和请求并发性。我们的结果突出了LACE在现实世界的物联网平台上实现安全，灵活和用户友好的访问控制的潜力。

Title: Large Language Models Often Know When They Are Being Evaluated

Authors: Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, Marius Hobbhahn
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23836
Pdf URL: https://arxiv.org/pdf/2505.23836
Copy Paste: [[2505.23836]] Large Language Models Often Know When They Are Being Evaluated(https://arxiv.org/abs/2505.23836)
Keywords: language model, prompt, chat, agent
Abstract: If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations, leading to less reliable benchmarks for deployment and governance decisions. We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment, a capability we call evaluation awareness. To achieve this, we construct a diverse benchmark of 1,000 prompts and transcripts from 61 distinct datasets. These span public benchmarks (e.g., MMLU, SWEBench), real-world deployment interactions, and agent trajectories from scaffolding frameworks (e.g., web-browsing agents). Frontier models clearly demonstrate above-random evaluation awareness (Gemini-2.5-Pro reaches an AUC of $0.83$), but do not yet surpass our simple human baseline (AUC of $0.92$). Furthermore, both AI models and humans are better at identifying evaluations in agentic settings compared to chat settings. Additionally, we test whether models can identify the purpose of the evaluation. Under multiple-choice and open-ended questioning, AI models far outperform random chance in identifying what an evaluation is testing for. Our results indicate that frontier models already exhibit a substantial, though not yet superhuman, level of evaluation-awareness. We recommend tracking this capability in future models.
摘要：如果AI模型可以检测到何时进行评估，则评估的有效性可能会受到损害。例如，模型在评估过程中可能具有系统不同的行为，从而导致用于部署和治理决策的可靠基准较差。我们调查了边境语言模型是否可以根据成绩单来准确地对其进行分类，它们是源于评估还是现实世界的部署，这是我们称为评估意识的功能。为此，我们构建了来自61个不同数据集的1,000个提示和成绩单的不同基准。这些跨越公共基准（例如MMLU，SWEBENCE），现实世界部署交互以及来自脚手架框架（例如Web浏览代理）的代理轨迹。 Frontier模型清楚地表明了超过的随机评估意识（Gemini-2.5-Pro的AUC为$ 0.83 $），但尚未超过我们简单的人类基线（AUC为$ 0.92 $）。此外，与聊天设置相比，AI模型和人类都可以更好地识别代理设置中的评估。此外，我们测试模型是否可以识别评估的目的。在多项选择和开放式质疑下，AI模型远远超过了随机的机会，可以识别评估测试的内容。我们的结果表明，边境模型已经表现出大量但尚未超人的评估意识水平。我们建议在以后的模型中跟踪此功能。

Title: CoMaPOI: A Collaborative Multi-Agent Framework for Next POI Prediction Bridging the Gap Between Trajectory and Language

Authors: Lin Zhong, Lingzhi Wang, Xu Yang, Qing Liao
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.23837
Pdf URL: https://arxiv.org/pdf/2505.23837
Copy Paste: [[2505.23837]] CoMaPOI: A Collaborative Multi-Agent Framework for Next POI Prediction Bridging the Gap Between Trajectory and Language(https://arxiv.org/abs/2505.23837)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) offer new opportunities for the next Point-Of-Interest (POI) prediction task, leveraging their capabilities in semantic understanding of POI trajectories. However, previous LLM-based methods, which are superficially adapted to next POI prediction, largely overlook critical challenges associated with applying LLMs to this task. Specifically, LLMs encounter two critical challenges: (1) a lack of intrinsic understanding of numeric spatiotemporal data, which hinders accurate modeling of users' spatiotemporal distributions and preferences; and (2) an excessively large and unconstrained candidate POI space, which often results in random or irrelevant predictions. To address these issues, we propose a Collaborative Multi Agent Framework for Next POI Prediction, named CoMaPOI. Through the close interaction of three specialized agents (Profiler, Forecaster, and Predictor), CoMaPOI collaboratively addresses the two critical challenges. The Profiler agent is responsible for converting numeric data into language descriptions, enhancing semantic understanding. The Forecaster agent focuses on dynamically constraining and refining the candidate POI space. The Predictor agent integrates this information to generate high-precision predictions. Extensive experiments on three benchmark datasets (NYC, TKY, and CA) demonstrate that CoMaPOI achieves state of the art performance, improving all metrics by 5% to 10% compared to SOTA baselines. This work pioneers the investigation of challenges associated with applying LLMs to complex spatiotemporal tasks by leveraging tailored collaborative agents.
摘要：大型语言模型（LLMS）为下一个利益点（POI）预测任务提供了新的机会，利用其在对POI轨迹的语义理解方面的能力。但是，以前基于LLM的方法在表面上适应了下一个POI预测，在很大程度上忽略了与将LLMS应用于此任务相关的关键挑战。具体而言，LLMS遇到了两个关键挑战：（1）缺乏对数字时空数据的内在理解，这阻碍了用户时空分布和偏好的准确建模；（2）一个过多且不受约束的候选POI空间，这通常会导致随机或无关的预测。为了解决这些问题，我们为下一个POI预测提出了一个合作的多代理框架，名为Comapoi。通过三种专业代理（Profiler，预报员和预测指标）的密切相互作用，Comapoi协同解决了这两个关键挑战。 Profiler代理负责将数字数据转换为语言描述，从而增强语义理解。预报剂专注于动态限制和完善候选POI空间。预测代理集成了此信息以生成高精度预测。在三个基准数据集（NYC，TKY和CA）上进行了广泛的实验表明，Comapoi实现了最先进的性能，与SOTA基线相比，将所有指标提高了5％至10％。这项工作先驱通过利用量身定制的协作代理来调查与将LLMS应用于复杂时空任务相关的挑战。

Title: Exploring the Landscape of Text-to-SQL with Large Language Models: Progresses, Challenges and Opportunities

Authors: Yiming Huang, Jiyu Guo, Wenxin Mao, Cuiyun Gao, Peiyi Han, Chuanyi Liu, Qing Ling
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.23838
Pdf URL: https://arxiv.org/pdf/2505.23838
Copy Paste: [[2505.23838]] Exploring the Landscape of Text-to-SQL with Large Language Models: Progresses, Challenges and Opportunities(https://arxiv.org/abs/2505.23838)
Keywords: language model, llm
Abstract: Converting natural language (NL) questions into SQL queries, referred to as Text-to-SQL, has emerged as a pivotal technology for facilitating access to relational databases, especially for users without SQL knowledge. Recent progress in large language models (LLMs) has markedly propelled the field of natural language processing (NLP), opening new avenues to improve text-to-SQL systems. This study presents a systematic review of LLM-based text-to-SQL, focusing on four key aspects: (1) an analysis of the research trends in LLM-based text-to-SQL; (2) an in-depth analysis of existing LLM-based text-to-SQL techniques from diverse perspectives; (3) summarization of existing text-to-SQL datasets and evaluation metrics; and (4) discussion on potential obstacles and avenues for future exploration in this domain. This survey seeks to furnish researchers with an in-depth understanding of LLM-based text-to-SQL, sparking new innovations and advancements in this field.
摘要：将自然语言（NL）问题转换为SQL查询（称为文本到SQL）已成为一种关键技术，用于促进访问关系数据库的访问，尤其是对于没有SQL知识的用户。大型语言模型（LLM）的最新进展显着推动了自然语言处理（NLP）的领域，开辟了新的途径，以改善文本到SQL系统。这项研究对基于LLM的文本到SQL进行了系统的综述，重点介绍了四个关键方面：（1）对基于LLM的文本到SQL的研究趋势的分析；（2）从不同的角度对现有的基于LLM的文本到SQL技术进行深入分析；（3）汇总现有的文本到SQL数据集和评估指标；（4）讨论该领域未来探索的潜在障碍和途径。这项调查旨在为研究人员提供对基于LLM的文本到SQL的深入了解，从而引发了该领域的新创新和进步。

Title: Measuring Sycophancy of Language Models in Multi-turn Dialogues

Authors: Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23840
Pdf URL: https://arxiv.org/pdf/2505.23840
Copy Paste: [[2505.23840]] Measuring Sycophancy of Language Models in Multi-turn Dialogues(https://arxiv.org/abs/2505.23840)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy--conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark for evaluating sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (Turn of Flip) and how frequently it shifts its stance under sustained user pressure (Number of Flip). Applying SYCON Bench to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model's ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user's underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario. We release our code and data at this https URL.
摘要：大型语言模型（LLM）有望提供有益且无害的反应，但是它们经常表现出粘粘性 - 与用户信念相吻合，而不管事实准确性或道德健全性如何。先前对粘浮浪的研究主要集中在单转的事实正确性上，忽视了现实世界相互作用的动态。在这项工作中，我们介绍了SYCON基准，这是一种用于评估多转变，自由形式的对话环境中的合并行为的新基准。我们的基准测试了模型对用户（翻转的转弯）的速度以及在持续的用户压力下改变其立场的频率（翻转数量）。在三种现实世界中，将SYCON基准台应用于17 LLM，我们发现粘粘剂仍然是普遍的故障模式。我们的分析表明，对齐调整放大了粘噬细胞的行为，而模型缩放和推理优化增强了模型抵抗不良用户视图的能力。推理模型通常优于指令调整的模型，但在逻辑说明过度索引时通常会失败，而不是直接解决用户的基本信念。最后，我们评估了四种额外的提示策略，并证明在辩论场景中采用第三人称观点可将摇摆人降低多达63.8％。我们在此HTTPS URL上发布代码和数据。

Title: Document Valuation in LLM Summaries: A Cluster Shapley Approach

Authors: Zikun Ye, Hema Yoganarasimhan
Subjects: cs.CL, econ.GN
Abstract URL: https://arxiv.org/abs/2505.23842
Pdf URL: https://arxiv.org/pdf/2505.23842
Copy Paste: [[2505.23842]] Document Valuation in LLM Summaries: A Cluster Shapley Approach(https://arxiv.org/abs/2505.23842)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly used in systems that retrieve and summarize content from multiple sources, such as search engines and AI assistants. While these models enhance user experience by generating coherent summaries, they obscure the contributions of original content creators, raising concerns about credit attribution and compensation. We address the challenge of valuing individual documents used in LLM-generated summaries. We propose using Shapley values, a game-theoretic method that allocates credit based on each document's marginal contribution. Although theoretically appealing, Shapley values are expensive to compute at scale. We therefore propose Cluster Shapley, an efficient approximation algorithm that leverages semantic similarity between documents. By clustering documents using LLM-based embeddings and computing Shapley values at the cluster level, our method significantly reduces computation while maintaining attribution quality. We demonstrate our approach to a summarization task using Amazon product reviews. Cluster Shapley significantly reduces computational complexity while maintaining high accuracy, outperforming baseline methods such as Monte Carlo sampling and Kernel SHAP with a better efficient frontier. Our approach is agnostic to the exact LLM used, the summarization process used, and the evaluation procedure, which makes it broadly applicable to a variety of summarization settings.
摘要：大型语言模型（LLM）越来越多地用于检索和总结多个来源（例如搜索引擎和AI助手）内容的系统。尽管这些模型通过产生连贯的摘要来增强用户体验，但它们掩盖了原始内容创建者的贡献，从而引起了人们对信用归因和薪酬的担忧。我们应对评估LLM生成的摘要中使用的单个文件的挑战。我们建议使用Shapley Value，这是一种游戏理论方法，该方法根据每个文档的边际贡献分配信用。尽管从理论上吸引人，但Shapley值的计算价格昂贵。因此，我们提出了集群Shapley，这是一种利用文档之间语义相似性的有效近似算法。通过使用基于LLM的嵌入和计算shapley值的聚类文档在集群级别，我们的方法可大大降低计算，同时保持归因质量。我们使用亚马逊产品评论来证明我们执行摘要任务的方法。 Shapley群集显着降低了计算复杂性，同时保持高精度，超过基线方法，例如蒙特卡洛采样和内核外形，具有更好的效率边界。我们的方法不可知所用的LLM，所使用的汇总过程以及评估程序，这使其广泛适用于各种摘要设置。

Title: Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks

Authors: Wenhan Dong, Tianyi Hu, Jingyi Zheng, Zhen Sun, Yuemeng Zhao, Yule Liu, Xinlei He, Xinyi Huang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23843
Pdf URL: https://arxiv.org/pdf/2505.23843
Copy Paste: [[2505.23843]] Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks(https://arxiv.org/abs/2505.23843)
Keywords: language model, llm, hallucination
Abstract: Multi-round incomplete information tasks are crucial for evaluating the lateral thinking capabilities of large language models (LLMs). Currently, research primarily relies on multiple benchmarks and automated evaluation metrics to assess these abilities. However, our study reveals novel insights into the limitations of existing methods, as they often yield misleading results that fail to uncover key issues, such as shortcut-taking behaviors, rigid patterns, and premature task termination. These issues obscure the true reasoning capabilities of LLMs and undermine the reliability of evaluations. To address these limitations, we propose a refined set of evaluation standards, including inspection of reasoning paths, diversified assessment metrics, and comparative analyses with human performance.
摘要：多轮不完整的信息任务对于评估大语言模型（LLMS）的横向思维功能至关重要。目前，研究主要依赖于多个基准和自动评估指标来评估这些能力。但是，我们的研究揭示了对现有方法的局限性的新颖见解，因为它们通常会产生误导性的结果，这些结果无法发现关键问题，例如捷径行为，严格的模式和过早的任务终止。这些问题掩盖了LLM的真正推理能力，并破坏了评估的可靠性。为了解决这些局限性，我们提出了一组精致的评估标准，包括检查推理路径，多元化的评估指标以及人类绩效的比较分析。

Title: Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation

Authors: Zhenglun Kong, Zheng Zhan, Shiyue Hou, Yifan Gong, Xin Meng, Pengwei Sui, Peiyan Dong, Xuan Shen, Zifeng Wang, Pu Zhao, Hao Tang, Stratis Ioannidis, Yanzhi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23844
Pdf URL: https://arxiv.org/pdf/2505.23844
Copy Paste: [[2505.23844]] Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation(https://arxiv.org/abs/2505.23844)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable promise but remain challenging to continually improve through traditional finetuning, particularly when integrating capabilities from other specialized LLMs. Popular methods like ensemble and weight merging require substantial memory and struggle to adapt to changing data environments. Recent efforts have transferred knowledge from multiple LLMs into a single target model; however, they suffer from interference and degraded performance among tasks, largely due to limited flexibility in candidate selection and training pipelines. To address these issues, we propose a framework that adaptively selects and aggregates knowledge from diverse LLMs to build a single, stronger model, avoiding the high memory overhead of ensemble and inflexible weight merging. Specifically, we design an adaptive selection network that identifies the most relevant source LLMs based on their scores, thereby reducing knowledge interference. We further propose a dynamic weighted fusion strategy that accounts for the inherent strengths of candidate LLMs, along with a feedback-driven loss function that prevents the selector from converging on a single subset of sources. Experimental results demonstrate that our method can enable a more stable and scalable knowledge aggregation process while reducing knowledge interference by up to 50% compared to existing approaches. Code is avaliable at this https URL
摘要：大型语言模型（LLMS）表现出了巨大的希望，但通过传统的填充措施仍然充满挑战，尤其是在整合其他专业LLM的功能时。诸如合奏和重量合并之类的流行方法需要大量的记忆和努力来适应不断变化的数据环境。最近的努力已将知识从多个LLM转移到单个目标模型中。但是，他们在任务之间受到干扰和降解的性能，这在很大程度上是由于候选人选择和培训管道的灵活性有限。为了解决这些问题，我们提出了一个框架，该框架可以自适应地选择和汇总来自不同LLM的知识，以构建一个更强的模型，避免了集合和僵化的重量合并的高内存开销。具体而言，我们设计了一个自适应选择网络，该网络可以根据其分数来标识最相关的源LLM，从而减少知识干扰。我们进一步提出了一种动态加权融合策略，该策略可以解释候选LLM的固有优势，以及反馈驱动的损耗函数，以防止选择器在单个来源的单个子集上融合。实验结果表明，与现有方法相比，我们的方法可以实现更稳定和可扩展的知识聚集过程，同时将知识干扰降低多达50％。在此https URL上可以使用代码

Title: Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs

Authors: Jakub Podolak, Rajeev Verma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23845
Pdf URL: https://arxiv.org/pdf/2505.23845
Copy Paste: [[2505.23845]] Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs(https://arxiv.org/abs/2505.23845)
Keywords: llm, chain-of-thought
Abstract: We study the source of uncertainty in DeepSeek R1-32B by analyzing its self-reported verbal confidence on question answering (QA) tasks. In the default answer-then-confidence setting, the model is regularly over-confident, whereas semantic entropy - obtained by sampling many responses - remains reliable. We hypothesize that this is because of semantic entropy's larger test-time compute, which lets us explore the model's predictive distribution. We show that granting DeepSeek the budget to explore its distribution by forcing a long chain-of-thought before the final answer greatly improves its verbal score effectiveness, even on simple fact-retrieval questions that normally require no reasoning. Furthermore, a separate reader model that sees only the chain can reconstruct very similar confidences, indicating the verbal score might be merely a statistic of the alternatives surfaced during reasoning. Our analysis concludes that reliable uncertainty estimation requires explicit exploration of the generative space, and self-reported confidence is trustworthy only after such exploration.
摘要：我们通过分析其自我报告的对问题回答（QA）任务的口头信心来研究DeepSeek R1-32B中不确定性的来源。在默认的答案中 - 随后的信心设置，该模型定期过度自信，而语义熵 - 通过对许多响应进行抽样获得 - 仍然可靠。我们假设这是因为语义熵的更大的测试时间计算，这使我们能够探索模型的预测分布。我们表明，授予DeepSeek预算来探索其分布，从而在最终答案之前迫使一项漫长的经过思考，从而大大提高了其口头评分的效率，即使在通常不需要任何理由的简单事实回归问题上也是如此。此外，只有链条才能重建非常相似的信心，表明口头评分可能只是推理过程中替代方案的统计数据。我们的分析得出的结论是，可靠的不确定性估计需要明确探索生成空间，并且自我报告的信心只有在这种探索之后才值得信赖。

Title: Scalable, Symbiotic, AI and Non-AI Agent Based Parallel Discrete Event Simulations

Authors: Atanu Barai, Stephan Eidenbenz, Nandakishore Santhi
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2505.23846
Pdf URL: https://arxiv.org/pdf/2505.23846
Copy Paste: [[2505.23846]] Scalable, Symbiotic, AI and Non-AI Agent Based Parallel Discrete Event Simulations(https://arxiv.org/abs/2505.23846)
Keywords: agent
Abstract: To fully leverage the potential of artificial intelligence (AI) systems in a trustworthy manner, it is desirable to couple multiple AI and non-AI systems together seamlessly for constraining and ensuring correctness of the output. This paper introduces a novel parallel discrete event simulation (PDES) based methodology to combine multiple AI and non-AI agents in a causal, rule-based way. Our approach tightly integrates the concept of passage of time, with each agent considered as an entity in the PDES framework and responding to prior requests from other agents. Such coupling mechanism enables the agents to work in a co-operative environment towards a common goal while many tasks run in parallel throughout the simulation. It further enables setting up boundaries to the outputs of the AI agents by applying necessary dynamic constraints using non-AI agents while allowing for scalability through deployment of hundreds of such agents in a larger compute cluster. Distributing smaller AI agents can enable extremely scalable simulations in the future, addressing local memory bottlenecks for model parameter storage. Within a PDES involving both AI and non-AI agents, we break down the problem at hand into structured steps, when necessary, providing a set of multiple choices to the AI agents, and then progressively solve these steps towards a final goal. At each step, the non-AI agents act as unbiased auditors, verifying each action by the AI agents so that certain rules of engagement are followed. We evaluate our approach by solving four problems from four different domains and comparing the results with those from AI models alone. Our results show greater accuracy in solving problems from various domains where the AI models struggle to solve the problems solely by themselves. Results show that overall accuracy of our approach is 68% where as the accuracy of vanilla models is less than 23%.
摘要：为了充分利用人工智能（AI）系统的潜力，以值得信赖的方式，希望将多个AI和非AI系统融合在一起，以限制和确保输出的正确性。本文介绍了一种基于新型的平行离散事件仿真（PDES）方法，以基于因果，基于规则的方式将多个AI和非AI代理组合在一起。我们的方法紧密整合了时间的概念，每个代理在PDES框架中被视为实体，并响应其他代理商的先验请求。这种耦合机制使代理能够在合作环境中运行到一个共同的目标，而许多任务在整个模拟过程中并行运行。它进一步可以通过使用非AI代理应用必要的动态约束来设置AI代理的输出，同时通过在较大的计算集群中部署数百种此类代理来允许可扩展性。分发较小的AI代理可以在将来启用极其可扩展的模拟，从而解决了用于模型参数存储的本地内存瓶颈。在涉及AI和非AI代理的PDE中，我们将目前的问题分解为结构化步骤，必要时，为AI代理提供了一组多种选择，然后逐步解决了这些步骤，以实现最终目标。在每个步骤中，非AI代理人充当公正的审计师，验证AI代理商的每个行动，以便遵循某些参与规则。我们通过解决四个不同领域的四个问题，并将结果与仅AI模型的结果进行比较来评估我们的方法。我们的结果表明，从各个领域解决问题方面的准确性更高，在这些域中，AI模型仅由自己解决问题。结果表明，我们方法的总体准确性为68％，因为香草模型的准确性小于23％。

Title: Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models

Authors: Harvey Dam, Jonas Knochelmann, Vinu Joseph, Ganesh Gopalakrishnan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23848
Pdf URL: https://arxiv.org/pdf/2505.23848
Copy Paste: [[2505.23848]] Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models(https://arxiv.org/abs/2505.23848)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: We introduce a method to reduce refusal rates of large language models (LLMs) on sensitive content without modifying model weights or prompts. Motivated by the observation that refusals in certain models were often preceded by the specific token sequence of a token marking the beginning of the chain-of-thought (CoT) block () followed by a double newline token (\n\n), we investigate the impact of two simple formatting adjustments during generation: suppressing \n\n after and suppressing the end-of-sequence token after the end of the CoT block (). Our method requires no datasets, parameter changes, or training, relying solely on modifying token probabilities during generation. In our experiments with official DeepSeek-R1 distillations, these interventions increased the proportion of substantive answers to sensitive prompts without affecting performance on standard benchmarks. Our findings suggest that refusal behaviors can be circumvented by blocking refusal subspaces at specific points in the generation process.
摘要：我们介绍了一种方法，以降低敏感内容上的大型语言模型（LLM）的拒绝率，而无需修改模型权重或提示。在某些模型中拒绝拒绝的动机通常是在标记的特定令牌序列之前，标记了三链链（cot）块（）的开始，然后是双重新线标记（\ n \ n），我们在产生的\ n \ n thincy persection thime ntible的影响中，我们调查了）。我们的方法不需要数据集，参数更改或培训，仅依赖于生成期间的代币概率。在我们对官方DeepSeek-R1蒸馏的实验中，这些干预措施增加了对敏感提示的实质性答案的比例，而不会影响标准基准的性能。我们的发现表明，可以通过在生成过程中的特定点阻止拒绝子空间来规避拒绝行为。

Title: ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

Authors: Michael Shalyt, Rotem Elimelech, Ido Kaminer
Subjects: cs.CL, cs.AI, cs.SC
Abstract URL: https://arxiv.org/abs/2505.23851
Pdf URL: https://arxiv.org/pdf/2505.23851
Copy Paste: [[2505.23851]] ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark(https://arxiv.org/abs/2505.23851)
Keywords: language model, llm
Abstract: Large language models (LLMs) are rapidly approaching the level of proficiency in university-level symbolic mathematics required for applications in advanced science and technology. However, existing benchmarks fall short in assessing the core skills of LLMs in symbolic mathematics-such as integration, differential equations, and algebraic simplification. To address this gap, we introduce ASyMOB, a novel assessment framework focused exclusively on symbolic manipulation, featuring 17,092 unique math challenges, organized by similarity and complexity. ASyMOB enables analysis of LLM generalization capabilities by comparing performance in problems that differ by simple numerical or symbolic `perturbations'. Evaluated LLMs exhibit substantial degradation in performance for all perturbation types (up to -70.3%), suggesting reliance on memorized patterns rather than deeper understanding of symbolic math, even among models achieving high baseline accuracy. Comparing LLM performance to computer algebra systems, we identify examples where they fail while LLMs succeed, as well as problems solved only by combining both approaches. Models capable of integrated code execution yielded higher accuracy compared to their performance without code, particularly stabilizing weaker models (up to +33.1% for certain perturbation types). Notably, the most advanced models (o4-mini, Gemini 2.5 Flash) demonstrate not only high symbolic math proficiency (scoring 96.8% and 97.6% on the unperturbed set), but also remarkable robustness against perturbations, (-21.7% and -21.2% vs. average -50.4% for the other models). This may indicate a recent "phase transition" in the generalization capabilities of frontier LLMs. It remains to be seen whether the path forward lies in deeper integration with sophisticated external tools, or in developing models so capable that symbolic math systems like CAS become unnecessary.
摘要：大型语言模型（LLMS）正在迅速接近高级科学和技术应用所需的大学水平符号数学水平。但是，现有基准在评估LLM在符号数学中的核心技能等基准类似，例如集成，微分方程和代数简化。为了解决这一差距，我们介绍了Asymob，这是一个专门针对象征性操纵的新型评估框架，具有17,092个独特的数学挑战，该挑战是通过相似性和复杂性组织的。 Asymob通过比较简单数值或符号“扰动”的问题的性能来启用LLM概括能力的分析。评估的LLM在所有扰动类型（高达-70.3％）的性能中表现出很大的降解，这表明依赖记忆模式，而不是对符号数学的更深入了解，即使在达到高基线准确性的模型中也是如此。将LLM性能与计算机代数系统进行比较，我们确定了在LLMS成功时它们失败的示例，以及仅通过结合两种方法来解决的问题。与没有代码的性能相比，能够集成代码执行的模型具有更高的准确性，尤其是稳定较弱的模型（某些扰动类型的模型高达33.1％）。值得注意的是，最先进的型号（O4 -Mini，Gemini 2.5 Flash）不仅表现出高符号数学能力（在未渗透的集合中得分为96.8％和97.6％），而且还表现出对扰动的显着鲁棒性（-21.7％和-21.7％和-21.2％和-21.2％vs.平均-50.4％）。这可能表明Frontier LLMS的概括能力中最近的“相变”。前进的道路是否在于与复杂的外部工具更深入地集成，还是开发出如此能力以至于像CAS这样的符号数学系统变得不必要的模型。

Title: Large Language Model-Based Agents for Automated Research Reproducibility: An Exploratory Study in Alzheimer's Disease

Authors: Nic Dobbins, Christelle Xiong, Kristine Lan, Meliha Yetisgen
Subjects: cs.CL, cs.AI, cs.MA, stat.AP
Abstract URL: https://arxiv.org/abs/2505.23852
Pdf URL: https://arxiv.org/pdf/2505.23852
Copy Paste: [[2505.23852]] Large Language Model-Based Agents for Automated Research Reproducibility: An Exploratory Study in Alzheimer's Disease(https://arxiv.org/abs/2505.23852)
Keywords: language model, gpt, llm, agent
Abstract: Objective: To demonstrate the capabilities of Large Language Models (LLMs) as autonomous agents to reproduce findings of published research studies using the same or similar dataset. Materials and Methods: We used the "Quick Access" dataset of the National Alzheimer's Coordinating Center (NACC). We identified highly cited published research manuscripts using NACC data and selected five studies that appeared reproducible using this dataset alone. Using GPT-4o, we created a simulated research team of LLM-based autonomous agents tasked with writing and executing code to dynamically reproduce the findings of each study, given only study Abstracts, Methods sections, and data dictionary descriptions of the dataset. Results: We extracted 35 key findings described in the Abstracts across 5 Alzheimer's studies. On average, LLM agents approximately reproduced 53.2% of findings per study. Numeric values and range-based findings often differed between studies and agents. The agents also applied statistical methods or parameters that varied from the originals, though overall trends and significance were sometimes similar. Discussion: In some cases, LLM-based agents replicated research techniques and findings. In others, they failed due to implementation flaws or missing methodological detail. These discrepancies show the current limits of LLMs in fully automating reproducibility assessments. Still, this early investigation highlights the potential of structured agent-based systems to provide scalable evaluation of scientific rigor. Conclusion: This exploratory work illustrates both the promise and limitations of LLMs as autonomous agents for automating reproducibility in biomedical research.
摘要：目的：证明大语模型（LLM）的能力作为自主代理人使用相同或相似数据集复制已发表研究的发现。材料和方法：我们使用了国家阿尔茨海默氏症协调中心（NACC）的“快速访问”数据集。我们使用NACC数据确定了高度引用的已发表的研究手稿，并选择了五项研究，这些研究仅使用该数据集可再现。使用GPT-4O，我们创建了一个基于LLM的自主代理的模拟研究团队，该研究仅由撰写和执行代码来动态复制每项研究的发现，只有研究摘要，方法部分和数据词语描述数据集的数据。结果：我们提取了55个阿尔茨海默氏症研究中摘要中描述的35个关键发现。平均而言，LLM代理大约再现了每项研究结果的53.2％。研究和代理之间的数字值和基于范围的发现通常会有所不同。尽管总体趋势和意义有时相似，但代理还应用了与原始物不同的统计方法或参数。讨论：在某些情况下，基于LLM的代理复制了研究技术和发现。在其他情况下，由于实施缺陷或缺少方法论细节，它们失败了。这些差异显示了LLM在完全自动化的可重复性评估中的当前限制。尽管如此，这项早期调查仍凸显了基于结构化代理的系统提供可扩展的科学严谨性评估的潜力。结论：这项探索性工作既说明了LLM作为自主代理在生物医学研究中自动化可重复性的承诺和局限性。

Title: Revisiting Uncertainty Estimation and Calibration of Large Language Models

Authors: Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Philip Torr, Chang Xu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23854
Pdf URL: https://arxiv.org/pdf/2505.23854
Copy Paste: [[2505.23854]] Revisiting Uncertainty Estimation and Calibration of Large Language Models(https://arxiv.org/abs/2505.23854)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly deployed in high-stakes applications, robust uncertainty estimation is essential for ensuring the safe and trustworthy deployment of LLMs. We present the most comprehensive study to date of uncertainty estimation in LLMs, evaluating 80 models spanning open- and closed-source families, dense and Mixture-of-Experts (MoE) architectures, reasoning and non-reasoning modes, quantization variants and parameter scales from 0.6B to 671B. Focusing on three representative black-box single-pass methods, including token probability-based uncertainty (TPU), numerical verbal uncertainty (NVU), and linguistic verbal uncertainty (LVU), we systematically evaluate uncertainty calibration and selective classification using the challenging MMLU-Pro benchmark, which covers both reasoning-intensive and knowledge-based tasks. Our results show that LVU consistently outperforms TPU and NVU, offering stronger calibration and discrimination while being more interpretable. We also find that high accuracy does not imply reliable uncertainty, and that model scale, post-training, reasoning ability and quantization all influence estimation performance. Notably, LLMs exhibit better uncertainty estimates on reasoning tasks than on knowledge-heavy ones, and good calibration does not necessarily translate to effective error ranking. These findings highlight the need for multi-perspective evaluation and position LVU as a practical tool for improving the reliability of LLMs in real-world settings.
摘要：由于大型语言模型（LLM）越来越多地部署在高风险应用程序中，因此，鲁棒的不确定性估计对于确保LLM的安全和值得信赖的部署至关重要。我们介绍了LLM中不确定性估计的最全面的研究，评估了80个模型，涵盖开放式和封闭式家族，密集和混合物（MOE）体系结构，推理和非争议模式，量化变体和参数量表从0.6B到671b。专注于三种代表性的黑盒单通量方法，包括基于令牌概率的不确定性（TPU），数值言语不确定性（NVU）和语言言语不确定性（LVU），我们会系统地评估不确定性校准和选择性分类的MMLU-Pro-Pro基础标记，这些基于理性和知识范围都构成了任务，并涵盖了既定的知识。我们的结果表明，LVU始终胜过TPU和NVU，提供更强的校准和歧视，同时更容易解释。我们还发现，高精度并不意味着可靠的不确定性，并且模型量表，训练后，推理能力和量化都会影响估计绩效。值得注意的是，LLMS对推理任务的不确定性估计要比知识重量更高，并且良好的校准不一定转化为有效的错误排名。这些发现凸显了需要进行多角度评估和将LVU定位为改善现实环境中LLM的可靠性的实用工具的必要性。

Title: OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities

Authors: Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, Chandan Singh
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23856
Pdf URL: https://arxiv.org/pdf/2505.23856
Copy Paste: [[2505.23856]] OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities(https://arxiv.org/abs/2505.23856)
Keywords: language model, llm, prompt
Abstract: The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient ($\approx 120 \times$ faster than the next fastest baseline). Code and data are available at: this https URL.
摘要：大型语言模型（LLM）的新兴能力引发了人们对它们的有害滥用潜力的担忧。减轻这些担忧的核心方法是检测模型的有害查询。当前的检测方法是容易造成的，并且特别容易受到利用模型能力不匹配的概括的攻击（例如，以低资源语言或以非文本方式提供的提示（例如图像和音频）提供的提示）。为了应对这一挑战，我们提出了Omniguard，这是一种检测跨语言和方式的有害提示的方法。我们的方法（i）标识了跨语言或模态对齐的LLM/MLLM的内部表示形式，然后（ii）使用它们来构建语言 - 反应或模态 - 敏捷的分类器来检测有害提示。 Omniguard在多语言环境中最强的基线上将有害的及时分类精度提高了11.57 \％，对于基于图像的提示，将有害的分类精度提高了20.44 \％，并为基于音频的提示设置了新的SOTA。通过在发电期间计算的重新利用嵌入，Omniguard也非常有效（$ \ \ \ 120 \ times $ $速度比下一个最快的基线速度快）。代码和数据可在以下网址提供：此HTTPS URL。

Title: Infi-Med: Low-Resource Medical MLLMs with Robust Reasoning Evaluation

Authors: Zeyu Liu, Zhitian Hou, Yining Di, Kejing Yang, Zhijie Sang, Congkai Xie, Jingwen Yang, Siyuan Liu, Jialu Wang, Chunming Li, Ming Li, Hongxia Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23867
Pdf URL: https://arxiv.org/pdf/2505.23867
Copy Paste: [[2505.23867]] Infi-Med: Low-Resource Medical MLLMs with Robust Reasoning Evaluation(https://arxiv.org/abs/2505.23867)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) have demonstrated promising prospects in healthcare, particularly for addressing complex medical tasks, supporting multidisciplinary treatment (MDT), and enabling personalized precision medicine. However, their practical deployment faces critical challenges in resource efficiency, diagnostic accuracy, clinical considerations, and ethical privacy. To address these limitations, we propose Infi-Med, a comprehensive framework for medical MLLMs that introduces three key innovations: (1) a resource-efficient approach through curating and constructing high-quality supervised fine-tuning (SFT) datasets with minimal sample requirements, with a forward-looking design that extends to both pretraining and posttraining phases; (2) enhanced multimodal reasoning capabilities for cross-modal integration and clinical task understanding; and (3) a systematic evaluation system that assesses model performance across medical modalities and task types. Our experiments demonstrate that Infi-Med achieves state-of-the-art (SOTA) performance in general medical reasoning while maintaining rapid adaptability to clinical scenarios. The framework establishes a solid foundation for deploying MLLMs in real-world healthcare settings by balancing model effectiveness with operational constraints.
摘要：多模式的大语言模型（MLLM）在医疗保健方面表现出了有希望的前景，特别是用于解决复杂的医疗任务，支持多学科治疗（MDT）并启用个性化的精密医学。但是，他们的实际部署在资源效率，诊断准确性，临床考虑和道德隐私方面面临着关键的挑战。为了解决这些局限性，我们提出了Infi-Med，这是一个综合的医学MLLMS框架，介绍了三个关键创新：（1）通过策划和构建具有最小的样本需求的高质量监督微调（SFT）数据集的资源效率方法，具有最小的远景设计，可扩展到训练和培训后的培训阶段；（2）增强了跨模式整合和临床任务理解的多模式推理能力；（3）系统评估系统，该系统评估跨医学方式和任务类型的模型性能。我们的实验表明，INFI-MED在一般医学推理中实现了最先进的表现（SOTA），同时保持对临床情况的快速适应性。该框架通过平衡模型有效性与操作约束，为在现实世界中的医疗设置中部署MLLM建立了坚实的基础。

Title: One Task Vector is not Enough: A Large-Scale Study for In-Context Learning

Authors: Pavel Tikhonov, Ivan Oseledets, Elena Tutubalina
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23911
Pdf URL: https://arxiv.org/pdf/2505.23911
Copy Paste: [[2505.23911]] One Task Vector is not Enough: A Large-Scale Study for In-Context Learning(https://arxiv.org/abs/2505.23911)
Keywords: language model, llm
Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to adapt to new tasks using few examples, with task vectors - specific hidden state activations - hypothesized to encode task information. Existing studies are limited by small-scale benchmarks, restricting comprehensive analysis. We introduce QuiteAFew, a novel dataset of 3,096 diverse few-shot tasks, each with 30 input-output pairs derived from the Alpaca dataset. Experiments with Llama-3-8B on QuiteAFew reveal: (1) task vector performance peaks at an intermediate layer (e.g., 15th), (2) effectiveness varies significantly by task type, and (3) complex tasks rely on multiple, subtask-specific vectors rather than a single vector, suggesting distributed task knowledge representation.
摘要：内部文化学习（ICL）使大型语言模型（LLMS）能够使用几个示例适应新任务，其中任务向量（特定的隐藏状态激活） - 假设用于编码任务信息。现有研究受小规模基准的限制，限制了综合分析。我们介绍了Quiteafew，这是一个由3,096个不同的少数任务的新颖数据集，每个数据集都有30对源自羊驼数据集的输入输出对。在戒烟上使用Llama-3-8b进行的实验揭示：（1）任务矢量性能在中间层的峰值（例如，15th），（2）有效性因任务类型而有很大差异，并且（3）复杂的任务依赖于多个子任务特异性矢量，而不是单个矢量，而不是单个矢量，建议分布式的任务知识表示。

Title: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generation

Authors: Caiqi Zhang, Xiaochen Zhu, Chengzu Li, Nigel Collier, Andreas Vlachos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23912
Pdf URL: https://arxiv.org/pdf/2505.23912
Copy Paste: [[2505.23912]] Reinforcement Learning for Better Verbalized Confidence in Long-Form Generation(https://arxiv.org/abs/2505.23912)
Keywords: language model, llm, hallucination
Abstract: Hallucination remains a major challenge for the safe and trustworthy deployment of large language models (LLMs) in factual content generation. Prior work has explored confidence estimation as an effective approach to hallucination detection, but often relies on post-hoc self-consistency methods that require computationally expensive sampling. Verbalized confidence offers a more efficient alternative, but existing approaches are largely limited to short-form question answering (QA) tasks and do not generalize well to open-ended generation. In this paper, we propose LoVeC (Long-form Verbalized Confidence), an on-the-fly verbalized confidence estimation method for long-form generation. Specifically, we use reinforcement learning (RL) to train LLMs to append numerical confidence scores to each generated statement, serving as a direct and interpretable signal of the factuality of generation. Our experiments consider both on-policy and off-policy RL methods, including DPO, ORPO, and GRPO, to enhance the model calibration. We introduce two novel evaluation settings, free-form tagging and iterative tagging, to assess different verbalized confidence estimation methods. Experiments on three long-form QA datasets show that our RL-trained models achieve better calibration and generalize robustly across domains. Also, our method is highly efficient, as it only requires adding a few tokens to the output being decoded.
摘要：幻觉仍然是对事实内容生成的大型语言模型（LLM）安全和值得信赖的部署的主要挑战。先前的工作探索了置信度估计是一种有效的幻觉检测方法，但通常依赖于事后的自符合方法，需要计算昂贵的采样。口头上的信心提供了更有效的替代方案，但现有的方法在很大程度上仅限于短形式的答案（QA）任务，并且不能很好地推广到开放式一代。在本文中，我们提出了LOVEC（长形式的语言置信度），这是一种长期生成的即时言语置信度估计方法。具体来说，我们使用加固学习（RL）来训练LLM，以将数值置信度分数附加到每个生成的语句中，并作为发电事实的直接和可解释的信号。我们的实验考虑了包括DPO，ORPO和GRPO在内的policy和policy RL方法，以增强模型校准。我们介绍了两个新颖的评估设置，即自由格式标记和迭代标签，以评估不同的口头置信度估计方法。在三个长形式质量检查数据集上进行的实验表明，我们的RL训练模型可以更好地校准，并在跨域中稳定地概括。同样，我们的方法高效，因为它仅需要在解码的输出中添加一些令牌。

Title: Probing Association Biases in LLM Moderation Over-Sensitivity

Authors: Yuxin Wang, Botao Yu, Ivory Yang, Saeed Hassanpour, Soroush Vosoughi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23914
Pdf URL: https://arxiv.org/pdf/2505.23914
Copy Paste: [[2505.23914]] Probing Association Biases in LLM Moderation Over-Sensitivity(https://arxiv.org/abs/2505.23914)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models are widely used for content moderation but often misclassify benign comments as toxic, leading to over-sensitivity. While previous research attributes this issue primarily to the presence of offensive terms, we reveal a potential cause beyond token level: LLMs exhibit systematic topic biases in their implicit associations. Inspired by cognitive psychology's implicit association tests, we introduce Topic Association Analysis, a semantic-level approach to quantify how LLMs associate certain topics with toxicity. By prompting LLMs to generate free-form scenario imagination for misclassified benign comments and analyzing their topic amplification levels, we find that more advanced models (e.g., GPT-4 Turbo) demonstrate stronger topic stereotype despite lower overall false positive rates. These biases suggest that LLMs do not merely react to explicit, offensive language but rely on learned topic associations, shaping their moderation decisions. Our findings highlight the need for refinement beyond keyword-based filtering, providing insights into the underlying mechanisms driving LLM over-sensitivity.
摘要：大型语言模型被广泛用于内容节奏，但通常将良性评论误认为是有毒的，导致过度敏感。尽管以前的研究主要归因于进攻性术语的存在，但我们揭示了超出令牌级别的潜在原因：LLMS在其隐式协会中表现出系统性的主题偏见。受认知心理学的隐性关联测试的启发，我们引入了主题协会分析，这是一种语义级别的方法，旨在量化LLM如何将某些主题与毒性联系起来。通过促使LLM产生自由形式的想象力，以实现错误分类的良性评论并分析其主题放大水平，我们发现尽管较低的整体假阳性率较低，但更高级模型（例如GPT-4 Turbo）表现出更强的主题刻板印象。这些偏见表明，LLM不仅对明确的，令人反感的语言做出反应，而且还依靠学习的主题协会，塑造他们的节制决策。我们的发现强调了除了基于关键字的过滤之外进行完善的必要性，从而提供了对驱动LLM过度敏感性的基本机制的见解。

Title: ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents

Authors: Feiteng Fang, Ting-En Lin, Yuchuan Wu, Xiong Liu, Xiang Huang, Dingwei Chen, Jing Ye, Haonan Zhang, Liang Zhu, Hamid Alinejad-Rokny, Min Yang, Fei Huang, Yongbin Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23923
Pdf URL: https://arxiv.org/pdf/2505.23923
Copy Paste: [[2505.23923]] ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents(https://arxiv.org/abs/2505.23923)
Keywords: agent
Abstract: Role-Playing Language Agents (RPLAs) aim to simulate characters for realistic and engaging human-computer interactions. However, traditional reward models often struggle with scalability and adapting to subjective conversational preferences. We propose ChARM, a Character-based Act-adaptive Reward Model, addressing these challenges through two innovations: (1) an act-adaptive margin that significantly enhances learning efficiency and generalizability, and (2) a self-evolution mechanism leveraging large-scale unlabeled data to improve training coverage. Additionally, we introduce RoleplayPref, the first large-scale preference dataset specifically for RPLAs, featuring 1,108 characters, 13 subcategories, and 16,888 bilingual dialogues, alongside RoleplayEval, a dedicated evaluation benchmark. Experimental results show a 13% improvement over the conventional Bradley-Terry model in preference rankings. Furthermore, applying ChARM-generated rewards to preference learning techniques (e.g., direct preference optimization) achieves state-of-the-art results on CharacterEval and RoleplayEval. Code and dataset are available at this https URL.
摘要：角色扮演语言代理（RPLAS）旨在模拟字符，以实现现实和引人入胜的人类计算机相互作用。但是，传统的奖励模型通常会在可扩展性上挣扎，并适应主观的对话偏好。我们提出了一个基于角色的自适应奖励模型Charm，通过两项创新解决了这些挑战：（1）ACT自适应边际极大地提高了学习效率和通用性，以及（2）利用大型非标记数据来改善培训覆盖范围的自我进化机制。此外，我们介绍了Roleplaypref，这是第一个专门针对RPLA的大规模偏好数据集，其中包含1,108个字符，13个子类别和16,888个双语对话，以及专用评估基准的Roleplayeval。实验结果表明，在偏好排名中，传统的布拉德利 - 泰式模型提高了13％。此外，将魅力生成的奖励应用于偏好学习技术（例如，直接偏好优化）实现了对角色的最新结果，并取得了最新的结果。代码和数据集可在此HTTPS URL上找到。

Title: SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving

Authors: Wendong Xu, Jing Xiong, Chenyang Zhao, Qiujiang Chen, Haoran Wang, Hui Shen, Zhongwei Wan, Jianbo Dai, Taiqiang Wu, He Xiao, Chaofan Tao, Z. Morley Mao, Ying Sheng, Zhijiang Guo, Hongxia Yang, Bei Yu, Lingpeng Kong, Quanquan Gu, Ngai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23932
Pdf URL: https://arxiv.org/pdf/2505.23932
Copy Paste: [[2505.23932]] SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving(https://arxiv.org/abs/2505.23932)
Keywords: language model, gpt, llm
Abstract: We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, SwingArena models the collaborative process of software iteration by pairing LLMs as submitters, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. SwingArena presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings. More details are available on our project page: this http URL
摘要：我们提出Swingarena，这是一个针对大型语言模型（LLMS）的竞争评估框架，它密切反映了现实世界软件开发工作流程。与传统的静态基准不同，Swingarena通过将LLMS与提交者配对，生成补丁程序和审阅者，并通过连续集成（CI）管道来验证补丁程序，将软件迭代的协作过程建模。为了支持这些交互式评估，我们引入了检索授权的代码生成（RACG）模块，该模块通过提供大型代码库的语法和语义相关代码片段来有效地处理长篇小说挑战，从而支持多种编程语言（C ++，Python，Rust，Rust和Go）。这使得框架可以在尊重令牌限制的同时扩展各种任务和上下文。我们的实验使用了400多个从2300个问题池中选择的高质量的现实世界GitHub问题，表明诸如GPT-4O Excel之类的模型在激进的贴片生成中，而DeepSeek和Gemini则优先考虑CI验证中的正确性。 Swingarena提出了一种可扩展且可扩展的方法，用于评估现实的，CI驱动的软件开发设置中的LLM。更多详细信息可以在我们的项目页面上找到：此HTTP URL

Title: Retrieval Augmented Generation based Large Language Models for Causality Mining

Authors: Thushara Manjari Naduvilakandy, Hyeju Jang, Mohammad Al Hasan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23944
Pdf URL: https://arxiv.org/pdf/2505.23944
Copy Paste: [[2505.23944]] Retrieval Augmented Generation based Large Language Models for Causality Mining(https://arxiv.org/abs/2505.23944)
Keywords: language model, llm, prompt, retrieval augmented generation, retrieval-augmented generation
Abstract: Causality detection and mining are important tasks in information retrieval due to their enormous use in information extraction, and knowledge graph construction. To solve these tasks, in existing literature there exist several solutions -- both unsupervised and supervised. However, the unsupervised methods suffer from poor performance and they often require significant human intervention for causal rule selection, leading to poor generalization across different domains. On the other hand, supervised methods suffer from the lack of large training datasets. Recently, large language models (LLMs) with effective prompt engineering are found to be effective to overcome the issue of unavailability of large training dataset. Yet, in existing literature, there does not exist comprehensive works on causality detection and mining using LLM prompting. In this paper, we present several retrieval-augmented generation (RAG) based dynamic prompting schemes to enhance LLM performance in causality detection and extraction tasks. Extensive experiments over three datasets and five LLMs validate the superiority of our proposed RAG-based dynamic prompting over other static prompting schemes.
摘要：因果关系检测和采矿是信息检索的重要任务，因为它们在信息提取和知识图构造中的巨大使用。为了解决这些任务，在现有文献中存在几种解决方案 - 既无监督和监督。但是，无监督的方法的性能差，它们通常需要大量的人力干预来进行因果统治选择，从而导致跨不同领域的泛化。另一方面，监督方法缺乏大型培训数据集。最近，发现具有有效及时工程的大型语言模型（LLM）可以有效克服大型培训数据集的不可用的问题。但是，在现有文献中，没有使用LLM提示进行有关因果关系检测和采矿的全面作品。在本文中，我们介绍了基于几种基于检索功能的生成（RAG）动态提示方案，以提高因果关系检测和提取任务中的LLM性能。在三个数据集和五个LLMS上进行的广泛实验验证了我们提出的基于抹布的动态提示的优越性，比其他静态提示方案。

Title: A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models

Authors: Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23945
Pdf URL: https://arxiv.org/pdf/2505.23945
Copy Paste: [[2505.23945]] A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models(https://arxiv.org/abs/2505.23945)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-thought (CoT) reasoning enhances performance of large language models, but questions remain about whether these reasoning traces faithfully reflect the internal processes of the model. We present the first comprehensive study of CoT faithfulness in large vision-language models (LVLMs), investigating how both text-based and previously unexplored image-based biases affect reasoning and bias articulation. Our work introduces a novel, fine-grained evaluation pipeline for categorizing bias articulation patterns, enabling significantly more precise analysis of CoT reasoning than previous methods. This framework reveals critical distinctions in how models process and respond to different types of biases, providing new insights into LVLM CoT faithfulness. Our findings reveal that subtle image-based biases are rarely articulated compared to explicit text-based ones, even in models specialized for reasoning. Additionally, many models exhibit a previously unidentified phenomenon we term ``inconsistent'' reasoning - correctly reasoning before abruptly changing answers, serving as a potential canary for detecting biased reasoning from unfaithful CoTs. We then apply the same evaluation pipeline to revisit CoT faithfulness in LLMs across various levels of implicit cues. Our findings reveal that current language-only reasoning models continue to struggle with articulating cues that are not overtly stated.
摘要：经过思考链（COT）推理增强了大语言模型的性能，但是关于这些推理痕迹是否忠实地反映了模型的内部过程的问题。我们介绍了大型视力语言模型（LVLM）中COT忠诚的首次综合研究，研究了基于文本和以前未开发的基于图像的偏见如何影响推理和偏见。我们的工作介绍了一种新型的细粒评估管道，用于对偏置发音模式进行分类，从而使COT推理的更精确分析比以前的方法更为精确。该框架揭示了模型如何处理和响应不同类型的偏见的关键区别，从而为LVLM COT忠实提供了新的见解。我们的发现表明，即使在专门用于推理的模型中，与明确的基于文本的偏差相比，基于图像的微妙偏见很少被阐明。此外，许多模型都表现出以前未知的现象，我们称``不一致的''推理 - 在突然改变答案之前正确推理，这是一种潜在的金丝雀，用于从不忠的婴儿床中检测出有偏见的推理。然后，我们将相同的评估管道应用于在不同级别的隐性提示中重新审视LLM的COT忠诚。我们的发现表明，当前仅使用语言的推理模型继续在表达未明确说明的线索上挣扎。

Title: FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression

Authors: Jiayi Tian, Ryan Solgi, Jinming Lu, Yifan Yang, Hai Li, Zheng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23966
Pdf URL: https://arxiv.org/pdf/2505.23966
Copy Paste: [[2505.23966]] FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression(https://arxiv.org/abs/2505.23966)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis (PCA), and employ an importance-based metric to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes. Evaluated across 4 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.
摘要：大型语言模型（LLMS）在自然语言处理方面取得了显着进步，但是它们的高计算和记忆要求在资源受限环境中部署构成挑战。尽管最近的低排放分解方法为结构压缩提供了有希望的途径，但它们通常会遭受准确性降解，昂贵的校准程序的困扰，并导致效率低下的模型体系结构，从而阻碍了现实世界中的推断速度。在本文中，我们提出了基于激活空间中细粒度的低级转换的快速准确，无训练的结构压缩方法。具体而言，我们通过使用通过头明主组件分析（PCA）计算的截短特征向量来改变权重来减少隐藏维度，并采用基于重要性的指标来自适应地分配跨解码器的等级。 Flat-llm可以在不恢复微调的情况下实现高效和有效的重量压缩，这可以在几分钟内完成校准。在4个模型和11个数据集中进行了评估，Flat-LLM在概括和下游性能中的结构修剪基准的表现优于结构性修剪基线，同时在基于分解的方法上提供了推理加速。

Title: Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs

Authors: Yinong Oliver Wang, Nivedha Sivakumar, Falaah Arif Khan, Rin Metcalf Susa, Adam Golinski, Natalie Mackraz, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23996
Pdf URL: https://arxiv.org/pdf/2505.23996
Copy Paste: [[2505.23996]] Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs(https://arxiv.org/abs/2505.23996)
Keywords: language model, llm
Abstract: The recent rapid adoption of large language models (LLMs) highlights the critical need for benchmarking their fairness. Conventional fairness metrics, which focus on discrete accuracy-based evaluations (i.e., prediction correctness), fail to capture the implicit impact of model uncertainty (e.g., higher model confidence about one group over another despite similar accuracy). To address this limitation, we propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness that is more reflective of the internal bias in model decisions compared to conventional fairness measures. Furthermore, observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset with 31,756 samples for co-reference resolution, offering a more diverse and suitable dataset for evaluating modern LLMs. We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source LLMs. For example, Mistral-7B exhibits suboptimal fairness due to high confidence in incorrect predictions, a detail overlooked by Equalized Odds but captured by UCerF. Overall, our proposed LLM benchmark, which evaluates fairness with uncertainty awareness, paves the way for developing more transparent and accountable AI systems.
摘要：最近的大型语言模型（LLMS）的迅速采用凸显了基准对其公平性进行基准测试的关键需求。传统的公平度量指标集中在基于离散准确性的评估（即预测正确性）上，无法捕获模型不确定性的隐式影响（例如，尽管精度相似，但还是对一个组的更高模型置信度更高，而另一组的置信度更高）。为了解决这一限制，我们提出了一个不确定性意识的公平度量，ucerf，可以对模型公平进行精细的评估，该评估与常规公平度量相比，这更反映了模型决策中的内部偏见。此外，在当前数据集中观察数据大小，多样性和清晰度问题，我们引入了一个新的性别占领公平评估数据集，其中包含31,756个样本，用于共同参考，提供了一个更多样化，更合适的数据集来评估现代LLM。我们使用我们的指标和数据集建立基准测试，并将其应用于评估十个开源LLM的行为。例如，Mistral-7b由于对不正确的预测的信心很高而表现出次优公平，这一细节被均衡的赔率忽略，但被UCERF捕获。总体而言，我们提出的LLM基准测试以不确定性意识评估公平性，为开发更透明和负责的AI系统铺平了道路。

Title: Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws

Authors: Hidetaka Kamigaito, Ying Zhang, Jingun Kwon, Katsuhiko Hayashi, Manabu Okumura, Taro Watanabe
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.24009
Pdf URL: https://arxiv.org/pdf/2505.24009
Copy Paste: [[2505.24009]] Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws(https://arxiv.org/abs/2505.24009)
Keywords: language model, llm
Abstract: Transformers deliver outstanding performance across a wide range of tasks and are now a dominant backbone architecture for large language models (LLMs). Their task-solving performance is improved by increasing parameter size, as shown in the recent studies on parameter scaling laws. Although recent mechanistic-interpretability studies have deepened our understanding of the internal behavior of Transformers by analyzing their residual stream, the relationship between these internal mechanisms and the parameter scaling laws remains unclear. To bridge this gap, we focus on layers and their size, which mainly decide the parameter size of Transformers. For this purpose, we first theoretically investigate the layers within the residual stream through a bias-diversity decomposition. The decomposition separates (i) bias, the error of each layer's output from the ground truth, and (ii) diversity, which indicates how much the outputs of each layer differ from each other. Analyzing Transformers under this theory reveals that performance improves when individual layers make predictions close to the correct answer and remain mutually diverse. We show that diversity becomes especially critical when individual layers' outputs are far from the ground truth. Finally, we introduce an information-theoretic diversity and show our main findings that adding layers enhances performance only when those layers behave differently, i.e., are diverse. We also reveal the performance gains from increasing the number of layers exhibit submodularity: marginal improvements diminish as additional layers increase, mirroring the logarithmic convergence predicted by the parameter scaling laws. Experiments on multiple semantic-understanding tasks with various LLMs empirically confirm the theoretical properties derived in this study.
摘要：变形金刚在各种任务中提供出色的性能，现在是大型语言模型（LLMS）的主要骨干架构。如最近关于参数缩放定律的研究所示，通过增加参数大小来改善他们的任务性能。尽管最近的机械诠释性研究通过分析了它们的残留流来加深我们对变压器内部行为的理解，但这些内部机制与参数缩放定律之间的关系尚不清楚。为了弥合这一差距，我们专注于层及其大小，这些差距主要决定变压器的参数大小。为此，我们首先通过偏置多样性分解从理论上研究残差流中的层。分解将（i）偏差分开，每个层的输出与地面真实的误差以及（ii）多样性，这表明每一层的输出彼此不同。根据该理论分析变压器表明，当单个层使预测接近正确答案并保持相互多样性时，性能会提高。我们表明，当各个层的产出远非基础真理时，多样性变得特别关键。最后，我们介绍了一种信息理论多样性，并表明我们的主要发现只有在这些层的行为不同，即多样化时，添加层才能提高性能。我们还揭示了增加的层数量表现出的表现性的性能增长：随着额外层的增加，边际改进减小，反映了参数缩放定律预测的对数收敛性。在多种LLM的多个语义理解任务上进行了实验，从经验上证实了本研究中得出的理论特性。

Title: Large Language Model Meets Constraint Propagation

Authors: Alexandre Bonlarron, Florian Régin, Elisabetta De Maria, Jean-Charles Régin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24012
Pdf URL: https://arxiv.org/pdf/2505.24012
Copy Paste: [[2505.24012]] Large Language Model Meets Constraint Propagation(https://arxiv.org/abs/2505.24012)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel at generating fluent text but struggle to enforce external constraints because they generate tokens sequentially without explicit control mechanisms. GenCP addresses this limitation by combining LLM predictions with Constraint Programming (CP) reasoning, formulating text generation as a Constraint Satisfaction Problem (CSP). In this paper, we improve GenCP by integrating Masked Language Models (MLMs) for domain generation, which allows bidirectional constraint propagation that leverages both past and future tokens. This integration bridges the gap between token-level prediction and structured constraint enforcement, leading to more reliable and constraint-aware text generation. Our evaluation on COLLIE benchmarks demonstrates that incorporating domain preview via MLM calls significantly improves GenCP's performance. Although this approach incurs additional MLM calls and, in some cases, increased backtracking, the overall effect is a more efficient use of LLM inferences and an enhanced ability to generate feasible and meaningful solutions, particularly in tasks with strict content constraints.
摘要：大型语言模型（LLMS）在产生流利的文本方面表现出色，但努力地执行外部约束，因为它们在没有明确控制机制的情况下会依次生成令牌。 GENCP通过将LLM预测与约束编程（CP）推理相结合，将文本生成作为约束满意度问题（CSP）来解决此限制。在本文中，我们通过整合域生成的蒙版语言模型（MLM）来改善GENCP，从而允许双向约束传播来利用过去和将来的代币。这种集成桥接了令牌级别的预测与结构化约束执行之间的差距，从而导致更可靠和约束意识的文本生成。我们对科利基准测试的评估表明，通过MLM Call合并域预览可显着提高GENCP的性能。尽管这种方法会引起其他MLM调用，并且在某些情况下会增加回溯，但总体效果是对LLM推论的更有效利用，并且可以增强生成可行和有意义的解决方案的能力，尤其是在具有严格内容约束的任务中。

Title: BeaverTalk: Oregon State University's IWSLT 2025 Simultaneous Speech Translation System

Authors: Matthew Raffel, Victor Agostinelli, Lizhong Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.24016
Pdf URL: https://arxiv.org/pdf/2505.24016
Copy Paste: [[2505.24016]] BeaverTalk: Oregon State University's IWSLT 2025 Simultaneous Speech Translation System(https://arxiv.org/abs/2505.24016)
Keywords: llm, prompt
Abstract: This paper discusses the construction, fine-tuning, and deployment of BeaverTalk, a cascaded system for speech-to-text translation as part of the IWSLT 2025 simultaneous translation task. The system architecture employs a VAD segmenter for breaking a speech stream into segments, Whisper Large V2 for automatic speech recognition (ASR), and Gemma 3 12B for simultaneous translation. Regarding the simultaneous translation LLM, it is fine-tuned via low-rank adaptors (LoRAs) for a conversational prompting strategy that leverages a single prior-sentence memory bank from the source language as context. The cascaded system participated in the English$\rightarrow$German and English$\rightarrow$Chinese language directions for both the low and high latency regimes. In particular, on the English$\rightarrow$German task, the system achieves a BLEU of 24.64 and 27.83 at a StreamLAAL of 1837.86 and 3343.73, respectively. Then, on the English$\rightarrow$Chinese task, the system achieves a BLEU of 34.07 and 37.23 at a StreamLAAL of 2216.99 and 3521.35, respectively.
摘要：本文讨论了Beavertalk的构建，微调和部署，这是一种级联的语音转换系统，作为IWSLT 2025同时翻译任务的一部分。该系统体系结构采用VAD细分器将语音流分成细分市场，低语大型V2用于自动语音识别（ASR）和Gemma 3 12B进行同时翻译。关于同时翻译LLM，它通过低级别适配器（Loras）进行了微调，以进行对话提示策略，该策略利用源语言作为上下文来利用单个先前的记忆库。级联系统参加了英语$ \ rightarrow $ derman和English $ \ rightarrow $中文方向的低潜伏期和高潜伏期。特别是，在英语$ \ rightarrow $德语任务上，该系统分别以1837.86和3343.73的流动性达到24.64和27.83。然后，在英语$ \ rightarrow $中文任务上，该系统分别以2216.99和3521.35的流动性达到34.07和37.23。

Title: Hidden Persuasion: Detecting Manipulative Narratives on Social Media During the 2022 Russian Invasion of Ukraine

Authors: Kateryna Akhynko, Oleksandr Kosovan, Mykola Trokhymovych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24028
Pdf URL: https://arxiv.org/pdf/2505.24028
Copy Paste: [[2505.24028]] Hidden Persuasion: Detecting Manipulative Narratives on Social Media During the 2022 Russian Invasion of Ukraine(https://arxiv.org/abs/2505.24028)
Keywords: language model
Abstract: This paper presents one of the top-performing solutions to the UNLP 2025 Shared Task on Detecting Manipulation in Social Media. The task focuses on detecting and classifying rhetorical and stylistic manipulation techniques used to influence Ukrainian Telegram users. For the classification subtask, we fine-tuned the Gemma 2 language model with LoRA adapters and applied a second-level classifier leveraging meta-features and threshold optimization. For span detection, we employed an XLM-RoBERTa model trained for multi-target, including token binary classification. Our approach achieved 2nd place in classification and 3rd place in span detection.
摘要：本文介绍了UNLP 2025共享任务的最佳性能解决方案之一，该任务在检测社交媒体中的操纵方面。该任务着重于检测和分类用于影响乌克兰电报用户的修辞和风格操纵技术。对于分类子任务，我们使用LORA适配器微调了Gemma 2语言模型，并应用了二级分类器利用元功能和阈值优化。对于SPAN检测，我们采用了XLM-Roberta模型，该模型训练了多目标，包括令牌二进制分类。我们的方法在分类中获得了第二名，在SPAN检测中获得了第三名。

Title: MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering

Authors: Yuexing Hao, Kumail Alhamoud, Hyewon Jeong, Haoran Zhang, Isha Puri, Philip Torr, Mike Schaekermann, Ariel D. Stern, Marzyeh Ghassemi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24040
Pdf URL: https://arxiv.org/pdf/2505.24040
Copy Paste: [[2505.24040]] MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering(https://arxiv.org/abs/2505.24040)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks, including standardized medical exams. However, correct answers alone do not ensure correct logic, and models may reach accurate conclusions through flawed processes. In this study, we introduce the MedPAIR (Medical Dataset Comparing Physicians and AI Relevance Estimation and Question Answering) dataset to evaluate how physician trainees and LLMs prioritize relevant information when answering QA questions. We obtain annotations on 1,300 QA pairs from 36 physician trainees, labeling each sentence within the question components for relevance. We compare these relevance estimates to those for LLMs, and further evaluate the impact of these "relevant" subsets on downstream task performance for both physician trainees and LLMs. We find that LLMs are frequently not aligned with the content relevance estimates of physician trainees. After filtering out physician trainee-labeled irrelevant sentences, accuracy improves for both the trainees and the LLMs. All LLM and physician trainee-labeled data are available at: this http URL.
摘要：大型语言模型（LLMS）在包括标准的体检（包括标准化的医学检查）上表现出了出色的性能（QA）基准。但是，仅正确的答案不能确保正确的逻辑，并且模型可以通过有缺陷的过程得出准确的结论。在这项研究中，我们介绍了MEDPAIR（比较医生的医疗数据集以及AI相关性估计和问题答案）数据集，以评估医师受训者和LLM在回答质量检查问题时如何优先考虑相关信息。我们从36名医师学员的1,300对质量检查对上获得注释，并在问题组成部分中标记每个句子的相关性。我们将这些相关性估计与LLM的相关性估计进行了比较，并进一步评估了这些“相关”子集对医师学员和LLM的下游任务绩效的影响。我们发现LLM经常与医师学员的内容相关性估计不符。在滤除了医师训练有素的无关句子后，学员和LLM的准确性都提高了。所有LLM和医师培训者标记的数据均可在以下网址提供：此HTTP URL。

Title: TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

Authors: Jiacheng Xie, Yang Yu, Ziyang Zhang, Shuai Zeng, Jiaxuan He, Ayush Vasireddy, Xiaoting Tang, Congyu Guo, Lening Zhao, Congcong Jing, Guanghui An, Dong Xu
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2505.24063
Pdf URL: https://arxiv.org/pdf/2505.24063
Copy Paste: [[2505.24063]] TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine(https://arxiv.org/abs/2505.24063)
Keywords: language model, llm
Abstract: Traditional Chinese Medicine (TCM), as an effective alternative medicine, has been receiving increasing attention. In recent years, the rapid development of large language models (LLMs) tailored for TCM has underscored the need for an objective and comprehensive evaluation framework to assess their performance on real-world tasks. However, existing evaluation datasets are limited in scope and primarily text-based, lacking a unified and standardized multimodal question-answering (QA) benchmark. To address this issue, we introduce TCM-Ladder, the first multimodal QA dataset specifically designed for evaluating large TCM language models. The dataset spans multiple core disciplines of TCM, including fundamental theory, diagnostics, herbal formulas, internal medicine, surgery, pharmacognosy, and pediatrics. In addition to textual content, TCM-Ladder incorporates various modalities such as images and videos. The datasets were constructed using a combination of automated and manual filtering processes and comprise 52,000+ questions in total. These questions include single-choice, multiple-choice, fill-in-the-blank, diagnostic dialogue, and visual comprehension tasks. We trained a reasoning model on TCM-Ladder and conducted comparative experiments against 9 state-of-the-art general domain and 5 leading TCM-specific LLMs to evaluate their performance on the datasets. Moreover, we propose Ladder-Score, an evaluation method specifically designed for TCM question answering that effectively assesses answer quality regarding terminology usage and semantic expression. To our knowledge, this is the first work to evaluate mainstream general domain and TCM-specific LLMs on a unified multimodal benchmark. The datasets and leaderboard are publicly available at this https URL or this https URL and will be continuously updated.
摘要：传统中药（TCM）作为一种有效的替代医学，一直在受到越来越多的关注。近年来，针对TCM量身定制的大型语言模型（LLMS）的快速发展强调了需要一个客观和全面的评估框架来评估其在现实世界任务上的绩效。但是，现有的评估数据集在范围和基于文本的范围中受到限制，缺少统一和标准化的多模式提问（QA）基准。为了解决此问题，我们介绍了TCM-LADDER，这是第一个专门设计用于评估大型TCM语言模型的多模式质量模式数据集。该数据集涵盖了TCM的多个核心学科，包括基本理论，诊断，草药配方，内科，手术，药学和儿科。除文本内容外，TCM-ladder还结合了各种模式，例如图像和视频。使用自动过滤过程的组合构建数据集，总共包含52,000多个问题。这些问题包括单选，多项选择，填空，诊断对话和视觉理解任务。我们培训了TCM-LADDER的推理模型，并对9个最先进的一般域和5个领先的TCM特异性LLM进行了比较实验，以评估其在数据集上的性能。此外，我们提出了阶梯得分，这是一种专门为TCM问题设计的评估方法，可有效评估有关术语使用和语义表达的答案质量。据我们所知，这是评估统一多模式基准上主流通用域和TCM特异性LLM的第一项工作。数据集和排行榜可在此HTTPS URL或此HTTPS URL上公开可用，并且将不断更新。

Title: HardTests: Synthesizing High-Quality Test Cases for LLM Coding

Authors: Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, Lei Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24098
Pdf URL: https://arxiv.org/pdf/2505.24098
Copy Paste: [[2505.24098]] HardTests: Synthesizing High-Quality Test Cases for LLM Coding(https://arxiv.org/abs/2505.24098)
Keywords: language model, llm
Abstract: Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at this https URL.
摘要：验证者在大语言模型（LLM）推理中起着至关重要的作用，这是通过培训后技术（例如增强学习）所需的。但是，对于困难的编码问题，可靠的验证者很难获得，因为只有很难综合的人类写入的边缘案例才能检测到一个被掩盖的错误解决方案。为了解决这个问题，我们提出了HardTestGen，这是使用LLMS高质量测试合成的管道。通过这条管道，我们策划了具有47K问题和合成高质量测试的全面竞争编程数据集硬盘。与现有测试相比，HardTestGen测试表明，评估LLM生成的代码时的精度高11.3个百分点，回想点要高17.5个百分点。对于更严重的问题，精度的提高可以大至40分。事实证明，通过下游代码生成性能来衡量硬测试对模型培训更有效。我们将在此HTTPS URL上打开数据集和合成管道。

Title: Training LLMs for EHR-Based Reasoning Tasks via Reinforcement Learning

Authors: Jiacheng Lin, Zhenbang Wu, Jimeng Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24105
Pdf URL: https://arxiv.org/pdf/2505.24105
Copy Paste: [[2505.24105]] Training LLMs for EHR-Based Reasoning Tasks via Reinforcement Learning(https://arxiv.org/abs/2505.24105)
Keywords: language model, llm
Abstract: We present EHRMIND, a practical recipe for adapting large language models (LLMs) to complex clinical reasoning tasks using reinforcement learning with verifiable rewards (RLVR). While RLVR has succeeded in mathematics and coding, its application to healthcare contexts presents unique challenges due to the specialized knowledge and reasoning required for electronic health record (EHR) interpretation. Our pilot study on the MEDCALC benchmark reveals two key failure modes: (1) misapplied knowledge, where models possess relevant medical knowledge but apply it incorrectly, and (2) missing knowledge, where models lack essential domain knowledge. To address these cases, EHRMIND applies a two-stage solution: a lightweight supervised fine-tuning (SFT) warm-up that injects missing domain knowledge, stabilizes subsequent training, and encourages structured, interpretable outputs; followed by RLVR, which reinforces outcome correctness and refines the model's decision-making. We demonstrate the effectiveness of our method across diverse clinical applications, including medical calculations (MEDCALC), patient-trial matching (TREC CLINICAL TRIALS), and disease diagnosis (EHRSHOT). EHRMIND delivers consistent gains in accuracy, interpretability, and cross-task generalization. These findings offer practical guidance for applying RLVR to enhance LLM capabilities in healthcare settings.
摘要：我们提出EHRMIND，这是一种适用大型语言模型（LLMS），使用具有可验证奖励的强化学习（RLVR）来复杂的临床推理任务。尽管RLVR成功地从事数学和编码，但由于电子健康记录（EHR）解释所需的专业知识和推理，其在医疗保健环境中的应用带来了独特的挑战。我们对MEDCALC基准测试的试点研究揭示了两种关键的故障模式：（1）误用知识，其中模型具有相关的医学知识，但错误地应用了知识，以及（2）缺失知识，模型缺乏基本领域知识。为了解决这些情况，EHRMIND应用了两阶段的解决方案：轻巧的监督微调（SFT）热身，注射缺失的领域知识，稳定后续培训并鼓励结构化的，可解释的输出；其次是RLVR，这增强了结果的正确性并完善了模型的决策。我们证明了我们方法在各种临床应用中的有效性，包括医疗计算（MEDCALC），患者 - 试验匹配（TREC临床试验）和疾病诊断（EHRSHOT）。 EHRMIND在准确性，可解释性和交叉任务概括方面具有一致的提高。这些发现提供了应用RLVR来增强医疗保健环境中LLM功能的实用指导。

Title: The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It

Authors: Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, Julia Kreutzer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24119
Pdf URL: https://arxiv.org/pdf/2505.24119
Copy Paste: [[2505.24119]] The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It(https://arxiv.org/abs/2505.24119)
Keywords: llm
Abstract: This paper presents a comprehensive analysis of the linguistic diversity of LLM safety research, highlighting the English-centric nature of the field. Through a systematic review of nearly 300 publications from 2020--2024 across major NLP conferences and workshops at *ACL, we identify a significant and growing language gap in LLM safety research, with even high-resource non-English languages receiving minimal attention. We further observe that non-English languages are rarely studied as a standalone language and that English safety research exhibits poor language documentation practice. To motivate future research into multilingual safety, we make several recommendations based on our survey, and we then pose three concrete future directions on safety evaluation, training data generation, and crosslingual safety generalization. Based on our survey and proposed directions, the field can develop more robust, inclusive AI safety practices for diverse global populations.
摘要：本文对LLM安全研究的语言多样性进行了全面分析，强调了以英语为中心的领域。通过对 *ACL的主要NLP会议和讲习班的2020--2024的近300篇出版物进行系统的审查，我们确定了LLM安全研究中的一个很大且不断增长的语言差距，甚至高资源的非英语语言也受到最少的关注。我们进一步观察到，非英语语言很少被研究为一种独立的语言，而英语安全研究表现出糟糕的语言文档实践。为了激励对多语言安全的未来研究，我们根据调查提出了一些建议，然后我们就安全评估，培训数据生成和跨语言安全概括提出了三个具体的未来指示。根据我们的调查和拟议的指示，该领域可以为各种全球人口开发更强大的，包容性的AI安全实践。

Title: R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration

Authors: Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24133
Pdf URL: https://arxiv.org/pdf/2505.24133
Copy Paste: [[2505.24133]] R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration(https://arxiv.org/abs/2505.24133)
Keywords: chain-of-thought
Abstract: Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.
摘要：推理模型在自我反思和经过思考的推理方面表现出了令人印象深刻的表现。但是，它们通常会产生过长的输出，从而导致推理过程中极大的键值（KV）缓存。虽然思想链的推论显着提高了复杂的推理任务的性能，但在使用现有的KV缓存压缩方法部署时，它也可能导致推理失败。为了解决这个问题，我们建议用于推理模型（R-KV）的冗余感知KV缓存压缩，这是一种专门针对推理模型中冗余令牌的新方法。我们的方法仅使用10％的KV缓存保留了将近100％的KV高速缓存性能，从而大大优于现有的KV高速缓存基线，仅达到了性能的60％。值得注意的是，R-KV甚至在16％的KV缓存中达到了105％的全KV缓存性能。这种KV-CACHE降低还导致保存90％的内存和6.6倍的吞吐量，超过了标准的思想推理推理。实验结果表明，R-KV在两个数学推理数据集中始终优于现有的KV缓存压缩基线。

Title: CrossICL: Cross-Task In-Context Learning via Unsupervised Demonstration Transfer

Authors: Jinglong Gao, Xiao Ding, Lingxiao Zou, Bing Qin, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24143
Pdf URL: https://arxiv.org/pdf/2505.24143
Copy Paste: [[2505.24143]] CrossICL: Cross-Task In-Context Learning via Unsupervised Demonstration Transfer(https://arxiv.org/abs/2505.24143)
Keywords: language model, gpt, llm
Abstract: In-Context Learning (ICL) enhances the performance of large language models (LLMs) with demonstrations. However, obtaining these demonstrations primarily relies on manual effort. In most real-world scenarios, users are often unwilling or unable to provide such demonstrations. Inspired by the human analogy, we explore a new ICL paradigm CrossICL to study how to utilize existing source task demonstrations in the ICL for target tasks, thereby obtaining reliable guidance without any additional manual effort. To explore this, we first design a two-stage alignment strategy to mitigate the interference caused by gaps across tasks, as the foundation for our experimental exploration. Based on it, we conduct comprehensive exploration of CrossICL, with 875 NLP tasks from the Super-NI benchmark and six types of LLMs, including GPT-4o. Experimental results demonstrate the effectiveness of CrossICL and provide valuable insights on questions like the criteria for selecting cross-task demonstrations, as well as the types of task-gap-induced interference in CrossICL.
摘要：通过演示，内部文化学习（ICL）增强了大语言模型（LLM）的性能。但是，获得这些示范主要依赖于手动努力。在大多数实际情况下，用户通常不愿或无法提供此类演示。受到人类类比的启发，我们探索了一个新的ICL范式杂交，以研究如何利用ICL中的现有源任务演示来实现目标任务，从而在没有任何其他手动努力的情况下获得了可靠的指导。为了探讨这一点，我们首先设计了两阶段的对准策略，以减轻跨任务差距引起的干扰，这是我们实验探索的基础。基于它，我们对Crossicl进行了全面的探索，其中包括来自Super-NI基准的875个NLP任务和包括GPT-4O在内的六种类型的LLM。实验结果证明了横icl的有效性，并就选择交叉任务演示的标准以及任务差异诱导的横icl中的干扰类型提供了有价值的见解。

Title: Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability

Authors: Chiwei Zhu, Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Zhendong Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24147
Pdf URL: https://arxiv.org/pdf/2505.24147
Copy Paste: [[2505.24147]] Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability(https://arxiv.org/abs/2505.24147)
Keywords: language model
Abstract: Training language models with rationales augmentation has been shown to be beneficial in many existing works. In this paper, we identify that such a prevailing view does not hold consistently. We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance as well as a novel perspective of model reliability. The results lead to several key findings that add new insights upon existing understandings: 1) Rationales can, at times, deteriorate model performance; 2) Rationales can, at times, improve model reliability, even outperforming their untrained counterparts; 3) A linear correspondence exists in between the performance and reliability improvements, while both are driven by the intrinsic difficulty of the task. These findings provide informative regulations on the broad utilization of rationales and raise critical implications on the procedure of explicitly aligning language models with implicit human thoughts. Codes can be found at this https URL.
摘要：培训语言模型具有理由增强的模型已被证明在许多现有作品中都是有益的。在本文中，我们确定这种普遍的观点并不始终如一。我们进行全面的研究，以彻底检查理由对模型性能的影响以及模型可靠性的新观点。结果导致了几个关键发现，这些发现增加了对现有理解的新见解：1）有时理由会恶化模型性能； 2）有时可以提高模型的可靠性，甚至表现优于未经培训的模型； 3）在性能和可靠性改进之间存在线性对应关系，而两者都由任务的内在难度驱动。这些发现提供了有关对理由的广泛利用的信息法规，并提高了对具有隐含人类思想的明确调整语言模型的程序的关键含义。代码可以在此HTTPS URL上找到。

Title: LKD-KGC: Domain-Specific KG Construction via LLM-driven Knowledge Dependency Parsing

Authors: Jiaqi Sun, Shiyou Qian, Zhangchi Han, Wei Li, Zelin Qian, Dingyu Yang, Jian Cao, Guangtao Xue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24163
Pdf URL: https://arxiv.org/pdf/2505.24163
Copy Paste: [[2505.24163]] LKD-KGC: Domain-Specific KG Construction via LLM-driven Knowledge Dependency Parsing(https://arxiv.org/abs/2505.24163)
Keywords: language model, llm
Abstract: Knowledge Graphs (KGs) structure real-world entities and their relationships into triples, enhancing machine reasoning for various tasks. While domain-specific KGs offer substantial benefits, their manual construction is often inefficient and requires specialized knowledge. Recent approaches for knowledge graph construction (KGC) based on large language models (LLMs), such as schema-guided KGC and reference knowledge integration, have proven efficient. However, these methods are constrained by their reliance on manually defined schema, single-document processing, and public-domain references, making them less effective for domain-specific corpora that exhibit complex knowledge dependencies and specificity, as well as limited reference knowledge. To address these challenges, we propose LKD-KGC, a novel framework for unsupervised domain-specific KG construction. LKD-KGC autonomously analyzes document repositories to infer knowledge dependencies, determines optimal processing sequences via LLM driven prioritization, and autoregressively generates entity schema by integrating hierarchical inter-document contexts. This schema guides the unsupervised extraction of entities and relationships, eliminating reliance on predefined structures or external knowledge. Extensive experiments show that compared with state-of-the-art baselines, LKD-KGC generally achieves improvements of 10% to 20% in both precision and recall rate, demonstrating its potential in constructing high-quality domain-specific KGs.
摘要：知识图（kgs）将现实世界实体及其关系构成三元组，从而增强了机器推理各种任务。尽管特定于领域的公斤提供了可观的好处，但它们的手动构建通常效率低下，需要专业知识。知识图构建的最新方法（KGC）基于大型语言模型（LLM），例如架构引导的KGC和参考知识集成，已被证明有效。但是，这些方法受到对手动定义的模式，单案处理和公共域参考的依赖的限制，从而使它们对具有复杂知识依赖性和特异性的领域特异性语料库以及有限的参考知识降低。为了应对这些挑战，我们提出了LKD-KGC，这是一个针对特定领域特定kg构造的新型框架。 LKD-KGC自主分析文档存储库以推断知识依赖性，通过LLM驱动的优先级确定最佳处理序列，并自动重新收集通过集成层次结构间的文档上下文来生成实体架构。该模式指导实体和关系的无监督提取，从而消除了对预定义结构或外部知识的依赖。广泛的实验表明，与最先进的基线相比，LKD-KGC通常在精度和召回率方面取得了10％至20％的提高，这表明其在构建高质量域特异性KGS方面的潜力。

Title: Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models

Authors: Shilin Xu, Yanwei Li, Rui Yang, Tao Zhang, Yueyi Sun, Wei Chow, Linfeng Li, Hang Song, Qi Xu, Yunhai Tong, Xiangtai Li, Hao Fei
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.24164
Pdf URL: https://arxiv.org/pdf/2505.24164
Copy Paste: [[2505.24164]] Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models(https://arxiv.org/abs/2505.24164)
Keywords: language model, llm
Abstract: Recent works on large language models (LLMs) have successfully demonstrated the emergence of reasoning capabilities via reinforcement learning (RL). Although recent efforts leverage group relative policy optimization (GRPO) for MLLMs post-training, they constantly explore one specific aspect, such as grounding tasks, math problems, or chart analysis. There are no works that can leverage multi-source MLLM tasks for stable reinforcement learning. In this work, we present a unified perspective to solve this problem. We present Mixed-R1, a unified yet straightforward framework that contains a mixed reward function design (Mixed-Reward) and a mixed post-training dataset (Mixed-45K). We first design a data engine to select high-quality examples to build the Mixed-45K post-training dataset. Then, we present a Mixed-Reward design, which contains various reward functions for various MLLM tasks. In particular, it has four different reward functions: matching reward for binary answer or multiple-choice problems, chart reward for chart-aware datasets, IoU reward for grounding problems, and open-ended reward for long-form text responses such as caption datasets. To handle the various long-form text content, we propose a new open-ended reward named Bidirectional Max-Average Similarity (BMAS) by leveraging tokenizer embedding matching between the generated response and the ground truth. Extensive experiments show the effectiveness of our proposed method on various MLLMs, including Qwen2.5-VL and Intern-VL on various sizes. Our dataset and model are available at this https URL.
摘要：关于大型语言模型（LLM）的最新著作成功地证明了通过强化学习（RL）的推理能力的出现。尽管最近的努力利用了培训后MLLM的相对政策优化（GRPO），但他们不断探索一个特定方面，例如接地任务，数学问题或图表分析。没有任何作品可以利用多源MLLM任务来稳定增强学习。在这项工作中，我们提出了解决这个问题的统一观点。我们提出了混合R1，这是一个统一而直接的框架，其中包含混合奖励功能设计（混合回报）和混合的训练后数据集（混合45k）。我们首先设计一个数据引擎来选择高质量的示例来构建混合-45K训练后数据集。然后，我们提出了一个混合奖励设计，该设计包含各种MLLM任务的各种奖励功能。特别是，它具有四个不同的奖励功能：对二进制答案或多项选择问题的匹配奖励，图表感知数据集的图表奖励，对接地问题的奖励以及对长篇文本响应（例如字幕数据集）的开放式奖励。为了处理各种长形式的文本内容，我们提出了一种新的开放式奖励，称为双向最大值相似性（BMA），通过利用令牌剂嵌入生成的响应与地面真相之间的匹配。广泛的实验显示了我们提出的方法对各种MLLM的有效性，包括QWEN2.5-VL和Intern-VL各种尺寸。我们的数据集和模型可在此HTTPS URL上找到。

Title: Adaptive LoRA Merge with Parameter Pruning for Low-Resource Generation

Authors: Ryota Miyano, Yuki Arase
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.24174
Pdf URL: https://arxiv.org/pdf/2505.24174
Copy Paste: [[2505.24174]] Adaptive LoRA Merge with Parameter Pruning for Low-Resource Generation(https://arxiv.org/abs/2505.24174)
Keywords: llm
Abstract: This study proposes a simple yet effective LoRA merge method to achieve LLM adaptation for low-resource language generation tasks. The LoRA merge technique, which integrates multiple LoRA modules trained on different tasks, has gained attention as an effective and efficient approach for adapting LLMs to target tasks. However, previous methods are limited in adaptability as they keep the LoRA parameters frozen. Additionally, the low-resource problem has been out of their scope. We propose a LoRA merge method that updates and prunes LoRA parameters through fine-tuning with minimal target task data, which allows finer-grained adjustments of LoRA parameters and enhancement of task adaptability. Extensive experiments have been conducted taking summarization as a benchmark task. Our datasets cover various domains and multiple languages of English and Japanese. The results confirm that the proposed method achieves significant and consistent improvements in task adaptability over the previous methods.
摘要：这项研究提出了一种简单而有效的洛拉合并方法，以实现低资源语言生成任务的LLM适应。 Lora合并技术集成了经过不同任务训练的多个LORA模块，它已成为一种有效，有效的方法，以适应LLMS来定位任务。但是，以前的方法在使洛拉参数冻结时受到限制。此外，低资源问题已经超出了其范围。我们提出了一种LORA合并方法，该方法通过使用最小目标任务数据进行微调来更新和修剪LORA参数，从而允许对Lora参数进行细粒度调整并增强任务适应性。已经进行了广泛的实验作为基准任务。我们的数据集涵盖了各种英语和日语的多种语言。结果证实，所提出的方法比以前的方法实现了任务适应性的显着和一致的提高。

Title: Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models

Authors: Mikhail L. Arbuzov, Alexey A. Shvets, Sisong Beir
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24187
Pdf URL: https://arxiv.org/pdf/2505.24187
Copy Paste: [[2505.24187]] Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models(https://arxiv.org/abs/2505.24187)
Keywords: language model, llm
Abstract: The prevailing assumption of an exponential decay in large language model (LLM) reliability with sequence length, predicated on independent per-token error probabilities, posits an inherent limitation for long autoregressive outputs. Our research fundamentally challenges this view by synthesizing emerging evidence that LLM errors are not uniformly distributed but are concentrated at sparse "key tokens" ($5-10\%$ of total tokens) representing critical decision junctions. By distinguishing these high-impact tokens from the increasingly predictable majority, we introduce a new reliability formula explaining the sustained coherence of modern LLMs over thousands of tokens. Converging research streams reveal that long-context performance primarily depends on accurately navigating a few crucial semantic decision points rather than on uniform token-level accuracy, enabling targeted strategies that significantly outperform brute-force approaches. We thus propose a framework for next-generation systems centered on selective preservation of semantically vital tokens, dynamic computational allocation at uncertain decision boundaries, multi-path exploration at ambiguities, and architectures aligned with natural semantic domains. This marks a fundamental shift from raw scaling to strategic reasoning, promising breakthrough performance without proportionate computational scaling and offering a more nuanced understanding that supersedes the exponential decay hypothesis, thereby opening pathways toward substantially more powerful and efficient language systems.
摘要：以序列长度为基础的大语言模型（LLM）可靠性中指数衰减的普遍假设是基于独立的双误差概率，这对长期自动回归输出的固有限制具有固有的限制。我们的研究从根本上挑战了这一观点，即通过综合新的证据表明LLM错误并非均匀分布，而是集中在稀疏的“关键令牌”（$ 5-10 \％的总代币）上，代表着重要的决策连接。通过将这些高影响力代币与日益预测的多数派区分开来，我们引入了一种新的可靠性公式，该公式解释了现代LLM的持续连贯性超过成千上万的令牌。融合的研究流表明，长篇小说性能主要取决于准确地导航一些关键的语义决策点，而不是统一的令牌级别的准确性，从而实现了有针对性的策略，这些策略显着超过了野蛮的方法。因此，我们为下一代系统提出了一个框架，该系统以选择性保存语义上的重要令牌，不确定决策边界的动态计算分配，模棱两可的多路径探索以及与自然语义领域保持一致的架构。这标志着从原始缩放到战略推理的根本转变，有望实现突破性的表现，而没有比例的计算缩放，并提供了更细微的理解，可以取代指数衰减的假设，从而开辟了实质性更强大，更有效的语言系统的途径。

Title: CLaSp: In-Context Layer Skip for Self-Speculative Decoding

Authors: Longze Chen, Renke Shan, Huiming Wang, Lu Wang, Ziqiang Liu, Run Luo, Jiawei Wang, Hamid Alinejad-Rokny, Min Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24196
Pdf URL: https://arxiv.org/pdf/2505.24196
Copy Paste: [[2505.24196]] CLaSp: In-Context Layer Skip for Self-Speculative Decoding(https://arxiv.org/abs/2505.24196)
Keywords: language model, llm
Abstract: Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3x ~ 1.7x on LLaMA3 series models without altering the original distribution of the generated text.
摘要：投机解码（SD）是加速大型语言模型（LLMS）解码过程的有前途的方法。 SD的效率主要取决于草案模型与验证模型之间的一致性。但是，现有的起草方法通常需要培训其他模块，这可能是具有挑战性的，要实施并确保各种LLM的兼容性。在本文中，我们提出了CLASP，这是一种自我指导解码的层面层覆盖策略。与先前的方法不同，CLASP不需要额外的起草模块或额外的培训。取而代之的是，它通过跳过验证模型的中间层来构建压缩草稿模型，从而采用了插件机制。具体来说，我们开发了一种动态编程算法，该算法通过利用最后一个验证阶段的完整隐藏状态作为目标来优化层衬里的过程。这使得能够在每个验证阶段之后动态调整其层化策略，而无需依靠预先优化的跳过层。各种下游任务的实验结果表明，扣子在Llama3系列模型上的加速度达到1.3倍〜1.7倍，而不会改变生成的文本的原始分布。

Title: Intuitionistic Fuzzy Sets for Large Language Model Data Annotation: A Novel Approach to Side-by-Side Preference Labeling

Authors: Yimin Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24199
Pdf URL: https://arxiv.org/pdf/2505.24199
Copy Paste: [[2505.24199]] Intuitionistic Fuzzy Sets for Large Language Model Data Annotation: A Novel Approach to Side-by-Side Preference Labeling(https://arxiv.org/abs/2505.24199)
Keywords: language model, llm
Abstract: The quality of human preference data is crucial for training and evaluating large language models (LLMs), particularly in reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) scenarios. Traditional side-by-side (SBS) annotation approaches often struggle with inherent uncertainty, annotator disagreement, and the complexity of preference judgments. This paper introduces a novel framework based on intuitionistic fuzzy sets (IFS) for modeling and aggregating human preferences in LLM data annotation tasks. Our approach captures not only the degree of preference but also the uncertainty and hesitation inherent in human judgment through membership, non-membership, and hesitation degrees. We propose an IFS-based annotation protocol that enables more nuanced preference modeling, develops aggregation methods for handling annotator disagreement, and introduces quality metrics for preference data assessment. Experimental validation on multiple datasets demonstrates that our IFS-based approach significantly improves annotation consistency, reduces annotator fatigue, and produces higher-quality preference data compared to traditional binary and Likert-scale methods. The resulting preference datasets lead to improved model performance in downstream tasks, with 12.3\% improvement in win-rate against baseline models and 15.7\% reduction in annotation time. Our framework provides a principled approach to handling uncertainty in human preference annotation and offers practical benefits for large-scale LLM training.
摘要：人类偏好数据的质量对于培训和评估大型语言模型（LLM）至关重要，尤其是在从人类反馈（RLHF）和直接偏好优化（DPO）方案中学习的强化学习中。传统的并排（SBS）注释方法通常会在固有的不确定性，注释者分歧和偏好判断的复杂性上挣扎。本文介绍了一个基于直觉模糊集（IFS）的新框架，用于建模和汇总LLM数据注释任务中的人类偏好。我们的方法不仅捕捉了通过成员，非会员和犹豫学位的人类判断中固有的不确定性和犹豫的程度。我们提出了一种基于IFS的注释协议，该协议可以实现更细微的偏好建模，开发用于处理注释者分歧的聚合方法，并引入质量指标以进行偏好数据评估。对多个数据集的实验验证表明，与传统的二进制和李克特级方法相比，我们基于IFS的方法可显着提高注释一致性，减少注释疲劳并产生更高质量的偏好数据。由此产生的偏好数据集可改善下游任务中的模型性能，而对基线模型的赢率提高了12.3 \％，注释时间减少了15.7 \％。我们的框架为处理人类偏好注释中的不确定性提供了一种原则性的方法，并为大规模LLM培训提供了实际好处。

Title: Semi-structured LLM Reasoners Can Be Rigorously Audited

Authors: Jixuan Leng, Cassandra A. Cohen, Zhixian Zhang, Chenyan Xiong, William W. Cohen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24217
Pdf URL: https://arxiv.org/pdf/2505.24217
Copy Paste: [[2505.24217]] Semi-structured LLM Reasoners Can Be Rigorously Audited(https://arxiv.org/abs/2505.24217)
Keywords: language model, llm, chain-of-thought
Abstract: As Large Language Models (LLMs) become increasingly capable at reasoning, the problem of "faithfulness" persists: LLM "reasoning traces" can contain errors and omissions that are difficult to detect, and may obscure biases in model outputs. To address these limitations, we introduce Semi-Structured Reasoning Models (SSRMs), which internalize a semi-structured Chain-of-Thought (CoT) reasoning format within the model. Our SSRMs generate reasoning traces in a Pythonic syntax. While SSRM traces are not executable, they adopt a restricted, task-specific vocabulary to name distinct reasoning steps, and to mark each step's inputs and outputs. Through extensive evaluation on ten benchmarks, SSRMs demonstrate strong performance and generality: they outperform comparably sized baselines by nearly ten percentage points on in-domain tasks while remaining competitive with specialized models on out-of-domain medical benchmarks. Furthermore, we show that semi-structured reasoning is more amenable to analysis: in particular, they can be automatically audited to identify reasoning flaws. We explore both hand-crafted structured audits, which detect task-specific problematic reasoning patterns, and learned typicality audits, which apply probabilistic models over reasoning patterns, and show that both audits can be used to effectively flag probable reasoning errors.
摘要：随着大型语言模型（LLMS）在推理方面越来越有能力，“忠诚”的问题持续存在：LLM“推理痕迹”可能包含难以检测到的错误和遗漏，并且可能会掩盖模型输出的偏见。为了解决这些限制，我们引入了半结构化推理模型（SSRMS），该模型将模型中的半结构化链（COT）推理格式内化。我们的SSRM在Pythonic语法中生成推理轨迹。尽管SSRM跟踪无法执行，但它们采用了一个受限制的特定于任务的词汇来命名不同的推理步骤，并标记每个步骤的输入和输出。通过对十个基准测试的广泛评估，SSRMS表现出强大的性能和一般性：它们在内域任务上的表现优于尺寸相当大的基线，同时与室外医疗基准的专业模型保持竞争力。此外，我们表明半结构化推理更适合分析：特别是可以自动审核以识别原因。我们探索了两个手工制作的结构化审核，这些审核检测特定于任务的有问题的推理模式和学习的典型性审核，这些审核将概率模型应用于推理模式上，并表明两种审核都可以用来有效地标记可能的推理错误。

Title: Automated Structured Radiology Report Generation

Authors: Jean-Benoit Delbrouck, Justin Xu, Johannes Moll, Alois Thomas, Zhihong Chen, Sophie Ostmeier, Asfandyar Azhar, Kelvin Zhenghao Li, Andrew Johnston, Christian Bluethgen, Eduardo Reis, Mohamed Muneer, Maya Varma, Curtis Langlotz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24223
Pdf URL: https://arxiv.org/pdf/2505.24223
Copy Paste: [[2505.24223]] Automated Structured Radiology Report Generation(https://arxiv.org/abs/2505.24223)
Keywords: language model, llm
Abstract: Automated radiology report generation from chest X-ray (CXR) images has the potential to improve clinical efficiency and reduce radiologists' workload. However, most datasets, including the publicly available MIMIC-CXR and CheXpert Plus, consist entirely of free-form reports, which are inherently variable and unstructured. This variability poses challenges for both generation and evaluation: existing models struggle to produce consistent, clinically meaningful reports, and standard evaluation metrics fail to capture the nuances of radiological interpretation. To address this, we introduce Structured Radiology Report Generation (SRRG), a new task that reformulates free-text radiology reports into a standardized format, ensuring clarity, consistency, and structured clinical reporting. We create a novel dataset by restructuring reports using large language models (LLMs) following strict structured reporting desiderata. Additionally, we introduce SRR-BERT, a fine-grained disease classification model trained on 55 labels, enabling more precise and clinically informed evaluation of structured reports. To assess report quality, we propose F1-SRR-BERT, a metric that leverages SRR-BERT's hierarchical disease taxonomy to bridge the gap between free-text variability and structured clinical reporting. We validate our dataset through a reader study conducted by five board-certified radiologists and extensive benchmarking experiments.
摘要：自动放射学报告从胸部X射线（CXR）图像产生有可能提高临床效率并减少放射科医生的工作量。但是，大多数数据集，包括公开可用的模拟CXR和CHEXPERT PLUS，完全由自由形式的报告组成，这些报告本质上是可变且非结构化的。这种变异性构成了发电和评估的挑战：现有模型难以产生一致的，临床上有意义的报告，并且标准评估指标无法捕获放射学解释的细微差别。为了解决这个问题，我们介绍了结构化放射学报告生成（SRRG），这是一项新任务，将自由文本放射学报告重新制定为标准化的格式，以确保清晰度，一致性和结构化临床报告。我们通过严格的结构化报告Desiderata来重组报告（LLM）来创建一个新颖的数据集。此外，我们介绍了SRR-BERT，这是一种在55个标签上训练的细粒疾病分类模型，从而实现了对结构化报告的更精确和临床知情的评估。为了评估报告质量，我们提出了F1-SRR-BERT，该指标利用SRR-BERT的分层疾病分类法来弥合自由文本可变性与结构化临床报告之间的差距。我们通过五位经过董事会认证的放射科医生和广泛的基准测试实验进行的读者研究来验证我们的数据集。

Title: Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization

Authors: Luong Ho, Khanh Le, Vinh Pham, Bao Nguyen, Tan Tran, Duc Chau
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.24229
Pdf URL: https://arxiv.org/pdf/2505.24229
Copy Paste: [[2505.24229]] Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization(https://arxiv.org/abs/2505.24229)
Keywords: language model
Abstract: Inverse Text Normalization (ITN) is crucial for converting spoken Automatic Speech Recognition (ASR) outputs into well-formatted written text, enhancing both readability and usability. Despite its importance, the integration of streaming ITN within streaming ASR remains largely unexplored due to challenges in accuracy, efficiency, and adaptability, particularly in low-resource and limited-context scenarios. In this paper, we introduce a streaming pretrained language model for ITN, leveraging pretrained linguistic representations for improved robustness. To address streaming constraints, we propose Dynamic Context-Aware during training and inference, enabling adaptive chunk size adjustments and the integration of right-context information. Experimental results demonstrate that our method achieves accuracy comparable to non-streaming ITN and surpasses existing streaming ITN models on a Vietnamese dataset, all while maintaining low latency, ensuring seamless integration into ASR systems.
摘要：反文本归一化（ITN）对于将口语自动语音识别（ASR）输出转换为形式良好的书面文本至关重要，从而增强了可读性和可用性。尽管它的重要性，但由于准确性，效率和适应性的挑战，流媒体ASR中流媒体ITN的集成在很大程度上仍未得到探索，尤其是在低资源和有限的文本方案中。在本文中，我们为ITN介绍了流媒体预审计的语言模型，利用了预验证的语言表征来提高鲁棒性。为了解决流的限制，我们建议在训练和推理期间感知动态上下文，从而实现自适应块大小调整以及对当前文本信息的集成。实验结果表明，我们的方法达到的准确性可与非流式ITN相提并论，并超过越南数据集上的现有流媒体ITN模型，同时保持低潜伏期，确保无缝集成到ASR系统中。

Title: Advantageous Parameter Expansion Training Makes Better Large Language Models

Authors: Naibin Gu, Yilong Chen, Zhenyu Zhang, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24241
Pdf URL: https://arxiv.org/pdf/2505.24241
Copy Paste: [[2505.24241]] Advantageous Parameter Expansion Training Makes Better Large Language Models(https://arxiv.org/abs/2505.24241)
Keywords: language model
Abstract: Although scaling up the number of trainable parameters in both pre-training and fine-tuning can effectively improve the performance of large language models, it also leads to increased computational overhead. When delving into the parameter difference, we find that a subset of parameters, termed advantageous parameters, plays a crucial role in determining model performance. Further analysis reveals that stronger models tend to possess more such parameters. In this paper, we propose Advantageous Parameter EXpansion Training (APEX), a method that progressively expands advantageous parameters into the space of disadvantageous ones, thereby increasing their proportion and enhancing training effectiveness. Further theoretical analysis from the perspective of matrix effective rank explains the performance gains of APEX. Extensive experiments on both instruction tuning and continued pre-training demonstrate that, in instruction tuning, APEX outperforms full-parameter tuning while using only 52% of the trainable parameters. In continued pre-training, APEX achieves the same perplexity level as conventional training with just 33% of the training data, and yields significant improvements on downstream tasks.
摘要：尽管在预训练和微调中扩大可训练参数的数量可以有效地提高大语言模型的性能，但它也会导致计算开销的增加。当研究参数差时，我们发现参数的子集（称为有利参数）在确定模型性能中起着至关重要的作用。进一步的分析表明，更强大的模型倾向于拥有更多此类参数。在本文中，我们提出了有利的参数扩展培训（APEX），该方法逐渐将有利参数扩展到不利的参数，从而提高其比例并提高训练有效性。从矩阵有效等级的角度来看，进一步的理论分析解释了顶点的性能提高。关于指令调整和持续预训练的广泛实验表明，在教学调整中，Apex在仅使用52％的可训练参数的同时优于全参数调整。在持续的预培训中，APEX仅在训练数据的33％中达到了与常规培训相同的困惑水平，并在下游任务上得到了重大改进。

Title: Mamba Knockout for Unraveling Factual Information Flow

Authors: Nir Endy, Idan Daniel Grosbard, Yuval Ran-Milo, Yonatan Slutzky, Itay Tshuva, Raja Giryes
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.24244
Pdf URL: https://arxiv.org/pdf/2505.24244
Copy Paste: [[2505.24244]] Mamba Knockout for Unraveling Factual Information Flow(https://arxiv.org/abs/2505.24244)
Keywords: language model, llm
Abstract: This paper investigates the flow of factual information in Mamba State-Space Model (SSM)-based language models. We rely on theoretical and empirical connections to Transformer-based architectures and their attention mechanisms. Exploiting this relationship, we adapt attentional interpretability techniques originally developed for Transformers--specifically, the Attention Knockout methodology--to both Mamba-1 and Mamba-2. Using them we trace how information is transmitted and localized across tokens and layers, revealing patterns of subject-token information emergence and layer-wise dynamics. Notably, some phenomena vary between mamba models and Transformer based models, while others appear universally across all models inspected--hinting that these may be inherent to LLMs in general. By further leveraging Mamba's structured factorization, we disentangle how distinct "features" either enable token-to-token information exchange or enrich individual tokens, thus offering a unified lens to understand Mamba internal operations.
摘要：本文研究了基于MAMBA州空间模型（SSM）的语言模型中事实信息的流动。我们依靠理论和经验联系与基于变压器的架构及其注意力机制。利用这种关系，我们适应了最初为变形金刚开发的注意力解释性技术 - 特定于注意力敲除方法 - 与MAMBA-1和MAMBA-2既有”。使用它们，我们追踪信息如何在代币和层之间传输和本地化，从而揭示了主题信息出现和层次动态的模式。值得注意的是，某些现象在MAMBA模型和基于变压器的模型之间有所不同，而另一些现象在所有检查的模型中都普遍出现 - 表明这些可能是LLM所固有的。通过进一步利用Mamba的结构性分解，我们可以解散独特的“特征”如何实现令牌信息交换或丰富单个令牌，从而提供统一的镜头以了解Mamba内部操作。

Title: Proactive Guidance of Multi-Turn Conversation in Industrial Search

Authors: Xiaoyu Li, Xiao Li, Li Gao, Yiding Liu, Xiaoyang Wang, Shuaiqiang Wang, Junfeng Wang, Dawei Yin
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.24251
Pdf URL: https://arxiv.org/pdf/2505.24251
Copy Paste: [[2505.24251]] Proactive Guidance of Multi-Turn Conversation in Industrial Search(https://arxiv.org/abs/2505.24251)
Keywords: language model, llm, agent
Abstract: The evolution of Large Language Models (LLMs) has significantly advanced multi-turn conversation systems, emphasizing the need for proactive guidance to enhance users' interactions. However, these systems face challenges in dynamically adapting to shifts in users' goals and maintaining low latency for real-time interactions. In the Baidu Search AI assistant, an industrial-scale multi-turn search system, we propose a novel two-phase framework to provide proactive guidance. The first phase, Goal-adaptive Supervised Fine-Tuning (G-SFT), employs a goal adaptation agent that dynamically adapts to user goal shifts and provides goal-relevant contextual information. G-SFT also incorporates scalable knowledge transfer to distill insights from LLMs into a lightweight model for real-time interaction. The second phase, Click-oriented Reinforcement Learning (C-RL), adopts a generate-rank paradigm, systematically constructs preference pairs from user click signals, and proactively improves click-through rates through more engaging guidance. This dual-phase architecture achieves complementary objectives: G-SFT ensures accurate goal tracking, while C-RL optimizes interaction quality through click signal-driven reinforcement learning. Extensive experiments demonstrate that our framework achieves 86.10% accuracy in offline evaluation (+23.95% over baseline) and 25.28% CTR in online deployment (149.06% relative improvement), while reducing inference latency by 69.55% through scalable knowledge distillation.
摘要：大语言模型（LLM）的演变具有显着高级的多转交谈系统，强调需要积极的指导来增强用户的交互。但是，这些系统在动态适应用户目标的转变并保持实时互动的延迟较低时面临挑战。在Baidu Search AI Assistant（一种工业规模的多转弯系统）中，我们提出了一个新颖的两阶段框架，以提供积极的指导。第一阶段，目标自适应监督的微调（G-SFT）采用了目标适应剂，该目标适应剂动态适应用户目标转移并提供与目标相关的上下文信息。 G-SFT还将可扩展的知识转移融合到从LLM的蒸馏见解中，以进行实时交互的轻质模型。第二阶段是面向点击的增强学习（C-RL），采用生成量范式，系统地从用户点击信号中构造了偏好对，并通过更多引人入胜的指导积极提高点击率。这种双相体系结构实现了互补的目标：G-SFT可确保准确的目标跟踪，而C-RL通过点击信号驱动的强化学习来优化交互质量。广泛的实验表明，我们的框架在离线评估中达到了86.10％的准确性（基线比23.95％）和在线部署的25.28％CTR（相对改善149.06％），而通过可扩展的知识蒸馏将推理潜伏期降低了69.55％。

Title: Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games

Authors: Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.24255
Pdf URL: https://arxiv.org/pdf/2505.24255
Copy Paste: [[2505.24255]] Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games(https://arxiv.org/abs/2505.24255)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Large Language Models (LLMs) have shown potential in simulating human behaviors and performing theory-of-mind (ToM) reasoning, a crucial skill for complex social interactions. In this study, we investigate the role of ToM reasoning in aligning agentic behaviors with human norms in negotiation tasks, using the ultimatum game as a controlled environment. We initialized LLM agents with different prosocial beliefs (including Greedy, Fair, and Selfless) and reasoning methods like chain-of-thought (CoT) and varying ToM levels, and examined their decision-making processes across diverse LLMs, including reasoning models like o3-mini and DeepSeek-R1 Distilled Qwen 32B. Results from 2,700 simulations indicated that ToM reasoning enhances behavior alignment, decision-making consistency, and negotiation outcomes. Consistent with previous findings, reasoning models exhibit limited capability compared to models with ToM reasoning, different roles of the game benefits with different orders of ToM reasoning. Our findings contribute to the understanding of ToM's role in enhancing human-AI interaction and cooperative decision-making. The code used for our experiments can be found at this https URL.
摘要：大型语言模型（LLM）显示出在模拟人类行为和执行心理理论（TOM）推理的潜力，这是复杂社交互动的关键技能。在这项研究中，我们将TOM推理在将代理行为与人类规范保持在谈判任务中的作用，将UltMatum Game用作受控环境。我们初始化了具有不同亲社会信念的LLM代理商（包括贪婪，公平和无私），以及诸如Theark（COT）和不同tom级别的推理方法，并检查了他们跨不同LLM的决策过程，包括O3-Mini和DeepSeek-R1-R1蒸发Qwen 32B等推理模型。 2700个模拟的结果表明，TOM推理可以增强行为一致性，决策一致性和谈判结果。与以前的发现一致，与具有TOM推理的模型，游戏的不同作用以及不同的TOM推理订单相比，推理模型具有有限的功能。我们的发现有助于理解汤姆在增强人类互动和合作决策中的作用。可以在此HTTPS URL上找到我们实验的代码。

Title: Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation

Authors: Naila Shafirni Hidayat, Muhammad Dehan Al Kautsar, Alfan Farizki Wicaksono, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24263
Pdf URL: https://arxiv.org/pdf/2505.24263
Copy Paste: [[2505.24263]] Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation(https://arxiv.org/abs/2505.24263)
Keywords: language model, llm
Abstract: The performance of large language models (LLMs) continues to improve, as reflected in rising scores on standard benchmarks. However, the lack of transparency around training data raises concerns about potential overlap with evaluation sets and the fairness of reported results. Although prior work has proposed methods for detecting data leakage, these approaches primarily focus on identifying outliers and have not been evaluated under controlled simulated leakage conditions. In this work, we compare existing leakage detection techniques, namely permutation and n-gram-based methods, under a continual pretraining setup that simulates real-world leakage scenarios, and additionally explore a lightweight method we call semi-half question. Although semi-half offers a low-cost alternative, our analysis shows that the n-gram method consistently achieves the highest F1-score. We also refine these techniques to support instance-level detection and reduce computational overhead. Leveraging the best-performing method, we create cleaned versions of MMLU and HellaSwag, and re-evaluate several LLMs. Our findings present a practical path toward more reliable and transparent evaluations, and we recommend contamination checks as a standard step before releasing benchmark results.
摘要：大型语言模型（LLM）的性能不断提高，这反映在标准基准分数上升。但是，培训数据周围缺乏透明度引起了人们对与评估集的潜在重叠和报告结果的公平性的担忧。尽管先前的工作提出了用于检测数据泄漏的方法，但这些方法主要集中于识别异常值，尚未在受控的模拟泄漏条件下进行评估。在这项工作中，我们比较了现有的泄漏检测技术，即置换和基于n-gram的方法，在持续的预处理设置下，该设置模拟了现实世界泄漏方案，并探索了我们称之为半半的问题的轻量级方法。尽管半半数提供了低成本的替代方案，但我们的分析表明，N-Gram方法始终达到最高的F1得分。我们还完善了这些技术以支持实例级检测并减少计算开销。利用表现最佳的方法，我们创建了MMLU和HELLASWAG的清洁版本，并重新评估了多个LLM。我们的发现提出了更可靠和透明的评估的实用途径，我们建议污染检查作为标准步骤，然后再释放基准结果。

Title: Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations

Authors: Xin Quan, Marco Valentino, Louise A. Dennis, André Freitas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24264
Pdf URL: https://arxiv.org/pdf/2505.24264
Copy Paste: [[2505.24264]] Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations(https://arxiv.org/abs/2505.24264)
Keywords: language model, llm
Abstract: Natural language explanations play a fundamental role in Natural Language Inference (NLI) by revealing how premises logically entail hypotheses. Recent work has shown that the interaction of large language models (LLMs) with theorem provers (TPs) can help verify and improve the validity of NLI explanations. However, TPs require translating natural language into machine-verifiable formal representations, a process that introduces the risk of semantic information loss and unfaithful interpretation, an issue compounded by LLMs' challenges in capturing critical logical structures with sufficient precision. Moreover, LLMs are still limited in their capacity for rigorous and robust proof construction within formal verification frameworks. To mitigate issues related to faithfulness and robustness, this paper investigates strategies to (1) alleviate semantic loss during autoformalisation, (2) efficiently identify and correct syntactic errors in logical representations, (3) explicitly use logical expressions to guide LLMs in generating structured proof sketches, and (4) increase LLMs' capacity of interpreting TP's feedback for iterative refinement. Our empirical results on e-SNLI, QASC and WorldTree using different LLMs demonstrate that the proposed strategies yield significant improvements in autoformalisation (+18.46%, +34.2%, +39.77%) and explanation refinement (+29.5%, +51.5%, +41.25%) over the state-of-the-art model. Moreover, we show that specific interventions on the hybrid LLM-TP architecture can substantially improve efficiency, drastically reducing the number of iterations required for successful verification.
摘要：自然语言解释在自然语言推论（NLI）中起着基本作用，通过揭示前提在逻辑上如何带来假设。最近的工作表明，大语言模型（LLM）与定理抛弃（TPS）的相互作用可以帮助验证和提高NLI解释的有效性。但是，TPS需要将自然语言转化为可验证的形式表示形式，该过程引入了语义信息丢失和不忠解释的风险，这是LLMS在捕获足够精确捕获关键逻辑结构时面临的挑战所带来的问题。此外，在正式验证框架内，LLM仍然受到严格和可靠的证明结构的能力的限制。 To mitigate issues related to faithfulness and robustness, this paper investigates strategies to (1) alleviate semantic loss during autoformalisation, (2) efficiently identify and correct syntactic errors in logical representations, (3) explicitly use logical expressions to guide LLMs in generating structured proof sketches, and (4) increase LLMs' capacity of interpreting TP's feedback for iterative refinement.我们对使用不同LLM的E-SNLI，QASC和WORLDTREE的实证结果表明，提议的策略在自动化（ +18.46％， +34.2％， +39.77％）和解释改进（ +29.5％， +51.5％， +51.5％， +41.25％）上取得了显着改善（ +34.2％， +39.77％）。此外，我们表明有关混合LLM-TP体系结构的特定干预措施可以大大提高效率，从而大大减少成功验证所需的迭代次数。

Title: ScienceMeter: Tracking Scientific Knowledge Updates in Language Models

Authors: Yike Wang, Shangbin Feng, Yulia Tsvetkov, Hannaneh Hajishirzi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24302
Pdf URL: https://arxiv.org/pdf/2505.24302
Copy Paste: [[2505.24302]] ScienceMeter: Tracking Scientific Knowledge Updates in Language Models(https://arxiv.org/abs/2505.24302)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly used to support scientific research, but their knowledge of scientific advancements can quickly become outdated. We introduce ScienceMeter, a new framework for evaluating scientific knowledge update methods over scientific knowledge spanning the past, present, and future. ScienceMeter defines three metrics: knowledge preservation, the extent to which models' understanding of previously learned papers are preserved; knowledge acquisition, how well scientific claims from newly introduced papers are acquired; and knowledge projection, the ability of the updated model to anticipate or generalize to related scientific claims that may emerge in the future. Using ScienceMeter, we examine the scientific knowledge of LLMs on claim judgment and generation tasks across a curated dataset of 15,444 scientific papers and 30,888 scientific claims from ten domains including medicine, biology, materials science, and computer science. We evaluate five representative knowledge update approaches including training- and inference-time methods. With extensive experiments, we find that the best-performing knowledge update methods can preserve only 85.9% of existing knowledge, acquire 71.7% of new knowledge, and project 37.7% of future knowledge. Inference-based methods work for larger models, whereas smaller models require training to achieve comparable performance. Cross-domain analysis reveals that performance on these objectives is correlated. Even when applying on specialized scientific LLMs, existing knowledge update methods fail to achieve these objectives collectively, underscoring that developing robust scientific knowledge update mechanisms is both crucial and challenging.
摘要：大型语言模型（LLM）越来越多地用于支持科学研究，但是他们对科学进步的了解可能会迅速过时。我们介绍了Sciencemeter，这是一个新的框架，用于评估有关过去，现在和未来的科学知识的科学知识更新方法。 Sciencemeter定义了三个指标：知识保存，模型对先前学会论文的理解的程度；知识获取，新引入论文的科学主张的收集如何；和知识投影，更新模型预测或推广到将来可能出现的相关科学主张的能力。使用Sciencemeter，我们研究了LLM的科学知识，这些科学知识对15,444篇科学论文的策划数据集的主张判断和发电任务，以及来自十个领域的科学论文的30,888个科学主张，包括医学，生物学，材料科学和计算机科学。我们评估五种代表性知识更新方法，包括培训和推理时间方法。通过广泛的实验，我们发现表现最佳的知识更新方法只能保留现有知识的85.9％，获得71.7％的新知识，以及项目37.7％的未来知识。基于推理的方法适用于较大的模型，而较小的模型需要训练以实现可比的性能。跨域分析表明，这些目标的绩效是相关的。即使应用专门的科学LLM，现有的知识更新方法也无法共同实现这些目标，强调开发强大的科学知识更新机制既重要又具有挑战性。

Title: HiCaM: A Hierarchical-Causal Modification Framework for Long-Form Text Modification

Authors: Yuntao Shi, Yi Luo, Yeyun Gong, Chen Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24319
Pdf URL: https://arxiv.org/pdf/2505.24319
Copy Paste: [[2505.24319]] HiCaM: A Hierarchical-Causal Modification Framework for Long-Form Text Modification(https://arxiv.org/abs/2505.24319)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success in various domains. However, when handling long-form text modification tasks, they still face two major problems: (1) producing undesired modifications by inappropriately altering or summarizing irrelevant content, and (2) missing necessary modifications to implicitly related passages that are crucial for maintaining document coherence. To address these issues, we propose HiCaM, a Hierarchical-Causal Modification framework that operates through a hierarchical summary tree and a causal graph. Furthermore, to evaluate HiCaM, we derive a multi-domain dataset from various benchmarks, providing a resource for assessing its effectiveness. Comprehensive evaluations on the dataset demonstrate significant improvements over strong LLMs, with our method achieving up to a 79.50\% win rate. These results highlight the comprehensiveness of our approach, showing consistent performance improvements across multiple models and domains.
摘要：大型语言模型（LLM）在各个领域取得了巨大的成功。但是，在处理长篇文本修改任务时，它们仍然面临两个主要问题：（1）通过不当更改或总结不相关的内容来产生不希望的修改，以及（2）缺少对隐式相关段落的必要修改，这些段落对于维持文档相干性至关重要。为了解决这些问题，我们提出了HICAM，这是一种通过分层摘要树和因果图运行的层次造成修改框架。此外，为了评估HICAM，我们从各种基准中得出了一个多域数据集，提供了评估其有效性的资源。对数据集的全面评估表现出对强LLM的显着改善，我们的方法达到了79.50 \％的获胜率。这些结果突出了我们方法的全面性，显示了多个模型和域之间的稳定的性能改进。

Title: Context-Aware Sentiment Forecasting via LLM-based Multi-Perspective Role-Playing Agents

Authors: Fanhang Man, Huandong Wang, Jianjie Fang, Zhaoyi Deng, Baining Zhao, Xinlei Chen, Yong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24331
Pdf URL: https://arxiv.org/pdf/2505.24331
Copy Paste: [[2505.24331]] Context-Aware Sentiment Forecasting via LLM-based Multi-Perspective Role-Playing Agents(https://arxiv.org/abs/2505.24331)
Keywords: llm, agent
Abstract: User sentiment on social media reveals the underlying social trends, crises, and needs. Researchers have analyzed users' past messages to trace the evolution of sentiments and reconstruct sentiment dynamics. However, predicting the imminent sentiment of an ongoing event is rarely studied. In this paper, we address the problem of \textbf{sentiment forecasting} on social media to predict the user's future sentiment in response to the development of the event. We extract sentiment-related features to enhance the modeling skill and propose a multi-perspective role-playing framework to simulate the process of human response. Our preliminary results show significant improvement in sentiment forecasting on both microscopic and macroscopic levels.
摘要：社交媒体上的用户情感揭示了基本的社会趋势，危机和需求。研究人员分析了用户过去的消息，以追踪情感和重建情感动态的演变。但是，很少研究预测正在进行的事件的迫在眉睫的情绪。在本文中，我们解决了社交媒体上\ textbf {情感预测}的问题，以预测用户对事件开发的未来情绪。我们提取与情感相关的特征以增强建模技巧，并提出一个多人角色扮演框架，以模拟人类反应的过程。我们的初步结果表明，在微观和宏观水平上，情绪预测的显着改善。

Title: Pangu DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning

Authors: Wenxuan Shi, Haochen Tan, Chuqiao Kuang, Xiaoguang Li, Xiaozhe Ren, Chen Zhang, Hanting Chen, Yasheng Wang, Lifeng Shang, Fisher Yu, Yunhe Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24332
Pdf URL: https://arxiv.org/pdf/2505.24332
Copy Paste: [[2505.24332]] Pangu DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning(https://arxiv.org/abs/2505.24332)
Keywords: language model, llm, prompt
Abstract: Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing methods rely on static prompting rules or training with Wikipedia-based corpora and retrieval environments, limiting adaptability to the real-world web environment where ambiguity, conflicting evidence, and noise are prevalent. These constrained training settings hinder LLMs from learning to dynamically decide when and where to search, and how to adjust search depth and frequency based on informational demands. We define this missing capacity as Search Intensity Scaling (SIS)--the emergent skill to intensify search efforts under ambiguous or conflicting conditions, rather than settling on overconfident, under-verification answers. To study SIS, we introduce WebPuzzle, the first dataset designed to foster information-seeking behavior in open-world internet environments. WebPuzzle consists of 24K training instances and 275 test questions spanning both wiki-based and open-web queries. Building on this dataset, we propose DeepDiver, a Reinforcement Learning (RL) framework that promotes SIS by encouraging adaptive search policies through exploration under a real-world open-web environment. Experimental results show that Pangu-7B-Reasoner empowered by DeepDiver achieve performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver's training curriculum from cold-start supervised fine-tuning to a carefully designed RL phase, and present that its capability of SIS generalizes from closed-form QA to open-ended tasks such as long-form writing. Our contributions advance adaptive information seeking in LLMs and provide a valuable benchmark and dataset for future research.
摘要：寻求信息需要迭代证据收集和反思性推理，但是大型语言模型（LLMS）仍在开放式问题回答中与之抗争。现有的方法依赖于基于Wikipedia的语料库和检索环境的静态提示规则或培训，从而限制了对现实世界中的Web环境的适应性，在这种情况下，歧义性，证据和噪音相互冲突是普遍的。这些受到限制的培训设置阻碍了LLM的学习，而不是动态地决定何时何地搜索，以及如何根据信息需求调整搜索深度和频率。我们将这种缺失的容量定义为搜索强度缩放（SIS） - 在模棱两可或冲突的条件下加强搜索工作的新兴技能，而不是解决过度自信的，低验证的答案。为了研究SIS，我们介绍了WebPuzzle，这是第一个旨在在开放世界的互联网环境中促进信息寻求行为的数据集。 WebPuzzle由24K培训实例和275个测试问题组成，涵盖了基于Wiki的和开放式WEB查询。在此数据集的基础上，我们提出了DeepDiver，这是一个增强学习（RL）框架，该框架通过在现实世界中开放环境下通过探索来鼓励自适应搜索政策来促进SIS。实验结果表明，DeepDiver授权的Pangu-7b-Reasoner在现实WEB任务上实现了与671B参数DeepSeek-R1相当的实现绩效。我们详细介绍了DeepDiver的培训课程，从冷启动监督的微调到精心设计的RL阶段，并陈述其SIS的能力从封闭形式的质量质量质量质量检查到诸如长格式写作之类的开放式任务。我们的贡献可以推进LLM中寻求自适应信息，并为未来的研究提供了宝贵的基准和数据集。

Title: Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings

Authors: Shujian Yang, Shiyao Cui, Chuanrui Hu, Haicheng Wang, Tianwei Zhang, Minlie Huang, Jialiang Lu, Han Qiu
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.24341
Pdf URL: https://arxiv.org/pdf/2505.24341
Copy Paste: [[2505.24341]] Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings(https://arxiv.org/abs/2505.24341)
Keywords: language model, llm
Abstract: Detecting toxic content using language models is important but challenging. While large language models (LLMs) have demonstrated strong performance in understanding Chinese, recent studies show that simple character substitutions in toxic Chinese text can easily confuse the state-of-the-art (SOTA) LLMs. In this paper, we highlight the multimodal nature of Chinese language as a key challenge for deploying LLMs in toxic Chinese detection. First, we propose a taxonomy of 3 perturbation strategies and 8 specific approaches in toxic Chinese content. Then, we curate a dataset based on this taxonomy, and benchmark 9 SOTA LLMs (from both the US and China) to assess if they can detect perturbed toxic Chinese text. Additionally, we explore cost-effective enhancement solutions like in-context learning (ICL) and supervised fine-tuning (SFT). Our results reveal two important findings. (1) LLMs are less capable of detecting perturbed multimodal Chinese toxic contents. (2) ICL or SFT with a small number of perturbed examples may cause the LLMs "overcorrect'': misidentify many normal Chinese contents as toxic.
摘要：使用语言模型检测有毒内容很重要，但具有挑战性。尽管大型语言模型（LLMS）在理解中国人方面表现出强烈的表现，但最近的研究表明，有毒中文文本中的简单性格替代很容易使最新的（SOTA）LLMS混淆。在本文中，我们强调了中文的多模式性质，这是在有毒中国发现中部署LLM的关键挑战。首先，我们提出了三种扰动策略的分类法和有毒中国含量的8种特定方法。然后，我们根据该分类法策划一个数据集，并基准9 Sota LLM（来自美国和中国），以评估他们是否可以检测到扰动的有毒中国文本。此外，我们探讨了具有成本效益的增强解决方案，例如内部文化学习（ICL）和受监督的微调（SFT）。我们的结果揭示了两个重要的发现。（1）LLM较少能够检测到扰动的多模式中国有毒含量的能力。（2）ICL或SFT具有少量的扰动示例可能会导致LLMS“过度正确''：错误地识别许多普通的中国内容是有毒的。

Title: Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

Authors: Yangui Fang, Baixu Cheng, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2505.24347
Pdf URL: https://arxiv.org/pdf/2505.24347
Copy Paste: [[2505.24347]] Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction(https://arxiv.org/abs/2505.24347)
Keywords: gpt, llm, hallucination, chain-of-thought
Abstract: Automatic Speech Recognition (ASR) error correction aims to correct recognition errors while preserving accurate text. Although traditional approaches demonstrate moderate effectiveness, LLMs offer a paradigm that eliminates the need for training and labeled data. However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text. To address this problem, we propose the Reliable LLM Correction Framework (RLLM-CF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. The advantage of our method is that it does not require additional information or fine-tuning of the model, and ensures the correctness of the LLM correction under multi-pass programming. Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.
摘要：自动语音识别（ASR）错误校正旨在在保留准确的文本的同时纠正识别错误。尽管传统方法表现出适度的有效性，但LLM提供了一种范式，可以消除对培训和标记数据的需求。但是，直接使用LLMS会遇到幻觉问题，这可能会导致正确的文本修改。为了解决此问题，我们提出了可靠的LLM校正框架（RLLM-CF），该框架由三个阶段组成：（1）错误预检测，（2）经过经过经过经过经过经过经验的子任务的迭代校正校正和（3）推理过程验证。我们方法的优点是，它不需要对模型的其他信息或微调，并确保在多通编程下的LLM校正的正确性。 Aishell-1，Aishell-2和LibrisPeech上的实验表明，通过我们的框架增强的GPT-4O模型可在CER/WER中获得21％，11％，9％和11.4％的相对减少。

Title: Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research

Authors: Qianqian Zhang, Jiajia Liao, Heting Ying, Yibo Ma, Haozhan Shen, Jingcheng Li, Peng Liu, Lu Zhang, Chunxin Fang, Kyusong Lee, Ruochen Xu, Tiancheng Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24354
Pdf URL: https://arxiv.org/pdf/2505.24354
Copy Paste: [[2505.24354]] Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research(https://arxiv.org/abs/2505.24354)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Language agents powered by large language models (LLMs) have demonstrated remarkable capabilities in understanding, reasoning, and executing complex tasks. However, developing robust agents presents significant challenges: substantial engineering overhead, lack of standardized components, and insufficient evaluation frameworks for fair comparison. We introduce Agent Graph-based Orchestration for Reasoning and Assessment (AGORA), a flexible and extensible framework that addresses these challenges through three key contributions: (1) a modular architecture with a graph-based workflow engine, efficient memory management, and clean component abstraction; (2) a comprehensive suite of reusable agent algorithms implementing state-of-the-art reasoning approaches; and (3) a rigorous evaluation framework enabling systematic comparison across multiple dimensions. Through extensive experiments on mathematical reasoning and multimodal tasks, we evaluate various agent algorithms across different LLMs, revealing important insights about their relative strengths and applicability. Our results demonstrate that while sophisticated reasoning approaches can enhance agent capabilities, simpler methods like Chain-of-Thought often exhibit robust performance with significantly lower computational overhead. AGORA not only simplifies language agent development but also establishes a foundation for reproducible agent research through standardized evaluation protocols.
摘要：由大型语言模型（LLM）提供动力的语言代理在理解，推理和执行复杂的任务方面表现出了显着的功能。但是，开发强大的代理提出了重大挑战：大量的工程开销，缺乏标准化组件以及不足的评估框架以进行公平比较。我们介绍了基于代理图的推理和评估编排（AGORA），这是一个灵活且可扩展的框架，通过三个关键贡献来解决这些挑战：（1）具有基于图的工作流程引擎，有效的内存管理和清洁组件抽象的模块化体系结构；（2）一组可重复使用的代理算法实施最先进的推理方法；（3）一个严格的评估框架，可以在多个维度进行系统比较。通过有关数学推理和多模式任务的广泛实验，我们评估了不同LLM的各种代理算法，揭示了有关其相对优势和适用性的重要见解。我们的结果表明，尽管复杂的推理方法可以增强药物的能力，但更简单的方法诸如经过思考的链条通常表现出强大的性能，并且计算开销明显较低。 Agora不仅简化了语言代理的发展，而且还通过标准化评估协议为可再现的代理研究建立了基础。

Title: Knowing Before Saying: LLM Representations Encode Information About Chain-of-Thought Success Before Completion

Authors: Anum Afzal, Florian Matthes, Gal Chechik, Yftah Ziser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24362
Pdf URL: https://arxiv.org/pdf/2505.24362
Copy Paste: [[2505.24362]] Knowing Before Saying: LLM Representations Encode Information About Chain-of-Thought Success Before Completion(https://arxiv.org/abs/2505.24362)
Keywords: llm, chain-of-thought
Abstract: We investigate whether the success of a zero-shot Chain-of-Thought (CoT) process can be predicted before completion. We discover that a probing classifier, based on LLM representations, performs well \emph{even before a single token is generated}, suggesting that crucial information about the reasoning process is already present in the initial steps representations. In contrast, a strong BERT-based baseline, which relies solely on the generated tokens, performs worse, likely because it depends on shallow linguistic cues rather than deeper reasoning dynamics. Surprisingly, using later reasoning steps does not always improve classification. When additional context is unhelpful, earlier representations resemble later ones more, suggesting LLMs encode key information early. This implies reasoning can often stop early without loss. To test this, we conduct early stopping experiments, showing that truncating CoT reasoning still improves performance over not using CoT at all, though a gap remains compared to full reasoning. However, approaches like supervised learning or reinforcement learning designed to shorten CoT chains could leverage our classifier's guidance to identify when early stopping is effective. Our findings provide insights that may support such methods, helping to optimize CoT's efficiency while preserving its benefits.\footnote{Code and data is available at \href{this https URL}{\texttt{this http URL}}.
摘要：我们调查是否可以在完成前预测零射链（COT）过程的成功。我们发现，基于LLM表示的探测分类器在生成单个令牌之前，都可以很好地执行\ emph {甚至在初始步骤表示中已经存在有关推理过程的关键信息。相比之下，仅依赖于生成的令牌的强大基线基线的性能更糟，这可能是因为它取决于浅层语言提示，而不是更深层的推理动力学。令人惊讶的是，使用以后的推理步骤并不总是会改善分类。当其他上下文是无助的时，较早的表示形式更类似于较晚的上下文，建议LLMS尽早编码关键信息。这意味着推理通常可以提早停止而不会损失。为了测试这一点，我们进行了早期停止实验，表明截断的COT推理仍然可以改善性能，而不是完全不使用COT，尽管与完全推理相比，差距仍然存在。但是，诸如旨在缩短COT链条的监督学习或强化学习之类的方法可以利用我们的分类器的指导来确定何时早期停止有效。我们的发现提供了可能支持此类方法的见解，有助于优化COT的效率。

Title: LLM Inference Enhanced by External Knowledge: A Survey

Authors: Yu-Hsuan Lin, Qian-Hui Chen, Yi-Jie Cheng, Jia-Ren Zhang, Yi-Hung Liu, Liang-Yu Hsia, Yun-Nung Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24377
Pdf URL: https://arxiv.org/pdf/2505.24377
Copy Paste: [[2505.24377]] LLM Inference Enhanced by External Knowledge: A Survey(https://arxiv.org/abs/2505.24377)
Keywords: language model, llm, hallucination
Abstract: Recent advancements in large language models (LLMs) have enhanced natural-language reasoning. However, their limited parametric memory and susceptibility to hallucination present persistent challenges for tasks requiring accurate, context-based inference. To overcome these limitations, an increasing number of studies have proposed leveraging external knowledge to enhance LLMs. This study offers a systematic exploration of strategies for using external knowledge to enhance LLMs, beginning with a taxonomy that categorizes external knowledge into unstructured and structured data. We then focus on structured knowledge, presenting distinct taxonomies for tables and knowledge graphs (KGs), detailing their integration paradigms with LLMs, and reviewing representative methods. Our comparative analysis further highlights the trade-offs among interpretability, scalability, and performance, providing insights for developing trustworthy and generalizable knowledge-enhanced LLMs.
摘要：大型语言模型（LLM）的最新进展增强了自然语言推理。但是，它们有限的参数记忆和对幻觉的敏感性对需要准确的，基于上下文的推论的任务持续挑战。为了克服这些局限性，越来越多的研究提出了利用外部知识来增强LLM的研究。这项研究提供了对使用外部知识来增强LLM的策略的系统探索，从分类学开始，将外部知识分类为非结构化和结构化数据。然后，我们专注于结构化知识，为表和知识图（kgs）提供不同的分类法，详细介绍了其与LLM的集成范式，并审查了代表性方法。我们的比较分析进一步凸显了可解释性，可伸缩性和性能之间的权衡，为开发可信赖和可推广的知识增强的LLM提供了见解。

Title: ClueAnchor: Clue-Anchored Knowledge Reasoning Exploration and Optimization for Retrieval-Augmented Generation

Authors: Hao Chen, Yukun Yan, Sen Mei, Wanxiang Che, Zhenghao Liu, Qi Shi, Xinze Li, Yuchun Fan, Pengcheng Huang, Qiushi Xiong, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24388
Pdf URL: https://arxiv.org/pdf/2505.24388
Copy Paste: [[2505.24388]] ClueAnchor: Clue-Anchored Knowledge Reasoning Exploration and Optimization for Retrieval-Augmented Generation(https://arxiv.org/abs/2505.24388)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge to improve factuality. However, existing RAG systems frequently underutilize the retrieved documents, failing to extract and integrate the key clues needed to support faithful and interpretable reasoning, especially in cases where relevant evidence is implicit, scattered, or obscured by noise. To address this issue, we propose ClueAnchor, a novel framework for enhancing RAG via clue-anchored reasoning exploration and optimization. ClueAnchor extracts key clues from retrieved content and generates multiple reasoning paths based on different knowledge configurations, optimizing the model by selecting the most effective one through reward-based preference optimization. Experiments show that ClueAnchor significantly outperforms prior RAG baselines in reasoning completeness and robustness. Further analysis confirms its strong resilience to noisy or partially relevant retrieved content, as well as its capability to identify supporting evidence even in the absence of explicit clue supervision during inference.
摘要：检索增强的一代（RAG）增强了具有外部知识的大语言模型（LLM），以改善事实。但是，现有的抹布系统经常不利于检索到的文档，未能提取和整合支持忠实和可解释的推理所需的关键线索，尤其是在相关证据是隐式，分散或被噪声掩盖的情况下。为了解决这个问题，我们提出了Clueanchor，这是一个新颖的框架，用于通过线索锚定的推理探索和优化增强抹布。 Clueanchor从检索到的内容中提取关键线索，并根据不同的知识配置生成多个推理路径，通过通过基于奖励的偏好优化选择最有效的知识来优化模型。实验表明，Clueanchor在推理完整性和鲁棒性方面显着优于先前的抹布基线。进一步的分析证实了其对嘈杂或部分相关检索内容的强烈韧性，以及即使在推断期间没有明确的线索监督的情况下，也能够识别支持证据的能力。

Title: LLMs Are Globally Multilingual Yet Locally Monolingual: Exploring Knowledge Transfer via Language and Thought Theory

Authors: Eojin Kang, Juae Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24409
Pdf URL: https://arxiv.org/pdf/2505.24409
Copy Paste: [[2505.24409]] LLMs Are Globally Multilingual Yet Locally Monolingual: Exploring Knowledge Transfer via Language and Thought Theory(https://arxiv.org/abs/2505.24409)
Keywords: language model, llm, prompt
Abstract: Multilingual large language models (LLMs) open up new possibilities for leveraging information across languages, but their factual knowledge recall remains inconsistent depending on the input language. While previous studies have attempted to address this issue through English-based prompting and evaluation, we explore non-English to English transfer via Language and Thought Theory. This perspective allows us to examine language-thought binding in LLMs and uncover why factual knowledge often fails to transfer effectively. We propose the Language-to-Thought (L2T) prompting strategy, which analyzes the relationship between input language, internal cognitive processes, and knowledge. Experimental results challenge the assumption that English-based approaches consistently outperform other languages and offer a novel insight that aligning the model's internal thought with the knowledge required for the task is critical for successful cross-lingual transfer. Furthermore, we show that applying L2T during training can alleviate LLMs' reliance on the input language and facilitate cross-linguistic knowledge integration without translation-based learning. Code and datasets will be available.
摘要：多语言大语模型（LLMS）为利用跨语言的信息开辟了新的可能性，但是根据输入语言，他们的事实知识回忆仍然不一致。虽然先前的研究试图通过基于英语的提示和评估来解决这个问题，但我们通过语言和思想理论探索了英语转移的非英语转移。这种观点使我们能够检查LLM中的语言思维约束力，并发现为什么事实知识通常无法有效地转移。我们提出了促进语言（L2T）提示策略，该策略分析了输入语言，内部认知过程和知识之间的关系。实验结果挑战了以下假设：基于英语的方法始终超过其他语言，并提供了一种新颖的见解，即使模型的内部思想与任务所需的知识保持一致，这对于成功的跨语义转移至关重要。此外，我们表明，在培训期间应用L2T可以减轻LLMS对输入语言的依赖，并促进跨语言知识整合而无需基于翻译的学习。代码和数据集将可用。

Title: MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs

Authors: Zhiwei Liu, Lingfei Qian, Qianqian Xie, Jimin Huang, Kailai Yang, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24423
Pdf URL: https://arxiv.org/pdf/2505.24423
Copy Paste: [[2505.24423]] MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs(https://arxiv.org/abs/2505.24423)
Keywords: language model, gpt, llm
Abstract: Large language models and vision-language models (which we jointly call LMs) have transformed NLP and CV, demonstrating remarkable potential across various fields. However, their capabilities in affective analysis (i.e. sentiment analysis and emotion detection) remain underexplored. This gap is largely due to the absence of comprehensive evaluation benchmarks, and the inherent complexity of affective analysis tasks. In this paper, we introduce MMAFFBen, the first extensive open-source benchmark for multilingual multimodal affective analysis. MMAFFBen encompasses text, image, and video modalities across 35 languages, covering four key affective analysis tasks: sentiment polarity, sentiment intensity, emotion classification, and emotion intensity. Moreover, we construct the MMAFFIn dataset for fine-tuning LMs on affective analysis tasks, and further develop MMAFFLM-3b and MMAFFLM-7b based on it. We evaluate various representative LMs, including GPT-4o-mini, providing a systematic comparison of their affective understanding capabilities. This project is available at this https URL.
摘要：大型语言模型和视觉模型（我们共同称为LMS）改变了NLP和CV，在各个领域都表现出巨大的潜力。但是，它们在情感分析中的能力（即情绪分析和情绪检测）仍然没有得到充实的态度。这一差距很大程度上是由于缺乏全面的评估基准以及情感分析任务的固有复杂性。在本文中，我们介绍了Mmaffben，这是第一个用于多语言多模式情感分析的广泛开源基准。 Mmaffben涵盖了35种语言的文本，图像和视频方式，涵盖了四个关键的情感分析任务：情感极性，情感强度，情感分类和情感强度。此外，我们在情感分析任务上构建了MMAFFIN数据集，用于微调LMS，并基于IT进一步开发MMAFFLM-3B和MMAFFLM-7B。我们评估包括GPT-4O-Mini在内的各种代表性LMS，提供了对其情感理解能力的系统比较。该项目可在此HTTPS URL上找到。

Title: Model Unlearning via Sparse Autoencoder Subspace Guided Projections

Authors: Xu Wang, Zihao Li, Benyou Wang, Yan Hu, Difan Zou
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.24428
Pdf URL: https://arxiv.org/pdf/2505.24428
Copy Paste: [[2505.24428]] Model Unlearning via Sparse Autoencoder Subspace Guided Projections(https://arxiv.org/abs/2505.24428)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) store vast amounts of information, making them powerful yet raising privacy and safety concerns when selective knowledge removal is required. Existing unlearning strategies, ranging from gradient-based fine-tuning and model editing to sparse autoencoder (SAE) steering, either lack interpretability or fail to provide a robust defense against adversarial prompts. We propose SAE-Guided Subspace Projection Unlearning (SSPU), a novel framework that leverages SAE features to drive targeted updates in the model's parameter space, enabling precise, interpretable, and robust unlearning. SSPU's three-stage pipeline performs data-driven layer and feature selection, subspace construction via QR decomposition, and constrained optimization that controls activations into an "irrelevant" subspace while preserving retained knowledge. Overall, we use SAE features to construct a subspace that supervises unlearning, refining the loss and adding a regularization term to guide interpretable parameter updates. In experiments on the WMDP-Cyber forget set and three utility benchmarks (MMLU, TruthfulQA, GSM8K), SSPU reduces harmful knowledge accuracy by 3.22% compared to the strongest baseline. It also improves adversarial robustness, lowering malicious accuracy under jailbreak prompts compared to baselines. Our findings expose the limitations of prior unlearning methods and demonstrate how interpretable subspace-guided optimization can achieve robust, controllable model behavior.
摘要：大型语言模型（LLMS）存储了大量信息，在需要删除选择性知识时，它们使其强大而又提高了隐私和安全问题。从基于梯度的微调和型号编辑到稀疏自动编码器（SAE）转向的现有未学习策略，要么缺乏解释性或无法为对抗性提示提供强大的防御。我们提出了SAE引导的子空间投影 - SSPU（SSPU），这是一个新型框架，利用SAE功能在模型的参数空间中驱动目标更新，从而可以精确，可解释且可靠地进行学习。 SSPU的三阶段管道执行数据驱动的层和特征选择，通过QR分解进行子空间构建，并限制了将激活控制为“无关”的子空间的优化，同时保留了保留知识。总体而言，我们使用SAE功能来构建一个子空间，该子空间可以监督学习，完善损失并添加正规化术语来指导可解释的参数更新。在WMDP-CYBER HOSED SET和三个实用性基准（MMLU，Truthfulqa，GSM8K）的实验中，SSPU将有害知识的准确性降低了3.22％，与最强的基线相比。它还改善了对抗性的鲁棒性，与基线相比，在越狱提示下降低了恶意准确性。我们的发现暴露了先前未学习方法的局限性，并证明了可解释的子空间指导优化如何实现可靠的可控模型行为。

Title: Exploring the Impact of Occupational Personas on Domain-Specific QA

Authors: Eojin Kang, Jaehyuk Yu, Juae Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24448
Pdf URL: https://arxiv.org/pdf/2505.24448
Copy Paste: [[2505.24448]] Exploring the Impact of Occupational Personas on Domain-Specific QA(https://arxiv.org/abs/2505.24448)
Keywords: language model, llm
Abstract: Recent studies on personas have improved the way Large Language Models (LLMs) interact with users. However, the effect of personas on domain-specific question-answering (QA) tasks remains a subject of debate. This study analyzes whether personas enhance specialized QA performance by introducing two types of persona: Profession-Based Personas (PBPs) (e.g., scientist), which directly relate to domain expertise, and Occupational Personality-Based Personas (OPBPs) (e.g., scientific person), which reflect cognitive tendencies rather than explicit expertise. Through empirical evaluations across multiple scientific domains, we demonstrate that while PBPs can slightly improve accuracy, OPBPs often degrade performance, even when semantically related to the task. Our findings suggest that persona relevance alone does not guarantee effective knowledge utilization and that they may impose cognitive constraints that hinder optimal knowledge application. Future research can explore how nuanced distinctions in persona representations guide LLMs, potentially contributing to reasoning and knowledge retrieval that more closely mirror human social conceptualization.
摘要：关于角色的最新研究改善了大语言模型（LLM）与用户互动的方式。但是，角色对特定领域提问（QA）任务的影响仍然是辩论的主题。这项研究分析了角色是否通过引入两种类型的角色来增强质量检查的绩效：基于职业的角色（PBP）（例如，科学家），该角色与领域专业知识直接相关，并且基于职业人格角色（例如，科学人）（例如，科学人），反映了认知倾向，而不是明确的专业知识。通过跨多个科学领域的经验评估，我们证明，尽管PBP可以稍微提高准确性，但即使在语义上与任务相关时，OPBP也经常降低性能。我们的发现表明，仅角色相关性并不能保证有效的知识利用，并且它们可能会构成阻碍最佳知识应用的认知约束。未来的研究可以探索角色表征中细微的差异如何指导LLM，从而有助于推理和知识检索，从而更加紧密地反映人类的社会概念化。

Title: When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways

Authors: Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Ning Jiang, Zhi Gao, Zilong Zheng, Lei Liu, Bin Li, Qing Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24449
Pdf URL: https://arxiv.org/pdf/2505.24449
Copy Paste: [[2505.24449]] When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways(https://arxiv.org/abs/2505.24449)
Keywords: llm
Abstract: Large language/multimodal models (LLMs/LMMs) store extensive pre-trained knowledge but struggle to maintain consistency with real-world updates, making it difficult to avoid catastrophic forgetting while acquiring evolving knowledge. Previous work focused on constructing textual knowledge datasets and exploring knowledge injection in LLMs, lacking exploration of multimodal evolving knowledge injection in LMMs. To address this, we propose the EVOKE benchmark to evaluate LMMs' ability to inject multimodal evolving knowledge in real-world scenarios. Meanwhile, a comprehensive evaluation of multimodal evolving knowledge injection revealed two challenges: (1) Existing knowledge injection methods perform terribly on evolving knowledge. (2) Supervised fine-tuning causes catastrophic forgetting, particularly instruction following ability is severely compromised. Additionally, we provide pathways and find that: (1) Text knowledge augmentation during the training phase improves performance, while image augmentation cannot achieve it. (2) Continual learning methods, especially Replay and MoELoRA, effectively mitigate forgetting. Our findings indicate that current knowledge injection methods have many limitations on evolving knowledge, which motivates further research on more efficient and stable knowledge injection methods.
摘要：大型语言/多模型模型（LLMS/LMM）存储了广泛的预培训知识，但努力与现实世界更新保持一致性，从而在获取不断发展的知识的同时很难避免灾难性的遗忘。以前的工作着重于构建文本知识数据集并探索LLM中的知识注入，缺乏对LMM中多模式不断发展的知识注入的探索。为了解决这个问题，我们提出了Evoke Benchmark，以评估LMM在实际情况下注入多模式不断发展的知识的能力。同时，对多模式不断发展的知识注入的全面评估揭示了两个挑战：（1）现有的知识注入方法在不断发展的知识上表现出色。（2）监督的微调导致灾难性遗忘，尤其是以下能力的指导受到严重损害。此外，我们提供途径并发现：（1）在培训阶段的文本知识增强可以提高性能，而图像增强无法实现。（2）持续的学习方法，尤其是重播和Moelora，有效地减轻了遗忘。我们的发现表明，当前的知识注入方法对不断发展的知识有许多局限性，这激发了对更高效和稳定的知识注入方法的进一步研究。

Title: CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

Authors: Emilio Villa-Cueva, Sholpan Bolatzhanova, Diana Turmakhan, Kareem Elzeky, Henok Biadglign Ademtew, Alham Fikri Aji, Israel Abebe Azime, Jinheon Baek, Frederico Belcavello, Fermin Cristobal, Jan Christian Blaise Cruz, Mary Dabre, Raj Dabre, Toqeer Ehsan, Naome A Etori, Fauzan Farooqui, Jiahui Geng, Guido Ivetta, Thanmay Jayakumar, Soyeong Jeong, Zheng Wei Lim, Aishik Mandal, Sofia Martinelli, Mihail Minkov Mihaylov, Daniil Orel, Aniket Pramanick, Sukannya Purkayastha, Israfel Salazar, Haiyue Song, Tiago Timponi Torrent, Debela Desalegn Yadeta, Injy Hamed, Atnafu Lambebo Tonja, Thamar Solorio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24456
Pdf URL: https://arxiv.org/pdf/2505.24456
Copy Paste: [[2505.24456]] CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation(https://arxiv.org/abs/2505.24456)
Keywords: language model
Abstract: Cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender usage. By releasing CaMMT, we aim to support broader efforts in building and evaluating multimodal translation systems that are better aligned with cultural nuance and regional variation.
摘要：由于文化之间概念化的差异，文化内容对机器翻译系统构成了挑战，因此单独语言可能无法传达足够的上下文来捕获特定地区的含义。在这项工作中，我们研究图像是否可以在多模式翻译中充当文化背景。我们介绍了CAMMT，这是一个超过5,800张图像的人类策划的基准，以及英语和区域语言的平行字幕。使用此数据集，我们在仅文本+图像设置中评估了五个视觉语言模型（VLM）。通过自动和人类的评估，我们发现视觉环境通常会提高翻译质量，尤其是在处理特定于文化的项目（CSIS），歧义和正确的性别使用方面。通过释放CAMMT，我们旨在支持更广泛的努力，以建立和评估与文化细微差别和区域变化更好的多模式翻译系统。

Title: VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation

Authors: Hieu Tran, Phuong-Anh Nguyen-Le, Huy Nghiem, Quang-Nhan Nguyen, Wei Ai, Marine Carpuat
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.24472
Pdf URL: https://arxiv.org/pdf/2505.24472
Copy Paste: [[2505.24472]] VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation(https://arxiv.org/abs/2505.24472)
Keywords: llm
Abstract: Machine translation systems fail when processing code-mixed inputs for low-resource languages. We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations. Augmenting this resource, we developed a complementary synthetic data generation pipeline. This pipeline incorporates filtering mechanisms to ensure syntactic plausibility and pragmatic appropriateness in code-mixing patterns. Experimental validation shows our naturalistic and complementary synthetic data boost models' performance, measured by translation quality estimation scores, of up to 71.84 on COMETkiwi and 81.77 on XCOMET. Triangulating positive results with LLM-based assessments, augmented models are favored over seed fine-tuned counterparts in approximately 49% of judgments (54-56% excluding ties). VietMix and our augmentation methodology advance ecological validity in neural MT evaluations and establish a framework for addressing code-mixed translation challenges across other low-resource pairs.
摘要：当处理低资源语言的代码混合输入时，机器翻译系统失败。我们通过策划越野越来越长的越南语料库来应对这一挑战。为了增加此资源，我们开发了一个互补的合成数据生成管道。该管道结合了过滤机制，以确保在代码混合模式中确保句法合理性和务实的适当性。实验验证表明，通过翻译质量估计得分衡量的自然主义和互补的合成数据增强模型的性能，在Cometkiwi上高达71.84，而在Xcomet上为81.77。通过基于LLM的评估进行三角调节的阳性结果，增强模型比种子微调的判断中的种子微调对应物受到青睐（54-56％不包括领带）。越野及我们的增强方法在神经MT评估中提高了生态有效性，并建立了一个框架，以应对其他低资源对的代码混合翻译挑战。

Title: TimeHC-RL: Temporal-aware Hierarchical Cognitive Reinforcement Learning for Enhancing LLMs' Social Intelligence

Authors: Guiyang Hou, Xing Gao, Yuchuan Wu, Xiang Huang, Wenqi Zhang, Zhe Zheng, Yongliang Shen, Jialu Du, Fei Huang, Yongbin Li, Weiming Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24500
Pdf URL: https://arxiv.org/pdf/2505.24500
Copy Paste: [[2505.24500]] TimeHC-RL: Temporal-aware Hierarchical Cognitive Reinforcement Learning for Enhancing LLMs' Social Intelligence(https://arxiv.org/abs/2505.24500)
Keywords: language model, llm
Abstract: Recently, Large Language Models (LLMs) have made significant progress in IQ-related domains that require careful thinking, such as mathematics and coding. However, enhancing LLMs' cognitive development in social domains, particularly from a post-training perspective, remains underexplored. Recognizing that the social world follows a distinct timeline and requires a richer blend of cognitive modes (from intuitive reactions (System 1) and surface-level thinking to deliberate thinking (System 2)) than mathematics, which primarily relies on System 2 cognition (careful, step-by-step reasoning), we introduce Temporal-aware Hierarchical Cognitive Reinforcement Learning (TimeHC-RL) for enhancing LLMs' social intelligence. In our experiments, we systematically explore improving LLMs' social intelligence and validate the effectiveness of the TimeHC-RL method, through five other post-training paradigms and two test-time intervention paradigms on eight datasets with diverse data patterns. Experimental results reveal the superiority of our proposed TimeHC-RL method compared to the widely adopted System 2 RL method. It gives the 7B backbone model wings, enabling it to rival the performance of advanced models like DeepSeek-R1 and OpenAI-O3. Additionally, the systematic exploration from post-training and test-time interventions perspectives to improve LLMs' social intelligence has uncovered several valuable insights.
摘要：最近，大型语言模型（LLMS）在需要仔细思考的与数学和编码等智商相关的领域取得了重大进展。但是，增强LLMS在社会领域的认知发展，尤其是从培训后的角度来看，仍然没有得到充实的态度。认识到社会世界遵循独特的时间表，需要更丰富的认知模式（来自直觉反应（系统1）和表面层面思维，而不是数学的思维（系统2）），而不是数学，这主要依赖于系统2认知（谨慎，逐步推理），我们介绍了暂时性的认知范围，以增强范围的认知范围。在我们的实验中，我们通过八个具有多种数据模式的数据集中的其他五个训练范式和两个测试时间干预范式，系统地探索了改善LLMS的社交智能并验证TimeHC-RL方法的有效性。实验结果揭示了与广泛采用的系统2 RL方法相比，我们提出的TimeHC-RL方法的优势。它给出了7B骨干模型机翼，使其能够与DeepSeek-R1和OpenAI-O3等高级模型的性能媲美。此外，从培训后和测试时间干预措施的角度来改善LLMS社会智能的系统探索已经发现了一些有价值的见解。

Title: Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors

Authors: Andrea Pedrotti, Michele Papucci, Cristiano Ciaccio, Alessio Miaschi, Giovanni Puccetti, Felice Dell'Orletta, Andrea Esuli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24523
Pdf URL: https://arxiv.org/pdf/2505.24523
Copy Paste: [[2505.24523]] Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors(https://arxiv.org/abs/2505.24523)
Keywords: language model, llm
Abstract: Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.
摘要：生成AI和大型语言模型（LLM）的最新进展使得创造了高度现实的合成内容，从而引起了人们对恶意使用潜力的担忧，例如错误信息和操纵。此外，由于缺乏评估对现实世界情景的概括的基准，检测机器生成的文本（MGT）仍然具有挑战性。在这项工作中，我们提出了一条管道，以测试最先进的MGT检测器（例如Mage，Radar，LLM-Detectaive）的弹性，以进行语言知情的对抗性攻击。为了挑战探测器，我们使用直接偏好优化（DPO）微调语言模型，将MGT样式转移到人体编写的文本（HWT）。这利用了探测器对风格线索的依赖，这使得新一代更具挑战性检测。此外，我们分析了对齐方式引起的语言偏移，并且检测器使用哪些特征来检测MGT文本。我们的结果表明，探测器很容易被相对较少的例子愚弄，从而导致检测性能大幅下降。这突出了改善检测方法并使它们能够强大地看不见的文本的重要性。

Title: DEEPQUESTION: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance

Authors: Ali Khoramfar, Ali Ramezani, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi, Heshaam Faili
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24532
Pdf URL: https://arxiv.org/pdf/2505.24532
Copy Paste: [[2505.24532]] DEEPQUESTION: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance(https://arxiv.org/abs/2505.24532)
Keywords: llm
Abstract: LLMs often excel on standard benchmarks but falter on real-world tasks. We introduce DeepQuestion, a scalable automated framework that augments existing datasets based on Bloom's taxonomy and creates novel questions that trace original solution paths to probe evaluative and creative skills. Extensive experiments across ten open-source and proprietary models, covering both general-purpose and reasoning LLMs, reveal substantial performance drops (even up to 70% accuracy loss) on higher-order tasks, underscoring persistent gaps in deep reasoning. Our work highlights the need for cognitively diverse benchmarks to advance LLM progress. DeepQuestion and related datasets will be released upon acceptance of the paper.
摘要：LLM经常在标准基准上表现出色，但在实际任务上步履蹒跚。我们介绍了DeepQuestion，这是一个可扩展的自动化框架，该框架可根据Bloom的分类法增强现有数据集，并创建新颖的问题，可追踪原始解决方案路径以探究评估和创造性技能。涵盖通用和推理LLM的十个开源和专有模型进行的广泛实验表明，高阶任务的性能下降（甚至高达70％的准确性损失），强调了深度推理的持续差距。我们的工作强调了对认知上不同基准的需求，以提高LLM的进步。接受论文后，将发布深度问题和相关数据集。

Title: Don't Erase, Inform! Detecting and Contextualizing Harmful Language in Cultural Heritage Collections

Authors: Orfeas Menis Mastromichalakis, Jason Liartis, Kristina Rose, Antoine Isaac, Giorgos Stamou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24538
Pdf URL: https://arxiv.org/pdf/2505.24538
Copy Paste: [[2505.24538]] Don't Erase, Inform! Detecting and Contextualizing Harmful Language in Cultural Heritage Collections(https://arxiv.org/abs/2505.24538)
Keywords: language model, llm
Abstract: Cultural Heritage (CH) data hold invaluable knowledge, reflecting the history, traditions, and identities of societies, and shaping our understanding of the past and present. However, many CH collections contain outdated or offensive descriptions that reflect historical biases. CH Institutions (CHIs) face significant challenges in curating these data due to the vast scale and complexity of the task. To address this, we develop an AI-powered tool that detects offensive terms in CH metadata and provides contextual insights into their historical background and contemporary perception. We leverage a multilingual vocabulary co-created with marginalized communities, researchers, and CH professionals, along with traditional NLP techniques and Large Language Models (LLMs). Available as a standalone web app and integrated with major CH platforms, the tool has processed over 7.9 million records, contextualizing the contentious terms detected in their metadata. Rather than erasing these terms, our approach seeks to inform, making biases visible and providing actionable insights for creating more inclusive and accessible CH collections.
摘要：文化遗产（CH）数据具有宝贵的知识，反映了社会的历史，传统和身份，并塑造了我们对过去和现在的理解。但是，许多CH收藏包含过时或反映的描述，这些描述反映了历史偏见。 CH机构（CHIS）由于任务的规模和复杂性，在策划这些数据方面面临重大挑战。为了解决这个问题，我们开发了一种AI驱动的工具，该工具可以检测CH Metadata中的进攻性术语，并为其历史背景和当代感知提供上下文见解。我们利用了与边缘化社区，研究人员和CH专业人员共同创建的多语言词汇，以及传统的NLP技术和大型语言模型（LLMS）。该工具可作为独立的Web应用程序可用，并与主要的CH平台集成在一起，已处理超过790万个记录，使其在其元数据中检测到的有争议的术语上下文。我们的方法没有删除这些术语，而是试图告知，使偏见可见，并提供可行的见解，以创建更具包容性和可访问的CH收藏。

Title: Localizing Persona Representations in LLMs

Authors: Celia Cintas, Miriam Rateike, Erik Miehling, Elizabeth Daly, Skyler Speakman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24539
Pdf URL: https://arxiv.org/pdf/2505.24539
Copy Paste: [[2505.24539]] Localizing Persona Representations in LLMs(https://arxiv.org/abs/2505.24539)
Keywords: language model, llm
Abstract: We present a study on how and where personas -- defined by distinct sets of human characteristics, values, and beliefs -- are encoded in the representation space of large language models (LLMs). Using a range of dimension reduction and pattern recognition methods, we first identify the model layers that show the greatest divergence in encoding these representations. We then analyze the activations within a selected layer to examine how specific personas are encoded relative to others, including their shared and distinct embedding spaces. We find that, across multiple pre-trained decoder-only LLMs, the analyzed personas show large differences in representation space only within the final third of the decoder layers. We observe overlapping activations for specific ethical perspectives -- such as moral nihilism and utilitarianism -- suggesting a degree of polysemy. In contrast, political ideologies like conservatism and liberalism appear to be represented in more distinct regions. These findings help to improve our understanding of how LLMs internally represent information and can inform future efforts in refining the modulation of specific human traits in LLM outputs. Warning: This paper includes potentially offensive sample statements.
摘要：我们介绍了一项研究，即在大型语言模型（LLM）的表示空间中编码的角色（由不同的人类特征，价值观和信念定义）。使用一系列降低和模式识别方法，我们首先确定在编码这些表示时显示最大差异的模型层。然后，我们分析选定层中的激活，以检查特定角色相对于其他角色的编码方式，包括它们的共享和独特的嵌入空间。我们发现，在多个仅预先训练的仅解码器的LLM中，分析的角色仅在解码器层的最后三分之一内显示出表示空间较大的差异。我们观察到针对特定的道德观点（例如道德虚无主义和功利主义）的重叠激活暗示了一定程度的多义。相比之下，保守主义和自由主义等政治意识形态似乎在更不同的地区代表。这些发现有助于提高我们对LLM在内部表示信息的理解，并可以告知未来的努力，以完善LLM输出中特定人类特征的调制。警告：本文包括潜在的进攻样本声明。

Title: Cross-Attention Speculative Decoding

Authors: Wei Zhong, Manasa Bharadwaj, Yixiao Wang, Nikhil Verma, Yipeng Ji, Chul Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24544
Pdf URL: https://arxiv.org/pdf/2505.24544
Copy Paste: [[2505.24544]] Cross-Attention Speculative Decoding(https://arxiv.org/abs/2505.24544)
Keywords: language model, llm
Abstract: Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we propose Two-Stage Block-Attention Training, a new method that achieves training stability and convergence efficiency in block-level attention scenarios. Extensive experiments across multiple LLMs and datasets show that Beagle achieves competitive inference speedups and higher training efficiency than EAGLE-v2, offering a strong alternative for architectures in speculative decoding.
摘要：投机解码（SD）是一种广泛采用的方法，用于加速大型语言模型（LLMS），尤其是当草案和目标模型良好时。但是，最新的SD方法通常依赖于紧密耦合的，基于自我注意的变压器解码器，通常用辅助池或融合层增强。这种耦合使它们越来越复杂，更难跨越不同的模型。我们介绍了预算鹰（Beagle），据我们所知，基于跨注意的变压器解码器SD模型的第一个与领先的自我发作SD模型（EAGLE-V2）相当，同时消除了汇总或辅助组件的需求，简化了建筑训练效率，提高培训效率，并在培训时间仿真过程中维持稳定的内存使用。为了实现这种新型建筑的有效培训，我们提出了两阶段的扩展培训，这是一种新方法，可在块级别的注意力方案中实现训练稳定性和收敛效率。多个LLM和数据集的广泛实验表明，Beagle比Eagle-V2实现了竞争性推理的加速和更高的训练效率，为投机解码提供了强大的替代方案。

Title: A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings

Authors: Xiaoang Xu, Shuo Wang, Xu Han, Zhenghao Liu, Huijia Wu, Peipei Li, Zhiyuan Liu, Maosong Sun, Zhaofeng He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24550
Pdf URL: https://arxiv.org/pdf/2505.24550
Copy Paste: [[2505.24550]] A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings(https://arxiv.org/abs/2505.24550)
Keywords: chain-of-thought
Abstract: Large Reasoning Models (LRMs) achieve superior performance by extending the thought length. However, a lengthy thinking trajectory leads to reduced efficiency. Most of the existing methods are stuck in the assumption of overthinking and attempt to reason efficiently by compressing the Chain-of-Thought, but this often leads to performance degradation. To address this problem, we introduce A*-Thought, an efficient tree search-based unified framework designed to identify and isolate the most essential thoughts from the extensive reasoning chains produced by these models. It formulates the reasoning process of LRMs as a search tree, where each node represents a reasoning span in the giant reasoning space. By combining the A* search algorithm with a cost function specific to the reasoning path, it can efficiently compress the chain of thought and determine a reasoning path with high information density and low cost. In addition, we also propose a bidirectional importance estimation mechanism, which further refines this search process and enhances its efficiency beyond uniform sampling. Extensive experiments on several advanced math tasks show that A*-Thought effectively balances performance and efficiency over a huge search space. Specifically, A*-Thought can improve the performance of QwQ-32B by 2.39$\times$ with low-budget and reduce the length of the output token by nearly 50% with high-budget. The proposed method is also compatible with several other LRMs, demonstrating its generalization capability. The code can be accessed at: this https URL.
摘要：大型推理模型（LRMS）通过扩展思想长度来实现出色的性能。但是，漫长的思维轨迹导致效率降低。大多数现有方法都被困在过度思考和试图通过压缩思维链有效推理的假设上，但这通常会导致性能退化。为了解决这个问题，我们介绍了一个基于树木搜索的统一框架，旨在识别和隔离这些模型所产生的广泛的推理链中。它将LRM的推理过程作为搜索树制定，其中每个节点代表巨型推理空间中的推理跨度。通过将A*搜索算法与特定于推理路径的成本函数相结合，它可以有效地压缩思想链，并确定高信息密度和低成本的推理路径。此外，我们还提出了双向重要性估计机制，该机制进一步完善了此搜索过程，并提高了其效率超出均匀抽样的效率。关于几个高级数学任务的广泛实验表明，*思想有效地平衡了巨大的搜索空间的性能和效率。具体而言，A*Thought可以用低预算的QWQ-32B提高QWQ-32B的性能2.39 $ \ times $，并使用高预算，将输出令牌的长度降低了近50％。所提出的方法也与其他几个LRM兼容，证明其概括能力。可以在以下位置访问代码：此HTTPS URL。

Title: CREFT: Sequential Multi-Agent LLM for Character Relation Extraction

Authors: Ye Eun Chun, Taeyoon Hwang, Seung-won Hwang, Byung-Hak Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24553
Pdf URL: https://arxiv.org/pdf/2505.24553
Copy Paste: [[2505.24553]] CREFT: Sequential Multi-Agent LLM for Character Relation Extraction(https://arxiv.org/abs/2505.24553)
Keywords: language model, llm, agent
Abstract: Understanding complex character relations is crucial for narrative analysis and efficient script evaluation, yet existing extraction methods often fail to handle long-form narratives with nuanced interactions. To address this challenge, we present CREFT, a novel sequential framework leveraging specialized Large Language Model (LLM) agents. First, CREFT builds a base character graph through knowledge distillation, then iteratively refines character composition, relation extraction, role identification, and group assignments. Experiments on a curated Korean drama dataset demonstrate that CREFT significantly outperforms single-agent LLM baselines in both accuracy and completeness. By systematically visualizing character networks, CREFT streamlines narrative comprehension and accelerates script review -- offering substantial benefits to the entertainment, publishing, and educational sectors.
摘要：了解复杂的角色关系对于叙事分析和有效的脚本评估至关重要，但是现有的提取方法通常无法通过细微的互动来处理长期的叙事。为了应对这一挑战，我们提出了CREFT，这是一个新型的顺序框架，利用专业的大语言模型（LLM）代理。首先，CREFT通过知识蒸馏构建基本角色图，然后迭代地完善角色组成，关系提取，角色识别和组分配。在精选的韩国戏剧数据集上进行的实验表明，CREFT在准确性和完整性方面显着优于单代理LLM基准。通过系统地可视化角色网络，Creft简化了叙事理解并加速脚本评论 - 为娱乐，出版和教育部门带来了可观的好处。

Title: Bench4KE: Benchmarking Automated Competency Question Generation

Authors: Anna Sofia Lippolis, Minh Davide Ragagni, Paolo Ciancarini, Andrea Giovanni Nuzzolese, Valentina Presutti
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24554
Pdf URL: https://arxiv.org/pdf/2505.24554
Copy Paste: [[2505.24554]] Bench4KE: Benchmarking Automated Competency Question Generation(https://arxiv.org/abs/2505.24554)
Keywords: language model, llm
Abstract: The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation, a trend already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs). However, the evaluation of these tools lacks standardisation. This undermines the methodological rigour and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. Its first release focuses on evaluating tools that generate CQs automatically. CQs are natural language questions used by ontology engineers to define the functional requirements of an ontology. Bench4KE provides a curated gold standard consisting of CQ datasets from four real-world ontology projects. It uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of four recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.
摘要：大型语言模型（LLMS）的可用性为振兴知识工程研究（KE）自动化提供了独特的机会，这是在最近开发基于LLM的方法和工具的趋势中，用于自动生成能力问题（CQS）。但是，对这些工具的评估缺乏标准化。这破坏了方法论严格，并阻碍了结果的复制和比较。为了解决此差距，我们介绍了Bench4Ke，这是一种可扩展的基于API的基准测试系统，用于KE自动化。它的第一个版本着重于评估自动生成CQ的工具。 CQ是本体论工程师使用的自然语言问题，以定义本体论的功能要求。 Bench4Ke提供了一个由四个现实世界本体学项目的CQ数据集组成的精选黄金标准。它使用一套相似性指标来评估生成的CQ的质量。我们对基于LLM的四个最近的CQ生成系统进行了比较分析，该系统建立了未来研究的基准。 Bench4Ke还旨在适应其他KE自动化任务，例如SPARQL查询产生，本体学测试和制图。代码和数据集可在Apache 2.0许可下公开使用。

Title: NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization

Authors: Hyuntak Kim, Byung-Hak Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24575
Pdf URL: https://arxiv.org/pdf/2505.24575
Copy Paste: [[2505.24575]] NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization(https://arxiv.org/abs/2505.24575)
Keywords: llm, agent
Abstract: Summarizing long-form narratives--such as books, movies, and TV scripts--requires capturing intricate plotlines, character interactions, and thematic coherence, a task that remains challenging for existing LLMs. We introduce NexusSum, a multi-agent LLM framework for narrative summarization that processes long-form text through a structured, sequential pipeline--without requiring fine-tuning. Our approach introduces two key innovations: (1) Dialogue-to-Description Transformation: A narrative-specific preprocessing method that standardizes character dialogue and descriptive text into a unified format, improving coherence. (2) Hierarchical Multi-LLM Summarization: A structured summarization pipeline that optimizes chunk processing and controls output length for accurate, high-quality summaries. Our method establishes a new state-of-the-art in narrative summarization, achieving up to a 30.0% improvement in BERTScore (F1) across books, movies, and TV scripts. These results demonstrate the effectiveness of multi-agent LLMs in handling long-form content, offering a scalable approach for structured summarization in diverse storytelling domains.
摘要：总结长形式的叙述（例如书籍，电影和电视脚本），要求捕获复杂的情节，角色互动和主题连贯性，这对于现有LLM仍然具有挑战性。我们介绍了Nexussum，这是一个多代理LLM叙事摘要的框架，该概述通过结构化的，顺序的管道来处理长篇文本，而不需要微调。我们的方法介绍了两个关键的创新：（1）对话到描述转换：一种特定于叙事的预处理方法，将角色对话和描述性文本标准化为统一格式，从而提高了连贯性。（2）分层多LLM汇总：结构化汇总管道，可优化块处理并控制输出长度，以获得准确的高质量摘要。我们的方法建立了叙事摘要的最新最新，在书，电影和电视脚本中，Bertscore（F1）的提高了30.0％。这些结果证明了多代理LLM在处理长形含量中的有效性，为各种讲故事领域的结构化摘要提供了可扩展的方法。

Title: When Harry Meets Superman: The Role of The Interlocutor in Persona-Based Dialogue Generation

Authors: Daniela Occhipinti, Marco Guerini, Malvina Nissim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24613
Pdf URL: https://arxiv.org/pdf/2505.24613
Copy Paste: [[2505.24613]] When Harry Meets Superman: The Role of The Interlocutor in Persona-Based Dialogue Generation(https://arxiv.org/abs/2505.24613)
Keywords: llm, agent
Abstract: Endowing dialogue agents with persona information has proven to significantly improve the consistency and diversity of their generations. While much focus has been placed on aligning dialogues with provided personas, the adaptation to the interlocutor's profile remains largely underexplored. In this work, we investigate three key aspects: (1) a model's ability to align responses with both the provided persona and the interlocutor's; (2) its robustness when dealing with familiar versus unfamiliar interlocutors and topics, and (3) the impact of additional fine-tuning on specific persona-based dialogues. We evaluate dialogues generated with diverse speaker pairings and topics, framing the evaluation as an author identification task and employing both LLM-as-a-judge and human evaluations. By systematically masking or disclosing information about the interlocutor, we assess its impact on dialogue generation. Results show that access to the interlocutor's persona improves the recognition of the target speaker, while masking it does the opposite. Although models generalise well across topics, they struggle with unfamiliar interlocutors. Finally, we found that in zero-shot settings, LLMs often copy biographical details, facilitating identification but trivialising the task.
摘要：事实证明，用角色信息赋予对话代理可以显着提高其世代的一致性和多样性。尽管已经将重点放在与所提供的角色的对齐对话上，但对对话者个人资料的改编基本上仍然没有被淘汰。在这项工作中，我们研究了三个关键方面：（1）模型与所提供的角色和对话者的响应的能力；（2）与熟悉的与陌生的对话者和主题打交道时的稳健性，以及（3）其他微调对特定基于角色的对话的影响。我们评估了由不同的演讲者配对和主题产生的对话，将评估作为作者身份识别任务框架，并同时采用LLM-AS-A-A-A-A-A-A-A-A-A-A-A-A-A-A-As-A-Audge和人类评估。通过系统地掩盖或披露有关对话者的信息，我们评估其对对话生成的影响。结果表明，访问对话者的角色可以提高目标扬声器的认可，同时掩盖其相反的情况。尽管模型跨越了主题，但它们与陌生的对话者斗争。最后，我们发现，在零拍设置中，LLMS经常复制传记细节，促进识别但使任务琐碎。

Title: Harnessing Large Language Models for Scientific Novelty Detection

Authors: Yan Liu, Zonglin Yang, Soujanya Poria, Thanh-Son Nguyen, Erik Cambria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24615
Pdf URL: https://arxiv.org/pdf/2505.24615
Copy Paste: [[2505.24615]] Harnessing Large Language Models for Scientific Novelty Detection(https://arxiv.org/abs/2505.24615)
Keywords: language model, llm
Abstract: In an era of exponential scientific growth, identifying novel research ideas is crucial and challenging in academia. Despite potential, the lack of an appropriate benchmark dataset hinders the research of novelty detection. More importantly, simply adopting existing NLP technologies, e.g., retrieving and then cross-checking, is not a one-size-fits-all solution due to the gap between textual similarity and idea conception. In this paper, we propose to harness large language models (LLMs) for scientific novelty detection (ND), associated with two new datasets in marketing and NLP domains. To construct the considerate datasets for ND, we propose to extract closure sets of papers based on their relationship, and then summarize their main ideas based on LLMs. To capture idea conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from LLMs to align ideas with similar conception, enabling efficient and accurate idea retrieval for LLM novelty detection. Experiments show our method consistently outperforms others on the proposed benchmark datasets for idea retrieval and ND tasks. Codes and data are available at this https URL.
摘要：在指数级科学增长的时代，识别新颖的研究思想在学术界至关重要且具有挑战性。尽管有潜力，但缺乏适当的基准数据集仍阻碍了新颖性检测的研究。更重要的是，仅仅采用现有的NLP技术，例如检索然后进行交叉检查，由于文本相似性和想法概念之间存在差距，因此并不是一个尺寸的所有解决方案。在本文中，我们建议利用大型语言模型（LLM）进行科学新颖性检测（ND），该模型与营销和NLP领域的两个新数据集有关。为了构建用于ND的体贴数据集，我们建议根据其关系提取封闭的论文集，然后根据LLMS总结其主要思想。为了捕捉思想概念，我们建议通过将Idea级知识从LLM提炼到具有相似概念的想法，从而训练轻巧的检索器，从而使LLM新颖性检测有效，准确的想法检索。实验表明，我们的方法始终在建议的基准数据集上胜过其他人，以进行想法检索和ND任务。代码和数据可在此HTTPS URL上找到。

Title: Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Authors: Nikita Martynov, Anastasia Mordasheva, Dmitriy Gorbetskiy, Danil Astafurov, Ulyana Isaeva, Elina Basyrova, Sergey Skachkov, Victoria Berestova, Nikolay Ivanov, Valeriia Zanina, Alena Fenogenova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24616
Pdf URL: https://arxiv.org/pdf/2505.24616
Copy Paste: [[2505.24616]] Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX(https://arxiv.org/abs/2505.24616)
Keywords: language model, llm, prompt
Abstract: We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.
摘要：我们介绍了Pollux，这是一种综合的开源基准测试，旨在评估俄罗斯大语模型（LLMS）的生成能力。我们的主要贡献是一种新颖的评估方法，可增强LLM评估的可解释性。对于每种任务类型，我们定义一组详细标准，并制定评分协议，其中模型评估响应并为其评级提供理由。这使得超出传统资源的人类比较超出了传统的资源消费，以标准驱动的评估。 Pollux包括35种任务类型的详细，细颗粒的分类学，涵盖了不同的生成领域，例如代码生成，创意写作和实用助理用例，总共手动制作和专业的提示。每个任务都是通过困难（简单/中/硬）进行分类的，专家完全从头开始构建数据集。我们还释放了一个LLM-AS-A-a-a-Gudge（7B和32B）评估人员的家族，接受了细微的评估生成产量的评估。这种方法为模型开发提供了可扩展的，可解释的评估和注释工具，从而有效地替代了昂贵且不太精确的人类判断。

Title: Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization

Authors: Utsav Maskey, Chencheng Zhu, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24621
Pdf URL: https://arxiv.org/pdf/2505.24621
Copy Paste: [[2505.24621]] Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization(https://arxiv.org/abs/2505.24621)
Keywords: language model, llm
Abstract: Recent advancements in Large Language Models (LLMs) have transformed natural language understanding and generation, leading to extensive benchmarking across diverse tasks. However, cryptanalysis a critical area for data security and encryption has not yet been thoroughly explored in LLM evaluations. To address this gap, we evaluate cryptanalytic potential of state of the art LLMs on encrypted texts generated using a range of cryptographic algorithms. We introduce a novel benchmark dataset comprising diverse plain texts spanning various domains, lengths, writing styles, and topics paired with their encrypted versions. Using zero-shot and few shot settings, we assess multiple LLMs for decryption accuracy and semantic comprehension across different encryption schemes. Our findings reveal key insights into the strengths and limitations of LLMs in side-channel communication while raising concerns about their susceptibility to jailbreaking attacks. This research highlights the dual-use nature of LLMs in security contexts and contributes to the ongoing discussion on AI safety and security.
摘要：大型语言模型（LLM）的最新进展已改变了自然语言的理解和产生，从而导致了跨不同任务的广泛基准测试。但是，密码分析在LLM评估中尚未对数据安全和加密的关键领域进行彻底探讨。为了解决这一差距，我们评估了使用一系列加密算法生成的加密文本，评估了ART LLMS状态的隐态潜力。我们介绍了一个新颖的基准数据集，其中包括各种域，长度，写作样式以及与其加密版本配对的各种纯文本。使用零射击和少量射击设置，我们评估了多个LLM，以跨不同的加密方案进行解密准确性和语义理解。我们的发现揭示了对LLM在侧渠道沟通中的优势和局限性的关键见解，同时引起了对他们对越狱袭击的敏感性的担忧。这项研究强调了LLM在安全环境中的双重用途性质，并有助于对AI安全和保障的持续讨论。

Title: The Hallucination Dilemma: Factuality-Aware Reinforcement Learning for Large Reasoning Models

Authors: Junyi Li, Hwee Tou Ng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24630
Pdf URL: https://arxiv.org/pdf/2505.24630
Copy Paste: [[2505.24630]] The Hallucination Dilemma: Factuality-Aware Reinforcement Learning for Large Reasoning Models(https://arxiv.org/abs/2505.24630)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have significantly advanced in reasoning tasks through reinforcement learning (RL) optimization, achieving impressive capabilities across various challenging benchmarks. However, our empirical analysis reveals a critical drawback: reasoning-oriented RL fine-tuning significantly increases the prevalence of hallucinations. We theoretically analyze the RL training dynamics, identifying high-variance gradient, entropy-induced randomness, and susceptibility to spurious local optima as key factors leading to hallucinations. To address this drawback, we propose Factuality-aware Step-wise Policy Optimization (FSPO), an innovative RL fine-tuning algorithm incorporating explicit factuality verification at each reasoning step. FSPO leverages automated verification against given evidence to dynamically adjust token-level advantage values, incentivizing factual correctness throughout the reasoning process. Experiments across mathematical reasoning and hallucination benchmarks using Qwen2.5 and Llama models demonstrate that FSPO effectively reduces hallucinations while enhancing reasoning accuracy, substantially improving both reliability and performance.
摘要：大型语言模型（LLMS）通过加强学习（RL）优化在推理任务方面有了显着提高，在各种具有挑战性的基准中实现了令人印象深刻的能力。但是，我们的经验分析揭示了一个关键的缺点：面向推理的RL微调显着增加了幻觉的流行。我们理论上分析了RL训练动力学，确定了高变化的梯度，熵引起的随机性以及对伪造局部优势的敏感性，这是导致幻觉的关键因素。为了解决这一缺点，我们提出了事实感知的逐步策略优化（FSPO），这是一种创新的RL微调算法，在每个推理步骤中纳入了明确的事实验证。 FSPO利用给定证据的自动验证来动态调整令牌级别的优势值，从而激励整个推理过程中的事实正确性。使用QWEN2.5和LLAMA模型进行数学推理和幻觉基准的实验表明，FSPO有效地降低了幻觉，同时提高了推理准确性，从而显着提高了可靠性和性能。

Title: Disentangling Language and Culture for Evaluating Multilingual Large Language Models

Authors: Jiahao Ying, Wei Tang, Yiran Zhao, Yixin Cao, Yu Rong, Wenxuan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24635
Pdf URL: https://arxiv.org/pdf/2505.24635
Copy Paste: [[2505.24635]] Disentangling Language and Culture for Evaluating Multilingual Large Language Models(https://arxiv.org/abs/2505.24635)
Keywords: language model, llm
Abstract: This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs. By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions within both native and cross-cultural contexts cross-lingually. Extensive evaluations are conducted on a wide range of models, revealing a notable "CulturalLinguistic Synergy" phenomenon, where models exhibit better performance when questions are culturally aligned with the language. This phenomenon is further explored through interpretability probing, which shows that a higher proportion of specific neurons are activated in a language's cultural context. This activation proportion could serve as a potential indicator for evaluating multilingual performance during model training. Our findings challenge the prevailing notion that LLMs, primarily trained on English data, perform uniformly across languages and highlight the necessity of culturally and linguistically model evaluations. Our code can be found at https://yingjiahao14. this http URL.
摘要：本文介绍了双重评估框架，以全面评估LLM的多语言能力。通过沿语言媒介和文化背景的维度分解评估，该框架可以对LLMS在本地和跨文化环境中跨语言中处理问题的能力进行细微的分析。广泛的评估是在广泛的模型上进行的，揭示了一个著名的“文化语言协同作用”现象，当问题与语言在文化上一致时，模型表现出更好的表现。通过可解释性探测进一步探讨了这种现象，这表明在语言的文化背景下激活了较高比例的特定神经元。此激活比例可以作为评估模型训练过程中多语言性能的潜在指标。我们的发现挑战了主要的观念，即LLM主要接受英语数据训练，跨语言统一地表现，并强调了文化和语言上建模评估的必要性。我们的代码可以在https：// yingjiahao14上找到。此HTTP URL。

Title: Efficient Text Encoders for Labor Market Analysis

Authors: Jens-Joris Decorte, Jeroen Van Hautte, Chris Develder, Thomas Demeester
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24640
Pdf URL: https://arxiv.org/pdf/2505.24640
Copy Paste: [[2505.24640]] Efficient Text Encoders for Labor Market Analysis(https://arxiv.org/abs/2505.24640)
Keywords: language model, llm
Abstract: Labor market analysis relies on extracting insights from job advertisements, which provide valuable yet unstructured information on job titles and corresponding skill requirements. While state-of-the-art methods for skill extraction achieve strong performance, they depend on large language models (LLMs), which are computationally expensive and slow. In this paper, we propose \textbf{ConTeXT-match}, a novel contrastive learning approach with token-level attention that is well-suited for the extreme multi-label classification task of skill classification. \textbf{ConTeXT-match} significantly improves skill extraction efficiency and performance, achieving state-of-the-art results with a lightweight bi-encoder model. To support robust evaluation, we introduce \textbf{Skill-XL}, a new benchmark with exhaustive, sentence-level skill annotations that explicitly address the redundancy in the large label space. Finally, we present \textbf{JobBERT V2}, an improved job title normalization model that leverages extracted skills to produce high-quality job title representations. Experiments demonstrate that our models are efficient, accurate, and scalable, making them ideal for large-scale, real-time labor market analysis.
摘要：劳动力市场分析依赖于从求职广告中提取见解，这些见解提供了有关职位标题和相应技能要求的宝贵而非结构化的信息。尽管最新的技能提取方法实现了强劲的表现，但它们取决于大型语言模型（LLMS），这些模型在计算上昂贵且缓慢。在本文中，我们提出了\ textbf {上下文摩擦}，这是一种新颖的对比学习方法，具有令牌级别的关注，非常适合极端的多标签分类技能分类任务。 \ textbf {上下文匹配}可以显着提高技能提取效率和性能，并使用轻量级的双重编码模型来实现最新的结果。为了支持可靠的评估，我们介绍了\ textbf {Skill-xl}，这是一种具有详尽的句子级技能注释的新基准测试标准，可以明确解决大型标签空间中的冗余。最后，我们提出\ textbf {jobbert v2}，这是一种改进的职位标题归一化模型，利用提取技能生成高质量的作业标题表示。实验表明，我们的模型是高效，准确且可扩展的，使其非常适合大规模实时劳动力市场分析。

Title: Are Optimal Algorithms Still Optimal? Rethinking Sorting in LLM-Based Pairwise Ranking with Batching and Caching

Authors: Juan Wisznia, Cecilia Bolaños, Juan Tollo, Giovanni Marraffini, Agustín Gianolini, Noe Hsueh, Luciano Del Corro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24643
Pdf URL: https://arxiv.org/pdf/2505.24643
Copy Paste: [[2505.24643]] Are Optimal Algorithms Still Optimal? Rethinking Sorting in LLM-Based Pairwise Ranking with Batching and Caching(https://arxiv.org/abs/2505.24643)
Keywords: llm, prompt
Abstract: We introduce a novel framework for analyzing sorting algorithms in pairwise ranking prompting (PRP), re-centering the cost model around LLM inferences rather than traditional pairwise comparisons. While classical metrics based on comparison counts have traditionally been used to gauge efficiency, our analysis reveals that expensive LLM inferences overturn these predictions; accordingly, our framework encourages strategies such as batching and caching to mitigate inference costs. We show that algorithms optimal in the classical setting can lose efficiency when LLM inferences dominate the cost under certain optimizations.
摘要：我们介绍了一个新颖的框架，用于分析成对排名提示（PRP）中的排序算法，重新介绍LLM推论的成本模型，而不是传统的成对比较。尽管传统上使用了基于比较计数的经典指标来衡量效率，但我们的分析表明，昂贵的LLM推论推翻了这些预测。因此，我们的框架鼓励诸如批处理和缓存之类的策略来减轻推理成本。我们表明，当LLM推论在某些优化下占主导地位时，经典环境中最佳的算法可能会失去效率。

Title: Multiple LLM Agents Debate for Equitable Cultural Alignment

Authors: Dayeon Ki, Rachel Rudinger, Tianyi Zhou, Marine Carpuat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24671
Pdf URL: https://arxiv.org/pdf/2505.24671
Copy Paste: [[2505.24671]] Multiple LLM Agents Debate for Equitable Cultural Alignment(https://arxiv.org/abs/2505.24671)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) need to adapt their predictions to diverse cultural contexts to benefit diverse communities across the world. While previous efforts have focused on single-LLM, single-turn approaches, we propose to exploit the complementary strengths of multiple LLMs to promote cultural adaptability. We introduce a Multi-Agent Debate framework, where two LLM-based agents debate over a cultural scenario and collaboratively reach a final decision. We propose two variants: one where either LLM agents exclusively debate and another where they dynamically choose between self-reflection and debate during their turns. We evaluate these approaches on 7 open-weight LLMs (and 21 LLM combinations) using the NormAd-ETI benchmark for social etiquette norms in 75 countries. Experiments show that debate improves both overall accuracy and cultural group parity over single-LLM baselines. Notably, multi-agent debate enables relatively small LLMs (7-9B) to achieve accuracies comparable to that of a much larger model (27B parameters).
摘要：大型语言模型（LLMS）需要将其预测调整到各种文化背景下，以使世界各地的各种社区受益。虽然以前的努力集中在单杆，单转的方法上，但我们建议利用多个LLM的互补优势来促进文化适应性。我们介绍了一个多代理的辩论框架，其中两个基于LLM的代理商就文化场景进行了辩论，并协作做出了最终决定。我们提出了两个变体：一个在其中llm代理人专门辩论，而在回合期间自我反省和辩论之间动态选择的另一个变体。我们使用75个国家 /地区的社会礼节规范的Normad-Eti基准测试了7个开放量LLM（和21个LLM组合）的7种方法。实验表明，争论提高了单级基准的总体准确性和文化群体的均等。值得注意的是，多代理争论使相对较小的LLM（7-9b）获得与更大模型（27b参数）相当的精度。

Title: TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

Authors: Xiaorui Wu, Xiaofeng Mao, Fei Li, Xin Zhang, Xuanhong Li, Chong Teng, Donghong Ji, Zhuang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24672
Pdf URL: https://arxiv.org/pdf/2505.24672
Copy Paste: [[2505.24672]] TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis(https://arxiv.org/abs/2505.24672)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.
摘要：大型语言模型（LLMS）在各种自然语言处理任务中表现出色，但仍然容易产生有害内容或出于恶意目的而被利用。尽管已经引入了安全对齐数据集，以通过监督的微调（SFT）来减轻此类风险，但这些数据集通常缺乏全面的风险覆盖范围。大多数现有的数据集主要关注词汇多样性，同时忽略其他关键维度。为了解决这一限制，我们提出了一个新颖的分析框架，以系统地测量三个基本维度的对齐数据集的风险覆盖：词汇多样性，恶意意图和越狱策略。我们进一步介绍了Trident，这是一种自动化管道，利用基于角色的，零击的LLM生成来产生这些维度的多样化和全面的说明。每个有害指令都与道德对齐的响应配对，从而产生了两个数据集：三叉戟核，包括26,311个示例和三叉戟边缘，以及18,773个示例。三叉戟边缘上的微调骆驼3.1-8b表现出很大的改善，与野生爆发数据集中表现最佳的基线模型相比，危害得分的平均降低14.29％，攻击成功率降低了20％。

Title: A Simple Linear Patch Revives Layer-Pruned Large Language Models

Authors: Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, Chun Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24680
Pdf URL: https://arxiv.org/pdf/2505.24680
Copy Paste: [[2505.24680]] A Simple Linear Patch Revives Layer-Pruned Large Language Models(https://arxiv.org/abs/2505.24680)
Keywords: language model, llm
Abstract: Layer pruning has become a popular technique for compressing large language models (LLMs) due to its simplicity. However, existing layer pruning methods often suffer from significant performance drops. We identify that this degradation stems from the mismatch of activation magnitudes across layers and tokens at the pruning interface. To address this, we propose LinearPatch, a simple plug-and-play technique to revive the layer-pruned LLMs. The proposed method adopts Hadamard transformation to suppress massive outliers in particular tokens, and channel-wise scaling to align the activation magnitudes. These operations can be fused into a single matrix, which functions as a patch to bridge the pruning interface with negligible inference overhead. LinearPatch retains up to 94.15% performance of the original model when pruning 5 layers of LLaMA-3-8B on the question answering benchmark, surpassing existing state-of-the-art methods by 4%. In addition, the patch matrix can be further optimized with memory efficient offline knowledge distillation. With only 5K samples, the retained performance of LinearPatch can be further boosted to 95.16% within 30 minutes on a single computing card.
摘要：由于其简单性，层修剪已成为压缩大型语言模型（LLM）的流行技术。但是，现有的修剪方法通常会遭受大量性能下降。我们确定这种降解源于在修剪界面上跨层和令牌之间的激活幅度不匹配。为了解决这个问题，我们提出了LinearPatch，这是一种简单的插件技术，以恢复层延伸的LLMS。所提出的方法采用Hadamard转换来抑制大量异常值，尤其是标记，并通过渠道缩放以对齐激活幅度。这些操作可以融合到单个矩阵中，该矩阵可以用作补丁，以桥接修剪界面的推理可忽略不计。在回答基准的问题上修剪5层的Llama-3-8b时，LinearPatch将保持高达94.15％的原始模型性能，使现有的最新方法超过4％。此外，通过内存有效的离线知识蒸馏，可以进一步优化补丁矩阵。只有5K样品，在单个计算卡上，可以在30分钟内进一步将线性性的保留性能进一步提高到95.16％。

Title: Should I Share this Translation? Evaluating Quality Feedback for User Reliance on Machine Translation

Authors: Dayeon Ki, Kevin Duh, Marine Carpuat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24683
Pdf URL: https://arxiv.org/pdf/2505.24683
Copy Paste: [[2505.24683]] Should I Share this Translation? Evaluating Quality Feedback for User Reliance on Machine Translation(https://arxiv.org/abs/2505.24683)
Keywords: llm
Abstract: As people increasingly use AI systems in work and daily life, feedback mechanisms that help them use AI responsibly are urgently needed, particularly in settings where users are not equipped to assess the quality of AI predictions. We study a realistic Machine Translation (MT) scenario where monolingual users decide whether to share an MT output, first without and then with quality feedback. We compare four types of quality feedback: explicit feedback that directly give users an assessment of translation quality using 1) error highlights and 2) LLM explanations, and implicit feedback that helps users compare MT inputs and outputs through 3) backtranslation and 4) question-answer (QA) tables. We find that all feedback types, except error highlights, significantly improve both decision accuracy and appropriate reliance. Notably, implicit feedback, especially QA tables, yields significantly greater gains than explicit feedback in terms of decision accuracy, appropriate reliance, and user perceptions, receiving the highest ratings for helpfulness and trust, and the lowest for mental burden.
摘要：随着人们越来越多地在工作和日常生活中使用AI系统，迫切需要负责任地使用AI的反馈机制，尤其是在用户不具备评估AI预测质量的环境中。我们研究了一个现实的机器翻译（MT）方案，单语用户决定是否首先分享MT输出，然后再提供优质的反馈。我们比较了四种类型的质量反馈类型：明确的反馈，这些反馈直接通过1）使用错误亮点和2）LLM解释来评估翻译质量，以及可帮助用户通过3）进行回避和4）询问 - 答案（QA）表比较MT输入和输出的隐式反馈。我们发现，除错误突出显示外，所有反馈类型都显着提高了决策准确性和适当的依赖。值得注意的是，就决策准确性，适当的依赖和用户的看法而言，隐性反馈，尤其是质量保证表的收益明显高于明确的反馈，获得了最高的有益和信任评级，并且对心理负担的最低收益。

Title: Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration

Authors: Qinglin Zhu, Runcong Zhao, Hanqi Yan, Yulan He, Yudong Chen, Lin Gui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24688
Pdf URL: https://arxiv.org/pdf/2505.24688
Copy Paste: [[2505.24688]] Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration(https://arxiv.org/abs/2505.24688)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) struggle with complex reasoning due to limited diversity and inefficient search. We propose Soft Reasoning, an embedding-based search framework that optimises the embedding of the first token to guide generation. It combines (1) embedding perturbation for controlled exploration and (2) Bayesian optimisation to refine embeddings via a verifier-guided objective, balancing exploration and exploitation. This approach improves reasoning accuracy and coherence while avoiding reliance on heuristic search. Experiments demonstrate superior correctness with minimal computation, making it a scalable, model-agnostic solution.
摘要：大型语言模型（LLMS）由于多样性和效率低下而与复杂的推理作斗争。我们提出了软性推理，这是一个基于嵌入的搜索框架，可优化第一个令牌的嵌入以指导生成。它结合了（1）嵌入受控探索的扰动和（2）贝叶斯优化，以通过验证者引导的目标，平衡探索和剥削来完善嵌入。这种方法提高了推理的准确性和连贯性，同时避免依赖启发式搜索。实验证明了最小计算的卓越正确性，使其成为可扩展的模型不合命相的解决方案。

Title: BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

Authors: Sander Land, Catherine Arnett
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24689
Pdf URL: https://arxiv.org/pdf/2505.24689
Copy Paste: [[2505.24689]] BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization(https://arxiv.org/abs/2505.24689)
Keywords: language model
Abstract: Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization, often reliant on complex regular expressions, can also introduce fragility and unexpected edge cases. We propose SCRIPT (Script Category Representation in PreTokenization), a novel encoding scheme that bypasses UTF-8 byte conversion by using initial tokens based on Unicode script and category properties. This approach enables a simple, rule-based pretokenization strategy that respects script boundaries, offering a robust alternative to pretokenization strategies based on regular expressions. We also introduce and validate a constrained BPE merging strategy that enforces character integrity, applicable to both SCRIPT-BPE and byte-based BPE. Our experiments demonstrate that SCRIPT-BPE achieves competitive compression while eliminating encoding-based penalties for non-Latin-script languages.
摘要：字节对编码（BPE）令牌（广泛用于大语言模型）在多语言设置中面临挑战，包括对非西方脚本的惩罚以及用部分UTF-8序列创建令牌。杂货通常依赖于复杂的正则表达式，也会引入脆弱性和意外的边缘案例。我们提出了脚本（PretokeNization中的脚本类别表示），这是一种新颖的编码方案，该方案通过使用基于Unicode脚本和类别属性的初始令牌来绕过UTF-8字节转换。这种方法实现了一种简单的基于规则的伪造策略，该策略尊重脚本边界，并根据正则表达式为piperational策略提供了强有力的替代方案。我们还介绍并验证了一个受约束的BPE合并策略，该策略可实施适用于脚本BPE和基于字节的BPE的字符完整性。我们的实验表明，Script-BPE可以实现竞争性压缩，同时消除了对非拉丁语脚本语言的基于编码的惩罚。

Title: Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios

Authors: Gerard I. Gállego, Oriol Pareras, Martí Cortada Garcia, Lucas Takanori, Javier Hernando
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.24691
Pdf URL: https://arxiv.org/pdf/2505.24691
Copy Paste: [[2505.24691]] Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios(https://arxiv.org/abs/2505.24691)
Keywords: llm, chain-of-thought
Abstract: We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more accessible across diverse languages.
摘要：我们提出了一种语音到文本翻译（S2TT）方法，该方法将音素表示形式集成到一个思想链（COT）框架中，以改善低资源和零资源设置中的翻译。通过将音素识别作为中间步骤引入，我们增强了跨语性转移，即使没有标记的语音数据，也可以为语言提供翻译。我们的系统以多语言LLM为基础，我们将其扩展到处理语音和音素。培训遵循课程学习策略，该策略逐渐引入了更复杂的任务。多语言S2TT基准测试的实验表明，音素增强的COT在低资源条件下提高了翻译质量，并实现了零资源的翻译，同时略有影响的高资源性能。尽管进行了这种权衡，但我们的发现表明，基于音素的COT是使S2TT在各种语言中更容易获得的有希望的一步。

Title: Multi-Domain ABSA Conversation Dataset Generation via LLMs for Real-World Evaluation and Model Comparison

Authors: Tejul Pandit, Meet Raval, Dhvani Upadhyay
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24701
Pdf URL: https://arxiv.org/pdf/2505.24701
Copy Paste: [[2505.24701]] Multi-Domain ABSA Conversation Dataset Generation via LLMs for Real-World Evaluation and Model Comparison(https://arxiv.org/abs/2505.24701)
Keywords: language model, gpt, llm
Abstract: Aspect-Based Sentiment Analysis (ABSA) offers granular insights into opinions but often suffers from the scarcity of diverse, labeled datasets that reflect real-world conversational nuances. This paper presents an approach for generating synthetic ABSA data using Large Language Models (LLMs) to address this gap. We detail the generation process aimed at producing data with consistent topic and sentiment distributions across multiple domains using GPT-4o. The quality and utility of the generated data were evaluated by assessing the performance of three state-of-the-art LLMs (Gemini 1.5 Pro, Claude 3.5 Sonnet, and DeepSeek-R1) on topic and sentiment classification tasks. Our results demonstrate the effectiveness of the synthetic data, revealing distinct performance trade-offs among the models: DeepSeekR1 showed higher precision, Gemini 1.5 Pro and Claude 3.5 Sonnet exhibited strong recall, and Gemini 1.5 Pro offered significantly faster inference. We conclude that LLM-based synthetic data generation is a viable and flexible method for creating valuable ABSA resources, facilitating research and model evaluation without reliance on limited or inaccessible real-world labeled data.
摘要：基于方面的情感分析（ABSA）提供了对观点的详细见解，但通常遭受了反映现实世界上细微差别的多样化，标记的数据集的稀缺性。本文提出了一种使用大语言模型（LLMS）生成合成ABSA数据的方法来解决此差距。我们详细介绍了旨在使用GPT-4O在多个域中产生一致主题和情感分布的数据的生成过程。通过评估主题和情感分类任务的三个最先进的LLM（Gemini 1.5 Pro，Claude 3.5十四行诗和DeepSeek-R1）的性能来评估生成数据的质量和实用性。我们的结果证明了合成数据的有效性，揭示了模型之间的独特性能权衡：DeepSeekr1表现出更高的精度，Gemini 1.5 Pro和Claude 3.5 SONNET表现出强烈的回忆，Gemini 1.5 Pro提供了更快的推断。我们得出的结论是，基于LLM的合成数据生成是一种可行而灵活的方法，用于创建有价值的ABSA资源，促进研究和模型评估，而无需依赖有限或无法访问的现实世界标记的数据。

Title: HESEIA: A community-based dataset for evaluating social biases in large language models, co-designed in real school settings in Latin America

Authors: Guido Ivetta (1 and 2), Marcos J. Gomez (1 and 2), Sofía Martinelli (1), Pietro Palombini (1), M. Emilia Echeveste (1 and 2), Nair Carolina Mazzeo (2), Beatriz Busaniche (2), Luciana Benotti (1 and 2) ((1) Universidad Nacional de Córdoba, Argentina, (2) Fundación Vía Libre)
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.24712
Pdf URL: https://arxiv.org/pdf/2505.24712
Copy Paste: [[2505.24712]] HESEIA: A community-based dataset for evaluating social biases in large language models, co-designed in real school settings in Latin America(https://arxiv.org/abs/2505.24712)
Keywords: language model, llm
Abstract: Most resources for evaluating social biases in Large Language Models are developed without co-design from the communities affected by these biases, and rarely involve participatory approaches. We introduce HESEIA, a dataset of 46,499 sentences created in a professional development course. The course involved 370 high-school teachers and 5,370 students from 189 Latin-American schools. Unlike existing benchmarks, HESEIA captures intersectional biases across multiple demographic axes and school subjects. It reflects local contexts through the lived experience and pedagogical expertise of educators. Teachers used minimal pairs to create sentences that express stereotypes relevant to their school subjects and communities. We show the dataset diversity in term of demographic axes represented and also in terms of the knowledge areas included. We demonstrate that the dataset contains more stereotypes unrecognized by current LLMs than previous datasets. HESEIA is available to support bias assessments grounded in educational communities.
摘要：大多数用于评估大语言模型中社会偏见的资源都是开发出受这些偏见影响的社区共同设计的，并且很少涉及参与式方法。我们介绍了Heseia，这是一个在专业发展课程中创建的46,499个句子的数据集。该课程涉及来自189名拉丁美洲学校的370名高中老师和5,370名学生。与现有的基准不同，Heseia捕获了多个人口轴和学校学科的交叉偏见。它通过教育工作者的生活经验和教学专业知识来反映当地环境。教师使用最小的对来创建表达与学校学科和社区相关的刻板印象的句子。我们在人口统计学轴上以及所包含的知识领域表示的数据集多样性。我们证明，与以前的数据集相比，该数据集包含当前LLM所无法识别的刻板印象。 HESEIA可以支持以教育社区为基础的偏见评估。

Title: FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation

Authors: Junyu Luo, Zhizhuo Kou, Liming Yang, Xiao Luo, Jinsheng Huang, Zhiping Xiao, Jingshu Peng, Chengzhong Liu, Jiaming Ji, Xuanzhe Liu, Sirui Han, Ming Zhang, Yike Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24714
Pdf URL: https://arxiv.org/pdf/2505.24714
Copy Paste: [[2505.24714]] FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation(https://arxiv.org/abs/2505.24714)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at this https URL and this https URL.
摘要：近年来，多模式大语模型（MLLM）经历了快速发展。但是，在金融领域，显然缺乏有效和专业的多模式评估数据集。为了促进金融领域的MLLM的发展，我们介绍了Finmme，其中包括18个金融领域和6种资产类别的11,000多个高质量的金融研究样本，其中包括10种主要图表类型和21个子类型。我们通过20个注释者和精心设计的验证机制确保数据质量。此外，我们开发了FinScore，这是一种评估系统，其中包含幻觉惩罚和多维能力评估，以提供无偏见的评估。广泛的实验结果表明，即使是诸如GPT-4O之类的最新模型也表现出Finmme的性能不令人满意，强调了其挑战性的性质。基准测试表现出很高的鲁棒性，并且在不同提示下的预测变化保持在1％以下，与现有数据集相比表明可靠性优异。我们的数据集和评估协议可在此HTTPS URL和此HTTPS URL上获得。

Title: Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

Authors: Shelly Bensal, Umar Jamil, Christopher Bryant, Melisa Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, Waseem AlShikh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24726
Pdf URL: https://arxiv.org/pdf/2505.24726
Copy Paste: [[2505.24726]] Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning(https://arxiv.org/abs/2505.24726)
Keywords: language model, llm
Abstract: We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model's ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation writing and 18.1% improvement at function calling. Notably, smaller fine-tuned models (1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.
摘要：我们探讨了一种通过自我反思和强化学习来改善大语言模型的性能的方法。通过激励模型在错误地回答时产生更好的自我反射，我们证明了模型解决复杂的能力，即使生成合成数据是不可行的，并且只有二进制反馈，也可以增强可验证的任务。我们的框架分为两个阶段：首先，在未完成一项任务后，该模型会产生自我反射的评论，以分析其先前的尝试；其次，该模型在上下文中以自我反射为任务进行了另一次尝试。如果随后的尝试成功，则会奖励在自我反思阶段产生的令牌。我们的实验结果表明，各种模型体系结构的大量性能提高，在数学方程式写作中提高了34.7％，功能呼叫时提高了18.1％。值得注意的是，较小的微型模型（15亿至70亿个参数）在同一家族中的表现要大10倍。因此，我们的新颖范式是通往更有用的语言模型的令人兴奋的途径，可以在有限的外部反馈方面自我爆发。

Title: Circuit Stability Characterizes Language Model Generalization

Authors: Alan Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24731
Pdf URL: https://arxiv.org/pdf/2505.24731
Copy Paste: [[2505.24731]] Circuit Stability Characterizes Language Model Generalization(https://arxiv.org/abs/2505.24731)
Keywords: language model
Abstract: Extensively evaluating the capabilities of (large) language models is difficult. Rapid development of state-of-the-art models induce benchmark saturation, while creating more challenging datasets is labor-intensive. Inspired by the recent developments in mechanistic interpretability, we introduce circuit stability as a new way to assess model performance. Circuit stability refers to a model's ability to apply a consistent reasoning process-its circuit-across various inputs. We mathematically formalize circuit stability and circuit equivalence. Then, through three case studies, we empirically show that circuit stability and the lack thereof can characterize and predict different aspects of generalization. Our proposed methods offer a step towards rigorously relating the generality of models to their interpretability.
摘要：很难广泛评估（大）语言模型的功能。最先进的模型的快速开发会引起基准饱和，同时创建更具挑战性的数据集是劳动密集型的。受到机械解释性最新发展的启发，我们引入了电路稳定性，作为评估模型性能的新方法。电路稳定性是指模型应用一致的推理过程的能力，即电路 - 流动各种输入。我们在数学上使电路稳定性和电路等效性形式化。然后，通过三个案例研究，我们从经验上表明，电路稳定性及其缺乏可以表征和预测概括的不同方面。我们提出的方法为将模型的通用性与其可解释性联系起来提供了一步。

Title: LGAR: Zero-Shot LLM-Guided Neural Ranking for Abstract Screening in Systematic Literature Reviews

Authors: Christian Jaumann, Andreas Wiedholz, Annemarie Friedrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24757
Pdf URL: https://arxiv.org/pdf/2505.24757
Copy Paste: [[2505.24757]] LGAR: Zero-Shot LLM-Guided Neural Ranking for Abstract Screening in Systematic Literature Reviews(https://arxiv.org/abs/2505.24757)
Keywords: language model, llm
Abstract: The scientific literature is growing rapidly, making it hard to keep track of the state-of-the-art. Systematic literature reviews (SLRs) aim to identify and evaluate all relevant papers on a topic. After retrieving a set of candidate papers, the abstract screening phase determines initial relevance. To date, abstract screening methods using large language models (LLMs) focus on binary classification settings; existing question answering (QA) based ranking approaches suffer from error propagation. LLMs offer a unique opportunity to evaluate the SLR's inclusion and exclusion criteria, yet, existing benchmarks do not provide them exhaustively. We manually extract these criteria as well as research questions for 57 SLRs, mostly in the medical domain, enabling principled comparisons between approaches. Moreover, we propose LGAR, a zero-shot LLM Guided Abstract Ranker composed of an LLM based graded relevance scorer and a dense re-ranker. Our extensive experiments show that LGAR outperforms existing QA-based methods by 5-10 pp. in mean average precision. Our code and data is publicly available.
摘要：科学文献正在迅速发展，因此很难跟踪最先进的方法。系统文献评论（SLR）旨在识别和评估有关主题的所有相关论文。检索一组候选论文后，抽象筛选阶段决定了初始相关性。迄今为止，使用大语言模型（LLMS）的抽象筛选方法专注于二进制分类设置；现有的问题回答（QA）的排名方法遭受错误传播的影响。 LLM提供了一个独特的机会来评估SLR的纳入和排除标准，但是现有的基准并未详尽地为其提供。我们手动提取这些标准以及针对57个SLR的研究问题，主要是在医疗领域，从而实现了方法之间的原则比较。此外，我们提出了LGAR，这是一个由LLM基于LLM的分级相关性得分手和密集的重新疗程组成的零摄影LLM引导的抽象排名。我们的广泛实验表明，LGA以平均平均精度优于5-10 pp的现有方法。我们的代码和数据公开可用。

Title: From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning

Authors: Haoyu Li, Xuhong Li, Yiming Dong, Kun Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24768
Pdf URL: https://arxiv.org/pdf/2505.24768
Copy Paste: [[2505.24768]] From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning(https://arxiv.org/abs/2505.24768)
Keywords: language model, llm
Abstract: Dataset diversity plays a pivotal role for the successful training of many machine learning models, particularly in the supervised fine-tuning (SFT) stage of large language model (LLM) development. Despite increasing recognition of its importance, systematic analyses of dataset diversity still remain underexplored. To address this gap, this work presents a systematic taxonomy of existing diversity-control strategies, which primarily focus on the instruction component, operating at either macroscopic (entire instruction semantics) or mesoscopic levels (instruction units), and furthermore introduces a novel analysis of microscopic diversity within the response component, specifically analyzing the statistical distribution of tokens in SFT training samples. In the experimental evaluation, we construct fixed-size datasets (e.g., 10,000 samples each) from a corpus of 117,000 open-source SFT samples, incorporating six distinct diversity-control strategies spanning macro-, meso-, and microscopic levels applied to both instructions and responses. We then fine-tune LLMs on these datasets to assess the six diversity-control strategies. Results reveal that while macroscopic and mesoscopic strategies lead to higher performance with increasing diversity, the microscopic strategy in responses exhibits both a stronger correlation between model performance and the degree of diversity and superior performance with maximum diversity across all strategies. These findings offer actionable insights for constructing high-performance SFT datasets.
摘要：数据集多样性在许多机器学习模型的成功培训中起着关键作用，尤其是在大型语言模型（LLM）开发的监督微调（SFT）阶段。尽管对其重要性的认识越来越高，但对数据集多样性的系统分析仍然没有得到充实的态度。为了解决这一差距，这项工作介绍了现有多样性控制策略的系统分类法，该策略主要集中在教学组件上，以宏观（整个教学语义）或介镜级别（教学单位）（教学单位）运行，并在响应组件中介绍了对响应组件的新型分析，并在统计分析中介绍了对响应分配的较新的分析。在实验评估中，我们从117,000个开源SFT样本的语料库中构建了固定尺寸的数据集（例如，每个样本），并结合了跨宏观，中，中和显微镜水平的六种不同的多样性控制策略，这些策略应用于指令和响应。然后，我们在这些数据集上微调LLM，以评估六种多样性控制策略。结果表明，尽管宏观和介观策略会导致更高的性能随着多样性的提高，但响应中的微观策略既表现出模型绩效与多样性程度和优越性能之间的更强相关性，并且在所有策略中最大程度地都具有最大的多样性。这些发现为构建高性能SFT数据集提供了可行的见解。

Title: Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?

Authors: Jiayu Liu, Qing Zong, Weiqi Wang, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24778
Pdf URL: https://arxiv.org/pdf/2505.24778
Copy Paste: [[2505.24778]] Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?(https://arxiv.org/abs/2505.24778)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly used in high-stakes domains, accurately assessing their confidence is crucial. Humans typically express confidence through epistemic markers (e.g., "fairly confident") instead of numerical values. However, it remains unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to the difficulty of quantifying uncertainty associated with various markers. To address this gap, we first define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty. Our code is available at this https URL.
摘要：由于大型语言模型（LLM）越来越多地用于高风险域，因此准确地评估其信心至关重要。人类通常通过认知标记（例如“相当自信”）而不是数字价值来表达信心。但是，由于难以量化与各种标记相关的不确定性，LLM是否始终使用这些标记来反映其内在信心。为了解决这一差距，我们首先将标记置信度定义为模型采用认知标记时观察到的准确性。我们在开源和专有LLMS中评估了在分发和分布式设置中，在多个提问数据集中评估了其稳定性。我们的结果表明，虽然标记在同一分布中很好地概括了，但他们的信心在分发场景中不一致。这些发现引起了人们对置信度估计的认知标志物的可靠性的重大关注，强调了基于标记的置信度和实际模型不确定性之间需要改善对齐的必要性。我们的代码可在此HTTPS URL上找到。

Title: Drop Dropout on Single-Epoch Language Model Pretraining

Authors: Houjun Liu, John Bauer, Christopher D. Manning
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24788
Pdf URL: https://arxiv.org/pdf/2505.24788
Copy Paste: [[2505.24788]] Drop Dropout on Single-Epoch Language Model Pretraining(https://arxiv.org/abs/2505.24788)
Keywords: language model, llm
Abstract: Originally, dropout was seen as a breakthrough regularization technique that reduced overfitting and improved performance in almost all applications of deep learning by reducing overfitting. Yet, single-epoch pretraining tasks common to modern LLMs yield minimal overfitting, leading to dropout not being used for large LLMs. Nevertheless, no thorough empirical investigation has been done on the role of dropout in LM pretraining. Through experiments in single-epoch pretraining of both masked (BERT) and autoregressive (Pythia 160M and 1.4B) LMs with varying levels of dropout, we find that downstream performance in language modeling, morpho-syntax (BLiMP), question answering (SQuAD), and natural-language inference (MNLI) improves when dropout is not applied during pretraining. We additionally find that the recently-introduced "early dropout" also degrades performance over applying no dropout at all. We further investigate the models' editability, and find that models trained without dropout are more successful in gradient-based model editing (MEND) and equivalent in representation-based model editing (ReFT). Therefore, we advocate to drop dropout during single-epoch pretraining.
摘要：最初，辍学被视为一种突破性的正规化技术，通过减少过度拟合，在几乎所有深度学习的应用中都降低了过度拟合并提高了性能。然而，现代LLMS常见的单个期间预处理任务产生的过度拟合度最低，导致大型LLM的辍学。然而，尚未就辍学在LM预训练中的作用进行彻底的实证研究。 Through experiments in single-epoch pretraining of both masked (BERT) and autoregressive (Pythia 160M and 1.4B) LMs with varying levels of dropout, we find that downstream performance in language modeling, morpho-syntax (BLiMP), question answering (SQuAD), and natural-language inference (MNLI) improves when dropout is not applied during pretraining.此外，我们发现最近引入的“早期辍学”也使性能降低了，而没有脱落。我们进一步研究了模型的编辑性，并发现未辍学的模型在基于梯度的模型编辑（MEND）中更成功，并且在基于表示的模型编辑（REFT）中等效。因此，我们主张在单学预处理期间辍学。

Title: Guiding Generative Storytelling with Knowledge Graphs

Authors: Zhijun Pan, Antonios Andronis, Eva Hayek, Oscar AP Wilkinson, Ilya Lasy, Annette Parry, Guy Gadney, Tim J. Smith, Mick Grierson
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2505.24803
Pdf URL: https://arxiv.org/pdf/2505.24803
Copy Paste: [[2505.24803]] Guiding Generative Storytelling with Knowledge Graphs(https://arxiv.org/abs/2505.24803)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have shown great potential in automated story generation, but challenges remain in maintaining long-form coherence and providing users with intuitive and effective control. Retrieval-Augmented Generation (RAG) has proven effective in reducing hallucinations in text generation; however, the use of structured data to support generative storytelling remains underexplored. This paper investigates how knowledge graphs (KGs) can enhance LLM-based storytelling by improving narrative quality and enabling user-driven modifications. We propose a KG-assisted storytelling pipeline and evaluate its effectiveness through a user study with 15 participants. Participants created their own story prompts, generated stories, and edited knowledge graphs to shape their narratives. Through quantitative and qualitative analysis, our findings demonstrate that knowledge graphs significantly enhance story quality in action-oriented and structured narratives within our system settings. Additionally, editing the knowledge graph increases users' sense of control, making storytelling more engaging, interactive, and playful.
摘要：大型语言模型（LLMS）在自动化故事中表现出巨大的潜力，但是在保持长期连贯性并为用户提供直观有效的控制方面仍然存在挑战。被证明有效地减少文本发电的幻觉有效；但是，使用结构化数据来支持生成性讲故事。本文研究了知识图（KGS）如何通过提高叙事质量并实现用户驱动的修改来增强基于LLM的讲故事。我们提出了一条KG协助的讲故事管道，并通过与15名参与者的用户研究来评估其有效性。参与者创建了自己的故事提示，生成的故事和编辑的知识图来塑造他们的叙述。通过定量和定性分析，我们的发现表明，知识图可显着提高系统设置中面向动作和结构化叙事的故事质量。此外，编辑知识图会增加用户的控制感，使讲故事更具吸引力，互动性和嬉戏。

Title: LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text

Authors: Li yunhan, Wu gengshen
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.24826
Pdf URL: https://arxiv.org/pdf/2505.24826
Copy Paste: [[2505.24826]] LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text(https://arxiv.org/abs/2505.24826)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks tend to focus mainly on factual accuracy while largely neglecting important linguistic quality aspects such as clarity, coherence, and terminology. To address this gap, we propose three steps: First, we develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology. Second, we create a specialized set of legal questions. Third, we analyze 49 LLMs using this evaluation framework. Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of $2.7\%$ noted at 72 billion parameters. Second, engineering choices such as quantization and context length have a negligible impact, as indicated by statistical significance thresholds above 0.016. Third, reasoning models consistently outperform base architectures. A significant outcome of our research is the release of a ranking list and Pareto analysis, which highlight the Qwen3 series as the optimal choice for cost-performance tradeoffs. This work not only establishes standardized evaluation protocols for legal LLMs but also uncovers fundamental limitations in current training data refinement approaches. Code and models are available at: this https URL.
摘要：由于大型语言模型（LLM）越来越多地用于法律应用中，因此当前的评估基准倾向于主要关注事实准确性，而在很大程度上忽略了重要的语言质量方面，例如清晰度，连贯性和术语。为了解决这一差距，我们提出了三个步骤：首先，我们开发了一个回归模型，以根据清晰度，连贯性和术语来评估法律文本的质量。其次，我们创建了一套专门的法律问题。第三，我们使用此评估框架分析了49个LLM。我们的分析确定了三个关键发现：首先，型号质量水平在140亿个参数下关闭，只有$ 2.7 \％$的边际提高为720亿美元。其次，诸如量化和上下文长度之类的工程选择具有可忽略的影响，如统计显着性阈值高于0.016所示。第三，推理模型始终超过基本体系结构。我们研究的重要结果是发布排名列表和帕累托分析，该分析强调了QWEN3系列是成本绩效折衷的最佳选择。这项工作不仅为法律LLM建立了标准化的评估协议，而且还发现了当前培训数据改进方法中的基本限制。代码和模型可在以下网址提供：此HTTPS URL。

Title: Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs

Authors: Juraj Vladika, Annika Domres, Mai Nguyen, Rebecca Moser, Jana Nano, Felix Busch, Lisa C. Adams, Keno K. Bressem, Denise Bernhardt, Stephanie E. Combs, Kai J. Borm, Florian Matthes, Jan C. Peeken
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24830
Pdf URL: https://arxiv.org/pdf/2505.24830
Copy Paste: [[2505.24830]] Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs(https://arxiv.org/abs/2505.24830)
Keywords: language model, llm, hallucination, retrieval augmented generation
Abstract: Large language models (LLMs) exhibit extensive medical knowledge but are prone to hallucinations and inaccurate citations, which pose a challenge to their clinical adoption and regulatory compliance. Current methods, such as Retrieval Augmented Generation, partially address these issues by grounding answers in source documents, but hallucinations and low fact-level explainability persist. In this work, we introduce a novel atomic fact-checking framework designed to enhance the reliability and explainability of LLMs used in medical long-form question answering. This method decomposes LLM-generated responses into discrete, verifiable units called atomic facts, each of which is independently verified against an authoritative knowledge base of medical guidelines. This approach enables targeted correction of errors and direct tracing to source literature, thereby improving the factual accuracy and explainability of medical Q&A. Extensive evaluation using multi-reader assessments by medical experts and an automated open Q&A benchmark demonstrated significant improvements in factual accuracy and explainability. Our framework achieved up to a 40% overall answer improvement and a 50% hallucination detection rate. The ability to trace each atomic fact back to the most relevant chunks from the database provides a granular, transparent explanation of the generated responses, addressing a major gap in current medical AI applications. This work represents a crucial step towards more trustworthy and reliable clinical applications of LLMs, addressing key prerequisites for clinical application and fostering greater confidence in AI-assisted healthcare.
摘要：大型语言模型（LLM）表现出广泛的医学知识，但容易出现幻觉和引用不准确，这对他们的临床采用和法规依从性构成了挑战。当前的方法（例如检索增强产生）通过在源文件中扎根答案来部分解决这些问题，但幻觉和事实级别的解释性持续存在。在这项工作中，我们介绍了一个新型的原子事实检查框架，旨在提高医学长形式问答中使用的LLM的可靠性和解释性。该方法将LLM生成的响应分解为离散的，可验证的单位称为原子事实，每个事实都针对医疗准则的权威知识基础进行了独立验证。这种方法可以有针对性的纠正错误并直接追踪到来源文献，从而提高了医学问答的事实准确性和解释性。使用医学专家的多阅读者评估和自动开放式问答基准进行了广泛的评估，其事实准确性和解释性有了显着提高。我们的框架达到了40％的总体答案改进和50％的幻觉检测率。将每个原子事实追溯到数据库中最相关的块的能力提供了对生成的响应的颗粒状，透明的解释，从而解决了当前医疗AI应用程序中的主要差距。这项工作代表了LLMS更加值得信赖和可靠的临床应用的关键一步，探讨了临床应用的关键先决条件并增强对AI辅助医疗保健的信心。

Title: How much do language models memorize?

Authors: John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, Saeed Mahloujifar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24832
Pdf URL: https://arxiv.org/pdf/2505.24832
Copy Paste: [[2505.24832]] How much do language models memorize?(https://arxiv.org/abs/2505.24832)
Keywords: language model, gpt
Abstract: We propose a new method for estimating how much a model ``knows'' about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: \textit{unintended memorization}, the information a model contains about a specific dataset, and \textit{generalization}, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter. We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point ``grokking'' begins, and unintended memorization decreases as models begin to generalize. We train hundreds of transformer language models ranging from $500K$ to $1.5B$ parameters and produce a series of scaling laws relating model capacity and data size to membership inference.
摘要：我们提出了一种新方法，用于估计一个模型'知道'的模型，并使用它来衡量现代语言模型的能力。对语言模型记忆的先前研究一直在努力将记忆与概括解散。我们将记忆正式分为两个组成部分：\ textIt {无意外的记忆}，模型包含有关特定数据集的信息，以及\ textit {pressilization}，模型包含有关真实数据生成过程的信息。当我们完全消除概括时，我们可以计算总记忆，该记忆提供了模型容量的估计：我们的测量值估计GPT风格的模型的容量约为每个参数3.6位。我们在增加规模的数据集上训练语言模型，并观察到模型记忆直到其容量填充为止，此时``Grokking''开始了，并且随着模型开始概括，意想不到的记忆减少了。我们培训数百种变压器语言模型，范围从$ 500K $到$ 1.5B $参数，并产生一系列规模定律，将模型容量和数据大小与会员推理有关。

Title: MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs

Authors: Gabrielle Kaili-May Liu, Gal Yona, Avi Caciularu, Idan Szpektor, Tim G. J. Rudner, Arman Cohan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.24858
Pdf URL: https://arxiv.org/pdf/2505.24858
Copy Paste: [[2505.24858]] MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs(https://arxiv.org/abs/2505.24858)
Keywords: llm, prompt
Abstract: A critical component in the trustworthiness of LLMs is reliable uncertainty communication, yet LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded trust. We present the first systematic study of $\textit{faithful confidence calibration}$ of LLMs, benchmarking models' ability to use linguistic expressions of uncertainty that $\textit{faithfully reflect}$ their intrinsic uncertainty, across a comprehensive array of models, datasets, and prompting strategies. Our results demonstrate that LLMs largely fail at this task, and that existing interventions are insufficient: standard prompt approaches provide only marginal gains, and existing, factuality-based calibration techniques can even harm faithful calibration. To address this critical gap, we introduce MetaFaith, a novel prompt-based calibration approach inspired by human metacognition. We show that MetaFaith robustly improves faithful calibration across diverse models and task domains, enabling up to 61% improvement in faithfulness and achieving an 83% win rate over original generations as judged by humans.
摘要：LLM的可信度的一个关键组成部分是可靠的不确定性交流，但是LLM在传达虚假主张时经常使用自信的语言，从而导致过度依赖和侵蚀信任。我们介绍了LLMS的$ \ textit {忠实信心校准} $的首次系统研究，以基准测试模型使用不确定性的语言表达能力，即$ \ textit {忠实地反映其内在的不确定性，跨越了各种模型，数据集和提示策略。我们的结果表明，LLM在这项任务上很大程度上失败了，并且现有的干预措施不足：标准提示方法仅提供边际收益，现有的基于事实的校准技术甚至会损害忠实的校准。为了解决这个关键的差距，我们介绍了Metafaith，这是一种基于人类元认知启发的新型基于迅速的校准方法。我们表明，Metafith稳健地改善了各种模型和任务领域的忠实校准，从而使忠实的提高高达61％，并且比人类所判断的原始世代达到了83％的胜利率。

Title: ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Authors: Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24864
Pdf URL: https://arxiv.org/pdf/2505.24864
Copy Paste: [[2505.24864]] ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models(https://arxiv.org/abs/2505.24864)
Keywords: language model
Abstract: Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: this https URL
摘要：以推理为中心的语言模型的最新进展突出了增强学习（RL），作为将模型与可验证奖励保持一致的有前途的方法。但是，RL是真正扩展模型的推理功能还是仅仅放大基本模型分布中已经潜在的高回报输出，以及是否不断扩大RL计算会可靠地导致推理性能提高，这仍然是有争议的。在这项工作中，我们通过证明延长的RL（PRORL）培训可以发现基本模型无法访问的新型推理策略，即使在广泛的采样中也无法访问，我们可以挑战普遍的假设。我们介绍了Prorl，这是一种新颖的培训方法，该方法结合了KL差异控制，参考策略重置和各种任务套件。我们的经验分析表明，经过RL训练的模型在广泛的Pass@K评估中始终优于基本模型，包括基本模型完全失败的方案，无论尝试的数量如何。我们进一步表明，推理边界改进与基本模型和训练持续时间的任务能力密切相关，这表明RL可以随着时间的推移探索和填充解决方案空间的新区域。这些发现为RL有意义地扩大语言模型中的推理界限的条件提供了新的见解，并为未来的长距离RL工作建立了基础。我们释放模型权重以支持进一步的研究：此HTTPS URL