2025-05-29

Title: More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Authors: Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, Sheng Liu
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.21523
Pdf URL: https://arxiv.org/pdf/2505.21523
Copy Paste: [[2505.21523]] More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models(https://arxiv.org/abs/2505.21523)
Keywords: language model, hallucination
Abstract: Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.
摘要：测试时间计算已经授权多模式大型语言模型生成扩展的推理链，从而在诸如多模式数学推理等任务上产生了强大的性能。但是，这种提高的推理能力通常会增加幻觉：随着世代的变长，模型往往会偏离图像的内容，并更加依赖语言先验。注意分析表明，较长的推理链导致对视觉输入的关注减少，这导致了幻觉。为了系统地研究这种现象，我们介绍了Rh-AUC，该指标量化了模型的感知精度如何随推理长度而变化，从而使我们能够评估该模型在推理过程中是否保留了视觉接地。我们还发布了RH Bench，这是一种诊断基准，涵盖了各种多模式任务，旨在评估推理能力和幻觉之间的权衡。我们的分析表明，（i）较大的模型通常在推理和感知之间实现更好的平衡，并且（ii）与培训数据的类型和域相比，这种平衡的影响更大。这些发现强调了共同考虑推理质量和感知忠诚度的评估框架的重要性。

Title: Rethinking Data Mixture for Large Language Models: A Comprehensive Survey and New Perspectives

Authors: Yajiao Liu, Congliang Chen, Junchi Yang, Ruoyu Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21598
Pdf URL: https://arxiv.org/pdf/2505.21598
Copy Paste: [[2505.21598]] Rethinking Data Mixture for Large Language Models: A Comprehensive Survey and New Perspectives(https://arxiv.org/abs/2505.21598)
Keywords: language model
Abstract: Training large language models with data collected from various domains can improve their performance on downstream tasks. However, given a fixed training budget, the sampling proportions of these different domains significantly impact the model's performance. How can we determine the domain weights across different data domains to train the best-performing model within constrained computational resources? In this paper, we provide a comprehensive overview of existing data mixture methods. First, we propose a fine-grained categorization of existing methods, extending beyond the previous offline and online classification. Offline methods are further grouped into heuristic-based, algorithm-based, and function fitting-based methods. For online methods, we categorize them into three groups: online min-max optimization, online mixing law, and other approaches by drawing connections with the optimization frameworks underlying offline methods. Second, we summarize the problem formulations, representative algorithms for each subtype of offline and online methods, and clarify the relationships and distinctions among them. Finally, we discuss the advantages and disadvantages of each method and highlight key challenges in the field of data mixture.
摘要：使用从各个领域收集的数据培训大语言模型可以改善其在下游任务上的性能。但是，鉴于固定的培训预算，这些不同领域的采样比例显着影响模型的性能。我们如何确定跨不同数据域中的域权重训练受约束的计算资源中表现最好的模型？在本文中，我们提供了现有数据混合方法的全面概述。首先，我们建议对现有方法进行细粒度的分类，从而超越了以前的离线和在线分类。离线方法将进一步分组为基于启发式的，基于算法和基于函数拟合的方法。对于在线方法，我们将它们分为三组：在线最小值优化，在线混合法和其他方法通过与脱机方法的优化框架进行连接来绘制连接。其次，我们总结了问题的表述，离线和在线方法的每个子类型的代表性算法，并阐明它们之间的关系和区别。最后，我们讨论了每种方法的优势和缺点，并突出了数据混合物领域的关键挑战。

Title: R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

Authors: Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Subjects: cs.CL, cs.AI, cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2505.21600
Pdf URL: https://arxiv.org/pdf/2505.21600
Copy Paste: [[2505.21600]] R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing(https://arxiv.org/abs/2505.21600)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at this https URL.
摘要：大型语言模型（LLMS）以大量推断的开销为代价实现了令人印象深刻的推理能力，从而带来了重大的部署挑战。尽管蒸馏的小语言模型（SLM）显着提高了效率，但由于他们未能遵循LLMS的推理路径而遭受的效率。幸运的是，我们透露，只有一小部分令牌在LLMS和SLMS之间真正散开的推理路径。大多数产生的令牌是相同的或表现出中性差异的，例如缩写或表达式的较小变化。利用这种见解，我们向罗马（R2R）**介绍了**，这是一种神经令牌路由方法，仅适用于这些关键的，路径差的令牌，而将大多数令牌产生给SLM。我们还开发了一个自动数据生成管道，该管道可以识别发散令牌并生成令牌级的路由标签以训练轻量级路由器。我们使用R2R来组合DeepSeek家族的R1-1.5B和R1-32B模型，并评估具有挑战性的数学，编码和QA基准。 R2R的平均激活参数大小为5.6B，超过R1-7B的平均准确性1.6倍，甚至超过R1-14B模型。与R1-32B相比，它提供了2.8倍的壁挂式加速度，具有可比的性能，从而提高了测试时间缩放效率的帕累托前沿。我们的代码可在此HTTPS URL上找到。

Title: How does Misinformation Affect Large Language Model Behaviors and Preferences?

Authors: Miao Peng, Nuo Chen, Jianheng Tang, Jia Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21608
Pdf URL: https://arxiv.org/pdf/2505.21608
Copy Paste: [[2505.21608]] How does Misinformation Affect Large Language Model Behaviors and Preferences?(https://arxiv.org/abs/2505.21608)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown remarkable capabilities in knowledge-intensive tasks, while they remain vulnerable when encountering misinformation. Existing studies have explored the role of LLMs in combating misinformation, but there is still a lack of fine-grained analysis on the specific aspects and extent to which LLMs are influenced by misinformation. To bridge this gap, we present MisBench, the current largest and most comprehensive benchmark for evaluating LLMs' behavior and knowledge preference toward misinformation. MisBench consists of 10,346,712 pieces of misinformation, which uniquely considers both knowledge-based conflicts and stylistic variations in misinformation. Empirical results reveal that while LLMs demonstrate comparable abilities in discerning misinformation, they still remain susceptible to knowledge conflicts and stylistic variations. Based on these findings, we further propose a novel approach called Reconstruct to Discriminate (RtD) to strengthen LLMs' ability to detect misinformation. Our study provides valuable insights into LLMs' interactions with misinformation, and we believe MisBench can serve as an effective benchmark for evaluating LLM-based detectors and enhancing their reliability in real-world applications. Codes and data are available at this https URL.
摘要：大型语言模型（LLMS）在知识密集型任务中表现出了显着的功能，而它们在遇到错误信息时仍然脆弱。现有的研究探讨了LLM在打击错误信息中的作用，但仍缺乏对LLM受误导影响的特定方面和程度的细粒度分析。为了弥合这一差距，我们提出了Missbench，这是当前最大，最全面的基准，用于评估LLMS的行为和知识偏爱错误信息。 MISBENCH由10,346,712件错误信息组成，它们独特地考虑了基于知识的冲突和错误信息的风格变化。经验结果表明，尽管LLMS在辨别错误信息方面具有可比的能力，但它们仍然容易受到知识冲突和风格差异的影响。基于这些发现，我们进一步提出了一种称为重建的新方法来区分（RTD），以增强LLMS检测错误信息的能力。我们的研究为LLMS与错误信息的互动提供了宝贵的见解，我们认为Misbench可以作为评估基于LLM的检测器并提高其在现实世界应用中的可靠性的有效基准。代码和数据可在此HTTPS URL上找到。

Title: Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations

Authors: Zeinab Dehghani, Koorosh Aslansefat, Adil Khan, Mohammed Naveed Akram
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21657
Pdf URL: https://arxiv.org/pdf/2505.21657
Copy Paste: [[2505.21657]] Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations(https://arxiv.org/abs/2505.21657)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models like GPT, LLAMA, and Claude have become incredibly powerful at generating text, but they are still black boxes, so it is hard to understand how they decide what to say. That lack of transparency can be problematic, especially in fields where trust and accountability matter. To help with this, we introduce SMILE, a new method that explains how these models respond to different parts of a prompt. SMILE is model-agnostic and works by slightly changing the input, measuring how the output changes, and then highlighting which words had the most impact. Create simple visual heat maps showing which parts of a prompt matter the most. We tested SMILE on several leading LLMs and used metrics such as accuracy, consistency, stability, and fidelity to show that it gives clear and reliable explanations. By making these models easier to understand, SMILE brings us one step closer to making AI more transparent and trustworthy.
摘要：GPT，Llama和Claude等大型语言模型在生成文本方面变得非常强大，但它们仍然是黑匣子，因此很难理解他们如何决定说些什么。缺乏透明度可能会出现问题，尤其是在信任和问责制至关重要的领域。为了帮助这一点，我们介绍了Smile，这是一种新方法，解释了这些模型如何响应提示的不同部分。 Smile是模型不平衡的，通过稍微更改输入，测量输出的变化，然后突出显示哪些单词最大的影响来起作用。创建简单的视觉热图，显示及时物质最多的部分。我们在几个领先的LLM上测试了微笑，并使用了诸如准确性，一致性，稳定性和保真度等指标，以表明它提供了清晰可靠的解释。通过使这些模型更容易理解，Smile使我们更加接近使AI更透明和值得信赖。

Title: Rethinking the Outlier Distribution in Large Language Models: An In-depth Study

Authors: Rahul Raman, Khushi Sharma, Sai Qian Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21670
Pdf URL: https://arxiv.org/pdf/2505.21670
Copy Paste: [[2505.21670]] Rethinking the Outlier Distribution in Large Language Models: An In-depth Study(https://arxiv.org/abs/2505.21670)
Keywords: language model, llm
Abstract: Investigating outliers in large language models (LLMs) is crucial due to their significant impact on various aspects of LLM performance, including quantization and compression. Outliers often cause considerable quantization errors, leading to degraded model performance. Identifying and addressing these outliers can enhance the accuracy and efficiency of the quantization process, enabling smoother deployment on edge devices or specialized hardware. Recent studies have identified two common types of outliers in LLMs: massive activations and channel-wise outliers. While numerous quantization algorithms have been proposed to mitigate their effects and maintain satisfactory accuracy, few have thoroughly explored the root causes of these outliers in depth. In this paper, we conduct a comprehensive investigation into the formation mechanisms of these outliers and propose potential strategies to mitigate their occurrence. Ultimately, we introduce some efficient approaches to eliminate most massive activations and channel-wise outliers with minimal impact on accuracy.
摘要：大型语言模型（LLMS）中的异常值对LLM性能的各个方面（包括量化和压缩）的重大影响，至关重要。离群值通常会导致大量量化错误，从而导致模型性能降低。识别和解决这些异常值可以提高量化过程的准确性和效率，从而使边缘设备或专用硬件更平滑。最近的研究已经确定了LLM中的两种常见类型的异常值类型：大量激活和渠道异常值。尽管已经提出了许多量化算法来减轻其效果并保持令人满意的准确性，但很少有人能深入探索这些异常值的根本原因。在本文中，我们对这些异常值的形成机制进行了全面调查，并提出了缓解其发生的潜在策略。最终，我们引入了一些有效的方法，以消除最大的激活和渠道异常值，对准确性影响最小。

Title: LLMPR: A Novel LLM-Driven Transfer Learning based Petition Ranking Model

Authors: Avijit Gayen, Somyajit Chakraborty, Mainak Sen, Soham Paul, Angshuman Jana
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21689
Pdf URL: https://arxiv.org/pdf/2505.21689
Copy Paste: [[2505.21689]] LLMPR: A Novel LLM-Driven Transfer Learning based Petition Ranking Model(https://arxiv.org/abs/2505.21689)
Keywords: language model, llm
Abstract: The persistent accumulation of unresolved legal cases, especially within the Indian judiciary, significantly hampers the timely delivery of justice. Manual methods of prioritizing petitions are often prone to inefficiencies and subjective biases further exacerbating delays. To address this issue, we propose LLMPR (Large Language Model-based Petition Ranking), an automated framework that utilizes transfer learning and machine learning to assign priority rankings to legal petitions based on their contextual urgency. Leveraging the ILDC dataset comprising 7,593 annotated petitions, we process unstructured legal text and extract features through various embedding techniques, including DistilBERT, LegalBERT, and MiniLM. These textual embeddings are combined with quantitative indicators such as gap days, rank scores, and word counts to train multiple machine learning models, including Random Forest, Decision Tree, XGBoost, LightGBM, and CatBoost. Our experiments demonstrate that Random Forest and Decision Tree models yield superior performance, with accuracy exceeding 99% and a Spearman rank correlation of 0.99. Notably, models using only numerical features achieve nearly optimal ranking results (R2 = 0.988, \r{ho} = 0.998), while LLM-based embeddings offer only marginal gains. These findings suggest that automated petition ranking can effectively streamline judicial workflows, reduce case backlog, and improve fairness in legal prioritization.
摘要：未解决的法律案件的持续积累，尤其是在印度司法机构中，显着妨碍了及时的正义。优先考虑请愿的手动方法通常容易出现效率低下和主观偏见，进一步加剧了延迟。为了解决这个问题，我们提出了LLMPR（基于大型语言模型的大型请愿排名），这是一个自动化的框架，利用转移学习和机器学习将优先排名分配给法律请愿，以其上下文的紧迫性。利用包括7,593个注释请愿的ILDC数据集，我们通过包括Distilbert，Legalbert和Minilm在内的各种嵌入技术来处理非结构化的法律文本并提取功能。这些文本嵌入与定量指标相结合，例如差距天数，等级分数和单词计数，以训练多个机器学习模型，包括随机森林，决策树，XGBoost，LightGBM和Catboost。我们的实验表明，随机森林和决策树模型产生了卓越的性能，精度超过99％，长矛人等级的相关性为0.99。值得注意的是，仅使用数值功能的模型获得了几乎最佳的排名结果（R2 = 0.988，\ r {HO} = 0.998），而基于LLM的嵌入仅提供边缘增长。这些发现表明，自动请愿排名可以有效地简化司法工作流程，减少案件积压并提高法律优先级的公平性。

Title: MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs

Authors: Raoyuan Zhao, Beiduo Chen, Barbara Plank, Michael A. Hedderich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21693
Pdf URL: https://arxiv.org/pdf/2505.21693
Copy Paste: [[2505.21693]] MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs(https://arxiv.org/abs/2505.21693)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive multilingual evaluation remains challenging due to limited benchmarks and questionable translation quality. To better assess these disparities, we introduce MAKIEval, an automatic multilingual framework for evaluating cultural awareness in LLMs across languages, regions, and topics. MAKIEval evaluates open-ended text generation, capturing how models express culturally grounded knowledge in natural language. Leveraging Wikidata's multilingual structure as a cross-lingual anchor, it automatically identifies cultural entities in model outputs and links them to structured knowledge, enabling scalable, language-agnostic evaluation without manual annotation or translation. We then introduce four metrics that capture complementary dimensions of cultural awareness: granularity, diversity, cultural specificity, and consensus across languages. We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems, across 13 languages, 19 countries and regions, and 6 culturally salient topics (e.g., food, clothing). Notably, we find that models tend to exhibit stronger cultural awareness in English, suggesting that English prompts more effectively activate culturally grounded knowledge. We publicly release our code and data.
摘要：大型语言模型（LLM）在多种语言上全球用于全球，但它们以英语为中心的预读引起了人们对文化意识的跨语性差异的担忧，通常会导致产量有偏见。但是，由于基准有限和可疑翻译质量，全面的多语言评估仍然具有挑战性。为了更好地评估这些差异，我们介绍了Makeeval，这是一个自动多语言框架，用于评估跨语言，地区和主题的LLM的文化意识。 Makeval评估了开放式文本生成，捕获模型如何用自然语言表达文化扎根的知识。它利用Wikidata的多语言结构作为跨语性锚点，自动确定模型输出中的文化实体，并将其与结构化知识联系起来，从而无需手动注释或翻译即可进行可扩展的语言敏捷评估。然后，我们介绍了四个指标，以捕获文化意识的互补维度：跨语言的粒度，多样性，文化特异性和共识。我们评估了从世界各地开发的7个LLM，涵盖了13种语言，19个国家和地区的开源和专有系统，以及6种文化上的主要主题（例如，食品，服装）。值得注意的是，我们发现模型倾向于在英语中表现出更强的文化意识，这表明英语会更有效地激活文化上的知识。我们公开发布代码和数据。

Title: Do We Know What LLMs Don't Know? A Study of Consistency in Knowledge Probing

Authors: Raoyuan Zhao, Abdullatif Köksal, Ali Modarressi, Michael A. Hedderich, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21701
Pdf URL: https://arxiv.org/pdf/2505.21701
Copy Paste: [[2505.21701]] Do We Know What LLMs Don't Know? A Study of Consistency in Knowledge Probing(https://arxiv.org/abs/2505.21701)
Keywords: language model, llm, prompt
Abstract: The reliability of large language models (LLMs) is greatly compromised by their tendency to hallucinate, underscoring the need for precise identification of knowledge gaps within LLMs. Various methods for probing such gaps exist, ranging from calibration-based to prompting-based methods. To evaluate these probing methods, in this paper, we propose a new process based on using input variations and quantitative metrics. Through this, we expose two dimensions of inconsistency in knowledge gap probing. (1) Intra-method inconsistency: Minimal non-semantic perturbations in prompts lead to considerable variance in detected knowledge gaps within the same probing method; e.g., the simple variation of shuffling answer options can decrease agreement to around 40%. (2) Cross-method inconsistency: Probing methods contradict each other on whether a model knows the answer. Methods are highly inconsistent -- with decision consistency across methods being as low as 7% -- even though the model, dataset, and prompt are all the same. These findings challenge existing probing methods and highlight the urgent need for perturbation-robust probing frameworks.
摘要：大型语言模型（LLM）的可靠性因其幻觉的趋势而极大地损害了，强调了在LLM中精确识别知识差距的必要性。存在各种探索此类差距的方法，从基于校准到基于促进的方法。为了评估这些探测方法，我们在本文中提出了一个基于输入变化和定量指标的新过程。通过此，我们揭示了知识差距探测中不一致的两个维度。（1）实际情况内矛盾：提示中的最小非语义扰动导致相同探测方法中检测到的知识差距的差异很大；例如，改组选项的简单变化可以将一致性降低到40％左右。（2）跨方法不一致：探测方法相互矛盾，与模型是否知道答案。方法是高度不一致的 - 尽管模型，数据集和提示都是相同的，但方法的决策一致性高达7％。这些发现挑战了现有的探测方法，并强调了迫切需要扰动探测框架。

Title: Assessing and Refining ChatGPT's Performance in Identifying Targeting and Inappropriate Language: A Comparative Study

Authors: Barbarestani Baran, Maks Isa, Vossen Piek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21710
Pdf URL: https://arxiv.org/pdf/2505.21710
Copy Paste: [[2505.21710]] Assessing and Refining ChatGPT's Performance in Identifying Targeting and Inappropriate Language: A Comparative Study(https://arxiv.org/abs/2505.21710)
Keywords: gpt, chat
Abstract: This study evaluates the effectiveness of ChatGPT, an advanced AI model for natural language processing, in identifying targeting and inappropriate language in online comments. With the increasing challenge of moderating vast volumes of user-generated content on social network sites, the role of AI in content moderation has gained prominence. We compared ChatGPT's performance against crowd-sourced annotations and expert evaluations to assess its accuracy, scope of detection, and consistency. Our findings highlight that ChatGPT performs well in detecting inappropriate content, showing notable improvements in accuracy through iterative refinements, particularly in Version 6. However, its performance in targeting language detection showed variability, with higher false positive rates compared to expert judgments. This study contributes to the field by demonstrating the potential of AI models like ChatGPT to enhance automated content moderation systems while also identifying areas for further improvement. The results underscore the importance of continuous model refinement and contextual understanding to better support automated moderation and mitigate harmful online behavior.
摘要：这项研究评估了一种用于自然语言处理的高级AI模型Chatgpt的有效性，在在线评论中识别针对性和不适当的语言。随着在社交网站上调节大量用户生成的内容的越来越大的挑战，AI在内容节制中的作用已获得突出。我们将Chatgpt的表现与众包注释和专家评估进行了比较，以评估其准确性，检测范围和一致性。我们的发现强调了ChatGPT在检测不适当的内容方面表现良好，通过迭代改进，尤其是在第6版中的精度表现出显着提高的准确性。但是，与专家判断相比，其靶向语言检测的性能显示出可变性，误报率更高。这项研究通过证明诸如Chatgpt之类的AI模型来增强自动化内容审核系统的潜力，同时还可以识别领域以进一步改进，从而为该领域做出了贡献。结果强调了连续模型完善和上下文理解的重要性，以更好地支持自动化节制并减轻有害的在线行为。

Title: Counterfactual Simulatability of LLM Explanations for Generation Tasks

Authors: Marvin Limpijankit, Yanda Chen, Melanie Subbiah, Nicholas Deas, Kathleen McKeown
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21740
Pdf URL: https://arxiv.org/pdf/2505.21740
Copy Paste: [[2505.21740]] Counterfactual Simulatability of LLM Explanations for Generation Tasks(https://arxiv.org/abs/2505.21740)
Keywords: llm, prompt
Abstract: LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model's output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.
摘要：LLM可能是不可预测的，因为即使对提示的轻微更改也会导致输出以意外的方式变化。因此，模型准确解释其行为的能力至关重要，尤其是在高风险环境中。评估解释的一种方法是反事实模拟性，解释允许用户在相关反事实上推断模型的输出的能力。以前已经研究了反事实的可模拟性，是/否问问任务。我们提供了一个通用框架，将这种方法扩展到生成任务，以新闻摘要和医疗建议为例。我们发现，尽管LLM的解释确实使用户能够更好地预测摘要设置中反事实的LLM输出，但医疗建议还有很大的改进空间。此外，我们的结果表明，反事实可模拟性的评估可能更适合基于技能的任务而不是基于知识的任务。

Title: BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

Authors: Yubin Kim, Zhiyuan Hu, Hyewon Jeong, Eugene Park, Shuyue Stella Li, Chanwoo Park, Shiyun Xiong, MingYu Lu, Hyeonhoon Lee, Xin Liu, Daniel McDuff, Cynthia Breazeal, Samir Tulebaev, Hae Won Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21757
Pdf URL: https://arxiv.org/pdf/2505.21757
Copy Paste: [[2505.21757]] BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum(https://arxiv.org/abs/2505.21757)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs' inconsistent proactivity. To address this, we propose BehaviorSFT, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from 95.0% to 96.5% for Qwen2.5-7B-Ins). Crucially, blind clinician evaluations confirmed BehaviorSFT-trained agents exhibit more realistic clinical behavior, striking a superior balance between helpful proactivity (e.g., timely, relevant suggestions) and necessary restraint (e.g., avoiding over-intervention) versus standard fine-tuning or explicit instructed agents.
摘要：大型语言模型（LLMS）作为临床药物需要仔细的行为适应。尽管擅长反应性任务（例如，诊断推理），但LLM经常在积极参与方面挣扎，例如对关键缺失信息或风险的未提及的识别。我们介绍了AcrupyBench，这是一个全面的数据集，可评估临床辅助范围内的代理行为，从反应性查询响应到主动的干预措施（例如，澄清歧义，标记忽略的关键数据）。我们的行为基础实验表明LLMS的积极性不一致。为了解决这个问题，我们提出了行为FTF，这是一种使用行为令牌的新型训练策略，以明确条件沿该频谱进行动态行为选择。行为促进了性能，在行为台上达到总体宏F1并提高了主动任务分数（例如，QWEN2.5-7B-INS的总体宏F1最高为96.5％）。至关重要的是，盲目的临床医生评估证实了经过行为训练的剂表现出更现实的临床行为，在有益的主动性（例如，及时，相关建议）和必要的约束（例如，避免过度交流）与标准微调或明确的指导者之间取得了卓越的平衡。

Title: Calibrating LLM Confidence by Probing Perturbed Representation Stability

Authors: Reza Khanmohammadi, Erfan Miahi, Mehrsa Mardikoraem, Simerjot Kaur, Ivan Brugere, Charese H. Smiley, Kundan Thind, Mohammad M. Ghassemi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21772
Pdf URL: https://arxiv.org/pdf/2505.21772
Copy Paste: [[2505.21772]] Calibrating LLM Confidence by Probing Perturbed Representation Stability(https://arxiv.org/abs/2505.21772)
Keywords: language model, llm
Abstract: Miscalibration in Large Language Models (LLMs) undermines their reliability, highlighting the need for accurate confidence estimation. We introduce CCPS (Calibrating LLM Confidence by Probing Perturbed Representation Stability), a novel method analyzing internal representational stability in LLMs. CCPS applies targeted adversarial perturbations to final hidden states, extracts features reflecting the model's response to these perturbations, and uses a lightweight classifier to predict answer correctness. CCPS was evaluated on LLMs from 8B to 32B parameters (covering Llama, Qwen, and Mistral architectures) using MMLU and MMLU-Pro benchmarks in both multiple-choice and open-ended formats. Our results show that CCPS significantly outperforms current approaches. Across four LLMs and three MMLU variants, CCPS reduces Expected Calibration Error by approximately 55% and Brier score by 21%, while increasing accuracy by 5 percentage points, Area Under the Precision-Recall Curve by 4 percentage points, and Area Under the Receiver Operating Characteristic Curve by 6 percentage points, all relative to the strongest prior method. CCPS delivers an efficient, broadly applicable, and more accurate solution for estimating LLM confidence, thereby improving their trustworthiness.
摘要：大语言模型（LLMS）中的误解会破坏其可靠性，强调需要进行准确的置信度估计。我们引入了CCP（通过探测扰动表示稳定性校准LLM置信度），这是一种新的方法，分析了LLMS中的内部代表性稳定性。 CCP将有针对性的对抗扰动应用于最终的隐藏状态，提取功能，反映了模型对这些扰动的响应，并使用轻量级分类器来预测答案的正确性。使用MMLU和MMLU-PRO基准测试在8B到32B参数（覆盖Llama，Qwen和Mistral体系结构）上在LLMS上评估CCP。我们的结果表明，CCP的表现明显优于当前方法。在四个LLM和三个MMLU变体中，CCPS将预期的校准误差降低了约55％，Brier得分降低了21％，而准确性曲线下的精度则提高了5个百分点，相对于先前的所有方法，Precision-Recal-Reclal曲线下的面积增加了4个百分点。 CCP提供了一种有效的，广泛的，更准确的解决方案来估算LLM信心，从而提高了他们的可信度。

Title: VeriTrail: Closed-Domain Hallucination Detection with Traceability

Authors: Dasha Metropolitansky, Jonathan Larson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21786
Pdf URL: https://arxiv.org/pdf/2505.21786
Copy Paste: [[2505.21786]] VeriTrail: Closed-Domain Hallucination Detection with Traceability(https://arxiv.org/abs/2505.21786)
Keywords: language model, hallucination
Abstract: Even when instructed to adhere to source material, Language Models often generate unsubstantiated content - a phenomenon known as "closed-domain hallucination." This risk is amplified in processes with multiple generative steps (MGS), compared to processes with a single generative step (SGS). However, due to the greater complexity of MGS processes, we argue that detecting hallucinations in their final outputs is necessary but not sufficient: it is equally important to trace where hallucinated content was likely introduced and how faithful content may have been derived from the source through intermediate outputs. To address this need, we present VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for both MGS and SGS processes. We also introduce the first datasets to include all intermediate outputs as well as human annotations of final outputs' faithfulness for their respective MGS processes. We demonstrate that VeriTrail outperforms baseline methods on both datasets.
摘要：即使被指示遵守来源材料，语言模型也经常产生未经证实的内容 - 一种称为“封闭域幻觉”的现象。与具有单个生成步骤（SGS）的过程相比，具有多个生成步骤（MGS）的过程中，这种风险会放大。但是，由于MGS过程的复杂性更大，我们认为最终输出中检测幻觉是必要的，但不够的：追踪可能引入幻觉内容以及如何通过中间输出从源中得出幻觉内容的位置同样重要。为了满足这一需求，我们提出了Veritrail，这是第一个封闭域幻觉检测方法，旨在为MGS和SGS过程提供可追溯性。我们还介绍了第一个数据集，以包括所有中间输出以及对最终输出对其各自MGS流程的忠诚的注释。我们证明Veritrail在两个数据集上的基线方法都优于基线方法。

Title: Principled Content Selection to Generate Diverse and Personalized Multi-Document Summaries

Authors: Vishakh Padmakumar, Zichao Wang, David Arbour, Jennifer Healey
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21859
Pdf URL: https://arxiv.org/pdf/2505.21859
Copy Paste: [[2505.21859]] Principled Content Selection to Generate Diverse and Personalized Multi-Document Summaries(https://arxiv.org/abs/2505.21859)
Keywords: language model, llm, prompt
Abstract: While large language models (LLMs) are increasingly capable of handling longer contexts, recent work has demonstrated that they exhibit the "lost in the middle" phenomenon (Liu et al., 2024) of unevenly attending to different parts of the provided context. This hinders their ability to cover diverse source material in multi-document summarization, as noted in the DiverseSumm benchmark (Huang et al., 2024). In this work, we contend that principled content selection is a simple way to increase source coverage on this task. As opposed to prompting an LLM to perform the summarization in a single step, we explicitly divide the task into three steps -- (1) reducing document collections to atomic key points, (2) using determinantal point processes (DPP) to perform select key points that prioritize diverse content, and (3) rewriting to the final summary. By combining prompting steps, for extraction and rewriting, with principled techniques, for content selection, we consistently improve source coverage on the DiverseSumm benchmark across various LLMs. Finally, we also show that by incorporating relevance to a provided user intent into the DPP kernel, we can generate personalized summaries that cover relevant source information while retaining coverage.
摘要：尽管大型语言模型（LLM）越来越能够处理更长的上下文，但最近的工作表明，它们表现出不均匀地参与所提供环境的不同部分的“中间”现象（Liu等，2024）。如不同的基准中所述（Huang等，2024），这阻碍了它们在多文章摘要中涵盖多种源材料的能力。在这项工作中，我们认为有原则的内容选择是增加此任务源覆盖的简单方法。与提示LLM在单个步骤中执行摘要相比，我们将任务明确分为三个步骤 - （1）将文档集合减少为原子关键点，（2）使用确定点过程（DPP）执行选择的关键点，以优先考虑多样的内容，以及（3）重写最终摘要。通过结合提示步骤，提取和重写，以及原则上的技术，用于内容选择，我们一致地改善了各种LLM的不同基准的源覆盖范围。最后，我们还表明，通过将与所提供的用户意图的相关性纳入DPP内核，我们可以生成个性化的摘要，以涵盖相关的源信息，同时保留覆盖范围。

Title: Evaluating the Retrieval Robustness of Large Language Models

Authors: Shuyang Cao, Karthik Radhakrishnan, David Rosenberg, Steven Lu, Pengxiang Cheng, Lu Wang, Shiyue Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21870
Pdf URL: https://arxiv.org/pdf/2505.21870
Copy Paste: [[2505.21870]] Evaluating the Retrieval Robustness of Large Language Models(https://arxiv.org/abs/2505.21870)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) generally enhances large language models' (LLMs) ability to solve knowledge-intensive tasks. But RAG may also lead to performance degradation due to imperfect retrieval and the model's limited ability to leverage retrieved content. In this work, we evaluate the robustness of LLMs in practical RAG setups (henceforth retrieval robustness). We focus on three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) and whether document orders impact results. To facilitate this study, we establish a benchmark of 1500 open-domain questions, each with retrieved documents from Wikipedia. We introduce three robustness metrics, each corresponds to one research question. Our comprehensive experiments, involving 11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit surprisingly high retrieval robustness; nonetheless, different degrees of imperfect robustness hinders them from fully utilizing the benefits of RAG.
摘要：检索增强的生成（RAG）通常增强了解决知识密集型任务的大型语言模型（LLMS）的能力。但是，由于检索不完善，抹布也可能导致性能下降，并且该模型利用检索到的内容的能力有限。在这项工作中，我们评估了LLM在实用的RAG设置中的鲁棒性（此后检索鲁棒性）。我们专注于三个研究问题：（1）抹布是否总是比非抹布更好；（2）是否有更多检索的文档始终导致更好的性能；（3）以及文档订单是否影响结果。为了促进这项研究，我们建立了1500个开放域问题的基准，每个问题都从Wikipedia检索了文件。我们介绍了三个鲁棒性指标，每个指标都对应于一个研究问题。我们的全面实验涉及11个LLM和3种提示策略，表明所有这些LLM都表现出令人惊讶的高检索鲁棒性。尽管如此，不同程度的不完美鲁棒性阻碍了他们充分利用抹布的好处。

Title: EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse

Authors: Tianyu Guo, Hande Dong, Yichong Leng, Feng Liu, Cheater Lin, Nong Xiao, Xianwei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21889
Pdf URL: https://arxiv.org/pdf/2505.21889
Copy Paste: [[2505.21889]] EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse(https://arxiv.org/abs/2505.21889)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are often used for infilling tasks, which involve predicting or generating missing information in a given text. These tasks typically require multiple interactions with similar context. To reduce the computation of repeated historical tokens, cross-request key-value (KV) cache reuse, a technique that stores and reuses intermediate computations, has become a crucial method in multi-round interactive services. However, in infilling tasks, the KV cache reuse is often hindered by the structure of the prompt format, which typically consists of a prefix and suffix relative to the insertion point. Specifically, the KV cache of the prefix or suffix part is frequently invalidated as the other part (suffix or prefix) is incrementally generated. To address the issue, we propose EFIM, a transformed prompt format of FIM to unleash the performance potential of KV cache reuse. Although the transformed prompt can solve the inefficiency, it exposes subtoken generation problems in current LLMs, where they have difficulty generating partial words accurately. Therefore, we introduce a fragment tokenization training method which splits text into multiple fragments before tokenization during data processing. Experiments on two representative LLMs show that LLM serving with EFIM can lower the latency by 52% and improve the throughput by 98% while maintaining the original infilling this http URL's source code is publicly available at this https URL.
摘要：大型语言模型（LLMS）通常用于填充任务，其中涉及预测或生成给定文本中缺失的信息。这些任务通常需要具有相似上下文的多个交互。为了减少重复的历史令牌的计算，跨重点键值（KV）缓存重用（一种存储和重用中间计算的技术）已成为多轮交互式服务中的一种至关重要的方法。但是，在填充任务中，迅速格式的结构通常会阻碍KV缓存重用，该结构通常由相对于插入点的前缀和后缀组成。具体而言，由于另一部分（后缀或前缀）是逐步生成的，前缀或后缀部分的KV缓存经常被无效。为了解决这个问题，我们提出了EFIM，这是FIM的转换及时格式，以释放KV缓存重复使用的性能潜力。尽管转换后的提示可以解决效率低下，但它揭示了当前LLM中的微传输产生问题，在那里它们难以准确产生部分单词。因此，我们引入了一种碎片令牌化训练方法，该方法将文本分成多个片段，然后在数据处理过程中进行象征化。在两个代表性LLM的实验表明，使用EFIM服务的LLM可以将潜伏期降低52％，并将吞吐量提高98％，同时在此HTTPS URL上公开可用。

Title: Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development

Authors: Rennai Qiu, Chen Qian, Ran Li, Yufan Dang, Weize Chen, Cheng Yang, Yingli Zhang, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI, cs.MA, cs.SE
Abstract URL: https://arxiv.org/abs/2505.21898
Pdf URL: https://arxiv.org/pdf/2505.21898
Copy Paste: [[2505.21898]] Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development(https://arxiv.org/abs/2505.21898)
Keywords: language model, llm, chat, agent
Abstract: Recent advancements in Large Language Models (LLMs) and autonomous agents have demonstrated remarkable capabilities across various domains. However, standalone agents frequently encounter limitations when handling complex tasks that demand extensive interactions and substantial computational resources. Although Multi-Agent Systems (MAS) alleviate some of these limitations through collaborative mechanisms like task decomposition, iterative communication, and role specialization, they typically remain resource-unaware, incurring significant inefficiencies due to high token consumption and excessive execution time. To address these limitations, we propose a resource-aware multi-agent system -- Co-Saving (meaning that multiple agents collaboratively engage in resource-saving activities), which leverages experiential knowledge to enhance operational efficiency and solution quality. Our key innovation is the introduction of "shortcuts" -- instructional transitions learned from historically successful trajectories -- which allows to bypass redundant reasoning agents and expedite the collective problem-solving process. Experiments for software development tasks demonstrate significant advantages over existing methods. Specifically, compared to the state-of-the-art MAS ChatDev, our method achieves an average reduction of 50.85% in token usage, and improves the overall code quality by 10.06%.
摘要：大语言模型（LLM）和自主代理商的最新进展表现出了各个领域的显着功能。但是，独立代理在处理需要广泛互动和大量计算资源的复杂任务时经常遇到限制。尽管多机构系统（MAS）通过任务分解，迭代沟通和角色专业化等协作机制来减轻其中的一些限制，但它们通常仍然保持资源 - 纳维尔，由于消耗量较高和执行时间过多而产生了重要的效率低下。为了解决这些限制，我们提出了一种资源感知的多代理系统 - 共同避免（这意味着多个代理商合作从事资源储蓄活动），该系统利用体验知识来提高运营效率和解决方案质量。我们的关键创新是引入“快捷方式” - 从历史上成功的轨迹中学到的教学过渡 - 它允许绕过冗余推理代理并加快集体解决问题的过程。软件开发任务的实验表明，与现有方法相比具有显着优势。具体而言，与最先进的MAS CHATDEV相比，我们的方法的平均用法平均减少了50.85％，并将整体代码质量提高了10.06％。

Title: RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Authors: Zeyi Liao, Jaylen Jones, Linxi Jiang, Eric Fosler-Lussier, Yu Su, Zhiqiang Lin, Huan Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21936
Pdf URL: https://arxiv.org/pdf/2505.21936
Copy Paste: [[2505.21936]] RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments(https://arxiv.org/abs/2505.21936)
Keywords: prompt, agent
Abstract: Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection. Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces. To address this, we propose RedTeamCUA, an adversarial testing framework featuring a novel hybrid sandbox that integrates a VM-based OS environment with Docker-based web platforms. Our sandbox supports key features tailored for red teaming, such as flexible adversarial scenario configuration, and a setting that decouples adversarial evaluation from navigational limitations of CUAs by initializing tests directly at the point of an adversarial injection. Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities. Benchmarking current frontier CUAs identifies significant vulnerabilities: Claude 3.7 Sonnet | CUA demonstrates an ASR of 42.9%, while Operator, the most secure CUA evaluated, still exhibits an ASR of 7.6%. Notably, CUAs often attempt to execute adversarial tasks with an Attempt Rate as high as 92.5%, although failing to complete them due to capability limitations. Nevertheless, we observe concerning ASRs of up to 50% in realistic end-to-end settings, with the recently released frontier Claude 4 Opus | CUA showing an alarming ASR of 48%, demonstrating that indirect prompt injection presents tangible risks for even advanced CUAs despite their capabilities and safeguards. Overall, RedTeamCUA provides an essential framework for advancing realistic, controlled, and systematic analysis of CUA vulnerabilities, highlighting the urgent need for robust defenses to indirect prompt injection prior to real-world deployment.
摘要：计算机使用代理（CUAS）有望自动化跨操作系统（OS）和Web的复杂任务，但仍然容易受到间接提示注入的影响。当前对这种威胁的评估要么缺乏支持现实但受控的环境，要么忽略了涉及两个接口的混合Web-OS攻击方案。为了解决这个问题，我们提出了Redteamcua，这是一个具有新型混合沙盒的对抗测试框架，该框架将基于VM的OS环境与基于Docker的Web平台集成在一起。我们的沙箱支持针对红色组合的关键功能，例如灵活的对抗场景配置，以及通过直接在对抗注射点上初始化测试来将cuas导航限制从CUAS的导航限制中脱离的设置。使用Redteamcua，我们开发了RTC基础，这是一个全面的基准测试，其中864个示例研究了现实，混合的Web-OS攻击方案和基本安全漏洞。基准测试当前的边界CUAS标识了重大漏洞：Claude 3.7十四行诗| CUA的ASR为42.9％，而操作员是评估最安全的CUA，但ASR仍然显示为7.6％。值得注意的是，CUA经常尝试以高达92.5％的尝试执行对抗任务，尽管由于能力限制，但未能完成它们。尽管如此，我们在现实的端到端设置中观察到高达50％的ASR，而最近发布的Frontier Claude 4 Opus | CUA显示出令人震惊的ASR为48％，表明尽管有能力和保障措施，但间接及时注射即使是高级CUA的明显风险。总体而言，Redteamcua提供了一个基本框架，用于对CUA漏洞进行现实，控制和系统的分析，强调迫切需要强大的防御能力以在现实世界部署之前进行间接及时注入。

Title: Graph-Assisted Culturally Adaptable Idiomatic Translation for Indic Languages

Authors: Pratik Rakesh Singh, Kritarth Prasad, Mohammadi Zaki, Pankaj Wasnik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21937
Pdf URL: https://arxiv.org/pdf/2505.21937
Copy Paste: [[2505.21937]] Graph-Assisted Culturally Adaptable Idiomatic Translation for Indic Languages(https://arxiv.org/abs/2505.21937)
Keywords: prompt
Abstract: Translating multi-word expressions (MWEs) and idioms requires a deep understanding of the cultural nuances of both the source and target languages. This challenge is further amplified by the one-to-many nature of idiomatic translations, where a single source idiom can have multiple target-language equivalents depending on cultural references and contextual variations. Traditional static knowledge graphs (KGs) and prompt-based approaches struggle to capture these complex relationships, often leading to suboptimal translations. To address this, we propose IdiomCE, an adaptive graph neural network (GNN) based methodology that learns intricate mappings between idiomatic expressions, effectively generalizing to both seen and unseen nodes during training. Our proposed method enhances translation quality even in resource-constrained settings, facilitating improved idiomatic translation in smaller models. We evaluate our approach on multiple idiomatic translation datasets using reference-less metrics, demonstrating significant improvements in translating idioms from English to various Indian languages.
摘要：翻译多字表达式（MWES）和成语需要深入了解源和目标语言的文化细微差别。惯用译本的一对多性质进一步扩大了这一挑战，其中单个源习惯可以根据文化参考和上下文变化具有多个目标语言等效物。传统的静态知识图（KGS）和基于及时的方法难以捕获这些复杂的关系，通常会导致次优的翻译。为了解决这个问题，我们提出了基于自适应图神经网络（GNN）方法的习惯，该方法学会在习惯表达式之间学习复杂的映射，从而有效地概括了训练期间看到和看不见的节点。我们提出的方法即使在资源受限的设置中也提高了翻译质量，从而促进了较小模型中改进的惯用翻译。我们使用无参考指标评估了多个惯用翻译数据集的方法，这表明将习语从英语转换为各种印度语言方面有了显着改善。

Title: RISE: Reasoning Enhancement via Iterative Self-Exploration in Multi-hop Question Answering

Authors: Bolei He, Xinran He, Mengke Chen, Xianwei Xue, Ying Zhu, Zhenhua Ling
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21940
Pdf URL: https://arxiv.org/pdf/2505.21940
Copy Paste: [[2505.21940]] RISE: Reasoning Enhancement via Iterative Self-Exploration in Multi-hop Question Answering(https://arxiv.org/abs/2505.21940)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) excel in many areas but continue to face challenges with complex reasoning tasks, such as Multi-Hop Question Answering (MHQA). MHQA requires integrating evidence from diverse sources while managing intricate logical dependencies, often leads to errors in reasoning. Retrieval-Augmented Generation (RAG), widely employed in MHQA tasks, faces challenges in effectively filtering noisy data and retrieving all necessary evidence, thereby limiting its effectiveness in addressing MHQA challenges. To address these challenges, we propose RISE:Reasoning Enhancement via Iterative Self-Exploration, a novel framework designed to enhance models' reasoning capability through iterative self-exploration. Specifically, RISE involves three key steps in addressing MHQA tasks: question decomposition, retrieve-then-read, and self-critique. By leveraging continuous self-exploration, RISE identifies accurate reasoning paths, iteratively self-improving the model's capability to integrate evidence, maintain logical consistency, and enhance performance in MHQA tasks. Extensive experiments on multiple MHQA benchmarks demonstrate that RISE significantly improves reasoning accuracy and task performance.
摘要：大型语言模型（LLMS）在许多领域都表现出色，但通过复杂的推理任务（例如多跳问答（MHQA））继续面临挑战。 MHQA需要在管理复杂的逻辑依赖性的同时，从各种来源中综合证据，通常会导致推理错误。在MHQA任务中广泛使用的检索增强发电（RAG）在有效过滤嘈杂的数据并检索所有必要证据时面临挑战，从而限制了其在应对MHQA挑战方面的有效性。为了应对这些挑战，我们提出了上升：通过迭代自我探索来增强推理，这是一个新型框架，旨在通过迭代自我探索来增强模型的推理能力。具体而言，RISE涉及解决MHQA任务的三个关键步骤：问题分解，检索到阅读和自我批评。通过利用持续的自我探索，RISE确定了准确的推理路径，迭代地自我提高模型可以整合证据，保持逻辑一致性并提高MHQA任务中的性能。对多个MHQA基准测试的广泛实验表明，RISE显着提高了推理的准确性和任务绩效。

Title: Test-Time Scaling with Repeated Sampling Improves Multilingual Text Generation

Authors: Ashim Gupta, Vivek Srikumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21941
Pdf URL: https://arxiv.org/pdf/2505.21941
Copy Paste: [[2505.21941]] Test-Time Scaling with Repeated Sampling Improves Multilingual Text Generation(https://arxiv.org/abs/2505.21941)
Keywords: prompt
Abstract: Inference-time scaling via repeated sampling has shown promise in reasoning tasks, but its effectiveness in multilingual generation remains underexplored. We evaluate this approach using perplexity- and reward-based verifiers on two multilingual benchmarks: the Aya Evaluation Suite and m-ArenaHard. Our results show consistent quality improvements, with gains exceeding 35% in some cases. While perplexity-based scoring is effective for open-ended prompts, only reward-based verifiers improve performance on tasks requiring reasoning (e.g., math, code). Our results demonstrate the broader utility of repeated sampling for multilingual text generation and underscore the importance of selecting right verifiers for the task.
摘要：通过重复采样的推理时间缩放在推理任务中显示出了希望，但是其在多语言生成中的有效性仍然没有被逐渐倍增。我们在两个多语言基准上使用基于困惑和奖励的验证符评估了这种方法：AYA评估套件和M-Arenahard。我们的结果显示出一致的质量改进，在某些情况下增长超过35％。虽然基于困惑的评分对于开放式提示有效，但只有基于奖励的验证者可以改善需要推理的任务的性能（例如，数学，代码）。我们的结果表明，重复采样对多语言文本生成的更广泛的效用，并强调为任务选择正确的验证程序的重要性。

Title: Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning

Authors: Qihuang Zhong, Liang Ding, Fei Liao, Juhua Liu, Bo Du, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21958
Pdf URL: https://arxiv.org/pdf/2505.21958
Copy Paste: [[2505.21958]] Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning(https://arxiv.org/abs/2505.21958)
Keywords: language model, llm, hallucination
Abstract: Domain-specific instruction-tuning has become the defacto standard for improving the performance of large language models (LLMs) in specialized applications, e.g., medical question answering. Since the instruction-tuning dataset might contain redundant or low-quality data, data selection (DS) is usually required to maximize the data efficiency. Despite the successes in the general domain, current DS methods often struggle to select the desired data for domain-specific instruction-tuning. One of the main reasons is that they neglect the impact of knowledge conflicts, i.e., the discrepancy between LLMs' pretrained knowledge and context knowledge of instruction data, which could damage LLMs' prior abilities and lead to hallucination. To this end, we propose a simple-yet-effective Knowledge-aware Data Selection (namely KDS) framework to select the domain-specific instruction-tuning data that meets LLMs' actual needs. The core of KDS is to leverage two knowledge-aware metrics for quantitatively measuring knowledge conflicts from two aspects: context-memory knowledge alignment and intra-memory knowledge consistency. By filtering the data with large knowledge conflicts and sampling the high-quality and diverse data, KDS can effectively stimulate the LLMs' abilities and achieve better domain-specific performance. Taking the medical domain as the testbed, we conduct extensive experiments and empirically prove that KDS surpasses the other baselines and brings significant and consistent performance gains among all LLMs. More encouragingly, KDS effectively improves the model generalization and alleviates the hallucination problem.
摘要：特定于领域的指令调整已成为改善专业应用程序中大型语言模型（LLM）性能（例如医学问题回答）的表现标准。由于指令调整数据集可能包含冗余或低质量数据，因此通常需要数据选择（DS）来最大化数据效率。尽管一般领域取得了成功，但当前的DS方法通常很难为特定于域特定的指导调整选择所需的数据。主要原因之一是，他们忽略了知识冲突的影响，即LLMS验证的知识与教学数据的上下文知识之间的差异，这可能会损害LLMS的先前能力并导致幻觉。为此，我们提出了一个简单的知识感知数据选择（即KDS）框架，以选择满足LLMS实际需求的特定领域的指令数据。 KD的核心是利用两个知识感知指标来定量测量两个方面的知识冲突：上下文记忆知识对准和内存内存知识的一致性。通过大量知识冲突过滤数据并对高质量和不同的数据进行采样，KD可以有效刺激LLMS的能力并实现更好的领域特定性能。我们将医疗领域作为测试台，我们进行广泛的实验，并从经验上证明KDS超过其他基线，并在所有LLM中带来显着，一致的性能提高。更令人鼓舞的是，KD有效地改善了模型的概括并减轻幻觉问题。

Title: LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents

Authors: Taro Yano, Yoichi Ishibashi, Masafumi Oyamada
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21963
Pdf URL: https://arxiv.org/pdf/2505.21963
Copy Paste: [[2505.21963]] LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents(https://arxiv.org/abs/2505.21963)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks. To further tailor LLMs to specific domains or applications, post-training techniques such as Supervised Fine-Tuning (SFT), Preference Learning, and model merging are commonly employed. While each of these methods has been extensively studied in isolation, the automated construction of complete post-training pipelines remains an underexplored area. Existing approaches typically rely on manual design or focus narrowly on optimizing individual components, such as data ordering or merging strategies. In this work, we introduce LaMDAgent (short for Language Model Developing Agent), a novel framework that autonomously constructs and optimizes full post-training pipelines through the use of LLM-based agents. LaMDAgent systematically explores diverse model generation techniques, datasets, and hyperparameter configurations, leveraging task-based feedback to discover high-performing pipelines with minimal human intervention. Our experiments show that LaMDAgent improves tool-use accuracy by 9.0 points while preserving instruction-following capabilities. Moreover, it uncovers effective post-training strategies that are often overlooked by conventional human-driven exploration. We further analyze the impact of data and model size scaling to reduce computational costs on the exploration, finding that model size scalings introduces new challenges, whereas scaling data size enables cost-effective pipeline discovery.
摘要：大型语言模型（LLMS）在各种任务中都表现出了出色的表现。为了进一步量身定制特定领域或应用程序，通常使用培训后的培训技术，例如监督微调（SFT），偏好学习和模型合并。尽管这些方法中的每种方法都经过广泛的隔离研究，但完整的训练后管道的自动构造仍然是一个未置换的区域。现有方法通常依赖于手动设计，或者专注于优化单个组件，例如数据排序或合并策略。在这项工作中，我们介绍了Lamdagent（语言模型开发代理的缩写），这是一个新颖的框架，可以自主构建和通过使用基于LLM的代理来构建和优化完整的训练后管道。 Lamdagent系统地探索了不同的模型生成技术，数据集和超参数配置，从而利用基于任务的反馈来发现以最少的人干预来发现高性能的管道。我们的实验表明，Lamdagent将工具使用的精度提高了9.0点，同时保留了遵循指示性能的功能。此外，它揭示了常规以人为驱动的探索而忽略的有效的训练后策略。我们进一步分析了数据和模型尺寸缩放的影响，以减少计算成本对勘探的影响，发现模型尺寸尺寸引入了新的挑战，而扩展数据尺寸可以使较高的管道发现。

Title: Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack

Authors: Juan Ren, Mark Dras, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21967
Pdf URL: https://arxiv.org/pdf/2505.21967
Copy Paste: [[2505.21967]] Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack(https://arxiv.org/abs/2505.21967)
Keywords: language model, prompt
Abstract: Large Vision-Language Models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, their integration of visual inputs introduces expanded attack surfaces, thereby exposing them to novel security vulnerabilities. In this work, we conduct a systematic representational analysis to uncover why conventional adversarial attacks can circumvent the safety mechanisms embedded in LVLMs. We further propose a novel two stage evaluation framework for adversarial attacks on LVLMs. The first stage differentiates among instruction non compliance, outright refusal, and successful adversarial exploitation. The second stage quantifies the degree to which the model's output fulfills the harmful intent of the adversarial prompt, while categorizing refusal behavior into direct refusals, soft refusals, and partial refusals that remain inadvertently helpful. Finally, we introduce a normative schema that defines idealized model behavior when confronted with harmful prompts, offering a principled target for safety alignment in multimodal systems.
摘要：大型视觉模型（LVLM）在各种多模式任务中表现出了显着的功能。但是，它们的视觉输入集成引入了扩展的攻击表面，从而使它们暴露于新颖的安全漏洞中。在这项工作中，我们进行了系统的代表性分析，以发现为什么传统的对抗攻击可以规避LVLM中嵌入的安全机制。我们进一步提出了一个新颖的两阶段评估框架，用于对LVLM的对抗性攻击。第一阶段将指导不合规，完全拒绝和成功的对抗性剥削区分了。第二阶段量化了模型的输出实现对抗提示的有害意图的程度，同时将拒绝行为分为直接拒绝，软拒绝和部分拒绝，这些拒绝行为仍然无意间有所帮助。最后，我们介绍了一种规范性架构，该模式在面对有害提示时定义了理想化的模型行为，为多模式系统中的安全对准提供了原则性的目标。

Title: Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset

Authors: Fakhraddin Alwajih, Samar Mohamed Magdy, Abdellah El Mekki, Omer Nacar, Youssef Nafea, Safaa Taher Abdelfadil, Abdulfattah Mohammed Yahya, Hamzah Luqman, Nada Almarwani, Samah Aloufi, Baraah Qawasmeh, Houdaifa Atou, Serry Sibaee, Hamzah A. Alsayadi, Walid Al-Dhabyani, Maged S. Al-shaibani, Aya El aatar, Nour Qandos, Rahaf Alhamouri, Samar Ahmad, Razan Khassib, Lina Hamad, Mohammed Anwar AL-Ghrawi, Fatimah Alshamari, Cheikh Malainine, Doaa Qawasmeh, Aminetou Yacoub, Tfeil moilid, Ruwa AbuHweidi, Ahmed Aboeitta, Vatimetou Mohamed Lemin, Reem Abdel-Salam, Ahlam Bashiti, Adel Ammar, Aisha Alansari, Ahmed Ashraf, Nora Alturayeif, Sara Shatnawi, Alcides Alcoba Inciarte, AbdelRahim A. Elmadany, Mohamedou cheikh tourad, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21979
Pdf URL: https://arxiv.org/pdf/2505.21979
Copy Paste: [[2505.21979]] Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset(https://arxiv.org/abs/2505.21979)
Keywords: language model, agent
Abstract: Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce Pearl, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 45 annotators from across the Arab world, Pearl comprises over K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks Pearl and Pearl-Lite along with a specialized subset Pearl-X explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models' cultural grounding compared to conventional scaling methods. Pearl establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.
摘要：主流大型视觉模型（LVLM）固有地编码文化偏见，突出了对多种多模式数据集的需求。为了解决这一差距，我们介绍了Pearl，这是一个大规模的阿拉伯语多模式数据集和明确设计用于文化理解的基准。 Pearl通过45个来自阿拉伯世界的45个注释者通过高级代理工作流和广泛的人类注释，其中包括跨越涵盖所有阿拉伯国家的十个具有文化重要领域的K多模式示例。我们进一步提供了两个可靠的评估基准珍珠和珍珠莱特以及明确开发的专门子集珍珠X，以评估细微的文化差异。对最新的开放和专有LVLM的全面评估表明，以推理为中心的教学对准可改善模型的文化基础，而与传统的缩放方法相比。 Pearl建立了一种基础资源，用于推进文化知识的多模型建模研究。所有数据集和基准都可以公开使用。

Title: Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI-Generated and Human Data

Authors: Jihong Zhang, Xinya Liang, Anqi Deng, Nicole Bonge, Lin Tan, Ling Zhang, Nicole Zarrett
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21997
Pdf URL: https://arxiv.org/pdf/2505.21997
Copy Paste: [[2505.21997]] Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI-Generated and Human Data(https://arxiv.org/abs/2505.21997)
Keywords: language model, gpt, llm, prompt
Abstract: Mixed methods research integrates quantitative and qualitative data but faces challenges in aligning their distinct structures, particularly in examining measurement characteristics and individual response patterns. Advances in large language models (LLMs) offer promising solutions by generating synthetic survey responses informed by qualitative data. This study investigates whether LLMs, guided by personal interviews, can reliably predict human survey responses, using the Behavioral Regulations in Exercise Questionnaire (BREQ) and interviews from after-school program staff as a case study. Results indicate that LLMs capture overall response patterns but exhibit lower variability than humans. Incorporating interview data improves response diversity for some models (e.g., Claude, GPT), while well-crafted prompts and low-temperature settings enhance alignment between LLM and human responses. Demographic information had less impact than interview content on alignment accuracy. These findings underscore the potential of interview-informed LLMs to bridge qualitative and quantitative methodologies while revealing limitations in response variability, emotional interpretation, and psychometric fidelity. Future research should refine prompt design, explore bias mitigation, and optimize model settings to enhance the validity of LLM-generated survey data in social science research.
摘要：混合方法研究集成了定量和定性数据，但在对齐它们的不同结构时面临挑战，尤其是在检查测量特征和个体响应模式时。大语言模型（LLMS）的进步通过产生由定性数据告知的合成调查响应，提供了有希望的解决方案。这项研究调查了在个人访谈的指导下，LLM是否可以使用锻炼问卷（BREQ）中的行为法规来可靠地预测人类调查的回应，并作为案例研究中的课后计划人员的访谈。结果表明，LLMS捕获了总体响应模式，但比人类表现出较低的变异性。合并访谈数据改善了某些模型（例如Claude，GPT）的响应多样性，而精心制作的提示和低温设置可以增强LLM和人类反应之间的一致性。人口统计信息的影响少于访谈内容对一致性准确性的影响。这些发现强调了采访知识的LLM在弥合定性和定量方法的潜力，同时揭示了响应变异性，情绪解释和心理忠诚度的局限性。未来的研究应完善及时的设计，探索偏见缓解并优化模型设置，以增强社会科学研究中LLM生成的调查数据的有效性。

Title: Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate

Authors: Ashim Gupta, Maitrey Mehta, Zhichao Xu, Vivek Srikumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21999
Pdf URL: https://arxiv.org/pdf/2505.21999
Copy Paste: [[2505.21999]] Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate(https://arxiv.org/abs/2505.21999)
Keywords: language model, llm
Abstract: Large language models (LLMs) provide detailed and impressive responses to queries in English. However, are they really consistent at responding to the same query in other languages? The popular way of evaluating for multilingual performance of LLMs requires expensive-to-collect annotated datasets. Further, evaluating for tasks like open-ended generation, where multiple correct answers may exist, is nontrivial. Instead, we propose to evaluate the predictability of model response across different languages. In this work, we propose a framework to evaluate LLM's cross-lingual consistency based on a simple Translate then Evaluate strategy. We instantiate this evaluation framework along two dimensions of consistency: information and empathy. Our results reveal pronounced inconsistencies in popular LLM responses across thirty languages, with severe performance deficits in certain language families and scripts, underscoring critical weaknesses in their multilingual capabilities. These findings necessitate cross-lingual evaluations that are consistent along multiple dimensions. We invite practitioners to use our framework for future multilingual LLM benchmarking.
摘要：大型语言模型（LLMS）对英语的查询提供了详细且令人印象深刻的响应。但是，他们在用其他语言中响应相同的查询时真的很稳定吗？评估LLM的多语言性能的流行方式需要昂贵的注释数据集。此外，对可能存在多个正确答案的开放式生成等任务进行评估是不平凡的。相反，我们建议评估不同语言跨不同语言的模型响应的可预测性。在这项工作中，我们提出了一个框架来根据简单翻译然后评估策略来评估LLM的跨语性一致性。我们沿着一致性的两个维度实例化了该评估框架：信息和同理心。我们的结果表明，在三十种语言中流行的LLM响应中明显不一致，某些语言系列和脚本的性能不足，强调了其多语言能力的关键弱点。这些发现需要沿多个维度一致的跨语性评估。我们邀请从业者使用我们的框架进行多种语言LLM基准测试。

Title: Legal Assist AI: Leveraging Transformer-Based Model for Effective Legal Assistance

Authors: Jatin Gupta, Akhil Sharma, Saransh Singhania, Ali Imam Abidi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22003
Pdf URL: https://arxiv.org/pdf/2505.22003
Copy Paste: [[2505.22003]] Legal Assist AI: Leveraging Transformer-Based Model for Effective Legal Assistance(https://arxiv.org/abs/2505.22003)
Keywords: language model, gpt, llm, hallucination
Abstract: Pursuit of accessible legal assistance in India faces a critical gap, as many citizens struggle to leverage their legal rights due to limited awareness and access to relevant legal information. This paper introduces Legal Assist AI, a transformer-based model designed to bridge this gap by offering effective legal assistance through large language models (LLMs). The system retrieves relevant legal information from a curated database and generates accurate responses, enabling effective assistance for diverse users, including legal professionals, scholars, and the general public. The model was fine-tuned on extensive datasets from the Indian legal domain, including Indian Constitution, Bharatiya Nyaya Sanhita (BNS), Bharatiya Nagarik Suraksha Sanhita (BNSS) and so forth, providing a robust understanding of the complexities of Indian law. By incorporating domain-specific legal datasets, the proposed model demonstrated remarkable efficiency and specialization in legal Question-Answering. The model was evaluated against state-of-the-art models such as GPT-3.5 Turbo and Mistral 7B, achieving a 60.08% score on the AIBE, outperforming its competitors in legal reasoning and accuracy. Unlike other models, Legal Assist AI avoided common issues such as hallucinations, making it highly reliable for practical legal applications. It showcases the model's applicability in real-world legal scenarios, with future iterations aiming to enhance performance and expand its dataset to cover a broader range of multilingual and case-specific queries as well.
摘要：在印度寻求可访问的法律援助面临着巨大的差距，因为许多公民由于有限的意识和获得相关法律信息而难以利用其合法权利。本文介绍了法律辅助AI，这是一种基于变压器的模型，旨在通过通过大语言模型（LLM）提供有效的法律援助来弥合这一差距。该系统从策划的数据库中检索相关的法律信息，并产生准确的响应，为包括法律专业人士，学者和公众在内的不同用户提供有效的帮助。该模型在印度法律领域的广泛数据集上进行了微调，包括印度宪法，Bharatiya Nyaya Sanhita（BNS），Bharatiya Nagarik Suraksha Sanhita（BNSS）（BNSS）和SOET，提供了对印度法律复杂性的强有力理解。通过合并特定领域的法律数据集，拟议的模型在法律提问方面表现出显着的效率和专业化。该模型是针对GPT-3.5 Turbo和Mismtral 7B等最新模型进行了评估的，在AIBE上获得了60.08％的得分，在法律推理和准确性方面表现出色。与其他模型不同，法律协助AI避免了常见问题，例如幻觉，使其对实际法律应用非常可靠。它展示了该模型在现实世界法律场景中的适用性，未来的迭代旨在提高性能并扩展其数据集，以涵盖更广泛的多语言和特定于案例的查询。

Title: CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models

Authors: Siqi Fan, Peng Han, Shuo Shang, Yequan Wang, Aixin Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22017
Pdf URL: https://arxiv.org/pdf/2505.22017
Copy Paste: [[2505.22017]] CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models(https://arxiv.org/abs/2505.22017)
Keywords: language model, llm
Abstract: Large language models (LLMs) benefit from increased test-time compute, a phenomenon known as test-time scaling. However, reasoning-optimized models often overthink even simple problems, producing excessively verbose outputs and leading to low token efficiency. By comparing these models with equally sized instruct models, we identify two key causes of this verbosity: (1) reinforcement learning reduces the information density of forward reasoning, and (2) backward chain-of thought training encourages redundant and often unnecessary verification steps. Since LLMs cannot assess the difficulty of a given problem, they tend to apply the same cautious reasoning strategy across all tasks, resulting in inefficient overthinking. To address this, we propose CoThink, an embarrassingly simple pipeline: an instruct model first drafts a high-level solution outline; a reasoning model then works out the solution. We observe that CoThink enables dynamic adjustment of reasoning depth based on input difficulty. Evaluated with three reasoning models DAPO, DeepSeek-R1, and QwQ on three datasets GSM8K, MATH500, and AIME24, CoThink reduces total token generation by 22.3% while maintaining pass@1 accuracy within a 0.42% margin on average. With reference to the instruct model, we formally define reasoning efficiency and observe a potential reasoning efficiency scaling law in LLMs.
摘要：大型语言模型（LLM）受益于增加测试时间计算，这是一种称为测试时间缩放的现象。但是，推理优化的模型通常甚至要思考简单的问题，产生过多的详细输出并导致令牌效率低。通过将这些模型与同等尺寸的指示模型进行比较，我们确定了这种冗长的两个关键原因：（1）增强学习降低了向前推理的信息密度，以及（2）向后的思想训练训练会鼓励冗余且通常是不必要的验证步骤。由于LLM无法评估给定问题的难度，因此它们倾向于在所有任务中应用相同的谨慎推理策略，从而导致过度思考效率低下。为了解决这个问题，我们提出了Cothink，这是一个令人尴尬的简单管道：指示模型首先起草了高级解决方案大纲；然后，推理模型可以解决解决方案。我们观察到，Cothink可以根据输入难度对推理深度进行动态调整。 Cothink用三个数据集GSM8K，Math500和Aime24上的三种推理模型DAPO，DeepSeek-R1和QWQ进行了评估，Cothink将总代币的产生降低了22.3％，同时平均在0.42％的利润率中保持了@1的准确性。关于指导模型，我们正式定义了推理效率，并观察LLMS中潜在的推理效率缩放定律。

Title: Jailbreak Distillation: Renewable Safety Benchmarking

Authors: Jingyu Zhang, Ahmed Elgohary, Xiawei Wang, A S M Iftekhar, Ahmed Magooda, Benjamin Van Durme, Daniel Khashabi, Kyle Jackson
Subjects: cs.CL, cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2505.22037
Pdf URL: https://arxiv.org/pdf/2505.22037
Copy Paste: [[2505.22037]] Jailbreak Distillation: Renewable Safety Benchmarking(https://arxiv.org/abs/2505.22037)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are rapidly deployed in critical applications, raising urgent needs for robust safety benchmarking. We propose Jailbreak Distillation (JBDistill), a novel benchmark construction framework that "distills" jailbreak attacks into high-quality and easily-updatable safety benchmarks. JBDistill utilizes a small set of development models and existing jailbreak attack algorithms to create a candidate prompt pool, then employs prompt selection algorithms to identify an effective subset of prompts as safety benchmarks. JBDistill addresses challenges in existing safety evaluation: the use of consistent evaluation prompts across models ensures fair comparisons and reproducibility. It requires minimal human effort to rerun the JBDistill pipeline and produce updated benchmarks, alleviating concerns on saturation and contamination. Extensive experiments demonstrate our benchmarks generalize robustly to 13 diverse evaluation models held out from benchmark construction, including proprietary, specialized, and newer-generation LLMs, significantly outperforming existing safety benchmarks in effectiveness while maintaining high separability and diversity. Our framework thus provides an effective, sustainable, and adaptable solution for streamlining safety evaluation.
摘要：大型语言模型（LLM）迅速部署在关键应用程序中，从而提高了对安全基准测试的紧急需求。我们提出了越狱蒸馏（JBDISTILL），这是一个新颖的基准建设框架，将越狱攻击“提炼”成高质量且易于上升的安全基准。 JBDISTILL利用一小部分开发模型和现有的越狱攻击算法来创建候选提示池，然后采用提示选择算法来确定提示的有效子集作为安全基准。 JBDistill解决了现有安全评估中的挑战：在模型之间使用一致的评估提示确保了公平的比较和可重复性。它需要最少的人类努力来重新运行JBDISTILL管道并产生更新的基准，从而减轻了对饱和和污染的担忧。广泛的实验表明，我们的基准对基准结构的13种不同的评估模型进行了鲁棒性，包括专有，专业和较新的LLM，在维持高可分离性和多样性的同时，在有效性的同时，在有效性方面表现明显优于现有的安全基准。因此，我们的框架为简化安全评估提供了有效，可持续和适应性的解决方案。

Title: Safeguarding Privacy of Retrieval Data against Membership Inference Attacks: Is This Query Too Close to Home?

Authors: Yujin Choi, Youngjoo Park, Junyoung Byun, Jaewook Lee, Jinseong Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22061
Pdf URL: https://arxiv.org/pdf/2505.22061
Copy Paste: [[2505.22061]] Safeguarding Privacy of Retrieval Data against Membership Inference Attacks: Is This Query Too Close to Home?(https://arxiv.org/abs/2505.22061)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) mitigates the hallucination problem in large language models (LLMs) and has proven effective for specific, personalized applications. However, passing private retrieved documents directly to LLMs introduces vulnerability to membership inference attacks (MIAs), which try to determine whether the target datum exists in the private external database or not. Based on the insight that MIA queries typically exhibit high similarity to only one target document, we introduce Mirabel, a similarity-based MIA detection framework designed for the RAG system. With the proposed Mirabel, we show that simple detect-and-hide strategies can successfully obfuscate attackers, maintain data utility, and remain system-agnostic. We experimentally prove its detection and defense against various state-of-the-art MIA methods and its adaptability to existing private RAG systems.
摘要：检索增强的生成（RAG）减轻了大语言模型（LLMS）中的幻觉问题，并已证明对特定的个性化应用程序有效。但是，将私人检索的文档直接传递给LLMS引入了会员资格推理攻击（MIA）的漏洞，这些漏洞试图确定目标基准是否存在于私人外部数据库中。基于这样的见解，MIA查询通常与仅一个目标文档具有很高的相似性，我们引入了Mirabel，这是一个基于相似性的MIA检测框架，该框架是为RAG系统设计的。借助拟议的Mirabel，我们表明，简单的检测和隐藏策略可以成功地混淆攻击者，维护数据实用程序并保持系统不可能。我们通过实验证明了其针对各种最新MIA方法及其对现有私人抹布系统的适应性的检测和防御。

Title: Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R$^2$)GRPO

Authors: Ran Li, Shimin Di, Yuchen Liu, Chen Jing, Yu Qiu, Lei Chen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.22068
Pdf URL: https://arxiv.org/pdf/2505.22068
Copy Paste: [[2505.22068]] Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R$^2$)GRPO(https://arxiv.org/abs/2505.22068)
Keywords: language model, llm, chain-of-thought
Abstract: Previous study suggest that powerful Large Language Models (LLMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR) only refines reasoning path without improving the reasoning capacity in math tasks while supervised-finetuning(SFT) with distillation can. We study this from the view of Scientific information extraction (SciIE) where LLMs and reasoning LLMs underperforms small Bert-based models. SciIE require both the reasoning and memorization. We argue that both SFT and RLVR can refine the reasoning path and improve reasoning capacity in a simple way based on SciIE. We propose two-stage training with 1. MimicSFT, using structured reasoning templates without needing high-quality chain-of-thought data, 2. R$^2$GRPO with relevance and rule-induced rewards. Experiments on scientific IE benchmarks show that both methods can improve the reasoning capacity. R$^2$GRPO with mimicSFT surpasses baseline LLMs and specialized supervised models in relation extraction. Our code is available at this https URL.
摘要：先前的研究表明，通过可验证的奖励（RLVR）培训的强大语言模型（LLMS）仅在不改善数学任务的推理能力的同时，在有监督的罚款（SFT）的情况下，只能完善推理路径。我们从科学信息提取（SCIIE）的角度研究了这一点，其中LLM和推理LLM的表现不足基于BERT的小型模型。科学需要推理和记忆。我们认为，SFT和RLVR都可以根据科学以简单的方式来完善推理路径并提高推理能力。我们提出了使用结构化推理模板的1。MimicsFT进行两阶段培训，而无需高质量的链链数据，2。R$^2 $ GRPO具有相关性和规则引起的奖励。关于科学IE基准的实验表明，两种方法都可以提高推理能力。 r $^2 $ grpo带有Mimicsft的GRPO超过基线LLM和专门的监督模型，以提取。我们的代码可在此HTTPS URL上找到。

Title: ArgInstruct: Specialized Instruction Fine-Tuning for Computational Argumentation

Authors: Maja Stahl, Timon Ziegenbein, Joonsuk Park, Henning Wachsmuth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22076
Pdf URL: https://arxiv.org/pdf/2505.22076
Copy Paste: [[2505.22076]] ArgInstruct: Specialized Instruction Fine-Tuning for Computational Argumentation(https://arxiv.org/abs/2505.22076)
Keywords: language model, llm
Abstract: Training large language models (LLMs) to follow instructions has significantly enhanced their ability to tackle unseen tasks. However, despite their strong generalization capabilities, instruction-following LLMs encounter difficulties when dealing with tasks that require domain knowledge. This work introduces a specialized instruction fine-tuning for the domain of computational argumentation (CA). The goal is to enable an LLM to effectively tackle any unseen CA tasks while preserving its generalization capabilities. Reviewing existing CA research, we crafted natural language instructions for 105 CA tasks to this end. On this basis, we developed a CA-specific benchmark for LLMs that allows for a comprehensive evaluation of LLMs' capabilities in solving various CA tasks. We synthesized 52k CA-related instructions, adapting the self-instruct process to train a CA-specialized instruction-following LLM. Our experiments suggest that CA-specialized instruction fine-tuning significantly enhances the LLM on both seen and unseen CA tasks. At the same time, performance on the general NLP tasks of the SuperNI benchmark remains stable.
摘要：培训大型语言模型（LLMS）遵循指示大大提高了他们处理看不见的任务的能力。但是，尽管具有强大的概括能力，但在处理需要域知识的任务时，指导遵循的LLMS遇到了困难。这项工作介绍了针对计算论证领域（CA）领域的专门指令进行微调。目的是使LLM能够有效地解决任何看不见的CA任务，同时保留其泛化功能。回顾了现有的CA研究，我们为105个CA任务制定了自然语言指令。在此基础上，我们为LLMS开发了一个特定于CA的基准，该基准允许对LLMS解决各种CA任务的能力进行全面评估。我们合成了52K CA相关的指令，调整了自我教学过程以训练CA专用的指导跟随LLM。我们的实验表明，CA专科指导微调显着增强了可见和看不见的CA任务的LLM。同时，超级基准的一般NLP任务的性能保持稳定。

Title: Learning to Route Queries Across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning

Authors: Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan, Shuo Wang, Zhiyuan Liu, Yu Gu, Minghe Yu, Ge Yu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22095
Pdf URL: https://arxiv.org/pdf/2505.22095
Copy Paste: [[2505.22095]] Learning to Route Queries Across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning(https://arxiv.org/abs/2505.22095)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge during generation. Existing MRAG methods typically adopt a static retrieval pipeline that fetches relevant information from multiple Knowledge Bases (KBs), followed by a refinement step. However, these approaches overlook the reasoning and planning capabilities of MLLMs to dynamically determine how to interact with different KBs during the reasoning process. To address this limitation, we propose R1-Router, a novel MRAG framework that learns to decide when and where to retrieve knowledge based on the evolving reasoning state. Specifically, R1-Router can generate follow-up queries according to the current reasoning step, routing these intermediate queries to the most suitable KB, and integrating external knowledge into a coherent reasoning trajectory to answer the original query. Furthermore, we introduce Step-wise Group Relative Policy Optimization (Step-GRPO), a tailored reinforcement learning algorithm that assigns step-specific rewards to optimize the reasoning behavior of MLLMs. Experimental results on various open-domain QA benchmarks across multiple modalities demonstrate that R1-Router outperforms baseline models by over 7%. Further analysis shows that R1-Router can adaptively and effectively leverage diverse KBs, reducing unnecessary retrievals and improving both efficiency and accuracy.
摘要：多模式检索仪（MRAG）通过在发电过程中纳入外部知识来减轻多模式大语言模型（MLLM）的幻觉表现出了希望。现有的MRAG方法通常采用静态检索管道，该管道从多个知识库（KBS）中获取相关信息，然后进行改进步骤。但是，这些方法忽略了MLLM的推理和计划能力，以动态确定如何在推理过程中与不同的KBS相互作用。为了解决这一限制，我们提出了R1-Router，这是一个新颖的MRAG框架，学会根据不断发展的推理状态来决定何时何地检索知识。具体而言，R1-Router可以根据当前的推理步骤生成后续查询，将这些中间查询路由到最合适的KB，并将外部知识集成到连贯的推理轨迹中以回答原始查询。此外，我们推出了踏步组相对策略优化（Step-Grpo），这是一种量身定制的增强式学习算法，该算法分配了阶梯特定的奖励，以优化MLLM的推理行为。多种方式上各种开放域QA基准测试的实验结果表明，R1-Router的表现优于基线模型超过7％。进一步的分析表明，R1-Router可以适应有效地利用各种KBS，减少不必要的检索并提高效率和准确性。

Title: Knowledge Base Construction for Knowledge-Augmented Text-to-SQL

Authors: Jinheon Baek, Horst Samulowitz, Oktie Hassanzadeh, Dharmashankar Subramanian, Sola Shirai, Alfio Gliozzo, Debarun Bhattacharjya
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22096
Pdf URL: https://arxiv.org/pdf/2505.22096
Copy Paste: [[2505.22096]] Knowledge Base Construction for Knowledge-Augmented Text-to-SQL(https://arxiv.org/abs/2505.22096)
Keywords: language model, llm
Abstract: Text-to-SQL aims to translate natural language queries into SQL statements, which is practical as it enables anyone to easily retrieve the desired information from databases. Recently, many existing approaches tackle this problem with Large Language Models (LLMs), leveraging their strong capability in understanding user queries and generating corresponding SQL code. Yet, the parametric knowledge in LLMs might be limited to covering all the diverse and domain-specific queries that require grounding in various database schemas, which makes generated SQLs less accurate oftentimes. To tackle this, we propose constructing the knowledge base for text-to-SQL, a foundational source of knowledge, from which we retrieve and generate the necessary knowledge for given queries. In particular, unlike existing approaches that either manually annotate knowledge or generate only a few pieces of knowledge for each query, our knowledge base is comprehensive, which is constructed based on a combination of all the available questions and their associated database schemas along with their relevant knowledge, and can be reused for unseen databases from different datasets and domains. We validate our approach on multiple text-to-SQL datasets, considering both the overlapping and non-overlapping database scenarios, where it outperforms relevant baselines substantially.
摘要：文本到SQL旨在将自然语言查询转换为SQL语句，这是实用的，因为它使任何人都可以轻松从数据库中检索所需的信息。最近，许多现有的方法通过大型语言模型（LLM）解决了这个问题，利用其强大的能力理解用户查询并生成相应的SQL代码。但是，LLMS中的参数知识可能仅限于涵盖需要在各种数据库模式中接地的所有不同和域特异性查询，这使得生成的SQL经常不准确。为了解决这个问题，我们建议构建文本到SQL的知识库，这是知识的基本来源，我们从中取回并为给定查询生成必要的知识。特别是，与现有的方法不同，这些方法要么手动注释知识或仅对每个查询生成几个知识，我们的知识基础是全面的，这是根据所有可用问题及其相关数据库模式以及相关知识的组合构建的，并且可以重用来自不同数据库和域的未看到的数据库。我们在多个文本到SQL数据集上验证了我们的方法，考虑到重叠和非重叠的数据库方案，在该方案中，它大大优于相关的基线。

Title: MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models

Authors: Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, Junpeng Ren, Zehao Lin, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhiqiang Yin, Qingchen Yu, Bo Tang, Hongkang Yang, Zhi-Qin John Xu, Feiyu Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22101
Pdf URL: https://arxiv.org/pdf/2505.22101
Copy Paste: [[2505.22101]] MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models(https://arxiv.org/abs/2505.22101)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have emerged as foundational infrastructure in the pursuit of Artificial General Intelligence (AGI). Despite their remarkable capabilities in language perception and generation, current LLMs fundamentally lack a unified and structured architecture for handling memory. They primarily rely on parametric memory (knowledge encoded in model weights) and ephemeral activation memory (context-limited runtime states). While emerging methods like Retrieval-Augmented Generation (RAG) incorporate plaintext memory, they lack lifecycle management and multi-modal integration, limiting their capacity for long-term knowledge evolution. To address this, we introduce MemOS, a memory operating system designed for LLMs that, for the first time, elevates memory to a first-class operational resource. It builds unified mechanisms for representation, organization, and governance across three core memory types: parametric, activation, and plaintext. At its core is the MemCube, a standardized memory abstraction that enables tracking, fusion, and migration of heterogeneous memory, while offering structured, traceable access across tasks and contexts. MemOS establishes a memory-centric execution framework with strong controllability, adaptability, and evolvability. It fills a critical gap in current LLM infrastructure and lays the groundwork for continual adaptation, personalized intelligence, and cross-platform coordination in next-generation intelligent systems.
摘要：大型语言模型（LLM）已成为追求人工通用智能（AGI）的基础基础设施。尽管在语言感知和产生方面具有显着的能力，但目前的LLM从根本上缺乏处理记忆的统一和结构化的体系结构。它们主要依赖于参数内存（在模型权重编码的知识）和短暂的激活内存（上下文限制的运行时状态）。尽管诸如检索效果生成（RAG）之类的新兴方法纳入了明文记忆，但它们缺乏生命周期管理和多模式整合，从而限制了其长期知识演变的能力。为了解决这个问题，我们介绍了备忘录，这是一种专为LLMS设计的内存操作系统，该系统首次将存储器提升到一流的操作资源。它构建了三种核心内存类型的表示，组织和治理的统一机制：参数，激活和明文。 Memcube是Memcube，这是一种标准化的内存抽象，可实现异质内存的跟踪，融合和迁移，同时在任务和上下文之间提供结构化的可追溯访问。备忘录建立了一个以内存为中心的执行框架，具有强大的可控性，适应性和可发展性。它填补了当前LLM基础架构的关键空白，并为下一代智能系统中的持续适应，个性化智能和跨平台协调奠定了基础。

Title: Curse of High Dimensionality Issue in Transformer for Long-context Modeling

Authors: Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22107
Pdf URL: https://arxiv.org/pdf/2505.22107
Copy Paste: [[2505.22107]] Curse of High Dimensionality Issue in Transformer for Long-context Modeling(https://arxiv.org/abs/2505.22107)
Keywords: language model, llm
Abstract: Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive this http URL is available at this https URL.
摘要：基于变形金刚的大型语言模型（LLM）在自然语言处理任务中脱颖而出，通过通过自我注意的机制捕获长期依赖性。但是，由于\ textIt {reduction}注意计算，长篇小说建模面对明显的计算效率低下：虽然注意力的权重通常是\ textit {sparse {sparse}，但所有令牌都会消耗\ textit {quare}计算资源。在本文中，我们将传统的概率序列建模重新制定为\ textit {有监督的学习任务}，从而使相关和无关紧要的令牌分开，并提供了对冗余的更清晰的理解。基于这种重新制定，我们理论上分析了注意力稀疏性，表明只有少数几个令牌显着有助于预测。在此基础上，我们将注意力优化作为线性编码问题，并提出了\ textIt {组编码策略}，从理论上讲，它可以提高其提高对随机噪声并提高学习效率的鲁棒性的能力。在此激励的情况下，我们提出了\ textIt {Dynamic群体注意}（DGA），该{DGA）通过在注意计算过程中汇总不太重要的标记来利用组编码来显式降低冗余。经验结果表明，我们的DGA显着降低了计算成本，同时保持竞争性此HTTP URL可在此HTTPS URL上获得。

Title: THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models

Authors: Zhiyuan Li, Yi Chang, Yuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22113
Pdf URL: https://arxiv.org/pdf/2505.22113
Copy Paste: [[2505.22113]] THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models(https://arxiv.org/abs/2505.22113)
Keywords: language model, llm, chain-of-thought
Abstract: Large reasoning models (LRMs) have achieved impressive performance in complex tasks, often outperforming conventional large language models (LLMs). However, the prevalent issue of overthinking severely limits their computational efficiency. Overthinking occurs when models generate excessive and redundant tokens that contribute little to accurate outcomes, especially in simple tasks, resulting in a significant waste of computational resources. To systematically investigate this issue, we introduce Think-Bench, a benchmark designed to evaluate the reasoning efficiency of LRMs. We also propose novel efficiency metrics and conduct a comprehensive evaluation of various LRMs across multiple dimensions, including the reasoning process, outcome quality, and chain-of-thought (CoT) characteristics. Our analysis reveals that most LRMs exhibit overthinking in handling easy questions, generating unnecessarily lengthy reasoning chains. While many LRMs demonstrate high CoT quality, several suffer from low efficiency. We hope that Think-Bench can serve as a robust foundation for advancing research into LRMs.
摘要：大型推理模型（LRMS）在复杂的任务中取得了令人印象深刻的表现，通常超过常规大语模型（LLMS）。但是，过度思考的普遍问题严重限制了其计算效率。当模型产生过多和多余的代币时，就会发生过度思考，这对准确的结果几乎没有贡献，尤其是在简单的任务中，从而导致了大量浪费计算资源。为了系统地调查此问题，我们介绍了Think-Bench，这是一种旨在评估LRMS推理效率的基准测试基准。我们还提出了新颖的效率指标，并对各个维度的各种LRM进行全面评估，包括推理过程，结果质量和思想链（COT）特征。我们的分析表明，大多数LRM在处理简单问题，产生不必要的漫长推理链时表现出过度思考。虽然许多LRMS表现出很高的婴儿床质量，但有些人效率低。我们希望思想基础可以成为推进LRMS研究的强大基础。

Title: Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model

Authors: Jintao Zhang, Zirui Liu, Mingyue Cheng, Shilong Zhang, Tingyue Pan, Qi Liu, Yanhu Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22116
Pdf URL: https://arxiv.org/pdf/2505.22116
Copy Paste: [[2505.22116]] Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model(https://arxiv.org/abs/2505.22116)
Keywords: language model
Abstract: Intraoperative hypotension (IOH) frequently occurs under general anesthesia and is strongly linked to adverse outcomes such as myocardial injury and increased mortality. Despite its significance, IOH prediction is hindered by event sparsity and the challenge of integrating static and dynamic data across diverse patients. In this paper, we propose \textbf{IOHFuseLM}, a multimodal language model framework. To accurately identify and differentiate sparse hypotensive events, we leverage a two-stage training strategy. The first stage involves domain adaptive pretraining on IOH physiological time series augmented through diffusion methods, thereby enhancing the model sensitivity to patterns associated with hypotension. Subsequently, task fine-tuning is performed on the original clinical dataset to further enhance the ability to distinguish normotensive from hypotensive states. To enable multimodal fusion for each patient, we align structured clinical descriptions with the corresponding physiological time series at the token level. Such alignment enables the model to capture individualized temporal patterns alongside their corresponding clinical semantics. In addition, we convert static patient attributes into structured text to enrich personalized information. Experimental evaluations on two intraoperative datasets demonstrate that IOHFuseLM outperforms established baselines in accurately identifying IOH events, highlighting its applicability in clinical decision support scenarios. Our code is publicly available to promote reproducibility at this https URL.
摘要：术中低血压（IOH）经常发生在全身麻醉下，与不良结局（例如心肌损伤和死亡率增加）密切相关。尽管具有重要意义，但事件的稀疏性和整合不同患者的静态和动态数据的挑战也会阻碍IOH预测。在本文中，我们建议\ textbf {iohfuselm}，一种多模式模型框架。为了准确识别和区分稀疏的低血压事件，我们利用了两阶段的训练策略。第一阶段涉及通过扩散方法增强IOH生理时间序列的域自适应预测，从而增强了对与低血压相关的模式的敏感性。随后，在原始临床数据集上进行了微调，以进一步增强了将正常引度诱导与降压状态区分开的能力。为了使每位患者的多模式融合，我们将结构化的临床描述与令牌水平的相应生理时间序列保持一致。这种对齐使模型能够捕获其相应的临床语义以及其相应的临床语义。此外，我们将静态患者属性转换为结构化文本以丰富个性化信息。对两个术中数据集进行的实验评估表明，IOHFUSELM在准确识别IOH事件方面优于建立的基准，从而强调了其在临床决策支持方案中的适用性。我们的代码公开可在此HTTPS URL上促进可重复性。

Title: Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: A Tale of Two Approaches

Authors: Alan Ramponi, Marco Rovera, Robert Moro, Sara Tonelli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22118
Pdf URL: https://arxiv.org/pdf/2505.22118
Copy Paste: [[2505.22118]] Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: A Tale of Two Approaches(https://arxiv.org/abs/2505.22118)
Keywords: llm
Abstract: Retrieval of previously fact-checked claims is a well-established task, whose automation can assist professional fact-checkers in the initial steps of information verification. Previous works have mostly tackled the task monolingually, i.e., having both the input and the retrieved claims in the same language. However, especially for languages with a limited availability of fact-checks and in case of global narratives, such as pandemics, wars, or international politics, it is crucial to be able to retrieve claims across languages. In this work, we examine strategies to improve the multilingual and crosslingual performance, namely selection of negative examples (in the supervised) and re-ranking (in the unsupervised setting). We evaluate all approaches on a dataset containing posts and claims in 47 languages (283 language combinations). We observe that the best results are obtained by using LLM-based re-ranking, followed by fine-tuning with negative examples sampled using a sentence similarity-based strategy. Most importantly, we show that crosslinguality is a setup with its own unique characteristics compared to the multilingual setup.
摘要：检索以前事实检查的主张是一项良好的任务，其自动化可以在信息验证的初始步骤中帮助专业的事实检查器。以前的作品主要是单一完成的，即以相同的语言具有输入和检索的主张。但是，尤其是对于事实核对的语言以及在全球叙事（例如大流行，战争或国际政治）的情况下，能够跨语言检索主张是至关重要的。在这项工作中，我们研究了改善多语言和跨语言表现的策略，即选择负面示例（在监督中）和重新排列（在无监督的环境中）。我们在包含47种语言（283种语言组合）的帖子和主张的数据集上评估所有方法。我们观察到，最佳结果是通过使用基于LLM的重新排列获得的，然后通过使用基于句子相似性的策略进行了微调示例进行微调。最重要的是，我们表明，与多语言设置相比，跨语言是具有其独特特征的设置。

Title: LoKI: Low-damage Knowledge Implanting of Large Language Models

Authors: Runyu Wang, Peng Ping, Zhengyu Guo, Xiaoye Zhang, Quan Shi, Liting Zhou, Tianbo Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22120
Pdf URL: https://arxiv.org/pdf/2505.22120
Copy Paste: [[2505.22120]] LoKI: Low-damage Knowledge Implanting of Large Language Models(https://arxiv.org/abs/2505.22120)
Keywords: language model, llm
Abstract: Fine-tuning adapts pretrained models for specific tasks but poses the risk of catastrophic forgetting (CF), where critical knowledge from pre-training is overwritten. Current Parameter-Efficient Fine-Tuning (PEFT) methods for Large Language Models (LLMs), while efficient, often sacrifice general capabilities. To address the issue of CF in a general-purpose PEFT framework, we propose \textbf{Lo}w-damage \textbf{K}nowledge \textbf{I}mplanting (\textbf{LoKI}), a PEFT technique that is based on a mechanistic understanding of how knowledge is stored in transformer architectures. In two real-world scenarios, LoKI demonstrates task-specific performance that is comparable to or even surpasses that of full fine-tuning and LoRA-based methods across various model types, while significantly better preserving general capabilities. Our work connects mechanistic insights into LLM knowledge storage with practical fine-tuning objectives, achieving state-of-the-art trade-offs between task specialization and the preservation of general capabilities. Our implementation is publicly available as ready-to-use code\footnote{this https URL}.
摘要：微观调整适应了预估计的模型，以适应特定任务，但构成了灾难性遗忘（CF）的风险，在这种情况下，预训练的批判性知识被覆盖。大型语言模型（LLMS）的当前参数效率微调方法（PEFT）方法，虽然有效，但通常会牺牲一般能力。为了在通用PEFT框架中解决CF问题，我们提出\ TextBf {lo} w-damage \ textbf {k} nowledge \ textbf {i} mplanting（\ textbf {loki}），基于对知识的机械性理解的PEFT技术，该技术是对知识的机械性理解的构图。在两种现实世界的情况下，Loki展示了特定于任务的性能，与各种模型类型的完全微调和基于LORA的方法相当，同时可以更好地保留一般能力。我们的工作将机械洞察力与LLM知识存储与实用的微调目标联系起来，从而实现了任务专业和一般能力保存之间的最新权衡。我们的实现可作为现成的代码\ footNote {this https url}公开使用。

Title: EULER: Enhancing the Reasoning Ability of Large Language Models through Error-Induced Learning

Authors: Zhuoyang Wu, Xinze Li, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Minghe Yu, Cheng Yang, Yu Gu, Ge Yu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22131
Pdf URL: https://arxiv.org/pdf/2505.22131
Copy Paste: [[2505.22131]] EULER: Enhancing the Reasoning Ability of Large Language Models through Error-Induced Learning(https://arxiv.org/abs/2505.22131)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated strong reasoning capabilities and achieved promising results in mathematical problem-solving tasks. Learning from errors offers the potential to further enhance the performance of LLMs during Supervised Fine-Tuning (SFT). However, the errors in synthesized solutions are typically gathered from sampling trails, making it challenging to generate solution errors for each mathematical problem. This paper introduces the Error-IndUced LEaRning (EULER) model, which aims to develop an error exposure model that generates high-quality solution errors to enhance the mathematical reasoning capabilities of LLMs. Specifically, EULER optimizes the error exposure model to increase the generation probability of self-made solution errors while utilizing solutions produced by a superior LLM to regularize the generation quality. Our experiments across various mathematical problem datasets demonstrate the effectiveness of the EULER model, achieving an improvement of over 4% compared to all baseline models. Further analysis reveals that EULER is capable of synthesizing more challenging and educational solution errors, which facilitate both the training and inference processes of LLMs. All codes are available at this https URL.
摘要：大型语言模型（LLM）表现出强大的推理能力，并在数学解决问题的任务中取得了令人鼓舞的结果。从错误中学习提供了进一步提高监督微调（SFT）期间LLM的性能的潜力。但是，合成解决方案中的错误通常是从采样跟踪中收集的，这使得为每个数学问题生成解决方案错误具有挑战性。本文介绍了错误诱导的学习（EULER）模型，该模型旨在开发错误的曝光模型，该模型生成高质量的解决方案错误以增强LLMS的数学推理能力。具体而言，Euler优化了错误暴露模型，以增加自制溶液错误的发电概率，同时利用由上级LLM生成的解决方案来正常生成质量。我们在各种数学问题数据集中进行的实验证明了Euler模型的有效性，与所有基线模型相比，提高了4％以上。进一步的分析表明，Euler能够综合更具挑战性和教育解决方案错误，从而有助于LLM的培训和推理过程。所有代码均可在此HTTPS URL上找到。

Title: InComeS: Integrating Compression and Selection Mechanisms into LLMs for Efficient Model Editing

Authors: Shuaiyi Li, Zhisong Zhang, Yang Deng, Chenlong Deng, Tianqing Fang, Hongming Zhang, Haitao Mi, Dong Yu, Wai Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22156
Pdf URL: https://arxiv.org/pdf/2505.22156
Copy Paste: [[2505.22156]] InComeS: Integrating Compression and Selection Mechanisms into LLMs for Efficient Model Editing(https://arxiv.org/abs/2505.22156)
Keywords: language model, llm
Abstract: Although existing model editing methods perform well in recalling exact edit facts, they often struggle in complex scenarios that require deeper semantic understanding rather than mere knowledge regurgitation. Leveraging the strong contextual reasoning abilities of large language models (LLMs), in-context learning (ICL) becomes a promising editing method by comprehending edit information through context encoding. However, this method is constrained by the limited context window of LLMs, leading to degraded performance and efficiency as the number of edits increases. To overcome this limitation, we propose InComeS, a flexible framework that enhances LLMs' ability to process editing contexts through explicit compression and selection mechanisms. Specifically, InComeS compresses each editing context into the key-value (KV) cache of a special gist token, enabling efficient handling of multiple edits without being restricted by the model's context window. Furthermore, specialized cross-attention modules are added to dynamically select the most relevant information from the gist pools, enabling adaptive and effective utilization of edit information. We conduct experiments on diverse model editing benchmarks with various editing formats, and the results demonstrate the effectiveness and efficiency of our method.
摘要：尽管现有的模型编辑方法在回忆确切的编辑事实方面表现良好，但它们通常在需要更深入的语义理解而不是仅仅是知识反流的复杂场景中挣扎。通过通过上下文编码来理解编辑信息，利用大语言模型（LLMS）的强大上下文推理能力（LLMS）成为一种有希望的编辑方法。但是，该方法受到LLM的有限上下文窗口的约束，导致性能和效率随着编辑数量的增加而降低。为了克服这一限制，我们提出了收入，这是一个灵活的框架，可增强LLMS通过明确的压缩和选择机制处理编辑环境的能力。具体而言，INCOMES将每个编辑上下文压缩到特殊要点令牌的键值（KV）缓存中，从而有效地处理多个编辑，而不会受模型上下文窗口的限制。此外，添加专门的跨意义模块以动态从要点池中选择最相关的信息，从而可以自适应地利用编辑信息。我们对各种编辑格式进行各种模型编辑基准进行实验，结果证明了我们方法的有效性和效率。

Title: Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy

Authors: Paramita Mirza, Lucas Weber, Fabian Küch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22157
Pdf URL: https://arxiv.org/pdf/2505.22157
Copy Paste: [[2505.22157]] Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy(https://arxiv.org/abs/2505.22157)
Keywords: llm
Abstract: Recent work shows that post-training datasets for LLMs can be substantially downsampled without noticeably deteriorating performance. However, data selection often incurs high computational costs or is limited to narrow domains. In this paper, we demonstrate that data selection can be both -- efficient and universal -- by using a multi-step pipeline in which we efficiently bin data points into groups, estimate quality using specialized models, and score difficulty with a robust, lightweight method. Task-based categorization allows us to control the composition of our final data -- crucial for finetuning multi-purpose models. To guarantee diversity, we improve upon previous work using embedding models and a clustering algorithm. This integrated strategy enables high-performance fine-tuning with minimal overhead.
摘要：最近的工作表明，LLMS的培训数据集可以大大减少采样，而不会明显恶化性能。但是，数据选择通常会导致高计算成本或仅限于狭窄的域。在本文中，我们通过使用多步管道来证明数据选择可以是高效和通用的，在该管道中，我们可以在该管道中有效地将数据点分为组，使用专用模型估算质量，并使用强大的轻量级方法进行得分难度。基于任务的分类使我们能够控制最终数据的组成 - 对于填充多功能模型至关重要。为了确保多样性，我们使用嵌入模型和聚类算法改进了以前的工作。这种综合策略可以用最少的开销进行高性能的微调。

Title: ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

Authors: Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, Gabriel Stanovsky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22169
Pdf URL: https://arxiv.org/pdf/2505.22169
Copy Paste: [[2505.22169]] ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments(https://arxiv.org/abs/2505.22169)
Keywords: gpt, llm, prompt
Abstract: LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.
摘要：LLM对提示措辞高度敏感，但是标准的基准通常使用单个提示来报告性能，从而引起人们对此类评估可靠性的担忧。在这项工作中，我们主张了一种随机方法，可以在具有含义的及时扰动的空间内进行矩评估。我们介绍了可靠评估的正式定义，该定义说明了迅速的敏感性，并提出了可靠的评估 - 一种估计获得有意义结果所需的及时再采样数量的方法。使用我们的框架，我们随机评估了五个Frontier LLMS，发现即使是GPT-4O和Claude-3.7-Sonnet等表现最好的模型也具有很大的迅速灵敏度。我们的方法是模型，任务和公制的方法，为有意义且强大的LLM评估提供了食谱。

Title: Reverse Preference Optimization for Complex Instruction Following

Authors: Xiang Huang, Ting-En Lin, Feiteng Fang, Yuchuan Wu, Hangyu Li, Yuzhong Qu, Fei Huang, Yongbin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22172
Pdf URL: https://arxiv.org/pdf/2505.22172
Copy Paste: [[2505.22172]] Reverse Preference Optimization for Complex Instruction Following(https://arxiv.org/abs/2505.22172)
Keywords: language model, gpt, llm
Abstract: Instruction following (IF) is a critical capability for large language models (LLMs). However, handling complex instructions with multiple constraints remains challenging. Previous methods typically select preference pairs based on the number of constraints they satisfy, introducing noise where chosen examples may fail to follow some constraints and rejected examples may excel in certain respects over the chosen ones. To address the challenge of aligning with multiple preferences, we propose a simple yet effective method called Reverse Preference Optimization (RPO). It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect, alleviating the burden of extensive sampling and filtering to collect perfect responses. Besides, reversal also enlarges the gap between chosen and rejected responses, thereby clarifying the optimization direction and making it more robust to noise. We evaluate RPO on two multi-turn IF benchmarks, Sysbench and Multi-IF, demonstrating average improvements over the DPO baseline of 4.6 and 2.5 points (on Llama-3.1 8B), respectively. Moreover, RPO scales effectively across model sizes (8B to 70B parameters), with the 70B RPO model surpassing GPT-4o.
摘要：以下指令（如果）是大语言模型（LLMS）的关键功能。但是，处理具有多个约束的复杂说明仍然具有挑战性。以前的方法通常会根据他们满足的约束数量选择偏好对，从而引入噪声，其中所选示例可能无法遵循某些约束，而被拒绝的示例可能在某些方面与所选的示例相比。为了应对与多个偏好保持一致的挑战，我们提出了一种简单而有效的方法，称为反偏好优化（RPO）。它通过动态逆转指令中的约束，以确保所选响应是完美的，从而减轻了广泛的采样和过滤以收集完美的响应，从而减轻偏好对噪声。此外，逆转还扩大了所选响应和被拒绝的响应之间的差距，从而阐明了优化方向并使其对噪声更强大。我们评估了两个多转移的RPO，如果基准，Sysbench和Multif IF，则表明比DPO基线分别为4.6和2.5分（在Llama-3.1 8b上）的平均改善。此外，RPO跨模型大小（8B至70B参数）有效地尺度，70B RPO模型超过GPT-4O。

Title: Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

Authors: Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22179
Pdf URL: https://arxiv.org/pdf/2505.22179
Copy Paste: [[2505.22179]] Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design(https://arxiv.org/abs/2505.22179)
Keywords: language model
Abstract: Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at this https URL.
摘要：投机解码和量化有效地加速了大语模型的记忆结合推理。投机解码通过在单个正向传球中验证多个令牌，从而减轻内存带宽瓶颈，从而增加了计算工作。量化通过将权重和激活压缩为较低的位宽度来实现此优化，并通过低位矩阵乘法减少计算。为了进一步利用其优势，我们研究了这两种技术的整合。出乎意料的是，将高级投机解码方法EAGLE-2应用于各种量化模型的实验表明，四位重量量化的内存益处通过投机解码的计算负载减少。具体而言，与4位重量量化型号上的单一前向通行相比，验证树型的草稿的开销明显高得多。这一发现导致了我们的新投机解码设计：一种分层框架，该框架采用小型模型作为中间阶段，将树型的草稿变成序列草稿，利用目标量化模型的内存访问益处。实验结果表明，我们的分层方法在A100 GPU上的4位重量Llama-3-70B型号的各种任务上达到了2.78 $ \ times $速度，优于1.31 $ \ times $。可在此HTTPS URL上找到代码。

Title: Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon

Authors: Xuchen Ma, Jianxiang Yu, Wenming Shao, Bo Pang, Xiang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22184
Pdf URL: https://arxiv.org/pdf/2505.22184
Copy Paste: [[2505.22184]] Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon(https://arxiv.org/abs/2505.22184)
Keywords: llm, prompt
Abstract: Social media platforms have experienced a significant rise in toxic content, including abusive language and discriminatory remarks, presenting growing challenges for content moderation. Some users evade censorship by deliberately disguising toxic words through homophonic cloak, which necessitates the task of unveiling cloaked toxicity. Existing methods are mostly designed for English texts, while Chinese cloaked toxicity unveiling has not been solved yet. To tackle the issue, we propose C$^2$TU, a novel training-free and prompt-free method for Chinese cloaked toxic content unveiling. It first employs substring matching to identify candidate toxic words based on Chinese homo-graph and toxic lexicon. Then it filters those candidates that are non-toxic and corrects cloaks to be their corresponding toxicities. Specifically, we develop two model variants for filtering, which are based on BERT and LLMs, respectively. For LLMs, we address the auto-regressive limitation in computing word occurrence probability and utilize the full semantic contexts of a text sequence to reveal cloaked toxic words. Extensive experiments demonstrate that C$^2$TU can achieve superior performance on two Chinese toxic datasets. In particular, our method outperforms the best competitor by up to 71% on the F1 score and 35% on accuracy, respectively.
摘要：社交媒体平台的有毒内容显着增加，包括滥用语言和歧视性言论，对内容审核提出了越来越多的挑战。一些用户通过故意通过同型斗篷掩饰有毒的词来避免审查制度，这需要揭示封闭毒性的任务。现有的方法主要是为英语文本设计的，而中文封闭的毒性尚未解决。为了解决这个问题，我们提出了C $^2 $ tu，这是一种新颖的无培训且无需迅速的方法，用于中国封闭的有毒内容。它首先采用基因匹配来鉴定基于中国人志和有毒词典的候选有毒词。然后，它过滤了那些无毒的候选者，并将斗篷纠正为相应的毒性。具体而言，我们开发了分别基于BERT和LLM的两个模型变体。对于LLM，我们解决了计算单词出现概率的自动回归限制，并利用文本序列的完整语义上下文来揭示被掩盖的毒性单词。广泛的实验表明，C $^2 $ tu可以在两个中国有毒数据集上取得出色的性能。特别是，我们的方法在F1得分上优于最佳竞争对手高达71％，精度分别优于35％。

Title: Let's Predict Sentence by Sentence

Authors: Hyeonbin Hwang, Byeongguk Jeon, Seungone Kim, Jiyeon Kim, Hoyeon Chang, Sohee Yang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22202
Pdf URL: https://arxiv.org/pdf/2505.22202
Copy Paste: [[2505.22202]] Let's Predict Sentence by Sentence(https://arxiv.org/abs/2505.22202)
Keywords: language model, chain-of-thought
Abstract: Autoregressive language models (LMs) generate one token at a time, yet human reasoning operates over higher-level abstractions - sentences, propositions, and concepts. This contrast raises a central question- Can LMs likewise learn to reason over structured semantic units rather than raw token sequences? In this work, we investigate whether pretrained LMs can be lifted into such abstract reasoning spaces by building on their learned representations. We present a framework that adapts a pretrained token-level LM to operate in sentence space by autoregressively predicting continuous embeddings of next sentences. We explore two embedding paradigms inspired by classical representation learning: 1) semantic embeddings, learned via autoencoding to preserve surface meaning; and 2) contextual embeddings, trained via next-sentence prediction to encode anticipatory structure. We evaluate both under two inference regimes: Discretized, which decodes each predicted embedding into text before re-encoding; and Continuous, which reasons entirely in embedding space for improved efficiency. Across four domains - mathematics, logic, commonsense, and planning - contextual embeddings under continuous inference show competitive performance with Chain-of-Thought (CoT) while reducing inference-time FLOPs on average by half. We also present early signs of scalability and modular adaptation. Finally, to visualize latent trajectories, we introduce SentenceLens, a diagnostic tool that decodes intermediate model states into interpretable sentences. Together, our results indicate that pretrained LMs can effectively transition to abstract, structured reasoning within latent embedding spaces.
摘要：自回归语言模型（LMS）一次产生一个令牌，但是人类推理在高级抽象（句子，命题和概念）上运作。这种对比提出了一个中心的问题 - LM可以同样学会对结构化语义单元而不是原始的令牌序列进行推理？在这项工作中，我们研究了是否可以通过建立学识渊博的表示，是否可以将验证的LMS提升为这样的抽象推理空间。我们提出了一个框架，该框架可以通过自动加工预测下一个句子的连续嵌入来适应句子级的LM在句子空间中运行。我们探讨了受经典表示学习启发的两个嵌入范式：1）通过自动编码来保留表面含义的语义嵌入； 2）上下文嵌入，通过下一句子预测训练以编码预期结构。我们在两个推论制度下评估这两个：离散化，它在重新编码之前将每个预测嵌入文本中的解码；和连续的，这完全是嵌入空间以提高效率的原因。在四个领域 - 数学，逻辑，常识和计划 - 连续推理下的上下文嵌入与思考链（COT）的竞争性能，同时平均将推理时间拖鞋减少一半。我们还提出了可伸缩性和模块化适应性的早期迹象。最后，为了可视化潜在轨迹，我们介绍了一种诊断工具，该工具将中间模型状态解码为可解释的句子。总之，我们的结果表明，经过预定的LM可以有效地转换为潜在的嵌入空间内的抽象，结构化推理。

Title: Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Authors: Mehdi Ali, Manuel Brack, Max Lübbering, Elias Wendt, Abbas Goher Khan, Richard Rutmann, Alex Jude, Maurice Kraus, Alexander Arno Weber, Felix Stollenwerk, David Kaczér, Florian Mai, Lucie Flek, Rafet Sifa, Nicolas Flores-Herr, Joachim Köhler, Patrick Schramowski, Michael Fromm, Kristian Kersting
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22232
Pdf URL: https://arxiv.org/pdf/2505.22232
Copy Paste: [[2505.22232]] Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models(https://arxiv.org/abs/2505.22232)
Keywords: language model, llm
Abstract: High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.
摘要：高质量的多语言培训数据对于有效预处理大语言模型（LLMS）至关重要。但是，合适的开源多语言数据集的可用性仍然有限。现有的最新数据集主要依赖于启发式过滤方法，从而限制了其跨语言可传递性和可扩展性。在这里，我们介绍了JQL，这是一种系统的方法，可以有效地策划大规模的多样化和高质量的多语言数据，同时大大减少计算需求。 JQL将LLMS的注释能力提炼为基于预读的多语言嵌入的轻量级注释者。这些模型也表现出强大的多语言和跨语性性能，即使在培训期间看不见的语言和脚本也是如此。通过35种语言进行经验评估，所得的注释管道基本上优于当前的启发式过滤方法，例如FineWeb2。 JQL显着提高了下游模型训练质量并提高了数据保留率。我们的研究为多语言数据策划提供了实用的见解和宝贵资源，从而提高了多语言数据集开发的标准。

Title: BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain

Authors: Yunsoo Kim, Yusuf Abdulle, Honghan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22240
Pdf URL: https://arxiv.org/pdf/2505.22240
Copy Paste: [[2505.22240]] BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain(https://arxiv.org/abs/2505.22240)
Keywords: language model, gpt, llm
Abstract: Biomedical reasoning often requires traversing interconnected relationships across entities such as drugs, diseases, and proteins. Despite the increasing prominence of large language models (LLMs), existing benchmarks lack the ability to evaluate multi-hop reasoning in the biomedical domain, particularly for queries involving one-to-many and many-to-many relationships. This gap leaves the critical challenges of biomedical multi-hop reasoning underexplored. To address this, we introduce BioHopR, a novel benchmark designed to evaluate multi-hop, multi-answer reasoning in structured biomedical knowledge graphs. Built from the comprehensive PrimeKG, BioHopR includes 1-hop and 2-hop reasoning tasks that reflect real-world biomedical complexities. Evaluations of state-of-the-art models reveal that O3-mini, a proprietary reasoning-focused model, achieves 37.93% precision on 1-hop tasks and 14.57% on 2-hop tasks, outperforming proprietary models such as GPT4O and open-source biomedical models including HuatuoGPT-o1-70B and Llama-3.3-70B. However, all models exhibit significant declines in multi-hop performance, underscoring the challenges of resolving implicit reasoning steps in the biomedical domain. By addressing the lack of benchmarks for multi-hop reasoning in biomedical domain, BioHopR sets a new standard for evaluating reasoning capabilities and highlights critical gaps between proprietary and open-source models while paving the way for future advancements in biomedical LLMs.
摘要：生物医学推理通常需要跨越药物，疾病和蛋白质等实体之间的相互联系。尽管大语言模型（LLMS）的突出性越来越大，但现有的基准缺乏评估生物医学领域中多跳的推理的能力，尤其是对于涉及一对多和多一对一关系的查询。这一差距留下了生物医学多跳的推理的关键挑战。为了解决这个问题，我们介绍了BioHopr，这是一种新型基准测试，旨在评估结构化生物医学知识图中的多跳，多回答推理。 BioHopr建立在综合的Primekg中，包括反映现实世界生物医学复杂性的1-跳和2跳推理任务。对最新模型的评估表明，以推理为重点的模型O3-Mini在1-HOP任务上达到37.93％的精度，在2-HOP任务上，O3-Mini在2-Hop任务上获得了14.57％的精度，超过了GPT4O和诸如Huat-Source BioMedical模型，以及包括Huatuogpt-o1-O1-70的gpt4o和Open-Source BioMedical模型，以优于GPT4O和诸如开放式生物医学模型。但是，所有模型在多跳的性能中均显示出显着下降，强调了解决生物医学领域中隐性推理步骤的挑战。通过解决生物医学领域中多跳的推理的缺乏基准，Biohoph为评估推理能力的新标准设定了新标准，并突出了专有和开源模型之间的关键差距，同时为未来的生物医学LLM中的进步铺平了道路。

Title: MRT at SemEval-2025 Task 8: Maximizing Recovery from Tables with Multiple Steps

Authors: Maximiliano Hormazábal Lagos, Álvaro Bueno Saez, Héctor Cerezo-Costas, Pedro Alonso Doval, Jorge Alcalde Vesteiro
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.22264
Pdf URL: https://arxiv.org/pdf/2505.22264
Copy Paste: [[2505.22264]] MRT at SemEval-2025 Task 8: Maximizing Recovery from Tables with Multiple Steps(https://arxiv.org/abs/2505.22264)
Keywords: llm, prompt
Abstract: In this paper we expose our approach to solve the \textit{SemEval 2025 Task 8: Question-Answering over Tabular Data} challenge. Our strategy leverages Python code generation with LLMs to interact with the table and get the answer to the questions. The process is composed of multiple steps: understanding the content of the table, generating natural language instructions in the form of steps to follow in order to get the answer, translating these instructions to code, running it and handling potential errors or exceptions. These steps use open source LLMs and fine grained optimized prompts for each task (step). With this approach, we achieved a score of $70.50\%$ for subtask 1.
摘要：在本文中，我们揭示了解决\ textit {semeval 2025任务8：对表格数据的问题的方法}挑战的方法。我们的策略利用LLM的Python代码生成与表进行互动，并找到问题的答案。该过程由多个步骤组成：了解表的内容，以遵循步骤的形式生成自然语言指令，以获取答案，将这些说明转换为代码，运行它并处理潜在的错误或异常。这些步骤使用开源LLM和每个任务的细性优化提示（步骤）。通过这种方法，我们获得了子任务1 $ 70.50 \％$的分数。

Title: Compensating for Data with Reasoning: Low-Resource Machine Translation with LLMs

Authors: Samuel Frontull, Thomas Ströhle
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22293
Pdf URL: https://arxiv.org/pdf/2505.22293
Copy Paste: [[2505.22293]] Compensating for Data with Reasoning: Low-Resource Machine Translation with LLMs(https://arxiv.org/abs/2505.22293)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in multilingual machine translation, sometimes even outperforming traditional neural systems. However, previous research has highlighted the challenges of using LLMs, particularly with prompt engineering, for low-resource languages. In this work, we introduce Fragment-Shot Prompting, a novel in-context learning method that segments input and retrieves translation examples based on syntactic coverage, along with Pivoted Fragment-Shot, an extension that enables translation without direct parallel data. We evaluate these methods using GPT-3.5, GPT-4o, o1-mini, LLaMA-3.3, and DeepSeek-R1 for translation between Italian and two Ladin variants, revealing three key findings: (1) Fragment-Shot Prompting is effective for translating into and between the studied low-resource languages, with syntactic coverage positively correlating with translation quality; (2) Models with stronger reasoning abilities make more effective use of retrieved knowledge, generally produce better translations, and enable Pivoted Fragment-Shot to significantly improve translation quality between the Ladin variants; and (3) prompt engineering offers limited, if any, improvements when translating from a low-resource to a high-resource language, where zero-shot prompting already yields satisfactory results. We publicly release our code and the retrieval corpora.
摘要：大型语言模型（LLM）在多语言机器翻译中表现出强大的功能，有时甚至胜过传统的神经系统。但是，先前的研究强调了使用LLM的挑战，尤其是在工程迅速的情况下，用于低资源语言。在这项工作中，我们介绍了一种新颖的片段提示，这是一种新颖的文化学习方法，该方法将基于句法覆盖范围的分段输入和检索翻译示例，以及旋转的片段射击，该扩展可以使翻译无需直接并行数据。我们使用GPT-3.5，GPT-4O，O1-Mini，Llama-3.3和DeepSeek-R1评估这些方法，以在意大利语和两个Ladin变体之间进行翻译，揭示了三个关键的发现：（1）片段 - 片段提示可有效地转化为构成的覆盖范围，并在合成的覆盖范围内翻译出可有效的覆盖范围，并有效地翻译了效果。（2）具有更强推理能力的模型可以更有效地利用检索到的知识，通常会产生更好的翻译，并使弹性片段能够显着提高LADIN变体之间的翻译质量；（3）及时的工程从低资源转换为高资源语言时会提供有限的改进（如果有的话），零拍摄的促使已经产生令人满意的结果。我们公开发布代码和检索语料库。

Title: Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing

Authors: Yifan Lu, Jing Li, Yigeng Zhou, Yihui Zhang, Wenya Wang, Xiucheng Li, Meishan Zhang, Fangming Liu, Jun Yu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22298
Pdf URL: https://arxiv.org/pdf/2505.22298
Copy Paste: [[2505.22298]] Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing(https://arxiv.org/abs/2505.22298)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) exhibit impressive language capabilities but remain vulnerable to malicious prompts and jailbreaking attacks. Existing knowledge editing methods for LLM detoxification face two major challenges. First, they often rely on entity-specific localization, making them ineffective against adversarial inputs without explicit entities. Second, these methods suffer from over-editing, where detoxified models reject legitimate queries, compromising overall performance. In this paper, we propose ToxEdit, a toxicity-aware knowledge editing approach that dynamically detects toxic activation patterns during forward propagation. It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively. This design ensures precise toxicity mitigation while preserving LLMs' general capabilities. To more accurately assess over-editing, we also enhance the SafeEdit benchmark by incorporating instruction-following evaluation tasks. Experimental results on multiple LLMs demonstrate that our ToxEdit outperforms previous state-of-the-art methods in both detoxification performance and safeguarding general capabilities of LLMs.
摘要：大型语言模型（LLMS）具有令人印象深刻的语言能力，但仍然容易受到恶意提示和越狱攻击的影响。 LLM排毒的现有知识编辑方法面临两个主要挑战。首先，他们通常依靠特定于实体的本地化，使他们对没有明确实体的对抗性输入无效。其次，这些方法遭受了过度编辑的困扰，其中排毒的模型拒绝合法的查询，从而损害了整体性能。在本文中，我们提出了Toxedit，Toxedit是一种毒性感知的知识编辑方法，该方法在正向传播过程中动态检测有毒激活模式。然后，它通过自适应层间途径进行计算，以有效地减轻毒性。该设计可确保精确的毒性降低，同时保留LLMS的一般能力。为了更准确地评估过度编辑，我们还通过纳入遵循指导遵循的评估任务来增强SAFEEDIT基准。多个LLMS的实验结果表明，在排毒性能和维护LLM的一般能力方面，我们的Toxedit优于先前的最新方法。

Title: If Pigs Could Fly... Can LLMs Logically Reason Through Counterfactuals?

Authors: Ishwar B Balappanawar, Vamshi Krishna Bonagiri, Anish R Joishy, Manas Gaur, Krishnaprasad Thirunarayan, Ponnurangam Kumaraguru
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22318
Pdf URL: https://arxiv.org/pdf/2505.22318
Copy Paste: [[2505.22318]] If Pigs Could Fly... Can LLMs Logically Reason Through Counterfactuals?(https://arxiv.org/abs/2505.22318)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) demonstrate impressive reasoning capabilities in familiar contexts, but struggle when the context conflicts with their parametric knowledge. To investigate this phenomenon, we introduce CounterLogic, a dataset containing 1,800 examples across 9 logical schemas, explicitly designed to evaluate logical reasoning through counterfactual (hypothetical knowledge-conflicting) scenarios. Our systematic evaluation of 11 LLMs across 6 different datasets reveals a consistent performance degradation, with accuracies dropping by 27% on average when reasoning through counterfactual information. We propose Self-Segregate, a prompting method enabling metacognitive awareness (explicitly identifying knowledge conflicts) before reasoning. Our method dramatically narrows the average performance gaps from 27% to just 11%, while significantly increasing the overall accuracy (+7.5%). We discuss the implications of these findings and draw parallels to human cognitive processes, particularly on how humans disambiguate conflicting information during reasoning tasks. Our findings offer practical insights for understanding and enhancing LLMs reasoning capabilities in real-world applications, especially where models must logically reason independently of their factual knowledge.
摘要：大型语言模型（LLMS）在熟悉的环境中表现出令人印象深刻的推理能力，但是当上下文与其参数知识冲突时。为了调查这一现象，我们介绍了Countlogic，这是一个数据集，该数据集包含9个逻辑模式中的1,800个示例，明确设计，旨在通过反事实（假设的知识转化）方案评估逻辑推理。我们对6个不同数据集的11个LLM的系统评估显示出稳定的性能降解，当通过反事实信息推理时，准确性平均下降了27％。我们提出了自我隔离，这是一种提示方法，可以在推理之前实现元认知意识（明确识别知识冲突）。我们的方法将平均绩效差距从27％缩小到仅11％，而显着提高了总体准确性（+7.5％）。我们讨论了这些发现的含义，并与人类认知过程相似，特别是关于人类在推理任务中如何消除矛盾的信息。我们的发现为理解和增强现实世界应用中LLM的推理功能提供了实用见解，尤其是在模型必须在逻辑上独立于其事实知识的推理时。

Title: Advancing Expert Specialization for Better MoE

Authors: Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, Xudong Jiang
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2505.22323
Pdf URL: https://arxiv.org/pdf/2505.22323
Copy Paste: [[2505.22323]] Advancing Expert Specialization for Better MoE(https://arxiv.org/abs/2505.22323)
Keywords: language model, llm
Abstract: Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.
摘要：Experts（MOE）模型的混合物模型通过仅激活每个输入的一部分专家来实现大型语言模型（LLMS）的有效缩放。但是，我们观察到，通常使用的辅助负载平衡损失通常会导致专家重叠和过度统一的路由，这阻碍了专家专业化并在训练后降低了整体绩效。为了解决这个问题，我们提出了一个简单而有效的解决方案，该解决方案引入了两个互补的目标：（1）正交性损失，以鼓励专家处理不同类型的令牌，以及（2）差异损失以鼓励更具歧视性的路由决策。梯度级别的分析表明，这些目标与现有的辅助损失兼容，并有助于优化培训过程。各种模型架构和多个基准测试的实验结果表明，我们的方法显着增强了专家的专业化。值得注意的是，我们的方法将辅助损失的经典MOE基线提高了23.79％，同时还可以在下游任务中保持负载平衡，而没有任何建筑修改或其他组件。我们将发布我们的代码以为社区做出贡献。

Title: NLP for Social Good: A Survey of Challenges, Opportunities, and Responsible Deployment

Authors: Antonia Karamolegkou, Angana Borah, Eunjung Cho, Sagnik Ray Choudhury, Martina Galletti, Rajarshi Ghosh, Pranav Gupta, Oana Ignat, Priyanka Kargupta, Neema Kotonya, Hemank Lamba, Sun-Joo Lee, Arushi Mangla, Ishani Mondal, Deniz Nazarova, Poli Nemkova, Dina Pisarevskaya, Naquee Rizwan, Nazanin Sabri, Dominik Stammbach, Anna Steinberg, David Tomás, Steven R Wilson, Bowen Yi, Jessica H Zhu, Arkaitz Zubiaga, Anders Søgaard, Alexander Fraser, Zhijing Jin, Rada Mihalcea, Joel R. Tetreault, Daryna Dementieva
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.22327
Pdf URL: https://arxiv.org/pdf/2505.22327
Copy Paste: [[2505.22327]] NLP for Social Good: A Survey of Challenges, Opportunities, and Responsible Deployment(https://arxiv.org/abs/2505.22327)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have unlocked unprecedented possibilities across a range of applications. However, as a community, we believe that the field of Natural Language Processing (NLP) has a growing need to approach deployment with greater intentionality and responsibility. In alignment with the broader vision of AI for Social Good (Tomašev et al., 2020), this paper examines the role of NLP in addressing pressing societal challenges. Through a cross-disciplinary analysis of social goals and emerging risks, we highlight promising research directions and outline challenges that must be addressed to ensure responsible and equitable progress in NLP4SG research.
摘要：大型语言模型（LLM）的最新进展已在一系列应用程序中解锁了前所未有的可能性。但是，作为一个社区，我们认为自然语言处理领域（NLP）越来越需要以更大的意图和责任进行部署。为了与AI对社会善的更广泛的愿景保持一致（Tomašev等，2020），本文研究了NLP在应对紧迫社会挑战方面的作用。通过对社会目标和新兴风险的跨学科分析，我们重点介绍了有前途的研究方向和概述挑战，必须解决，以确保NLP4SG研究中负责任和公平的进步。

Title: Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Authors: Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22334
Pdf URL: https://arxiv.org/pdf/2505.22334
Copy Paste: [[2505.22334]] Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start(https://arxiv.org/abs/2505.22334)
Keywords: language model, llm, chain-of-thought
Abstract: Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at this https URL.
摘要：大型语言模型（LLM）的最新进步表现出了令人印象深刻的思想链推理能力，加强学习（RL）在这一进展中起着至关重要的作用。尽管“ AHA力矩”模式（模型通过反射表现出自校正）通常归因于RL的新兴特性，但我们首先证明这些模式在RL训练之前存在于多模式LLMS（MLLMS）中，但不一定与改善的推理性能相关。在这些见解的基础上，我们提出了一项有关通过两阶段方法增强多模式推理的全面研究：（1）监督的微调（SFT）作为寒冷的开始，以结构化的思想链推理模式的寒冷开端，然后通过（2）通过GRPO进行加强学习，以进一步完善这些能力。我们的广泛实验表明，这种合并的方法始终超过挑战性的多模式推理基准的仅SFT和仅RL的方法。由此产生的模型在3b和7b尺度上都达到了开源MLLM的最新性能，我们的7b型号显示了基本型号的实质性改进（例如66.3％$ \ rightarrow $ 73.4％的Mathvista，62.9％$ \ rightArow $ \ rightArow $ \ rightarrow in We-Math上的$ 70.4％）和我们的3B模型竞争模型竞争竞争7b。总体而言，这项工作为构建先进的多模式推理模型提供了实用的指导。我们的代码可在此HTTPS URL上找到。

Title: Text2Grad: Reinforcement Learning from Natural Language Feedback

Authors: Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22338
Pdf URL: https://arxiv.org/pdf/2505.22338
Copy Paste: [[2505.22338]] Text2Grad: Reinforcement Learning from Natural Language Feedback(https://arxiv.org/abs/2505.22338)
Keywords: language model, prompt
Abstract: Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model's policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answer while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for fine-grained policy optimization. The code for our method is available at this https URL
摘要：传统的RLHF通过粗糙的标量奖励优化语言模型，掩盖了成功或失败背后的细粒度理由，从而导致缓慢而不透明的学习。最近的工作通过提示或反思来增强RL的文本批评，改善了解释性，但仍未触及模型参数。我们介绍了Text2Grad，这是一种加强学习范式，将自由形式的文本反馈转变为跨度级梯度。鉴于人类（或程序化）的批评，Text2Grad将每个反馈短语与相关的令牌跨度保持一致，将这些对齐转换为可区分的奖励信号，并执行梯度更新，直接完善模型策略的有问题的部分。这会产生精确的，反馈条件的调整，而不是全球轻推。 Text2Grad是通过三个组件实现的：（1）将批评与令牌跨度配对的高质量反馈通道管道；（2）一种细粒度的奖励模型，可以在产生解释性批评的同时预测答案的跨度奖励；（3）背向自然语言梯度的跨度策略优化器。在摘要，代码生成和问答中，Text2Grad始终超过标量奖励RL和及时的基线，提供了更高的任务指标和更丰富的解释性。我们的结果表明，自然语言反馈转换为梯度，是细粒度优化的强大信号。我们方法的代码可在此HTTPS URL上获得

Title: LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High

Authors: Judith Sieker, Clara Lachenmaier, Sina Zarrieß
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22354
Pdf URL: https://arxiv.org/pdf/2505.22354
Copy Paste: [[2505.22354]] LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High(https://arxiv.org/abs/2505.22354)
Keywords: gpt, llm
Abstract: This paper examines how LLMs handle false presuppositions and whether certain linguistic factors influence their responses to falsely presupposed content. Presuppositions subtly introduce information as given, making them highly effective at embedding disputable or false information. This raises concerns about whether LLMs, like humans, may fail to detect and correct misleading assumptions introduced as false presuppositions, even when the stakes of misinformation are high. Using a systematic approach based on linguistic presupposition analysis, we investigate the conditions under which LLMs are more or less sensitive to adopt or reject false presuppositions. Focusing on political contexts, we examine how factors like linguistic construction, political party, and scenario probability impact the recognition of false presuppositions. We conduct experiments with a newly created dataset and examine three LLMs: OpenAI's GPT-4-o, Meta's LLama-3-8B, and MistralAI's Mistral-7B-v03. Our results show that the models struggle to recognize false presuppositions, with performance varying by condition. This study highlights that linguistic presupposition analysis is a valuable tool for uncovering the reinforcement of political misinformation in LLM responses.
摘要：本文探讨了LLM如何处理错误的预设以及某些语言因素是否影响其对错误的预设内容的反应。预设巧妙地介绍了给出的信息，使其在嵌入有争议或虚假信息方面非常有效。这引起了人们对像人类这样的LLM的担忧是否可能无法检测并纠正以虚假的预设引入的误导性假设，即使误导性的赌注很高。使用基于语言预设分析的系统方法，我们研究了LLM或多或少敏感或拒绝虚假预设的条件。为了关注政治背景，我们研究了语言建设，政党和场景概率等因素如何影响识别错误的预设。我们使用新创建的数据集进行实验，并检查三个LLM：OpenAI的GPT-4-O，Meta的Llama-3-8B和Mistralai的Mistral-7b-V03。我们的结果表明，模型难以识别错误的预设，并且性能因条件而变化。这项研究强调，语言的前提分析是发现LLM反应中政治错误信息的有价值的工具。

Title: Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition

Authors: Hanting Chen, Yasheng Wang, Kai Han, Dong Li, Lin Li, Zhenni Bi, Jinpeng Li, Haoyu Wang, Fei Mi, Mingjian Zhu, Bin Wang, Kaikai Song, Yifei Fu, Xu He, Yu Luo, Chong Zhu, Quan He, Xueyu Wu, Wei He, Hailin Hu, Yehui Tang, Dacheng Tao, Xinghao Chen, Yunhe Wang, Other Contributors
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22375
Pdf URL: https://arxiv.org/pdf/2505.22375
Copy Paste: [[2505.22375]] Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition(https://arxiv.org/abs/2505.22375)
Keywords: language model, llm
Abstract: This work presents Pangu Embedded, an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs), featuring flexible fast and slow thinking capabilities. Pangu Embedded addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs. We propose a two-stage training framework for its construction. In Stage 1, the model is finetuned via an iterative distillation process, incorporating inter-iteration model merging to effectively aggregate complementary knowledge. This is followed by reinforcement learning on Ascend clusters, optimized by a latency-tolerant scheduler that combines stale synchronous parallelism with prioritized data queues. The RL process is guided by a Multi-source Adaptive Reward System (MARS), which generates dynamic, task-specific reward signals using deterministic metrics and lightweight LLM evaluators for mathematics, coding, and general problem-solving tasks. Stage 2 introduces a dual-system framework, endowing Pangu Embedded with a "fast" mode for routine queries and a deeper "slow" mode for complex inference. This framework offers both manual mode switching for user control and an automatic, complexity-aware mode selection mechanism that dynamically allocates computational resources to balance latency and reasoning depth. Experimental results on benchmarks including AIME 2024, GPQA, and LiveCodeBench demonstrate that Pangu Embedded with 7B parameters, outperforms similar-size models like Qwen3-8B and GLM4-9B. It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture, highlighting a promising direction for developing powerful yet practically deployable LLM reasoners.
摘要：这项工作介绍了Pangu嵌入式的，这是一种有效的大语言模型（LLM）推理器，它是在上升神经处理单元（NPU）上开发的，具有灵活的快速和缓慢的思维功能。 Pangu嵌入式问题解决了现有推理优化的LLM中普遍存在的重大计算成本和推理潜伏期挑战。我们为其建设提出了一个两阶段的培训框架。在阶段1中，该模型通过迭代蒸馏过程进行了固定，并结合了合并有效汇总互补知识的间介质模型。接下来是对上升群集的加强学习，该学习由延迟耐受性调度程序优化，该调度程序将陈旧的同步并行性与优先的数据队列相结合。 RL过程以多源自适应奖励系统（MARS）为指导，该系统使用确定性指标和轻量级LLM评估器来生成动态的，特定于任务的奖励信号，用于数学，编码和一般问题解决任务。第2阶段引入了双系统框架，赋予了pangu，并具有“快速”模式的常规查询模式，并以更深入的“慢速”模式进行复杂的推理。该框架既提供了用于用户控制的手动模式切换，也提供了自动，复杂性感知模式的选择机制，该机制可以动态分配计算资源以平衡延迟和推理深度。在包括AIME 2024，GPQA和LiveCodeBench在内的基准上的实验结果表明，Pangu嵌入了7B参数，优于QWEN3-8B和GLM4-9B（例如QWEN3-8B和GLM4-9B）的大小相似大小。它在单个统一的模型体系结构中提供了快速的响应和最新的推理质量，突出了开发强大但实际上可部署的LLM推理器的有希望的方向。

Title: RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning

Authors: Kun Li, Yunxiang Li, Tianhua Zhang, Hongyin Luo, Xixin Wu, James Glass, Helen Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22430
Pdf URL: https://arxiv.org/pdf/2505.22430
Copy Paste: [[2505.22430]] RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning(https://arxiv.org/abs/2505.22430)
Keywords: llm, prompt, retrieval-augmented generation
Abstract: Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems. However, current LLM-based evaluation frameworks predominantly rely on directly prompting resource-intensive models with complex multi-stage prompts, underutilizing models' reasoning capabilities and introducing significant computational cost. In this paper, we present RAG-Zeval (RAG-Zero Evaluator), a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task. Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments with detailed explanation in one-pass. We introduce a ranking-based outcome reward mechanism, using preference judgments rather than absolute scores, to address the challenge of obtaining precise pointwise reward signals. To this end, we synthesize the ranking references by generating quality-controlled responses with zero human annotation. Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments and outperforming baselines that rely on LLMs with 10-100 times more parameters. Our approach also exhibits superior interpretability in response evaluation.
摘要：强大的评估对于部署值得信赖的检索生成（RAG）系统至关重要。但是，当前基于LLM的评估框架主要依赖于直接提示具有复杂多阶段提示的资源密集型模型，使模型的推理功能不足并引入了大量的计算成本。在本文中，我们提出了Rag-Zeval（Rag-Zero评估器），这是一个新颖的端到端框架，旨在提出忠诚和正确性评估作为规则引导的推理任务。我们的方法培训评估人员的加强学习，促进紧凑的模型，以一通中的详细说明来产生全面和合理的评估。我们使用偏好判断而不是绝对分数介绍了基于排名的结果奖励机制，以应对获得精确的奖励信号的挑战。为此，我们通过以零人类注释生成质量控制的响应来综合排名参考。实验证明了Rag-Zeval的出色表现，与人类判断和表现优于依赖于LLM的依赖10-100倍参数的llms的相关性最强。我们的方法还表现出卓越的解释性评估。

Title: Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Authors: Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22453
Pdf URL: https://arxiv.org/pdf/2505.22453
Copy Paste: [[2505.22453]] Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO(https://arxiv.org/abs/2505.22453)
Keywords: language model, llm
Abstract: Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %$\rightarrow$72.9 % on MathVista, 62.9 %$\rightarrow$68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at this https URL.
摘要：在训练后阶段改善多模式的大语言模型（MLLM）通常依赖于监督的微调（SFT）或增强学习（RL）。但是，这些监督的方法需要昂贵且手动注释的多模式数据 - 最终是不可持续的资源。尽管最近的努力探讨了无监督的训练后，但他们的方法很复杂且难以迭代。在这项工作中，我们是第一个调查GRPO的使用，GRPO是一种稳定且可扩展的在线RL算法，可以在没有任何外部监督的情况下继续进行自我改善。我们提出了MM-upt，这是MLLM无监督后培训的简单而有效的框架。 MM-upt建立在GRPO的基础上，基于多数采样响应的多数投票，用自我奖励机制代替了传统的奖励信号。我们的实验表明，MM-upt显着提高了QWEN2.5-VL-7B的推理能力（例如，Mathvista上的66.3％$ \ rightarrow $ 72.9％，62.9％$ \ rightarrow $ 68.7％，使用WE-MATH上的68.7％），使用标准数据集中没有地面真相标签。 MM爆发也表现优于事先无监督的基线，甚至接近受监督的GRPO的结果。此外，我们表明，合成仅由MLLM本身产生的合成问题也可以提高性能，从而强调了一种有希望的可扩展自我改善方法。总体而言，在没有外部监督的情况下，MM-upt提供了一个新的范式，可用于持续，自主增强MLLM。我们的代码可在此HTTPS URL上找到。

Title: EvolveSearch: An Iterative Self-Evolving Search Agent

Authors: Dingchu Zhang, Yida Zhao, Jialong Wu, Baixuan Li, Wenbiao Yin, Liwen Zhang, Yong Jiang, Yufeng Li, Kewei Tu, Pengjun Xie, Fei Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22501
Pdf URL: https://arxiv.org/pdf/2505.22501
Copy Paste: [[2505.22501]] EvolveSearch: An Iterative Self-Evolving Search Agent(https://arxiv.org/abs/2505.22501)
Keywords: language model, llm, agent
Abstract: The rapid advancement of large language models (LLMs) has transformed the landscape of agentic information seeking capabilities through the integration of tools such as search engines and web browsers. However, current mainstream approaches for enabling LLM web search proficiency face significant challenges: supervised fine-tuning struggles with data production in open-search domains, while RL converges quickly, limiting their data utilization efficiency. To address these issues, we propose EvolveSearch, a novel iterative self-evolution framework that combines SFT and RL to enhance agentic web search capabilities without any external human-annotated reasoning data. Extensive experiments on seven multi-hop question-answering (MHQA) benchmarks demonstrate that EvolveSearch consistently improves performance across iterations, ultimately achieving an average improvement of 4.7\% over the current state-of-the-art across seven benchmarks, opening the door to self-evolution agentic capabilities in open web search domains.
摘要：大型语言模型（LLMS）的快速发展已通过集成诸如搜索引擎和Web浏览器之类的工具来改变了代理信息寻求功能的景观。但是，当前启用LLM Web搜索能力的当前主流方法面临着重大挑战：在开放式搜索域中与数据生产进行的微调斗争，而RL迅速收敛，从而限制了他们的数据利用效率。为了解决这些问题，我们提出了Evolvesearch，这是一种新型的迭代自我进化框架，结合了SFT和RL，以增强代理Web搜索功能，而无需任何外部人类宣传的推理数据。对七个多跳问答（MHQA）的基准进行的广泛实验表明，EvolvesEarch始终提高跨迭代的性能，最终取得了4.7％的平均改善，比当前的七个基准测试中的最新先进，在开放的Web搜索范围内开放了对大门的代理能力，在开放的Web搜索域中为大门开放了。

Title: Multi-MLLM Knowledge Distillation for Out-of-Context News Detection

Authors: Yimeng Gu, Zhao Tong, Ignacio Castro, Shu Wu, Gareth Tyson
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2505.22517
Pdf URL: https://arxiv.org/pdf/2505.22517
Copy Paste: [[2505.22517]] Multi-MLLM Knowledge Distillation for Out-of-Context News Detection(https://arxiv.org/abs/2505.22517)
Keywords: language model, gpt, llm, prompt
Abstract: Multimodal out-of-context news is a type of misinformation in which the image is used outside of its original context. Many existing works have leveraged multimodal large language models (MLLMs) for detecting out-of-context news. However, observing the limited zero-shot performance of smaller MLLMs, they generally require label-rich fine-tuning and/or expensive API calls to GPT models to improve the performance, which is impractical in low-resource scenarios. In contrast, we aim to improve the performance of small MLLMs in a more label-efficient and cost-effective manner. To this end, we first prompt multiple teacher MLLMs to generate both label predictions and corresponding rationales, which collectively serve as the teachers' knowledge. We then introduce a two-stage knowledge distillation framework to transfer this knowledge to a student MLLM. In Stage 1, we apply LoRA fine-tuning to the student model using all training data. In Stage 2, we further fine-tune the student model using both LoRA fine-tuning and DPO on the data points where teachers' predictions conflict. This two-stage strategy reduces annotation costs and helps the student model uncover subtle patterns in more challenging cases. Experimental results demonstrate that our approach achieves state-of-the-art performance using less than 10% labeled data.
摘要：多模式外观新闻是一种错误信息，其中图像在其原始上下文之外使用。许多现有的作品利用了多模式大型语言模型（MLLM）来检测神秘的新闻。但是，在观察较小的MLLM的零拍摄性能有限的情况下，它们通常需要富含标签的微调和/或昂贵的API调用GPT型号以改善性能，这在低资源场景中是不切实际的。相比之下，我们旨在以更有效和具有成本效益的方式提高小型MLLM的性能。为此，我们首先提示多个教师MLLM产生标签预测和相应的理由，这些预测共同充当教师的知识。然后，我们引入了两个阶段的知识蒸馏框架，以将这些知识转移到学生MLLM。在第1阶段，我们使用所有培训数据将Lora微调应用于学生模型。在第2阶段，我们在教师预测冲突的数据点上使用Lora微调和DPO进行了对学生模型的进一步调整。这种两阶段的策略降低了注释成本，并帮助学生模型在更具挑战性的情况下发现微妙的模式。实验结果表明，我们的方法使用少于10％的标记数据实现了最先进的性能。

Title: Emotion-o1: Adaptive Long Reasoning for Emotion Understanding in LLMs

Authors: Changhao Song, Yazhou Zhang, Peng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22548
Pdf URL: https://arxiv.org/pdf/2505.22548
Copy Paste: [[2505.22548]] Emotion-o1: Adaptive Long Reasoning for Emotion Understanding in LLMs(https://arxiv.org/abs/2505.22548)
Keywords: llm
Abstract: Emotion understanding includes basic tasks (e.g., sentiment/emotion classification) and advanced tasks (e.g., sarcasm/humor detection). Current methods rely on fixed-length CoT reasoning, failing to adapt to the varying complexity of emotions. We propose a task-adaptive reasoning framework that employs DeepSeek-R1 to generate variable-length reasoning chains for different emotion tasks. By combining fine-tuning with reinforcement learning, we design a composite reward function that balances four objectives: prediction accuracy, adaptive reasoning depth control, structural diversity in reasoning paths, and suppression of repetitive logic. This approach achieves dynamic context-sensitive inference while enabling LLMs to autonomously develop deep reasoning capabilities. Experimental results demonstrate consistent improvements in both Acc and F1 scores across four tasks: emotion, sentiment, humor, and sarcasm. Notably, peak enhancements reached 3.56% F1 (2.76% Acc) for basic tasks and 37.95% F1 (23.14% Acc) for advanced tasks. Our work bridges rigid CoT reasoning and emotional complexity through adaptive-depth analysis.
摘要：情绪理解包括基本任务（例如情感/情感分类）和高级任务（例如讽刺/幽默检测）。当前的方法依赖于固定长度的COT推理，无法适应情绪的不同复杂性。我们提出了一个任务自适应推理框架，该框架采用DeepSeek-R1来为不同的情感任务生成可变的长度推理。通过将微调与增强学习相结合，我们设计了一个复合奖励功能，可以平衡四个目标：预测准确性，自适应推理深度控制，推理路径中的结构多样性以及对重复逻辑的抑制。这种方法可以实现动态上下文敏感的推理，同时使LLMS能够自主发展深度的推理能力。实验结果表明，在四个任务中，ACC和F1得分的一致性得到了持续的改善：情绪，情感，幽默和讽刺。值得注意的是，针对基本任务的峰值增强功能达到3.56％F1（2.76％ACC），高级任务达到37.95％的F1（23.14％ACC）。我们的工作通过自适应深度分析桥梁僵化的COT推理和情绪复杂性。

Title: ClaimPKG: Enhancing Claim Verification via Pseudo-Subgraph Generation with Lightweight Specialized LLM

Authors: Hoang Pham, Thanh-Do Nguyen, Khac-Hoai Nam Bui
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2505.22552
Pdf URL: https://arxiv.org/pdf/2505.22552
Copy Paste: [[2505.22552]] ClaimPKG: Enhancing Claim Verification via Pseudo-Subgraph Generation with Lightweight Specialized LLM(https://arxiv.org/abs/2505.22552)
Keywords: language model, llm
Abstract: Integrating knowledge graphs (KGs) to enhance the reasoning capabilities of large language models (LLMs) is an emerging research challenge in claim verification. While KGs provide structured, semantically rich representations well-suited for reasoning, most existing verification methods rely on unstructured text corpora, limiting their ability to effectively leverage KGs. Additionally, despite possessing strong reasoning abilities, modern LLMs struggle with multi-step modular pipelines and reasoning over KGs without adaptation. To address these challenges, we propose ClaimPKG, an end-to-end framework that seamlessly integrates LLM reasoning with structured knowledge from KGs. Specifically, the main idea of ClaimPKG is to employ a lightweight, specialized LLM to represent the input claim as pseudo-subgraphs, guiding a dedicated subgraph retrieval module to identify relevant KG subgraphs. These retrieved subgraphs are then processed by a general-purpose LLM to produce the final verdict and justification. Extensive experiments on the FactKG dataset demonstrate that ClaimPKG achieves state-of-the-art performance, outperforming strong baselines in this research field by 9%-12% accuracy points across multiple categories. Furthermore, ClaimPKG exhibits zero-shot generalizability to unstructured datasets such as HoVer and FEVEROUS, effectively combining structured knowledge from KGs with LLM reasoning across various LLM backbones.
摘要：集成知识图（KGS）以增强大语模型（LLMS）的推理能力（LLMS）是索赔验证的新兴研究挑战。尽管KGS提供了适合推理的结构化，语义上丰富的表示，但大多数现有的验证方法都依赖于非结构化的文本语料库，从而限制了它们有效利用KGS的能力。此外，尽管具有强大的推理能力，但现代LLM仍在多步模块化管道上挣扎，而在没有适应的情况下进行了推理。为了应对这些挑战，我们提出了索赔，这是一个端到端的框架，将LLM推理与KGS的结构化知识无缝整合。具体而言，SoipePKG的主要思想是采用轻巧的专业LLM来表示输入索赔为伪用图，指导一个专用的子图检索模块来识别相关的KG子绘图。然后，通过通用LLM处理这些检索的子图，以产生最终的判决和理由。对FACTKG数据集的广泛实验表明，SopelPKG实现了最先进的性能，在该研究领域的强大基本线的表现优于多个类别的9％-12％的精度。此外，SopelPKG对非结构化数据集（如悬停和发烧，有效地结合了KGS的结构化知识与LLM推理）在各种LLM骨架上的推理表现出零变量的概括性。

Title: Do Large Language Models Think Like the Brain? Sentence-Level Evidence from fMRI and Hierarchical Embeddings

Authors: Yu Lei, Xingyang Ge, Yi Zhang, Yiming Yang, Bolei Ma
Subjects: cs.CL, q-bio.NC
Abstract URL: https://arxiv.org/abs/2505.22563
Pdf URL: https://arxiv.org/pdf/2505.22563
Copy Paste: [[2505.22563]] Do Large Language Models Think Like the Brain? Sentence-Level Evidence from fMRI and Hierarchical Embeddings(https://arxiv.org/abs/2505.22563)
Keywords: language model, llm
Abstract: Understanding whether large language models (LLMs) and the human brain converge on similar computational principles remains a fundamental and important question in cognitive neuroscience and AI. Do the brain-like patterns observed in LLMs emerge simply from scaling, or do they reflect deeper alignment with the architecture of human language processing? This study focuses on the sentence-level neural mechanisms of language models, systematically investigating how hierarchical representations in LLMs align with the dynamic neural responses during human sentence comprehension. By comparing hierarchical embeddings from 14 publicly available LLMs with fMRI data collected from participants, who were exposed to a naturalistic narrative story, we constructed sentence-level neural prediction models to precisely identify the model layers most significantly correlated with brain region activations. Results show that improvements in model performance drive the evolution of representational architectures toward brain-like hierarchies, particularly achieving stronger functional and anatomical correspondence at higher semantic abstraction levels.
摘要：了解大型语言模型（LLM）和人脑是否在类似的计算原理上汇聚仍然是认知神经科学和AI中的一个基本和重要问题。在LLM中观察到的类似大脑的模式是否仅仅是由于缩放而出现的，还是反映了与人类语言处理的结构更深的一致性？这项研究的重点是语言模型的句子级神经机制，系统地研究了LLMS中的层次结构表示如何与人类句子理解过程中的动态神经反应保持一致。通过将14个公开LLMS的层次嵌入与从参与者中收集的fMRI数据进行比较，他们接触了自然主义的叙事故事，我们构建了句子级的神经预测模型，以精确地识别与大脑区域激活最显着相关的模型层。结果表明，模型性能的改善推动了代表体系结构向类似大脑的层次结构的演变，尤其是在较高的语义抽象水平下实现了更强的功能和解剖对应关系。

Title: Agent-UniRAG: A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems

Authors: Hoang Pham, Khac-Hoai Nam Bui
Subjects: cs.CL, cs.AI, cs.DB, cs.IR
Abstract URL: https://arxiv.org/abs/2505.22571
Pdf URL: https://arxiv.org/pdf/2505.22571
Copy Paste: [[2505.22571]] Agent-UniRAG: A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems(https://arxiv.org/abs/2505.22571)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: This paper presents a novel approach for unified retrieval-augmented generation (RAG) systems using the recent emerging large language model (LLM) agent concept. Specifically, Agent LLM, which utilizes LLM as fundamental controllers, has become a promising approach to enable the interpretability of RAG tasks, especially for complex reasoning question-answering systems (e.g., multi-hop queries). Nonetheless, previous works mainly focus on solving RAG systems with either single-hop or multi-hop approaches separately, which limits the application of those approaches to real-world applications. In this study, we propose a trainable agent framework called Agent-UniRAG for unified retrieval-augmented LLM systems, which enhances the effectiveness and interpretability of RAG systems. The main idea is to design an LLM agent framework to solve RAG tasks step-by-step based on the complexity of the inputs, simultaneously including single-hop and multi-hop queries in an end-to-end manner. Furthermore, we introduce SynAgent-RAG, a synthetic dataset to enable the proposed agent framework for small open-source LLMs (e.g., Llama-3-8B). The results show comparable performances with closed-source and larger open-source LLMs across various RAG benchmarks. Our source code and dataset are publicly available for further exploitation.
摘要：本文提出了一种使用最新新兴的大语模型（LLM）代理概念的统一检索生成（RAG）系统的新颖方法。具体而言，将LLM用作基本控制器的Agent LLM已成为一种有前途的方法来实现抹布任务的可解释性，尤其是对于复杂的推理提问系统（例如，多跳查询）。尽管如此，以前的作品主要集中于分别使用单跳或多跳方法求解抹布系统，这限制了这些方法在现实世界应用中的应用。在这项研究中，我们提出了一个可训练的代理框架，称为Agent-unirag用于统一检索的LLM系统，从而增强了抹布系统的有效性和解释性。主要思想是设计一个LLM代理框架，以基于输入的复杂性逐步求解抹布任务，同时以端到端的方式包括单跳和多跳的查询。此外，我们引入了犹太教堂，这是一种合成数据集，以实现小型开源LLMS（例如Llama-3-8B）的建议代理框架。结果显示出在各种抹布基准的封闭源和较大的开源LLMS和较大的开源LLM的表现。我们的源代码和数据集可公开使用。

Title: Fusion Steering: Prompt-Specific Activation Control

Authors: Waldemar Chang, Alhassan Yasin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22572
Pdf URL: https://arxiv.org/pdf/2505.22572
Copy Paste: [[2505.22572]] Fusion Steering: Prompt-Specific Activation Control(https://arxiv.org/abs/2505.22572)
Keywords: language model, llm, prompt
Abstract: We present Fusion Steering, an activation steering methodology that improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks. This approach introduces flexible steering configurations, including full-layer steering and segmented steering. Unlike traditional methods constrained to single-layer or fixed-layer operations, Fusion Steering employs dynamic injection of prompt-specific activation deltas across all transformer layers. These activation deltas are derived from reference completions that combine the ground-truth answer with a model-generated explanation to facilitate semantically enriched, example-specific steering. The injection weights are optimized per prompt using Optuna, targeting a joint objective that balances token overlap (factual alignment) and perplexity (fluency proxy). Evaluation employs a composite score integrating token overlap and LLM-graded quality, encompassing factual accuracy, coherence, and relevance. Empirical results on 260 SimpleQA prompts (selected from 500 where the baseline failed) showcase the efficacy of segmented steering. Using Gemma-2-2B-IT with 8-bit quantization, segmented steering achieves an accuracy of 25.4% (outputs scoring $\geq 0.6$), outperforming the baseline at 3.5% and full-layer steering at 16.2%. Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%. These findings highlight the strengths of segmented, dynamic intervention strategies and the promise of per-prompt, full-network activation control. Fusion Steering is also amenable to sparse representations, such as Neuronpedia or sparse crosscoders, suggesting a promising direction for interpretable and scalable activation-level control in LLMs.
摘要：我们提出了Fusion转向，这是一种激活转向方法，可提高大语模型（LLMS）的事实准确性（QA）任务。这种方法引入了灵活的转向配置，包括全层转向和分段转向。与限制单层或固定层操作的传统方法不同，Fusion转向在所有变压器层中采用了及时的特异性激活三角洲的动态注入。这些激活增量来自参考完成，将基础真实答案与模型生成的解释相结合，以促进语义丰富的，示例特定的转向。使用Optuna对注射权重进行优化，以平衡令牌重叠（事实对准）和困惑（Flumenty代理）的关节目标。评估采用集成令牌重叠和LLM分级质量的综合分数，包括事实准确性，连贯性和相关性。 260个简单QA提示（从基线失败的500中选择）的经验结果展示了分割转向的功效。使用具有8位量化的Gemma-2-2B-IT，分段转向的准确度为25.4％（输出得分$ \ geq 0.6 $），以3.5％的比率优于基线，全层转向以16.2％。在更严格的SimpleQA漏鲁式下，分段转向的促进将完全正确的响应从0.0％提高到13.1％。这些发现突出了分段，动态干预策略的优势以及每项准备下的全网络激活控制的希望。融合转向也适合稀疏表示，例如神经元或稀疏的交叉编码器，这表明在LLMS中可解释的可解释且可扩展的激活级控制有前途的方向。

Title: Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts

Authors: Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Yufeng Chen, Jinan Xu, Jie Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22582
Pdf URL: https://arxiv.org/pdf/2505.22582
Copy Paste: [[2505.22582]] Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts(https://arxiv.org/abs/2505.22582)
Keywords: language model, llm
Abstract: Continually expanding new languages for existing large language models (LLMs) is a promising yet challenging approach to building powerful multilingual LLMs. The biggest challenge is to make the model continuously learn new languages while preserving the proficient ability of old languages. To achieve this, recent work utilizes the Mixture-of-Experts (MoE) architecture to expand new languages by adding new experts and avoid catastrophic forgetting of old languages by routing corresponding tokens to the original model backbone (old experts). Although intuitive, this kind of method is parameter-costly when expanding new languages and still inevitably impacts the performance of old languages. To address these limitations, we analyze the language characteristics of different layers in LLMs and propose a layer-wise expert allocation algorithm (LayerMoE) to determine the appropriate number of new experts for each layer. Specifically, we find different layers in LLMs exhibit different representation similarities between languages and then utilize the similarity as the indicator to allocate experts for each layer, i.e., the higher similarity, the fewer experts. Additionally, to further mitigate the forgetting of old languages, we add a classifier in front of the router network on the layers with higher similarity to guide the routing of old language tokens. Experimental results show that our method outperforms the previous state-of-the-art baseline with 60% fewer experts in the single-expansion setting and with 33.3% fewer experts in the lifelong-expansion setting, demonstrating the effectiveness of our method.
摘要：不断扩展现有大型语言模型（LLM）的新语言是建立强大的多语言LLM的一种有前途但充满挑战的方法。最大的挑战是使模型在保持旧语言能力的同时不断学习新语言。为了实现这一目标，最近的工作利用了专家（MOE）架构的混合物来扩展新的语言，并添加新专家并避免通过将相应的标记与原始模型骨干（老专家）路由相应的标记（老专家）来避免灾难性忘记。尽管直观，但是在扩展新语言时，这种方法是对参数的成本，并且仍然不可避免地会影响旧语言的性能。为了解决这些局限性，我们分析了LLMS中不同层的语言特征，并提出了层面专家分配算法（Layermoe），以确定每一层的适当数量的新专家数量。具体而言，我们发现LLMS中的不同层表在语言之间表现出不同的表示相似性，然后利用相似性作为指标来为每一层分配专家，即越高的相似性，专家越少。此外，为了进一步减轻旧语言的忘记，我们在路由器网络的前面添加了一个分类器，并具有更高的相似性，以指导旧语言令牌的路由。实验结果表明，我们的方法的表现优于先前的最先进的基线，而单一膨胀设置的专家则少60％，而在终生膨胀环境中的专家则减少了33.3％，这证明了我们方法的有效性。

Title: Precise In-Parameter Concept Erasure in Large Language Models

Authors: Yoav Gur-Arieh, Clara Suslik, Yihuai Hong, Fazl Barez, Mor Geva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22586
Pdf URL: https://arxiv.org/pdf/2505.22586
Copy Paste: [[2505.22586]] Precise In-Parameter Concept Erasure in Large Language Models(https://arxiv.org/abs/2505.22586)
Keywords: language model, llm
Abstract: Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.
摘要：大型语言模型（LLMS）通常在预处理过程中获得知识，而这些知识在下游部署（例如敏感信息或受版权保护的内容）中是不可取的。删除此类知识的现有方法取决于微调，培训低级适配器或事实级编辑，但是这些方法太粗糙，太浅或无效。在这项工作中，我们提出了双鱼座（用于概念擦除的精确参数抑制），这是一个新颖的框架，用于通过直接编辑在参数空间中编码它们的方向来精确地从模型参数中删除整个概念。双鱼座使用DISENTANGLER模型将MLP向量分解为可解释的功能，并使用自动化的可解释性技术识别与目标概念相关联的功能，并将其从模型参数中删除。关于Gemma 2和Llama 3.1在各种概念上的实验表明，双鱼座在领先的擦除方法上实现了适度的功效提高，将目标概念的准确性降低到低至7.7％，同时显着提高了擦除特异性（最高31％）和鲁棒性（最高38％）。总体而言，这些结果表明，基于功能的参数编辑可以使一种更精确，更可靠的方法来消除语言模型中的概念知识。

Title: Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning

Authors: Erxin Yu, Jing Li, Ming Liao, Qi Zhu, Boyang Xue, Minghui Xu, Baojun Wang, Lanqing Hong, Fei Mi, Lifeng Shang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22591
Pdf URL: https://arxiv.org/pdf/2505.22591
Copy Paste: [[2505.22591]] Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning(https://arxiv.org/abs/2505.22591)
Keywords: language model, gpt, llm
Abstract: Although large language models demonstrate strong performance across various domains, they still struggle with numerous bad cases in mathematical reasoning. Previous approaches to learning from errors synthesize training data by solely extrapolating from isolated bad cases, thereby failing to generalize the extensive patterns inherent within these cases. This paper presents Self-Error-Instruct (SEI), a framework that addresses these model weaknesses and synthesizes more generalized targeted training data. Specifically, we explore a target model on two mathematical datasets, GSM8K and MATH, to pinpoint bad cases. Then, we generate error keyphrases for these cases based on the instructor model's (GPT-4o) analysis and identify error types by clustering these keyphrases. Next, we sample a few bad cases during each generation for each identified error type and input them into the instructor model, which synthesizes additional training data using a self-instruct approach. This new data is refined through a one-shot learning process to ensure that only the most effective examples are kept. Finally, we use these curated data to fine-tune the target model, iteratively repeating the process to enhance performance. We apply our framework to various models and observe improvements in their reasoning abilities across both in-domain and out-of-domain mathematics datasets. These results demonstrate the effectiveness of self-error instruction in improving LLMs' mathematical reasoning through error generalization.
摘要：尽管大型语言模型在各个领域都表现出强劲的表现，但它们仍在数学推理中与许多不良案例斗争。从错误中学习的先前方法通过仅从孤立的坏病例中推断出综合培训数据，从而无法概括这些情况下固有的广泛模式。本文介绍了自我纠正教学（SEI），该框架解决了这些模型弱点并综合了更广泛的目标培训数据。具体来说，我们在两个数学数据集（GSM8K和数学）上探索了一个目标模型，以查明不良情况。然后，我们根据指导员模型（GPT-4O）分析为这些情况生成错误键形，并通过聚集这些键形式来识别错误类型。接下来，我们在每一代中为每种已确定的错误类型采样了一些不良情况，并将其输入指导员模型，该模型使用自我实施方法综合了其他训练数据。这些新数据是通过一次性学习过程来完善的，以确保仅保留最有效的示例。最后，我们使用这些精选的数据来微调目标模型，迭代地重复该过程以增强性能。我们将框架应用于各种模型，并观察其在内域和室外数学数据集的推理能力的改进。这些结果证明了自我解释指导通过错误概括改善LLM的数学推理的有效性。

Title: Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Authors: Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22618
Pdf URL: https://arxiv.org/pdf/2505.22618
Copy Paste: [[2505.22618]] Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding(https://arxiv.org/abs/2505.22618)
Keywords: language model, llm
Abstract: Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$ throughput} improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.
摘要：基于扩散的大语言模型（扩散LLM）已显示出具有并行解码功能的非自动回归文本生成的希望。但是，由于缺乏键值（KV）缓存和质量降解，开源扩散LLM的实际推理速度通常会落后于自回归模型。为了弥合这一差距，我们引入了一种针对双向扩散模型量身定制的新颖的块近似KV缓存机制，从而使缓存可重复使用，而性能降低可忽略不计。此外，我们确定并行解码中产生质量降解的根本原因是在条件独立性假设下的令牌依赖依赖性的破坏。为了解决这一问题，我们提出了一种平行感知的平行解码策略，该策略有选择地解码令牌超过信心阈值，减轻依赖性侵犯并保持发电质量。多个LLM基准测试的LLADA和DREAM模型的实验结果表明，\ textbf {27.6 $ \ times $ thyput}改进，精确度损失最小，通过自动性模型缩小了性能差距，并为实践部署扩散LLMS铺平了道路。

Title: Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs

Authors: Ziling Cheng, Meng Cao, Marc-Antoine Rondeau, Jackie Chi Kit Cheung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22630
Pdf URL: https://arxiv.org/pdf/2505.22630
Copy Paste: [[2505.22630]] Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs(https://arxiv.org/abs/2505.22630)
Keywords: language model, llm, hallucination
Abstract: The widespread success of large language models (LLMs) on NLP benchmarks has been accompanied by concerns that LLMs function primarily as stochastic parrots that reproduce texts similar to what they saw during pre-training, often erroneously. But what is the nature of their errors, and do these errors exhibit any regularities? In this work, we examine irrelevant context hallucinations, in which models integrate misleading contextual cues into their predictions. Through behavioral analysis, we show that these errors result from a structured yet flawed mechanism that we term class-based (mis)generalization, in which models combine abstract class cues with features extracted from the query or context to derive answers. Furthermore, mechanistic interpretability experiments on Llama-3, Mistral, and Pythia across 39 factual recall relation types reveal that this behavior is reflected in the model's internal computations: (i) abstract class representations are constructed in lower layers before being refined into specific answers in higher layers, (ii) feature selection is governed by two competing circuits -- one prioritizing direct query-based reasoning, the other incorporating contextual cues -- whose relative influences determine the final output. Our findings provide a more nuanced perspective on the stochastic parrot argument: through form-based training, LLMs can exhibit generalization leveraging abstractions, albeit in unreliable ways based on contextual cues -- what we term stochastic chameleons.
摘要：大型语言模型（LLM）在NLP基准测试中的广泛成功伴随着担心，即LLMS主要是随机鹦鹉的作用，这些随机鹦鹉重现了与预训练期间所见的文本相似的文本，通常是错误的。但是，他们错误的本质是什么，这些错误表现出任何规律性？在这项工作中，我们研究了无关紧要的环境幻觉，其中模型将误导性的上下文提示整合到其预测中。通过行为分析，我们表明这些错误是由我们根据类别的基于类（MIS）概括的结构化但有缺陷的机制引起的，其中模型将抽象的类提示与从查询或上下文中提取的功能相结合以获取答案。此外，在39个事实召回关系类型的39种对Llama-3，Mistral和Pythia上进行的机械性可解释性实验表明，这种行为反映在模型的内部计算中：（i）在较低的层中构建了较低的阶级表示，然后将抽象的答案置于较高的较优先范围的情况下，在较高的范围内，（ii）在较高的情况下进行了互动 - （ii）依据 - （ii）依赖于其他循环范围，则相关范围构建了相关性，相对于其他相关性，相对于其他相关性，相对于其他相关性，相对于其他相关范围，相对于互化，相对于较差的范围，则构建了相关性。影响决定最终输出。我们的发现提供了关于随机鹦鹉论点的更细微的观点：通过基于表格的培训，LLM可以表现出概括性化的抽象，尽管基于上下文提示，以不可靠的方式 - 我们称之为随机变色龙。

Title: Spatial Knowledge Graph-Guided Multimodal Synthesis

Authors: Yida Xue, Zhen Bi, Jinnan Yang, Jungang Lou, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2505.22633
Pdf URL: https://arxiv.org/pdf/2505.22633
Copy Paste: [[2505.22633]] Spatial Knowledge Graph-Guided Multimodal Synthesis(https://arxiv.org/abs/2505.22633)
Keywords: language model, llm
Abstract: Recent advances in multimodal large language models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. In this work, we introduce SKG2Data, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2Data automatically constructs a Spatial Knowledge Graph (SKG) to emulate human-like perception of spatial directions and distances, which is subsequently utilized to guide multimodal data synthesis. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, not only enhance the spatial perception and reasoning abilities of MLLMs but also exhibit strong generalization capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.
摘要：多模式大语言模型（MLLM）的最新进展显着增强了其能力。但是，他们的空间感知能力仍然是一个显着的限制。为了应对这一挑战，多模式数据综合提供了有希望的解决方案。但是，确保合成的数据遵守空间常识是一项非平凡的任务。在这项工作中，我们介绍了SKG2DATA，这是一种由空间知识图指导的新型多模式合成方法，其基于知识到数据的概念。 SKG2DATA自动构建空间知识图（SKG），以模拟对空间方向和距离的类似人类的感知，随后用来指导多模式数据综合。广泛的实验表明，从不同类型的空间知识（包括方向和距离）中综合的数据不仅增强了MLLM的空间感知和推理能力，而且还具有强大的概括能力。我们希望基于知识的数据综合的想法可以推动空间智能的发展。

Title: Learning Composable Chains-of-Thought

Authors: Fangcong Yin, Zeyu Leo Liu, Liu Leqi, Xi Ye, Greg Durrett
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22635
Pdf URL: https://arxiv.org/pdf/2505.22635
Copy Paste: [[2505.22635]] Learning Composable Chains-of-Thought(https://arxiv.org/abs/2505.22635)
Keywords: language model, llm, chain-of-thought
Abstract: A common approach for teaching large language models (LLMs) to reason is to train on chain-of-thought (CoT) traces of in-distribution reasoning problems, but such annotated data is costly to obtain for every problem of interest. We want reasoning models to generalize beyond their training distribution, and ideally to generalize compositionally: combine atomic reasoning skills to solve harder, unseen reasoning tasks. We take a step towards compositional generalization of reasoning skills when addressing a target compositional task that has no labeled CoT data. We find that simply training models on CoT data of atomic tasks leads to limited generalization, but minimally modifying CoT formats of constituent atomic tasks to be composable can lead to improvements. We can train "atomic CoT" models on the atomic tasks with Composable CoT data and combine them with multitask learning or model merging for better zero-shot performance on the target compositional task. Such a combined model can be further bootstrapped on a small amount of compositional data using rejection sampling fine-tuning (RFT). Results on string operations and natural language skill compositions show that training LLMs on Composable CoT outperforms multitask learning and continued fine-tuning baselines within a given training data budget.
摘要：教授大语模型（LLM）推理的一种常见方法是培训分发推理问题的三通链（COT）痕迹，但是对于每个感兴趣的问题，这种带注释的数据都代价高昂。我们希望推理模型超出其训练分布的概括，理想情况下以构图概括：结合原子推理技能，以解决更艰苦，看不见的推理任务。在解决没有标记COT数据的目标组成任务时，我们朝着推理技能的组成概括迈出了一步。我们发现，简单地训练原子任务的COT数据的模型会导致有限的概括，但是要组合的成分原子任务的COT格式的最小修改可能会导致改进。我们可以使用可组合的COT数据训练“原子COT”模型，并将其与多任务学习或模型合并结合使用，以在目标组成任务上更好地零拍摄性能。使用拒绝采样微调（RFT），可以在少量的组成数据上进一步进行这种组合模型。弦乐操作和自然语言技能组成的结果表明，在给定的培训数据预算中，培训可组合COT的LLM优于多任务学习和持续的微调基线。

Title: Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

Authors: Hanjia Lyu, Jiebo Luo, Jian Kang, Allison Koenecke
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.22645
Pdf URL: https://arxiv.org/pdf/2505.22645
Copy Paste: [[2505.22645]] Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese(https://arxiv.org/abs/2505.22645)
Keywords: language model, llm, prompt
Abstract: While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (this https URL).
摘要：尽管在简化和传统的中文中都研究了大语言模型（LLMS）的功能，但在这两个书面中文变体中提示时，LLMS尚不清楚LLM是否表现出差异性能。这种理解至关重要，因为LLM响应质量的差异可以通过忽略简化与传统汉语的不同文化背景来使代表性危害永久存在，并且可以加剧LLM在LLM促成的领域中的下游危害，例如教育或雇用教育或雇用。为了调查潜在的LLM性能差异，我们设计了两个反映现实情况的基准任务：区域术语选择（促使LLM命名了所描述的项目，在中国大陆和台湾大陆和区域名称选择中，其称为不同的项目（促使LLM从Simplified和传统中文中选择谁雇用了谁）。对于这两项任务，我们审核了11种领先的商业LLM服务和开源车型的性能 - 涵盖了主要接受英语，简化中文或传统中文培训的模型。我们的分析表明，LLM响应中的偏见取决于任务和提示语言：虽然大多数LLM在区域选择任务中都不成比例地简化中文响应，但它们在区域名称选择任务中令人惊讶地偏爱传统的中文名称。我们发现，这些差异可能是由于培训数据表示，书面角色偏好以及简化和传统中文的令牌化的差异。这些发现凸显了对LLM偏见进行进一步分析的必要性。因此，我们提供开源的基准数据集，以促进对跨语言变种（此HTTPS URL）的未来LLM行为的可重现评估。

Title: WebDancer: Towards Autonomous Information Seeking Agency

Authors: Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22648
Pdf URL: https://arxiv.org/pdf/2505.22648
Copy Paste: [[2505.22648]] WebDancer: Towards Autonomous Information Seeking Agency(https://arxiv.org/abs/2505.22648)
Keywords: agent
Abstract: Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct, WebDancer. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models. The codes and demo will be released in this https URL.
摘要：解决复杂的现实世界问题需要深入的信息寻求和多步推理。代理系统的最新进展以深入的研究为例，强调了自动多步研究的潜力。在这项工作中，我们提出了一个有凝聚力的范式，用于从以数据为中心和培训阶段的角度来构建端到端的代理信息。我们的方法包括四个关键阶段：（1）浏览数据构建，（2）轨迹采样，（3）有效的冷启动进行微调，以及（4）增强概括的增强学习。我们根据React，WebDancer在Web代理中实例化此框架。对寻求基准的挑战性信息Gaia和Webwalkerqa的经验评估证明了WebDancer的出色表现，取得了可观的结果并突出了我们培训范式的功效。对代理训练的进一步分析为开发更有能力的代理模型提供了宝贵的见解和可行的系统途径。代码和演示将在此HTTPS URL中发布。

Title: The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Authors: Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22653
Pdf URL: https://arxiv.org/pdf/2505.22653
Copy Paste: [[2505.22653]] The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason(https://arxiv.org/abs/2505.22653)
Keywords: language model, llm
Abstract: Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at this https URL.
摘要：关于培训后大语言模型（LLM）的最新研究通过加强学习（RL）进行推理（RL）通常关注可以准确验证和奖励的任务，例如解决数学问题。相比之下，我们的研究调查了奖励噪声的影响，这是对使用奖励模型涉及LLMS培训后的现实世界情景的更实际考虑。我们发现LLMS对实质性奖励噪声表现出强大的鲁棒性。例如，手动翻转奖励功能在数学任务中的40％的输出仍然允许QWEN-2.5-7B模型实现快速收敛，而将其在数学任务的绩效从5％提高到72％，而通过训练有嘈杂的无噪声奖励的模型获得的75％精度。令人惊讶的是，通过仅奖励关键推理短语的出现（即推理模式奖励，RPR），例如``首先，我需要'' - 而不验证答案的正确性，该模型可实现下游性能的峰值（QWEN-2.5-7B的70％精度），可与训练有素的型号具有严格的正确性验证和准确的验证训练的型号。认识到推理过程对最终结果的重要性，我们将RPR与嘈杂的奖励模型相结合。 RPR帮助校准了嘈杂的奖励模型，减轻潜在的假否定性并提高LLM在开放式任务上的绩效。这些发现表明，在培训前阶段提高模型的基础能力的重要性，同时为推进训练后技术提供见解。我们的代码和脚本可在此HTTPS URL上找到。

Title: GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

Authors: Qingchen Yu, Zifan Zheng, Ding Chen, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22661
Pdf URL: https://arxiv.org/pdf/2505.22661
Copy Paste: [[2505.22661]] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning(https://arxiv.org/abs/2505.22661)
Keywords: language model, llm
Abstract: The evaluation of large language models (LLMs) has traditionally relied on static benchmarks, a paradigm that poses two major limitations: (1) predefined test sets lack adaptability to diverse application domains, and (2) standardized evaluation protocols often fail to capture fine-grained assessments of domain-specific knowledge and contextual reasoning abilities. To overcome these challenges, we propose GuessArena, an adaptive evaluation framework grounded in adversarial game-based interactions. Inspired by the interactive structure of the Guess Who I Am? game, our framework seamlessly integrates dynamic domain knowledge modeling with progressive reasoning assessment to improve evaluation fidelity. Empirical studies across five vertical domains-finance, healthcare, manufacturing, information technology, and education-demonstrate that GuessArena effectively distinguishes LLMs in terms of domain knowledge coverage and reasoning chain completeness. Compared to conventional benchmarks, our method provides substantial advantages in interpretability, scalability, and scenario adaptability.
摘要：传统上，对大语言模型（LLM）的评估依赖于静态基准，这是一个范式构成两个主要局限性的范式：（1）预定义的测试集缺乏对各种应用程序领域的适应性，以及（2）标准化的评估协议通常无法捕获域特异性知识和上下文的精细评估。为了克服这些挑战，我们提出了Guessarena，这是一个以对抗性游戏为基础的互动为基础的自适应评估框架。受猜测我是谁的互动结构的启发？游戏，我们的框架无缝将动态域知识建模与渐进推理评估相结合，以提高评估保真度。跨五个垂直领域的经验研究，医疗保健，制造，信息技术和教育示意，猜测有效地从域知识覆盖范围和推理链完整性方面区分了LLM。与传统的基准相比，我们的方法在可解释性，可伸缩性和方案适应性方面具有很大的优势。

Title: AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models

Authors: Feng Luo, Yu-Neng Chuang, Guanchu Wang, Hoang Anh Duy Le, Shaochen Zhong, Hongyi Liu, Jiayi Yuan, Yang Sui, Vladimir Braverman, Vipin Chaudhary, Xia Hu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22662
Pdf URL: https://arxiv.org/pdf/2505.22662
Copy Paste: [[2505.22662]] AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models(https://arxiv.org/abs/2505.22662)
Keywords: language model, llm, chain-of-thought
Abstract: The reasoning-capable large language models (LLMs) demonstrate strong performance on complex reasoning tasks but often suffer from overthinking, generating unnecessarily long chain-of-thought (CoT) reasoning paths for easy reasoning questions, thereby increasing inference cost and latency. Recent approaches attempt to address this challenge by manually deciding when to apply long or short reasoning. However, they lack the flexibility to adapt CoT length dynamically based on question complexity. In this paper, we propose Auto Long-Short Reasoning (AutoL2S), a dynamic and model-agnostic framework that enables LLMs to dynamically compress their generated reasoning path based on the complexity of the reasoning question. AutoL2S enables a learned paradigm, in which LLMs themselves can decide when longer reasoning is necessary and when shorter reasoning suffices, by training on data annotated with our proposed method, which includes both long and short CoT paths and a special token. We then use token to indicate when the model can skip generating lengthy CoT reasoning. This proposed annotation strategy can enhance the LLMs' ability to generate shorter CoT reasoning paths with improved quality after training. Extensive evaluation results show that AutoL2S reduces the length of reasoning generation by up to 57% without compromising performance, demonstrating the effectiveness of AutoL2S for scalable and efficient LLM reasoning.
摘要：具有推理能力的大语言模型（LLMS）在复杂的推理任务上表现出很强的表现，但通常会遭受过度思考的痛苦，从而产生了不必要的长期思想链（COT）推理路径，以实现简单的推理问题，从而增加了推理成本和潜伏期。最近的方法试图通过手动决定何时应用长期或短暂的推理来应对这一挑战。但是，他们缺乏基于问题复杂性动态调整COT长度的灵活性。在本文中，我们提出了自动长短推理（AUTOL2S），这是一种动态和模型的敏锐性框架，使LLMS能够根据推理问题的复杂性动态压缩其生成的推理路径。 autol2s启用了一个学识渊博的范式，在该范式中，LLM本身可以决定何时需要更长的推理，以及何时较短的推理足以满足我们所提出的方法的培训，其中包括长长和短的COT路径和特殊的标记。然后，我们使用令牌来指示模型何时可以跳过生成冗长的婴儿床推理。这种提出的注释策略可以增强LLMS在训练后以提高质量生成更短的COT推理路径的能力。广泛的评估结果表明，AutoL2s在不损害性能的情况下最多将推理的长度降低了57％，这证明了AutoL2s对可扩展有效的LLM推理的有效性。