2025-05-21

Title: Evaluating Reasoning LLMs for Suicide Screening with the Columbia-Suicide Severity Rating Scale

Authors: Avinash Patil, Siru Tao, Amardeep Gedhu
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13480
Pdf URL: https://arxiv.org/pdf/2505.13480
Copy Paste: [[2505.13480]] Evaluating Reasoning LLMs for Suicide Screening with the Columbia-Suicide Severity Rating Scale(https://arxiv.org/abs/2505.13480)
Keywords: language model, gpt, llm
Abstract: Suicide prevention remains a critical public health challenge. While online platforms such as Reddit's r/SuicideWatch have historically provided spaces for individuals to express suicidal thoughts and seek community support, the advent of large language models (LLMs) introduces a new paradigm-where individuals may begin disclosing ideation to AI systems instead of humans. This study evaluates the capability of LLMs to perform automated suicide risk assessment using the Columbia-Suicide Severity Rating Scale (C-SSRS). We assess the zero-shot performance of six models-including Claude, GPT, Mistral, and LLaMA-in classifying posts across a 7-point severity scale (Levels 0-6). Results indicate that Claude and GPT closely align with human annotations, while Mistral achieves the lowest ordinal prediction error. Most models exhibit ordinal sensitivity, with misclassifications typically occurring between adjacent severity levels. We further analyze confusion patterns, misclassification sources, and ethical considerations, underscoring the importance of human oversight, transparency, and cautious deployment. Full code and supplementary materials are available at this https URL.
摘要：预防自杀仍然是一个关键的公共卫生挑战。虽然Reddit的R/自杀式观看式等在线平台历史上为个人提供了自杀思想和寻求社区支持的空间，但大型语言模型的出现（LLMS）介绍了一个新的范式 - 个人可能会开始向AI系统披露构想，而不是人类。这项研究评估了LLM使用哥伦比亚自杀的严重程度评级量表（C-SSR）进行自杀风险评估的能力。我们评估了包括克劳德（Claude），GPT，Mistral和Llama-In-In分类帖子的六个模型的零射击性能（水平0-6）。结果表明Claude和GPT与人类注释紧密一致，而Mistral达到了最低的序数预测误差。大多数模型都表现出顺序的灵敏度，通常在相邻的严重程度之间发生错误分类。我们进一步分析了混乱模式，错误分类来源和道德考虑，强调了人类监督，透明度和谨慎部署的重要性。此HTTPS URL提供完整的代码和补充材料。

Title: Detecting Prefix Bias in LLM-based Reward Models

Authors: Ashwin Kumar, Yuzi He, Aram H. Markosyan, Bobbie Chern, Imanol Arrieta-Ibarra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13487
Pdf URL: https://arxiv.org/pdf/2505.13487
Copy Paste: [[2505.13487]] Detecting Prefix Bias in LLM-based Reward Models(https://arxiv.org/abs/2505.13487)
Keywords: language model, llm
Abstract: Reinforcement Learning with Human Feedback (RLHF) has emerged as a key paradigm for task-specific fine-tuning of language models using human preference data. While numerous publicly available preference datasets provide pairwise comparisons of responses, the potential for biases in the resulting reward models remains underexplored. In this work, we introduce novel methods to detect and evaluate prefix bias -- a systematic shift in model preferences triggered by minor variations in query prefixes -- in LLM-based reward models trained on such datasets. We leverage these metrics to reveal significant biases in preference models across racial and gender dimensions. Our comprehensive evaluation spans diverse open-source preference datasets and reward model architectures, demonstrating susceptibility to this kind of bias regardless of the underlying model architecture. Furthermore, we propose a data augmentation strategy to mitigate these biases, showing its effectiveness in reducing the impact of prefix bias. Our findings highlight the critical need for bias-aware dataset design and evaluation in developing fair and reliable reward models, contributing to the broader discourse on fairness in AI.
摘要：通过人类反馈（RLHF）的增强学习已成为使用人类偏好数据对语言模型进行特定于任务的微调的关键范式。尽管众多公开可用的偏好数据集提供了响应的成对比较，但最终奖励模型中偏见的可能性仍然没有得到充实。在这项工作中，我们介绍了新的方法来检测和评估前缀偏差 - 在此类数据集中训练的基于LLM的奖励模型中，在查询前缀的较小变化触发的模型偏好中的系统变化。我们利用这些指标来揭示种族和性别维度跨种族和性别方面的偏好模型的巨大偏见。我们的全面评估涵盖了各种各样的开源偏好数据集和奖励模型体系结构，无论基本模型体系结构如何，都证明了对这种偏见的敏感性。此外，我们提出了一种数据增强策略来减轻这些偏见，显示出其在减少前缀偏见影响的有效性。我们的发现强调了在开发公平可靠的奖励模型中对偏见的数据集设计和评估的关键需求，这有助于更广泛的AI公平话语。

Title: Source framing triggers systematic evaluation bias in Large Language Models

Authors: Federico Germani, Giovanni Spitale
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.13488
Pdf URL: https://arxiv.org/pdf/2505.13488
Copy Paste: [[2505.13488]] Source framing triggers systematic evaluation bias in Large Language Models(https://arxiv.org/abs/2505.13488)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly used not only to generate text but also to evaluate it, raising urgent questions about whether their judgments are consistent, unbiased, and robust to framing effects. In this study, we systematically examine inter- and intra-model agreement across four state-of-the-art LLMs (OpenAI o3-mini, Deepseek Reasoner, xAI Grok 2, and Mistral) tasked with evaluating 4,800 narrative statements on 24 different topics of social, political, and public health relevance, for a total of 192,000 assessments. We manipulate the disclosed source of each statement to assess how attribution to either another LLM or a human author of specified nationality affects evaluation outcomes. We find that, in the blind condition, different LLMs display a remarkably high degree of inter- and intra-model agreement across topics. However, this alignment breaks down when source framing is introduced. Here we show that attributing statements to Chinese individuals systematically lowers agreement scores across all models, and in particular for Deepseek Reasoner. Our findings reveal that framing effects can deeply affect text evaluation, with significant implications for the integrity, neutrality, and fairness of LLM-mediated information systems.
摘要：大型语言模型（LLM）越来越多地用于生成文本，还可以评估文本，从而提出了有关其判断是否一致，公正和强大的构图效果的紧急问题。在这项研究中，我们系统地检查了四个最先进的LLMS（OpenAi O3-Mini，DeepSeek Reparacher，Xai Grok 2和Mistral）的模型间和模型间一致性，负责评估4,800个关于社会，政治和公共卫生相关性的24个不同主题的叙事陈述，总计192,000，000。我们操纵每个声明的披露来源，以评估归因于另一个LLM或指定国籍的人类作者会影响评估结果。我们发现，在盲目的情况下，不同的LLM在跨主题之间表现出高度高度的模型间和模型一致性。但是，当引入源框架时，这种对齐会分解。在这里，我们表明，将陈述归因于中国人系统地降低所有模型的协议得分，尤其是DeepSeek推理者。我们的发现表明，框架效应可以深深影响文本评估，对LLM介导的信息系统的完整性，中立性和公平性产生重大影响。

Title: ProdRev: A DNN framework for empowering customers using generative pre-trained transformers

Authors: Aakash Gupta, Nataraj Das
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13491
Pdf URL: https://arxiv.org/pdf/2505.13491
Copy Paste: [[2505.13491]] ProdRev: A DNN framework for empowering customers using generative pre-trained transformers(https://arxiv.org/abs/2505.13491)
Keywords: gpt
Abstract: Following the pandemic, customers, preference for using e-commerce has accelerated. Since much information is available in multiple reviews (sometimes running in thousands) for a single product, it can create decision paralysis for the buyer. This scenario disempowers the consumer, who cannot be expected to go over so many reviews since its time consuming and can confuse them. Various commercial tools are available, that use a scoring mechanism to arrive at an adjusted score. It can alert the user to potential review manipulations. This paper proposes a framework that fine-tunes a generative pre-trained transformer to understand these reviews better. Furthermore, using "common-sense" to make better decisions. These models have more than 13 billion parameters. To fine-tune the model for our requirement, we use the curie engine from generative pre-trained transformer (GPT3). By using generative models, we are introducing abstractive summarization. Instead of using a simple extractive method of summarizing the reviews. This brings out the true relationship between the reviews and not simply copy-paste. This introduces an element of "common sense" for the user and helps them to quickly make the right decisions. The user is provided the pros and cons of the processed reviews. Thus the user/customer can take their own decisions.
摘要：随着大流行，客户偏爱使用电子商务。由于单个产品的多个评论（有时以成千上万的方式运行）中提供了许多信息，因此可以为买家创建决策瘫痪。这种情况消除了消费者的能力，自从耗时以来，就不能期望他们会经历这么多评论，并且会使他们感到困惑。提供各种商业工具，使用评分机制来得出调整后的分数。它可以提醒用户潜在的审查操作。本文提出了一个框架，以微调生成的预训练的变压器，以更好地了解这些评论。此外，使用“常识”来做出更好的决定。这些模型具有超过130亿个参数。为了微调我们的需求模型，我们使用生成预训练的变压器（GPT3）的Curie发动机。通过使用生成模型，我们正在引入抽象性摘要。而不是使用简单的提取方法来汇总评论。这带来了评论之间的真实关系，而不仅仅是复制纸。这为用户引入了“常识”的元素，并帮助他们快速做出正确的决策。为用户提供了处理后的评论的利弊。因此，用户/客户可以做出自己的决定。

Title: LLM4CD: Leveraging Large Language Models for Open-World Knowledge Augmented Cognitive Diagnosis

Authors: Weiming Zhang, Lingyue Fu, Qingyao Li, Kounianhua Du, Jianghao Lin, Jingwei Yu, Wei Xia, Weinan Zhang, Ruiming Tang, Yong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13492
Pdf URL: https://arxiv.org/pdf/2505.13492
Copy Paste: [[2505.13492]] LLM4CD: Leveraging Large Language Models for Open-World Knowledge Augmented Cognitive Diagnosis(https://arxiv.org/abs/2505.13492)
Keywords: language model, llm
Abstract: Cognitive diagnosis (CD) plays a crucial role in intelligent education, evaluating students' comprehension of knowledge concepts based on their test histories. However, current CD methods often model students, exercises, and knowledge concepts solely on their ID relationships, neglecting the abundant semantic relationships present within educational data space. Furthermore, contemporary intelligent tutoring systems (ITS) frequently involve the addition of new students and exercises, a situation that ID-based methods find challenging to manage effectively. The advent of large language models (LLMs) offers the potential for overcoming this challenge with open-world knowledge. In this paper, we propose LLM4CD, which Leverages Large Language Models for Open-World Knowledge Augmented Cognitive Diagnosis. Our method utilizes the open-world knowledge of LLMs to construct cognitively expressive textual representations, which are then encoded to introduce rich semantic information into the CD task. Additionally, we propose an innovative bi-level encoder framework that models students' test histories through two levels of encoders: a macro-level cognitive text encoder and a micro-level knowledge state encoder. This approach substitutes traditional ID embeddings with semantic representations, enabling the model to accommodate new students and exercises with open-world knowledge and address the cold-start problem. Extensive experimental results demonstrate that our proposed method consistently outperforms previous CD models on multiple real-world datasets, validating the effectiveness of leveraging LLMs to introduce rich semantic information into the CD task.
摘要：认知诊断（CD）在智能教育中起着至关重要的作用，根据他们的测试历史评估学生对知识概念的理解。但是，当前的CD方法通常仅对他们的ID关系建模学生，练习和知识概念，从而忽略了教育数据空间中存在的丰富语义关系。此外，当代智能辅导系统（ITS）经常涉及新的学生和练习，这种情况基于ID的方法很具有效管理。大型语言模型（LLMS）的出现提供了通过开放世界知识克服这一挑战的潜力。在本文中，我们提出了LLM4CD，该LLM4CD利用大型语言模型来增强认知诊断。我们的方法利用LLM的开放世界知识来构建认知表达性的文本表示，然后对其进行编码以将丰富的语义信息引入CD任务。此外，我们提出了一个创新的双层编码器框架，该框架通过两个级别的编码器来对学生的测试历史进行建模：宏观认知文本编码器和微观级别的知识状态编码器。这种方法用语义表示替代传统的ID嵌入，使该模型能够容纳新的学生和具有开放世界知识的练习并解决寒冷的问题。广泛的实验结果表明，我们提出的方法始终在多个现实世界数据集上胜过先前的CD模型，从而验证了利用LLMS以将丰富的语义信息引入CD任务的有效性。

Title: IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation

Authors: Khanh-Tung Tran, Barry O'Sullivan, Hoang D. Nguyen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13498
Pdf URL: https://arxiv.org/pdf/2505.13498
Copy Paste: [[2505.13498]] IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation(https://arxiv.org/abs/2505.13498)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) have demonstrated promising knowledge and reasoning abilities, yet their performance in multilingual and low-resource settings remains underexplored. Existing benchmarks often exhibit cultural bias, restrict evaluation to text-only, rely on multiple-choice formats, and, more importantly, are limited for extremely low-resource languages. To address these gaps, we introduce IRLBench, presented in parallel English and Irish, which is considered definitely endangered by UNESCO. Our benchmark consists of 12 representative subjects developed from the 2024 Irish Leaving Certificate exams, enabling fine-grained analysis of model capabilities across domains. By framing the task as long-form generation and leveraging the official marking scheme, it does not only support a comprehensive evaluation of correctness but also language fidelity. Our extensive experiments of leading closed-source and open-source LLMs reveal a persistent performance gap between English and Irish, in which models produce valid Irish responses less than 80\% of the time, and answer correctly 55.8\% of the time compared to 76.2\% in English for the best-performing model. We release IRLBench (this https URL) and an accompanying evaluation codebase (this https URL) to enable future research on robust, culturally aware multilingual AI development.
摘要：大型语言模型（LLM）的最新进展表现出了有希望的知识和推理能力，但是它们在多语言和低资源设置中的表现仍未得到充实。现有的基准通常表现出文化偏见，将评估限制在文本中，依靠多项选择格式，更重要的是，对于极低的资源语言，限制了限制。为了解决这些差距，我们介绍了以平行的英语和爱尔兰语提出的Irlbench，这绝对被联合国教科文组织危害。我们的基准由2024个爱尔兰退出证书考试开发的12个代表性主题组成，从而对跨领域的模型功能进行细粒度分析。通过将任务构建为长期生成并利用官方标记方案，它不仅支持对正确性的全面评估，还支持语言保真度。我们对领先的闭合源和开源LLM的广泛实验表明，英语和爱尔兰人之间的性能差距持续存在，其中模型的时间少于80 \％，并且在最佳模型中，与英语的76.2 \％相比，正确回答了55.8％的时间。我们发布IRLBENCH（此HTTPS URL）和随附的评估代码库（此HTTPS URL），以实现对强大的，具有文化意识的多语言AI开发的未来研究。

Title: Noise Injection Systemically Degrades Large Language Model Safety Guardrails

Authors: Prithviraj Singh Shahani, Matthias Scheutz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13500
Pdf URL: https://arxiv.org/pdf/2505.13500
Copy Paste: [[2505.13500]] Noise Injection Systemically Degrades Large Language Model Safety Guardrails(https://arxiv.org/abs/2505.13500)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p < 0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical applications as these results imply that widely-deployed safety tuning methods can fail even without adversarial prompts.
摘要：大型语言模型（LLM）中的安全护栏是防止有害产出的关键组成部分。然而，他们在扰动下的韧性仍然很少理解。在本文中，我们通过系统地将高斯噪声注入模型激活来研究LLM中安全微调的鲁棒性。我们在多种开放式模型中显示（1）高斯噪声将有害的速率（p <0.001）提高了27％，（2）更深层的安全性微调没有额外的保护，（3）想法链的推理仍然很大程度上完整。这些发现揭示了当前安全对准技术中的关键脆弱性，并突出了基于推理和强化学习方法的潜力，这是开发更健壮的AI安全系统的有希望的方向。这些结果对LLM在安全至关重要的应用中的现实部署具有重要意义，因为这些结果表明，即使没有对抗性提示，这些结果也可能会失败。

Title: EcoSafeRAG: Efficient Security through Context Analysis in Retrieval-Augmented Generation

Authors: Ruobing Yao, Yifei Zhang, Shuang Song, Neng Gao, Chenyang Tu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13506
Pdf URL: https://arxiv.org/pdf/2505.13506
Copy Paste: [[2505.13506]] EcoSafeRAG: Efficient Security through Context Analysis in Retrieval-Augmented Generation(https://arxiv.org/abs/2505.13506)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) compensates for the static knowledge limitations of Large Language Models (LLMs) by integrating external knowledge, producing responses with enhanced factual correctness and query-specific contextualization. However, it also introduces new attack surfaces such as corpus poisoning at the same time. Most of the existing defense methods rely on the internal knowledge of the model, which conflicts with the design concept of RAG. To bridge the gap, EcoSafeRAG uses sentence-level processing and bait-guided context diversity detection to identify malicious content by analyzing the context diversity of candidate documents without relying on LLM internal knowledge. Experiments show EcoSafeRAG delivers state-of-the-art security with plug-and-play deployment, simultaneously improving clean-scenario RAG performance while maintaining practical operational costs (relatively 1.2$\times$ latency, 48\%-80\% token reduction versus Vanilla RAG).
摘要：检索增强的生成（RAG）通过整合外部知识，以增强的事实正确性和特定于查询的环境化来弥补大语言模型（LLM）的静态知识限制。但是，它还引入了新的攻击表面，例如同时中毒。大多数现有的防御方法都依赖于模型的内部知识，该模型与抹布的设计概念冲突。为了弥合差距，EcoSaferag使用句子级处理和诱饵引导的上下文多样性检测来通过分析候选文档的上下文多样性而不依赖LLM内部知识来识别恶意内容。实验表明，Ecosaferag通过插件部署提供了最先进的安全性，同时改善了清洁式抹布的抹布性能，同时保持实际的运营成本（相对1.2 $ \ times $延迟，48 \％-80 \％-80 \％\％\％的代币减少了与Vanilla Rag）。

Title: Time-R1: Towards Comprehensive Temporal Reasoning in LLMs

Authors: Zijia Liu, Peixuan Han, Haofei Yu, Haoru Li, Jiaxuan You
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13508
Pdf URL: https://arxiv.org/pdf/2505.13508
Copy Paste: [[2505.13508]] Time-R1: Towards Comprehensive Temporal Reasoning in LLMs(https://arxiv.org/abs/2505.13508)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate impressive capabilities but lack robust temporal intelligence, struggling to integrate reasoning about the past with predictions and plausible generations of the future. Meanwhile, existing methods typically target isolated temporal skills, such as question answering about past events or basic forecasting, and exhibit poor generalization, particularly when dealing with events beyond their knowledge cutoff or requiring creative foresight. To address these limitations, we introduce \textit{Time-R1}, the first framework to endow a moderate-sized (3B-parameter) LLM with comprehensive temporal abilities: understanding, prediction, and creative generation. Our approach features a novel three-stage development path; the first two constitute a \textit{reinforcement learning (RL) curriculum} driven by a meticulously designed dynamic rule-based reward system. This framework progressively builds (1) foundational temporal understanding and logical event-time mappings from historical data, (2) future event prediction skills for events beyond its knowledge cutoff, and finally (3) enables remarkable generalization to creative future scenario generation without any fine-tuning. Strikingly, experiments demonstrate that Time-R1 outperforms models over 200 times larger, including the state-of-the-art 671B DeepSeek-R1, on highly challenging future event prediction and creative scenario generation benchmarks. This work provides strong evidence that thoughtfully engineered, progressive RL fine-tuning allows smaller, efficient models to achieve superior temporal performance, offering a practical and scalable path towards truly time-aware AI. To foster further research, we also release \textit{Time-Bench}, a large-scale multi-task temporal reasoning dataset derived from 10 years of news data, and our series of \textit{Time-R1} checkpoints.
摘要：大型语言模型（LLMS）表现出令人印象深刻的能力，但缺乏强大的时间智能，努力将有关过去的推理与未来的预测和可行一代相结合。同时，现有方法通常针对孤立的时间技能，例如回答过去事件或基本预测的问题，并表现出较差的概括，尤其是在处理超越知识截止的事件或需要创造性的预见时。为了解决这些限制，我们介绍了\ textit {time-r1}，这是第一个赋予中等尺寸（3B参数）LLM具有全面时间能力的框架：理解，预测和创造性生成。我们的方法具有新颖的三阶段发展路径。前两个构成了由精心设计的基于动态规则的奖励系统驱动的A \ textit {强化学习（RL）课程。该框架逐渐构建了（1）从历史数据中构建（1）基本的时间理解和逻辑事件时间映射，（2）事件的未来事件预测技能超出其知识截止，而（3）（3）对创造性的未来场景产生了显着概括，而无需进行任何微调。令人惊讶的是，实验表明，Time-R1在200倍以上的模型上优于200倍的模型，包括最新的671B DeepSeek-R1，在极具挑战性的未来事件预测和创造性的场景生成基准上。这项工作提供了有力的证据表明，经过精心设计的渐进式RL微调使较小，有效的模型可以实现卓越的时间性能，从而为真正的时光感动提供了实用且可扩展的途径。为了促进进一步的研究，我们还发布了\ textit {Time-Bench}，这是一个从10年的新闻数据中得出的大规模多任务时间推理数据集，以及我们的系列\ TextIt {Time-R1}检查点。

Title: Induction Head Toxicity Mechanistically Explains Repetition Curse in Large Language Models

Authors: Shuxun Wang, Qingyu Yin, Chak Tou Leong, Qiang Zhang, Linyi Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13514
Pdf URL: https://arxiv.org/pdf/2505.13514
Copy Paste: [[2505.13514]] Induction Head Toxicity Mechanistically Explains Repetition Curse in Large Language Models(https://arxiv.org/abs/2505.13514)
Keywords: language model, llm
Abstract: Repetition curse is a phenomenon where Large Language Models (LLMs) generate repetitive sequences of tokens or cyclic sequences. While the repetition curse has been widely observed, its underlying mechanisms remain poorly understood. In this work, we investigate the role of induction heads--a specific type of attention head known for their ability to perform in-context learning--in driving this repetitive behavior. Specifically, we focus on the "toxicity" of induction heads, which we define as their tendency to dominate the model's output logits during repetition, effectively excluding other attention heads from contributing to the generation process. Our findings have important implications for the design and training of LLMs. By identifying induction heads as a key driver of the repetition curse, we provide a mechanistic explanation for this phenomenon and suggest potential avenues for mitigation. We also propose a technique with attention head regularization that could be employed to reduce the dominance of induction heads during generation, thereby promoting more diverse and coherent outputs.
摘要：重复诅咒是一种现象，其中大语言模型（LLM）产生令牌或环状序列的重复序列。尽管已广泛观察到了重复的诅咒，但其潜在机制仍然很少理解。在这项工作中，我们调查了感应头的作用，这是一种以其在上下文学习的能力而闻名的特定类型的注意力头 - 在推动这种重复行为中。具体而言，我们专注于诱导头的“毒性”，我们将其定义为在重复过程中主导模型的输出逻辑的趋势，有效地排除了其他注意力头，从贡献了生成过程。我们的发现对LLM的设计和培训具有重要意义。通过将诱导头确定为重复诅咒的关键驱动力，我们为这种现象提供了一种机械解释，并提出了缓解措施的潜在途径。我们还提出了一种具有注意力头正规化的技术，该技术可用于减少发电期间的诱导头的优势，从而促进更多多样化和相干的产出。

Title: Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

Authors: Jingyu Peng, Maolin Wang, Nan Wang, Xiangyu Zhao, Jiatong Li, Kai Zhang, Qi Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13527
Pdf URL: https://arxiv.org/pdf/2505.13527
Copy Paste: [[2505.13527]] Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression(https://arxiv.org/abs/2505.13527)
Keywords: language model, llm, prompt
Abstract: Despite substantial advancements in aligning large language models (LLMs) with human values, current safety mechanisms remain susceptible to jailbreak attacks. We hypothesize that this vulnerability stems from distributional discrepancies between alignment-oriented prompts and malicious prompts. To investigate this, we introduce LogiBreak, a novel and universal black-box jailbreak method that leverages logical expression translation to circumvent LLM safety systems. By converting harmful natural language prompts into formal logical expressions, LogiBreak exploits the distributional gap between alignment data and logic-based inputs, preserving the underlying semantic intent and readability while evading safety constraints. We evaluate LogiBreak on a multilingual jailbreak dataset spanning three languages, demonstrating its effectiveness across various evaluation settings and linguistic contexts.
摘要：尽管将大语言模型（LLM）与人类价值观保持一致，但当前的安全机制仍然容易受到越狱攻击的影响。我们假设这种脆弱性源于面向对齐的提示和恶意提示之间的分布差异。为了调查这一点，我们介绍了Logibreak，这是一种新颖而通用的黑盒越狱方法，利用逻辑表达式翻译以规避LLM安全系统。通过将有害的自然语言转换为正式的逻辑表达式，Logibreak利用对齐数据和基于逻辑的输入之间的分布差距，在避免安全性约束的同时，保留了基本的语义意图和可读性。我们在跨越三种语言的多语言越狱数据集上评估了Logibreak，并在各种评估环境和语言环境中展示了其有效性。

Title: Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation

Authors: Zhanglin Wu, Daimeng Wei, Xiaoyu Chen, Hengchao Shang, Jiaxin Guo, Zongyao Li, Yuanchang Luo, Jinlong Yang, Zhiqiang Rao, Hao Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13554
Pdf URL: https://arxiv.org/pdf/2505.13554
Copy Paste: [[2505.13554]] Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation(https://arxiv.org/abs/2505.13554)
Keywords: language model, llm
Abstract: Large language model (LLM) shows promising performances in a variety of downstream tasks, such as machine translation (MT). However, using LLMs for translation suffers from high computational costs and significant latency. Based on our evaluation, in most cases, translations using LLMs are comparable to that generated by neural machine translation (NMT) systems. Only in particular scenarios, LLM and NMT models show respective advantages. As a result, integrating NMT and LLM for translation and using LLM only when necessary seems to be a sound solution. A scheduling policy that optimizes translation result while ensuring fast speed and as little LLM usage as possible is thereby required. We compare several scheduling policies and propose a novel and straightforward decider that leverages source sentence features. We conduct extensive experiments on multilingual test sets and the result shows that we can achieve optimal translation performance with minimal LLM usage, demonstrating effectiveness of our decider.
摘要：大型语言模型（LLM）在各种下游任务（例如机器翻译（MT））中显示出有希望的表演。但是，使用LLMS进行翻译的计算成本高和延迟较大。根据我们的评估，在大多数情况下，使用LLM的翻译与神经机器翻译（NMT）系统生成的翻译相当。 LLM和NMT模型仅在特定方案中才显示出各自的优势。结果，仅在必要时才集成NMT和LLM进行翻译和使用LLM似乎是声音解决方案。因此，需要优化翻译结果的调度策略，同时确保快速速度和尽可能少的LLM使用情况。我们比较了几种调度策略，并提出了一个利用源句子特征的新颖而直接的决定者。我们对多语言测试集进行了广泛的实验，结果表明，我们可以使用最小的LLM使用实现最佳的翻译性能，从而证明了决定者的有效性。

Title: CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models

Authors: Sathya Krishnan Suresh, Tanmay Surana, Lim Zhi Hao, Eng Siong Chng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13559
Pdf URL: https://arxiv.org/pdf/2505.13559
Copy Paste: [[2505.13559]] CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models(https://arxiv.org/abs/2505.13559)
Keywords: language model, llm
Abstract: Code-switching (CS) poses a significant challenge for Large Language Models (LLMs), yet its comprehensibility remains underexplored in LLMs. We introduce CS-Sum, to evaluate the comprehensibility of CS by the LLMs through CS dialogue to English summarization. CS-Sum is the first benchmark for CS dialogue summarization across Mandarin-English (EN-ZH), Tamil-English (EN-TA), and Malay-English (EN-MS), with 900-1300 human-annotated dialogues per language pair. Evaluating ten LLMs, including open and closed-source models, we analyze performance across few-shot, translate-summarize, and fine-tuning (LoRA, QLoRA on synthetic data) approaches. Our findings show that though the scores on automated metrics are high, LLMs make subtle mistakes that alter the complete meaning of the dialogue. To this end, we introduce 3 most common type of errors that LLMs make when handling CS input. Error rates vary across CS pairs and LLMs, with some LLMs showing more frequent errors on certain language pairs, underscoring the need for specialized training on code-switched data.
摘要：代码转换（CS）对大型语言模型（LLMS）构成了重大挑战，但其可理解性在LLM中仍未得到充实。我们介绍了CS-SUM，以通过CS对话与英语摘要来评估LLM的CS的可理解性。 CS-SUM是CS对话摘要的第一个基准，包括普通话 - 英语（EN-ZH），泰米尔语 - 英语（EN-TA）和Malay-English（EN-MS），每语言对900-1300个人类通知的对话。评估十个LLM，包括开放式和封闭源模型，我们分析了几次射击，翻译 - 夏姆化和微调（lora，Qlora，Qlora）的性能。我们的发现表明，尽管自动指标的分数很高，但LLMS犯了微妙的错误，改变了对话的完整含义。为此，我们介绍了LLM在处理CS输入时遇到的3种最常见类型的错误。错误率随CS对和LLM的不同而有所不同，一些LLM在某些语言对上显示出更频繁的错误，强调了对代码转换数据进行专门培训的需求。

Title: Are Large Language Models Good at Detecting Propaganda?

Authors: Julia Jose, Rachel Greenstadt
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13706
Pdf URL: https://arxiv.org/pdf/2505.13706
Copy Paste: [[2505.13706]] Are Large Language Models Good at Detecting Propaganda?(https://arxiv.org/abs/2505.13706)
Keywords: language model, gpt, llm
Abstract: Propagandists use rhetorical devices that rely on logical fallacies and emotional appeals to advance their agendas. Recognizing these techniques is key to making informed decisions. Recent advances in Natural Language Processing (NLP) have enabled the development of systems capable of detecting manipulative content. In this study, we look at several Large Language Models and their performance in detecting propaganda techniques in news articles. We compare the performance of these LLMs with transformer-based models. We find that, while GPT-4 demonstrates superior F1 scores (F1=0.16) compared to GPT-3.5 and Claude 3 Opus, it does not outperform a RoBERTa-CRF baseline (F1=0.67). Additionally, we find that all three LLMs outperform a MultiGranularity Network (MGN) baseline in detecting instances of one out of six propaganda techniques (name-calling), with GPT-3.5 and GPT-4 also outperforming the MGN baseline in detecting instances of appeal to fear and flag-waving.
摘要：宣传者使用依靠逻辑谬论和情感上吸引力来推进其议程的修辞手段。认识到这些技术是做出明智决定的关键。自然语言处理（NLP）的最新进展已使能够检测操纵内容的系统的开发。在这项研究中，我们研究了几种大型语言模型及其在检测新闻文章中宣传技术方面的表现。我们将这些LLM的性能与基于变压器的模型进行比较。我们发现，与GPT-3.5和Claude 3 Opus相比，GPT-4表现出优异的F1分数（F1 = 0.16），但它并没有表现优于Roberta-CRF基线（F1 = 0.67）。此外，我们发现这三个LLM的表现均优于多毛线网络（MGN）基线，在检测六种宣传技术（名称）中的一个实例，而GPT-3.5和GPT-4在检测恐惧和努力竞争的实例中也超过了MGN基线。

Title: SQLForge: Synthesizing Reliable and Diverse Data to Enhance Text-to-SQL Reasoning in LLMs

Authors: Yu Guo, Dong Jin, Shenghao Ye, Shuangwu Chen, Jian Yang, Xiaobin Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13725
Pdf URL: https://arxiv.org/pdf/2505.13725
Copy Paste: [[2505.13725]] SQLForge: Synthesizing Reliable and Diverse Data to Enhance Text-to-SQL Reasoning in LLMs(https://arxiv.org/abs/2505.13725)
Keywords: language model, llm
Abstract: Large Language models (LLMs) have demonstrated significant potential in text-to-SQL reasoning tasks, yet a substantial performance gap persists between existing open-source models and their closed-source counterparts. In this paper, we introduce SQLForge, a novel approach for synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs. We improve data reliability through SQL syntax constraints and SQL-to-question reverse translation, ensuring data logic at both structural and semantic levels. We also propose an SQL template enrichment and iterative data domain exploration mechanism to boost data diversity. Building on the augmented data, we fine-tune a variety of open-source models with different architectures and parameter sizes, resulting in a family of models termed SQLForge-LM. SQLForge-LM achieves the state-of-the-art performance on the widely recognized Spider and BIRD benchmarks among the open-source models. Specifically, SQLForge-LM achieves EX accuracy of 85.7% on Spider Dev and 59.8% on BIRD Dev, significantly narrowing the performance gap with closed-source methods.
摘要：大型语言模型（LLMS）在文本到SQL推理任务中表现出了巨大的潜力，但是现有的开源模型及其封闭源对应物之间的性能差距仍然存在。在本文中，我们介绍了SQLForge，这是一种合成可靠和多种数据的新型方法，以增强LLMS中的文本到SQL推理。我们通过SQL语法约束和SQL到问题反向翻译提高了数据可靠性，从而确保了在结构和语义级别的数据逻辑。我们还提出了SQL模板富集和迭代数据域探索机制，以提高数据多样性。在增强数据的基础上，我们微调了各种具有不同体系结构和参数尺寸的开源模型，从而导致一个称为SQLFORGE-LM的模型家族。 SQLFORGE-LM在开源型号中在公认的蜘蛛和鸟类基准中实现了最新的性能。具体而言，SQLFORGE-LM在蜘蛛开发处获得了85.7％的Ex精度，而Bird Dev的精度为59.8％，通过封闭源方法大大缩小了性能差距。

Title: Simulation Agent: A Framework for Integrating Simulation and Large Language Models for Enhanced Decision-Making

Authors: Jacob Kleiman, Kevin Frank, Sindy Campagna
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13761
Pdf URL: https://arxiv.org/pdf/2505.13761
Copy Paste: [[2505.13761]] Simulation Agent: A Framework for Integrating Simulation and Large Language Models for Enhanced Decision-Making(https://arxiv.org/abs/2505.13761)
Keywords: language model, llm, agent
Abstract: Simulations, although powerful in accurately replicating real-world systems, often remain inaccessible to non-technical users due to their complexity. Conversely, large language models (LLMs) provide intuitive, language-based interactions but can lack the structured, causal understanding required to reliably model complex real-world dynamics. We introduce our simulation agent framework, a novel approach that integrates the strengths of both simulation models and LLMs. This framework helps empower users by leveraging the conversational capabilities of LLMs to interact seamlessly with sophisticated simulation systems, while simultaneously utilizing the simulations to ground the LLMs in accurate and structured representations of real-world phenomena. This integrated approach helps provide a robust and generalizable foundation for empirical validation and offers broad applicability across diverse domains.
摘要：仿真虽然在准确复制现实世界系统方面有力，但由于其复杂性，非技术用户通常仍然无法访问。相反，大型语言模型（LLMS）提供了直观的，基于语言的互动，但可能缺乏可靠地模拟复杂现实世界动态所需的结构化的，因果理解。我们介绍了我们的模拟代理框架，这是一种整合模拟模型和LLM的优势的新方法。该框架通过利用LLM的对话能力与复杂的模拟系统无缝交互，从而帮助用户赋予用户能力，同时利用模拟将LLMS扎根于现实世界现象的准确和结构化表示。这种综合方法有助于为经验验证提供坚固且可推广的基础，并在不同领域提供广泛的适用性。

Title: Krikri: Advancing Open Large Language Models for Greek

Authors: Dimitris Roussis, Leon Voukoutis, Georgios Paraskevopoulos, Sokratis Sofianopoulos, Prokopis Prokopidis, Vassilis Papavasileiou, Athanasios Katsamanis, Stelios Piperidis, Vassilis Katsouros
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13772
Pdf URL: https://arxiv.org/pdf/2505.13772
Copy Paste: [[2505.13772]] Krikri: Advancing Open Large Language Models for Greek(https://arxiv.org/abs/2505.13772)
Keywords: language model, llm, chat
Abstract: We introduce Llama-Krikri-8B, a cutting-edge Large Language Model tailored for the Greek language, built on Meta's Llama 3.1-8B. Llama-Krikri-8B has been extensively trained on high-quality Greek data to ensure superior adaptation to linguistic nuances. With 8 billion parameters, it offers advanced capabilities while maintaining efficient computational performance. Llama-Krikri-8B supports both Modern Greek and English, and is also equipped to handle polytonic text and Ancient Greek. The chat version of Llama-Krikri-8B features a multi-stage post-training pipeline, utilizing both human and synthetic instruction and preference data, by applying techniques such as MAGPIE. In addition, for evaluation, we propose three novel public benchmarks for Greek. Our evaluation on existing as well as the proposed benchmarks shows notable improvements over comparable Greek and multilingual LLMs in both natural language understanding and generation as well as code generation.
摘要：我们介绍了Llama-Krikri-8b，这是一种针对希腊语言量身定制的尖端的大型语言模型，建立在Meta的Llama 3.1-8B上。 Llama-krikri-8b已在高质量的希腊数据上进行了广泛的培训，以确保对语言细微差别的卓越适应。它具有80亿个参数，提供了高级功能，同时保持有效的计算性能。 Llama-krikri-8b支持现代希腊语和英语，并且还具有处理多构的文字和古希腊语。 Llama-Krikri-8b的聊天版本具有多阶段的训练后管道，利用人类和合成指令和偏好数据，通过应用Magpie等技术。此外，为了进行评估，我们为希腊语提出了三个新颖的公共基准。我们对现有基准以及所提出的基准的评估表明，在自然语言理解和产生以及代码生成中，对希腊和多语言LLM的可比性和多语言LLM都有显着的改进。

Title: Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

Authors: Siddhant Bhambri, Upasana Biswas, Subbarao Kambhampati
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13792
Pdf URL: https://arxiv.org/pdf/2505.13792
Copy Paste: [[2505.13792]] Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation(https://arxiv.org/abs/2505.13792)
Keywords: language model, gpt, chat, chain-of-thought
Abstract: Question Answering (QA) poses a challenging and critical problem, particularly in today's age of interactive dialogue systems such as ChatGPT, Perplexity, Microsoft Copilot, etc. where users demand both accuracy and transparency in the model's outputs. Since smaller language models (SLMs) are computationally more efficient but often under-perform compared to larger models, Knowledge Distillation (KD) methods allow for finetuning these smaller models to improve their final performance. Lately, the intermediate tokens or the so called `reasoning' traces produced by Chain-of-Thought (CoT) or by reasoning models such as DeepSeek R1 are used as a training signal for KD. However, these reasoning traces are often verbose and difficult to interpret or evaluate. In this work, we aim to address the challenge of evaluating the faithfulness of these reasoning traces and their correlation with the final performance. To this end, we employ a KD method leveraging rule-based problem decomposition. This approach allows us to break down complex queries into structured sub-problems, generating interpretable traces whose correctness can be readily evaluated, even at inference time. Specifically, we demonstrate this approach on Open Book QA, decomposing the problem into a Classification step and an Information Retrieval step, thereby simplifying trace evaluation. Our SFT experiments with correct and incorrect traces on the CoTemp QA, Microsoft Machine Reading Comprehension QA, and Facebook bAbI QA datasets reveal the striking finding that correct traces do not necessarily imply that the model outputs the correct final solution. Similarly, we find a low correlation between correct final solutions and intermediate trace correctness. These results challenge the implicit assumption behind utilizing reasoning traces for improving SLMs' final performance via KD.
摘要：问答（QA）提出了一个具有挑战性和危险的问题，尤其是在当今的交互式对话系统时代，例如Chatgpt，Chelplexity，Microsoft Copilot等。由于较小的语言模型（SLM）在计算上的效率更高，但与较大模型相比，经常表现不佳，因此知识蒸馏（KD）方法允许对这些较小的模型进行填充以提高其最终性能。最近，将中间令牌或由思想链（COT）或通过推理模型（例如DeepSeek R1）产生的所谓的“推理”痕迹用作KD的训练信号。但是，这些推理轨迹通常是冗长的，难以解释或评估。在这项工作中，我们旨在应对评估这些推理痕迹的忠诚及其与最终表现的相关性的挑战。为此，我们采用KD方法利用基于规则的问题分解。这种方法使我们能够将复杂的查询分解为结构化的子问题，从而产生可解释的痕迹，即使在推理时间也可以轻松评估其正确性。具体来说，我们在QA上证明了这种方法，将问题分解为分类步骤和信息检索步骤，从而简化了跟踪评估。我们的SFT实验在COTEMP QA，Microsoft Machine阅读QA和Facebook Babi QA数据集上进行了正确且错误的痕迹，这表明了一个惊人的发现，即正确的痕迹并不一定意味着该模型输出了正确的最终解决方案。同样，我们发现正确的最终解决方案与中间跟踪正确性之间存在较低的相关性。这些结果挑战了利用推理痕迹通过KD提高SLM的最终性能的隐含假设。

Title: EfficientLLM: Efficiency in Large Language Models

Authors: Zhengqing Yuan, Weixiang Sun, Yixin Liu, Huichi Zhou, Rong Zhou, Yiyang Li, Zheyuan Zhang, Wei Song, Yue Huang, Haolong Jia, Keerthiram Murugesan, Yu Wang, Lifang He, Jianfeng Gao, Lichao Sun, Yanfang Ye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13840
Pdf URL: https://arxiv.org/pdf/2505.13840
Copy Paste: [[2505.13840]] EfficientLLM: Efficiency in Large Language Models(https://arxiv.org/abs/2505.13840)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.
摘要：大型语言模型（LLM）取得了重大进展，但它们不断增长的参数计数和上下文Windows会产生过度的计算，能源和货币成本。我们介绍了效应，这是一种新颖的基准和第一个全面的经验研究，评估了LLMS的大规模评估效率技术。 Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and （3）推理（量化方法：int4，float16）。我们定义了六个细粒度指标（内存利用，计算利用率，延迟，吞吐量，能量消耗，压缩率）来捕获硬件饱和，延迟到达量表平衡和碳成本。评估100多个模型技术对（0.5b-72b参数），我们得出了三个核心见解：（i）效率涉及可量化的权衡：没有单一方法是普遍最佳的；例如，MOE降低了拖船并提高了准确性，但将VRAM提高了40％，而INT4量化以3-5％的精度下降将记忆/能量降低了3.9倍。（ii）Optima是任务和规模依赖性的：MQA为受限设备提供最佳的内存延迟权衡，MLA对于质量关键任务实现最低的困惑，而RSlora仅超过14B参数超过14B参数。（iii）技术跨越模式：我们将评估扩展到大型视力模型（稳定扩散3.5，WAN 2.1）和视觉模型（QWEN2.5-VL），确认有效的传递性。通过开源数据集，评估管道和排行榜，EfficityLLM为研究人员和工程师提供了重要的指导，以导航下一代基础模型的效率表现景观。

Title: Improve Language Model and Brain Alignment via Associative Memory

Authors: Congchi Yin, Yongpeng Zhang, Xuyun Wen, Piji Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13844
Pdf URL: https://arxiv.org/pdf/2505.13844
Copy Paste: [[2505.13844]] Improve Language Model and Brain Alignment via Associative Memory(https://arxiv.org/abs/2505.13844)
Keywords: language model
Abstract: Associative memory engages in the integration of relevant information for comprehension in the human cognition system. In this work, we seek to improve alignment between language models and human brain while processing speech information by integrating associative memory. After verifying the alignment between language model and brain by mapping language model activations to brain activity, the original text stimuli expanded with simulated associative memory are regarded as input to computational language models. We find the alignment between language model and brain is improved in brain regions closely related to associative memory processing. We also demonstrate large language models after specific supervised fine-tuning better align with brain response, by building the \textit{Association} dataset containing 1000 samples of stories, with instructions encouraging associative memory as input and associated content as output.
摘要：关联记忆参与将相关信息集成在人类认知系统中的理解。在这项工作中，我们试图通过整合关联记忆来处理语言模型和人脑之间的一致性。通过将语言模型激活映射到大脑活动来验证语言模型和大脑之间的一致性之后，用模拟的关联记忆扩展的原始文本刺激被认为是计算语言模型的输入。我们发现，与关联记忆处理密切相关的大脑区域，语言模型和大脑之间的一致性得到了改善。我们还通过构建包含1000个故事样本的数据集，并通过指令将关联记忆作为输入和关联的内容作为输出来展示大型语言模型，以更好地监督微调更好地与大脑响应更好地与大脑响应保持一致。

Title: Domain Gating Ensemble Networks for AI-Generated Text Detection

Authors: Arihant Tripathi, Liam Dugan, Charis Gao, Maggie Huan, Emma Jin, Peter Zhang, David Zhang, Julia Zhao, Chris Callison-Burch
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13855
Pdf URL: https://arxiv.org/pdf/2505.13855
Copy Paste: [[2505.13855]] Domain Gating Ensemble Networks for AI-Generated Text Detection(https://arxiv.org/abs/2505.13855)
Keywords: language model
Abstract: As state-of-the-art language models continue to improve, the need for robust detection of machine-generated text becomes increasingly critical. However, current state-of-the-art machine text detectors struggle to adapt to new unseen domains and generative models. In this paper we present DoGEN (Domain Gating Ensemble Networks), a technique that allows detectors to adapt to unseen domains by ensembling a set of domain expert detector models using weights from a domain classifier. We test DoGEN on a wide variety of domains from leading benchmarks and find that it achieves state-of-the-art performance on in-domain detection while outperforming models twice its size on out-of-domain detection. We release our code and trained models to assist in future research in domain-adaptive AI detection.
摘要：随着最先进的语言模型继续改善，对机器生成的文本的强大检测需求变得越来越重要。但是，当前的最新机器文本检测器难以适应新的看不见的域和生成模型。在本文中，我们介绍了Dogen（域门控集合网络），该技术允许检测器使用来自域分类器的权重结合一组域专家检测器模型来适应看不见的域。我们从领先的基准测试了多种域上测试了Dogen，并发现它在域检测上实现了最先进的性能，而在跨域检测方面的尺寸是其大小的两倍。我们发布了我们的代码和训练有素的模型，以协助域自适应AI检测的未来研究。

Title: Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

Authors: Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13866
Pdf URL: https://arxiv.org/pdf/2505.13866
Copy Paste: [[2505.13866]] Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning(https://arxiv.org/abs/2505.13866)
Keywords: language model, llm
Abstract: Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining KV cache that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60$\times$ compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at this https URL.
摘要：最近以推理为中心的语言模型通过在产生最终答案之前产生冗长的中间推理路径来实现高精度。尽管这种方法可以有效地解决需要逻辑思考的问题，但较长的推理路径显着增加了代币产生的记忆使用情况和吞吐量，从而限制了此类模型的实际部署。我们提出了推理路径压缩（RPC），这是一种无训练方法，通过利用推理路径的语义稀疏来加速推断。 RPC通过保留获得高重要性得分的KV缓存来定期压缩KV缓存，该kV缓存是使用由最近生成的查询组成的选择器窗口计算得出的。实验表明，与全KV缓存的推断相比，RPC将QWQ-32B的生成吞吐量提高了1.60美元$ \ times $，而AIME 2024基准的准确度下降了1.2％。我们的发现表明，可以有效利用推理痕迹中的语义稀疏性进行压缩，从而为有效的推理LLM提供了实用的途径。我们的代码可在此HTTPS URL上找到。

Title: Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning

Authors: Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13886
Pdf URL: https://arxiv.org/pdf/2505.13886
Copy Paste: [[2505.13886]] Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning(https://arxiv.org/abs/2505.13886)
Keywords: language model, llm, chain-of-thought
Abstract: Visual-language Chain-of-Thought (CoT) data resources are relatively scarce compared to text-only counterparts, limiting the improvement of reasoning capabilities in Vision Language Models (VLMs). However, high-quality vision-language reasoning data is expensive and labor-intensive to annotate. To address this issue, we leverage a promising resource: game code, which naturally contains logical structures and state transition processes. Therefore, we propose Code2Logic, a novel game-code-driven approach for multimodal reasoning data synthesis. Our approach leverages Large Language Models (LLMs) to adapt game code, enabling automatic acquisition of reasoning processes and results through code execution. Using the Code2Logic approach, we developed the GameQA dataset to train and evaluate VLMs. GameQA is cost-effective and scalable to produce, challenging for state-of-the-art models, and diverse with 30 games and 158 tasks. Surprisingly, despite training solely on game data, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33\% across 7 diverse vision-language benchmarks. Our code and dataset are available at this https URL.
摘要：与纯文本相比，视觉语言链（COT）数据资源相对较少，这限制了视觉语言模型（VLMS）中推理能力的提高。但是，高质量的视觉推理数据是昂贵且劳动密集型的注释。为了解决这个问题，我们利用有希望的资源：游戏代码，它自然包含逻辑结构和状态过渡过程。因此，我们提出了Code2Logic，这是一种用于多模式推理数据合成的新型游戏代码驱动方法。我们的方法利用大型语言模型（LLMS）调整游戏代码，从而通过代码执行自动获取推理过程和结果。使用Code2Logic方法，我们开发了GameQA数据集来训练和评估VLM。 GameQa具有成本效益且可扩展的生产，对最先进的车型具有挑战性，并且有30场游戏和158个任务。令人惊讶的是，尽管仅对游戏数据进行了培训，但VLMS证明了域的概括，特别是QWEN2.5-VL-7B在7种不同的视觉语言基准中提高了2.33 \％的性能。我们的代码和数据集可在此HTTPS URL上找到。

Title: Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM

Authors: Zhen Xiong, Yujun Cai, Zhecheng Li, Yiwei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13890
Pdf URL: https://arxiv.org/pdf/2505.13890
Copy Paste: [[2505.13890]] Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM(https://arxiv.org/abs/2505.13890)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent advances in test-time scaling have enabled Large Language Models (LLMs) to display sophisticated reasoning abilities via extended Chain-of-Thought (CoT) generation. Despite their potential, these Reasoning LLMs (RLMs) often demonstrate counterintuitive and unstable behaviors, such as performance degradation under few-shot prompting, that challenge our current understanding of RLMs. In this work, we introduce a unified graph-based analytical framework for better modeling the reasoning processes of RLMs. Our method first clusters long, verbose CoT outputs into semantically coherent reasoning steps, then constructs directed reasoning graphs to capture contextual and logical dependencies among these steps. Through comprehensive analysis across models and prompting regimes, we reveal that structural properties, such as exploration density, branching, and convergence ratios, strongly correlate with reasoning accuracy. Our findings demonstrate how prompting strategies substantially reshape the internal reasoning structure of RLMs, directly affecting task outcomes. The proposed framework not only enables quantitative evaluation of reasoning quality beyond conventional metrics but also provides practical insights for prompt engineering and the cognitive analysis of LLMs. Code and resources will be released to facilitate future research in this direction.
摘要：测试时间缩放的最新进展使大型语言模型（LLMS）通过扩展的思维链（COT）生成显示复杂的推理能力。尽管它们具有潜力，但这些推理LLM（RLM）通常表现出违反直觉和不稳定的行为，例如在很少的促使下的性能下降，这挑战了我们当前对RLM的理解。在这项工作中，我们引入了一个基于图形的分析框架，以更好地对RLMS的推理过程进行建模。我们的方法首先要长，冗长的cot输出到语义上一致的推理步骤中，然后构造有向的推理图，以在这些步骤中捕获上下文和逻辑依赖性。通过跨模型和促使制度的全面分析，我们揭示了结构特性，例如勘探密度，分支和收敛率，与推理准确性密切相关。我们的发现表明了促使策略如何重大重塑RLM的内部推理结构，从而直接影响任务结果。提出的框架不仅可以对传统指标以外的推理质量进行定量评估，而且还可以为迅速的工程和LLM的认知分析提供实用的见解。代码和资源将被发布，以促进未来的研究。

Title: InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

Authors: Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, Hongxia Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13893
Pdf URL: https://arxiv.org/pdf/2505.13893
Copy Paste: [[2505.13893]] InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion(https://arxiv.org/abs/2505.13893)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n \log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.
摘要：大型语言模型（LLM）的最新进展加剧了将异质开源模型融合到继承其互补优势的统一系统中。现有的基于logit的融合方法保持推理效率，但独立处理词汇维度，忽略了通过交叉差异编码的语义依赖性。这些依赖性反映了代币类型如何在模型的内部推理下相互作用，对于使不同产生行为的模型对齐至关重要。为了明确建模这些依赖性，我们提出了\ textbf {Infigfusion}，这是第一个具有小说\ textit {Graph-logogits Distillation}（GLD）损失的结构感知融合框架。具体而言，我们保留每个输出的顶部$ K $ logits，并跨序列位置汇总其外部产品以形成全局共激活图，其中节点代表词汇通道和边缘量化其关节激活。为了确保可扩展性和效率，我们设计了一个基于分选的封闭形式近似，该近似将Gromov-Wasserstein距离的原始$ O（n^4）$成本降低到$ O（n \ log n）$，并具有可证明的近似保证。跨多个融合设置的实验表明，GLD始终提高融合质量和稳定性。 Infipusion在跨越推理，编码和数学的11个基准上优于SOTA模型和融合基线。它在复杂的推理任务中显示了特殊的强度，对多步算术的+35.6改进，而在SFT上的因果判断+37.06，表明了卓越的多步骤和关系推断。

Title: Let's Verify Math Questions Step by Step

Authors: Chengyu Shen, Zhen Hao Wong, Runming He, Hao Liang, Meiyi Qiang, Zimo Meng, Zhengyang Zhao, Bohan Zeng, Zhengzhou Zhu, Bin Cui, Wentao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13903
Pdf URL: https://arxiv.org/pdf/2505.13903
Copy Paste: [[2505.13903]] Let's Verify Math Questions Step by Step(https://arxiv.org/abs/2505.13903)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have recently achieved remarkable progress in mathematical reasoning. To enable such capabilities, many existing works distill strong reasoning models into long chains of thought or design algorithms to construct high-quality math QA data for training. However, these efforts primarily focus on generating correct reasoning paths and answers, while largely overlooking the validity of the questions themselves. In this work, we propose Math Question Verification (MathQ-Verify), a novel five-stage pipeline designed to rigorously filter ill-posed or under-specified math problems. MathQ-Verify first performs format-level validation to remove redundant instructions and ensure that each question is syntactically well-formed. It then formalizes each question, decomposes it into atomic conditions, and verifies them against mathematical definitions. Next, it detects logical contradictions among these conditions, followed by a goal-oriented completeness check to ensure the question provides sufficient information for solving. To evaluate this task, we use existing benchmarks along with an additional dataset we construct, containing 2,147 math questions with diverse error types, each manually double-validated. Experiments show that MathQ-Verify achieves state-of-the-art performance across multiple benchmarks, improving the F1 score by up to 25 percentage points over the direct verification baseline. It further attains approximately 90% precision and 63% recall through a lightweight model voting scheme. MathQ-Verify offers a scalable and accurate solution for curating reliable mathematical datasets, reducing label noise and avoiding unnecessary computation on invalid questions. Our code and data are available at this https URL.
摘要：大型语言模型（LLM）最近在数学推理方面取得了显着进步。为了实现此类功能，许多现有的作品将强有力的推理模型提炼成长长的思想或设计算法，以构建高质量的数学质量质量质量质量质量检查数据以进行培训。但是，这些努力主要集中于产生正确的推理路径和答案，同时在很大程度上忽略了问题本身的有效性。在这项工作中，我们提出了数学问题验证（MathQ-Verify），这是一种新型的五阶段管道，旨在严格过滤不适合或指定的数学问题。 Mathq-Verify首先执行格式级验证以删除冗余说明，并确保每个问题在句法上都构成良好。然后，它将每个问题形式化，将其分解为原子条件，并根据数学定义验证它们。接下来，它检测到这些条件之间的逻辑矛盾，然后是目标的完整性检查，以确保该问题提供了足够的信息来解决。为了评估此任务，我们使用现有的基准测试以及构建的其他数据集，其中包含2,147个具有多种错误类型的数学问题，每种都可以手动双重验证。实验表明，MathQ-Verify在多个基准测试中实现了最先进的性能，在直接验证基线上将F1得分提高了25个百分点。它进一步达到了约90％的精度，通过轻巧的模型投票计划召回了63％的召回。 Mathq-Verify提供了一种可扩展，准确的解决方案，用于策划可靠的数学数据集，降低标签噪声并避免在无效问题上进行不必要的计算。我们的代码和数据可在此HTTPS URL上找到。

Title: Cross-Linguistic Transfer in Multilingual NLP: The Role of Language Families and Morphology

Authors: Ajitesh Bankula, Praney Bankula
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13908
Pdf URL: https://arxiv.org/pdf/2505.13908
Copy Paste: [[2505.13908]] Cross-Linguistic Transfer in Multilingual NLP: The Role of Language Families and Morphology(https://arxiv.org/abs/2505.13908)
Keywords: language model
Abstract: Cross-lingual transfer has become a crucial aspect of multilingual NLP, as it allows for models trained on resource-rich languages to be applied to low-resource languages more effectively. Recently massively multilingual pre-trained language models (e.g., mBERT, XLM-R) demonstrate strong zero-shot transfer capabilities[14] [13]. This paper investigates cross-linguistic transfer through the lens of language families and morphology. Investigating how language family proximity and morphological similarity affect performance across NLP tasks. We further discuss our results and how it relates to findings from recent literature. Overall, we compare multilingual model performance and review how linguistic distance metrics correlate with transfer outcomes. We also look into emerging approaches that integrate typological and morphological information into model pre-training to improve transfer to diverse languages[18] [19].
摘要：跨语性转移已成为多语言NLP的关键方面，因为它允许对资源丰富的语言进行培训的模型可以更有效地应用于低资源语言。最近，大量多语言的预训练的语言模型（例如Mbert，XLM-R）表现出强大的零射传递能力[14] [13]。本文通过语言家庭和形态的角度调查了跨语言转移。调查语言家庭的接近性和形态相似性如何影响NLP任务的表现。我们进一步讨论了我们的结果及其与最近文献的发现之间的关系。总体而言，我们比较了多语言模型性能，并回顾语言距离指标与转移结果的相关性。我们还研究了将类型学和形态学信息整合到模型预训练中以改善对不同语言的转移的新兴方法[18] [19]。

Title: EEG-to-Text Translation: A Model for Deciphering Human Brain Activity

Authors: Saydul Akbar Murad, Ashim Dahal, Nick Rahimi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13936
Pdf URL: https://arxiv.org/pdf/2505.13936
Copy Paste: [[2505.13936]] EEG-to-Text Translation: A Model for Deciphering Human Brain Activity(https://arxiv.org/abs/2505.13936)
Keywords: language model, gpt
Abstract: With the rapid advancement of large language models like Gemini, GPT, and others, bridging the gap between the human brain and language processing has become an important area of focus. To address this challenge, researchers have developed various models to decode EEG signals into text. However, these models still face significant performance limitations. To overcome these shortcomings, we propose a new model, R1 Translator, which aims to improve the performance of EEG-to-text decoding. The R1 Translator model combines a bidirectional LSTM encoder with a pretrained transformer-based decoder, utilizing EEG features to produce high-quality text outputs. The model processes EEG embeddings through the LSTM to capture sequential dependencies, which are then fed into the transformer decoder for effective text generation. The R1 Translator excels in ROUGE metrics, outperforming both T5 (previous research) and Brain Translator. Specifically, R1 achieves a ROUGE-1 score of 38.00% (P), which is up to 9% higher than T5 (34.89%) and 3% better than Brain (35.69%). It also leads in ROUGE-L, with a F1 score of 32.51%, outperforming T5 by 3% (29.67%) and Brain by 2% (30.38%). In terms of CER, R1 achieves a CER of 0.5795, which is 2% lower than T5 (0.5917) and 4% lower than Brain (0.6001). Additionally, R1 performs better in WER with a score of 0.7280, outperforming T5 by 4.3% (0.7610) and Brain by 3.6% (0.7553). Code is available at this https URL.
摘要：随着Gemini，GPT和其他人等大型语言模型的快速发展，弥合人脑和语言处理之间的差距已成为重点的重要领域。为了应对这一挑战，研究人员已经开发了各种模型，将脑电信号解码为文本。但是，这些模型仍然面临着重大的性能限制。为了克服这些缺点，我们提出了一个新的模型R1翻译器，该模型旨在提高脑电图到文本解码的性能。 R1转换器模型将双向LSTM编码器与验证的基于变压器的解码器相结合，利用EEG功能生成高质量的文本输出。该模型通过LSTM处理EEG嵌入以捕获顺序依赖性，然后将其馈入变压器解码器以进行有效的文本生成。 R1翻译器在胭脂指标中表现出色，表现优于T5（先前的研究）和脑翻译器。具体而言，R1的胭脂-1分数为38.00％（P），比T5（34.89％）高达9％，比大脑好3％（35.69％）。它也导致Rouge-L领先，F1得分为32.51％，表现的T5优于3％（29.67％），而大脑的表现为2％（30.38％）。就CER而言，R1的CER为0.5795，比T5（0.5917）低2％，比大脑低4％（0.6001）。此外，R1的表现更好，得分为0.7280，比T5的表现优于4.3％（0.7610），而大脑的表现效果更好3.6％（0.7553）。代码可在此HTTPS URL上找到。

Title: Towards Rehearsal-Free Continual Relation Extraction: Capturing Within-Task Variance with Adaptive Prompting

Authors: Bao-Ngoc Dao, Quang Nguyen, Luyen Ngo Dinh, Minh Le, Nam Le, Linh Ngo Van
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13944
Pdf URL: https://arxiv.org/pdf/2505.13944
Copy Paste: [[2505.13944]] Towards Rehearsal-Free Continual Relation Extraction: Capturing Within-Task Variance with Adaptive Prompting(https://arxiv.org/abs/2505.13944)
Keywords: prompt
Abstract: Memory-based approaches have shown strong performance in Continual Relation Extraction (CRE). However, storing examples from previous tasks increases memory usage and raises privacy concerns. Recently, prompt-based methods have emerged as a promising alternative, as they do not rely on storing past samples. Despite this progress, current prompt-based techniques face several core challenges in CRE, particularly in accurately identifying task identities and mitigating catastrophic forgetting. Existing prompt selection strategies often suffer from inaccuracies, lack robust mechanisms to prevent forgetting in shared parameters, and struggle to handle both cross-task and within-task variations. In this paper, we propose WAVE++, a novel approach inspired by the connection between prefix-tuning and mixture of experts. Specifically, we introduce task-specific prompt pools that enhance flexibility and adaptability across diverse tasks while avoiding boundary-spanning risks; this design more effectively captures variations within each task and across tasks. To further refine relation classification, we incorporate label descriptions that provide richer, more global context, enabling the model to better distinguish among different relations. We also propose a training-free mechanism to improve task prediction during inference. Moreover, we integrate a generative model to consolidate prior knowledge within the shared parameters, thereby removing the need for explicit data storage. Extensive experiments demonstrate that WAVE++ outperforms state-of-the-art prompt-based and rehearsal-based methods, offering a more robust solution for continual relation extraction. Our code is publicly available at this https URL.
摘要：基于内存的方法在持续关系提取（CRE）中表现出强烈的性能。但是，从以前的任务中存储示例会增加内存使用情况并引起隐私问题。最近，基于及时的方法已成为有希望的替代方法，因为它们不依赖于存储过去的样本。尽管取得了这种进步，但当前基于及时的技术在CRE中面临着几个核心挑战，尤其是在准确识别任务身份并减轻灾难性遗忘时。现有的及时选择策略通常会遭受不准确的困扰，缺乏强大的机制来防止忘记共享参数，并难以处理交叉任务和任务内变化。在本文中，我们提出了Wave ++，这是一种新型方法，灵感来自专家的前缀调整和混合物之间的联系。具体而言，我们介绍了特定于任务的提示池，以增强各种任务的灵活性和适应性，同时避免跨越边界的风险；该设计更有效地捕获了每个任务内部和跨任务的变化。为了进一步完善关系分类，我们结合了标签描述，这些描述提供了更丰富，更全球的环境，从而使模型能够更好地区分不同的关系。我们还提出了一种无培训的机制，以改善推理期间的任务预测。此外，我们集成了一个生成模型，以在共享参数中巩固先验知识，从而消除了对显式数据存储的需求。广泛的实验表明，波++的表现优于最先进的基于及时的迅速和基于排练的方法，为持续关系提取提供了更强大的解决方案。我们的代码在此HTTPS URL上公开可用。

Title: Memory-Centric Embodied Question Answer

Authors: Mingliang Zhai, Zhi Gao, Yuwei Wu, Yunde Jia
Subjects: cs.CL, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2505.13948
Pdf URL: https://arxiv.org/pdf/2505.13948
Copy Paste: [[2505.13948]] Memory-Centric Embodied Question Answer(https://arxiv.org/abs/2505.13948)
Keywords: language model, agent
Abstract: Embodied Question Answering (EQA) requires agents to autonomously explore and understand the environment to answer context-dependent questions. Existing frameworks typically center around the planner, which guides the stopping module, memory module, and answering module for reasoning. In this paper, we propose a memory-centric EQA framework named MemoryEQA. Unlike planner-centric EQA models where the memory module cannot fully interact with other modules, MemoryEQA flexible feeds memory information into all modules, thereby enhancing efficiency and accuracy in handling complex tasks, such as those involving multiple targets across different regions. Specifically, we establish a multi-modal hierarchical memory mechanism, which is divided into global memory that stores language-enhanced scene maps, and local memory that retains historical observations and state information. When performing EQA tasks, the multi-modal large language model is leveraged to convert memory information into the required input formats for injection into different modules. To evaluate EQA models' memory capabilities, we constructed the MT-HM3D dataset based on HM3D, comprising 1,587 question-answer pairs involving multiple targets across various regions, which requires agents to maintain memory of exploration-acquired target information. Experimental results on HM-EQA, MT-HM3D, and OpenEQA demonstrate the effectiveness of our framework, where a 19.8% performance gain on MT-HM3D compared to baseline model further underscores memory capability's pivotal role in resolving complex tasks.
摘要：体现的问题回答（EQA）要求代理人自主探索和理解环境，以回答上下文依赖上下文的问题。现有的框架通常围绕计划者，该框架指导停止模块，内存模块和回答用于推理的模块。在本文中，我们提出了一个名为MemoryEQA的以内存为中心的EQA框架。与以计划者为中心的EQA模型不同，内存模块无法与其他模块完全交互，MemoryEQA灵活将内存信息馈入所有模块，从而提高了处理复杂任务的效率和准确性，例如涉及不同地区的多个目标的任务。具体而言，我们建立了一个多模式的层次记忆机制，该机制分为整体内存，该内存存储语言增强的场景图以及保留历史观察和状态信息的本地记忆。执行EQA任务时，将利用多模式大型语言模型将内存信息转换为所需的输入格式，以注入不同的模块。为了评估EQA模型的内存功能，我们基于HM3D构建了MT-HM3D数据集，其中包括1,587个问答对，涉及各个区域的多个目标，这需要代理来维持对勘探获得的目标信息的记忆。 HM-EQA，MT-HM3D和OpenEQA的实验结果证明了我们框架的有效性，与基线模型相比，MT-HM3D的绩效增长了19.8％，进一步凸显了记忆能力在解决复杂任务中的关键作用。

Title: FlashThink: An Early Exit Method For Efficient Reasoning

Authors: Guochao Jiang, Guofeng Quan, Zepeng Ding, Ziqin Luo, Dixuan Wang, Zheng Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13949
Pdf URL: https://arxiv.org/pdf/2505.13949
Copy Paste: [[2505.13949]] FlashThink: An Early Exit Method For Efficient Reasoning(https://arxiv.org/abs/2505.13949)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown impressive performance in reasoning tasks. However, LLMs tend to generate excessively long reasoning content, leading to significant computational overhead. Our observations indicate that even on simple problems, LLMs tend to produce unnecessarily lengthy reasoning content, which is against intuitive expectations. Preliminary experiments show that at a certain point during the generation process, the model is already capable of producing the correct solution without completing the full reasoning content. Therefore, we consider that the reasoning process of the model can be exited early to achieve the purpose of efficient reasoning. We introduce a verification model that identifies the exact moment when the model can stop reasoning and still provide the correct answer. Comprehensive experiments on four different benchmarks demonstrate that our proposed method, FlashThink, effectively shortens the reasoning content while preserving the model accuracy. For the Deepseek-R1 and QwQ-32B models, we reduced the length of reasoning content by 77.04% and 77.47%, respectively, without reducing the accuracy.
摘要：大型语言模型（LLM）在推理任务中表现出了令人印象深刻的表现。但是，LLM倾向于产生过长的推理内容，从而导致大量的计算开销。我们的观察结果表明，即使在简单的问题上，LLM倾向于产生不必要的冗长推理内容，这是违背直觉期望的。初步实验表明，在生成过程中的某个点上，该模型已经能够在不完成完整推理内容的情况下生成正确的解决方案。因此，我们认为该模型的推理过程可以尽早退出以实现有效推理的目的。我们介绍了一个验证模型，该模型可以识别模型可以停止推理并仍然提供正确答案的确切时刻。对四个不同基准测试的全面实验表明，我们提出的方法flashthink有效地缩短了推理内容，同时保留了模型的准确性。对于DeepSeek-R1和QWQ-32B模型，我们在不降低准确性的情况下将推理内容的长度分别降低了77.04％和77.47％。

Title: Through a Compressed Lens: Investigating the Impact of Quantization on LLM Explainability and Interpretability

Authors: Qianli Wang, Mingyang Wang, Nils Feldhus, Simon Ostermann, Yuan Cao, Hinrich Schütze, Sebastian Möller, Vera Schmitt
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13963
Pdf URL: https://arxiv.org/pdf/2505.13963
Copy Paste: [[2505.13963]] Through a Compressed Lens: Investigating the Impact of Quantization on LLM Explainability and Interpretability(https://arxiv.org/abs/2505.13963)
Keywords: language model, llm
Abstract: Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). While prior research has extensively investigated the degradation of various LLM capabilities due to quantization, its effects on model explainability and interpretability, which are crucial for understanding decision-making processes, remain unexplored. To address this gap, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with two explainability methods, counterfactual examples and natural language explanations, as well as two interpretability approaches, knowledge memorization analysis and latent multi-hop reasoning analysis. We complement our analysis with a thorough user study, evaluating selected explainability methods. Our findings reveal that, depending on the configuration, quantization can significantly impact model explainability and interpretability. Notably, the direction of this effect is not consistent, as it strongly depends on (1) the quantization method, (2) the explainability or interpretability approach, and (3) the evaluation protocol. In some settings, human evaluation shows that quantization degrades explainability, while in others, it even leads to improvements. Our work serves as a cautionary tale, demonstrating that quantization can unpredictably affect model transparency. This insight has important implications for deploying LLMs in applications where transparency is a critical requirement.
摘要：量化方法被广泛用于加速推理和简化大型语言模型（LLMS）的部署。尽管先前的研究已广泛研究了由于量化而导致的各种LLM功能的降解，但其对模型的解释性和可解释性的影响，这对于理解决策过程至关重要，但仍未开发。为了解决这一差距，我们与两种可解释的方法，反事实示例和自然语言解释以及两种可解释性方法，知识记忆分析和潜在的多跳上合理推理分析一起，使用三种常见的量化技术进行了全面的实验。我们通过详尽的用户研究对分析进行补充，并评估所选的解释性方法。我们的发现表明，根据配置，量化可以显着影响模型的解释性和解释性。值得注意的是，这种效果的方向不一致，因为它在很大程度上取决于（1）量化方法，（2）解释性或可解释性方法，以及（3）评估协议。在某些情况下，人类评估表明，量化降低了解释性，而在其他情况下，它甚至会导致改进。我们的工作是一个警示性的故事，表明量化可以不可预测地影响模型透明度。这种见解对在透明性是关键要求的应用程序中部署LLM具有重要意义。

Title: CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring

Authors: Jiamin Su, Yibo Yan, Zhuoran Gao, Han Zhang, Xiang Liu, Xuming Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13965
Pdf URL: https://arxiv.org/pdf/2505.13965
Copy Paste: [[2505.13965]] CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring(https://arxiv.org/abs/2505.13965)
Keywords: language model, llm, agent
Abstract: Automated Essay Scoring (AES) is crucial for modern education, particularly with the increasing prevalence of multimodal assessments. However, traditional AES methods struggle with evaluation generalizability and multimodal perception, while even recent Multimodal Large Language Model (MLLM)-based approaches can produce hallucinated justifications and scores misaligned with human judgment. To address the limitations, we introduce CAFES, the first collaborative multi-agent framework specifically designed for AES. It orchestrates three specialized agents: an Initial Scorer for rapid, trait-specific evaluations; a Feedback Pool Manager to aggregate detailed, evidence-grounded strengths; and a Reflective Scorer that iteratively refines scores based on this feedback to enhance human alignment. Extensive experiments, using state-of-the-art MLLMs, achieve an average relative improvement of 21% in Quadratic Weighted Kappa (QWK) against ground truth, especially for grammatical and lexical diversity. Our proposed CAFES framework paves the way for an intelligent multimodal AES system. The code will be available upon acceptance.
摘要：自动论文评分（AES）对于现代教育至关重要，尤其是随着多模式评估的越来越多的流行率。但是，传统的AES方法在评估概括性和多模式感知方面遇到了困难，而即使是最近的多模式大语模型（MLLM）的方法也可以产生幻觉的理由，并与人类判断失误。为了解决限制，我们介绍了咖啡馆，这是第一个专门为AES设计的合作多代理框架。它精心策划了三种专门的代理：最初的得分手，用于快速，特征特定的评估；反馈池经理，汇总了详细的，有证据的优势；以及一个反思性得分手，它可以根据此反馈来迭代地优化分数，以增强人类的一致性。使用最先进的MLLM的广泛实验在二次加权Kappa（QWK）中平均相对改善，尤其是对于语法和词汇多样性而言。我们提出的咖啡馆框架为智能的多模式系统铺平了道路。该代码将在接受后可用。

Title: Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals

Authors: Qianli Wang, Van Bach Nguyen, Nils Feldhus, Luis Felipe Villa-Arenas, Christin Seifert, Sebastian Möller, Vera Schmitt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13972
Pdf URL: https://arxiv.org/pdf/2505.13972
Copy Paste: [[2505.13972]] Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals(https://arxiv.org/abs/2505.13972)
Keywords: language model, llm
Abstract: Counterfactual examples are widely employed to enhance the performance and robustness of large language models (LLMs) through counterfactual data augmentation (CDA). However, the selection of the judge model used to evaluate label flipping, the primary metric for assessing the validity of generated counterfactuals for CDA, yields inconsistent results. To decipher this, we define four types of relationships between the counterfactual generator and judge models. Through extensive experiments involving two state-of-the-art LLM-based methods, three datasets, five generator models, and 15 judge models, complemented by a user study (n = 90), we demonstrate that judge models with an independent, non-fine-tuned relationship to the generator model provide the most reliable label flipping evaluations. Relationships between the generator and judge models, which are closely aligned with the user study for CDA, result in better model performance and robustness. Nevertheless, we find that the gap between the most effective judge models and the results obtained from the user study remains considerably large. This suggests that a fully automated pipeline for CDA may be inadequate and requires human intervention.
摘要：反事实的例子被广泛用于通过反事实数据增强（CDA）来增强大语言模型（LLMS）的鲁棒性和鲁棒性。然而，选择用于评估标签翻转的法官模型，这是评估CDA产生的反事实有效性的主要指标，会产生不一致的结果。为了破译这一点，我们定义了反事实发生器和法官模型之间的四种类型的关系。通过涉及两种基于LLM的方法，三个数据集，五个发电机模型和15个法官模型的广泛实验，并在用户研究中得到补充（n = 90），我们证明，与发电机模型具有独立，非预定关系的法官模型提供了最可靠的标签标签翻转评估。与CDA的用户研究紧密一致的发电机和法官模型之间的关系，可以提高模型性能和鲁棒性。然而，我们发现最有效的法官模型与从用户研究获得的结果之间的差距仍然很大。这表明CDA的全自动管道可能不足，需要人类干预。

Title: Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models

Authors: Wenhui Zhu, Xuanzhao Dong, Xin Li, Peijie Qiu, Xiwen Chen, Abolfazl Razi, Aris Sotiras, Yi Su, Yalin Wang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.13973
Pdf URL: https://arxiv.org/pdf/2505.13973
Copy Paste: [[2505.13973]] Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models(https://arxiv.org/abs/2505.13973)
Keywords: language model, llm
Abstract: Recently, reinforcement learning (RL)-based tuning has shifted the trajectory of Multimodal Large Language Models (MLLMs), particularly following the introduction of Group Relative Policy Optimization (GRPO). However, directly applying it to medical tasks remains challenging for achieving clinically grounded model behavior. Motivated by the need to align model response with clinical expectations, we investigate four critical dimensions that affect the effectiveness of RL-based tuning in medical visual question answering (VQA): base model initialization strategy, the role of medical semantic alignment, the impact of length-based rewards on long-chain reasoning, and the influence of bias. We conduct extensive experiments to analyze these factors for medical MLLMs, providing new insights into how models are domain-specifically fine-tuned. Additionally, our results also demonstrate that GRPO-based RL tuning consistently outperforms standard supervised fine-tuning (SFT) in both accuracy and reasoning quality.
摘要：最近，基于强化学习（RL）的调整改变了多模式大语言模型（MLLM）的轨迹，尤其是在引入小组相对策略优化（GRPO）之后。但是，将其直接应用于医疗任务仍然具有挑战性，即实现临床基础模型行为。由于需要将模型响应与临床期望保持一致的动机，我们研究了四个关键维度，这些临界维度影响了基于RL的调谐在医学视觉问题答案中的有效性（VQA）：基本模型初始化策略，医学语义一致性的作用，基于长度的奖励对长链推理的影响以及偏见的影响。我们进行了广泛的实验，以分析医疗MLLM的这些因素，从而提供了有关模型特定于域特定调整的新见解。此外，我们的结果还表明，基于GRPO的RL调整在准确性和推理质量方面始终优于标准监督微调（SFT）。

Title: DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models

Authors: Yuxuan Jiang, Dawei Li, Frank Ferraro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13975
Pdf URL: https://arxiv.org/pdf/2505.13975
Copy Paste: [[2505.13975]] DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models(https://arxiv.org/abs/2505.13975)
Keywords: chain-of-thought
Abstract: While Large Reasoning Models (LRMs) have demonstrated success in complex reasoning tasks through long chain-of-thought (CoT) reasoning, their inference often involves excessively verbose reasoning traces, resulting in substantial inefficiency. To address this, we propose Distilled Reasoning Pruning (DRP), a hybrid framework that combines inference-time pruning with tuning-based distillation, two widely used strategies for efficient reasoning. DRP uses a teacher model to perform skill-aware step decomposition and content pruning, and then distills the pruned reasoning paths into a student model, enabling it to reason both efficiently and accurately. Across several challenging mathematical reasoning datasets, we find that models trained with DRP achieve substantial improvements in token efficiency without sacrificing accuracy. Specifically, DRP reduces average token usage on GSM8K from 917 to 328 while improving accuracy from 91.7% to 94.1%, and achieves a 43% token reduction on AIME with no performance drop. Further analysis shows that aligning the reasoning structure of training CoTs with the student's reasoning capacity is critical for effective knowledge transfer and performance gains.
摘要：尽管大型推理模型（LRMS）通过长期的思想链（COT）推理在复杂的推理任务中表现出成功，但它们的推论通常涉及过度的详细推理痕迹，从而导致效率很大。为了解决这个问题，我们提出了蒸馏推理修剪（DRP），这是一种混合框架，将推理时间修剪与基于调整的蒸馏相结合，这两种广泛使用的有效推理的策略。 DRP使用教师模型执行技能感知的步骤分解和内容修剪，然后将修剪的推理路径提炼成学生模型，使其能够有效，准确地进行推理。在几个具有挑战性的数学推理数据集中，我们发现接受DRP训练的模型在不牺牲准确性的情况下实现了令牌效率的实质性提高。具体而言，DRP将GSM8K的平均令牌用法从917降低到328，同时将准确性从91.7％提高到94.1％，并在没有性能下降的情况下降低了43％的标记。进一步的分析表明，将培训COTS的推理结构与学生的推理能力保持一致对于有效的知识转移和绩效提高至关重要。

Title: The Hallucination Tax of Reinforcement Finetuning

Authors: Linxin Song, Taiwei Shi, Jieyu Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13988
Pdf URL: https://arxiv.org/pdf/2505.13988
Copy Paste: [[2505.13988]] The Hallucination Tax of Reinforcement Finetuning(https://arxiv.org/abs/2505.13988)
Keywords: language model, llm, hallucination
Abstract: Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplored. In this work, we identify and systematically study a critical side effect of RFT, which we term the hallucination tax: a degradation in refusal behavior causing models to produce hallucinated answers to unanswerable questions confidently. To investigate this, we introduce SUM (Synthetic Unanswerable Math), a high-quality dataset of unanswerable math problems designed to probe models' ability to recognize an unanswerable question by reasoning from the insufficient or ambiguous information. Our results show that standard RFT training could reduce model refusal rates by more than 80%, which significantly increases model's tendency to hallucinate. We further demonstrate that incorporating just 10% SUM during RFT substantially restores appropriate refusal behavior, with minimal accuracy trade-offs on solvable tasks. Crucially, this approach enables LLMs to leverage inference-time compute to reason about their own uncertainty and knowledge boundaries, improving generalization not only to out-of-domain math problems but also to factual question answering tasks.
摘要：加强框（RFT）已成为增强大语言模型（LLM）推理能力的标准方法。但是，它对模型可信度的影响仍未得到充实。在这项工作中，我们确定并系统地研究了RFT的关键副作用，我们将其称为幻觉税：拒绝行为的退化，导致模型可以确保对无法回答的问题产生幻觉答案。为了调查这一点，我们引入了总和（合成无法回答的数学），这是一个无法回答的数学问题的高质量数据集，旨在通过从不足或模棱两可的信息中推理来探讨模型识别无法回答的问题的能力。我们的结果表明，标准的RFT训练可以将模型拒绝率降低超过80％，从而大大增加了模型幻觉的趋势。我们进一步证明，在RFT期间仅合并10％的总和实质上恢复了适当的拒绝行为，而在可解决的任务上的精确度最低。至关重要的是，这种方法使LLM能够利用推理时间计算来推理其自身的不确定性和知识边界，不仅改善了对跨域数学问题的概括，而且还提高了对回答任务的事实问题。

Title: DecIF: Improving Instruction-Following through Meta-Decomposition

Authors: Tingfeng Hui, Pengyu Zhu, Bowen Ping, Ling Tang, Yaqi Zhang, Sen Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13990
Pdf URL: https://arxiv.org/pdf/2505.13990
Copy Paste: [[2505.13990]] DecIF: Improving Instruction-Following through Meta-Decomposition(https://arxiv.org/abs/2505.13990)
Keywords: language model, llm
Abstract: Instruction-following has emerged as a crucial capability for large language models (LLMs). However, existing approaches often rely on pre-existing documents or external resources to synthesize instruction-following data, which limits their flexibility and generalizability. In this paper, we introduce DecIF, a fully autonomous, meta-decomposition guided framework that generates diverse and high-quality instruction-following data using only LLMs. DecIF is grounded in the principle of decomposition. For instruction generation, we guide LLMs to iteratively produce various types of meta-information, which are then combined with response constraints to form well-structured and semantically rich instructions. We further utilize LLMs to detect and resolve potential inconsistencies within the generated instructions. Regarding response generation, we decompose each instruction into atomic-level evaluation criteria, enabling rigorous validation and the elimination of inaccurate instruction-response pairs. Extensive experiments across a wide range of scenarios and settings demonstrate DecIF's superior performance on instruction-following tasks. Further analysis highlights its strong flexibility, scalability, and generalizability in automatically synthesizing high-quality instruction data.
摘要：遵循指示性跟踪已成为大型语言模型（LLMS）的关键能力。但是，现有的方法通常依赖于先前存在的文档或外部资源来综合遵循指令的数据，从而限制了其灵活性和可推广性。在本文中，我们介绍了Decif，这是一个完全自主的，分解的指导框架，仅使用LLMS生成多样化和高质量的指导遵循数据。 Decif是基于分解原则。为了生成教学，我们将LLMS引导到迭代产生各种类型的元信息，然后将其与响应约束结合在一起，形成结构良好和语义上丰富的指令。我们进一步利用LLM来检测和解决生成指令中的潜在不一致之处。关于响应生成，我们将每种指令分解为原子级评估标准，实现严格的验证以及消除不准确的指令 - 响应对。在各种场景和设置中进行的广泛实验表明，Decif在跟踪任务上的出色表现。进一步的分析强调了其在自动综合高质量指令数据中的强大灵活性，可伸缩性和概括性。

Title: Social Sycophancy: A Broader Understanding of LLM Sycophancy

Authors: Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, Dan Jurafsky
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.13995
Pdf URL: https://arxiv.org/pdf/2505.13995
Copy Paste: [[2505.13995]] Social Sycophancy: A Broader Understanding of LLM Sycophancy(https://arxiv.org/abs/2505.13995)
Keywords: llm
Abstract: A serious risk to the safety and utility of LLMs is sycophancy, i.e., excessive agreement with and flattery of the user. Yet existing work focuses on only one aspect of sycophancy: agreement with users' explicitly stated beliefs that can be compared to a ground truth. This overlooks forms of sycophancy that arise in ambiguous contexts such as advice and support-seeking, where there is no clear ground truth, yet sycophancy can reinforce harmful implicit assumptions, beliefs, or actions. To address this gap, we introduce a richer theory of social sycophancy in LLMs, characterizing sycophancy as the excessive preservation of a user's face (the positive self-image a person seeks to maintain in an interaction). We present ELEPHANT, a framework for evaluating social sycophancy across five face-preserving behaviors (emotional validation, moral endorsement, indirect language, indirect action, and accepting framing) on two datasets: open-ended questions (OEQ) and Reddit's r/AmITheAsshole (AITA). Across eight models, we show that LLMs consistently exhibit high rates of social sycophancy: on OEQ, they preserve face 47% more than humans, and on AITA, they affirm behavior deemed inappropriate by crowdsourced human judgments in 42% of cases. We further show that social sycophancy is rewarded in preference datasets and is not easily mitigated. Our work provides theoretical grounding and empirical tools (datasets and code) for understanding and addressing this under-recognized but consequential issue.
摘要：LLM的安全性和实用性的严重风险是粘糊糊，即与用户的过度同意和奉承。然而，现有的工作仅着眼于粘糊精的一个方面：与用户明确指定的信念的同意，可以将其与地面真理进行比较。这忽略了在没有明确的地面真理的诸如建议和支持的诸如咨询和支持之类的诸如忠告和寻求支持的情况下出现的糊状形式，但是sycophancy可以加强有害的隐式假设，信念或行动。为了解决这一差距，我们介绍了LLMS中更丰富的社会摇摇欲坠的理论，将hipophancy描述为对用户脸的过度保存（一个人寻求在互动中维持的积极自我形象）。我们介绍了大象，这是一个框架，用于评估五种保护表面的行为（情感验证，道德认可，间接语言，间接行动和接受框架）的框架：开放式问题（OEQ）和REDDIT的R/AmitheasShole（AITA）。在八个模型中，我们表明LLM始终表现出高度的社会粘浮浪：在OEQ上，他们的面部比人类多47％，在AITA上，他们确认了42％案件中众包人类判断认为不合适的行为。我们进一步表明，社会糊精在偏好数据集中得到了奖励，并且不容易缓解。我们的工作提供了理论基础和经验工具（数据集和代码），以理解和解决这一不认可但结果问题的问题。

Title: Activation-Guided Consensus Merging for Large Language Models

Authors: Yuxuan Yao, Shuqi Liu, Zehua Liu, Qintong Li, Mingyang Liu, Xiongwei Han, Zhijiang Guo, Han Wu, Linqi Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14009
Pdf URL: https://arxiv.org/pdf/2505.14009
Copy Paste: [[2505.14009]] Activation-Guided Consensus Merging for Large Language Models(https://arxiv.org/abs/2505.14009)
Keywords: language model, llm, prompt
Abstract: Recent research has increasingly focused on reconciling the reasoning capabilities of System 2 with the efficiency of System 1. While existing training-based and prompt-based approaches face significant challenges in terms of efficiency and stability, model merging emerges as a promising strategy to integrate the diverse capabilities of different Large Language Models (LLMs) into a unified model. However, conventional model merging methods often assume uniform importance across layers, overlooking the functional heterogeneity inherent in neural components. To address this limitation, we propose \textbf{A}ctivation-Guided \textbf{C}onsensus \textbf{M}erging (\textbf{ACM}), a plug-and-play merging framework that determines layer-specific merging coefficients based on mutual information between activations of pre-trained and fine-tuned models. ACM effectively preserves task-specific capabilities without requiring gradient computations or additional training. Extensive experiments on Long-to-Short (L2S) and general merging tasks demonstrate that ACM consistently outperforms all baseline methods. For instance, in the case of Qwen-7B models, TIES-Merging equipped with ACM achieves a \textbf{55.3\%} reduction in response length while simultaneously improving reasoning accuracy by \textbf{1.3} points. We submit the code with the paper for reproducibility, and it will be publicly available.
摘要：最近的研究越来越集中于将系统2的推理能力与系统1的效率进行调和。尽管现有的基于培训和基于及时的迅速的方法在效率和稳定性方面面临重大挑战，但模型合并是将不同大语模型（LLMS）的多样性功能整合到统一模型中的有希望的策略。但是，传统的模型合并方法通常会在层次上统一重要性，从而忽略了神经成分固有的功能异质性。为了解决这个限制，我们提出\ textbf {a} ctivation导式\ textbf {c} onsensus \ textbf {m} erging（\ textbf {acm}），这是一个基于互惠信息的插件合并框架，该插件的合并框架基于相互的模型，并基于相互的模型和互联网模型，并在互联网之间进行了互联网的模型和罚款。 ACM有效地保留了特定于任务的功能，而无需梯度计算或其他培训。关于远程（L2S）和一般合并任务的广泛实验表明，ACM始终优于所有基线方法。例如，在QWEN-7B模型的情况下，配备ACM的扎带合并实现了响应长度的降低，同时通过\ textbf {1.3}点同时提高了推理精度的降低。我们将代码与论文提交以供重复可再现性，并且将公开使用。

Title: AUTOLAW: Enhancing Legal Compliance in Large Language Models via Case Law Generation and Jury-Inspired Deliberation

Authors: Tai D. Nguyen, Long H. Pham, Jun Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14015
Pdf URL: https://arxiv.org/pdf/2505.14015
Copy Paste: [[2505.14015]] AUTOLAW: Enhancing Legal Compliance in Large Language Models via Case Law Generation and Jury-Inspired Deliberation(https://arxiv.org/abs/2505.14015)
Keywords: language model, llm
Abstract: The rapid advancement of domain-specific large language models (LLMs) in fields like law necessitates frameworks that account for nuanced regional legal distinctions, which are critical for ensuring compliance and trustworthiness. Existing legal evaluation benchmarks often lack adaptability and fail to address diverse local contexts, limiting their utility in dynamically evolving regulatory landscapes. To address these gaps, we propose AutoLaw, a novel violation detection framework that combines adversarial data generation with a jury-inspired deliberation process to enhance legal compliance of LLMs. Unlike static approaches, AutoLaw dynamically synthesizes case law to reflect local regulations and employs a pool of LLM-based "jurors" to simulate judicial decision-making. Jurors are ranked and selected based on synthesized legal expertise, enabling a deliberation process that minimizes bias and improves detection accuracy. Evaluations across three benchmarks: Law-SG, Case-SG (legality), and Unfair-TOS (policy), demonstrate AutoLaw's effectiveness: adversarial data generation improves LLM discrimination, while the jury-based voting strategy significantly boosts violation detection rates. Our results highlight the framework's ability to adaptively probe legal misalignments and deliver reliable, context-aware judgments, offering a scalable solution for evaluating and enhancing LLMs in legally sensitive applications.
摘要：像法律这样的领域中特定领域的大语言模型（LLM）的快速发展需要解决细微的区域法律区别的框架，这对于确保合规性和可信度至关重要。现有的法律评估基准通常缺乏适应性，并且无法解决各种地方环境，从而将其效用限制在动态发展的监管景观中。为了解决这些差距，我们提出了Autolaw，这是一个新颖的违规检测框架，将对抗性数据的生成与陪审团启发的审议过程相结合，以增强LLM的法律合规性。与静态方法不同，Autolaw动态地综合了判例法以反映当地法规，并采用了基于LLM的“陪审员”来模拟司法决策。根据合成的法律专业知识对陪审员进行排名和选择，从而实现了审议过程，从而最大程度地降低了偏见并提高了检测准确性。跨三个基准测试的评估：法律-SG，案例SG（合法性）和不公平的TOS（政策）证明了Autolaw的有效性：对抗性数据的生成改善了LLM歧视，而基于陪审团的投票策略则显着提高了违反检测率。我们的结果突出了该框架适应探测法律未对准并提供可靠的，上下文感知判断的能力，提供了可扩展的解决方案，以评估和增强具有法律敏感的应用程序中的LLM。

Title: From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

Authors: Yingli Shen, Wen Lai, Shuo Wang, Kangyang Luo, Alexander Fraser, Maosong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14045
Pdf URL: https://arxiv.org/pdf/2505.14045
Copy Paste: [[2505.14045]] From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora(https://arxiv.org/abs/2505.14045)
Keywords: language model, llm
Abstract: Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.
摘要：事实证明，对大规模多语言数据进行预处理和指导调整可以有效地将大语言模型（LLMS）缩放到低资源语言。但是，此类数据的未对准性质限制了其有效捕获跨语性语义的能力。相比之下，多通行的数据（在多种语言之间都对齐相同的内容）提供了更强的跨语性一致性，并为改善多语言性能提供了更大的潜力。在本文中，我们基于TED谈话，介绍了一个大型高质量的多路平行语料库TED2025。语料库跨越113种语言，并并行排列多达50种语言，以确保广泛的多语言覆盖范围。使用此数据集，我们研究了利用多路并行数据增强LLM的最佳实践，包括持续预处理，指导调整的策略以及对关键影响因素的分析。六个多语言基准测试的实验表明，在多路并行数据上训练的模型始终优于那些接受未对齐的多语言数据的模型。

Title: Improved Methods for Model Pruning and Knowledge Distillation

Authors: Wei Jiang, Anying Fu, Youling Zhang
Subjects: cs.CL, cs.CE
Abstract URL: https://arxiv.org/abs/2505.14052
Pdf URL: https://arxiv.org/pdf/2505.14052
Copy Paste: [[2505.14052]] Improved Methods for Model Pruning and Knowledge Distillation(https://arxiv.org/abs/2505.14052)
Keywords: language model
Abstract: Model pruning is a performance optimization technique for large language models like R1 or o3-mini. However, existing pruning methods often lead to significant performance degradation or require extensive retraining and fine-tuning. This technique aims to identify and remove neurons, connections unlikely leading to the contribution during the human-computer interaction phase. Our goal is to obtain a much smaller and faster knowledge distilled model that can quickly generate content almost as good as those of the unpruned ones. We propose MAMA Pruning, short for Movement and Magnitude Analysis, an improved pruning method that effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels. The improved method is based on weights, bias fixed in the pre-training phase and GRPO rewards verified during the post-training phase as our novel pruning indicators. Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.
摘要：模型修剪是针对R1或O3-Mini等大型语言模型的性能优化技术。但是，现有的修剪方法通常会导致大量的性能降解或需要大量的再训练和微调。该技术旨在识别和去除神经元，连接不太可能导致人类计算阶段的贡献。我们的目标是获得一个较小，更快的知识蒸馏模型，该模型几乎可以生成与未经修复的内容一样好的内容。我们提出了妈妈修剪，运动级和幅度分析的缩写，这是一种改进的修剪方法，可有效地降低模型大小和计算复杂性，同时即使在极端修剪水平下，也可以保持与原始未经修复模型相当的性能。改进的方法是基于权重，固定在训练阶段的偏差，在训练后阶段验证的GRPO奖励是我们的新修剪指标。初步实验结果表明，我们的方法优于各种修剪水平和不同下游计算语言学任务的最先进方法。

Title: Enhancing LLMs via High-Knowledge Data Selection

Authors: Feiyu Duan, Xuemiao Zhang, Sirui Wang, Haoran Que, Yuqi Liu, Wenge Rong, Xunliang Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14070
Pdf URL: https://arxiv.org/pdf/2505.14070
Copy Paste: [[2505.14070]] Enhancing LLMs via High-Knowledge Data Selection(https://arxiv.org/abs/2505.14070)
Keywords: language model, llm
Abstract: The performance of Large Language Models (LLMs) is intrinsically linked to the quality of its training data. Although several studies have proposed methods for high-quality data selection, they do not consider the importance of knowledge richness in text corpora. In this paper, we propose a novel and gradient-free High-Knowledge Scorer (HKS) to select high-quality data from the dimension of knowledge, to alleviate the problem of knowledge scarcity in the pre-trained corpus. We propose a comprehensive multi-domain knowledge element pool and introduce knowledge density and coverage as metrics to assess the knowledge content of the text. Based on this, we propose a comprehensive knowledge scorer to select data with intensive knowledge, which can also be utilized for domain-specific high-knowledge data selection by restricting knowledge elements to the specific domain. We train models on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model's performance in knowledge-intensive and general comprehension tasks, and is effective in enhancing both the generic and domain-specific capabilities of the model.
摘要：大型语言模型（LLM）的性能与其培训数据的质量本质上相关。尽管有几项研究提出了用于高质量数据选择的方法，但它们并不认为知识丰富性在文本语料库中的重要性。在本文中，我们提出了一个新颖且无梯度的高知识分数（HKS），以从知识维度中选择高质量的数据，以减轻预先训练的语料库中知识稀缺的问题。我们提出了一个全面的多域知识要素池，并将知识密度和覆盖范围作为指标，以评估文本的知识内容。基于此，我们提出了一个全面的知识评分者，以选择具有密集知识的数据，这也可以通过将知识元素限制为特定领域来用于特定领域的高知识数据选择。我们在高知识的双语数据集上训练模型，实验结果表明，我们的得分手在知识密集型和一般的理解任务中提高了模型的性能，并且有效地增强了模型的通用和特定于域的功能。

Title: BAR: A Backward Reasoning based Agent for Complex Minecraft Tasks

Authors: Weihong Du, Wenrui Liao, Binyu Yan, Hongru Liang, Anthony G. Cohn, Wenqiang Lei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14079
Pdf URL: https://arxiv.org/pdf/2505.14079
Copy Paste: [[2505.14079]] BAR: A Backward Reasoning based Agent for Complex Minecraft Tasks(https://arxiv.org/abs/2505.14079)
Keywords: language model, llm, agent
Abstract: Large language model (LLM) based agents have shown great potential in following human instructions and automatically completing various tasks. To complete a task, the agent needs to decompose it into easily executed steps by planning. Existing studies mainly conduct the planning by inferring what steps should be executed next starting from the agent's initial state. However, this forward reasoning paradigm doesn't work well for complex tasks. We propose to study this issue in Minecraft, a virtual environment that simulates complex tasks based on real-world scenarios. We believe that the failure of forward reasoning is caused by the big perception gap between the agent's initial state and task goal. To this end, we leverage backward reasoning and make the planning starting from the terminal state, which can directly achieve the task goal in one step. Specifically, we design a BAckward Reasoning based agent (BAR). It is equipped with a recursive goal decomposition module, a state consistency maintaining module and a stage memory module to make robust, consistent, and efficient planning starting from the terminal state. Experimental results demonstrate the superiority of BAR over existing methods and the effectiveness of proposed modules.
摘要：基于大型语言模型（LLM）代理在遵循人类指示并自动完成各种任务方面表现出了巨大的潜力。要完成任务，代理需要通过计划将其分解为易于执行的步骤。现有研究主要是通过推断出从代理商的初始状态开始下一步应执行哪些步骤来进行计划。但是，对于复杂的任务，这种远期推理范式无法正常工作。我们建议在Minecraft中研究此问题，Minecraft是一个虚拟环境，可以根据现实情况模拟复杂的任务。我们认为，远期推理的失败是由代理商的初始状态和任务目标之间的巨大看法差距引起的。为此，我们利用向后推理，并从终端状态开始计划，该计划可以在一步中直接实现任务目标。具体而言，我们设计了基于推理的代理（bar）。它配备了递归目标分解模块，维护模块的状态一致性和阶段内存模块，以从终端状态开始进行健壮，一致和有效的计划。实验结果表明，钢筋优于现有方法以及提议的模块的有效性。

Title: Gender Trouble in Language Models: An Empirical Audit Guided by Gender Performativity Theory

Authors: Franziska Sofia Hafner, Ana Valdivia, Luc Rocher
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2505.14080
Pdf URL: https://arxiv.org/pdf/2505.14080
Copy Paste: [[2505.14080]] Gender Trouble in Language Models: An Empirical Audit Guided by Gender Performativity Theory(https://arxiv.org/abs/2505.14080)
Keywords: language model
Abstract: Language models encode and subsequently perpetuate harmful gendered stereotypes. Research has succeeded in mitigating some of these harms, e.g. by dissociating non-gendered terms such as occupations from gendered terms such as 'woman' and 'man'. This approach, however, remains superficial given that associations are only one form of prejudice through which gendered harms arise. Critical scholarship on gender, such as gender performativity theory, emphasizes how harms often arise from the construction of gender itself, such as conflating gender with biological sex. In language models, these issues could lead to the erasure of transgender and gender diverse identities and cause harms in downstream applications, from misgendering users to misdiagnosing patients based on wrong assumptions about their anatomy. For FAccT research on gendered harms to go beyond superficial linguistic associations, we advocate for a broader definition of 'gender bias' in language models. We operationalize insights on the construction of gender through language from gender studies literature and then empirically test how 16 language models of different architectures, training datasets, and model sizes encode gender. We find that language models tend to encode gender as a binary category tied to biological sex, and that gendered terms that do not neatly fall into one of these binary categories are erased and pathologized. Finally, we show that larger models, which achieve better results on performance benchmarks, learn stronger associations between gender and sex, further reinforcing a narrow understanding of gender. Our findings lead us to call for a re-evaluation of how gendered harms in language models are defined and addressed.
摘要：语言模型编码并随后将有害性别的刻板印象永存。研究成功地减轻了其中一些危害，例如通过从性别术语（例如“女人”和“男人””等性别术语中解散非性别术语，例如职业。然而，考虑到联想只是一种偏见的一种形式，这种方法仍然是肤浅的。关于性别的批判性奖学金，例如性别绩效理论，强调了危害通常是由于性别本身的建构而引起的，例如将性别与生物学性别混为一谈。在语言模型中，这些问题可能会导致对跨性别者和性别多样的身份的擦除，并在下游应用中造成危害，从错误的用户到基于对其解剖结构的错误假设误诊患者。为了对性别危害的FACCT研究超越肤浅的语言关联，我们主张在语言模型中对“性别偏见”的更广泛定义。我们通过性别研究文献中的语言对性别的构建进行洞察力，然后经验测试不同建筑，培训数据集和模型大小的16种语言模型如何编码性别。我们发现，语言模型倾向于将性别编码为与生物性别相关的二元类别，并且不整齐地属于这些二进制类别之一的性别术语被删除和病态。最后，我们表明，在性能基准上获得更好的结果，可以学习性别与性别之间的更强关联，从而进一步增强了对性别的狭窄理解。我们的发现使我们呼吁重新评估语言模型中的性别危害的定义和解决。

Title: Beyond Chains: Bridging Large Language Models and Knowledge Bases in Complex Question Answering

Authors: Yihua Zhu, Qianying Liu, Akiko Aizawa, Hidetoshi Shimodaira
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.14099
Pdf URL: https://arxiv.org/pdf/2505.14099
Copy Paste: [[2505.14099]] Beyond Chains: Bridging Large Language Models and Knowledge Bases in Complex Question Answering(https://arxiv.org/abs/2505.14099)
Keywords: language model, llm, hallucination, agent
Abstract: Knowledge Base Question Answering (KBQA) aims to answer natural language questions using structured knowledge from KBs. While LLM-only approaches offer generalization, they suffer from outdated knowledge, hallucinations, and lack of transparency. Chain-based KG-RAG methods address these issues by incorporating external KBs, but are limited to simple chain-structured questions due to the absence of planning and logical structuring. Inspired by semantic parsing methods, we propose PDRR: a four-stage framework consisting of Predict, Decompose, Retrieve, and Reason. Our method first predicts the question type and decomposes the question into structured triples. Then retrieves relevant information from KBs and guides the LLM as an agent to reason over and complete the decomposed triples. Experimental results demonstrate that PDRR consistently outperforms existing methods across various LLM backbones and achieves superior performance on both chain-structured and non-chain complex questions.
摘要：知识基础问题回答（KBQA）旨在使用KBS的结构化知识回答自然语言问题。虽然仅使用LLM的方法提供了概括，但它们遭受了过时的知识，幻觉和缺乏透明度的困扰。基于链条的KG-rag方法通过合并外部KB来解决这些问题，但由于缺乏计划和逻辑结构，仅限于简单的链条结构问题。受语义解析方法的启发，我们提出了PDRR：由预测，分解，检索和推理组成的四阶段框架。我们的方法首先预测问题类型，并将问题分解为结构化的三元组。然后从KBS中检索相关信息，并指导LLM作为代理，以推理并完成分解的三元组。实验结果表明，PDRR始终胜过各种LLM骨架上的现有方法，并且在链条结构和非链复杂问题上都取得了卓越的性能。

Title: MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations

Authors: Ernests Lavrinovics, Russa Biswas, Katja Hose, Johannes Bjerva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14101
Pdf URL: https://arxiv.org/pdf/2505.14101
Copy Paste: [[2505.14101]] MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations(https://arxiv.org/abs/2505.14101)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of English-centric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called \textbf{MultiHal} framed for generative text evaluation. As part of our data collection pipeline, we mined 140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths, curating a high-quality subset of 25.9k. Our baseline evaluation shows an absolute scale increase by approximately 0.12 to 0.36 points for the semantic similarity score in KG-RAG over vanilla QA across multiple languages and multiple models, demonstrating the potential of KG integration. We anticipate MultiHal will foster future research towards several graph-based hallucination mitigation and fact-checking tasks.
摘要：大型语言模型（LLM）具有忠实和事实的固有局限性，通常称为幻觉。已经开发了几个基准，可以在以英语为中心的数据集的背景下为事实评估提供测试床，同时依靠补充信息上下文（例如Web链接或文本段落），但忽略了可用的结构化事实资源。为此，知识图（kgs）已被确定为缓解幻觉的有用辅助，因为它们提供了一种结构化的方式来表示有关实体的事实及其与最少语言开销的事实。我们在现有的幻觉评估基准中桥梁缺乏KG路径和多语言性，并提出了一种基于kg的多语言，多ihop基准，称为\ textbf {Multihal}构架生成文本评估。作为数据收集管道的一部分，我们从开放域KGS开采了140k kg paths，从中我们修剪了嘈杂的kg paths，策划了25.9k的高质量子集。我们的基线评估表明，跨多种语言和多种模型的语义相似性得分在KG-rag中的语义相似性得分的绝对量表增加了约0.12至0.36点，这表明了KG积分的潜力。我们预计多哈尔将促进未来的研究，以减轻基于图的幻觉和事实检查任务。

Title: Legal Rule Induction: Towards Generalizable Principle Discovery from Analogous Judicial Precedents

Authors: Wei Fan, Tianshi Zheng, Yiran Hu, Zheye Deng, Weiqi Wang, Baixuan Xu, Chunyang Li, Haoran Li, Weixing Shen, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14104
Pdf URL: https://arxiv.org/pdf/2505.14104
Copy Paste: [[2505.14104]] Legal Rule Induction: Towards Generalizable Principle Discovery from Analogous Judicial Precedents(https://arxiv.org/abs/2505.14104)
Keywords: language model, llm, hallucination
Abstract: Legal rules encompass not only codified statutes but also implicit adjudicatory principles derived from precedents that contain discretionary norms, social morality, and policy. While computational legal research has advanced in applying established rules to cases, inducing legal rules from judicial decisions remains understudied, constrained by limitations in model inference efficacy and symbolic reasoning capability. The advent of Large Language Models (LLMs) offers unprecedented opportunities for automating the extraction of such latent principles, yet progress is stymied by the absence of formal task definitions, benchmark datasets, and methodologies. To address this gap, we formalize Legal Rule Induction (LRI) as the task of deriving concise, generalizable doctrinal rules from sets of analogous precedents, distilling their shared preconditions, normative behaviors, and legal consequences. We introduce the first LRI benchmark, comprising 5,121 case sets (38,088 Chinese cases in total) for model tuning and 216 expert-annotated gold test sets. Experimental results reveal that: 1) State-of-the-art LLMs struggle with over-generalization and hallucination; 2) Training on our dataset markedly enhances LLMs capabilities in capturing nuanced rule patterns across similar cases.
摘要：法律规则不仅包括编纂法规，还包括来自包含可酌情规范，社会道德和政策的先例的隐式裁决原则。尽管计算法律研究在将既定规则应用于案件中提高了进步，但诱使司法决策的法律规则仍在研究中，受到模型推论效力和象征性推理能力的限制的限制。大型语言模型（LLMS）的出现为自动提取此类潜在原则提供了前所未有的机会，但由于缺乏正式的任务定义，基准数据集和方法论所阻碍了进步。为了解决这一差距，我们将法律规则归纳（LRI）形式化为从类似先例的集合中衍生出简洁，可普遍的教义规则的任务，提炼了他们的共同先决条件，规范行为和法律后果。我们介绍了第一个LRI基准，其中包括5,121套案例集（总共38,088个中国案例），用于模型调整和216个专家注销的黄金测试集。实验结果表明：1）最先进的LLM与过度概括和幻觉斗争； 2）在我们的数据集中培训明显增强了LLMS在捕获类似情况下细微的规则模式方面的功能。

Title: A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

Authors: Li Li, Peilin Cai, Ryan A. Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, Nesreen K. Ahmed, Samyadeep Basu, Subhojyoti Mukherjee, Ruiyi Zhang, Zhengmian Hu, Bo Ni, Yuxiao Zhou, Zichao Wang, Yue Huang, Yu Wang, Xiangliang Zhang, Philip S. Yu, Xiyang Hu, Yue Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14106
Pdf URL: https://arxiv.org/pdf/2505.14106
Copy Paste: [[2505.14106]] A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations(https://arxiv.org/abs/2505.14106)
Keywords: language model, llm, prompt
Abstract: We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.
摘要：我们提出了PersonAconvBench，这是一种大规模的基准，用于评估与大语言模型（LLMS）的多转交谈中的个性化推理和产生。与现有的工作着重于孤立的个性化或对话结构的工作不同，PersonAconVbench都集成了这两个核心任务：句子分类，影响回归和以用户为中心的文本生成，跨越了基于Reddit的十个不同的域。该设计实现了对个性化对话上下文如何在现实的多用户方案中塑造LLM输出的系统分析。我们在统一提示设置下基准了几个商业和开源的LLM，并观察到，合并个性化历史可实现实质性的改进，包括相对增长198％，比最佳的非差异基准在情感分类中获得了198％的增长。通过使用评估和代码发布PersonAconVbench，我们旨在支持适应各个样式，跟踪长期环境并产生上下文丰富，引人入胜的响应的LLM的研究。

Title: DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models

Authors: Yakun Zhu, Zhongzhen Huang, Linjie Mu, Yutong Huang, Wei Nie, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14107
Pdf URL: https://arxiv.org/pdf/2505.14107
Copy Paste: [[2505.14107]] DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models(https://arxiv.org/abs/2505.14107)
Keywords: language model
Abstract: The emergence of groundbreaking large language models capable of performing complex reasoning tasks holds significant promise for addressing various scientific challenges, including those arising in complex clinical scenarios. To enable their safe and effective deployment in real-world healthcare settings, it is urgently necessary to benchmark the diagnostic capabilities of current models systematically. Given the limitations of existing medical benchmarks in evaluating advanced diagnostic reasoning, we present DiagnosisArena, a comprehensive and challenging benchmark designed to rigorously assess professional-level diagnostic competence. DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties, deriving from clinical case reports published in 10 top-tier medical journals. The benchmark is developed through a meticulous construction pipeline, involving multiple rounds of screening and review by both AI systems and human experts, with thorough checks conducted to prevent data leakage. Our study reveals that even the most advanced reasoning models, o3-mini, o1, and DeepSeek-R1, achieve only 45.82%, 31.09%, and 17.79% accuracy, respectively. This finding highlights a significant generalization bottleneck in current large language models when faced with clinical diagnostic reasoning challenges. Through DiagnosisArena, we aim to drive further advancements in AIs diagnostic reasoning capabilities, enabling more effective solutions for real-world clinical diagnostic challenges. We provide the benchmark and evaluation tools for further research and development this https URL.
摘要：能够执行复杂推理任务的开创性大语言模型的出现在应对各种科学挑战（包括在复杂的临床情况下引起的挑战）具有重要的希望。为了使其在实际医疗保健环境中进行安全有效的部署，迫切需要系统地对当前模型的诊断功能进行基准测试。鉴于现有医疗基准在评估高级诊断推理时的局限性，我们提出了诊断，这是一种全面且具有挑战性的基准测试，旨在严格评估专业级别的诊断能力。诊断包括1,113对分割的患者病例和相应的诊断，涵盖了28个医学专业，这些诊断源自在10个顶级医学期刊上发表的临床病例报告。基准是通过细致的施工管道开发的，涉及AI系统和人类专家的多轮筛选和审查，并进行了彻底的检查以防止数据泄漏。我们的研究表明，即使是最先进的推理模型，O3-Mini，O1和DeepSeek-R1也只能达到45.82％，31.09％和17.79％的精度。这一发现突出了面临临床诊断推理挑战时，在当前大型语言模型中的重要概括瓶颈。通过诊断，我们旨在推动AIS诊断推理能力的进一步进步，从而为现实世界中的临床诊断挑战提供更有效的解决方案。我们提供基准和评估工具，用于进一步研究和开发此HTTPS URL。

Title: Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking

Authors: Tianle Gu, Zongqi Wang, Kexin Huang, Yuanqi Yao, Xiangliang Zhang, Yujiu Yang, Xiuying Chen
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2505.14112
Pdf URL: https://arxiv.org/pdf/2505.14112
Copy Paste: [[2505.14112]] Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking(https://arxiv.org/abs/2505.14112)
Keywords: llm
Abstract: Logit-based LLM watermarking traces and verifies AI-generated content by maintaining green and red token lists and increasing the likelihood of green tokens during generation. However, it fails in low-entropy scenarios, where predictable outputs make green token selection difficult without disrupting natural text flow. Existing approaches address this by assuming access to the original LLM to calculate entropy and selectively watermark high-entropy tokens. However, these methods face two major challenges: (1) high computational costs and detection delays due to reliance on the original LLM, and (2) potential risks of model leakage. To address these limitations, we propose Invisible Entropy (IE), a watermarking paradigm designed to enhance both safety and efficiency. Instead of relying on the original LLM, IE introduces a lightweight feature extractor and an entropy tagger to predict whether the entropy of the next token is high or low. Furthermore, based on theoretical analysis, we develop a threshold navigator that adaptively sets entropy thresholds. It identifies a threshold where the watermark ratio decreases as the green token count increases, enhancing the naturalness of the watermarked text and improving detection robustness. Experiments on HumanEval and MBPP datasets demonstrate that IE reduces parameter size by 99\% while achieving performance on par with state-of-the-art methods. Our work introduces a safe and efficient paradigm for low-entropy watermarking. this https URL this https URL
摘要：基于LOGIT的LLM水印轨迹并通过保持绿色和红色令牌列表并增加一代中绿色令牌的可能性来验证AI生成的内容。但是，它在低渗透方案中失败了，可以预测的输出使绿色令牌选择变得困难而不会破坏自然文本流动。现有方法通过假设访问原始LLM来计算熵并有选择地水印高凝聚令牌来解决这一问题。但是，这些方法面临两个主要挑战：（1）由于对原始LLM的依赖以及（2）模型泄漏的潜在风险，高计算成本和检测延迟。为了解决这些限制，我们提出了无形的熵（IE），这是一种旨在提高安全性和效率的水印范式。 IE不依靠原始的LLM，而是引入了轻巧的特征提取器和熵标记器，以预测近乎令牌的熵是高还是低。此外，基于理论分析，我们开发了一个适应性设置熵阈值的阈值导航器。它确定了一个阈值，其中水印比随着绿色代币计数的增加而降低，从而增强了水印文本的自然性并改善了检测鲁棒性。关于HumaneVal和MBPP数据集的实验表明，IE在使用最新方法的同时实现性能，将参数大小降低了99 \％。我们的工作引入了一个安全有效的范式，用于低凝集水印。此https url此https url

Title: Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst

Authors: Hongru Wang, Deng Cai, Wanjun Zhong, Shijue Huang, Jeff Z. Pan, Zeming Liu, Kam-Fai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14116
Pdf URL: https://arxiv.org/pdf/2505.14116
Copy Paste: [[2505.14116]] Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst(https://arxiv.org/abs/2505.14116)
Keywords: language model, llm, chain-of-thought
Abstract: Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition, such as reflection and decomposition, being difficult to create and acquire. In this work, we introduce \textit{Self-Reasoning Language Model} (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few demonstration examples (i.e., 1,000 samples) on how to unfold hidden reasoning chains from existing responses, which act as a reasoning catalyst, we demonstrate that SRLM not only enhances the model's initial performance but also ensures more stable and consistent improvements in subsequent iterations. Our proposed SRLM achieves an average absolute improvement of more than $+2.5$ points across five reasoning tasks: MMLU, GSM8K, ARC-C, HellaSwag, and BBH on two backbone models. Moreover, it brings more improvements with more times of sampling during inference, such as absolute $+7.89$ average improvement with $64$ sampling times, revealing the in-depth, diverse and creative reasoning paths in SRLM against the strong baseline.
摘要：推理时间缩放引起了很多关注，从而通过增加思想链的长度来显着提高复杂推理任务中大型语言模型（LLM）的性能。这些较长的中间推理原理在人类认知方面体现了各种元概述技能，例如反射和分解，难以创造和获取。在这项工作中，我们介绍了\ textit {自我调查语言模型}（SRLM），其中模型本身可以合成更长的COT数据，并通过自我训练迭代地提高性能。通过合并一些演示示例（即1,000个样本），介绍了如何从现有响应中展开隐藏的推理链，这些响应是一种推理催化剂，我们证明了SRLM不仅可以增强模型的初始性能，还可以确保随后的迭代中更稳定，更稳定的改进。我们提出的SRLM在五个推理任务中的平均绝对提高了$+2.5美元以上的积分：MMLU，GSM8K，ARC-C，HELLASWAG和BBH在两个骨干模型上。此外，它在推断期间的抽样次数增加了更多的改进，例如，$+7.89美元的平均改进时间为64美元的抽样时间，揭示了SRLM中针对强大基线的深入，多样和创造性的推理路径。

Title: Texts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering

Authors: Wei Zhou, Mohsen Mesgar, Heike Adel, Annemarie Friedrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14131
Pdf URL: https://arxiv.org/pdf/2505.14131
Copy Paste: [[2505.14131]] Texts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering(https://arxiv.org/abs/2505.14131)
Keywords: language model, llm
Abstract: In table question answering (TQA), tables are encoded as either texts or images. Prior work suggests that passing images of tables to multi-modal large language models (MLLMs) performs comparably to or even better than using textual input with large language models (LLMs). However, the lack of controlled setups limits fine-grained distinctions between these approaches. In this paper, we conduct the first controlled study on the effectiveness of several combinations of table representations and models from two perspectives: question complexity and table size. We build a new benchmark based on existing TQA datasets. In a systematic analysis of seven pairs of MLLMs and LLMs, we find that the best combination of table representation and model varies across setups. We propose FRES, a method selecting table representations dynamically, and observe a 10% average performance improvement compared to using both representations indiscriminately.
摘要：在表“问题回答（TQA））中，表被编码为文本或图像。先前的工作表明，将表格的图像传递给多模式大型语言模型（MLLM）的性能比使用大型语言模型（LLMS）的文本输入相当甚至更好。但是，缺乏受控的设置限制了这些方法之间的细粒度区分。在本文中，我们从两个角度进行了第一个对表表示和模型组合有效性的对照研究：问题复杂性和桌子大小。我们基于现有的TQA数据集构建了一个新的基准测试。在对七对MLLM和LLMS的系统分析中，我们发现表表示和模型的最佳组合在设置之间各不相同。我们提出FRES，一种动态选择表表示的方法，与使用这两种表示相比，平均性能提高了10％。

Title: Prior Prompt Engineering for Reinforcement Fine-Tuning

Authors: Pittawat Taveekitworachai, Potsawee Manakul, Sarana Nutanong, Kunat Pipatanakul
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14157
Pdf URL: https://arxiv.org/pdf/2505.14157
Copy Paste: [[2505.14157]] Prior Prompt Engineering for Reinforcement Fine-Tuning(https://arxiv.org/abs/2505.14157)
Keywords: language model, prompt
Abstract: This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt--the instructions prepended to queries during training to elicit behaviors such as step-by-step reasoning--remains underexplored. We investigate whether different pPE approaches can guide LMs to internalize distinct behaviors after RFT. Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies--reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization--into corresponding pPE approaches. We experiment with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and GPQA-Diamond). Our results show that all pPE-trained models surpass their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvement on AIME2024 and GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by adapting a behavior-classification framework, we demonstrate that different pPE strategies instill distinct behavioral styles in the resulting models. These findings position pPE as a powerful yet understudied axis for RFT.
摘要：本文在加强微调（RFT）的背景下调查了先前的及时工程（PPE），其中激励语言模型（LMS）以表现出通过奖励信号最大程度地提高性能的行为。尽管现有的RFT研究主要集中于算法，奖励成型和数据策划，但先前提示的设计 - 在培训期间进行查询的指令，以引起诸如逐步推理之类的行为，例如 - 驱动器 - 驱动器 - 毫无疑问。我们研究了不同的PPE方法是否可以指导LMS在RFT之后内部化不同的行为。受推理时间提示工程（IPE）的启发，我们翻译了五种代表性的IPE策略 - 策略，计划，基于代码的推理，知识回忆和示例利用率 - Into相应的PPE方法。我们使用每种PPE方法对QWEN2.5-7B进行了实验，然后评估对内域和室外基准测试的性能（例如AIME2024，HumaneVal+和GPQA-Diamond）。我们的结果表明，所有受PPE训练的模型都超过了IPE促进的对应物，示例PPE方法可实现AIME2024和GPQA-Diamond的最大平均性能增长，并且超过了常用的推理方法。此外，通过调整行为分类框架，我们证明了不同的PPE策略在结果模型中灌输了不同的行为风格。这些发现将PPE定位为RFT的强大但已研究的轴。

Title: Temporal Alignment of Time Sensitive Facts with Activation Engineering

Authors: Sanjay Govindan, Maurice Pagnucco, Yang Song
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14158
Pdf URL: https://arxiv.org/pdf/2505.14158
Copy Paste: [[2505.14158]] Temporal Alignment of Time Sensitive Facts with Activation Engineering(https://arxiv.org/abs/2505.14158)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are trained on diverse and often conflicting knowledge spanning multiple domains and time periods. Some of this knowledge is only valid within specific temporal contexts, such as answering the question, "Who is the President of the United States in 2022?" Ensuring LLMs generate time appropriate responses is crucial for maintaining relevance and accuracy. In this work we explore activation engineering as a method for temporally aligning LLMs to improve factual recall without any training or dataset creation. In this research we explore an activation engineering technique to ground three versions of LLaMA 2 to specific points in time and examine the effects of varying injection layers and prompting strategies. Our experiments demonstrate up to a 44% and 16% improvement in relative and explicit prompting respectively, achieving comparable performance to the fine-tuning method proposed by Zhao et al. (2024) . Notably, our approach achieves similar results to the fine-tuning baseline while being significantly more computationally efficient and requiring no pre-aligned datasets.
摘要：大型语言模型（LLM）经过跨越多个领域和时间段的多样化且经常相互矛盾的知识培训。这些知识中的一些仅在特定的时间环境中有效，例如回答问题：“ 2022年美国总统是谁？”确保LLM产生时间适当的响应对于保持相关性和准确性至关重要。在这项工作中，我们探讨了激活工程作为时间对齐LLM的一种方法，以改善事实召回，而无需进行任何培训或数据集创建。在这项研究中，我们探索了一种激活工程技术，以将三种版本的Llama 2与特定的时间点结合在一起，并检查不同注射层的影响并促使策略。我们的实验证明，相对和明确提示的提示分别提高了44％和16％，与Zhao等人提出的微调方法相当。（2024）。值得注意的是，我们的方法获得了与微调基线相似的结果，同时又有明显的计算效率，并且不需要预先对准的数据集。

Title: Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models

Authors: Zahraa Al Sahili, Ioannis Patras, Matthew Purver
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14160
Pdf URL: https://arxiv.org/pdf/2505.14160
Copy Paste: [[2505.14160]] Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models(https://arxiv.org/abs/2505.14160)
Keywords: language model
Abstract: Multilingual vision-language models promise universal image-text retrieval, yet their social biases remain under-explored. We present the first systematic audit of three public multilingual CLIP checkpoints -- M-CLIP, NLLB-CLIP, and CAPIVARA-CLIP -- across ten languages that vary in resource availability and grammatical gender. Using balanced subsets of \textsc{FairFace} and the \textsc{PATA} stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the assumption that multilinguality mitigates bias, every model exhibits stronger gender bias than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared cross-lingual encoder of NLLB-CLIP transports English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this transfer. Highly gendered languages consistently magnify all measured bias types, but even gender-neutral languages remain vulnerable when cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics conceal language-specific ``hot spots,'' underscoring the need for fine-grained, language-aware bias evaluation in future multilingual vision-language research.
摘要：多语言视觉语言模型有望获得通用图像文本检索，但他们的社会偏见仍然不足。我们介绍了三种公共多语言剪辑检查点 - M-CLIP，NLLB-CLIP和CAPIVARA-CLIP的第一个系统审核，这些审核在资源可用性和语法性别各不相同。使用\ textsc {fairface}的平衡子集和\ textsc {pata}刻板印象套件在零拍设置中，我们量化了种族和性别偏见，并测量刻板印象放大。与假设多语言减轻偏见的假设相反，每个模型都表现出比仅英语基线的性别偏见更强。 Capivara-CLIP在其目标的低资源语言中表明了其最大的偏见，而NLLB-CLIP的共享跨语义编码器将英语性别刻板印象转移到性别中性语言中；松散的耦合编码器在很大程度上避免了此传递。高度性别的语言一致地放大了所有测量的偏见类型，但是当分享跨语性权重进口外国刻板印象时，即使性别中性语言也仍然容易受到伤害。汇总指标掩盖了语言特定的``热点''，强调了对未来多种语言视觉研究的细粒度，语言意识偏见评估的需求。

Title: PL-FGSA: A Prompt Learning Framework for Fine-Grained Sentiment Analysis Based on MindSpore

Authors: Zhenkai Qin, Jiajing He, Qiao Fang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14165
Pdf URL: https://arxiv.org/pdf/2505.14165
Copy Paste: [[2505.14165]] PL-FGSA: A Prompt Learning Framework for Fine-Grained Sentiment Analysis Based on MindSpore(https://arxiv.org/abs/2505.14165)
Keywords: prompt
Abstract: Fine-grained sentiment analysis (FGSA) aims to identify sentiment polarity toward specific aspects within a text, enabling more precise opinion mining in domains such as product reviews and social media. However, traditional FGSA approaches often require task-specific architectures and extensive annotated data, limiting their generalization and scalability. To address these challenges, we propose PL-FGSA, a unified prompt learning-based framework implemented using the MindSpore platform, which integrates prompt design with a lightweight TextCNN backbone. Our method reformulates FGSA as a multi-task prompt-augmented generation problem, jointly tackling aspect extraction, sentiment classification, and causal explanation in a unified paradigm. By leveraging prompt-based guidance, PL-FGSA enhances interpretability and achieves strong performance under both full-data and low-resource conditions. Experiments on three benchmark datasets-SST-2, SemEval-2014 Task 4, and MAMS-demonstrate that our model consistently outperforms traditional fine-tuning methods and achieves F1-scores of 0.922, 0.694, and 0.597, respectively. These results validate the effectiveness of prompt-based generalization and highlight the practical value of PL-FGSA for real-world sentiment analysis tasks.
摘要：细粒度的情感分析（FGSA）旨在确定文本中特定方面的情绪极性，从而在产品评论和社交媒体等领域中更精确的意见采矿。但是，传统的FGSA方法通常需要特定于任务的架构和广泛的注释数据，从而限制了它们的概括和可扩展性。为了应对这些挑战，我们提出了PL-FGSA，这是一种使用Mindspore平台实施的统一及时学习框架，该框架将及时设计与轻量级的TextCNN骨干集整合在一起。我们的方法将FGSA重新定义为一个多任务提示的一代问题，共同解决统一范式中的方面提取，情感分类和因果解释。通过利用基于及时的指导，PL-FGSA在全数据和低资源条件下增强了可解释性和实现强劲的性能。在三个基准数据集-SST-2，Semeval-2014任务4和MAMS表示的实验表明，我们的模型始终优于传统的微调方法，并分别达到0.922、0.694和0.597的F1-SCORES。这些结果证明了基于迅速的概括的有效性，并突出了PL-FGSA对现实世界情感分析任务的实际价值。

Title: The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

Authors: Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14172
Pdf URL: https://arxiv.org/pdf/2505.14172
Copy Paste: [[2505.14172]] The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models(https://arxiv.org/abs/2505.14172)
Keywords: language model, llm
Abstract: Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge slowly, suddenly, and only late in training. We further show that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.
摘要：尽管在各个领域之间取得了显着的进展，但大型语言模型（LLM）仍在简单的字符级任务（例如基本限制）上始终失败，例如在单词中计数字母：标记化。在这项工作中，我们将这一限制构图为低相互信息的问题，并以概念出现的方式进行分析。使用在受控环境中隔离角色级别推理的19个合成任务的套件，我们表明这种功能逐渐出现，突然突然出现，并且仅在训练后期出现。我们进一步表明，基于渗透的概念出现模型解释了这些模式，这表明学习特征组成与学习常识知识没有根本不同。为了解决这一瓶颈，我们提出了一种轻巧的体系结构修改，可显着改善角色级别的推理，同时保留子词模型的电感优势。我们的结果共同弥合了令牌化的LMS中低级感知差距，并提供了理解和减轻其结构盲点的原则框架。我们公开提供代码。

Title: Cheaper, Better, Faster, Stronger: Robust Text-to-SQL without Chain-of-Thought or Fine-Tuning

Authors: Yusuf Denizay Dönder, Derek Hommel, Andrea W Wen-Yi, David Mimno, Unso Eun Seo Jo
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14174
Pdf URL: https://arxiv.org/pdf/2505.14174
Copy Paste: [[2505.14174]] Cheaper, Better, Faster, Stronger: Robust Text-to-SQL without Chain-of-Thought or Fine-Tuning(https://arxiv.org/abs/2505.14174)
Keywords: llm, chain-of-thought
Abstract: LLMs are effective at code generation tasks like text-to-SQL, but is it worth the cost? Many state-of-the-art approaches use non-task-specific LLM techniques including Chain-of-Thought (CoT), self-consistency, and fine-tuning. These methods can be costly at inference time, sometimes requiring over a hundred LLM calls with reasoning, incurring average costs of up to \$0.46 per query, while fine-tuning models can cost thousands of dollars. We introduce "N-rep" consistency, a more cost-efficient text-to-SQL approach that achieves similar BIRD benchmark scores as other more expensive methods, at only \$0.039 per query. N-rep leverages multiple representations of the same schema input to mitigate weaknesses in any single representation, making the solution more robust and allowing the use of smaller and cheaper models without any reasoning or fine-tuning. To our knowledge, N-rep is the best-performing text-to-SQL approach in its cost range.
摘要：LLM在代码生成任务（例如文本到SQL）方面有效，但值得成本吗？许多最先进的方法都使用非任务特异性LLM技术，包括思想链（COT），自我持续性和微调。这些方法在推理时间可能是昂贵的，有时需要推理的一百多个LLM电话，每查询的平均成本高达0.46美元，而微调模型可能会花费数千美元。我们介绍了“ N-REP”一致性，这是一种更具成本效益的文本到SQL方法，可在每个查询中获得与其他更昂贵的方法相似的鸟基准分数，仅\ $ 0.039。 N-REP利用相同架构输入的多个表示，以减轻任何单个表示中的弱点，从而使解决方案更强大，并允许使用较小且更便宜的模型，而无需任何推理或微调。据我们所知，N-REP是其成本范围内表现最好的文本到SQL方法。

Title: Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits

Authors: Xiang Zhang, Juntai Cao, Jiaqi Wei, Yiwei Xu, Chenyu You
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14178
Pdf URL: https://arxiv.org/pdf/2505.14178
Copy Paste: [[2505.14178]] Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits(https://arxiv.org/abs/2505.14178)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Tokenization is the first - and often underappreciated - layer of computation in language models. While Chain-of-Thought (CoT) prompting enables transformer models to approximate recurrent computation by externalizing intermediate steps, we show that the success of such reasoning is fundamentally bounded by the structure of tokenized inputs. This work presents a theoretical and empirical investigation into how tokenization schemes, particularly subword-based methods like byte-pair encoding (BPE), impede symbolic computation by merging or obscuring atomic reasoning units. We introduce the notion of Token Awareness to formalize how poor token granularity disrupts logical alignment and prevents models from generalizing symbolic procedures. Through systematic evaluation on arithmetic and symbolic tasks, we demonstrate that token structure dramatically affect reasoning performance, causing failure even with CoT, while atomically-aligned formats unlock strong generalization, allowing small models (e.g., GPT-4o-mini) to outperform larger systems (e.g., o1) in structured reasoning. Our findings reveal that symbolic reasoning ability in LLMs is not purely architectural, but deeply conditioned on token-level representations.
摘要：令牌化是在语言模型中的第一个（并且通常不被调用）计算层。虽然对链链（COT）提示使变压器模型通过外部化中间步骤近似地计算，但我们表明，这种推理的成功在根本上是受令牌化输入的结构的限制。这项工作提出了一个理论和实证研究，介绍了如何通过合并或模糊的原子推理单元合并或模糊的原子推理来阻止符号计算，尤其是基于子词的方法，尤其是基于子词的方法。我们介绍了令牌意识的概念，以形式化较差的令牌粒度如何破坏逻辑对准，并阻止模型概括符号程序。通过对算术和符号任务的系统评估，我们证明令牌结构极大地影响了推理性能，即使在COT中也会导致失败，而原子对齐的格式则解锁了强有力的强度，从而使小型模型（例如GPT-4O-MINI）在结构化的理性中超过了较大的大型系统（例如，O1）。我们的发现表明，LLMS中的符号推理能力不是纯粹的建筑，而是在令牌级的表示上。

Title: SlangDIT: Benchmarking LLMs in Interpretative Slang Translation

Authors: Yunlong Liang, Fandong Meng, Jiaan Wang, Jie Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14181
Pdf URL: https://arxiv.org/pdf/2505.14181
Copy Paste: [[2505.14181]] SlangDIT: Benchmarking LLMs in Interpretative Slang Translation(https://arxiv.org/abs/2505.14181)
Keywords: language model, llm
Abstract: The challenge of slang translation lies in capturing context-dependent semantic extensions, as slang terms often convey meanings beyond their literal interpretation. While slang detection, explanation, and translation have been studied as isolated tasks in the era of large language models (LLMs), their intrinsic interdependence remains underexplored. The main reason is lacking of a benchmark where the two tasks can be a prerequisite for the third one, which can facilitate idiomatic translation. In this paper, we introduce the interpretative slang translation task (named SlangDIT) consisting of three sub-tasks: slang detection, cross-lingual slang explanation, and slang translation within the current context, aiming to generate more accurate translation with the help of slang detection and slang explanation. To this end, we construct a SlangDIT dataset, containing over 25k English-Chinese sentence pairs. Each source sentence mentions at least one slang term and is labeled with corresponding cross-lingual slang explanation. Based on the benchmark, we propose a deep thinking model, named SlangOWL. It firstly identifies whether the sentence contains a slang, and then judges whether the slang is polysemous and analyze its possible meaning. Further, the SlangOWL provides the best explanation of the slang term targeting on the current context. Finally, according to the whole thought, the SlangOWL offers a suitable translation. Our experiments on LLMs (\emph{e.g.}, Qwen2.5 and LLama-3.1), show that our deep thinking approach indeed enhances the performance of LLMs where the proposed SLangOWL significantly surpasses the vanilla models and supervised fine-tuned models without thinking.
摘要：语翻译的挑战在于捕获与上下文相关的语义扩展，因为语术语通常传达的含义超出了他们的字面解释。尽管在大型语言模型（LLMS）时代已经将语检测，解释和翻译作为孤立的任务进行了研究，但它们的内在相互依赖性仍未得到充实。主要原因是缺乏基准，在这两个任务可以成为第三个任务的先决条件，这可以促进惯用的翻译。在本文中，我们介绍了由三个子任务组成的解释性语翻译任务（命名为slangdit）：s语检测，跨语言s语解释和s语翻译在当前上下文中，旨在借助语检测和s语解释来产生更准确的翻译。为此，我们构建了一个Slangdit数据集，该数据集包含超过25K英语的句子对。每个源句子都提到至少一个语术语，并标有相应的跨语性语解释。基于基准，我们提出了一个名为Slangowl的深思熟虑模型。它首先确定该句子是否包含lang语，然后判断语是否是多义的，并分析其可能的含义。此外，slang夫尔提供了在当前上下文中对s语术语定位的最佳解释。最后，根据整个想法，slang夫提供了合适的翻译。我们对LLMS（\ Emph {e.g。}，Qwen2.5和Llama-3.1）进行的实验表明，我们的深思熟虑方法确实增强了LLM的性能，而建议的slangowl显着超过了香草模型，并在没有思考的情况下超过了精心调整的模型。

Title: ThinkSwitcher: When to Think Hard, When to Think Fast

Authors: Guosheng Liang, Longguang Zhong, Ziyi Yang, Xiaojun Quan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14183
Pdf URL: https://arxiv.org/pdf/2505.14183
Copy Paste: [[2505.14183]] ThinkSwitcher: When to Think Hard, When to Think Fast(https://arxiv.org/abs/2505.14183)
Keywords: prompt, chain-of-thought
Abstract: Large reasoning models (LRMs) excel at solving complex tasks by leveraging long chain-of-thought (CoT) reasoning. However, this often leads to overthinking on simple tasks, resulting in unnecessary computational overhead. We observe that LRMs inherently possess the capability for efficient short CoT reasoning, which can be reliably elicited through prompt design. To leverage this capability, we propose ThinkSwitcher, a framework that enables a single LRM to dynamically switch between short and long CoT modes based on task complexity. ThinkSwitcher introduces a lightweight switching module trained with supervision signals derived from the relative performance of each reasoning mode across tasks. Experiments on multiple reasoning benchmarks show that ThinkSwitcher reduces computational cost by 20-30% while maintaining high accuracy on complex tasks. This demonstrates the effectiveness of ThinkSwitcher as a scalable and efficient solution for unified LRM deployment.
摘要：大型推理模型（LRMS）通过利用长期的经过思考（COT）推理来解决复杂的任务。但是，这通常会导致对简单任务的思考，从而导致不必要的计算开销。我们观察到，LRM固有地具有有效的短COT推理的能力，可以通过及时设计可靠地引起该作用。为了利用这种功能，我们提出了ThinkSwitcher，该框架使单个LRM能够根据任务复杂性在短床和长COT模式之间动态切换。 ThinkSwitcher引入了一个轻巧的切换模块，该模块训练有跨任务的每个推理模式的相对性能的监督信号。在多个推理基准上进行的实验表明，ThinkSwitcher将计算成本降低了20-30％，同时在复杂的任务上保持了高精度。这证明了ThinkSwitcher作为统一LRM部署的可扩展有效解决方案的有效性。

Title: Unraveling Interwoven Roles of Large Language Models in Authorship Privacy: Obfuscation, Mimicking, and Verification

Authors: Tuc Nguyen, Yifan Hu, Thai Le
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14195
Pdf URL: https://arxiv.org/pdf/2505.14195
Copy Paste: [[2505.14195]] Unraveling Interwoven Roles of Large Language Models in Authorship Privacy: Obfuscation, Mimicking, and Verification(https://arxiv.org/abs/2505.14195)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have been fueled by large scale training corpora drawn from diverse sources such as websites, news articles, and books. These datasets often contain explicit user information, such as person names and addresses, that LLMs may unintentionally reproduce in their generated outputs. Beyond such explicit content, LLMs can also leak identity revealing cues through implicit signals such as distinctive writing styles, raising significant concerns about authorship privacy. There are three major automated tasks in authorship privacy, namely authorship obfuscation (AO), authorship mimicking (AM), and authorship verification (AV). Prior research has studied AO, AM, and AV independently. However, their interplays remain under explored, which leaves a major research gap, especially in the era of LLMs, where they are profoundly shaping how we curate and share user generated content, and the distinction between machine generated and human authored text is also increasingly blurred. This work then presents the first unified framework for analyzing the dynamic relationships among LLM enabled AO, AM, and AV in the context of authorship privacy. We quantify how they interact with each other to transform human authored text, examining effects at a single point in time and iteratively over time. We also examine the role of demographic metadata, such as gender, academic background, in modulating their performances, inter-task dynamics, and privacy risks. All source code will be publicly available.
摘要：大规模培训语料库的最新进步（LLMS）从网站，新闻文章和书籍等各种来源中汲取了大规模培训语料库的推动。这些数据集通常包含明确的用户信息，例如人的名称和地址，LLMS可能会在其生成的输出中无意中复制。除了这种明确的内容之外，LLM还可以通过隐性信号（例如独特的写作方式）泄漏身份，从而揭示了线索，从而引起了对作者隐私的重大关注。作者隐私有三个主要的自动化任务，分别是作者身份混淆（AO），模仿（AM）和作者身份验证（AV）。先前的研究已经独立研究了AO，AM和AV。但是，他们的相互作用仍在探索中，这给了一个重大的研究差距，尤其是在LLMS时代，他们深刻地塑造了我们如何策划和共享用户生成的内容，并且机器生成和人类作者的文本之间的区别也越来越模糊。然后，这项工作介绍了第一个统一的框架，用于分析在作者隐私的背景下启用AO，AM和AV的LLM之间的动态关系。我们量化了它们如何相互作用以转化人类作者的文本，并随着时间的推移在单个时间点检查效果。我们还研究了人口元数据的作用，例如性别，学术背景，调节其表现，任务跨态动力和隐私风险。所有源代码将公开可用。

Title: Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks

Authors: Sizhe Yuen, Ting Su, Ziyang Wang, Yali Du, Adam J. Sobey
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14212
Pdf URL: https://arxiv.org/pdf/2505.14212
Copy Paste: [[2505.14212]] Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks(https://arxiv.org/abs/2505.14212)
Keywords: language model, llm, retrieval-augmented generation
Abstract: A question-answering (QA) system is to search suitable answers within a knowledge base. Current QA systems struggle with queries requiring complex reasoning or real-time knowledge integration. They are often supplemented with retrieval techniques on a data source such as Retrieval-Augmented Generation (RAG). However, RAG continues to face challenges in handling complex reasoning and logical connections between multiple sources of information. A novel approach for enhancing Large Language Models (LLMs) in knowledge-intensive QA tasks is presented through the automated generation of context-based QA pairs. This methodology leverages LLMs to create fine-tuning data, reducing reliance on human labelling and improving model comprehension and reasoning capabilities. The proposed system includes an automated QA generator and a model fine-tuner, evaluated using perplexity, ROUGE, BLEU, and BERTScore. Comprehensive experiments demonstrate improvements in logical coherence and factual accuracy, with implications for developing adaptable Artificial Intelligence (AI) systems. Mistral-7b-v0.3 outperforms Llama-3-8b with BERT F1, BLEU, and ROUGE scores 0.858, 0.172, and 0.260 of for the LLM generated QA pairs compared to scores of 0.836, 0.083, and 0.139 for the human annotated QA pairs.
摘要：提问（QA）系统是在知识库中搜索合适的答案。当前的质量保证系统与需要复杂推理或实时知识集成的查询斗争。他们通常会在数据源上补充检索技术，例如检索型发电（RAG）。但是，RAG在处理复杂的推理和多个信息来源之间的逻辑联系方面继续面临挑战。通过基于上下文的QA对，提出了一种增强知识密集型QA任务中大型语言模型（LLM）的新型方法。该方法利用LLM来创建微调数据，减少对人类标签的依赖并改善模型的理解和推理能力。所提出的系统包括一个自动化的QA发生器和模型微型调节器，并使用困惑，胭脂，BLEU和BERTSCORE评估。全面的实验表明，逻辑相干性和事实准确性的改善，对开发适应能力的人工智能（AI）系统的影响。 MISTRAL-7B-V0.3优于Bert F1，Bleu和Rouge的Llama-3-8b，而LLM产生的QA对的得分为0.858、0.172和0.260的QA对，相比之下，与人类无用的QA Pairs的评分为0.836、0.083，而0.836、0.083和0.139。

Title: "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs

Authors: Darpan Aswal, Siddharth D Jaiswal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14226
Pdf URL: https://arxiv.org/pdf/2505.14226
Copy Paste: [[2505.14226]] "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs(https://arxiv.org/abs/2505.14226)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have become increasingly powerful, with multilingual and multimodal capabilities improving by the day. These models are being evaluated through audits, alignment studies and red-teaming efforts to expose model vulnerabilities towards generating harmful, biased and unfair content. Existing red-teaming efforts have previously focused on the English language, using fixed template-based attacks; thus, models continue to be susceptible to multilingual jailbreaking strategies, especially in the multimodal context. In this study, we introduce a novel strategy that leverages code-mixing and phonetic perturbations to jailbreak LLMs for both text and image generation tasks. We also introduce two new jailbreak strategies that show higher effectiveness than baseline strategies. Our work presents a method to effectively bypass safety filters in LLMs while maintaining interpretability by applying phonetic misspellings to sensitive words in code-mixed prompts. Our novel prompts achieve a 99% Attack Success Rate for text generation and 78% for image generation, with Attack Relevance Rate of 100% for text generation and 95% for image generation when using the phonetically perturbed code-mixed prompts. Our interpretability experiments reveal that phonetic perturbations impact word tokenization, leading to jailbreak success. Our study motivates increasing the focus towards more generalizable safety alignment for multilingual multimodal models, especially in real-world settings wherein prompts can have misspelt words.
摘要：大型语言模型（LLMS）变得越来越强大，多语言和多模式的功能每天都在改善。这些模型正在通过审核，对齐研究和红色团队的努力进行评估，以暴露模型漏洞，以产生有害，有偏见和不公平的内容。现有的红色团队工作以前已经使用基于模板的攻击了英语；因此，模型继续容易受到多语言越狱策略的影响，尤其是在多模式背景下。在这项研究中，我们介绍了一种新颖的策略，该策略利用代码混合和语音扰动来越狱的LLM，用于文本和图像生成任务。我们还引入了两种新的越狱策略，这些策略比基线策略更高。我们的工作提出了一种方法，可以有效地绕过LLM中的安全过滤器，同时通过在代码混合提示中应用语音拼写错误来维持可解释性。我们的小说提示为文本生成实现了99％的攻击成功率，图像生成的攻击成功率为78％，使用语音扰动的代码混合提示时，文本生成的攻击相关率为100％，图像生成的攻击率为95％。我们的可解释性实验表明，语音扰动会影响单词令牌化，从而导致越狱成功。我们的研究促使人们将注意力集中在多语言多模型模型的更概括的安全对准方面，尤其是在现实环境中，提示可能拼写错误的单词。

Title: Mechanistic Fine-tuning for In-context Learning

Authors: Hakaze Cho, Peng Luo, Mariko Kato, Rin Kaenbyou, Naoya Inoue
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14233
Pdf URL: https://arxiv.org/pdf/2505.14233
Copy Paste: [[2505.14233]] Mechanistic Fine-tuning for In-context Learning(https://arxiv.org/abs/2505.14233)
Keywords: language model
Abstract: In-context Learning (ICL) utilizes structured demonstration-query inputs to induce few-shot learning on Language Models (LMs), which are not originally pre-trained on ICL-style data. To bridge the gap between ICL and pre-training, some approaches fine-tune LMs on large ICL-style datasets by an end-to-end paradigm with massive computational costs. To reduce such costs, in this paper, we propose Attention Behavior Fine-Tuning (ABFT), utilizing the previous findings on the inner mechanism of ICL, building training objectives on the attention scores instead of the final outputs, to force the attention scores to focus on the correct label tokens presented in the context and mitigate attention scores from the wrong label tokens. Our experiments on 9 modern LMs and 8 datasets empirically find that ABFT outperforms in performance, robustness, unbiasedness, and efficiency, with only around 0.01% data cost compared to the previous methods. Moreover, our subsequent analysis finds that the end-to-end training objective contains the ABFT objective, suggesting the implicit bias of ICL-style data to the emergence of induction heads. Our work demonstrates the possibility of controlling specific module sequences within LMs to improve their behavior, opening up the future application of mechanistic interpretability.
摘要：内部文化学习（ICL）利用结构化的演示 - 问题输入来诱导语言模型（LMS）很少的学习，这些学习最初不是在ICL式数据上预先训练的。为了弥合ICL和预训练之间的差距，某些方法通过端到端范式进行了大量的ICL风格数据集中的微调LMS，并具有巨大的计算成本。为了降低这种成本，在本文中，我们提出了注意行为微调（ABFT），利用了先前关于ICL内部机制的发现，在注意力评分上而不是最终输出上建立训练目标，以迫使注意力评分专注于上下文中所显示的正确标签标记，并减轻来自错误的标签标签的注意力评分。我们在9个现代LM和8个数据集上进行的实验在经验上发现，与以前的方法相比，ABFT的表现优于性能，鲁棒性，无偏见和效率，数据成本仅为0.01％。此外，我们随后的分析发现，端到端训练目标包含ABFT目标，这表明ICL式数据的隐含偏见与归纳负责人的出现。我们的工作证明了控制LMS内特定模块序列以改善其行为的可能性，从而开放了未来的机械解释性应用。

Title: ABBA: Highly Expressive Hadamard Product Adaptation for Large Language Models

Authors: Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Praneeth Vepakomma
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14238
Pdf URL: https://arxiv.org/pdf/2505.14238
Copy Paste: [[2505.14238]] ABBA: Highly Expressive Hadamard Product Adaptation for Large Language Models(https://arxiv.org/abs/2505.14238)
Keywords: language model
Abstract: Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce ABBA, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget. We formally analyze ABBA's expressive capacity and validate its advantages through matrix reconstruction experiments. Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is publicly available at: this https URL.
摘要：大型语言模型在各种任务中都表现出了很强的表现，但是将它们有效地适应新领域仍然是一个关键挑战。参数有效的微调（PEFT）方法通过引入轻质，可训练的模块来解决此问题，同时保持大多数预训练的权重固定。流行的方法洛拉（Lora）使用低级分解模型更新，但其表现力固有地受到等级的约束。 HIRA之类的最新方法旨在通过将Hadamard产品与冷冻权重结合，但仍依赖于预训练的模型的结构来提高表现力。我们介绍了ABBA，这是一种新的PEFT架构，将更新重新聚集为两个独立学习的低级矩阵的Hadamard产品。与先前的工作相反，ABBA完全将更新与预先训练的权重分开，从而使两个组件都能自由优化。在相同的参数预算下，这会导致明显更高的表达性。我们正式分析了ABBA的表现能力，并通过基质重建实验验证其优势。从经验上讲，ABBA在算术和常识性推理基准方面取得了最新的结果，在多个模型之间始终优于现有的PEFT方法。我们的代码可公开可用：此HTTPS URL。

Title: TransBench: Benchmarking Machine Translation for Industrial-Scale Applications

Authors: Haijun Li, Tianqi Shi, Zifu Shang, Yuxuan Han, Xueyu Zhao, Hao Wang, Yu Qian, Zhiqiang Qian, Linlong Xu, Minghao Wu, Chenyang Lyu, Longyue Wang, Gongbo Tang, Weihua Luo, Zhao Xu, Kaifu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14244
Pdf URL: https://arxiv.org/pdf/2505.14244
Copy Paste: [[2505.14244]] TransBench: Benchmarking Machine Translation for Industrial-Scale Applications(https://arxiv.org/abs/2505.14244)
Keywords: language model, llm
Abstract: Machine translation (MT) has become indispensable for cross-border communication in globalized industries like e-commerce, finance, and legal services, with recent advancements in large language models (LLMs) significantly enhancing translation quality. However, applying general-purpose MT models to industrial scenarios reveals critical limitations due to domain-specific terminology, cultural nuances, and stylistic conventions absent in generic benchmarks. Existing evaluation frameworks inadequately assess performance in specialized contexts, creating a gap between academic benchmarks and real-world efficacy. To address this, we propose a three-level translation capability framework: (1) Basic Linguistic Competence, (2) Domain-Specific Proficiency, and (3) Cultural Adaptation, emphasizing the need for holistic evaluation across these dimensions. We introduce TransBench, a benchmark tailored for industrial MT, initially targeting international e-commerce with 17,000 professionally translated sentences spanning 4 main scenarios and 33 language pairs. TransBench integrates traditional metrics (BLEU, TER) with Marco-MOS, a domain-specific evaluation model, and provides guidelines for reproducible benchmark construction. Our contributions include: (1) a structured framework for industrial MT evaluation, (2) the first publicly available benchmark for e-commerce translation, (3) novel metrics probing multi-level translation quality, and (4) open-sourced evaluation tools. This work bridges the evaluation gap, enabling researchers and practitioners to systematically assess and enhance MT systems for industry-specific needs.
摘要：机器翻译（MT）对于电子商务，金融和法律服务等全球化行业中的跨境通信来说是必不可少的，并且最新的大语模型（LLMS）的进步显着提高了翻译质量。然而，将通用MT模型应用于工业场景，揭示了由于特定于域特异性术语，文化差异和风格惯例而引起的临界局限性。现有的评估框架在专业环境中评估绩效不足，从而在学术基准和现实世界效力之间差距。为了解决这个问题，我们提出了一个三级翻译能力框架：（1）基本的语言能力，（2）领域特定的熟练度和（3）文化适应，强调了跨这些维度进行整体评估的需求。我们介绍了Transbench，这是一种针对工业MT量身定制的基准，最初针对国际电子商务，其中17,000个专业翻译的句子涵盖了4个主要方案和33个语言对。 TransBench将传统指标（BLEU，TER）与特定领域的评估模型Marco-Mos集成，并为可重复的基准结构提供了指南。我们的贡献包括：（1）用于工业MT评估的结构化框架，（2）电子商务翻译的第一个公开可用基准，（3）新型指标探测多层翻译质量，以及（4）开源评估工具。这项工作弥合了评估差距，使研究人员和从业人员能够系统地评估和增强MT系统，以满足行业特定的需求。

Title: FuxiMT: Sparsifying Large Language Models for Chinese-Centric Multilingual Machine Translation

Authors: Shaolin Zhu, Tianyu Dong, Bo Li, Deyi Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14256
Pdf URL: https://arxiv.org/pdf/2505.14256
Copy Paste: [[2505.14256]] FuxiMT: Sparsifying Large Language Models for Chinese-Centric Multilingual Machine Translation(https://arxiv.org/abs/2505.14256)
Keywords: language model, llm
Abstract: In this paper, we present FuxiMT, a novel Chinese-centric multilingual machine translation model powered by a sparsified large language model (LLM). We adopt a two-stage strategy to train FuxiMT. We first pre-train the model on a massive Chinese corpus and then conduct multilingual fine-tuning on a large parallel dataset encompassing 65 languages. FuxiMT incorporates Mixture-of-Experts (MoEs) and employs a curriculum learning strategy for robust performance across various resource levels. Experimental results demonstrate that FuxiMT significantly outperforms strong baselines, including state-of-the-art LLMs and machine translation models, particularly under low-resource scenarios. Furthermore, FuxiMT exhibits remarkable zero-shot translation capabilities for unseen language pairs, indicating its potential to bridge communication gaps where parallel data are scarce or unavailable.
摘要：在本文中，我们提出了Fuximt，这是一种新型的中文以中国为中心的多语言机器翻译模型，该模型由稀疏的大语言模型（LLM）提供动力。我们采用两阶段的策略来训练fuximt。我们首先将模型预先培训在大规模的中国语料库上，然后在包含65种语言的大型平行数据集上进行多语言微调。 Fuximt结合了专家的混合物（MOES），并采用课程学习策略来在各种资源水平上进行稳健性能。实验结果表明，Fuximt明显胜过强大的基准，包括最先进的LLM和机器翻译模型，尤其是在低资源场景下。此外，Fuximt对看不见的语言对表现出了显着的零击翻译功能，这表明其潜力弥合了平行数据稀缺或不可用的沟通差距。

Title: Think-J: Learning to Think for Generative LLM-as-a-Judge

Authors: Hui Huang, Yancheng He, Hongli Zhou, Rui Zhang, Wei Liu, Weixun Wang, Wenbo Su, Bo Zheng, Jiaheng Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14268
Pdf URL: https://arxiv.org/pdf/2505.14268
Copy Paste: [[2505.14268]] Think-J: Learning to Think for Generative LLM-as-a-Judge(https://arxiv.org/abs/2505.14268)
Keywords: language model, llm
Abstract: LLM-as-a-Judge refers to the automatic modeling of preferences for responses generated by Large Language Models (LLMs), which is of significant importance for both LLM evaluation and reward modeling. Although generative LLMs have made substantial progress in various tasks, their performance as LLM-Judge still falls short of expectations. In this work, we propose Think-J, which improves generative LLM-as-a-Judge by learning how to think. We first utilized a small amount of curated data to develop the model with initial judgment thinking capabilities. Subsequently, we optimize the judgment thinking traces based on reinforcement learning (RL). We propose two methods for judgment thinking optimization, based on offline and online RL, respectively. The offline RL requires training a critic model to construct positive and negative examples for learning. The online method defines rule-based reward as feedback for optimization. Experimental results showed that our approach can significantly enhance the evaluation capability of generative LLM-Judge, surpassing both generative and classifier-based LLM-Judge without requiring extra human annotations.
摘要：LLM-AS-A-Gudge是指大语模型（LLMS）产生的响应偏好的自动建模，这对于LLM评估和奖励建模都至关重要。尽管生成的LLM在各种任务上取得了长足的进步，但他们作为LLM法官的表现仍然没有期望。在这项工作中，我们提出了Think-J，它通过学习如何思考来改善生成的LLM-AS-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A As-A-A-A-A-A-Judge。我们首先利用少量的策划数据来开发具有初始判断思维功能的模型。随后，我们根据强化学习（RL）优化了判断思维痕迹。我们分别基于离线和在线RL提出了两种判断思维优化的方法。离线RL需要培训评论家模型来构建积极和负面的学习例子。在线方法将基于规则的奖励定义为优化的反馈。实验结果表明，我们的方法可以显着增强生成LLM法官的评估能力，超过基于生成的LLM判断力，而无需额外的人类注释。

Title: YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering

Authors: Jennifer D'Souza, Hamed Babaei Giglou, Quentin Münch
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14279
Pdf URL: https://arxiv.org/pdf/2505.14279
Copy Paste: [[2505.14279]] YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering(https://arxiv.org/abs/2505.14279)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry and artificial general intelligence.
摘要：大型语言模型（LLMS）推动了对现代搜索引擎的科学提问，但他们的评估鲁棒性仍然没有得到充实。我们介绍了YesCieval，这是一个开源框架，将基于细分的标题评估与强化学习结合在一起，以减轻LLM评估者的乐观偏见。我们发布了来自多个LLM的评估得分，包括对抗性变体，包括对抗性变体。与专有模型和人类反馈无关，我们的方法可以实现可扩展的，无成本的评估。通过推进可靠的LLM-AS-A-A-Gudge模型，这项工作支持AI的一致性，并促进了对科学探究和人工通用情报必不可少的强大，透明评估。

Title: Universal Acoustic Adversarial Attacks for Flexible Control of Speech-LLMs

Authors: Rao Ma, Mengjie Qian, Vyas Raina, Mark Gales, Kate Knill
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.14286
Pdf URL: https://arxiv.org/pdf/2505.14286
Copy Paste: [[2505.14286]] Universal Acoustic Adversarial Attacks for Flexible Control of Speech-LLMs(https://arxiv.org/abs/2505.14286)
Keywords: language model, llm, prompt
Abstract: The combination of pre-trained speech encoders with large language models has enabled the development of speech LLMs that can handle a wide range of spoken language processing tasks. While these models are powerful and flexible, this very flexibility may make them more vulnerable to adversarial attacks. To examine the extent of this problem, in this work we investigate universal acoustic adversarial attacks on speech LLMs. Here a fixed, universal, adversarial audio segment is prepended to the original input audio. We initially investigate attacks that cause the model to either produce no output or to perform a modified task overriding the original prompt. We then extend the nature of the attack to be selective so that it activates only when specific input attributes, such as a speaker gender or spoken language, are present. Inputs without the targeted attribute should be unaffected, allowing fine-grained control over the model outputs. Our findings reveal critical vulnerabilities in Qwen2-Audio and Granite-Speech and suggest that similar speech LLMs may be susceptible to universal adversarial attacks. This highlights the need for more robust training strategies and improved resistance to adversarial attacks.
摘要：预先训练的语音编码器与大型语言模型的结合使语音LLM的发展可以处理各种口头语言处理任务。尽管这些模型强大而灵活，但这种非常灵活的性能可能会使它们更容易受到对抗性攻击的影响。为了研究这个问题的程度，在这项工作中，我们研究了对语音LLM的普遍声学对抗攻击。在这里，将固定的通用，对抗性音频段添加到原始输入音频。我们最初调查导致模型不产生输出或执行修改后的任务的攻击覆盖了原始提示。然后，我们将攻击的性质扩展为选择性，以便仅当存在特定的输入属性（例如说话者性别或口语）时才激活。没有目标属性的输入不受影响，从而可以对模型输出进行细粒度的控制。我们的发现揭示了Qwen2-Audio和Granite语音中的关键脆弱性，并表明类似的语音LLM可能容易受到普遍的对抗攻击的影响。这凸显了需要更强大的训练策略，并提高了对对抗性攻击的抵抗力。

Title: Cross-Lingual Optimization for Language Transfer in Large Language Models

Authors: Jungseob Lee, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14297
Pdf URL: https://arxiv.org/pdf/2505.14297
Copy Paste: [[2505.14297]] Cross-Lingual Optimization for Language Transfer in Large Language Models(https://arxiv.org/abs/2505.14297)
Keywords: language model, llm
Abstract: Adapting large language models to other languages typically employs supervised fine-tuning (SFT) as a standard approach. However, it often suffers from an overemphasis on English performance, a phenomenon that is especially pronounced in data-constrained environments. To overcome these challenges, we propose \textbf{Cross-Lingual Optimization (CLO)} that efficiently transfers an English-centric LLM to a target language while preserving its English capabilities. CLO utilizes publicly available English SFT data and a translation model to enable cross-lingual transfer. We conduct experiments using five models on six languages, each possessing varying levels of resource. Our results show that CLO consistently outperforms SFT in both acquiring target language proficiency and maintaining English performance. Remarkably, in low-resource languages, CLO with only 3,200 samples surpasses SFT with 6,400 samples, demonstrating that CLO can achieve better performance with less data. Furthermore, we find that SFT is particularly sensitive to data quantity in medium and low-resource languages, whereas CLO remains robust. Our comprehensive analysis emphasizes the limitations of SFT and incorporates additional training strategies in CLO to enhance efficiency.
摘要：将大型语言模型调整为其他语言通常采用监督的微调（SFT）作为标准方法。但是，它通常受到对英语表现的过分强调，这一现象在数据受限的环境中特别明显。为了克服这些挑战，我们提出\ textbf {跨语性优化（clo）}，该挑战将以英语为中心的LLM有效地转移到目标语言的同时，同时保留其英语能力。 CLO利用公开可用的英语SFT数据和翻译模型来实现跨语性转移。我们使用六种语言进行五个模型进行实验，每种都具有不同水平的资源。我们的结果表明，CLO在获得目标语言水平和维持英语表现方面始终优于SFT。值得注意的是，在低资源语言中，只有3200个样本的CLO超过了SFT，其中有6,400个样本，表明CLO可以通过更少的数据来实现更好的性能。此外，我们发现SFT对中和低资源语言中的数据数量特别敏感，而CLO仍然坚固。我们的全面分析强调了SFT的局限性，并在CLO中纳入了其他培训策略，以提高效率。

Title: JOLT-SQL: Joint Loss Tuning of Text-to-SQL with Confusion-aware Noisy Schema Sampling

Authors: Jinwang Song, Hongying Zan, Kunli Zhang, Lingling Mu, Yingjie Han, Haobo Hua, Min Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14305
Pdf URL: https://arxiv.org/pdf/2505.14305
Copy Paste: [[2505.14305]] JOLT-SQL: Joint Loss Tuning of Text-to-SQL with Confusion-aware Noisy Schema Sampling(https://arxiv.org/abs/2505.14305)
Keywords: language model, llm, prompt
Abstract: Text-to-SQL, which maps natural language to SQL queries, has benefited greatly from recent advances in Large Language Models (LLMs). While LLMs offer various paradigms for this task, including prompting and supervised fine-tuning (SFT), SFT approaches still face challenges such as complex multi-stage pipelines and poor robustness to noisy schema information. To address these limitations, we present JOLT-SQL, a streamlined single-stage SFT framework that jointly optimizes schema linking and SQL generation via a unified loss. JOLT-SQL employs discriminative schema linking, enhanced by local bidirectional attention, alongside a confusion-aware noisy schema sampling strategy with selective attention to improve robustness under noisy schema conditions. Experiments on the Spider and BIRD benchmarks demonstrate that JOLT-SQL achieves state-of-the-art execution accuracy among comparable-size open-source models, while significantly improving both training and inference efficiency.
摘要：将自然语言映射到SQL查询的文本到SQL从最近的大型语言模型（LLMS）的最新进展中受益匪浅。尽管LLM为这项任务提供了各种范式，包括提示和监督微调（SFT），但SFT方法仍然面临着诸如复杂的多阶段管道和嘈杂架构信息的鲁棒性之类的挑战。为了解决这些局限性，我们提出了一种简化的单级SFT框架Jolt-SQL，可以通过统一损失共同优化模式链接和SQL生成。 JOLT-SQL采用歧视性架构连接，通过局部双向关注增强，以及令人困惑的嘈杂架构抽样策略，具有选择性的注意，以在嘈杂的模式条件下提高鲁棒性。蜘蛛和鸟类基准的实验表明，Jolt-SQL在相当大小的开源模型之间达到了最先进的执行精度，同时显着提高了训练和推理效率。

Title: Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency

Authors: Ehsan Doostmohammadi, Marco Kuhlmann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14309
Pdf URL: https://arxiv.org/pdf/2505.14309
Copy Paste: [[2505.14309]] Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency(https://arxiv.org/abs/2505.14309)
Keywords: language model
Abstract: Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query--context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40\% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.
摘要：检索声明的语言模型表明性能与更大的模型相当，同时需要更少的计算资源。这些模型的有效性至关重要地取决于查询和检索到的上下文之间的重叠，但是这种重叠的最佳程度仍未得到探索。在本文中，我们系统地研究了查询水平的不同级别，即在训练和推理过程中会影响模型性能。我们的实验表明，最初的重叠增加具有最小的效果，但是大大改善了测试时间的困惑，并加速了模型学习，超过了关键阈值。在这些发现的基础上，我们证明，故意通过合成环境重叠可以提高数据效率，并在不损害性能的情况下将训练时间降低约40 \％。我们通过释义查询特别生成综合上下文。我们验证了基于问题的问题的发现，关于提问任务的发现，证实了检索型语言建模的好处扩展到实际应用。我们的结果提供了在语言模型预处理中检索机制的显着优化潜力的经验证据。

Title: HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing

Authors: Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Idris Abdulmumin, Falalu Ibrahim Lawan, Babangida Sani, Sukairaj Hafiz Imam, Yusuf Aliyu, Sani Abdullahi Sani, Ali Usman Umar, Kenneth Church, Vukosi Marivate
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14311
Pdf URL: https://arxiv.org/pdf/2505.14311
Copy Paste: [[2505.14311]] HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing(https://arxiv.org/abs/2505.14311)
Keywords: language model, llm
Abstract: Hausa Natural Language Processing (NLP) has gained increasing attention in recent years, yet remains understudied as a low-resource language despite having over 120 million first-language (L1) and 80 million second-language (L2) speakers worldwide. While significant advances have been made in high-resource languages, Hausa NLP faces persistent challenges, including limited open-source datasets and inadequate model representation. This paper presents an overview of the current state of Hausa NLP, systematically examining existing resources, research contributions, and gaps across fundamental NLP tasks: text classification, machine translation, named entity recognition, speech recognition, and question answering. We introduce HausaNLP (this https URL), a curated catalog that aggregates datasets, tools, and research works to enhance accessibility and drive further development. Furthermore, we discuss challenges in integrating Hausa into large language models (LLMs), addressing issues of suboptimal tokenization and dialectal variation. Finally, we propose strategic research directions emphasizing dataset expansion, improved language modeling approaches, and strengthened community collaboration to advance Hausa NLP. Our work provides both a foundation for accelerating Hausa NLP progress and valuable insights for broader multilingual NLP research.
摘要：近年来，Hausa自然语言处理（NLP）越来越受到关注，尽管全球拥有超过1.2亿第一语言（L1）和8000万个第二语言（L2）演讲者，但仍被视为低资源语言。尽管高资源语言已经取得了重大进展，但Hausa NLP面临着持续的挑战，包括有限的开源数据集和模型表示不足。本文概述了Hausa NLP的当前状态，系统地检查了基本NLP任务的现有资源，研究贡献和差距：文本分类，机器翻译，命名实体识别，语音识别和问题答案。我们介绍了Hausanlp（此HTTPS URL），该目录汇总了数据集，工具和研究工作，以增强可访问性并推动进一步的开发。此外，我们讨论将HAUSA整合到大语言模型（LLMS）中的挑战，以解决次优的令牌化和方言变化问题。最后，我们提出了强调数据集扩展，改进的语言建模方法以及加强社区合作以推进Hausa NLP的战略研究方向。我们的工作既是加速Hausa NLP进步的基础，又为更广泛的多语言NLP研究提供了宝贵的见解。

Title: A MIND for Reasoning: Meta-learning for In-context Deduction

Authors: Leonardo Bertolazzi, Manuel Vargas Guzmán, Raffaella Bernardi, Maciej Malicki, Jakub Szymanik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14313
Pdf URL: https://arxiv.org/pdf/2505.14313
Copy Paste: [[2505.14313]] A MIND for Reasoning: Meta-learning for In-context Deduction(https://arxiv.org/abs/2505.14313)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly evaluated on formal tasks, where strong reasoning abilities define the state of the art. However, their ability to generalize to out-of-distribution problems remains limited. In this paper, we investigate how LLMs can achieve a systematic understanding of deductive rules. Our focus is on the task of identifying the appropriate subset of premises within a knowledge base needed to derive a given hypothesis. To tackle this challenge, we propose Meta-learning for In-context Deduction (MIND), a novel few-shot meta-learning fine-tuning approach. The goal of MIND is to enable models to generalize more effectively to unseen knowledge bases and to systematically apply inference rules. Our results show that MIND significantly improves generalization in small LMs ranging from 1.5B to 7B parameters. The benefits are especially pronounced in smaller models and low-data settings. Remarkably, small models fine-tuned with MIND outperform state-of-the-art LLMs, such as GPT-4o and o3-mini, on this task.
摘要：大型语言模型（LLM）越来越多地对正式任务进行评估，在这种正式任务中，强大的推理能力定义了最新的现状。但是，它们概括到分布问题的能力仍然有限。在本文中，我们研究了LLM如何实现对演绎规则的系统理解。我们的重点是确定得出给定假设所需的知识库中适当的前提子集的任务。为了应对这一挑战，我们提出了元学习的元学习扣除（思维），这是一种新颖的元学习微调方法。心灵的目的是使模型能够更有效地概括到看不见的知识库并系统地应用推论规则。我们的结果表明，思维显着改善了从1.5b到7b参数的小型LMS的概括。在较小的型号和低数据设置中，这些好处尤其明显。值得注意的是，在这项任务上，用大脑进行了微调的小型模型（例如GPT-4O和O3-Mini）胜过最先进的LLM。

Title: QA-prompting: Improving Summarization with Large Language Models using Question-Answering

Authors: Neelabh Sinha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14347
Pdf URL: https://arxiv.org/pdf/2505.14347
Copy Paste: [[2505.14347]] QA-prompting: Improving Summarization with Large Language Models using Question-Answering(https://arxiv.org/abs/2505.14347)
Keywords: language model, prompt
Abstract: Language Models (LMs) have revolutionized natural language processing, enabling high-quality text generation through prompting and in-context learning. However, models often struggle with long-context summarization due to positional biases, leading to suboptimal extraction of critical information. There are techniques to improve this with fine-tuning, pipelining, or using complex techniques, which have their own challenges. To solve these challenges, we propose QA-prompting - a simple prompting method for summarization that utilizes question-answering as an intermediate step prior to summary generation. Our method extracts key information and enriches the context of text to mitigate positional biases and improve summarization in a single LM call per task without requiring fine-tuning or pipelining. Experiments on multiple datasets belonging to different domains using ten state-of-the-art pre-trained models demonstrate that QA-prompting outperforms baseline and other state-of-the-art methods, achieving up to 29% improvement in ROUGE scores. This provides an effective and scalable solution for summarization and highlights the importance of domain-specific question selection for optimal performance.
摘要：语言模型（LMS）已彻底改变了自然语言处理，从而通过提示和文本学习使高质量的文本生成。但是，由于位置偏见，模型通常会因长篇下说汇总而困难，从而导致关键信息的次优提取。有一些技术可以通过微调，管道进行或使用复杂的技术来改善这种技术，这些技术有自己的挑战。为了解决这些挑战，我们提出了QA启动 - 一种简单的提示方法，用于摘要，将问题撤离作为汇总生成之前的中间步骤。我们的方法提取关键信息并丰富文本的上下文，以减轻位置偏见并改善每个任务单个LM调用中的摘要，而无需进行微调或管道填充。使用十个最先进的预培训模型属于不同域的多个数据集上的实验表明，启动质量保证的表现优于基线和其他最先进的方法，在胭脂分数中提高了29％。这为汇总提供了有效且可扩展的解决方案，并突出了特定于域问题选择最佳性能的重要性。

Title: OSoRA: Output-Dimension and Singular-Value Initialized Low-Rank Adaptation

Authors: Jialong Han, Si Zhang, Ke Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14350
Pdf URL: https://arxiv.org/pdf/2505.14350
Copy Paste: [[2505.14350]] OSoRA: Output-Dimension and Singular-Value Initialized Low-Rank Adaptation(https://arxiv.org/abs/2505.14350)
Keywords: language model, llm
Abstract: Fine-tuning Large Language Models (LLMs) has become increasingly challenging due to their massive scale and associated computational costs. Parameter-Efficient Fine-Tuning (PEFT) methodologies have been proposed as computational alternatives; however, their implementations still require significant resources. In this paper, we present OSoRA (Output-Dimension and Singular-Value Initialized Low-Rank Adaptation), a novel PEFT method for LLMs. OSoRA extends Low-Rank Adaptation (LoRA) by integrating Singular Value Decomposition (SVD) with learnable scaling vectors in a unified framework. It first performs an SVD of pre-trained weight matrices, then optimizes an output-dimension vector during training, while keeping the corresponding singular vector matrices frozen. OSoRA substantially reduces computational resource requirements by minimizing the number of trainable parameters during fine-tuning. Comprehensive evaluations across mathematical reasoning, common sense reasoning, and other benchmarks demonstrate that OSoRA achieves comparable or superior performance to state-of-the-art methods like LoRA and VeRA, while maintaining a linear parameter scaling even as the rank increases to higher dimensions. Our ablation studies further confirm that jointly training both the singular values and the output-dimension vector is critical for optimal performance.
摘要：通过大规模的大型语言模型（LLM），由于其规模和相关的计算成本，变得越来越具有挑战性。已经提出了参数有效的微调（PEFT）方法作为计算替代方案；但是，他们的实施仍然需要大量资源。在本文中，我们介绍了Osora（输出尺寸和奇异值初始化的低级适应性），这是一种新颖的LLMS PEFT方法。 Osora通过在统一框架中将单数值分解（SVD）与可学习的缩放矢量集成在一起，扩展了低级适应性（LORA）。它首先执行预先训练的重量矩阵的SVD，然后在训练过程中优化输出尺寸矢量，同时保持相应的奇异矢量矩阵冷冻。 Osora通过最大程度地减少微调参数的数量来大大减少计算资源需求。跨数学推理，常识推理和其他基准的全面评估表明，奥索拉（Osora）与洛拉（Lora）和维拉（Vera）（如洛拉（Lora）和维拉（Vera）等最先进的方法都具有可比性或卓越的性能，同时即使等级增加到较高的维度，也可以保持线性参数缩放。我们的消融研究进一步证实，共同训练奇异值和输出差异矢量对于最佳性能至关重要。

Title: WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications

Authors: Xin Li, Mengbing Liu, Li Wei, Jiancheng An, Mérouane Debbah, Chau Yuen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14354
Pdf URL: https://arxiv.org/pdf/2505.14354
Copy Paste: [[2505.14354]] WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications(https://arxiv.org/abs/2505.14354)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved impressive results across a broad array of tasks, yet their capacity for complex, domain-specific mathematical reasoning-particularly in wireless communications-remains underexplored. In this work, we introduce WirelessMathBench, a novel benchmark specifically designed to evaluate LLMs on mathematical modeling challenges to wireless communications engineering. Our benchmark consists of 587 meticulously curated questions sourced from 40 state-of-the-art research papers, encompassing a diverse spectrum of tasks ranging from basic multiple-choice questions to complex equation completion tasks, including both partial and full completions, all of which rigorously adhere to physical and dimensional constraints. Through extensive experimentation with leading LLMs, we observe that while many models excel in basic recall tasks, their performance degrades significantly when reconstructing partially or fully obscured equations, exposing fundamental limitations in current LLMs. Even DeepSeek-R1, the best performer on our benchmark, achieves an average accuracy of only 38.05%, with a mere 7.83% success rate in full equation completion. By publicly releasing WirelessMathBench along with the evaluation toolkit, we aim to advance the development of more robust, domain-aware LLMs for wireless system analysis and broader engineering applications.
摘要：大型语言模型（LLM）在各种各样的任务中都取得了令人印象深刻的结果，但是它们在无线通信范围内尤其是在无线电话中尤其是复杂的，特定于域的数学推理的能力。在这项工作中，我们介绍了无线化学板，这是一种新颖的基准测试，该基准专门旨在评估无线通信工程的数学建模挑战的LLM。我们的基准由587个精心策划的问题组成，这些问题来自40个最先进的研究论文，其中包括从基本的多项选择问题到复杂方程式完成任务的各种任务，包括部分和完整的完成，所有这些任务都严格地遵守和尺寸的约束。通过对领先LLM的广泛实验，我们观察到，尽管许多模型在基本的召回任务中表现出色，但在重建部分或完全遮盖的方程式时，它们的性能会大大降低，从而暴露了当前LLM的基本限制。即使是我们基准测试中表现最好的DeepSeek-R1，平均准确性仅为38.05％，在完整方程式完成中，成功率仅为7.83％。通过公开发布无线板和评估工具包，我们旨在推动开发更健壮的，域吸引的LLM，用于无线系统分析和更广泛的工程应用。

Title: Dual Decomposition of Weights and Singular Value Low Rank Adaptation

Authors: Jialong Han, Si Zhang, Ke Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14367
Pdf URL: https://arxiv.org/pdf/2505.14367
Copy Paste: [[2505.14367]] Dual Decomposition of Weights and Singular Value Low Rank Adaptation(https://arxiv.org/abs/2505.14367)
Keywords: language model, llm
Abstract: Parameter-Efficient Fine-Tuning (PEFT) has emerged as a critical paradigm for adapting Large Language Models (LLMs) to downstream tasks, among which Low-rank Adaptation (LoRA) represents one of the most widely adopted methodologies. However, existing LoRA-based approaches exhibit two fundamental limitations: unstable training dynamics and inefficient knowledge transfer from pre-trained models, both stemming from random initialization of adapter parameters. To overcome these challenges, we propose DuDe, a novel approach that decomposes weight matrices into magnitude and direction components, employing Singular Value Decomposition (SVD) for principled initialization. Our comprehensive evaluation demonstrates DuDe's superior performance and robustness, achieving up to 48.35\% accuracy on MMLU and 62.53\% ($\pm$ 1.59) accuracy on GSM8K. Our theoretical analysis and empirical validation collectively demonstrate that DuDe's decomposition strategy enhances optimization stability and better preserves pre-trained representations, particularly for domain-specific tasks requiring specialized knowledge. The combination of robust empirical performance and rigorous theoretical foundations establishes DuDe as a significant contribution to PEFT methodologies for LLMs.
摘要：参数有效的微调（PEFT）已成为适应大型语言模型（LLMS）到下游任务的关键范式，其中低级别适应性（LORA）代表了采用最广泛的方法之一。但是，现有的基于洛拉的方法表现出两个基本局限性：不稳定的训练动力学和从预训练模型中的知识转移效率低下，这两者都是源自适配器参数的随机初始化。为了克服这些挑战，我们提出了花花公子，这是一种新颖的方法，将重量矩阵分解为大小和方向组件，采用奇异值分解（SVD）进行原则初始化。我们的全面评估表明了花花公子的出色表现和鲁棒性，在MMLU上达到了48.35％的精度和62.53 \％（$ \ pm $ 1.59）的GSM8K精度。我们的理论分析和经验验证共同表明，花花公子的分解策略增强了优化稳定性，并更好地保留了预训练的表示形式，尤其是对于需要专业知识的特定领域特定任务。强大的经验表现和严格的理论基础的结合确立了对LLMS PEFT方法论的重要贡献。

Title: AutoRev: Automatic Peer Review System for Academic Research Papers

Authors: Maitreya Prafulla Chitale, Ketaki Mangesh Shetye, Harshit Gupta, Manav Chaudhary, Vasudeva Varma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14376
Pdf URL: https://arxiv.org/pdf/2505.14376
Copy Paste: [[2505.14376]] AutoRev: Automatic Peer Review System for Academic Research Papers(https://arxiv.org/abs/2505.14376)
Keywords: language model, llm
Abstract: Generating a review for an academic research paper is a complex task that requires a deep understanding of the document's content and the interdependencies between its sections. It demands not only insight into technical details but also an appreciation of the paper's overall coherence and structure. Recent methods have predominantly focused on fine-tuning large language models (LLMs) to address this challenge. However, they often overlook the computational and performance limitations imposed by long input token lengths. To address this, we introduce AutoRev, an Automatic Peer Review System for Academic Research Papers. Our novel framework represents an academic document as a graph, enabling the extraction of the most critical passages that contribute significantly to the review. This graph-based approach demonstrates effectiveness for review generation and is potentially adaptable to various downstream tasks, such as question answering, summarization, and document representation. When applied to review generation, our method outperforms SOTA baselines by an average of 58.72% across all evaluation metrics. We hope that our work will stimulate further research in applying graph-based extraction techniques to other downstream tasks in NLP. We plan to make our code public upon acceptance.
摘要：对学术研究论文产生审查是一项复杂的任务，需要对文档的内容和该部分之间的相互依存关系有深入的了解。它不仅需要深入了解技术细节，而且还需要对本文的整体连贯性和结构表示赞赏。最近的方法主要集中于微调大语言模型（LLM）以应对这一挑战。但是，他们经常忽略长输入令牌长度施加的计算和性能限制。为了解决这个问题，我们介绍了Autorev，这是一种用于学术研究论文的自动同行评审系统。我们的新颖框架代表了一个学术文档作为图表，从而提出了对审查产生重大贡献的最关键段落。这种基于图的方法证明了审查生成的有效性，并且有可能适应各种下游任务，例如问题回答，摘要和文档表示。当应用于审查生成时，我们的方法在所有评估指标中平均比SOTA基准的平均高出58.72％。我们希望我们的工作将刺激进一步的研究，以将基于图的提取技术应用于NLP的其他下游任务。我们计划在接受后将我们的代码公开。

Title: Editing Across Languages: A Survey of Multilingual Knowledge Editing

Authors: Nadir Durrani, Basel Mousi, Fahim Dalvi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14393
Pdf URL: https://arxiv.org/pdf/2505.14393
Copy Paste: [[2505.14393]] Editing Across Languages: A Survey of Multilingual Knowledge Editing(https://arxiv.org/abs/2505.14393)
Keywords: llm
Abstract: While Knowledge Editing has been extensively studied in monolingual settings, it remains underexplored in multilingual contexts. This survey systematizes recent research on Multilingual Knowledge Editing (MKE), a growing subdomain of model editing focused on ensuring factual edits generalize reliably across languages. We present a comprehensive taxonomy of MKE methods, covering parameter-based, memory-based, fine-tuning, and hypernetwork approaches. We survey available benchmarks,summarize key findings on method effectiveness and transfer patterns, identify challenges in cross-lingual propagation, and highlight open problems related to language anisotropy, evaluation coverage, and edit scalability. Our analysis consolidates a rapidly evolving area and lays the groundwork for future progress in editable language-aware LLMs.
摘要：尽管知识编辑已在单语设置中进行了广泛的研究，但在多语言上下文中仍然没有被忽视。这项调查使有关多语言知识编辑（MKE）的最新研究系统化，这是一个越来越多的模型编辑子域，旨在确保事实编辑可靠地跨语言概括。我们提出了MKE方法的全面分类学，涵盖基于参数的，基于内存的，微调和超网络方法。我们调查可用的基准测试，总结有关方法有效性和转移模式的关键发现，确定跨语性传播中的挑战，并突出与语言各向异性有关的开放问题，评估覆盖范围和编辑可扩展性。我们的分析巩固了一个快速发展的领域，并为可编辑的语言感知LLM的进展奠定了基础。

Title: MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

Authors: Seyoung Song, Seogyeong Jeong, Eunsu Kim, Jiho Jin, Dongkwan Kim, Jay Shin, Alice Oh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14395
Pdf URL: https://arxiv.org/pdf/2505.14395
Copy Paste: [[2505.14395]] MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language(https://arxiv.org/abs/2505.14395)
Keywords: language model, llm
Abstract: Evaluating text generation capabilities of large language models (LLMs) is challenging, particularly for low-resource languages where methods for direct assessment are scarce. We propose MUG-Eval, a novel framework that evaluates LLMs' multilingual generation capabilities by transforming existing benchmarks into conversational tasks and measuring the LLMs' accuracies on those tasks. We specifically designed these conversational tasks to require effective communication in the target language. Then, we simply use task success rate as a proxy of successful conversation generation. Our approach offers two key advantages: it is independent of language-specific NLP tools or annotated datasets, which are limited for most languages, and it does not rely on LLMs-as-judges, whose evaluation quality degrades outside a few high-resource languages. We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks ($r$ > 0.75) while enabling standardized comparisons across languages and models. Our framework provides a robust and resource-efficient solution for evaluating multilingual generation that can be extended to thousands of languages.
摘要：评估大语模型（LLM）的文本生成能力（LLM）具有挑战性，特别是对于低资源语言，直接评估方法很少。我们提出了Mug-eval，这是一个新颖的框架，该框架通过将现有基准转换为对话任务并在这些任务上测量LLMS的准确性来评估LLMS的多语言生成能力。我们专门设计了这些对话任务，以需要目标语言的有效沟通。然后，我们只是将任务成功率作为成功的对话生成的代理。我们的方法提供了两个关键优势：它独立于特定于语言的NLP工具或注释的数据集，这些工具对大多数语言有限，并且不依赖于LLMS-As-aS-gudges，其评估质量在几种高资源语言之外降低了。我们评估了30种跨越高层，低资源类别的语言的8个LLM，并且发现杯子eval与已建立的基准（$ r $> 0.75）密切相关，同时启用了跨语言和模型的标准化比较。我们的框架为评估可以扩展到数千种语言的多语言生成提供了强大而有资源的解决方案。

Title: Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation

Authors: Peter Baile Chen, Yi Zhang, Dan Roth, Samuel Madden, Jacob Andreas, Michael Cafarella
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14398
Pdf URL: https://arxiv.org/pdf/2505.14398
Copy Paste: [[2505.14398]] Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation(https://arxiv.org/abs/2505.14398)
Keywords: language model, llm, agent
Abstract: While humans naturally learn and adapt from past experiences, large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks and apply them in future contexts. To address this limitation, we propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time to enhance model's ability to learn from previous tasks and perform better on new, unseen challenges, all while keeping the system efficient and scalable. Specifically, our system represents task logs using key-value (KV) caches, encoding the full reasoning context of prior tasks while storing KV caches for only a selected subset of tokens. When a new task arises, LAG retrieves the KV values from relevant logs to augment generation. Our approach differs from reflection-based memory mechanisms by directly reusing prior reasoning and computations without requiring additional steps for knowledge extraction or distillation. Our method also goes beyond existing KV caching techniques, which primarily target efficiency gains rather than improving accuracy. Experiments on knowledge- and reasoning-intensive datasets demonstrate that our method significantly outperforms standard agentic systems that do not utilize logs, as well as existing solutions based on reflection and KV cache techniques.
摘要：尽管人类自然会从过去的经验中学习和适应，但大型语言模型（LLM）及其代理人却很难从以前的任务中保留推理，并将其应用于将来的环境。为了解决这一限制，我们提出了一个新颖的框架，日志调格生成（LAG），该框架直接重复了先前的计算和从过去的日志中进行的推理，以增强模型从以前的任务中学习并在新的，看不见的挑战中进行更好的挑战，同时使系统有效且可扩展。具体来说，我们的系统代表使用键值（KV）缓存的任务日志，在仅用于所选子集的子集的同时编码先前任务的完整推理上下文。当出现新任务时，滞后从相关日志中检索KV值到增强生成。我们的方法与基于反射的记忆机制不同，通过直接重复先前的推理和计算，而无需进行知识提取或蒸馏的其他步骤。我们的方法还超越了现有的KV缓存技术，该技术主要针对效率提高而不是提高准确性。知识和推理密集型数据集的实验表明，我们的方法显着优于不利用日志的标准代理系统，以及基于反射和KV缓存技术的现有解决方案。

Title: Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis

Authors: Haoming Huang, Yibo Yan, Jiahao Huo, Xin Zou, Xinfeng Li, Kun Wang, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14406
Pdf URL: https://arxiv.org/pdf/2505.14406
Copy Paste: [[2505.14406]] Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis(https://arxiv.org/abs/2505.14406)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs), despite their remarkable capabilities, are hampered by hallucinations. A particularly challenging variant, knowledge overshadowing, occurs when one piece of activated knowledge inadvertently masks another relevant piece, leading to erroneous outputs even with high-quality training data. Current understanding of overshadowing is largely confined to inference-time observations, lacking deep insights into its origins and internal mechanisms during model training. Therefore, we introduce PhantomCircuit, a novel framework designed to comprehensively analyze and detect knowledge overshadowing. By innovatively employing knowledge circuit analysis, PhantomCircuit dissects the internal workings of attention heads, tracing how competing knowledge pathways contribute to the overshadowing phenomenon and its evolution throughout the training process. Extensive experiments demonstrate PhantomCircuit's effectiveness in identifying such instances, offering novel insights into this elusive hallucination and providing the research community with a new methodological lens for its potential mitigation.
摘要：大型语言模型（LLMS）尽管具有显着的功能，但却受到幻觉的阻碍。当一块激活的知识无意间掩盖了另一个相关的一块时，即使使用高质量的培训数据，也会导致错误的输出，这是一个特别具有挑战性的变体，知识掩盖了。当前对遮盖物的理解在很大程度上仅限于推理时间观察，缺乏对模型训练期间其起源和内部机制的深刻见解。因此，我们引入了幻影，这是一个新颖的框架，旨在全面分析和检测知识掩盖的知识。通过创新采用知识电路分析，幻象电路剖析了注意力头的内部工作，追踪了竞争知识途径如何有助于整个训练过程中的渐进现象及其演变。广泛的实验表明，幻影在识别此类实例方面的有效性，为这种难以捉摸的幻觉提供了新的见解，并为研究界提供了新的方法论镜头，以缓解其潜在的缓解措施。

Title: Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents

Authors: Pengzhou Cheng, Haowen Hu, Zheng Wu, Zongru Wu, Tianjie Ju, Daizong Ding, Zhuosheng Zhang, Gongshen Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14418
Pdf URL: https://arxiv.org/pdf/2505.14418
Copy Paste: [[2505.14418]] Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents(https://arxiv.org/abs/2505.14418)
Keywords: language model, llm, agent
Abstract: Graphical user interface (GUI) agents powered by multimodal large language models (MLLMs) have shown greater promise for human-interaction. However, due to the high fine-tuning cost, users often rely on open-source GUI agents or APIs offered by AI providers, which introduces a critical but underexplored supply chain threat: backdoor attacks. In this work, we first unveil that MLLM-powered GUI agents naturally expose multiple interaction-level triggers, such as historical steps, environment states, and task progress. Based on this observation, we introduce AgentGhost, an effective and stealthy framework for red-teaming backdoor attacks. Specifically, we first construct composite triggers by combining goal and interaction levels, allowing GUI agents to unintentionally activate backdoors while ensuring task utility. Then, we formulate backdoor injection as a Min-Max optimization problem that uses supervised contrastive learning to maximize the feature difference across sample classes at the representation space, improving flexibility of the backdoor. Meanwhile, it adopts supervised fine-tuning to minimize the discrepancy between backdoor and clean behavior generation, enhancing effectiveness and utility. Extensive evaluations of various agent models in two established mobile benchmarks show that AgentGhost is effective and generic, with attack accuracy that reaches 99.7\% on three attack objectives, and shows stealthiness with only 1\% utility degradation. Furthermore, we tailor a defense method against AgentGhost that reduces the attack accuracy to 22.1\%. Our code is available at \texttt{anonymous}.
摘要：由多模式大语言模型（MLLM）提供动力的图形用户界面（GUI）代理对人类交流的希望更大。但是，由于成本高昂，用户经常依靠AI提供商提供的开源GUI代理或API，这引入了关键但毫无用处的供应链威胁：后门攻击。在这项工作中，我们首先公布了MLLM驱动的GUI代理自然会暴露多个互动级触发器，例如历史步骤，环境状态和任务进度。基于此观察，我们介绍了AgentGhost，这是一个有效而隐形的框架，用于红色团队的后门攻击。具体而言，我们首先通过结合目标和交互水平来构建复合触发器，从而使GUI代理可以在确保任务实用程序的同时无意中激活后门。然后，我们将后门注入作为最小最大优化问题，该问题使用监督的对比度学习，以最大程度地提高表示空间中样本类别的特征差异，从而提高了后门的灵活性。同时，它采用了监督的微调来最大程度地减少后门和清洁行为产生之间的差异，从而提高了有效性和实用性。对两个已建立的移动基准的各种代理模型的广泛评估表明，AgentGhost具有有效和通用，攻击精度在三个攻击目标上达到99.7 \％，并且仅显示1 \％实用性降级的隐形性。此外，我们针对AgentGhost量身定制了一种防御方法，该方法将攻击精度降低到22.1 \％。我们的代码可在\ texttt {匿名}上获得。

Title: SAE-FiRE: Enhancing Earnings Surprise Predictions Through Sparse Autoencoder Feature Selection

Authors: Huopu Zhang, Yanguang Liu, Mengnan Du
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14420
Pdf URL: https://arxiv.org/pdf/2505.14420
Copy Paste: [[2505.14420]] SAE-FiRE: Enhancing Earnings Surprise Predictions Through Sparse Autoencoder Feature Selection(https://arxiv.org/abs/2505.14420)
Keywords: language model
Abstract: Predicting earnings surprises through the analysis of earnings conference call transcripts has attracted increasing attention from the financial research community. Conference calls serve as critical communication channels between company executives, analysts, and shareholders, offering valuable forward-looking information. However, these transcripts present significant analytical challenges, typically containing over 5,000 words with substantial redundancy and industry-specific terminology that creates obstacles for language models. In this work, we propose the Sparse Autoencoder for Financial Representation Enhancement (SAE-FiRE) framework to address these limitations by extracting key information while eliminating redundancy. SAE-FiRE employs Sparse Autoencoders (SAEs) to efficiently identify patterns and filter out noises, and focusing specifically on capturing nuanced financial signals that have predictive power for earnings surprises. Experimental results indicate that the proposed method can significantly outperform comparing baselines.
摘要：通过分析收益会议呼叫笔录来预测收益惊喜，这引起了金融研究界的越来越多的关注。电话会议是公司高管，分析师和股东之间的关键沟通渠道，提供有价值的前瞻性信息。但是，这些成绩单提出了重大的分析挑战，通常包含5,000多个单词，具有大量冗余和特定于行业的术语，从而为语言模型造成了障碍。在这项工作中，我们提出了稀疏的自动编码器，用于财务代表性增强（SAE-FIRE）框架，以通过在消除冗余的同时提取关键信息来解决这些限制。 SAE-FIRE使用稀疏的自动编码器（SAE）来有效识别模式并过滤掉噪声，并专门专门用于捕获具有预测能力收益惊喜的细微财务信号。实验结果表明，所提出的方法可以显着胜过比较基线的方法。

Title: Scaling Low-Resource MT via Synthetic Data Generation with LLMs

Authors: Ona de Gibert, Joseph Attieh, Teemu Vahtola, Mikko Aulamo, Zihao Li, Raúl Vázquez, Tiancheng Hu, Jörg Tiedemann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14423
Pdf URL: https://arxiv.org/pdf/2505.14423
Copy Paste: [[2505.14423]] Scaling Low-Resource MT via Synthetic Data Generation with LLMs(https://arxiv.org/abs/2505.14423)
Keywords: llm
Abstract: We investigate the potential of LLM-generated synthetic data for improving low-resource machine translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its high overall quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, and (iii) testing its utility beyond English-centric MT. Finally, we introduce SynOPUS, a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.
摘要：我们研究了LLM生成的合成数据的潜力，以改善低资源机器翻译（MT）。为了关注七种不同的目标语言，我们从英语Europarl构建了文档级的合成语料库，并通过将其旋转到147个其他语言对扩展。自动和人类评估证实了其高质量的整体质量。我们通过（i）确定有效的培训方案，（ii）将我们的数据与HPLT数据集进行比较，以及（iii）测试其以英语以英语为单位的MT以外的效用。最后，我们引入了Synopus，这是一个用于合成并行数据集的公共存储库。我们的发现表明，LLM生成的合成数据即使嘈杂，也可以基本上改善低资源语言的MT性能。

Title: From Templates to Natural Language: Generalization Challenges in Instruction-Tuned LLMs for Spatial Reasoning

Authors: Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14425
Pdf URL: https://arxiv.org/pdf/2505.14425
Copy Paste: [[2505.14425]] From Templates to Natural Language: Generalization Challenges in Instruction-Tuned LLMs for Spatial Reasoning(https://arxiv.org/abs/2505.14425)
Keywords: language model, llm
Abstract: Instruction-tuned large language models (LLMs) have shown strong performance on a variety of tasks; however, generalizing from synthetic to human-authored instructions in grounded environments remains a challenge for them. In this work, we study generalization challenges in spatial grounding tasks where models interpret and translate instructions for building object arrangements on a $2.5$D grid. We fine-tune LLMs using only synthetic instructions and evaluate their performance on a benchmark dataset containing both synthetic and human-written instructions. Our results reveal that while models generalize well on simple tasks, their performance degrades significantly on more complex tasks. We present a detailed error analysis of the gaps in instruction generalization.
摘要：指导调节的大语模型（LLM）在各种任务上表现出了很强的表现。但是，从合成到人为实现的指示中的概括是对他们来说仍然是一个挑战。在这项工作中，我们研究了空间基础任务中的概括挑战，模型在$ 2.5 $ d网格上解释和翻译说明，以建立对象安排。我们只使用合成指令微调LLM，并在包含合成和人体文字指令的基准数据集中评估其性能。我们的结果表明，尽管模型在简单任务上很好地推广，但其性能在更复杂的任务上大大降低。我们对指导概括中差距进行了详细的错误分析。

Title: Neural Incompatibility: The Unbridgeable Gap of Cross-Scale Parametric Knowledge Transfer in Large Language Models

Authors: Yuqiao Tan, Shizhu He, Kang Liu, Jun Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14436
Pdf URL: https://arxiv.org/pdf/2505.14436
Copy Paste: [[2505.14436]] Neural Incompatibility: The Unbridgeable Gap of Cross-Scale Parametric Knowledge Transfer in Large Language Models(https://arxiv.org/abs/2505.14436)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) offer a transparent brain with accessible parameters that encode extensive knowledge, which can be analyzed, located and transferred. Consequently, a key research challenge is to transcend traditional knowledge transfer paradigms rooted in symbolic language and achieve genuine Parametric Knowledge Transfer (PKT). Significantly, exploring effective methods for transferring knowledge across LLMs of different scales through parameters presents an intriguing and valuable research direction. In this paper, we first demonstrate $\textbf{Alignment}$ in parametric space is the fundamental prerequisite to achieve successful cross-scale PKT. We redefine the previously explored knowledge transfer as Post-Align PKT (PostPKT), which utilizes extracted parameters for LoRA initialization and requires subsequent fine-tune for alignment. Hence, to reduce cost for further fine-tuning, we introduce a novel Pre-Align PKT (PrePKT) paradigm and propose a solution called $\textbf{LaTen}$ ($\textbf{L}$oc$\textbf{a}$te-$\textbf{T}$h$\textbf{e}$n-Alig$\textbf{n}$) that aligns the parametric spaces of LLMs across scales only using several training steps without following training. Comprehensive experiments on four benchmarks demonstrate that both PostPKT and PrePKT face challenges in achieving consistently stable transfer. Through in-depth analysis, we identify $\textbf{Neural Incompatibility}$ as the ethological and parametric structural differences between LLMs of varying scales, presenting fundamental challenges to achieving effective PKT. These findings provide fresh insights into the parametric architectures of LLMs and highlight promising directions for future research on efficient PKT. Our code is available at this https URL.
摘要：大型语言模型（LLMS）提供了一个透明的大脑，其中包含可访问的参数编码广泛的知识，可以分析，定位和转移。因此，一个关键的研究挑战是超越植根于象征性语言并实现真正的参数知识转移（PKT）的传统知识转移范式。值得注意的是，探索通过参数转移不同尺度的LLM的知识的有效方法提出了一个有趣而有价值的研究方向。在本文中，我们首先演示了参数空间中的$ \ textbf {对齐} $是实现成功跨尺度PKT的基本先决条件。我们将先前探索的知识转移重新定义为Align PKT（POSTPKT），该转移利用提取的参数进行LORA初始化，并需要随后的微调进行对齐。因此，为了降低进一步的微调成本，我们介绍了一个新颖的Align Pkt（Prepkt）范式，并提出了一种称为$ \ textbf {laten} $的解决方案（$ \ textbf {l} $ oc $ \ textbf {a} $ te- $ \ textbf {t} $ h $ \ textbf {e} $ n-alig $ \ textbf {n} $）仅使用尺度跨越几个训练步骤的LLMS的参数，而无需培训。对四个基准测试的全面实验表明，延期和Prepkt在实现稳定转移时面临挑战。通过深入分析，我们将$ \ textbf {神经不兼容} $确定为变化量表的LLM之间的伦理学和参数结构差异，对实现有效PKT提出了基本挑战。这些发现为LLM的参数体系结构提供了新的见解，并突出了有望在有效PKT上进行研究的有希望的方向。我们的代码可在此HTTPS URL上找到。

Title: Creative Preference Optimization

Authors: Mete Ismayilzada, Antonio Laverghetta Jr., Simone A. Luchini, Reet Patel, Antoine Bosselut, Lonneke van der Plas, Roger Beaty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14442
Pdf URL: https://arxiv.org/pdf/2505.14442
Copy Paste: [[2505.14442]] Creative Preference Optimization(https://arxiv.org/abs/2505.14442)
Keywords: language model, gpt, llm
Abstract: While Large Language Models (LLMs) have demonstrated impressive performance across natural language generation tasks, their ability to generate truly creative content-characterized by novelty, diversity, surprise, and quality-remains limited. Existing methods for enhancing LLM creativity often focus narrowly on diversity or specific tasks, failing to address creativity's multifaceted nature in a generalizable way. In this work, we propose Creative Preference Optimization (CrPO), a novel alignment method that injects signals from multiple creativity dimensions into the preference optimization objective in a modular fashion. We train and evaluate creativity-augmented versions of several models using CrPO and MuCE, a new large-scale human preference dataset spanning over 200,000 human-generated responses and ratings from more than 30 psychological creativity assessments. Our models outperform strong baselines, including GPT-4o, on both automated and human evaluations, producing more novel, diverse, and surprising generations while maintaining high output quality. Additional evaluations on NoveltyBench further confirm the generalizability of our approach. Together, our results demonstrate that directly optimizing for creativity within preference frameworks is a promising direction for advancing the creative capabilities of LLMs without compromising output quality.
摘要：尽管大型语言模型（LLM）在自然语言生成任务中表现出了令人印象深刻的表现，但它们具有通过新颖性，多样性，惊喜和质量捕获限制的真正创造性内容来表征的能力。增强LLM创造力的现有方法通常集中在多样性或特定任务上，未能以可推广的方式解决创造力的多方面性质。在这项工作中，我们提出了创造性优先优化（CRPO），这是一种新颖的对齐方法，以模块化方式将来自多个创造力的信号注入偏好优化目标。我们使用CRPO和Muce训练并评估了几种模型的创造力增强版本，这是一种新的大规模人类偏好数据集，涵盖了超过200,000多个人类生成的响应，并从30多种心理创造力评估中获得了超过200,000个人类偏好的响应和评分。我们的模型在自动化和人类评估上都优于包括GPT-4O在内的强大基线，在保持高输出质量的同时，产生了更多新颖，多样化和令人惊讶的一代。关于NovertyBench的其他评估进一步证实了我们方法的普遍性。总之，我们的结果表明，在优先框架内直接优化创造力是提高LLM的创造能力而不会损害输出质量的有希望的方向。

Title: CtrlDiff: Boosting Large Diffusion Language Models with Dynamic Block Prediction and Controllable Generation

Authors: Chihan Huang, Hao Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14455
Pdf URL: https://arxiv.org/pdf/2505.14455
Copy Paste: [[2505.14455]] CtrlDiff: Boosting Large Diffusion Language Models with Dynamic Block Prediction and Controllable Generation(https://arxiv.org/abs/2505.14455)
Keywords: language model
Abstract: Although autoregressive models have dominated language modeling in recent years, there has been a growing interest in exploring alternative paradigms to the conventional next-token prediction framework. Diffusion-based language models have emerged as a compelling alternative due to their powerful parallel generation capabilities and inherent editability. However, these models are often constrained by fixed-length generation. A promising direction is to combine the strengths of both paradigms, segmenting sequences into blocks, modeling autoregressive dependencies across blocks while leveraging discrete diffusion to estimate the conditional distribution within each block given the preceding context. Nevertheless, their practical application is often hindered by two key limitations: rigid fixed-length outputs and a lack of flexible control mechanisms. In this work, we address the critical limitations of fixed granularity and weak controllability in current large diffusion language models. We propose CtrlDiff, a dynamic and controllable semi-autoregressive framework that adaptively determines the size of each generation block based on local semantics using reinforcement learning. Furthermore, we introduce a classifier-guided control mechanism tailored to discrete diffusion, which significantly reduces computational overhead while facilitating efficient post-hoc conditioning without retraining. Extensive experiments demonstrate that CtrlDiff sets a new standard among hybrid diffusion models, narrows the performance gap to state-of-the-art autoregressive approaches, and enables effective conditional text generation across diverse tasks.
摘要：尽管近年来自回旋模型统治了语言建模，但人们对探索传统的下一步预测框架的替代范式越来越兴趣。基于扩散的语言模型由于其强大的并行生成功能和固有的编辑性而成为一种引人注目的替代方法。但是，这些模型通常受到固定长度的限制。一个有希望的方向是将两个范式的强度结合在一起，将序列分割为块，对跨块的自回旋依赖性进行建模，同时利用离散扩散以估计给定上下文中每个块中的条件分布。然而，他们的实际应用通常受到两个关键局限性的阻碍：刚性固定长度的输出和缺乏灵活的控制机制。在这项工作中，我们解决了当前大型扩散语言模型中固定粒度和弱可控性的关键局限性。我们提出了Ctrldiff，这是一种动态且可控制的半自动回旋框架，使用强化学习根据局部语义来自适应地确定每个一代块的大小。此外，我们引入了一种针对离散扩散的分类器指导的控制机制，该机制可显着降低计算开销，同时促进有效的事后调节而无需重新培训。广泛的实验表明，Ctrldiff在混合扩散模型中设定了一个新标准，将性能差距缩小到最先进的自动回归方法，并可以跨不同任务进行有效的有条件文本生成。

Title: Not All Correct Answers Are Equal: Why Your Distillation Source Matters

Authors: Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, Xiangang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14464
Pdf URL: https://arxiv.org/pdf/2505.14464
Copy Paste: [[2505.14464]] Not All Correct Answers Are Equal: Why Your Distillation Source Matters(https://arxiv.org/abs/2505.14464)
Keywords: language model
Abstract: Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The AM-based model consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging Face\footnote{Datasets are available on Hugging Face: \href{this https URL}{AM-Thinking-v1-Distilled}, \href{this https URL}{AM-Qwen3-Distilled}.}.
摘要：蒸馏已成为一种实用有效的方法，以增强开源语言模型的推理能力。在这项工作中，我们通过从三个最先进的教师模型 - 思考-V1，QWEN3-235B-A22B和DeepSeek-r1-R1-On收集经过验证的输出，对推理数据蒸馏进行了大规模的实证研究。我们构建了三个平行数据集并分析它们的分布，揭示了AM-INCKING-V1-DISTILD数据表现出更大的令牌多样性和更低的困惑。在每个数据集上培训的学生模型对推理基准进行了评估，包括AIME2024，AIME2025，MATH500和LIVECODEBENCH。基于AM的模型始终达到最佳性能（例如AIME2024上的84.3，AIME2025上的72.2，MATH500上的98.4，在LiveCodeBench上为65.9），并展示了适应性的输出行为，以实现更艰苦的任务和更短的任务，以实现更简单的任务。这些发现突出了高质量，经过验证的推理痕迹的价值。我们发布了AM-INKING-V1和QWEN3-235B-A22B蒸馏数据集，以支持对开放和高性能推理的语言模型的未来研究。该数据集可在拥抱face \ footNote {数据集上公开可用：\ href {this https url} {am-thinking-v1-distild}，\ href {this HTTPS url} {am-qwen3-distill}。

Title: Void in Language Models

Authors: Mani Shemiranifar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14467
Pdf URL: https://arxiv.org/pdf/2505.14467
Copy Paste: [[2505.14467]] Void in Language Models(https://arxiv.org/abs/2505.14467)
Keywords: language model, prompt
Abstract: Despite advances in transformer-based language models (LMs), a fundamental question remains largely unanswered: Are all layers activated during inference? We investigate this question by detecting unactivated layers (which we refer to as Voids) using a non-trainable and parameter-free adaptive computation method called L2 Adaptive Computation (LAC). We adapt LAC from its original efficiency-focused application to trace activated layers during inference. This method monitors changes in the L2-norm of activations to identify voids. We analyze layer activation in instruction-tuned LMs across two phases: Prompt Processing (PP), where we trace activated layers for each token in the input prompts, and Response Generation (RG), where we trace activated layers for each generated token. We further demonstrate that distinct layers are activated during these two phases. To show the effectiveness of our method, we evaluated three distinct instruction-tuned LMs from the Llama, Mistral, and Qwen families on three benchmarks: MMLU, GPQA Diamond, and BoolQ. For example, on MMLU with a zero-shot setting, skipping voids in Qwen2.5-7B-Instruct resulted in an improvement from 69.24 to 71.29 while the model uses only 30% of the layers. Similarly, Mistral-7B-Instruct-v0.3 on GPQA Diamond improved from 13.88 to 18.36 when using 70% of the layers during both the PP and RG phases. These results show that not all layers contribute equally during inference, and that selectively skipping most of them can improve the performance of models on certain tasks.
摘要：尽管基于变压器的语言模型（LMS）取得了进步，但基本问题仍未得到解决：推断期间都激活了层吗？我们通过使用不可验证的无参数自适应计算方法（称为L2自适应计算（LAC））检测未激活的层（我们称为空隙）来研究这个问题。我们从其原始效率的应用中调整LAC，以在推理过程中追踪激活的层。该方法监测活化的L2-纳米的变化以识别空隙。我们在两个阶段分析了指令调整的LMS中的层激活：提示处理（PP），在该阶段中，我们在输入提示中为每个令牌跟踪激活的层，以及响应生成（RG），在其中我们跟踪每个生成的令牌的激活层。我们进一步证明，在这两个阶段中激活了不同的层。为了显示我们方法的有效性，我们评估了三个基准的三种不同的指令调整的LMS：MMLU，GPQA Diamond和Booolq。例如，在MMLU上具有零拍设置，在QWEN2.5-7B-INSTRUCTION中跳过空隙可从69.24增加到71.29，而模型仅使用30％的图层。同样，在PP和RG相期间使用70％的层时，GPQA钻石上的Mistral-7b-Instruct-V0.3从13.88提高到18.36。这些结果表明，并非所有层在推理过程中都同等贡献，并且有选择地跳过大多数可以提高模型在某些任务上的性能。

Title: Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

Authors: Somnath Banerjee, Pratyush Chatterjee, Shanu Kumar, Sayan Layek, Parag Agrawal, Rima Hazra, Animesh Mukherjee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14469
Pdf URL: https://arxiv.org/pdf/2505.14469
Copy Paste: [[2505.14469]] Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations(https://arxiv.org/abs/2505.14469)
Keywords: language model, llm, prompt
Abstract: Recent advancements in LLMs have raised significant safety concerns, particularly when dealing with code-mixed inputs and outputs. Our study systematically investigates the increased susceptibility of LLMs to produce unsafe outputs from code-mixed prompts compared to monolingual English prompts. Utilizing explainability methods, we dissect the internal attribution shifts causing model's harmful behaviors. In addition, we explore cultural dimensions by distinguishing between universally unsafe and culturally-specific unsafe queries. This paper presents novel experimental insights, clarifying the mechanisms driving this phenomenon.
摘要：LLM的最新进展引起了严重的安全问题，尤其是在处理代码混合的投入和产出时。我们的研究系统地研究了LLM与单语言提示相比，LLMS对由代码混合提示产生不安全输出的敏感性提高。利用解释性方法，我们剖析了内部归因转移导致模型的有害行为。此外，我们通过区分普遍不安全和特定于文化的不安全查询来探索文化维度。本文提出了新的实验见解，阐明了推动这种现象的机制。

Title: Adapting Pretrained Language Models for Citation Classification via Self-Supervised Contrastive Learning

Authors: Tong Li, Jiachuan Wang, Yongqi Zhang, Shuangyin Li, Lei Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14471
Pdf URL: https://arxiv.org/pdf/2505.14471
Copy Paste: [[2505.14471]] Adapting Pretrained Language Models for Citation Classification via Self-Supervised Contrastive Learning(https://arxiv.org/abs/2505.14471)
Keywords: language model, llm, long context
Abstract: Citation classification, which identifies the intention behind academic citations, is pivotal for scholarly analysis. Previous works suggest fine-tuning pretrained language models (PLMs) on citation classification datasets, reaping the reward of the linguistic knowledge they gained during pretraining. However, directly fine-tuning for citation classification is challenging due to labeled data scarcity, contextual noise, and spurious keyphrase correlations. In this paper, we present a novel framework, Citss, that adapts the PLMs to overcome these challenges. Citss introduces self-supervised contrastive learning to alleviate data scarcity, and is equipped with two specialized strategies to obtain the contrastive pairs: sentence-level cropping, which enhances focus on target citations within long contexts, and keyphrase perturbation, which mitigates reliance on specific keyphrases. Compared with previous works that are only designed for encoder-based PLMs, Citss is carefully developed to be compatible with both encoder-based PLMs and decoder-based LLMs, to embrace the benefits of enlarged pretraining. Experiments with three benchmark datasets with both encoder-based PLMs and decoder-based LLMs demonstrate our superiority compared to the previous state of the art. Our code is available at: this http URL
摘要：引用分类确定了学术引用背后的意图，这对于学术分析至关重要。以前的作品建议对引用分类数据集进行微调预审计的语言模型（PLM），从而获得了他们在审议期间获得的语言知识的回报。但是，由于标记的数据稀缺，上下文噪声和虚假的键形相关性，直接进行引文分类的微调是具有挑战性的。在本文中，我们提出了一个新颖的框架CITSS，可以适应PLM来克服这些挑战。 CITSS引入了自我监管的对比学习，以减轻数据稀缺性，并配备了两种专门的策略来获得对比度对：句子级的裁剪，从而增强了对长篇小说中的目标引用和键形扰动的关注，从而减轻了对特定钥匙液的依赖。与以前仅专为基于编码器的PLM设计的作品相比，CITS经过精心开发，以与基于编码器的PLM和基于解码器的LLM兼容，以接受扩大预读的好处。使用基于编码器的PLM和基于解码器的LLM的三个基准数据集进行的实验证明了我们的优势与以前的最新状态相比。我们的代码可用：此HTTP URL

Title: PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models

Authors: He Zhu, Junyou Su, Minxi Chen, Wen Wang, Yijie Deng, Guanhua Chen, Wenjia Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14481
Pdf URL: https://arxiv.org/pdf/2505.14481
Copy Paste: [[2505.14481]] PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models(https://arxiv.org/abs/2505.14481)
Keywords: language model, gpt, hallucination
Abstract: In the field of urban planning, existing Vision-Language Models (VLMs) frequently fail to effectively analyze and evaluate planning maps, despite the critical importance of these visual elements for urban planners and related educational contexts. Planning maps, which visualize land use, infrastructure layouts, and functional zoning, require specialized understanding of spatial configurations, regulatory requirements, and multi-scale analysis. To address this challenge, we introduce PlanGPT-VL, the first domain-specific Vision-Language Model tailored specifically for urban planning maps. PlanGPT-VL employs three innovative approaches: (1) PlanAnno-V framework for high-quality VQA data synthesis, (2) Critical Point Thinking to reduce hallucinations through structured verification, and (3) comprehensive training methodology combining Supervised Fine-Tuning with frozen vision encoder parameters. Through systematic evaluation on our proposed PlanBench-V benchmark, we demonstrate that PlanGPT-VL significantly outperforms general-purpose state-of-the-art VLMs in specialized planning map interpretation tasks, offering urban planning professionals a reliable tool for map analysis, assessment, and educational applications while maintaining high factual accuracy. Our lightweight 7B parameter model achieves comparable performance to models exceeding 72B parameters, demonstrating efficient domain specialization without sacrificing performance.
摘要：在城市规划领域，尽管这些视觉要素对城市规划师和相关的教育环境至关重要，但现有的视觉语言模型（VLM）经常无法有效地分析和评估计划图。计划地图可视化土地使用，基础架构布局和功能分区，需要专门了解空间配置，监管要求和多规模分析。为了应对这一挑战，我们介绍了Plangpt-VL，这是专门针对城市规划地图量身定制的第一个特定领域的视觉模型。 PLANGPT-VL采用三种创新方法：（1）用于高质量VQA数据综合的PlanAnno-V框架，（2）通过结构化验证减少幻觉的临界点思维，以及（3）结合监督的微调微调和冷冻视觉编码器参数的全面培训方法。通过对我们提出的PlanBench-V基准测试的系统评估，我们证明了Plangpt-VL在专业规划地图解释任务中的通用最先进的VLMS明显胜过通用的最先进的VLM，为城市规划专业人员提供了可靠的地图分析，评估和教育应用程序的工具，同时维持高度的事实准确性。我们的轻量级7B参数模型的性能与超过72B参数的模型相当，证明了有效的域专业化而不牺牲性能。

Title: MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance

Authors: Agam Goyal, Xianyang Zhan, Yilun Chen, Koustuv Saha, Eshwar Chandrasekharan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14483
Pdf URL: https://arxiv.org/pdf/2505.14483
Copy Paste: [[2505.14483]] MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance(https://arxiv.org/abs/2505.14483)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown great potential in flagging harmful content in online communities. Yet, existing approaches for moderation require a separate model for every community and are opaque in their decision-making, limiting real-world adoption. We introduce Mixture of Moderation Experts (MoMoE), a modular, cross-community framework that adds post-hoc explanations to scalable content moderation. MoMoE orchestrates four operators -- Allocate, Predict, Aggregate, Explain -- and is instantiated as seven community-specialized experts (MoMoE-Community) and five norm-violation experts (MoMoE-NormVio). On 30 unseen subreddits, the best variants obtain Micro-F1 scores of 0.72 and 0.67, respectively, matching or surpassing strong fine-tuned baselines while consistently producing concise and reliable explanations. Although community-specialized experts deliver the highest peak accuracy, norm-violation experts provide steadier performance across domains. These findings show that MoMoE yields scalable, transparent moderation without needing per-community fine-tuning. More broadly, they suggest that lightweight, explainable expert ensembles can guide future NLP and HCI research on trustworthy human-AI governance of online communities.
摘要：大型语言模型（LLMS）在在线社区中标记有害内容方面表现出很大的潜力。然而，现有的节制方法需要为每个社区提供一个单独的模型，并且在决策方面不透明，从而限制了现实世界的采用。我们介绍了Mederation Experts（MOMOE）的混合物，这是一种模块化，交叉社区框架，为可扩展内容中的适度添加了事后解释。 Momoe精心策划了四位运营商 - 分配，预测，汇总，解释 - 并被实例化为七个社区专家（Momoe-Community）和五位Norm-Norm-violation专家（Momoe-Normvio）。在30个看不见的子列表中，最佳变体分别获得0.72和0.67的Micro-F1分数，匹配或超过强度的微调基线，同时始终产生简洁可靠的解释。尽管社区专家的专家提供了最高的峰准确性，但规范侵略专家在范围内提供了更稳定的性能。这些发现表明，Momoe会产生可扩展的透明节制，而无需每个社区微调。更广泛地说，他们建议，轻巧，可解释的专家合奏可以指导未来的NLP和HCI关于在线社区可信赖的人类治理的研究。

Title: Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM-Generated Rationales

Authors: Jun Cao, Jiyi Li, Ziwei Yang, Renjie Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14499
Pdf URL: https://arxiv.org/pdf/2505.14499
Copy Paste: [[2505.14499]] Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM-Generated Rationales(https://arxiv.org/abs/2505.14499)
Keywords: language model, llm
Abstract: There has been growing interest in Multimodal Aspect-Based Sentiment Analysis (MABSA) in recent years. Existing methods predominantly rely on pre-trained small language models (SLMs) to collect information related to aspects and sentiments from both image and text, with an aim to align these two modalities. However, small SLMs possess limited capacity and knowledge, often resulting in inaccurate identification of meaning, aspects, sentiments, and their interconnections in textual and visual data. On the other hand, Large language models (LLMs) have shown exceptional capabilities in various tasks by effectively exploring fine-grained information in multimodal data. However, some studies indicate that LLMs still fall short compared to fine-tuned small models in the field of ABSA. Based on these findings, we propose a novel framework, termed LRSA, which combines the decision-making capabilities of SLMs with additional information provided by LLMs for MABSA. Specifically, we inject explanations generated by LLMs as rationales into SLMs and employ a dual cross-attention mechanism for enhancing feature interaction and fusion, thereby augmenting the SLMs' ability to identify aspects and sentiments. We evaluated our method using two baseline models, numerous experiments highlight the superiority of our approach on three widely-used benchmarks, indicating its generalizability and applicability to most pre-trained models for MABSA.
摘要：近年来，人们对基于多模式的情感分析（MABSA）的兴趣越来越大。现有方法主要依赖于预先训练的小语言模型（SLM）来收集与图像和文本的各个方面和情感有关的信息，以使这两种方式保持一致。但是，小型SLM具有有限的能力和知识，通常导致对意义，方面，情感及其在文本和视觉数据中的互连的识别不准确。另一方面，大型语言模型（LLMS）通过有效探索多模式数据中的细粒度信息，在各种任务中显示出了出色的功能。但是，一些研究表明，与ABSA领域的微调小型模型相比，LLMS仍然缺乏。基于这些发现，我们提出了一个称为LRSA的新型框架，该框架将SLM的决策能力与LLMS为MABSA提供的其他信息相结合。具体而言，我们将LLMS作为理由产生的解释注入SLM中，并采用双重交叉注意机制来增强特征交互和融合，从而增强了SLMS识别方面和情感的能力。我们使用两个基线模型评估了我们的方法，许多实验突出了我们在三个广泛使用的基准测试上的优越性，表明其对大多数MABSA预先训练的模型的普遍性和适用性。

Title: ModRWKV: Transformer Multimodality in Linear Time

Authors: Jiale Kang, Ziyin Yue, Qingyu Yin, Jiang Rui, Weile Li, Zening Lu, Zhouran Ji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14505
Pdf URL: https://arxiv.org/pdf/2505.14505
Copy Paste: [[2505.14505]] ModRWKV: Transformer Multimodality in Linear Time(https://arxiv.org/abs/2505.14505)
Keywords: language model, llm
Abstract: Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV-a decoupled multimodal framework built upon the RWKV7 architecture as its LLM backbone-which achieves multi-source information fusion through dynamically adaptable heterogeneous modality encoders. We designed the multimodal modules in ModRWKV with an extremely lightweight architecture and, through extensive experiments, identified a configuration that achieves an optimal balance between performance and computational efficiency. ModRWKV leverages the pretrained weights of the RWKV7 LLM for initialization, which significantly accelerates multimodal training. Comparative experiments with different pretrained checkpoints further demonstrate that such initialization plays a crucial role in enhancing the model's ability to understand multimodal signals. Supported by extensive experiments, we conclude that modern RNN architectures present a viable alternative to Transformers in the domain of multimodal large language models (MLLMs). Furthermore, we identify the optimal configuration of the ModRWKV architecture through systematic exploration.
摘要：当前，大多数多模式研究基于具有二次复合变压器体系结构的大语言模型（LLM）。尽管RNN等线性模型的推理成本较低，但它们的应用在很大程度上仅限于仅文本模式。这项工作探讨了在多模式上下文中现代RNN体系结构的功能。我们提出了MODRWKV-A脱钩的多模式框架，它是RWKV7体系结构构建的，它是其LLM骨架 - 通过动态适应性的异质模态编码器实现多源信息融合。我们使用非常轻巧的体系结构设计了MODRWKV中的多模式模块，并通过广泛的实验确定了一种配置，该配置在性能和计算效率之间达到了最佳平衡。 MODRWKV利用了RWKV7 LLM的预审计重量来初始化，这显着加速了多模式训练。具有不同审慎检查点的比较实验进一步表明，这种初始化在增强模型理解多模式信号的能力方面起着至关重要的作用。在广泛的实验的支持下，我们得出结论，现代RNN体系结构为多模式大型语言模型（MLLM）领域中变压器提供了可行的替代方案。此外，我们通过系统探索确定MODRWKV体系结构的最佳配置。

Title: Exploring Graph Representations of Logical Forms for Language Modeling

Authors: Michael Sullivan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14523
Pdf URL: https://arxiv.org/pdf/2505.14523
Copy Paste: [[2505.14523]] Exploring Graph Representations of Logical Forms for Language Modeling(https://arxiv.org/abs/2505.14523)
Keywords: language model
Abstract: We make the case for language models over logical forms (LFLMs), arguing that such models are more data-efficient than their textual counterparts. To that end, we introduce the Graph-based Formal-Logical Distributional Semantics (GFoLDS) prototype, a pretrained LM over graph representations of logical forms, as a proof-of-concept of LFLMs. Using GFoLDS, we present strong experimental evidence that LFLMs can leverage the built-in, basic linguistic knowledge inherent in such models to immediately begin learning more complex patterns. On downstream tasks, we show that GFoLDS vastly outperforms textual, transformer LMs pretrained on similar amounts of data, indicating that LFLMs can learn with substantially less data than models over plain text. Furthermore, we show that the performance of this model is likely to scale with additional parameters and pretraining data, suggesting the viability of LFLMs in real-world applications.
摘要：我们为逻辑形式（LFLM）的语言模型提供了理由，认为此类模型比其文本对应物更具数据效率。为此，我们介绍了基于图形的形式逻辑分布语义（GFOLDS）原型，这是逻辑形式的图形表示的验证的LM，作为LFLMS的概念验证。使用GFOLDS，我们提供了有力的实验证据，表明LFLM可以利用此类模型固有的内置，基本的语言知识立即开始学习更复杂的模式。在下游任务上，我们表明，GFOLDS在类似的数据上仔细预测的Transformer LMS的表现非常优于文本，这表明LFLMS可以使用比纯文本的模型更少的数据来学习。此外，我们表明，该模型的性能可能会通过其他参数和预处理数据进行扩展，这表明LFLMS在现实世界应用中具有可行性。

Title: Internal Chain-of-Thought: Empirical Evidence for Layer-wise Subtask Scheduling in LLMs

Authors: Zhipeng Yang, Junzhuo Li, Siyu Xia, Xuming Hu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14530
Pdf URL: https://arxiv.org/pdf/2505.14530
Copy Paste: [[2505.14530]] Internal Chain-of-Thought: Empirical Evidence for Layer-wise Subtask Scheduling in LLMs(https://arxiv.org/abs/2505.14530)
Keywords: language model, llm, chain-of-thought
Abstract: We show that large language models (LLMs) exhibit an $\textit{internal chain-of-thought}$: they sequentially decompose and execute composite tasks layer-by-layer. Two claims ground our study: (i) distinct subtasks are learned at different network depths, and (ii) these subtasks are executed sequentially across layers. On a benchmark of 15 two-step composite tasks, we employ layer-from context-masking and propose a novel cross-task patching method, confirming (i). To examine claim (ii), we apply LogitLens to decode hidden states, revealing a consistent layerwise execution pattern. We further replicate our analysis on the real-world $\text{TRACE}$ benchmark, observing the same stepwise dynamics. Together, our results enhance LLMs transparency by showing their capacity to internally plan and execute subtasks (or instructions), opening avenues for fine-grained, instruction-level activation steering.
摘要：我们表明，大型语言模型（LLMS）展示了$ \ textIt {内部经营链} $：它们顺序分解并逐层执行复合任务。我们的研究有两个主张：（i）在不同的网络深度上学习了不同的子任务，并且（ii）这些子任务是在跨层上顺序执行的。在15个两步综合任务的基准下，我们采用了上下文上下文掩盖的层，并提出了一种新颖的跨任务修补方法，确认（i）。要检查索赔（ii），我们将logitlens应用于解码隐藏状态，从而揭示了一致的layerwise执行模式。我们进一步复制了对现实世界$ \ text {trace} $基准的分析，观察相同的逐步动态。我们的结果一起，通过展示其内部计划和执行子任务（或说明）的能力，为细粒度，指令级激活转向开放途径，从而提高了LLMS透明度。

Title: Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders

Authors: Agam Goyal, Vedant Rathi, William Yeh, Yian Wang, Yuen Chen, Hari Sundaram
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14536
Pdf URL: https://arxiv.org/pdf/2505.14536
Copy Paste: [[2505.14536]] Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders(https://arxiv.org/abs/2505.14536)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are now ubiquitous in user-facing applications, yet they still generate undesirable toxic outputs, including profanity, vulgarity, and derogatory remarks. Although numerous detoxification methods exist, most apply broad, surface-level fixes and can therefore easily be circumvented by jailbreak attacks. In this paper we leverage sparse autoencoders (SAEs) to identify toxicity-related directions in the residual stream of models and perform targeted activation steering using the corresponding decoder vectors. We introduce three tiers of steering aggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealing trade-offs between toxicity reduction and language fluency. At stronger steering strengths, these causal interventions surpass competitive baselines in reducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2 Small depending on the aggressiveness. Crucially, standard NLP benchmark scores upon steering remain stable, indicating that the model's knowledge and general abilities are preserved. We further show that feature-splitting in wider SAEs hampers safety interventions, underscoring the importance of disentangled feature learning. Our findings highlight both the promise and the current limitations of SAE-based causal interventions for LLM detoxification, further suggesting practical guidelines for safer language-model deployment.
摘要：现在，大型语言模型（LLMS）在面向用户的应用程序中无处不在，但它们仍然产生不良的有毒输出，包括亵渎，庸俗和贬义词。尽管存在大量的排毒方法，但大多数都采用了广泛的表面级修复，因此可以轻松地通过越狱攻击来规避。在本文中，我们利用稀疏自动编码器（SAE）在模型的残留流中识别与毒性相关的方向，并使用相应的解码器向量执行靶向激活转向。我们介绍了转向侵略性的三个层次，并在GPT-2小型和Gemma-2-2B上进行了评估，从而揭示了降低毒性和语言流利性之间的权衡。在更强的转向强度下，这些因果干预措施超过了竞争基线，在降低毒性中最多可降低20％，尽管流利度可以显着降低GPT-2对GPT-2的降低，这取决于侵略性。至关重要的是，转向时标准的NLP基准得分保持稳定，表明该模型的知识和一般能力得到了保留。我们进一步表明，较宽的SAE中的特征分类会阻碍安全干预，强调了分解功能学习的重要性。我们的发现突出了基于SAE的因果干预措施对LLM排毒的承诺和当前局限性，进一步提出了对更安全的语言模型部署的实用指南。

Title: KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Authors: Jiajun Shi, Jian Yang, Jiaheng Liu, Xingyuan Bu, Jiangjie Chen, Junting Zhou, Kaijing Ma, Zhoufutu Wen, Bingli Wang, Yancheng He, Liang Song, Hualei Zhu, Shilong Li, Xingjian Wang, Wei Zhang, Ruibin Yuan, Yifan Yao, Wenjun Yang, Yunli Wang, Siyuan Fang, Siyu Yuan, Qianyu He, Xiangru Tang, Yingshui Tan, Wangchunshu Zhou, Zhaoxiang Zhang, Zhoujun Li, Wenhao Huang, Ge Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14552
Pdf URL: https://arxiv.org/pdf/2505.14552
Copy Paste: [[2505.14552]] KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation(https://arxiv.org/abs/2505.14552)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.
摘要：大语言模型（LLMS）的最新进展强调了对更全面的评估方法的需求，以准确评估其推理能力。现有的基准通常是特定于领域的，因此无法完全捕获LLM的一般推理潜力。为了解决这一限制，我们介绍了知识正交推理体育馆（Korgym），这是一个受Kor-Bench和Gymnasium启发的动态评估平台。 Korgym以文本或视觉格式提供了五十多种游戏，并通过增强学习方案支持交互式，多转的评估。使用Korgym，我们对19个LLM和8个VLM进行了广泛的实验，揭示了模型家族中的一致推理模式，并证明了封闭源模型的出色性能。进一步分析研究了方式，推理策略，增强学习技术和响应长度对模型性能的影响。我们希望Korgym成为推进LLM推理研究和开发适合复杂互动环境的评估方法的宝贵资源。

Title: TRATES: Trait-Specific Rubric-Assisted Cross-Prompt Essay Scoring

Authors: Sohaila Eltanbouly, Salam Albatarni, Tamer Elsayed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14577
Pdf URL: https://arxiv.org/pdf/2505.14577
Copy Paste: [[2505.14577]] TRATES: Trait-Specific Rubric-Assisted Cross-Prompt Essay Scoring(https://arxiv.org/abs/2505.14577)
Keywords: language model, llm, prompt
Abstract: Research on holistic Automated Essay Scoring (AES) is long-dated; yet, there is a notable lack of attention for assessing essays according to individual traits. In this work, we propose TRATES, a novel trait-specific and rubric-based cross-prompt AES framework that is generic yet specific to the underlying trait. The framework leverages a Large Language Model (LLM) that utilizes the trait grading rubrics to generate trait-specific features (represented by assessment questions), then assesses those features given an essay. The trait-specific features are eventually combined with generic writing-quality and prompt-specific features to train a simple classical regression model that predicts trait scores of essays from an unseen prompt. Experiments show that TRATES achieves a new state-of-the-art performance across all traits on a widely-used dataset, with the generated LLM-based features being the most significant.
摘要：关于整体自动论文评分（AES）的研究长期；然而，根据个人特征评估论文，缺乏关注。在这项工作中，我们提出了Truess，这是一种新颖的特异性特异性且基于栏目的交叉推出的AES框架，该框架是通用但针对基本特征的特定特定的。该框架利用了一个大型语言模型（LLM），该模型利用特征分级的标题来生成特定特征特定的特征（以评估问题为代表），然后评估给定论文的那些功能。特定特征的特征最终与通用写作质量和及时特定功能相结合，以训练一个简单的经典回归模型，该模型可以从看不见的提示中预测论文的特质分数。实验表明，在广泛使用的数据集上，DACTARE在所有特征上都实现了新的最先进的性能，而生成的基于LLM的功能是最重要的。

Title: Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning

Authors: Shangziqi Zhao, Jiahao Yuan, Guisong Yang, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14582
Pdf URL: https://arxiv.org/pdf/2505.14582
Copy Paste: [[2505.14582]] Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning(https://arxiv.org/abs/2505.14582)
Keywords: language model, llm, chain-of-thought
Abstract: Long chain-of-thought (Long-CoT) reasoning improves accuracy in LLMs, yet its verbose, self-reflective style often hinders effective distillation into small language models (SLMs). We revisit Long-CoT compression through the lens of capability alignment and ask: Can pruning improve reasoning? We propose Prune-on-Logic, a structure-aware framework that transforms Long-CoT into logic graphs and selectively prunes low-utility reasoning steps under self-verification constraints. Through systematic analysis across three pruning strategies -- targeting entire chains, core reasoning, and verification -- we find that pruning verification steps yields consistent accuracy gains while reducing inference cost, outperforming token-level baselines and uncompressed fine-tuning. In contrast, pruning reasoning or all-chain steps degrades performance, revealing that small models benefit not from shorter CoTs, but from semantically leaner ones. Our findings highlight pruning as a structural optimization strategy for aligning CoT reasoning with SLM capacity.
摘要：长期的经过思考（长期）推理提高了LLM的准确性，但其冗长的自我反射风格通常会阻碍有效蒸馏到小语言模型（SLMS）中。我们通过能力对准的镜头重新审视长时间压缩，并提出：修剪可以改善推理吗？我们提出了修剪逻辑的修剪，这是一种结构感知的框架，可将长期框架转换为逻辑图，并在自我验证约束下有选择性的pruns prunes prunes prunes how-prounity推理步骤。通过针对整个链条，核心推理和验证的三种修剪策略的系统分析 - 我们发现修剪验证步骤可在降低推理成本，胜过代币级的基准和未压缩的微调的同时获得一致的准确性提高。相比之下，修剪推理或全链步骤会降低性能，揭示了小型模型不是从较短的婴儿床中受益，而是从语义上更瘦的小说中受益。我们的发现将修剪作为将COT推理与SLM容量对齐的结构优化策略。

Title: Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning

Authors: Wenbin Hu, Haoran Li, Huihao Jing, Qi Hu, Ziqian Zeng, Sirui Han, Heli Xu, Tianshu Chu, Peizhao Hu, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14585
Pdf URL: https://arxiv.org/pdf/2505.14585
Copy Paste: [[2505.14585]] Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning(https://arxiv.org/abs/2505.14585)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) exhibit remarkable capabilities, they also introduce significant safety and privacy risks. Current mitigation strategies often fail to preserve contextual reasoning capabilities in risky scenarios. Instead, they rely heavily on sensitive pattern matching to protect LLMs, which limits the scope. Furthermore, they overlook established safety and privacy standards, leading to systemic risks for legal compliance. To address these gaps, we formulate safety and privacy issues into contextualized compliance problems following the Contextual Integrity (CI) theory. Under the CI framework, we align our model with three critical regulatory standards: GDPR, EU AI Act, and HIPAA. Specifically, we employ reinforcement learning (RL) with a rule-based reward to incentivize contextual reasoning capabilities while enhancing compliance with safety and privacy norms. Through extensive experiments, we demonstrate that our method not only significantly enhances legal compliance (achieving a +17.64% accuracy improvement in safety/privacy benchmarks) but also further improves general reasoning capability. For OpenThinker-7B, a strong reasoning model that significantly outperforms its base model Qwen2.5-7B-Instruct across diverse subjects, our method enhances its general reasoning capabilities, with +2.05% and +8.98% accuracy improvement on the MMLU and LegalBench benchmark, respectively.
摘要：尽管大型语言模型（LLMS）具有显着的功能，但它们也带来了重大的安全性和隐私风险。当前的缓解策略通常无法在风险的情况下保留上下文推理能力。取而代之的是，它们在很大程度上依赖于敏感的图案匹配来保护LLM，从而限制了范围。此外，他们忽略了确定的安全性和隐私标准，从而导致了法律合规的系统性风险。为了解决这些差距，我们将安全性和隐私问题提高到上下文完整性（CI）理论之后的上下文合规性问题。在CI框架下，我们将模型与三个关键的监管标准保持一致：GDPR，欧盟AI ACT和HIPAA。具体来说，我们采用基于规则的奖励的加固学习（RL）来激励上下文推理能力，同时增强遵守安全和隐私规范的依从性。通过广泛的实验，我们证明了我们的方法不仅可以显着提高法律依从性（在安全/隐私基准中提高了 +17.64％的准确性），而且还进一步提高了一般推理能力。对于Openthinker-7b，一个强大的推理模型，它在不同受试者的基本模型QWEN2.5-7B-INSTRUCT中显着优于其基本模型，我们的方法可增强其一般推理能力，分别在MMLU和LegalBench基准方面具有 +2.05％和 +8.98％的准确性提高。

Title: MCIP: Protecting MCP Safety via Model Contextual Integrity Protocol

Authors: Huihao Jing, Haoran Li, Wenbin Hu, Qi Hu, Heli Xu, Tianshu Chu, Peizhao Hu, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14590
Pdf URL: https://arxiv.org/pdf/2505.14590
Copy Paste: [[2505.14590]] MCIP: Protecting MCP Safety via Model Contextual Integrity Protocol(https://arxiv.org/abs/2505.14590)
Keywords: llm
Abstract: As Model Context Protocol (MCP) introduces an easy-to-use ecosystem for users and developers, it also brings underexplored safety risks. Its decentralized architecture, which separates clients and servers, poses unique challenges for systematic safety analysis. This paper proposes a novel framework to enhance MCP safety. Guided by the MAESTRO framework, we first analyze the missing safety mechanisms in MCP, and based on this analysis, we propose the Model Contextual Integrity Protocol (MCIP), a refined version of MCP that addresses these this http URL, we develop a fine-grained taxonomy that captures a diverse range of unsafe behaviors observed in MCP scenarios. Building on this taxonomy, we develop benchmark and training data that support the evaluation and improvement of LLMs' capabilities in identifying safety risks within MCP interactions. Leveraging the proposed benchmark and training data, we conduct extensive experiments on state-of-the-art LLMs. The results highlight LLMs' vulnerabilities in MCP interactions and demonstrate that our approach substantially improves their safety performance.
摘要：由于模型上下文协议（MCP）为用户和开发人员介绍了一个易于使用的生态系统，因此它也带来了不充分的安全风险。它的分散体系结构将客户和服务器分开，为系统安全分析带来了独特的挑战。本文提出了一个新的框架来增强MCP安全性。在Maestro框架的指导下，我们首先分析了MCP中缺少的安全机制，并基于此分析，我们提出了模型上下文完整性协议（MCIP），MCP的精制版本，该版本解决这些HTTP URL，我们开发了一种精细的分类法，该分类法捕获了MCP场景中无助行为的多样性范围。在此分类法的基础上，我们开发了基准和培训数据，以支持LLMS在确定MCP交互中安全风险方面的评估和改进。利用提出的基准和培训数据，我们对最先进的LLM进行了广泛的实验。结果突出了LLMS在MCP互动中的漏洞，并证明我们的方法显着提高了其安全性能。

Title: Success is in the Details: Evaluate and Enhance Details Sensitivity of Code LLMs through Counterfactuals

Authors: Xianzhen Luo, Qingfu Zhu, Zhiming Zhang, Mingzheng Xu, Tianhao Cheng, Yixuan Wang, Zheng Chu, Shijie Xuyang, Zhiyuan Ma, YuanTao Fan, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14597
Pdf URL: https://arxiv.org/pdf/2505.14597
Copy Paste: [[2505.14597]] Success is in the Details: Evaluate and Enhance Details Sensitivity of Code LLMs through Counterfactuals(https://arxiv.org/abs/2505.14597)
Keywords: llm
Abstract: Code Sensitivity refers to the ability of Code LLMs to recognize and respond to details changes in problem descriptions. While current code benchmarks and instruction data focus on difficulty and diversity, sensitivity is overlooked. We first introduce the CTF-Code benchmark, constructed using counterfactual perturbations, minimizing input changes while maximizing output changes. The evaluation shows that many LLMs have a more than 10\% performance drop compared to the original problems. To fully utilize sensitivity, CTF-Instruct, an incremental instruction fine-tuning framework, extends on existing data and uses a selection mechanism to meet the three dimensions of difficulty, diversity, and sensitivity. Experiments show that LLMs fine-tuned with CTF-Instruct data achieve over a 2\% improvement on CTF-Code, and more than a 10\% performance boost on LiveCodeBench, validating the feasibility of enhancing LLMs' sensitivity to improve performance.
摘要：代码灵敏度是指代码LLM识别和响应问题描述中的细节变化的能力。尽管当前的代码基准和指令数据侧重于难度和多样性，但灵敏度被忽略了。我们首先介绍了使用反事实扰动构建的CTF代码基准，最大程度地减少输入更改，同时最大化输出变化。评估表明，与原始问题相比，许多LLM的性能下降超过10 \％。为了充分利用灵敏度，CTF教学（一种增量指令微调框架）扩展了现有数据，并使用选择机制来满足难度，多样性和灵敏度的三个维度。实验表明，通过CTF教学数据进行微调的LLMS在CTF代码的2 \％改进方面实现了，而超过10 \％的性能提高了LiveCodeBench，从而验证了增强LLMS对提高性能的敏感性的可行性。

Title: Toward Reliable Biomedical Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models

Authors: Guangzhi Xiong, Eric Xie, Corey Williams, Myles Kim, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14599
Pdf URL: https://arxiv.org/pdf/2505.14599
Copy Paste: [[2505.14599]] Toward Reliable Biomedical Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models(https://arxiv.org/abs/2505.14599)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful biomedical hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at this https URL.
摘要：大型语言模型（LLM）在科学学科（例如生物医学）中显示出巨大的潜力，尤其是在假设产生中，他们可以分析大量文献，识别模式并提出研究方向。但是，关键的挑战在于评估产生的假设的真实性，因为验证其准确性通常需要大量时间和资源。此外，LLMS中的幻觉问题可能导致产生看起来合理但最终不正确的假设，从而破坏了它们的可靠性。为了促进对这些挑战的系统研究，我们介绍了TruthHypo，这是评估LLMS产生真实生物医学假设的能力的基准，以及基于知识的幻觉检测器KNOWHD评估假设在现有知识中的基础。我们的结果表明，LLM难以产生真实的假设。通过在推理步骤中分析幻觉，我们证明了KNOWH提供的基础得分是从LLM的各种产出中过滤真实假设的有效指标。人类评估进一步验证了知识在识别真实假设和加速科学发现方面的实用性。我们的数据和源代码可在此HTTPS URL上找到。

Title: sudoLLM : On Multi-role Alignment of Language Models

Authors: Soumadeep Saha, Akshay Chaturvedi, Joy Mahapatra, Utpal Garain
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2505.14607
Pdf URL: https://arxiv.org/pdf/2505.14607
Copy Paste: [[2505.14607]] sudoLLM : On Multi-role Alignment of Language Models(https://arxiv.org/abs/2505.14607)
Keywords: language model, llm, prompt
Abstract: User authorization-based access privileges are a key feature in many safety-critical systems, but have thus far been absent from the large language model (LLM) realm. In this work, drawing inspiration from such access control systems, we introduce sudoLLM, a novel framework that results in multi-role aligned LLMs, i.e., LLMs that account for, and behave in accordance with, user access rights. sudoLLM injects subtle user-based biases into queries and trains an LLM to utilize this bias signal in order to produce sensitive information if and only if the user is authorized. We present empirical results demonstrating that this approach shows substantially improved alignment, generalization, and resistance to prompt-based jailbreaking attacks. The persistent tension between the language modeling objective and safety alignment, which is often exploited to jailbreak LLMs, is somewhat resolved with the aid of the injected bias signal. Our framework is meant as an additional security layer, and complements existing guardrail mechanisms for enhanced end-to-end safety with LLMs.
摘要：基于用户授权的访问特权是许多安全至关重要系统的关键功能，但到目前为止，大型语言模型（LLM）领域都没有。在这项工作中，我们从此类访问控制系统中汲取灵感，我们介绍了Sudollm，这是一个新颖的框架，导致了多孔Aligned LLMS，即，即用户访问权利来解释并按照用户访问权利来解释和行为。 Sudollm将基于用户的微妙偏见注入查询，并训练LLM来利用此偏见信号，以便在且仅在用户授权时就产生敏感信息。我们提出了经验结果，表明这种方法显示出对迅速基于越狱攻击的迅速越狱的一致性，概括和抵抗。语言建模目标与安全对准之间的持续张力（通常是被越狱LLM）在某种程度上借助注入的偏见信号得到了解决。我们的框架是作为额外的安全层，并补充了现有的护栏机制，以增强LLM的端到端安全性。

Title: Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It)

Authors: Rafael Rivera Soto, Barry Chen, Nicholas Andrews
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14608
Pdf URL: https://arxiv.org/pdf/2505.14608
Copy Paste: [[2505.14608]] Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It)(https://arxiv.org/abs/2505.14608)
Keywords: language model
Abstract: Despite considerable progress in the development of machine-text detectors, it has been suggested that the problem is inherently hard, and therefore, that stakeholders should proceed under the assumption that machine-generated text cannot be reliably detected as such. We examine a recent such claim by Nicks et al. (2024) regarding the ease with which language models can be optimized to degrade the performance of machine-text detectors, including detectors not specifically optimized against. We identify a feature space$\unicode{x2013}$the stylistic feature space$\unicode{x2013}$that is robust to such optimization, and show that it may be used to reliably detect samples from language models optimized to prevent detection. Furthermore, we show that even when models are explicitly optimized against stylistic detectors, detection performance remains surprisingly unaffected. We then seek to understand if stylistic detectors are inherently more robust. To study this question, we explore a new paraphrasing approach that simultaneously aims to close the gap between human writing and machine writing in stylistic feature space while avoiding detection using traditional features. We show that when only a single sample is available for detection, this attack is universally effective across all detectors considered, including those that use writing style. However, as the number of samples available for detection grows, the human and machine distributions become distinguishable. This observation encourages us to introduce AURA, a metric that estimates the overlap between human and machine-generated distributions by analyzing how detector performance improves as more samples become available. Overall, our findings underscore previous recommendations to avoid reliance on machine-text detection.
摘要：尽管在机器文本检测器的开发方面取得了长足的进步，但已经提出问题本质上是困难的，因此，利益相关者应在无法可靠地检测到机器生成的文本的假设下进行。我们研究了Nicks等人最近的这种说法。（2024）关于可以优化语言模型以降低机器文本检测器的性能，包括未明确优化的检测器的性能。我们确定功能空间$ \ unicode {x2013} $样式特征空间$ \ unicode {x2013} $，可用于这种优化，并证明它可用于可靠地检测到优化的语言模型中的样本，以防止检测。此外，我们表明，即使模型针对风格探测器进行了明确优化，检测性能仍然令人惊讶地不受影响。然后，我们寻求了解风格检测器是否本质上更强大。为了研究这个问题，我们探索了一种新的释义方法，该方法同时旨在缩小在风格特征空间中的人写作和机器写作之间的差距，同时避免使用传统特征进行检测。我们表明，当仅一个单个样本可检测时，此攻击在所有考虑的检测器中都普遍有效，包括使用写作样式的攻击。但是，随着可检测的样品数量的增加，人类和机器分布变得可区分。该观察结果鼓励我们引入AURA，该指标通过分析探测器性能如何改善随着更多样本的可用性来估算人类和机器生成的分布之间的重叠。总体而言，我们的发现强调了以前的建议，以避免依赖机器文本检测。

Title: Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models

Authors: Sahar Abdelnabi, Ahmed Salem
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.14617
Pdf URL: https://arxiv.org/pdf/2505.14617
Copy Paste: [[2505.14617]] Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models(https://arxiv.org/abs/2505.14617)
Keywords: language model, llm, prompt
Abstract: Reasoning-focused large language models (LLMs) sometimes alter their behavior when they detect that they are being evaluated, an effect analogous to the Hawthorne phenomenon, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such "test awareness" impacts model behavior, particularly its safety alignment. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-source reasoning LLMs across both realistic and hypothetical tasks. Our results demonstrate that test awareness significantly impact safety alignment, and is different for different models. By providing fine-grained control over this latent effect, our work aims to increase trust in how we perform safety evaluation.
摘要：以推理为重点的大语言模型（LLM）有时会在检测到正在评估中时会改变其行为，这种效果类似于霍桑现象，这会导致他们优化以测试的表现，或者如果不存在现实世界的后果，则可以更容易遵守有害的提示。我们介绍了这种“测试意识”如何影响模型行为，尤其是其安全一致性的第一个定量研究。我们引入了一个白盒探测框架，（i）（i）线性识别与意识相关的激活，（ii）在监视下游性能的同时，将模型转向或远离测试意识。我们将方法应用于现实和假设任务的不同最先进的开源推理LLM。我们的结果表明，测试意识显着影响安全对准，并且对于不同的模型而言是不同的。通过对这种潜在效果提供细粒度的控制，我们的工作旨在提高对我们执行安全评估的信任。

Title: Think Only When You Need with Large Hybrid-Reasoning Models

Authors: Lingjie Jiang, Xun Wu, Shaohan Huang, Qingxiu Dong, Zewen Chi, Li Dong, Xingxing Zhang, Tengchao Lv, Lei Cui, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14631
Pdf URL: https://arxiv.org/pdf/2505.14631
Copy Paste: [[2505.14631]] Think Only When You Need with Large Hybrid-Reasoning Models(https://arxiv.org/abs/2505.14631)
Keywords: language model, llm
Abstract: Recent Large Reasoning Models (LRMs) have shown substantially improved reasoning capabilities over traditional Large Language Models (LLMs) by incorporating extended thinking processes prior to producing final responses. However, excessively lengthy thinking introduces substantial overhead in terms of token consumption and latency, which is particularly unnecessary for simple queries. In this work, we introduce Large Hybrid-Reasoning Models (LHRMs), the first kind of model capable of adaptively determining whether to perform thinking based on the contextual information of user queries. To achieve this, we propose a two-stage training pipeline comprising Hybrid Fine-Tuning (HFT) as a cold start, followed by online reinforcement learning with the proposed Hybrid Group Policy Optimization (HGPO) to implicitly learn to select the appropriate thinking mode. Furthermore, we introduce a metric called Hybrid Accuracy to quantitatively assess the model's capability for hybrid thinking. Extensive experimental results show that LHRMs can adaptively perform hybrid thinking on queries of varying difficulty and type. It outperforms existing LRMs and LLMs in reasoning and general capabilities while significantly improving efficiency. Together, our work advocates for a reconsideration of the appropriate use of extended thinking processes and provides a solid starting point for building hybrid thinking systems.
摘要：最近的大型推理模型（LRMS）通过在产生最终响应之前纳入扩展的思维过程，显示出比传统大型语言模型（LLM）的推理能力大大提高的推理能力。但是，过长的思维在令牌消费和延迟方面引入了大量的开销，这对于简单的查询尤为不必要。在这项工作中，我们介绍了大型混合策划模型（LHRMS），这是第一种能够根据用户查询的上下文信息自适应地确定是否执行思考的模型。为了实现这一目标，我们提出了一条两阶段的培训管道，其中包括混合微调（HFT）作为寒冷的开始，其次是在线加强学习，并使用拟议的混合小组政策优化（HGPO），以隐式学习选择适当的思维模式。此外，我们引入了一种称为混合精度的度量，以定量评估该模型的混合思维能力。广泛的实验结果表明，LHRM可以对不同难度和类型的查询进行适应性地进行混合思考。它在推理和一般能力方面的表现优于现有的LRM和LLM，同时显着提高了效率。我们的工作共同提倡重新考虑适当使用扩展思维过程，并为建立混合思维系统提供了坚实的起点。

Title: General-Reasoner: Advancing LLM Reasoning Across All Domains

Authors: Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, Wenhu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14652
Pdf URL: https://arxiv.org/pdf/2505.14652
Copy Paste: [[2505.14652]] General-Reasoner: Advancing LLM Reasoning Across All Domains(https://arxiv.org/abs/2505.14652)
Keywords: language model, llm, chain-of-thought
Abstract: Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs). Particularly, the "Zero" reinforcement learning introduced by Deepseek-R1-Zero, enables direct RL training of base LLMs without relying on an intermediate supervised fine-tuning stage. Despite these advancements, current works for LLM reasoning mainly focus on mathematical and coding domains, largely due to data abundance and the ease of answer verification. This limits the applicability and generalization of such models to broader domains, where questions often have diverse answer representations, and data is more scarce. In this paper, we propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains. Our key contributions include: (1) constructing a large-scale, high-quality dataset of questions with verifiable answers curated by web crawling, covering a wide range of disciplines; and (2) developing a generative model-based answer verifier, which replaces traditional rule-based verification with the capability of chain-of-thought and context-awareness. We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc. Our comprehensive evaluation across these 12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC) demonstrates that General-Reasoner outperforms existing baseline methods, achieving robust and generalizable reasoning performance while maintaining superior effectiveness in mathematical reasoning tasks.
摘要：强化学习（RL）最近在增强大语言模型（LLMS）的推理能力方面具有强大的潜力。特别是，DeepSeek-R1-Zero引入的“零”强化学习可以直接对基本LLM进行直接培训，而无需依靠中间监督的微调阶段。尽管有这些进步，但当前用于LLM推理的作品主要集中于数学和编码域，这主要是由于数据丰度和易于答案验证。这限制了此类模型对更广泛的领域的适用性和概括，在这些领域中，问题通常具有不同的答案表示形式，并且数据更稀缺。在本文中，我们提出了一般性赛季，这是一种新型的培训范式，旨在增强各种领域的LLM推理能力。我们的主要贡献包括：（1）构建一个通过网络爬行策划的可验证答案的大规模，高质量的问题，涵盖了广泛的学科；（2）开发基于生成模型的答案验证者，该验证者用思想链和上下文意识的能力代替了传统的基于规则的验证。我们培训一系列模型，并在涵盖物理，化学，金融，电子设备等广泛领域的广泛数据集上进行评估。我们在这12个基准测试中进行全面评估（例如MMLU-PRO，GPQA，GPQA，SUPERGPQA，SUPERGPQA，superGPQA，theoremqa，theoremqa，bbeh和Math AMC都表明了一般性的固定效果，并在一般的基础上进行了稳固的效果，并实现了一般的速度，并实现了一般的基础，即稳定性的效果。在数学推理任务中的有效性。

Title: Reward Reasoning Model

Authors: Jiaxin Guo, Zewen Chi, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14674
Pdf URL: https://arxiv.org/pdf/2505.14674
Copy Paste: [[2505.14674]] Reward Reasoning Model(https://arxiv.org/abs/2505.14674)
Keywords: language model, chain-of-thought
Abstract: Reward models play a critical role in guiding large language models toward outputs that align with human expectations. However, an open challenge remains in effectively utilizing test-time compute to enhance reward model performance. In this work, we introduce Reward Reasoning Models (RRMs), which are specifically designed to execute a deliberate reasoning process before generating final rewards. Through chain-of-thought reasoning, RRMs leverage additional test-time compute for complex queries where appropriate rewards are not immediately apparent. To develop RRMs, we implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities without requiring explicit reasoning traces as training data. Experimental results demonstrate that RRMs achieve superior performance on reward modeling benchmarks across diverse domains. Notably, we show that RRMs can adaptively exploit test-time compute to further improve reward accuracy. The pretrained reward reasoning models are available at this https URL.
摘要：奖励模型在指导大型语言模型方面起着与人类期望相符的输出的关键作用。但是，有效利用测试时间计算来增强奖励模型性能，仍有一个开放的挑战。在这项工作中，我们引入了奖励推理模型（RRMS），这些模型是专门设计的，目的是在产生最终奖励之前执行故意的推理过程。通过经过思考的推理，RRMS在没有立即明显的情况下，在适当的奖励的情况下利用额外的测试时间计算来进行复杂的查询。为了开发RRMS，我们实施了一个强化学习框架，该框架可以培养自发展的奖励推理能力，而无需明确的推理痕迹作为培训数据。实验结果表明，RRMS在跨不同领域的奖励建模基准上取得了卓越的性能。值得注意的是，我们表明RRM可以适应利用测试时间计算以进一步提高奖励准确性。验证的奖励推理模型可在此HTTPS URL上找到。

Title: UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Large Language Models

Authors: Xiaojie Gu, Guangxu Chen, Jungang Li, Jia-Chen Gu, Xuming Hu, Kai Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14679
Pdf URL: https://arxiv.org/pdf/2505.14679
Copy Paste: [[2505.14679]] UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Large Language Models(https://arxiv.org/abs/2505.14679)
Keywords: language model, llm
Abstract: Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model's internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose ULTRAEDIT-a fundamentally new editing solution that is training-, subject- and memory-free, making it particularly well-suited for ultra-scalable, real-world lifelong model editing. ULTRAEDIT performs editing through a self-contained process that relies solely on lightweight linear algebra operations to compute parameter shifts, enabling fast and consistent parameter modifications with minimal overhead. To improve scalability in lifelong settings, ULTRAEDIT employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. ULTRAEDIT achieves editing speeds over 7x faster than the previous state-of-the-art method-which was also the fastest known approach-while consuming less than 1/3 the VRAM, making it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct ULTRAEDITBENCH-the largest dataset in the field to date, with over 2M editing pairs-and demonstrate that our method supports up to 1M edits while maintaining high accuracy. Comprehensive experiments on four datasets and six models show that ULTRAEDIT consistently achieves superior performance across diverse model editing scenarios. Our code is available at: this https URL.
摘要：终身学习使大型语言模型（LLMS）通过不断更新其内部知识来适应不断发展的信息。理想的系统应支持高效，广泛的更新，同时保留现有功能并确保可靠的部署。模型编辑是该目标的有前途的解决方案，提供了一种重点，有效的方法来修改模型的内部知识。尽管最近的范式取得了显着的进步，但他们经常努力满足实际终身适应的要求。为了弥合这一差距，我们提出了Ultraedit-从根本上是培训，主题和内存的新编辑解决方案，这使其特别适合超级可观，现实世界中的终身模型编辑。 Ultraedit通过一个独立的过程进行编辑，该过程仅依赖于轻质线性代数操作来计算参数偏移，从而可以通过最小的开销来快速，一致的参数修改。为了提高终身设置的可扩展性，Ultraedit采用了终生的归一化策略，该策略不断地更新跨回合的统计信息，从而使其适应分配变化并随着时间的推移保持一致性。 Ultraedit的编辑速度比以前的最新方法快7倍以上 - 这也是最快的方法，而消耗小于1/3的VRAM，这使其成为当前唯一能够在24GB消费者级GPU上编辑7B LLM的方法。此外，我们构建了UltraeditBench，这是迄今为止该领域最大的数据集，超过2M的编辑对，并证明我们的方法最多支持1M编辑，同时保持高精度。在四个数据集和六个模型上进行的综合实验表明，Ultraedit始终在各种模型编辑方案中取得卓越的性能。我们的代码可用：此HTTPS URL。

Title: Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning

Authors: Haolei Xu, Yuchen Yan, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Shengpei Jiang, Kaitao Song, Weiming Lu, Jun Xiao, Yueting Zhuang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14684
Pdf URL: https://arxiv.org/pdf/2505.14684
Copy Paste: [[2505.14684]] Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning(https://arxiv.org/abs/2505.14684)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) have achieved remarkable progress on mathemati-cal tasks through Chain-of-Thought (CoT) reasoning. However, existing mathematical CoT datasets often suffer from Thought Leaps due to experts omitting intermediate steps, which negatively impacts model learning and generalization. We propose the CoT Thought Leap Bridge Task, which aims to automatically detect leaps and generate missing intermediate reasoning steps to restore the completeness and coherence of CoT. To facilitate this, we constructed a specialized training dataset called ScaleQM+, based on the structured ScaleQuestMath dataset, and trained CoT-Bridge to bridge thought leaps. Through comprehensive experiments on mathematical reasoning benchmarks, we demonstrate that models fine-tuned on bridged datasets consistently outperform those trained on original datasets, with improvements of up to +5.87% on NuminaMath. Our approach effectively enhances distilled data (+3.02%) and provides better starting points for reinforcement learning (+3.1%), functioning as a plug-and-play module compatible with existing optimization techniques. Furthermore, CoT-Bridge demonstrate improved generalization to out-of-domain logical reasoning tasks, confirming that enhancing reasoning completeness yields broadly applicable benefits.
摘要：大型语言模型（LLMS）通过经过思考链（COT）推理在数学 - 卡尔任务上取得了显着进步。但是，由于专家省略了中间步骤，现有的数学COT数据集经常会遭受思想飞跃，这对模型学习和概括产生了负面影响。我们提出了COT Thought Bridge任务，该任务旨在自动检测飞跃并生成缺失的中间推理步骤，以恢复COT的完整性和连贯性。为了促进这一点，我们基于结构化的ScaleQuestMath数据集构建了一个名为ScaleQM+的专业培训数据集，并构建了训练有素的Cot-Bridge，以弥合思想飞跃。通过对数学推理基准测试的全面实验，我们证明了在桥接数据集上进行微调的模型始终优于在原始数据集中训练的模型，而在Numinamath上的改善高达 +5.87％。我们的方法有效地增强了蒸馏数据（+3.02％），并为增强学习提供了更好的起点（+3.1％），作为与现有优化技术兼容的插件模块。此外，COT桥表明对杂物外逻辑推理任务的概括有所改善，证实提高推理完整性会产生广泛适用的好处。

Title: Language Models use Lookbacks to Track Beliefs

Authors: Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14685
Pdf URL: https://arxiv.org/pdf/2505.14685
Copy Paste: [[2505.14685]] Language Models use Lookbacks to Track Beliefs(https://arxiv.org/abs/2505.14685)
Keywords: language model
Abstract: How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze Llama-3-70B-Instruct's ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset that consists of simple stories where two characters each separately change the state of two objects, potentially unaware of each other's actions. Our investigation uncovered a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating reference information about them, represented as their Ordering IDs (OIs) in low rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the corresponding state OI and then an answer lookback retrieves the state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into the LM's belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
摘要：语言模型（LMS）如何代表角色的信念，尤其是当这些信念与现实有所不同时？这个问题是理解LMS心理理论（TOM）能力的核心。我们分析了Llama-3-70B教学使用因果关系和抽象来推论角色信念的能力。我们构建了一个由简单故事组成的数据集，其中两个字符分别分别改变了两个对象的状态，可能不知道彼此的动作。我们的调查发现了一种普遍存在的算法模式，我们称之为回顾机制，使LM能够在必要时回忆重要信息。 LM通过共同列出有关它们的参考信息将每个字符对象三重三重绑定在一起，该信息表示为状态令状残留流的低等级子空间中的订购ID（OIS）。当被问及角色对物体状态的信念时，绑定的回顾器会检索相应的状态OI，然后答案回顾器检索了状态令牌。当我们引入文本指定一个字符对另一个字符可见时，我们发现LM首先生成一个可见性ID，编码观测值和观察到的字符OI之间的关系。在可见性回顾中，此ID用于检索有关观察到的角色的信息，并更新观察到的角色信念。我们的工作为LM的信念跟踪机制提供了见解，迈向了LMS中的tom推理一步。