2025-05-20

Title: A Data Synthesis Method Driven by Large Language Models for Proactive Mining of Implicit User Intentions in Tourism

Authors: Jinqiang Wang, Huansheng Ning, Tao Zhu, Jianguo Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11533
Pdf URL: https://arxiv.org/pdf/2505.11533
Copy Paste: [[2505.11533]] A Data Synthesis Method Driven by Large Language Models for Proactive Mining of Implicit User Intentions in Tourism(https://arxiv.org/abs/2505.11533)
Keywords: language model, llm, agent
Abstract: In the tourism domain, Large Language Models (LLMs) often struggle to mine implicit user intentions from tourists' ambiguous inquiries and lack the capacity to proactively guide users toward clarifying their needs. A critical bottleneck is the scarcity of high-quality training datasets that facilitate proactive questioning and implicit intention mining. While recent advances leverage LLM-driven data synthesis to generate such datasets and transfer specialized knowledge to downstream models, existing approaches suffer from several shortcomings: (1) lack of adaptation to the tourism domain, (2) skewed distributions of detail levels in initial inquiries, (3) contextual redundancy in the implicit intention mining module, and (4) lack of explicit thinking about tourists' emotions and intention values. Therefore, we propose SynPT (A Data Synthesis Method Driven by LLMs for Proactive Mining of Implicit User Intentions in the Tourism), which constructs an LLM-driven user agent and assistant agent to simulate dialogues based on seed data collected from Chinese tourism websites. This approach addresses the aforementioned limitations and generates SynPT-Dialog, a training dataset containing explicit reasoning. The dataset is utilized to fine-tune a general LLM, enabling it to proactively mine implicit user intentions. Experimental evaluations, conducted from both human and LLM perspectives, demonstrate the superiority of SynPT compared to existing methods. Furthermore, we analyze key hyperparameters and present case studies to illustrate the practical applicability of our method, including discussions on its adaptability to English-language scenarios. All code and data are publicly available.
摘要：在旅游领域中，大型语言模型（LLMS）经常努力从游客的模棱两可的询问中挖掘隐性用户意图，并且缺乏主动指导用户阐明其需求的能力。一个关键的瓶颈是缺乏高质量的培训数据集，这些数据集有助于主动提出质疑和隐含意图挖掘。尽管最近的进步利用了LLM驱动的数据合成以生成此类数据集并将专业知识转移到下游模型，但现有方法却遭受了几个缺点：（1）缺乏适应旅游领域的适应性，（2）（2）在初始询问中的详细信息分布的分布偏斜，（3）在隐式冗余中，在视图中，有意缩小了图案的探视派和（4）示意模块和（4）观察（4）图4）。因此，我们提出了CYNPT（由LLM驱动的数据合成方法，用于主动挖掘旅游中的隐性用户意图），该方法构建了LLM驱动的用户代理和助理代理，以基于从中国旅游网站收集的种子数据来模拟对话。该方法解决了上述局限性并生成Cynpt-Dialog，这是一个包含显式推理的培训数据集。该数据集可用于微调一般的LLM，使其能够主动采用隐式用户意图。从人类和LLM的角度进行的实验评估表明，与现有方法相比，综合的优势。此外，我们分析了关键的超参数和目前的案例研究，以说明我们方法的实际适用性，包括讨论其对英语方案的适应性。所有代码和数据均可公开使用。

Title: AI-generated Text Detection: A Multifaceted Approach to Binary and Multiclass Classification

Authors: Harika Abburi, Sanmitra Bhattacharya, Edward Bowen, Nirmala Pudota
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11550
Pdf URL: https://arxiv.org/pdf/2505.11550
Copy Paste: [[2505.11550]] AI-generated Text Detection: A Multifaceted Approach to Binary and Multiclass Classification(https://arxiv.org/abs/2505.11550)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing across a wide range of styles and genres. However, such capabilities are prone to potential misuse, such as fake news generation, spam email creation, and misuse in academic assignments. As a result, accurate detection of AI-generated text and identification of the model that generated it are crucial for maintaining the responsible use of LLMs. In this work, we addressed two sub-tasks put forward by the Defactify workshop under AI-Generated Text Detection shared task at the Association for the Advancement of Artificial Intelligence (AAAI 2025): Task A involved distinguishing between human-authored or AI-generated text, while Task B focused on attributing text to its originating language model. For each task, we proposed two neural architectures: an optimized model and a simpler variant. For Task A, the optimized neural architecture achieved fifth place with $F1$ score of 0.994, and for Task B, the simpler neural architecture also ranked fifth place with $F1$ score of 0.627.
摘要：大型语言模型（LLMS）在生成与各种样式和流派的人类写作相似的文本方面表现出了显着的功能。但是，这种功能容易滥用，例如假新闻，垃圾邮件创建和滥用学术任务。结果，准确检测AI生成的文本以及对其产生的模型的识别对于维持负责任的LLM至关重要。在这项工作中，我们谈到了在AI生成的文本检测下，在人工智能协会（AAAI 2025）协会下，Defactify研讨会提出的两个子任务（AAAI 2025）：涉及区分人为或AI生成的文本的任务A，而任务B则将其归为其属于其始终语言模型。对于每个任务，我们提出了两个神经体系结构：优化的模型和一个更简单的变体。对于任务A，优化的神经体系结构以0.994的$ F1 $得分获得第五名，对于任务B，更简单的神经体系结构也排名第五，$ f1 $得分为0.627。

Title: Assessing Collective Reasoning in Multi-Agent LLMs via Hidden Profile Tasks

Authors: Yuxuan Li, Aoi Naito, Hirokazu Shirado
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.11556
Pdf URL: https://arxiv.org/pdf/2505.11556
Copy Paste: [[2505.11556]] Assessing Collective Reasoning in Multi-Agent LLMs via Hidden Profile Tasks(https://arxiv.org/abs/2505.11556)
Keywords: language model, gpt, llm, agent
Abstract: Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving through distributed information integration, but also risk replicating collective reasoning failures observed in human groups. Yet, no theory-grounded benchmark exists to systematically evaluate such failures. In this paper, we introduce the Hidden Profile paradigm from social psychology as a diagnostic testbed for multi-agent LLM systems. By distributing critical information asymmetrically across agents, the paradigm reveals how inter-agent dynamics support or hinder collective reasoning. We first formalize the paradigm for multi-agent decision-making under distributed knowledge and instantiate it as a benchmark with nine tasks spanning diverse scenarios, including adaptations from prior human studies. We then conduct experiments with GPT-4.1 and five other leading LLMs, including reasoning-enhanced variants, showing that multi-agent systems across all models fail to match the accuracy of single agents given complete information. While agents' collective performance is broadly comparable to that of human groups, nuanced behavioral differences emerge, such as increased sensitivity to social desirability. Finally, we demonstrate the paradigm's diagnostic utility by exploring a cooperation-contradiction trade-off in multi-agent LLM systems. We find that while cooperative agents are prone to over-coordination in collective settings, increased contradiction impairs group convergence. This work contributes a reproducible framework for evaluating multi-agent LLM systems and motivates future research on artificial collective intelligence and human-AI interaction.
摘要：建立在大型语言模型（LLMS）上的多机构系统通过分布式信息集成有望提高问题解决问题，但也有可能复制人类群体观察到的集体推理失败。然而，没有理论基的基准可以系统地评估这种失败。在本文中，我们介绍了社会心理学的隐藏概况范式，作为多代理LLM系统的诊断测试。通过对代理的关键信息不对称地分配，范式揭示了机构间动力学支持或阻碍集体推理的方式。我们首先将分布式知识下的多代理决策制定范式正式化，并将其实例化为基准，其中有九项任务涵盖了各种情况，包括先前的人类研究的适应。然后，我们使用GPT-4.1和其他五个领先的LLM进行实验，包括推理增强的变体，表明所有模型的多代理系统都无法匹配给定完整信息的单个试剂的准确性。尽管代理人的集体绩效与人类群体的绩效广泛相当，但出现了细微的行为差异，例如对社会可取性的敏感性提高。最后，我们通过探索多代理LLM系统中的合作互动权衡来证明该范式的诊断实用程序。我们发现，尽管合作社容易在集体环境中过度协调，但增加的矛盾会损害群体的融合。这项工作为评估多代理LLM系统的可再现框架提供了可再现的框架，并激发了对人工集体智能和人类互动的未来研究。

Title: Talk to Your Slides: Efficient Slide Editing Agent with Large Language Models

Authors: Kyudan Jung, Hojun Cho, Jooyeol Yun, Jaehyeok Jang, Jagul Choo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11604
Pdf URL: https://arxiv.org/pdf/2505.11604
Copy Paste: [[2505.11604]] Talk to Your Slides: Efficient Slide Editing Agent with Large Language Models(https://arxiv.org/abs/2505.11604)
Keywords: language model, llm, agent
Abstract: Existing research on large language models (LLMs) for PowerPoint predominantly focuses on slide generation, overlooking the common yet tedious task of editing existing slides. We introduce Talk-to-Your-Slides, an LLM-powered agent that directly edits slides within active PowerPoint sessions through COM communication. Our system employs a two-level approach: (1) high-level processing where an LLM agent interprets instructions and formulates editing plans, and (2) low-level execution where Python scripts directly manipulate PowerPoint objects. Unlike previous methods relying on predefined operations, our approach enables more flexible and contextually-aware editing. To facilitate evaluation, we present TSBench, a human-annotated dataset of 379 diverse editing instructions with corresponding slide variations. Experimental results demonstrate that Talk-to-Your-Slides significantly outperforms baseline methods in execution success rate, instruction fidelity, and editing efficiency. Our code and benchmark are available at this https URL
摘要：有关PowerPoint的大型语言模型（LLM）的现有研究主要集中在幻灯片生成上，俯瞰着编辑现有幻灯片的常见但繁琐的任务。我们介绍了通过LLM驱动的代理商进行谈话与您的滑坡，该代理通过COM通信直接在主动PowerPoint会话中编辑幻灯片。我们的系统采用了两级方法：（1）LLM代理在其中解释指令并制定编辑计划的高级处理，以及（2）低级执行，其中Python脚本直接操纵PowerPoint对象。与以前的方法依赖于预定义的操作不同，我们的方法可以使更加灵活和上下文意识到编辑。为了促进评估，我们提出了TSBENCH，这是379个具有相应幻灯片变化的379种不同编辑指令的数据集。实验结果表明，在执行成功率，指令保真度和编辑效率方面，对话与您的流动性的表现明显优于基线方法。我们的代码和基准可在此HTTPS URL上找到

Title: MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models

Authors: Xiaomin Li, Mingye Gao, Yuexing Hao, Taoran Li, Guangya Wan, Zihan Wang, Yijun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11613
Pdf URL: https://arxiv.org/pdf/2505.11613
Copy Paste: [[2505.11613]] MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models(https://arxiv.org/abs/2505.11613)
Keywords: language model, llm
Abstract: Clinical guidelines, typically structured as decision trees, are central to evidence-based medical practice and critical for ensuring safe and accurate diagnostic decision-making. However, it remains unclear whether Large Language Models (LLMs) can reliably follow such structured protocols. In this work, we introduce MedGUIDE, a new benchmark for evaluating LLMs on their ability to make guideline-consistent clinical decisions. MedGUIDE is constructed from 55 curated NCCN decision trees across 17 cancer types and uses clinical scenarios generated by LLMs to create a large pool of multiple-choice diagnostic questions. We apply a two-stage quality selection process, combining expert-labeled reward models and LLM-as-a-judge ensembles across ten clinical and linguistic criteria, to select 7,747 high-quality samples. We evaluate 25 LLMs spanning general-purpose, open-source, and medically specialized models, and find that even domain-specific LLMs often underperform on tasks requiring structured guideline adherence. We also test whether performance can be improved via in-context guideline inclusion or continued pretraining. Our findings underscore the importance of MedGUIDE in assessing whether LLMs can operate safely within the procedural frameworks expected in real-world clinical settings.
摘要：临床指南通常以决策树的形式结构，对基于证据的医学实践至关重要，对于确保安全，准确的诊断决策至关重要。但是，尚不清楚大型语言模型（LLM）是否可以可靠地遵循这种结构化协议。在这项工作中，我们介绍了Medguide，这是一种新的基准，用于评估LLMS的能力，以做出指南一致的临床决策。 Medguide是由17种癌症类型的55种策划的NCCN决策树构建的，并使用LLMS产生的临床场景来创建大量多项选择诊断问题。我们应用了两阶段质量的选择过程，将专家标记的奖励模型和跨越十个临床和语言标准的法官法官合奏结合在一起，以选择7,747个高质量的样本。我们评估了25个LLMS，涵盖通用，开源和医学专业模型，发现即使是域特异性LLMS在需要结构化指南依从性的任务上也经常表现不佳。我们还测试了是否可以通过包含在内的指南包含或持续预处理来提高性能。我们的发现强调了Medguide在评估LLM是否可以在实际临床环境中预期的程序框架内安全运行的重要性。

Title: Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations

Authors: Jian-Qiao Zhu, Haijiang Yan, Thomas L. Griffiths
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11615
Pdf URL: https://arxiv.org/pdf/2505.11615
Copy Paste: [[2505.11615]] Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations(https://arxiv.org/abs/2505.11615)
Keywords: language model, llm
Abstract: Changing the behavior of large language models (LLMs) can be as straightforward as editing the Transformer's residual streams using appropriately constructed "steering vectors." These modifications to internal neural activations, a form of representation engineering, offer an effective and targeted means of influencing model behavior without retraining or fine-tuning the model. But how can such steering vectors be systematically identified? We propose a principled approach for uncovering steering vectors by aligning latent representations elicited through behavioral methods (specifically, Markov chain Monte Carlo with LLMs) with their neural counterparts. To evaluate this approach, we focus on extracting latent risk preferences from LLMs and steering their risk-related outputs using the aligned representations as steering vectors. We show that the resulting steering vectors successfully and reliably modulate LLM outputs in line with the targeted behavior.
摘要：更改大语言模型（LLM）的行为可以像使用适当构造的“转向向量”来编辑变压器的残差流一样简单。这些对内部神经激活（一种表示工程形式）的修改为影响模型行为的有效和有针对性的手段而无需重新调整或微调模型。但是，如何系统地确定这种转向向量呢？我们提出了一种原则性的方法，用于通过对齐通过行为方法（特别是马尔可夫链蒙特卡洛与LLMS）与他们的神经对应物引起的潜在表示，以揭示转向向量。为了评估这种方法，我们专注于从LLM中提取潜在的风险偏好，并使用对齐表示的转向向量来指导其与风险相关的输出。我们表明，所得的转向向量成功可靠地调节了与目标行为一致的LLM输出。

Title: THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering

Authors: Udita Patel, Rutu Mulkar, Jay Roberts, Cibi Chakravarthy Senthilkumar, Sujay Gandhi, Xiaofei Zheng, Naumaan Nayyar, Rafael Castrillo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11626
Pdf URL: https://arxiv.org/pdf/2505.11626
Copy Paste: [[2505.11626]] THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering(https://arxiv.org/abs/2505.11626)
Keywords: language model, retrieval augmented generation
Abstract: We propose THELMA (Task Based Holistic Evaluation of Large Language Model Applications), a reference free framework for RAG (Retrieval Augmented generation) based question answering (QA) applications. THELMA consist of six interdependent metrics specifically designed for holistic, fine grained evaluation of RAG QA applications. THELMA framework helps developers and application owners evaluate, monitor and improve end to end RAG QA pipelines without requiring labelled sources or reference this http URL also present our findings on the interplay of the proposed THELMA metrics, which can be interpreted to identify the specific RAG component needing improvement in QA applications.
摘要：我们提出了Thelma（基于任务的大语言模型应用程序的整体评估），这是一个基于RAG的参考框架（检索增强生成生成）的问题答案（QA）应用程序。 thelma由六个相互依存的指标组成，专门针对RAG QA应用的整体，细粒度评估而设计。 Thelma框架可帮助开发人员和应用程序所有者评估，监视和改善端到端的RAG QA管道，而无需标记的来源或参考此HTTP URL也介绍了我们在拟议的Thelma指标相互作用的发现，可以将其解释以识别QA应用中的特定RAG组件需要改进。

Title: Critique-Guided Distillation: Improving Supervised Fine-tuning via Better Distillation

Authors: Berkcan Kapusuzoglu, Supriyo Chakraborty, Chia-Hsuan Lee, Sambit Sahu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11628
Pdf URL: https://arxiv.org/pdf/2505.11628
Copy Paste: [[2505.11628]] Critique-Guided Distillation: Improving Supervised Fine-tuning via Better Distillation(https://arxiv.org/abs/2505.11628)
Keywords: prompt
Abstract: Supervised fine-tuning (SFT) using expert demonstrations often suffer from the imitation problem, where the model learns to reproduce the correct responses without \emph{understanding} the underlying rationale. To address this limitation, we propose \textsc{Critique-Guided Distillation (CGD)}, a novel multi-stage framework that integrates teacher model generated \emph{explanatory critiques} and \emph{refined responses} into the SFT process. A student model is then trained to map the triplet of prompt, teacher critique, and its own initial response to the corresponding refined teacher response, thereby learning both \emph{what} to imitate and \emph{why}. Using entropy-based analysis, we show that \textsc{CGD} reduces refinement uncertainty and can be interpreted as a Bayesian posterior update. We perform extensive empirical evaluation of \textsc{CGD}, on variety of benchmark tasks, and demonstrate significant gains on both math (AMC23 +17.5%) and language understanding tasks (MMLU-Pro +6.3%), while successfully mitigating the format drift issues observed in previous critique fine-tuning (CFT) techniques.
摘要：使用专家演示的监督微调（SFT）经常遇到模仿问题，该模型学会在没有\ emph {理解}基本原理的情况下重现正确的响应。为了解决此限制，我们提出\ textsc {批评引导的蒸馏（CGD）}，这是一个新型的多阶段框架，该框架整合了\ emph {解释性批评}的教师模型和\ emph {精制{精制响应}。然后，对学生模型进行了培训，以绘制提示，教师批评的三胞胎以及对相应的教师响应的初步响应，从而学习\ emph {what}以模仿和\ emph {why}。使用基于熵的分析，我们表明\ textsc {cgd}降低了细化不确定性，可以解释为贝叶斯后验更新。我们对\ textsc {cgd}的多种基准任务进行了广泛的经验评估，并在数学（AMC23 +17.5％）和语言理解任务（mmlu-pro +6.3％）方面表现出很大的收获（MMLU-PRO +6.3％），同时成功地缓解了先前的批评式漂移问题（CFT）技术（CFT）技术。

Title: Can an Easy-to-Hard Curriculum Make Reasoning Emerge in Small Language Models? Evidence from a Four-Stage Curriculum on GPT-2

Authors: Xiang Fu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11643
Pdf URL: https://arxiv.org/pdf/2505.11643
Copy Paste: [[2505.11643]] Can an Easy-to-Hard Curriculum Make Reasoning Emerge in Small Language Models? Evidence from a Four-Stage Curriculum on GPT-2(https://arxiv.org/abs/2505.11643)
Keywords: language model, gpt
Abstract: We demonstrate that a developmentally ordered curriculum markedly improves reasoning transparency and sample-efficiency in small language models (SLMs). Concretely, we train Cognivolve, a 124 M-parameter GPT-2 model, on a four-stage syllabus that ascends from lexical matching to multi-step symbolic inference and then evaluate it without any task-specific fine-tuning. Cognivolve reaches target accuracy in half the optimization steps of a single-phase baseline, activates an order-of-magnitude more gradient-salient reasoning heads, and shifts those heads toward deeper layers, yielding higher-entropy attention that balances local and long-range context. The same curriculum applied out of order or with optimizer resets fails to reproduce these gains, confirming that progression--not extra compute--drives the effect. We also identify open challenges: final-answer success still lags a conventional run by about 30%, and our saliency probe under-detects verbal-knowledge heads in the hardest stage, suggesting directions for mixed-stage fine-tuning and probe expansion.
摘要：我们证明，开发有序的课程显着提高了小语言模型（SLM）中的推理透明度和样本效率。具体而言，我们在四阶段的教学大纲上训练一种124 M参数GPT-2模型，该课程从词汇匹配到多步符号推理，然后在没有任何任务特定的微调中进行评估。 Cognivolve达到目标准确性的一半是单相基线的优化步骤，激活了更多梯度升压的推理头，并将这些头向更深的层转移，从而产生了较高的注重，从而平衡了本地和远距离环境。不顺序或优化器重置的相同课程无法再现这些收益，从而证实了进步（而不是额外的计算）可以驱动效果。我们还确定了开放的挑战：最终回答的成功仍然落后于30％的传统挑战，而我们的显着性探针在最困难的阶段将口头知识的头部降低了，这表明了混合阶段的微调和探针扩展的方向。

Title: Multilingual Prompt Engineering in Large Language Models: A Survey Across NLP Tasks

Authors: Shubham Vatsal, Harsh Dubey, Aditi Singh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11665
Pdf URL: https://arxiv.org/pdf/2505.11665
Copy Paste: [[2505.11665]] Multilingual Prompt Engineering in Large Language Models: A Survey Across NLP Tasks(https://arxiv.org/abs/2505.11665)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated impressive performance across a wide range of Natural Language Processing (NLP) tasks. However, ensuring their effectiveness across multiple languages presents unique challenges. Multilingual prompt engineering has emerged as a key approach to enhance LLMs' capabilities in diverse linguistic settings without requiring extensive parameter re-training or fine-tuning. With growing interest in multilingual prompt engineering over the past two to three years, researchers have explored various strategies to improve LLMs' performance across languages and NLP tasks. By crafting structured natural language prompts, researchers have successfully extracted knowledge from LLMs across different languages, making these techniques an accessible pathway for a broader audience, including those without deep expertise in machine learning, to harness the capabilities of LLMs. In this paper, we survey and categorize different multilingual prompting techniques based on the NLP tasks they address across a diverse set of datasets that collectively span around 250 languages. We further highlight the LLMs employed, present a taxonomy of approaches and discuss potential state-of-the-art (SoTA) methods for specific multilingual datasets. Additionally, we derive a range of insights across language families and resource levels (high-resource vs. low-resource), including analyses such as the distribution of NLP tasks by language resource type and the frequency of prompting methods across different language families. Our survey reviews 36 research papers covering 39 prompting techniques applied to 30 multilingual NLP tasks, with the majority of these studies published in the last two years.
摘要：大型语言模型（LLM）在各种自然语言处理（NLP）任务中表现出令人印象深刻的表现。但是，确保其跨多种语言的有效性提出了独特的挑战。多语言及时工程已成为增强LLMS在不同语言环境中功能的关键方法，而无需进行大量参数重新训练或微调。在过去的两到三年中，随着对多语言迅速工程的兴趣，研究人员探索了各种策略，以提高LLMS跨语言和NLP任务的性能。通过制定结构化的自然语言提示，研究人员已成功从不同语言中从LLM中提取知识，使这些技术成为更广泛的受众访问的途径，包括那些在机器学习方面没有深厚专业知识的人，以利用LLM的功能。在本文中，我们根据他们在多种数据集中解决的NLP任务进行了调查和分类，并根据他们在250种语言中跨越大约250种语言的数据集进行了分类。我们进一步强调了使用的LLM，提出了方法的分类法，并讨论了特定多语言数据集的潜在最新方法（SOTA）方法。此外，我们在语言家族和资源水平（高资源与低资源）之间得出了一系列见解，包括分析，例如按语言资源类型分配NLP任务以及在不同语言家族中提示方法的频率。我们的调查回顾了36篇研究论文，其中涉及39种应用于30个多语言NLP任务的提示技术，其中大多数研究在过去两年中发表。

Title: Ambiguity Resolution in Text-to-Structured Data Mapping

Authors: Zhibo Hu, Chen Wang, Yanfeng Shu, Hye-Young Paik, Liming Zhu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11679
Pdf URL: https://arxiv.org/pdf/2505.11679
Copy Paste: [[2505.11679]] Ambiguity Resolution in Text-to-Structured Data Mapping(https://arxiv.org/abs/2505.11679)
Keywords: language model, llm, agent
Abstract: Ambiguity in natural language is a significant obstacle for achieving accurate text to structured data mapping through large language models (LLMs), which affects the performance of tasks such as mapping text to agentic tool calling and text-to-SQL queries. Existing methods of ambiguity handling either exploit ReACT framework to produce the correct mapping through trial and error, or supervised fine tuning to guide models to produce a biased mapping to improve certain tasks. In this paper, we adopt a different approach that characterizes the representation difference of ambiguous text in the latent space and leverage the difference to identify ambiguity before mapping them to structured data. To detect ambiguity of a sentence, we focused on the relationship between ambiguous questions and their interpretations and what cause the LLM ignore multiple interpretations. Different to the distance calculated by dense embedding vectors, we utilize the observation that ambiguity is caused by concept missing in latent space of LLM to design a new distance measurement, computed through the path kernel by the integral of gradient values for each concepts from sparse-autoencoder (SAE) under each state. We identify patterns to distinguish ambiguous questions with this measurement. Based on our observation, We propose a new framework to improve the performance of LLMs on ambiguous agentic tool calling through missing concepts prediction.
摘要：自然语言中的歧义是通过大型语言模型（LLMS）实现准确文本的重要障碍，这会影响任务的性能，例如将文本映射到代理工具调用和文本到SQL查询。现有的歧义方法处理利用反应框架以通过反复试验产生正确的映射，或者有监督的微调来指导模型以产生有偏见的映射以改善某些任务。在本文中，我们采用了一种不同的方法，该方法表征了潜在空间中模棱两可的文本的表示差异，并利用差异来识别歧义，然后将它们映射到结构化数据。为了检测句子的歧义，我们专注于歧义问题及其解释之间的关系以及LLM忽略多种解释的原因。与密集嵌入向量所计算的距离不同，我们利用了这样的观察结果，即歧义是由LLM的潜在空间中缺少的概念引起的，以设计一个新的距离测量值，该测量是通过路径内核计算出的，这是由每个州下面的稀疏autoencoder（SAE）的每个概念的梯度值的积分来计算的。我们确定通过此测量来区分模棱两可问题的模式。根据我们的观察，我们提出了一个新框架，以通过缺少概念预测来提高LLM在模棱两可的代理工具上的性能。

Title: MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

Authors: Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J. Tao, Min Woo Sun, Alejandro Lozano, James Zou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11733
Pdf URL: https://arxiv.org/pdf/2505.11733
Copy Paste: [[2505.11733]] MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports(https://arxiv.org/abs/2505.11733)
Keywords: language model, llm
Abstract: Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final answer, overlooking the quality and faithfulness of the clinical reasoning process. To address this limitation, we introduce MedCaseReasoning, the first open-access dataset for evaluating LLMs on their ability to align with clinician-authored diagnostic reasoning. The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements derived from open-access medical case reports. We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning: for instance, the top-performing open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy and mentions only 64% of the clinician reasoning statements (recall). However, we demonstrate that fine-tuning LLMs on the reasoning traces derived from MedCaseReasoning significantly improves diagnostic accuracy and clinical reasoning recall by an average relative gain of 29% and 41%, respectively. The open-source dataset, code, and models are available at this https URL.
摘要：医生和患者都越来越多地使用大型语言模型（LLMS）来诊断临床病例。但是，与数学或编码等领域不同，在最终答案可以客观地定义正确性的情况下，医学诊断需要结果和推理过程才能准确。目前，使用MEDQA和MMLU（MMLU）的广泛使用的医疗基准仅在最终答案中评估准确性，从而忽略了临床推理过程的质量和忠诚度。为了解决这一限制，我们引入了Medcasereaning，这是第一个开放式数据集，用于评估LLMS与临床医生合作的诊断推理的能力。该数据集包括14,489个诊断问答案例，每个病例与开放式医疗案例报告中得出的详细推理说明配对。我们评估了最先进的推理LLM在Medcasereaning方面，并在其诊断和推理中发现了重大缺点：例如，表现最佳的开源模型DeepSeek-R1，仅实现48％10-Shot诊断准确性，并且仅提及了64％的临床推理语言（回想一下）。但是，我们证明了从杂种序中得出的推理痕迹的微调LLM显着提高了诊断准确性和临床推理回顾，平均相对增益分别为29％和41％。开源数据集，代码和模型可在此HTTPS URL上找到。

Title: ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training

Authors: Feijiang Han, Xiaodong Yu, Jianheng Tang, Lyle Ungar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11739
Pdf URL: https://arxiv.org/pdf/2505.11739
Copy Paste: [[2505.11739]] ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training(https://arxiv.org/abs/2505.11739)
Keywords: language model, llm, long context, prompt
Abstract: Recently, training-free methods for improving large language models (LLMs) have attracted growing interest, with token-level attention tuning emerging as a promising and interpretable direction. However, existing methods typically rely on auxiliary mechanisms to identify important or irrelevant task-specific tokens, introducing potential bias and limiting applicability. In this paper, we uncover a surprising and elegant alternative: the semantically empty initial token is a powerful and underexplored control point for optimizing model behavior. Through theoretical analysis, we show that tuning the initial token's attention sharpens or flattens the attention distribution over subsequent tokens, and its role as an attention sink amplifies this effect. Empirically, we find that: (1) tuning its attention improves LLM performance more effectively than tuning other task-specific tokens; (2) the effect follows a consistent trend across layers, with earlier layers having greater impact, but varies across attention heads, with different heads showing distinct preferences in how they attend to this token. Based on these findings, we propose ZeroTuning, a training-free approach that improves LLM performance by applying head-specific attention adjustments to this special token. Despite tuning only one token, ZeroTuning achieves higher performance on text classification, multiple-choice, and multi-turn conversation tasks across models such as Llama, Qwen, and DeepSeek. For example, ZeroTuning improves Llama-3.1-8B by 11.71% on classification, 2.64% on QA tasks, and raises its multi-turn score from 7.804 to 7.966. The method is also robust to limited resources, few-shot settings, long contexts, quantization, decoding strategies, and prompt variations. Our work sheds light on a previously overlooked control point in LLMs, offering new insights into both inference-time tuning and model interpretability.
摘要：最近，改善大型语言模型（LLM）的无培训方法引起了人们日益增长的兴趣，而象征性的注意力调整出现为有前途且可解释的方向。但是，现有方法通常依靠辅助机制来识别重要或无关的特定任务代币，从而引入潜在的偏见和限制适用性。在本文中，我们发现了一种令人惊讶且优雅的替代方案：语义上空的初始令牌是一个功能强大且不受欢迎的控制点，用于优化模型行为。通过理论分析，我们表明，调整初始令牌的注意力会在随后的令牌上加强或使注意力分布变平，并且其作为注意力下沉的作用会放大这种效果。从经验上讲，我们发现：（1）与调整其他特定于任务的代币相比，调整其注意力可以更有效地改善LLM的性能；（2）效果遵循整个层的一致趋势，较早的层产生了更大的影响，但在注意力头上有所不同，不同的头部在对此代币的方式上表现出不同的偏好。基于这些发现，我们提出了Zerotuning，这是一种无训练的方法，可以通过对此特殊令牌进行特定于头部的注意力调整来改善LLM的性能。尽管只需调整一个令牌，但Zerotuning在文本分类，多项选择和跨越诸如Llama，Qwen和DeepSeek等模型的多转交谈任务方面取得了更高的性能。例如，Zerotuning在分类方面将Llama-3.1-8B提高11.71％，质量保证任务为2.64％，并将其多转弯得分从7.804提高到7.966。该方法对于有限的资源，几乎没有弹头的设置，较长的上下文，量化，解码策略和及时的变化也很健壮。我们的工作阐明了LLM中以前被忽视的控制点，为推理时间调整和模型可解释性提供了新的见解。

Title: Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation

Authors: Wenyu Huang, Pavlos Vougiouklis, Mirella Lapata, Jeff Z. Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11754
Pdf URL: https://arxiv.org/pdf/2505.11754
Copy Paste: [[2505.11754]] Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation(https://arxiv.org/abs/2505.11754)
Keywords: language model, prompt
Abstract: Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs' performance on this task. Our code is publicly available at this https URL.
摘要：多跳问答（MHQA）增加了复杂性的层次，以使答案更具挑战性。当提示语言模型（LMS）带有多个搜索结果时，它们不仅要检索相关信息，而且还采用了整个信息源的多跳推理。尽管LMS在传统的提问任务上表现良好，但因果面具可能会阻碍其在复杂环境中推理的能力。在本文中，我们通过在各种配置下将LMS探讨LMS如何回答多跳问题（检索的文档）。我们的研究揭示了有趣的发现，如下所示：1）编码器模型，例如Flan-T5家族中的模型，通常在MHQA任务中的因果分解率通常都优于因果解码器，尽管大小的大小要小得多； 2）更改黄金文档的顺序揭示了Flan T5模型和微调解码器模型的不同趋势，当文档顺序与推理链顺序保持一致时，可以观察到最佳性能； 3）通过修改因果面具来增强仅双向关注的因果解码模型，可以有效地提高其最终性能。除上述内容外，我们还对MHQA背景下LM注意权重的分布进行了彻底的研究。我们的实验表明，当结果答案正确时，注意力的重量往往会在较高的值下达到峰值。我们利用这一发现来启发性地提高LMS在这项任务上的表现。我们的代码在此HTTPS URL上公开可用。

Title: Towards Universal Semantics With Large Language Models

Authors: Raymond Baartmans, Matthew Raffel, Rahul Vikram, Aiden Deringer, Lizhong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11764
Pdf URL: https://arxiv.org/pdf/2505.11764
Copy Paste: [[2505.11764]] Towards Universal Semantics With Large Language Models(https://arxiv.org/abs/2505.11764)
Keywords: language model, gpt, llm
Abstract: The Natural Semantic Metalanguage (NSM) is a linguistic theory based on a universal set of semantic primes: simple, primitive word-meanings that have been shown to exist in most, if not all, languages of the world. According to this framework, any word, regardless of complexity, can be paraphrased using these primes, revealing a clear and universally translatable meaning. These paraphrases, known as explications, can offer valuable applications for many natural language processing (NLP) tasks, but producing them has traditionally been a slow, manual process. In this work, we present the first study of using large language models (LLMs) to generate NSM explications. We introduce automatic evaluation methods, a tailored dataset for training and evaluation, and fine-tuned models for this task. Our 1B and 8B models outperform GPT-4o in producing accurate, cross-translatable explications, marking a significant step toward universal semantic representation with LLMs and opening up new possibilities for applications in semantic analysis, translation, and beyond.
摘要：自然语义术语（NSM）是一种基于通用语义素数的语言理论：简单的，原始的单词含义，这些词在世界上大多数（如果不是全部）中都存在。根据这个框架，无论复杂性如何，任何单词都可以使用这些素数来解释，从而揭示出清晰且普遍可翻译的含义。这些称为阐释的解释可以为许多自然语言处理（NLP）任务提供有价值的应用，但是传统上生产它们是一个缓慢的手动过程。在这项工作中，我们介绍了使用大型语言模型（LLM）生成NSM解释的首次研究。我们介绍了自动评估方法，用于培训和评估的量身定制数据集以及用于此任务的微调模型。我们的1B和8B模型在产生准确的，跨透明的解释方面优于GPT-4O，标志着使用LLMS朝着通用语义表示的重要一步，并为语义分析，翻译，翻译及其他方面的应用开辟了新的可能性。

Title: Retrospex: Language Agent Meets Offline Reinforcement Learning Critic

Authors: Yufei Xiang, Yiqun Shen, Yeqin Zhang, Cam-Tu Nguyen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11807
Pdf URL: https://arxiv.org/pdf/2505.11807
Copy Paste: [[2505.11807]] Retrospex: Language Agent Meets Offline Reinforcement Learning Critic(https://arxiv.org/abs/2505.11807)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) possess extensive knowledge and commonsense reasoning capabilities, making them valuable for creating powerful agents. However, existing LLM agent frameworks have not fully utilized past experiences for improvement. This work introduces a new LLM-based agent framework called Retrospex, which addresses this challenge by analyzing past experiences in depth. Unlike previous approaches, Retrospex does not directly integrate experiences into the LLM's context. Instead, it combines the LLM's action likelihood with action values estimated by a Reinforcement Learning (RL) Critic, which is trained on past experiences through an offline ''retrospection'' process. Additionally, Retrospex employs a dynamic action rescoring mechanism that increases the importance of experience-based values for tasks that require more interaction with the environment. We evaluate Retrospex in ScienceWorld, ALFWorld and Webshop environments, demonstrating its advantages over strong, contemporary baselines.
摘要：大型语言模型（LLMS）具有广泛的知识和常识性推理能力，使其对于创建强大的代理人很有价值。但是，现有的LLM代理框架并未完全利用过去的经验来改进。这项工作介绍了一个名为Retrospex的新的基于LLM的代理框架，该框架通过深入分析过去的经验来解决这一挑战。与以前的方法不同，ReTrospex并未将经验直接整合到LLM的上下文中。取而代之的是，它将LLM的动作可能性与强化学习（RL）评论家所估计的行动价值相结合，该评论家通过离线“回顾”过程对过去的经验进行了培训。此外，ReTrospex采用了动态动作重新纠正机制，从而提高了基于经验的值对需要与环境进行更多相互作用的任务的重要性。我们评估了科学世界，ALFWORLD和WebShop环境中的Retrospex，证明了其优于强，现代基线的优势。

Title: Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model

Authors: Shen Li, Renfen Hu, Lijun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11810
Pdf URL: https://arxiv.org/pdf/2505.11810
Copy Paste: [[2505.11810]] Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model(https://arxiv.org/abs/2505.11810)
Keywords: language model
Abstract: General-purpose large language models demonstrate notable capabilities in language comprehension and generation, achieving results that are comparable to, or even surpass, human performance in many language information processing tasks. Nevertheless, when general models are applied to some specific domains, e.g., Classical Chinese texts, their effectiveness is often unsatisfactory, and fine-tuning open-source foundational models similarly struggles to adequately incorporate domain-specific knowledge. To address this challenge, this study developed a large language model, AI Taiyan, specifically designed for understanding and generating Classical Chinese. Experiments show that with a reasonable model design, data processing, foundational training, and fine-tuning, satisfactory results can be achieved with only 1.8 billion parameters. In key tasks related to Classical Chinese information processing such as punctuation, identification of allusions, explanation of word meanings, and translation between ancient and modern Chinese, this model exhibits a clear advantage over both general-purpose large models and domain-specific traditional models, achieving levels close to or surpassing human baselines. This research provides a reference for the efficient construction of specialized domain-specific large language models. Furthermore, the paper discusses the application of this model in fields such as the collation of ancient texts, dictionary editing, and language research, combined with case studies.
摘要：通用大语模型在语言理解和发电方面表现出了显着的功能，在许多语言信息处理任务中与人类表现相当甚至超过了人类表现。然而，当将一般模型应用于某些特定领域时，例如古典中文文本时，它们的有效性通常是不令人满意的，并且对开源基础模型进行了微调，同样会努力努力，以充分地融入特定领域的知识。为了应对这一挑战，这项研究开发了一种大型语言模型AI Taiyan，专为理解和生成古典中文而设计。实验表明，通过合理的模型设计，数据处理，基础培训和微调，只有18亿个参数就可以实现令人满意的结果。在与古典信息处理相关的关键任务中，例如标点符号，典故的识别，单词含义的解释以及古代和现代汉语之间的翻译，该模型比通用大型模型和特定领域的传统模型都具有明显的优势，从而达到了接近或超过人类基线的水平。这项研究为有效构建专业领域特定的大语言模型提供了参考。此外，本文讨论了该模型在诸如古代文本，词典编辑和语言研究等领域的应用，并结合了案例研究。

Title: BELLE: A Bi-Level Multi-Agent Reasoning Framework for Multi-Hop Question Answering

Authors: Taolin Zhang, Dongyang Li, Qizhou Chen, Chengyu Wang, Xiaofeng He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11811
Pdf URL: https://arxiv.org/pdf/2505.11811
Copy Paste: [[2505.11811]] BELLE: A Bi-Level Multi-Agent Reasoning Framework for Multi-Hop Question Answering(https://arxiv.org/abs/2505.11811)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: Multi-hop question answering (QA) involves finding multiple relevant passages and performing step-by-step reasoning to answer complex questions. Previous works on multi-hop QA employ specific methods from different modeling perspectives based on large language models (LLMs), regardless of the question types. In this paper, we first conduct an in-depth analysis of public multi-hop QA benchmarks, dividing the questions into four types and evaluating five types of cutting-edge methods for multi-hop QA: Chain-of-Thought (CoT), Single-step, Iterative-step, Sub-step, and Adaptive-step. We find that different types of multi-hop questions have varying degrees of sensitivity to different types of methods. Thus, we propose a Bi-levEL muLti-agEnt reasoning (BELLE) framework to address multi-hop QA by specifically focusing on the correspondence between question types and methods, where each type of method is regarded as an ''operator'' by prompting LLMs differently. The first level of BELLE includes multiple agents that debate to obtain an executive plan of combined ''operators'' to address the multi-hop QA task comprehensively. During the debate, in addition to the basic roles of affirmative debater, negative debater, and judge, at the second level, we further leverage fast and slow debaters to monitor whether changes in viewpoints are reasonable. Extensive experiments demonstrate that BELLE significantly outperforms strong baselines in various datasets. Additionally, the model consumption of BELLE is higher cost-effectiveness than that of single models in more complex multi-hop QA scenarios.
摘要：多跳问题回答（QA）涉及查找多个相关段落并逐步推理以回答复杂的问题。多跳质量质量检查的先前作品从基于大语言模型（LLM）的不同建模角度采用特定方法，而不管问题类型如何。在本文中，我们首先对公共多跳高质量检查基准测试的深入分析，将问题分为四种类型，并评估五种用于多跳的质量检查的尖端方法：思想链（COT），单步，迭代，迭代步骤，子步骤，子步骤和适应性步骤。我们发现，不同类型的多跳问题对不同类型的方法具有不同程度的敏感性。因此，我们通过专门关注问题类型和方法之间的对应关系，提出一个双级多代理推理（BELLE）框架来解决多跳质量质量质量质量质量质量质量质量质量请访问，在这种对应关系中，每种类型的方法都被视为“操作员”，通过以不同的方式提示LLMS。 Belle的第一级包括多个代理商，这些代理商辩论以获取合并的“操作员”的行政计划，以全面解决多跳质量质量检查任务。在辩论期间，除了肯定性辩论者，负面辩论者和判决的基本作用外，我们在第二级上还利用快速和缓慢的辩论者来监视观点上的变化是否合理。广泛的实验表明，在各种数据集中，Belle明显胜过强大的基线。此外，在更复杂的多跳QA场景中，Belle的模型消耗比单个模型的成本效益更高。

Title: Chain-of-Model Learning for Language Model

Authors: Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11820
Pdf URL: https://arxiv.org/pdf/2505.11820
Copy Paste: [[2505.11820]] Chain-of-Model Learning for Language Model(https://arxiv.org/abs/2505.11820)
Keywords: language model
Abstract: In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: this https URL.
摘要：在本文中，我们提出了一种新颖的学习范式，称为模型链（COM），该范式将因果关系纳入每一层的隐藏状态作为链式样式，从而在模型训练中引入了极大的扩展效率，并在部署中推断了灵活性。我们介绍了代表链（COR）的概念，该概念将每个层的隐藏状态制定为在隐藏维度级别上多个子代理（即链）的组合。在每一层中，来自输出表示的每个链只能在输入表示中查看其所有先前的链。因此，基于COM框架构建的模型可以通过基于先前的模型（即链）增加链条逐渐扩大模型大小，并通过使用不同的链数来提供多个以不同尺寸的弹性推理的子模型。基于这一原则，我们设计了语言链模型（COLM），该链将COM的想法纳入了变压器体系结构的每个层。基于COLM，我们通过引入KV共享机制进一步引入Colm-Air，该机制计算第一个链中的所有键和值，然后在所有链中共享。该设计显示了其他可扩展性，例如启用无缝的LM开关，预填充加速度等。实验结果表明，我们的COLM家族可以实现与标准变压器的可比性能，同时可以实现更大的灵活性，例如渐进式扩展以提高训练效率，并为弹性推理提供多个变化的模型大小，为构建语言模型的新方法铺平了一种方法。我们的代码将来将在以下网址发布：此HTTPS URL。

Title: Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning

Authors: Yansong Ning, Wei Li, Jun Fang, Naiqiang Tan, Hao Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11827
Pdf URL: https://arxiv.org/pdf/2505.11827
Copy Paste: [[2505.11827]] Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning(https://arxiv.org/abs/2505.11827)
Keywords: language model, llm, chain-of-thought
Abstract: Compressing long chain-of-thought (CoT) from large language models (LLMs) is an emerging strategy to improve the reasoning efficiency of LLMs. Despite its promising benefits, existing studies equally compress all thoughts within a long CoT, hindering more concise and effective reasoning. To this end, we first investigate the importance of different thoughts by examining their effectiveness and efficiency in contributing to reasoning through automatic long CoT chunking and Monte Carlo rollouts. Building upon the insights, we propose a theoretically bounded metric to jointly measure the effectiveness and efficiency of different thoughts. We then propose Long$\otimes$Short, an efficient reasoning framework that enables two LLMs to collaboratively solve the problem: a long-thought LLM for more effectively generating important thoughts, while a short-thought LLM for efficiently generating remaining thoughts. Specifically, we begin by synthesizing a small amount of cold-start data to fine-tune LLMs for long-thought and short-thought reasoning styles, respectively. Furthermore, we propose a synergizing-oriented multi-turn reinforcement learning, focusing on the model self-evolution and collaboration between long-thought and short-thought LLMs. Experimental results show that our method enables Qwen2.5-7B and Llama3.1-8B to achieve comparable performance compared to DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B, while reducing token length by over 80% across the MATH500, AIME24/25, AMC23, and GPQA Diamond benchmarks. Our data and code are available at this https URL.
摘要：从大型语言模型（LLMS）压缩长长的经过思考（COT）是提高LLMS推理效率的新兴策略。尽管有希望的好处，但现有研究同样压缩了长长的婴儿床中的所有思想，从而阻碍了更简洁有效的推理。为此，我们首先通过检查不同思想在通过自动长的COT块和蒙特卡洛推出来促进推理方面的有效性和效率来调查不同思想的重要性。在洞察力的基础上，我们提出了一个理论上有限的指标，以共同衡量不同思想的有效性和效率。然后，我们提出了长期的$ \ otimes $ short，这是一个有效的推理框架，使两个LLMS能够协作解决该问题：一种长期以来的LLM，以更有效地产生重要的想法，同时简短的LLM有效地产生剩余的想法。具体而言，我们首先将少量的冷启动数据合成以细化LLMS，分别以进行长期思考和简短的推理方式。此外，我们提出了一个以协同为导向的多转弯强化学习，重点是模型的自我进化和长期思考和短期思考的LLM之间的协作。实验结果表明，与DeepSeek-R1-Distill-Qwen-7b和DeepSeek-R1-Distill-LALA-8B相比，我们的方法使QWEN2.5-7B和LLAMA3.1-8B能够实现可比的性能，而在Math500，Aime24/25，AMC23，AMC23，AMC23和GPQA diamond Bench的MATH500，AMC24/25，AMC24/25，AMC24/25，将令牌长度降低了80％以上。我们的数据和代码可在此HTTPS URL上找到。

Title: Class Distillation with Mahalanobis Contrast: An Efficient Training Paradigm for Pragmatic Language Understanding Tasks

Authors: Chenlu Wang, Weimin Lyu, Ritwik Banerjee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11829
Pdf URL: https://arxiv.org/pdf/2505.11829
Copy Paste: [[2505.11829]] Class Distillation with Mahalanobis Contrast: An Efficient Training Paradigm for Pragmatic Language Understanding Tasks(https://arxiv.org/abs/2505.11829)
Keywords: language model, llm
Abstract: Detecting deviant language such as sexism, or nuanced language such as metaphors or sarcasm, is crucial for enhancing the safety, clarity, and interpretation of online social discourse. While existing classifiers deliver strong results on these tasks, they often come with significant computational cost and high data demands. In this work, we propose \textbf{Cla}ss \textbf{D}istillation (ClaD), a novel training paradigm that targets the core challenge: distilling a small, well-defined target class from a highly diverse and heterogeneous background. ClaD integrates two key innovations: (i) a loss function informed by the structural properties of class distributions, based on Mahalanobis distance, and (ii) an interpretable decision algorithm optimized for class separation. Across three benchmark detection tasks -- sexism, metaphor, and sarcasm -- ClaD outperforms competitive baselines, and even with smaller language models and orders of magnitude fewer parameters, achieves performance comparable to several large language models (LLMs). These results demonstrate ClaD as an efficient tool for pragmatic language understanding tasks that require gleaning a small target class from a larger heterogeneous background.
摘要：检测诸如性别歧视或细微的语言（例如隐喻或讽刺）之类的偏差语言对于增强在线社会话语的安全性，清晰度和解释至关重要。尽管现有分类器在这些任务上产生了强大的结果，但它们通常具有巨大的计算成本和高数据需求。在这项工作中，我们提出了\ textbf {cla} s \ textbf {d} IsTillation（clad），这是一种针对核心挑战的新型训练范式：将小型，明确定义的目标类提炼从高度多样性和异质性背景中。 clad整合了两个关键的创新：（i）基于马哈拉诺邦距离的班级分布的结构属性所告知的损失函数，以及（ii）针对类别分离优化的可解释决策算法。在三个基准检测任务（性别歧视，隐喻和讽刺）中，超越了竞争基准，甚至具有较小的语言模型和较小的数量级的参数，也可以达到与几种大语言模型（LLMS）相当的性能。这些结果证明了包裹是一种有效的工具，用于务实的语言理解任务，这些任务需要从较大的异质背景中收集小目标类别。

Title: Multilingual Collaborative Defense for Large Language Models

Authors: Hongliang Li, Jinan Xu, Gengping Cui, Changhao Guan, Fengran Mo, Kaiyu Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11835
Pdf URL: https://arxiv.org/pdf/2505.11835
Copy Paste: [[2505.11835]] Multilingual Collaborative Defense for Large Language Models(https://arxiv.org/abs/2505.11835)
Keywords: language model, llm, prompt
Abstract: The robustness and security of large language models (LLMs) has become a prominent research area. One notable vulnerability is the ability to bypass LLM safeguards by translating harmful queries into rare or underrepresented languages, a simple yet effective method of "jailbreaking" these models. Despite the growing concern, there has been limited research addressing the safeguarding of LLMs in multilingual scenarios, highlighting an urgent need to enhance multilingual safety. In this work, we investigate the correlation between various attack features across different languages and propose Multilingual Collaborative Defense (MCD), a novel learning method that optimizes a continuous, soft safety prompt automatically to facilitate multilingual safeguarding of LLMs. The MCD approach offers three advantages: First, it effectively improves safeguarding performance across multiple languages. Second, MCD maintains strong generalization capabilities while minimizing false refusal rates. Third, MCD mitigates the language safety misalignment caused by imbalances in LLM training corpora. To evaluate the effectiveness of MCD, we manually construct multilingual versions of commonly used jailbreak benchmarks, such as MaliciousInstruct and AdvBench, to assess various safeguarding methods. Additionally, we introduce these datasets in underrepresented (zero-shot) languages to verify the language transferability of MCD. The results demonstrate that MCD outperforms existing approaches in safeguarding against multilingual jailbreak attempts while also exhibiting strong language transfer capabilities. Our code is available at this https URL.
摘要：大语言模型（LLM）的鲁棒性和安全性已成为一个杰出的研究领域。一个值得注意的脆弱性是能够通过将有害查询转化为罕见或代表性不足的语言来绕过LLM保障措施，这是一种“越狱”这些模型的简单而有效的方法。尽管关注的问题日益严重，但在多语言场景中，研究涉及LLM的保护有限，这突出了迫切需要增强多语言安全性。在这项工作中，我们研究了不同语言跨不同语言的各种攻击特征之间的相关性，并提出了多种语言协作防御（MCD），这是一种新颖的学习方法，可自动优化连续的，软安全提示，以促进对LLM的多语言保护。 MCD方法提供了三个优点：首先，它有效地改善了多种语言的保护性能。其次，MCD保持强大的概括能力，同时最大程度地减少了错误的拒绝率。第三，MCD减轻了由LLM培训Corpora失衡引起的语言安全不一致。为了评估MCD的有效性，我们手动构建了常用的越狱基准的多语言版本，例如恶意建筑和Advbench，以评估各种保障方法。此外，我们以代表性不足（零摄）语言介绍这些数据集，以验证MCD的语言可传递性。结果表明，MCD在保护多语言越狱尝试的同时表现出强大的语言传递能力方面优于现有方法。我们的代码可在此HTTPS URL上找到。

Title: When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research

Authors: Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, Stella Biderman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11855
Pdf URL: https://arxiv.org/pdf/2505.11855
Copy Paste: [[2505.11855]] When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research(https://arxiv.org/abs/2505.11855)
Keywords: language model, llm, prompt
Abstract: Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative co-authors responsible for crafting hypotheses, synthesizing code, or drafting manuscripts. In this work, we explore a complementary application: using LLMs as verifiers to automate the \textbf{academic verification of scientific manuscripts}. To that end, we introduce SPOT, a dataset of 83 published papers paired with 91 errors significant enough to prompt errata or retraction, cross-validated with actual authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find that none surpasses 21.1\% recall or 6.1\% precision (o3 achieves the best scores, with all others near zero). Furthermore, confidence estimates are uniformly low, and across eight independent runs, models rarely rediscover the same errors, undermining their reliability. Finally, qualitative analysis with domain experts reveals that even the strongest models make mistakes resembling student-level misconceptions derived from misunderstandings. These findings highlight the substantial gap between current LLM capabilities and the requirements for dependable AI-assisted academic verification.
摘要：大型语言模型（LLM）的最新进展推动了自动化科学发现的愿景，通常称为AI共同科学家。迄今为止，先前的工作将这些系统作为生成合着者，负责制定假设，合成代码或起草手稿。在这项工作中，我们探讨了一个互补的应用：使用LLM作为验证者来自动化\ textbf {科学手稿的学术验证}。为此，我们介绍了Spot，该数据集由83个已发表的论文组成，并配对91个错误，足以引起勘误或缩回，并与实际的作者和人类注释者进行了交叉验证。在现场评估最新的LLMS，我们发现没有一个超过21.1 \％的召回或6.1 \％的精度（O3取得了最佳分数，所有其他人都接近零）。此外，置信度的估计值均匀，在八个独立运行中，模型很少重新发现相同的错误，从而破坏了它们的可靠性。最后，与领域专家的定性分析表明，即使是最强大的模型也会犯有来自误解的学生水平误解的错误。这些发现突出了当前LLM功能与可靠的AI辅助学术验证的要求之间的巨大差距。

Title: NAMET: Robust Massive Model Editing via Noise-Aware Memory Optimization

Authors: Yanbo Dai, Zhenlan Ji, Zongjie Li, Shuai Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11876
Pdf URL: https://arxiv.org/pdf/2505.11876
Copy Paste: [[2505.11876]] NAMET: Robust Massive Model Editing via Noise-Aware Memory Optimization(https://arxiv.org/abs/2505.11876)
Keywords: language model, llm
Abstract: Model editing techniques are essential for efficiently updating knowledge in large language models (LLMs). However, the effectiveness of existing approaches degrades in massive editing scenarios, particularly when evaluated with practical metrics or in context-rich settings. We attribute these failures to embedding collisions among knowledge items, which undermine editing reliability at scale. To address this, we propose NAMET (Noise-aware Model Editing in Transformers), a simple yet effective method that introduces noise during memory extraction via a one-line modification to MEMIT. Extensive experiments across six LLMs and three datasets demonstrate that NAMET consistently outperforms existing methods when editing thousands of facts.
摘要：模型编辑技术对于有效地更新大语言模型（LLMS）的知识至关重要。但是，现有方法在大规模编辑方案中的有效性降低了，尤其是在使用实用指标或上下文富裕设置进行评估时。我们将这些失败归因于将碰撞嵌入在知识项目之间，这会破坏大规模编辑可靠性。为了解决这个问题，我们提出了NAMET（Transformers中的噪声吸引模型编辑），这是一种简单而有效的方法，它在记忆提取过程中通过单行修改在记忆提取过程中引入噪声。六个LLM和三个数据集的广泛实验表明，在编辑数千个事实时，NAMET始终超过现有方法。

Title: AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation

Authors: Xiechi Zhang, Zetian Ouyang, Linlin Wang, Gerard de Melo, Zhu Cao, Xiaoling Wang, Ya Zhang, Yanfeng Wang, Liang He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11887
Pdf URL: https://arxiv.org/pdf/2505.11887
Copy Paste: [[2505.11887]] AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation(https://arxiv.org/abs/2505.11887)
Keywords: language model, llm
Abstract: With the proliferation of large language models (LLMs) in the medical domain, there is increasing demand for improved evaluation techniques to assess their capabilities. However, traditional metrics like F1 and ROUGE, which rely on token overlaps to measure quality, significantly overlook the importance of medical terminology. While human evaluation tends to be more reliable, it can be very costly and may as well suffer from inaccuracies due to limits in human expertise and motivation. Although there are some evaluation methods based on LLMs, their usability in the medical field is limited due to their proprietary nature or lack of expertise. To tackle these challenges, we present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs. The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation. Specifically, we propose a hierarchical training method involving curriculum instruction tuning and an iterative knowledge introspection mechanism, enabling AutoMedEval to acquire professional medical assessment capabilities with limited instructional data. Human evaluations indicate that AutoMedEval surpasses other baselines in terms of correlation with human judgments.
摘要：随着医学领域中大型语言模型（LLM）的扩散，人们对评估能力的改进技术的需求不断增加。但是，依靠令牌重叠来衡量质量的传统指标，例如F1和Rouge，大大忽略了医学术语的重要性。尽管人类评估往往更可靠，但由于人类的专业知识和动机的限制，它可能会非常昂贵，并且可能会遭受不准确性的困扰。尽管有一些基于LLM的评估方法，但由于其专有性质或缺乏专业知识，它们在医疗领域的可用性受到限制。为了应对这些挑战，我们提出了AutomedeVal，这是一种开源的自动评估模型，该模型具有专门设计的13B参数，以衡量医疗LLMS的提问能力。汽车中件的总体目标是评估不同模型产生的响应质量，渴望显着降低对人类评估的依赖。具体而言，我们提出了一种涉及课程教学调整和迭代知识内省机制的分层培训方法，使自动助理能够以有限的教学数据获得专业的医疗评估能力。人类评估表明，在与人类判断的相关性方面，自动甲甲基超过其他基线。

Title: Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

Authors: Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, Bo An
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11891
Pdf URL: https://arxiv.org/pdf/2505.11891
Copy Paste: [[2505.11891]] Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents(https://arxiv.org/abs/2505.11891)
Keywords: agent
Abstract: VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and to complete daily tasks. However, existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes. Offline benchmarks evaluate the agents through single-path trajectories, which stands in contrast to the inherently multi-solution characteristics of GUI tasks. Additionally, both types of benchmarks fail to assess whether mobile agents can handle noise or engage in proactive interactions due to a lack of noisy apps or overly full instructions during the evaluation process. To address these limitations, we use a slot-based instruction generation method to construct a more realistic and comprehensive benchmark named Mobile-Bench-v2. Mobile-Bench-v2 includes a common task split, with offline multi-path evaluation to assess the agent's ability to obtain step rewards during task execution. It contains a noisy split based on pop-ups and ads apps, and a contaminated split named AITZ-Noise to formulate a real noisy environment. Furthermore, an ambiguous instruction split with preset Q\&A interactions is released to evaluate the agent's proactive interaction capabilities. We conduct evaluations on these splits using the single-agent framework AppAgent-v1, the multi-agent framework Mobile-Agent-v2, as well as other mobile agents such as UI-Tars and OS-Atlas. Code and data are available at this https URL.
摘要：基于VLM的移动代理由于其功能与智能手机GUI和XML结构的文本互动以及完成日常任务而越来越受欢迎。但是，由于动态的环境变化，现有的在线基准测试努力获得稳定的奖励信号。离线基准测试通过单路径轨迹评估了代理，这与GUI任务的固有多溶液特征形成鲜明对比。此外，由于缺乏嘈杂的应用程序或在评估过程中缺乏完整的说明，两种类型的基准都无法评估移动代理是否可以处理噪声或主动进行主动互动。为了解决这些局限性，我们使用基于插槽的指令生成方法来构建一个更现实，更全面的基准，称为移动基础V2。移动基础V2包括一个常见的任务拆分，并具有离线多路评估，以评估代理在任务执行过程中获得阶梯奖励的能力。它包含基于弹出窗口和广告应用程序的嘈杂拆分，以及一个被污染的拆分，名为Aitz-Noise，以制定真正的嘈杂环境。此外，释放了用预设Q \＆A交互作用的模棱两可的指令，以评估代理人的主动互动能力。我们使用单权框架Appagent-V1，多代理框架移动代理-V2以及其他移动剂（例如UI-TARS和OS-ATLAS）对这些拆分进行评估。代码和数据可在此HTTPS URL上找到。

Title: RLAP: A Reinforcement Learning Enhanced Adaptive Planning Framework for Multi-step NLP Task Solving

Authors: Zepeng Ding, Dixuan Wang, Ziqin Luo, Guochao Jiang, Deqing Yang, Jiaqing Liang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11893
Pdf URL: https://arxiv.org/pdf/2505.11893
Copy Paste: [[2505.11893]] RLAP: A Reinforcement Learning Enhanced Adaptive Planning Framework for Multi-step NLP Task Solving(https://arxiv.org/abs/2505.11893)
Keywords: language model, llm
Abstract: Multi-step planning has been widely employed to enhance the performance of large language models (LLMs) on downstream natural language processing (NLP) tasks, which decomposes the original task into multiple subtasks and guide LLMs to solve them sequentially without additional training. When addressing task instances, existing methods either preset the order of steps or attempt multiple paths at each step. However, these methods overlook instances' linguistic features and rely on the intrinsic planning capabilities of LLMs to evaluate intermediate feedback and then select subtasks, resulting in suboptimal outcomes. To better solve multi-step NLP tasks with LLMs, in this paper we propose a Reinforcement Learning enhanced Adaptive Planning framework (RLAP). In our framework, we model an NLP task as a Markov decision process (MDP) and employ an LLM directly into the environment. In particular, a lightweight Actor model is trained to estimate Q-values for natural language sequences consisting of states and actions through reinforcement learning. Therefore, during sequential planning, the linguistic features of each sequence in the MDP can be taken into account, and the Actor model interacts with the LLM to determine the optimal order of subtasks for each task instance. We apply RLAP on three different types of NLP tasks and conduct extensive experiments on multiple datasets to verify RLAP's effectiveness and robustness.
摘要：多步规划已被广泛用于增强下游自然语言处理（NLP）任务的大型语言模型（LLMS）的性能，该任务将原始任务分解为多个子任务并指导LLMS以在没有其他培训的情况下依次解决它们。在解决任务实例时，现有方法要么预设步骤顺序，要么在每个步骤尝试多个路径。但是，这些方法忽略了实例的语言特征，并依赖LLMS的内在计划功能来评估中间反馈，然后选择子任务，从而导致次优结果。为了更好地通过LLMS解决多步入式NLP任务，在本文中，我们提出了增强学习增强的自适应计划框架（RLAP）。在我们的框架中，我们将NLP任务建模为Markov决策过程（MDP），并将LLM直接用于环境。特别是，对轻量级演员模型进行了训练，以估算由国家和通过强化学习组成的自然语言序列的Q值。因此，在顺序规划中，可以考虑MDP中每个序列的语言特征，并且Actor模型与LLM相互作用，以确定每个任务实例的子任务的最佳顺序。我们将RLAP应用于三种不同类型的NLP任务，并在多个数据集上进行广泛的实验，以验证RLAP的有效性和鲁棒性。

Title: Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data

Authors: Philipp Christmann, Gerhard Weikum
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.11900
Pdf URL: https://arxiv.org/pdf/2505.11900
Copy Paste: [[2505.11900]] Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data(https://arxiv.org/abs/2505.11900)
Keywords: language model
Abstract: Question answering over mixed sources, like text and tables, has been advanced by verbalizing all contents and encoding it with a language model. A prominent case of such heterogeneous data is personal information: user devices log vast amounts of data every day, such as calendar entries, workout statistics, shopping records, streaming history, and more. Information needs range from simple look-ups to queries of analytical nature. The challenge is to provide humans with convenient access with small footprint, so that all personal data stays on the user devices. We present ReQAP, a novel method that creates an executable operator tree for a given question, via recursive decomposition. Operators are designed to enable seamless integration of structured and unstructured sources, and the execution of the operator tree yields a traceable answer. We further release the PerQA benchmark, with persona-based data and questions, covering a diverse spectrum of realistic user needs.
摘要：通过口头表达所有内容并使用语言模型对混合资源（例如文本和表）进行的问题回答，例如文本和表格。这种异质数据的一个突出情况是个人信息：用户设备每天记录大量数据，例如日历条目，锻炼统计，购物记录，流媒体历史记录等。信息需求范围从简单查找到分析性质的查询。面临的挑战是为人类提供带有少量足迹的方便访问，以便所有个人数据都保留在用户设备上。我们提出了Reqap，这是一种新颖的方法，该方法通过递归分解为给定的问题创建可执行的操作员树。操作员旨在实现结构化和非结构化源的无缝集成，而操作员树的执行产生了可追溯的答案。我们进一步发布了基于角色的数据和问题的PERQA基准测试，涵盖了各种各样的现实用户需求。

Title: ELITE: Embedding-Less retrieval with Iterative Text Exploration

Authors: Zhangyu Wang, Siyuan Gao, Rong Zhou, Hao Wang, Li Ning
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11908
Pdf URL: https://arxiv.org/pdf/2505.11908
Copy Paste: [[2505.11908]] ELITE: Embedding-Less retrieval with Iterative Text Exploration(https://arxiv.org/abs/2505.11908)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have achieved impressive progress in natural language processing, but their limited ability to retain long-term context constrains performance on document-level or multi-turn tasks. Retrieval-Augmented Generation (RAG) mitigates this by retrieving relevant information from an external corpus. However, existing RAG systems often rely on embedding-based retrieval trained on corpus-level semantic similarity, which can lead to retrieving content that is semantically similar in form but misaligned with the question's true intent. Furthermore, recent RAG variants construct graph- or hierarchy-based structures to improve retrieval accuracy, resulting in significant computation and storage overhead. In this paper, we propose an embedding-free retrieval framework. Our method leverages the logical inferencing ability of LLMs in retrieval using iterative search space refinement guided by our novel importance measure and extend our retrieval results with logically related information without explicit graph construction. Experiments on long-context QA benchmarks, including NovelQA and Marathon, show that our approach outperforms strong baselines while reducing storage and runtime by over an order of magnitude.
摘要：大型语言模型（LLMS）在自然语言处理中取得了令人印象深刻的进步，但是它们保留长期背景的能力有限，会限制在文档级别或多转弯任务上的性能。检索增强的生成（RAG）通过从外部语料库中检索相关信息来减轻这种情况。但是，现有的抹布系统通常依赖于对语料库级别的语义相似性训练的基于嵌入的检索，这可能导致检索语义上相似的内容，但与问题的真实意图失误。此外，最近的RAG变体构建了基于图形或层次结构的结构，以提高检索准确性，从而产生大量的计算和存储开销。在本文中，我们提出了一个无嵌入的检索框架。我们的方法利用迭代搜索空间改进以我们新颖的重要性度量指导，并通过逻辑相关的信息在没有明确的图形构造的情况下扩展了我们的检索结果，利用LLM的逻辑推理能力在检索中的逻辑推理能力。在包括NovelQa和Marathon在内的长篇文本基准测试基准上的实验表明，我们的方法的表现优于强大的基准，同时将存储时间和运行时间降低了一个数量级。

Title: Enhancing Complex Instruction Following for Large Language Models with Mixture-of-Contexts Fine-tuning

Authors: Yuheng Lu, ZiMeng Bai, Caixia Yuan, Huixing Jiang, Xiaojie Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11922
Pdf URL: https://arxiv.org/pdf/2505.11922
Copy Paste: [[2505.11922]] Enhancing Complex Instruction Following for Large Language Models with Mixture-of-Contexts Fine-tuning(https://arxiv.org/abs/2505.11922)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit remarkable capabilities in handling natural language tasks; however, they may struggle to consistently follow complex instructions including those involve multiple constraints. Post-training LLMs using supervised fine-tuning (SFT) is a standard approach to improve their ability to follow instructions. In addressing complex instruction following, existing efforts primarily focus on data-driven methods that synthesize complex instruction-output pairs for SFT. However, insufficient attention allocated to crucial sub-contexts may reduce the effectiveness of SFT. In this work, we propose transforming sequentially structured input instruction into multiple parallel instructions containing subcontexts. To support processing this multi-input, we propose MISO (Multi-Input Single-Output), an extension to currently dominant decoder-only transformer-based LLMs. MISO introduces a mixture-of-contexts paradigm that jointly considers the overall instruction-output alignment and the influence of individual sub-contexts to enhance SFT effectiveness. We apply MISO fine-tuning to complex instructionfollowing datasets and evaluate it with standard LLM inference. Empirical results demonstrate the superiority of MISO as a fine-tuning method for LLMs, both in terms of effectiveness in complex instruction-following scenarios and its potential for training efficiency.
摘要：大型语言模型（LLMS）在处理自然语言任务方面具有显着的功能；但是，他们可能很难始终如一地遵循复杂的说明，包括涉及多个限制的指示。使用监督微调（SFT）的培训后LLMS是提高其遵循说明能力的标准方法。在解决复杂的指导以下时，现有的努力主要集中于数据驱动的方法，该方法合成了SFT的复杂指令输出对。但是，分配给关键的次文本的注意力不足可能会降低SFT的有效性。在这项工作中，我们建议将顺序结构化的输入指令转换为包含子膜的多个并行指令。为了支持处理此多输入，我们提出了MISO（多输入单输出），这是当前基于仅解码器的LLM的扩展。莫斯（Miso）介绍了具有共同的范式的混合物，该范式共同考虑了总体指示输出对准和单个子膜的影响以提高SFT效率。我们将MISO微调应用于复杂的指令关注数据集并使用标准LLM推断进行评估。经验结果证明了MISO作为LLM的微调方法的优越性，无论是在复杂的指导跟随场景中的有效性及其训练效率的潜力而言。

Title: An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts

Authors: Yu-Ting Lee, Hui-Ying Shih, Fu-Chieh Chang, Pei-Yuan Wu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11924
Pdf URL: https://arxiv.org/pdf/2505.11924
Copy Paste: [[2505.11924]] An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts(https://arxiv.org/abs/2505.11924)
Keywords: language model, prompt
Abstract: We provide an explanation for the performance gains of intrinsic self-correction, a process where a language model iteratively refines its outputs without external feedback. More precisely, we investigate how prompting induces interpretable changes in hidden states and thus affects the output distributions. We hypothesize that each prompt-induced shift lies in a linear span of some linear representation vectors, naturally separating tokens based on individual concept alignment. Building around this idea, we give a mathematical formulation of self-correction and derive a concentration result for output tokens based on alignment magnitudes. Our experiments on text detoxification with zephyr-7b-sft reveal a substantial gap in the inner products of the prompt-induced shifts and the unembeddings of the top-100 most toxic tokens vs. those of the unembeddings of the bottom-100 least toxic tokens, under toxic instructions. This suggests that self-correction prompts enhance a language model's capability of latent concept recognition. Our analysis offers insights into the underlying mechanism of self-correction by characterizing how prompting works explainably. For reproducibility, our code is available.
摘要：我们为固有自我纠正的性能增长提供了解释，在这个过程中，语言模型在没有外部反馈的情况下会迭代地完善其输出。更确切地说，我们研究了促使隐藏状态的可解释变化，从而影响输出分布的方式。我们假设每个及时诱导的转移都在某些线性表示向量的线性跨度，自然会根据单个概念比对分开令牌。围绕这个想法建立，我们给出了自我纠正的数学表述，并基于对齐幅度的输出代币获得了浓度结果。我们对Zephyr-7b-SFT进行文本解毒的实验表明，在有毒指示下，在迅速诱导的移位和前100名最有毒的毒物最有毒最不毒物的无毒最低毒物最不受欢迎的象征的内部产物中，有毒的毒品造成的差异很大。这表明自我纠正会提示提高语言模型的潜在概念识别能力。我们的分析通过表征提示的工作方式来解释的方式来洞悉自我纠正的基本机制。为了获得可重复性，我们的代码可用。

Title: Neuro-Symbolic Query Compiler

Authors: Yuyao Zhang, Zhicheng Dou, Xiaoxi Li, Jiajie Jin, Yongkang Wu, Zhonghua Li, Qi Ye, Ji-Rong Wen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.11932
Pdf URL: https://arxiv.org/pdf/2505.11932
Copy Paste: [[2505.11932]] Neuro-Symbolic Query Compiler(https://arxiv.org/abs/2505.11932)
Keywords: retrieval-augmented generation
Abstract: Precise recognition of search intent in Retrieval-Augmented Generation (RAG) systems remains a challenging goal, especially under resource constraints and for complex queries with nested structures and dependencies. This paper presents QCompiler, a neuro-symbolic framework inspired by linguistic grammar rules and compiler design, to bridge this gap. It theoretically designs a minimal yet sufficient Backus-Naur Form (BNF) grammar $G[q]$ to formalize complex queries. Unlike previous methods, this grammar maintains completeness while minimizing redundancy. Based on this, QCompiler includes a Query Expression Translator, a Lexical Syntax Parser, and a Recursive Descent Processor to compile queries into Abstract Syntax Trees (ASTs) for execution. The atomicity of the sub-queries in the leaf nodes ensures more precise document retrieval and response generation, significantly improving the RAG system's ability to address complex queries.
摘要：在检索型生成（RAG）系统中对搜索意图的精确识别仍然是一个具有挑战性的目标，尤其是在资源约束和具有嵌套结构和依赖关系的复杂查询下。本文介绍了QCompiler，这是一种灵感来自语言语法规则和编译器设计的神经符号框架，以弥合此差距。从理论上讲，它设计了一种最小而又充分的backus-naur形式（BNF）语法$ g [q] $，以形式化复杂的查询。与以前的方法不同，该语法保持完整性，同时最大程度地减少冗余。基于此，QCOMPILER包括一个查询表达式翻译器，词汇语法解析器和一个递归血统处理器，以将查询编译为抽象语法树（ASTS）进行执行。叶子节点中的子征服的原子性确保了更精确的文档检索和响应产生，从而显着提高了抹布系统的解决复杂查询的能力。

Title: ChartEdit: How Far Are MLLMs From Automating Chart Analysis? Evaluating MLLMs' Capability via Chart Editing

Authors: Xuanle Zhao, Xuexin Liu, Haoyue Yang, Xianzhen Luo, Fanhu Zeng, Jianling Li, Qi Shi, Chi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11935
Pdf URL: https://arxiv.org/pdf/2505.11935
Copy Paste: [[2505.11935]] ChartEdit: How Far Are MLLMs From Automating Chart Analysis? Evaluating MLLMs' Capability via Chart Editing(https://arxiv.org/abs/2505.11935)
Keywords: language model, llm
Abstract: Although multimodal large language models (MLLMs) show promise in generating chart rendering code, chart editing presents a greater challenge. This difficulty stems from its nature as a labor-intensive task for humans that also demands MLLMs to integrate chart understanding, complex reasoning, and precise intent interpretation. While many MLLMs claim such editing capabilities, current assessments typically rely on limited case studies rather than robust evaluation methodologies, highlighting the urgent need for a comprehensive evaluation framework. In this work, we propose ChartEdit, a new high-quality benchmark designed for chart editing tasks. This benchmark comprises $1,405$ diverse editing instructions applied to $233$ real-world charts, with each instruction-chart instance having been manually annotated and validated for accuracy. Utilizing ChartEdit, we evaluate the performance of 10 mainstream MLLMs across two types of experiments, assessing them at both the code and chart levels. The results suggest that large-scale models can generate code to produce images that partially match the reference images. However, their ability to generate accurate edits according to the instructions remains limited. The state-of-the-art (SOTA) model achieves a score of only $59.96$, highlighting significant challenges in precise modification. In contrast, small-scale models, including chart-domain models, struggle both with following editing instructions and generating overall chart images, underscoring the need for further development in this area. Code is available at this https URL.
摘要：尽管多模式的大语言模型（MLLM）在生成图表渲染代码方面显示出希望，但图表编辑带来了更大的挑战。这一困难源于其作为人类的劳动密集型任务的本质，这也要求MLLM整合图表的理解，复杂的推理和精确的意图解释。尽管许多MLLM声称此类编辑功能，但当前的评估通常依赖有限的案例研究，而不是强大的评估方法，强调了迫切需要进行全面评估框架。在这项工作中，我们提出了Chartedit，这是一种新的高质量基准测试，旨在进行图表编辑任务。该基准包括$ 1,405 $多样化的编辑说明，适用于$ 233 $真实世界图表，每个说明绘制的实例都经过手动注释和验证以确保准确性。利用Chartedit，我们评估了两种类型的实验中10个主流MLLM的性能，并在代码和图表级别上评估它们。结果表明，大型模型可以生成代码以产生与参考图像部分匹配的图像。但是，它们根据说明生成准确编辑的能力仍然有限。最先进的（SOTA）模型仅达到59.96美元的分数，这强调了精确修改方面的重大挑战。相比之下，包括图表域模型在内的小型模型都在遵循编辑说明和生成总体图表图像方面挣扎，强调了该领域进一步开发的需求。代码可在此HTTPS URL上找到。

Title: CCNU at SemEval-2025 Task 3: Leveraging Internal and External Knowledge of Large Language Models for Multilingual Hallucination Annotation

Authors: Xu Liu, Guanyi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11965
Pdf URL: https://arxiv.org/pdf/2505.11965
Copy Paste: [[2505.11965]] CCNU at SemEval-2025 Task 3: Leveraging Internal and External Knowledge of Large Language Models for Multilingual Hallucination Annotation(https://arxiv.org/abs/2505.11965)
Keywords: language model, llm, hallucination
Abstract: We present the system developed by the Central China Normal University (CCNU) team for the Mu-SHROOM shared task, which focuses on identifying hallucinations in question-answering systems across 14 different languages. Our approach leverages multiple Large Language Models (LLMs) with distinct areas of expertise, employing them in parallel to annotate hallucinations, effectively simulating a crowdsourcing annotation process. Furthermore, each LLM-based annotator integrates both internal and external knowledge related to the input during the annotation process. Using the open-source LLM DeepSeek-V3, our system achieves the top ranking (\#1) for Hindi data and secures a Top-5 position in seven other languages. In this paper, we also discuss unsuccessful approaches explored during our development process and share key insights gained from participating in this shared task.
摘要：我们介绍了由中国中国师范大学（CCNU）团队开发的系统共享任务，该团队的重点是识别14种不同语言的提问系统中的幻觉。我们的方法利用了具有不同专业知识领域的多种大型语言模型（LLM），并同时采用注释幻觉，有效地模拟了众包注释过程。此外，每个基于LLM的注释者都集成了与注释过程中输入相关的内部和外部知识。使用开源LLM DeepSeek-V3，我们的系统可在印地语数据中获得最高排名（\＃1），并在其他七种语言中获得前5位。在本文中，我们还讨论了在开发过程中探讨的失败方法，并分享参与这项共同任务而获得的关键见解。

Title: Unveiling Knowledge Utilization Mechanisms in LLM-based Retrieval-Augmented Generation

Authors: Yuhao Wang, Ruiyang Ren, Yucheng Wang, Wayne Xin Zhao, Jing Liu, Hua Wu, Haifeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.11995
Pdf URL: https://arxiv.org/pdf/2505.11995
Copy Paste: [[2505.11995]] Unveiling Knowledge Utilization Mechanisms in LLM-based Retrieval-Augmented Generation(https://arxiv.org/abs/2505.11995)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Considering the inherent limitations of parametric knowledge in large language models (LLMs), retrieval-augmented generation (RAG) is widely employed to expand their knowledge scope. Since RAG has shown promise in knowledge-intensive tasks like open-domain question answering, its broader application to complex tasks and intelligent assistants has further advanced its utility. Despite this progress, the underlying knowledge utilization mechanisms of LLM-based RAG remain underexplored. In this paper, we present a systematic investigation of the intrinsic mechanisms by which LLMs integrate internal (parametric) and external (retrieved) knowledge in RAG scenarios. Specially, we employ knowledge stream analysis at the macroscopic level, and investigate the function of individual modules at the microscopic level. Drawing on knowledge streaming analyses, we decompose the knowledge utilization process into four distinct stages within LLM layers: knowledge refinement, knowledge elicitation, knowledge expression, and knowledge contestation. We further demonstrate that the relevance of passages guides the streaming of knowledge through these stages. At the module level, we introduce a new method, knowledge activation probability entropy (KAPE) for neuron identification associated with either internal or external knowledge. By selectively deactivating these neurons, we achieve targeted shifts in the LLM's reliance on one knowledge source over the other. Moreover, we discern complementary roles for multi-head attention and multi-layer perceptron layers during knowledge formation. These insights offer a foundation for improving interpretability and reliability in retrieval-augmented LLMs, paving the way for more robust and transparent generative solutions in knowledge-intensive domains.
摘要：考虑到大语言模型（LLM）中参数知识的固有局限性，将检索型生成（RAG）广泛用于扩大其知识范围。由于RAG在诸如开放域问题回答之类的知识密集任务中表现出了希望，因此它在复杂的任务和智能助手方面的更广泛应用已进一步提高了其实用性。尽管取得了这种进步，但基于LLM的抹布的潜在知识利用机制仍未得到充实。在本文中，我们对LLM在抹布场景中整合内部（参数）和外部（检索）知识的内在机制进行了系统的研究。特别是，我们在宏观级别采用知识流分析，并研究微观级别的单个模块的功能。利用知识流分析，我们将知识利用过程分解为LLM层中的四个不同阶段：知识改进，知识启发，知识表达和知识竞争。我们进一步证明，段落的相关性指导了通过这些阶段的知识流。在模块级别，我们引入了一种新方法，即与内部或外部知识相关的神经元识别的知识激活概率熵（KAPE）。通过有选择地停用这些神经元，我们实现了LLM对一个知识源的依赖性的目标转移。此外，我们辨别出在知识形成期间多头关注和多层感知层的互补作用。这些见解为提高检索功能的LLM的可解释性和可靠性提供了基础，为知识密集型领域中更健壮和透明的生成解决方案铺平了道路。

Title: Towards Comprehensive Argument Analysis in Education: Dataset, Tasks, and Method

Authors: Yupei Ren, Xinyi Zhou, Ning Zhang, Shangqing Zhao, Man Lan, Xiaopeng Bai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12028
Pdf URL: https://arxiv.org/pdf/2505.12028
Copy Paste: [[2505.12028]] Towards Comprehensive Argument Analysis in Education: Dataset, Tasks, and Method(https://arxiv.org/abs/2505.12028)
Keywords: language model, llm
Abstract: Argument mining has garnered increasing attention over the years, with the recent advancement of Large Language Models (LLMs) further propelling this trend. However, current argument relations remain relatively simplistic and foundational, struggling to capture the full scope of argument information, particularly when it comes to representing complex argument structures in real-world scenarios. To address this limitation, we propose 14 fine-grained relation types from both vertical and horizontal dimensions, thereby capturing the intricate interplay between argument components for a thorough understanding of argument structure. On this basis, we conducted extensive experiments on three tasks: argument component detection, relation prediction, and automated essay grading. Additionally, we explored the impact of writing quality on argument component detection and relation prediction, as well as the connections between discourse relations and argumentative features. The findings highlight the importance of fine-grained argumentative annotations for argumentative writing quality assessment and encourage multi-dimensional argument analysis.
摘要：多年来，随着大型语言模型（LLMS）的最新进步进一步推动了这一趋势，挖掘矿业引起了越来越多的关注。但是，当前的参数关系仍然相对简单且基础，努力捕获参数信息的全部范围，尤其是在代表现实世界中复杂的参数结构时。为了解决这一限制，我们提出了从垂直和水平维度上提出的14种细粒关系类型，从而捕获了参数组件之间的复杂相互作用，以彻底了解参数结构。在此基础上，我们对三个任务进行了广泛的实验：参数组件检测，关系预测和自动论文分级。此外，我们探讨了写作质量对参数组件检测和关系预测的影响，以及话语关系与论证特征之间的联系。这些发现突出了细粒度的论证注释对论证写作质量评估的重要性，并鼓励多维论证分析。

Title: MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities

Authors: Jingxue Chen, Qingkun Tang, Qianchun Lu, Siyuan Fang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12043
Pdf URL: https://arxiv.org/pdf/2505.12043
Copy Paste: [[2505.12043]] MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities(https://arxiv.org/abs/2505.12043)
Keywords: llm, hallucination
Abstract: Although LLMs perform well in general tasks, domain-specific applications suffer from hallucinations and accuracy limitations. CPT approaches encounter two key issues: (1) domain-biased data degrades general language skills, and (2) improper corpus-mixture ratios limit effective adaptation. To address these, we propose a novel framework, Mixture of Losses (MoL), which decouples optimization objectives for domain-specific and general corpora. Specifically, cross-entropy (CE) loss is applied to domain data to ensure knowledge acquisition, while Kullback-Leibler (KL) divergence aligns general-corpus training with the base model's foundational capabilities. This dual-loss architecture preserves universal skills while enhancing domain expertise, avoiding catastrophic forgetting. Empirically, we validate that a 1:1 domain-to-general corpus ratio optimally balances training and overfitting without the need for extensive tuning or resource-intensive experiments. Furthermore, our experiments demonstrate significant performance gains compared to traditional CPT approaches, which often suffer from degradation in general language capabilities; our model achieves 27.9% higher accuracy on the Math-500 benchmark in the non-think reasoning mode, and an impressive 83.3% improvement on the challenging AIME25 subset in the think mode, underscoring the effectiveness of our approach.
摘要：尽管LLM在一般任务中的表现良好，但特定于域的应用却遭受了幻觉和准确性限制。 CPT的方法遇到了两个关键问题：（1）域偏向数据降低通用语言技能，（2）混合比率不当限制有效适应。为了解决这些问题，我们提出了一个新颖的框架，即损失的混合物（MOL），该框架将其针对特定领域特定和一般语料库的优化目标。具体而言，将跨渗透损失（CE）损失应用于域数据以确保知识获取，而Kullback-Leibler（KL）Divergence将通用训练与基本模型的基础能力保持一致。这种双重损坏的体系结构可保留通用的技能，同时增强了域专业知识，避免了灾难性的遗忘。从经验上讲，我们验证了1：1域与总体语料库比率最佳地平衡训练和过度拟合，而无需进行广泛的调整或资源密集型实验。此外，与传统的CPT方法相比，我们的实验表现出显着的性能，这种方法通常会遭受一般语言能力的降解。我们的模型在非思考推理模式下的数学500基准上的准确性提高了27.9％，在思维模式下，充满挑战的AIME25子集的83.3％提高了83.3％，强调了我们方法的有效性。

Title: ABoN: Adaptive Best-of-N Alignment

Authors: Vinod Raman, Hilal Asi, Satyen Kale
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12050
Pdf URL: https://arxiv.org/pdf/2505.12050
Copy Paste: [[2505.12050]] ABoN: Adaptive Best-of-N Alignment(https://arxiv.org/abs/2505.12050)
Keywords: language model, prompt
Abstract: Recent advances in test-time alignment methods, such as Best-of-N sampling, offer a simple and effective way to steer language models (LMs) toward preferred behaviors using reward models (RM). However, these approaches can be computationally expensive, especially when applied uniformly across prompts without accounting for differences in alignment difficulty. In this work, we propose a prompt-adaptive strategy for Best-of-N alignment that allocates inference-time compute more efficiently. Motivated by latency concerns, we develop a two-stage algorithm: an initial exploratory phase estimates the reward distribution for each prompt using a small exploration budget, and a second stage adaptively allocates the remaining budget using these estimates. Our method is simple, practical, and compatible with any LM/RM combination. Empirical results on the AlpacaEval dataset for 12 LM/RM pairs and 50 different batches of prompts show that our adaptive strategy consistently outperforms the uniform allocation with the same inference budget. Moreover, our experiments show that our adaptive strategy remains competitive against uniform allocations with 20% larger inference budgets and even improves in performance as the batch size grows.
摘要：测试时间对齐方法的最新进展（例如最佳N采样）为使用奖励模型（RM）提供了一种简单有效的方法，可以将语言模型（LMS）转向首选行为。但是，这些方法在计算上可能很昂贵，尤其是当在不考虑一致性难度差异的情况下均匀地应用时。在这项工作中，我们提出了一种迅速的自适应策略，以进行最佳N对准，以更有效地分配推理时间。出于延迟问题的激励，我们开发了两阶段的算法：初始探索阶段使用较小的勘探预算估算每个提示的奖励分布，第二阶段使用这些估计值适应剩余的预算。我们的方法简单，实用，并且与任何LM/RM组合兼容。 Alpacaeval数据集的经验结果12 lm/rm对和50个不同的提示表明，我们的自适应策略始终以相同的推理预算优于统一分配。此外，我们的实验表明，我们的适应性策略在统一分配中保持竞争力，其推理预算越大20％，甚至随着批量尺寸的增长，绩效的提高。

Title: GenderBench: Evaluation Suite for Gender Biases in LLMs

Authors: Matúš Pikuliak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12054
Pdf URL: https://arxiv.org/pdf/2505.12054
Copy Paste: [[2505.12054]] GenderBench: Evaluation Suite for Gender Biases in LLMs(https://arxiv.org/abs/2505.12054)
Keywords: llm
Abstract: We present GenderBench -- a comprehensive evaluation suite designed to measure gender biases in LLMs. GenderBench includes 14 probes that quantify 19 gender-related harmful behaviors exhibited by LLMs. We release GenderBench as an open-source and extensible library to improve the reproducibility and robustness of benchmarking across the field. We also publish our evaluation of 12 LLMs. Our measurements reveal consistent patterns in their behavior. We show that LLMs struggle with stereotypical reasoning, equitable gender representation in generated texts, and occasionally also with discriminatory behavior in high-stakes scenarios, such as hiring.
摘要：我们提出了性别培训 - 旨在衡量LLMS性别偏见的全面评估套件。性别镇包括14种探针，这些探针量化了LLMS表现出的19种与性别相关的有害行为。我们将Genderbench释放为一个开源和可扩展的库，以提高整个田间基准测试的可重复性和鲁棒性。我们还发布了对12个LLM的评估。我们的测量结果揭示了其行为的一致模式。我们表明，LLM在生成的文本中与陈规定型推理，公平的性别表示斗争，偶尔在高风险场景（例如招聘）中也有歧视性行为。

Title: Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement

Authors: Peng Ding, Jun Kuang, Zongyu Wang, Xuezhi Cao, Xunliang Cai, Jiajun Chen, Shujian Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12060
Pdf URL: https://arxiv.org/pdf/2505.12060
Copy Paste: [[2505.12060]] Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement(https://arxiv.org/abs/2505.12060)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks. In this paper, we identify a critical safety gap: while LLMs are adept at detecting jailbreak prompts, they often produce unsafe responses when directly processing these inputs. Inspired by this insight, we propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability. SAGE consists of two core components: a Discriminative Analysis Module and a Discriminative Response Module, enhancing resilience against sophisticated jailbreak attempts through flexible safety discrimination instructions. Extensive experiments demonstrate SAGE's effectiveness and robustness across various open-source and closed-source LLMs of different sizes and architectures, achieving an average 99% defense success rate against numerous complex and covert jailbreak methods while maintaining helpfulness on general benchmarks. We further conduct mechanistic interpretability analysis through hidden states and attention distributions, revealing the underlying mechanisms of this detection-generation discrepancy. Our work thus contributes to developing future LLMs with coherent safety awareness and generation behavior. Our code and datasets are publicly available at this https URL.
摘要：大型语言模型（LLM）在各种任务中表现出令人印象深刻的能力，但仍然容易受到精心制作的越狱攻击的影响。在本文中，我们确定了一个关键的安全差距：虽然LLM擅长检测越狱提示，但直接处理这些输入时，它们通常会产生不安全的响应。受到这种见解的启发，我们提出了Sage（自我意识的后卫增强），这是一种无训练的防御策略，旨在使LLMS的强大安全歧视性能与他们相对较弱的安全性产生能力相结合。 Sage由两个核心组成部分组成：一个判别分析模块和一个判别响应模块，通过灵活的安全歧视说明增强了针对复杂越狱尝试的弹性。广泛的实验证明了Sage在不同尺寸和体系结构的各种开源和封闭源LLM中的有效性和鲁棒性，对于众多复杂和秘密的越狱方法，平均达到99％的国防成功率，同时保持对一般基准的帮助。我们通过隐藏状态和注意力分布进一步进行机械性解释性分析，揭示了这种检测产生差异的基本机制。因此，我们的工作有助于以一致的安全意识和发电行为发展未来的LLM。我们的代码和数据集可在此HTTPS URL上公开获得。

Title: Do different prompting methods yield a common task representation in language models?

Authors: Guy Davidson, Todd M. Gureckis, Brenden M. Lake, Adina Williams
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12075
Pdf URL: https://arxiv.org/pdf/2505.12075
Copy Paste: [[2505.12075]] Do different prompting methods yield a common task representation in language models?(https://arxiv.org/abs/2505.12075)
Keywords: language model, llm, prompt
Abstract: Demonstrations and instructions are two primary approaches for prompting language models to perform in-context learning (ICL) tasks. Do identical tasks elicited in different ways result in similar representations of the task? An improved understanding of task representation mechanisms would offer interpretability insights and may aid in steering models. We study this through function vectors, recently proposed as a mechanism to extract few-shot ICL task representations. We generalize function vectors to alternative task presentations, focusing on short textual instruction prompts, and successfully extract instruction function vectors that promote zero-shot task accuracy. We find evidence that demonstration- and instruction-based function vectors leverage different model components, and offer several controls to dissociate their contributions to task performance. Our results suggest that different task presentations do not induce a common task representation but elicit different, partly overlapping mechanisms. Our findings offer principled support to the practice of combining textual instructions and task demonstrations, imply challenges in universally monitoring task inference across presentation forms, and encourage further examinations of LLM task inference mechanisms.
摘要：演示和说明是提示语言模型执行文化学习（ICL）任务的两种主要方法。以不同方式引起的相同任务是否会导致任务的相似表示？对任务表示机制的改进理解将提供可解释性见解，并可能有助于转向模型。我们通过函数向量研究了这一点，该功能向量最近被提议作为提取几次ICL任务表示形式的一种机制。我们将功能向量推广到替代任务演示文稿，重点关注简短的文本说明提示，并成功提取指令功能向量，以促进零弹药任务的准确性。我们发现证据表明，基于演示和指导的功能向量利用不同的模型组件，并提供多个控件以解离其对任务绩效的贡献。我们的结果表明，不同的任务演示不会引起共同的任务表示，而是引起不同的，部分重叠的机制。我们的发现为结合文本说明和任务演示的实践提供了有明显的支持，这意味着在跨演讲形式的普遍监视任务推论中挑战，并鼓励对LLM任务推理机制进行进一步检查。

Title: Model Merging in Pre-training of Large Language Models

Authors: Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Zhou Xun, Liang Xiang, Yonghui Wu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12082
Pdf URL: https://arxiv.org/pdf/2505.12082
Copy Paste: [[2505.12082]] Model Merging in Pre-training of Large Language Models(https://arxiv.org/abs/2505.12082)
Keywords: language model
Abstract: Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.
摘要：模型合并已成为增强大型语言模型的一种有希望的技术，尽管它在大规模的预训练中的应用仍然相对尚未探索。在本文中，我们对训练过程中的模型合并技术进行了全面研究。通过对量表和混合物（MOE）的广泛实验，范围从数百万到1000亿多个参数，我们证明，与持续学习率进行训练的检查点不仅可以实现显着的绩效提高，而且还可以准确预测退火行为。这些改进会导致更有效的模型开发，并显着降低培训成本。我们关于合并策略和超参数合并的详细消融研究提供了对基本机制的新见解，同时揭示了新的应用。通过全面的实验分析，我们为有效模型合并提供了开源社区实践预训练指南。

Title: Personalized Author Obfuscation with Large Language Models

Authors: Mohammad Shokri, Sarah Ita Levitan, Rivka Levitan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12090
Pdf URL: https://arxiv.org/pdf/2505.12090
Copy Paste: [[2505.12090]] Personalized Author Obfuscation with Large Language Models(https://arxiv.org/abs/2505.12090)
Keywords: language model, llm, prompt
Abstract: In this paper, we investigate the efficacy of large language models (LLMs) in obfuscating authorship by paraphrasing and altering writing styles. Rather than adopting a holistic approach that evaluates performance across the entire dataset, we focus on user-wise performance to analyze how obfuscation effectiveness varies across individual authors. While LLMs are generally effective, we observe a bimodal distribution of efficacy, with performance varying significantly across users. To address this, we propose a personalized prompting method that outperforms standard prompting techniques and partially mitigates the bimodality issue.
摘要：在本文中，我们研究了大语言模型（LLM）在通过释义和改变写作风格来混淆作者身份中的功效。我们没有采用整体方法来评估整个数据集的性能，而是专注于用户的性能，以分析混淆效率在各个作者之间的变化。尽管LLM通常是有效的，但我们观察到功效的双峰分布，并且在用户之间的性能差异很大。为了解决这个问题，我们提出了一种个性化的提示方法，该方法表现优于标准提示技术，并部分缓解双峰性问题。

Title: Improving Fairness in LLMs Through Testing-Time Adversaries

Authors: Isabela Pereira Gregio, Ian Pons, Anna Helena Reali Costa, Artur Jordão
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12100
Pdf URL: https://arxiv.org/pdf/2505.12100
Copy Paste: [[2505.12100]] Improving Fairness in LLMs Through Testing-Time Adversaries(https://arxiv.org/abs/2505.12100)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) push the bound-aries in natural language processing and generative AI, driving progress across various aspects of modern society. Unfortunately, the pervasive issue of bias in LLMs responses (i.e., predictions) poses a significant and open challenge, hindering their application in tasks involving ethical sensitivity and responsible decision-making. In this work, we propose a straightforward, user-friendly and practical method to mitigate such biases, enhancing the reliability and trustworthiness of LLMs. Our method creates multiple variations of a given sentence by modifying specific attributes and evaluates the corresponding prediction behavior compared to the original, unaltered, prediction/sentence. The idea behind this process is that critical ethical predictions often exhibit notable inconsistencies, indicating the presence of bias. Unlike previous approaches, our method relies solely on forward passes (i.e., testing-time adversaries), eliminating the need for training, fine-tuning, or prior knowledge of the training data distribution. Through extensive experiments on the popular Llama family, we demonstrate the effectiveness of our method in improving various fairness metrics, focusing on the reduction of disparities in how the model treats individuals from different racial groups. Specifically, using standard metrics, we improve the fairness in Llama3 in up to 27 percentage points. Overall, our approach significantly enhances fairness, equity, and reliability in LLM-generated results without parameter tuning or training data modifications, confirming its effectiveness in practical scenarios. We believe our work establishes an important step toward enabling the use of LLMs in tasks that require ethical considerations and responsible decision-making.
摘要：大型语言模型（LLM）推动了自然语言处理和生成AI的边界，在现代社会的各个方面推动了进步。不幸的是，LLMS响应中普遍存在的偏见问题（即预测）提出了一个重大而开放的挑战，阻碍了他们在涉及道德敏感性和负责任决策的任务中的应用。在这项工作中，我们提出了一种直接，用户友好和实用的方法来减轻此类偏见，增强LLM的可靠性和可信度。我们的方法通过修改特定属性并评估相应的预测行为与原始的，未更改的预测/句子相比，通过修改特定属性来创建给定句子的多种变化。这个过程背后的想法是，批判性的道德预测通常表现出明显的矛盾，表明存在偏见。与以前的方法不同，我们的方法仅依赖于前向通行证（即测试时间对手），消除了培训，微调或对培训数据分布的先验知识的需求。通过对流行的美洲驼家族的广泛实验，我们证明了我们方法在改善各种公平指标方面的有效性，重点是减少模型如何对待不同种族群体的个人的差异。具体来说，使用标准指标，我们在27个百分点中提高了Llama3的公平性。总体而言，我们的方法显着提高了LLM生成的结果中的公平性，公平性和可靠性，而无需参数调整或培训数据修改，从而在实际情况下证实了其有效性。我们认为，我们的工作确立了使LLM在需要道德考虑和负责任决策的任务中使用LLM的重要一步。

Title: The AI Gap: How Socioeconomic Status Affects Language Technology Interactions

Authors: Elisa Bassignana, Amanda Cercas Curry, Dirk Hovy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12158
Pdf URL: https://arxiv.org/pdf/2505.12158
Copy Paste: [[2505.12158]] The AI Gap: How Socioeconomic Status Affects Language Technology Interactions(https://arxiv.org/abs/2505.12158)
Keywords: language model, llm, prompt
Abstract: Socioeconomic status (SES) fundamentally influences how people interact with each other and more recently, with digital technologies like Large Language Models (LLMs). While previous research has highlighted the interaction between SES and language technology, it was limited by reliance on proxy metrics and synthetic data. We survey 1,000 individuals from diverse socioeconomic backgrounds about their use of language technologies and generative AI, and collect 6,482 prompts from their previous interactions with LLMs. We find systematic differences across SES groups in language technology usage (i.e., frequency, performed tasks), interaction styles, and topics. Higher SES entails a higher level of abstraction, convey requests more concisely, and topics like 'inclusivity' and 'travel'. Lower SES correlates with higher anthropomorphization of LLMs (using ''hello'' and ''thank you'') and more concrete language. Our findings suggest that while generative language technologies are becoming more accessible to everyone, socioeconomic linguistic differences still stratify their use to exacerbate the digital divide. These differences underscore the importance of considering SES in developing language technologies to accommodate varying linguistic needs rooted in socioeconomic factors and limit the AI Gap across SES groups.
摘要：社会经济地位（SES）从根本上影响人们如何相互互动，以及最近与大型语言模型（LLMS）等数字技术的互动。虽然先前的研究强调了SES与语言技术之间的相互作用，但它受到对代理指标和合成数据的依赖的限制。我们调查了来自不同社会经济背景的1,000个人，涉及他们对语言技术和生成AI的使用，并从以前与LLM的互动中收集了6,482个提示。我们发现语言技术使用（即频率，执行任务），互动样式和主题的SES组之间的系统差异。更高的SE需要更高的抽象，更简洁地传达请求，以及“包容性”和“旅行”之类的主题。 Lower SES与LLM的较高拟人化（使用“ Hello'''''''''''和'Thess You”）和更多具体语言相关。我们的发现表明，尽管生成语言技术变得越来越容易获得，但社会经济的语言差异仍在分层它们来加剧数字鸿沟。这些差异强调了考虑SES在开发语言技术方面的重要性，以适应植根于社会经济因素的不同语言需求，并限制SES群体之间的AI差距。

Title: Truth Neurons

Authors: Haohang Li, Yupeng Cao, Yangyang Yu, Jordan W. Suchow, Zining Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12182
Pdf URL: https://arxiv.org/pdf/2505.12182
Copy Paste: [[2505.12182]] Truth Neurons(https://arxiv.org/abs/2505.12182)
Keywords: language model
Abstract: Despite their remarkable success and deployment across diverse workflows, language models sometimes produce untruthful responses. Our limited understanding of how truthfulness is mechanistically encoded within these models jeopardizes their reliability and safety. In this paper, we propose a method for identifying representations of truthfulness at the neuron level. We show that language models contain truth neurons, which encode truthfulness in a subject-agnostic manner. Experiments conducted across models of varying scales validate the existence of truth neurons, confirming that the encoding of truthfulness at the neuron level is a property shared by many language models. The distribution patterns of truth neurons over layers align with prior findings on the geometry of truthfulness. Selectively suppressing the activations of truth neurons found through the TruthfulQA dataset degrades performance both on TruthfulQA and on other benchmarks, showing that the truthfulness mechanisms are not tied to a specific dataset. Our results offer novel insights into the mechanisms underlying truthfulness in language models and highlight potential directions toward improving their trustworthiness and reliability.
摘要：尽管他们在各种工作流程中取得了显着的成功和部署，但语言模型有时会产生不真实的回应。我们对这些模型中如何机械地编码真实性的有限理解危害了它们的可靠性和安全性。在本文中，我们提出了一种在神经元层面识别真实性表示的方法。我们表明，语言模型包含真理神经元，该神经元以主题不足的方式编码真实性。跨不同尺度模型进行的实验验证了真实神经元的存在，证实了在神经元层面上的真实性的编码是许多语言模型共有的属性。层次上真理神经元的分布模式与先前关于真实性几何形状的发现保持一致。通过真实性数据集发现的真实性神经元的激活有选择地抑制在Fultulfulqa和其他基准上的性能，这表明真实机制与特定数据集并未息息相关。我们的结果为语言模型中的真实性提供了新的见解，并突出了提高其可信度和可靠性的潜在方向。

Title: Decoding the Mind of Large Language Models: A Quantitative Evaluation of Ideology and Biases

Authors: Manari Hirose, Masato Uchida
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2505.12183
Pdf URL: https://arxiv.org/pdf/2505.12183
Copy Paste: [[2505.12183]] Decoding the Mind of Large Language Models: A Quantitative Evaluation of Ideology and Biases(https://arxiv.org/abs/2505.12183)
Keywords: language model, gpt, llm, chat
Abstract: The widespread integration of Large Language Models (LLMs) across various sectors has highlighted the need for empirical research to understand their biases, thought patterns, and societal implications to ensure ethical and effective use. In this study, we propose a novel framework for evaluating LLMs, focusing on uncovering their ideological biases through a quantitative analysis of 436 binary-choice questions, many of which have no definitive answer. By applying our framework to ChatGPT and Gemini, findings revealed that while LLMs generally maintain consistent opinions on many topics, their ideologies differ across models and languages. Notably, ChatGPT exhibits a tendency to change their opinion to match the questioner's opinion. Both models also exhibited problematic biases, unethical or unfair claims, which might have negative societal impacts. These results underscore the importance of addressing both ideological and ethical considerations when evaluating LLMs. The proposed framework offers a flexible, quantitative method for assessing LLM behavior, providing valuable insights for the development of more socially aligned AI systems.
摘要：大型语言模型（LLM）在各个部门的广泛整合表明，需要经验研究来了解其偏见，思想模式和社会意义，以确保道德和有效使用。在这项研究中，我们提出了一个用于评估LLM的新型框架，重点是通过对436个二元选择问题的定量分析来揭示其意识形态偏见，其中许多问题没有确切的答案。通过将我们的框架应用于Chatgpt和Gemini，调查结果表明，尽管LLMS通常对许多主题保持一致的看法，但它们的意识形态在模型和语言之间有所不同。值得注意的是，Chatgpt表现出改变意见以符合发问者意见的趋势。两种模型都表现出有问题的偏见，不道德或不公平的主张，这可能会产生负面影响。这些结果强调了在评估LLM时解决意识形态和道德考虑因素的重要性。拟议的框架提供了一种评估LLM行为的灵活，定量的方法，为开发更具社会状态的AI系统提供了宝贵的见解。

Title: Vectors from Larger Language Models Predict Human Reading Time and fMRI Data More Poorly when Dimensionality Expansion is Controlled

Authors: Yi-Chien Lin, Hongao Zhu, William Schuler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12196
Pdf URL: https://arxiv.org/pdf/2505.12196
Copy Paste: [[2505.12196]] Vectors from Larger Language Models Predict Human Reading Time and fMRI Data More Poorly when Dimensionality Expansion is Controlled(https://arxiv.org/abs/2505.12196)
Keywords: language model, llm
Abstract: The impressive linguistic abilities of large language models (LLMs) have recommended them as models of human sentence processing, with some conjecturing a positive 'quality-power' relationship (Wilcox et al., 2023), in which language models' (LMs') fit to psychometric data continues to improve as their ability to predict words in context increases. This is important because it suggests that elements of LLM architecture, such as veridical attention to context and a unique objective of predicting upcoming words, reflect the architecture of the human sentence processing faculty, and that any inadequacies in predicting human reading time and brain imaging data may be attributed to insufficient model complexity, which recedes as larger models become available. Recent studies (Oh and Schuler, 2023) have shown this scaling inverts after a point, as LMs become excessively large and accurate, when word prediction probability (as information-theoretic surprisal) is used as a predictor. Other studies propose the use of entire vectors from differently sized LLMs, still showing positive scaling (Schrimpf et al., 2021), casting doubt on the value of surprisal as a predictor, but do not control for the larger number of predictors in vectors from larger LMs. This study evaluates LLM scaling using entire LLM vectors, while controlling for the larger number of predictors in vectors from larger LLMs. Results show that inverse scaling obtains, suggesting that inadequacies in predicting human reading time and brain imaging data may be due to substantial misalignment between LLMs and human sentence processing, which worsens as larger models are used.
摘要：大型语言模型（LLM）令人印象深刻的语言能力建议它们作为人类句子处理的模型，有些人猜想了积极的“质量权力”关系（Wilcox等，2023），其中语言模型（LMS'）适合心理学数据，随着他们在上下文中预测单词的能力而不断提高。这很重要，因为它表明LLM体系结构的要素，例如对上下文的垂直关注，以及预测即将到来的单词的独特目标，反映了人类句子处理教师的结构，并且在预测人类阅读时间和脑部成像数据中的任何不足之处都可能归因于不足的模型，因为该模型不足以恢复较大的模型，从而可用。最近的研究（OH和Schuler，2023年）表明，当LMS变得过大且准确，当单词预测概率（作为信息理论惊奇）用作预测因子时，LMS变得过大且准确。其他研究提出，使用不同尺寸的LLM的整个向量，仍然显示出正缩放率（Schrimpf等，2021），对惊人的价值作为预测变量提出了疑问，但不能控制来自较大LMS的向量中的大量预测指标。这项研究使用整个LLM矢量评估了LLM缩放，同时控制了来自较大LLM的向量中的大量预测变量。结果表明，逆缩放率获得了，这表明预测人类阅读时间和大脑成像数据的不足可能是由于LLMS和人类句子处理之间的严重误解，这随着使用较大的模型而恶化。

Title: How Reliable is Multilingual LLM-as-a-Judge?

Authors: Xiyan Fu, Wei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12201
Pdf URL: https://arxiv.org/pdf/2505.12201
Copy Paste: [[2505.12201]] How Reliable is Multilingual LLM-as-a-Judge?(https://arxiv.org/abs/2505.12201)
Keywords: language model, llm
Abstract: LLM-as-a-Judge has emerged as a popular evaluation strategy, where advanced large language models assess generation results in alignment with human instructions. While these models serve as a promising alternative to human annotators, their reliability in multilingual evaluation remains uncertain. To bridge this gap, we conduct a comprehensive analysis of multilingual LLM-as-a-Judge. Specifically, we evaluate five models from different model families across five diverse tasks involving 25 languages. Our findings reveal that LLMs struggle to achieve consistent judgment results across languages, with an average Fleiss' Kappa of approximately 0.3, and some models performing even worse. To investigate the cause of inconsistency, we analyze various influencing factors. We observe that consistency varies significantly across languages, with particularly poor performance in low-resource languages. Additionally, we find that neither training on multilingual data nor increasing model scale directly improves judgment consistency. These findings suggest that LLMs are not yet reliable for evaluating multilingual predictions. We finally propose an ensemble strategy which improves the consistency of the multilingual judge in real-world applications.
摘要：LLM-AS-A-Gudge已成为一种流行的评估策略，高级大语言模型评估发电结果与人类指示保持一致。尽管这些模型是人类注释者的有希望的替代方法，但它们在多语言评估中的可靠性仍然不确定。为了弥合这一差距，我们对多语言LLM-AS-A-A-Gudge进行了全面分析。具体来说，我们评估了五种涉及25种语言的不同任务的不同模型家族的五个模型。我们的发现表明，LLM努力在跨语言中实现一致的判断结果，平均弗莱斯的kappa约为0.3，并且某些模型的性能甚至更糟。为了调查不一致的原因，我们分析了各种影响因素。我们观察到，一致性在各种语言之间差异很大，低资源语言的性能尤为差。此外，我们发现对多语言数据的培训和增加模型量表都直接提高了判断一致性。这些发现表明，LLM尚不可靠地评估多语言预测。我们最终提出了一项合奏策略，该战略改善了现实世界应用中多语言法官的一致性。

Title: Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

Authors: Shaobo Wang, Ziming Wang, Xiangqi Jin, Jize Wang, Jiajun Zhang, Kaixin Li, Zichen Wen, Zhong Li, Conghui He, Xuming Hu, Linfeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12212
Pdf URL: https://arxiv.org/pdf/2505.12212
Copy Paste: [[2505.12212]] Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning(https://arxiv.org/abs/2505.12212)
Keywords: language model, llm
Abstract: Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model's predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$\times$ speedup.
摘要：对特定于任务数据的微调大语言模型（LLM）对于其有效部署至关重要。随着数据集尺寸的增长，有效选择用于培训的最佳子集对于平衡性能和计算成本至关重要。传统的数据选择方法通常需要在目标数据集上进行微调计分模型，该数据集是耗时且资源密集的，或者依靠无法完全利用该模型的预测能力的启发式方法。为了应对这些挑战，我们提出了一种数据窃窃私语，这是一种有效，无训练，基于注意力的方法，它利用模型进行微调的方式，几乎没有镜头学习。跨不同任务和模型对RAW和合成数据集进行了全面的评估。值得注意的是，与Llama-3-8b-Instruct模型上的完整GSM8K数据集相比，数据窃窃私语的性能卓越，仅使用10％的数据，并且优于改善3.1分和7.4 $ \ times $ $速度的现有方法。

Title: GMSA: Enhancing Context Compression via Group Merging and Layer Semantic Alignment

Authors: Jiwei Tang, Zhicheng Zhang, Shunlong Wu, Jingheng Ye, Lichen Bai, Zitai Wang, Tingwei Lu, Jiaqi Chen, Lin Hai, Hai-Tao Zheng, Hong-Gee Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12215
Pdf URL: https://arxiv.org/pdf/2505.12215
Copy Paste: [[2505.12215]] GMSA: Enhancing Context Compression via Group Merging and Layer Semantic Alignment(https://arxiv.org/abs/2505.12215)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have achieved impressive performance in a variety of natural language processing (NLP) tasks. However, when applied to long-context scenarios, they face two challenges, i.e., low computational efficiency and much redundant information. This paper introduces GMSA, a context compression framework based on the encoder-decoder architecture, which addresses these challenges by reducing input sequence length and redundant information. Structurally, GMSA has two key components: Group Merging and Layer Semantic Alignment (LSA). Group merging is used to effectively and efficiently extract summary vectors from the original context. Layer semantic alignment, on the other hand, aligns the high-level summary vectors with the low-level primary input semantics, thus bridging the semantic gap between different layers. In the training process, GMSA first learns soft tokens that contain complete semantics through autoencoder training. To furtherly adapt GMSA to downstream tasks, we propose Knowledge Extraction Fine-tuning (KEFT) to extract knowledge from the soft tokens for downstream tasks. We train GMSA by randomly sampling the compression rate for each sample in the dataset. Under this condition, GMSA not only significantly outperforms the traditional compression paradigm in context restoration but also achieves stable and significantly faster convergence with only a few encoder layers. In downstream question-answering (QA) tasks, GMSA can achieve approximately a 2x speedup in end-to-end inference while outperforming both the original input prompts and various state-of-the-art (SOTA) methods by a large margin.
摘要：大型语言模型（LLMS）在各种自然语言处理（NLP）任务中取得了令人印象深刻的表现。但是，当应用于长篇小说方案时，它们面临两个挑战，即计算效率低和大量冗余信息。本文介绍了GMSA，这是一个基于编码器架构的上下文压缩框架，该框架通过减少输入序列长度和冗余信息来解决这些挑战。从结构上讲，GMSA有两个关键组成部分：组合并和层语义比对（LSA）。组合并用于从原始上下文中有效有效地提取摘要向量。另一方面，图层语义对齐将高级摘要向量与低级初级输入语义对齐，从而弥合了不同层之间的语义差距。在培训过程中，GMSA首先学习通过自动编码器培训包含完整语义的软令牌。为了进一步调整GMSA到下游任务，我们建议知识提取微调（KEFT）从软令牌中提取知识以进行下游任务。我们通过随机对数据集中每个样本的压缩率进行随机采样来训练GMSA。在这种情况下，GMSA不仅在上下文恢复中显着优于传统的压缩范式，而且仅使用几个编码器层就可以实现稳定和更快的收敛。在下游提问（QA）任务中，GMSA可以在端到端推理中达到大约2倍的速度，同时胜过原始输入提示和各种最新的（SOTA）方法。

Title: One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models

Authors: Rongguang Ye, Ming Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12216
Pdf URL: https://arxiv.org/pdf/2505.12216
Copy Paste: [[2505.12216]] One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models(https://arxiv.org/abs/2505.12216)
Keywords: language model, llm
Abstract: Existing pruning methods for large language models (LLMs) focus on achieving high compression rates while maintaining model performance. Although these methods have demonstrated satisfactory performance in handling a single user's compression request, their processing time increases linearly with the number of requests, making them inefficient for real-world scenarios with multiple simultaneous requests. To address this limitation, we propose a Univeral Model for Customized Compression (UniCuCo) for LLMs, which introduces a StratNet that learns to map arbitrary requests to their optimal pruning strategy. The challenge in training StratNet lies in the high computational cost of evaluating pruning strategies and the non-differentiable nature of the pruning process, which hinders gradient backpropagation for StratNet updates. To overcome these challenges, we leverage a Gaussian process to approximate the evaluation process. Since the gradient of the Gaussian process is computable, we can use it to approximate the gradient of the non-differentiable pruning process, thereby enabling StratNet updates. Experimental results show that UniCuCo is 28 times faster than baselines in processing 64 requests, while maintaining comparable accuracy to baselines.
摘要：大型语言模型（LLMS）的现有修剪方法着重于在保持模型性能的同时达到高压率。尽管这些方法在处理单个用户的压缩请求方面表现出令人满意的性能，但它们的处理时间随请求数量线性增加，从而使它们对具有多个同时请求的真实场景效率低下。为了解决此限制，我们为LLMS提供了一个定制压缩（UNICUCO）的Univeral模型，该模型介绍了一个学会将任意请求映射到其最佳修剪策略的Stratnet。训练Stratnet的挑战在于评估修剪策略的高计算成本和修剪过程的非差异性质，这阻碍了Stratnet更新的梯度反向传播。为了克服这些挑战，我们利用高斯过程来近似评估过程。由于高斯过程的梯度是可以计算的，因此我们可以使用它来近似非差异修剪过程的梯度，从而启用Stratnet更新。实验结果表明，在处理64条请求时，Unicuco的速度比基线快28倍，同时保持与基础线的同等准确性。

Title: Examining Linguistic Shifts in Academic Writing Before and After the Launch of ChatGPT: A Study on Preprint Papers

Authors: Tong Bao, Yi Zhao, Jin Mao, Chengzhi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12218
Pdf URL: https://arxiv.org/pdf/2505.12218
Copy Paste: [[2505.12218]] Examining Linguistic Shifts in Academic Writing Before and After the Launch of ChatGPT: A Study on Preprint Papers(https://arxiv.org/abs/2505.12218)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large Language Models (LLMs), such as ChatGPT, have prompted academic concerns about their impact on academic writing. Existing studies have primarily examined LLM usage in academic writing through quantitative approaches, such as word frequency statistics and probability-based analyses. However, few have systematically examined the potential impact of LLMs on the linguistic characteristics of academic writing. To address this gap, we conducted a large-scale analysis across 823,798 abstracts published in last decade from arXiv dataset. Through the linguistic analysis of features such as the frequency of LLM-preferred words, lexical complexity, syntactic complexity, cohesion, readability and sentiment, the results indicate a significant increase in the proportion of LLM-preferred words in abstracts, revealing the widespread influence of LLMs on academic writing. Additionally, we observed an increase in lexical complexity and sentiment in the abstracts, but a decrease in syntactic complexity, suggesting that LLMs introduce more new vocabulary and simplify sentence structure. However, the significant decrease in cohesion and readability indicates that abstracts have fewer connecting words and are becoming more difficult to read. Moreover, our analysis reveals that scholars with weaker English proficiency were more likely to use the LLMs for academic writing, and focused on improving the overall logic and fluency of the abstracts. Finally, at discipline level, we found that scholars in Computer Science showed more pronounced changes in writing style, while the changes in Mathematics were minimal.
摘要：大型语言模型（LLMS），例如ChatGpt，引起了学术对其对学术写作影响的关注。现有研究主要通过定量方法（例如单词频率统计数据和基于概率的分析）在学术写作中检查了LLM使用情况。但是，很少有人系统地研究了LLM对学术写作语言特征的潜在影响。为了解决这一差距，我们对Arxiv数据集发表的823,798个摘要进行了大规模分析。通过对诸如LLM偏爱单词的频率，词汇复杂性，句法复杂性，凝聚力，可读性和情感等特征的语言分析，结果表明，摘要中LLM偏爱单词的比例显着增加，从而揭示了LLMS对学术写作的广泛影响。此外，我们观察到摘要中词汇复杂性和情感的增加，但句法复杂性的下降表明LLMS引入了更多的新词汇和简化的句子结构。但是，凝聚力和可读性的显着降低表明摘要的连接单词较少，并且越来越难以阅读。此外，我们的分析表明，英语能力较弱的学者更有可能将LLM用于学术写作，并专注于提高摘要的整体逻辑和流利度。最后，在学科层面上，我们发现计算机科学学者在写作风格上显示出更明显的变化，而数学的变化很小。

Title: Bridging Generative and Discriminative Learning: Few-Shot Relation Extraction via Two-Stage Knowledge-Guided Pre-training

Authors: Quanjiang Guo, Jinchuan Zhang, Sijie Wang, Ling Tian, Zhao Kang, Bin Yan, Weidong Xiao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12236
Pdf URL: https://arxiv.org/pdf/2505.12236
Copy Paste: [[2505.12236]] Bridging Generative and Discriminative Learning: Few-Shot Relation Extraction via Two-Stage Knowledge-Guided Pre-training(https://arxiv.org/abs/2505.12236)
Keywords: language model, llm
Abstract: Few-Shot Relation Extraction (FSRE) remains a challenging task due to the scarcity of annotated data and the limited generalization capabilities of existing models. Although large language models (LLMs) have demonstrated potential in FSRE through in-context learning (ICL), their general-purpose training objectives often result in suboptimal performance for task-specific relation extraction. To overcome these challenges, we propose TKRE (Two-Stage Knowledge-Guided Pre-training for Relation Extraction), a novel framework that synergistically integrates LLMs with traditional relation extraction models, bridging generative and discriminative learning paradigms. TKRE introduces two key innovations: (1) leveraging LLMs to generate explanation-driven knowledge and schema-constrained synthetic data, addressing the issue of data scarcity; and (2) a two-stage pre-training strategy combining Masked Span Language Modeling (MSLM) and Span-Level Contrastive Learning (SCL) to enhance relational reasoning and generalization. Together, these components enable TKRE to effectively tackle FSRE tasks. Comprehensive experiments on benchmark datasets demonstrate the efficacy of TKRE, achieving new state-of-the-art performance in FSRE and underscoring its potential for broader application in low-resource scenarios. \footnote{The code and data are released on this https URL.
摘要：由于带注释的数据缺乏和现有模型的有限概括能力，很少有射击关系提取（FSRE）仍然是一项艰巨的任务。尽管大型语言模型（LLMS）通过文本学习（ICL）证明了FSRE的潜力，但它们的通用培训目标通常会导致特定于任务的关系提取的次优性能。为了克服这些挑战，我们提出了TKRE（两阶段知识引导的关系提取的预培训），这是一个新型框架，将LLMS协同与传统的关系提取模型结合在一起，弥合生成和歧视性学习范式。 TKRE介绍了两个关键的创新：（1）利用LLMS生成解释驱动的知识和架构约束的合成数据，以解决数据稀缺问题；（2）结合蒙版跨度语言建模（MSLM）和跨度对比度学习（SCL）的两阶段预训练策略，以增强关系推理和泛化。这些组件共同使TKRE能够有效地处理FSRE任务。基准数据集的全面实验证明了TKRE的功效，在FSRE中实现了新的最新性能，并强调了其在低资源场景中更广泛应用的潜力。 \ footNote {代码和数据在此HTTPS URL上发布。

Title: PANORAMA: A synthetic PII-laced dataset for studying sensitive data memorization in LLMs

Authors: Sriram Selvam, Anneswa Ghosh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12238
Pdf URL: https://arxiv.org/pdf/2505.12238
Copy Paste: [[2505.12238]] PANORAMA: A synthetic PII-laced dataset for studying sensitive data memorization in LLMs(https://arxiv.org/abs/2505.12238)
Keywords: language model, llm, prompt
Abstract: The memorization of sensitive and personally identifiable information (PII) by large language models (LLMs) poses growing privacy risks as models scale and are increasingly deployed in real-world applications. Existing efforts to study sensitive and PII data memorization and develop mitigation strategies are hampered by the absence of comprehensive, realistic, and ethically sourced datasets reflecting the diversity of sensitive information found on the web. We introduce PANORAMA - Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis, a large-scale synthetic corpus of 384,789 samples derived from 9,674 synthetic profiles designed to closely emulate the distribution, variety, and context of PII and sensitive data as it naturally occurs in online environments. Our data generation pipeline begins with the construction of internally consistent, multi-attribute human profiles using constrained selection to reflect real-world demographics such as education, health attributes, financial status, etc. Using a combination of zero-shot prompting and OpenAI o3-mini, we generate diverse content types - including wiki-style articles, social media posts, forum discussions, online reviews, comments, and marketplace listings - each embedding realistic, contextually appropriate PII and other sensitive information. We validate the utility of PANORAMA by fine-tuning the Mistral-7B model on 1x, 5x, 10x, and 25x data replication rates with a subset of data and measure PII memorization rates - revealing not only consistent increases with repetition but also variation across content types, highlighting PANORAMA's ability to model how memorization risks differ by context. Our dataset and code are publicly available, providing a much-needed resource for privacy risk assessment, model auditing, and the development of privacy-preserving LLMs.
摘要：大型语言模型（LLMS）对敏感和个人身份信息（PII）的记忆构成了越来越多的隐私风险作为模型规模，并且越来越多地部署在现实世界中。缺乏全面，现实和道德采购的数据集，反映了网络上发现的敏感信息的多样性，因此阻碍了研究敏感和PII数据记忆和制定缓解策略的现有努力。我们介绍了全景图 - 基于个人资料的组合，用于自然主义在线表示和属性记忆分析，这是一个大规模合成语料库，其384,789个样本衍生自9,674个合成轮廓，旨在紧密掩盖PII和敏感数据的分布，多样性和敏感数据，因为它在在线环境中自然出现。 Our data generation pipeline begins with the construction of internally consistent, multi-attribute human profiles using constrained selection to reflect real-world demographics such as education, health attributes, financial status, etc. Using a combination of zero-shot prompting and OpenAI o3-mini, we generate diverse content types - including wiki-style articles, social media posts, forum discussions, online reviews, comments, and marketplace listings - each embedding realistic, contextually适当的PII和其他敏感信息。我们通过微调1倍，5x，10x和25x数据复制率的Mistral-7b模型来验证Panorama的实用性，并通过数据子集并测量PII记忆率 - 不仅揭示了随着重复的增长，而且还揭示了跨内容的差异，突显了Panorama在模型中差异的差异。我们的数据集和代码公开可用，为隐私风险评估，模型审核以及保护隐私的LLM的开发提供了急需的资源。

Title: Distribution Prompting: Understanding the Expressivity of Language Models Through the Next-Token Distributions They Can Produce

Authors: Haojin Wang, Zining Zhu, Freda Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12244
Pdf URL: https://arxiv.org/pdf/2505.12244
Copy Paste: [[2505.12244]] Distribution Prompting: Understanding the Expressivity of Language Models Through the Next-Token Distributions They Can Produce(https://arxiv.org/abs/2505.12244)
Keywords: language model, prompt
Abstract: Autoregressive neural language models (LMs) generate a probability distribution over tokens at each time step given a prompt. In this work, we attempt to systematically understand the probability distributions that LMs can produce, showing that some distributions are significantly harder to elicit than others. Specifically, for any target next-token distribution over the vocabulary, we attempt to find a prompt that induces the LM to output a distribution as close as possible to the target, using either soft or hard gradient-based prompt tuning. We find that (1) in general, distributions with very low or very high entropy are easier to approximate than those with moderate entropy; (2) among distributions with the same entropy, those containing ''outlier tokens'' are easier to approximate; (3) target distributions generated by LMs -- even LMs with different tokenizers -- are easier to approximate than randomly chosen targets. These results offer insights into the expressiveness of LMs and the challenges of using them as probability distribution proposers.
摘要：自回归的神经语言模型（LMS）在给定提示的每个时间步骤都会产生代币的概率分布。在这项工作中，我们试图系统地了解LMS可以产生的概率分布，这表明某些分布比其他分布更难引起。具体来说，对于词汇上的任何目标下一分布，我们尝试使用基于软梯度的及时调整来找到一个提示，该提示可以诱导LM输出与目标的分布。我们发现（1）通常，熵非常低或非常高的分布比中等熵的分布更容易近似；（2）在具有相同熵的分布中，包含“离群令牌”的分布更容易近似；（3）与随机选择的目标相比，LMS生成的目标分布（甚至具有不同的引物器的LMS）更容易近似。这些结果为LMS的表现力以及将其用作概率分配建议者的挑战提供了见解。

Title: Not All Documents Are What You Need for Extracting Instruction Tuning Data

Authors: Chi Zhang, Huaping Zhong, Hongtao Li, Chengliang Chai, Jiawei Hong, Yuhao Deng, Jiacheng Wang, Tian Tan, Yizhou Yan, Jiantao Qiu, Ye Yuan, Guoren Wang, Conghui He, Lei Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12250
Pdf URL: https://arxiv.org/pdf/2505.12250
Copy Paste: [[2505.12250]] Not All Documents Are What You Need for Extracting Instruction Tuning Data(https://arxiv.org/abs/2505.12250)
Keywords: language model, llm
Abstract: Instruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address this, we propose extracting instruction tuning data from web corpora that contain rich and diverse knowledge. A naive solution is to retrieve domain-specific documents and extract all QA pairs from them, but this faces two key challenges: (1) extracting all QA pairs using LLMs is prohibitively expensive, and (2) many extracted QA pairs may be irrelevant to the downstream tasks, potentially degrading model performance. To tackle these issues, we introduce EQUAL, an effective and scalable data extraction framework that iteratively alternates between document selection and high-quality QA pair extraction to enhance instruction tuning. EQUAL first clusters the document corpus based on embeddings derived from contrastive learning, then uses a multi-armed bandit strategy to efficiently identify clusters that are likely to contain valuable QA pairs. This iterative approach significantly reduces computational cost while boosting model performance. Experiments on AutoMathText and StackOverflow across four downstream tasks show that EQUAL reduces computational costs by 5-10x and improves accuracy by 2.5 percent on LLaMA-3.1-8B and Mistral-7B
摘要：指令调整可改善大语言模型（LLMS）的性能，但在很大程度上依赖于高质量的培训数据。最近，LLM已用于使用种子问题解答（QA）对合成指令数据。但是，这些合成的说明通常缺乏多样性，并且往往与输入种子相似，从而限制了它们在现实情况下的适用性。为了解决这个问题，我们建议从包含丰富多样化知识的Web Corpora中提取指令调整数据。一个天真的解决方案是检索特定于域的文档并从中提取所有质量检查对，但这面临两个关键挑战：（1）使用LLMS提取所有QA对非常昂贵，并且（2）许多提取的QA对可能与下游任务无关，并有潜在地降级模型。为了解决这些问题，我们介绍了一个有效且可扩展的数据提取框架，在文档选择和高质量的QA对提取之间迭代交替，以增强说明调整。相等的第一组基于对比度学习得出的嵌入文档语料库，然后使用多臂匪徒策略有效地识别可能包含有价值的QA对的群集。这种迭代方法在提高模型性能的同时大大降低了计算成本。在四个下游任务上进行自动压缩和Stackoverflow的实验表明，相等的计算成本降低了5-10倍，在Llama-3.1-8B和Mismtral-7B上将精度提高了2.5％

Title: Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches

Authors: Yuhang Zhou, Xutian Chen, Yixin Cao, Yuchen Ni, Yu He, Siyu Tian, Xiang Liu, Jian Zhang, Chuanjun Ji, Guangnan Ye, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12259
Pdf URL: https://arxiv.org/pdf/2505.12259
Copy Paste: [[2505.12259]] Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches(https://arxiv.org/abs/2505.12259)
Keywords: language model, llm
Abstract: Recent progress in large language models (LLMs) has outpaced the development of effective evaluation methods. Traditional benchmarks rely on task-specific metrics and static datasets, which often suffer from fairness issues, limited scalability, and contamination risks. In this paper, we introduce Teach2Eval, an indirect evaluation framework inspired by the Feynman Technique. Instead of directly testing LLMs on predefined tasks, our method evaluates a model's multiple abilities to teach weaker student models to perform tasks effectively. By converting open-ended tasks into standardized multiple-choice questions (MCQs) through teacher-generated feedback, Teach2Eval enables scalable, automated, and multi-dimensional assessment. Our approach not only avoids data leakage and memorization but also captures a broad range of cognitive abilities that are orthogonal to current benchmarks. Experimental results across 26 leading LLMs show strong alignment with existing human and model-based dynamic rankings, while offering additional interpretability for training guidance.
摘要：大型语言模型（LLM）的最新进展超过了有效评估方法的发展。传统的基准测试依赖于特定于任务的指标和静态数据集，这些数据集通常会遇到公平问题，有限的可扩展性和污染风险。在本文中，我们介绍了受Feynman技术启发的间接评估框架Teach2eval。我们的方法不是直接在预定义的任务上测试LLM，而是评估模型的多种功能，以教授较弱的学生模型有效执行任务。通过将开放式任务转换为标准化的多项选择问题（MCQ），通过教师生成的反馈，Teach2eval可以实现可扩展，自动化和多维评估。我们的方法不仅避免了数据泄漏和记忆，而且还捕获了广泛的认知能力，这些认知能力与当前基准是正交的。 26个领先LLM的实验结果表现出与现有的人类和基于模型的动态排名的强烈一致性，同时为训练指导提供了额外的解释性。

Title: Learning Auxiliary Tasks Improves Reference-Free Hallucination Detection in Open-Domain Long-Form Generation

Authors: Chengwei Qin, Wenxuan Zhou, Karthik Abinav Sankararaman, Nanshu Wang, Tengyu Xu, Alexander Radovic, Eryk Helenowski, Arya Talebzadeh, Aditya Tayade, Sinong Wang, Shafiq Joty, Han Fang, Hao Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12265
Pdf URL: https://arxiv.org/pdf/2505.12265
Copy Paste: [[2505.12265]] Learning Auxiliary Tasks Improves Reference-Free Hallucination Detection in Open-Domain Long-Form Generation(https://arxiv.org/abs/2505.12265)
Keywords: language model, llm, hallucination, prompt
Abstract: Hallucination, the generation of factually incorrect information, remains a significant challenge for large language models (LLMs), especially in open-domain long-form generation. Existing approaches for detecting hallucination in long-form tasks either focus on limited domains or rely heavily on external fact-checking tools, which may not always be available. In this work, we systematically investigate reference-free hallucination detection in open-domain long-form responses. Our findings reveal that internal states (e.g., model's output probability and entropy) alone are insufficient for reliably (i.e., better than random guessing) distinguishing between factual and hallucinated content. To enhance detection, we explore various existing approaches, including prompting-based methods, probing, and fine-tuning, with fine-tuning proving the most effective. To further improve the accuracy, we introduce a new paradigm, named RATE-FT, that augments fine-tuning with an auxiliary task for the model to jointly learn with the main task of hallucination detection. With extensive experiments and analysis using a variety of model families & datasets, we demonstrate the effectiveness and generalizability of our method, e.g., +3% over general fine-tuning methods on LongFact.
摘要：幻觉是事实不正确的信息的产生，对于大型语言模型（LLM）来说仍然是一个重大挑战，尤其是在开放域的长期生成中。现有的方法用于长期任务中的幻觉的方法要么集中在有限的域上，要么严重依赖外部事实检查工具，这可能并不总是可用。在这项工作中，我们系统地研究了开放域长形响应中的无参考幻觉检测。我们的发现表明，单独的内部状态（例如，模型的输出概率和熵）不足以可靠（即比随机猜测更好）区分事实和幻觉内容。为了增强检测，我们探索了各种现有方法，包括基于促进的方法，探测和微调，并通过微调证明了最有效的方法。为了进一步提高准确性，我们引入了一种名为Rate-ft的新范式，该范式通过辅助任务增强微调，以使模型可以通过幻觉检测的主要任务共同学习。通过使用各种模型家族和数据集进行了广泛的实验和分析，我们证明了方法的有效性和概括性，例如， +3％比长期的一般微调方法 +3％。

Title: $K$-MSHC: Unmasking Minimally Sufficient Head Circuits in Large Language Models with Experiments on Syntactic Classification Tasks

Authors: Pratim Chowdhary
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12268
Pdf URL: https://arxiv.org/pdf/2505.12268
Copy Paste: [[2505.12268]] $K$-MSHC: Unmasking Minimally Sufficient Head Circuits in Large Language Models with Experiments on Syntactic Classification Tasks(https://arxiv.org/abs/2505.12268)
Keywords: language model
Abstract: Understanding which neural components drive specific capabilities in mid-sized language models ($\leq$10B parameters) remains a key challenge. We introduce the $(\bm{K}, \epsilon)$-Minimum Sufficient Head Circuit ($K$-MSHC), a methodology to identify minimal sets of attention heads crucial for classification tasks as well as Search-K-MSHC, an efficient algorithm for discovering these circuits. Applying our Search-K-MSHC algorithm to Gemma-9B, we analyze three syntactic task families: grammar acceptability, arithmetic verification, and arithmetic word problems. Our findings reveal distinct task-specific head circuits, with grammar tasks predominantly utilizing early layers, word problems showing pronounced activity in both shallow and deep regions, and arithmetic verification demonstrating a more distributed pattern across the network. We discover non-linear circuit overlap patterns, where different task pairs share computational components at varying levels of importance. While grammar and arithmetic share many "weak" heads, arithmetic and word problems share more consistently critical "strong" heads. Importantly, we find that each task maintains dedicated "super-heads" with minimal cross-task overlap, suggesting that syntactic and numerical competencies emerge from specialized yet partially reusable head circuits.
摘要：了解哪些神经组件在中型语言模型中驱动特定功能（$ \ leq $ 10B参数）仍然是一个关键挑战。我们介绍了$（\ bm {k}，\ epsilon）$ - 最小足够的头电路（$ k $ -MSHC），一种方法，用于识别最小的注意力头集对分类任务以及搜索-K-MSHC至关重要的关注头，这是一种有效的算法，可发现这些环形。将我们的搜索-K-MSHC算法应用于Gemma-9B，我们分析了三个句法任务族：语法可接受性，算术验证和算术单词问题。我们的发现揭示了特定于任务特定的头部电路，语法任务主要利用早期层，单词问题在浅层区域和深处显示出明显的活动，算术验证显示了整个网络中更分布的模式。我们发现了非线性电路重叠模式，其中不同的任务对以不同的重要性水平共享计算组件。尽管语法和算术具有许多“弱”头，但算术和单词问题具有更一致的批判性“强”头。重要的是，我们发现每个任务都具有最小的交叉任务重叠的专用“超级头”，这表明句法和数值能力从专业但可重复使用的头部电路中出现。

Title: LLM-Based Evaluation of Low-Resource Machine Translation: A Reference-less Dialect Guided Approach with a Refined Sylheti-English Benchmark

Authors: Md. Atiqur Rahman, Sabrina Islam, Mushfiqul Haque Omi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12273
Pdf URL: https://arxiv.org/pdf/2505.12273
Copy Paste: [[2505.12273]] LLM-Based Evaluation of Low-Resource Machine Translation: A Reference-less Dialect Guided Approach with a Refined Sylheti-English Benchmark(https://arxiv.org/abs/2505.12273)
Keywords: language model, llm, prompt
Abstract: Evaluating machine translation (MT) for low-resource languages poses a persistent challenge, primarily due to the limited availability of high quality reference translations. This issue is further exacerbated in languages with multiple dialects, where linguistic diversity and data scarcity hinder robust evaluation. Large Language Models (LLMs) present a promising solution through reference-free evaluation techniques; however, their effectiveness diminishes in the absence of dialect-specific context and tailored guidance. In this work, we propose a comprehensive framework that enhances LLM-based MT evaluation using a dialect guided approach. We extend the ONUBAD dataset by incorporating Sylheti-English sentence pairs, corresponding machine translations, and Direct Assessment (DA) scores annotated by native speakers. To address the vocabulary gap, we augment the tokenizer vocabulary with dialect-specific terms. We further introduce a regression head to enable scalar score prediction and design a dialect-guided (DG) prompting strategy. Our evaluation across multiple LLMs shows that the proposed pipeline consistently outperforms existing methods, achieving the highest gain of +0.1083 in Spearman correlation, along with improvements across other evaluation settings. The dataset and the code are available at this https URL.
摘要：评估机器翻译（MT）的低资源语言构成了持续的挑战，这主要是由于高质量参考翻译的可用性有限。这个问题进一步加剧了具有多种方言的语言，在这种语言中，语言多样性和数据稀缺阻碍了强大的评估。大型语言模型（LLMS）通过无参考评估技术提出了有希望的解决方案；但是，在没有方言特定环境和量身定制的指导的情况下，它们的有效性会降低。在这项工作中，我们提出了一个综合框架，该框架使用方言指导方法来增强基于LLM的MT评估。我们通过合并Sylheti-English句子对，相应的机器翻译以及直接评估（DA）的分数来扩展Onubad数据集。为了解决词汇差距，我们使用方言特定的术语增强了令牌词汇。我们进一步介绍了一个回归头，以启用标量评分预测并设计方言引导（DG）提示策略。我们跨多个LLM的评估表明，所提出的管道始终优于现有方法，在Spearman相关性中获得了+0.1083的最高增益，以及其他评估设置的改进。该数据集和代码可在此HTTPS URL上找到。

Title: The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models

Authors: Linghan Huang, Haolin Jin, Zhaoge Bi, Pengyue Yang, Peizhou Zhao, Taozhao Chen, Xiongfei Wu, Lei Ma, Huaming Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12287
Pdf URL: https://arxiv.org/pdf/2505.12287
Copy Paste: [[2505.12287]] The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models(https://arxiv.org/abs/2505.12287)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Large language models (LLMs) have seen widespread applications across various domains, yet remain vulnerable to adversarial prompt injections. While most existing research on jailbreak attacks and hallucination phenomena has focused primarily on open-source models, we investigate the frontier of closed-source LLMs under multilingual attack scenarios. We present a first-of-its-kind integrated adversarial framework that leverages diverse attack techniques to systematically evaluate frontier proprietary solutions, including GPT-4o, DeepSeek-R1, Gemini-1.5-Pro, and Qwen-Max. Our evaluation spans six categories of security contents in both English and Chinese, generating 38,400 responses across 32 types of jailbreak attacks. Attack success rate (ASR) is utilized as the quantitative metric to assess performance from three dimensions: prompt design, model architecture, and language environment. Our findings suggest that Qwen-Max is the most vulnerable, while GPT-4o shows the strongest defense. Notably, prompts in Chinese consistently yield higher ASRs than their English counterparts, and our novel Two-Sides attack technique proves to be the most effective across all models. This work highlights a dire need for language-aware alignment and robust cross-lingual defenses in LLMs, and we hope it will inspire researchers, developers, and policymakers toward more robust and inclusive AI systems.
摘要：大型语言模型（LLMS）在各个领域都看到了广泛的应用程序，但仍然容易受到对抗提示的注射。尽管大多数现有关于越狱攻击和幻觉现象的研究主要集中在开源模型上，但我们调查了在多语言攻击情况下封闭源LLM的前沿。我们提出了一个首先的综合对抗框架，该框架利用各种攻击技术来系统地评估Frontier专有解决方案，包括GPT-4O，DeepSeek-R1，Gemini-1.5-Pro和Qwen-Max。我们的评估涵盖了英语和中文的六类安全内容，在32种越狱攻击中产生了38,400种答复。攻击成功率（ASR）被用作定量指标，以评估三个维度的性能：及时设计，模型架构和语言环境。我们的发现表明，Qwen-Max是最脆弱的，而GPT-4O显示了最强的防御力。值得注意的是，中文的提示始终产生的ASR高于其英语对应物，而我们新颖的两面攻击技术被证明是所有模型中最有效的。这项工作强调了在LLMS中对语言意识的一致性和强大的跨语性防御的迫切需求，我们希望它能激发研究人员，开发人员和决策者更加强大和更具包容性的AI系统。

Title: Enhance Mobile Agents Thinking Process Via Iterative Preference Learning

Authors: Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, Bo An
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12299
Pdf URL: https://arxiv.org/pdf/2505.12299
Copy Paste: [[2505.12299]] Enhance Mobile Agents Thinking Process Via Iterative Preference Learning(https://arxiv.org/abs/2505.12299)
Keywords: gpt, agent
Abstract: The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction evolution, which leverages GPT-4o to generate diverse Q\&A pairs based on real mobile UI screenshots, enhancing both generality and layout understanding. Experiments on three standard Mobile GUI-agent benchmarks demonstrate that our agent MobileIPL outperforms strong baselines, including continual pretraining models such as OS-ATLAS and UI-TARS. It achieves state-of-the-art performance across three standard Mobile GUI-Agents benchmarks and shows strong generalization to out-of-domain scenarios.
摘要：已经证明，动作规划思想链（外套）范式可以改善基于VLM的移动代理在GUI任务中的推理性能。但是，各种外套轨迹的稀缺性限制了此类药物的表达和概括能力。尽管通常使用自我训练来解决数据稀缺性，但现有方法要么忽略了中间推理步骤的正确性，要么依赖于昂贵的过程级注释来构建过程奖励模型（PRM）。为了解决上述问题，我们提出了一种迭代偏好学习（IPL），该学习通过相互作用采样，使用基于规则的奖励来构建外套树，并将反馈反馈以获得思维级直接偏好优化（T-DPO）对。为了防止在热身监督的微调过程中过度拟合，我们进一步引入了三阶段的指令演变，该指令利用GPT-4O来生成基于实际移动UI屏幕截图的多种Q \＆A对，从而增强了通用和布局的理解。在三个标准移动GUI-AGENT基准测试的实验表明，我们的Agent Mobile Plipl优于强大的基准，包括持续预处理模型，例如OS-ATLAS和UI-TARS。它在三个标准的移动Gui-Agents基准中实现了最新的性能，并显示出对跨域情景的强烈概括。

Title: HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models

Authors: Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12300
Pdf URL: https://arxiv.org/pdf/2505.12300
Copy Paste: [[2505.12300]] HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models(https://arxiv.org/abs/2505.12300)
Keywords: language model, llm
Abstract: Fine-tuning large language models (LLMs) on a mixture of diverse datasets poses challenges due to data imbalance and heterogeneity. Existing methods often address these issues across datasets (globally) but overlook the imbalance and heterogeneity within individual datasets (locally), which limits their effectiveness. We introduce Hierarchical Balancing Optimization (HBO), a novel method that enables LLMs to autonomously adjust data allocation during fine-tuning both across datasets (globally) and within each individual dataset (locally). HBO employs a bilevel optimization strategy with two types of actors: a Global Actor, which balances data sampling across different subsets of the training mixture, and several Local Actors, which optimizes data usage within each subset based on difficulty levels. These actors are guided by reward functions derived from the LLM's training state, which measure learning progress and relative performance improvement. We evaluate HBO on three LLM backbones across nine diverse tasks in multilingual and multitask setups. Results show that HBO consistently outperforms existing baselines, achieving significant accuracy gains. Our in-depth analysis further demonstrates that both the global actor and local actors of HBO effectively adjust data usage during fine-tuning. HBO provides a comprehensive solution to the challenges of data imbalance and heterogeneity in LLM fine-tuning, enabling more effective training across diverse datasets.
摘要：通过数据不平衡和异质性，微调大语模型（LLM）在不同数据集的混合物上构成了挑战。现有方法通常在数据集（全球）上解决这些问题，但忽略了各个数据集（本地）内部的不平衡和异质性，从而限制了它们的有效性。我们介绍了层次平衡优化（HBO），这是一种新型方法，使LLMS能够在跨数据集（全球）和每个单独的数据集（本地）中自主调整数据分配。 HBO使用两种类型的参与者采用双重优化策略：一个全球参与者，它可以平衡跨训练混合物的不同子集的数据采样，以及几个本地参与者，该参与者根据难度级别在每个子集中优化数据使用情况。这些参与者以LLM的训练状态得出的奖励功能为指导，该奖励功能衡量了学习进度和相对绩效的改善。我们在多种语和多任务设置中的九个不同任务中评估了HBO的三个LLM骨架上的HBO。结果表明，HBO始终胜过现有的基准，从而实现了显着的准确性提高。我们的深入分析进一步表明，HBO的全球参与者和本地参与者都可以在微调过程中有效调整数据使用情况。 HBO为LLM微调中的数据不平衡和异质性的挑战提供了全面的解决方案，从而使各种数据集更有效地培训。

Title: Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection

Authors: Yuwei Zhang, Wenhao Yu, Shangbin Feng, Yifan Zhu, Letian Peng, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12306
Pdf URL: https://arxiv.org/pdf/2505.12306
Copy Paste: [[2505.12306]] Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection(https://arxiv.org/abs/2505.12306)
Keywords: language model, llm, prompt
Abstract: Despite significant advances in large language models (LLMs), their knowledge memorization capabilities remain underexplored, due to the lack of standardized and high-quality test ground. In this paper, we introduce a novel, real-world and large-scale knowledge injection benchmark that evolves continuously over time without requiring human intervention. Specifically, we propose WikiDYK, which leverages recently-added and human-written facts from Wikipedia's "Did You Know..." entries. These entries are carefully selected by expert Wikipedia editors based on criteria such as verifiability and clarity. Each entry is converted into multiple question-answer pairs spanning diverse task formats from easy cloze prompts to complex multi-hop questions. WikiDYK contains 12,290 facts and 77,180 questions, which is also seamlessly extensible with future updates from Wikipedia editors. Extensive experiments using continued pre-training reveal a surprising insight: despite their prevalence in modern LLMs, Causal Language Models (CLMs) demonstrate significantly weaker knowledge memorization capabilities compared to Bidirectional Language Models (BiLMs), exhibiting a 23% lower accuracy in terms of reliability. To compensate for the smaller scales of current BiLMs, we introduce a modular collaborative framework utilizing ensembles of BiLMs as external knowledge repositories to integrate with LLMs. Experiment shows that our framework further improves the reliability accuracy by up to 29.1%.
摘要：尽管大型语言模型（LLMS）取得了重大进展，但由于缺乏标准化和高质量的测试场，他们的知识记忆能力仍然没有得到充实的影响。在本文中，我们介绍了一种新颖的，现实的和大规模的知识注入基准，该基准随着时间的流逝不断发展而不需要人类干预。具体来说，我们提出了Wikidyk，该Wikidyk利用了Wikipedia的“您知道的...这些条目是由专家Wikipedia编辑仔细选择的，它根据标准（例如可验证性和清晰度）选择。每个条目都将转换为多个问答对，涵盖了各种任务格式，从简单的披肩提示到复杂的多跳问题。 Wikidyk包含12,290个事实和77,180个问题，Wikipedia编辑的未来更新也可以无缝扩展。使用持续训练的广泛实验表明了令人惊讶的见解：尽管在现代LLM中盛行，但因果语言模型（CLMS）与双向语言模型（BILMS（BILMS）相比，知识记忆能力明显弱，在可靠性方面表现出23％的精度。为了补偿当前比尔姆的较小规模，我们引入了一个模块化协作框架，该框架利用比尔姆斯的集合作为外部知识存储库来与LLMS集成。实验表明，我们的框架进一步提高了可靠性准确性多达29.1％。

Title: ExpertSteer: Intervening in LLMs through Expert Knowledge

Authors: Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12313
Pdf URL: https://arxiv.org/pdf/2505.12313
Copy Paste: [[2505.12313]] ExpertSteer: Intervening in LLMs through Expert Knowledge(https://arxiv.org/abs/2505.12313)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit remarkable capabilities across various tasks, yet guiding them to follow desired behaviours during inference remains a significant challenge. Activation steering offers a promising method to control the generation process of LLMs by modifying their internal activations. However, existing methods commonly intervene in the model's behaviour using steering vectors generated by the model itself, which constrains their effectiveness to that specific model and excludes the possibility of leveraging powerful external expert models for steering. To address these limitations, we propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors, enabling intervention in any LLMs. ExpertSteer transfers the knowledge from an expert model to a target LLM through a cohesive four-step process: first aligning representation dimensions with auto-encoders to enable cross-model transfer, then identifying intervention layer pairs based on mutual information analysis, next generating steering vectors from the expert model using Recursive Feature Machines, and finally applying these vectors on the identified layers during inference to selectively guide the target LLM without updating model parameters. We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains. Experiments demonstrate that ExpertSteer significantly outperforms established baselines across diverse tasks at minimal cost.
摘要：大型语言模型（LLMS）在各种任务中都具有出色的功能，但是指导他们在推断期间遵循所需的行为仍然是一个重大挑战。激活转向提供了一种有希望的方法，可以通过修改其内部激活来控制LLM的生成过程。但是，现有方法通常使用模型本身产生的转向向量干预模型的行为，这将其有效性限制为该特定模型，并排除利用强大的外部专家模型进行转向的可能性。为了解决这些局限性，我们提出了专家企业，这是一种新颖的方法，它利用任意专业的专家模型生成转向向量，从而使任何LLMS进行干预。 ExpertSteer transfers the knowledge from an expert model to a target LLM through a cohesive four-step process: first aligning representation dimensions with auto-encoders to enable cross-model transfer, then identifying intervention layer pairs based on mutual information analysis, next generating steering vectors from the expert model using Recursive Feature Machines, and finally applying these vectors on the identified layers during inference to selectively guide the target LLM without updating model 参数。我们在四个不同领域的15个流行基准上使用三个LLM进行了全面的实验。实验表明，专家企业以最低的成本胜过跨不同任务的基线的表现明显胜过基线。

Title: LLMSR@XLLM25: An Empirical Study of LLM for Structural Reasoning

Authors: Xinye Li, Mingqi Wan, Dianbo Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12328
Pdf URL: https://arxiv.org/pdf/2505.12328
Copy Paste: [[2505.12328]] LLMSR@XLLM25: An Empirical Study of LLM for Structural Reasoning(https://arxiv.org/abs/2505.12328)
Keywords: language model, llm, prompt
Abstract: We present Team asdfo123's submission to the LLMSR@XLLM25 shared task, which evaluates large language models on producing fine-grained, controllable, and interpretable reasoning processes. Systems must extract all problem conditions, decompose a chain of thought into statement-evidence pairs, and verify the logical validity of each pair. Leveraging only the off-the-shelf Meta-Llama-3-8B-Instruct, we craft a concise few-shot, multi-turn prompt that first enumerates all conditions and then guides the model to label, cite, and adjudicate every reasoning step. A lightweight post-processor based on regular expressions normalises spans and enforces the official JSON schema. Without fine-tuning, external retrieval, or ensembling, our method ranks 5th overall, achieving macro F1 scores on par with substantially more complex and resource-consuming pipelines. We conclude by analysing the strengths and limitations of our approach and outlining directions for future research in structural reasoning with LLMs. Our code is available at this https URL.
摘要：我们将ASDFO123团队提交给LLMSR@xllm25共享任务，该任务评估了大型语言模型，以生成细粒度，可控制和可解释的推理过程。系统必须提取所有问题条件，将思想链分解成语句 - 证据对，并验证每对的逻辑有效性。我们仅利用现成的元路3-8B教学，我们制作了一个简洁的少量，多转的提示，首先列举所有条件，然后指导模型标记，引用和裁定每个推理步骤。基于正则表达式的轻量级后处理器将跨度范围归一化，并执行官方的JSON模式。如果没有微调，外部检索或结合，我们的方法总体排名第五，可以达到宏观F1的得分，并且基本上更复杂和消耗的管道。我们通过分析方法的优势和局限性，并概述了与LLM的结构推理有关的未来研究的方向。我们的代码可在此HTTPS URL上找到。

Title: UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models

Authors: Qizhou Chen, Dakan Wang, Taolin Zhang, Zaoming Yan, Chengsong You, Chengyu Wang, Xiaofeng He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12345
Pdf URL: https://arxiv.org/pdf/2505.12345
Copy Paste: [[2505.12345]] UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models(https://arxiv.org/abs/2505.12345)
Keywords: language model, llm
Abstract: Model editing aims to enhance the accuracy and reliability of large language models (LLMs) by efficiently adjusting their internal parameters. Currently, most LLM editing datasets are confined to narrow knowledge domains and cover a limited range of editing evaluation. They often overlook the broad scope of editing demands and the diversity of ripple effects resulting from edits. In this context, we introduce UniEdit, a unified benchmark for LLM editing grounded in open-domain knowledge. First, we construct editing samples by selecting entities from 25 common domains across five major categories, utilizing the extensive triple knowledge available in open-domain knowledge graphs to ensure comprehensive coverage of the knowledge domains. To address the issues of generality and locality in editing, we design an Neighborhood Multi-hop Chain Sampling (NMCS) algorithm to sample subgraphs based on a given knowledge piece to entail comprehensive ripple effects to evaluate. Finally, we employ proprietary LLMs to convert the sampled knowledge subgraphs into natural language text, guaranteeing grammatical accuracy and syntactical diversity. Extensive statistical analysis confirms the scale, comprehensiveness, and diversity of our UniEdit benchmark. We conduct comprehensive experiments across multiple LLMs and editors, analyzing their performance to highlight strengths and weaknesses in editing across open knowledge domains and various evaluation criteria, thereby offering valuable insights for future research endeavors.
摘要：模型编辑旨在通过有效调整其内部参数来提高大语言模型（LLM）的准确性和可靠性。当前，大多数LLM编辑数据集仅限于狭窄的知识域，并涵盖了有限的编辑评估。他们经常忽略编辑需求的广泛范围以及由编辑产生的连锁反应的多样性。在这种情况下，我们介绍了Uniedit，这是基于开放域知识的LLM编辑的统一基准。首先，我们利用开放域知识图中可用的广泛三重知识来确保对知识领域的全面覆盖，从而构建了编辑样本。为了解决编辑中的一般性和局部性问题，我们设计了一个邻里多跳链采样（NMC）算法，以根据给定的知识文章来采样子图，以需要全面的连锁效果来评估。最后，我们采用专有的LLM将采样的知识子图转换为自然语言文本，从而确保语法准确性和句法多样性。广泛的统计分析证实了我们Uniedit基准的规模，全面性和多样性。我们在多个LLM和编辑中进行了全面的实验，分析了他们的绩效，以突出跨开放知识领域和各种评估标准编辑的优势和缺点，从而为未来的研究努力提供了宝贵的见解。

Title: Wisdom from Diversity: Bias Mitigation Through Hybrid Human-LLM Crowds

Authors: Axel Abels, Tom Lenaerts
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12349
Pdf URL: https://arxiv.org/pdf/2505.12349
Copy Paste: [[2505.12349]] Wisdom from Diversity: Bias Mitigation Through Hybrid Human-LLM Crowds(https://arxiv.org/abs/2505.12349)
Keywords: language model, llm
Abstract: Despite their performance, large language models (LLMs) can inadvertently perpetuate biases found in the data they are trained on. By analyzing LLM responses to bias-eliciting headlines, we find that these models often mirror human biases. To address this, we explore crowd-based strategies for mitigating bias through response aggregation. We first demonstrate that simply averaging responses from multiple LLMs, intended to leverage the "wisdom of the crowd", can exacerbate existing biases due to the limited diversity within LLM crowds. In contrast, we show that locally weighted aggregation methods more effectively leverage the wisdom of the LLM crowd, achieving both bias mitigation and improved accuracy. Finally, recognizing the complementary strengths of LLMs (accuracy) and humans (diversity), we demonstrate that hybrid crowds containing both significantly enhance performance and further reduce biases across ethnic and gender-related contexts.
摘要：尽管表现出色，但大型语言模型（LLMS）仍可以无意间永久培训的数据中存在偏见。通过分析LLM对偏见的头条新闻的反应，我们发现这些模型经常反映出人类的偏见。为了解决这个问题，我们探讨了基于人群的策略来通过响应聚合来减轻偏见。我们首先证明，仅仅平均多个LLM的反应，旨在利用“人群的智慧”，由于LLM人群中的多样性有限，可能会加剧现有的偏见。相比之下，我们表明，局部加权的聚合方法更有效地利用了LLM人群的智慧，从而实现了缓解偏见和改善的准确性。最后，认识到LLM（准确性）和人类（多样性）的互补优势，我们证明了包含既有显着提高绩效的混合人群，并进一步降低了与种族和性别相关的环境之间的偏见。

Title: CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement

Authors: Gauri Kholkar, Ratinder Ahuja
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12368
Pdf URL: https://arxiv.org/pdf/2505.12368
Copy Paste: [[2505.12368]] CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement(https://arxiv.org/abs/2505.12368)
Keywords: language model, prompt
Abstract: Prompt injection remains a major security risk for large language models. However, the efficacy of existing guardrail models in context-aware settings remains underexplored, as they often rely on static attack benchmarks. Additionally, they have over-defense tendencies. We introduce CAPTURE, a novel context-aware benchmark assessing both attack detection and over-defense tendencies with minimal in-domain examples. Our experiments reveal that current prompt injection guardrail models suffer from high false negatives in adversarial cases and excessive false positives in benign scenarios, highlighting critical limitations.
摘要：提示注射仍然是大型语言模型的主要安全风险。但是，现有的护栏模型在上下文感知的设置中的功效仍然没有被逐渐解散，因为它们通常依赖于静态攻击基准。此外，它们具有防御性倾向。我们介绍了一种新颖的背景感知基准，以最小的域内示例评估攻击检测和过度防御趋势。我们的实验表明，当前的及时注射护栏模型在对抗情况下遭受了高误响，并且在良性场景中过度的假阳性会遭受过多的误解，从而突出了关键局限性。

Title: From n-gram to Attention: How Model Architectures Learn and Propagate Bias in Language Modeling

Authors: Mohsinul Kabir, Tasfia Tahsin, Sophia Ananiadou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12381
Pdf URL: https://arxiv.org/pdf/2505.12381
Copy Paste: [[2505.12381]] From n-gram to Attention: How Model Architectures Learn and Propagate Bias in Language Modeling(https://arxiv.org/abs/2505.12381)
Keywords: language model
Abstract: Current research on bias in language models (LMs) predominantly focuses on data quality, with significantly less attention paid to model architecture and temporal influences of data. Even more critically, few studies systematically investigate the origins of bias. We propose a methodology grounded in comparative behavioral theory to interpret the complex interaction between training data and model architecture in bias propagation during language modeling. Building on recent work that relates transformers to n-gram LMs, we evaluate how data, model design choices, and temporal dynamics affect bias propagation. Our findings reveal that: (1) n-gram LMs are highly sensitive to context window size in bias propagation, while transformers demonstrate architectural robustness; (2) the temporal provenance of training data significantly affects bias; and (3) different model architectures respond differentially to controlled bias injection, with certain biases (e.g. sexual orientation) being disproportionately amplified. As language models become ubiquitous, our findings highlight the need for a holistic approach -- tracing bias to its origins across both data and model dimensions, not just symptoms, to mitigate harm.
摘要：当前对语言模型（LMS）偏见的研究主要集中在数据质量上，对模型架构和数据的时间影响的关注明显较少。更重要的是，很少有研究系统地研究偏见的起源。我们提出了一种基于比较行为理论的方法，以解释语言建模期间训练数据与模型架构之间的复杂相互作用。在将变形金刚与N-Gram LMS相关联的最新工作的基础上，我们评估了数据，模型设计选择和时间动态如何影响偏差传播。我们的发现表明：（1）n-gram lms对偏置传播中的上下文窗口大小高度敏感，而变形金刚则表现出建筑鲁棒性；（2）培训数据的时间出处显着影响偏见；（3）不同的模型体系结构对受控偏置注入的反应有所不同，某些偏见（例如性取向）被过度放大。随着语言模型变得无处不在，我们的发现突出了对整体方法的需求 - 在数据和模型维度（不仅仅是症状）中偏见其起源，以减轻伤害。

Title: SLOT: Sample-specific Language Model Optimization at Test-time

Authors: Yang Hu, Xingyu Zhang, Xueji Fang, Zhiyang Chen, Xiao Wang, Huatian Zhang, Guojun Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12392
Pdf URL: https://arxiv.org/pdf/2505.12392
Copy Paste: [[2505.12392]] SLOT: Sample-specific Language Model Optimization at Test-time(https://arxiv.org/abs/2505.12392)
Keywords: language model, llm, prompt
Abstract: We propose SLOT (Sample-specific Language Model Optimization at Test-time), a novel and parameter-efficient test-time inference approach that enhances a language model's ability to more accurately respond to individual prompts. Existing Large Language Models (LLMs) often struggle with complex instructions, leading to poor performances on those not well represented among general samples. To address this, SLOT conducts few optimization steps at test-time to update a light-weight sample-specific parameter vector. It is added to the final hidden layer before the output head, and enables efficient adaptation by caching the last layer features during per-sample optimization. By minimizing the cross-entropy loss on the input prompt only, SLOT helps the model better aligned with and follow each given instruction. In experiments, we demonstrate that our method outperforms the compared models across multiple benchmarks and LLMs. For example, Qwen2.5-7B with SLOT achieves an accuracy gain of 8.6% on GSM8K from 57.54% to 66.19%, while DeepSeek-R1-Distill-Llama-70B with SLOT achieves a SOTA accuracy of 68.69% on GPQA among 70B-level models. Our code is available at this https URL.
摘要：我们提出了插槽（在测试时间时特定于样本的语言模型优化），这是一种新颖和参数有效的测试时间推理方法，可增强语言模型更准确地响应单个提示的能力。现有的大型语言模型（LLMS）经常在复杂的说明中挣扎，从而导致对一般样本中未很好地代表的人的表现不佳。为了解决这个问题，插槽在测试时间进行了几个优化步骤，以更新轻量级样本特定的参数向量。它被添加到输出头之前的最终隐藏层中，并通过在按样本优化期间缓存最后一层特征来实现有效的适应。通过仅在输入提示符上最大程度地减少交叉渗透丢失，插槽可以帮助模型更好地与并遵循每个给定的指令。在实验中，我们证明我们的方法的表现优于多个基准和LLM的模型。例如，带有插槽的QWEN2.5-7B的GSM8K准确度从57.54％达到8.6％至66.19％，而DeepSeek-R1-Distill-lalama-70B具有插槽的deepSeek-r1-Distill-lalama-70B可实现SOTA的精度，在70B级模型中的GPQA上的SOTA精度为68.69％。我们的代码可在此HTTPS URL上找到。

Title: Traversal Verification for Speculative Tree Decoding

Authors: Yepeng Weng, Qiao Hu, Xujie Chen, Li Liu, Dianwen Mei, Huishi Qiu, Jiang Tian, Zhongchao Shi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12398
Pdf URL: https://arxiv.org/pdf/2505.12398
Copy Paste: [[2505.12398]] Traversal Verification for Speculative Tree Decoding(https://arxiv.org/abs/2505.12398)
Keywords: language model
Abstract: Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in parallel to determine whether the drafted tokens should be accepted or rejected. To enhance acceptance rates, existing frameworks typically construct token trees containing multiple candidates in each timestep. However, their reliance on token-level verification mechanisms introduces two critical limitations: First, the probability distribution of a sequence differs from that of individual tokens, leading to suboptimal acceptance length. Second, current verification schemes begin from the root node and proceed layer by layer in a top-down manner. Once a parent node is rejected, all its child nodes should be discarded, resulting in inefficient utilization of speculative candidates. This paper introduces Traversal Verification, a novel speculative decoding algorithm that fundamentally rethinks the verification paradigm through leaf-to-root traversal. Our approach considers the acceptance of the entire token sequence from the current node to the root, and preserves potentially valid subsequences that would be prematurely discarded by existing methods. We theoretically prove that the probability distribution obtained through Traversal Verification is identical to that of the target model, guaranteeing lossless inference while achieving substantial acceleration gains. Experimental results across different large language models and multiple tasks show that our method consistently improves acceptance length and throughput over existing methods
摘要：投机解码是加速大型语言模型的一种有希望的方法。主要想法是使用轻量级草稿模型来推测多个后续时间段的目标模型的输出，然后并行验证它们以确定是否应接受或拒绝草稿令牌。为了提高接受率，现有的框架通常在每个时间段中构建包含多个候选者的令牌树。但是，他们对令牌级验证机制的依赖引入了两个临界局限性：首先，序列的概率分布与单个令牌的概率分布不同，导致了次优的接受长度。其次，当前的验证方案从根节点开始，并以自上而下的方式逐层继续进行。一旦拒绝父节点，应丢弃其所有子节点，从而导致投机候选者的利用率降低。本文介绍了遍历验证，这是一种新型的投机解码算法，从根本上重新考虑了通过叶到根部遍历的验证范式。我们的方法考虑了从当前节点到根部的整个令牌序列的接受，并保留了可能被现有方法过早丢弃的潜在有效子序列。从理论上讲，我们证明，通过遍历验证获得的概率分布与目标模型的概率分布相同，可以保证无损推断，同时实现了实质性的加速度增长。跨不同语言模型和多个任务的实验结果表明，我们的方法一致地改善了现有方法的接受度和吞吐量

Title: The power of text similarity in identifying AI-LLM paraphrased documents: The case of BBC news articles and ChatGPT

Authors: Konstantinos Xylogiannopoulos, Petros Xanthopoulos, Panagiotis Karampelas, Georgios Bakamitsos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12405
Pdf URL: https://arxiv.org/pdf/2505.12405
Copy Paste: [[2505.12405]] The power of text similarity in identifying AI-LLM paraphrased documents: The case of BBC news articles and ChatGPT(https://arxiv.org/abs/2505.12405)
Keywords: gpt, llm, chat
Abstract: Generative AI paraphrased text can be used for copyright infringement and the AI paraphrased content can deprive substantial revenue from original content creators. Despite this recent surge of malicious use of generative AI, there are few academic publications that research this threat. In this article, we demonstrate the ability of pattern-based similarity detection for AI paraphrased news recognition. We propose an algorithmic scheme, which is not limited to detect whether an article is an AI paraphrase, but, more importantly, to identify that the source of infringement is the ChatGPT. The proposed method is tested with a benchmark dataset specifically created for this task that incorporates real articles from BBC, incorporating a total of 2,224 articles across five different news categories, as well as 2,224 paraphrased articles created with ChatGPT. Results show that our pattern similarity-based method, that makes no use of deep learning, can detect ChatGPT assisted paraphrased articles at percentages 96.23% for accuracy, 96.25% for precision, 96.21% for sensitivity, 96.25% for specificity and 96.23% for F1 score.
摘要：生成的AI释义文本可用于侵犯版权，而AI释义的内容可以剥夺原始内容创建者的大量收入。尽管最近对生成AI的恶意使用激增，但很少有学术出版物研究这种威胁。在本文中，我们演示了基于模式的相似性检测能够解释新闻识别的能力。我们提出了一种算法方案，该方案不限于检测文章是否是AI释义，但更重要的是，确定侵权的来源是ChatGpt。使用专门为此任务创建的基准数据集对所提出的方法进行了测试，该数据集包含了BBC的真实文章，其中包含了五个不同新闻类别的2,224篇文章，以及与Chatgpt创建的2,224种释义文章。结果表明，我们不使用深度学习的基于模式相似性的方法可以以96.23％的精度检测Chatgpt辅助释义的文章，精度为96.25％，敏感性为96.21％，特异性为96.25％，F1分数为96.23％。

Title: Table-R1: Region-based Reinforcement Learning for Table Understanding

Authors: Zhenhe Wu, Jian Yang, Jiaheng Liu, Xianjie Wu, Changzai Pan, Jie Zhang, Yu Zhao, Shuangyong Song, Yongxiang Li, Zhoujun Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12415
Pdf URL: https://arxiv.org/pdf/2505.12415
Copy Paste: [[2505.12415]] Table-R1: Region-based Reinforcement Learning for Table Understanding(https://arxiv.org/abs/2505.12415)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Tables present unique challenges for language models due to their structured row-column interactions, necessitating specialized approaches for effective comprehension. While large language models (LLMs) have demonstrated potential in table reasoning through prompting and techniques like chain-of-thought (CoT) and program-of-thought (PoT), optimizing their performance for table question answering remains underexplored. In this paper, we introduce region-based Table-R1, a novel reinforcement learning approach that enhances LLM table understanding by integrating region evidence into reasoning steps. Our method employs Region-Enhanced Supervised Fine-Tuning (RE-SFT) to guide models in identifying relevant table regions before generating answers, incorporating textual, symbolic, and program-based reasoning. Additionally, Table-Aware Group Relative Policy Optimization (TARPO) introduces a mixed reward system to dynamically balance region accuracy and answer correctness, with decaying region rewards and consistency penalties to align reasoning steps. Experiments show that Table-R1 achieves an average performance improvement of 14.36 points across multiple base models on three benchmark datasets, even outperforming baseline models with ten times the parameters, while TARPO reduces response token consumption by 67.5% compared to GRPO, significantly advancing LLM capabilities in efficient tabular reasoning.
摘要：由于其结构化的行柱相互作用，表对语言模型提出了独特的挑战，因此需要采用专门的方法进行有效理解。尽管大型语言模型（LLMS）通过提示和诸如思考链（COT）和经营计划（POT）（POT）等技巧（POT）的提示和技术表现出了潜力，但优化其对表问题答案的绩效仍然没有充满信心。在本文中，我们介绍了基于区域的Table-R1，这是一种新颖的增强学习方法，通过将区域证据整合到推理步骤中来增强LLM表的理解。我们的方法采用区域增强的监督微调（RE-SFT）来指导模型，以在生成答案之前确定相关的表区域，并结合文本，符号和基于程序的推理。此外，表格感知的小组相对策略优化（TARPO）引入了混合奖励系统，以动态平衡区域的准确性和回答正确性，并以腐烂的区域奖励和一致性惩罚以使推理步骤保持一致。实验表明，Table-R1在三个基准数据集上的多个基本模型中的平均性能提高14.36点，甚至比参数的十倍以优于基线模型，而TARPO将响应响应的消耗量减少了67.5％，与GRPO相比，LLM的LLM能力显着提高了有效的LLM能力。

Title: PSC: Extending Context Window of Large Language Models via Phase Shift Calibration

Authors: Wenqiao Zhu, Chao Xu, Lulu Wang, Jun Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12423
Pdf URL: https://arxiv.org/pdf/2505.12423
Copy Paste: [[2505.12423]] PSC: Extending Context Window of Large Language Models via Phase Shift Calibration(https://arxiv.org/abs/2505.12423)
Keywords: language model, llm
Abstract: Rotary Position Embedding (RoPE) is an efficient position encoding approach and is widely utilized in numerous large language models (LLMs). Recently, a lot of methods have been put forward to further expand the context window based on RoPE. The core concept of those methods is to predefine or search for a set of factors to rescale the base frequencies of RoPE. Nevertheless, it is quite a challenge for existing methods to predefine an optimal factor due to the exponential search space. In view of this, we introduce PSC (Phase Shift Calibration), a small module for calibrating the frequencies predefined by existing methods. With the employment of PSC, we demonstrate that many existing methods can be further enhanced, like PI, YaRN, and LongRoPE. We conducted extensive experiments across multiple models and tasks. The results demonstrate that (1) when PSC is enabled, the comparative reductions in perplexity increase as the context window size is varied from 16k, to 32k, and up to 64k. (2) Our approach is broadly applicable and exhibits robustness across a variety of models and tasks. The code can be found at this https URL.
摘要：旋转位置嵌入（绳索）是一种有效的位置编码方法，并在许多大型语言模型（LLMS）中广泛使用。最近，已经提出了许多方法，以进一步扩展基于绳索的上下文窗口。这些方法的核心概念是预先定义或寻找一组因素来重塑绳索的基本频率。然而，由于指数搜索空间，现有方法预先定义最佳因素是一个挑战。鉴于此，我们引入了PSC（相位移位校准），这是一个小型模块，用于校准由现有方法预定的频率。通过使用PSC，我们证明了许多现有方法可以进一步增强，例如Pi，纱线和远程。我们跨多个模型和任务进行了广泛的实验。结果表明，（1）启用PSC时，随着上下文窗口大小从16K，至32K和最高64K的变化，相比的比较减少会增加。（2）我们的方法广泛适用，并在各种模型和任务中表现出鲁棒性。该代码可以在此HTTPS URL上找到。

Title: Learning to Play Like Humans: A Framework for LLM Adaptation in Interactive Fiction Games

Authors: Jinming Zhang, Yunfei Long
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12439
Pdf URL: https://arxiv.org/pdf/2505.12439
Copy Paste: [[2505.12439]] Learning to Play Like Humans: A Framework for LLM Adaptation in Interactive Fiction Games(https://arxiv.org/abs/2505.12439)
Keywords: language model, llm, agent
Abstract: Interactive Fiction games (IF games) are where players interact through natural language commands. While recent advances in Artificial Intelligence agents have reignited interest in IF games as a domain for studying decision-making, existing approaches prioritize task-specific performance metrics over human-like comprehension of narrative context and gameplay logic. This work presents a cognitively inspired framework that guides Large Language Models (LLMs) to learn and play IF games systematically. Our proposed **L**earning to **P**lay **L**ike **H**umans (LPLH) framework integrates three key components: (1) structured map building to capture spatial and narrative relationships, (2) action learning to identify context-appropriate commands, and (3) feedback-driven experience analysis to refine decision-making over time. By aligning LLMs-based agents' behavior with narrative intent and commonsense constraints, LPLH moves beyond purely exploratory strategies to deliver more interpretable, human-like performance. Crucially, this approach draws on cognitive science principles to more closely simulate how human players read, interpret, and respond within narrative worlds. As a result, LPLH reframes the IF games challenge as a learning problem for LLMs-based agents, offering a new path toward robust, context-aware gameplay in complex text-based environments.
摘要：互动小说游戏（如果游戏）是玩家通过自然语言命令进行互动的地方。尽管人工智能代理的最新进展重新激发了人们对IF作为研究决策的领域的兴趣，但现有方法优先考虑特定于任务的绩效指标，而不是对叙事环境和游戏逻辑的人类般的理解。这项工作提出了一个具有认知启发的框架，该框架指导大型语言模型（LLMS），以学习和玩游戏。我们提出的** l **赚取** p ** lay ** l ** ike ** h ** h ** umans（lplh）框架集成了三个关键组件：（1）结构化的地图构建以捕获空间和叙事关系，（2）动作学习以学习以识别上下文适当的命令，以及（3）反馈驱动的经验经验的经验分析以完善的决策，以完善决策时间。通过将基于LLMS的代理人的行为与叙事意图和常识性约束保持一致，LPLH超越了纯粹的探索性策略，可以提供更容易解释的人类般的性能。至关重要的是，这种方法借鉴了认知科学原则，以更紧密地模拟人类参与者在叙事世界中的阅读，解释和反应方式。结果，LPLH将IF游戏挑战为基于LLMS的代理商的学习问题，为在复杂的基于文本的环境中提供了一条新的途径。

Title: Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment

Authors: Siyang Wu, Honglin Bao, Nadav Kunievsky, James A. Evans
Subjects: cs.CL, cs.CY, cs.DL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.12452
Pdf URL: https://arxiv.org/pdf/2505.12452
Copy Paste: [[2505.12452]] Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment(https://arxiv.org/abs/2505.12452)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) increasingly demonstrate signs of conceptual understanding, yet much of their internal knowledge remains latent, loosely structured, and difficult to access or evaluate. We propose self-questioning as a lightweight and scalable strategy to improve LLMs' understanding, particularly in domains where success depends on fine-grained semantic distinctions. To evaluate this approach, we introduce a challenging new benchmark of 1.3 million post-2015 computer science patent pairs, characterized by dense technical jargon and strategically complex writing. The benchmark centers on a pairwise differentiation task: can a model distinguish between closely related but substantively different inventions? We show that prompting LLMs to generate and answer their own questions - targeting the background knowledge required for the task - significantly improves performance. These self-generated questions and answers activate otherwise underutilized internal knowledge. Allowing LLMs to retrieve answers from external scientific texts further enhances performance, suggesting that model knowledge is compressed and lacks the full richness of the training data. We also find that chain-of-thought prompting and self-questioning converge, though self-questioning remains more effective for improving understanding of technical concepts. Notably, we uncover an asymmetry in prompting: smaller models often generate more fundamental, more open-ended, better-aligned questions for mid-sized models than large models with better understanding do, revealing a new strategy for cross-model collaboration. Altogether, our findings establish self-questioning as both a practical mechanism for automatically improving LLM comprehension, especially in domains with sparse and underrepresented knowledge, and a diagnostic probe of how internal and external knowledge are organized.
摘要：大型语言模型（LLMS）越来越表现出概念理解的迹象，但是他们的许多内部知识仍然潜在，结构松散，难以访问或评估。我们提出自我提议作为一种轻巧且可扩展的策略，以提高LLMS的理解，尤其是在成功取决于细粒度的语义区别的领域。为了评估这种方法，我们介绍了130万个2015年后计算机科学专利对的挑战性新基准，其特征在于密集的技术术语和战略性复杂的写作。基准测试以成对分化任务为中心：模型可以区分密切相关但实质性不同的发明吗？我们表明，促使LLMS生成和回答自己的问题（针对任务所需的背景知识）可显着提高性能。这些自我生成的问题和答案激活了原本未充分利用的内部知识。允许LLM从外部科学文本中检索答案进一步增强了性能，这表明模型知识被压缩，并且缺乏培训数据的全部丰富性。我们还发现，尽管自我询问仍然在提高对技术概念的理解方面仍然更有效，但促使人们的促进链和自我询问融合了。值得注意的是，我们在提示中发现了不对称的：较小的模型通常会比具有更好理解的大型模型产生更基本，更开放的，更加对准的问题，从而揭示了一种新的跨模型协作策略。总而言之，我们的发现确立了自动提高LLM理解的实际机制，尤其是在具有稀疏和代表性不足的知识的领域，也是对内部和外部知识的组织方式的诊断探测。

Title: Towards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations

Authors: Yuyang Ding, Dan Qiao, Juntao Li, Jiajie Xu, Pingfu Chao, Xiaofang Zhou, Min Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12454
Pdf URL: https://arxiv.org/pdf/2505.12454
Copy Paste: [[2505.12454]] Towards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations(https://arxiv.org/abs/2505.12454)
Keywords: language model
Abstract: Distantly supervised named entity recognition (DS-NER) has emerged as a cheap and convenient alternative to traditional human annotation methods, enabling the automatic generation of training data by aligning text with external resources. Despite the many efforts in noise measurement methods, few works focus on the latent noise distribution between different distant annotation methods. In this work, we explore the effectiveness and robustness of DS-NER by two aspects: (1) distant annotation techniques, which encompasses both traditional rule-based methods and the innovative large language model supervision approach, and (2) noise assessment, for which we introduce a novel framework. This framework addresses the challenges by distinctly categorizing them into the unlabeled-entity problem (UEP) and the noisy-entity problem (NEP), subsequently providing specialized solutions for each. Our proposed method achieves significant improvements on eight real-world distant supervision datasets originating from three different data sources and involving four distinct annotation techniques, confirming its superiority over current state-of-the-art methods.
摘要：远处监督的指定实体识别（DS-NER）已成为传统人类注释方法的便利替代方案，通过使文本与外部资源对齐，从而实现了自动生成的培训数据。尽管在噪声测量方法上做出了许多努力，但很少有工作集中在不同远处注释方法之间的潜在噪声分布。在这项工作中，我们通过两个方面探索了DS-NER的有效性和鲁棒性：（1）遥远的注释技术，该技术涵盖了传统的基于规则的方法和创新的大语言模型监督方法，以及（2）噪声评估，为此我们引入了一个新颖的框架。该框架通过将它们分为无标记的实体问题（UEP）和嘈杂的问题（NEP）来解决挑战，随后为每个问题提供了专门的解决方案。我们提出的方法对源自三个不同数据源的八个现实世界遥远的监督数据集进行了重大改进，并涉及四种不同的注释技术，从而证实了其优于当前最新方法。

Title: What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion Summarization

Authors: Weixiao Zhou, Junnan Zhu, Gengyao Li, Xianfu Cheng, Xinnian Liang, Feifei Zhai, Zhoujun Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12474
Pdf URL: https://arxiv.org/pdf/2505.12474
Copy Paste: [[2505.12474]] What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion Summarization(https://arxiv.org/abs/2505.12474)
Keywords: language model, llm, prompt
Abstract: In this work, we investigate the performance of LLMs on a new task that requires combining discussion with background knowledge for summarization. This aims to address the limitation of outside observer confusion in existing dialogue summarization systems due to their reliance solely on discussion information. To achieve this, we model the task output as background and opinion summaries and define two standardized summarization patterns. To support assessment, we introduce the first benchmark comprising high-quality samples consistently annotated by human experts and propose a novel hierarchical evaluation framework with fine-grained, interpretable metrics. We evaluate 12 LLMs under structured-prompt and self-reflection paradigms. Our findings reveal: (1) LLMs struggle with background summary retrieval, generation, and opinion summary integration. (2) Even top LLMs achieve less than 69% average performance across both patterns. (3) Current LLMs lack adequate self-evaluation and self-correction capabilities for this task.
摘要：在这项工作中，我们调查了LLM在一项新任务上的性能，该任务需要将讨论与背景知识结合起来进行摘要。这旨在解决外部观察者混乱在现有对话摘要系统中的局限性，因为它们仅依赖讨论信息。为了实现这一目标，我们将任务输出作为背景和意见摘要建模，并定义两个标准化的摘要模式。为了支持评估，我们介绍了第一个基准，其中包括人类专家始终注释的高质量样本，并提出了一个新型的分层评估框架，并具有细粒度，可解释的指标。我们在结构化预测和自我反射范式下评估12个LLM。我们的发现揭示了：（1）LLM与背景摘要检索，发电和意见摘要整合斗争。（2）即使是顶级LLM的两种模式的平均表现都低于69％。（3）当前的LLM缺乏适当的自我评估和此任务的自我纠正功能。

Title: Enhancing Large Language Models with Reward-guided Tree Search for Knowledge Graph Question and Answering

Authors: Xiao Long, Liansheng Zhuang, Chen Shen, Shaotian Yan, Yifei Li, Shafei Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12476
Pdf URL: https://arxiv.org/pdf/2505.12476
Copy Paste: [[2505.12476]] Enhancing Large Language Models with Reward-guided Tree Search for Knowledge Graph Question and Answering(https://arxiv.org/abs/2505.12476)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Recently, large language models (LLMs) have demonstrated impressive performance in Knowledge Graph Question Answering (KGQA) tasks, which aim to find answers based on knowledge graphs (KGs) for natural language questions. Existing LLMs-based KGQA methods typically follow the Graph Retrieval-Augmented Generation (GraphRAG) paradigm, which first retrieves reasoning paths from the large KGs, and then generates the answers based on them. However, these methods emphasize the exploration of new optimal reasoning paths in KGs while ignoring the exploitation of historical reasoning paths, which may lead to sub-optimal reasoning paths. Additionally, the complex semantics contained in questions may lead to the retrieval of inaccurate reasoning paths. To address these issues, this paper proposes a novel and training-free framework for KGQA tasks called Reward-guided Tree Search on Graph (RTSoG). RTSoG decomposes an original question into a series of simpler and well-defined sub-questions to handle the complex semantics. Then, a Self-Critic Monte Carlo Tree Search (SC-MCTS) guided by a reward model is introduced to iteratively retrieve weighted reasoning paths as contextual knowledge. Finally, it stacks the weighted reasoning paths according to their weights to generate the final answers. Extensive experiments on four datasets demonstrate the effectiveness of RTSoG. Notably, it achieves 8.7\% and 7.0\% performance improvement over the state-of-the-art method on the GrailQA and the WebQSP respectively.
摘要：最近，大型语言模型（LLMS）在知识图答案（KGQA）任务中表现出了令人印象深刻的表现，该任务旨在根据知识图（KGS）找到自然语言问题的答案。现有的基于LLMS的KGQA方法通常遵循图检索型生成（GraphRag）范式，该范式首先从大型KG中检索推理路径，然后根据它们生成答案。但是，这些方法强调了kg中新的最佳推理路径的探索，同时忽略了对历史推理路径的开发，这可能导致次级最佳的推理路径。此外，问题中包含的复杂语义可能导致检索不准确的推理路径。为了解决这些问题，本文提出了一个针对KGQA任务的新颖且无训练的框架，称为奖励引导树搜索（RTSOG）。 RTSOG将原始问题分解为一系列简单，定义明确的子问题，以处理复杂的语义。然后，将以奖励模型引导的自我批评的蒙特卡洛树搜索（SC-MCT）被引入迭代地检索加权推理路径作为上下文知识。最后，它根据其权重堆叠加权推理路径，以产生最终答案。四个数据集上的大量实验证明了RTSOG的有效性。值得注意的是，与GrailQA和WebQSP上的最新方法相比，它可以提高8.7 \％和7.0 \％的性能。

Title: KG-QAGen: A Knowledge-Graph-Based Framework for Systematic Question Generation and Long-Context LLM Evaluation

Authors: Nikita Tatarinov, Vidhyakshaya Kannan, Haricharana Srinivasa, Arnav Raj, Harpreet Singh Anand, Varun Singh, Aditya Luthra, Ravij Lade, Agam Shah, Sudheer Chava
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12495
Pdf URL: https://arxiv.org/pdf/2505.12495
Copy Paste: [[2505.12495]] KG-QAGen: A Knowledge-Graph-Based Framework for Systematic Question Generation and Long-Context LLM Evaluation(https://arxiv.org/abs/2505.12495)
Keywords: language model, llm
Abstract: The increasing context length of modern language models has created a need for evaluating their ability to retrieve and process information across extensive documents. While existing benchmarks test long-context capabilities, they often lack a structured way to systematically vary question complexity. We introduce KG-QAGen (Knowledge-Graph-based Question-Answer Generation), a framework that (1) extracts QA pairs at multiple complexity levels (2) by leveraging structured representations of financial agreements (3) along three key dimensions -- multi-hop retrieval, set operations, and answer plurality -- enabling fine-grained assessment of model performance across controlled difficulty levels. Using this framework, we construct a dataset of 20,139 QA pairs (the largest number among the long-context benchmarks) and open-source a part of it. We evaluate 13 proprietary and open-source LLMs and observe that even the best-performing models are struggling with set-based comparisons and multi-hop logical inference. Our analysis reveals systematic failure modes tied to semantic misinterpretation and inability to handle implicit relations.
摘要：现代语言模型的上下文长度不断提高，已经需要评估其在广泛文档中检索和处理信息的能力。尽管现有的基准测试了长期文化功能，但它们通常缺乏系统地改变问题复杂性的结构化方法。我们介绍了KG-QAGEN（基于知识图的问题 - 答案一代），该框架（1）通过利用三个关键维度的财务协议（3）的结构化表示，以多个关键的维度（3）的结构化表示，以多跳的检索，设置操作（设置操作）来提取QA对（2），并促进了跨越的模型性能跨度的难度级别。使用此框架，我们构建了一个20,139 QA对的数据集（在长篇小写基准测试中，数量最多），并开源其中的一部分。我们评估了13个专有和开源LLM，并观察到即使是表现最佳的模型也在基于设定的比较和多跳逻辑推理方面挣扎。我们的分析揭示了与语义误解和无法处理隐式关系有关的系统故障模式。

Title: LM$^2$otifs : An Explainable Framework for Machine-Generated Texts Detection

Authors: Xu Zheng, Zhuomin Chen, Esteban Schafir, Sipeng Chen, Hojat Allah Salehi, Haifeng Chen, Farhad Shirani, Wei Cheng, Dongsheng Luo
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.12507
Pdf URL: https://arxiv.org/pdf/2505.12507
Copy Paste: [[2505.12507]] LM$^2$otifs : An Explainable Framework for Machine-Generated Texts Detection(https://arxiv.org/abs/2505.12507)
Keywords: language model
Abstract: The impressive ability of large language models to generate natural text across various tasks has led to critical challenges in authorship authentication. Although numerous detection methods have been developed to differentiate between machine-generated texts (MGT) and human-generated texts (HGT), the explainability of these methods remains a significant gap. Traditional explainability techniques often fall short in capturing the complex word relationships that distinguish HGT from MGT. To address this limitation, we present LM$^2$otifs, a novel explainable framework for MGT detection. Inspired by probabilistic graphical models, we provide a theoretical rationale for the effectiveness. LM$^2$otifs utilizes eXplainable Graph Neural Networks to achieve both accurate detection and interpretability. The LM$^2$otifs pipeline operates in three key stages: first, it transforms text into graphs based on word co-occurrence to represent lexical dependencies; second, graph neural networks are used for prediction; and third, a post-hoc explainability method extracts interpretable motifs, offering multi-level explanations from individual words to sentence structures. Extensive experiments on multiple benchmark datasets demonstrate the comparable performance of LM$^2$otifs. The empirical evaluation of the extracted explainable motifs confirms their effectiveness in differentiating HGT and MGT. Furthermore, qualitative analysis reveals distinct and visible linguistic fingerprints characteristic of MGT.
摘要：大型语言模型在各种任务中生成自然文本的令人印象深刻的能力导致了作者身份验证方面的关键挑战。尽管已经开发了许多检测方法来区分机器生成的文本（MGT）和人类生成的文本（HGT），但这些方法的解释性仍然是一个显着的差距。传统的解释性技术通常在捕获将HGT与MGT区分开的复杂单词关系时缺乏。为了解决这一限制，我们提出了LM $^2 $ OTIFS，这是一个新颖的MGT检测框架。受概率图形模型的启发，我们为有效性提供了理论上的理由。 LM $^2 $ OTIFS利用可解释的图神经网络来实现准确的检测和解释性。 LM $^2 $ otifs管道在三个关键阶段运行：首先，它将文本转换为基于单词共发生的图形，以表示词汇依赖性；其次，图形神经网络用于预测。第三，一种事后解释性方法提取可解释的图案，提供从单个单词到句子结构的多层次解释。在多个基准数据集上进行的大量实验证明了LM $ $^2 $ OTIFS的可比性能。对提取的可解释基序的经验评估证实了它们在区分HGT和MGT方面的有效性。此外，定性分析揭示了MGT的独特和可见的语言指纹特征。

Title: DS-ProGen: A Dual-Structure Deep Language Model for Functional Protein Design

Authors: Yanting Li, Jiyue Jiang, Zikang Wang, Ziqian Lin, Dongchen He, Yuheng Shan, Yanruisheng Shao, Jiayi Li, Xiangyu Shi, Jiuming Wang, Yanyu Chen, Yimin Fan, Han Li, Yu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12511
Pdf URL: https://arxiv.org/pdf/2505.12511
Copy Paste: [[2505.12511]] DS-ProGen: A Dual-Structure Deep Language Model for Functional Protein Design(https://arxiv.org/abs/2505.12511)
Keywords: language model
Abstract: Inverse Protein Folding (IPF) is a critical subtask in the field of protein design, aiming to engineer amino acid sequences capable of folding correctly into a specified three-dimensional (3D) conformation. Although substantial progress has been achieved in recent years, existing methods generally rely on either backbone coordinates or molecular surface features alone, which restricts their ability to fully capture the complex chemical and geometric constraints necessary for precise sequence prediction. To address this limitation, we present DS-ProGen, a dual-structure deep language model for functional protein design, which integrates both backbone geometry and surface-level representations. By incorporating backbone coordinates as well as surface chemical and geometric descriptors into a next-amino-acid prediction paradigm, DS-ProGen is able to generate functionally relevant and structurally stable sequences while satisfying both global and local conformational constraints. On the PRIDE dataset, DS-ProGen attains the current state-of-the-art recovery rate of 61.47%, demonstrating the synergistic advantage of multi-modal structural encoding in protein design. Furthermore, DS-ProGen excels in predicting interactions with a variety of biological partners, including ligands, ions, and RNA, confirming its robust functional retention capabilities.
摘要：逆蛋白折叠（IPF）是蛋白质设计领域的关键子任务，旨在设计能够正确折叠成指定的三维（3D）构象的氨基酸序列。尽管近年来已经取得了实质性进展，但现有方法通常依赖于骨干坐标或分子表面特征，这限制了它们完全捕获精确序列预测所需的复杂化学化学和几何约束的能力。为了解决这一限制，我们提出了DS-Progen，这是功能蛋白设计的双结构深度语言模型，它同时整合了骨干几何形状和表面级表示。通过将主链坐标以及表面化学和几何描述符纳入下一个氨基酸预测范式中，DS-Progen能够生成功能相关和结构稳定的序列，同时满足全球和局部构型约束。在Pride数据集上，DS-Progen达到了61.47％的当前最新恢复率，这表明了蛋白质设计中多模式结构编码的协同优势。此外，DS-Progen在预测与配体，离子和RNA在内的各种生物伴侣的相互作用方面表现出色，从而证实了其强大的功能保留能力。

Title: ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents

Authors: Navid Madani, Rohini Srihari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12531
Pdf URL: https://arxiv.org/pdf/2505.12531
Copy Paste: [[2505.12531]] ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents(https://arxiv.org/abs/2505.12531)
Keywords: language model, llm, prompt, chat, agent
Abstract: Large language models (LLMs) increasingly power mental-health chatbots, yet the field still lacks a scalable, theory-grounded way to decide which model is most effective to deploy. We present ESC-Judge, the first end-to-end evaluation framework that (i) grounds head-to-head comparisons of emotional-support LLMs in Clara Hill's established Exploration-Insight-Action counseling model, providing a structured and interpretable view of performance, and (ii) fully automates the evaluation pipeline at scale. ESC-Judge operates in three stages: first, it synthesizes realistic help-seeker roles by sampling empirically salient attributes such as stressors, personality, and life history; second, it has two candidate support agents conduct separate sessions with the same role, isolating model-specific strategies; and third, it asks a specialized judge LLM to express pairwise preferences across rubric-anchored skills that span the Exploration, Insight, and Action spectrum. In our study, ESC-Judge matched PhD-level annotators on 85 percent of Exploration, 83 percent of Insight, and 86 percent of Action decisions, demonstrating human-level reliability at a fraction of the cost. All code, prompts, synthetic roles, transcripts, and judgment scripts are released to promote transparent progress in emotionally supportive AI.
摘要：大型语言模型（LLM）越来越多地精神健康聊天机器人，但是该领域仍然缺乏可扩展的，理论的方法来决定哪种模型最有效地部署。我们提出了ESC-Judge，这是第一个端到端评估框架（i）（i）在克拉拉·希尔（Clara Hill）既定的探索 - 潜在咨询模型中的情感支持LLM的正面比较，并提供了绩效的结构化且可解释的观点，以及（ii）将评估渠道及时自动化。 Esc-gudge在三个阶段运作：首先，它通过对经验上的显着属性（例如压力源，个性和生活史）进行取样来综合寻求寻求帮助的角色；其次，它的两个候选支持者会进行单独的会话，并具有相同的作用，隔离模型特定的策略。第三，它要求专业法官LLM表达跨越探索，洞察力和动作范围的标准群体的成对偏好。在我们的研究中，ESC法官在85％的探索，83％的洞察力和86％的行动决策中匹配了博士级别的注释者，并以一小部分成本证明了人级可靠性。所有代码，提示，综合角色，成绩单和判断脚本都均释放，以促进情感支持AI的透明进展。

Title: Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE

Authors: Varvara Arzt, Allan Hanbury, Michael Wiegand, Gábor Recski, Terra Blevins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12533
Pdf URL: https://arxiv.org/pdf/2505.12533
Copy Paste: [[2505.12533]] Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE(https://arxiv.org/abs/2505.12533)
Keywords: language model
Abstract: Analysing the generalisation capabilities of relation extraction (RE) models is crucial for assessing whether they learn robust relational patterns or rely on spurious correlations. Our cross-dataset experiments find that RE models struggle with unseen data, even within similar domains. Notably, higher intra-dataset performance does not indicate better transferability, instead often signaling overfitting to dataset-specific artefacts. Our results also show that data quality, rather than lexical similarity, is key to robust transfer, and the choice of optimal adaptation strategy depends on the quality of data available: while fine-tuning yields the best cross-dataset performance with high-quality data, few-shot in-context learning (ICL) is more effective with noisier data. However, even in these cases, zero-shot baselines occasionally outperform all cross-dataset results. Structural issues in RE benchmarks, such as single-relation per sample constraints and non-standardised negative class definitions, further hinder model transferability.
摘要：分析关系提取（RE）模型的概括能力（RE）模型对于评估它们是学习稳健的关系模式还是依赖虚假相关性至关重要。我们的跨数据集实验发现，即使在相似的域内，RE模型也与看不见的数据斗争。值得注意的是，较高的内部性能并不能表明更高的可传递性，而是经常向数据集特异性人工制品发出过度信号。我们的结果还表明，数据质量而不是词汇相似性是可靠传输的关键，最佳适应策略的选择取决于可用数据的质量：虽然微调可产生具有高质量数据的最佳跨数据库性能，而具有高质量的内在学习（ICL）具有更有效的效率。但是，即使在这些情况下，零射基线的基准偶尔也会胜过所有交叉数据表。基准测试中的结构性问题，例如每个样本约束单一关系和非标准化的负类定义，进一步阻碍了模型的可传递性。

Title: Disambiguation in Conversational Question Answering in the Era of LLM: A Survey

Authors: Md Mehrab Tanjim, Yeonjun In, Xiang Chen, Victor S. Bursztyn, Ryan A. Rossi, Sungchul Kim, Guang-Jie Ren, Vaishnavi Muppala, Shun Jiang, Yongsung Kim, Chanyoung Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12543
Pdf URL: https://arxiv.org/pdf/2505.12543
Copy Paste: [[2505.12543]] Disambiguation in Conversational Question Answering in the Era of LLM: A Survey(https://arxiv.org/abs/2505.12543)
Keywords: language model, llm
Abstract: Ambiguity remains a fundamental challenge in Natural Language Processing (NLP) due to the inherent complexity and flexibility of human language. With the advent of Large Language Models (LLMs), addressing ambiguity has become even more critical due to their expanded capabilities and applications. In the context of Conversational Question Answering (CQA), this paper explores the definition, forms, and implications of ambiguity for language driven systems, particularly in the context of LLMs. We define key terms and concepts, categorize various disambiguation approaches enabled by LLMs, and provide a comparative analysis of their advantages and disadvantages. We also explore publicly available datasets for benchmarking ambiguity detection and resolution techniques and highlight their relevance for ongoing research. Finally, we identify open problems and future research directions, proposing areas for further investigation. By offering a comprehensive review of current research on ambiguities and disambiguation with LLMs, we aim to contribute to the development of more robust and reliable language systems.
摘要：由于人类语言的固有复杂性和灵活性，模棱两可仍然是自然语言处理（NLP）的基本挑战。随着大语言模型（LLM）的出现，由于其功能和应用的扩展，解决歧义的问题变得更加重要。在对话问题回答（CQA）的背景下，本文探讨了歧义对语言驱动系统的定义，形式和含义，尤其是在LLMS的背景下。我们定义关键术语和概念，对LLMS启用的各种歧义方法进行分类，并对其优势和缺点进行比较分析。我们还探索了公开可用的数据集，用于基准歧义检测和解决技术，并突出其与正在进行的研究相关性。最后，我们确定了开放的问题和未来的研究方向，提出了进一步调查的领域。通过对当前有关歧义性和与LLM歧义的研究进行全面审查，我们旨在为开发更健壮和可靠的语言系统的发展做出贡献。

Title: Towards Reliable and Interpretable Traffic Crash Pattern Prediction and Safety Interventions Using Customized Large Language Models

Authors: Yang Zhao, Pu Wang, Yibo Zhao, Hongru Du, Hao (Frank)Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12545
Pdf URL: https://arxiv.org/pdf/2505.12545
Copy Paste: [[2505.12545]] Towards Reliable and Interpretable Traffic Crash Pattern Prediction and Safety Interventions Using Customized Large Language Models(https://arxiv.org/abs/2505.12545)
Keywords: language model, llm
Abstract: Predicting crash events is crucial for understanding crash distributions and their contributing factors, thereby enabling the design of proactive traffic safety policy interventions. However, existing methods struggle to interpret the complex interplay among various sources of traffic crash data, including numeric characteristics, textual reports, crash imagery, environmental conditions, and driver behavior records. As a result, they often fail to capture the rich semantic information and intricate interrelationships embedded in these diverse data sources, limiting their ability to identify critical crash risk factors. In this research, we propose TrafficSafe, a framework that adapts LLMs to reframe crash prediction and feature attribution as text-based reasoning. A multi-modal crash dataset including 58,903 real-world reports together with belonged infrastructure, environmental, driver, and vehicle information is collected and textualized into TrafficSafe Event Dataset. By customizing and fine-tuning LLMs on this dataset, the TrafficSafe LLM achieves a 42% average improvement in F1-score over baselines. To interpret these predictions and uncover contributing factors, we introduce TrafficSafe Attribution, a sentence-level feature attribution framework enabling conditional risk analysis. Findings show that alcohol-impaired driving is the leading factor in severe crashes, with aggressive and impairment-related behaviors having nearly twice the contribution for severe crashes compared to other driver behaviors. Furthermore, TrafficSafe Attribution highlights pivotal features during model training, guiding strategic crash data collection for iterative performance improvements. The proposed TrafficSafe offers a transformative leap in traffic safety research, providing a blueprint for translating advanced AI technologies into responsible, actionable, and life-saving outcomes.
摘要：预测崩溃事件对于理解崩溃分布及其贡献因素至关重要，从而实现了主动的交通安全政策干预措施的设计。但是，现有方法难以解释流量崩溃数据的各种来源之间的复杂相互作用，包括数字特征，文本报告，崩溃图像，环境条件和驾驶员行为记录。结果，他们常常无法捕获这些多种数据源中嵌入的丰富语义信息和复杂的相互关系，从而限制了它们识别关键崩溃风险因素的能力。在这项研究中，我们提出了trabricsafe，该框架适应LLMS以重新构成崩溃预测和特征归因为基于文本的推理。一个多模式的崩溃数据集，包括58,903个现实世界报告，以及属于基础架构，环境，驱动程序和车辆信息，并将其收集到trablesafe事件数据集中。通过在该数据集上定制和微调LLM，Trabericsafe LLM可实现与基线相比的F1得分平均提高42％。为了解释这些预测并发现促成因素，我们引入了trablicsafe归因，这是一个句子级特征归因框架，实现了条件风险分析。调查结果表明，与其他驾驶员行为相比，与其他驾驶行为相比，饮酒型驾驶受损是严重崩溃的主要因素，与其他驾驶员行为相比，对严重崩溃的贡献几乎是两倍。此外，贩运归因强调了模型培训期间的关键特征，指导战略性崩溃数据收集以改进效果。拟议中的贩运活动为交通安全研究提供了变革性的飞跃，为将先进的AI技术转化为负责任，可行和挽救生命的结果提供了蓝图。

Title: Extracting memorized pieces of (copyrighted) books from open-weight language models

Authors: A. Feder Cooper, Aaron Gokaslan, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, Percy Liang
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12546
Pdf URL: https://arxiv.org/pdf/2505.12546
Copy Paste: [[2505.12546]] Extracting memorized pieces of (copyrighted) books from open-weight language models(https://arxiv.org/abs/2505.12546)
Keywords: language model, llm
Abstract: Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expression. Drawing on adversarial ML and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we leverage a recent probabilistic extraction technique to extract pieces of the Books3 dataset from 13 open-weight LLMs. Through numerous experiments, we show that it's possible to extract substantial parts of at least some books from different LLMs. This is evidence that the LLMs have memorized the extracted text; this memorized content is copied inside the model parameters. But the results are complicated: the extent of memorization varies both by model and by book. With our specific experiments, we find that the largest LLMs don't memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B memorizes some books, like Harry Potter and 1984, almost entirely. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.
摘要：关于生成AI的版权诉讼中的原告和被告经常对大型语言模型（LLMS）在多大程度上进行了反对的主张，并记住了原告的受保护表达。利用对抗性ML和版权法，我们表明这些两极分化的立场极大地简化了记忆与版权之间的关系。为此，我们利用最近的一种概率提取技术来从13个开放式LLM中提取Books3数据集。通过许多实验，我们表明可以从不同LLM中提取至少一些书籍的大量部分。这证明LLM已经记住了提取的文本。该记忆的内容在模型参数中复制。但是结果很复杂：记忆的程度因模型和书籍而异。通过我们的特定实验，我们发现最大的LLM并不能记住大多数书籍 - 无论是全部还是部分。但是，我们还发现Llama 3.1 70B几乎完全记住了一些书，例如Harry Potter和1984。我们讨论了为什么我们的结果对版权案例具有重要意义，尽管并非明确偏爱双方的结果。

Title: Enriching Patent Claim Generation with European Patent Dataset

Authors: Lekang Jiang, Chengzu Li, Stephan Goetz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12568
Pdf URL: https://arxiv.org/pdf/2505.12568
Copy Paste: [[2505.12568]] Enriching Patent Claim Generation with European Patent Dataset(https://arxiv.org/abs/2505.12568)
Keywords: language model, gpt, llm
Abstract: Drafting patent claims is time-intensive, costly, and requires professional skill. Therefore, researchers have investigated large language models (LLMs) to assist inventors in writing claims. However, existing work has largely relied on datasets from the United States Patent and Trademark Office (USPTO). To enlarge research scope regarding various jurisdictions, drafting conventions, and legal standards, we introduce EPD, a European patent dataset. EPD presents rich textual data and structured metadata to support multiple patent-related tasks, including claim generation. This dataset enriches the field in three critical aspects: (1) Jurisdictional diversity: Patents from different offices vary in legal and drafting conventions. EPD fills a critical gap by providing a benchmark for European patents to enable more comprehensive evaluation. (2) Quality improvement: EPD offers high-quality granted patents with finalized and legally approved texts, whereas others consist of patent applications that are unexamined or provisional. Experiments show that LLMs fine-tuned on EPD significantly outperform those trained on previous datasets and even GPT-4o in claim quality and cross-domain generalization. (3) Real-world simulation: We propose a difficult subset of EPD to better reflect real-world challenges of claim generation. Results reveal that all tested LLMs perform substantially worse on these challenging samples, which highlights the need for future research.
摘要：起草专利索赔是时间密集型，昂贵的，需要专业技能。因此，研究人员已经调查了大型语言模型（LLMS），以协助发明家编写主张。但是，现有的工作主要依赖于美国专利商标局（USPTO）的数据集。为了扩大有关各种司法管辖区，起草公约和法律标准的研究范围，我们介绍了欧洲专利数据集EPD。 EPD介绍了丰富的文本数据和结构化元数据，以支持多个与专利相关的任务，包括索赔生成。该数据集在三个关键方面丰富了该领域：（1）管辖权多样性：来自不同办公室的专利在法律和起草公约上有所不同。 EPD通过为欧洲专利提供基准来实现更全面的评估，从而填补了关键的空白。（2）质量改进：EPD提供具有最终认可和合法批准的文本的高质量授予专利，而其他文本则包括未经检查或临时的专利申请。实验表明，在EPD上进行了微调的LLM显着优于在先前数据集中训练的人，甚至在索赔质量和跨域概括方面进行了训练。（3）现实世界的模拟：我们提出了一个困难的EPD子集，以更好地反映索赔生成的现实世界挑战。结果表明，所有经过测试的LLM在这些具有挑战性的样本上的表现都更加糟糕，这突出了对未来研究的需求。

Title: Measuring Information Distortion in Hierarchical Ultra long Novel Generation:The Optimal Expansion Ratio

Authors: Hanwen Shen, Ting Ying
Subjects: cs.CL, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2505.12572
Pdf URL: https://arxiv.org/pdf/2505.12572
Copy Paste: [[2505.12572]] Measuring Information Distortion in Hierarchical Ultra long Novel Generation:The Optimal Expansion Ratio(https://arxiv.org/abs/2505.12572)
Keywords: language model, llm
Abstract: Writing novels with Large Language Models (LLMs) raises a critical question: how much human-authored outline is necessary to generate high-quality million-word novels? While frameworks such as DOME, Plan&Write, and Long Writer have improved stylistic coherence and logical consistency, they primarily target shorter novels (10k--100k words), leaving ultra-long generation largely unexplored. Drawing on insights from recent text compression methods like LLMZip and LLM2Vec, we conduct an information-theoretic analysis that quantifies distortion occurring when LLMs compress and reconstruct ultra-long novels under varying compression-expansion ratios. We introduce a hierarchical two-stage generation pipeline (outline -> detailed outline -> manuscript) and find an optimal outline length that balances information preservation with human effort. Through extensive experimentation with Chinese novels, we establish that a two-stage hierarchical outline approach significantly reduces semantic distortion compared to single-stage methods. Our findings provide empirically-grounded guidance for authors and researchers collaborating with LLMs to create million-word novels.
摘要：用大语言模型（LLM）写小说提出了一个关键的问题：产生高质量的百万字小说需要多少人为概述？虽然圆顶，计划和写作以及长作家等框架提高了风格的连贯性和逻辑一致性，但它们主要针对较短的小说（10k--100k单词），而超长的一代则在很大程度上没有探索。利用来自LLMZIP和LLM2VEC等最新文本压缩方法的见解，我们进行了信息理论分析，该分析在不同的压缩性扩张比率下量化LLMS压缩和重建超长小说时会量化失真。我们介绍了分层的两阶段一代管道（概述 - >详细的大纲 - >手稿），并找到最佳的轮廓长度，以平衡信息保存与人类努力。通过对中国小说进行广泛的实验，我们确定与单阶段方法相比，两阶段的层次轮廓方法可显着减少语义失真。我们的发现为作者和研究人员与LLMS合作创作数百万本小说提供了经验基础的指导。

Title: Improving Multilingual Language Models by Aligning Representations through Steering

Authors: Omar Mahmoud, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12584
Pdf URL: https://arxiv.org/pdf/2505.12584
Copy Paste: [[2505.12584]] Improving Multilingual Language Models by Aligning Representations through Steering(https://arxiv.org/abs/2505.12584)
Keywords: language model, llm, prompt
Abstract: In this paper, we investigate how large language models (LLMS) process non-English tokens within their layer representations, an open question despite significant advancements in the field. Using representation steering, specifically by adding a learned vector to a single model layer's activations, we demonstrate that steering a single model layer can notably enhance performance. Our analysis shows that this approach achieves results comparable to translation baselines and surpasses state of the art prompt optimization methods. Additionally, we highlight how advanced techniques like supervised fine tuning (\textsc{sft}) and reinforcement learning from human feedback (\textsc{rlhf}) improve multilingual capabilities by altering representation spaces. We further illustrate how these methods align with our approach to reshaping LLMS layer representations.
摘要：在本文中，我们调查了大型语言模型（LLMS）在其层表示中的非英语令牌的处理方式，尽管该领域取得了重大进展，但还是一个空旷的问题。使用表示转向，特别是通过将学习的向量添加到单个模型层的激活中，我们证明了转向单个模型层可以显着提高性能。我们的分析表明，这种方法可实现与翻译基线相当的结果，并超过了最新的及时优化方法。此外，我们重点介绍了高级技术等高级技术（\ textsc {sft}）和从人类反馈（\ textsc {rlhf}）中学习的加固学习如何通过改变表示空间来改善多语言能力。我们进一步说明了这些方法如何与重塑LLMS层表示的方法保持一致。

Title: CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling

Authors: Aditeya Baral, Allen George Ajith, Roshan Nayak, Mrityunjay Abhijeet Bhanja
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12587
Pdf URL: https://arxiv.org/pdf/2505.12587
Copy Paste: [[2505.12587]] CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling(https://arxiv.org/abs/2505.12587)
Keywords: language model
Abstract: Code-mixed languages, characterized by frequent within-sentence language transitions, present structural challenges that standard language models fail to address. In this work, we propose CMLFormer, an enhanced multi-layer dual-decoder Transformer with a shared encoder and synchronized decoder cross-attention, designed to model the linguistic and semantic dynamics of code-mixed text. CMLFormer is pre-trained on an augmented Hinglish corpus with switching point and translation annotations with multiple new objectives specifically aimed at capturing switching behavior, cross-lingual structure, and code-mixing complexity. Our experiments show that CMLFormer improves F1 score, precision, and accuracy over other approaches on the HASOC-2021 benchmark under select pre-training setups. Attention analyses further show that it can identify and attend to switching points, validating its sensitivity to code-mixed structure. These results demonstrate the effectiveness of CMLFormer's architecture and multi-task pre-training strategy for modeling code-mixed languages.
摘要：代码混合语言的特点是频繁的句子内语言过渡，标准语言模型无法解决的结构挑战。在这项工作中，我们提出了CMLFormer，这是一种增强的多层双动物变压器，具有共享的编码器和同步解码器跨注意，旨在模拟代码混合文本的语言和语义动力学。 CMLFORMER已在增强的Hinglish语料库中进行了预训练，其切换点和翻译注释具有多个新目标，旨在捕获切换行为，跨语言结构和混合代码复杂性。我们的实验表明，CMLFormer在精选的预训练设置下的其他方法上提高了F1得分，精度和准确性。注意分析进一步表明，它可以识别并关注切换点，从而验证其对代码混合结构的敏感性。这些结果证明了CMLFormer体系结构和用于建模代码混合语言的多任务预训练策略的有效性。

Title: PromptPrism: A Linguistically-Inspired Taxonomy for Prompts

Authors: Sullam Jeoung, Yueyan Chen, Yi Zhang, Shuai Wang, Haibo Ding, Lin Lee Cheong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12592
Pdf URL: https://arxiv.org/pdf/2505.12592
Copy Paste: [[2505.12592]] PromptPrism: A Linguistically-Inspired Taxonomy for Prompts(https://arxiv.org/abs/2505.12592)
Keywords: language model, llm, prompt
Abstract: Prompts are the interface for eliciting the capabilities of large language models (LLMs). Understanding their structure and components is critical for analyzing LLM behavior and optimizing performance. However, the field lacks a comprehensive framework for systematic prompt analysis and understanding. We introduce PromptPrism, a linguistically-inspired taxonomy that enables prompt analysis across three hierarchical levels: functional structure, semantic component, and syntactic pattern. We show the practical utility of PromptPrism by applying it to three applications: (1) a taxonomy-guided prompt refinement approach that automatically improves prompt quality and enhances model performance across a range of tasks; (2) a multi-dimensional dataset profiling method that extracts and aggregates structural, semantic, and syntactic characteristics from prompt datasets, enabling comprehensive analysis of prompt distributions and patterns; (3) a controlled experimental framework for prompt sensitivity analysis by quantifying the impact of semantic reordering and delimiter modifications on LLM performance. Our experimental results validate the effectiveness of our taxonomy across these applications, demonstrating that PromptPrism provides a foundation for refining, profiling, and analyzing prompts.
摘要：提示是启发大语言模型（LLMS）功能的接口。了解它们的结构和组件对于分析LLM行为和优化性能至关重要。但是，该领域缺乏用于系统及时分析和理解的全面框架。我们介绍了及时的Prism，这是一种以语言为灵感的分类法，可以在三个层次级别上进行及时分析：功能结构，语义组件和句法模式。我们通过将其应用于三个应用程序来展示及时PRISM的实际实用性：（1）一种分类指导的及时完善方法，可以自动提高及时的质量并增强一系列任务中的模型性能；（2）一种多维数据集分析方法，从及时数据集提取和汇总结构，语义和句法特征，从而对及时分布和模式进行全面分析；（3）通过量化语义重新排序和定界线修改对LLM性能的影响，用于迅速灵敏度分析的受控实验框架。我们的实验结果证明了我们在这些应用程序中分类法的有效性，这表明迅速的Prism为提示，分析和分析提示提供了基础。

Title: AD-AGENT: A Multi-agent Framework for End-to-end Anomaly Detection

Authors: Tiankai Yang, Junjun Liu, Wingchun Siu, Jiahang Wang, Zhuangzhuang Qian, Chanjuan Song, Cheng Cheng, Xiyang Hu, Yue Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12594
Pdf URL: https://arxiv.org/pdf/2505.12594
Copy Paste: [[2505.12594]] AD-AGENT: A Multi-agent Framework for End-to-end Anomaly Detection(https://arxiv.org/abs/2505.12594)
Keywords: llm, agent
Abstract: Anomaly detection (AD) is essential in areas such as fraud detection, network monitoring, and scientific research. However, the diversity of data modalities and the increasing number of specialized AD libraries pose challenges for non-expert users who lack in-depth library-specific knowledge and advanced programming skills. To tackle this, we present AD-AGENT, an LLM-driven multi-agent framework that turns natural-language instructions into fully executable AD pipelines. AD-AGENT coordinates specialized agents for intent parsing, data preparation, library and model selection, documentation mining, and iterative code generation and debugging. Using a shared short-term workspace and a long-term cache, the agents integrate popular AD libraries like PyOD, PyGOD, and TSLib into a unified workflow. Experiments demonstrate that AD-AGENT produces reliable scripts and recommends competitive models across libraries. The system is open-sourced to support further research and practical applications in AD.
摘要：在欺诈检测，网络监测和科学研究等领域，异常检测（AD）至关重要。但是，数据模式的多样性和越来越多的专业广告图书馆对缺乏深入图书馆特定知识和高级编程技能的非专家用户构成挑战。为了解决这个问题，我们提出了Ad-Agent，这是一个LLM驱动的多代理框架，将自然语言指令转变为完全可执行的广告管道。 Ad-Agent协调专门的代理，用于解析，数据准备，图书馆和模型选择，文档挖掘以及迭代代码生成和调试。使用共享的短期工作区和长期缓存，代理将PYOD，PYGOD和TSLIB等流行的广告库集成到统一的工作流程中。实验表明，Ad-Agent会产生可靠的脚本，并推荐跨库的竞争模型。该系统是开源的，以支持AD中的进一步研究和实际应用。

Title: Think Before You Attribute: Improving the Performance of LLMs Attribution Systems

Authors: João Eduardo Batista, Emil Vatai, Mohamed Wahib
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.12621
Pdf URL: https://arxiv.org/pdf/2505.12621
Copy Paste: [[2505.12621]] Think Before You Attribute: Improving the Performance of LLMs Attribution Systems(https://arxiv.org/abs/2505.12621)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly applied in various science domains, yet their broader adoption remains constrained by a critical challenge: the lack of trustworthy, verifiable outputs. Current LLMs often generate answers without reliable source attribution, or worse, with incorrect attributions, posing a barrier to their use in scientific and high-stakes settings, where traceability and accountability are non-negotiable. To be reliable, attribution systems need high accuracy and retrieve data with short lengths, i.e., attribute to a sentence within a document rather than a whole document. We propose a sentence-level pre-attribution step for Retrieve-Augmented Generation (RAG) systems that classify sentences into three categories: not attributable, attributable to a single quote, and attributable to multiple quotes. By separating sentences before attribution, a proper attribution method can be selected for the type of sentence, or the attribution can be skipped altogether. Our results indicate that classifiers are well-suited for this task. In this work, we propose a pre-attribution step to reduce the computational complexity of attribution, provide a clean version of the HAGRID dataset, and provide an end-to-end attribution system that works out of the box.
摘要：大型语言模型（LLM）越来越多地应用于各种科学领域，但其更广泛的采用仍然受到一个关键挑战的限制：缺乏可信赖，可验证的输出。当前的LLM通常会在没有可靠的源归因的情况下产生答案，或者更糟的是，归因不正确，这对它们在科学和高风险环境中的使用构成了障碍，在科学和高风险环境中，可追溯性和责任制是不可能的。为了可靠，归因系统需要高度准确性并以短长的长度检索数据，即归因于文档中的句子，而不是整个文档。我们为检索句子的生成（RAG）系统提出了一个句子级的预摘要步骤，该步骤将句子分为三个类别：不属于单个报价，并归因于多个引号。通过在归因之前分开句子，可以为句子类型选择一种适当的归因方法，也可以完全跳过归因。我们的结果表明，分类器非常适合此任务。在这项工作中，我们提出了一个临时步骤，以降低归因的计算复杂性，提供HARGID数据集的干净版本，并提供一个从开箱即用的端到端归因系统。

Title: R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model

Authors: Ali Naseh, Harsh Chaudhari, Jaechul Roh, Mingshi Wu, Alina Oprea, Amir Houmansadr
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12625
Pdf URL: https://arxiv.org/pdf/2505.12625
Copy Paste: [[2505.12625]] R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model(https://arxiv.org/abs/2505.12625)
Keywords: language model, llm, prompt
Abstract: DeepSeek recently released R1, a high-performing large language model (LLM) optimized for reasoning tasks. Despite its efficient training pipeline, R1 achieves competitive performance, even surpassing leading reasoning models like OpenAI's o1 on several benchmarks. However, emerging reports suggest that R1 refuses to answer certain prompts related to politically sensitive topics in China. While existing LLMs often implement safeguards to avoid generating harmful or offensive outputs, R1 represents a notable shift - exhibiting censorship-like behavior on politically charged queries. In this paper, we investigate this phenomenon by first introducing a large-scale set of heavily curated prompts that get censored by R1, covering a range of politically sensitive topics, but are not censored by other models. We then conduct a comprehensive analysis of R1's censorship patterns, examining their consistency, triggers, and variations across topics, prompt phrasing, and context. Beyond English-language queries, we explore censorship behavior in other languages. We also investigate the transferability of censorship to models distilled from the R1 language model. Finally, we propose techniques for bypassing or removing this censorship. Our findings reveal possible additional censorship integration likely shaped by design choices during training or alignment, raising concerns about transparency, bias, and governance in language model deployment.
摘要：DeepSeek最近发布了R1，这是一种针对推理任务进行了优化的高性能大语言模型（LLM）。尽管有高效的训练管道，R1还是达到了竞争性能，甚至超过了几个基准上的Openai的O1等领先的推理模型。但是，新兴的报告表明，R1拒绝回答与中国政治敏感主题有关的某些提示。尽管现有的LLM经常实施保障措施以避免产生有害或进攻性产量，但R1代表着一个显着的转变 - 在政治上充满电的查询上表现出类似审查的行为。在本文中，我们首先引入了一组大规模策划的提示，这些提示对R1进行了审查，涵盖了一系列政治敏感的主题，但并未对其他模型进行审查。然后，我们对R1的审查模式进行了全面的分析，研究了它们的一致性，触发器和跨主题，及时措辞和上下文的变化。除了英语查询之外，我们还探讨了其他语言的审查行为。我们还研究了审查制度向从R1语言模型提取的模型的可传递性。最后，我们提出了绕过或删除此审查制度的技术。我们的发现揭示了在培训或对齐过程中设计选择可能会塑造的其他审查整合，从而提出了对语言模型部署中透明度，偏见和治理的担忧。

Title: Revealing the Deceptiveness of Knowledge Editing: A Mechanistic Analysis of Superficial Editing

Authors: Jiakuan Xie, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12636
Pdf URL: https://arxiv.org/pdf/2505.12636
Copy Paste: [[2505.12636]] Revealing the Deceptiveness of Knowledge Editing: A Mechanistic Analysis of Superficial Editing(https://arxiv.org/abs/2505.12636)
Keywords: language model
Abstract: Knowledge editing, which aims to update the knowledge encoded in language models, can be deceptive. Despite the fact that many existing knowledge editing algorithms achieve near-perfect performance on conventional metrics, the models edited by them are still prone to generating original knowledge. This paper introduces the concept of "superficial editing" to describe this phenomenon. Our comprehensive evaluation reveals that this issue presents a significant challenge to existing algorithms. Through systematic investigation, we identify and validate two key factors contributing to this issue: (1) the residual stream at the last subject position in earlier layers and (2) specific attention modules in later layers. Notably, certain attention heads in later layers, along with specific left singular vectors in their output matrices, encapsulate the original knowledge and exhibit a causal relationship with superficial editing. Furthermore, we extend our analysis to the task of superficial unlearning, where we observe consistent patterns in the behavior of specific attention heads and their corresponding left singular vectors, thereby demonstrating the robustness and broader applicability of our methodology and conclusions. Our code is available here.
摘要：旨在更新语言模型中编码的知识的知识编辑可以具有欺骗性。尽管许多现有的知识编辑算法在传统指标上取得了接近完美的性能，但它们编辑的模型仍然容易产生原始知识。本文介绍了“表面编辑”的概念来描述这种现象。我们的全面评估表明，这个问题对现有算法提出了重大挑战。通过系统的研究，我们确定并验证了有助于此问题的两个关键因素：（1）在早期层中最后一个主题位置的残留流以及（2）后期层中的特定注意模块。值得注意的是，以后的某些注意力以及其输出矩阵中的特定左单数向量封装了原始知识，并与浅表编辑表现出因果关系。此外，我们将分析扩展到浅表学习的任务，在该任务中，我们观察到特定注意力头的行为及其相应的左奇异向量的一致模式，从而证明了我们的方法论和结论的鲁棒性和更广泛的适用性。我们的代码可在此处提供。

Title: Know3-RAG: A Knowledge-aware RAG Framework with Adaptive Retrieval, Generation, and Filtering

Authors: Xukai Liu, Ye Liu, Shiwen Wu, Yanghai Zhang, Yihao Yuan, Kai Zhang, Qi Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12662
Pdf URL: https://arxiv.org/pdf/2505.12662
Copy Paste: [[2505.12662]] Know3-RAG: A Knowledge-aware RAG Framework with Adaptive Retrieval, Generation, and Filtering(https://arxiv.org/abs/2505.12662)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Recent advances in large language models (LLMs) have led to impressive progress in natural language generation, yet their tendency to produce hallucinated or unsubstantiated content remains a critical concern. To improve factual reliability, Retrieval-Augmented Generation (RAG) integrates external knowledge during inference. However, existing RAG systems face two major limitations: (1) unreliable adaptive control due to limited external knowledge supervision, and (2) hallucinations caused by inaccurate or irrelevant references. To address these issues, we propose Know3-RAG, a knowledge-aware RAG framework that leverages structured knowledge from knowledge graphs (KGs) to guide three core stages of the RAG process, including retrieval, generation, and filtering. Specifically, we introduce a knowledge-aware adaptive retrieval module that employs KG embedding to assess the confidence of the generated answer and determine retrieval necessity, a knowledge-enhanced reference generation strategy that enriches queries with KG-derived entities to improve generated reference relevance, and a knowledge-driven reference filtering mechanism that ensures semantic alignment and factual accuracy of references. Experiments on multiple open-domain QA benchmarks demonstrate that Know3-RAG consistently outperforms strong baselines, significantly reducing hallucinations and enhancing answer reliability.
摘要：大型语言模型（LLM）的最新进展导致了自然语言产生的令人印象深刻的进步，但是它们产生幻觉或未经证实的内容的倾向仍然是一个关键问题。为了提高事实的可靠性，检索功能生成（RAG）在推理过程中整合了外部知识。但是，现有的抹布系统面临两个主要局限性：（1）由于外部知识监督有限而导致的不可靠的适应性控制，以及（2）由于参考不准确或无关的参考而引起的幻觉。为了解决这些问题，我们提出了知识感知的抹布框架，它利用知识图（KGS）指导抹布过程的三个核心阶段，包括检索，生成和过滤。具体而言，我们介绍了一种知识吸引的自适应检索模块，该模块采用KG嵌入来评估生成的答案的信心并确定检索必要性，一种具有知识增强的参考生成策略，该策略丰富了与KG衍生的实体的查询，以提高生成的参考相关性，并确保具有知识驱动的参考机制，以确保具有精神分子和事实证明的知识准确性。在多个开放域QA基准上进行的实验表明，知识3 rag始终优于强质基础，大大降低了幻觉并增强了答案的可靠性。

Title: Shadow-FT: Tuning Instruct via Base

Authors: Taiqiang Wu, Runming Yang, Jiayi Li, Pengfei Hu, Ngai Wong, Yujiu Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12716
Pdf URL: https://arxiv.org/pdf/2505.12716
Copy Paste: [[2505.12716]] Shadow-FT: Tuning Instruct via Base(https://arxiv.org/abs/2505.12716)
Keywords: language model, llm
Abstract: Large language models (LLMs) consistently benefit from further fine-tuning on various tasks. However, we observe that directly tuning the INSTRUCT (i.e., instruction tuned) models often leads to marginal improvements and even performance degeneration. Notably, paired BASE models, the foundation for these INSTRUCT variants, contain highly similar weight values (i.e., less than 2% on average for Llama 3.1 8B). Therefore, we propose a novel Shadow-FT framework to tune the INSTRUCT models by leveraging the corresponding BASE models. The key insight is to fine-tune the BASE model, and then directly graft the learned weight updates to the INSTRUCT model. Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance. We conduct extensive experiments on tuning mainstream LLMs, such as Qwen 3 and Llama 3 series, and evaluate them across 19 benchmarks covering coding, reasoning, and mathematical tasks. Experimental results demonstrate that Shadow-FT consistently outperforms conventional full-parameter and parameter-efficient tuning approaches. Further analyses indicate that Shadow-FT can be applied to multimodal large language models (MLLMs) and combined with direct preference optimization (DPO). Codes and weights are available at \href{this https URL}{Github}.
摘要：大型语言模型（LLMS）始终受益于各种任务的进一步微调。但是，我们观察到，直接调整指令（即指令调谐）模型通常会导致边缘改进甚至性能变性。值得注意的是，配对的基本模型是这些指示变体的基础，其中包含高度相似的权重值（即，平均而言，遍地3.1 8b平均不到2％）。因此，我们提出了一个新颖的阴影框架，以利用相应的基本模型来调整指示模型。关键的见解是微调基本模型，然后将学习的权重更新直接移植到指令模型中。我们提出的Shadow-ft引入没有其他参数，易于实现，并且可以显着提高性能。我们进行了有关调整主流LLM的广泛实验，例如QWEN 3和LLAMA 3系，并在涵盖编码，推理和数学任务的19个基准中评估它们。实验结果表明，阴影始终胜过常规的全参数和参数有效的调整方法。进一步的分析表明，Shadow-ft可以应用于多模式的大语言模型（MLLM），并与直接偏好优化（DPO）结合使用。代码和权重可在\ href {此https url} {github}上获得。

Title: ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving

Authors: Haoyuan Wu, Xueyi Chen, Rui Ming, Jilong Gao, Shoubo Hu, Zhuolun He, Bei Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12717
Pdf URL: https://arxiv.org/pdf/2505.12717
Copy Paste: [[2505.12717]] ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving(https://arxiv.org/abs/2505.12717)
Keywords: language model, llm, chain-of-thought, tree-of-thought
Abstract: Large language models (LLMs) demonstrate significant reasoning capabilities, particularly through long chain-of-thought (CoT) processes, which can be elicited by reinforcement learning (RL). However, prolonged CoT reasoning presents limitations, primarily verbose outputs due to excessive introspection. The reasoning process in these LLMs often appears to follow a trial-and-error methodology rather than a systematic, logical deduction. In contrast, tree-of-thoughts (ToT) offers a conceptually more advanced approach by modeling reasoning as an exploration within a tree structure. This reasoning structure facilitates the parallel generation and evaluation of multiple reasoning branches, allowing for the active identification, assessment, and pruning of unproductive paths. This process can potentially lead to improved performance and reduced token costs. Building upon the long CoT capability of LLMs, we introduce tree-of-thoughts RL (ToTRL), a novel on-policy RL framework with a rule-based reward. ToTRL is designed to guide LLMs in developing the parallel ToT strategy based on the sequential CoT strategy. Furthermore, we employ LLMs as players in a puzzle game during the ToTRL training process. Solving puzzle games inherently necessitates exploring interdependent choices and managing multiple constraints, which requires the construction and exploration of a thought tree, providing challenging tasks for cultivating the ToT reasoning capability. Our empirical evaluations demonstrate that our ToTQwen3-8B model, trained with our ToTRL, achieves significant improvement in performance and reasoning efficiency on complex reasoning tasks.
摘要：大型语言模型（LLMS）表现出重要的推理能力，尤其是通过长链（COT）过程，可以通过增强学习（RL）引起。但是，长时间的婴儿床推理提出了局限性，主要是由于内省过多而导致的详细输出。这些LLMS中的推理过程通常遵循试验方法，而不是系统的，逻辑上的推论。相比之下，思想树（TOT）通过将推理作为树结构中的探索进行建模，从而在概念上提供了更高级的方法。这种推理结构有助于对多个推理分支的平行生成和评估，从而使非生产性路径的主动识别，评估和修剪。这个过程可能会导致性能提高并降低令牌成本。在LLMS的长期COT能力的基础上，我们介绍了经过思考的RL（TOTRL），这是一个具有基于规则的奖励的新型式RL框架。 TOTRL旨在指导LLM基于顺序COT策略制定并行TOT策略。此外，在TOTRL培训过程中，我们在益智游戏中使用LLM作为玩家。解决益智游戏本质上需要探索相互依存的选择并管理多个约束，这需要对思想树进行构建和探索，从而为培养TOT推理能力提供了具有挑战性的任务。我们的经验评估表明，我们的TOTQWEN3-8B模型接受了我们的TOTRL培训，可以在复杂的推理任务上取得显着提高的性能和推理效率。

Title: Automated Bias Assessment in AI-Generated Educational Content Using CEAT Framework

Authors: Jingyang Peng, Wenyuan Shen, Jiarui Rao, Jionghao Lin
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2505.12718
Pdf URL: https://arxiv.org/pdf/2505.12718
Copy Paste: [[2505.12718]] Automated Bias Assessment in AI-Generated Educational Content Using CEAT Framework(https://arxiv.org/abs/2505.12718)
Keywords: prompt, retrieval-augmented generation
Abstract: Recent advances in Generative Artificial Intelligence (GenAI) have transformed educational content creation, particularly in developing tutor training materials. However, biases embedded in AI-generated content--such as gender, racial, or national stereotypes--raise significant ethical and educational concerns. Despite the growing use of GenAI, systematic methods for detecting and evaluating such biases in educational materials remain limited. This study proposes an automated bias assessment approach that integrates the Contextualized Embedding Association Test with a prompt-engineered word extraction method within a Retrieval-Augmented Generation framework. We applied this method to AI-generated texts used in tutor training lessons. Results show a high alignment between the automated and manually curated word sets, with a Pearson correlation coefficient of r = 0.993, indicating reliable and consistent bias assessment. Our method reduces human subjectivity and enhances fairness, scalability, and reproducibility in auditing GenAI-produced educational content.
摘要：生成人工智能（Genai）的最新进展改变了教育内容的创建，尤其是在开发导师培训材料方面。但是，嵌入了AI生成的内容中的偏见（例如性别，种族或民族刻板印象），包括重大的道德和教育问题。尽管使用Genai，但用于检测和评估教育材料中此类偏见的系统方法仍然有限。这项研究提出了一种自动偏见评估方法，该方法将上下文化的嵌入关联测试与迅速设计的单词提取方法集成到检索型生成框架内。我们将此方法应用于导师培训课程中使用的AI生成的文本。结果表明，自动化和手动策划的单词集之间的一致性很高，皮尔逊相关系数为r = 0.993，表明可靠且一致的偏差评估。我们的方法降低了人类的主观性，并在审核Genai生产的教育内容方面提高了公平性，可扩展性和可重复性。

Title: On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding

Authors: Haoyuan Wu, Rui Ming, Jilong Gao, Hangyu Zhao, Xueyi Chen, Yikai Yang, Haisheng Zheng, Zhuolun He, Bei Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12723
Pdf URL: https://arxiv.org/pdf/2505.12723
Copy Paste: [[2505.12723]] On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding(https://arxiv.org/abs/2505.12723)
Keywords: language model, llm
Abstract: Large language models (LLMs) achieve remarkable performance in code generation tasks. However, a significant performance disparity persists between popular programming languages (e.g., Python, C++) and others. To address this capability gap, we leverage the code translation task to train LLMs, thereby facilitating the transfer of coding proficiency across diverse programming languages. Moreover, we introduce OORL for training, a novel reinforcement learning (RL) framework that integrates on-policy and off-policy strategies. Within OORL, on-policy RL is applied during code translation, guided by a rule-based reward signal derived from unit tests. Complementing this coarse-grained rule-based reward, we propose Group Equivalent Preference Optimization (GEPO), a novel preference optimization method. Specifically, GEPO trains the LLM using intermediate representations (IRs) groups. LLMs can be guided to discern IRs equivalent to the source code from inequivalent ones, while also utilizing signals about the mutual equivalence between IRs within the group. This process allows LLMs to capture nuanced aspects of code functionality. By employing OORL for training with code translation tasks, LLMs improve their recognition of code functionality and their understanding of the relationships between code implemented in different languages. Extensive experiments demonstrate that our OORL for LLMs training with code translation tasks achieves significant performance improvements on code benchmarks across multiple programming languages.
摘要：大型语言模型（LLMS）在代码生成任务中实现出色的性能。但是，在流行的编程语言（例如Python，C ++）等之间，性能差异仍然存在。为了解决此能力差距，我们利用代码翻译任务来培训LLM，从而促进了编码能力的转移，使得跨越不同的编码语言。此外，我们介绍了OORL进行培训，这是一种新颖的增强学习（RL）框架，该框架集成了政策和非政策策略。在OORL中，在代码翻译过程中应用式RL，并在基于单位测试的规则奖励信号的指导下进行。在这种基于粗粒规则的奖励中，我们提出了一种新型的优先优化方法，我们提出了群体等效优先优化（GEPO）。具体而言，GEPO使用中间表示（IRS）组训练LLM。可以指导LLMS从不等式的IR辨别IRS等于源代码，同时还利用有关IR之间相互等价的信号。此过程允许LLM捕获代码功能的细微方面。通过使用OORL进行代码翻译任务培训，LLMS提高了他们对代码功能的认识以及对以不同语言实现的代码之间关系的理解。广泛的实验表明，通过代码翻译任务进行LLMS培训的OORL可以对多种编程语言的代码基准进行大幅改进。

Title: What is Stigma Attributed to? A Theory-Grounded, Expert-Annotated Interview Corpus for Demystifying Mental-Health Stigma

Authors: Han Meng, Yancan Chen, Yunan Li, Yitian Yang, Jungup Lee, Renwen Zhang, Yi-Chieh Lee
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2505.12727
Pdf URL: https://arxiv.org/pdf/2505.12727
Copy Paste: [[2505.12727]] What is Stigma Attributed to? A Theory-Grounded, Expert-Annotated Interview Corpus for Demystifying Mental-Health Stigma(https://arxiv.org/abs/2505.12727)
Keywords: chat
Abstract: Mental-health stigma remains a pervasive social problem that hampers treatment-seeking and recovery. Existing resources for training neural models to finely classify such stigma are limited, relying primarily on social-media or synthetic data without theoretical underpinnings. To remedy this gap, we present an expert-annotated, theory-informed corpus of human-chatbot interviews, comprising 4,141 snippets from 684 participants with documented socio-cultural backgrounds. Our experiments benchmark state-of-the-art neural models and empirically unpack the challenges of stigma detection. This dataset can facilitate research on computationally detecting, neutralizing, and counteracting mental-health stigma.
摘要：心理健康的污名仍然是一个普遍的社会问题，会阻碍寻求治疗和康复。现有用于对这种污名进行细微分类的神经模型的资源有限，主要依赖于社交媒体或合成数据而没有理论基础。为了弥补这一差距，我们提出了一个专家通知的，理论知识的人类访谈的语料库，其中包括来自684名参与者的4,141个摘要，具有有记录的社会文化背景。我们的实验基准基准的最新神经模型，并凭经验揭示了污名化检测的挑战。该数据集可以促进有关计算检测，中和和抵消心理健康污名的研究。

Title: ReEx-SQL: Reasoning with Execution-Aware Reinforcement Learning for Text-to-SQL

Authors: Yaxun Dai (1), Wenxuan Xie (3), Xialie Zhuang (4), Tianyu Yang (5), Yiying Yang (2), Haiqin Yang (6), Yuhang Zhao (2), Pingfu Chao (1), Wenhao Jiang (2) ((1) Soochow University, (2) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), (3) South China University of Technology, (4) University of Chinese Academy of Sciences, (5) Alibaba DAMO Academy, (6) International Digital Economy Academy (IDEA))
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12768
Pdf URL: https://arxiv.org/pdf/2505.12768
Copy Paste: [[2505.12768]] ReEx-SQL: Reasoning with Execution-Aware Reinforcement Learning for Text-to-SQL(https://arxiv.org/abs/2505.12768)
Keywords: language model, llm, prompt
Abstract: In Text-to-SQL, execution feedback is essential for guiding large language models (LLMs) to reason accurately and generate reliable SQL queries. However, existing methods treat execution feedback solely as a post-hoc signal for correction or selection, failing to integrate it into the generation process. This limitation hinders their ability to address reasoning errors as they occur, ultimately reducing query accuracy and robustness. To address this issue, we propose ReEx-SQL (Reasoning with Execution-Aware Reinforcement Learning), a framework for Text-to-SQL that enables models to interact with the database during decoding and dynamically adjust their reasoning based on execution feedback. ReEx-SQL introduces an execution-aware reasoning paradigm that interleaves intermediate SQL execution into reasoning paths, facilitating context-sensitive revisions. It achieves this through structured prompts with markup tags and a stepwise rollout strategy that integrates execution feedback into each stage of generation. To supervise policy learning, we develop a composite reward function that includes an exploration reward, explicitly encouraging effective database interaction. Additionally, ReEx-SQL adopts a tree-based decoding strategy to support exploratory reasoning, enabling dynamic expansion of alternative reasoning paths. Notably, ReEx-SQL achieves 88.8% on Spider and 64.9% on BIRD at the 7B scale, surpassing the standard reasoning baseline by 2.7% and 2.6%, respectively. It also shows robustness, achieving 85.2% on Spider-Realistic with leading performance. In addition, its tree-structured decoding improves efficiency and performance over linear decoding, reducing inference time by 51.9% on the BIRD development set.
摘要：在文本到SQL中，执行反馈对于指导大型语言模型（LLMS）以准确推理并生成可靠的SQL查询至关重要。但是，现有方法仅将执行反馈视为校正或选择的事后信号，未能将其集成到生成过程中。这种限制阻碍了他们在发生推理错误发生的能力，最终降低了查询准确性和鲁棒性。为了解决此问题，我们建议REEX-SQL（使用执行感知强化学习推理），这是文本到SQL的框架，使模型可以在解码过程中与数据库进行交互，并根据执行反馈动态调整其推理。 Reex-SQL引入了执行感知的推理范式，该范式将中间SQL执行交织到推理路径中，从而促进了上下文敏感的修订。它通过带有标记标签的结构化提示和逐步推出策略来实现这一目标，该策略将执行反馈集成到生成的每个阶段。为了监督政策学习，我们开发了一个综合奖励功能，其中包括探索奖励，明确鼓励有效的数据库互动。此外，Reex-SQL采用基于树的解码策略来支持探索性推理，从而使替代推理路径的动态扩展。值得注意的是，REEX-SQL在蜘蛛上的蜘蛛为88.8％，在7B量表上占64.9％，分别超过标准推理基线的2.7％和2.6％。它还显示出鲁棒性，在蜘蛛现实中获得了85.2％的效果，并获得了领先的表现。此外，其树木结构的解码可提高线性解码的效率和性能，从而在鸟类发育集中将推理时间降低了51.9％。

Title: A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Authors: Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12781
Pdf URL: https://arxiv.org/pdf/2505.12781
Copy Paste: [[2505.12781]] A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone(https://arxiv.org/abs/2505.12781)
Keywords: language model
Abstract: Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at this https URL and this https URL.
摘要：培训高性能的小语言模型（SLM）仍然昂贵，即使知识蒸馏并从较大的教师模型中进行修剪。现有工作通常面临三个关键挑战：（1）艰苦修剪，（2）表示表征效率低下的信息损失，以及（3）信息激活的实用性不足，尤其是馈送前向网络（FFNS）。为了应对这些挑战，我们引入了低级克隆（LRC），这是一种有效的预训练方法，该方法构建了渴望使用强大的教师模型的行为等价的SLM。 LRC训练一组低级投影矩阵，这些矩阵可以通过压缩教师的体重来共同使软修剪，并通过对齐学生的激活（包括FFN信号）与老师的激活来激活克隆。这种统一的设计使知识转移最大化，同时消除了对明确对齐模块的需求。开源教师（例如Llama-3.2-3b-infrats，Qwen2.5-3b/7b-Instruct）进行了广泛的实验表明，LRC匹配或超过了对数百万个代币的最先进模型 - 仅使用20B代币，可实现超过1,000倍的培训效率。我们的代码和模型检查点可在此HTTPS URL和此HTTPS URL上找到。

Title: EAVIT: Efficient and Accurate Human Value Identification from Text data via LLMs

Authors: Wenhao Zhu, Yuhang Xie, Guojie Song, Xin Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12792
Pdf URL: https://arxiv.org/pdf/2505.12792
Copy Paste: [[2505.12792]] EAVIT: Efficient and Accurate Human Value Identification from Text data via LLMs(https://arxiv.org/abs/2505.12792)
Keywords: language model, gpt, llm, long context, prompt
Abstract: The rapid evolution of large language models (LLMs) has revolutionized various fields, including the identification and discovery of human values within text data. While traditional NLP models, such as BERT, have been employed for this task, their ability to represent textual data is significantly outperformed by emerging LLMs like GPTs. However, the performance of online LLMs often degrades when handling long contexts required for value identification, which also incurs substantial computational costs. To address these challenges, we propose EAVIT, an efficient and accurate framework for human value identification that combines the strengths of both locally fine-tunable and online black-box LLMs. Our framework employs a value detector - a small, local language model - to generate initial value estimations. These estimations are then used to construct concise input prompts for online LLMs, enabling accurate final value identification. To train the value detector, we introduce explanation-based training and data generation techniques specifically tailored for value identification, alongside sampling strategies to optimize the brevity of LLM input prompts. Our approach effectively reduces the number of input tokens by up to 1/6 compared to directly querying online LLMs, while consistently outperforming traditional NLP methods and other LLM-based strategies.
摘要：大语言模型（LLM）的快速演变彻底改变了各个领域，包括在文本数据中识别和发现人类价值观。虽然已经使用了诸如BERT之类的传统NLP模型来执行此任务，但GPT等新兴LLMS的表现能力显着胜过。但是，在处理价值识别所需的较长上下文时，在线LLM的性能通常会降低，这也会产生大量的计算成本。为了应对这些挑战，我们提出了EAVIT，这是一个有效，准确的人类价值识别框架，结合了本地微调和在线Black-Box LLM的优势。我们的框架采用价值检测器（一种小型的本地语言模型）来生成初始价值估计。然后，这些估计用于在线LLM构建简洁的输入提示，从而实现准确的最终值识别。为了培训价值检测器，我们介绍了专门针对价值识别的基于解释的培训和数据生成技术，以及采样策略，以优化LLM输入提示的简洁性。与直接查询在线LLM相比，我们的方法有效地将输入令牌的数量减少了1/6，同时始终优于传统的NLP方法和其他基于LLM的策略。

Title: Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models

Authors: Yanbin Yin, Kun Zhou, Zhen Wang, Xiangdong Zhang, Yifei Shao, Shibo Hao, Yi Gu, Jieyuan Liu, Somanshu Singla, Tianyang Liu, Eric P. Xing, Zhengzhong Liu, Haojian Jin, Zhiting Hu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12808
Pdf URL: https://arxiv.org/pdf/2505.12808
Copy Paste: [[2505.12808]] Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models(https://arxiv.org/abs/2505.12808)
Keywords: language model, llm, chat
Abstract: The recent explosion of large language models (LLMs), each with its own general or specialized strengths, makes scalable, reliable benchmarking more urgent than ever. Standard practices nowadays face fundamental trade-offs: closed-ended question-based benchmarks (eg MMLU) struggle with saturation as newer models emerge, while crowd-sourced leaderboards (eg Chatbot Arena) rely on costly and slow human judges. Recently, automated methods (eg LLM-as-a-judge) shed light on the scalability, but risk bias by relying on one or a few "authority" models. To tackle these issues, we propose Decentralized Arena (dearena), a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other. It mitigates single-model judge bias by democratic, pairwise evaluation, and remains efficient at scale through two key components: (1) a coarse-to-fine ranking algorithm for fast incremental insertion of new models with sub-quadratic complexity, and (2) an automatic question selection strategy for the construction of new evaluation dimensions. Across extensive experiments across 66 LLMs, dearena attains up to 97% correlation with human judgements, while significantly reducing the cost. Our code and data will be publicly released on this https URL.
摘要：大型语言模型（LLMS）的最近爆炸，每个模型都具有其自身或专业的优势，使可扩展，可靠的基准测试比以往任何时候都更加紧迫。如今，标准实践面临基本的权衡：随着较新的模型出现，基于封闭的问题的基准（例如MMLU）与饱和度斗争，而众筹的排行榜（例如Chatbot Arena）依靠昂贵和慢慢的人类法官。最近，自动化方法（例如LLM-AS-A-Gudge）阐明了可扩展性，但是通过依靠一个或几个“权威”模型来冒险偏见。为了解决这些问题，我们提出了分散的竞技场（Dearena），这是一个完全自动化的框架，利用所有LLM的集体智能来相互评估。它通过民主，成对评估来减轻单模法官偏见，并通过两个关键组成部分进行效率：（1）一种以下额定插入次级复杂性的新模型，以及（2）用于构建新评估尺寸的自动问题选择策略的粗略排名算法。在66个LLM的广泛实验中，Dearena与人类判断的相关性高达97％，同时大大降低了成本。我们的代码和数据将在此HTTPS URL上公开发布。

Title: PsyMem: Fine-grained psychological alignment and Explicit Memory Control for Advanced Role-Playing LLMs

Authors: Xilong Cheng, Yunxiao Qin, Yuting Tan, Zhengnan Li, Ye Wang, Hongjiang Xiao, Yuan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12814
Pdf URL: https://arxiv.org/pdf/2505.12814
Copy Paste: [[2505.12814]] PsyMem: Fine-grained psychological alignment and Explicit Memory Control for Advanced Role-Playing LLMs(https://arxiv.org/abs/2505.12814)
Keywords: llm
Abstract: Existing LLM-based role-playing methods often rely on superficial textual descriptions or simplistic metrics, inadequately modeling both intrinsic and extrinsic character dimensions. Additionally, they typically simulate character memory with implicit model knowledge or basic retrieval augment generation without explicit memory alignment, compromising memory consistency. The two issues weaken reliability of role-playing LLMs in several applications, such as trustworthy social simulation. To address these limitations, we propose PsyMem, a novel framework integrating fine-grained psychological attributes and explicit memory control for role-playing. PsyMem supplements textual descriptions with 26 psychological indicators to detailed model character. Additionally, PsyMem implements memory alignment training, explicitly trains the model to align character's response with memory, thereby enabling dynamic memory-controlled responding during inference. By training Qwen2.5-7B-Instruct on our specially designed dataset (including 5,414 characters and 38,962 dialogues extracted from novels), the resulting model, termed as PsyMem-Qwen, outperforms baseline models in role-playing, achieving the best performance in human-likeness and character fidelity.
摘要：现有的基于LLM的角色扮演方法通常依赖于表面的文本描述或简单的指标，对内在和外在的字符维度进行建模不足。此外，他们通常使用隐式模型知识或基本检索增强生成模拟字符存储器，而无需明确的内存对准，损害内存一致性。这两个问题削弱了在几种应用程序中角色扮演LLM的可靠性，例如值得信赖的社会模拟。为了解决这些局限性，我们提出了PSYMEM，这是一个新颖的框架，该框架整合了细粒度的心理属性和显式记忆控制以进行角色扮演。 Psymem补充了带有26个心理指标的文本描述，以详细的模型特征。此外，PSYMEM实现了内存对准训练，明确训练模型以使角色的响应与内存保持一致，从而在推理过程中实现了动态内存控制的响应。通过在我们专门设计的数据集中训练QWEN2.5-7B的教学（包括从小说中提取的5,414个字符和38,962个对话），被称为PSYMEM-QWEN的模型优于角色扮演的基线模型，在人类生活和角色忠诚度中达到了最佳表现。

Title: SynDec: A Synthesize-then-Decode Approach for Arbitrary Textual Style Transfer via Large Language Models

Authors: Han Sun, Zhen Sun, Zongmin Zhang, Linzhao Jia, Wei Shao, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12821
Pdf URL: https://arxiv.org/pdf/2505.12821
Copy Paste: [[2505.12821]] SynDec: A Synthesize-then-Decode Approach for Arbitrary Textual Style Transfer via Large Language Models(https://arxiv.org/abs/2505.12821)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are emerging as dominant forces for textual style transfer. However, for arbitrary style transfer, LLMs face two key challenges: (1) considerable reliance on manually-constructed prompts and (2) rigid stylistic biases inherent in LLMs. In this paper, we propose a novel Synthesize-then-Decode (SynDec) approach, which automatically synthesizes high-quality prompts and amplifies their roles during decoding process. Specifically, our approach synthesizes prompts by selecting representative few-shot samples, conducting a four-dimensional style analysis, and reranking the candidates. At LLM decoding stage, the TST effect is amplified by maximizing the contrast in output probabilities between scenarios with and without the synthesized prompt, as well as between prompts and negative samples. We conduct extensive experiments and the results show that SynDec outperforms existing state-of-the-art LLM-based methods on five out of six benchmarks (e.g., achieving up to a 9\% increase in accuracy for modern-to-Elizabethan English transfer). Detailed ablation studies further validate the effectiveness of SynDec.
摘要：大型语言模型（LLM）正在成为文本样式转移的主要力量。但是，对于任意风格的转移，LLMS面临两个主要挑战：（1）非常依赖手动构造的提示，以及（2）LLMS固有的僵化风格偏见。在本文中，我们提出了一种新颖的合成 - 然后使用decode（Syndec）方法，该方法会自动合成高质量的提示并在解码过程中扩大其角色。具体而言，我们的方法通过选择代表性的几种样本，进行四维样式分析并重新列出候选人来综合提示。在LLM解码阶段，通过最大程度地提高和没有合成提示的情况以及提示和负样本之间的方案之间的输出概率对比度来扩大TST效果。我们进行了广泛的实验，结果表明，Syndec在六个基准中的五个基准的五种基于现有的最新LLM的方法（例如，现代到伊丽莎白英语转移的准确性提高了9 \％）。详细的消融研究进一步验证了Syndec的有效性。

Title: Contrastive Prompting Enhances Sentence Embeddings in LLMs through Inference-Time Steering

Authors: Zifeng Cheng, Zhonghui Wang, Yuchen Fu, Zhiwei Jiang, Yafeng Yin, Cong Wang, Qing Gu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12831
Pdf URL: https://arxiv.org/pdf/2505.12831
Copy Paste: [[2505.12831]] Contrastive Prompting Enhances Sentence Embeddings in LLMs through Inference-Time Steering(https://arxiv.org/abs/2505.12831)
Keywords: language model, llm, prompt
Abstract: Extracting sentence embeddings from large language models (LLMs) is a practical direction, as it requires neither additional data nor fine-tuning. Previous studies usually focus on prompt engineering to guide LLMs to encode the core semantic information of the sentence into the embedding of the last token. However, the last token in these methods still encodes an excess of non-essential information, such as stop words, limiting its encoding capacity. To this end, we propose a Contrastive Prompting (CP) method that introduces an extra auxiliary prompt to elicit better sentence embedding. By contrasting with the auxiliary prompt, CP can steer existing prompts to encode the core semantics of the sentence, rather than non-essential information. CP is a plug-and-play inference-time intervention method that can be combined with various prompt-based methods. Extensive experiments on Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our method can improve the performance of existing prompt-based methods across different LLMs. Our code will be released at this https URL.
摘要：从大语言模型（LLM）中提取句子嵌入是一个实用的方向，因为它既不需要其他数据也不需要微调。先前的研究通常着重于促进工程，以指导LLMS将句子的核心语义信息编码到最后一个令牌的嵌入中。但是，这些方法中的最后一个令牌仍然编码过多的非必需信息，例如停止单词，从而限制了其编码能力。为此，我们提出了一种对比度提示（CP）方法，该方法引入了一个额外的辅助提示，以引起更好的句子嵌入。通过与辅助提示相反，CP可以引导现有的提示来编码句子的核心语义，而不是非必需的信息。 CP是一种插件的推理时间干预方法，可以与各种基于及时的方法结合使用。关于语义文本相似性（STS）任务和下游分类任务的广泛实验表明，我们的方法可以改善不同LLMS跨不同LLM的现有及时方法的性能。我们的代码将在此HTTPS URL上发布。

Title: FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models

Authors: Hengxing Cai, Jinhan Dong, Jingjun Tan, Jingcheng Deng, Sihang Li, Zhifeng Gao, Haidong Wang, Zicheng Su, Agachai Sumalee, Renxin Zhong
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.12835
Pdf URL: https://arxiv.org/pdf/2505.12835
Copy Paste: [[2505.12835]] FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models(https://arxiv.org/abs/2505.12835)
Keywords: language model, gpt, chain-of-thought
Abstract: Unmanned Aerial Vehicle (UAV) Vision-and-Language Navigation (VLN) is vital for applications such as disaster response, logistics delivery, and urban inspection. However, existing methods often struggle with insufficient multimodal fusion, weak generalization, and poor interpretability. To address these challenges, we propose FlightGPT, a novel UAV VLN framework built upon Vision-Language Models (VLMs) with powerful multimodal perception capabilities. We design a two-stage training pipeline: first, Supervised Fine-Tuning (SFT) using high-quality demonstrations to improve initialization and structured reasoning; then, Group Relative Policy Optimization (GRPO) algorithm, guided by a composite reward that considers goal accuracy, reasoning quality, and format compliance, to enhance generalization and adaptability. Furthermore, FlightGPT introduces a Chain-of-Thought (CoT)-based reasoning mechanism to improve decision interpretability. Extensive experiments on the city-scale dataset CityNav demonstrate that FlightGPT achieves state-of-the-art performance across all scenarios, with a 9.22\% higher success rate than the strongest baseline in unseen environments. Our implementation is publicly available.
摘要：无人机（UAV）视觉和语言导航（VLN）对于诸如灾难响应，物流交付和城市检查等应用至关重要。但是，现有的方法通常在多模式融合，概括弱和可解释性差的情况下困难。为了应对这些挑战，我们提出了FlightGpt，这是一个新型的无人机VLN框架，建立在具有强大的多模式感知能力的视觉模型（VLM）的基础上。我们设计了两阶段的训练管道：首先使用高质量演示的监督微调（SFT）来改善初始化和结构化推理；然后，小组相对策略优化（GRPO）算法以综合奖励的指导，该奖励考虑目标准确性，推理质量和格式合规性，以增强概括和适应性。此外，FlightGPT引入了基于基础的（COT）的推理机制，以提高决策能力。在城市规模的数据集CityNav上进行了广泛的实验表明，在所有情况下，FlightGpt在所有情况下都取得了最先进的性能，其成功率比看不见的环境中最强的基线高9.22 \％。我们的实施已公开可用。

Title: The Hidden Structure -- Improving Legal Document Understanding Through Explicit Text Formatting

Authors: Christian Braun, Alexander Lilienbeck, Daniel Mentjukov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12837
Pdf URL: https://arxiv.org/pdf/2505.12837
Copy Paste: [[2505.12837]] The Hidden Structure -- Improving Legal Document Understanding Through Explicit Text Formatting(https://arxiv.org/abs/2505.12837)
Keywords: gpt, llm, prompt
Abstract: Legal contracts possess an inherent, semantically vital structure (e.g., sections, clauses) that is crucial for human comprehension but whose impact on LLM processing remains under-explored. This paper investigates the effects of explicit input text structure and prompt engineering on the performance of GPT-4o and GPT-4.1 on a legal question-answering task using an excerpt of the CUAD. We compare model exact-match accuracy across various input formats: well-structured plain-text (human-generated from CUAD), plain-text cleaned of line breaks, extracted plain-text from Azure OCR, plain-text extracted by GPT-4o Vision, and extracted (and interpreted) Markdown (MD) from GPT-4o Vision. To give an indication of the impact of possible prompt engineering, we assess the impact of shifting task instructions to the system prompt and explicitly informing the model about the structured nature of the input. Our findings reveal that GPT-4o demonstrates considerable robustness to variations in input structure, but lacks in overall performance. Conversely, GPT-4.1's performance is markedly sensitive; poorly structured inputs yield suboptimal results (but identical with GPT-4o), while well-structured formats (original CUAD text, GPT-4o Vision text and GPT-4o MD) improve exact-match accuracy by ~20 percentage points. Optimizing the system prompt to include task details and an advisory about structured input further elevates GPT-4.1's accuracy by an additional ~10-13 percentage points, with Markdown ultimately achieving the highest performance under these conditions (79 percentage points overall exact-match accuracy). This research empirically demonstrates that while newer models exhibit greater resilience, careful input structuring and strategic prompt design remain critical for optimizing the performance of LLMs, and can significantly affect outcomes in high-stakes legal applications.
摘要：法律合同具有固有的，语义上重要的结构（例如，部分，条款），这对于人类的理解至关重要，但对LLM处理的影响仍然不足。本文研究了使用CUAD的摘录，研究了明确输入文本结构和促使工程对GPT-4O和GPT-4.1的性能的影响。我们比较了各种输入格式的模型精确度：结构良好的纯文本（由CUAD生成的人类），清洁线断裂，从Azure OCR中提取的普通文本，由GPT-4O视觉提取，并从GPT-4O Vision提取（和解释）。为了指出可能及时工程的影响，我们评估了将任务指令转换到系统及时的影响，并明确地将输入的结构化性质告知模型。我们的发现表明，GPT-4O对输入结构的变化表现出很大的鲁棒性，但缺乏整体性能。相反，GPT-4.1的性能显着敏感。结构不佳的输入产生了次优的结果（但与GPT-4O相同），而结构良好的格式（原始CUAD文本，GPT-4O视觉文本和GPT-4O MD）提高了〜20个百分点。优化系统提示以包括任务详细信息和有关结构化输入的建议进一步提高GPT-4.1的准确性额外的〜10-13个百分点，而Markdown最终在这些条件下实现了最高的性能（79个百分点的总体确切匹配精度）。这项研究从经验上表明，尽管较新的模型表现出更大的弹性，但仔细的输入结构和战略及时设计对于优化LLM的性能仍然至关重要，并且可以显着影响高风险法律应用的结果。

Title: LEXam: Benchmarking Legal Reasoning on 340 Law Exams

Authors: Yu Fan, Jingwei Ni, Jakob Merane, Etienne Salimbeni, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, Joel Niklaus
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12864
Pdf URL: https://arxiv.org/pdf/2505.12864
Copy Paste: [[2505.12864]] LEXam: Benchmarking Legal Reasoning on 340 Law Exams(https://arxiv.org/abs/2505.12864)
Keywords: language model, llm
Abstract: Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. We introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Adopting an LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: this https URL
摘要：尽管最近的测试时间扩展方面取得了进步，但长期法律推理仍然是大语言模型（LLMS）的关键挑战。我们介绍了Lexam，这是一种新颖的基准测试，该基准来自340项法律考试，涵盖了116个法学院课程，遍布各种学科和学位级别。该数据集包含4,886个法律考试问题，其中包括2,841个长格式，开放式问题和2,045个多项选择问题。除了参考答案外，开放问题还伴随着明确的指导，概述了预期的法律推理方法，例如发行，规则召回或规则申请。我们对开放式和多项选择问题的评估给当前的LLM带来了重大挑战；特别是，他们在需要结构化的多步法律推理的开放问题上挣扎。此外，我们的结果强调了数据集在区分具有不同功能的模型中的有效性。通过严格的人类专家验证，采用LLM-AS-A-A-A-A-A-A-A-Gudge范式，我们证明了如何始终准确地评估模型生成的推理步骤。我们的评估设置提供了一种可扩展的方法来评估简单准确度指标之外的法律推理质量。项目页面：此HTTPS URL

Title: GAP: Graph-Assisted Prompts for Dialogue-based Medication Recommendation

Authors: Jialun Zhong, Yanzeng Li, Sen Hu, Yang Zhang, Teng Xu, Lei Zou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12888
Pdf URL: https://arxiv.org/pdf/2505.12888
Copy Paste: [[2505.12888]] GAP: Graph-Assisted Prompts for Dialogue-based Medication Recommendation(https://arxiv.org/abs/2505.12888)
Keywords: language model, llm, prompt
Abstract: Medication recommendations have become an important task in the healthcare domain, especially in measuring the accuracy and safety of medical dialogue systems (MDS). Different from the recommendation task based on electronic health records (EHRs), dialogue-based medication recommendations require research on the interaction details between patients and doctors, which is crucial but may not exist in EHRs. Recent advancements in large language models (LLM) have extended the medical dialogue domain. These LLMs can interpret patients' intent and provide medical suggestions including medication recommendations, but some challenges are still worth attention. During a multi-turn dialogue, LLMs may ignore the fine-grained medical information or connections across the dialogue turns, which is vital for providing accurate suggestions. Besides, LLMs may generate non-factual responses when there is a lack of domain-specific knowledge, which is more risky in the medical domain. To address these challenges, we propose a \textbf{G}raph-\textbf{A}ssisted \textbf{P}rompts (\textbf{GAP}) framework for dialogue-based medication recommendation. It extracts medical concepts and corresponding states from dialogue to construct an explicitly patient-centric graph, which can describe the neglected but important information. Further, combined with external medical knowledge graphs, GAP can generate abundant queries and prompts, thus retrieving information from multiple sources to reduce the non-factual responses. We evaluate GAP on a dialogue-based medication recommendation dataset and further explore its potential in a more difficult scenario, dynamically diagnostic interviewing. Extensive experiments demonstrate its competitive performance when compared with strong baselines.
摘要：药物建议已成为医疗保健领域的重要任务，尤其是在衡量医疗对话系统（MDS）的准确性和安全性方面。与基于电子健康记录（EHR）的建议任务不同，基于对话的药物建议需要研究患者和医生之间的相互作用细节，这在EHR中可能不存在，但可能不存在。大型语言模型（LLM）的最新进展扩大了医学对话领域。这些LLM可以解释患者的意图，并提供包括药物建议在内的医疗建议，但仍然值得关注一些挑战。在进行多转对话期间，LLM可能会忽略整个对话回合中的细粒度医疗信息或连接，这对于提供准确的建议至关重要。此外，当缺乏域特异性知识时，LLM可能会产生非事实的反应，这在医疗领域更具风险。为了应对这些挑战，我们提出了一个\ textbf {g} raph- \ textbf {a} ssisted \ textbf {p} rompts（\ textbf {gap}）基于对话的药物建议。它从对话中提取医学概念和相应的状态，以构建一个明确的以患者为中心的图形，可以描述被忽视但重要的信息。此外，与外部医学知识图相结合，差距可以产生丰富的查询和提示，从而从多个来源检索信息以减少非事实响应。我们评估了基于对话的药物建议数据集的差距，并在更困难的情况下进一步探索了其潜力。与强基础相比，广泛的实验表明其竞争性能。

Title: On the Thinking-Language Modeling Gap in Large Language Models

Authors: Chenxi Liu, Yongqiang Chen, Tongliang Liu, James Cheng, Bo Han, Kun Zhang
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.12896
Pdf URL: https://arxiv.org/pdf/2505.12896
Copy Paste: [[2505.12896]] On the Thinking-Language Modeling Gap in Large Language Models(https://arxiv.org/abs/2505.12896)
Keywords: language model, llm, prompt
Abstract: System 2 reasoning is one of the defining characteristics of intelligence, which requires slow and logical thinking. Human conducts System 2 reasoning via the language of thoughts that organizes the reasoning process as a causal sequence of mental language, or thoughts. Recently, it has been observed that System 2 reasoning can be elicited from Large Language Models (LLMs) pre-trained on large-scale natural languages. However, in this work, we show that there is a significant gap between the modeling of languages and thoughts. As language is primarily a tool for humans to share knowledge and thinking, modeling human language can easily absorb language biases into LLMs deviated from the chain of thoughts in minds. Furthermore, we show that the biases will mislead the eliciting of "thoughts" in LLMs to focus only on a biased part of the premise. To this end, we propose a new prompt technique termed Language-of-Thoughts (LoT) to demonstrate and alleviate this gap. Instead of directly eliciting the chain of thoughts from partial information, LoT instructs LLMs to adjust the order and token used for the expressions of all the relevant information. We show that the simple strategy significantly reduces the language modeling biases in LLMs and improves the performance of LLMs across a variety of reasoning tasks.
摘要：系统2推理是智力的定义特征之一，它需要缓慢而逻辑的思维。人类进行系统2通过思想语言推理，将推理过程组织为心理语言或思想的因果关系序列。最近，已经观察到，系统2推理可以从大型自然语言预先培训的大语言模型（LLM）中引起。但是，在这项工作中，我们表明语言和思想的建模之间存在显着差距。由于语言主要是人类共享知识和思维的一种工具，因此建模人类语言很容易吸收与思想链偏离思想链的LLM的偏见。此外，我们表明，偏见会误导LLM中的“思想”的引起，只专注于前提的偏见部分。为此，我们提出了一种新的及时技术，称为“思考语言”（批次），以证明和减轻这一差距。 Lot没有直接从部分信息中引起思想链，而是指示LLMS调整用于所有相关信息表达式的顺序和令牌。我们表明，简单的策略会大大减少LLM中的语言建模偏见，并提高LLM在各种推理任务中的性能。

Title: PyFCG: Fluid Construction Grammar in Python

Authors: Paul Van Eecke, Katrien Beuls
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.12920
Pdf URL: https://arxiv.org/pdf/2505.12920
Copy Paste: [[2505.12920]] PyFCG: Fluid Construction Grammar in Python(https://arxiv.org/abs/2505.12920)
Keywords: agent
Abstract: We present PyFCG, an open source software library that ports Fluid Construction Grammar (FCG) to the Python programming language. PyFCG enables its users to seamlessly integrate FCG functionality into Python programs, and to use FCG in combination with other libraries within Python's rich ecosystem. Apart from a general description of the library, this paper provides three walkthrough tutorials that demonstrate example usage of PyFCG in typical use cases of FCG: (i) formalising and testing construction grammar analyses, (ii) learning usage-based construction grammars from corpora, and (iii) implementing agent-based experiments on emergent communication.
摘要：我们提出了PYFCG，这是一个开源软件库，该库将流体构造语法（FCG）移至Python编程语言。 PYFCG使其用户能够将FCG功能无缝集成到Python程序中，并将FCG与Python丰富生态系统中的其他库结合使用。除了对图书馆的一般描述外，本文还提供了三个演练教程，这些教程在FCG的典型用例中展示了PYFCG的用法：（i）对Corpora的基于用法的构造语法进行形式化和测试构造语法分析，以及（III）实施基于新兴沟通的代理实验。

Title: Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs

Authors: Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, Yunjian Xu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12929
Pdf URL: https://arxiv.org/pdf/2505.12929
Copy Paste: [[2505.12929]] Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs(https://arxiv.org/abs/2505.12929)
Keywords: language model, llm
Abstract: Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at this https URL.
摘要：强化学习（RL）已成为增强大语言模型（LLMS）推理能力（LLM）的基石，最近的创新（例如小组相对政策优化（GRPO））表现出非凡的有效性。在这项研究中，我们在RL培训中确定了一个关键但毫无疑问的问题：由于其较大的梯度幅度，低概率令牌会影响模型更新。这种主导地位阻碍了有效学习高概率代币，其梯度对于LLMS的性能至关重要，但受到了极大的抑制。为了减轻这种干扰，我们提出了两种新颖的方法：优势重新加权和低概率令牌隔离（LOPTI），这两者都从低概率令牌中有效地减弱了梯度，同时强调参数更新由高概率标记驱动的。我们的方法以不同的概率促进了跨令牌的平衡更新，从而提高了RL培训的效率。实验结果表明，它们显着提高了受GRPO训练的LLM的性能，在K＆K逻辑拼图推理任务中提高了46.2％。我们的实现可在此HTTPS URL上获得。

Title: A3 : an Analytical Low-Rank Approximation Framework for Attention

Authors: Jeffrey T. H. Wong, Cheng Zhang, Xinye Cao, Pedro Gimenes, George A. Constantinides, Wayne Luk, Yiren Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12942
Pdf URL: https://arxiv.org/pdf/2505.12942
Copy Paste: [[2505.12942]] A3 : an Analytical Low-Rank Approximation Framework for Attention(https://arxiv.org/abs/2505.12942)
Keywords: language model
Abstract: Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices. Consequently, these methods often fall short compared to other compression techniques like pruning and quantization, and introduce runtime overhead such as the extra GEMM kernel launches for decomposed small matrices. To address these limitations, we propose $\tt A^\tt 3$, a post-training low-rank approximation framework. $\tt A^\tt 3$ splits a Transformer layer into three functional components, namely $\tt QK$, $\tt OV$, and $\tt MLP$. For each component, $\tt A^\tt 3$ provides an analytical solution that reduces the hidden dimension size inside each component while minimizing the component's functional loss ($\it i.e.$, error in attention scores, attention outputs, and MLP outputs). This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads. In addition, it provides a new narrative in advancing the optimization problem from singular linear layer loss optimization toward improved end-to-end performance. Through extensive experiments, we show that $\tt A^\tt 3$ maintains superior performance compared to SoTAs. For example, under the same reduction budget in computation and memory, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA's 7.87 by 3.18. We also demonstrate the versatility of $\tt A^\tt 3$, including KV cache compression, quantization, and mixed-rank assignments for enhanced performance.
摘要：大型语言模型表现出了非凡的表现。但是，它们的大量参数计数使部署非常昂贵。低级近似提供了有希望的压缩解决方案，但是现有的方法具有两个主要局限性：（1）他们专注于最小化单个线性层的输出误差，而无需考虑变压器的架构特性，并且（2）它们将大量重量矩阵分解为两个小小的低级别矩阵。因此，与修剪和量化等其他压缩技术相比，这些方法通常会缺乏，并引入运行时开销，例如用于分解的小矩阵的额外GEMM内核启动。为了解决这些限制，我们提出了$ \ tt a^\ tt 3 $，这是训练后的低级近似框架。 $ \ tt a^\ tt 3 $将变压器层分为三个功能组件，即$ \ tt qk $，$ \ tt ov $和$ \ tt MLP $。对于每个组件，$ \ tt a^\ tt 3 $提供了一个分析解决方案，可以减少每个组件内部的隐藏尺寸，同时最大程度地减少组件的功能损失（即$ \ it，注意力分数，注意力输出，注意输出和MLP输出的错误）。这种方法直接降低了模型尺寸，KV高速缓存大小和失败，而无需引入任何运行时开销。此外，它提供了一种新的叙述，从而将优化问题从奇异线性层损失优化到改进的端到端性能推进。通过广泛的实验，我们表明$ \ tt a^\ tt 3 $与SOTA相比保持卓越的性能。例如，在计算和内存中相同的还原预算下，我们的低级别近似Llama 3.1-70B在Wikitext-2上实现了4.69的困惑，表现优于先前的SOTA 7.87，比3.18。我们还演示了$ \ tt a^\ tt 3 $的多功能性，包括KV缓存压缩，量化和混合级分配，以增强性能。

Title: Neural Morphological Tagging for Nguni Languages

Authors: Cael Marquard, Simbarashe Mawere, Francois Meyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12949
Pdf URL: https://arxiv.org/pdf/2505.12949
Copy Paste: [[2505.12949]] Neural Morphological Tagging for Nguni Languages(https://arxiv.org/abs/2505.12949)
Keywords: language model
Abstract: Morphological parsing is the task of decomposing words into morphemes, the smallest units of meaning in a language, and labelling their grammatical roles. It is a particularly challenging task for agglutinative languages, such as the Nguni languages of South Africa, which construct words by concatenating multiple morphemes. A morphological parsing system can be framed as a pipeline with two separate components, a segmenter followed by a tagger. This paper investigates the use of neural methods to build morphological taggers for the four Nguni languages. We compare two classes of approaches: training neural sequence labellers (LSTMs and neural CRFs) from scratch and finetuning pretrained language models. We compare performance across these two categories, as well as to a traditional rule-based morphological parser. Neural taggers comfortably outperform the rule-based baseline and models trained from scratch tend to outperform pretrained models. We also compare parsing results across different upstream segmenters and with varying linguistic input features. Our findings confirm the viability of employing neural taggers based on pre-existing morphological segmenters for the Nguni languages.
摘要：形态解析是将单词分解为词素的任务，是一种语言中最小的意义单位，并标记了其语法角色。对于凝集性语言，例如南非的Nguni语言，这是一项特别具有挑战性的任务，该语言通过串联多种词素来构建单词。形态解析系统可以用作管道，带有两个单独的组件，一个分段，然后是标记器。本文研究了使用神经方法来建造四种Nguni语言的形态标记。我们比较了两类方法：训练神经序列标记（LSTMS和神经CRF），从头开始和填充审计的语言模型。我们将这两个类别的性能以及传统的基于规则的形态解析器进行比较。神经标记者舒适地胜过基于规则的基线，而从头开始训练的型号往往要优于预验证的模型。我们还比较了不同上游细分器和不同语言输入特征的解析结果。我们的发现证实了基于Nguni语言的现有形态分段的神经标记器的生存能力。

Title: GuRE:Generative Query REwriter for Legal Passage Retrieval

Authors: Daehee Kim, Deokhyung Kang, Jonghwi Kim, Sangwon Ryu, Gary Geunbae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12950
Pdf URL: https://arxiv.org/pdf/2505.12950
Copy Paste: [[2505.12950]] GuRE:Generative Query REwriter for Legal Passage Retrieval(https://arxiv.org/abs/2505.12950)
Keywords: language model, llm
Abstract: Legal Passage Retrieval (LPR) systems are crucial as they help practitioners save time when drafting legal arguments. However, it remains an underexplored avenue. One primary reason is the significant vocabulary mismatch between the query and the target passage. To address this, we propose a simple yet effective method, the Generative query REwriter (GuRE). We leverage the generative capabilities of Large Language Models (LLMs) by training the LLM for query rewriting. "Rewritten queries" help retrievers to retrieve target passages by mitigating vocabulary mismatch. Experimental results show that GuRE significantly improves performance in a retriever-agnostic manner, outperforming all baseline methods. Further analysis reveals that different training objectives lead to distinct retrieval behaviors, making GuRE more suitable than direct retriever fine-tuning for real-world applications. Codes are avaiable at this http URL.
摘要：法律通过检索（LPR）系统至关重要，因为它们可以帮助从业者节省起草法律论点时的时间。但是，它仍然是一个毫无疑问的大道。一个主要原因之一是查询与目标通过之间的显着词汇不匹配。为了解决这个问题，我们提出了一种简单而有效的方法，即生成查询重写器（gure）。我们通过训练LLM进行查询重写来利用大语言模型（LLM）的生成能力。 “重写查询”通过减轻词汇不匹配来帮助检索器检索目标段落。实验结果表明，GURE以检索员敏锐的方式显着提高了性能，表现优于所有基线方法。进一步的分析表明，不同的培训目标会导致独特的检索行为，这使得与直接回收者对现实世界应用更合适。在此HTTP URL上可以使用代码。

Title: MA-COIR: Leveraging Semantic Search Index and Generative Models for Ontology-Driven Biomedical Concept Recognition

Authors: Shanshan Liu, Noriki Nishida, Rumana Ferdous Munne, Narumi Tokunaga, Yuki Yamagata, Kouji Kozaki, Yuji Matsumoto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12964
Pdf URL: https://arxiv.org/pdf/2505.12964
Copy Paste: [[2505.12964]] MA-COIR: Leveraging Semantic Search Index and Generative Models for Ontology-Driven Biomedical Concept Recognition(https://arxiv.org/abs/2505.12964)
Keywords: language model, llm
Abstract: Recognizing biomedical concepts in the text is vital for ontology refinement, knowledge graph construction, and concept relationship discovery. However, traditional concept recognition methods, relying on explicit mention identification, often fail to capture complex concepts not explicitly stated in the text. To overcome this limitation, we introduce MA-COIR, a framework that reformulates concept recognition as an indexing-recognition task. By assigning semantic search indexes (ssIDs) to concepts, MA-COIR resolves ambiguities in ontology entries and enhances recognition efficiency. Using a pretrained BART-based model fine-tuned on small datasets, our approach reduces computational requirements to facilitate adoption by domain experts. Furthermore, we incorporate large language models (LLMs)-generated queries and synthetic data to improve recognition in low-resource settings. Experimental results on three scenarios (CDR, HPO, and HOIP) highlight the effectiveness of MA-COIR in recognizing both explicit and implicit concepts without the need for mention-level annotations during inference, advancing ontology-driven concept recognition in biomedical domain applications. Our code and constructed data are available at this https URL.
摘要：识别文本中的生物医学概念对于本体的完善，知识图构建和概念关系发现至关重要。但是，传统的概念识别方法依赖于明确提及的识别，通常无法捕获文本中未明确说明的复杂概念。为了克服这一限制，我们介绍了Ma-Coir，该框架将概念识别重新定义为索引识别任务。通过将语义搜索索引（SSID）分配给概念，Ma-Coir可以解决本体论文中的歧义并提高识别效率。我们使用在小型数据集上微调的基于BART的验证模型，我们的方法减少了计算要求，以促进域专家的采用。此外，我们结合了大型语言模型（LLMS）生成的查询和合成数据，以改善低资源设置中的识别。在三种情况（CDR，HPO和HOIP）上进行的实验结果突出了Ma-Coir在识别明确和隐式概念的有效性，而无需推理过程中提及级别的注释，从而推进了本体论驱动的概念在生物医学领域应用中的识别。我们的代码和构造数据可在此HTTPS URL上找到。

Title: Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down

Authors: Yingzhi Wang, Anas Alhmoud, Saad Alsahly, Muhammad Alqurishi, Mirco Ravanelli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12969
Pdf URL: https://arxiv.org/pdf/2505.12969
Copy Paste: [[2505.12969]] Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down(https://arxiv.org/abs/2505.12969)
Keywords: hallucination
Abstract: OpenAI's Whisper has achieved significant success in Automatic Speech Recognition. However, it has consistently been found to exhibit hallucination issues, particularly in non-speech segments, which limits its broader application in complex industrial settings. In this paper, we introduce a novel method to reduce Whisper's hallucination on non-speech segments without using any pre- or post-possessing techniques. Specifically, we benchmark the contribution of each self-attentional head in the Whisper-large-v3 decoder to the hallucination problem by performing a head-wise mask. Our findings reveal that only 3 of the 20 heads account for over 75% of the hallucinations on the UrbanSound dataset. We then fine-tune these three crazy heads using a collection of non-speech data. The results show that our best fine-tuned model, namely Calm-Whisper, achieves over 80% reduction in non-speech hallucination with only less than 0.1% WER degradation on LibriSpeech test-clean and test-other.
摘要：Openai的耳语在自动语音识别方面取得了巨大的成功。但是，一直发现它表现出幻觉问题，尤其是在非语音细分市场中，这限制了其在复杂的工业环境中的更广泛应用。在本文中，我们介绍了一种新颖的方法，可以减少低语对非语音段的幻觉，而无需使用任何货物前或后填充技术。具体而言，我们通过执行头饰面膜来基准在窃窃私语-V3解码器中为幻觉问题做出每个自我发起的头部的贡献。我们的发现表明，在20个主管中，只有3个占Urbansound数据集中幻觉的75％以上。然后，我们使用非语音数据的集合微调了这三个疯狂的头部。结果表明，我们最佳的微调模型，即平静的旋转，在非语音幻觉中降低了80％以上，而在Librispeech测试清洁和其他测试中仅降低了0.1％的降解。

Title: A Structured Literature Review on Traditional Approaches in Current Natural Language Processing

Authors: Robin Jegan, Andreas Henrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.12970
Pdf URL: https://arxiv.org/pdf/2505.12970
Copy Paste: [[2505.12970]] A Structured Literature Review on Traditional Approaches in Current Natural Language Processing(https://arxiv.org/abs/2505.12970)
Keywords: language model
Abstract: The continued rise of neural networks and large language models in the more recent past has altered the natural language processing landscape, enabling new approaches towards typical language tasks and achieving mainstream success. Despite the huge success of large language models, many disadvantages still remain and through this work we assess the state of the art in five application scenarios with a particular focus on the future perspectives and sensible application scenarios of traditional and older approaches and techniques. In this paper we survey recent publications in the application scenarios classification, information and relation extraction, text simplification as well as text summarization. After defining our terminology, i.e., which features are characteristic for traditional techniques in our interpretation for the five scenarios, we survey if such traditional approaches are still being used, and if so, in what way they are used. It turns out that all five application scenarios still exhibit traditional models in one way or another, as part of a processing pipeline, as a comparison/baseline to the core model of the respective paper, or as the main model(s) of the paper. For the complete statistics, see this https URL
摘要：在最近的过去，神经网络和大型语言模型的持续兴起改变了自然语言处理的景观，从而为典型的语言任务提供了新的方法并取得了主流成功。尽管大型语言模型取得了巨大的成功，但仍然存在许多缺点，通过这项工作，我们在五个应用程序场景中评估了最新技术的状态，特别关注传统和旧方法和技术的未来观点和明智的应用方案。在本文中，我们调查了应用程序方案分类，信息和关系提取，简化文本以及文本摘要中的最新出版物。在定义了我们的术语（即哪些特征是我们对五种情况的解释中的特征）之后，我们调查是否仍在使用这种传统方法，如果是这样，则以何种方式使用它们。事实证明，作为处理管道的一部分，所有五个应用程序方案仍然以一种或另一种方式展示传统模型，作为与各个论文的核心模型的比较/基线，或作为论文的主要模型。有关完整的统计信息，请参阅此HTTPS URL

Title: An Empirical Study of Many-to-Many Summarization with Large Language Models

Authors: Jiaan Wang, Fandong Meng, Zengkui Sun, Yunlong Liang, Yuxuan Cao, Jiarong Xu, Haoxiang Shi, Jie Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12983
Pdf URL: https://arxiv.org/pdf/2505.12983
Copy Paste: [[2505.12983]] An Empirical Study of Many-to-Many Summarization with Large Language Models(https://arxiv.org/abs/2505.12983)
Keywords: language model, gpt, llm
Abstract: Many-to-many summarization (M2MS) aims to process documents in any language and generate the corresponding summaries also in any language. Recently, large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform M2MS in real applications. This work presents a systematic empirical study on LLMs' M2MS ability. Specifically, we first reorganize M2MS data based on eight previous domain-specific datasets. The reorganized data contains 47.8K samples spanning five domains and six languages, which could be used to train and evaluate LLMs. Then, we benchmark 18 LLMs in a zero-shot manner and an instruction-tuning manner. Fine-tuned traditional models (e.g., mBART) are also conducted for comparisons. Our experiments reveal that, zero-shot LLMs achieve competitive results with fine-tuned traditional models. After instruct-tuning, open-source LLMs can significantly improve their M2MS ability, and outperform zero-shot LLMs (including GPT-4) in terms of automatic evaluations. In addition, we demonstrate that this task-specific improvement does not sacrifice the LLMs' general task-solving abilities. However, as revealed by our human evaluation, LLMs still face the factuality issue, and the instruction tuning might intensify the issue. Thus, how to control factual errors becomes the key when building LLM summarizers in real applications, and is worth noting in future research.
摘要：多对多的摘要（M2MS）旨在以任何语言处理文档，并以任何语言生成相应的摘要。最近，大型语言模型（LLMS）表现出强大的多语言能力，使他们有可能在实际应用中执行M2MS。这项工作提出了一项关于LLMS M2MS能力的系统实证研究。具体而言，我们首先基于八个以前的域特异性数据集重组M2MS数据。重组的数据包含47.8K样本，涵盖了五个域和六种语言，可用于训练和评估LLMS。然后，我们以零拍的方式和指令调用方式基准了18个LLM。还进行了微调的传统模型（例如Mbart）进行比较。我们的实验表明，通过微调的传统模型，零射门LLMS实现了竞争成果。在使用指导后，开源LLM可以显着提高其M2MS能力，并且在自动评估方面超过了零摄影LLM（包括GPT-4）。此外，我们证明了这种特定于任务的改进并不能牺牲LLMS的一般任务解决能力。但是，正如我们人类评估所揭示的那样，LLM仍然面临事实问题，指令调整可能会加剧问题。因此，如何控制事实错误成为在实际应用中构建LLM摘要时的关键，并且值得在未来的研究中注意。

Title: EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Authors: Yuhao Qing, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M. Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, Luu Anh Tuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13004
Pdf URL: https://arxiv.org/pdf/2505.13004
Copy Paste: [[2505.13004]] EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code(https://arxiv.org/abs/2505.13004)
Keywords: llm
Abstract: Existing code generation benchmarks primarily evaluate functional correctness, with limited focus on code efficiency and often restricted to a single language like Python. To address this gap, we introduce EffiBench-X, the first multi-language benchmark designed to measure the efficiency of LLM-generated code. EffiBench-X supports Python, C++, Java, JavaScript, Ruby, and Golang. It comprises competitive programming tasks with human-expert solutions as efficiency baselines. Evaluating state-of-the-art LLMs on EffiBench-X reveals that while models generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM-generated solutions (Qwen3-32B) achieve only around \textbf{62\%} of human efficiency on average, with significant language-specific variations. LLMs show better efficiency in Python, Ruby, and JavaScript than in Java, C++, and Golang. For instance, DeepSeek-R1's Python code is significantly more efficient than its Java code. These results highlight the critical need for research into LLM optimization techniques to improve code efficiency across diverse languages. The dataset and evaluation infrastructure are submitted and available at this https URL and this https URL.
摘要：现有的代码生成基准主要评估功能正确性，对代码效率的关注有限，并且通常仅限于Python等单一语言。为了解决此差距，我们介绍了Effibench-X，这是第一个旨在衡量LLM生成代码效率的多语言基准。 Effibench-X支持Python，C ++，Java，JavaScript，Ruby和Golang。它由人类专家解决方案作为效率基准组成竞争性编程任务。评估Effibench-X上最新的LLMS表明，尽管模型生成功能上正确的代码，但它们始终如一地表现效率的人类专家。即使是最有效的LLM生成的解决方案（QWEN3-32B）也仅达到\ textbf {62 \％}的人类效率的平均值，具有明显的语言特定变化。与Java，C ++和Golang相比，LLM在Python，Ruby和JavaScript中显示出更好的效率。例如，DeepSeek-R1的Python代码比其Java代码高得多。这些结果突出了研究LLM优化技术的关键需求，以提高各种语言的代码效率。数据集和评估基础架构已提交，并在此HTTPS URL和此HTTPS URL上获得。

Title: Evaluating the Performance of RAG Methods for Conversational AI in the Airport Domain

Authors: Yuyang Li, Philip J.M. Kerbusch, Raimon H.R. Pruim, Tobias Käfer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13006
Pdf URL: https://arxiv.org/pdf/2505.13006
Copy Paste: [[2505.13006]] Evaluating the Performance of RAG Methods for Conversational AI in the Airport Domain(https://arxiv.org/abs/2505.13006)
Keywords: gpt, hallucination, retrieval-augmented generation
Abstract: Airports from the top 20 in terms of annual passengers are highly dynamic environments with thousands of flights daily, and they aim to increase the degree of automation. To contribute to this, we implemented a Conversational AI system that enables staff in an airport to communicate with flight information systems. This system not only answers standard airport queries but also resolves airport terminology, jargon, abbreviations, and dynamic questions involving reasoning. In this paper, we built three different Retrieval-Augmented Generation (RAG) methods, including traditional RAG, SQL RAG, and Knowledge Graph-based RAG (Graph RAG). Experiments showed that traditional RAG achieved 84.84% accuracy using BM25 + GPT-4 but occasionally produced hallucinations, which is risky to airport safety. In contrast, SQL RAG and Graph RAG achieved 80.85% and 91.49% accuracy respectively, with significantly fewer hallucinations. Moreover, Graph RAG was especially effective for questions that involved reasoning. Based on our observations, we thus recommend SQL RAG and Graph RAG are better for airport environments, due to fewer hallucinations and the ability to handle dynamic questions.
摘要：就年度乘客而言，来自前20名的机场是高度动态的环境，每天数千个航班，旨在提高自动化程度。为此，我们实施了一个对话性AI系统，使机场的员工能够与飞行信息系统进行通信。该系统不仅回答标准的机场查询，还可以解决机场术语，行话，缩写和涉及推理的动态问题。在本文中，我们构建了三种不同的检索型生成（RAG）方法，包括传统的抹布，SQL抹布和基于图形的抹布（Graph Rag）。实验表明，传统的抹布使用BM25 + GPT-4实现了84.84％的精度，但偶尔会产生幻觉，这对机场的安全有风险。相比之下，SQL RAG和Graph Rag分别达到了80.85％和91.49％的精度，幻觉明显少得多。此外，图形抹布对于涉及推理的问题特别有效。根据我们的观察结果，我们建议SQL RAG和Graph Rag更适合机场环境，因为幻觉较少和处理动态问题的能力。

Title: KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025

Authors: Sai Koneru, Maike Züfle, Thai-Binh Nguyen, Seymanur Akti, Jan Niehues, Alexander Waibel
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13036
Pdf URL: https://arxiv.org/pdf/2505.13036
Copy Paste: [[2505.13036]] KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025(https://arxiv.org/abs/2505.13036)
Keywords: language model, llm
Abstract: The scope of the International Workshop on Spoken Language Translation (IWSLT) has recently broadened beyond traditional Speech Translation (ST) to encompass a wider array of tasks, including Speech Question Answering and Summarization. This shift is partly driven by the growing capabilities of modern systems, particularly with the success of Large Language Models (LLMs). In this paper, we present the Karlsruhe Institute of Technology's submissions for the Offline ST and Instruction Following (IF) tracks, where we leverage LLMs to enhance performance across all tasks. For the Offline ST track, we propose a pipeline that employs multiple automatic speech recognition systems, whose outputs are fused using an LLM with document-level context. This is followed by a two-step translation process, incorporating additional refinement step to improve translation quality. For the IF track, we develop an end-to-end model that integrates a speech encoder with an LLM to perform a wide range of instruction-following tasks. We complement it with a final document-level refinement stage to further enhance output quality by using contextual information.
摘要：国际口语翻译（IWSLT）国际研讨会的范围最近扩大了传统语音翻译（ST），以涵盖更广泛的任务，包括语音问题回答和摘要。这种转变部分是由现代系统不断增长的能力驱动的，尤其是在大型语言模型（LLMS）的成功中。在本文中，我们介绍了Karlsruhe技术研究所的脱机ST和以下教学（如果）轨道的提交，我们利用LLMS来提高所有任务的性能。对于离线ST轨道，我们提出了一条采用多个自动语音识别系统的管道，其输出使用具有文档级别上下文的LLM融合。接下来是两步翻译过程，并结合了提高翻译质量的其他细化步骤。对于IF轨道，我们开发了一个端到端模型，该模型将语音编码器与LLM集成在一起，以执行广泛的指令跟随任务。我们将其与最终文档级的完善阶段进行补充，以通过使用上下文信息进一步提高输出质量。

Title: Advancing Sequential Numerical Prediction in Autoregressive Models

Authors: Xiang Fei, Jinghui Lu, Qi Sun, Hao Feng, Yanjie Wang, Wei Shi, An-Lan Wang, Jingqun Tang, Can Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13077
Pdf URL: https://arxiv.org/pdf/2505.13077
Copy Paste: [[2505.13077]] Advancing Sequential Numerical Prediction in Autoregressive Models(https://arxiv.org/abs/2505.13077)
Keywords: llm
Abstract: Autoregressive models have become the de facto choice for sequence generation tasks, but standard approaches treat digits as independent tokens and apply cross-entropy loss, overlooking the coherent structure of numerical sequences. This paper introduces Numerical Token Integrity Loss (NTIL) to address this gap. NTIL operates at two levels: (1) token-level, where it extends the Earth Mover's Distance (EMD) to preserve ordinal relationships between numerical values, and (2) sequence-level, where it penalizes the overall discrepancy between the predicted and actual sequences. This dual approach improves numerical prediction and integrates effectively with LLMs/MLLMs. Extensive experiments show significant performance improvements with NTIL.
摘要：自回旋模型已成为序列生成任务的事实上的选择，但是标准方法将数字视为独立的令牌并应用跨透明镜损失，从而忽略了数值序列的相干结构。本文介绍了数值令牌完整性损失（NTIL）来解决此差距。 NTIL在两个级别上运行：（1）令牌级，它扩展了地球移动器的距离（EMD），以保持数值之间的序数关系，以及（2）序列级别，在此惩罚预测序列和实际序列之间的整体差异。这种双重方法改善了数值预测并与LLMS/MLLM有效地集成。广泛的实验显示了NTIL的显着性能改善。

Title: Systematic Generalization in Language Models Scales with Information Entropy

Authors: Sondre Wold, Lucas Georges Gabriel Charpentier, Étienne Simon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13089
Pdf URL: https://arxiv.org/pdf/2505.13089
Copy Paste: [[2505.13089]] Systematic Generalization in Language Models Scales with Information Entropy(https://arxiv.org/abs/2505.13089)
Keywords: language model
Abstract: Systematic generalization remains challenging for current language models, which are known to be both sensitive to semantically similar permutations of the input and to struggle with known concepts presented in novel contexts. Although benchmarks exist for assessing compositional behavior, it is unclear how to measure the difficulty of a systematic generalization problem. In this work, we show how one aspect of systematic generalization can be described by the entropy of the distribution of component parts in the training data. We formalize a framework for measuring entropy in a sequence-to-sequence task and find that the performance of popular model architectures scales with the entropy. Our work connects systematic generalization to information efficiency, and our results indicate that success at high entropy can be achieved even without built-in priors, and that success at low entropy can serve as a target for assessing progress towards robust systematic generalization.
摘要：对于当前的语言模型，系统的概括仍然具有挑战性，这既对输入的语义相似置换既敏感，又与在新颖背景下提出的已知概念斗争。尽管存在用于评估组成行为的基准，但尚不清楚如何衡量系统概括问题的难度。在这项工作中，我们展示了如何通过训练数据中组件零件的分布的熵来描述系统概括的一个方面。我们为在序列到序列任务中测量熵的框架形式化，并发现流行模型的性能与熵相比。我们的工作将系统的概括与信息效率联系起来，我们的结果表明，即使没有内置的先验，也可以在高熵上取得成功，而在低熵的成功可以作为评估对强大系统概括的进步的目标。

Title: The Effect of Language Diversity When Fine-Tuning Large Language Models for Translation

Authors: David Stap, Christof Monz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13090
Pdf URL: https://arxiv.org/pdf/2505.13090
Copy Paste: [[2505.13090]] The Effect of Language Diversity When Fine-Tuning Large Language Models for Translation(https://arxiv.org/abs/2505.13090)
Keywords: language model, llm
Abstract: Prior research diverges on language diversity in LLM fine-tuning: Some studies report benefits while others find no advantages. Through controlled fine-tuning experiments across 132 translation directions, we systematically resolve these disparities. We find that expanding language diversity during fine-tuning improves translation quality for both unsupervised and -- surprisingly -- supervised pairs, despite less diverse models being fine-tuned exclusively on these supervised pairs. However, benefits plateau or decrease beyond a certain diversity threshold. We show that increased language diversity creates more language-agnostic representations. These representational adaptations help explain the improved performance in models fine-tuned with greater diversity.
摘要：先前的研究对LLM微调的语言多样性有所不同：一些研究报告有益，而另一些研究则没有任何优势。通过跨132个翻译方向的受控微调实验，我们系统地解决了这些差异。我们发现，在微调过程中扩大语言多样性可以改善无监督的翻译质量，而且 - 令人惊讶的是，有监督的对，尽管多样化的模型较少，仅在这些有监督的对中进行了微调。但是，受益于高原或降低一定多样性阈值。我们表明，增加的语言多样性会创造更多语言敏捷的表示。这些代表性适应有助于解释具有更大多样性的微调模型的改善性能。

Title: Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning

Authors: Debarpan Bhattacharya, Apoorva Kulkarni, Sriram Ganapathy
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.13115
Pdf URL: https://arxiv.org/pdf/2505.13115
Copy Paste: [[2505.13115]] Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning(https://arxiv.org/abs/2505.13115)
Keywords: language model, llm
Abstract: The popular success of text-based large language models (LLM) has streamlined the attention of the multimodal community to combine other modalities like vision and audio along with text to achieve similar multimodal capabilities. In this quest, large audio language models (LALMs) have to be evaluated on reasoning related tasks which are different from traditional classification or generation tasks. Towards this goal, we propose a novel dataset called temporal reasoning evaluation of audio (TREA). We benchmark open-source LALMs and observe that they are consistently behind human capabilities on the tasks in the TREA dataset. While evaluating LALMs, we also propose an uncertainty metric, which computes the invariance of the model to semantically identical perturbations of the input. Our analysis shows that the accuracy and uncertainty metrics are not necessarily correlated and thus, points to a need for wholesome evaluation of LALMs for high-stakes applications.
摘要：基于文本的大语言模型（LLM）的普遍成功，简化了多模式社区的注意力，以结合视觉和音频等其他模式，以实现类似的多峰功能。在此任务中，必须评估大型音频语言模型（LALMS），以与传统分类或发电任务不同的相关任务进行评估。为了实现这一目标，我们提出了一个新颖的数据集，称为音频的时间推理评估（TREA）。我们为开源LALMS进行了基准，并观察到它们一直背后是人类在TREA数据集中的任务上的能力。在评估LALMS时，我们还提出了一个不确定性度量，该指标将模型的不变性计算为输入的语义相同扰动。我们的分析表明，准确性和不确定性指标不一定相关，因此指出需要对LALMS进行高风险应用的健康评估。

Title: ModernGBERT: German-only 1B Encoder Model Trained from Scratch

Authors: Anton Ehrmanntraut, Julia Wunderle, Jan Pfister, Fotis Jannidis, Andreas Hotho
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13136
Pdf URL: https://arxiv.org/pdf/2505.13136
Copy Paste: [[2505.13136]] ModernGBERT: German-only 1B Encoder Model Trained from Scratch(https://arxiv.org/abs/2505.13136)
Keywords: language model, llm
Abstract: Despite the prominence of decoder-only language models, encoders remain crucial for resource-constrained applications. We introduce ModernGBERT (134M, 1B), a fully transparent family of German encoder models trained from scratch, incorporating architectural innovations from ModernBERT. To evaluate the practical trade-offs of training encoders from scratch, we also present LLäMmlein2Vec (120M, 1B, 7B), a family of encoders derived from German decoder-only models via LLM2Vec. We benchmark all models on natural language understanding, text embedding, and long-context reasoning tasks, enabling a controlled comparison between dedicated encoders and converted decoders. Our results show that ModernGBERT 1B outperforms prior state-of-the-art German encoders as well as encoders adapted via LLM2Vec, with regard to performance and parameter-efficiency. All models, training data, checkpoints and code are publicly available, advancing the German NLP ecosystem with transparent, high-performance encoder models.
摘要：尽管仅是解码器的语言模型的突出性，但编码器对于资源受限的应用程序仍然至关重要。我们介绍了Moderngbert（134m，1b），这是一个完全透明的德国编码模型家族，训练了从头开始训练，并结合了来自Modernbert的建筑创新。为了评估从头开始的培训编码者的实际权衡，我们还提出了Llämmlein2Vec（120m，1b，7b），这是一个通过LLM2VEC衍生自仅德国解码器模型的编码器家族。我们基于自然语言理解，文本嵌入和长篇文本推理任务的所有模型，从而可以在专用编码器和转换的解码器之间进行受控比较。我们的结果表明，关于性能和参数效率，ModernGbert 1b的表现优于先前最新的德国编码器以及通过LLM2VEC改编的编码器，就性能和参数效率而言。所有模型，培训数据，检查点和代码均可公开使用，以透明，高性能的编码器模型推进了德国NLP生态系统。

Title: Understanding Cross-Lingual Inconsistency in Large Language Models

Authors: Zheng Wei Lim, Alham Fikri Aji, Trevor Cohn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13141
Pdf URL: https://arxiv.org/pdf/2505.13141
Copy Paste: [[2505.13141]] Understanding Cross-Lingual Inconsistency in Large Language Models(https://arxiv.org/abs/2505.13141)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are demonstrably capable of cross-lingual transfer, but can produce inconsistent output when prompted with the same queries written in different languages. To understand how language models are able to generalize knowledge from one language to the others, we apply the logit lens to interpret the implicit steps taken by LLMs to solve multilingual multi-choice reasoning questions. We find LLMs predict inconsistently and are less accurate because they rely on subspaces of individual languages, rather than working in a shared semantic space. While larger models are more multilingual, we show their hidden states are more likely to dissociate from the shared representation compared to smaller models, but are nevertheless more capable of retrieving knowledge embedded across different languages. Finally, we demonstrate that knowledge sharing can be modulated by steering the models' latent processing towards the shared semantic space. We find reinforcing utilization of the shared space improves the models' multilingual reasoning performance, as a result of more knowledge transfer from, and better output consistency with English.
摘要：大型语言模型（LLM）明显能够交叉传输，但是当提示使用使用不同语言编写的相同查询时，可能会产生不一致的输出。为了了解语言模型如何能够将知识从一种语言概括为另一种语言，我们应用Logit镜头来解释LLMS所采取的隐式步骤来解决多语言的多项选择性推理问题。我们发现LLM不一致地预测，并且不准确，因为它们依赖于单个语言的子空间，而不是在共享的语义空间中工作。虽然较大的模型更为多语言，但我们表明，与较小的模型相比，它们的隐藏状态更有可能与共享表示形式分离，但是仍然更有能力检索嵌入不同语言的知识。最后，我们证明可以通过将模型的潜在处理转向共享的语义空间来调节知识共享。我们发现共享空间的加强利用可改善模型的多语言推理性能，这是由于更多的知识转移，并且与英语的输出一致性更好。

Title: What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text

Authors: Aswathy Velutharambath, Roman Klinger, Kai Sassenberg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13147
Pdf URL: https://arxiv.org/pdf/2505.13147
Copy Paste: [[2505.13147]] What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text(https://arxiv.org/abs/2505.13147)
Keywords: language model
Abstract: Can deception be detected solely from written text? Cues of deceptive communication are inherently subtle, even more so in text-only communication. Yet, prior studies have reported considerable success in automatic deception detection. We hypothesize that such findings are largely driven by artifacts introduced during data collection and do not generalize beyond specific datasets. We revisit this assumption by introducing a belief-based deception framework, which defines deception as a misalignment between an author's claims and true beliefs, irrespective of factual accuracy, allowing deception cues to be studied in isolation. Based on this framework, we construct three corpora, collectively referred to as DeFaBel, including a German-language corpus of deceptive and non-deceptive arguments and a multilingual version in German and English, each collected under varying conditions to account for belief change and enable cross-linguistic analysis. Using these corpora, we evaluate commonly reported linguistic cues of deception. Across all three DeFaBel variants, these cues show negligible, statistically insignificant correlations with deception labels, contrary to prior work that treats such cues as reliable indicators. We further benchmark against other English deception datasets following similar data collection protocols. While some show statistically significant correlations, effect sizes remain low and, critically, the set of predictive cues is inconsistent across datasets. We also evaluate deception detection using feature-based models, pretrained language models, and instruction-tuned large language models. While some models perform well on established deception datasets, they consistently perform near chance on DeFaBel. Our findings challenge the assumption that deception can be reliably inferred from linguistic cues and call for rethinking how deception is studied and modeled in NLP.
摘要：可以仅从书面文字中检测到欺骗吗？欺骗性交流的提示本质上是微妙的，在仅文本交流中更是如此。然而，先前的研究报告了自动欺骗检测的巨大成功。我们假设这样的发现主要是由数据收集过程中引入的工件驱动的，并且不会超出特定数据集。我们通过引入一个基于信念的欺骗框架来重新审视这一假设，该框架将欺骗定义为作者的主张与真实信念之间的错位，无论事实准确性如何，都可以隔离地研究欺骗提示。基于此框架，我们构建了三个语料库，共同称为defabel，包括欺骗性和非欺骗性论点的德语语料库和德语和英语中的多语言版本，每个版本都在不同的条件下收集以说明信念变化和启用交叉语言分析。使用这些语料库，我们评估了欺骗的常见语言提示。在所有三个defabel变体中，这些提示与欺骗标签的相关性可忽略不计，与将这些提示视为可靠指标的先前工作相反。根据类似的数据收集协议，我们进一步针对其他英语欺骗数据集进行了基准测试。虽然某些显示出统计学意义的相关性，但效应大小仍然很低，并且在批判性的情况下，一组预测提示在整个数据集中不一致。我们还使用基于功能的模型，预审前的语言模型和指导调节的大语言模型来评估欺骗检测。尽管某些模型在已建立的欺骗数据集上表现良好，但它们在Defabel上始终如一地发挥作用。我们的发现挑战了以下假设：欺骗可以从语言线索可靠地推断出来，并要求重新思考在NLP中研究和建模的欺骗。

Title: Tianyi: A Traditional Chinese Medicine all-rounder language model and its Real-World Clinical Practice

Authors: Zhi Liu, Tao Yang, Jing Wang, Yexin Chen, Zhan Gao, Jiaxi Yang, Kui Chen, Bingji Lu, Xiaochen Li, Changyong Luo, Yan Li, Xiaohong Gu, Peng Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13156
Pdf URL: https://arxiv.org/pdf/2505.13156
Copy Paste: [[2505.13156]] Tianyi: A Traditional Chinese Medicine all-rounder language model and its Real-World Clinical Practice(https://arxiv.org/abs/2505.13156)
Keywords: language model, llm, hallucination
Abstract: Natural medicines, particularly Traditional Chinese Medicine (TCM), are gaining global recognition for their therapeutic potential in addressing human symptoms and diseases. TCM, with its systematic theories and extensive practical experience, provides abundant resources for healthcare. However, the effective application of TCM requires precise syndrome diagnosis, determination of treatment principles, and prescription formulation, which demand decades of clinical expertise. Despite advancements in TCM-based decision systems, machine learning, and deep learning research, limitations in data and single-objective constraints hinder their practical application. In recent years, large language models (LLMs) have demonstrated potential in complex tasks, but lack specialization in TCM and face significant challenges, such as too big model scale to deploy and issues with hallucination. To address these challenges, we introduce Tianyi with 7.6-billion-parameter LLM, a model scale proper and specifically designed for TCM, pre-trained and fine-tuned on diverse TCM corpora, including classical texts, expert treatises, clinical records, and knowledge graphs. Tianyi is designed to assimilate interconnected and systematic TCM knowledge through a progressive learning manner. Additionally, we establish TCMEval, a comprehensive evaluation benchmark, to assess LLMs in TCM examinations, clinical tasks, domain-specific question-answering, and real-world trials. The extensive evaluations demonstrate the significant potential of Tianyi as an AI assistant in TCM clinical practice and research, bridging the gap between TCM knowledge and practical application.
摘要：天然药物，尤其是中药（TCM），因其在解决人类症状和疾病方面的治疗潜力而获得了全球认可。 TCM具有系统的理论和丰富的实践经验，为医疗保健提供了丰富的资源。但是，TCM的有效应用需要精确的综合征诊断，治疗原则的确定和处方制定，这需要数十年的临床专业知识。尽管基于TCM的决策系统，机器学习和深度学习研究的进步，数据和单目标限制的局限性阻碍了其实际应用。近年来，大型语言模型（LLM）在复杂的任务中表现出了潜力，但是缺乏专业的TCM和面临重大挑战，例如太大的模型量表无法部署和幻觉问题。为了应对这些挑战，我们用76亿参数LLM介绍了Tianyi，这是一种适当的模型量表，专门针对TCM设计，对TCM多样性的TCM Corpora进行了预培训和微调，包括经典文本，专家专家，临床记录，临床记录和知识图。田（Tianyi）旨在通过渐进学习方式吸收相互联系和系统的TCM知识。此外，我们建立了TCMeval，这是一个全面的评估基准，以评估TCM考试，临床任务，特定领域的提问和实际试验中的LLM。广泛的评估表明，蒂亚尼（Tianyi）作为TCM临床实践和研究中的AI助手的重要潜力，弥合了TCM知识与实际应用之间的差距。

Title: Role-Playing Evaluation for Large Language Models

Authors: Yassine El Boudouri, Walter Nuninger, Julian Alvarez, Yvan Peter
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13157
Pdf URL: https://arxiv.org/pdf/2505.13157
Copy Paste: [[2505.13157]] Role-Playing Evaluation for Large Language Models(https://arxiv.org/abs/2505.13157)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate a notable capacity for adopting personas and engaging in role-playing. However, evaluating this ability presents significant challenges, as human assessments are resource-intensive and automated evaluations can be biased. To address this, we introduce Role-Playing Eval (RPEval), a novel benchmark designed to assess LLM role-playing capabilities across four key dimensions: emotional understanding, decision-making, moral alignment, and in-character consistency. This article details the construction of RPEval and presents baseline evaluations. Our code and dataset are available at this https URL
摘要：大型语言模型（LLMS）表现出明显的采用角色和参与角色扮演的能力。但是，评估这种能力提出了重大挑战，因为人类评估是资源密集的，并且自动化评估可能是偏见的。为了解决这个问题，我们介绍了角色扮演评估（RPEVAL），这是一种旨在评估LLM角色扮演能力跨四个关键维度的新基准：情感理解，决策，道德一致性和特征者的一致性。本文详细介绍了RPEVAL的构建，并提供了基线评估。我们的代码和数据集可在此HTTPS URL上找到

Title: Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks

Authors: Yixuan Xu, Antoine Bosselut, Imanol Schlag
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13171
Pdf URL: https://arxiv.org/pdf/2505.13171
Copy Paste: [[2505.13171]] Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks(https://arxiv.org/abs/2505.13171)
Keywords: language model, llm
Abstract: Large language models are known to memorize parts of their training data, posing risk of copyright violations. To systematically examine this risk, we pretrain language models (1B/3B/8B) from scratch on 83B tokens, mixing web-scale data with public domain books used to simulate copyrighted content at controlled frequencies at lengths at least ten times longer than prior work. We thereby identified the offset effect, a phenomenon characterized by two key findings: (1) verbatim memorization is most strongly triggered by short prefixes drawn from the beginning of the context window, with memorization decreasing counterintuitively as prefix length increases; and (2) a sharp decline in verbatim recall when prefix begins offset from the initial tokens of the context window. We attribute this to positional fragility: models rely disproportionately on the earliest tokens in their context window as retrieval anchors, making them sensitive to even slight shifts. We further observe that when the model fails to retrieve memorized content, it often produces degenerated text. Leveraging these findings, we show that shifting sensitive data deeper into the context window suppresses both extractable memorization and degeneration. Our results suggest that positional offset is a critical and previously overlooked axis for evaluating memorization risks, since prior work implicitly assumed uniformity by probing only from the beginning of training sequences.
摘要：众所周知，大型语言模型可以记住其培训数据的一部分，从而构成了侵犯版权的风险。为了系统地检查这种风险，我们从83b代币上从头开始对语言模型（1b/3b/8b）进行了预认识，将网络级数据与用于模拟受控频率模拟受控内容的公共域数据的长度至少比先前工作长10倍。因此，我们确定了偏移效应，这是一个以两个关键发现为特征的现象：（1）逐字记忆最强烈地触发了从上下文窗口的开头提取的短前缀，记忆的违反直觉随着前缀长度的增加而降低；（2）当前缀从上下文窗口的初始令牌开始偏移时，逐字记忆的急剧下降。我们将其归因于位置脆弱性：模型不成比例地依赖于上下文窗口中最早的令牌作为检索锚，使其对稍微转移敏感。我们进一步观察到，当模型无法检索记忆的内容时，它通常会产生退化的文本。利用这些发现，我们表明将敏感数据更深地转移到上下文窗口中可以抑制可提取的记忆和变性。我们的结果表明，位置偏移是评估记忆风险的关键且先前被忽略的轴，因为先前的工作仅通过从训练序列开始就隐式地假定均匀性。

Title: A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs

Authors: V.S.D.S.Mahesh Akavarapu, Hrishikesh Terdalkar, Pramit Bhattacharyya, Shubhangi Agarwal, Vishakha Deulgaonkar, Pralay Manna, Chaitali Dangarikar, Arnab Bhattacharya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13173
Pdf URL: https://arxiv.org/pdf/2505.13173
Copy Paste: [[2505.13173]] A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs(https://arxiv.org/abs/2505.13173)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across diverse tasks and languages. In this study, we focus on natural language understanding in three classical languages -- Sanskrit, Ancient Greek and Latin -- to investigate the factors affecting cross-lingual zero-shot generalization. First, we explore named entity recognition and machine translation into English. While LLMs perform equal to or better than fine-tuned baselines on out-of-domain data, smaller models often struggle, especially with niche or abstract entity types. In addition, we concentrate on Sanskrit by presenting a factoid question-answering (QA) dataset and show that incorporating context via retrieval-augmented generation approach significantly boosts performance. In contrast, we observe pronounced performance drops for smaller LLMs across these QA tasks. These results suggest model scale as an important factor influencing cross-lingual generalization. Assuming that models used such as GPT-4o and Llama-3.1 are not instruction fine-tuned on classical languages, our findings provide insights into how LLMs may generalize on these languages and their consequent utility in classical studies.
摘要：大型语言模型（LLM）表现出了各种任务和语言的显着概括能力。在这项研究中，我们专注于三种古典语言（梵语，古希腊语和拉丁语）的自然语言理解，以研究影响跨语性零局部概括的因素。首先，我们探索命名的实体识别和机器翻译为英语。虽然LLM在台面数据上的表现等于或更好地等于或更好，但较小的模型通常会挣扎，尤其是在利基或抽象实体类型的情况下。此外，我们通过介绍FACTOID提问（QA）数据集专注于梵语，并表明通过检索效果的生成方法合并上下文可以显着提高性能。相比之下，我们观察到在这些质量检查任务中，较小的LLM的性能下降。这些结果表明，模型量表是影响跨语性概括的重要因素。假设使用诸如GPT-4O和Llama-3.1之类的模型不是对古典语言进行微调的指导，我们的发现提供了有关LLM如何推广这些语言及其在古典研究中的实用性的见解。

Title: ToolSpectrum : Towards Personalized Tool Utilization for Large Language Models

Authors: Zihao Cheng, Hongru Wang, Zeming Liu, Yuhang Guo, Yuanfang Guo, Yunhong Wang, Haifeng Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13176
Pdf URL: https://arxiv.org/pdf/2505.13176
Copy Paste: [[2505.13176]] ToolSpectrum : Towards Personalized Tool Utilization for Large Language Models(https://arxiv.org/abs/2505.13176)
Keywords: language model, llm
Abstract: While integrating external tools into large language models (LLMs) enhances their ability to access real-time information and domain-specific services, existing approaches focus narrowly on functional tool selection following user instructions, overlooking the context-aware personalization in tool selection. This oversight leads to suboptimal user satisfaction and inefficient tool utilization, particularly when overlapping toolsets require nuanced selection based on contextual factors. To bridge this gap, we introduce ToolSpectrum, a benchmark designed to evaluate LLMs' capabilities in personalized tool utilization. Specifically, we formalize two key dimensions of personalization, user profile and environmental factors, and analyze their individual and synergistic impacts on tool utilization. Through extensive experiments on ToolSpectrum, we demonstrate that personalized tool utilization significantly improves user experience across diverse scenarios. However, even state-of-the-art LLMs exhibit the limited ability to reason jointly about user profiles and environmental factors, often prioritizing one dimension at the expense of the other. Our findings underscore the necessity of context-aware personalization in tool-augmented LLMs and reveal critical limitations for current models. Our data and code are available at this https URL.
摘要：在将外部工具集成到大型语言模型（LLMS）中的同时，增强了其访问实时信息和特定领域服务的能力，但现有方法狭义地关注按用户说明的功能性工具选择，从而忽略了工具选择中的上下文感知的个性化。这种监督会导致次优的用户满意度和效率低下的工具利用率，尤其是当重叠工具集需要基于上下文因素需要细微的选择时。为了弥合这一差距，我们介绍了Toolspectrum，这是一种基准测试，旨在评估LLMS在个性化工具利用中的功能。具体而言，我们正式化了个性化，用户概况和环境因素的两个关键维度，并分析其个人和协同对工具利用的影响。通过有关工具光谱的广泛实验，我们证明了个性化工具利用率可显着改善各种情况的用户体验。但是，即使是最先进的LLM也表现出有限的共同推理用户概况和环境因素的能力，通常以另一个维度为代价优先考虑一个维度。我们的发现强调了在工具增强的LLMS中具有上下文感知的个性化的必要性，并揭示了当前模型的关键局限性。我们的数据和代码可在此HTTPS URL上找到。

Title: Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

Authors: Zhengrui Ma, Yang Feng, Chenze Shao, Fandong Meng, Jie Zhou, Min Zhang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.13181
Pdf URL: https://arxiv.org/pdf/2505.13181
Copy Paste: [[2505.13181]] Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space(https://arxiv.org/abs/2505.13181)
Keywords: language model
Abstract: We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying continuous autoregressive distribution. By bypassing reliance on residual vector quantization, SLED avoids discretization errors and eliminates the need for the complicated hierarchical architectures common in existing speech language models. It simplifies the overall modeling pipeline while preserving the richness of speech information and maintaining inference efficiency. Empirical results demonstrate that SLED achieves strong performance in both zero-shot and streaming speech synthesis, showing its potential for broader applications in general-purpose speech language models.
摘要：我们介绍了雪橇，这是一种替代语音语言建模的方法，它通过将语音波形编码为连续的潜在表示序列，并使用能量距离目标对它们进行自动进行建模。能量距离通过对比模拟和目标样品进行了分析差距的分析度量，从而使有效的训练能够捕获潜在的连续自回归分布。通过绕过对残差向量量化的依赖，雪橇避免了离散错误，并消除了对现有语音语言模型中常见的复杂层次结构的需求。它简化了整体建模管道，同时保留了语音信息的丰富性并保持推理效率。经验结果表明，雪橇在零拍和流语音合成中都能达到强大的性能，从而显示出其在通用语音语言模型中更广泛应用的潜力。

Title: Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification

Authors: Jikai Wang, Zhenxu Tian, Juntao Li, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13204
Pdf URL: https://arxiv.org/pdf/2505.13204
Copy Paste: [[2505.13204]] Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification(https://arxiv.org/abs/2505.13204)
Keywords: language model
Abstract: Recent works have revealed the great potential of speculative decoding in accelerating the autoregressive generation process of large language models. The success of these methods relies on the alignment between draft candidates and the sampled outputs of the target model. Existing methods mainly achieve draft-target alignment with training-based methods, e.g., EAGLE, Medusa, involving considerable training costs. In this paper, we present a training-free alignment-augmented speculative decoding algorithm. We propose alignment sampling, which leverages output distribution obtained in the prefilling phase to provide more aligned draft candidates. To further benefit from high-quality but non-aligned draft candidates, we also introduce a simple yet effective flexible verification strategy. Through an adaptive probability threshold, our approach can improve generation accuracy while further improving inference efficiency. Experiments on 8 datasets (including question answering, summarization and code completion tasks) show that our approach increases the average generation score by 3.3 points for the LLaMA3 model. Our method achieves a mean acceptance length up to 2.39 and speed up generation by 2.23.
摘要：最近的作品揭示了投机解码在加速大型语言模型的自回归生成过程方面具有的巨大潜力。这些方法的成功取决于候选人草案和目标模型的采样输出之间的对齐。现有方法主要通过基于培训的方法（例如Eagle，Medusa）实现草稿目标对齐，涉及相当大的培训成本。在本文中，我们提出了一种无训练的投机解码算法。我们提出了对齐抽样，该采样利用在预填充阶段获得的输出分布来提供更一致的候选草案。为了进一步受益于高质量但不安排的候选人草案，我们还引入了一种简单而有效的灵活验证策略。通过自适应概率阈值，我们的方法可以提高发电精度，同时进一步提高推理效率。 8个数据集的实验（包括问答，摘要和代码完成任务）表明，我们的方法将LLAMA3模型的平均发电分数提高了3.3点。我们的方法达到了平均接收长度，高达2.39，并加快生成2.23。

Title: Picturized and Recited with Dialects: A Multimodal Chinese Representation Framework for Sentiment Analysis of Classical Chinese Poetry

Authors: Xiaocong Du, Haoyu Pei, Haipeng Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13210
Pdf URL: https://arxiv.org/pdf/2505.13210
Copy Paste: [[2505.13210]] Picturized and Recited with Dialects: A Multimodal Chinese Representation Framework for Sentiment Analysis of Classical Chinese Poetry(https://arxiv.org/abs/2505.13210)
Keywords: llm
Abstract: Classical Chinese poetry is a vital and enduring part of Chinese literature, conveying profound emotional resonance. Existing studies analyze sentiment based on textual meanings, overlooking the unique rhythmic and visual features inherent in poetry,especially since it is often recited and accompanied by Chinese paintings. In this work, we propose a dialect-enhanced multimodal framework for classical Chinese poetry sentiment analysis. We extract sentence-level audio features from the poetry and incorporate audio from multiple dialects,which may retain regional ancient Chinese phonetic features, enriching the phonetic representation. Additionally, we generate sentence-level visual features, and the multimodal features are fused with textual features enhanced by LLM translation through multimodal contrastive representation learning. Our framework outperforms state-of-the-art methods on two public datasets, achieving at least 2.51% improvement in accuracy and 1.63% in macro F1. We open-source the code to facilitate research in this area and provide insights for general multimodal Chinese representation.
摘要：中国古典诗歌是中国文学的重要组成部分，传达了深刻的情感共鸣。现有的研究根据文本含义分析了情感，忽视了诗歌固有的独特节奏和视觉特征，尤其是因为它经常被朗诵和伴随中国绘画。在这项工作中，我们为中国古典诗歌情感分析提出了一个方言增强的多模式框架。我们从诗歌中提取句子级的音频特征，并结合了多个方言的音频，这些音频可能保留中国古代语音特征，从而丰富了语音表示。此外，我们生成句子级的视觉特征，并且多模式特征与LLM翻译通过多模式对比表示学习增强的文本功能融合。我们的框架在两个公共数据集上的最先进方法优于最先进的方法，其准确性至少提高了2.51％，而宏F1的方法至少提高了1.63％。我们开源代码，以促进该领域的研究，并为一般多模式中国代表制提供见解。

Title: SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science

Authors: Jie Ying, Zihong Chen, Zhefan Wang, Wanli Jiang, Chenyang Wang, Zhonghang Yuan, Haoyang Su, Huanjun Kong, Fan Yang, Nanqing Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13220
Pdf URL: https://arxiv.org/pdf/2505.13220
Copy Paste: [[2505.13220]] SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science(https://arxiv.org/abs/2505.13220)
Keywords: language model, llm
Abstract: Seed science is essential for modern agriculture, directly influencing crop yields and global food security. However, challenges such as interdisciplinary complexity and high costs with limited returns hinder progress, leading to a shortage of experts and insufficient technological support. While large language models (LLMs) have shown promise across various fields, their application in seed science remains limited due to the scarcity of digital resources, complex gene-trait relationships, and the lack of standardized benchmarks. To address this gap, we introduce SeedBench -- the first multi-task benchmark specifically designed for seed science. Developed in collaboration with domain experts, SeedBench focuses on seed breeding and simulates key aspects of modern breeding processes. We conduct a comprehensive evaluation of 26 leading LLMs, encompassing proprietary, open-source, and domain-specific fine-tuned models. Our findings not only highlight the substantial gaps between the power of LLMs and the real-world seed science problems, but also make a foundational step for research on LLMs for seed design.
摘要：种子科学对于现代农业至关重要，直接影响农作物产量和全球粮食安全。然而，诸如跨学科复杂性和高昂的挑战以有限的回报阻碍了进步，从而导致专家短缺和技术支持不足。尽管大型语言模型（LLM）在各个领域都表现出了希望，但由于数字资源缺乏，复杂的基因特性关系以及缺乏标准化的基准，它们在种子科学中的应用仍有限制。为了解决这一差距，我们介绍了Seedbench，这是专为种子科学设计的第一个多任务基准。 Seedbench与领域专家合作开发，专注于种子育种，并模拟现代育种过程的关键方面。我们对26个领先的LLM进行了全面评估，其中包括专有，开源和域特定的微调模型。我们的发现不仅强调了LLM的力量与现实世界科学问题之间的巨大差距，而且还为研究LLM的种子设计研究奠定了基础。

Title: JNLP at SemEval-2025 Task 11: Cross-Lingual Multi-Label Emotion Detection Using Generative Models

Authors: Jieying Xue, Phuong Minh Nguyen, Minh Le Nguyen, Xin Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13244
Pdf URL: https://arxiv.org/pdf/2505.13244
Copy Paste: [[2505.13244]] JNLP at SemEval-2025 Task 11: Cross-Lingual Multi-Label Emotion Detection Using Generative Models(https://arxiv.org/abs/2505.13244)
Keywords: llm
Abstract: With the rapid advancement of global digitalization, users from different countries increasingly rely on social media for information exchange. In this context, multilingual multi-label emotion detection has emerged as a critical research area. This study addresses SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. Our paper focuses on two sub-tracks of this task: (1) Track A: Multi-label emotion detection, and (2) Track B: Emotion intensity. To tackle multilingual challenges, we leverage pre-trained multilingual models and focus on two architectures: (1) a fine-tuned BERT-based classification model and (2) an instruction-tuned generative LLM. Additionally, we propose two methods for handling multi-label classification: the base method, which maps an input directly to all its corresponding emotion labels, and the pairwise method, which models the relationship between the input text and each emotion category individually. Experimental results demonstrate the strong generalization ability of our approach in multilingual emotion recognition. In Track A, our method achieved Top 4 performance across 10 languages, ranking 1st in Hindi. In Track B, our approach also secured Top 5 performance in 7 languages, highlighting its simplicity and effectiveness\footnote{Our code is available at this https URL.
摘要：随着全球数字化的快速发展，来自不同国家的用户越来越依赖社交媒体进行信息交流。在这种情况下，多语言多标签情绪检测已成为关键研究领域。这项研究涉及Semeval-2025任务11：在基于文本的情感检测中弥合差距。我们的论文重点介绍了此任务的两个子轨道：（1）跟踪A：多标签情绪检测，（2）跟踪B：情感强度。为了应对多语言挑战，我们利用预先训练的多语言模型，专注于两个体系结构：（1）基于BERT的微调分类模型和（2）指导调节的生成LLM。此外，我们提出了两种处理多标签分类的方法：基本方法，该方法将输入直接映射到其所有相应的情感标签，以及成对方法，它们分别模拟了输入文本与每个情绪类别之间的关系。实验结果表明，我们的方法在多语言情绪识别方面具有强大的概括能力。在轨道A中，我们的方法在10种语言中获得了前4个表现，在印地语中排名第一。在轨道B中，我们的方法还以7种语言获得了前5个表演，突出显示了其简单性和有效性\ footNote {我们的代码可在此HTTPS URL上找到。

Title: Natural Language Planning via Coding and Inference Scaling

Authors: Rikhil Amonkar, Ronan Le Bras, Li Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13252
Pdf URL: https://arxiv.org/pdf/2505.13252
Copy Paste: [[2505.13252]] Natural Language Planning via Coding and Inference Scaling(https://arxiv.org/abs/2505.13252)
Keywords: llm
Abstract: Real-life textual planning tasks such as meeting scheduling have posed much challenge to LLMs especially when the complexity is high. While previous work primarily studied auto-regressive generation of plans with closed-source models, we systematically evaluate both closed- and open-source models, including those that scales output length with complexity during inference, in generating programs, which are executed to output the plan. We consider not only standard Python code, but also the code to a constraint satisfaction problem solver. Despite the algorithmic nature of the task, we show that programming often but not always outperforms planning. Our detailed error analysis also indicates a lack of robustness and efficiency in the generated code that hinders generalization.
摘要：现实生活中的文本规划任务，例如会议计划，对LLM提出了很多挑战，尤其是在复杂性很高的情况下。虽然先前的工作主要研究了具有封闭源模型的自动回归生成计划，但我们系统地评估了闭合和开源模型，包括在推断过程中缩放具有复杂性的输出长度的缩放程序，并在生成程序中执行以输出计划。我们不仅考虑了标准Python代码，还考虑了限制满意度解决方案的代码。尽管任务具有算法性质，但我们表明编程经常但并不总是胜过计划。我们的详细错误分析还表明，在阻碍概括的生成代码中缺乏鲁棒性和效率。

Title: HeteroSpec: Leveraging Contextual Heterogeneity for Efficient Speculative Decoding

Authors: Siran Liu, Yang Ye, Qianchao Zhu, Zheng Cao, Yongchao He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13254
Pdf URL: https://arxiv.org/pdf/2505.13254
Copy Paste: [[2505.13254]] HeteroSpec: Leveraging Contextual Heterogeneity for Efficient Speculative Decoding(https://arxiv.org/abs/2505.13254)
Keywords: language model, llm
Abstract: Autoregressive decoding, the standard approach for Large Language Model (LLM) inference, remains a significant bottleneck due to its sequential nature. While speculative decoding algorithms mitigate this inefficiency through parallel verification, they fail to exploit the inherent heterogeneity in linguistic complexity, a key factor leading to suboptimal resource allocation. We address this by proposing HeteroSpec, a heterogeneity-adaptive speculative decoding framework that dynamically optimizes computational resource allocation based on linguistic context complexity. HeteroSpec introduces two key mechanisms: (1) A novel cumulative meta-path Top-$K$ entropy metric for efficiently identifying predictable contexts. (2) A dynamic resource allocation strategy based on data-driven entropy partitioning, enabling adaptive speculative expansion and pruning tailored to local context difficulty. Evaluated on five public benchmarks and four models, HeteroSpec achieves an average speedup of 4.26$\times$. It consistently outperforms state-of-the-art EAGLE-3 across speedup rates, average acceptance length, and verification cost. Notably, HeteroSpec requires no draft model retraining, incurs minimal overhead, and is orthogonal to other acceleration techniques. It demonstrates enhanced acceleration with stronger draft models, establishing a new paradigm for context-aware LLM inference acceleration.
摘要：自回归解码是大语言模型（LLM）推断的标准方法，由于其顺序性质仍然是一个重要的瓶颈。尽管投机解码算法通过并行验证减轻了这种低效率，但它们无法利用语言复杂性的固有异质性，这是导致次优资源分配的关键因素。我们通过提出异质性自适应投机解码框架来解决这一问题，该框架可以动态优化基于语言上下文复杂性的计算资源分配。 HeteroSPEC引入了两个关键机制：（1）一种新型的累积元数据顶部 - $ K $熵指标，可有效识别可预测的环境。（2）基于数据驱动的熵分区的动态资源分配策略，使适应性投机膨胀和针对本地环境难度量身定制的修剪。在五个公共基准和四种型号上进行了评估，HeteroSpec的平均加速为4.26 $ \ times $。在加速率，平均接受度和验证成本上，它始终优于最先进的Eagle-3。值得注意的是，杂种不需要模型培训草案，造成最小的开销，并且与其他加速技术正交。它通过更强大的草稿模型展示了增强的加速度，并为上下文感知的LLM推理加速建立了新的范式。

Title: WikiPersonas: What Can We Learn From Personalized Alignment to Famous People?

Authors: Zilu Tang, Afra Feyza Akyürek, Ekin Akyürek, Derry Wijaya
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13257
Pdf URL: https://arxiv.org/pdf/2505.13257
Copy Paste: [[2505.13257]] WikiPersonas: What Can We Learn From Personalized Alignment to Famous People?(https://arxiv.org/abs/2505.13257)
Keywords: prompt
Abstract: Preference alignment has become a standard pipeline in finetuning models to follow \emph{generic} human preferences. Majority of work seeks to optimize model to produce responses that would be preferable \emph{on average}, simplifying the diverse and often \emph{contradicting} space of human preferences. While research has increasingly focused on personalized alignment: adapting models to individual user preferences, there is a lack of personalized preference dataset which focus on nuanced individual-level preferences. To address this, we introduce WikiPersona: the first fine-grained personalization using well-documented, famous individuals. Our dataset challenges models to align with these personas through an interpretable process: generating verifiable textual descriptions of a persona's background and preferences in addition to alignment. We systematically evaluate different personalization approaches and find that as few-shot prompting with preferences and fine-tuning fail to simultaneously ensure effectiveness and efficiency, using \textit{inferred personal preferences} as prefixes enables effective personalization, especially in topics where preferences clash while leading to more equitable generalization across unseen personas.
摘要：偏好比对已成为填充模型中的标准管道，以遵循\ emph {generic}人类偏好。大多数工作都试图优化模型，以产生最可取的响应\ emph {平均而言，简化了人类偏好的多样性且经常\ emph {矛盾的}空间。虽然研究越来越集中于个性化的对齐方式：将模型调整到个人用户偏好中，但缺乏个性化的偏好数据集，这些数据集专注于细微的个人级别偏好。为了解决这个问题，我们介绍了Wikipersona：使用有据可查的著名人物的第一个精细的个性化。我们的数据集通过可解释的过程来挑战模型与这些角色保持一致：生成可验证的文本描述，除了对齐外，还对角色的背景和偏好。我们系统地评估了不同的个性化方法，并发现使用\ textit {推断的个人偏好}作为前缀可以有效的个性化，尤其是在偏好范围内，尤其是在跨越不可否认的角色跨越更公平的概括中，因此，随着偏好和微调无法同时确保有效性和效率的弹药提示，无法同时确保有效性和效率。

Title: Effective and Transparent RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability

Authors: Jingyi Ren, Yekun Xu, Xiaolong Wang, Weitao Li, Weizhi Ma, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13258
Pdf URL: https://arxiv.org/pdf/2505.13258
Copy Paste: [[2505.13258]] Effective and Transparent RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability(https://arxiv.org/abs/2505.13258)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) has significantly improved the performance of large language models (LLMs) on knowledge-intensive domains. However, although RAG achieved successes across distinct domains, there are still some unsolved challenges: 1) Effectiveness. Existing research mainly focuses on developing more powerful RAG retrievers, but how to enhance the generator's (LLM's) ability to utilize the retrieved information for reasoning and generation? 2) Transparency. Most RAG methods ignore which retrieved content actually contributes to the reasoning process, resulting in a lack of interpretability and visibility. To address this, we propose ARENA (Adaptive-Rewarded Evidence Navigation Agent), a transparent RAG generator framework trained via reinforcement learning (RL) with our proposed rewards. Based on the structured generation and adaptive reward calculation, our RL-based training enables the model to identify key evidence, perform structured reasoning, and generate answers with interpretable decision traces. Applied to Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct, abundant experiments with various RAG baselines demonstrate that our model achieves 10-30% improvements on all multi-hop QA datasets, which is comparable with the SOTA Commercially-developed LLMs (e.g., OpenAI-o1, DeepSeek-R1). Further analyses show that ARENA has strong flexibility to be adopted on new datasets without extra training. Our models and codes are publicly released.
摘要：检索增强的生成（RAG）已大大提高了大语模型（LLMS）在知识密集型领域上的性能。但是，尽管RAG在不同领域取得了成功，但仍然存在一些尚未解决的挑战：1）有效性。现有的研究主要集中在开发更强大的抹布检索器上，但是如何增强发电机（LLM）利用检索到的信息进行推理和发电的能力？ 2）透明度。大多数抹布方法忽略了检索内容的实际上有助于推理过程，从而导致缺乏解释性和可见性。为了解决这个问题，我们建议竞技场（自适应奖励证据导航代理），这是一个透明的抹布生成器框架，通过加固学习（RL）培训了我们的建议奖励。基于结构化的生成和自适应奖励计算，我们的基于RL的培训使该模型能够识别关键证据，执行结构化推理，并使用可解释的决策痕迹生成答案。适用于QWEN2.5-7B教学和Llama3.1-8B实验室，具有各种抹布基线的丰富实验表明，我们的模型在所有多跳QA数据集上都取得了10-30％的提高，这与SOTA公销开发的LLMS相当，这是可相当的。进一步的分析表明，竞技场具有强大的灵活性，可以在新数据集上采用而无需额外的培训。我们的模型和代码将公开发布。

Title: From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

Authors: Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13259
Pdf URL: https://arxiv.org/pdf/2505.13259
Copy Paste: [[2505.13259]] From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery(https://arxiv.org/abs/2505.13259)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) are catalyzing a paradigm shift in scientific discovery, evolving from task-specific automation tools into increasingly autonomous agents and fundamentally redefining research processes and human-AI collaboration. This survey systematically charts this burgeoning field, placing a central focus on the changing roles and escalating capabilities of LLMs in science. Through the lens of the scientific method, we introduce a foundational three-level taxonomy-Tool, Analyst, and Scientist-to delineate their escalating autonomy and evolving responsibilities within the research lifecycle. We further identify pivotal challenges and future research trajectories such as robotic automation, self-improvement, and ethical governance. Overall, this survey provides a conceptual architecture and strategic foresight to navigate and shape the future of AI-driven scientific discovery, fostering both rapid innovation and responsible advancement. Github Repository: this https URL.
摘要：大型语言模型（LLM）正在催化科学发现的范式转变，从特定于任务的自动化工具变成了越来越多的自主代理，并从根本上重新定义了研究过程和人类AI协作。这项调查系统地绘制了这个新兴的领域，将LLMS在科学中不断变化的角色和不断提高的能力上提升的核心重点。通过科学方法的镜头，我们介绍了一个基础三级分类工具，分析师和科学家，以描绘其在研究生命周期内升级的自主权和不断发展的责任。我们进一步确定了关键的挑战和未来的研究轨迹，例如机器人自动化，自我完善和道德治理。总体而言，这项调查提供了一种概念架构和战略远见，以导航和塑造AI驱动的科学发现的未来，从而促进了快速的创新和负责任的进步。 GitHub存储库：此HTTPS URL。

Title: CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning

Authors: Lei Sheng, Shuai-Shuai Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13271
Pdf URL: https://arxiv.org/pdf/2505.13271
Copy Paste: [[2505.13271]] CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning(https://arxiv.org/abs/2505.13271)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated strong capabilities in translating natural language questions about relational databases into SQL queries. In particular, test-time scaling techniques such as Self-Consistency and Self-Correction can enhance SQL generation accuracy by increasing computational effort during inference. However, these methods have notable limitations: Self-Consistency may select suboptimal outputs despite majority votes, while Self-Correction typically addresses only syntactic errors. To leverage the strengths of both approaches, we propose CSC-SQL, a novel method that integrates Self-Consistency and Self-Correction. CSC-SQL selects the two most frequently occurring outputs from parallel sampling and feeds them into a merge revision model for correction. Additionally, we employ the Group Relative Policy Optimization (GRPO) algorithm to fine-tune both the SQL generation and revision models via reinforcement learning, significantly enhancing output quality. Experimental results confirm the effectiveness and generalizability of CSC-SQL. On the BIRD development set, our 3B model achieves 65.28% execution accuracy, while the 7B model achieves 69.19%. The code will be open sourced at this https URL.
摘要：大型语言模型（LLMS）在将有关关系数据库的自然语言问题转化为SQL查询方面表现出了强大的能力。特别是，测试时间缩放技术（例如自洽和自我纠正）可以通过提高推断期间的计算工作来提高SQL的生成精度。但是，这些方法具有明显的局限性：尽管大多数票数，但自我纠纷可能会选择次优的输出，而自我纠正通常仅解决句法错误。为了利用这两种方法的优势，我们提出了CSC-SQL，这是一种整合自洽和自我纠正的新方法。 CSC-SQL从并行采样中选择了两个最常出现的输出，并将其输入合并修订模型进行校正。此外，我们采用小组相对政策优化（GRPO）算法通过强化学习来微调SQL生成和修订模型，从而显着提高了产出质量。实验结果证实了CSC-SQL的有效性和概括性。在鸟类开发套装上，我们的3B型号达到了65.28％的执行精度，而7B模型则达到69.19％。该代码将在此HTTPS URL上开放。

Title: I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models

Authors: Alice Plebe, Timothy Douglas, Diana Riazi, R. Maria del Rio-Chanona
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13302
Pdf URL: https://arxiv.org/pdf/2505.13302
Copy Paste: [[2505.13302]] I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models(https://arxiv.org/abs/2505.13302)
Keywords: language model, prompt
Abstract: Large language models are increasingly integrated into news recommendation systems, raising concerns about their role in spreading misinformation. In humans, visual content is known to boost credibility and shareability of information, yet its effect on vision-language models (VLMs) remains unclear. We present the first study examining how images influence VLMs' propensity to reshare news content, whether this effect varies across model families, and how persona conditioning and content attributes modulate this behavior. To support this analysis, we introduce two methodological contributions: a jailbreaking-inspired prompting strategy that elicits resharing decisions from VLMs while simulating users with antisocial traits and political alignments; and a multimodal dataset of fact-checked political news from PolitiFact, paired with corresponding images and ground-truth veracity labels. Experiments across model families reveal that image presence increases resharing rates by 4.8% for true news and 15.0% for false news. Persona conditioning further modulates this effect: Dark Triad traits amplify resharing of false news, whereas Republican-aligned profiles exhibit reduced veracity sensitivity. Of all the tested models, only Claude-3-Haiku demonstrates robustness to visual misinformation. These findings highlight emerging risks in multimodal model behavior and motivate the development of tailored evaluation frameworks and mitigation strategies for personalized AI systems. Code and dataset are available at: this https URL
摘要：大型语言模型越来越多地整合到新闻推荐系统中，引起了人们对它们在传播错误信息中的作用的担忧。在人类中，众所周知，视觉内容可以提高信息的可信度和共享性，但其对视觉模型（VLM）的影响尚不清楚。我们介绍了第一项研究，研究图像如何影响VLMS重新毛利新闻内容的倾向，这种效果是否在模型家庭中有所不同，角色条件和内容属性如何调节这种行为。为了支持这项分析，我们介绍了两项方法论贡献：一种以越狱为灵感的促进策略，引发了VLM的重新决定，同时模拟用户具有反社会特征和政治一致性；以及来自Politifact的事实检查的政治新闻的多模式数据集，并配上相应的图像和地面真实性标签。跨模型家族的实验表明，图像存在的真实新闻的重新出现率提高了4.8％，对于虚假新闻来说，图像存在率提高了15.0％。角色调节进一步调节了这种效果：黑暗的三合会特征扩大了对虚假新闻的重新评估，而共和党一致的概况表现出降低的敏感性。在所有测试的模型中，只有Claude-3-Haiku证明了视觉错误信息的鲁棒性。这些发现突出了多模型模型行为中的新兴风险，并激发了针对个性化AI系统的量身定制评估框架和缓解策略的开发。代码和数据集可用：此HTTPS URL

Title: RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning

Authors: Qiguang Chen, Libo Qin, Jinhao Liu, Yue Liao, Jiaqi Wang, Jingxuan Zhou, Wanxiang Che
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.13307
Pdf URL: https://arxiv.org/pdf/2505.13307
Copy Paste: [[2505.13307]] RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning(https://arxiv.org/abs/2505.13307)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models (LLMs) on complex tasks, spurring research into its underlying mechanisms. However, two primary challenges remain for real-world applications: (1) the lack of quantitative metrics and actionable guidelines for evaluating and optimizing measurable boundaries of CoT capability, and (2) the absence of methods to assess boundaries of unmeasurable CoT capability, such as multimodal perception. To address these gaps, we introduce the Reasoning Boundary Framework++ (RBF++). To tackle the first challenge, we define the reasoning boundary (RB) as the maximum limit of CoT performance. We also propose a combination law for RBs, enabling quantitative analysis and offering actionable guidance across various CoT tasks. For the second challenge, particularly in multimodal scenarios, we introduce a constant assumption, which replaces unmeasurable RBs with scenario-specific constants. Additionally, we propose the reasoning boundary division mechanism, which divides unmeasurable RBs into two sub-boundaries, facilitating the quantification and optimization of both unmeasurable domain knowledge and multimodal perception capabilities. Extensive experiments involving 38 models across 13 tasks validate the feasibility of our framework in cross-modal settings. Additionally, we evaluate 10 CoT strategies, offer insights into optimization and decay from two complementary perspectives, and expand evaluation benchmarks for measuring RBs in LLM reasoning. We hope this work advances the understanding of RBs and optimization strategies in LLMs. Code and data are available at this https URL.
摘要：事实证明，经过思考链（COT）推理有效地增强了复杂任务的大型语言模型（LLM），这激发了对其基本机制的研究。但是，实际应用仍存在两个主要挑战：（1）缺乏定量指标和可行的指南，用于评估和优化COT能力的可测量界限，以及（2）缺乏评估不可估量的COT能力边界的方法，例如多态感知。为了解决这些差距，我们介绍了推理边界框架++（RBF ++）。为了应对第一个挑战，我们将推理边界（RB）定义为COT性能的最大限制。我们还提出了一项针对苏格R的联合法，实现了定量分析并在各种COT任务中提供可行的指导。对于第二个挑战，尤其是在多模式场景中，我们引入了一个恒定的假设，该假设将不可衡量的RB替换为方案特定的常数。此外，我们提出了推理边界划分机制，该机制将不可衡量的RBS划分为两个子边界，从而促进了不可衡量的域知识和多模式感知能力的量化和优化。涉及13个任务的38个模型的广泛实验验证了我们在跨模式设置中的框架的可行性。此外，我们评估了10种COT策略，从两个互补的角度提供了对优化和衰减的见解，并扩大评估基准测量LLM推理中的RBS。我们希望这项工作能够提高对RBS和LLM中优化策略的理解。代码和数据可在此HTTPS URL上找到。

Title: GUARD: Generation-time LLM Unlearning via Adaptive Restriction and Detection

Authors: Zhijie Deng, Chris Yuhao Liu, Zirui Pang, Xinlei He, Lei Feng, Qi Xuan, Zhaowei Zhu, Jiaheng Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13312
Pdf URL: https://arxiv.org/pdf/2505.13312
Copy Paste: [[2505.13312]] GUARD: Generation-time LLM Unlearning via Adaptive Restriction and Detection(https://arxiv.org/abs/2505.13312)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in memorizing vast amounts of knowledge across diverse domains. However, the ability to selectively forget specific knowledge is critical for ensuring the safety and compliance of deployed models. Existing unlearning efforts typically fine-tune the model with resources such as forget data, retain data, and a calibration model. These additional gradient steps blur the decision boundary between forget and retain knowledge, making unlearning often at the expense of overall performance. To avoid the negative impact of fine-tuning, it would be better to unlearn solely at inference time by safely guarding the model against generating responses related to the forget target, without destroying the fluency of text generation. In this work, we propose Generation-time Unlearning via Adaptive Restriction and Detection (GUARD), a framework that enables dynamic unlearning during LLM generation. Specifically, we first employ a prompt classifier to detect unlearning targets and extract the corresponding forbidden token. We then dynamically penalize and filter candidate tokens during generation using a combination of token matching and semantic matching, effectively preventing the model from leaking the forgotten content. Experimental results on copyright content unlearning tasks over the Harry Potter dataset and the MUSE benchmark, as well as entity unlearning tasks on the TOFU dataset, demonstrate that GUARD achieves strong forget quality across various tasks while causing almost no degradation to the LLM's general capabilities, striking an excellent trade-off between forgetting and utility.
摘要：大型语言模型（LLM）在记住各种领域的大量知识方面表现出了强大的能力。但是，选择性忘记特定知识的能力对于确保部署模型的安全性和合规性至关重要。现有的未学习工作通常会用忘记数据，保留数据和校准模型等资源来微调模型。这些额外的梯度步骤模糊了忘记和保留知识之间的决策边界，以牺牲整体绩效为代价，使其经常进行学习。为了避免微调的负面影响，最好仅在推理时间学习，通过安全保护模型免于产生与忘记目标有关的响应，而不会破坏文本生成的流利性。在这项工作中，我们建议通过自适应限制和检测（Guard）进行生成时间，这是一个在LLM生成期间能够动态学习的框架。具体而言，我们首先采用及时分类器来检测未学习目标并提取相应的禁止令牌。然后，我们使用令牌匹配和语义匹配的组合在生成过程中动态惩罚和过滤候选令牌，从而有效地阻止了模型泄漏被遗忘的内容。关于Harry Potter数据集和Muse基准的版权内容的实验结果以及豆腐数据集上的实体学习任务的实验结果表明，Guard在各种任务中都能达到强大的忘记质量，同时几乎没有降低LLM的一般能力，引起了良好的折算，在忘记了忘记和遗忘之间的折衷方案。

Title: Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges

Authors: Hongru Wang, Wenyu Huang, Yufei Wang, Yuanhao Xi, Jianqiao Lu, Huan Zhang, Nan Hu, Zeming Liu, Jeff Z. Pan, Kam-Fai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13328
Pdf URL: https://arxiv.org/pdf/2505.13328
Copy Paste: [[2505.13328]] Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges(https://arxiv.org/abs/2505.13328)
Keywords: language model, llm, agent
Abstract: Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose \texttt{DialogTool}, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) \textit{tool creation}; 2) \textit{tool utilization}: tool awareness, tool selection, tool execution; and 3) \textit{role-consistent response}: response generation and role play. Furthermore, we build \texttt{VirtualMobile} -- an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs\footnote{We will use tools and APIs alternatively, there are no significant differences between them in this paper.}. Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons.
摘要：将语言模型（LMS）作为语言代理（LAS）进行工具使用的现有基准主要集中在无状态，单转交互或部分评估上，例如单个转弯中的工具选择，俯瞰多转移应用中交互的固有状态。为了满足这一差距，我们建议\ texttt {dialogTool}，这是一个多转向对话数据集，具有固定工具交互的多转交换，考虑了工具使用的整个生命周期，在三个阶段跨越了六个关键任务：1）\ textit {tool freation}; 2）\ textit {工具利用率}：工具意识，工具选择，工具执行；和3）\ textit {角色一致的响应}：响应生成和角色扮演。此外，我们构建\ texttt {virtualMobile} - 一个具体的虚拟移动评估环境来模拟API呼叫并评估创建的APIS \ footNote的鲁棒性{我们将使用工具和API，另外，它们之间没有显着差异。利用这些工件，我们对13个不同的开放式和封闭式LLM进行了全面的评估，并在每个阶段提供了详细的分析，揭示了现有的最新最新LLMS仍然无法在长时间的视野上使用工具。

Title: Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation

Authors: Qiongqiong Wang, Hardik B. Sailor, Tianchi Liu, Ai Ti Aw
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2505.13338
Pdf URL: https://arxiv.org/pdf/2505.13338
Copy Paste: [[2505.13338]] Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation(https://arxiv.org/abs/2505.13338)
Keywords: llm
Abstract: Current speech-LLMs exhibit limited capability in contextual reasoning alongside paralinguistic understanding, primarily due to the lack of Question-Answer (QA) datasets that cover both aspects. We propose a novel framework for dataset generation from in-the-wild speech data, that integrates contextual reasoning with paralinguistic information. It consists of a pseudo paralinguistic label-based data condensation of in-the-wild speech and LLM-based Contextual Paralinguistic QA (CPQA) generation. The effectiveness is validated by a strong correlation in evaluations of the Qwen2-Audio-7B-Instruct model on a dataset created by our framework and human-generated CPQA dataset. The results also reveal the speech-LLM's limitations in handling empathetic reasoning tasks, highlighting the need for such datasets and more robust models. The proposed framework is first of its kind and has potential in training more robust speech-LLMs with paralinguistic reasoning capabilities.
摘要：当前的语音插件在上下文推理以及副语言学理解上表现出有限的能力，这主要是由于缺乏涵盖这两个方面的问答（QA）数据集。我们从野外语音数据中提出了一个新型框架，以将上下文推理与副语言信息集成在一起。它由伪副语言标签的基于野外语音和基于LLM的上下文副语言质量质量质量质量质量质量质量质量标签（CPQA）的数据组成。在我们的框架和人类生成的CPQA数据集创建的数据集中，在评估QWEN2-ADIO-7B-INSTRUCT模型的评估中，有效性得到了验证。结果还揭示了语音-LLM在处理同情推理任务中的局限性，突出了对此类数据集和更强大的模型的需求。拟议的框架首先是此类框架，并且具有具有副语言推理能力的更强大的语音训练。

Title: J4R: Learning to Judge with Equivalent Initial State Group Relative Preference Optimization

Authors: Austin Xu, Yilun Zhou, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13346
Pdf URL: https://arxiv.org/pdf/2505.13346
Copy Paste: [[2505.13346]] J4R: Learning to Judge with Equivalent Initial State Group Relative Preference Optimization(https://arxiv.org/abs/2505.13346)
Keywords: language model, gpt, llm, chat
Abstract: To keep pace with the increasing pace of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work. (3) We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%, matching or exceeding the performance of larger GRPO-trained judges on both JudgeBench and ReasoningJudgeBench.
摘要：为了跟上大语模型（LLM）开发的不断增长的步伐，模型输出评估已从耗时的人类评估转变为自动评估，在该评估中，LLMS本身的任务是评估和批评其他模型输出。 LLM-AS-Gudge模型是一类生成评估者，它们在评估相对简单的领域（例如聊天质量）方面表现出色，但是在推理密集域中进行了斗争，其中模型响应包含更具实质性和具有挑战性的内容。为了解决现有的法官缺点，我们通过加强学习（RL）探索培训法官。我们做出了三个关键贡献：（1）我们提出了同等的初始状态组相对策略优化（EIS-GRPO）算法，这使我们能够训练法官在更复杂的评估设置中出现的位置偏见。（2）我们介绍了推理Judgebench，这是一种基准，它评估了先前工作未涵盖的各种推理环境中的法官。（3）我们培训法官（J4R），是一名接受EIS-GRPO培训的7B法官，其表现优于GPT-4O，下一个最佳小法官的表现为6.7％和9％，在法官Bench和推理Judgegudgebench上匹配或超过了较大的GRPO训练的法官的表现。

Title: Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks

Authors: Narek Maloyan, Bislan Ashinov, Dmitry Namiot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13348
Pdf URL: https://arxiv.org/pdf/2505.13348
Copy Paste: [[2505.13348]] Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks(https://arxiv.org/abs/2505.13348)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly employed as evaluators (LLM-as-a-Judge) for assessing the quality of machine-generated text. This paradigm offers scalability and cost-effectiveness compared to human annotation. However, the reliability and security of such systems, particularly their robustness against adversarial manipulations, remain critical concerns. This paper investigates the vulnerability of LLM-as-a-Judge architectures to prompt-injection attacks, where malicious inputs are designed to compromise the judge's decision-making process. We formalize two primary attack strategies: Comparative Undermining Attack (CUA), which directly targets the final decision output, and Justification Manipulation Attack (JMA), which aims to alter the model's generated reasoning. Using the Greedy Coordinate Gradient (GCG) optimization method, we craft adversarial suffixes appended to one of the responses being compared. Experiments conducted on the MT-Bench Human Judgments dataset with open-source instruction-tuned LLMs (Qwen2.5-3B-Instruct and Falcon3-3B-Instruct) demonstrate significant susceptibility. The CUA achieves an Attack Success Rate (ASR) exceeding 30\%, while JMA also shows notable effectiveness. These findings highlight substantial vulnerabilities in current LLM-as-a-Judge systems, underscoring the need for robust defense mechanisms and further research into adversarial evaluation and trustworthiness in LLM-based assessment frameworks.
摘要：大型语言模型（LLM）越来越多地用作评估者（LLM-AS-A-Gudge）来评估机器生成的文本质量。与人类注释相比，该范式提供了可扩展性和成本效益。但是，这种系统的可靠性和安全性，尤其是它们针对对抗性操纵的鲁棒性，仍然是关键问题。本文调查了LLM-AS-A-A-Gudge架构对迅速进行注射攻击的脆弱性，在该攻击中，恶意投入旨在损害法官的决策过程。我们正式化了两种主要攻击策略：直接针对最终决策输出的比较破坏攻击（CUA），以及旨在改变模型生成的推理的理由操纵攻击（JMA）。使用贪婪的坐标梯度（GCG）优化方法，我们制作的对抗后缀附加到了要比较的响应之一上。在MT Bench人类判断数据集上进行的实验具有开源指导调整的LLM（QWEN2.5-3B-INSTRUCTION和FALCON3-3B-INSTRUCTION）的实验表现出明显的敏感性。 CUA实现了超过30 \％的攻击成功率（ASR），而JMA也表现出显着的有效性。这些发现突出了当前法官法官系统中的实质性脆弱性，强调了对强大的防御机制的需求，并进一步研究了基于LLM的评估框架中对抗性评估和信任度。

Title: Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning

Authors: Adam Štorek, Mukur Gupta, Samira Hajizadeh, Prashast Srivastava, Suman Jana
Subjects: cs.CL, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2505.13353
Pdf URL: https://arxiv.org/pdf/2505.13353
Copy Paste: [[2505.13353]] Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning(https://arxiv.org/abs/2505.13353)
Keywords: language model, llm, long context
Abstract: Although modern Large Language Models (LLMs) support extremely large contexts, their effectiveness in utilizing long context for code reasoning remains unclear. This paper investigates LLM reasoning ability over code snippets within large repositories and how it relates to their recall ability. Specifically, we differentiate between lexical code recall (verbatim retrieval) and semantic code recall (remembering what the code does). To measure semantic recall, we propose SemTrace, a code reasoning technique where the impact of specific statements on output is attributable and unpredictable. We also present a method to quantify semantic recall sensitivity in existing benchmarks. Our evaluation of state-of-the-art LLMs reveals a significant drop in code reasoning accuracy as a code snippet approaches the middle of the input context, particularly with techniques requiring high semantic recall like SemTrace. Moreover, we find that lexical recall varies by granularity, with models excelling at function retrieval but struggling with line-by-line recall. Notably, a disconnect exists between lexical and semantic recall, suggesting different underlying mechanisms. Finally, our findings indicate that current code reasoning benchmarks may exhibit low semantic recall sensitivity, potentially underestimating LLM challenges in leveraging in-context information.
摘要：尽管现代大型语言模型（LLM）支持极大的环境，但它们在使用长篇小说中进行代码推理的有效性尚不清楚。本文调查了大型存储库中代码片段的LLM推理能力及其与召回能力的关系。具体而言，我们区分了词汇代码回忆（逐字检索）和语义代码回忆（记住代码的作用）。为了衡量语义召回，我们提出了SEMTRACE，这是一种代码推理技术，其中特定陈述对输出的影响是可归因和不可预测的。我们还提出了一种量化现有基准中语义回忆灵敏度的方法。我们对最先进的LLMS的评估表明，随着代码段接近输入上下文中间的代码推理准确性，尤其是在需要高语义召回（例如SemTrace）的技术时，我们的代码推理准确性显着下降。此外，我们发现词汇回忆因粒度而有所不同，在功能检索方面，模型在逐步召回方面都在努力挣扎。值得注意的是，词汇和语义回忆之间存在断开连接，这表明了不同的潜在机制。最后，我们的发现表明，当前的代码推理基准可能表现出较低的语义回忆敏感性，并可能低估了利用文化信息的LLM挑战。

Title: What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts

Authors: Chenyang Yang, Yike Shi, Qianou Ma, Michael Xieyang Liu, Christian Kästner, Tongshuang Wu
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2505.13360
Pdf URL: https://arxiv.org/pdf/2505.13360
Copy Paste: [[2505.13360]] What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts(https://arxiv.org/abs/2505.13360)
Keywords: llm, prompt
Abstract: Building LLM-powered software requires developers to communicate their requirements through natural language, but developer prompts are frequently underspecified, failing to fully capture many user-important requirements. In this paper, we present an in-depth analysis of prompt underspecification, showing that while LLMs can often (41.1%) guess unspecified requirements by default, such behavior is less robust: Underspecified prompts are 2x more likely to regress over model or prompt changes, sometimes with accuracy drops by more than 20%. We then demonstrate that simply adding more requirements to a prompt does not reliably improve performance, due to LLMs' limited instruction-following capabilities and competing constraints, and standard prompt optimizers do not offer much help. To address this, we introduce novel requirements-aware prompt optimization mechanisms that can improve performance by 4.8% on average over baselines that naively specify everything in the prompt. Beyond prompt optimization, we envision that effectively managing prompt underspecification requires a broader process, including proactive requirements discovery, evaluation, and monitoring.
摘要：建立LLM驱动的软件要求开发人员通过自然语言传达其需求，但是开发人员的提示经常被指定，无法完全捕获许多用户重要要求。在本文中，我们对及时指定的深入分析进行了深入的分析，表明尽管LLM可以（41.1％）猜测未指定的要求默认情况下，但这种行为较不强大：未指定的提示更可能在模型或及时更改上恢复2倍，而准确性下降的可能性更高。然后，我们证明，由于LLMS的有限遵守功能和竞争限制，仅在提示中添加更多要求并不能可靠地提高性能，而标准提示优化器并没有提供太多帮助。为了解决这个问题，我们介绍了新颖的需求意识提示优化机制，这些机制可以比天真地指定提示中所有内容的基线相比平均提高性能4.8％。除了及时的优化之外，我们设想有效地管理及时指定的规定需要更广泛的过程，包括主动要求发现，评估和监视。

Title: Thinkless: LLM Learns When to Think

Authors: Gongfan Fang, Xinyin Ma, Xinchao Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13379
Pdf URL: https://arxiv.org/pdf/2505.13379
Copy Paste: [[2505.13379]] Thinkless: LLM Learns When to Think(https://arxiv.org/abs/2505.13379)
Keywords: language model, llm, chain-of-thought
Abstract: Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, for concise responses and for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at this https URL
摘要：推理语言模型，能够扩展思想链的推理，在需要复杂的逻辑推理的任务上表现出了出色的表现。但是，对所有查询进行详尽的推理通常会导致大量的计算效率低下，尤其是当许多问题接受直接解决方案时。这激发了一个悬而未决的问题：LLM可以学习何时思考吗？为了回答这一点，我们提出了毫无疑问的框架，这是一个可学习的框架，它使LLM能够根据任务复杂性和模型的能力适应在短形式和长期推理之间选择。在强化学习范式下进行了训练，并采用了两个控制令牌，进行简洁的响应，以详细推理。我们方法的核心是分离的组相对策略优化（DEGRPO）算法，该算法将混合推理的学习目标分解为两个组成部分：（1）控制推理模式选择的控制令牌损失，（2）改善了产生的答案的准确性。这种脱钩的配方可以对每个目标的贡献进行细粒度的控制，稳定训练，并有效防止在香草GRPO中观察到的崩溃。从经验上讲，在Minerva代数，Math-500和GSM8K等几个基准上，Thinkless能够将长链思维的使用降低50％-90％，从而大大提高了推理语言模型的效率。该代码可在此HTTPS URL上找到

Title: R3: Robust Rubric-Agnostic Reward Models

Authors: David Anugraha, Zilu Tang, Lester James V. Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, Genta Indra Winata
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13388
Pdf URL: https://arxiv.org/pdf/2505.13388
Copy Paste: [[2505.13388]] R3: Robust Rubric-Agnostic Reward Models(https://arxiv.org/abs/2505.13388)
Keywords: language model
Abstract: Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce R3, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. R3 enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source at this https URL
摘要：奖励模型对于将语言模型输出与人类偏好保持一致至关重要，但是现有方法通常缺乏可控性和可解释性。这些模型通常针对狭窄的目标进行了优化，从而将其推广到更广泛的下游任务。此外，如果没有上下文推理，它们的标量输出很难解释。为了解决这些局限性，我们介绍了R3，这是一个新颖的奖励建模框架，它是划界范围的，可在评估维度上进行推广，并提供可解释的，理性的得分分配。 R3可以对语言模型进行更多透明和灵活的评估，从而支持与不同人类价值和用例的稳健对齐。我们的模型，数据和代码可在此HTTPS URL上作为开源

Title: MR. Judge: Multimodal Reasoner as a Judge

Authors: Renjie Pi, Felix Bai, Qibin Chen, Simon Wang, Jiulong Shan, Kieran Liu, Meng Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13403
Pdf URL: https://arxiv.org/pdf/2505.13403
Copy Paste: [[2505.13403]] MR. Judge: Multimodal Reasoner as a Judge(https://arxiv.org/abs/2505.13403)
Keywords: language model, gpt, llm, prompt
Abstract: The paradigm of using Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) as evaluative judges has emerged as an effective approach in RLHF and inference-time scaling. In this work, we propose Multimodal Reasoner as a Judge (MR. Judge), a paradigm for empowering general-purpose MLLMs judges with strong reasoning capabilities. Instead of directly assigning scores for each response, we formulate the judgement process as a reasoning-inspired multiple-choice problem. Specifically, the judge model first conducts deliberate reasoning covering different aspects of the responses and eventually selects the best response from them. This reasoning process not only improves the interpretibility of the judgement, but also greatly enhances the performance of MLLM judges. To cope with the lack of questions with scored responses, we propose the following strategy to achieve automatic annotation: 1) Reverse Response Candidates Synthesis: starting from a supervised fine-tuning (SFT) dataset, we treat the original response as the best candidate and prompt the MLLM to generate plausible but flawed negative candidates. 2) Text-based reasoning extraction: we carefully design a data synthesis pipeline for distilling the reasoning capability from a text-based reasoning model, which is adopted to enable the MLLM judges to regain complex reasoning ability via warm up supervised fine-tuning. Experiments demonstrate that our MR. Judge is effective across a wide range of tasks. Specifically, our MR. Judge-7B surpasses GPT-4o by 9.9% on VL-RewardBench, and improves performance on MM-Vet during inference-time scaling by up to 7.7%.
摘要：使用大语言模型（LLM）和多模式大语言模型（MLLM）作为评估法官的范式已成为RLHF和推理时间缩放的有效方法。在这项工作中，我们提出了多模式推理的法官（法官先生），这是一种具有强大推理能力的通用MLLMS法官的范式。我们没有直接为每个响应分配分数，而是将判断过程作为推理启发的多项选择问题。具体而言，法官模型首先进行了故意的推理，涵盖了响应的不同方面，并最终从中选择了最佳回应。这种推理过程不仅改善了判断的解释性，而且还大大提高了MLLM法官的表现。为了应对缺乏评分回答的问题，我们提出以下策略来实现自动注释：1）从监督的微调（SFT）数据集开始，逆向候选候选者综合：我们将原始响应视为最佳候选人，并提示MLLM产生可行的但有益的负面候选者。 2）基于文本的推理提取：我们仔细设计了一个数据综合管道，用于从基于文本的推理模型中提取推理能力，该模型采用了该模型，以使MLLM法官能够通过热门监督的微调来恢复复杂的推理能力。实验表明我们的MR。法官在各种任务中有效。具体来说，我们的MR。 Judge-7B在VL-RewardBench上超过了GPT-4O的9.9％，并在推理时间缩放期间提高了MM-VET的性能高达7.7％。

Title: Granary: Speech Recognition and Translation Dataset in 25 European Languages

Authors: Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, Boris Ginsburg
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2505.13404
Pdf URL: https://arxiv.org/pdf/2505.13404
Copy Paste: [[2505.13404]] Granary: Speech Recognition and Translation Dataset in 25 European Languages(https://arxiv.org/abs/2505.13404)
Keywords: llm, hallucination
Abstract: Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at this https URL
摘要：多任务和多语言方法使大型模型有益于大型模型，但是由于数据稀缺，低资源语言的语音处理仍然没有被忽视。为了解决这个问题，我们介绍了Granary，这是一个大规模的语音数据集集合，可识别和翻译25种欧洲语言。这是用于转录和翻译的第一个开源工作。我们使用伪标记的管道提高数据质量，并进行分割，两次推理，幻觉过滤和标点符号恢复。我们进一步使用Eurollm从伪标记的转录中生成翻译对，然后再产生数据过滤管道。为了效率而设计，我们的管道过程会在几小时内进行大量数据。我们通过比较其在先前策划的数据集上的高资源和低资源语言的数据来评估对处理数据培训的模型。我们的发现表明，这些模型使用大约实现了相似的性能。数据少50％。数据集将在此HTTPS URL上提供

Title: AdaptThink: Reasoning Models Can Learn When to Think

Authors: Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, Juanzi Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13417
Pdf URL: https://arxiv.org/pdf/2505.13417
Copy Paste: [[2505.13417]] AdaptThink: Reasoning Models Can Learn When to Think(https://arxiv.org/abs/2505.13417)
Keywords: prompt
Abstract: Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at this https URL.
摘要：最近，大型推理模型通过采用类似人类的深思熟虑，在各种任务上取得了令人印象深刻的表现。但是，漫长的思维过程大大增加了推理开销，使效率成为关键的瓶颈。在这项工作中，我们首先证明了NotHinking促使推理模型跳过思维并直接生成最终解决方案，这是在性能和效率方面相对简单任务的更好选择。在此激励的情况下，我们提出了一种新颖的RL算法AdaptThink，以教授推理模型，以根据问题难度自适应地选择最佳思维模式。具体而言，AdaptThink具有两个核心组成部分：（1）鼓励模型在保持整体性能的同时选择NotHink的约束优化目标；（2）一种重要的抽样策略，可以平衡思维和无意义的样本在派对培训期间，从而使寒冷开始并允许模型在整个培训过程中探索和利用这两种思维模式。我们的实验表明，自适应大大降低了推理成本，同时进一步提高了性能。值得注意的是，在三个数学数据集中，AdaptThink将DeepSeek-R1-Distill-Qwen-1.5b的平均响应长度降低了53％，并将其准确性提高了2.4％，强调了自适应思维模式选择的希望，以优化推理质量和效率之间的平衡。我们的代码和模型可在此HTTPS URL上找到。

Title: Dementia Through Different Eyes: Explainable Modeling of Human and LLM Perceptions for Early Awareness

Authors: Lotem Peled-Cohen, Maya Zadok, Nitay Calderon, Hila Gonen, Roi Reichart
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13418
Pdf URL: https://arxiv.org/pdf/2505.13418
Copy Paste: [[2505.13418]] Dementia Through Different Eyes: Explainable Modeling of Human and LLM Perceptions for Early Awareness(https://arxiv.org/abs/2505.13418)
Keywords: llm
Abstract: Cognitive decline often surfaces in language years before diagnosis. It is frequently non-experts, such as those closest to the patient, who first sense a change and raise concern. As LLMs become integrated into daily communication and used over prolonged periods, it may even be an LLM that notices something is off. But what exactly do they notice--and should be noticing--when making that judgment? This paper investigates how dementia is perceived through language by non-experts. We presented transcribed picture descriptions to non-expert humans and LLMs, asking them to intuitively judge whether each text was produced by someone healthy or with dementia. We introduce an explainable method that uses LLMs to extract high-level, expert-guided features representing these picture descriptions, and use logistic regression to model human and LLM perceptions and compare with clinical diagnoses. Our analysis reveals that human perception of dementia is inconsistent and relies on a narrow, and sometimes misleading, set of cues. LLMs, by contrast, draw on a richer, more nuanced feature set that aligns more closely with clinical patterns. Still, both groups show a tendency toward false negatives, frequently overlooking dementia cases. Through our interpretable framework and the insights it provides, we hope to help non-experts better recognize the linguistic signs that matter.
摘要：认知能力下降通常在诊断前的语言几年中表现出来。通常是非专家，例如最接近患者的人，他们首先感觉到改变并引起人们的关注。随着LLMS融入日常通信并在长时间内使用的时间，甚至可能是LLM注意到某些事情已经关闭。但是，当做出判断时，他们到底注意到什么？本文调查了非专家如何通过语言感知痴呆症。我们向非专家的人和LLM提出了抄录的图片描述，要求他们直观地判断每个文本是由健康还是患有痴呆症的人产生的。我们介绍了一种可解释的方法，该方法使用LLM来提取代表这些图片描述的高级专家指导特征，并使用逻辑回归来对人类和LLM感知进行建模，并与临床诊断进行比较。我们的分析表明，人类对痴呆的看法是不一致的，并且依赖于一组狭窄，有时具有误导性的提示。相比之下，LLM借鉴了更丰富，更细微的功能集，该集合与临床模式更加一致。尽管如此，两组都表现出倾向于假否定的趋势，经常忽略痴呆症病例。通过我们可解释的框架及其提供的见解，我们希望帮助非专家更好地认识到重要的语言标志。

Title: SMOTExT: SMOTE meets Large Language Models

Authors: Mateusz Bystroński, Mikołaj Hołysz, Grzegorz Piotrowski, Nitesh V. Chawla, Tomasz Kajdanowicz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.13434
Pdf URL: https://arxiv.org/pdf/2505.13434
Copy Paste: [[2505.13434]] SMOTExT: SMOTE meets Large Language Models(https://arxiv.org/abs/2505.13434)
Keywords: language model
Abstract: Data scarcity and class imbalance are persistent challenges in training robust NLP models, especially in specialized domains or low-resource settings. We propose a novel technique, SMOTExT, that adapts the idea of Synthetic Minority Over-sampling (SMOTE) to textual data. Our method generates new synthetic examples by interpolating between BERT-based embeddings of two existing examples and then decoding the resulting latent point into text with xRAG architecture. By leveraging xRAG's cross-modal retrieval-generation framework, we can effectively turn interpolated vectors into coherent text. While this is preliminary work supported by qualitative outputs only, the method shows strong potential for knowledge distillation and data augmentation in few-shot settings. Notably, our approach also shows promise for privacy-preserving machine learning: in early experiments, training models solely on generated data achieved comparable performance to models trained on the original dataset. This suggests a viable path toward safe and effective learning under data protection constraints.
摘要：数据稀缺性和阶级失衡是训练强大的NLP模型的持续挑战，尤其是在专业领域或低资源设置中。我们提出了一种新颖的技术Smotext，该技术适应了合成少数群体过度采样（SMOTE）的想法。我们的方法通过在两个现有示例的基于BERT的嵌入之间进行插值，然后将所得的潜在点解码为具有XRAG体系结构的文本，从而生成了新的合成示例。通过利用Xrag的跨模式检索生成框架，我们可以有效地将插值向量变成连贯的文本。虽然这是仅由定性产出支持的初步工作，但该方法在几个射击设置中显示出强大的知识蒸馏和数据增强的潜力。值得注意的是，我们的方法还显示出对隐私机器学习的希望：在早期实验中，仅基于与原始数据集中训练的模型相当的数据的培训模型。这表明在数据保护限制下，可以实现安全有效学习的可行途径。

Title: ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Authors: Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.13444
Pdf URL: https://arxiv.org/pdf/2505.13444
Copy Paste: [[2505.13444]] ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models(https://arxiv.org/abs/2505.13444)
Keywords: language model
Abstract: Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.
摘要：图表理解给大型视觉模型（LVLM）带来了独特的挑战，因为它需要集成复杂的文本和视觉推理能力。但是，当前的LVLM在这些技能之间表现出明显的失衡，在文本中很难执行的视觉推理缺乏。我们仅通过视觉推理可解决的合成数据集进行案例研究，并表明模型性能会随着视觉复杂性的增加而显着降低，而人类绩效仍然稳健。然后，我们介绍ChartMuseum，这是一个新的图表问题回答（QA）基准，其中包含1,162个专家宣布的问题，这些问题涵盖了多种推理类型，该问题是从184个来源的真实图表中策划的，专门用于评估复杂的视觉和文本推理。与先前的图表理解基准测试（前沿模型相似和接近饱和的基准）不同，我们的基准标准暴露了模型和人类绩效之间的巨大差距，同时有效地区分了模型功能：尽管人类达到了93％的精度，但表现最好的gemini-2.5-Pro只能达到63.0％，并且只能达到63.0％，并且只能达到63.0％，并且只能达到38％。此外，在主要需要视觉推理的问题上，所有模型都会从文本繁重的问题表现中均绩效下降35％-55％。最后，我们的定性错误分析揭示了当前LVLM具有挑战性的视觉推理类别。

Title: CIE: Controlling Language Model Text Generations Using Continuous Signals

Authors: Vinay Samuel, Harshita Diddee, Yiming Zhang, Daphne Ippolito
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13448
Pdf URL: https://arxiv.org/pdf/2505.13448
Copy Paste: [[2505.13448]] CIE: Controlling Language Model Text Generations Using Continuous Signals(https://arxiv.org/abs/2505.13448)
Keywords: language model, prompt
Abstract: Aligning language models with user intent is becoming increasingly relevant to enhance user experience. This calls for designing methods that can allow users to control the properties of the language that LMs generate. For example, controlling the length of the generation, the complexity of the language that gets chosen, the sentiment, tone, etc. Most existing work attempts to integrate users' control by conditioning LM generations on natural language prompts or discrete control signals, which are often brittle and hard to scale. In this work, we are interested in \textit{continuous} control signals, ones that exist along a spectrum that can't easily be captured in a natural language prompt or via existing techniques in conditional generation. Through a case study in controlling the precise response-length of generations produced by LMs, we demonstrate how after fine-tuning, behaviors of language models can be controlled via continuous signals -- as vectors that are interpolated between a "low" and a "high" token embedding. Our method more reliably exerts response-length control than in-context learning methods or fine-tuning methods that represent the control signal as a discrete signal. Our full open-sourced code and datasets are available at this https URL.
摘要：将语言模型与用户意图保持一致，与增强用户体验变得越来越相关。这需要设计方法，可以允许用户控制LMS生成的语言的属性。例如，控制发电的长度，选择的语言的复杂性，情感，音调等。大多数现有的工作试图通过根据自然语言提示或离散控制信号来调节LM世代来整合用户的控制，这些信号通常是脆弱且难以扩展的。在这项工作中，我们对\ textIt {连续}控制信号感兴趣，这些信号是沿着自然语言提示或通过有条件生成的现有技术捕获的频谱。通过控制LMS产生的几代人的精确响应长度的案例研究，我们演示了如何通过连续信号来控制语言模型的行为 - 作为在“低”和“高”和“高”令牌嵌入之间插值的向量。我们的方法更可靠地发挥了响应长度控制，而不是在内在的学习方法或代表控制信号作为离散信号的微调方法。我们的完整开源代码和数据集可在此HTTPS URL上找到。