2025-05-06

Title: Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

Authors: Vaidehi Patil, Yi-Lin Sung, Peter Hase, Jie Peng, Tianlong Chen, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.01456
Pdf URL: https://arxiv.org/pdf/2505.01456
Copy Paste: [[2505.01456]] Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation(https://arxiv.org/abs/2505.01456)
Keywords: llm, prompt
Abstract: LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset using an automated pipeline that generates varying-proximity samples for testing generalization and specificity, followed by manual filtering for maintaining high quality. We then evaluate six defense objectives against seven attacks (four whitebox, three blackbox), including a novel whitebox method leveraging interpretability of hidden states. Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states. Additionally, larger models exhibit greater post-editing robustness, suggesting that scale enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing unlearning in MLLMs.
摘要：在大规模数据集中培训的LLM可能会无意中获取敏感信息，例如个人详细信息和潜在的有害内容。在多模式LLMS中，由于多种模式（图像和文本）整合信息时，这种风险进一步增加了。对手可以通过多模式提示来利用这些知识，以提取敏感的细节。评估MLLM如何忘记此类信息（有针对性的未学习）的有效程度需要创建高质量，良好的图像文本对。虽然先前的学习工作集中在文本上，但多模式的解读仍未得到充实。为了解决这一差距，我们首先引入了多模式的未学习基准，Unulok-VQA（在知识外部知识VQA），以及一个攻击和防御框架，以评估从MLLM中删除特定多模式知识的方法。我们使用自动化管道扩展了一个视觉提问数据集，该管道生成不同的预性样品，以测试概括和特异性，然后进行手动过滤以保持高质量。然后，我们评估了针对七次攻击（四个白框，三个黑框）的六个防御目标，其中包括一种新颖的白盒方法，利用了隐藏状态的可解释性。我们的结果表明，多模式攻击优于文本或仅图像的攻击，最有效的辩护删除了内部模型状态的答案信息。此外，较大的模型表现出更大的编辑后鲁棒性，表明规模可以提高安全性。 Ununok-VQA提供了一个严格的基准，用于推进MLLM中的学习。

Title: MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling

Authors: Abdoul Majid O. Thiombiano, Brahim Hnich, Ali Ben Mrad, Mohamed Wiem Mkaouer
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01459
Pdf URL: https://arxiv.org/pdf/2505.01459
Copy Paste: [[2505.01459]] MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling(https://arxiv.org/abs/2505.01459)
Keywords: language model, llm
Abstract: This paper introduces MoxE, a novel architecture that synergistically combines the Extended Long Short-Term Memory (xLSTM) with the Mixture of Experts (MoE) framework to address critical scalability and efficiency challenges in large language models (LLMs). The proposed method effectively leverages xLSTM's innovative memory structures while strategically introducing sparsity through MoE to substantially reduce computational overhead. At the heart of our approach is a novel entropy-based routing mechanism, designed to dynamically route tokens to specialized experts, thereby ensuring efficient and balanced resource utilization. This entropy awareness enables the architecture to effectively manage both rare and common tokens, with mLSTM blocks being favored to handle rare tokens. To further enhance generalization, we introduce a suite of auxiliary losses, including entropy-based and group-wise balancing losses, ensuring robust performance and efficient training. Theoretical analysis and empirical evaluations rigorously demonstrate that MoxE achieves significant efficiency gains and enhanced effectiveness compared to existing approaches, marking a notable advancement in scalable LLM architectures.
摘要：本文介绍了莫克斯（Moxe），这是一种新颖的体系结构，可以协同结合长期的短期记忆（XLSTM）与专家（MOE）框架的混合物，以解决大语言模型（LLMS）中的关键可扩展性和效率挑战。提出的方法有效地利用了XLSTM的创新记忆结构，同时通过MOE策略性地引入稀疏性，以大大减少计算开销。我们方法的核心是一种新型的基于熵的路由机制，旨在动态地将代币伸向专业专家，从而确保有效且平衡的资源利用率。这种熵意识使体系结构能够有效地管理稀有和共同的令牌，其中MLSTM块被偏爱处理稀有令牌。为了进一步增强概括，我们引入了一系列辅助损失，包括基于熵和团体平衡损失，确保了稳健的性能和有效的训练。理论分析和经验评估严格表明，与现有方法相比，Moxe在可扩展的LLM体系结构方面取得了显着进步。

Title: SymPlanner: Deliberate Planning in Language Models with Symbolic Representation

Authors: Siheng Xiong, Jieyu Zhou, Zhangding Liu, Yusen Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01479
Pdf URL: https://arxiv.org/pdf/2505.01479
Copy Paste: [[2505.01479]] SymPlanner: Deliberate Planning in Language Models with Symbolic Representation(https://arxiv.org/abs/2505.01479)
Keywords: language model
Abstract: Planning remains a core challenge for language models (LMs), particularly in domains that require coherent multi-step action sequences grounded in external constraints. We introduce SymPlanner, a novel framework that equips LMs with structured planning capabilities by interfacing them with a symbolic environment that serves as an explicit world model. Rather than relying purely on natural language reasoning, SymPlanner grounds the planning process in a symbolic state space, where a policy model proposes actions and a symbolic environment deterministically executes and verifies their effects. To enhance exploration and improve robustness, we introduce Iterative Correction (IC), which refines previously proposed actions by leveraging feedback from the symbolic environment to eliminate invalid decisions and guide the model toward valid alternatives. Additionally, Contrastive Ranking (CR) enables fine-grained comparison of candidate plans by evaluating them jointly. We evaluate SymPlanner on PlanBench, demonstrating that it produces more coherent, diverse, and verifiable plans than pure natural language baselines.
摘要：计划仍然是语言模型（LMS）的核心挑战，尤其是在需要基于外部约束的连贯多步操作序列的领域。我们介绍了Symplanner，这是一个新颖的框架，它通过将它们与符号环境接口，使其具有结构化的计划功能，该环境是明确的世界模型。 Symplanner并非纯粹依靠自然语言推理，而是在象征性状态空间中以计划过程为基础，在这种情况下，政策模型提出了行动和象征性的环境决定性地执行并验证其效果。为了增强探索和提高鲁棒性，我们引入了迭代校正（IC），该校正通过利用符号环境中的反馈来消除无效的决策并指导模型朝着有效的替代方案指导，从而完善了先前提出的动作。此外，对比度排名（CR）可以通过共同评估候选计划来对候选计划进行细粒度的比较。我们在Planbench上评估了Symplanner，这表明它比纯粹的自然语言基线产生的计划更连贯，多样和可验证的计划。

Title: On the effectiveness of Large Language Models in the mechanical design domain

Authors: Daniele Grandi, Fabian Riquelme
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01559
Pdf URL: https://arxiv.org/pdf/2505.01559
Copy Paste: [[2505.01559]] On the effectiveness of Large Language Models in the mechanical design domain(https://arxiv.org/abs/2505.01559)
Keywords: language model
Abstract: In this work, we seek to understand the performance of large language models in the mechanical engineering domain. We leverage the semantic data found in the ABC dataset, specifically the assembly names that designers assigned to the overall assemblies, and the individual semantic part names that were assigned to each part. After pre-processing the data we developed two unsupervised tasks to evaluate how different model architectures perform on domain-specific data: a binary sentence-pair classification task and a zero-shot classification task. We achieved a 0.62 accuracy for the binary sentence-pair classification task with a fine-tuned model that focuses on fighting over-fitting: 1) modifying learning rates, 2) dropout values, 3) Sequence Length, and 4) adding a multi-head attention layer. Our model on the zero-shot classification task outperforms the baselines by a wide margin, and achieves a top-1 classification accuracy of 0.386. The results shed some light on the specific failure modes that arise when learning from language in this domain.
摘要：在这项工作中，我们试图了解机械工程领域中大语言模型的性能。我们利用ABC数据集中发现的语义数据，特别是设计人员分配给整体组件的汇编名称，以及分配给每个部分的单个语义零件名称。在进行数据后，我们开发了两个无监督的任务，以评估不同模型体系结构在特定于域的数据上的执行方式：二进制句子分类任务和零摄像的分类任务。我们使用微调模型的二进制句子对分类任务达到了0.62的精度，该模型的重点是对抗过度拟合：1）修改学习率，2）辍学率，3）序列长度和4）添加多头注意力层。我们在零射击分类任务上的模型的表现优于基准，并达到了0.386的TOP-1分类精度。结果阐明了从该域中的语言学习时出现的特定故障模式。

Title: AI agents may be worth the hype but not the resources (yet): An initial exploration of machine translation quality and costs in three language pairs in the legal and news domains

Authors: Vicent Briva Iglesias, Gokhan Dogru
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01560
Pdf URL: https://arxiv.org/pdf/2505.01560
Copy Paste: [[2505.01560]] AI agents may be worth the hype but not the resources (yet): An initial exploration of machine translation quality and costs in three language pairs in the legal and news domains(https://arxiv.org/abs/2505.01560)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs) and multi-agent orchestration are touted as the next leap in machine translation (MT), but their benefits relative to conventional neural MT (NMT) remain unclear. This paper offers an empirical reality check. We benchmark five paradigms, Google Translate (strong NMT baseline), GPT-4o (general-purpose LLM), o1-preview (reasoning-enhanced LLM), and two GPT-4o-powered agentic workflows (sequential three-stage and iterative refinement), on test data drawn from a legal contract and news prose in three English-source pairs: Spanish, Catalan and Turkish. Automatic evaluation is performed with COMET, BLEU, chrF2 and TER; human evaluation is conducted with expert ratings of adequacy and fluency; efficiency with total input-plus-output token counts mapped to April 2025 pricing. Automatic scores still favour the mature NMT system, which ranks first in seven of twelve metric-language combinations; o1-preview ties or places second in most remaining cases, while both multi-agent workflows trail. Human evaluation reverses part of this narrative: o1-preview produces the most adequate and fluent output in five of six comparisons, and the iterative agent edges ahead once, indicating that reasoning layers capture semantic nuance undervalued by surface metrics. Yet these qualitative gains carry steep costs. The sequential agent consumes roughly five times, and the iterative agent fifteen times, the tokens used by NMT or single-pass LLMs. We advocate multidimensional, cost-aware evaluation protocols and highlight research directions that could tip the balance: leaner coordination strategies, selective agent activation, and hybrid pipelines combining single-pass LLMs with targeted agent intervention.
摘要：大型语言模型（LLM）和多代理编排被吹捧为机器翻译的下一个飞跃（MT），但它们相对于传统神经MT（NMT）的好处尚不清楚。本文提供了经验现实检查。我们基于五个范式，Google翻译（坚固的NMT基线），GPT-4O（通用LLM），O1-preview（推理增强的LLM）和两个GPT-4O-Power temic工作流程（序列的三阶段和迭代精炼），在测试数据和新闻合同中绘制了三个pers，在三个法律合同中绘制了pers pern cation：使用彗星，BLEU，CHRF2和TER进行自动评估；人类评估是具有适当性和流利度的专家评级；效率与2025年4月定价的总输入超输出令牌计数。自动分数仍然有利于成熟的NMT系统，该系统在十二个度量语言组合中的七个中排名第一。在大多数剩下的情况下，O1-preiview Cine或将两个多代理工作流程排名第二。人类的评估逆转了此叙述的一部分：O1-preiview在六个比较中的五个中产生了最适当和流利的输出，并且迭代剂一次向前介绍，这表明推理层捕获了被表面指标低估的语义细微差别。然而，这些定性收益却带来了巨大的成本。顺序代理大约消耗了五次，迭代剂15次，由NMT或单人通LLMS使用的令牌。我们主张多维，成本感知评估方案，并突出显示可能平衡的研究方向：更精简的协调策略，选择性代理激活和混合管道，将单人体LLM与有针对性的代理干预相结合。

Title: PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents

Authors: Takyoung Kim, Janvijay Singh, Shuhaib Mehri, Emre Can Acikgoz, Sagnik Mukherjee, Nimet Beyza Bozdag, Sumuk Shashidhar, Gokhan Tur, Dilek Hakkani-Tür
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01592
Pdf URL: https://arxiv.org/pdf/2505.01592
Copy Paste: [[2505.01592]] PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents(https://arxiv.org/abs/2505.01592)
Keywords: language model, llm, agent
Abstract: The growing capabilities of large language models (LLMs) in instruction-following and context-understanding lead to the era of agents with numerous applications. Among these, task planning agents have become especially prominent in realistic scenarios involving complex internal pipelines, such as context understanding, tool management, and response generation. However, existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness. We hypothesize that merely improving task completion is misaligned with maximizing user satisfaction, as users interact with the entire agentic process and not only the end result. To address this gap, we propose PIPA, a unified evaluation protocol that conceptualizes the behavioral process of interactive task planning agents within a partially observable Markov Decision Process (POMDP) paradigm. The proposed protocol offers a comprehensive assessment of agent performance through a set of atomic evaluation criteria, allowing researchers and practitioners to diagnose specific strengths and weaknesses within the agent's decision-making pipeline. Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors. We also highlight future directions, including systems that leverage multiple agents and the limitations of user simulators in task planning.
摘要：大语言模型（LLM）在遵守教学和上下文理解中的增长能力导致代理时代具有许多应用。其中，任务计划代理在涉及复杂内部管道（例如上下文理解，工具管理和响应生成）等现实场景中变得尤为突出。但是，现有的基准主要基于任务完成来评估代理绩效，以此作为整体效率的代理。我们假设，随着用户与整个代理流程互动，而不仅仅是最终结果，仅改善任务完成就会使用户满意度最大化。为了解决这一差距，我们提出了PIPA，这是一种统一的评估协议，概念化了部分可观察到的马尔可夫决策过程（POMDP）范式中交互式任务计划代理的行为过程。拟议的协议通过一组原子评估标准对代理性能进行了全面评估，使研究人员和从业人员能够诊断代理商决策管道内的特定优势和劣势。我们的分析表明，代理在不同的行为阶段表现出色，其用户满意度都受到结果和中间行为的影响。我们还强调了未来的方向，包括利用多个代理的系统以及在任务计划中使用用户模拟器的局限性。

Title: Always Tell Me The Odds: Fine-grained Conditional Probability Estimation

Authors: Liaoyaqi Wang, Zhengping Jiang, Anqi Liu, Benjamin Van Durme
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01595
Pdf URL: https://arxiv.org/pdf/2505.01595
Copy Paste: [[2505.01595]] Always Tell Me The Odds: Fine-grained Conditional Probability Estimation(https://arxiv.org/abs/2505.01595)
Keywords: language model, llm, prompt
Abstract: We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context. Recent advances in large language models (LLMs) have significantly enhanced their reasoning capabilities, particularly on well-defined tasks with complete information. However, LLMs continue to struggle with making accurate and well-calibrated probabilistic predictions under uncertainty or partial information. While incorporating uncertainty into model predictions often boosts performance, obtaining reliable estimates of that uncertainty remains understudied. In particular, LLM probability estimates tend to be coarse and biased towards more frequent numbers. Through a combination of human and synthetic data creation and assessment, scaling to larger models, and better supervision, we propose a set of strong and precise probability estimation models. We conduct systematic evaluations across tasks that rely on conditional probability estimation and show that our approach consistently outperforms existing fine-tuned and prompting-based methods by a large margin.
摘要：我们提出了一种最新模型，以估算以上下文为条件的命题。大型语言模型（LLM）的最新进展显着增强了其推理能力，尤其是在具有完整信息的明确任务上。但是，在不确定性或部分信息下，LLMS继续努力做出准确且精心校准的概率预测。在将不确定性纳入模型预测中通常会提高性能，但获得该不确定性的可靠估计值仍在研究中。特别是，LLM概率估计值往往是粗糙的，并且偏向更频繁的数字。通过人类和合成数据创建和评估，扩展到更大的模型以及更好的监督，我们提出了一组强，精确的概率估计模型。我们跨任务进行系统的评估，这些任务依赖有条件的概率估计，并表明我们的方法始终超过现有的微调和基于促进的方法的方法。

Title: A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

Authors: Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, Jemin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01658
Pdf URL: https://arxiv.org/pdf/2505.01658
Copy Paste: [[2505.01658]] A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency(https://arxiv.org/abs/2505.01658)
Keywords: language model, llm, chat, chain-of-thought, agent
Abstract: Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: this https URL
摘要：大型语言模型（LLM）广泛应用于聊天机器人，代码生成器和搜索引擎。通过反复调用模型，诸如经过思考，复杂的推理和代理服务等工作负载大大增加了推理成本。采用了诸如并行性，压缩和缓存之类的优化方法来降低成本，但是多样化的服务要求使得很难选择正确的方法。最近，专门的LLM推理引擎已成为将优化方法集成到面向服务的基础架构中的关键组件。但是，仍然缺乏对推理引擎的系统研究。本文对25种开源和商业推理引擎进行了全面评估。我们以易用性，易于启动性，通用支持，可扩展性以及对吞吐量和潜伏期感知的计算的适用性来检查每个推理引擎。此外，我们通过研究其支持的优化技术来探索每个推理引擎的设计目标。此外，我们评估了开源推理引擎的生态系统成熟度，并处理商业解决方案的性能和成本政策。我们概述了未来的研究方向，其中包括对复杂的基于LLM的服务的支持，各种硬件的支持以及增强的安全性，并向研究人员和开发人员选择和设计优化的LLM推理引擎为研究人员和开发人员提供实用指导。我们还提供了一个公共存储库来不断跟踪这个快速发展的领域的发展：此HTTPS URL

Title: High-Fidelity Pseudo-label Generation by Large Language Models for Training Robust Radiology Report Classifiers

Authors: Brian Wong, Kaito Tanaka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01693
Pdf URL: https://arxiv.org/pdf/2505.01693
Copy Paste: [[2505.01693]] High-Fidelity Pseudo-label Generation by Large Language Models for Training Robust Radiology Report Classifiers(https://arxiv.org/abs/2505.01693)
Keywords: language model, llm
Abstract: Automated labeling of chest X-ray reports is essential for enabling downstream tasks such as training image-based diagnostic models, population health studies, and clinical decision support. However, the high variability, complexity, and prevalence of negation and uncertainty in these free-text reports pose significant challenges for traditional Natural Language Processing methods. While large language models (LLMs) demonstrate strong text understanding, their direct application for large-scale, efficient labeling is limited by computational cost and speed. This paper introduces DeBERTa-RAD, a novel two-stage framework that combines the power of state-of-the-art LLM pseudo-labeling with efficient DeBERTa-based knowledge distillation for accurate and fast chest X-ray report labeling. We leverage an advanced LLM to generate high-quality pseudo-labels, including certainty statuses, for a large corpus of reports. Subsequently, a DeBERTa-Base model is trained on this pseudo-labeled data using a tailored knowledge distillation strategy. Evaluated on the expert-annotated MIMIC-500 benchmark, DeBERTa-RAD achieves a state-of-the-art Macro F1 score of 0.9120, significantly outperforming established rule-based systems, fine-tuned transformer models, and direct LLM inference, while maintaining a practical inference speed suitable for high-throughput applications. Our analysis shows particular strength in handling uncertain findings. This work demonstrates a promising path to overcome data annotation bottlenecks and achieve high-performance medical text processing through the strategic combination of LLM capabilities and efficient student models trained via distillation.
摘要：胸部X射线报告的自动标签对于实现下游任务，例如培训基于图像的诊断模型，人群健康研究和临床决策支持至关重要。但是，这些自由文本报告中的否定和不确定性的高度可变性，复杂性和流行率对传统的自然语言处理方法构成了重大挑战。尽管大型语言模型（LLM）表现出强烈的文本理解，但它们在大规模，有效的标签上的直接应用受到计算成本和速度的限制。本文介绍了Deberta-Rad，这是一种新型的两阶段框架，结合了最先进的LLM伪标记的力量，并有效地基于Deberta的知识蒸馏，以准确且快速的胸部X射线报告标签。我们利用高级LLM来生成大量报告的高质量伪标签，包括确定性状态。随后，使用量身定制的知识蒸馏策略对Deberta-Base模型进行了该伪标记数据的培训。 Deberta-Rad对专家注销的MIMIC-500基准进行了评估，达到了最先进的宏F1得分为0.9120，大大优于已建立的基于规则的系统，微调变压器模型和直接LLM推断，同时保持适用于高直发应用程序的实际选择速度。我们的分析显示了处理不确定发现的特殊优势。这项工作证明了克服数据注释瓶颈并通过LLM功能和通过蒸馏培训的有效学生模型的战略组合实现高性能医学文本处理的有希望的途径。

Title: Efficient Shapley Value-based Non-Uniform Pruning of Large Language Models

Authors: Chuan Sun, Han Yu, Lizhen Cui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01731
Pdf URL: https://arxiv.org/pdf/2505.01731
Copy Paste: [[2505.01731]] Efficient Shapley Value-based Non-Uniform Pruning of Large Language Models(https://arxiv.org/abs/2505.01731)
Keywords: language model, gpt, llm
Abstract: Pruning large language models (LLMs) is a promising solution for reducing model sizes and computational complexity while preserving performance. Traditional layer-wise pruning methods often adopt a uniform sparsity approach across all layers, which leads to suboptimal performance due to the varying significance of individual transformer layers within the model not being accounted for. To this end, we propose the \underline{S}hapley \underline{V}alue-based \underline{N}on-\underline{U}niform \underline{P}runing (\methodname{}) method for LLMs. This approach quantifies the contribution of each transformer layer to the overall model performance, enabling the assignment of tailored pruning budgets to different layers to retain critical parameters. To further improve efficiency, we design the Sliding Window-based Shapley Value approximation method. It substantially reduces computational overhead compared to exact SV calculation methods. Extensive experiments on various LLMs including LLaMA-v1, LLaMA-v2 and OPT demonstrate the effectiveness of the proposed approach. The results reveal that non-uniform pruning significantly enhances the performance of pruned models. Notably, \methodname{} achieves a reduction in perplexity (PPL) of 18.01\% and 19.55\% on LLaMA-7B and LLaMA-13B, respectively, compared to SparseGPT at 70\% sparsity.
摘要：修剪大语言模型（LLMS）是一种有希望的解决方案，用于降低模型大小和计算复杂性，同时保持性能。传统的层修剪方法通常在所有层中采用统一的稀疏方法，这导致了次优性能，这是由于模型中单个变压器层的重要性不同。为此，我们提出了\下划线{s} hapley \ useverline {v}基于\ unesues \ usewessline {n} on- \ lundline {u} niform \ niform \ usewessline \ undesline {p} runing {\ methodname {}）llms的方法。这种方法量化了每个变压器对整体模型性能的贡献，从而使定制的修剪预算分配给不同的层以保留关键参数。为了进一步提高效率，我们设计了基于滑动窗口的Shapley值近似方法。与精确的SV计算方法相比，它大大降低了计算开销。在包括Llama-V1，Llama-V2和OPT在内的各种LLM的广泛实验证明了该方法的有效性。结果表明，非均匀的修剪可显着增强修剪模型的性能。值得注意的是，\ MethodName {}分别在70 \％稀疏时的稀疏度相比，在Llama-7b和Llama-13b上分别降低了llama-7b和Llama-13b的困惑（PPL）为18.01 \％和19.55 \％。

Title: Same evaluation, more tokens: On the effect of input length for machine translation evaluation using Large Language Models

Authors: Tobias Domhan, Dawei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01761
Pdf URL: https://arxiv.org/pdf/2505.01761
Copy Paste: [[2505.01761]] Same evaluation, more tokens: On the effect of input length for machine translation evaluation using Large Language Models(https://arxiv.org/abs/2505.01761)
Keywords: language model, llm, prompt
Abstract: Accurately evaluating machine-translated text remains a long-standing challenge, particularly for long documents. Recent work has shown that large language models (LLMs) can serve as reliable and interpretable sentence-level translation evaluators via MQM error span annotations. With modern LLMs supporting larger context windows, a natural question arises: can we feed entire document translations into an LLM for quality assessment? Ideally, evaluation should be invariant to text length, producing consistent error spans regardless of input granularity. However, our analysis shows that text length significantly impacts evaluation: longer texts lead to fewer error spans and reduced system ranking accuracy. To address this limitation, we evaluate several strategies, including granularity-aligned prompting, Focus Sentence Prompting (FSP), and a fine-tuning approach to better align LLMs with the evaluation task. The latter two methods largely mitigate this length bias, making LLMs more reliable for long-form translation evaluation.
摘要：准确评估机器翻译的文本仍然是一个长期的挑战，尤其是对于长文件而言。最近的工作表明，大型语言模型（LLMS）可以通过MQM错误跨度注释作为可靠且可解释的句子级翻译评估者。有了支持较大上下文窗口的现代LLM，就会出现一个自然的问题：我们可以将整个文档翻译成LLM以进行质量评估吗？理想情况下，评估应该是文本长度不变的，无论输入粒度如何，都会产生一致的误差。但是，我们的分析表明，文本长度显着影响评估：较长的文本导致错误跨度较少，并且系统排名的准确性降低。为了解决这一限制，我们评估了几种策略，包括与粒度一致的提示，焦点句子提示（FSP）以及一种微调方法，以更好地使LLM与评估任务保持一致。后两种方法在很大程度上减轻了这一长度偏差，使LLMS更可靠地对长形式翻译评估。

Title: $\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge

Authors: Core Francisco Park, Zechen Zhang, Hidenori Tanaka
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01812
Pdf URL: https://arxiv.org/pdf/2505.01812
Copy Paste: [[2505.01812]] $\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge(https://arxiv.org/abs/2505.01812)
Keywords: language model, llm
Abstract: Humans and intelligent animals can effortlessly internalize new information ("news") and accurately extract the implications for performing downstream tasks. While large language models (LLMs) can achieve this through in-context learning (ICL) when the news is explicitly given as context, fine-tuning remains challenging for the models to consolidate learning in weights. In this paper, we introduce $\textit{New News}$, a dataset composed of hypothetical yet plausible news spanning multiple domains (mathematics, coding, discoveries, leaderboards, events), accompanied by downstream evaluation questions whose correct answers critically depend on understanding and internalizing the news. We first demonstrate a substantial gap between naive fine-tuning and in-context learning (FT-ICL gap) on our news dataset. To address this gap, we explore a suite of self-play data generation protocols -- paraphrases, implications and Self-QAs -- designed to distill the knowledge from the model with context into the weights of the model without the context, which we term $\textit{System-2 Fine-tuning}$ (Sys2-FT). We systematically evaluate ICL and Sys2-FT performance across data domains and model scales with the Qwen 2.5 family of models. Our results demonstrate that the self-QA protocol of Sys2-FT significantly improves models' in-weight learning of the news. Furthermore, we discover the $\textit{contexual shadowing effect}$, where training with the news $\textit{in context}$ followed by its rephrases or QAs degrade learning of the news. Finally, we show preliminary evidence of an emerging scaling law of Sys2-FT.
摘要：人类和聪明的动物可以毫不费力地将新信息（“新闻”）内化，并准确地提取执行下游任务的含义。尽管大型语言模型（LLMS）可以在明确以上下文为上下文时通过文化学习（ICL）实现这一目标，但对于模型巩固权重中学习的微调仍然具有挑战性。在本文中，我们介绍了$ \ textit {new News} $，该数据集由假设但合理的新闻组成，涵盖了多个领域（数学，编码，发现，排行榜，活动，事件），并伴随着下游评估问题，其正确的答案非常依赖于理解和内部化新闻。我们首先在新闻数据集上展示了幼稚的微调学习和内在学习（FT-ICL GAP）之间的巨大差距。为了解决这一差距，我们探索了一套自我播放的数据生成协议 - 释义，含义和自我QA-旨在将知识从模型中的上下文中的知识提炼成模型的权重，而无需上下文，我们将其称为$ \ textit {system-2 fine-tuning} $（sys2-ft）。我们使用QWEN 2.5模型系列系统地评估了跨数据域和模型量表的ICL和SYS2-FT性能。我们的结果表明，SYS2-FT的自QA协议显着改善了模型对新闻的权威学习。此外，我们发现了$ \ textIt {contexual Shadowing效果} $，其中使用新闻$ \ textit {在上下文} $的培训，然后是其rephrass或qas degrade degrade对新闻的学习。最后，我们展示了SYS2-FT的新兴缩放定律的初步证据。

Title: Intra-Layer Recurrence in Transformers for Language Modeling

Authors: Anthony Nguyen, Wenjun Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01855
Pdf URL: https://arxiv.org/pdf/2505.01855
Copy Paste: [[2505.01855]] Intra-Layer Recurrence in Transformers for Language Modeling(https://arxiv.org/abs/2505.01855)
Keywords: language model
Abstract: Transformer models have established new benchmarks in natural language processing; however, their increasing depth results in substantial growth in parameter counts. While existing recurrent transformer methods address this issue by reprocessing layers multiple times, they often apply recurrence indiscriminately across entire blocks of layers. In this work, we investigate Intra-Layer Recurrence (ILR), a more targeted approach that applies recurrence selectively to individual layers within a single forward pass. Our experiments show that allocating more iterations to earlier layers yields optimal results. These findings suggest that ILR offers a promising direction for optimizing recurrent structures in transformer architectures.
摘要：变压器模型已经在自然语言处理中建立了新的基准。但是，它们的深度增加导致参数计数的大幅增长。尽管现有的复发变压器方法通过多次重新处理图层来解决此问题，但它们通常会在整个层块上不分青睐。在这项工作中，我们研究了层内复发（ILR），这是一种更具针对性的方法，可选择性地适用于单个正向段落内的单个层。我们的实验表明，将更多的迭代分配给早期层可产生最佳结果。这些发现表明，ILR为在变压器体系结构中优化复发结构提供了有希望的方向。

Title: Humans can learn to detect AI-generated texts, or at least learn when they can't

Authors: Jiří Milička, Anna Marklová, Ondřej Drobil, Eva Pospíšilová
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01877
Pdf URL: https://arxiv.org/pdf/2505.01877
Copy Paste: [[2505.01877]] Humans can learn to detect AI-generated texts, or at least learn when they can't(https://arxiv.org/abs/2505.01877)
Keywords: gpt
Abstract: This study investigates whether individuals can learn to accurately discriminate between human-written and AI-produced texts when provided with immediate feedback, and if they can use this feedback to recalibrate their self-perceived competence. We also explore the specific criteria individuals rely upon when making these decisions, focusing on textual style and perceived readability. We used GPT-4o to generate several hundred texts across various genres and text types comparable to Koditex, a multi-register corpus of human-written texts. We then presented randomized text pairs to 255 Czech native speakers who identified which text was human-written and which was AI-generated. Participants were randomly assigned to two conditions: one receiving immediate feedback after each trial, the other receiving no feedback until experiment completion. We recorded accuracy in identification, confidence levels, response times, and judgments about text readability along with demographic data and participants' engagement with AI technologies prior to the experiment. Participants receiving immediate feedback showed significant improvement in accuracy and confidence calibration. Participants initially held incorrect assumptions about AI-generated text features, including expectations about stylistic rigidity and readability. Notably, without feedback, participants made the most errors precisely when feeling most confident -- an issue largely resolved among the feedback group. The ability to differentiate between human and AI-generated texts can be effectively learned through targeted training with explicit feedback, which helps correct misconceptions about AI stylistic features and readability, as well as potential other variables that were not explored, while facilitating more accurate self-assessment. This finding might be particularly important in educational contexts.
摘要：这项研究调查了个人是否可以在获得即时反馈时学会准确区分人文和AI产生的文本，以及是否可以使用此反馈来重新校准其自我感知的能力。我们还探讨了个人在做出这些决定时所依赖的特定标准，专注于文本样式和可感知的可读性。我们使用GPT-4O在各种流派和文本类型中生成数百个文本，可与Koditex相当，Koditex是人写的文本的多登录语料库。然后，我们向255个捷克语的母语人士提出了随机文本对，他们确定了哪个文本是人写的，哪些是AI生成的。参与者被随机分配给两个条件：一个在每次试验后接收立即反馈，另一个在实验完成之前没有收到反馈。我们记录了有关文本可读性的识别，置信度，响应时间和判断的准确性，以及人口统计数据以及参与者在实验之前与AI技术的互动。接收立即反馈的参与者在准确性和置信度校准方面显示出显着提高。参与者最初对AI生成的文本功能进行了错误的假设，包括对风格刚性和可读性的期望。值得注意的是，没有反馈，参与者在感到最自信的情况下犯了最大的错误 - 反馈小组中的一个很大程度上解决了这个问题。通过明确的反馈有针对性的培训，可以有效地了解人与AI生成的文本的能力，这有助于纠正对AI风格特征和可读性的误解，以及未探索的其他潜在变量，同时促进更准确的自我评估。在教育环境中，这一发现可能尤其重要。

Title: CAMOUFLAGE: Exploiting Misinformation Detection Systems Through LLM-driven Adversarial Claim Transformation

Authors: Mazal Bethany, Nishant Vishwamitra, Cho-Yu Jason Chiang, Peyman Najafirad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01900
Pdf URL: https://arxiv.org/pdf/2505.01900
Copy Paste: [[2505.01900]] CAMOUFLAGE: Exploiting Misinformation Detection Systems Through LLM-driven Adversarial Claim Transformation(https://arxiv.org/abs/2505.01900)
Keywords: llm, prompt, agent
Abstract: Automated evidence-based misinformation detection systems, which evaluate the veracity of short claims against evidence, lack comprehensive analysis of their adversarial vulnerabilities. Existing black-box text-based adversarial attacks are ill-suited for evidence-based misinformation detection systems, as these attacks primarily focus on token-level substitutions involving gradient or logit-based optimization strategies, which are incapable of fooling the multi-component nature of these detection systems. These systems incorporate both retrieval and claim-evidence comparison modules, which requires attacks to break the retrieval of evidence and/or the comparison module so that it draws incorrect inferences. We present CAMOUFLAGE, an iterative, LLM-driven approach that employs a two-agent system, a Prompt Optimization Agent and an Attacker Agent, to create adversarial claim rewritings that manipulate evidence retrieval and mislead claim-evidence comparison, effectively bypassing the system without altering the meaning of the claim. The Attacker Agent produces semantically equivalent rewrites that attempt to mislead detectors, while the Prompt Optimization Agent analyzes failed attack attempts and refines the prompt of the Attacker to guide subsequent rewrites. This enables larger structural and stylistic transformations of the text rather than token-level substitutions, adapting the magnitude of changes based on previous outcomes. Unlike existing approaches, CAMOUFLAGE optimizes its attack solely based on binary model decisions to guide its rewriting process, eliminating the need for classifier logits or extensive querying. We evaluate CAMOUFLAGE on four systems, including two recent academic systems and two real-world APIs, with an average attack success rate of 46.92\% while preserving textual coherence and semantic equivalence to the original claims.
摘要：自动循证的错误信息检测系统评估了反对证据的简短索赔的真实性，缺乏对其对抗性脆弱性的全面分析。现有的黑框基于文本的对抗攻击不适合基于证据的错误信息检测系统，因为这些攻击主要集中于涉及涉及基于梯度或基于logit的优化策略的令牌级别的替换，而这些策略无法欺骗这些检测系统的多组分性质。这些系统既包含了检索和索赔的比较模块，又需要攻击以打破证据和/或比较模块的检索，以便提出错误的推论。我们提出了一种迭代性，LLM驱动的方法，该方法采用了两种代理系统，及时的优化代理和攻击者代理，以创建对抗性索赔的重写，以操纵证据检索和误导性索赔证据比较，有效地绕过该系统而不改变索赔的含义。攻击者代理会产生与试图误导探测器的语义等效重写，而及时优化代理分析失败的攻击尝试并完善了攻击者的提示，以指导后续重写。这可以使文本的更大的结构和风格转换而不是令牌级的替换，从而根据以前的结果调整了变化的幅度。与现有方法不同，伪装仅基于二进制模型决策来优化其攻击，以指导其重写过程，从而消除了对分类器逻辑的需求或广泛的查询。我们在四个系统上评估了伪装，包括最近的两个学术系统和两个现实世界中的API，平均攻击成功率为46.92 \％，同时保留了文本连贯性和与原始主张的语义等效性。

Title: Analyzing Cognitive Differences Among Large Language Models through the Lens of Social Worldview

Authors: Jiatao Li, Yanheng Li, Xiaojun Wan
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.01967
Pdf URL: https://arxiv.org/pdf/2505.01967
Copy Paste: [[2505.01967]] Analyzing Cognitive Differences Among Large Language Models through the Lens of Social Worldview(https://arxiv.org/abs/2505.01967)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have become integral to daily life, widely adopted in communication, decision-making, and information retrieval, raising critical questions about how these systems implicitly form and express socio-cognitive attitudes or "worldviews". While existing research extensively addresses demographic and ethical biases, broader dimensions-such as attitudes toward authority, equality, autonomy, and fate-remain under-explored. In this paper, we introduce the Social Worldview Taxonomy (SWT), a structured framework grounded in Cultural Theory, operationalizing four canonical worldviews (Hierarchy, Egalitarianism, Individualism, Fatalism) into measurable sub-dimensions. Using SWT, we empirically identify distinct and interpretable cognitive profiles across 28 diverse LLMs. Further, inspired by Social Referencing Theory, we experimentally demonstrate that explicit social cues systematically shape these cognitive attitudes, revealing both general response patterns and nuanced model-specific variations. Our findings enhance the interpretability of LLMs by revealing implicit socio-cognitive biases and their responsiveness to social feedback, thus guiding the development of more transparent and socially responsible language technologies.
摘要：大型语言模型（LLMS）已成为日常生活中不可或缺的一部分，在沟通，决策和信息检索中广泛采用，提出了有关这些系统如何隐含形成和表达社会认知态度或“世界观”的关键问题。尽管现有的研究广泛地解决了人口和道德偏见，但更广泛的维度，例如对权威，平等，自主权和命运 - 雷曼的态度。在本文中，我们将社会世界观分类法（SWT）介绍为以文化理论为基础的结构化框架，将四个规范的世界观（等级制度，平等主义，个人主义，个人主义，宿命论）运行。使用SWT，我们从经验上确定了28种不同LLM的独特和可解释的认知谱。此外，受社会参考理论的启发，我们在实验上证明了明确的社会暗示会系统地塑造这些认知态度，从而揭示了一般响应模式和细微的模型特异性变化。我们的发现通过揭示隐性的社会认知偏见及其对社会反馈的反应，从而提高了LLM的可解释性，从而指导了更透明且对社会负责的语言技术的发展。

Title: LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load

Authors: Theo Guidroz, Diego Ardila, Jimmy Li, Adam Mansour, Paul Jhun, Nina Gonzalez, Xiang Ji, Mike Sanchez, Sujay Kakarmath, Mathias MJ Bellaiche, Miguel Ángel Garrido, Faruk Ahmed, Divyansh Choudhary, Jay Hartford, Chenwei Xu, Henry Javier Serrano Echeverria, Yifan Wang, Jeff Shaffer, Eric (Yifan)Cao, Yossi Matias, Avinatan Hassidim, Dale R Webster, Yun Liu, Sho Fujiwara, Peggy Bui, Quang Duong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.01980
Pdf URL: https://arxiv.org/pdf/2505.01980
Copy Paste: [[2505.01980]] LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load(https://arxiv.org/abs/2505.01980)
Keywords: llm
Abstract: Information on the web, such as scientific publications and Wikipedia, often surpasses users' reading level. To help address this, we used a self-refinement approach to develop a LLM capability for minimally lossy text simplification. To validate our approach, we conducted a randomized study involving 4563 participants and 31 texts spanning 6 broad subject areas: PubMed (biomedical scientific articles), biology, law, finance, literature/philosophy, and aerospace/computer science. Participants were randomized to viewing original or simplified texts in a subject area, and answered multiple-choice questions (MCQs) that tested their comprehension of the text. The participants were also asked to provide qualitative feedback such as task difficulty. Our results indicate that participants who read the simplified text answered more MCQs correctly than their counterparts who read the original text (3.9% absolute increase, p<0.05). This gain was most striking with PubMed (14.6%), while more moderate gains were observed for finance (5.5%), aerospace/computer science (3.8%) domains, and legal (3.5%). Notably, the results were robust to whether participants could refer back to the text while answering MCQs. The absolute accuracy decreased by up to ~9% for both original and simplified setups where participants could not refer back to the text, but the ~4% overall improvement persisted. Finally, participants' self-reported perceived ease based on a simplified NASA Task Load Index was greater for those who read the simplified text (absolute change on a 5-point scale 0.33, p<0.05). This randomized study, involving an order of magnitude more participants than prior works, demonstrates the potential of LLMs to make complex information easier to understand. Our work aims to enable a broader audience to better learn and make use of expert knowledge available on the web, improving information accessibility.
摘要：网络上的信息，例如科学出版物和维基百科，通常超过用户的阅读水平。为了帮助解决这个问题，我们使用了一种自我注册方法来开发LLM功能，以简化最低损失的文本。为了验证我们的方法，我们进行了一项随机研究，涉及4563名参与者和31个跨越6个广泛主题领域的文本：PubMed（生物医学科学文章），生物学，法律，金融，文学/哲学和航空计算机/计算机科学。参与者被随机地查看主题领域中的原始或简化文本，并回答了测试其对文本理解的多项选择问题（MCQ）。还要求参与者提供定性反馈，例如任务难度。我们的结果表明，阅读简化文本的参与者比阅读原始文本的同行（绝对增加3.9％，p <0.05）的参与者正确地回答了MCQ。 PubMed（14.6％）的这一收益最为惊人，而金融（5.5％），航空/计算机科学（3.8％）域和合法（3.5％）观察到了更为中等的收益。值得注意的是，结果对于参与者在回答MCQ时是否可以参考文本是有力的。对于参与者无法回到文本的原始设置和简化的设置，绝对准确性降低了〜9％，但总体改进持续了约4％。最后，对于那些阅读简化文本的人来说，参与者根据简化的NASA任务负荷指数的自我报告的易度性更大（在5分量表上绝对更改为0.33，p <0.05）。这项随机研究涉及的数量级比以前的工作更多，它证明了LLM的潜力使复杂信息易于理解。我们的工作旨在使更广泛的受众能够更好地学习和利用网络上可用的专家知识，从而提高信息可访问性。

Title: Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs

Authors: Sai Krishna Mendu, Harish Yenala, Aditi Gulati, Shanu Kumar, Parag Agrawal
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02009
Pdf URL: https://arxiv.org/pdf/2505.02009
Copy Paste: [[2505.02009]] Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs(https://arxiv.org/abs/2505.02009)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have become integral to various real-world applications, leveraging massive, web-sourced datasets like Common Crawl, C4, and FineWeb for pretraining. While these datasets provide linguistic data essential for high-quality natural language generation, they often contain harmful content, such as hate speech, misinformation, and biased narratives. Training LLMs on such unfiltered data risks perpetuating toxic behaviors, spreading misinformation, and amplifying societal biases which can undermine trust in LLM-driven applications and raise ethical concerns about their use. This paper presents a large-scale analysis of inappropriate content across these datasets, offering a comprehensive taxonomy that categorizes harmful webpages into Topical and Toxic based on their intent. We also introduce a prompt evaluation dataset, a high-accuracy Topical and Toxic Prompt (TTP), and a transformer-based model (HarmFormer) for content filtering. Additionally, we create a new multi-harm open-ended toxicity benchmark (HAVOC) and provide crucial insights into how models respond to adversarial toxic inputs. Upon publishing, we will also opensource our model signal on the entire C4 dataset. Our work offers insights into ensuring safer LLM pretraining and serves as a resource for Responsible AI (RAI) compliance.
摘要：大型语言模型（LLM）已成为各种现实世界应用不可或缺的一部分，这些应用程序利用了大量的网络来源数据集（如常见的爬网，C4和FineWeb）进行预处理。尽管这些数据集为高质量的自然语言生成提供了必不可少的语言数据，但它们通常包含有害内容，例如仇恨言论，错误信息和偏见的叙述。培训LLM关于这种未经过滤的数据风险延续有毒行为，传播错误信息以及扩大社会偏见，这些偏见可能会破坏对LLM驱动的应用程序的信任，并提高对其使用的道德问题。本文对这些数据集的不适当内容进行了大规模分析，提供了全面的分类法，该分类法根据其意图将有害网页分类为局部和毒性。我们还介绍了一个及时的评估数据集，一个高准确的局部局部和有毒提示（TTP）以及用于内容过滤的基于变压器的模型（Harmformer）。此外，我们创建了一种新的多损伤开放式毒性基准（HAVOC），并为模型如何响应对抗性有毒输入提供了重要的见解。发布后，我们还将在整个C4数据集上打开模型信号。我们的工作提供了确保更安全的LLM预处理的见解，并作为负责AI（RAI）合规性的资源。

Title: An overview of artificial intelligence in computer-assisted language learning

Authors: Anisia Katinskaia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02032
Pdf URL: https://arxiv.org/pdf/2505.02032
Copy Paste: [[2505.02032]] An overview of artificial intelligence in computer-assisted language learning(https://arxiv.org/abs/2505.02032)
Keywords: agent
Abstract: Computer-assisted language learning -- CALL -- is an established research field. We review how artificial intelligence can be applied to support language learning and teaching. The need for intelligent agents that assist language learners and teachers is increasing: the human teacher's time is a scarce and costly resource, which does not scale with growing demand. Further factors contribute to the need for CALL: pandemics and increasing demand for distance learning, migration of large populations, the need for sustainable and affordable support for learning, etc. CALL systems are made up of many components that perform various functions, and AI is applied to many different aspects in CALL, corresponding to their own expansive research areas. Most of what we find in the research literature and in practical use are prototypes or partial implementations -- systems that perform some aspects of the overall desired functionality. Complete solutions -- most of them commercial -- are few, because they require massive resources. Recent advances in AI should result in improvements in CALL, yet there is a lack of surveys that focus on AI in the context of this research field. This paper aims to present a perspective on the AI methods that can be employed for language learning from a position of a developer of a CALL system. We also aim to connect work from different disciplines, to build bridges for interdisciplinary work.
摘要：计算机辅助语言学习 - 呼叫 - 是一个既定的研究领域。我们回顾如何将人工智能应用于语言学习和教学。需要帮助语言学习者和教师的智能代理人的需求正在增加：人类教师的时间是一种稀缺和昂贵的资源，并不会随着需求的增长而扩展。进一步的因素有助于呼叫的需求：大流行和对远程学习的需求不断增加，大人群的迁移，对学习的可持续和负担得起的支持等。呼叫系统由许多执行各种功能的组件组成，并且AI适用于呼叫中的许多不同方面，相应地与他们自身的广泛研究领域相应。我们在研究文献和实际用途中发现的大部分是原型或部分实现，这些系统可以执行整体期望功能的某些方面。完整的解决方案 - 大多数商业 - 很少，因为它们需要大量资源。 AI的最新进展应导致呼叫的改善，但是在该研究领域的背景下，缺乏调查专注于AI。本文旨在介绍可以从呼叫系统开发人员的位置进行语言学习的AI方法的观点。我们还旨在将不同学科的工作联系起来，以建造跨学科工作的桥梁。

Title: What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction

Authors: Eitan Wagner, Omri Abend
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02072
Pdf URL: https://arxiv.org/pdf/2505.02072
Copy Paste: [[2505.02072]] What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction(https://arxiv.org/abs/2505.02072)
Keywords: language model, llm
Abstract: The notion of language modeling has gradually shifted in recent years from a distribution over finite-length strings to general-purpose prediction models for textual inputs and outputs, following appropriate alignment phases. This paper analyzes the distinction between distribution estimation and response prediction in the context of LLMs, and their often conflicting goals. We examine the training phases of LLMs, which include pretraining, in-context learning, and preference tuning, and also the common use cases for their output probabilities, which include completion probabilities and explicit probabilities as output. We argue that the different settings lead to three distinct intended output distributions. We demonstrate that NLP works often assume that these distributions should be similar, which leads to misinterpretations of their experimental findings. Our work sets firmer formal foundations for the interpretation of LLMs, which will inform ongoing work on the interpretation and use of LLMs' induced distributions.
摘要：近年来，语言建模的概念已从有限长度的字符串的分布转变为经过适当的对齐阶段的文本输入和输出的通用预测模型。本文分析了在LLM的背景下的分布估计和响应预测之间的区别，以及它们通常相互矛盾的目标。我们检查了LLMS的训练阶段，其中包括预处理，内在学习和偏好调整，以及其输出概率的常见用例，其中包括完成概率和显式概率作为输出。我们认为不同的设置导致了三个不同的预期输出分布。我们证明NLP的工作经常假定这些分布应该相似，从而导致对其实验发现的误解。我们的工作为解释LLM的解释设定了更牢固的正式基础，这将为LLMS诱导分布的解释和使用提供持续的工作。

Title: LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning

Authors: Joy Lim Jia Yin, Daniel Zhang-Li, Jifan Yu, Haoxuan Li, Shangqing Tu, Yuanchun Wang, Zhiyuan Liu, Huiqin Liu, Lei Hou, Juanzi Li, Bin Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02078
Pdf URL: https://arxiv.org/pdf/2505.02078
Copy Paste: [[2505.02078]] LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning(https://arxiv.org/abs/2505.02078)
Keywords: language model
Abstract: Evaluating the quality of slide-based multimedia instruction is challenging. Existing methods like manual assessment, reference-based metrics, and large language model evaluators face limitations in scalability, context capture, or bias. In this paper, we introduce LecEval, an automated metric grounded in Mayer's Cognitive Theory of Multimedia Learning, to evaluate multimodal knowledge acquisition in slide-based learning. LecEval assesses effectiveness using four rubrics: Content Relevance (CR), Expressive Clarity (EC), Logical Structure (LS), and Audience Engagement (AE). We curate a large-scale dataset of over 2,000 slides from more than 50 online course videos, annotated with fine-grained human ratings across these rubrics. A model trained on this dataset demonstrates superior accuracy and adaptability compared to existing metrics, bridging the gap between automated and human assessments. We release our dataset and toolkits at this https URL.
摘要：评估基于幻灯片的多媒体教学的质量是具有挑战性的。现有的方法，例如手动评估，基于参考的指标和大型语言模型评估者面临可伸缩性，上下文捕获或偏见的限制。在本文中，我们介绍了LeCeval，这是一种基于Mayer多媒体学习认知理论的自动指标，以评估基于幻灯片的学习中的多模式知识获取。 LeCeval使用四个标题评估有效性：内容相关性（CR），表达式清晰度（EC），逻辑结构（LS）和受众参与（AE）。我们从50多个在线课程视频中策划了一个超过2,000张幻灯片的大规模数据集，并注明了这些专栏中的细粒度人类评分。与现有指标相比，在该数据集上训练的模型表现出较高的准确性和适应性，从而弥合了自动化和人类评估之间的差距。我们在此HTTPS URL上发布数据集和工具包。

Title: LLM-OptiRA: LLM-Driven Optimization of Resource Allocation for Non-Convex Problems in Wireless Communications

Authors: Xinyue Peng, Yanming Liu, Yihan Cang, Chaoqun Cao, Ming Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02091
Pdf URL: https://arxiv.org/pdf/2505.02091
Copy Paste: [[2505.02091]] LLM-OptiRA: LLM-Driven Optimization of Resource Allocation for Non-Convex Problems in Wireless Communications(https://arxiv.org/abs/2505.02091)
Keywords: language model, gpt, llm
Abstract: Solving non-convex resource allocation problems poses significant challenges in wireless communication systems, often beyond the capability of traditional optimization techniques. To address this issue, we propose LLM-OptiRA, the first framework that leverages large language models (LLMs) to automatically detect and transform non-convex components into solvable forms, enabling fully automated resolution of non-convex resource allocation problems in wireless communication systems. LLM-OptiRA not only simplifies problem-solving by reducing reliance on expert knowledge, but also integrates error correction and feasibility validation mechanisms to ensure robustness. Experimental results show that LLM-OptiRA achieves an execution rate of 96% and a success rate of 80% on GPT-4, significantly outperforming baseline approaches in complex optimization tasks across diverse scenarios.
摘要：解决非凸资源分配问题在无线通信系统中构成了重大挑战，通常超出了传统优化技术的能力。为了解决这个问题，我们提出了LLM-Optira，这是一个利用大型语言模型（LLM）自动检测并将非convex组件转换为可解决的形式的框架，从而实现了无线通信系统中非convex资源分配问题的全自动分辨率。 LLM-Optira不仅可以通过减少对专家知识的依赖来简化解决问题的问题，而且还整合了错误纠正和可行性验证机制以确保鲁棒性。实验结果表明，LLM-Optira在GPT-4上达到的执行率为96％，成功率为80％，在各种情况下，在复杂的优化任务中的基线方法显着超过了基线方法。

Title: Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study

Authors: Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Yunjie Ji, Han Zhao, Xiangang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02142
Pdf URL: https://arxiv.org/pdf/2505.02142
Copy Paste: [[2505.02142]] Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study(https://arxiv.org/abs/2505.02142)
Keywords: language model, llm
Abstract: Despite significant advances in long-context reasoning by large language models (LLMs), primarily through Online Reinforcement Learning (RL) methods, these approaches incur substantial computational costs and complexity. In contrast, simpler and more economical Offline RL methods remain underexplored. To address this gap, we investigate the effectiveness of Offline RL methods, specifically Direct Preference Optimization (DPO) and its length-desensitized variant LD-DPO, in enhancing the reasoning capabilities of LLMs. Extensive experiments across multiple reasoning benchmarks demonstrate that these simpler Offline RL methods substantially improve model performance, achieving an average enhancement of 3.3\%, with a particularly notable increase of 10.1\% on the challenging Arena-Hard benchmark. Furthermore, we analyze DPO's sensitivity to output length, emphasizing that increasing reasoning length should align with semantic richness, as indiscriminate lengthening may adversely affect model performance. We provide comprehensive descriptions of our data processing and training methodologies, offering empirical evidence and practical insights for developing more cost-effective Offline RL approaches.
摘要：尽管大型语言模型（LLM）（主要通过在线强化学习（RL）方法）在长期文化推理方面取得了重大进展，但这些方法仍会引起实质性的计算成本和复杂性。相比之下，更简单，更经济的离线RL方法仍未得到充实。为了解决这一差距，我们研究了离线RL方法的有效性，特别是直接偏好优化（DPO）及其长度敏感的变体LD-DPO，在增强了LLMS的推理能力方面。跨多个推理基准的广泛实验表明，这些简单的离线RL方法显着改善了模型性能，达到了3.3 \％的平均增强，在挑战性的竞技场基准中的平均增强率为10.1 \％。此外，我们分析了DPO对输出长度的敏感性，强调推理长度的增加应与语义丰富度保持一致，因为不加选择的延长可能会对模型性能产生不利影响。我们提供了有关数据处理和培训方法的全面描述，提供了经验证据和实用见解，以开发更具成本效益的离线RL方法。

Title: QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach

Authors: Shouyang Dong, Yuanbo Wen, Jun Bi, Di Huang, Jiaming Guo, Jianxing Xu, Ruibai Xu, Xinkai Song, Yifan Hao, Xuehai Zhou, Tianshi Chen, Qi Guo, Yunji Chen
Subjects: cs.CL, cs.LG, cs.PL
Abstract URL: https://arxiv.org/abs/2505.02146
Pdf URL: https://arxiv.org/pdf/2505.02146
Copy Paste: [[2505.02146]] QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach(https://arxiv.org/abs/2505.02146)
Keywords: language model, llm, prompt
Abstract: Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering "Write Once, Run Anywhere" of tensor programs an open question. We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systematically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average, and the performance of translated programs achieves up to 2.0x over vendor-provided manually-optimized libraries. As a result, the programming productivity of DLS is improved by up to 96.0x via transcompiling legacy tensor programs.
摘要：GPU和ASIC等异质深度学习系统（DLS）已被广泛部署在工业数据中心中，该中心需要为不同平台开发多个低级张量程序。减轻编程负担的一个有吸引力的解决方案是将一个平台的旧版代码转换给另一个平台。但是，当前的转化技术在巨大的手动努力或功能不正确方面遇到了困难，渲染“一次写，运行到任何地方”张量程序。我们提出了一种新型的跨组合，即Qimeng-Xpiler，用于通过大语言模型（LLMS）和符号程序合成，即神经符号合成来自动翻译DLS的张量程序。关键的见解是利用LLM的强大代码生成能力使基于昂贵的基于搜索的符号合成计算可以易于处理。具体而言，我们通过预定义的元数据提出了多个LLM辅助汇编，以进行程序转换。在每个程序转换期间，使用有限的规模来修复不正确的代码片段。为了获得高性能，我们提出了一种分层自动调整方法，以系统地探索转换通行证的参数和序列。具有不同编程界面的4个DLS的实验，即用VNNI，NVIDIA GPU加强Intel DL，带有CUDA，带有髋关节的AMD MI和带有轰动的剑桥MLU，表明Qimeng-Xpiler可以正确地转换95％的效果，并在平均表现出色的效果上，并在平均上转换了95％的效果。手动优化的库。结果，DLS的编程生产率通过移植传统张量程序提高了96.0倍。

Title: Think on your Feet: Adaptive Thinking via Reinforcement Learning for Social Agents

Authors: Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, Wenji Mao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02156
Pdf URL: https://arxiv.org/pdf/2505.02156
Copy Paste: [[2505.02156]] Think on your Feet: Adaptive Thinking via Reinforcement Learning for Social Agents(https://arxiv.org/abs/2505.02156)
Keywords: chain-of-thought, agent
Abstract: Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current approaches. While existing methods either lack this kind of reasoning capability or enforce uniform long chain-of-thought reasoning across all scenarios, resulting in excessive token usage and inappropriate social simulation. In this paper, we propose $\textbf{A}$daptive $\textbf{M}$ode $\textbf{L}$earning ($\textbf{AML}$) that strategically selects from four thinking modes (intuitive reaction $\rightarrow$ deep contemplation) based on real-time context. Our framework's core innovation, the $\textbf{A}$daptive $\textbf{M}$ode $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{AMPO}$) algorithm, introduces three key advancements over existing methods: (1) Multi-granular thinking mode design, (2) Context-aware mode switching across social interaction, and (3) Token-efficient reasoning via depth-adaptive processing. Extensive experiments on social intelligence tasks confirm that AML achieves 15.6% higher task performance than state-of-the-art methods. Notably, our method outperforms GRPO by 7.0% with 32.8% shorter reasoning chains. These results demonstrate that context-sensitive thinking mode selection, as implemented in AMPO, enables more human-like adaptive reasoning than GRPO's fixed-depth approach
摘要：有效的社会智能模拟要求语言代理人动态调整推理深度，这是当前方法中特别缺乏的能力。尽管现有方法要么缺乏这种推理能力，要么在所有情况下都实施了统一的长期思考推理，从而导致过多的令牌用法和不适当的社交模拟。在本文中，我们提出$ \ textbf {a} $ daptive $ \ textbf {m} $ ode $ \ textbf {l} $ renning（$ \ textbf {aml} $）从四种思维模式（Intuitive Recession $ \ rightarrow $ deep erseplation）基于实时的上下文中从四种思维模式中进行策略性选择。我们框架的核心创新，$ \ textbf {a} $ dappive $ \ textbf {m} $ ode $ \ ode $ \ textbf {p} $ olicy $ \ olicy $ \ textbf {o} $ ptimization（$ \ \ \ \ \ \ \ \ textbf} （2）上下文感知模式在社交互动中切换，以及（3）通过深度自适应处理的标记推理。关于社会情报任务的广泛实验证实，与最新方法相比，AML的任务性能高15.6％。值得注意的是，我们的方法以32.8％的推理链的速度优于GRPO 7.0％。这些结果表明，在AMPO中实现的上下文敏感思维模式选择比GRPO的固定深度方法更像是人类的适应性推理

Title: Incorporating Legal Structure in Retrieval-Augmented Generation: A Case Study on Copyright Fair Use

Authors: Justin Ho, Alexandra Colby, William Fisher
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02164
Pdf URL: https://arxiv.org/pdf/2505.02164
Copy Paste: [[2505.02164]] Incorporating Legal Structure in Retrieval-Augmented Generation: A Case Study on Copyright Fair Use(https://arxiv.org/abs/2505.02164)
Keywords: llm, retrieval-augmented generation, chain-of-thought
Abstract: This paper presents a domain-specific implementation of Retrieval-Augmented Generation (RAG) tailored to the Fair Use Doctrine in U.S. copyright law. Motivated by the increasing prevalence of DMCA takedowns and the lack of accessible legal support for content creators, we propose a structured approach that combines semantic search with legal knowledge graphs and court citation networks to improve retrieval quality and reasoning reliability. Our prototype models legal precedents at the statutory factor level (e.g., purpose, nature, amount, market effect) and incorporates citation-weighted graph representations to prioritize doctrinally authoritative sources. We use Chain-of-Thought reasoning and interleaved retrieval steps to better emulate legal reasoning. Preliminary testing suggests this method improves doctrinal relevance in the retrieval process, laying groundwork for future evaluation and deployment of LLM-based legal assistance tools.
摘要：本文介绍了针对美国版权法中合理使用学说量身定制的检索型发电（RAG）的特定领域实施。由于DMCA下撤出的率不断增加，并且缺乏对内容创建者的法律支持，我们提出了一种结构化方法，该方法将语义搜索与法律知识图和法院引文网络相结合，以提高检索质量和推理可靠性。我们的原型模型在法定因素级别（例如目的，性质，金额，市场效应）上的法律先例，并结合了引用加权图表，以优先考虑学说权威的来源。我们使用经过思考的推理和交错的检索步骤来更好地模仿法律推理。初步测试表明，这种方法在检索过程中提高了学说相关性，为将来的评估和部署基于LLM的法律援助工具奠定了基础。

Title: A New HOPE: Domain-agnostic Automatic Evaluation of Text Chunking

Authors: Henrik Brådland, Morten Goodwin, Per-Arne Andersen, Alexander S. Nossum, Aditya Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02171
Pdf URL: https://arxiv.org/pdf/2505.02171
Copy Paste: [[2505.02171]] A New HOPE: Domain-agnostic Automatic Evaluation of Text Chunking(https://arxiv.org/abs/2505.02171)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Document chunking fundamentally impacts Retrieval-Augmented Generation (RAG) by determining how source materials are segmented before indexing. Despite evidence that Large Language Models (LLMs) are sensitive to the layout and structure of retrieved data, there is currently no framework to analyze the impact of different chunking methods. In this paper, we introduce a novel methodology that defines essential characteristics of the chunking process at three levels: intrinsic passage properties, extrinsic passage properties, and passages-document coherence. We propose HOPE (Holistic Passage Evaluation), a domain-agnostic, automatic evaluation metric that quantifies and aggregates these characteristics. Our empirical evaluations across seven domains demonstrate that the HOPE metric correlates significantly (p > 0.13) with various RAG performance indicators, revealing contrasts between the importance of extrinsic and intrinsic properties of passages. Semantic independence between passages proves essential for system performance with a performance gain of up to 56.2% in factual correctness and 21.1% in answer correctness. On the contrary, traditional assumptions about maintaining concept unity within passages show minimal impact. These findings provide actionable insights for optimizing chunking strategies, thus improving RAG system design to produce more factually correct responses.
摘要：文档块从根本上影响了检索提升的发电（RAG），通过确定索引之前的源材料的细分方式。尽管有证据表明大语言模型（LLMS）对检索到的数据的布局和结构敏感，但目前尚无框架来分析不同块方法的影响。在本文中，我们介绍了一种新的方法，该方法在三个层面上定义了分解过程的基本特征：固有的段落特性，外在通道特性和通道约束。我们提出了Hope（整体通道评估），这是一种域形无关，自动评估指标，可量化和汇总这些特征。我们在七个领域进行的经验评估表明，希望度量与各种抹布性能指标显着相关（p> 0.13），从而揭示了段落外在和内在特性的重要性之间的对比。段落之间的语义独立性对于系统性能而言是必不可少的，其性能增长在事实正确性上最多56.2％，答案正确性为21.1％。相反，关于在段落中保持概念统一的传统假设显示的影响很小。这些发现提供了可行的见解，以优化分块策略，从而改善了抹布系统设计以产生更多实际上正确的响应。

Title: Identifying Legal Holdings with LLMs: A Systematic Study of Performance, Scale, and Memorization

Authors: Chuck Arvin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02172
Pdf URL: https://arxiv.org/pdf/2505.02172
Copy Paste: [[2505.02172]] Identifying Legal Holdings with LLMs: A Systematic Study of Performance, Scale, and Memorization(https://arxiv.org/abs/2505.02172)
Keywords: language model, gpt, llm, prompt
Abstract: As large language models (LLMs) continue to advance in capabilities, it is essential to assess how they perform on established benchmarks. In this study, we present a suite of experiments to assess the performance of modern LLMs (ranging from 3B to 90B+ parameters) on CaseHOLD, a legal benchmark dataset for identifying case holdings. Our experiments demonstrate ``scaling effects'' - performance on this task improves with model size, with more capable models like GPT4o and AmazonNovaPro achieving macro F1 scores of 0.744 and 0.720 respectively. These scores are competitive with the best published results on this dataset, and do not require any technically sophisticated model training, fine-tuning or few-shot prompting. To ensure that these strong results are not due to memorization of judicial opinions contained in the training data, we develop and utilize a novel citation anonymization test that preserves semantic meaning while ensuring case names and citations are fictitious. Models maintain strong performance under these conditions (macro F1 of 0.728), suggesting the performance is not due to rote memorization. These findings demonstrate both the promise and current limitations of LLMs for legal tasks with important implications for the development and measurement of automated legal analytics and legal benchmarks.
摘要：随着大型语言模型（LLMS）继续提高能力，必须评估它们在既定基准测试的情况下的表现至关重要。在这项研究中，我们提出了一套实验套件，以评估Casehold上现代LLMS（从3B到90b+参数）的性能，Casehold是一个合法的基准数据集，用于识别案例持有量。我们的实验证明了``缩放效应'' - 在此任务上的性能随着模型大小而改善，诸如GPT4O和Amazonnovapro（Amazonnovapro）的功能更高的宏F1得分分别为0.744和0.720。这些分数具有竞争力，具有该数据集上最佳发布的结果，并且不需要任何技术复杂的模型培训，微调或发射较少的提示。为了确保这些强大的结果不是由于培训数据中包含的司法意见的记忆，我们开发和利用了一种新颖的引文匿名测试，该测试可保留语义含义，同时确保案例名称和引用是虚构的。模型在这些条件下保持强劲的性能（宏F1为0.728），这表明该性能不是由于死记硬背。这些发现表明了LLM对法律任务的承诺和当前局限性，对自动法律分析和法律基准的开发和衡量具有重要意义。

Title: Measuring Hong Kong Massive Multi-Task Language Understanding

Authors: Chuxue Cao, Zhenghao Zhu, Junqi Zhu, Guoying Lu, Siyu Peng, Juntao Dai, Weijie Shi, Sirui Han, Yike Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02177
Pdf URL: https://arxiv.org/pdf/2505.02177
Copy Paste: [[2505.02177]] Measuring Hong Kong Massive Multi-Task Language Understanding(https://arxiv.org/abs/2505.02177)
Keywords: language model, gpt, llm, prompt
Abstract: Multilingual understanding is crucial for the cross-cultural applicability of Large Language Models (LLMs). However, evaluation benchmarks designed for Hong Kong's unique linguistic landscape, which combines Traditional Chinese script with Cantonese as the spoken form and its cultural context, remain underdeveloped. To address this gap, we introduce HKMMLU, a multi-task language understanding benchmark that evaluates Hong Kong's linguistic competence and socio-cultural knowledge. The HKMMLU includes 26,698 multi-choice questions across 66 subjects, organized into four categories: Science, Technology, Engineering, and Mathematics (STEM), Social Sciences, Humanities, and Other. To evaluate the multilingual understanding ability of LLMs, 90,550 Mandarin-Cantonese translation tasks were additionally included. We conduct comprehensive experiments on GPT-4o, Claude 3.7 Sonnet, and 18 open-source LLMs of varying sizes on HKMMLU. The results show that the best-performing model, DeepSeek-V3, struggles to achieve an accuracy of 75\%, significantly lower than that of MMLU and CMMLU. This performance gap highlights the need to improve LLMs' capabilities in Hong Kong-specific language and knowledge domains. Furthermore, we investigate how question language, model size, prompting strategies, and question and reasoning token lengths affect model performance. We anticipate that HKMMLU will significantly advance the development of LLMs in multilingual and cross-cultural contexts, thereby enabling broader and more impactful applications.
摘要：多语言理解对于大语言模型（LLMS）的跨文化适用性至关重要。但是，为香港独特的语言景观设计的评估基准，将传统的中国剧本与广东话作为口头形式及其文化背景相结合，仍然不发达。为了解决这一差距，我们介绍了HKMMLU，这是一种多任务语言理解基准，评估香港的语言能力和社会文化知识。 HKMMLU在66个主题中包括26,698个多选择问题，分为四类：科学，技术，工程和数学（STEM），社会科学，人文科学等。为了评估LLM的多语言理解能力，还包括了90,550台普通话翻译任务。我们在HKMMLU上进行了GPT-4O，Claude 3.7十四行诗和18种不同尺寸的开源LLM的全面实验。结果表明，表现最佳的模型DeepSeek-V3努力达到75 \％的精度，明显低于MMLU和CMMLU。这种性能差距凸显了在香港特异性语言和知识领域中提高LLMS能力的必要性。此外，我们研究了问题语言，模型大小，提示策略以及问题和推理令牌长度如何影响模型性能。我们预计，HKMMLU将大大提高多语言和跨文化环境中LLM的发展，从而实现更广泛和更具影响力的应用。

Title: SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation

Authors: Tanguy Herserant, Vincent Guigue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02235
Pdf URL: https://arxiv.org/pdf/2505.02235
Copy Paste: [[2505.02235]] SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation(https://arxiv.org/abs/2505.02235)
Keywords: gpt, llm, hallucination
Abstract: Evaluating text summarization quality remains a critical challenge in Natural Language Processing. Current approaches face a trade-off between performance and interpretability. We present SEval-Ex, a framework that bridges this gap by decomposing summarization evaluation into atomic statements, enabling both high performance and explainability. SEval-Ex employs a two-stage pipeline: first extracting atomic statements from text source and summary using LLM, then a matching between generated statements. Unlike existing approaches that provide only summary-level scores, our method generates detailed evidence for its decisions through statement-level alignments. Experiments on the SummEval benchmark demonstrate that SEval-Ex achieves state-of-the-art performance with 0.580 correlation on consistency with human consistency judgments, surpassing GPT-4 based evaluators (0.521) while maintaining interpretability. Finally, our framework shows robustness against hallucination.
摘要：评估文本摘要质量在自然语言处理中仍然是一个关键的挑战。当前的方法面临性能和可解释性之间的权衡。我们提出了Seval-EX，这是一个框架，通过将汇总评估分解为原子陈述，从而弥合了这一差距，从而使高性能和解释性既实现了。 Seval-EX采用了两阶段的管道：首先从文本源中提取原子语句，并使用LLM摘要，然后在生成的语句之间进行匹配。与仅提供摘要级别分数的现有方法不同，我们的方法通过声明级别的一致性为其决策生成了详细的证据。萨蒙基准测试的实验表明，SEVAL-EX在与人类一致性判断一致性方面具有0.580相关性，超过了基于GPT-4的评估者（0.521），同时维持可解释性，同时具有0.580的相关性。最后，我们的框架表现出对幻觉的鲁棒性。

Title: Personalisation or Prejudice? Addressing Geographic Bias in Hate Speech Detection using Debias Tuning in Large Language Models

Authors: Paloma Piot, Patricia Martín-Rodilla, Javier Parapar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02252
Pdf URL: https://arxiv.org/pdf/2505.02252
Copy Paste: [[2505.02252]] Personalisation or Prejudice? Addressing Geographic Bias in Hate Speech Detection using Debias Tuning in Large Language Models(https://arxiv.org/abs/2505.02252)
Keywords: language model, llm, prompt
Abstract: Commercial Large Language Models (LLMs) have recently incorporated memory features to deliver personalised responses. This memory retains details such as user demographics and individual characteristics, allowing LLMs to adjust their behaviour based on personal information. However, the impact of integrating personalised information into the context has not been thoroughly assessed, leading to questions about its influence on LLM behaviour. Personalisation can be challenging, particularly with sensitive topics. In this paper, we examine various state-of-the-art LLMs to understand their behaviour in different personalisation scenarios, specifically focusing on hate speech. We prompt the models to assume country-specific personas and use different languages for hate speech detection. Our findings reveal that context personalisation significantly influences LLMs' responses in this sensitive area. To mitigate these unwanted biases, we fine-tune the LLMs by penalising inconsistent hate speech classifications made with and without country or language-specific context. The refined models demonstrate improved performance in both personalised contexts and when no context is provided.
摘要：商业大型语言模型（LLMS）最近具有记忆功能以提供个性化的响应。该内存保留了诸如用户人口统计和个人特征之类的详细信息，从而使LLM可以根据个人信息调整其行为。但是，将个性化信息整合到上下文中的影响尚未得到彻底评估，从而质疑其对LLM行为的影响。个性化可能具有挑战性，尤其是敏感主题。在本文中，我们研究了各种最先进的LLM，以了解其在不同个性化场景中的行为，特别是专注于仇恨言论。我们促使模型采用特定国家 /地区的角色，并使用不同的语言进行仇恨言论检测。我们的发现表明，上下文个性化显着影响了LLM在该敏感领域的反应。为了减轻这些不必要的偏见，我们通过惩罚有或没有特定于国家或语言的环境的不一致的仇恨言论分类来微调LLM。精致的模型在两个个性化的上下文中表现出改善的性能，并且在没有提供上下文时。

Title: Parameter-Efficient Transformer Embeddings

Authors: Henry Ndubuaku, Mouad Talhi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02266
Pdf URL: https://arxiv.org/pdf/2505.02266
Copy Paste: [[2505.02266]] Parameter-Efficient Transformer Embeddings(https://arxiv.org/abs/2505.02266)
Keywords: language model
Abstract: Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs using a Fourier expansion of their normalized values, followed by a lightweight multilayer perceptron (MLP) that captures higher-order interactions. We train standard transformers and our architecture on natural language inference tasks (SNLI and MNLI), and evaluate zero-shot performance on sentence textual similarity (STS-B). Our results demonstrate that the proposed method achieves competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout. This proof-of-concept study highlights the potential for scalable, memory-efficient language models and motivates further large-scale experimentation based on our findings.
摘要：嵌入基于变压器的NLP模型中的层通常是模型参数的最大份额，以词汇大小扩展，但不能产生与比例比例成正比的性能提高。我们提出了一种替代方法，其中首先使用其归一化值的傅立叶扩展直接从令牌ID中直接从标记ID中生成代币嵌入向量，然后是轻量级的多层求解（MLP），以捕获高阶相互作用。我们在自然语言推理任务（SNLI和MNLI）上训练标准的变压器和我们的体系结构，并评估句子文本相似性（STS-B）上的零拍摄性能。我们的结果表明，所提出的方法使用较少的参数，更快地训练并有效地运行而无需辍学，可以实现竞争性能。这项概念验证的研究突出了可扩展，记忆有效语言模型的潜力，并根据我们的发现激发了进一步的大规模实验。

Title: Demystifying optimized prompts in language models

Authors: Rimon Melamed, Lucas H. McCabe, H. Howie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02273
Pdf URL: https://arxiv.org/pdf/2505.02273
Copy Paste: [[2505.02273]] Demystifying optimized prompts in language models(https://arxiv.org/abs/2505.02273)
Keywords: language model, prompt
Abstract: Modern language models (LMs) are not robust to out-of-distribution inputs. Machine generated (``optimized'') prompts can be used to modulate LM outputs and induce specific behaviors while appearing completely uninterpretable. In this work, we investigate the composition of optimized prompts, as well as the mechanisms by which LMs parse and build predictions from optimized prompts. We find that optimized prompts primarily consist of punctuation and noun tokens which are more rare in the training data. Internally, optimized prompts are clearly distinguishable from natural language counterparts based on sparse subsets of the model's activations. Across various families of instruction-tuned models, optimized prompts follow a similar path in how their representations form through the network.
摘要：现代语言模型（LMS）对分发输入不强大。机器生成的（``优化''）提示可用于调节LM输出并诱导特定的行为，同时完全无法解释。在这项工作中，我们研究了优化提示的组成，以及LMS解析并通过优化提示进行预测的机制。我们发现优化的提示主要包括标点符号和名词令牌，这些标记在训练数据中更为罕见。在内部，根据模型激活的稀疏子集，在内部，优化的提示可以与自然语言的同行区分开。在各种教学模型家族中，优化的提示遵循其表示形式通过网络形成的类似途径。

Title: Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition

Authors: Siyu Liang, Yunan Li, Wentian Xin, Huizhou Chen, Xujie Liu, Kang Liu, Qiguang Miao
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.02304
Pdf URL: https://arxiv.org/pdf/2505.02304
Copy Paste: [[2505.02304]] Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition(https://arxiv.org/abs/2505.02304)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Sign language recognition (SLR) faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. To the best of our knowledge, this is the first work to integrate generative large language models (LLMs) into SLR tasks. We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method that leverages retrieval-augmented generation (RAG) with domain-specific LLMs, incorporating multi-step prompt engineering and expert-validated sign language corpora to produce precise multipart descriptions. The GSP-MC method also employs a dual-encoder architecture to bidirectionally align hierarchical skeleton features with multiple text descriptions (global, synonym, and part level) through probabilistic matching. Our approach combines global and part-level losses, optimizing KL divergence to ensure robust alignment across all relevant text-skeleton pairs while capturing both sign-level semantics and detailed part dynamics. Experiments demonstrate state-of-the-art performance against existing methods on the Chinese SLR500 (reaching 97.1%) and Turkish AUTSL datasets (97.07% accuracy). The method's cross-lingual effectiveness highlight its potential for developing inclusive communication technologies.
摘要：由于同时手动和非手动信号的固有复杂性，手语识别（SLR）在创建准确的注释方面面临着基本挑战。据我们所知，这是将生成大语言模型（LLM）集成到SLR任务中的第一项工作。我们提出了一种新颖的生成签名提示，提示多阳性对比度学习（GSP-MC）方法，该方法利用特定于域的LLMS利用检索效果的生成（RAG），结合了多步及时的工程和专家验证的手语公司来产生精确的多个描述。 GSP-MC方法还采用双重编码器架构来通过概率匹配，并具有多个文本描述（全局，同义词和零件级别）的双向对齐层次骨架特征。我们的方法结合了全局和零件级别的损失，优化了KL差异，以确保所有相关文本 - 骨骼对的稳健对齐，同时同时捕获签名级的语义和详细的零件动态。实验证明了针对中国SLR500现有方法（达到97.1％）和土耳其AUTSL数据集（精度为97.07％）的最新性能。该方法的跨语性有效性突出了其开发包容性通信技术的潜力。

Title: Invoke Interfaces Only When Needed: Adaptive Invocation for Large Language Models in Question Answering

Authors: Jihao Zhao, Chunlai Zhou, Biao Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02311
Pdf URL: https://arxiv.org/pdf/2505.02311
Copy Paste: [[2505.02311]] Invoke Interfaces Only When Needed: Adaptive Invocation for Large Language Models in Question Answering(https://arxiv.org/abs/2505.02311)
Keywords: language model, hallucination
Abstract: The collaborative paradigm of large and small language models (LMs) effectively balances performance and cost, yet its pivotal challenge lies in precisely pinpointing the moment of invocation when hallucinations arise in small LMs. Previous optimization efforts primarily focused on post-processing techniques, which were separate from the reasoning process of LMs, resulting in high computational costs and limited effectiveness. In this paper, we propose a practical invocation evaluation metric called AttenHScore, which calculates the accumulation and propagation of hallucinations during the generation process of small LMs, continuously amplifying potential reasoning errors. By dynamically adjusting the detection threshold, we achieve more accurate real-time invocation of large LMs. Additionally, considering the limited reasoning capacity of small LMs, we leverage uncertainty-aware knowledge reorganization to assist them better capture critical information from different text chunks. Extensive experiments reveal that our AttenHScore outperforms most baseline in enhancing real-time hallucination detection capabilities across multiple QA datasets, especially when addressing complex queries. Moreover, our strategies eliminate the need for additional model training and display flexibility in adapting to various transformer-based LMs.
摘要：大型和小语言模型（LMS）的协作范式有效地平衡了性能和成本，但其关键的挑战在于，当幻觉出现在小型LMS中时，精确地指出了调用时刻。以前的优化工作主要集中在后处理技术上，这些技术与LMS的推理过程分开，从而导致高计算成本和有限的有效性。在本文中，我们提出了一个称为AttenhScore的实用调用评估度量，该指标计算了小型LMS生成过程中幻觉的积累和传播，从而不断扩大潜在的推理误差。通过动态调整检测阈值，我们实现了大型LMS的更准确的实时调用。此外，考虑到小型LMS的推理能力有限，我们利用不确定性感知的知识重组来帮助他们更好地从不同文本块中捕获关键信息。广泛的实验表明，我们的Attenhscore在增强多个QA数据集的实时幻觉检测能力方面的表现优于大多数基线，尤其是在解决复杂查询时。此外，我们的策略消除了进行其他模型培训的需求，并在适应各种基于变压器的LMS方面表现出灵活性。

Title: SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Authors: Tianjian Li, Daniel Khashabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02363
Pdf URL: https://arxiv.org/pdf/2505.02363
Copy Paste: [[2505.02363]] SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning(https://arxiv.org/abs/2505.02363)
Keywords: language model
Abstract: Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off -policy data for preference learning, others indicate that the advantages of on-policy data may be task-dependent, highlighting the need for a systematic exploration of their interplay. In this work, we show that on-policy and off-policy data offer complementary strengths in preference optimization: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on open-ended tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SIMPLEMIX, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SIMPLEMIX substantially improves language model alignment. Specifically, SIMPLEMIX improves upon on-policy DPO and off-policy DPO by an average of 6.03% on Alpaca Eval 2.0. Moreover, it outperforms prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3.05%.
摘要：将语言模型与人类偏好相结合在于成对偏好数据集。虽然一些研究表明，派利数据始终优于偏好学习的policy数据，但其他研究则表明，在政策数据的优势可能依赖于任务，这突显了对其相互作用进行系统探索的需求。在这项工作中，我们表明，在偏好优化方面提供互补的优势：policy数据对数学和编码等推理任务特别有效，而违反政策数据在开放式任务（例如创意写作）上的表现更好，例如创意写作和提出个人建议。在这些发现的指导下，我们介绍了SimpleMix，一种方法，可以通过简单地将这两个数据源混合来结合政策和非政策偏好学习的互补优势。我们跨不同任务和基准测试的经验结果表明，Simplemix显着改善了语言模型的一致性。具体而言，SimpleEmix在羊驼毛Evar 2.0上平均在policy DPO和非政策DPO上提高了6.03％。此外，它在结合诸如HYPO和DPO-MIX-P之类的诸如HYPO和DPO-MIX-P的情况下的先验方法均优于先前的方法，平均要比3.05％。

Title: RM-R1: Reward Modeling as Reasoning

Authors: Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02387
Pdf URL: https://arxiv.org/pdf/2505.02387
Copy Paste: [[2505.02387]] RM-R1: Reward Modeling as Reasoning(https://arxiv.org/abs/2505.02387)
Keywords: language model, gpt, llm, chat, chain-of-thought
Abstract: Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. However, existing RMs either produce opaque scalar scores or directly generate the prediction of a preferred answer, making them struggle to integrate natural language critiques, thus lacking interpretability. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM's interpretability and performance. In this work, we introduce a new class of generative reward models -- Reasoning Reward Models (ReasRMs) -- which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. The training consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts by self-generating reasoning traces or chat-specific rubrics and evaluating candidate responses against them. Empirically, our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple comprehensive reward model benchmarks, outperforming much larger open-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) by up to 13.8%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at this https URL.
摘要：奖励建模对于将大语言模型（LLM）与人类偏好保持一致，尤其是通过增强人类反馈（RLHF）学习。为了提供准确的奖励信号，奖励模型（RM）应在分配分数或判断之前刺激深思熟虑的思维并进行可解释的推理。但是，现有的RMS要么产生不透明的标量分数，要么直接产生了首选答案的预测，从而使他们难以整合自然语言的批评，因此缺乏可解释性。受到长期思考（COT）在推理密集型任务上的最新进展的启发，我们假设并验证将推理能力整合到奖励建模中显着增强了RM的解释性和性能。在这项工作中，我们介绍了一类新的生成奖励模型 - 推理奖励模型（REASRMS） - 将奖励建模作为推理任务。我们建议采用以推理为导向的培训管道，并培训RM-R1的REASRMS家族。培训由两个关键阶段组成：（1）蒸馏高质量的推理链和（2）具有可验证的奖励的增强学习。 RM-R1通过自我生成的推理轨迹或特定于聊天的专栏来改善LLM的推出，并评估针对它们的候选响应。从经验上讲，我们的模型在多种综合奖励模型基准中实现了生成RMS的最先进或最先进的性能，超过了更大的开放式模型（例如Llama3.1-405b）和专有模型（例如，GPT-4O），最多可达13.8％。除了最终的表现，我们还进行了彻底的经验分析，以了解成功REASRM培训的关键要素。为了促进未来的研究，我们在此HTTPS URL上发布了六个REASRM模型以及代码和数据。

Title: Bielik 11B v2 Technical Report

Authors: Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej, Remigiusz Kinas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02410
Pdf URL: https://arxiv.org/pdf/2505.02410
Copy Paste: [[2505.02410]] Bielik 11B v2 Technical Report(https://arxiv.org/abs/2505.02410)
Keywords: language model
Abstract: We present Bielik 11B v2, a state-of-the-art language model optimized for Polish text processing. Built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling, this model demonstrates exceptional performance across Polish language benchmarks while maintaining strong cross-lingual capabilities. We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss, which optimizes learning across diverse instruction types by assigning quality-based weights to training examples, and Adaptive Learning Rate, which dynamically adjusts based on context length. Comprehensive evaluation across multiple benchmarks demonstrates that Bielik 11B v2 outperforms many larger models, including those with 2-6 times more parameters, and significantly surpasses other specialized Polish language models on tasks ranging from linguistic understanding to complex reasoning. The model's parameter efficiency and extensive quantization options enable deployment across various hardware configurations, advancing Polish language AI capabilities and establishing new benchmarks for resource-efficient language modeling in less-represented languages.
摘要：我们提出Bielik 11B V2，这是一种针对波兰文本处理优化的最先进的语言模型。该模型建立在Mistral 7b V0.2体系结构上，并使用深度尺度缩放到11B参数，在维持强大的跨语义功能的同时，在波兰语言基准中展示了出色的性能。我们介绍了两项关键的技术创新：加权指令跨凝结损失，通过将基于质量的权重分配给培训示例，以及根据上下文长度进行动态调整，从而优化了各种教学类型的学习。跨多个基准测试的全面评估表明，Bielik 11B V2的表现要优于许多较大的模型，包括具有2-6倍参数的模型，并且在从语言理解到复杂推理的任务上大量超过了其他专业的波兰语模型。该模型的参数效率和广泛的量化选项可以在各种硬件配置上进行部署，推进波兰语言AI功能，并建立用于以较少有代表性的语言的资源有效语言建模的新基准。

Title: Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs

Authors: Elisa Forcada Rodríguez, Olatz Perez-de-Viñaspre, Jon Ander Campos, Dietrich Klakow, Vagrant Gautam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02456
Pdf URL: https://arxiv.org/pdf/2505.02456
Copy Paste: [[2505.02456]] Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs(https://arxiv.org/abs/2505.02456)
Keywords: language model, llm, prompt
Abstract: One of the goals of fairness research in NLP is to measure and mitigate stereotypical biases that are propagated by NLP systems. However, such work tends to focus on single axes of bias (most often gender) and the English language. Addressing these limitations, we contribute the first study of multilingual intersecting country and gender biases, with a focus on occupation recommendations generated by large language models. We construct a benchmark of prompts in English, Spanish and German, where we systematically vary country and gender, using 25 countries and four pronoun sets. Then, we evaluate a suite of 5 Llama-based models on this benchmark, finding that LLMs encode significant gender and country biases. Notably, we find that even when models show parity for gender or country individually, intersectional occupational biases based on both country and gender persist. We also show that the prompting language significantly affects bias, and instruction-tuned models consistently demonstrate the lowest and most stable levels of bias. Our findings highlight the need for fairness researchers to use intersectional and multilingual lenses in their work.
摘要：NLP公平研究的目标之一是测量和减轻NLP系统传播的刻板印象偏见。但是，这种工作倾向于集中在偏见的单轴上（通常是性别）和英语。在解决这些局限性方面，我们为多语言相交国家和性别偏见做出了首次研究，重点是大语模型产生的职业建议。我们使用25个国家和四个代词套装来构建英语，西班牙和德语提示的基准，在那里我们系统地改变了国家和性别。然后，我们在该基准上评估了一套基于5个乳拉的模型的套件，发现LLMS编码了重大的性别和国家偏见。值得注意的是，我们发现，即使模型对性别或国家单独表现出奇偶校验，基于国家和性别的交叉职业偏见仍然存在。我们还表明，提示语言会显着影响偏见，而指导调节的模型始终证明了最低和最稳定的偏见水平。我们的发现凸显了公平研究人员在工作中使用交叉和多语言镜头的必要性。

Title: EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

Authors: Lingxiao Kong (1), Cong Yang (2), Susanne Neufang (3), Oya Deniz Beyan (1,3), Zeyd Boukhers (1,3) ((1) Fraunhofer Institute for Applied Information Technology FIT, (2) Soochow University, (3) University Hospital of Cologne)
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02579
Pdf URL: https://arxiv.org/pdf/2505.02579
Copy Paste: [[2505.02579]] EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning(https://arxiv.org/abs/2505.02579)
Keywords: language model, llm
Abstract: Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including complex objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the training to improve efficiency and flexibility. Our method is the first to aggregate the last hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text-scoring LLMs to evaluate the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption ($17,529\pm 1,650$ data points and $6,573\pm 147.43$ seconds), improved scalability and explainability, and comparable performance across multiple objectives.
摘要：大型语言模型（LLM）微调的强化学习（RL）的最新进展在解决多目标任务方面表明了有希望，但仍然面临重大挑战，包括复杂的客观平衡，低训练效率，差可伸缩性差和有限的可解释性。利用合奏学习原则，我们引入了一个集合多目标RL（EMORL）框架，该框架将多个模型与个别目标微调，同时在培训后优化其聚合以提高效率和灵活性。我们的方法是第一个汇总单个模型的最后一个隐藏状态，并结合了来自多个目标的上下文信息。这种方法由标识最佳加权组合的分层网格搜索算法支持。我们使用文本评分LLMS评估世代相传并在RL微调过程中评估咨询者反思生成任务，评估EMORL。通过对这对和Psych8K数据集进行的全面实验，我们证明了EMORL对现有基准的优势：明显降低和更稳定的培训消费（$ 17,529 \ pm 1,650美元的数据点，$ 6,573 \ $ 6,573 \ pm 147.43 $ 147.43 $秒），改进的可伸缩性和可比性的绩效以及多个多个目标的可比性性能，以及多个多个目标的可比性。

Title: Automatic Proficiency Assessment in L2 English Learners

Authors: Armita Mohammadi, Alessandro Lameiras Koerich, Laureano Moro-Velazquez, Patrick Cardinal
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.02615
Pdf URL: https://arxiv.org/pdf/2505.02615
Copy Paste: [[2505.02615]] Automatic Proficiency Assessment in L2 English Learners(https://arxiv.org/abs/2505.02615)
Keywords: language model
Abstract: Second language proficiency (L2) in English is usually perceptually evaluated by English teachers or expert evaluators, with the inherent intra- and inter-rater variability. This paper explores deep learning techniques for comprehensive L2 proficiency assessment, addressing both the speech signal and its correspondent transcription. We analyze spoken proficiency classification prediction using diverse architectures, including 2D CNN, frequency-based CNN, ResNet, and a pretrained wav2vec 2.0 model. Additionally, we examine text-based proficiency assessment by fine-tuning a BERT language model within resource constraints. Finally, we tackle the complex task of spontaneous dialogue assessment, managing long-form audio and speaker interactions through separate applications of wav2vec 2.0 and BERT models. Results from experiments on EFCamDat and ANGLISH datasets and a private dataset highlight the potential of deep learning, especially the pretrained wav2vec 2.0 model, for robust automated L2 proficiency evaluation.
摘要：英语的第二语言水平（L2）通常由英语教师或专家评估者进行感知评估，并具有固有的评估者和评分者间差异。本文探讨了全面的L2能力评估的深度学习技术，并解决了语音信号及其通讯转录。我们使用不同的架构（包括2D CNN，基于频率的CNN，Resnet和predraimed Wav2VEC 2.0模型）分析口语能力分类预测。此外，我们通过在资源约束中微调BERT语言模型来检查基于文本的能力评估。最后，我们通过WAV2VEC 2.0和BERT模型的单独应用来解决自发对话评估的复杂任务，管理长形的音频和扬声器互动。来自EFCAMDAT和ANGLISH数据集的实验以及一个私人数据集的结果突出了深度学习的潜力，尤其是经过验证的WAV2VEC 2.0模型，用于强大的自动化L2熟练评估。

Title: LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

Authors: Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.02625
Pdf URL: https://arxiv.org/pdf/2505.02625
Copy Paste: [[2505.02625]] LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis(https://arxiv.org/abs/2505.02625)
Keywords: language model, llm, chat
Abstract: Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.
摘要：实时，聪明和自然的语音互动是下一代人类计算机相互作用的重要组成部分。最近的进步展示了基于大语言模型（LLM）建立智能口头聊天机器人的潜力。在本文中，我们介绍了Llama-omni 2，这是一系列语言模型（语音LMS），范围从0.5B到14B参数，能够实现高质量的实时语音交互。 Llama-omni 2建立在QWEN2.5系列模型上，集成了语音编码器和自回归流媒体语音解码器。尽管仅接受了200k多转弯语音对话样本的培训，但Llama-omni 2在基准后的几个口头答案和语音教学上表现出了强劲的表现，超过了以前的最先进的语音语音，例如GLM-4-4-VOICE（如数百万个小时的语音数据）进行了培训。

Title: Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset

Authors: Rawan Bondok, Mayar Nassar, Salam Khalifa, Kurt Micallaf, Nizar Habash
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02656
Pdf URL: https://arxiv.org/pdf/2505.02656
Copy Paste: [[2505.02656]] Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset(https://arxiv.org/abs/2505.02656)
Keywords: gpt
Abstract: Proper names in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP,their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper names of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it. We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms, and analyze its performance. Achieving 73% accuracy, our results underscore both the difficulty of the task and the need for improved models and resources. We release our dataset to facilitate further research on Arabic Wikipedia proper name diacritization.
摘要：阿拉伯语Wikipedia中的专有名称经常被毫无限制，在发音和解释中产生了歧义，尤其是针对外国起源的命名实体。尽管在阿拉伯语NLP中分别对音译和大气压进行了充分研究，但它们的交叉点仍然没有被逐渐解散。在本文中，我们介绍了一个新的手动数字化数据集的各种起源的阿拉伯语专有名称，其英语wikipedia等效光泽，并提出了我们遵循的挑战和准则来创建它。我们基于GPT-4O基准，鉴于不可判断的阿拉伯语和英语形式，并分析了其性能。我们的结果达到了73％的准确性，这既强调了任务的难度，又强调了改进模型和资源的需求。我们释放数据集，以促进对阿拉伯语Wikipedia专有名称的进一步研究。

Title: A Survey on Progress in LLM Alignment from the Perspective of Reward Design

Authors: Miaomiao Ji, Yanqiu Wu, Zhibin Wu, Shoujin Wang, Jian Yang, Mark Dras, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02666
Pdf URL: https://arxiv.org/pdf/2505.02666
Copy Paste: [[2505.02666]] A Survey on Progress in LLM Alignment from the Perspective of Reward Design(https://arxiv.org/abs/2505.02666)
Keywords: language model, llm
Abstract: The alignment of large language models (LLMs) with human values and intentions represents a core challenge in current AI research, where reward mechanism design has become a critical factor in shaping model behavior. This study conducts a comprehensive investigation of reward mechanisms in LLM alignment through a systematic theoretical framework, categorizing their development into three key phases: (1) feedback (diagnosis), (2) reward design (prescription), and (3) optimization (treatment). Through a four-dimensional analysis encompassing construction basis, format, expression, and granularity, this research establishes a systematic classification framework that reveals evolutionary trends in reward modeling. The field of LLM alignment faces several persistent challenges, while recent advances in reward design are driving significant paradigm shifts. Notable developments include the transition from reinforcement learning-based frameworks to novel optimization paradigms, as well as enhanced capabilities to address complex alignment scenarios involving multimodal integration and concurrent task coordination. Finally, this survey outlines promising future research directions for LLM alignment through innovative reward design strategies.
摘要：大型语言模型（LLM）与人类价值观和意图的一致性代表了当前AI研究中的核心挑战，在当前的AI研究中，奖励机制设计已成为塑造模型行为的关键因素。这项研究通过系统的理论框架对LLM对齐中的奖励机制进行了全面研究，将其发展分为三个关键阶段：（1）反馈（诊断），（2）奖励设计（处方）和（3）优化（治疗）。通过四维分析，包括构造基础，格式，表达和粒度，这项研究建立了一个系统的分类框架，揭示了奖励建模的进化趋势。 LLM Alignment的领域面临着几个持续的挑战，而奖励设计的最新进展正在推动大幅度的范式转移。值得注意的发展包括从基于增强学习的框架到新颖优化范式的过渡，以及增强的能力，以解决涉及多模式集成和并发任务协调的复杂对齐场景。最后，这项调查通过创新的奖励设计策略概述了LLM对齐的有希望的未来研究指示。

Title: Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models

Authors: Xiaobao Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02686
Pdf URL: https://arxiv.org/pdf/2505.02686
Copy Paste: [[2505.02686]] Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models(https://arxiv.org/abs/2505.02686)
Keywords: language model, llm
Abstract: Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (in RLHF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities. In this survey, we present a comprehensive overview of the paradigm of learning from rewards. We categorize and analyze the strategies under this paradigm across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at this https URL.
摘要：大型语言模型（LLM）的最新发展已从训练前缩放量表转变为训练后和测试时间缩放。在这些事态发展中，出现了一个关键的统一范式：从奖励中学习，奖励信号是指导明星引导LLM行为。它为诸如增强学习（在RLHF，DPO和GRPO中），奖励引导的解码和事后校正等广泛的技术构成了广泛的技术。至关重要的是，该范式可以从被动学习从静态数据到从动态反馈中积极学习的过渡。这赋予了LLM具有一致的偏好和深层推理功能。在这项调查中，我们介绍了从奖励中学习范式的全面概述。我们在培训，推理和推理后阶段对此范式下的策略进行分类和分析。我们进一步讨论了奖励模型和主要应用程序的基准。最后，我们强调了挑战和未来的方向。我们在此HTTPS URL上维护纸质收藏。

Title: Bye-bye, Bluebook? Automating Legal Procedure with Large Language Models

Authors: Matthew Dahl
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.02763
Pdf URL: https://arxiv.org/pdf/2505.02763
Copy Paste: [[2505.02763]] Bye-bye, Bluebook? Automating Legal Procedure with Large Language Models(https://arxiv.org/abs/2505.02763)
Keywords: language model, llm
Abstract: Legal practice requires careful adherence to procedural rules. In the United States, few are more complex than those found in The Bluebook: A Uniform System of Citation. Compliance with this system's 500+ pages of byzantine formatting instructions is the raison d'etre of thousands of student law review editors and the bete noire of lawyers everywhere. To evaluate whether large language models (LLMs) are able to adhere to the procedures of such a complicated system, we construct an original dataset of 866 Bluebook tasks and test flagship LLMs from OpenAI, Anthropic, Google, Meta, and DeepSeek. We show (1) that these models produce fully compliant Bluebook citations only 69%-74% of the time and (2) that in-context learning on the Bluebook's underlying system of rules raises accuracy only to 77%. These results caution against using off-the-shelf LLMs to automate aspects of the law where fidelity to procedure is paramount.
摘要：法律实践需要仔细遵守程序规则。在美国，很少有人比《蓝皮书：统一的引文》更复杂。遵守该系统的500多页拜占庭格式说明是成千上万的学生法律评论编辑和各地律师的贝特·诺伊尔（Bete Noire）的理由。为了评估大型语言模型（LLMS）是否能够遵守如此复杂的系统的过程，我们构建了一个由OpenAI，Anthropic，Google，Meta和Deepseek的866个蓝书任务和测试旗舰LLM的原始数据集。我们表明（1）这些模型只会产生完全合规的蓝书引用，仅为69％-74％，并且（2）在蓝皮书的基本规则系统上学习中的内在学习只会使准确性仅提高到77％。这些结果警告不要使用现成的LLM来自动化法律方面的忠诚度至关重要。

Title: ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations

Authors: Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.02819
Pdf URL: https://arxiv.org/pdf/2505.02819
Copy Paste: [[2505.02819]] ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations(https://arxiv.org/abs/2505.02819)
Keywords: language model, llm
Abstract: We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation to approximate the pruned blocks. This estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at this repository.
摘要：我们介绍了一种替代品，这是一种通用的无训练深度修剪方法，可通过线性操作有效地替换变压器块，同时保持低压缩比的高性能。与需要额外训练或微调的常规修剪方法相反，我们的方法仅需要一个小的校准数据集，该数据集用于估算线性转换以近似修剪的块。该估计的线性映射可以与其余的变压器块无缝合并，从而消除了对任何其他网络参数的需求。我们的实验表明，替代品始终优于其他无训练方法，并且与最先进的修剪方法保持了高度竞争力，这些方法涉及广泛的重新训练/微调和建筑修改。替换应用于几种大型语言模型（LLMS），可实现多达25％的修剪，同时在开放基准上保留了约90％的原始模型性能 - 没有任何培训或康复步骤，从而导致了最小的计算额外开销（见图1）。我们提供了一个开源库，该库以及该存储库可用的几种最先进的深度修剪技术以及几种最先进的深度修剪技术。