2025-02-18

Title: Hallucinations and Truth: A Comprehensive Accuracy Evaluation of RAG, LoRA and DoRA

Authors: Mohammad Baqar, Rajat Khanda
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10497
Pdf URL: https://arxiv.org/pdf/2502.10497
Copy Paste: [[2502.10497]] Hallucinations and Truth: A Comprehensive Accuracy Evaluation of RAG, LoRA and DoRA(https://arxiv.org/abs/2502.10497)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Recent advancements in Generative AI have significantly improved the efficiency and adaptability of natural language processing (NLP) systems, particularly through Retrieval-Augmented Generation (RAG), Low-Rank Adaptation (LoRA), and Weight-Decomposed Low-Rank Adaptation (DoRA). RAG integrates external knowledge to enhance factual consistency in generative outputs, while LoRA enables parameter-efficient fine-tuning of large language models (LLMs). DoRA further refines this process by optimizing fine-tuning through adaptive parameter ranking and domain-aware weight adjustments, improving learning efficiency while maintaining inference performance. This paper presents a large-scale empirical evaluation of RAG, LoRA, and DoRA, with model fine-tuning and generation performance assessed on 20,000 FAQ-based queries, while the knowledge base spans 400,000 entries. The study analyzes key performance metrics such as accuracy, relevance, and inference latency. Experimental results demonstrate that DoRA achieves the highest accuracy (90.1%), relevance score (0.88), and lowest latency (110 ms per query), outperforming both LoRA and RAG in real-world, domain-specific generative AI applications. Furthermore, this study examines the trade-offs between fine-tuning efficiency, computational cost, and real-time adaptability across different models. Findings highlight RAG's effectiveness in knowledge grounding, LoRA's cost-efficient domain adaptation, and DoRA's ability to balance fine-tuning efficiency with model precision. These insights provide practical guidance for deploying AI-driven generative systems in accuracy-critical domains such as healthcare, finance, and legal services, ensuring scalability, reliability, and optimal performance in dynamic environments.
摘要：生成AI的最新进展显着提高了自然语言处理（NLP）系统的效率和适应性，尤其是通过检索型发电（RAG），低级别适应性（LORA）和体重化的低秩适应（DORA）。 RAG集成了外部知识，以增强生成产出的事实一致性，而Lora则可以对大型语言模型（LLMS）进行参数有效的微调。 Dora通过通过自适应参数排名和域吸引力的权重调整来优化微调，从而进一步完善了这一过程，从而提高了学习效率，同时保持了推理性能。本文对RAG，LORA和DORA进行了大规模的经验评估，对20,000个基于FAQ的查询进行了微调和发电性能，而知识库跨越了400,000个条目。该研究分析了关键绩效指标，例如准确性，相关性和推理潜伏期。实验结果表明，朵拉（Dora）达到了最高的精度（90.1％），相关性评分（0.88）和最低的延迟（每个查询110毫秒），在现实世界中的域，域特异性AI应用中的表现都优于Lora和rag。此外，这项研究研究了不同模型的微调效率，计算成本和实时适应性之间的权衡。发现突出了抹布在知识接地，洛拉（Lora）的成本效益领域的适应性以及朵拉（Dora）在模型精度上平衡微调效率的能力。这些见解为在医疗保健，金融和法律服务等准确性领域中部署AI驱动的生成系统提供了实用的指导，从而确保在动态环境中确保可伸缩性，可靠性和最佳性能。

Title: Man Made Language Models? Evaluating LLMs' Perpetuation of Masculine Generics Bias

Authors: Enzo Doyen, Amalia Todirascu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10577
Pdf URL: https://arxiv.org/pdf/2502.10577
Copy Paste: [[2502.10577]] Man Made Language Models? Evaluating LLMs' Perpetuation of Masculine Generics Bias(https://arxiv.org/abs/2502.10577)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been shown to propagate and even amplify gender bias, in English and other languages, in specific or constrained contexts. However, no studies so far have focused on gender biases conveyed by LLMs' responses to generic instructions, especially with regard to masculine generics (MG). MG are a linguistic feature found in many gender-marked languages, denoting the use of the masculine gender as a "default" or supposedly neutral gender to refer to mixed group of men and women, or of a person whose gender is irrelevant or unknown. Numerous psycholinguistics studies have shown that MG are not neutral and induce gender bias. This work aims to analyze the use of MG by both proprietary and local LLMs in responses to generic instructions and evaluate their MG bias rate. We focus on French and create a human noun database from existing lexical resources. We filter existing French instruction datasets to retrieve generic instructions and analyze the responses of 6 different LLMs. Overall, we find that $\approx$39.5\% of LLMs' responses to generic instructions are MG-biased ($\approx$73.1\% across responses with human nouns). Our findings also reveal that LLMs are reluctant to using gender-fair language spontaneously.
摘要：大型语言模型（LLM）已显示出在特定或受约束的上下文中以英语和其他语言传播甚至扩大性别偏见。但是，到目前为止，尚无研究的重点是LLMS对通用指令的反应传达的性别偏见，尤其是在男性仿制药（MG）方面。 MG是许多性别标记的语言中发现的语言特征，表示使用男性性别作为“默认性”或所谓的中性性别，指的是男女混合群体，或者是无关紧要或不明性别的人。许多心理语言学研究表明，MG不是中立的，并且会引起性别偏见。这项工作旨在分析专有和本地LLM在对通用指令的响应中对MG的使用并评估其MG偏置率。我们专注于法语，并从现有的词汇资源中创建人类名词数据库。我们过滤现有的法语指令数据集以检索通用说明并分析6种不同LLM的响应。总体而言，我们发现$ \ $ \ $ 39.5 \％llms对通用说明的响应是mg偏见的（$ \ $ \ $ \ $ 73.1 \％\％\％\％在人类名词的响应中。我们的发现还表明，LLM不愿自发地使用性别对语言。

Title: Named entity recognition for Serbian legal documents: Design, methodology and dataset development

Authors: Vladimir Kalušev, Branko Brkljač
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10582
Pdf URL: https://arxiv.org/pdf/2502.10582
Copy Paste: [[2502.10582]] Named entity recognition for Serbian legal documents: Design, methodology and dataset development(https://arxiv.org/abs/2502.10582)
Keywords: language model, llm
Abstract: Recent advancements in the field of natural language processing (NLP) and especially large language models (LLMs) and their numerous applications have brought research attention to design of different document processing tools and enhancements in the process of document archiving, search and retrieval. Domain of official, legal documents is especially interesting due to vast amount of data generated on the daily basis, as well as the significant community of interested practitioners (lawyers, law offices, administrative workers, state institutions and citizens). Providing efficient ways for automation of everyday work involving legal documents is therefore expected to have significant impact in different fields. In this work we present one LLM based solution for Named Entity Recognition (NER) in the case of legal documents written in Serbian language. It leverages on the pre-trained bidirectional encoder representations from transformers (BERT), which had been carefully adapted to the specific task of identifying and classifying specific data points from textual content. Besides novel dataset development for Serbian language (involving public court rulings), presented system design and applied methodology, the paper also discusses achieved performance metrics and their implications for objective assessment of the proposed solution. Performed cross-validation tests on the created manually labeled dataset with mean $F_1$ score of 0.96 and additional results on the examples of intentionally modified text inputs confirm applicability of the proposed system design and robustness of the developed NER solution.
摘要：自然语言处理领域（NLP），尤其是大型语言模型（LLM）及其众多应用程序的最新进展引起了对文档归档，搜索和检索过程中不同文档处理工具和增强功能的研究的研究。由于每天生成的大量数据以及有兴趣的从业者（律师，律师事务所，行政人员，行政人员，州机构和公民）的重要社区，因此官方法律文件的领域尤其有趣。因此，提供有效的方法来自动化涉及法律文件的日常工作，预计将对不同的领域产生重大影响。在这项工作中，我们为用塞尔维亚语言编写的法律文件提供了一个基于LLM的解决方案（NER）。它利用了来自变形金刚（BERT）的预训练的双向编码器表示，这些表示器已仔细地适应了从文本内容中识别和分类特定数据点的特定任务。除了针对塞尔维亚语言的新型数据集开发（涉及公共法院裁决），提出的系统设计和应用方法论外，本文还讨论了实现的绩效指标及其对拟议解决方案的客观评估的影响。在创建的手动标记数据集上进行了交叉验证测试，平均$ f_1 $得分为0.96，并在有意修改的文本输入的示例上进行了其他结果，确认了已开发的NER解决方案的拟议系统设计和鲁棒性的适用性。

Title: Post-training an LLM for RAG? Train on Self-Generated Demonstrations

Authors: Matthew Finlayson, Ilia Kulikov, Daneil M. Bikel, Barlas Oguz, Xilun Chen, Aasish Pappu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.10596
Pdf URL: https://arxiv.org/pdf/2502.10596
Copy Paste: [[2502.10596]] Post-training an LLM for RAG? Train on Self-Generated Demonstrations(https://arxiv.org/abs/2502.10596)
Keywords: language model, llm, hallucination, retrieval augmented generation
Abstract: Large language models (LLMs) often struggle with knowledge intensive NLP tasks, such as answering "Who won the latest World Cup?" because the knowledge they learn during training may be insufficient or outdated. Conditioning generation on retrieved documents -- a technique known as retrieval augmented generation (RAG) -- mitigates these shortcomings by allowing the model to leverage in-context information. Practitioners can improve LLM RAG performance by fine-tuning on retrieval-augmented instructions, but must beware that this can cause undesirable model behaviors like hallucinations. We attribute this degradation to the fact that the training data is likely to be out-of-distribution for the model and may suffer from quality issues, such as misalignment between retrievals and target responses (since retrievals are frequently added post-hoc). We propose a recipe for training RAG-enabled LLMs using self-generated demonstrations, thereby avoiding training on out-of-distribution text and integrating retrievals into the LLM responses. We evaluate our method on knowledge intensive question answering (QA) tasks and show that our method teaches LLMs to properly handle in-context retrievals and abstain from questions it will likely get wrong. Compared to conventional RA-IT methods, our method prevents model degradation in non-RAG settings while exhibiting superior QA performance.
摘要：大型语言模型（LLM）经常在知识密集的NLP任务中挣扎，例如回答“谁赢得了最新的世界杯？”因为他们在培训期间学到的知识可能不足或过时。在检索文档（一种称为检索增强发电（RAG）的技术）上生成的调节生成，通过允许该模型利用文字信息中的信息来减轻这些缺点。从业人员可以通过对检索提示的说明进行微调来改善LLM抹布性能，但必须提防这可能会导致诸如幻觉之类的不良模型行为。我们将这种退化归因于以下事实：训练数据可能是该模型的分布情况，并且可能遭受质量问题的困扰，例如检索和目标响应之间的差异（由于经常在事后添加回收）。我们建议使用自我生成的演示进行培训培训抹布的LLM的配方，从而避免在分发文本上进行培训并将检索整合到LLM响应中。我们评估我们的方法是关于知识密集的问题回答（QA）任务的方法，并表明我们的方法教授LLM可以正确处理中文检索，并避免问题可能会误会。与常规的RA-IT方法相比，我们的方法可以防止非剥离设置中的模型降解，同时表现出卓越的质量检查性能。

Title: Retrieval-augmented Encoders for Extreme Multi-label Text Classification

Authors: Yau-Shian Wang, Wei-Cheng Chang, Jyun-Yu Jiang, Jiong Zhang, Hsiang-Fu Yu, S. V. N. Vishwanathan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10615
Pdf URL: https://arxiv.org/pdf/2502.10615
Copy Paste: [[2502.10615]] Retrieval-augmented Encoders for Extreme Multi-label Text Classification(https://arxiv.org/abs/2502.10615)
Keywords: language model
Abstract: Extreme multi-label classification (XMC) seeks to find relevant labels from an extremely large label collection for a given text input. To tackle such a vast label space, current state-of-the-art methods fall into two categories. The one-versus-all (OVA) method uses learnable label embeddings for each label, excelling at memorization (i.e., capturing detailed training signals for accurate head label prediction). In contrast, the dual-encoder (DE) model maps input and label text into a shared embedding space for better generalization (i.e., the capability of predicting tail labels with limited training data), but may fall short at memorization. To achieve generalization and memorization, existing XMC methods often combine DE and OVA models, which involves complex training pipelines. Inspired by the success of retrieval-augmented language models, we propose the Retrieval-augmented Encoders for XMC (RAEXMC), a novel framework that equips a DE model with retrieval-augmented capability for efficient memorization without additional trainable parameter. During training, RAEXMC is optimized by the contrastive loss over a knowledge memory that consists of both input instances and labels. During inference, given a test input, RAEXMC retrieves the top-$K$ keys from the knowledge memory, and aggregates the corresponding values as the prediction scores. We showcase the effectiveness and efficiency of RAEXMC on four public LF-XMC benchmarks. RAEXMC not only advances the state-of-the-art (SOTA) DE method DEXML, but also achieves more than 10x speedup on the largest LF-AmazonTitles-1.3M dataset under the same 8 A100 GPUs training environments.
摘要：极端多标签分类（XMC）旨在从给定的文本输入中找到非常大的标签收藏品中的相关标签。为了解决如此庞大的标签空间，当前的最新方法分为两类。单个ALL（OVA）方法使用每个标签的可学习标签嵌入式，在记忆方面表现出色（即，捕获详细的培训信号以进行准确的头部标签预测）。相比之下，双编码器（DE）模型映射输入和标记文本成一个共享的嵌入空间，以更好地泛化（即，预测具有有限培训数据的尾标标签的能力），但在记忆时可能会缺乏。为了实现概括和记忆，现有的XMC方法通常将DE和OVA模型结合在一起，涉及复杂的训练管道。受到检索功能模型的成功的启发，我们提出了XMC（RAEXMC）的检索式编码器，这是一个新型框架，该框架为DE模型提供了带有检索功能的有效记忆的能力，以实现有效的记忆，而无需其他可训练的参数。在训练过程中，RAEXMC通过由输入实例和标签组成的知识记忆的对比损失进行了优化。在推断期间，给定测试输入，RAEXMC从知识内存中检索顶部$ K $键，并将相应值汇总为预测分数。我们展示了RAEXMC对四个公共LF-XMC基准的有效性和效率。 RAEXMC不仅可以推进最先进的方法DEXML（SOTA）DEXML，而且在同一8 A100 GPUS培训环境下，最大的LF-Amazontitles-1.3M数据集上的最大LF-Amazontitles-1.3M数据集实现了超过10倍的速度。

Title: Lost in the Passage: Passage-level In-context Learning Does Not Necessarily Need a "Passage"

Authors: Hao Sun, Chenming Tang, Gengyang Li, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10634
Pdf URL: https://arxiv.org/pdf/2502.10634
Copy Paste: [[2502.10634]] Lost in the Passage: Passage-level In-context Learning Does Not Necessarily Need a "Passage"(https://arxiv.org/abs/2502.10634)
Keywords: language model, llm, prompt
Abstract: By simply incorporating demonstrations into the context, in-context learning (ICL) enables large language models (LLMs) to yield awesome performance on many tasks. In this paper, we focus on passage-level long-context ICL for generation tasks and find that LLMs cannot learn the intrinsic relationships between the demonstration passage and the generation output. We conduct experiments with different LLMs on two typical generation tasks including single-document QA and distractor generation, demonstrating that even a completely meaningless demonstration passage with 1/4 length achieves much better performance than the original full passage. Analysis via attention score reveals that LLMs pay little attention to passages compared to other components in prompt and little attention flows from the passage to other parts of the demonstration, which further confirms our finding. Additionally, experiments on context compression indicate that compression approaches proven effective on other long-context tasks are not suitable for passage-level ICL, since simply using shorter meaningless demonstration passages has achieved competitive performance.
摘要：通过简单地将演示纳入上下文中，内在学习（ICL）可以使大型语言模型（LLMS）在许多任务上产生出色的性能。在本文中，我们专注于用于生成任务的段落级别的长篇小说ICL，并发现LLM无法学习演示段与生成输出之间的内在关系。我们对两种典型的一代任务进行了不同的LLM进行实验，包括单案QA和干扰物生成，表明即使是完全毫无意义的演示段，其长度为1/4，也比原始的完整段落更好地实现了性能。通过注意分数的分析表明，与迅速的其他组件相比，LLM很少关注通道，而从段落到其他部分的其他部分则很少流向演示的其他部分，这进一步证实了我们的发现。此外，在上下文压缩上进行实验表明，在其他长篇小说任务中被证明有效的压缩方法不适合通过ICL，因为仅使用较短的毫无意义的演示段落就可以实现竞争性能。

Title: BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop

Authors: Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Hu, Jaap Jumelet, Tal Linzen, Jing Liu, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Wilcox, Adina Williams
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10645
Pdf URL: https://arxiv.org/pdf/2502.10645
Copy Paste: [[2502.10645]] BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop(https://arxiv.org/abs/2502.10645)
Keywords: language model
Abstract: BabyLM aims to dissolve the boundaries between cognitive modeling and language modeling. We call for both workshop papers and for researchers to join the 3rd BabyLM competition. As in previous years, we call for participants in the data-efficient pretraining challenge in the general track. This year, we also offer a new track: INTERACTION. This new track encourages interactive behavior, learning from a teacher, and adapting the teaching material to the student. We also call for papers outside the competition in any relevant areas. These include training efficiency, cognitively plausible research, weak model evaluation, and more.
摘要：BabyLM的目的是解散认知建模和语言建模之间的界限。我们呼吁进行研讨会论文，并要求研究人员参加第三届Babylm比赛。与往年一样，我们呼吁参与者在一般轨道中进行数据效率高的预处理挑战。今年，我们还提供了新的曲目：互动。这条新曲目鼓励互动行为，向老师学习，并向学生调整教材。我们还呼吁在任何相关领域的竞争之外提出论文。这些包括培训效率，认知合理的研究，弱模型评估等等。

Title: User Profile with Large Language Models: Construction, Updating, and Benchmarking

Authors: Nusrat Jahan Prottasha, Md Kowsher, Hafijur Raman, Israt Jahan Anny, Prakash Bhat, Ivan Garibay, Ozlem Garibay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10660
Pdf URL: https://arxiv.org/pdf/2502.10660
Copy Paste: [[2502.10660]] User Profile with Large Language Models: Construction, Updating, and Benchmarking(https://arxiv.org/abs/2502.10660)
Keywords: language model, llm
Abstract: User profile modeling plays a key role in personalized systems, as it requires building accurate profiles and updating them with new information. In this paper, we present two high-quality open-source user profile datasets: one for profile construction and another for profile updating. These datasets offer a strong basis for evaluating user profile modeling techniques in dynamic settings. We also show a methodology that uses large language models (LLMs) to tackle both profile construction and updating. Our method uses a probabilistic framework to predict user profiles from input text, allowing for precise and context-aware profile generation. Our experiments demonstrate that models like Mistral-7b and Llama2-7b perform strongly in both tasks. LLMs improve the precision and recall of the generated profiles, and high evaluation scores confirm the effectiveness of our approach.
摘要：用户配置文件建模在个性化系统中起关键作用，因为它需要构建准确的配置文件并使用新信息对其进行更新。在本文中，我们介绍了两个高质量的开源用户配置文件数据集：一个用于个人资料构建，另一个用于个人资料更新。这些数据集为在动态设置中评估用户配置文件建模技术提供了强大的基础。我们还展示了一种使用大型语言模型（LLM）来解决个人资料构建和更新的方法。我们的方法使用概率框架来预测输入文本的用户配置文件，从而可以生成精确和上下文感知的配置文件。我们的实验表明，像Mistral-7b和Llama2-7b这样的模型在这两个任务中都表现出色。 LLM提高了生成的概况的精度和回忆，高评估得分证实了我们方法的有效性。

Title: Exploring Synaptic Resonance in Large Language Models: A Novel Approach to Contextual Memory Integration

Authors: George Applegarth, Christian Weatherstone, Maximilian Hollingsworth, Henry Middlebrook, Marcus Irvin
Subjects: cs.CL, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2502.10699
Pdf URL: https://arxiv.org/pdf/2502.10699
Copy Paste: [[2502.10699]] Exploring Synaptic Resonance in Large Language Models: A Novel Approach to Contextual Memory Integration(https://arxiv.org/abs/2502.10699)
Keywords: language model
Abstract: Contextual memory integration remains a high challenge in the development of language models, particularly in tasks that require maintaining coherence over extended sequences. Traditional approaches, such as self-attention mechanisms and memory-augmented architectures, often prioritize short-term dependencies, leading to fragmentation and inconsistency in long-range contextual understanding. Inspired by principles of synaptic plasticity observed in biological neural systems, a novel mechanism, Synaptic Resonance, is introduced to dynamically reinforce relevant memory pathways during training and inference. Unlike static memory representations, this mechanism continuously adjusts synaptic weight matrices based on contextual relevance, allowing for improved information retention without excessive computational overhead. Evaluations conducted on an open-source language model demonstrate reductions in perplexity, enhancements in contextual coherence, and increased robustness against input noise, highlighting the effectiveness of reinforcement-driven memory modulation. Comparative analysis against baseline models further reveals that the proposed approach achieves higher memory retention efficiency while maintaining computational feasibility. The architectural modifications integrate seamlessly into existing transformer-based frameworks, ensuring stable convergence and efficient inference without sacrificing scalability. Applications benefiting from improved long-term contextual consistency, such as dialogue systems and document summarization, stand to gain from this approach. Empirical findings suggest that dynamically reinforced memory pathways offer a promising alternative to conventional memory mechanisms, addressing longstanding limitations in extended sequence modeling.
摘要：上下文记忆集成在语言模型的开发中仍然是一个巨大的挑战，尤其是在需要维持扩展序列连贯性的任务中。传统方法（例如自我发挥机制和记忆启发架构）通常优先考虑短期依赖性，从而导致远距离上下文理解中的分裂和不一致。受到生物神经系统中观察到的突触可塑性原理的启发，一种新型机制，突触共振，被引入在训练和推理过程中动态增强相关的记忆途径。与静态存储器表示不同，该机制根据上下文相关性不断调整突触权重矩阵，从而可以改善信息保留而没有过多的计算开销。在开源语言模型上进行的评估表明，在上下文连贯性中的增强和对输入噪声的鲁棒性增加的降低，突出了增强驱动的内存调制的有效性。针对基线模型的比较分析进一步表明，所提出的方法在保持计算可行性的同时，可实现更高的存储率效率。体系结构修改将无缝集成到现有的基于变压器的框架中，从而在不牺牲可扩展性的情况下确保稳定的收敛性和有效的推断。从改善的长期背景一致性（例如对话系统和文档摘要）中受益的应用程序将从这种方法中获利。经验发现表明，动态增强的内存途径为传统记忆机制提供了一种有希望的替代方案，可以解决扩展序列建模中的长期局限性。

Title: Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey

Authors: Zirui Song, Bin Yan, Yuhan Liu, Miao Fang, Mingzhe Li, Rui Yan, Xiuying Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10708
Pdf URL: https://arxiv.org/pdf/2502.10708
Copy Paste: [[2502.10708]] Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey(https://arxiv.org/abs/2502.10708)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable success in various tasks such as natural language understanding, text summarization, and machine translation. However, their general-purpose nature often limits their effectiveness in domain-specific applications that require specialized knowledge, such as healthcare, chemistry, or legal analysis. To address this, researchers have explored diverse methods to enhance LLMs by integrating domain-specific knowledge. In this survey, we provide a comprehensive overview of these methods, which we categorize into four key approaches: dynamic knowledge injection, static knowledge embedding, modular adapters, and prompt optimization. Each approach offers unique mechanisms to equip LLMs with domain expertise, balancing trade-offs between flexibility, scalability, and efficiency. We discuss how these methods enable LLMs to tackle specialized tasks, compare their advantages and disadvantages, evaluate domain-specific LLMs against general LLMs, and highlight the challenges and opportunities in this emerging field. For those interested in delving deeper into this area, we also summarize the commonly used datasets and benchmarks. To keep researchers updated on the latest studies, we maintain an open-source at: this https URL, dedicated to documenting research in the field of specialized LLM.
摘要：大型语言模型（LLMS）在各种任务中表现出了很大的成功，例如自然语言理解，文本摘要和机器翻译。但是，它们的通用性质通常会限制其在需要专门知识的特定领域应用中的有效性，例如医疗保健，化学或法律分析。为了解决这个问题，研究人员探索了通过集成特定领域知识来增强LLM的各种方法。在这项调查中，我们提供了这些方法的全面概述，我们将其归类为四种关键方法：动态知识注入，静态知识嵌入，模块化适配器和及时的优化。每种方法都提供了独特的机制，可以为LLM提供域专业知识，平衡灵活性，可扩展性和效率之间的权衡。我们讨论了这些方法如何使LLM能够解决专业任务，比较其优势和缺点，评估针对一般LLM的领域特定LLM，并突出此新兴领域的挑战和机遇。对于那些有兴趣深入研究这一领域的人，我们还总结了常用的数据集和基准。为了使研究人员在最新研究中进行最新信息，我们在以下位置上保持了一个开源：此HTTPS URL，致力于记录专业LLM领域的研究。

Title: An Empirical Analysis of Uncertainty in Large Language Model Evaluations

Authors: Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10709
Pdf URL: https://arxiv.org/pdf/2502.10709
Copy Paste: [[2502.10709]] An Empirical Analysis of Uncertainty in Large Language Model Evaluations(https://arxiv.org/abs/2502.10709)
Keywords: language model, llm, prompt
Abstract: As LLM-as-a-Judge emerges as a new paradigm for assessing large language models (LLMs), concerns have been raised regarding the alignment, bias, and stability of LLM evaluators. While substantial work has focused on alignment and bias, little research has concentrated on the stability of LLM evaluators. In this paper, we conduct extensive experiments involving 9 widely used LLM evaluators across 2 different evaluation settings to investigate the uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators exhibit varying uncertainty based on model families and sizes. With careful comparative analyses, we find that employing special prompting strategies, whether during inference or post-training, can alleviate evaluation uncertainty to some extent. By utilizing uncertainty to enhance LLM's reliability and detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an uncertainty-aware LLM evaluator named ConfiLM using a human-annotated fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually designed test set sourced from the 2024 Olympics. Experimental results demonstrate that incorporating uncertainty as additional information during the fine-tuning phase can largely improve the model's evaluation performance in OOD scenarios. The code and data are released at: this https URL.
摘要：随着LLM-AS-A-Gudge作为评估大语模型（LLM）的新范式的出现，人们对LLM评估者的一致性，偏见和稳定性提出了担忧。尽管大量工作集中在对齐和偏见上，但很少研究集中在LLM评估者的稳定性上。在本文中，我们进行了广泛的实验，其中涉及2种不同评估设置中的9个广泛使用的LLM评估者，以研究基于模型的LLM评估的不确定性。我们指出，LLM评估者根据模型家族和大小表现出不同的不确定性。通过仔细的比较分析，我们发现采用特殊的提示策略（无论是在推论还是在培训后）可以在某种程度上减轻评估不确定性。通过利用不确定性来增强LLM在分布（OOD）数据中的可靠性和检测能力，我们进一步调整了不确定性感知的LLM评估者Condilm，并使用人类注销的微调集并评估Confilm的OOD评估能力手动设计的测试集来自2024年奥运会。实验结果表明，将不确定性纳入微调阶段的其他信息可以在很大程度上改善模型在OOD方案中的评估性能。代码和数据发布在以下位置：此HTTPS URL。

Title: OPTISHEAR: Towards Efficient and Adaptive Pruning of Large Language Models via Evolutionary Optimization

Authors: Shuqi Liu, Bowei He, Han Wu, Linqi Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10735
Pdf URL: https://arxiv.org/pdf/2502.10735
Copy Paste: [[2502.10735]] OPTISHEAR: Towards Efficient and Adaptive Pruning of Large Language Models via Evolutionary Optimization(https://arxiv.org/abs/2502.10735)
Keywords: language model, llm
Abstract: Post-training pruning has emerged as a crucial optimization technique as large language models (LLMs) continue to grow rapidly. However, the significant variations in weight distributions across different LLMs make fixed pruning strategies inadequate for multiple models. In this paper, we introduce \textbf{\textsc{OptiShear}}, an efficient evolutionary optimization framework for adaptive LLM pruning. Our framework features two key innovations: an effective search space built on our Meta pruning metric to handle diverse weight distributions, and a model-wise reconstruction error for rapid evaluation during search trials. We employ Non-dominated Sorting Genetic Algorithm III (NSGA-III) to optimize both pruning metrics and layerwise sparsity ratios. Through extensive evaluation on LLaMA-1/2/3 and Mistral models (7B-70B) across multiple benchmarks, we demonstrate that our adaptive pruning metrics consistently outperform existing methods. Additionally, our discovered layerwise sparsity ratios enhance the effectiveness of other pruning metrics. The framework exhibits strong cross-task and cross-model generalizability, providing a cost-effective solution for model compression.
摘要：随着大型语言模型（LLMS）继续迅速增长，训练后修剪已成为一种关键的优化技术。但是，不同LLM的重量分布的显着变化使得固定的修剪策略不足以适用于多种模型。在本文中，我们介绍了\ textbf {\ textsc {optishear}}，这是一个有效的自适应llm修剪的进化优化框架。我们的框架具有两个关键的创新：一个有效的搜索空间，建立在我们的元修剪度量标准上，以处理各种重量分布，以及在搜索试验期间快速评估的模型重建错误。我们采用非主导的排序遗传算法III（NSGA-III）来优化修剪指标和层次稀疏比。通过对多个基准的Llama-1/2/3和Mistral模型（7b-70b）的广泛评估，我们证明我们的自适应修剪指标始终超过现有方法。此外，我们发现的图层稀疏比增强了其他修剪指标的有效性。该框架表现出强大的交叉任务和交叉模型的概括性，为模型压缩提供了一种成本效益的解决方案。

Title: BASE-SQL: A powerful open source Text-To-SQL baseline approach

Authors: Lei Sheng, Shuai-Shuai Xu, Wei Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10739
Pdf URL: https://arxiv.org/pdf/2502.10739
Copy Paste: [[2502.10739]] BASE-SQL: A powerful open source Text-To-SQL baseline approach(https://arxiv.org/abs/2502.10739)
Keywords: language model, gpt
Abstract: The conversion of natural language into SQL language for querying databases (Text-to-SQL) has broad application prospects and has attracted widespread attention. At present, the mainstream Text-to-SQL methods are mainly divided into in-context learning (ICL) based methods and supervised fine-tuning (SFT) based methods. ICL-based methods can achieve relatively good results thanks to the use of the most advanced closed-source models. However, in real-world application scenarios, factors such as data privacy, SQL generation efficiency and cost need to be considered. SFT-based methods have certain advantages. At present, methods based on fine-tuning of open source models lack easy-to-implement and effective (cost-effective) baseline methods. We propose a pipeline-based method using open source model fine-tuning, referred to as BASE-SQL, which includes four components: Schema Linking, Candidate SQL Generate, SQL Revision and SQL Merge Revision. Experimental results show that BASE-SQL uses the open source model Qwen2.5-Coder-32B-Instruct, and achieves an accuracy of 67.47% on the BIRD development set and 88.9% on the Spider test set, which is significantly better than other methods using open source models, and even exceeds several methods using the GPT-4o closed-source model. At the same time, BASE-SQL is easy to implement and highly efficient (on average, only five calls to the large language model are required to generate SQL once). The code will be open sourced at this https URL.
摘要：自然语言转换为用于查询数据库的SQL语言（文本到SQL）具有广泛的应用程序前景，并引起了广泛的关注。目前，主流文本到SQL方法主要分为基于基于内在的学习方法（ICL）方法和基于监督的微调方法（SFT）方法。由于使用了最先进的封闭源模型，基于ICL的方法可以取得相对良好的结果。但是，在现实的应用程序方案中，需要考虑数据隐私，SQL发电效率和成本等因素。基于SFT的方法具有某些优势。目前，基于开源模型进行微调的方法缺乏易于实施和有效（成本效益）的基线方法。我们建议使用开源模型微调（称为base-SQL）提出一种基于管道的方法，其中包括四个组件：架构链接，候选SQL生成，SQL修订版和SQL Merge修订版。实验结果表明，基本-SQL使用开源模型QWEN2.5-CODER-32B-INSTRUCTION，并且在鸟类开发套件上的精度为67.47％，蜘蛛测试集的精度为88.9％，这比其他方法要好得多。使用开源模型，甚至超过了使用GPT-4O封闭源模型的几种方法。同时，基本-SQL易于实现且高效（平均而言，只有五个调用大型语言模型才能生成SQL一次）。该代码将在此HTTPS URL上开放。

Title: 1bit-Merging: Dynamic Quantized Merging for Large Language Models

Authors: Shuqi Liu, Han Wu, Bowei He, Zehua Liu, Xiongwei Han, Mingxuan Yuan, Linqi Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10743
Pdf URL: https://arxiv.org/pdf/2502.10743
Copy Paste: [[2502.10743]] 1bit-Merging: Dynamic Quantized Merging for Large Language Models(https://arxiv.org/abs/2502.10743)
Keywords: language model, chat
Abstract: Recent advances in large language models have led to specialized models excelling in specific domains, creating a need for efficient model merging techniques. While traditional merging approaches combine parameters into a single static model, they often compromise task-specific performance. However, task-specific routing methods maintain accuracy but introduce substantial storage overhead. We present \texttt{1bit}-Merging, a novel framework that integrates task-specific routing with 1-bit quantized task vectors to balance performance and storage efficiency. Our approach leverages the observation that different task-specific models store knowledge in distinct layers-chat models primarily in attention layers and math/code models in MLP layers-enabling targeted compression strategies. Through extensive experiments with LLaMA2 and Mistral model families across chat, mathematical reasoning, and code generation tasks, we demonstrate that \texttt{1bit}-Merging achieves comparable or superior performance to existing methods while significantly reducing storage requirements. Our framework offers a practical solution for combining specialized models while maintaining their individual strengths and addressing the storage challenges of current approaches.
摘要：大型语言模型的最新进展导致了在特定领域中脱颖而出的专业模型，从而需要有效的模型合并技术。尽管传统合并方法将参数结合到单个静态模型中，但它们通常会损害特定于任务的性能。但是，特定于任务的路由方法保持准确性，但在开销上引入了大量的存储空间。我们提出\ texttt {1bit} - 一种新颖的框架，将特定于任务的路由与1位量化的任务向量集成在一起，以平衡性能和存储效率。我们的方法利用了以下观察结果：不同的特定任务模型将知识存储在不同的层 - chat模型中，主要是在注意力层中以MLP层中的数学和代码模型，以实现目标的压缩策略。通过在聊天，数学推理和代码生成任务之间对Llama2和Mistral模型家族进行的广泛实验，我们证明了\ Texttt {1bit} -Mergoger-sergoging实现与现有方法可比性或卓越的性能，同时显着降低了存储要求。我们的框架为结合专业模型提供了一种实用的解决方案，同时保持其个人优势并应对当前方法的存储挑战。

Title: LoRE-Merging: Exploring Low-Rank Estimation For Large Language Model Merging

Authors: Zehua Liu, Han Wu, Yuxuan Yao, Ruifeng She, Xiongwei Han, Tao Zhong, Mingxuan Yuan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10749
Pdf URL: https://arxiv.org/pdf/2502.10749
Copy Paste: [[2502.10749]] LoRE-Merging: Exploring Low-Rank Estimation For Large Language Model Merging(https://arxiv.org/abs/2502.10749)
Keywords: language model
Abstract: While most current approaches rely on further training techniques, such as fine-tuning or reinforcement learning, to enhance model capacities, model merging stands out for its ability of improving models without requiring any additional training. In this paper, we propose a unified framework for model merging based on low-rank estimation of task vectors without the need for access to the base model, named \textsc{LoRE-Merging}. Our approach is motivated by the observation that task vectors from fine-tuned models frequently exhibit a limited number of dominant singular values, making low-rank estimations less prone to interference. We implement the method by formulating the merging problem as an optimization problem. Extensive empirical experiments demonstrate the effectiveness of our framework in mitigating interference and preserving task-specific information, thereby advancing the state-of-the-art performance in model merging techniques.
摘要：尽管大多数当前的方法都依赖于进一步的培训技术，例如微调或增强学习，以增强模型能力，但模型合并在不需要任何其他培训的情况下提高模型的能力而脱颖而出。在本文中，我们提出了一个统一的框架，用于基于任务向量的低级别估计，而无需访问基本模型，称为\ textsc {lore-Mererging}。我们的方法是由观察到的，从微调模型中的任务向量经常表现出有限数量的主要奇异值，从而使低级别的估计降低了干扰。我们通过将合并问题作为优化问题来实现该方法。广泛的经验实验证明了我们框架在减轻干扰和保存特定于任务的信息中的有效性，从而提高了模型合并技术的最新性能。

Title: Why is prompting hard? Understanding prompts on binary sequence predictors

Authors: Li Kevin Wenliang, Anian Ruoss, Jordi Grau-Moya, Marcus Hutter, Tim Genewein
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.10760
Pdf URL: https://arxiv.org/pdf/2502.10760
Copy Paste: [[2502.10760]] Why is prompting hard? Understanding prompts on binary sequence predictors(https://arxiv.org/abs/2502.10760)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) can be prompted to do many tasks, but finding good prompts is not always easy, nor is understanding some performant prompts. We explore these issues by viewing prompting as conditioning a near-optimal sequence predictor (LLM) pretrained on diverse data sources. Through numerous prompt search experiments, we show that the unintuitive patterns in optimal prompts can be better understood given the pretraining distribution, which is often unavailable in practice. Moreover, even using exhaustive search, reliably identifying optimal prompts from practical neural predictors can be difficult. Further, we demonstrate that common prompting methods, such as using intuitive prompts or samples from the targeted task, are in fact suboptimal. Thus, this work takes an initial step towards understanding the difficulties in finding and understanding optimal prompts from a statistical and empirical perspective.
摘要：可以提示大型语言模型（LLM）执行许多任务，但是找到好的提示并不总是那么容易，也不了解某些表现的提示。我们通过将提示视为在各种数据源中预计的近乎最佳序列预测变量（LLM）来探讨这些问题。通过大量及时的搜索实验，我们表明，鉴于预处理的分布，可以更好地理解最佳提示中的不直觉模式，这在实践中通常不可用。此外，即使使用详尽的搜索，也很难可靠地识别实际神经预测因子的最佳提示。此外，我们证明了常见的提示方法，例如使用直观提示或目标任务中的样本，实际上是次优的。因此，这项工作朝着了解从统计和经验的角度了解最佳提示的困难迈出了第一步。

Title: Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models

Authors: Zeping Yu, Yonatan Belinkov, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10835
Pdf URL: https://arxiv.org/pdf/2502.10835
Copy Paste: [[2502.10835]] Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models(https://arxiv.org/abs/2502.10835)
Keywords: language model, llm, prompt
Abstract: We investigate how large language models perform latent multi-hop reasoning in prompts like "Wolfgang Amadeus Mozart's mother's spouse is". To analyze this process, we introduce logit flow, an interpretability method that traces how logits propagate across layers and positions toward the final prediction. Using logit flow, we identify four distinct stages in single-hop knowledge prediction: (A) entity subject enrichment, (B) entity attribute extraction, (C) relation subject enrichment, and (D) relation attribute extraction. Extending this analysis to multi-hop reasoning, we find that failures often stem from the relation attribute extraction stage, where conflicting logits reduce prediction accuracy. To address this, we propose back attention, a novel mechanism that enables lower layers to leverage higher-layer hidden states from different positions during attention computation. With back attention, a 1-layer transformer achieves the performance of a 2-layer transformer. Applied to four LLMs, back attention improves accuracy on five reasoning datasets, demonstrating its effectiveness in enhancing latent multi-hop reasoning ability.
摘要：我们研究了大型语言模型如何在“沃尔夫冈·阿玛迪斯·莫扎特的母亲的配偶”等提示中执行潜在的多跳跃推理。为了分析此过程，我们介绍了logit流，这是一种可解释性方法，它可以追溯logits如何在跨层和位置向最终预测传播的方式。使用logit流，我们在单跳知识预测中识别四个不同的阶段：（a）实体主题富集，（b）实体属性提取，（c）关系主体富集，以及（d）关系属性属性提取。将此分析扩展到多跳的推理，我们发现故障通常源于关系属性提取阶段，其中相互冲突的逻辑降低了预测准确性。为了解决这个问题，我们提出了回馈注意力，这是一种新的机制，使较低的层能够在注意力计算过程中利用来自不同位置的较高层的隐藏状态。随着后退的注意，1层变压器可以实现2层变压器的性能。 RACK注意力应用于四个LLM，提高了五个推理数据集的精度，证明了其在增强潜在多跳多跳的推理能力方面的有效性。

Title: Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

Authors: Zeli Su, Ziyin Zhang, Guixian Xu, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10852
Pdf URL: https://arxiv.org/pdf/2502.10852
Copy Paste: [[2502.10852]] Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages(https://arxiv.org/abs/2502.10852)
Keywords: language model, llm
Abstract: While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.
摘要：虽然XLM-R等多语言模型在NLP中具有高级多语言主义，但它们在极低的资源语言中的表现仍然很差。这种情况加剧了以下事实：诸如Llama和Qwen之类的现代LLM所支持的语言少于XLM-R，这使得文本生成模型对于世界上许多语言都不存在。为了应对这一挑战，我们提出了一个新颖的框架，以使多语言编码器以极低的资源语言适应文本生成。通过重复编码器与解码器之间的权重，我们的框架使该模型能够利用编码器的学习语义空间，从而有效地学习和在低资源语言中有效概括。我们将此框架应用于四种中国少数族裔语言，我们提出XLM-SWCM，即使与更大的模型相比，即使在各种下游任务上都表现出了卓越的性能。

Title: Towards Effective Extraction and Evaluation of Factual Claims

Authors: Dasha Metropolitansky, Jonathan Larson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10855
Pdf URL: https://arxiv.org/pdf/2502.10855
Copy Paste: [[2502.10855]] Towards Effective Extraction and Evaluation of Factual Claims(https://arxiv.org/abs/2502.10855)
Keywords: language model, llm
Abstract: A common strategy for fact-checking long-form content generated by Large Language Models (LLMs) is extracting simple claims that can be verified independently. Since inaccurate or incomplete claims compromise fact-checking results, ensuring claim quality is critical. However, the lack of a standardized evaluation framework impedes assessment and comparison of claim extraction methods. To address this gap, we propose a framework for evaluating claim extraction in the context of fact-checking along with automated, scalable, and replicable methods for applying this framework, including novel approaches for measuring coverage and decontextualization. We also introduce Claimify, an LLM-based claim extraction method, and demonstrate that it outperforms existing methods under our evaluation framework. A key feature of Claimify is its ability to handle ambiguity and extract claims only when there is high confidence in the correct interpretation of the source text.
摘要：大型语言模型（LLMS）生成的事实检查长期内容的常见策略是提取可以独立验证的简单主张。由于不准确或不完整的索赔会损害事实检查结果，因此确保索赔质量至关重要。但是，缺乏标准化的评估框架阻碍了索赔提取方法的评估和比较。为了解决这一差距，我们提出了一个框架，以评估事实检查中的主张提取，以及用于应用此框架的自动化，可扩展性和可复制方法，包括用于衡量覆盖范围和去上下文化的新方法。我们还介绍了一种基于LLM的索赔提取方法索赔，并证明它在我们的评估框架下的表现优于现有方法。索赔的一个关键特征是它仅在对源文本的正确解释有很高的信心时才能处理歧义和提取索赔。

Title: Divergent Thoughts toward One Goal: LLM-based Multi-Agent Collaboration System for Electronic Design Automation

Authors: Haoyuan Wu, Haisheng Zheng, Zhuolun He, Bei Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10857
Pdf URL: https://arxiv.org/pdf/2502.10857
Copy Paste: [[2502.10857]] Divergent Thoughts toward One Goal: LLM-based Multi-Agent Collaboration System for Electronic Design Automation(https://arxiv.org/abs/2502.10857)
Keywords: language model, llm, agent
Abstract: Recently, with the development of tool-calling capabilities in large language models (LLMs), these models have demonstrated significant potential for automating electronic design automation (EDA) flows by interacting with EDA tool APIs via EDA scripts. However, considering the limited understanding of EDA tools, LLMs face challenges in practical scenarios where diverse interfaces of EDA tools exist across different platforms. Additionally, EDA flow automation often involves intricate, long-chain tool-calling processes, increasing the likelihood of errors in intermediate steps. Any errors will lead to the instability and failure of EDA flow automation. To address these challenges, we introduce EDAid, a multi-agent collaboration system where multiple agents harboring divergent thoughts converge towards a common goal, ensuring reliable and successful EDA flow automation. Specifically, each agent is controlled by ChipLlama models, which are expert LLMs fine-tuned for EDA flow automation. Our experiments demonstrate the state-of-the-art (SOTA) performance of our ChipLlama models and validate the effectiveness of our EDAid in the automation of complex EDA flows, showcasing superior performance compared to single-agent systems.
摘要：最近，随着大语言模型（LLMS）的工具称呼功能的开发，这些模型通过通过EDA脚本与EDA工具API进行交互，这表明了自动化电子设计自动化（EDA）流的显着潜力。但是，考虑到对EDA工具的有限理解，LLM在实际情况下面临挑战，在不同平台上存在EDA工具的各种接口。此外，EDA流动自动化通常涉及复杂的长链工具调用过程，从而增加了中间步骤中错误的可能性。任何错误都会导致EDA流动自动化的不稳定性和故障。为了应对这些挑战，我们介绍了EDAID，这是一种多代理协作系统，其中包含不同思想的多个代理人融合了一个共同的目标，从而确保可靠且成功的EDA流动自动化。具体而言，每个代理都由Chipllama模型控制，这些模型是EDA流动自动化的专家LLMS。我们的实验证明了我们的Chipllama模型的最新性能（SOTA）性能，并验证了我们EDAID在复杂EDA流动自动化中的有效性，与单一代理系统相比，表现出了出色的性能。

Title: NitiBench: A Comprehensive Studies of LLM Frameworks Capabilities for Thai Legal Question Answering

Authors: Pawitsapak Akarajaradwong, Pirat Pothavorn, Chompakorn Chaksangchaichot, Panuthep Tasawong, Thitiwat Nopparatbundit, Sarana Nutanong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10868
Pdf URL: https://arxiv.org/pdf/2502.10868
Copy Paste: [[2502.10868]] NitiBench: A Comprehensive Studies of LLM Frameworks Capabilities for Thai Legal Question Answering(https://arxiv.org/abs/2502.10868)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The application of large language models (LLMs) in the legal domain holds significant potential for information retrieval and question answering, yet Thai legal QA systems face challenges due to a lack of standardized evaluation benchmarks and the complexity of Thai legal structures. This paper introduces NitiBench, a benchmark comprising two datasets: the NitiBench-CCL, covering general Thai financial law, and the NitiBench-Tax, which includes real-world tax law cases requiring advanced legal reasoning. We evaluate retrieval-augmented generation (RAG) and long-context LLM-based approaches to address three key research questions: the impact of domain-specific components like section-based chunking and cross-referencing, the comparative performance of different retrievers and LLMs, and the viability of long-context LLMs as an alternative to RAG. Our results show that section-based chunking significantly improves retrieval and end-to-end performance, current retrievers struggle with complex queries, and long-context LLMs still underperform RAG-based systems in Thai legal QA. To support fair evaluation, we propose tailored multi-label retrieval metrics and the use of an LLM-as-judge for coverage and contradiction detection method. These findings highlight the limitations of current Thai legal NLP solutions and provide a foundation for future research in the field. We also open-sourced our codes and dataset to available publicly.
摘要：大型语言模型（LLM）在法律领域中的应用具有重要的信息检索和问答的潜力，但是由于缺乏标准化的评估基准和泰国法律结构的复杂性，泰国法律质量检查系统面临挑战。本文介绍了Nitibench，这是一个包括两个数据集的基准：Nitibench-CCL，涵盖泰国一般金融法和Nitibench-Tax，其中包括现实世界中的税法案件，需要先进的法律推理。我们评估了检索效果的生成（抹布）和基于长篇小说LLM的方法，以解决三个关键的研究问题：特定于域特定组件的影响，例如基于截面的分块和交叉引用，不同的猎犬和LLM的比较性能，以及长篇文化LLM作为抹布的替代方案的生存能力。我们的研究结果表明，基于截面的块可显着改善检索和端到端的性能，当前的检索员在复杂的查询中挣扎，并且在泰国法律QA中仍然不足以基于抹布的LLM，而长期以来的LLM仍然不足。为了支持公平的评估，我们提出了量身定制的多标签检索指标，并使用LLM-As-Gudge用于覆盖范围和矛盾检测方法。这些发现突出了当前泰国法律NLP解决方案的局限性，并为该领域的未来研究提供了基础。我们还将我们的代码和数据集开放为公开。

Title: The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis

Authors: Ge Lei, Samuel J. Cooper
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.10871
Pdf URL: https://arxiv.org/pdf/2502.10871
Copy Paste: [[2502.10871]] The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis(https://arxiv.org/abs/2502.10871)
Keywords: language model, llm, prompt
Abstract: This study investigates how large language models (LLMs) represent and recall multi-associated attributes across transformer layers. We show that intermediate layers encode factual knowledge by superimposing related attributes in overlapping spaces, along with effective recall even when attributes are not explicitly prompted. In contrast, later layers refine linguistic patterns and progressively separate attribute representations, optimizing task-specific outputs while appropriately narrowing attribute recall. We identify diverse encoding patterns including, for the first time, the observation of 3D spiral structures when exploring information related to the periodic table of elements. Our findings reveal a dynamic transition in attribute representations across layers, contributing to mechanistic interpretability and providing insights for understanding how LLMs handle complex, interrelated knowledge.
摘要：这项研究调查了大型语言模型（LLMS）如何代表和回忆到变压器层之间的多种相关属性。我们表明，中间层通过在重叠空间中叠加相关属性来编码事实知识，即使没有明确提示属性，也有效的回忆。相反，后来的层完善语言模式并逐步分开属性表示，优化特定于任务的输出，同时适当地缩小属性回忆。我们确定各种编码模式，包括首次探索与元素元素表相关的信息时，首次观察3D螺旋结构。我们的发现揭示了跨层的属性表示中的动态过渡，这有助于机械性解释性，并提供了了解LLM如何处理复杂，相互关联的知识的见解。

Title: CiteCheck: Towards Accurate Citation Faithfulness Detection

Authors: Ziyao Xu, Shaohang Wei, Zhuoheng Han, Jing Jin, Zhe Yang, Xiaoguang Li, Haochen Tan, Zhijiang Guo, Houfeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10881
Pdf URL: https://arxiv.org/pdf/2502.10881
Copy Paste: [[2502.10881]] CiteCheck: Towards Accurate Citation Faithfulness Detection(https://arxiv.org/abs/2502.10881)
Keywords: llm, retrieval-augmented generation
Abstract: Citation faithfulness detection is critical for enhancing retrieval-augmented generation (RAG) systems, yet large-scale Chinese datasets for this task are scarce. Existing methods face prohibitive costs due to the need for manually annotated negative samples. To address this, we introduce the first large-scale Chinese dataset CiteCheck for citation faithfulness detection, constructed via a cost-effective approach using two-stage manual annotation. This method balances positive and negative samples while significantly reducing annotation expenses. CiteCheck comprises training and test splits. Experiments demonstrate that: (1) the test samples are highly challenging, with even state-of-the-art LLMs failing to achieve high accuracy; and (2) training data augmented with LLM-generated negative samples enables smaller models to attain strong performance using parameter-efficient fine-tuning. CiteCheck provides a robust foundation for advancing citation faithfulness detection in Chinese RAG systems. The dataset is publicly available to facilitate research.
摘要：引用忠实检测对于增强检索仪（RAG）系统至关重要，但是大规模的中国数据集用于此任务。由于需要手动注释的负样本，因此现有方法面临着过高的成本。为了解决这个问题，我们介绍了第一个大规模的中国数据集Citecheck，以引用忠实检测，该数据通过使用两阶段手动注释通过具有成本效益的方法构建。该方法平衡了正样本和负样本，同时显着降低了注释费用。 Citecheck包括培训和测试分裂。实验表明：（1）测试样本具有高度挑战性，即使是最先进的LLMS也无法实现高精度；（2）用LLM生成的负样品增强的训练数据使较小的模型能够使用参数有效的微调实现强大的性能。 Citecheck为推进中国抹布系统中的引文忠诚检测提供了强大的基础。该数据集可公开使用以促进研究。

Title: MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

Authors: Vanya Cohen, Raymond Mooney
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10886
Pdf URL: https://arxiv.org/pdf/2502.10886
Copy Paste: [[2502.10886]] MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models(https://arxiv.org/abs/2502.10886)
Keywords: language model
Abstract: Entity tracking is a fundamental challenge in natural language understanding, requiring models to maintain coherent representations of entities. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate the ability of vision-language models to track entity states across modalities. Using two structured domains, Chess and the Shell Game, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based tracking and that this performance gap stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet substantial limitations remain, especially in long-horizon multimodal scenarios. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.
摘要：实体跟踪是自然语言理解中的基本挑战，需要模型维持实体的连贯表示。以前的工作已经在纯粹基于文本的任务中基于实体跟踪性能。我们介绍了Met-Bench，这是一种多模式实体跟踪基准测试，旨在评估视觉模型在跨模式中跟踪实体状态的能力。使用两个结构化域，国际象棋和壳游戏，我们评估当前模型如何整合基于文本和图像的状态更新。我们的发现揭示了基于文本的跟踪和基于图像的跟踪之间存在显着的性能差距，并且这种性能差距源于视觉推理而不是感知的缺陷。我们进一步表明，明确的基于文本的推理策略可以改善绩效，但仍然存在实质性限制，尤其是在长途多模式场景中。我们的结果强调了需要改进的多模式表示和推理技术，以弥合文本和视觉实体跟踪之间的差距。

Title: Developing Conversational Speech Systems for Robots to Detect Speech Biomarkers of Cognition in People Living with Dementia

Authors: Rohith Perumandla, Young-Ho Bae, Diego Izaguirre, Esther Hwang, Andrew Murphy, Long-Jing Hsu, Selma Sabanovic, Casey C. Bennett
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10896
Pdf URL: https://arxiv.org/pdf/2502.10896
Copy Paste: [[2502.10896]] Developing Conversational Speech Systems for Robots to Detect Speech Biomarkers of Cognition in People Living with Dementia(https://arxiv.org/abs/2502.10896)
Keywords: language model, llm
Abstract: This study presents the development and testing of a conversational speech system designed for robots to detect speech biomarkers indicative of cognitive impairments in people living with dementia (PLwD). The system integrates a backend Python WebSocket server and a central core module with a large language model (LLM) fine-tuned for dementia to process user input and generate robotic conversation responses in real-time in less than 1.5 seconds. The frontend user interface, a Progressive Web App (PWA), displays information and biomarker score graphs on a smartphone in real-time to human users (PLwD, caregivers, clinicians). Six speech biomarkers based on the existing literature - Altered Grammar, Pragmatic Impairments, Anomia, Disrupted Turn-Taking, Slurred Pronunciation, and Prosody Changes - were developed for the robot conversation system using two datasets, one that included conversations of PLwD with a human clinician (DementiaBank dataset) and one that included conversations of PLwD with a robot (Indiana dataset). We also created a composite speech biomarker that combined all six individual biomarkers into a single score. The speech system's performance was first evaluated on the DementiaBank dataset showing moderate correlation with MMSE scores, with the composite biomarker score outperforming individual biomarkers. Analysis of the Indiana dataset revealed higher and more variable biomarker scores, suggesting potential differences due to study populations (e.g. severity of dementia) and the conversational scenario (human-robot conversations are different from human-human). The findings underscore the need for further research on the impact of conversational scenarios on speech biomarkers and the potential clinical applications of robotic speech systems.
摘要：这项研究介绍了针对机器人旨在检测痴呆症患者认知障碍的语音生物标志物（PLWD）的语音生物标志物的对话语音系统的开发和测试。该系统集成了一个后端Python Websocket服务器和一个中央核心模块，并使用大型语言模型（LLM）进行微调，以便痴呆处理用户输入并在不到1.5秒的时间内实时生成机器人对话响应。前端用户界面是一个渐进式Web应用程序（PWA），将信息和生物标志物分数实时显示给人类用户（PLWD，护理人员，临床医生）。基于现有文献的六个语音生物标志物 - 语法改变，务实的障碍，异常障碍，破坏转弯，发音含糊和韵律变化 - 使用两个数据集为机器人对话系统开发（Dementiabank数据集），其中包括PLWD与机器人（印第安纳州数据集）的对话。我们还创建了一个综合语音生物标志物，将所有六个单独的生物标志物组合为单个分数。首先在Dementiabank数据集上评估了语音系统的性能，该数据集与MMSE分数中等相关性，复合生物标志物分数的表现优于单个生物标志物。对印第安纳州数据集的分析显示，生物标志物分数较高，更可变，这表明由于研究人群（例如痴呆症的严重程度）和对话情景（人类机器人的对话与人类人类人类）而引起的潜在差异。这些发现强调了对对话情景对语音生物标志物的影响以及机器人语音系统的潜在临床应用的进一步研究的必要性。

Title: Enhancing Conversational Agents from Open-Source Large Language Models with Illocutionary Force and Document-Based Knowledge Retrieval

Authors: Godfrey Inyama
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.10916
Pdf URL: https://arxiv.org/pdf/2502.10916
Copy Paste: [[2502.10916]] Enhancing Conversational Agents from Open-Source Large Language Models with Illocutionary Force and Document-Based Knowledge Retrieval(https://arxiv.org/abs/2502.10916)
Keywords: language model, llm, chat, agent
Abstract: In this paper, we first present a novel way of computationally analysing and extracting illocutionary forces from dialogue using Bert-based Large Language Models, and demonstrate how these features impact the response of a conversational agent guided by a document-based knowledge bank demonstrated by a bespoke web conversational chat agent system developed. Our proposed illocutionary force extraction and classification technique is the first of its kind using the Argument Interchange Format (AIF) Dataset, showing an improved performance compared to two methods for carrying out similar tasks with a macro F1 of approximately 45%. When we evaluated the system based on 2 knowledge files, with 2 user queries each, across 5 open-source large language models (LLMs) using 10 standard metrics we found out that larger open-source models, such as Llama2:13b and Llama3-chatqa-latest, demonstrated an improved alignment when the user illocutionary force was included with their query, achieving higher QA and linguistic similarity scores. The smaller models on the other hand like Tinyllama:latest showed an increased perplexity and mixed performance, which explicitly indicated struggles in processing queries that explicitly included illocutionary forces. The results from the analysis highlight the potential of illocutionary force to enhance conversational depth while underscoring the need for model-specific optimizations to address increased computational costs and response times.
摘要：在本文中，我们首先提出了一种新颖的计算方式，使用基于BERT的大型语言模型从对话中从对话中提取和提取iellictionary力量，并演示这些功能如何影响由基于文档的知识库指导的对话代理的响应定制的Web对话聊天代理系统开发了。我们提出的Illociveary Force提取和分类技术是使用参数互换格式（AIF）数据集的第一种此类技术，与执行大约45％的宏F1执行类似任务的两种方法相比，其性能提高了。当我们基于2个知识文件评估系统时，每个都有2个用户查询，使用10种开源大型语言模型（LLMS）使用10个标准指标，我们发现较大的开源模型，例如Llama2：13b和Llama3--当询问用户的iellociveary力量时，CHENQA-LATEST表现出了改进的对齐方式，其查询获得了更高的质量检查和语言相似性得分。另一方面，较小的模型如Tinyllama：最新的模型显示出更高的困惑和混合性能，这明确表明在处理明确包含Illocutionary力量的查询方面存在斗争。分析的结果凸显了iellictionary力量增强对话深度的潜力，同时强调了对模型特定优化的需求，以解决增加的计算成本和响应时间。

Title: Fundamental Principles of Linguistic Structure are Not Represented by o3

Authors: Elliot Murphy, Evelina Leivada, Vittoria Dentella, Fritz Gunther, Gary Marcus
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10934
Pdf URL: https://arxiv.org/pdf/2502.10934
Copy Paste: [[2502.10934]] Fundamental Principles of Linguistic Structure are Not Represented by o3(https://arxiv.org/abs/2502.10934)
Keywords: language model
Abstract: A core component of a successful artificial general intelligence would be the rapid creation and manipulation of grounded compositional abstractions and the demonstration of expertise in the family of recursive hierarchical syntactic objects necessary for the creative use of human language. We evaluated the recently released o3 model (OpenAI; o3-mini-high) and discovered that while it succeeds on some basic linguistic tests relying on linear, surface statistics (e.g., the Strawberry Test), it fails to generalize basic phrase structure rules; it fails with comparative sentences involving semantically illegal cardinality comparisons ('Escher sentences'); its fails to correctly rate and explain acceptability dynamics; and it fails to distinguish between instructions to generate unacceptable semantic vs. unacceptable syntactic outputs. When tasked with generating simple violations of grammatical rules, it is seemingly incapable of representing multiple parses to evaluate against various possible semantic interpretations. In stark contrast to many recent claims that artificial language models are on the verge of replacing the field of linguistics, our results suggest not only that deep learning is hitting a wall with respect to compositionality (Marcus 2022), but that it is hitting [a [stubbornly [resilient wall]]] that cannot readily be surmounted to reach human-like compositional reasoning simply through more compute.
摘要：成功的人工智能的一个核心组成部分是对扎实的构图抽象的快速创建和操纵，以及在递归层次级别的句法对象中的专业知识的演示，这是人类语言创造性使用所必需的。我们评估了最近发布的O3模型（OpenAI; O3-Mini-High），并发现它在一些基本的语言测试中取得了成功，该测试依赖于线性，表面统计（例如草莓测试），但未能推广基本的短语结构规则；它因涉及语义上非法基数比较的比较句子而失败（“ escher句子”）；它无法正确评估并解释可接受性动态；而且它无法区分指令，以生成不可接受的语义与不可接受的句法输出。当负责产生简单的语法规则时，似乎无法代表多个解析以根据各种可能的语义解释进行评估。与最近的许多说法形成鲜明对比的是，人工语言模型正处于取代语言学领域的边缘，我们的结果不仅表明深度学习在构图方面遇到了墙（Marcus 2022），而且它正在击中[A [固执的[弹性壁]]]不能轻易地通过更多的计算来掩盖以达到类似人类的组成推理。

Title: Exploring Contextual Flux in Large Language Models: A Novel Approach to Self-Modulating Semantic Networks

Authors: Henry Evidail, Zachary Mountebank, Alistair Hathersage, Peter Stanhope, Basil Ravenscroft, Tobias Waddingham
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10942
Pdf URL: https://arxiv.org/pdf/2502.10942
Copy Paste: [[2502.10942]] Exploring Contextual Flux in Large Language Models: A Novel Approach to Self-Modulating Semantic Networks(https://arxiv.org/abs/2502.10942)
Keywords: language model
Abstract: Self-modulating mechanisms introduce dynamic adaptation capabilities within language models through contextual realignment strategies that influence token embedding trajectories across extended sequences. Contextual Flux is explored as an approach to embedding modulation, integrating an auxiliary gating mechanism within the self-attention framework to dynamically adjust token representations based on evolving contextual dependencies. The empirical analysis evaluates entropy variations, latent space realignments, and coherence stability to assess the extent to which self-regulation enhances text generation consistency while preserving generative flexibility. Quantitative assessments suggest that embedding shifts contribute to more structured adaptation in long-form sequences, with measured reductions in redundant phrase repetitions and improvements in thematic retention. Variability in contextual weight computation affects modulation stability, leading to differing levels of adaptation across diverse linguistic structures. The computational demands introduced through real-time embedding reconfiguration are examined in relation to model scalability, emphasizing the need for optimization strategies in high-volume generative applications. The findings suggest that while adaptive embedding updates improve certain aspects of coherence, their impact remains contingent on model capacity and input complexity.
摘要：自我调节机制通过上下文调整策略在语言模型中引入动态适应能力，这些策略影响令牌嵌入跨扩展序列的轨迹。探索上下文通量是一种嵌入调制的方法，将辅助门控机制集成到自发框架内，以基于不断发展的上下文依赖性动态调整令牌表示。经验分析评估了熵变化，潜在的空间调整和连贯性稳定性，以评估自我调节在多大程度上增强文本生成一致性的程度，同时保持生成灵活性。定量评估表明，嵌入转移有助于长期序列中更结构化的适应性，并在冗余短语重复量的降低和主题保留的改善中进行了测量。上下文体重计算中的可变性会影响调制稳定性，从而导致不同语言结构的适应水平不同。通过实时嵌入重新配置引入的计算需求与模型可伸缩性有关，强调了在高量生成应用中进行优化策略的需求。研究结果表明，虽然自适应嵌入更新改善了连贯性的某些方面，但它们的影响仍然取决于模型容量和输入复杂性。

Title: Neural Networks Remember More: The Power of Parameter Isolation and Combination

Authors: Biqing Zeng, Zehan Li, Aladdin Ayesh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10966
Pdf URL: https://arxiv.org/pdf/2502.10966
Copy Paste: [[2502.10966]] Neural Networks Remember More: The Power of Parameter Isolation and Combination(https://arxiv.org/abs/2502.10966)
Keywords: language model
Abstract: Catastrophic forgetting is a pervasive issue for pre-trained language models (PLMs) during continual learning, where models lose previously acquired knowledge when sequentially trained on a series of tasks. The model's ability to retain old tasks is referred to as stability, while its adaptability to new tasks is called plasticity. Therefore, the key to solving this problem is to find a trade-off between the plasticity and stability of the model. To address this issue, in this paper, we propose a novel method to achieve a balance between model stability and plasticity, thereby mitigating catastrophic forgetting. More specifically, our proposed approach leverages parameter isolation and a subsequent combination strategy. Initially, in the training stage, the model adapts to each downstream task via a parameter isolation method to prevent potential interference among different tasks. We then combine all trained parameters, which contain acquired knowledge, using the task arithmetic method and finally apply them to the backbone model. Empirical evaluations on continual language learning benchmarks substantiate the effectiveness of our approach, revealing a marked enhancement over existing state-of-the-art approaches.
摘要：灾难性的遗忘是持续学习期间预训练的语言模型（PLM）的普遍问题，在经过一系列任务进行依次训练时，模型将失去以前获得的知识。该模型保留旧任务的能力称为稳定性，而其对新任务的适应性称为可塑性。因此，解决此问题的关键是要在模型的可塑性和稳定性之间找到权衡。为了解决这个问题，在本文中，我们提出了一种新颖的方法，以在模型稳定性和可塑性之间达到平衡，从而减轻灾难性的遗忘。更具体地说，我们提出的方法利用参数隔离和随后的组合策略。最初，在训练阶段，该模型通过参数隔离方法适应了每个下游任务，以防止不同任务之间的潜在干扰。然后，我们使用任务算术方法将所有训练有素的参数（包含获得的知识）组合在一起，并最终将其应用于骨干模型。对持续语言学习基准的经验评估证实了我们方法的有效性，揭示了对现有最新方法的明显增强。

Title: FinMTEB: Finance Massive Text Embedding Benchmark

Authors: Yixuan Tang, Yi Yang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.10990
Pdf URL: https://arxiv.org/pdf/2502.10990
Copy Paste: [[2502.10990]] FinMTEB: Finance Massive Text Embedding Benchmark(https://arxiv.org/abs/2502.10990)
Keywords: language model, llm
Abstract: Embedding models play a crucial role in representing and retrieving information across various NLP applications. Recent advances in large language models (LLMs) have further enhanced the performance of embedding models. While these models are often benchmarked on general-purpose datasets, real-world applications demand domain-specific evaluation. In this work, we introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a specialized counterpart to MTEB designed for the financial domain. FinMTEB comprises 64 financial domain-specific embedding datasets across 7 tasks that cover diverse textual types in both Chinese and English, such as financial news articles, corporate annual reports, ESG reports, regulatory filings, and earnings call transcripts. We also develop a finance-adapted model, FinPersona-E5, using a persona-based data synthetic method to cover diverse financial embedding tasks for training. Through extensive evaluation of 15 embedding models, including FinPersona-E5, we show three key findings: (1) performance on general-purpose benchmarks shows limited correlation with financial domain tasks; (2) domain-adapted models consistently outperform their general-purpose counterparts; and (3) surprisingly, a simple Bag-of-Words (BoW) approach outperforms sophisticated dense embeddings in financial Semantic Textual Similarity (STS) tasks, underscoring current limitations in dense embedding techniques. Our work establishes a robust evaluation framework for financial NLP applications and provides crucial insights for developing domain-specific embedding models.
摘要：嵌入模型在表示和检索各种NLP应用程序的信息方面起着至关重要的作用。大型语言模型（LLM）的最新进展进一步增强了嵌入模型的性能。尽管这些模型通常在通用数据集上进行基准测试，但实际应用程序需要特定于域的评估。在这项工作中，我们介绍了金融庞大的文本嵌入基准（Finmteb），这是专门为金融领域设计的MTEB的专业对应物。 Finmteb包括64个金融领域特定的嵌入数据集，其中7个任务涵盖了中文和英语的各种文本类型，例如金融新闻文章，公司年度报告，ESG报告，监管申请和收益呼叫笔录。我们还使用基于角色的数据合成方法来开发适合金融的模型Finpersona-E5，以涵盖各种财务嵌入式培训任务。通过对15个嵌入模型的广泛评估，包括Finpersona-E5，我们展示了三个关键发现：（1）通用基准测试的性能显示与金融领域任务的相关性有限；（2）域适应的模型始终优于其通用模型；（3）令人惊讶的是，在财务语义文本相似性（STS）任务中，简单的单词袋（BOW）方法的表现优于复杂的密集嵌入，强调了当前密集嵌入技术的限制。我们的工作为财务NLP应用程序建立了强大的评估框架，并为开发特定领域的嵌入模型提供了重要的见解。

Title: RoseRAG: Robust Retrieval-augmented Generation with Small-scale LLMs via Margin-aware Preference Optimization

Authors: Tianci Liu, Haoxiang Jiang, Tianze Wang, Ran Xu, Yue Yu, Linjun Zhang, Tuo Zhao, Haoyu Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.10993
Pdf URL: https://arxiv.org/pdf/2502.10993
Copy Paste: [[2502.10993]] RoseRAG: Robust Retrieval-augmented Generation with Small-scale LLMs via Margin-aware Preference Optimization(https://arxiv.org/abs/2502.10993)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Large language models (LLMs) have achieved impressive performance but face high computational costs and latency, limiting their deployment in resource-constrained settings. In contrast, small-scale LLMs (SLMs) are more efficient yet struggle to capture evolving real-world knowledge. Retrieval-augmented generation (RAG) helps by integrating external knowledge, but imperfect retrieval can introduce distracting noise that misleads SLMs. We propose RoseRAG, a robust RAG framework for SLMs via Margin-aware Preference Optimization. RoseRAG employs multi-turn prompting for detailed reasoning, rejection sampling for high-quality explanations, and contrastive preference selection to refine responses by maximizing the likelihood gap between preferred and non-preferred outputs. By integrating these components into a margin-aware optimization process, RoseRAG robustly enhances the accuracy and reliability of SLMs for RAG applications. Extensive experiments on three open-domain question answering benchmarks indicate that our innovative RoseRAG surpasses state-of-the-art baselines significantly.
摘要：大型语言模型（LLMS）取得了令人印象深刻的性能，但面临着较高的计算成本和延迟，限制了它们在资源受限设置中的部署。相比之下，小规模的LLM（SLM）更有效，但努力捕捉不断发展的现实知识。检索增强的生成（RAG）通过整合外部知识来帮助您，但是不完美的检索可以引入分散注意力的噪音，从而误导SLM。我们提出了Roserag，这是通过边缘感知的偏好优化对SLM的强大抹布框架。 Roserag采用多转弯来提示详细的推理，高质量解释的排斥抽样以及对比度偏好选择，以通过最大程度地提高首选和非优先输出之间的可能性差距来完善响应。通过将这些组件集成到边缘感知的优化过程中，Roserag稳健地提高了SLMS对抹布应用的准确性和可靠性。对三个开放域问题回答基准测试的广泛实验表明，我们的创新roserag超过了最新的基线。

Title: Evaluating Large language models on Understanding Korean indirect Speech acts

Authors: Youngeun Koo, Jiwoo Lee, Dojun Park, Seohyun Park, Sungeun Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10995
Pdf URL: https://arxiv.org/pdf/2502.10995
Copy Paste: [[2502.10995]] Evaluating Large language models on Understanding Korean indirect Speech acts(https://arxiv.org/abs/2502.10995)
Keywords: language model, llm
Abstract: To accurately understand the intention of an utterance is crucial in conversational communication. As conversational artificial intelligence models are rapidly being developed and applied in various fields, it is important to evaluate the LLMs' capabilities of understanding the intentions of user's utterance. This study evaluates whether current LLMs can understand the intention of an utterance by considering the given conversational context, particularly in cases where the actual intention differs from the surface-leveled, literal intention of the sentence, i.e. indirect speech acts. Our findings reveal that Claude3-Opus outperformed the other competing models, with 71.94% in MCQ and 65% in OEQ, showing a clear advantage. In general, proprietary models exhibited relatively higher performance compared to open-source models. Nevertheless, no LLMs reached the level of human performance. Most LLMs, except for Claude3-Opus, demonstrated significantly lower performance in understanding indirect speech acts compared to direct speech acts, where the intention is explicitly revealed through the utterance. This study not only performs an overall pragmatic evaluation of each LLM's language use through the analysis of OEQ response patterns, but also emphasizes the necessity for further research to improve LLMs' understanding of indirect speech acts for more natural communication with humans.
摘要：准确地了解话语的意图对于对话交流至关重要。随着对话性人工智能模型正在迅速开发和应用在各个领域，因此评估LLMS了解用户话语意图的能力很重要。这项研究评估当前的LLM是否可以通过考虑给定的对话环境来理解话语的意图，尤其是在实际意图与表面级别的，句子的字面意图不同的情况下，即间接的语音行为。我们的发现表明，Claude3-Opus的表现优于其他竞争模型，MCQ的71.94％，OEQ的65％显示出明显的优势。通常，与开源模型相比，专有模型的性能相对较高。然而，没有LLM达到人类绩效水平。除Claude3-Opus外，大多数LLM与直接语音行为相比，在理解间接语音行为方面的性能明显较低，在这种情况下，意图通过话语明确揭示了。这项研究不仅通过分析OEQ响应模式对每个LLM的语言使用进行了整体务实的评估，而且还强调了进一步研究的必要性，以提高LLMS对间接语音行为的理解，以实现与人类更自然的交流。

Title: RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation

Authors: Pengcheng Jiang, Lang Cao, Ruike Zhu, Minhao Jiang, Yunyi Zhang, Jimeng Sun, Jiawei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10996
Pdf URL: https://arxiv.org/pdf/2502.10996
Copy Paste: [[2502.10996]] RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation(https://arxiv.org/abs/2502.10996)
Keywords: language model, llm
Abstract: Retrieval-augmented language models often struggle with knowledge-intensive tasks due to inefficient retrieval, unstructured knowledge integration, and single-pass architectures. We present Retrieval-And-Structuring (RAS), a novel framework that dynamically constructs and reasons over query-specific knowledge graphs through iterative retrieval and structuring. RAS introduces four key technical innovations: (1) a themescoped retrieval mechanism that efficiently narrows the search space while maintaining retrieval quality, (2) an action planning module that determines knowledge needs and generates focused sub-queries, (3) a dynamic knowledge structuring approach that converts retrieved text into an evolving knowledge graph, and (4) a graph-augmented answering component that leverages the accumulated structured information. Our framework achieves state-of-the-art performance, surpassing leading baselines by 6.4% with open-source language models and 7.0% with proprietary models on seven knowledge-intensive generation datasets across all evaluation metrics. Detailed ablation studies verify the contribution of each technical component to the overall system performance.
摘要：由于检索效率低下，非结构化的知识集成和单通车体系结构，检索授课的语言模型通常会在知识密集型任务上挣扎。我们介绍了检索和结构（RAS），这是一个新颖的框架，通过迭代检索和结构化，通过查询特定的知识图动态构建和原因。 RAS介绍了四个关键的技术创新：（1）一种主题检索机制，该机制有效地缩小了搜索空间的同时保持检索质量，（2）一个行动计划模块，该模块确定知识需求并产生集中的子征服，（3）动态知识结构结构的动态知识结构化结构化。将检索到的文本转换为不断发展的知识图的方法，以及（4）利用累积结构化信息的图形调整答案组件。我们的框架实现了最新的性能，开源语言模型超过了6.4％的领先基线，并且在所有评估指标的七个知识密集型生成数据集上使用了7.0％的模型。详细的消融研究验证了每个技术组成部分对整体系统性能的贡献。

Title: CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models

Authors: Yuefei Chen, Vivek K.Singh, Jing Ma, Ruxiang Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11008
Pdf URL: https://arxiv.org/pdf/2502.11008
Copy Paste: [[2502.11008]] CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models(https://arxiv.org/abs/2502.11008)
Keywords: language model, llm
Abstract: Counterfactual reasoning is widely recognized as one of the most challenging and intricate aspects of causality in artificial intelligence. In this paper, we evaluate the performance of large language models (LLMs) in counterfactual reasoning. In contrast to previous studies that primarily focus on commonsense causal reasoning, where LLMs often rely on prior knowledge for inference, we specifically assess their ability to perform counterfactual inference using a set of formal rules. To support this evaluation, we introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions. The dataset is designed with varying levels of difficulty, diverse causal graph structures, distinct types of counterfactual questions, and multiple nonsensical name variants. Our experiments demonstrate that counterfactual reasoning poses a significant challenge for LLMs, with most models performing at levels comparable to random guessing. To enhance LLM's counterfactual reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions. Experimental results show that our method significantly improves LLM performance on counterfactual reasoning tasks and consistently enhances performance across different this http URL dataset is available at this https URL.
摘要：反事实推理被广泛认为是人工智能因果关系最具挑战性和错综复杂的方面之一。在本文中，我们评估了反事实推理中大语言模型（LLM）的性能。与以前主要关注共识性因果推理（LLMS经常依靠先验知识进行推理）的研究相反，我们特别评估了他们使用一组正式规则进行反事实推断的能力。为了支持此评估，我们介绍了一个新的基准数据集，包括1K反事实推理问题。该数据集的设计具有不同的困难水平，多样化的因果图结构，不同类型的反事实问题以及多个荒谬的名称变体。我们的实验表明，反事实推理对LLM构成了重大挑战，大多数模型的表现与随机猜测相当。为了增强LLM的反事实推理能力，我们提出了一种新颖的推理范式硬币，该硬币可以指导LLM通过迭代推理和回溯，以系统地探索反事实解决方案。实验结果表明，我们的方法可显着提高反事实推理任务上的LLM性能，并在此HTTPS URL上可用。

Title: GRIFFIN: Effective Token Alignment for Faster Speculative Decoding

Authors: Shijing Hu, Jingyang Li, Xingyu Xie, Zhihui Lu, Kim-Chuan Toh, Pan Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11018
Pdf URL: https://arxiv.org/pdf/2502.11018
Copy Paste: [[2502.11018]] GRIFFIN: Effective Token Alignment for Faster Speculative Decoding(https://arxiv.org/abs/2502.11018)
Keywords: language model, llm
Abstract: Speculative decoding accelerates inference in large language models (LLMs) by generating multiple draft tokens simultaneously. However, existing methods often struggle with token misalignment between the training and decoding phases, limiting their performance. To address this, we propose GRIFFIN, a novel framework that incorporates a token-alignable training strategy and a token-alignable draft model to mitigate misalignment. The training strategy employs a loss masking mechanism to exclude highly misaligned tokens during training, preventing them from negatively impacting the draft model's optimization. The token-alignable draft model introduces input tokens to correct inconsistencies in generated features. Experiments on LLaMA-series and Vicuna models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 7\% and a speedup ratio exceeding 8%, outperforming current SoTAs as shown in Fig. 1 (a) and (b).
摘要：通过同时生成多个草稿代币，投机解码可以加速大型语言模型（LLMS）的推断。但是，现有的方法通常在培训和解码阶段之间的令牌未对准方面困难，从而限制了它们的性能。为了解决这个问题，我们提出了格里芬（Griffin），这是一个新颖的框架，结合了代币的培训策略和可代币的模型草案，以减轻未对准。培训策略采用损失掩盖机制来排除培训期间高度未对准的令牌，从而阻止了它们对模型优化草案的优化产生负面影响。令牌可符合的草稿模型引入了输入令牌，以纠正生成的功能中的不一致。对骆驼系列和维库纳模型的实验表明，格里芬的平均接受度长度提高了7％以上，加速比超过8％，表现优于图1（a）和（b）所示的电流SOTA。

Title: TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages

Authors: Jafar Isbarov, Arofat Akhundjanova, Mammad Hajili, Kavsar Huseynova, Dmitry Gaynullin, Anar Rzayev, Osman Tursun, Ilshat Saetov, Rinat Kharisov, Saule Belginova, Ariana Kenbayeva, Amina Alisheva, Aizirek Turdubaeva, Abdullatif Köksal, Samir Rustamov, Duygu Ataman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11020
Pdf URL: https://arxiv.org/pdf/2502.11020
Copy Paste: [[2502.11020]] TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages(https://arxiv.org/abs/2502.11020)
Keywords: language model, gpt, llm
Abstract: Being able to thoroughly assess massive multi-task language understanding (MMLU) capabilities is essential for advancing the applicability of multilingual language models. However, preparing such benchmarks in high quality native language is often costly and therefore limits the representativeness of evaluation datasets. While recent efforts focused on building more inclusive MMLU benchmarks, these are conventionally built using machine translation from high-resource languages, which may introduce errors and fail to account for the linguistic and cultural intricacies of the target languages. In this paper, we address the lack of native language MMLU benchmark especially in the under-represented Turkic language family with distinct morphosyntactic and cultural characteristics. We propose two benchmarks for Turkic language MMLU: TUMLU is a comprehensive, multilingual, and natively developed language understanding benchmark specifically designed for Turkic languages. It consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar, Turkish, Uyghur, and Uzbek. We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset. Using this dataset, we systematically evaluate a diverse range of open and proprietary multilingual large language models (LLMs), including Claude, Gemini, GPT, and LLaMA, offering an in-depth analysis of their performance across different languages, subjects, and alphabets. To promote further research and development in multilingual language understanding, we release TUMLU-mini and all corresponding evaluation scripts.
摘要：能够彻底评估大规模的多任务语言理解（MMLU）功能对于推进多语言模型的适用性至关重要。但是，以高质量的母语准备此类基准通常是昂贵的，因此限制了评估数据集的代表性。尽管最近的努力着重于建立更具包容性的MMLU基准，但这些努力通常是使用高资源语言的机器翻译来构建的，这可能会引入错误，并且无法说明目标语言的语言和文化复杂性。在本文中，我们讨论了缺乏母语MMLU基准测试，尤其是在代表性不足的土耳其语家族中，具有独特的形态句法和文化特征。我们提出了两个针对Turkic语言MMLU的基准：Tumlu是一种全面的，多语言和本地发展的语言理解理解，专门为突出语言设计。它由阿塞拜疆，克里米亚塔塔尔，卡拉卡帕克，哈萨克，哈萨克，塔塔尔，土耳其语，uyghur和uzbek组成的中学和高中级问题。我们还提出了Tumlu-Mini，这是数据集的更简洁，平衡和手动验证的子集。使用此数据集，我们系统地评估了各种开放和专有的多语言大语言模型（LLMS），包括Claude，Gemini，GPT和Llama，对它们跨不同语言，主题和字母的性能进行了深入的分析。为了促进多语言语言理解中的进一步研究和发展，我们发布了Tumlu-Mini和所有相应的评估脚本。

Title: MultiTEND: A Multilingual Benchmark for Natural Language to NoSQL Query Translation

Authors: Zhiqian Qin, Yuanfeng Song, Jinwei Lu, Yuanwei Song, Shuaimin Li, Chen Jason Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11022
Pdf URL: https://arxiv.org/pdf/2502.11022
Copy Paste: [[2502.11022]] MultiTEND: A Multilingual Benchmark for Natural Language to NoSQL Query Translation(https://arxiv.org/abs/2502.11022)
Keywords: llm, retrieval-augmented generation, chain-of-thought
Abstract: Natural language interfaces for NoSQL databases are increasingly vital in the big data era, enabling users to interact with complex, unstructured data without deep technical expertise. However, most recent advancements focus on English, leaving a gap for multilingual support. This paper introduces MultiTEND, the first and largest multilingual benchmark for natural language to NoSQL query generation, covering six languages: English, German, French, Russian, Japanese and Mandarin Chinese. Using MultiTEND, we analyze challenges in translating natural language to NoSQL queries across diverse linguistic structures, including lexical and syntactic differences. Experiments show that performance accuracy in both English and non-English settings remains relatively low, with a 4%-6% gap across scenarios like fine-tuned SLM, zero-shot LLM, and RAG for LLM. To address the aforementioned challenges, we introduce MultiLink, a novel framework that bridges the multilingual input to NoSQL query generation gap through a Parallel Linking Process. It breaks down the task into multiple steps, integrating parallel multilingual processing, Chain-of-Thought (CoT) reasoning, and Retrieval-Augmented Generation (RAG) to tackle lexical and structural challenges inherent in multilingual NoSQL generation. MultiLink shows enhancements in all metrics for every language against the top baseline, boosting execution accuracy by about 15% for English and averaging a 10% improvement for non-English languages.
摘要：NOSQL数据库的自然语言界面在大数据时代越来越重要，使用户能够与复杂的，非结构化的数据进行交互，而无需深入的技术专业知识。但是，最近的进步集中在英语上，留下了多语言支持的差距。本文介绍了多端，这是NOSQL查询一代的第一和最大的自然语言的多语言基准，涵盖了六种语言：英语，德语，法语，俄语，日语和普通话。使用多端，我们分析了将自然语言转化为跨不同语言结构（包括词汇和句法差异）的NOSQL查询的挑战。实验表明，英语和非英语设置的性能准确性仍然相对较低，在微调SLM，零摄像机LLM和LLM的抹布等方案中存在4％-6％的差距。为了应对上述挑战，我们引入了Multilink，这是一个新颖的框架，通过并行链接过程将多语言输入桥接到NOSQL查询生成差距。它将任务分为多个步骤，集成了平行的多语言处理，经过思考链（COT）推理以及检索成绩的一代（RAG），以应对多语言NOSQL生成中固有的词汇和结构性挑战。 Multilink在针对最高基线的所有语言中显示了所有指标的增强功能，使英语的执行精度提高了约15％，并为非英语语言提高了10％的提高。

Title: Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models

Authors: Prateek Chhikara
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11028
Pdf URL: https://arxiv.org/pdf/2502.11028
Copy Paste: [[2502.11028]] Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models(https://arxiv.org/abs/2502.11028)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) demonstrate impressive performance across diverse tasks, yet confidence calibration remains a challenge. Miscalibration - where models are overconfident or underconfident - poses risks, particularly in high-stakes applications. This paper presents an empirical study on LLM calibration, examining how model size, distractors, and question types affect confidence alignment. We introduce an evaluation framework to measure overconfidence and investigate whether multiple-choice formats mitigate or worsen miscalibration. Our findings show that while larger models (e.g., GPT-4o) are better calibrated overall, they are more prone to distraction, whereas smaller models benefit more from answer choices but struggle with uncertainty estimation. Unlike prior work, which primarily reports miscalibration trends, we provide actionable insights into failure modes and conditions that worsen overconfidence. These findings highlight the need for calibration-aware interventions and improved uncertainty estimation methods.
摘要：大型语言模型（LLMS）在各种任务中表现出令人印象深刻的表现，但是信心校准仍然是一个挑战。错误校准 - 模型过度自信或自信的情况下会带来风险，尤其是在高风险应用中。本文介绍了一项有关LLM校准的经验研究，研究了模型大小，干扰因素和问题类型如何影响置信度。我们介绍了一个评估框架，以测量过度自信并研究多项选择格式是否减轻或恶化错误校准。我们的发现表明，虽然较大的模型（例如GPT-4O）总体上校准了更好的校准，但它们更容易分散注意力，而较小的模型从答案选择中受益更多，但在不确定性估计中挣扎。与主要报告误解趋势的先前工作不同，我们为故障模式和条件提供了可行的见解。这些发现凸显了需要校准感知干预措施和改进的不确定性估计方法。

Title: MMUNLEARNER: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models

Authors: Jiahao Huo, Yibo Yan, Xu Zheng, Yuanhuiyi Lyu, Xin Zou, Zhihua Wei, Xuming Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11051
Pdf URL: https://arxiv.org/pdf/2502.11051
Copy Paste: [[2502.11051]] MMUNLEARNER: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models(https://arxiv.org/abs/2502.11051)
Keywords: language model, llm
Abstract: Recent progress in Machine Unlearning (MU) has introduced solutions for the selective removal of private or sensitive information encoded within deep neural networks. Nonetheless, MU for Multimodal Large Language Models (MLLMs) remains in its nascent phase. Therefore, we propose to reformulate the task of multimodal MU in the era of MLLMs, which aims to erase only the visual patterns associated with a given entity while preserving the corresponding textual knowledge encoded within the original parameters of the language model backbone. Furthermore, we develop a novel geometry-constrained gradient descent method MMUnlearner. It updates the weights of MLLMs with a weight saliency map jointly restricted by the remaining concepts and textual knowledge during unlearning, thereby preserving parameters essential for non-target knowledge. Extensive experiments demonstrate that MMUnlearner surpasses baselines that finetuning MLLMs with VQA data directly through Gradient Ascent (GA) or Negative Preference Optimization (NPO), across all evaluation dimensions. Our code will be released upon acceptance.
摘要：机器学习的最新进展（MU）引入了解决方案，以选择性去除深度神经网络中编码的私人或敏感信息。但是，多模式大语言模型（MLLM）的MU仍处于其新生阶段。因此，我们建议在MLLM时代重新重新重新重新制定多模式MU的任务，该任务仅删除与给定实体相关的视觉模式，同时保留语言模型骨干的原始参数中编码的相应文本知识。此外，我们开发了一种新型的几何梯度下降方法mmunlearner。它通过在学习期间的其余概念和文本知识共同限制的重量显着性图更新了MLLM的权重，从而保留了非目标知识必不可少的参数。广泛的实验表明，Mmunlearner超过了所有评估维度，通过梯度上升（GA）或负偏好优化（NPO）直接使用VQA数据对MLLM进行了捕获。我们的代码将在接受后发布。

Title: Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models

Authors: Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, Dacheng Tao
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2502.11054
Pdf URL: https://arxiv.org/pdf/2502.11054
Copy Paste: [[2502.11054]] Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models(https://arxiv.org/abs/2502.11054)
Keywords: language model, llm
Abstract: Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation, a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs' strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to preserve attack semantics, enhance effectiveness, and sustain reasoning-driven attack progression. Extensive experiments on multiple LLMs demonstrate that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios, with attack success rates (ASRs) increasing by up to 96%. Notably, our approach achieves ASRs of 82% and 92% against leading commercial models, OpenAI o1 and DeepSeek R1, underscoring its potency. We release our code at this https URL to facilitate further research in this critical domain.
摘要：多转弯越狱攻击通过使大型语言模型（LLM）参与迭代对话，暴露了关键的安全漏洞，模拟了现实世界中的人类互动。但是，现有的方法通常难以在语义连贯性与攻击效果之间取得平衡，从而导致语义漂移或无效的检测逃避。为了应对这一挑战，我们提出了一个新颖的多转越越狱框架，将有害的查询重新定义为良性的推理任务，并利用了LLMS的强大推理能力，以损害安全对齐。具体来说，我们将攻击状态机框架引入系统地模拟问题翻译和迭代推理，以确保多个转弯的连贯的查询生成。在此框架的基础上，我们设计了增益引导的探索，自我播放和拒绝反馈模块，以保持攻击语义，提高有效性并维持推理驱动的攻击进程。对多个LLM的广泛实验表明，种族在复杂的对话场景中实现了最先进的攻击效果，攻击成功率（ASRS）增长了96％。值得注意的是，我们的方法对领先的商业模型，OpenAi O1和DeepSeek R1的ASRS达到了82％和92％，强调了其效力。我们在此HTTPS URL上发布代码，以促进此关键领域的进一步研究。

Title: Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection

Authors: Yang Zhao, Li Du, Xiao Ding, Yangou Ouyang, Hepeng Wang, Kai Xiong, Jinglong Gao, Zhouhao Sun, Dongliang Xu, Yang Qing, Dongchen Li, Bing Qin, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11062
Pdf URL: https://arxiv.org/pdf/2502.11062
Copy Paste: [[2502.11062]] Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection(https://arxiv.org/abs/2502.11062)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown great potential across various industries due to their remarkable ability to generalize through instruction tuning. However, the limited availability of domain-specific data significantly hampers their performance on specialized tasks. While existing methods primarily focus on selecting training data from general datasets that are similar to the target domain, they often fail to consider the joint distribution of instructions, resulting in inefficient learning and suboptimal knowledge transfer. To address these challenges, we introduce G2IS (Gradient-based Graph Instruction Selection), a novel method that constructs a mixed gradient-based instruction graph to capture the joint distribution and interdependencies between instructions. By accounting for the relationships between instructions, G2IS improves domain adaptation efficiency. Additionally, we propose a gradient walk algorithm to refine the data selection process, enhancing both training effectiveness and efficiency. Our experiments demonstrate that G2IS outperforms traditional methods across various domain adaptation tasks, yielding significant performance gains, particularly in complex, data-scarce scenarios. These results underscore the potential of G2IS in advancing the development of large, domain-specific models.
摘要：大型语言模型（LLMS）在各个行业中都表现出巨大的潜力，因为它们通过教学调整具有出色的概括能力。但是，特定于域数据的有限可用性极大地阻碍了他们在专业任务上的性能。尽管现有方法主要集中于从类似于目标域类似的一般数据集中选择培训数据，但它们通常无法考虑指令的联合分布，从而导致学习效率低下和次优知识转移。为了应对这些挑战，我们介绍了G2IS（基于梯度的图指令选择），这是一种新的方法，该方法构建了基于梯度的指令图，以捕获指令之间的关节分布和相互依赖性。通过考虑说明之间的关系，G2IS提高了域的适应效率。此外，我们提出了一种梯度步行算法，以完善数据选择过程，从而提高培训效率和效率。我们的实验表明，G2IS在各种领域适应任务上的表现优于传统方法，从而产生了显着的性能增长，尤其是在复杂的，数据筛选的情况下。这些结果强调了G2I在推进大型域特异性模型发展的潜力。

Title: CARMA: Enhanced Compositionality in LLMs via Advanced Regularisation and Mutual Information Alignment

Authors: Nura Aljaafari, Danilo S. Carvalho, André Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11066
Pdf URL: https://arxiv.org/pdf/2502.11066
Copy Paste: [[2502.11066]] CARMA: Enhanced Compositionality in LLMs via Advanced Regularisation and Mutual Information Alignment(https://arxiv.org/abs/2502.11066)
Keywords: language model, llm
Abstract: Large language models (LLMs) struggle with compositional generalisation, limiting their ability to systematically combine learned components to interpret novel inputs. While architectural modifications, fine-tuning, and data augmentation improve compositionality, they often have limited adaptability, face scalability constraints, or yield diminishing returns on real data. To address this, we propose CARMA, an intervention that enhances the stability and robustness of compositional reasoning in LLMs while preserving fine-tuned performance. CARMA employs mutual information regularisation and layer-wise stability constraints to mitigate feature fragmentation, ensuring structured representations persist across and within layers. We evaluate CARMA on inverse dictionary modelling and sentiment classification, measuring its impact on semantic consistency, performance stability, and robustness to lexical perturbations. Results show that CARMA reduces the variability introduced by fine-tuning, stabilises token representations, and improves compositional reasoning. While its effectiveness varies across architectures, CARMA's key strength lies in reinforcing learned structures rather than introducing new capabilities, making it a scalable auxiliary method. These findings suggest that integrating CARMA with fine-tuning can improve compositional generalisation while maintaining task-specific performance in LLMs.
摘要：大型语言模型（LLM）与组成概括斗争，限制了它们系统地结合学习组件以解释新投入的能力。尽管建筑修改，微调和数据增强提高了组成性，但它们通常具有有限的适应性，面部可伸缩性约束或产生实际数据的收益降低。为了解决这个问题，我们提出了Carma，这是一种干预措施，可以增强LLMS中构图推理的稳定性和鲁棒性，同时保持微调的性能。 Carma采用相互信息正则化和层面稳定性约束来减轻特征碎片化，从而确保结构化表示持续遍及层。我们评估Carma在反词典建模和情感分类上，衡量其对语义一致性，性能稳定性和对词汇扰动的鲁棒性的影响。结果表明，Carma降低了通过微调，稳定令牌表示并改善组成推理所引入的可变性。尽管它的有效性在各个体系结构之间都有不同，但Carma的关键优势在于增强学习结构而不是引入新功能，从而使其成为一种可扩展的辅助方法。这些发现表明，将Carma与微调整合可以改善组成概括，同时保持LLMS的特定任务性能。

Title: Demystifying Hateful Content: Leveraging Large Multimodal Models for Hateful Meme Detection with Explainable Decisions

Authors: Ming Shan Hee, Roy Ka-Wei Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11073
Pdf URL: https://arxiv.org/pdf/2502.11073
Copy Paste: [[2502.11073]] Demystifying Hateful Content: Leveraging Large Multimodal Models for Hateful Meme Detection with Explainable Decisions(https://arxiv.org/abs/2502.11073)
Keywords: language model
Abstract: Hateful meme detection presents a significant challenge as a multimodal task due to the complexity of interpreting implicit hate messages and contextual cues within memes. Previous approaches have fine-tuned pre-trained vision-language models (PT-VLMs), leveraging the knowledge they gained during pre-training and their attention mechanisms to understand meme content. However, the reliance of these models on implicit knowledge and complex attention mechanisms renders their decisions difficult to explain, which is crucial for building trust in meme classification. In this paper, we introduce IntMeme, a novel framework that leverages Large Multimodal Models (LMMs) for hateful meme classification with explainable decisions. IntMeme addresses the dual challenges of improving both accuracy and explainability in meme moderation. The framework uses LMMs to generate human-like, interpretive analyses of memes, providing deeper insights into multimodal content and context. Additionally, it uses independent encoding modules for both memes and their interpretations, which are then combined to enhance classification performance. Our approach addresses the opacity and misclassification issues associated with PT-VLMs, optimizing the use of LMMs for hateful meme detection. We demonstrate the effectiveness of IntMeme through comprehensive experiments across three datasets, showcasing its superiority over state-of-the-art models.
摘要：仇恨的模因检测是由于解释模因中的隐性仇恨信息和上下文提示的复杂性，因此提出了一个重大挑战作为多模式任务。先前的方法具有微调的预训练的视觉模型（PT-VLM），利用它们在训练期间获得的知识及其注意力机制来理解模因含量。但是，这些模型对隐性知识和复杂的注意机制的依赖使他们的决定难以解释，这对于建立对模因分类的信任至关重要。在本文中，我们介绍了Intmeme，这是一个新颖的框架，该框架利用大型多模型（LMMS）进行可恶的模因分类和可解释的决定。 Intmeme解决了提高模因适度精度和解释性的双重挑战。该框架使用LMM来生成类似人类的模因的解释性分析，从而为多模式内容和上下文提供了更深入的见解。此外，它对模因及其解释都使用独立的编码模块，然后将其组合起来以增强分类性能。我们的方法解决了与PT-VLM相关的不透明度和错误分类问题，从而优化了LMM用于仇恨模因检测的使用。我们通过在三个数据集中进行的全面实验来证明Intmeme的有效性，从而展示了其优越性比最先进的模型。

Title: Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models

Authors: Haoyang Li, Xuejia Chen, Zhanchao XU, Darian Li, Nicole Hu, Fei Teng, Yiming Li, Luyu Qiu, Chen Jason Zhang, Qing Li, Lei Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11075
Pdf URL: https://arxiv.org/pdf/2502.11075
Copy Paste: [[2502.11075]] Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models(https://arxiv.org/abs/2502.11075)
Keywords: language model, gpt, llm, long context
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in natural language processing tasks, such as text generation and semantic understanding. However, their performance on numerical reasoning tasks, such as basic arithmetic, numerical retrieval, and magnitude comparison, remains surprisingly poor. This gap arises from their reliance on surface-level statistical patterns rather than understanding numbers as continuous magnitudes. Existing benchmarks primarily focus on either linguistic competence or structured mathematical problem-solving, neglecting fundamental numerical reasoning required in real-world scenarios. To bridge this gap, we propose NumericBench, a comprehensive benchmark to evaluate six fundamental numerical capabilities: number recognition, arithmetic operations, contextual retrieval, comparison, summary, and logical reasoning. NumericBench includes datasets ranging from synthetic number lists to the crawled real-world data, addressing challenges like long contexts, noise, and multi-step reasoning. Extensive experiments on state-of-the-art LLMs, including GPT-4 and DeepSeek, reveal persistent weaknesses in numerical reasoning, highlighting the urgent need to improve numerically-aware language modeling. The benchmark is released in: this https URL.
摘要：大型语言模型（LLM）在自然语言处理任务（例如文本生成和语义理解）中表现出了令人印象深刻的功能。但是，它们在数值推理任务上的表现，例如基本算术，数值检索和幅度比较，仍然令人惊讶地差。这一差距源于它们对表面级统计模式的依赖，而不是将数字理解为连续幅度。现有的基准主要集中于语言能力或结构化数学问题解决问题，而忽略了现实世界中所需的基本数值推理。为了弥合这一差距，我们提出了数字基础，这是一个评估六个基本数值能力的综合基准：数字识别，算术操作，上下文检索，比较，摘要，摘要和逻辑推理。 NumericBench包括从合成数字列表到爬行的现实世界数据的数据集，以解决诸如长上下文，噪声和多步推理之类的挑战。在包括GPT-4和DeepSeek在内的最先进的LLM的广泛实验揭示了数值推理的持续弱点，强调了迫切需要改善数字意识到的语言建模。基准发布在：此HTTPS URL中。

Title: DEEPER Insight into Your User: Directed Persona Refinement for Dynamic Persona Modeling

Authors: Aili Chen, Chengyu Du, Jiangjie Chen, Jinghan Xu, Yikai Zhang, Siyu Yuan, Zulong Chen, Liangyue Li, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11078
Pdf URL: https://arxiv.org/pdf/2502.11078
Copy Paste: [[2502.11078]] DEEPER Insight into Your User: Directed Persona Refinement for Dynamic Persona Modeling(https://arxiv.org/abs/2502.11078)
Keywords: language model, llm
Abstract: To advance personalized applications such as recommendation systems and user behavior prediction, recent research increasingly adopts large language models (LLMs) for human -readable persona modeling. In dynamic real -world scenarios, effective persona modeling necessitates leveraging streaming behavior data to continually optimize user personas. However, existing methods -whether regenerating personas or incrementally extending them with new behaviors -often fail to achieve sustained improvements in persona quality or future behavior prediction accuracy. To address this, we propose DEEPER, a novel approach for dynamic persona modeling that enables continual persona optimization. Specifically, we enhance the model's direction -search capability through an iterative reinforcement learning framework, allowing it to automatically identify effective update directions and optimize personas using discrepancies between user behaviors and model predictions. Extensive experiments on dynamic persona modeling involving 4800 users across 10 domains highlight the superior persona optimization capabilities of DEEPER, delivering an impressive 32.2% average reduction in user behavior prediction error over four update rounds -outperforming the best baseline by a remarkable 22.92%.
摘要：为了推进个性化的应用程序，例如推荐系统和用户行为预测，最近的研究越来越多地采用大型语言模型（LLMS），用于人类可读的角色建模。在动态的真实世界情景中，有效的角色建模需要利用流动行为数据来不断优化用户角色。但是，现有的方法 - 无论是再生角色还是通过新行为扩展它们 - 通常无法实现角色质量或未来行为预测准确性的持续改进。为了解决这个问题，我们提出了更深入的动态角色建模的新型方法，可以持续优化。具体来说，我们通过迭代增强学习框架增强了模型的方向 - 搜索能力，从而使其能够自动识别有效的更新方向并使用用户行为和模型预测之间的差异来优化角色。关于涉及10个领域的4800个用户的动态性格建模的广泛实验突出了更深层次的角色优化功能，在四个更新回合中，用户行为预测错误的平均平均降低了32.2％ - 以极大的22.92％的速度使最佳基线取得了最佳基线。

Title: Streamlining the Collaborative Chain of Models into A Single Forward Pass in Generation-Based Tasks

Authors: Yuanjie Lyu, Chao Zhang, Yuhao Chen, Yong Chen, Tong Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11083
Pdf URL: https://arxiv.org/pdf/2502.11083
Copy Paste: [[2502.11083]] Streamlining the Collaborative Chain of Models into A Single Forward Pass in Generation-Based Tasks(https://arxiv.org/abs/2502.11083)
Keywords: prompt, retrieval-augmented generation, agent
Abstract: In Retrieval-Augmented Generation (RAG) and agent-based frameworks, the "Chain of Models" approach is widely used, where multiple specialized models work sequentially on distinct sub-tasks. This approach is effective but increases resource demands as each model must be deployed separately. Recent advancements attempt to address this by applying prompt tuning, which allows a shared base model to adapt to multiple tasks with minimal parameter changes. However, a key challenge remains: intermediate outputs, passed between models as plain text, require recomputation of hidden states (i.e., Key and Value (KV) states in Transformers) during inference. In this paper, we introduce FTHSS, a novel prompt-tuning method that enables models to share KV hidden states, eliminating redundant forward passes and reducing KV cache storage. By modifying input and attention masks during training, FTHSS allows models to effectively utilize KV hidden states from prior models in both single- and multi-round scenarios. Empirical results on four tasks show that FTHSS matches the performance of traditional model chains while improving inference efficiency.
摘要：在检索增强的生成（RAG）和基于代理的框架中，“模型链”方法被广泛使用，其中多个专业模型在不同的子任务上依次工作。这种方法是有效的，但由于必须单独部署每个模型，因此增加了资源需求。最近的进步试图通过应用及时调整来解决此问题，这允许共享基本模型适应具有最小参数更改的多个任务。但是，一个关键的挑战仍然存在：中间输出在模型之间以纯文本为单位，需要在推理过程中隐藏状态（即变压器中的密钥和价值（KV）状态）重新计算。在本文中，我们介绍了FTHSS，这是一种新颖的及时调整方法，使模型能够共享KV隐藏状态，消除冗余前向通行证并减少KV缓存存储。通过在训练过程中修改输入和注意性掩码，FTHSS允许模型在单轮和多发场景中有效利用先前模型中的KV隐藏状态。四个任务的经验结果表明，FTHSS与传统模型链的性能相匹配，同时提高了推理效率。

Title: Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction

Authors: Yuting Huang, Chengyuan Liu, Yifeng Feng, Chao Wu, Fei Wu, Kun Kuang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11084
Pdf URL: https://arxiv.org/pdf/2502.11084
Copy Paste: [[2502.11084]] Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction(https://arxiv.org/abs/2502.11084)
Keywords: language model, llm, prompt
Abstract: As Large Language Models (LLMs) are widely applied in various domains, the safety of LLMs is increasingly attracting attention to avoid their powerful capabilities being misused. Existing jailbreak methods create a forced instruction-following scenario, or search adversarial prompts with prefix or suffix tokens to achieve a specific representation manually or automatically. However, they suffer from low efficiency and explicit jailbreak patterns, far from the real deployment of mass attacks to LLMs. In this paper, we point out that simply rewriting the original instruction can achieve a jailbreak, and we find that this rewriting approach is learnable and transferable. We propose the Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs by iteratively exploring the weakness of the LLMs and automatically improving the attacking strategy. The jailbreak is more efficient and hard to identify since no additional features are introduced. Extensive experiments and analysis demonstrate the effectiveness of R2J, and we find that the jailbreak is also transferable to multiple datasets and various types of models with only a few queries. We hope our work motivates further investigation of LLM safety.
摘要：由于大型语言模型（LLM）广泛应用于各个领域，因此LLM的安全性越来越引起人们的注意，以避免其强大的功能被滥用。现有的越狱方法创建一个强制指令遵循的方案，或使用前缀或后缀令牌搜索对抗提示，以手动或自动实现特定的表示。但是，他们遭受了低效率和明确的越狱模式，远非批量攻击到LLM的真正部署。在本文中，我们指出，简单地重写原始指令就可以实现越狱，我们发现这种重写方法是可以学习且可以转移的。我们建议重写越狱方法（R2J）方法，这是一种可转让的黑盒越狱方法，可以通过迭代地探索LLM的弱点并自动改善攻击策略来攻击LLM。由于没有引入其他功能，因此越狱更有效，难以识别。广泛的实验和分析证明了R2J的有效性，我们发现越狱也可以转移到多个数据集和各种类型的模型中，只有几个查询。我们希望我们的工作激励对LLM安全的进一步调查。

Title: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.11089
Pdf URL: https://arxiv.org/pdf/2502.11089
Copy Paste: [[2502.11089]] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention(https://arxiv.org/abs/2502.11089)
Keywords: language model
Abstract: Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.
摘要：长篇小说建模对于下一代语言模型至关重要，但是标准注意机制的高计算成本却带来了重大的计算挑战。稀疏的注意力为提高效率的方向提供了有希望的方向，同时保持模型功能。我们提出了NSA，这是一种本地可训练的稀疏注意机制，将算法创新与硬件一致的优化相结合，以实现有效的长篇文化建模。 NSA采用了动态的分层稀疏策略，将粗粒的令牌压缩与精细的令牌选择相结合，以保持全球环境意识和局部精度。我们的方法通过两个关键创新进行了稀疏注意设计：（1）我们通过算术强度平衡算法设计实现了实质性的加速，并对现代硬件进行了优化。（2）我们启用端到端培训，在不牺牲模型性能的情况下减少预处理的计算。如图1所示，实验表明，使用NSA预测的模型维持或超过了一般基准，长篇下说任务和基于指导的推理的全部注意力模型。同时，NSA在对解码，正向传播和向后传播的64k长度序列上的全面关注中实现了实质性加速，从而验证了整个模型生命周期的效率。

Title: SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks

Authors: Hongye Cao, Yanming Wang, Sijia Jing, Ziyue Peng, Zhixin Bai, Zhe Cao, Meng Fang, Fan Feng, Boyan Wang, Jiaheng Liu, Tianpei Yang, Jing Huo, Yang Gao, Fanyu Meng, Xi Yang, Chao Deng, Junlan Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11090
Pdf URL: https://arxiv.org/pdf/2502.11090
Copy Paste: [[2502.11090]] SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks(https://arxiv.org/abs/2502.11090)
Keywords: language model, llm, chat
Abstract: With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 17 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.
摘要：随着大语言模型（LLM）的快速发展，LLM的安全性是需要精确评估的关键问题。当前的基准主要集中于单转对话或一种评估安全性的越狱攻击方法。此外，这些基准尚未考虑LLM详细识别和处理不安全信息的能力。为了解决这些问题，我们提出了一个精细的基准SAFEDIALBENCH，以评估LLM在多圈对话中的各种越狱攻击中的安全性。具体而言，我们设计了一个两层层次的安全分类法，该分类法考虑了6个安全维度，并在22个对话方案下在中文和英语中产生了4000多个多转向对话。我们采用7种越狱攻击策略，例如参考攻击和目的反向，以增强对话生成的数据集质量。值得注意的是，我们构建了LLM的创新评估框架，测量检测和处理不安全信息的能力，并在面对越狱攻击时保持一致性。 17个LLM的实验结果表明，YI-34B-CHAT和GLM4-9B-CHAT表现出卓越的安全性能，而Llama3.1-8B-r-Instruct和O3-Mini表现出安全性。

Title: A Survey of Large Language Models in Psychotherapy: Current Landscape and Future Directions

Authors: Hongbin Na, Yining Hua, Zimu Wang, Tao Shen, Beibei Yu, Lilin Wang, Wei Wang, John Torous, Ling Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11095
Pdf URL: https://arxiv.org/pdf/2502.11095
Copy Paste: [[2502.11095]] A Survey of Large Language Models in Psychotherapy: Current Landscape and Future Directions(https://arxiv.org/abs/2502.11095)
Keywords: language model, llm
Abstract: Mental health remains a critical global challenge, with increasing demand for accessible, effective interventions. Large language models (LLMs) offer promising solutions in psychotherapy by enhancing the assessment, diagnosis, and treatment of mental health conditions through dynamic, context-aware interactions. This survey provides a comprehensive overview of the current landscape of LLM applications in psychotherapy, highlighting the roles of LLMs in symptom detection, severity estimation, cognitive assessment, and therapeutic interventions. We present a novel conceptual taxonomy to organize the psychotherapy process into three core components: assessment, diagnosis, and treatment, and examine the challenges and advancements in each area. The survey also addresses key research gaps, including linguistic biases, limited disorder coverage, and underrepresented therapeutic models. Finally, we discuss future directions to integrate LLMs into a holistic, end-to-end psychotherapy framework, addressing the evolving nature of mental health conditions and fostering more inclusive, personalized care.
摘要：心理健康仍然是一个关键的全球挑战，随着对可访问，有效干预措施的需求不断增长。大型语言模型（LLMS）通过通过动态的，上下文感知的互动来增强心理健康状况的评估，诊断和治疗，从而在心理治疗中提供有希望的解决方案。这项调查概述了LLM在心理疗法中的当前局势，强调了LLM在症状检测，严重性估计，认知评估和治疗干预措施中的作用。我们提出了一种新颖的概念分类法，将心理治疗过程组织成三个核心组成部分：评估，诊断和治疗，并检查每个领域的挑战和进步。该调查还解决了关键的研究差距，包括语言偏见，有限的疾病覆盖范围和代表性不足的治疗模型。最后，我们讨论将LLMS整合到整体，端到端心理治疗框架中的未来方向，以解决心理健康状况的不断发展的性质并促进更具包容性的个性化护理。

Title: Towards Achieving Concept Completeness for Unsupervised Textual Concept Bottleneck Models

Authors: Milan Bhan, Yann Choho, Pierre Moreau, Jean-Noel Vittaut, Nicolas Chesneau, Marie-Jeanne Lesot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11100
Pdf URL: https://arxiv.org/pdf/2502.11100
Copy Paste: [[2502.11100]] Towards Achieving Concept Completeness for Unsupervised Textual Concept Bottleneck Models(https://arxiv.org/abs/2502.11100)
Keywords: language model, llm
Abstract: Textual Concept Bottleneck Models (TBMs) are interpretable-by-design models for text classification that predict a set of salient concepts before making the final prediction. This paper proposes Complete Textual Concept Bottleneck Model (CT-CBM),a novel TCBM generator building concept labels in a fully unsupervised manner using a small language model, eliminating both the need for predefined human labeled concepts and LLM annotations. CT-CBM iteratively targets and adds important concepts in the bottleneck layer to create a complete concept basis and addresses downstream classification leakage through a parallel residual connection. CT-CBM achieves good results against competitors, offering a promising solution to enhance interpretability of NLP classifiers without sacrificing performance.
摘要：文本概念瓶颈模型（TBM）是可解释的文本分类模型，在做出最终预测之前，可以预测一组显着概念。本文提出了完整的文本概念瓶颈模型（CT-CBM），这是一种新颖的TCBM生成器建筑概念标签，以小语言模型完全无监督的方式，从而消除了对预定义的人类标记概念和LLM注释的需求。 CT-CBM迭代目标并在瓶颈层中添加重要概念，以创建一个完整的概念基础，并通过并行残留连接解决下游分类泄漏。 CT-CBM对竞争对手取得了良好的成绩，提供了一种有希望的解决方案，以增强NLP分类器的解释性而无需牺牲绩效。

Title: CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation

Authors: Kun-Hui Lee, Eunhwan Park, Donghoon Han, Seung-Hoon Na
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11101
Pdf URL: https://arxiv.org/pdf/2502.11101
Copy Paste: [[2502.11101]] CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation(https://arxiv.org/abs/2502.11101)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) excel across a variety of language tasks yet are constrained by limited input lengths and high computational costs. Existing approaches\textemdash such as relative positional encodings (e.g., RoPE, ALiBi) and sliding window mechanisms\textemdash partially alleviate these issues but often require additional training or suffer from performance degradation with longer inputs. In this paper, we introduce \textbf{\textit{CacheFocus}}, a method that enhances length normalization and reduces inference latency without any further training. Our approach leverages query-independent, offline caching to efficiently reuse a Context KV Cache Store. We address the amplification of abnormal token distributions problem by re-positioning cached keys and introducing Layer-Adaptive Cache Pruning to discard low-relevance caches during pre-filling. Additionally, our Adaptive Positional Allocation Strategy dynamically reassigns cache positions to maximize the use of the available positional encoding range. Experiments on the Natural Questions and TriviaQA datasets demonstrate that CacheFocus outperforms alternative methods even when inputs exceed the $4$K limit of the \texttt{LLaMA-2} model, emphasizing its practical effectiveness for long-context LLMs. Moreover, even with large maximum input length of \texttt{Qwen2}, the performance of CacheFocus shows that it maintains consistent performance even as the number of documents increases, effectively managing long-text generation without degradation.
摘要：大型语言模型（LLMS）在各种语言任务中都表现出色，但受到有限的输入长度和高计算成本的限制。现有的方法\ Textemdash，例如相对位置编码（例如，绳索，alibi）和滑动窗口机制\ TexteMdash部分减轻了这些问题，但通常需要额外的培训或较长输入的性能降解。在本文中，我们介绍\ textbf {\ textit {cachefocus}}，该方法可增强长度归一化并减少推理潜伏期而无需任何进一步的培训。我们的方法利用与查询无关的，离线缓存的方式有效地重复了上下文KV CACHESTER。我们通过重新放置缓存的键并引入层 - 自适应缓存修剪来解决异常令牌分布问题的扩增，以在预填充过程中丢弃低相关的缓存。此外，我们的自适应位置分配策略动态重新分配了缓存位置，以最大程度地使用可用位置编码范围。关于自然问题和Triviaqa数据集的实验表明，即使输入超过了\ texttt {llama-2}模型的$ 4 $ K限制，Cachefocus也优于替代方法，强调了其对长篇文章LLM的实际效率。此外，即使最大输入长度为\ texttt {qwen2}，cachefocus的性能也表明，即使文档数量增加，它仍保持一致的性能，从而有效地管理长文本生成而不会降解。

Title: Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications

Authors: Alexandru Lecu, Adrian Groza, Lezan Hawizy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11108
Pdf URL: https://arxiv.org/pdf/2502.11108
Copy Paste: [[2502.11108]] Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications(https://arxiv.org/abs/2502.11108)
Keywords: language model, llm, hallucination, chat, retrieval-augmented generation
Abstract: Large language models (LLMs) have significantly advanced the field of natural language generation. However, they frequently generate unverified outputs, which compromises their reliability in critical applications. In this study, we propose an innovative framework that combines structured biomedical knowledge with LLMs through a retrieval-augmented generation technique. Our system develops a thorough knowledge graph by identifying and refining causal relationships and named entities from medical abstracts related to age-related macular degeneration (AMD). Using a vector-based retrieval process and a locally deployed language model, our framework produces responses that are both contextually relevant and verifiable, with direct references to clinical evidence. Experimental results show that this method notably decreases hallucinations, enhances factual precision, and improves the clarity of generated responses, providing a robust solution for advanced biomedical chatbot applications.
摘要：大型语言模型（LLM）已大大推进了自然语言生成领域。但是，它们经常产生未验证的输出，这会损害其在关键应用程序中的可靠性。在这项研究中，我们提出了一个创新的框架，该框架将结构化的生物医学知识与LLM结合在一起，通过检索增强的生成技术。我们的系统通过识别和完善因果关系来开发详尽的知识图，并从与年龄相关的黄斑变性（AMD）有关的医学摘要中命名实体。使用基于向量的检索过程和本地部署的语言模型，我们的框架产生的响应既具有上下文相关又可验证，并直接引用了临床证据。实验结果表明，这种方法明显减少幻觉，提高事实的精度并提高生成的响应的清晰度，为晚期生物医学聊天机器人应用提供了强大的解决方案。

Title: Valuable Hallucinations: Realizable Non-realistic Propositions

Authors: Qiucheng Chen, Bo Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11113
Pdf URL: https://arxiv.org/pdf/2502.11113
Copy Paste: [[2502.11113]] Valuable Hallucinations: Realizable Non-realistic Propositions(https://arxiv.org/abs/2502.11113)
Keywords: language model, llm, hallucination, prompt
Abstract: This paper introduces the first formal definition of valuable hallucinations in large language models (LLMs),addressing a gap in the existing this http URL provide a systematic definition and analysis of hallucination value,proposing methods for enhancing the value of this http URL contrast to previous works,which often treat hallucinations as a broad flaw,we focus on the potential value that certain types of hallucinations can offer in specific this http URL in LLMs generally refer to the generation of unfaithful, fabricated,inconsistent,or nonsensical this http URL than viewing all hallucinations negatively,this paper gives formal representations and manual judgments of "valuable hallucinations" and explores how realizable non-realistic propositions-ideas that are not currently true but could be achievable under certain conditions-can have constructive this http URL present experiments using the Qwen2.5 model and HalluQA dataset, employing ReAct prompting (which involves reasoning, confidence assessment, and answer verification) to control and optimize hallucinations. Our findings show that ReAct prompting results in a reduction in overall hallucinations and an increase in the proportion of valuable this http URL results demonstrate that systematically controlling hallucinations can improve their usefulness without compromising factual reliability.
摘要：本文介绍了对大语言模型（LLM）中有价值幻觉的第一个正式定义，解决了现有的差距。本http URL提供了对幻觉价值的系统定义和分析，提出了增强与以前的HTTP URL对比的方法经常将幻觉视为广泛缺陷的作品，我们专注于某些类型的幻觉可以在LLMS中提供的HTTP URL所提供的潜在价值，通常是指比观看HTTP URL的不忠，捏造，不一致或荒谬的产生所有幻觉都呈负面影响，本文对“有价值的幻觉”提供了正式的代表和手动判断，并探讨了当前不正确但可以在某些条件下可以实现的可实现的非现实命题 - IDEAS，在某些条件下可以实现此HTTP URL，使用此HTTP URL使用此HTTP URL实验。 QWEN2.5模型和Halluqa数据集，采用React提示（涉及推理，置信度评估和答案验证）来控制和优化幻觉。我们的发现表明，反应提示会导致整体幻觉的减少，并增加了有价值的HTTP URL结果的比例，表明系统控制幻觉可以提高其实用性而不会损害事实可靠性。

Title: Beyond Pairwise: Global Zero-shot Temporal Graph Generation

Authors: Alon Eirew, Kfir Bar, Ido Dagan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11114
Pdf URL: https://arxiv.org/pdf/2502.11114
Copy Paste: [[2502.11114]] Beyond Pairwise: Global Zero-shot Temporal Graph Generation(https://arxiv.org/abs/2502.11114)
Keywords: language model, llm
Abstract: Temporal relation extraction (TRE) is a fundamental task in natural language processing (NLP) that involves identifying the temporal relationships between events in a document. Despite the advances in large language models (LLMs), their application to TRE remains limited. Most existing approaches rely on pairwise classification, in which event pairs are considered individually, leading to computational inefficiency and a lack of global consistency in the resulting temporal graph. In this work, we propose a novel zero-shot method for TRE that generates a document's complete temporal graph at once, then applies transitive constraints optimization to refine predictions and enforce temporal consistency across relations. Additionally, we introduce OmniTemp, a new dataset with complete annotations for all pairs of targeted events within a document. Through experiments and analyses, we demonstrate that our method significantly outperforms existing zero-shot approaches while achieving competitive performance with supervised models.
摘要：时间关系提取（TRE）是自然语言处理（NLP）的基本任务，涉及识别文档中事件之间的时间关系。尽管大语模型取得了进步（LLMS），但它们在TRE的应用仍然有限。大多数现有方法依赖于成对分类，其中事件对单独考虑，导致计算效率低下，并且在所得的时间表中缺乏全球一致性。在这项工作中，我们为TRE提出了一种新颖的零摄像方法，该方法立即生成文档的完整时间图，然后应用及时的约束来优化，以完善预测并在跨关系之间执行时间一致性。此外，我们介绍了OmniTemp，这是一个新数据集，其中包含文档中所有目标事件的完整注释。通过实验和分析，我们证明了我们的方法在通过监督模型中实现竞争性能，极大地胜过现有的零击方法。

Title: DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities

Authors: Xiangyu Lu, Wang Xu, Haoyu Wang, Hongyun Zhou, Haiyan Zhao, Conghui Zhu, Tiejun Zhao, Muyun Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11123
Pdf URL: https://arxiv.org/pdf/2502.11123
Copy Paste: [[2502.11123]] DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities(https://arxiv.org/abs/2502.11123)
Keywords: language model, chat
Abstract: Real-time speech conversation is essential for natural and efficient human-machine interactions, requiring duplex and streaming capabilities. Traditional Transformer-based conversational chatbots operate in a turn-based manner and exhibit quadratic computational complexity that grows as the input size increases. In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. DuplexMamba enables simultaneous input processing and output generation, dynamically adjusting to support real-time streaming. Specifically, we develop a Mamba-based speech encoder and adapt it with a Mamba-based language model. Furthermore, we introduce a novel duplex decoding strategy that enables DuplexMamba to process input and generate output simultaneously. Experimental results demonstrate that DuplexMamba successfully implements duplex and streaming capabilities while achieving performance comparable to several recently developed Transformer-based models in automatic speech recognition (ASR) tasks and voice assistant benchmark evaluations.
摘要：实时语音对话对于需要双工和流式传输功能的自然和高效的人机相互作用至关重要。传统的基于变压器的对话聊天机器人以转弯的方式运行，并表现出二次计算复杂性，随着输入尺寸的增加而增长。在本文中，我们提出了Duplexmamba，这是一种基于曼巴am的端到端多模式模型，用于语音到文本对话。 DuplexMamba可以同时进行输入处理和输出生成，并动态调整以支持实时流。具体来说，我们开发了一个基于MAMBA的语音编码器，并使用基于Mamba的语言模型进行调整。此外，我们引入了一种新型的双工解码策略，该策略使双工amamba能够同时处理输入并生成输出。实验结果表明，双工amamba成功实现了双链体和流功能，同时实现了与几种自动语音识别（ASR）任务（ASR）和语音助手基准测试评估的最近开发的基于变压器的模型相当的性能。

Title: FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

Authors: Hui Wang, Shujie Liu, Lingwei Meng, Jinyu Li, Yifan Yang, Shiwan Zhao, Haiyang Sun, Yanqing Liu, Haoqin Sun, Jiaming Zhou, Yan Lu, Yong Qin
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2502.11128
Pdf URL: https://arxiv.org/pdf/2502.11128
Copy Paste: [[2502.11128]] FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching(https://arxiv.org/abs/2502.11128)
Keywords: language model
Abstract: To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in this https URL.
摘要：为了推进连续值的代币建模和时间连接执行，我们提出了Felle，Felle是一种自动回归模型，将语言建模与令牌流量匹配集成在一起。通过利用语言模型的自回旋性质和流量匹配的生成功效，Felle有效地预测了连续值的代币（MEL-SPECTROGRAMENS）。对于每个连续价值的令牌，Felle通过合并上一步的信息来改善相干性和稳定性来修改流量匹配中的一般先验分布。此外，为了提高综合质量，Felle引入了一种粗到最佳的流量匹配机制，从而在层次上产生连续值的代币，以语言模型的输出为条件。实验结果表明，将流动匹配技术纳入自回旋旋光图建模的潜力，从而导致TTS生成质量的显着改善，如本HTTPS URL所示。

Title: Improving Similar Case Retrieval Ranking Performance By Revisiting RankSVM

Authors: Yuqi Liu, Yan Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11131
Pdf URL: https://arxiv.org/pdf/2502.11131
Copy Paste: [[2502.11131]] Improving Similar Case Retrieval Ranking Performance By Revisiting RankSVM(https://arxiv.org/abs/2502.11131)
Keywords: language model
Abstract: Given the rapid development of Legal AI, a lot of attention has been paid to one of the most important legal AI tasks--similar case retrieval, especially with language models to use. In our paper, however, we try to improve the ranking performance of current models from the perspective of learning to rank instead of language models. Specifically, we conduct experiments using a pairwise method--RankSVM as the classifier to substitute a fully connected layer, combined with commonly used language models on similar case retrieval datasets LeCaRDv1 and LeCaRDv2. We finally come to the conclusion that RankSVM could generally help improve the retrieval performance on the LeCaRDv1 and LeCaRDv2 datasets compared with original classifiers by optimizing the precise ranking. It could also help mitigate overfitting owing to class imbalance. Our code is available in this https URL
摘要：鉴于法律AI的快速发展，已经对最重要的法律AI任务之一引起了很多关注 - 相似的案例检索，尤其是在使用语言模型的情况下。但是，在我们的论文中，我们尝试从学习排名而不是语言模型的角度提高当前模型的排名。具体而言，我们使用成对方法进行实验（rankSVM）作为分类器代替完全连接的层，并在相似的情况检索数据集lecardv1和lecardv2上与常用的语言模型结合使用。我们最终得出的结论是，与原始分类器相比，LecARDV1和LECARDV2数据集的RekSVM通常可以帮助提高LecARDV1和LECARDV2数据集的检索性能，通过优化精确的排名。由于阶级失衡，它也可能有助于减轻过度适应。我们的代码可在此HTTPS URL中找到

Title: Safety Evaluation of DeepSeek Models in Chinese Contexts

Authors: Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Ning Wang, Zhenhong Long, Peijun Yang, Jiaojiao Zhao, Minjie Hua, Chaoyang Ma, Kai Wang, Shiguo Lian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11137
Pdf URL: https://arxiv.org/pdf/2502.11137
Copy Paste: [[2502.11137]] Safety Evaluation of DeepSeek Models in Chinese Contexts(https://arxiv.org/abs/2502.11137)
Keywords: prompt
Abstract: Recently, the DeepSeek series of models, leveraging their exceptional reasoning capabilities and open-source strategy, is reshaping the global AI landscape. Despite these advantages, they exhibit significant safety deficiencies. Research conducted by Robust Intelligence, a subsidiary of Cisco, in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1 has a 100\% attack success rate when processing harmful prompts. Additionally, multiple safety companies and research institutions have confirmed critical safety vulnerabilities in this model. As models demonstrating robust performance in Chinese and English, DeepSeek models require equally crucial safety assessments in both language contexts. However, current research has predominantly focused on safety evaluations in English environments, leaving a gap in comprehensive assessments of their safety performance in Chinese contexts. In response to this gap, this study introduces CHiSafetyBench, a Chinese-specific safety evaluation benchmark. This benchmark systematically evaluates the safety of DeepSeek-R1 and DeepSeek-V3 in Chinese contexts, revealing their performance across safety categories. The experimental results quantify the deficiencies of these two models in Chinese contexts, providing key insights for subsequent improvements.
摘要：最近，DeepSeek系列模型利用其出色的推理能力和开源战略，正在重塑全球AI景观。尽管有这些优势，但它们表现出严重的安全性缺陷。 Cisco的子公司与宾夕法尼亚大学合作进行的Robust Intelligence进行的研究表明，在处理有害提示时，DeepSeek-R1在处理有害提示时具有100 \％的攻击成功率。此外，多个安全公司和研究机构已经确认了该模型中的关键安全漏洞。作为在中文和英语中表现出良好表现的模型，DeepSeek模型在两种语言环境中都需要同样重要的安全评估。但是，当前的研究主要集中在英语环境中的安全评估上，在中文中对其安全性能的全面评估留下了差距。为了应对这一差距，这项研究引入了ChisafetyBench，这是一种中文特定的安全评估基准。该基准有系统地评估了中文中DeepSeek-R1和DeepSeek-V3的安全性，从而揭示了它们在安全类别中的性能。实验结果量化了中文中这两个模型的缺陷，为后续改进提供了关键见解。

Title: Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning

Authors: Qingwen Lin, Boyan Xu, Zijian Li, Zhifeng Hao, Keli Zhang, Ruichu Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11169
Pdf URL: https://arxiv.org/pdf/2502.11169
Copy Paste: [[2502.11169]] Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning(https://arxiv.org/abs/2502.11169)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recently, Long Chain-of-Thoughts (CoTs) have gained widespread attention for improving the reasoning capabilities of Large Language Models (LLMs). This necessitates that existing LLMs, which lack the ability to generate Long CoTs, to acquire such capability through post-training methods. Without additional training, LLMs typically enhance their mathematical reasoning abilities through inference scaling methods such as MCTS. However, they are hindered by the large action space and inefficient search strategies, making it challenging to generate Long CoTs effectively. To tackle this issue, we propose constraining the action space and guiding the emergence of Long CoTs through a refined search strategy. In our proposed Constrained Monte Carlo Tree Search (C-MCTS) framework, we limit the actions selected from a constrained action space, which is divided into five disjoint subsets: \emph{understanding}, \emph{planning}, \emph{reflection}, \emph{coding}, and \emph{summary}. Each subset is further constrained to a small number of predefined prompts, rather than allowing LLMs to generate actions arbitrarily. Additionally, we refine the search strategy by incorporating prior knowledge about the action sets, such as a human-like partial order of the action subsets and the pretrained process reward models. These strategies work together to significantly reduce the vast search space of Long CoTs. Extensive evaluations on mathematical reasoning benchmarks show that, under zero-shot settings, our method enables the 7B model to achieve reasoning capabilities that surpass those of the 72B model.
摘要：最近，长期的思想链（COTS）因改善大语言模型（LLM）的推理能力而引起了广泛的关注。这有必要使现有的LLM缺乏产生长床的能力，可以通过训练后方法获得此类能力。如果没有额外的培训，LLM通常通过推理缩放方法（例如MCT）增强其数学推理能力。但是，它们受到较大的动作空间和效率低下的搜索策略的阻碍，这使得有效产生长的COTS具有挑战性。为了解决这个问题，我们建议通过精致的搜索策略来限制行动空间并指导长床的出现。在我们提出的约束蒙特卡洛树搜索（C-MCT）框架中，我们限制了从受约束的动作空间中选择的动作，该动作空间被分为五个不相交的子集：\ emph {placemph {planne} }，\ emph {编码}和\ emph {summary}。每个子集进一步限制为少数预定义的提示，而不是允许LLMS任意生成操作。此外，我们通过纳入有关动作集的先验知识来完善搜索策略，例如动作子集的类似人类的部分顺序和预审过程的奖励模型。这些策略共同起作用，可大大减少长科克的巨大搜索空间。对数学推理基准的广泛评估表明，在零拍设置下，我们的方法使7B模型能够实现超过72B模型的推理能力。

Title: Investigating Language Preference of Multilingual RAG Systems

Authors: Jeonghyun Park, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11175
Pdf URL: https://arxiv.org/pdf/2502.11175
Copy Paste: [[2502.11175]] Investigating Language Preference of Multilingual RAG Systems(https://arxiv.org/abs/2502.11175)
Keywords: language model, retrieval-augmented generation
Abstract: Multilingual Retrieval-Augmented Generation (mRAG) systems enhance language models by integrating external multilingual information to produce context-aware responses. However, mRAG systems struggle with retrieving relevant information due to linguistic variations between queries and documents, generating inconsistent responses when multilingual sources conflict. In this work, we systematically investigate language preferences in both retrieval and generation of mRAG through a series of experiments. Our analysis indicates that retrievers tend to prefer high-resource and query languages, yet this preference does not consistently improve generation performance. Moreover, we observe that generators prefer the query language or Latin scripts, leading to inconsistent outputs. To overcome these issues, we propose Dual Knowledge Multilingual RAG (DKM-RAG), a simple yet effective framework that fuses translated multilingual passages with complementary model knowledge. Empirical results demonstrate that DKM-RAG mitigates language preference in generation and enhances performance across diverse linguistic settings.
摘要：多语言检索效果生成（MRAG）系统通过集成外部多语言信息以产生上下文感知的响应来增强语言模型。但是，由于查询和文档之间的语言变化，MRAG系统努力检索相关信息，在多语言来源冲突时会产生不一致的响应。在这项工作中，我们通过一系列实验系统地研究了检索和产生MRAG的语言偏好。我们的分析表明，猎犬倾向于更喜欢高资源和查询语言，但是这种偏好并不能始终如一地提高发电性能。此外，我们观察到生成器更喜欢查询语言或拉丁文脚本，从而导致输出不一致。为了克服这些问题，我们提出了双重知识多语言抹布（DKM-rag），这是一个简单而有效的框架，将翻译的多语言段落与互补的模型知识融合在一起。经验结果表明，DKM-rag减轻了发电的语言偏好，并提高了各种语言环境的性能。

Title: LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning

Authors: Tianshi Zheng, Jiayang Cheng, Chunyang Li, Haochen Shi, Zihao Wang, Jiaxin Bai, Yangqiu Song, Ginny Y. Wong, Simon See
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11176
Pdf URL: https://arxiv.org/pdf/2502.11176
Copy Paste: [[2502.11176]] LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning(https://arxiv.org/abs/2502.11176)
Keywords: language model, llm
Abstract: Modern large language models (LLMs) employ various forms of logical inference, both implicitly and explicitly, when addressing reasoning tasks. Understanding how to optimally leverage these inference paradigms is critical for advancing LLMs' reasoning capabilities. This paper adopts an exploratory approach by introducing a controlled evaluation environment for analogical reasoning -- a fundamental cognitive task -- that is systematically parameterized across three dimensions: modality (textual, visual, symbolic), difficulty (easy, medium, hard), and task format (multiple-choice or free-text generation). We analyze the comparative dynamics of inductive, abductive, and deductive inference pipelines across these dimensions, and demonstrate that our findings generalize to broader in-context learning tasks. Additionally, we investigate advanced paradigms such as hypothesis selection, verification, and refinement, revealing their potential to scale up logical inference in LLM reasoning. This exploratory study provides a foundation for future research in enhancing LLM reasoning through systematic logical inference strategies.
摘要：在解决推理任务时，现代大型语言模型（LLMS）采用各种形式的逻辑推理。了解如何最佳利用这些推论范例对于推进LLMS的推理能力至关重要。本文通过引入类似推理的受控评估环境（一项基本认知任务）来采用探索性方法，该方法在三个维度上进行了系统的参数化：模态（文本，视觉，符号），难度（易于，中等，中等）和任务格式（多项选择或自由文本生成）。我们分析了在这些维度上的归纳性，绑架性和演绎推理管道的比较动力学，并证明我们的发现概括为更广泛的文化学习任务。此外，我们研究了高级范式，例如假设选择，验证和精致，揭示了它们在LLM推理中扩展逻辑推断的潜力。这项探索性研究为未来的研究为通过系统的逻辑推理策略增强LLM推理提供了基础。

Title: The Mirage of Model Editing: Revisiting Evaluation in the Wild

Authors: Wanli Yang, Fei Sun, Jiajun Tan, Xinyu Ma, Qi Cao, Dawei Yin, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11177
Pdf URL: https://arxiv.org/pdf/2502.11177
Copy Paste: [[2502.11177]] The Mirage of Model Editing: Revisiting Evaluation in the Wild(https://arxiv.org/abs/2502.11177)
Keywords: llm
Abstract: Despite near-perfect results in artificial evaluations, the effectiveness of model editing in real-world applications remains unexplored. To bridge this gap, we propose to study model editing in question answering (QA) by establishing a rigorous evaluation practice to assess the effectiveness of editing methods in correcting LLMs' errors. It consists of QAEdit, a new benchmark derived from popular QA datasets, and a standardized evaluation framework. Our single editing experiments indicate that current editing methods perform substantially worse than previously reported (38.5% vs. ~96%). Through module analysis and controlled experiments, we demonstrate that this performance decline stems from issues in evaluation practices of prior editing research. One key issue is the inappropriate use of teacher forcing in testing prevents error propagation by feeding ground truth tokens (inaccessible in real-world scenarios) as input. Furthermore, we simulate real-world deployment by sequential editing, revealing that current approaches fail drastically with only 1000 edits. Our analysis provides a fundamental reexamination of both the real-world applicability of existing model editing methods and their evaluation practices, and establishes a rigorous evaluation framework with key insights to advance reliable and practical model editing research.
摘要：尽管在人工评估中近乎完美的结果，但在现实世界应用中编辑的有效性仍未得到探索。为了弥合这一差距，我们建议通过建立严格的评估实践来评估编辑方法在纠正LLMS的错误中的有效性来研究有关回答（QA）的模型编辑。它由Qaedit组成，Qaedit是一种来自流行的QA数据集的新基准，以及标准化的评估框架。我们的单个编辑实验表明，当前的编辑方法的性能要比先前报道的要差得多（38.5％对〜96％）。通过模块分析和受控实验，我们证明了这种绩效下降源于先前编辑研究的评估实践中的问题。一个关键问题是不适当使用教师在测试中使用强迫可以通过将地面真实令牌（在现实世界情景中无法访问）作为输入来防止错误传播。此外，我们通过顺序编辑模拟了现实世界的部署，表明当前方法仅使用1000个编辑大大失败。我们的分析提供了对现有模型编辑方法及其评估实践的实际适用性的基本重新审查，并建立了一个严格的评估框架，并具有关键见解，以推动可靠且实用的实用模型编辑研究。

Title: Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls

Authors: Ante Wang, Linfeng Song, Ye Tian, Dian Yu, Haitao Mi, Xiangyu Duan, Zhaopeng Tu, Jinsong Su, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11183
Pdf URL: https://arxiv.org/pdf/2502.11183
Copy Paste: [[2502.11183]] Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls(https://arxiv.org/abs/2502.11183)
Keywords: language model, llm
Abstract: Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs), but at the cost of increased computational resources. In this work, we identify two key challenges contributing to this inefficiency: $\textit{over-exploration}$ due to redundant states with semantically equivalent content, and $\textit{under-exploration}$ caused by high variance in verifier scoring leading to frequent trajectory switching. To address these issues, we propose FETCH, an e$\textbf{f}$fici$\textbf{e}$nt $\textbf{t}$ree sear$\textbf{ch}$ framework, which is a flexible, plug-and-play system compatible with various tree search algorithms. Our framework mitigates over-exploration by merging semantically similar states using agglomerative clustering of text embeddings obtained from a fine-tuned SimCSE model. To tackle under-exploration, we enhance verifiers by incorporating temporal difference learning with adjusted $\lambda$-returns during training to reduce variance, and employing a verifier ensemble to aggregate scores during inference. Experiments on GSM8K, GSM-Plus, and MATH datasets demonstrate that our methods significantly improve reasoning accuracy and computational efficiency across four different tree search algorithms, paving the way for more practical applications of LLM-based reasoning. The code will be released upon acceptance.
摘要：在验证符引导的树木搜索算法中的最新进展显着增强了大语言模型（LLMS）的推理能力，但以增加计算资源为代价。在这项工作中，我们确定了导致这种效率低下的两个主要挑战：$ \ textit {过多揭示} $是由于具有语义上等效内容的冗余状态以及$ \ textit {textit {prospploration} $是由验证者得分高的较高差异引起的。频繁的轨迹切换。为了解决这些问题，我们提出了fetch，e $ \ textbf {f} $ fici $ \ textbf {e} $ nt $ \ textbf {t} $ ree sear sear $ \ textbf {ch}插件系统与各种树搜索算法兼容。我们的框架通过使用从微型SIMCSE模型获得的文本嵌入的凝集聚类合并语义相似的状态来减轻过度探索。为了解决探究率不足，我们通过将时间差异学习与调整后的$ \ lambda $ - 返回在培训期间减少差异并使用验证者合奏在推理过程中汇总分数来增强验证者。对GSM8K，GSM-Plus和Math数据集的实验表明，我们的方法显着提高了四种不同的树搜索算法的推理精度和计算效率，为基于LLM的推理的更实际应用铺平了道路。该代码将在接受后发布。

Title: Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs

Authors: Wenxuan Wang, Xiaoyuan Liu, Kuiyi Gao, Jen-tse Huang, Youliang Yuan, Pinjia He, Shuai Wang, Zhaopeng Tu
Subjects: cs.CL, cs.AI, cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2502.11184
Pdf URL: https://arxiv.org/pdf/2502.11184
Copy Paste: [[2502.11184]] Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs(https://arxiv.org/abs/2502.11184)
Keywords: language model, gpt, llm, prompt
Abstract: Multimodal Large Language Models (MLLMs) have expanded the capabilities of traditional language models by enabling interaction through both text and images. However, ensuring the safety of these models remains a significant challenge, particularly in accurately identifying whether multimodal content is safe or unsafe-a capability we term safety awareness. In this paper, we introduce MMSafeAware, the first comprehensive multimodal safety awareness benchmark designed to evaluate MLLMs across 29 safety scenarios with 1500 carefully curated image-prompt pairs. MMSafeAware includes both unsafe and over-safety subsets to assess models abilities to correctly identify unsafe content and avoid over-sensitivity that can hinder helpfulness. Evaluating nine widely used MLLMs using MMSafeAware reveals that current models are not sufficiently safe and often overly sensitive; for example, GPT-4V misclassifies 36.1% of unsafe inputs as safe and 59.9% of benign inputs as unsafe. We further explore three methods to improve safety awareness-prompting-based approaches, visual contrastive decoding, and vision-centric reasoning fine-tuning-but find that none achieve satisfactory performance. Our findings highlight the profound challenges in developing MLLMs with robust safety awareness, underscoring the need for further research in this area. All the code and data will be publicly available to facilitate future research.
摘要：多模式大型语言模型（MLLM）通过通过文本和图像启用互动来扩展传统语言模型的功能。但是，确保这些模型的安全仍然是一个重大挑战，特别是在准确确定多模式内容是安全的还是不安全的，我们将其称为安全意识。在本文中，我们介绍了MMSAFEAWARE，这是第一个综合的多模式安全意识基准测试，旨在评估29个安全场景中的MLLM，并使用1500个精心策划的图像推出对。 MMSAFEAWARE包括不安全和过度安全子集，以评估模型能力，以正确识别不安全的内容并避免过度敏感，从而阻碍有用的帮助。使用mmsafeaware评估九种MLLM的九种MLLM表明，当前模型不够安全，而且通常过于敏感。例如，GPT-4V将36.1％的不安全输入误以为是安全的，而59.9％的良性输入则为不安全。我们进一步探讨了三种方法，以改善基于安全意识的方法，视觉对比度解码以及以视觉为中心的推理微调，但没有人可以实现令人满意的性能。我们的发现突出了以强大的安全意识开发MLLM的深刻挑战，强调了该领域进一步研究的必要性。所有代码和数据将公开使用，以促进未来的研究。

Title: TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking

Authors: Shahriar Kabir Nahin, Rabindra Nath Nandi, Sagor Sarker, Quazi Sarwar Muhtaseem, Md Kowsher, Apu Chandraw Shill, Md Ibrahim, Mehadi Hasan Menon, Tareq Al Muntasir, Firoj Alam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11187
Pdf URL: https://arxiv.org/pdf/2502.11187
Copy Paste: [[2502.11187]] TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking(https://arxiv.org/abs/2502.11187)
Keywords: llm
Abstract: In this paper, we present TituLLMs, the first large pretrained Bangla LLMs, available in 1B and 3B parameter sizes. Due to computational constraints during both training and inference, we focused on smaller models. To train TituLLMs, we collected a pretraining dataset of approximately 37 billion tokens. We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge, which also enables faster training and inference. There was a lack of benchmarking datasets to evaluate LLMs for Bangla. To address this gap, we developed five benchmarking datasets. We benchmarked various LLMs, including TituLLMs, and demonstrated that TituLLMs outperforms its initial multilingual versions. However, this is not always the case, highlighting the complexities of language adaptation. Our work lays the groundwork for adapting existing multilingual open models to other low-resource languages. To facilitate broader adoption and further research, we have made the TituLLMs models and benchmarking datasets publicly available (this https URL).
摘要：在本文中，我们介绍了第一个预处理的Bangla LLM，可提供1B和3B参数尺寸。由于训练和推断期间的计算限制，我们专注于较小的模型。为了训练titullm，我们收集了一个大约370亿个令牌的预处理数据集。我们扩展了Llama-3.2令牌以结合语言和特定文化知识，这也可以更快地进行培训和推理。缺乏基准数据集来评估孟加拉的LLM。为了解决这一差距，我们开发了五个基准测试数据集。我们对包括Titullm在内的各种LLM进行了基准测试，并证明Titullms的表现优于其最初的多语言版本。但是，并非总是如此，强调了语言适应的复杂性。我们的作品为将现有的多语言开放型模型调整为其他低资源语言奠定了基础。为了促进更广泛的采用和进一步的研究，我们已公开使用Titullms模型和基准测试数据集（此HTTPS URL）。

Title: ReLearn: Unlearning via Learning for Large Language Models

Authors: Haoming Xu, Ningyuan Zhao, Liming Yang, Sendong Zhao, Shumin Deng, Mengru Wang, Bryan Hooi, Nay Oo, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.CV, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2502.11190
Pdf URL: https://arxiv.org/pdf/2502.11190
Copy Paste: [[2502.11190]] ReLearn: Unlearning via Learning for Large Language Models(https://arxiv.org/abs/2502.11190)
Keywords: language model
Abstract: Current unlearning methods for large language models usually rely on reverse optimization to reduce target token probabilities. However, this paradigm disrupts the subsequent tokens prediction, degrading model performance and linguistic coherence. Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation, and Linguistic Score (LS) to evaluate generation quality. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output. Through mechanistic analysis, we further demonstrate how reverse optimization disrupts coherent text generation, while ReLearn preserves this essential capability. Code is available at this https URL.
摘要：大型语言模型的当前未学习方法通常依赖于反向优化来降低目标令牌概率。但是，这种范式破坏了随后的令牌预测，降低模型性能和语言连贯性。此外，现有的评估指标过于强调上下文遗忘，同时评估响应流利性和相关性不足。为了应对这些挑战，我们提出了重新学习，有效学习的数据增强和微调管道以及全面的评估框架。该框架引入了知识遗忘率（KFR）和知识保留率（KRR），以衡量知识级别的保存以及语言得分（LS）以评估发电质量。我们的实验表明，重新学习成功地实现了目标遗忘，同时保留了高质量的产出。通过机械分析，我们进一步证明了反向优化如何破坏连贯的文本生成，而重新学习保留了这一基本能力。代码可在此HTTPS URL上找到。

Title: Large Language Models Penetration in Scholarly Writing and Peer Review

Authors: Li Zhou, Ruijie Zhang, Xunlian Dai, Daniel Hershcovich, Haizhou Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11193
Pdf URL: https://arxiv.org/pdf/2502.11193
Copy Paste: [[2502.11193]] Large Language Models Penetration in Scholarly Writing and Peer Review(https://arxiv.org/abs/2502.11193)
Keywords: language model, llm
Abstract: While the widespread use of Large Language Models (LLMs) brings convenience, it also raises concerns about the credibility of academic research and scholarly processes. To better understand these dynamics, we evaluate the penetration of LLMs across academic workflows from multiple perspectives and dimensions, providing compelling evidence of their growing influence. We propose a framework with two components: \texttt{ScholarLens}, a curated dataset of human- and LLM-generated content across scholarly writing and peer review for multi-perspective evaluation, and \texttt{LLMetrica}, a tool for assessing LLM penetration using rule-based metrics and model-based detectors for multi-dimensional evaluation. Our experiments demonstrate the effectiveness of \texttt{LLMetrica}, revealing the increasing role of LLMs in scholarly processes. These findings emphasize the need for transparency, accountability, and ethical practices in LLM usage to maintain academic credibility.
摘要：尽管大型语言模型（LLMS）的广泛使用带来了便利，但它也引起了人们对学术研究和学术过程的信誉的担忧。为了更好地理解这些动态，我们从多个角度和维度评估了LLM在学术工作流程中的渗透，从而提供了令人信服的证据，证明了它们的影响力日益增长。我们提出了一个具有两个组成部分的框架：\ texttt {Scholarlens}，这是一个策划的人类和LLM生成内容的数据集，跨学术写作和同行评审进行多观察评估，以及\ texttt {llmetrica}，用于评估LLM渗透率的工具使用基于规则的指标和基于模型的检测器进行多维评估。我们的实验证明了\ texttt {llmetrica}的有效性，揭示了LLM在学术过程中的重要作用。这些发现强调了LLM使用中对透明度，问责制和道德实践的需求，以维持学术信誉。

Title: A Survey of LLM-based Agents in Medicine: How far are we from Baymax?

Authors: Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Wenting Chen, Xiang Li, Yixuan Yuan
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.11211
Pdf URL: https://arxiv.org/pdf/2502.11211
Copy Paste: [[2502.11211]] A Survey of LLM-based Agents in Medicine: How far are we from Baymax?(https://arxiv.org/abs/2502.11211)
Keywords: language model, llm, hallucination, agent
Abstract: Large Language Models (LLMs) are transforming healthcare through the development of LLM-based agents that can understand, reason about, and assist with medical tasks. This survey provides a comprehensive review of LLM-based agents in medicine, examining their architectures, applications, and challenges. We analyze the key components of medical agent systems, including system profiles, clinical planning mechanisms, medical reasoning frameworks, and external capacity enhancement. The survey covers major application scenarios such as clinical decision support, medical documentation, training simulations, and healthcare service optimization. We discuss evaluation frameworks and metrics used to assess these agents' performance in healthcare settings. While LLM-based agents show promise in enhancing healthcare delivery, several challenges remain, including hallucination management, multimodal integration, implementation barriers, and ethical considerations. The survey concludes by highlighting future research directions, including advances in medical reasoning inspired by recent developments in LLM architectures, integration with physical systems, and improvements in training simulations. This work provides researchers and practitioners with a structured overview of the current state and future prospects of LLM-based agents in medicine.
摘要：大型语言模型（LLM）通过开发基于LLM的代理商可以理解，推理和协助医疗任务来改变医疗保健。这项调查提供了对基于LLM的医学代理商的全面审查，研究了其体系结构，应用和挑战。我们分析了医疗代理系统的关键组成部分，包括系统概况，临床计划机制，医疗推理框架和外部能力增强。该调查涵盖了主要的应用程序方案，例如临床决策支持，医疗文献，培训模拟和医疗服务优化。我们讨论用于评估这些代理在医疗保健环境中的表现的评估框架和指标。尽管基于LLM的代理商在增强医疗保健提供方面表现出了希望，但仍有一些挑战，包括幻觉管理，多模式整合，实施障碍和道德考虑。该调查结束了，强调未来的研究方向，包括受LLM体系结构最近发展，与物理系统集成以及培训模拟的改进的医学推理的进步。这项工作为研究人员和从业人员提供了有关医学中LLM代理的现状和未来前景的结构化概述。

Title: Asymmetric Conflict and Synergy in Post-training for LLM-based Multilingual Machine Translation

Authors: Tong Zheng, Yan Wen, Huiwen Bao, Junfeng Guo, Heng Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11223
Pdf URL: https://arxiv.org/pdf/2502.11223
Copy Paste: [[2502.11223]] Asymmetric Conflict and Synergy in Post-training for LLM-based Multilingual Machine Translation(https://arxiv.org/abs/2502.11223)
Keywords: language model, llm
Abstract: The emergence of Large Language Models (LLMs) has advanced the multilingual machine translation (MMT), yet the Curse of Multilinguality (CoM) remains a major challenge. Existing work in LLM-based MMT typically mitigates this issue via scaling up training and computation budget, which raises a critical question: Is scaling up the training and computation budget truly necessary for high-quality MMT, or can a deeper understanding of CoM provide a more efficient solution? To explore this problem, we analyze the linguistic conflicts and synergy, the underlying mechanism of CoM during post-training phase. We identify an asymmetric phenomenon in linguistic conflicts and synergy: the dominance of conflicts and synergy varies in different translation directions, leading to sub-optimal adaptation in existing post-training methods. We further find that a significant bottleneck in MMT appears to lie in post-training rather than multilingual pre-training, suggesting the need for more effective adaptation strategies. Building on these new insights, we propose a direction-aware training approach, combined with group-wise model merging, to address asymmetry in linguistic conflicts and synergy explicitly. Leveraging this strategy, our method fine-tunes X-ALMA-13B-Pretrain-trained only with multilingual pre-training-achieving comparable performance to XALMA-13B (only SFT) while using only 20B pretraining tokens and 17B parameters-5.5x fewer pretraining-tokens and 1.7x fewer model size-with just 0.85 COMET drop on Flores-200 testsets of 50 languages.
摘要：大语言模型（LLMS）的出现已经提出了多语言机器翻译（MMT），但多语言（COM）的诅咒仍然是一个主要挑战。基于LLM的MMT中的现有工作通常通过扩大培训和计算预算来减轻此问题，这提出了一个关键的问题：正在扩大高质量MMT所需的培训和计算预算，或者对COM的更深入了解更有效的解决方案？为了探索这个问题，我们分析了语言冲突和协同作用，这是训练后com的潜在机制。我们确定语言冲突和协同作用中的不对称现象：冲突和协同作用的主导地位在不同的翻译方向上有所不同，从而导致现有的后训练方法中的次最佳适应。我们进一步发现，MMT中有重要的瓶颈似乎在于训练后而不是多语言前训练，这表明需要更有效的适应策略。在这些新见解的基础上，我们提出了一种方向感知的培训方法，并结合了团体模型合并，以明确解决语言冲突和协同作用的不对称性。利用此策略，我们的方法微型X-ALMA-13B预测仅通过多种语言预培训实现的性能与XALMA-13B（仅SFT）（仅使用SFT），而仅使用20b预处理令牌和17B参数-5.5倍少于预读量。 -Tokens和1.7倍的型号尺寸减少，仅在50种语言的Flores-200测试集上仅0.85彗星下降。

Title: Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs

Authors: Mohammad Reza Rezaei, Adji Bousso Dieng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11228
Pdf URL: https://arxiv.org/pdf/2502.11228
Copy Paste: [[2502.11228]] Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs(https://arxiv.org/abs/2502.11228)
Keywords: language model, gpt, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires connecting information from multiple sources. This paper introduces Vendi-RAG, a framework based on an iterative process that jointly optimizes retrieval diversity and answer quality. This joint optimization leads to significantly higher accuracy for multi-hop QA tasks. Vendi-RAG leverages the Vendi Score (VS), a flexible similarity-based diversity metric, to promote semantic diversity in document retrieval. It then uses an LLM judge that evaluates candidate answers, generated after a reasoning step, and outputs a score that the retriever uses to balance relevance and diversity among the retrieved documents during each iteration. Experiments on three challenging datasets -- HotpotQA, MuSiQue, and 2WikiMultiHopQA -- demonstrate Vendi-RAG's effectiveness in multi-hop reasoning tasks. The framework achieves significant accuracy improvements over traditional single-step and multi-step RAG approaches, with accuracy increases reaching up to +4.2% on HotpotQA, +4.1% on 2WikiMultiHopQA, and +1.3% on MuSiQue compared to Adaptive-RAG, the current best baseline. The benefits of Vendi-RAG are even more pronounced as the number of retrieved documents increases. Finally, we evaluated Vendi-RAG across different LLM backbones, including GPT-3.5, GPT-4, and GPT-4o-mini, and observed consistent improvements, demonstrating that the framework's advantages are model-agnostic.
摘要：通过利用外部知识源来提高特定于领域的问题避开（QA）任务的大语言模型（LLM）。但是，传统的抹布系统主要集中于基于相关的检索，并且通常会在冗余中挣扎，尤其是当推理需要连接来自多个来源的信息时。本文介绍了Vendi-Rag，这是一个基于迭代过程的框架，该过程共同优化了检索多样性和答案质量。这种关节优化导致多跳质量检查任务的准确性明显更高。 Vendi-Rag利用基于灵活的相似性多样性度量的Vendi分数（VS）来促进文档检索的语义多样性。然后，它使用LLM法官评估候选人答案，并在推理步骤之后产生，并输出回猎犬在每次迭代期间检索文档之间相关性和多样性的分数。在三个具有挑战性的数据集（HotPotQA，Musique和2wikimultihopqa）上进行的实验证明了Vendi-Rag在多跳推理任务中的有效性。该框架比传统的单步和多步抹布的方法可取得显着的准确性提高，其准确性提高了HotPotQA的精度可提高到 +4.2％，在2wikimultihopqa上 +4.1％，Musique的准确性提高了，与Adaptive-Rag相比，Musique的准确性可提高 +4.1％。最佳基线。随着检索文件的数量增加，Vendi-rag的好处更加明显。最后，我们评估了跨不同LLM主链（包括GPT-3.5，GPT-4和GPT-4O-MINI）的Vendi-rag，并观察到一致的改进，表明该框架的优势是模型 - 敏锐的。

Title: Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment

Authors: Somnath Banerjee, Sayan Layek, Pratyush Chatterjee, Animesh Mukherjee, Rima Hazra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11244
Pdf URL: https://arxiv.org/pdf/2502.11244
Copy Paste: [[2502.11244]] Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment(https://arxiv.org/abs/2502.11244)
Keywords: language model, llm
Abstract: Ensuring consistent safety across multiple languages remains a significant challenge for large language models (LLMs). We introduce Soteria, a lightweight yet powerful strategy that locates and minimally adjusts the "functional heads" most responsible for harmful content generation in each language. By altering only a fraction of parameters, Soteria drastically reduces policy violations without sacrificing overall model performance, even in low-resource settings. To rigorously evaluate our approach, we also present XThreatBench, a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines. Experiments with leading open-source LLMs (e.g., Llama, Qwen, Mistral) show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages. These findings highlight a promising path toward scalable, linguistically attuned, and ethically aligned LLMs worldwide.
摘要：确保多种语言的一致安全仍然是大型语言模型（LLM）的重大挑战。我们介绍了Soteria，这是一种轻巧而有力的策略，可以找到并最少调整“功能性负责人”，以使每种语言中的有害内容产生最大。通过仅更改一小部分参数，Soteria大大减少了违反政策的行为，而无需牺牲整体模型绩效，即使在低资源设置中也是如此。为了严格评估我们的方法，我们还提出了Xthreatbench，这是一个专门的多语言数据集，捕获了从真实的政策指南中汲取的细粒度有害行为。领先的开源LLM（例如Llama，Qwen，Mistral）的实验表明，Soteria始终改善高，中低资源语言的安全指标。这些发现突出了一条有前途的途径，通向全球可扩展，语言调整和道德上一致的LLM。

Title: Uncertainty-Aware Step-wise Verification with Generative Reward Models

Authors: Zihuiwen Ye, Luckeciano Carvalho Melo, Younesse Kaddar, Phil Blunsom, Sam Staton, Yarin Gal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11250
Pdf URL: https://arxiv.org/pdf/2502.11250
Copy Paste: [[2502.11250]] Uncertainty-Aware Step-wise Verification with Generative Reward Models(https://arxiv.org/abs/2502.11250)
Keywords: language model, llm
Abstract: Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge-LM PRMs, leading to more reliable verification.
摘要：对于大型语言模型（LLM），复杂的多步推理任务（例如解决数学问题）仍然具有挑战性。尽管通常使用结果监督，但通过过程奖励模型（PRMS）的过程监督提供了中间奖励，以验证解决方案轨迹中的逐步正确性。但是，作为人类判断的代理人，PRMS遭受了可靠性问题，包括奖励黑客的敏感性。在这项工作中，我们建议利用不确定性量化（UQ），以通过用于数学推理任务的生成奖励模型来提高逐步验证的可靠性。我们介绍了COT熵，这是一种新颖的UQ方法，它在量化逐步验证中量化PRM的不确定性方面胜过现有方法。我们的结果表明，纳入不确定性估计值可改善法官LM PRMS的鲁棒性，从而提供更可靠的验证。

Title: Leveraging Conditional Mutual Information to Improve Large Language Model Fine-Tuning For Classification

Authors: Thanushon Sivakaran, En-Hui Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11258
Pdf URL: https://arxiv.org/pdf/2502.11258
Copy Paste: [[2502.11258]] Leveraging Conditional Mutual Information to Improve Large Language Model Fine-Tuning For Classification(https://arxiv.org/abs/2502.11258)
Keywords: language model, llm
Abstract: Although large language models (LLMs) have demonstrated remarkable capabilities in recent years, the potential of information theory (IT) to enhance LLM development remains underexplored. This paper introduces the information theoretic principle of Conditional Mutual Information (CMI) to LLM fine-tuning for classification tasks, exploring its promise in two main ways: minimizing CMI to improve a model's standalone performance and maximizing CMI to enhance knowledge distillation (KD) for more capable student models. To apply CMI in LLM fine-tuning, we adapt the recently proposed CMI-constrained deep learning framework, which was initially developed for image classification, with some modification. By minimizing CMI during LLM fine-tuning, we achieve superior performance gains on 6 of 8 GLUE classification tasks compared to BERT. Additionally, maximizing CMI during the KD process results in significant performance improvements in 6 of 8 GLUE classification tasks compared to DistilBERT. These findings demonstrate CMI's adaptability for optimizing both standalone LLMs and student models, showcasing its potential as a robust framework for advancing LLM fine-tuning. Our work bridges the gap between information theory and LLM development, offering new insights for building high-performing language models.
摘要：尽管近年来大型语言模型（LLMS）表现出了显着的功能，但信息理论（IT）增强LLM开发的潜力仍然没有得到充实。本文介绍了有条件互信息（CMI）的信息理论原理以进行分类任务的llm微调，并以两种主要方式探索其诺言：最小化CMI以提高模型的独立性能，并最大程度地提高CMI，以增强知识蒸馏（KD）以增强知识蒸馏（KD）的诺言。更有能力的学生模型。为了在LLM微调中应用CMI，我们适应了最近提出的CMI约束深度学习框架，该框架最初是用于图像分类的，并进行了一些修改。通过在LLM微调过程中最大程度地减少CMI，与BERT相比，我们在8个胶水分类任务中的6个实现了卓越的性能增长。此外，与Distilbert相比，在KD过程中最大化CMI在8个胶水分类任务中有6个可显着改善。这些发现证明了CMI可以优化独立LLM和学生模型的适应性，从而展示了其作为推进LLM微调的强大框架的潜力。我们的工作弥合了信息理论与LLM开发之间的差距，为建立高性能语言模型提供了新的见解。

Title: The Shrinking Landscape of Linguistic Diversity in the Age of Large Language Models

Authors: Zhivar Sourati, Farzan Karimi-Malekabadi, Meltem Ozcan, Colin McDaniel, Alireza Ziabari, Jackson Trager, Ala Tak, Meng Chen, Fred Morstatter, Morteza Dehghani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11266
Pdf URL: https://arxiv.org/pdf/2502.11266
Copy Paste: [[2502.11266]] The Shrinking Landscape of Linguistic Diversity in the Age of Large Language Models(https://arxiv.org/abs/2502.11266)
Keywords: language model, llm, prompt
Abstract: Language is far more than a communication tool. A wealth of information - including but not limited to the identities, psychological states, and social contexts of its users - can be gleaned through linguistic markers, and such insights are routinely leveraged across diverse fields ranging from product development and marketing to healthcare. In four studies utilizing experimental and observational methods, we demonstrate that the widespread adoption of large language models (LLMs) as writing assistants is linked to notable declines in linguistic diversity and may interfere with the societal and psychological insights language provides. We show that while the core content of texts is retained when LLMs polish and rewrite texts, not only do they homogenize writing styles, but they also alter stylistic elements in a way that selectively amplifies certain dominant characteristics or biases while suppressing others - emphasizing conformity over individuality. By varying LLMs, prompts, classifiers, and contexts, we show that these trends are robust and consistent. Our findings highlight a wide array of risks associated with linguistic homogenization, including compromised diagnostic processes and personalization efforts, the exacerbation of existing divides and barriers to equity in settings like personnel selection where language plays a critical role in assessing candidates' qualifications, communication skills, and cultural fit, and the undermining of efforts for cultural preservation.
摘要：语言远不止是通信工具。可以通过语言标记来收集大量信息 - 包括但不限于其用户的身份，心理状态和社会环境，并且这些见解经常在从产品开发和营销到医疗保健等各种领域的各种领域中进行利用。在利用实验和观察方法的四项研究中，我们证明，大型语言模型（LLM）作为写作助手的广泛采用与语言多样性的显着下降有关，并可能干扰社会和心理洞察力的语言提供。我们表明，尽管LLMS抛光并重写文本时，保留了文本的核心内容，但它们不仅可以使写作风格均匀，而且还以选择性地放大某些主导特征或偏见的方式，同时抑制其他人的同时强调一致性而不是，它们还可以改变风格上的元素 - 个性。通过改变LLM，提示，分类器和上下文，我们表明这些趋势是强大且一致的。我们的发现突出了与语言同质化相关的广泛风险，包括受损的诊断过程和个性化工作，加剧现有鸿沟和在人事选择等环境中的公平障碍，而语言在评估候选人的资格，沟通技巧，沟通能力，沟通技巧，语言中起着至关重要的作用和文化适应，以及破坏文化保护的努力。

Title: Improved Unbiased Watermark for Large Language Models

Authors: Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11268
Pdf URL: https://arxiv.org/pdf/2502.11268
Copy Paste: [[2502.11268]] Improved Unbiased Watermark for Large Language Models(https://arxiv.org/abs/2502.11268)
Keywords: language model
Abstract: As artificial intelligence surpasses human capabilities in text generation, the necessity to authenticate the origins of AI-generated content has become paramount. Unbiased watermarks offer a powerful solution by embedding statistical signals into language model-generated text without distorting the quality. In this paper, we introduce MCmark, a family of unbiased, Multi-Channel-based watermarks. MCmark works by partitioning the model's vocabulary into segments and promoting token probabilities within a selected segment based on a watermark key. We demonstrate that MCmark not only preserves the original distribution of the language model but also offers significant improvements in detectability and robustness over existing unbiased watermarks. Our experiments with widely-used language models demonstrate an improvement in detectability of over 10% using MCmark, compared to existing state-of-the-art unbiased watermarks. This advancement underscores MCmark's potential in enhancing the practical application of watermarking in AI-generated texts.
摘要：随着人工智能在文本生成中超过了人类的能力，确认AI生成内容的起源的必要性已变得至关重要。无偏的水印通过将统计信号嵌入语言模型生成的文本中而不会扭曲质量来提供强大的解决方案。在本文中，我们介绍了McMark，这是一个由公正的，基于多渠道的水印。 McMark通过将模型的词汇分为细分市场并根据水印钥匙促进所选段中的令牌概率来工作。我们证明，MCMARK不仅保留了语言模型的原始分布，而且还提供了对现有无偏水印的可检测性和鲁棒性的显着改善。与现有的最新无偏水印相比，我们使用广泛使用语言模型的实验表明，使用MCMARK的可检测性超过10％。这一进步强调了McMark在增强AI生成文本中水印的实际应用方面的潜力。

Title: Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest

Authors: Letian Peng, Zilong Wang, Feng Yao, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11275
Pdf URL: https://arxiv.org/pdf/2502.11275
Copy Paste: [[2502.11275]] Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest(https://arxiv.org/abs/2502.11275)
Keywords: language model, llm
Abstract: Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token \emph{prediction} into \emph{extraction} for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, \emph{Cuckoo}, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.
摘要：精心准备的高质量数据，包括培训前的原始文本和训练后注释，以培养高级大语言模型（LLMS）。相反，对于信息提取（IE），很难扩大预训练数据（例如生物标记的序列）。我们证明，IE模型可以通过将下一步的\ emph {predictiation}重新布置为\ emph {提取}的代币来充当LLM资源上的免费骑手。具体而言，我们提出的下一个令牌提取（NTE）范式学习了一种多功能IE模型，\ emph {cuckoo}，从LLM的训练前训练和训练后数据转换了10260万个提取数据。在几次射击设置下，杜鹃有效地适应了传统和复杂的指导遵循的IE，其性能比现有的预训练的IE模型更好。作为一个自由骑手，杜鹃自然可以随着LLM数据准备的持续进步而自然发展，这受益于LLM培训管道的改进而无需额外的手动努力。

Title: The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval

Authors: Ting-Rui Chiang, Dani Yogatama
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.11276
Pdf URL: https://arxiv.org/pdf/2502.11276
Copy Paste: [[2502.11276]] The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval(https://arxiv.org/abs/2502.11276)
Keywords: language model, llm, long context
Abstract: The Rotary Position Embedding (RoPE) is widely used in the attention heads of many large language models (LLM). It rotates dimensions in the query and the key vectors by different angles according to their positions in the input sequence. For long context modeling, the range of positions may vary a lot, and thus RoPE rotates some dimensions by a great range of angles. We hypothesize that the wide range of rotation angles may prevent LLMs from utilizing those dimensions. To validate this hypothesis, we present a controlled experiment showing that applying RoPE causes low utility of certain dimensions. Our analyses on three LLMs also indicate that these dimensions do not help LLMs do long-context question answering.
摘要：旋转位置嵌入（绳索）被广泛用于许多大型语言模型（LLM）的注意力头中。它根据输入序列中的位置旋转查询中的尺寸和密钥向量。对于长上下文建模，位置范围可能会有很大变化，因此绳索旋转一些尺寸的角度范围很大。我们假设广泛的旋转角可能会阻止LLM使用这些维度。为了验证这一假设，我们提出了一个受控的实验，表明应用绳索会导致某些维度的效用较低。我们对三个LLM的分析还表明，这些维度并不能帮助LLMS进行长期的问题回答。

Title: CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relationships?

Authors: Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.11300
Pdf URL: https://arxiv.org/pdf/2502.11300
Copy Paste: [[2502.11300]] CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relationships?(https://arxiv.org/abs/2502.11300)
Keywords: language model, gpt, llm, prompt
Abstract: Multimodal Large Language Models (MLLMs) are renowned for their superior instruction-following and reasoning capabilities across diverse problem domains. However, existing benchmarks primarily focus on assessing factual and logical correctness in downstream tasks, with limited emphasis on evaluating MLLMs' ability to interpret pragmatic cues and intermodal relationships. To address this gap, we assess the competency of MLLMs in performing Multimodal Discourse Analysis (MDA) using Coherence Relations. Our benchmark, CORDIAL, encompasses a broad spectrum of Coherence Relations across 3 different discourse domains at varying levels of granularity. Through our experiments on 10+ MLLMs employing different prompting strategies, we show that even top models like Gemini 1.5 Pro and GPT-4o fail to match the performance of simple classifier-based baselines. This study emphasizes the need to move beyond similarity-based metrics and adopt a discourse-driven framework for evaluating MLLMs, providing a more nuanced assessment of their capabilities. The benchmark and code are available at: this https URL.
摘要：多模式的大语言模型（MLLM）以其在各种问题域中的出色指导跟踪和推理能力而闻名。但是，现有的基准主要集中于评估下游任务中的事实和逻辑正确性，而有限的重点是评估MLLM的解释实用线索和模式间关系的能力。为了解决这一差距，我们评估了MLLM在使用相干关系进行多模式话语分析（MDA）方面的能力。我们的基准诚挚的基准包括在不同水平的粒度水平的3个不同的话语领域之间的一系列连贯关系。通过采用不同提示策略的10+ MLLMS的实验，我们表明，即使是Gemini 1.5 Pro和GPT-4O等顶级模型也无法匹配简单基于分类器的基线的性能。这项研究强调需要超越基于相似性的指标，并采用以话语为导向的框架来评估MLLM，从而对其能力进行更细微的评估。基准和代码可在以下网址提供：此HTTPS URL。

Title: Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation

Authors: Hieu Nguyen, Zihao He, Shoumik Atul Gandre, Ujjwal Pasupulety, Sharanya Kumari Shivakumar, Kristina Lerman
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.11306
Pdf URL: https://arxiv.org/pdf/2502.11306
Copy Paste: [[2502.11306]] Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation(https://arxiv.org/abs/2502.11306)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) often suffer from hallucination, generating factually incorrect or ungrounded content, which limits their reliability in high-stakes applications. A key factor contributing to hallucination is the use of hard labels during training, which enforce deterministic supervision, encourage overconfidence, and disregard the uncertainty inherent in natural language. To address this, we propose mitigating hallucination through knowledge distillation (KD), where a teacher model provides smoothed soft labels to a student model, reducing overconfidence and improving factual grounding. We apply KD during supervised finetuning on instructional data, evaluating its effectiveness across LLMs from different families. Experimental results on summarization benchmarks demonstrate that KD reduces hallucination compared to standard finetuning while preserving performance on general NLP tasks. These findings highlight KD as a promising approach for mitigating hallucination in LLMs and improving model reliability.
摘要：大型语言模型（LLM）通常会遭受幻觉的困扰，产生事实不正确或未接地的内容，这限制了其在高风险应用中的可靠性。导致幻觉的关键因素是在训练过程中使用硬标签，该训练会执行确定性的监督，鼓励过度自信并无视自然语言固有的不确定性。为了解决这个问题，我们建议通过知识蒸馏（KD）来缓解幻觉，其中教师模型为学生模型提供了平滑的柔软标签，从而减少了过度保存并改善了事实基础。我们在监督填充期间将KD应用于教学数据，评估其在不同家庭的LLM的有效性。总结基准的实验结果表明，与标准登录相比，KD可以减少幻觉，同时保留一般NLP任务的性能。这些发现重点介绍KD是减轻LLM中幻觉并提高模型可靠性的有前途的方法。

Title: System Message Generation for User Preferences using Open-Source Models

Authors: Minbyul Jeong, Jungho Cho, Minsoo Khang, Dawoon Jung, Teakgyu Hong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11330
Pdf URL: https://arxiv.org/pdf/2502.11330
Copy Paste: [[2502.11330]] System Message Generation for User Preferences using Open-Source Models(https://arxiv.org/abs/2502.11330)
Keywords: language model, llm, prompt
Abstract: System messages play a crucial role in interactions with large language models (LLMs), often serving as prompts to initiate conversations. Through system messages, users can assign specific roles, perform intended tasks, incorporate background information, specify various output formats and communication styles. Despite such versatility, publicly available data are often lack system messages and subject to strict license constraints in the industry field. Manual labeling of publicly available data with system messages that align with user instructions demands significant resources. In view of such challenges, our work introduces SysGen, a pipeline for generating system messages with better aligned assistant responses from the supervised fine-tuning dataset without system messages. Training on SysGen data has demonstrated substantial improvements in the alignment of model responses with system messages and user instructions, as demonstrated across various open-source models on the Multifacet benchmark, while maintaining minimal impact on other unseen benchmarks such as Open LLM Leaderboard 2. Our qualitative analysis highlights the importance of diverse system messages to ensure better adaptability across different contexts.
摘要：系统消息在与大语言模型（LLMS）的互动中起着至关重要的作用，通常是发起对话的提示。通过系统消息，用户可以分配特定的角色，执行预期的任务，包含背景信息，指定各种输出格式和通信样式。尽管具有如此多功能性，但公开可用的数据通常缺乏系统消息，并且在行业领域受到严格的许可限制。与用户指令保持一致的系统消息的公开数据的手动标记需要大量资源。鉴于此类挑战，我们的工作介绍了Sysgen，这是一种用于生成系统消息的管道，该消息具有从没有系统消息的情况下从监督的微调数据集中获得更好对齐的助手响应。对SYSGEN数据的培训表明，模型响应与系统消息和用户说明的一致性有了很大的改进，如多方面基准上的各种开源模型所证明的那样，同时对其他看不见的基准（例如Open LLM Leadboard 2）保持最小的影响。定性分析强调了各种系统信息的重要性，以确保在不同情况下更好地适应性。

Title: ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability

Authors: Ryuto Koike, Masahiro Kaneko, Ayana Niwa, Preslav Nakov, Naoaki Okazaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11336
Pdf URL: https://arxiv.org/pdf/2502.11336
Copy Paste: [[2502.11336]] ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability(https://arxiv.org/abs/2502.11336)
Keywords: language model, gpt, llm
Abstract: Detecting texts generated by Large Language Models (LLMs) could cause grave mistakes due to incorrect decisions, such as undermining student's academic dignity. LLM text detection thus needs to ensure the interpretability of the decision, which can help users judge how reliably correct its prediction is. When humans verify whether a text is human-written or LLM-generated, they intuitively investigate with which of them it shares more similar spans. However, existing interpretable detectors are not aligned with the human decision-making process and fail to offer evidence that users easily understand. To bridge this gap, we introduce ExaGPT, an interpretable detection approach grounded in the human decision-making process for verifying the origin of a text. ExaGPT identifies a text by checking whether it shares more similar spans with human-written vs. with LLM-generated texts from a datastore. This approach can provide similar span examples that contribute to the decision for each span in the text as evidence. Our human evaluation demonstrates that providing similar span examples contributes more effectively to judging the correctness of the decision than existing interpretable methods. Moreover, extensive experiments in four domains and three generators show that ExaGPT massively outperforms prior powerful detectors by up to +40.9 points of accuracy at a false positive rate of 1%.
摘要：检测大语模型（LLM）产生的文本可能会由于错误的决定而导致严重错误，例如破坏学生的学业尊严。因此，LLM文本检测需要确保决策的解释性，这可以帮助用户判断其预测的可靠性。当人类验证文本是人写的还是LLM生成的文本时，他们会直观地研究其中的哪个具有更相似的跨度。但是，现有的可解释探测器与人类决策过程不符，也无法提供用户容易理解的证据。为了弥合这一差距，我们引入了Exagpt，这是一种基于人类决策过程中的可解释的检测方法，用于验证文本的起源。出口通过检查该文本是否与来自数据存储的LLM生成的文本进行检查是否与人编写的文本共享更多相似的跨度。这种方法可以提供类似的跨度示例，从而有助于文本中的每个跨度作为证据。我们的人类评估表明，提供相似的跨度示例比现有的可解释方法更有效地判断决策的正确性。此外，在四个域和三个发电机中进行的广泛实验表明，逐出表现强大的检测器以+40.9的准确度高达+40.9，以1％的误报率。

Title: "Nuclear Deployed!": Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents

Authors: Rongwu Xu, Xiaojian Li, Shuo Chen, Wei Xu
Subjects: cs.CL, cs.AI, cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2502.11355
Pdf URL: https://arxiv.org/pdf/2502.11355
Copy Paste: [[2502.11355]] "Nuclear Deployed!": Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents(https://arxiv.org/abs/2502.11355)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are evolving into autonomous decision-makers, raising concerns about catastrophic risks in high-stakes scenarios, particularly in Chemical, Biological, Radiological and Nuclear (CBRN) domains. Based on the insight that such risks can originate from trade-offs between the agent's Helpful, Harmlessness and Honest (HHH) goals, we build a novel three-stage evaluation framework, which is carefully constructed to effectively and naturally expose such risks. We conduct 14,400 agentic simulations across 12 advanced LLMs, with extensive experiments and analysis. Results reveal that LLM agents can autonomously engage in catastrophic behaviors and deception, without being deliberately induced. Furthermore, stronger reasoning abilities often increase, rather than mitigate, these risks. We also show that these agents can violate instructions and superior commands. On the whole, we empirically prove the existence of catastrophic risks in autonomous LLM agents. We will release our code upon request.
摘要：大型语言模型（LLM）正在发展为自主决策者，引起了人们对高风险风险的担忧，特别是在化学，生物学，放射学和核（CBRN）领域中。基于这样的见解，即这种风险可以源于代理商的有益，无害和诚实（HHH）目标之间的权衡，我们建立了一个新颖的三阶段评估框架，该框架经过精心构建，以有效地自然地暴露了这种风险。我们对12个高级LLM进行了14,400次代理模拟，并进行了广泛的实验和分析。结果表明，LLM代理可以自主参与灾难性的行为和欺骗，而不会被故意诱导。此外，更强的推理能力通常会增加，而不是减轻这些风险。我们还表明，这些代理可以违反说明和上级命令。总体而言，我们从经验上证明了自主LLM代理中存在灾难性风险。我们将根据要求发布我们的代码。

Title: VLDBench: Vision Language Models Disinformation Detection Benchmark

Authors: Shaina Raza, Ashmal Vayani, Aditya Jain, Aravind Narayanan, Vahid Reza Khazaie, Syed Raza Bashir, Elham Dolatabadi, Gias Uddin, Christos Emmanouilidis, Rizwan Qureshi, Mubarak Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11361
Pdf URL: https://arxiv.org/pdf/2502.11361
Copy Paste: [[2502.11361]] VLDBench: Vision Language Models Disinformation Detection Benchmark(https://arxiv.org/abs/2502.11361)
Keywords: language model, llm
Abstract: The rapid rise of AI-generated content has made detecting disinformation increasingly challenging. In particular, multimodal disinformation, i.e., online posts-articles that contain images and texts with fabricated information are specially designed to deceive. While existing AI safety benchmarks primarily address bias and toxicity, multimodal disinformation detection remains largely underexplored. To address this challenge, we present the Vision-Language Disinformation Detection Benchmark VLDBench, the first comprehensive benchmark for detecting disinformation across both unimodal (text-only) and multimodal (text and image) content, comprising 31,000} news article-image pairs, spanning 13 distinct categories, for robust evaluation. VLDBench features a rigorous semi-automated data curation pipeline, with 22 domain experts dedicating 300 plus hours} to annotation, achieving a strong inter-annotator agreement (Cohen kappa = 0.78). We extensively evaluate state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs), demonstrating that integrating textual and visual cues in multimodal news posts improves disinformation detection accuracy by 5 - 35 % compared to unimodal models. Developed in alignment with AI governance frameworks such as the EU AI Act, NIST guidelines, and the MIT AI Risk Repository 2024, VLDBench is expected to become a benchmark for detecting disinformation in online multi-modal contents. Our code and data will be publicly available.
摘要：AI生成的内容的迅速上升使检测到虚假信息越来越具有挑战性。特别是，多模式的虚假信息，即包含图像和文本的在线帖子 - 具有伪造信息的文本是专门设计用于欺骗的。尽管现有的AI安全基准主要解决偏见和毒性，但多模式的虚假信息检测仍然在很大程度上没有被忽略。为了应对这一挑战，我们介绍视觉虚假信息检测基准VLDBENCH，这是第一个综合基准，用于检测单峰（仅文本）和多模式（文本和图像）的虚假信息，包括31,000 13个不同的类别，用于强大的评估。 VLDBENCH具有严格的半自动数据策展管道，有22个域专家专用300多个小时}进行注释，达到了强大的通知者一致性（Cohen Kappa = 0.78）。我们广泛评估了最先进的大语模型（LLM）和视觉模型（VLMS），这表明与单模型模型相比，多模式新闻帖子中的文本和视觉提示会提高5-35％的虚假信息检测准确性。 VLDBENCH与AI治理框架（例如《欧盟AI法》，NIST指南和MIT AI风险存储库2024年的一致性，VLDBENCH预计将成为检测在线多模式内容中的虚假信息的基准。我们的代码和数据将公开可用。

Title: Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning

Authors: Yilei Tu, Andrew Xue, Freda Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11364
Pdf URL: https://arxiv.org/pdf/2502.11364
Copy Paste: [[2502.11364]] Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning(https://arxiv.org/abs/2502.11364)
Keywords: language model, prompt
Abstract: While multilingual large language models generally perform adequately, and sometimes even rival English performance on high-resource languages (HRLs), they often significantly underperform on low-resource languages (LRLs). Among several prompting strategies aiming at bridging the gap, multilingual in-context learning (ICL) has been particularly effective when demonstration in target languages is unavailable. However, there lacks a systematic understanding when and why it works well. In this work, we systematically analyze multilingual ICL, using demonstrations in HRLs to enhance cross-lingual transfer. We show that demonstrations in mixed HRLs consistently outperform English-only ones across the board, particularly for tasks written in LRLs. Surprisingly, our ablation study show that the presence of irrelevant non-English sentences in the prompt yields measurable gains, suggesting the effectiveness of multilingual exposure itself. Our results highlight the potential of strategically leveraging multilingual resources to bridge the performance gap for underrepresented languages.
摘要：虽然多语言大语模型通常会表现足够，有时甚至在高资源语言（HRL）上竞争英语表现，但它们在低资源语言（LRLS）上的表现通常显着不足。在旨在弥合差距的几种提示策略中，当无法使用目标语言的演示时，多语言内部学习（ICL）特别有效。但是，缺乏系统的了解何时以及为什么效果很好。在这项工作中，我们使用HRL中的演示来系统地分析多语言ICL，以增强跨语性转移。我们表明，混合HRL中的示威游行始终超过英语，特别是对于用LRLS编写的任务。令人惊讶的是，我们的消融研究表明，在迅速中存在无关的非英语句子的存在可以衡量的收益，这表明了多语言暴露本身的有效性。我们的结果突出了利用多语言资源来弥合代表性不足语言的性能差距的潜力。

Title: LLMs can Perform Multi-Dimensional Analytic Writing Assessments: A Case Study of L2 Graduate-Level Academic English Writing

Authors: Zhengxiang Wang, Veronika Makarova, Zhi Li, Jordan Kodner, Owen Rambow
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11368
Pdf URL: https://arxiv.org/pdf/2502.11368
Copy Paste: [[2502.11368]] LLMs can Perform Multi-Dimensional Analytic Writing Assessments: A Case Study of L2 Graduate-Level Academic English Writing(https://arxiv.org/abs/2502.11368)
Keywords: llm, prompt
Abstract: The paper explores the performance of LLMs in the context of multi-dimensional analytic writing assessments, i.e. their ability to provide both scores and comments based on multiple assessment criteria. Using a corpus of literature reviews written by L2 graduate students and assessed by human experts against 9 analytic criteria, we prompt several popular LLMs to perform the same task under various conditions. To evaluate the quality of feedback comments, we apply a novel feedback comment quality evaluation framework. This framework is interpretable, cost-efficient, scalable, and reproducible, compared to existing methods that rely on manual judgments. We find that LLMs can generate reasonably good and generally reliable multi-dimensional analytic assessments. We release our corpus for reproducibility.
摘要：本文探讨了在多维分析写作评估的背景下，LLM的性能，即根据多个评估标准提供分数和评论的能力。使用L2研究生撰写的文献评论的语料库，并由人类专家根据9个分析标准进行评估，我们促使几个流行的LLM在各种条件下执行相同的任务。为了评估反馈意见的质量，我们应用了一个新颖的反馈评论质量评估框架。与依赖手动判断的现有方法相比，该框架是可解释的，具有成本效益，可扩展性和可再现的。我们发现LLM可以产生合理的良好且通常可靠的多维分析评估。我们释放我们的语料库以获得可重复性。

Title: Exploring the Small World of Word Embeddings: A Comparative Study on Conceptual Spaces from LLMs of Different Scales

Authors: Zhu Liu, Ying Liu, KangYang Luo, Cunliang Kong, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11380
Pdf URL: https://arxiv.org/pdf/2502.11380
Copy Paste: [[2502.11380]] Exploring the Small World of Word Embeddings: A Comparative Study on Conceptual Spaces from LLMs of Different Scales(https://arxiv.org/abs/2502.11380)
Keywords: language model, llm
Abstract: A conceptual space represents concepts as nodes and semantic relatedness as edges. Word embeddings, combined with a similarity metric, provide an effective approach to constructing such a space. Typically, embeddings are derived from traditional distributed models or encoder-only pretrained models, whose objectives directly capture the meaning of the current token. In contrast, decoder-only models, including large language models (LLMs), predict the next token, making their embeddings less directly tied to the current token's semantics. Moreover, comparative studies on LLMs of different scales remain underexplored. In this paper, we construct a conceptual space using word embeddings from LLMs of varying scales and comparatively analyze their properties. We establish a network based on a linguistic typology-inspired connectivity hypothesis, examine global statistical properties, and compare LLMs of varying scales. Locally, we analyze conceptual pairs, WordNet relations, and a cross-lingual semantic network for qualitative words. Our results indicate that the constructed space exhibits small-world properties, characterized by a high clustering coefficient and short path lengths. Larger LLMs generate more intricate spaces, with longer paths reflecting richer relational structures and connections. Furthermore, the network serves as an efficient bridge for cross-lingual semantic mapping.
摘要：概念空间将概念表示为节点和语义相关性。单词嵌入与相似性度量相结合，为构建这样的空间提供了有效的方法。通常，嵌入源自传统的分布式模型或仅经过编码的模型，其目标直接捕获了当前令牌的含义。相比之下，包括大语言模型（LLM）在内的仅解码模型预测了下一个令牌，从而使它们的嵌入程度不太直接与当前令牌的语义绑定。此外，对不同尺度的LLM的比较研究仍然没有被逐渐倍增。在本文中，我们使用来自不同尺度的LLM的单词嵌入构建概念空间，并相对分析其属性。我们建立了一个基于语言类型学启发的连接性假设，检查全局统计特性并比较不同尺度的LLM的网络。在本地，我们分析了定性单词的概念对，WordNet关系和跨语性语义网络。我们的结果表明，构造的空间表现出小世界的特性，其特征是高簇系数和短路径长度。较大的LLM会产生更复杂的空间，其较长的路径反映了更丰富的关系结构和连接。此外，该网络是跨语义语义映射的有效桥梁。

Title: RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following

Authors: Junru Lu, Jiazheng Li, Guodong Shen, Lin Gui, Siyu An, Yulan He, Di Yin, Xing Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11387
Pdf URL: https://arxiv.org/pdf/2502.11387
Copy Paste: [[2502.11387]] RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following(https://arxiv.org/abs/2502.11387)
Keywords: language model, llm, chat
Abstract: Role-playing is important for Large Language Models (LLMs) to follow diverse instructions while maintaining role identity and the role's pre-defined ability limits. Existing role-playing datasets mostly contribute to controlling role style and knowledge boundaries, but overlook role-playing in instruction-following scenarios. We introduce a fine-grained role-playing and instruction-following composite benchmark, named RoleMRC, including: (1) Multi-turn dialogues between ideal roles and humans, including free chats or discussions upon given passages; (2) Role-playing machine reading comprehension, involving response, refusal, and attempts according to passage answerability and role ability; (3) More complex scenarios with nested, multi-turn and prioritized instructions. The final RoleMRC features a 10.2k role profile meta-pool, 37.9k well-synthesized role-playing instructions, and 1.4k testing samples. We develop a pipeline to quantitatively evaluate the fine-grained role-playing and instruction-following capabilities of several mainstream LLMs, as well as models that are fine-tuned on our data. Moreover, cross-evaluation on external role-playing datasets confirms that models fine-tuned on RoleMRC enhances instruction-following without compromising general role-playing and reasoning capabilities. We also probe the neural-level activation maps of different capabilities over post-tuned LLMs. Access to our RoleMRC, RoleMRC-mix and Codes: this https URL.
摘要：角色扮演对于大型语言模型（LLMS）遵循多种说明很重要，同时保持角色身份和角色的预定能力限制。现有的角色扮演数据集主要有助于控制角色样式和知识边界，但在跟随教学的方案中忽略了角色扮演。我们介绍了一个名为RoleMRC的精细颗粒角色扮演和指令以下复合基准，包括：（1）理想角色和人类之间的多转话对话，包括免费聊天或讨论给定段落；（2）角色扮演机器阅读理解，涉及反应，拒绝和根据通过的回答性和角色能力尝试；（3）带有嵌套，多转弯和优先说明的更复杂的方案。最终的ROLEMRC具有10.2k的角色剖面元池，37.9k良好的角色扮演指令和1.4K测试样品。我们开发了一条管道，以定量评估几个主流LLM的细粒度角色扮演和指导遵循功能，以及对我们数据进行微调的模型。此外，对外部角色扮演数据集的交叉评估证实，在RoleMRC上进行了微调的模型增强了指导的遵循，而不会损害一般角色扮演和推理能力。我们还探测了在调节后LLM上不同功能的神经级激活图。访问我们的rolemrc，rolemrc-mix和代码：此HTTPS URL。

Title: HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

Authors: Xiaoyuan Li, Moxin Li, Rui Men, Yichang Zhang, Keqin Bao, Wenjie Wang, Fuli Feng, Dayiheng Liu, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11393
Pdf URL: https://arxiv.org/pdf/2502.11393
Copy Paste: [[2502.11393]] HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning(https://arxiv.org/abs/2502.11393)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations in questions can trigger incorrect responses. Do these models truly understand commonsense knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs in commonsense reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases, by designing and compiling seven types of question variants. To construct this benchmark, we propose a two-stage method to develop Chinese HellaSwag, a finely annotated dataset comprising 12,000 instances across 56 categories. We conduct extensive experiments on 41 representative LLMs, revealing that these LLMs are far from robust in commonsense reasoning. Furthermore, this robustness varies depending on the language in which the LLM is tested. This work establishes a high-quality evaluation benchmark, with extensive experiments offering valuable insights to the community in commonsense reasoning for LLMs.
摘要：大型语言模型（LLM）在常识推理中表现出了显着的功能；但是，问题的某些变化可能触发错误的回答。这些模型是否真正了解常识性知识，或者只是记住表达模式？为了调查这个问题，我们介绍了常识性推理中LLM的第一个广泛的鲁棒性评估。我们通过设计和编译七种类型的问题变体来介绍Hellaswag-Pro，这是一种由11,200个案例组成的大规模双语基准。为了构建此基准，我们提出了一种开发中国Hellaswag的两阶段方法，这是一个精心注释的数据集，其中包括56个类别的12,000个实例。我们对41个代表性LLM进行了广泛的实验，表明这些LLM在常识性推理中远非强大。此外，这种鲁棒性取决于测试LLM的语言。这项工作确立了高质量的评估基准，并进行了广泛的实验，为社区提供了有价值的见解，以对LLMS的常识推理。

Title: Revisiting Robust RAG: Do We Still Need Complex Robust Training in the Era of Powerful LLMs?

Authors: Hanxing Ding, Shuchang Tao, Liang Pang, Zihao Wei, Liwei Chen, Kun Xu, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11400
Pdf URL: https://arxiv.org/pdf/2502.11400
Copy Paste: [[2502.11400]] Revisiting Robust RAG: Do We Still Need Complex Robust Training in the Era of Powerful LLMs?(https://arxiv.org/abs/2502.11400)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) systems often suffer from performance degradation when encountering noisy or irrelevant documents, driving researchers to develop sophisticated training strategies to enhance their robustness against such retrieval noise. However, as large language models (LLMs) continue to advance, the necessity of these complex training methods is increasingly questioned. In this paper, we systematically investigate whether complex robust training strategies remain necessary as model capacity grows. Through comprehensive experiments spanning multiple model architectures and parameter scales, we evaluate various document selection methods and adversarial training techniques across diverse datasets. Our extensive experiments consistently demonstrate that as models become more powerful, the performance gains brought by complex robust training methods drop off dramatically. We delve into the rationale and find that more powerful models inherently exhibit superior confidence calibration, better generalization across datasets (even when trained with randomly selected documents), and optimal attention mechanisms learned with simpler strategies. Our findings suggest that RAG systems can benefit from simpler architectures and training strategies as models become more powerful, enabling more scalable applications with minimal complexity.
摘要：遇到嘈杂或无关紧要的文档时，检索增强的生成（RAG）系统通常会遭受性能退化，这促使研究人员开发出复杂的训练策略，以增强其对这种检索噪声的鲁棒性。但是，随着大型语言模型（LLM）继续发展，这些复杂的培训方法的必要性越来越受到质疑。在本文中，我们系统地研究了随着模型能力的增长，复杂的强大训练策略是否仍然需要。通过跨越多个模型体系结构和参数量表的全面实验，我们评估了各种数据集跨不同数据集的各种文档选择方法和对抗训练技术。我们的广泛实验始终表明，随着模型变得更加强大，复杂的强大训练方法带来的性能取得的增长急剧下降。我们深入研究了基本原理，发现更强大的模型固有地表现出了较高的置信度校准，在数据集（即使使用随机选择的文档进行培训）以及使用更简单的策略中学到的最佳注意力机制。我们的发现表明，随着模型变得越来越强大，破布系统可以从更简单的架构和培训策略中受益，从而实现更可扩展的应用程序，具有最小的复杂性。

Title: Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment

Authors: Jingcheng Deng, Zhongtao Jiang, Liang Pang, Liwei Chen, Kun Xu, Zihao Wei, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11401
Pdf URL: https://arxiv.org/pdf/2502.11401
Copy Paste: [[2502.11401]] Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment(https://arxiv.org/abs/2502.11401)
Keywords: llm
Abstract: A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs' pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text embeddings with positive samples embeddings by leveraging the conditional distribution of embeddings while simultaneously reducing the likelihood of generating negative samples from text embeddings, thereby achieving embedding alignment and uniformity. Experimental results demonstrate that our method significantly outperforms traditional contrastive learning approaches and achieves performance comparable to state-of-the-art models when using the same amount of data.
摘要：新趋势将LLMs通过对比度学习用作密集文本编码。但是，由于LLM嵌入式预测了下一代币的概率分布，因此它们本质上是生成的和分布的，与对比度学习矛盾，这需要嵌入以捕获全文本语义并通过余弦相似性对齐。这种差异阻碍了LLMS的预训练能力的全面利用，从而导致学习效率低下。为了应对这个问题，我们提出了自动化，这是一种基于嵌入条件概率分布的新的对比学习方法，该方法集成了两个核心任务：信息压缩和条件分布对齐。信息压缩任务将文本编码到嵌入式空间中，以确保嵌入向量捕获全局语义。条件分布对齐任务的重点是通过利用嵌入的条件分布，同时减少从文本嵌入中产生负样本的可能性，从而实现嵌入对齐和均匀的可能性，从而使嵌入文本嵌入与正样品嵌入。实验结果表明，我们的方法在使用相同数量的数据时显着胜过传统的对比学习方法，并实现与最新模型相当的性能。

Title: ToolCoder: A Systematic Code-Empowered Tool Learning Framework for Large Language Models

Authors: Hanxing Ding, Shuchang Tao, Liang Pang, Zihao Wei, Jinyang Gao, Bolin Ding, Huawei Shen, Xueqi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11404
Pdf URL: https://arxiv.org/pdf/2502.11404
Copy Paste: [[2502.11404]] ToolCoder: A Systematic Code-Empowered Tool Learning Framework for Large Language Models(https://arxiv.org/abs/2502.11404)
Keywords: language model, llm, prompt
Abstract: Tool learning has emerged as a crucial capability for large language models (LLMs) to solve complex real-world tasks through interaction with external tools. Existing approaches face significant challenges, including reliance on hand-crafted prompts, difficulty in multi-step planning, and lack of precise error diagnosis and reflection mechanisms. We propose ToolCoder, a novel framework that reformulates tool learning as a code generation task. Inspired by software engineering principles, ToolCoder transforms natural language queries into structured Python function scaffold and systematically breaks down tasks with descriptive comments, enabling LLMs to leverage coding paradigms for complex reasoning and planning. It then generates and executes function implementations to obtain final responses. Additionally, ToolCoder stores successfully executed functions in a repository to promote code reuse, while leveraging error traceback mechanisms for systematic debugging, optimizing both execution efficiency and robustness. Experiments demonstrate that ToolCoder achieves superior performance in task completion accuracy and execution reliability compared to existing approaches, establishing the effectiveness of code-centric approaches in tool learning.
摘要：工具学习已成为大型语言模型（LLMS）通过与外部工具的互动来解决复杂的现实世界任务的关键能力。现有方法面临重大挑战，包括依赖手工制作的提示，多步计划的困难以及缺乏精确的错误诊断和反射机制。我们提出了ToolCoder，这是一个新颖的框架，将工具学习重新定义为代码生成任务。受软件工程原理的启发，工具包将自然语言查询转换为结构化的Python功能脚手架，并系统地通过描述性注释分解任务，使LLMS能够利用编码范式来进行复杂的推理和计划。然后，它生成并执行功能实现以获得最终响应。此外，工具包存储在存储库中成功执行功能以促进代码重复使用，同时利用错误追溯机制进行系统调试，优化执行效率和鲁棒性。实验表明，与现有方法相比，工具编码器在任务完成准确性和执行可靠性方面取得了出色的性能，从而确立了以代码为中心的方法在工具学习中的有效性。

Title: LayAlign: Enhancing Multilingual Reasoning in Large Language Models via Layer-Wise Adaptive Fusion and Alignment Strategy

Authors: Zhiwen Ruan, Yixia Li, He Zhu, Longyue Wang, Weihua Luo, Kaifu Zhang, Yun Chen, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11405
Pdf URL: https://arxiv.org/pdf/2502.11405
Copy Paste: [[2502.11405]] LayAlign: Enhancing Multilingual Reasoning in Large Language Models via Layer-Wise Adaptive Fusion and Alignment Strategy(https://arxiv.org/abs/2502.11405)
Keywords: language model, llm
Abstract: Despite being pretrained on multilingual corpora, large language models (LLMs) exhibit suboptimal performance on low-resource languages. Recent approaches have leveraged multilingual encoders alongside LLMs by introducing trainable parameters connecting the two models. However, these methods typically focus on the encoder's output, overlooking valuable information from other layers. We propose \aname (\mname), a framework that integrates representations from all encoder layers, coupled with the \attaname mechanism to enable layer-wise interaction between the LLM and the multilingual encoder. Extensive experiments on multilingual reasoning tasks, along with analyses of learned representations, show that our approach consistently outperforms existing baselines.
摘要：尽管在多语言语料库中进行了预估计，但大型语言模型（LLMS）在低资源语言上表现出次优性能。最近的方法通过引入连接这两个模型的可训练参数来利用多语言编码器与LLMS一起使用。但是，这些方法通常集中在编码器的输出上，从而忽略了来自其他层的有价值的信息。我们提出\ aname（\ mname），该框架集成了所有编码层的表示形式，并与\ attaname机制相结合，以启用LLM和多语言编码器之间的层相互作用。关于多语言推理任务的广泛实验，以及对学会表示的分析，表明我们的方法始终优于现有基准。

Title: InsBank: Evolving Instruction Subset for Ongoing Alignment

Authors: Jiayi Shi, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Huan Ren, Yao Hu, Kan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11419
Pdf URL: https://arxiv.org/pdf/2502.11419
Copy Paste: [[2502.11419]] InsBank: Evolving Instruction Subset for Ongoing Alignment(https://arxiv.org/abs/2502.11419)
Keywords: language model, llm
Abstract: Large language models (LLMs) typically undergo instruction tuning to enhance alignment. Recent studies emphasize that quality and diversity of instruction data are more crucial than quantity, highlighting the need to select diverse, high-quality subsets to reduce training costs. However, how to evolve these selected subsets alongside the development of new instruction data remains insufficiently explored. To achieve LLMs' ongoing alignment, we introduce Instruction Bank (InsBank), a continuously updated repository that integrates the latest valuable instruction data. We further propose Progressive Instruction Bank Evolution (PIBE), a novel framework designed to evolve InsBank effectively and efficiently over time. PIBE employs a gradual data selection strategy to maintain long-term efficiency, leveraging a representation-based diversity score to capture relationships between data points and retain historical information for comprehensive diversity evaluation. This also allows for flexible combination of diversity and quality scores during data selection and ranking. Extensive experiments demonstrate that PIBE significantly outperforms baselines in InsBank evolution and is able to extract budget-specific subsets, demonstrating its effectiveness and adaptability.
摘要：大型语言模型（LLMS）通常会进行指导调整以增强对齐方式。最近的研究强调，教学数据的质量和多样性比数量更为重要，强调需要选择各种高质量的子集来降低培训成本。但是，如何将这些选定的子集与新指令数据的开发一起发展保持不足。为了实现LLMS的持续对齐方式，我们介绍了指令库（INSBANK），这是一个不断更新的存储库，该存储库集成了最新的有价值的教学数据。我们进一步提出了渐进式教学库进化（PIBE），这是一个新颖的框架，旨在随着时间的流逝有效，有效地进化。 PIBE采用逐步的数据选择策略来维持长期效率，利用基于表示的多样性得分来捕获数据点之间的关系并保留历史信息以进行全面的多样性评估。这也允许在数据选择和排名期间的多样性和质量得分的灵活组合。广泛的实验表明，PIBE在Insbank进化中的表现明显优于基线，并且能够提取特定于预算的子集，从而证明其有效性和适应性。

Title: Exploring Persona Sentiment Sensitivity in Personalized Dialogue Generation

Authors: YongHyun Jun, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11423
Pdf URL: https://arxiv.org/pdf/2502.11423
Copy Paste: [[2502.11423]] Exploring Persona Sentiment Sensitivity in Personalized Dialogue Generation(https://arxiv.org/abs/2502.11423)
Keywords: language model, llm
Abstract: Personalized dialogue systems have advanced considerably with the integration of user-specific personas into large language models (LLMs). However, while LLMs can effectively generate personalized responses, the influence of persona sentiment on dialogue quality remains underexplored. In this work, we conduct a large-scale analysis of dialogues generated using a range of polarized user profiles. Our experiments reveal that dialogues involving negatively polarized users tend to overemphasize persona attributes, leading to increased entailment and contradiction instances and lower overall coherence. In contrast, positively polarized profiles yield dialogues that selectively incorporate persona information, resulting in smoother and more coherent interactions. Furthermore, we find that personas with weak or neutral sentiment generally produce lower-quality dialogues. Motivated by these findings, we propose a dialogue generation approach that explicitly accounts for persona polarity by combining a turn-based generation strategy with a profile ordering mechanism. Our study provides new insights into the sensitivity of LLMs to persona sentiment and offers guidance for developing more robust and nuanced personalized dialogue systems.
摘要：通过将用户特定角色集成到大型语言模型（LLMS）中的个性化对话系统已大大提高。但是，尽管LLM可以有效地产生个性化的响应，但角色情感对对话质量的影响仍然没有得到充实。在这项工作中，我们对使用一系列两极分化的用户配置文件生成的对话进行了大规模分析。我们的实验表明，涉及否定两极化用户的对话倾向于过分强调角色属性，从而增加了综合和矛盾实例，并降低了整体连贯性。相比之下，积极的两极化曲线产生对话，这些对话有选择地包含角色信息，从而导致更顺畅，更连贯的相互作用。此外，我们发现具有弱或中性情绪的角色通常会产生较低质量的对话。在这些发现的激励下，我们提出了一种对话生成方法，该方法通过将基于回合的生成策略与配置文件订购机制相结合，从而明确说明角色极性。我们的研究为LLM对角色情感的敏感性提供了新的见解，并为开发更健壮和细微的个性化对话系统提供了指导。

Title: Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models

Authors: Jongho Kim, Seung-won Hwang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11425
Pdf URL: https://arxiv.org/pdf/2502.11425
Copy Paste: [[2502.11425]] Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models(https://arxiv.org/abs/2502.11425)
Keywords: language model, llm, prompt
Abstract: Despite the advanced capabilities of large language models (LLMs), their temporal reasoning ability remains underdeveloped. Prior works have highlighted this limitation, particularly in maintaining temporal consistency when understanding events. For example, models often confuse mutually exclusive temporal relations like ``before'' and ``after'' between events and make inconsistent predictions. In this work, we tackle the issue of temporal inconsistency in LLMs by proposing a novel counterfactual prompting approach. Our method generates counterfactual questions and enforces collective constraints, enhancing the model's consistency. We evaluate our method on multiple datasets, demonstrating significant improvements in event ordering for explicit and implicit events and temporal commonsense understanding by effectively addressing temporal inconsistencies.
摘要：尽管大语言模型（LLMS）具有先进的功能，但它们的时间推理能力仍然不发达。先前的工作强调了这一限制，尤其是在理解事件时保持时间一致性。例如，模型通常会混淆互斥的时间关系，例如``之前''和``在事件之间''之后的''和````''之后的''和做出不一致的预测。在这项工作中，我们通过提出一种新颖的反事实提示方法来解决LLM中时间不一致的问题。我们的方法产生了反事实问题并实施集体约束，从而增强了模型的一致性。我们在多个数据集上评估了我们的方法，证明了事件排序的显着改进，以有效解决时间不一致，以实现明确和隐式事件以及时间常识的理解。

Title: Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models

Authors: Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong Wen
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2502.11427
Pdf URL: https://arxiv.org/pdf/2502.11427
Copy Paste: [[2502.11427]] Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models(https://arxiv.org/abs/2502.11427)
Keywords: language model, llm
Abstract: Visual instruction tuning has become the predominant technology in eliciting the multimodal task-solving capabilities of large vision-language models (LVLMs). Despite the success, as visual instructions require images as the input, it would leave the gap in inheriting the task-solving capabilities from the backbone LLMs, and make it costly to collect a large-scale dataset. To address it, we propose ViFT, a visual instruction-free fine-tuning framework for LVLMs. In ViFT, we only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities. During inference, we extract and combine the representations of the text and image inputs, for fusing the two abilities to fulfill multimodal tasks. Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks, with rather less training data. Our code and data will be publicly released.
摘要：视觉指导调整已成为引发大型视觉模型（LVLM）多模式解决功能的主要技术。尽管取得了成功，但由于视觉说明需要图像作为输入，它将留下差距从骨干LLMS继承解决任务的功能，并使收集大型数据集的昂贵。为了解决这个问题，我们建议VIFT，这是LVLMS的无视觉指导微调框架。在VIFT中，我们只需要在培训期间单独学习解决任务和视觉感知能力的仅文本说明和图像标题数据。在推断期间，我们提取和组合文本和图像输入的表示形式，以融合这两种能力来完成多模式任务。实验结果表明，VIFT可以在基准之后的几个视觉推理和视觉指导上实现最新的性能，并具有较少的训练数据。我们的代码和数据将公开发布。

Title: SAFE-SQL: Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to-SQL

Authors: Jimin Lee, Ingeol Baek, Byeongjeong Kim, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11438
Pdf URL: https://arxiv.org/pdf/2502.11438
Copy Paste: [[2502.11438]] SAFE-SQL: Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to-SQL(https://arxiv.org/abs/2502.11438)
Keywords: language model, llm, prompt
Abstract: Text-to-SQL aims to convert natural language questions into executable SQL queries. While previous approaches, such as skeleton-masked selection, have demonstrated strong performance by retrieving similar training examples to guide large language models (LLMs), they struggle in real-world scenarios where such examples are unavailable. To overcome this limitation, we propose Self-Augmentation in-context learning with Fine-grained Example selection for Text-to-SQL (SAFE-SQL), a novel framework that improves SQL generation by generating and filtering self-augmented examples. SAFE-SQL first prompts an LLM to generate multiple Text-to-SQL examples relevant to the test input. Then SAFE-SQL filters these examples through three relevance assessments, constructing high-quality in-context learning examples. Using self-generated examples, SAFE-SQL surpasses the previous zero-shot, and few-shot Text-to-SQL frameworks, achieving higher execution accuracy. Notably, our approach provides additional performance gains in extra hard and unseen scenarios, where conventional methods often fail.
摘要：文本到SQL旨在将自然语言问题转换为可执行的SQL查询。尽管以前的方法（例如骨骼掩盖的选择）通过检索类似的培训示例来指导大型语言模型（LLMS）表现出了很强的表现，但它们在不可用的示例的现实情况下挣扎。为了克服这一局限性，我们提出了自我提出的文本示例选择文本对SQL（Safe-SQL）的示例，这是一个新颖的框架，该框架通过生成和过滤自我夸大的示例来改善SQL的生成。 Safe-SQL首先提示LLM生成与测试输入相关的多个文本到SQL示例。然后，Safe-SQL通过三个相关性评估过滤了这些示例，从而构建了高质量的内在学习示例。 Safe-SQL使用自我生成的示例超过了先前的零射击，并且很少弹出文本到SQL框架，从而实现了更高的执行精度。值得注意的是，我们的方法在常规方法经常失败的情况下，在额外的硬和看不见的情况下提供了额外的性能增长。

Title: An Efficient Row-Based Sparse Fine-Tuning

Authors: Cen-Jhih Li, Aditya Bhaskara
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.11439
Pdf URL: https://arxiv.org/pdf/2502.11439
Copy Paste: [[2502.11439]] An Efficient Row-Based Sparse Fine-Tuning(https://arxiv.org/abs/2502.11439)
Keywords: language model
Abstract: Fine-tuning is an important step in adapting foundation models such as large language models to downstream tasks. To make this step more accessible to users with limited computational budgets, it is crucial to develop fine-tuning methods that are memory and computationally efficient. Sparse Fine-tuning (SFT) and Low-rank adaptation (LoRA) are two frameworks that have emerged for addressing this problem and have been adopted widely in practice. In this work, we develop a new SFT framework, based on ideas from neural network pruning. At a high level, we first identify "important" neurons/nodes using feature importance metrics from network pruning (specifically, we use the structural pruning method), and then perform fine-tuning by restricting to weights involving these neurons. Using experiments on common language tasks, we demonstrate that our method significantly improves the memory efficiency of SFT without increasing training time complexity and implementation complexity, while achieving accuracy comparable to state-of-the-art methods such as LoRA and its variants.
摘要：微调是调整基础模型（例如大型语言模型）到下游任务的重要一步。为了使计算预算有限的用户更容易访问此步骤，开发记忆和计算效率的微调方法至关重要。稀疏的微调（SFT）和低级适应（Lora）是解决此问题的两个框架，并在实践中被广泛采用。在这项工作中，我们根据神经网络修剪的想法开发了一个新的SFT框架。在高级别上，我们首先使用网络修剪的特征重要性指标识别“重要”神经元/节点（具体来说，我们使用结构修剪方法），然后通过限制涉及这些神经元的权重来进行微调。使用对通用语言任务的实验，我们证明我们的方法显着提高了SFT的记忆效率，而无需提高训练时间的复杂性和实现复杂性，同时实现了与Lora及其变体等最新方法相媲美的准确性。

Title: Which Retain Set Matters for LLM Unlearning? A Case Study on Entity Unlearning

Authors: Hwan Chang, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11441
Pdf URL: https://arxiv.org/pdf/2502.11441
Copy Paste: [[2502.11441]] Which Retain Set Matters for LLM Unlearning? A Case Study on Entity Unlearning(https://arxiv.org/abs/2502.11441)
Keywords: language model, llm
Abstract: Large language models (LLMs) risk retaining unauthorized or sensitive information from their training data, which raises privacy concerns. LLM unlearning seeks to mitigate these risks by selectively removing specified data while maintaining overall model performance. However, most existing work focus on methods to achieve effective forgetting and does not provide a detailed analysis of the retain set, the portion of training data that is not targeted for removal. In this paper, we investigate the effects of unlearning on various subsets of the retain set through a case study on entity unlearning. We introduce the Syntactically Similar Neighbor Set, a group of queries that share similar syntactic structures with the data targeted for removal, and show that this subset suffers the greatest performance drop during unlearning. Moreover, when used for regularization, this set not only preserves performance on syntactically similar queries but also delivers comparable or improved results across other data subsets. Our results highlight that syntactic similarity is a critical factor, potentially more so than domain or entity relationships, in achieving effective and practical LLM unlearning.
摘要：大型语言模型（LLMS）可能会从其培训数据中保留未经授权或敏感信息，这引起了隐私问题。 LLM Unerning试图通过选择性地删除指定的数据，同时保持整体模型性能，从而减轻这些风险。但是，大多数现有的工作都集中在实现有效遗忘的方法上，并且不提供对保留集的详细分析，即训练数据的一部分，而不是针对删除的目标。在本文中，我们研究了通过案例研究对实体进行的案例研究对保留的各种子集的影响。我们介绍了句法相似的邻居集，这是一组与靶向删除数据共享类似句法结构的查询，并表明该子集在未学习过程中遭受了最大的性能下降。此外，当用于正规化时，该集合不仅可以保留句法相似的查询性能，而且还可以在其他数据子集上提供可比较或改进的结果。我们的结果表明，句法相似性是一个关键因素，在实现有效和实用的LLM学习方面，可能比领域或实体关系更重要。

Title: Does RAG Really Perform Bad For Long-Context Processing?

Authors: Kun Luo, Zheng Liu, Peitian Zhang, Hongjin Qian, Jun Zhao, Kang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11444
Pdf URL: https://arxiv.org/pdf/2502.11444
Copy Paste: [[2502.11444]] Does RAG Really Perform Bad For Long-Context Processing?(https://arxiv.org/abs/2502.11444)
Keywords: language model, llm, long context, retrieval-augmented generation
Abstract: The efficient processing of long context poses a serious challenge for large language models (LLMs). Recently, retrieval-augmented generation (RAG) has emerged as a promising strategy for this problem, as it enables LLMs to make selective use of the long context for efficient computation. However, existing RAG approaches lag behind other long-context processing methods due to inherent limitations on inaccurate retrieval and fragmented contexts. To address these challenges, we introduce RetroLM, a novel RAG framework for long-context processing. Unlike traditional methods, RetroLM employs KV-level retrieval augmentation, where it partitions the LLM's KV cache into contiguous pages and retrieves the most crucial ones for efficient computation. This approach enhances robustness to retrieval inaccuracy, facilitates effective utilization of fragmented contexts, and saves the cost from repeated computation. Building on this framework, we further develop a specialized retriever for precise retrieval of critical pages and conduct unsupervised post-training to optimize the model's ability to leverage retrieved information. We conduct comprehensive evaluations with a variety of benchmarks, including LongBench, InfiniteBench, and RULER, where RetroLM significantly outperforms existing long-context LLMs and efficient long-context processing methods, particularly in tasks requiring intensive reasoning or extremely long-context comprehension.
摘要：长篇小说的有效处理对大语言模型（LLM）构成了严重的挑战。最近，检索功能增强的生成（RAG）已成为该问题的有希望的策略，因为它使LLMS能够选择性地使用长上下文进行有效的计算。但是，由于对检索不正确和分散的上下文的固有局限性，现有的破布方法落后于其他长篇小说处理方法。为了应对这些挑战，我们介绍了Retrolm，这是一种新型的RAG框架，用于长期处理。与传统方法不同，retrolm采用KV级检索增强，在此将LLM的KV缓存分配为连续页面，并检索最关键的kace以进行有效计算。这种方法增强了检索不准确性的鲁棒性，促进了零散环境的有效利用，并从重复计算中节省了成本。在此框架的基础上，我们进一步开发了一个专门的检索器，以精确检索关键页面并进行无监督的培训后，以优化该模型利用检索到的信息的能力。我们通过各种基准进行全面的评估，包括Longbench，Infinite-and和Ruler，在该基准中，Retrolm的表现明显优于现有的长篇小说LLM和有效的长篇文化处理方法，尤其是在需要密集的推理或极其长篇文章理解的任务中。

Title: From Personas to Talks: Revisiting the Impact of Personas on LLM-Synthesized Emotional Support Conversations

Authors: Shenghan Wu, Yang Deng, Yimo Zhu, Wynne Hsu, Mong Li Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11451
Pdf URL: https://arxiv.org/pdf/2502.11451
Copy Paste: [[2502.11451]] From Personas to Talks: Revisiting the Impact of Personas on LLM-Synthesized Emotional Support Conversations(https://arxiv.org/abs/2502.11451)
Keywords: language model, llm
Abstract: The rapid advancement of Large Language Models (LLMs) has revolutionized the generation of emotional support conversations (ESC), offering scalable solutions with reduced costs and enhanced data privacy. This paper explores the role of personas in the creation of ESC by LLMs. Our research utilizes established psychological frameworks to measure and infuse persona traits into LLMs, which then generate dialogues in the emotional support scenario. We conduct extensive evaluations to understand the stability of persona traits in dialogues, examining shifts in traits post-generation and their impact on dialogue quality and strategy distribution. Experimental results reveal several notable findings: 1) LLMs can infer core persona traits, 2) subtle shifts in emotionality and extraversion occur, influencing the dialogue dynamics, and 3) the application of persona traits modifies the distribution of emotional support strategies, enhancing the relevance and empathetic quality of the responses. These findings highlight the potential of persona-driven LLMs in crafting more personalized, empathetic, and effective emotional support dialogues, which has significant implications for the future design of AI-driven emotional support systems.
摘要：大型语言模型（LLM）的快速发展已彻底改变了情感支持对话（ESC）的产生，从而提供了可扩展的解决方案，并以降低的成本和增强的数据隐私性。本文探讨了角色在LLM创建ESC中的作用。我们的研究利用既定的心理框架来衡量和注入人格特征，从而在情感支持方案中产生对话。我们进行了广泛的评估，以了解对话中角色特征的稳定性，研究产后特征的转变及其对对话质量和战略分布的影响。实验结果揭示了几个值得注意的发现：1）LLM可以推断出核心角色特征，2）情绪和外向性发生的细微转变，影响对话动态，以及3）角色特征的应用会改变情感支持策略的分布，增强相关性和反应的同理心质量。这些发现突出了角色驱动的LLM在制定更个性化，善解人意和有效的情感支持对话中的潜力，这对AI驱动的情感支持系统的未来设计具有重要意义。

Title: UniCBE: An Uniformity-driven Comparing Based Evaluation Framework with Unified Multi-Objective Optimization

Authors: Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11454
Pdf URL: https://arxiv.org/pdf/2502.11454
Copy Paste: [[2502.11454]] UniCBE: An Uniformity-driven Comparing Based Evaluation Framework with Unified Multi-Objective Optimization(https://arxiv.org/abs/2502.11454)
Keywords: language model
Abstract: Human preference plays a significant role in measuring large language models and guiding them to align with human values. Unfortunately, current comparing-based evaluation (CBE) methods typically focus on a single optimization objective, failing to effectively utilize scarce yet valuable preference signals. To address this, we delve into key factors that can enhance the accuracy, convergence, and scalability of CBE: suppressing sampling bias, balancing descending process of uncertainty, and mitigating updating uncertainty. Following the derived guidelines, we propose UniCBE, a unified uniformity-driven CBE framework which simultaneously optimize these core objectives by constructing and integrating three decoupled sampling probability matrices, each designed to ensure uniformity in specific aspects. We further ablate the optimal tuple sampling and preference aggregation strategies to achieve efficient CBE. On the AlpacaEval benchmark, UniCBE saves over 17% of evaluation budgets while achieving a Pearson correlation with ground truth exceeding 0.995, demonstrating excellent accuracy and convergence. In scenarios where new models are continuously introduced, UniCBE can even save over 50% of evaluation costs, highlighting its improved scalability.
摘要：人类的偏好在测量大语模型并指导它们与人类价值观保持一致方面起着重要作用。不幸的是，当前基于比较的评估方法（CBE）方法通常集中于单个优化目标，无法有效利用稀缺但有价值的偏好信号。为了解决这个问题，我们研究了可以提高CBE的准确性，收敛性和可扩展性的关键因素：抑制采样偏差，平衡不确定性的下降过程以及减轻更新的不确定性。遵循派生的准则，我们提出了Unicbe，unicbe是一个统一统一驱动的CBE框架，通过构建和集成三个解耦采样概率矩阵，同时优化了这些核心目标，每个矩阵均设计旨在确保特定方面的均匀性。我们进一步消融了最佳的元组采样和偏好聚集策略，以实现有效的CBE。在Alpacaeval基准中，Unicbe节省了17％的评估预算，同时达到了与地面真相的相关性超过0.995，表现出极好的准确性和融合。在不断引入新模型的情况下，Unicbe甚至可以节省超过50％的评估成本，从而突出其提高的可伸缩性。

Title: Aligning Sentence Simplification with ESL Learner's Proficiency for Language Acquisition

Authors: Guanlin Li, Yuki Arase, Noel Crespi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11457
Pdf URL: https://arxiv.org/pdf/2502.11457
Copy Paste: [[2502.11457]] Aligning Sentence Simplification with ESL Learner's Proficiency for Language Acquisition(https://arxiv.org/abs/2502.11457)
Keywords: language model
Abstract: Text simplification is crucial for improving accessibility and comprehension for English as a Second Language (ESL) learners. This study goes a step further and aims to facilitate ESL learners' language acquisition by simplification. Specifically, we propose simplifying complex sentences to appropriate levels for learners while also increasing vocabulary coverage of the target level in the simplifications. We achieve this without a parallel corpus by conducting reinforcement learning on a large language model. Our method employs token-level and sentence-level rewards, and iteratively trains the model on its self-generated outputs to guide the model to search for simplification hypotheses that satisfy the target attributes. Experiment results on CEFR-SP and TurkCorpus datasets show that the proposed method can effectively increase the frequency and diversity of vocabulary of the target level by more than $20\%$ compared to baseline models, while maintaining high simplification quality.
摘要：文本简化对于改善英语作为第二语言（ESL）学习者的可访问性和理解至关重要。这项研究更进一步，旨在通过简化来促进ESL学习者的语言获取。具体而言，我们建议将复杂的句子简化为学习者的适当水平，同时还提高了目标水平的词汇覆盖范围。我们通过在大型语言模型上进行强化学习而无需平行的语料库实现这一目标。我们的方法采用令牌级别和句子级别的奖励，并迭代地训练该模型的自构型输出，以指导模型搜索满足目标属性的简化假设。 CEFR-SP和Turkcorpus数据集的实验结果表明，与基线模型相比，所提出的方法可以有效地将目标水平词汇的频率和多样性提高超过$ 20 \％$，同时保持高简化质量。

Title: UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance

Authors: Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li, Xipeng Qiu, Kai Chen
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2502.11460
Pdf URL: https://arxiv.org/pdf/2502.11460
Copy Paste: [[2502.11460]] UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance(https://arxiv.org/abs/2502.11460)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale pre-training data and (ii) synthesizing instruction data through prompt engineering with powerful models. While pre-training data faces quality consistency issues, instruction-based synthesis suffers from limited instruction diversity and inherent biases of LLMs. To address this gap, we introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to both guide and validate the code generation process. Combined with large-scale package-based retrieval from pre-training corpus, we generate a dataset of 500K+ verifiable programs containing diverse API calls. Evaluations on multiple Python benchmarks (BigCodeBench, HumanEval, MBPP) demonstrate that models fine-tuned on our synthetic data exhibit consistent performance improvements. Notably, Llama3.1-8B and InternLM2.5-7B improve from 31\% and 28\% to 40\% and 39\% success rates on BigCodeBench, respectively. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora, demonstrating the potential for producing diverse and high-quality post-training data at scale. All code and data will be released (this https URL).
摘要：大型语言模型（LLMS）在各种任务中都表现出了显着的功能，但是代码生成仍然是一个重大挑战。获得高质量代码数据的当前方法主要集中于（i）通过使用功能强大的模型提示工程来收集大规模训练数据和（ii）综合指令数据。尽管培训前数据面临质量一致性问题，但基于指导的合成遭受了有限的指导多样性和LLMS固有的偏见。为了解决此差距，我们介绍了UnitCoder，这是一种系统的管道，利用模型生成的单元测试来指导和验证代码生成过程。结合从预训练语料库中的大规模包装基于包装的大规模检索，我们生成了一个包含不同API调用的500K+可验证程序的数据集。对多个Python基准（BigCodebench，HumaneVal，MBPP）的评估表明，在我们的合成数据上微调的模型表现出一致的性能提高。值得注意的是，llama3.1-8b和interlm2.5-7b从BigCodeBench上分别从31 \％和28 \％和39 \％的成功率提高了。我们的工作提出了一种可扩展的方法，该方法利用模型生成的单位测试来指导培训前语料库中高质量代码数据的综合，证明了在大规模生成多样化和高质量的训练后数据的潜力。所有代码和数据将发布（此HTTPS URL）。

Title: GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion

Authors: Kangyang Luo, Yuzhuo Bai, Cheng Gao, Shuzheng Si, Yingli Shen, Zhu Liu, Zhitong Wang, Cunliang Kong, Wenhao Li, Yufei Huang, Ye Tian, Xuantang Xiong, Lei Han, Maosong Sun
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.11471
Pdf URL: https://arxiv.org/pdf/2502.11471
Copy Paste: [[2502.11471]] GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion(https://arxiv.org/abs/2502.11471)
Keywords: language model, llm, prompt
Abstract: Knowledge Graph Completion (KGC), which aims to infer missing or incomplete facts, is a crucial task for KGs. However, integrating the vital structural information of KGs into Large Language Models (LLMs) and outputting predictions deterministically remains challenging. To address this, we propose a new method called GLTW, which encodes the structural information of KGs and merges it with LLMs to enhance KGC performance. Specifically, we introduce an improved Graph Transformer (iGT) that effectively encodes subgraphs with both local and global structural information and inherits the characteristics of language model, bypassing training from scratch. Also, we develop a subgraph-based multi-classification training objective, using all entities within KG as classification objects, to boost learning this http URL, we combine iGT with an LLM that takes KG language prompts as this http URL extensive experiments on various KG datasets show that GLTW achieves significant performance gains compared to SOTA baselines.
摘要：旨在推断丢失或不完整事实的知识图完成（KGC）对于KGS来说是一项至关重要的任务。但是，将KG的重要结构信息整合到大语言模型（LLMS）中，并确定性地输出预测仍然具有挑战性。为了解决这个问题，我们提出了一种称为GLTW的新方法，该方法编码了KGS的结构信息，并将其与LLMS合并以增强KGC性能。具体而言，我们介绍了一个改进的图形变压器（IGT），该图形有效地编码了使用本地和全球结构信息的子图，并继承了语言模型的特征，从而绕开了从头开始的培训。此外，我们开发了一个基于子图的多分类培训目标，使用kg中的所有实体作为分类对象，以增强学习此HTTP URL的学习，我们将IGT与LLM相结合，该LLM在此HTTP URL广泛实验上，该LLM在各种kg上都提示了KG语言提示。数据集表明，与SOTA基准相比，GLTW取得了显着的性能增长。

Title: FastMCTS: A Simple Sampling Strategy for Data Synthesis

Authors: Peiji Li, Kai Lv, Yunfan Shao, Yichuan Ma, Linyang Li, Xiaoqing Zheng, Xipeng Qiu, Qipeng Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11476
Pdf URL: https://arxiv.org/pdf/2502.11476
Copy Paste: [[2502.11476]] FastMCTS: A Simple Sampling Strategy for Data Synthesis(https://arxiv.org/abs/2502.11476)
Keywords: language model
Abstract: Synthetic high-quality multi-step reasoning data can significantly enhance the performance of large language models on various tasks. However, most existing methods rely on rejection sampling, which generates trajectories independently and suffers from inefficiency and imbalanced sampling across problems of varying difficulty. In this work, we introduce FastMCTS, an innovative data synthesis strategy inspired by Monte Carlo Tree Search. FastMCTS provides a more efficient sampling method for multi-step reasoning data, offering step-level evaluation signals and promoting balanced sampling across problems of different difficulty levels. Experiments on both English and Chinese reasoning datasets demonstrate that FastMCTS generates over 30\% more correct reasoning paths compared to rejection sampling as the number of generated tokens scales up. Furthermore, under comparable synthetic data budgets, models trained on FastMCTS-generated data outperform those trained on rejection sampling data by 3.9\% across multiple benchmarks. As a lightweight sampling strategy, FastMCTS offers a practical and efficient alternative for synthesizing high-quality reasoning data. Our code will be released soon.
摘要：合成高质量的多步推理数据可以显着提高大语模型在各种任务上的性能。但是，大多数现有的方法都依赖于拒绝抽样，该采样采样会独立生成轨迹，并且在不同难度的问题上遇到了效率低下和采样不平衡的采样。在这项工作中，我们介绍了FastMcts，这是一种受蒙特卡洛树搜索启发的创新数据综合策略。 FastMCT为多步推理数据提供了一种更有效的采样方法，提供了台阶评估信号，并促进了不同难度级别问题的平衡采样。对英语和中文推理数据集的实验表明，与拒绝采样相比，随着生成的代币数量扩展，FastMCTs会产生超过30 \％的正确推理路径。此外，在可比的合成数据预算下，在FastMCTS生成的数据上训练的模型优于对拒绝采样数据培训的模型在多个基准中培训了3.9 \％。作为一种轻巧的抽样策略，FastMCTS为合成高质量推理数据提供了一种实用有效的替代方法。我们的代码将很快发布。

Title: Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering

Authors: Runxuan Liu, Bei Luo, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11491
Pdf URL: https://arxiv.org/pdf/2502.11491
Copy Paste: [[2502.11491]] Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering(https://arxiv.org/abs/2502.11491)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable capabilities in natural language processing. However, in knowledge graph question answering tasks (KGQA), there remains the issue of answering questions that require multi-hop reasoning. Existing methods rely on entity vector matching, but the purpose of the question is abstract and difficult to match with specific entities. As a result, it is difficult to establish reasoning paths to the purpose, which leads to information loss and redundancy. To address this issue, inspired by human reverse thinking, we propose Ontology-Guided Reverse Thinking (ORT), a novel framework that constructs reasoning paths from purposes back to conditions. ORT operates in three key phases: (1) using LLM to extract purpose labels and condition labels, (2) constructing label reasoning paths based on the KG ontology, and (3) using the label reasoning paths to guide knowledge retrieval. Experiments on the WebQSP and CWQ datasets show that ORT achieves state-of-the-art performance and significantly enhances the capability of LLMs for KGQA.
摘要：大型语言模型（LLM）在自然语言处理中表现出了显着的功能。但是，在“知识图”回答任务（KGQA）中，仍然存在回答需要多跳推理的问题的问题。现有方法依赖于实体向量匹配，但是问题的目的是抽象的，很难与特定实体匹配。结果，很难建立目的的推理途径，从而导致信息丢失和冗余。为了解决这个问题，灵感来自人类反向思维的启发，我们提出了本体论引导的反向思维（ORT），这是一个新颖的框架，构建了从目的到条件的推理路径。 ORT分为三个关键阶段：（1）使用LLM提取目的标签和条件标签，（2）基于KG本体的构建标签推理路径，以及（3）使用标签推理路径来指导知识检索。 WebQSP和CWQ数据集上的实验表明，ORT可以实现最先进的性能，并显着提高了LLMS对KGQA的能力。

Title: DAST: Context-Aware Compression in LLMs via Dynamic Allocation of Soft Tokens

Authors: Shaoshen Chen, Yangning Li, Zishan Xu, Yinghui Li, Xin Su, Zifei Shan, Hai-tao Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11493
Pdf URL: https://arxiv.org/pdf/2502.11493
Copy Paste: [[2502.11493]] DAST: Context-Aware Compression in LLMs via Dynamic Allocation of Soft Tokens(https://arxiv.org/abs/2502.11493)
Keywords: language model, llm, long context, prompt
Abstract: Large Language Models (LLMs) face computational inefficiencies and redundant processing when handling long context inputs, prompting a focus on compression techniques. While existing semantic vector-based compression methods achieve promising performance, these methods fail to account for the intrinsic information density variations between context chunks, instead allocating soft tokens uniformly across context chunks. This uniform distribution inevitably diminishes allocation to information-critical regions. To address this, we propose Dynamic Allocation of Soft Tokens (DAST), a simple yet effective method that leverages the LLM's intrinsic understanding of contextual relevance to guide compression. DAST combines perplexity-based local information with attention-driven global information to dynamically allocate soft tokens to the informative-rich chunks, enabling effective, context-aware compression. Experimental results across multiple benchmarks demonstrate that DAST surpasses state-of-the-art methods.
摘要：大型语言模型（LLMS）在处理长上下文输入时面临计算效率低下和冗余处理，促使人们专注于压缩技术。尽管现有的基于语义向量的压缩方法达到了有希望的性能，但这些方法无法解释上下文块之间的内在信息密度变化，而是在上下文块上均匀地分配软令牌。这种统一的分布不可避免地会减少对信息关键区域的分配。为了解决这个问题，我们提出了软令牌（DAST）的动态分配，这是一种简单而有效的方法，它利用LLM对上下文相关性的内在理解以指导压缩。 DAST将基于困惑的本地信息与注意力驱动的全球信息结合在一起，将软令牌动态分配给信息丰富的块，从而实现有效的，上下文感知的压缩。多个基准测试的实验结果表明，DAST超过了最新方法。

Title: Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More

Authors: Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2502.11494
Pdf URL: https://arxiv.org/pdf/2502.11494
Copy Paste: [[2502.11494]] Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More(https://arxiv.org/abs/2502.11494)
Keywords: language model
Abstract: Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation this http URL, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99$\times$ and 2.99$\times$ speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators. Our codes are available at this https URL.
摘要：多模式大型语言模型中的视觉令牌通常会主导巨大的计算开销，因为与语言方式相比，它们的长度过长。最近的大量方法旨在解决令牌修剪解决这个问题的问题，该方法首先定义了代币的重要性标准，然后在推断过程中修剪了不重要的视觉令牌。但是，在本文中，我们表明，重要性不是决定是否应该修剪令牌的理想指标。令人惊讶的是，它通常会导致性能较低，而不是随机的代币修剪，并导致该HTTP URL的有效注意计算的不兼容，我们提出了DART（降低令牌的重复 - 意识降低令牌），这是基于其重复的其他代币，导致显着的代币，从而导致显着的代价和无训练的加速度。具体地，Dart选择了一小部分枢轴令牌，然后保留对枢轴的重复较低的令牌，从而确保了在代币修剪过程中的最小信息丢失。实验表明，飞镖可以在保持可比性能的同时修剪88.9％的视力令牌，从而导致1.99 $ \ times $和2.99 $ \ times $ \ times $ $ \ times $在总时间和预填充阶段加速阶段，并且具有良好的兼容性，并且与有效的注意操作员有良好的兼容性。我们的代码可在此HTTPS URL上找到。

Title: Balanced Multi-Factor In-Context Learning for Multilingual Large Language Models

Authors: Masahiro Kaneko, Alham Fikri Aji, Timothy Baldwin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11495
Pdf URL: https://arxiv.org/pdf/2502.11495
Copy Paste: [[2502.11495]] Balanced Multi-Factor In-Context Learning for Multilingual Large Language Models(https://arxiv.org/abs/2502.11495)
Keywords: language model, llm
Abstract: Multilingual large language models (MLLMs) are able to leverage in-context learning (ICL) to achieve high performance by leveraging cross-lingual knowledge transfer without parameter updates. However, their effectiveness is highly sensitive to example selection, particularly in multilingual settings. Based on the findings of existing work, three key factors influence multilingual ICL: (1) semantic similarity, (2) linguistic alignment, and (3) language-specific performance. However, existing approaches address these factors independently, without explicitly disentangling their combined impact, leaving optimal example selection underexplored. To address this gap, we propose balanced multi-factor ICL (\textbf{BMF-ICL}), a method that quantifies and optimally balances these factors for improved example selection. Experiments on mCSQA and TYDI across four MLLMs demonstrate that BMF-ICL outperforms existing methods. Further analysis highlights the importance of incorporating all three factors and the importance of selecting examples from multiple languages.
摘要：多语言大型语言模型（MLLM）能够通过利用跨语性知识转移而无需参数更新来利用文本学习（ICL）来实现高性能。但是，它们的有效性对示例选择非常敏感，尤其是在多语言环境中。根据现有工作的发现，三个关键因素影响了多语言ICL：（1）语义相似性，（2）语言一致性和（3）特定于语言的表现。但是，现有的方法独立地解决了这些因素，而没有明确删除其综合影响，而最佳示例选择却没有被逐渐删除。为了解决此差距，我们提出了平衡的多因素ICL（\ textbf {bmf-icl}），该方法量化并最佳地平衡了这些因素以改进示例选择。 MCSQA和TYDI在四个MLLM上进行的实验表明，BMF-ICL的表现优于现有方法。进一步的分析强调了合并所有三个因素以及从多种语言中选择示例的重要性的重要性。

Title: Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?

Authors: Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, Linfeng Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2502.11501
Pdf URL: https://arxiv.org/pdf/2502.11501
Copy Paste: [[2502.11501]] Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?(https://arxiv.org/abs/2502.11501)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs, leading to significant acceleration without training. While these methods claim efficiency gains, critical questions about their fundamental design and evaluation remain unanswered: Why do many existing approaches underperform even compared to naive random token selection? Are attention-based scoring sufficient for reliably identifying redundant tokens? Is language information really helpful during token pruning? What makes a good trade-off between token importance and duplication? Are current evaluation protocols comprehensive and unbiased? The ignorance of previous research on these problems hinders the long-term development of token pruning. In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods.
摘要：多模式的大型语言模型（MLLM）表现出出色的性能，可以进行跨模式的理解和产生，但仍处于严重的推理成本。最近，已经提出了大量的工作来解决令牌修剪的问题，该问题可以识别MLLM中的冗余令牌，然后将它们修剪以降低计算和KV存储成本，从而在没有培训的情况下导致明显的加速。尽管这些方法声称提高效率，但有关其基本设计和评估的关键问题仍未得到答案：与天真的随机代币选择相比，为什么许多现有的方法表现不佳？基于注意力的评分足以可靠地识别多余的代币吗？在代币修剪期间，语言信息真的有帮助吗？是什么使令牌重要性与重复之间的良好权衡？当前的评估协议是否全面且无偏见？对这些问题的先前研究的无知阻碍了令牌修剪的长期发展。在本文中，我们一一回答这些问题，提供有关未来令牌修剪方法设计的见解。

Title: Chinese Spelling Correction: A Comprehensive Survey of Progress, Challenges, and Opportunities

Authors: Changchun Liu, Kai Zhang, Junzhe Jiang, Zixiao Kong, Qi Liu, Enhong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11508
Pdf URL: https://arxiv.org/pdf/2502.11508
Copy Paste: [[2502.11508]] Chinese Spelling Correction: A Comprehensive Survey of Progress, Challenges, and Opportunities(https://arxiv.org/abs/2502.11508)
Keywords: language model, llm
Abstract: Chinese Spelling Correction (CSC) is a critical task in natural language processing, aimed at detecting and correcting spelling errors in Chinese text. This survey provides a comprehensive overview of CSC, tracing its evolution from pre-trained language models to large language models, and critically analyzing their respective strengths and weaknesses in this domain. Moreover, we further present a detailed examination of existing benchmark datasets, highlighting their inherent challenges and limitations. Finally, we propose promising future research directions, particularly focusing on leveraging the potential of LLMs and their reasoning capabilities for improved CSC performance. To the best of our knowledge, this is the first comprehensive survey dedicated to the field of CSC. We believe this work will serve as a valuable resource for researchers, fostering a deeper understanding of the field and inspiring future advancements.
摘要：中国拼写校正（CSC）是自然语言处理中的关键任务，旨在检测和纠正中文文本中的拼写错误。这项调查提供了CSC的全面概述，将其从预先训练的语言模型转变为大型语言模型，并严格地分析了该领域中各自的优势和劣势。此外，我们进一步介绍了现有基准数据集的详细检查，突出了它们固有的挑战和局限性。最后，我们提出了有希望的未来研究方向，尤其是专注于利用LLM的潜力及其推理能力以改善CSC性能。据我们所知，这是专门针对CSC领域的首次全面调查。我们认为，这项工作将成为研究人员的宝贵资源，增进对该领域的了解并激发未来的进步。

Title: Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study

Authors: Yujie Lin, Ante Wang, Moye Chen, Jingyao Liu, Hao Liu, Jinsong Su, Xinyan Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11514
Pdf URL: https://arxiv.org/pdf/2502.11514
Copy Paste: [[2502.11514]] Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study(https://arxiv.org/abs/2502.11514)
Keywords: chain-of-thought
Abstract: Recently, inference-time scaling of chain-of-thought (CoT) has been demonstrated as a promising approach for addressing multi-modal reasoning tasks. While existing studies have predominantly centered on text-based thinking, the integration of both visual and textual modalities within the reasoning process remains unexplored. In this study, we pioneer the exploration of inference-time scaling with multi-modal thought, aiming to bridge this gap. To provide a comprehensive analysis, we systematically investigate popular sampling-based and tree search-based inference-time scaling methods on 10 challenging tasks spanning various domains. Besides, we uniformly adopt a consistency-enhanced verifier to ensure effective guidance for both methods across different thought paradigms. Results show that multi-modal thought promotes better performance against conventional text-only thought, and blending the two types of thought fosters more diverse thinking. Despite these advantages, multi-modal thoughts necessitate higher token consumption for processing richer visual inputs, which raises concerns in practical applications. We hope that our findings on the merits and drawbacks of this research line will inspire future works in the field.
摘要：最近，已经证明了思考链（COT）的推理时间缩放是解决多模式推理任务的一种有前途的方法。尽管现有的研究主要集中在基于文本的思维上，但在推理过程中的视觉和文本方式的整合仍然没有探索。在这项研究中，我们以多模式思想的探索进行了推理时间缩放的探索，旨在弥合这一差距。为了提供全面的分析，我们系统地研究了跨越各个域的10个挑战性任务，基于流行的采样和基于树搜索的推理时间缩放方法。此外，我们统一地采用了一致性增强的验证者，以确保对不同思想范式的两种方法有效指导。结果表明，多模式思想可以促进与常规文本思想相比，将两种类型的思想融合在一起，促进了更多样化的思维。尽管有这些优势，但多模式的思想仍需要更高的令牌消费来处理更丰富的视觉投入，这引起了实际应用的关注。我们希望我们对该研究线的优点和缺点的发现将激发该领域的未来作品。

Title: Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

Authors: Tian Jin, Ellie Y. Cheng, Zack Ankner, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin
Subjects: cs.CL, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2502.11517
Pdf URL: https://arxiv.org/pdf/2502.11517
Copy Paste: [[2502.11517]] Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding(https://arxiv.org/abs/2502.11517)
Keywords: language model, llm
Abstract: Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teaches LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that enables LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate parallel decoding on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21x to 1.93x with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline.
摘要：传统上，使用自回归的大语言模型（LLM）解码，依次发生一个令牌。新兴的工作线通过识别和同时生成语义独立的LLM响应块来探索并行解码。但是，这些技术依赖于与句法结构（如列表和段落）相关的手工制作的启发式方法，使其僵化和不精确。我们提出了一个基于学习的系统意大利面，它教LLMS识别语义独立性，并在其自身反应中表达平行解码机会。其核心是意大利面及其解释器：意大利面是一种注释语言，使LLMs能够以自己的回答表达语义独立性；语言解释器作用于这些注释，以在推理时间进行平行解码。通过两个阶段的固定过程，我们训练LLMS生成意大利面条注释，以优化响应质量和解码速度。关于基准之后的指令，对Alpacaeval的评估表明，我们的方法在解码速度和响应质量方面占据了现有方法；我们的结果表明，几何平均速度范围从1.21倍到1.93倍，相应的质量变化为 +2.2％至-7.1％，以连续解码基线的长度控制率衡量。

Title: AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification

Authors: Xiaoyu Tan, Tianchu Yao, Chao Qu, Bin Li, Minghao Yang, Dakuan Lu, Haozhe Wang, Xihe Qiu, Wei Chu, Yinghui Xu, Yuan Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11520
Pdf URL: https://arxiv.org/pdf/2502.11520
Copy Paste: [[2502.11520]] AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification(https://arxiv.org/abs/2502.11520)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: The reasoning capabilities of advanced large language models (LLMs) like o1 have revolutionized artificial intelligence applications. Nevertheless, evaluating and optimizing complex reasoning processes remain significant challenges due to diverse policy distributions and the inherent limitations of human effort and accuracy. In this paper, we present AURORA, a novel automated framework for training universal process reward models (PRMs) using ensemble prompting and reverse verification. The framework employs a two-phase approach: First, it uses diverse prompting strategies and ensemble methods to perform automated annotation and evaluation of processes, ensuring robust assessments for reward learning. Second, it leverages practical reference answers for reverse verification, enhancing the model's ability to validate outputs and improving training accuracy. To assess the framework's performance, we extend beyond the existing ProcessBench benchmark by introducing UniversalBench, which evaluates reward predictions across full trajectories under diverse policy distribtion with long Chain-of-Thought (CoT) outputs. Experimental results demonstrate that AURORA enhances process evaluation accuracy, improves PRMs' accuracy for diverse policy distributions and long-CoT responses. The project will be open-sourced at this https URL. The Universal-PRM-7B is available at this https URL.
摘要：像O1这样的高级大语言模型（LLM）的推理能力彻底改变了人工智能应用。然而，由于各种政策分布以及人类努力和准确性的固有局限性，评估和优化复杂的推理过程仍然是重大挑战。在本文中，我们提出了Aurora，这是一种使用集合提示和反向验证的新型自动化框架，用于训练通用过程奖励模型（PRMS）。该框架采用了两阶段的方法：首先，它使用各种提示策略和集合方法来执行自动注释和流程评估，从而确保对奖励学习的强大评估。其次，它利用实用的参考答案进行反向验证，增强模型验证产出的能力并提高训练精度。为了评估框架的性能，我们通过引入UniversalBench来超越现有的ProcessBench基准，该基金会评估了在各种策略分布下的全面轨迹的奖励预测，并具有长期的想法（COT）输出。实验结果表明，Aurora提高了过程评估的准确性，提高了PRMS的不同政策分布和长期响应的精度。该项目将在此HTTPS URL上开源。通用PRM-7B可在此HTTPS URL上找到。

Title: Training Large Language Models to be Better Rule Followers

Authors: Yi Hu, Shijia Kang, Haotong Yang, Haotian Xu, Muhan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11525
Pdf URL: https://arxiv.org/pdf/2502.11525
Copy Paste: [[2502.11525]] Training Large Language Models to be Better Rule Followers(https://arxiv.org/abs/2502.11525)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have shown impressive performance across a wide range of tasks. However, they often exhibit unexpected failures in seemingly straightforward tasks, suggesting a reliance on case-based reasoning rather than rule-based reasoning. While the vast training corpus of LLMs contains numerous textual "rules", current training methods fail to leverage these rules effectively. Crucially, the relationships between these "rules" and their corresponding "instances" are not explicitly modeled. As a result, while LLMs can often recall rules with ease, they fail to apply these rules strictly and consistently in relevant reasoning scenarios. In this paper, we investigate the rule-following capabilities of LLMs and propose Meta Rule-Following Fine-Tuning (Meta-RFFT) to enhance the cross-task transferability of rule-following abilities. We first construct a dataset of 88 tasks requiring following rules, encompassing diverse reasoning domains. We demonstrate through extensive experiments that models trained on large-scale rule-following tasks are better rule followers, outperforming the baselines in both downstream fine-tuning and few-shot prompting scenarios. This highlights the cross-task transferability of models with the aid of Meta-RFFT. Furthermore, we examine the influence of factors such as dataset size, rule formulation, and in-context learning.
摘要：大型语言模型（LLM）在各种任务中表现出令人印象深刻的表现。但是，它们经常在看似直接的任务中表现出意外的失败，这表明依赖基于案例的推理而不是基于规则的推理。尽管LLM的庞大培训语料库包含许多文本“规则”，但当前的培训方法无法有效利用这些规则。至关重要的是，这些“规则”及其相应的“实例”之间的关系并未明确建模。结果，尽管LLMS经常可以轻松回顾规则，但它们在相关的推理方案中严格且一致地应用了这些规则。在本文中，我们研究了LLMS的规则遵循功能，并提出了元数据遵循的微调（META-RFFT），以增强规则遵循能力的交叉任务可传递性。我们首先构建了一个需要以下规则的88个任务的数据集，其中包括各种推理域。我们通过广泛的实验证明，在大规模规则遵循的任务上训练的模型是更好的规则追随者，在下游微调和几乎没有弹性的情况下，表现优于基线。这突出了借助Meta-RFFT的模型的交叉任务可传递性。此外，我们研究了诸如数据集大小，规则制定和内在学习的因素的影响。

Title: Be Cautious When Merging Unfamiliar LLMs: A Phishing Model Capable of Stealing Privacy

Authors: Zhenyuan Guo, Yi Shi, Wenlong Meng, Chen Gong, Chengkun Wei, Wenzhi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11533
Pdf URL: https://arxiv.org/pdf/2502.11533
Copy Paste: [[2502.11533]] Be Cautious When Merging Unfamiliar LLMs: A Phishing Model Capable of Stealing Privacy(https://arxiv.org/abs/2502.11533)
Keywords: language model, llm
Abstract: Model merging is a widespread technology in large language models (LLMs) that integrates multiple task-specific LLMs into a unified one, enabling the merged model to inherit the specialized capabilities of these LLMs. Most task-specific LLMs are sourced from open-source communities and have not undergone rigorous auditing, potentially imposing risks in model merging. This paper highlights an overlooked privacy risk: \textit{an unsafe model could compromise the privacy of other LLMs involved in the model merging.} Specifically, we propose PhiMM, a privacy attack approach that trains a phishing model capable of stealing privacy using a crafted privacy phishing instruction dataset. Furthermore, we introduce a novel model cloaking method that mimics a specialized capability to conceal attack intent, luring users into merging the phishing model. Once victims merge the phishing model, the attacker can extract personally identifiable information (PII) or infer membership information (MI) by querying the merged model with the phishing instruction. Experimental results show that merging a phishing model increases the risk of privacy breaches. Compared to the results before merging, PII leakage increased by 3.9\% and MI leakage increased by 17.4\% on average. We release the code of PhiMM through a link.
摘要：模型合并是大语言模型（LLMS）中广泛的技术，它将多个特定于任务的LLM集成到统一的技术中，从而使合并的模型能够继承这些LLM的专业功能。大多数特定于任务的LLM来自开源社区，并且没有经过严格的审计，可能会在模型合并中施加风险。本文强调了一种被忽视的隐私风险：\ textit {不安全的模型可能会损害模型合并中涉及的其他LLM的隐私。}特别是，我们提出了PHIMM，一种隐私攻击方法，该方法训练了能够使用精心制作的隐私的网络钓鱼模型隐私网络钓鱼指令数据集。此外，我们引入了一种新颖的模型掩盖方法，该方法模仿了隐藏攻击意图的专门功能，吸引用户将网络钓鱼模型融合在一起。受害者合并了网络钓鱼模型后，攻击者可以通过使用网络钓鱼指令查询合并的模型来提取个人身份信息（PII）或推断成员资格信息（MI）。实验结果表明，合并网络钓鱼模型会增加隐私漏洞的风险。与合并前的结果相比，PII泄漏增加了3.9 \％，MI泄漏平均增加了17.4％。我们通过链接发布PHIMM的代码。

Title: MuSC: Improving Complex Instruction Following with Multi-granularity Self-Contrastive Training

Authors: Hui Huang, Jiaheng Liu, Yancheng He, Shilong Li, Bing Xu, Conghui Zhu, Muyun Yang, Tiejun Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11541
Pdf URL: https://arxiv.org/pdf/2502.11541
Copy Paste: [[2502.11541]] MuSC: Improving Complex Instruction Following with Multi-granularity Self-Contrastive Training(https://arxiv.org/abs/2502.11541)
Keywords: language model, gpt, llm
Abstract: Complex instruction-following with elaborate constraints is imperative for Large Language Models (LLMs). While existing methods have constructed data for complex instruction alignment, they all rely on a more advanced model, especially GPT-4, limiting their application. In this paper, we propose a Multi-granularity Self-Contrastive Training (MuSC) framework, to improve the complex instruction alignment without relying on a stronger model. Our method is conducted on both coarse and fine granularity. On coarse-granularity, we construct constraint-aware preference data based on instruction decomposition and recombination. On fine-granularity, we perform token-aware preference optimization with dynamic token-level supervision. Our method is evaluated on open-sourced models, and experiment results show our method achieves significant improvement on both complex and general instruction-following benchmarks, surpassing previous self-alignment methods.
摘要：对于大型语言模型（LLMS），必须使用精心限制的复杂指导跟踪。尽管现有方法已经构建了数据以进行复杂的指令对齐，但它们都依赖于更高级的模型，尤其是GPT-4，从而限制了其应用程序。在本文中，我们提出了一个多界定性自对比度训练（MUSC）框架，以改善复杂的教学对齐方式而不依赖更强的模型。我们的方法是在粗粒和细粒度上进行的。关于粗粒状性，我们基于指令分解和重组构建约束偏好数据。在细粒状性上，我们通过动态令牌级别的监督执行令牌感知的偏好优化。我们的方法在开源模型上进行了评估，实验结果表明，我们的方法在复杂和一般指导遵循的基准测试基准上都取得了显着改善，超过了先前的自我调整方法。

Title: Evaluating o1-Like LLMs: Unlocking Reasoning for Translation through Comprehensive Analysis

Authors: Andong Chen, Yuchen Song, Wenxin Zhu, Kehai Chen, Muyun Yang, Tiejun Zhao, Min zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11544
Pdf URL: https://arxiv.org/pdf/2502.11544
Copy Paste: [[2502.11544]] Evaluating o1-Like LLMs: Unlocking Reasoning for Translation through Comprehensive Analysis(https://arxiv.org/abs/2502.11544)
Keywords: gpt, llm, chat
Abstract: The o1-Like LLMs are transforming AI by simulating human cognitive processes, but their performance in multilingual machine translation (MMT) remains underexplored. This study examines: (1) how o1-Like LLMs perform in MMT tasks and (2) what factors influence their translation quality. We evaluate multiple o1-Like LLMs and compare them with traditional models like ChatGPT and GPT-4o. Results show that o1-Like LLMs establish new multilingual translation benchmarks, with DeepSeek-R1 surpassing GPT-4o in contextless tasks. They demonstrate strengths in historical and cultural translation but exhibit a tendency for rambling issues in Chinese-centric outputs. Further analysis reveals three key insights: (1) High inference costs and slower processing speeds make complex translation tasks more resource-intensive. (2) Translation quality improves with model size, enhancing commonsense reasoning and cultural translation. (3) The temperature parameter significantly impacts output quality-lower temperatures yield more stable and accurate translations, while higher temperatures reduce coherence and precision.
摘要：O1样LLM通过模拟人类的认知过程来转换AI，但是它们在多语言机器翻译（MMT）中的性能仍然没有被逐渐解散。这项研究研究了：（1）O1样LLM在MMT任务中的执行方式以及（2）哪些因素影响其翻译质量。我们评估了多个类似O1的LLM，并将它们与Chatgpt和GPT-4O等传统型号进行了比较。结果表明，类似O1的LLM建立了新的多语言翻译基准，在无上下文任务中，DeepSeek-R1超过GPT-4O。它们在历史和文化翻译中表现出优势，但在以中国为中心的产出中表现出漫无其实的问题的趋势。进一步的分析揭示了三个关键见解：（1）高推理成本和较慢的处理速度使复杂的翻译任务更加丰富。（2）翻译质量随模型大小而提高，增强了常识性推理和文化翻译。（3）温度参数显着影响输出质量低温的温度会产生更稳定和准确的翻译，而较高的温度则降低了连贯性和精度。

Title: DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection

Authors: Yingli Shen, Wen Lai, Shuo Wang, Xueren Zhang, Kangyang Luo, Alexander Fraser, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11546
Pdf URL: https://arxiv.org/pdf/2502.11546
Copy Paste: [[2502.11546]] DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection(https://arxiv.org/abs/2502.11546)
Keywords: language model, llm
Abstract: The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and clean multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus built using newly extracted Common Crawl data and existing multilingual datasets. DCAD-2000 includes over 2,282 languages, 46.72TB of data, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of current data cleaning methods, which rely on manual heuristic thresholds, we propose reframing data cleaning as an anomaly detection task. This dynamic filtering approach significantly enhances data quality by identifying and removing noisy or anomalous content. We evaluate the quality of DCAD-2000 on the FineTask benchmark, demonstrating substantial improvements in multilingual dataset quality and task performance.
摘要：多语言大语言模型（LLM）的快速发展凸显了需要高质量，多样化和清洁多语言数据集的需求。在本文中，我们介绍了DCAD-2000（数据清洁作为异常检测），这是一种使用新提取的常见爬网数据和现有多语言数据集构建的大规模多语种语料库。 DCAD-2000包括2,282多种语言，46.72TB的数据和86.3亿个文档，涵盖155种高和中资源语言和159个写作脚本。为了克服依赖手动启发式阈值的当前数据清洁方法的局限性，我们建议将数据清洁作为一种异常检测任务进行重新标记。这种动态过滤方法通过识别和消除嘈杂或异常内容来显着提高数据质量。我们评估了FINETAKS基准上DCAD-2000的质量，证明了多语言数据集质量和任务性能的实质性改进。

Title: Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in Large Language Models

Authors: Yue Xu, Chengyan Fu, Li Xiong, Sibei Yang, Wenjie Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11559
Pdf URL: https://arxiv.org/pdf/2502.11559
Copy Paste: [[2502.11559]] Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in Large Language Models(https://arxiv.org/abs/2502.11559)
Keywords: language model, llm
Abstract: Pre-training large language models (LLMs) on vast text corpora enhances natural language processing capabilities but risks encoding social biases, particularly gender bias. While parameter-modification methods like fine-tuning mitigate bias, they are resource-intensive, unsuitable for closed-source models, and lack adaptability to evolving societal norms. Instruction-based approaches offer flexibility but often compromise task performance. To address these limitations, we propose $\textit{FaIRMaker}$, an automated and model-independent framework that employs an $\textbf{auto-search and refinement}$ paradigm to adaptively generate Fairwords, which act as instructions integrated into input queries to reduce gender bias and enhance response quality. Extensive experiments demonstrate that $\textit{FaIRMaker}$ automatically searches for and dynamically refines Fairwords, effectively mitigating gender bias while preserving task integrity and ensuring compatibility with both API-based and open-source LLMs.
摘要：在庞大的文本中，预训练大型语言模型（LLM）增强了自然语言处理能力，但有可能编码社会偏见，尤其是性别偏见。虽然参数修改方法如微调减轻偏见，但它们是资源密集型的，不适合封闭源模型，并且缺乏发展社会规范的适应性。基于指导的方法具有灵活性，但通常会损害任务绩效。为了解决这些限制，我们建议$ \ textit {fairmaker} $，一种自动化和模型独立的框架，该框架采用$ \ textbf {自动搜索和细化} $ paradigm以适应性地生成公平词，它充当指令中的说明减少性别偏见并提高反应质量。广泛的实验表明，$ \ textit {Fairmaker} $自动搜索并动态完善了公平词，有效地减轻了性别偏见，同时保留了任务完整性并确保与基于API和开源的LLMS兼容。

Title: Reinforced Information Retrieval

Authors: Chaofan Li, Zheng Liu, Jianlyv Chen, Defu Lian, Yingxia Shao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11562
Pdf URL: https://arxiv.org/pdf/2502.11562
Copy Paste: [[2502.11562]] Reinforced Information Retrieval(https://arxiv.org/abs/2502.11562)
Keywords: llm
Abstract: While retrieval techniques are widely used in practice, they still face significant challenges in cross-domain scenarios. Recently, generation-augmented methods have emerged as a promising solution to this problem. These methods enhance raw queries by incorporating additional information from an LLM-based generator, facilitating more direct retrieval of relevant documents. However, existing methods struggle with highly specialized situations that require extensive domain expertise. To address this problem, we present \textbf{Reinforced-IR}, a novel approach that jointly adapts a pre-trained retriever and generator for precise cross-domain retrieval. A key innovation of Reinforced-IR is its \textbf{Self-Boosting} framework, which enables retriever and generator to learn from each other's feedback. Specifically, the generator is reinforced to generate query augmentations that enhance the retriever's performance, while the retriever is trained to better discriminate the relevant documents identified by the generator. This iterative process allows the end-to-end retrieval performance to be progressively optimized using an unlabeled corpus from the target domain. In our experiment, Reinforced-IR outperforms existing domain adaptation methods by a large margin, leading to substantial improvements in retrieval quality across a wide range of application scenarios.
摘要：尽管检索技术在实践中被广泛使用，但它们在跨域情景中仍然面临重大挑战。最近，生成增强的方法已成为解决此问题的有希望的解决方案。这些方法通过合并了基于LLM的发电机的其他信息来增强原始查询，从而促进了更直接检索相关文档。但是，现有的方法在需要广泛的领域专业知识的高度专业情况下遇到困难。为了解决这个问题，我们提出\ textbf {readforced-ir}，这是一种新颖的方法，可以共同适应预先训练的检索器和发电机，以进行精确的跨域检索。增强-IR的关键创新是其\ textbf {自我增强}框架，它使得猎犬和生成器能够从彼此的反馈中学习。具体而言，发电机被加固以生成查询增强量以增强猎犬的性能，而培训者则经过培训以更好地区分发电机确定的相关文档。这种迭代过程允许使用目标域中未标记的语料库逐步优化端到端检索性能。在我们的实验中，增强-IR的表现优于现有的域适应方法，从而大幅度的余量，从而在广泛的应用方案中取得了重大改善。

Title: Towards Reasoning Ability of Small Language Models

Authors: Gaurav Srivastava, Shuxiang Cao, Xuan Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.11569
Pdf URL: https://arxiv.org/pdf/2502.11569
Copy Paste: [[2502.11569]] Towards Reasoning Ability of Small Language Models(https://arxiv.org/abs/2502.11569)
Keywords: language model, llm, prompt
Abstract: Reasoning has long been viewed as an emergent property of large language models (LLMs), appearing at or above a certain scale ($\sim$100B parameters). However, recent studies challenge this assumption, showing that small language models (SLMs) can also achieve competitive reasoning performance. SLMs are increasingly favored for their efficiency and deployability. However, there is a lack of systematic study on the reasoning abilities of diverse SLMs, including those trained from scratch or derived from LLMs through quantization, pruning, and distillation. This raises a critical question: Can SLMs achieve reasoning abilities comparable to LLMs? In this work, we systematically survey, benchmark, and analyze 72 SLMs from six model families across 14 reasoning benchmarks. For reliable evaluation, we examine four evaluation methods and compare four LLM judges against human evaluations on 800 data points. We repeat all experiments three times to ensure a robust performance assessment. Additionally, we analyze the impact of different prompting strategies in small models. Beyond accuracy, we also evaluate model robustness under adversarial conditions and intermediate reasoning steps. Our findings challenge the assumption that scaling is the only way to achieve strong reasoning. Instead, we foresee a future where SLMs with strong reasoning capabilities can be developed through structured training or post-training compression. They can serve as efficient alternatives to LLMs for reasoning-intensive tasks.
摘要：长期以来，推理一直被视为大语言模型（LLMS）的新兴属性，出现在特定规模（$ \ sim $ 100B参数）上。但是，最近的研究挑战了这一假设，表明小语言模型（SLM）也可以达到竞争推理性能。 SLM越来越喜欢其效率和可部署性。但是，缺乏关于不同SLM的推理能力的系统研究，包括从头开始训练或通过量化，修剪和蒸馏从LLM派生的研究。这提出了一个关键的问题：SLM可以实现与LLM相当的推理能力吗？在这项工作中，我们系统地调查，基准和分析了14个推理基准的六个模型家族的72个SLM。为了获得可靠的评估，我们检查了四种评估方法，并将四个LLM法官与800个数据点的人类评估进行了比较。我们重复所有实验三次，以确保稳健的性能评估。此外，我们分析了小型模型中不同提示策略的影响。除了准确性之外，我们还评估了在对抗条件和中间推理步骤下模型鲁棒性。我们的发现挑战了扩展是实现强大推理的唯一途径的假设。取而代之的是，我们预见了一个未来，可以通过结构化训练或训练后压缩来开发具有强大推理能力的SLM。它们可以作为LLM的有效替代方法，以实现推理密集型任务。

Title: FaMTEB: Massive Text Embedding Benchmark in Persian Language

Authors: Erfan Zinvandi, Morteza Alikhani, Mehran Sarmadi, Zahra Pourbahman, Sepehr Arvin, Reza Kazemi, Arash Amini
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.11571
Pdf URL: https://arxiv.org/pdf/2502.11571
Copy Paste: [[2502.11571]] FaMTEB: Massive Text Embedding Benchmark in Persian Language(https://arxiv.org/abs/2502.11571)
Keywords: language model, chat, retrieval-augmented generation
Abstract: In this paper, we introduce a comprehensive benchmark for Persian (Farsi) text embeddings, built upon the Massive Text Embedding Benchmark (MTEB). Our benchmark includes 63 datasets spanning seven different tasks: classification, clustering, pair classification, reranking, retrieval, summary retrieval, and semantic textual similarity. The datasets are formed as a combination of existing, translated, and newly generated data, offering a diverse evaluation framework for Persian language models. Given the increasing use of text embedding models in chatbots, evaluation datasets are becoming inseparable ingredients in chatbot challenges and Retrieval-Augmented Generation systems. As a contribution, we include chatbot evaluation datasets in the MTEB benchmark for the first time. In addition, in this paper, we introduce the new task of summary retrieval which is not part of the tasks included in standard MTEB. Another contribution of this paper is the introduction of a substantial number of new Persian language NLP datasets suitable for training and evaluation, some of which have no previous counterparts in Persian. We evaluate the performance of several Persian and multilingual embedding models in a range of tasks. This work introduces an open-source benchmark with datasets, code and a public leaderboard.
摘要：在本文中，我们介绍了波斯（FARSI）文本嵌入的综合基准，该基于大规模的文本嵌入基准（MTEB）。我们的基准包括63个数据集，涵盖了七个不同的任务：分类，聚类，对分类，重读，检索，摘要检索和语义文本相似性。这些数据集的形成是现有，翻译和新生成的数据的组合，为波斯语模型提供了多样化的评估框架。鉴于聊天机器人中文本嵌入模型的使用越来越多，评估数据集正成为聊天机器人挑战和检索增强生成系统中不可分割的成分。作为贡献，我们首次将聊天机器人评估数据集包括在MTEB基准中。此外，在本文中，我们介绍了摘要检索的新任务，这不是标准MTEB中包含的任务的一部分。本文的另一个贡献是引入了大量适合培训和评估的新波斯语NLP数据集，其中一些数据集以前在波斯语中没有同行。我们在一系列任务中评估了几种波斯和多语言嵌入模型的性能。这项工作介绍了带有数据集，代码和公共排行榜的开源基准。

Title: InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Authors: Congkai Xie, Shuo Cai, Wenjun Wang, Pengxiang Li, Zhijie Sang, Kejing Yang, Yiming Zhang, Zhen Li, Guanghao Zhu, Zeyu Liu, Yang Yu, Yuhang Liu, Su Lu, Baoyi He, Qi Zhou, Xiaotian Han, Jianbo Yuan, Shengyu Zhang, Fei Wu, Hongxia Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11573
Pdf URL: https://arxiv.org/pdf/2502.11573
Copy Paste: [[2502.11573]] InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning(https://arxiv.org/abs/2502.11573)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have made significant advancements in reasoning capabilities. However, they still face challenges such as high computational demands and privacy concerns. This paper focuses on developing efficient Small Language Models (SLMs) and Multimodal Small Language Models (MSLMs) that retain competitive reasoning abilities. We introduce a novel training pipeline that enhances reasoning capabilities and facilitates deployment on edge devices, achieving state-of-the-art performance while minimizing development costs. \InfR~ aims to advance AI systems by improving reasoning, reducing adoption barriers, and addressing privacy concerns through smaller model sizes. Resources are available at https://github. com/Reallm-Labs/InfiR.
摘要：大型语言模型（LLM）和多模式大型语言模型（MLLM）在推理能力方面取得了重大进步。但是，他们仍然面临诸如高度计算需求和隐私问题之类的挑战。本文着重于开发有效的小语言模型（SLM）和保留竞争推理能力的多模式小语言模型（MSLMS）。我们介绍了一条新型的培训管道，该管道增强了推理能力并促进在边缘设备上的部署，从而实现了最先进的性能，同时最大程度地降低了开发成本。 \ Infr〜旨在通过改善推理，减少采用障碍并通过较小的模型大小来解决隐私问题来推进AI系统。资源可在https：// github上找到。 com/reallm-labs/infir。

Title: Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance

Authors: Birger Moell, Johan Boye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11578
Pdf URL: https://arxiv.org/pdf/2502.11578
Copy Paste: [[2502.11578]] Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance(https://arxiv.org/abs/2502.11578)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs) have made significant strides in natural language generation but often face challenges in tasks requiring precise calculations and structural analysis. This paper investigates the performance of state-of-the-art LLMs on language complexity measurement tasks, through the computation of the LIX readability metric and Average Dependency Distance (ADD). Using Swedish high school and university-level essays, we evaluate the models' abilities to compute LIX scores and perform dependency parsing, comparing their results to established ground truths. Our findings reveal that while all models demonstrate some capacity for these tasks, ChatGPT-o1-mini performs most consistently, achieving the highest accuracy in both LIX computation and dependency parsing. Additionally, we observe a strong significant correlation -0.875 p 0.026 (N=6) between the models' accuracy in computing LIX and their overall performance on the Massive Multitask Language Understanding (MMLU) benchmark. These results suggest that language complexity measurement abilities can serve as a noisy zero-shot proxies for assessing the general capabilities of LLMs, providing a practical method for model evaluation without the need for extensive benchmarking datasets.
摘要：大型语言模型（LLM）在自然语言的产生方面取得了长足的进步，但在需要精确计算和结构分析的任务中经常面临挑战。本文通过计算LIX可读性指标和平均依赖关系距离（ADD）来研究最新的LLM在语言复杂性测量任务上的性能。使用瑞典高中和大学级论文，我们评估了模型计算Lix分数并进行依赖性解析的能力，将其结果与既定基础真理进行了比较。我们的发现表明，尽管所有模型都显示出这些任务的某些能力，但Chatgpt-O1-Mini的表现最稳定，在LIX计算和依赖性解析中都达到了最高准确性。此外，我们观察到强大的显着相关性-0.875 P 0.026（n = 6）在模型计算LIX的准确性及其在大量多任务语言理解（MMLU）基准上的总体表现。这些结果表明，语言复杂性测量能力可以用作评估LLMS一般能力的嘈杂零拍，从而提供了一种实用的模型评估方法，而无需大量的基准测试数据集。

Title: Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?

Authors: Leyi Pan, Aiwei Liu, Shiyu Huang, Yijian Lu, Xuming Hu, Lijie Wen, Irwin King, Philip S. Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11598
Pdf URL: https://arxiv.org/pdf/2502.11598
Copy Paste: [[2502.11598]] Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?(https://arxiv.org/abs/2502.11598)
Keywords: language model, llm
Abstract: The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies. Our code is available at this https URL.
摘要：大型语言模型（LLM）水印的放射性性质可在接受水印教师模型的输出培训时，可以检测学生模型所继承的水印，这使其成为防止未经授权的知识蒸馏的有前途的工具。然而，水印对对抗性参与者的鲁棒性在很大程度上尚未得到探索。在本文中，我们调查了学生模型是否可以通过知识蒸馏获得教师模型的能力，同时避免了水印遗传。我们提出了两类去除水印方法：通过不靶向和有针对性的训练数据释义（UP和TP）进行前缩减的去除（UP和TP），以及通过推理时间水印中和（WN）通过推理时间去除后的去除（WN）。跨多个模型对的广泛实验，水印方案和高参数设置表明，TP和WN都彻底消除了遗传的水印，WN实现了这一目标，同时保持知识传递效率和低计算机开销。考虑到生产LLM中水印技术的持续部署，这些发现强调了迫切需要更强大的防御策略。我们的代码可在此HTTPS URL上找到。

Title: DR.GAP: Mitigating Bias in Large Language Models using Gender-Aware Prompting with Demonstration and Reasoning

Authors: Hongye Qiu, Yue Xu, Meikang Qiu, Wenjie Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11603
Pdf URL: https://arxiv.org/pdf/2502.11603
Copy Paste: [[2502.11603]] DR.GAP: Mitigating Bias in Large Language Models using Gender-Aware Prompting with Demonstration and Reasoning(https://arxiv.org/abs/2502.11603)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) exhibit strong natural language processing capabilities but also inherit and amplify societal biases, including gender bias, raising fairness concerns. Existing debiasing methods face significant limitations: parameter tuning requires access to model weights, prompt-based approaches often degrade model utility, and optimization-based techniques lack generalizability. To address these challenges, we propose this http URL (Demonstration and Reasoning for Gender-Aware Prompting), an automated and model-agnostic approach that mitigates gender bias while preserving model performance. this http URL selects bias-revealing examples and generates structured reasoning to guide models toward more impartial responses. Extensive experiments on coreference resolution and QA tasks across multiple LLMs (GPT-3.5, Llama3, and Llama2-Alpaca) demonstrate its effectiveness, generalization ability, and robustness. this http URL can generalize to vision-language models (VLMs), achieving significant bias reduction.
摘要：大型语言模型（LLM）具有强大的自然语言处理能力，但也继承和扩大社会偏见，包括性别偏见，提高公平关注。现有的辩论方法面临重大局限性：参数调整需要访问模型权重，基于及时的方法通常会降低模型实用程序，而基于优化的技术则缺乏可普遍性。为了应对这些挑战，我们提出了此HTTP URL（性别感知到的提示的演示和推理），这是一种自动化和模型的方法，可以减轻性别偏见，同时保留模型性能。该HTTP URL选择偏见的示例，并生成结构化的推理，以指导模型朝着更公正的响应。跨多个LLM（GPT-3.5，LLAMA3和LLAMA2-ALPACA）进行的有关核心分辨率和质量指数任务的广泛实验表明其有效性，泛化能力和稳健性。该HTTP URL可以推广到视觉模型（VLM），从而实现了显着的偏差减少。

Title: Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI

Authors: Yuxia Wang, Rui Xing, Jonibek Mansurov, Giovanni Puccetti, Zhuohan Xie, Minh Ngoc Ta, Jiahui Geng, Jinyan Su, Mervat Abassy, Saad El Dine Ahmed, Kareem Elozeiri, Nurkhan Laiyk, Maiya Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Alexander Aziz, Ryuto Koike, Masahiro Kaneko, Artem Shelmanov, Ekaterina Artemova, Vladislav Mikhailov, Akim Tsvigun, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11614
Pdf URL: https://arxiv.org/pdf/2502.11614
Copy Paste: [[2502.11614]] Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI(https://arxiv.org/abs/2502.11614)
Keywords: language model, llm, prompt
Abstract: Prior studies have shown that distinguishing text generated by large language models (LLMs) from human-written one is highly challenging, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source.
摘要：先前的研究表明，大型语言模型（LLM）与人类所写的区分文本具有很高的挑战，通常没有比随机猜测更好。为了验证跨语言和领域的发现的普遍性，我们进行了广泛的案例研究，以确定人类检测准确性的上限。在涵盖9种语言和9个域的16个数据集中，19个注释者的平均检测准确性为87.6％，从而挑战了先前的结论。我们发现，人类和机器文本之间的主要差距在于具体，文化细微差别和多样性。通过明确解释提示中的区别，可以在50％以上的情况下部分弥合差距。但是，我们还发现，人类并不总是喜欢人写的文本，尤其是当他们无法清楚地识别其来源时。

Title: Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL

Authors: Hanbing Liu, Haoyang Li, Xiaokang Zhang, Ruotong Chen, Haiyong Xu, Tian Tian, Qi Qi, Jing Zhang
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2502.11656
Pdf URL: https://arxiv.org/pdf/2502.11656
Copy Paste: [[2502.11656]] Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL(https://arxiv.org/abs/2502.11656)
Keywords: chain-of-thought
Abstract: Direct Preference Optimization (DPO) has proven effective in complex reasoning tasks like math word problems and code generation. However, when applied to Text-to-SQL datasets, it often fails to improve performance and can even degrade it. Our investigation reveals the root cause: unlike math and code tasks, which naturally integrate Chain-of-Thought (CoT) reasoning with DPO, Text-to-SQL datasets typically include only final answers (gold SQL queries) without detailed CoT solutions. By augmenting Text-to-SQL datasets with synthetic CoT solutions, we achieve, for the first time, consistent and significant performance improvements using DPO. Our analysis shows that CoT reasoning is crucial for unlocking DPO's potential, as it mitigates reward hacking, strengthens discriminative capabilities, and improves scalability. These findings offer valuable insights for building more robust Text-to-SQL models. To support further research, we publicly release the code and CoT-enhanced datasets.
摘要：直接偏好优化（DPO）已被证明在数学单词问题和代码生成等复杂推理任务中有效。但是，当应用于文本到SQL数据集时，它通常无法提高性能，甚至无法降解。我们的调查揭示了根本原因：与数学和代码任务不同，这些任务自然地将思想链（COT）推理与DPO相结合，文本到SQL数据集通常仅包含最终答案（金SQL查询），而无需详细的COT解决方案。通过使用合成COT解决方案增强文本到SQL数据集，我们首次使用DPO实现了一致且显着的性能改进。我们的分析表明，COT推理对于解锁DPO的潜力至关重要，因为它可以减轻奖励黑客攻击，增强判别能力并提高可扩展性。这些发现为建立更强大的文本到SQL模型提供了宝贵的见解。为了支持进一步的研究，我们将公开发布代码和COT增强数据集。

Title: Diversity-Oriented Data Augmentation with Large Language Models

Authors: Zaitian Wang, Jinghan Zhang, Xinhao Zhang, Kunpeng Liu, Pengfei Wang, Yuanchun Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.11671
Pdf URL: https://arxiv.org/pdf/2502.11671
Copy Paste: [[2502.11671]] Diversity-Oriented Data Augmentation with Large Language Models(https://arxiv.org/abs/2502.11671)
Keywords: language model, llm
Abstract: Data augmentation is an essential technique in natural language processing (NLP) for enriching training datasets by generating diverse samples. This process is crucial for improving the robustness and generalization capabilities of NLP models. However, a significant challenge remains: \textit{Insufficient Attention to Sample Distribution Diversity}. Most existing methods focus on increasing the sample numbers while neglecting the sample distribution diversity, which can lead to model overfitting. In response, we explore data augmentation's impact on dataset diversity and propose a \textbf{\underline{D}}iversity-\textbf{\underline{o}}riented data \textbf{\underline{Aug}}mentation framework (\textbf{DoAug}). % $\mathscr{DoAug}$ Specifically, we utilize a diversity-oriented fine-tuning approach to train an LLM as a diverse paraphraser, which is capable of augmenting textual datasets by generating diversified paraphrases. Then, we apply the LLM paraphraser to a selected coreset of highly informative samples and integrate the paraphrases with the original data to create a more diverse augmented dataset. Finally, we conduct extensive experiments on 12 real-world textual datasets. The results show that our fine-tuned LLM augmenter improves diversity while preserving label consistency, thereby enhancing the robustness and performance of downstream tasks. Specifically, it achieves an average performance gain of $10.52\%$, surpassing the runner-up baseline with more than three percentage points.
摘要：数据增强是自然语言处理（NLP）的必不可少的技术，用于通过生成多种样本来丰富培训数据集。这个过程对于提高NLP模型的鲁棒性和概括能力至关重要。但是，仍然存在一个重大挑战：\ textit {对样本分布多样性的关注不足}。大多数现有的方法着重于增加样本数量，同时忽略样本分布多样性，这可能导致模型过度拟合。作为响应，我们探讨了数据增强对数据集多样性的影响，并提出了a \ textbf {\ unsuessine {d}} iversity- \ textbf {\ textbf {\ lisews {o}} riented data \ textbf {\ textbf {\ textbf {\ supperline {aug auged} entline {aug} entline {aug texttic {texttic {\ textbbf（\ textbbf） {doaug}）。％\（\ Mathscr {doaug} \）具体来说，我们利用面向多样性的微调方法来培训LLM作为多样化的解释器，该方法能够通过产生多样化的释义来增强文本数据集。然后，我们将LLM释义器应用于所选的高度信息样本的核心，并将解释与原始数据集成在一起，以创建更多样化的增强数据集。最后，我们对12个现实世界的文本数据集进行了广泛的实验。结果表明，我们的微调LLM增强器可以提高多样性，同时保持标签一致性，从而提高下游任务的稳健性和性能。具体而言，它的平均性能增益为\（10.52 \％\），超过三个百分点的亚军基线。

Title: Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception

Authors: Shiyu Ni, Keping Bi, Jiafeng Guo, Lulu Yu, Baolong Bi, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11677
Pdf URL: https://arxiv.org/pdf/2502.11677
Copy Paste: [[2502.11677]] Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception(https://arxiv.org/abs/2502.11677)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit impressive performance across diverse tasks but often struggle to accurately gauge their knowledge boundaries, leading to confident yet incorrect responses. This paper explores leveraging LLMs' internal states to enhance their perception of knowledge boundaries from efficiency and risk perspectives. We investigate whether LLMs can estimate their confidence using internal states before response generation, potentially saving computational resources. Our experiments on datasets like Natural Questions, HotpotQA, and MMLU reveal that LLMs demonstrate significant pre-generation perception, which is further refined post-generation, with perception gaps remaining stable across varying conditions. To mitigate risks in critical domains, we introduce Consistency-based Confidence Calibration ($C^3$), which assesses confidence consistency through question reformulation. $C^3$ significantly improves LLMs' ability to recognize their knowledge gaps, enhancing the unknown perception rate by 5.6\% on NQ and 4.9\% on HotpotQA. Our findings suggest that pre-generation confidence estimation can optimize efficiency, while $C^3$ effectively controls output risks, advancing the reliability of LLMs in practical applications.
摘要：大型语言模型（LLMS）在各种任务中表现出令人印象深刻的表现，但经常难以准确评估其知识界限，从而导致自信而又不正确的响应。本文探讨了LLMS内部状态的利用，以增强其从效率和风险角度来提高知识边界的看法。我们研究LLM是否可以在响应生成之前使用内部状态估算他们的信心，并可能节省计算资源。我们在自然问题（HotPotQA和MMLU）等数据集上进行的实验表明，LLMS表现出显着的生成前感知，这是进一步完善的后产后，并且在各种条件下，感知差距保持稳定。为了减轻关键领域的风险，我们引入了基于一致性的置信度校准（$ c^3 $），从而通过问题重新制定评估信心一致性。 $ c^3 $显着提高了LLMS识别其知识差距的能力，将未知的感知率提高了5.6 \％，而HOTPOTQA对HOTPOTQA的感知率提高了4.9 \％。我们的发现表明，生成前置信度估计可以优化效率，而$ c^3 $有效地控制了产出风险，从而提高了LLMS在实际应用中的可靠性。

Title: RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars

Authors: Yuncheng Hua, Lizhen Qu, Zhuang Li, Hao Xue, Flora D. Salim, Gholamreza Haffari
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11681
Pdf URL: https://arxiv.org/pdf/2502.11681
Copy Paste: [[2502.11681]] RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars(https://arxiv.org/abs/2502.11681)
Keywords: language model, llm, prompt
Abstract: Alignment tuning is crucial for ensuring large language models (LLMs) behave ethically and helpfully. Current alignment approaches require high-quality annotations and significant training resources. This paper proposes a low-cost, tuning-free method using in-context learning (ICL) to enhance LLM alignment. Through an analysis of high-quality ICL demos, we identified style as a key factor influencing LLM alignment capabilities and explicitly restyled ICL exemplars based on this stylistic framework. Additionally, we combined the restyled demos to achieve a balance between the two conflicting aspects of LLM alignment--factuality and safety. We packaged the restyled examples as prompts to trigger few-shot learning, improving LLM alignment. Compared to the best baseline approach, with an average score of 5.00 as the maximum, our method achieves a maximum 0.10 increase on the Alpaca task (from 4.50 to 4.60), a 0.22 enhancement on the Just-eval benchmark (from 4.34 to 4.56), and a maximum improvement of 0.32 (from 3.53 to 3.85) on the MT-Bench dataset. We release the code and data at this https URL.
摘要：一致性调整对于确保大型语言模型（LLM）的行为至关重要。当前的一致性方法需要高质量的注释和大量的培训资源。本文提出了一种使用内部文化学习（ICL）来增强LLM比对的低成本，无调方法。通过对高质量ICL演示的分析，我们将样式确定为影响LLM对齐功能的关键因素，并基于此风格框架明确重新设计了ICL示例。此外，我们合并了重新设计的演示，以在LLM对齐的两个相互矛盾的方面之间达到平衡 - 实质性和安全性。我们将重新设计的示例打包为触发几次学习的提示，从而改善了LLM的对齐方式。与最佳基线方法相比，平均得分为5.00作为最高分数，我们的方法在羊驼任务上最大增加了0.10（从4.50到4.60），在正式基准测试中提高0.22（从4.34到4.56），在MT Bench数据集上，最大改善为0.32（从3.53到3.85）。我们在此HTTPS URL上发布代码和数据。

Title: MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task

Authors: Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu, Mengdi Zhang, Jian Shao, Yueting Zhuang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11684
Pdf URL: https://arxiv.org/pdf/2502.11684
Copy Paste: [[2502.11684]] MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task(https://arxiv.org/abs/2502.11684)
Keywords: language model, llm
Abstract: Mathematical reasoning represents a critical frontier in advancing large language models (LLMs). While step-by-step approaches have emerged as the dominant paradigm for mathematical problem-solving in LLMs, the quality of reasoning steps in training data fundamentally constrains the performance of the models. Recent studies has demonstrated that more detailed intermediate steps can enhance model performance, yet existing methods for step expansion either require more powerful external models or incur substantial computational costs. In this paper, we introduce MathFimer, a novel framework for mathematical reasoning step expansion inspired by the "Fill-in-the-middle" task from code completion. By decomposing solution chains into prefix-suffix pairs and training models to reconstruct missing intermediate steps, we develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset. We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains, creating MathFimer-expanded versions. Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on original data across various benchmarks such as GSM8K and MATH. Our approach offers a practical, scalable solution for enhancing mathematical reasoning capabilities in LLMs without relying on powerful external models or expensive inference procedures.
摘要：数学推理代表了推进大型语言模型（LLM）的关键边界。虽然逐步的方法已经成为LLM中数学问题解决问题的主要范式，但培训数据中的推理步骤的质量从根本上限制了模型的性能。最近的研究表明，更详细的中级步骤可以增强模型性能，但是现有的阶跃扩展方法要么需要更强大的外部模型，要么需要大量的计算成本。在本文中，我们介绍了Mathfimer，这是一个数学推理步骤扩展的新颖框架，该框架受代码完成的“中间”任务启发。通过将解决方案链分解为前缀 - 衬垫对和训练模型以重建缺失的中间步骤，我们在精心策划的Numinamath-FIM数据集上开发了一个专业模型Mathfimer-7b。然后，我们应用这些模型来通过将详细的中间步骤插入其解决方案链中，从而创建数学范围扩展的版本来增强现有的数学推理数据集。通过在包括数学推理数据集的多个数学推理数据集的全面实验中，我们证明了对数学扩展数据进行训练的模型始终优于其对应人员，该模型在GSM8K和MATH等各种基准的原始数据方面训练了对原始数据的培训。我们的方法提供了一种实用，可扩展的解决方案，可在不依赖强大的外部模型或昂贵的推理过程的情况下增强LLM中的数学推理能力。

Title: Improve LLM-as-a-Judge Ability as a General Ability

Authors: Jiachen Yu, Shaoning Sun, Xiaohui Hu, Jiaxu Yan, Kaidong Yu, Xuelong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11689
Pdf URL: https://arxiv.org/pdf/2502.11689
Copy Paste: [[2502.11689]] Improve LLM-as-a-Judge Ability as a General Ability(https://arxiv.org/abs/2502.11689)
Keywords: language model, llm
Abstract: LLM-as-a-Judge leverages the generative and reasoning capabilities of large language models (LLMs) to evaluate LLM responses across diverse scenarios, providing accurate preference signals. This approach plays a vital role in aligning LLMs with human values, ensuring ethical and reliable AI outputs that align with societal norms. Recent studies have raised many methods to train LLM as generative judges, but most of them are data consuming or lack accuracy, and only focus on LLM's judge ability. In this work, we regard judge ability as a general ability of LLM and implement a two-stage training approach, comprising supervised fine-tuning (SFT) warm-up and direct preference optimization (DPO) enhancement, to achieve judge style adaptation and improve judgment accuracy. Additionally, we introduce an efficient data synthesis method to generate judgmental content. Experimental results demonstrate that our approach, utilizing only about 2% to 40% of the data required by other methods, achieves SOTA performance on RewardBench. Furthermore, our training method enhances the general capabilities of the model by constructing complicated judge task, and the judge signals provided by our model have significantly enhanced the downstream DPO training performance of our internal models in our test to optimize policy model with Judge Model. We also open-source our model weights and training data to facilitate further research.
摘要：LLM-AS-A-Gudge利用大语模型（LLMS）的生成和推理能力来评估各种情况跨不同情况的LLM响应，从而提供准确的偏好信号。这种方法在使LLM与人类价值观对齐中起着至关重要的作用，从而确保道德和可靠的AI输出与社会规范保持一致。最近的研究提出了许多方法来将LLM作为生成法官培训，但其中大多数是数据消耗或缺乏准确性，只关注LLM的法官能力。在这项工作中，我们将法官视为LLM的一般能力并实施两阶段的培训方法，包括监督的微调（SFT）热身热身和直接优先优化（DPO）增强，以实现法官风格适应和改善。判断准确性。此外，我们引入了一种有效的数据综合方法来生成判断内容。实验结果表明，我们的方法仅利用其他方法所需的数据的约2％至40％，可以在奖励台上实现SOTA性能。此外，我们的培训方法通过构建复杂的法官任务增强了模型的一般能力，并且我们的模型提供的法官信号大大提高了我们测试中内部模型的下游DPO培训性能，以通过法官模型优化政策模型。我们还开源的模型权重和培训数据以促进进一步的研究。

Title: CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation

Authors: Guangya Yu, Yanhao Li, Zongying Jiang, Yuxiong Jin, Li Dai, Yupian Lin, Ruihui Hou, Weiyan Zhang, Yongqi Fan, Qi Ye, Jingping Liu, Tong Ruan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11703
Pdf URL: https://arxiv.org/pdf/2502.11703
Copy Paste: [[2502.11703]] CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation(https://arxiv.org/abs/2502.11703)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Medical quality control indicators are essential to assess the qualifications of healthcare institutions for medical services. With the impressive performance of large language models (LLMs) like GPT-4 in the medical field, leveraging these technologies for the Medical Quality Control Indicator Calculation (MQCIC) presents a promising approach. In this work, (1) we introduce a real-world task MQCIC and propose an open-source Chinese electronic medical records (EMRs)-based dataset (CMQCIC-Bench) comprising 785 instances and 76 indicators. (2) We propose a semi-automatic method to enhance the rule representation. Then we propose the Clinical Facts-based Inferential Rule (CF-IR) method that disentangles the clinical fact verification and inferential rule reasoning actions. (3) We conduct comprehensive experiments on 20 representative LLMs, covering general and medical models. Our findings reveal that CF-IR outperforms Chain-of-Thought methods in MQCIC tasks. (4) We conduct an error analysis and investigate the capabilities of clinical fact verification and inferential rule reasoning, providing insights to improve performance in the MQCIC further. The dataset and code is available in this repo this https URL.
摘要：医疗质量控制指标对于评估医疗机构的医疗机构资格至关重要。随着大型语言模型（LLM）在医学领域的令人印象深刻的性能（LLM），利用这些技术进行医疗质量控制指标计算（MQCIC）提出了一种有希望的方法。在这项工作中，（1）我们介绍了一个现实世界的任务MQCIC，并提出了一个开源的中国电子病历（EMRS）基于785个实例和76个指标的数据集（CMQCIC-BENCE）。（2）我们提出了一种半自动方法来增强规则表示。然后，我们提出了基于临床事实的推论规则（CF-IR）方法，该方法删除了临床事实验证和推论规则推理行动。（3）我们对20个代表性LLM进行了全面的实验，涵盖了一般和医学模型。我们的发现表明，CF-IR在MQCIC任务中的表现优于经过思考的方法。（4）我们进行了错误分析，并研究了临床事实验证和推论规则推理的能力，从而提供了见解以进一步提高MQCIC的性能。该仓库中的数据集和代码可在此HTTPS URL中获得。

Title: LLM Agents Making Agent Tools

Authors: Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2502.11705
Pdf URL: https://arxiv.org/pdf/2502.11705
Copy Paste: [[2502.11705]] LLM Agents Making Agent Tools(https://arxiv.org/abs/2502.11705)
Keywords: language model, llm, agent
Abstract: Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains which demand large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, a novel agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a short task description and a repository URL, ToolMaker autonomously installs required dependencies and generates code to perform the task, using a closed-loop self-correction mechanism to iteratively diagnose and rectify errors. To evaluate our approach, we introduce a benchmark comprising 15 diverse and complex computational tasks spanning both medical and non-medical domains with over 100 unit tests to objectively assess tool correctness and robustness. ToolMaker correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows.
摘要：工具的使用已将大型语言模型（LLMS）变成了强大的代理，可以通过动态利用外部软件组件来执行复杂的多步任务。但是，这些工具必须由人类开发人员提前实施，阻碍了LLM代理在域中的适用性，这些域需要大量高度专业的工具，例如生命科学和医学。在科学研究的日益增长的趋势的激励下，我们提出了工具制造商，这是一个新型的代理框架，该框架将用代码自主将论文转换为LLM兼容的工具。给定简短的任务说明和存储库URL，工具制造商自主安装所需的依赖项并生成代码执行任务，并使用闭环自校正机制进行迭代诊断和纠正错误。为了评估我们的方法，我们引入了一个基准，该基准包括15种不同的且复杂的计算任务，这些任务涵盖了医疗和非医疗域，具有100多个单位测试，以客观地评估工具的正确性和鲁棒性。 Toolmaker正确地实施了80％的任务，从而大大优于当前的最新软件工程代理。因此，工具制造商是迈向完全自主代理的科学工作流程的一步。

Title: Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models

Authors: Sherzod Hakimov, Lara Pfennigschmidt, David Schlangen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11707
Pdf URL: https://arxiv.org/pdf/2502.11707
Copy Paste: [[2502.11707]] Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models(https://arxiv.org/abs/2502.11707)
Keywords: language model, llm
Abstract: This study utilizes the game Codenames as a benchmarking tool to evaluate large language models (LLMs) with respect to specific linguistic and cognitive skills. LLMs play each side of the game, where one side generates a clue word covering several target words and the other guesses those target words. We designed various experiments by controlling the choice of words (abstract vs. concrete words, ambiguous vs. monosemic) or the opponent (programmed to be faster or slower in revealing words). Recent commercial and open-weight models were compared side-by-side to find out factors affecting their performance. The evaluation reveals details about their strategies, challenging cases, and limitations of LLMs.
摘要：这项研究利用游戏代号作为基准测试工具来评估有关特定语言和认知技能的大型语言模型（LLM）。 LLM在游戏的每一侧都玩，其中一侧生成一个线索词，涵盖了几个目标单词，而另一侧猜测这些目标单词。我们通过控制单词的选择（抽象与具体单词，模棱两可的与单体病）或对手（编程为更快或更慢的单词）来设计各种实验。比较了最近的商业和开放权重模型并排比较，以找出影响其性能的因素。评估揭示了有关其策略，挑战性案件和LLMS局限性的详细信息。

Title: "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models

Authors: Jihao Gu, Yingyao Wang, Pi Bu, Chen Wang, Ziming Wang, Tengtao Song, Donglai Wei, Jiale Yuan, Yingxiu Zhao, Yancheng He, Shilong Li, Jiaheng Liu, Meng Cao, Jun Song, Yingshui Tan, Xiang Li, Wenbo Su, Zhicheng Zheng, Xiaoyong Zhu, Bo Zheng
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2502.11718
Pdf URL: https://arxiv.org/pdf/2502.11718
Copy Paste: [[2502.11718]] "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models(https://arxiv.org/abs/2502.11718)
Keywords: language model
Abstract: The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models' knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major topics and 56 subtopics. The key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers. Moreover, we contribute a rigorous data construction pipeline and decouple the visual factuality into two parts: seeing the world (i.e., object recognition) and discovering knowledge. This decoupling allows us to analyze the capability boundaries and execution mechanisms of LVLMs. Subsequently, we evaluate 34 advanced open-source and closed-source models, revealing critical performance gaps within this field.
摘要：在大型视觉语言模型（LVLM）中对事实准确性的评估落后于他们的快速发展，这使得充分反映这些模型的知识能力和可靠性是具有挑战性的。在本文中，我们介绍了中文的第一个基于事实的视觉提问基准，名为Chinesemplevqa，旨在评估8个主要主题和56个子主题中LVLM的视觉事实。该基准的关键特征包括专注于中文，多样化的知识类型，多跳问题构建，高质量的数据，静态一致性以及易于通过简短答案进行评估。此外，我们贡献了一条严格的数据构建管道，并将视觉事实分为两个部分：看到世界（即对象识别）和发现知识。这种去耦使我们能够分析LVLMS的能力边界和执行机制。随后，我们评估了34个高级开源和封闭源模型，从而揭示了该领域内的关键性能差距。

Title: Plant in Cupboard, Orange on Table, Book on Shelf. Benchmarking Practical Reasoning and Situation Modelling in a Text-Simulated Situated Environment

Authors: Jonathan Jordan, Sherzod Hakimov, David Schlangen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11733
Pdf URL: https://arxiv.org/pdf/2502.11733
Copy Paste: [[2502.11733]] Plant in Cupboard, Orange on Table, Book on Shelf. Benchmarking Practical Reasoning and Situation Modelling in a Text-Simulated Situated Environment(https://arxiv.org/abs/2502.11733)
Keywords: language model, llm, chat, agent
Abstract: Large language models (LLMs) have risen to prominence as 'chatbots' for users to interact via natural language. However, their abilities to capture common-sense knowledge make them seem promising as language-based planners of situated or embodied action as well. We have implemented a simple text-based environment -- similar to others that have before been used for reinforcement-learning of agents -- that simulates, very abstractly, a household setting. We use this environment and the detailed error-tracking capabilities we implemented for targeted benchmarking of LLMs on the problem of practical reasoning: Going from goals and observations to actions. Our findings show that environmental complexity and game restrictions hamper performance, and concise action planning is demanding for current LLMs.
摘要：大型语言模型（LLM）已成为“聊天机器人”的突出性，以供用户通过自然语言进行互动。但是，它们捕获常识性知识的能力使它们看起来很有希望，这也是基于语言的计划者或体现的动作。我们已经实施了一个简单的基于文本的环境 - 类似于以前用于加强代理的加强学习的环境 - 非常抽象地模拟了家庭环境。我们使用此环境以及我们实现的详细错误跟踪功能，用于针对LLM的实现基准测试实践推理问题：从目标和观察到动作。我们的发现表明，环境复杂性和游戏限制阻碍了性能，而简洁的行动计划要求当前的LLMS。

Title: MT-RAIG: Novel Benchmark and Evaluation Framework for Retrieval-Augmented Insight Generation over Multiple Tables

Authors: Kwangwook Seo, Donguk Kwon, Dongha Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11735
Pdf URL: https://arxiv.org/pdf/2502.11735
Copy Paste: [[2502.11735]] MT-RAIG: Novel Benchmark and Evaluation Framework for Retrieval-Augmented Insight Generation over Multiple Tables(https://arxiv.org/abs/2502.11735)
Keywords: llm
Abstract: Recent advancements in table-based reasoning have expanded beyond factoid-level QA to address insight-level tasks, where systems should synthesize implicit knowledge in the table to provide explainable analyses. Although effective, existing studies remain confined to scenarios where a single gold table is given alongside the user query, failing to address cases where users seek comprehensive insights from multiple unknown tables. To bridge these gaps, we propose MT-RAIG Bench, design to evaluate systems on Retrieval-Augmented Insight Generation over Mulitple-Tables. Additionally, to tackle the suboptimality of existing automatic evaluation methods in the table domain, we further introduce a fine-grained evaluation framework MT-RAIG Eval, which achieves better alignment with human quality judgments on the generated insights. We conduct extensive experiments and reveal that even frontier LLMs still struggle with complex multi-table reasoning, establishing our MT-RAIG Bench as a challenging testbed for future research.
摘要：基于表的推理的最新进展已超出事实级质量质量质量质量质量，以解决洞察力级任务，在该任务中，系统应在表中综合表格中的隐式知识以提供可解释的分析。尽管有效，但现有的研究仍然局限于在用户查询旁边给出单个金表的方案，无法解决用户从多个未知表中寻求全面见解的情况。为了弥合这些差距，我们提出了MT-Raig板凳的设计，设计以评估用于检索的洞察力生成的系统。此外，为了解决表域中现有自动评估方法的次优，我们进一步介绍了一个精细的评估框架MT-RAIG评估，该框架可以更好地与生成的见解的人类质量判断更好地保持一致性。我们进行了广泛的实验，并表明即使是Frontier LLM仍然在复杂的多桌子推理方面挣扎，从而确立了MT-Raig替补席，作为对未来研究的挑战性测试。

Title: ReviewEval: An Evaluation Framework for AI-Generated Reviews

Authors: Chavvi Kirtani, Madhav Krishan Garg, Tejash Prasad, Tanmay Singhal, Murari Mandal, Dhruv Kumar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11736
Pdf URL: https://arxiv.org/pdf/2502.11736
Copy Paste: [[2502.11736]] ReviewEval: An Evaluation Framework for AI-Generated Reviews(https://arxiv.org/abs/2502.11736)
Keywords: language model, llm, hallucination, prompt
Abstract: The escalating volume of academic research, coupled with a shortage of qualified reviewers, necessitates innovative approaches to peer review. While large language model (LLMs) offer potential for automating this process, their current limitations include superficial critiques, hallucinations, and a lack of actionable insights. This research addresses these challenges by introducing a comprehensive evaluation framework for AI-generated reviews, that measures alignment with human evaluations, verifies factual accuracy, assesses analytical depth, and identifies actionable insights. We also propose a novel alignment mechanism that tailors LLM-generated reviews to the unique evaluation priorities of individual conferences and journals. To enhance the quality of these reviews, we introduce a self-refinement loop that iteratively optimizes the LLM's review prompts. Our framework establishes standardized metrics for evaluating AI-based review systems, thereby bolstering the reliability of AI-generated reviews in academic research.
摘要：学术研究的不断升级，再加上合格的审稿人的短缺，需要创新的同行评审方法。尽管大型语言模型（LLMS）为自动化此过程提供了潜力，但它们当前的局限性包括肤浅的批评，幻觉和缺乏可行的见解。这项研究通过为AI生成的评论引入全面的评估框架来解决这些挑战，该框架衡量了与人类评估的一致性，验证事实准确性，评估分析深度并确定可行的见解。我们还提出了一种新颖的对齐机制，该机制定制了LLM生成的评论，以根据个人会议和期刊的独特评估优先级。为了提高这些评论的质量，我们介绍了一个自我进行循环，迭代地优化了LLM的评论提示。我们的框架建立了用于评估基于AI的审核系统的标准化指标，从而加强了学术研究中AI生成的评论的可靠性。

Title: Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation

Authors: Zengkui Sun, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11766
Pdf URL: https://arxiv.org/pdf/2502.11766
Copy Paste: [[2502.11766]] Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation(https://arxiv.org/abs/2502.11766)
Keywords: language model, llm
Abstract: The widespread deployment of Large Language Models (LLMs) is hindered by the high computational demands, making knowledge distillation (KD) crucial for developing compact smaller ones. However, the conventional KD methods endure the distribution mismatch issue between the teacher and student models, leading to the poor performance of distillation. For instance, the widely-used KL-based methods suffer the mode-averaging and mode-collapsing problems, since the mismatched probabitliy distribution between both models. Previous studies mainly optimize this issue via different distance calculations towards the distribution of both models. Unfortunately, the distribution mismatch issue still exists in the early stage of the distillation. Hence, to reduce the impact of distribution mismatch, we propose a simple yet efficient method, named Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation. Specifically, we first detect the distribution of the student model in practical scenarios with its internal knowledge, and then modify the knowledge with low probability via the teacher as the checker. Consequently, Warmup-Distill aligns the internal student's knowledge to that of the teacher, which expands the distribution of the student with the teacher's, and assists the student model to learn better in the subsequent distillation. Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation, which outperforms the vanilla student by as least +0.4 averaged score among all benchmarks. Noteably, with the assistance of Warmup-Distill, the distillation on the math task could yield a further improvement, at most +1.9% accuracy.
摘要：大型语言模型（LLM）的广泛部署受到高度计算需求的阻碍，因此知识蒸馏（KD）对于发展紧凑的小型较小较小的蒸馏（KD）至关重要。但是，常规的KD方法忍受了教师和学生模型之间的分配不匹配问题，导致蒸馏的性能不佳。例如，由于两个模型之间的概率分布不匹配，因此基于KL的广泛使用方法会遭受模式平均和模式填充问题。先前的研究主要通过不同的距离计算来优化这两个模型的分布。不幸的是，在蒸馏的早期阶段，分配不匹配问题仍然存在。因此，为了减少分布不匹配的影响，我们提出了一种简单而有效的方法，名为“热身迪斯蒂尔”，该方法在蒸馏之前将学生的蒸馏与老师的蒸馏保持一致。具体而言，我们首先在实际情况下以其内部知识来检测学生模型的分布，然后通过老师作为检查员以低概率修改知识。因此，热身 - 迪斯蒂尔将内部学生的知识与老师的知识保持一致，这将学生的分布与老师的分布相一致，并协助学生模型在随后的蒸馏中学习得更好。七个基准测试的实验表明，热身迪斯蒂尔可以为热身学生提供更适合蒸馏的热身学生，这使所有基准中的香草学生的表现都优于+0.4平均得分。在热身迪斯蒂尔的协助下，数学任务的蒸馏量最多最多 +1.9％的精度，对数学任务的蒸馏量不足。

Title: The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It

Authors: Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, Raffaella Bernardi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11771
Pdf URL: https://arxiv.org/pdf/2502.11771
Copy Paste: [[2502.11771]] The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It(https://arxiv.org/abs/2502.11771)
Keywords: language model, llm
Abstract: The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encountering significant challenges in detecting errors. While studies have explored methods to enhance self-correction in LLMs, relatively little attention has been given to understanding the models' internal mechanisms underlying error detection. In this paper, we present a mechanistic analysis of error detection in LLMs, focusing on simple arithmetic problems. Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal that all models heavily rely on $\textit{consistency heads}$--attention heads that assess surface-level alignment of numerical values in arithmetic solutions. Moreover, we observe that the models' internal arithmetic computation primarily occurs in higher layers, whereas validation takes place in middle layers, before the final arithmetic results are fully encoded. This structural dissociation between arithmetic computation and validation seems to explain why current LLMs struggle to detect even simple arithmetic errors.
摘要：大语言模型（LLM）验证其产出并确定潜在错误的能力对于确保鲁棒性和可靠性至关重要。但是，当前的研究表明，LLM在自我纠正中挣扎，在检测错误时面临重大挑战。尽管研究探索了增强LLM中自我纠正的方法，但对了解模型的内部机制的关注很少。在本文中，我们介绍了LLM中错误检测的机理分析，重点是简单的算术问题。通过电路分析，我们确定了负责检测四个较小尺寸LLM的算术误差的计算子图。我们的发现表明，所有模型都在很大程度上依赖$ \ textit {一致性头} $ - 注意算术解决方案中数值的表面级对齐的注意力头。此外，我们观察到模型的内部算术计算主要发生在较高的层中，而验证是在最终算术结果完全编码之前的中层层。算术计算和验证之间的这种结构解离似乎解释了为什么当前的LLM努力检测简单的算术错误。

Title: Efficient Response Generation Method Selection for Fine-Tuning Large Language Models

Authors: Xuan Ren, Qi Chen, Lingqiao Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11779
Pdf URL: https://arxiv.org/pdf/2502.11779
Copy Paste: [[2502.11779]] Efficient Response Generation Method Selection for Fine-Tuning Large Language Models(https://arxiv.org/abs/2502.11779)
Keywords: language model, llm
Abstract: The training data for fine-tuning large language models (LLMs) is typically structured as input-output pairs. However, for many tasks, there can be multiple equally valid output variations for the same input. Recent studies have observed that the choice of output variation used in training can affect the model's performance. This raises an important question: how can we generate the most effective output from the many possible response generation strategy options? Rather than relying on the traditional but resource-intensive train-and-evaluate approach, this paper proposes a scalable, approximate method for estimating the quality of a small subset of generated training data derived from the same input. We then evaluate how well this small subset of generated output fits the target model we are trying to train. We present a large-scale benchmark covering diverse reasoning-based datasets to support our study. The central idea is that a good output should closely resemble the output generated by the target LLM. We formalize this 'closeness' as the expected alignment score between a candidate output and the output sampled from the target LLM. We connect this measurement to the perplexity metric used in previous literature and demonstrate that leveraging an alignment-based metric can provide better predictions of model performance. Using this strategy, we can evaluate a small subset of the generated output from each response generation strategy option, then select the most effective strategy. We show that an LLM trained on data generated by the selected strategy could lead to a significant performance gain in many cases.
摘要：微调大语言模型（LLM）的训练数据通常是作为输入输出对构造的。但是，对于许多任务，对于同一输入，可能会有多个同样有效的输出变化。最近的研究观察到，训练中使用的输出变化的选择会影响模型的性能。这提出了一个重要的问题：我们如何从许多可能的响应生成策略选项中产生最有效的输出？本文不是依靠传统的资源密集型火车和评估方法，而是提出了一种可扩展的，近似方法，用于估算从同一输入中得出的一小部分生成的培训数据的质量。然后，我们评估这一小部分生成的输出适合我们试图训练的目标模型的程度。我们提出了一个大规模的基准测试，涵盖了不同的基于推理的数据集，以支持我们的研究。核心思想是，良好的输出应非常类似于目标LLM产生的输出。我们将这种“亲密关系”形式化为候选输出与从目标LLM采样的输出之间的预期对齐得分。我们将此测量值与以前文献中使用的困惑度度量联系起来，并证明利用基于对准的度量可以更好地预测模型性能。使用此策略，我们可以评估每个响应生成策略选项中生成的输出的一小部分，然后选择最有效的策略。我们表明，在许多情况下，对选定策略生成的数据进行培训的LLM可能会带来显着的绩效提高。

Title: Personality Editing for Language Models through Relevant Knowledge Editing

Authors: Seojin Hwang, Yumin Kim, Byeongjeong Kim, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11789
Pdf URL: https://arxiv.org/pdf/2502.11789
Copy Paste: [[2502.11789]] Personality Editing for Language Models through Relevant Knowledge Editing(https://arxiv.org/abs/2502.11789)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) play a vital role in applications like conversational agents and content creation, where controlling a model's personality is crucial for maintaining tone, consistency, and engagement. However, traditional prompt-based techniques for controlling personality often fall short, as they do not effectively mitigate the model's inherent biases. In this paper, we introduce a novel method PALETTE that enhances personality control through knowledge editing. By generating adjustment queries inspired by psychological assessments, our approach systematically adjusts responses to personality-related queries similar to modifying factual knowledge, thereby achieving controlled shifts in personality traits. Experimental results from both automatic and human evaluations demonstrate that our method enables more stable and well-balanced personality control in LLMs.
摘要：大型语言模型（LLMS）在对话代理和内容创建等应用中起着至关重要的作用，在这种应用程序中，控制模型的个性对于维持语气，一致性和参与至关重要。但是，传统的基于及时的控制人格的技术通常会缺乏，因为它们没有有效地减轻模型的固有偏见。在本文中，我们介绍了一种新颖的方法调色板，该调色板通过知识编辑增强人格控制。通过产生受心理评估启发的调整查询，我们的方法会系统地调整对人格相关的查询的反应，类似于修改事实知识，从而实现了人格特征的控制转变。自动评估和人类评估的实验结果表明，我们的方法可以在LLMS中更稳定且平衡的人格控制。

Title: Exploring Translation Mechanism of Large Language Models

Authors: Hongbin Zhang, Kehai Chen, Xuefeng Bai, Xiucheng Li, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11806
Pdf URL: https://arxiv.org/pdf/2502.11806
Copy Paste: [[2502.11806]] Exploring Translation Mechanism of Large Language Models(https://arxiv.org/abs/2502.11806)
Keywords: language model, llm
Abstract: Large language models (LLMs) have succeeded remarkably in multilingual translation tasks. However, the inherent translation mechanisms of LLMs remain poorly understood, largely due to sophisticated architectures and vast parameter scales. In response to this issue, this study explores the translation mechanism of LLM from the perspective of computational components (e.g., attention heads and MLPs). Path patching is utilized to explore causal relationships between components, detecting those crucial for translation tasks and subsequently analyzing their behavioral patterns in human-interpretable terms. Comprehensive analysis reveals that translation is predominantly facilitated by a sparse subset of specialized attention heads (less than 5\%), which extract source language, indicator, and positional features. MLPs subsequently integrate and process these features by transiting towards English-centric latent representations. Notably, building on the above findings, targeted fine-tuning of only 64 heads achieves translation improvement comparable to full-parameter tuning while preserving general capabilities.
摘要：大型语言模型（LLMS）在多语言翻译任务中取得了显着成功。但是，LLM的固有翻译机制仍然很少理解，这在很大程度上是由于复杂的架构和广泛的参数量表。为了应对这个问题，本研究从计算成分的角度（例如注意力头和MLP）探索了LLM的翻译机制。路径修补用于探索组件之间的因果关系，检测那些对于翻译任务至关重要的因果关系，并随后以人解剖术语分析其行为模式。全面的分析表明，翻译主要是由稀疏的专门注意头的稀疏子集（小于5 \％）促进的，这些集合提取了源语言，指标和位置特征。 MLP随后通过转移到以英语为中心的潜在表示来整合和处理这些功能。值得注意的是，基于上述发现，仅针对64个头的针对微调可以进行翻译改进，同时保留一般能力，可与全参数调整相当。

Title: FineFilter: A Fine-grained Noise Filtering Mechanism for Retrieval-Augmented Large Language Models

Authors: Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Yongxin Tong, Zhiming Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11811
Pdf URL: https://arxiv.org/pdf/2502.11811
Copy Paste: [[2502.11811]] FineFilter: A Fine-grained Noise Filtering Mechanism for Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2502.11811)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieved documents containing noise will hinder Retrieval-Augmented Generation (RAG) from detecting answer clues, necessitating noise filtering mechanisms to enhance this http URL methods use re-ranking or summarization to identify the most relevant sentences, but directly and accurately locating answer clues from these large-scale and complex documents remains challenging. Unlike these document-level operations, we treat noise filtering as a sentence-level MinMax optimization problem: first identifying the potential clues from multiple documents using contextual information, then ranking them by relevance, and finally retaining the least clues through truncation. In this paper, we propose FineFilter, a novel fine-grained noise filtering mechanism for RAG consisting of a clue extractor, a re-ranker, and a truncator. We optimize each module to tackle complex reasoning challenges: (1) Clue extractor firstly uses sentences containing the answer and similar ones as fine-tuned targets, aiming at extracting sufficient potential clues; (2) Re-ranker is trained to prioritize effective clues based on the real feedback from generation module, with clues capable of generating correct answer as positive samples and others as negative; (3) Truncator takes the minimum clues needed to answer the question (truncation point) as fine-tuned targets, and performs truncation on the re-ranked clues to achieve fine-grained noise filtering. Experiments on three QA datasets demonstrate that FineFilter significantly outperforms baselines in terms of performance and inference cost. Further analysis on each module shows the effectiveness of our optimizations for complex reasoning.
摘要：检索包含噪声的文档将阻止检测答案线索的检索效果（rag），需要提高噪音过滤机制来增强此HTTP URL方法使用重新级别或摘要来识别最相关的句子，但直接准确地找到了这些句子大规模和复杂的文件仍然具有挑战性。与这些文档级操作不同，我们将噪声过滤视为句子级的Minmax优化问题：首先使用上下文信息从多个文档中识别潜在线索，然后通过相关性对它们进行排名，最后通过截断保留最小的线索。在本文中，我们提出了FineFilter，这是一种新型的细粒噪声滤波机制，用于由线索提取器，重新级别和截断器组成的碎布。我们优化了每个模块以应对复杂的推理挑战：（1）提取器首先使用包含答案的句子和类似的句子作为微调目标，旨在提取足够的潜在线索；（2）对重新级别的培训，可以根据生成模块的真实反馈来确定有效线索的优先级，并能够将正确的答案作为正样本产生正确的答案，而其他方法则为负面；（3）截断器采用回答问题（截断点）作为微调目标所需的最低线索，并在重新排列的线索上执行截断以实现细粒度的噪声过滤。三个质量检查数据集的实验表明，在性能和推理成本方面，细滤镜的表现明显优于基准。对每个模块的进一步分析显示了我们对复杂推理的优化的有效性。

Title: Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis

Authors: Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.11812
Pdf URL: https://arxiv.org/pdf/2502.11812
Copy Paste: [[2502.11812]] Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis(https://arxiv.org/abs/2502.11812)
Keywords: language model, llm
Abstract: Fine-tuning significantly improves the performance of Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. This paper aims to provide an in-depth interpretation of the fine-tuning process through circuit analysis, a popular tool in Mechanistic Interpretability (MI). Unlike previous studies \cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity} that focus on tasks where pre-trained models already perform well, we develop a set of mathematical tasks where fine-tuning yields substantial performance gains, which are closer to the practical setting. In our experiments, we identify circuits at various checkpoints during fine-tuning and examine the interplay between circuit analysis, fine-tuning methods, and task complexities. First, we find that while circuits maintain high node similarity before and after fine-tuning, their edges undergo significant changes, which is in contrast to the previous work \cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity} that show circuits only add some additional components after fine-tuning. Based on these observations, we develop a circuit-aware Low-Rank Adaptation (LoRA) method, which assigns ranks to layers based on edge changes in the circuits. Experimental results demonstrate that our circuit-based LoRA algorithm achieves an average performance improvement of 2.46\% over standard LoRA with similar parameter sizes. Furthermore, we explore how combining circuits from subtasks can enhance fine-tuning in compositional tasks, providing new insights into the design of such tasks and deepening the understanding of circuit dynamics and fine-tuning mechanisms.
摘要：微调显着提高了大语言模型（LLM）的性能，但其潜在机制仍然很少了解。本文旨在通过电路分析（MI）（MI）的流行工具来对微调过程进行深入解释。与以前的研究不同，{prakash2024finetuningenhancesexistingMechanismiss，Chhabra2024neuroplasticity}，重点是预先训练模型已经表现良好的任务，我们开发了一组数学任务，在这些任务中，微小调整效果可产生实质性绩效，这些绩效均接近实践设置。在我们的实验中，我们在微调过程中确定各种检查点的电路，并检查电路分析，微调方法和任务复杂性之间的相互作用。首先，我们发现，尽管电路在微调之前和之后保持高节点相似性，但它们的边缘经历了重大变化，这与先前的工作形成鲜明对比\ cite {prakash2024finetuningenhancesexistingMechanismisss，Chabhabra2024neuroplasticity}仅添加了一些循环的循环，而这只会添加一些循环。调谐。基于这些观察结果，我们开发了一种电路感知的低级别适应性（LORA）方法，该方法根据电路的边缘变化将等级分配给层。实验结果表明，我们的基于电路的LORA算法比标准LORA的平均性能提高为2.46 \％，具有相似的参数尺寸。此外，我们探讨了从子任务中的电路组合可以增强组成任务中的微调，从而为设计设计的设计提供新的见解，并加深对电路动力学和微调机制的理解。

Title: M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis

Authors: Chengyan Wu, Bolei Ma, Yihong Liu, Zheyu Zhang, Ningyuan Deng, Yanshu Li, Baolan Chen, Yi Zhang, Barbara Plank, Yun Xue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11824
Pdf URL: https://arxiv.org/pdf/2502.11824
Copy Paste: [[2502.11824]] M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2502.11824)
Keywords: language model
Abstract: Aspect-based sentiment analysis (ABSA) is a crucial task in information extraction and sentiment analysis, aiming to identify aspects with associated sentiment elements in text. However, existing ABSA datasets are predominantly English-centric, limiting the scope for multilingual evaluation and research. To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7 domains and 21 languages, making it the most extensive multilingual parallel dataset for ABSA to date. Our primary focus is on triplet extraction, which involves identifying aspect terms, aspect categories, and sentiment polarities. The dataset is constructed through an automatic translation process with human review to ensure quality. We perform extensive experiments using various baselines to assess performance and compatibility on M-ABSA. Our empirical findings highlight that the dataset enables diverse evaluation tasks, such as multilingual and multi-domain transfer learning, and large language model evaluation, underscoring its inclusivity and its potential to drive advancements in multilingual ABSA research.
摘要：基于方面的情感分析（ABSA）是信息提取和情感分析中的至关重要的任务，旨在识别具有文本中相关情感要素的方面。但是，现有的ABSA数据集主要以英语为中心，限制了多语言评估和研究的范围。为了弥合这一差距，我们提出了M-ABSA，这是一个跨越7个域和21种语言的综合数据集，使其成为ABSA迄今为止最广泛的多语言并行数据集。我们的主要重点是三胞胎提取，其中涉及确定方面术语，方面类别和情感极性。数据集是通过自动翻译过程和人类审查来构建的，以确保质量。我们使用各种基线来评估M-ABSA的性能和兼容性进行广泛的实验。我们的经验发现凸显了该数据集可以实现多种评估任务，例如多语言和多域转移学习以及大型语言模型评估，强调其包容性及其在多语言ABSA研究中推动进步的潜力。

Title: Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities

Authors: Hanbin Wang, Xiaoxuan Zhou, Zhipeng Xu, Keyuan Cheng, Yuxin Zuo, Kai Tian, Jingwei Song, Junting Lu, Wenhui Hu, Xueyang Liu
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2502.11829
Pdf URL: https://arxiv.org/pdf/2502.11829
Copy Paste: [[2502.11829]] Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities(https://arxiv.org/abs/2502.11829)
Keywords: language model, gpt, llm
Abstract: This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. Code-Vision comprises three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On Hard problems, GPT-4o can achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further experiments reveal that Code-Vision can pose unique challenges compared to other multimodal reasoning benchmarks MMCode and MathVista. We also explore the reason for the poor performance of the open-source models. All data and codes are available at this https URL.
摘要：本文介绍了Code-Vision，这是一种基准测试，旨在评估多模式大语言模型（MLLM）的逻辑理解和代码生成功能。它挑战MLLM的生成正确的程序，该程序根据给定流程图满足特定功能要求，该程序在视觉上代表所需的算法或过程。 Code-Vision包括三个子集：HumaneVal-V，算法和数学，它们评估了MLLMS在基本编程，算法和数学解决问题的域上的编码能力。我们的实验评估了12 mllms code-Vision。实验结果表明，专有和开源模型之间的性能差异很大。在严重问题上，GPT-4O可以达到79.3％的通过@1，但最好的开源模型只能达到15％。进一步的实验表明，与其他多模式推理基准MMCODE和MATHVISTA相比，代码视图可能构成独特的挑战。我们还探讨了开源模型性能不佳的原因。所有数据和代码均可在此HTTPS URL上找到。

Title: Text Classification in the LLM Era - Where do we stand?

Authors: Sowmya Vajjala, Shwetali Shimangaud
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11830
Pdf URL: https://arxiv.org/pdf/2502.11830
Copy Paste: [[2502.11830]] Text Classification in the LLM Era - Where do we stand?(https://arxiv.org/abs/2502.11830)
Keywords: language model, llm
Abstract: Large Language Models revolutionized NLP and showed dramatic performance improvements across several tasks. In this paper, we investigated the role of such language models in text classification and how they compare with other approaches relying on smaller pre-trained language models. Considering 32 datasets spanning 8 languages, we compared zero-shot classification, few-shot fine-tuning and synthetic data based classifiers with classifiers built using the complete human labeled dataset. Our results show that zero-shot approaches do well for sentiment classification, but are outperformed by other approaches for the rest of the tasks, and synthetic data sourced from multiple LLMs can build better classifiers than zero-shot open LLMs. We also see wide performance disparities across languages in all the classification scenarios. We expect that these findings would guide practitioners working on developing text classification systems across languages.
摘要：大型语言模型彻底改变了NLP，并在几个任务中表现出了巨大的性能提高。在本文中，我们调查了此类语言模型在文本分类中的作用，以及它们与依靠较小训练的训练性语言模型的其他方法进行比较。考虑到跨越8种语言的32个数据集，我们比较了使用完整的人类标记的数据集构建的分类器，比较了零摄像的分类，很少弹出的微调和基于合成数据的分类器。我们的结果表明，零拍方法对于情感分类非常有用，但是对其他任务的其他方法表现出色，而来自多个LLM的合成数据可以构建比零摄像机开放式LLM更好的分类器。在所有分类方案中，我们还看到各种语言的性能差异很大。我们希望这些发现将指导从业者跨语言开发文本分类系统。

Title: Can LLM Agents Maintain a Persona in Discourse?

Authors: Pranav Bhandari, Nicolas Fay, Michael Wise, Amitava Datta, Stephanie Meek, Usman Naseem, Mehwish Nasim
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2502.11843
Pdf URL: https://arxiv.org/pdf/2502.11843
Copy Paste: [[2502.11843]] Can LLM Agents Maintain a Persona in Discourse?(https://arxiv.org/abs/2502.11843)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) are widely used as conversational agents, exploiting their capabilities in various sectors such as education, law, medicine, and more. However, LLMs are often subjected to context-shifting behaviour, resulting in a lack of consistent and interpretable personality-aligned interactions. Adherence to psychological traits lacks comprehensive analysis, especially in the case of dyadic (pairwise) conversations. We examine this challenge from two viewpoints, initially using two conversation agents to generate a discourse on a certain topic with an assigned personality from the OCEAN framework (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) as High/Low for each trait. This is followed by using multiple judge agents to infer the original traits assigned to explore prediction consistency, inter-model agreement, and alignment with the assigned personality. Our findings indicate that while LLMs can be guided toward personality-driven dialogue, their ability to maintain personality traits varies significantly depending on the combination of models and discourse settings. These inconsistencies emphasise the challenges in achieving stable and interpretable personality-aligned interactions in LLMs.
摘要：大型语言模型（LLM）被广泛用作对话代理，在教育，法律，医学等各个部门中利用其能力。但是，LLM经常受到上下文转移行为的影响，导致缺乏一致且可解释的人格一致的相互作用。遵守心理特征缺乏全面的分析，尤其是在二元（成对）对话的情况下。我们从两个观点中研究了这一挑战，最初使用两个对话代理来对某个主题进行讨论，并从海洋框架（开放性，认真度，外向性，同意和神经质主义）中分配了个性，而每个特征都高/低。接下来是使用多个法官代理来推断分配的原始特征，以探索预测一致性，模型间协议以及与分配的个性的一致性。我们的发现表明，尽管LLM可以引导到人格驱动的对话，但它们保持人格特质的能力取决于模型和话语环境的结合，却大不相同。这些不一致强调了在LLMS中实现稳定且可解释的人格一致相互作用方面的挑战。

Title: LLMs as a synthesis between symbolic and continuous approaches to language

Authors: Gemma Boleda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11856
Pdf URL: https://arxiv.org/pdf/2502.11856
Copy Paste: [[2502.11856]] LLMs as a synthesis between symbolic and continuous approaches to language(https://arxiv.org/abs/2502.11856)
Keywords: llm
Abstract: Since the middle of the 20th century, a fierce battle is being fought between symbolic and continuous approaches to language and cognition. The success of deep learning models, and LLMs in particular, has been alternatively taken as showing that the continuous camp has won, or dismissed as an irrelevant engineering development. However, in this position paper I argue that deep learning models for language actually represent a synthesis between the two traditions. This is because 1) deep learning architectures allow for both continuous/distributed and symbolic/discrete-like representations and computations; 2) models trained on language make use this flexibility. In particular, I review recent research in mechanistic interpretability that showcases how a substantial part of morphosyntactic knowledge is encoded in a near-discrete fashion in LLMs. This line of research suggests that different behaviors arise in an emergent fashion, and models flexibly alternate between the two modes (and everything in between) as needed. This is possibly one of the main reasons for their wild success; and it is also what makes them particularly interesting for the study of language and cognition. Is it time for peace?
摘要：自20世纪中叶以来，在象征和持续的语言和认知方法之间进行了激烈的战斗。深度学习模型，尤其是LLM的成功已被视为表明连续训练营已赢得或被视为无关紧要的工程发展。但是，在这个立场上，我认为语言的深度学习模型实际上代表了这两种传统之间的综合。这是因为1）深度学习体系结构允许连续/分布式和符号/离散的表示和计算； 2）训练语言的模型可以使用此灵活性。特别是，我回顾了有关机械解释性的最新研究，该研究表明了如何以LLM的近乎交流方式编码形态句法知识的大部分。这一研究表明，不同的行为是出现的，并且模型可以在两种模式之间灵活地交替（以及两者之间的所有模式）。这可能是他们狂野成功的主要原因之一。这也是使他们在学习语言和认知方面特别有趣的原因。是时候和平了吗？

Title: Exploring Large Language Models in Healthcare: Insights into Corpora Sources, Customization Strategies, and Evaluation Metrics

Authors: Shuqi Yang, Mingrui Jing, Shuai Wang, Jiaxin Kou, Manfei Shi, Weijie Xing, Yan Hu, Zheng Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11861
Pdf URL: https://arxiv.org/pdf/2502.11861
Copy Paste: [[2502.11861]] Exploring Large Language Models in Healthcare: Insights into Corpora Sources, Customization Strategies, and Evaluation Metrics(https://arxiv.org/abs/2502.11861)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: This study reviewed the use of Large Language Models (LLMs) in healthcare, focusing on their training corpora, customization techniques, and evaluation metrics. A systematic search of studies from 2021 to 2024 identified 61 articles. Four types of corpora were used: clinical resources, literature, open-source datasets, and web-crawled data. Common construction techniques included pre-training, prompt engineering, and retrieval-augmented generation, with 44 studies combining multiple methods. Evaluation metrics were categorized into process, usability, and outcome metrics, with outcome metrics divided into model-based and expert-assessed outcomes. The study identified critical gaps in corpus fairness, which contributed to biases from geographic, cultural, and socio-economic factors. The reliance on unverified or unstructured data highlighted the need for better integration of evidence-based clinical guidelines. Future research should focus on developing a tiered corpus architecture with vetted sources and dynamic weighting, while ensuring model transparency. Additionally, the lack of standardized evaluation frameworks for domain-specific models called for comprehensive validation of LLMs in real-world healthcare settings.
摘要：这项研究回顾了大型语言模型（LLM）在医疗保健中的使用，重点是他们的培训语料库，定制技术和评估指标。从2021年到2024年的研究系统搜索确定了61篇文章。使用了四种类型的语料库：临床资源，文献，开源数据集和网络爬行数据。常见的施工技术包括预训练，及时的工程和检索型生成，并结合了多种方法的44项研究。评估指标被归类为过程，可用性和结果指标，结果指标分为基于模型和专家评估的结果。该研究确定了语料库公平性的关键差距，这导致了地理，文化和社会经济因素的偏见。对未验证或非结构化数据的依赖强调了需要更好地整合基于证据的临床准则。未来的研究应着重于开发具有审查来源和动态加权的分层语料库体系结构，同时确保模型透明度。此外，缺乏针对特定领域模型的标准化评估框架，要求在实际医疗保健环境中进行全面验证LLMS。

Title: Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu

Authors: Renhao Pei, Yihong Liu, Peiqin Lin, François Yvon, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11862
Pdf URL: https://arxiv.org/pdf/2502.11862
Copy Paste: [[2502.11862]] Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu(https://arxiv.org/abs/2502.11862)
Keywords: language model, llm, prompt
Abstract: In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT, as it can readily take advantage of linguistic resources such as grammar books and dictionaries. Such resources are usually selectively integrated into the prompt so that LLMs can directly perform translation without any specific training, via their in-context learning capability (ICL). However, the relative importance of each type of resource e.g., dictionary, grammar book, and retrieved parallel examples, is not entirely clear. To address this gap, this study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language as our case study. To remove any prior knowledge of Manchu encoded in the LLM parameters and single out the effect of ICL, we also experiment with an encrypted version of Manchu texts. Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help. In a follow-up study, we showcase a promising application of in-context MT: parallel data augmentation as a way to bootstrap the conventional MT model. When monolingual data abound, generating synthetic parallel data through in-context MT offers a pathway to mitigate data scarcity and build effective and efficient low-resource neural MT systems.
摘要：具有大语言模型（LLMS）的文化机器翻译（MT）是低资源MT的有前途方法，因为它可以轻松利用语言资源（例如语法书籍和词典）。这些资源通常被选择性地集成到提示中，以便LLM可以通过其内部文化学习能力（ICL）直接执行翻译而无需进行任何特定的培训。但是，每种资源的相对重要性，例如字典，语法书籍和检索并行示例，尚不完全清楚。为了解决这一差距，这项研究系统地研究了每种资源及其质量如何影响翻译性能，而Manchu语言是我们的案例研究。为了删除在LLM参数中编码的Manchu的任何先验知识并删除ICL的效果，我们还尝试了Manchu文本的加密版本。我们的结果表明，高质量的词典和良好的平行示例非常有帮助，而语法几乎没有帮助。在一项后续研究中，我们展示了在文章中的有希望的应用MT：并行数据增强，以此作为引导常规MT模型的一种方式。当单语言数据充斥时，通过内在MT生成综合并行数据提供了减轻数据稀缺和建立有效，有效的低资源低资源神经MT系统的途径。

Title: Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page

Authors: Michael McRae
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11866
Pdf URL: https://arxiv.org/pdf/2502.11866
Copy Paste: [[2502.11866]] Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page(https://arxiv.org/abs/2502.11866)
Keywords: llm
Abstract: I introduce a new large-scale dataset of historical wire articles from U.S. Southern newspapers, spanning 1960-1975 and covering multiple wire services: The Associated Press, United Press International, Newspaper Enterprise Association. Unlike prior work focusing on front-page content, this dataset captures articles across the entire newspaper, offering broader insight into mid-century Southern coverage. The dataset includes a version that has undergone an LLM-based text cleanup pipeline to reduce OCR noise, enhancing its suitability for quantitative text analysis. Additionally, duplicate versions of articles are retained to enable analysis of editorial differences in language and framing across newspapers. Each article is tagged by wire service, facilitating comparative studies of editorial patterns across agencies. This resource opens new avenues for research in computational social science, digital humanities, and historical linguistics, providing a detailed perspective on how Southern newspapers relayed national and international news during a transformative period in American history. The dataset will be made available upon publication or request for research purposes.
摘要：我介绍了来自美国南部报纸的历史文章的新大规模数据集，涉及1960 - 1975年，并涵盖了多种电线服务：美联社，联合出版社，国际报纸企业协会。与先前关注头版内容的工作不同，该数据集捕获了整个报纸上的文章，从而更广泛地了解了本世纪中叶的南部报道。该数据集包括一个基于LLM的文本清理管道以减少OCR噪声，从而提高了其对定量文本分析的适用性。此外，保留了重复的文章版本，以便对报纸上的语言和框架的编辑差异进行分析。每篇文章都由电线服务标记，从而促进了对跨机构编辑模式的比较研究。该资源为计算社会科学，数字人文科学和历史语言学研究开辟了新的途径，提供了有关南方报纸如何在美国历史上的变革时期中传递国家和国际新闻的详细观点。该数据集将出版或出于研究目的而提供。

Title: VAQUUM: Are Vague Quantifiers Grounded in Visual Data?

Authors: Hugh Mee Wong, Rick Nouwen, Albert Gatt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11874
Pdf URL: https://arxiv.org/pdf/2502.11874
Copy Paste: [[2502.11874]] VAQUUM: Are Vague Quantifiers Grounded in Visual Data?(https://arxiv.org/abs/2502.11874)
Keywords: language model
Abstract: Vague quantifiers such as "a few" and "many" are influenced by many contextual factors, including how many objects are present in a given context. In this work, we evaluate the extent to which vision-and-language models (VLMs) are compatible with humans when producing or judging the appropriateness of vague quantifiers in visual contexts. We release a novel dataset, VAQUUM, containing 20300 human ratings on quantified statements across a total of 1089 images. Using this dataset, we compare human judgments and VLM predictions using three different evaluation methods. Our findings show that VLMs, like humans, are influenced by object counts in vague quantifier use. However, we find significant inconsistencies across models in different evaluation settings, suggesting that judging and producing vague quantifiers rely on two different processes.
摘要：诸如“少数”和“许多”之类的模糊量化符受到许多背景因素的影响，包括在给定的上下文中存在多少个对象。在这项工作中，我们评估了视觉和语言模型（VLM）在产生或判断视觉环境中模糊量化器的适当性时与人类兼容的程度。我们发布了一个新颖的数据集Vaquum，其中包含20300个人类评级，对总共1089张图像的量化陈述。使用此数据集，我们使用三种不同的评估方法比较了人类判断和VLM预测。我们的发现表明，像人类一样，VLM受模糊量词使用中的对象计数的影响。但是，我们发现在不同评估环境中的模型之间存在很大的不一致之处，这表明判断和产生模糊的量词依赖于两个不同的过程。

Title: Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity

Authors: Dylan Zhang, Justin Wang, Tianran Sun
Subjects: cs.CL, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2502.11901
Pdf URL: https://arxiv.org/pdf/2502.11901
Copy Paste: [[2502.11901]] Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity(https://arxiv.org/abs/2502.11901)
Keywords: language model, gpt
Abstract: Existing LMs struggle with proof-oriented programming due to data scarcity, which manifest in two key ways: (1) a lack of sufficient corpora for proof-oriented programming languages such as F*, and (2) the absence of large-scale, project-level proof-oriented implementations that can teach the model the intricate reasoning process when performing proof-oriented programming. We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language; incorporating diverse coding data for reasoning capability elicitation and creating new proofs and repair data within existing repositories. This approach enables language models to both synthesize and repair proofs for function- and repository-level code. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming by 64% relative margin, and can improve GPT-4o's performance by 54% by repairing its outputs over GPT-4o's self-repair.
摘要：现有的LMS由于数据稀缺而与面向验证的编程斗争，这表现为两种关键方法：（1）缺乏足够的语料库来用于校对的编程语言，例如F*，以及（2）缺乏大规模的大规模，在执行面向校对的编程时，可以教导模型的“验证”实现的实现。我们介绍了针对生成和维修的项目级别证明的综合数据扩展的第一个。我们的方法通过综合基本的面向证明的编程问题来解决数据稀缺，以熟练使用该语言；在现有存储库中合并各种编码数据，以启发推理能力并创建新的证明和修复数据。这种方法使语言模型可以合成和修复功能和存储库级代码的证明。我们表明，我们的微调14B参数模型Popilot可以超过模型的性能，该模型的性能优于项目级证明的面向项目验证的编程相对边距64％，并且可以将GPT-4O的性能提高54％。修复其在GPT-4O的自我修复上的输出。

Title: MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation

Authors: Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, Yu Qiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11903
Pdf URL: https://arxiv.org/pdf/2502.11903
Copy Paste: [[2502.11903]] MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation(https://arxiv.org/abs/2502.11903)
Keywords: language model, llm
Abstract: Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation, inadequacies in updating factual knowledge, accumulated assumption of error propagation, and reluctance to say no. To mitigate these issues, we propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses, enhancing conversational capabilities. Experiments across six MLLMs demonstrate significant performance improvements.
摘要：最近的多模式大语模型（MLLM）在开放式对话中表现出了巨大的潜力，产生了更准确和个性化的响应。但是，他们在现实世界中持续相互作用中记住，回忆和理性的能力仍然没有被淘汰。本文介绍了MMRC，这是一种多模式现实世界对话基准，用于评估MLLM的六个核心开放式能力：信息提取，多转变推理，信息更新，图像管理，内存召回和答案拒绝。通过从实际情况收集的数据，MMRC包括5,120次对话和28,720个相应的手动标记问题，对现有MLLM构成了重大挑战。对MMRC中20个MLLM的评估表明在开放式相互作用期间的准确性下降。我们确定了四种常见的故障模式：长期记忆力下降，更新事实知识的不足，累积的错误传播假设以及不愿说否。为了减轻这些问题，我们提出了一种简单而有效的笔记策略，该策略可以记录对话中的关键信息，并在其响应过程中提醒该模型，从而增强对话能力。跨六个MLLM的实验显示出显着的性能改善。

Title: EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models

Authors: Jiamin Su, Yibo Yan, Fangteng Fu, Han Zhang, Jingheng Ye, Xiang Liu, Jiahao Huo, Huiyu Zhou, Xuming Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11916
Pdf URL: https://arxiv.org/pdf/2502.11916
Copy Paste: [[2502.11916]] EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models(https://arxiv.org/abs/2502.11916)
Keywords: language model, llm
Abstract: Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (1) reliance on handcrafted features that limit generalizability, (2) difficulty in capturing fine-grained traits like coherence and argumentation, and (3) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits. By leveraging MLLMs' strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research. Our dataset and code will be available upon acceptance.
摘要：自动论文评分（AES）通过提供对写作任务的可扩展和一致的评估，在教育评估中起着至关重要的作用。但是，传统的AES系统面临三个主要挑战：（1）依赖限制可推广性的手工特征，（2）难以捕获连贯性和论证等细粒度的性状，以及（3）无法处理多模式上下文。在多模式大语言模型（MLLM）时代，我们提出了EssayJudge，这是第一个评估跨词汇，句子和话语级特征的AES功能的多模式基准。通过利用MLLM在特定特定的评分和多模式上下文理解中的优势，EssayJudge的目标是提供精确的，上下文丰富的评估，而无需手动功能工程，以解决长期存在的AES限制。与人类评估相比，我们对18个代表性MLLM的实验揭示了AES性能的差距，尤其是在话语级的特征中，强调了基于MLLM的AES研究中进一步进步的必要性。我们的数据集和代码将在接受后提供。

Title: BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages

Authors: Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine de Kock, Nirmal Surange, Daniela Teodorescu, Ibrahim Said Ahmad, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino D. M. A. Ali, Ilseyar Alimova, Vladimir Araujo, Nikolay Babakov, Naomi Baes, Ana-Maria Bucur, Andiswa Bukula, Guanqun Cao, Rodrigo Tufino Cardenas, Rendi Chevi, Chiamaka Ijeoma Chukwuneke, Alexandra Ciobotaru, Daryna Dementieva, Murja Sani Gadanya, Robert Geislinger, Bela Gipp, Oumaima Hourrane, Oana Ignat, Falalu Ibrahim Lawan, Rooweither Mabuya, Rahmad Mahendra, Vukosi Marivate, Andrew Piper, Alexander Panchenko, Charles Henrique Porto Ferreira, Vitaly Protasov, Samuel Rutunda, Manish Shrivastava, Aura Cristina Udrea, Lilian Diana Awuor Wanzare, Sophie Wu, Florian Valentin Wunderlich, Hanif Muhammad Zhafran, Tianhui Zhang, Yi Zhou, Saif M. Mohammad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11926
Pdf URL: https://arxiv.org/pdf/2502.11926
Copy Paste: [[2502.11926]] BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages(https://arxiv.org/abs/2502.11926)
Keywords: llm
Abstract: People worldwide use language in subtle and complex ways to express emotions. While emotion recognition -- an umbrella term for several NLP tasks -- significantly impacts different applications in NLP and other fields, most work in the area is focused on high-resource languages. Therefore, this has led to major disparities in research and proposed solutions, especially for low-resource languages that suffer from the lack of high-quality datasets. In this paper, we present BRIGHTER-- a collection of multilabeled emotion-annotated datasets in 28 different languages. BRIGHTER covers predominantly low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances from various domains annotated by fluent speakers. We describe the data collection and annotation processes and the challenges of building these datasets. Then, we report different experimental results for monolingual and crosslingual multi-label emotion identification, as well as intensity-level emotion recognition. We investigate results with and without using LLMs and analyse the large variability in performance across languages and text domains. We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition and discuss their impact and utility.
摘要：全球人以微妙而复杂的方式使用语言来表达情感。尽管情感识别（用于多个NLP任务的伞术语）显着影响NLP和其他领域的不同应用，但该地区的大多数工作都集中在高资源语言上。因此，这导致了研究和提出的解决方案的重大差异，尤其是对于缺乏高质量数据集的低资源语言。在本文中，我们提出了更明亮的 - 一系列具有28种不同语言的多列表情绪注销的数据集。更明亮的涵盖了来自非洲，亚洲，东欧和拉丁美洲的主要低资源语言，其中包括来自流利的扬声器的各个领域的实例。我们描述了数据收集和注释过程以及构建这些数据集的挑战。然后，我们报告了单语和跨语言多标签情绪识别以及强度级别的情绪识别的不同实验结果。我们在使用或不使用LLM的情况下调查结果，并分析语言和文本域之间的性能差异。我们表明，更明亮的数据集是弥合基于文本的情感识别并讨论其影响和实用性的一步。

Title: On Representational Dissociation of Language and Arithmetic in Large Language Models

Authors: Riku Kisako, Tatsuki Kuribayashi, Ryohei Sasano
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11932
Pdf URL: https://arxiv.org/pdf/2502.11932
Copy Paste: [[2502.11932]] On Representational Dissociation of Language and Arithmetic in Large Language Models(https://arxiv.org/abs/2502.11932)
Keywords: language model, llm
Abstract: The association between language and (non-linguistic) thinking ability in humans has long been debated, and recently, neuroscientific evidence of brain activity patterns has been considered. Such a scientific context naturally raises an interdisciplinary question -- what about such a language-thought dissociation in large language models (LLMs)? In this paper, as an initial foray, we explore this question by focusing on simple arithmetic skills (e.g., $1+2=$ ?) as a thinking ability and analyzing the geometry of their encoding in LLMs' representation space. Our experiments with linear classifiers and cluster separability tests demonstrate that simple arithmetic equations and general language input are encoded in completely separated regions in LLMs' internal representation space across all the layers, which is also supported with more controlled stimuli (e.g., spelled-out equations). These tentatively suggest that arithmetic reasoning is mapped into a distinct region from general language input, which is in line with the neuroscientific observations of human brain activations, while we also point out their somewhat cognitively implausible geometric properties.
摘要：长期以来，人们一直在辩论语言与（非语言）思维能力之间的关联，最近，已经考虑了神经科学的证据。这样的科学背景自然提出了一个跨学科的问题 - 在大型语言模型（LLMS）中，这样的语言思想分离又如何呢？在本文中，作为最初的尝试，我们通过关注简单的算术技能（例如$ 1+2 = $？）来探讨这个问题，作为一种思维能力，并分析了其在LLMS表示空间中编码的几何形状。我们使用线性分类器和集群可分离性测试进行的实验表明，简单的算术方程和一般语言输入在LLMS的内部表示空间中完全分离的区域中编码所有层中的完全分离的区域，这也得到了更受控的刺激（例如，拼写输出方程））。这些初步表明，算术推理映射到与一般语言输入不同的区域，该区域与人脑激活的神经科学观测一致，而我们也指出了它们的认知性几何特性。

Title: Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Authors: Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Brian Li, Changyi Wan, Hanpeng Hu, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Kang An, Wei Ji, Wen Li, Xuan Wen, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chengting Feng, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Jianchang Wu, Jiahong Liu, Jianjian Sun, Jiangjie Zhen, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Shaoliang Pang, Shiliang Yang, Shuli Gao, Siqi Liu, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wenqing He, Wen Sun, Xin Han, Xiaomin Deng, Xiaojia Liu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaqiang Shi, Yilei Wang, Yinmin Zhong
Subjects: cs.CL, cs.AI, cs.HC, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2502.11946
Pdf URL: https://arxiv.org/pdf/2502.11946
Copy Paste: [[2502.11946]] Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction(https://arxiv.org/abs/2502.11946)
Keywords: chat
Abstract: Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at this https URL.
摘要：实时语音互动是人机合作的基本接口，具有巨大的潜力。但是，当前的开源模型面临着诸如语音数据收集的高成本，动态控制中的弱点和智力有限的限制。为了应对这些挑战，本文介绍了Step-Audio，这是第一个可以生产的开源解决方案。主要贡献包括：1）130B参数统一的语音文本多模式模型，该模型具有开源的Step-Audio-Chat版本，可以实现统一的理解和生成； 2）一种生成的语音数据引擎，该数据引擎建立了负担得起的语音克隆框架，并通过蒸馏生产开源的轻巧的步骤ADIO-TTS-3B模型； 3）指令驱动的精细控制系统，实现了方言，情感，唱歌和说唱的动态调整； 4）增强的认知体系结构增强了工具呼叫和角色扮演能力，以有效地管理复杂的任务。根据我们新的Stepeval-Audio-360评估基准，Step-Audio在人类评估中实现了最先进的表现，尤其是在以下教学方面。在诸如Llama问题之类的开源基准上，表明平均绩效提高了9.3％，这表明我们致力于推进开源多模式语言技术的发展。我们的代码和模型可在此HTTPS URL上找到。

Title: Can Your Uncertainty Scores Detect Hallucinated Entity?

Authors: Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11948
Pdf URL: https://arxiv.org/pdf/2502.11948
Copy Paste: [[2502.11948]] Can Your Uncertainty Scores Detect Hallucinated Entity?(https://arxiv.org/abs/2502.11948)
Keywords: llm, hallucination
Abstract: To mitigate the impact of hallucination nature of LLMs, many studies propose detecting hallucinated generation through uncertainty estimation. However, these approaches predominantly operate at the sentence or paragraph level, failing to pinpoint specific spans or entities responsible for hallucinated content. This lack of granularity is especially problematic for long-form outputs that mix accurate and fabricated information. To address this limitation, we explore entity-level hallucination detection. We propose a new data set, HalluEntity, which annotates hallucination at the entity level. Based on the dataset, we comprehensively evaluate uncertainty-based hallucination detection approaches across 17 modern LLMs. Our experimental results show that uncertainty estimation approaches focusing on individual token probabilities tend to over-predict hallucinations, while context-aware methods show better but still suboptimal performance. Through an in-depth qualitative study, we identify relationships between hallucination tendencies and linguistic properties and highlight important directions for future research.
摘要：为了减轻LLM的幻觉性质的影响，许多研究提出，通过不确定性估计来检测幻觉的产生。但是，这些方法主要在句子或段落级别上运行，但未能指出负责幻觉内容的特定跨度或实体。缺乏粒度对于混合准确和制造信息的长期产出尤其有问题。为了解决这一限制，我们探索实体级幻觉检测。我们提出了一个新的数据集，即HALLUNTITY，该数据集在实体级别注释幻觉。基于数据集，我们全面评估了17个现代LLM的基于不确定性的幻觉检测方法。我们的实验结果表明，关注单个令牌概率的不确定性估计方法倾向于过度预测幻觉，而情境感知的方法表现更好，但表现却较低。通过深入的定性研究，我们确定了幻觉趋势与语言特性之间的关系，并突出了未来研究的重要方向。

Title: Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning

Authors: Tianyi Wu, Jingwei Ni, Bryan Hooi, Jiaheng Zhang, Elliott Ash, See-Kiong Ng, Mrinmaya Sachan, Markus Leippold
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11962
Pdf URL: https://arxiv.org/pdf/2502.11962
Copy Paste: [[2502.11962]] Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning(https://arxiv.org/abs/2502.11962)
Keywords: language model, llm, hallucination
Abstract: Instruction Fine-tuning (IFT) can enhance the helpfulness of Large Language Models (LLMs), but it may lower their truthfulness. This trade-off arises because IFT steers LLMs to generate responses with long-tail knowledge that is not well covered during pre-training, leading to more informative but less truthful answers when generalizing to unseen tasks. In this paper, we empirically demonstrate this helpfulness-truthfulness trade-off in IFT and propose $\textbf{UNIT}$, a novel IFT paradigm to address it. UNIT teaches LLMs to recognize their uncertainty and explicitly reflect it at the end of their responses. Experimental results show that UNIT-tuned models maintain their helpfulness while distinguishing between certain and uncertain claims, thereby reducing hallucinations.
摘要：指导微调（IFT）可以增强大语言模型（LLM）的有益性，但可能会降低他们的真实性。出现这种权衡是因为IFT引导LLMs以长尾知识在培训期间没有很好的涵盖，从而产生回答，从而在推广到看不见的任务时会导致更有用但更真实的答案。在本文中，我们从经验上证明了IFT中的这种有益的真实性权衡，并提出了$ \ textbf {unit} $，这是一个小说的IFT范式来解决它。单元教LLMS认识到他们的不确定性，并在他们的回答结束时明确反映它。实验结果表明，单位调整模型保持其帮助，同时区分某些不确定的主张，从而减少幻觉。

Title: Generating Text from Uniform Meaning Representation

Authors: Emma Markle, Reihaneh Iranmanesh, Shira Wein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11973
Pdf URL: https://arxiv.org/pdf/2502.11973
Copy Paste: [[2502.11973]] Generating Text from Uniform Meaning Representation(https://arxiv.org/abs/2502.11973)
Keywords: language model
Abstract: Uniform Meaning Representation (UMR) is a recently developed graph-based semantic representation, which expands on Abstract Meaning Representation (AMR) in a number of ways, in particular through the inclusion of document-level information and multilingual flexibility. In order to effectively adopt and leverage UMR for downstream tasks, efforts must be placed toward developing a UMR technological ecosystem. Though still limited amounts of UMR annotations have been produced to date, in this work, we investigate the first approaches to producing text from multilingual UMR graphs: (1) a pipeline conversion of UMR to AMR, then using AMR-to-text generation models, (2) fine-tuning large language models with UMR data, and (3) fine-tuning existing AMR-to-text generation models with UMR data. Our best performing model achieves a multilingual BERTscore of 0.825 for English and 0.882 for Chinese when compared to the reference, which is a promising indication of the effectiveness of fine-tuning approaches for UMR-to-text generation with even limited amounts of UMR data.
摘要：统一含义表示（UMR）是一个最近开发的基于图的语义表示，它以多种方式扩展了抽象含义表示（AMR），尤其是通过包含文档级信息和多语言灵活性。为了有效地采用并利用UMR进行下游任务，必须为开发UMR技术生态系统做出努力。尽管迄今为止仍在生产有限的UMR注释，但在这项工作中，我们研究了从多语言UMR图制作文本的第一个方法：（1）将UMR转换为AMR，然后使用AMR到文本生成模型。，（2）使用UMR数据对大型语言模型进行微调，以及（3）使用UMR数据对现有的AMR到文本生成模型进行微调。与参考相比，我们最佳性能模型的英语实现了英语0.825的多种语言Bertscore，中文的bertscore为0.882，这是一个有前途的指示，即对UMR到文本生成的微调方法的有效性甚至有限。

Title: Presumed Cultural Identity: How Names Shape LLM Responses

Authors: Siddhesh Pawar, Arnav Arora, Lucie-Aimée Kaffee, Isabelle Augenstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11995
Pdf URL: https://arxiv.org/pdf/2502.11995
Copy Paste: [[2502.11995]] Presumed Cultural Identity: How Names Shape LLM Responses(https://arxiv.org/abs/2502.11995)
Keywords: llm, chat
Abstract: Names are deeply tied to human identity. They can serve as markers of individuality, cultural heritage, and personal history. However, using names as a core indicator of identity can lead to over-simplification of complex identities. When interacting with LLMs, user names are an important point of information for personalisation. Names can enter chatbot conversations through direct user input (requested by chatbots), as part of task contexts such as CV reviews, or as built-in memory features that store user information for personalisation. We study biases associated with names by measuring cultural presumptions in the responses generated by LLMs when presented with common suggestion-seeking queries, which might involve making assumptions about the user. Our analyses demonstrate strong assumptions about cultural identity associated with names present in LLM generations across multiple cultures. Our work has implications for designing more nuanced personalisation systems that avoid reinforcing stereotypes while maintaining meaningful customisation.
摘要：名称与人类身份有着深远的关系。它们可以作为个性，文化遗产和个人历史的标志。但是，使用名称作为身份的核心指标可以导致复杂身份的过度简化。与LLMS互动时，用户名是个性化的重要信息点。名称可以通过直接用户输入（由聊天机器人请求）输入聊天机器人对话，作为任务上下文的一部分，例如CV评论，或作为存储用户信息以进行个性化的内置内存功能。我们通过测量LLMS产生的响应中的文化推定来研究与名称相关的偏见，并提出常见的寻求建议查询，这可能涉及对用户进行假设。我们的分析表明，关于与LLM世代相关的文化认同的有力假设。我们的工作对设计更细微的个性化系统具有影响，该系统避免在维护有意义的自定义的同时加强刻板印象。

Title: Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition

Authors: Thibault Rousset, Taisei Kakibuchi, Yusuke Sasaki, Yoshihide Nomura
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12001
Pdf URL: https://arxiv.org/pdf/2502.12001
Copy Paste: [[2502.12001]] Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition(https://arxiv.org/abs/2502.12001)
Keywords: language model
Abstract: This paper investigates the integration of technical vocabulary in merged language models. We explore the knowledge transfer mechanisms involved when combining a general-purpose language-specific model with a domain-specific model, focusing on the resulting model's comprehension of technical jargon. Our experiments analyze the impact of this merging process on the target model's proficiency in handling specialized terminology. We present a quantitative evaluation of the performance of the merged model, comparing it with that of the individual constituent models. The findings offer insights into the effectiveness of different model merging methods for enhancing domain-specific knowledge and highlight potential challenges and future directions in leveraging these methods for cross-lingual knowledge transfer in Natural Language Processing.
摘要：本文研究了合并语言模型中技术词汇的整合。我们探讨了将通用语言特异性模型与域特异性模型相结合时所涉及的知识转移机制，重点是结果模型的技术术语的理解。我们的实验分析了该合并过程对目标模型处理专业术语的熟练程度的影响。我们对合并模型的性能进行了定量评估，并将其与单个组成模型的模型进行了比较。这些发现提供了对不同模型合并方法的有效性来增强领域特定知识的有效性，并突出了利用这些方法在自然语言处理中的跨语性知识转移时的潜在挑战和未来方向。

Title: Atom of Thoughts for Markov LLM Test-Time Scaling

Authors: Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12018
Pdf URL: https://arxiv.org/pdf/2502.12018
Copy Paste: [[2502.12018]] Atom of Thoughts for Markov LLM Test-Time Scaling(https://arxiv.org/abs/2502.12018)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning progress is often achieved by solving a sequence of independent subquestions, each being self-contained and verifiable. These subquestions are essentially atomic questions, relying primarily on their current state rather than accumulated history, similar to the memoryless transitions in a Markov process. Based on this observation, we propose Atom of Thoughts (AoT), where each state transition in the reasoning process consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a new atomic question state. This iterative decomposition-contraction process continues until reaching directly solvable atomic questions, naturally realizing Markov transitions between question states. Furthermore, these atomic questions can be seamlessly integrated into existing test-time scaling methods, enabling AoT to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of AoT both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and DeepSeek-R1 by 10.6%. The code will be available at this https URL.
摘要：大型语言模型（LLMS）通过训练时间缩放实现出色的性能，测试时间缩放进一步通过推断过程中进行有效的推理进一步增强了其功能。但是，随着推理的规模的增加，现有的测试时间缩放方法遭受了累积的历史信息，这不仅浪费了计算资源，而且会干扰有效的推理。为了解决这个问题，我们观察到，通常通过求解一系列独立的子问题来实现复杂的推理进步，每个序列都是独立的和可验证的。这些子问题本质上是原子问题，主要依赖于它们的当前状态而不是累积的历史，类似于马尔可夫过程中的无记忆过渡。基于此观察，我们提出了思想原子（AOT），在推理过程中的每个状态过渡都包括将当前问题分解为基于依赖关系的定向无环形图并收缩其子问题，形成了新的原子问题状态。这种迭代的分解反应过程一直持续到达到直接解决的原子问题，自然会意识到问题状态之间的马尔可夫过渡。此外，这些原子问题可以无缝集成到现有的测试时间缩放方法中，从而使AOT可以作为提高推理能力的插件增强功能。跨六个基准测试的实验证明了AOT作为独立框架和插件增强的有效性。值得注意的是，在HOTPOTQA上，当应用于GPT-4O-Mini时，AOT获得了80.6％的F1得分，超过O3-Mini的3.4％，而DeepSeek-R1则达到10.6％。该代码将在此HTTPS URL上可用。

Title: Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving

Authors: Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12022
Pdf URL: https://arxiv.org/pdf/2502.12022
Copy Paste: [[2502.12022]] Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving(https://arxiv.org/abs/2502.12022)
Keywords: language model, llm, chain-of-thought
Abstract: Existing approaches to mathematical reasoning with large language models (LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation. While efforts have been made to combine these methods, they primarily rely on post-selection or predefined strategies, leaving an open question: whether LLMs can autonomously adapt their reasoning strategy based on their inherent capabilities. In this work, we propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework that enables LLMs to personalize their reasoning strategy spontaneously, aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware data selection during supervised fine-tuning (SFT) to tailor training data to the model's unique abilities. This approach equips LLMs to autonomously determine and apply the appropriate reasoning strategy at test time. We evaluate TATA through extensive experiments on six mathematical reasoning benchmarks, using both general-purpose and math-specialized LLMs. Empirical results demonstrate that TATA effectively combines the complementary strengths of CoT and TIR, achieving superior or comparable performance with improved inference efficiency compared to TIR alone. Further analysis underscores the critical role of aptitude-aware data selection in enabling LLMs to make effective and adaptive reasoning decisions and align reasoning strategies with model capabilities.
摘要：使用大语言模型（LLM）的数学推理方法的现有方法依赖于经过概括或工具集成推理（TIR）进行精确计算。尽管已经努力将这些方法结合起来，但它们主要依赖于选择后或预定义的策略，而留下了一个悬而未决的问题：LLMS是否可以根据其固有能力自主适应其推理策略。在这项工作中，我们提出了TATA（根据其能力教授LLMS），这是一个自适应框架，使LLMS能够自发地个性化其推理策略，并将其与他们的内在能力保持一致。塔塔（Tata）在监督微调（SFT）期间合并了基本llm感知数据选择，以根据模型的独特能力来量身定制培训数据。这种方法使LLMS在测试时自主确定并应用适当的推理策略。我们使用通用和数学专用的LLM通过对六个数学推理基准的六个数学推理基准进行广泛的实验来评估TATA。经验结果表明，塔塔有效地结合了COT和TIR的互补强度，与单独使用TIR相比，提高或可比的推理效率提高了推理效率。进一步的分析强调了能力感知数据选择在使LLMS能够做出有效和适应性推理决策以及将推理策略与模型能力相结合的关键作用。

Title: A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability

Authors: Xinyu Hu, Mingqi Gao, Li Lin, Zhenghan Yu, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12052
Pdf URL: https://arxiv.org/pdf/2502.12052
Copy Paste: [[2502.12052]] A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability(https://arxiv.org/abs/2502.12052)
Keywords: llm
Abstract: In NLG meta-evaluation, evaluation metrics are typically assessed based on their consistency with humans. However, we identify some limitations in traditional NLG meta-evaluation approaches, such as issues in handling human ratings and ambiguous selections of correlation measures, which undermine the effectiveness of meta-evaluation. In this work, we propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities, thereby providing better interpretability. In addition, we introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations. Furthermore, we conduct experiments with 16 representative LLMs as the evaluators based on our proposed framework, comprehensively analyzing their evaluation performance from different perspectives.
摘要：在NLG元评估中，通常根据与人类的一致性评估评估指标。但是，我们确定了传统的NLG元评估方法中的一些局限性，例如处理人类评级的问题以及对相关措施的模棱两可的选择，这破坏了元评估的有效性。在这项工作中，我们提出了一个双重观点NLG元评估框架，该框架着重于不同的评估功能，从而提供了更好的解释性。此外，我们引入了一种自动构建相应基准的方法，而无需新的人类注释。此外，我们根据提议的框架对16个代表性LLM进行实验，作为评估者，从不同角度全面地分析其评估绩效。

Title: Designing Role Vectors to Improve LLM Inference Behaviour

Authors: Daniele Potertì, Andrea Seveso, Fabio Mercorio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12055
Pdf URL: https://arxiv.org/pdf/2502.12055
Copy Paste: [[2502.12055]] Designing Role Vectors to Improve LLM Inference Behaviour(https://arxiv.org/abs/2502.12055)
Keywords: language model, llm, prompt
Abstract: The influence of personas on Large Language Models (LLMs) has been widely studied, yet their direct impact on performance remains uncertain. This work explores a novel approach to guiding LLM behaviour through role vectors, an alternative to persona-based prompting. We construct 29 role vectors derived from model activations and evaluate their impact on benchmark performance across multiple domains. Our analysis investigates whether these vectors can effectively steer models toward domain-specific expertise. We measure two key interventions: (i) activation addition, which reinforces role-specific directions, and (ii) directional ablation, which removes them. Results on well-established benchmarks indicate that role vectors do, in fact, influence model behaviour, improving task performance in relevant domains while marginally affecting unrelated tasks. This, in turn, suggests that manipulating internal model representations has a greater impact on outcomes than persona-based prompting.
摘要：广泛研究了角色对大语言模型（LLM）的影响，但它们对绩效的直接影响仍然不确定。这项工作探讨了一种通过角色向量引导LLM行为的新方法，这是基于角色的提示的替代方法。我们构建了从模型激活得出的29个角色向量，并评估了它们对多个领域基准性能的影响。我们的分析研究了这些向量是否可以有效地将模型转向特定领域的专业知识。我们测量两个关键的干预措施：（i）激活添加，它加强了特定于角色的方向，以及（ii）定向消融，从而去除它们。结果良好的基准测试表明，角色向量实际上会影响模型行为，改善了相关领域的任务绩效，同时略微影响了无关任务。反过来，这表明操纵内部模型表示对结果的影响比基于角色的提示更大。

Title: AI-generated Text Detection with a GLTR-based Approach

Authors: Lucía Yan Wu, Isabel Segura-Bedmar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12064
Pdf URL: https://arxiv.org/pdf/2502.12064
Copy Paste: [[2502.12064]] AI-generated Text Detection with a GLTR-based Approach(https://arxiv.org/abs/2502.12064)
Keywords: language model, gpt, llm
Abstract: The rise of LLMs (Large Language Models) has contributed to the improved performance and development of cutting-edge NLP applications. However, these can also pose risks when used maliciously, such as spreading fake news, harmful content, impersonating individuals, or facilitating school plagiarism, among others. This is because LLMs can generate high-quality texts, which are challenging to differentiate from those written by humans. GLTR, which stands for Giant Language Model Test Room and was developed jointly by the MIT-IBM Watson AI Lab and HarvardNLP, is a visual tool designed to help detect machine-generated texts based on GPT-2, that highlights the words in text depending on the probability that they were machine-generated. One limitation of GLTR is that the results it returns can sometimes be ambiguous and lead to confusion. This study aims to explore various ways to improve GLTR's effectiveness for detecting AI-generated texts within the context of the IberLef-AuTexTification 2023 shared task, in both English and Spanish languages. Experiment results show that our GLTR-based GPT-2 model overcomes the state-of-the-art models on the English dataset with a macro F1-score of 80.19%, except for the first ranking model (80.91%). However, for the Spanish dataset, we obtained a macro F1-score of 66.20%, which differs by 4.57% compared to the top-performing model.
摘要：LLM（大型语言模型）的兴起有助于提高尖端NLP应用程序的性能和发展。但是，这些在恶意使用时也可能构成风险，例如传播虚假新闻，有害内容，模仿个人或促进学校窃等。这是因为LLM可以产生高质量的文本，这是与人类所写的文本挑战。 GLTR代表巨型语言模型测试室，由MIT-IBM Watson AI实验室和HarvardNLP共同开发，是一种视觉工具，旨在帮助检测基于GPT-2的机器生成的文本，该文本凸显了文本中的单词关于它们是机器生成的概率。 GLTR的一个局限性是它返回的结果有时可能是模棱两可的，并导致混乱。这项研究旨在探索在Iberlef-autextification 2023共享任务的背景下，以英语和西班牙语的方式提高GLTR在检测AI生成的文本的有效性中的各种方法。实验结果表明，我们基于GLTR的GPT-2模型以80.19％的宏F1分数克服了英语数据集上的最新模型，但第一个排名模型（80.91％）。但是，对于西班牙数据集，我们获得了66.20％的宏F1评分，与表现最佳模型相比，这一比例为4.57％。

Title: Formalizing Complex Mathematical Statements with LLMs: A Study on Mathematical Definitions

Authors: Lan Zhang, Marco Valentino, Andre Freitas
Subjects: cs.CL, cs.FL
Abstract URL: https://arxiv.org/abs/2502.12065
Pdf URL: https://arxiv.org/pdf/2502.12065
Copy Paste: [[2502.12065]] Formalizing Complex Mathematical Statements with LLMs: A Study on Mathematical Definitions(https://arxiv.org/abs/2502.12065)
Keywords: llm
Abstract: Thanks to their linguistic capabilities, LLMs offer an opportunity to bridge the gap between informal mathematics and formal languages through autoformalization. However, it is still unclear how well LLMs generalize to sophisticated and naturally occurring mathematical statements. To address this gap, we investigate the task of autoformalizing real-world mathematical definitions -- a critical component of mathematical discourse. Specifically, we introduce two novel resources for autoformalisation, collecting definitions from Wikipedia (Def_Wiki) and arXiv papers (Def_ArXiv). We then systematically evaluate a range of LLMs, analyzing their ability to formalize definitions into Isabelle/HOL. Furthermore, we investigate strategies to enhance LLMs' performance including refinement through external feedback from Proof Assistants, and formal definition grounding, where we guide LLMs through relevant contextual elements from formal mathematical libraries. Our findings reveal that definitions present a greater challenge compared to existing benchmarks, such as miniF2F. In particular, we found that LLMs still struggle with self-correction, and aligning with relevant mathematical libraries. At the same time, structured refinement methods and definition grounding strategies yield notable improvements of up to 16% on self-correction capabilities and 43% on the reduction of undefined errors, highlighting promising directions for enhancing LLM-based autoformalization in real-world scenarios.
摘要：借助他们的语言能力，LLMS提供了一个机会，可以通过自动化来弥合非正式数学和正式语言之间的差距。但是，目前尚不清楚LLM概括地对复杂且自然发生的数学陈述的推广程度如何。为了解决这一差距，我们研究了自动化现实世界数学定义的任务 - 数学话语的关键组成部分。具体来说，我们介绍了两种新型资源以进行自动化，从Wikipedia（DEF_WIKI）和Arxiv Papers（DEF_ARXIV）收集定义。然后，我们系统地评估了一系列LLM，分析了它们将定义形式化为Isabelle/HOL的能力。此外，我们研究了提高LLMS绩效的策略，包括通过证明助手的外部反馈和正式定义基础进行改进，在此我们通过正式数学库中的相关上下文元素指导LLMS。我们的发现表明，与现有基准相比，定义具有更大的挑战，例如minif2f。特别是，我们发现LLM仍在自我纠正和与相关的数学库保持一致。同时，结构化的完善方法和定义接地策略在自我纠正能力上最多可显着提高16％，而减少不确定错误的43％，突出了有希望的方向，以增强现实世界中基于LLM的自动化。

Title: TokenSkip: Controllable Chain-of-Thought Compression in LLMs

Authors: Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, Wenjie Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12067
Pdf URL: https://arxiv.org/pdf/2502.12067
Copy Paste: [[2502.12067]] TokenSkip: Controllable Chain-of-Thought Compression in LLMs(https://arxiv.org/abs/2502.12067)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning capabilities of large language models (LLMs). Recent advancements, such as OpenAI's o1 and DeepSeek-R1, suggest that scaling up the length of CoT sequences during inference could further boost LLM reasoning performance. However, due to the autoregressive nature of LLM decoding, longer CoT outputs lead to a linear increase in inference latency, adversely affecting user experience, particularly when the CoT exceeds 10,000 tokens. To address this limitation, we analyze the semantic importance of tokens within CoT outputs and reveal that their contributions to reasoning vary. Building on this insight, we propose TokenSkip, a simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression. Extensive experiments across various models and tasks demonstrate the effectiveness of TokenSkip in reducing CoT token usage while preserving strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct, TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less than a 0.4% performance drop.
摘要：经过思考链（COT）已被证明有效地增强了大语言模型（LLM）的推理能力。最近的进步，例如OpenAI的O1和DeepSeek-R1，表明在推理期间扩大COT序列的长度可以进一步提高LLM推理性能。但是，由于LLM解码的自回旋性质，较长的COT输出会导致推理潜伏期的线性增加，从而对用户体验产生不利影响，尤其是当COT超过10,000个令牌时。为了解决这一限制，我们分析了代币在COT输出中的语义重要性，并揭示了它们对推理的贡献有所不同。在此洞察力的基础上，我们提出了Tokenskip，这是一种简单而有效的方法，使LLM可以选择性地跳过较少重要的令牌，从而可以进行可控的COT压缩。各种模型和任务的广泛实验证明了Tokenskip在减少COT令牌使用方面的有效性，同时保留了强大的推理性能。值得注意的是，当将Tokenskip应用于QWEN2.5-14B-Instruct时，在GSM8K上将推理令牌降低了40％（从313降低到181），绩效下降不到0.4％。

Title: Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation

Authors: Zhongyi Qiu, Hanjia Lyu, Wei Xiong, Jiebo Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12073
Pdf URL: https://arxiv.org/pdf/2502.12073
Copy Paste: [[2502.12073]] Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation(https://arxiv.org/abs/2502.12073)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Social media enables dynamic user engagement with trending topics, and recent research has explored the potential of large language models (LLMs) for response generation. While some studies investigate LLMs as agents for simulating user behavior on social media, their focus remains on practical viability and scalability rather than a deeper understanding of how well LLM aligns with human behavior. This paper analyzes LLMs' ability to simulate social media engagement through action guided response generation, where a model first predicts a user's most likely engagement action-retweet, quote, or rewrite-towards a trending post before generating a personalized response conditioned on the predicted action. We benchmark GPT-4o-mini, O1-mini, and DeepSeek-R1 in social media engagement simulation regarding a major societal event discussed on X. Our findings reveal that zero-shot LLMs underperform BERT in action prediction, while few-shot prompting initially degrades the prediction accuracy of LLMs with limited examples. However, in response generation, few-shot LLMs achieve stronger semantic alignment with ground truth posts.
摘要：社交媒体使动态用户参与趋势主题，最近的研究探讨了大语言模型（LLM）的潜力。虽然一些研究调查了LLM作为模拟社交媒体上用户行为的代理人，但他们的重点仍然放在实际的生存能力和可扩展性上，而不是对LLM与人类行为保持一致的深入了解。本文分析了LLMS通过行动指导的响应生成模拟社交媒体参与的能力，在此模型首先预测了用户最有可能的参与行动 - 换版，引用或重写towrite towards thards a趋势帖子，然后再产生以预测的动作为条件的个性化响应。我们在社交媒体参与模拟中基于有关X上讨论的重大社会活动的社交媒体参与模拟中的GPT-4O-Mini，O1-Mini和DeepSeek-R1。我们的发现表明，零照片LLMS在行动预测中的表现不佳，而最初很少发射促进使用有限的示例降低了LLM的预测准确性。但是，在响应产生中，很少有shot llms与地面真相帖子实现更强的语义一致性。

Title: AdaSplash: Adaptive Sparse Flash Attention

Authors: Nuno Gonçalves, Marcos Treviso, André F. T. Martins
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12082
Pdf URL: https://arxiv.org/pdf/2502.12082
Copy Paste: [[2502.12082]] AdaSplash: Adaptive Sparse Flash Attention(https://arxiv.org/abs/2502.12082)
Keywords: language model, gpt
Abstract: The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $\alpha$-entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $\alpha$-entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the $\alpha$-entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing $\alpha$-entmax implementations. It approaches -- and in some cases surpasses -- the efficiency of highly optimized softmax implementations like FlashAttention-2, enabling long-context training while maintaining strong task performance.
摘要：在变形金刚中，基于软马克斯的注意力的计算成本限制了其对长篇小说任务的适用性。自适应稀疏性（其中$ \ alpha $ -entmax的注意力）就是一个例子，提供了灵活的数据依赖性替代方案，但是现有的实现效率低下，并且不利用稀疏性来获得运行时和内存增长。在这项工作中，我们提出了Adasplash，将GPU优化算法的效率与$ \ alpha $ -entmax的稀疏益处相结合。我们首先引入了混合HALLEY双压算法，从而减少了计算$ \ alpha $ -entmax变换所需的迭代次数7倍。然后，我们实现自定义Triton内核，以有效处理自适应稀疏性。与Roberta和Modernbert进行了文本分类和单矢量检索的实验，以及用于语言建模的GPT-2，表明我们的方法与现有的$ \ alpha $ -entmax实现相比，我们的方法在运行时和内存效率方面有了显着提高。它接近 - 在某些情况下超过 - 高度优化的软磁性实现（例如flashattention-2）的效率，在保持强大的任务绩效的同时，可以进行长期培训。

Title: VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

Authors: Jianshu Zhang, Dongyu Yao, Renjie Pi, Paul Pu Liang, Yi R. (May)Fung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12084
Pdf URL: https://arxiv.org/pdf/2502.12084
Copy Paste: [[2502.12084]] VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues(https://arxiv.org/abs/2502.12084)
Keywords: language model, gpt, prompt
Abstract: Visually linking matching cues is a crucial ability in daily life, such as identifying the same person in multiple photos based on their cues, even without knowing who they are. Despite the extensive knowledge that vision-language models (VLMs) possess, it remains largely unexplored whether they are capable of performing this fundamental task. To address this, we introduce VLM$^2$-Bench, a benchmark designed to assess whether VLMs can Visually Link Matching cues, with 9 subtasks and over 3,000 test cases. Comprehensive evaluation across eight open-source VLMs and GPT-4o, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings. We identify critical challenges in models' ability to link visual cues, highlighting a significant performance gap where even GPT-4o lags 34.80% behind humans. Based on these insights, we advocate for (i) enhancing core visual capabilities to improve adaptability and reduce reliance on prior knowledge, (ii) establishing clearer principles for integrating language-based reasoning in vision-centric tasks to prevent unnecessary biases, and (iii) shifting vision-text training paradigms toward fostering models' ability to independently structure and infer relationships among visual cues.
摘要：在视觉上链接匹配线索是日常生活中的至关重要能力，例如，即使在不知道自己是谁的情况下，根据他们的线索来识别多个照片中的同一个人。尽管众所周知，视觉模型（VLM）拥有，但是否能够执行这项基本任务，在很大程度上尚未探索。为了解决这个问题，我们介绍了VLM $^2 $ -Bench，这是一个基准测试，旨在评估VLM是否可以在视觉上链接匹配线索，带有9个子任务和3,000多个测试用例。对八个开源VLM和GPT-4O进行的全面评估，以及对各种语言侧和视觉侧提示方法的进一步分析，总共带来了八个关键发现。我们确定了模型链接视觉提示的能力的关键挑战，突出了一个显着的性能差距，即使GPT-4O也落后于人类34.80％。基于这些见解，我们主张（i）增强核心视觉能力以提高适应性并减少对先验知识的依赖，（ii）建立更清晰的原则，以将基于语言的推理整合到以视觉为中心的任务中，以防止不必要的偏见，以及（iii ）将视觉文本训练范式转移到促进模型独立构建和推断视觉提示之间的关系的能力。

Title: Personality Structured Interview for Large Language Model Simulation in Personality Research

Authors: Pengda Wang, Huiqi Zou, Hanjie Chen, Tianjun Sun, Ziang Xiao, Frederick L. Oswald
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12109
Pdf URL: https://arxiv.org/pdf/2502.12109
Copy Paste: [[2502.12109]] Personality Structured Interview for Large Language Model Simulation in Personality Research(https://arxiv.org/abs/2502.12109)
Keywords: language model, llm
Abstract: Although psychometrics researchers have recently explored the use of large language models (LLMs) as proxies for human participants, LLMs often fail to generate heterogeneous data with human-like diversity, which diminishes their value in advancing social science research. To address these challenges, we explored the potential of the theory-informed Personality Structured Interview (PSI) as a tool for simulating human responses in personality research. In this approach, the simulation is grounded in nuanced real-human interview transcripts that target the personality construct of interest. We have provided a growing set of 357 structured interview transcripts from a representative sample, each containing an individual's response to 32 open-ended questions carefully designed to gather theory-based personality evidence. Additionally, grounded in psychometric research, we have summarized an evaluation framework to systematically validate LLM-generated psychometric data. Results from three experiments demonstrate that well-designed structured interviews could improve human-like heterogeneity in LLM-simulated personality data and predict personality-related behavioral outcomes (i.e., organizational citizenship behaviors and counterproductive work behavior). We further discuss the role of theory-informed structured interviews in LLM-based simulation and outline a general framework for designing structured interviews to simulate human-like data for psychometric research.
摘要：尽管心理计量学研究人员最近探索了大型语言模型（LLM）作为人类参与者的代理，但LLMS通常无法生成具有类似人类多样性的异质数据，从而降低了其在推进社会科学研究中的价值。为了应对这些挑战，我们探讨了理论知识的人格结构访谈（PSI）的潜力，作为模拟人格研究中人类反应的工具。在这种方法中，模拟以细微的现实人类访谈成绩单为基础，该笔录针对感兴趣的人格结构。我们从代表性样本中提供了越来越多的357个结构化访谈笔录，每个访谈记录包含一个人对32个开放式问题的回答，精心设计以收集基于理论的人格证据。此外，基于心理测量研究，我们总结了一个评估框架，以系统地验证LLM生成的心理测量数据。三个实验的结果表明，精心设计的结构化访谈可以改善LLM模拟的人格数据中的人类异质性，并预测与人格相关的行为结果（即组织公民行为和适得其反的工作行为）。我们进一步讨论了理论知识的结构化访谈在基于LLM的模拟中的作用，并概述了设计结构化访谈以模拟类似人类数据的心理测量学研究的一般框架。

Title: A-MEM: Agentic Memory for LLM Agents

Authors: Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2502.12110
Pdf URL: https://arxiv.org/pdf/2502.12110
Copy Paste: [[2502.12110]] A-MEM: Agentic Memory for LLM Agents(https://arxiv.org/abs/2502.12110)
Keywords: language model, llm, agent
Abstract: While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code is available at this https URL.
摘要：尽管大型语言模型（LLM）代理可以有效地使用外部工具来进行复杂的现实世界任务，但它们需要内存系统来利用历史体验。当前的内存系统启用基本存储和检索，但尽管最近尝试合并图形数据库，但仍缺乏复杂的内存组织。此外，这些系统的固定操作和结构限制了它们在各种任务中的适应性。为了解决这一限制，本文提出了一个针对LLM代理的新型代理存储系统，该系统可以以代理方式动态组织内存。遵循Zettelkasten方法的基本原理，我们设计了我们的内存系统，以通过动态索引和链接来创建互连的知识网络。添加新内存后，我们会生成一个综合说明，其中包含多个结构化属性，包括上下文描述，关键字和标签。然后，该系统分析了历史记忆，以确定相关的联系，并在存在有意义的相似之处的地方建立联系。此外，此过程可以实现内存的演变 - 随着新记忆的集成，它们可以触发更新到现有历史记忆的上下文表示和属性，从而使内存网络能够不断地完善其理解。我们的方法将Zettelkasten的结构化组织原理与代理驱动的决策的灵活性相结合，从而可以进行更自适应和上下文感知的内存管理。对六个基础模型的经验实验显示了针对现有SOTA基准的较高改进。源代码可在此HTTPS URL上找到。

Title: On the Query Complexity of Verifier-Assisted Language Generation

Authors: Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12123
Pdf URL: https://arxiv.org/pdf/2502.12123
Copy Paste: [[2502.12123]] On the Query Complexity of Verifier-Assisted Language Generation(https://arxiv.org/abs/2502.12123)
Keywords: language model
Abstract: Recently, a plethora of works have proposed inference-time algorithms (e.g. best-of-n), which incorporate verifiers to assist the generation process. Their quality-efficiency trade-offs have been empirically benchmarked on a variety of constrained generation tasks, but the algorithmic design landscape is still largely poorly understood. In this paper, we develop a mathematical framework for reasoning about constrained generation using a pre-trained language model generator oracle and a process verifier--which can decide whether a prefix can be extended to a string which satisfies the constraints of choice. We show that even in very simple settings, access to a verifier can render an intractable problem (information-theoretically or computationally) to a tractable one. In fact, we show even simple algorithms, like tokenwise rejection sampling, can enjoy significant benefits from access to a verifier. Empirically, we show that a natural modification of tokenwise rejection sampling, in which the sampler is allowed to "backtrack" (i.e., erase the final few generated tokens) has robust and substantive benefits over natural baselines (e.g. (blockwise) rejection sampling, nucleus sampling)--both in terms of computational efficiency, accuracy and diversity.
摘要：最近，众多作品提出了推理时间算法（例如最佳N），其中包含了验证者以协助发电过程。他们的质量效率权衡已在各种受约束的一代任务上进行了基准测试，但是算法的设计格局仍然很少了解。在本文中，我们开发了一个数学框架，用于使用预先训练的语言模型生成器甲骨文和过程验证器来推理有关约束生成的推理 - 可以决定是否可以将前缀扩展到满足选择约束的字符串。我们表明，即使在非常简单的设置中，对验证者的访问也可以将棘手的问题（理论上或计算上的信息）呈现到易于处理的问题。实际上，我们甚至显示简单的算法，例如Tokenwise拒绝采样，可以从访问验证器中获得可观的好处。从经验上讲，我们表明对令牌排斥采样的自然修改，其中允许采样器“回溯”（即删除最后几个产生的代币）具有强大而实质性的益处（例如（例如（blokwise）拒绝，核，核，核，核）采样） - 在计算效率，准确性和多样性方面。

Title: SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs

Authors: Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12134
Pdf URL: https://arxiv.org/pdf/2502.12134
Copy Paste: [[2502.12134]] SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs(https://arxiv.org/abs/2502.12134)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks by generating intermediate reasoning steps. However, most existing approaches focus on hard token decoding, which constrains reasoning within the discrete vocabulary space and may not always be optimal. While recent efforts explore continuous-space reasoning, they often suffer from catastrophic forgetting, limiting their applicability to state-of-the-art LLMs that already perform well in zero-shot settings with a proper instruction. To address this challenge, we propose a novel approach for continuous-space reasoning that does not require modifying the underlying LLM. Specifically, we employ a lightweight assistant model to generate instance-specific soft thought tokens speculatively as the initial chain of thoughts, which are then mapped into the LLM's representation space via a projection module. Experimental results on five reasoning benchmarks demonstrate that our method enhances LLM reasoning performance through supervised, parameter-efficient fine-tuning.
摘要：经过思考链（COT）推理使大型语言模型（LLMS）能够通过生成中间推理步骤来解决复杂的推理任务。但是，大多数现有方法都集中在硬令牌解码上，这会在离散词汇空间内限制推理，并且可能并不总是最佳的。尽管最近的努力探讨了连续的空间推理，但他们经常遭受灾难性遗忘的困扰，将其适用性限制在最先进的LLMS上，这些LLM在零拍摄的情况下以适当的指示表现出色。为了应对这一挑战，我们提出了一种新颖的方法来进行连续空间推理，该方法不需要修改基础LLM。具体而言，我们采用轻量级助手模型来产生特定于实例的软思想令牌作为初始思想链，然后通过投影模块将其映射到LLM的表示空间中。五个推理基准的实验结果表明，我们的方法通过监督，有效的微调来增强LLM推理性能。

Title: REVERSUM: A Multi-staged Retrieval-Augmented Generation Method to Enhance Wikipedia Tail Biographies through Personal Narratives

Authors: Sayantan Adak, Pauras Mangesh Meher, Paramita Das, Animesh Mukherjee
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.12137
Pdf URL: https://arxiv.org/pdf/2502.12137
Copy Paste: [[2502.12137]] REVERSUM: A Multi-staged Retrieval-Augmented Generation Method to Enhance Wikipedia Tail Biographies through Personal Narratives(https://arxiv.org/abs/2502.12137)
Keywords: retrieval-augmented generation
Abstract: Wikipedia is an invaluable resource for factual information about a wide range of entities. However, the quality of articles on less-known entities often lags behind that of the well-known ones. This study proposes a novel approach to enhancing Wikipedia's B and C category biography articles by leveraging personal narratives such as autobiographies and biographies. By utilizing a multi-staged retrieval-augmented generation technique -- REVerSum -- we aim to enrich the informational content of these lesser-known articles. Our study reveals that personal narratives can significantly improve the quality of Wikipedia articles, providing a rich source of reliable information that has been underutilized in previous studies. Based on crowd-based evaluation, REVerSum generated content outperforms the best performing baseline by 17% in terms of integrability to the original Wikipedia article and 28.5\% in terms of informativeness. Code and Data are available at: this https URL
摘要：Wikipedia是有关广泛实体的事实信息的宝贵资源。但是，鲜为人知实体的文章质量通常落后于知名实体的质量。这项研究提出了一种新颖的方法，用于通过利用自传和传记等个人叙事来增强Wikipedia的B和C类别传记文章。通过利用多阶段的检索型发电技术 - 反转 - 我们旨在丰富这些鲜为人知的文章的信息内容。我们的研究表明，个人叙事可以显着提高Wikipedia文章的质量，从而提供了丰富的可靠信息来源，而这些信息在先前的研究中未充分利用。基于基于人群的评估，Reversum在与原始Wikipedia文章的整合性方面的表现优于最佳性能基线，而在信息性方面为28.5 \％。代码和数据可用：此HTTPS URL

Title: Idiosyncrasies in Large Language Models

Authors: Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, Zhuang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12150
Pdf URL: https://arxiv.org/pdf/2502.12150
Copy Paste: [[2502.12150]] Idiosyncrasies in Large Language Models(https://arxiv.org/abs/2502.12150)
Keywords: language model, gpt, llm, chat
Abstract: In this work, we unveil and study idiosyncrasies in Large Language Models (LLMs) -- unique patterns in their outputs that can be used to distinguish the models. To do so, we consider a simple classification task: given a particular text output, the objective is to predict the source LLM that generates the text. We evaluate this synthetic task across various groups of LLMs and find that simply fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on held-out validation data in the five-way classification problem involving ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals that these idiosyncrasies are rooted in word-level distributions. These patterns persist even when the texts are rewritten, translated, or summarized by an external LLM, suggesting that they are also encoded in the semantic content. Additionally, we leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies. Finally, we discuss the broader implications of our findings, particularly for training on synthetic data and inferring model similarity. Code is available at this https URL.
摘要：在这项工作中，我们在大语模型（LLMS）中揭幕和研究特质 - 其输出中的独特模式可用于区分模型。为此，我们考虑了一个简单的分类任务：给定特定的文本输出，目的是预测生成文本的源LLM。我们评估了各个LLM组的合成任务，发现在LLM生成的文本上仅微调现有的文本嵌入模型可以产生出色的分类精度。值得注意的是，我们在涉及Chatgpt，Claude，Grok，Gemini和DeepSeek的五向分类问题中获得了97.1％的准确性。我们的进一步调查表明，这些特质源于单词级分布。即使文本被外部LLM重写，翻译或总结，这些模式仍然存在，这表明它们也在语义内容中编码。此外，我们利用LLM作为法官来生成每个模型特质的详细开放描述。最后，我们讨论了我们发现的更广泛的含义，特别是在合成数据和推断模型相似性的培训方面。代码可在此HTTPS URL上找到。