2025-02-21

Title: MaskPrune: Mask-based LLM Pruning for Layer-wise Uniform Structures

Authors: Jiayu Qin, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Wei Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14008
Pdf URL: https://arxiv.org/pdf/2502.14008
Copy Paste: [[2502.14008]] MaskPrune: Mask-based LLM Pruning for Layer-wise Uniform Structures(https://arxiv.org/abs/2502.14008)
Keywords: language model, llm
Abstract: The remarkable performance of large language models (LLMs) in various language tasks has attracted considerable attention. However, the ever-increasing size of these models presents growing challenges for deployment and inference. Structured pruning, an effective model compression technique, is gaining increasing attention due to its ability to enhance inference efficiency. Nevertheless, most previous optimization-based structured pruning methods sacrifice the uniform structure across layers for greater flexibility to maintain performance. The heterogeneous structure hinders the effective utilization of off-the-shelf inference acceleration techniques and impedes efficient configuration for continued training. To address this issue, we propose a novel masking learning paradigm based on minimax optimization to obtain the uniform pruned structure by optimizing the masks under sparsity regularization. Extensive experimental results demonstrate that our method can maintain high performance while ensuring the uniformity of the pruned model structure, thereby outperforming existing SOTA methods.
摘要：大型语言模型（LLM）在各种语言任务中的出色表现引起了极大的关注。但是，这些模型不断增加的规模给部署和推理带来了越来越多的挑战。结构化修剪是一种有效的模型压缩技术，由于其提高推理效率的能力，人们正在越来越多的关注。然而，大多数以前基于优化的结构化修剪方法牺牲了跨层的统一结构，以更大的灵活性来维持性能。异质结构阻碍了现成的推理加速技术的有效利用，并阻碍了有效的配置进行持续训练。为了解决这个问题，我们提出了一种基于最小值优化的新型掩蔽学习范式，以通过优化稀疏性正则化的口罩来获得统一的修剪结构。广泛的实验结果表明，我们的方法可以保持高性能，同时确保修剪模型结构的均匀性，从而超过现有的SOTA方法。

Title: DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Authors: Giorgio Franceschelli, Mirco Musolesi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14037
Pdf URL: https://arxiv.org/pdf/2502.14037
Copy Paste: [[2502.14037]] DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation(https://arxiv.org/abs/2502.14037)
Keywords: language model
Abstract: Despite their increasing performance, large language models still tend to reproduce training data, generate several repetitions, and focus on the most common grammatical structures and words. A possible cause is the decoding strategy adopted: the most common ones either consider only the most probable tokens, reducing output diversity, or increase the likelihood of unlikely tokens at the cost of output accuracy and correctness. In this paper, we propose a family of three new decoding methods by leveraging a mathematical analysis of the token probability distribution. In particular, the difference between consecutive, sorted probabilities can be used to avoid incorrect tokens and increase the chance of low-probable but accurate words. Experiments concerning math problem solving, extreme summarization, and the divergent association task show that our approach consistently performs at least as well as current alternatives in terms of quality and diversity.
摘要：尽管其性能提高，但大型语言模型仍然倾向于再现训练数据，产生多次重复，并专注于最常见的语法结构和单词。一个可能的原因是采用了解码策略：最常见的策略要么仅考虑最可能的令牌，减少产出多样性，要么以产出准确性和正确性为代价增加了象征性的可能性。在本文中，我们通过利用令牌概率分布的数学分析来提出三种新解码方法的家族。特别是，可以使用连续排序的概率之间的差异来避免令牌不正确，并增加了低概率但准确的单词的机会。有关数学问题解决，极端摘要和不同关联任务的实验表明，我们的方法至少在质量和多样性方面始终如一以及当前的替代方案。

Title: Semantic Decomposition and Selective Context Filtering -- Text Processing Techniques for Context-Aware NLP-Based Systems

Authors: Karl John Villardar
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2502.14048
Pdf URL: https://arxiv.org/pdf/2502.14048
Copy Paste: [[2502.14048]] Semantic Decomposition and Selective Context Filtering -- Text Processing Techniques for Context-Aware NLP-Based Systems(https://arxiv.org/abs/2502.14048)
Keywords: llm, prompt
Abstract: In this paper, we present two techniques for use in context-aware systems: Semantic Decomposition, which sequentially decomposes input prompts into a structured and hierarchal information schema in which systems can parse and process easily, and Selective Context Filtering, which enables systems to systematically filter out specific irrelevant sections of contextual information that is fed through a system's NLP-based pipeline. We will explore how context-aware systems and applications can utilize these two techniques in order to implement dynamic LLM-to-system interfaces, improve an LLM's ability to generate more contextually cohesive user-facing responses, and optimize complex automated workflows and pipelines.
摘要：在本文中，我们介绍了两种用于上下文感知系统的技术：语义分解，它们将输入提示依次分解为结构化和层次的信息架构架构，其中系统可以轻松地解析和处理，并且可以系统地进行系统，从而使系统能够系统地进行系统。滤除通过系统基于NLP的管道提供的上下文信息的特定不相关部分。我们将探讨如何利用这两种技术来实现动态LLM到系统接口，提高LLM生成更高上下文上上下文具有凝聚力的用户面向用户的响应并优化复杂的自动化工作流程和管道的能力。

Title: Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder

Authors: Xianjun Yang, Shaoliang Nie, Lijuan Liu, Suchin Gururangan, Ujjwal Karn, Rui Hou, Madian Khabsa, Yuning Mao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14050
Pdf URL: https://arxiv.org/pdf/2502.14050
Copy Paste: [[2502.14050]] Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder(https://arxiv.org/abs/2502.14050)
Keywords: language model
Abstract: Current pre-trained large language models typically need instruction tuning to align with human preferences. However, instruction tuning data is often quantity-saturated due to the large volume of data collection and fast model iteration, leaving coreset data selection important but underexplored. On the other hand, existing quality-driven data selection methods such as LIMA (NeurIPS 2023 (Zhou et al., 2024)) and AlpaGasus (ICLR 2024 (Chen et al.)) generally ignore the equal importance of data diversity and complexity. In this work, we aim to design a diversity-aware data selection strategy and creatively propose using sparse autoencoders to tackle the challenge of data diversity measure. In addition, sparse autoencoders can also provide more interpretability of model behavior and explain, e.g., the surprising effectiveness of selecting the longest response (ICML 2024 (Zhao et al.)). Using effective data selection, we experimentally prove that models trained on our selected data can outperform other methods in terms of model capabilities, reduce training cost, and potentially gain more control over model behaviors.
摘要：当前的预训练的大语言模型通常需要进行指导调整以与人类的偏好保持一致。但是，由于数据收集大量和快速模型迭代，指令调整数据通常被数量饱和，而核心数据选择很重要，但没有充满驱动。另一方面，现有的质量驱动数据选择方法，例如Lima（Neurips 2023（Zhou等，2024））和Alpagasus（ICLR 2024（Chen等人））通常忽略数据多样性和复杂性的同等重要性。在这项工作中，我们旨在设计一种多样性吸引的数据选择策略，并创造性地建议使用稀疏的自动编码器来应对数据多样性度量的挑战。此外，稀疏的自动编码器还可以提供更多的模型行为的解释性，例如选择最长响应的令人惊讶的有效性（ICML 2024（Zhao等人））。使用有效的数据选择，我们在实验上证明了对我们所选数据训练的模型可以优于模型能力，降低培训成本以及可能对模型行为获得更多控制的其他方法。

Title: RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Authors: Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14051
Pdf URL: https://arxiv.org/pdf/2502.14051
Copy Paste: [[2502.14051]] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression(https://arxiv.org/abs/2502.14051)
Keywords: language model, llm
Abstract: Transformer-based Large Language Models rely critically on KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy designed specifically to reduce both memory bandwidth and capacity demand of KV cache during the decode phase. RocketKV contains two consecutive stages. In the first stage, it performs coarse-grain KV cache eviction on the input sequence tokens with SnapKV++, a method improved upon SnapKV by introducing adaptive pooling size and full compatibility with grouped-query attention. In the second stage, it adopts a hybrid attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensional reductions. Combining these two stages, RocketKV achieves significant KV cache fetching bandwidth and storage savings while maintaining comparable accuracy to full KV cache attention. We show that RocketKV provides end-to-end speedup by up to 3$\times$ as well as peak memory reduction by up to 31% in the decode phase on an NVIDIA H100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks.
摘要：基于变压器的大型语言模型批判性地依靠KV缓存来在解码阶段有效处理扩展上下文。然而，KV缓存的大小随输入长度的成比例增长，随着解码的进展，内存带宽和容量都会为内存带宽和容量负担。为了应对这一挑战，我们提出了RocketKV，这是一种专门旨在减少DECODE阶段中KV CACHE的内存带宽和容量需求的无训练KV缓存策略。 RocketKV连续两个阶段。在第一阶段，它使用Snapkv ++对输入序列令牌执行粗粒kV缓存驱逐，通过引入自适应池大小和完整的兼容性，可以改善SNAPKV的方法，并具有分组疑问。在第二阶段，它采用了一种混合注意方法来进行细粒度的TOP-K稀疏注意，从而通过利用头部和序列尺寸减少来近似注意力评分。 RocketKV结合了这两个阶段，可以实现显着的KV高速缓存，以获取带宽和存储节省，同时保持了与完全KV高速缓存的可比精度。我们表明，RocketKV在NVIDIA H100 GPU上的Decode阶段中最多可提供3 $ \ times $的端到端速度，而峰值记忆的峰值将最多减少31％，而与完整的KV缓存基线相比各种长篇文章任务的准确性损失。

Title: Are Rules Meant to be Broken? Understanding Multilingual Moral Reasoning as a Computational Pipeline with UniMoral

Authors: Shivani Kumar, David Jurgens
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14083
Pdf URL: https://arxiv.org/pdf/2502.14083
Copy Paste: [[2502.14083]] Are Rules Meant to be Broken? Understanding Multilingual Moral Reasoning as a Computational Pipeline with UniMoral(https://arxiv.org/abs/2502.14083)
Keywords: language model, llm
Abstract: Moral reasoning is a complex cognitive process shaped by individual experiences and cultural contexts and presents unique challenges for computational analysis. While natural language processing (NLP) offers promising tools for studying this phenomenon, current research lacks cohesion, employing discordant datasets and tasks that examine isolated aspects of moral reasoning. We bridge this gap with UniMoral, a unified dataset integrating psychologically grounded and social-media-derived moral dilemmas annotated with labels for action choices, ethical principles, contributing factors, and consequences, alongside annotators' moral and cultural profiles. Recognizing the cultural relativity of moral reasoning, UniMoral spans six languages, Arabic, Chinese, English, Hindi, Russian, and Spanish, capturing diverse socio-cultural contexts. We demonstrate UniMoral's utility through a benchmark evaluations of three large language models (LLMs) across four tasks: action prediction, moral typology classification, factor attribution analysis, and consequence generation. Key findings reveal that while implicitly embedded moral contexts enhance the moral reasoning capability of LLMs, there remains a critical need for increasingly specialized approaches to further advance moral reasoning in these models.
摘要：道德推理是一个由个人经验和文化背景所塑造的复杂认知过程，并为计算分析带来了独特的挑战。尽管自然语言处理（NLP）为研究这种现象提供了有希望的工具，但当前的研究缺乏凝聚力，采用了不和谐的数据集和任务来检查道德推理的孤立方面。我们弥合了这个差距，该差距是一个统一的数据集，该数据集以心理扎根和社会媒体衍生的道德困境为基础，并带有标签，以进行行动选择，道德原则，贡献因素和后果，并与注释者的道德和文化概况一起。认识到道德推理的文化相对性，单语言涵盖了六种语言，阿拉伯语，中文，英语，印地语，俄语和西班牙语，捕捉了多种社会文化背景。我们通过对四个任务的三个大语言模型（LLM）进行基准评估来证明单型实用性：行动预测，道德类型学分类，因素归因分析和后果生成。主要发现表明，虽然隐式嵌入道德背景增强了LLM的道德推理能力，但仍需要对日益专业的方法进行迫切需要，以进一步推进这些模型的道德推理。

Title: Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning

Authors: Cole Gawin, Yidan Sun, Mayank Kejriwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14086
Pdf URL: https://arxiv.org/pdf/2502.14086
Copy Paste: [[2502.14086]] Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning(https://arxiv.org/abs/2502.14086)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving reasoning tasks of moderate complexity, such as question-answering and mathematical problem-solving. However, their capabilities in tasks requiring deeper cognitive skills, such as common-sense understanding and abstract reasoning, remain under-explored. In this paper, we systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph. We propose two prompting approaches: instruct prompting, where models predict plausible semantic relationships based on provided definitions, and few-shot prompting, where models identify relations using examples as guidance. Our experiments with the gpt-4o-mini model show that in instruct prompting, consistent performance is obtained when ranking multiple relations but with substantial decline when the model is restricted to predicting only one relation. In few-shot prompting, the model's accuracy improves significantly when selecting from five relations rather than the full set, although with notable bias toward certain relations. These results suggest significant gaps still, even in commercially used LLMs' abstract common-sense reasoning abilities, compared to human-level understanding. However, the findings also highlight the promise of careful prompt engineering, based on selective retrieval, for obtaining better performance.
摘要：大型语言模型（LLM）在产生类似人类的文本和解决中等复杂性的推理任务（例如提问和数学问题解决问题）方面取得了出色的性能。但是，它们在需要更深入认知技能的任务中的能力，例如常识性理解和抽象推理，仍然不足。在本文中，我们使用概念网知识图系统地评估了LLM中的抽象常识推理。我们提出了两种提示方法：指示提示，在其中模型根据提供的定义预测了合理的语义关系，并且很少射击提示，其中模型使用示例作为指导来识别关系。我们对GPT-4O-MINI模型进行的实验表明，在指示提示中，在对多重关系进行排名时会获得一致的性能，但是当模型仅限于仅预测一个关系时，会大幅下降。在很少的提示中，当从五个关系中选择而不是完整集合时，模型的准确性显着提高，尽管对某些关系有明显的偏见。这些结果表明，即使在商业使用的LLMS的抽象常识推理能力中，与人类水平的理解相比，仍然存在明显的差距。但是，这些发现还强调了基于选择性检索的仔细及时工程的承诺，以获得更好的性能。

Title: Retrieving Versus Understanding Extractive Evidence in Few-Shot Learning

Authors: Karl Elbakian, Samuel Carton
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14095
Pdf URL: https://arxiv.org/pdf/2502.14095
Copy Paste: [[2502.14095]] Retrieving Versus Understanding Extractive Evidence in Few-Shot Learning(https://arxiv.org/abs/2502.14095)
Keywords: language model
Abstract: A key aspect of alignment is the proper use of within-document evidence to construct document-level decisions. We analyze the relationship between the retrieval and interpretation of within-document evidence for large language model in a few-shot setting. Specifically, we measure the extent to which model prediction errors are associated with evidence retrieval errors with respect to gold-standard human-annotated extractive evidence for five datasets, using two popular closed proprietary models. We perform two ablation studies to investigate when both label prediction and evidence retrieval errors can be attributed to qualities of the relevant evidence. We find that there is a strong empirical relationship between model prediction and evidence retrieval error, but that evidence retrieval error is mostly not associated with evidence interpretation error--a hopeful sign for downstream applications built on this mechanism.
摘要：一致性的一个关键方面是正确使用文档内证据来构建文档级决策。我们分析了在几个弹奏环境中的大型语言模型的检索和解释之间的关系。具体而言，我们使用两个流行的封闭的专有模型来衡量模型预测误差与证据检索错误有关的五个数据集的人类宣传的提取证据的程度。我们进行两项消融研究，以调查标签预测和证据检索错误何时可以归因于相关证据的质量。我们发现，模型预测与证据检索错误之间存在很强的经验关系，但是证据检索错误主要与证据解释错误无关 - 这是基于该机制构建的下游应用的充满希望的迹象。

Title: Towards Context-Robust LLMs: A Gated Representation Fine-tuning Approach

Authors: Shenglai Zeng, Pengfei He, Kai Guo, Tianqi Zheng, Hanqing Lu, Yue Xing, Hui Liu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.14100
Pdf URL: https://arxiv.org/pdf/2502.14100
Copy Paste: [[2502.14100]] Towards Context-Robust LLMs: A Gated Representation Fine-tuning Approach(https://arxiv.org/abs/2502.14100)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) enhanced with external contexts, such as through retrieval-augmented generation (RAG), often face challenges in handling imperfect evidence. They tend to over-rely on external knowledge, making them vulnerable to misleading and unhelpful contexts. To address this, we propose the concept of context-robust LLMs, which can effectively balance internal knowledge with external context, similar to human cognitive processes. Specifically, context-robust LLMs should rely on external context only when lacking internal knowledge, identify contradictions between internal and external knowledge, and disregard unhelpful contexts. To achieve this goal, we introduce Grft, a lightweight and plug-and-play gated representation fine-tuning approach. Grft consists of two key components: a gating mechanism to detect and filter problematic inputs, and low-rank representation adapters to adjust hidden representations. By training a lightweight intervention function with only 0.0004\% of model size on fewer than 200 examples, Grft can effectively adapt LLMs towards context-robust behaviors.
摘要：大型语言模型（LLM）通过外部环境增强了，例如通过检索增强的一代（RAG），经常在处理不完美的证据时面临挑战。他们倾向于过度依靠外部知识，使它们容易受到误导性和无助的背景。为了解决这一问题，我们提出了上下文持续llms的概念，该概念可以有效地平衡内部知识与外部背景，类似于人类的认知过程。具体而言，上下文持续的LLM只有在缺乏内部知识，确定内部和外部知识之间的矛盾并忽略无用的上下文时才能依靠外部上下文。为了实现这一目标，我们介绍了GRFT，这是一种轻巧和插件的封闭式表示方法。 GRFT由两个关键组成部分组成：检测和过滤有问题输入的门控机制，以及低级别表示适配器以调整隐藏的表示。通过训练仅在少于200个示例的模型大小的0.0004 \％的轻量级干预功能，GRFT可以有效地使LLMS适应上下文 - 弹性行为。

Title: Meaning Beyond Truth Conditions: Evaluating Discourse Level Understanding via Anaphora Accessibility

Authors: Xiaomeng Zhu, Zhenghao Zhou, Simon Charlow, Robert Frank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14119
Pdf URL: https://arxiv.org/pdf/2502.14119
Copy Paste: [[2502.14119]] Meaning Beyond Truth Conditions: Evaluating Discourse Level Understanding via Anaphora Accessibility(https://arxiv.org/abs/2502.14119)
Keywords: llm
Abstract: We present a hierarchy of natural language understanding abilities and argue for the importance of moving beyond assessments of understanding at the lexical and sentence levels to the discourse level. We propose the task of anaphora accessibility as a diagnostic for assessing discourse understanding, and to this end, present an evaluation dataset inspired by theoretical research in dynamic semantics. We evaluate human and LLM performance on our dataset and find that LLMs and humans align on some tasks and diverge on others. Such divergence can be explained by LLMs' reliance on specific lexical items during language comprehension, in contrast to human sensitivity to structural abstractions.
摘要：我们提出了自然语言理解能力的层次结构，并提出了超越词汇和句子级别的理解评估的重要性。我们提出了用于评估话语理解的诊断性访问性的任务，为此，提出了受动态语义理论研究启发的评估数据集。我们在数据集中评估了人类和LLM的性能，发现LLM和人类在某些任务上保持一致并在其他任务上分歧。与人类对结构抽象的敏感性相反，LLMS对语言理解过程中对特定词汇的依赖可以解释这种差异。

Title: Benchmarking LLMs for Political Science: A United Nations Perspective

Authors: Yueqing Liang, Liangwei Yang, Chen Wang, Congying Xia, Rui Meng, Xiongxiao Xu, Haoran Wang, Ali Payani, Kai Shu
Subjects: cs.CL, cs.CY, cs.ET
Abstract URL: https://arxiv.org/abs/2502.14122
Pdf URL: https://arxiv.org/pdf/2502.14122
Copy Paste: [[2502.14122]] Benchmarking LLMs for Political Science: A United Nations Perspective(https://arxiv.org/abs/2502.14122)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved significant advances in natural language processing, yet their potential for high-stake political decision-making remains largely unexplored. This paper addresses the gap by focusing on the application of LLMs to the United Nations (UN) decision-making process, where the stakes are particularly high and political decisions can have far-reaching consequences. We introduce a novel dataset comprising publicly available UN Security Council (UNSC) records from 1994 to 2024, including draft resolutions, voting records, and diplomatic speeches. Using this dataset, we propose the United Nations Benchmark (UNBench), the first comprehensive benchmark designed to evaluate LLMs across four interconnected political science tasks: co-penholder judgment, representative voting simulation, draft adoption prediction, and representative statement generation. These tasks span the three stages of the UN decision-making process--drafting, voting, and discussing--and aim to assess LLMs' ability to understand and simulate political dynamics. Our experimental analysis demonstrates the potential and challenges of applying LLMs in this domain, providing insights into their strengths and limitations in political science. This work contributes to the growing intersection of AI and political science, opening new avenues for research and practical applications in global governance. The UNBench Repository can be accessed at: this https URL.
摘要：大型语言模型（LLM）在自然语言处理方面取得了重大进展，但它们具有高级政治决策的潜力仍未得到探索。本文通过关注LLM在联合国（联合国）决策过程中的应用来解决差距，在联合国（联合国）的决策过程中，股份特别高，政治决策可能会带来深远的后果。我们介绍了一个新颖的数据集，其中包括1994年至2024年公开可用的联合国安全委员会（UNSC）记录，包括决议草案，投票记录和外交演讲。使用此数据集，我们提出了联合国基准（Unbenchen），这是第一个旨在评估四个相互联系的政治学任务的LLM的全面基准：共同持有人的判断，代表性投票模拟，采用预测草案和代表性陈述生成。这些任务涵盖了联合国决策过程的三个阶段（制止，投票和讨论），并旨在评估LLMS理解和模拟政治动态的能力。我们的实验分析表明，将LLM应用于该领域的潜力和挑战，提供了对他们在政治学中的优势和局限性的见解。这项工作有助于人工智能和政治学的日益增长，为全球治理中的研究和实际应用开辟了新的途径。可以在以下位置访问Unbenchent库。

Title: Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Authors: Nishant Balepur, Rachel Rudinger, Jordan Lee Boyd-Graber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14127
Pdf URL: https://arxiv.org/pdf/2502.14127
Copy Paste: [[2502.14127]] Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above(https://arxiv.org/abs/2502.14127)
Keywords: llm
Abstract: Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing-where LLMs construct and explain answers-better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful explanations-showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.
摘要：多项选择问题答案（MCQA）由于其简单性和类似人类的测试而在LLM评估中很受欢迎，但我们主张其改革。我们首先揭示了MCQA格式的缺陷，因为它努力：1）测试/主观性； 2）匹配LLM用例； 3）完全测试知识。相反，我们主张基于人类测试的生成格式 - llms构建llms并解释答案，以捕获用户的需求和知识，同时又易于评分。然后，我们显示MCQA是一种有用的格式，其数据集也会遭受：泄漏；无法选择；捷径；和饱和。在每个问题中，我们都会提供教育的修复，例如专栏来指导MCQ写作；为bridle猜测的评分方法；和项目响应理论以建立更难的MCQ。最后，我们讨论了MCQA合作社，偏见和不忠解释的LLM错误，以表达我们先前的解决方案如何更好地衡量或解决这些问题。虽然我们不需要抛弃MCQA，但我们鼓励根据教育测试来完善任务，并进行评估。

Title: Can Community Notes Replace Professional Fact-Checkers?

Authors: Nadav Borenstein, Greta Warren, Desmond Elliott, Isabelle Augenstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14132
Pdf URL: https://arxiv.org/pdf/2502.14132
Copy Paste: [[2502.14132]] Can Community Notes Replace Professional Fact-Checkers?(https://arxiv.org/abs/2502.14132)
Keywords: language model
Abstract: Two commonly-employed strategies to combat the rise of misinformation on social media are (i) fact-checking by professional organisations and (ii) community moderation by platform users. Policy changes by Twitter/X and, more recently, Meta, signal a shift away from partnerships with fact-checking organisations and towards an increased reliance on crowdsourced community notes. However, the extent and nature of dependencies between fact-checking and helpful community notes remain unclear. To address these questions, we use language models to annotate a large corpus of Twitter/X community notes with attributes such as topic, cited sources, and whether they refute claims tied to broader misinformation narratives. Our analysis reveals that community notes cite fact-checking sources up to five times more than previously reported. Fact-checking is especially crucial for notes on posts linked to broader narratives, which are twice as likely to reference fact-checking sources compared to other sources. In conclusion, our results show that successful community moderation heavily relies on professional fact-checking.
摘要：（i）专业组织和（ii）平台用户的社区审核是（i）对社交媒体上误解的兴起的两种常用策略。 Twitter/X以及最近的Meta的政策变化表明，与事实检查组织的伙伴关系转变，并越来越依赖对众包社区票据的依赖。但是，事实检查和有用的社区笔记之间依赖关系的程度和性质尚不清楚。为了解决这些问题，我们使用语言模型来注释大量的Twitter/X社区注释，具有诸如主题，引用的资料之类的属性，以及它们是否驳斥与更广泛的错误信息叙述相关的主张。我们的分析表明，社区笔记引用了事实检验来源的数量，是先前报道的五倍。事实核对对于与更广泛的叙述相关的帖子的注释尤其至关重要，与其他来源相比，参考事实检查来源的可能性是两倍。总而言之，我们的结果表明，成功的社区节制在很大程度上依赖于专业事实检查。

Title: Self-Regularization with Latent Space Explanations for Controllable LLM-based Classification

Authors: Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, Ninghao Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14133
Pdf URL: https://arxiv.org/pdf/2502.14133
Copy Paste: [[2502.14133]] Self-Regularization with Latent Space Explanations for Controllable LLM-based Classification(https://arxiv.org/abs/2502.14133)
Keywords: language model, llm, chat
Abstract: Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to guarantee regulatory compliance or improve the generalizability of classification models. This limitation arises because LLM embeddings are opaque and difficult to interpret. In this paper, we propose a novel framework to identify and regularize unintended features in the LLM latent space. Specifically, we first pre-train a sparse autoencoder (SAE) to extract interpretable features from LLM latent spaces. To ensure the SAE can capture task-specific features, we further fine-tune it on task-specific datasets. In training the classification model, we propose a simple and effective regularizer, by minimizing the similarity between the classifier weights and the identified unintended feature, to remove the impacts of these unintended features toward classification. We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis. Results show that the proposed framework can significantly improve the classifier's generalizability by regularizing those features that are not semantically correlated to each task. This work pioneers controllable text classification on LLM latent spaces by leveraging interpreted features to address generalizability, fairness, and privacy challenges. We will release our code and data once accepted.
摘要：现代文本分类方法在很大程度上依赖于大型语言模型（LLM）的上下文嵌入。与人工设计的特征相比，这些嵌入为分类模型培训提供了自动有效的表示。但是，他们还引入了一个挑战：我们失去了手动删除意外功能的能力，例如敏感或任务 - 默认功能，以确保法规合规性或提高分类模型的普遍性。由于LLM嵌入不透明且难以解释，因此会出现此限制。在本文中，我们提出了一个新颖的框架，以识别和规范LLM潜在空间中的意外功能。具体而言，我们首先预先培训稀疏的自动编码器（SAE），以从LLM潜在空间中提取可解释的功能。为了确保SAE可以捕获特定于任务的功能，我们将其在特定于任务的数据集中进行了微调。在训练分类模型时，我们提出了一个简单有效的正规器，通过最大程度地降低分类器权重和确定的意外功能之间的相似性，以消除这些意外特征对分类的影响。我们在三个现实世界任务上评估了提议的框架，包括有毒的聊天检测，奖励建模和疾病诊断。结果表明，提出的框架可以通过使与每个任务无关的那些功能正规化那些特征来显着改善分类器的普遍性。这项工作是通过利用解释的功能来应对概括，公平性和隐私挑战的利用LLM潜在空间的可控文本分类。我们将发布一旦接受的代码和数据。

Title: UM_FHS at TREC 2024 PLABA: Exploration of Fine-tuning and AI agent approach for plain language adaptations of biomedical text

Authors: Primoz Kocbek, Leon Kopitar, Zhihong Zhang, Emirhan Aydin, Maxim Topaz, Gregor Stiglic
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14144
Pdf URL: https://arxiv.org/pdf/2502.14144
Copy Paste: [[2502.14144]] UM_FHS at TREC 2024 PLABA: Exploration of Fine-tuning and AI agent approach for plain language adaptations of biomedical text(https://arxiv.org/abs/2502.14144)
Keywords: gpt, prompt, agent
Abstract: This paper describes our submissions to the TREC 2024 PLABA track with the aim to simplify biomedical abstracts for a K8-level audience (13-14 years old students). We tested three approaches using OpenAI's gpt-4o and gpt-4o-mini models: baseline prompt engineering, a two-AI agent approach, and fine-tuning. Adaptations were evaluated using qualitative metrics (5-point Likert scales for simplicity, accuracy, completeness, and brevity) and quantitative readability scores (Flesch-Kincaid grade level, SMOG Index). Results indicated that the two-agent approach and baseline prompt engineering with gpt-4o-mini models show superior qualitative performance, while fine-tuned models excelled in accuracy and completeness but were less simple. The evaluation results demonstrated that prompt engineering with gpt-4o-mini outperforms iterative improvement strategies via two-agent approach as well as fine-tuning with gpt-4o. We intend to expand our investigation of the results and explore advanced evaluations.
摘要：本文介绍了我们对TREC 2024 Plaba Track的提交，目的是简化K8级观众（13-14岁的学生）的生物医学摘要。我们使用OpenAI的GPT-4O和GPT-4O-MINI型号测试了三种方法：基线及时工程，一种两种AGER代理方法和微调。使用定性指标（为简单，准确性，完整性和简洁）和定量可读性得分（Flesch-Kincaid等级，烟雾指数）评估适应性。结果表明，使用GPT-4O-MINI模型的两种方法和基线及时工程表现出较高的定性性能，而精细的模型在准确性和完整性方面表现出色，但并不简单。评估结果表明，GPT-4O-MINI的及时工程通过两种代理方法以及与GPT-4O进行微调。我们打算扩大对结果的调查并探索高级评估。

Title: LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

Authors: Hao Zhang, Weiwei Li, Rilin Chen, Vinay Kothapally, Meng Yu, Dong Yu
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2502.14145
Pdf URL: https://arxiv.org/pdf/2502.14145
Copy Paste: [[2502.14145]] LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems(https://arxiv.org/abs/2502.14145)
Keywords: llm
Abstract: Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.
摘要：在口语对话系统（SD）中实现全双工沟通需要在聆听，说话和思考之间进行实时协调。本文提出了作为对话管理器（DM）的语义语音活动检测（VAD）模块，以有效地管理全双工SD的转弯。语义VAD在全面的对话数据上进行了轻巧（0.5B）LLM微调，可预测四个控制令牌可以调节转弯和转弯，并区分故意和意外的驳船，同时检测到查询完成求解处理。用户停顿和犹豫。通过简短的间隔处理输入语音，语义VAD可以实时决策，而核心对话引擎（CDE）仅被激活以生成响应，从而减少了计算开销。该设计允许独立的DM优化，而无需重新训练CDE，平衡相互作用的准确性以及可扩展的下一代全双工SD的推理效率。

Title: Enhancing Conversational Agents with Theory of Mind: Aligning Beliefs, Desires, and Intentions for Human-Like Interaction

Authors: Mohammadmahdi Jafari, Devin Yuncheng Hua, Hao Xue, Flora Salim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14171
Pdf URL: https://arxiv.org/pdf/2502.14171
Copy Paste: [[2502.14171]] Enhancing Conversational Agents with Theory of Mind: Aligning Beliefs, Desires, and Intentions for Human-Like Interaction(https://arxiv.org/abs/2502.14171)
Keywords: language model, llm, agent
Abstract: Natural language interaction with agentic Artificial Intelligence (AI), driven by Large Language Models (LLMs), is expected to remain a dominant paradigm in the near future. While humans instinctively align their communication with mental states -- an ability known as Theory of Mind (ToM), current LLM powered systems exhibit significant limitations in this regard. This study examines the extent to which open source language models (LLaMA) can capture and preserve ToM related information and how effectively it contributes to consistent ToM reasoning in generated responses. We further investigate whether explicit manipulation of ToM related components, such as beliefs, desires, and intentions, can enhance response alignment. Experiments on two LLaMA 3 variants demonstrate that incorporating ToM informed alignment improves response quality, achieving win rates of 67 and 63 percent for the 3B and 8B models, respectively. These findings highlight the potential of ToM driven strategies to improve alignment in LLM based conversational agents.
摘要：由大型语言模型（LLMS）驱动的代理人人工智能（AI）的自然语言互动预计将在不久的将来仍然是主要的范式。尽管人类本能地将他们与精神状态保持一致 - 一种称为心理理论的能力（TOM），但当前的LLM动力系统在这方面表现出重大局限性。这项研究研究了开源语言模型（Llama）可以在多大程度上捕获和保存与TOM相关的信息以及它在产生的响应中有效地推理的有效贡献。我们进一步研究了对TOM相关组件（例如信念，欲望和意图）的明确操纵是否可以增强响应对准。在两个Llama 3变体上进行的实验表明，结合汤姆知情的一致性可以提高响应质量，分别为3B和8B模型达到67％和63％的获胜率。这些发现突出了汤姆驱动的策略的潜力，以改善基于LLM的对话剂的一致性。

Title: QUAD-LLM-MLTC: Large Language Models Ensemble Learning for Healthcare Text Multi-Label Classification

Authors: Hajar Sakai, Sarah S. Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14189
Pdf URL: https://arxiv.org/pdf/2502.14189
Copy Paste: [[2502.14189]] QUAD-LLM-MLTC: Large Language Models Ensemble Learning for Healthcare Text Multi-Label Classification(https://arxiv.org/abs/2502.14189)
Keywords: language model, gpt, llm, prompt
Abstract: The escalating volume of collected healthcare textual data presents a unique challenge for automated Multi-Label Text Classification (MLTC), which is primarily due to the scarcity of annotated texts for training and their nuanced nature. Traditional machine learning models often fail to fully capture the array of expressed topics. However, Large Language Models (LLMs) have demonstrated remarkable effectiveness across numerous Natural Language Processing (NLP) tasks in various domains, which show impressive computational efficiency and suitability for unsupervised learning through prompt engineering. Consequently, these LLMs promise an effective MLTC of medical narratives. However, when dealing with various labels, different prompts can be relevant depending on the topic. To address these challenges, the proposed approach, QUAD-LLM-MLTC, leverages the strengths of four LLMs: GPT-4o, BERT, PEGASUS, and BART. QUAD-LLM-MLTC operates in a sequential pipeline in which BERT extracts key tokens, PEGASUS augments textual data, GPT-4o classifies, and BART provides topics' assignment probabilities, which results in four classifications, all in a 0-shot setting. The outputs are then combined using ensemble learning and processed through a meta-classifier to produce the final MLTC result. The approach is evaluated using three samples of annotated texts, which contrast it with traditional and single-model methods. The results show significant improvements across the majority of the topics in the classification's F1 score and consistency (F1 and Micro-F1 scores of 78.17% and 80.16% with standard deviations of 0.025 and 0.011, respectively). This research advances MLTC using LLMs and provides an efficient and scalable solution to rapidly categorize healthcare-related text data without further training.
摘要：收集到的医疗保健文本数据不断升级为自动多标签文本分类（MLTC）提出了一个独特的挑战，这主要是由于培训带注释的文本稀少及其细微的性质所致。传统的机器学习模型通常无法完全捕获一系列表达的主题。但是，大型语言模型（LLMS）在各种领域的众多自然语言处理（NLP）任务中表现出了出色的有效性，这些域名表现出令人印象深刻的计算效率和适用性，可通过及时的工程进行无监督的学习。因此，这些LLMS有效的医学叙事有效MLTC。但是，在使用各种标签时，根据主题的不同提示可能是相关的。为了应对这些挑战，拟议的方法是Quad-llm-MLTC，利用了四个LLM的优势：GPT-4O，BERT，PEGASUS和BART。 Quad-LLM-MLTC在一条顺序的管道中运行，其中BERT提取关键令牌，Pegasus增强文本数据，GPT-4O分类，而Bart则提供了主题的分配概率，这会导致四个分类，所有分类均以0次速度设置为例。然后，使用集合学习将输出组合在一起，并通过元分类器处理以产生最终的MLTC结果。使用三个带注释的文本样本评估该方法，这些样本与传统和单模方法对比。结果表明，分类的F1分数和一致性的大多数主题（F1和Micro-F1分别为78.17％和80.16％，标准偏差分别为0.025和0.011）。这项研究使用LLMS推进了MLTC，并提供了一种有效且可扩展的解决方案，可以快速对医疗保健相关的文本数据进行迅速分类，而无需进一步培训。

Title: NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM

Authors: Jiayin Lan, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin
Subjects: cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2502.14192
Pdf URL: https://arxiv.org/pdf/2502.14192
Copy Paste: [[2502.14192]] NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM(https://arxiv.org/abs/2502.14192)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been widely applied in question answering over scientific research papers. To enhance the professionalism and accuracy of responses, many studies employ external knowledge augmentation. However, existing structures of external knowledge in scientific literature often focus solely on either paper entities or domain concepts, neglecting the intrinsic connections between papers through shared domain concepts. This results in less comprehensive and specific answers when addressing questions that combine papers and concepts. To address this, we propose a novel knowledge graph framework that captures deep conceptual relations between academic papers, constructing a relational network via intra-paper semantic elements and inter-paper citation relations. Using a few-shot knowledge graph construction method based on LLM, we develop NLP-AKG, an academic knowledge graph for the NLP domain, by extracting 620,353 entities and 2,271,584 relations from 60,826 papers in ACL Anthology. Based on this, we propose a 'sub-graph community summary' method and validate its effectiveness on three NLP scientific literature question answering datasets.
摘要：大型语言模型（LLM）已被广泛应用于有关科学研究论文的回答。为了提高回应的专业水平和准确性，许多研究采用了外部知识增强。但是，科学文献中现有的外部知识结构通常仅关注纸张实体或领域概念，从而忽略了通过共享领域概念之间的纸张之间的内在联系。在解决结合论文和概念的问题时，这会导致不全面和具体的答案。为了解决这个问题，我们提出了一个新颖的知识图框架，该框架捕获了学术论文之间的深厚概念关系，通过纸质内语义元素和纸间引文关系构建关系网络。使用基于LLM的几种知识图构建方法，我们通过提取60,826篇论文中的620,353个实体和2,271,584个关系来开发NLP-AKG，这是NLP领域的学术知识图。基于此，我们提出了一种“子图社区摘要”方法，并验证了其对三个NLP科学文献问题回答数据集的有效性。

Title: On-the-fly Preference Alignment via Principle-Guided Decoding

Authors: Mingye Zhu, Yi Liu, Lei Zhang, Junbo Guo, Zhendong Mao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14204
Pdf URL: https://arxiv.org/pdf/2502.14204
Copy Paste: [[2502.14204]] On-the-fly Preference Alignment via Principle-Guided Decoding(https://arxiv.org/abs/2502.14204)
Keywords: language model
Abstract: With the rapidly expanding landscape of large language models, aligning model generations with human values and preferences is becoming increasingly important. Popular alignment methods, such as Reinforcement Learning from Human Feedback, have shown significant success in guiding models with greater control. However, these methods require considerable computational resources, which is inefficient, and substantial collection of training data to accommodate the diverse and pluralistic nature of human preferences, which is impractical. These limitations significantly constrain the scope and efficacy of both task-specific and general preference alignment methods. In this work, we introduce On-the-fly Preference Alignment via Principle-Guided Decoding (OPAD) to directly align model outputs with human preferences during inference, eliminating the need for fine-tuning. Our approach involves first curating a surrogate solution to an otherwise infeasible optimization problem and then designing a principle-guided reward function based on this surrogate. The final aligned policy is derived by maximizing this customized reward, which exploits the discrepancy between the constrained policy and its unconstrained counterpart. OPAD directly modifies the model's predictions during inference, ensuring principle adherence without incurring the computational overhead of retraining or fine-tuning. Experiments show that OPAD achieves competitive or superior performance in both general and personalized alignment tasks, demonstrating its efficiency and effectiveness compared to state-of-the-art baselines.
摘要：随着大语言模型的快速扩展的景观，将模型世代与人类价值和偏好保持一致，变得越来越重要。流行的一致性方法，例如从人类反馈中学习的强化学习，在指导模型具有更大控制的指导模型方面取得了重大成功。但是，这些方法需要大量的计算资源，这些计算资源效率低下且大量收集培训数据，以适应不切实际的人类偏好的多样性和多元化性质。这些局限性显着限制了特定于任务和一般偏好比对方法的范围和功效。在这项工作中，我们通过原理引导的解码（OPAD）引入了直接的偏好一致性，以将模型输出与推理过程中的人类偏好直接相结合，从而消除了对微调的需求。我们的方法涉及首先策划替代解决方案，以解决原本不可行的优化问题，然后根据该替代物设计原理引导的奖励功能。最终的对齐政策是通过最大化这种自定义奖励来得出的，该奖励利用了受约束策略与其无约束的对应方之间的差异。 OPAD在推理过程中直接修改模型的预测，确保原理依从性，而不会产生重新调整或微调的计算开销。实验表明，OPAD在一般和个性化的一致性任务中都能达到竞争性或卓越的性能，这表明其与最先进的基线相比其效率和有效性。

Title: Transfer-Prompting: Enhancing Cross-Task Adaptation in Large Language Models via Dual-Stage Prompts Optimization

Authors: Yupeng Chang, Yi Chang, Yuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14211
Pdf URL: https://arxiv.org/pdf/2502.14211
Copy Paste: [[2502.14211]] Transfer-Prompting: Enhancing Cross-Task Adaptation in Large Language Models via Dual-Stage Prompts Optimization(https://arxiv.org/abs/2502.14211)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) face significant challenges when balancing multiple high-level objectives, such as generating coherent, relevant, and high-quality responses while maintaining efficient task adaptation across diverse tasks. To address these challenges, we introduce Transfer-Prompting, a novel two-stage framework designed to enhance cross-task adaptation in prompt generation. The framework comprises two key components: (1) source prompt construction, which refines the original prompts on source task datasets to generate source prompts with enhanced generalization ability, and (2) target prompt generation, which enhances cross-task adaptation of target prompts by fine-tuning a set of high-scored source prompts on task-specific datasets. In each optimization cycle, a reference LLM generates candidate prompts based on historical prompt-score pairs and task descriptions in our designed reference prompt. These candidate prompts are refined iteratively, while a scorer LLM evaluates their effectiveness using the multi-dimensional metrics designed in the objective prompts evaluator-a novel contribution in this work that provides a holistic evaluation of prompt quality and task performance. This feedback loop facilitates continuous refinement, optimizing both prompt quality and task-specific outcomes. We validate Transfer-Prompting through extensive experiments across 25 LLMs, including 7 foundational models and 18 specialized models, evaluated on 9 diverse datasets. The results demonstrate that Transfer-Prompting significantly improves task-specific performance, highlighting its potential for enhancing cross-task adaptation in LLMs. The code is available at this https URL.
摘要：在平衡多个高级目标时，大型语言模型（LLM）面临重大挑战，例如产生连贯，相关和高质量的响应，同时保持各种任务的有效任务适应。为了应对这些挑战，我们介绍了转移启动，这是一个新颖的两阶段框架，旨在增强及时生成中的交叉任务适应。该框架包括两个关键组成部分：（1）源提示构建，该构建提示构建，该构建构造在源任务数据集中完善了原始提示，以生成具有增强概括能力的源提示，以及（2）目标提示生成，从在特定于任务的数据集上微调一组高得分的源提示。在每个优化周期中，参考LLM在我们设计的参考提示中基于历史及时分数对和任务描述生成候选提示。这些候选提示进行了迭代的改进，而得分手LLM则使用在客观提示中设计的多维指标评估评估者评估者的多维指标来评估其有效性，这在这项工作中提供了对及时的质量和任务绩效的整体评估。此反馈循环促进了连续的精致，从而优化了及时的质量和特定于任务的结果。我们通过在25个LLM上进行的广泛实验来验证转移启动，其中包括7种基础模型和18个专用模型，这些模型在9个不同的数据集上进行了评估。结果表明，转移启动可显着提高特定于任务的性能，突出显示其增强LLM中的交叉任务适应的潜力。该代码可在此HTTPS URL上找到。

Title: Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering

Authors: Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, Wei Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14245
Pdf URL: https://arxiv.org/pdf/2502.14245
Copy Paste: [[2502.14245]] Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering(https://arxiv.org/abs/2502.14245)
Keywords: language model, gpt, llm
Abstract: In this paper, we identify a critical problem, "lost-in-retrieval", in retrieval-augmented multi-hop question answering (QA): the key entities are missed in LLMs' sub-question decomposition. "Lost-in-retrieval" significantly degrades the retrieval performance, which disrupts the reasoning chain and leads to the incorrect answers. To resolve this problem, we propose a progressive retrieval and rewriting method, namely ChainRAG, which sequentially handles each sub-question by completing missing key entities and retrieving relevant sentences from a sentence graph for answer generation. Each step in our retrieval and rewriting process builds upon the previous one, creating a seamless chain that leads to accurate retrieval and answers. Finally, all retrieved sentences and sub-question answers are integrated to generate a comprehensive answer to the original question. We evaluate ChainRAG on three multi-hop QA datasets$\unicode{x2013}$MuSiQue, 2Wiki, and HotpotQA$\unicode{x2013}$using three large language models: GPT4o-mini, Qwen2.5-72B, and GLM-4-Plus. Empirical results demonstrate that ChainRAG consistently outperforms baselines in both effectiveness and efficiency.
摘要：在本文中，我们确定了一个关键的问题，即“丢失的遗失”，在检索功能的多跳问答（QA）中：在LLMS的子问题分解中错过了关键实体。 “退缩损失”大大降低了检索性能，这破坏了推理链并导致错误的答案。为了解决这个问题，我们提出了一种渐进的检索和重写方法，即chainrag，该方法通过完成缺失的关键实体并从句子图中检索有关答案生成的句子图来顺序处理每个子问题。我们检索和重写过程中的每个步骤都建立在上一个步骤上，创建了一个无缝的链，从而导致准确的检索和答案。最后，将所有检索的句子和子问题答案集成在一起，以对原始问题产生全面的答案。我们在三个多跳QA数据集上评估Chainrag $ \ Unicode {x2013} $ Musique，2Wiki和HotPotQA $ \ Unicode {X2013} $使用三种大语言：GPT4O-MINI，QWEN2.5-72B和GLM-4-42B，以及GLM-4-42B和GLM-44 -加。经验结果表明，Chainrag在有效性和效率方面始终优于基准。

Title: Effects of Prompt Length on Domain-specific Tasks for Large Language Models

Authors: Qibang Liu, Wenzhe Wang, Jeffrey Willard
Subjects: cs.CL, cs.AI, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14255
Pdf URL: https://arxiv.org/pdf/2502.14255
Copy Paste: [[2502.14255]] Effects of Prompt Length on Domain-specific Tasks for Large Language Models(https://arxiv.org/abs/2502.14255)
Keywords: language model, prompt
Abstract: In recent years, Large Language Models have garnered significant attention for their strong performance in various natural language tasks, such as machine translation and question answering. These models demonstrate an impressive ability to generalize across diverse tasks. However, their effectiveness in tackling domain-specific tasks, such as financial sentiment analysis and monetary policy understanding, remains a topic of debate, as these tasks often require specialized knowledge and precise reasoning. To address such challenges, researchers design various prompts to unlock the models' abilities. By carefully crafting input prompts, researchers can guide these models to produce more accurate responses. Consequently, prompt engineering has become a key focus of study. Despite the advancements in both models and prompt engineering, the relationship between the two-specifically, how prompt design impacts models' ability to perform domain-specific tasks-remains underexplored. This paper aims to bridge this research gap.
摘要：近年来，大型语言模型在各种自然语言任务（例如机器翻译和问答回答）中的强劲表现引起了人们的重大关注。这些模型表现出令人印象深刻的跨越多个任务的能力。但是，它们在解决特定领域的任务（例如财务情感分析和货币政策理解）方面的有效性仍然是一个辩论的话题，因为这些任务通常需要专业知识和精确的推理。为了应对此类挑战，研究人员设计了各种提示来解锁模型的能力。通过仔细制作输入提示，研究人员可以指导这些模型以产生更准确的响应。因此，迅速的工程已成为研究的重点。尽管两个模型和及时工程的进步都取得了进步，但两种特殊的关系之间的关系如何及时设计如何影响模型执行特定领域的任务 - 侦查的能力。本文旨在弥合这一研究差距。

Title: Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information

Authors: Yein Park, Chanwoong Yoon, Jungwoo Park, Minbyul Jeong, Jaewoo Kang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14258
Pdf URL: https://arxiv.org/pdf/2502.14258
Copy Paste: [[2502.14258]] Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information(https://arxiv.org/abs/2502.14258)
Keywords: language model
Abstract: While the ability of language models to elicit facts has been widely investigated, how they handle temporally changing facts remains underexplored. We discover Temporal Heads, specific attention heads primarily responsible for processing temporal knowledge through circuit analysis. We confirm that these heads are present across multiple models, though their specific locations may vary, and their responses differ depending on the type of knowledge and its corresponding years. Disabling these heads degrades the model's ability to recall time-specific knowledge while maintaining its general capabilities without compromising time-invariant and question-answering performances. Moreover, the heads are activated not only numeric conditions ("In 2004") but also textual aliases ("In the year ..."), indicating that they encode a temporal dimension beyond simple numerical representation. Furthermore, we expand the potential of our findings by demonstrating how temporal knowledge can be edited by adjusting the values of these heads.
摘要：尽管语言模型引起事实的能力已得到广泛的研究，但它们如何处理时间变化的事实仍然没有被忽视。我们发现时间头，特定的注意力头主要负责通过电路分析来处理时间知识。我们确认这些头部存在于多个模型中，尽管它们的特定位置可能会有所不同，并且它们的响应因知识的类型及其相应的年份而有所不同。禁用这些头部会降低该模型回忆特定时间知识的能力，同时保持其一般能力，而不会损害时间流行和提问的表演。此外，头部不仅被激活数字条件（“ 2004年”），还激活文本别名（“在一年中”），表明它们在简单的数值表示之外编码了一个时间维度。此外，我们通过证明如何通过调整这些头部值来编辑时间知识来扩大发现的潜力。

Title: MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels

Authors: Xiaoou Liu, Zhen Lin, Longchao Da, Chacha Chen, Shubhendu Trivedi, Hua Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14268
Pdf URL: https://arxiv.org/pdf/2502.14268
Copy Paste: [[2502.14268]] MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels(https://arxiv.org/abs/2502.14268)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) require robust confidence estimation, particularly in critical domains like healthcare and law where unreliable outputs can lead to significant consequences. Despite much recent work in confidence estimation, current evaluation frameworks rely on correctness functions -- various heuristics that are often noisy, expensive, and possibly introduce systematic biases. These methodological weaknesses tend to distort evaluation metrics and thus the comparative ranking of confidence measures. We introduce MCQA-Eval, an evaluation framework for assessing confidence measures in Natural Language Generation (NLG) that eliminates dependence on an explicit correctness function by leveraging gold-standard correctness labels from multiple-choice datasets. MCQA-Eval enables systematic comparison of both internal state-based white-box (e.g. logit-based) and consistency-based black-box confidence measures, providing a unified evaluation methodology across different approaches. Through extensive experiments on multiple LLMs and widely used QA datasets, we report that MCQA-Eval provides efficient and more reliable assessments of confidence estimation methods than existing approaches.
摘要：大型语言模型（LLM）需要稳健的置信度估算，尤其是在医疗保健和法律等关键领域，不可靠的产出会导致重大后果。尽管置信度估算了许多最新的工作，但当前的评估框架依赖于正确性功能 - 各种通常嘈杂，昂贵且可能引入系统偏见的启发式方法。这些方法上的弱点倾向于扭曲评估指标，因此置信度度量的比较排名。我们介绍了MCQA-eval，这是一个评估自然语言产生（NLG）的置信度度量的评估框架，该框架通过利用从多项选择数据集中的金标准正确性标签来消除对明确正确性功能的依赖。 MCQA-eval可以对内部基于状态的白框（例如基于logit）和基于一致性的黑盒置信度度量进行系统比较，从而提供了跨不同方法的统一评估方法。通过对多个LLM的广泛实验和广泛使用的QA数据集，我们报告说，MCQA-Eval比现有方法提供了对置信度估计方法的有效和可靠的评估。

Title: PaperHelper: Knowledge-Based LLM QA Paper Reading Assistant

Authors: Congrui Yin, Evan Wei, Zhongxing Zhang, Zaifu Zhan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14271
Pdf URL: https://arxiv.org/pdf/2502.14271
Copy Paste: [[2502.14271]] PaperHelper: Knowledge-Based LLM QA Paper Reading Assistant(https://arxiv.org/abs/2502.14271)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation
Abstract: In the paper, we introduce a paper reading assistant, PaperHelper, a potent tool designed to enhance the capabilities of researchers in efficiently browsing and understanding scientific literature. Utilizing the Retrieval-Augmented Generation (RAG) framework, PaperHelper effectively minimizes hallucinations commonly encountered in large language models (LLMs), optimizing the extraction of accurate, high-quality knowledge. The implementation of advanced technologies such as RAFT and RAG Fusion significantly boosts the performance, accuracy, and reliability of the LLMs-based literature review process. Additionally, PaperHelper features a user-friendly interface that facilitates the batch downloading of documents and uses the Mermaid format to illustrate structural relationships between documents. Experimental results demonstrate that PaperHelper, based on a fine-tuned GPT-4 API, achieves an F1 Score of 60.04, with a latency of only 5.8 seconds, outperforming the basic RAG model by 7\% in F1 Score.
摘要：在论文中，我们介绍了纸质阅读助手PaperHelper，这是一种有效的工具，旨在增强研究人员在有效浏览和理解科学文献方面的能力。利用检索型生成（RAG）框架，有效地最大程度地减少了大语模型（LLMS）通常遇到的幻觉，从而优化了精确，高质量的知识的提取。诸如木筏和碎布融合等先进技术的实施显着提高了基于LLMS的文献审查过程的性能，准确性和可靠性。此外，PaperHelper具有用户友好的界面，可促进文档的批量下载，并使用美人鱼格式来说明文档之间的结构关系。实验结果表明，基于微调的GPT-4 API，PaperHelper的F1得分为60.04，延迟仅为5.8秒，在F1分数中的延迟效果仅超过基本抹布模型7 \％。

Title: Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models

Authors: Yanggan Gu, Junzhuo Li, Sirui Huang, Xin Zou, Zhenghua Li, Xuming Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14272
Pdf URL: https://arxiv.org/pdf/2502.14272
Copy Paste: [[2502.14272]] Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models(https://arxiv.org/abs/2502.14272)
Keywords: language model, llm
Abstract: Aligning small language models (SLMs) with human values typically involves distilling preference knowledge from large language models (LLMs). However, existing distillation methods model preference knowledge in teacher LLMs by comparing pairwise responses, overlooking the extent of difference between responses. This limitation hinders student SLMs from capturing the nuanced preferences for multiple responses. In this paper, we propose a Preference-Aligned Distillation (PAD) framework, which models teacher's preference knowledge as a probability distribution over all potential preferences, thereby providing more nuanced supervisory signals. Our insight in developing PAD is rooted in the demonstration that language models can serve as reward functions, reflecting their intrinsic preferences. Based on this, PAD comprises three key steps: (1) sampling diverse responses using high-temperature; (2) computing rewards for both teacher and student to construct their intrinsic preference; and (3) training the student's intrinsic preference distribution to align with the teacher's. Experiments on four mainstream alignment benchmarks demonstrate that PAD consistently and significantly outperforms existing approaches, achieving over 20\% improvement on AlpacaEval 2 and Arena-Hard, indicating superior alignment with human preferences. Notably, on MT-Bench, using the \textsc{Gemma} model family, the student trained by PAD surpasses its teacher, further validating the effectiveness of our PAD.
摘要：将小语言模型（SLM）与人类价值观保持一致，通常涉及从大语言模型（LLM）中提取偏好知识。但是，现有的蒸馏方法通过比较成对响应来模拟教师LLM中的偏好知识，从而忽略了响应之间的差异程度。这种限制阻碍了学生SLMS捕获多个响应的细微偏好。在本文中，我们提出了一个偏好一致的蒸馏框架（PAD）框架，该框架将教师的偏好知识建模为所有潜在偏好的概率分布，从而提供了更多细微的监督信号。我们对开发垫的见解植根于以下演示：语言模型可以作为奖励功能，反映其内在偏好。基于此，PAD包括三个关键步骤：（1）使用高温对不同的响应进行采样；（2）计算教师和学生构建其内在偏好的奖励；（3）培训学生的内在偏好分布以与老师的一致。在四个主流比对基准上进行的实验表明，垫子始终如一地胜过现有方法，在Alpacaeval 2和Arena-Hard上实现了超过20 \％的改善，表明与人类偏爱相比优势。值得注意的是，在MT板凳上，使用\ textsc {gemma}模型家庭，受Pad培训的学生超过了老师，进一步验证了我们的垫子的有效性。

Title: Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment

Authors: Jiaxi Li, Yiwei Wang, Kai Zhang, Yujun Cai, Bryan Hooi, Nanyun Peng, Kai-Wei Chang, Jin Lu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14275
Pdf URL: https://arxiv.org/pdf/2502.14275
Copy Paste: [[2502.14275]] Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment(https://arxiv.org/abs/2502.14275)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) have been widely adopted in various downstream task domains. However, their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. Given the high-stakes nature of medical applications, where incorrect information can have critical consequences, it is essential to evaluate how well LLMs encode, retain, and recall fundamental medical facts. To bridge this gap, we introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge. MKJ is constructed from the Unified Medical Language System (UMLS), a large-scale repository of standardized biomedical vocabularies and knowledge graphs. We frame knowledge assessment as a binary judgment task, requiring LLMs to verify the correctness of medical statements extracted from reliable and structured knowledge sources. Our experiments reveal that LLMs struggle with factual medical knowledge retention, exhibiting significant performance variance across different semantic categories, particularly for rare medical conditions. Furthermore, LLMs show poor calibration, often being overconfident in incorrect answers. To mitigate these issues, we explore retrieval-augmented generation, demonstrating its effectiveness in improving factual accuracy and reducing uncertainty in medical decision-making.
摘要：大型语言模型（LLM）已在各种下游任务域中被广泛采用。但是，他们直接召回和应用事实医学知识的能力仍未探索。大多数现有的医学质量检查基准测试评估复杂的推理或多跳推断，因此很难将LLMS固有的医学知识从其推理能力中隔离开来。鉴于医疗应用的高风险性质，不正确的信息可能会产生重大后果，因此必须评估LLMS的编码，保留和回忆基本医学事实的效果。为了弥合这一差距，我们介绍了医学知识判断，这是一个专门旨在衡量LLMS单跳的事实医学知识的数据集。 MKJ是由统一医学语言系统（UMLS）构建的，这是标准化生物医学词汇和知识图的大规模存储库。我们将知识评估作为二进制判断任务，要求LLMS验证从可靠和结构化知识源中提取的医学陈述的正确性。我们的实验表明，LLM与事实医学知识的保留斗争，在不同的语义类别中表现出显着的性能差异，尤其是对于极少数医疗状况。此外，LLMS的校准不佳，通常在不正确的答案中过度自信。为了减轻这些问题，我们探索了检索效果的一代，证明了其在提高事实准确性和降低医疗决策的不确定性方面的有效性。

Title: EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts

Authors: Subhajit Chaudhury, Payel Das, Sarathkrishna Swaminathan, Georgios Kollias, Elliot Nelson, Khushbu Pahwa, Tejaswini Pedapati, Igor Melnyk, Matthew Riemer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14280
Pdf URL: https://arxiv.org/pdf/2502.14280
Copy Paste: [[2502.14280]] EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts(https://arxiv.org/abs/2502.14280)
Keywords: language model, llm, long context, retrieval-augmented generation
Abstract: Recent advances in Large Language Models (LLMs) have yielded impressive successes on many language tasks. However, efficient processing of long contexts using LLMs remains a significant challenge. We introduce \textbf{EpMAN} -- a method for processing long contexts in an \textit{episodic memory} module while \textit{holistically attending to} semantically relevant context chunks. The output of \textit{episodic attention} is then used to reweigh the decoder's self-attention to the stored KV cache of the context during training and generation. When an LLM decoder is trained using \textbf{EpMAN}, its performance on multiple challenging single-hop long-context recall and question-answering benchmarks is found to be stronger and more robust across the range from 16k to 256k tokens than baseline decoders trained with self-attention, and popular retrieval-augmented generation frameworks.
摘要：大型语言模型（LLM）的最新进展在许多语言任务上取得了令人印象深刻的成功。但是，使用LLM的长篇小说有效处理仍然是一个重大挑战。我们介绍\ textbf {epman} - 一种在\ textIt {intodic memory}模块中处理长上下文的方法，而\ textit {整体上都要}在语义上相关的上下文块。然后，使用\ textit {情节注意}的输出将解码器的自我注意力重新为训练和发电期间的上下文中存储的kV缓存。当使用\ textbf {epman}对LLM解码器进行培训时，发现其在多个挑战性的单跳长召回式召回和提问的基准测试中的性能比受过培训的基线解码器比训练的16K到256K代币更强大，更健壮具有自我注意事项和流行的检索生成框架。

Title: Vulnerability of Text-to-Image Models to Prompt Template Stealing: A Differential Evolution Approach

Authors: Yurong Wu, Fangwen Mu, Qiuhong Zhang, Jinjing Zhao, Xinrun Xu, Lingrui Mei, Yang Wu, Lin Shi, Junjie Wang, Zhiming Ding, Yiwei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14285
Pdf URL: https://arxiv.org/pdf/2502.14285
Copy Paste: [[2502.14285]] Vulnerability of Text-to-Image Models to Prompt Template Stealing: A Differential Evolution Approach(https://arxiv.org/abs/2502.14285)
Keywords: language model, gpt, llm, prompt
Abstract: Prompt trading has emerged as a significant intellectual property concern in recent years, where vendors entice users by showcasing sample images before selling prompt templates that can generate similar images. This work investigates a critical security vulnerability: attackers can steal prompt templates using only a limited number of sample images. To investigate this threat, we introduce Prism, a prompt-stealing benchmark consisting of 50 templates and 450 images, organized into Easy and Hard difficulty levels. To identify the vulnerabity of VLMs to prompt stealing, we propose EvoStealer, a novel template stealing method that operates without model fine-tuning by leveraging differential evolution algorithms. The system first initializes population sets using multimodal large language models (MLLMs) based on predefined patterns, then iteratively generates enhanced offspring through MLLMs. During evolution, EvoStealer identifies common features across offspring to derive generalized templates. Our comprehensive evaluation conducted across open-source (INTERNVL2-26B) and closed-source models (GPT-4o and GPT-4o-mini) demonstrates that EvoStealer's stolen templates can reproduce images highly similar to originals and effectively generalize to other subjects, significantly outperforming baseline methods with an average improvement of over 10%. Moreover, our cost analysis reveals that EvoStealer achieves template stealing with negligible computational expenses. Our code and dataset are available at this https URL.
摘要：近年来，迅速交易已成为一个重要的知识产权问题，供应商在出售可以生成类似图像的及时模板之前通过展示样本图像来吸引用户。这项工作调查了一个关键的安全漏洞：攻击者只能使用有限数量的示例图像窃取及时的模板。为了调查这种威胁，我们介绍了Prism，这是一个迅速偷走的基准测试，该基准由50个模板和450张图像组成，分为简单而困难的水平。为了确定VLM的漏洞提示窃取，我们提出了EvoStealer，这是一种新型的模板窃取方法，该方法通过利用差异进化算法而无需模型进行微调。该系统首先使用基于预定义模式的多模式大语言模型（MLLM）初始化人口集，然后迭代通过MLLM生成增强的后代。在进化过程中，EvoStealer识别了跨后代的共同特征，以推导广义模板。我们跨开源（Intervl2-26b）和封闭源模型（GPT-4O和GPT-4O和GPT-4O-MINI）进行的全面评估表明，Evostealer的被盗模板可以重现高度相似的原始图像，并有效地推广到其他受试者，并显着超越了Performing的图像平均改善超过10％的基线方法。此外，我们的成本分析表明，Evostealer通过可忽略的计算费用实现了模板窃取。我们的代码和数据集可在此HTTPS URL上找到。

Title: Drift: Decoding-time Personalized Alignments with Implicit User Preferences

Authors: Minbeom Kim, Kang-il Lee, Seongho Joo, Hwaran Lee, Minbeom Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14289
Pdf URL: https://arxiv.org/pdf/2502.14289
Copy Paste: [[2502.14289]] Drift: Decoding-time Personalized Alignments with Implicit User Preferences(https://arxiv.org/abs/2502.14289)
Keywords: language model, llm
Abstract: Personalized alignments for individual users have been a long-standing goal in large language models (LLMs). We introduce Drift, a novel framework that personalizes LLMs at decoding time with implicit user preferences. Traditional Reinforcement Learning from Human Feedback (RLHF) requires thousands of annotated examples and expensive gradient updates. In contrast, Drift personalizes LLMs in a training-free manner, using only a few dozen examples to steer a frozen model through efficient preference modeling. Our approach models user preferences as a composition of predefined, interpretable attributes and aligns them at decoding time to enable personalized generation. Experiments on both a synthetic persona dataset (Perspective) and a real human-annotated dataset (PRISM) demonstrate that Drift significantly outperforms RLHF baselines while using only 50-100 examples. Our results and analysis show that Drift is both computationally efficient and interpretable.
摘要：在大型语言模型（LLMS）中，个人用户的个性化对齐方式一直是一个长期的目标。我们介绍了Drift，这是一个新颖的框架，在解码时间内通过隐式用户偏好来个性化LLM。从人类反馈（RLHF）中学习的传统加强学习需要数千个带注释的示例和昂贵的梯度更新。相比之下，漂移以无训练的方式个性化LLM，仅使用几十个示例来通过有效的偏好模型来引导冷冻模型。我们的方法将用户偏好模型作为预定义的，可解释的属性的组成，并在解码时间对齐以实现个性化生成。在合成角色数据集（透视图）和真实的人类通知数据集（PRISM）上进行的实验表明，在仅使用50-100个示例时，漂移显着胜过RLHF基线。我们的结果和分析表明，漂移既是计算上有效又可解释的。

Title: SEA-HELM: Southeast Asian Holistic Evaluation of Language Models

Authors: Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xian Bin Yong, Weiqi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, William Chandra Tjhi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14301
Pdf URL: https://arxiv.org/pdf/2502.14301
Copy Paste: [[2502.14301]] SEA-HELM: Southeast Asian Holistic Evaluation of Language Models(https://arxiv.org/abs/2502.14301)
Keywords: language model, llm
Abstract: With the rapid emergence of novel capabilities in Large Language Models (LLMs), the need for rigorous multilingual and multicultural benchmarks that are integrated has become more pronounced. Though existing LLM benchmarks are capable of evaluating specific capabilities of LLMs in English as well as in various mid- to low-resource languages, including those in the Southeast Asian (SEA) region, a comprehensive and authentic evaluation suite for the SEA languages has not been developed thus far. Here, we present SEA-HELM, a holistic linguistic and cultural LLM evaluation suite that emphasizes SEA languages, comprising five core pillars: (1) NLP Classics, (2) LLM-specifics, (3) SEA Linguistics, (4) SEA Culture, (5) Safety. SEA-HELM currently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese. We also introduce the SEA-HELM leaderboard, which allows users to understand models' multilingual and multicultural performance in a systematic and user-friendly manner.
摘要：随着大语言模型（LLM）中新型能力的快速出现，对整合的严格多语言和多元文化基准的需求变得更加明显。尽管现有的LLM基准能够评估LLM在英语以及各种中途至低资源语言中的特定功能，包括东南亚（SEA）地区的语言，但海洋语言的全面和真实的评估套件都没有到目前为止已经开发。在这里，我们介绍了一家整体语言和文化LLM评估套件，强调海语，包括五个核心支柱：（1）NLP经典，（2）LLM特异性，（3）海语言学，（3）海洋文化，（4）海洋文化，（5）安全。 Sea-Helm目前支持菲律宾，印尼，泰米尔语，泰国和越南人。我们还介绍了Sea-Helm排行榜，该排行榜允许用户以系统的和用户友好的方式了解模型的多语言和多元文化表现。

Title: MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

Authors: Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14302
Pdf URL: https://arxiv.org/pdf/2502.14302
Copy Paste: [[2502.14302]] MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models(https://arxiv.org/abs/2502.14302)
Keywords: language model, gpt, llm, hallucination
Abstract: Advancements in Large Language Models (LLMs) and their increasing use in medical question-answering necessitate rigorous evaluation of their reliability. A critical challenge lies in hallucination, where models generate plausible yet factually incorrect outputs. In the medical domain, this poses serious risks to patient safety and clinical decision-making. To address this, we introduce MedHallu, the first benchmark specifically designed for medical hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with this binary hallucination detection task, with the best model achieving an F1 score as low as 0.625 for detecting "hard" category hallucinations. Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth. Through experiments, we also show incorporating domain-specific knowledge and introducing a "not sure" category as one of the answer categories improves the precision and F1 scores by up to 38% relative to baselines.
摘要：大语言模型（LLM）的进步及其在医疗提问中越来越多的使用需要严格评估其可靠性。一个关键的挑战在于幻觉，模型产生了合理但事实不正确的产出。在医疗领域，这给患者的安全和临床决策带来了严重的风险。为了解决这个问题，我们介绍了Medhallu，这是第一个专门为医疗幻觉检测设计的基准。 Medhallu包括10,000个源自PubMedQA的高质量提问对，并通过受控管道系统地生成了幻觉答案。我们的实验表明，包括GPT-4O，Llama-3.1在内的最先进的LLM和医学微调的超级高级超级，与这项二进制幻觉检测任务斗争，最佳模型达到了F1的最佳模型低至0.625用于检测“硬”类别幻觉。使用双向构成聚类，我们表明难以实现的幻觉在语义上更接近地面真相。通过实验，我们还展示了合并特定领域的知识并引入“不确定”类别，因为答案类别之一将相对于基线的基准提高了精度和F1分数高达38％。

Title: Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension

Authors: Amir Hossein Yari, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14315
Pdf URL: https://arxiv.org/pdf/2502.14315
Copy Paste: [[2502.14315]] Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension(https://arxiv.org/abs/2502.14315)
Keywords: language model, llm
Abstract: Despite the impressive performance of multilingual large language models (mLLMs) in various natural language processing tasks, their ability to understand procedural texts, particularly those with culture-specific content, remains largely unexplored. Texts describing cultural procedures, including rituals, traditional craftsmanship, and social etiquette, require an inherent understanding of cultural context, presenting a significant challenge for mLLMs. In this work, we introduce CAPTex, a benchmark designed to evaluate mLLMs' ability to process and reason about culturally diverse procedural texts across multiple languages using various methodologies to assess their performance. Our findings indicate that (1) mLLMs face difficulties with culturally contextualized procedural texts, showing notable performance declines in low-resource languages, (2) model performance fluctuates across cultural domains, with some areas presenting greater difficulties, and (3) language models exhibit better performance on multiple-choice tasks within conversational frameworks compared to direct questioning. These results underscore the current limitations of mLLMs in handling culturally nuanced procedural texts and highlight the need for culturally aware benchmarks like CAPTex to enhance their adaptability and comprehension across diverse linguistic and cultural landscapes.
摘要：尽管多语言大语模型（MLLM）在各种自然语言处理任务中的表现令人印象深刻，但它们的理解程序文本的能力，尤其是那些具有特定文化内容的过程的能力，在很大程度上尚未探索。描述文化程序，包括仪式，传统工艺和社会礼节的文本需要对文化背景的内在理解，对MLLM提出了重大挑战。在这项工作中，我们介绍了CAPTEX，这是一种基准测试，旨在评估MLLM的能力和理由，以使用各种方法来评估其性能，从而跨多种语言进行文化多样化的程序文本。我们的发现表明，（1）MLLM面临着文化上下文化的程序文本的困难，显示出低资源语言的表现显着下降，（2）模型绩效在跨文化领域的波动波动，某些领域会带来更大的困难，并且（3）语言模型展示了与直接提问相比，对话框架内的多项选择任务的性能更好。这些结果强调了MLLM在处理文化细微差别的程序文本中的当前局限性，并强调了对CAPTEX等文化意识的基准的需求，以增强其在各种语言和文化景观中的适应性和理解。

Title: ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

Authors: Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14317
Pdf URL: https://arxiv.org/pdf/2502.14317
Copy Paste: [[2502.14317]] ParallelComp: Parallel Long-Context Compressor for Length Extrapolation(https://arxiv.org/abs/2502.14317)
Keywords: language model, gpt, llm, long context, chat
Abstract: Efficiently handling long contexts is crucial for large language models (LLMs). While rotary position embeddings (RoPEs) enhance length generalization, effective length extrapolation remains challenging and often requires costly fine-tuning. In contrast, recent training-free approaches suffer from the attention sink phenomenon, leading to severe performance degradation. In this paper, we introduce ParallelComp, a novel training-free method for long-context extrapolation that extends LLMs' context length from 4K to 128K while maintaining high throughput and preserving perplexity, and integrates seamlessly with Flash Attention. Our analysis offers new insights into attention biases in parallel attention mechanisms and provides practical solutions to tackle these challenges. To mitigate the attention sink issue, we propose an attention calibration strategy that reduces biases, ensuring more stable long-range attention. Additionally, we introduce a chunk eviction strategy to efficiently manage ultra-long contexts on a single A100 80GB GPU. To further enhance efficiency, we propose a parallel KV cache eviction technique, which improves chunk throughput by 1.76x, thereby achieving a 23.50x acceleration in the prefilling stage with negligible performance loss due to attention calibration. Furthermore, ParallelComp achieves 91.17% of GPT-4's performance on long-context tasks using an 8B model trained on 8K-length context, outperforming powerful closed-source models such as Claude-2 and Kimi-Chat.
摘要：有效地处理长篇小说对于大语言模型（LLM）至关重要。尽管旋转位置嵌入（绳索）可以增强长度的概括，但有效的长度外推仍然具有挑战性，并且通常需要昂贵的微调。相比之下，最近的无训练方法遭受了注意流现象的影响，导致严重的性能降解。在本文中，我们介绍了ParallelComp，这是一种新型的无训练方法，用于长篇文化外推，将LLMS的上下文长度从4K扩展到128K，同时保持高吞吐量和保留困惑，并无缝集成并闪烁。我们的分析为平行注意机制提供了对注意力偏见的新见解，并提供了解决这些挑战的实用解决方案。为了减轻注意力下沉问题，我们提出了一种注意力校准策略，以减少偏见，从而确保更稳定的长期注意力。此外，我们引入了块驱逐策略，以在单个A100 80GB GPU上有效管理超长环境。为了进一步提高效率，我们提出了一种平行的KV缓存驱逐技术，该技术将块吞吐量提高了1.76倍，从而在预填充阶段达到了23.50倍的加速度，由于注意力校准而导致的性能丧失可忽略不计。此外，使用在8K长度上下文上训练的8B模型，平行COLALLALECP在长篇小写任务上达到了GPT-4的91.17％的性能，表现优于Claude-2和Kimi-Chat等强大的闭合源模型。

Title: Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models

Authors: James Fodor
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14318
Pdf URL: https://arxiv.org/pdf/2502.14318
Copy Paste: [[2502.14318]] Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models(https://arxiv.org/abs/2502.14318)
Keywords: language model, llm
Abstract: Large language models (LLMs) regularly demonstrate new and impressive performance on a wide range of language, knowledge, and reasoning benchmarks. Such rapid progress has led many commentators to argue that LLM general cognitive capabilities have likewise rapidly improved, with the implication that such models are becoming progressively more capable on various real-world tasks. Here I summarise theoretical and empirical considerations to challenge this narrative. I argue that inherent limitations with the benchmarking paradigm, along with specific limitations of existing benchmarks, render benchmark performance highly unsuitable as a metric for generalisable competence over cognitive tasks. I also contend that alternative methods for assessing LLM capabilities, including adversarial stimuli and interpretability techniques, have shown that LLMs do not have robust competence in many language and reasoning tasks, and often fail to learn representations which facilitate generalisable inferences. I conclude that benchmark performance should not be used as a reliable indicator of general LLM cognitive capabilities.
摘要：大型语言模型（LLMS）定期在各种语言，知识和推理基准上表现出新的和令人印象深刻的表现。如此快速的进步使许多评论员认为，LLM一般认知能力同样迅速提高，这意味着这种模型正在逐渐在各种现实世界中的任务中逐渐有能力。在这里，我总结了理论和经验的考虑，以挑战这种叙述。我认为，基准制定范式以及现有基准的特定局限性的固有局限性，使基准的性能非常不适合作为对认知任务的普遍能力的度量。我还认为，评估LLM功能的替代方法，包括对抗性刺激和可解释性技术，表明LLM在许多语言和推理任务中都没有强大的能力，并且常常无法学习促进可推断推断的代表性。我得出的结论是，基准性能不应用作一般LLM认知能力的可靠指标。

Title: A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics

Authors: Ting-Ruen Wei, Haowei Liu, Xuyang Wu, Yi Fang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14333
Pdf URL: https://arxiv.org/pdf/2502.14333
Copy Paste: [[2502.14333]] A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics(https://arxiv.org/abs/2502.14333)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent progress in large language models (LLM) found chain-of-thought prompting strategies to improve the reasoning ability of LLMs by encouraging problem solving through multiple steps. Therefore, subsequent research aimed to integrate the multi-step reasoning process into the LLM itself through process rewards as feedback and achieved improvements over prompting strategies. Due to the cost of step-level annotation, some turn to outcome rewards as feedback. Aside from these training-based approaches, training-free techniques leverage frozen LLMs or external tools for feedback at each step to enhance the reasoning process. With the abundance of work in mathematics due to its logical nature, we present a survey of strategies utilizing feedback at the step and outcome levels to enhance multi-step math reasoning for LLMs. As multi-step reasoning emerges a crucial component in scaling LLMs, we hope to establish its foundation for easier understanding and empower further research.
摘要：大型语言模型（LLM）的最新进展发现，通过多个步骤鼓励解决问题，从而促使促进了LLM的推理能力的策略。因此，随后的研究旨在通过过程奖励作为反馈并实现了促使策略的改进，将多步推理过程整合到LLM本身中。由于步骤级注释的成本，有些转向结果奖励作为反馈。除了这些基于培训的方法外，无培训的技术还利用冷冻的LLM或外部工具在每个步骤中提供反馈来增强推理过程。由于其逻辑性质，我们在数学方面的大量工作，我们提出了一项策略调查，该策略利用反馈和结果级别提高了LLMS的多步数学推理。随着多步推理在缩放LLM中出现至关重要的组成部分，我们希望为更容易理解和赋予进一步的研究而建立其基础。

Title: English Please: Evaluating Machine Translation for Multilingual Bug Reports

Authors: Avinash Patil, Aryan Jadon
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2502.14338
Pdf URL: https://arxiv.org/pdf/2502.14338
Copy Paste: [[2502.14338]] English Please: Evaluating Machine Translation for Multilingual Bug Reports(https://arxiv.org/abs/2502.14338)
Keywords: gpt, chat
Abstract: Accurate translation of bug reports is critical for efficient collaboration in global software development. In this study, we conduct the first comprehensive evaluation of machine translation (MT) performance on bug reports, analyzing the capabilities of DeepL, AWS Translate, and ChatGPT using data from the Visual Studio Code GitHub repository, specifically focusing on reports labeled with the english-please tag. To thoroughly assess the accuracy and effectiveness of each system, we employ multiple machine translation metrics, including BLEU, BERTScore, COMET, METEOR, and ROUGE. Our findings indicate that DeepL consistently outperforms the other systems across most automatic metrics, demonstrating strong lexical and semantic alignment. AWS Translate performs competitively, particularly in METEOR, while ChatGPT lags in key metrics. This study underscores the importance of domain adaptation for translating technical texts and offers guidance for integrating automated translation into bug-triaging workflows. Moreover, our results establish a foundation for future research to refine machine translation solutions for specialized engineering contexts. The code and dataset for this paper are available at GitHub: this https URL.
摘要：错误报告的准确翻译对于在全球软件开发中有效协作至关重要。在这项研究中，我们使用Visual Studio Code GitHub存储库中的数据进行了首次对机器翻译（MT）性能进行首次全面评估，分析DEEPL，AWS翻译和Chatgpt的功能，特别专注于用英语标记的报告 - 请标记。为了彻底评估每个系统的准确性和有效性，我们采用了多个机器翻译指标，包括BLEU，Bertscore，Comet，Meteor和Rouge。我们的发现表明，DEEPL始终在大多数自动指标上胜过其他系统，表现出强烈的词汇和语义对齐。 AWS翻译具有竞争性的性能，尤其是在流星中，而Chatgpt落后于关键指标。这项研究强调了域适应在翻译技术文本中的重要性，并提供了将自动翻译整合到错误验证工作流程中的指导。此外，我们的结果为未来的研究建立了基础，以优化专业工程环境的机器翻译解决方案。本文的代码和数据集可在GitHub：此HTTPS URL上获得。

Title: Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

Authors: Ruichen Shao, Bei Li, Gangao Liu, Yang Chen, Xiang Zhou, Jingang Wang, Xunliang Cai, Peng Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14340
Pdf URL: https://arxiv.org/pdf/2502.14340
Copy Paste: [[2502.14340]] Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective(https://arxiv.org/abs/2502.14340)
Keywords: language model, llm
Abstract: Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences. Experimental results on several benchmarks show that our approach consistently outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across different model architectures and sizes. Furthermore, additional experiments on mathematical and reasoning benchmarks (MMLU, GSM8K, and MATH) confirm that our method enhances performance without compromising general capabilities. Our codebase would be available at \url{this https URL}.
摘要：直接偏好优化（DPO）已引起人们的关注，作为从人类反馈（RLHF）学习大型语言模型（LLMS）与人类偏好的有效替代方法。尽管具有优势，但DPO仍具有长度偏差，产生的响应比参考模型的响应更长。现有的解决方案（例如Simpo和Sampo）解决了这个问题，但统一地对待跨序列奖励的贡献，忽略了时间动态。为此，我们提出了一种增强的偏好优化方法，该方法结合了由伽马参数控制的时间衰减因子。这种动态的加权机制根据每个奖励在序列中的位置来调整每个奖励的影响，从而优先考虑对比对更为重要的较早令牌。通过自适应地关注更相关的反馈，我们的方法减轻了对较少相关数据的过度拟合，并且仍然对不断发展的人类偏好做出反应。几个基准测试的实验结果表明，我们的方法始终在Alpacaeval 2上胜过5.9-8.8点的香草DPO，在不同模型架构和尺寸的竞技场上，在竞技场上，我们的方法在竞技场上胜过3.3-9.7点。此外，关于数学和推理基准（MMLU，GSM8K和MATH）的其他实验证实，我们的方法在不损害一般能力的情况下增强了性能。我们的代码库可在\ url {this HTTPS url}上找到。

Title: SR-LLM: Rethinking the Structured Representation in Large Language Model

Authors: Jiahuan Zhang, Tianheng Wang, Hanqing Wu, Ziyi Huang, Yulong Wu, Dongbai Chen, Linfeng Song, Yue Zhang, Guozheng Rao, Kaicheng Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14352
Pdf URL: https://arxiv.org/pdf/2502.14352
Copy Paste: [[2502.14352]] SR-LLM: Rethinking the Structured Representation in Large Language Model(https://arxiv.org/abs/2502.14352)
Keywords: language model, llm, prompt
Abstract: Structured representations, exemplified by Abstract Meaning Representation (AMR), have long been pivotal in computational linguistics. However, their role remains ambiguous in the Large Language Models (LLMs) era. Initial attempts to integrate structured representation into LLMs via a zero-shot setting yielded inferior performance. We hypothesize that such a decline stems from the structure information being passed into LLMs in a code format unfamiliar to LLMs' training corpora. Consequently, we propose SR-LLM, an innovative framework with two settings to explore a superior way of integrating structured representation with LLMs from training-free and training-dependent perspectives. The former integrates structural information through natural language descriptions in LLM prompts, whereas its counterpart augments the model's inference capability through fine-tuning on linguistically described structured representations. Performance improvements were observed in widely downstream datasets, with particularly notable gains of 3.17% and 12.38% in PAWS. To the best of our knowledge, this work represents the pioneering demonstration that leveraging structural representations can substantially enhance LLMs' inference capability. We hope that our work sheds light and encourages future research to enhance the reasoning and interoperability of LLMs by structure data.
摘要：以抽象意义表示（AMR）为例，结构化表示在计算语言学中一直是关键的。但是，在大型语言模型（LLM）时代，它们的作用仍然含糊不清。通过零拍设置将结构化表示形式集成到LLM的初步尝试产生了较低的性能。我们假设这样的下降源于以LLMS培训语料库不熟悉的代码格式传递到LLM的结构信息。因此，我们提出了SR-LLM，这是一种具有两个设置的创新框架，探索了一种将结构化表示与LLM与无训练和培训依赖性观点的较高方式集成的卓越方式。前者通过在LLM提示中通过自然语言描述整合结构信息，而同行则通过对语言描述的结构化表示，增强了模型的推论能力。在广泛的下游数据集中观察到了性能的提高，爪子的增长尤其显着3.17％和12.38％。据我们所知，这项工作代表了一个开创性的演示，即利用结构表示可以大大增强LLMS的推理能力。我们希望我们的工作能阐明并鼓励未来的研究，以通过结构数据来增强LLM的推理和互操作性。

Title: Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning

Authors: Huimin Xu, Xin Mao, Feng-Lin Li, Xiaobao Wu, Wang Chen, Wei Zhang, Anh Tuan Luu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14356
Pdf URL: https://arxiv.org/pdf/2502.14356
Copy Paste: [[2502.14356]] Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning(https://arxiv.org/abs/2502.14356)
Keywords: language model, gpt
Abstract: Direct Preference Optimization (DPO) often struggles with long-chain mathematical reasoning. Existing approaches, such as Step-DPO, typically improve this by focusing on the first erroneous step in the reasoning chain. However, they overlook all other steps and rely heavily on humans or GPT-4 to identify erroneous steps. To address these issues, we propose Full-Step-DPO, a novel DPO framework tailored for mathematical reasoning. Instead of optimizing only the first erroneous step, it leverages step-wise rewards from the entire reasoning chain. This is achieved by training a self-supervised process reward model, which automatically scores each step, providing rewards while avoiding reliance on external signals. Furthermore, we introduce a novel step-wise DPO loss, which dynamically updates gradients based on these step-wise rewards. This endows stronger reasoning capabilities to language models. Extensive evaluations on both in-domain and out-of-domain mathematical reasoning benchmarks across various base language models, demonstrate that Full-Step-DPO achieves superior performance compared to state-of-the-art baselines.
摘要：直接偏好优化（DPO）经常在长链数学推理中挣扎。现有的方法（例如Step-DPO）通常通过关注推理链中的第一个错误步骤来改善这一点。但是，他们忽略了所有其他步骤，并严重依赖人类或GPT-4来识别错误的步骤。为了解决这些问题，我们提出了全步-DPO，这是一个针对数学推理量身定制的新型DPO框架。它不仅要优化第一个错误的步骤，而是利用整个推理链的逐步奖励。这是通过训练自我监督的过程奖励模型来实现的，该模型每步自动得分，在避免依赖外部信号的同时提供奖励。此外，我们引入了一种新颖的逐步DPO损失，该损失是根据这些阶梯奖励动态更新梯度的。这赋予了语言模型更强的推理能力。对各种基本语言模型的内域和跨域数学推理基准测试的广泛评估表明，与最先进的基线相比，全步-DPO的性能优越。

Title: Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests

Authors: Filippo Momentè, Alessandro Suglia, Mario Giulianelli, Ambra Ferrari, Alexander Koller, Oliver Lemon, David Schlangen, Raquel Fernández, Raffaella Bernardi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14359
Pdf URL: https://arxiv.org/pdf/2502.14359
Copy Paste: [[2502.14359]] Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests(https://arxiv.org/abs/2502.14359)
Keywords: llm
Abstract: We examine three evaluation paradigms: large question-answering benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two-benchmarks or games-is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We advocate the development of new interactive benchmarks and targeted cognitive tasks inspired by assessing human abilities but designed specifically for LLMs.
摘要：我们检查了三个评估范式：大型问题提出词基准（例如MMLU和BBH），互动游戏（例如，信号游戏或禁忌）和认知测试（例如，用于工作记忆或心理理论）。首先，我们调查了以前的两基准或游戏中的哪个是最有效的，可以区分不同质量的LLM。然后，受到人类认知评估的启发，我们编制了一套针对性的测试，这些测试衡量了对有效语言使用至关重要的认知能力，并研究了它们与基准和游戏中模型性能的相关性。我们的分析表明，互动游戏在区分模型中优于标准基准。因果关系和逻辑推理与静态和互动测试均相关，而在核心执行功能和社交/情感技能方面出现了差异，这与游戏更多相关。我们主张开发新的交互式基准和针对性的认知任务，这些认知任务是通过评估人类能力而设计的，但专门为LLM设计。

Title: Entropy-UID: A Method for Optimizing Information Density

Authors: Xinpeng Shou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14366
Pdf URL: https://arxiv.org/pdf/2502.14366
Copy Paste: [[2502.14366]] Entropy-UID: A Method for Optimizing Information Density(https://arxiv.org/abs/2502.14366)
Keywords: language model, gpt
Abstract: Balanced and efficient information flow is essential for optimizing language generation models. In this work, we propose Entropy-UID, a new token selection method that balances entropy and Uniform Information Density (UID) principles for enhanced efficiency of text generation. Our approach adaptively adjusts token selection by jointly minimizing entropy and surprisal, promoting more even information distribution across generated sequences. Theoretical validation demonstrates that Entropy-UID optimally reduces information spikes while maintaining fluency and coherence. The method has been evulated using information-theoretic metrics on multiple benchmark datasets, including WikiText-2, OpenWebText, and WMT. Experimental results show that Entropy-UID achieves lower surprisal and entropy variance compared to standard GPT-2 and alternative heuristics, leading to more balanced and human-like text generation. Our findings point towards the potential of leveraging information-theoretic constraints to refine token selection strategies in autoregressive language models.
摘要：平衡有效的信息流对于优化语言生成模型至关重要。在这项工作中，我们提出了熵uid，这是一种新的令牌选择方法，可以平衡熵和统一信息密度（UID）原理，以提高文本生成效率。我们的方法通过共同最大程度地减少熵和惊人的方式来适应代币选择，从而促进跨生成序列的更多信息分布。理论验证表明，熵uid可以最佳地减少信息峰值，同时保持流利性和连贯性。该方法已在多个基准数据集上使用信息理论指标撤离，包括Wikitext-2，OpenWebText和WMT。实验结果表明，与标准的GPT-2和替代启发式方法相比，熵uid可实现较低的惊喜和熵差异，从而导致更加平衡和类似人类的文本产生。我们的发现表明，利用信息理论约束来完善自回归语言模型中的令牌选择策略的潜力。

Title: A Similarity Paradigm Through Textual Regularization Without Forgetting

Authors: Fangming Cui, Jan Fong, Rongfei Zeng, Xinmei Tian, Jun Yu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2502.14376
Pdf URL: https://arxiv.org/pdf/2502.14376
Copy Paste: [[2502.14376]] A Similarity Paradigm Through Textual Regularization Without Forgetting(https://arxiv.org/abs/2502.14376)
Keywords: language model, prompt
Abstract: Prompt learning has emerged as a promising method for adapting pre-trained visual-language models (VLMs) to a range of downstream tasks. While optimizing the context can be effective for improving performance on specific tasks, it can often lead to poor generalization performance on unseen classes or datasets sampled from different distributions. It may be attributed to the fact that textual prompts tend to overfit downstream data distributions, leading to the forgetting of generalized knowledge derived from hand-crafted prompts. In this paper, we propose a novel method called Similarity Paradigm with Textual Regularization (SPTR) for prompt learning without forgetting. SPTR is a two-pronged design based on hand-crafted prompts that is an inseparable framework. 1) To avoid forgetting general textual knowledge, we introduce the optimal transport as a textual regularization to finely ensure approximation with hand-crafted features and tuning textual features. 2) In order to continuously unleash the general ability of multiple hand-crafted prompts, we propose a similarity paradigm for natural alignment score and adversarial alignment score to improve model robustness for generalization. Both modules share a common objective in addressing generalization issues, aiming to maximize the generalization capability derived from multiple hand-crafted prompts. Four representative tasks (i.e., non-generalization few-shot learning, base-to-novel generalization, cross-dataset generalization, domain generalization) across 11 datasets demonstrate that SPTR outperforms existing prompt learning methods.
摘要：迅速学习已成为将预训练的视觉模型（VLM）调整为一系列下游任务的有前途的方法。虽然优化上下文可以有效地提高特定任务的性能，但通常会导致对从不同分布采样的看不见的类或数据集的概括性能差。这可能归因于以下事实：文本提示倾向于过度拟合下游数据分布，从而忘记了从手工制作的提示中得出的广义知识。在本文中，我们提出了一种具有文本正则化（SPTR）的新方法，称为相似性范式（SPTR），以迅速学习而不会忘记。 SPTR是基于手工制作的提示，是一个不可分割的框架。 1）为避免忘记一般的文本知识，我们将最佳传输作为文本正则化介绍，以精细地确保使用手工制作的功能和调谐文本功能近似。 2）为了连续释放多个手工制作的提示的一般能力，我们提出了自然对齐得分和对抗性对准评分的相似性范式，以提高模型鲁棒性以泛化。这两个模块在解决概括问题方面都有一个共同的目标，旨在最大化从多个手工制作的提示中得出的概括能力。跨11个数据集的四个代表性任务（即，非概要的少数学习，基础对概括，跨数据集概括，域概括）表明，SPTR的表现优于现有的及时学习方法。

Title: Rumor Detection by Multi-task Suffix Learning based on Time-series Dual Sentiments

Authors: Zhiwei Liu, Kailai Yang, Eduard Hovy, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14383
Pdf URL: https://arxiv.org/pdf/2502.14383
Copy Paste: [[2502.14383]] Rumor Detection by Multi-task Suffix Learning based on Time-series Dual Sentiments(https://arxiv.org/abs/2502.14383)
Keywords: llm, prompt
Abstract: The widespread dissemination of rumors on social media has a significant impact on people's lives, potentially leading to public panic and fear. Rumors often evoke specific sentiments, resonating with readers and prompting sharing. To effectively detect and track rumors, it is essential to observe the fine-grained sentiments of both source and response message pairs as the rumor evolves over time. However, current rumor detection methods fail to account for this aspect. In this paper, we propose MSuf, the first multi-task suffix learning framework for rumor detection and tracking using time series dual (coupled) sentiments. MSuf includes three modules: (1) an LLM to extract sentiment intensity features and sort them chronologically; (2) a module that fuses the sorted sentiment features with their source text word embeddings to obtain an aligned embedding; (3) two hard prompts are combined with the aligned vector to perform rumor detection and sentiment analysis using one frozen LLM. MSuf effectively enhances the performance of LLMs for rumor detection with only minimal parameter fine-tuning. Evaluating MSuf on four rumor detection benchmarks, we find significant improvements compared to other emotion-based methods.
摘要：社交媒体上谣言的广泛传播对人们的生活产生了重大影响，这可能导致公众恐慌和恐惧。谣言经常引起特定的情感，引起读者的共鸣并促使共享。为了有效地检测和跟踪谣言，随着谣言随着时间的推移而演变，观察源和响应消息对的细粒情感至关重要。但是，当前的谣言检测方法无法解决这一方面。在本文中，我们提出了MSUF，这是使用时间序列双重（耦合）情感的第一个用于谣言检测和跟踪的多任务后缀学习框架。 MSUF包括三个模块：（1）LLM提取情感强度特征并按时间顺序排序；（2）一个将分类情感特征与源文本嵌入的模块融合在一起，以获得对齐的嵌入；（3）将两个硬提示与对齐矢量相结合，以使用一个冷冻的LLM进行谣言检测和情感分析。 MSUF仅通过最小的参数微调有效地提高了LLM的谣言检测性能。评估MSUF在四个谣言检测基准测试基准上，我们发现与其他基于情感的方法相比，我们发现了显着改善。

Title: Tradutor: Building a Variety Specific Translation Model

Authors: Hugo Sousa, Satya Almasian, Ricardo Campos, Alípio Jorge
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14385
Pdf URL: https://arxiv.org/pdf/2502.14385
Copy Paste: [[2502.14385]] Tradutor: Building a Variety Specific Translation Model(https://arxiv.org/abs/2502.14385)
Keywords: language model
Abstract: Language models have become foundational to many widely used systems. However, these seemingly advantageous models are double-edged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese are neglected in favor of their more popular counterpart, Brazilian Portuguese, leading to suboptimal performance in various linguistic tasks. To address this gap, we introduce the first open-source translation model specifically tailored for European Portuguese, along with a novel dataset specifically designed for this task. Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.
摘要：语言模型已成为许多广泛使用系统的基础。但是，这些看似有利的模型是双刃剑。尽管他们在与英语（例如英语）相关的任务中表现出色，但他们通常会失去语言形式，方言和品种的细微差别，这些语言和品种固有的语言在世界上多个地区使用。像欧洲葡萄牙语这样的语言被忽略了他们更受欢迎的同类语言巴西葡萄牙语，从而在各种语言任务中表现出色。为了解决这一差距，我们介绍了第一个专门针对欧洲葡萄牙语量身定制的开源翻译模型，以及专门为此任务设计的新型数据集。在两个基准数据集上进行自动评估的结果表明，我们的最佳模型超过了葡萄牙语的现有开源翻译系统，并访问了欧洲葡萄牙人的行业领先封闭源系统的性能。通过公开提供数据集，模型和代码，我们旨在支持和鼓励进一步的研究，从而促进代表不足语言品种的进步。

Title: Leveraging Small LLMs for Argument Mining in Education: Argument Component Identification, Classification, and Assessment

Authors: Lucile Favero, Juan Antonio Pérez-Ortiz, Tanja Käser, Nuria Oliver
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2502.14389
Pdf URL: https://arxiv.org/pdf/2502.14389
Copy Paste: [[2502.14389]] Leveraging Small LLMs for Argument Mining in Education: Argument Component Identification, Classification, and Assessment(https://arxiv.org/abs/2502.14389)
Keywords: language model, llm, prompt
Abstract: Argument mining algorithms analyze the argumentative structure of essays, making them a valuable tool for enhancing education by providing targeted feedback on the students' argumentation skills. While current methods often use encoder or encoder-decoder deep learning architectures, decoder-only models remain largely unexplored, offering a promising research direction. This paper proposes leveraging open-source, small Large Language Models (LLMs) for argument mining through few-shot prompting and fine-tuning. These models' small size and open-source nature ensure accessibility, privacy, and computational efficiency, enabling schools and educators to adopt and deploy them locally. Specifically, we perform three tasks: segmentation of student essays into arguments, classification of the arguments by type, and assessment of their quality. We empirically evaluate the models on the Feedback Prize - Predicting Effective Arguments dataset of grade 6-12 students essays and demonstrate how fine-tuned small LLMs outperform baseline methods in segmenting the essays and determining the argument types while few-shot prompting yields comparable performance to that of the baselines in assessing quality. This work highlights the educational potential of small, open-source LLMs to provide real-time, personalized feedback, enhancing independent learning and writing skills while ensuring low computational cost and privacy.
摘要：论证挖掘算法分析了论文的论证结构，使其成为通过提供有关学生论证技能的有针对性的反馈来增强教育的宝贵工具。尽管当前的方法经常使用编码器或编码器 - 模型深度学习体系结构，但仅解码器模型仍未开发，提供了有希望的研究方向。本文提出了利用开源，小型语言模型（LLMS），通过几次提示和微调进行挖掘。这些模型的小规模和开源性质可确保可访问性，隐私和计算效率，使学校和教育工作者能够在当地采用和部署它们。具体来说，我们执行三个任务：将学生论文分割成论点，类型的论点分类以及评估其质量。我们从反馈奖上进行经验评估模型 - 预测6-12年级学生论文的有效论证数据集，并证明了微调的小LLM在分割论文和确定参数类型时的表现如何超过基线方法，而很少提示较少的效果，而产生了可比的绩效，以至于可比评估质量的基准。这项工作突出了小型开源LLM的教育潜力，以提供实时的个性化反馈，增强独立学习和写作技巧，同时确保低计算成本和隐私。

Title: Unstructured Evidence Attribution for Long Context Query Focused Summarization

Authors: Dustin Wright, Zain Muhammad Mujahid, Lu Wang, Isabelle Augenstein, David Jurgens
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.14409
Pdf URL: https://arxiv.org/pdf/2502.14409
Copy Paste: [[2502.14409]] Unstructured Evidence Attribution for Long Context Query Focused Summarization(https://arxiv.org/abs/2502.14409)
Keywords: language model, llm, long context
Abstract: Large language models (LLMs) are capable of generating coherent summaries from very long contexts given a user query. Extracting and properly citing evidence spans could help improve the transparency and reliability of these summaries. At the same time, LLMs suffer from positional biases in terms of which information they understand and attend to, which could affect evidence citation. Whereas previous work has focused on evidence citation with predefined levels of granularity (e.g. sentence, paragraph, document, etc.), we propose the task of long-context query focused summarization with unstructured evidence citation. We show how existing systems struggle to generate and properly cite unstructured evidence from their context, and that evidence tends to be "lost-in-the-middle". To help mitigate this, we create the Summaries with Unstructured Evidence Text dataset (SUnsET), a synthetic dataset generated using a novel domain-agnostic pipeline which can be used as supervision to adapt LLMs to this task. We demonstrate across 5 LLMs of different sizes and 4 datasets with varying document types and lengths that LLMs adapted with SUnsET data generate more relevant and factually consistent evidence than their base models, extract evidence from more diverse locations in their context, and can generate more relevant and consistent summaries.
摘要：大型语言模型（LLMS）能够从较长的上下文中生成连贯的摘要，给定用户查询。提取并正确地引用证据可以帮助提高这些摘要的透明度和可靠性。同时，LLM在他们理解和参与的信息方面遭受位置偏见，这可能会影响证据引用。尽管以前的工作集中在具有预定义水平的粒度（例如句子，段落，文档等）的证据引用上，但我们提出了长篇文章查询以非结构化证据引用的重点汇总的任务。我们展示了现有系统如何从其上下文中产生并正确地引用非结构化证据，而证据往往是“中间失落的”。为了减轻这种情况，我们使用非结构化的证据文本数据集（Sunset）创建摘要，这是一种使用新颖的域 - 无术管道生成的合成数据集，可以用作将LLMS调整为此任务的监督。我们在5个不同尺寸和4个数据集的5个LLM中证明了具有不同文档类型和长度的LLM，这些LLMS与日落数据相比，与基本模型相比，与日落数据相适应的相关性和事实一致的证据，在其上下文中提取更多不同地点的证据，并且可以产生更相关的证据，并且可以产生更相关的证据。和一致的摘要。

Title: A Survey on Data Contamination for Large Language Models

Authors: Yuxing Cheng, Yi Chang, Yuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14425
Pdf URL: https://arxiv.org/pdf/2502.14425
Copy Paste: [[2502.14425]] A Survey on Data Contamination for Large Language Models(https://arxiv.org/abs/2502.14425)
Keywords: language model, llm
Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis. However, the reliability of performance evaluation has come under scrutiny due to data contamination-the unintended overlap between training and test datasets. This overlap has the potential to artificially inflate model performance, as LLMs are typically trained on extensive datasets scraped from publicly available sources. These datasets often inadvertently overlap with the benchmarks used for evaluation, leading to an overestimation of the models' true generalization capabilities. In this paper, we first examine the definition and impacts of data contamination. Secondly, we review methods for contamination-free evaluation, focusing on three strategies: data updating-based methods, data rewriting-based methods, and prevention-based methods. Specifically, we highlight dynamic benchmarks and LLM-driven evaluation methods. Finally, we categorize contamination detecting methods based on model information dependency: white-Box, gray-Box, and black-Box detection approaches. Our survey highlights the requirements for more rigorous evaluation protocols and proposes future directions for addressing data contamination challenges.
摘要：大型语言模型（LLM）的最新进展已在各个领域（例如文本生成和代码综合）中取得了重大进展。但是，由于数据污染 - 培训和测试数据集之间的意外重叠，绩效评估的可靠性受到了审查。这种重叠有可能人为地膨胀模型性能，因为LLM通常是在从公共可用来源刮除的广泛数据集中培训的。这些数据集经常无意间与用于评估的基准重叠，从而高估了模型的真实概括能力。在本文中，我们首先检查数据污染的定义和影响。其次，我们回顾了无污染评估的方法，重点介绍了三种策略：基于数据更新的方法，基于数据重写的方法和基于预防的方法。具体而言，我们重点介绍了动态基准和LLM驱动的评估方法。最后，我们根据模型信息依赖关系对污染检测方法进行分类：白色框，灰色框和黑框检测方法。我们的调查强调了更严格的评估协议的要求，并提出了解决数据污染挑战的未来方向。

Title: Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models

Authors: Artem Vazhentsev, Lyudmila Rvanova, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Artem Shelmanov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14427
Pdf URL: https://arxiv.org/pdf/2502.14427
Copy Paste: [[2502.14427]] Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models(https://arxiv.org/abs/2502.14427)
Keywords: language model, llm
Abstract: Uncertainty quantification (UQ) is a prominent approach for eliciting truthful answers from large language models (LLMs). To date, information-based and consistency-based UQ have been the dominant UQ methods for text generation via LLMs. Density-based methods, despite being very effective for UQ in text classification with encoder-based models, have not been very successful with generative LLMs. In this work, we adapt Mahalanobis Distance (MD) - a well-established UQ technique in classification tasks - for text generation and introduce a new supervised UQ method. Our method extracts token embeddings from multiple layers of LLMs, computes MD scores for each token, and uses linear regression trained on these features to provide robust uncertainty scores. Through extensive experiments on eleven datasets, we demonstrate that our approach substantially improves over existing UQ methods, providing accurate and computationally efficient uncertainty scores for both sequence-level selective generation and claim-level fact-checking tasks. Our method also exhibits strong generalization to out-of-domain data, making it suitable for a wide range of LLM-based applications.
摘要：不确定性量化（UQ）是从大语言模型（LLM）中提出真实答案的重要方法。迄今为止，基于信息的基于信息和一致性的UQ一直是通过LLMS生成文本生成的主要UQ方法。尽管使用基于编码器的模型在文本分类方面对UQ非常有效，但基于密度的方法在生成LLM中并不是很成功。在这项工作中，我们适应了Mahalanobis距离（MD） - 一种已建立的分类任务的UQ技术 - 用于文本生成，并介绍一种新的监督UQ方法。我们的方法从多个LLM的多层提取令牌嵌入，计算每个令牌的MD分数，并使用对这些功能进行训练的线性回归来提供可靠的不确定性分数。通过对11个数据集的广泛实验，我们证明了我们的方法对现有的UQ方法有很大改进，从而为序列级别的选择性生成和索赔级事实检查任务提供了准确和计算上有效的不确定性得分。我们的方法还表现出对室外数据的强烈概括，使其适用于广泛的基于LLM的应用程序。

Title: Natural Language Generation

Authors: Ehud Reiter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14437
Pdf URL: https://arxiv.org/pdf/2502.14437
Copy Paste: [[2502.14437]] Natural Language Generation(https://arxiv.org/abs/2502.14437)
Keywords: llm
Abstract: This book provides a broad overview of Natural Language Generation (NLG), including technology, user requirements, evaluation, and real-world applications. The focus is on concepts and insights which hopefully will remain relevant for many years, not on the latest LLM innovations. It draws on decades of work by the author and others on NLG. The book has the following chapters: Introduction to NLG; Rule-Based NLG; Machine Learning and Neural NLG; Requirements; Evaluation; Safety, Maintenance, and Testing; and Applications. All chapters include examples and anecdotes from the author's personal experiences, and end with a Further Reading section. The book should be especially useful to people working on applied NLG, including NLG researchers, people in other fields who want to use NLG, and commercial developers. It will not however be useful to people who want to understand the latest LLM technology. There is a companion site with more information at this https URL
摘要：本书提供了自然语言生成（NLG）的广泛概述，包括技术，用户需求，评估和现实世界应用程序。重点是概念和见解，希望多年来一直保持相关，而不是最新的LLM创新。它借鉴了作者和其他NLG上数十年的工作。该书有以下章节：NLG简介；基于规则的NLG；机器学习和神经NLG；要求;评估;安全，维护和测试；和申请。所有章节都包括作者个人经历中的示例和轶事，并以进一步的阅读部分结尾。这本书对于从事应用NLG的人来说应该特别有用，包括NLG研究人员，其他想要使用NLG的人和商业开发人员。但是，这对于想了解最新的LLM技术的人将无用。在此HTTPS URL上有更多信息

Title: PredictaBoard: Benchmarking LLM Score Predictability

Authors: Lorenzo Pacchiardi, Konstantinos Voudouris, Ben Slater, Fernando Martínez-Plumed, José Hernández-Orallo, Lexin Zhou, Wout Schellaert
Subjects: cs.CL, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2502.14445
Pdf URL: https://arxiv.org/pdf/2502.14445
Copy Paste: [[2502.14445]] PredictaBoard: Benchmarking LLM Score Predictability(https://arxiv.org/abs/2502.14445)
Keywords: language model, llm, prompt
Abstract: Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated. Code for our benchmark can be found at this https URL
摘要：尽管拥有令人印象深刻的技能，但大型语言模型（LLM）通常会无法预测，在基本的常识推理任务中表现出不一致的成功。这种不可预测性在确保其安全部署方面构成了重大挑战，因为在可靠的“安全区域”中识别和运行对于缓解风险是必不可少的。为了解决这个问题，我们提出了预测，这是一个新颖的协作基准测试框架，旨在评估分数预测因素（称为评估者）预测在现有数据集中特定任务实例（即提示）上的LLM错误。预测，通过考虑不同公差误差的排斥率来评估LLM和评估者的对。因此，预测代理会刺激研究发展更好的评估者并使LLMS更具可预测性，不仅具有较高的平均性能。我们使用基线评估者和最先进的LLM进行说明性实验。预测助理强调了与性能一起评估可预测性的关键需求，为更安全的AI系统铺平了道路，在这种情况下，错误不仅被最小化，而且预期并有效地减轻了。可以在此https url上找到我们的基准的代码

Title: Optimal word order for non-causal text generation with Large Language Models: the Spanish case

Authors: Andrea Busto-Castiñeira, Silvia García-Méndez, Francisco de Arriba-Pérez, Francisco J. González-Castaño
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14451
Pdf URL: https://arxiv.org/pdf/2502.14451
Copy Paste: [[2502.14451]] Optimal word order for non-causal text generation with Large Language Models: the Spanish case(https://arxiv.org/abs/2502.14451)
Keywords: language model, llm
Abstract: Natural Language Generation (NLG) popularity has increased owing to the progress in Large Language Models (LLMs), with zero-shot inference capabilities. However, most neural systems utilize decoder-only causal (unidirectional) transformer models, which are effective for English but may reduce the richness of languages with less strict word order, subject omission, or different relative clause attachment preferences. This is the first work that analytically addresses optimal text generation order for non-causal language models. We present a novel Viterbi algorithm-based methodology for maximum likelihood word order estimation. We analyze the non-causal most-likelihood order probability for NLG in Spanish and, then, the probability of generating the same phrases with Spanish causal NLG. This comparative analysis reveals that causal NLG prefers English-like SVO structures. We also analyze the relationship between optimal generation order and causal left-to-right generation order using Spearman's rank correlation. Our results demonstrate that the ideal order predicted by the maximum likelihood estimator is not closely related to the causal order and may be influenced by the syntactic structure of the target sentence.
摘要：自然语言产生（NLG）的受欢迎程度由于大型语言模型（LLMS）的进展而提高了，具有零射的推理能力。但是，大多数神经系统都使用仅解码器的因果（单向）变压器模型，这些模型对英语有效，但可能会降低语言的丰富性，而单词顺序不太严格，主题遗漏或不同的相对子句附件偏好。这是第一项分析地解决非伴奏语言模型的最佳文本生成顺序的工作。我们提出了一种基于Viterbi算法的新型方法，以最大似然词顺序估计。我们分析了NLG在西班牙语中的非毒学最类似序概率，然后分析了与西班牙因果NLG产生相同短语的概率。这种比较分析表明，因果NLG更喜欢英语样的SVO结构。我们还使用Spearman的等级相关性分析了最佳生成顺序与因果关系左至右生成顺序之间的关系。我们的结果表明，最大似然估计器预测的理想顺序与因果秩序没有密切相关，并且可能受到目标句子的句法结构的影响。

Title: Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models

Authors: Aurora Polo-Rodríguez, Laura Fiorini, Erika Rovini, Filippo Cavallo, Javier Medina-Quero
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2502.14469
Pdf URL: https://arxiv.org/pdf/2502.14469
Copy Paste: [[2502.14469]] Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models(https://arxiv.org/abs/2502.14469)
Keywords: language model, llm, chat
Abstract: This work presents a novel architecture for context-aware interactions within smart environments, leveraging Large Language Models (LLMs) to enhance user experiences. Our system integrates user location data obtained through UWB tags and sensor-equipped smart homes with real-time human activity recognition (HAR) to provide a comprehensive understanding of user context. This contextual information is then fed to an LLM-powered chatbot, enabling it to generate personalised interactions and recommendations based on the user's current activity and environment. This approach moves beyond traditional static chatbot interactions by dynamically adapting to the user's real-time situation. A case study conducted from a real-world dataset demonstrates the feasibility and effectiveness of our proposed architecture, showcasing its potential to create more intuitive and helpful interactions within smart homes. The results highlight the significant benefits of integrating LLM with real-time activity and location data to deliver personalised and contextually relevant user experiences.
摘要：这项工作为智能环境中的上下文感知互动提供了一种新颖的体系结构，利用大型语言模型（LLMS）来增强用户体验。我们的系统集成了通过UWB标签和配备传感器的智能房屋获得的用户位置数据，并具有实时的人类活动识别（HAR），以提供对用户环境的全面了解。然后将这些上下文信息提供给LLM驱动的聊天机器人，使其能够根据用户的当前活动和环境生成个性化的交互和建议。通过动态适应用户的实时情况，这种方法超越了传统的静态聊天机器人交互。从现实世界中的数据集进行的案例研究表明，我们提出的架构的可行性和有效性，展示了其在智能家居中创造更直观和有用的互动的潜力。结果突出了将LLM与实时活动和位置数据集成以提供个性化和上下文相关的用户体验的重大好处。

Title: Argument-Based Comparative Question Answering Evaluation Benchmark

Authors: Irina Nikishina, Saba Anwar, Nikolay Dolgov, Maria Manina, Daria Ignatenko, Viktor Moskvoretskii, Artem Shelmanov, Tim Baldwin, Chris Biemann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14476
Pdf URL: https://arxiv.org/pdf/2502.14476
Copy Paste: [[2502.14476]] Argument-Based Comparative Question Answering Evaluation Benchmark(https://arxiv.org/abs/2502.14476)
Keywords: language model, gpt, llm
Abstract: In this paper, we aim to solve the problems standing in the way of automatic comparative question answering. To this end, we propose an evaluation framework to assess the quality of comparative question answering summaries. We formulate 15 criteria for assessing comparative answers created using manual annotation and annotation from 6 large language models and two comparative question asnwering datasets. We perform our tests using several LLMs and manual annotation under different settings and demonstrate the constituency of both evaluations. Our results demonstrate that the Llama-3 70B Instruct model demonstrates the best results for summary evaluation, while GPT-4 is the best for answering comparative questions. All used data, code, and evaluation results are publicly available\footnote{\url{this https URL}}.
摘要：在本文中，我们旨在解决以自动比较问答的方式站立的问题。为此，我们提出了一个评估框架，以评估比较问题回答摘要的质量。我们制定了15个标准，用于评估使用手动注释和从6个大语言模型的手动注释和注释创建的比较答案，并为两个比较问题而引起的数据集。我们在不同的设置下使用几个LLM和手动注释进行测试，并证明了这两种评估的选区。我们的结果表明，Llama-3 70B指示模型证明了摘要评估的最佳结果，而GPT-4是回答比较问题的最佳结果。所有使用的数据，代码和评估结果均可公开可用\ footNote {\ url {this https url}}。

Title: Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression

Authors: Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14477
Pdf URL: https://arxiv.org/pdf/2502.14477
Copy Paste: [[2502.14477]] Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression(https://arxiv.org/abs/2502.14477)
Keywords: language model, llm
Abstract: Handling long-context sequences efficiently remains a significant challenge in large language models (LLMs). Existing methods for token selection in sequence extrapolation either employ a permanent eviction strategy or select tokens by chunk, which may lead to the loss of critical information. We propose Efficient Selective Attention (ESA), a novel approach that extends context length by efficiently selecting the most critical tokens at the token level to compute attention. ESA reduces the computational complexity of token selection by compressing query and key vectors into lower-dimensional representations. We evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using open-source LLMs with context lengths of 8k and 32k. ESA outperforms other selective attention methods, especially in tasks requiring the retrieval of multiple pieces of information, achieving comparable performance to full-attention extrapolation methods across various tasks, with superior results in certain tasks.
摘要：在大型语言模型（LLMS）中，有效地处理长篇小说序列仍然是一个重大挑战。以序列外推的代币选择的现有方法要么采用永久驱逐策略，要么由块选择代币，这可能导致关键信息的丢失。我们提出了有效的选择性注意（ESA），这是一种新型方法，通过有效选择令牌级别上最关键的令牌来计算注意力来扩展上下文长度。 ESA通过将查询和关键向量压缩为较低维表示，从而降低了令牌选择的计算复杂性。我们在长序列基准上评估了ESA，其长度最大长度为256K，使用了8K和32K的上下文长度。 ESA的表现优于其他选择性注意方法，尤其是在需要检索多个信息的任务中，可以在各种任务中实现与全注意外推方法相当的性能，在某些任务中取得了卓越的结果。

Title: NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models

Authors: Chenlu Guo, Yuan Wu, Yi Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14482
Pdf URL: https://arxiv.org/pdf/2502.14482
Copy Paste: [[2502.14482]] NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models(https://arxiv.org/abs/2502.14482)
Keywords: language model, llm
Abstract: Parameter-efficient fine-tuning (PEFT) is essential for adapting large language models (LLMs), with low-rank adaptation (LoRA) being the most popular approach. However, LoRA suffers from slow convergence, and some recent LoRA variants, such as PiSSA, primarily rely on Singular Value Decomposition (SVD) for initialization, leading to expensive computation. To mitigate these problems, we use the Nyström method, which follows a three-matrix manipulation. We first introduce StructuredLoRA (SLoRA), which investigates adding a small intermediate matrix between the low-rank matrices A and B. Secondly, we propose NyströmLoRA (NLoRA), which leverages Nyström-based initialization for SLoRA to improve its effectiveness and efficiency. Finally, we propose IntermediateTune (IntTune), which explores fine-tuning exclusively on the intermediate matrix of NLoRA to further boost LLM efficiency. We evaluate our methods on five natural language generation (NLG) tasks and eight natural language understanding (NLU) tasks. On GSM8K, SLoRA and NLoRA achieve accuracies of 56.48% and 57.70%, surpassing LoRA by 33.52% and 36.41%, with only 3.67 million additional trainable parameters. IntTune improves average NLG performance over LoRA by 7.45% while using only 1.25% of its parameters. These results demonstrate the efficiency and effectiveness of our approach in enhancing model performance with minimal parameter overhead.
摘要：参数有效的微调（PEFT）对于适应大型语言模型（LLMS）至关重要，低级适应（LORA）是最受欢迎的方法。然而，洛拉（Lora）遭受缓慢的收敛性，最近的一些洛拉（Lora）变体（例如PISSA）主要依赖于单数值分解（SVD）来初始化，从而导致昂贵的计算。为了减轻这些问题，我们使用NyStröm方法，该方法遵循三矩阵操纵。我们首先引入了结构性洛拉（Slora），该结构洛拉（Slora）研究了低级别矩阵A和B之间添加一个小的中间基质。其次，我们提出了NyStrömlora（Nlora），该基于NyStrömlora（Nlora），该基于NyStröm的初始化来使Slora提高其效率和效率。最后，我们提出了中间次（Inttune），该中期季度仅在Nlora的中间基质上进行微调，以进一步提高LLM效率。我们评估了五个自然语言生成（NLG）任务和八个自然语言理解（NLU）任务的方法。在GSM8K上，Slora和Nlora的精度为56.48％和57.70％，超过Lora的精度为33.52％和36.41％，只有367万额外的可训练参数。 Inttune仅使用其参数的1.25％，将LORA的平均NLG性能提高了7.45％。这些结果证明了我们方法在用最小参数开销增强模型性能方面的效率和有效性。

Title: StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following

Authors: Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, Yuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14494
Pdf URL: https://arxiv.org/pdf/2502.14494
Copy Paste: [[2502.14494]] StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following(https://arxiv.org/abs/2502.14494)
Keywords: language model, llm
Abstract: Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependency between dialogue turns that distinguishes multi-turn from single-turn interactions. This structural dependency not only reflects user intent but also establishes a second dimension for instruction following evaluation beyond constraint satisfaction. To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling. The benchmark innovatively defines a structural flow framework comprising six fundamental inter-turn relationships, which not only introduces novel structural constraints for model evaluation but also serves as generation parameters for creating customized dialogue flows tailored to specific scenarios. Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs. Experimental results reveal significant deficiencies in current models' comprehension of multi-turn dialogue structures. The code is available at \url{this https URL}.
摘要：功能之后的多转向指令构成了现实应用应用程序中大语言模型（LLM）的核心竞争力。现有的评估基准主要集中在细粒的约束满意度和特定于域的能力评估上，但忽略了对话转弯之间的关键结构依赖性，这些依赖性与单转交流不同。这种结构依赖性不仅反映了用户意图，而且还建立了在评估超出约束满意度之后进行指导的第二维度。为了解决这一差距，我们提出了结构流板，这是一种结构流建模的基准后的多转移指令。该基准是创新的，定义了一个结构流框架，其中包括六个基本的转变关系，该关系不仅引入了模型评估的新型结构约束，而且还可以作为生成参数来创建针对特定情况下量身定制的自定义对话流。采用既定的基于LLM的自动评估方法，我们对13个领先的开源和封闭源LLM进行系统评估。实验结果揭示了当前模型对多转化对话结构的理解中的明显缺陷。该代码可在\ url {this HTTPS url}上获得。

Title: Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization

Authors: Zhitao He, Zijun Liu, Peng Li, May Fung, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14496
Pdf URL: https://arxiv.org/pdf/2502.14496
Copy Paste: [[2502.14496]] Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization(https://arxiv.org/abs/2502.14496)
Keywords: llm, agent
Abstract: LLM-based agents have made significant advancements in interactive environments, such as mobile operations and web browsing, and other domains beyond computer using. Current multi-agent systems universally excel in performance, compared to single agents, but struggle with generalization across environments due to predefined roles and inadequate strategies for generalizing language agents. The challenge of achieving both strong performance and good generalization has hindered the progress of multi-agent systems for interactive environments. To address these issues, we propose CollabUIAgents, a multi-agent reinforcement learning framework with a novel multi-agent credit re-assignment (CR) strategy, assigning process rewards with LLMs rather than environment-specific rewards and learning with synthesized preference data, in order to foster generalizable, collaborative behaviors among the role-free agents' policies. Empirical results show that our framework improves both performance and cross-environment generalizability of multi-agent systems. Moreover, our 7B-parameter system achieves results on par with or exceed strong closed-source models, and the LLM that guides the CR. We also provide insights in using granular CR rewards effectively for environment generalization, and accommodating trained LLMs in multi-agent systems.
摘要：基于LLM的代理在交互式环境中取得了重大进步，例如移动操作和Web浏览以及其他计算机使用的其他域。与单个代理相比，当前的多机构系统在性能方面普遍表现，但由于预定义的角色和对语言代理的推广策略不足，在环境中与概括斗争。达到强大的性能和良好概括的挑战阻碍了多代理系统在交互式环境中的进步。为了解决这些问题，我们提出合作，这是一个具有新颖的多机构信用重新分配（CR）策略的多项式强化学习框架，用LLMS分配过程奖励，而不是环境特定的奖励和通过合成的偏好数据进行学习，为了促进无角色代理商政策之间的可概括，协作行为。经验结果表明，我们的框架可以改善多代理系统的性能和交叉环境的概括性。此外，我们的7B参数系统以或超过强大的闭合源模型以及指导CR的LLM的成果。我们还提供有关有效使用颗粒状CR奖励进行环境概括的见解，并在多代理系统中适应受过训练的LLM。

Title: MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Authors: Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14499
Pdf URL: https://arxiv.org/pdf/2502.14499
Copy Paste: [[2502.14499]] MLGym: A New Framework and Benchmark for Advancing AI Research Agents(https://arxiv.org/abs/2502.14499)
Keywords: language model, gpt, llm, agent
Abstract: We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.
摘要：我们介绍了Meta Mlgym和Mlgym-Bench，这是一个新的框架和基准，用于评估和开发AI研究任务的LLM代理。这是第一个用于机器学习（ML）任务的健身房环境，为培训此类代理的增强学习（RL）算法提供了研究。 MLGYM基础由来自计算机视觉，自然语言处理，强化学习和游戏理论等不同领域的13种不同和开放式的AI研究任务组成。解决这些任务需要现实世界中的AI研究技能，例如生成新的想法和假设，创建和处理数据，实施ML方法，培训模型，运行实验，分析结果并在此过程中进行迭代以改进给定的任务。我们在基准上评估了许多边界大型语言模型（LLM），例如Claude-3.5-Sonnet，Llama-3.1 405B，GPT-4O，O1-Preview和Gemini-1.5 Pro。我们的MLGYM框架使添加新任务，集成和评估模型或代理，按大规模生成综合数据，并为培训AI研究任务培训代理开发新的学习算法变得容易。我们发现，当前的边界模型通常可以通过找到更好的超参数来改善给定的基线，但不会产生新颖的假设，算法，体系结构或实质性改进。我们开源的框架和基准，以促进未来的研究，以促进LLM代理的AI研究能力。

Title: How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?

Authors: Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexander Panchenko, Mikhail Salnikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14502
Pdf URL: https://arxiv.org/pdf/2502.14502
Copy Paste: [[2502.14502]] How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?(https://arxiv.org/abs/2502.14502)
Keywords: language model, llm
Abstract: The performance of Large Language Models (LLMs) on many tasks is greatly limited by the knowledge learned during pre-training and stored in the model's parameters. Low-rank adaptation (LoRA) is a popular and efficient training technique for updating or domain-specific adaptation of LLMs. In this study, we investigate how new facts can be incorporated into the LLM using LoRA without compromising the previously learned knowledge. We fine-tuned Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our experiments have shown that the best results are obtained when the training data contains a mixture of known and new facts. However, this approach is still potentially harmful because the model's performance on external question-answering benchmarks declines after such fine-tuning. When the training data is biased towards certain entities, the model tends to regress to few overrepresented answers. In addition, we found that the model becomes more confident and refuses to provide an answer in only few cases. These findings highlight the potential pitfalls of LoRA-based LLM updates and underscore the importance of training data composition and tuning parameters to balance new knowledge integration and general model capabilities.
摘要：大语言模型（LLM）在许多任务上的性能受到了在预训练期间学到的知识并存储在模型参数中的知识的限制。低级适应性（LORA）是一种流行而有效的训练技术，用于更新LLM的特定于域的适应性。在这项研究中，我们研究了如何使用LORA将新事实纳入LLM，而不会损害先前学习的知识。我们使用洛拉（Lora）和不同数量的新知识微调了Llama-3.1-8B教学。我们的实验表明，当训练数据包含已知事实和新事实的混合物时，获得最佳结果。但是，这种方法仍然有害，因为该模型在这种微调后的外部提问基准测试中的性能下降。当培训数据偏向某些实体时，该模型倾向于回归几乎代表过度的答案。此外，我们发现该模型变得更加自信，并且拒绝在很少的情况下提供答案。这些发现突出了基于LORA的LLM更新的潜在陷阱，并强调了培训数据组成和调整参数的重要性，以平衡新知识集成和一般模型功能。

Title: Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases

Authors: Rena Gao, Xuetong Wu, Tatsuki Kuribayashi, Mingrui Ye, Siya Qi, Carsten Roever, Yuanxing Liu, Zheng Yuan, Jey Han Lau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14507
Pdf URL: https://arxiv.org/pdf/2502.14507
Copy Paste: [[2502.14507]] Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases(https://arxiv.org/abs/2502.14507)
Keywords: language model, gpt, llm, prompt
Abstract: This study evaluates Large Language Models' (LLMs) ability to simulate non-native-like English use observed in human second language (L2) learners interfered with by their native first language (L1). In dialogue-based interviews, we prompt LLMs to mimic L2 English learners with specific L1s (e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to real L2 learner data. Our analysis examines L1-driven linguistic biases, such as reference word usage and avoidance behaviors, using information-theoretic and distributional density measures. Results show that modern LLMs (e.g., Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed in human L2 data, with distinct influences from various languages (e.g., Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu influences noun-verb collocations). Our results reveal the potential of LLMs for L2 dialogue generation and evaluation for future educational applications.
摘要：这项研究评估了大型语言模型（LLMS）模拟在人类第二语言（L2）学习者中观察到的非本地人的英语用途的能力（L1）。在基于对话的访谈中，我们促使LLMS与七种语言的特定L1（例如日语，泰语，乌尔都语）模仿L2英语学习者，将其输出与实际L2学习者数据进行比较。我们的分析使用信息理论和分布密度测量方法研究了L1驱动的语言偏见，例如参考单词使用和回避行为。结果表明，现代LLM（例如Qwen2.5，Llama3.3，DeepSeekV3，GPT-4O）复制在人类L2数据中观察到的L1依赖模式，具有各种语言的明显影响（例如，日语，韩语和Mandarin显着影响）时态协议，乌尔都语会影响名词 - 动词搭配）。我们的结果揭示了LLM在L2对话生成和评估未来教育应用的潜力。

Title: CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models

Authors: Zhenhong Zhou, Zherui Li, Jie Zhang, Yuanhe Zhang, Kun Wang, Yang Liu, Qing Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14529
Pdf URL: https://arxiv.org/pdf/2502.14529
Copy Paste: [[2502.14529]] CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models(https://arxiv.org/abs/2502.14529)
Keywords: language model, llm, agent
Abstract: Large Language Model-based Multi-Agent Systems (LLM-MASs) have demonstrated remarkable real-world capabilities, effectively collaborating to complete complex tasks. While these systems are designed with safety mechanisms, such as rejecting harmful instructions through alignment, their security remains largely unexplored. This gap leaves LLM-MASs vulnerable to targeted disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks (Corba), a novel and simple yet highly effective attack that disrupts interactions between agents within an LLM-MAS. Corba leverages two key properties: its contagious nature allows it to propagate across arbitrary network topologies, while its recursive property enables sustained depletion of computational resources. Notably, these blocking attacks often involve seemingly benign instructions, making them particularly challenging to mitigate using conventional alignment methods. We evaluate Corba on two widely-used LLM-MASs, namely, AutoGen and Camel across various topologies and commercial models. Additionally, we conduct more extensive experiments in open-ended interactive LLM-MASs, demonstrating the effectiveness of Corba in complex topology structures and open-source models. Our code is available at: this https URL.
摘要：大型基于语言模型的多代理系统（LLM-MAS）表现出了非凡的现实功能，有效地协作以完成复杂的任务。尽管这些系统的设计具有安全机制，例如通过对齐方式拒绝有害说明，但它们的安全性仍然在很大程度上没有探索。该差距使LLM质量容易受到目标破坏的影响。在本文中，我们引入了具有传染性的递归封锁攻击（CORBA），这是一种新颖而简单而高效的攻击，破坏了LLM-MAS中代理之间的相互作用。 Corba利用了两个关键属性：它的传染性使其能够在任意网络拓扑之间传播，而其递归属性可以持续耗尽计算资源。值得注意的是，这些阻塞攻击通常涉及看似良性的说明，这使得它们特别具有挑战性地减轻使用传统的对准方法。我们在两个广泛使用的LLM质量上评估了Corba，即在各种拓扑和商业模型上，即自动基因和骆驼。此外，我们在开放式的互动LLM质量中进行了更广泛的实验，证明了Corba在复杂的拓扑结构和开源模型中的有效性。我们的代码可用：此HTTPS URL。

Title: LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization

Authors: Yupeng Chang, Chenlu Guo, Yi Chang, Yuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14538
Pdf URL: https://arxiv.org/pdf/2502.14538
Copy Paste: [[2502.14538]] LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization(https://arxiv.org/abs/2502.14538)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success in natural language processing, but their full fine-tuning remains resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), have emerged as a practical solution by approximating parameter updates with low-rank matrices. However, LoRA often exhibits a "double descent" phenomenon during fine-tuning, where model performance degrades due to overfitting and limited expressiveness caused by low-rank constraints. To address this issue, we propose LoRA-GGPO (Gradient-Guided Perturbation Optimization), a novel method that leverages gradient and weight norms to generate targeted perturbations. By optimizing the sharpness of the loss landscape, LoRA-GGPO guides the model toward flatter minima, mitigating the double descent problem and improving generalization. Extensive experiments on natural language understanding (NLU) and generation (NLG) tasks demonstrate that LoRA-GGPO outperforms LoRA and its state-of-the-art variants. Furthermore, extended experiments specifically designed to analyze the double descent phenomenon confirm that LoRA-GGPO effectively alleviates this issue, producing more robust and generalizable models. Our work provides a robust and efficient solution for fine-tuning LLMs, with broad applicability in real-world scenarios. The code is available at this https URL.
摘要：大型语言模型（LLMS）在自然语言处理中取得了巨大的成功，但它们的完整微调仍然是资源密集的。参数有效的微调方法（PEFT）方法（例如低级别适应性（LORA））已通过使用低级别矩阵近似参数更新作为实用解决方案出现。然而，洛拉在微调过程中经常表现出“双重下降”现象，在这种情况下，由于低级别约束导致的过度拟合和有限的表现性，模型性能降低了。为了解决这个问题，我们提出了Lora-ggpo（梯度引导的扰动优化），这是一种利用梯度和重量规范来产生目标扰动的新方法。通过优化损失景观的清晰度，Lora-Ggpo将模型引导到较小的最小值，从而减轻了双重下降问题并改善了概括。关于自然语言理解（NLU）和发电（NLG）任务的广泛实验表明，Lora-Ggpo的表现优于Lora及其最先进的变体。此外，专门设计用于分析双重下降现象的扩展实验证实，洛拉格波有效地减轻了此问题，从而产生了更强大且可推广的模型。我们的工作为微调LLM提供了强大而有效的解决方案，在现实世界中具有广泛的适用性。该代码可在此HTTPS URL上找到。

Title: LLM-based User Profile Management for Recommender System

Authors: Seunghwan Bang, Hwanjun Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14541
Pdf URL: https://arxiv.org/pdf/2502.14541
Copy Paste: [[2502.14541]] LLM-based User Profile Management for Recommender System(https://arxiv.org/abs/2502.14541)
Keywords: language model, llm
Abstract: The rapid advancement of Large Language Models (LLMs) has opened new opportunities in recommender systems by enabling zero-shot recommendation without conventional training. Despite their potential, most existing works rely solely on users' purchase histories, leaving significant room for improvement by incorporating user-generated textual data, such as reviews and product descriptions. Addressing this gap, we propose PURE, a novel LLM-based recommendation framework that builds and maintains evolving user profiles by systematically extracting and summarizing key information from user reviews. PURE consists of three core components: a Review Extractor for identifying user preferences and key product features, a Profile Updater for refining and updating user profiles, and a Recommender for generating personalized recommendations using the most current profile. To evaluate PURE, we introduce a continuous sequential recommendation task that reflects real-world scenarios by adding reviews over time and updating predictions incrementally. Our experimental results on Amazon datasets demonstrate that PURE outperforms existing LLM-based methods, effectively leveraging long-term user information while managing token limitations.
摘要：大型语言模型（LLMS）的快速发展通过在没有常规培训的情况下启用零拍摄建议，为推荐系统开辟了新的机会。尽管它们具有潜力，但大多数现有作品都仅依赖用户的购买历史，而通过合并用户生成的文本数据（例如评论和产品描述），留下了重大改进的空间。在解决这一差距时，我们提出了一个基于LLM的新型推荐框架Pure，该框架通过系统地提取和总结用户评论中的关键信息来构建和维护不断发展的用户资料。纯粹由三个核心组成部分组成：用于识别用户首选项和关键产品功能的审核提取器，用于提炼和更新用户配置文件的配置文件更新程序，以及使用最新配置文件生成个性化建议的推荐人。为了评估纯粹，我们引入了一个连续的顺序推荐任务，该任务通过随着时间的推移添加评论并逐步更新预测来反映现实世界中的情况。我们在亚马逊数据集上的实验结果表明，纯粹的表现优于现有的基于LLM的方法，在管理令牌限制的同时有效利用长期用户信息。

Title: Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling

Authors: Eric Egli, Matteo Manica, Jannis Born
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14553
Pdf URL: https://arxiv.org/pdf/2502.14553
Copy Paste: [[2502.14553]] Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling(https://arxiv.org/abs/2502.14553)
Keywords: language model
Abstract: Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Recently, Byte Language Models (BLMs) have emerged to overcome tokenization, yet the excessive length of bytestreams requires new architectural paradigms. Therefore, we present the Multiscale Byte Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of $5$M bytes on single GPU in full model precision. We thoroughly examine MBLM's performance with Transformer and Mamba blocks on both unimodal and multimodal tasks. Our experiments demonstrate that hybrid architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. To the best of our knowledge, we present the first evaluation of BLMs on visual Q\&A tasks and find that, despite serializing images and the absence of an encoder, a MBLM with pure next token prediction can match custom CNN-LSTM architectures with designated classification heads. We show that MBLMs exhibit strong adaptability in integrating diverse data representations, including pixel and image filestream bytes, underlining their potential toward omnimodal foundation models. Source code is publicly available at: this https URL
摘要：字节构成了数字世界的基础，因此是多模式基础模型的有希望的基础。最近，字节语言模型（BLM）已经出现以克服令牌化，但是过度长度的副流需要新的建筑范式。因此，我们提出了多尺度字节语言模型（MBLM），这是一种模型不合时宜的层次解码器堆栈，允许在单个GPU上以$ 5 $ M字节的上下文窗口进行完整模型精度的培训。我们在单峰和多模式任务上彻底检查了MBLM在变压器和MAMBA块中的性能。我们的实验表明，混合体系结构在训练过程中处理极长的字节序列有效，同时达到了接近线性的世代效率。据我们所知，我们介绍了对视觉Q \＆A任务上BLM的首次评估，并发现尽管序列化图像和没有编码器的缺乏，但具有纯粹的近代标记预测的MBLM可以匹配自定义CNN-LSTM架构分类头。我们表明，MBLM在整合包括像素和图像文件的多种数据表示方面表现出强大的适应性，这强调了它们对综合基础模型的潜力。源代码可公开可用：此HTTPS URL

Title: Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs

Authors: Paris Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis, Christos Tryfonopoulos
Subjects: cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2502.14561
Pdf URL: https://arxiv.org/pdf/2502.14561
Copy Paste: [[2502.14561]] Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs(https://arxiv.org/abs/2502.14561)
Keywords: language model, llm, prompt
Abstract: This work investigates the ability of open Large Language Models (LLMs) to predict citation intent through in-context learning and fine-tuning. Unlike traditional approaches that rely on pre-trained models like SciBERT, which require extensive domain-specific pretraining and specialized architectures, we demonstrate that general-purpose LLMs can be adapted to this task with minimal task-specific data. We evaluate twelve model variations across five prominent open LLM families using zero, one, few, and many-shot prompting to assess performance across scenarios. Our experimental study identifies the top-performing model through extensive experimentation of in-context learning-related parameters, which we fine-tune to further enhance task performance. The results highlight the strengths and limitations of LLMs in recognizing citation intents, providing valuable insights for model selection and prompt engineering. Additionally, we make our end-to-end evaluation framework and models openly available for future use.
摘要：这项工作调查了开放式大语模型（LLMS）通过文化学习和微调来预测引用意图的能力。与依赖于SCIBERT（SCIBERT）的传统方法不同，SCIBERT需要广泛的域特定训练和专业的体系结构，我们证明可以使用最小的特定于任务数据将通用LLMS适应此任务。我们使用零，一，少数和许多弹药的促使在跨场景中评估五个突出的开放LLM家族的十二个模型变化。我们的实验研究通过广泛实验与内在学习相关的参数来确定表现最佳模型，我们进行了微调以进一步提高任务性能。结果突出了LLM在识别引用意图方面的优势和局限性，为模型选择和及时工程提供了宝贵的见解。此外，我们使我们的端到端评估框架和模型公开可供将来使用。

Title: Behavioral Analysis of Information Salience in Large Language Models

Authors: Jan Trienes, Jörg Schlötterer, Junyi Jessy Li, Christin Seifert
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14613
Pdf URL: https://arxiv.org/pdf/2502.14613
Copy Paste: [[2502.14613]] Behavioral Analysis of Information Salience in Large Language Models(https://arxiv.org/abs/2502.14613)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel at text summarization, a task that requires models to select content based on its importance. However, the exact notion of salience that LLMs have internalized remains unclear. To bridge this gap, we introduce an explainable framework to systematically derive and investigate information salience in LLMs through their summarization behavior. Using length-controlled summarization as a behavioral probe into the content selection process, and tracing the answerability of Questions Under Discussion throughout, we derive a proxy for how models prioritize information. Our experiments on 13 models across four datasets reveal that LLMs have a nuanced, hierarchical notion of salience, generally consistent across model families and sizes. While models show highly consistent behavior and hence salience patterns, this notion of salience cannot be accessed through introspection, and only weakly correlates with human perceptions of information salience.
摘要：大型语言模型（LLMS）在文本摘要方面表现出色，该任务要求模型根据其重要性选择内容。但是，LLM内部化的显着性的确切概念尚不清楚。为了弥合这一差距，我们引入了一个可解释的框架，以通过其汇总行为来系统地得出LLM中的信息显着性。使用长度控制的摘要作为对内容选择过程的行为探测，并追踪整个讨论的问题的回答性，我们为模型如何优先级信息提供了代理。我们对四个数据集的13个模型进行的实验表明，LLM具有细微的，分层的显着性概念，通常在模型家族和大小之间是一致的。尽管模型表现出高度一致的行为，因此显着模式，但这种显着性的概念无法通过内省访问，并且仅与人类对信息显着性的看法无关紧要。

Title: FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis

Authors: Mingyi Jia, Junwen Duan, Yan Song, Jianxin Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14614
Pdf URL: https://arxiv.org/pdf/2502.14614
Copy Paste: [[2502.14614]] FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis(https://arxiv.org/abs/2502.14614)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Large Language Models (LLMs), which integrate external knowledge into LLMs, have shown remarkable performance in various medical domains, including clinical diagnosis. However, existing RAG methods struggle to effectively assess task difficulty to make retrieval decisions, thereby failing to meet the clinical requirements for balancing efficiency and accuracy. So in this paper, we propose FIND (\textbf{F}ine-grained \textbf{In}formation \textbf{D}ensity Guided Adaptive RAG), a novel framework that improves the reliability of RAG in disease diagnosis scenarios. FIND incorporates a fine-grained adaptive control module to determine whether retrieval is necessary based on the information density of the input. By optimizing the retrieval process and implementing a knowledge filtering module, FIND ensures that the retrieval is better suited to clinical scenarios. Experiments on three Chinese electronic medical record datasets demonstrate that FIND significantly outperforms various baseline methods, highlighting its effectiveness in clinical diagnosis tasks.
摘要：将外部知识纳入LLM的检索大型语言模型（LLMS）在包括临床诊断在内的各种医学领域中表现出色。但是，现有的抹布方法难以有效地评估做出检索决策的任务困难，从而无法满足平衡效率和准确性的临床要求。因此，在本文中，我们建议找到（\ textbf {f} ine grained \ textbf {in}形成\ textbf {d} nesity引导的自适应抹布），这是一个新颖的框架，可提高rag在疾病诊断场景中的抹布的可靠性。查找结合了一个细粒的自适应控制模块，以根据输入的信息密度确定是否需要检索。通过优化检索过程并实施知识过滤模块，找到确保检索更适合临床方案。三个中国电子病历数据集的实验表明，发现明显胜过各种基线方法，突出了其在临床诊断任务中的有效性。

Title: Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity

Authors: Xinghan Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14620
Pdf URL: https://arxiv.org/pdf/2502.14620
Copy Paste: [[2502.14620]] Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity(https://arxiv.org/abs/2502.14620)
Keywords: language model
Abstract: This paper investigates the efficacy of RWKV, a novel language model architecture known for its linear attention mechanism, for generating sentence embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate the semantic similarity captured by embeddings from different hidden layers of a pre-trained RWKV model. The performance is assessed on the Microsoft Research Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared against a GloVe-based baseline. My results indicate that while RWKV embeddings capture some semantic relatedness, they underperform compared to the GloVe baseline in terms of Spearman correlation. I also analyze the inference time and GPU memory usage, highlighting the computational trade-offs associated with RWKV embeddings. The findings suggest that while RWKV offers potential advantages in terms of linear scaling, its zero-shot sentence embedding quality for semantic similarity tasks requires further investigation and potential task-specific fine-tuning to match or exceed simpler baselines.
摘要：本文研究了RWKV的功效，RWKV是一种新型的语言模型架构，以其线性注意机制而闻名，用于在零拍设置中生成句子嵌入。我进行了层次分析，以评估预先训练的RWKV模型的不同隐藏层的嵌入捕获的语义相似性。使用Spearman相关性评估了Microsoft研究释义语料库（MRPC）数据集的性能，并与基于手套的基线进行了比较。我的结果表明，尽管RWKV嵌入捕获了一些语义相关性，但在Spearman相关性方面，它们的表现不佳与手套基线相比。我还分析了推理时间和GPU内存使用量，突出了与RWKV嵌入相关的计算权衡。研究结果表明，尽管RWKV在线性缩放方面具有潜在的优势，但其零句句子嵌入语义相似性任务的质量需要进一步研究和潜在的特定任务微调以匹配或超过更简单的基准。

Title: NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization

Authors: Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2502.14638
Pdf URL: https://arxiv.org/pdf/2502.14638
Copy Paste: [[2502.14638]] NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization(https://arxiv.org/abs/2502.14638)
Keywords: language model
Abstract: Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at this https URL.
摘要：图像地理位置定位是预测图像的特定位置，并需要在视觉，地理和文化环境中进行复杂的推理。虽然先前的视觉语言模型（VLM）在此任务上具有最佳准确性，但缺乏用于分析推理的高质量数据集和模型。我们首先创建了NavicLues，这是一种源自流行地理游戏Geoguessr的高质量数据集，以提供来自语言的专家推理示例。使用此数据集，我们提出了Navig，这是一个全面的图像地理位置框架，集成了全局和细粒度的图像信息。通过使用语言推理，Navig将平均距离误差降低了14％，而先前的最新型号则需要少于1000个培训样本。我们的数据集和代码可在此HTTPS URL上找到。

Title: How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation

Authors: Rui Li, Heming Xia, Xinfeng Yuan, Qingxiu Dong, Lei Sha, Wenjie Li, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14642
Pdf URL: https://arxiv.org/pdf/2502.14642
Copy Paste: [[2502.14642]] How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation(https://arxiv.org/abs/2502.14642)
Keywords: llm
Abstract: Recently, LLMs have garnered increasing attention across academic disciplines for their potential as human digital twins, virtual proxies designed to replicate individuals and autonomously perform tasks such as decision-making, problem-solving, and reasoning on their behalf. However, current evaluations of LLMs primarily emphasize dialogue simulation while overlooking human behavior simulation, which is crucial for digital twins. To address this gap, we introduce BehaviorChain, the first benchmark for evaluating LLMs' ability to simulate continuous human behavior. BehaviorChain comprises diverse, high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors across 1,001 unique personas, each with detailed history and profile metadata. For evaluation, we integrate persona metadata into LLMs and employ them to iteratively infer contextually appropriate behaviors within dynamic scenarios provided by BehaviorChain. Comprehensive evaluation results demonstrated that even state-of-the-art models struggle with accurately simulating continuous human behavior.
摘要：最近，LLM在学术学科中引起了人们越来越多的关注，因为它们作为人类数字双胞胎的潜力，旨在复制个人并自主执行决策，解决问题和代表他们推理等任务的虚拟代理。但是，当前对LLM的评估主要强调对话模拟，同时忽略了人类行为模拟，这对于数字双胞胎至关重要。为了解决这一差距，我们引入了行为链，这是评估LLMS模拟人类行为能力的第一个基准。行为链包括多样化，高质量的基于角色的行为链，在1,001个独特的角色中总共有15,846个不同的行为，每个行为都有详细的历史记录和个人资料元数据。为了进行评估，我们将角色元数据集成到LLMS中，并在行为链提供的动态场景中迭代地推断出上下文适当的行为。全面的评估结果表明，即使是最先进的模型，也可以准确模拟连续的人类行为。

Title: Length-Controlled Margin-Based Preference Optimization without Reference Model

Authors: Gengxu Li, Tingyu Xia, Yi Chang, Yuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14643
Pdf URL: https://arxiv.org/pdf/2502.14643
Copy Paste: [[2502.14643]] Length-Controlled Margin-Based Preference Optimization without Reference Model(https://arxiv.org/abs/2502.14643)
Keywords: language model
Abstract: Direct Preference Optimization (DPO) is a widely adopted offline algorithm for preference-based reinforcement learning from human feedback (RLHF), designed to improve training simplicity and stability by redefining reward functions. However, DPO is hindered by several limitations, including length bias, memory inefficiency, and probability degradation. To address these challenges, we propose Length-Controlled Margin-Based Preference Optimization (LMPO), a more efficient and robust alternative. LMPO introduces a uniform reference model as an upper bound for the DPO loss, enabling a more accurate approximation of the original optimization objective. Additionally, an average log-probability optimization strategy is employed to minimize discrepancies between training and inference phases. A key innovation of LMPO lies in its Length-Controlled Margin-Based loss function, integrated within the Bradley-Terry framework. This loss function regulates response length while simultaneously widening the margin between preferred and rejected outputs. By doing so, it mitigates probability degradation for both accepted and discarded responses, addressing a significant limitation of existing methods. We evaluate LMPO against state-of-the-art preference optimization techniques on two open-ended large language models, Mistral and LLaMA3, across six conditional benchmarks. Our experimental results demonstrate that LMPO effectively controls response length, reduces probability degradation, and outperforms existing approaches. The code is available at \url{this https URL}.
摘要：直接偏好优化（DPO）是一种从人类反馈（RLHF）学习的广泛采用的离线算法，旨在通过重新定义奖励功能来提高培训的简单性和稳定性。但是，DPO受到多种局限性的阻碍，包括长度偏差，记忆效率低下和概率降解。为了应对这些挑战，我们提出了基于长度控制的利润优化优化（LMPO），这是一种更有效，更健壮的替代方案。 LMPO引入了统一的参考模型，作为DPO损耗的上限，从而使原始优化目标更准确地近似。此外，采用平均对数概率优化策略来最大程度地减少训练和推理阶段之间的差异。 LMPO的关键创新在于其长度控制的基于利润的损失函数，该损失函数集成在Bradley-Terry框架内。此损耗函数调节响应长度，同时扩大了首选和拒绝输出之间的边距。通过这样做，它可以减轻接受和废弃响应的概率降解，从而解决了现有方法的重大限制。我们在六个条件基准的两个开放式大语模型（Mistral和Llama3）上评估了LMPO，以针对两种开放式的大型语言模型和Llama3的最新优先偏好优化技术进行评估。我们的实验结果表明，LMPO有效地控制了响应长度，降低了概率降解并胜过现有方法。该代码可在\ url {this HTTPS url}上获得。

Title: LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning

Authors: Yansheng Mao, Yufei Xu, Jiaqi Li, Fanxu Meng, Haotong Yang, Zilong Zheng, Xiyuan Wang, Muhan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14644
Pdf URL: https://arxiv.org/pdf/2502.14644
Copy Paste: [[2502.14644]] LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning(https://arxiv.org/abs/2502.14644)
Keywords: language model, llm, long context
Abstract: Long context understanding remains challenging for large language models due to their limited context windows. This paper presents Long Input Fine-Tuning (LIFT), a novel framework for long-context modeling that can improve the long-context performance of arbitrary (short-context) LLMs by dynamically adapting model parameters based on the long input. Importantly, LIFT, rather than endlessly extending the context window size to accommodate increasingly longer inputs in context, chooses to store and absorb the long input in parameter. By fine-tuning the long input into model parameters, LIFT allows short-context LLMs to answer questions even when the required information is not provided in the context during inference. Furthermore, to enhance LIFT performance while maintaining the original in-context learning (ICL) capabilities, we introduce Gated Memory, a specialized attention adapter that automatically balances long input memorization and ICL. We provide a comprehensive analysis of the strengths and limitations of LIFT on long context understanding, offering valuable directions for future research.
摘要：由于其有限的上下文窗口，对于大型语言模型，长期的上下文理解仍然具有挑战性。本文介绍了长输入微调（LIFT），这是一种用于长篇文化建模的新型框架，可以通过基于长输入动态适应模型参数来改善任意（缩写）LLMS的长期性能。重要的是，提升，而不是无休止地扩展上下文窗口大小以适应上下文中越来越长的输入，而是选择存储和吸收参数中的长输入。通过对模型参数的长时间输入，LIFT允许短篇小说LLMS回答问题，即使在推荐过程中没有在上下文中提供所需的信息。此外，为了增强升力性能，同时保持原始的内在学习功能（ICL）功能，我们引入了门控记忆，这是一种专门的注意适配器，可以自动平衡长输入记忆和ICL。我们对长篇小说中的提升的优势和局限性进行了全面分析，为未来的研究提供了宝贵的方向。

Title: Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs

Authors: Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14645
Pdf URL: https://arxiv.org/pdf/2502.14645
Copy Paste: [[2502.14645]] Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs(https://arxiv.org/abs/2502.14645)
Keywords: language model, llm
Abstract: Knowledge editing allows for efficient adaptation of large language models (LLMs) to new information or corrections without requiring full retraining. However, prior methods typically focus on either single-language editing or basic multilingual editing, failing to achieve true cross-linguistic knowledge synchronization. To address this, we present a simple and practical state-of-the-art (SOTA) recipe Cross-Lingual Knowledge Democracy Edit (X-KDE), designed to propagate knowledge from a dominant language to other languages effectively. Our X-KDE comprises two stages: (i) Cross-lingual Edition Instruction Tuning (XE-IT), which fine-tunes the model on a curated parallel dataset to modify in-scope knowledge while preserving unrelated information, and (ii) Target-language Preference Optimization (TL-PO), which applies advanced optimization techniques to ensure consistency across languages, fostering the transfer of updates. Additionally, we contribute a high-quality, cross-lingual dataset, specifically designed to enhance knowledge transfer across languages. Extensive experiments on the Bi-ZsRE and MzsRE benchmarks show that X-KDE significantly enhances cross-lingual performance, achieving an average improvement of +8.19%, while maintaining high accuracy in monolingual settings.
摘要：知识编辑可以有效地适应大型语言模型（LLMS）对新信息或更正，而无需进行全面的重新培训。但是，先前的方法通常专注于单语言编辑或基本的多语言编辑，无法实现真正的跨语言知识同步。为了解决这个问题，我们提出了一种简单且实用的先进（SOTA）食谱跨语性知识民主编辑（X-KDE），旨在将知识从主导语言传播到其他语言。我们的X-kde包括两个阶段：（i）跨语性版本指令调整（XE-IT），它在策划的并行数据集中微调模型以修改范围内的知识，同时保留无关信息，并且（ii）目标 - 采用高级优化技术来确保跨语言的一致性，从而促进更新的传输，该语言偏好优化（TL-PO）（TL-PO）。此外，我们贡献了高质量的跨语言数据集，该数据集专门为增强跨语言的知识转移而设计。对BI-ZSRE和MZSRE基准的广泛实验表明，X-KDE显着提高了跨语性的性能，实现了平均提高 +8.19％，同时保持单语设置中的高精度。

Title: InstructAgent: Building User Controllable Recommender via LLM Agent

Authors: Wujiang Xu, Yunxiao Shi, Zujie Liang, Xuying Ning, Kai Mei, Kun Wang, Xi Zhu, Min Xu, Yongfeng Zhang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.14662
Pdf URL: https://arxiv.org/pdf/2502.14662
Copy Paste: [[2502.14662]] InstructAgent: Building User Controllable Recommender via LLM Agent(https://arxiv.org/abs/2502.14662)
Keywords: llm, agent
Abstract: Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform's recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform's benefits, which may hinder their ability to protect and capture users' true interests. Second, these models are typically optimized using data from all users, which may overlook individual user's preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user-platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform-side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user-agent-platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure. To this end, we first construct four recommendation datasets, denoted as $\dataset$, along with user instructions for each record.
摘要：传统的推荐系统通常采用用户平台范式，在该范围内，用户直接在平台推荐算法的控制下暴露。但是，建议算法的缺陷可能使用户在此范式下处于非常脆弱的位置。首先，许多复杂的模型通常都考虑到商业目标，重点关注平台的好处，这可能会阻碍他们保护和捕捉用户真正兴趣的能力。其次，通常使用所有用户的数据对这些模型进行优化，这可能会忽略单个用户的偏好。由于这些缺点，用户可能会在传统的用户平台直接曝光范式下遇到几个缺点，例如缺乏对推荐系统的控制，平台的潜在操作，回声室效果或由于由于缺乏对较低的用户的个性化的控制积极用户在协作学习过程中的主导地位。因此，迫切需要开发新的范式来保护用户利益并减轻这些问题。最近，一些研究人员引入了LLM代理以模拟用户行为，这些方法主要旨在优化平台侧性能，而在建议系统中尚未解决的核心问题。为了解决这些限制，我们提出了一个新的用户代理平台范式，在该范式中，代理可以用作用户和推荐系统之间的保护屏蔽，从而可以间接暴露。为此，我们首先构建了四个建议数据集，称为$ \ dataset $，以及每个记录的用户说明。

Title: AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

Authors: Alan Dao (Gia Tuan Dao), Dinh Bach Vu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14669
Pdf URL: https://arxiv.org/pdf/2502.14669
Copy Paste: [[2502.14669]] AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO(https://arxiv.org/abs/2502.14669)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of tokenized maze representations to teach the model to predict step-by-step movement commands. Next, we apply Group Relative Policy Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted reward function to refine the model's sequential decision-making and encourage emergent chain-of-thought behaviors. Experimental results on synthetically generated mazes show that while a baseline model fails to navigate the maze, the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more robust and self-corrective reasoning, highlighting the potential of our approach to bridge the gap between language models and visual spatial tasks. These findings offer promising implications for applications in robotics, autonomous navigation, and other domains that require integrated visual and sequential reasoning.
摘要：大型语言模型（LLM）在语言处理方面表现出了令人印象深刻的能力，但他们经常在需要真正的视觉空间推理的任务上挣扎。在本文中，我们介绍了一个新颖的两阶段训练框架，旨在为标准LLM配备迷宫导航的视觉推理能力。首先，我们利用策划的令牌迷宫表示数据集上的监督微调（SFT），以教导模型预测逐步移动命令。接下来，我们应用小组相对政策优化（GRPO） - DeepSeekr1中使用的一种技术，其精心制作的奖励功能可以完善模型的顺序决策并鼓励新兴的经过思考的行为。合成产生的迷宫的实验结果表明，尽管基线模型无法浏览迷宫，但SFT训练的模型可实现86％的精度，并且进一步的GRPO微调将精度提高到93％。定性分析表明，GRPO促进了更强大和自我校对的推理，强调了我们在语言模型和视觉空间任务之间弥合差距的方法的潜力。这些发现为机器人技术，自动导航和其他需要集成视觉和顺序推理的域中的应用提供了有希望的含义。

Title: Explanations of Deep Language Models Explain Language Representations in the Brain

Authors: Maryam Rahimi (1), Yadollah Yaghoobzadeh (2 and 4), Mohammad Reza Daliri (1 and 3) ((1) Biomedical Engineering Department, School of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran, (2) Electrical and Computer Engineering Department, University of Tehran, Tehran, Iran, (3) School of Cognitive Sciences, Institute for Research in Fundamental Sciences, Tehran, Iran, (4) Tehran Institute for Advanced Studies, Khatam University, Tehran, Iran)
Subjects: cs.CL, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2502.14671
Pdf URL: https://arxiv.org/pdf/2502.14671
Copy Paste: [[2502.14671]] Explanations of Deep Language Models Explain Language Representations in the Brain(https://arxiv.org/abs/2502.14671)
Keywords: language model, llm
Abstract: Recent advances in artificial intelligence have given rise to large language models (LLMs) that not only achieve human-like performance but also share computational principles with the brain's language processing mechanisms. While previous research has primarily focused on aligning LLMs' internal representations with neural activity, we introduce a novel approach that leverages explainable AI (XAI) methods to forge deeper connections between the two domains. Using attribution methods, we quantified how preceding words contribute to an LLM's next-word predictions and employed these explanations to predict fMRI recordings from participants listening to the same narratives. Our findings demonstrate that attribution methods robustly predict brain activity across the language network, surpassing traditional internal representations in early language areas. This alignment is hierarchical: early-layer explanations correspond to the initial stages of language processing in the brain, while later layers align with more advanced stages. Moreover, the layers more influential on LLM next-word prediction$\unicode{x2014}$those with higher attribution scores$\unicode{x2014}$exhibited stronger alignment with neural activity. This work establishes a bidirectional bridge between AI and neuroscience. First, we demonstrate that attribution methods offer a powerful lens for investigating the neural mechanisms of language comprehension, revealing how meaning emerges from preceding context. Second, we propose using brain alignment as a metric to evaluate the validity of attribution methods, providing a framework for assessing their biological plausibility.
摘要：人工智能的最新进展引起了大型语言模型（LLM），不仅可以实现类似人类的表现，而且还通过大脑的语言处理机制共享计算原理。虽然先前的研究主要集中于将LLMS的内部表示与神经活动保持一致，但我们引入了一种新颖的方法，该方法利用了可解释的AI（XAI）方法来建立两个领域之间的更深层次的联系。使用归因方法，我们量化了前面的单词对LLM的下一字预测的贡献，并采用了这些解释来预测参与者听取相同叙述的参与者的录音。我们的发现表明，归因方法可牢固地预测整个语言网络中的大脑活动，超过早期语言领域的传统内部表示。该对齐是分层的：早期解释对应于大脑中语言处理的初始阶段，而后来的层与更高级的阶段保持一致。此外，对LLM Next-Word预测的影响更大，Unicode {X2014} $具有较高归属分数的人$ \ Unicode {X2014} $表现出与神经活动的更强一致性。这项工作建立了AI和神经科学之间的双向桥梁。首先，我们证明归因方法为研究语言理解的神经机制提供了强大的镜头，从而揭示了含义是如何从前面的上下文中出现的。其次，我们建议将大脑对齐方式用作评估归因方法有效性的度量，从而提供了评估其生物学合理性的框架。

Title: Data-Constrained Synthesis of Training Data for De-Identification

Authors: Thomas Vakili, Aron Henriksson, Hercules Dalianis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14677
Pdf URL: https://arxiv.org/pdf/2502.14677
Copy Paste: [[2502.14677]] Data-Constrained Synthesis of Training Data for De-Identification(https://arxiv.org/abs/2502.14677)
Keywords: language model, llm
Abstract: Many sensitive domains -- such as the clinical domain -- lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study -- using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.
摘要：由于隐私风险，许多敏感领域（例如临床领域）缺乏广泛可用的数据集。大型语言模型（LLM）的生成能力越来越多使合成数据集成为可行的前进道路。在这项研究中，我们使用基于编码器的NER模型，对临床领域的域名llms域上域中的LLM并生成了与标签进行机器通知的合成临床文本。然后，合成语料库用于训练合成模型。结果表明，使用合成语料库的训练NER模型只会导致预测性能下降。在系统的消融研究中，使用瑞典和西班牙数据研究了此过程的局限性。我们的分析表明，较小的数据集可以足以使域适应数据合成。相反，此过程的有效性几乎完全取决于使用原始数据训练的机器通知NER模型的性能。

Title: How to Get Your LLM to Generate Challenging Problems for Evaluation

Authors: Arkil Patel, Siva Reddy, Dzmitry Bahdanau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14678
Pdf URL: https://arxiv.org/pdf/2502.14678
Copy Paste: [[2502.14678]] How to Get Your LLM to Generate Challenging Problems for Evaluation(https://arxiv.org/abs/2502.14678)
Keywords: language model, llm
Abstract: The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.
摘要：大语言模型（LLM）的演变速度需要采取新的方法来进行严格和全面的评估。由于产生高质量，具有挑战性的问题所涉及的复杂性和成本，传统的人类注释越来越不切实际。在这项工作中，我们介绍了Chase，这是一个统一的框架，用于使用不参与人类参与的LLM综合产生挑战性问题。对于给定的任务，我们的方法以更简单的组件以自下而上的方式构建了一个严重的问题。此外，我们的框架将生成过程分解为独立可验证的子任务，从而确保了高质量和正确性。我们实施追逐以在三个不同域中创建评估基准：（1）基于文档的问题回答，（2）存储库级代码完成，以及（3）数学推理。最先进的LLM在这些合成基准上的性能在40-60％的精度范围内，从而证明了我们框架在产生具有挑战性问题方面的有效性。我们公开发布基准和代码。

Title: Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup

Authors: Yonghui Kong, Hongbing Hu, Dan Zhang, Siyuan Chai, Fan Zhang, Wei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14682
Pdf URL: https://arxiv.org/pdf/2502.14682
Copy Paste: [[2502.14682]] Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup(https://arxiv.org/abs/2502.14682)
Keywords: language model, gpt, llm
Abstract: Large language models have demonstrated excellent performance in many tasks, including Text-to-SQL, due to their powerful in-context learning capabilities. They are becoming the mainstream approach for Text-to-SQL. However, these methods still have a significant gap compared to human performance, especially on complex questions. As the complexity of questions increases, the gap between questions and SQLs increases. We identify two important gaps: the structural mapping gap and the lexical mapping gap. To tackle these two gaps, we propose PAS-SQL, an efficient SQL generation pipeline based on LLMs, which alleviates gaps through Abstract Query Pattern (AQP) and Contextual Schema Markup (CSM). AQP aims to obtain the structural pattern of the question by removing database-related information, which enables us to find structurally similar demonstrations. CSM aims to associate database-related text span in the question with specific tables or columns in the database, which alleviates the lexical mapping gap. Experimental results on the Spider and BIRD datasets demonstrate the effectiveness of our proposed method. Specifically, PAS-SQL + GPT-4o sets a new state-of-the-art on the Spider benchmark with an execution accuracy of 87.9\%, and achieves leading results on the BIRD dataset with an execution accuracy of 64.67\%.
摘要：大型语言模型由于其强大的文本学习能力，在包括文本到SQL在内的许多任务中表现出了出色的性能。它们正在成为文本到SQL的主流方法。但是，与人类绩效相比，这些方法仍然存在很大的差距，尤其是在复杂的问题上。随着问题的复杂性的增加，问题和SQL之间的差距增加。我们确定了两个重要的差距：结构映射间隙和词汇映射间隙。为了解决这两个差距，我们提出了基于LLM的有效的SQL生成管道PAS-SQL，它通过抽象查询模式（AQP）和上下文模式标记（CSM）来减轻差距。 AQP旨在通过删除与数据库相关的信息来获得问题的结构模式，这使我们能够找到结构上相似的演示。 CSM的目的是将问题与数据库相关的文本跨度与数据库中的特定表或列相关联，从而减轻了词汇映射差距。蜘蛛和鸟类数据集的实验结果证明了我们提出的方法的有效性。具体而言，PAS-SQL + GPT-4O以87.9％的执行精度在蜘蛛标准上设置了一个新的最先进的方法，并以64.67 \％的执行精度在鸟数据集上实现领先的结果。

Title: I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search

Authors: Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, Xinhui Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14693
Pdf URL: https://arxiv.org/pdf/2502.14693
Copy Paste: [[2502.14693]] I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search(https://arxiv.org/abs/2502.14693)
Keywords: language model, llm, agent
Abstract: Recent advancements in large language models (LLMs) have shown remarkable potential in automating machine learning tasks. However, existing LLM-based agents often struggle with low-diversity and suboptimal code generation. While recent work has introduced Monte Carlo Tree Search (MCTS) to address these issues, limitations persist in the quality and diversity of thoughts generated, as well as in the scalar value feedback mechanisms used for node selection. In this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a novel approach that iteratively expands tree nodes through an introspective process that meticulously analyzes solutions and results from parent and sibling nodes. This facilitates a continuous refinement of the node in the search tree, thereby enhancing the overall decision-making this http URL, we integrate a Large Language Model (LLM)-based value model to facilitate direct evaluation of each node's solution prior to conducting comprehensive computational rollouts. A hybrid rewarding mechanism is implemented to seamlessly transition the Q-value from LLM-estimated scores to actual performance scores. This allows higher-quality nodes to be traversed this http URL to the various ML tasks, our approach demonstrates a6\% absolute improvement in performance compared to the strong open-source AutoML agents, showcasing its effectiveness in enhancing agentic AutoML systems.
摘要：大型语言模型（LLM）的最新进展在使机器学习任务自动化方面表现出了巨大的潜力。但是，现有的基于LLM的代理通常会在低多样性和次优码生成中挣扎。尽管最近的工作引入了蒙特卡洛树搜索（MCT）来解决这些问题，但限制持续存在生成的思想质量和多样性，以及用于节点选择的标量值反馈机制。在这项研究中，我们引入了内省的蒙特卡洛树搜索（I-MCTS），这种新型方法通过内省的过程迭代地扩展树节点，该过程精心分析了父母和兄弟姐妹节点的解决方案和结果。这促进了搜索树中节点的连续完善，从而增强了整个HTTP URL的决策，我们集成了一个基于大的语言模型（LLM）的价值模型，以促进对每个节点进行全面计算的直接评估推广。实施了一种混合奖励机制，以无缝将Q值从LLM估计的分数转变为实际的性能得分。这使得可以将高质量的节点遍历到各种ML任务的HTTP URL，我们的方法表明，与强大的开源汽车代理相比，性能的A6 \％绝对改善，表明其在增强代理自动系统方面的有效性。

Title: Entity Framing and Role Portrayal in the News

Authors: Tarek Mahmoud, Zhuohan Xie, Dimitar Dimitrov, Nikolaos Nikolaidis, Purificação Silvano, Roman Yangarber, Shivam Sharma, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14718
Pdf URL: https://arxiv.org/pdf/2502.14718
Copy Paste: [[2502.14718]] Entity Framing and Role Portrayal in the News(https://arxiv.org/abs/2502.14718)
Keywords: llm
Abstract: We introduce a novel multilingual hierarchical corpus annotated for entity framing and role portrayal in news articles. The dataset uses a unique taxonomy inspired by storytelling elements, comprising 22 fine-grained roles, or archetypes, nested within three main categories: protagonist, antagonist, and innocent. Each archetype is carefully defined, capturing nuanced portrayals of entities such as guardian, martyr, and underdog for protagonists; tyrant, deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for innocents. The dataset includes 1,378 recent news articles in five languages (Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two critical domains of global significance: the Ukraine-Russia War and Climate Change. Over 5,800 entity mentions have been annotated with role labels. This dataset serves as a valuable resource for research into role portrayal and has broader implications for news analysis. We describe the characteristics of the dataset and the annotation process, and we report evaluation results on fine-tuned state-of-the-art multilingual transformers and hierarchical zero-shot learning using LLMs at the level of a document, a paragraph, and a sentence.
摘要：我们介绍了一个新颖的多语言分层语料库，该语料库在新闻文章中为实体框架和角色刻画提供了注释。该数据集使用的独特分类学灵感来自讲故事的元素，包括22个精细角色或原型，嵌套在三个主要类别中：主角，对立者和无辜者。每个原型都经过精心定义，捕捉了主角的守护者，烈士和弱者等实体的细微刻画。暴君，欺骗者和对手的偏执；还有受害者，替罪羊，并为无辜者剥削。该数据集包括五种语言（保加利亚语，英语，印地语，欧洲葡萄牙语和俄罗斯）的1,378篇新闻文章，重点关注两个全球意义的关键领域：乌克兰 - 俄罗斯战争和气候变化。超过5,800个实体提及已经用角色标签注释。该数据集是研究角色刻画的宝贵资源，并对新闻分析具有更广泛的影响。我们描述了数据集的特征和注释过程，并报告了使用LLMS在文档，段落和段落级别上使用LLMS进行微调的最新多语言变压器和层次零拍学习的评估结果。句子。

Title: SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Authors: M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jingyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14739
Pdf URL: https://arxiv.org/pdf/2502.14739
Copy Paste: [[2502.14739]] SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines(https://arxiv.org/abs/2502.14739)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.
摘要：大型语言模型（LLM）表现出在数学，物理学和计算机科学等主流学科的熟练程度上非常熟练。但是，人类知识涵盖了200多个专业学科，远远超过了现有基准的范围。 LLM在许多专业领域中的功能在光线行业，农业和面向服务的学科中的评估不足。为了解决这一差距，我们提出了SuperGPQA，这是一个全面的基准，可评估285个学科的研究生级知识和推理能力。我们的基准测试采用了一种新型的人体协作过滤机制，通过基于LLM的响应和专家反馈来消除琐碎或模棱两可的问题。我们的实验结果表明，在各种知识领域跨不同知识领域的当前最先进的LLM的表现有明显的余地（例如，以推理为中心的模型DeepSeek-R1达到了SuperGPQA上61.82％的最高准确性），突出了相当大的当前模型功能与人工通用情报之间的差距。此外，我们从管理大规模注释过程的管理中展示了全面的见解，涉及80多个专家注释和一个交互式的人类协作系统，为未来的可比范围研究计划提供了宝贵的方法学指南。

Title: HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States

Authors: Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14744
Pdf URL: https://arxiv.org/pdf/2502.14744
Copy Paste: [[2502.14744]] HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States(https://arxiv.org/abs/2502.14744)
Keywords: language model, prompt
Abstract: The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that {HiddenDetect} surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs. By utilizing intrinsic safety-aware patterns, our method provides an efficient and scalable solution for strengthening LVLM robustness against multimodal threats. Our code will be released publicly at this https URL.
摘要：与仅语言的同行相比，其他方式的整合增加了大型视力语言模型（LVLM）对安全风险（例如越狱攻击）的敏感性。尽管现有的研究主要集中在事后对准技术上，但LVLMS内的潜在安全机制在很大程度上尚未探索。在这项工作中，我们调查了LVLM在推断过程中内部激活中固有地编码与安全相关的信号。我们的发现表明，在处理不安全的提示时，LVLM会表现出不同的激活模式，可以利用这些激活模式来检测和减轻对抗性输入而无需进行广泛的微调。在此洞察力的基础上，我们引入了HiddenDetect，这是一个新颖的无调框架，可利用内部模型激活以增强安全性。实验结果表明，{HiddenDetect}在检测对LVLMS的越狱攻击方面超过了最新方法。通过利用固有的安全感知模式，我们的方法提供了一种有效且可扩展的解决方案，可增强对多模式威胁的LVLM稳健性。我们的代码将在此HTTPS URL上公开发布。

Title: Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs

Authors: Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Alden Dima, Juan Francisco Fung, Jordan Boyd-Graber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14748
Pdf URL: https://arxiv.org/pdf/2502.14748
Copy Paste: [[2502.14748]] Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs(https://arxiv.org/abs/2502.14748)
Keywords: language model, llm, hallucination
Abstract: A common use of NLP is to facilitate the understanding of large document collections, with a shift from using traditional topic models to Large Language Models. Yet the effectiveness of using LLM for large corpus understanding in real-world applications remains under-explored. This study measures the knowledge users acquire with unsupervised, supervised LLM-based exploratory approaches or traditional topic models on two datasets. While LLM-based methods generate more human-readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to the LLM generation process improves data exploration by mitigating hallucination and over-genericity but requires greater human effort. In contrast, traditional. models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. We show that LLMs struggle to describe the haystack of large corpora without human help, particularly domain-specific data, and face scaling and hallucination limitations due to context length constraints. Dataset available at https://huggingface. co/datasets/zli12321/Bills.
摘要：NLP的普遍用途是促进对大型文档收集的理解，从使用传统主题模型转变为大型语言模型。然而，在现实世界应用中，使用LLM在现实应用程序中使用LLM的有效性仍未得到探索。这项研究通过两个数据集上的无监督，基于LLM的探索方法或传统主题模型来衡量知识用户获得的。尽管基于LLM的方法比传统模型的数据探索产生更多的人类可读主题，并显示出更高的平均胜利概率，但它们为特定领域的数据集生成了过于通用的主题，这些主题不容易让用户可以对文档了解太多。在LLM生成过程中增加人类的监督可以通过缓解幻觉和过度的幻觉来改善数据探索，但需要更大的努力。相反，传统。诸如潜在Dirichlet分配（LDA）之类的模型对于探索仍然有效，但对用户友好型。我们表明，LLM在没有人类帮助的情况下（尤其是特定于领域的数据）以及由于上下文长度限制而引起的面部缩放和幻觉限制，难以描述大型语料库的大麻。数据集可在https：// huggingface上找到。 CO/DATASET/ZLI12321/BILLS。

Title: TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

Authors: Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14752
Pdf URL: https://arxiv.org/pdf/2502.14752
Copy Paste: [[2502.14752]] TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators(https://arxiv.org/abs/2502.14752)
Keywords: language model, llm
Abstract: Triton, a high-level Python-like language designed for building efficient GPU kernels, is widely adopted in deep learning frameworks due to its portability, flexibility, and accessibility. However, programming and parallel optimization still require considerable trial and error from Triton developers. Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code, as they lack awareness of its specifications and the complexities of GPU programming. More critically, there is an urgent need for systematic evaluations tailored to Triton. In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation. TritonBench features two evaluation channels: a curated set of 184 real-world operators from GitHub and a collection of operators aligned with PyTorch interfaces. Unlike conventional code benchmarks prioritizing functional correctness, TritonBench also profiles efficiency performance on widely deployed GPUs aligned with industry applications. Our study reveals that current state-of-the-art code LLMs struggle to generate efficient Triton operators, highlighting a significant gap in high-performance code generation. TritonBench will be available at this https URL.
摘要：Triton是一种高级Python的语言，旨在建造有效的GPU内核，由于其可移植性，灵活性和可访问性，在深度学习框架中被广泛采用。但是，编程和并行优化仍需要特里顿开发人员的大量试用和错误。尽管传统代码生成的大语言模型（LLMS）的进步，但这些模型仍在努力生成准确，性能优化的Triton代码，因为它们缺乏对其规格及其GPU编程的复杂性的认识。更重要的是，迫切需要针对特里顿量身定制的系统评估。在这项工作中，我们介绍了Tritonbench，这是Triton操作员一代的第一个全面基准。 TritonBench具有两个评估通道：GitHub的一组策划的184个现实世界运算符，以及与Pytorch接口对齐的一系列操作员。与常规的代码基准优先级正确的功能正确性不同，TritonBench还对与行业应用一致的广泛部署的GPU进行了效率性能。我们的研究表明，当前的最新代码LLM难以产生高效的Triton操作员，突出了高性能代码生成的显着差距。 TritonBench将在此HTTPS URL上提供。

Title: On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems

Authors: Juraj Vladika, Florian Matthes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14759
Pdf URL: https://arxiv.org/pdf/2502.14759
Copy Paste: [[2502.14759]] On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems(https://arxiv.org/abs/2502.14759)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has emerged as an approach to augment large language models (LLMs) by reducing their reliance on static knowledge and improving answer factuality. RAG retrieves relevant context snippets and generates an answer based on them. Despite its increasing industrial adoption, systematic exploration of RAG components is lacking, particularly regarding the ideal size of provided context, and the choice of base LLM and retrieval method. To help guide development of robust RAG systems, we evaluate various context sizes, BM25 and semantic search as retrievers, and eight base LLMs. Moving away from the usual RAG evaluation with short answers, we explore the more challenging long-form question answering in two domains, where a good answer has to utilize the entire context. Our findings indicate that final QA performance improves steadily with up to 15 snippets but stagnates or declines beyond that. Finally, we show that different general-purpose LLMs excel in the biomedical domain than the encyclopedic one, and that open-domain evidence retrieval in large corpora is challenging.
摘要：通过减少其对静态知识的依赖并改善答案的事实，检索授权的一代（RAG）已成为增强大语模型（LLM）的一种方法。 RAG检索相关上下文片段，并根据它们生成答案。尽管采用工业的采用量增加，但仍缺乏对抹布组件的系统探索，尤其是在提供的上下文的理想大小以及基本LLM和检索方法的选择方面。为了指导强大的抹布系统的开发，我们评估各种背景尺寸，BM25和语义搜索作为检索器以及八个基本LLM。我们从通常的抹布评估中带有简短的答案，探索在两个领域中回答更具挑战性的长期问题，其中一个好的答案必须利用整个上下文。我们的发现表明，最终的质量检查性能通过多达15片片段稳步提高，但停滞不前或下降。最后，我们表明，生物医学领域中不同的通用LLM与百科全书中的excel不同，而大型语料库中的开放域证据检索是具有挑战性的。

Title: Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning

Authors: Juraj Vladika, Ivana Hacajová, Florian Matthes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14765
Pdf URL: https://arxiv.org/pdf/2502.14765
Copy Paste: [[2502.14765]] Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning(https://arxiv.org/abs/2502.14765)
Keywords: llm
Abstract: Fact verification (FV) aims to assess the veracity of a claim based on relevant evidence. The traditional approach for automated FV includes a three-part pipeline relying on short evidence snippets and encoder-only inference models. More recent approaches leverage the multi-turn nature of LLMs to address FV as a step-by-step problem where questions inquiring additional context are generated and answered until there is enough information to make a decision. This iterative method makes the verification process rational and explainable. While these methods have been tested for encyclopedic claims, exploration on domain-specific and realistic claims is missing. In this work, we apply an iterative FV system on three medical fact-checking datasets and evaluate it with multiple settings, including different LLMs, external web search, and structured reasoning using logic predicates. We demonstrate improvements in the final performance over traditional approaches and the high potential of step-by-step FV systems for domain-specific claims.
摘要：事实验证（FV）旨在根据相关证据评估索赔的真实性。自动FV的传统方法包括依靠简短证据片段和仅编码的推理模型的三部分管道。最新的方法利用LLM的多转变性质将FV作为逐步问题解决，其中要生成和回答其他上下文的问题，直到有足够的信息来做出决定为止。这种迭代方法使验证过程合理且可解释。尽管这些方法已经测试了百科全书的主张，但缺少对域特异性和现实主张的探索。在这项工作中，我们将迭代FV系统应用于三个医学事实检查数据集，并使用多个设置进行评估，包括不同的LLM，外部Web搜索和使用逻辑谓词进行结构化推理。我们证明了对传统方法的最终表现以及针对特定领域特定索赔的分步FV系统的高潜力的改进。

Title: Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis

Authors: Priyanka Kargupta, Ishika Agarwal, Tal August, Jiawei Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14767
Pdf URL: https://arxiv.org/pdf/2502.14767
Copy Paste: [[2502.14767]] Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis(https://arxiv.org/abs/2502.14767)
Keywords: language model, llm, agent
Abstract: With the exponential growth of research facilitated by modern technology and improved accessibility, scientific discoveries have become increasingly fragmented within and across fields. This makes it challenging to assess the significance, novelty, incremental findings, and equivalent ideas between related works, particularly those from different research communities. Large language models (LLMs) have recently demonstrated strong quantitative and qualitative reasoning abilities, and multi-agent LLM debates have shown promise in handling complex reasoning tasks by exploring diverse perspectives and reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a framework which converts scientific papers into LLM personas that debate their respective novelties. To emphasize structured, critical reasoning rather than focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling fine-grained analysis of independent novelty arguments within scholarly articles. Through experiments on scientific literature across various domains, evaluated by expert researchers, we demonstrate that ToD generates informative arguments, effectively contrasts papers, and supports researchers in their literature review.
摘要：随着现代技术促进的研究的指数增长和改善的可访问性，科学发现在跨领域内和跨领域的分散越来越多。这使得评估相关工作（尤其是来自不同研究社区的工作的意义，新颖性，增量发现和同等思想）的重要性，新颖性，增量发现和同等思想。大型语言模型（LLMS）最近表现出了强大的定量和定性推理能力，多代理LLM辩论通过探索各种观点和推理路径来处理复杂的推理任务，在处理复杂的推理任务方面有希望。受到此启发，我们介绍了驱散树（TOD），该框架将科学论文转换为LLM角色，以辩论各自的新颖性。为了强调结构化的批判性推理，而不是仅仅专注于结果，Tod动态构建了辩论树，从而对学术文章中的独立新颖性参数进行了精细的分析。通过由专家研究人员评估的各个领域的科学文献的实验，我们证明TOD产生了信息的论点，有效地对比了论文，并在其文献综述中支持研究人员。

Title: Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Authors: Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14768
Pdf URL: https://arxiv.org/pdf/2502.14768
Copy Paste: [[2502.14768]] Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning(https://arxiv.org/abs/2502.14768)
Keywords: llm, prompt
Abstract: Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.
摘要：受DeepSeek-R1成功的启发，我们探讨了大型推理模型中基于规则的增强学习（RL）的潜力。为了分析推理动力学，我们将合成的逻辑难题用作训练数据，因为它们的可控复杂性和直接的答案验证。我们做出一些关键的技术贡献，从而导致有效稳定的RL培训：强调思维和答案过程的系统提示，严格的格式奖励功能，惩罚了对选择快捷方式的产出以及可实现稳定收敛的直接培训食谱。我们的7b模型开发了先进的推理技能，例如反射，验证和摘要，而逻辑语料库中没有。值得注意的是，在仅培训了5K逻辑问题之后，它证明了具有挑战性的数学基准AIME和AMC的概括能力。

Title: SurveyX: Academic Survey Automation via Large Language Models

Authors: Xun Liang, Jiawei Yang, Yezhaohui Wang, Chen Tang, Zifan Zheng, Simin Niu, Shichao Song, Hanyu Wang, Bo Tang, Feiyu Xiong, Keming Mao, Zhiyu li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14776
Pdf URL: https://arxiv.org/pdf/2502.14776
Copy Paste: [[2502.14776]] SurveyX: Academic Survey Automation via Large Language Models(https://arxiv.org/abs/2502.14776)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated exceptional comprehension capabilities and a vast knowledge base, suggesting that LLMs can serve as efficient tools for automated survey generation. However, recent research related to automated survey generation remains constrained by some critical limitations like finite context window, lack of in-depth content discussion, and absence of systematic evaluation frameworks. Inspired by human writing processes, we propose SurveyX, an efficient and organized system for automated survey generation that decomposes the survey composing process into two phases: the Preparation and Generation phases. By innovatively introducing online reference retrieval, a pre-processing method called AttributeTree, and a re-polishing process, SurveyX significantly enhances the efficacy of survey composition. Experimental evaluation results show that SurveyX outperforms existing automated survey generation systems in content quality (0.259 improvement) and citation quality (1.76 enhancement), approaching human expert performance across multiple evaluation dimensions. Examples of surveys generated by SurveyX are available on this http URL
摘要：大型语言模型（LLM）表现出了出色的理解能力和庞大的知识库，这表明LLM可以用作自动化调查生成的有效工具。但是，与自动调查生成有关的最新研究仍然受到一些临界局限性的限制，例如有限上下文窗口，缺乏深入的内容讨论以及缺乏系统评估框架。在人类写作过程的启发下，我们提出了SurveyX，这是一种自动化调查生成的高效且有组织的系统，将调查过程分解为两个阶段：制备和生成阶段。通过创新引入在线参考检索，一种称为attributetree的预处理方法，以及重新抛光过程，Suressionx显着提高了调查组成的功效。实验评估结果表明，SurveyX在内容质量（0.259提高）和引文质量（1.76增强）方面的表现优于现有的自动化测量系统，从而在多个评估维度上接近人类专家表现。由Sureverx生成的调查的示例可在此HTTP URL上提供

Title: Harnessing PDF Data for Improving Japanese Large Multimodal Models

Authors: Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.14778
Pdf URL: https://arxiv.org/pdf/2502.14778
Copy Paste: [[2502.14778]] Harnessing PDF Data for Improving Japanese Large Multimodal Models(https://arxiv.org/abs/2502.14778)
Keywords: language model
Abstract: Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 3.9% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs. We plan to make the source code and data publicly available upon acceptance.
摘要：大型多模型模型（LMM）在英语方面表现出很强的性能，但是由于缺乏高质量的培训数据，它们在日语中的有效性仍然有限。当前的日本LMM通常依靠翻译的英语数据集，从而限制了它们捕获日本特定文化知识的能力。为了解决这个问题，我们探讨了日本PDF数据作为培训资源的潜力，该领域在很大程度上未被充分利用。我们引入了一条全自动的管道，该管道利用预处理的模型通过布局分析，OCR和视觉 - 语言配对从PDF中提取图像文本对，从而消除了手动注释的需求。此外，我们从提取的图像文本对构建指令数据以丰富培训数据。为了评估PDF衍生数据的有效性，我们培训日本LMM并评估其在日本LMM基准上的性能。我们的结果表明，苍鹭板凳的性能增长范围从3.9％到13.8％。进一步的分析强调了PDF衍生数据对各种因素（例如模型大小和语言模型）的影响，从而增强了其作为日本LMM的多模式资源的价值。我们计划在接受后公开提供源代码和数据。

Title: ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

Authors: Abhijit Mishra, Richard Noh, Hsiang Fu, Mingda Li, Minji Kim
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.14780
Pdf URL: https://arxiv.org/pdf/2502.14780
Copy Paste: [[2502.14780]] ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting(https://arxiv.org/abs/2502.14780)
Keywords: language model
Abstract: Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.
摘要：高效和保护隐私的多模式互动至关重要，因为AR，VR和具有强大相机的现代智能手机成为人类计算机通信的主要接口。现有强大的大型视觉模型（VLM）实现多模式互动通常依赖于基于云的处理，从而通过将敏感视觉数据传输到服务器以及（2）实时，实时，实时， - 设备可用性。本文探讨了视觉指令重写，这是一种新颖的方法，将多模式指令转换为仅文本命令，从而使轻质的放在设备指令重写器VLMS（250m参数）与现有对话AI系统的无缝集成，并增强视觉数据隐私。为了实现这一目标，我们介绍了14个域中39,000多个示例的数据集，并开发了一个紧凑的VLM，该数据集在图像字幕的数据集中预定，并进行了微调以重写。通过NLG指标进行评估的实验结果，例如BLEU，流星和胭脂，以及语义解析分析，证明即使是量化的版本的模型（<500MB存储足迹），也可以实现有效的指令重写，从而实现以隐私为中心的多态，多态度的指令AI应用程序。

Title: Rapid Word Learning Through Meta In-Context Learning

Authors: Wentao Wang, Guangyuan Jiang, Tal Linzen, Brenden M. Lake
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14791
Pdf URL: https://arxiv.org/pdf/2502.14791
Copy Paste: [[2502.14791]] Rapid Word Learning Through Meta In-Context Learning(https://arxiv.org/abs/2502.14791)
Keywords: language model, llm
Abstract: Humans can quickly learn a new word from a few illustrative examples, and then systematically and flexibly use it in novel contexts. Yet the abilities of current language models for few-shot word learning, and methods for improving these abilities, are underexplored. In this study, we introduce a novel method, Meta-training for IN-context learNing Of Words (Minnow). This method trains language models to generate new examples of a word's usage given a few in-context examples, using a special placeholder token to represent the new word. This training is repeated on many new words to develop a general word-learning ability. We find that training models from scratch with Minnow on human-scale child-directed language enables strong few-shot word learning, comparable to a large language model (LLM) pre-trained on orders of magnitude more data. Furthermore, through discriminative and generative evaluations, we demonstrate that finetuning pre-trained LLMs with Minnow improves their ability to discriminate between new words, identify syntactic categories of new words, and generate reasonable new usages and definitions for new words, based on one or a few in-context examples. These findings highlight the data efficiency of Minnow and its potential to improve language model performance in word learning tasks.
摘要：人类可以从一些说明性的示例中快速学习一个新词，然后在新颖的环境中系统地灵活地使用它。然而，当前语言模型用于几个单词学习的能力以及提高这些能力的方法的能力尚未得到充实。在这项研究中，我们介绍了一种新颖的方法，即用于文字的文字学习的元训练（Minnow）。该方法训练语言模型，以使用特殊占位符代表来表示新单词，从而生成单词用法的新示例。在许多新单词上重复进行此培训，以发展一般的单词学习能力。我们发现，在人体规模的儿童指导的语言上从头开始训练模型可以实现强大的单词学习，可与大型语言模型（LLM）相媲美，该模型（LLM）在数量级的范围内进行了更多的数据。此外，通过歧视性和生成性评估，我们证明，使用MinNow进行预先培训的LLM提高了他们在新单词之间歧视新单词的句法类别的能力很少有文本示例。这些发现突出了Minnow的数据效率及其在单词学习任务中提高语言模型性能的潜力。

Title: From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Authors: Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14802
Pdf URL: https://arxiv.org/pdf/2502.14802
Copy Paste: [[2502.14802]] From RAG to Memory: Non-Parametric Continual Learning for Large Language Models(https://arxiv.org/abs/2502.14802)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Our ability to continuously acquire, organize, and leverage knowledge is a key feature of human intelligence that AI systems must approximate to unlock their full potential. Given the challenges in continual learning with large language models (LLMs), retrieval-augmented generation (RAG) has become the dominant way to introduce new information. However, its reliance on vector retrieval hinders its ability to mimic the dynamic and interconnected nature of human long-term memory. Recent RAG approaches augment vector embeddings with various structures like knowledge graphs to address some of these gaps, namely sense-making and associativity. However, their performance on more basic factual memory tasks drops considerably below standard RAG. We address this unintended deterioration and propose HippoRAG 2, a framework that outperforms standard RAG comprehensively on factual, sense-making, and associative memory tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in HippoRAG and enhances it with deeper passage integration and more effective online use of an LLM. This combination pushes this RAG system closer to the effectiveness of human long-term memory, achieving a 7% improvement in associative memory tasks over the state-of-the-art embedding model while also exhibiting superior factual knowledge and sense-making memory capabilities. This work paves the way for non-parametric continual learning for LLMs. Our code and data will be released at this https URL.
摘要：我们连续获取，组织和利用知识的能力是人类智能的关键特征，AI系统必须大约释放其全部潜力。鉴于大语言模型（LLMS）的持续学习挑战，检索声明的一代（RAG）已成为引入新信息的主要方式。但是，它对矢量检索的依赖阻碍了其模仿人类长期记忆的动态和相互联系的能力。最近的RAG方法将媒介嵌入具有各种结构（例如知识图）的嵌入，以解决这些差距中的一些，即感知和关联性。但是，它们在更基本的事实记忆任务上的性能大大低于标准抹布。我们解决了这种意外的恶化，并提出了Hipporag 2，该框架在事实，感知和关联的内存任务上均超过了标准抹布。 Hipporag 2建立在Hipporag中使用的个性化Pagerank算法的基础上，并通过更深入的通过集成和更有效的在线使用LLM来增强它。这种组合使该抹布系统更接近人类长期记忆的有效性，与最新的嵌入模型相比，关联记忆任务的提高了7％，同时也表现出了卓越的事实知识和感知的记忆能力。这项工作为LLM的非参数持续学习铺平了道路。我们的代码和数据将在此HTTPS URL上发布。

Title: eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables

Authors: Luis Antonio Gutiérrez Guanilo, Mir Tafseer Nayeem, Cristian López, Davood Rafiei
Subjects: cs.CL, cs.AI, cs.DB, cs.HC
Abstract URL: https://arxiv.org/abs/2502.14820
Pdf URL: https://arxiv.org/pdf/2502.14820
Copy Paste: [[2502.14820]] eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables(https://arxiv.org/abs/2502.14820)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated exceptional versatility across diverse domains, yet their application in e-commerce remains underexplored due to a lack of domain-specific datasets. To address this gap, we introduce eC-Tab2Text, a novel dataset designed to capture the intricacies of e-commerce, including detailed product attributes and user-specific queries. Leveraging eC-Tab2Text, we focus on text generation from product tables, enabling LLMs to produce high-quality, attribute-specific product reviews from structured tabular data. Fine-tuned models were rigorously evaluated using standard Table2Text metrics, alongside correctness, faithfulness, and fluency assessments. Our results demonstrate substantial improvements in generating contextually accurate reviews, highlighting the transformative potential of tailored datasets and fine-tuning methodologies in optimizing e-commerce workflows. This work highlights the potential of LLMs in e-commerce workflows and the essential role of domain-specific datasets in tailoring them to industry-specific challenges.
摘要：大型语言模型（LLMS）在各种领域都表现出了异常的多功能性，但是由于缺乏特定领域的数据集，它们在电子商务中的应用仍未得到充满激光。为了解决这一差距，我们介绍了EC-TAB2Text，这是一个新颖的数据集，旨在捕获电子商务的复杂性，包括详细的产品属性和特定于用户的查询。利用EC-TAB2Text，我们专注于从产品表中的文本生成，使LLMS能够从结构化表格数据中产生高质量的，特定于属性的产品评论。使用标准表2Text指标对微调模型进行了严格评估，并进行了正确的性，忠诚和流利度评估。我们的结果表明，在生成上下文准确的评论方面有了很大的改进，突出了量身定制的数据集和优化电子商务工作流程的微调方法的变革潜力。这项工作突出了LLM在电子商务工作流程中的潜力，以及针对特定于行业特定挑战的特定领域数据集的重要作用。

Title: Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps

Authors: Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasović, Yonatan Belinkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14829
Pdf URL: https://arxiv.org/pdf/2502.14829
Copy Paste: [[2502.14829]] Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps(https://arxiv.org/abs/2502.14829)
Keywords: language model, prompt
Abstract: When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. However, despite much work on CoT prompting, it is unclear if CoT reasoning is faithful to the models' parameteric beliefs. We introduce a framework for measuring parametric faithfulness of generated reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an instance of this framework. FUR erases information contained in reasoning steps from model parameters. We perform experiments unlearning CoTs of four LMs prompted on four multi-choice question answering (MCQA) datasets. Our experiments show that FUR is frequently able to change the underlying models' prediction by unlearning key steps, indicating when a CoT is parametrically faithful. Further analysis shows that CoTs generated by models post-unlearning support different answers, hinting at a deeper effect of unlearning. Importantly, CoT steps identified as important by FUR do not align well with human notions of plausbility, emphasizing the need for specialized alignment
摘要：当提示逐步思考时，语言模型（LMS）产生了一系列思想链（COT），这是该模型据称用于产生其预测的一系列推理步骤。但是，尽管在COT提示上进行了很多工作，但尚不清楚COT推理是否忠实于模型的参数信念。我们介绍了一个框架，以衡量产生的推理的参数忠诚，并通过学习推理步骤（fur）提出忠诚，这是该框架的一个实例。从模型参数中的推理步骤中包含的信息。我们执行在四个多选择性答案（MCQA）数据集中提示的四个LMS的COTS的实验。我们的实验表明，毛皮通常能够通过学习关键步骤来改变基本模型的预测，这表明何时cot是参数忠实的。进一步的分析表明，模型产生后的COTS支持不同的答案，这暗示了不学习的更深层次的效果。重要的是，通过皮毛确定为重要的COT步骤与人类的合理性概念不符，强调需要专业的一致性

Title: Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs

Authors: Danni Liu, Jan Niehues
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14830
Pdf URL: https://arxiv.org/pdf/2502.14830
Copy Paste: [[2502.14830]] Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs(https://arxiv.org/abs/2502.14830)
Keywords: language model, llm
Abstract: While large language models demonstrate remarkable capabilities at task-specific applications through fine-tuning, extending these benefits across diverse languages is essential for broad accessibility. However, effective cross-lingual transfer is hindered by LLM performance gaps across languages and the scarcity of fine-tuning data in many languages. Through analysis of LLM internal representations from over 1,000+ language pairs, we discover that middle layers exhibit the strongest potential for cross-lingual alignment. Building on this finding, we propose a middle-layer alignment objective integrated into task-specific training. Our experiments on slot filling, machine translation, and structured text generation show consistent improvements in cross-lingual transfer, especially to lower-resource languages. The method is robust to the choice of alignment languages and generalizes to languages unseen during alignment. Furthermore, we show that separately trained alignment modules can be merged with existing task-specific modules, improving cross-lingual capabilities without full re-training. Our code is publicly available (this https URL).
摘要：尽管大型语言模型通过微调在特定于任务的应用程序中表现出显着的功能，但将这些好处扩展到各种语言中对于广泛的可访问性至关重要。但是，LLM的性能差距跨语言和许多语言的微调数据稀缺，从而阻碍了有效的跨语性转移。通过分析来自1,000多种语言对的LLM内部表示形式，我们发现中层层具有最强的跨语性对准潜力。在这一发现的基础上，我们提出了一个集成到特定于任务的培训中的中层路线目标。我们在插槽填充，机器翻译和结构化文本生成方面的实验表现出跨语言转移的一致改进，尤其是对低资产阶级语言。该方法可以选择对齐语言的选择，并在对齐过程中概括为看不见的语言。此外，我们表明可以将经过训练的对齐模块与现有特定任务的模块合并，从而在不完全重新训练的情况下提高了跨语性功能。我们的代码公开可用（此HTTPS URL）。

Title: Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

Authors: Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14837
Pdf URL: https://arxiv.org/pdf/2502.14837
Copy Paste: [[2502.14837]] Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs(https://arxiv.org/abs/2502.14837)
Keywords: llm
Abstract: Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.3% to 0.6%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 0.5% drop in LongBench performance.
摘要：多头潜在注意力（MLA）是DeepSeek提出的创新架构，旨在通过将键值（KV）缓存重大压缩为潜在矢量来确保有效和经济的推断。与MLA相比，采用多头关注的标准LLM（MHA）及其变体（例如分组疑问）（GQA）表现出明显的成本劣势。使训练有素的LLM（例如Llama）能够在不从头开始预先培训的情况下快速适应MLA，这既有意义又具有挑战性。本文提出了从MHA到MLA（MHA2MLA）过渡的第一种数据有效的微调方法，其中包括两个关键组成部分：对于部分绳索，我们从查询和键的范围中删除了对注意分数的较小的绳索，这些绳索对注意力评分的贡献较少，，即对于低级近似，我们根据键和值的预训练参数引入关节SVD近似值。这些精心设计的策略使MHA2MLA仅使用一小部分（0.3％至0.6％）的数据恢复性能，从而大大降低了推理成本，同时与KV缓存量化等压缩技术无缝集成。例如，Llama2-7B的KV高速缓存大小降低了92.19％，长板台性能下降了0.5％。

Title: Revealing and Mitigating Over-Attention in Knowledge Editing

Authors: Pinzheng Wang, Zecheng Tang, Keyan Zhou, Juntao Li, Qiaoming Zhu, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14838
Pdf URL: https://arxiv.org/pdf/2502.14838
Copy Paste: [[2502.14838]] Revealing and Mitigating Over-Attention in Knowledge Editing(https://arxiv.org/abs/2502.14838)
Keywords: language model, llm
Abstract: Large Language Models have demonstrated superior performance across a wide range of tasks, but they still exhibit undesirable errors due to incorrect knowledge learned from the training data. To avoid this, knowledge editing methods emerged to precisely edit the specific model knowledge via efficiently modifying a very small percentage of parameters. % However, those methods can lead to the problem of Specificity Failure: when the content related to the edited knowledge occurs in the context, it can inadvertently corrupt other pre-existing knowledge. However, those methods can lead to the problem of Specificity Failure, where the existing knowledge and capabilities are severely degraded due to editing. Our preliminary indicates that Specificity Failure primarily stems from the model's attention heads assigning excessive attention scores to entities related to the edited knowledge, thereby unduly focusing on specific snippets within the context, which we denote as the Attention Drift phenomenon. To mitigate such Attention Drift issue, we introduce a simple yet effective method Selective Attention Drift Restriction}(SADR), which introduces an additional regularization term during the knowledge editing process to restrict changes in the attention weight distribution, thereby preventing undue focus on the edited entity. Experiments on five frequently used strong LLMs demonstrate the effectiveness of our method, where SADR can significantly mitigate Specificity Failure in the predominant knowledge editing tasks.
摘要：大型语言模型在各种任务中都表现出卓越的性能，但是由于从培训数据中学到的知识不正确，它们仍然表现出不良的错误。为了避免这种情况，出现了知识编辑方法，可以通过有效修改很小比例的参数来精确编辑特定的模型知识。但是，这些方法可能导致特异性失败的问题：当与编辑知识相关的内容发生在上下文中时，它会无意中损害其他先前存在的知识。但是，这些方法可能导致特异性失败的问题，因为由于编辑而严重降低了现有的知识和功能。我们的初步表明，特异性失败主要源于模型的注意力头，将与编辑知识相关的实体分配过度注意力分数，从而不适当地集中在上下文中的特定片段，我们将其表示为注意漂移现象。为了减轻此类注意力漂移问题，我们引入了一种简单而有效的方法选择性注意力漂移限制}（SADR），该方法在知识编辑过程中引入了一个额外的正规化术语，以限制注意力重量分布的变化，从而防止对编辑的不当关注实体。在五个经常使用的强LLM上进行的实验证明了我们方法的有效性，在该方法中，SADR可以显着减轻主要知识编辑任务中的特异性失败。

Title: GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks

Authors: Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14848
Pdf URL: https://arxiv.org/pdf/2502.14848
Copy Paste: [[2502.14848]] GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks(https://arxiv.org/abs/2502.14848)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have shown great promise in tool-making, yet existing frameworks often struggle to efficiently construct reliable toolsets and are limited to single-task settings. To address these challenges, we propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that dynamically constructs and evolves a hierarchical graph of reusable tools across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, TabMWP). Our results show that GATE achieves up to 4.3x faster milestone completion in Minecraft compared to the previous SOTA, and provides an average improvement of 9.23% over existing tool-making methods in code generation tasks and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, balancing tool quantity, complexity, and functionality while maintaining high efficiency. Code and data are available at \url{this https URL}.
摘要：大型语言模型（LLMS）在工具制造方面表现出了巨大的希望，但是现有的框架通常很难有效地构建可靠的工具集，并且仅限于单任务设置。为了应对这些挑战，我们提出了GATE（基于图的自适应工具演变），这是一个自适应框架，该框架在多种情况下动态构建和进化可重复使用的工具的层次图。我们在开放式任务（Minecraft），基于代理的任务（文本Craft，Dabench）和代码生成任务（数学，日期，TABMWP）上评估门。我们的结果表明，与以前的SOTA相比，Minecraft的Gate在Minecraft中实现了4.3倍，并且在代码生成任务中的现有工具制定方法的平均提高为9.23％，代理任务中的平均提高了10.03％。 Gate展示了自适应演化的力量，平衡工具数量，复杂性和功能性，同时保持高效率。代码和数据可在\ url {this HTTPS url}上获得。

Title: CLIPPER: Compression enables long-context synthetic data generation

Authors: Chau Minh Pham, Yapei Chang, Mohit Iyyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14854
Pdf URL: https://arxiv.org/pdf/2502.14854
Copy Paste: [[2502.14854]] CLIPPER: Compression enables long-context synthetic data generation(https://arxiv.org/abs/2502.14854)
Keywords: llm, chain-of-thought
Abstract: LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).
摘要：LLM开发人员越来越依赖综合数据，但是为复杂的长篇小说推理任务生成高质量的数据仍然具有挑战性。我们介绍了CLIPPER，这是一种基于压缩的方法，用于生成针对叙事索赔验证的合成数据 - 一项任务，需要在书上进行推理以验证给定的索赔。 Clipper并没有直接从本书的原始文本中产生索赔，而这会导致伪像的索赔，而是将书籍首先压缩为章节大纲和书籍摘要，然后使用这些中间表示形式来产生复杂的索赔和相应的链条。与幼稚的方法相比，快船产生的主张更有效，扎根和复杂。我们使用剪子，构建了一个19k合成书籍的数据集，并构建了其源文本和经过思考推理的材料，并将其用于微调三个开放重量模型。我们的最佳模型在叙事索赔验证方面取得了突破性的结果（我们的测试集中的准确性从28％到76％），并为NOCHA排行榜上的Sub-10b模型设置了新的最新设备。进一步的分析表明，我们的模型产生了更详细和扎根的基础推理，同时也提高了其他叙事理解任务的绩效（例如叙事Qa）。

Title: FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

Authors: Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14856
Pdf URL: https://arxiv.org/pdf/2502.14856
Copy Paste: [[2502.14856]] FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling(https://arxiv.org/abs/2502.14856)
Keywords: language model, llm
Abstract: Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2.
摘要：投机性采样已成为一种重要技术，用于通过利用草稿然后验证的机制来加速大型语言模型（LLMS）的自动回归生成过程，每次向前通行证产生多个令牌。虽然最新的投机抽样方法仅使用单层和语言建模（LM）作为达到令人印象深刻的层压缩的模型，但对于大型vocabulary LLM，例如Llama- 3-8b，词汇为128K令牌。为了解决这个问题，我们提出了FR-Spec，这是一个频率排名的投机采样框架，可通过词汇空间压缩优化候选候选者选择。通过将草稿搜索限制为频率优先的令牌子集，我们的方法将LM头部计算开销降低了75％，同时确保最终输出分布的等效性。多个数据集的实验表明，在最先进的投机采样方法eagle-2上，平均为1.12 $ \ times $加速。

Title: Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning

Authors: Shuyue Stella Li, Jimin Mun, Faeze Brahman, Jonathan S. Ilgen, Yulia Tsvetkov, Maarten Sap
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.14860
Pdf URL: https://arxiv.org/pdf/2502.14860
Copy Paste: [[2502.14860]] Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning(https://arxiv.org/abs/2502.14860)
Keywords: language model, llm
Abstract: Large language models (LLMs) often fail to ask effective questions under uncertainty, making them unreliable in domains where proactive information-gathering is essential for decisionmaking. We present ALFA, a framework that improves LLM question-asking by (i) decomposing the notion of a "good" question into a set of theory-grounded attributes (e.g., clarity, relevance), (ii) controllably synthesizing attribute-specific question variations, and (iii) aligning models via preference-based optimization to explicitly learn to ask better questions along these fine-grained attributes. Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs dataset, composed of 17k real-world clinical interactions augmented with 80k attribute-specific preference pairs of follow-up questions, as well as a novel expert-annotated interactive healthcare QA task to evaluate question-asking abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on MediQ-AskDocs compared to SOTA instruction-tuned LLMs, with a question-level win-rate of 64.4% and strong generalizability. Our findings suggest that explicitly guiding question-asking with structured, fine-grained attributes offers a scalable path to improve LLMs, especially in expert application domains.
摘要：大型语言模型（LLM）通常无法在不确定性下提出有效的问题，这使其在积极的信息收集对于决策至关重要的领域中不可靠。我们提出了Alfa，这是一个框架，通过（i）将“好”问题的概念分解为一组理论基础的属性（例如，清晰度，相关性），（ii）可控属性特定问题的框架来提高llm的问答。变化，（iii）通过基于首选项的优化对齐模型，以明确学习沿这些细粒度属性提出更好的问题。我们以案例研究为重点介绍了Mediq-askDocs数据集，该数据集由17K现实世界中的临床相互作用组成，增强了80k属性特定的后续问题，以及新型专家宣传的互动式医疗保健QA评估问答能力的任务。与SOTA指导调节的LLM相比，与ALFA一致的模型在MEDIQ-ASKDOC上对诊断错误的诊断错误减少了56.6％，问题级的赢得率为64.4％，并且具有强大的概括性。我们的发现表明，使用结构化的，细粒度的属性明确指导提问，为改善LLM的可扩展路径提供了可扩展的途径，尤其是在专家应用程序领域中。

Title: LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Authors: Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han
Subjects: cs.CL, cs.AI, cs.DC, cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2502.14866
Pdf URL: https://arxiv.org/pdf/2502.14866
Copy Paste: [[2502.14866]] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention(https://arxiv.org/abs/2502.14866)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at this https URL.
摘要：大型语言模型（LLM）在处理长序列方面表现出了巨大的潜力，但是由于预填充阶段的二次计算复杂性以及在解码阶段的KV缓存的较大记忆足迹，因此有效地服务于这些长篇小说模型仍然具有挑战性。为了解决这些问题，我们引入了Lserve，这是一个有效的系统，可通过混合稀疏的注意力加速长期效果LLM。该方法将预填充和解码的注意力统一的不同硬件，结构化的稀疏模式统一到单个框架中，在此框架中，对较小重要的令牌进行了计算，可以跳过块。 Lserve证明了长篇文化LLM注意力中静态和动态稀疏性的兼容性。该设计通过结合这些优化来实现乘法加速。具体而言，我们将注意力头的一半转换为预填充和解码阶段的几乎免费流媒体头。此外，我们发现只需要恒定数量的KV页面才能保留长篇小说功能，而与上下文长度无关。然后，我们设计了一个层次的KV页面选择策略，该策略基于以查询为中心的相似性动态修剪KV页面。平均而言，Lserve会加速LLM的预填充2.9倍，并在VLLM上加速1.3-2.1倍，以保持长篇文化准确性。代码在此HTTPS URL上发布。