2025-01-24

Title: Dagger Behind Smile: Fool LLMs with a Happy Ending Story

Authors: Xurui Song, Zhixin Xie, Shuo Huai, Jiayi Kong, Jun Luo
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2501.13115
Pdf URL: https://arxiv.org/pdf/2501.13115
Copy Paste: [[2501.13115]] Dagger Behind Smile: Fool LLMs with a Happy Ending Story(https://arxiv.org/abs/2501.13115)
Keywords: language model, gpt, llm, prompt
Abstract: The wide adoption of Large Language Models (LLMs) has attracted significant attention from \textit{jailbreak} attacks, where adversarial prompts crafted through optimization or manual design exploit LLMs to generate malicious content. However, optimization-based attacks have limited efficiency and transferability, while manual designs are either easily detectable or demand intricate interactions with LLMs. In this paper, we first point out a novel perspective for jailbreak attacks: LLMs are more responsive to \textit{positive} prompts. Based on this, we deploy Happy Ending Attack (HEA) to wrap up a malicious request in a scenario template involving a positive prompt formed mainly via a \textit{happy ending}, it thus fools LLMs into jailbreaking either immediately or at a follow-up malicious request. This has made HEA both efficient and effective, as it requires only up to two steps to fully jailbreak LLMs. Extensive experiments show that our HEA can successfully jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro, and achieves 88.79\% Attack Success Rate on average. We also provide potential quantitative explanations for the success of HEA.
摘要：大型语言模型 (LLM) 的广泛采用引起了 \textit{越狱} 攻击的极大关注，在这种攻击中，通过优化或手动设计制作的对抗性提示会利用 LLM 生成恶意内容。然而，基于优化的攻击效率和可转移性有限，而手动设计要么容易被发现，要么需要与 LLM 进行复杂的交互。在本文中，我们首先指出了越狱攻击的一个新视角：LLM 对 \textit{正} 提示的响应更快。基于此，我们部署了快乐结局攻击 (HEA)，将恶意请求包装在一个场景模板中，该场景模板主要通过 \textit{快乐结局} 形成正提示，从而欺骗 LLM 立即越狱或在后续恶意请求时越狱。这使得 HEA 既高效又有效，因为它只需要最多两个步骤就可以完全越狱 LLM。大量实验表明，我们的 HEA 可以成功越狱最先进的 LLM，包括 GPT-4o、Llama3-70b、Gemini-pro，平均攻击成功率达到 88.79%。我们还为 HEA 的成功提供了潜在的定量解释。

Title: MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought Thinking

Authors: Shihao Ji, Zihui Song, Fucheng Zhong, Jisen Jia, Zhaobo Wu, Zheyi Cao, Tianhao Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13117
Pdf URL: https://arxiv.org/pdf/2501.13117
Copy Paste: [[2501.13117]] MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought Thinking(https://arxiv.org/abs/2501.13117)
Keywords: language model, llm, prompt
Abstract: Recent advancements in large language models (LLMs) have demonstrated their impressive abilities in various reasoning and decision-making tasks. However, the quality and coherence of the reasoning process can still benefit from enhanced introspection and self-reflection. In this paper, we introduce Multiplex CoT (Chain of Thought), a method that enables LLMs to simulate a form of self-review while reasoning, by initiating double Chain of Thought (CoT) thinking. Multiplex CoT leverages the power of iterative reasoning, where the model generates an initial chain of thought and subsequently critiques and refines this reasoning with a second round of thought generation. This recursive approach allows for more coherent, logical, and robust answers, improving the overall decision-making process. We demonstrate how this method can be effectively implemented using simple prompt engineering in existing LLM architectures, achieving an effect similar to that of the Learning-Refinement Model (LRM) without the need for additional training. Additionally, we present a practical guide for implementing the method in Google Colab, enabling easy integration into real-world applications.
摘要：大型语言模型 (LLM) 的最新进展已在各种推理和决策任务中展现出令人印象深刻的能力。然而，推理过程的质量和连贯性仍然可以从增强的内省和自我反思中受益。在本文中，我们介绍了多路复用 CoT（思维链），这种方法通过启动双重思维链 (CoT) 思维，使 LLM 能够在推理时模拟一种自我审查形式。多路复用 CoT 利用迭代推理的力量，其中模型生成初始思维链，随后通过第二轮思维生成来批评和改进这种推理。这种递归方法允许更连贯、更合乎逻辑和更稳健的答案，从而改善整个决策过程。我们展示了如何在现有的 LLM 架构中使用简单的提示工程有效地实现此方法，实现与学习细化模型 (LRM) 类似的效果，而无需额外的训练。此外，我们还提供了在 Google Colab 中实现该方法的实用指南，使其能够轻松集成到实际应用中。

Title: Multilinguality in LLM-Designed Reward Functions for Restless Bandits: Effects on Task Performance and Fairness

Authors: Ambreesh Parthasarathy, Chandrasekar Subramanian, Ganesh Senrayan, Shreyash Adappanavar, Aparna Taneja, Balaraman Ravindran, Milind Tambe
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2501.13120
Pdf URL: https://arxiv.org/pdf/2501.13120
Copy Paste: [[2501.13120]] Multilinguality in LLM-Designed Reward Functions for Restless Bandits: Effects on Task Performance and Fairness(https://arxiv.org/abs/2501.13120)
Keywords: language model, llm, prompt
Abstract: Restless Multi-Armed Bandits (RMABs) have been successfully applied to resource allocation problems in a variety of settings, including public health. With the rapid development of powerful large language models (LLMs), they are increasingly used to design reward functions to better match human preferences. Recent work has shown that LLMs can be used to tailor automated allocation decisions to community needs using language prompts. However, this has been studied primarily for English prompts and with a focus on task performance only. This can be an issue since grassroots workers, especially in developing countries like India, prefer to work in local languages, some of which are low-resource. Further, given the nature of the problem, biases along population groups unintended by the user are also undesirable. In this work, we study the effects on both task performance and fairness when the DLM algorithm, a recent work on using LLMs to design reward functions for RMABs, is prompted with non-English language commands. Specifically, we run the model on a synthetic environment for various prompts translated into multiple languages. The prompts themselves vary in complexity. Our results show that the LLM-proposed reward functions are significantly better when prompted in English compared to other languages. We also find that the exact phrasing of the prompt impacts task performance. Further, as prompt complexity increases, performance worsens for all languages; however, it is more robust with English prompts than with lower-resource languages. On the fairness side, we find that low-resource languages and more complex prompts are both highly likely to create unfairness along unintended dimensions.
摘要：不安分多臂老虎机 (RMAB) 已成功应用于各种环境中的资源分配问题，包括公共卫生。随着强大的大型语言模型 (LLM) 的快速发展，它们越来越多地用于设计奖励函数以更好地匹配人类偏好。最近的研究表明，LLM 可用于根据语言提示定制自动分配决策以满足社区需求。然而，这主要针对英语提示进行研究，并且仅关注任务绩效。这可能是一个问题，因为基层工人，尤其是印度等发展中国家的基层工人，更喜欢使用当地语言工作，其中一些语言资源匮乏。此外，考虑到问题的性质，用户不希望的针对人群的偏见也是不可取的。在这项工作中，我们研究了当 DLM 算法（一项使用 LLM 设计 RMAB 奖励函数的最新研究）以非英语语言命令提示时对任务绩效和公平性的影响。具体来说，我们在合成环境中针对翻译成多种语言的各种提示运行模型。提示本身的复杂程度各不相同。我们的结果表明，与其他语言相比，LLM 提出的奖励函数在用英语提示时效果明显更好。我们还发现，提示的确切措辞会影响任务表现。此外，随着提示复杂度的增加，所有语言的表现都会变差；然而，英语提示的表现比资源较少的语言更稳健。在公平性方面，我们发现资源较少的语言和更复杂的提示都很可能在意想不到的维度上造成不公平。

Title: Episodic Memories Generation and Evaluation Benchmark for Large Language Models

Authors: Alexis Huet, Zied Ben Houidi, Dario Rossi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13121
Pdf URL: https://arxiv.org/pdf/2501.13121
Copy Paste: [[2501.13121]] Episodic Memories Generation and Evaluation Benchmark for Large Language Models(https://arxiv.org/abs/2501.13121)
Keywords: language model, gpt, llm
Abstract: Episodic memory -- the ability to recall specific events grounded in time and space -- is a cornerstone of human cognition, enabling not only coherent storytelling, but also planning and decision-making. Despite their remarkable capabilities, Large Language Models (LLMs) lack a robust mechanism for episodic memory: we argue that integrating episodic memory capabilities into LLM is essential for advancing AI towards human-like cognition, increasing their potential to reason consistently and ground their output in real-world episodic events, hence avoiding confabulations. To address this challenge, we introduce a comprehensive framework to model and evaluate LLM episodic memory capabilities. Drawing inspiration from cognitive science, we develop a structured approach to represent episodic events, encapsulating temporal and spatial contexts, involved entities, and detailed descriptions. We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. Our evaluation of state-of-the-art models, including GPT-4 and Claude variants, Llama 3.1, and o1-mini, reveals that even the most advanced LLMs struggle with episodic memory tasks, particularly when dealing with multiple related events or complex spatio-temporal relationships -- even in contexts as short as 10k-100k tokens.
摘要：情景记忆——回忆特定时间和空间事件的能力——是人类认知的基石，它不仅能实现连贯的故事叙述，还能实现规划和决策。尽管大型语言模型 (LLM) 具有非凡的能力，但它们缺乏强大的情景记忆机制：我们认为，将情景记忆功能集成到 LLM 中对于推动 AI 向类似人类的认知发展至关重要，这可以提高其持续推理的潜力，并将其输出建立在现实世界的情景事件中，从而避免虚构。为了应对这一挑战，我们引入了一个全面的框架来建模和评估 LLM 情景记忆能力。从认知科学中汲取灵感，我们开发了一种结构化的方法来表示情景事件，封装时间和空间背景、相关实体和详细描述。我们综合了一个独特的、不受污染的情景记忆基准，并发布了开源代码和数据集来评估 LLM 在各种回忆和情景推理任务中的表现。我们对最先进模型（包括 GPT-4 和 Claude 变体、Llama 3.1 和 o1-mini）的评估表明，即使是最先进的 LLM 也会在情景记忆任务中遇到困难，特别是在处理多个相关事件或复杂的时空关系时——即使在短至 10k-100k 个标记的环境中也是如此。

Title: Zero-Shot Verification-guided Chain of Thoughts

Authors: Jishnu Ray Chowdhury, Cornelia Caragea
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13122
Pdf URL: https://arxiv.org/pdf/2501.13122
Copy Paste: [[2501.13122]] Zero-Shot Verification-guided Chain of Thoughts(https://arxiv.org/abs/2501.13122)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Previous works have demonstrated the effectiveness of Chain-of-Thought (COT) prompts and verifiers in guiding Large Language Models (LLMs) through the space of reasoning. However, most such studies either use a fine-tuned verifier or rely on manually handcrafted few-shot examples. In contrast, in this paper, we focus on LLM-based self-verification of self-generated reasoning steps via COT prompts in a completely zero-shot regime. To explore this setting, we design a new zero-shot prompt, which we call COT STEP, to aid zero-shot decomposition of reasoning steps and design two new zero-shot prompts for LLM-based verifiers. We evaluate the verifiers' ability to classify the correctness of reasoning chains and explore different ways to use verifier scores in guiding reasoning for various mathematical and commonsense reasoning tasks with different LLMs.
摘要：先前的研究已经证明了思路链 (COT) 提示和验证器在引导大型语言模型 (LLM) 完成推理空间方面的有效性。然而，大多数此类研究要么使用经过微调的验证器，要么依赖手工制作的少量样本。相比之下，在本文中，我们专注于通过完全零样本机制的 COT 提示对基于 LLM 的自生成推理步骤进行自我验证。为了探索这种情况，我们设计了一个新的零样本提示，我们称之为 COT STEP，以帮助零样本分解推理步骤，并为基于 LLM 的验证器设计了两个新的零样本提示。我们评估了验证器对推理链正确性进行分类的能力，并探索了使用验证器分数指导不同 LLM 的各种数学和常识推理任务的推理的不同方法。

Title: Preference Curriculum: LLMs Should Always Be Pretrained on Their Preferred Data

Authors: Xuemiao Zhang, Liangyu Xu, Feiyu Duan, Yongwei Zhou, Sirui Wang, Jingang Wang, Xunliang Cai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13126
Pdf URL: https://arxiv.org/pdf/2501.13126
Copy Paste: [[2501.13126]] Preference Curriculum: LLMs Should Always Be Pretrained on Their Preferred Data(https://arxiv.org/abs/2501.13126)
Keywords: language model, llm
Abstract: Current large language models (LLMs) generally utilize a consistent data distribution throughout the entire pretraining process. However, as the model's ability improves, it intuitively should be pretrained with differentiated data. To achieve it, we propose the Perplexity Difference based Preference Curriculum learning (PDPC) framework, which always perceives and uses the data preferred by LLMs to train and boost them. Firstly, we introduce the PD metric to measure the difference in how well strong and weak models fit the samples. Samples with high PD are more challenging for weak models to learn and are more suitable to be arranged in the later stage of pretraining. Secondly, we propose the PD preference function to approximate the model and predict the data preference of the LLM at any time, so as to complete the arrangement of the entire data offline and ensure continuous training without interruption. Experimental results on 1.3B and 3B models demonstrate that our PDPC significantly surpasses baselines. Notably, the 3B model achieved more substantial gains, with an increased average accuracy of over 4.1% across various benchmarks.
摘要：目前的大型语言模型（LLM）通常在整个预训练过程中使用一致的数据分布。然而，随着模型能力的提升，直观地看应该使用差异化的数据进行预训练。为了实现这一点，我们提出了基于困惑度差异的偏好课程学习（PDPC）框架，该框架始终感知并使用LLM偏好的数据来训练和提升它们。首先，我们引入PD指标来衡量强模型和弱模型对样本的拟合程度的差异。高PD的样本对于弱模型来说更难学习，更适合在预训练的后期进行排列。其次，我们提出PD偏好函数来近似模型并随时预测LLM的数据偏好，从而离线完成整个数据的排列并确保不间断的持续训练。在1.3B和3B模型上的实验结果表明我们的PDPC显著超越了基线。值得注意的是，3B 模型取得了更显著的收益，在各个基准测试中平均准确率提高了 4.1% 以上。

Title: RAG-Reward: Optimizing RAG with Reward Modeling and RLHF

Authors: Hanning Zhang, Juntong Song, Juno Zhu, Yuanhao Wu, Tong Zhang, Cheng Niu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13264
Pdf URL: https://arxiv.org/pdf/2501.13264
Copy Paste: [[2501.13264]] RAG-Reward: Optimizing RAG with Reward Modeling and RLHF(https://arxiv.org/abs/2501.13264)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enhances Large Language Models (LLMs) with relevant and up-to-date knowledge, improving their ability to answer knowledge-intensive questions. It has been shown to enhance both generation quality and trustworthiness. While numerous works have focused on improving retrieval, generation, and evaluation, the role of reward models in reinforcement learning for optimizing RAG and establishing automated benchmarking pipelines remains underexplored. In this paper, we introduce \textbf{RAG-Reward}, a dataset designed to enable \textit{hallucination-free, comprehensive, reliable, and efficient RAG}. We define four key metrics for assessing generation quality and develop an automated annotation pipeline that leverages multiple LLMs to generate outputs across diverse RAG scenarios. GPT-4o is used to evaluate and construct preference data. Using \textbf{RAG-Reward}, we train reward models and apply reinforcement learning with human feedback (RLHF) to improve LLMs' effectiveness in RAG. Experimental results show that our reward model achieves state-of-the-art performance on a held-out test set, demonstrating both the effectiveness of our approach and the quality of our dataset. Furthermore, the improved generation quality of the trained policy model highlights the feasibility of using RLHF to enhance RAG pipelines.
摘要：检索增强生成 (RAG) 可通过相关和最新知识增强大型语言模型 (LLM)，从而提高其回答知识密集型问题的能力。事实证明，它可以提高生成质量和可信度。虽然许多工作都集中在改进检索、生成和评估上，但奖励模型在强化学习中用于优化 RAG 和建立自动基准测试管道的作用仍未得到充分探索。在本文中，我们介绍了 \textbf{RAG-Reward}，这是一个旨在实现 \textit{无幻觉、全面、可靠和高效的 RAG} 的数据集。我们定义了评估生成质量的四个关键指标，并开发了一个自动注释管道，该管道利用多个 LLM 在不同的 RAG 场景中生成输出。GPT-4o 用于评估和构建偏好数据。使用 \textbf{RAG-Reward}，我们训练奖励模型并应用带有人工反馈的强化学习 (RLHF) 来提高 LLM 在 RAG 中的有效性。实验结果表明，我们的奖励模型在保留测试集上实现了最佳性能，证明了我们方法的有效性和数据集的质量。此外，训练后的策略模型的生成质量得到改善，凸显了使用 RLHF 增强 RAG 管道的可行性。

Title: RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering

Authors: Yang Bai, Christan Earl Grant, Daisy Zhe Wang
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13297
Pdf URL: https://arxiv.org/pdf/2501.13297
Copy Paste: [[2501.13297]] RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering(https://arxiv.org/abs/2501.13297)
Keywords: language model, llm
Abstract: Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text and images, has gained significant attention in information retrieval (IR) and natural language processing (NLP). Traditional ranking methods rely on small encoder-based language models, which are incompatible with modern decoder-based generative large language models (LLMs) that have advanced various NLP tasks. To bridge this gap, we propose RAMQA, a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques. We first train a pointwise multi-modal ranker using LLaVA as the backbone. Then, we apply instruction tuning to train a LLaMA model for re-ranking the top-k documents using an innovative autoregressive multi-task learning approach. Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various permutations. Experiments on two MRAQA benchmarks, WebQA and MultiModalQA, show significant improvements over strong baselines, highlighting the effectiveness of our approach. Code and data are available at: this https URL
摘要：多模态检索增强型问答系统 (MRAQA) 集成了文本和图像，在信息检索 (IR) 和自然语言处理 (NLP) 领域引起了广泛关注。传统的排名方法依赖于基于小型编码器的语言模型，这些模型与现代基于解码器的生成式大型语言模型 (LLM) 不兼容，后者已经推进了各种 NLP 任务。为了弥补这一差距，我们提出了 RAMQA，这是一个统一的框架，结合了学习排名方法和生成式排列增强排名技术。我们首先使用 LLaVA 作为主干训练逐点多模态排名器。然后，我们应用指令调整来训练 LLaMA 模型，以使用创新的自回归多任务学习方法对前 k 个文档进行重新排名。我们的生成排名模型以各种排列从文档候选中生成重新排名的文档 ID 和具体答案。在两个 MRAQA 基准 WebQA 和 MultiModalQA 上进行的实验表明，与强基线相比有显著改进，凸显了我们方法的有效性。代码和数据可在以下位置获得：此 https URL

Title: Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents

Authors: Shrinidhi Kumbhar, Venkatesh Mishra, Kevin Coutinho, Divij Handa, Ashif Iquebal, Chitta Baral
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13299
Pdf URL: https://arxiv.org/pdf/2501.13299
Copy Paste: [[2501.13299]] Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents(https://arxiv.org/abs/2501.13299)
Keywords: language model, llm, agent
Abstract: Materials discovery and design are essential for advancing technology across various industries by enabling the development of application-specific materials. Recent research has leveraged Large Language Models (LLMs) to accelerate this process. We explore the potential of LLMs to generate viable hypotheses that, once validated, can expedite materials discovery. Collaborating with materials science experts, we curated a novel dataset from recent journal publications, featuring real-world goals, constraints, and methods for designing real-world applications. Using this dataset, we test LLM-based agents that generate hypotheses for achieving given goals under specific constraints. To assess the relevance and quality of these hypotheses, we propose a novel scalable evaluation metric that emulates the process a materials scientist would use to evaluate a hypothesis critically. Our curated dataset, proposed method, and evaluation framework aim to advance future research in accelerating materials discovery and design with LLMs.
摘要：材料发现和设计对于推动各行各业的技术进步至关重要，因为它可以促进特定应用材料的开发。最近的研究利用大型语言模型 (LLM) 来加速这一过程。我们探索了 LLM 生成可行假设的潜力，这些假设一旦得到验证，就可以加快材料发现。我们与材料科学专家合作，从最近的期刊出版物中整理出了一个新数据集，其中包含现实世界的目标、约束和设计现实世界应用的方法。使用此数据集，我们测试了基于 LLM 的代理，这些代理会在特定约束下生成实现给定目标的假设。为了评估这些假设的相关性和质量，我们提出了一种新颖的可扩展评估指标，该指标模拟了材料科学家用来批判性地评估假设的过程。我们整理的数据集、提出的方法和评估框架旨在推动未来使用 LLM 加速材料发现和设计的研究。

Title: Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers

Authors: Akshit Achara, Anshuman Chhabra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13302
Pdf URL: https://arxiv.org/pdf/2501.13302
Copy Paste: [[2501.13302]] Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers(https://arxiv.org/abs/2501.13302)
Keywords: language model, llm
Abstract: AI Safety Moderation (ASM) classifiers are designed to moderate content on social media platforms and to serve as guardrails that prevent Large Language Models (LLMs) from being fine-tuned on unsafe inputs. Owing to their potential for disparate impact, it is crucial to ensure that these classifiers: (1) do not unfairly classify content belonging to users from minority groups as unsafe compared to those from majority groups and (2) that their behavior remains robust and consistent across similar inputs. In this work, we thus examine the fairness and robustness of four widely-used, closed-source ASM classifiers: OpenAI Moderation API, Perspective API, Google Cloud Natural Language (GCNL) API, and Clarifai API. We assess fairness using metrics such as demographic parity and conditional statistical parity, comparing their performance against ASM models and a fair-only baseline. Additionally, we analyze robustness by testing the classifiers' sensitivity to small and natural input perturbations. Our findings reveal potential fairness and robustness gaps, highlighting the need to mitigate these issues in future versions of these models.
摘要：AI 安全审核 (ASM) 分类器旨在审核社交媒体平台上的内容，并充当护栏，防止大型语言模型 (LLM) 对不安全的输入进行微调。由于它们可能产生不同的影响，因此必须确保这些分类器：(1) 不会不公平地将少数群体用户的内容与多数群体用户的内容归类为不安全内容，以及 (2) 它们的行为在类似输入中保持稳健和一致。因此，在这项工作中，我们研究了四个广泛使用的闭源 ASM 分类器的公平性和稳健性：OpenAI Moderation API、Perspective API、Google Cloud Natural Language (GCNL) API 和 Clarifai API。我们使用人口统计奇偶性和条件统计奇偶性等指标来评估公平性，将它们的性能与 ASM 模型和公平基线进行比较。此外，我们通过测试分类器对小而自然的输入扰动的敏感性来分析稳健性。我们的研究结果揭示了潜在的公平性和稳健性差距，强调了在这些模型的未来版本中缓解这些问题的必要性。

Title: Do as We Do, Not as You Think: the Conformity of Large Language Models

Authors: Zhiyuan Weng, Guikun Chen, Wenguan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13381
Pdf URL: https://arxiv.org/pdf/2501.13381
Copy Paste: [[2501.13381]] Do as We Do, Not as You Think: the Conformity of Large Language Models(https://arxiv.org/abs/2501.13381)
Keywords: language model, llm, agent
Abstract: Recent advancements in large language models (LLMs) revolutionize the field of intelligent agents, enabling collaborative multi-agent systems capable of tackling complex problems across various domains. However, the potential of conformity within these systems, analogous to phenomena like conformity bias and groupthink in human group dynamics, remains largely unexplored, raising concerns about their collective problem-solving capabilities and possible ethical implications. This paper presents a comprehensive study on conformity in LLM-driven multi-agent systems, focusing on three aspects: the existence of conformity, the factors influencing conformity, and potential mitigation strategies. In particular, we introduce BenchForm, a new conformity-oriented benchmark, featuring reasoning-intensive tasks and five distinct interaction protocols designed to probe LLMs' behavior in collaborative scenarios. Several representative LLMs are evaluated on BenchForm, using metrics such as conformity rate and independence rate to quantify conformity's impact. Our analysis delves into factors influencing conformity, including interaction time and majority size, and examines how the subject agent rationalizes its conforming behavior. Furthermore, we explore two strategies to mitigate conformity effects, i.e., developing enhanced personas and implementing a reflection mechanism. Several interesting findings regarding LLMs' conformity are derived from empirical results and case studies. We hope that these insights can pave the way for more robust and ethically-aligned collaborative AI systems. Our benchmark and code are available at BenchForm.
摘要：大型语言模型 (LLM) 的最新进展彻底改变了智能代理领域，使协作多智能体系统能够解决各个领域的复杂问题。然而，这些系统中的从众潜力（类似于人类群体动力学中的从众偏见和群体思维等现象）仍未得到充分探索，这引发了人们对其集体解决问题的能力和可能的伦理影响的担忧。本文对 LLM 驱动的多智能体系统中的从众性进行了全面研究，重点关注三个方面：从众性的存在、影响从众性的因素以及潜在的缓解策略。特别是，我们引入了 BenchForm，这是一种新的面向从众性的基准，具有推理密集型任务和五种不同的交互协议，旨在探测 LLM 在协作场景中的行为。在 BenchForm 上评估了几个代表性的 LLM，使用从众率和独立率等指标来量化从众的影响。我们的分析深入研究了影响从众性的因素，包括交互时间和多数大小，并研究了主体代理如何合理化其从众行为。此外，我们探索了两种减轻从众效应的策略，即开发增强角色和实施反思机制。从实证结果和案例研究中得出了有关 LLM 从众的几个有趣发现。我们希望这些见解可以为更强大、更符合道德规范的协作 AI 系统铺平道路。我们的基准和代码可在 BenchForm 上找到。

Title: Can Large Language Models Understand Preferences in Personalized Recommendation?

Authors: Zhaoxuan Tan, Zinan Zeng, Qingkai Zeng, Zhenyu Wu, Zheyuan Liu, Fengran Mo, Meng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13391
Pdf URL: https://arxiv.org/pdf/2501.13391
Copy Paste: [[2501.13391]] Can Large Language Models Understand Preferences in Personalized Recommendation?(https://arxiv.org/abs/2501.13391)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel in various tasks, including personalized recommendations. Existing evaluation methods often focus on rating prediction, relying on regression errors between actual and predicted ratings. However, user rating bias and item quality, two influential factors behind rating scores, can obscure personal preferences in user-item pair data. To address this, we introduce PerRecBench, disassociating the evaluation from these two factors and assessing recommendation techniques on capturing the personal preferences in a grouped ranking manner. We find that the LLM-based recommendation techniques that are generally good at rating prediction fail to identify users' favored and disfavored items when the user rating bias and item quality are eliminated by grouping users. With PerRecBench and 19 LLMs, we find that while larger models generally outperform smaller ones, they still struggle with personalized recommendation. Our findings reveal the superiority of pairwise and listwise ranking approaches over pointwise ranking, PerRecBench's low correlation with traditional regression metrics, the importance of user profiles, and the role of pretraining data distributions. We further explore three supervised fine-tuning strategies, finding that merging weights from single-format training is promising but improving LLMs' understanding of user preferences remains an open research problem. Code and data are available at this https URL
摘要：大型语言模型 (LLM) 在各种任务中表现出色，包括个性化推荐。现有的评估方法通常侧重于评分预测，依赖于实际评分和预测评分之间的回归误差。然而，用户评分偏差和项目质量是评分背后的两个影响因素，它们可能会掩盖用户-项目对数据中的个人偏好。为了解决这个问题，我们引入了 PerRecBench，将评估与这两个因素分离，并评估以分组排名方式捕捉个人偏好的推荐技术。我们发现，通常擅长评分预测的基于 LLM 的推荐技术在通过分组用户消除用户评分偏差和项目质量时无法识别用户喜欢和不喜欢的项目。通过 PerRecBench 和 19 个 LLM，我们发现虽然较大的模型通常优于较小的模型，但它们在个性化推荐方面仍然举步维艰。我们的研究结果揭示了成对和列表排名方法优于逐点排名、PerRecBench 与传统回归指标的低相关性、用户个人资料的重要性以及预训练数据分布的作用。我们进一步探索了三种监督微调策略，发现合并单一格式训练的权重很有前景，但提高 LLM 对用户偏好的理解仍然是一个悬而未决的研究问题。代码和数据可在此 https URL 上获取

Title: ExLM: Rethinking the Impact of $\texttt{[MASK]}$ Tokens in Masked Language Models

Authors: Kangjie Zheng, Junwei Yang, Siyue Liang, Bin Feng, Zequn Liu, Wei Ju, Zhiping Xiao, Ming Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13397
Pdf URL: https://arxiv.org/pdf/2501.13397
Copy Paste: [[2501.13397]] ExLM: Rethinking the Impact of $\texttt{[MASK]}$ Tokens in Masked Language Models(https://arxiv.org/abs/2501.13397)
Keywords: language model
Abstract: Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly replacing some tokens in the input sentences with $\texttt{[MASK]}$ tokens and predicting the original tokens based on the remaining context. This paper explores the impact of $\texttt{[MASK]}$ tokens on MLMs. Analytical studies show that masking tokens can introduce the corrupted semantics problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands $\texttt{[MASK]}$ tokens in the input context and models the dependencies between these expanded states. This expansion increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enhances semantic representations through context enhancement, and effectively reduces the multimodality problem commonly observed in MLMs.
摘要：掩蔽语言模型 (MLM) 在许多自监督表征学习任务中取得了显著的成功。MLM 的训练方法是将输入句子中的一些标记随机替换为 $\texttt{[MASK]}$ 个标记，并根据剩余的上下文预测原始标记。本文探讨了 $\texttt{[MASK]}$ 个标记对 MLM 的影响。分析研究表明，掩蔽标记会引入损坏的语义问题，其中损坏的上下文可能会传达多重、模糊的含义。这个问题也是影响 MLM 在下游任务上性能的关键因素。基于这些发现，我们提出了一种新的增强上下文 MLM，ExLM。我们的方法扩展了输入上下文中的 $\texttt{[MASK]}$ 个标记，并对这些扩展状态之间的依赖关系进行建模。这种扩展增加了上下文容量，使模型能够捕获更丰富的语义信息，从而有效地缓解了预训练期间的损坏语义问题。实验结果表明，ExLM 在文本建模和 SMILES 建模任务中均实现了显著的性能提升。进一步分析证实，ExLM 通过上下文增强来增强语义表示，并有效缓解 MLM 中常见的多模态问题。

Title: Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

Authors: Bo Gao, Michael W. Spratling
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13428
Pdf URL: https://arxiv.org/pdf/2501.13428
Copy Paste: [[2501.13428]] Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models(https://arxiv.org/abs/2501.13428)
Keywords: language model
Abstract: Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases. This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm. We identify the latter as essential for maintaining model performance. By replacing the non-linear transformation with the Softplus activation function and introducing a dynamic length scale factor for different token lengths based on invariance entropy, we create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths. To further improve the length extrapolation ability of the proposed attention mechanism, we introduce a re-weighting mechanism that amplifies significant attention weights while diminishing weaker ones, enabling the model to concentrate more effectively on relevant tokens. When combined with our proposed attention mechanism, this approach demonstrates significant promise in managing longer sequences, maintaining nearly constant validation loss even at 16$\times$ the training token length while ensuring numerical stability. Our code is available at: this https URL.
摘要：近年来，大型语言模型取得了显著的成功，这主要归功于自注意力机制的实施。然而，随着推理标记长度的增加，传统的 Softmax 注意力机制存在数值不稳定和性能下降的问题。本文通过将 Softmax 操作分解为非线性变换和 $l_1$ 范数来解决这些问题。我们认为后者对于保持模型性能至关重要。通过用 Softplus 激活函数替换非线性变换，并基于不变熵为不同的标记长度引入动态长度比例因子，我们创建了一种新颖的注意力机制，其性能优于传统 Softmax 注意力机制，适用于各种推理长度。为了进一步提高所提出的注意力机制的长度外推能力，我们引入了一种重新加权机制，该机制放大了重要的注意力权重，同时减少了较弱的注意力权重，使模型能够更有效地集中在相关标记上。与我们提出的注意力机制相结合，这种方法在管理更长的序列方面表现出巨大的潜力，即使在训练标记长度为 16$\times$ 时也能保持几乎恒定的验证损失，同时确保数值稳定性。我们的代码可以在以下网址获得：此 https URL。

Title: RECALL: Library-Like Behavior In Language Models is Enhanced by Self-Referencing Causal Cycles

Authors: Munachiso Nwadike, Zangir Iklassov, Toluwani Aremu, Tatsuya Hiraoka, Velibor Bojkovic, Benjamin Heinzerling, Hilal Alqaubeh, Martin Takáč, Kentaro Inui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13491
Pdf URL: https://arxiv.org/pdf/2501.13491
Copy Paste: [[2501.13491]] RECALL: Library-Like Behavior In Language Models is Enhanced by Self-Referencing Causal Cycles(https://arxiv.org/abs/2501.13491)
Keywords: language model, gpt, llm, prompt, chat
Abstract: We introduce the concept of the self-referencing causal cycle (abbreviated RECALL) - a mechanism that enables large language models (LLMs) to bypass the limitations of unidirectional causality, which underlies a phenomenon known as the reversal curse. When an LLM is prompted with sequential data, it often fails to recall preceding context. For example, when we ask an LLM to recall the line preceding "O say does that star-spangled banner yet wave" in the U.S. National Anthem, it often fails to correctly return "Gave proof through the night that our flag was still there" - this is due to the reversal curse. It occurs because language models such as ChatGPT and Llama generate text based on preceding tokens, requiring facts to be learned and reproduced in a consistent token order. While the reversal curse is often viewed as a limitation, we offer evidence of an alternative view: it is not always an obstacle in practice. We find that RECALL is driven by what we designate as cycle tokens - sequences that connect different parts of the training data, enabling recall of preceding tokens from succeeding ones. Through rigorous probabilistic formalization and controlled experiments, we demonstrate how the cycles they induce influence a model's ability to reproduce information. To facilitate reproducibility, we provide our code and experimental details at this https URL.
摘要：我们引入了自引用因果循环（简称 RECALL）的概念 - 一种使大型语言模型 (LLM) 能够绕过单向因果关系限制的机制，这种限制是所谓的逆转诅咒现象的基础。当 LLM 被提示顺序数据时，它通常无法回忆起前面的上下文。例如，当我们要求 LLM 回忆美国国歌中“哦，说那星条旗还在飘扬吗”之前的一行时，它通常无法正确返回“整个夜晚证明我们的国旗仍然在那里” - 这是由于逆转诅咒造成的。之所以发生这种情况，是因为 ChatGPT 和 Llama 等语言模型根据前面的标记生成文本，需要以一致的标记顺序学习和重现事实。虽然逆转诅咒通常被视为一种限制，但我们提供了另一种观点的证据：它在实践中并不总是一种障碍。我们发现 RECALL 是由我们指定的循环标记驱动的 - 连接训练数据不同部分的序列，能够从后续标记中回忆出前面的标记。通过严格的概率形式化和受控实验，我们展示了它们引发的循环如何影响模型重现信息的能力。为了便于重现，我们在此 https URL 上提供了我们的代码和实验细节。

Title: LLMs Can Plan Only If We Tell Them

Authors: Bilgehan Sel, Ruoxi Jia, Ming Jin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13545
Pdf URL: https://arxiv.org/pdf/2501.13545
Copy Paste: [[2501.13545]] LLMs Can Plan Only If We Tell Them(https://arxiv.org/abs/2501.13545)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have demonstrated significant capabilities in natural language processing and reasoning, yet their effectiveness in autonomous planning has been under debate. While existing studies have utilized LLMs with external feedback mechanisms or in controlled environments for planning, these approaches often involve substantial computational and development resources due to the requirement for careful design and iterative backprompting. Moreover, even the most advanced LLMs like GPT-4 struggle to match human performance on standard planning benchmarks, such as the Blocksworld, without additional support. This paper investigates whether LLMs can independently generate long-horizon plans that rival human baselines. Our novel enhancements to Algorithm-of-Thoughts (AoT), which we dub AoT+, help achieve state-of-the-art results in planning benchmarks out-competing prior methods and human baselines all autonomously.
摘要：大型语言模型 (LLM) 在自然语言处理和推理方面表现出了显著的能力，但它们在自主规划方面的有效性一直存在争议。虽然现有研究已经利用具有外部反馈机制或在受控环境中进行规划的 LLM，但由于需要精心设计和迭代反向提示，这些方法通常需要大量的计算和开发资源。此外，即使是最先进的 LLM（如 GPT-4），如果没有额外的支持，也难以在标准规划基准（如 Blocksworld）上与人类的表现相媲美。本文探讨了 LLM 是否可以独立生成与人类基线相媲美的长期计划。我们对算法思维 (AoT) 的全新增强功能（我们称之为 AoT+）有助于在规划基准方面实现最先进的结果，这些结果完全自主地超越了之前的方法和人类基线。

Title: K-COMP: Retrieval-Augmented Medical Domain Question Answering With Knowledge-Injected Compressor

Authors: Jeonghun Cho, Gary Geunbae Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13567
Pdf URL: https://arxiv.org/pdf/2501.13567
Copy Paste: [[2501.13567]] K-COMP: Retrieval-Augmented Medical Domain Question Answering With Knowledge-Injected Compressor(https://arxiv.org/abs/2501.13567)
Keywords: hallucination
Abstract: Retrieval-augmented question answering (QA) integrates external information, and thereby increases the QA accuracy of reader models that lack domain knowledge. However, documents retrieved for closed domains require high expertise, so the reader model may have difficulty fully comprehending the text. Moreover, the retrieved documents contain thousands of tokens, some unrelated to the question. As a result, the documents include some inaccurate information, which could lead the reader model to mistrust the passages and could result in hallucinations. To solve these problems, we propose K-COMP (Knowledge-injected compressor) which provides the knowledge required to answer correctly. The compressor automatically generates the requisite prior knowledge to facilitate the answering process prior to the compression of retrieved passages. Subsequently, the passages are compressed autoregressively, with the generated knowledge being integrated into the compression process. This process ensures alignment between the question intent and the compressed context. By augmenting this prior knowledge and concise context, the reader models are guided toward relevant answers and trust the context.
摘要：检索增强型问答 (QA) 集成了外部信息，从而提高了缺乏领域知识的读者模型的 QA 准确性。但是，针对封闭域检索的文档需要很高的专业知识，因此读者模型可能难以完全理解文本。此外，检索到的文档包含数千个标记，其中一些与问题无关。因此，文档中包含一些不准确的信息，这可能会导致读者模型不信任段落并导致幻觉。为了解决这些问题，我们提出了 K-COMP（知识注入压缩器），它提供了正确回答所需的知识。压缩器会在压缩检索到的段落之前自动生成必要的先验知识以促进回答过程。随后，对段落进行自回归压缩，并将生成的知识集成到压缩过程中。此过程可确保问题意图与压缩上下文保持一致。通过增强这些先验知识和简洁的上下文，读者模型会被引导至相关答案并信任上下文。

Title: Improving Contextual Faithfulness of Large Language Models via Retrieval Heads-Induced Optimization

Authors: Lei Huang, Xiaocheng Feng, Weitao Ma, Yuchun Fan, Xiachong Feng, Yangfan Ye, Weihong Zhong, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13573
Pdf URL: https://arxiv.org/pdf/2501.13573
Copy Paste: [[2501.13573]] Improving Contextual Faithfulness of Large Language Models via Retrieval Heads-Induced Optimization(https://arxiv.org/abs/2501.13573)
Keywords: language model, gpt, llm
Abstract: Ensuring contextual faithfulness in retrieval-augmented large language models (LLMs) is crucial for building trustworthy information-seeking systems, particularly in long-form question-answering (LFQA) scenarios. In this work, we identify a salient correlation between LFQA faithfulness and retrieval heads, a set of attention heads responsible for retrieving contextual information. Leveraging this insight, we propose RHIO, a framework designed to teach LLMs to explicitly discriminate between faithful and unfaithful generations. RHIO first augments unfaithful samples that simulate realistic model-intrinsic errors by selectively masking retrieval heads. Then, these samples are incorporated into joint training, enabling the model to distinguish unfaithful outputs from faithful ones conditioned on control tokens. Furthermore, these control tokens are leveraged to self-induce contrastive outputs, amplifying their difference through contrastive decoding. Additionally, to facilitate the evaluation of contextual faithfulness, we also introduce GroundBench, a comprehensive benchmark compiled from five existing LFQA datasets. Extensive experimental results on GroundBench demonstrate that RHIO significantly improves faithfulness, even outperforming GPT-4o.
摘要：确保检索增强大型语言模型 (LLM) 中的上下文忠实性对于构建值得信赖的信息搜索系统至关重要，尤其是在长篇问答 (LFQA) 场景中。在这项工作中，我们发现了 LFQA 忠实度与检索头（一组负责检索上下文信息的注意头）之间的显着相关性。利用这一见解，我们提出了 RHIO，这是一个旨在教 LLM 明确区分忠实和不忠实的世代的框架。RHIO 首先通过有选择地屏蔽检索头来增强模拟真实模型固有错误的不忠实样本。然后，将这些样本纳入联合训练，使模型能够区分不忠实输出和以控制标记为条件的忠实输出。此外，这些控制标记被用来自我诱导对比输出，通过对比解码放大它们的差异。此外，为了便于评估语境忠诚度，我们还引入了 GroundBench，这是一个由五个现有 LFQA 数据集编译而成的综合基准。在 GroundBench 上进行的大量实验结果表明，RHIO 显著提高了忠诚度，甚至优于 GPT-4o。

Title: Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models

Authors: Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li, Guoshuai Zhao, Jui-Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13629
Pdf URL: https://arxiv.org/pdf/2501.13629
Copy Paste: [[2501.13629]] Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models(https://arxiv.org/abs/2501.13629)
Keywords: language model, gpt
Abstract: We introduce Sigma, an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention, and pre-trained on our meticulously collected system domain data. DiffQKV attention significantly enhances the inference efficiency of Sigma by optimizing the Query (Q), Key (K), and Value (V) components in the attention mechanism differentially, based on their varying impacts on the model performance and efficiency indicators. Specifically, we (1) conduct extensive experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV, and (2) propose augmented Q to expand the Q head dimension, which enhances the model's representation capacity with minimal impacts on the inference speed. Rigorous theoretical and empirical analyses reveal that DiffQKV attention significantly enhances efficiency, achieving up to a 33.36% improvement in inference speed over the conventional grouped-query attention (GQA) in long-context scenarios. We pre-train Sigma on 6T tokens from various sources, including 19.5B system domain data that we carefully collect and 1T tokens of synthesized and rewritten data. In general domains, Sigma achieves comparable performance to other state-of-arts models. In the system domain, we introduce the first comprehensive benchmark AIMicius, where Sigma demonstrates remarkable performance across all tasks, significantly outperforming GPT-4 with an absolute improvement up to 52.5%.
摘要：我们引入了 Sigma，这是一种专门用于系统领域的高效大型语言模型，它采用了包括 DiffQKV 注意力机制在内的新架构，并基于我们精心收集的系统领域数据进行了预训练。DiffQKV 注意力机制通过对注意力机制中的查询 (Q)、键 (K) 和值 (V) 组件进行差异化优化，显著提高了 Sigma 的推理效率，这些组件基于它们对模型性能和效率指标的不同影响。具体而言，我们 (1) 进行了广泛的实验，证明了模型对 K 和 V 组件压缩的不同敏感度，从而开发了差异压缩的 KV；(2) 提出了增强 Q 来扩展 Q 头维度，从而增强了模型的表示能力，同时对推理速度的影响最小。严格的理论和实证分析表明，DiffQKV 注意力机制显著提高了效率，在长上下文场景中，与传统的分组查询注意力机制 (GQA) 相比，推理速度提高了 33.36%。我们使用来自各种来源的 6T 标记对 Sigma 进行预训练，包括我们精心收集的 19.5B 系统域数据和 1T 合成和重写数据标记。在一般领域，Sigma 实现了与其他最先进模型相当的性能。在系统领域，我们引入了第一个综合基准 AIMicius，其中 Sigma 在所有任务中都表现出色，明显优于 GPT-4，绝对改进高达 52.5%。

Title: LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models

Authors: Yizheng Sun, Yanze Xin, Hao Li, Jingyuan Sun, Chenghua Lin, Riza Batista-Navarro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13652
Pdf URL: https://arxiv.org/pdf/2501.13652
Copy Paste: [[2501.13652]] LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models(https://arxiv.org/abs/2501.13652)
Keywords: language model, llm
Abstract: Multi-modal Large Language Models (MLLMs) have achieved remarkable success by integrating visual and textual modalities. However, they incur significant computational overhead due to the large number of vision tokens processed, limiting their practicality in resource-constrained environments. We introduce Language-Guided Vision Token Pruning (LVPruning) for MLLMs, an effective yet simple method that significantly reduces the computational burden while preserving model performance. LVPruning employs cross-attention modules to compute the importance of vision tokens based on their interaction with language tokens, determining which to prune. Importantly, LVPruning can be integrated without modifying the original MLLM parameters, which makes LVPruning simple to apply or remove. Our experiments show that LVPruning can effectively reduce up to 90% of vision tokens by the middle layer of LLaVA-1.5, resulting in a 62.1% decrease in inference Tera Floating-Point Operations Per Second (TFLOPs), with an average performance loss of just 0.45% across nine multi-modal benchmarks.
摘要：多模态大型语言模型 (MLLM) 通过整合视觉和文本模态取得了显著的成功。然而，由于处理的视觉标记数量巨大，它们会产生大量的计算开销，限制了它们在资源受限环境中的实用性。我们为 MLLM 引入了语言引导的视觉标记修剪 (LVPruning)，这是一种有效而简单的方法，可在保持模型性能的同时显着减轻计算负担。LVPruning 采用交叉注意模块，根据视觉标记与语言标记的交互来计算视觉标记的重要性，从而确定要修剪哪些标记。重要的是，LVPruning 可以在不修改原始 MLLM 参数的情况下集成，这使得 LVPruning 易于应用或删除。我们的实验表明，LVPruning 可以有效地减少 LLaVA-1.5 中间层高达 90% 的视觉 token，导致推理每秒 Tera 浮点运算 (TFLOP) 减少 62.1%，在九个多模态基准测试中平均性能损失仅为 0.45%。

Title: How to Complete Domain Tuning while Keeping General Ability in LLM: Adaptive Layer-wise and Element-wise Regularization

Authors: Shezheng Song, Hao Xu, Jun Ma, Shasha Li, Long Peng, Qian Wan, Xiaodong Liu, Jie Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13669
Pdf URL: https://arxiv.org/pdf/2501.13669
Copy Paste: [[2501.13669]] How to Complete Domain Tuning while Keeping General Ability in LLM: Adaptive Layer-wise and Element-wise Regularization(https://arxiv.org/abs/2501.13669)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) exhibit strong general-purpose language capabilities. However, fine-tuning these models on domain-specific tasks often leads to catastrophic forgetting, where the model overwrites or loses essential knowledge acquired during pretraining. This phenomenon significantly limits the broader applicability of LLMs. To address this challenge, we propose a novel approach to compute the element-wise importance of model parameters crucial for preserving general knowledge during fine-tuning. Our method utilizes a dual-objective optimization strategy: (1) regularization loss to retain the parameter crucial for general knowledge; (2) cross-entropy loss to adapt to domain-specific tasks. Additionally, we introduce layer-wise coefficients to account for the varying contributions of different layers, dynamically balancing the dual-objective optimization. Extensive experiments on scientific, medical, and physical tasks using GPT-J and LLaMA-3 demonstrate that our approach mitigates catastrophic forgetting while enhancing model adaptability. Compared to previous methods, our solution is approximately 20 times faster and requires only 10%-15% of the storage, highlighting the practical efficiency. The code will be released.
摘要：大型语言模型 (LLM) 表现出强大的通用语言能力。然而，在特定领域的任务上对这些模型进行微调往往会导致灾难性遗忘，即模型会覆盖或丢失在预训练期间获得的重要知识。这种现象严重限制了 LLM 的广泛适用性。为了应对这一挑战，我们提出了一种新方法来计算在微调过程中对于保留一般知识至关重要的模型参数的元素重要性。我们的方法采用了双目标优化策略：(1) 正则化损失以保留对一般知识至关重要的参数；(2) 交叉熵损失以适应特定领域的任务。此外，我们引入了逐层系数来解释不同层的不同贡献，动态平衡双目标优化。使用 GPT-J 和 LLaMA-3 对科学、医学和物理任务进行的大量实验表明，我们的方法可以减轻灾难性遗忘，同时增强模型的适应性。与之前的方法相比，我们的解决方案大约快了 20 倍，并且仅需要 10%-15% 的存储空间，突出了实用效率。代码即将发布。

Title: Question Answering on Patient Medical Records with Private Fine-Tuned LLMs

Authors: Sara Kothari, Ayush Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13687
Pdf URL: https://arxiv.org/pdf/2501.13687
Copy Paste: [[2501.13687]] Question Answering on Patient Medical Records with Private Fine-Tuned LLMs(https://arxiv.org/abs/2501.13687)
Keywords: language model, gpt, llm
Abstract: Healthcare systems continuously generate vast amounts of electronic health records (EHRs), commonly stored in the Fast Healthcare Interoperability Resources (FHIR) standard. Despite the wealth of information in these records, their complexity and volume make it difficult for users to retrieve and interpret crucial health insights. Recent advances in Large Language Models (LLMs) offer a solution, enabling semantic question answering (QA) over medical data, allowing users to interact with their health records more effectively. However, ensuring privacy and compliance requires edge and private deployments of LLMs. This paper proposes a novel approach to semantic QA over EHRs by first identifying the most relevant FHIR resources for a user query (Task1) and subsequently answering the query based on these resources (Task2). We explore the performance of privately hosted, fine-tuned LLMs, evaluating them against benchmark models such as GPT-4 and GPT-4o. Our results demonstrate that fine-tuned LLMs, while 250x smaller in size, outperform GPT-4 family models by 0.55% in F1 score on Task1 and 42% on Meteor Task in Task2. Additionally, we examine advanced aspects of LLM usage, including sequential fine-tuning, model self-evaluation (narcissistic evaluation), and the impact of training data size on performance. The models and datasets are available here: this https URL
摘要：医疗保健系统不断生成大量电子健康记录 (EHR)，这些记录通常存储在快速医疗互操作性资源 (FHIR) 标准中。尽管这些记录中的信息丰富，但其复杂性和数量使用户难以检索和解释关键的健康见解。大型语言模型 (LLM) 的最新进展提供了一种解决方案，可以对医疗数据进行语义问答 (QA)，从而使用户能够更有效地与他们的健康记录进行交互。但是，确保隐私和合规性需要边缘和私有部署 LLM。本文提出了一种通过 EHR 进行语义 QA 的新方法，首先确定与用户查询最相关的 FHIR 资源（任务 1），然后根据这些资源回答查询（任务 2）。我们探索了私人托管、经过微调的 LLM 的性能，并根据 GPT-4 和 GPT-4o 等基准模型对其进行了评估。我们的结果表明，经过微调的 LLM 虽然尺寸小了 250 倍，但在任务 1 的 F1 得分上比 GPT-4 系列模型高出 0.55%，在任务 2 的 Meteor 任务上比 GPT-4 系列模型高出 42%。此外，我们还研究了 LLM 使用的高级方面，包括顺序微调、模型自我评估（自恋评估）以及训练数据大小对性能的影响。模型和数据集可在此处获取：此 https URL

Title: DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale

Authors: Linghao Zhang, Junhao Wang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Jiaheng Wen, Chengxing Xie, Maoquan Wang, Yufan Huang, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2501.13699
Pdf URL: https://arxiv.org/pdf/2501.13699
Copy Paste: [[2501.13699]] DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale(https://arxiv.org/abs/2501.13699)
Keywords: language model, llm
Abstract: Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40\% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs' capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 42.9% execution pass rate, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.
摘要：大型语言模型推动了自动化软件开发的发展，但正确推断依赖关系（即识别成功运行存储库所需的内部组件和外部包）仍然是一项挑战。现有研究表明，依赖关系相关问题导致生成的存储库中观察到的 40% 以上的运行时错误。为了解决这个问题，我们引入了 DI-BENCH，这是一个大型基准和评估框架，专门用于评估 LLM 的依赖关系推断能力。该基准测试包含 581 个存储库，测试环境涵盖 Python、C#、Rust 和 JavaScript。对文本和基于执行的指标进行的大量实验表明，当前表现最佳的模型仅实现了 42.9% 的执行通过率，这表明有很大的改进空间。DI-BENCH 为评估存储库上的 LLM 性能建立了一个新的视角，为更强大的端到端软件综合铺平了道路。

Title: Musical ethnocentrism in Large Language Models

Authors: Anna Kruspe
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2501.13720
Pdf URL: https://arxiv.org/pdf/2501.13720
Copy Paste: [[2501.13720]] Musical ethnocentrism in Large Language Models(https://arxiv.org/abs/2501.13720)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large Language Models (LLMs) reflect the biases in their training data and, by extension, those of the people who created this training data. Detecting, analyzing, and mitigating such biases is becoming a focus of research. One type of bias that has been understudied so far are geocultural biases. Those can be caused by an imbalance in the representation of different geographic regions and cultures in the training data, but also by value judgments contained therein. In this paper, we make a first step towards analyzing musical biases in LLMs, particularly ChatGPT and Mixtral. We conduct two experiments. In the first, we prompt LLMs to provide lists of the "Top 100" musical contributors of various categories and analyze their countries of origin. In the second experiment, we ask the LLMs to numerically rate various aspects of the musical cultures of different countries. Our results indicate a strong preference of the LLMs for Western music cultures in both experiments.
摘要：大型语言模型 (LLM) 反映了其训练数据中的偏见，进而反映了创建这些训练数据的人的偏见。检测、分析和减轻此类偏见正成为研究的重点。迄今为止，一种研究不足的偏见是地缘文化偏见。这些偏见可能是由于训练数据中不同地理区域和文化的表示不平衡造成的，也可能是由其中包含的价值判断造成的。在本文中，我们迈出了分析 LLM 中的音乐偏见的第一步，尤其是 ChatGPT 和 Mixtral。我们进行了两个实验。在第一个实验中，我们要求 LLM 提供各种类别的“前 100 名”音乐贡献者列表，并分析他们的原籍国。在第二个实验中，我们要求 LLM 对不同国家音乐文化的各个方面进行数字评分。我们的结果表明，在两个实验中，LLM 都强烈偏爱西方音乐文化。

Title: RPO: Retrieval Preference Optimization for Robust Retrieval-Augmented Generation

Authors: Shi-Qi Yan, Zhen-Hua Ling
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13726
Pdf URL: https://arxiv.org/pdf/2501.13726
Copy Paste: [[2501.13726]] RPO: Retrieval Preference Optimization for Robust Retrieval-Augmented Generation(https://arxiv.org/abs/2501.13726)
Keywords: language model, llm, retrieval-augmented generation
Abstract: While Retrieval-Augmented Generation (RAG) has exhibited promise in utilizing external knowledge, its generation process heavily depends on the quality and accuracy of the retrieved context. Large language models (LLMs) struggle to evaluate the correctness of non-parametric knowledge retrieved externally when it differs from internal memorization, leading to knowledge conflicts during response generation. To this end, we introduce the Retrieval Preference Optimization (RPO), a lightweight and effective alignment method to adaptively leverage multi-source knowledge based on retrieval relevance. An implicit representation of retrieval relevance is derived and incorporated into the reward model to integrate retrieval evaluation and response generation into a single model, solving the problem that previous methods necessitate the additional procedure to assess the retrieval quality. Notably, RPO is the only RAG-dedicated alignment approach that quantifies the awareness of retrieval relevance in training, overcoming mathematical obstacles. Experiments on four datasets demonstrate that RPO outperforms RAG by 4-10% in accuracy without any extra component, exhibiting its robust generalization.
摘要：虽然检索增强生成 (RAG) 在利用外部知识方面表现出良好的前景，但其生成过程在很大程度上取决于检索到的上下文的质量和准确性。当从外部检索到的非参数知识与内部记忆不同时，大型语言模型 (LLM) 很难评估其正确性，从而导致响应生成过程中的知识冲突。为此，我们引入了检索偏好优化 (RPO)，这是一种轻量级且有效的对齐方法，可根据检索相关性自适应地利用多源知识。我们推导出检索相关性的隐式表示并将其纳入奖励模型，以将检索评估和响应生成集成到单个模型中，解决了以前的方法需要额外程序来评估检索质量的问题。值得注意的是，RPO 是唯一一种在训练中量化检索相关性意识的 RAG 专用对齐方法，克服了数学障碍。在四个数据集上的实验表明，RPO 在没有任何额外组件的情况下，准确率比 RAG 高出 4-10%，表现出其强大的泛化能力。

Title: Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks

Authors: Chang Gong, Wanrui Bian, Zhijie Zhang, Weiguo Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13731
Pdf URL: https://arxiv.org/pdf/2501.13731
Copy Paste: [[2501.13731]] Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks(https://arxiv.org/abs/2501.13731)
Keywords: language model, llm, prompt
Abstract: Graph computational tasks are inherently challenging and often demand the development of advanced algorithms for effective solutions. With the emergence of large language models (LLMs), researchers have begun investigating their potential to address these tasks. However, existing approaches are constrained by LLMs' limited capability to comprehend complex graph structures and their high inference costs, rendering them impractical for handling large-scale graphs. Inspired by human approaches to graph problems, we introduce a novel framework, PIE (Pseudocode-Injection-Enhanced LLM Reasoning for Graph Computational Tasks), which consists of three key steps: problem understanding, prompt design, and code generation. In this framework, LLMs are tasked with understanding the problem and extracting relevant information to generate correct code. The responsibility for analyzing the graph structure and executing the code is delegated to the interpreter. We inject task-related pseudocodes into the prompts to further assist the LLMs in generating efficient code. We also employ cost-effective trial-and-error techniques to ensure that the LLM-generated code executes correctly. Unlike other methods that require invoking LLMs for each individual test case, PIE only calls the LLM during the code generation phase, allowing the generated code to be reused and significantly reducing inference costs. Extensive experiments demonstrate that PIE outperforms existing baselines in terms of both accuracy and computational efficiency.
摘要：图形计算任务本身就具有挑战性，通常需要开发高级算法才能有效解决。随着大型语言模型 (LLM) 的出现，研究人员已经开始研究它们解决这些任务的潜力。然而，现有方法受到 LLM 理解复杂图形结构的能力有限和推理成本高的限制，使得它们不适用于处理大规模图形。受人类解决图形问题的方法的启发，我们引入了一个新框架 PIE（用于图形计算任务的伪代码注入增强型 LLM 推理），它由三个关键步骤组成：问题理解、提示设计和代码生成。在这个框架中，LLM 的任务是理解问题并提取相关信息以生成正确的代码。分析图形结构和执行代码的责任委托给解释器。我们将与任务相关的伪代码注入到提示中，以进一步帮助 LLM 生成高效的代码。我们还采用了经济高效的试错技术来确保 LLM 生成的代码正确执行。与其他需要为每个单独的测试用例调用 LLM 的方法不同，PIE 仅在代码生成阶段调用 LLM，从而允许重用生成的代码并显著降低推理成本。大量实验表明，PIE 在准确性和计算效率方面均优于现有基准。

Title: UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models

Authors: Xin Xu, Jiaxin Zhang, Tianhao Chen, Zitong Chao, Jishan Hu, Can Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13766
Pdf URL: https://arxiv.org/pdf/2501.13766
Copy Paste: [[2501.13766]] UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models(https://arxiv.org/abs/2501.13766)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have made significant strides in mathematical reasoning, underscoring the need for a comprehensive and fair evaluation of their capabilities. However, existing benchmarks often fall short, either lacking extensive coverage of undergraduate-level mathematical problems or probably suffering from test-set contamination. To address these issues, we introduce UGMathBench, a diverse and dynamic benchmark specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions, with additional versions planned for release as leading open-source LLMs become saturated in UGMathBench. Furthermore, we propose two key metrics: effective accuracy (EAcc), which measures the percentage of correctly solved problems across all three versions, and reasoning gap ($\Delta$), which assesses reasoning robustness by calculating the difference between the average accuracy across all versions and EAcc. Our extensive evaluation of 23 leading LLMs reveals that the highest EAcc achieved is 56.3\% by OpenAI-o1-mini, with large $\Delta$ values observed across different models. This highlights the need for future research aimed at developing "large reasoning models" with high EAcc and $\Delta = 0$. We anticipate that the release of UGMathBench, along with its detailed evaluation codes, will serve as a valuable resource to advance the development of LLMs in solving mathematical problems.
摘要：大型语言模型 (LLM) 在数学推理方面取得了重大进展，这凸显了对其能力进行全面、公平评估的必要性。然而，现有的基准测试往往存在不足，要么缺乏对本科水平数学问题的广泛覆盖，要么可能受到测试集污染。为了解决这些问题，我们推出了 UGMathBench，这是一个多样化且动态的基准测试，专门用于评估 LLM 的本科水平数学推理。UGMathBench 包含 16 个科目和 111 个主题的 5,062 个问题，具有 10 种不同的答案类型。每个问题包括三个随机版本，随着领先的开源 LLM 在 UGMathBench 中变得饱和，计划发布更多版本。此外，我们提出了两个关键指标：有效准确度 (EAcc)，它衡量所有三个版本中正确解决问题的百分比，以及推理差距 ($\Delta$)，它通过计算所有版本的平均准确度与 EAcc 之间的差异来评估推理稳健性。我们对 23 个领先的 LLM 进行了广泛的评估，结果表明 OpenAI-o1-mini 取得的最高 EAcc 为 56.3\%，不同模型的 $\Delta$ 值都很大。这凸显了未来研究的必要性，旨在开发具有高 EAcc 和 $\Delta = 0$ 的“大型推理模型”。我们预计 UGMathBench 的发布及其详细的评估代码将成为推动 LLM 在解决数学问题方面的发展的宝贵资源。

Title: Do Large Language Models Truly Understand Geometric Structures?

Authors: Xiaofeng Wang, Yiming Wang, Wenhong Zhu, Rui Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13773
Pdf URL: https://arxiv.org/pdf/2501.13773
Copy Paste: [[2501.13773]] Do Large Language Models Truly Understand Geometric Structures?(https://arxiv.org/abs/2501.13773)
Keywords: language model, llm, chain-of-thought
Abstract: Geometric ability is a significant challenge for large language models (LLMs) due to the need for advanced spatial comprehension and abstract thinking. Existing datasets primarily evaluate LLMs on their final answers, but they cannot truly measure their true understanding of geometric structures, as LLMs can arrive at correct answers by coincidence. To fill this gap, we introduce the GeomRel dataset, designed to evaluate LLMs' understanding of geometric structures by isolating the core step of geometric relationship identification in problem-solving. Using this benchmark, we conduct thorough evaluations of diverse LLMs and identify key limitations in understanding geometric structures. We further propose the Geometry Chain-of-Thought (GeoCoT) method, which enhances LLMs' ability to identify geometric relationships, resulting in significant performance improvements.
摘要：由于需要高级空间理解和抽象思维，几何能力对于大型语言模型 (LLM) 来说是一项重大挑战。现有数据集主要根据 LLM 的最终答案对其进行评估，但无法真正衡量其对几何结构的真实理解，因为 LLM 可能会巧合地得出正确答案。为了填补这一空白，我们引入了 GeomRel 数据集，旨在通过隔离问题解决中几何关系识别的核心步骤来评估 LLM 对几何结构的理解。使用此基准，我们对各种 LLM 进行了彻底的评估，并确定了理解几何结构的主要限制。我们进一步提出了几何思维链 (GeoCoT) 方法，该方法增强了 LLM 识别几何关系的能力，从而显著提高了性能。

Title: Parameter-Efficient Fine-Tuning for Foundation Models

Authors: Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13787
Pdf URL: https://arxiv.org/pdf/2501.13787
Copy Paste: [[2501.13787]] Parameter-Efficient Fine-Tuning for Foundation Models(https://arxiv.org/abs/2501.13787)
Keywords: gpt, chat
Abstract: This survey delves into the realm of Parameter-Efficient Fine-Tuning (PEFT) within the context of Foundation Models (FMs). PEFT, a cost-effective fine-tuning technique, minimizes parameters and computational complexity while striving for optimal downstream task performance. FMs, like ChatGPT, DALL-E, and LLaVA specialize in language understanding, generative tasks, and multimodal tasks, trained on diverse datasets spanning text, images, and videos. The diversity of FMs guides various adaptation strategies for PEFT. Therefore, this survey aims to provide a comprehensive overview of PEFT techniques applied to diverse FMs and address critical gaps in understanding the techniques, trends, and applications. We start by providing a detailed development of FMs and PEFT. Subsequently, we systematically review the key categories and core mechanisms of PEFT across diverse FMs to offer a comprehensive understanding of trends. We also explore the most recent applications across various FMs to demonstrate the versatility of PEFT, shedding light on the integration of systematic PEFT methods with a range of FMs. Furthermore, we identify potential research and development directions for improving PEFTs in the future. This survey provides a valuable resource for both newcomers and experts seeking to understand and use the power of PEFT across FMs. All reviewed papers are listed at \url{this https URL}.
摘要：本调查深入探讨了基础模型 (FM) 背景下的参数高效微调 (PEFT) 领域。PEFT 是一种经济高效的微调技术，可在追求最佳下游任务性能的同时最大限度地减少参数和计算复杂性。ChatGPT、DALL-E 和 LLaVA 等 FM 专注于语言理解、生成任务和多模态任务，并在涵盖文本、图像和视频的各种数据集上进行训练。FM 的多样性指导了 PEFT 的各种适应策略。因此，本调查旨在全面概述应用于不同 FM 的 PEFT 技术，并解决理解技术、趋势和应用方面的关键差距。我们首先提供 FM 和 PEFT 的详细开发。随后，我们系统地回顾了不同 FM 中 PEFT 的主要类别和核心机制，以全面了解趋势。我们还探索了各种 FM 中的最新应用，以展示 PEFT 的多功能性，阐明了系统 PEFT 方法与一系列 FM 的集成。此外，我们确定了未来改进 PEFT 的潜在研究和开发方向。这项调查为寻求了解和使用 PEFT 在 FM 中的功能的新手和专家提供了宝贵的资源。所有评论的论文都列在 \url{此 https URL} 中。

Title: Hallucinations Can Improve Large Language Models in Drug Discovery

Authors: Shuzhou Yuan, Michael Färber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13824
Pdf URL: https://arxiv.org/pdf/2501.13824
Copy Paste: [[2501.13824]] Hallucinations Can Improve Large Language Models in Drug Discovery(https://arxiv.org/abs/2501.13824)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Concerns about hallucinations in Large Language Models (LLMs) have been raised by researchers, yet their potential in areas where creativity is vital, such as drug discovery, merits exploration. In this paper, we come up with the hypothesis that hallucinations can improve LLMs in drug discovery. To verify this hypothesis, we use LLMs to describe the SMILES string of molecules in natural language and then incorporate these descriptions as part of the prompt to address specific tasks in drug discovery. Evaluated on seven LLMs and five classification tasks, our findings confirm the hypothesis: LLMs can achieve better performance with text containing hallucinations. Notably, Llama-3.1-8B achieves an 18.35% gain in ROC-AUC compared to the baseline without hallucination. Furthermore, hallucinations generated by GPT-4o provide the most consistent improvements across models. Additionally, we conduct empirical analyses and a case study to investigate key factors affecting performance and the underlying reasons. Our research sheds light on the potential use of hallucinations for LLMs and offers new perspectives for future research leveraging LLMs in drug discovery.
摘要：研究人员对大型语言模型 (LLM) 中的幻觉问题表示担忧，但它们在药物发现等创造力至关重要的领域的潜力值得探索。在本文中，我们提出了幻觉可以改善药物发现中的 LLM 的假设。为了验证这一假设，我们使用 LLM 以自然语言描述分子的 SMILES 字符串，然后将这些描述作为提示的一部分，以解决药物发现中的特定任务。在七个 LLM 和五个分类任务上进行评估后，我们的研究结果证实了这一假设：LLM 可以在包含幻觉的文本上实现更好的性能。值得注意的是，与没有幻觉的基线相比，Llama-3.1-8B 的 ROC-AUC 提高了 18.35%。此外，GPT-4o 产生的幻觉在各个模型中提供了最一致的改进。此外，我们还进行了实证分析和案例研究，以调查影响性能的关键因素及其根本原因。我们的研究揭示了幻觉在法学硕士 (LLM) 中的潜在用途，并为未来利用法学硕士 (LLM) 进行药物研发的研究提供了新的视角。

Title: Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post Editing

Authors: Hao Zhang, Felix Stahlberg, Shankar Kumar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13831
Pdf URL: https://arxiv.org/pdf/2501.13831
Copy Paste: [[2501.13831]] Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post Editing(https://arxiv.org/abs/2501.13831)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel at rewriting tasks such as text style transfer and grammatical error correction. While there is considerable overlap between the inputs and outputs in these tasks, the decoding cost still increases with output length, regardless of the amount of overlap. By leveraging the overlap between the input and the output, Kaneko and Okazaki (2023) proposed model-agnostic edit span representations to compress the rewrites to save computation. They reported an output length reduction rate of nearly 80% with minimal accuracy impact in four rewriting tasks. In this paper, we propose alternative edit phrase representations inspired by phrase-based statistical machine translation. We systematically compare our phrasal representations with their span representations. We apply the LLM rewriting model to the task of Automatic Speech Recognition (ASR) post editing and show that our target-phrase-only edit representation has the best efficiency-accuracy trade-off. On the LibriSpeech test set, our method closes 50-60% of the WER gap between the edit span model and the full rewrite model while losing only 10-20% of the length reduction rate of the edit span model.
摘要：大型语言模型 (LLM) 擅长重写任务，例如文本样式转换和语法错误纠正。虽然这些任务中的输入和输出之间存在相当大的重叠，但无论重叠量如何，解码成本仍会随着输出长度的增加而增加。通过利用输入和输出之间的重叠，Kaneko 和 Okazaki (2023) 提出了与模型无关的编辑跨度表示来压缩重写以节省计算量。他们报告称，在四项重写任务中，输出长度减少了近 80%，而准确度影响最小。在本文中，我们提出了受基于短语的统计机器翻译启发的替代编辑短语表示。我们系统地将我们的短语表示与它们的跨度表示进行比较。我们将 LLM 重写模型应用于自动语音识别 (ASR) 后编辑任务，并表明我们的仅目标短语编辑表示具有最佳的效率-准确度权衡。在 LibriSpeech 测试集上，我们的方法缩小了编辑跨度模型和完整重写模型之间 50-60% 的 WER 差距，同时仅损失了编辑跨度模型 10-20% 的长度减少率。

Title: Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages

Authors: Farhana Shahid, Mona Elswah, Aditya Vashistha
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2501.13836
Pdf URL: https://arxiv.org/pdf/2501.13836
Copy Paste: [[2501.13836]] Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages(https://arxiv.org/abs/2501.13836)
Keywords: language model
Abstract: Most social media users come from non-English speaking countries in the Global South. Despite the widespread prevalence of harmful content in these regions, current moderation systems repeatedly struggle in low-resource languages spoken there. In this work, we examine the challenges AI researchers and practitioners face when building moderation tools for low-resource languages. We conducted semi-structured interviews with 22 AI researchers and practitioners specializing in automatic detection of harmful content in four diverse low-resource languages from the Global South. These are: Tamil from South Asia, Swahili from East Africa, Maghrebi Arabic from North Africa, and Quechua from South America. Our findings reveal that social media companies' restrictions on researchers' access to data exacerbate the historical marginalization of these languages, which have long lacked datasets for studying online harms. Moreover, common preprocessing techniques and language models, predominantly designed for data-rich English, fail to account for the linguistic complexity of low-resource languages. This leads to critical errors when moderating content in Tamil, Swahili, Arabic, and Quechua, which are morphologically richer than English. Based on our findings, we establish that the precarities in current moderation pipelines are rooted in deep systemic inequities and continue to reinforce historical power imbalances. We conclude by discussing multi-stakeholder approaches to improve moderation for low-resource languages.
摘要：大多数社交媒体用户来自全球南方的非英语国家。尽管有害内容在这些地区普遍存在，但目前的审核系统在这些地区使用的资源匮乏的语言中却屡屡遭遇困境。在这项研究中，我们研究了人工智能研究人员和从业者在为资源匮乏的语言构建审核工具时面临的挑战。我们对 22 名人工智能研究人员和从业者进行了半结构化访谈，他们专门研究全球南方四种不同的资源匮乏语言的有害内容自动检测。这四种语言分别是：来自南亚的泰米尔语、来自东非的斯瓦希里语、来自北非的马格里布阿拉伯语和来自南美洲的克丘亚语。我们的研究结果表明，社交媒体公司对研究人员访问数据的限制加剧了这些语言的历史边缘化，这些语言长期以来一直缺乏用于研究在线危害的数据集。此外，常见的预处理技术和语言模型主要针对数据丰富的英语设计，无法解释资源匮乏语言的语言复杂性。这导致在审核泰米尔语、斯瓦希里语、阿拉伯语和克丘亚语内容时出现严重错误，因为这些语言在形态上比英语更丰富。根据我们的研究结果，我们发现，当前审核流程中的不稳定性根源于深层系统性不平等，并继续加剧历史权力不平衡。最后，我们讨论了多利益相关方方法来改善资源匮乏语言的审核。

Title: A RAG-Based Institutional Assistant

Authors: Gustavo Kuratomi, Paulo Pirozelli, Fabio G. Cozman, Sarajane M. Peres
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13880
Pdf URL: https://arxiv.org/pdf/2501.13880
Copy Paste: [[2501.13880]] A RAG-Based Institutional Assistant(https://arxiv.org/abs/2501.13880)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Although large language models (LLMs) demonstrate strong text generation capabilities, they struggle in scenarios requiring access to structured knowledge bases or specific documents, limiting their effectiveness in knowledge-intensive tasks. To address this limitation, retrieval-augmented generation (RAG) models have been developed, enabling generative models to incorporate relevant document fragments into their inputs. In this paper, we design and evaluate a RAG-based virtual assistant specifically tailored for the University of São Paulo. Our system architecture comprises two key modules: a retriever and a generative model. We experiment with different types of models for both components, adjusting hyperparameters such as chunk size and the number of retrieved documents. Our optimal retriever model achieves a Top-5 accuracy of 30%, while our most effective generative model scores 22.04\% against ground truth answers. Notably, when the correct document chunks are supplied to the LLMs, accuracy significantly improves to 54.02%, an increase of over 30 percentage points. Conversely, without contextual input, performance declines to 13.68%. These findings highlight the critical role of database access in enhancing LLM performance. They also reveal the limitations of current semantic search methods in accurately identifying relevant documents and underscore the ongoing challenges LLMs face in generating precise responses.
摘要：尽管大型语言模型 (LLM) 表现出强大的文本生成能力，但它们在需要访问结构化知识库或特定文档的场景中却举步维艰，这限制了它们在知识密集型任务中的有效性。为了解决这一限制，人们开发了检索增强生成 (RAG) 模型，使生成模型能够将相关的文档片段合并到其输入中。在本文中，我们设计并评估了专为圣保罗大学量身定制的基于 RAG 的虚拟助手。我们的系统架构包括两个关键模块：检索器和生成模型。我们针对这两个组件尝试了不同类型的模型，调整了超参数，例如块大小和检索到的文档数量。我们的最佳检索器模型实现了 30% 的 Top-5 准确率，而我们最有效的生成模型对地面实况答案的得分为 22.04\%。值得注意的是，当向 LLM 提供正确的文档块时，准确率显着提高到 54.02%，增加了 30 个百分点以上。相反，如果没有上下文输入，性能会下降到 13.68%。这些发现凸显了数据库访问在提高 LLM 性能方面的关键作用。它们还揭示了当前语义搜索方法在准确识别相关文档方面的局限性，并强调了 LLM 在生成精确响应方面面临的持续挑战。

Title: GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration

Authors: Yue Fan, Handong Zhao, Ruiyi Zhang, Yu Shen, Xin Eric Wang, Gang Wu
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13896
Pdf URL: https://arxiv.org/pdf/2501.13896
Copy Paste: [[2501.13896]] GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration(https://arxiv.org/abs/2501.13896)
Keywords: llm, agent
Abstract: Graphical User Interface (GUI) action grounding is a critical step in GUI automation that maps language instructions to actionable elements on GUI screens. Most recent works of GUI action grounding leverage large GUI datasets to fine-tune MLLMs. However, the fine-tuning data always covers limited GUI environments, and we find the performance of the resulting model deteriorates in novel environments. We argue that the GUI grounding models should be further aligned to the novel environments to reveal their full potential, when the inference is known to involve novel environments, i.e., environments not used during the previous fine-tuning. To realize this, we first propose GUI-Bee, an MLLM-based autonomous agent, to collect high-quality, environment-specific data through exploration and then continuously fine-tune GUI grounding models with the collected data. Our agent leverages a novel Q-value-Incentive In-Context Reinforcement Learning (Q-ICRL) method to optimize exploration efficiency and data quality. Additionally, we introduce NovelScreenSpot, a benchmark for testing how well the data can help align GUI action grounding models to novel environments and demonstrate the effectiveness of data collected by GUI-Bee in the experiments. Furthermore, we conduct an ablation study to validate the Q-ICRL method in enhancing the efficiency of GUI-Bee. Project page: this https URL
摘要：图形用户界面 (GUI) 动作基础是 GUI 自动化中的关键步骤，它将语言指令映射到 GUI 屏幕上的可操作元素。GUI 动作基础的最新研究利用大型 GUI 数据集来微调 MLLM。然而，微调数据总是覆盖有限的 GUI 环境，我们发现生成的模型的性能在新的环境中会下降。我们认为，当已知推理涉及新环境（即上一次微调期间未使用的环境）时，GUI 基础模型应该进一步与新环境保持一致，以充分发挥其潜力。为了实现这一点，我们首先提出了基于 MLLM 的自主代理 GUI-Bee，通过探索收集高质量、特定于环境的数据，然后使用收集的数据不断微调 GUI 基础模型。我们的代理利用一种新颖的 Q 值激励上下文强化学习 (Q-ICRL) 方法来优化探索效率和数据质量。此外，我们引入了 NovelScreenSpot，这是一个基准，用于测试数据如何帮助将 GUI 动作接地模型与新环境对齐，并证明 GUI-Bee 在实验中收集的数据的有效性。此外，我们进行了一项消融研究，以验证 Q-ICRL 方法在提高 GUI-Bee 效率方面的作用。项目页面：此 https URL

Title: Analysis of Indic Language Capabilities in LLMs

Authors: Aatman Vaidya, Tarunima Prabhakar, Denny George, Swair Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13912
Pdf URL: https://arxiv.org/pdf/2501.13912
Copy Paste: [[2501.13912]] Analysis of Indic Language Capabilities in LLMs(https://arxiv.org/abs/2501.13912)
Keywords: language model, llm
Abstract: This report evaluates the performance of text-in text-out Large Language Models (LLMs) to understand and generate Indic languages. This evaluation is used to identify and prioritize Indic languages suited for inclusion in safety benchmarks. We conduct this study by reviewing existing evaluation studies and datasets; and a set of twenty-eight LLMs that support Indic languages. We analyze the LLMs on the basis of the training data, license for model and data, type of access and model developers. We also compare Indic language performance across evaluation datasets and find that significant performance disparities in performance across Indic languages. Hindi is the most widely represented language in models. While model performance roughly correlates with number of speakers for the top five languages, the assessment after that varies.
摘要：本报告评估了文本输入文本输出大型语言模型 (LLM) 在理解和生成印度语方面的表现。此评估用于识别和确定适合纳入安全基准的印度语的优先顺序。我们通过审查现有的评估研究和数据集以及一组支持印度语的 28 个 LLM 来开展这项研究。我们根据训练数据、模型和数据许可证、访问类型和模型开发人员对 LLM 进行分析。我们还比较了评估数据集中的印度语性能，发现不同印度语的性能存在显著差异。印地语是模型中代表性最广泛的语言。虽然模型性能大致与前五种语言的使用者数量相关，但此后的评估会有所不同。

Title: The Breeze 2 Herd of Models: Traditional Chinese LLMs Based on Llama with Vision-Aware and Function-Calling Capabilities

Authors: Chan-Jan Hsu, Chia-Sheng Liu, Meng-Hsi Chen, Muxi Chen, Po-Chun Hsu, Yi-Chang Chen, Da-Shan Shiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.13921
Pdf URL: https://arxiv.org/pdf/2501.13921
Copy Paste: [[2501.13921]] The Breeze 2 Herd of Models: Traditional Chinese LLMs Based on Llama with Vision-Aware and Function-Calling Capabilities(https://arxiv.org/abs/2501.13921)
Keywords: language model, llm, long context, prompt
Abstract: Breeze 2 is a suite of advanced multi-modal language models, available in 3B and 8B parameter configurations, specifically designed to enhance Traditional Chinese language representation. Building upon the Llama 3, Breeze 2 continues pretraining on an extensive corpus to enhance the linguistic and cultural heritage of Traditional Chinese. It incorporates vision-aware capabilities through a visual encoder and a bridge module, and supports function-calling via prompt templates and post-training on function-calling data. The effectiveness of Breeze 2 is benchmarked across various tasks, including Taiwan general knowledge, instruction-following, long context, function calling, and vision understanding. Furthermore, we showcase the capabilities of the its 3B model in a mobile application. We are publicly releasing all Breeze 2 models under the Llama 3 Community License.
摘要：Breeze 2 是一套先进的多模态语言模型，提供 3B 和 8B 参数配置，专为增强繁体中文语言表示而设计。在 Llama 3 的基础上，Breeze 2 继续在广泛的语料库上进行预训练，以增强繁体中文的语言和文化遗产。它通过视觉编码器和桥接模块整合了视觉感知功能，并通过提示模板和对函数调用数据的后训练支持函数调用。Breeze 2 的有效性在各种任务中进行了基准测试，包括台湾常识、指令遵循、长上下文、函数调用和视觉理解。此外，我们在移动应用程序中展示了其 3B 模型的功能。我们将根据 Llama 3 社区许可公开发布所有 Breeze 2 模型。

Title: CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

Authors: Guofeng Cui, Pichao Wang, Yang Liu, Zemian Ke, Zhu Liu, Vimal Bhat
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2501.13927
Pdf URL: https://arxiv.org/pdf/2501.13927
Copy Paste: [[2501.13927]] CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation(https://arxiv.org/abs/2501.13927)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of reinforcement learning from human feedback (RLHF). Direct Preference Optimization (DPO) has emerged as a simpler and more efficient alternative, but its performance depends heavily on the quality of preference data. To address this, we propose Confidence-Reward driven Preference Optimization (CRPO), a novel method that combines reward scores with model confidence to improve data selection for fine-tuning. CRPO selects challenging sentence pairs where the model is uncertain or underperforms, leading to more effective learning. While primarily designed for LLMs, CRPO also generalizes to encoder-decoder models like NLLB, demonstrating its versatility. Empirical results show that CRPO outperforms existing methods such as RS-DPO, RSO and MBR score in both translation accuracy and data efficiency.
摘要：大型语言模型 (LLM) 在自然语言处理任务中显示出巨大的潜力，但由于以英语为中心的数据进行预训练以及从人类反馈 (RLHF) 进行强化学习的复杂性，它们在机器翻译 (MT) 中的应用仍然具有挑战性。直接偏好优化 (DPO) 已成为一种更简单、更高效的替代方案，但其性能在很大程度上取决于偏好数据的质量。为了解决这个问题，我们提出了置信度-奖励驱动的偏好优化 (CRPO)，这是一种将奖励分数与模型置信度相结合以改进微调数据选择的新方法。CRPO 选择模型不确定或表现不佳的具有挑战性的句子对，从而实现更有效的学习。虽然 CRPO 主要为 LLM 设计，但它也可以推广到 NLLB 等编码器-解码器模型，从而展示了它的多功能性。实证结果表明，CRPO 在翻译准确性和数据效率方面均优于现有方法，例如 RS-DPO、RSO 和 MBR 分数。