2024-03-13

Title: SPA: Towards A Computational Friendly Cloud-Base and On-Devices Collaboration Seq2seq Personalized Generation

Authors: Yanming Liu, Xinyue Peng, Jiannan Cao, Le Dai, Xingzu Liu, Weihao Liu, Mingbang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07088
Pdf URL: https://arxiv.org/pdf/2403.07088
Copy Paste: [[2403.07088]] SPA: Towards A Computational Friendly Cloud-Base and On-Devices Collaboration Seq2seq Personalized Generation(https://arxiv.org/abs/2403.07088)
Keywords: language model, llm
Abstract: Large language models(LLMs) have shown its outperforming ability on various tasks and question answering. However, LLMs require high computation cost and large memory cost. At the same time, LLMs may cause privacy leakage when training or prediction procedure contains sensitive information. In this paper, we propose SPA(Side Plugin Adaption), a lightweight architecture for fast on-devices inference and privacy retaining on the constraints of strict on-devices computation and memory constraints. Compared with other on-devices seq2seq generation, SPA could make a fast and stable inference on low-resource constraints, allowing it to obtain cost effiency. Our method establish an interaction between a pretrained LLMs on-cloud and additive parameters on-devices, which could provide the knowledge on both pretrained LLMs and private personal feature.Further more, SPA provides a framework to keep feature-base parameters on private guaranteed but low computational devices while leave the parameters containing general information on the high computational devices.
摘要：大型语言模型（LLM）已在各种任务和问题回答中显示出其卓越的能力。然而，LLM 需要较高的计算成本和较大的内存成本。同时，当训练或预测过程包含敏感信息时，LLM 可能会导致隐私泄露。在本文中，我们提出了 SPA（Side Plugin Adaption），这是一种轻量级架构，用于在严格的设备上计算和内存约束的情况下进行快速设备上推理和隐私保留。与其他设备上的 seq2seq 生成相比，SPA 可以在低资源约束下做出快速稳定的推理，从而获得成本效率。我们的方法在云上预训练的 LLM 和设备上的附加参数之间建立交互，这可以提供有关预训练的 LLM 和私有个人特征的知识。此外，SPA 提供了一个框架来保持基于特征的参数私有保证，但低计算设备同时将包含一般信息的参数保留在高计算设备上。

Title: Narrating Causal Graphs with Large Language Models

Authors: Atharva Phatak, Vijay K. Mago, Ameeta Agrawal, Aravind Inbasekaran, Philippe J. Giabbanelli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07118
Pdf URL: https://arxiv.org/pdf/2403.07118
Copy Paste: [[2403.07118]] Narrating Causal Graphs with Large Language Models(https://arxiv.org/abs/2403.07118)
Keywords: language model, gpt
Abstract: The use of generative AI to create text descriptions from graphs has mostly focused on knowledge graphs, which connect concepts using facts. In this work we explore the capability of large pretrained language models to generate text from causal graphs, where salient concepts are represented as nodes and causality is represented via directed, typed edges. The causal reasoning encoded in these graphs can support applications as diverse as healthcare or marketing. Using two publicly available causal graph datasets, we empirically investigate the performance of four GPT-3 models under various settings. Our results indicate that while causal text descriptions improve with training data, compared to fact-based graphs, they are harder to generate under zero-shot settings. Results further suggest that users of generative AI can deploy future applications faster since similar performances are obtained when training a model with only a few examples as compared to fine-tuning via a large curated dataset.
摘要：使用生成式人工智能从图形创建文本描述主要集中在知识图，它使用事实连接概念。在这项工作中，我们探索大型预训练语言模型从因果图生成文本的能力，其中显着概念表示为节点，因果关系通过有向、类型化的边表示。这些图表中编码的因果推理可以支持医疗保健或营销等多种应用。使用两个公开可用的因果图数据集，我们实证研究了四种 GPT-3 模型在不同设置下的性能。我们的结果表明，虽然因果文本描述可以通过训练数据得到改善，但与基于事实的图表相比，它们在零样本设置下更难生成。结果进一步表明，生成式人工智能的用户可以更快地部署未来的应用程序，因为与通过大型数据集进行微调相比，仅使用几个示例训练模型可以获得相似的性能。

Title: Thought Graph: Generating Thought Process for Biological Reasoning

Authors: Chi-Yang Hsu, Kyle Cox, Jiawei Xu, Zhen Tan, Tianhua Zhai, Mengzhou Hu, Dexter Pratt, Tianlong Chen, Ziniu Hu, Ying Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07144
Pdf URL: https://arxiv.org/pdf/2403.07144
Copy Paste: [[2403.07144]] Thought Graph: Generating Thought Process for Biological Reasoning(https://arxiv.org/abs/2403.07144)
Keywords: llm
Abstract: We present the Thought Graph as a novel framework to support complex reasoning and use gene set analysis as an example to uncover semantic relationships between biological processes. Our framework stands out for its ability to provide a deeper understanding of gene sets, significantly surpassing GSEA by 40.28% and LLM baselines by 5.38% based on cosine similarity to human annotations. Our analysis further provides insights into future directions of biological processes naming, and implications for bioinformatics and precision medicine.
摘要：我们将思维图作为支持复杂推理的新颖框架，并以基因集分析为例来揭示生物过程之间的语义关系。我们的框架因其能够提供对基因集更深入的理解而脱颖而出，基于与人类注释的余弦相似性，显着超过 GSEA 40.28%，超过 LLM 基线 5.38%。我们的分析进一步提供了对生物过程命名的未来方向以及对生物信息学和精准医学的影响的见解。

Title: Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

Authors: Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, Daniel A. McFarland, James Y. Zou
Subjects: cs.CL, cs.AI, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2403.07183
Pdf URL: https://arxiv.org/pdf/2403.07183
Copy Paste: [[2403.07183]] Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews(https://arxiv.org/abs/2403.07183)
Keywords: language model, gpt, llm, chat
Abstract: We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.
摘要：我们提出了一种估计大型语料库中文本比例的方法，该语料库可能会被大型语言模型（LLM）大幅修改或生成。我们的最大似然模型利用专家编写和人工智能生成的参考文本，在语料库级别准确有效地检查现实世界的法学硕士使用情况。我们将这种方法应用于 ChatGPT 发布后举行的人工智能会议中科学同行评审的案例研究：ICLR 2024、NeurIPS 2023、CoRL 2023 和 EMNLP 2023。我们的结果表明，6.5% 到 16.9% 的文本提交为法学硕士可以对这些会议的同行评审进行大幅修改，即除了拼写检查或较小的写作更新之外。生成文本发生的情况可以洞察用户行为：在报告置信度较低、在截止日期附近提交的审稿以及不太可能回应作者反驳的审稿人中，LLM 生成文本的估计比例较高。我们还观察生成文本中的语料库级别趋势，这些趋势可能过于微妙而无法在个人级别上检测到，并讨论这种趋势对同行评审的影响。我们呼吁未来开展跨学科工作来研究法学硕士的使用如何改变我们的信息和知识实践。

Title: CuentosIE: can a chatbot about "tales with a message" help to teach emotional intelligence?

Authors: Antonio Ferrández, Rocío Lavigne-Cerván, Jesús Peral, Ignasi Navarro-Soria, Ángel Lloret, David Gil, Carmen Rocamora
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07193
Pdf URL: https://arxiv.org/pdf/2403.07193
Copy Paste: [[2403.07193]] CuentosIE: can a chatbot about "tales with a message" help to teach emotional intelligence?(https://arxiv.org/abs/2403.07193)
Keywords: chat
Abstract: In this article, we present CuentosIE (TalesEI: chatbot of tales with a message to develop Emotional Intelligence), an educational chatbot on emotions that also provides teachers and psychologists with a tool to monitor their students/patients through indicators and data compiled by CuentosIE. The use of "tales with a message" is justified by their simplicity and easy understanding, thanks to their moral or associated metaphors. The main contributions of CuentosIE are the selection, collection, and classification of a set of highly specialized tales, as well as the provision of tools (searching, reading comprehension, chatting, recommending, and classifying) that are useful for both educating users about emotions and monitoring their emotional development. The preliminary evaluation of the tool has obtained encouraging results, which provides an affirmative answer to the question posed in the title of the article.
摘要：在本文中，我们介绍了 CuentosIE（TalesEI：带有发展情商消息的故事聊天机器人），这是一种关于情感的教育聊天机器人，还为教师和心理学家提供了一种工具，通过 CuentosIE 编制的指标和数据来监控学生/患者。使用“带有信息的故事”是合理的，因为它们简单且易于理解，这要归功于它们的道德或相关隐喻。 CuentosIE 的主要贡献是对一组高度专业化的故事的选择、收集和分类，以及提供对教育用户情绪有用的工具（搜索、阅读理解、聊天、推荐和分类）。并监测他们的情绪发展。该工具的初步评估取得了令人鼓舞的结果，这为文章标题提出的问题提供了肯定的答案。

Title: Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

Authors: Pulkit Pattnaik, Rishabh Maheshwary, Kelechi Ogueji, Vikas Yadav, Sathwik Tejaswi Madhusudhan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.07230
Pdf URL: https://arxiv.org/pdf/2403.07230
Copy Paste: [[2403.07230]] Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences(https://arxiv.org/abs/2403.07230)
Keywords: llm, prompt
Abstract: Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data (usually one chosen and rejected response pair per user prompt) to align LLMs to human preferences. In practice, multiple responses can exist for a given prompt with varying quality relative to each other. With availability of such quality ratings for multiple responses, we propose utilizing these responses to create multiple preference pairs for a given prompt. Our work focuses on systematically using the constructed multiple preference pair in DPO training via curriculum learning methodology. In particular, we order these multiple pairs of preference data from easy to hard (emulating curriculum training) according to various criteria. We show detailed comparisons of our proposed approach to the standard single-pair DPO setting. Our method, which we call Curry-DPO consistently shows increased performance gains on MTbench, Vicuna, WizardLM, and the UltraFeedback test set, highlighting its effectiveness. More specifically, Curry-DPO achieves a score of 7.43 on MT-bench with Zephy-7B model outperforming majority of existing LLMs with similar parameter size. Curry-DPO also achieves the highest adjusted win rates on Vicuna, WizardLM, and UltraFeedback test datasets (90.7%, 87.1%, and 87.9% respectively) in our experiments, with notable gains of upto 7.5% when compared to standard DPO technique.
摘要：直接偏好优化 (DPO) 是一种有效的技术，它利用成对偏好数据（通常是每个用户提示选择一个和拒绝的响应对）来使法学硕士与人类偏好保持一致。在实践中，对于给定的提示可以存在多个响应，并且彼此之间的质量不同。由于多个响应的质量评级可用，我们建议利用这些响应为给定的提示创建多个偏好对。我们的工作重点是通过课程学习方法在 DPO 培训中系统地使用构建的多重偏好对。特别是，我们根据各种标准对这些多对偏好数据从易到难（模拟课程培训）进行排序。我们展示了我们提出的方法与标准单对 DPO 设置的详细比较。我们的方法（我们称之为 Curry-DPO）在 MTbench、Vicuna、WizardLM 和 UltraFeedback 测试集上始终显示出性能提升，凸显了其有效性。更具体地说，Curry-DPO 在 MT 基准上取得了 7.43 的分数，Zephy-7B 模型的表现优于大多数具有相似参数大小的现有 LLM。在我们的实验中，Curry-DPO 还在 Vicuna、WizardLM 和 UltraFeedback 测试数据集上实现了最高的调整胜率（分别为 90.7%、87.1% 和 87.9%），与标准 DPO 技术相比，显着提高了高达 7.5%。

Title: CKERC : Joint Large Language Models with Commonsense Knowledge for Emotion Recognition in Conversation

Authors: Yumeng Fu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07260
Pdf URL: https://arxiv.org/pdf/2403.07260
Copy Paste: [[2403.07260]] CKERC : Joint Large Language Models with Commonsense Knowledge for Emotion Recognition in Conversation(https://arxiv.org/abs/2403.07260)
Keywords: language model, llm, prompt
Abstract: Emotion recognition in conversation (ERC) is a task which predicts the emotion of an utterance in the context of a conversation. It tightly depends on dialogue context, speaker identity information, multiparty dialogue scenario and so on. However, the state-of-the-art method (instructERC) solely identifying speaker, and ignores commonsense knowledge(i.e., reaction of the listeners and intention of the speaker, etc.) behind speakers during a conversation, which can deeply mine speaker information. To this end, we propose a novel joint large language models with commonsense knowledge framework for emotion recognition in conversation, namely CKERC.We design prompts to generate interlocutors' commonsense based on historical utterances with large language model. And we use the interlocutor commonsense identification task for LLM pre-training to fine-tune speaker implicit clues information.By solving above challenge, our method achieve state-of-the-art.We extensive experiment on three widely-used datasets, i.e., IEMOCAP, MELD, EmoryNLP, demonstrate our method superiority. Also, we conduct in-depth analysis and further demonstrate the effectiveness of commonsense knowledge in ERC task in large language model.
摘要：对话中的情绪识别（ERC）是一项预测对话上下文中话语情绪的任务。它紧密依赖于对话上下文、说话者身份信息、多方对话场景等。然而，最先进的方法（instructERC）仅识别说话人，而忽略了对话过程中说话人背后的常识知识（即听众的反应和说话人的意图等），无法深度挖掘说话人信息。为此，我们提出了一种新颖的具有常识知识框架的联合大语言模型，用于对话中的情感识别，即CKERC。我们设计提示，以大语言模型的历史话语为基础生成对话者的常识。我们使用对话者常识识别任务进行 LLM 预训练来微调说话者隐含线索信息。通过解决上述挑战，我们的方法达到了最先进的水平。我们在三个广泛使用的数据集上进行了广泛的实验，即IEMOCAP、MELD、EmoryNLP，展示了我们的方法优势。此外，我们还进行了深入分析，进一步论证了常识知识在大语言模型中 ERC 任务中的有效性。

Title: Knowledge Graph Large Language Model (KG-LLM) for Link Prediction

Authors: Dong Shu, Tianle Chen, Mingyu Jin, Yiting Zhang, Mengnan Du, Yongfeng Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.07311
Pdf URL: https://arxiv.org/pdf/2403.07311
Copy Paste: [[2403.07311]] Knowledge Graph Large Language Model (KG-LLM) for Link Prediction(https://arxiv.org/abs/2403.07311)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: The task of predicting multiple links within knowledge graphs (KGs) stands as a challenge in the field of knowledge graph analysis, a challenge increasingly resolvable due to advancements in natural language processing (NLP) and KG embedding techniques. This paper introduces a novel methodology, the Knowledge Graph Large Language Model Framework (KG-LLM), which leverages pivotal NLP paradigms, including chain-of-thought (CoT) prompting and in-context learning (ICL), to enhance multi-hop link prediction in KGs. By converting the KG to a CoT prompt, our framework is designed to discern and learn the latent representations of entities and their interrelations. To show the efficacy of the KG-LLM Framework, we fine-tune three leading Large Language Models (LLMs) within this framework, employing both non-ICL and ICL tasks for a comprehensive evaluation. Further, we explore the framework's potential to provide LLMs with zero-shot capabilities for handling previously unseen prompts. Our experimental findings discover that integrating ICL and CoT not only augments the performance of our approach but also significantly boosts the models' generalization capacity, thereby ensuring more precise predictions in unfamiliar scenarios.
摘要：预测知识图谱（KG）中的多个链接的任务是知识图谱分析领域的一个挑战，由于自然语言处理（NLP）和知识图谱嵌入技术的进步，这一挑战越来越容易解决。本文介绍了一种新颖的方法，即知识图大语言模型框架（KG-LLM），该框架利用关键的 NLP 范式，包括思想链（CoT）提示和上下文学习（ICL）来增强多跳KG 中的链接预测。通过将 KG 转换为 CoT 提示，我们的框架旨在辨别和学习实体的潜在表示及其相互关系。为了展示 KG-LLM 框架的有效性，我们在此框架内微调了三个领先的大型语言模型 (LLM)，采用非 ICL 和 ICL 任务进行综合评估。此外，我们还探讨了该框架为法学硕士提供零样本能力的潜力，以处理以前未见过的提示。我们的实验结果发现，集成 ICL 和 CoT 不仅增强了我们方法的性能，而且还显着提高了模型的泛化能力，从而确保在不熟悉的场景中进行更精确的预测。

Title: GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection Method

Authors: Zubair Qazi, William Shiao, Evangelos E. Papalexakis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07321
Pdf URL: https://arxiv.org/pdf/2403.07321
Copy Paste: [[2403.07321]] GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection Method(https://arxiv.org/abs/2403.07321)
Keywords: language model, gpt, prompt, chat
Abstract: As natural language models like ChatGPT become increasingly prevalent in applications and services, the need for robust and accurate methods to detect their output is of paramount importance. In this paper, we present GPT Reddit Dataset (GRiD), a novel Generative Pretrained Transformer (GPT)-generated text detection dataset designed to assess the performance of detection models in identifying generated responses from ChatGPT. The dataset consists of a diverse collection of context-prompt pairs based on Reddit, with human-generated and ChatGPT-generated responses. We provide an analysis of the dataset's characteristics, including linguistic diversity, context complexity, and response quality. To showcase the dataset's utility, we benchmark several detection methods on it, demonstrating their efficacy in distinguishing between human and ChatGPT-generated responses. This dataset serves as a resource for evaluating and advancing detection techniques in the context of ChatGPT and contributes to the ongoing efforts to ensure responsible and trustworthy AI-driven communication on the internet. Finally, we propose GpTen, a novel tensor-based GPT text detection method that is semi-supervised in nature since it only has access to human-generated text and performs on par with fully-supervised baselines.
摘要：随着像 ChatGPT 这样的自然语言模型在应用程序和服务中变得越来越普遍，对强大而准确的方法来检测其输出的需求变得至关重要。在本文中，我们提出了 GPT Reddit 数据集 (GRiD)，这是一种新颖的生成预训练 Transformer (GPT) 生成的文本检测数据集，旨在评估检测模型在识别 ChatGPT 生成的响应方面的性能。该数据集由基于 Reddit 的不同上下文提示对集合组成，其中包含人工生成和 ChatGPT 生成的响应。我们对数据集的特征进行分析，包括语言多样性、上下文复杂性和响应质量。为了展示该数据集的实用性，我们对其进行了多种检测方法的基准测试，证明了它们在区分人类和 ChatGPT 生成的响应方面的功效。该数据集可作为在 ChatGPT 背景下评估和推进检测技术的资源，并有助于确保互联网上负责任且值得信赖的人工智能驱动通信的持续努力。最后，我们提出了 GpTen，一种新颖的基于张量的 GPT 文本检测方法，它本质上是半监督的，因为它只能访问人类生成的文本，并且与完全监督的基线性能相当。

Title: Rethinking ASTE: A Minimalist Tagging Scheme Alongside Contrastive Learning

Authors: Qiao Sun, Liujia Yang, Minghao Ma, Nanyang Ye, Qinying Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07342
Pdf URL: https://arxiv.org/pdf/2403.07342
Copy Paste: [[2403.07342]] Rethinking ASTE: A Minimalist Tagging Scheme Alongside Contrastive Learning(https://arxiv.org/abs/2403.07342)
Keywords: language model, gpt, llm
Abstract: Aspect Sentiment Triplet Extraction (ASTE) is a burgeoning subtask of fine-grained sentiment analysis, aiming to extract structured sentiment triplets from unstructured textual data. Existing approaches to ASTE often complicate the task with additional structures or external data. In this research, we propose a novel tagging scheme and employ a contrastive learning approach to mitigate these challenges. The proposed approach demonstrates comparable or superior performance in comparison to state-of-the-art techniques, while featuring a more compact design and reduced computational overhead. Notably, even in the era of Large Language Models (LLMs), our method exhibits superior efficacy compared to GPT 3.5 and GPT 4 in a few-shot learning scenarios. This study also provides valuable insights for the advancement of ASTE techniques within the paradigm of large language models.
摘要：方面情感三元组提取（ASTE）是细粒度情感分析的一个新兴子任务，旨在从非结构化文本数据中提取结构化情感三元组。现有的 ASTE 方法通常会因为额外的结构或外部数据而使任务复杂化。在这项研究中，我们提出了一种新颖的标记方案，并采用对比学习方法来缓解这些挑战。与最先进的技术相比，所提出的方法表现出可比或优越的性能，同时具有更紧凑的设计和减少的计算开销。值得注意的是，即使在大型语言模型（LLM）时代，我们的方法在少数学习场景中也表现出比 GPT 3.5 和 GPT 4 更优越的功效。这项研究还为大型语言模型范式内 ASTE 技术的进步提供了宝贵的见解。

Title: KEBench: A Benchmark on Knowledge Editing for Large Vision-Language Models

Authors: Han Huang, Haitian Zhong, Qiang Liu, Shu Wu, Liang Wang, Tieniu Tan
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2403.07350
Pdf URL: https://arxiv.org/pdf/2403.07350
Copy Paste: [[2403.07350]] KEBench: A Benchmark on Knowledge Editing for Large Vision-Language Models(https://arxiv.org/abs/2403.07350)
Keywords: language model
Abstract: Currently, little research has been done on knowledge editing for Large Vision-Language Models (LVLMs). Editing LVLMs faces the challenge of effectively integrating diverse modalities (image and text) while ensuring coherent and contextually relevant modifications. An existing benchmark has three metrics (Reliability, Locality and Generality) to measure knowledge editing for LVLMs. However, the benchmark falls short in the quality of generated images used in evaluation and cannot assess whether models effectively utilize edited knowledge in relation to the associated content. We adopt different data collection methods to construct a new benchmark, $\textbf{KEBench}$, and extend new metric (Portability) for a comprehensive evaluation. Leveraging a multimodal knowledge graph, our image data exhibits clear directionality towards entities. This directional aspect can be further utilized to extract entity-related knowledge and form editing data. We conducted experiments of different editing methods on five LVLMs, and thoroughly analyze how these methods impact the models. The results reveal strengths and deficiencies of these methods and, hopefully, provide insights into potential avenues for future research.
摘要：目前，针对大型视觉语言模型（LVLM）的知识编辑的研究还很少。编辑 LVLM 面临着有效整合不同模式（图像和文本）的挑战，同时确保修改的连贯性和上下文相关性。现有的基准具有三个指标（可靠性、局部性和通用性）来衡量 LVLM 的知识编辑。然而，该基准在评估中使用的生成图像的质量方面存在缺陷，并且无法评估模型是否有效地利用与相关内容相关的编辑知识。我们采用不同的数据收集方法构建新的基准$\textbf{KEBench}$，并扩展新的指标（可移植性）进行综合评估。利用多模态知识图，我们的图像数据表现出针对实体的明确方向性。这个方向性可以进一步用于提取实体相关知识和表单编辑数据。我们在五个 LVLM 上进行了不同编辑方法的实验，并深入分析了这些方法对模型的影响。结果揭示了这些方法的优点和缺点，并希望为未来研究的潜在途径提供见解。

Title: SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression

Authors: Xin Wang, Yu Zheng, Zhongwei Wan, Mi Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.07378
Pdf URL: https://arxiv.org/pdf/2403.07378
Copy Paste: [[2403.07378]] SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression(https://arxiv.org/abs/2403.07378)
Keywords: language model, llm
Abstract: The advancements in Large Language Models (LLMs) have been hindered by their substantial sizes, which necessitate LLM compression methods for practical deployment. Singular Value Decomposition (SVD) offers a promising solution for LLM compression. However, state-of-the-art SVD-based LLM compression methods have two key limitations: truncating smaller singular values may lead to higher compression loss, and the lack of update on the remaining model parameters after SVD truncation. In this work, we propose SVD-LLM, a new SVD-based LLM compression method that addresses the limitations of existing methods. SVD-LLM incorporates a truncation-aware data whitening strategy to ensure a direct mapping between singular values and compression loss. Moreover, SVD-LLM adopts a layer-wise closed-form model parameter update strategy to compensate for accuracy degradation caused by SVD truncation. We evaluate SVD-LLM on a total of 11 datasets and seven models from three different LLM families at four different scales. Our results demonstrate the superiority of SVD-LLM over state-of-the-arts, especially at high model compression ratios. The source code is available at https://github.com/AIoT-MLSys-Lab/SVD-LLM.
摘要：大型语言模型 (LLM) 的进步因其巨大的规模而受到阻碍，这需要 LLM 压缩方法来进行实际部署。奇异值分解 (SVD) 为 LLM 压缩提供了一种有前景的解决方案。然而，最先进的基于 SVD 的 LLM 压缩方法有两个关键限制：截断较小的奇异值可能会导致更高的压缩损失，以及 SVD 截断后剩余模型参数缺乏更新。在这项工作中，我们提出了 SVD-LLM，一种新的基于 SVD 的 LLM 压缩方法，解决了现有方法的局限性。 SVD-LLM 结合了截断感知数据白化策略，以确保奇异值和压缩损失之间的直接映射。此外，SVD-LLM采用分层封闭式模型参数更新策略来补偿SVD截断带来的精度下降。我们在四个不同规模的三个不同 LLM 系列的总共 11 个数据集和 7 个模型上评估 SVD-LLM。我们的结果证明了 SVD-LLM 相对于最先进技术的优越性，特别是在高模型压缩比下。源代码可在 https://github.com/AIoT-MLSys-Lab/SVD-LLM 获取。

Title: SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

Authors: Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.07384
Pdf URL: https://arxiv.org/pdf/2403.07384
Copy Paste: [[2403.07384]] SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models(https://arxiv.org/abs/2403.07384)
Keywords: language model, llm
Abstract: Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.
摘要：尽管大型语言模型（LLM）的数据选择在预训练和指令微调阶段非常有效，但由于微调数据的复杂性，提高专业领域监督微调（SFT）的数据效率提出了重大挑战。为了弥补这一差距，我们引入了一种有效且可扩展的 SFT 数据选择方法，SmallToLarge (S2L)，它利用小模型的训练轨迹来指导较大模型的数据选择。我们通过大量实验证明，S2L 显着提高了 SFT 解决数学问题的数据效率，将训练数据减少到原始 MathInstruct 数据集的 11%（Yue 等人，2023），以匹配完整数据集性能，同时超越现状- 在 6 个域内和域外评估数据集中，最先进的数据选择算法平均提高了 4.7%。值得注意的是，S2L 仅选择 50K 数据进行 SFT，在最具挑战性的 MATH（Hendrycks 等人，2021）基准上实现了 32.7% 的准确率，将 Phi-2（Li 等人，2023b）提高了 16.6%。在 MIMIC-III 数据集（Johnson et al., 2016）的临床文本摘要中，S2L 再次优于仅使用 50% 数据对完整数据集进行的训练。值得注意的是，S2L 可以使用比目标模型小 40 倍的参考模型来执行数据选择，从而按比例降低数据选择的成本。

Title: Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs

Authors: Tianqing Fang, Zeming Chen, Yangqiu Song, Antoine Bosselut
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07398
Pdf URL: https://arxiv.org/pdf/2403.07398
Copy Paste: [[2403.07398]] Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs(https://arxiv.org/abs/2403.07398)
Keywords: language model
Abstract: Event commonsense reasoning requires the ability to reason about the relationship between events, as well as infer implicit context underlying that relationship. However, data scarcity makes it challenging for language models to learn to generate commonsense inferences for contexts and questions involving interactions between complex events. To address this demand, we present COM2 (COMplex COMmonsense), a new dataset created by sampling multi-hop logical queries (e.g., the joint effect or cause of both event A and B, or the effect of the effect of event C) from an existing commonsense knowledge graph (CSKG), and verbalizing them using handcrafted rules and large language models into multiple-choice and text generation questions. Our experiments show that language models trained on COM2 exhibit significant improvements in complex reasoning ability, resulting in enhanced zero-shot performance in both in-domain and out-of-domain tasks for question answering and generative commonsense reasoning, without expensive human annotations.
摘要：事件常识推理需要能够推理事件之间的关系，以及推断该关系背后的隐式上下文。然而，数据稀缺使得语言模型很难学习为涉及复杂事件之间交互的上下文和问题生成常识性推论。为了满足这一需求，我们提出了 COM2 (COMplex COMmonsense)，这是一个通过对多跳逻辑查询进行采样而创建的新数据集（例如，事件 A 和 B 的联合效应或原因，或者事件 C 的效应的效应）现有的常识知识图（CSKG），并使用手工规则和大型语言模型将它们语言化为多项选择和文本生成问题。我们的实验表明，在 COM2 上训练的语言模型在复杂推理能力方面表现出显着的改进，从而在域内和域外任务中提高问答和生成常识推理的零样本性能，而无需昂贵的人工注释。

Title: Matrix-Transformation Based Low-Rank Adaptation (MTLoRA): A Brain-Inspired Method for Parameter-Efficient Fine-Tuning

Authors: Yao Liang, Yuwei Wang, Yi Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07440
Pdf URL: https://arxiv.org/pdf/2403.07440
Copy Paste: [[2403.07440]] Matrix-Transformation Based Low-Rank Adaptation (MTLoRA): A Brain-Inspired Method for Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2403.07440)
Keywords: language model
Abstract: Fine-tuning techniques based on Large Pretrained Language Models (LPLMs) have been proven to significantly enhance model performance on a variety of downstream tasks and effectively control the output behaviors of LPLMs. Recent studies have proposed numerous methods for fine-tuning a small number of parameters based on open-source LPLMs, reducing the demand for computational and storage resources. Among these, reparameterization fine-tuning methods represented by LoRA (Low-Rank Adaptation) have gained popularity. We find that although these methods perform well in many aspects, there is still considerable room for improvement in terms of complex task adaptability, performance, stability, and algorithm complexity. In response to this, inspired by the idea that the functions of the brain are shaped by its geometric structure, this paper integrates this idea into LoRA technology and proposes a new matrix transformation-based reparameterization method for efficient fine-tuning, named Matrix-Transformation based Low-Rank Adaptation (MTLoRA). MTLoRA aims to dynamically alter its spatial geometric structure by applying a transformation-matrix T to perform linear transformations, such as rotation, scaling, and translation, on the task-specific parameter matrix, generating new matrix feature patterns (eigenvectors) to mimic the fundamental influence of complex geometric structure feature patterns in the brain on functions, thereby enhancing the model's performance in downstream tasks. In Natural Language Understanding (NLU) tasks, it is evaluated using the GLUE benchmark test, and the results reveal that MTLoRA achieves an overall performance increase of about 1.0% across eight tasks; in Natural Language Generation (NLG) tasks, MTLoRA improves performance by an average of 0.95% and 0.31% in the DART and WebNLG tasks, respectively.
摘要：基于大型预训练语言模型（LPLM）的微调技术已被证明可以显着增强各种下游任务的模型性能，并有效控制 LPLM 的输出行为。最近的研究提出了许多基于开源 LPLM 的微调少量参数的方法，减少了对计算和存储资源的需求。其中，以LoRA（Low-Rank Adaptation）为代表的重参数化微调方法受到青睐。我们发现，虽然这些方法在很多方面表现良好，但在复杂任务适应性、性能、稳定性和算法复杂度方面仍有相当大的改进空间。针对这一点，本文受到大脑功能由其几何结构塑造的思想的启发，将这一思想融入到LoRA技术中，提出了一种新的基于矩阵变换的高效微调重参数化方法，命名为Matrix-Transformation基于低秩适应（MTLoRA）。 MTLoRA 旨在通过应用变换矩阵 T 对特定于任务的参数矩阵执行线性变换（例如旋转、缩放和平移）来动态改变其空间几何结构，生成新的矩阵特征模式（特征向量）来模仿基本矩阵大脑中复杂的几何结构特征模式对功能的影响，从而提高模型在下游任务中的性能。在自然语言理解（NLU）任务中，使用 GLUE 基准测试进行评估，结果显示 MTLoRA 在 8 个任务中实现了约 1.0% 的整体性能提升；在自然语言生成（NLG）任务中，MTLoRA 在 DART 和 WebNLG 任务中分别平均提高了 0.95% 和 0.31% 的性能。

Title: MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki

Authors: Timothee Mickus, Stig-Arne Grönroos, Joseph Attieh, Michele Boggia, Ona De Gibert, Shaoxiong Ji, Niki Andreas Lopi, Alessandro Raganato, Raúl Vázquez, Jörg Tiedemann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07544
Pdf URL: https://arxiv.org/pdf/2403.07544
Copy Paste: [[2403.07544]] MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki(https://arxiv.org/abs/2403.07544)
Keywords: language model
Abstract: NLP in the age of monolithic large language models is approaching its limits in terms of size and information that can be handled. The trend goes to modularization, a necessary step into the direction of designing smaller sub-networks and components with specialized functionality. In this paper, we present the MAMMOTH toolkit: a framework designed for training massively multilingual modular machine translation systems at scale, initially derived from OpenNMT-py and then adapted to ensure efficient training across computation clusters. We showcase its efficiency across clusters of A100 and V100 NVIDIA GPUs, and discuss our design philosophy and plans for future information. The toolkit is publicly available online.
摘要：在整体大型语言模型时代，NLP 在大小和可处理的信息方面正在接近其极限。趋势是模块化，这是设计更小的子网络和具有专门功能的组件的必要步骤。在本文中，我们提出了 MAMMOTH 工具包：一个旨在大规模训练大规模多语言模块化机器翻译系统的框架，最初源自 OpenNMT-py，然后进行调整以确保跨计算集群的高效训练。我们展示了其在 A100 和 V100 NVIDIA GPU 集群上的效率，并讨论了我们的设计理念和未来信息计划。该工具包可在线公开获取。

Title: Truth-Aware Context Selection: Mitigating the Hallucinations of Large Language Models Being Misled by Untruthful Contexts

Authors: Tian Yu, Shaolei Zhang, Yang Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07556
Pdf URL: https://arxiv.org/pdf/2403.07556
Copy Paste: [[2403.07556]] Truth-Aware Context Selection: Mitigating the Hallucinations of Large Language Models Being Misled by Untruthful Contexts(https://arxiv.org/abs/2403.07556)
Keywords: language model, llm, hallucination
Abstract: Although large language models (LLMs) have demonstrated impressive text generation capabilities, they are easily misled by the untruthful context provided by users or knowledge argumentation tools, thereby producing hallucinations. To alleviate the LLMs from being misled by untruthful information and take advantage of knowledge argumentation, we propose Truth-Aware Context Selection (TACS), a lightweight method to shield untruthful context from the inputs. TACS begins by performing truth detection on the input context, leveraging the parameterized knowledge within the LLM. Subsequently, it constructs a corresponding attention mask based on the truthfulness of each position, selecting the truthful context and discarding the untruthful context. Additionally, we introduce a new evaluation metric, Disturbance Adaption Rate, to further study the LLMs' ability to accept truthful information and resist untruthful information. Experimental results show that TACS can effectively filter information in context and significantly improve the overall quality of LLMs' responses when presented with misleading information.
摘要：尽管大型语言模型（LLM）已经展现了令人印象深刻的文本生成能力，但它们很容易被用户或知识论证工具提供的不真实上下文所误导，从而产生幻觉。为了减轻法学硕士被不真实信息误导并利用知识论证的优势，我们提出了真相感知上下文选择（TACS），这是一种从输入中屏蔽不真实上下文的轻量级方法。 TACS 首先利用 LLM 中的参数化知识对输入上下文进行真实检测。随后，它根据每个位置的真实性构建相应的注意掩模，选择真实的上下文并丢弃不真实的上下文。此外，我们引入了一个新的评估指标——干扰适应率，以进一步研究法学硕士接受真实信息和抵制不真实信息的能力。实验结果表明，TACS 可以有效地过滤上下文中的信息，并在出现误导性信息时显着提高法学硕士回答的整体质量。

Title: SIFiD: Reassess Summary Factual Inconsistency Detection with LLM

Authors: Jiuding Yang, Hui Liu, Weidong Guo, Zhuwei Rao, Yu Xu, Di Niu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.07557
Pdf URL: https://arxiv.org/pdf/2403.07557
Copy Paste: [[2403.07557]] SIFiD: Reassess Summary Factual Inconsistency Detection with LLM(https://arxiv.org/abs/2403.07557)
Keywords: language model, gpt, llm
Abstract: Ensuring factual consistency between the summary and the original document is paramount in summarization tasks. Consequently, considerable effort has been dedicated to detecting inconsistencies. With the advent of Large Language Models (LLMs), recent studies have begun to leverage their advanced language understanding capabilities for inconsistency detection. However, early attempts have shown that LLMs underperform traditional models due to their limited ability to follow instructions and the absence of an effective detection methodology. In this study, we reassess summary inconsistency detection with LLMs, comparing the performances of GPT-3.5 and GPT-4. To advance research in LLM-based inconsistency detection, we propose SIFiD (Summary Inconsistency Detection with Filtered Document) that identify key sentences within documents by either employing natural language inference or measuring semantic similarity between summaries and documents.
摘要：确保摘要与原始文档之间的事实一致性对于摘要任务至关重要。因此，人们投入了大量精力来检测不一致之处。随着大型语言模型（LLM）的出现，最近的研究已经开始利用其先进的语言理解能力来检测不一致。然而，早期的尝试表明，法学硕士由于遵循指示的能力有限且缺乏有效的检测方法，因此表现不佳。在本研究中，我们重新评估了 LLM 的摘要不一致检测，比较了 GPT-3.5 和 GPT-4 的性能。为了推进基于 LLM 的不一致检测的研究，我们提出了 SIFiD（过滤文档摘要不一致检测），它通过采用自然语言推理或测量摘要和文档之间的语义相似性来识别文档中的关键句子。

Title: Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation

Authors: Francois Meyer, Jan Buys
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07567
Pdf URL: https://arxiv.org/pdf/2403.07567
Copy Paste: [[2403.07567]] Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation(https://arxiv.org/abs/2403.07567)
Keywords: language model
Abstract: Most data-to-text datasets are for English, so the difficulties of modelling data-to-text for low-resource languages are largely unexplored. In this paper we tackle data-to-text for isiXhosa, which is low-resource and agglutinative. We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG, which presents a new linguistic context that shifts modelling demands to subword-driven techniques. We also develop an evaluation framework for T2X that measures how accurately generated text describes the data. This enables future users of T2X to go beyond surface-level metrics in evaluation. On the modelling side we explore two classes of methods - dedicated data-to-text models trained from scratch and pretrained language models (PLMs). We propose a new dedicated architecture aimed at agglutinative data-to-text, the Subword Segmental Pointer Generator (SSPG). It jointly learns to segment words and copy entities, and outperforms existing dedicated models for 2 agglutinative languages (isiXhosa and Finnish). We investigate pretrained solutions for T2X, which reveals that standard PLMs come up short. Fine-tuning machine translation models emerges as the best method overall. These findings underscore the distinct challenge presented by T2X: neither well-established data-to-text architectures nor customary pretrained methodologies prove optimal. We conclude with a qualitative analysis of generation errors and an ablation study.
摘要：大多数数据到文本的数据集都是英语的，因此为低资源语言建模数据到文本的困难在很大程度上尚未被探索。在本文中，我们解决 isiXhosa 的数据到文本问题，它资源匮乏且具有粘性。我们引入了 Triples-to-isiXhosa (T2X)，这是一个基于 WebNLG 子集的新数据集，它提供了一种新的语言环境，将建模需求转变为子词驱动技术。我们还为 T2X 开发了一个评估框架，用于衡量生成的文本描述数据的准确程度。这使得 T2X 的未来用户在评估中能够超越表面指标。在建模方面，我们探索了两类方法 - 从头开始训练的专用数据到文本模型和预训练语言模型 (PLM)。我们提出了一种旨在粘合数据到文本的新专用架构，即子字分段指针生成器（SSPG）。它共同学习分割单词和复制实体，并且优于 2 种粘着语言（isiXhosa 和芬兰语）的现有专用模型。我们研究了 T2X 的预训练解决方案，结果表明标准 PLM 存在不足。总体而言，微调机器翻译模型是最好的方法。这些发现强调了 T2X 提出的独特挑战：完善的数据到文本架构和惯用的预训练方法都不是最佳的。最后我们对生成误差进行了定性分析并进行了消融研究。

Title: LLMvsSmall Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model

Authors: Linmei Hu, Hongyu He, Duokang Wang, Ziwang Zhao, Yingxia Shao, Liqiang Nie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07581
Pdf URL: https://arxiv.org/pdf/2403.07581
Copy Paste: [[2403.07581]] LLMvsSmall Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model(https://arxiv.org/abs/2403.07581)
Keywords: language model, llm
Abstract: Personality detection aims to detect one's personality traits underlying in social media posts. One challenge of this task is the scarcity of ground-truth personality traits which are collected from self-report questionnaires. Most existing methods learn post features directly by fine-tuning the pre-trained language models under the supervision of limited personality labels. This leads to inferior quality of post features and consequently affects the performance. In addition, they treat personality traits as one-hot classification labels, overlooking the semantic information within them. In this paper, we propose a large language model (LLM) based text augmentation enhanced personality detection model, which distills the LLM's knowledge to enhance the small model for personality detection, even when the LLM fails in this task. Specifically, we enable LLM to generate post analyses (augmentations) from the aspects of semantic, sentiment, and linguistic, which are critical for personality detection. By using contrastive learning to pull them together in the embedding space, the post encoder can better capture the psycho-linguistic information within the post representations, thus improving personality detection. Furthermore, we utilize the LLM to enrich the information of personality labels for enhancing the detection performance. Experimental results on the benchmark datasets demonstrate that our model outperforms the state-of-the-art methods on personality detection.
摘要：人格检测旨在检测社交媒体帖子中潜在的人格特征。这项任务的一个挑战是缺乏从自我报告问卷中收集的真实人格特征。大多数现有方法通过在有限的个性标签的监督下微调预训练的语言模型来直接学习帖子特征。这会导致后期特征质量较差，从而影响性能。此外，他们将人格特质视为热门分类标签，而忽略了其中的语义信息。在本文中，我们提出了一种基于大语言模型（LLM）的文本增强增强型人格检测模型，该模型提炼了LLM的知识来增强人格检测的小模型，即使LLM在此任务中失败了。具体来说，我们使法学硕士能够从语义、情感和语言方面生成后期分析（增强），这对于人格检测至关重要。通过使用对比学习将它们在嵌入空间中整合在一起，后编码器可以更好地捕获后表示中的心理语言信息，从而改善个性检测。此外，我们利用法学硕士来丰富个性标签的信息，以提高检测性能。基准数据集的实验结果表明，我们的模型在人格检测方面优于最先进的方法。

Title: Reference-free Monolithic Preference Optimization with Odds Ratio

Authors: Jiwoo Hong, Noah Lee, James Thorne
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07691
Pdf URL: https://arxiv.org/pdf/2403.07691
Copy Paste: [[2403.07691]] Reference-free Monolithic Preference Optimization with Odds Ratio(https://arxiv.org/abs/2403.07691)
Keywords: language model
Abstract: While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on $\text{AlpacaEval}_{2.0}$ and 7.32 in MT-Bench, as shown in Figures 1 and 12. We release code and model checkpoints for Mistral-ORPO-$\alpha$ (7B) and Mistral-ORPO-$\beta$ (7B).
摘要：虽然最近的语言模型偏好对齐算法已经显示出有希望的结果，但监督微调（SFT）对于实现成功收敛仍然至关重要。在本文中，我们研究了 SFT 在偏好对齐背景下的关键作用，强调对不受欢迎的生成方式进行较小的惩罚对于偏好对齐的 SFT 来说就足够了。在此基础上，我们引入了一种简单且创新的无参考模型的整体优势比偏好优化算法 ORPO，消除了额外的偏好调整阶段的必要性。我们从经验和理论上证明，优势比是在 SFT 期间对比从 125M 到 7B 的不同大小的受欢迎和不受欢迎风格的明智选择。具体来说，仅在 UltraFeedback 上使用 ORPO 微调 Phi-2 (2.7B)、Llama-2 (7B) 和 Mistral (7B) 就超越了超过 7B 和 13B 的最先进语言模型的性能参数：在$\text{AlpacaEval}_{2.0}$上达到12.20%，在MT-Bench中达到7.32，如图1和图12所示。我们发布了Mistral-ORPO-$\alpha$的代码和模型检查点（ 7B) 和 Mistral-ORPO-$\beta$ (7B)。

Title: Large, Small or Both: A Novel Data Augmentation Framework Based on Language Models for Debiasing Opinion Summarization

Authors: Yanyue Zhang, Pengfei Li, Yilong Lai, Deyu Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07693
Pdf URL: https://arxiv.org/pdf/2403.07693
Copy Paste: [[2403.07693]] Large, Small or Both: A Novel Data Augmentation Framework Based on Language Models for Debiasing Opinion Summarization(https://arxiv.org/abs/2403.07693)
Keywords: language model
Abstract: As more than 70$\%$ of reviews in the existing opinion summary data set are positive, current opinion summarization approaches are reluctant to generate negative summaries given the input of negative texts. To address such sentiment bias, a direct approach without the over-reliance on a specific framework is to generate additional data based on large language models to balance the emotional distribution of the dataset. However, data augmentation based on large language models faces two disadvantages: 1) the potential issues or toxicity in the augmented data; 2) the expensive costs. Therefore, in this paper, we propose a novel data augmentation framework based on both large and small language models for debiasing opinion summarization. In specific, a small size of synthesized negative reviews is obtained by rewriting the positive text via a large language model. Then, a disentangle reconstruction model is trained based on the generated data. After training, a large amount of synthetic data can be obtained by decoding the new representation obtained from the combination of different sample representations and filtering based on confusion degree and sentiment classification. Experiments have proved that our framework can effectively alleviate emotional bias same as using only large models, but more economically.
摘要：由于现有意见摘要数据集中超过 70$\%$ 的评论是正面的，因此当前的意见摘要方法不愿意在输入负面文本的情况下生成负面摘要。为了解决这种情感偏差，一种不过度依赖特定框架的直接方法是基于大型语言模型生成额外的数据，以平衡数据集的情感分布。然而，基于大型语言模型的数据增强面临两个缺点：1）增强数据中的潜在问题或毒性； 2）费用昂贵。因此，在本文中，我们提出了一种基于大型和小型语言模型的新型数据增强框架，用于消除意见摘要的偏差。具体来说，通过大语言模型重写正面文本来获得小规模的综合负面评论。然后，基于生成的数据训练解缠结重建模型。训练后，通过对不同样本表示组合得到的新表示进行解码，并基于混淆度和情感分类进行过滤，可以获得大量的合成数据。实验证明，我们的框架可以像仅使用大型模型一样有效地减轻情绪偏差，但更经济。

Title: Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

Authors: Wei Shen, Xiaoying Zhang, Yuanshun Yao, Rui Zheng, Hongyi Guo, Yang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07708
Pdf URL: https://arxiv.org/pdf/2403.07708
Copy Paste: [[2403.07708]] Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards(https://arxiv.org/abs/2403.07708)
Keywords: language model, gpt, llm, prompt
Abstract: Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align large language models (LLMs) with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources, e.g. human labeling errors, making the pipeline fragile. In this work, we improve the effectiveness of the reward model by introducing a penalty term on the reward, named as \textit{contrastive rewards}. %Contrastive rewards Our approach involves two steps: (1) an offline sampling step to obtain responses to prompts that serve as baseline calculation and (2) a contrastive reward calculated using the baseline responses and used in the Proximal Policy Optimization (PPO) step. We show that contrastive rewards enable the LLM to penalize reward uncertainty, improve robustness, encourage improvement over baselines, calibrate according to task difficulty, and reduce variance in PPO. We show empirically contrastive rewards can improve RLHF substantially, evaluated by both GPTs and humans, and our method consistently outperforms strong baselines.
摘要：来自人类反馈的强化学习 (RLHF) 是用于使大型语言模型 (LLM) 与人类偏好保持一致的主流范式。然而，现有的 RLHF 严重依赖于准确且信息丰富的奖励模型，这些模型很容易受到各种来源（例如，噪音）的影响且敏感。人为标记错误，使管道脆弱。在这项工作中，我们通过在奖励上引入惩罚项（称为 \textit{对比奖励}）来提高奖励模型的有效性。 %对比奖励我们的方法涉及两个步骤：(1) 离线采样步骤，以获得对用作基线计算的提示的响应；(2) 使用基线响应计算并在近端策略优化 (PPO) 步骤中使用的对比奖励。我们表明，对比性奖励使法学硕士能够惩罚奖励的不确定性、提高鲁棒性、鼓励对基线的改进、根据任务难度进行校准，并减少 PPO 的方差。我们通过 GPT 和人类的评估表明，实证对比奖励可以显着改善 RLHF，并且我们的方法始终优于强大的基线。

Title: StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

Authors: Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07714
Pdf URL: https://arxiv.org/pdf/2403.07714
Copy Paste: [[2403.07714]] StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models(https://arxiv.org/abs/2403.07714)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.
摘要：近年来，大型语言模型 (LLM) 取得了显着的进步，促进了工具学习的探索，它将 LLM 与外部工具相集成，以应对各种现实世界的挑战。评估法学硕士利用工具的能力需要大规模且稳定的基准。然而，之前的工作要么依赖于规模有限的手工制作的在线工具，要么依赖于API状态不稳定的大规模真实在线API。为了解决这个问题，我们引入了StableToolBench，这是一个从ToolBench演变而来的基准测试，提出了虚拟API服务器和稳定的评估系统。虚拟API服务器包含缓存系统和API模拟器，它们相辅相成，以缓解API状态的变化。同时，稳定的评估系统使用GPT-4作为自动评估器设计可解的通过率和获胜率，以消除评估过程中的随机性。实验结果证明了StableToolBench的稳定性，并进一步讨论了API模拟器、缓存系统和评估器系统的有效性。

Title: SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

Authors: Timothee Mickus, Elaine Zosa, Raúl Vázquez, Teemu Vahtola, Jörg Tiedemann, Vincent Segonne, Alessandro Raganato, Marianna Apidianaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07726
Pdf URL: https://arxiv.org/pdf/2403.07726
Copy Paste: [[2403.07726]] SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes(https://arxiv.org/abs/2403.07726)
Keywords: hallucination, prompt
Abstract: This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate. Such cases of overgeneration put in jeopardy many NLG applications, where correctness is often mission-critical. The shared task was conducted with a newly constructed dataset of 4000 model outputs labeled by 5 annotators each, spanning 3 NLP tasks: machine translation, paraphrase generation and definition modeling. The shared task was tackled by a total of 58 different users grouped in 42 teams, out of which 27 elected to write a system description paper; collectively, they submitted over 300 prediction sets on both tracks of the shared task. We observe a number of key trends in how this approach was tackled -- many participants rely on a handful of model, and often rely either on synthetic data for fine-tuning or zero-shot prompting strategies. While a majority of the teams did outperform our proposed baseline system, the performances of top-scoring systems are still consistent with a random handling of the more challenging items.
摘要：本文介绍了 SHROOM 的结果，这是一项专注于检测幻觉的共享任务：自然语言生成 (NLG) 系统的输出流畅但不准确。这种过度生成的情况会危及许多 NLG 应用程序，因为在这些应用程序中，正确性往往是关键任务。共享任务使用新构建的 4000 个模型输出数据集进行，每个模型输出由 5 个注释器标记，涵盖 3 个 NLP 任务：机器翻译、释义生成和定义建模。这项共同任务由 42 个团队中的 58 位不同用户共同完成，其中 27 位选择撰写系统描述论文；他们总共在共享任务的两个轨道上提交了 300 多个预测集。我们观察到了如何解决这种方法的一些关键趋势——许多参与者依赖于少数模型，并且经常依赖合成数据进行微调或零样本提示策略。虽然大多数团队的表现确实优于我们提出的基线系统，但得分最高的系统的性能仍然与随机处理更具挑战性的项目一致。

Title: FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

Authors: Yan Liu, Renren Jin, Lin Shi, Zheng Yao, Deyi Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07747
Pdf URL: https://arxiv.org/pdf/2403.07747
Copy Paste: [[2403.07747]] FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models(https://arxiv.org/abs/2403.07747)
Keywords: language model, llm
Abstract: To thoroughly assess the mathematical reasoning abilities of Large Language Models (LLMs), we need to carefully curate evaluation datasets covering diverse mathematical concepts and mathematical problems at different difficulty levels. In pursuit of this objective, we propose FineMath in this paper, a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems. We conduct extensive experiments on a wide range of LLMs on FineMath and find that there is still considerable room for improvements in terms of mathematical reasoning capability of Chinese LLMs. We also carry out an in-depth analysis on the evaluation process and methods that have been overlooked previously. These two factors significantly influence the model results and our understanding of their mathematical reasoning capabilities. The dataset will be publicly available soon.
摘要：为了彻底评估大型语言模型（LLM）的数学推理能力，我们需要精心策划涵盖不同数学概念和不同难度级别的数学问题的评估数据集。为了实现这一目标，我们在本文中提出了 FineMath，这是一个用于评估中国法学硕士的细粒度数学评估基准数据集。 FineMath旨在涵盖小学数学中教授的主要数学概念，并进一步分为17类数学应用题，能够深入分析法学硕士的数学推理能力。所有 17 类数学应用题均根据解决这些问题所需的推理步骤数手动标注其难度级别。我们在 FineMath 上对各类法学硕士进行了广泛的实验，发现中国法学硕士的数学推理能力还有很大的提升空间。我们还对之前被忽视的评估流程和方法进行了深入分析。这两个因素显着影响模型结果以及我们对其数学推理能力的理解。该数据集将很快公开。

Title: Fine-tuning Large Language Models with Sequential Instructions

Authors: Hanxu Hu, Pinzhen Chen, Edoardo M. Ponti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07794
Pdf URL: https://arxiv.org/pdf/2403.07794
Copy Paste: [[2403.07794]] Fine-tuning Large Language Models with Sequential Instructions(https://arxiv.org/abs/2403.07794)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) struggle to follow a sequence of instructions in a single query as they may ignore or misinterpret part of it. This impairs their performance in complex problems whose solution requires multiple intermediate steps, such as multilingual (translate then answer) and multimodal (caption then answer) tasks. We empirically verify this with open-source LLMs as large as LLaMA-2 70B and Mixtral-8x7B. Targeting the scarcity of sequential instructions in present-day data, we propose sequential instruction tuning, a simple yet effective strategy to automatically augment instruction tuning data and equip LLMs with the ability to execute multiple sequential instructions. After exploring interleaving instructions in existing datasets, such as Alpaca, with a wide range of intermediate tasks, we find that sequential instruction-tuned models consistently outperform the conventional instruction-tuned baselines in downstream tasks involving reasoning, multilingual, and multimodal abilities. To shed further light on our technique, we analyse how adversarial intermediate texts, unseen tasks, prompt verbalization, number of tasks, and prompt length affect SIT. We hope that this method will open new research avenues on instruction tuning for complex tasks.
摘要：大型语言模型 (LLM) 很难遵循单个查询中的一系列指令，因为它们可能会忽略或误解其中的一部分。这会损害他们在解决复杂问题时的表现，这些问题的解决方案需要多个中间步骤，例如多语言（翻译然后回答）和多模式（标题然后回答）任务。我们通过 LLaMA-2 70B 和 Mixtral-8x7B 等开源 LLM 进行实证验证。针对当今数据中顺序指令的稀缺性，我们提出了顺序指令调优，这是一种简单而有效的策略，可以自动增加指令调优数据并使法学硕士具备执行多个顺序指令的能力。在探索现有数据集（例如 Alpaca）中具有广泛中间任务的交错指令后，我们发现顺序指令调整模型在涉及推理、多语言和多模式能力的下游任务中始终优于传统指令调整基线。为了进一步阐明我们的技术，我们分析了对抗性中间文本、看不见的任务、提示语言、任务数量和提示长度如何影响 SIT。我们希望这种方法能够为复杂任务的指令调整开辟新的研究途径。

Title: Beyond Memorization: The Challenge of Random Memory Access in Language Models

Authors: Tongyao Zhu, Qian Liu, Liang Pang, Zhengbao Jiang, Min-Yen Kan, Min Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07805
Pdf URL: https://arxiv.org/pdf/2403.07805
Copy Paste: [[2403.07805]] Beyond Memorization: The Challenge of Random Memory Access in Language Models(https://arxiv.org/abs/2403.07805)
Keywords: language model, gpt
Abstract: Recent developments in Language Models (LMs) have shown their effectiveness in NLP tasks, particularly in knowledge-intensive tasks. However, the mechanisms underlying knowledge storage and memory access within their parameters remain elusive. In this paper, we investigate whether a generative LM (e.g., GPT-2) is able to access its memory sequentially or randomly. Through carefully-designed synthetic tasks, covering the scenarios of full recitation, selective recitation and grounded question answering, we reveal that LMs manage to sequentially access their memory while encountering challenges in randomly accessing memorized content. We find that techniques including recitation and permutation improve the random memory access capability of LMs. Furthermore, by applying this intervention to realistic scenarios of open-domain question answering, we validate that enhancing random access by recitation leads to notable improvements in question answering. The code to reproduce our experiments can be found at https://github. com/sail-sg/lm-random-memory-access.
摘要：语言模型 (LM) 的最新发展已显示出其在 NLP 任务中的有效性，特别是在知识密集型任务中。然而，其参数内知识存储和内存访问的底层机制仍然难以捉摸。在本文中，我们研究了生成式 LM（例如 GPT-2）是否能够顺序或随机访问其内存。通过精心设计的综合任务，涵盖完整背诵、选择性背诵和扎根问答的场景，我们揭示了语言学习者设法顺序访问他们的记忆，同时在随机访问记忆内容方面遇到挑战。我们发现包括重述和排列在内的技术提高了 LM 的随机存储器访问能力。此外，通过将这种干预应用于开放域问答的现实场景，我们验证了通过背诵增强随机访问会导致问答的显着改进。重现我们实验的代码可以在 https://github 上找到。 com/sail-sg/lm-random-memory-access。

Title: Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Authors: Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, Xian Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07816
Pdf URL: https://arxiv.org/pdf/2403.07816
Copy Paste: [[2403.07816]] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM(https://arxiv.org/abs/2403.07816)
Keywords: language model, llm
Abstract: We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.
摘要：我们研究训练大型语言模型 (LLM) 的有效方法，使其具备多个专业领域的能力，例如编码、数学推理和世界知识。我们的方法名为 Branch-Train-MiX (BTX)，从种子模型开始，该模型经过分支，以高吞吐量和降低通信成本的并行方式训练专家。在对各个专家进行异步训练后，BTX 将其前馈参数作为专家混合 (MoE) 层中的专家汇集在一起，并对剩余参数进行平均，然后是 MoE 微调阶段以学习令牌级路由。 BTX概括了两种特殊情况，即没有MoE微调阶段来学习路由的Branch-Train-Merge方法，以及省略了异步训练专家阶段的稀疏升级方法。与其他方法相比，BTX 实现了最佳的准确性-效率权衡。

Title: The Missing Piece in Model Editing: A Deep Dive into the Hidden Damage Brought By Model Editing

Authors: Jianchen Wang, Zhouhong Gu, Zhuozhi Xiong, Hongwei Feng, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07825
Pdf URL: https://arxiv.org/pdf/2403.07825
Copy Paste: [[2403.07825]] The Missing Piece in Model Editing: A Deep Dive into the Hidden Damage Brought By Model Editing(https://arxiv.org/abs/2403.07825)
Keywords: language model, llm
Abstract: Large Language Models have revolutionized numerous tasks with their remarkable efficacy.However, the editing of these models, crucial for rectifying outdated or erroneous information, often leads to a complex issue known as the ripple effect in the hidden space. This effect, while difficult to detect, can significantly impede the efficacy of model editing tasks and deteriorate model performance.This paper addresses this scientific challenge by proposing a novel evaluation methodology, Graphical Outlier Relation based Assessment(GORA), which quantitatively evaluates the adaptations of the model and the subsequent impact of editing. Furthermore, we introduce the Selective Outlier Re-Editing Approach(SORA), a model editing method designed to mitigate this ripple effect. Our comprehensive evaluations reveal that the ripple effect in the hidden space is a significant issue in all current model editing methods. However, our proposed methods, GORA and SORA, effectively identify and alleviate this issue, respectively, contributing to the advancement of LLM editing techniques.
摘要：大型语言模型以其卓越的功效彻底改变了许多任务。然而，对这些模型的编辑对于纠正过时或错误的信息至关重要，通常会导致一个复杂的问题，即隐藏空间中的连锁反应。这种影响虽然难以检测，但会严重阻碍模型编辑任务的效率并降低模型性能。本文通过提出一种新颖的评估方法——基于图形离群关系的评估（GORA）来解决这一科学挑战，该方法定量评估模型以及编辑的后续影响。此外，我们还引入了选择性异常值重新编辑方法（SORA），这是一种旨在减轻这种连锁反应的模型编辑方法。我们的综合评估表明，隐藏空间中的连锁反应是当前所有模型编辑方法中的一个重要问题。然而，我们提出的方法GORA和SORA分别有效地识别和缓解了这个问题，为LLM编辑技术的进步做出了贡献。

Title: Exploring Safety Generalization Challenges of Large Language Models via Code

Authors: Qibing Ren, Chang Gao, Jing Shao, Junchi Yan, Xin Tan, Wai Lam, Lizhuang Ma
Subjects: cs.CL, cs.AI, cs.CR, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2403.07865
Pdf URL: https://arxiv.org/pdf/2403.07865
Copy Paste: [[2403.07865]] Exploring Safety Generalization Challenges of Large Language Models via Code(https://arxiv.org/abs/2403.07865)
Keywords: language model, gpt, llm
Abstract: The rapid advancement of Large Language Models (LLMs) has brought about remarkable capabilities in natural language processing but also raised concerns about their potential misuse. While strategies like supervised fine-tuning and reinforcement learning from human feedback have enhanced their safety, these methods primarily focus on natural languages, which may not generalize to other domains. This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs, presenting a novel environment for testing the safety generalization of LLMs. Our comprehensive studies on state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series reveal a common safety vulnerability of these models against code input: CodeAttack consistently bypasses the safety guardrails of all models more than 80\% of the time. Furthermore, we find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization, such as encoding natural language input with data structures or using less popular programming languages. These findings highlight new safety risks in the code domain and the need for more robust safety alignment algorithms to match the code capabilities of LLMs.
摘要：大型语言模型（LLM）的快速发展带来了自然语言处理方面的卓越能力，但也引发了对其潜在滥用的担忧。虽然监督微调和根据人类反馈进行强化学习等策略增强了其安全性，但这些方法主要关注自然语言，可能无法推广到其他领域。本文介绍了 CodeAttack，一个将自然语言输入转换为代码输入的框架，为测试 LLM 的安全泛化提供了一个新颖的环境。我们对最先进的 LLM（包括 GPT-4、Claude-2 和 Llama-2 系列）的全面研究揭示了这些模型针对代码输入的常见安全漏洞：CodeAttack 始终绕过超过 80 个模型的安全护栏\％的时间。此外，我们发现 CodeAttack 和自然语言之间较大的分布差距会导致安全泛化能力较弱，例如使用数据结构对自然语言输入进行编码或使用不太流行的编程语言。这些发现凸显了代码领域的新安全风险，以及需要更强大的安全对齐算法来匹配法学硕士的代码功能。

Title: Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Authors: Fangyun Wei, Xi Chen, Lin Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.07872
Pdf URL: https://arxiv.org/pdf/2403.07872
Copy Paste: [[2403.07872]] Rethinking Generative Large Language Model Evaluation for Semantic Comprehension(https://arxiv.org/abs/2403.07872)
Keywords: language model, gpt, llm
Abstract: Despite their sophisticated capabilities, large language models (LLMs) encounter a major hurdle in effective assessment. This paper first revisits the prevalent evaluation method-multiple choice question answering (MCQA), which allows for straightforward accuracy measurement. Through a comprehensive evaluation of 24 models across 11 benchmarks, we highlight several potential drawbacks of MCQA, for instance, the inconsistency between the MCQA evaluation and the generation of open-ended responses in practical scenarios. In response, we introduce an RWQ-Elo rating system, engaging 24 LLMs such as GPT-4, GPT-3.5, Google-Gemini-Pro and LLaMA-1/-2, in a two-player competitive format, with GPT-4 serving as the judge. Each LLM receives an Elo rating thereafter. This system is designed to mirror real-world usage, and for this purpose, we have compiled a new benchmark called ``Real-world questions'' (RWQ), comprising 20,772 authentic user inquiries. Additionally, we thoroughly analyze the characteristics of our system and compare it with prior leaderboards like AlpacaEval and MT-Bench. Our analysis reveals the stability of our RWQ-Elo system, the feasibility of registering new models, and its potential to reshape LLM leaderboards.
摘要：尽管大型语言模型（LLM）具有复杂的功能，但它们在有效评估方面遇到了主要障碍。本文首先回顾了流行的评估方法——多项选择问答（MCQA），它可以直接测量准确性。通过对 11 个基准的 24 个模型进行综合评估，我们强调了 MCQA 的几个潜在缺点，例如 MCQA 评估与实际场景中开放式响应的生成之间的不一致。为此，我们引入了 RWQ-Elo 评级系统，以 GPT-4、GPT-3.5、Google-Gemini-Pro 和 LLaMA-1/-2 等 24 个 LLM 参与，以两人竞争的形式进行，GPT-4担任法官。此后，每个法学硕士都会收到 Elo 评级。该系统旨在反映现实世界的使用情况，为此，我们编制了一个名为“现实世界问题”(RWQ) 的新基准，其中包含 20,772 个真实的用户查询。此外，我们还彻底分析了我们系统的特征，并将其与 AlpacaEval 和 MT-Bench 等之前的排行榜进行比较。我们的分析揭示了 RWQ-Elo 系统的稳定性、注册新模型的可行性及其重塑 LLM 排行榜的潜力。