2024-04-19

Title: MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory

Authors: Ali Modarressi, Abdullatif Köksal, Ayyoob Imani, Mohsen Fayyaz, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11672
Pdf URL: https://arxiv.org/pdf/2404.11672
Copy Paste: [[2404.11672]] MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory(https://arxiv.org/abs/2404.11672)
Keywords: language model, llm, hallucination, retrieval augmented generation
Abstract: While current large language models (LLMs) demonstrate some capabilities in knowledge-intensive tasks, they are limited by relying on their parameters as an implicit storage mechanism. As a result, they struggle with infrequent knowledge and temporal degradation. In addition, the uninterpretable nature of parametric memorization makes it challenging to understand and prevent hallucination. Parametric memory pools and model editing are only partial solutions. Retrieval Augmented Generation (RAG) $\unicode{x2013}$ though non-parametric $\unicode{x2013}$ has its own limitations: it lacks structure, complicates interpretability and makes it hard to effectively manage stored knowledge. In this paper, we introduce MemLLM, a novel method of enhancing LLMs by integrating a structured and explicit read-and-write memory module. MemLLM tackles the aforementioned challenges by enabling dynamic interaction with the memory and improving the LLM's capabilities in using stored knowledge. Our experiments indicate that MemLLM enhances the LLM's performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular. We see MemLLM as an important step towards making LLMs more grounded and factual through memory augmentation.
摘要：虽然当前的大型语言模型（LLM）在知识密集型任务中展示了一些功能，但它们由于依赖其参数作为隐式存储机制而受到限制。结果，他们与稀少的知识和暂时的退化作斗争。此外，参数记忆的不可解释性使得理解和预防幻觉变得困难。参数化内存池和模型编辑只是部分解决方案。检索增强生成（RAG）$\unicode{x2013}$虽然非参数$\unicode{x2013}$有其自身的局限性：它缺乏结构，使可解释性复杂化并且难以有效管理存储的知识。在本文中，我们介绍了 MemLLM，这是一种通过集成结构化和显式读写内存模块来增强 LLM 的新方法。 MemLLM 通过实现与内存的动态交互并提高 LLM 使用存储知识的能力来解决上述挑战。我们的实验表明，MemLLM 增强了法学硕士的性能和可解释性，特别是在一般语言建模和知识密集型任务中。我们认为 MemLLM 是通过记忆增强使法学硕士更加扎实和真实的重要一步。

Title: How Well Can You Articulate that Idea? Insights from Automated Formative Assessment

Authors: Mahsa Sheikhi Karizaki, Dana Gnesdilow, Sadhana Puntambekar, Rebecca J. Passonneau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11682
Pdf URL: https://arxiv.org/pdf/2404.11682
Copy Paste: [[2404.11682]] How Well Can You Articulate that Idea? Insights from Automated Formative Assessment(https://arxiv.org/abs/2404.11682)
Keywords: prompt
Abstract: Automated methods are becoming increasingly integrated into studies of formative feedback on students' science explanation writing. Most of this work, however, addresses students' responses to short answer questions. We investigate automated feedback on students' science explanation essays, where students must articulate multiple ideas. Feedback is based on a rubric that identifies the main ideas students are prompted to include in explanatory essays about the physics of energy and mass, given their experiments with a simulated roller coaster. We have found that students generally improve on revised versions of their essays. Here, however, we focus on two factors that affect the accuracy of the automated feedback. First, we find that the main ideas in the rubric differ with respect to how much freedom they afford in explanations of the idea, thus explanation of a natural law is relatively constrained. Students have more freedom in how they explain complex relations they observe in their roller coasters, such as transfer of different forms of energy. Second, by tracing the automated decision process, we can diagnose when a student's statement lacks sufficient clarity for the automated tool to associate it more strongly with one of the main ideas above all others. This in turn provides an opportunity for teachers and peers to help students reflect on how to state their ideas more clearly.
摘要：自动化方法越来越多地融入到学生科学解释写作形成性反馈的研究中。然而，这项工作的大部分内容都是针对学生对简答题的回答。我们调查了对学生科学解释论文的自动反馈，学生必须在论文中阐明多种想法。反馈基于一个标题，该标题确定了学生在进行模拟过山车实验时被提示将其纳入有关能量和质量物理的解释性文章中的主要想法。我们发现，学生们通常会在论文的修改版本上取得进步。然而，在这里，我们关注影响自动反馈准确性的两个因素。首先，我们发现标题中的主要思想在解释思想时提供的自由度方面有所不同，因此对自然法的解释相对受到限制。学生可以更自由地解释在过山车中观察到的复杂关系，例如不同形式能量的传递。其次，通过跟踪自动决策过程，我们可以诊断学生的陈述何时缺乏足够的清晰度，使自动化工具无法将其与主要思想之一更紧密地联系起来。这反过来又为老师和同学提供了一个机会，帮助学生思考如何更清楚地表达自己的想法。

Title: How often are errors in natural language reasoning due to paraphrastic variability?

Authors: Neha Srikanth, Marine Carpuat, Rachel Rudinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11717
Pdf URL: https://arxiv.org/pdf/2404.11717
Copy Paste: [[2404.11717]] How often are errors in natural language reasoning due to paraphrastic variability?(https://arxiv.org/abs/2404.11717)
Keywords: language model
Abstract: Large language models have been shown to behave inconsistently in response to meaning-preserving paraphrastic inputs. At the same time, researchers evaluate the knowledge and reasoning abilities of these models with test evaluations that do not disaggregate the effect of paraphrastic variability on performance. We propose a metric for evaluating the paraphrastic consistency of natural language reasoning models based on the probability of a model achieving the same correctness on two paraphrases of the same problem. We mathematically connect this metric to the proportion of a model's variance in correctness attributable to paraphrasing. To estimate paraphrastic consistency, we collect ParaNLU, a dataset of 7,782 human-written and validated paraphrased reasoning problems constructed on top of existing benchmark datasets for defeasible and abductive natural language inference. Using ParaNLU, we measure the paraphrastic consistency of several model classes and show that consistency dramatically increases with pretraining but not finetuning. All models tested exhibited room for improvement in paraphrastic consistency.
摘要：大型语言模型已被证明在响应保留意义的释义输入时表现不一致。与此同时，研究人员通过测试评估来评估这些模型的知识和推理能力，这些测试评估不会分解释义变异对性能的影响。我们提出了一种评估自然语言推理模型的释义一致性的指标，该指标基于模型在同一问题的两个释义上实现相同正确性的概率。我们在数学上将该指标与模型因释义而导致的正确性方差的比例联系起来。为了估计释义一致性，我们收集了 ParaNLU，这是一个包含 7,782 个人工编写并经过验证的释义推理问题的数据集，构建在现有基准数据集之上，用于可废止和溯因自然语言推理。使用 ParaNLU，我们测量了几个模型类的释义一致性，并表明一致性随着预训练而不是微调而显着增加。所有测试的模型在释义一致性方面都有改进的空间。

Title: Investigating Gender Bias in Turkish Language Models

Authors: Orhun Caglidil, Malte Ostendorff, Georg Rehm
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11726
Pdf URL: https://arxiv.org/pdf/2404.11726
Copy Paste: [[2404.11726]] Investigating Gender Bias in Turkish Language Models(https://arxiv.org/abs/2404.11726)
Keywords: language model
Abstract: Language models are trained mostly on Web data, which often contains social stereotypes and biases that the models can inherit. This has potentially negative consequences, as models can amplify these biases in downstream tasks or applications. However, prior research has primarily focused on the English language, especially in the context of gender bias. In particular, grammatically gender-neutral languages such as Turkish are underexplored despite representing different linguistic properties to language models with possibly different effects on biases. In this paper, we fill this research gap and investigate the significance of gender bias in Turkish language models. We build upon existing bias evaluation frameworks and extend them to the Turkish language by translating existing English tests and creating new ones designed to measure gender bias in the context of T\"urkiye. Specifically, we also evaluate Turkish language models for their embedded ethnic bias toward Kurdish people. Based on the experimental results, we attribute possible biases to different model characteristics such as the model size, their multilingualism, and the training corpora. We make the Turkish gender bias dataset publicly available.
摘要：语言模型主要基于网络数据进行训练，这些数据通常包含模型可以继承的社会刻板印象和偏见。这可能会产生负面后果，因为模型可能会放大下游任务或应用程序中的这些偏差。然而，之前的研究主要集中在英语，特别是在性别偏见的背景下。特别是，语法上性别中立的语言（例如土耳其语）尽管代表了语言模型的不同语言属性，并且可能对偏见产生不同的影响，但尚未得到充分研究。在本文中，我们填补了这一研究空白，并研究了土耳其语言模型中性别偏见的重要性。我们以现有的偏见评估框架为基础，通过翻译现有的英语测试并创建新的测试来衡量土耳其语背景下的性别偏见，并将其扩展到土耳其语。具体来说，我们还评估土耳其语言模型中嵌入的种族偏见根据实验结果，我们将可能的偏见归因于不同的模型特征，例如模型大小、多语言能力和训练语料库。我们公开了土耳其性别偏见数据集。

Title: Missed Connections: Lateral Thinking Puzzles for Large Language Models

Authors: Graham Todd, Tim Merino, Sam Earle, Julian Togelius
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11730
Pdf URL: https://arxiv.org/pdf/2404.11730
Copy Paste: [[2404.11730]] Missed Connections: Lateral Thinking Puzzles for Large Language Models(https://arxiv.org/abs/2404.11730)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme. Solving the puzzle requires both common linguistic knowledge (i.e. definitions and typical usage) as well as, in many cases, lateral or abstract thinking. This is because the four categories ascend in complexity, with the most challenging category often requiring thinking about words in uncommon ways or as parts of larger phrases. We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning and a way to measure the semantic information encoded by data-driven linguistic systems. In particular, we study both a sentence-embedding baseline and modern large language models (LLMs). We report their accuracy on the task, measure the impacts of chain-of-thought prompting, and discuss their failure modes. Overall, we find that the Connections task is challenging yet feasible, and a strong test-bed for future work.
摘要：《纽约时报》每天发布的“连接”谜题要求玩家将一组 16 个单词分成四组，每组四个单词，每个组与一个共同主题相关。解决这个难题不仅需要通用的语言知识（即定义和典型用法），而且在许多情况下还需要横向或抽象思维。这是因为这四个类别的复杂性不断上升，最具挑战性的类别通常需要以不常见的方式思考单词或将其作为较大短语的一部分。我们研究了自动化人工智能系统玩《Connections》的能力，并探索该游戏作为抽象推理自动化基准的潜力，以及衡量数据驱动语言系统编码的语义信息的方法。特别是，我们研究了句子嵌入基线和现代大型语言模型（LLM）。我们报告他们对任务的准确性，衡量思维链提示的影响，并讨论他们的失败模式。总的来说，我们发现连接任务具有挑战性但又可行，并且是未来工作的强大测试平台。

Title: Mapping Violence: Developing an Extensive Framework to Build a Bangla Sectarian Expression Dataset from Social Media Interactions

Authors: Nazia Tasnim, Sujan Sen Gupta, Md. Istiak Hossain Shihab, Fatiha Islam Juee, Arunima Tahsin, Pritom Ghum, Kanij Fatema, Marshia Haque, Wasema Farzana, Prionti Nasir, Ashique KhudaBukhsh, Farig Sadeque, Asif Sushmit
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2404.11752
Pdf URL: https://arxiv.org/pdf/2404.11752
Copy Paste: [[2404.11752]] Mapping Violence: Developing an Extensive Framework to Build a Bangla Sectarian Expression Dataset from Social Media Interactions(https://arxiv.org/abs/2404.11752)
Keywords: language model
Abstract: Communal violence in online forums has become extremely prevalent in South Asia, where many communities of different cultures coexist and share resources. These societies exhibit a phenomenon characterized by strong bonds within their own groups and animosity towards others, leading to conflicts that frequently escalate into violent confrontations. To address this issue, we have developed the first comprehensive framework for the automatic detection of communal violence markers in online Bangla content accompanying the largest collection (13K raw sentences) of social media interactions that fall under the definition of four major violence class and their 16 coarse expressions. Our workflow introduces a 7-step expert annotation process incorporating insights from social scientists, linguists, and psychologists. By presenting data statistics and benchmarking performance using this dataset, we have determined that, aside from the category of Non-communal violence, Religio-communal violence is particularly pervasive in Bangla text. Moreover, we have substantiated the effectiveness of fine-tuning language models in identifying violent comments by conducting preliminary benchmarking on the state-of-the-art Bangla deep learning model.
摘要：在线论坛中的社区暴力在南亚非常普遍，不同文化的许多社区共存并共享资源。这些社会表现出一种现象，其特点是自身群体内部的牢固联系和对他人的敌意，导致冲突经常升级为暴力对抗。为了解决这个问题，我们开发了第一个综合框架，用于自动检测在线孟加拉语内容中的社区暴力标记，这些内容伴随着最大的社交媒体互动集合（13K 原始句子），这些互动属于四个主要暴力类别及其 16 个暴力类别的定义。粗俗的表达。我们的工作流程引入了 7 个步骤的专家注释流程，融合了社会科学家、语言学家和心理学家的见解。通过使用该数据集提供数据统计和基准测试性能，我们确定，除了非社区暴力类别外，宗教社区暴力在孟加拉语文本中尤其普遍。此外，我们通过对最先进的孟加拉深度学习模型进行初步基准测试，证实了微调语言模型在识别暴力评论方面的有效性。

Title: Language Models Still Struggle to Zero-shot Reason about Time Series

Authors: Mike A. Merrill, Mingtian Tan, Vinayak Gupta, Tom Hartvigsen, Tim Althoff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11757
Pdf URL: https://arxiv.org/pdf/2404.11757
Copy Paste: [[2404.11757]] Language Models Still Struggle to Zero-shot Reason about Time Series(https://arxiv.org/abs/2404.11757)
Keywords: language model
Abstract: Time series are critical for decision-making in fields like finance and healthcare. Their importance has driven a recent influx of works passing time series into language models, leading to non-trivial forecasting on some datasets. But it remains unknown whether non-trivial forecasting implies that language models can reason about time series. To address this gap, we generate a first-of-its-kind evaluation framework for time series reasoning, including formal tasks and a corresponding dataset of multi-scale time series paired with text captions across ten domains. Using these data, we probe whether language models achieve three forms of reasoning: (1) Etiological Reasoning - given an input time series, can the language model identify the scenario that most likely created it? (2) Question Answering - can a language model answer factual questions about time series? (3) Context-Aided Forecasting - does highly relevant textual context improve a language model's time series forecasts? We find that otherwise highly-capable language models demonstrate surprisingly limited time series reasoning: they score marginally above random on etiological and question answering tasks (up to 30 percentage points worse than humans) and show modest success in using context to improve forecasting. These weakness showcase that time series reasoning is an impactful, yet deeply underdeveloped direction for language model research. We also make our datasets and code public at to support further research in this direction at https://github.com/behavioral-data/TSandLanguage
摘要：时间序列对于金融和医疗保健等领域的决策至关重要。它们的重要性推动了最近大量将时间序列传递到语言模型中的工作，从而对某些数据集进行了重要的预测。但不平凡的预测是否意味着语言模型可以推理时间序列仍然未知。为了解决这一差距，我们生成了一个用于时间序列推理的首个评估框架，包括正式任务和相应的多尺度时间序列数据集，并与十个领域的文本标题配对。使用这些数据，我们探讨语言模型是否实现了三种形式的推理：（1）病因推理 - 给定输入时间序列，语言模型能否识别最有可能创建它的场景？ (2) 问答——语言模型能否回答有关时间序列的事实问题？ (3) 上下文辅助预测 - 高度相关的文本上下文是否可以改善语言模型的时间序列预测？我们发现，其他功能强大的语言模型表现出令人惊讶的有限时间序列推理：它们在病因学和问答任务上的得分略高于随机得分（比人类差 30 个百分点），并且在使用上下文来改进预测方面取得了一定的成功。这些弱点表明，时间序列推理是语言模型研究的一个有影响力但尚未充分发展的方向。我们还将我们的数据集和代码公开，以支持该方向的进一步研究：https://github.com/behavioral-data/TSandLanguage

Title: REQUAL-LM: Reliability and Equity through Aggregation in Large Language Models

Authors: Sana Ebrahimi, Nima Shahbazi, Abolfazl Asudeh
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2404.11782
Pdf URL: https://arxiv.org/pdf/2404.11782
Copy Paste: [[2404.11782]] REQUAL-LM: Reliability and Equity through Aggregation in Large Language Models(https://arxiv.org/abs/2404.11782)
Keywords: language model, llm
Abstract: The extensive scope of large language models (LLMs) across various domains underscores the critical importance of responsibility in their application, beyond natural language processing. In particular, the randomized nature of LLMs, coupled with inherent biases and historical stereotypes in data, raises critical concerns regarding reliability and equity. Addressing these challenges are necessary before using LLMs for applications with societal impact. Towards addressing this gap, we introduce REQUAL-LM, a novel method for finding reliable and equitable LLM outputs through aggregation. Specifically, we develop a Monte Carlo method based on repeated sampling to find a reliable output close to the mean of the underlying distribution of possible outputs. We formally define the terms such as reliability and bias, and design an equity-aware aggregation to minimize harmful bias while finding a highly reliable output. REQUAL-LM does not require specialized hardware, does not impose a significant computing load, and uses LLMs as a blackbox. This design choice enables seamless scalability alongside the rapid advancement of LLM technologies. Our system does not require retraining the LLMs, which makes it deployment ready and easy to adapt. Our comprehensive experiments using various tasks and datasets demonstrate that REQUAL- LM effectively mitigates bias and selects a more equitable response, specifically the outputs that properly represents minority groups.
摘要：跨各个领域的大语言模型 (LLM) 的广泛范围强调了责任在其应用中的重要性，超越了自然语言处理。特别是，法学硕士的随机性，加上数据中固有的偏见和历史刻板印象，引起了人们对可靠性和公平性的严重担忧。在将法学硕士用于具有社会影响的应用之前，必须解决这些挑战。为了解决这一差距，我们引入了 REQUAL-LM，这是一种通过聚合寻找可靠且公平的 LLM 输出的新方法。具体来说，我们开发了一种基于重复采样的蒙特卡罗方法，以找到接近可能输出的基础分布平均值的可靠输出。我们正式定义了可靠性和偏差等术语，并设计了一个公平意识聚合，以最大限度地减少有害偏差，同时找到高度可靠的输出。 REQUAL-LM 不需要专门的硬件，不会施加大量的计算负载，并将 LLM 用作黑匣子。这种设计选择可实现无缝可扩展性以及法学硕士技术的快速发展。我们的系统不需要重新培训法学硕士，这使得它可以部署并易于适应。我们使用各种任务和数据集进行的综合实验表明，REQUAL-LM 有效地减轻了偏见并选择了更公平的响应，特别是正确代表少数群体的输出。

Title: AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence

Authors: Minbeom Kim, Hwanhee Lee, Joonsuk Park, Hwaran Lee, Kyomin Jung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11826
Pdf URL: https://arxiv.org/pdf/2404.11826
Copy Paste: [[2404.11826]] AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence(https://arxiv.org/abs/2404.11826)
Keywords: language model, gpt, llm
Abstract: As the integration of large language models into daily life is on the rise, there is a clear gap in benchmarks for advising on subjective and personal dilemmas. To address this, we introduce AdvisorQA, the first benchmark developed to assess LLMs' capability in offering advice for deeply personalized concerns, utilizing the LifeProTips subreddit forum. This forum features a dynamic interaction where users post advice-seeking questions, receiving an average of 8.9 advice per query, with 164.2 upvotes from hundreds of users, embodying a collective intelligence framework. Therefore, we've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric. Baseline experiments validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and human evaluation, analyzing phenomena beyond the trade-off between helpfulness and harmlessness. AdvisorQA marks a significant leap in enhancing QA systems for providing personalized, empathetic advice, showcasing LLMs' improved understanding of human subjectivity.
摘要：随着大型语言模型与日常生活的融合不断增加，针对主观和个人困境提供建议的基准存在明显差距。为了解决这个问题，我们推出了 AdvisorQA，这是第一个为评估法学硕士利用 LifeProTips Reddit 子论坛为深度个性化问题提供建议的能力而开发的基准。该论坛以动态互动为特色，用户发布咨询问题，平均每次查询收到8.9条建议，数百名用户点赞164.2条，体现了集体智慧框架。因此，我们完成了一个包含日常生活问题、多样化的相应回答和多数投票排名的基准来训练我们的帮助指标。基线实验通过我们的有用性指标、GPT-4 和人类评估来验证 AdvisorQA 的功效，分析超出有用性和无害性之间权衡的现象。 AdvisorQA 标志着增强 QA 系统的重大飞跃，可提供个性化、同理心的建议，展示法学硕士对人类主观性的更好理解。

Title: TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Authors: Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.11912
Pdf URL: https://arxiv.org/pdf/2404.11912
Copy Paste: [[2404.11912]] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding(https://arxiv.org/abs/2404.11912)
Keywords: language model, llm
Abstract: With large language models (LLMs) widely deployed in long content generation recently, there has emerged an increasing demand for efficient long-sequence inference support. However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length. Due to the auto-regressive nature of LLMs, the entire KV cache will be loaded for every generated token, resulting in low utilization of computational cores and high latency. While various compression methods for KV cache have been proposed to alleviate this issue, they suffer from degradation in generation quality. We introduce TriForce, a hierarchical speculative decoding system that is scalable to long sequence generation. This approach leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which serves as an intermediate layer in the hierarchy and is further speculated by a smaller model to reduce its drafting latency. TriForce not only facilitates impressive speedups for Llama2-7B-128K, achieving up to 2.31$\times$ on an A100 GPU but also showcases scalability in handling even longer contexts. For the offloading setting on two RTX 4090 GPUs, TriForce achieves 0.108s/token$\unicode{x2014}$only half as slow as the auto-regressive baseline on an A100, which attains 7.78$\times$ on our optimized offloading system. Additionally, TriForce performs 4.86$\times$ than DeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness is highlighted by its consistently outstanding performance across various temperatures. The code is available at https://github.com/Infini-AI-Lab/TriForce.
摘要：近年来，随着大型语言模型（LLM）在长内容生成中的广泛部署，对高效长序列推理支持的需求日益增长。然而，存储键值（KV）缓存以避免重新计算，其大小随着序列长度线性增长，已成为关键瓶颈。由于LLM的自回归性质，将为每个生成的令牌加载整个KV缓存，导致计算核心利用率低和延迟高。虽然已经提出了各种 KV 缓存压缩方法来缓解这个问题，但它们的生成质量会下降。我们引入了 TriForce，这是一种分层推测解码系统，可扩展至长序列生成。这种方法通过检索利用原始模型权重和动态稀疏 KV 缓存作为草稿模型，充当层次结构中的中间层，并由较小的模型进一步推测以减少其草稿延迟。 TriForce 不仅为 Llama2-7B-128K 提供了令人印象深刻的加速，在 A100 GPU 上实现了高达 2.31$\times$，而且还展示了处理更长上下文的可扩展性。对于两个 RTX 4090 GPU 上的卸载设置，TriForce 实现了 0.108s/token$\unicode{x2014}$，仅比 A100 上的自回归基线慢一半，在我们优化的卸载系统上达到 7.78$\times$。此外，TriForce 在单个 RTX 4090 GPU 上的执行速度比 DeepSpeed-Zero-Inference 快 4.86$\times$。 TriForce 的坚固性因其在各种温度下始终如一的出色性能而得以凸显。该代码可在 https://github.com/Infini-AI-Lab/TriForce 获取。

Title: SKIP: Skill-Localized Prompt Tuning for Inference Speed Boost-Up

Authors: Nakyeong Yang, Junseok Kim, Jiwon Moon, Yunah Jang, Kyomin Jung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11916
Pdf URL: https://arxiv.org/pdf/2404.11916
Copy Paste: [[2404.11916]] SKIP: Skill-Localized Prompt Tuning for Inference Speed Boost-Up(https://arxiv.org/abs/2404.11916)
Keywords: language model, prompt
Abstract: Prompt-tuning methods have shown comparable performance as parameter-efficient fine-tuning (PEFT) methods in various natural language understanding tasks. However, existing prompt tuning methods still utilize the entire model architecture; thus, they fail to accelerate inference speed in the application. In this paper, we propose a novel approach called SKIll-localized Prompt tuning (SKIP), which is extremely efficient in inference time. Our method significantly enhances inference efficiency by investigating and utilizing a skill-localized subnetwork in a language model. Surprisingly, our method improves the inference speed up to 160% while pruning 52% of the parameters. Furthermore, we demonstrate that our method is applicable across various transformer-based architectures, thereby confirming its practicality and scalability.
摘要：在各种自然语言理解任务中，快速调优方法已显示出与参数高效微调（PEFT）方法相当的性能。然而，现有的即时调优方法仍然利用整个模型架构；因此，它们无法加快应用程序中的推理速度。在本文中，我们提出了一种称为 SKIll 局部提示调整（SKIP）的新方法，该方法在推理时间上非常有效。我们的方法通过研究和利用语言模型中的技能本地化子网络显着提高了推理效率。令人惊讶的是，我们的方法将推理速度提高了 160%，同时修剪了 52% 的参数。此外，我们证明我们的方法适用于各种基于变压器的架构，从而证实了其实用性和可扩展性。

Title: CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment

Authors: Geyu Lin, Bin Wang, Zhengyuan Liu, Nancy F. Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11932
Pdf URL: https://arxiv.org/pdf/2404.11932
Copy Paste: [[2404.11932]] CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment(https://arxiv.org/abs/2404.11932)
Keywords: language model, llm
Abstract: Multilingual proficiency presents a significant challenge for large language models (LLMs). English-centric models are usually suboptimal in other languages, particularly those that are linguistically distant from English. This performance discrepancy mainly stems from the imbalanced distribution of training data across languages during pre-training and instruction tuning stages. To address this problem, we propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data. Our method leverages the compressed representation shared by various languages to efficiently enhance the model's task-solving capabilities and multilingual proficiency within a single process. In addition, we introduce a multi-task and multi-faceted benchmark to evaluate the effectiveness of CrossIn. Experimental results demonstrate that our method substantially improves performance across tasks and languages, and we provide extensive insights into the impact of cross-lingual data volume and the integration of translation data on enhancing multilingual consistency and accuracy.
摘要：多语言能力对大型语言模型 (LLM) 提出了重大挑战。以英语为中心的模型在其他语言中通常不是最佳的，特别是那些在语言上与英语相距甚远的语言。这种性能差异主要源于预训练和指令调优阶段跨语言训练数据分布不平衡。为了解决这个问题，我们提出了一种称为 CrossIn 的新方法，它利用跨语言指令调整数据的混合组合。我们的方法利用各种语言共享的压缩表示来有效增强模型在单个流程中解决任务的能力和多语言熟练程度。此外，我们引入了多任务、多方面的基准来评估 CrossIn 的有效性。实验结果表明，我们的方法大大提高了跨任务和语言的性能，并且我们提供了关于跨语言数据量和翻译数据集成对增强多语言一致性和准确性的影响的广泛见解。

Title: Aligning Language Models to Explicitly Handle Ambiguity

Authors: Hyuhng Joon Kim, Youna Kim, Cheonbok Park, Junyeob Kim, Choonghyun Park, Kang Min Yoo, Sang-goo Lee, Taeuk Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11972
Pdf URL: https://arxiv.org/pdf/2404.11972
Copy Paste: [[2404.11972]] Aligning Language Models to Explicitly Handle Ambiguity(https://arxiv.org/abs/2404.11972)
Keywords: language model, llm, agent
Abstract: In spoken languages, utterances are often shaped to be incomplete or vague for efficiency. This can lead to varying interpretations of the same input, based on different assumptions about the context. To ensure reliable user-model interactions in such scenarios, it is crucial for models to adeptly handle the inherent ambiguity in user queries. However, conversational agents built upon even the most recent large language models (LLMs) face challenges in processing ambiguous inputs, primarily due to the following two hurdles: (1) LLMs are not directly trained to handle inputs that are too ambiguous to be properly managed; (2) the degree of ambiguity in an input can vary according to the intrinsic knowledge of the LLMs, which is difficult to investigate. To address these issues, this paper proposes a method to align LLMs to explicitly handle ambiguous inputs. Specifically, we introduce a proxy task that guides LLMs to utilize their intrinsic knowledge to self-disambiguate a given input. We quantify the information gain from the disambiguation procedure as a measure of the extent to which the models perceive their inputs as ambiguous. This measure serves as a cue for selecting samples deemed ambiguous from the models' perspectives, which are then utilized for alignment. Experimental results from several question-answering datasets demonstrate that the LLMs fine-tuned with our approach are capable of handling ambiguous inputs while still performing competitively on clear questions within the task.
摘要：在口语中，为了提高效率，话语常常被设计得不完整或模糊。基于对上下文的不同假设，这可能会导致对同一输入产生不同的解释。为了确保在这种情况下可靠的用户模型交互，模型熟练地处理用户查询中固有的模糊性至关重要。然而，即使是基于最新的大型语言模型 (LLM) 构建的会话代理在处理模糊输入时也面临着挑战，这主要是由于以下两个障碍：(1) LLM 没有经过直接训练来处理过于模糊而无法正确管理的输入; (2) 输入的模糊程度可能会根据法学硕士的内在知识而变化，这很难调查。为了解决这些问题，本文提出了一种方法来调整法学硕士以明确处理不明确的输入。具体来说，我们引入了一个代理任务，指导法学硕士利用他们的内在知识来自我消除给定输入的歧义。我们量化消歧过程中的信息增益，作为模型将其输入视为模糊程度的衡量标准。该度量可作为选择从模型角度来看不明确的样本的提示，然后将其用于对齐。多个问答数据集的实验结果表明，使用我们的方法进行微调的法学硕士能够处理模糊的输入，同时在任务中的明确问题上仍然具有竞争力。

Title: EVIT: Event-Oriented Instruction Tuning for Event Reasoning

Authors: Zhengwei Tao, Xiancai Chen, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Yiwei Lou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11978
Pdf URL: https://arxiv.org/pdf/2404.11978
Copy Paste: [[2404.11978]] EVIT: Event-Oriented Instruction Tuning for Event Reasoning(https://arxiv.org/abs/2404.11978)
Keywords: language model, llm
Abstract: Events refer to specific occurrences, incidents, or happenings that take place under a particular background. Event reasoning aims to infer events according to certain relations and predict future events. The cutting-edge techniques for event reasoning play a crucial role in various natural language processing applications. Large language models (LLMs) have made significant advancements in event reasoning owing to their wealth of knowledge and reasoning capabilities. However, smaller instruction-tuned models currently in use do not consistently demonstrate exceptional proficiency in managing these tasks. This discrepancy arises from the absence of explicit modeling of events and the interconnections of them within their instruction data. Consequently, these models face challenges in comprehending event structures and semantics while struggling to bridge the gap between their interpretations and human understanding of events. Additionally, their limitations in grasping event relations lead to constrained event reasoning abilities to effectively deduce and incorporate pertinent event knowledge. In this paper, we propose Event-Oriented Instruction Tuning (EvIT) to train our LLM. Specifically, we first propose a novel structure named event quadruple which contains the structure and semantics of events and is complete in the event representation. We then design event-relation learning based on the structures. We encapsulate the learning into the instruction-tuning formulation to better stimulate the event reasoning capacity of our model. We design a heuristic unsupervised method to mine event quadruple from a large-scale corpus. At last, we finetune a Llama model on our Event-Oriented Instruction Tuning. We conduct extensive experiments on event reasoning tasks on several datasets. Automatic and human evaluations demonstrate EvIT achieves competitive performances on event reasoning.
摘要：事件是指在特定背景下发生的具体事件、事件或事件。事件推理的目的是根据一定的关系来推断事件并预测未来的事件。事件推理的尖端技术在各种自然语言处理应用中发挥着至关重要的作用。大型语言模型（LLM）凭借其丰富的知识和推理能力，在事件推理方面取得了显着的进步。然而，目前使用的较小的指令调整模型并不能始终表现出管理这些任务的卓越能力。这种差异是由于缺乏事件的显式建模以及它们在指令数据中的互连而产生的。因此，这些模型在理解事件结构和语义方面面临挑战，同时努力弥合它们的解释与人类对事件的理解之间的差距。此外，他们在掌握事件关系方面的局限性导致有效推断和整合相关事件知识的事件推理能力受到限制。在本文中，我们提出了面向事件的指令调整（EvIT）来训练我们的法学硕士。具体来说，我们首先提出了一种名为事件四元组的新颖结构，它包含事件的结构和语义，并且事件表示是完整的。然后我们根据结构设计事件关系学习。我们将学习封装到指令调整公式中，以更好地激发模型的事件推理能力。我们设计了一种启发式无监督方法从大规模语料库中挖掘事件四元组。最后，我们在面向事件的指令调优中对 Llama 模型进行了微调。我们在多个数据集上对事件推理任务进行了广泛的实验。自动和人工评估表明 EvIT 在事件推理方面取得了有竞争力的表现。

Title: Token-level Direct Preference Optimization

Authors: Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, Jun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11999
Pdf URL: https://arxiv.org/pdf/2404.11999
Copy Paste: [[2404.11999]] Token-level Direct Preference Optimization(https://arxiv.org/abs/2404.11999)
Keywords: language model, llm
Abstract: Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.
摘要：微调预训练的大型语言模型 (LLM) 对于使其符合人类价值观和意图至关重要。此过程通常使用成对比较和针对参考 LLM 的 KL 散度等方法，重点是评估模型生成的完整答案。然而，这些响应的生成发生在令牌级别，遵循顺序、自回归的方式。在本文中，我们介绍了令牌级直接偏好优化（TDPO），这是一种通过在令牌级优化策略来使法学硕士与人类偏好保持一致的新颖方法。与之前面临发散效率挑战的方法不同，TDPO 为每个 token 引入了前向 KL 发散约束，从而提高了一致性和多样性。 TDPO 利用 Bradley-Terry 模型构建基于代币的奖励系统，增强了 KL 散度的调节，同时保持简单性，无需显式奖励建模。各种文本任务的实验结果证明了 TDPO 在平衡对齐与生成多样性方面的卓越性能。值得注意的是，在受控情感生成和单轮对话数据集中，使用 TDPO 进行微调比 DPO 取得了更好的平衡，并且与 DPO 和基于 PPO 的 RLHF 方法相比，显着提高了生成响应的质量。我们的代码在 https://github.com/Vance0124/Token-level-Direct-Preference-Optimization 上开源。

Title: ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity

Authors: Lasal Jayawardena, Prasan Yapa
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12010
Pdf URL: https://arxiv.org/pdf/2404.12010
Copy Paste: [[2404.12010]] ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity(https://arxiv.org/abs/2404.12010)
Keywords: language model, llm
Abstract: Paraphrase generation is a pivotal task in natural language processing (NLP). Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences. Moreover, these datasets often contain hate speech and noise, and may unintentionally include non-English language sentences. This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using Large Language Models (LLM) to address these challenges. ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity. It also mitigates the presence of hate speech and reduces noise, ensuring a cleaner and more focused English dataset. Results show that ParaFusion offers at least a 25% improvement in both syntactic and lexical diversity, measured across several metrics for each data source. The paper also aims to set a gold standard for paraphrase evaluation as it contains one of the most comprehensive evaluation strategies to date. The results underscore the potential of ParaFusion as a valuable resource for improving NLP applications.
摘要：释义生成是自然语言处理（NLP）中的一项关键任务。该领域的现有数据集缺乏句法和词汇多样性，导致释义与源句子非常相似。此外，这些数据集通常包含仇恨言论和噪音，并且可能无意中包含非英语句子。本研究引入了 ParaFusion，这是一个使用大型语言模型 (LLM) 开发的大规模、高质量英语释义数据集，旨在解决这些挑战。 ParaFusion 使用高质量数据增强了现有数据集，显着增强了词汇和句法多样性，同时保持了紧密的语义相似性。它还可以减少仇恨言论的出现并减少噪音，确保英语数据集更干净、更有针对性。结果表明，根据每个数据源的多个指标进行衡量，ParaFusion 在句法和词汇多样性方面至少提高了 25%。该论文还旨在为释义评估设定黄金标准，因为它包含迄今为止最全面的评估策略之一。结果强调了 ParaFusion 作为改进 NLP 应用的宝贵资源的潜力。

Title: Enhance Robustness of Language Models Against Variation Attack through Graph Integration

Authors: Zi Xiong, Lizhi Qing, Yangyang Kang, Jiawei Liu, Hongsong Li, Changlong Sun, Xiaozhong Liu, Wei Lu
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2404.12014
Pdf URL: https://arxiv.org/pdf/2404.12014
Copy Paste: [[2404.12014]] Enhance Robustness of Language Models Against Variation Attack through Graph Integration(https://arxiv.org/abs/2404.12014)
Keywords: language model
Abstract: The widespread use of pre-trained language models (PLMs) in natural language processing (NLP) has greatly improved performance outcomes. However, these models' vulnerability to adversarial attacks (e.g., camouflaged hints from drug dealers), particularly in the Chinese language with its rich character diversity/variation and complex structures, hatches vital apprehension. In this study, we propose a novel method, CHinese vAriatioN Graph Enhancement (CHANGE), to increase the robustness of PLMs against character variation attacks in Chinese content. CHANGE presents a novel approach for incorporating a Chinese character variation graph into the PLMs. Through designing different supplementary tasks utilizing the graph structure, CHANGE essentially enhances PLMs' interpretation of adversarially manipulated text. Experiments conducted in a multitude of NLP tasks show that CHANGE outperforms current language models in combating against adversarial attacks and serves as a valuable contribution to robust language model research. These findings contribute to the groundwork on robust language models and highlight the substantial potential of graph-guided pre-training strategies for real-world applications.
摘要：预训练语言模型 (PLM) 在自然语言处理 (NLP) 中的广泛使用极大地提高了性能结果。然而，这些模型容易受到对抗性攻击（例如，来自毒贩的伪装暗示），特别是在具有丰富的字符多样性/变化和复杂结构的中文中，引起了人们的严重担忧。在本研究中，我们提出了一种新方法，即中文变异图增强（CHANGE），以提高 PLM 抵御中文内容字符变异攻击的鲁棒性。 CHANGE 提出了一种将汉字变异图合并到 PLM 中的新颖方法。通过利用图结构设计不同的补充任务，CHANGE 从本质上增强了 PLM 对对抗性操纵文本的解释。在大量 NLP 任务中进行的实验表明，CHANGE 在对抗对抗性攻击方面优于当前的语言模型，并对稳健的语言模型研究做出了宝贵的贡献。这些发现为稳健的语言模型奠定了基础，并凸显了图引导预训练策略在实际应用中的巨大潜力。

Title: Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Authors: Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li, Jingang Wang, Xunliang Cai, Dongyan Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12022
Pdf URL: https://arxiv.org/pdf/2404.12022
Copy Paste: [[2404.12022]] Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration(https://arxiv.org/abs/2404.12022)
Keywords: language model, llm
Abstract: Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly evident when utilizing autoregressive decoding methods, which generate one token in a single forward process, thereby not fully capitalizing on the parallel computing capabilities of GPUs. In this paper, we propose a novel parallel decoding approach, namely \textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass. The idea is to transfer the intermediate hidden states of the previous context to the \textit{pseudo} hidden states of the future tokens to be generated, and then the pseudo hidden states will pass the following transformer layers thereby assimilating more semantic information and achieving superior predictive accuracy of the future tokens. Besides, we use the novel tree attention mechanism to simultaneously generate and verify multiple candidates of output sequences, which ensure the lossless generation and further improves the generation efficiency of our method. Experiments demonstrate the effectiveness of our method. We conduct a lot of analytic experiments to prove our motivation. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
摘要：大型语言模型（LLM）最近在广泛的任务中表现出了卓越的性能。然而，LLM 中的大量参数会导致模型推理期间出现显着的延迟。当使用自回归解码方法时，这一点尤其明显，该方法在单个前向过程中生成一个令牌，从而无法充分利用 GPU 的并行计算能力。在本文中，我们提出了一种新颖的并行解码方法，即\textit{隐藏传输}，它在一次前向传递中同时解码多个连续的令牌。其思想是将先前上下文的中间隐藏状态转移到要生成的未来标记的 \textit{pseudo} 隐藏状态，然后伪隐藏状态将通过后续的 Transformer 层，从而吸收更多的语义信息并实现更好的效果。未来代币的预测准确性。此外，我们使用新颖的树注意机制来同时生成和验证输出序列的多个候选，这保证了无损生成并进一步提高了我们方法的生成效率。实验证明了我们方法的有效性。我们进行了大量的分析实验来证明我们的动机。在加速指标方面，我们优于所有单模型加速技术，包括 Medusa 和自推测解码。

Title: Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector

Authors: Zhihao Xu, Ruixuan Huang, Xiting Wang, Fangzhao Wu, Jing Yao, Xing Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12038
Pdf URL: https://arxiv.org/pdf/2404.12038
Copy Paste: [[2404.12038]] Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector(https://arxiv.org/abs/2404.12038)
Keywords: language model, llm
Abstract: Current open-source large language models (LLMs) are often undergone careful safety alignment before public release. Some attack methods have also been proposed that help check for safety vulnerabilities in LLMs to ensure alignment robustness. However, many of these methods have moderate attack success rates. Even when successful, the harmfulness of their outputs cannot be guaranteed, leading to suspicions that these methods have not accurately identified the safety vulnerabilities of LLMs. In this paper, we introduce a LLM attack method utilizing concept-based model explanation, where we extract safety concept activation vectors (SCAVs) from LLMs' activation space, enabling efficient attacks on well-aligned LLMs like LLaMA-2, achieving near 100% attack success rate as if LLMs are completely unaligned. This suggests that LLMs, even after thorough safety alignment, could still pose potential risks to society upon public release. To evaluate the harmfulness of outputs resulting with various attack methods, we propose a comprehensive evaluation method that reduces the potential inaccuracies of existing evaluations, and further validate that our method causes more harmful content. Additionally, we discover that the SCAVs show some transferability across different open-source LLMs.
摘要：当前的开源大语言模型 (LLM) 在公开发布之前通常会经过仔细的安全调整。还提出了一些攻击方法来帮助检查 LLM 中的安全漏洞，以确保对齐的鲁棒性。然而，其中许多方法的攻击成功率中等。即使成功，也无法保证其产出的危害性，这导致人们怀疑这些方法没有准确识别法学硕士的安全漏洞。在本文中，我们介绍了一种利用基于概念的模型解释的 LLM 攻击方法，其中我们从 LLM 的激活空间中提取安全概念激活向量（SCAV），从而能够对 LLaMA-2 等良好对齐的 LLM 进行有效攻击，实现接近 100%攻击成功率就好像法学硕士完全不对齐一样。这表明，即使在彻底的安全调整之后，法学硕士在公开发布后仍可能对社会构成潜在风险。为了评估各种攻击方法产生的输出的危害性，我们提出了一种综合评估方法，以减少现有评估的潜在不准确性，并进一步验证我们的方法会导致更多有害内容。此外，我们发现 SCAV 在不同的开源 LLM 之间表现出一定的可转移性。

Title: Can We Catch the Elephant? The Evolvement of Hallucination Evaluation on Natural Language Generation: A Survey

Authors: Siya Qi, Yulan He, Zheng Yuan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.12041
Pdf URL: https://arxiv.org/pdf/2404.12041
Copy Paste: [[2404.12041]] Can We Catch the Elephant? The Evolvement of Hallucination Evaluation on Natural Language Generation: A Survey(https://arxiv.org/abs/2404.12041)
Keywords: language model, llm, hallucination
Abstract: Hallucination in Natural Language Generation (NLG) is like the elephant in the room, obvious but often overlooked until recent achievements significantly improved the fluency and grammatical accuracy of generated text. For Large Language Models (LLMs), hallucinations can happen in various downstream tasks and casual conversations, which need accurate assessment to enhance reliability and safety. However, current studies on hallucination evaluation vary greatly, and people still find it difficult to sort out and select the most appropriate evaluation methods. Moreover, as NLP research gradually shifts to the domain of LLMs, it brings new challenges to this direction. This paper provides a comprehensive survey on the evolvement of hallucination evaluation methods, aiming to address three key aspects: 1) Diverse definitions and granularity of facts; 2) The categories of automatic evaluators and their applicability; 3) Unresolved issues and future directions.
摘要：自然语言生成（NLG）中的幻觉就像房间里的大象，显而易见但经常被忽视，直到最近的成就显着提高了生成文本的流畅性和语法准确性。对于大型语言模型（LLM），幻觉可能发生在各种下游任务和随意对话中，需要准确的评估以增强可靠性和安全性。然而，目前对于幻觉评估的研究差异很大，人们仍然很难梳理和选择最合适的评估方法。而且，随着NLP研究逐渐转向LLM领域，也给这个方向带来了新的挑战。本文对幻觉评估方法的演变进行了全面的回顾，旨在解决三个关键问题：1）定义的多样性和事实的粒度； 2）自动评价器的类别及其适用范围； 3）未解决的问题和未来的方向。

Title: RAGAR, Your Falsehood RADAR: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models

Authors: M. Abdul Khaliq, P. Chang, M. Ma, B. Pflugfelder, F. Miletić
Subjects: cs.CL, cs.AI, cs.CY, cs.ET, cs.MA
Abstract URL: https://arxiv.org/abs/2404.12065
Pdf URL: https://arxiv.org/pdf/2404.12065
Copy Paste: [[2404.12065]] RAGAR, Your Falsehood RADAR: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models(https://arxiv.org/abs/2404.12065)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The escalating challenge of misinformation, particularly in the context of political discourse, necessitates advanced solutions for fact-checking. We introduce innovative approaches to enhance the reliability and efficiency of multimodal fact-checking through the integration of Large Language Models (LLMs) with Retrieval-augmented Generation (RAG)- based advanced reasoning techniques. This work proposes two novel methodologies, Chain of RAG (CoRAG) and Tree of RAG (ToRAG). The approaches are designed to handle multimodal claims by reasoning the next questions that need to be answered based on previous evidence. Our approaches improve the accuracy of veracity predictions and the generation of explanations over the traditional fact-checking approach of sub-question generation with chain of thought veracity prediction. By employing multimodal LLMs adept at analyzing both text and images, this research advances the capability of automated systems in identifying and countering misinformation.
摘要：虚假信息的挑战日益严峻，特别是在政治话语背景下，这需要先进的事实核查解决方案。我们引入了创新方法，通过将大型语言模型 (LLM) 与基于检索增强生成 (RAG) 的高级推理技术相结合，提高多模态事实核查的可靠性和效率。这项工作提出了两种新方法，即 RAG 链 (CoRAG) 和 RAG 树 (ToRAG)。这些方法旨在通过根据先前的证据推理需要回答的下一个问题来处理多模态主张。与传统的使用思路链真实性预测的子问题生成事实核查方法相比，我们的方法提高了真实性预测和解释生成的准确性。通过采用擅长分析文本和图像的多模态 LLM，这项研究提高了自动化系统识别和反击虚假信息的能力。

Title: LongEmbed: Extending Embedding Models for Long Context Retrieval

Authors: Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12096
Pdf URL: https://arxiv.org/pdf/2404.12096
Copy Paste: [[2404.12096]] LongEmbed: Extending Embedding Models for Long Context Retrieval(https://arxiv.org/abs/2404.12096)
Keywords: llm, long context
Abstract: Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.
摘要：嵌入模型在 IR 和 RAG 等现代 NLP 应用中发挥着关键作用。虽然LLM的上下文限制已经突破了100万个token，但嵌入模型仍然局限于不超过8k个token的狭窄上下文窗口，避免了法律合同等需要长输入的应用场景。本文探讨了现有嵌入模型的上下文窗口扩展，将限制推至 32k，而不需要额外的训练。首先，我们在新构建的 LongEmbed 基准上检查当前嵌入模型在长上下文检索方面的性能。 LongEmbed 包含两个综合任务和四个精心挑选的现实世界任务，具有不同长度的文档和分散的目标信息。基准测试结果强调了这些模型的巨大改进空间。基于此，综合实验表明，位置插值等免训练上下文窗口扩展策略可以有效地将现有嵌入模型的上下文窗口扩展数倍，无论其原始上下文是512还是超过4k。此外，对于采用绝对位置编码（APE）的模型，我们展示了进一步微调以获得显着性能提升的可能性，同时严格保留短输入的原始行为。对于使用旋转位置嵌入 (RoPE) 的模型，当采用 RoPE 特定方法（例如 NTK 和 SelfExtend）时，可以观察到显着的增强，这表明 RoPE 在上下文窗口扩展方面优于 APE。为了促进未来的研究，我们发布了 E5-Base-4k 和 E5-RoPE-Base 以及 LongEmbed 基准测试。

Title: From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Authors: Xenia Ohmer, Elia Bruni, Dieuwke Hupkes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.12145
Pdf URL: https://arxiv.org/pdf/2404.12145
Copy Paste: [[2404.12145]] From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency(https://arxiv.org/abs/2404.12145)
Keywords: language model, gpt, llm
Abstract: The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what "understanding" means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes - inspired by Fregean senses - of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model's multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.
摘要：通过一系列常用的自然语言理解 (NLU) 基准来衡量，大型语言模型 (LLM) 的能力正在以惊人的速度增长，这引发了许多关于“理解”对于语言模型意味着什么以及它如何进行比较的问题。到人类的理解。尤其如此，因为许多法学硕士只接受过文本方面的培训，这让人怀疑他们出色的基准表现是否反映了对这些基准所代表问题的真正理解，或者法学硕士是否只是擅长说出与某人的内容相关的文本形式。明白问题就说。在这部受哲学启发的作品中，我们的目标是通过一系列测试来在形式和意义之间建立某种分离，这些测试利用了这样的想法：世界理解应该在具有相同含义的表现模式之间保持一致（受到弗雷格感官的启发）。具体来说，我们关注跨语言和释义的一致性。以 GPT-3.5 作为我们的研究对象，我们评估了五种不同语言和各种任务的多意义一致性。我们在受控环境中开始评估，向模型询问简单的事实，然后对四个流行的 NLU 基准进行评估。我们发现该模型缺乏多感官一致性，并进行了多次后续分析，以验证这种一致性的缺乏是由于依赖于感官的任务理解造成的。我们的结论是，在这方面，法学硕士的理解还远远不够一致和像人类一样，并仔细考虑这如何影响他们在学习人类语言和理解的背景下的效用。

Title: Stance Detection on Social Media with Fine-Tuned Large Language Models

Authors: İlker Gül, Rémi Lebret, Karl Aberer
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2404.12171
Pdf URL: https://arxiv.org/pdf/2404.12171
Copy Paste: [[2404.12171]] Stance Detection on Social Media with Fine-Tuned Large Language Models(https://arxiv.org/abs/2404.12171)
Keywords: language model, gpt, llm, chat
Abstract: Stance detection, a key task in natural language processing, determines an author's viewpoint based on textual analysis. This study evaluates the evolution of stance detection methods, transitioning from early machine learning approaches to the groundbreaking BERT model, and eventually to modern Large Language Models (LLMs) such as ChatGPT, LLaMa-2, and Mistral-7B. While ChatGPT's closed-source nature and associated costs present challenges, the open-source models like LLaMa-2 and Mistral-7B offers an encouraging alternative. Initially, our research focused on fine-tuning ChatGPT, LLaMa-2, and Mistral-7B using several publicly available datasets. Subsequently, to provide a comprehensive comparison, we assess the performance of these models in zero-shot and few-shot learning scenarios. The results underscore the exceptional ability of LLMs in accurately detecting stance, with all tested models surpassing existing benchmarks. Notably, LLaMa-2 and Mistral-7B demonstrate remarkable efficiency and potential for stance detection, despite their smaller sizes compared to ChatGPT. This study emphasizes the potential of LLMs in stance detection and calls for more extensive research in this field.
摘要：立场检测是自然语言处理中的一项关键任务，它根据文本分析确定作者的观点。这项研究评估了姿态检测方法的演变，从早期的机器学习方法过渡到突破性的 BERT 模型，并最终过渡到 ChatGPT、LLaMa-2 和 Mistral-7B 等现代大型语言模型 (LLM)。虽然 ChatGPT 的闭源性质和相关成本带来了挑战，但 LLaMa-2 和 Mistral-7B 等开源模型提供了令人鼓舞的替代方案。最初，我们的研究重点是使用几个公开可用的数据集对 ChatGPT、LLaMa-2 和 Mistral-7B 进行微调。随后，为了提供全面的比较，我们评估了这些模型在零样本和少样本学习场景中的性能。结果强调了法学硕士在准确检测立场方面的卓越能力，所有测试的模型都超越了现有基准。值得注意的是，LLaMa-2 和 Mistral-7B 尽管尺寸比 ChatGPT 更小，但在姿态检测方面表现出了卓越的效率和潜力。这项研究强调了法学硕士在姿态检测方面的潜力，并呼吁在这一领域进行更广泛的研究。

Title: Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation Guidelines?

Authors: Laura Majer, Jan Šnajder
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12174
Pdf URL: https://arxiv.org/pdf/2404.12174
Copy Paste: [[2404.12174]] Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation Guidelines?(https://arxiv.org/abs/2404.12174)
Keywords: llm, prompt
Abstract: The increasing threat of disinformation calls for automating parts of the fact-checking pipeline. Identifying text segments requiring fact-checking is known as claim detection (CD) and claim check-worthiness detection (CW), the latter incorporating complex domain-specific criteria of worthiness and often framed as a ranking task. Zero- and few-shot LLM prompting is an attractive option for both tasks, as it bypasses the need for labeled datasets and allows verbalized claim and worthiness criteria to be directly used for prompting. We evaluate the LLMs' predictive and calibration accuracy on five CD/CW datasets from diverse domains, each utilizing a different worthiness criterion. We investigate two key aspects: (1) how best to distill factuality and worthiness criteria into a prompt and (2) what amount of context to provide for each claim. To this end, we experiment with varying the level of prompt verbosity and the amount of contextual information provided to the model. Our results show that optimal prompt verbosity is domain-dependent, adding context does not improve performance, and confidence scores can be directly used to produce reliable check-worthiness rankings.
摘要：虚假信息的威胁日益严重，要求事实核查流程的部分自动化。识别需要事实检查的文本片段被称为声明检测（CD）和声明检查价值检测（CW），后者结合了复杂的特定领域的价值标准，并且通常被视为排名任务。零次和少次 LLM 提示对于这两项任务来说都是一个有吸引力的选择，因为它绕过了对标记数据集的需求，并允许直接使用口头声明和价值标准进行提示。我们评估了法学硕士在来自不同领域的五个 CD/CW 数据集上的预测和校准准确性，每个数据集都使用不同的价值标准。我们研究两个关键方面：（1）如何最好地将事实和价值标准提炼成提示；（2）为每个主张提供多少上下文。为此，我们尝试改变提示的详细程度和提供给模型的上下文信息量。我们的结果表明，最佳提示冗长程度取决于领域，添加上下文不会提高性能，并且置信度分数可以直接用于生成可靠的检查价值排名。

Title: OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data

Authors: Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, Yasiru Ratnayake
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12195
Pdf URL: https://arxiv.org/pdf/2404.12195
Copy Paste: [[2404.12195]] OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data(https://arxiv.org/abs/2404.12195)
Keywords: gpt, llm
Abstract: Instruction fine-tuning pretrained LLMs for diverse downstream tasks has demonstrated remarkable success and has captured the interest of both academics and practitioners. To ensure such fine-tuned LLMs align with human preferences, techniques such as RLHF and DPO have emerged. At the same time, there is increasing interest in smaller parameter counts for models. In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models. In this recipe: We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model under three schemes based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a seed dataset) and Orca (with the Flan Collection as a seed dataset), then filter these generations using GPT-4 as a human proxy. We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme. The resulting checkpoint is further fine-tuned with a subset of the HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to obtain the final checkpoint. Evaluation is done with the LM Eval Harness tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with Claude 2.1, with the finding that the final checkpoint, "OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at the 3B parameter scale, even outperforming the top model in one of the categories on the Huggingface Open LLM Leaderboard. We release "OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO" checkpoints, alongside our generated datasets on HuggingFace at https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc and our codebase at https://bitbucket.org/paladinanalytics/workspace/projects/OP.
摘要：针对不同下游任务对预训练的法学硕士进行指令微调已取得了显着的成功，并引起了学者和从业者的兴趣。为了确保这种经过微调的法学硕士符合人类的偏好，诸如 RLHF 和 DPO 等技术应运而生。与此同时，人们对模型的较小参数数量越来越感兴趣。在这项工作中，我们使用 OpenLLaMA 3Bv2 作为基本模型，描述了用于微调 OpenBezoar 系列模型的方法。在本秘籍中：我们首先使用开放且商业上非限制性的 Falcon-40B 模型指令微调变体在基于 LaMini-LM、WizardLM/Evol-Instruct（带有 databricks）的三种方案下生成合成指令微调数据-dolly-15k 作为种子数据集）和 Orca（以 Flan Collection 作为种子数据集），然后使用 GPT-4 作为人类代理来过滤这些代。然后，我们对每个方案依次执行经济有效的基于 QLoRA 的监督微调。在使用 DPO 损失获得最终检查点之前，使用 HH-RLHF 数据集的子集进一步微调生成的检查点，以最小化分布偏移。使用 LM Eval Harness 任务/指标以及使用 Claude 2.1 的“LLM-as-a-judge”框架在 MT-Bench 上完成评估，结果发现最终检查点“OpenBezoar-HH-RLHF-DPO” ”，在 3B 参数尺度上表现出了优于许多模型的性能，甚至超越了 Huggingface Open LLM 排行榜的某一类别中的顶级模型。我们发布了“OpenBezoar-SFT”、“OpenBezoar-HH-RLHF-SFT”、“OpenBezoar-HH-RLHF-DPO”检查点，以及我们在 HuggingFace 上生成的数据集：https://huggingface.co/collections/SurgeGlobal/open- bezoar-6620a24923e12127e9e2b9cc 和我们的代码库 https://bitbucket.org/paladinanalytics/workspace/projects/OP。

Title: Length Generalization of Causal Transformers without Position Encoding

Authors: Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, Xiaoling Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12224
Pdf URL: https://arxiv.org/pdf/2404.12224
Copy Paste: [[2404.12224]] Length Generalization of Causal Transformers without Position Encoding(https://arxiv.org/abs/2404.12224)
Keywords: language model, long context
Abstract: Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible
摘要：对于最近基于 Transformer 的语言模型来说，推广到较长的句子非常重要。除了操纵显式位置特征的算法之外，无位置编码（NoPE）的 Transformer 的成功提供了一种克服这一挑战的新方法。在本文中，我们研究了NoPE的长度泛化特性。我们发现，尽管 NoPE 可以扩展到比常用的显式位置编码更长的序列，但它仍然具有有限的上下文长度。我们发现 NoPE 泛化失败与注意力分布分散之间存在联系。我们提出了一种参数有效的调整来搜索注意力头的最佳温度超参数，这大大扩展了 NoPE 的上下文大小。长序列语言建模、合成密钥检索任务和现实世界长上下文任务的实验表明，NoPE 可以通过最先进的长度泛化算法实现具有竞争力的性能。源代码可公开访问

Title: Introducing v0.5 of the AI Safety Benchmark from MLCommons

Authors: Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller, Ram Gandikota, Agasthya Gangavarapu, Ananya Gangavarapu, James Gealy, Rajat Ghosh, James Goel, Usman Gohar, Sujata Goswami, Scott A. Hale, Wiebke Hutiri, Joseph Marvin Imperial, Surgan Jandial, Nick Judd, Felix Juefei-Xu, Foutse Khomh, Bhavya Kailkhura, Hannah Rose Kirk, Kevin Klyman, Chris Knotz, Michael Kuchnik, Shachi H. Kumar, Chris Lengerich, Bo Li, Zeyi Liao, Eileen Peters Long, Victor Lu, Yifan Mai, et al. (46 additional authors not shown)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.12241
Pdf URL: https://arxiv.org/pdf/2404.12241
Copy Paste: [[2404.12241]] Introducing v0.5 of the AI Safety Benchmark from MLCommons(https://arxiv.org/abs/2404.12241)
Keywords: language model, prompt, chat
Abstract: This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
摘要：本文介绍了由 MLCommons AI 安全工作组创建的 AI 安全基准 v0.5。人工智能安全基准旨在评估使用聊天调整语言模型的人工智能系统的安全风险。我们引入了一种原则性方法来指定和构建基准，对于 v0.5，该基准仅涵盖单个用例（成人用英语与通用助理聊天）和一组有限的角色（即典型用户、恶意用户）用户和易受攻击的用户）。我们创建了 13 个危险类别的新分类法，其中 7 个在 v0.5 基准测试中进行了测试。我们计划在 2024 年底之前发布 AI 安全基准 1.0 版本。v1.0 基准将为 AI 系统的安全性提供有意义的见解。然而，v0.5基准不应该用来评估人工智能系统的安全性。我们力求完整记录 v0.5 的限制、缺陷和挑战。此版本的 AI 安全基准 v0.5 包括 (1) 指定和构建基准的原则性方法，其中包括用例、被测系统 (SUT) 类型、语言和上下文、角色、测试和测试项目; (2) 13 个危险类别的分类法及其定义和子类别； (3) 对七个危险类别进行测试，每个危险类别包含一组独特的测试项目，即提示。总共有43,090个测试项，是我们用模板创建的；（4）针对基准的人工智能系统评分系统； (5) 一个名为 ModelBench 的开放平台和可下载工具，可用于在基准上评估人工智能系统的安全性； (6) 评估报告示例，对十多个公开可用的聊天调整语言模型的性能进行基准测试； (7) 基准测试规范。

Title: Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

Authors: Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12253
Pdf URL: https://arxiv.org/pdf/2404.12253
Copy Paste: [[2404.12253]] Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing(https://arxiv.org/abs/2404.12253)
Keywords: language model, llm, prompt
Abstract: Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. In light of this, self-correction and self-learning emerge as viable solutions, employing strategies that allow LLMs to refine their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs in self-refining its response, particularly in complex reasoning and planning task, remains dubious. In this paper, we introduce AlphaLLM for the self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop, thereby enhancing the capabilities of LLMs without additional annotations. Drawing inspiration from the success of AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM for self-improvement, including data scarcity, the vastness search spaces of language tasks, and the subjective nature of feedback in language tasks. AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach tailored for language tasks, and a trio of critic models for precise feedback. Our experimental results in mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.
摘要：尽管大型语言模型 (LLM) 在各种任务上具有令人印象深刻的能力，但它们仍然难以处理涉及复杂推理和规划的场景。最近的工作提出了先进的提示技术以及使用高质量数据进行微调以增强法学硕士推理能力的必要性。然而，这些方法本质上受到数据可用性和质量的限制。有鉴于此，自我纠正和自我学习成为可行的解决方案，采用的策略允许法学硕士改进他们的成果并从自我评估的奖励中学习。然而，法学硕士在自我完善其反应方面的有效性，特别是在复杂的推理和规划任务中，仍然值得怀疑。在本文中，我们引入了用于LLM自我改进的AlphaLLM，它将蒙特卡罗树搜索（MCTS）与LLM集成，建立自我改进循环，从而在无需额外注释的情况下增强LLM的能力。 AlphaLLM 从 AlphaGo 的成功中汲取灵感，解决了将 MCTS 与 LLM 相结合以实现自我提升的独特挑战，包括数据稀缺、语言任务的巨大搜索空间以及语言任务中反馈的主观性。 AlphaLLM 由即时合成组件、专为语言任务量身定制的高效 MCTS 方法以及用于精确反馈的三个批评模型组成。我们在数学推理任务中的实验结果表明，AlphaLLM 在无需额外注释的情况下显着提高了法学硕士的性能，显示了法学硕士自我提升的潜力。

Title: Advancing the Robustness of Large Language Models through Self-Denoised Smoothing

Authors: Jiabao Ji, Bairu Hou, Zhen Zhang, Guanhua Zhang, Wenqi Fan, Qing Li, Yang Zhang, Gaowen Liu, Sijia Liu, Shiyu Chang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.12274
Pdf URL: https://arxiv.org/pdf/2404.12274
Copy Paste: [[2404.12274]] Advancing the Robustness of Large Language Models through Self-Denoised Smoothing(https://arxiv.org/abs/2404.12274)
Keywords: language model, llm
Abstract: Although large language models (LLMs) have achieved significant success, their vulnerability to adversarial perturbations, including recent jailbreak attacks, has raised considerable concerns. However, the increasing size of these models and their limited access make improving their robustness a challenging task. Among various defense strategies, randomized smoothing has shown great potential for LLMs, as it does not require full access to the model's parameters or fine-tuning via adversarial training. However, randomized smoothing involves adding noise to the input before model prediction, and the final model's robustness largely depends on the model's performance on these noise corrupted data. Its effectiveness is often limited by the model's sub-optimal performance on noisy data. To address this issue, we propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. We call this procedure self-denoised smoothing. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility. Our experimental results indicate that our method surpasses existing methods in both empirical and certified robustness in defending against adversarial attacks for both downstream tasks and human alignments (i.e., jailbreak attacks). Our code is publicly available at https://github.com/UCSB-NLP-Chang/SelfDenoise
摘要：尽管大型语言模型（LLM）取得了巨大的成功，但它们容易受到对抗性扰动（包括最近的越狱攻击）的影响，引起了相当大的担忧。然而，这些模型的规模不断增大且访问权限有限，使得提高其稳健性成为一项具有挑战性的任务。在各种防御策略中，随机平滑对于法学硕士显示出巨大的潜力，因为它不需要完全访问模型的参数或通过对抗性训练进行微调。然而，随机平滑涉及在模型预测之前向输入添加噪声，最终模型的鲁棒性很大程度上取决于模型在这些噪声损坏的数据上的性能。其有效性通常受到模型在噪声数据上的次优性能的限制。为了解决这个问题，我们建议利用法学硕士的多任务性质，首先对噪声输入进行去噪，然后根据这些去噪版本进行预测。我们将此过程称为自降噪平滑。与以前的计算机视觉中的去噪平滑技术不同，该技术需要训练单独的模型来增强法学硕士的鲁棒性，我们的方法提供了显着更好的效率和灵活性。我们的实验结果表明，我们的方法在防御下游任务和人类对齐（即越狱攻击）的对抗性攻击方面，在经验和经过认证的鲁棒性方面均优于现有方法。我们的代码可在 https://github.com/UCSB-NLP-Chang/SelfDenoise 上公开获取

Title: Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting

Authors: Nicholas Harris, Anand Butani, Syed Hashmy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12283
Pdf URL: https://arxiv.org/pdf/2404.12283
Copy Paste: [[2404.12283]] Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting(https://arxiv.org/abs/2404.12283)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Embedding models are crucial for various natural language processing tasks but can be limited by factors such as limited vocabulary, lack of context, and grammatical errors. This paper proposes a novel approach to improve embedding performance by leveraging large language models (LLMs) to enrich and rewrite input text before the embedding process. By utilizing ChatGPT 3.5 to provide additional context, correct inaccuracies, and incorporate metadata, the proposed method aims to enhance the utility and accuracy of embedding models. The effectiveness of this approach is evaluated on three datasets: Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual Classification. Results demonstrate significant improvements over the baseline model on the TwitterSemEval 2015 dataset, with the best-performing prompt achieving a score of 85.34 compared to the previous best of 81.52 on the Massive Text Embedding Benchmark (MTEB) Leaderboard. However, performance on the other two datasets was less impressive, highlighting the importance of considering domain-specific characteristics. The findings suggest that LLM-based text enrichment has shown promising results to improve embedding performance, particularly in certain domains. Hence, numerous limitations in the process of embedding can be avoided.
摘要：嵌入模型对于各种自然语言处理任务至关重要，但可能受到词汇量有限、上下文缺乏和语法错误等因素的限制。本文提出了一种通过利用大型语言模型（LLM）在嵌入过程之前丰富和重写输入文本来提高嵌入性能的新方法。通过利用 ChatGPT 3.5 提供额外的上下文、纠正不准确性并合并元数据，所提出的方法旨在提高嵌入模型的实用性和准确性。该方法的有效性在三个数据集上进行评估：Banking77Classification、TwitterSemEval 2015 和 Amazon Counter-factual Classification。结果表明，TwitterSemEval 2015 数据集上的基线模型有了显着改进，表现最好的提示在大规模文本嵌入基准 (MTEB) 排行榜上获得了 85.34 分，而之前的最高分是 81.52 分。然而，其他两个数据集的性能不太令人印象深刻，这凸显了考虑特定领域特征的重要性。研究结果表明，基于法学硕士的文本丰富在提高嵌入性能方面显示出了有希望的结果，特别是在某些领域。因此，可以避免嵌入过程中的许多限制。

Title: Augmenting emotion features in irony detection with Large language modeling

Authors: Yucheng Lin, Yuhan Xia, Yunfei Long
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.12291
Pdf URL: https://arxiv.org/pdf/2404.12291
Copy Paste: [[2404.12291]] Augmenting emotion features in irony detection with Large language modeling(https://arxiv.org/abs/2404.12291)
Keywords: language model, gpt, llm, prompt
Abstract: This study introduces a novel method for irony detection, applying Large Language Models (LLMs) with prompt-based learning to facilitate emotion-centric text augmentation. Traditional irony detection techniques typically fall short due to their reliance on static linguistic features and predefined knowledge bases, often overlooking the nuanced emotional dimensions integral to irony. In contrast, our methodology augments the detection process by integrating subtle emotional cues, augmented through LLMs, into three benchmark pre-trained NLP models - BERT, T5, and GPT-2 - which are widely recognized as foundational in irony detection. We assessed our method using the SemEval-2018 Task 3 dataset and observed substantial enhancements in irony detection capabilities.
摘要：这项研究引入了一种反讽检测的新方法，应用大型语言模型（LLM）和基于提示的学习来促进以情感为中心的文本增强。传统的反讽检测技术通常由于依赖静态语言特征和预定义的知识库而存在不足，往往忽视了反讽不可或缺的微妙情感维度。相比之下，我们的方法通过将微妙的情感线索（通过 LLM 增强）集成到三个基准预训练 NLP 模型（BERT、T5 和 GPT-2）中来增强检测过程，这些模型被广泛认为是反讽检测的基础。我们使用 SemEval-2018 Task 3 数据集评估了我们的方法，并观察到讽刺检测能力的显着增强。

Title: Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair

Authors: Yusuke Sakai, Mana Makinae, Hidetaka Kamigaito, Taro Watanabe
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2404.12299
Pdf URL: https://arxiv.org/pdf/2404.12299
Copy Paste: [[2404.12299]] Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair(https://arxiv.org/abs/2404.12299)
Keywords: language model, llm
Abstract: In Simultaneous Machine Translation (SiMT) systems, training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency systems. However, it is very challenging to curate such a corpus due to limitations in the abilities of annotators, and hence, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation corpora into interpretation-style data, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models in text-to-text and speech-to-text settings with the LLM-SI-Corpus reduces latencies while maintaining the same level of quality as the models trained with offline datasets. The LLM-SI-Corpus is available at \url{https://github.com/yusuke1997/LLM-SI-Corpus}.
摘要：在同声机器翻译（SiMT）系统中，使用同声传译（SI）语料库进行训练是实现高质量低延迟系统的有效方法。然而，由于注释者能力的限制，管理这样的语料库非常具有挑战性，因此现有的 SI 语料库是有限的。因此，我们提出了一种方法，将现有的语音翻译语料库转换为解释式数据，保持原始词序并使用大型语言模型（LLM-SI-Corpus）保留整个源内容。我们证明，使用 LLM-SI-Corpus 在文本到文本和语音到文本设置中微调 SiMT 模型可以减少延迟，同时保持与使用离线数据集训练的模型相同的质量水平。 LLM-SI-Corpus 可在 \url{https://github.com/yusuke1997/LLM-SI-Corpus} 获取。

Title: Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

Authors: Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, Ahmad Beirami
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12318
Pdf URL: https://arxiv.org/pdf/2404.12318
Copy Paste: [[2404.12318]] Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment(https://arxiv.org/abs/2404.12318)
Keywords: language model
Abstract: Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is trained on preference data in one source language and directly applied to other target languages. On summarization and open-ended dialog generation, we show that this method is consistently successful under comprehensive evaluation settings, including human evaluation: cross-lingually aligned models are preferred by humans over unaligned models on up to >70% of evaluation instances. We moreover find that a different-language reward model sometimes yields better aligned models than a same-language reward model. We also identify best practices when there is no language-specific data for even supervised finetuning, another component in alignment.
摘要：基于人工注释的偏好数据调整语言模型 (LM) 是获得实用且高性能的基于 LM 的系统的关键一步。然而，多语言人类偏好数据很难大规模获得，这使得将该框架扩展到多种语言具有挑战性。在这项工作中，我们评估了一种零样本跨语言对齐的简单方法，其中奖励模型根据一种源语言的偏好数据进行训练，然后直接应用于其他目标语言。在摘要和开放式对话生成方面，我们表明该方法在综合评估设置（包括人类评估）下始终如一地成功：在高达 > 70% 的评估实例中，人类更喜欢跨语言对齐的模型，而不是未对齐的模型。此外，我们发现不同语言奖励模型有时会比同语言奖励模型产生更好的一致性模型。当没有特定于语言的数据甚至监督微调（另一个对齐的组件）时，我们还确定了最佳实践。

Title: Large Language Models in Targeted Sentiment Analysis

Authors: Nicolay Rusnachenko, Anton Golubev, Natalia Loukachevitch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12342
Pdf URL: https://arxiv.org/pdf/2404.12342
Copy Paste: [[2404.12342]] Large Language Models in Targeted Sentiment Analysis(https://arxiv.org/abs/2404.12342)
Keywords: language model, llm, chain-of-thought
Abstract: In this paper we investigate the use of decoder-based generative transformers for extracting sentiment towards the named entities in Russian news articles. We study sentiment analysis capabilities of instruction-tuned large language models (LLMs). We consider the dataset of RuSentNE-2023 in our study. The first group of experiments was aimed at the evaluation of zero-shot capabilities of LLMs with closed and open transparencies. The second covers the fine-tuning of Flan-T5 using the "chain-of-thought" (CoT) three-hop reasoning framework (THoR). We found that the results of the zero-shot approaches are similar to the results achieved by baseline fine-tuned encoder-based transformers (BERT-base). Reasoning capabilities of the fine-tuned Flan-T5 models with THoR achieve at least 5% increment with the base-size model compared to the results of the zero-shot experiment. The best results of sentiment analysis on RuSentNE-2023 were achieved by fine-tuned Flan-T5-xl, which surpassed the results of previous state-of-the-art transformer-based classifiers. Our CoT application framework is publicly available: https://github.com/nicolay-r/Reasoning-for-Sentiment-Analysis-Framework
摘要：在本文中，我们研究了使用基于解码器的生成变压器来提取对俄罗斯新闻文章中命名实体的情感。我们研究指令调整的大语言模型（LLM）的情感分析能力。我们在研究中考虑了 RuSentNE-2023 的数据集。第一组实验旨在评估具有封闭和开放透明度的法学硕士的零样本能力。第二个内容涉及使用“思想链”（CoT）三跳推理框架（THoR）对 Flan-T5 进行微调。我们发现零样本方法的结果与基于基线微调编码器的变压器（BERT-base）所取得的结果相似。与零样本实验的结果相比，经过 THoR 微调的 Flan-T5 模型的推理能力在基本尺寸模型的基础上实现了至少 5% 的增量。 RuSentNE-2023 上情感分析的最佳结果是通过微调的 Flan-T5-xl 实现的，超越了之前最先进的基于 Transformer 的分类器的结果。我们的 CoT 应用程序框架已公开：https://github.com/nicolay-r/Reasoning-for-Sentiment-Analysis-Framework

Title: When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes

Authors: Asaf Yehudai, Elron Bendel
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12365
Pdf URL: https://arxiv.org/pdf/2404.12365
Copy Paste: [[2404.12365]] When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes(https://arxiv.org/abs/2404.12365)
Keywords: language model, llm, prompt
Abstract: We present FastFit, a method, and a Python package design to provide fast and accurate few-shot classification, especially for scenarios with many semantically similar classes. FastFit utilizes a novel approach integrating batch contrastive learning and token-level similarity score. Compared to existing few-shot learning packages, such as SetFit, Transformers, or few-shot prompting of large language models via API calls, FastFit significantly improves multiclass classification performance in speed and accuracy across FewMany, our newly curated English benchmark, and Multilingual datasets. FastFit demonstrates a 3-20x improvement in training speed, completing training in just a few seconds. The FastFit package is now available on GitHub and PyPi, presenting a user-friendly solution for NLP practitioners.
摘要：我们提出了 FastFit 方法和 Python 包设计，以提供快速准确的小样本分类，特别是对于具有许多语义相似类的场景。 FastFit 采用了一种集成批量对比学习和标记级相似度评分的新颖方法。与现有的少样本学习包（例如 SetFit、Transformers 或通过 API 调用对大型语言模型进行少样本提示）相比，FastFit 显着提高了 FewMany（我们新策划的英语基准测试）和多语言数据集的多类分类性能。。 FastFit 的训练速度提高了 3-20 倍，只需几秒钟即可完成训练。 FastFit 包现已在 GitHub 和 PyPi 上提供，为 NLP 从业者提供了一个用户友好的解决方案。

Title: Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

Authors: Aitor Ormazabal, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, Zhihui Xie
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2404.12387
Pdf URL: https://arxiv.org/pdf/2404.12387
Copy Paste: [[2404.12387]] Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models(https://arxiv.org/abs/2404.12387)
Keywords: language model, gpt, chat
Abstract: We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but also outperform many much larger models, delivering outsized values for their respective compute class. Meanwhile, our most capable and largest model, Reka Core, approaches the best frontier models on both automatic evaluations and blind human evaluations. On image question answering benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V. Meanwhile, on multimodal chat, Core ranks as the second most preferred model under a blind third-party human evaluation setup, outperforming other models such as Claude 3 Opus. On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e.g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped in production at this http URL . A showcase of non cherry picked qualitative examples can also be found at this http URL .
摘要：我们介绍 Reka Core、Flash 和 Edge，这是 Reka 从头开始训练的一系列强大的多模态语言模型。 Reka 模型能够对文本、图像、视频和音频输入进行处理和推理。该技术报告讨论了其中一些模型的训练细节，并提供了综合评估结果。我们证明 Reka Edge 和 Reka Flash 不仅是最先进的，而且还优于许多更大的模型，为各自的计算类别提供了巨大的价值。与此同时，我们最强大、最大的模型 Reka Core 在自动评估和盲人评估方面都接近最佳前沿模型。在图像问答基准（例如 MMMU、VQAv2）上，Core 的表现与 GPT4-V 相当。同时，在多模态聊天中，Core 在第三方盲人评估设置下排名第二，优于 Claude 3 Opus 等其他模型。在文本基准测试中，Core 不仅在一组完善的基准测试（例如 MMLU、GSM8K）上与其他前沿模型相比具有竞争力，而且在人类评估方面也优于 GPT4-0613。在视频问答（感知测试）方面，Core 优于 Gemini Ultra。模型在生产中通过此 http URL 发货。在此 http URL 中还可以找到非精选定性示例的展示。