2024-02-27

Title: Beware of Words: Evaluating the Lexical Richness of Conversational Large Language Models

Authors: Gonzalo Martínez, José Alberto Hernández, Javier Conde, Pedro Reviriego, Elena Merino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15518
Pdf URL: https://arxiv.org/pdf/2402.15518
Copy Paste: [[2402.15518]] Beware of Words: Evaluating the Lexical Richness of Conversational Large Language Models(https://arxiv.org/abs/2402.15518)
Keywords: language model, gpt, llm, chat
Abstract: The performance of conversational Large Language Models (LLMs) in general, and of ChatGPT in particular, is currently being evaluated on many different tasks, from logical reasoning or maths to answering questions on a myriad of topics. Instead, much less attention is being devoted to the study of the linguistic features of the texts generated by these LLMs. This is surprising since LLMs are models for language, and understanding how they use the language is important. Indeed, conversational LLMs are poised to have a significant impact on the evolution of languages as they may eventually dominate the creation of new text. This means that for example, if conversational LLMs do not use a word it may become less and less frequent and eventually stop being used altogether. Therefore, evaluating the linguistic features of the text they produce and how those depend on the model parameters is the first step toward understanding the potential impact of conversational LLMs on the evolution of languages. In this paper, we consider the evaluation of the lexical richness of the text generated by LLMs and how it depends on the model parameters. A methodology is presented and used to conduct a comprehensive evaluation of lexical richness using ChatGPT as a case study. The results show how lexical richness depends on the version of ChatGPT and some of its parameters, such as the presence penalty, or on the role assigned to the model. The dataset and tools used in our analysis are released under open licenses with the goal of drawing the much-needed attention to the evaluation of the linguistic features of LLM-generated text.
摘要：目前正在对许多不同任务（从逻辑推理或数学到回答有关无数主题的问题）评估一般会话大型语言模型 (LLM)（尤其是 ChatGPT）的性能。相反，对这些法学硕士生成的文本的语言特征研究的关注要少得多。这是令人惊讶的，因为法学硕士是语言的模型，了解他们如何使用语言非常重要。事实上，会话法学硕士有望对语言的演变产生重大影响，因为它们最终可能主导新文本的创建。这意味着，例如，如果会话法学硕士不使用某个单词，它可能会变得越来越不频繁，并最终完全停止使用。因此，评估它们生成的文本的语言特征以及这些特征如何依赖于模型参数是了解会话式法学硕士对语言进化的潜在影响的第一步。在本文中，我们考虑了对法学硕士生成的文本的词汇丰富度的评估以及它如何依赖于模型参数。以 ChatGPT 作为案例研究，提出并使用一种方法对词汇丰富度进行综合评估。结果显示词汇丰富度如何取决于 ChatGPT 的版本及其一些参数（例如存在惩罚）或分配给模型的角色。我们分析中使用的数据集和工具是在开放许可下发布的，目的是引起对法学硕士生成文本的语言特征评估的急需关注。

Title: Detecting misinformation through Framing Theory: the Frame Element-based Model

Authors: Guan Wang, Rebecca Frederick, Jinglong Duan, William Wong, Verica Rupar, Weihua Li, Quan Bai
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2402.15525
Pdf URL: https://arxiv.org/pdf/2402.15525
Copy Paste: [[2402.15525]] Detecting misinformation through Framing Theory: the Frame Element-based Model(https://arxiv.org/abs/2402.15525)
Keywords: language model
Abstract: In this paper, we delve into the rapidly evolving challenge of misinformation detection, with a specific focus on the nuanced manipulation of narrative frames - an under-explored area within the AI community. The potential for Generative AI models to generate misleading narratives underscores the urgency of this problem. Drawing from communication and framing theories, we posit that the presentation or 'framing' of accurate information can dramatically alter its interpretation, potentially leading to misinformation. We highlight this issue through real-world examples, demonstrating how shifts in narrative frames can transmute fact-based information into misinformation. To tackle this challenge, we propose an innovative approach leveraging the power of pre-trained Large Language Models and deep neural networks to detect misinformation originating from accurate facts portrayed under different frames. These advanced AI techniques offer unprecedented capabilities in identifying complex patterns within unstructured data critical for examining the subtleties of narrative frames. The objective of this paper is to bridge a significant research gap in the AI domain, providing valuable insights and methodologies for tackling framing-induced misinformation, thus contributing to the advancement of responsible and trustworthy AI technologies. Several experiments are intensively conducted and experimental results explicitly demonstrate the various impact of elements of framing theory proving the rationale of applying framing theory to increase the performance in misinformation detection.
摘要：在本文中，我们深入研究了错误信息检测快速发展的挑战，特别关注叙事框架的微妙操纵——这是人工智能社区中尚未探索的领域。生成式人工智能模型产生误导性叙述的潜力凸显了这个问题的紧迫性。根据沟通和框架理论，我们认为准确信息的呈现或“框架”可以极大地改变其解释，可能导致错误信息。我们通过现实世界的例子来强调这个问题，展示叙事框架的转变如何将基于事实的信息转化为错误信息。为了应对这一挑战，我们提出了一种创新方法，利用预先训练的大型语言模型和深度神经网络的力量来检测源自不同框架下描绘的准确事实的错误信息。这些先进的人工智能技术提供了前所未有的能力，可以识别非结构化数据中的复杂模式，这对于检查叙事框架的微妙之处至关重要。本文的目的是弥合人工智能领域的重大研究空白，为解决框架引起的错误信息提供有价值的见解和方法，从而促进负责任和值得信赖的人工智能技术的进步。集中进行了多项实验，实验结果明确证明了框架理论要素的各种影响，证明了应用框架理论来提高错误信息检测性能的基本原理。

Title: Chain-of-Specificity: An Iteratively Refining Method for Eliciting Knowledge from Large Language Models

Authors: Kaiwen Wei, Jingyuan Zhang, Hongzhi Zhang, Fuzheng Zhang, Di Zhang, Li Jin, Yue Yu
Subjects: cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15526
Pdf URL: https://arxiv.org/pdf/2402.15526
Copy Paste: [[2402.15526]] Chain-of-Specificity: An Iteratively Refining Method for Eliciting Knowledge from Large Language Models(https://arxiv.org/abs/2402.15526)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit remarkable generative capabilities, enabling the generation of valuable information. Despite these advancements, previous research found that LLMs sometimes struggle with adhering to specific constraints (e.g., in specific place or at specific time), at times even overlooking them, which leads to responses that are either too generic or not fully satisfactory. Existing approaches attempted to address this issue by decomposing or rewriting input instructions, yet they fall short in adequately emphasizing specific constraints and in unlocking the underlying knowledge (e.g., programming within the context of software development). In response, this paper proposes a simple yet effective method named Chain-of-Specificity (CoS). Specifically, CoS iteratively emphasizes the specific constraints in the input instructions, unlocks knowledge within LLMs, and refines responses. Experiments conducted on publicly available and self-build complex datasets demonstrate that CoS outperforms existing methods in enhancing generated content especially for the specificity. Besides, as the number of specific constraints increase, other baselines falter, while CoS still performs well. Moreover, we show that distilling responses generated by CoS effectively enhances the ability of smaller models to follow the constrained instructions. Resources of this paper will be released for further research.
摘要：大型语言模型 (LLM) 表现出卓越的生成能力，能够生成有价值的信息。尽管取得了这些进步，之前的研究发现法学硕士有时很难遵守特定的限制（例如，在特定地点或特定时间），有时甚至忽视它们，这导致回答要么过于笼统，要么不完全令人满意。现有方法试图通过分解或重写输入指令来解决这个问题，但它们在充分强调特定约束和解锁底层知识（例如，在软件开发上下文中编程）方面存在不足。为此，本文提出了一种简单而有效的方法，称为特定链（CoS）。具体来说，CoS 迭代地强调输入指令中的特定约束，解锁 LLM 中的知识，并完善响应。在公开可用和自建的复杂数据集上进行的实验表明，CoS 在增强生成内容（尤其是特异性）方面优于现有方法。此外，随着特定约束数量的增加，其他基线会动摇，而 CoS 仍然表现良好。此外，我们表明，提取 CoS 生成的响应可以有效增强较小模型遵循受限指令的能力。本文的资源将被释放以供进一步研究。

Title: PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain

Authors: Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi Meng, Tianyu Liu, Baobao Chang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2402.15527
Pdf URL: https://arxiv.org/pdf/2402.15527
Copy Paste: [[2402.15527]] PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain(https://arxiv.org/abs/2402.15527)
Keywords: language model, gpt, llm, agent
Abstract: We present PCA-Bench, a multimodal decision-making benchmark for evaluating the integrated capabilities of Multimodal Large Language Models (MLLMs). Departing from previous benchmarks focusing on simplistic tasks and individual model capability, PCA-Bench introduces three complex scenarios: autonomous driving, domestic robotics, and open-world games. Given task instructions and diverse contexts, the model is required to seamlessly integrate multiple capabilities of Perception, Cognition, and Action in a reasoning chain to make accurate decisions. Moreover, PCA-Bench features error localization capabilities, scrutinizing model inaccuracies in areas such as perception, knowledge, or reasoning. This enhances the reliability of deploying MLLMs. To balance accuracy and efficiency in evaluation, we propose PCA-Eval, an automatic evaluation protocol, and assess 10 prevalent MLLMs. The results reveal significant performance disparities between open-source models and powerful proprietary models like GPT-4 Vision. To address this, we introduce Embodied-Instruction-Evolution (EIE), an automatic framework for synthesizing instruction tuning examples in multimodal embodied environments. EIE generates 7,510 training examples in PCA-Bench and enhances the performance of open-source MLLMs, occasionally surpassing GPT-4 Vision (+3\% in decision accuracy), thereby validating the effectiveness of EIE. Our findings suggest that robust MLLMs like GPT4-Vision show promise for decision-making in embodied agents, opening new avenues for MLLM research.
摘要：我们推出了 PCA-Bench，这是一种多模态决策基准，用于评估多模态大型语言模型 (MLLM) 的集成能力。与之前专注于简单任务和个体模型能力的基准测试不同，PCA-Bench引入了三个复杂的场景：自动驾驶、家用机器人和开放世界游戏。给定任务指令和不同的上下文，模型需要将感知、认知和行动的多种能力无缝集成在推理链中，以做出准确的决策。此外，PCA-Bench 还具有错误定位功能，可仔细检查感知、知识或推理等领域的模型不准确性。这增强了部署 MLLM 的可靠性。为了平衡评估的准确性和效率，我们提出了 PCA-Eval（一种自动评估协议），并评估了 10 个流行的 MLLM。结果揭示了开源模型和强大的专有模型（例如 GPT-4 Vision）之间存在显着的性能差异。为了解决这个问题，我们引入了 Embodied-Instruction-Evolution (EIE)，这是一个用于在多模式体现环境中合成指令调优示例的自动框架。 EIE 在 PCA-Bench 中生成了 7,510 个训练样本，并增强了开源 MLLM 的性能，偶尔超越 GPT-4 Vision（决策精度+3%），从而验证了 EIE 的有效性。我们的研究结果表明，像 GPT4-Vision 这样强大的 MLLM 在实体智能体的决策中表现出了希望，为 MLLM 研究开辟了新的途径。

Title: Evaluating the Performance of ChatGPT for Spam Email Detection

Authors: Yuwei Wu, Shijing Si, Yugui Zhang, Jiawen Gu, Jedrek Wosik
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15537
Pdf URL: https://arxiv.org/pdf/2402.15537
Copy Paste: [[2402.15537]] Evaluating the Performance of ChatGPT for Spam Email Detection(https://arxiv.org/abs/2402.15537)
Keywords: language model, gpt, prompt, chat
Abstract: Email continues to be a pivotal and extensively utilized communication medium within professional and commercial domains. Nonetheless, the prevalence of spam emails poses a significant challenge for users, disrupting their daily routines and diminishing productivity. Consequently, accurately identifying and filtering spam based on content has become crucial for cybersecurity. Recent advancements in natural language processing, particularly with large language models like ChatGPT, have shown remarkable performance in tasks such as question answering and text generation. However, its potential in spam identification remains underexplored. To fill in the gap, this study attempts to evaluate ChatGPT's capabilities for spam identification in both English and Chinese email datasets. We employ ChatGPT for spam email detection using in-context learning, which requires a prompt instruction and a few demonstrations. We also investigate how the training example size affects the performance of ChatGPT. For comparison, we also implement five popular benchmark methods, including naive Bayes, support vector machines (SVM), logistic regression (LR), feedforward dense neural networks (DNN), and BERT classifiers. Though extensive experiments, the performance of ChatGPT is significantly worse than deep supervised learning methods in the large English dataset, while it presents superior performance on the low-resourced Chinese dataset, even outperforming BERT in this case.
摘要：电子邮件仍然是专业和商业领域中关键且广泛使用的通信媒介。尽管如此，垃圾邮件的盛行给用户带来了重大挑战，扰乱了他们的日常生活并降低了生产力。因此，根据内容准确识别和过滤垃圾邮件对于网络安全至关重要。自然语言处理的最新进展，特别是像 ChatGPT 这样的大型语言模型，在问答和文本生成等任务中表现出了卓越的性能。然而，其在垃圾邮件识别方面的潜力仍未得到充分开发。为了填补这一空白，本研究尝试评估 ChatGPT 在英文和中文电子邮件数据集中识别垃圾邮件的能力。我们使用 ChatGPT 通过上下文学习来检测垃圾邮件，这需要及时的指导和一些演示。我们还研究了训练示例大小如何影响 ChatGPT 的性能。为了进行比较，我们还实现了五种流行的基准方法，包括朴素贝叶斯、支持向量机 (SVM)、逻辑回归 (LR)、前馈密集神经网络 (DNN) 和 BERT 分类器。经过大量实验，ChatGPT 在大型英语数据集上的性能明显比深度监督学习方法差，但在资源匮乏的中文数据集上表现出优越的性能，甚至在这种情况下优于 BERT。

Title: Foundation Policies with Hilbert Representations

Authors: Seohong Park, Tobias Kreiman, Sergey Levine
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2402.15567
Pdf URL: https://arxiv.org/pdf/2402.15567
Copy Paste: [[2402.15567]] Foundation Policies with Hilbert Representations(https://arxiv.org/abs/2402.15567)
Keywords: prompt
Abstract: Unsupervised and self-supervised objectives, such as next token prediction, have enabled pre-training generalist models from large amounts of unlabeled data. In reinforcement learning (RL), however, finding a truly general and scalable unsupervised pre-training objective for generalist policies from offline data remains a major open question. While a number of methods have been proposed to enable generic self-supervised RL, based on principles such as goal-conditioned RL, behavioral cloning, and unsupervised skill learning, such methods remain limited in terms of either the diversity of the discovered behaviors, the need for high-quality demonstration data, or the lack of a clear prompting or adaptation mechanism for downstream tasks. In this work, we propose a novel unsupervised framework to pre-train generalist policies that capture diverse, optimal, long-horizon behaviors from unlabeled offline data such that they can be quickly adapted to any arbitrary new tasks in a zero-shot manner. Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment, and then to span this learned latent space with directional movements, which enables various zero-shot policy "prompting" schemes for downstream tasks. Through our experiments on simulated robotic locomotion and manipulation benchmarks, we show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion, even often outperforming prior methods designed specifically for each setting. Our code and videos are available at https://seohong.me/projects/hilp/
摘要：无监督和自监督目标，例如下一个令牌预测，使得能够从大量未标记数据中预训练通才模型。然而，在强化学习（RL）中，从离线数据中找到一个真正通用且可扩展的无监督预训练目标来实现通用策略仍然是一个主要的悬而未决的问题。虽然基于目标条件强化学习、行为克隆和无监督技能学习等原则，已经提出了许多方法来实现通用的自监督强化学习，但这些方法在所发现行为的多样性、需要高质量的示范数据，或者下游任务缺乏明确的提示或适应机制。在这项工作中，我们提出了一种新颖的无监督框架来预训练通才策略，这些策略从未标记的离线数据中捕获多样化的、最优的、长期的行为，以便它们能够以零样本的方式快速适应任何任意的新任务。我们的主要见解是学习一种保留底层环境的时间结构的结构化表示，然后通过定向运动跨越这个学习的潜在空间，从而为下游任务提供各种零样本策略“提示”方案。通过我们对模拟机器人运动和操作基准的实验，我们表明，我们的无监督策略可以以零样本的方式解决目标条件和一般强化学习任务，甚至常常优于专门为每种设置设计的先前方法。我们的代码和视频可在 https://seohong.me/projects/hilp/ 获取

Title: Prompting LLMs to Compose Meta-Review Drafts from Peer-Review Narratives of Scholarly Manuscripts

Authors: Shubhra Kanti Karmaker Santu, Sanjeev Kumar Sinha, Naman Bansal, Alex Knipper, Souvika Sarkar, John Salvador, Yash Mahajan, Sri Guttikonda, Mousumi Akter, Matthew Freestone, Matthew C. Williams Jr
Subjects: cs.CL, cs.AI, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2402.15589
Pdf URL: https://arxiv.org/pdf/2402.15589
Copy Paste: [[2402.15589]] Prompting LLMs to Compose Meta-Review Drafts from Peer-Review Narratives of Scholarly Manuscripts(https://arxiv.org/abs/2402.15589)
Keywords: language model, gpt, llm, prompt
Abstract: One of the most important yet onerous tasks in the academic peer-reviewing process is composing meta-reviews, which involves understanding the core contributions, strengths, and weaknesses of a scholarly manuscript based on peer-review narratives from multiple experts and then summarizing those multiple experts' perspectives into a concise holistic overview. Given the latest major developments in generative AI, especially Large Language Models (LLMs), it is very compelling to rigorously study the utility of LLMs in generating such meta-reviews in an academic peer-review setting. In this paper, we perform a case study with three popular LLMs, i.e., GPT-3.5, LLaMA2, and PaLM2, to automatically generate meta-reviews by prompting them with different types/levels of prompts based on the recently proposed TELeR taxonomy. Finally, we perform a detailed qualitative study of the meta-reviews generated by the LLMs and summarize our findings and recommendations for prompting LLMs for this complex task.
摘要：学术同行评审过程中最重要但最繁重的任务之一是撰写元评审，其中涉及根据多位专家的同行评审叙述来了解学术手稿的核心贡献、优点和缺点，然后总结这些多方面的内容。将专家的观点转化为简洁的整体概述。鉴于生成式人工智能的最新重大发展，特别是大型语言模型（LLM），严格研究 LLM 在学术同行评审环境中生成此类元评审的效用非常引人注目。在本文中，我们对三种流行的法学硕士（即 GPT-3.5、LLaMA2 和 PaLM2）进行了案例研究，根据最近提出的 TELeR 分类法，通过不同类型/级别的提示来自动生成元评论。最后，我们对法学硕士生成的元评论进行了详细的定性研究，并总结了我们的发现和建议，以促使法学硕士完成这项复杂的任务。

Title: Training Nonlinear Transformers for Efficient In-Context Learning: A Theoretical Learning and Generalization Analysis

Authors: Hongkang Li, Meng Wang, Songtao Lu, Xiaodong Cui, Pin-Yu Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.15607
Pdf URL: https://arxiv.org/pdf/2402.15607
Copy Paste: [[2402.15607]] Training Nonlinear Transformers for Efficient In-Context Learning: A Theoretical Learning and Generalization Analysis(https://arxiv.org/abs/2402.15607)
Keywords: language model
Abstract: Transformer-based large language models have displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply augmenting the query with some input-output examples from that task. Despite the empirical success, the mechanics of how to train a Transformer to achieve ICL and the corresponding ICL capacity is mostly elusive due to the technical challenges of analyzing the nonconvex training problems resulting from the nonlinear self-attention and nonlinear activation in Transformers. To the best of our knowledge, this paper provides the first theoretical analysis of the training dynamics of Transformers with nonlinear self-attention and nonlinear MLP, together with the ICL generalization capability of the resulting model. Focusing on a group of binary classification tasks, we train Transformers using data from a subset of these tasks and quantify the impact of various factors on the ICL generalization performance on the remaining unseen tasks with and without data distribution shifts. We also analyze how different components in the learned Transformers contribute to the ICL performance. Furthermore, we provide the first theoretical analysis of how model pruning affects the ICL performance and prove that proper magnitude-based pruning can have a minimal impact on ICL while reducing inference costs. These theoretical findings are justified through numerical experiments.
摘要：基于 Transformer 的大型语言模型显示了令人印象深刻的上下文学习能力，其中预训练的模型可以通过简单地使用该任务中的一些输入输出示例来增强查询来处理新任务，而无需进行微调。尽管在实证上取得了成功，但由于分析 Transformer 中的非线性自注意力和非线性激活所产生的非凸训练问题的技术挑战，如何训练 Transformer 实现 ICL 和相应的 ICL 能力的机制大多难以捉摸。据我们所知，本文首次对具有非线性自注意力和非线性 MLP 的 Transformer 训练动态进行了理论分析，以及所得模型的 ICL 泛化能力。专注于一组二元分类任务，我们使用这些任务子集的数据来训练 Transformer，并量化各种因素对 ICL 泛化性能对剩余未见任务（有或没有数据分布变化）的影响。我们还分析了学习到的 Transformer 中的不同组件如何影响 ICL 性能。此外，我们首次对模型剪枝如何影响 ICL 性能进行了理论分析，并证明适当的基于幅度的剪枝可以对 ICL 产生最小的影响，同时降低推理成本。这些理论发现通过数值实验得到了证实。

Title: Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning

Authors: Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Yejin Choi, Jesse Thomason, Khyathi Raghavi Chandu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15610
Pdf URL: https://arxiv.org/pdf/2402.15610
Copy Paste: [[2402.15610]] Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning(https://arxiv.org/abs/2402.15610)
Keywords: language model, llm
Abstract: Prior work on selective prediction minimizes incorrect predictions from vision-language models (VLMs) by allowing them to abstain from answering when uncertain. However, when deploying a vision-language system with low tolerance for inaccurate predictions, selective prediction may be over-cautious and abstain too frequently, even on many correct predictions. We introduce ReCoVERR, an inference-time algorithm to reduce the over-abstention of a selective vision-language system without decreasing prediction accuracy. When the VLM makes a low-confidence prediction, instead of abstaining ReCoVERR tries to find relevant clues in the image that provide additional evidence for the prediction. ReCoVERR uses an LLM to pose related questions to the VLM, collects high-confidence evidences, and if enough evidence confirms the prediction the system makes a prediction instead of abstaining. ReCoVERR enables two VLMs, BLIP2 and InstructBLIP, to answer up to 20% more questions on the A-OKVQA task than vanilla selective prediction without decreasing system accuracy, thus improving overall system reliability.
摘要：先前关于选择性预测的工作通过允许视觉语言模型（VLM）在不确定时放弃回答来最大限度地减少视觉语言模型（VLM）的错误预测。然而，当部署对不准确预测的容忍度较低的视觉语言系统时，选择性预测可能会过于谨慎并且过于频繁地放弃，即使对于许多正确的预测也是如此。我们引入了 ReCoVERR，一种推理时间算法，可减少选择性视觉语言系统的过度放弃，而不会降低预测准确性。当 VLM 做出低置信度预测时，ReCoVERR 不会放弃，而是尝试在图像中找到相关线索，为预测提供额外的证据。 ReCoVERR 使用 LLM 向 VLM 提出相关问题，收集高置信度证据，如果有足够的证据证实预测，系统就会做出预测而不是弃权。 ReCoVERR 使两个 VLM（BLIP2 和 InstructBLIP）能够比普通选择性预测多回答 20% 的 A-OKVQA 任务问题，而不会降低系统精度，从而提高整体系统可靠性。

Title: Towards Efficient Active Learning in NLP via Pretrained Representations

Authors: Artem Vysogorets, Achintya Gopal
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2402.15613
Pdf URL: https://arxiv.org/pdf/2402.15613
Copy Paste: [[2402.15613]] Towards Efficient Active Learning in NLP via Pretrained Representations(https://arxiv.org/abs/2402.15613)
Keywords: language model, llm
Abstract: Fine-tuning Large Language Models (LLMs) is now a common approach for text classification in a wide range of applications. When labeled documents are scarce, active learning helps save annotation efforts but requires retraining of massive models on each acquisition iteration. We drastically expedite this process by using pretrained representations of LLMs within the active learning loop and, once the desired amount of labeled data is acquired, fine-tuning that or even a different pretrained LLM on this labeled data to achieve the best performance. As verified on common text classification benchmarks with pretrained BERT and RoBERTa as the backbone, our strategy yields similar performance to fine-tuning all the way through the active learning loop but is orders of magnitude less computationally expensive. The data acquired with our procedure generalizes across pretrained networks, allowing flexibility in choosing the final model or updating it as newer versions get released.
摘要：微调大型语言模型 (LLM) 现在是各种应用中文本分类的常用方法。当标记文档稀缺时，主动学习有助于节省注释工作，但需要在每次采集迭代时重新训练大量模型。我们通过在主动学习循环中使用 LLM 的预训练表示来极大地加快这一过程，并且一旦获取了所需数量的标记数据，就可以对该标记数据进行微调，甚至对此标记数据进行不同的预训练 LLM 以实现最佳性能。正如在以预训练 BERT 和 RoBERTa 作为骨干的常见文本分类基准上所验证的那样，我们的策略在整个主动学习循环中产生了与微调类似的性能，但计算成本却低了几个数量级。通过我们的程序获取的数据可以推广到预训练的网络中，从而可以灵活地选择最终模型或在更新版本发布时对其进行更新。

Title: Language-Based User Profiles for Recommendation

Authors: Joyce Zhou, Yijia Dai, Thorsten Joachims
Subjects: cs.CL, cs.HC, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15623
Pdf URL: https://arxiv.org/pdf/2402.15623
Copy Paste: [[2402.15623]] Language-Based User Profiles for Recommendation(https://arxiv.org/abs/2402.15623)
Keywords: language model, llm
Abstract: Most conventional recommendation methods (e.g., matrix factorization) represent user profiles as high-dimensional vectors. Unfortunately, these vectors lack interpretability and steerability, and often perform poorly in cold-start settings. To address these shortcomings, we explore the use of user profiles that are represented as human-readable text. We propose the Language-based Factorization Model (LFM), which is essentially an encoder/decoder model where both the encoder and the decoder are large language models (LLMs). The encoder LLM generates a compact natural-language profile of the user's interests from the user's rating history. The decoder LLM uses this summary profile to complete predictive downstream tasks. We evaluate our LFM approach on the MovieLens dataset, comparing it against matrix factorization and an LLM model that directly predicts from the user's rating history. In cold-start settings, we find that our method can have higher accuracy than matrix factorization. Furthermore, we find that generating a compact and human-readable summary often performs comparably with or better than direct LLM prediction, while enjoying better interpretability and shorter model input length. Our results motivate a number of future research directions and potential improvements.
摘要：大多数传统推荐方法（例如矩阵分解）将用户配置文件表示为高维向量。不幸的是，这些向量缺乏可解释性和可操纵性，并且在冷启动环境中通常表现不佳。为了解决这些缺点，我们探索使用以人类可读文本表示的用户配置文件。我们提出了基于语言的分解模型（LFM），它本质上是一个编码器/解码器模型，其中编码器和解码器都是大型语言模型（LLM）。编码器 LLM 根据用户的评级历史生成用户兴趣的紧凑自然语言配置文件。解码器 LLM 使用此摘要配置文件来完成预测下游任务。我们在 MovieLens 数据集上评估我们的 LFM 方法，将其与矩阵分解和直接根据用户评分历史进行预测的 LLM 模型进行比较。在冷启动设置中，我们发现我们的方法比矩阵分解具有更高的精度。此外，我们发现生成紧凑且人类可读的摘要通常与直接 LLM 预测相当或更好，同时具有更好的可解释性和更短的模型输入长度。我们的结果激发了许多未来的研究方向和潜在的改进。

Title: MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Authors: Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2402.15627
Pdf URL: https://arxiv.org/pdf/2402.15627
Copy Paste: [[2402.15627]] MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs(https://arxiv.org/abs/2402.15627)
Keywords: language model, llm
Abstract: We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.
摘要：我们展示了构建和部署 MegaScale 的设计、实施和工程经验，MegaScale 是一个用于训练超过 10,000 个 GPU 规模的大型语言模型 (LLM) 的生产系统。如此规模的法学硕士培训给培训效率和稳定性带来了前所未有的挑战。我们采用全栈方法，跨模型块和优化器设计、计算和通信重叠、操作员优化、数据管道和网络性能调整来共同设计算法和系统组件。鉴于LLM培训工作的范围很长，在整个培训过程中保持高效率（即稳定性）是生产中的一个重要考虑因素。许多硬稳定性问题只会大规模出现，而深入的可观测性是解决这些问题的关键。我们开发了一套诊断工具来监控堆栈深处的系统组件和事件，识别根本原因，并得出有效的技术来实现容错并减少掉队的情况。在 12,288 个 GPU 上训练 175B LLM 模型时，MegaScale 实现了 55.2% 的模型 FLOP 利用率 (MFU)，与 Megatron-LM 相比，MFU 提高了 1.34 倍。我们分享在识别和修复故障和落后者方面的运营经验。我们希望通过从系统角度阐述问题并分享我们的经验，这项工作能够启发未来的法学硕士系统研究。

Title: Fine-Grained Self-Endorsement Improves Factuality and Reasoning

Authors: Ante Wang, Linfeng Song, Baolin Peng, Ye Tian, Lifeng Jin, Haitao Mi, Jinsong Su, Dong Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15631
Pdf URL: https://arxiv.org/pdf/2402.15631
Copy Paste: [[2402.15631]] Fine-Grained Self-Endorsement Improves Factuality and Reasoning(https://arxiv.org/abs/2402.15631)
Keywords: language model, llm, hallucination, prompt
Abstract: This work studies improving large language model (LLM) generations at inference time by mitigating fact-conflicting hallucinations. Particularly, we propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses. Compared with prior ensemble methods (Wang et al., 2022;Chen et al., 2023)) that perform response-level selection, our approach can better alleviate hallucinations, especially for longform generation tasks. Our approach can broadly benefit smaller and open-source LLMs as it mainly conducts simple content-based comparisons. Experiments on Biographies show that our method can effectively improve the factuality of generations with simple and intuitive prompts across different scales of LLMs. Besides, comprehensive analyses on TriviaQA and GSM8K demonstrate the potential of self-endorsement for broader application.
摘要：这项工作研究通过减轻与事实冲突的幻觉来改进推理时的大语言模型（LLM）生成。特别是，我们提出了一个自我认可框架，该框架利用多个采样响应之间的细粒度事实水平比较。与之前执行响应级别选择的集成方法（Wang et al., 2022;Chen et al., 2023））相比，我们的方法可以更好地减轻幻觉，特别是对于长格式生成任务。我们的方法可以让规模较小的开源法学硕士广泛受益，因为它主要进行简单的基于内容的比较。传记上的实验表明，我们的方法可以在不同规模的法学硕士中通过简单直观的提示有效提高世代的真实性。此外，对 TriviaQA 和 GSM8K 的综合分析证明了自我认可具有更广泛应用的潜力。

Title: Addressing Order Sensitivity of In-Context Demonstration Examples in Causal Language Models

Authors: Yanzheng Xiang, Hanqi Yan, Lin Gui, Yulan He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15637
Pdf URL: https://arxiv.org/pdf/2402.15637
Copy Paste: [[2402.15637]] Addressing Order Sensitivity of In-Context Demonstration Examples in Causal Language Models(https://arxiv.org/abs/2402.15637)
Keywords: language model, llm
Abstract: In-context learning has become a popular paradigm in natural language processing. However, its performance can be significantly influenced by the order of in-context demonstration examples. In this paper, we found that causal language models (CausalLMs) are more sensitive to this order compared to prefix language models (PrefixLMs). We attribute this phenomenon to the auto-regressive attention masks within CausalLMs, which restrict each token from accessing information from subsequent tokens. This results in different receptive fields for samples at different positions, thereby leading to representation disparities across positions. To tackle this challenge, we introduce an unsupervised fine-tuning method, termed the Information-Augmented and Consistency-Enhanced approach. This approach utilizes contrastive learning to align representations of in-context examples across different positions and introduces a consistency loss to ensure similar representations for inputs with different permutations. This enhances the model's predictive consistency across permutations. Experimental results on four benchmarks suggest that our proposed method can reduce the sensitivity to the order of in-context examples and exhibit robust generalizability, particularly when demonstrations are sourced from a pool different from that used in the training phase, or when the number of in-context examples differs from what is used during training.
摘要：上下文学习已成为自然语言处理中的流行范例。然而，它的性能可能会受到上下文演示示例的顺序的显着影响。在本文中，我们发现与前缀语言模型（PrefixLM）相比，因果语言模型（CausalLM）对此顺序更敏感。我们将这种现象归因于 CausalLM 中的自回归注意掩码，它限制每个 token 访问后续 token 的信息。这导致不同位置的样本具有不同的感受野，从而导致不同位置的表示差异。为了应对这一挑战，我们引入了一种无监督的微调方法，称为信息增强和一致性增强方法。这种方法利用对比学习来对齐不同位置的上下文示例的表示，并引入一致性损失以确保具有不同排列的输入的相似表示。这增强了模型在排列之间的预测一致性。四个基准的实验结果表明，我们提出的方法可以降低对上下文中示例顺序的敏感性，并表现出强大的通用性，特别是当演示来自于与训练阶段使用的池不同的池时，或者当示例的数量- 上下文示例与训练期间使用的示例不同。

Title: Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics

Authors: Sadaf Ghaffari, Nikhil Krishnaswamy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15654
Pdf URL: https://arxiv.org/pdf/2402.15654
Copy Paste: [[2402.15654]] Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics(https://arxiv.org/abs/2402.15654)
Keywords: language model, llm
Abstract: In this paper, we present an exploration of LLMs' abilities to problem solve with physical reasoning in situated environments. We construct a simple simulated environment and demonstrate examples of where, in a zero-shot setting, both text and multimodal LLMs display atomic world knowledge about various objects but fail to compose this knowledge in correct solutions for an object manipulation and placement task. We also use BLIP, a vision-language model trained with more sophisticated cross-modal attention, to identify cases relevant to object physical properties that that model fails to ground. Finally, we present a procedure for discovering the relevant properties of objects in the environment and propose a method to distill this knowledge back into the LLM.
摘要：在本文中，我们探索了法学硕士在情境环境中通过物理推理解决问题的能力。我们构建了一个简单的模拟环境，并演示了一些示例，在零样本设置中，文本和多模态 LLM 都显示有关各种对象的原子世界知识，但无法将这些知识组合成对象操作和放置任务的正确解决方案。我们还使用 BLIP（一种经过更复杂的跨模式注意力训练的视觉语言模型）来识别与该模型无法支持的对象物理属性相关的案例。最后，我们提出了一个发现环境中对象的相关属性的过程，并提出了一种将这些知识提炼回法学硕士的方法。

Title: Contact Complexity in Customer Service

Authors: Shu-Ting Pi, Michael Yang, Qun Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15655
Pdf URL: https://arxiv.org/pdf/2402.15655
Copy Paste: [[2402.15655]] Contact Complexity in Customer Service(https://arxiv.org/abs/2402.15655)
Keywords: agent
Abstract: Customers who reach out for customer service support may face a range of issues that vary in complexity. Routing high-complexity contacts to junior agents can lead to multiple transfers or repeated contacts, while directing low-complexity contacts to senior agents can strain their capacity to assist customers who need professional help. To tackle this, a machine learning model that accurately predicts the complexity of customer issues is highly desirable. However, defining the complexity of a contact is a difficult task as it is a highly abstract concept. While consensus-based data annotation by experienced agents is a possible solution, it is time-consuming and costly. To overcome these challenges, we have developed a novel machine learning approach to define contact complexity. Instead of relying on human annotation, we trained an AI expert model to mimic the behavior of agents and evaluate each contact's complexity based on how the AI expert responds. If the AI expert is uncertain or lacks the skills to comprehend the contact transcript, it is considered a high-complexity contact. Our method has proven to be reliable, scalable, and cost-effective based on the collected data.
摘要：寻求客户服务支持的客户可能会面临一系列复杂程度各异的问题。将高复杂性联系转给初级客服人员可能会导致多次转接或重复联系，而将低复杂性联系转给高级客服人员可能会限制他们为需要专业帮助的客户提供帮助的能力。为了解决这个问题，非常需要一种能够准确预测客户问题复杂性的机器学习模型。然而，定义接触的复杂性是一项艰巨的任务，因为它是一个高度抽象的概念。虽然由经验丰富的代理进行基于共识的数据注释是一种可能的解决方案，但它既耗时又昂贵。为了克服这些挑战，我们开发了一种新颖的机器学习方法来定义接触复杂性。我们没有依赖人工注释，而是训练了一个人工智能专家模型来模仿代理的行为，并根据人工智能专家的响应方式评估每个联系人的复杂性。如果人工智能专家不确定或缺乏理解联系记录的技能，则被视为高复杂性联系。根据收集的数据，我们的方法已被证明是可靠的、可扩展的且具有成本效益的。

Title: Leveraging ChatGPT in Pharmacovigilance Event Extraction: An Empirical Study

Authors: Zhaoyue Sun, Gabriele Pergola, Byron C. Wallace, Yulan He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15663
Pdf URL: https://arxiv.org/pdf/2402.15663
Copy Paste: [[2402.15663]] Leveraging ChatGPT in Pharmacovigilance Event Extraction: An Empirical Study(https://arxiv.org/abs/2402.15663)
Keywords: language model, gpt, llm, prompt, chat
Abstract: With the advent of large language models (LLMs), there has been growing interest in exploring their potential for medical applications. This research aims to investigate the ability of LLMs, specifically ChatGPT, in the context of pharmacovigilance event extraction, of which the main goal is to identify and extract adverse events or potential therapeutic events from textual medical sources. We conduct extensive experiments to assess the performance of ChatGPT in the pharmacovigilance event extraction task, employing various prompts and demonstration selection strategies. The findings demonstrate that while ChatGPT demonstrates reasonable performance with appropriate demonstration selection strategies, it still falls short compared to fully fine-tuned small models. Additionally, we explore the potential of leveraging ChatGPT for data augmentation. However, our investigation reveals that the inclusion of synthesized data into fine-tuning may lead to a decrease in performance, possibly attributed to noise in the ChatGPT-generated labels. To mitigate this, we explore different filtering strategies and find that, with the proper approach, more stable performance can be achieved, although constant improvement remains elusive.
摘要：随着大型语言模型（LLM）的出现，人们对探索其医学应用潜力越来越感兴趣。本研究旨在调查法学硕士（特别是 ChatGPT）在药物警戒事件提取背景下的能力，其主要目标是从文本医学来源中识别和提取不良事件或潜在的治疗事件。我们进行了大量的实验，采用各种提示和演示选择策略来评估 ChatGPT 在药物警戒事件提取任务中的性能。研究结果表明，虽然 ChatGPT 通过适当的演示选择策略表现出了合理的性能，但与完全微调的小型模型相比，它仍然存在不足。此外，我们还探索了利用 ChatGPT 进行数据增强的潜力。然而，我们的调查表明，将合成数据纳入微调可能会导致性能下降，这可能归因于 ChatGPT 生成的标签中的噪声。为了缓解这个问题，我们探索了不同的过滤策略，并发现，通过适当的方法，可以实现更稳定的性能，尽管持续的改进仍然难以实现。

Title: Teacher-Student Learning on Complexity in Intelligent Routing

Authors: Shu-Ting Pi, Michael Yang, Yuying Zhu, Qun Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15665
Pdf URL: https://arxiv.org/pdf/2402.15665
Copy Paste: [[2402.15665]] Teacher-Student Learning on Complexity in Intelligent Routing(https://arxiv.org/abs/2402.15665)
Keywords: agent
Abstract: Customer service is often the most time-consuming aspect for e-commerce websites, with each contact typically taking 10-15 minutes. Effectively routing customers to appropriate agents without transfers is therefore crucial for e-commerce success. To this end, we have developed a machine learning framework that predicts the complexity of customer contacts and routes them to appropriate agents accordingly. The framework consists of two parts. First, we train a teacher model to score the complexity of a contact based on the post-contact transcripts. Then, we use the teacher model as a data annotator to provide labels to train a student model that predicts the complexity based on pre-contact data only. Our experiments show that such a framework is successful and can significantly improve customer experience. We also propose a useful metric called complexity AUC that evaluates the effectiveness of customer service at a statistical level.
摘要：客户服务往往是电子商务网站最耗时的环节，每次联系通常需要 10-15 分钟。因此，在不进行转接的情况下有效地将客户路由到适当的代理对于电子商务的成功至关重要。为此，我们开发了一个机器学习框架，可以预测客户联系人的复杂性，并相应地将其路由给适当的代理。该框架由两部分组成。首先，我们训练一个教师模型，根据接触后的记录对接触的复杂性进行评分。然后，我们使用教师模型作为数据注释器来提供标签来训练学生模型，该模型仅根据接触前的数据来预测复杂性。我们的实验表明，这样的框架是成功的，可以显着改善客户体验。我们还提出了一个称为复杂性 AUC 的有用指标，用于在统计层面评估客户服务的有效性。

Title: Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology

Authors: Zhenhua Wang, Wei Xie, Baosheng Wang, Enze Wang, Zhiwen Gui, Shuoyoucheng Ma, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15690
Pdf URL: https://arxiv.org/pdf/2402.15690
Copy Paste: [[2402.15690]] Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology(https://arxiv.org/abs/2402.15690)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have gradually become the gateway for people to acquire new knowledge. However, attackers can break the model's security protection ("jail") to access restricted information, which is called "jailbreaking." Previous studies have shown the weakness of current LLMs when confronted with such jailbreaking attacks. Nevertheless, comprehension of the intrinsic decision-making mechanism within the LLMs upon receipt of jailbreak prompts is noticeably lacking. Our research provides a psychological explanation of the jailbreak prompts. Drawing on cognitive consistency theory, we argue that the key to jailbreak is guiding the LLM to achieve cognitive coordination in an erroneous direction. Further, we propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique. This method progressively induces the model to answer harmful questions via multi-step incremental prompts. We instantiated a prototype system to evaluate the jailbreaking effectiveness on 8 advanced LLMs, yielding an average success rate of 83.9%. This study builds a psychological perspective on the explanatory insights into the intrinsic decision-making logic of LLMs.
摘要：大型语言模型（LLM）逐渐成为人们获取新知识的门户。然而，攻击者可以突破模型的安全保护（“监狱”）来访问受限信息，这就是所谓的“越狱”。之前的研究已经表明，当前的法学硕士在面对此类越狱攻击时存在弱点。然而，在收到越狱提示后，法学硕士内部决策机制的理解明显缺乏。我们的研究为越狱提示提供了心理学解释。借鉴认知一致性理论，我们认为越狱的关键是引导法学硕士朝着错误的方向实现认知协调。此外，我们提出了一种基于Foot-in-the-Door（FITD）技术的自动黑盒越狱方法。该方法通过多步骤增量提示逐步引导模型回答有害问题。我们实例化了一个原型系统来评估 8 名高级法学硕士的越狱效果，平均成功率为 83.9%。本研究从心理学角度来解释法学硕士内在决策逻辑。

Title: Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement

Authors: Ruiqi Zhang, Yuexiang Zhai, Andrea Zanette
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2402.15703
Pdf URL: https://arxiv.org/pdf/2402.15703
Copy Paste: [[2402.15703]] Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement(https://arxiv.org/abs/2402.15703)
Keywords: agent
Abstract: What can an agent learn in a stochastic Multi-Armed Bandit (MAB) problem from a dataset that contains just a single sample for each arm? Surprisingly, in this work, we demonstrate that even in such a data-starved setting it may still be possible to find a policy competitive with the optimal one. This paves the way to reliable decision-making in settings where critical decisions must be made by relying only on a handful of samples. Our analysis reveals that \emph{stochastic policies can be substantially better} than deterministic ones for offline decision-making. Focusing on offline multi-armed bandits, we design an algorithm called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST) which is quite different from the predominant value-based lower confidence bound approach. Its design is enabled by localization laws, critical radii, and relative pessimism. We prove that its sample complexity is comparable to that of LCB on minimax problems while being substantially lower on problems with very few samples. Finally, we consider an application to offline reinforcement learning in the special case where the logging policies are known.
摘要：在随机多臂强盗 (MAB) 问题中，智能体可以从每只臂只包含一个样本的数据集中学到什么？令人惊讶的是，在这项工作中，我们证明，即使在这种数据匮乏的环境中，仍然有可能找到与最优政策相竞争的政策。这为必须仅依靠少数样本做出关键决策的环境中的可靠决策铺平了道路。我们的分析表明，对于离线决策来说，\emph{随机策略可能比确定性策略要好得多}。针对离线多臂老虎机，我们设计了一种称为随机策略增强的不确定性信任区域（TRUST）的算法，该算法与主要的基于值的下置信界方法有很大不同。它的设计是通过局域化定律、临界半径和相对悲观主义来实现的。我们证明，它的样本复杂度与极小极大问题上的 LCB 相当，而在样本很少的问题上则要低得多。最后，我们考虑在已知日志记录策略的特殊情况下离线强化学习的应用。

Title: Query Augmentation by Decoding Semantics from Brain Signals

Authors: Ziyi Ye, Jingtao Zhan, Qingyao Ai, Yiqun Liu, Maarten de Rijke, Christina Lioma, Tuukka Ruotsalo
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2402.15708
Pdf URL: https://arxiv.org/pdf/2402.15708
Copy Paste: [[2402.15708]] Query Augmentation by Decoding Semantics from Brain Signals(https://arxiv.org/abs/2402.15708)
Keywords: prompt
Abstract: Query augmentation is a crucial technique for refining semantically imprecise queries. Traditionally, query augmentation relies on extracting information from initially retrieved, potentially relevant documents. If the quality of the initially retrieved documents is low, then the effectiveness of query augmentation would be limited as well. We propose Brain-Aug, which enhances a query by incorporating semantic information decoded from brain signals. BrainAug generates the continuation of the original query with a prompt constructed with brain signal information and a ranking-oriented inference approach. Experimental results on fMRI (functional magnetic resonance imaging) datasets show that Brain-Aug produces semantically more accurate queries, leading to improved document ranking performance. Such improvement brought by brain signals is particularly notable for ambiguous queries.
摘要：查询增强是改进语义不精确查询的关键技术。传统上，查询增强依赖于从最初检索的潜在相关文档中提取信息。如果最初检索到的文档质量较低，那么查询增强的有效性也会受到限制。我们提出了 Brain-Aug，它通过合并从大脑信号解码的语义信息来增强查询。 BrainAug 使用由大脑信号信息和面向排名的推理方法构建的提示生成原始查询的延续。 fMRI（功能磁共振成像）数据集的实验结果表明，Brain-Aug 生成语义上更准确的查询，从而提高文档排名性能。大脑信号带来的这种改善对于模糊查询尤其显着。

Title: Making Pre-trained Language Models Better Continual Few-Shot Relation Extractors

Authors: Shengkun Ma, Jiale Han, Yi Liang, Bo Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15713
Pdf URL: https://arxiv.org/pdf/2402.15713
Copy Paste: [[2402.15713]] Making Pre-trained Language Models Better Continual Few-Shot Relation Extractors(https://arxiv.org/abs/2402.15713)
Keywords: language model, gpt, prompt, chat
Abstract: Continual Few-shot Relation Extraction (CFRE) is a practical problem that requires the model to continuously learn novel relations while avoiding forgetting old ones with few labeled training data. The primary challenges are catastrophic forgetting and overfitting. This paper harnesses prompt learning to explore the implicit capabilities of pre-trained language models to address the above two challenges, thereby making language models better continual few-shot relation extractors. Specifically, we propose a Contrastive Prompt Learning framework, which designs prompt representation to acquire more generalized knowledge that can be easily adapted to old and new categories, and margin-based contrastive learning to focus more on hard samples, therefore alleviating catastrophic forgetting and overfitting issues. To further remedy overfitting in low-resource scenarios, we introduce an effective memory augmentation strategy that employs well-crafted prompts to guide ChatGPT in generating diverse samples. Extensive experiments demonstrate that our method outperforms state-of-the-art methods by a large margin and significantly mitigates catastrophic forgetting and overfitting in low-resource scenarios.
摘要：连续小样本关系提取（CFRE）是一个实际问题，要求模型不断学习新的关系，同时避免忘记带有少量标记训练数据的旧关系。主要挑战是灾难性遗忘和过度拟合。本文利用即时学习来探索预训练语言模型的隐式功能，以解决上述两个挑战，从而使语言模型成为更好的连续少样本关系提取器。具体来说，我们提出了一个对比提示学习框架，该框架设计提示表示以获取更通用的知识，可以轻松适应新旧类别，并设计基于边缘的对比学习以更多地关注硬样本，从而减轻灾难性遗忘和过度拟合问题。为了进一步解决资源匮乏场景中的过度拟合问题，我们引入了一种有效的内存增强策略，该策略采用精心设计的提示来指导 ChatGPT 生成不同的样本。大量的实验表明，我们的方法大大优于最先进的方法，并显着减轻了资源匮乏场景中的灾难性遗忘和过度拟合。

Title: Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Authors: Chaoya Jiang, Wei Ye, Mengfan Dong, Hongrui Jia, Haiyang Xu, Ming Yan, Ji Zhang, Shikun Zhang
Subjects: cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.15721
Pdf URL: https://arxiv.org/pdf/2402.15721
Copy Paste: [[2402.15721]] Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models(https://arxiv.org/abs/2402.15721)
Keywords: language model, llm, hallucination
Abstract: Large Vision Language Models exhibit remarkable capabilities but struggle with hallucinations inconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this paper, we introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. We then utilize advanced LLMs to generate and filter fine grained hallucinatory data consisting of various types of hallucinations, with a particular focus on event hallucinations, laying the groundwork for integrating discriminative and generative evaluation methods within our universal evaluation framework. The proposed benchmark distinctively assesses LVLMs ability to tackle a broad spectrum of hallucinations, making it a reliable and comprehensive tool for gauging LVLMs efficacy in handling hallucinations. We will release our code and data.
摘要：大视觉语言模型表现出非凡的能力，但与图像及其描述之间的幻觉不一致作斗争。先前对 LVLM 的幻觉评估研究已经识别了物体、属性和关系方面的幻觉，但忽略了围绕虚构实体创建完整叙述的复杂幻觉。在本文中，我们介绍了幻觉的精确分类法，其中包括一个新类别：事件幻觉。然后，我们利用先进的法学硕士来生成和过滤由各种类型的幻觉组成的细粒度幻觉数据，特别关注事件幻觉，为在我们的通用评估框架中整合判别性和生成性评估方法奠定基础。拟议的基准独特地评估了 LVLM 处理广泛幻觉的能力，使其成为衡量 LVLM 处理幻觉功效的可靠且全面的工具。我们将发布我们的代码和数据。

Title: How Do Humans Write Code? Large Models Do It the Same Way Too

Authors: Long Li
Subjects: cs.AI, cs.CL, cs.PL
Abstract URL: https://arxiv.org/abs/2402.15729
Pdf URL: https://arxiv.org/pdf/2402.15729
Copy Paste: [[2402.15729]] How Do Humans Write Code? Large Models Do It the Same Way Too(https://arxiv.org/abs/2402.15729)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) often make errors when performing numerical calculations. In contrast to traditional chain-of-thought reasoning, the program-of-thoughts approach involves generating executable code to solve problems. By executing this code, it achieves more precise results. Using generated executable code instead of natural language can reduce computational errors. However, we observe that when LLMs solve mathematical problems using code, they tend to generate more incorrect reasoning than when using natural language. To address this issue, we propose Human-Think Language (HTL), a straightforward yet highly efficient approach inspired by human coding practices. The approach first generates problem-solving methods described in the natural language by the model, then converts them into code, mirroring the process where people think through the logic in natural language before writing it as code. Additionally, it utilizes the Proximal Policy Optimization (PPO) algorithm, enabling it to provide feedback to itself based on the correctness of mathematical answers, much like humans do. Finally, we introduce a focus-attention mechanism that masks the question segment, enhancing its reliance on natural language inference solutions during code generation. We conduct our experiments without introducing any additional information, and the results across five mathematical calculation datasets showcase the effectiveness of our approach. Notably, on the NumGLUE dataset, the LlaMA-2-7B-based model achieves a superior performance rate (75.1%) compared to the previous best performance with the LlaMA-2-70B model (74.4%).
摘要：大型语言模型 (LLM) 在执行数值计算时经常出错。与传统的思维链推理相反，思维程序方法涉及生成可执行代码来解决问题。通过执行此代码，可以获得更精确的结果。使用生成的可执行代码代替自然语言可以减少计算错误。然而，我们观察到，当法学硕士使用代码解决数学问题时，他们往往比使用自然语言时产生更多不正确的推理。为了解决这个问题，我们提出了人类思维语言（HTL），这是一种受人类编码实践启发的简单而高效的方法。该方法首先通过模型生成用自然语言描述的解决问题的方法，然后将其转换为代码，反映了人们用自然语言思考逻辑然后将其编写为代码的过程。此外，它还利用近端策略优化 (PPO) 算法，使其能够根据数学答案的正确性向自身提供反馈，就像人类一样。最后，我们引入了一种焦点-注意机制，该机制可以掩盖问题片段，从而增强其在代码生成过程中对自然语言推理解决方案的依赖。我们在不引入任何额外信息的情况下进行实验，五个数学计算数据集的结果展示了我们方法的有效性。值得注意的是，在 NumGLUE 数据集上，与之前使用 LlaMA-2-70B 模型的最佳性能 (74.4%) 相比，基于 LlaMA-2-7B 的模型实现了更高的性能率 (75.1%)。

Title: GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

Authors: Yi Zong, Xipeng Qiu
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2402.15745
Pdf URL: https://arxiv.org/pdf/2402.15745
Copy Paste: [[2402.15745]] GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation(https://arxiv.org/abs/2402.15745)
Keywords: language model, gpt
Abstract: The Large Vision-Language Models (LVLMs) have demonstrated great abilities in image perception and language understanding. However, existing multimodal benchmarks focus on primary perception abilities and commonsense knowledge which are insufficient to reflect the comprehensive capabilities of LVLMs. We propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO), comprising of 8 subjects and 12 types of images, such as diagrams, function graphs, maps and photos. GAOKAO-MM derives from native Chinese context and sets human-level requirements for the model's abilities, including perception, understanding, knowledge and reasoning. We evaluate 10 LVLMs and find that the accuracies of all of them are lower than 50%, with GPT-4-Vison (48.1%), Qwen-VL-Plus (41.2%) and Gemini-Pro-Vision (35.1%) ranking in the top three positions. The results of our multi-dimension analysis indicate that LVLMs have moderate distance towards Artificial General Intelligence (AGI) and provide insights facilitating the development of multilingual LVLMs.
摘要：大视觉语言模型（LVLM）在图像感知和语言理解方面表现出了强大的能力。然而，现有的多模态基准侧重于初级感知能力和常识知识，不足以反映 LVLM 的综合能力。我们提出了 GAOKAO-MM，这是一个基于中国高考（GAOKAO）的多模态基准，由 8 个科目和 12 种图像组成，例如图表、函数图、地图和照片。 GAOKAO-MM源自中国本土背景，对模型的感知、理解、知识和推理等能力设定了人类水平的要求。我们评估了10个LVLM，发现它们的准确率都低于50%，其中GPT-4-Vison（48.1%）、Qwen-VL-Plus（41.2%）和Gemini-Pro-Vision（35.1%）排名处于前三名的位置。我们的多维度分析结果表明，LVLM 与通用人工智能 (AGI) 的距离适中，并为促进多语言 LVLM 的发展提供了见解。

Title: Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

Authors: Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, Yang You
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.15751
Pdf URL: https://arxiv.org/pdf/2402.15751
Copy Paste: [[2402.15751]] Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning(https://arxiv.org/abs/2402.15751)
Keywords: language model, llm
Abstract: While fine-tuning large language models (LLMs) for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training. Memory-efficient Zeroth-order (MeZO) optimizers, recently proposed to address this issue, only require forward passes during training, making them more memory-friendly. However, the quality of gradient estimates in zeroth order optimization often depends on the data dimensionality, potentially explaining why MeZO still exhibits significant performance drops compared to standard fine-tuning across various tasks. Inspired by the success of Parameter-Efficient Fine-Tuning (PEFT), this paper introduces Sparse MeZO, a novel memory-efficient zeroth-order optimization approach that applies ZO only to a carefully chosen subset of parameters. We propose a simple yet effective parameter selection scheme that yields significant performance gains with Sparse-MeZO. Additionally, we develop a memory-optimized implementation for sparse masking, ensuring the algorithm requires only inference-level memory consumption, allowing Sparse-MeZO to fine-tune LLaMA-30b on a single A100 GPU. Experimental results illustrate that Sparse-MeZO consistently improves both performance and convergence speed over MeZO without any overhead. For example, it achieves a 9\% absolute accuracy improvement and 3.5x speedup over MeZO on the RTE task.
摘要：虽然针对特定任务微调大型语言模型 (LLM) 通常会产生令人印象深刻的结果，但它的代价是由于基于梯度的训练中的反向传播而导致内存效率低下。最近提出的内存高效零阶 (MeZO) 优化器旨在解决此问题，仅在训练期间需要前向传递，从而使它们更加内存友好。然而，零阶优化中梯度估计的质量通常取决于数据维度，这可能解释了为什么与各种任务的标准微调相比，MeZO 仍然表现出显着的性能下降。受参数高效微调 (PEFT) 成功的启发，本文介绍了 Sparse MeZO，这是一种新颖的内存高效零阶优化方法，仅将 ZO 应用于精心选择的参数子集。我们提出了一种简单而有效的参数选择方案，该方案可以通过 Sparse-MeZO 显着提高性能。此外，我们还开发了稀疏掩码的内存优化实现，确保算法仅需要推理级内存消耗，从而允许 Sparse-MeZO 在单个 A100 GPU 上微调 LLaMA-30b。实验结果表明，与 MeZO 相比，Sparse-MeZO 持续提高了性能和收敛速度，且没有任何开销。例如，它在 RTE 任务上比 MeZO 实现了 9% 的绝对精度提升和 3.5 倍的加速。

Title: HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition

Authors: Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15754
Pdf URL: https://arxiv.org/pdf/2402.15754
Copy Paste: [[2402.15754]] HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition(https://arxiv.org/abs/2402.15754)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have emerged as a promising alternative to expensive human evaluations. However, the alignment and coverage of LLM-based evaluations are often limited by the scope and potential bias of the evaluation prompts and criteria. To address this challenge, we propose HD-Eval, a novel framework that iteratively aligns LLM-based evaluators with human preference via Hierarchical Criteria Decomposition. HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators by decomposing a given evaluation task into finer-grained criteria, aggregating them according to estimated human preferences, pruning insignificant criteria with attribution, and further decomposing significant criteria. By integrating these steps within an iterative alignment training process, we obtain a hierarchical decomposition of criteria that comprehensively captures aspects of natural language at multiple levels of granularity. Implemented as a white box, the human preference-guided aggregator is efficient to train and more explainable than relying solely on prompting, and its independence from model parameters makes it applicable to closed-source LLMs. Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators and providing deeper insights into the explanation of evaluation results and the task itself.
摘要：大型语言模型（LLM）已成为昂贵的人工评估的有前途的替代方案。然而，基于法学硕士的评估的一致性和覆盖范围往往受到评估提示和标准的范围和潜在偏差的限制。为了应对这一挑战，我们提出了 HD-Eval，这是一种新颖的框架，可通过层次标准分解迭代地将基于 LLM 的评估器与人类偏好保持一致。 HD-Eval继承了人类专家评估思维的本质，通过将给定的评估任务分解为更细粒度的标准，根据估计的人类偏好进行聚合，通过归因修剪不重要的标准，并进一步增强基于LLM的评估者的一致性。分解重要的标准。通过将这些步骤集成到迭代对齐训练过程中，我们获得了标准的分层分解，该标准在多个粒度级别上全面捕获自然语言的各个方面。以白盒形式实现的人类偏好引导聚合器可以高效地训练，并且比仅仅依赖提示更易于解释，并且它独立于模型参数，使其适用于闭源法学硕士。对三个评估领域的广泛实验证明了 HD-Eval 在进一步调整最先进的评估者并为评估结果和任务本身的解释提供更深入的见解方面的优越性。

Title: Dental Severity Assessment through Few-shot Learning and SBERT Fine-tuning

Authors: Mohammad Dehghani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15755
Pdf URL: https://arxiv.org/pdf/2402.15755
Copy Paste: [[2402.15755]] Dental Severity Assessment through Few-shot Learning and SBERT Fine-tuning(https://arxiv.org/abs/2402.15755)
Keywords: language model
Abstract: Dental diseases have a significant impact on a considerable portion of the population, leading to various health issues that can detrimentally affect individuals' overall well-being. The integration of automated systems in oral healthcare has become increasingly crucial. Machine learning approaches offer a viable solution to address challenges such as diagnostic difficulties, inefficiencies, and errors in oral disease diagnosis. These methods prove particularly useful when physicians struggle to predict or diagnose diseases at their early stages. In this study, thirteen different machine learning, deep learning, and large language models were employed to determine the severity level of oral health issues based on radiologists' reports. The results revealed that the Few-shot learning with SBERT and Multi-Layer Perceptron model outperformed all other models across various experiments, achieving an impressive accuracy of 94.1% as the best result. Consequently, this model exhibits promise as a reliable tool for evaluating the severity of oral diseases, enabling patients to receive more effective treatment and aiding healthcare professionals in making informed decisions regarding resource allocation and the management of high-risk patients.
摘要：牙科疾病对相当一部分人口有重大影响，导致各种健康问题，对个人的整体福祉产生不利影响。自动化系统在口腔保健中的集成变得越来越重要。机器学习方法为解决口腔疾病诊断中的诊断困难、效率低下和错误等挑战提供了可行的解决方案。当医生难以在疾病的早期阶段预测或诊断疾病时，这些方法被证明特别有用。在这项研究中，采用了十三种不同的机器学习、深度学习和大型语言模型，根据放射科医生的报告确定口腔健康问题的严重程度。结果显示，在各种实验中，使用 SBERT 和多层感知器模型进行的少样本学习优于所有其他模型，达到了令人印象深刻的 94.1% 的最佳结果。因此，该模型有望成为评估口腔疾病严重程度的可靠工具，使患者能够接受更有效的治疗，并帮助医疗保健专业人员就资源分配和高风险患者的管理做出明智的决策。

Title: Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens

Authors: Ziqian Zeng, Jiahong Yu, Qianshi Pang, Zihao Wang, Huiping Zhuang, Cen Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15758
Pdf URL: https://arxiv.org/pdf/2402.15758
Copy Paste: [[2402.15758]] Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens(https://arxiv.org/abs/2402.15758)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their widespread application is hindered by the resource-intensive decoding process. To address this challenge, current approaches have incorporated additional decoding heads to enable parallel prediction of multiple subsequent tokens, thereby achieving inference acceleration. Nevertheless, the accuracy of these decoding heads falls short of the auto-regressive decoding approach. In light of these limitations, we propose Chimera, a novel framework specifically designed for speculative sampling. Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words. To ensure both accuracy and efficiency, we present two strategies within the lightweight draft model. Firstly, we focus on capturing short-range dependencies at the bottom layer. Secondly, we leverage the readily available representations from the original LLM.Through empirical evaluation on the Vicuna and LlaMA-2 series, Chimera demonstrates impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach. This highlights the potential of our proposed framework in significantly improving the efficiency of large language models during the decoding process.
摘要：大型语言模型 (LLM) 在各种任务中表现出了卓越的能力。然而，资源密集型解码过程阻碍了它们的广泛应用。为了应对这一挑战，当前的方法已经结合了额外的解码头来实现多个后续令牌的并行预测，从而实现推理加速。然而，这些解码头的准确性低于自回归解码方法。鉴于这些限制，我们提出了 Chimera，这是一种专门为推测性采样而设计的新颖框架。在此框架内，我们引入了一个轻量级草稿模型，该模型有效地利用先前生成的标记来预测后续单词。为了确保准确性和效率，我们在轻量级草稿模型中提出了两种策略。首先，我们专注于捕获底层的短程依赖。其次，我们利用原始 LLM 中现成的表示。通过对 Vicuna 和 LlaMA-2 系列的实证评估，Chimera 展示了令人印象深刻的结果，与普通自回归解码方法相比，实现了 2.7 倍的平均延迟加速比。这凸显了我们提出的框架在解码过程中显着提高大型语言模型效率的潜力。

Title: Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models

Authors: Haoran Liao, Jidong Tian, Shaohua Hu, Hao He, Yaohui Jin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15764
Pdf URL: https://arxiv.org/pdf/2402.15764
Copy Paste: [[2402.15764]] Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models(https://arxiv.org/abs/2402.15764)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models~(LLMs) have exhibited impressive performance across NLP tasks. So far they still face challenges in complex reasoning tasks and can be sensitive to input context. Despite significant efforts have been invested in enhancing reasoning process and improving prefix-prompts robustness, the crucial role of problem context has been overlooked. In this study, we propose a new approach to improve the mathematical capacities of LLMs, named Problem Elaboration Prompting~(PEP). Specifically, PEP decomposes and elucidates the problem context before reasoning, thus enhancing the global context modeling and reducing the parsing difficulties. Experiments on datasets demonstrate promising performances on complex reasoning and indicate the beneficial impact for ill-formed problems. For instance, with the GPT-3.5 model~(\texttt{text-davinci-003}), we observed a 9.93\% improvement with greedy decoding and 8.80\% improvement with self-consistency on GSM8k compared to the standard CoT. With ChatGPT~(\texttt{turbo}) and PEP, we achieve SOTA performances on SVAMP with 86.2\% and GSM8k with 90.98\%.
摘要：大型语言模型（LLM）在 NLP 任务中表现出了令人印象深刻的性能。到目前为止，它们仍然面临复杂推理任务的挑战，并且对输入上下文很敏感。尽管在增强推理过程和提高前缀提示的稳健性方面投入了大量精力，但问题上下文的关键作用却被忽视了。在这项研究中，我们提出了一种提高法学硕士数学能力的新方法，称为问题阐述提示〜（PEP）。具体来说，PEP在推理之前对问题上下文进行分解和阐明，从而增强了全局上下文建模并降低了解析难度。数据集上的实验证明了在复杂推理方面的良好性能，并表明了对格式错误问题的有益影响。例如，在 GPT-3.5 模型~(\texttt{text-davinci-003}) 中，与标准 CoT 相比，我们观察到贪婪解码提高了 9.93%，GSM8k 上的自一致性提高了 8.80%。借助 ChatGPT~(\texttt{turbo}) 和 PEP，我们在 SVAMP 上实现了 SOTA 性能（86.2%），在 GSM8k 上实现了 90.98%（90.98%）。

Title: Empowering Large Language Model Agents through Action Learning

Authors: Haiteng Zhao, Chang Ma, Guoyin Wang, Jing Su, Lingpeng Kong, Jingjing Xu, Zhi-Hong Deng, Hongxia Yang
Subjects: cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.15809
Pdf URL: https://arxiv.org/pdf/2402.15809
Copy Paste: [[2402.15809]] Empowering Large Language Model Agents through Action Learning(https://arxiv.org/abs/2402.15809)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM) Agents have recently garnered increasing interest yet they are limited in their ability to learn from trial and error, a key element of intelligent behavior. In this work, we argue that the capacity to learn new actions from experience is fundamental to the advancement of learning in LLM agents. While humans naturally expand their action spaces and develop skills through experiential learning, LLM agents typically operate within fixed action spaces, limiting their potential for growth. To address these challenges, our study explores open-action learning for language agents. We introduce a framework LearnAct with an iterative learning strategy to create and improve actions in the form of Python functions. In each iteration, LLM revises and updates the currently available actions based on the errors identified in unsuccessful training tasks, thereby enhancing action effectiveness. Our experimental evaluations across Robotic Planning and Alfworld environments reveal that after learning on a few training task instances, our approach to open-action learning markedly improves agent performance for the type of task (by 32 percent in AlfWorld compared to ReAct+Reflexion, for instance) highlighting the importance of experiential action learning in the development of more intelligent LLM agents.
摘要：大型语言模型（LLM）智能体最近引起了越来越多的兴趣，但它们从试错中学习的能力有限，而试错是智能行为的关键要素。在这项工作中，我们认为从经验中学习新行动的能力是 LLM 代理学习进步的基础。虽然人类自然会通过体验式学习来扩展他们的行动空间并发展技能，但法学硕士代理人通常在固定的行动空间内运作，限制了他们的成长潜力。为了应对这些挑战，我们的研究探索了语言代理的开放式学习。我们引入了一个框架 LearnAct，它具有迭代学习策略，以 Python 函数的形式创建和改进操作。在每次迭代中，LLM都会根据不成功的训练任务中发现的错误来修改和更新当前可用的操作，从而提高操作的有效性。我们在机器人规划和 Alfworld 环境中进行的实验评估表明，在学习了一些训练任务实例后，我们的开放动作学习方法显着提高了该类型任务的代理性能（例如，与 ReAct+Reflexion 相比，在 AlfWorld 中提高了 32%））强调体验式行动学习在开发更智能的法学硕士代理中的重要性。

Title: Measuring Bargaining Abilities of LLMs: A Benchmark and A Buyer-Enhancement Method

Authors: Tian Xia, Zhiwei He, Tong Ren, Yibo Miao, Zhuosheng Zhang, Yang Yang, Rui Wang
Subjects: cs.CL, cs.GT
Abstract URL: https://arxiv.org/abs/2402.15813
Pdf URL: https://arxiv.org/pdf/2402.15813
Copy Paste: [[2402.15813]] Measuring Bargaining Abilities of LLMs: A Benchmark and A Buyer-Enhancement Method(https://arxiv.org/abs/2402.15813)
Keywords: llm, agent
Abstract: Bargaining is an important and unique part of negotiation between humans. As LLM-driven agents learn to negotiate and act like real humans, how to evaluate agents' bargaining abilities remains an open problem. For the first time, we formally described the Bargaining task as an asymmetric incomplete information game, defining the gains of the Buyer and Seller in multiple bargaining processes. It allows us to quantitatively assess an agent's performance in the Bargain task. We collected a real product price dataset, AmazonHistoryPrice, and conducted evaluations of various LLM agents' bargaining abilities. We find that playing a Buyer is much harder than a Seller, and increasing model size can not effectively improve the Buyer's performance. To address the challenge, we propose a novel approach called OG-Narrator that integrates a deterministic Offer Generator to control the price range of Buyer's offers, and an LLM Narrator to create natural language sentences for generated offers. Experimental results show that OG-Narrator improves the buyer's deal rates from 26.67% to 88.88% and brings a ten times of multiplication of profits on all baselines, even a model that has not been aligned.
摘要：讨价还价是人类之间谈判的一个重要而独特的部分。随着法学硕士驱动的代理人学习像真人一样进行谈判和行事，如何评估代理人的讨价还价能力仍然是一个悬而未决的问题。我们首次将讨价还价任务正式描述为不对称不完全信息博弈，定义了买方和卖方在多个讨价还价过程中的收益。它使我们能够定量评估代理在讨价还价任务中的表现。我们收集了真实的产品价格数据集AmazonHistoryPrice，并对各个LLM代理的议价能力进行了评估。我们发现扮演Buyer比扮演Seller要困难得多，并且增加模型大小并不能有效提高Buyer的性能。为了应对这一挑战，我们提出了一种名为 OG-Narrator 的新颖方法，该方法集成了确定性报价生成器来控制买方报价的价格范围，以及 LLM 讲述人为生成的报价创建自然语言句子。实验结果表明，OG-Narrator 将买家成交率从 26.67% 提高到 88.88%，在所有基线上（即使是未对齐的模型）也带来了十倍的利润倍增。

Title: A Theoretical Result on the Inductive Bias of RNN Language Models

Authors: Anej Svete, Robin Shing Moon Chan, Ryan Cotterell
Subjects: cs.CL, cs.CC, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15814
Pdf URL: https://arxiv.org/pdf/2402.15814
Copy Paste: [[2402.15814]] A Theoretical Result on the Inductive Bias of RNN Language Models(https://arxiv.org/abs/2402.15814)
Keywords: language model
Abstract: Recent work by Hewitt et al. (2020) provides a possible interpretation of the empirical success of recurrent neural networks (RNNs) as language models (LMs). It shows that RNNs can efficiently represent bounded hierarchical structures that are prevalent in human language. This suggests that RNNs' success might be linked to their ability to model hierarchy. However, a closer inspection of Hewitt et al.'s (2020) construction shows that it is not limited to hierarchical LMs, posing the question of what \emph{other classes} of LMs can be efficiently represented by RNNs. To this end, we generalize their construction to show that RNNs can efficiently represent a larger class of LMs: Those that can be represented by a pushdown automaton with a bounded stack and a generalized stack update function. This is analogous to an automaton that keeps a memory of a fixed number of symbols and updates the memory with a simple update mechanism. Altogether, the efficiency in representing a diverse class of non-hierarchical LMs posits a lack of concrete cognitive and human-language-centered inductive biases in RNNs.
摘要：休伊特等人最近的工作。（2020）对递归神经网络（RNN）作为语言模型（LM）的经验成功提供了可能的解释。它表明 RNN 可以有效地表示人类语言中普遍存在的有界层次结构。这表明 RNN 的成功可能与其建模层次结构的能力有关。然而，对 Hewitt 等人（2020）的构造的仔细检查表明，它不仅限于分层 LM，从而提出了哪些 \emph{其他类别} LM 可以由 RNN 有效表示的问题。为此，我们概括了它们的构造，以表明 RNN 可以有效地表示更大类别的 LM：可以用具有有界堆栈和广义堆栈更新函数的下推自动机来表示的 LM。这类似于自动机，它保留固定数量符号的存储器并通过简单的更新机制更新存储器。总而言之，表示不同类别的非分层 LM 的效率表明 RNN 中缺乏具体的认知和以人类语言为中心的归纳偏差。

Title: Linguistic Intelligence in Large Language Models for Telecommunications

Authors: Tasnim Ahmed, Nicola Piovesan, Antonio De Domenico, Salimur Choudhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15818
Pdf URL: https://arxiv.org/pdf/2402.15818
Copy Paste: [[2402.15818]] Linguistic Intelligence in Large Language Models for Telecommunications(https://arxiv.org/abs/2402.15818)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs) have emerged as a significant advancement in the field of Natural Language Processing (NLP), demonstrating remarkable capabilities in language generation and other language-centric tasks. Despite their evaluation across a multitude of analytical and reasoning tasks in various scientific domains, a comprehensive exploration of their knowledge and understanding within the realm of natural language tasks in the telecommunications domain is still needed. This study, therefore, seeks to evaluate the knowledge and understanding capabilities of LLMs within this domain. To achieve this, we conduct an exhaustive zero-shot evaluation of four prominent LLMs-Llama-2, Falcon, Mistral, and Zephyr. These models require fewer resources than ChatGPT, making them suitable for resource-constrained environments. Their performance is compared with state-of-the-art, fine-tuned models. To the best of our knowledge, this is the first work to extensively evaluate and compare the understanding of LLMs across multiple language-centric tasks in this domain. Our evaluation reveals that zero-shot LLMs can achieve performance levels comparable to the current state-of-the-art fine-tuned models. This indicates that pretraining on extensive text corpora equips LLMs with a degree of specialization, even within the telecommunications domain. We also observe that no single LLM consistently outperforms others, and the performance of different LLMs can fluctuate. Although their performance lags behind fine-tuned models, our findings underscore the potential of LLMs as a valuable resource for understanding various aspects of this field that lack large annotated data.
摘要：大型语言模型 (LLM) 已成为自然语言处理 (NLP) 领域的重大进步，在语言生成和其他以语言为中心的任务中展现出卓越的能力。尽管他们对各个科学领域的大量分析和推理任务进行了评估，但仍然需要在电信领域的自然语言任务领域全面探索他们的知识和理解。因此，本研究旨在评估法学硕士在该领域的知识和理解能力。为了实现这一目标，我们对四个著名的法学硕士——Llama-2、Falcon、Mistral 和 Zephyr 进行了详尽的零样本评估。这些模型比 ChatGPT 需要更少的资源，因此适合资源受限的环境。他们的性能与最先进的、经过微调的模型进行了比较。据我们所知，这是第一个广泛评估和比较该领域多个以语言为中心的任务的法学硕士理解的工作。我们的评估表明，零样本法学硕士可以达到与当前最先进的微调模型相当的性能水平。这表明，即使在电信领域，对广泛文本语料库的预训练也使法学硕士具备一定程度的专业性。我们还观察到，没有哪个法学硕士能够始终优于其他法学硕士，而且不同法学硕士的表现可能会有所波动。尽管它们的性能落后于微调模型，但我们的研究结果强调了法学硕士作为理解该领域缺乏大量注释数据的各个方面的宝贵资源的潜力。

Title: Reward Design for Justifiable Sequential Decision-Making

Authors: Aleksa Sukovic, Goran Radanovic
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15826
Pdf URL: https://arxiv.org/pdf/2402.15826
Copy Paste: [[2402.15826]] Reward Design for Justifiable Sequential Decision-Making(https://arxiv.org/abs/2402.15826)
Keywords: agent
Abstract: Equipping agents with the capacity to justify made decisions using supporting evidence represents a cornerstone of accountable decision-making. Furthermore, ensuring that justifications are in line with human expectations and societal norms is vital, especially in high-stakes situations such as healthcare. In this work, we propose the use of a debate-based reward model for reinforcement learning agents, where the outcome of a zero-sum debate game quantifies the justifiability of a decision in a particular state. This reward model is then used to train a justifiable policy, whose decisions can be more easily corroborated with supporting evidence. In the debate game, two argumentative agents take turns providing supporting evidence for two competing decisions. Given the proposed evidence, a proxy of a human judge evaluates which decision is better justified. We demonstrate the potential of our approach in learning policies for prescribing and justifying treatment decisions of septic patients. We show that augmenting the reward with the feedback signal generated by the debate-based reward model yields policies highly favored by the judge when compared to the policy obtained solely from the environment rewards, while hardly sacrificing any performance. Moreover, in terms of the overall performance and justifiability of trained policies, the debate-based feedback is comparable to the feedback obtained from an ideal judge proxy that evaluates decisions using the full information encoded in the state. This suggests that the debate game outputs key information contained in states that is most relevant for evaluating decisions, which in turn substantiates the practicality of combining our approach with human-in-the-loop evaluations. Lastly, we showcase that agents trained via multi-agent debate learn to propose evidence that is resilient to refutations and closely aligns with human preferences.
摘要：使代理人能够使用支持证据来证明所做决策的合理性，是负责任决策的基石。此外，确保理由符合人类期望和社会规范至关重要，尤其是在医疗保健等高风险情况下。在这项工作中，我们建议对强化学习代理使用基于辩论的奖励模型，其中零和辩论游戏的结果量化了特定状态下决策的合理性。然后，该奖励模型用于训练合理的政策，其决策可以更容易地用支持证据来证实。在辩论游戏中，两个辩论代理人轮流为两个相互竞争的决定提供支持证据。根据所提出的证据，人类法官的代理人会评估哪个决定更合理。我们展示了我们的方法在学习政策方面的潜力，这些政策用于脓毒症患者的处方和证明治疗决策的合理性。我们表明，与仅从环境奖励中获得的政策相比，通过基于辩论的奖励模型生成的反馈信号来增加奖励会产生法官高度青睐的政策，同时几乎不会牺牲任何绩效。此外，就训练有素的政策的整体表现和合理性而言，基于辩论的反馈与从理想的法官代理获得的反馈相当，理想的法官代理使用状态中编码的完整信息来评估决策。这表明辩论游戏输出了与评估决策最相关的状态中包含的关键信息，这反过来又证实了将我们的方法与人机交互评估相结合的实用性。最后，我们展示了通过多智能体辩论训练的智能体学会提出能够抵御反驳并与人类偏好紧密一致的证据。

Title: Prompt Perturbation Consistency Learning for Robust Language Models

Authors: Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, Aram Galstyan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15833
Pdf URL: https://arxiv.org/pdf/2402.15833
Copy Paste: [[2402.15833]] Prompt Perturbation Consistency Learning for Robust Language Models(https://arxiv.org/abs/2402.15833)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated impressive performance on a number of natural language processing tasks, such as question answering and text summarization. However, their performance on sequence labeling tasks such as intent classification and slot filling (IC-SF), which is a central component in personal assistant systems, lags significantly behind discriminative models. Furthermore, there is a lack of substantive research on the robustness of LLMs to various perturbations in the input prompts. The contributions of this paper are three-fold. First, we show that fine-tuning sufficiently large LLMs can produce IC-SF performance comparable to discriminative models. Next, we systematically analyze the performance deterioration of those fine-tuned models due to three distinct yet relevant types of input perturbations - oronyms, synonyms, and paraphrasing. Finally, we propose an efficient mitigation approach, Prompt Perturbation Consistency Learning (PPCL), which works by regularizing the divergence between losses from clean and perturbed samples. Our experiments demonstrate that PPCL can recover on average 59% and 69% of the performance drop for IC and SF tasks, respectively. Furthermore, PPCL beats the data augmentation approach while using ten times fewer augmented data samples.
摘要：大型语言模型 (LLM) 在许多自然语言处理任务（例如问答和文本摘要）上表现出了令人印象深刻的性能。然而，它们在序列标记任务上的表现，例如意图分类和槽填充（IC-SF）（个人助理系统的核心组成部分），明显落后于判别模型。此外，缺乏关于法学硕士对输入提示中各种扰动的鲁棒性的实质性研究。本文的贡献有三个方面。首先，我们证明微调足够大的 LLM 可以产生与判别模型相当的 IC-SF 性能。接下来，我们系统地分析了这些微调模型由于三种不同但相关的输入扰动类型（同义词、同义词和释义）而导致的性能恶化。最后，我们提出了一种有效的缓解方法，即提示扰动一致性学习（PPCL），该方法通过规范干净样本和扰动样本的损失之间的差异来发挥作用。我们的实验表明，PPCL 可以分别恢复 IC 和 SF 任务平均 59% 和 69% 的性能下降。此外，PPCL 击败了数据增强方法，同时使用的增强数据样本少了十倍。

Title: MATHWELL: Generating Educational Math Word Problems at Scale

Authors: Bryan R Christ, Jonathan Kropko, Thomas Hartvigsen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15861
Pdf URL: https://arxiv.org/pdf/2402.15861
Copy Paste: [[2402.15861]] MATHWELL: Generating Educational Math Word Problems at Scale(https://arxiv.org/abs/2402.15861)
Keywords: language model
Abstract: Math word problems are critical K-8 educational tools, but writing them is time-consuming and requires domain expertise. We suggest that language models can support K-8 math education by automatically generating problems at scale. To be educational, generated problems must be 1) solvable, 2) accurate, and 3) appropriate. Existing datasets are unlabeled for these criteria, making them ill-suited for training problem generators. We introduce MATHWELL, a Llama-2 (70B) model iteratively finetuned to generate K-8 math word problems using data from expert annotation. Using MATHWELL, we generate the largest English word problem dataset to date, containing 20,490 problems. 3,484 are scored by domain experts who find MATHWELL has a 40% higher share of problems that have executable solutions and meet all criteria than alternatives, with 74% of its problems with executable solutions being solvable, accurate, and appropriate.
摘要：数学应用题是关键的 K-8 教育工具，但编写它们非常耗时，并且需要领域专业知识。我们建议语言模型可以通过自动大规模生成问题来支持 K-8 数学教育。为了具有教育意义，生成的问题必须 1) 可解决，2) 准确，3) 适当。现有数据集没有针对这些标准进行标记，这使得它们不适合训练问题生成器。我们引入 MATHWELL，这是一个 Llama-2 (70B) 模型，经过迭代微调，可使用专家注释的数据生成 K-8 数学单词问题。使用 MATHWELL，我们生成了迄今为止最大的英语单词问题数据集，包含 20,490 个问题。领域专家对 3,484 项进行了评分，他们发现 MATHWELL 具有可执行解决方案并满足所有标准的问题比例比替代方案高 40%，其中 74% 具有可执行解决方案的问题是可解决的、准确且适当的。

Title: SportQA: A Benchmark for Sports Understanding in Large Language Models

Authors: Haotian Xia, Zhengbang Yang, Yuqing Wang, Rhys Tracy, Yun Zhao, Dongdong Huang, Zezhi Chen, Yan Zhu, Yuan-fang Wang, Weining Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15862
Pdf URL: https://arxiv.org/pdf/2402.15862
Copy Paste: [[2402.15862]] SportQA: A Benchmark for Sports Understanding in Large Language Models(https://arxiv.org/abs/2402.15862)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: A deep understanding of sports, a field rich in strategic and dynamic content, is crucial for advancing Natural Language Processing (NLP). This holds particular significance in the context of evaluating and advancing Large Language Models (LLMs), given the existing gap in specialized benchmarks. To bridge this gap, we introduce SportQA, a novel benchmark specifically designed for evaluating LLMs in the context of sports understanding. SportQA encompasses over 70,000 multiple-choice questions across three distinct difficulty levels, each targeting different aspects of sports knowledge from basic historical facts to intricate, scenario-based reasoning tasks. We conducted a thorough evaluation of prevalent LLMs, mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting. Our results reveal that while LLMs exhibit competent performance in basic sports knowledge, they struggle with more complex, scenario-based sports reasoning, lagging behind human expertise. The introduction of SportQA marks a significant step forward in NLP, offering a tool for assessing and enhancing sports understanding in LLMs.
摘要：对体育这个充满战略和动态内容的领域的深入了解对于推进自然语言处理 (NLP) 至关重要。鉴于专业基准方面存在的差距，这在评估和推进大型语言模型 (LLM) 的背景下具有特别重要的意义。为了弥补这一差距，我们引入了 SportQA，这是一个专门为评估运动理解背景下的法学硕士而设计的新颖基准。 SportQA 包含超过 70,000 个多项选择题，涵盖三个不同的难度级别，每个问题都针对体育知识的不同方面，从基本历史事实到复杂的基于场景的推理任务。我们对流行的法学硕士进行了彻底的评估，主要利用小样本学习范式，辅以思想链（CoT）提示。我们的结果表明，虽然法学硕士在基本体育知识方面表现出出色的表现，但他们在更复杂的、基于场景的体育推理方面遇到了困难，落后于人类的专业知识。 SportQA 的引入标志着 NLP 向前迈出了重要一步，为法学硕士评估和增强体育理解提供了工具。

Title: SemEval-2024 Task 8: Weighted Layer Averaging RoBERTa for Black-Box Machine-Generated Text Detection

Authors: Ayan Datta, Aryan Chandramania, Radhika Mamidi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15873
Pdf URL: https://arxiv.org/pdf/2402.15873
Copy Paste: [[2402.15873]] SemEval-2024 Task 8: Weighted Layer Averaging RoBERTa for Black-Box Machine-Generated Text Detection(https://arxiv.org/abs/2402.15873)
Keywords: language model, llm
Abstract: This document contains the details of the authors' submission to the proceedings of SemEval 2024's Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection Subtask A (monolingual) and B. Detection of machine-generated text is becoming an increasingly important task, with the advent of large language models (LLMs). In this document, we lay out the techniques utilized for performing the same, along with the results obtained.
摘要：本文档包含作者提交给 SemEval 2024 任务 8：多生成器、多域和多语言黑盒机器生成文本检测子任务 A（单语）和 B 的详细信息。机器生成文本的检测正在成为一项重要任务。随着大型语言模型（LLM）的出现，任务变得越来越重要。在本文档中，我们列出了用于执行相同操作的技术以及获得的结果。

Title: Predicting Outcomes in Video Games with Long Short Term Memory Networks

Authors: Kittimate Chulajata, Sean Wu, Fabien Scalzo, Eun Sang Cha
Subjects: cs.LG, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2402.15923
Pdf URL: https://arxiv.org/pdf/2402.15923
Copy Paste: [[2402.15923]] Predicting Outcomes in Video Games with Long Short Term Memory Networks(https://arxiv.org/abs/2402.15923)
Keywords: language model, llm
Abstract: Forecasting winners in E-sports with real-time analytics has the potential to further engage audiences watching major tournament events. However, making such real-time predictions is challenging due to unpredictable variables within the game involving diverse player strategies and decision-making. Our work attempts to enhance audience engagement within video game tournaments by introducing a real-time method of predicting wins. Our Long Short Term Memory Network (LSTMs) based approach enables efficient predictions of win-lose outcomes by only using the health indicator of each player as a time series. As a proof of concept, we evaluate our model's performance within a classic, two-player arcade game, Super Street Fighter II Turbo. We also benchmark our method against state of the art methods for time series forecasting; i.e. Transformer models found in large language models (LLMs). Finally, we open-source our data set and code in hopes of furthering work in predictive analysis for arcade games.
摘要：通过实时分析预测电子竞技的获胜者有可能进一步吸引观众观看大型锦标赛赛事。然而，由于游戏中涉及不同玩家策略和决策的不可预测变量，做出这种实时预测具有挑战性。我们的工作试图通过引入实时预测获胜的方法来提高视频游戏锦标赛中观众的参与度。我们基于长短期记忆网络 (LSTM) 的方法只需使用每个玩家的健康指标作为时间序列即可有效预测输赢结果。作为概念验证，我们评估了模型在经典双人街机游戏 Super Street Fighter II Turbo 中的性能。我们还将我们的方法与最先进的时间序列预测方法进行比较；即大型语言模型 (LLM) 中的 Transformer 模型。最后，我们开源我们的数据集和代码，希望进一步推进街机游戏的预测分析工作。

Title: MultiContrievers: Analysis of Dense Retrieval Representations

Authors: Seraphina Goldfarb-Tarrant, Pedro Rodriguez, Jane Dwivedi-Yu, Patrick Lewis
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2402.15925
Pdf URL: https://arxiv.org/pdf/2402.15925
Copy Paste: [[2402.15925]] MultiContrievers: Analysis of Dense Retrieval Representations(https://arxiv.org/abs/2402.15925)
Keywords: language model
Abstract: Dense retrievers compress source documents into (possibly lossy) vector representations, yet there is little analysis of what information is lost versus preserved, and how it affects downstream tasks. We conduct the first analysis of the information captured by dense retrievers compared to the language models they are based on (e.g., BERT versus Contriever). We use 25 MultiBert checkpoints as randomized initialisations to train MultiContrievers, a set of 25 contriever models. We test whether specific pieces of information -- such as gender and occupation -- can be extracted from contriever vectors of wikipedia-like documents. We measure this extractability via information theoretic probing. We then examine the relationship of extractability to performance and gender bias, as well as the sensitivity of these results to many random initialisations and data shuffles. We find that (1) contriever models have significantly increased extractability, but extractability usually correlates poorly with benchmark performance 2) gender bias is present, but is not caused by the contriever representations 3) there is high sensitivity to both random initialisation and to data shuffle, suggesting that future retrieval research should test across a wider spread of both.
摘要：密集检索器将源文档压缩为（可能有损）矢量表示，但很少分析哪些信息丢失，哪些信息保留，以及它如何影响下游任务。我们对密集检索器捕获的信息与它们所基于的语言模型（例如 BERT 与 Contriever）进行比较进行了首次分析。我们使用 25 个 MultiBert 检查点作为随机初始化来训练 MultiContrievers，这是一组 25 个 Contriever 模型。我们测试是否可以从类似维基百科的文档的构造者向量中提取特定的信息（例如性别和职业）。我们通过信息论探测来测量这种可提取性。然后，我们检查可提取性与性能和性别偏见的关系，以及这些结果对许多随机初始化和数据洗牌的敏感性。我们发现 (1) 设计者模型的可提取性显着提高，但可提取性通常与基准性能相关性较差 2) 存在性别偏见，但不是由设计者表示引起的 3) 对随机初始化和数据洗牌都高度敏感，表明未来的检索研究应该对两者进行更广泛的测试。

Title: QuaCer-C: Quantitative Certification of Knowledge Comprehension in LLMs

Authors: Isha Chaudhary, Vedaant V. Jain, Gagandeep Singh
Subjects: cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15929
Pdf URL: https://arxiv.org/pdf/2402.15929
Copy Paste: [[2402.15929]] QuaCer-C: Quantitative Certification of Knowledge Comprehension in LLMs(https://arxiv.org/abs/2402.15929)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated impressive performance on several benchmarks. However, traditional studies do not provide formal guarantees on the performance of LLMs. In this work, we propose a novel certification framework for LLM, QuaCer-C, wherein we formally certify the knowledge-comprehension capabilities of popular LLMs. Our certificates are quantitative - they consist of high-confidence, tight bounds on the probability that the target LLM gives the correct answer on any relevant knowledge comprehension prompt. Our certificates for the Llama, Vicuna, and Mistral LLMs indicate that the knowledge comprehension capability improves with an increase in the number of parameters and that the Mistral model is less performant than the rest in this evaluation.
摘要：大型语言模型 (LLM) 在多个基准测试中表现出了令人印象深刻的性能。然而，传统研究并没有对法学硕士的表现提供正式的保证。在这项工作中，我们提出了一种新颖的法学硕士认证框架，QuaCer-C，其中我们正式认证了流行的法学硕士的知识理解能力。我们的证书是定量的——它们包含目标法学硕士在任何相关知识理解提示上给出正确答案的概率的高置信度和严格界限。我们对 Llama、Vicuna 和 Mistral 法学硕士的证书表明，知识理解能力随着参数数量的增加而提高，并且 Mistral 模型的性能低于本次评估中的其他模型。

Title: Evaluating Prompting Strategies for Grammatical Error Correction Based on Language Proficiency

Authors: Min Zeng, Jiexin Kuang, Mengyang Qiu, Jayoung Song, Jungyeul Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15930
Pdf URL: https://arxiv.org/pdf/2402.15930
Copy Paste: [[2402.15930]] Evaluating Prompting Strategies for Grammatical Error Correction Based on Language Proficiency(https://arxiv.org/abs/2402.15930)
Keywords: llm, prompt
Abstract: The writing examples of English language learners may be different from those of native speakers. Given that there is a significant differences in second language (L2) learners' error types by their proficiency levels, this paper attempts to reduce overcorrection by examining the interaction between LLM's performance and L2 language proficiency. Our method focuses on zero-shot and few-shot prompting and fine-tuning models for GEC for learners of English as a foreign language based on the different proficiency. We investigate GEC results and find that overcorrection happens primarily in advanced language learners' writing (proficiency C) rather than proficiency A (a beginner level) and proficiency B (an intermediate level). Fine-tuned LLMs, and even few-shot prompting with writing examples of English learners, actually tend to exhibit decreased recall measures. To make our claim concrete, we conduct a comprehensive examination of GEC outcomes and their evaluation results based on language proficiency.
摘要：英语学习者的写作范例可能与母语人士的写作范例不同。鉴于第二语言（L2）学习者的错误类型因其熟练程度而存在显着差异，本文试图通过研究法学硕士的表现与第二语言熟练程度之间的相互作用来减少矫枉过正。我们的方法侧重于针对不同熟练程度的英语作为外语学习者的 GEC 零样本和少样本提示和微调模型。我们调查了 GEC 结果，发现过度纠正主要发生在高级语言学习者的写作（熟练程度 C）中，而不是熟练程度 A（初级水平）和熟练程度 B（中级水平）中。经过微调的法学硕士，甚至是用英语学习者的写作例子进行的几次提示，实际上往往会表现出记忆力下降的情况。为了使我们的主张具体化，我们对 GEC 的结果及其基于语言能力的评估结果进行了全面检查。

Title: Frustratingly Simple Prompting-based Text Denoising

Authors: Jungyeul Park, Mengyang Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15931
Pdf URL: https://arxiv.org/pdf/2402.15931
Copy Paste: [[2402.15931]] Frustratingly Simple Prompting-based Text Denoising(https://arxiv.org/abs/2402.15931)
Keywords: prompt
Abstract: This paper introduces a novel perspective on the automated essay scoring (AES) task, challenging the conventional view of the ASAP dataset as a static entity. Employing simple text denoising techniques using prompting, we explore the dynamic potential within the dataset. While acknowledging the previous emphasis on building regression systems, our paper underscores how making minor changes to a dataset through text denoising can enhance the final results.
摘要：本文介绍了自动论文评分 (AES) 任务的新颖视角，挑战了 ASAP 数据集作为静态实体的传统观点。通过使用提示的简单文本去噪技术，我们探索了数据集中的动态潜力。在承认之前对构建回归系统的重视的同时，我们的论文强调了如何通过文本去噪对数据集进行微小的更改可以增强最终结果。

Title: Scalable Volt-VAR Optimization using RLlib-IMPALA Framework: A Reinforcement Learning Approach

Authors: Alaa Selim, Yanzhu Ye, Junbo Zhao, Bo Yang
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2402.15932
Pdf URL: https://arxiv.org/pdf/2402.15932
Copy Paste: [[2402.15932]] Scalable Volt-VAR Optimization using RLlib-IMPALA Framework: A Reinforcement Learning Approach(https://arxiv.org/abs/2402.15932)
Keywords: agent
Abstract: In the rapidly evolving domain of electrical power systems, the Volt-VAR optimization (VVO) is increasingly critical, especially with the burgeoning integration of renewable energy sources. Traditional approaches to learning-based VVO in expansive and dynamically changing power systems are often hindered by computational complexities. To address this challenge, our research presents a novel framework that harnesses the potential of Deep Reinforcement Learning (DRL), specifically utilizing the Importance Weighted Actor-Learner Architecture (IMPALA) algorithm, executed on the RAY platform. This framework, built upon RLlib-an industry-standard in Reinforcement Learning-ingeniously capitalizes on the distributed computing capabilities and advanced hyperparameter tuning offered by RAY. This design significantly expedites the exploration and exploitation phases in the VVO solution space. Our empirical results demonstrate that our approach not only surpasses existing DRL methods in achieving superior reward outcomes but also manifests a remarkable tenfold reduction in computational requirements. The integration of our DRL agent with the RAY platform facilitates the creation of RLlib-IMPALA, a novel framework that efficiently uses RAY's resources to improve system adaptability and control. RLlib-IMPALA leverages RAY's toolkit to enhance analytical capabilities and significantly speeds up training to become more than 10 times faster than other state-of-the-art DRL methods.
摘要：在快速发展的电力系统领域，Volt-VAR 优化 (VVO) 变得越来越重要，尤其是随着可再生能源的迅速整合。在广阔且动态变化的电力系统中，基于学习的 VVO 的传统方法常常受到计算复杂性的阻碍。为了应对这一挑战，我们的研究提出了一种新颖的框架，该框架利用深度强化学习 (DRL) 的潜力，特别是利用在 RAY 平台上执行的重要性加权参与者-学习者架构 (IMPALA) 算法。该框架基于 RLlib（强化学习的行业标准）构建，巧妙地利用了 RAY 提供的分布式计算功能和高级超参数调整。该设计显着加快了 VVO 解决方案空间的探索和开发阶段。我们的实证结果表明，我们的方法不仅在实现卓越的奖励结果方面超越了现有的 DRL 方法，而且还表现出计算要求显着降低了十倍。我们的 DRL 代理与 RAY 平台的集成有助于创建 RLlib-IMPALA，这是一个新颖的框架，可以有效地利用 RAY 的资源来提高系统的适应性和控制能力。 RLlib-IMPALA 利用 RAY 的工具包来增强分析能力并显着加快训练速度，比其他最先进的 DRL 方法快 10 倍以上。

Title: Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Authors: Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Ge Li
Subjects: cs.CL, cs.AI, cs.CR, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2402.15938
Pdf URL: https://arxiv.org/pdf/2402.15938
Copy Paste: [[2402.15938]] Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models(https://arxiv.org/abs/2402.15938)
Keywords: language model, gpt, llm, chat
Abstract: Recent statements about the impressive capabilities of large language models (LLMs) are usually supported by evaluating on open-access benchmarks. Considering the vast size and wide-ranging sources of LLMs' training data, it could explicitly or implicitly include test data, leading to LLMs being more susceptible to data contamination. However, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mitigating data contamination for LLMs faces significant challenges. In this paper, we propose CDD, which stands for Contamination Detection via output Distribution for LLMs. CDD necessitates only the sampled texts to detect data contamination, by identifying the peakedness of LLM's output distribution. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution, based on the correction of LLM's output distribution. To facilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval, for data contamination detection and contamination mitigation evaluation tasks. Extensive experimental results show that CDD achieves the average relative improvements of 21.8\%-30.2\% over other contamination detection approaches in terms of Accuracy, F1 Score, and AUC metrics, and can effectively detect contamination caused by the variants of test data. TED significantly mitigates performance improvements up to 66.9\% attributed to data contamination across 24 settings and 21 contamination degrees. In real-world applications, we reveal that ChatGPT exhibits a high potential to suffer from data contamination on HumanEval benchmark.
摘要：最近关于大型语言模型 (LLM) 令人印象深刻的功能的声明通常是通过对开放获取基准进行评估来支持的。考虑到法学硕士的训练数据规模庞大、来源广泛，它可能会显式或隐式地包含测试数据，导致法学硕士更容易受到数据污染。然而，由于训练数据的不透明性、模型的黑盒访问以及合成训练数据的快速增长，法学硕士检测和减轻数据污染面临着巨大的挑战。在本文中，我们提出了 CDD，它代表法学硕士通过输出分布进行污染检测。 CDD 只需要样本文本即可通过识别 LLM 输出分布的峰值来检测数据污染。为了减轻评估中数据污染的影响，我们还基于 LLM 输出分布的修正，提出了 TED：通过输出分布进行可信评估。为了促进这项研究，我们引入了两个基准，即 DetCon 和 ComiEval，用于数据污染检测和污染缓解评估任务。大量的实验结果表明，CDD在准确率、F1分数和AUC指标方面比其他污染检测方法平均相对提高了21.8％-30.2％，并且可以有效地检测由于测试数据的变异而引起的污染。 TED 显着降低了由于 24 种设置和 21 种污染程度的数据污染而导致的性能提升高达 66.9%。在实际应用中，我们发现 ChatGPT 在 HumanEval 基准测试中极有可能遭受数据污染。

Title: GreenLLaMA: A Framework for Detoxification with Explanations

Authors: Md Tawkat Islam Khondaker, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan
Subjects: cs.LG, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2402.15951
Pdf URL: https://arxiv.org/pdf/2402.15951
Copy Paste: [[2402.15951]] GreenLLaMA: A Framework for Detoxification with Explanations(https://arxiv.org/abs/2402.15951)
Keywords: gpt, chat
Abstract: Prior works on detoxification are scattered in the sense that they do not cover all aspects of detoxification needed in a real-world scenario. Notably, prior works restrict the task of developing detoxification models to only a seen subset of platforms, leaving the question of how the models would perform on unseen platforms unexplored. Additionally, these works do not address non-detoxifiability, a phenomenon whereby the toxic text cannot be detoxified without altering the meaning. We propose GreenLLaMA, the first comprehensive end-to-end detoxification framework, which attempts to alleviate the aforementioned limitations. We first introduce a cross-platform pseudo-parallel corpus applying multi-step data processing and generation strategies leveraging ChatGPT. We then train a suite of detoxification models with our cross-platform corpus. We show that our detoxification models outperform the SoTA model trained with human-annotated parallel corpus. We further introduce explanation to promote transparency and trustworthiness. GreenLLaMA additionally offers a unique paraphrase detector especially dedicated for the detoxification task to tackle the non-detoxifiable cases. Through experimental analysis, we demonstrate the effectiveness of our cross-platform corpus and the robustness of GreenLLaMA against adversarial toxicity.
摘要：先前关于解毒的工作是分散的，因为它们没有涵盖现实世界场景中所需的解毒的所有方面。值得注意的是，先前的工作将开发解毒模型的任务限制为仅在可见的平台子集上进行，从而留下了模型如何在未见的平台上执行的问题。此外，这些作品并没有解决不可解毒性问题，即有毒文本无法在不改变含义的情况下被解毒的现象。我们提出了 GreenLLaMA，这是第一个全面的端到端解毒框架，它试图减轻上述限制。我们首先介绍一个跨平台伪并行语料库，该语料库应用利用 ChatGPT 的多步骤数据处理和生成策略。然后，我们使用跨平台语料库训练一套解毒模型。我们表明，我们的解毒模型优于使用人类注释的平行语料库训练的 SoTA 模型。我们进一步引入解释以提高透明度和可信度。 GreenLLaMA 还提供了一个独特的释义检测器，专门用于解毒任务，以解决不可解毒的情况。通过实验分析，我们证明了跨平台语料库的有效性以及 GreenLLaMA 针对对抗性毒性的稳健性。

Title: Budget-Constrained Tool Learning with Planning

Authors: Yuanhang Zheng, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Subjects: cs.AI
Abstract URL: https://arxiv.org/abs/2402.15960
Pdf URL: https://arxiv.org/pdf/2402.15960
Copy Paste: [[2402.15960]] Budget-Constrained Tool Learning with Planning(https://arxiv.org/abs/2402.15960)
Keywords: language model
Abstract: Despite intensive efforts devoted to tool learning, the problem of budget-constrained tool learning, which focuses on resolving user queries within a specific budget constraint, has been widely overlooked. This paper proposes a novel method for budget-constrained tool learning. Our approach involves creating a preferable plan under the budget constraint before utilizing the tools. This plan outlines the feasible tools and the maximum number of times they can be employed, offering a comprehensive overview of the tool learning process for large language models. This allows them to allocate the budget from a broader perspective. To devise the plan without incurring significant extra costs, we suggest initially estimating the usefulness of the candidate tools based on past experience. Subsequently, we employ dynamic programming to formulate the plan. Experimental results demonstrate that our method can be integrated with various tool learning methods, significantly enhancing their effectiveness under strict budget constraints.
摘要：尽管在工具学习方面付出了巨大的努力，但预算受限的工具学习问题（其重点是在特定预算约束内解决用户查询）却被广泛忽视。本文提出了一种预算受限工具学习的新方法。我们的方法包括在使用这些工具之前在预算限制下制定更好的计划。该计划概述了可行的工具以及它们可以使用的最大次数，全面概述了大型语言模型的工具学习过程。这使他们能够从更广泛的角度分配预算。为了在不产生大量额外成本的情况下制定计划，我们建议根据过去的经验初步估计候选工具的有用性。随后，我们采用动态规划来制定计划。实验结果表明，我们的方法可以与各种工具学习方法集成，在严格的预算限制下显着提高其有效性。

Title: Likelihood-based Mitigation of Evaluation Bias in Large Language Models

Authors: Masanari Ohi, Masahiro Kaneko, Ryuto Koike, Mengsay Loem, Naoaki Okazaki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15987
Pdf URL: https://arxiv.org/pdf/2402.15987
Copy Paste: [[2402.15987]] Likelihood-based Mitigation of Evaluation Bias in Large Language Models(https://arxiv.org/abs/2402.15987)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics. However, the likelihood, a measure of LLM's plausibility for a sentence, can vary due to superficial differences in sentences, such as word order and sentence structure. It is therefore possible that there might be a likelihood bias if LLMs are used for evaluation: they might overrate sentences with higher likelihoods while underrating those with lower likelihoods. In this paper, we investigate the presence and impact of likelihood bias in LLM-based evaluators. We also propose a method to mitigate the likelihood bias. Our method utilizes highly biased instances as few-shot examples for in-context learning. Our experiments in evaluating the data-to-text and grammatical error correction tasks reveal that several LLMs we test display a likelihood bias. Furthermore, our proposed method successfully mitigates this bias, also improving evaluation performance (in terms of correlation of models with human scores) significantly.
摘要：大型语言模型 (LLM) 广泛用于评估自然语言生成任务作为自动化指标。然而，可能性（衡量 LLM 对句子的合理性的衡量标准）可能会因句子的表面差异（例如词序和句子结构）而有所不同。因此，如果使用法学硕士进行评估，可能会出现似然偏差：他们可能会高估可能性较高的句子，而低估可能性较低的句子。在本文中，我们研究了基于法学硕士的评估者中似然偏差的存在及其影响。我们还提出了一种减轻似然偏差的方法。我们的方法利用高度偏差的实例作为上下文学习的少量示例。我们评估数据到文本和语法错误纠正任务的实验表明，我们测试的几个法学硕士表现出似然偏差。此外，我们提出的方法成功地减轻了这种偏差，还显着提高了评估性能（就模型与人类分数的相关性而言）。

Title: PIDformer: Transformer Meets Control Theory

Authors: Tam Nguyen, César A. Uribe, Tan M. Nguyen, Richard G. Baraniuk
Subjects: cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2402.15989
Pdf URL: https://arxiv.org/pdf/2402.15989
Copy Paste: [[2402.15989]] PIDformer: Transformer Meets Control Theory(https://arxiv.org/abs/2402.15989)
Keywords: language model
Abstract: In this work, we address two main shortcomings of transformer architectures: input corruption and rank collapse in their output representation. We unveil self-attention as an autonomous state-space model that inherently promotes smoothness in its solutions, leading to lower-rank outputs and diminished representation capacity. Moreover, the steady-state solution of the model is sensitive to input perturbations. We incorporate a Proportional-Integral-Derivative (PID) closed-loop feedback control system with a reference point into the model to improve robustness and representation capacity. This integration aims to preserve high-frequency details while bolstering model stability, rendering it more noise-resilient. The resulting controlled state-space model is theoretically proven robust and adept at addressing the rank collapse. Motivated by this control framework, we derive a novel class of transformers, PID-controlled Transformer (PIDformer), aimed at improving robustness and mitigating the rank-collapse issue inherent in softmax transformers. We empirically evaluate the model for advantages and robustness against baseline transformers across various practical tasks, including object classification, image segmentation, and language modeling.
摘要：在这项工作中，我们解决了 Transformer 架构的两个主要缺点：输入损坏和输出表示中的等级崩溃。我们将自注意力作为一种自治的状态空间模型推出，它本质上促进了其解决方案的平滑性，从而导致较低等级的输出和减少的表示能力。此外，模型的稳态解对输入扰动敏感。我们将带有参考点的比例积分微分 (PID) 闭环反馈控制系统纳入模型中，以提高鲁棒性和表示能力。这种集成旨在保留高频细节，同时增强模型稳定性，使其更具抗噪能力。由此产生的受控状态空间模型在理论上被证明是稳健的并且擅长解决秩崩溃。受此控制框架的启发，我们推导了一类新型变压器，PID控制变压器（PIDformer），旨在提高鲁棒性并减轻softmax变压器固有的等级崩溃问题。我们根据经验评估该模型在各种实际任务中相对于基线变换器的优势和鲁棒性，包括对象分类、图像分割和语言建模。

Title: $C^3$: Confidence Calibration Model Cascade for Inference-Efficient Cross-Lingual Natural Language Understanding

Authors: Taixi Lu, Haoyu Wang, Huajie Shao, Jing Gao, Huaxiu Yao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15991
Pdf URL: https://arxiv.org/pdf/2402.15991
Copy Paste: [[2402.15991]] $C^3$: Confidence Calibration Model Cascade for Inference-Efficient Cross-Lingual Natural Language Understanding(https://arxiv.org/abs/2402.15991)
Keywords: language model
Abstract: Cross-lingual natural language understanding (NLU) is a critical task in natural language processing (NLP). Recent advancements have seen multilingual pre-trained language models (mPLMs) significantly enhance the performance of these tasks. However, mPLMs necessitate substantial resources and incur high computational costs during inference, posing challenges for deployment in real-world and real-time systems. Existing model cascade methods seek to enhance inference efficiency by greedily selecting the lightest model capable of processing the current input from a variety of models, based on model confidence scores. Nonetheless, deep models tend to exhibit overconfidence, and confidence distributions vary across languages. This leads to the emission of confident but incorrect predictions by smaller models, hindering their ability to generalize effectively across test languages. In this study, we introduce a confidence calibration model cascade ($C^3$) method. This approach, simple yet effective, involves calibration prior to cascade inference, thereby enhancing cascade accuracy through more reliable predictions. Extensive experiments conducted on three cross-lingual benchmarks demonstrate that $C^3$ significantly outperforms all state-of-the-art baselines.
摘要：跨语言自然语言理解（NLU）是自然语言处理（NLP）中的一项关键任务。最近的进展表明，多语言预训练语言模型 (mPLM) 显着提高了这些任务的性能。然而，mPLM 需要大量资源，并且在推理过程中会产生高昂的计算成本，这给现实世界和实时系统中的部署带来了挑战。现有的模型级联方法寻求通过基于模型置信度分数贪婪地选择能够处理来自各种模型的当前输入的最轻模型来提高推理效率。尽管如此，深度模型往往表现出过度自信，并且不同语言的置信度分布各不相同。这导致较小的模型发出自信但不正确的预测，从而阻碍了它们跨测试语言有效泛化的能力。在本研究中，我们引入了置信度校准模型级联（$C^3$）方法。这种方法简单而有效，涉及级联推理之前的校准，从而通过更可靠的预测来提高级联准确性。对三个跨语言基准进行的广泛实验表明，$C^3$ 显着优于所有最先进的基准。

Title: From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings

Authors: Hao Wang, Hao Li, Minlie Huang, Lei Sha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16006
Pdf URL: https://arxiv.org/pdf/2402.16006
Copy Paste: [[2402.16006]] From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings(https://arxiv.org/abs/2402.16006)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The safety defense methods of Large language models(LLMs) stays limited because the dangerous prompts are manually curated to just few known attack types, which fails to keep pace with emerging varieties. Recent studies found that attaching suffixes to harmful instructions can hack the defense of LLMs and lead to dangerous outputs. This method, while effective, leaves a gap in understanding the underlying mechanics of such adversarial suffix due to the non-readability and it can be relatively easily seen through by common defense methods such as perplexity filters.To cope with this challenge, in this paper, we propose an Adversarial Suffixes Embedding Translation Framework(ASETF) that are able to translate the unreadable adversarial suffixes into coherent, readable text, which makes it easier to understand and analyze the reasons behind harmful content generation by large language models. We conducted experiments on LLMs such as LLaMa2, Vicuna and using the Advbench dataset's harmful instructions. The results indicate that our method achieves a much better attack success rate to existing techniques, while significantly enhancing the textual fluency of the prompts. In addition, our approach can be generalized into a broader method for generating transferable adversarial suffixes that can successfully attack multiple LLMs, even black-box LLMs, such as ChatGPT and Gemini. As a result, the prompts generated through our method exhibit enriched semantic diversity, which potentially provides more adversarial examples for LLM defense methods.
摘要：大型语言模型（LLM）的安全防御方法仍然有限，因为危险提示是针对少数已知的攻击类型手动策划的，无法跟上新兴品种的步伐。最近的研究发现，在有害指令上附加后缀可能会破坏法学硕士的防御并导致危险的输出。这种方法虽然有效，但由于不可读性，在理解此类对抗性后缀的底层机制方面留下了空白，并且可以通过常见的防御方法（例如困惑过滤器）相对容易地识破它。为了应对这一挑战，本文提出了，我们提出了一种对抗性后缀嵌入翻译框架（ASEF），能够将不可读的对抗性后缀翻译成连贯、可读的文本，这使得更容易理解和分析大型语言模型生成有害内容背后的原因。我们对 LLaMa2、Vicuna 等 LLM 进行了实验，并使用 Advbench 数据集的有害指令。结果表明，我们的方法比现有技术实现了更好的攻击成功率，同时显着增强了提示的文本流畅性。此外，我们的方法可以推广为更广泛的方法，用于生成可转移的对抗性后缀，可以成功攻击多个 LLM，甚至是黑盒 LLM，例如 ChatGPT 和 Gemini。因此，通过我们的方法生成的提示表现出丰富的语义多样性，这可能为 LLM 防御方法提供更多的对抗性示例。

Title: HiGPT: Heterogeneous Graph Language Model

Authors: Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Long Xia, Dawei Yin, Chao Huang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.16024
Pdf URL: https://arxiv.org/pdf/2402.16024
Copy Paste: [[2402.16024]] HiGPT: Heterogeneous Graph Language Model(https://arxiv.org/abs/2402.16024)
Keywords: language model, gpt
Abstract: Heterogeneous graph learning aims to capture complex relationships and diverse relational semantics among entities in a heterogeneous graph to obtain meaningful representations for nodes and edges. Recent advancements in heterogeneous graph neural networks (HGNNs) have achieved state-of-the-art performance by considering relation heterogeneity and using specialized message functions and aggregation rules. However, existing frameworks for heterogeneous graph learning have limitations in generalizing across diverse heterogeneous graph datasets. Most of these frameworks follow the "pre-train" and "fine-tune" paradigm on the same dataset, which restricts their capacity to adapt to new and unseen data. This raises the question: "Can we generalize heterogeneous graph models to be well-adapted to diverse downstream learning tasks with distribution shifts in both node token sets and relation type heterogeneity?'' To tackle those challenges, we propose HiGPT, a general large graph model with Heterogeneous graph instruction-tuning paradigm. Our framework enables learning from arbitrary heterogeneous graphs without the need for any fine-tuning process from downstream datasets. To handle distribution shifts in heterogeneity, we introduce an in-context heterogeneous graph tokenizer that captures semantic relationships in different heterogeneous graphs, facilitating model adaptation. We incorporate a large corpus of heterogeneity-aware graph instructions into our HiGPT, enabling the model to effectively comprehend complex relation heterogeneity and distinguish between various types of graph tokens. Furthermore, we introduce the Mixture-of-Thought (MoT) instruction augmentation paradigm to mitigate data scarcity by generating diverse and informative instructions. Through comprehensive evaluations, our proposed framework demonstrates exceptional performance in terms of generalization performance.
摘要：异构图学习旨在捕获异构图中实体之间的复杂关系和多样化的关系语义，以获得节点和边的有意义的表示。异构图神经网络（HGNN）的最新进展通过考虑关系异构性并使用专门的消息函数和聚合规则，实现了最先进的性能。然而，现有的异构图学习框架在泛化不同异构图数据集方面存在局限性。这些框架大多数都遵循同一数据集上的“预训练”和“微调”范例，这限制了它们适应新的和未见过的数据的能力。这就提出了一个问题：“我们能否泛化异构图模型，使其能够很好地适应节点标记集和关系类型异构性的分布变化的不同下游学习任务？”为了应对这些挑战，我们提出了 HiGPT，一种通用的大图具有异构图指令调整范式的模型。我们的框架可以从任意异构图进行学习，而不需要对下游数据集进行任何微调过程。为了处理异构性中的分布变化，我们引入了一个上下文中的异构图标记器来捕获语义关系在不同的异构图中，促进模型适应。我们将大量异构感知图指令合并到我们的 HiGPT 中，使模型能够有效理解复杂的关系异构性并区分各种类型的图标记。此外，我们引入了 Mixture-of -思想（MoT）指令增强范例，通过生成多样化且信息丰富的指令来缓解数据稀缺性。通过综合评估，我们提出的框架在泛化性能方面表现出了卓越的性能。

Title: GraphWiz: An Instruction-Following Language Model for Graph Problems

Authors: Nuo Chen, Yuhan Li, Jianheng Tang, Jia Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16029
Pdf URL: https://arxiv.org/pdf/2402.16029
Copy Paste: [[2402.16029]] GraphWiz: An Instruction-Following Language Model for Graph Problems(https://arxiv.org/abs/2402.16029)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have achieved impressive success across several fields, but their proficiency in understanding and resolving complex graph problems is less explored. To bridge this gap, we introduce GraphInstruct, a novel and comprehensive instruction-tuning dataset designed to equip language models with the ability to tackle a broad spectrum of graph problems using explicit reasoning paths. Utilizing GraphInstruct, we build GraphWiz, an open-source language model capable of resolving various graph problem types while generating clear reasoning processes. To enhance the model's capability and reliability, we incorporate the Direct Preference Optimization (DPO) framework into the graph problem-solving context. The enhanced model, GraphWiz-DPO, achieves an average accuracy of 65% across nine tasks with different complexity levels, surpassing GPT-4 which has an average accuracy of 43.8%. Moreover, our research delves into the delicate balance between training data volume and model performance, highlighting the potential for overfitting with increased data. We also explore the transferability of the model's reasoning ability across different graph tasks, indicating the model's adaptability and practical application potential. Our investigation offers a new blueprint and valuable insights for developing LLMs specialized in graph reasoning and problem-solving.
摘要：大型语言模型（LLM）在多个领域取得了令人瞩目的成功，但它们在理解和解决复杂图形问题方面的能力却很少被探索。为了弥补这一差距，我们引入了 GraphInstruct，这是一种新颖且全面的指令调整数据集，旨在使语言模型能够使用显式推理路径解决广泛的图形问题。利用 GraphInstruct，我们构建了 GraphWiz，这是一种开源语言模型，能够解决各种图形问题类型，同时生成清晰的推理过程。为了增强模型的能力和可靠性，我们将直接偏好优化（DPO）框架纳入图问题解决环境中。增强模型 GraphWiz-DPO 在不同复杂程度的 9 个任务中实现了 65% 的平均准确率，超过了平均准确率 43.8% 的 GPT-4。此外，我们的研究深入探讨了训练数据量和模型性能之间的微妙平衡，强调了数据增加导致过度拟合的可能性。我们还探索了模型推理能力在不同图任务之间的可迁移性，表明了模型的适应性和实际应用潜力。我们的调查为培养专门从事图形推理和问题解决的法学硕士提供了新的蓝图和宝贵的见解。

Title: Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration

Authors: Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Anh Tuan Luu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16030
Pdf URL: https://arxiv.org/pdf/2402.16030
Copy Paste: [[2402.16030]] Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration(https://arxiv.org/abs/2402.16030)
Keywords: language model, llm
Abstract: While Reinforcement Learning from Human Feedback (RLHF) significantly enhances the generation quality of Large Language Models (LLMs), recent studies have raised concerns regarding the complexity and instability associated with the Proximal Policy Optimization (PPO) algorithm, proposing a series of order-based calibration methods as viable alternatives. This paper delves further into current order-based methods, examining their inefficiencies in utilizing reward values and addressing misalignment issues. Building upon these findings, we propose a novel \textbf{V}alue-based \textbf{C}ali\textbf{B}ration (VCB) method to better align LLMs with human preferences. Experimental results demonstrate that VCB surpasses existing alignment methods on AI assistant and summarization datasets, providing impressive generalizability, robustness, and stability in diverse settings.
摘要：虽然人类反馈强化学习（RLHF）显着提高了大型语言模型（LLM）的生成质量，但最近的研究引起了人们对与近端策略优化（PPO）算法相关的复杂性和不稳定性的担忧，提出了一系列基于顺序的算法校准方法作为可行的替代方案。本文进一步深入研究了当前基于订单的方法，检查了它们在利用奖励值和解决错位问题方面的低效率。基于这些发现，我们提出了一种新颖的基于 \textbf{V}alue 的 \textbf{C}ali\textbf{B}ration (VCB) 方法，以更好地使法学硕士与人类偏好保持一致。实验结果表明，VCB 超越了人工智能助手和摘要数据集上的现有对齐方法，在不同设置中提供了令人印象深刻的通用性、鲁棒性和稳定性。

Title: Text Understanding and Generation Using Transformer Models for Intelligent E-commerce Recommendations

Authors: Yafei Xiang, Hanyi Yu, Yulu Gong, Shuning Huo, Mengran Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16035
Pdf URL: https://arxiv.org/pdf/2402.16035
Copy Paste: [[2402.16035]] Text Understanding and Generation Using Transformer Models for Intelligent E-commerce Recommendations(https://arxiv.org/abs/2402.16035)
Keywords: language model, llm
Abstract: With the rapid development of artificial intelligence technology, Transformer structural pre-training model has become an important tool for large language model (LLM) tasks. In the field of e-commerce, these models are especially widely used, from text understanding to generating recommendation systems, which provide powerful technical support for improving user experience and optimizing service processes. This paper reviews the core application scenarios of Transformer pre-training model in e-commerce text understanding and recommendation generation, including but not limited to automatic generation of product descriptions, sentiment analysis of user comments, construction of personalized recommendation system and automated processing of customer service conversations. Through a detailed analysis of the model's working principle, implementation process, and application effects in specific cases, this paper emphasizes the unique advantages of pre-trained models in understanding complex user intentions and improving the quality of recommendations. In addition, the challenges and improvement directions for the future are also discussed, such as how to further improve the generalization ability of the model, the ability to handle large-scale data sets, and technical strategies to protect user privacy. Ultimately, the paper points out that the application of Transformer structural pre-training models in e-commerce has not only driven technological innovation, but also brought substantial benefits to merchants and consumers, and looking forward, these models will continue to play a key role in e-commerce and beyond.
摘要：随着人工智能技术的快速发展，Transformer结构预训练模型已成为大型语言模型（LLM）任务的重要工具。在电子商务领域，这些模型的应用尤其广泛，从文本理解到生成推荐系统，为改善用户体验、优化服务流程提供了强大的技术支撑。本文回顾了Transformer预训练模型在电商文本理解和推荐生成中的核心应用场景，包括但不限于产品描述的自动生成、用户评论的情感分析、个性化推荐系统的构建以及客户的自动化处理服务对话。通过对模型的工作原理、实现过程以及具体案例的应用效果进行详细分析，强调了预训练模型在理解复杂用户意图、提高推荐质量方面的独特优势。此外，还讨论了未来的挑战和改进方向，例如如何进一步提高模型的泛化能力、处理大规模数据集的能力以及保护用户隐私的技术策略。论文最终指出，Transformer结构预训练模型在电商中的应用不仅带动了技术创新，也给商家和消费者带来了实质性的收益，展望未来，这些模型将继续发挥关键作用在电子商务及其他领域。

Title: Deep Learning Approaches for Improving Question Answering Systems in Hepatocellular Carcinoma Research

Authors: Shuning Huo, Yafei Xiang, Hanyi Yu, Mengran Zhu, Yulu Gong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.16038
Pdf URL: https://arxiv.org/pdf/2402.16038
Copy Paste: [[2402.16038]] Deep Learning Approaches for Improving Question Answering Systems in Hepatocellular Carcinoma Research(https://arxiv.org/abs/2402.16038)
Keywords: gpt
Abstract: In recent years, advancements in natural language processing (NLP) have been fueled by deep learning techniques, particularly through the utilization of powerful computing resources like GPUs and TPUs. Models such as BERT and GPT-3, trained on vast amounts of data, have revolutionized language understanding and generation. These pre-trained models serve as robust bases for various tasks including semantic understanding, intelligent writing, and reasoning, paving the way for a more generalized form of artificial intelligence. NLP, as a vital application of AI, aims to bridge the gap between humans and computers through natural language interaction. This paper delves into the current landscape and future prospects of large-scale model-based NLP, focusing on the question-answering systems within this domain. Practical cases and developments in artificial intelligence-driven question-answering systems are analyzed to foster further exploration and research in the realm of large-scale NLP.
摘要：近年来，深度学习技术推动了自然语言处理 (NLP) 的进步，特别是通过利用 GPU 和 TPU 等强大的计算资源。 BERT 和 GPT-3 等模型经过大量数据训练，彻底改变了语言理解和生成。这些预先训练的模型为语义理解、智能写作和推理等各种任务提供了坚实的基础，为更通用的人工智能形式铺平了道路。 NLP作为人工智能的重要应用，旨在通过自然语言交互来弥合人类与计算机之间的鸿沟。本文深入探讨了基于大规模模型的 NLP 的现状和未来前景，重点关注该领域内的问答系统。分析人工智能驱动的问答系统的实际案例和发展，以促进大规模自然语言处理领域的进一步探索和研究。

Title: EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings

Authors: Sunjun Kweon, Jiyoun Kim, Heeyoung Kwak, Dongchul Cha, Hangyul Yoon, Kwanghyun Kim, Seunghyun Won, Edward Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16040
Pdf URL: https://arxiv.org/pdf/2402.16040
Copy Paste: [[2402.16040]] EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings(https://arxiv.org/abs/2402.16040)
Keywords: language model, llm
Abstract: This study introduces EHRNoteQA, a novel patient-specific question answering benchmark tailored for evaluating Large Language Models (LLMs) in clinical environments. Based on MIMIC-IV Electronic Health Record (EHR), a team of three medical professionals has curated the dataset comprising 962 unique questions, each linked to a specific patient's EHR clinical notes. What makes EHRNoteQA distinct from existing EHR-based benchmarks is as follows: Firstly, it is the first dataset to adopt a multi-choice question answering format, a design choice that effectively evaluates LLMs with reliable scores in the context of automatic evaluation, compared to other formats. Secondly, it requires an analysis of multiple clinical notes to answer a single question, reflecting the complex nature of real-world clinical decision-making where clinicians review extensive records of patient histories. Our comprehensive evaluation on various large language models showed that their scores on EHRNoteQA correlate more closely with their performance in addressing real-world medical questions evaluated by clinicians than their scores from other LLM benchmarks. This underscores the significance of EHRNoteQA in evaluating LLMs for medical applications and highlights its crucial role in facilitating the integration of LLMs into healthcare systems. The dataset will be made available to the public under PhysioNet credential access, promoting further research in this vital field.
摘要：本研究引入了 EHRNoteQA，这是一种新颖的针对患者的问答基准，专为评估临床环境中的大型语言模型 (LLM) 而设计。基于 MIMIC-IV 电子健康记录 (EHR)，由三名医疗专业人员组成的团队整理了包含 962 个独特问题的数据集，每个问题都与特定患者的 EHR 临床记录相关联。 EHRNoteQA 与现有基于 EHR 的基准的区别如下：首先，它是第一个采用多选问答格式的数据集，这种设计选择可以在自动评估的背景下有效地评估具有可靠分数的 LLM，相比之下，其他格式。其次，它需要分析多个临床记录才能回答一个问题，反映了现实世界临床决策的复杂性，其中临床医生审查了患者病史的大量记录。我们对各种大型语言模型的综合评估表明，与其他法学硕士基准的分数相比，他们在 EHRNoteQA 上的分数与解决临床医生评估的现实世界医学问题的表现更密切相关。这强调了 EHRNoteQA 在评估法学硕士医疗应用方面的重要性，并强调了其在促进法学硕士融入医疗保健系统方面的关键作用。该数据集将通过 PhysioNet 凭证访问向公众开放，促进这一重要领域的进一步研究。

Title: Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy

Authors: Shuhai Zhang, Feng Liu, Jiahao Yang, Yifan Yang, Changsheng Li, Bo Han, Mingkui Tan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.16041
Pdf URL: https://arxiv.org/pdf/2402.16041
Copy Paste: [[2402.16041]] Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy(https://arxiv.org/abs/2402.16041)
Keywords: language model, gpt, llm, hallucination, chat
Abstract: Large language models (LLMs) such as ChatGPT have exhibited remarkable performance in generating human-like texts. However, machine-generated texts (MGTs) may carry critical risks, such as plagiarism issues, misleading information, or hallucination issues. Therefore, it is very urgent and important to detect MGTs in many situations. Unfortunately, it is challenging to distinguish MGTs and human-written texts because the distributional discrepancy between them is often very subtle due to the remarkable performance of LLMs. In this paper, we seek to exploit \textit{maximum mean discrepancy} (MMD) to address this issue in the sense that MMD can well identify distributional discrepancies. However, directly training a detector with MMD using diverse MGTs will incur a significantly increased variance of MMD since MGTs may contain \textit{multiple text populations} due to various LLMs. This will severely impair MMD's ability to measure the difference between two samples. To tackle this, we propose a novel \textit{multi-population} aware optimization method for MMD called MMD-MP, which can \textit{avoid variance increases} and thus improve the stability to measure the distributional discrepancy. Relying on MMD-MP, we develop two methods for paragraph-based and sentence-based detection, respectively. Extensive experiments on various LLMs, \eg, GPT2 and ChatGPT, show superior detection performance of our MMD-MP. The source code is available at \url{https://github.com/ZSHsh98/MMD-MP}.
摘要：ChatGPT 等大型语言模型 (LLM) 在生成类人文本方面表现出了卓越的性能。然而，机器生成文本（MGT）可能会带来严重风险，例如抄袭问题、误导性信息或幻觉问题。因此，在许多情况下检测MGT是非常紧迫和重要的。不幸的是，区分 MGT 和人类编写的文本具有挑战性，因为由于法学硕士的卓越表现，它们之间的分布差异通常非常微妙。在本文中，我们寻求利用 \textit{最大平均差异} (MMD) 来解决这个问题，因为 MMD 可以很好地识别分布差异。然而，使用不同的 MGT 直接训练 MMD 检测器将导致 MMD 的方差显着增加，因为由于各种 LLM，MGT 可能包含 \textit{多个文本群体}。这将严重削弱 MMD 测量两个样本之间差异的能力。为了解决这个问题，我们提出了一种新的 MMD 感知优化方法，称为 MMD-MP，它可以避免方差增加，从而提高测量分布差异的稳定性。依靠MMD-MP，我们分别开发了两种基于段落和基于句子的检测方法。对各种 LLM（例如 GPT2 和 ChatGPT）的广泛实验显示了我们的 MMD-MP 的卓越检测性能。源代码可在 \url{https://github.com/ZSHsh98/MMD-MP} 获取。

Title: LLMs with Chain-of-Thought Are Non-Causal Reasoners

Authors: Guangsheng Bao, Hongbo Zhang, Linyi Yang, Cunxiang Wang, Yue Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.16048
Pdf URL: https://arxiv.org/pdf/2402.16048
Copy Paste: [[2402.16048]] LLMs with Chain-of-Thought Are Non-Causal Reasoners(https://arxiv.org/abs/2402.16048)
Keywords: language model, llm, chain-of-thought
Abstract: This paper explores the role of the Chain of Thought (CoT) in Large Language Models (LLMs) reasoning. Despite its potential to improve task performance, our analysis reveals a surprising frequency of correct answers following incorrect CoTs and vice versa. We employ causal analysis to assess the cause-effect relationship between CoTs/instructions and answers in LLMs, uncovering the Structural Causal Model (SCM) that LLMs approximate. By comparing the implied SCM with that of human reasoning, we highlight discrepancies between LLM and human reasoning processes. We further examine the factors influencing the causal structure of the implied SCM, revealing that in-context learning, supervised fine-tuning, and reinforcement learning on human feedback significantly impact the causal relations. We release the code and results at https://github.com/StevenZHB/CoT_Causal_Analysis.
摘要：本文探讨了思想链 (CoT) 在大型语言模型 (LLM) 推理中的作用。尽管它具有提高任务绩效的潜力，但我们的分析显示，在不正确的 CoT 后正确答案的出现频率令人惊讶，反之亦然。我们采用因果分析来评估法学硕士中的 CoT/指令与答案之间的因果关系，揭示法学硕士近似的结构因果模型 (SCM)。通过将隐含的 SCM 与人类推理的 SCM 进行比较，我们强调了 LLM 与人类推理过程之间的差异。我们进一步研究了影响隐含 SCM 因果结构的因素，揭示了上下文学习、监督微调和人类反馈的强化学习显着影响了因果关系。我们在 https://github.com/StevenZHB/CoT_Causal_Analysis 发布了代码和结果。

Title: Say More with Less: Understanding Prompt Learning Behaviors through Gist Compression

Authors: Xinze Li, Zhenghao Liu, Chenyan Xiong, Shi Yu, Yukun Yan, Shuo Wang, Ge Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16058
Pdf URL: https://arxiv.org/pdf/2402.16058
Copy Paste: [[2402.16058]] Say More with Less: Understanding Prompt Learning Behaviors through Gist Compression(https://arxiv.org/abs/2402.16058)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) require lengthy prompts as the input context to produce output aligned with user intentions, a process that incurs extra costs during inference. In this paper, we propose the Gist COnditioned deCOding (Gist-COCO) model, introducing a novel method for compressing prompts which also can assist the prompt interpretation and engineering. Gist-COCO employs an encoder-decoder based language model and then incorporates an additional encoder as a plugin module to compress prompts with inputs using gist tokens. It finetunes the compression plugin module and uses the representations of gist tokens to emulate the raw prompts in the vanilla language model. By verbalizing the representations of gist tokens into gist prompts, the compression ability of Gist-COCO can be generalized to different LLMs with high compression rates. Our experiments demonstrate that Gist-COCO outperforms previous prompt compression models in both passage and instruction compression tasks. Further analysis on gist verbalization results suggests that our gist prompts serve different functions in aiding language models. They may directly provide potential answers, generate the chain-of-thought, or simply repeat the inputs. All data and codes are available at https://github.com/OpenMatch/Gist-COCO .
摘要：大型语言模型 (LLM) 需要冗长的提示作为输入上下文来生成与用户意图一致的输出，这一过程会在推理过程中产生额外的成本。在本文中，我们提出了 Gist CONditioned deCOding (Gist-COCO) 模型，引入了一种压缩提示的新方法，该方法也可以辅助提示解释和工程。 Gist-COCO 采用基于编码器-解码器的语言模型，然后将额外的编码器合并为插件模块，以使用要点标记压缩带有输入的提示。它对压缩插件模块进行了微调，并使用要点标记的表示来模拟普通语言模型中的原始提示。通过将 gist token 的表示形式语言化为 gist 提示，Gist-COCO 的压缩能力可以推广到具有高压缩率的不同 LLM。我们的实验表明，Gist-COCO 在段落和指令压缩任务中都优于以前的即时压缩模型。对要点言语化结果的进一步分析表明，我们的要点提示在帮助语言模型方面发挥着不同的作用。他们可以直接提供潜在的答案，生成思路，或者简单地重复输入。所有数据和代码均可在 https://github.com/OpenMatch/Gist-COCO 获取。

Title: How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study

Authors: Tianjie Ju, Weiwei Sun, Wei Du, Xinwei Yuan, Zhaochun Ren, Gongshen Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16061
Pdf URL: https://arxiv.org/pdf/2402.16061
Copy Paste: [[2402.16061]] How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study(https://arxiv.org/abs/2402.16061)
Keywords: language model, gpt, llm, chat
Abstract: Previous work has showcased the intriguing capability of large language models (LLMs) in retrieving facts and processing context knowledge. However, only limited research exists on the layer-wise capability of LLMs to encode knowledge, which challenges our understanding of their internal mechanisms. In this paper, we devote the first attempt to investigate the layer-wise capability of LLMs through probing tasks. We leverage the powerful generative capability of ChatGPT to construct probing datasets, providing diverse and coherent evidence corresponding to various facts. We employ $\mathcal V$-usable information as the validation metric to better reflect the capability in encoding context knowledge across different layers. Our experiments on conflicting and newly acquired knowledge show that LLMs: (1) prefer to encode more context knowledge in the upper layers; (2) primarily encode context knowledge within knowledge-related entity tokens at lower layers while progressively expanding more knowledge within other tokens at upper layers; and (3) gradually forget the earlier context knowledge retained within the intermediate layers when provided with irrelevant evidence. Code is publicly available at https://github.com/Jometeorie/probing_llama.
摘要：之前的工作展示了大型语言模型（LLM）在检索事实和处理上下文知识方面的有趣能力。然而，关于法学硕士编码知识的分层能力的研究非常有限，这挑战了我们对其内部机制的理解。在本文中，我们首次尝试通过探测任务来研究法学硕士的分层能力。我们利用 ChatGPT 强大的生成能力来构建探测数据集，提供与各种事实相对应的多样化且连贯的证据。我们采用 $\mathcal V$ 可用信息作为验证指标，以更好地反映跨不同层编码上下文知识的能力。我们对冲突知识和新获得的知识的实验表明，法学硕士：（1）更喜欢在上层编码更多的上下文知识；（2）主要在较低层的知识相关实体令牌中编码上下文知识，同时逐步在较高层的其他令牌中扩展更多知识；（3）当提供不相关的证据时，逐渐忘记中间层中保留的早期上下文知识。代码可在 https://github.com/Jometeorie/probing_llama 上公开获取。

Title: Citation-Enhanced Generation for LLM-based Chatbot

Authors: Weitao Li, Junkai Li, Weizhi Ma, Yang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16063
Pdf URL: https://arxiv.org/pdf/2402.16063
Copy Paste: [[2402.16063]] Citation-Enhanced Generation for LLM-based Chatbot(https://arxiv.org/abs/2402.16063)
Keywords: language model, llm, hallucination, chat, retrieval augmented generation
Abstract: Large language models (LLMs) exhibit powerful general intelligence across diverse scenarios, including their integration into chatbots. However, a vital challenge of LLM-based chatbots is that they may produce hallucinated content in responses, which significantly limits their applicability. Various efforts have been made to alleviate hallucination, such as retrieval augmented generation and reinforcement learning with human feedback, but most of them require additional training and data annotation. In this paper, we propose a novel post-hoc \textbf{C}itation-\textbf{E}nhanced \textbf{G}eneration (\textbf{CEG}) approach combined with retrieval argumentation. Unlike previous studies that focus on preventing hallucinations during generation, our method addresses this issue in a post-hoc way. It incorporates a retrieval module to search for supporting documents relevant to the generated content, and employs a natural language inference-based citation generation module. Once the statements in the generated content lack of reference, our model can regenerate responses until all statements are supported by citations. Note that our method is a training-free plug-and-play plugin that is capable of various LLMs. Experiments on various hallucination-related datasets show our framework outperforms state-of-the-art methods in both hallucination detection and response regeneration on three benchmarks. Our codes and dataset will be publicly available.
摘要：大型语言模型 (LLM) 在不同场景中展现出强大的通用智能，包括与聊天机器人的集成。然而，基于法学硕士的聊天机器人面临的一个重大挑战是它们可能会在响应中产生幻觉内容，这极大地限制了它们的适用性。人们已经做出了各种努力来减轻幻觉，例如检索增强生成和带有人类反馈的强化学习，但大多数都需要额外的训练和数据注释。在本文中，我们提出了一种新颖的事后\textbf{C}itation-\textbf{E}增强\textbf{G}生成（\textbf{CEG}）与检索论证相结合的方法。与之前专注于预防生成过程中产生幻觉的研究不同，我们的方法以事后的方式解决了这个问题。它包含一个检索模块来搜索与生成的内容相关的支持文档，并采用基于自然语言推理的引文生成模块。一旦生成内容中的陈述缺乏参考，我们的模型可以重新生成响应，直到所有陈述都得到引用的支持。请注意，我们的方法是一个无需培训的即插即用插件，能够支持各种 LLM。对各种与幻觉相关的数据集的实验表明，我们的框架在三个基准的幻觉检测和响应再生方面都优于最先进的方法。我们的代码和数据集将公开。

Title: Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space

Authors: Aviad Rom, Kfir Bar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.16065
Pdf URL: https://arxiv.org/pdf/2402.16065
Copy Paste: [[2402.16065]] Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space(https://arxiv.org/abs/2402.16065)
Keywords: language model
Abstract: We train a bilingual Arabic-Hebrew language model using a transliterated version of Arabic texts in Hebrew, to ensure both languages are represented in the same script. Given the morphological, structural similarities, and the extensive number of cognates shared among Arabic and Hebrew, we assess the performance of a language model that employs a unified script for both languages, on machine translation which requires cross-lingual knowledge. The results are promising: our model outperforms a contrasting model which keeps the Arabic texts in the Arabic script, demonstrating the efficacy of the transliteration step. Despite being trained on a dataset approximately 60% smaller than that of other existing language models, our model appears to deliver comparable performance in machine translation across both translation directions.
摘要：我们使用希伯来语阿拉伯语文本的音译版本来训练阿拉伯语-希伯来语双语语言模型，以确保两种语言以相同的文字表示。考虑到阿拉伯语和希伯来语在形态、结构上的相似性以及大量共享的同源词，我们评估了两种语言采用统一脚本的语言模型在需要跨语言知识的机器翻译上的性能。结果令人鼓舞：我们的模型优于将阿拉伯文本保留在阿拉伯文字中的对比模型，证明了音译步骤的有效性。尽管训练数据集比其他现有语言模型小约 60%，但我们的模型似乎在两个翻译方向的机器翻译方面提供了相当的性能。

Title: Behavioral Refinement via Interpolant-based Policy Diffusion

Authors: Kaiqi Chen, Eugene Lim, Kelvin Lin, Yiyang Chen, Harold Soh
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2402.16075
Pdf URL: https://arxiv.org/pdf/2402.16075
Copy Paste: [[2402.16075]] Behavioral Refinement via Interpolant-based Policy Diffusion(https://arxiv.org/abs/2402.16075)
Keywords: agent
Abstract: Imitation learning empowers artificial agents to mimic behavior by learning from demonstrations. Recently, diffusion models, which have the ability to model high-dimensional and multimodal distributions, have shown impressive performance on imitation learning tasks. These models learn to shape a policy by diffusing actions (or states) from standard Gaussian noise. However, the target policy to be learned is often significantly different from Gaussian and this mismatch can result in poor performance when using a small number of diffusion steps (to improve inference speed) and under limited data. The key idea in this work is that initiating from a more informative source than Gaussian enables diffusion methods to overcome the above limitations. We contribute both theoretical results, a new method, and empirical findings that show the benefits of using an informative source policy. Our method, which we call BRIDGER, leverages the stochastic interpolants framework to bridge arbitrary policies, thus enabling a flexible approach towards imitation learning. It generalizes prior work in that standard Gaussians can still be applied, but other source policies can be used if available. In experiments on challenging benchmarks, BRIDGER outperforms state-of-the-art diffusion policies and we provide further analysis on design considerations when applying BRIDGER.
摘要：模仿学习使人工智能体能够通过从演示中学习来模仿行为。最近，能够对高维和多模态分布进行建模的扩散模型在模仿学习任务中表现出了令人印象深刻的性能。这些模型通过从标准高斯噪声中扩散动作（或状态）来学习制定策略。然而，要学习的目标策略通常与高斯策略显着不同，并且在使用少量扩散步骤（以提高推理速度）和有限数据时，这种不匹配可能会导致性能不佳。这项工作的关键思想是，从比高斯更丰富的信息源开始，使扩散方法能够克服上述限制。我们贡献了理论结果、新方法和实证结果，显示了使用信息源政策的好处。我们的方法称为 BRIDGER，利用随机插值框架来桥接任意策略，从而实现模仿学习的灵活方法。它概括了先前的工作，即仍然可以应用标准高斯，但可以使用其他源策略（如果可用）。在具有挑战性的基准实验中，BRIDGER 优于最先进的扩散策略，我们对应用 BRIDGER 时的设计考虑因素进行了进一步分析。

Title: FuseChat: Knowledge Fusion of Chat Models

Authors: Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, Wei Bi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16107
Pdf URL: https://arxiv.org/pdf/2402.16107
Copy Paste: [[2402.16107]] FuseChat: Knowledge Fusion of Chat Models(https://arxiv.org/abs/2402.16107)
Keywords: language model, gpt, llm, chat
Abstract: While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, this approach incurs substantial costs and may lead to potential redundancy in competencies. An alternative strategy is to combine existing LLMs into a more robust LLM, thereby diminishing the necessity for expensive pre-training. However, due to the diverse architectures of LLMs, direct parameter blending proves to be unfeasible. Recently, \textsc{FuseLLM} introduced the concept of knowledge fusion to transfer the collective knowledge of multiple structurally varied LLMs into a target LLM through lightweight continual training. In this report, we extend the scalability and flexibility of the \textsc{FuseLLM} framework to realize the fusion of chat LLMs, resulting in \textsc{FuseChat}. \textsc{FuseChat} comprises two main stages. Firstly, we undertake knowledge fusion for structurally and scale-varied source LLMs to derive multiple target LLMs of identical structure and size via lightweight fine-tuning. Then, these target LLMs are merged within the parameter space, wherein we propose a novel method for determining the merging weights based on the variation ratio of parameter matrices before and after fine-tuning. We validate our approach using three prominent chat LLMs with diverse architectures and scales, namely \texttt{NH2-Mixtral-8x7B}, \texttt{NH2-Solar-10.7B}, and \texttt{OpenChat-3.5-7B}. Experimental results spanning various chat domains demonstrate the superiority of \texttt{\textsc{FuseChat}-7B} across a broad spectrum of chat LLMs at 7B and 34B scales, even surpassing \texttt{GPT-3.5 (March)} and approaching \texttt{Mixtral-8x7B-Instruct}. Our code, model weights, and data are openly accessible at \url{https://github.com/fanqiwan/FuseLLM}.
摘要：虽然从头开始训练大型语言模型 (LLM) 确实可以产生具有独特功能和优势的模型，但这种方法会产生大量成本，并可能导致潜在的能力冗余。另一种策略是将现有的法学硕士合并为更强大的法学硕士，从而减少昂贵的预培训的必要性。然而，由于法学硕士的架构多种多样，直接参数混合被证明是不可行的。最近，\textsc{FuseLLM}引入了知识融合的概念，通过轻量级持续训练将多个结构不同的LLM的集体知识转移到目标LLM中。在本报告中，我们扩展了 \textsc{FuseLLM} 框架的可扩展性和灵活性，以实现聊天 LLM 的融合，从而产生了 \textsc{FuseChat}。 \textsc{FuseChat} 包括两个主要阶段。首先，我们对结构和规模不同的源LLM进行知识融合，通过轻量级微调导出多个具有相同结构和规模的目标LLM。然后，这些目标LLM在参数空间内进行合并，其中我们提出了一种基于微调前后参数矩阵的变化率来确定合并权重的新方法。我们使用三个具有不同架构和规模的著名聊天 LLM 来验证我们的方法，即 \texttt{NH2-Mixtral-8x7B}、\texttt{NH2-Solar-10.7B} 和 \texttt{OpenChat-3.5-7B}。跨越各种聊天领域的实验结果证明了 \texttt{\textsc{FuseChat}-7B} 在 7B 和 34B 规模的广泛聊天 LLM 中的优越性，甚至超过了 \texttt{GPT-3.5 (March)} 并接近 \texttt {Mixtral-8x7B-指令}。我们的代码、模型权重和数据可以在 \url{https://github.com/fanqiwan/FuseLLM} 上公开访问。

Title: InstructEdit: Instruction-based Knowledge Editing for Large Language Models

Authors: Bozhong Tian, Siyuan Cheng, Xiaozhuan Liang, Ningyu Zhang, Yi Hu, Kouying Xue, Yanjie Gou, Xi Chen, Huajun Chen
Subjects: cs.CL, cs.AI, cs.CV, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2402.16123
Pdf URL: https://arxiv.org/pdf/2402.16123
Copy Paste: [[2402.16123]] InstructEdit: Instruction-based Knowledge Editing for Large Language Models(https://arxiv.org/abs/2402.16123)
Keywords: language model, llm
Abstract: Knowledge editing for large language models can offer an efficient solution to alter a model's behavior without negatively impacting the overall performance. However, the current approach encounters issues with limited generalizability across tasks, necessitating one distinct editor for each task, which significantly hinders the broader applications. To address this, we take the first step to analyze the multi-task generalization issue in knowledge editing. Specifically, we develop an instruction-based editing technique, termed InstructEdit, which facilitates the editor's adaptation to various task performances simultaneously using simple instructions. With only one unified editor for each LLM, we empirically demonstrate that InstructEdit can improve the editor's control, leading to an average 14.86% increase in Reliability in multi-task editing setting. Furthermore, experiments involving holdout unseen task illustrate that InstructEdit consistently surpass previous strong baselines. To further investigate the underlying mechanisms of instruction-based knowledge editing, we analyze the principal components of the editing gradient directions, which unveils that instructions can help control optimization direction with stronger OOD generalization. Code and datasets will be available in https://github.com/zjunlp/EasyEdit.
摘要：大型语言模型的知识编辑可以提供有效的解决方案来改变模型的行为，而不会对整体性能产生负面影响。然而，当前的方法遇到了跨任务通用性有限的问题，每个任务都需要一个不同的编辑器，这极大地阻碍了更广泛的应用。为了解决这个问题，我们首先分析知识编辑中的多任务泛化问题。具体来说，我们开发了一种基于指令的编辑技术，称为 InstructEdit，它有助于编辑器使用简单的指令同时适应各种任务性能。每个LLM只有一个统一的编辑器，我们凭经验证明InstructEdit可以提高编辑器的控制力，导致多任务编辑设置的可靠性平均提高14.86%。此外，涉及坚持未见任务的实验表明，InstructEdit 始终超越以前的强基线。为了进一步研究基于指令的知识编辑的潜在机制，我们分析了编辑梯度方向的主成分，揭示了指令可以通过更强的 OOD 泛化来帮助控制优化方向。代码和数据集将在 https://github.com/zjunlp/EasyEdit 中提供。

Title: LSTPrompt: Large Language Models as Zero-Shot Time Series Forecasters by Long-Short-Term Prompting

Authors: Haoxin Liu, Zhiyuan Zhao, Jindong Wang, Harshavardhan Kamarthi, B. Aditya Prakash
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16132
Pdf URL: https://arxiv.org/pdf/2402.16132
Copy Paste: [[2402.16132]] LSTPrompt: Large Language Models as Zero-Shot Time Series Forecasters by Long-Short-Term Prompting(https://arxiv.org/abs/2402.16132)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Time-series forecasting (TSF) finds broad applications in real-world scenarios. Prompting off-the-shelf Large Language Models (LLMs) demonstrates strong zero-shot TSF capabilities while preserving computational efficiency. However, existing prompting methods oversimplify TSF as language next-token predictions, overlooking its dynamic nature and lack of integration with state-of-the-art prompt strategies such as Chain-of-Thought. Thus, we propose LSTPrompt, a novel approach for prompting LLMs in zero-shot TSF tasks. LSTPrompt decomposes TSF into short-term and long-term forecasting sub-tasks, tailoring prompts to each. LSTPrompt guides LLMs to regularly reassess forecasting mechanisms to enhance adaptability. Extensive evaluations demonstrate consistently better performance of LSTPrompt than existing prompting methods, and competitive results compared to foundation TSF models.
摘要：时间序列预测（TSF）在现实场景中有着广泛的应用。提示现成的大型语言模型 (LLM) 展示了强大的零样本 TSF 功能，同时保持计算效率。然而，现有的提示方法将 TSF 过度简化为语言下一个标记预测，忽视了其动态本质，并且缺乏与思想链等最先进的提示策略的集成。因此，我们提出了 LSTPrompt，这是一种在零样本 TSF 任务中提示 LLM 的新方法。 LSTPrompt 将 TSF 分解为短期和长期预测子任务，并为每个子任务定制提示。 LSTPrompt 指导法学硕士定期重新评估预测机制，以增强适应性。广泛的评估表明，LSTPrompt 的性能始终优于现有的提示方法，并且与基础 TSF 模型相比，结果具有竞争力。

Title: What Generative Artificial Intelligence Means for Terminological Definitions

Authors: Antonio San Martín
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16139
Pdf URL: https://arxiv.org/pdf/2402.16139
Copy Paste: [[2402.16139]] What Generative Artificial Intelligence Means for Terminological Definitions(https://arxiv.org/abs/2402.16139)
Keywords: gpt, chat
Abstract: This paper examines the impact of Generative Artificial Intelligence (GenAI) on the creation and consumption of terminological definitions. GenAI tools like ChatGPT present a mix of benefits and drawbacks compared to traditional terminological resources. ChatGPT excels in providing context-specific meanings in an interactive and customized fashion but faces challenges with accuracy. Terminological definitions in recognized resources will likely survive because of their reliability. From the point of view of the terminologist, tools like ChatGPT enable AI-assisted terminography, including post-editing terminography, as an approach blending AI efficiency with human expertise for faster definition creation.
摘要：本文探讨了生成人工智能 (GenAI) 对术语定义的创建和使用的影响。与传统术语资源相比，ChatGPT 等 GenAI 工具既有优点也有缺点。 ChatGPT 擅长以交互式和定制的方式提供上下文特定的含义，但在准确性方面面临挑战。公认资源中的术语定义可能会因其可靠性而保留下来。从术语学家的角度来看，ChatGPT 等工具支持人工智能辅助术语，包括后期编辑术语，作为一种将人工智能效率与人类专业知识相结合的方法，以加快定义创建速度。

Title: PeriodicLoRA: Breaking the Low-Rank Bottleneck in LoRA Optimization

Authors: Xiangdi Meng, Damai Dai, Weiyao Luo, Zhe Yang, Shaoxiang Wu, Xiaochen Wang, Peiyi Wang, Qingxiu Dong, Liang Chen, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16141
Pdf URL: https://arxiv.org/pdf/2402.16141
Copy Paste: [[2402.16141]] PeriodicLoRA: Breaking the Low-Rank Bottleneck in LoRA Optimization(https://arxiv.org/abs/2402.16141)
Keywords: language model, llm
Abstract: Supervised fine-tuning is the most common method to adapt large language models (LLMs) to downstream tasks, but full fine-tuning LLMs requires massive computational resources. Recently, parameter-efficient fine-tuning (PEFT) methods have been widely studied due to its cost-effectiveness. LoRA is one of the most widely used methods, which assumes that the optimization process is essentially low-dimensional. Although LoRA fine-tuning is effective, there is still a performance gap compared to full fine-tuning, since its weight update is limited to low-rank matrices. In order to break the low-rank bottleneck in LoRA Optimization, we propose PeriodicLoRA (PLoRA), which accumulates low-rank update matrices multiple times to achieve a higher update rank. PLoRA has multiple training stages. During each stage, we still update only the LoRA weights. However, at the end of each stage, we unload the LoRA weights into the backbone parameters and then reinitialize the LoRA states. Experimental results show that PLoRA has stronger learning ability, approximately 1.8 times that of LoRA's learning ability at most, but it does not increase memory usage. Further, we introduce a momentum-based unloading strategy for PLoRA to mitigate the training instability.
摘要：有监督微调是使大型语言模型 (LLM) 适应下游任务的最常见方法，但完全微调 LLM 需要大量计算资源。最近，参数高效微调（PEFT）方法由于其成本效益而得到了广泛的研究。 LoRA 是最广泛使用的方法之一，它假设优化过程本质上是低维的。尽管LoRA微调是有效的，但与完全微调相比仍然存在性能差距，因为其权重更新仅限于低秩矩阵。为了打破LoRA优化中的低秩瓶颈，我们提出了PeriodicLoRA（PLoRA），它多次累积低秩更新矩阵以获得更高的更新秩。 PLoRA 有多个训练阶段。在每个阶段，我们仍然只更新 LoRA 权重。然而，在每个阶段结束时，我们将 LoRA 权重卸载到主干参数中，然后重新初始化 LoRA 状态。实验结果表明，PLoRA 具有更强的学习能力，最多约为 LoRA 学习能力的 1.8 倍，但不会增加内存占用。此外，我们为 PLoRA 引入了基于动量的卸载策略，以减轻训练的不稳定性。

Title: From Text to Transformation: A Comprehensive Review of Large Language Models' Versatility

Authors: Pravneet Kaur, Gautam Siddharth Kashyap, Ankit Kumar, Md Tabrez Nafis, Sandeep Kumar, Vikrant Shokeen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16142
Pdf URL: https://arxiv.org/pdf/2402.16142
Copy Paste: [[2402.16142]] From Text to Transformation: A Comprehensive Review of Large Language Models' Versatility(https://arxiv.org/abs/2402.16142)
Keywords: language model, gpt, llm
Abstract: This groundbreaking study explores the expanse of Large Language Models (LLMs), such as Generative Pre-Trained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT) across varied domains ranging from technology, finance, healthcare to education. Despite their established prowess in Natural Language Processing (NLP), these LLMs have not been systematically examined for their impact on domains such as fitness, and holistic well-being, urban planning, climate modelling as well as disaster management. This review paper, in addition to furnishing a comprehensive analysis of the vast expanse and extent of LLMs' utility in diverse domains, recognizes the research gaps and realms where the potential of LLMs is yet to be harnessed. This study uncovers innovative ways in which LLMs can leave a mark in the fields like fitness and wellbeing, urban planning, climate modelling and disaster response which could inspire future researches and applications in the said avenues.
摘要：这项开创性的研究探索了大型语言模型 (LLM) 的范围，例如跨技术、金融、医疗保健和教育等各个领域的生成式预训练 Transformer (GPT) 和来自 Transformers 的双向编码器表示 (BERT)。尽管这些法学硕士在自然语言处理（NLP）方面实力雄厚，但尚未系统地研究其对健身、整体福祉、城市规划、气候建模以及灾害管理等领域的影响。这篇综述论文除了对法学硕士在不同领域的广泛应用和程度进行全面分析外，还认识到了法学硕士的潜力尚未发挥的研究差距和领域。这项研究揭示了法学硕士可以在健身和福祉、城市规划、气候建模和灾害应对等领域留下印记的创新方式，这可能会激发上述领域的未来研究和应用。

Title: DistALANER: Distantly Supervised Active Learning Augmented Named Entity Recognition in the Open Source Software Ecosystem

Authors: Somnath Banerjee, Avik Dutta, Aaditya Agrawal, Rima Hazra, Animesh Mukherjee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16159
Pdf URL: https://arxiv.org/pdf/2402.16159
Copy Paste: [[2402.16159]] DistALANER: Distantly Supervised Active Learning Augmented Named Entity Recognition in the Open Source Software Ecosystem(https://arxiv.org/abs/2402.16159)
Keywords: llm
Abstract: This paper proposes a novel named entity recognition (NER) technique specifically tailored for the open-source software systems. Our approach aims to address the scarcity of annotated software data by employing a comprehensive two-step distantly supervised annotation process. This process strategically leverages language heuristics, unique lookup tables, external knowledge sources, and an active learning approach. By harnessing these powerful techniques, we not only enhance model performance but also effectively mitigate the limitations associated with cost and the scarcity of expert annotators. It is noteworthy that our framework significantly outperforms the state-of-the-art LLMs by a substantial margin. We also show the effectiveness of NER in the downstream task of relation extraction.
摘要：本文提出了一种专门为开源软件系统量身定制的新型命名实体识别（NER）技术。我们的方法旨在通过采用全面的两步远程监督注释过程来解决注释软件数据的稀缺问题。这个过程战略性地利用了语言启发法、独特的查找表、外部知识源和主动学习方法。通过利用这些强大的技术，我们不仅提高了模型性能，还有效缓解了与成本和专家注释者稀缺相关的限制。值得注意的是，我们的框架明显优于最先进的法学硕士。我们还展示了 NER 在下游关系提取任务中的有效性。

Title: Hitting "Probe"rty with Non-Linearity, and More

Authors: Avik Pal, Madhura Pawar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16168
Pdf URL: https://arxiv.org/pdf/2402.16168
Copy Paste: [[2402.16168]] Hitting "Probe"rty with Non-Linearity, and More(https://arxiv.org/abs/2402.16168)
Keywords: language model
Abstract: Structural probes learn a linear transformation to find how dependency trees are embedded in the hidden states of language models. This simple design may not allow for full exploitation of the structure of the encoded information. Hence, to investigate the structure of the encoded information to its full extent, we incorporate non-linear structural probes. We reformulate the design of non-linear structural probes introduced by White et al. making its design simpler yet effective. We also design a visualization framework that lets us qualitatively assess how strongly two words in a sentence are connected in the predicted dependency tree. We use this technique to understand which non-linear probe variant is good at encoding syntactical information. Additionally, we also use it to qualitatively investigate the structure of dependency trees that BERT encodes in each of its layers. We find that the radial basis function (RBF) is an effective non-linear probe for the BERT model than the linear probe.
摘要：结构探针学习线性变换以找出依赖树如何嵌入到语言模型的隐藏状态中。这种简单的设计可能不允许充分利用编码信息的结构。因此，为了充分研究编码信息的结构，我们采用了非线性结构探针。我们重新制定了 White 等人提出的非线性结构探针的设计。使其设计更简单而有效。我们还设计了一个可视化框架，让我们定性评估句子中两个单词在预测依存关系树中的连接强度。我们使用这种技术来了解哪种非线性探针变体擅长编码语法信息。此外，我们还用它来定性研究 BERT 在每个层中编码的依赖树的结构。我们发现径向基函数（RBF）对于 BERT 模型来说是比线性探针更有效的非线性探针。

Title: How Can LLM Guide RL? A Value-Based Approach

Authors: Shenao Zhang, Sirui Zheng, Shuqi Ke, Zhihan Liu, Wanxin Jin, Jianbo Yuan, Yingxiang Yang, Hongxia Yang, Zhaoran Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16181
Pdf URL: https://arxiv.org/pdf/2402.16181
Copy Paste: [[2402.16181]] How Can LLM Guide RL? A Value-Based Approach(https://arxiv.org/abs/2402.16181)
Keywords: language model, llm, agent
Abstract: Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. However, RL algorithms may require extensive trial-and-error interactions to collect useful feedback for improvement. On the other hand, recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities for planning tasks, lacking the ability to autonomously refine their responses based on feedback. Therefore, in this paper, we study how the policy prior provided by the LLM can enhance the sample efficiency of RL algorithms. Specifically, we develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning, particularly when the difference between the ideal policy and the LLM-informed policy is small, which suggests that the initial policy is close to optimal, reducing the need for further exploration. Additionally, we present a practical algorithm SLINVIT that simplifies the construction of the value function and employs subgoals to reduce the search complexity. Our experiments across three interactive environments ALFWorld, InterCode, and BlocksWorld demonstrate that our method achieves state-of-the-art success rates and also surpasses previous RL and LLM approaches in terms of sample efficiency. Our code is available at https://github.com/agentification/Language-Integrated-VI.
摘要：强化学习（RL）通过反馈改进未来的行动策略，已成为顺序决策问题事实上的标准实践。然而，强化学习算法可能需要大量的试错交互来收集有用的反馈以进行改进。另一方面，大型语言模型（LLM）的最新发展展示了令人印象深刻的语言理解和生成能力，但它们在规划任务的探索和自我改进能力方面存在不足，缺乏根据反馈自主完善其响应的能力。因此，在本文中，我们研究了LLM提供的策略先验如何提高RL算法的样本效率。具体来说，我们开发了一种名为 LINVIT 的算法，该算法将 LLM 指导作为基于价值的 RL 中的正则化因素，从而显着减少学习所需的数据量，特别是当理想策略与 LLM 通知策略之间的差异为小，这表明初始政策接近最优，减少了进一步探索的需要。此外，我们提出了一种实用的算法 SLINVIT，它简化了价值函数的构造并采用子目标来降低搜索复杂度。我们在 ALFWorld、InterCode 和 BlocksWorld 三个交互环境中进行的实验表明，我们的方法实现了最先进的成功率，并且在样本效率方面也超越了之前的 RL 和 LLM 方法。我们的代码可从 https://github.com/agentification/Language-Integrated-VI 获取。

Title: Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Authors: Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16192
Pdf URL: https://arxiv.org/pdf/2402.16192
Copy Paste: [[2402.16192]] Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing(https://arxiv.org/abs/2402.16192)
Keywords: language model, llm, prompt
Abstract: Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.
摘要：对齐的大语言模型 (LLM) 很容易受到越狱攻击，这种攻击会绕过目标 LLM 的保护措施并欺骗它们生成令人反感的内容。虽然最初的防御显示出针对基于令牌的威胁模型的希望，但不存在能够提供针对语义攻击的鲁棒性并避免鲁棒性和名义性能之间不利权衡的防御。为了满足这一需求，我们提出了 SEMANTICSMOOTH，这是一种基于平滑的防御，它聚合了给定输入提示的多个语义转换副本的预测。实验结果表明，SEMANTICSMOOTH 实现了针对 GCG、PAIR 和 AutoDAN 攻击的最先进的鲁棒性，同时在指令遵循基准（例如InstructionFollowing 和 AlpacaEval）上保持了强大的标称性能。这些代码将在 https://github.com/UCSB-NLP-Chang/SemanticSmooth 上公开提供。

Title: ASEM: Enhancing Empathy in Chatbot through Attention-based Sentiment and Emotion Modeling

Authors: Omama Hamad, Ali Hamdi, Khaled Shaban
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16194
Pdf URL: https://arxiv.org/pdf/2402.16194
Copy Paste: [[2402.16194]] ASEM: Enhancing Empathy in Chatbot through Attention-based Sentiment and Emotion Modeling(https://arxiv.org/abs/2402.16194)
Keywords: chat
Abstract: Effective feature representations play a critical role in enhancing the performance of text generation models that rely on deep neural networks. However, current approaches suffer from several drawbacks, such as the inability to capture the deep semantics of language and sensitivity to minor input variations, resulting in significant changes in the generated text. In this paper, we present a novel solution to these challenges by employing a mixture of experts, multiple encoders, to offer distinct perspectives on the emotional state of the user's utterance while simultaneously enhancing performance. We propose an end-to-end model architecture called ASEM that performs emotion analysis on top of sentiment analysis for open-domain chatbots, enabling the generation of empathetic responses that are fluent and relevant. In contrast to traditional attention mechanisms, the proposed model employs a specialized attention strategy that uniquely zeroes in on sentiment and emotion nuances within the user's utterance. This ensures the generation of context-rich representations tailored to the underlying emotional tone and sentiment intricacies of the text. Our approach outperforms existing methods for generating empathetic embeddings, providing empathetic and diverse responses. The performance of our proposed model significantly exceeds that of existing models, enhancing emotion detection accuracy by 6.2% and lexical diversity by 1.4%.
摘要：有效的特征表示在增强依赖于深度神经网络的文本生成模型的性能方面发挥着关键作用。然而，当前的方法存在一些缺点，例如无法捕获语言的深层语义以及对微小输入变化的敏感性，从而导致生成的文本发生显着变化。在本文中，我们针对这些挑战提出了一种新颖的解决方案，通过采用专家和多个编码器的组合，为用户话语的情绪状态提供不同的视角，同时提高性能。我们提出了一种名为 ASEM 的端到端模型架构，它在开放域聊天机器人的情感分析之上执行情感分析，从而能够生成流畅且相关的移情响应。与传统的注意力机制相比，所提出的模型采用了专门的注意力策略，该策略独特地关注用户话语中的情绪和情感细微差别。这确保了生成适合文本的潜在情感基调和情感复杂性的上下文丰富的表示。我们的方法优于现有的生成同理心嵌入的方法，提供同理心和多样化的响应。我们提出的模型的性能显着超过现有模型，将情绪检测精度提高了 6.2%，词汇多样性提高了 1.4%。

Title: HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs

Authors: Cem Uluoglakci, Tugba Taskaya Temizel (Middle East Technical University)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16211
Pdf URL: https://arxiv.org/pdf/2402.16211
Copy Paste: [[2402.16211]] HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs(https://arxiv.org/abs/2402.16211)
Keywords: language model, llm, hallucination, chat, agent
Abstract: Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs), limiting their widespread acceptance beyond chatbot applications. Despite ongoing efforts, hallucinations remain a prevalent challenge in LLMs. The detection of hallucinations itself is also a formidable task, frequently requiring manual labeling or constrained evaluations. This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection. We leverage LLMs to generate challenging tasks related to hypothetical phenomena, subsequently employing them as agents for efficient hallucination detection. The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain. We introduce the publicly available HypoTermQA Benchmarking Dataset, on which state-of-the-art models' performance ranged between 3% and 11%, and evaluator agents demonstrated a 6% error rate in hallucination prediction. The proposed framework provides opportunities to test and improve LLMs. Additionally, it has the potential to generate benchmarking datasets tailored to specific domains, such as law, health, and finance.
摘要：幻觉对大型语言模型（LLM）的可靠性和一致性提出了重大挑战，限制了它们在聊天机器人应用程序之外的广泛接受。尽管不断努力，幻觉仍然是法学硕士中普遍存在的挑战。幻觉的检测本身也是一项艰巨的任务，经常需要手动标记或受限评估。本文介绍了一种自动化可扩展框架，该框架将法学硕士的幻觉倾向基准与高效的幻觉检测相结合。我们利用法学硕士来生成与假设现象相关的具有挑战性的任务，随后将它们用作有效幻觉检测的代理。该框架与领域无关，允许使用任何语言模型在任何领域进行基准创建或评估。我们引入了公开的 HypoTermQA 基准数据集，在该数据集上，最先进的模型的性能范围在 3% 到 11% 之间，评估代理在幻觉预测中的错误率为 6%。拟议的框架提供了测试和改进法学硕士的机会。此外，它还有可能生成针对特定领域（例如法律、健康和金融）定制的基准数据集。

Title: Learning Translations: Emergent Communication Pretraining for Cooperative Language Acquisition

Authors: Dylan Cope, Peter McBurney
Subjects: cs.LG, cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2402.16247
Pdf URL: https://arxiv.org/pdf/2402.16247
Copy Paste: [[2402.16247]] Learning Translations: Emergent Communication Pretraining for Cooperative Language Acquisition(https://arxiv.org/abs/2402.16247)
Keywords: agent
Abstract: In Emergent Communication (EC) agents learn to communicate with one another, but the protocols that they develop are specialised to their training community. This observation led to research into Zero-Shot Coordination (ZSC) for learning communication strategies that are robust to agents not encountered during training. However, ZSC typically assumes that no prior data is available about the agents that will be encountered in the zero-shot setting. In many cases, this presents an unnecessarily hard problem and rules out communication via preestablished conventions. We propose a novel AI challenge called a Cooperative Language Acquisition Problem (CLAP) in which the ZSC assumptions are relaxed by allowing a 'joiner' agent to learn from a dataset of interactions between agents in a target community. We propose and compare two methods for solving CLAPs: Imitation Learning (IL), and Emergent Communication pretraining and Translation Learning (ECTL), in which an agent is trained in self-play with EC and then learns from the data to translate between the emergent protocol and the target community's protocol.
摘要：在紧急通信（EC）中，代理学习如何相互通信，但他们开发的协议是专门针对他们的培训社区的。这一观察结果引发了对零样本协调（ZSC）的研究，用于学习对于训练期间未遇到的智能体而言稳健的通信策略。然而，ZSC 通常假设没有关于零样本设置中将遇到的代理的先前数据。在许多情况下，这会带来不必要的难题，并排除通过预先建立的约定进行通信的可能性。我们提出了一种新的人工智能挑战，称为合作语言习得问题（CLAP），其中通过允许“加入”代理从目标社区中代理之间的交互数据集中学习来放宽 ZSC 假设。我们提出并比较了两种解决 CLAP 的方法：模仿学习（IL）和紧急通信预训练和翻译学习（ECTL），其中代理接受与 EC 的自我对弈训练，然后从数据中学习以在紧急通信之间进行翻译。协议和目标社区的协议。

Title: Topic-to-essay generation with knowledge-based content selection

Authors: Jieyong Wang, Chunyao Song, Yihao Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16248
Pdf URL: https://arxiv.org/pdf/2402.16248
Copy Paste: [[2402.16248]] Topic-to-essay generation with knowledge-based content selection(https://arxiv.org/abs/2402.16248)
Keywords: language model
Abstract: The topic-to-essay generation task is a challenging natural language generation task that aims to generate paragraph-level text with high semantic coherence based on a given set of topic words. Previous work has focused on the introduction of external knowledge, ignoring the insufficient generated text diversity. In order to improve the generation diversity, we propose a novel copy mechanism model with a content selection module that integrates rich semantic knowledge from the language model into the decoder. Furthermore, we introduce the improved prefix tuning method to train the model, enabling it to adapt to varying input complexities. In addition, we have contributed a new Chinese dataset for TEG tasks. Experimental results demonstrate that the proposed model can improve the generated text diversity by 35\% to 59\% compared to the state-of-the-art method, while maintaining a high level of topic consistency.
摘要：主题到文章生成任务是一项具有挑战性的自然语言生成任务，旨在基于给定的主题词集生成具有高度语义一致性的段落级文本。以往的工作主要集中在外部知识的引入，忽略了生成的文本多样性不足。为了提高生成多样性，我们提出了一种新颖的复制机制模型，该模型具有内容选择模块，将语言模型中的丰富语义知识集成到解码器中。此外，我们引入了改进的前缀调整方法来训练模型，使其能够适应不同的输入复杂性。此外，我们还为 TEG 任务贡献了一个新的中文数据集。实验结果表明，与最先进的方法相比，所提出的模型可以将生成的文本多样性提高 35% 至 59%，同时保持高水平的主题一致性。

Title: Foundation Model Transparency Reports

Authors: Rishi Bommasani, Kevin Klyman, Shayne Longpre, Betty Xiong, Sayash Kapoor, Nestor Maslej, Arvind Narayanan, Percy Liang
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2402.16268
Pdf URL: https://arxiv.org/pdf/2402.16268
Copy Paste: [[2402.16268]] Foundation Model Transparency Reports(https://arxiv.org/abs/2402.16268)
Keywords: prompt
Abstract: Foundation models are critical digital technologies with sweeping societal impact that necessitates transparency. To codify how foundation model developers should provide transparency about the development and deployment of their models, we propose Foundation Model Transparency Reports, drawing upon the transparency reporting practices in social media. While external documentation of societal harms prompted social media transparency reports, our objective is to institutionalize transparency reporting for foundation models while the industry is still nascent. To design our reports, we identify 6 design principles given the successes and shortcomings of social media transparency reporting. To further schematize our reports, we draw upon the 100 transparency indicators from the Foundation Model Transparency Index. Given these indicators, we measure the extent to which they overlap with the transparency requirements included in six prominent government policies (e.g., the EU AI Act, the US Executive Order on Safe, Secure, and Trustworthy AI). Well-designed transparency reports could reduce compliance costs, in part due to overlapping regulatory requirements across different jurisdictions. We encourage foundation model developers to regularly publish transparency reports, building upon recommendations from the G7 and the White House.
摘要：基础模型是具有广泛社会影响的关键数字技术，因此需要透明度。为了规范基础模型开发人员应如何提供模型开发和部署的透明度，我们借鉴社交媒体中的透明度报告实践，提出了基础模型透明度报告。虽然社会危害的外部记录促使社交媒体透明度报告，但我们的目标是在该行业仍处于新生阶段时将基础模型的透明度报告制度化。为了设计我们的报告，考虑到社交媒体透明度报告的成功和缺点，我们确定了 6 条设计原则。为了进一步系统化我们的报告，我们借鉴了基金会模型透明度指数中的 100 个透明度指标。鉴于这些指标，我们衡量它们与六项重要政府政策（例如欧盟人工智能法案、美国安全、可靠和可信人工智能行政命令）中包含的透明度要求的重叠程度。精心设计的透明度报告可以降低合规成本，部分原因是不同司法管辖区的监管要求存在重叠。我们鼓励基础模型开发人员根据七国集团和白宫的建议定期发布透明度报告。

Title: From Large Language Models and Optimization to Decision Optimization CoPilot: A Research Manifesto

Authors: Segev Wasserkrug, Leonard Boussioux, Dick den Hertog, Farzaneh Mirzazadeh, Ilker Birbil, Jannis Kurtz, Donato Maragno
Subjects: cs.AI, cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2402.16269
Pdf URL: https://arxiv.org/pdf/2402.16269
Copy Paste: [[2402.16269]] From Large Language Models and Optimization to Decision Optimization CoPilot: A Research Manifesto(https://arxiv.org/abs/2402.16269)
Keywords: language model, gpt, llm, chat
Abstract: Significantly simplifying the creation of optimization models for real-world business problems has long been a major goal in applying mathematical optimization more widely to important business and societal decisions. The recent capabilities of Large Language Models (LLMs) present a timely opportunity to achieve this goal. Therefore, we propose research at the intersection of LLMs and optimization to create a Decision Optimization CoPilot (DOCP) - an AI tool designed to assist any decision maker, interacting in natural language to grasp the business problem, subsequently formulating and solving the corresponding optimization model. This paper outlines our DOCP vision and identifies several fundamental requirements for its implementation. We describe the state of the art through a literature survey and experiments using ChatGPT. We show that a) LLMs already provide substantial novel capabilities relevant to a DOCP, and b) major research challenges remain to be addressed. We also propose possible research directions to overcome these gaps. We also see this work as a call to action to bring together the LLM and optimization communities to pursue our vision, thereby enabling much more widespread improved decision-making.
摘要：长期以来，显着简化针对现实业务问题的优化模型的创建一直是将数学优化更广泛地应用于重要业务和社会决策的主要目标。大型语言模型 (LLM) 的最新功能为实现这一目标提供了及时的机会。因此，我们建议在法学硕士和优化的交叉点进行研究，创建决策优化副驾驶（DOCP）——一种旨在协助任何决策者的人工智能工具，以自然语言交互来掌握业务问题，随后制定并解决相应的优化模型。本文概述了我们的 DOCP 愿景并确定了其实施的几个基本要求。我们通过文献调查和使用 ChatGPT 的实验来描述最新技术。我们表明，a）法学硕士已经提供了与 DOCP 相关的大量新颖能力，b）主要的研究挑战仍有待解决。我们还提出了克服这些差距的可能研究方向。我们还将这项工作视为行动号召，将法学硕士和优化社区聚集在一起以追求我们的愿景，从而实现更广泛的改进决策。

Title: PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering

Authors: Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, Kam-Fai Wong
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2402.16288
Pdf URL: https://arxiv.org/pdf/2402.16288
Copy Paste: [[2402.16288]] PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering(https://arxiv.org/abs/2402.16288)
Keywords: language model, gpt, llm, chat
Abstract: Long-term memory plays a critical role in personal interaction, considering long-term memory can better leverage world knowledge, historical information, and preferences in dialogues. Our research introduces PerLTQA, an innovative QA dataset that combines semantic and episodic memories, including world knowledge, profiles, social relationships, events, and dialogues. This dataset is collected to investigate the use of personalized memories, focusing on social interactions and events in the QA task. PerLTQA features two types of memory and a comprehensive benchmark of 8,593 questions for 30 characters, facilitating the exploration and application of personalized memories in Large Language Models (LLMs). Based on PerLTQA, we propose a novel framework for memory integration and generation, consisting of three main components: Memory Classification, Memory Retrieval, and Memory Synthesis. We evaluate this framework using five LLMs and three retrievers. Experimental results demonstrate that BERT-based classification models significantly outperform LLMs such as ChatGLM3 and ChatGPT in the memory classification task. Furthermore, our study highlights the importance of effective memory integration in the QA task.
摘要：长期记忆在人际交往中起着至关重要的作用，因为长期记忆可以更好地利用对话中的世界知识、历史信息和偏好。我们的研究引入了 PerLTQA，这是一个创新的 QA 数据集，它结合了语义和情景记忆，包括世界知识、概况、社会关系、事件和对话。收集该数据集是为了研究个性化记忆的使用，重点关注 QA 任务中的社交互动和事件。 PerLTQA 具有两种类型的记忆和 30 个字符的 8,593 个问题的综合基准，有助于在大型语言模型 (LLM) 中探索和应用个性化记忆。基于 PerLTQA，我们提出了一种新颖的内存集成和生成框架，由三个主要组件组成：内存分类、内存检索和内存合成。我们使用五个法学硕士和三个检索器来评估这个框架。实验结果表明，基于 BERT 的分类模型在内存分类任务中显着优于 ChatGLM3 和 ChatGPT 等 LLM。此外，我们的研究强调了有效记忆整合在 QA 任务中的重要性。

Title: Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

Authors: Xuantong Liu, Tianyang Hu, Wenjia Wang, Kenji Kawaguchi, Yuan Yao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16305
Pdf URL: https://arxiv.org/pdf/2402.16305
Copy Paste: [[2402.16305]] Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion(https://arxiv.org/abs/2402.16305)
Keywords: language model
Abstract: As a dominant force in text-to-image generation tasks, Diffusion Probabilistic Models (DPMs) face a critical challenge in controllability, struggling to adhere strictly to complex, multi-faceted instructions. In this work, we aim to address this alignment challenge for conditional generation tasks. First, we provide an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs). With this formulation, we naturally propose a training-free approach that bypasses the conventional sampling process associated with DPMs. By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment. As proof of concept, we demonstrate the pipeline with the pre-trained BLIP-2 model and identify several key designs for improved image generation. To further enhance the image fidelity, a Score Distillation Sampling module of Stable Diffusion is incorporated. By carefully balancing the two components during optimization, our method can produce high-quality images with near state-of-the-art performance on T2I-Compbench.
摘要：作为文本到图像生成任务的主导力量，扩散概率模型（DPM）在可控性方面面临着严峻的挑战，难以严格遵守复杂的多方面指令。在这项工作中，我们的目标是解决条件生成任务的对齐挑战。首先，我们提供了最先进的 DPM 的另一种观点，作为反转高级视觉语言模型 (VLM) 的一种方式。通过这个公式，我们自然地提出了一种免训练的方法，绕过与 DPM 相关的传统采样过程。通过在判别性 VLM 的监督下直接优化图像，所提出的方法可以实现更好的文本-图像对齐。作为概念证明，我们使用预训练的 BLIP-2 模型演示了流程，并确定了用于改进图像生成的几个关键设计。为了进一步增强图像保真度，加入了稳定扩散的分数蒸馏采样模块。通过在优化过程中仔细平衡两个组件，我们的方法可以在 T2I-Compbench 上生成具有接近最先进性能的高质量图像。

Title: Cross-domain Chinese Sentence Pattern Parsing

Authors: Yingsi Yu, Cunliang Kong, Liner Yang, Meishan Zhang, Lin Zhu, Yujie Wang, Haozhe Lin, Maosong Sun, Erhong Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16311
Pdf URL: https://arxiv.org/pdf/2402.16311
Copy Paste: [[2402.16311]] Cross-domain Chinese Sentence Pattern Parsing(https://arxiv.org/abs/2402.16311)
Keywords: language model, llm
Abstract: Sentence Pattern Structure (SPS) parsing is a syntactic analysis method primarily employed in language teaching.Existing SPS parsers rely heavily on textbook corpora for training, lacking cross-domain capability.To overcome this constraint, this paper proposes an innovative approach leveraging large language models (LLMs) within a self-training framework. Partial syntactic rules from a source domain are combined with target domain sentences to dynamically generate training data, enhancing the adaptability of the parser to diverse domains.Experiments conducted on textbook and news domains demonstrate the effectiveness of the proposed method, outperforming rule-based baselines by 1.68 points on F1 metrics.
摘要：句子模式结构（SPS）解析是一种主要应用于语言教学的句法分析方法。现有的SPS解析器严重依赖教科书语料库进行训练，缺乏跨领域能力。为了克服这一限制，本文提出了一种利用大型语言模型的创新方法（法学硕士）在自我培训框架内。将源领域的部分句法规则与目标领域句子相结合，动态生成训练数据，增强解析器对不同领域的适应性。在教科书和新闻领域进行的实验证明了该方法的有效性，优于基于规则的基线F1 指标为 1.68 分。

Title: Federated Contextual Cascading Bandits with Asynchronous Communication and Heterogeneous Users

Authors: Hantao Yang, Xutong Liu, Zhiyong Wang, Hong Xie, John C. S. Lui, Defu Lian, Enhong Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16312
Pdf URL: https://arxiv.org/pdf/2402.16312
Copy Paste: [[2402.16312]] Federated Contextual Cascading Bandits with Asynchronous Communication and Heterogeneous Users(https://arxiv.org/abs/2402.16312)
Keywords: agent
Abstract: We study the problem of federated contextual combinatorial cascading bandits, where $|\mathcal{U}|$ agents collaborate under the coordination of a central server to provide tailored recommendations to the $|\mathcal{U}|$ corresponding users. Existing works consider either a synchronous framework, necessitating full agent participation and global synchronization, or assume user homogeneity with identical behaviors. We overcome these limitations by considering (1) federated agents operating in an asynchronous communication paradigm, where no mandatory synchronization is required and all agents communicate independently with the server, (2) heterogeneous user behaviors, where users can be stratified into $J \le |\mathcal{U}|$ latent user clusters, each exhibiting distinct preferences. For this setting, we propose a UCB-type algorithm with delicate communication protocols. Through theoretical analysis, we give sub-linear regret bounds on par with those achieved in the synchronous framework, while incurring only logarithmic communication costs. Empirical evaluation on synthetic and real-world datasets validates our algorithm's superior performance in terms of regrets and communication costs.
摘要：我们研究联合上下文组合级联老虎机的问题，其中 $|\mathcal{U}|$ 代理在中央服务器的协调下协作，为 $|\mathcal{U}|$ 相应用户提供量身定制的建议。现有的工作要么考虑同步框架，需要完全代理参与和全局同步，要么假设用户具有相同行为的同质性。我们通过考虑（1）在异步通信范式中运行的联合代理来克服这些限制，其中不需要强制同步并且所有代理独立地与服务器通信，（2）异构用户行为，其中用户可以分层为$J \le |\mathcal{U}|$ 潜在用户集群，每个集群都表现出不同的偏好。对于这种设置，我们提出了一种具有精细通信协议的 UCB 型算法。通过理论分析，我们给出了与同步框架中实现的亚线性后悔界限相同的结果，同时仅产生对数通信成本。对合成数据集和真实数据集的实证评估验证了我们的算法在遗憾和通信成本方面的卓越性能。

Title: Chain-of-Discussion: A Multi-Model Framework for Complex Evidence-Based Question Answering

Authors: Mingxu Tao, Dongyan Zhao, Yansong Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16313
Pdf URL: https://arxiv.org/pdf/2402.16313
Copy Paste: [[2402.16313]] Chain-of-Discussion: A Multi-Model Framework for Complex Evidence-Based Question Answering(https://arxiv.org/abs/2402.16313)
Keywords: language model, llm
Abstract: Open-ended question answering requires models to find appropriate evidence to form well-reasoned, comprehensive and helpful answers. In practical applications, models also need to engage in extended discussions on potential scenarios closely relevant to the question. With augmentation of retrieval module, open-source Large Language Models (LLMs) can produce coherent answers often with different focuses, but are still sub-optimal in terms of reliable evidence selection and in-depth question analysis. In this paper, we propose a novel Chain-of-Discussion framework to leverage the synergy among multiple open-source LLMs aiming to provide \textbf{more correct} and \textbf{more comprehensive} answers for open-ended QA, although they are not strong enough individually. Our experiments show that discussions among multiple LLMs play a vital role in enhancing the quality of answers. We release our data and code at \url{https://github.com/kobayashikanna01/Chain-of-Discussion}.
摘要：开放式问答需要模型找到适当的证据，以形成合理、全面且有用的答案。在实际应用中，模型还需要对与问题密切相关的潜在场景进行扩展讨论。通过增强检索模块，开源大型语言模型（LLM）通常可以产生具有不同焦点的连贯答案，但在可靠证据选择和深入问题分析方面仍然不是最佳的。在本文中，我们提出了一种新颖的讨论链框架，以利用多个开源法学硕士之间的协同作用，旨在为开放式问答提供 \textbf{更正确} 和 \textbf{更全面} 答案，尽管它们是个人实力不够。我们的实验表明，多个法学硕士之间的讨论对于提高答案质量起着至关重要的作用。我们在 \url{https://github.com/kobayashikanna01/Chain-of-Discussion} 发布我们的数据和代码。

Title: Data-freeWeight Compress and Denoise for Large Language Models

Authors: Runyu Peng, Yunhua Zhou, Qipeng Guo, Yang Gao, Hang Yan, Xipeng Qiu, Dahua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16319
Pdf URL: https://arxiv.org/pdf/2402.16319
Copy Paste: [[2402.16319]] Data-freeWeight Compress and Denoise for Large Language Models(https://arxiv.org/abs/2402.16319)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are reshaping the research landscape in artificial intelligence, particularly as model parameters scale up significantly, unlocking remarkable capabilities across various domains. Nevertheless, the scalability of model parameters faces constraints due to limitations in GPU memory and computational speed. To address these constraints, various weight compression methods have emerged, such as Pruning and Quantization. Given the low-rank nature of weight matrices in language models, the reduction of weights through matrix decomposition undoubtedly holds significant potential and promise. In this paper, drawing upon the intrinsic structure of LLMs, we propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. Significantly, our method is characterized by without necessitating additional involvement of any corpus, while simultaneously preserving orthogonality in conjunction with pruning and quantization methods. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data. Additionally, we explore the fundamental properties of the weight matrix of LLMs undergone Rank-k Approximation and conduct comprehensive experiments to elucidate our hypothesis.
摘要：大型语言模型 (LLM) 正在重塑人工智能的研究格局，特别是随着模型参数显着扩大，释放各个领域的卓越功能。然而，由于 GPU 内存和计算速度的限制，模型参数的可扩展性面临限制。为了解决这些限制，出现了各种权重压缩方法，例如剪枝和量化。考虑到语言模型中权重矩阵的低秩性质，通过矩阵分解来减少权重无疑具有巨大的潜力和前景。在本文中，利用 LLM 的内在结构，我们提出了一种称为无数据联合 Rank-k 近似的新方法来压缩参数矩阵。值得注意的是，我们的方法的特点是不需要额外参与任何语料库，同时结合剪枝和量化方法保持正交性。我们在没有任何校准数据的情况下实现了 80% 参数的模型剪枝，同时保留了 93.43% 的原始性能。此外，我们探索了经过Rank-k近似的法学硕士权重矩阵的基本属性，并进行了全面的实验来阐明我们的假设。

Title: CodeS: Towards Building Open-source Language Models for Text-to-SQL

Authors: Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, Hong Chen
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2402.16347
Pdf URL: https://arxiv.org/pdf/2402.16347
Copy Paste: [[2402.16347]] CodeS: Towards Building Open-source Language Models for Text-to-SQL(https://arxiv.org/abs/2402.16347)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Language models have shown promising performance on the task of translating natural language questions into SQL queries (Text-to-SQL). However, most of the state-of-the-art (SOTA) approaches rely on powerful yet closed-source large language models (LLMs), such as ChatGPT and GPT-4, which may have the limitations of unclear model architectures, data privacy risks, and expensive inference overheads. To address the limitations, we introduce CodeS, a series of pre-trained language models with parameters ranging from 1B to 15B, specifically designed for the text-to-SQL task. CodeS is a fully open-source language model, which achieves superior accuracy with much smaller parameter sizes. This paper studies the research challenges in building CodeS. To enhance the SQL generation abilities of CodeS, we adopt an incremental pre-training approach using a specifically curated SQL-centric corpus. Based on this, we address the challenges of schema linking and rapid domain adaptation through strategic prompt construction and a bi-directional data augmentation technique. We conduct comprehensive evaluations on multiple datasets, including the widely used Spider benchmark, the newly released BIRD benchmark, robustness-diagnostic benchmarks such as Spider-DK, Spider-Syn, Spider-Realistic, and Dr.Spider, as well as two real-world datasets created for financial and academic applications. The experimental results show that our CodeS achieves new SOTA accuracy and robustness on nearly all challenging text-to-SQL benchmarks.
摘要：语言模型在将自然语言问题转换为 SQL 查询（文本到 SQL）的任务中表现出了良好的性能。然而，大多数最先进的（SOTA）方法依赖于强大但闭源的大语言模型（LLM），例如ChatGPT和GPT-4，这些模型可能存在模型架构不清晰、数据隐私等局限性风险和昂贵的推理开销。为了解决这些限制，我们引入了 CodeS，这是一系列参数范围从 1B 到 15B 的预训练语言模型，专为文本到 SQL 任务而设计。 CodeS 是一种完全开源的语言模型，它以更小的参数大小实现了卓越的准确性。本文研究了构建 CodeS 的研究挑战。为了增强 CodeS 的 SQL 生成能力，我们采用了一种增量预训练方法，使用专门策划的以 SQL 为中心的语料库。基于此，我们通过战略提示构建和双向数据增强技术来解决模式链接和快速域适应的挑战。我们对多个数据集进行了综合评估，包括广泛使用的Spider benchmark、新发布的BIRD benchmark、Spider-DK、Spider-Syn、Spider-Realistic和Dr.Spider等鲁棒性诊断基准，以及两个真实的数据集。为金融和学术应用创建的世界数据集。实验结果表明，我们的 CodeS 在几乎所有具有挑战性的文本到 SQL 基准测试中都实现了新的 SOTA 准确性和稳健性。

Title: MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs

Authors: Zimu Lu, Aojun Zhou, Houxing Ren, Ke Wang, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16352
Pdf URL: https://arxiv.org/pdf/2402.16352
Copy Paste: [[2402.16352]] MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs(https://arxiv.org/abs/2402.16352)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have exhibited great potential in mathematical reasoning. However, there remains a performance gap in this area between existing open-source models and closed-source models such as GPT-4. In this paper, we introduce MathGenie, a novel method for generating diverse and reliable math problems from a small-scale problem-solution dataset (denoted as seed data). We augment the ground-truth solutions of our seed data and train a back-translation model to translate the augmented solutions back into new questions. Subsequently, we generate code-integrated solutions for the new questions. To ensure the correctness of the code-integrated solutions, we employ rationale-based strategy for solution verification. Various pretrained models, ranging from 7B to 70B, are trained on the newly curated data to test the effectiveness of the proposed augmentation technique, resulting in a family of models known as MathGenieLM. These models consistently outperform previous open-source models across five representative mathematical reasoning datasets, achieving state-of-the-art performance. In particular, MathGenieLM-InternLM2 achieves an accuracy of 87.7% on GSM8K and 55.7% on MATH, securing the best overall score among open-source language models.
摘要：大型语言模型（LLM）在数学推理方面展现出了巨大的潜力。然而，现有的开源模型和 GPT-4 等闭源模型在这方面仍然存在性能差距。在本文中，我们介绍了 MathGenie，这是一种从小规模问题解决方案数据集（表示为种子数据）生成多样化且可靠的数学问题的新方法。我们增强了种子数据的真实解决方案，并训练了一个反向翻译模型，将增强的解决方案转换回新的问题。随后，我们为新问题生成代码集成的解决方案。为了确保代码集成解决方案的正确性，我们采用基于原理的策略来验证解决方案。从 7B 到 70B 的各种预训练模型都在新整理的数据上进行训练，以测试所提出的增强技术的有效性，从而产生了一系列称为 MathGenieLM 的模型。这些模型在五个代表性数学推理数据集中始终优于以前的开源模型，实现了最先进的性能。其中，MathGenieLM-InternLM2在GSM8K上的准确率达到87.7%，在MATH上的准确率达到55.7%，在开源语言模型中综合得分最高。

Title: Language-guided Skill Learning with Temporal Variational Inference

Authors: Haotian Fu, Pratyusha Sharma, Elias Stengel-Eskin, George Konidaris, Nicolas Le Roux, Marc-Alexandre Côté, Xingdi Yuan
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.16354
Pdf URL: https://arxiv.org/pdf/2402.16354
Copy Paste: [[2402.16354]] Language-guided Skill Learning with Temporal Variational Inference(https://arxiv.org/abs/2402.16354)
Keywords: language model, llm, agent
Abstract: We present an algorithm for skill discovery from expert demonstrations. The algorithm first utilizes Large Language Models (LLMs) to propose an initial segmentation of the trajectories. Following that, a hierarchical variational inference framework incorporates the LLM-generated segmentation information to discover reusable skills by merging trajectory segments. To further control the trade-off between compression and reusability, we introduce a novel auxiliary objective based on the Minimum Description Length principle that helps guide this skill discovery process. Our results demonstrate that agents equipped with our method are able to discover skills that help accelerate learning and outperform baseline skill learning approaches on new long-horizon tasks in BabyAI, a grid world navigation environment, as well as ALFRED, a household simulation environment.
摘要：我们提出了一种从专家演示中发现技能的算法。该算法首先利用大型语言模型 (LLM) 提出轨迹的初始分割。接下来，分层变分推理框架结合了法学硕士生成的分割信息，通过合并轨迹片段来发现可重用的技能。为了进一步控制压缩和可重用性之间的权衡，我们引入了一种基于最小描述长度原则的新颖辅助目标，有助于指导这一技能发现过程。我们的结果表明，配备我们方法的智能体能够发现有助于加速学习的技能，并在 BabyAI（网格世界导航环境）以及 ALFRED（家庭模拟环境）中的新长期任务中超越基线技能学习方法。

Title: An Integrated Data Processing Framework for Pretraining Foundation Models

Authors: Yiding Sun, Feng Wang, Yutao Zhu, Wayne Xin Zhao, Jiaxin Mao
Subjects: cs.LG, cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2402.16358
Pdf URL: https://arxiv.org/pdf/2402.16358
Copy Paste: [[2402.16358]] An Integrated Data Processing Framework for Pretraining Foundation Models(https://arxiv.org/abs/2402.16358)
Keywords: gpt, chat
Abstract: The ability of the foundation models heavily relies on large-scale, diverse, and high-quality pretraining data. In order to improve data quality, researchers and practitioners often have to manually curate datasets from difference sources and develop dedicated data cleansing pipeline for each data repository. Lacking a unified data processing framework, this process is repetitive and cumbersome. To mitigate this issue, we propose a data processing framework that integrates a Processing Module which consists of a series of operators at different granularity levels, and an Analyzing Module which supports probing and evaluation of the refined data. The proposed framework is easy to use and highly flexible. In this demo paper, we first introduce how to use this framework with some example use cases and then demonstrate its effectiveness in improving the data quality with an automated evaluation with ChatGPT and an end-to-end evaluation in pretraining the GPT-2 model. The code and demonstration videos are accessible on GitHub.
摘要：基础模型的能力很大程度上依赖于大规模、多样化、高质量的预训练数据。为了提高数据质量，研究人员和从业人员通常必须手动整理来自不同来源的数据集，并为每个数据存储库开发专用的数据清理管道。由于缺乏统一的数据处理框架，这个过程重复且繁琐。为了缓解这个问题，我们提出了一个数据处理框架，该框架集成了一个由一系列不同粒度级别的运算符组成的处理模块和一个支持对精炼数据进行探测和评估的分析模块。所提出的框架易于使用且高度灵活。在这篇演示论文中，我们首先通过一些示例用例介绍如何使用该框架，然后通过 ChatGPT 的自动评估和预训练 GPT-2 模型的端到端评估来证明其在提高数据质量方面的有效性。代码和演示视频可在 GitHub 上访问。

Title: Layer-wise Regularized Dropout for Neural Language Models

Authors: Shiwen Ni, Min Yang, Ruifeng Xu, Chengming Li, Xiping Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16361
Pdf URL: https://arxiv.org/pdf/2402.16361
Copy Paste: [[2402.16361]] Layer-wise Regularized Dropout for Neural Language Models(https://arxiv.org/abs/2402.16361)
Keywords: language model
Abstract: Among the various pre-trained neural language models that are popular today, dropout is already an indispensable regularization technique. To solve the inconsistency between training and inference caused by the randomness of dropout, some studies use consistency training to regularize dropout at the output layer. In this paper, we propose a novel Layer-wise Regularized Dropout (LR-Drop), which is specially designed for Transformer-based Language models. Specifically, LR-Drop layer-wise regularizes each Transformer layer using the consistency training strategy. Each training sample passes through the two siamese sub-models sampled by dropout, and then LR-Drop forces the hidden states, multi-head attention matrices, and output distribution of the two siamese sub-models to be consistent. The proposed LR-Drop can be regarded as a "self-distillation" framework, in which each sub-model generated by dropout is the other's "teacher" model and "student" model. Through extensive experiments on 8 natural language understanding datasets, 6 neural machine translation datasets, and 1 abstractive summarization dataset (a total of 15 datasets), we show that LR-Drop achieves superior performances, including state-of-the-art results.
摘要：在当今流行的各种预训练神经语言模型中，dropout已经是不可或缺的正则化技术。为了解决dropout的随机性导致的训练和推理的不一致问题，一些研究使用一致性训练来对输出层的dropout进行正则化。在本文中，我们提出了一种新颖的逐层正则化丢弃（LR-Drop），它是专门为基于 Transformer 的语言模型设计的。具体来说，LR-Drop 使用一致性训练策略逐层正则化每个 Transformer 层。每个训练样本经过dropout采样的两个siamese子模型，然后LR-Drop强制两个siamese子模型的隐藏状态、多头注意力矩阵、输出分布一致。所提出的LR-Drop可以看作是一个“自蒸馏”框架，其中dropout生成的每个子模型都是对方的“老师”模型和“学生”模型。通过对 8 个自然语言理解数据集、6 个神经机器翻译数据集和 1 个抽象摘要数据集（总共 15 个数据集）的广泛实验，我们表明 LR-Drop 实现了卓越的性能，包括最先进的结果。

Title: LLM Inference Unveiled: Survey and Roofline Model Insights

Authors: Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16363
Pdf URL: https://arxiv.org/pdf/2402.16363
Copy Paste: [[2402.16363]] LLM Inference Unveiled: Survey and Roofline Model Insights(https://arxiv.org/abs/2402.16363)
Keywords: language model, llm
Abstract: The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Although the field has expanded and is vibrant, there hasn't been a concise framework that analyzes the various methods of LLM Inference to provide a clear understanding of this domain. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques. This framework enables identifying the bottlenecks in LLM deployments and provides a deeper understanding of the practical aspects on real devices, thereby informing more effective strategies for deploying LLM. Furthermore, we systematically collate the latest advancements in efficient LLM inference, covering crucial areas such as weight optimization (e.g., Knowledge Distillation and Quantization), decoding algorithm improvements (e.g., Early Exit and Mixture-of-Expert), and both hardware and system-level enhancements. Distinguished by the integration of roofline model analysis, our survey provides a comprehensive and nuanced exploration of efficient LLM inference challenges and solutions. This distinctive approach not only showcases the current research landscape but also delivers valuable insights for practical implementation, positioning our work as an indispensable resource for researchers new to the field as well as for those seeking to deepen their understanding of efficient LLM deployment. The tool LLM-Viewer is open-sourced.
摘要：高效的大型语言模型 (LLM) 推理领域正在迅速发展，带来了独特的机遇和挑战。尽管该领域已经扩展并且充满活力，但还没有一个简洁的框架来分析法学硕士推理的各种方法，以提供对该领域的清晰理解。我们的调查从传统文献综述中脱颖而出，不仅总结了研究现状，还引入了基于屋顶线模型的框架，用于系统分析法学硕士推理技术。该框架能够识别 LLM 部署中的瓶颈，并提供对真实设备上的实际方面的更深入的了解，从而为部署 LLM 提供更有效的策略。此外，我们系统地整理了高效 LLM 推理的最新进展，涵盖权重优化（例如知识蒸馏和量化）、解码算法改进（例如 Early Exit 和 Mixture-of-Expert）以及硬件和系统等关键领域。级增强。我们的调查以屋顶线模型分析的集成为特色，对高效的法学硕士推理挑战和解决方案进行了全面而细致的探索。这种独特的方法不仅展示了当前的研究格局，而且还为实际实施提供了宝贵的见解，使我们的工作成为新进入该领域的研究人员以及寻求加深对高效法学硕士部署的理解的人不可或缺的资源。 LLM-Viewer 工具是开源的。

Title: Where Do We Go from Here? Multi-scale Allocentric Relational Inference from Natural Spatial Descriptions

Authors: Tzuf Paz-Argaman, Sayali Kulkarni, John Palowitch, Jason Baldridge, Reut Tsarfaty
Subjects: cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2402.16364
Pdf URL: https://arxiv.org/pdf/2402.16364
Copy Paste: [[2402.16364]] Where Do We Go from Here? Multi-scale Allocentric Relational Inference from Natural Spatial Descriptions(https://arxiv.org/abs/2402.16364)
Keywords: agent
Abstract: When communicating routes in natural language, the concept of {\em acquired spatial knowledge} is crucial for geographic information retrieval (GIR) and in spatial cognitive research. However, NLP navigation studies often overlook the impact of such acquired knowledge on textual descriptions. Current navigation studies concentrate on egocentric local descriptions (e.g., `it will be on your right') that require reasoning over the agent's local perception. These instructions are typically given as a sequence of steps, with each action-step explicitly mentioning and being followed by a landmark that the agent can use to verify they are on the right path (e.g., `turn right and then you will see...'). In contrast, descriptions based on knowledge acquired through a map provide a complete view of the environment and capture its overall structure. These instructions (e.g., `it is south of Central Park and a block north of a police station') are typically non-sequential, contain allocentric relations, with multiple spatial relations and implicit actions, without any explicit verification. This paper introduces the Rendezvous (RVS) task and dataset, which includes 10,404 examples of English geospatial instructions for reaching a target location using map-knowledge. Our analysis reveals that RVS exhibits a richer use of spatial allocentric relations, and requires resolving more spatial relations simultaneously compared to previous text-based navigation benchmarks.
摘要：当用自然语言交流路线时，“获得的空间知识”的概念对于地理信息检索（GIR）和空间认知研究至关重要。然而，NLP 导航研究常常忽视此类获得的知识对文本描述的影响。当前的导航研究集中于以自我为中心的本地描述（例如，“它将在你的右边”），这需要对代理的本地感知进行推理。这些指令通常以一系列步骤的形式给出，每个操作步骤都明确提及并后跟一个地标，代理可以使用该地标来验证它们是否处于正确的路径上（例如，“向右转，然后您将看到..”） .'）。相比之下，基于通过地图获取的知识的描述提供了环境的完整视图并捕获其整体结构。这些指令（例如，“它在中央公园以南，警察局以北的一个街区”）通常是非顺序的，包含异心关系，具有多种空间关系和隐式动作，没有任何明确的验证。本文介绍了 Rendezvous (RVS) 任务和数据集，其中包括 10,404 个使用地图知识到达目标位置的英语地理空间指令示例。我们的分析表明，与之前基于文本的导航基准相比，RVS 表现出更丰富的空间异中心关系使用，并且需要同时解决更多空间关系。

Title: Unraveling Babel: Exploring Multilingual Activation Patterns within Large Language Models

Authors: Weize Liu, Yinlong Xu, Hongxia Xu, Jintai Chen, Xuming Hu, Jian Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16367
Pdf URL: https://arxiv.org/pdf/2402.16367
Copy Paste: [[2402.16367]] Unraveling Babel: Exploring Multilingual Activation Patterns within Large Language Models(https://arxiv.org/abs/2402.16367)
Keywords: language model, llm
Abstract: Recently, large language models (LLMs) have achieved tremendous breakthroughs in the field of language processing, yet their mechanisms in processing multiple languages remain agnostic. Therefore, in this work we study the multilingual activation patterns of LLMs. By transforming the original Large Language Models (LLMs) into a Mixture of Experts (MoE) architecture, we analyze the expert activation patterns when processing various languages and demonstrate the connections of these activation patterns at the level of language families. We discover the existence of non-language-specific neurons as well as language-specific activation neurons. Further exploration even showcases that merely leveraging high-frequency activation neurons can accelerate inference while maintaining comparable performance. These findings shed light on the LLMs' multilingual processing mechanism, and are of significant importance in guiding the multilingual training and model pruning of LLMs.
摘要：近年来，大型语言模型（LLM）在语言处理领域取得了巨大突破，但其处理多种语言的机制仍然不可知。因此，在这项工作中，我们研究了法学硕士的多语言激活模式。通过将原始的大型语言模型（LLM）转变为专家混合（MoE）架构，我们分析了处理各种语言时的专家激活模式，并在语言家族层面展示了这些激活模式的联系。我们发现非语言特异性神经元以及语言特异性激活神经元的存在。进一步的探索甚至表明，仅利用高频激活神经元就可以加速推理，同时保持可比较的性能。这些发现揭示了法学硕士的多语言处理机制，对于指导法学硕士的多语言训练和模型剪枝具有重要意义。

Title: Improving LLM-based Machine Translation with Systematic Self-Correction

Authors: Zhaopeng Feng, Yan Zhang, Hao Li, Wenqiang Liu, Jun Lang, Yang Feng, Jian Wu, Zuozhu Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16379
Pdf URL: https://arxiv.org/pdf/2402.16379
Copy Paste: [[2402.16379]] Improving LLM-based Machine Translation with Systematic Self-Correction(https://arxiv.org/abs/2402.16379)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved impressive results in Machine Translation (MT). However, careful evaluations by human reveal that the translations produced by LLMs still contain multiple errors. Importantly, feeding back such error information into the LLMs can lead to self-correction and result in improved translation performance. Motivated by these insights, we introduce a systematic LLM-based self-correcting translation framework, named TER, which stands for Translate, Estimate, and Refine, marking a significant step forward in this direction. Our findings demonstrate that 1) our self-correction framework successfully assists LLMs in improving their translation quality across a wide range of languages, whether it's from high-resource languages to low-resource ones or whether it's English-centric or centered around other languages; 2) TER exhibits superior systematicity and interpretability compared to previous methods; 3) different estimation strategies yield varied impacts on AI feedback, directly affecting the effectiveness of the final corrections. We further compare different LLMs and conduct various experiments involving self-correction and cross-model correction to investigate the potential relationship between the translation and evaluation capabilities of LLMs.
摘要：大型语言模型（LLM）在机器翻译（MT）方面取得了令人瞩目的成果。然而，经过人工仔细评估，法学硕士的翻译仍然存在多个错误。重要的是，将此类错误信息反馈给法学硕士可以进行自我纠正并提高翻译性能。受这些见解的启发，我们引入了一个基于 LLM 的系统自校正翻译框架，名为 TER，它代表翻译、估计和精炼，标志着朝着这个方向迈出了重要的一步。我们的研究结果表明：1）我们的自我修正框架成功帮助法学硕士提高了多种语言的翻译质量，无论是从高资源语言到低资源语言，还是以英语为中心还是以其他语言为中心； 2）与之前的方法相比，TER表现出优越的系统性和可解释性； 3）不同的估计策略对AI反馈的影响不同，直接影响最终修正的效果。我们进一步比较不同的法学硕士，并进行各种涉及自我校正和跨模型校正的实验，以研究法学硕士的翻译和评估能力之间的潜在关系。

Title: Immunization against harmful fine-tuning attacks

Authors: Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank Rudzicz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16382
Pdf URL: https://arxiv.org/pdf/2402.16382
Copy Paste: [[2402.16382]] Immunization against harmful fine-tuning attacks(https://arxiv.org/abs/2402.16382)
Keywords: language model, llm, chat
Abstract: Approaches to aligning large language models (LLMs) with human values has focused on correcting misalignment that emerges from pretraining. However, this focus overlooks another source of misalignment: bad actors might purposely fine-tune LLMs to achieve harmful goals. In this paper, we present an emerging threat model that has arisen from alignment circumvention and fine-tuning attacks. However, lacking in previous works is a clear presentation of the conditions for effective defence. We propose a set of conditions for effective defence against harmful fine-tuning in LLMs called "Immunization conditions," which help us understand how we would construct and measure future defences. Using this formal framework for defence, we offer a synthesis of different research directions that might be persued to prevent harmful fine-tuning attacks and provide a demonstration of how to use these conditions experimentally showing early results of using an adversarial loss to immunize LLama2-7b-chat.
摘要：将大型语言模型 (LLM) 与人类价值观保持一致的方法主要集中在纠正预训练中出现的偏差。然而，这种关注忽视了另一个失调的根源：不良行为者可能会故意调整法学硕士以实现有害的目标。在本文中，我们提出了一种由对齐规避和微调攻击产生的新兴威胁模型。然而，以往的作品缺乏对有效防御的条件的明确表述。我们提出了一套有效防御法学硕士有害微调的条件，称为“免疫条件”，这有助于我们了解如何构建和衡量未来的防御。使用这个正式的防御框架，我们提供了不同研究方向的综合，这些研究方向可能被用来防止有害的微调攻击，并提供了如何通过实验使用这些条件的演示，展示了使用对抗性损失来免疫 LLama2-7b 的早期结果-聊天。

Title: MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property

Authors: Shiwen Ni, Minghuan Tan, Yuelin Bai, Fuqiang Niu, Min Yang, Bowen Zhang, Ruifeng Xu, Xiaojun Chen, Chengming Li, Xiping Hu, Ye Li, Jianping Fan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16389
Pdf URL: https://arxiv.org/pdf/2402.16389
Copy Paste: [[2402.16389]] MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property(https://arxiv.org/abs/2402.16389)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) have demonstrated impressive performance in various natural language processing (NLP) tasks. However, there is limited understanding of how well LLMs perform in specific domains (e.g, the intellectual property (IP) domain). In this paper, we contribute a new benchmark, the first Multilingual-oriented quiZ on Intellectual Property (MoZIP), for the evaluation of LLMs in the IP domain. The MoZIP benchmark includes three challenging tasks: IP multiple-choice quiz (IPQuiz), IP question answering (IPQA), and patent matching (PatentMatch). In addition, we also develop a new IP-oriented multilingual large language model (called MoZi), which is a BLOOMZ-based model that has been supervised fine-tuned with multilingual IP-related text data. We evaluate our proposed MoZi model and four well-known LLMs (i.e., BLOOMZ, BELLE, ChatGLM and ChatGPT) on the MoZIP benchmark. Experimental results demonstrate that MoZi outperforms BLOOMZ, BELLE and ChatGLM by a noticeable margin, while it had lower scores compared with ChatGPT. Notably, the performance of current LLMs on the MoZIP benchmark has much room for improvement, and even the most powerful ChatGPT does not reach the passing level. Our source code, data, and models are available at \url{https://github.com/AI-for-Science/MoZi}.
摘要：大型语言模型 (LLM) 在各种自然语言处理 (NLP) 任务中表现出了令人印象深刻的性能。然而，人们对法学硕士在特定领域（例如知识产权（IP）领域）的表现了解有限。在本文中，我们贡献了一个新的基准，即第一个面向多语言的知识产权测验（MoZIP），用于评估知识产权领域的法学硕士。 MoZIP基准测试包括三项具有挑战性的任务：IP多项选择测验（IPQuiz）、IP问答（IPQA）和专利匹配（PatentMatch）。此外，我们还开发了一种新的面向IP的多语言大语言模型（称为MoZi），它是基于BLOOMZ的模型，并使用多语言IP相关文本数据进行了监督微调。我们在 MoZIP 基准上评估了我们提出的 MoZi 模型和四个著名的 LLM（即 BLOOMZ、BELLE、ChatGLM 和 ChatGPT）。实验结果表明，MoZi 的性能明显优于 BLOOMZ、BELLE 和 ChatGLM，但与 ChatGPT 相比得分较低。值得注意的是，目前LLM在MoZIP基准上的表现还有很大的提升空间，即使是最强大的ChatGPT也没有达到及格水平。我们的源代码、数据和模型可在 \url{https://github.com/AI-for-Science/MoZi} 获取。

Title: From RAGs to riches: Using large language models to write documents for clinical trials

Authors: Nigel Markey, Ilyass El-Mansouri, Gaetan Rensonnet, Casper van Langen, Christoph Meier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16406
Pdf URL: https://arxiv.org/pdf/2402.16406
Copy Paste: [[2402.16406]] From RAGs to riches: Using large language models to write documents for clinical trials(https://arxiv.org/abs/2402.16406)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Clinical trials require numerous documents to be written -- protocols, consent forms, clinical study reports and others. Large language models (LLMs) offer the potential to rapidly generate first versions of these documents, however there are concerns about the quality of their output Here we report an evaluation of LLMs in generating parts of one such document, clinical trial protocols. We find that an offthe-shelf LLM delivers reasonable results, especially when assessing content relevance and the correct use of terminology. However, deficiencies remain: specifically clinical thinking and logic, and appropriate use of references. To improve performance, we used retrieval-augmented generation (RAG) to prompt an LLM with accurate up-to-date information. As a result of using RAG, the writing quality of the LLM improves substantially, which has implications for the practical useability of LLMs in clinical trial-related writing.
摘要：临床试验需要编写大量文件——方案、同意书、临床研究报告等。大语言模型 (LLM) 提供了快速生成这些文档的第一个版本的潜力，但人们对其输出的质量存在担忧。在此，我们报告了对 LLM 在生成此类文档（临床试验方案）的部分内容方面的评估。我们发现现成的法学硕士可以提供合理的结果，特别是在评估内容相关性和术语的正确使用时。然而，缺陷仍然存在：特别是临床思维和逻辑，以及参考文献的适当使用。为了提高性能，我们使用检索增强生成（RAG）来提示法学硕士提供准确的最新信息。由于使用RAG，法学硕士的写作质量大幅提高，这对法学硕士在临床试验相关写作中的实际可用性具有影响。

Title: Predicting Sustainable Development Goals Using Course Descriptions -- from LLMs to Conventional Foundation Models

Authors: Lev Kharlashkin, Melany Macias, Leo Huovinen, Mika Hämäläinen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16420
Pdf URL: https://arxiv.org/pdf/2402.16420
Copy Paste: [[2402.16420]] Predicting Sustainable Development Goals Using Course Descriptions -- from LLMs to Conventional Foundation Models(https://arxiv.org/abs/2402.16420)
Keywords: language model, llm
Abstract: We present our work on predicting United Nations sustainable development goals (SDG) for university courses. We use an LLM named PaLM 2 to generate training data given a noisy human-authored course description input as input. We use this data to train several different smaller language models to predict SDGs for university courses. This work contributes to better university level adaptation of SDGs. The best performing model in our experiments was BART with an F1-score of 0.786.
摘要：我们介绍了我们为大学课程预测联合国可持续发展目标 (SDG) 的工作。我们使用名为 PaLM 2 的法学硕士来生成训练数据，并将嘈杂的人类编写的课程描述输入作为输入。我们使用这些数据来训练几种不同的较小语言模型，以预测大学课程的可持续发展目标。这项工作有助于大学层面更好地适应可持续发展目标。我们实验中表现最好的模型是 BART，F1 分数为 0.786。

Title: RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions

Authors: Yuansen Zhang, Xiao Wang, Zhiheng Xi, Han Xia, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16431
Pdf URL: https://arxiv.org/pdf/2402.16431
Copy Paste: [[2402.16431]] RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions(https://arxiv.org/abs/2402.16431)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have showcased remarkable capabilities in following human instructions. However, recent studies have raised concerns about the robustness of LLMs when prompted with instructions combining textual adversarial samples. In this paper, drawing inspiration from recent works that LLMs are sensitive to the design of the instructions, we utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions. Through this conversion, we provide LLMs with more precise instructions and strengthen the robustness of LLMs. Moreover, under few-shot scenarios, we propose a novel method to compose in-context demonstrations using both clean and adversarial samples (\textit{adversarial context method}) to further boost the robustness of the LLMs. Experiments on eight robustness datasets show that our method consistently outperforms prompting LLMs with natural language instructions. For example, with gpt-3.5-turbo, our method achieves an improvement of 5.68\% in test set accuracy and a reduction of 5.66 points in Attack Success Rate (ASR).
摘要：大型语言模型 (LLM) 在遵循人类指令方面表现出了非凡的能力。然而，最近的研究引起了人们对法学硕士在结合文本对抗样本的指令时的稳健性的担忧。在本文中，从最近的工作中汲取灵感，法学硕士对指令的设计很敏感，我们利用代码风格的指令来代替典型的自然语言指令，这种指令更具结构性且不那么模糊。通过这种转换，我们为LLM提供了更精确的指导，并增强了LLM的稳健性。此外，在少数场景下，我们提出了一种使用干净样本和对抗性样本（\textit{对抗性上下文方法}）来构建上下文演示的新方法，以进一步提高 LLM 的鲁棒性。对八个稳健性数据集的实验表明，我们的方法始终优于使用自然语言指令提示法学硕士。例如，使用 gpt-3.5-turbo，我们的方法在测试集准确率上提高了 5.68%，在攻击成功率 (ASR) 上降低了 5.66 个点。

Title: Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

Authors: Tianyi Tang, Wenyang Luo, Haoyang Huang, Dongdong Zhang, Xiaolei Wang, Xin Zhao, Furu Wei, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16438
Pdf URL: https://arxiv.org/pdf/2402.16438
Copy Paste: [[2402.16438]] Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models(https://arxiv.org/abs/2402.16438)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora. It remains a challenging problem to explain the underlying mechanisms by which LLMs process multilingual texts. In this paper, we delve into the composition of Transformer architectures in LLMs to pinpoint language-specific regions. Specially, we propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs. Based on LAPE, we conduct comprehensive experiments on two representative LLMs, namely LLaMA-2 and BLOOM. Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons, primarily situated in the models' top and bottom layers. Furthermore, we showcase the feasibility to "steer" the output language of LLMs by selectively activating or deactivating language-specific neurons. Our research provides important evidence to the understanding and exploration of the multilingual capabilities of LLMs.
摘要：大型语言模型 (LLM) 无需在专门策划的多语言并行语料库上进行预训练，即可展现出卓越的多语言能力。解释法学硕士处理多语言文本的基本机制仍然是一个具有挑战性的问题。在本文中，我们深入研究了法学硕士中 Transformer 架构的组成，以查明特定于语言的区域。特别地，我们提出了一种新的检测方法，即语言激活概率熵（LAPE），来识别法学硕士内的语言特定神经元。基于LAPE，我们对两个具有代表性的LLM，即LLaMA-2和BLOOM进行了综合实验。我们的研究结果表明，法学硕士处理特定语言的能力主要归功于一小部分神经元，主要位于模型的顶层和底层。此外，我们展示了通过选择性激活或停用语言特定神经元来“引导”法学硕士输出语言的可行性。我们的研究为理解和探索法学硕士的多语言能力提供了重要证据。

Title: ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Authors: Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16444
Pdf URL: https://arxiv.org/pdf/2402.16444
Copy Paste: [[2402.16444]] ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors(https://arxiv.org/abs/2402.16444)
Keywords: language model, llm
Abstract: The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with general human safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective in real-world situations as a safety evaluator for advanced LLMs. We release ShieldLM at \url{https://github.com/thu-coai/ShieldLM} to support accurate and explainable safety detection under various safety standards, contributing to the ongoing efforts to enhance the safety of LLMs.
摘要：近年来，大型语言模型 (LLM) 的安全性受到越来越多的关注，但仍然缺乏一种全面的方法来以一致、可定制和可解释的方式检测 LLM 响应中的安全问题。在本文中，我们提出了 ShieldLM，一种基于 LLM 的安全检测器，它符合一般人类安全标准，支持可定制的检测规则，并为其决策提供解释。为了训练 ShieldLM，我们编译了一个包含 14,387 个查询-响应对的大型双语数据集，根据各种安全标准注释响应的安全性。通过大量的实验，我们证明 ShieldLM 在四个测试集上超越了强大的基线，展示了卓越的可定制性和可解释性。除了在标准检测数据集上表现良好之外，ShieldLM 还被证明在现实世界中作为高级法学硕士的安全评估器是有效的。我们在 \url{https://github.com/thu-coai/ShieldLM} 发布了 ShieldLM，以支持各种安全标准下准确且可解释的安全检测，为增强法学硕士安全性的持续努力做出贡献。

Title: RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering

Authors: Zihan Zhang, Meng Fang, Ling Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16457
Pdf URL: https://arxiv.org/pdf/2402.16457
Copy Paste: [[2402.16457]] RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering(https://arxiv.org/abs/2402.16457)
Keywords: llm, prompt, retrieval-augmented generation
Abstract: Adaptive retrieval-augmented generation (ARAG) aims to dynamically determine the necessity of retrieval for queries instead of retrieving indiscriminately to enhance the efficiency and relevance of the sourced information. However, previous works largely overlook the evaluation of ARAG approaches, leading to their effectiveness being understudied. This work presents a benchmark, RetrievalQA, comprising 1,271 short-form questions covering new world and long-tail knowledge. The knowledge necessary to answer the questions is absent from LLMs; therefore, external information must be retrieved to answer correctly. This makes RetrievalQA a suitable testbed to evaluate existing ARAG methods. We observe that calibration-based methods heavily rely on threshold tuning, while vanilla prompting is inadequate for guiding LLMs to make reliable retrieval decisions. Based on our findings, we propose Time-Aware Adaptive Retrieval (TA-ARE), a simple yet effective method that helps LLMs assess the necessity of retrieval without calibration or additional training. The dataset and code will be available at \url{https://github.com/hyintell/RetrievalQA}
摘要：自适应检索增强生成（ARAG）旨在动态确定查询检索的必要性，而不是不加区别地检索，以提高源信息的效率和相关性。然而，以前的工作很大程度上忽视了 ARAG 方法的评估，导致其有效性未被充分研究。这项工作提出了一个基准 RetrievalQA，其中包含 1,271 个涵盖新世界和长尾知识的简短问题。法学硕士缺乏回答问题所需的知识；因此，必须检索外部信息才能正确回答。这使得 RetrievalQA 成为评估现有 ARAG 方法的合适测试平台。我们观察到基于校准的方法严重依赖阈值调整，而普通提示不足以指导法学硕士做出可靠的检索决策。根据我们的研究结果，我们提出了时间感知自适应检索（TA-ARE），这是一种简单而有效的方法，可以帮助法学硕士评估检索的必要性，而无需校准或额外培训。数据集和代码可在 \url{https://github.com/hyintell/RetrievalQA} 获取

Title: Defending LLMs against Jailbreaking Attacks via Backtranslation

Authors: Yihan Wang, Zhouxing Shi, Andrew Bai, Cho-Jui Hsieh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16459
Pdf URL: https://arxiv.org/pdf/2402.16459
Copy Paste: [[2402.16459]] Defending LLMs against Jailbreaking Attacks via Backtranslation(https://arxiv.org/abs/2402.16459)
Keywords: language model, llm, prompt
Abstract: Although many large language models (LLMs) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks, which rewrite the original prompt to conceal its harmful intent. In this paper, we propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and is not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts.
摘要：尽管许多大型语言模型（LLM）经过训练可以拒绝有害请求，但它们仍然容易受到越狱攻击，越狱攻击会重写原始提示以隐藏其有害意图。在本文中，我们提出了一种通过“反向翻译”来保护法学硕士免受越狱攻击的新方法。具体来说，给定目标 LLM 根据输入提示生成的初始响应，我们的反向翻译会提示语言模型推断出可以导致响应的输入提示。推断出的提示称为反向翻译提示，它往往会揭示原始提示的实际意图，因为它是根据 LLM 的响应生成的，并且不是由攻击者直接操纵的。然后，我们在反向翻译的提示上再次运行目标 LLM，如果模型拒绝反向翻译的提示，我们将拒绝原始提示。我们解释说，拟议的防御措施在其有效性和效率方面提供了多种好处。我们凭经验证明，在基线困难的情况下，我们的防御明显优于基线，并且我们的防御对良性输入提示的生成质量也几乎没有影响。

Title: Unveiling Vulnerability of Self-Attention

Authors: Khai Jiet Liong, Hongqiu Wu, Hai Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16470
Pdf URL: https://arxiv.org/pdf/2402.16470
Copy Paste: [[2402.16470]] Unveiling Vulnerability of Self-Attention(https://arxiv.org/abs/2402.16470)
Keywords: language model
Abstract: Pre-trained language models (PLMs) are shown to be vulnerable to minor word changes, which poses a big threat to real-world systems. While previous studies directly focus on manipulating word inputs, they are limited by their means of generating adversarial samples, lacking generalization to versatile real-world attack. This paper studies the basic structure of transformer-based PLMs, the self-attention (SA) mechanism. (1) We propose a powerful perturbation technique \textit{HackAttend}, which perturbs the attention scores within the SA matrices via meticulously crafted attention masks. We show that state-of-the-art PLMs fall into heavy vulnerability that minor attention perturbations $(1\%)$ can produce a very high attack success rate $(98\%)$. Our paper expands the conventional text attack of word perturbations to more general structural perturbations. (2) We introduce \textit{S-Attend}, a novel smoothing technique that effectively makes SA robust via structural perturbations. We empirically demonstrate that this simple yet effective technique achieves robust performance on par with adversarial training when facing various text attackers. Code is publicly available at \url{github.com/liongkj/HackAttend}.
摘要：事实证明，预训练语言模型 (PLM) 很容易受到微小单词变化的影响，这对现实世界的系统构成了巨大威胁。虽然之前的研究直接关注于操纵单词输入，但它们受到生成对抗性样本的方式的限制，缺乏对多功能现实世界攻击的泛化。本文研究了基于 Transformer 的 PLM 的基本结构，即自注意力（SA）机制。 (1) 我们提出了一种强大的扰动技术 \textit{HackAttend}，它通过精心设计的注意力掩模来扰动 SA 矩阵内的注意力分数。我们表明，最先进的 PLM 存在严重漏洞，轻微的注意力扰动 $(1\%)$ 就可以产生非常高的攻击成功率 $(98\%)$。我们的论文将单词扰动的传统文本攻击扩展到更一般的结构扰动。 (2) 我们引入了 \textit{S-Attend}，一种新颖的平滑技术，可以通过结构扰动有效地使 SA 变得鲁棒。我们凭经验证明，这种简单而有效的技术在面对各种文本攻击者时可以实现与对抗训练相当的稳健性能。代码可在 \url{github.com/liongkj/HackAttend} 上公开获取。

Title: mEdIT: Multilingual Text Editing via Instruction Tuning

Authors: Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16472
Pdf URL: https://arxiv.org/pdf/2402.16472
Copy Paste: [[2402.16472]] mEdIT: Multilingual Text Editing via Instruction Tuning(https://arxiv.org/abs/2402.16472)
Keywords: language model, llm
Abstract: We introduce mEdIT, a multi-lingual extension to CoEdIT -- the recent state-of-the-art text editing models for writing assistance. mEdIT models are trained by fine-tuning multi-lingual large, pre-trained language models (LLMs) via instruction tuning. They are designed to take instructions from the user specifying the attributes of the desired text in the form of natural language instructions, such as Grammatik korrigieren (German) or Parafrasee la oraci\'on (Spanish). We build mEdIT by curating data from multiple publicly available human-annotated text editing datasets for three text editing tasks (Grammatical Error Correction (GEC), Text Simplification, and Paraphrasing) across diverse languages belonging to six different language families. We detail the design and training of mEdIT models and demonstrate their strong performance on many multi-lingual text editing benchmarks against other multilingual LLMs. We also find that mEdIT generalizes effectively to new languages over multilingual baselines. We publicly release our data, code, and trained models at https://github.com/vipulraheja/medit.
摘要：我们介绍 mEdit，它是 CoEdIT 的多语言扩展——CoEdIT 是最近最先进的用于写作辅助的文本编辑模型。 mEdit 模型是通过指令调整对多语言大型预训练语言模型 (LLM) 进行微调来训练的。它们旨在接受用户的指令，以自然语言指令的形式指定所需文本的属性，例如 Grammatik korrigieren（德语）或 Parafrasee la oraci\'on（西班牙语）。我们通过从多个公开的人工注释文本编辑数据集中收集数据来构建 mEditIT，这些数据用于跨属于六个不同语系的不同语言的三个文本编辑任务（语法错误纠正 (GEC)、文本简化和释义）。我们详细介绍了 mEdit 模型的设计和训练，并在许多多语言文本编辑基准测试中与其他多语言法学硕士相比展示了它们的强大性能。我们还发现 mEditIT 可以在多语言基线上有效地推广到新语言。我们在 https://github.com/vipulraheja/medit 公开发布我们的数据、代码和训练模型。

Title: On Languaging a Simulation Engine

Authors: Han Liu, Liantang Li
Subjects: cs.AI, cs.CE, cs.CL
Abstract URL: https://arxiv.org/abs/2402.16482
Pdf URL: https://arxiv.org/pdf/2402.16482
Copy Paste: [[2402.16482]] On Languaging a Simulation Engine(https://arxiv.org/abs/2402.16482)
Keywords: language model, chat
Abstract: Language model intelligence is revolutionizing the way we program materials simulations. However, the diversity of simulation scenarios renders it challenging to precisely transform human language into a tailored simulator. Here, using three functionalized types of language model, we propose a language-to-simulation (Lang2Sim) framework that enables interactive navigation on languaging a simulation engine, by taking a scenario instance of water sorption in porous matrices. Unlike line-by-line coding of a target simulator, the language models interpret each simulator as an assembly of invariant tool function and its variant input-output pair. Lang2Sim enables the precise transform of textual description by functionalizing and sequentializing the language models of, respectively, rationalizing the tool categorization, customizing its input-output combinations, and distilling the simulator input into executable format. Importantly, depending on its functionalized type, each language model features a distinct processing of chat history to best balance its memory limit and information completeness, thus leveraging the model intelligence to unstructured nature of human request. Overall, this work establishes language model as an intelligent platform to unlock the era of languaging a simulation engine.
摘要：语言模型智能正在彻底改变我们对材料模拟进行编程的方式。然而，模拟场景的多样性使得将人类语言精确地转换为定制的模拟器具有挑战性。在这里，我们使用三种功能化类型的语言模型，提出了一种语言到模拟（Lang2Sim）框架，通过多孔基质中水吸附的场景实例，该框架可以在模拟引擎语言上进行交互式导航。与目标模拟器的逐行编码不同，语言模型将每个模拟器解释为不变工具函数及其变体输入输出对的集合。 Lang2Sim 通过分别对语言模型进行功能化和序列化、合理化工具分类、定制其输入输出组合以及将模拟器输入提炼为可执行格式来实现文本描述的精确转换。重要的是，根据其功能化类型，每种语言模型都具有对聊天历史记录的独特处理，以最好地平衡其内存限制和信息完整性，从而利用模型智能来处理人类请求的非结构化性质。总的来说，这项工作建立了语言模型作为一个智能平台，以开启语言模拟引擎的时代。

Title: LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments

Authors: Junzhe Chen, Xuming Hu, Shuodi Liu, Shiyu Huang, Wei-Wei Tu, Zhaofeng He, Lijie Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16499
Pdf URL: https://arxiv.org/pdf/2402.16499
Copy Paste: [[2402.16499]] LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments(https://arxiv.org/abs/2402.16499)
Keywords: language model, llm, agent
Abstract: Recent advancements in large language models (LLMs) have revealed their potential for achieving autonomous agents possessing human-level intelligence. However, existing benchmarks for evaluating LLM Agents either use static datasets, potentially leading to data leakage or focus only on single-agent scenarios, overlooking the complexities of multi-agent interactions. There is a lack of a benchmark that evaluates the diverse capabilities of LLM agents in multi-agent, dynamic environments. To this end, we introduce LLMArena, a novel and easily extensible framework for evaluating the diverse capabilities of LLM in multi-agent dynamic environments. LLMArena encompasses seven distinct gaming environments, employing Trueskill scoring to assess crucial abilities in LLM agents, including spatial reasoning, strategic planning, numerical reasoning, risk assessment, communication, opponent modeling, and team collaboration. We conduct an extensive experiment and human evaluation among different sizes and types of LLMs, showing that LLMs still have a significant journey ahead in their development towards becoming fully autonomous agents, especially in opponent modeling and team collaboration. We hope LLMArena could guide future research towards enhancing these capabilities in LLMs, ultimately leading to more sophisticated and practical applications in dynamic, multi-agent settings. The code and data will be available.
摘要：大语言模型（LLM）的最新进展揭示了它们实现具有人类水平智能的自主代理的潜力。然而，评估LLM代理的现有基准要么使用静态数据集，可能导致数据泄漏，要么只关注单代理场景，忽视多代理交互的复杂性。缺乏评估 LLM 代理在多代理、动态环境中的多样化能力的基准。为此，我们引入了 LLMArena，这是一种新颖且易于扩展的框架，用于评估多智能体动态环境中 LLM 的各种功能。 LLMArena 包含七个不同的游戏环境，采用 Trueskill 评分来评估 LLM 代理的关键能力，包括空间推理、战略规划、数字推理、风险评估、沟通、对手建模和团队协作。我们对不同规模和类型的法学硕士进行了广泛的实验和人工评估，结果表明法学硕士在成为完全自主的代理方面仍有很长的路要走，特别是在对手建模和团队协作方面。我们希望 LLMArena 能够指导未来的研究，以增强法学硕士的这些能力，最终在动态、多智能体环境中实现更复杂和更实际的应用。代码和数据将可用。

Title: Memory GAPS: Would LLM pass the Tulving Test?

Authors: Jean-Marie Chauvet
Subjects: cs.AI
Abstract URL: https://arxiv.org/abs/2402.16505
Pdf URL: https://arxiv.org/pdf/2402.16505
Copy Paste: [[2402.16505]] Memory GAPS: Would LLM pass the Tulving Test?(https://arxiv.org/abs/2402.16505)
Keywords: llm
Abstract: The Tulving Test was designed to investigate memory performance in recognition and recall tasks. Its results help assess the relevance of the "Synergistic Ecphory Model" of memory and similar RK paradigms in human performance. This paper starts investigating whether the more than forty-year-old framework sheds some light on LLMs' acts of remembering.
摘要：图尔文测试旨在调查识别和回忆任务中的记忆表现。其结果有助于评估记忆的“协同回声模型”和类似的 RK 范式在人类表现中的相关性。本文开始调查这个已有四十多年历史的框架是否对法学硕士的记忆行为有所启发。

Title: LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification

Authors: Yiping Song, Juhua Zhang, Zhiliang Tian, Yuxin Yang, Minlie Huang, Dongsheng Li
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2402.16515
Pdf URL: https://arxiv.org/pdf/2402.16515
Copy Paste: [[2402.16515]] LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification(https://arxiv.org/abs/2402.16515)
Keywords: llm
Abstract: As sufficient data are not always publically accessible for model training, researchers exploit limited data with advanced learning algorithms or expand the dataset via data augmentation (DA). Conducting DA in private domain requires private protection approaches (i.e. anonymization and perturbation), but those methods cannot provide protection guarantees. Differential privacy (DP) learning methods theoretically bound the protection but are not skilled at generating pseudo text samples with large models. In this paper, we transfer DP-based pseudo sample generation task to DP-based generated samples discrimination task, where we propose a DP-based DA method with a LLM and a DP-based discriminator for text classification on private domains. We construct a knowledge distillation model as the DP-based discriminator: teacher models, accessing private data, teaches students how to select private samples with calibrated noise to achieve DP. To constrain the distribution of DA's generation, we propose a DP-based tutor that models the noised private distribution and controls samples' generation with a low privacy cost. We theoretically analyze our model's privacy protection and empirically verify our model.
摘要：由于模型训练并不总是能够公开获取足够的数据，研究人员利用先进的学习算法来利用有限的数据，或者通过数据增强 (DA) 来扩展数据集。在私有领域进行DA需要私有保护方法（即匿名和扰动），但这些方法无法提供保护保证。差分隐私（DP）学习方法理论上限制了保护，但不擅长用大型模型生成伪文本样本。在本文中，我们将基于 DP 的伪样本生成任务转移到基于 DP 的生成样本判别任务，其中我们提出了一种基于 DP 的 DA 方法，具有 LLM 和基于 DP 的判别器，用于私有域上的文本分类。我们构建了一个知识蒸馏模型作为基于 DP 的判别器：教师模型，访问私有数据，教学生如何选择具有校准噪声的私有样本来实现 DP。为了限制 DA 生成的分布，我们提出了一种基于 DP 的导师，它对噪声私人分布进行建模，并以较低的隐私成本控制样本的生成。我们从理论上分析了我们模型的隐私保护，并实证验证了我们的模型。

Title: Label Learning Method Based on Tensor Projection

Authors: Jing Li, Quanxue Gao, Qianqian Wang, Cheng Deng, Deyan Xie
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.16544
Pdf URL: https://arxiv.org/pdf/2402.16544
Copy Paste: [[2402.16544]] Label Learning Method Based on Tensor Projection(https://arxiv.org/abs/2402.16544)
Keywords: llm, chat
Abstract: Multi-view clustering method based on anchor graph has been widely concerned due to its high efficiency and effectiveness. In order to avoid post-processing, most of the existing anchor graph-based methods learn bipartite graphs with connected components. However, such methods have high requirements on parameters, and in some cases it may not be possible to obtain bipartite graphs with clear connected components. To end this, we propose a label learning method based on tensor projection (LLMTP). Specifically, we project anchor graph into the label space through an orthogonal projection matrix to obtain cluster labels directly. Considering that the spatial structure information of multi-view data may be ignored to a certain extent when projected in different views separately, we extend the matrix projection transformation to tensor projection, so that the spatial structure information between views can be fully utilized. In addition, we introduce the tensor Schatten $p$-norm regularization to make the clustering label matrices of different views as consistent as possible. Extensive experiments have proved the effectiveness of the proposed method.
摘要：基于锚图的多视图聚类方法因其高效、有效而受到广泛关注。为了避免后处理，大多数现有的基于锚图的方法学习具有连接组件的二部图。然而，此类方法对参数要求较高，在某些情况下可能无法获得连通分量清晰的二分图。为此，我们提出了一种基于张量投影（LLMTP）的标签学习方法。具体来说，我们通过正交投影矩阵将锚图投影到标签空间中，以直接获得聚类标签。考虑到多视图数据分别投影到不同视图时可能会在一定程度上忽略其空间结构信息，将矩阵投影变换扩展到张量投影，从而可以充分利用视图之间的空间结构信息。此外，我们引入张量 Schatten $p$-norm 正则化，使不同视图的聚类标签矩阵尽可能一致。大量的实验证明了该方法的有效性。

Title: Q-FOX Learning: Breaking Tradition in Reinforcement Learning

Authors: Mahmood Alqaseer, Yossra H. Ali, Tarik A. Rashid
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2402.16562
Pdf URL: https://arxiv.org/pdf/2402.16562
Copy Paste: [[2402.16562]] Q-FOX Learning: Breaking Tradition in Reinforcement Learning(https://arxiv.org/abs/2402.16562)
Keywords: agent
Abstract: Reinforcement learning (RL) is a subset of artificial intelligence (AI) where agents learn the best action by interacting with the environment, making it suitable for tasks that do not require labeled data or direct supervision. Hyperparameters (HP) tuning refers to choosing the best parameter that leads to optimal solutions in RL algorithms. Manual or random tuning of the HP may be a crucial process because variations in this parameter lead to changes in the overall learning aspects and different rewards. In this paper, a novel and automatic HP-tuning method called Q-FOX is proposed. This uses both the FOX optimizer, a new optimization method inspired by nature that mimics red foxes' hunting behavior, and the commonly used, easy-to-implement RL Q-learning algorithm to solve the problem of HP tuning. Moreover, a new objective function is proposed which prioritizes the reward over the mean squared error (MSE) and learning time (steps). Q-FOX has been evaluated on two OpenAI Gym environment control tasks: Cart Pole and Frozen Lake. It exposed greater cumulative rewards than HP tuning with other optimizers, such as PSO, GA, Bee, or randomly selected HP. The cumulative reward for the Cart Pole task was 32.08, and for the Frozen Lake task was 0.95. Despite the robustness of Q-FOX, it has limitations. It cannot be used directly in real-word problems before choosing the HP in a simulation environment because its processes work iteratively, making it time-consuming. The results indicate that Q-FOX has played an essential role in HP tuning for RL algorithms to effectively solve different control tasks.
摘要：强化学习 (RL) 是人工智能 (AI) 的一个子集，智能体通过与环境交互来学习最佳动作，使其适合不需要标记数据或直接监督的任务。超参数 (HP) 调整是指在 RL 算法中选择能够产生最佳解决方案的最佳参数。手动或随机调整 HP 可能是一个关键过程，因为该参数的变化会导致整体学习方面的变化和不同的奖励。在本文中，提出了一种称为 Q-FOX 的新颖的自动 HP 调整方法。这既使用了 FOX 优化器（一种受大自然启发、模仿红狐狩猎行为的新型优化方法），又使用了常用的、易于实现的 RL Q 学习算法来解决 HP 调优问题。此外，还提出了一种新的目标函数，它将奖励优先于均方误差（MSE）和学习时间（步骤）。 Q-FOX 已在两项 OpenAI Gym 环境控制任务上进行了评估：Cart Pole 和 Frozen Lake。与使用其他优化器（例如 PSO、GA、Bee 或随机选择的 HP）进行 HP 调整相比，它提供了更大的累积奖励。车杆任务的累计奖励为 32.08，冰湖任务的累计奖励为 0.95。尽管 Q-FOX 很强大，但它也有局限性。在模拟环境中选择 HP 之前，它不能直接用于实际问题，因为它的过程是迭代工作的，非常耗时。结果表明，Q-FOX 在 RL 算法的 HP 调整中发挥了重要作用，以有效解决不同的控制任务。

Title: Aligning Large Language Models to a Domain-specific Graph Database

Authors: Yuanyuan Liang, Keren Tan, Tingyu Xie, Wenbiao Tao, Siyuan Wang, Yunshi Lan, Weining Qian
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2402.16567
Pdf URL: https://arxiv.org/pdf/2402.16567
Copy Paste: [[2402.16567]] Aligning Large Language Models to a Domain-specific Graph Database(https://arxiv.org/abs/2402.16567)
Keywords: language model, gpt, llm, chat
Abstract: Graph Databases (Graph DB) are widely applied in various fields, including finance, social networks, and medicine. However, translating Natural Language (NL) into the Graph Query Language (GQL), commonly known as NL2GQL, proves to be challenging due to its inherent complexity and specialized nature. Some approaches have sought to utilize Large Language Models (LLMs) to address analogous tasks like text2SQL. Nevertheless, when it comes to NL2GQL taskson a particular domain, the absence of domain-specific NL-GQL data pairs makes it difficult to establish alignment between LLMs and the graph DB. To address this challenge, we propose a well-defined pipeline. Specifically, we utilize ChatGPT to create NL-GQL data pairs based on the given graph DB with self-instruct. Then, we use the created data to fine-tune LLMs, thereby achieving alignment between LLMs and the graph DB. Additionally, during inference, we propose a method that extracts relevant schema to the queried NL as the input context to guide LLMs for generating accurate GQLs.We evaluate our method on two constructed datasets deriving from graph DBs in finance domain and medicine domain, namely FinGQL and MediGQL. Experimental results demonstrate that our method significantly outperforms a set of baseline methods, with improvements of 5.90 and 6.36 absolute points on EM, and 6.00 and 7.09 absolute points on EX, respectively.
摘要：图数据库（Graph DB）广泛应用于各个领域，包括金融、社交网络和医学。然而，由于其固有的复杂性和专业性，将自然语言（NL）转换为图形查询语言（GQL）（通常称为NL2GQL）被证明是具有挑战性的。一些方法试图利用大型语言模型 (LLM) 来解决类似的任务，例如 text2SQL。然而，当涉及特定领域的 NL2GQL 任务时，缺乏特定领域的 NL-GQL 数据对使得很难在 LLM 和图 DB 之间建立对齐。为了应对这一挑战，我们提出了一个明确定义的管道。具体来说，我们利用 ChatGPT 根据给定的图数据库通过自指令创建 NL-GQL 数据对。然后，我们使用创建的数据来微调LLM，从而实现LLM和图DB之间的对齐。此外，在推理过程中，我们提出了一种方法，提取与查询的 NL 相关的模式作为输入上下文，以指导 LLM 生成准确的 GQL。我们在源自金融领域和医学领域的图 DB 的两个构建数据集（即 FinGQL）上评估我们的方法。和 MediGQL。实验结果表明，我们的方法明显优于一组基线方法，在 EM 上分别提高了 5.90 和 6.36 绝对点，在 EX 上分别提高了 6.00 和 7.09 绝对点。

Title: Two-stage Generative Question Answering on Temporal Knowledge Graph Using Large Language Models

Authors: Yifu Gao, Linbo Qiao, Zhigang Kan, Zhihua Wen, Yongquan He, Dongsheng Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16568
Pdf URL: https://arxiv.org/pdf/2402.16568
Copy Paste: [[2402.16568]] Two-stage Generative Question Answering on Temporal Knowledge Graph Using Large Language Models(https://arxiv.org/abs/2402.16568)
Keywords: language model, llm
Abstract: Temporal knowledge graph question answering (TKGQA) poses a significant challenge task, due to the temporal constraints hidden in questions and the answers sought from dynamic structured knowledge. Although large language models (LLMs) have made considerable progress in their reasoning ability over structured data, their application to the TKGQA task is a relatively unexplored area. This paper first proposes a novel generative temporal knowledge graph question answering framework, GenTKGQA, which guides LLMs to answer temporal questions through two phases: Subgraph Retrieval and Answer Generation. First, we exploit LLM's intrinsic knowledge to mine temporal constraints and structural links in the questions without extra training, thus narrowing down the subgraph search space in both temporal and structural dimensions. Next, we design virtual knowledge indicators to fuse the graph neural network signals of the subgraph and the text representations of the LLM in a non-shallow way, which helps the open-source LLM deeply understand the temporal order and structural dependencies among the retrieved facts through instruction tuning. Experimental results demonstrate that our model outperforms state-of-the-art baselines, even achieving 100\% on the metrics for the simple question type.
摘要：由于隐藏在问题中的时间约束以及从动态结构化知识中寻求的答案，时间知识图问答（TKGQA）提出了一项重大的挑战任务。尽管大型语言模型 (LLM) 在结构化数据的推理能力方面取得了相当大的进步，但它们在 TKGQA 任务中的应用仍然是一个相对未经探索的领域。本文首先提出了一种新颖的生成时间知识图问答框架GenTKGQA，它指导法学硕士通过两个阶段回答时间问题：子图检索和答案生成。首先，我们利用LLM的内在知识来挖掘问题中的时间约束和结构链接，而无需额外的训练，从而在时间和结构维度上缩小子图搜索空间。接下来，我们设计虚拟知识指标，以非浅层的方式融合子图的图神经网络信号和LLM的文本表示，这有助于开源LLM深入理解检索到的事实之间的时间顺序和结构依赖关系通过指令调整。实验结果表明，我们的模型优于最先进的基线，甚至在简单问题类型的指标上达到了 100%。

Title: Multi-Bit Distortion-Free Watermarking for Large Language Models

Authors: Massieh Kordi Boroujeny, Ya Jiang, Kai Zeng, Brian Mark
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.16578
Pdf URL: https://arxiv.org/pdf/2402.16578
Copy Paste: [[2402.16578]] Multi-Bit Distortion-Free Watermarking for Large Language Models(https://arxiv.org/abs/2402.16578)
Keywords: language model
Abstract: Methods for watermarking large language models have been proposed that distinguish AI-generated text from human-generated text by slightly altering the model output distribution, but they also distort the quality of the text, exposing the watermark to adversarial detection. More recently, distortion-free watermarking methods were proposed that require a secret key to detect the watermark. The prior methods generally embed zero-bit watermarks that do not provide additional information beyond tagging a text as being AI-generated. We extend an existing zero-bit distortion-free watermarking method by embedding multiple bits of meta-information as part of the watermark. We also develop a computationally efficient decoder that extracts the embedded information from the watermark with low bit error rate.
摘要：人们已经提出了对大型语言模型加水印的方法，通过稍微改变模型输出分布来区分人工智能生成的文本和人类生成的文本，但它们也会扭曲文本的质量，使水印暴露于对抗性检测。最近，提出了需要密钥来检测水印的无失真水印方法。现有方法通常嵌入零位水印，除了将文本标记为人工智能生成之外，不提供其他信息。我们通过嵌入多位元信息作为水印的一部分来扩展现有的零位无失真水印方法。我们还开发了一种计算效率高的解码器，可以以低误码率从水印中提取嵌入信息。

Title: Rethinking Negative Instances for Generative Named Entity Recognition

Authors: Yuyang Ding, Juntao Li, Pinzheng Wang, Zecheng Tang, Bowen Yan, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16602
Pdf URL: https://arxiv.org/pdf/2402.16602
Copy Paste: [[2402.16602]] Rethinking Negative Instances for Generative Named Entity Recognition(https://arxiv.org/abs/2402.16602)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities for generalizing in unseen tasks. In the Named Entity Recognition (NER) task, recent advancements have seen the remarkable improvement of LLMs in a broad range of entity domains via instruction tuning, by adopting entity-centric schema. In this work, we explore the potential enhancement of the existing methods by incorporating negative instances into training. Our experiments reveal that negative instances contribute to remarkable improvements by (1) introducing contextual information, and (2) clearly delineating label boundaries. Furthermore, we introduce a novel and efficient algorithm named Hierarchical Matching, which is tailored to transform unstructured predictions into structured entities. By integrating these components, we present GNER, a Generative NER system that shows improved zero-shot performance across unseen entity domains. Our comprehensive evaluation illustrates our system's superiority, surpassing state-of-the-art (SoTA) methods by 11 $F_1$ score in zero-shot evaluation.
摘要：大型语言模型 (LLM) 已展现出令人印象深刻的泛化未知任务的能力。在命名实体识别（NER）任务中，最近的进展表明，通过采用以实体为中心的模式，通过指令调整，法学硕士在广泛的实体领域中取得了显着的进步。在这项工作中，我们通过将负面实例纳入训练来探索现有方法的潜在增强。我们的实验表明，负面实例通过（1）引入上下文信息和（2）清晰地描绘标签边界来促成显着的改进。此外，我们引入了一种名为“分层匹配”的新颖且高效的算法，该算法专为将非结构化预测转换为结构化实体而定制。通过集成这些组件，我们提出了 GNER，这是一种生成 NER 系统，它显示了跨未知实体域的零样本性能的改进。我们的综合评估证明了我们系统的优越性，在零样本评估中以 11 $F_1$ 分数超越了最先进的 (SoTA) 方法。

Title: Understanding the Dataset Practitioners Behind Large Language Model Development

Authors: Crystal Qian, Emily Reif, Minsuk Kahng
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2402.16611
Pdf URL: https://arxiv.org/pdf/2402.16611
Copy Paste: [[2402.16611]] Understanding the Dataset Practitioners Behind Large Language Model Development(https://arxiv.org/abs/2402.16611)
Keywords: language model, llm
Abstract: As large language models (LLMs) become more advanced and impactful, it is increasingly important to scrutinize the data that they rely upon and produce. What is it to be a dataset practitioner doing this work? We approach this in two parts: first, we define the role of "dataset practitioner" by performing a retrospective analysis on the responsibilities of teams contributing to LLM development at Google. Then, we conduct semi-structured interviews with a cross-section of these practitioners (N=10). We find that data quality is the top priority. To evaluate data quality, practitioners either rely on their own intuition or write custom evaluation logic. There is a lack of consensus across practitioners on what quality is and how to evaluate it. We discuss potential reasons for this phenomenon and opportunities for alignment.
摘要：随着大型语言模型 (LLM) 变得更加先进和有影响力，仔细检查它们所依赖和产生的数据变得越来越重要。作为一名从事这项工作的数据集从业者意味着什么？我们分两部分来解决这个问题：首先，我们通过对 Google 的 LLM 开发团队的职责进行回顾性分析来定义“数据集从业者”的角色。然后，我们对这些从业者进行了半结构化访谈（N=10）。我们发现数据质量是重中之重。为了评估数据质量，从业者要么依靠自己的直觉，要么编写自定义评估逻辑。从业者对于什么是质量以及如何评估质量缺乏共识。我们讨论这种现象的潜在原因以及调整的机会。

Title: Long-Context Language Modeling with Parallel Context Encoding

Authors: Howard Yen, Tianyu Gao, Danqi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16617
Pdf URL: https://arxiv.org/pdf/2402.16617
Copy Paste: [[2402.16617]] Long-Context Language Modeling with Parallel Context Encoding(https://arxiv.org/abs/2402.16617)
Keywords: language model, llm, long context, chat
Abstract: Extending large language models (LLMs) to process longer inputs is crucial for numerous applications. However, the considerable computational cost of transformers, coupled with limited generalization of positional encoding, restricts the size of their context window. We introduce Context Expansion with Parallel Encoding (CEPE), a framework that can be applied to any existing decoder-only LLMs to extend their context window. CEPE adopts a small encoder to process long inputs chunk by chunk and enables the frozen decoder to leverage additional contexts via cross-attention. CEPE is efficient, generalizable, and versatile: trained with 8K-token documents, CEPE extends the context window of LLAMA-2 to 128K tokens, offering 10x the throughput with only 1/6 of the memory. CEPE yields strong performance on language modeling and in-context learning. CEPE also excels in retrieval-augmented applications, while existing long-context models degenerate with retrieved contexts. We further introduce a CEPE variant that can extend the context window of instruction-tuned models with only unlabeled data, and showcase its effectiveness on LLAMA-2-CHAT, leading to a strong instruction-following model that can leverage very long context on downstream tasks.
摘要：扩展大型语言模型 (LLM) 以处理更长的输入对于许多应用程序至关重要。然而，变压器的巨大计算成本，加上位置编码的有限泛化，限制了其上下文窗口的大小。我们引入了并行编码上下文扩展 (CEPE)，这是一个可以应用于任何现有仅解码器的 LLM 来扩展其上下文窗口的框架。 CEPE 采用小型编码器来逐块处理长输入，并使冻结的解码器能够通过交叉注意力来利用额外的上下文。 CEPE 高效、可泛化且用途广泛：使用 8K 令牌文档进行训练，CEPE 将 LLAMA-2 的上下文窗口扩展到 128K 令牌，仅用 1/6 的内存即可提供 10 倍的吞吐量。 CEPE 在语言建模和上下文学习方面表现出色。 CEPE 在检索增强应用程序中也表现出色，而现有的长上下文模型会随着检索上下文而退化。我们进一步引入了一种 CEPE 变体，它可以仅使用未标记的数据扩展指令调整模型的上下文窗口，并展示其在 LLAMA-2-CHAT 上的有效性，从而形成一个强大的指令跟踪模型，可以在下游任务中利用很长的上下文。

Title: GenAINet: Enabling Wireless Collective Intelligence via Knowledge Transfer and Reasoning

Authors: Hang Zou, Qiyang Zhao, Lina Bariah, Yu Tian, Mehdi Bennis, Samson Lasaulce, Merouane Debbah, Faouzi Bader
Subjects: cs.AI, cs.NI, eess.SP
Abstract URL: https://arxiv.org/abs/2402.16631
Pdf URL: https://arxiv.org/pdf/2402.16631
Copy Paste: [[2402.16631]] GenAINet: Enabling Wireless Collective Intelligence via Knowledge Transfer and Reasoning(https://arxiv.org/abs/2402.16631)
Keywords: agent
Abstract: Generative artificial intelligence (GenAI) and communication networks are expected to have groundbreaking synergies in 6G. Connecting GenAI agents over a wireless network can potentially unleash the power of collective intelligence and pave the way for artificial general intelligence (AGI). However, current wireless networks are designed as a "data pipe" and are not suited to accommodate and leverage the power of GenAI. In this paper, we propose the GenAINet framework in which distributed GenAI agents communicate knowledge (high-level concepts or abstracts) to accomplish arbitrary tasks. We first provide a network architecture integrating GenAI capabilities to manage both network protocols and applications. Building on this, we investigate effective communication and reasoning problems by proposing a semantic-native GenAINet. Specifically, GenAI agents extract semantic concepts from multi-modal raw data, build a knowledgebase representing their semantic relations, which is retrieved by GenAI models for planning and reasoning. Under this paradigm, an agent can learn fast from other agents' experience for making better decisions with efficient communications. Furthermore, we conduct two case studies where in wireless device query, we show that extracting and transferring knowledge can improve query accuracy with reduced communication; and in wireless power control, we show that distributed agents can improve decisions via collaborative reasoning. Finally, we address that developing a hierarchical semantic level Telecom world model is a key path towards network of collective intelligence.
摘要：生成式人工智能（GenAI）和通信网络预计将在 6G 中产生突破性的协同效应。通过无线网络连接 GenAI 代理可以释放集体智慧的力量，并为通用人工智能 (AGI) 铺平道路。然而，当前的无线网络被设计为“数据管道”，不适合容纳和利用 GenAI 的强大功能。在本文中，我们提出了 GenAINet 框架，其中分布式 GenAI 代理交流知识（高级概念或摘要）以完成任意任务。我们首先提供一个集成 GenAI 功能的网络架构来管理网络协议和应用程序。在此基础上，我们通过提出语义原生 GenAINet 来研究有效的通信和推理问题。具体来说，GenAI 代理从多模态原始数据中提取语义概念，构建表示其语义关系的知识库，由 GenAI 模型检索以进行规划和推理。在这种范式下，代理可以快速学习其他代理的经验，通过有效的通信做出更好的决策。此外，我们进行了两个案例研究，在无线设备查询中，我们表明提取和传输知识可以通过减少通信来提高查询准确性；在无线电源控制中，我们证明分布式代理可以通过协作推理来改进决策。最后，我们指出开发分层语义级电信世界模型是通向集体智能网络的关键路径。

Title: ESG Sentiment Analysis: comparing human and language model performance including GPT

Authors: Karim Derrick
Subjects: cs.CL, cs.CE, cs.CY
Abstract URL: https://arxiv.org/abs/2402.16650
Pdf URL: https://arxiv.org/pdf/2402.16650
Copy Paste: [[2402.16650]] ESG Sentiment Analysis: comparing human and language model performance including GPT(https://arxiv.org/abs/2402.16650)
Keywords: language model, gpt
Abstract: In this paper we explore the challenges of measuring sentiment in relation to Environmental, Social and Governance (ESG) social media. ESG has grown in importance in recent years with a surge in interest from the financial sector and the performance of many businesses has become based in part on their ESG related reputations. The use of sentiment analysis to measure ESG related reputation has developed and with it interest in the use of machines to do so. The era of digital media has created an explosion of new media sources, driven by the growth of social media platforms. This growing data environment has become an excellent source for behavioural insight studies across many disciplines that includes politics, healthcare and market research. Our study seeks to compare human performance with the cutting edge in machine performance in the measurement of ESG related sentiment. To this end researchers classify the sentiment of 150 tweets and a reliability measure is made. A gold standard data set is then established based on the consensus of 3 researchers and this data set is then used to measure the performance of different machine approaches: one based on the VADER dictionary approach to sentiment classification and then multiple language model approaches, including Llama2, T5, Mistral, Mixtral, FINBERT, GPT3.5 and GPT4.
摘要：在本文中，我们探讨了衡量与环境、社会和治理 (ESG) 社交媒体相关的情绪的挑战。近年来，随着金融部门兴趣的增加，ESG 的重要性日益增强，许多企业的业绩已部分取决于其 ESG 相关声誉。使用情感分析来衡量 ESG 相关声誉已经发展起来，人们对使用机器来衡量相关声誉也产生了兴趣。在社交媒体平台的增长推动下，数字媒体时代创造了新媒体资源的爆炸式增长。这种不断增长的数据环境已成为政治、医疗保健和市场研究等多个学科的行为洞察研究的绝佳来源。我们的研究旨在将 ESG 相关情绪测量中的人类表现与机器性能的前沿进行比较。为此，研究人员对 150 条推文的情绪进行了分类，并制定了可靠性衡量标准。然后根据 3 位研究人员的共识建立了一个黄金标准数据集，然后使用该数据集来衡量不同机器方法的性能：一种基于 VADER 词典的情感分类方法，然后是多种语言模型方法，包括 Llama2 、T5、Mistral、Mixtral、FINBERT、GPT3.5 和 GPT4。

Title: GigaPevt: Multimodal Medical Assistant

Authors: Pavel Blinov, Konstantin Egorov, Ivan Sviridov, Nikolay Ivanov, Stepan Botman, Evgeniy Tagin, Stepan Kudin, Galina Zubkova, Andrey Savchenko
Subjects: cs.AI, cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2402.16654
Pdf URL: https://arxiv.org/pdf/2402.16654
Copy Paste: [[2402.16654]] GigaPevt: Multimodal Medical Assistant(https://arxiv.org/abs/2402.16654)
Keywords: language model
Abstract: Building an intelligent and efficient medical assistant is still a challenging AI problem. The major limitation comes from the data modality scarceness, which reduces comprehensive patient perception. This demo paper presents the GigaPevt, the first multimodal medical assistant that combines the dialog capabilities of large language models with specialized medical models. Such an approach shows immediate advantages in dialog quality and metric performance, with a 1.18\% accuracy improvement in the question-answering task.
摘要：构建智能高效的医疗助手仍然是一个具有挑战性的人工智能问题。主要限制来自数据模式的稀缺，这降低了患者的综合感知。这篇演示论文介绍了 GigaPevt，这是第一个将大型语言模型的对话功能与专业医疗模型相结合的多模式医疗助手。这种方法在对话质量和指标性能方面显示出直接的优势，问答任务的准确率提高了 1.18%。

Title: RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation

Authors: Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16667
Pdf URL: https://arxiv.org/pdf/2402.16667
Copy Paste: [[2402.16667]] RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation(https://arxiv.org/abs/2402.16667)
Keywords: language model, llm, agent
Abstract: Generative models have demonstrated considerable potential in software engineering, particularly in tasks such as code generation and debugging. However, their utilization in the domain of code documentation generation remains underexplored. To this end, we introduce RepoAgent, a large language model powered open-source framework aimed at proactively generating, maintaining, and updating code documentation. Through both qualitative and quantitative evaluations, we have validated the effectiveness of our approach, showing that RepoAgent excels in generating high-quality repository-level documentation. The code and results are publicly accessible at https://github.com/OpenBMB/RepoAgent.
摘要：生成模型在软件工程中展示了巨大的潜力，特别是在代码生成和调试等任务中。然而，它们在代码文档生成领域的利用仍未得到充分探索。为此，我们引入了 RepoAgent，这是一个大型语言模型驱动的开源框架，旨在主动生成、维护和更新代码文档。通过定性和定量评估，我们验证了我们方法的有效性，表明 RepoAgent 在生成高质量存储库级文档方面表现出色。代码和结果可在 https://github.com/OpenBMB/RepoAgent 上公开访问。

Title: StructLM: Towards Building Generalist Models for Structured Knowledge Grounding

Authors: Alex Zhuang, Ge Zhang, Tianyu Zheng, Xinrun Du, Junjie Wang, Weiming Ren, Stephen W. Huang, Jie Fu, Xiang Yue, Wenhu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16671
Pdf URL: https://arxiv.org/pdf/2402.16671
Copy Paste: [[2402.16671]] StructLM: Towards Building Generalist Models for Structured Knowledge Grounding(https://arxiv.org/abs/2402.16671)
Keywords: language model, gpt, llm, chat
Abstract: Structured data sources, such as tables, graphs, and databases, are ubiquitous knowledge sources. Despite the demonstrated capabilities of large language models (LLMs) on plain text, their proficiency in interpreting and utilizing structured data remains limited. Our investigation reveals a notable deficiency in LLMs' ability to process structured data, e.g., ChatGPT lags behind state-of-the-art (SoTA) model by an average of 35%. To augment the Structured Knowledge Grounding (SKG) capabilities in LLMs, we have developed a comprehensive instruction tuning dataset comprising 1.1 million examples. Utilizing this dataset, we train a series of models, referred to as StructLM, based on the Code-LLaMA architecture, ranging from 7B to 34B parameters. Our StructLM series surpasses task-specific models on 14 out of 18 evaluated datasets and establishes new SoTA achievements on 7 SKG tasks. Furthermore, StructLM demonstrates exceptional generalization across 6 novel SKG tasks. Contrary to expectations, we observe that scaling model size offers marginal benefits, with StructLM-34B showing only slight improvements over StructLM-7B. This suggests that structured knowledge grounding is still a challenging task and requires more innovative design to push to a new level.
摘要：表格、图表和数据库等结构化数据源是无处不在的知识源。尽管大型语言模型（LLM）在纯文本上展示了能力，但它们解释和利用结构化数据的能力仍然有限。我们的调查揭示了法学硕士处理结构化数据的能力存在显着缺陷，例如，ChatGPT 平均落后于最先进的 (SoTA) 模型 35%。为了增强法学硕士的结构化知识基础 (SKG) 能力，我们开发了一个包含 110 万个示例的综合指令调整数据集。利用该数据集，我们训练了一系列模型，称为 StructLM，基于 Code-LLaMA 架构，参数范围从 7B 到 34B。我们的 StructLM 系列在 18 个评估数据集中的 14 个上超越了特定任务模型，并在 7 个 SKG 任务上建立了新的 SoTA 成就。此外，StructLM 在 6 个新颖的 SKG 任务中展示了卓越的泛化能力。与预期相反，我们观察到缩放模型大小带来了边际效益，StructLM-34B 仅比 StructLM-7B 略有改进。这表明结构化知识基础仍然是一项具有挑战性的任务，需要更多的创新设计才能提升到新的水平。

Title: Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study

Authors: Adrien Bazoge, Emmanuel Morin, Beatrice Daille, Pierre-Antoine Gourraud
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16689
Pdf URL: https://arxiv.org/pdf/2402.16689
Copy Paste: [[2402.16689]] Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study(https://arxiv.org/abs/2402.16689)
Keywords: language model
Abstract: Recently, pretrained language models based on BERT have been introduced for the French biomedical domain. Although these models have achieved state-of-the-art results on biomedical and clinical NLP tasks, they are constrained by a limited input sequence length of 512 tokens, which poses challenges when applied to clinical notes. In this paper, we present a comparative study of three adaptation strategies for long-sequence models, leveraging the Longformer architecture. We conducted evaluations of these models on 16 downstream tasks spanning both biomedical and clinical domains. Our findings reveal that further pre-training an English clinical model with French biomedical texts can outperform both converting a French biomedical BERT to the Longformer architecture and pre-training a French biomedical Longformer from scratch. The results underscore that long-sequence French biomedical models improve performance across most downstream tasks regardless of sequence length, but BERT based models remain the most efficient for named entity recognition tasks.
摘要：最近，法国生物医学领域引入了基于 BERT 的预训练语言模型。尽管这些模型在生物医学和临床 NLP 任务上取得了最先进的结果，但它们受到 512 个标记的有限输入序列长度的限制，这在应用于临床笔记时带来了挑战。在本文中，我们利用 Longformer 架构对长序列模型的三种适应策略进行了比较研究。我们对这些模型在生物医学和临床领域的 16 个下游任务上进行了评估。我们的研究结果表明，使用法语生物医学文本进一步预训练英语临床模型的效果优于将法国生物医学 BERT 转换为 Longformer 架构以及从头开始预训练法国生物医学 Longformer。结果强调，无论序列长度如何，长序列法国生物医学模型都可以提高大多数下游任务的性能，但基于 BERT 的模型对于命名实体识别任务仍然是最有效的。

Title: HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

Authors: Qiwei Peng, Yekun Chai, Xuhong Li
Subjects: cs.CL, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2402.16694
Pdf URL: https://arxiv.org/pdf/2402.16694
Copy Paste: [[2402.16694]] HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization(https://arxiv.org/abs/2402.16694)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at \url{https://github.com/FloatAI/HumanEval-XL}.
摘要：大型语言模型 (LLM) 在根据文本提示生成代码方面取得了重大进展。然而，现有的基准主要集中在将英语提示翻译为多语言代码，或者仅限于非常有限的自然语言（NL）。这些基准忽略了大规模多语言 NL 到多语言代码的广阔前景，在多语言法学硕士的评估中留下了关键差距。作为回应，我们推出了 HumanEval-XL，这是一个专门为解决这一缺陷而设计的大规模多语言代码生成基准。 HumanEval-XL 在 23 种 NL 和 12 种编程语言 (PL) 之间建立连接，并包含 22,080 个提示的集合，平均有 8.33 个测试用例。通过确保多个 NL 和 PL 之间的并行数据，HumanEval-XL 为多语言法学硕士提供了一个全面的评估平台，允许评估对不同 NL 的理解。我们的工作是填补多语言代码生成领域评估 NL 泛化空白的开创性一步。我们在 \url{https://github.com/FloatAI/HumanEval-XL} 公开提供我们的评估代码和数据。

Title: Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models

Authors: Anchun Gui, Jian Li, Yong Dai, Nan Du, Han Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16696
Pdf URL: https://arxiv.org/pdf/2402.16696
Copy Paste: [[2402.16696]] Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models(https://arxiv.org/abs/2402.16696)
Keywords: language model, gpt, llm, hallucination, prompt, chat
Abstract: Tool-augmented large language models (LLMs) are attracting widespread attention when accessing up-to-date knowledge and alleviating hallucination issues. Nowadays, advanced closed-source LLMs (e.g., ChatGPT) have demonstrated surprising tool-usage capabilities through prompting and in-context learning techniques. To empower the capabilities of open-source LLMs (e.g., LLaMA) in manipulating tools, current efforts focus on either template-driven or token-triggered tool-usage. However, the former hampers LLMs' flexibility to address diverse user's queries due to constrained tool interactions, while the latter limits the generalizability when engaging with new tools, since tool-usage learning is based on task- and tool-specific datasets. To alleviate these concerns, in this paper, we propose a decision-aware and generalizable tool-usage framework (DEER). Specifically, we first construct the tool-usage samples with multiple decision branches via an automatic generation pipeline, thereby inspiring the decision-making awareness of LLMs under diverse scenarios. Meanwhile, we propose a novel tool sampling strategy to enhance the generalizability of LLMs over unseen tools. Extensive experiments demonstrate that our proposed DEER is effective and significantly outperforms baselines across various datasets.
摘要：工具增强的大语言模型（LLM）在获取最新知识和缓解幻觉问题时引起了广泛的关注。如今，先进的闭源法学硕士（例如 ChatGPT）通过提示和情境学习技术展示了令人惊讶的工具使用能力。为了增强开源法学硕士（例如 LLaMA）操作工具的能力，当前的工作重点是模板驱动或令牌触发的工具使用。然而，由于工具交互受限，前者阻碍了法学硕士解决不同用户查询的灵活性，而后者则限制了使用新工具时的通用性，因为工具使用学习是基于特定于任务和工具的数据集。为了减轻这些担忧，在本文中，我们提出了一个决策感知和可泛化的工具使用框架（DEER）。具体来说，我们首先通过自动生成管道构建具有多个决策分支的工具使用样本，从而激发LLM在不同场景下的决策意识。同时，我们提出了一种新颖的工具采样策略，以增强法学硕士相对于未见过的工具的通用性。大量实验表明，我们提出的 DEER 是有效的，并且在各种数据集上显着优于基线。

Title: Generating Effective Ensembles for Sentiment Analysis

Authors: Itay Etelis, Avi Rosenfeld, Abraham Itzhak Weinberg, David Sarne
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16700
Pdf URL: https://arxiv.org/pdf/2402.16700
Copy Paste: [[2402.16700]] Generating Effective Ensembles for Sentiment Analysis(https://arxiv.org/abs/2402.16700)
Keywords: gpt
Abstract: In recent years, transformer models have revolutionized Natural Language Processing (NLP), achieving exceptional results across various tasks, including Sentiment Analysis (SA). As such, current state-of-the-art approaches for SA predominantly rely on transformer models alone, achieving impressive accuracy levels on benchmark datasets. In this paper, we show that the key for further improving the accuracy of such ensembles for SA is to include not only transformers, but also traditional NLP models, despite the inferiority of the latter compared to transformer models. However, as we empirically show, this necessitates a change in how the ensemble is constructed, specifically relying on the Hierarchical Ensemble Construction (HEC) algorithm we present. Our empirical studies across eight canonical SA datasets reveal that ensembles incorporating a mix of model types, structured via HEC, significantly outperform traditional ensembles. Finally, we provide a comparative analysis of the performance of the HEC and GPT-4, demonstrating that while GPT-4 closely approaches state-of-the-art SA methods, it remains outperformed by our proposed ensemble strategy.
摘要：近年来，Transformer 模型彻底改变了自然语言处理 (NLP)，在包括情感分析 (SA) 在内的各种任务中取得了出色的结果。因此，当前最先进的 SA 方法主要依赖于变压器模型，在基准数据集上实现了令人印象深刻的准确性水平。在本文中，我们表明，进一步提高 SA 集成的准确性的关键不仅包括 Transformer，还包括传统的 NLP 模型，尽管后者与 Transformer 模型相比较差。然而，正如我们的经验表明，这需要改变集成的构建方式，特别是依赖于我们提出的分层集成构建（HEC）算法。我们对八个规范 SA 数据集的实证研究表明，通过 HEC 构建的包含多种模型类型的集成明显优于传统集成。最后，我们对 HEC 和 GPT-4 的性能进行了比较分析，证明虽然 GPT-4 非常接近最先进的 SA 方法，但它仍然优于我们提出的集成策略。

Title: SelectIT: Selective Instruction Tuning for Large Language Models via Uncertainty-Aware Self-Reflection

Authors: Liangxin Liu, Xuebo Liu, Derek F. Wong, Dongfang Li, Ziyi Wang, Baotian Hu, Min Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.16705
Pdf URL: https://arxiv.org/pdf/2402.16705
Copy Paste: [[2402.16705]] SelectIT: Selective Instruction Tuning for Large Language Models via Uncertainty-Aware Self-Reflection(https://arxiv.org/abs/2402.16705)
Keywords: language model, gpt, llm
Abstract: Instruction tuning (IT) is crucial to tailoring large language models (LLMs) towards human-centric interactions. Recent advancements have shown that the careful selection of a small, high-quality subset of IT data can significantly enhance the performance of LLMs. Despite this, common approaches often rely on additional models or data sets, which increases costs and limits widespread adoption. In this work, we propose a novel approach, termed SelectIT, that capitalizes on the foundational capabilities of the LLM itself. Specifically, we exploit the intrinsic uncertainty present in LLMs to more effectively select high-quality IT data, without the need for extra resources. Furthermore, we introduce a novel IT dataset, the Selective Alpaca, created by applying SelectIT to the Alpaca-GPT4 dataset. Empirical results demonstrate that IT using Selective Alpaca leads to substantial model ability enhancement. The robustness of SelectIT has also been corroborated in various foundation models and domain-specific tasks. Our findings suggest that longer and more computationally intensive IT data may serve as superior sources of IT, offering valuable insights for future research in this area. Data, code, and scripts are freely available at https://github.com/Blue-Raincoat/SelectIT.
摘要：指令调优 (IT) 对于定制大型语言模型 (LLM) 以实现以人为中心的交互至关重要。最近的进展表明，仔细选择一小部分高质量的 IT 数据子集可以显着提高法学硕士的表现。尽管如此，常见的方法通常依赖于额外的模型或数据集，这增加了成本并限制了广泛采用。在这项工作中，我们提出了一种称为 SelectIT 的新颖方法，该方法利用了法学硕士本身的基础能力。具体来说，我们利用法学硕士中存在的内在不确定性来更有效地选择高质量的 IT 数据，而不需要额外的资源。此外，我们还介绍了一种新颖的 IT 数据集，即 Selective Alpaca，它是通过将 SelectIT 应用于 Alpaca-GPT4 数据集而创建的。实证结果表明，使用选择性羊驼毛的 IT 可以显着提高模型能力。 SelectIT 的稳健性也在各种基础模型和特定领域任务中得到了证实。我们的研究结果表明，更长、计算密集度更高的 IT 数据可以作为 IT 的优质来源，为该领域的未来研究提供有价值的见解。数据、代码和脚本可在 https://github.com/Blue-Raincoat/SelectIT 上免费获取。

Title: CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

Authors: Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2402.16717
Pdf URL: https://arxiv.org/pdf/2402.16717
Copy Paste: [[2402.16717]] CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models(https://arxiv.org/abs/2402.16717)
Keywords: language model, gpt, llm
Abstract: Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.
摘要：对抗性滥用，特别是通过“越狱”来规避模型的安全和道德协议，给大型语言模型（LLM）带来了重大挑战。本文深入研究了此类成功攻击背后的机制，引入了对齐 LLM 安全机制的假设：意图安全识别，然后生成响应。基于这一假设，我们提出了 CodeChameleon，一种基于个性化加密策略的新型越狱框架。为了避开意图安全识别阶段，我们将任务重新表述为代码补全格式，使用户能够使用个性化加密功能对查询进行加密。为了保证响应生成功能，我们在指令中嵌入了解密函数，这使得 LLM 能够成功解密并执行加密的查询。我们对 7 个法学硕士进行了广泛的实验，实现了最先进的平均攻击成功率 (ASR)。值得注意的是，我们的方法在 GPT-4-1106 上实现了 86.6% ASR。

Title: Value Preferences Estimation and Disambiguation in Hybrid Participatory Systems

Authors: Enrico Liscio, Luciano C. Siebert, Catholijn M. Jonker, Pradeep K. Murukannaiah
Subjects: cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.16751
Pdf URL: https://arxiv.org/pdf/2402.16751
Copy Paste: [[2402.16751]] Value Preferences Estimation and Disambiguation in Hybrid Participatory Systems(https://arxiv.org/abs/2402.16751)
Keywords: agent
Abstract: Understanding citizens' values in participatory systems is crucial for citizen-centric policy-making. We envision a hybrid participatory system where participants make choices and provide motivations for those choices, and AI agents estimate their value preferences by interacting with them. We focus on situations where a conflict is detected between participants' choices and motivations, and propose methods for estimating value preferences while addressing detected inconsistencies by interacting with the participants. We operationalize the philosophical stance that "valuing is deliberatively consequential." That is, if a participant's choice is based on a deliberation of value preferences, the value preferences can be observed in the motivation the participant provides for the choice. Thus, we propose and compare value estimation methods that prioritize the values estimated from motivations over the values estimated from choices alone. Then, we introduce a disambiguation strategy that addresses the detected inconsistencies between choices and motivations by directly interacting with the participants. We evaluate the proposed methods on a dataset of a large-scale survey on energy transition. The results show that explicitly addressing inconsistencies between choices and motivations improves the estimation of an individual's value preferences. The disambiguation strategy does not show substantial improvements when compared to similar baselines--however, we discuss how the novelty of the approach can open new research avenues and propose improvements to address the current limitations.
摘要：了解公民在参与系统中的价值观对于以公民为中心的决策至关重要。我们设想了一个混合参与系统，参与者做出选择并为这些选择提供动机，人工智能代理通过与他们互动来估计他们的价值偏好。我们关注参与者的选择和动机之间检测到冲突的情况，并提出估计价值偏好的方法，同时通过与参与者交互来解决检测到的不一致问题。我们将“价值评估是深思熟虑的结果”这一哲学立场付诸实践。也就是说，如果参与者的选择是基于价值偏好的深思熟虑，那么可以在参与者提供选择的动机中观察到价值偏好。因此，我们提出并比较了价值估计方法，这些方法将根据动机估计的价值优先于仅根据选择估计的价值。然后，我们引入一种消歧策略，通过直接与参与者互动来解决检测到的选择和动机之间的不一致。我们在大规模能源转型调查的数据集上评估了所提出的方法。结果表明，明确解决选择和动机之间的不一致可以改善对个人价值偏好的估计。与类似的基线相比，消歧策略并没有显示出实质性的改进——但是，我们讨论了该方法的新颖性如何开辟新的研究途径，并提出改进方案来解决当前的局限性。

Title: A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Authors: Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16775
Pdf URL: https://arxiv.org/pdf/2402.16775
Copy Paste: [[2402.16775]] A Comprehensive Evaluation of Quantization Strategies for Large Language Models(https://arxiv.org/abs/2402.16775)
Keywords: language model, llm
Abstract: Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular due to the rise of LLMs. However, most quantization studies use pre-trained LLMs, and the impact of quantization on instruction-tuned LLMs and the relationship between perplexity and benchmark performance of quantized LLMs are not well understood. Evaluation of quantized LLMs is often limited to language modeling and a few classification tasks, leaving their performance on other benchmarks unclear. To address these gaps, we propose a structured evaluation framework consisting of three critical dimensions: (1) knowledge \& capacity, (2) alignment, and (3) efficiency, and conduct extensive experiments across ten diverse benchmarks. Our experimental results indicate that LLMs with 4-bit quantization can retain performance comparable to their non-quantized counterparts, and perplexity can serve as a proxy metric for quantized LLMs on most benchmarks. Furthermore, quantized LLMs with larger parameter scales can outperform smaller LLMs. Despite the memory savings achieved through quantization, it can also slow down the inference speed of LLMs. Consequently, substantial engineering efforts and hardware support are imperative to achieve a balanced optimization of decoding speed and memory consumption in the context of quantized LLMs.
摘要：增加大型语言模型 (LLM) 中的参数数量通常会提高下游任务的性能，但会增加计算和内存成本，使得在资源有限的环境中部署变得困难。由于 LLM 的兴起，量化技术已变得流行，该技术可以减少模型权重或激活所需的位数，同时将性能损失降至最低。然而，大多数量化研究都使用预先训练的 LLM，并且量化对指令调整的 LLM 的影响以及量化 LLM 的困惑度和基准性能之间的关系尚不清楚。量化法学硕士的评估通常仅限于语言建模和一些分类任务，使其在其他基准上的表现不清楚。为了解决这些差距，我们提出了一个由三个关键维度组成的结构化评估框架：（1）知识和能力，（2）一致性和（3）效率，并在十个不同的基准上进行了广泛的实验。我们的实验结果表明，具有 4 位量化的 LLM 可以保持与非量化对应物相当的性能，并且困惑度可以作为大多数基准上量化 LLM 的代理指标。此外，具有较大参数规模的量化 LLM 可以优于较小的 LLM。尽管通过量化节省了内存，但它也会降低 LLM 的推理速度。因此，在量化 LLM 的背景下，为了实现解码速度和内存消耗的平衡优化，大量的工程工作和硬件支持势在必行。

Title: Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

Authors: Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.16786
Pdf URL: https://arxiv.org/pdf/2402.16786
Copy Paste: [[2402.16786]] Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models(https://arxiv.org/abs/2402.16786)
Keywords: language model, llm
Abstract: Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask LLMs survey questions. Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations. As a case study, we focus on the popular Political Compass Test (PCT). In a systematic review, we find that most prior work using the PCT forces models to comply with the PCT's multiple-choice format. We show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. Then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
摘要：最近的许多工作都试图使用多项选择调查和问卷来评估大型语言模型（LLM）中的价值观和观点。这项工作的大部分都是出于对现实世界法学硕士申请的关注。例如，带有政治偏见的法学硕士在被数百万人使用时可能会微妙地影响社会。然而，这种现实世界的担忧与当前评估的人为性形成鲜明对比：真实用户通常不会询问法学硕士的调查问题。受这种差异的激励，我们挑战了法学硕士中普遍存在的价值观和观点的受限评估范式，并探索更现实的无约束评估。作为案例研究，我们重点关注流行的政治指南针测试（PCT）。在系统回顾中，我们发现大多数先前使用 PCT 的工作迫使模型遵守 PCT 的多项选择格式。我们证明，在不强迫的情况下，模型会给出本质上不同的答案。答案会根据模型的强制方式而变化；这个答案缺乏释义的稳健性。然后，我们证明模型在更现实的开放式答案设置中再次给出不同的答案。我们将这些发现提炼成评估法学硕士价值观和观点的建议和开放挑战。

Title: Set the Clock: Temporal Alignment of Pretrained Language Models

Authors: Bowen Zhao, Zander Brumbaugh, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16797
Pdf URL: https://arxiv.org/pdf/2402.16797
Copy Paste: [[2402.16797]] Set the Clock: Temporal Alignment of Pretrained Language Models(https://arxiv.org/abs/2402.16797)
Keywords: language model, prompt
Abstract: Language models (LMs) are trained on web text originating from many points in time and, in general, without any explicit temporal grounding. This work investigates the temporal chaos of pretrained LMs and explores various methods to align their internal knowledge to a target time, which we call "temporal alignment." To do this, we first automatically construct a dataset containing 20K time-sensitive questions and their answers for each year from 2000 to 2023. Based on this dataset, we empirically show that pretrained LMs (e.g., LLaMa2), despite having a recent pretraining cutoff (e.g., 2022), mostly answer questions using earlier knowledge (e.g., in 2019). We then develop several methods, from prompting to finetuning, to align LMs to use their most recent knowledge when answering questions, and investigate various factors in this alignment. Our experiments show that aligning LLaMa2 to the year 2022 can boost its performance by up to 62% relatively as measured by that year, even without mentioning time information explicitly, indicating the possibility of aligning models' internal sense of time after pretraining. Finally, we find that alignment to a historical time is also possible, with up to 2.8$\times$ the performance of the unaligned LM in 2010 if finetuning models to that year. These findings hint at the sophistication of LMs' internal knowledge organization and the necessity of tuning them properly.
摘要：语言模型（LM）是根据来自多个时间点的网络文本进行训练的，并且通常没有任何明确的时间基础。这项工作研究了预训练 LM 的时间混乱，并探索了将其内部知识与目标时间对齐的各种方法，我们称之为“时间对齐”。为此，我们首先自动构建一个数据集，其中包含从 2000 年到 2023 年每年的 2 万个时间敏感问题及其答案。基于该数据集，我们凭经验表明，尽管最近有预训练截止，但预训练的 LM（例如 LLaMa2）（例如，2022 年），主要使用早期知识（例如，2019 年）回答问题。然后，我们开发了几种方法，从提示到微调，使语言模型在回答问题时使用他们最新的知识，并调查这种调整中的各种因素。我们的实验表明，即使没有明确提及时间信息，将 LLaMa2 与 2022 年对齐可以将其性能相对于当年测量提高高达 62%，这表明在预训练后对齐模型内部时间感的可能性。最后，我们发现与历史时间对齐也是可能的，如果将模型微调到 2010 年，则未对齐 LM 的性能最多可提高 2.8$\times$。这些发现暗示了语言模型内部知识组织的复杂性以及对其进行适当调整的必要性。

Title: OncoGPT: A Medical Conversational Model Tailored with Oncology Domain Expertise on a Large Language Model Meta-AI (LLaMA)

Authors: Fujian Jia, Xin Liu, Lixi Deng, Jiwen Gu, Chunchao Pu, Tunan Bai, Mengjiang Huang, Yuanzhi Lu, Kang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16810
Pdf URL: https://arxiv.org/pdf/2402.16810
Copy Paste: [[2402.16810]] OncoGPT: A Medical Conversational Model Tailored with Oncology Domain Expertise on a Large Language Model Meta-AI (LLaMA)(https://arxiv.org/abs/2402.16810)
Keywords: language model, gpt, llm, chat
Abstract: In the past year, there has been a growing trend in applying Large Language Models (LLMs) to the field of medicine, particularly with the advent of advanced language models such as ChatGPT developed by OpenAI. However, there is limited research on LLMs specifically addressing oncology-related queries. The primary aim of this research was to develop a specialized language model that demonstrates improved accuracy in providing advice related to oncology. We performed an extensive data collection of online question-answer interactions centered around oncology, sourced from reputable doctor-patient platforms. Following data cleaning and anonymization, a dataset comprising over 180K+ oncology-related conversations was established. The conversations were categorized and meticulously reviewed by field specialists and clinicians to ensure precision. Employing the LLaMA model and other selected open-source datasets, we conducted iterative fine-tuning to enhance the model's proficiency in basic medical conversation and specialized oncology knowledge. We observed a substantial enhancement in the model's understanding of genuine patient inquiries and its reliability in offering oncology-related advice through the utilization of real online question-answer interactions in the fine-tuning process. We release database and models to the research community (https://github.com/OncoGPT1).
摘要：在过去的一年里，将大型语言模型（LLM）应用于医学领域的趋势日益明显，特别是随着 OpenAI 开发的 ChatGPT 等高级语言模型的出现。然而，针对专门解决肿瘤学相关问题的法学硕士的研究有限。这项研究的主要目的是开发一种专门的语言模型，该模型可以提高提供肿瘤学相关建议的准确性。我们对围绕肿瘤学的在线问答互动进行了广泛的数据收集，这些数据来自信誉良好的医患平台。经过数据清理和匿名化处理后，建立了包含超过 18 万个肿瘤学相关对话的数据集。这些对话经过现场专家和临床医生的分类和仔细审查，以确保准确性。我们采用 LLaMA 模型和其他选定的开源数据集，进行迭代微调，以提高模型对基础医学对话和专业肿瘤学知识的熟练程度。我们观察到，通过在微调过程中利用真实的在线问答互动，模型对真实患者询问的理解及其在提供肿瘤学相关建议方面的可靠性得到了显着增强。我们向研究社区发布数据库和模型（https://github.com/OncoGPT1）。

Title: Investigating the Effectiveness of HyperTuning via Gisting

Authors: Jason Phang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16817
Pdf URL: https://arxiv.org/pdf/2402.16817
Copy Paste: [[2402.16817]] Investigating the Effectiveness of HyperTuning via Gisting(https://arxiv.org/abs/2402.16817)
Keywords: language model
Abstract: Gisting (Mu et al., 2023) is a simple method for training models to compress information into fewer token representations using a modified attention mask, and can serve as an economical approach to training Transformer-based hypernetworks. We introduce HyperLlama, a set of Gisting-based hypernetworks built on Llama-2 models that generates task-specific soft prefixes based on few-shot inputs. In experiments across P3, Super-NaturalInstructions and Symbol Tuning datasets, we show that HyperLlama models can effectively compress information from few-shot examples into soft prefixes. However, they still underperform multi-task fine-tuned language models with full attention over few-shot in-context examples. We also show that HyperLlama-generated soft prefixes can serve as better initializations for further prefix tuning. Overall, Gisting-based hypernetworks are economical and easy to implement, but have mixed empirical performance.
摘要：Gisting（Mu et al., 2023）是一种简单的训练模型方法，使用修改后的注意力掩码将信息压缩为更少的标记表示，并且可以作为训练基于 Transformer 的超网络的经济方法。我们引入了 HyperLlama，这是一组基于 Gisting 的超网络，构建在 Llama-2 模型上，可根据少量输入生成特定于任务的软前缀。在 P3、Super-NaturalInstructions 和 Symbol Tuning 数据集的实验中，我们表明 HyperLlama 模型可以有效地将少数样本示例中的信息压缩为软前缀。然而，它们在充分关注少量上下文示例的多任务微调语言模型方面仍然表现不佳。我们还表明，HyperLlama 生成的软前缀可以作为进一步前缀调整的更好初始化。总体而言，基于 Gisting 的超网络经济且易于实现，但经验性能好坏参半。

Title: Nemotron-4 15B Technical Report

Authors: Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski, Annika Brundyn, James Maki, Miguel Martinez, Jiaxuan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi, Jonathan Cohen, Bryan Catanzaro
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.16819
Pdf URL: https://arxiv.org/pdf/2402.16819
Copy Paste: [[2402.16819]] Nemotron-4 15B Technical Report(https://arxiv.org/abs/2402.16819)
Keywords: language model
Abstract: We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly-sized models, even outperforming models over four times larger and those explicitly specialized for multilingual tasks.
摘要：我们推出 Nemotron-4 15B，这是一个在 8 万亿个文本标记上训练的 150 亿参数大型多语言语言模型。 Nemotron-4 15B 在英语、多语言和编码任务评估中表现出强大的性能：它在 7 个下游评估领域中的 4 个领域优于所有现有的类似规模的开放模型，并在其余领域中实现了与领先开放模型竞争的性能。具体而言，Nemotron-4 15B 展示了所有类似大小模型中最好的多语言功能，甚至优于四倍以上的模型和明确专门用于多语言任务的模型。

Title: Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Authors: Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.16822
Pdf URL: https://arxiv.org/pdf/2402.16822
Copy Paste: [[2402.16822]] Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts(https://arxiv.org/abs/2402.16822)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to user inputs is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. It can uncover a model's vulnerabilities across a broad range of domains including, in this paper, safety, question answering, and cybersecurity. We also demonstrate that fine-tuning on synthetic data generated by Rainbow Teaming improves the safety of state-of-the-art LLMs without hurting their general capabilities and helpfulness, paving the path to open-ended self-improvement.
摘要：随着大型语言模型 (LLM) 在许多现实应用程序中变得越来越普遍，理解和增强其对用户输入的鲁棒性至关重要。现有的识别对抗性提示的方法往往侧重于特定领域，缺乏多样性，或者需要大量的人工注释。为了解决这些限制，我们提出了 Rainbow Teaming，这是一种生成多样化对抗性提示集合的新颖方法。 Rainbow Teaming 将对抗性提示生成视为质量多样性问题，并使用开放式搜索来生成既有效又多样化的提示。它可以发现模型在广泛领域的漏洞，在本文中包括安全、问答和网络安全。我们还证明，对 Rainbow Teaming 生成的合成数据进行微调可以提高最先进的 LLM 的安全性，而不会损害他们的一般能力和帮助性，从而为开放式自我完善铺平道路。

Title: Language Agents as Optimizable Graphs

Authors: Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, Jurgen Schmidhuber
Subjects: cs.AI, cs.CL, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2402.16823
Pdf URL: https://arxiv.org/pdf/2402.16823
Copy Paste: [[2402.16823]] Language Agents as Optimizable Graphs(https://arxiv.org/abs/2402.16823)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Various human-designed prompt engineering techniques have been proposed to improve problem solvers based on Large Language Models (LLMs), yielding many disparate code bases. We unify these approaches by describing LLM-based agents as computational graphs. The nodes implement functions to process multimodal data or query LLMs, and the edges describe the information flow between operations. Graphs can be recursively combined into larger composite graphs representing hierarchies of inter-agent collaboration (where edges connect operations of different agents). Our novel automatic graph optimizers (1) refine node-level LLM prompts (node optimization) and (2) improve agent orchestration by changing graph connectivity (edge optimization). Experiments demonstrate that our framework can be used to efficiently develop, integrate, and automatically improve various LLM agents. The code can be found at https://github.com/metauto-ai/gptswarm.
摘要：人们已经提出了各种人工设计的提示工程技术来改进基于大型语言模型（LLM）的问题解决器，从而产生了许多不同的代码库。我们通过将基于 LLM 的代理描述为计算图来统一这些方法。节点实现处理多模式数据或查询 LLM 的功能，边描述操作之间的信息流。图可以递归地组合成更大的复合图，表示代理间协作的层次结构（其中边连接不同代理的操作）。我们新颖的自动图形优化器 (1) 细化节点级 LLM 提示（节点优化），以及 (2) 通过更改图形连接性（边缘优化）来改进代理编排。实验表明，我们的框架可用于高效开发、集成和自动改进各种 LLM 代理。代码可以在 https://github.com/metauto-ai/gptswarm 找到。

Title: A Survey on Data Selection for Language Models

Authors: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.16827
Pdf URL: https://arxiv.org/pdf/2402.16827
Copy Paste: [[2402.16827]] A Survey on Data Selection for Language Models(https://arxiv.org/abs/2402.16827)
Keywords: language model
Abstract: A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.
摘要：大型语言模型最近取得成功的一个主要因素是使用巨大且不断增长的文本数据集进行无监督预训练。然而，在所有可用数据上简单地训练模型可能不是最佳的（或可行的），因为可用文本数据的质量可能会有所不同。过滤数据还可以通过减少所需的培训量来减少培训模型的碳足迹和财务成本。数据选择方法旨在确定训练数据集中包含哪些候选数据点以及如何从所选数据点中适当采样。改进数据选择方法的前景导致该领域的研究量迅速扩大。然而，由于深度学习主要由经验证据驱动，并且大规模数据的实验成本高昂，因此很少有组织拥有进行广泛的数据选择研究的资源。因此，有效数据选择实践的知识已集中在少数组织内，其中许多组织并未公开分享其发现和方法。为了缩小知识差距，我们对数据选择方法和相关研究领域的现有文献进行了全面回顾，提供了现有方法的分类。通过描述当前的研究现状，这项工作旨在通过为新老研究人员建立切入点来加速数据选择的进展。此外，在这篇综述中，我们提请注意文献中明显的漏洞，并通过提出未来研究的有希望的途径来总结本文。

Title: GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

Authors: Aivin V. Solatorio
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2402.16829
Pdf URL: https://arxiv.org/pdf/2402.16829
Copy Paste: [[2402.16829]] GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning(https://arxiv.org/abs/2402.16829)
Keywords: llm, prompt, retrieval augmented generation
Abstract: Embedding models are integral to AI applications like semantic search, personalized recommendations, and retrieval augmented generation for LLMs, necessitating high-quality training data. However, the limited scalability of manual data curation prompts the need for automated methods to ensure data integrity. Traditional unsupervised triplet mining automates training data generation, crucial for embedding model training, yet inadvertently injects biases and noise, thereby degrading model performance. Addressing this, we introduce GISTEmbed, a novel strategy that enhances in-batch negative selection during contrastive training through a guide model. This approach departs from reliance on random sampling and equal utility assumption of batch negatives, significantly reducing noise from data quality issues and improving model fine-tuning. Benchmarked against the Massive Text Embedding Benchmark (MTEB), GISTEmbed showcases consistent performance improvements across various model sizes and achieves state-of-the-art results in select categories. This framework enables significant enhancements for smaller models by leveraging the capabilities of powerful yet resource-intensive large models. GISTEmbed can potentially revolutionize the creation of highly efficient, smaller models, democratizing access to advanced AI technologies. Making these technologies more accessible and cost-effective, especially for applications constrained by resources, significantly expands the impact and accessibility of state-of-the-art AI solutions across diverse sectors.
摘要：嵌入模型是人工智能应用程序不可或缺的一部分，例如语义搜索、个性化推荐和法学硕士的检索增强生成，需要高质量的训练数据。然而，手动数据管理的可扩展性有限，提示需要自动化方法来确保数据完整性。传统的无监督三元组挖掘会自动生成训练数据，这对于嵌入模型训练至关重要，但会无意中注入偏差和噪声，从而降低模型性能。为了解决这个问题，我们引入了 GISTEmbed，这是一种新颖的策略，可以通过指导模型增强对比训练期间的批量负选择。这种方法摆脱了对随机采样和批次负数的等效用假设的依赖，显着减少了数据质量问题带来的噪声并改进了模型微调。 GISTEmbed 以大规模文本嵌入基准 (MTEB) 为基准，展示了各种模型大小的一致性能改进，并在选定类别中实现了最先进的结果。该框架通过利用强大但资源密集型大型模型的功能，显着增强了小型模型的功能。 GISTEmbed 有可能彻底改变高效、小型模型的创建，使先进人工智能技术的获取更加民主化。使这些技术更易于使用且更具成本效益，特别是对于受资源限制的应用程序，可以显着扩大最先进的人工智能解决方案在不同领域的影响和可及性。

Title: Mysterious Projections: Multimodal LLMs Gain Domain-Specific Visual Capabilities Without Richer Cross-Modal Projections

Authors: Gaurav Verma, Minje Choi, Kartik Sharma, Jamelle Watson-Daniels, Sejoon Oh, Srijan Kumar
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2402.16832
Pdf URL: https://arxiv.org/pdf/2402.16832
Copy Paste: [[2402.16832]] Mysterious Projections: Multimodal LLMs Gain Domain-Specific Visual Capabilities Without Richer Cross-Modal Projections(https://arxiv.org/abs/2402.16832)
Keywords: language model, gpt, llm
Abstract: Multimodal large language models (MLLMs) like LLaVA and GPT-4(V) enable general-purpose conversations about images with the language modality. As off-the-shelf MLLMs may have limited capabilities on images from domains like dermatology and agriculture, they must be fine-tuned to unlock domain-specific applications. The prevalent architecture of current open-source MLLMs comprises two major modules: an image-language (cross-modal) projection network and a large language model. It is desirable to understand the roles of these two modules in modeling domain-specific visual attributes to inform the design of future models and streamline the interpretability efforts on the current models. To this end, via experiments on 4 datasets and under 2 fine-tuning settings, we find that as the MLLM is fine-tuned, it indeed gains domain-specific visual capabilities, but the updates do not lead to the projection extracting relevant domain-specific visual attributes. Our results indicate that the domain-specific visual attributes are modeled by the LLM, even when only the projection is fine-tuned. Through this study, we offer a potential reinterpretation of the role of cross-modal projections in MLLM architectures. Projection webpage: https://claws-lab.github.io/projection-in-MLLMs/
摘要：LLaVA 和 GPT-4(V) 等多模态大语言模型 (MLLM) 支持使用语言模态进行有关图像的通用对话。由于现成的 MLLM 对皮肤病学和农业等领域的图像的能力可能有限，因此必须对其进行微调以解锁特定领域的应用程序。当前开源 MLLM 的流行架构包括两个主要模块：图像语言（跨模态）投影网络和大型语言模型。希望了解这两个模块在建模特定领域的视觉属性中的作用，以便为未来模型的设计提供信息并简化当前模型的可解释性工作。为此，通过在 4 个数据集和 2 个微调设置下进行的实验，我们发现，随着 MLLM 的微调，它确实获得了特定领域的视觉能力，但更新并没有导致投影提取相关领域 -具体的视觉属性。我们的结果表明，即使仅对投影进行微调，特定领域的视觉属性也是由法学硕士建模的。通过这项研究，我们对 MLLM 架构中跨模态投影的作用进行了潜在的重新解释。投影网页：https://claws-lab.github.io/projection-in-MLLMs/

Title: Eight Methods to Evaluate Robust Unlearning in LLMs

Authors: Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16835
Pdf URL: https://arxiv.org/pdf/2402.16835
Copy Paste: [[2402.16835]] Eight Methods to Evaluate Robust Unlearning in LLMs(https://arxiv.org/abs/2402.16835)
Keywords: language model, llm
Abstract: Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it. In this paper, we first survey techniques and limitations of existing unlearning evaluations. Second, we apply a comprehensive set of tests for the robustness and competitiveness of unlearning in the "Who's Harry Potter" (WHP) model from Eldan and Russinovich (2023). While WHP's unlearning generalizes well when evaluated with the "Familiarity" metric from Eldan and Russinovich, we find i) higher-than-baseline amounts of knowledge can reliably be extracted, ii) WHP performs on par with the original model on Harry Potter Q&A tasks, iii) it represents latent knowledge comparably to the original model, and iv) there is collateral unlearning in related domains. Overall, our results highlight the importance of comprehensive unlearning evaluation that avoids ad-hoc metrics.
摘要：机器取消学习对于从大型语言模型 (LLM) 中删除有害功能和记忆文本非常有用，但目前还没有严格评估它的标准化方法。在本文中，我们首先调查了现有的忘却评估的技术和局限性。其次，我们对 Eldan 和 Russinovich（2023）的“谁是哈利·波特”（WHP）模型中遗忘的稳健性和竞争力进行了一套全面的测试。虽然当使用 Eldan 和 Russinovich 的“熟悉度”指标进行评估时，WHP 的遗忘能力很好地概括，但我们发现 i）可以可靠地提取高于基线的知识量，ii）WHP 在哈利波特问答任务上的表现与原始模型相当，iii）它代表了与原始模型相比的潜在知识，并且 iv）相关领域存在附带的遗忘。总的来说，我们的结果强调了避免临时指标的全面遗忘评估的重要性。

Title: Do Large Language Models Latently Perform Multi-Hop Reasoning?

Authors: Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, Sebastian Riedel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16837
Pdf URL: https://arxiv.org/pdf/2402.16837
Copy Paste: [[2402.16837]] Do Large Language Models Latently Perform Multi-Hop Reasoning?(https://arxiv.org/abs/2402.16837)
Keywords: language model, llm, prompt
Abstract: We study whether Large Language Models (LLMs) latently perform multi-hop reasoning with complex prompts such as "The mother of the singer of 'Superstition' is". We look for evidence of a latent reasoning pathway where an LLM (1) latently identifies "the singer of 'Superstition'" as Stevie Wonder, the bridge entity, and (2) uses its knowledge of Stevie Wonder's mother to complete the prompt. We analyze these two hops individually and consider their co-occurrence as indicative of latent multi-hop reasoning. For the first hop, we test if changing the prompt to indirectly mention the bridge entity instead of any other entity increases the LLM's internal recall of the bridge entity. For the second hop, we test if increasing this recall causes the LLM to better utilize what it knows about the bridge entity. We find strong evidence of latent multi-hop reasoning for the prompts of certain relation types, with the reasoning pathway used in more than 80% of the prompts. However, the utilization is highly contextual, varying across different types of prompts. Also, on average, the evidence for the second hop and the full multi-hop traversal is rather moderate and only substantial for the first hop. Moreover, we find a clear scaling trend with increasing model size for the first hop of reasoning but not for the second hop. Our experimental findings suggest potential challenges and opportunities for future development and applications of LLMs.
摘要：我们研究大型语言模型（LLM）是否会潜在地执行具有复杂提示的多跳推理，例如“‘迷信’歌手的母亲是”。我们寻找潜在推理路径的证据，其中法学硕士 (1) 潜在地将“‘迷信’的歌手”识别为桥接实体 Stevie Wonder，并且 (2) 使用其对 Stevie Wonder 母亲的知识来完成提示。我们单独分析这两个跃点，并将它们的共现视为潜在多跳推理的指示。对于第一跳，我们测试更改提示以间接提及桥接实体而不是任何其他实体是否会增加 LLM 对桥接实体的内部回忆。对于第二跳，我们测试增加此召回率是否会导致 LLM 更好地利用其对桥接实体的了解。我们发现某些关系类型的提示存在潜在多跳推理的有力证据，超过 80% 的提示都使用了推理路径。然而，使用情况与上下文高度相关，根据不同类型的提示而有所不同。此外，平均而言，第二跳和完整多跳遍历的证据相当温和，并且仅对于第一跳而言是实质性的。此外，我们发现随着模型大小的增加，第一跳推理有明显的缩放趋势，但第二跳则不然。我们的实验结果表明了法学硕士未来发展和应用的潜在挑战和机遇。

Title: MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

Authors: Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.16840
Pdf URL: https://arxiv.org/pdf/2402.16840
Copy Paste: [[2402.16840]] MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT(https://arxiv.org/abs/2402.16840)
Keywords: language model, gpt, llm
Abstract: "Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. However, LLMs do not suit well for scenarios that require on-device processing, energy efficiency, low memory footprint, and response efficiency. These requisites are crucial for privacy, security, and sustainable deployment. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices. Our primary contribution is the introduction of an accurate and fully transparent open-source 0.5 billion (0.5B) parameter SLM, named MobiLlama, catering to the specific needs of resource-constrained computing with an emphasis on enhanced performance with reduced resource demands. MobiLlama is a SLM design that initiates from a larger model and applies a careful parameter sharing scheme to reduce both the pre-training and the deployment cost. Our work strives to not only bridge the gap in open-source SLMs but also ensures full transparency, where complete training data pipeline, training code, model weights, and over 300 checkpoints along with evaluation codes is available at : https://github.com/mbzuai-oryx/MobiLlama.
摘要：“越大越好”是最近大型语言模型（LLM）开发的主要趋势。然而，LLM 不太适合需要设备上处理、能源效率、低内存占用和响应效率的场景。这些必要条件对于隐私、安全和可持续部署至关重要。本文通过解决为资源受限设备设计准确而高效的小语言模型 (SLM) 的挑战，探讨了“少即是多”的范式。我们的主要贡献是引入了准确且完全透明的开源 5 亿 (0.5B) 参数 SLM，名为 MobiLlama，满足资源受限计算的特定需求，重点是增强性能并减少资源需求。 MobiLlama 是一种 SLM 设计，它从更大的模型开始，并应用仔细的参数共享方案来降低预训练和部署成本。我们的工作不仅致力于缩小开源 SLM 的差距，而且确保完全透明，完整的训练数据管道、训练代码、模型权重、超过 300 个检查点以及评估代码可在以下网址获取：https://github。 com/mbzuai-oryx/MobiLlama。

Title: Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Authors: Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.16844
Pdf URL: https://arxiv.org/pdf/2402.16844
Copy Paste: [[2402.16844]] Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding(https://arxiv.org/abs/2402.16844)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance. Our method utilizes a pretrained frozen LLM that encodes all prompt tokens once in parallel, and uses the resulting representations to condition and guide a small language model (SLM), which then generates the response more efficiently. We investigate the combination of encoder-decoder LLMs with both encoder-decoder and decoder-only SLMs from different model families and only require fine-tuning of the SLM. Experiments with various benchmarks show substantial speedups of up to $4\times$, with minor performance penalties of $1-2\%$ for translation and summarization tasks compared to the LLM.
摘要：大型语言模型（LLM）在实践中已经无处不在，并广泛用于翻译、摘要和指令跟踪等生成任务。然而，它们的巨大尺寸和对自回归解码的依赖增加了部署成本，并使它们在延迟关键型应用程序中的使用变得复杂。在这项工作中，我们提出了一种混合方法，结合不同大小的语言模型来提高自回归解码的效率，同时保持高性能。我们的方法利用预训练的冻结 LLM，对所有提示标记进行一次并行编码，并使用生成的表示来条件和指导小语言模型 (SLM)，然后更有效地生成响应。我们研究了编码器-解码器 LLM 与来自不同模型系列的编码器-解码器和仅解码器 SLM 的组合，并且只需要对 SLM 进行微调。各种基准测试的实验表明，与法学硕士相比，翻译和摘要任务的速度大幅提升高达 $4\times$，但翻译和摘要任务的性能损失较小，为 $1-2\%$。