2024-02-26

Title: Orca-Math: Unlocking the potential of SLMs in Grade School Math

Authors: Arindam Mitra, Hamed Khanpour, Corby Rosset, Ahmed Awadallah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14830
Pdf URL: https://arxiv.org/pdf/2402.14830
Copy Paste: [[2402.14830]] Orca-Math: Unlocking the potential of SLMs in Grade School Math(https://arxiv.org/abs/2402.14830)
Keywords: language model, gpt, chat, agent
Abstract: Mathematical word problem-solving has long been recognized as a complex task for small language models (SLMs). A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters. To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors. Additionally, they employ ensembling, where outputs of up to 100 model runs are combined to arrive at a more accurate result. Result selection is done using consensus, majority vote or a separate a verifier model used in conjunction with the SLM. Ensembling provides a substantial boost in accuracy but at a significant cost increase with multiple calls to the model (e.g., Phi-GSM uses top-48 to boost the performance from 68.2 to 81.5). In this work, we present Orca-Math, a 7-billion-parameter SLM based on the Mistral-7B, which achieves 86.81% on GSM8k without the need for multiple model calls or the use of verifiers, code execution or any other external tools. Our approach has the following key elements: (1) A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data, (2) An iterative learning techniques that enables the SLM to practice solving problems, receive feedback on its solutions and learn from preference pairs incorporating the SLM solutions and the feedback. When trained with Supervised Fine-Tuning alone, Orca-Math achieves 81.50% on GSM8k pass@1 metric. With iterative preference learning, Orca-Math achieves 86.81% pass@1. Orca-Math surpasses the performance of significantly larger models such as LLAMA-2-70B, WizardMath-70B, Gemini-Pro, ChatGPT-3.5. It also significantly outperforms other smaller models while using much smaller data (hundreds of thousands vs. millions of problems).
摘要：长期以来，解决数学应用问题一直被认为是小语言模型 (SLM) 的一项复杂任务。最近的一项研究假设，在 GSM8K 基准上实现超过 80% 的准确度所需的最小模型大小为 340 亿个参数。为了使用较小的模型达到这种性能水平，研究人员经常训练 SLM 来生成 Python 代码或使用工具来帮助避免计算错误。此外，他们还采用集成，将多达 100 个模型运行的输出组合起来，以获得更准确的结果。结果选择是使用共识、多数投票或与 SLM 结合使用的单独验证者模型来完成的。集成极大地提高了准确性，但由于多次调用模型，成本显着增加（例如，Phi-GSM 使用 top-48 将性能从 68.2 提高到 81.5）。在这项工作中，我们提出了 Orca-Math，一种基于 Mistral-7B 的 70 亿参数 SLM，它在 GSM8k 上实现了 86.81%，无需多个模型调用或使用验证器、代码执行或任何其他外部工具。我们的方法具有以下关键要素：(1) 使用多代理设置创建的 20 万个数学问题的高质量综合数据集，其中代理协作创建数据，(2) 迭代学习技术，使 SLM 能够练习解决问题，接收有关其解决方案的反馈，并从结合了 SLM 解决方案和反馈的偏好对中学习。当单独使用监督微调进行训练时，Orca-Math 在 GSM8k pass@1 指标上达到了 81.50%。通过迭代偏好学习，Orca-Math 达到了 86.81% pass@1。 Orca-Math 的性能超越了较大模型的性能，例如 LLAMA-2-70B、WizardMath-70B、Gemini-Pro、ChatGPT-3.5。在使用更小的数据（数十万与数百万问题）时，它的性能也显着优于其他较小的模型。

Title: CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And Faithfulness

Authors: Jiayi Liu, Tinghan Yang, Jennifer Neville
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14833
Pdf URL: https://arxiv.org/pdf/2402.14833
Copy Paste: [[2402.14833]] CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And Faithfulness(https://arxiv.org/abs/2402.14833)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have become pivotal in recent research. However, during the inference process, LLMs still require substantial resources. In this paper, we propose CliqueParcel, a method designed to improve the efficiency of LLMs via prompt batching. Existing strategies to optimize inference efficiency often compromise on output quality, leading to a discounted output problem. This issue might result in reduced accuracy or outputs that are less detailed. CliqueParcel is our answer to this challenge. While ensuring accuracy and minimizing deviations from the original outputs (i.e., faithfulness), our method significantly improves efficiency during inference. To lay the groundwork, we first redefine efficiency measurements by excluding the reduction in running time due to shorter lengths. Then, we provide a comprehensive trade-off between efficiency and faithfulness to clarify the nature of the 'discounted output' problem. Within the CliqueParcel framework, we suggest multiple batching sub-methods and discuss the specific scenarios in which they can be applied. During evaluation, CliqueParcel is tested on eight widely recognized datasets, which can be classified into three types: reading comprehension, open-source question-answering, and reasoning. Our experiments explore the performance of CliqueParcel, including efficiency, faithfulness, and the trade-off between them. This work provides novel insights into inference efficiency and demonstrates promising performance.
摘要：大型语言模型（LLM）在最近的研究中已变得至关重要。然而，在推理过程中，法学硕士仍然需要大量资源。在本文中，我们提出了 CliqueParcel，一种旨在通过即时批处理提高 LLM 效率的方法。现有的优化推理效率的策略通常会损害输出质量，从而导致输出折扣问题。此问题可能会导致准确性降低或输出不太详细。 CliqueParcel 是我们应对这一挑战的答案。在确保准确性并最大限度地减少与原始输出的偏差（即忠实度）的同时，我们的方法显着提高了推理过程中的效率。为了奠定基础，我们首先重新定义效率测量，排除由于较短的长度而导致的运行时间的减少。然后，我们提供效率和忠诚度之间的全面权衡，以阐明“贴现产出”问题的本质。在 CliqueParcel 框架内，我们建议使用多种批处理子方法，并讨论它们可以应用的具体场景。在评估过程中，CliqueParcel 在八个广泛认可的数据集上进行了测试，这些数据集可分为三种类型：阅读理解、开源问答和推理。我们的实验探索了 CliqueParcel 的性能，包括效率、忠实度以及它们之间的权衡。这项工作为推理效率提供了新颖的见解，并展示了有前途的性能。

Title: MIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing

Authors: Jiaqi Li, Miaozeng Du, Chuanyi Zhang, Yongrui Chen, Nan Hu, Guilin Qi, Haiyun Jiang, Siyuan Cheng, Bozhong Tian
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14835
Pdf URL: https://arxiv.org/pdf/2402.14835
Copy Paste: [[2402.14835]] MIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing(https://arxiv.org/abs/2402.14835)
Keywords: language model, llm
Abstract: Multimodal knowledge editing represents a critical advancement in enhancing the capabilities of Multimodal Large Language Models (MLLMs). Despite its potential, current benchmarks predominantly focus on coarse-grained knowledge, leaving the intricacies of fine-grained (FG) multimodal entity knowledge largely unexplored. This gap presents a notable challenge, as FG entity recognition is pivotal for the practical deployment and effectiveness of MLLMs in diverse real-world scenarios. To bridge this gap, we introduce MIKE, a comprehensive benchmark and dataset specifically designed for the FG multimodal entity knowledge editing. MIKE encompasses a suite of tasks tailored to assess different perspectives, including Vanilla Name Answering, Entity-Level Caption, and Complex-Scenario Recognition. In addition, a new form of knowledge editing, Multi-step Editing, is introduced to evaluate the editing efficiency. Through our extensive evaluations, we demonstrate that the current state-of-the-art methods face significant challenges in tackling our proposed benchmark, underscoring the complexity of FG knowledge editing in MLLMs. Our findings spotlight the urgent need for novel approaches in this domain, setting a clear agenda for future research and development efforts within the community.
摘要：多模态知识编辑代表了增强多模态大语言模型 (MLLM) 功能的关键进步。尽管具有潜力，但当前的基准主要关注粗粒度知识，而细粒度（FG）多模态实体知识的复杂性在很大程度上尚未得到探索。这一差距提出了一个显着的挑战，因为 FG 实体识别对于 MLLM 在各种现实场景中的实际部署和有效性至关重要。为了弥补这一差距，我们引入了 MIKE，这是一个专门为 FG 多模态实体知识编辑设计的综合基准和数据集。 MIKE 包含一套专为评估不同观点而定制的任务，包括普通名称应答、实体级标题和复杂场景识别。此外，还引入了一种新的知识编辑形式——多步编辑来评估编辑效率。通过我们的广泛评估，我们证明当前最先进的方法在处理我们提出的基准方面面临着重大挑战，强调了 MLLM 中 FG 知识编辑的复杂性。我们的研究结果凸显了该领域对新方法的迫切需求，为社区内未来的研究和开发工作制定了明确的议程。

Title: Stealthy Attack on Large Language Model based Recommendation

Authors: Jinghao Zhang, Yuting Liu, Qiang Liu, Shu Wu, Guibing Guo, Liang Wang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2402.14836
Pdf URL: https://arxiv.org/pdf/2402.14836
Copy Paste: [[2402.14836]] Stealthy Attack on Large Language Model based Recommendation(https://arxiv.org/abs/2402.14836)
Keywords: language model, llm
Abstract: Recently, the powerful large language models (LLMs) have been instrumental in propelling the progress of recommender systems (RS). However, while these systems have flourished, their susceptibility to security threats has been largely overlooked. In this work, we reveal that the introduction of LLMs into recommendation models presents new security vulnerabilities due to their emphasis on the textual content of items. We demonstrate that attackers can significantly boost an item's exposure by merely altering its textual content during the testing phase, without requiring direct interference with the model's training process. Additionally, the attack is notably stealthy, as it does not affect the overall recommendation performance and the modifications to the text are subtle, making it difficult for users and platforms to detect. Our comprehensive experiments across four mainstream LLM-based recommendation models demonstrate the superior efficacy and stealthiness of our approach. Our work unveils a significant security gap in LLM-based recommendation systems and paves the way for future research on protecting these systems.
摘要：最近，强大的大语言模型（LLM）在推动推荐系统（RS）的进步方面发挥了重要作用。然而，尽管这些系统蓬勃发展，但它们对安全威胁的敏感性却在很大程度上被忽视了。在这项工作中，我们揭示了将法学硕士引入推荐模型会带来新的安全漏洞，因为它们强调项目的文本内容。我们证明，攻击者只需在测试阶段更改项目的文本内容即可显着提高项目的曝光率，而无需直接干扰模型的训练过程。此外，这种攻击非常隐蔽，因为它不会影响整体推荐性能，而且对文本的修改很微妙，使得用户和平台很难检测到。我们对四种主流的基于 LLM 的推荐模型进行了全面的实验，证明了我们方法的卓越功效和隐蔽性。我们的工作揭示了基于法学硕士的推荐系统中存在的重大安全漏洞，并为未来保护这些系统的研究铺平了道路。

Title: An Empirical Categorization of Prompting Techniques for Large Language Models: A Practitioner's Guide

Authors: Oluwole Fagbohun, Rachel M. Harrison, Anton Dereventsov
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14837
Pdf URL: https://arxiv.org/pdf/2402.14837
Copy Paste: [[2402.14837]] An Empirical Categorization of Prompting Techniques for Large Language Models: A Practitioner's Guide(https://arxiv.org/abs/2402.14837)
Keywords: language model, llm, prompt
Abstract: Due to rapid advancements in the development of Large Language Models (LLMs), programming these models with prompts has recently gained significant attention. However, the sheer number of available prompt engineering techniques creates an overwhelming landscape for practitioners looking to utilize these tools. For the most efficient and effective use of LLMs, it is important to compile a comprehensive list of prompting techniques and establish a standardized, interdisciplinary categorization framework. In this survey, we examine some of the most well-known prompting techniques from both academic and practical viewpoints and classify them into seven distinct categories. We present an overview of each category, aiming to clarify their unique contributions and showcase their practical applications in real-world examples in order to equip fellow practitioners with a structured framework for understanding and categorizing prompting techniques tailored to their specific domains. We believe that this approach will help simplify the complex landscape of prompt engineering and enable more effective utilization of LLMs in various applications. By providing practitioners with a systematic approach to prompt categorization, we aim to assist in navigating the intricacies of effective prompt design for conversational pre-trained LLMs and inspire new possibilities in their respective fields.
摘要：由于大型语言模型 (LLM) 开发的快速进步，使用提示对这些模型进行编程最近引起了极大的关注。然而，可用的即时工程技术的绝对数量为希望利用这些工具的从业者创造了压倒性的前景。为了最有效地利用法学硕士，编制一份全面的提示技术清单并建立标准化的跨学科分类框架非常重要。在这项调查中，我们从学术和实践的角度研究了一些最著名的提示技巧，并将它们分为七个不同的类别。我们对每个类别进行概述，旨在阐明它们的独特贡献并展示它们在现实世界示例中的实际应用，以便为其他从业者提供一个结构化框架，用于理解和分类适合其特定领域的提示技术。我们相信，这种方法将有助于简化即时工程的复杂情况，并能够在各种应用中更有效地利用法学硕士。通过为从业者提供系统的提示分类方法，我们的目标是帮助经过对话预训练的法学硕士了解有效提示设计的复杂性，并在各自领域激发新的可能性。

Title: RFBES at SemEval-2024 Task 8: Investigating Syntactic and Semantic Features for Distinguishing AI-Generated and Human-Written Texts

Authors: Mohammad Heydari Rad, Farhan Farsi, Shayan Bali, Romina Etezadi, Mehrnoush Shamsfard
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14838
Pdf URL: https://arxiv.org/pdf/2402.14838
Copy Paste: [[2402.14838]] RFBES at SemEval-2024 Task 8: Investigating Syntactic and Semantic Features for Distinguishing AI-Generated and Human-Written Texts(https://arxiv.org/abs/2402.14838)
Keywords: language model, llm
Abstract: Nowadays, the usage of Large Language Models (LLMs) has increased, and LLMs have been used to generate texts in different languages and for different tasks. Additionally, due to the participation of remarkable companies such as Google and OpenAI, LLMs are now more accessible, and people can easily use them. However, an important issue is how we can detect AI-generated texts from human-written ones. In this article, we have investigated the problem of AI-generated text detection from two different aspects: semantics and syntax. Finally, we presented an AI model that can distinguish AI-generated texts from human-written ones with high accuracy on both multilingual and monolingual tasks using the M4 dataset. According to our results, using a semantic approach would be more helpful for detection. However, there is a lot of room for improvement in the syntactic approach, and it would be a good approach for future work.
摘要：如今，大型语言模型（LLM）的使用有所增加，LLM 已被用来生成不同语言和不同任务的文本。此外，由于谷歌和OpenAI等知名公司的参与，法学硕士现在更容易获得，人们可以轻松使用它们。然而，一个重要的问题是我们如何从人类编写的文本中检测人工智能生成的文本。在本文中，我们从语义和语法两个不同的方面研究了人工智能生成的文本检测问题。最后，我们提出了一个人工智能模型，可以使用 M4 数据集在多语言和单语言任务上高精度地区分人工智能生成的文本和人类编写的文本。根据我们的结果，使用语义方法对检测更有帮助。然而，句法方法还有很大的改进空间，这将是未来工作的一个很好的方法。

Title: RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

Authors: Congyun Jin, Ming Zhang, Xiaowei Ma, Li Yujiao, Yingbo Wang, Yabo Jia, Yuliang Du, Tao Sun, Haowen Wang, Cong Fan, Jinjie Gu, Chenfei Chi, Xiangguo Lv, Fangzhou Li, Wei Xue, Yiran Huang
Subjects: cs.CL, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2402.14840
Pdf URL: https://arxiv.org/pdf/2402.14840
Copy Paste: [[2402.14840]] RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning(https://arxiv.org/abs/2402.14840)
Keywords: language model, llm
Abstract: Recent advancements in Large Language Models (LLMs) and Large Multi-modal Models (LMMs) have shown potential in various medical applications, such as Intelligent Medical Diagnosis. Although impressive results have been achieved, we find that existing benchmarks do not reflect the complexity of real medical reports and specialized in-depth reasoning capabilities. In this work, we introduced RJUA-MedDQA, a comprehensive benchmark in the field of medical specialization, which poses several challenges: comprehensively interpreting imgage content across diverse challenging layouts, possessing numerical reasoning ability to identify abnormal indicators and demonstrating clinical reasoning ability to provide statements of disease diagnosis, status and advice based on medical contexts. We carefully design the data generation pipeline and proposed the Efficient Structural Restoration Annotation (ESRA) Method, aimed at restoring textual and tabular content in medical report images. This method substantially enhances annotation efficiency, doubling the productivity of each annotator, and yields a 26.8% improvement in accuracy. We conduct extensive evaluations, including few-shot assessments of 5 LMMs which are capable of solving Chinese medical QA tasks. To further investigate the limitations and potential of current LMMs, we conduct comparative experiments on a set of strong LLMs by using image-text generated by ESRA method. We report the performance of baselines and offer several observations: (1) The overall performance of existing LMMs is still limited; however LMMs more robust to low-quality and diverse-structured images compared to LLMs. (3) Reasoning across context and image content present significant challenges. We hope this benchmark helps the community make progress on these challenging tasks in multi-modal medical document understanding and facilitate its application in healthcare.
摘要：大型语言模型 (LLM) 和大型多模态模型 (LMM) 的最新进展在智能医疗诊断等各种医疗应用中显示出了潜力。尽管取得了令人印象深刻的结果，但我们发现现有的基准并不能反映真实医疗报告的复杂性和专门的深度推理能力。在这项工作中，我们引入了 RJUA-MedDQA，这是医学专业领域的综合基准，它提出了几个挑战：在各种具有挑战性的布局中全面解释图像内容，拥有识别异常指标的数字推理能力，并展示提供陈述的临床推理能力基于医疗背景的疾病诊断、状态和建议。我们精心设计了数据生成流程，并提出了高效结构恢复注释（ESRA）方法，旨在恢复医学报告图像中的文本和表格内容。该方法极大地提高了标注效率，使每个标注者的生产力提高了一倍，并且准确率提高了 26.8%。我们进行了广泛的评估，包括对 5 个能够解决中国医学 QA 任务的 LMM 进行小样本评估。为了进一步研究当前 LMM 的局限性和潜力，我们使用 ESRA 方法生成的图像文本对一组强 LMM 进行了比较实验。我们报告了基线的性能并提供了一些观察结果：（1）现有 LMM 的整体性能仍然有限；然而，与 LLM 相比，LMM 对低质量和多样化结构的图像更稳健。 (3) 跨上下文和图像内容的推理提出了重大挑战。我们希望这个基准能够帮助社区在多模式医疗文档理解方面的这些具有挑战性的任务上取得进展，并促进其在医疗保健中的应用。

Title: Purifying Large Language Models by Ensembling a Small Language Model

Authors: Tianlin Li, Qian Liu, Tianyu Pang, Chao Du, Qing Guo, Yang Liu, Min Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14845
Pdf URL: https://arxiv.org/pdf/2402.14845
Copy Paste: [[2402.14845]] Purifying Large Language Models by Ensembling a Small Language Model(https://arxiv.org/abs/2402.14845)
Keywords: language model, llm
Abstract: The emerging success of large language models (LLMs) heavily relies on collecting abundant training data from external (untrusted) sources. Despite substantial efforts devoted to data cleaning and curation, well-constructed LLMs have been reported to suffer from copyright infringement, data poisoning, and/or privacy violations, which would impede practical deployment of LLMs. In this study, we propose a simple and easily implementable method for purifying LLMs from the negative effects caused by uncurated data, namely, through ensembling LLMs with benign and small language models (SLMs). Aside from theoretical guarantees, we perform comprehensive experiments to empirically confirm the efficacy of ensembling LLMs with SLMs, which can effectively preserve the performance of LLMs while mitigating issues such as copyright infringement, data poisoning, and privacy violations.
摘要：大型语言模型 (LLM) 的新兴成功在很大程度上依赖于从外部（不可信）来源收集丰富的训练数据。尽管在数据清理和管理方面做出了大量努力，但据报道，构建良好的法学硕士仍遭受版权侵权、数据中毒和/或侵犯隐私的问题，这将阻碍法学硕士的实际部署。在这项研究中，我们提出了一种简单且易于实施的方法，用于净化法学硕士免受未经整理的数据造成的负面影响，即通过将法学硕士与良性小型语言模型（SLM）集成。除了理论保证之外，我们还进行了全面的实验来实证证实 LLM 与 SLM 集成的有效性，这可以有效保留 LLM 的性能，同时减轻版权侵权、数据中毒和隐私侵犯等问题。

Title: Stick to your Role! Stability of Personal Values Expressed in Large Language Models

Authors: Grgur Kovač, Rémy Portelas, Masataka Sawayama, Peter Ford Dominey, Pierre-Yves Oudeyer
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14846
Pdf URL: https://arxiv.org/pdf/2402.14846
Copy Paste: [[2402.14846]] Stick to your Role! Stability of Personal Values Expressed in Large Language Models(https://arxiv.org/abs/2402.14846)
Keywords: language model, llm
Abstract: The standard way to study Large Language Models (LLMs) through benchmarks or psychology questionnaires is to provide many different queries from similar minimal contexts (e.g. multiple choice questions). However, due to LLM's highly context-dependent nature, conclusions from such minimal-context evaluations may be little informative about the model's behavior in deployment (where it will be exposed to many new contexts). We argue that context-dependence should be studied as another dimension of LLM comparison alongside others such as cognitive abilities, knowledge, or model size. In this paper, we present a case-study about the stability of value expression over different contexts (simulated conversations on different topics), and as measured using a standard psychology questionnaire (PVQ) and a behavioral downstream task. We consider 19 open-sourced LLMs from five families. Reusing methods from psychology, we study Rank-order stability on the population (interpersonal) level, and Ipsative stability on the individual (intrapersonal) level. We explore two settings: with and without instructing LLMs to simulate particular personalities. We observe similar trends in the stability of models and model families - Mixtral, Mistral and Qwen families being more stable than LLaMa-2 and Phi - over those two settings, two different simulated populations, and even in the downstream behavioral task. When instructed to simulate particular personas, LLMs exhibit low Rank-Order stability, and this stability further diminishes with conversation length. This highlights the need for future research directions on LLMs that can coherently simulate a diversity of personas, as well as how context-dependence can be studied in more thorough and efficient ways. This paper provides a foundational step in that direction, and, to our knowledge, it is the first study of value stability in LLMs.
摘要：通过基准测试或心理学问卷来研究大型语言模型 (LLM) 的标准方法是从相似的最小上下文（例如多项选择题）中提供许多不同的查询。然而，由于 LLM 高度依赖上下文的性质，这种最小上下文评估的结论可能无法提供有关模型在部署中的行为的信息（模型将暴露于许多新上下文）。我们认为，背景依赖性应该作为法学硕士比较的另一个维度与认知能力、知识或模型大小等其他维度一起研究。在本文中，我们提出了一个关于不同背景下价值表达稳定性的案例研究（模拟不同主题的对话），并使用标准心理学问卷（PVQ）和行为下游任务进行测量。我们考虑了来自 5 个家族的 19 名开源法学硕士。重用心理学的方法，我们研究群体（人际）层面的排序稳定性和个体（人际）层面的自驱稳定性。我们探索两种设置：有或没有指导法学硕士模拟特定的性格。我们观察到模型和模型家族稳定性的相似趋势 - Mixtral、Mistral 和 Qwen 家族比 LLaMa-2 和 Phi 更稳定 - 在这两种设置、两个不同的模拟群体中，甚至在下游行为任务中。当被指示模拟特定角色时，法学硕士表现出较低的排名稳定性，并且这种稳定性随着对话长度的增加而进一步减弱。这凸显了法学硕士未来研究方向的必要性，即能够连贯地模拟多样化的人物角色，以及如何以更彻底、更有效的方式研究情境依赖性。本文朝着这个方向迈出了基础性的一步，据我们所知，这是第一个关于法学硕士价值稳定性的研究。

Title: Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Authors: Mosh Levy, Alon Jacoby, Yoav Goldberg
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14848
Pdf URL: https://arxiv.org/pdf/2402.14848
Copy Paste: [[2402.14848]] Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models(https://arxiv.org/abs/2402.14848)
Keywords: language model, llm
Abstract: This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs). Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs' reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that traditional perplexity metrics do not correlate with performance of LLMs' in long input reasoning tasks. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.
摘要：本文探讨了扩展输入长度对大型语言模型 (LLM) 功能的影响。尽管法学硕士近年来取得了进步，但它们在不同输入长度上的性能一致性尚不清楚。我们通过引入一种新颖的 QA 推理框架来研究这个方面，该框架专门用于评估输入长度的影响。我们使用同一样本的多个版本来隔离输入长度的影响，每个版本都使用不同长度、类型和位置的填充进行扩展。我们的研究结果表明，在比其技术最大值短得多的输入长度下，法学硕士的推理性能显着下降。我们表明，退化趋势出现在数据集的每个版本中，尽管强度不同。此外，我们的研究表明，传统的困惑度指标与法学硕士在长输入推理任务中的表现并不相关。我们分析我们的结果并确定失败模式，这些模式可以作为未来研究的有用指南，并有可能为解决法学硕士中观察到的局限性提供策略。

Title: CHATATC: Large Language Model-Driven Conversational Agents for Supporting Strategic Air Traffic Flow Management

Authors: Sinan Abdulhak, Wayne Hubbard, Karthik Gopalakrishnan, Max Z. Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14850
Pdf URL: https://arxiv.org/pdf/2402.14850
Copy Paste: [[2402.14850]] CHATATC: Large Language Model-Driven Conversational Agents for Supporting Strategic Air Traffic Flow Management(https://arxiv.org/abs/2402.14850)
Keywords: language model, gpt, llm, chat, agent
Abstract: Generative artificial intelligence (AI) and large language models (LLMs) have gained rapid popularity through publicly available tools such as ChatGPT. The adoption of LLMs for personal and professional use is fueled by the natural interactions between human users and computer applications such as ChatGPT, along with powerful summarization and text generation capabilities. Given the widespread use of such generative AI tools, in this work we investigate how these tools can be deployed in a non-safety critical, strategic traffic flow management setting. Specifically, we train an LLM, CHATATC, based on a large historical data set of Ground Delay Program (GDP) issuances, spanning 2000-2023 and consisting of over 80,000 GDP implementations, revisions, and cancellations. We test the query and response capabilities of CHATATC, documenting successes (e.g., providing correct GDP rates, durations, and reason) and shortcomings (e.g,. superlative questions). We also detail the design of a graphical user interface for future users to interact and collaborate with the CHATATC conversational agent.
摘要：生成人工智能 (AI) 和大型语言模型 (LLM) 通过 ChatGPT 等公开工具迅速普及。人类用户与 ChatGPT 等计算机应用程序之间的自然交互以及强大的摘要和文本生成功能推动了法学硕士在个人和专业用途中的采用。鉴于此类生成式人工智能工具的广泛使用，在这项工作中，我们研究了如何在非安全关键的战略交通流管理环境中部署这些工具。具体来说，我们根据 2000 年至 2023 年期间的地面延误计划 (GDP) 发布的大型历史数据集来培训法学硕士 CHATATC，其中包括超过 80,000 个 GDP 实施、修订和取消。我们测试 CHATATC 的查询和响应能力，记录成功的地方（例如，提供正确的 GDP 率、持续时间和原因）和缺点（例如，最高级的问题）。我们还详细设计了图形用户界面，以便未来用户与 CHATATC 对话代理进行交互和协作。

Title: SQL-CRAFT: Text-to-SQL through Interactive Refinement and Enhanced Reasoning

Authors: Hanchen Xia, Feng Jiang, Naihao Deng, Cunxiang Wang, Guojiang Zhao, Rada Mihalcea, Yue Zhang
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2402.14851
Pdf URL: https://arxiv.org/pdf/2402.14851
Copy Paste: [[2402.14851]] SQL-CRAFT: Text-to-SQL through Interactive Refinement and Enhanced Reasoning(https://arxiv.org/abs/2402.14851)
Keywords: llm, prompt
Abstract: Modern LLMs have become increasingly powerful, but they are still facing challenges in specialized tasks such as Text-to-SQL. We propose SQL-CRAFT, a framework to advance LLMs' SQL generation Capabilities through inteRActive reFinemenT and enhanced reasoning. We leverage an Interactive Correction Loop (IC-Loop) for LLMs to interact with databases automatically, as well as Python-enhanced reasoning. We conduct experiments on two Text-to-SQL datasets, Spider and Bird, with performance improvements of up to 5.7% compared to the naive prompting method. Moreover, our method surpasses the current state-of-the-art on the Spider Leaderboard, demonstrating the effectiveness of our framework.
摘要：现代法学硕士已经变得越来越强大，但他们仍然面临着文本到 SQL 等专业任务的挑战。我们提出了 SQL-CRAFT，这是一个通过交互式细化和增强推理来提高法学硕士 SQL 生成能力的框架。我们利用交互式校正循环（IC-Loop）让法学硕士自动与数据库交互，以及 Python 增强推理。我们在两个 Text-to-SQL 数据集 Spider 和 Bird 上进行了实验，与朴素提示方法相比，性能提升高达 5.7%。此外，我们的方法超越了蜘蛛排行榜上当前最先进的技术，证明了我们框架的有效性。

Title: HumanEval on Latest GPT Models -- 2024

Authors: Daniel Li, Lincoln Murr
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14852
Pdf URL: https://arxiv.org/pdf/2402.14852
Copy Paste: [[2402.14852]] HumanEval on Latest GPT Models -- 2024(https://arxiv.org/abs/2402.14852)
Keywords: language model, gpt, prompt
Abstract: In 2023, we are using the latest models of GPT-4 to advance program synthesis. The large language models have significantly improved the state-of-the-art for this purpose. To make these advancements more accessible, we have created a repository that connects these models to Huamn Eval. This dataset was initally developed to be used with a language model called CODEGEN on natural and programming language data. The utility of these trained models is showcased by demonstrating their competitive performance in zero-shot Python code generation on HumanEval tasks compared to previous state-of-the-art solutions. Additionally, this gives way to developing more multi-step paradigm synthesis. This benchmark features 160 diverse problem sets factorized into multistep prompts that our analysis shows significantly improves program synthesis over single-turn inputs. All code is open source at https://github.com/daniel442li/gpt-human-eval .
摘要：2023 年，我们将使用最新的 GPT-4 模型来推进程序综合。大型语言模型显着提高了为此目的的最先进水平。为了使这些进步更容易获得，我们创建了一个将这些模型连接到 Huamn Eval 的存储库。该数据集最初是为了与自然语言和编程语言数据上称为 CODEGEN 的语言模型一起使用而开发的。与之前最先进的解决方案相比，这些经过训练的模型的实用性通过展示它们在 HumanEval 任务上的零样本 Python 代码生成方面的竞争性能来展示。此外，这为开发更多的多步骤范式合成让路。该基准测试将 160 个不同的问题集分解为多步骤提示，我们的分析显示，与单轮输入相比，程序综合得到了显着改善。所有代码都是开源的，网址为 https://github.com/daniel442li/gpt- human-eval 。

Title: NL2Formula: Generating Spreadsheet Formulas from Natural Language Queries

Authors: Wei Zhao, Zhitao Hou, Siyuan Wu, Yan Gao, Haoyu Dong, Yao Wan, Hongyu Zhang, Yulei Sui, Haidong Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14853
Pdf URL: https://arxiv.org/pdf/2402.14853
Copy Paste: [[2402.14853]] NL2Formula: Generating Spreadsheet Formulas from Natural Language Queries(https://arxiv.org/abs/2402.14853)
Keywords: gpt
Abstract: Writing formulas on spreadsheets, such as Microsoft Excel and Google Sheets, is a widespread practice among users performing data analysis. However, crafting formulas on spreadsheets remains a tedious and error-prone task for many end-users, particularly when dealing with complex operations. To alleviate the burden associated with writing spreadsheet formulas, this paper introduces a novel benchmark task called NL2Formula, with the aim to generate executable formulas that are grounded on a spreadsheet table, given a Natural Language (NL) query as input. To accomplish this, we construct a comprehensive dataset consisting of 70,799 paired NL queries and corresponding spreadsheet formulas, covering 21,670 tables and 37 types of formula functions. We realize the NL2Formula task by providing a sequence-to-sequence baseline implementation called fCoder. Experimental results validate the effectiveness of fCoder, demonstrating its superior performance compared to the baseline models. Furthermore, we also compare fCoder with an initial GPT-3.5 model (i.e., text-davinci-003). Lastly, through in-depth error analysis, we identify potential challenges in the NL2Formula task and advocate for further investigation.
摘要：在 Microsoft Excel 和 Google Sheets 等电子表格上编写公式是执行数据分析的用户的普遍做法。然而，对于许多最终用户来说，在电子表格上编写公式仍然是一项乏味且容易出错的任务，特别是在处理复杂操作时。为了减轻与编写电子表格公式相关的负担，本文引入了一种名为 NL2Formula 的新颖基准任务，旨在以自然语言 (NL) 查询作为输入，生成基于电子表格表的可执行公式。为了实现这一目标，我们构建了一个综合数据集，其中包含 70,799 个配对的 NL 查询和相应的电子表格公式，涵盖 21,670 个表格和 37 种公式函数。我们通过提供称为 fCoder 的序列到序列基线实现来实现 NL2Formula 任务。实验结果验证了 fCoder 的有效性，证明了其与基线模型相比的优越性能。此外，我们还将 fCoder 与初始 GPT-3.5 模型（即 text-davinci-003）进行比较。最后，通过深入的错误分析，我们确定了 NL2Formula 任务中的潜在挑战，并主张进一步调查。

Title: A Dual-Prompting for Interpretable Mental Health Language Models

Authors: Hyolim Jeon, Dongje Yoo, Daeun Lee, Sejung Son, Seungbae Kim, Jinyoung Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14854
Pdf URL: https://arxiv.org/pdf/2402.14854
Copy Paste: [[2402.14854]] A Dual-Prompting for Interpretable Mental Health Language Models(https://arxiv.org/abs/2402.14854)
Keywords: language model, llm, prompt
Abstract: Despite the increasing demand for AI-based mental health monitoring tools, their practical utility for clinicians is limited by the lack of interpretability.The CLPsych 2024 Shared Task (Chim et al., 2024) aims to enhance the interpretability of Large Language Models (LLMs), particularly in mental health analysis, by providing evidence of suicidality through linguistic content. We propose a dual-prompting approach: (i) Knowledge-aware evidence extraction by leveraging the expert identity and a suicide dictionary with a mental health-specific LLM; and (ii) Evidence summarization by employing an LLM-based consistency evaluator. Comprehensive experiments demonstrate the effectiveness of combining domain-specific information, revealing performance improvements and the approach's potential to aid clinicians in assessing mental state progression.
摘要：尽管对基于人工智能的心理健康监测工具的需求不断增加，但它们对临床医生的实用性因缺乏可解释性而受到限制。CLPsych 2024 共享任务（Chim 等人，2024）旨在增强大型语言模型（LLM）的可解释性），特别是在心理健康分析中，通过语言内容提供自杀证据。我们提出了一种双重提示方法：（i）通过利用专家身份和具有心理健康特定法学硕士的自杀词典来提取知识意识证据； (ii) 使用基于法学硕士的一致性评估器进行证据总结。综合实验证明了结合特定领域信息的有效性，揭示了性能改进以及该方法帮助临床医生评估精神状态进展的潜力。

Title: An LLM Maturity Model for Reliable and Transparent Text-to-Query

Authors: Lei Yu (Expression), Abir Ray (Expression)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14855
Pdf URL: https://arxiv.org/pdf/2402.14855
Copy Paste: [[2402.14855]] An LLM Maturity Model for Reliable and Transparent Text-to-Query(https://arxiv.org/abs/2402.14855)
Keywords: language model, llm
Abstract: Recognizing the imperative to address the reliability and transparency issues of Large Language Models (LLM), this work proposes an LLM maturity model tailored for text-to-query applications. This maturity model seeks to fill the existing void in evaluating LLMs in such applications by incorporating dimensions beyond mere correctness or accuracy. Moreover, this work introduces a real-world use case from the law enforcement domain and showcases QueryIQ, an LLM-powered, domain-specific text-to-query assistant to expedite user workflows and reveal hidden relationship in data.
摘要：认识到解决大型语言模型 (LLM) 的可靠性和透明度问题的必要性，这项工作提出了一个专为文本到查询应用程序定制的 LLM 成熟度模型。该成熟度模型旨在通过纳入超越正确性或准确性的维度来填补此类应用程序中评估法学硕士的现有空白。此外，这项工作介绍了执法领域的一个真实用例，并展示了 QueryIQ，这是一种由法学硕士支持的特定领域文本到查询助手，可加快用户工作流程并揭示数据中隐藏的关系。

Title: Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

Authors: Philipp Mondorf, Barbara Plank
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14856
Pdf URL: https://arxiv.org/pdf/2402.14856
Copy Paste: [[2402.14856]] Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning(https://arxiv.org/abs/2402.14856)
Keywords: language model, llm
Abstract: Deductive reasoning plays a pivotal role in the formulation of sound and cohesive arguments. It allows individuals to draw conclusions that logically follow, given the truth value of the information provided. Recent progress in the domain of large language models (LLMs) has showcased their capability in executing deductive reasoning tasks. Nonetheless, a significant portion of research primarily assesses the accuracy of LLMs in solving such tasks, often overlooking a deeper analysis of their reasoning behavior. In this study, we draw upon principles from cognitive psychology to examine inferential strategies employed by LLMs, through a detailed evaluation of their responses to propositional logic problems. Our findings indicate that LLMs display reasoning patterns akin to those observed in humans, including strategies like $\textit{supposition following}$ or $\textit{chain construction}$. Moreover, our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning, with more advanced models tending to adopt strategies more frequently than less sophisticated ones. Importantly, we assert that a model's accuracy, that is the correctness of its final conclusion, does not necessarily reflect the validity of its reasoning process. This distinction underscores the necessity for more nuanced evaluation procedures in the field.
摘要：演绎推理在形成合理且有凝聚力的论证中起着关键作用。鉴于所提供信息的真实价值，它允许个人得出符合逻辑的结论。大语言模型（LLM）领域的最新进展展示了它们执行演绎推理任务的能力。尽管如此，很大一部分研究主要评估法学硕士解决此类任务的准确性，往往忽视了对其推理行为的更深入分析。在这项研究中，我们利用认知心理学的原理，通过详细评估法学硕士对命题逻辑问题的反应来检查他们所采用的推理策略。我们的研究结果表明，法学硕士表现出类似于人类观察到的推理模式，包括诸如$\textit{假设跟随}$或$\textit{链构建}$之类的策略。此外，我们的研究表明，模型的架构和规模会显着影响其首选的推理方法，更先进的模型往往比较不复杂的模型更频繁地采用策略。重要的是，我们断言模型的准确性，即其最终结论的正确性，并不一定反映其推理过程的有效性。这种区别强调了在该领域采取更细致的评估程序的必要性。

Title: Is the System Message Really Important to Jailbreaks in Large Language Models?

Authors: Xiaotian Zou, Yongkang Chen, Ke Li
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2402.14857
Pdf URL: https://arxiv.org/pdf/2402.14857
Copy Paste: [[2402.14857]] Is the System Message Really Important to Jailbreaks in Large Language Models?(https://arxiv.org/abs/2402.14857)
Keywords: language model, gpt, llm, prompt
Abstract: The rapid evolution of Large Language Models (LLMs) has rendered them indispensable in modern society. While security measures are typically in place to align LLMs with human values prior to release, recent studies have unveiled a concerning phenomenon named "jailbreak." This term refers to the unexpected and potentially harmful responses generated by LLMs when prompted with malicious questions. Existing research focuses on generating jailbreak prompts but our study aim to answer a different question: Is the system message really important to jailbreak in LLMs? To address this question, we conducted experiments in a stable GPT version gpt-3.5-turbo-0613 to generated jailbreak prompts with varying system messages: short, long, and none. We discover that different system messages have distinct resistances to jailbreak by experiments. Additionally, we explore the transferability of jailbreak across LLMs. This finding underscores the significant impact system messages can have on mitigating LLMs jailbreak. To generate system messages that are more resistant to jailbreak prompts, we propose System Messages Evolutionary Algorithms (SMEA). Through SMEA, we can get robust system messages population that demonstrate up to 98.9% resistance against jailbreak prompts. Our research not only bolsters LLMs security but also raises the bar for jailbreak, fostering advancements in this field of study.
摘要：大型语言模型（LLM）的快速发展使其在现代社会中不可或缺。虽然在释放之前通常会采取安全措施来使法学硕士与人类价值观保持一致，但最近的研究揭示了一种名为“越狱”的令人担忧的现象。该术语是指法学硕士在收到恶意问题提示时产生的意外且可能有害的响应。现有的研究重点是生成越狱提示，但我们的研究旨在回答一个不同的问题：系统消息对于法学硕士的越狱真的很重要吗？为了解决这个问题，我们在稳定的 GPT 版本 gpt-3.5-turbo-0613 中进行了实验，生成具有不同系统消息的越狱提示：短、长和无。我们通过实验发现不同的系统消息对越狱的抵抗力不同。此外，我们还探讨了跨法学硕士越狱的可转移性。这一发现强调了系统消息对缓解法学硕士越狱问题的重大影响。为了生成更能抵抗越狱提示的系统消息，我们提出了系统消息进化算法（SMEA）。通过 SMEA，我们可以获得强大的系统消息群，对越狱提示的抵抗力高达 98.9%。我们的研究不仅增强了法学硕士的安全性，还提高了越狱的门槛，促进了这一研究领域的进步。

Title: ChatEL: Entity Linking with Chatbots

Authors: Yifan Ding, Qingkai Zeng, Tim Weninger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14858
Pdf URL: https://arxiv.org/pdf/2402.14858
Copy Paste: [[2402.14858]] ChatEL: Entity Linking with Chatbots(https://arxiv.org/abs/2402.14858)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Entity Linking (EL) is an essential and challenging task in natural language processing that seeks to link some text representing an entity within a document or sentence with its corresponding entry in a dictionary or knowledge base. Most existing approaches focus on creating elaborate contextual models that look for clues the words surrounding the entity-text to help solve the linking problem. Although these fine-tuned language models tend to work, they can be unwieldy, difficult to train, and do not transfer well to other domains. Fortunately, Large Language Models (LLMs) like GPT provide a highly-advanced solution to the problems inherent in EL models, but simply naive prompts to LLMs do not work well. In the present work, we define ChatEL, which is a three-step framework to prompt LLMs to return accurate results. Overall the ChatEL framework improves the average F1 performance across 10 datasets by more than 2%. Finally, a thorough error analysis shows many instances with the ground truth labels were actually incorrect, and the labels predicted by ChatEL were actually correct. This indicates that the quantitative results presented in this paper may be a conservative estimate of the actual performance. All data and code are available as an open-source package on GitHub at https://github.com/yifding/In_Context_EL.
摘要：实体链接（EL）是自然语言处理中一项重要且具有挑战性的任务，旨在将文档或句子中表示实体的一些文本与其在词典或知识库中的相应条目链接起来。大多数现有方法侧重于创建复杂的上下文模型，寻找实体文本周围单词的线索，以帮助解决链接问题。尽管这些经过微调的语言模型往往有效，但它们可能很笨重，难以训练，并且不能很好地转移到其他领域。幸运的是，像 GPT 这样的大型语言模型 (LLM) 为 EL 模型中固有的问题提供了非常先进的解决方案，但简单地对 LLM 进行简单的提示并不能很好地发挥作用。在目前的工作中，我们定义了ChatEL，它是一个三步框架，可促使法学硕士返回准确的结果。总体而言，ChatEL 框架将 10 个数据集的平均 F1 性能提高了 2% 以上。最后，彻底的错误分析表明，许多带有真实标签的实例实际上是不正确的，而 ChatEL 预测的标签实际上是正确的。这表明本文提出的定量结果可能是对实际性能的保守估计。所有数据和代码都可以作为开源包在 GitHub 上获取：https://github.com/yifding/In_Context_EL。

Title: Ranking Large Language Models without Ground Truth

Authors: Amit Dhurandhar, Rahul Nair, Moninder Singh, Elizabeth Daly, Karthikeyan Natesan Ramamurthy
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14860
Pdf URL: https://arxiv.org/pdf/2402.14860
Copy Paste: [[2402.14860]] Ranking Large Language Models without Ground Truth(https://arxiv.org/abs/2402.14860)
Keywords: language model, llm, prompt
Abstract: Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructions, etc.) and a set of LLMs, we rank them without access to any ground truth or reference responses. Inspired by real life where both an expert and a knowledgeable person can identify a novice our main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. We also analyze our idea and provide sufficient conditions for it to succeed. Applying this idea repeatedly, we propose two methods to rank LLMs. In experiments on different generative tasks (summarization, multiple-choice, and dialog), our methods reliably recover close to true rankings without reference data. This points to a viable low-resource mechanism for practical use.
摘要：随着这些模型的激增及其影响，大型语言模型（LLM）的评估和排名已成为一个重要问题。评估方法要么需要人类的反应，而获取成本昂贵，要么使用成对的法学硕士来相互评估，这可能是不可靠的。在本文中，我们提供了一种新颖的视角，在给定提示数据集（即问题、说明等）和一组法学硕士的情况下，我们对它们进行排名，而无需访问任何基本事实或参考答案。受到现实生活的启发，专家和知识渊博的人都可以识别新手，我们的主要想法是考虑模型三元组，其中每个模型评估其他两个模型，以高概率正确识别三元组中最差的模型。我们还分析我们的想法并为其成功提供充分的条件。反复应用这个想法，我们提出了两种对法学硕士进行排名的方法。在不同生成任务（摘要、多项选择和对话）的实验中，我们的方法在没有参考数据的情况下可靠地恢复接近真实排名。这表明了一种可供实际使用的可行的低资源机制。

Title: Evaluation of a semi-autonomous attentive listening system with takeover prompting

Authors: Haruki Kawai, Divesh Lala, Koji Inoue, Keiko Ochi, Tatsuya Kawahara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.14863
Pdf URL: https://arxiv.org/pdf/2402.14863
Copy Paste: [[2402.14863]] Evaluation of a semi-autonomous attentive listening system with takeover prompting(https://arxiv.org/abs/2402.14863)
Keywords: prompt, chat
Abstract: The handling of communication breakdowns and loss of engagement is an important aspect of spoken dialogue systems, particularly for chatting systems such as attentive listening, where the user is mostly speaking. We presume that a human is best equipped to handle this task and rescue the flow of conversation. To this end, we propose a semi-autonomous system, where a remote operator can take control of an autonomous attentive listening system in real-time. In order to make human intervention easy and consistent, we introduce automatic detection of low interest and engagement to provide explicit takeover prompts to the remote operator. We implement this semi-autonomous system which detects takeover points for the operator and compare it to fully tele-operated and fully autonomous attentive listening systems. We find that the semi-autonomous system is generally perceived more positively than the autonomous system. The results suggest that identifying points of conversation when the user starts to lose interest may help us improve a fully autonomous dialogue system.
摘要：处理通信故障和失去参与度是语音对话系统的一个重要方面，特别是对于聊天系统，例如用户主要在说话的专注聆听。我们认为人类最有能力处理这项任务并挽救对话流程。为此，我们提出了一种半自主系统，远程操作员可以实时控制自主专注监听系统。为了使人为干预变得简单和一致，我们引入了低兴趣和参与度的自动检测，以便向远程操作员提供明确的接管提示。我们实施了这种半自主系统，该系统可以检测操作员的接管点，并将其与完全远程操作和完全自主的专心监听系统进行比较。我们发现，半自治系统通常比自治系统获得更积极的认知。结果表明，当用户开始失去兴趣时识别对话点可能有助于我们改进完全自主的对话系统。

Title: DyVal 2: Dynamic Evaluation of Large Language Models by Meta Probing Agents

Authors: Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu, Xing Xie
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14865
Pdf URL: https://arxiv.org/pdf/2402.14865
Copy Paste: [[2402.14865]] DyVal 2: Dynamic Evaluation of Large Language Models by Meta Probing Agents(https://arxiv.org/abs/2402.14865)
Keywords: language model, llm, agent
Abstract: Evaluation of large language models (LLMs) has raised great concerns in the community due to the issue of data contamination. Existing work designed evaluation protocols using well-defined algorithms for specific tasks, which cannot be easily extended to diverse scenarios. Moreover, current evaluation benchmarks can only provide the overall benchmark results and cannot support a fine-grained and multifaceted analysis of LLMs' abilities. In this paper, we propose meta probing agents (MPA), a general dynamic evaluation protocol inspired by psychometrics to evaluate LLMs. MPA is the key component of DyVal 2, which naturally extends the previous DyVal~\citep{zhu2023dyval}. MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory on three basic cognitive abilities: language understanding, problem solving, and domain knowledge. These basic abilities are also dynamically configurable, allowing multifaceted analysis. We conducted extensive evaluations using MPA and found that most LLMs achieve poorer performance, indicating room for improvement. Our multifaceted analysis demonstrated the strong correlation between the basic abilities and an implicit Matthew effect on model size, i.e., larger models possess stronger correlations of the abilities. MPA can also be used as a data augmentation approach to enhance LLMs.
摘要：由于数据污染问题，大语言模型（LLM）的评估引起了社会的高度关注。现有工作使用针对特定任务的明确定义的算法设计了评估协议，这些协议无法轻松扩展到不同的场景。而且，目前的评估基准只能提供总体基准结果，无法支持对LLM能力进行细粒度、多方面的分析。在本文中，我们提出了元探测代理（MPA），这是一种受心理测量学启发的通用动态评估协议，用于评估法学硕士。 MPA 是 DyVal 2 的关键组件，它自然地扩展了之前的 DyVal~\citep{zhu2023dyval}。 MPA 设计了探测和判断代理，以根据三种基本认知能力（语言理解、问题解决和领域知识）的心理测量理论，自动将原始评估问题转换为新的评估问题。这些基本能力也是动态可配置的，允许多方面的分析。我们使用 MPA 进行了广泛的评估，发现大多数法学硕士的表现较差，表明还有改进的空间。我们的多方面分析证明了基本能力与模型大小的隐含马太效应之间存在很强的相关性，即较大的模型具有更强的能力相关性。 MPA 还可以用作增强法学硕士的数据增强方法。

Title: APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Authors: Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, Hao Yu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.14866
Pdf URL: https://arxiv.org/pdf/2402.14866
Copy Paste: [[2402.14866]] APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models(https://arxiv.org/abs/2402.14866)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24\% and 70.48\% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.
摘要：大型语言模型 (LLM) 极大地推进了自然语言处理范式。然而，高计算负载和巨大的模型尺寸对边缘设备上的部署提出了巨大的挑战。为此，我们提出了LLM的APTQ（Attention-aware Post-Training Mixed-Precision Quantization），它不仅考虑了每层权重的二阶信息，而且首次考虑了注意力输出的非线性效应在整个模型上。我们利用 Hessian 迹作为混合精度量化的灵敏度指标，确保在保持模型性能的同时进行明智的精度降低。实验表明，APTQ 超越了以前的量化方法，实现了平均 4 位宽度和 5.22 的困惑度，几乎相当于 C4 数据集中的全精度。此外，APTQ 在 LLaMa-7B 和 LLaMa-13B 中以 3.8 的平均位宽分别实现了最先进的零样本精度 68.24% 和 70.48%，证明了其产生高质量量化结果的有效性。法学硕士。

Title: LLM Based Multi-Agent Generation of Semi-structured Documents from Semantic Templates in the Public Administration Domain

Authors: Emanuele Musumeci, Michele Brienza, Vincenzo Suriani, Daniele Nardi, Domenico Daniele Bloisi
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2402.14871
Pdf URL: https://arxiv.org/pdf/2402.14871
Copy Paste: [[2402.14871]] LLM Based Multi-Agent Generation of Semi-structured Documents from Semantic Templates in the Public Administration Domain(https://arxiv.org/abs/2402.14871)
Keywords: language model, llm, prompt, agent
Abstract: In the last years' digitalization process, the creation and management of documents in various domains, particularly in Public Administration (PA), have become increasingly complex and diverse. This complexity arises from the need to handle a wide range of document types, often characterized by semi-structured forms. Semi-structured documents present a fixed set of data without a fixed format. As a consequence, a template-based solution cannot be used, as understanding a document requires the extraction of the data structure. The recent introduction of Large Language Models (LLMs) has enabled the creation of customized text output satisfying user requests. In this work, we propose a novel approach that combines the LLMs with prompt engineering and multi-agent systems for generating new documents compliant with a desired structure. The main contribution of this work concerns replacing the commonly used manual prompting with a task description generated by semantic retrieval from an LLM. The potential of this approach is demonstrated through a series of experiments and case studies, showcasing its effectiveness in real-world PA scenarios.
摘要：在过去几年的数字化进程中，各个领域，特别是公共管理（PA）领域的文档创建和管理变得越来越复杂和多样化。这种复杂性源于需要处理各种文档类型，这些文档类型通常以半结构化形式为特征。半结构化文档呈现一组没有固定格式的固定数据。因此，无法使用基于模板的解决方案，因为理解文档需要提取数据结构。最近引入的大型语言模型 (LLM) 可以创建满足用户请求的自定义文本输出。在这项工作中，我们提出了一种新颖的方法，将法学硕士与即时工程和多代理系统相结合，以生成符合所需结构的新文档。这项工作的主要贡献在于用法学硕士语义检索生成的任务描述取代常用的手动提示。通过一系列实验和案例研究证明了这种方法的潜力，展示了其在现实世界 PA 场景中的有效性。

Title: Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs

Authors: Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Aishan Liu, Ee-Chien Chang
Subjects: cs.CL, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2402.14872
Pdf URL: https://arxiv.org/pdf/2402.14872
Copy Paste: [[2402.14872]] Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs(https://arxiv.org/abs/2402.14872)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs), used in creative writing, code generation, and translation, generate text based on input sequences but are vulnerable to jailbreak attacks, where crafted prompts induce harmful outputs. Most jailbreak prompt methods use a combination of jailbreak templates followed by questions to ask to create jailbreak prompts. However, existing jailbreak prompt designs generally suffer from excessive semantic differences, resulting in an inability to resist defenses that use simple semantic metrics as thresholds. Jailbreak prompts are semantically more varied than the original questions used for queries. In this paper, we introduce a Semantic Mirror Jailbreak (SMJ) approach that bypasses LLMs by generating jailbreak prompts that are semantically similar to the original question. We model the search for jailbreak prompts that satisfy both semantic similarity and jailbreak validity as a multi-objective optimization problem and employ a standardized set of genetic algorithms for generating eligible prompts. Compared to the baseline AutoDAN-GA, SMJ achieves attack success rates (ASR) that are at most 35.4% higher without ONION defense and 85.2% higher with ONION defense. SMJ's better performance in all three semantic meaningfulness metrics of Jailbreak Prompt, Similarity, and Outlier, also means that SMJ is resistant to defenses that use those metrics as thresholds.
摘要：大型语言模型（LLM）用于创意写作、代码生成和翻译，根据输入序列生成文本，但容易受到越狱攻击，其中精心设计的提示会导致有害的输出。大多数越狱提示方法使用越狱模板的组合，然后询问问题来创建越狱提示。然而，现有的越狱提示设计普遍存在语义差异过大的问题，导致无法抵御以简单语义指标为阈值的防御。越狱提示在语义上比用于查询的原始问题更加多样化。在本文中，我们介绍了一种语义镜像越狱（SMJ）方法，该方法通过生成语义上与原始问题相似的越狱提示来绕过法学硕士。我们将搜索满足语义相似性和越狱有效性的越狱提示建模为多目标优化问题，并采用一组标准化的遗传算法来生成合格的提示。与基线 AutoDAN-GA 相比，SMJ 在没有 ONION 防御的情况下实现的攻击成功率 (ASR) 最多高出 35.4%，在有 ONION 防御的情况下最高高出 85.2%。 SMJ 在越狱提示、相似性和离群值这三个语义意义指标中表现更好，也意味着 SMJ 能够抵抗使用这些指标作为阈值的防御。

Title: Technical Report on the Checkfor.ai AI-Generated Text Classifier

Authors: Bradley Emi, Max Spero
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14873
Pdf URL: https://arxiv.org/pdf/2402.14873
Copy Paste: [[2402.14873]] Technical Report on the Checkfor.ai AI-Generated Text Classifier(https://arxiv.org/abs/2402.14873)
Keywords: language model, gpt
Abstract: We present the Checkfor.ai text classifier, a transformer-based neural network trained to distinguish text written by large language models from text written by humans. Checkfor.ai outperforms zero-shot methods such as DetectGPT as well as leading commercial AI detection tools with over 9 times lower error rates on a comprehensive benchmark comprised of ten text domains (student writing, creative writing, scientific writing, books, encyclopedias, news, email, scientific papers, short-form Q\&A) and 8 open- and closed-source large language models. We propose a training algorithm, hard negative mining with synthetic mirrors, that enables our classifier to achieve orders of magnitude lower false positive rates on high-data domains such as reviews. Finally, we show that Checkfor.ai is not biased against nonnative English speakers and generalizes to domains and models unseen during training.
摘要：我们推出了 Checkfor.ai 文本分类器，这是一种基于 Transformer 的神经网络，经过训练可以区分大型语言模型编写的文本和人类编写的文本。 Checkfor.ai 的性能优于 DetectGPT 等零样本方法以及领先的商业 AI 检测工具，在由 10 个文本领域（学生写作、创意写作、科学写作、书籍、百科全书、新闻）组成的综合基准测试中，错误率降低了 9 倍以上、电子邮件、科学论文、简短问答）和 8 个开源和闭源大型语言模型。我们提出了一种训练算法，即使用合成镜像进行硬阴性挖掘，使我们的分类器能够在评论等高数据域上实现低几个数量级的误报率。最后，我们表明 Checkfor.ai 对非英语母语者没有偏见，并且可以推广到训练期间未见过的领域和模型。

Title: Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation

Authors: Phuc Phan, Hieu Tran, Long Phan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14874
Pdf URL: https://arxiv.org/pdf/2402.14874
Copy Paste: [[2402.14874]] Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation(https://arxiv.org/abs/2402.14874)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: We propose a straightforward approach called Distillation Contrastive Decoding (DCD) to enhance the reasoning capabilities of Large Language Models (LLMs) during inference. In contrast to previous approaches that relied on smaller amateur models or analysis of hidden state differences, DCD employs Contrastive Chain-of-thought Prompting and advanced distillation techniques, including Dropout and Quantization. This approach effectively addresses the limitations of Contrastive Decoding (CD), which typically requires both an expert and an amateur model, thus increasing computational resource demands. By integrating contrastive prompts with distillation, DCD obviates the need for an amateur model and reduces memory usage. Our evaluations demonstrate that DCD significantly enhances LLM performance across a range of reasoning benchmarks, surpassing both CD and existing methods in the GSM8K and StrategyQA datasets.
摘要：我们提出了一种称为蒸馏对比解码（DCD）的简单方法，以增强大型语言模型（LLM）在推理过程中的推理能力。与以前依赖较小的业余模型或隐藏状态差异分析的方法相比，DCD 采用对比思维链提示和先进的蒸馏技术，包括 Dropout 和量化。这种方法有效地解决了对比解码（CD）的局限性，它通常需要专家和业余模型，从而增加了计算资源需求。通过将对比提示与蒸馏相结合，DCD 消除了对业余模型的需求并减少了内存使用量。我们的评估表明，DCD 显着提高了一系列推理基准的 LLM 性能，超越了 CD 以及 GSM8K 和 StrategyQA 数据集中的现有方法。

Title: What's in a Name? Auditing Large Language Models for Race and Gender Bias

Authors: Amit Haim, Alejandro Salinas, Julian Nyarko
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14875
Pdf URL: https://arxiv.org/pdf/2402.14875
Copy Paste: [[2402.14875]] What's in a Name? Auditing Large Language Models for Race and Gender Bias(https://arxiv.org/abs/2402.14875)
Keywords: language model, gpt, llm, prompt
Abstract: We employ an audit design to investigate biases in state-of-the-art large language models, including GPT-4. In our study, we elicit prompt the models for advice regarding an individual across a variety of scenarios, such as during car purchase negotiations or election outcome predictions. We find that the advice systematically disadvantages names that are commonly associated with racial minorities and women. Names associated with Black women receive the least advantageous outcomes. The biases are consistent across 42 prompt templates and several models, indicating a systemic issue rather than isolated incidents. While providing numerical, decision-relevant anchors in the prompt can successfully counteract the biases, qualitative details have inconsistent effects and may even increase disparities. Our findings underscore the importance of conducting audits at the point of LLM deployment and implementation to mitigate their potential for harm against marginalized communities.
摘要：我们采用审计设计来调查最先进的大型语言模型（包括 GPT-4）中的偏差。在我们的研究中，我们在各种场景中提示模型提供有关个人的建议，例如在汽车购买谈判或选举结果预测期间。我们发现，该建议系统地损害了通常与少数族裔和女性相关的名字。与黑人女性相关的名字得到的结果最不利。 42 个提示模板和多个模型中的偏差是一致的，表明存在系统性问题而不是孤立事件。虽然在提示中提供与决策相关的数字锚点可以成功抵消偏见，但定性细节的效果不一致，甚至可能增加差异。我们的研究结果强调了在法学硕士部署和实施时进行审计的重要性，以减轻其对边缘化社区的潜在危害。

Title: Driving Generative Agents With Their Personality

Authors: Lawrence J. Klinkert, Stephanie Buongiorno, Corey Clark
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14879
Pdf URL: https://arxiv.org/pdf/2402.14879
Copy Paste: [[2402.14879]] Driving Generative Agents With Their Personality(https://arxiv.org/abs/2402.14879)
Keywords: language model, gpt, llm, prompt, agent
Abstract: This research explores the potential of Large Language Models (LLMs) to utilize psychometric values, specifically personality information, within the context of video game character development. Affective Computing (AC) systems quantify a Non-Player character's (NPC) psyche, and an LLM can take advantage of the system's information by using the values for prompt generation. The research shows an LLM can consistently represent a given personality profile, thereby enhancing the human-like characteristics of game characters. Repurposing a human examination, the International Personality Item Pool (IPIP) questionnaire, to evaluate an LLM shows that the model can accurately generate content concerning the personality provided. Results show that the improvement of LLM, such as the latest GPT-4 model, can consistently utilize and interpret a personality to represent behavior.
摘要：这项研究探讨了大型语言模型 (LLM) 在视频游戏角色开发背景下利用心理测量值（特别是个性信息）的潜力。情感计算 (AC) 系统量化非玩家角色 (NPC) 的心理，法学硕士可以通过使用提示生成值来利用系统信息。研究表明，法学硕士可以一致地代表给定的个性特征，从而增强游戏角色的类人特征。重新利用人体检查、国际人格项目库 (IPIP) 问卷来评估法学硕士，结果表明该模型可以准确生成与所提供的人格相关的内容。结果表明，LLM的改进，例如最新的GPT-4模型，可以一致地利用和解释人格来表示行为。

Title: Automatic Histograms: Leveraging Language Models for Text Dataset Exploration

Authors: Emily Reif, Crystal Qian, James Wexler, Minsuk Kahng
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2402.14880
Pdf URL: https://arxiv.org/pdf/2402.14880
Copy Paste: [[2402.14880]] Automatic Histograms: Leveraging Language Models for Text Dataset Exploration(https://arxiv.org/abs/2402.14880)
Keywords: language model, llm
Abstract: Making sense of unstructured text datasets is perennially difficult, yet increasingly relevant with Large Language Models. Data workers often rely on dataset summaries, especially distributions of various derived features. Some features, like toxicity or topics, are relevant to many datasets, but many interesting features are domain specific: instruments and genres for a music dataset, or diseases and symptoms for a medical dataset. Accordingly, data workers often run custom analyses for each dataset, which is cumbersome and difficult. We present AutoHistograms, a visualization tool leveragingLLMs. AutoHistograms automatically identifies relevant features, visualizes them with histograms, and allows the user to interactively query the dataset for categories of entities and create new histograms. In a user study with 10 data workers (n=10), we observe that participants can quickly identify insights and explore the data using AutoHistograms, and conceptualize a broad range of applicable use cases. Together, this tool and user study contributeto the growing field of LLM-assisted sensemaking tools.
摘要：理解非结构化文本数据集一直很困难，但与大型语言模型的相关性却越来越高。数据工作者通常依赖数据集摘要，尤其是各种派生特征的分布。某些特征（例如毒性或主题）与许多数据集相关，但许多有趣的特征是特定于领域的：音乐数据集的乐器和流派，或医疗数据集的疾病和症状。因此，数据工作者经常对每个数据集运行自定义分析，这既麻烦又困难。我们推出 AutoHistograms，这是一种利用 LLM 的可视化工具。 AutoHistograms 自动识别相关特征，使用直方图将其可视化，并允许用户交互式查询数据集以查找实体类别并创建新的直方图。在一项针对 10 名数据工作者 (n=10) 的用户研究中，我们观察到参与者可以使用自动直方图快速识别见解并探索数据，并概念化广泛的适用用例。该工具和用户研究共同促进了法学硕士辅助意义建构工具领域的不断发展。

Title: A Study on the Vulnerability of Test Questions against ChatGPT-based Cheating

Authors: Shanker Ram, Chen Qian
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2402.14881
Pdf URL: https://arxiv.org/pdf/2402.14881
Copy Paste: [[2402.14881]] A Study on the Vulnerability of Test Questions against ChatGPT-based Cheating(https://arxiv.org/abs/2402.14881)
Keywords: gpt, prompt, chat
Abstract: ChatGPT is a chatbot that can answer text prompts fairly accurately, even performing very well on postgraduate-level questions. Many educators have found that their take-home or remote tests and exams are vulnerable to ChatGPT-based cheating because students may directly use answers provided by tools like ChatGPT. In this paper, we try to provide an answer to an important question: how well ChatGPT can answer test questions and how we can detect whether the questions of a test can be answered correctly by ChatGPT. We generated ChatGPT's responses to the MedMCQA dataset, which contains over 10,000 medical school entrance exam questions. We analyzed the responses and uncovered certain types of questions ChatGPT answers more inaccurately than others. In addition, we have created a basic natural language processing model to single out the most vulnerable questions to ChatGPT in a collection of questions or a sample exam. Our tool can be used by test-makers to avoid ChatGPT-vulnerable test questions.
摘要：ChatGPT 是一个聊天机器人，可以相当准确地回答文本提示，甚至在研究生水平的问题上也表现得很好。许多教育工作者发现，他们的带回家或远程测试和考试很容易受到基于 ChatGPT 的作弊行为，因为学生可能直接使用 ChatGPT 等工具提供的答案。在本文中，我们试图回答一个重要问题：ChatGPT 回答测试问题的能力如何，以及我们如何检测 ChatGPT 是否可以正确回答测试问题。我们生成了 ChatGPT 对 MedMCQA 数据集的回复，该数据集包含 10,000 多个医学院入学考试问题。我们分析了这些回复，发现 ChatGPT 回答的某些类型的问题比其他问题更不准确。此外，我们还创建了一个基本的自然语言处理模型，以在问题集合或样本考试中挑选出最容易受到 ChatGPT 影响的问题。测试者可以使用我们的工具来避免 ChatGPT 易受攻击的测试问题。

Title: COBIAS: Contextual Reliability in Bias Assessment

Authors: Priyanshul Govil, Vamshi Krishna Bonagiri, Manas Gaur, Ponnurangam Kumaraguru, Sanorita Dey
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14889
Pdf URL: https://arxiv.org/pdf/2402.14889
Copy Paste: [[2402.14889]] COBIAS: Contextual Reliability in Bias Assessment(https://arxiv.org/abs/2402.14889)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are trained on inherently biased data. Previous works on debiasing models rely on benchmark datasets to measure model performance. However, these datasets suffer from several pitfalls due to the extremely subjective understanding of bias, highlighting a critical need for contextual exploration. We propose understanding the context of user inputs with consideration of the diverse situations in which input statements are possible. This approach would allow for frameworks that foster bias awareness rather than guardrails that hurt user engagement. Our contribution is twofold: (i) we create a dataset of 2287 stereotyped statements augmented with points for adding context; (ii) we develop the Context-Oriented Bias Indicator and Assessment Score (COBIAS) to assess statements' contextual reliability in measuring bias. Our metric is a significant predictor of the contextual reliability of bias-benchmark datasets ($\chi^2=71.02, p<2.2 \cdot 10^{-16})$. COBIAS can be used to create reliable datasets, resulting in an improvement in bias mitigation works.
摘要：大型语言模型 (LLM) 是根据固有偏差的数据进行训练的。先前的去偏模型工作依赖于基准数据集来衡量模型性能。然而，由于对偏见的极其主观的理解，这些数据集存在一些陷阱，这凸显了对情境探索的迫切需要。我们建议在考虑输入语句可能的不同情况的情况下理解用户输入的上下文。这种方法将允许建立培养偏见意识的框架，而不是损害用户参与度的护栏。我们的贡献是双重的：(i) 我们创建了一个包含 2287 个刻板语句的数据集，并添加了用于添加上下文的点； (ii) 我们开发了上下文导向偏差指标和评估分数（COBIAS）来评估陈述在测量偏差时的上下文可靠性。我们的指标是偏差基准数据集上下文可靠性的重要预测因子 ($\chi^2=71.02, p<2.2 \cdot 10^{-16})$。 COBIAS 可用于创建可靠的数据集，从而改进偏差缓解工作。

Title: LLMBind: A Unified Modality-Task Integration Framework

Authors: Bin Zhu, Peng Jin, Munan Ning, Bin Lin, Jinfa Huang, Qi Song, Mingjun Pan, Li Yuan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14891
Pdf URL: https://arxiv.org/pdf/2402.14891
Copy Paste: [[2402.14891]] LLMBind: A Unified Modality-Task Integration Framework(https://arxiv.org/abs/2402.14891)
Keywords: language model, llm, agent
Abstract: While recent progress in multimodal large language models tackles various modality tasks, they posses limited integration capabilities for complex multi-modality tasks, consequently constraining the development of the field. In this work, we take the initiative to explore and propose the LLMBind, a unified framework for modality task integration, which binds Large Language Models and corresponding pre-trained task models with task-specific tokens. Consequently, LLMBind can interpret inputs and produce outputs in versatile combinations of image, text, video, and audio. Specifically, we introduce a Mixture-of-Experts technique to enable effective learning for different multimodal tasks through collaboration among diverse experts. Furthermore, we create a multi-task dataset comprising 400k instruction data, which unlocks the ability for interactive visual generation and editing tasks. Extensive experiments show the effectiveness of our framework across various tasks, including image, video, audio generation, image segmentation, and image editing. More encouragingly, our framework can be easily extended to other modality tasks, showcasing the promising potential of creating a unified AI agent for modeling universal modalities.
摘要：虽然多模态大语言模型的最新进展可以解决各种模态任务，但它们对复杂的多模态任务的集成能力有限，从而限制了该领域的发展。在这项工作中，我们主动探索并提出了 LLMBind，一种模态任务集成的统一框架，它将大型语言模型和相应的预训练任务模型与特定于任务的标记绑定在一起。因此，LLMBind 可以解释输入并以图像、文本、视频和音频的多种组合生成输出。具体来说，我们引入了专家混合技术，通过不同专家之间的协作来实现不同多模式任务的有效学习。此外，我们创建了一个包含 400k 指令数据的多任务数据集，这解锁了交互式视觉生成和编辑任务的能力。大量的实验证明了我们的框架在各种任务中的有效性，包括图像、视频、音频生成、图像分割和图像编辑。更令人鼓舞的是，我们的框架可以轻松扩展到其他模态任务，展示了创建统一的人工智能代理来建模通用模态的巨大潜力。

Title: Data Augmentation is Dead, Long Live Data Augmentation

Authors: Frédéric Piedboeuf, Philippe Langlais
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14895
Pdf URL: https://arxiv.org/pdf/2402.14895
Copy Paste: [[2402.14895]] Data Augmentation is Dead, Long Live Data Augmentation(https://arxiv.org/abs/2402.14895)
Keywords: gpt, chat, agent
Abstract: Textual data augmentation (DA) is a prolific field of study where novel techniques to create artificial data are regularly proposed, and that has demonstrated great efficiency on small data settings, at least for text classification tasks. In this paper, we challenge those results, showing that classical data augmentation is simply a way of performing better fine-tuning, and that spending more time fine-tuning before applying data augmentation negates its effect. This is a significant contribution as it answers several questions that were left open in recent years, namely~: which DA technique performs best (all of them as long as they generate data close enough to the training set as to not impair training) and why did DA show positive results (facilitates training of network). We furthermore show that zero and few-shot data generation via conversational agents such as ChatGPT or LLama2 can increase performances, concluding that this form of data augmentation does still work, even if classical methods do not.
摘要：文本数据增强（DA）是一个多产的研究领域，经常提出创建人工数据的新技术，并且在小数据设置上表现出很高的效率，至少对于文本分类任务来说是这样。在本文中，我们对这些结果提出了质疑，表明经典数据增强只是执行更好微调的一种方式，并且在应用数据增强之前花费更多时间进行微调会抵消其效果。这是一项重大贡献，因为它回答了近年来悬而未决的几个问题，即：哪种 DA 技术表现最好（所有这些技术只要生成的数据足够接近训练集，不会影响训练）以及为什么DA 是否显示出积极的结果（促进网络训练）。我们还表明，通过 ChatGPT 或 LLama2 等会话代理生成零次和少量数据可以提高性能，得出的结论是，这种形式的数据增强仍然有效，即使经典方法不起作用。

Title: Chain-of-Thought Unfaithfulness as Disguised Accuracy

Authors: Oliver Bentham, Nathan Stringham, Ana Marasović
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14897
Pdf URL: https://arxiv.org/pdf/2402.14897
Copy Paste: [[2402.14897]] Chain-of-Thought Unfaithfulness as Disguised Accuracy(https://arxiv.org/abs/2402.14897)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Understanding the extent to which Chain-of-Thought (CoT) generations align with a large language model's (LLM) internal computations is critical for deciding whether to trust an LLM's output. As a proxy for CoT faithfulness, arXiv:2307.13702 propose a metric that measures a model's dependence on its CoT for producing an answer. Within a single family of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. We evaluate whether these results generalize as a property of all LLMs. We replicate their experimental setup with three different families of models and, under specific conditions, successfully reproduce the scaling trends for CoT faithfulness they report. However, we discover that simply changing the order of answer choices in the prompt can reduce the metric by 73 percentage points. The faithfulness metric is also highly correlated ($R^2$ = 0.91) with accuracy, raising doubts about its validity as a construct for evaluating faithfulness.
摘要：了解思想链 (CoT) 生成与大型语言模型 (LLM) 内部计算的一致程度对于决定是否信任 LLM 的输出至关重要。作为 CoT 忠实度的代理，arXiv:2307.13702 提出了一个度量标准，用于衡量模型对其 CoT 的依赖程度以产生答案。在单一专有模型系列中，他们发现 LLM 在模型大小与其忠实度度量之间表现出缩放-然后逆缩放关系，并且与 8.1 亿到 175 个参数的模型相比，130 亿个参数的模型表现出更高的忠实度十亿个参数大小。我们评估这些结果是否可以概括为所有法学硕士的属性。我们用三个不同的模型系列复制了他们的实验设置，并在特定条件下成功地重现了他们报告的 CoT 忠实度的缩放趋势。然而，我们发现，仅更改提示中答案选项的顺序即可将指标降低 73 个百分点。忠诚度指标也与准确性高度相关 ($R^2$ = 0.91)，这引发了人们对其作为评估忠诚度的构造的有效性的怀疑。

Title: Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

Authors: Aaditya K. Singh, DJ Strouse
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14903
Pdf URL: https://arxiv.org/pdf/2402.14903
Copy Paste: [[2402.14903]] Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs(https://arxiv.org/abs/2402.14903)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Tokenization, the division of input text into input tokens, is an often overlooked aspect of the large language model (LLM) pipeline and could be the source of useful or harmful inductive biases. Historically, LLMs have relied on byte pair encoding, without care to specific input domains. With the increased use of LLMs for reasoning, various number-specific tokenization schemes have been adopted, with popular models like LLaMa and PaLM opting for single-digit tokenization while GPT-3.5 and GPT-4 have separate tokens for each 1-, 2-, and 3-digit numbers. In this work, we study the effect this choice has on numerical reasoning through the use of arithmetic tasks. We consider left-to-right and right-to-left tokenization for GPT-3.5 and -4, finding that right-to-left tokenization (enforced by comma separating numbers at inference time) leads to largely improved performance. Furthermore, we find that model errors when using standard left-to-right tokenization follow stereotyped error patterns, suggesting that model computations are systematic rather than approximate. We show that the model is able to convert between tokenizations easily, thus allowing chain-of-thought-inspired approaches to recover performance on left-to-right tokenized inputs. We also find the gap between tokenization directions decreases when models are scaled, possibly indicating that larger models are better able to override this tokenization-dependent inductive bias. In summary, our work performs the first study of how number tokenization choices lead to differences in model performance on arithmetic tasks, accompanied by a thorough analysis of error patterns. We hope this work inspires practitioners to more carefully ablate number tokenization-related choices when working towards general models of numerical reasoning.
摘要：标记化（将输入文本划分为输入标记）是大型语言模型 (LLM) 管道中经常被忽视的一个方面，并且可能是有用或有害的归纳偏差的来源。从历史上看，法学硕士一直依赖于字节对编码，而不关心特定的输入域。随着 LLM 越来越多地用于推理，人们采用了各种特定于数字的标记化方案，LLaMa 和 PaLM 等流行模型选择单位数字标记化，而 GPT-3.5 和 GPT-4 为每个 1-、2- 都有单独的标记。和 3 位数字。在这项工作中，我们通过使用算术任务来研究这种选择对数字推理的影响。我们考虑 GPT-3.5 和 -4 的从左到右和从右到左标记化，发现从右到左标记化（在推理时通过逗号分隔数字强制执行）可以大大提高性能。此外，我们发现使用标准从左到右标记化时的模型错误遵循刻板的错误模式，这表明模型计算是系统的而不是近似的。我们证明该模型能够轻松地在标记化之间进行转换，从而允许思想链启发的方法恢复从左到右标记化输入的性能。我们还发现，当模型缩放时，标记化方向之间的差距会减小，这可能表明较大的模型能够更好地克服这种依赖于标记化的归纳偏差。总之，我们的工作首次研究了数字标记化选择如何导致算术任务上模型性能的差异，并对错误模式进行了彻底分析。我们希望这项工作能够激励从业者在研究数字推理的通用模型时更仔细地消除与数字标记化相关的选择。

Title: MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Authors: Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.14905
Pdf URL: https://arxiv.org/pdf/2402.14905
Copy Paste: [[2402.14905]] MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases(https://arxiv.org/abs/2402.14905)
Keywords: language model, llm, chat
Abstract: This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.
摘要：本文解决了由于云成本和延迟问题不断增加而导致的移动设备上对高效大型语言模型 (LLM) 日益增长的需求。我们专注于设计参数少于十亿的高质量 LLM，这是移动部署的实用选择。与强调数据和参数数量在确定模型质量方面发挥关键作用的普遍看法相反，我们的调查强调了模型架构对于数十亿规模的法学硕士的重要性。利用深而薄的架构，加上嵌入共享和分组查询注意机制，我们建立了一个强大的基线网络，称为 MobileLLM，与之前的 125M/350M 最先进技术相比，其准确率显着提高了 2.7%/4.3%楷模。此外，我们提出了一种立即按块权重共享的方法，不会增加模型大小，只有边际延迟开销。所得模型表示为 MobileLLM-LS，其精度比 MobileLLM 125M/350M 进一步提高了 0.7%/0.8%。此外，与之前的数十亿级模型相比，MobileLLM 模型系列在聊天基准测试中显示出显着改进，并在 API 调用任务中展示了与 LLaMA-v2 7B 接近的正确性，突出了小型模型针对常见设备上用例的能力。

Title: Mirror: A Multiple-perspective Self-Reflection Method for Knowledge-rich Reasoning

Authors: Hanqi Yan, Qinglin Zhu, Xinyu Wang, Lin Gui, Yulan He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14963
Pdf URL: https://arxiv.org/pdf/2402.14963
Copy Paste: [[2402.14963]] Mirror: A Multiple-perspective Self-Reflection Method for Knowledge-rich Reasoning(https://arxiv.org/abs/2402.14963)
Keywords: language model, llm, agent
Abstract: While Large language models (LLMs) have the capability to iteratively reflect on their own outputs, recent studies have observed their struggles with knowledge-rich problems without access to external resources. In addition to the inefficiency of LLMs in self-assessment, we also observe that LLMs struggle to revisit their predictions despite receiving explicit negative feedback. Therefore, We propose Mirror, a Multiple-perspective self-reflection method for knowledge-rich reasoning, to avoid getting stuck at a particular reflection iteration. Mirror enables LLMs to reflect from multiple-perspective clues, achieved through a heuristic interaction between a Navigator and a Reasoner. It guides agents toward diverse yet plausibly reliable reasoning trajectory without access to ground truth by encouraging (1) diversity of directions generated by Navigator and (2) agreement among strategically induced perturbations in responses generated by the Reasoner. The experiments on five reasoning datasets demonstrate that Mirror's superiority over several contemporary self-reflection approaches. Additionally, the ablation study studies clearly indicate that our strategies alleviate the aforementioned challenges.
摘要：虽然大型语言模型 (LLM) 能够迭代地反映自己的输出，但最近的研究发现，它们在无法访问外部资源的情况下难以解决知识丰富的问题。除了法学硕士自我评估效率低下之外，我们还观察到法学硕士尽管收到了明确的负面反馈，但仍难以重新审视他们的预测。因此，我们提出了 Mirror，一种用于知识丰富推理的多视角自我反思方法，以避免陷入特定的反思迭代。 Mirror 使法学硕士能够从多个角度的线索进行反思，这是通过导航器和推理器之间的启发式交互来实现的。它通过鼓励（1）导航器生成的方向的多样性和（2）推理器生成的响应中策略性诱导的扰动之间的一致性，引导智能体走向多样化但似乎可靠的推理轨迹，而无需访问地面事实。在五个推理数据集上的实验证明了 Mirror 相对于当代几种自我反思方法的优越性。此外，消融研究清楚地表明我们的策略缓解了上述挑战。

Title: MultiLS: A Multi-task Lexical Simplification Framework

Authors: Kai North, Tharindu Ranasinghe, Matthew Shardlow, Marcos Zampieri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.14972
Pdf URL: https://arxiv.org/pdf/2402.14972
Copy Paste: [[2402.14972]] MultiLS: A Multi-task Lexical Simplification Framework(https://arxiv.org/abs/2402.14972)
Keywords: language model, llm
Abstract: Lexical Simplification (LS) automatically replaces difficult to read words for easier alternatives while preserving a sentence's original meaning. LS is a precursor to Text Simplification with the aim of improving text accessibility to various target demographics, including children, second language learners, individuals with reading disabilities or low literacy. Several datasets exist for LS. These LS datasets specialize on one or two sub-tasks within the LS pipeline. However, as of this moment, no single LS dataset has been developed that covers all LS sub-tasks. We present MultiLS, the first LS framework that allows for the creation of a multi-task LS dataset. We also present MultiLS-PT, the first dataset to be created using the MultiLS framework. We demonstrate the potential of MultiLS-PT by carrying out all LS sub-tasks of (1). lexical complexity prediction (LCP), (2). substitute generation, and (3). substitute ranking for Portuguese. Model performances are reported, ranging from transformer-based models to more recent large language models (LLMs).
摘要：词汇简化 (LS) 自动将难以阅读的单词替换为更容易的替代词，同时保留句子的原始含义。 LS 是文本简化的前身，旨在提高各种目标人群的文本可访问性，包括儿童、第二语言学习者、阅读障碍或识字率低的个人。 LS 存在多个数据集。这些 LS 数据集专门研究 LS 管道中的一两个子任务。然而，截至目前，还没有开发出涵盖所有 LS 子任务的单个 LS 数据集。我们提出了 MultiLS，这是第一个允许创建多任务 LS 数据集的 LS 框架。我们还展示了 MultiLS-PT，这是使用 MultiLS 框架创建的第一个数据集。我们通过执行 (1) 的所有 LS 子任务来展示 MultiLS-PT 的潜力。词汇复杂度预测（LCP），（2）。替代一代，以及（3）。葡萄牙语的替代排名。报告了模型性能，范围从基于 Transformer 的模型到更新的大型语言模型 (LLM)。

Title: GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data

Authors: Lele Cao, Valentin Buchner, Zineb Senane, Fangkai Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.14973
Pdf URL: https://arxiv.org/pdf/2402.14973
Copy Paste: [[2402.14973]] GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data(https://arxiv.org/abs/2402.14973)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) are commonly evaluated using costly annotated multimodal benchmarks. However, these benchmarks often struggle to keep pace with the rapidly advancing requirements of MLLM evaluation. We propose GenCeption, a novel and annotation-free MLLM evaluation framework that merely requires unimodal data to assess inter-modality semantic coherence and inversely reflects the models' inclination to hallucinate. Analogous to the popular DrawCeption game, GenCeption initiates with a non-textual sample and undergoes a series of iterative description and generation steps. Semantic drift across iterations is quantified using the GC@T metric. Our empirical findings validate GenCeption's efficacy, showing strong correlations with popular MLLM benchmarking results. GenCeption may be extended to mitigate training data contamination by utilizing ubiquitous, previously unseen unimodal data.
摘要：多模态大型语言模型 (MLLM) 通常使用昂贵的带注释的多模态基准进行评估。然而，这些基准往往难以跟上 MLLM 评估快速发展的要求。我们提出了 GenCeption，一种新颖且无注释的 MLLM 评估框架，仅需要单模态数据来评估模态间语义一致性，并反向反映模型产生幻觉的倾向。与流行的 DrawCeption 游戏类似，GenCeption 从非文本样本开始，并经历一系列迭代描述和生成步骤。使用 GC@T 指标来量化迭代之间的语义漂移。我们的实证研究结果验证了 GenCeption 的功效，显示与流行的 MLLM 基准测试结果有很强的相关性。 GenCeption 可以通过利用无处不在的、以前未见过的单峰数据来扩展，以减轻训练数据污染。

Title: Optimizing Language Models for Human Preferences is a Causal Inference Problem

Authors: Victoria Lin, Eli Ben-Michael, Louis-Philippe Morency
Subjects: cs.LG, cs.CL, stat.ME
Abstract URL: https://arxiv.org/abs/2402.14979
Pdf URL: https://arxiv.org/pdf/2402.14979
Copy Paste: [[2402.14979]] Optimizing Language Models for Human Preferences is a Causal Inference Problem(https://arxiv.org/abs/2402.14979)
Keywords: language model, llm
Abstract: As large language models (LLMs) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. In this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. We first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. We formalize this causal language optimization problem, and we develop a method--causal preference optimization (CPO)--that solves an unbiased surrogate objective for the problem. We further extend CPO with doubly robust CPO (DR-CPO), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. Finally, we empirically demonstrate the effectiveness of (DR-)CPO in optimizing state-of-the-art LLMs for human preferences on direct outcome data, and we validate the robustness of DR-CPO under difficult confounding conditions.
摘要：随着大型语言模型 (LLM) 在学术和商业环境中得到越来越多的应用，人们对允许语言模型生成符合人类偏好的文本的方法越来越感兴趣。在本文中，我们对来自直接结果数据集的人类偏好的语言模型优化进行了初步探索，其中每个样本都由文本和衡量读者反应的相关数字结果组成。我们首先提出，语言模型优化应该被视为因果问题，以确保模型正确地学习文本和结果之间的关系。我们形式化了这个因果语言优化问题，并开发了一种方法——因果偏好优化（CPO）——来解决该问题的无偏代理目标。我们用双重稳健的 CPO (DR-CPO) 进一步扩展了 CPO，这减少了替代目标的方差，同时保留了可证明的对偏差的强有力保证。最后，我们凭经验证明了 (DR-)CPO 在针对直接结果数据的人类偏好优化最先进的法学硕士方面的有效性，并验证了 DR-CPO 在困难的混杂条件下的稳健性。

Title: tinyBenchmarks: evaluating LLMs with fewer examples

Authors: Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, Mikhail Yurochkin
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2402.14992
Pdf URL: https://arxiv.org/pdf/2402.14992
Copy Paste: [[2402.14992]] tinyBenchmarks: evaluating LLMs with fewer examples(https://arxiv.org/abs/2402.14992)
Keywords: language model, llm
Abstract: The versatility of large language models (LLMs) led to the creation of diverse benchmarks that thoroughly test a variety of language models' abilities. These benchmarks consist of tens of thousands of examples making evaluation of LLMs very expensive. In this paper, we investigate strategies to reduce the number of evaluations needed to assess the performance of an LLM on several key benchmarks. For example, we show that to accurately estimate the performance of an LLM on MMLU, a popular multiple-choice QA benchmark consisting of 14K examples, it is sufficient to evaluate this LLM on 100 curated examples. We release evaluation tools and tiny versions of popular benchmarks: Open LLM Leaderboard, MMLU, HELM, and AlpacaEval 2.0. Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results.
摘要：大型语言模型（LLM）的多功能性导致了各种基准的创建，彻底测试各种语言模型的能力。这些基准由数以万计的例子组成，使得法学硕士的评估非常昂贵。在本文中，我们研究了减少评估法学硕士在几个关键基准上的表现所需的评估数量的策略。例如，我们表明，要准确估计 LLM 在 MMLU（一种由 14K 个示例组成的流行的多项选择 QA 基准）上的表现，在 100 个精选示例上评估该 LLM 就足够了。我们发布了评估工具和流行基准的小型版本：Open LLM Leaderboard、MMLU、HELM 和 AlpacaEval 2.0。我们的实证分析表明，这些工具和微小的基准足以可靠且高效地重现原始评估结果。

Title: Divide-or-Conquer? Which Part Should You Distill Your LLM?

Authors: Zhuofeng Wu, He Bai, Aonan Zhang, Jiatao Gu, VG Vinod Vydiswaran, Navdeep Jaitly, Yizhe Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15000
Pdf URL: https://arxiv.org/pdf/2402.15000
Copy Paste: [[2402.15000]] Divide-or-Conquer? Which Part Should You Distill Your LLM?(https://arxiv.org/abs/2402.15000)
Keywords: language model, llm
Abstract: Recent methods have demonstrated that Large Language Models (LLMs) can solve reasoning tasks better when they are encouraged to solve subtasks of the main task first. In this paper we devise a similar strategy that breaks down reasoning tasks into a problem decomposition phase and a problem solving phase and show that the strategy is able to outperform a single stage solution. Further, we hypothesize that the decomposition should be easier to distill into a smaller model compared to the problem solving because the latter requires large amounts of domain knowledge while the former only requires learning general problem solving strategies. We propose methods to distill these two capabilities and evaluate their impact on reasoning outcomes and inference cost. We find that we can distill the problem decomposition phase and at the same time achieve good generalization across tasks, datasets, and models. However, it is harder to distill the problem solving capability without losing performance and the resulting distilled model struggles with generalization. These results indicate that by using smaller, distilled problem decomposition models in combination with problem solving LLMs we can achieve reasoning with cost-efficient inference and local adaptation.
摘要：最近的方法表明，当鼓励大型语言模型（LLM）首先解决主要任务的子任务时，它们可以更好地解决推理任务。在本文中，我们设计了一种类似的策略，将推理任务分解为问题分解阶段和问题解决阶段，并表明该策略能够优于单阶段解决方案。此外，我们假设与问题解决相比，分解应该更容易提炼成更小的模型，因为后者需要大量的领域知识，而前者只需要学习一般的问题解决策略。我们提出了提炼这两种能力并评估它们对推理结果和推理成本的影响的方法。我们发现我们可以提炼问题分解阶段，同时在任务、数据集和模型之间实现良好的泛化。然而，在不损失性能的情况下提炼问题解决能力更加困难，并且由此产生的提炼模型难以泛化。这些结果表明，通过使用更小的、精炼的问题分解模型与问题解决法学硕士相结合，我们可以实现具有成本效益的推理和局部适应的推理。

Title: How Important Is Tokenization in French Medical Masked Language Models?

Authors: Yanis Labrak, Adrien Bazoge, Beatrice Daille, Mickael Rouvier, Richard Dufour
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15010
Pdf URL: https://arxiv.org/pdf/2402.15010
Copy Paste: [[2402.15010]] How Important Is Tokenization in French Medical Masked Language Models?(https://arxiv.org/abs/2402.15010)
Keywords: language model
Abstract: Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair Encoding (BPE) and was later followed by the adoption of SentencePiece and WordPiece. While subword tokenization consistently outperforms character and word-level tokenization, the precise factors contributing to its success remain unclear. Key aspects such as the optimal segmentation granularity for diverse tasks and languages, the influence of data sources on tokenizers, and the role of morphological information in Indo-European languages remain insufficiently explored. This is particularly pertinent for biomedical terminology, characterized by specific rules governing morpheme combinations. Despite the agglutinative nature of biomedical terminology, existing language models do not explicitly incorporate this knowledge, leading to inconsistent tokenization strategies for common terms. In this paper, we seek to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks and pinpoint areas where further enhancements can be made. We analyze classical tokenization algorithms, including BPE and SentencePiece, and introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.
摘要：近年来，子词标记化已成为自然语言处理（NLP）领域的流行标准，这主要是由于预训练语言模型的广泛使用。这种转变始于字节对编码 (BPE)，随后采用了 SentencePiece 和 WordPiece。虽然子词标记化始终优于字符和单词级标记化，但促成其成功的确切因素仍不清楚。诸如不同任务和语言的最佳分割粒度、数据源对分词器的影响以及形态信息在印欧语言中的作用等关键方面仍未得到充分探索。这对于生物医学术语尤其相关，其特征在于管理语素组合的特定规则。尽管生物医学术语具有粘着性，但现有的语言模型并未明确纳入这些知识，导致常用术语的标记化策略不一致。在本文中，我们试图通过各种 NLP 任务深入研究法国生物医学领域子词标记化的复杂性，并找出可以进一步增强的领域。我们分析了经典的分词算法，包括 BPE 和 SentencePiece，并引入了一种原始的分词策略，将词素丰富的分词集成到现有的分词方法中。

Title: Unintended Impacts of LLM Alignment on Global Representation

Authors: Michael J. Ryan, William Held, Diyi Yang
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15018
Pdf URL: https://arxiv.org/pdf/2402.15018
Copy Paste: [[2402.15018]] Unintended Impacts of LLM Alignment on Global Representation(https://arxiv.org/abs/2402.15018)
Keywords: language model, llm
Abstract: Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. Our results show that current alignment procedures create disparities between English dialects and global opinions. We find alignment improves capabilities in several languages. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning.
摘要：在部署面向用户的应用程序之前，开发人员通过各种程序将大型语言模型 (LLM) 与用户偏好保持一致，例如根据人类反馈进行强化学习 (RLHF) 和直接偏好优化 (DPO)。目前对这些程序的评估侧重于遵循指令、推理和真实性的基准。然而，人类偏好并不普遍，并且符合特定偏好集可能会产生意想不到的效果。我们探讨一致性如何影响全球代表性的三个轴上的表现：英语方言、多语言以及来自世界各国的观点。我们的结果表明，当前的调整程序在英语方言和全球观点之间造成了差异。我们发现对齐可以提高多种语言的功能。最后，我们讨论了导致这些意外影响的设计决策，以及更公平的偏好调整的建议。

Title: Probabilistically-sound beam search with masked language models

Authors: Charlie Cowen-Breen, Creston Brooks, Robert Calef, Anna Sappington
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2402.15020
Pdf URL: https://arxiv.org/pdf/2402.15020
Copy Paste: [[2402.15020]] Probabilistically-sound beam search with masked language models(https://arxiv.org/abs/2402.15020)
Keywords: language model
Abstract: Beam search with masked language models (MLMs) is challenging in part because joint probability distributions over sequences are not readily available, unlike for autoregressive models. Nevertheless, estimating such distributions has applications in many domains, including protein engineering and ancient text restoration. We present probabilistically-sound methods for beam search with MLMs. First, we clarify the conditions under which it is theoretically sound to perform text infilling with MLMs using standard beam search. When these conditions fail, we provide a probabilistically-sound modification with no additional computational complexity and demonstrate that it is superior to the aforementioned beam search in the expected conditions. We then present empirical results comparing several infilling approaches with MLMs across several domains.
摘要：使用掩蔽语言模型 (MLM) 的集束搜索具有挑战性，部分原因是序列上的联合概率分布不容易获得，这与自回归模型不同。然而，估计这种分布在许多领域都有应用，包括蛋白质工程和古代文本恢复。我们提出了使用 MLM 进行波束搜索的概率合理的方法。首先，我们阐明使用标准波束搜索执行 MLM 文本填充在理论上是合理的条件。当这些条件失败时，我们提供概率合理的修改，无需额外的计算复杂性，并证明它在预期条件下优于上述波束搜索。然后，我们提出了跨多个领域比较几种与传销的填充方法的实证结果。

Title: KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Authors: Zhuohao Yu, Chang Gao, Wenjin Yao, Yidong Wang, Wei Ye, Jindong Wang, Xing Xie, Yue Zhang, Shikun Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15043
Pdf URL: https://arxiv.org/pdf/2402.15043
Copy Paste: [[2402.15043]] KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models(https://arxiv.org/abs/2402.15043)
Keywords: language model, llm
Abstract: Automatic evaluation methods for large language models (LLMs) are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, which incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Starting with a question in a conventional LLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically generated, multi-round, and knowledge-focused dialogues to determine whether a model's response is merely a recall of benchmark answers or demonstrates a deep comprehension to apply knowledge in more complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. We also reveal that data contamination brings no contribution or even negative effect to models' real-world applicability and understanding, and existing contamination detection methods for LLMs can only identify contamination in pre-training but not during supervised fine-tuning.
摘要：大型语言模型（LLM）的自动评估方法受到数据污染的阻碍，导致对其有效性的评估过高。旨在检测受污染文本的现有策略侧重于量化污染状态，而不是准确衡量模型性能。在本文中，我们介绍了 KIEval，一个基于知识的交互式评估框架，它首次结合了法学硕士支持的“交互者”角色来完成动态的抗污染评估。从涉及特定领域知识的传统 LLM 基准中的问题开始，KIEval 利用动态生成、多轮和以知识为中心的对话来确定模型的响应是否仅仅是对基准答案的回忆，还是展示了对应用知识的深刻理解在更复杂的对话中。对七个领先的法学硕士跨五个数据集进行的广泛实验验证了 KIEval 的有效性和泛化性。我们还发现，数据污染对模型的现实世界适用性和理解没有任何贡献甚至负面影响，现有的法学硕士污染检测方法只能在预训练中识别污染，而不能在监督微调期间识别污染。

Title: CARBD-Ko: A Contextually Annotated Review Benchmark Dataset for Aspect-Level Sentiment Classification in Korean

Authors: Dongjun Jang, Jean Seo, Sungjoo Byun, Taekyoung Kim, Minseok Kim, Hyopil Shin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15046
Pdf URL: https://arxiv.org/pdf/2402.15046
Copy Paste: [[2402.15046]] CARBD-Ko: A Contextually Annotated Review Benchmark Dataset for Aspect-Level Sentiment Classification in Korean(https://arxiv.org/abs/2402.15046)
Keywords: language model, hallucination
Abstract: This paper explores the challenges posed by aspect-based sentiment classification (ABSC) within pretrained language models (PLMs), with a particular focus on contextualization and hallucination issues. In order to tackle these challenges, we introduce CARBD-Ko (a Contextually Annotated Review Benchmark Dataset for Aspect-Based Sentiment Classification in Korean), a benchmark dataset that incorporates aspects and dual-tagged polarities to distinguish between aspect-specific and aspect-agnostic sentiment classification. The dataset consists of sentences annotated with specific aspects, aspect polarity, aspect-agnostic polarity, and the intensity of aspects. To address the issue of dual-tagged aspect polarities, we propose a novel approach employing a Siamese Network. Our experimental findings highlight the inherent difficulties in accurately predicting dual-polarities and underscore the significance of contextualized sentiment analysis models. The CARBD-Ko dataset serves as a valuable resource for future research endeavors in aspect-level sentiment classification.
摘要：本文探讨了预训练语言模型 (PLM) 中基于方面的情感分类 (ABSC) 带来的挑战，特别关注情境化和幻觉问题。为了应对这些挑战，我们引入了 CARBD-Ko（韩语基于方面的情感分类的上下文注释评论基准数据集），这是一个基准数据集，结合了方面和双标记极性来区分特定方面和方面不可知的情感分类。该数据集由用特定方面、方面极性、方面不可知极性和方面强度注释的句子组成。为了解决双标记方面极性的问题，我们提出了一种采用暹罗网络的新方法。我们的实验结果凸显了准确预测双重极性的固有困难，并强调了情境化情绪分析模型的重要性。 CARBD-Ko 数据集是未来方面级别情感分类研究工作的宝贵资源。

Title: Unlocking the Power of Large Language Models for Entity Alignment

Authors: Xuhui Jiang, Yinghan Shen, Zhichao Shi, Chengjin Xu, Wei Li, Zixuan Li, Jian Guo, Huawei Shen, Yuanzhuo Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15048
Pdf URL: https://arxiv.org/pdf/2402.15048
Copy Paste: [[2402.15048]] Unlocking the Power of Large Language Models for Entity Alignment(https://arxiv.org/abs/2402.15048)
Keywords: language model, llm, chat
Abstract: Entity Alignment (EA) is vital for integrating diverse knowledge graph (KG) data, playing a crucial role in data-driven AI applications. Traditional EA methods primarily rely on comparing entity embeddings, but their effectiveness is constrained by the limited input KG data and the capabilities of the representation learning techniques. Against this backdrop, we introduce ChatEA, an innovative framework that incorporates large language models (LLMs) to improve EA. To address the constraints of limited input KG data, ChatEA introduces a KG-code translation module that translates KG structures into a format understandable by LLMs, thereby allowing LLMs to utilize their extensive background knowledge to improve EA accuracy. To overcome the over-reliance on entity embedding comparisons, ChatEA implements a two-stage EA strategy that capitalizes on LLMs' capability for multi-step reasoning in a dialogue format, thereby enhancing accuracy while preserving efficiency. Our experimental results affirm ChatEA's superior performance, highlighting LLMs' potential in facilitating EA tasks.
摘要：实体对齐（EA）对于整合不同的知识图（KG）数据至关重要，在数据驱动的人工智能应用中发挥着至关重要的作用。传统的 EA 方法主要依赖于比较实体嵌入，但其有效性受到有限的输入 KG 数据和表示学习技术的能力的限制。在此背景下，我们推出了 ChatEA，这是一个创新框架，它结合了大型语言模型 (LLM) 来改进 EA。为了解决输入 KG 数据有限的限制，ChatEA 引入了 KG 代码翻译模块，将 KG 结构翻译为法学硕士可以理解的格式，从而允许法学硕士利用其丰富的背景知识来提高 EA 的准确性。为了克服对实体嵌入比较的过度依赖，ChatEA 实施了两阶段 EA 策略，利用 LLM 的能力以对话格式进行多步推理，从而在保持效率的同时提高准确性。我们的实验结果肯定了 ChatEA 的卓越性能，凸显了法学硕士在促进 EA 任务方面的潜力。

Title: ToMBench: Benchmarking Theory of Mind in Large Language Models

Authors: Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao Jiang, Yaru Cao, Mengting Hu, Yunghwei Lai, Zexuan Xiong, Minlie Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15052
Pdf URL: https://arxiv.org/pdf/2402.15052
Copy Paste: [[2402.15052]] ToMBench: Benchmarking Theory of Mind in Large Language Models(https://arxiv.org/abs/2402.15052)
Keywords: language model, gpt, llm
Abstract: Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether large language models (LLMs) exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag behind human performance by over 10% points, indicating that LLMs have not achieved a human-level theory of mind yet. Our aim with ToMBench is to enable an efficient and effective evaluation of LLMs' ToM capabilities, thereby facilitating the development of LLMs with inherent social intelligence.
摘要：心理理论（ToM）是感知并将心理状态归因于自己和他人的认知能力。最近的研究引发了关于大型语言模型 (LLM) 是否表现出某种形式的 ToM 的争论。然而，现有的 ToM 评估受到范围有限、主观判断和意外污染等挑战的阻碍，导致评估不充分。为了弥补这一差距，我们引入了具有三个关键特征的 ToMBench：包含 8 项任务和 31 项社会认知能力的系统评估框架、支持自动化和公正评估的多项选择题格式，以及从头开始构建的双语清单严格避免数据泄露。基于 ToMBench，我们进行了广泛的实验来评估 10 个流行的法学硕士在任务和能力方面的 ToM 表现。我们发现，即使是像 GPT-4 这样最先进的法学硕士，其表现也落后于人类 10% 以上，这表明法学硕士尚未达到人类水平的心理理论。我们与 ToMBench 的目标是对法学硕士的 ToM 能力进行高效且有效的评估，从而促进具有内在社交智能的法学硕士的发展。

Title: Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Authors: Clement Neo, Shay B. Cohen, Fazl Barez
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15055
Pdf URL: https://arxiv.org/pdf/2402.15055
Copy Paste: [[2402.15055]] Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions(https://arxiv.org/abs/2402.15055)
Keywords: gpt, llm, prompt
Abstract: In this paper, we investigate the interplay between attention heads and specialized "next-token" neurons in the Multilayer Perceptron that predict specific tokens. By prompting an LLM like GPT-4 to explain these model internals, we can elucidate attention mechanisms that activate certain next-token neurons. Our analysis identifies attention heads that recognize contexts relevant to predicting a particular token, activating the associated neuron through the residual connection. We focus specifically on heads in earlier layers consistently activating the same next-token neuron across similar prompts. Exploring these differential activation patterns reveals that heads that specialize for distinct linguistic contexts are tied to generating certain tokens. Overall, our method combines neural explanations and probing isolated components to illuminate how attention enables context-dependent, specialized processing in LLMs.
摘要：在本文中，我们研究了注意力头和多层感知器中预测特定标记的专门“下一个标记”神经元之间的相互作用。通过提示像 GPT-4 这样的法学硕士来解释这些模型的内部结构，我们可以阐明激活某些下一个标记神经元的注意机制。我们的分析确定了注意力头，这些注意力头识别与预测特定标记相关的上下文，通过剩余连接激活相关的神经元。我们特别关注早期层中的头在类似的提示中一致激活相同的下一个令牌神经元。探索这些差异激活模式表明，专门针对不同语言上下文的头部与生成某些标记有关。总的来说，我们的方法结合了神经解释和探测孤立的组件，以阐明注意力如何在法学硕士中实现上下文相关的专门处理。

Title: On the Multi-turn Instruction Following for Conversational Web Agents

Authors: Yang Deng, Xuan Zhang, Wenxuan Zhang, Yifei Yuan, See-Kiong Ng, Tat-Seng Chua
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15057
Pdf URL: https://arxiv.org/pdf/2402.15057
Copy Paste: [[2402.15057]] On the Multi-turn Instruction Following for Conversational Web Agents(https://arxiv.org/abs/2402.15057)
Keywords: language model, llm, agent
Abstract: Web agents powered by Large Language Models (LLMs) have demonstrated remarkable abilities in planning and executing multi-step interactions within complex web-based environments, fulfilling a wide range of web navigation tasks. Despite these advancements, the potential for LLM-powered agents to effectively engage with sequential user instructions in real-world scenarios has not been fully explored. In this work, we introduce a new task of Conversational Web Navigation, which necessitates sophisticated interactions that span multiple turns with both the users and the environment, supported by a specially developed dataset named Multi-Turn Mind2Web (MT-Mind2Web). To tackle the limited context length of LLMs and the context-dependency issue of the conversational tasks, we further propose a novel framework, named self-reflective memory-augmented planning (Self-MAP), which employs memory utilization and self-reflection techniques. Extensive experiments are conducted to benchmark the MT-Mind2Web dataset, and validate the effectiveness of the proposed method.
摘要：由大型语言模型 (LLM) 提供支持的 Web 代理在复杂的基于 Web 的环境中规划和执行多步骤交互方面表现出了卓越的能力，能够完成各种 Web 导航任务。尽管取得了这些进步，但 LLM 支持的代理在现实场景中有效地处理顺序用户指令的潜力尚未得到充分探索。在这项工作中，我们引入了会话式 Web 导航的新任务，该任务需要与用户和环境进行多个回合的复杂交互，并由专门开发的名为 Multi-Turn Mind2Web (MT-Mind2Web) 的数据集支持。为了解决法学硕士有限的上下文长度和会话任务的上下文依赖性问题，我们进一步提出了一种新颖的框架，称为自反射记忆增强规划（Self-MAP），它采用内存利用和自反射技术。进行了大量的实验来对 MT-Mind2Web 数据集进行基准测试，并验证所提出方法的有效性。

Title: ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval

Authors: Antoine Louis, Vageesh Saxena, Gijs van Dijck, Gerasimos Spanakis
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2402.15059
Pdf URL: https://arxiv.org/pdf/2402.15059
Copy Paste: [[2402.15059]] ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval(https://arxiv.org/abs/2402.15059)
Keywords: language model
Abstract: State-of-the-art neural retrievers predominantly focus on high-resource languages like English, which impedes their adoption in retrieval scenarios involving other languages. Current approaches circumvent the lack of high-quality labeled data in non-English languages by leveraging multilingual pretrained language models capable of cross-lingual transfer. However, these models require substantial task-specific fine-tuning across multiple languages, often perform poorly in languages with minimal representation in the pretraining corpus, and struggle to incorporate new languages after the pretraining phase. In this work, we present a novel modular dense retrieval model that learns from the rich data of a single high-resource language and effectively zero-shot transfers to a wide array of languages, thereby eliminating the need for language-specific labeled data. Our model, ColBERT-XM, demonstrates competitive performance against existing state-of-the-art multilingual retrievers trained on more extensive datasets in various languages. Further analysis reveals that our modular approach is highly data-efficient, effectively adapts to out-of-distribution data, and significantly reduces energy consumption and carbon emissions. By demonstrating its proficiency in zero-shot scenarios, ColBERT-XM marks a shift towards more sustainable and inclusive retrieval systems, enabling effective information accessibility in numerous languages. We publicly release our code and models for the community.
摘要：最先进的神经检索器主要关注英语等高资源语言，这阻碍了它们在涉及其他语言的检索场景中的采用。当前的方法通过利用能够跨语言迁移的多语言预训练语言模型来解决非英语语言中缺乏高质量标记数据的问题。然而，这些模型需要跨多种语言进行大量特定于任务的微调，通常在预训练语料库中代表性最少的语言中表现不佳，并且在预训练阶段后难以合并新语言。在这项工作中，我们提出了一种新颖的模块化密集检索模型，该模型从单一高资源语言的丰富数据中学习，并有效地零样本传输到多种语言，从而消除了对特定于语言的标记数据的需求。我们的模型 ColBERT-XM 展示了与现有最先进的多语言检索器相比的竞争性能，这些检索器是在各种语言的更广泛的数据集上进行训练的。进一步分析表明，我们的模块化方法具有很高的数据效率，有效适应分布外数据，并显着降低能源消耗和碳排放。通过展示其在零样本场景中的熟练程度，ColBERT-XM 标志着向更可持续和更具包容性的检索系统的转变，从而能够以多种语言实现有效的信息访问。我们向社区公开发布我们的代码和模型。

Title: Fine-tuning Large Language Models for Domain-specific Machine Translation

Authors: Jiawei Zheng, Hanghai Hong, Xiaoli Wang, Jingsong Su, Yonggui Liang, Shikai Wu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15061
Pdf URL: https://arxiv.org/pdf/2402.15061
Copy Paste: [[2402.15061]] Fine-tuning Large Language Models for Domain-specific Machine Translation(https://arxiv.org/abs/2402.15061)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have made significant progress in machine translation (MT). However, their potential in domain-specific MT remains under-explored. Current LLM-based MT systems still face several challenges. First, for LLMs with in-context learning, their effectiveness is highly sensitive to input translation examples, and processing them can increase inference costs. They often require extra post-processing due to over-generation. Second, LLMs with fine-tuning on domain-specific data often require high training costs for domain adaptation, and may weaken the zero-shot MT capabilities of LLMs due to over-specialization. The aforementioned methods can struggle to translate rare words in domain transfer scenarios. To address these challenges, this paper proposes a prompt-oriented fine-tuning method, denoted as LlamaIT, to effectively and efficiently fine-tune a general-purpose LLM for domain-specific MT tasks. First, we construct a task-specific mix-domain dataset, which is then used to fine-tune the LLM with LoRA. This can eliminate the need for input translation examples, post-processing, or over-specialization. By zero-shot prompting with instructions, we adapt the MT tasks to the target domain at inference time. To further elicit the MT capability for rare words, we construct new prompts by incorporating domain-specific bilingual vocabulary. We also conduct extensive experiments on both publicly available and self-constructed datasets. The results show that our LlamaIT can significantly enhance the domain-specific MT capabilities of the LLM, meanwhile preserving its zero-shot MT capabilities.
摘要：大型语言模型（LLM）在机器翻译（MT）方面取得了重大进展。然而，它们在特定领域机器翻译中的潜力仍未得到充分开发。目前基于法学硕士的机器翻译系统仍然面临着一些挑战。首先，对于具有上下文学习的法学硕士来说，其有效性对输入翻译示例高度敏感，处理它们会增加推理成本。由于过度生成，它们通常需要额外的后处理。其次，对特定领域数据进行微调的法学硕士通常需要高昂的领域适应训练成本，并且可能因过度专业化而削弱法学硕士的零样本机器翻译能力。上述方法在域转移场景中可能很难翻译罕见单词。为了应对这些挑战，本文提出了一种面向提示的微调方法，称为 LlamaIT，以有效且高效地微调针对特定领域 MT 任务的通用 LLM。首先，我们构建一个特定于任务的混合域数据集，然后使用该数据集通过 LoRA 微调 LLM。这可以消除对输入翻译示例、后处理或过度专业化的需求。通过指令的零样本提示，我们在推理时将 MT 任务调整到目标域。为了进一步激发罕见词的机器翻译能力，我们通过结合特定领域的双语词汇来构建新的提示。我们还对公开数据集和自行构建的数据集进行了广泛的实验。结果表明，我们的 LlamaIT 可以显着增强 LLM 的特定领域 MT 能力，同时保留其零样本 MT 能力。

Title: Gotcha! Don't trick me with unanswerable questions! Self-aligning Large Language Models for Responding to Unknown Questions

Authors: Yang Deng, Yong Zhao, Moxin Li, See-Kiong Ng, Tat-Seng Chua
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15062
Pdf URL: https://arxiv.org/pdf/2402.15062
Copy Paste: [[2402.15062]] Gotcha! Don't trick me with unanswerable questions! Self-aligning Large Language Models for Responding to Unknown Questions(https://arxiv.org/abs/2402.15062)
Keywords: language model, llm
Abstract: Despite the remarkable abilities of Large Language Models (LLMs) to answer questions, they often display a considerable level of overconfidence even when the question does not have a definitive answer. To avoid providing hallucinated answers to these unknown questions, existing studies typically investigate approaches to refusing to answer these questions. In this work, we propose a novel and scalable self-alignment method to utilize the LLM itself to enhance its response-ability to different types of unknown questions, being capable of not only refusing to answer but also providing explanation to the unanswerability of unknown questions. Specifically, the Self-Align method first employ a two-stage class-aware self-augmentation approach to generate a large amount of unknown question-response data. Then we conduct disparity-driven self-curation to select qualified data for fine-tuning the LLM itself for aligning the responses to unknown questions as desired. Experimental results on two datasets across four types of unknown questions validate the superiority of the Self-Align method over existing baselines in terms of three types of task formulation.
摘要：尽管大型语言模型（LLM）在回答问题方面具有非凡的能力，但即使问题没有明确的答案，他们也经常表现出相当程度的过度自信。为了避免对这些未知问题提供幻觉答案，现有研究通常调查拒绝回答这些问题的方法。在这项工作中，我们提出了一种新颖且可扩展的自对准方法，利用LLM本身来增强其对不同类型未知问题的响应能力，不仅能够拒绝回答，而且能够对未知问题的不可回答性提供解释。具体来说，自对齐方法首先采用两阶段类感知自增强方法来生成大量未知的问题-响应数据。然后，我们进行差异驱动的自我管理，选择合格的数据来微调法学硕士本身，以根据需要调整对未知问题的回答。涉及四种未知问题的两个数据集的实验结果验证了自对齐方法在三种类型的任务制定方面相对于现有基线的优越性。

Title: Infusing Hierarchical Guidance into Prompt Tuning: A Parameter-Efficient Framework for Multi-level Implicit Discourse Relation Recognition

Authors: Haodong Zhao, Ruifang He, Mengnan Xiao, Jing Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15080
Pdf URL: https://arxiv.org/pdf/2402.15080
Copy Paste: [[2402.15080]] Infusing Hierarchical Guidance into Prompt Tuning: A Parameter-Efficient Framework for Multi-level Implicit Discourse Relation Recognition(https://arxiv.org/abs/2402.15080)
Keywords: prompt
Abstract: Multi-level implicit discourse relation recognition (MIDRR) aims at identifying hierarchical discourse relations among arguments. Previous methods achieve the promotion through fine-tuning PLMs. However, due to the data scarcity and the task gap, the pre-trained feature space cannot be accurately tuned to the task-specific space, which even aggravates the collapse of the vanilla space. Besides, the comprehension of hierarchical semantics for MIDRR makes the conversion much harder. In this paper, we propose a prompt-based Parameter-Efficient Multi-level IDRR (PEMI) framework to solve the above problems. First, we leverage parameter-efficient prompt tuning to drive the inputted arguments to match the pre-trained space and realize the approximation with few parameters. Furthermore, we propose a hierarchical label refining (HLR) method for the prompt verbalizer to deeply integrate hierarchical guidance into the prompt tuning. Finally, our model achieves comparable results on PDTB 2.0 and 3.0 using about 0.1% trainable parameters compared with baselines and the visualization demonstrates the effectiveness of our HLR method.
摘要：多级隐式话语关系识别（MIDRR）旨在识别论点之间的层次性话语关系。以往的方法都是通过PLM的微调来实现提升的。然而，由于数据稀缺和任务差距，预训练的特征空间无法准确调整到任务特定空间，这甚至加剧了香草空间的崩溃。此外，MIDRR 的层次语义的理解使得转换变得更加困难。在本文中，我们提出了一种基于提示的参数高效多级IDRR（PEMI）框架来解决上述问题。首先，我们利用参数高效的提示调整来驱动输入的参数匹配预训练的空间，并用很少的参数实现近似。此外，我们提出了一种用于提示语言器的分层标签细化（HLR）方法，将分层指导深度集成到提示调整中。最后，与基线相比，我们的模型使用约 0.1% 的可训练参数在 PDTB 2.0 和 3.0 上取得了可比的结果，并且可视化证明了我们的 HLR 方法的有效性。

Title: PEMT: Multi-Task Correlation Guided Mixture-of-Experts Enables Parameter-Efficient Transfer Learning

Authors: Zhisheng Lin, Han Fu, Chenghao Liu, Zhuo Li, Jianling Sun
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15082
Pdf URL: https://arxiv.org/pdf/2402.15082
Copy Paste: [[2402.15082]] PEMT: Multi-Task Correlation Guided Mixture-of-Experts Enables Parameter-Efficient Transfer Learning(https://arxiv.org/abs/2402.15082)
Keywords: language model, prompt
Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as an effective method for adapting pre-trained language models to various tasks efficiently. Recently, there has been a growing interest in transferring knowledge from one or multiple tasks to the downstream target task to achieve performance improvements. However, current approaches typically either train adapters on individual tasks or distill shared knowledge from source tasks, failing to fully exploit task-specific knowledge and the correlation between source and target tasks. To overcome these limitations, we propose PEMT, a novel parameter-efficient fine-tuning framework based on multi-task transfer learning. PEMT extends the mixture-of-experts (MoE) framework to capture the transferable knowledge as a weighted combination of adapters trained on source tasks. These weights are determined by a gated unit, measuring the correlation between the target and each source task using task description prompt vectors. To fully exploit the task-specific knowledge, we also propose the Task Sparsity Loss to improve the sparsity of the gated unit. We conduct experiments on a broad range of tasks over 17 datasets. The experimental results demonstrate our PEMT yields stable improvements over full fine-tuning, and state-of-the-art PEFT and knowledge transferring methods on various tasks. The results highlight the effectiveness of our method which is capable of sufficiently exploiting the knowledge and correlation features across multiple tasks.
摘要：参数高效微调（PEFT）已成为使预训练语言模型有效适应各种任务的有效方法。最近，人们越来越关注将知识从一个或多个任务转移到下游目标任务以实现性能改进。然而，当前的方法通常要么在单个任务上训练适配器，要么从源任务中提取共享知识，无法充分利用特定于任务的知识以及源任务和目标任务之间的相关性。为了克服这些限制，我们提出了 PEMT，一种基于多任务迁移学习的新型参数高效微调框架。 PEMT 扩展了专家混合 (MoE) 框架，以将可转移知识捕获为在源任务上训练的适配器的加权组合。这些权重由门控单元确定，使用任务描述提示向量测量目标任务和每个源任务之间的相关性。为了充分利用特定于任务的知识，我们还提出了任务稀疏性损失来提高门控单元的稀疏性。我们对 17 个数据集的广泛任务进行了实验。实验结果表明，我们的 PEMT 在各种任务上比完全微调、最先进的 PEFT 和知识转移方法产生了稳定的改进。结果凸显了我们方法的有效性，该方法能够充分利用多个任务的知识和相关特征。

Title: AttributionBench: How Hard is Automatic Attribution Evaluation?

Authors: Yifei Li, Xiang Yue, Zeyi Liao, Huan Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15089
Pdf URL: https://arxiv.org/pdf/2402.15089
Copy Paste: [[2402.15089]] AttributionBench: How Hard is Automatic Attribution Evaluation?(https://arxiv.org/abs/2402.15089)
Keywords: language model, gpt, llm
Abstract: Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the answer's attribution, i.e., whether every claim within the generated responses is fully supported by its cited evidence, remains an open problem. This verification, traditionally dependent on costly human evaluation, underscores the urgent need for automatic attribution evaluation methods. To bridge the gap in the absence of standardized benchmarks for these methods, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation. A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model's inability to process nuanced information, and the discrepancy between the information the model has access to and that human annotators do.
摘要：现代生成搜索引擎通过提供引用的证据来增强大语言模型 (LLM) 响应的可靠性。然而，评估答案的归属，即生成的答案中的每个主张是否得到其引用的证据的充分支持，仍然是一个悬而未决的问题。这种验证传统上依赖于昂贵的人工评估，凸显了对自动归因评估方法的迫切需要。为了弥补这些方法缺乏标准化基准的差距，我们提出了 AttributionBench，这是一个根据各种现有归因数据集编译的综合基准。我们在 AttributionBench 上进行的广泛实验揭示了自动归因评估的挑战，即使对于最先进的法学硕士也是如此。具体来说，我们的研究结果表明，即使经过微调的 GPT-3.5 在二元分类公式下也只能实现约 80% 的宏 F1。对 300 多个错误案例的详细分析表明，大多数失败源于模型无法处理细致入微的信息，以及模型访问的信息与人类注释者访问的信息之间的差异。

Title: Trajectory-wise Iterative Reinforcement Learning Framework for Auto-bidding

Authors: Haoming Li, Yusen Huo, Shuai Dou, Zhenzhe Zheng, Zhilin Zhang, Chuan Yu, Jian Xu, Fan Wu
Subjects: cs.LG, cs.AI, cs.GT, cs.IR
Abstract URL: https://arxiv.org/abs/2402.15102
Pdf URL: https://arxiv.org/pdf/2402.15102
Copy Paste: [[2402.15102]] Trajectory-wise Iterative Reinforcement Learning Framework for Auto-bidding(https://arxiv.org/abs/2402.15102)
Keywords: agent
Abstract: In online advertising, advertisers participate in ad auctions to acquire ad opportunities, often by utilizing auto-bidding tools provided by demand-side platforms (DSPs). The current auto-bidding algorithms typically employ reinforcement learning (RL). However, due to safety concerns, most RL-based auto-bidding policies are trained in simulation, leading to a performance degradation when deployed in online environments. To narrow this gap, we can deploy multiple auto-bidding agents in parallel to collect a large interaction dataset. Offline RL algorithms can then be utilized to train a new policy. The trained policy can subsequently be deployed for further data collection, resulting in an iterative training framework, which we refer to as iterative offline RL. In this work, we identify the performance bottleneck of this iterative offline RL framework, which originates from the ineffective exploration and exploitation caused by the inherent conservatism of offline RL algorithms. To overcome this bottleneck, we propose Trajectory-wise Exploration and Exploitation (TEE), which introduces a novel data collecting and data utilization method for iterative offline RL from a trajectory perspective. Furthermore, to ensure the safety of online exploration while preserving the dataset quality for TEE, we propose Safe Exploration by Adaptive Action Selection (SEAS). Both offline experiments and real-world experiments on Alibaba display advertising platform demonstrate the effectiveness of our proposed method.
摘要：在在线广告中，广告商通常利用需求方平台（DSP）提供的自动出价工具参与广告拍卖以获得广告机会。当前的自动出价算法通常采用强化学习（RL）。然而，出于安全考虑，大多数基于强化学习的自动出价策略都是在模拟中进行训练的，导致部署在在线环境中时性能下降。为了缩小这一差距，我们可以并行部署多个自动出价代理来收集大型交互数据集。然后可以利用离线强化学习算法来训练新策略。随后可以部署经过训练的策略以进行进一步的数据收集，从而形成迭代训练框架，我们将其称为迭代离线强化学习。在这项工作中，我们确定了这种迭代离线强化学习框架的性能瓶颈，该瓶颈源于离线强化学习算法固有的保守性导致的无效探索和利用。为了克服这个瓶颈，我们提出了轨迹明智的探索和利用（TEE），它从轨迹的角度引入了一种新颖的迭代离线强化学习数据收集和数据利用方法。此外，为了确保在线探索的安全，同时保持 TEE 的数据集质量，我们提出了通过自适应动作选择（SEAS）进行安全探索。阿里巴巴展示广告平台上的离线实验和真实实验都证明了我们提出的方法的有效性。

Title: Multi-Armed Bandits with Abstention

Authors: Junwen Yang, Tianyuan Jin, Vincent Y. F. Tan
Subjects: cs.LG, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2402.15127
Pdf URL: https://arxiv.org/pdf/2402.15127
Copy Paste: [[2402.15127]] Multi-Armed Bandits with Abstention(https://arxiv.org/abs/2402.15127)
Keywords: agent
Abstract: We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention. In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the stochastic instantaneous reward before observing it. When opting for abstention, the agent either suffers a fixed regret or gains a guaranteed reward. Given this added layer of complexity, we ask whether we can develop efficient algorithms that are both asymptotically and minimax optimal. We answer this question affirmatively by designing and analyzing algorithms whose regrets meet their corresponding information-theoretic lower bounds. Our results offer valuable quantitative insights into the benefits of the abstention option, laying the groundwork for further exploration in other online decision-making problems with such an option. Numerical results further corroborate our theoretical findings.
摘要：我们引入了典型的多臂老虎机问题的新颖扩展，其中包含了一个额外的策略元素：弃权。在这个增强的框架中，智能体不仅要在每个时间步选择一个手臂，而且还可以选择在观察随机瞬时奖励之前放弃接受它。当选择弃权时，代理人要么遭受固定的遗憾，要么获得有保证的奖励。考虑到这一层复杂性的增加，我们询问是否可以开发出渐近最优和极小极大最优的高效算法。我们通过设计和分析其遗憾满足相应信息论下限的算法来肯定地回答这个问题。我们的结果为弃权选项的好处提供了有价值的定量见解，为进一步探索此类选项的其他在线决策问题奠定了基础。数值结果进一步证实了我们的理论发现。

Title: Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models

Authors: Guanming Xiong, Junwei Bao, Wen Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15131
Pdf URL: https://arxiv.org/pdf/2402.15131
Copy Paste: [[2402.15131]] Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models(https://arxiv.org/abs/2402.15131)
Keywords: language model, llm
Abstract: This study explores the realm of knowledge-base question answering (KBQA). KBQA is considered a challenging task, particularly in parsing intricate questions into executable logical forms. Traditional semantic parsing (SP)-based methods require extensive data annotations, which result in significant costs. Recently, the advent of few-shot in-context learning, powered by large language models (LLMs), has showcased promising capabilities. Yet, fully leveraging LLMs to parse questions into logical forms in low-resource scenarios poses a substantial challenge. To tackle these hurdles, we introduce Interactive-KBQA, a framework designed to generate logical forms through direct interaction with knowledge bases (KBs). Within this framework, we have developed three generic APIs for KB interaction. For each category of complex question, we devised exemplars to guide LLMs through the reasoning processes. Our method achieves competitive results on the WebQuestionsSP, ComplexWebQuestions, KQA Pro, and MetaQA datasets with a minimal number of examples (shots). Importantly, our approach supports manual intervention, allowing for the iterative refinement of LLM outputs. By annotating a dataset with step-wise reasoning processes, we showcase our model's adaptability and highlight its potential for contributing significant enhancements to the field.
摘要：本研究探讨了知识库问答（KBQA）领域。 KBQA 被认为是一项具有挑战性的任务，特别是在将复杂的问题解析为可执行的逻辑形式方面。传统的基于语义解析（SP）的方法需要大量的数据注释，这会导致巨大的成本。最近，由大型语言模型 (LLM) 提供支持的小样本上下文学习的出现，展示了有前景的能力。然而，在资源匮乏的情况下，充分利用法学硕士将问题解析为逻辑形式构成了巨大的挑战。为了解决这些障碍，我们引入了 Interactive-KBQA，这是一个旨在通过与知识库 (KB) 直接交互来生成逻辑形式的框架。在此框架内，我们开发了三个用于知识库交互的通用 API。对于每一类复杂问题，我们都设计了范例来指导法学硕士完成推理过程。我们的方法以最少数量的示例（镜头）在 WebQuestionsSP、ComplexWebQuestions、KQA Pro 和 MetaQA 数据集上取得了有竞争力的结果。重要的是，我们的方法支持手动干预，允许对 LLM 输出进行迭代细化。通过使用逐步推理过程注释数据集，我们展示了模型的适应性，并强调了其为该领域做出重大改进的潜力。

Title: Improving Sentence Embeddings with an Automatically Generated NLI Dataset

Authors: Soma Sato, Hayato Tsukagoshi, Ryohei Sasano, Koichi Takeda
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15132
Pdf URL: https://arxiv.org/pdf/2402.15132
Copy Paste: [[2402.15132]] Improving Sentence Embeddings with an Automatically Generated NLI Dataset(https://arxiv.org/abs/2402.15132)
Keywords: language model, llm, prompt
Abstract: Decoder-based large language models (LLMs) have shown high performance on many tasks in natural language processing. This is also true for sentence embedding learning, where a decoder-based model, PromptEOL, has achieved the best performance on semantic textual similarity (STS) tasks. However, PromptEOL makes great use of fine-tuning with a manually annotated natural language inference (NLI) dataset. We aim to improve sentence embeddings learned in an unsupervised setting by automatically generating an NLI dataset with an LLM and using it to fine-tune PromptEOL. In experiments on STS tasks, the proposed method achieved an average Spearman's rank correlation coefficient of 82.21 with respect to human evaluation, thus outperforming existing methods without using large, manually annotated datasets.
摘要：基于解码器的大语言模型 (LLM) 在自然语言处理的许多任务上都表现出了高性能。对于句子嵌入学习也是如此，其中基于解码器的模型 PromptEOL 在语义文本相似性 (STS) 任务上取得了最佳性能。然而，PromptEOL 充分利用了手动注释的自然语言推理 (NLI) 数据集的微调。我们的目标是通过使用 LLM 自动生成 NLI 数据集并使用它来微调 PromptEOL，从而改进在无监督环境中学习的句子嵌入。在 STS 任务的实验中，所提出的方法在人类评估方面实现了 82.21 的平均 Spearman 等级相关系数，因此在不使用大型手动注释数据集的情况下优于现有方法。

Title: Self-Adaptive Reconstruction with Contrastive Learning for Unsupervised Sentence Embeddings

Authors: Junlong Liu, Xichen Shang, Huawen Feng, Junhao Zheng, Qianli Ma
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15153
Pdf URL: https://arxiv.org/pdf/2402.15153
Copy Paste: [[2402.15153]] Self-Adaptive Reconstruction with Contrastive Learning for Unsupervised Sentence Embeddings(https://arxiv.org/abs/2402.15153)
Keywords: language model
Abstract: Unsupervised sentence embeddings task aims to convert sentences to semantic vector representations. Most previous works directly use the sentence representations derived from pretrained language models. However, due to the token bias in pretrained language models, the models can not capture the fine-grained semantics in sentences, which leads to poor predictions. To address this issue, we propose a novel Self-Adaptive Reconstruction Contrastive Sentence Embeddings (SARCSE) framework, which reconstructs all tokens in sentences with an AutoEncoder to help the model to preserve more fine-grained semantics during tokens aggregating. In addition, we proposed a self-adaptive reconstruction loss to alleviate the token bias towards frequency. Experimental results show that SARCSE gains significant improvements compared with the strong baseline SimCSE on the 7 STS tasks.
摘要：无监督句子嵌入任务旨在将句子转换为语义向量表示。以前的大多数作品直接使用从预训练语言模型派生的句子表示。然而，由于预训练语言模型中的标记偏差，模型无法捕获句子中的细粒度语义，从而导致预测效果不佳。为了解决这个问题，我们提出了一种新颖的自适应重建对比句子嵌入（SARCSE）框架，该框架使用自动编码器重建句子中的所有标记，以帮助模型在标记聚合期间保留更细粒度的语义。此外，我们提出了一种自适应重建损失来减轻令牌对频率的偏差。实验结果表明，与强基线 SimCSE 相比，SARCSE 在 7 个 STS 任务上获得了显着改进。

Title: Machine Unlearning of Pre-trained Large Language Models

Authors: Jin Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao Wang, Zezhou Cheng, Xiang Yue
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15159
Pdf URL: https://arxiv.org/pdf/2402.15159
Copy Paste: [[2402.15159]] Machine Unlearning of Pre-trained Large Language Models(https://arxiv.org/abs/2402.15159)
Keywords: language model, llm
Abstract: This study investigates the concept of the `right to be forgotten' within the context of large language models (LLMs). We explore machine unlearning as a pivotal solution, with a focus on pre-trained models--a notably under-researched area. Our research delineates a comprehensive framework for machine unlearning in pre-trained LLMs, encompassing a critical analysis of seven diverse unlearning methods. Through rigorous evaluation using curated datasets from arXiv, books, and GitHub, we establish a robust benchmark for unlearning performance, demonstrating that these methods are over $10^5$ times more computationally efficient than retraining. Our results show that integrating gradient ascent with gradient descent on in-distribution data improves hyperparameter robustness. We also provide detailed guidelines for efficient hyperparameter tuning in the unlearning process. Our findings advance the discourse on ethical AI practices, offering substantive insights into the mechanics of machine unlearning for pre-trained LLMs and underscoring the potential for responsible AI development.
摘要：这项研究在大型语言模型（LLM）的背景下调查了“被遗忘权”的概念。我们探索机器取消学习作为关键解决方案，重点关注预训练模型——这是一个明显研究不足的领域。我们的研究描绘了预训练法学硕士中机器遗忘的综合框架，包括对七种不同遗忘方法的批判性分析。通过使用来自 arXiv、书籍和 GitHub 的精选数据集进行严格评估，我们建立了一个强大的忘却性能基准，证明这些方法的计算效率比再训练高 10^5 倍以上。我们的结果表明，在分布数据上整合梯度上升和梯度下降可以提高超参数的鲁棒性。我们还提供了在取消学习过程中有效调整超参数的详细指南。我们的研究结果推进了关于道德人工智能实践的讨论，为预先训练的法学硕士的机器遗忘机制提供了实质性见解，并强调了负责任的人工智能开发的潜力。

Title: Spatially-Aware Transformer Memory for Embodied Agents

Authors: Junmo Cho, Jaesik Yoon, Sungjin Ahn
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15160
Pdf URL: https://arxiv.org/pdf/2402.15160
Copy Paste: [[2402.15160]] Spatially-Aware Transformer Memory for Embodied Agents(https://arxiv.org/abs/2402.15160)
Keywords: agent
Abstract: Episodic memory plays a crucial role in various cognitive processes, such as the ability to mentally recall past events. While cognitive science emphasizes the significance of spatial context in the formation and retrieval of episodic memory, the current primary approach to implementing episodic memory in AI systems is through transformers that store temporally ordered experiences, which overlooks the spatial dimension. As a result, it is unclear how the underlying structure could be extended to incorporate the spatial axis beyond temporal order alone and thereby what benefits can be obtained. To address this, this paper explores the use of Spatially-Aware Transformer models that incorporate spatial information. These models enable the creation of place-centric episodic memory that considers both temporal and spatial dimensions. Adopting this approach, we demonstrate that memory utilization efficiency can be improved, leading to enhanced accuracy in various place-centric downstream tasks. Additionally, we propose the Adaptive Memory Allocator, a memory management method based on reinforcement learning that aims to optimize efficiency of memory utilization. Our experiments demonstrate the advantages of our proposed model in various environments and across multiple downstream tasks, including prediction, generation, reasoning, and reinforcement learning. The source code for our models and experiments will be available at https://github.com/junmokane/spatially-aware-transformer.
摘要：情景记忆在各种认知过程中起着至关重要的作用，例如在心理上回忆过去事件的能力。虽然认知科学强调空间背景在情景记忆形成和检索中的重要性，但目前在人工智能系统中实现情景记忆的主要方法是通过存储按时间排序的经验的变压器，而忽略了空间维度。因此，尚不清楚如何扩展底层结构以将空间轴纳入单独的时间顺序之外，从而可以获得什么好处。为了解决这个问题，本文探讨了如何使用包含空间信息的空间感知 Transformer 模型。这些模型能够创建考虑时间和空间维度的以地点为中心的情景记忆。采用这种方法，我们证明可以提高内存利用效率，从而提高各种以地点为中心的下游任务的准确性。此外，我们提出了自适应内存分配器，这是一种基于强化学习的内存管理方法，旨在优化内存利用效率。我们的实验证明了我们提出的模型在各种环境和多个下游任务中的优势，包括预测、生成、推理和强化学习。我们的模型和实验的源代码可在 https://github.com/junmokane/spatially-aware-transformer 获取。

Title: Entity-level Factual Adaptiveness of Fine-tuning based Abstractive Summarization Models

Authors: Jongyoon Song, Nohil Park, Bongkyu Hwang, Jaewoong Yun, Seongho Joe, Youngjune L. Gwon, Sungroh Yoon
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15162
Pdf URL: https://arxiv.org/pdf/2402.15162
Copy Paste: [[2402.15162]] Entity-level Factual Adaptiveness of Fine-tuning based Abstractive Summarization Models(https://arxiv.org/abs/2402.15162)
Keywords: language model
Abstract: Abstractive summarization models often generate factually inconsistent content particularly when the parametric knowledge of the model conflicts with the knowledge in the input document. In this paper, we analyze the robustness of fine-tuning based summarization models to the knowledge conflict, which we call factual adaptiveness. We utilize pre-trained language models to construct evaluation sets and find that factual adaptiveness is not strongly correlated with factual consistency on original datasets. Furthermore, we introduce a controllable counterfactual data augmentation method where the degree of knowledge conflict within the augmented data can be adjustable. Our experimental results on two pre-trained language models (PEGASUS and BART) and two fine-tuning datasets (XSum and CNN/DailyMail) demonstrate that our method enhances factual adaptiveness while achieving factual consistency on original datasets on par with the contrastive learning baseline.
摘要：抽象摘要模型通常会生成与事实不一致的内容，特别是当模型的参数知识与输入文档中的知识冲突时。在本文中，我们分析了基于微调的摘要模型对知识冲突的鲁棒性，我们称之为事实适应性。我们利用预先训练的语言模型来构建评估集，发现事实适应性与原始数据集上的事实一致性并没有很强的相关性。此外，我们引入了一种可控的反事实数据增强方法，其中增强数据内的知识冲突程度可以调整。我们在两个预训练语言模型（PEGASUS 和 BART）和两个微调数据集（XSum 和 CNN/DailyMail）上的实验结果表明，我们的方法增强了事实适应性，同时在原始数据集上实现了与对比学习基线相同的事实一致性。

Title: Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer

Authors: Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, Ivor W.Tsang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.15173
Pdf URL: https://arxiv.org/pdf/2402.15173
Copy Paste: [[2402.15173]] Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer(https://arxiv.org/abs/2402.15173)
Keywords: language model, llm
Abstract: Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs. What's more, HiZOO avoids the expensive memory cost and only increases one forward pass per step. Extensive experiments on various models (350M~66B parameters) indicate that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy. Moreover, we visualize the optimization trajectories of HiZOO on test functions, illustrating its effectiveness in handling heterogeneous curvatures. Lastly, we provide theoretical proofs of convergence for HiZOO. Code is publicly available at https://anonymous.4open.science/r/HiZOO27F8.
摘要：由于反向传播过程，使用经典的一阶优化器微调大型语言模型 (LLM) 需要占用大量 GPU 内存。最近的工作已转向零阶优化器进行微调，它通过使用两次前向传递来节省大量内存。然而，这些优化器受到不同维度参数曲率异质性的困扰。在这项工作中，我们提出了 HiZOO，一种对角 Hessian 通知零阶优化器，这是第一个利用对角 Hessian 增强零阶优化器以微调 LLM 的工作。更重要的是，HiZOO 避免了昂贵的内存成本，每一步只增加一次前向传递。对各种模型（350M~66B参数）的大量实验表明，HiZOO提高了模型收敛性，显着减少了训练步骤，有效提升了模型精度。此外，我们还可视化了 HiZOO 在测试函数上的优化轨迹，说明了其在处理异构曲率方面的有效性。最后，我们提供了 HiZOO 收敛的理论证明。代码可在 https://anonymous.4open.science/r/HiZOO27F8 上公开获取。

Title: Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition

Authors: Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.15175
Pdf URL: https://arxiv.org/pdf/2402.15175
Copy Paste: [[2402.15175]] Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition(https://arxiv.org/abs/2402.15175)
Keywords: language model
Abstract: Recent studies have uncovered intriguing phenomena in deep learning, such as grokking, double descent, and emergent abilities in large language models, which challenge human intuition and are crucial for a deeper understanding of neural models. In this paper, we present a comprehensive framework that provides a unified view of these three phenomena, focusing on the competition between memorization and generalization circuits. This approach, initially employed to explain grokking, is extended in our work to encompass a wider range of model sizes and training data volumes. Our framework delineates four distinct training dynamics, each depending on varying combinations of model size and training data quantity. Utilizing this framework, we provide a detailed analysis of the double descent phenomenon and propose two verifiable predictions regarding its occurrence, both substantiated by our experimental results. Moreover, we expand our framework to the multi-task learning paradigm, demonstrating how algorithm tasks can be turned into emergent abilities. This offers a novel perspective to understand emergent abilities in Large Language Models.
摘要：最近的研究发现了深度学习中有趣的现象，例如大型语言模型中的 grokking、双重下降和涌现能力，这些现象挑战了人类的直觉，对于更深入地理解神经模型至关重要。在本文中，我们提出了一个综合框架，为这三种现象提供了统一的观点，重点关注记忆和泛化电路之间的竞争。这种方法最初用于解释 grokking，在我们的工作中进行了扩展，以涵盖更广泛的模型大小和训练数据量。我们的框架描绘了四种不同的训练动态，每种训练动态都取决于模型大小和训练数据量的不同组合。利用这个框架，我们对双下降现象进行了详细分析，并提出了关于其发生的两个可验证的预测，这两个预测都得到了我们的实验结果的证实。此外，我们将我们的框架扩展到多任务学习范式，展示了算法任务如何转化为新兴能力。这为理解大型语言模型中的新兴能力提供了一个新颖的视角。

Title: Advancing Parameter Efficiency in Fine-tuning via Representation Editing

Authors: Muling Wu, Wenhao Liu, Xiaohua Wang, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2402.15179
Pdf URL: https://arxiv.org/pdf/2402.15179
Copy Paste: [[2402.15179]] Advancing Parameter Efficiency in Fine-tuning via Representation Editing(https://arxiv.org/abs/2402.15179)
Keywords: gpt, prompt
Abstract: Parameter Efficient Fine-Tuning (PEFT) has gained significant attention for its ability to achieve competitive results while updating only a small subset of trainable parameters. Despite the promising performance of current PEFT methods, they present challenges in hyperparameter selection, such as determining the rank of LoRA or Adapter, or specifying the length of soft prompts. In addressing these challenges, we propose a novel approach to fine-tuning neural models, termed Representation EDiting (RED), which scales and biases the representation produced at each layer. RED substantially reduces the number of trainable parameters by a factor of $25,700$ compared to full parameter fine-tuning, and by a factor of $32$ compared to LoRA. Remarkably, RED achieves comparable or superior results to full parameter fine-tuning and other PEFT methods. Extensive experiments were conducted across models of varying architectures and scales, including RoBERTa, GPT-2, T5, and Llama-2, and the results demonstrate the efficiency and efficacy of RED, positioning it as a promising PEFT approach for large neural models.
摘要：参数高效微调（PEFT）因其能够在仅更新一小部分可训练参数的同时获得有竞争力的结果而受到广泛关注。尽管当前 PEFT 方法的性能很有前景，但它们在超参数选择方面提出了挑战，例如确定 LoRA 或 Adapter 的等级，或指定软提示的长度。为了应对这些挑战，我们提出了一种微调神经模型的新方法，称为表示编辑（RED），它可以对每一层产生的表示进行缩放和偏置。与全参数微调相比，RED 将可训练参数的数量大幅减少了 25,700 美元，与 LoRA 相比减少了 32 美元。值得注意的是，RED 取得了与全参数微调和其他 PEFT 方法相当或更好的结果。在不同架构和规模的模型（包括 RoBERTa、GPT-2、T5 和 Llama-2）上进行了大量实验，结果证明了 RED 的效率和功效，将其定位为大型神经模型的一种有前途的 PEFT 方法。

Title: Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

Authors: Heegyu Kim, Sehyun Yuk, Hyunsouk Cho
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2402.15180
Pdf URL: https://arxiv.org/pdf/2402.15180
Copy Paste: [[2402.15180]] Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement(https://arxiv.org/abs/2402.15180)
Keywords: language model
Abstract: Caution: This paper includes offensive words that could potentially cause unpleasantness. Language models (LMs) are vulnerable to exploitation for adversarial misuse. Training LMs for safety alignment is extensive and makes it hard to respond to fast-developing attacks immediately, such as jailbreaks. We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks. Additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. We've also observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses. In conclusion, our findings can achieve less safety risk with fewer computational costs, allowing non-safety LM to be easily utilized in real-world service.
摘要：注意：本文包含可能引起不愉快的冒犯性词语。语言模型（LM）很容易被对抗性滥用所利用。对 LM 进行安全调整的培训非常广泛，因此很难立即响应快速发展的攻击，例如越狱。我们建议通过格式化进行自我改进，即使在非安全一致的 LM 中也能实现出色的安全性，并与多个防御基线一起评估我们的方法，证明它是针对越狱攻击的最安全的免训练方法。此外，我们提出了一种格式化方法，可以提高自优化过程的效率，同时以更少的迭代次数降低攻击成功率。我们还观察到，非安全一致的 LM 在安全任务中表现优于安全一致的 LM，因为它提供了更有帮助和安全的响应。总之，我们的研究结果可以通过更少的计算成本实现更低的安全风险，从而使非安全 LM 能够轻松地在现实世界的服务中使用。

Title: GraphEdit: Large Language Models for Graph Structure Learning

Authors: Zirui Guo, Lianghao Xia, Yanhua Yu, Yuling Wang, Zixuan Yang, Wei Wei, Liang Pang, Tat-Seng Chua, Chao Huang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15183
Pdf URL: https://arxiv.org/pdf/2402.15183
Copy Paste: [[2402.15183]] GraphEdit: Large Language Models for Graph Structure Learning(https://arxiv.org/abs/2402.15183)
Keywords: language model, llm
Abstract: Graph Structure Learning (GSL) focuses on capturing intrinsic dependencies and interactions among nodes in graph-structured data by generating novel graph structures. Graph Neural Networks (GNNs) have emerged as promising GSL solutions, utilizing recursive message passing to encode node-wise inter-dependencies. However, many existing GSL methods heavily depend on explicit graph structural information as supervision signals, leaving them susceptible to challenges such as data noise and sparsity. In this work, we propose GraphEdit, an approach that leverages large language models (LLMs) to learn complex node relationships in graph-structured data. By enhancing the reasoning capabilities of LLMs through instruction-tuning over graph structures, we aim to overcome the limitations associated with explicit graph structural information and enhance the reliability of graph structure learning. Our approach not only effectively denoises noisy connections but also identifies node-wise dependencies from a global perspective, providing a comprehensive understanding of the graph structure. We conduct extensive experiments on multiple benchmark datasets to demonstrate the effectiveness and robustness of GraphEdit across various settings. We have made our model implementation available at: https://github.com/HKUDS/GraphEdit.
摘要：图结构学习（GSL）专注于通过生成新颖的图结构来捕获图结构数据中节点之间的内在依赖关系和交互。图神经网络 (GNN) 已成为有前景的 GSL 解决方案，利用递归消息传递来编码节点间的相互依赖关系。然而，许多现有的 GSL 方法严重依赖显式的图结构信息作为监督信号，这使得它们容易受到数据噪声和稀疏性等挑战的影响。在这项工作中，我们提出了 GraphEdit，这是一种利用大型语言模型 (LLM) 来学习图结构数据中复杂节点关系的方法。通过对图结构进行指令调整来增强法学硕士的推理能力，我们的目标是克服与显式图结构信息相关的限制，并提高图结构学习的可靠性。我们的方法不仅有效地去除噪声连接，而且还从全局角度识别节点依赖关系，提供对图结构的全面理解。我们对多个基准数据集进行了广泛的实验，以证明 GraphEdit 在各种设置下的有效性和鲁棒性。我们已在以下位置提供模型实现：https://github.com/HKUDS/GraphEdit。

Title: Biomedical Entity Linking as Multiple Choice Question Answering

Authors: Zhenxi Lin, Ziheng Zhang, Xian Wu, Yefeng Zheng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15189
Pdf URL: https://arxiv.org/pdf/2402.15189
Copy Paste: [[2402.15189]] Biomedical Entity Linking as Multiple Choice Question Answering(https://arxiv.org/abs/2402.15189)
Keywords: language model
Abstract: Although biomedical entity linking (BioEL) has made significant progress with pre-trained language models, challenges still exist for fine-grained and long-tailed entities. To address these challenges, we present BioELQA, a novel model that treats Biomedical Entity Linking as Multiple Choice Question Answering. BioELQA first obtains candidate entities with a fast retriever, jointly presents the mention and candidate entities to a generator, and then outputs the predicted symbol associated with its chosen entity. This formulation enables explicit comparison of different candidate entities, thus capturing fine-grained interactions between mentions and entities, as well as among entities themselves. To improve generalization for long-tailed entities, we retrieve similar labeled training instances as clues and concatenate the input with retrieved instances for the generator. Extensive experimental results show that BioELQA outperforms state-of-the-art baselines on several datasets.
摘要：尽管生物医学实体链接（BioEL）在预训练语言模型方面取得了重大进展，但细粒度和长尾实体仍然存在挑战。为了应对这些挑战，我们提出了 BioELQA，这是一种将生物医学实体链接视为多项选择题回答的新颖模型。 BioELQA 首先使用快速检索器获取候选实体，将提及实体和候选实体联合呈现给生成器，然后输出与其所选实体相关的预测符号。该公式可以对不同候选实体进行显式比较，从而捕获提及项和实体之间以及实体本身之间的细粒度交互。为了提高长尾实体的泛化能力，我们检索类似的标记训练实例作为线索，并将输入与检索到的生成器实例连接起来。大量实验结果表明，BioELQA 在多个数据集上的表现优于最先进的基线。

Title: DeMPT: Decoding-enhanced Multi-phase Prompt Tuning for Making LLMs Be Better Context-aware Translators

Authors: Xinglin Lyu, Junhui Li, Yanqing Zhao, Min Zhang, Daimeng Wei, Shimin Tao, Hao Yang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15200
Pdf URL: https://arxiv.org/pdf/2402.15200
Copy Paste: [[2402.15200]] DeMPT: Decoding-enhanced Multi-phase Prompt Tuning for Making LLMs Be Better Context-aware Translators(https://arxiv.org/abs/2402.15200)
Keywords: language model, llm, prompt
Abstract: Generally, the decoder-only large language models (LLMs) are adapted to context-aware neural machine translation (NMT) in a concatenating way, where LLMs take the concatenation of the source sentence (i.e., intra-sentence context) and the inter-sentence context as the input, and then to generate the target tokens sequentially. This adaptation strategy, i.e., concatenation mode, considers intra-sentence and inter-sentence contexts with the same priority, despite an apparent difference between the two kinds of contexts. In this paper, we propose an alternative adaptation approach, named Decoding-enhanced Multi-phase Prompt Tuning (DeMPT), to make LLMs discriminately model and utilize the inter- and intra-sentence context and more effectively adapt LLMs to context-aware NMT. First, DeMPT divides the context-aware NMT process into three separate phases. During each phase, different continuous prompts are introduced to make LLMs discriminately model various information. Second, DeMPT employs a heuristic way to further discriminately enhance the utilization of the source-side inter- and intra-sentence information at the final decoding phase. Experiments show that our approach significantly outperforms the concatenation method, and further improves the performance of LLMs in discourse modeling.
摘要：一般来说，仅解码器的大语言模型（LLM）以串联方式适应上下文感知神经机器翻译（NMT），其中LLM将源句子（即句子内上下文）和句子间的上下文串联起来。句子上下文作为输入，然后顺序生成目标标记。这种适应策略，即串联模式，以相同的优先级考虑句子内和句子间上下文，尽管两种上下文之间存在明显差异。在本文中，我们提出了一种替代的适应方法，称为解码增强多阶段提示调整（DeMPT），使 LLM 能够有区别地建模并利用句子间和句子内上下文，并更有效地使 LLM 适应上下文感知 NMT。首先，DeMPT 将上下文感知 NMT 过程分为三个独立的阶段。在每个阶段，引入不同的连续提示，使LLM有区别地对各种信息进行建模。其次，DeMPT采用启发式方式进一步有区别地提高最终解码阶段源侧句子间和句子内信息的利用率。实验表明，我们的方法显着优于串联方法，并进一步提高了法学硕士在话语建模中的性能。

Title: Fine-Grained Detoxification via Instance-Level Prefixes for Large Language Models

Authors: Xin Yi, Linlin Wang, Xiaoling Wang, Liang He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15202
Pdf URL: https://arxiv.org/pdf/2402.15202
Copy Paste: [[2402.15202]] Fine-Grained Detoxification via Instance-Level Prefixes for Large Language Models(https://arxiv.org/abs/2402.15202)
Keywords: language model, llm, prompt
Abstract: Impressive results have been achieved in natural language processing (NLP) tasks through the training of large language models (LLMs). However, these models occasionally produce toxic content such as insults, threats, and profanity in response to certain prompts, thereby constraining their practical utility. To tackle this issue, various finetuning-based and decoding-based approaches have been utilized to mitigate toxicity. However, these methods typically necessitate additional costs such as high-quality training data or auxiliary models. In this paper, we propose fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost. Specifically, FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. This allows for constructing fine-grained subtoxicity vectors, which enables collaborative detoxification by fusing them to correct the normal generation process when provided with a raw prompt. We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels. Our method surpasses prompt-based baselines in detoxification, although at a slight cost to generation fluency and diversity.
摘要：通过大语言模型（LLM）的训练，在自然语言处理（NLP）任务中取得了令人印象深刻的结果。然而，这些模型偶尔会根据某些提示产生侮辱、威胁和脏话等有毒内容，从而限制了它们的实用性。为了解决这个问题，人们利用了各种基于微调和解码的方法来减轻毒性。然而，这些方法通常需要额外的成本，例如高质量的训练数据或辅助模型。在本文中，我们提出通过实例级前缀（FGDILP）进行细粒度解毒，以减轻有毒文本，而无需额外成本。具体来说，FGDILP 使用正前缀前缀提示与实例级别的多个负前缀提示来对比注意空间中的上下文表示。这允许构建细粒度的亚毒性向量，当提供原始提示时，通过融合它们来纠正正常的生成过程，从而实现协作解毒。我们验证 FGDILP 能够在话语和上下文级别上实现受控文本生成。我们的方法超越了基于提示的解毒基线，尽管生成流畅度和多样性略有下降。

Title: ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Authors: Lu Ye, Ze Tao, Yong Huang, Yang Li
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2402.15220
Pdf URL: https://arxiv.org/pdf/2402.15220
Copy Paste: [[2402.15220]] ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition(https://arxiv.org/abs/2402.15220)
Keywords: language model, llm, prompt
Abstract: Self-attention is an essential component of large language models(LLMs) but a significant source of inference latency for long sequences. In multi-tenant LLMs serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data locality during self-attention computation in the presence of shared system prompts. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8$\times$ compared to the start-of-the-art implementation, with the length of the system prompt ranging from 1024 to 4096.
摘要：自注意力是大型语言模型（LLM）的重要组成部分，但也是长序列推理延迟的重要来源。在多租户 LLM 服务场景中，可以通过使用多个 LLM 请求在前缀中共享系统提示的概率来优化自注意力的计算和内存操作成本。在本文中，我们介绍了 ChunkAttention，一个前缀感知的自注意力模块，它可以跨多个请求检测匹配的提示前缀，并在运行时在内存中共享它们的键/值张量，以提高 KV 缓存的内存利用率。这是通过将整体键/值张量分解成更小的块并将它们构建到辅助前缀树中来实现的。因此，在基于前缀树的 KV 缓存之上，我们设计了一个高效的自注意力内核，其中实现了两阶段分区算法，以在存在共享系统提示的情况下提高自注意力计算期间的数据局部性。实验表明，与最先进的实现相比，ChunkAttention 可以将 self-attention 内核加速 3.2-4.8$\times$，系统提示符的长度范围为 1024 到 4096。

Title: GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection?

Authors: Yiping Jin, Leo Wanner, Alexander Shvets
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2402.15238
Pdf URL: https://arxiv.org/pdf/2402.15238
Copy Paste: [[2402.15238]] GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection?(https://arxiv.org/abs/2402.15238)
Keywords: language model, gpt, llm
Abstract: Online hate detection suffers from biases incurred in data sampling, annotation, and model pre-training. Therefore, measuring the averaged performance over all examples in held-out test data is inadequate. Instead, we must identify specific model weaknesses and be informed when it is more likely to fail. A recent proposal in this direction is HateCheck, a suite for testing fine-grained model functionalities on synthesized data generated using templates of the kind "You are just a [slur] to me." However, despite enabling more detailed diagnostic insights, the HateCheck test cases are often generic and have simplistic sentence structures that do not match the real-world data. To address this limitation, we propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch by instructing large language models (LLMs). We employ an additional natural language inference (NLI) model to verify the generations. Crowd-sourced annotation demonstrates that the generated test cases are of high quality. Using the new functional tests, we can uncover model weaknesses that would be overlooked using the original HateCheck dataset.
摘要：在线仇恨检测受到数据采样、注释和模型预训练中产生的偏差的影响。因此，测量保留测试数据中所有示例的平均性能是不够的。相反，我们必须识别特定模型的弱点，并在它更有可能失败时得到通知。最近在这个方向上的一个提案是 HateCheck，这是一个套件，用于测试使用“你对我来说只是[诽谤]”之类的模板生成的合成数据的细粒度模型功能。然而，尽管能够提供更详细的诊断见解，但 HateCheck 测试用例通常是通用的，并且具有与现实世界数据不匹配的简单句子结构。为了解决这一限制，我们提出了 GPT-HateCheck，这是一个通过指导大型语言模型 (LLM) 从头开始生成更加多样化和现实的功能测试的框架。我们采用额外的自然语言推理（NLI）模型来验证各代。众包注释表明生成的测试用例是高质量的。使用新的功能测试，我们可以发现使用原始 HateCheck 数据集会被忽视的模型弱点。

Title: Chitchat as Interference: Adding User Backstories to Task-Oriented Dialogues

Authors: Armand Stricker, Patrick Paroubek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15248
Pdf URL: https://arxiv.org/pdf/2402.15248
Copy Paste: [[2402.15248]] Chitchat as Interference: Adding User Backstories to Task-Oriented Dialogues(https://arxiv.org/abs/2402.15248)
Keywords: prompt, chat
Abstract: During task-oriented dialogues (TODs), human users naturally introduce chitchat that is beyond the immediate scope of the task, interfering with the flow of the conversation. To address this issue without the need for expensive manual data creation, we use few-shot prompting with Llama-2-70B to enhance the MultiWOZ dataset with user backstories, a typical example of chitchat interference in TODs. We assess the impact of this addition by testing two models: one trained solely on TODs and another trained on TODs with a preliminary chitchat interaction. Our analysis reveals that our enriched dataset poses a significant challenge to these systems. Moreover, we demonstrate that our dataset can be effectively used for training purposes, enabling a system to consistently acknowledge the user's backstory while also successfully moving the task forward in the same turn, as confirmed by human evaluation. These findings highlight the benefits of generating novel chitchat-TOD scenarios to test TOD systems more thoroughly and improve their resilience to natural user interferences.
摘要：在面向任务的对话 (TOD) 中，人类用户自然会引入超出任务直接范围的闲聊，从而干扰对话流程。为了解决这个问题，而不需要昂贵的手动数据创建，我们使用 Llama-2-70B 的少量提示来增强具有用户背景故事的 MultiWOZ 数据集，这是 TOD 中闲聊干扰的典型示例。我们通过测试两个模型来评估这一添加的影响：一个模型仅接受 TOD 训练，另一个模型接受 TOD 训练并进行初步的闲聊交互。我们的分析表明，我们丰富的数据集对这些系统构成了重大挑战。此外，我们证明我们的数据集可以有效地用于训练目的，使系统能够一致地了解用户的背景故事，同时也成功地推动任务前进，正如人类评估所证实的那样。这些发现强调了生成新颖的闲聊 TOD 场景的好处，可以更彻底地测试 TOD 系统并提高其对自然用户干扰的适应能力。

Title: DEEM: Dynamic Experienced Expert Modeling for Stance Detection

Authors: Xiaolong Wang, Yile Wang, Sijie Cheng, Peng Li, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.15264
Pdf URL: https://arxiv.org/pdf/2402.15264
Copy Paste: [[2402.15264]] DEEM: Dynamic Experienced Expert Modeling for Stance Detection(https://arxiv.org/abs/2402.15264)
Keywords: language model, llm, agent
Abstract: Recent work has made a preliminary attempt to use large language models (LLMs) to solve the stance detection task, showing promising results. However, considering that stance detection usually requires detailed background knowledge, the vanilla reasoning method may neglect the domain knowledge to make a professional and accurate analysis. Thus, there is still room for improvement of LLMs reasoning, especially in leveraging the generation capability of LLMs to simulate specific experts (i.e., multi-agents) to detect the stance. In this paper, different from existing multi-agent works that require detailed descriptions and use fixed experts, we propose a Dynamic Experienced Expert Modeling (DEEM) method which can leverage the generated experienced experts and let LLMs reason in a semi-parametric way, making the experts more generalizable and reliable. Experimental results demonstrate that DEEM consistently achieves the best results on three standard benchmarks, outperforms methods with self-consistency reasoning, and reduces the bias of LLMs.
摘要：最近的工作初步尝试使用大型语言模型（LLM）来解决姿态检测任务，并显示出可喜的结果。然而，考虑到姿态检测通常需要详细的背景知识，普通的推理方法可能会忽略领域知识来做出专业而准确的分析。因此，LLM的推理仍有改进的空间，特别是在利用LLM的生成能力来模拟特定专家（即多智能体）来检测立场方面。在本文中，与现有的需要详细描述和使用固定专家的多智能体工作不同，我们提出了一种动态经验专家建模（DEEM）方法，该方法可以利用生成的经验专家并让LLM以半参数方式进行推理，使得专家更具普遍性和可靠性。实验结果表明，DEEM 在三个标准基准上始终取得最佳结果，优于具有自我一致性推理的方法，并减少了法学硕士的偏差。

Title: MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models

Authors: Nathanaël Carraz Rakotonirina, Marco Baroni
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15268
Pdf URL: https://arxiv.org/pdf/2402.15268
Copy Paste: [[2402.15268]] MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models(https://arxiv.org/abs/2402.15268)
Keywords: language model, prompt
Abstract: Transformer-based language models (LMs) track contextual information through large, hard-coded input windows. We introduce MemoryPrompt, a leaner approach in which the LM is complemented by a small auxiliary recurrent network that passes information to the LM by prefixing its regular input with a sequence of vectors, akin to soft prompts, without requiring LM finetuning. Tested on a task designed to probe a LM's ability to keep track of multiple fact updates, a MemoryPrompt-augmented LM outperforms much larger LMs that have access to the full input history. We also test MemoryPrompt on a long-distance dialogue dataset, where its performance is comparable to that of a model conditioned on the entire conversation history. In both experiments we also observe that, unlike full-finetuning approaches, MemoryPrompt does not suffer from catastrophic forgetting when adapted to new tasks, thus not disrupting the generalist capabilities of the underlying LM.
摘要：基于 Transformer 的语言模型 (LM) 通过大型硬编码输入窗口跟踪上下文信息。我们引入了 MemoryPrompt，这是一种更精简的方法，其中 LM 得到了一个小型辅助循环网络的补充，该网络通过在其常规输入前面加上一系列向量来将信息传递给 LM，类似于软提示，而不需要 LM 微调。在旨在探测 LM 跟踪多个事实更新的能力的任务上进行测试，MemoryPrompt 增强型 LM 的性能优于可以访问完整输入历史记录的大型 LM。我们还在长距离对话数据集上测试 MemoryPrompt，其性能与基于整个对话历史记录的模型相当。在这两个实验中，我们还观察到，与完全微调方法不同，MemoryPrompt 在适应新任务时不会遭受灾难性遗忘，因此不会破坏底层 LM 的通用能力。

Title: When in Doubt, Think Slow: Iterative Reasoning with Latent Imagination

Authors: Martin Benfeghoul, Umais Zahid, Qinghai Guo, Zafeirios Fountas
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15283
Pdf URL: https://arxiv.org/pdf/2402.15283
Copy Paste: [[2402.15283]] When in Doubt, Think Slow: Iterative Reasoning with Latent Imagination(https://arxiv.org/abs/2402.15283)
Keywords: agent
Abstract: In an unfamiliar setting, a model-based reinforcement learning agent can be limited by the accuracy of its world model. In this work, we present a novel, training-free approach to improving the performance of such agents separately from planning and learning. We do so by applying iterative inference at decision-time, to fine-tune the inferred agent states based on the coherence of future state representations. Our approach achieves a consistent improvement in both reconstruction accuracy and task performance when applied to visual 3D navigation tasks. We go on to show that considering more future states further improves the performance of the agent in partially-observable environments, but not in a fully-observable one. Finally, we demonstrate that agents with less training pre-evaluation benefit most from our approach.
摘要：在不熟悉的环境中，基于模型的强化学习代理可能会受到其世界模型的准确性的限制。在这项工作中，我们提出了一种新颖的、免训练的方法来提高此类代理的性能，与规划和学习分开。我们通过在决策时应用迭代推理来实现这一点，以根据未来状态表示的一致性来微调推断的代理状态。当应用于视觉 3D 导航任务时，我们的方法在重建精度和任务性能方面实现了持续改进。我们继续证明，考虑更多的未来状态可以进一步提高代理在部分可观察环境中的性能，但不能在完全可观察环境中提高性能。最后，我们证明了训练预评估较少的代理从我们的方法中受益最多。

Title: Causal Graph Discovery with Retrieval-Augmented Generation based Large Language Models

Authors: Yuzhe Zhang, Yipeng Zhang, Yidong Gan, Lina Yao, Chen Wang
Subjects: cs.CL, cs.LG, stat.ME
Abstract URL: https://arxiv.org/abs/2402.15301
Pdf URL: https://arxiv.org/pdf/2402.15301
Copy Paste: [[2402.15301]] Causal Graph Discovery with Retrieval-Augmented Generation based Large Language Models(https://arxiv.org/abs/2402.15301)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Causal graph recovery is essential in the field of causal inference. Traditional methods are typically knowledge-based or statistical estimation-based, which are limited by data collection biases and individuals' knowledge about factors affecting the relations between variables of interests. The advance of large language models (LLMs) provides opportunities to address these problems. We propose a novel method that utilizes the extensive knowledge contained within a large corpus of scientific literature to deduce causal relationships in general causal graph recovery tasks. This method leverages Retrieval Augmented-Generation (RAG) based LLMs to systematically analyze and extract pertinent information from a comprehensive collection of research papers. Our method first retrieves relevant text chunks from the aggregated literature. Then, the LLM is tasked with identifying and labelling potential associations between factors. Finally, we give a method to aggregate the associational relationships to build a causal graph. We demonstrate our method is able to construct high quality causal graphs on the well-known SACHS dataset solely from literature.
摘要：因果图恢复在因果推理领域至关重要。传统方法通常是基于知识或基于统计估计的，其受到数据收集偏差和个人对影响利益变量之间关系的因素的了解的限制。大型语言模型（LLM）的进步为解决这些问题提供了机会。我们提出了一种新颖的方法，利用大量科学文献中包含的广泛知识来推断一般因果图恢复任务中的因果关系。该方法利用基于检索增强生成（RAG）的法学硕士从全面的研究论文集中系统地分析和提取相关信息。我们的方法首先从聚合文献中检索相关文本块。然后，法学硕士的任务是识别和标记因素之间的潜在关联。最后，我们给出了一种聚合关联关系以构建因果图的方法。我们证明我们的方法能够仅根据文献在著名的 SACHS 数据集上构建高质量的因果图。

Title: How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries

Authors: Somnath Banerjee, Sayan Layek, Rima Hazra, Animesh Mukherjee
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2402.15302
Pdf URL: https://arxiv.org/pdf/2402.15302
Copy Paste: [[2402.15302]] How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries(https://arxiv.org/abs/2402.15302)
Keywords: language model, gpt, llm
Abstract: In this study, we tackle a growing concern around the safety and ethical use of large language models (LLMs). Despite their potential, these models can be tricked into producing harmful or unethical content through various sophisticated methods, including 'jailbreaking' techniques and targeted manipulation. Our work zeroes in on a specific issue: to what extent LLMs can be led astray by asking them to generate responses that are instruction-centric such as a pseudocode, a program or a software snippet as opposed to vanilla text. To investigate this question, we introduce TechHazardQA, a dataset containing complex queries which should be answered in both text and instruction-centric formats (e.g., pseudocodes), aimed at identifying triggers for unethical responses. We query a series of LLMs -- Llama-2-13b, Llama-2-7b, Mistral-V2 and Mistral 8X7B -- and ask them to generate both text and instruction-centric responses. For evaluation we report the harmfulness score metric as well as judgements from GPT-4 and humans. Overall, we observe that asking LLMs to produce instruction-centric responses enhances the unethical response generation by ~2-38% across the models. As an additional objective, we investigate the impact of model editing using the ROME technique, which further increases the propensity for generating undesirable content. In particular, asking edited LLMs to generate instruction-centric responses further increases the unethical response generation by ~3-16% across the different models.
摘要：在这项研究中，我们解决了人们对大型语言模型 (LLM) 的安全性和道德使用日益关注的问题。尽管它们具有潜力，但这些模型可能会被欺骗，通过各种复杂的方法（包括“越狱”技术和有针对性的操纵）产生有害或不道德的内容。我们的工作集中在一个具体问题上：要求法学硕士生成以指令为中心的响应（例如伪代码、程序或软件片段，而不是普通文本），会在多大程度上引入歧途。为了研究这个问题，我们引入了 TechHazardQA，这是一个包含复杂查询的数据集，应以文本和以指令为中心的格式（例如伪代码）回答，旨在识别不道德响应的触发因素。我们查询一系列 LLM——Llama-2-13b、Llama-2-7b、Mistral-V2 和 Mistral 8X7B——并要求它们生成以文本和指令为中心的响应。为了进行评估，我们报告了危害性评分指标以及 GPT-4 和人类的判断。总体而言，我们观察到，要求法学硕士做出以教学为中心的回答，可以使模型中不道德回答的生成量增加约 2-38%。作为另一个目标，我们调查使用 ROME 技术进行模型编辑的影响，这进一步增加了生成不良内容的可能性。特别是，要求经过编辑的法学硕士生成以教学为中心的响应，在不同模型中进一步增加了不道德响应的生成约 3-16%。

Title: ArabianGPT: Native Arabic GPT-based Large Language

Authors: Anis Koubaa, Adel Ammar, Lahouari Ghouti, Omar Najar, Serry Sibaee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15313
Pdf URL: https://arxiv.org/pdf/2402.15313
Copy Paste: [[2402.15313]] ArabianGPT: Native Arabic GPT-based Large Language(https://arxiv.org/abs/2402.15313)
Keywords: language model, gpt, llm
Abstract: The predominance of English and Latin-based large language models (LLMs) has led to a notable deficit in native Arabic LLMs. This discrepancy is accentuated by the prevalent inclusion of English tokens in existing Arabic models, detracting from their efficacy in processing native Arabic's intricate morphology and syntax. Consequently, there is a theoretical and practical imperative for developing LLMs predominantly focused on Arabic linguistic elements. To address this gap, this paper proposes ArabianGPT, a series of transformer-based models within the ArabianLLM suite designed explicitly for Arabic. These models, including ArabianGPT-0.1B and ArabianGPT-0.3B, vary in size and complexity, aligning with the nuanced linguistic characteristics of Arabic. The AraNizer tokenizer, integral to these models, addresses the unique morphological aspects of Arabic script, ensuring more accurate text processing. Empirical results from fine-tuning the models on tasks like sentiment analysis and summarization demonstrate significant improvements. For sentiment analysis, the fine-tuned ArabianGPT-0.1B model achieved a remarkable accuracy of 95%, a substantial increase from the base model's 56%. Similarly, in summarization tasks, fine-tuned models showed enhanced F1 scores, indicating improved precision and recall in generating concise summaries. Comparative analysis of fine-tuned ArabianGPT models against their base versions across various benchmarks reveals nuanced differences in performance, with fine-tuning positively impacting specific tasks like question answering and summarization. These findings underscore the efficacy of fine-tuning in aligning ArabianGPT models more closely with specific NLP tasks, highlighting the potential of tailored transformer architectures in advancing Arabic NLP.
摘要：基于英语和拉丁语的大语言模型 (LLM) 的主导地位导致了阿拉伯语 LLM 的显着不足。现有阿拉伯语模型中普遍包含英语标记，从而加剧了这种差异，从而降低了它们处理母语阿拉伯语复杂的形态和语法的效率。因此，发展主要关注阿拉伯语言元素的法学硕士在理论上和实践上都势在必行。为了解决这一差距，本文提出了ArabianGPT，这是一系列专门为阿拉伯语设计的、ArabianLLM 套件中基于 Transformer 的模型。这些模型（包括ArabianGPT-0.1B 和ArabianGPT-0.3B）的大小和复杂性各不相同，与阿拉伯语微妙的语言特征保持一致。 AraNizer 分词器是这些模型的组成部分，它解决了阿拉伯文字独特的形态方面的问题，确保更准确的文本处理。在情感分析和总结等任务上微调模型的经验结果表明，有了显着的改进。在情感分析方面，经过微调的ArabianGPT-0.1B模型的准确率达到了95%，比基础模型的56%有了大幅提高。同样，在摘要任务中，微调模型显示出增强的 F1 分数，表明生成简洁摘要时的精确度和召回率有所提高。对经过微调的ArabianGPT模型与其基础版本在各种基准上的比较分析揭示了性能上的细微差别，微调对诸如问答和总结等特定任务产生了积极影响。这些发现强调了微调在使阿拉伯 GPT 模型与特定 NLP 任务更紧密地结合方面的功效，凸显了定制 Transformer 架构在推进阿拉伯语 NLP 方面的潜力。

Title: GPTVQ: The Blessing of Dimensionality for LLM Quantization

Authors: Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2402.15319
Pdf URL: https://arxiv.org/pdf/2402.15319
Copy Paste: [[2402.15319]] GPTVQ: The Blessing of Dimensionality for LLM Quantization(https://arxiv.org/abs/2402.15319)
Keywords: language model, gpt, llm
Abstract: In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.
摘要：在这项工作中，我们表明，通过增加量化维度可以显着改善神经网络量化的大小与精度权衡。我们提出了 GPTVQ 方法，这是一种用于训练后矢量量化 (VQ) 的新快速方法，可以很好地扩展到大型语言模型 (LLM)。我们的方法使用来自每层输出重建 MSE 的 Hessian 矩阵的信息，将一列或多列的量化与剩余未量化权重的更新交错。量化码本使用高效的数据感知版本的 EM 算法进行初始化。然后更新码本，并使用整数量化和基于 SVD 的压缩进一步压缩。 GPTVQ 在 Llama-v2 和 Mistral 等各种 LLM 上建立了新的最先进的尺寸与精度权衡。此外，我们的方法非常高效：在单个 H100 上，处理 Llamav2-70B 模型需要 3 到 11 小时，具体取决于量化设置。最后，通过移动 CPU 上 VQ 解压缩的设备端时序，我们发现与使用 4 位整数格式相比，VQ 可以改善延迟。

Title: Ranking Entities along Conceptual Space Dimensions with LLMs: An Analysis of Fine-Tuning Strategies

Authors: Nitesh Kumar, Usashi Chatterjee, Steven Schockaert
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15337
Pdf URL: https://arxiv.org/pdf/2402.15337
Copy Paste: [[2402.15337]] Ranking Entities along Conceptual Space Dimensions with LLMs: An Analysis of Fine-Tuning Strategies(https://arxiv.org/abs/2402.15337)
Keywords: language model, llm
Abstract: Conceptual spaces represent entities in terms of their primitive semantic features. Such representations are highly valuable but they are notoriously difficult to learn, especially when it comes to modelling perceptual and subjective features. Distilling conceptual spaces from Large Language Models (LLMs) has recently emerged as a promising strategy. However, existing work has been limited to probing pre-trained LLMs using relatively simple zero-shot strategies. We focus in particular on the task of ranking entities according to a given conceptual space dimension. Unfortunately, we cannot directly fine-tune LLMs on this task, because ground truth rankings for conceptual space dimensions are rare. We therefore use more readily available features as training data and analyse whether the ranking capabilities of the resulting models transfer to perceptual and subjective features. We find that this is indeed the case, to some extent, but having perceptual and subjective features in the training data seems essential for achieving the best results. We furthermore find that pointwise ranking strategies are competitive against pairwise approaches, in defiance of common wisdom.
摘要：概念空间根据实体的原始语义特征来表示实体。这种表示非常有价值，但众所周知，它们很难学习，特别是在对感知和主观特征进行建模时。从大型语言模型（LLM）中提取概念空间最近已成为一种有前景的策略。然而，现有的工作仅限于使用相对简单的零样本策略来探索预先训练的法学硕士。我们特别关注根据给定的概念空间维度对实体进行排名的任务。不幸的是，我们无法直接对这项任务进行微调 LLM，因为概念空间维度的真实排名很少见。因此，我们使用更容易获得的特征作为训练数据，并分析所得模型的排名能力是否转移到感知和主观特征。我们发现在某种程度上确实是这样，但是训练数据中具有感知和主观特征似乎对于获得最佳结果至关重要。我们还发现，逐点排名策略与成对方法相比具有竞争力，这违背了常识。

Title: NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data

Authors: Sergei Bogdanov, Alexandre Constantin, Timothée Bernard, Benoit Crabbé, Etienne Bernard
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15343
Pdf URL: https://arxiv.org/pdf/2402.15343
Copy Paste: [[2402.15343]] NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data(https://arxiv.org/abs/2402.15343)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown impressive abilities in data annotation, opening the way for new approaches to solve classic NLP problems. In this paper, we show how to use LLMs to create NuNER, a compact language representation model specialized in the Named Entity Recognition (NER) task. NuNER can be fine-tuned to solve downstream NER problems in a data-efficient way, outperforming similar-sized foundation models in the few-shot regime and competing with much larger LLMs. We find that the size and entity-type diversity of the pre-training dataset are key to achieving good performance. We view NuNER as a member of the broader family of task-specific foundation models, recently unlocked by LLMs.
摘要：大型语言模型 (LLM) 在数据注释方面表现出了令人印象深刻的能力，为解决经典 NLP 问题的新方法开辟了道路。在本文中，我们展示了如何使用 LLM 创建 NuNER，这是一种专门用于命名实体识别 (NER) 任务的紧凑语言表示模型。 NuNER 可以进行微调，以数据有效的方式解决下游 NER 问题，在少样本情况下优于类似大小的基础模型，并与更大的 LLM 竞争。我们发现预训练数据集的大小和实体类型多样性是实现良好性能的关键。我们将 NuNER 视为更广泛的特定任务基础模型家族的成员，该模型最近由法学硕士解锁。

Title: AutoMMLab: Automatically Generating Deployable Models from Language Instructions for Computer Vision Tasks

Authors: Zekang Yang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, Wentao Liu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2402.15351
Pdf URL: https://arxiv.org/pdf/2402.15351
Copy Paste: [[2402.15351]] AutoMMLab: Automatically Generating Deployable Models from Language Instructions for Computer Vision Tasks(https://arxiv.org/abs/2402.15351)
Keywords: llm, prompt
Abstract: Automated machine learning (AutoML) is a collection of techniques designed to automate the machine learning development process. While traditional AutoML approaches have been successfully applied in several critical steps of model development (e.g. hyperparameter optimization), there lacks a AutoML system that automates the entire end-to-end model production workflow. To fill this blank, we present AutoMMLab, a general-purpose LLM-empowered AutoML system that follows user's language instructions to automate the whole model production workflow for computer vision tasks. The proposed AutoMMLab system effectively employs LLMs as the bridge to connect AutoML and OpenMMLab community, empowering non-expert individuals to easily build task-specific models via a user-friendly language interface. Specifically, we propose RU-LLaMA to understand users' request and schedule the whole pipeline, and propose a novel LLM-based hyperparameter optimizer called HPO-LLaMA to effectively search for the optimal hyperparameters. Experiments show that our AutoMMLab system is versatile and covers a wide range of mainstream tasks, including classification, detection, segmentation and keypoint estimation. We further develop a new benchmark, called LAMP, for studying key components in the end-to-end prompt-based model training pipeline. Code, model, and data will be released.
摘要：自动化机器学习 (AutoML) 是旨在自动化机器学习开发过程的技术集合。虽然传统的 AutoML 方法已成功应用于模型开发的几个关键步骤（例如超参数优化），但缺乏能够自动化整个端到端模型生产工作流程的 AutoML 系统。为了填补这一空白，我们推出了 AutoMMLab，这是一个由 LLM 授权的通用 AutoML 系统，它遵循用户的语言指令来自动化计算机视觉任务的整个模型生产工作流程。所提出的 AutoMMLab 系统有效地利用法学硕士作为连接 AutoML 和 OpenMMLab 社区的桥梁，使非专业人士能够通过用户友好的语言界面轻松构建特定于任务的模型。具体来说，我们提出 RU-LLaMA 来理解用户的请求并调度整个管道，并提出一种基于 LLM 的新型超参数优化器 HPO-LLaMA 来有效搜索最佳超参数。实验表明，我们的AutoMMLab系统具有通用性，涵盖了广泛的主流任务，包括分类、检测、分割和关键点估计。我们进一步开发了一个名为 LAMP 的新基准，用于研究端到端基于提示的模型训练管道中的关键组件。代码、模型和数据将被发布。

Title: Explorations of Self-Repair in Language Models

Authors: Cody Rushing, Neel Nanda
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.15390
Pdf URL: https://arxiv.org/pdf/2402.15390
Copy Paste: [[2402.15390]] Explorations of Self-Repair in Language Models(https://arxiv.org/abs/2402.15390)
Keywords: language model, prompt
Abstract: Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate. Our work builds off this past literature, demonstrating that self-repair exists on a variety of models families and sizes when ablating individual attention heads on the full training distribution. We further show that on the full training distribution self-repair is imperfect, as the original direct effect of the head is not fully restored, and noisy, since the degree of self-repair varies significantly across different prompts (sometimes overcorrecting beyond the original effect). We highlight two different mechanisms that contribute to self-repair, including changes in the final LayerNorm scaling factor (which can repair up to 30% of the direct effect) and sparse sets of neurons implementing Anti-Erasure. We additionally discuss the implications of these results for interpretability practitioners and close with a more speculative discussion on the mystery of why self-repair occurs in these models at all, highlighting evidence for the Iterative Inference hypothesis in language models, a framework that predicts self-repair.
摘要：先前研究窄分布的可解释性研究已经初步确定了自我修复，这种现象是，如果大型语言模型中的组件被消融，后面的组件将改变其行为以进行补偿。我们的工作建立在过去的文献的基础上，证明当在整个训练分布上消除个人注意力时，各种模型系列和规模都存在自我修复。我们进一步表明，在完整的训练分布上，自我修复是不完美的，因为头部的原始直接效果没有完全恢复，而且有噪音，因为自我修复的程度在不同的提示中差异很大（有时过度校正超出了原始效果））。我们强调了有助于自我修复的两种不同机制，包括最终 LayerNorm 缩放因子的变化（可以修复高达 30% 的直接效果）和实施反擦除的稀疏神经元集。我们还讨论了这些结果对可解释性实践者的影响，并最后对这些模型中为什么会发生自我修复之谜进行了更具推测性的讨论，强调了语言模型中迭代推理假设的证据，这是一个预测自我修复的框架。维修。

Title: Genie: Generative Interactive Environments

Authors: Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Singh, Tim Rocktäschel
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2402.15391
Pdf URL: https://arxiv.org/pdf/2402.15391
Copy Paste: [[2402.15391]] Genie: Generative Interactive Environments(https://arxiv.org/abs/2402.15391)
Keywords: prompt, agent
Abstract: We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.
摘要：我们介绍 Genie，这是第一个通过无标签的互联网视频以无监督方式训练的生成交互环境。该模型可以被提示生成无数种通过文本、合成图像、照片甚至草图描述的动作可控的虚拟世界。在 11B 参数下，Genie 可以被视为基础世界模型。它由时空视频分词器、自回归动力学模型和简单且可扩展的潜在动作模型组成。 Genie 使用户能够在生成的环境中逐帧进行操作，尽管训练时没有任何真实动作标签或世界模型文献中常见的其他特定领域要求。此外，由此产生的学习潜在动作空间有助于训练智能体模仿未见过的视频中的行为，为未来训练多面手智能体开辟道路。

Title: Offline Inverse RL: New Solution Concepts and Provably Efficient Algorithms

Authors: Filippo Lazzati, Mirco Mutti, Alberto Maria Metelli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.15392
Pdf URL: https://arxiv.org/pdf/2402.15392
Copy Paste: [[2402.15392]] Offline Inverse RL: New Solution Concepts and Provably Efficient Algorithms(https://arxiv.org/abs/2402.15392)
Keywords: agent
Abstract: Inverse reinforcement learning (IRL) aims to recover the reward function of an expert agent from demonstrations of behavior. It is well known that the IRL problem is fundamentally ill-posed, i.e., many reward functions can explain the demonstrations. For this reason, IRL has been recently reframed in terms of estimating the feasible reward set, thus, postponing the selection of a single reward. However, so far, the available formulations and algorithmic solutions have been proposed and analyzed mainly for the online setting, where the learner can interact with the environment and query the expert at will. This is clearly unrealistic in most practical applications, where the availability of an offline dataset is a much more common scenario. In this paper, we introduce a novel notion of feasible reward set capturing the opportunities and limitations of the offline setting and we analyze the complexity of its estimation. This requires the introduction an original learning framework that copes with the intrinsic difficulty of the setting, for which the data coverage is not under control. Then, we propose two computationally and statistically efficient algorithms, IRLO and PIRLO, for addressing the problem. In particular, the latter adopts a specific form of pessimism to enforce the novel desirable property of inclusion monotonicity of the delivered feasible set. With this work, we aim to provide a panorama of the challenges of the offline IRL problem and how they can be fruitfully addressed.
摘要：逆强化学习（IRL）旨在从行为演示中恢复专家代理的奖励函数。众所周知，IRL 问题从根本上来说是不适定的，即许多奖励函数可以解释演示。因此，IRL 最近在估计可行奖励集方面进行了重新构建，从而推迟了单个奖励的选择。然而，到目前为止，可用的公式和算法解决方案主要是针对在线环境提出和分析的，学习者可以在在线环境中随意与环境交互并询问专家。这在大多数实际应用中显然是不现实的，其中离线数据集的可用性是更常见的场景。在本文中，我们引入了可行奖励集的新概念，捕捉了离线设置的机会和局限性，并分析了其估计的复杂性。这就需要引入一个原创的学习框架来应对设置的内在困难，即数据覆盖范围不受控制。然后，我们提出了两种计算和统计上高效的算法：IRLO 和 PIRLO 来解决该问题。特别是，后者采用了一种特定形式的悲观主义来强化所传递的可行集的包含单调性这一新颖的理想特性。通过这项工作，我们的目标是提供线下现实生活问题的挑战以及如何有效解决这些挑战的全景。

Title: Does Combining Parameter-efficient Modules Improve Few-shot Transfer Accuracy?

Authors: Nader Asadi, Mahdi Beitollahi, Yasser Khalil, Yinchuan Li, Guojun Zhang, Xi Chen
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2402.15414
Pdf URL: https://arxiv.org/pdf/2402.15414
Copy Paste: [[2402.15414]] Does Combining Parameter-efficient Modules Improve Few-shot Transfer Accuracy?(https://arxiv.org/abs/2402.15414)
Keywords: language model
Abstract: Parameter-efficient fine-tuning stands as the standard for efficiently fine-tuning large language and vision models on downstream tasks. Specifically, the efficiency of low-rank adaptation has facilitated the creation and sharing of hundreds of custom LoRA modules, each trained on distinct data from various downstream tasks. In this paper, we explore the composability of LoRA modules, examining if combining these pre-trained modules enhances generalization to unseen downstream tasks. Our investigation involves evaluating two approaches: (a) uniform composition, involving averaging upstream LoRA modules with equal weights, and (b) learned composition, where we learn the weights for each upstream module and perform weighted averaging. Our experimental results on both vision and language models reveal that in few-shot settings, where only a limited number of samples are available for the downstream task, both uniform and learned composition methods result in better transfer accuracy; outperforming full fine-tuning and training a LoRA from scratch. Moreover, in full-shot settings, learned composition performs comparably to regular LoRA training with significantly fewer number of trainable parameters. Our research unveils the potential of uniform composition for enhancing transferability in low-shot settings, without introducing additional learnable parameters.
摘要：参数高效的微调是在下游任务上有效微调大型语言和视觉模型的标准。具体来说，低秩适应的效率促进了数百个自定义 LoRA 模块的创建和共享，每个模块都根据来自各种下游任务的不同数据进行训练。在本文中，我们探讨了 LoRA 模块的可组合性，检查组合这些预训练模块是否可以增强对未见过的下游任务的泛化。我们的研究涉及评估两种方法：(a) 统一组合，涉及对具有相等权重的上游 LoRA 模块进行平均；(b) 学习组合，其中我们学习每个上游模块的权重并执行加权平均。我们在视觉和语言模型上的实验结果表明，在少镜头设置中，只有有限数量的样本可用于下游任务，统一和学习的合成方法都会带来更好的传输准确性；优于从头开始的全面微调和训练 LoRA。此外，在全镜头设置中，学习合成的性能与常规 LoRA 训练相当，但可训练参数的数量明显较少。我们的研究揭示了均匀构图在增强低镜头设置中的可转移性方面的潜力，而无需引入额外的可学习参数。

Title: A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models

Authors: Stefan Hegselmann, Shannon Zejiang Shen, Florian Gierse, Monica Agrawal, David Sontag, Xiaoyi Jiang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15422
Pdf URL: https://arxiv.org/pdf/2402.15422
Copy Paste: [[2402.15422]] A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models(https://arxiv.org/abs/2402.15422)
Keywords: language model, gpt, hallucination, prompt
Abstract: Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of large language models to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we develop a rigorous labeling protocol for hallucinations, and have two medical experts annotate 100 real-world summaries and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. Although the effect is still present, it is much smaller for GPT-4 when prompted with five examples (0.70 to 0.40). We also conduct a qualitative evaluation using hallucination-free and improved training data. GPT-4 shows very good results even in the zero-shot setting. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which yields promising results.
摘要：患者常常难以理解自己的住院情况，而医护人员提供解释的资源也有限。在这项工作中，我们研究了大型语言模型根据医生笔记生成患者摘要的潜力，并研究了训练数据对生成摘要的可信度和质量的影响。为此，我们制定了严格的幻觉标签协议，并让两名医学专家注释了 100 个真实世界的摘要和 100 个生成的摘要。我们表明，对无幻觉数据进行微调可以有效地将 Llama 2 的每个摘要中的幻觉从 2.60 减少到 1.55，同时保留相关信息。尽管效果仍然存在，但当提示五个示例时，GPT-4 的效果要小得多（0.70 到 0.40）。我们还使用无幻觉和改进的训练数据进行定性评估。即使在零样本设置下，GPT-4 也显示出非常好的结果。我们发现常见的定量指标与忠诚度和质量并没有很好的相关性。最后，我们测试了 GPT-4 的自动幻觉检测，取得了有希望的结果。

Title: Repetition Improves Language Model Embeddings

Authors: Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, Aditi Raghunathan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15449
Pdf URL: https://arxiv.org/pdf/2402.15449
Copy Paste: [[2402.15449]] Repetition Improves Language Model Embeddings(https://arxiv.org/abs/2402.15449)
Keywords: language model, llm
Abstract: Recent approaches to improving the extraction of text embeddings from autoregressive large language models (LLMs) have largely focused on improvements to data, backbone pretrained language models, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear later in the input. To address this limitation, we propose a simple approach, "echo embeddings," in which we repeat the input twice in context and extract embeddings from the second occurrence. We show that echo embeddings of early tokens can encode information about later tokens, allowing us to maximally leverage high-quality LLMs for embeddings. On the MTEB leaderboard, echo embeddings improve over classical embeddings by over 9% zero-shot and by around 0.7% when fine-tuned. Echo embeddings with a Mistral-7B model achieve state-of-the-art compared to prior open source models that do not leverage synthetic fine-tuning data.
摘要：最近改进从自回归大语言模型 (LLM) 中提取文本嵌入的方法主要集中在改进数据、骨干预训练语言模型或通过指令改进任务区分。在这项工作中，我们解决了自回归模型的架构限制：令牌嵌入不能包含输入中稍后出现的令牌的信息。为了解决这个限制，我们提出了一种简单的方法“回声嵌入”，其中我们在上下文中重复输入两次，并从第二次出现中提取嵌入。我们证明了早期 token 的回显嵌入可以对后来 token 的信息进行编码，从而使我们能够最大限度地利用高质量的 LLM 进行嵌入。在 MTEB 排行榜上，回声嵌入比经典嵌入在零样本情况下提高了 9% 以上，在微调时提高了约 0.7%。与之前不利用合成微调数据的开源模型相比，Mistral-7B 模型的回声嵌入实现了最先进的技术。

Title: Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization

Authors: Swaroop Nath, Tejpalsingh Siledar, Sankara Sri Raghava Ravindra Muddu, Rupasai Rangaraju, Harshad Khadilkar, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Sudhanshu Shekhar Singh, Muthusamy Chelliah, Nikesh Garera
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15473
Pdf URL: https://arxiv.org/pdf/2402.15473
Copy Paste: [[2402.15473]] Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization(https://arxiv.org/abs/2402.15473)
Keywords: language model, prompt
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in steering Language Models (LMs) towards human values/goals. The key to the strategy is employing a reward model ({$\varphi$}) which can reflect a latent reward model with humans. While this strategy has proven to be effective, the training methodology requires a lot of human preference annotation (usually of the order of tens of thousands) to train {$\varphi$}. Such large-scale preference annotations can be achievable if the reward model can be ubiquitously used. However, human values/goals are subjective and depend on the nature of the task. This poses a challenge in collecting diverse preferences for downstream applications. To address this, we propose a novel methodology to infuse domain knowledge into {$\varphi$}, which reduces the size of preference annotation required. We validate our approach in E-Commerce Opinion Summarization, with a significant reduction in dataset size (just $940$ samples) while advancing the state-of-the-art. Our contributions include a novel Reward Modelling technique, a new dataset (PromptOpinSumm) for Opinion Summarization, and a human preference dataset (OpinPref). The proposed methodology opens avenues for efficient RLHF, making it more adaptable to diverse applications with varying human values. We release the artifacts for usage under MIT License.
摘要：来自人类反馈的强化学习 (RLHF) 已成为引导语言模型 (LM) 实现人类价值观/目标的主导策略。该策略的关键是采用奖励模型（{$\varphi$}），它可以反映人类的潜在奖励模型。虽然这种策略已被证明是有效的，但训练方法需要大量的人类偏好注释（通常为数万量级）来训练 {$\varphi$}。如果奖励模型可以普遍使用，那么大规模的偏好注释是可以实现的。然而，人类的价值观/目标是主观的，取决于任务的性质。这对收集下游应用程序的不同偏好提出了挑战。为了解决这个问题，我们提出了一种新的方法将领域知识注入 {$\varphi$} 中，从而减少了所需的偏好注释的大小。我们验证了我们的电子商务意见总结方法，显着减少了数据集大小（仅 940 美元样本），同时提高了最先进的水平。我们的贡献包括新颖的奖励建模技术、用于意见总结的新数据集 (PromptOpinSumm) 以及人类偏好数据集 (OpinPref)。所提出的方法为高效 RLHF 开辟了途径，使其更适合具有不同人类价值观的各种应用。我们根据 MIT 许可证发布工件以供使用。

Title: Prejudice and Caprice: A Statistical Framework for Measuring Social Discrimination in Large Language Models

Authors: Yiran Liu (1 and 2), Ke Yang (1 and 3), Zehan Qi (2), Xiao Liu (2), Yang Yu (2), Chengxiang Zhai (3) ((1) Equal contributions, (2) Tsinghua University, (3) University of Illinois Urbana-Champaign)
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2402.15481
Pdf URL: https://arxiv.org/pdf/2402.15481
Copy Paste: [[2402.15481]] Prejudice and Caprice: A Statistical Framework for Measuring Social Discrimination in Large Language Models(https://arxiv.org/abs/2402.15481)
Keywords: language model, llm
Abstract: The growing integration of large language models (LLMs) into social operations amplifies their impact on decisions in crucial areas such as economics, law, education, and healthcare, raising public concerns about these models' discrimination-related safety and reliability. However, prior discrimination measuring frameworks solely assess the average discriminatory behavior of LLMs, often proving inadequate due to the overlook of an additional discrimination-leading factor, i.e., the LLMs' prediction variation across diverse contexts. In this work, we present the Prejudice-Caprice Framework (PCF) that comprehensively measures discrimination in LLMs by considering both their consistently biased preference and preference variation across diverse contexts. Specifically, we mathematically dissect the aggregated contextualized discrimination risk of LLMs into prejudice risk, originating from LLMs' persistent prejudice, and caprice risk, stemming from their generation inconsistency. In addition, we utilize a data-mining approach to gather preference-detecting probes from sentence skeletons, devoid of attribute indications, to approximate LLMs' applied contexts. While initially intended for assessing discrimination in LLMs, our proposed PCF facilitates the comprehensive and flexible measurement of any inductive biases, including knowledge alongside prejudice, across various modality models. We apply our discrimination-measuring framework to 12 common LLMs, yielding intriguing findings: i) modern LLMs demonstrate significant pro-male stereotypes, ii) LLMs' exhibited discrimination correlates with several social and economic factors, iii) prejudice risk dominates the overall discrimination risk and follows a normal distribution, and iv) caprice risk contributes minimally to the overall risk but follows a fat-tailed distribution, suggesting that it is wild risk requiring enhanced surveillance.
摘要：大语言模型（LLM）越来越多地融入社会运作，放大了它们对经济、法律、教育和医疗保健等关键领域决策的影响，引起了公众对这些模型与歧视相关的安全性和可靠性的担忧。然而，先前的歧视衡量框架仅评估法学硕士的平均歧视行为，由于忽视了另一个导致歧视的因素，即法学硕士在不同背景下的预测差异，往往被证明是不够的。在这项工作中，我们提出了偏见-任性框架（PCF），该框架通过考虑法学硕士一贯的偏见偏好和不同背景下的偏好变化来全面衡量法学硕士中的歧视。具体来说，我们从数学角度将法学硕士的情境歧视风险汇总为偏见风险（源于法学硕士持续的偏见）和任性风险（源于其代际不一致）。此外，我们利用数据挖掘方法从没有属性指示的句子骨架中收集偏好检测探针，以近似法学硕士的应用上下文。虽然最初的目的是评估法学硕士的歧视，但我们提出的 PCF 有助于全面、灵活地测量任何归纳偏差，包括跨各种模态模型的知识和偏见。我们将我们的歧视衡量框架应用于 12 个常见的法学硕士，得出了有趣的发现：i) 现代法学硕士表现出明显的亲男性刻板印象，ii) 法学硕士表现出的歧视与多个社会和经济因素相关，iii) 偏见风险主导着整体歧视风险并遵循正态分布，iv) 反复无常的风险对总体风险的贡献很小，但遵循肥尾分布，表明这是需要加强监控的狂野风险。

Title: API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs

Authors: Kinjal Basu, Ibrahim Abdelaziz, Subhajit Chaudhury, Soham Dan, Maxwell Crouse, Asim Munawar, Sadhana Kumaravel, Vinod Muthusamy, Pavan Kapanipathi, Luis A. Lastras
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.15491
Pdf URL: https://arxiv.org/pdf/2402.15491
Copy Paste: [[2402.15491]] API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs(https://arxiv.org/abs/2402.15491)
Keywords: language model, llm
Abstract: There is a growing need for Large Language Models (LLMs) to effectively use tools and external Application Programming Interfaces (APIs) to plan and complete tasks. As such, there is tremendous interest in methods that can acquire sufficient quantities of train and test data that involve calls to tools / APIs. Two lines of research have emerged as the predominant strategies for addressing this challenge. The first has focused on synthetic data generation techniques, while the second has involved curating task-adjacent datasets which can be transformed into API / Tool-based tasks. In this paper, we focus on the task of identifying, curating, and transforming existing datasets and, in turn, introduce API-BLEND, a large corpora for training and systematic testing of tool-augmented LLMs. The datasets mimic real-world scenarios involving API-tasks such as API / tool detection, slot filling, and sequencing of the detected APIs. We demonstrate the utility of the API-BLEND dataset for both training and benchmarking purposes.
摘要：人们越来越需要大型语言模型 (LLM) 来有效地使用工具和外部应用程序编程接口 (API) 来计划和完成任务。因此，人们对能够获取足够数量的训练和测试数据（涉及工具/API 调用）的方法产生了极大的兴趣。两条研究路线已成为应对这一挑战的主要策略。第一个重点关注合成数据生成技术，而第二个涉及管理与任务相邻的数据集，这些数据集可以转换为基于 API/工具的任务。在本文中，我们重点关注识别、整理和转换现有数据集的任务，并依次介绍 API-BLEND，这是一个用于工具增强法学硕士培训和系统测试的大型语料库。这些数据集模拟涉及 API 任务的真实场景，例如 API/工具检测、槽填充和检测到的 API 的排序。我们展示了 API-BLEND 数据集在训练和基准测试方面的实用性。

Title: AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning

Authors: Jianguo Zhang, Tian Lan, Rithesh Murthy, Zhiwei Liu, Weiran Yao, Juntao Tan, Thai Hoang, Liangwei Yang, Yihao Feng, Zuxin Liu, Tulika Awalgaonkar, Juan Carlos Niebles, Silvio Savarese, Shelby Heinecke, Huan Wang, Caiming Xiong
Subjects: cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.15506
Pdf URL: https://arxiv.org/pdf/2402.15506
Copy Paste: [[2402.15506]] AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning(https://arxiv.org/abs/2402.15506)
Keywords: language model, llm, agent
Abstract: Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories. In this paper, we introduce \textbf{AgentOhana} as a comprehensive solution to address these challenges. \textit{AgentOhana} aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. It meticulously standardizes and unifies these trajectories into a consistent format, streamlining the creation of a generic data loader optimized for agent training. Leveraging the data unification, our training pipeline maintains equilibrium across different data sources and preserves independent randomness across devices during dataset partitioning and model training. Additionally, we present \textbf{xLAM-v0.1}, a large action model tailored for AI agents, which demonstrates exceptional performance across various benchmarks.
摘要：由大型语言模型（LLM）支持的自主代理已经引起了广泛的研究关注。然而，由于具有多轮轨迹的不同数据源的异构性，充分利用法学硕士的潜力来完成基于代理的任务会带来固有的挑战。在本文中，我们引入 \textbf{AgentOhana} 作为应对这些挑战的综合解决方案。 \textit{AgentOhana} 聚合来自不同环境的代理轨迹，涵盖各种场景。它将这些轨迹精心标准化并统一为一致的格式，简化了针对代理培训优化的通用数据加载器的创建。利用数据统一，我们的训练管道在不同数据源之间保持平衡，并在数据集分区和模型训练期间保持跨设备的独立随机性。此外，我们还推出了 \textbf{xLAM-v0.1}，这是一个为 AI 代理量身定制的大型动作模型，它在各种基准测试中表现出了卓越的性能。