2024-05-31

Title: Adaptive In-conversation Team Building for Language Model Agents

Authors: Linxin Song, Jiale Liu, Jieyu Zhang, Shaokun Zhang, Ao Luo, Shijian Wang, Qingyun Wu, Chi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.19425
Pdf URL: https://arxiv.org/pdf/2405.19425
Copy Paste: [[2405.19425]] Adaptive In-conversation Team Building for Language Model Agents(https://arxiv.org/abs/2405.19425)
Keywords: language model, llm, prompt, agent
Abstract: Leveraging multiple large language model (LLM) agents has shown to be a promising approach for tackling complex tasks, while the effective design of multiple agents for a particular application remains an art. It is thus intriguing to answer a critical question: Given a task, how can we build a team of LLM agents to solve it effectively? Our new adaptive team-building paradigm offers a flexible solution, realized through a novel agent design named Captain Agent. It dynamically forms and manages teams for each step of a task-solving process, utilizing nested group conversations and reflection to ensure diverse expertise and prevent stereotypical outputs. It allows for a flexible yet structured approach to problem-solving and can help reduce redundancy and enhance output diversity. A comprehensive evaluation across six real-world scenarios demonstrates that Captain Agent significantly outperforms existing multi-agent methods with 21.94% improvement in average accuracy, providing outstanding performance without requiring task-specific prompt engineering.
摘要：利用多个大型语言模型 (LLM) 代理已被证明是一种解决复杂任务的有前途的方法，而为特定应用有效设计多个代理仍然是一门艺术。因此，回答一个关键问题非常有趣：给定一个任务，我们如何组建一个 LLM 代理团队来有效地解决它？我们新的自适应团队建设范式提供了一个灵活的解决方案，通过名为 Captain Agent 的新型代理设计实现。它动态地为任务解决过程的每个步骤组建和管理团队，利用嵌套的群组对话和反思来确保多样化的专业知识并防止刻板的输出。它允许采用灵活而结构化的方法解决问题，并有助于减少冗余并增强输出多样性。对六个真实场景的全面评估表明，Captain Agent 的表现明显优于现有的多代理方法，平均准确率提高了 21.94%，无需针对特定任务进行快速工程即可提供出色的性能。

Title: Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals

Authors: Yupei Wang, Renfen Hu, Zhe Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.19433
Pdf URL: https://arxiv.org/pdf/2405.19433
Copy Paste: [[2405.19433]] Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals(https://arxiv.org/abs/2405.19433)
Keywords: language model, llm
Abstract: While current automated essay scoring (AES) methods show high agreement with human raters, their scoring mechanisms are not fully explored. Our proposed method, using counterfactual intervention assisted by Large Language Models (LLMs), reveals that when scoring essays, BERT-like models primarily focus on sentence-level features, while LLMs are attuned to conventions, language complexity, as well as organization, indicating a more comprehensive alignment with scoring rubrics. Moreover, LLMs can discern counterfactual interventions during feedback. Our approach improves understanding of neural AES methods and can also apply to other domains seeking transparency in model-driven decisions. The codes and data will be released at GitHub.
摘要：虽然目前的自动论文评分 (AES) 方法与人类评分者表现出高度一致性，但它们的评分机制尚未得到充分探索。我们提出的方法使用大型语言模型 (LLM) 辅助的反事实干预，结果表明，在对论文进行评分时，类似 BERT 的模型主要关注句子级特征，而 LLM 则关注惯例、语言复杂性以及组织，这表明与评分标准更加一致。此外，LLM 可以在反馈过程中辨别反事实干预。我们的方法提高了对神经 AES 方法的理解，也可以应用于寻求模型驱动决策透明度的其他领域。代码和数据将在 GitHub 上发布。

Title: A Full-duplex Speech Dialogue Scheme Based On Large Language Models

Authors: Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Yuanjun Xiong, Wei Xia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.19487
Pdf URL: https://arxiv.org/pdf/2405.19487
Copy Paste: [[2405.19487]] A Full-duplex Speech Dialogue Scheme Based On Large Language Models(https://arxiv.org/abs/2405.19487)
Keywords: language model, llm
Abstract: We present a generative dialogue system capable of operating in a full-duplex manner, allowing for seamless interaction. It is based on a large language model (LLM) carefully aligned to be aware of a perception module, a motor function module, and the concept of a simple finite state machine (called neural FSM) with two states. The perception and motor function modules operate simultaneously, allowing the system to simultaneously speak and listen to the user. The LLM generates textual tokens for inquiry responses and makes autonomous decisions to start responding to, wait for, or interrupt the user by emitting control tokens to the neural FSM. All these tasks of the LLM are carried out as next token prediction on a serialized view of the dialogue in real-time. In automatic quality evaluations simulating real-life interaction, the proposed system reduces the average conversation response latency by more than 3 folds compared with LLM-based half-duplex dialogue systems while responding within less than 500 milliseconds in more than 50% of evaluated interactions. Running a LLM with only 8 billion parameters, our system exhibits a 8% higher interruption precision rate than the best available commercial LLM for voice-based dialogue.
摘要：我们提出了一种能够以全双工方式运行的生成对话系统，可实现无缝交互。它基于一个大型语言模型 (LLM)，经过精心调整，可感知感知模块、运动功能模块以及具有两个状态的简单有限状态机（称为神经 FSM）的概念。感知和运动功能模块同时运行，使系统能够同时与用户对话和倾听。LLM 生成用于询问响应的文本标记，并通过向神经 FSM 发出控制标记来自主决定开始响应、等待或中断用户。LLM 的所有这些任务都是作为实时对话序列化视图上的下一个标记预测执行的。在模拟真实交互的自动质量评估中，与基于 LLM 的半双工对话系统相比，所提出的系统将平均对话响应延迟减少了 3 倍以上，同时在超过 50% 的评估交互中响应时间少于 500 毫秒。运行仅有 80 亿个参数的 LLM，我们的系统比现有的最佳商用语音对话 LLM 的中断准确率高出 8%。

Title: Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data

Authors: Sudeshna Das, Yao Ge, Yuting Guo, Swati Rajwal, JaMor Hairston, Jeanne Powell, Drew Walker, Snigdha Peddireddy, Sahithi Lakamana, Selen Bozkurt, Matthew Reyna, Reza Sameni, Yunyu Xiao, Sangmi Kim, Rasheeta Chandler, Natalie Hernandez, Danielle Mowery, Rachel Wightman, Jennifer Love, Anthony Spadaro, Jeanmarie Perrone, Abeed Sarker
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.19519
Pdf URL: https://arxiv.org/pdf/2405.19519
Copy Paste: [[2405.19519]] Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data(https://arxiv.org/abs/2405.19519)
Keywords: language model, llm, hallucination, retrieval augmented generation
Abstract: Retrieval augmented generation (RAG) provides the capability to constrain generative model outputs, and mitigate the possibility of hallucination, by providing relevant in-context text. The number of tokens a generative large language model (LLM) can incorporate as context is finite, thus limiting the volume of knowledge from which to generate an answer. We propose a two-layer RAG framework for query-focused answer generation and evaluate a proof-of-concept for this framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. The evaluations demonstrate the effectiveness of the two-layer framework in resource constrained settings to enable researchers in obtaining near real-time data from users.
摘要：检索增强生成 (RAG) 通过提供相关的上下文文本，提供了约束生成模型输出并降低幻觉可能性的能力。生成大型语言模型 (LLM) 可以作为上下文合并的标记数量是有限的，因此限制了生成答案的知识量。我们提出了一个两层 RAG 框架用于以查询为中心的答案生成，并在以查询为中心的社交媒体论坛摘要生成背景下评估了该框架的概念验证，重点关注新兴的毒品相关信息。评估证明了两层框架在资源受限环境中的有效性，使研究人员能够从用户那里获取近乎实时的数据。

Title: CheXpert Plus: Hundreds of Thousands of Aligned Radiology Texts, Images and Patients

Authors: Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, Curtis P. Langlotz
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] CheXpert Plus: Hundreds of Thousands of Aligned Radiology Texts, Images and Patients(https://arxiv.org/abs/)
Keywords: language model
Abstract: Since the release of the original CheXpert paper five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. The emergence of vision language models has sparked an increase in demands for sharing reports linked to CheXpert images, along with a growing interest among AI fairness researchers in obtaining demographic data. To address this, CheXpert Plus serves as a new collection of radiology data sources, made publicly available to enhance the scaling, performance, robustness, and fairness of models for all subsequent machine learning tasks in the field of radiology. CheXpert Plus is the largest text dataset publicly released in radiology, with a total of 36 million text tokens, including 13 million impression tokens. To the best of our knowledge, it represents the largest text de-identification effort in radiology, with almost 1 million PHI spans anonymized. It is only the second time that a large-scale English paired dataset has been released in radiology, thereby enabling, for the first time, cross-institution training at scale. All reports are paired with high-quality images in DICOM format, along with numerous image and patient metadata covering various clinical and socio-economic groups, as well as many pathology labels and RadGraph annotations. We hope this dataset will boost research for AI models that can further assist radiologists and help improve medical care. Data is available at the following URL: this https URL Models are available at the following URL: this https URL
摘要：自五年前发布原始 CheXpert 论文以来，CheXpert 已成为使用最广泛、引用最多的临床 AI 数据集之一。视觉语言模型的出现引发了对与 CheXpert 图像相关的报告共享需求的增加，同时 AI 公平性研究人员对获取人口统计数据的兴趣也日益增长。为了解决这个问题，CheXpert Plus 是一个新的放射学数据源集合，公开可用，以增强模型的可扩展性、性能、稳健性和公平性，适用于放射学领域的所有后续机器学习任务。CheXpert Plus 是放射学领域公开发布的最大文本数据集，共有 3600 万个文本标记，包括 1300 万个印象标记。据我们所知，它代表了放射学领域最大的文本去识别工作，近 100 万个 PHI 跨度被匿名化。这只是放射学领域第二次发布大规模英语配对数据集，从而首次实现了大规模跨机构培训。所有报告均配有 DICOM 格式的高质量图像，以及涵盖各种临床和社会经济群体的大量图像和患者元数据，以及许多病理标签和 RadGraph 注释。我们希望这个数据集能够促进 AI 模型的研究，从而进一步协助放射科医生并帮助改善医疗保健。数据可在以下 URL 获得：此 https URL 模型可在以下 URL 获得：此 https URL

Title: Unlearning Climate Misinformation in Large Language Models

Authors: Michael Fore, Simranjit Singh, Chaehong Lee, Amritanshu Pandey, Antonios Anastasopoulos, Dimitrios Stamoulis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.19563
Pdf URL: https://arxiv.org/pdf/2405.19563
Copy Paste: [[2405.19563]] Unlearning Climate Misinformation in Large Language Models(https://arxiv.org/abs/2405.19563)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Misinformation regarding climate change is a key roadblock in addressing one of the most serious threats to humanity. This paper investigates factual accuracy in large language models (LLMs) regarding climate information. Using true/false labeled Q&A data for fine-tuning and evaluating LLMs on climate-related claims, we compare open-source models, assessing their ability to generate truthful responses to climate change questions. We investigate the detectability of models intentionally poisoned with false climate information, finding that such poisoning may not affect the accuracy of a model's responses in other domains. Furthermore, we compare the effectiveness of unlearning algorithms, fine-tuning, and Retrieval-Augmented Generation (RAG) for factually grounding LLMs on climate change topics. Our evaluation reveals that unlearning algorithms can be effective for nuanced conceptual claims, despite previous findings suggesting their inefficacy in privacy contexts. These insights aim to guide the development of more factually reliable LLMs and highlight the need for additional work to secure LLMs against misinformation attacks.
摘要：有关气候变化的错误信息是解决人类面临的最严重威胁之一的主要障碍。本文研究了大型语言模型 (LLM) 中有关气候信息的事实准确性。我们使用真/假标记的问答数据对与气候相关的声明的 LLM 进行微调和评估，比较开源模型，评估它们对气候变化问题产生真实反应的能力。我们调查了故意用虚假气候信息毒害的模型的可检测性，发现这种毒害可能不会影响模型在其他领域的响应准确性。此外，我们比较了反学习算法、微调和检索增强生成 (RAG) 在事实基础上为气候变化主题的 LLM 提供依据的有效性。我们的评估表明，反学习算法对于细微的概念性声明是有效的，尽管先前的发现表明它们在隐私环境中无效。这些见解旨在指导开发更具事实可靠性的 LLM，并强调需要做更多工作来保护 LLM 免受错误信息攻击。

Title: GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment

Authors: Yao Yao, Zuchao Li, Hai Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.19635
Pdf URL: https://arxiv.org/pdf/2405.19635
Copy Paste: [[2405.19635]] GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment(https://arxiv.org/abs/2405.19635)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The burgeoning size of Large Language Models (LLMs) has led to enhanced capabilities in generating responses, albeit at the expense of increased inference times and elevated resource demands. Existing methods of acceleration, predominantly hinged on knowledge distillation, generally necessitate fine-tuning of considerably large models, such as Llama-7B, posing a challenge for average users. Furthermore, present techniques for expediting inference and reducing costs operate independently. To address these issues, we introduce a novel and intuitive Guidance-based Knowledge Transfer (GKT) framework. This approach leverages a larger LLM as a ''teacher'' to create guidance prompts, paired with a smaller ''student'' model to finalize responses. Remarkably, GKT requires no fine-tuning and doesn't necessitate the teacher and student models to have the same vocabulary, allowing for extensive batch generation to accelerate the process while ensuring user customization. GKT can be seamlessly integrated into cloud-edge collaboration architectures, and is versatile enough for plug-and-play application across various models. It excels in both efficiency and affordability, epitomizing a ''cheap and cheerful'' solution. GKT achieves a maximum accuracy improvement of 14.18%, along with a 10.72 times speed-up on GSM8K and an accuracy improvement of 14.00 % along with a 7.73 times speed-up in CSQA. When utilizing ChatGPT as teacher model and Llama2-70B as the student model, we can achieve 95.00% of ChatGPT's performance at 52% of the cost. The results highlight substantial enhancements in accuracy and processing speed on the GSM8K and CSQA datasets, surpassing the performance of using either the student or teacher models in isolation.
摘要：大型语言模型 (LLM) 规模的不断扩大，提高了生成响应的能力，尽管这会增加推理时间和资源需求。现有的加速方法主要依赖于知识提炼，通常需要对相当大的模型（例如 Llama-7B）进行微调，这对普通用户来说是一个挑战。此外，目前用于加快推理和降低成本的技术是独立运行的。为了解决这些问题，我们引入了一个新颖且直观的基于指导的知识转移 (GKT) 框架。这种方法利用较大的 LLM 作为“老师”来创建指导提示，并搭配较小的“学生”模型来完成响应。值得注意的是，GKT 不需要微调，也不需要老师和学生模型具有相同的词汇量，从而允许进行大量批量生成以加速该过程，同时确保用户自定义。 GKT 可以无缝集成到云端协作架构中，并且功能多样，可在各种模型之间即插即用。它在效率和价格方面都表现出色，堪称“物美价廉”的解决方案。GKT 在 GSM8K 上实现了 14.18% 的最大准确率提升，同时速度提高了 10.72 倍；在 CSQA 上实现了 14.00% 的准确率提升，同时速度提高了 7.73 倍。当使用 ChatGPT 作为教师模型，使用 Llama2-70B 作为学生模型时，我们可以以 52% 的成本实现 ChatGPT 95.00% 的性能。结果显示，在 GSM8K 和 CSQA 数据集上，准确率和处理速度均有显著提升，超过了单独使用学生或教师模型的性能。

Title: Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach

Authors: Ernesto Quevedo, Jorge Yero, Rachel Koerner, Pablo Rivas, Tomas Cerny
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2405.19648
Pdf URL: https://arxiv.org/pdf/2405.19648
Copy Paste: [[2405.19648]] Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach(https://arxiv.org/abs/2405.19648)
Keywords: language model, llm, hallucination
Abstract: Concerns regarding the propensity of Large Language Models (LLMs) to produce inaccurate outputs, also known as hallucinations, have escalated. Detecting them is vital for ensuring the reliability of applications relying on LLM-generated content. Current methods often demand substantial resources and rely on extensive LLMs or employ supervised learning with multidimensional features or intricate linguistic and semantic analyses difficult to reproduce and largely depend on using the same LLM that hallucinated. This paper introduces a supervised learning approach employing two simple classifiers utilizing only four numerical features derived from tokens and vocabulary probabilities obtained from other LLM evaluators, which are not necessarily the same. The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks. Additionally, we provide a comprehensive examination of the strengths and weaknesses of our approach, highlighting the significance of the features utilized and the LLM employed as an evaluator. We have released our code publicly at this https URL.
摘要：人们越来越担心大型语言模型 (LLM) 容易产生不准确的输出（也称为幻觉）。检测它们对于确保依赖 LLM 生成内容的应用程序的可靠性至关重要。当前的方法通常需要大量资源并依赖于广泛的 LLM，或者采用具有多维特征的监督学习或难以重现的复杂语言和语义分析，并且很大程度上依赖于使用产生幻觉的相同 LLM。本文介绍了一种监督学习方法，该方法采用两个简单的分类器，仅使用从其他 LLM 评估器获得的标记和词汇概率得出的四个数值特征，这些特征不一定相同。该方法产生了有希望的结果，在三个不同基准的多个任务中超越了最先进的结果。此外，我们还全面检查了我们方法的优缺点，强调了所使用的特征和作为评估器使用的 LLM 的重要性。我们已在此 https URL 上公开发布了我们的代码。

Title: PATIENT-{\Psi}: Using Large Language Models to Simulate Patients for Training Mental Health Professionals

Authors: Ruiyi Wang, Stephanie Milani, Jamie C. Chiu, Shaun M. Eack, Travis Labrum, Samuel M. Murphy, Nev Jones, Kate Hardy, Hong Shen, Fei Fang, Zhiyu Zoey Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] PATIENT-{\Psi}: Using Large Language Models to Simulate Patients for Training Mental Health Professionals(https://arxiv.org/abs/)
Keywords: language model, gpt, llm
Abstract: Mental illness remains one of the most critical public health issues, with a significant gap between the available mental health support and patient needs. Many mental health professionals highlight a disconnect between their training and real-world patient interactions, leaving some trainees feeling unprepared and potentially affecting their early career success. In this paper, we propose PATIENT-{\Psi}, a novel patient simulation framework for cognitive behavior therapy (CBT) training. To build PATIENT-{\Psi}, we constructed diverse patient profiles and their corresponding cognitive models based on CBT principles, and then used large language models (LLMs) programmed with the patient cognitive models to act as a simulated therapy patient. We propose an interactive training scheme, PATIENT-{\Psi}-TRAINER, for mental health trainees to practice a key skill in CBT -- formulating the cognitive model of the patient -- through role-playing a therapy session with PATIENT-{\Psi}. To evaluate PATIENT-{\Psi}, we conducted a user study of 4 mental health trainees and 10 experts. The results demonstrate that practice using PATIENT-{\Psi}-TRAINER greatly enhances the perceived skill acquisition and confidence of the trainees beyond existing forms of training such as textbooks, videos, and role-play with non-patients. Based on the experts' perceptions, PATIENT-{\Psi} is perceived to be closer to real patient interactions than GPT-4, and PATIENT-{\Psi}-TRAINER holds strong promise to improve trainee competencies. Our pioneering patient simulation training framework, using LLMs, holds great potential to enhance and advance mental health training, ultimately leading to improved patient care and outcomes. We will release all our data, code, and the training platform.
摘要：精神疾病仍然是最关键的公共卫生问题之一，现有的心理健康支持与患者需求之间存在巨大差距。许多心理健康专业人士强调，他们的培训与现实世界的患者互动之间存在脱节，这让一些受训者感到措手不及，并可能影响他们早期的职业成功。在本文中，我们提出了 PATIENT-{\Psi}，一种用于认知行为疗法 (CBT) 培训的新型患者模拟框架。为了构建 PATIENT-{\Psi}，我们根据 CBT 原理构建了不同的患者档案及其相应的认知模型，然后使用用患者认知模型编程的大型语言模型 (LLM) 充当模拟治疗患者。我们提出了一种交互式培训方案 PATIENT-{\Psi}-TRAINER，让心理健康受训者通过与 PATIENT-{\Psi} 在治疗环节中进行角色扮演来练习 CBT 的一项关键技能——制定患者的认知模型。为了评估 PATIENT-{\Psi}，我们对 4 名心理健康受训者和 10 名专家进行了用户研究。结果表明，使用 PATIENT-{\Psi}-TRAINER 进行练习大大提高了受训者的感知技能习得和信心，超越了现有的培训形式，例如教科书、视频和与非患者的角色扮演。根据专家的看法，PATIENT-{\Psi} 被认为比 GPT-4 更接近真实的患者互动，并且 PATIENT-{\Psi}-TRAINER 有望提高受训者的能力。我们使用 LLM 的开创性患者模拟培训框架具有巨大的潜力，可以增强和推进心理健康培训，最终改善患者护理和结果。我们将发布所有数据、代码和培训平台。

Title: One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models

Authors: Yutao Zhu, Zhaoheng Huang, Zhicheng Dou, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models(https://arxiv.org/abs/)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) for generating more factual, accurate, and up-to-date content. Existing methods either optimize prompts to guide LLMs in leveraging retrieved information or directly fine-tune the LLMs to adapt to RAG scenarios. Although fine-tuning can yield better performance, it often compromises the LLMs' general generation capabilities by modifying their parameters. This limitation poses challenges in practical applications, especially when LLMs are already deployed, as parameter adjustments may affect their original functionality. To address this, we propose a novel method that involves learning scalable and pluggable virtual tokens for RAG. By maintaining the LLMs' original parameters and fine-tuning only the embeddings of these pluggable tokens, our approach not only enhances LLMs' performance but also preserves their general generation capacities. Furthermore, we design several training strategies to improve the scalability, flexibility, and generalizability of our method. Comprehensive experiments across nine question-answering tasks demonstrate the superiority of our approach.
摘要：检索增强生成 (RAG) 是一种有前途的方法，可以改进大型语言模型 (LLM)，以生成更真实、更准确和最新的内容。现有方法要么优化提示以指导 LLM 利用检索到的信息，要么直接微调 LLM 以适应 RAG 场景。虽然微调可以产生更好的性能，但它通常会通过修改其参数来损害 LLM 的一般生成能力。这种限制在实际应用中带来了挑战，尤其是在已经部署 LLM 的情况下，因为参数调整可能会影响其原始功能。为了解决这个问题，我们提出了一种新方法，该方法涉及学习可扩展和可插入的 RAG 虚拟令牌。通过保持 LLM 的原始参数并仅微调这些可插入令牌的嵌入，我们的方法不仅提高了 LLM 的性能，而且还保留了它们的一般生成能力。此外，我们设计了几种训练策略来提高我们方法的可扩展性、灵活性和通用性。在九个问答任务中进行的综合实验证明了我们方法的优越性。

Title: Significance of Chain of Thought in Gender Bias Mitigation for English-Dravidian Machine Translation

Authors: Lavanya Prahallad, Radhika Mamidi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.19701
Pdf URL: https://arxiv.org/pdf/2405.19701
Copy Paste: [[2405.19701]] Significance of Chain of Thought in Gender Bias Mitigation for English-Dravidian Machine Translation(https://arxiv.org/abs/2405.19701)
Keywords: gpt, prompt, chat
Abstract: Gender bias in machine translation (MT) systems poses a significant challenge to achieving accurate and inclusive translations. This paper examines gender bias in machine translation systems for languages such as Telugu and Kannada from the Dravidian family, analyzing how gender inflections affect translation accuracy and neutrality using Google Translate and ChatGPT. It finds that while plural forms can reduce bias, individual-centric sentences often maintain the bias due to historical stereotypes. The study evaluates the Chain of Thought processing, noting significant bias mitigation from 80% to 4% in Telugu and from 40% to 0% in Kannada. It also compares Telugu and Kannada translations, emphasizing the need for language specific strategies to address these challenges and suggesting directions for future research to enhance fairness in both data preparation and prompts during inference.
摘要：机器翻译 (MT) 系统中的性别偏见对实现准确和包容的翻译构成了重大挑战。本文研究了机器翻译系统中的性别偏见，例如德拉威语系的泰卢固语和卡纳达语，分析了性别变化如何影响使用 Google 翻译和 ChatGPT 的翻译准确性和中立性。研究发现，虽然复数形式可以减少偏见，但以个人为中心的句子往往会因历史刻板印象而保留偏见。该研究评估了思维链处理，发现泰卢固语的偏见显著减轻，从 80% 降至 4%，卡纳达语的偏见显著减轻，从 40% 降至 0%。它还比较了泰卢固语和卡纳达语的翻译，强调需要采取针对特定语言的策略来应对这些挑战，并提出了未来研究的方向，以提高数据准备和推理过程中提示的公平性。

Title: SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

Authors: Kaixuan Huang, Xudong Guo, Mengdi Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.19715
Pdf URL: https://arxiv.org/pdf/2405.19715
Copy Paste: [[2405.19715]] SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths(https://arxiv.org/abs/2405.19715)
Keywords: language model, chat
Abstract: Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K -- the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B & 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (an additional 7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively.
摘要：推测解码通过使用更小、更快的草稿模型来减少目标大型语言模型的推理延迟。其性能取决于超参数 K——候选长度，即目标模型在每一轮中要验证的候选标记数。然而，以前的方法通常使用简单的启发式方法来选择 K，这可能会导致性能不佳。我们研究了候选长度 K 的选择，并将其表述为马尔可夫决策过程。我们从理论上表明，这种马尔可夫决策过程的最优策略采用阈值策略的形式，即当被拒绝的概率超过阈值时，当前推测应该停止并进行验证。受此理论的启发，我们提出了 SpecDec++，这是推测解码的增强版本，可以动态自适应地确定候选长度。我们用经过训练的接受预测头增强了草稿模型，以预测候选标记的条件接受概率。当至少一个标记被拒绝的预测概率超过阈值时，SpecDec++ 将停止当前推测。我们实现了 SpecDec++ 并将其应用于 llama-2-chat 7B 和 70B 模型对。我们的自适应方法在 Alpaca 数据集上实现了 2.04 倍的加速（比基线推测解码额外提高了 7.2%）。在 GSM8K 和 HumanEval 数据集上，我们的方法分别实现了 2.26 倍的加速（提高了 9.4%）和 2.23 倍的加速（提高了 11.1%）。

Title: Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation

Authors: Chengwei Dai, Kun Li, Wei Zhou, Songlin Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.19737
Pdf URL: https://arxiv.org/pdf/2405.19737
Copy Paste: [[2405.19737]] Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation(https://arxiv.org/abs/2405.19737)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: As Large Language Models (LLMs) scale up and gain powerful Chain-of-Thoughts (CoTs) reasoning abilities, practical resource constraints drive efforts to distill these capabilities into more compact Smaller Language Models (SLMs). We find that CoTs consist mainly of simple reasoning forms, with a small proportion ($\approx 4.7\%$) of key reasoning steps that truly impact conclusions. However, previous distillation methods typically involve supervised fine-tuning student SLMs only on correct CoTs data produced by teacher LLMs, resulting in students struggling to learn the key reasoning steps, instead imitating the teacher's reasoning forms and making errors or omissions on these steps. To address these issues, drawing an analogy to human learning, where analyzing mistakes according to correct solutions often reveals the crucial steps leading to successes or failures, we propose mistak\textbf{E}-\textbf{D}riven key reason\textbf{I}ng step distilla\textbf{T}ion (\textbf{EDIT}), a novel method that further aids SLMs learning key reasoning steps rather than mere simple fine-tuning. Firstly, to expose these crucial steps in CoTs, we design specific prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions. Then, we apply the minimum edit distance algorithm on the dual CoTs data to locate these key steps and optimize the likelihood of these steps. Extensive experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets. Further analysis shows that EDIT can generate high-quality CoTs with more correct key reasoning steps. Notably, we also explore how different mistake patterns affect performance and find that EDIT benefits more from logical errors than from knowledge or mathematical calculation errors in dual CoTs\footnote{Code can be found at \url{this https URL}}.
摘要：随着大型语言模型 (LLM) 规模扩大并获得强大的思路链 (CoT) 推理能力，实际的资源限制促使人们努力将这些能力提炼成更紧凑的小型语言模型 (SLM)。我们发现 CoT 主要由简单的推理形式组成，其中只有一小部分 ($\approx 4.7\%$) 的关键推理步骤真正影响了结论。然而，以前的提炼方法通常只涉及对教师 LLM 生成的正确 CoT 数据进行监督微调学生 SLM，导致学生难以学习关键的推理步骤，而是模仿教师的推理形式并在这些步骤上犯错或遗漏。为了解决这些问题，我们类比人类的学习，根据正确的解决方案分析错误通常会揭示导致成功或失败的关键步骤，我们提出了错误驱动关键推理步骤蒸馏（EDIT） (\textbf{EDIT})，这是一种新方法，可进一步帮助 SLM 学习关键推理步骤，而不仅仅是简单的微调。首先，为了揭示 CoT 中的这些关键步骤，我们设计了特定的提示来生成具有相似推理路径但不同结论的双重 CoT 数据。然后，我们在双重 CoT 数据上应用最小编辑距离算法来定位这些关键步骤并优化这些步骤的可能性。大量实验验证了 EDIT 在域内和域外基准推理数据集上的有效性。进一步分析表明，EDIT 可以生成具有更多正确关键推理步骤的高质量 CoT。值得注意的是，我们还探讨了不同的错误模式如何影响性能，并发现在双重 CoTs 中，EDIT 从逻辑错误中受益比从知识或数学计算错误中受益更多\footnote{代码可在 \url{此 https URL}} 找到。

Title: PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

Authors: Jiatong Li, Renjun Hu, Kunzhe Huang, Yan Zhuang, Qi Liu, Mengxiao Zhu, Xing Shi, Wei Lin
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2405.19740
Pdf URL: https://arxiv.org/pdf/2405.19740
Copy Paste: [[2405.19740]] PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations(https://arxiv.org/abs/2405.19740)
Keywords: language model, gpt, llm
Abstract: Expert-designed close-ended benchmarks serve as vital tools in assessing the knowledge capacity of large language models (LLMs). Despite their widespread use, concerns have mounted regarding their reliability due to limited test scenarios and an unavoidable risk of data contamination. To rectify this, we present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge capacity through knowledge-invariant perturbations. These perturbations employ human-like restatement techniques to generate on-the-fly test samples from static benchmarks, meticulously retaining knowledge-critical content while altering irrelevant details. Our toolkit further includes a suite of transition analyses that compare performance on raw vs. perturbed test sets to precisely assess LLMs' genuine knowledge capacity. Six state-of-the-art LLMs are re-evaluated using PertEval. Results reveal significantly inflated performance of the LLMs on raw benchmarks, including an absolute 21% overestimation for GPT-4. Additionally, through a nuanced response pattern analysis, we discover that PertEval retains LLMs' uncertainty to specious knowledge, potentially being resolved through rote memorization and leading to inflated performance. We also find that the detailed transition analyses by PertEval could illuminate weaknesses in existing LLMs' knowledge mastery and guide the development of refinement. Given these insights, we posit that PertEval can act as an essential tool that, when applied alongside any close-ended benchmark, unveils the true knowledge capacity of LLMs, marking a significant step toward more trustworthy LLM evaluation.
摘要：专家设计的封闭式基准是评估大型语言模型 (LLM) 知识容量的重要工具。尽管它们被广泛使用，但由于测试场景有限和不可避免的数据污染风险，人们对其可靠性的担忧日益增加。为了纠正这个问题，我们提出了 PertEval，这是一个工具包，旨在通过知识不变的扰动深入探测 LLM 的知识容量。这些扰动采用类似人类的重述技术从静态基准中生成即时测试样本，一丝不苟地保留知识关键内容，同时改变不相关的细节。我们的工具包还包括一套过渡分析，用于比较原始测试集和扰动测试集上的性能，以准确评估 LLM 的真正知识容量。使用 PertEval 重新评估了六个最先进的 LLM。结果显示，LLM 在原始基准上的表现明显被夸大，其中对 GPT-4 的绝对高估了 21%。此外，通过细致入微的反应模式分析，我们发现 PertEval 保留了 LLM 对似是而非的知识的不确定性，这种不确定性可能通过死记硬背来解决，并导致成绩膨胀。我们还发现，PertEval 的详细过渡分析可以揭示现有 LLM 知识掌握的弱点，并指导改进。鉴于这些见解，我们认为 PertEval 可以作为一种必不可少的工具，当与任何封闭式基准一起使用时，它可以揭示 LLM 的真正知识能力，标志着朝着更值得信赖的 LLM 评估迈出了重要一步。

Title: X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions

Authors: Chong Li, Wen Yang, Jiajun Zhang, Jinliang Lu, Shaonan Wang, Chengqing Zong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.19744
Pdf URL: https://arxiv.org/pdf/2405.19744
Copy Paste: [[2405.19744]] X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions(https://arxiv.org/abs/2405.19744)
Keywords: language model, gpt, chat
Abstract: Large language models respond well in high-resource languages like English but struggle in low-resource languages. It may arise from the lack of high-quality instruction following data in these languages. Directly translating English samples into these languages can be a solution but unreliable, leading to responses with translation errors and lacking language-specific or cultural knowledge. To address this issue, we propose a novel method to construct cross-lingual instruction following samples with instruction in English and response in low-resource languages. Specifically, the language model first learns to generate appropriate English instructions according to the natural web texts in other languages as responses. The candidate cross-lingual instruction tuning samples are further refined and diversified. We have employed this method to build a large-scale cross-lingual instruction tuning dataset on 10 languages, namely X-Instruction. The instruction data built using our method incorporate more language-specific knowledge compared with the naive translation method. Experimental results have shown that the response quality of the model tuned on X-Instruction greatly exceeds the model distilled from a powerful teacher model, reaching or even surpassing the ones of ChatGPT. In addition, we find that models tuned on cross-lingual instruction following samples can follow the instruction in the output language without further tuning.
摘要：大型语言模型在英语等资源丰富的语言中反应良好，但在资源匮乏的语言中却举步维艰。这可能是由于这些语言缺乏高质量的教学数据。直接将英语样本翻译成这些语言可能是一种解决方案，但不可靠，导致响应出现翻译错误，并且缺乏语言特定或文化知识。为了解决这个问题，我们提出了一种新方法，以英语教学和资源匮乏的语言响应样本为基础，构建跨语言教学。具体来说，语言模型首先学习根据其他语言的自然网络文本作为响应生成合适的英语指令。候选的跨语言指令调优样本进一步细化和多样化。我们利用这种方法在 10 种语言上构建了一个大规模的跨语言指令调优数据集，即 X-Instruction。与朴素翻译方法相比，使用我们的方法构建的指令数据包含更多语言特定知识。实验结果表明，在 X-Instruction 上调优的模型的响应质量大大超过从强大的教师模型中提炼出来的模型，达到甚至超过了 ChatGPT 的响应质量。此外，我们发现，针对跨语言指令调整的模型可以按照样本执行输出语言中的指令，而无需进一步调整。

Title: Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding

Authors: Kuo Liao, Shuang Li, Meng Zhao, Liqun Liu, Mengge Xue, Zhenyu Hu, Honglin Han, Chengguo Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.19763
Pdf URL: https://arxiv.org/pdf/2405.19763
Copy Paste: [[2405.19763]] Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding(https://arxiv.org/abs/2405.19763)
Keywords: language model, llm
Abstract: Recent strides in large language models (LLMs) have yielded remarkable performance, leveraging reinforcement learning from human feedback (RLHF) to significantly enhance generation and alignment capabilities. However, RLHF encounters numerous challenges, including the objective mismatch issue, leading to suboptimal performance in Natural Language Understanding (NLU) tasks. To address this limitation, we propose a novel Reinforcement Learning framework enhanced with Label-sensitive Reward (RLLR) to amplify the performance of LLMs in NLU tasks. By incorporating label-sensitive pairs into reinforcement learning, our method aims to adeptly capture nuanced label-sensitive semantic features during RL, thereby enhancing natural language understanding. Experiments conducted on five diverse foundation models across eight tasks showcase promising results. In comparison to Supervised Fine-tuning models (SFT), RLLR demonstrates an average performance improvement of 1.54%. Compared with RLHF models, the improvement averages at 0.69%. These results reveal the effectiveness of our method for LLMs in NLU tasks. Code and data available at: this https URL.
摘要：大型语言模型 (LLM) 的最新进展取得了显著的性能，利用强化学习从人类反馈 (RLHF) 显著增强了生成和对齐能力。然而，RLHF 遇到了许多挑战，包括目标不匹配问题，导致其在自然语言理解 (NLU) 任务中的表现不佳。为了解决这一限制，我们提出了一种新的强化学习框架，该框架通过标签敏感奖励 (RLLR) 进行了增强，以放大 LLM 在 NLU 任务中的性能。通过将标签敏感对纳入强化学习，我们的方法旨在在 RL 过程中熟练地捕捉细微的标签敏感语义特征，从而增强自然语言理解。在八个任务中对五种不同的基础模型进行的实验展示了令人鼓舞的结果。与监督微调模型 (SFT) 相比，RLLR 的平均性能提高了 1.54%。与 RLHF 模型相比，平均改进率为 0.69%。这些结果揭示了我们的方法对 NLU 任务中的 LLM 的有效性。代码和数据可从此 https URL 获取。

Title: Enhancing Consistency and Role-Specific Knowledge Capturing by Rebuilding Fictional Character's Persona

Authors: Jeiyoon Park, Chanjun Park, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.19778
Pdf URL: https://arxiv.org/pdf/2405.19778
Copy Paste: [[2405.19778]] Enhancing Consistency and Role-Specific Knowledge Capturing by Rebuilding Fictional Character's Persona(https://arxiv.org/abs/2405.19778)
Keywords: language model, gpt, agent
Abstract: With the recent introduction of Assistants API, it is expected that document-based language models will be actively used in various domains, especially Role-playing. However, a key challenge lies in utilizing protagonist's persona: Assistants API often fails to achieve with its search because the information extraction part is different each time and it often omits important information such as protagonist's backstory or relationships. It is hard to maintain a consistent persona simply by using the persona document as input to the Assistants API. To address the challenge of achieving stable persona consistency, we propose CharacterGPT, a novel persona reconstruction framework to alleviate the shortcomings of the Assistants API. Our method involves Character Persona Training (CPT), an effective persona rebuilding process that updates the character persona by extracting the character's traits from given summary of the novel for each character as if the story in a novel progresses. In our experiments, we ask each character to take the Big Five Inventory personality test in various settings and analyze the results. To assess whether it can think outside the box, we let each character generate short novels. Extensive experiments and human evaluation demonstrate that CharacterGPT presents new possibilities for role-playing agent research.
摘要：随着 Assistants API 的近期推出，预计基于文档的语言模型将在各个领域得到积极应用，尤其是角色扮演。然而，一个关键挑战在于利用主角的人物角色：Assistant API 的搜索通常无法实现，因为每次的信息提取部分都不同，而且它经常会忽略主角的背景故事或人际关系等重要信息。仅使用人物角色文档作为 Assistants API 的输入很难保持一致的人物角色。为了应对实现稳定人物角色一致性的挑战，我们提出了 CharacterGPT，这是一种新颖的人物角色重建框架，可以缓解 Assistants API 的缺点。我们的方法涉及人物角色训练 (CPT)，这是一种有效的人物角色重建过程，它通过从小说中每个角色的给定摘要中提取人物特征来更新人物角色，就像小说中的故事进展一样。在我们的实验中，我们要求每个角色在各种设置下参加 Big Five Inventory 人格测试并分析结果。为了评估它是否能够跳出框框思考，我们让每个角色创作短篇小说。大量实验和人工评估表明，CharacterGPT 为角色扮演代理研究提供了新的可能性。

Title: From Symbolic Tasks to Code Generation: Diversification Yields Better Task Performers

Authors: Dylan Zhang, Justin Wang, Francois Charton
Subjects: cs.CL, cs.AI, cs.LG, cs.LO, cs.PL
Abstract URL: https://arxiv.org/abs/2405.19787
Pdf URL: https://arxiv.org/pdf/2405.19787
Copy Paste: [[2405.19787]] From Symbolic Tasks to Code Generation: Diversification Yields Better Task Performers(https://arxiv.org/abs/2405.19787)
Keywords: language model
Abstract: Instruction tuning -- tuning large language models on instruction-output pairs -- is a promising technique for making models better adapted to the real world. Yet, the key factors driving the model's capability to understand and follow instructions not seen during training remain under-explored. Our investigation begins with a series of synthetic experiments within the theoretical framework of a Turing-complete algorithm called Markov algorithm, which allows fine-grained control over the instruction-tuning data. Generalization and robustness with respect to the training distribution emerge once a diverse enough set of tasks is provided, even though very few examples are provided for each task. We extend these initial results to a real-world application scenario of code generation and find that a more diverse instruction set, extending beyond code-related tasks, improves the performance of code generation. Our observations suggest that a more diverse semantic space for instruction-tuning sets greatly improves the model's ability to follow instructions and perform tasks.
摘要：指令调优（根据指令输出对调优大型语言模型）是一种很有前途的技术，可以使模型更好地适应现实世界。然而，驱动模型理解和遵循训练期间未见的指令的能力的关键因素仍未得到充分探索。我们的研究始于图灵完备算法（称为马尔可夫算法）的理论框架内的一系列综合实验，该算法允许对指令调优数据进行细粒度控制。一旦提供了足够多样化的任务集，即使每个任务提供的示例很少，也会出现相对于训练分布的泛化和鲁棒性。我们将这些初步结果扩展到代码生成的实际应用场景，并发现更多样化的指令集（超出代码相关任务的范围）可以提高代码生成的性能。我们的观察表明，指令调优集的语义空间更加多样化，极大地提高了模型遵循指令和执行任务的能力。

Title: PDDLEGO: Iterative Planning in Textual Environments

Authors: Li Zhang, Peter Jansen, Tianyi Zhang, Peter Clark, Chris Callison-Burch, Niket Tandon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.19793
Pdf URL: https://arxiv.org/pdf/2405.19793
Copy Paste: [[2405.19793]] PDDLEGO: Iterative Planning in Textual Environments(https://arxiv.org/abs/2405.19793)
Keywords: llm
Abstract: Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed, leading to a complete plan. In contrast, we tackle partially-observed environments where there is initially no sufficient information to plan for the end-goal. We propose PDDLEGO that iteratively construct a planning representation that can lead to a partial plan for a given sub-goal. By accomplishing the sub-goal, more information is acquired to augment the representation, eventually achieving the end-goal. We show that plans produced by few-shot PDDLEGO are 43% more efficient than generating plans end-to-end on the Coin Collector simulation, with strong performance (98%) on the more complex Cooking World simulation where end-to-end LLMs fail to generate coherent plans (4%).
摘要：事实证明，即使对于当前模型而言，在文本环境中进行规划也是一项长期挑战。最近一项前景光明的研究使用 LLM 来生成环境的正式表示，该表示可以由符号规划器解决。但是，现有方法依赖于完全观察到的环境，其中所有实体状态最初都是已知的，因此可以构建一次性表示，从而得出完整的计划。相比之下，我们处理部分观察到的环境，其中最初没有足够的信息来规划最终目标。我们提出了 PDDLEGO，它迭代地构建规划表示，可以得出给定子目标的部分计划。通过实现子目标，可以获得更多信息来增强表示，最终实现最终目标。我们表明，在 Coin Collector 模拟中，由少样本 PDDLEGO 生成的计划比端到端生成计划的效率高 43%，在更复杂的 Cooking World 模拟中具有强大的性能（98%），其中端到端 LLM 无法生成连贯的计划（4%）。

Title: SLM as Guardian: Pioneering AI Safety with Small Language Models

Authors: Ohjoon Kwon, Donghyeon Jeon, Nayoung Choi, Gyu-Hwung Cho, Changbong Kim, Hyunwoo Lee, Inho Kang, Sun Kim, Taiwoo Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.19795
Pdf URL: https://arxiv.org/pdf/2405.19795
Copy Paste: [[2405.19795]] SLM as Guardian: Pioneering AI Safety with Small Language Models(https://arxiv.org/abs/2405.19795)
Keywords: language model, llm
Abstract: Most prior safety research of large language models (LLMs) has focused on enhancing the alignment of LLMs to better suit the safety requirements of humans. However, internalizing such safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness. To overcome such challenges, a modular approach employing a smaller LLM to detect harmful user queries is regarded as a convenient solution in designing LLM-based system with safety requirements. In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation. We introduce our safety requirements and the taxonomy of harmfulness categories, and then propose a multi-task learning mechanism fusing the two tasks into a single model. We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.
摘要：先前大多数大型语言模型 (LLM) 的安全性研究都集中于增强 LLM 的一致性，以更好地满足人类的安全要求。然而，将这些保障措施特征内化到更大的模型中带来了更高的训练成本和意外降低有用性的挑战。为了克服这些挑战，采用较小的 LLM 来检测有害用户查询的模块化方法被认为是设计具有安全要求的基于 LLM 的系统的一种便捷解决方案。在本文中，我们利用较小的 LLM 进行有害查询检测和保障措施响应生成。我们介绍了我们的安全要求和有害类别的分类，然后提出了一种将这两个任务融合到单个模型中的多任务学习机制。我们证明了我们方法的有效性，与公开的 LLM 相比，它提供了相当或超越的有害查询检测和保障措施响应性能。

Title: Unsupervised Mutual Learning of Dialogue Discourse Parsing and Topic Segmentation

Authors: Jiahui Xu, Feng Jiang, Anningzhe Gao, Haizhou Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.19799
Pdf URL: https://arxiv.org/pdf/2405.19799
Copy Paste: [[2405.19799]] Unsupervised Mutual Learning of Dialogue Discourse Parsing and Topic Segmentation(https://arxiv.org/abs/2405.19799)
Keywords: language model, gpt, llm, chat
Abstract: The advancement of large language models (LLMs) has propelled the development of dialogue systems. Unlike the popular ChatGPT-like assistant model, which only satisfies the user's preferences, task-oriented dialogue systems have also faced new requirements and challenges in the broader business field. They are expected to provide correct responses at each dialogue turn, at the same time, achieve the overall goal defined by the task. By understanding rhetorical structures and topic structures via topic segmentation and discourse parsing, a dialogue system may do a better planning to achieve both objectives. However, while both structures belong to discourse structure in linguistics, rhetorical structure and topic structure are mostly modeled separately or with one assisting the other in the prior work. The interaction between these two structures has not been considered for joint modeling and mutual learning. Furthermore, unsupervised learning techniques to achieve the above are not well explored. To fill this gap, we propose an unsupervised mutual learning framework of two structures leveraging the global and local connections between them. We extend the topic modeling between non-adjacent discourse units to ensure global structural relevance with rhetorical structures. We also incorporate rhetorical structures into the topic structure through a graph neural network model to ensure local coherence consistency. Finally, we utilize the similarity between the two fused structures for mutual learning. The experimental results demonstrate that our methods outperform all strong baselines on two dialogue rhetorical datasets (STAC and Molweni), as well as dialogue topic datasets (Doc2Dial and TIAGE).
摘要：大型语言模型 (LLM) 的进步推动了对话系统的发展。与流行的 ChatGPT 类助手模型不同，它只能满足用户的偏好，而面向任务的对话系统在更广泛的业务领域也面临着新的要求和挑战。它们需要在每个对话回合提供正确的响应，同时实现任务定义的总体目标。通过主题分割和篇章解析来理解修辞结构和主题结构，对话系统可以更好地规划以实现这两个目标。然而，虽然这两种结构在语言学中都属于篇章结构，但在先前的工作中，修辞结构和主题结构大多是单独建模或相互辅助的。这两种结构之间的相互作用尚未被考虑用于联合建模和相互学习。此外，实现上述目标的无监督学习技术尚未得到很好的探索。为了填补这一空白，我们提出了一个无监督的双向学习框架，利用两种结构之间的全局和局部联系。我们扩展了非相邻话语单元之间的主题建模，以确保修辞结构的全局结构相关性。我们还通过图神经网络模型将修辞结构纳入主题结构，以确保局部连贯性。最后，我们利用两个融合结构之间的相似性进行相互学习。实验结果表明，我们的方法在两个对话修辞数据集（STAC 和 Molweni）以及对话主题数据集（Doc2Dial 和 TIAGE）上的表现优于所有强基线。

Title: Improve Student's Reasoning Generalizability through Cascading Decomposed CoTs Distillation

Authors: Chengwei Dai, Kun Li, Wei Zhou, Songlin Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.19842
Pdf URL: https://arxiv.org/pdf/2405.19842
Copy Paste: [[2405.19842]] Improve Student's Reasoning Generalizability through Cascading Decomposed CoTs Distillation(https://arxiv.org/abs/2405.19842)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) exhibit enhanced reasoning at larger scales, driving efforts to distill these capabilities into smaller models via teacher-student learning. Previous works simply fine-tune student models on teachers' generated Chain-of-Thoughts (CoTs) data. Although these methods enhance in-domain (IND) reasoning performance, they struggle to generalize to out-of-domain (OOD) tasks. We believe that the widespread spurious correlations between questions and answers may lead the model to preset a specific answer which restricts the diversity and generalizability of its reasoning process. In this paper, we propose Cascading Decomposed CoTs Distillation (CasCoD) to address these issues by decomposing the traditional single-step learning process into two cascaded learning steps. Specifically, by restructuring the training objectives -- removing the answer from outputs and concatenating the question with the rationale as input -- CasCoD's two-step learning process ensures that students focus on learning rationales without interference from the preset answers, thus improving reasoning generalizability. Extensive experiments demonstrate the effectiveness of CasCoD on both IND and OOD benchmark reasoning datasets. Code can be found at this https URL.
摘要：大型语言模型 (LLM) 在更大规模上表现出增强的推理能力，推动了通过师生学习将这些能力提炼到更小的模型中的努力。以前的研究只是根据教师生成的思维链 (CoT) 数据对学生模型进行微调。虽然这些方法提高了领域内 (IND) 推理性能，但它们很难推广到领域外 (OOD) 任务。我们认为，问题和答案之间普遍存在的虚假相关性可能会导致模型预设一个特定的答案，从而限制其推理过程的多样性和普遍性。在本文中，我们提出了级联分解 CoT 提炼 (CasCoD) 来解决这些问题，它将传统的单步学习过程分解为两个级联学习步骤。具体来说，通过重组训练目标——从输出中删除答案并将问题与基本原理连接起来作为输入——CasCoD 的两步学习过程确保学生专注于学习基本原理而不受预设答案的干扰，从而提高推理的普遍性。大量实验证明了 CasCoD 在 IND 和 OOD 基准推理数据集上的有效性。代码可在此 https URL 中找到。

Title: Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model

Authors: Chaochen Gao, Xing Wu, Qi Fu, Songlin Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.19846
Pdf URL: https://arxiv.org/pdf/2405.19846
Copy Paste: [[2405.19846]] Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model(https://arxiv.org/abs/2405.19846)
Keywords: language model
Abstract: Large language models, initially pre-trained with a limited context length, can better handle longer texts by continuing training on a corpus with extended contexts. However, obtaining effective long-context data is challenging due to the scarcity and uneven distribution of long documents across different domains. To address this issue, we propose a Query-centric data synthesis method, abbreviated as Quest. Quest is an interpretable method based on the observation that documents retrieved by similar queries are relevant but low-redundant, thus well-suited for synthesizing long-context data. The method is also scalable and capable of constructing large amounts of long-context data. Using Quest, we synthesize a long-context dataset up to 128k context length, significantly outperforming other data synthesis methods on multiple long-context benchmark datasets. In addition, we further verify that the Quest method is predictable through scaling law experiments, making it a reliable solution for advancing long-context models.
摘要：大型语言模型最初以有限的上下文长度进行预训练，通过在具有扩展上下文的语料库上继续训练，可以更好地处理较长的文本。然而，由于长文档在不同领域的稀缺性和分布不均，获取有效的长上下文数据具有挑战性。为了解决这个问题，我们提出了一种以查询为中心的数据合成方法，简称 Quest。Quest 是一种可解释的方法，它基于这样的观察：通过相似的查询检索到的文档是相关的但冗余度低，因此非常适合合成长上下文数据。该方法还具有可扩展性，能够构建大量的长上下文数据。使用 Quest，我们合成了一个长达 128k 上下文长度的长上下文数据集，在多个长上下文基准数据集上的表现显著优于其他数据合成方法。此外，我们进一步通过缩放律实验验证了 Quest 方法是可预测的，这使其成为推进长上下文模型的可靠解决方案。

Title: DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

Authors: Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, Yongbin Li
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2405.19856
Pdf URL: https://arxiv.org/pdf/2405.19856
Copy Paste: [[2405.19856]] DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories(https://arxiv.org/abs/2405.19856)
Keywords: language model, gpt, llm, prompt
Abstract: How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. To address the knowledge gap, we propose a new benchmark named DevEval, which has three advances. (1) DevEval aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) DevEval is annotated by 13 developers and contains comprehensive annotations (e.g., requirements, original repositories, reference code, and reference dependencies). (3) DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains (e.g., Internet, Database). Based on DevEval, we propose repository-level code generation and evaluate 8 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa). Our experiments reveal these LLMs' coding abilities in real-world code repositories. For example, in our experiments, the highest Pass@1 of gpt-4-turbo is only 53.04%. We also analyze LLMs' failed cases and summarize their shortcomings. We hope DevEval can facilitate the development of LLMs in real code repositories. DevEval, prompts, and LLMs' predictions have been released.
摘要：如何评估大型语言模型 (LLM) 的编码能力仍是一个悬而未决的问题。我们发现现有的基准测试与真实世界的代码库不太一致，不足以评估 LLM 的编码能力。为了弥补知识差距，我们提出了一个名为 DevEval 的新基准测试，它有三个进步。（1）DevEval 在多个维度上与真实世界的代码库保持一致，例如代码分布和依赖关系分布。（2）DevEval 由 13 位开发人员注释，包含全面的注释（例如需求、原始存储库、参考代码和参考依赖项）。（3）DevEval 包含来自 117 个存储库的 1,874 个测试样本，涵盖 10 个流行领域（例如 Internet、数据库）。基于 DevEval，我们提出了存储库级代码生成，并在 DevEval 上评估了 8 个流行的 LLM（例如，gpt-4、gpt-3.5、StarCoder 2、DeepSeek Coder、CodeLLaMa）。我们的实验揭示了这些 LLM 在真实代码库中的编码能力。例如，在我们的实验中，gpt-4-turbo 的最高 Pass@1 仅为 53.04%。我们还分析了 LLM 的失败案例并总结了它们的缺点。我们希望 DevEval 能够促进真实代码库中 LLM 的开发。DevEval、提示和 LLM 的预测已经发布。

Title: Is In-Context Learning Sufficient for Instruction Following in LLMs?

Authors: Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.19874
Pdf URL: https://arxiv.org/pdf/2405.19874
Copy Paste: [[2405.19874]] Is In-Context Learning Sufficient for Instruction Following in LLMs?(https://arxiv.org/abs/2405.19874)
Keywords: llm
Abstract: In-context learning (ICL) allows LLMs to learn from examples without changing their weights, which is a particularly promising capability for long-context LLMs that can potentially learn from many examples. Recently, Lin et al. (2024) proposed URIAL, a method using only three in-context examples to align base LLMs, achieving non-trivial instruction following performance. In this work, we show that, while effective, ICL alignment with URIAL still underperforms compared to instruction fine-tuning on established benchmarks such as MT-Bench and AlpacaEval 2.0 (LC), especially with more capable base LMs. Unlike for tasks such as classification, translation, or summarization, adding more ICL demonstrations for long-context LLMs does not systematically improve instruction following performance. To address this limitation, we derive a greedy selection approach for ICL examples that noticeably improves performance, yet without bridging the gap to instruction fine-tuning. Finally, we provide a series of ablation studies to better understand the reasons behind the remaining gap, and we show how some aspects of ICL depart from the existing knowledge and are specific to the instruction tuning setting. Overall, our work advances the understanding of ICL as an alignment technique. We provide our code at this https URL.
摘要：上下文学习 (ICL) 允许 LLM 从示例中学习而不改变其权重，这对于可以从许多示例中学习的长上下文 LLM 来说是一项特别有前途的功能。最近，Lin 等人 (2024) 提出了 URIAL，这是一种仅使用三个上下文示例来对齐基础 LLM 的方法，可实现非平凡的指令跟踪性能。在这项工作中，我们表明，虽然有效，但使用 URIAL 的 ICL 对齐与已建立的基准（例如 MT-Bench 和 AlpacaEval 2.0 (LC)）上的指令微调相比仍然表现不佳，尤其是对于功能更强大的基础 LM。与分类、翻译或摘要等任务不同，为长上下文 LLM 添加更多 ICL 演示并不能系统地提高指令跟踪性能。为了解决这一限制，我们推导出一种针对 ICL 示例的贪婪选择方法，该方法可显着提高性能，但不会弥合与指令微调之间的差距。最后，我们提供了一系列消融研究，以更好地了解剩余差距背后的原因，并展示了 ICL 的某些方面如何偏离现有知识并特定于指令调整设置。总体而言，我们的工作推进了对 ICL 作为一种对齐技术的理解。我们在此 https URL 上提供我们的代码。

Title: Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

Authors: Avelina Asada Hadji-Kyriacou, Ognjen Arandjelovic
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.20053
Pdf URL: https://arxiv.org/pdf/2405.20053
Copy Paste: [[2405.20053]] Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads(https://arxiv.org/abs/2405.20053)
Keywords: language model, gpt, hallucination
Abstract: Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context learning capabilities; however, their behaviors are often difficult to control. By utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible to fine-tune unsupervised LMs to follow instructions and produce outputs that reflect human preferences. Despite its benefits, RLHF has been shown to potentially harm a language model's reasoning capabilities and introduce artifacts such as hallucinations where the model may fabricate facts. To address this issue we introduce Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals through an auxiliary reward head without directly affecting the output distribution of the language modeling head. We perform a theoretical analysis of our objective function and find strong ties to Conservative Direct Preference Optimization (cDPO). Finally we evaluate our models on GLUE, RACE, and the GPT4All evaluation suite and demonstrate that our method produces models which achieve higher scores than those fine-tuned with Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) alone.
摘要：预训练语言模型 (LM) 表现出强大的零样本和上下文学习能力；然而，它们的行为通常难以控制。通过利用人类反馈强化学习 (RLHF)，可以微调无监督 LM 以遵循指令并产生反映人类偏好的输出。尽管 RLHF 有诸多好处，但它已被证明可能会损害语言模型的推理能力，并引入诸如幻觉之类的伪像，模型可能会捏造事实。为了解决这个问题，我们引入了直接偏好头 (DPH)，这是一个微调框架，使 LM 能够通过辅助奖励头学习人类偏好信号，而不会直接影响语言建模头的输出分布。我们对目标函数进行了理论分析，发现它与保守直接偏好优化 (cDPO) 密切相关。最后，我们在 GLUE、RACE 和 GPT4All 评估套件上评估我们的模型，并证明我们的方法产生的模型比单独使用监督微调 (SFT) 或直接偏好优化 (DPO) 进行微调的模型获得更高的分数。

Title: Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language Learning

Authors: Elena Grazia Gado, Tommaso Martorella, Luca Zunino, Paola Mejia-Domenzain, Vinitra Swamy, Jibril Frej, Tanja Käser
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2405.20079
Pdf URL: https://arxiv.org/pdf/2405.20079
Copy Paste: [[2405.20079]] Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language Learning(https://arxiv.org/abs/2405.20079)
Keywords: language model, llm
Abstract: Intelligent Tutoring Systems (ITS) enhance personalized learning by predicting student answers to provide immediate and customized instruction. However, recent research has primarily focused on the correctness of the answer rather than the student's performance on specific answer choices, limiting insights into students' thought processes and potential misconceptions. To address this gap, we present MCQStudentBert, an answer forecasting model that leverages the capabilities of Large Language Models (LLMs) to integrate contextual understanding of students' answering history along with the text of the questions and answers. By predicting the specific answer choices students are likely to make, practitioners can easily extend the model to new answer choices or remove answer choices for the same multiple-choice question (MCQ) without retraining the model. In particular, we compare MLP, LSTM, BERT, and Mistral 7B architectures to generate embeddings from students' past interactions, which are then incorporated into a finetuned BERT's answer-forecasting mechanism. We apply our pipeline to a dataset of language learning MCQ, gathered from an ITS with over 10,000 students to explore the predictive accuracy of MCQStudentBert, which incorporates student interaction patterns, in comparison to correct answer prediction and traditional mastery-learning feature-based approaches. This work opens the door to more personalized content, modularization, and granular support.
摘要：智能辅导系统 (ITS) 通过预测学生答案来提供即时和定制化的指导，从而增强个性化学习。然而，最近的研究主要关注答案的正确性，而不是学生在特定答案选择上的表现，这限制了对学生思维过程和潜在误解的洞察。为了解决这一差距，我们提出了 MCQStudentBert，这是一个答案预测模型，它利用大型语言模型 (LLM) 的功能，将对学生回答历史的上下文理解与问题和答案的文本相结合。通过预测学生可能做出的特定答案选择，从业者可以轻松地将模型扩展到新的答案选择或删除同一多项选择题 (MCQ) 的答案选择，而无需重新训练模型。具体来说，我们比较了 MLP、LSTM、BERT 和 Mistral 7B 架构，以从学生过去的互动中生成嵌入，然后将其纳入经过微调的 BERT 答案预测机制中。我们将我们的管道应用于语言学习 MCQ 数据集，该数据集来自拥有 10,000 多名学生的 ITS，旨在探索 MCQStudentBert 的预测准确性，该数据集结合了学生的互动模式，并与正确答案预测和传统的基于掌握学习特征的方法进行了比较。这项工作为更加个性化的内容、模块化和细粒度支持打开了大门。

Title: The Fine-Tuning Paradox: Boosting Translation Quality Without Sacrificing LLM Abilities

Authors: David Stap, Eva Hasler, Bill Byrne, Christof Monz, Ke Tran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.20089
Pdf URL: https://arxiv.org/pdf/2405.20089
Copy Paste: [[2405.20089]] The Fine-Tuning Paradox: Boosting Translation Quality Without Sacrificing LLM Abilities(https://arxiv.org/abs/2405.20089)
Keywords: language model, llm
Abstract: Fine-tuning large language models (LLMs) for machine translation has shown improvements in overall translation quality. However, it is unclear what is the impact of fine-tuning on desirable LLM behaviors that are not present in neural machine translation models, such as steerability, inherent document-level translation abilities, and the ability to produce less literal translations. We perform an extensive translation evaluation on the LLaMA and Falcon family of models with model size ranging from 7 billion up to 65 billion parameters. Our results show that while fine-tuning improves the general translation quality of LLMs, several abilities degrade. In particular, we observe a decline in the ability to perform formality steering, to produce technical translations through few-shot examples, and to perform document-level translation. On the other hand, we observe that the model produces less literal translations after fine-tuning on parallel data. We show that by including monolingual data as part of the fine-tuning data we can maintain the abilities while simultaneously enhancing overall translation quality. Our findings emphasize the need for fine-tuning strategies that preserve the benefits of LLMs for machine translation.
摘要：对大型语言模型 (LLM) 进行机器翻译的微调已显示出整体翻译质量的提高。然而，尚不清楚微调对神经机器翻译模型中不存在的理想 LLM 行为有何影响，例如可操纵性、固有的文档级翻译能力以及产生更少直译的能力。我们对 LLaMA 和 Falcon 系列模型进行了广泛的翻译评估，模型大小从 70 亿到 650 亿个参数不等。我们的结果表明，虽然微调提高了 LLM 的总体翻译质量，但一些能力却下降了。特别是，我们观察到执行形式指导、通过少量示例生成技术翻译以及执行文档级翻译的能力下降。另一方面，我们观察到该模型在对并行数据进行微调后产生的直译更少。我们表明，通过将单语数据作为微调数据的一部分，我们可以保持这些能力，同时提高整体翻译质量。我们的研究结果强调了微调策略的必要性，以保留 LLM 对机器翻译的优势。

Title: Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation

Authors: Jingchang Chen, Hongxuan Tang, Zheng Chu, Qianglong Chen, Zekun Wang, Ming Liu, Bing Qin
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2405.20092
Pdf URL: https://arxiv.org/pdf/2405.20092
Copy Paste: [[2405.20092]] Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation(https://arxiv.org/abs/2405.20092)
Keywords: language model, gpt
Abstract: Despite recent progress made by large language models in code generation, they still struggle with programs that meet complex requirements. Recent work utilizes plan-and-solve decomposition to decrease the complexity and leverage self-tests to refine the generated program. Yet, planning deep-inside requirements in advance can be challenging, and the tests need to be accurate to accomplish self-improvement. To this end, we propose FunCoder, a code generation framework incorporating the divide-and-conquer strategy with functional consensus. Specifically, FunCoder recursively branches off sub-functions as smaller goals during code generation, represented by a tree hierarchy. These sub-functions are then composited to attain more complex objectives. Additionally, we designate functions via a consensus formed by identifying similarities in program behavior, mitigating error propagation. FunCoder outperforms state-of-the-art methods by +9.8% on average in HumanEval, MBPP, xCodeEval and MATH with GPT-3.5 and GPT-4. Moreover, our method demonstrates superiority on smaller models: With FunCoder, StableCode-3b surpasses GPT-3.5 by +18.6% and achieves 97.7% of GPT-4's performance on HumanEval. Further analysis reveals that our proposed dynamic function decomposition is capable of handling complex requirements, and the functional consensus prevails over self-testing in correctness evaluation.
摘要：尽管大型语言模型在代码生成方面取得了近期进展，但它们仍然难以编写满足复杂要求的程序。近期的研究利用规划和解决分解来降低复杂性，并利用自测来改进生成的程序。然而，提前规划深层需求可能具有挑战性，并且测试需要准确才能实现自我改进。为此，我们提出了 FunCoder，这是一个将分而治之策略与功能共识相结合的代码生成框架。具体而言，FunCoder 在代码生成过程中以递归方式将子函数分支为较小的目标，以树状层次结构表示。然后将这些子函数组合起来以实现更复杂的目标。此外，我们通过识别程序行为的相似性形成的共识来指定函数，从而减轻错误传播。在 GPT-3.5 和 GPT-4 中，FunCoder 在 HumanEval、MBPP、xCodeEval 和 MATH 中的表现平均比最先进的方法高出 9.8%。此外，我们的方法在较小的模型上表现出了优越性：借助 FunCoder，StableCode-3b 超越 GPT-3.5 +18.6%，并在 HumanEval 上达到 GPT-4 性能的 97.7%。进一步分析表明，我们提出的动态函数分解能够处理复杂的需求，并且在正确性评估方面，函数共识优于自我测试。

Title: Language Models Need Inductive Biases to Count Inductively

Authors: Yingshan Chang, Yonatan Bisk
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.20131
Pdf URL: https://arxiv.org/pdf/2405.20131
Copy Paste: [[2405.20131]] Language Models Need Inductive Biases to Count Inductively(https://arxiv.org/abs/2405.20131)
Keywords: language model
Abstract: Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of counting, investigating length generalization does occur throughout the literature. In the "train short, test long" paradigm of NLP, length refers to the training sentence length. In formal language recognition, length refers to the input sequence length, or the maximum stack size induced by a pushdown automata. In general problem solving, length refers to the number of hops in a deductive reasoning chain or the recursion depth. For all cases, counting is central to task success. And crucially, generalizing counting inductively is central to success on OOD instances. This work provides extensive empirical results on training language models to count. We experiment with architectures ranging from RNNs, Transformers, State-Space Models and RWKV. We present carefully-designed task formats, auxiliary tasks and positional embeddings to avoid limitations in generalization with OOD-position and OOD-vocabulary. We find that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain. As counting is the basis for many arguments concerning the expressivity of Transformers, our finding calls for the community to reexamine the application scope of primitive functions defined in formal characterizations. Finally, modern RNNs also largely underperform traditional RNNs in generalizing counting inductively. We discuss how design choices that enable parallelized training of modern RNNs cause them to lose merits of a recurrent nature.
摘要：计数是泛化的一个基本例子，无论是从定义自然数的皮亚诺公理的数学视角，还是从儿童学习计数的认知科学文献的角度来看。对于这两种情况，学习计数意味着学习无限计数的论点都是成立的。虽然很少有论文试图将 Transformer“推理”提炼为最简单的计数情况，但研究长度泛化确实贯穿了整个文献。在 NLP 的“短训练，长测试”范式中，长度指的是训练句子的长度。在形式语言识别中，长度指的是输入序列长度，或由下推自动机引起的最大堆栈大小。在一般问题解决中，长度指的是演绎推理链中的跳数或递归深度。对于所有情况，计数都是任务成功的关键。而且至关重要的是，归纳地推广计数是 OOD 实例成功的关键。这项工作提供了大量关于训练语言模型进行计数的经验结果。我们尝试了从 RNN、Transformer、状态空间模型到 RWKV 的各种架构。我们提出了精心设计的任务格式、辅助任务和位置嵌入，以避免 OOD 位置和 OOD 词汇的泛化限制。我们发现，虽然传统的 RNN 可以轻松实现归纳计数，但 Transformers 必须依靠位置嵌入来进行域外计数。由于计数是许多有关 Transformers 表达能力的论点的基础，我们的发现呼吁社区重新审视形式表征中定义的原始函数的应用范围。最后，现代 RNN 在归纳计数泛化方面也远远不如传统 RNN。我们讨论了支持现代 RNN 并行训练的设计选择如何导致它们失去循环性的优点。

Title: GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning

Authors: Costas Mavromatis, George Karypis
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.20139
Pdf URL: https://arxiv.org/pdf/2405.20139
Copy Paste: [[2405.20139]] GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning(https://arxiv.org/abs/2405.20139)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form of triplets (head, relation, tail), which collectively form a graph. Question Answering over KGs (KGQA) is the task of answering natural questions grounding the reasoning to the information provided by the KG. Large Language Models (LLMs) are the state-of-the-art models for QA tasks due to their remarkable ability to understand natural language. On the other hand, Graph Neural Networks (GNNs) have been widely used for KGQA as they can handle the complex graph information stored in the KG. In this work, we introduce GNN-RAG, a novel method for combining language understanding abilities of LLMs with the reasoning abilities of GNNs in a retrieval-augmented generation (RAG) style. First, a GNN reasons over a dense KG subgraph to retrieve answer candidates for a given question. Second, the shortest paths in the KG that connect question entities and answer candidates are extracted to represent KG reasoning paths. The extracted paths are verbalized and given as input for LLM reasoning with RAG. In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to extract useful graph information, while the LLM leverages its natural language processing ability for ultimate KGQA. Furthermore, we develop a retrieval augmentation (RA) technique to further boost KGQA performance with GNN-RAG. Experimental results show that GNN-RAG achieves state-of-the-art performance in two widely used KGQA benchmarks (WebQSP and CWQ), outperforming or matching GPT-4 performance with a 7B tuned LLM. In addition, GNN-RAG excels on multi-hop and multi-entity questions outperforming competing approaches by 8.9--15.5% points at answer F1.
摘要：知识图谱 (KG) 以三元组 (头、关系、尾) 的形式表示人工制作的事实知识，这些三元组共同形成一个图。基于知识图谱的问答 (KGQA) 是回答自然问题的任务，将推理建立在知识图谱提供的信息之上。大型语言模型 (LLM) 是 QA 任务的最新模型，因为它们具有出色的自然语言理解能力。另一方面，图神经网络 (GNN) 已广泛用于 KGQA，因为它们可以处理存储在知识图谱中的复杂图信息。在这项工作中，我们引入了 GNN-RAG，这是一种将 LLM 的语言理解能力与 GNN 的推理能力以检索增强生成 (RAG) 风格相结合的新方法。首先，GNN 在密集的知识图谱子图上进行推理以检索给定问题的答案候选。其次，提取知识图谱中连接问题实体和答案候选的最短路径来表示知识图谱推理路径。提取的路径被口头化并作为使用 RAG 的 LLM 推理的输入。在我们的 GNN-RAG 框架中，GNN 充当密集子图推理器来提取有用的图信息，而 LLM 利用其自然语言处理能力实现最终的 KGQA。此外，我们开发了一种检索增强 (RA) 技术，以进一步提高 GNN-RAG 的 KGQA 性能。实验结果表明，GNN-RAG 在两个广泛使用的 KGQA 基准 (WebQSP 和 CWQ) 中实现了最佳性能，优于或匹敌使用 7B 调整的 LLM 的 GPT-4 性能。此外，GNN-RAG 在多跳和多实体问题上表现出色，在答案 F1 上比竞争方法高出 8.9--15.5%。

Title: Reasoning about concepts with LLMs: Inconsistencies abound

Authors: Rosario Uceda-Sosa, Karthikeyan Natesan Ramamurthy, Maria Chang, Moninder Singh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.20163
Pdf URL: https://arxiv.org/pdf/2405.20163
Copy Paste: [[2405.20163]] Reasoning about concepts with LLMs: Inconsistencies abound(https://arxiv.org/abs/2405.20163)
Keywords: language model, llm, prompt
Abstract: The ability to summarize and organize knowledge into abstract concepts is key to learning and reasoning. Many industrial applications rely on the consistent and systematic use of concepts, especially when dealing with decision-critical knowledge. However, we demonstrate that, when methodically questioned, large language models (LLMs) often display and demonstrate significant inconsistencies in their knowledge. Computationally, the basic aspects of the conceptualization of a given domain can be represented as Is-A hierarchies in a knowledge graph (KG) or ontology, together with a few properties or axioms that enable straightforward reasoning. We show that even simple ontologies can be used to reveal conceptual inconsistencies across several LLMs. We also propose strategies that domain experts can use to evaluate and improve the coverage of key domain concepts in LLMs of various sizes. In particular, we have been able to significantly enhance the performance of LLMs of various sizes with openly available weights using simple knowledge-graph (KG) based prompting strategies.
摘要：将知识总结并组织成抽象概念的能力是学习和推理的关键。许多工业应用依赖于概念的一致和系统使用，尤其是在处理决策关键知识时。然而，我们证明，当系统地询问时，大型语言模型 (LLM) 通常会显示并证明其知识存在重大不一致。从计算上讲，给定领域概念化的基本方面可以表示为知识图谱 (KG) 或本体中的 Is-A 层次结构，以及一些支持直接推理的属性或公理。我们表明，即使是简单的本体也可用于揭示多个 LLM 中的概念不一致。我们还提出了领域专家可以用来评估和改进各种规模的 LLM 中关键领域概念覆盖范围的策略。特别是，我们已经能够使用基于简单知识图谱 (KG) 的提示策略显著提高具有公开可用权重的各种规模的 LLM 的性能。

Title: InstructionCP: A fast approach to transfer Large Language Models into target language

Authors: Kuang-Ming Chen, Hung-yi Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.20175
Pdf URL: https://arxiv.org/pdf/2405.20175
Copy Paste: [[2405.20175]] InstructionCP: A fast approach to transfer Large Language Models into target language(https://arxiv.org/abs/2405.20175)
Keywords: language model, llm
Abstract: The rapid development of large language models (LLMs) in recent years has largely focused on English, resulting in models that respond exclusively in English. To adapt these models to other languages, continual pre-training (CP) is often employed, followed by supervised fine-tuning (SFT) to maintain conversational abilities. However, CP and SFT can reduce a model's ability to filter harmful content. We propose Instruction Continual Pre-training (InsCP), which integrates instruction tags into the CP process to prevent loss of conversational proficiency while acquiring new languages. Our experiments demonstrate that InsCP retains conversational and Reinforcement Learning from Human Feedback (RLHF) abilities. Empirical evaluations on language alignment, reliability, and knowledge benchmarks confirm the efficacy of InsCP. Notably, this approach requires only 0.1 billion tokens of high-quality instruction-following data, thereby reducing resource consumption.
摘要：近年来，大型语言模型 (LLM) 的快速发展主要集中在英语上，导致模型仅以英语做出响应。为了使这些模型适应其他语言，通常采用持续预训练 (CP)，然后进行监督微调 (SFT) 以保持对话能力。然而，CP 和 SFT 会降低模型过滤有害内容的能力。我们提出了指令持续预训练 (InsCP)，将指令标签集成到 CP 过程中，以防止在学习新语言时失去对话能力。我们的实验表明，InsCP 保留了对话和从人类反馈中强化学习 (RLHF) 的能力。对语言对齐、可靠性和知识基准的实证评估证实了 InsCP 的有效性。值得注意的是，这种方法只需要 0.1 亿个高质量指令跟踪数据标记，从而减少了资源消耗。

Title: Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning CodeLLMs

Authors: Zichao Hu, Junyi Jessy Li, Arjun Guha, Joydeep Biswas
Subjects: cs.CL, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2405.20179
Pdf URL: https://arxiv.org/pdf/2405.20179
Copy Paste: [[2405.20179]] Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning CodeLLMs(https://arxiv.org/abs/2405.20179)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have shown great promise at generating robot programs from natural language given domain-specific robot application programming interfaces (APIs). However, the performance gap between proprietary LLMs and smaller open-weight LLMs remains wide. This raises a question: Can we fine-tune smaller open-weight LLMs for generating domain-specific robot programs to close the performance gap with proprietary LLMs? While Self-Instruct is a promising solution by generating a diverse set of training data, it cannot verify the correctness of these programs. In contrast, a robot simulator with a well-defined world can identify execution errors but limits the diversity of programs that it can verify. In this work, we introduce Robo-Instruct, which brings the best of both worlds -- it promotes the diversity of Self-Instruct while providing the correctness of simulator-based checking. Robo-Instruct introduces RoboSim to synthesize a consistent world state on the fly by inferring properties relevant to the program being checked, and simulating actions accordingly. Furthermore, the instructions and programs generated by Self-Instruct may be subtly inconsistent -- such as the program missing a step implied by the instruction. Robo-Instruct further addresses this with InstAlign, an instruction-program alignment procedure that revises the task instruction to reflect the actual results of the generated program. Given a few seed task descriptions and the robot APIs, Robo-Instruct is capable of generating a training dataset using only a small open-weight model. This dataset can then be used to fine-tune small open-weight language models, enabling them to match or even exceed the performance of several proprietary LLMs, such as GPT-3.5-Turbo and Gemini-Pro.
摘要：大型语言模型 (LLM) 在利用特定领域的机器人应用程序编程接口 (API) 从自然语言生成机器人程序方面表现出了巨大的潜力。然而，专有 LLM 和较小的开放权重 LLM 之间的性能差距仍然很大。这就提出了一个问题：我们能否微调较小的开放权重 LLM 来生成特定领域的机器人程序，以缩小与专有 LLM 的性能差距？虽然 Self-Instruct 是一个有前途的解决方案，它可以生成一组多样化的训练数据，但它无法验证这些程序的正确性。相比之下，具有明确定义世界的机器人模拟器可以识别执行错误，但限制了它可以验证的程序的多样性。在这项工作中，我们引入了 Robo-Instruct，它兼具了两全其美的优势——它促进了 Self-Instruct 的多样性，同时提供了基于模拟器的检查的正确性。Robo-Instruct 引入了 RoboSim，通过推断与被检查程序相关的属性并相应地模拟操作来动态合成一致的世界状态。此外，Self-Instruct 生成的指令和程序可能存在细微的不一致——例如程序缺少指令所暗示的步骤。Robo-Instruct 使用 InstAlign 进一步解决了这个问题，这是一个指令程序对齐程序，可以修改任务指令以反映生成程序的实际结果。给定一些种子任务描述和机器人 API，Robo-Instruct 能够仅使用小型开放权重模型生成训练数据集。然后可以使用该数据集对小型开放权重语言模型进行微调，使它们能够达到甚至超过几个专有 LLM（例如 GPT-3.5-Turbo 和 Gemini-Pro）的性能。

Title: TAIA: Large Language Models are Out-of-Distribution Data Learners

Authors: Shuyang Jiang, Yusheng Liao, Ya Zhang, Yu Wang, Yanfeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.20192
Pdf URL: https://arxiv.org/pdf/2405.20192
Copy Paste: [[2405.20192]] TAIA: Large Language Models are Out-of-Distribution Data Learners(https://arxiv.org/abs/2405.20192)
Keywords: language model, llm
Abstract: Fine-tuning on task-specific question-answer pairs is a predominant method for enhancing the performance of instruction-tuned large language models (LLMs) on downstream tasks. However, in certain specialized domains, such as healthcare or harmless content generation, it is nearly impossible to obtain a large volume of high-quality data that matches the downstream distribution. To improve the performance of LLMs in data-scarce domains with domain-mismatched data, we re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance. Our analysis reveals that within the self-attention and feed-forward networks, only the fine-tuned attention parameters are particularly beneficial when the training set's distribution does not fully align with the test set. Based on this insight, we propose an effective inference-time intervention method: \uline{T}raining \uline{A}ll parameters but \uline{I}nferring with only \uline{A}ttention (\trainallInfAttn). We empirically validate \trainallInfAttn using two general instruction-tuning datasets and evaluate it on seven downstream tasks involving math, reasoning, and knowledge understanding across LLMs of different parameter sizes and fine-tuning techniques. Our comprehensive experiments demonstrate that \trainallInfAttn achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios, with significant performance gains. The high tolerance of \trainallInfAttn to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data.
摘要：对特定任务的问答对进行微调是增强指令调整的大型语言模型 (LLM) 在下游任务上性能的主要方法。然而，在某些专业领域，例如医疗保健或无害内容生成，几乎不可能获得与下游分布相匹配的大量高质量数据。为了提高数据稀缺领域中数据不匹配的 LLM 的性能，我们重新评估了 Transformer 架构，发现微调期间并非所有参数更新都会对下游性能产生积极影响。我们的分析表明，在自注意力和前馈网络中，当训练集的分布与测试集不完全一致时，只有微调的注意力参数特别有益。基于这一见解，我们提出了一种有效的推理时间干预方法：\uline{T} 训练 \uline{A} 所有参数，但 \uline{I} 仅使用 \uline{A} 注意力进行 \uline{I} 推理 (\trainallInfAttn)。我们使用两个通用指令调优数据集对 \trainallInfAttn 进行了实证验证，并在七个下游任务上对其进行了评估，这些任务涉及数学、推理和知识理解，涉及不同参数大小和微调技术的 LLM。我们的全面实验表明，在大多数情况下，\trainallInfAttn 都比完全微调模型和基础模型取得了显著的改进，性能得到了显著提升。\trainallInfAttn 对数据不匹配的高容忍度使其能够抵抗越狱调优，并使用通用数据增强专门任务。

Title: TS-Align: A Teacher-Student Collaborative Framework for Scalable Iterative Finetuning of Large Language Models

Authors: Chen Zhang, Chengguang Tang, Dading Chong, Ke Shi, Guohua Tang, Feng Jiang, Haizhou Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.20215
Pdf URL: https://arxiv.org/pdf/2405.20215
Copy Paste: [[2405.20215]] TS-Align: A Teacher-Student Collaborative Framework for Scalable Iterative Finetuning of Large Language Models(https://arxiv.org/abs/2405.20215)
Keywords: language model, llm
Abstract: Mainstream approaches to aligning large language models (LLMs) heavily rely on human preference data, particularly when models require periodic updates. The standard process for iterative alignment of LLMs involves collecting new human feedback for each update. However, the data collection process is costly and challenging to scale. To address this issue, we introduce the "TS-Align" framework, which fine-tunes a policy model using pairwise feedback data automatically mined from its outputs. This automatic mining process is efficiently accomplished through the collaboration between a large-scale teacher model and a small-scale student model. The policy fine-tuning process can be iteratively repeated using on-policy generations within our proposed teacher-student collaborative framework. Through extensive experiments, we demonstrate that our final aligned policy outperforms the base policy model with an average win rate of 69.7% across seven conversational or instruction-following datasets. Furthermore, we show that the ranking capability of the teacher is effectively distilled into the student through our pipeline, resulting in a small-scale yet effective reward model for policy model alignment.
摘要：对齐大型语言模型 (LLM) 的主流方法严重依赖人类偏好数据，尤其是当模型需要定期更新时。LLM 迭代对齐的标准过程包括每次更新时收集新的人类反馈。然而，数据收集过程成本高昂且难以扩展。为了解决这个问题，我们引入了“TS-Align”框架，该框架使用从其输出中自动挖掘的成对反馈数据来微调策略模型。这个自动挖掘过程是通过大型教师模型和小型学生模型之间的协作高效完成的。策略微调过程可以在我们提出的师生协作框架内使用策略生成迭代重复。通过大量实验，我们证明我们最终的对齐策略优于基础策略模型，在七个对话或指令遵循数据集中的平均胜率为 69.7%。此外，我们表明教师的排名能力通过我们的管道有效地提炼到学生身上，从而形成了一个小规模但有效的策略模型对齐奖励模型。

Title: Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use

Authors: Franz Louis Cesista, Rui Aguiar, Jason Kim, Paolo Acilo
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2405.20245
Pdf URL: https://arxiv.org/pdf/2405.20245
Copy Paste: [[2405.20245]] Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use(https://arxiv.org/abs/2405.20245)
Keywords: language model, llm
Abstract: Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a novel general framework for BDIE that achieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE benchmarks. The contributions of this paper are threefold: (1) We show, with ablation benchmarks, that Large Language Models (LLMs) with RASG are already competitive with or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on BDIE benchmarks. (2) We propose a new metric class for Line Items Recognition, General Line Items Recognition Metric (GLIRM), that is more aligned with practical BDIE use cases compared to existing metrics, such as ANLS*, DocILE, and GriTS. (3) We provide a heuristic algorithm for backcalculating bounding boxes of predicted line items and tables without the need for vision encoders. Finally, we claim that, while LMMs might sometimes offer marginal performance benefits, LLMs + RASG is oftentimes superior given real-world applications and constraints of BDIE.
摘要：商业文档信息提取 (BDIE) 是将大量非结构化信息（原始文本、扫描文档等）转换为下游系统可以解析和使用的结构化格式的问题。它有两个主要任务：关键信息提取 (KIE) 和行项目识别 (LIR)。在本文中，我们认为 BDIE 最好建模为工具使用问题，其中工具是这些下游系统。然后，我们介绍了检索增强结构化生成 (RASG)，这是一种新颖的 BDIE 通用框架，它在 BDIE 基准测试中的 KIE 和 LIR 任务上均取得了最新 (SOTA) 结果。本文的贡献有三方面：(1) 我们通过消融基准测试表明，在 BDIE 基准测试中，具有 RASG 的大型语言模型 (LLM) 已经可以与没有 RASG 的当前 SOTA 大型多模态模型 (LMM) 相媲美甚至超越它们。 (2) 我们提出了一种新的行项目识别指标类别，即一般行项目识别指标 (GLIRM)，与现有指标（例如 ANLS*、DocILE 和 GriTS）相比，该指标更符合实际的 BDIE 用例。 (3) 我们提供了一种启发式算法，用于反向计算预测行项目和表格的边界框，而无需视觉编码器。最后，我们声称，虽然 LMM 有时可能会提供边际性能优势，但考虑到 BDIE 的实际应用和约束，LLM + RASG 往往更胜一筹。

Title: Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization

Authors: Yuchi Liu, Jaskirat Singh, Gaowen Liu, Ali Payani, Liang Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.20252
Pdf URL: https://arxiv.org/pdf/2405.20252
Copy Paste: [[2405.20252]] Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization(https://arxiv.org/abs/2405.20252)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) have shown great progress in responding to user questions, allowing for a multitude of diverse applications. Yet, the quality of LLM outputs heavily depends on the prompt design, where a good prompt might enable the LLM to answer a very challenging question correctly. Therefore, recent works have developed many strategies for improving the prompt, including both manual crafting and in-domain optimization. However, their efficacy in unrestricted scenarios remains questionable, as the former depends on human design for specific questions and the latter usually generalizes poorly to unseen scenarios. To address these problems, we give LLMs the freedom to design the best prompts according to themselves. Specifically, we include a hierarchy of LLMs, first constructing a prompt with precise instructions and accurate wording in a hierarchical manner, and then using this prompt to generate the final answer to the user query. We term this pipeline Hierarchical Multi-Agent Workflow, or HMAW. In contrast with prior works, HMAW imposes no human restriction and requires no training, and is completely task-agnostic while capable of adjusting to the nuances of the underlying task. Through both quantitative and qualitative experiments across multiple benchmarks, we verify that despite its simplicity, the proposed approach can create detailed and suitable prompts, further boosting the performance of current LLMs.
摘要：大型语言模型 (LLM) 在回答用户问题方面取得了巨大进步，可用于多种不同的应用。然而，LLM 输出的质量在很大程度上取决于提示设计，一个好的提示可能使 LLM 能够正确回答一个非常具有挑战性的问题。因此，最近的研究已经开发出许多改进提示的策略，包括手工制作和领域内优化。然而，它们在无限制场景中的有效性仍然值得怀疑，因为前者依赖于人类对特定问题的设计，而后者通常很难推广到看不见的场景。为了解决这些问题，我们让 LLM 自由地根据自身设计最佳提示。具体来说，我们包括一个 LLM 层次结构，首先以分层方式构建具有精确指令和准确措辞的提示，然后使用此提示生成对用户查询的最终答案。我们将此管道称为分层多智能体工作流 (HMAW)。与之前的研究相比，HMAW 不受人为限制，无需训练，完全与任务无关，同时能够适应底层任务的细微差别。通过跨多个基准的定量和定性实验，我们验证了尽管该方法很简单，但它可以创建详细且合适的提示，从而进一步提高当前 LLM 的性能。

Title: Evaluating Large Language Model Biases in Persona-Steered Generation

Authors: Andy Liu, Mona Diab, Daniel Fried
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.20253
Pdf URL: https://arxiv.org/pdf/2405.20253
Copy Paste: [[2405.20253]] Evaluating Large Language Model Biases in Persona-Steered Generation(https://arxiv.org/abs/2405.20253)
Keywords: language model, llm
Abstract: The task of persona-steered text generation requires large language models (LLMs) to generate text that reflects the distribution of views that an individual fitting a persona could have. People have multifaceted personas, but prior work on bias in LLM-generated opinions has only explored multiple-choice settings or one-dimensional personas. We define an incongruous persona as a persona with multiple traits where one trait makes its other traits less likely in human survey data, e.g. political liberals who support increased military spending. We find that LLMs are 9.7% less steerable towards incongruous personas than congruous ones, sometimes generating the stereotypical stance associated with its demographic rather than the target stance. Models that we evaluate that are fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but present significantly less diverse views of personas. We also find variance in LLM steerability that cannot be predicted from multiple-choice opinion evaluation. Our results show the importance of evaluating models in open-ended text generation, as it can surface new LLM opinion biases. Moreover, such a setup can shed light on our ability to steer models toward a richer and more diverse range of viewpoints.
摘要：角色引导文本生成任务需要大型语言模型 (LLM) 来生成反映符合角色的个人可能具有的观点分布的文本。人们具有多面的角色，但之前关于 LLM 生成意见偏见的研究仅探索了多项选择设置或一维角色。我们将不协调的角色定义为具有多种特征的角色，其中一种特征使其其他特征在人类调查数据中不太可能出现，例如支持增加军费开支的政治自由主义者。我们发现 LLM 对不协调角色的可引导性比一致角色低 9.7%，有时会产生与其人口统计相关的刻板立场，而不是目标立场。我们评估的模型通过人类反馈强化学习 (RLHF) 进行了微调，可引导性更强，尤其是对与政治自由主义者和女性相关的立场，但呈现的角色观点多样性明显较低。我们还发现 LLM 可引导性存在差异，而这无法从多项选择意见评估中预测出来。我们的结果表明，在开放式文本生成中评估模型非常重要，因为它可以揭示新的 LLM 观点偏见。此外，这样的设置可以揭示我们引导模型走向更丰富、更多样化观点的能力。

Title: Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

Authors: Ruochen Zhao, Wenxuan Zhang, Yew Ken Chia, Deli Zhao, Lidong Bing
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.20267
Pdf URL: https://arxiv.org/pdf/2405.20267
Copy Paste: [[2405.20267]] Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions(https://arxiv.org/abs/2405.20267)
Keywords: llm, chat, agent
Abstract: As LLMs evolve on a daily basis, there is an urgent need for a trustworthy evaluation method that can provide robust evaluation results in a timely fashion. Currently, as static benchmarks are prone to contamination concerns, users tend to trust human voting platforms, such as Chatbot Arena. However, human annotations require extensive manual efforts. To provide an automatic, robust, and trustworthy evaluation framework, we innovatively propose the Auto-Arena of LLMs, which automates the entire evaluation process with LLM agents. Firstly, an examiner LLM devises queries. Then, a pair of candidate LLMs engage in a multi-round peer-battle around the query, during which the LLM's true performance gaps become visible. Finally, a committee of LLM judges collectively discuss and determine the winner, which alleviates bias and promotes fairness. In our extensive experiment on the 17 newest LLMs, Auto-Arena shows the highest correlation with human preferences, providing a promising alternative to human evaluation platforms.
摘要：随着 LLM 的日新月异，迫切需要一种可靠的评估方法，能够及时提供可靠的评估结果。目前，由于静态基准容易受到污染问题的影响，用户倾向于信任人工投票平台，例如 Chatbot Arena。但是，人工注释需要大量的人工工作。为了提供一个自动、可靠且值得信赖的评估框架，我们创新地提出了 LLM 的 Auto-Arena，它使用 LLM 代理自动完成整个评估过程。首先，审查员 LLM 设计查询。然后，一对候选 LLM 围绕查询进行多轮同行竞争，在此期间 LLM 的真实表现差距变得明显。最后，LLM 评委委员会共同讨论并确定获胜者，以减轻偏见并促进公平。在我们对 17 个最新 LLM 进行的广泛实验中，Auto-Arena 与人类偏好的相关性最高，为人类评估平台提供了一个有希望的替代方案。

Title: Who Writes the Review, Human or AI?

Authors: Panagiotis C. Theocharopoulos, Spiros V. Georgakopoulos, Sotiris K. Tasoulis, Vassilis P. Plagianakos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.20285
Pdf URL: https://arxiv.org/pdf/2405.20285
Copy Paste: [[2405.20285]] Who Writes the Review, Human or AI?(https://arxiv.org/abs/2405.20285)
Keywords: language model
Abstract: With the increasing use of Artificial Intelligence in Natural Language Processing, concerns have been raised regarding the detection of AI-generated text in various domains. This study aims to investigate this issue by proposing a methodology to accurately distinguish AI-generated and human-written book reviews. Our approach utilizes transfer learning, enabling the model to identify generated text across different topics while improving its ability to detect variations in writing style and vocabulary. To evaluate the effectiveness of the proposed methodology, we developed a dataset consisting of real book reviews and AI-generated reviews using the recently proposed Vicuna open-source language model. The experimental results demonstrate that it is feasible to detect the original source of text, achieving an accuracy rate of 96.86%. Our efforts are oriented toward the exploration of the capabilities and limitations of Large Language Models in the context of text identification. Expanding our knowledge in these aspects will be valuable for effectively navigating similar models in the future and ensuring the integrity and authenticity of human-generated content.
摘要：随着人工智能在自然语言处理中的应用越来越广泛，人们对各个领域中人工智能生成的文本的检测产生了担忧。本研究旨在通过提出一种准确区分人工智能生成的书评和人类撰写的书评的方法来调查这个问题。我们的方法利用迁移学习，使模型能够识别不同主题的生成文本，同时提高其检测写作风格和词汇变化的能力。为了评估所提出方法的有效性，我们使用最近提出的 Vicuna 开源语言模型开发了一个由真实书评和人工智能生成的评论组成的数据集。实验结果表明，检测文本的原始来源是可行的，准确率达到 96.86%。我们的努力方向是探索大型语言模型在文本识别方面的功能和局限性。扩展我们在这些方面的知识将有助于将来有效地驾驭类似的模型并确保人类生成内容的完整性和真实性。

Title: Group Robust Preference Optimization in Reward-free RLHF

Authors: Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, Ilija Bogunovic
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2405.20304
Pdf URL: https://arxiv.org/pdf/2405.20304
Copy Paste: [[2405.20304]] Group Robust Preference Optimization in Reward-free RLHF(https://arxiv.org/abs/2405.20304)
Keywords: language model, llm
Abstract: Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers' groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a "one-size-fits-all" approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups' preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.
摘要：将大型语言模型 (LLM) 调整为特定任务通常需要通过强化学习和人工反馈 (RLHF) 对偏好数据进行微调。虽然这些数据通常来自不同的标注者群体（例如，不同的人口统计、种族、公司团队等），但传统的 RLHF 方法采用“一刀切”的方法，即它们不加区分地假设和优化单一的偏好模型，因此对不同群体的独特特征和需求不具有鲁棒性。为了解决这一限制，我们提出了一种新颖的群体鲁棒偏好优化 (GRPO) 方法，以使 LLM 与各个群体的偏好稳健地保持一致。我们的方法建立在无奖励直接偏好优化方法的基础上，但与以前的方法不同，它寻求一种鲁棒策略，以最大化最坏情况下的群体表现。为了实现这一点，GRPO 自适应地、按顺序加权不同群体的重要性，优先考虑累积损失更差的群体。我们从理论上研究了 GRPO 的可行性，并分析了其对对数线性策略类的收敛性。通过使用基于多样化群体的全局意见数据对带有 GRPO 的 LLM 进行微调，与非稳健基线相比，我们显著提高了表现最差群体的表现，减少了群体之间的损失不平衡，并且提高了概率准确性。

Title: S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs

Authors: Wei Zhong, Manasa Bharadwaj
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.20314
Pdf URL: https://arxiv.org/pdf/2405.20314
Copy Paste: [[2405.20314]] S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs(https://arxiv.org/abs/2405.20314)
Keywords: llm
Abstract: Speculative decoding (SD) has attracted a significant amount of research attention due to the substantial speedup it can achieve for LLM inference. However, despite the high speedups they offer, speculative decoding methods often achieve optimal performance on high-end devices or with a substantial GPU memory overhead. Given limited memory and the necessity of quantization, a high-performing model on a high-end GPU can slow down by up to 7 times. To this end, we propose Skippy Simultaneous Speculative Decoding (or S3D), a cost-effective self-speculative SD method based on simultaneous multi-token decoding and mid-layer skipping. When compared against recent effective open-source SD systems, our method has achieved one of the top performance-memory ratios while requiring minimal architecture changes and training data. Leveraging our memory efficiency, we created a smaller yet more effective SD model based on Phi-3. It is 1.4 to 2 times faster than the quantized EAGLE model and operates in half-precision while using less VRAM.
摘要：推测解码 (SD) 吸引了大量研究关注，因为它可以实现 LLM 推理的大幅加速。然而，尽管推测解码方法提供了很高的加速，但它们通常在高端设备或 GPU 内存开销较大的情况下才能实现最佳性能。鉴于内存有限且需要量化，高端 GPU 上的高性能模型可能会减慢多达 7 倍。为此，我们提出了 Skippy 同步推测解码 (或 S3D)，这是一种基于同时多令牌解码和中间层跳过的经济高效的自推测 SD 方法。与最近有效的开源 SD 系统相比，我们的方法实现了最高的性能内存比之一，同时只需要最少的架构更改和训练数据。利用我们的内存效率，我们基于 Phi-3 创建了一个更小但更有效的 SD 模型。它比量化的 EAGLE 模型快 1.4 到 2 倍，并且以半精度运行，同时使用更少的 VRAM。

Title: ANAH: Analytical Annotation of Hallucinations in Large Language Models

Authors: Ziwei Ji, Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.20315
Pdf URL: https://arxiv.org/pdf/2405.20315
Copy Paste: [[2405.20315]] ANAH: Analytical Annotation of Hallucinations in Large Language Models(https://arxiv.org/abs/2405.20315)
Keywords: language model, gpt, llm, hallucination
Abstract: Reducing the `$\textit{hallucination}$' problem of Large Language Models (LLMs) is crucial for their wide applications. A comprehensive and fine-grained measurement of the hallucination is the first key step for the governance of this issue but is under-explored in the community. Thus, we present $\textbf{ANAH}$, a bilingual dataset that offers $\textbf{AN}$alytical $\textbf{A}$nnotation of $\textbf{H}$allucinations in LLMs within Generative Question Answering. Each answer sentence in our dataset undergoes rigorous annotation, involving the retrieval of a reference fragment, the judgment of the hallucination type, and the correction of hallucinated content. ANAH consists of ~12k sentence-level annotations for ~4.3k LLM responses covering over 700 topics, constructed by a human-in-the-loop pipeline. Thanks to the fine granularity of the hallucination annotations, we can quantitatively confirm that the hallucinations of LLMs progressively accumulate in the answer and use ANAH to train and evaluate hallucination annotators. We conduct extensive experiments on studying generative and discriminative annotators and show that, although current open-source LLMs have difficulties in fine-grained hallucination annotation, the generative annotator trained with ANAH can surpass all open-source LLMs and GPT-3.5, obtain performance competitive with GPT-4, and exhibits better generalization ability on unseen questions.
摘要：减少大型语言模型 (LLM) 的“幻觉”问题对于其广泛应用至关重要。对幻觉进行全面而细致的测量是治理这一问题的第一步，但社区对此的探索还不够。因此，我们提出了 ANAH，这是一个双语数据集，可在生成式问答中提供 LLM 中 H 幻觉的 AN 分析注释。我们数据集中的每个答案句子都经过严格的注释，包括检索参考片段、判断幻觉类型和纠正幻觉内容。ANAH 由人机交互管道构建，包含约 4.3k 个 LLM 响应的约 12k 个句子级注释，涵盖 700 多个主题。得益于幻觉标注的细粒度，我们可以定量地确认 LLM 的幻觉在答案中逐渐积累，并使用 ANAH 来训练和评估幻觉标注器。我们对研究生成式和判别式标注器进行了大量实验，结果表明，尽管当前的开源 LLM 在细粒度幻觉标注方面存在困难，但使用 ANAH 训练的生成式标注器可以超越所有开源 LLM 和 GPT-3.5，获得与 GPT-4 相媲美的性能，并在未见过的问题上表现出更好的泛化能力。

Title: CausalQuest: Collecting Natural Causal Questions for AI Agents

Authors: Roberto Ceraolo, Dmitrii Kharlapenko, Amélie Reymond, Rada Mihalcea, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2405.20318
Pdf URL: https://arxiv.org/pdf/2405.20318
Copy Paste: [[2405.20318]] CausalQuest: Collecting Natural Causal Questions for AI Agents(https://arxiv.org/abs/2405.20318)
Keywords: language model, llm, agent
Abstract: Humans have an innate drive to seek out causality. Whether fuelled by curiosity or specific goals, we constantly question why things happen, how they are interconnected, and many other related phenomena. To develop AI agents capable of addressing this natural human quest for causality, we urgently need a comprehensive dataset of natural causal questions. Unfortunately, existing datasets either contain only artificially-crafted questions that do not reflect real AI usage scenarios or have limited coverage of questions from specific sources. To address this gap, we present CausalQuest, a dataset of 13,500 naturally occurring questions sourced from social networks, search engines, and AI assistants. We formalize the definition of causal questions and establish a taxonomy for finer-grained classification. Through a combined effort of human annotators and large language models (LLMs), we carefully label the dataset. We find that 42% of the questions humans ask are indeed causal, with the majority seeking to understand the causes behind given effects. Using this dataset, we train efficient classifiers (up to 2.85B parameters) for the binary task of identifying causal questions, achieving high performance with F1 scores of up to 0.877. We conclude with a rich set of future research directions that can build upon our data and models.
摘要：人类天生就具有寻找因果关系的动力。无论是出于好奇心还是特定目标，我们都会不断质疑事情发生的原因、它们之间的相互联系以及许多其他相关现象。为了开发能够解决人类对因果关系的自然探索的人工智能代理，我们迫切需要一个全面的自然因果问题数据集。不幸的是，现有的数据集要么只包含人工设计的问题，无法反映真实的人工智能使用场景，要么对特定来源的问题的覆盖范围有限。为了弥补这一差距，我们推出了 CausalQuest，这是一个包含 13,500 个自然发生问题的数据集，这些问题来自社交网络、搜索引擎和人工智能助手。我们形式化了因果问题的定义，并建立了更细粒度分类的分类法。通过人类注释者和大型语言模型 (LLM) 的共同努力，我们仔细标记了数据集。我们发现，人类提出的 42% 的问题确实是因果关系，其中大多数问题都在寻求了解给定结果背后的原因。利用此数据集，我们训练了高效的分类器（多达 28.5 亿个参数），用于识别因果问题的二元任务，实现了高达 0.877 的 F1 分数的高性能。我们得出了丰富的未来研究方向，这些方向可以基于我们的数据和模型进行构建。

Title: Xwin-LM: Strong and Scalable Alignment Practice for LLMs

Authors: Bolin Ni, JingCheng Hu, Yixuan Wei, Houwen Peng, Zheng Zhang, Gaofeng Meng, Han Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.20335
Pdf URL: https://arxiv.org/pdf/2405.20335
Copy Paste: [[2405.20335]] Xwin-LM: Strong and Scalable Alignment Practice for LLMs(https://arxiv.org/abs/2405.20335)
Keywords: language model, gpt, llm, prompt
Abstract: In this work, we present Xwin-LM, a comprehensive suite of alignment methodologies for large language models (LLMs). This suite encompasses several key techniques, including supervised finetuning (SFT), reward modeling (RM), rejection sampling finetuning (RS), and direct preference optimization (DPO). The key components are as follows: (1) Xwin-LM-SFT, models initially finetuned with high-quality instruction data; (2) Xwin-Pair, a large-scale, multi-turn preference dataset meticulously annotated using GPT-4; (3) Xwin-RM, reward models trained on Xwin-Pair, developed at scales of 7B, 13B, and 70B parameters; (4) Xwin-Set, a multiwise preference dataset in which each prompt is linked to 64 unique responses generated by Xwin-LM-SFT and scored by Xwin-RM; (5) Xwin-LM-RS, models finetuned with the highest-scoring responses from Xwin-Set; (6) Xwin-LM-DPO, models further optimized on Xwin-Set using the DPO algorithm. Our evaluations on AlpacaEval and MT-bench demonstrate consistent and significant improvements across the pipeline, demonstrating the strength and scalability of Xwin-LM. The repository this https URL will be continually updated to foster community research.
摘要：在本文中，我们提出了 Xwin-LM，一套用于大型语言模型 (LLM) 的全面对齐方法。这套方法包含几种关键技术，包括监督微调 (SFT)、奖励建模 (RM)、拒绝抽样微调 (RS) 和直接偏好优化 (DPO)。关键组件包括：（1）Xwin-LM-SFT，最初使用高质量指令数据进行微调的模型；（2）Xwin-Pair，使用 GPT-4 精心注释的大规模多轮偏好数据集；（3）Xwin-RM，在 Xwin-Pair 上训练的奖励模型，以 7B、13B 和 70B 参数的规模开发；（4）Xwin-Set，一个多向偏好数据集，其中每个提示都链接到由 Xwin-LM-SFT 生成并由 Xwin-RM 评分的 64 个唯一响应；（5）Xwin-LM-RS，使用 Xwin-Set 中得分最高的响应进行微调的模型；（6）Xwin-LM-DPO，使用 DPO 算法在 Xwin-Set 上进一步优化的模型。我们在 AlpacaEval 和 MT-bench 上的评估表明，整个流程都有持续显著的改进，证明了 Xwin-LM 的实力和可扩展性。此 https URL 的存储库将不断更新，以促进社区研究。