2024-06-10

Title: Large Language Model Confidence Estimation via Black-Box Access

Authors: Tejaswini Pedapati, Amit Dhurandhar, Soumya Ghosh, Soham Dan, Prasanna Sattigeri
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.04370
Pdf URL: https://arxiv.org/pdf/2406.04370
Copy Paste: [[2406.04370]] Large Language Model Confidence Estimation via Black-Box Access(https://arxiv.org/abs/2406.04370)
Keywords: language model, llm
Abstract: Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of flan-ul2, llama-13b and mistral-7b with it consistently outperforming existing black-box confidence estimation approaches on benchmark datasets such as TriviaQA, SQuAD, CoQA and Natural Questions by even over $10\%$ (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.
摘要：评估模型响应的不确定性或置信度不仅对于评估响应的可信度很重要，而且对于评估整个模型的可信度也很重要。在本文中，我们探讨了仅使用黑盒或查询访问大型语言模型 (LLM) 的响应置信度的问题。我们提出了一个简单且可扩展的框架，在该框架中，我们设计新特征并在这些特征上训练一个（可解释的）模型（即逻辑回归）来估计置信度。我们通过经验证明，我们的简单框架可以有效地估计 flan-ul2、llama-13b 和 mistral-7b 的置信度，并且它在基准数据集（例如 TriviaQA、SQuAD、CoQA 和 Natural Questions）上的表现始终优于现有的黑盒置信度估计方法，在某些情况下甚至超过 $10\%$（在 AUROC 上）。此外，我们的可解释方法提供了对可预测置信度的特征的洞察，从而得出了一个有趣且有用的发现，即我们为一个 LLM 构建的置信度模型可以在给定数据集上对其他 LLM 进行零样本泛化。

Title: Phased Instruction Fine-Tuning for Large Language Models

Authors: Wei Pang, Chuan Zhou, Xiao-Hua Zhou, Xiaojie Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04371
Pdf URL: https://arxiv.org/pdf/2406.04371
Copy Paste: [[2406.04371]] Phased Instruction Fine-Tuning for Large Language Models(https://arxiv.org/abs/2406.04371)
Keywords: language model, gpt
Abstract: Instruction Fine-Tuning, a method enhancing pre-trained language models' capabilities from mere next-word prediction to complex instruction following, often employs a one-off training approach on diverse instruction dataset. However, this method may not effectively enhance models' adherence to instructions due to the simultaneous handling of varying instruction complexities. To address this, we propose a novel phased instruction fine-tuning (Phased IFT) method, grounded in the hypothesis of progressive alignment, which posits that the transition of a pre-trained language model from simple next-word prediction to sophisticated instruction following is a gradual learning process. Specifically, we obtain the score of difficulty for each instruction via GPT-4, stratify the instruction data into subsets of increasing difficulty, and sequentially uptrain on these subsets using the standard supervised loss. Through extensive experiments on the pre-trained models Llama-2 7B/13B, and Mistral-7B using the 52K Alpaca instruction data, we demonstrate that Phased IFT significantly surpasses traditional one-off instruction fine-tuning (One-off IFT) method in win rate, empirically validating the progressive alignment hypothesis. Our findings suggest that Phased IFT offers a simple yet effective pathway for elevating the instruction-following capabilities of pre-trained language models. Models and datasets from our experiments are freely available at this https URL.
摘要：指令微调是一种增强预训练语言模型能力的方法，可将其从简单的下一个单词预测扩展到复杂的指令跟踪，这种方法通常采用对多样化指令数据集的一次性训练方法。然而，由于需要同时处理不同的指令复杂性，这种方法可能无法有效增强模型对指令的遵循能力。为了解决这个问题，我们提出了一种新颖的分阶段指令微调 (Phased IFT) 方法，该方法基于渐进对齐假设，该方法认为预训练语言模型从简单的下一个单词预测到复杂的指令跟踪的过渡是一个渐进的学习过程。具体来说，我们通过 GPT-4 获得每条指令的难度分数，将指令数据分层为难度不断增加的子集，然后使用标准监督损失对这些子集进行顺序训练。通过使用 52K Alpaca 指令数据对预训练模型 Llama-2 7B/13B 和 Mistral-7B 进行大量实验，我们证明了 Phased IFT 在胜率方面显著超越了传统的一次性指令微调 (One-off IFT) 方法，从而实证验证了渐进对齐假设。我们的研究结果表明，Phased IFT 提供了一种简单而有效的途径来提升预训练语言模型的指令遵循能力。我们实验中的模型和数据集可在此 https URL 上免费获取。

Title: Exploring the Latest LLMs for Leaderboard Extraction

Authors: Salomon Kabongo, Jennifer D'Souza, Sören Auer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04383
Pdf URL: https://arxiv.org/pdf/2406.04383
Copy Paste: [[2406.04383]] Exploring the Latest LLMs for Leaderboard Extraction(https://arxiv.org/abs/2406.04383)
Keywords: language model, gpt, llm
Abstract: The rapid advancements in Large Language Models (LLMs) have opened new avenues for automating complex tasks in AI research. This paper investigates the efficacy of different LLMs-Mistral 7B, Llama-2, GPT-4-Turbo and GPT-4.o in extracting leaderboard information from empirical AI research articles. We explore three types of contextual inputs to the models: DocTAET (Document Title, Abstract, Experimental Setup, and Tabular Information), DocREC (Results, Experiments, and Conclusions), and DocFULL (entire document). Our comprehensive study evaluates the performance of these models in generating (Task, Dataset, Metric, Score) quadruples from research papers. The findings reveal significant insights into the strengths and limitations of each model and context type, providing valuable guidance for future AI research automation efforts.
摘要：大型语言模型 (LLM) 的快速发展为自动化 AI 研究中的复杂任务开辟了新途径。本文研究了不同的 LLM-Mistral 7B、Llama-2、GPT-4-Turbo 和 GPT-4.o 在从实证 AI 研究文章中提取排行榜信息方面的有效性。我们探索了模型的三种上下文输入类型：DocTAET（文档标题、摘要、实验设置和表格信息）、DocREC（结果、实验和结论）和 DocFULL（整个文档）。我们的综合研究评估了这些模型在从研究论文中生成（任务、数据集、指标、分数）四元组方面的表现。研究结果揭示了每种模型和上下文类型的优势和局限性的重要见解，为未来的 AI 研究自动化工作提供了宝贵的指导。

Title: MoralBench: Moral Evaluation of LLMs

Authors: Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, Yongfeng Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04428
Pdf URL: https://arxiv.org/pdf/2406.04428
Copy Paste: [[2406.04428]] MoralBench: Moral Evaluation of LLMs(https://arxiv.org/abs/2406.04428)
Keywords: language model, llm
Abstract: In the rapidly evolving field of artificial intelligence, large language models (LLMs) have emerged as powerful tools for a myriad of applications, from natural language processing to decision-making support systems. However, as these models become increasingly integrated into societal frameworks, the imperative to ensure they operate within ethical and moral boundaries has never been more critical. This paper introduces a novel benchmark designed to measure and compare the moral reasoning capabilities of LLMs. We present the first comprehensive dataset specifically curated to probe the moral dimensions of LLM outputs, addressing a wide range of ethical dilemmas and scenarios reflective of real-world complexities. The main contribution of this work lies in the development of benchmark datasets and metrics for assessing the moral identity of LLMs, which accounts for nuance, contextual sensitivity, and alignment with human ethical standards. Our methodology involves a multi-faceted approach, combining quantitative analysis with qualitative insights from ethics scholars to ensure a thorough evaluation of model performance. By applying our benchmark across several leading LLMs, we uncover significant variations in moral reasoning capabilities of different models. These findings highlight the importance of considering moral reasoning in the development and evaluation of LLMs, as well as the need for ongoing research to address the biases and limitations uncovered in our study. We publicly release the benchmark at this https URL and also open-source the code of the project at this https URL.
摘要：在快速发展的人工智能领域，大型语言模型 (LLM) 已成为从自然语言处理到决策支持系统等无数应用的强大工具。然而，随着这些模型越来越多地融入社会框架，确保它们在道德和伦理界限内运作的必要性从未如此重要。本文介绍了一种旨在衡量和比较 LLM 道德推理能力的新基准。我们提供了第一个专门用于探究 LLM 输出的道德维度的综合数据集，解决了反映现实世界复杂性的各种道德困境和场景。这项工作的主要贡献在于开发了用于评估 LLM 道德身份的基准数据集和指标，这些数据集和指标考虑了细微差别、语境敏感性以及与人类道德标准的一致性。我们的方法涉及多方面的方法，将定量分析与伦理学者的定性见解相结合，以确保对模型性能进行全面评估。通过将我们的基准应用于几个领先的 LLM，我们发现不同模型的道德推理能力存在显著差异。这些发现强调了在 LLM 的开发和评估中考虑道德推理的重要性，以及需要持续研究以解决我们在研究中发现的偏见和局限性。我们在此 https URL 上公开发布了基准，并在此 https URL 上开源代码。

Title: MAIRA-2: Grounded Radiology Report Generation

Authors: Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, Felix Meissen, Mercy Ranjit, Shaury Srivastav, Julia Gong, Fabian Falck, Ozan Oktay, Anja Thieme, Matthew P. Lungren, Maria Teodora Wetscherek, Javier Alvarez-Valle, Stephanie L. Hyland
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2406.04449
Pdf URL: https://arxiv.org/pdf/2406.04449
Copy Paste: [[2406.04449]] MAIRA-2: Grounded Radiology Report Generation(https://arxiv.org/abs/2406.04449)
Keywords: language model, llm, hallucination
Abstract: Radiology reporting is a complex task that requires detailed image understanding, integration of multiple inputs, including comparison with prior imaging, and precise language generation. This makes it ideal for the development and use of generative multimodal models. Here, we extend report generation to include the localisation of individual findings on the image - a task we call grounded report generation. Prior work indicates that grounding is important for clarifying image understanding and interpreting AI-generated text. Therefore, grounded reporting stands to improve the utility and transparency of automated report drafting. To enable evaluation of grounded reporting, we propose a novel evaluation framework - RadFact - leveraging the reasoning capabilities of large language models (LLMs). RadFact assesses the factuality of individual generated sentences, as well as correctness of generated spatial localisations when present. We introduce MAIRA-2, a large multimodal model combining a radiology-specific image encoder with a LLM, and trained for the new task of grounded report generation on chest X-rays. MAIRA-2 uses more comprehensive inputs than explored previously: the current frontal image, the current lateral image, the prior frontal image and prior report, as well as the Indication, Technique and Comparison sections of the current report. We demonstrate that these additions significantly improve report quality and reduce hallucinations, establishing a new state of the art on findings generation (without grounding) on MIMIC-CXR while demonstrating the feasibility of grounded reporting as a novel and richer task.
摘要：放射学报告是一项复杂的任务，需要详细理解图像、整合多种输入（包括与之前的图像进行比较）以及精确的语言生成。这使其成为生成式多模态模型的开发和使用的理想选择。在这里，我们扩展了报告生成，以包括对图像上各个发现的定位 - 我们称之为扎实报告生成的任务。先前的研究表明，扎实对于澄清图像理解和解释 AI 生成的文本非常重要。因此，扎实报告可以提高自动报告起草的实用性和透明度。为了评估扎实报告，我们提出了一种新颖的评估框架 - RadFact - 利用大型语言模型 (LLM) 的推理能力。RadFact 评估单个生成句子的真实性，以及生成的空间定位的正确性（如果存在）。我们引入了 MAIRA-2，这是一个大型多模态模型，结合了放射学专用图像编码器和 LLM，并针对胸部 X 光片扎实报告生成这一新任务进行了训练。 MAIRA-2 使用的输入比以前探索的更全面：当前正面图像、当前侧面图像、先前正面图像和先前报告，以及当前报告的指示、技术和比较部分。我们证明这些附加功能显著提高了报告质量并减少了幻觉，在 MIMIC-CXR 上建立了发现生成（无基础）的新水平，同时证明了基础报告作为一项新颖且更丰富的任务的可行性。

Title: Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs

Authors: Shang Zhou, Feng Yao, Chengyu Dong, Zihan Wang, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04460
Pdf URL: https://arxiv.org/pdf/2406.04460
Copy Paste: [[2406.04460]] Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs(https://arxiv.org/abs/2406.04460)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Controlling the attribute intensity of text generation is crucial across scenarios (e.g., writing conciseness, chatting emotion, and explanation clarity). The remarkable capabilities of large language models (LLMs) have revolutionized text generation, prompting us to explore such \emph{smooth control} of LLM generation. Specifically, we propose metrics to assess the range, calibration, and consistency of the generated text's attribute intensity in response to varying control values, as well as its relevance to the intended context. To quantify the attribute intensity and context relevance, we propose an effective evaluation framework leveraging the Elo rating system and GPT4, both renowned for their robust alignment with human judgment. We look into two viable training-free methods for achieving smooth control of LLMs: (1) Prompting with semantic shifters, and (2) Modifying internal model representations. The evaluations of these two methods are conducted on $5$ different attributes with various models. Our code and dataset can be obtained from \url{this https URL}.
摘要：控制文本生成的属性强度对于各种场景都至关重要（例如，写作简洁性、聊天情感和解释清晰度）。大型语言模型 (LLM) 的卓越功能彻底改变了文本生成，促使我们探索 LLM 生成的这种 \emph{平滑控制}。具体来说，我们提出了一些指标来评估生成的文本的属性强度随控制值变化的范围、校准和一致性，以及它与预期上下文的相关性。为了量化属性强度和上下文相关性，我们提出了一个有效的评估框架，利用 Elo 评分系统和 GPT4，这两者都以与人类判断的稳健一致性而闻名。我们研究了两种可行的无需训练的方法来实现 LLM 的平滑控制：（1）使用语义转换器提示，以及（2）修改内部模型表示。这两种方法的评估是在 5 个不同属性上使用各种模型进行的。我们的代码和数据集可以从 \url{此 https URL} 获得。

Title: PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning

Authors: Tianrong Zhang, Zhaohan Xi, Ting Wang, Prasenjit Mitra, Jinghui Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.04478
Pdf URL: https://arxiv.org/pdf/2406.04478
Copy Paste: [[2406.04478]] PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning(https://arxiv.org/abs/2406.04478)
Keywords: language model, prompt
Abstract: Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented. In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings. Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance. Experiments with various backdoor attacks validate the effectiveness of the proposed method and the performances when domain shift is present further shows PromptFix's applicability to models pretrained on unknown data source which is the common case in prompt tuning scenarios.
摘要：预训练语言模型 (PLM) 凭借其无与伦比的性能在过去几年中引起了极大的关注。同时，训练 PLM 的成本飙升以及其惊人的通用性共同促成了少样本微调和提示成为最流行的自然语言处理 (NLP) 模型训练范例。然而，现有研究表明，这些 NLP 模型可以被后门攻击，从而在呈现触发标记时操纵模型行为。在本文中，我们提出了 PromptFix，这是一种通过少样本设置中的对抗性提示调整为 NLP 模型的新型后门缓解策略。与依赖于精确触发反转和随后的模型微调的现有 NLP 后门移除方法不同，PromptFix 保持模型参数完整，仅使用两组额外的软标记，分别近似触发和抵消触发。使用软标记和对抗性优化消除了枚举可能的后门配置的需要，并能够在触发查找和性能保持之间实现自适应平衡。通过各种后门攻击的实验验证了所提方法的有效性，并且存在域转移时的性能进一步表明了 PromptFix 适用于在未知数据源上预训练的模型，这是快速调整场景中的常见情况。

Title: Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs

Authors: Claire Jin, Sudha Rao, Xiangyu Peng, Portia Botchway, Jessica Quaye, Chris Brockett, Bill Dolan
Subjects: cs.CL, cs.AI, cs.HC, cs.SE
Abstract URL: https://arxiv.org/abs/2406.04482
Pdf URL: https://arxiv.org/pdf/2406.04482
Copy Paste: [[2406.04482]] Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs(https://arxiv.org/abs/2406.04482)
Keywords: language model, llm, hallucination, prompt
Abstract: Advancements in large language models (LLMs) are revolutionizing interactive game design, enabling dynamic plotlines and interactions between players and non-player characters (NPCs). However, LLMs may exhibit flaws such as hallucinations, forgetfulness, or misinterpretations of prompts, causing logical inconsistencies and unexpected deviations from intended designs. Automated techniques for detecting such game bugs are still lacking. To address this, we propose a systematic LLM-based method for automatically identifying such bugs from player game logs, eliminating the need for collecting additional data such as post-play surveys. Applied to a text-based game DejaBoom!, our approach effectively identifies bugs inherent in LLM-powered interactive games, surpassing unstructured LLM-powered bug-catching methods and filling the gap in automated detection of logical and design flaws.
摘要：大型语言模型 (LLM) 的进步正在彻底改变交互式游戏设计，使玩家和非玩家角色 (NPC) 之间的动态情节和互动成为可能。然而，LLM 可能会出现幻觉、健忘或误解提示等缺陷，导致逻辑不一致和与预期设计的意外偏差。目前仍然缺乏用于检测此类游戏错误的自动技术。为了解决这个问题，我们提出了一种基于 LLM 的系统方法，用于自动从玩家游戏日志中识别此类错误，从而无需收集游戏后调查等额外数据。将我们的方法应用于基于文本的游戏 DejaBoom!，可以有效识别 LLM 驱动的交互式游戏中固有的错误，超越了非结构化的 LLM 驱动的错误捕捉方法，并填补了自动检测逻辑和设计缺陷的空白。

Title: Time Sensitive Knowledge Editing through Efficient Finetuning

Authors: Xiou Ge, Ali Mousavi, Edouard Grave, Armand Joulin, Kun Qian, Benjamin Han, Mostafa Arefiyan, Yunyao Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.04496
Pdf URL: https://arxiv.org/pdf/2406.04496
Copy Paste: [[2406.04496]] Time Sensitive Knowledge Editing through Efficient Finetuning(https://arxiv.org/abs/2406.04496)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive capability in different tasks and are bringing transformative changes to many domains. However, keeping the knowledge in LLMs up-to-date remains a challenge once pretraining is complete. It is thus essential to design effective methods to both update obsolete knowledge and induce new knowledge into LLMs. Existing locate-and-edit knowledge editing (KE) method suffers from two limitations. First, the post-edit LLMs by such methods generally have poor capability in answering complex queries that require multi-hop reasoning. Second, the long run-time of such locate-and-edit methods to perform knowledge edits make it infeasible for large scale KE in practice. In this paper, we explore Parameter-Efficient Fine-Tuning (PEFT) techniques as an alternative for KE. We curate a more comprehensive temporal KE dataset with both knowledge update and knowledge injection examples for KE performance benchmarking. We further probe the effect of fine-tuning on a range of layers in an LLM for the multi-hop QA task. We find that PEFT performs better than locate-and-edit techniques for time-sensitive knowledge edits.
摘要：大型语言模型 (LLM) 在不同任务中表现出令人印象深刻的能力，并为许多领域带来了变革性的变化。然而，在预训练完成后，保持 LLM 中的知识最新仍然是一个挑战。因此，设计有效的方法来更新过时的知识并将新知识引入 LLM 至关重要。现有的定位和编辑知识编辑 (KE) 方法存在两个限制。首先，通过此类方法编辑的 LLM 通常在回答需要多跳推理的复杂查询方面能力较差。其次，这种定位和编辑方法执行知识编辑的运行时间长，使其在实践中不适用于大规模 KE。在本文中，我们探索参数高效微调 (PEFT) 技术作为 KE 的替代方案。我们策划了一个更全面的时间 KE 数据集，其中包含知识更新和知识注入示例，用于 KE 性能基准测试。我们进一步探讨了微调对 LLM 中一系列层对多跳 QA 任务的影响。我们发现，对于时间敏感的知识编辑，PEFT 的表现比定位和编辑技术更好。

Title: NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

Authors: Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04520
Pdf URL: https://arxiv.org/pdf/2406.04520
Copy Paste: [[2406.04520]] NATURAL PLAN: Benchmarking LLMs on Natural Language Planning(https://arxiv.org/abs/2406.04520)
Keywords: gpt, llm
Abstract: We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for a tool-use environment for evaluating LLMs on Planning. We observe that NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively. We find that model performance drops drastically as the complexity of the problem increases: all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs. We also conduct extensive ablation studies on NATURAL PLAN to further shed light on the (in)effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts on improving LLM planning.
摘要：我们引入了 NATURAL PLAN，这是一个自然语言中的现实规划基准，包含 3 个关键任务：旅行规划、会议规划和日历安排。我们将评估重点放在 LLM 的规划能力上，并提供关于任务的完整信息，方法是将 Google 航班、Google 地图和 Google 日历等工具的输出作为模型的上下文。这消除了对规划 LLM 进行评估的工具使用环境的需求。我们观察到 NATURAL PLAN 是最先进的模型的一个具有挑战性的基准。例如，在旅行规划中，GPT-4 和 Gemini 1.5 Pro 分别只能实现 31.1% 和 34.8% 的解决率。我们发现，随着问题复杂性的增加，模型性能急剧下降：当有 10 个城市时，所有模型的性能都低于 5%，这凸显了 SoTA LLM 在自然语言规划方面存在显著差距。我们还对 NATURAL PLAN 进行了广泛的消融研究，以进一步阐明自我校正、小样本泛化和长上下文规划等方法对改进 LLM 规划的有效性（不有效性）。

Title: Proofread: Fixes All Errors with One Tap

Authors: Renjie Liu, Yanxiang Zhang, Yun Zhu, Haicheng Sun, Yuanbo Zhang, Michael Xuelin Huang, Shanqing Cai, Lei Meng, Shumin Zhai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.04523
Pdf URL: https://arxiv.org/pdf/2406.04523
Copy Paste: [[2406.04523]] Proofread: Fixes All Errors with One Tap(https://arxiv.org/abs/2406.04523)
Keywords: language model, llm
Abstract: The impressive capabilities in Large Language Models (LLMs) provide a powerful approach to reimagine users' typing experience. This paper demonstrates Proofread, a novel Gboard feature powered by a server-side LLM in Gboard, enabling seamless sentence-level and paragraph-level corrections with a single tap. We describe the complete system in this paper, from data generation, metrics design to model tuning and deployment. To obtain models with sufficient quality, we implement a careful data synthetic pipeline tailored to online use cases, design multifaceted metrics, employ a two-stage tuning approach to acquire the dedicated LLM for the feature: the Supervised Fine Tuning (SFT) for foundational quality, followed by the Reinforcement Learning (RL) tuning approach for targeted refinement. Specifically, we find sequential tuning on Rewrite and proofread tasks yields the best quality in SFT stage, and propose global and direct rewards in the RL tuning stage to seek further improvement. Extensive experiments on a human-labeled golden set showed our tuned PaLM2-XS model achieved 85.56\% good ratio. We launched the feature to Pixel 8 devices by serving the model on TPU v5 in Google Cloud, with thousands of daily active users. Serving latency was significantly reduced by quantization, bucket inference, text segmentation, and speculative decoding. Our demo could be seen in \href{this https URL}{Youtube}.
摘要：大型语言模型 (LLM) 的强大功能为重新构想用户的打字体验提供了强大的方法。本文展示了 Proofread，这是 Gboard 的一项新奇功能，由 Gboard 中的服务器端 LLM 提供支持，只需轻轻一按即可实现句子级和段落级的无缝更正。我们在本文中描述了完整的系统，从数据生成、指标设计到模型调整和部署。为了获得具有足够质量的模型，我们实施了针对在线用例量身定制的精心数据合成管道，设计了多方面的指标，采用两阶段调整方法获取该功能的专用 LLM：监督微调 (SFT) 用于基础质量，然后是强化学习 (RL) 调整方法用于有针对性的改进。具体而言，我们发现在 SFT 阶段对重写和校对任务进行顺序调整可获得最佳质量，并在 RL 调整阶段提出全局和直接奖励以寻求进一步改进。在人工标记的黄金集上进行的大量实验表明，我们调整后的 PaLM2-XS 模型实现了 85.56% 的良好率。我们通过在 Google Cloud 的 TPU v5 上提供模型，将该功能发布到 Pixel 8 设备，每天有数千名活跃用户。通过量化、桶推理、文本分割和推测解码，服务延迟显著降低。我们的演示可以在 \href{此 https URL}{Youtube} 中看到。

Title: llmNER: (Zero|Few)-Shot Named Entity Recognition, Exploiting the Power of Large Language Models

Authors: Fabián Villena, Luis Miranda, Claudio Aracena
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04528
Pdf URL: https://arxiv.org/pdf/2406.04528
Copy Paste: [[2406.04528]] llmNER: (Zero|Few)-Shot Named Entity Recognition, Exploiting the Power of Large Language Models(https://arxiv.org/abs/2406.04528)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) allow us to generate high-quality human-like text. One interesting task in natural language processing (NLP) is named entity recognition (NER), which seeks to detect mentions of relevant information in documents. This paper presents llmNER, a Python library for implementing zero-shot and few-shot NER with LLMs; by providing an easy-to-use interface, llmNER can compose prompts, query the model, and parse the completion returned by the LLM. Also, the library enables the user to perform prompt engineering efficiently by providing a simple interface to test multiple variables. We validated our software on two NER tasks to show the library's flexibility. llmNER aims to push the boundaries of in-context learning research by removing the barrier of the prompting and parsing steps.
摘要：大型语言模型 (LLM) 使我们能够生成高质量的类人文本。自然语言处理 (NLP) 中一项有趣的任务是命名实体识别 (NER)，它旨在检测文档中相关信息的提及。本文介绍了 llmNER，这是一个使用 LLM 实现零样本和少样本 NER 的 Python 库；通过提供易于使用的界面，llmNER 可以编写提示、查询模型并解析 LLM 返回的完成内容。此外，该库还提供了一个简单的界面来测试多个变量，使用户能够有效地执行提示工程。我们在两个 NER 任务上验证了我们的软件，以展示该库的灵活性。llmNER 旨在通过消除提示和解析步骤的障碍来突破情境学习研究的界限。

Title: Creating an AI Observer: Generative Semantic Workspaces

Authors: Pavan Holur, Shreyas Rajesh, David Chong, Vwani Roychowdhury
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04555
Pdf URL: https://arxiv.org/pdf/2406.04555
Copy Paste: [[2406.04555]] Creating an AI Observer: Generative Semantic Workspaces(https://arxiv.org/abs/2406.04555)
Keywords: llm
Abstract: An experienced human Observer reading a document -- such as a crime report -- creates a succinct plot-like $\textit{``Working Memory''}$ comprising different actors, their prototypical roles and states at any point, their evolution over time based on their interactions, and even a map of missing Semantic parts anticipating them in the future. $\textit{An equivalent AI Observer currently does not exist}$. We introduce the $\textbf{[G]}$enerative $\textbf{[S]}$emantic $\textbf{[W]}$orkspace (GSW) -- comprising an $\textit{``Operator''}$ and a $\textit{``Reconciler''}$ -- that leverages advancements in LLMs to create a generative-style Semantic framework, as opposed to a traditionally predefined set of lexicon labels. Given a text segment $C_n$ that describes an ongoing situation, the $\textit{Operator}$ instantiates actor-centric Semantic maps (termed ``Workspace instance'' $\mathcal{W}_n$). The $\textit{Reconciler}$ resolves differences between $\mathcal{W}_n$ and a ``Working memory'' $\mathcal{M}_n^*$ to generate the updated $\mathcal{M}_{n+1}^*$. GSW outperforms well-known baselines on several tasks ($\sim 94\%$ vs. FST, GLEN, BertSRL - multi-sentence Semantics extraction, $\sim 15\%$ vs. NLI-BERT, $\sim 35\%$ vs. QA). By mirroring the real Observer, GSW provides the first step towards Spatial Computing assistants capable of understanding individual intentions and predicting future behavior.
摘要：经验丰富的人类观察员在阅读文件（例如犯罪报告）时，会创建一个简洁的情节式“工作记忆”，其中包含不同的参与者、他们在任何时候的原型角色和状态、基于交互的随时间演变，甚至还有一张预测他们未来发展的缺失语义部分地图。目前不存在等效的人工智能观察员。我们引入了“[G]]$生成式“[S]}$语义式“[W]}$工作空间 (GSW)”，包含“操作器”和“协调器”，利用 LLM 的进步来创建生成式语义框架，而不是传统上预定义的词汇标签集。给定一个描述当前情况的文本段 $C_n$，$\textit{Operator}$ 实例化以参与者为中心的语义图（称为“工作区实例”$\mathcal{W}_n$）。$\textit{Reconciler}$ 解决 $\mathcal{W}_n$ 和“工作记忆”$\mathcal{M}_n^*$ 之间的差异，以生成更新的 $\mathcal{M}_n+1}^*$。GSW 在多个任务上的表现优于众所周知的基线（与 FST、GLEN、BertSRL 相比，多句语义提取为 94%；与 NLI-BERT 相比，为 15%；与 QA 相比，为 35%）。通过镜像真实的观察者，GSW 迈出了实现能够理解个人意图和预测未来行为的空间计算助手的第一步。

Title: SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models

Authors: Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.04566
Pdf URL: https://arxiv.org/pdf/2406.04566
Copy Paste: [[2406.04566]] SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models(https://arxiv.org/abs/2406.04566)
Keywords: language model, llm
Abstract: Spatial reasoning is a crucial component of both biological and artificial intelligence. In this work, we present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning. To support our study, we created and contribute a novel Spatial Reasoning Characterization (SpaRC) framework and Spatial Reasoning Paths (SpaRP) datasets, to enable an in-depth understanding of the spatial relations and compositions as well as the usefulness of spatial reasoning chains. We found that all the state-of-the-art LLMs do not perform well on the datasets -- their performances are consistently low across different setups. The spatial reasoning capability improves substantially as model sizes scale up. Finetuning both large language models (e.g., Llama-2-70B) and smaller ones (e.g., Llama-2-13B) can significantly improve their F1-scores by 7--32 absolute points. We also found that the top proprietary LLMs still significantly outperform their open-source counterparts in topological spatial understanding and reasoning.
摘要：空间推理是生物和人工智能的重要组成部分。在这项工作中，我们对当前最先进的大型语言模型 (LLM) 在空间推理方面的能力进行了全面的研究。为了支持我们的研究，我们创建并贡献了一个新颖的空间推理表征 (SpaRC) 框架和空间推理路径 (SpaRP) 数据集，以便深入了解空间关系和组成以及空间推理链的实用性。我们发现所有最先进的 LLM 在数据集上的表现都不佳——它们的性能在不同设置中始终很低。随着模型尺寸的扩大，空间推理能力会大幅提高。对大型语言模型（例如 Llama-2-70B）和小型语言模型（例如 Llama-2-13B）进行微调可以显著提高它们的 F1 分数，绝对点数为 7-32。我们还发现，顶级专有 LLM 在拓扑空间理解和推理方面仍然明显优于开源 LLM。

Title: Extroversion or Introversion? Controlling The Personality of Your Large Language Models

Authors: Yanquan Chen, Zhen Wu, Junjie Guo, Shujian Huang, Xinyu Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04583
Pdf URL: https://arxiv.org/pdf/2406.04583
Copy Paste: [[2406.04583]] Extroversion or Introversion? Controlling The Personality of Your Large Language Models(https://arxiv.org/abs/2406.04583)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) exhibit robust capabilities in text generation and comprehension, mimicking human behavior and exhibiting synthetic personalities. However, some LLMs have displayed offensive personality, propagating toxic discourse. Existing literature neglects the origin and evolution of LLM personalities, as well as the effective personality control. To fill these gaps, our study embarked on a comprehensive investigation into LLM personality control. We investigated several typical methods to influence LLMs, including three training methods: Continual Pre-training, Supervised Fine-Tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF), along with inference phase considerations (prompts). Our investigation revealed a hierarchy of effectiveness in control: Prompt > SFT > RLHF > Continual Pre-train. Notably, SFT exhibits a higher control success rate compared to prompt induction. While prompts prove highly effective, we found that prompt-induced personalities are less robust than those trained, making them more prone to showing conflicting personalities under reverse personality prompt induction. Besides, harnessing the strengths of both SFT and prompt, we proposed $\underline{\text{P}}$rompt $\underline{\text{I}}$nduction post $\underline{\text{S}}$upervised $\underline{\text{F}}$ine-tuning (PISF), which emerges as the most effective and robust strategy for controlling LLMs' personality, displaying high efficacy, high success rates, and high robustness. Even under reverse personality prompt induction, LLMs controlled by PISF still exhibit stable and robust personalities.
摘要：大型语言模型 (LLM) 在文本生成和理解方面表现出强大的能力，模仿人类行为并表现出合成人格。然而，一些 LLM 表现出攻击性人格，传播有害言论。现有文献忽视了 LLM 人格的起源和演变，以及有效的人格控制。为了填补这些空白，我们的研究开始对 LLM 人格控制进行全面调查。我们研究了几种影响 LLM 的典型方法，包括三种训练方法：持续预训练、监督微调 (SFT) 和从人类反馈中进行强化学习 (RLHF)，以及推理阶段考虑因素（提示）。我们的调查揭示了控制有效性的层次结构：提示 > SFT > RLHF > 持续预训练。值得注意的是，与提示诱导相比，SFT 表现出更高的控制成功率。虽然提示被证明非常有效，但我们发现提示诱导的人格不如训练的人格那么强大，这使得它们更容易在反向人格提示诱导下表现出冲突的人格。此外，我们结合 SFT 和提示的优势，提出了 $\underline{\text{P}}$rompt $\underline{\text{I}}$nduction post $\underline{\text{S}}$supervised $\underline{\text{F}}$ine-tuning (PISF)，这是控制 LLM 人格最有效、最稳健的策略，具有高效性、高成功率和高稳健性。即使在反向人格提示诱导下，由 PISF 控制的 LLM 仍然表现出稳定而稳健的人格。

Title: Learning Task Decomposition to Assist Humans in Competitive Programming

Authors: Jiaxin Wen, Ruiqi Zhong, Pei Ke, Zhihong Shao, Hongning Wang, Minlie Huang
Subjects: cs.CL, cs.PL
Abstract URL: https://arxiv.org/abs/2406.04604
Pdf URL: https://arxiv.org/pdf/2406.04604
Copy Paste: [[2406.04604]] Learning Task Decomposition to Assist Humans in Competitive Programming(https://arxiv.org/abs/2406.04604)
Keywords: language model
Abstract: When using language models (LMs) to solve complex problems, humans might struggle to understand the LM-generated solutions and repair the flawed ones. To assist humans in repairing them, we propose to automatically decompose complex solutions into multiple simpler pieces that correspond to specific subtasks. We introduce a novel objective for learning task decomposition, termed assistive value (AssistV), which measures the feasibility and speed for humans to repair the decomposed solution. We collect a dataset of human repair experiences on different decomposed solutions. Utilizing the collected data as in-context examples, we then learn to critique, refine, and rank decomposed solutions to improve AssistV. We validate our method under competitive programming problems: under 177 hours of human study, our method enables non-experts to solve 33.3\% more problems, speeds them up by 3.3x, and empowers them to match unassisted experts.
摘要：在使用语言模型 (LM) 解决复杂问题时，人类可能难以理解语言模型生成的解决方案并修复有缺陷的解决方案。为了帮助人类修复这些问题，我们建议将复杂的解决方案自动分解为多个与特定子任务相对应的更简单的部分。我们引入了一个学习任务分解的新目标，称为辅助值 (AssistV)，它衡量人类修复分解解决方案的可行性和速度。我们收集了人类对不同分解解决方案的修复经验数据集。利用收集的数据作为上下文示例，我们然后学习批评、改进和排名分解的解决方案以改进 AssistV。我们在竞争性编程问题下验证了我们的方法：在 177 小时的人工学习下，我们的方法使非专家能够解决 33.3\% 的问题，将他们的速度提高了 3.3 倍，并使他们能够与无人协助的专家匹敌。

Title: LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model

Authors: Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song, Xiao-Wen Yang, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04614
Pdf URL: https://arxiv.org/pdf/2406.04614
Copy Paste: [[2406.04614]] LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model(https://arxiv.org/abs/2406.04614)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs), including both proprietary and open-source models, have showcased remarkable capabilities in addressing a wide range of downstream tasks. Nonetheless, when it comes to practical Chinese legal tasks, these models fail to meet the actual requirements. Proprietary models do not ensure data privacy for sensitive legal cases, while open-source models demonstrate unsatisfactory performance due to their lack of legal knowledge. To address this problem, we introduce LawGPT, the first open-source model specifically designed for Chinese legal applications. LawGPT comprises two key components: legal-oriented pre-training and legal supervised fine-tuning. Specifically, we employ large-scale Chinese legal documents for legal-oriented pre-training to incorporate legal domain knowledge. To further improve the model's performance on downstream legal tasks, we create a knowledge-driven instruction dataset for legal supervised fine-tuning. Our experimental results demonstrate that LawGPT outperforms the open-source LLaMA 7B model. Our code and resources are publicly available at this https URL and have received 5.7K stars on GitHub.
摘要：大型语言模型（LLM），包括专有模型和开源模型，在解决各种下游任务方面都表现出了卓越的能力。然而，当涉及到实际的中国法律任务时，这些模型却无法满足实际要求。专有模型无法确保敏感法律案件的数据隐私，而开源模型由于缺乏法律知识而表现不佳。为了解决这个问题，我们推出了 LawGPT，这是第一个专门为中国法律应用设计的开源模型。LawGPT 包含两个关键部分：面向法律的预训练和法律监督微调。具体来说，我们使用大规模中文法律文件进行面向法律的预训练，以融入法律领域的知识。为了进一步提高模型在下游法律任务上的性能，我们创建了一个知识驱动的指令数据集，用于法律监督微调。我们的实验结果表明 LawGPT 优于开源 LLaMA 7B 模型。我们的代码和资源在此 https URL 上公开提供，并在 GitHub 上获得了 5.7K 颗星。

Title: Key-Element-Informed sLLM Tuning for Document Summarization

Authors: Sangwon Ryu, Heejin Do, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04625
Pdf URL: https://arxiv.org/pdf/2406.04625
Copy Paste: [[2406.04625]] Key-Element-Informed sLLM Tuning for Document Summarization(https://arxiv.org/abs/2406.04625)
Keywords: language model, llm, hallucination
Abstract: Remarkable advances in large language models (LLMs) have enabled high-quality text summarization. However, this capability is currently accessible only through LLMs of substantial size or proprietary LLMs with usage fees. In response, smaller-scale LLMs (sLLMs) of easy accessibility and low costs have been extensively studied, yet they often suffer from missing key information and entities, i.e., low relevance, in particular, when input documents are long. We hence propose a key-element-informed instruction tuning for summarization, so-called KEITSum, which identifies key elements in documents and instructs sLLM to generate summaries capturing these key elements. Experimental results on dialogue and news datasets demonstrate that sLLM with KEITSum indeed provides high-quality summarization with higher relevance and less hallucinations, competitive to proprietary LLM.
摘要：大型语言模型 (LLM) 的显著进步使得高质量文本摘要成为可能。然而，目前只有通过大规模的 LLM 或需要付费的专有 LLM 才能实现此功能。为此，人们广泛研究了易于访问且成本低廉的小型 LLM (sLLM)，但它们经常会缺少关键信息和实体，即相关性低，尤其是在输入文档很长的情况下。因此，我们提出了一种针对摘要的关键元素信息指令调整，即所谓的 KEITSum，它可以识别文档中的关键元素并指示 sLLM 生成捕捉这些关键元素的摘要。在对话和新闻数据集上的实验结果表明，带有 KEITSum 的 sLLM 确实提供了高质量摘要，相关性更高，幻觉更少，可与专有 LLM 相媲美。

Title: Low-Resource Cross-Lingual Summarization through Few-Shot Learning with Large Language Models

Authors: Gyutae Park, Seojin Hwang, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04630
Pdf URL: https://arxiv.org/pdf/2406.04630
Copy Paste: [[2406.04630]] Low-Resource Cross-Lingual Summarization through Few-Shot Learning with Large Language Models(https://arxiv.org/abs/2406.04630)
Keywords: language model, gpt, llm
Abstract: Cross-lingual summarization (XLS) aims to generate a summary in a target language different from the source language document. While large language models (LLMs) have shown promising zero-shot XLS performance, their few-shot capabilities on this task remain unexplored, especially for low-resource languages with limited parallel data. In this paper, we investigate the few-shot XLS performance of various models, including Mistral-7B-Instruct-v0.2, GPT-3.5, and GPT-4. Our experiments demonstrate that few-shot learning significantly improves the XLS performance of LLMs, particularly GPT-3.5 and GPT-4, in low-resource settings. However, the open-source model Mistral-7B-Instruct-v0.2 struggles to adapt effectively to the XLS task with limited examples. Our findings highlight the potential of few-shot learning for improving XLS performance and the need for further research in designing LLM architectures and pre-training objectives tailored for this task. We provide a future work direction to explore more effective few-shot learning strategies and to investigate the transfer learning capabilities of LLMs for cross-lingual summarization.
摘要：跨语言摘要 (XLS) 旨在生成与源语言文档不同的目标语言摘要。虽然大型语言模型 (LLM) 已表现出良好的零样本 XLS 性能，但它们在此任务上的少样本能力仍未被探索，尤其是对于具有有限并行数据的低资源语言。在本文中，我们研究了各种模型的少样本 XLS 性能，包括 Mistral-7B-Instruct-v0.2、GPT-3.5 和 GPT-4。我们的实验表明，在低资源环境下，少样本学习显著提高了 LLM 的 XLS 性能，尤其是 GPT-3.5 和 GPT-4。然而，开源模型 Mistral-7B-Instruct-v0.2 难以在示例有限的情况下有效适应 XLS 任务。我们的研究结果强调了少样本学习在提高 XLS 性能方面的潜力，以及在设计针对此任务的 LLM 架构和预训练目标方面需要进一步研究。我们提供了未来的工作方向，以探索更有效的少量学习策略并研究 LLM 用于跨语言摘要的迁移学习能力。

Title: Large Language Model-guided Document Selection

Authors: Xiang Kong, Tom Gunter, Ruoming Pang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04638
Pdf URL: https://arxiv.org/pdf/2406.04638
Copy Paste: [[2406.04638]] Large Language Model-guided Document Selection(https://arxiv.org/abs/2406.04638)
Keywords: language model, llm, prompt
Abstract: Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts suggesting that domain-specific training document selection is in fact an interpretable process [Gunasekar et al., 2023], as well as research showing that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et al.,2023], we explore a promising direction for scalable general-domain document selection; employing a prompted LLM as a document grader, we distill quality labels into a classifier model, which is applied at scale to a large, and already heavily-filtered, web-crawl-derived corpus autonomously. Following the guidance of this classifier, we drop 75% of the corpus and train LLMs on the remaining data. Results across multiple benchmarks show that: 1. Filtering allows us to quality-match a model trained on the full corpus across diverse benchmarks with at most 70% of the FLOPs, 2. More capable LLM labelers and classifier models lead to better results that are less sensitive to the labeler's prompt, 3. In-context learning helps to boost the performance of less-capable labeling models. In all cases we use open-source datasets, models, recipes, and evaluation frameworks, so that results can be reproduced by the community.
摘要：大型语言模型 (LLM) 预训练会耗尽不断增长的计算预算，但最近的研究表明，仔细选择文档可以仅使用一小部分 FLOP 实现相当的模型质量。受到一些努力的启发，这些努力表明特定领域的训练文档选择实际上是一个可解释的过程 [Gunasekar et al., 2023]，以及一些研究表明指令微调的 LLM 是熟练的零样本数据标记器 [Gilardi et al.,2023]，我们探索了一个可扩展的通用领域文档选择的有希望的方向；使用提示的 LLM 作为文档评分器，我们将质量标签提炼到分类器模型中，该模型被大规模应用于大型且已经经过大量过滤的网络爬取衍生语料库。按照这个分类器的指导，我们删除了 75% 的语料库并在剩余数据上训练 LLM。多个基准测试的结果表明：1. 通过过滤，我们可以对在各种基准测试中在完整语料库上训练的模型进行质量匹配，最多可达到 70% 的 FLOP；2. 更强大的 LLM 标注器和分类器模型可产生更好的结果，这些结果对标注器的提示不太敏感；3. 上下文学习有助于提高能力较弱的标注模型的性能。在所有情况下，我们都使用开源数据集、模型、配方和评估框架，以便社区可以重现结果。

Title: DiNeR: a Large Realistic Dataset for Evaluating Compositional Generalization

Authors: Chengang Hu, Xiao Liu, Yansong Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04669
Pdf URL: https://arxiv.org/pdf/2406.04669
Copy Paste: [[2406.04669]] DiNeR: a Large Realistic Dataset for Evaluating Compositional Generalization(https://arxiv.org/abs/2406.04669)
Keywords: language model, llm
Abstract: Most of the existing compositional generalization datasets are synthetically-generated, resulting in a lack of natural language variation. While there have been recent attempts to introduce non-synthetic datasets for compositional generalization, they suffer from either limited data scale or a lack of diversity in the forms of combinations. To better investigate compositional generalization with more linguistic phenomena and compositional diversity, we propose the DIsh NamE Recognition (DiNeR) task and create a large realistic Chinese dataset. Given a recipe instruction, models are required to recognize the dish name composed of diverse combinations of food, actions, and flavors. Our dataset consists of 3,811 dishes and 228,114 recipes, and involves plenty of linguistic phenomena such as anaphora, omission and ambiguity. We provide two strong baselines based on T5 and large language models (LLMs). This work contributes a challenging task, baseline methods to tackle the task, and insights into compositional generalization in the context of dish name recognition. Code and data are available at this https URL.
摘要：现有的组合泛化数据集大部分都是人工合成的，因此缺乏自然语言变化。虽然最近有人尝试引入非人工数据集进行组合泛化，但它们要么数据规模有限，要么组合形式缺乏多样性。为了更好地研究具有更多语言现象和组合多样性的组合泛化，我们提出了菜名识别 (DiNeR) 任务并创建了一个大型逼真的中文数据集。给定一个菜谱说明，模型需要识别由食物、动作和口味的多种组合组成的菜名。我们的数据集包含 3,811 道菜和 228,114 道菜谱，涉及大量语言现象，如首指、省略和歧义。我们基于 T5 和大型语言模型 (LLM) 提供了两个强大的基线。这项工作贡献了一项具有挑战性的任务、解决该任务的基线方法以及对菜名识别背景下的组合泛化的见解。代码和数据可在此 https URL 上获得。

Title: MATTER: Memory-Augmented Transformer Using Heterogeneous Knowledge Sources

Authors: Dongkyu Lee, Chandana Satya Prakash, Jack FitzGerald, Jens Lehmann
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04670
Pdf URL: https://arxiv.org/pdf/2406.04670
Copy Paste: [[2406.04670]] MATTER: Memory-Augmented Transformer Using Heterogeneous Knowledge Sources(https://arxiv.org/abs/2406.04670)
Keywords: language model, long context
Abstract: Leveraging external knowledge is crucial for achieving high performance in knowledge-intensive tasks, such as question answering. The retrieve-and-read approach is widely adopted for integrating external knowledge into a language model. However, this approach suffers from increased computational cost and latency due to the long context length, which grows proportionally with the number of retrieved knowledge. Furthermore, existing retrieval-augmented models typically retrieve information from a single type of knowledge source, limiting their scalability to diverse knowledge sources with varying structures. In this work, we introduce an efficient memory-augmented transformer called MATTER, designed to retrieve relevant knowledge from multiple heterogeneous knowledge sources. Specifically, our model retrieves and reads from both unstructured sources (paragraphs) and semi-structured sources (QA pairs) in the form of fixed-length neural memories. We demonstrate that our model outperforms existing efficient retrieval-augmented models on popular QA benchmarks in terms of both accuracy and speed. Furthermore, MATTER achieves competitive results compared to conventional read-and-retrieve models while having 100x throughput during inference.
摘要：利用外部知识对于在知识密集型任务（例如问答）中实现高性能至关重要。检索和阅读方法被广泛用于将外部知识集成到语言模型中。然而，这种方法的计算成本和延迟会增加，因为上下文长度较长，并且会随着检索到的知识数量成比例增长。此外，现有的检索增强模型通常从单一类型的知识源检索信息，这限制了它们对具有不同结构的各种知识源的可扩展性。在这项工作中，我们引入了一种名为 MATTER 的高效记忆增强转换器，旨在从多个异构知识源检索相关知识。具体而言，我们的模型以固定长度的神经记忆的形式从非结构化源（段落）和半结构化源（QA 对）中检索和读取。我们证明，我们的模型在流行 QA 基准测试中的准确性和速度方面都优于现有的高效检索增强模型。此外，MATTER 与传统的读取和检索模型相比，取得了有竞争力的结果，同时在推理过程中具有 100 倍的吞吐量。

Title: Mixture-of-Agents Enhances Large Language Model Capabilities

Authors: Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Zou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04692
Pdf URL: https://arxiv.org/pdf/2406.04692
Copy Paste: [[2406.04692]] Mixture-of-Agents Enhances Large Language Model Capabilities(https://arxiv.org/abs/2406.04692)
Keywords: language model, gpt, llm, agent
Abstract: Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.
摘要：大型语言模型 (LLM) 的最新进展展示了其在自然语言理解和生成任务中的强大能力。随着 LLM 数量的不断增长，如何利用多个 LLM 的集体专业知识是一个令人兴奋的开放方向。为了实现这一目标，我们提出了一种新方法，通过混合代理 (MoA) 方法利用多个 LLM 的集体优势。在我们的方法中，我们构建了一个分层的 MoA 架构，其中每一层都包含多个 LLM 代理。每个代理都将前一层代理的所有输出作为生成其响应的辅助信息。MoA 模型在 AlpacaEval 2.0、MT-Bench 和 FLASK 上实现了最先进的性能，超越了 GPT-4 Omni。例如，我们仅使用开源 LLM 的 MoA 以相当大的差距领先于 AlpacaEval 2.0，得分为 65.1%，而 GPT-4 Omni 得分为 57.5%。

Title: AICoderEval: Improving AI Domain Code Generation of Large Language Models

Authors: Yinghui Xia, Yuyan Chen, Tianyu Shi, Jun Wang, Jinsong Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04712
Pdf URL: https://arxiv.org/pdf/2406.04712
Copy Paste: [[2406.04712]] AICoderEval: Improving AI Domain Code Generation of Large Language Models(https://arxiv.org/abs/2406.04712)
Keywords: language model, llm, agent
Abstract: Automated code generation is a pivotal capability of large language models (LLMs). However, assessing this capability in real-world scenarios remains challenging. Previous methods focus more on low-level code generation, such as model loading, instead of generating high-level codes catering for real-world tasks, such as image-to-text, text classification, in various domains. Therefore, we construct AICoderEval, a dataset focused on real-world tasks in various domains based on HuggingFace, PyTorch, and TensorFlow, along with comprehensive metrics for evaluation and enhancing LLMs' task-specific code generation capability. AICoderEval contains test cases and complete programs for automated evaluation of these tasks, covering domains such as natural language processing, computer vision, and multimodal learning. To facilitate research in this area, we open-source the AICoderEval dataset at \url{this https URL}. After that, we propose CoderGen, an agent-based framework, to help LLMs generate codes related to real-world tasks on the constructed AICoderEval. Moreover, we train a more powerful task-specific code generation model, named AICoder, which is refined on llama-3 based on AICoderEval. Our experiments demonstrate the effectiveness of CoderGen in improving LLMs' task-specific code generation capability (by 12.00\% on pass@1 for original model and 9.50\% on pass@1 for ReAct Agent). AICoder also outperforms current code generation LLMs, indicating the great quality of the AICoderEval benchmark.
摘要：自动代码生成是大型语言模型 (LLM) 的关键功能。然而，在现实场景中评估这种能力仍然具有挑战性。以前的方法更多地关注低级代码生成，例如模型加载，而不是生成适合各种领域中实际任务的高级代码，例如图像到文本、文本分类。因此，我们基于 HuggingFace、PyTorch 和 TensorFlow 构建了 AICoderEval，这是一个专注于各种领域中实际任务的数据集，以及用于评估和增强 LLM 特定任务代码生成能力的综合指标。AICoderEval 包含用于自动评估这些任务的测试用例和完整程序，涵盖自然语言处理、计算机视觉和多模态学习等领域。为了促进该领域的研究，我们在 \url{this https URL} 开源了 AICoderEval 数据集。之后，我们提出了基于代理的框架 CoderGen，以帮助 LLM 在构建的 AICoderEval 上生成与实际任务相关的代码。此外，我们还训练了一个更强大的任务特定代码生成模型 AICoder，该模型在 AICoderEval 的基础上在 llama-3 上进行了改进。我们的实验证明了 CoderGen 在提高 LLM 的任务特定代码生成能力方面的有效性（原始模型在 pass@1 上提高了 12.00\%，ReAct Agent 在 pass@1 上提高了 9.50\%）。AICoder 的表现也优于当前的代码生成 LLM，表明 AICoderEval 基准测试的质量很高。

Title: CRAG -- Comprehensive RAG Benchmark

Authors: Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar, Wen-tau Yih, Xin Luna Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] CRAG -- Comprehensive RAG Benchmark(https://arxiv.org/abs/)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions.
摘要：检索增强生成 (RAG) 最近成为缓解大型语言模型 (LLM) 知识缺乏问题的一种有前途的解决方案。然而，现有的 RAG 数据集不能充分代表现实世界问答 (QA) 任务的多样性和动态性。为了弥补这一差距，我们引入了综合 RAG 基准 (CRAG)，这是一个包含 4,409 个问答对和模拟 API 的事实问答基准，用于模拟网络和知识图谱 (KG) 搜索。CRAG 旨在封装五个领域和八个问题类别中的各种问题，反映从热门到长尾的不同实体流行度，以及从几年到几秒的时间动态。我们对这个基准的评估突出了与完全值得信赖的 QA 之间的差距。虽然大多数先进的 LLM 在 CRAG 上的准确率 <=34%，但以简单的方式添加 RAG 只能将准确率提高到 44%。最先进的行业 RAG 解决方案只能回答 63% 的问题，没有任何幻觉。 CRAG 还表明，在回答有关动态性更高、流行度更低或复杂度更高的事实的问题时，准确率要低得多，这为未来的研究方向指明了方向。CRAG 基准为 KDD Cup 2024 挑战赛奠定了基础，在比赛开始后的前 50 天内吸引了数千名参与者和参赛作品。我们致力于维护 CRAG，为研究社区提供服务，推动 RAG 解决方案和通用 QA 解决方案的发展。

Title: CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models

Authors: Ling Shi, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04752
Pdf URL: https://arxiv.org/pdf/2406.04752
Copy Paste: [[2406.04752]] CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models(https://arxiv.org/abs/2406.04752)
Keywords: language model, llm
Abstract: Large language models (LLMs) are possessed of numerous beneficial capabilities, yet their potential inclination harbors unpredictable risks that may materialize in the future. We hence propose CRiskEval, a Chinese dataset meticulously designed for gauging the risk proclivities inherent in LLMs such as resource acquisition and malicious coordination, as part of efforts for proactive preparedness. To curate CRiskEval, we define a new risk taxonomy with 7 types of frontier risks and 4 safety levels, including extremely hazardous,moderately hazardous, neutral and safe. We follow the philosophy of tendency evaluation to empirically measure the stated desire of LLMs via fine-grained multiple-choice question answering. The dataset consists of 14,888 questions that simulate scenarios related to predefined 7 types of frontier risks. Each question is accompanied with 4 answer choices that state opinions or behavioral tendencies corresponding to the question. All answer choices are manually annotated with one of the defined risk levels so that we can easily build a fine-grained frontier risk profile for each assessed LLM. Extensive evaluation with CRiskEval on a spectrum of prevalent Chinese LLMs has unveiled a striking revelation: most models exhibit risk tendencies of more than 40% (weighted tendency to the four risk levels). Furthermore, a subtle increase in the model's inclination toward urgent self-sustainability, power seeking and other dangerous goals becomes evident as the size of models increase. To promote further research on the frontier risk evaluation of LLMs, we publicly release our dataset at this https URL.
摘要：大型语言模型 (LLM) 拥有众多有益的功能，但其潜在倾向却蕴含着未来可能出现的不可预测的风险。因此，我们提出了 CRiskEval，这是一个精心设计的中文数据集，用于衡量 LLM 固有的风险倾向，例如资源获取和恶意协调，作为主动准备工作的一部分。为了策划 CRiskEval，我们定义了一个新的风险分类法，其中包含 7 种前沿风险和 4 个安全级别，包括极其危险、中等危险、中性和安全。我们遵循倾向评估的理念，通过细粒度的多项选择题回答来实证测量 LLM 的既定愿望。该数据集包含 14,888 个问题，模拟与预定义的 7 种前沿风险相关的情景。每个问题都附有 4 个答案选项，陈述与问题相对应的意见或行为倾向。所有答案选项都手动标注了定义的风险级别之一，以便我们可以轻松地为每个评估的 LLM 构建细粒度的前沿风险概况。使用 CRiskEval 对中国流行的 LLM 进行广泛评估，发现了一个惊人的事实：大多数模型的风险倾向超过 40%（四个风险水平的加权倾向）。此外，随着模型规模的增加，模型对紧急自我维持、权力追求和其他危险目标的倾向略有增加。为了促进对 LLM 前沿风险评估的进一步研究，我们在此 https URL 上公开发布了我们的数据集。

Title: Think out Loud: Emotion Deducing Explanation in Dialogues

Authors: Jiangnan Li, Zheng Lin, Lanrui Wang, Qingyi Si, Yanan Cao, Mo Yu, Peng Fu, Weiping Wang, Jie Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04758
Pdf URL: https://arxiv.org/pdf/2406.04758
Copy Paste: [[2406.04758]] Think out Loud: Emotion Deducing Explanation in Dialogues(https://arxiv.org/abs/2406.04758)
Keywords: language model, llm
Abstract: Humans convey emotions through daily dialogues, making emotion understanding a crucial step of affective intelligence. To understand emotions in dialogues, machines are asked to recognize the emotion for an utterance (Emotion Recognition in Dialogues, ERD); based on the emotion, then find causal utterances for the emotion (Emotion Cause Extraction in Dialogues, ECED). The setting of the two tasks requires first ERD and then ECED, ignoring the mutual complement between emotion and cause. To fix this, some new tasks are proposed to extract them simultaneously. Although the current research on these tasks has excellent achievements, simply identifying emotion-related factors by classification modeling lacks realizing the specific thinking process of causes stimulating the emotion in an explainable way. This thinking process especially reflected in the reasoning ability of Large Language Models (LLMs) is under-explored. To this end, we propose a new task "Emotion Deducing Explanation in Dialogues" (EDEN). EDEN recognizes emotion and causes in an explicitly thinking way. That is, models need to generate an explanation text, which first summarizes the causes; analyzes the inner activities of the speakers triggered by the causes using common sense; then guesses the emotion accordingly. To support the study of EDEN, based on the existing resources in ECED, we construct two EDEN datasets by human effort. We further evaluate different models on EDEN and find that LLMs are more competent than conventional PLMs. Besides, EDEN can help LLMs achieve better recognition of emotions and causes, which explores a new research direction of explainable emotion understanding in dialogues.
摘要：人类通过日常对话传递情感，情感理解是情感智能的重要一步。理解对话中的情感，需要机器识别话语中的情感（对话中的情感识别，ERD）；基于情感，寻找与情感相关的因果话语（对话中的情感原因提取，ECED）。两个任务的设置需要先进行ERD，再进行ECED，忽略了情感与原因之间的互补性。针对这一问题，一些新的任务被提出来同时提取它们。虽然目前对这些任务的研究已经取得了很好的成果，但仅仅通过分类建模来识别与情感相关的因素，缺乏以可解释的方式实现激发情感的原因的具体思维过程。这种思维过程尤其体现在大型语言模型（LLM）的推理能力中，尚未得到充分探索。为此，我们提出了一项新任务“对话中的情感推导解释”（EDEN）。EDEN以明确的思维方式识别情感和原因。即模型需要生成解释性文本，首先概括原因，然后利用常识分析原因引发的说话者内心活动，最后猜测情绪。为了支持对 EDEN 的研究，我们基于 ECED 中现有的资源，人工构建了两个 EDEN 数据集。我们进一步在 EDEN 上评估了不同的模型，发现 LLM 比传统的 PLM 更胜一筹。此外，EDEN 可以帮助 LLM 更好地识别情绪和原因，这为对话中可解释的情绪理解开辟了新的研究方向。

Title: WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Authors: Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04770
Pdf URL: https://arxiv.org/pdf/2406.04770
Copy Paste: [[2406.04770]] WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild(https://arxiv.org/abs/2406.04770)
Keywords: language model, gpt, llm, chat
Abstract: We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of ``slightly better/worse'' to ``tie'' if the winner response exceeds the loser one by more than $K$ characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard's 0.91 and AlpacaEval2.0's 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.
摘要：我们推出了 WildBench，这是一个自动评估框架，旨在使用具有挑战性的真实用户查询对大型语言模型 (LLM) 进行基准测试。WildBench 由从超过一百万个人机聊天机器人对话日志中精心挑选出的 1,024 个任务组成。为了使用 WildBench 进行自动评估，我们开发了两个指标，即 WB-Reward 和 WB-Score，它们可以使用 GPT-4-turbo 等高级 LLM 进行计算。WildBench 评估使用特定于任务的检查表来系统地评估模型输出，并提供结构化的解释来证明分数和比较的合理性，从而产生更可靠和可解释的自动判断。WB-Reward 在模型响应之间采用细粒度的成对比较，产生五种潜在结果：好得多、稍好、稍差、差得多或平局。与以前使用单个基线模型的评估不同，我们选择了三个不同性能水平的基线模型，以确保全面的成对评估。此外，我们提出了一种缓解长度偏差的简单方法，即如果获胜者的响应比失败者的响应多出 $K$ 个字符，则将“略好/略差”的结果转换为“平局”。WB-Score 单独评估模型输出的质量，使其成为一种快速且经济高效的评估指标。WildBench 结果显示，在困难任务上与 Chatbot Arena 中人工投票的 Elo 评级具有很强的相关性。具体来说，WB-Reward 与排名靠前的模型实现了 0.98 的 Pearson 相关性。此外，WB-Score 达到 0.95，超过了 ArenaHard 的 0.91 和 AlpacaEval2.0 的长度控制胜率 0.89，以及常规胜率 0.87。

Title: SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals

Authors: Ruihan Yang, Jiangjie Chen, Yikai Zhang, Siyu Yuan, Aili Chen, Kyle Richardson, Yanghua Xiao, Deqing Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04784
Pdf URL: https://arxiv.org/pdf/2406.04784
Copy Paste: [[2406.04784]] SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals(https://arxiv.org/abs/2406.04784)
Keywords: language model, llm, agent
Abstract: Language agents powered by large language models (LLMs) are increasingly valuable as decision-making tools in domains such as gaming and programming. However, these agents often face challenges in achieving high-level goals without detailed instructions and in adapting to environments where feedback is delayed. In this paper, we present SelfGoal, a novel automatic approach designed to enhance agents' capabilities to achieve high-level goals with limited human prior and environmental feedback. The core concept of SelfGoal involves adaptively breaking down a high-level goal into a tree structure of more practical subgoals during the interaction with environments while identifying the most useful subgoals and progressively updating this structure. Experimental results demonstrate that SelfGoal significantly enhances the performance of language agents across various tasks, including competitive, cooperative, and deferred feedback environments. Project page: this https URL.
摘要：大型语言模型 (LLM) 驱动的语言代理作为游戏和编程等领域的决策工具越来越有价值。然而，这些代理在没有详细说明的情况下实现高级目标以及适应延迟反馈的环境时经常面临挑战。在本文中，我们介绍了 SelfGoal，这是一种新颖的自动化方法，旨在增强代理在有限的人类先验和环境反馈下实现高级目标的能力。SelfGoal 的核心概念涉及在与环境交互过程中自适应地将高级目标分解为更实用的子目标的树形结构，同时识别最有用的子目标并逐步更新此结构。实验结果表明，SelfGoal 显著提高了语言代理在各种任务（包括竞争、合作和延迟反馈环境）中的表现。项目页面：此 https URL。

Title: BERTs are Generative In-Context Learners

Authors: David Samuel
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04823
Pdf URL: https://arxiv.org/pdf/2406.04823
Copy Paste: [[2406.04823]] BERTs are Generative In-Context Learners(https://arxiv.org/abs/2406.04823)
Keywords: language model, gpt
Abstract: This paper explores the in-context learning capabilities of masked language models, challenging the common view that this ability does not 'emerge' in them. We present an embarrassingly simple inference technique that enables DeBERTa to operate as a generative model without any additional training. Our findings demonstrate that DeBERTa can match and even surpass GPT-3, its contemporary that famously introduced the paradigm of in-context learning. The comparative analysis reveals that the masked and causal language models behave very differently, as they clearly outperform each other on different categories of tasks. This suggests that there is great potential for a hybrid training approach that takes advantage of the strengths of both training objectives.
摘要：本文探讨了掩码语言模型的上下文学习能力，挑战了这种能力不会“出现”在掩码语言模型中的普遍观点。我们提出了一种非常简单的推理技术，使 DeBERTa 无需任何额外训练即可作为生成模型运行。我们的研究结果表明，DeBERTa 可以匹敌甚至超越 GPT-3，后者是其同时代产品，以引入上下文学习范式而闻名。比较分析表明，掩码语言模型和因果语言模型的表现非常不同，因为它们在不同类别的任务上明显胜过对方。这表明，利用两种训练目标的优势的混合训练方法具有巨大的潜力。

Title: Annotating FrameNet via Structure-Conditioned Language Generation

Authors: Xinyue Cui, Swabha Swayamdipta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04834
Pdf URL: https://arxiv.org/pdf/2406.04834
Copy Paste: [[2406.04834]] Annotating FrameNet via Structure-Conditioned Language Generation(https://arxiv.org/abs/2406.04834)
Keywords: language model, prompt
Abstract: Despite the remarkable generative capabilities of language models in producing naturalistic language, their effectiveness on explicit manipulation and generation of linguistic structures remain understudied. In this paper, we investigate the task of generating new sentences preserving a given semantic structure, following the FrameNet formalism. We propose a framework to produce novel frame-semantically annotated sentences following an overgenerate-and-filter approach. Our results show that conditioning on rich, explicit semantic information tends to produce generations with high human acceptance, under both prompting and finetuning. Our generated frame-semantic structured annotations are effective at training data augmentation for frame-semantic role labeling in low-resource settings; however, we do not see benefits under higher resource settings. Our study concludes that while generating high-quality, semantically rich data might be within reach, the downstream utility of such generations remains to be seen, highlighting the outstanding challenges with automating linguistic annotation tasks.
摘要：尽管语言模型在生成自然语言方面具有出色的生成能力，但它们在显式操作和语言结构生成方面的有效性仍未得到充分研究。在本文中，我们研究了按照 FrameNet 形式生成保留给定语义结构的新句子的任务。我们提出了一个框架，按照过度生成和过滤方法生成新的框架语义注释句子。我们的结果表明，在提示和微调的情况下，以丰富、明确的语义信息为条件往往会产生具有高度人类接受度的生成。我们生成的框架语义结构化注释可有效地在资源匮乏的环境中增强框架语义角色标记的训练数据；然而，在资源较多的环境中，我们看不到好处。我们的研究得出结论，虽然生成高质量、语义丰富的数据可能触手可及，但这种生成的下游效用仍有待观察，这凸显了语言注释任务自动化的突出挑战。

Title: Revisiting Catastrophic Forgetting in Large Language Model Tuning

Authors: Hongyu Li, Liang Ding, Meng Fang, Dacheng Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04836
Pdf URL: https://arxiv.org/pdf/2406.04836
Copy Paste: [[2406.04836]] Revisiting Catastrophic Forgetting in Large Language Model Tuning(https://arxiv.org/abs/2406.04836)
Keywords: language model, llm
Abstract: Catastrophic Forgetting (CF) means models forgetting previously acquired knowledge when learning new data. It compromises the effectiveness of large language models (LLMs) during fine-tuning, yet the underlying causes have not been thoroughly investigated. This paper takes the first step to reveal the direct link between the flatness of the model loss landscape and the extent of CF in the field of LLMs. Based on this, we introduce the sharpness-aware minimization to mitigate CF by flattening the loss landscape. Experiments on three widely-used fine-tuning datasets, spanning different model scales, demonstrate the effectiveness of our method in alleviating CF. Analyses show that we nicely complement the existing anti-forgetting strategies, further enhancing the resistance of LLMs to CF.
摘要：灾难性遗忘 (CF) 是指模型在学习新数据时会忘记先前获得的知识。它会在微调期间损害大型语言模型 (LLM) 的有效性，但其根本原因尚未得到彻底研究。本文迈出了第一步，揭示了 LLM 领域中模型损失景观的平坦度与 CF 程度之间的直接联系。在此基础上，我们引入了锐度感知最小化，通过平坦损失景观来缓解 CF。在三个广泛使用的微调数据集上进行的实验涵盖了不同的模型规模，证明了我们的方法在缓解 CF 方面的有效性。分析表明，我们很好地补充了现有的防遗忘策略，进一步增强了 LLM 对 CF 的抵抗力。

Title: FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models

Authors: Rui Ye, Rui Ge, Xinyu Zhu, Jingyi Chai, Yaxin Du, Yang Liu, Yanfeng Wang, Siheng Chen
Subjects: cs.CL, cs.AI, cs.DC, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2406.04845
Pdf URL: https://arxiv.org/pdf/2406.04845
Copy Paste: [[2406.04845]] FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models(https://arxiv.org/abs/2406.04845)
Keywords: language model, llm
Abstract: Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM). Following this training paradigm, the community has put massive efforts from diverse aspects including framework, performance, and privacy. However, an unpleasant fact is that there are currently no realistic datasets and benchmarks for FedLLM and previous works all rely on artificially constructed datasets, failing to capture properties in real-world scenarios. Addressing this, we propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics, to offer a comprehensive testbed for the FedLLM community. FedLLM-Bench encompasses three datasets (e.g., user-annotated multilingual dataset) for federated instruction tuning and one dataset (e.g., user-annotated preference dataset) for federated preference alignment, whose scale of client number ranges from 38 to 747. Our datasets incorporate several representative diversities: language, quality, quantity, instruction, length, embedding, and preference, capturing properties in real-world scenarios. Based on FedLLM-Bench, we conduct experiments on all datasets to benchmark existing FL methods and provide empirical insights (e.g., multilingual collaboration). We believe that our FedLLM-Bench can benefit the FedLLM community by reducing required efforts, providing a practical testbed, and promoting fair comparisons. Code and datasets are available at this https URL.
摘要：联邦学习使多方能够协作训练大型语言模型而无需直接共享其数据（FedLLM）。遵循这种训练范式，社区从框架、性能和隐私等各个方面付出了巨大努力。然而，一个令人不快的事实是，目前没有针对 FedLLM 的现实数据集和基准，以前的工作都依赖于人工构建的数据集，无法捕捉真实场景中的属性。为了解决这个问题，我们提出了 FedLLM-Bench，它涉及 8 种训练方法、4 个训练数据集和 6 个评估指标，为 FedLLM 社区提供全面的测试平台。FedLLM-Bench 包含三个用于联邦指令调整的数据集（例如，用户注释的多语言数据集）和一个用于联邦偏好对齐的数据集（例如，用户注释的偏好数据集），其客户端数量范围从 38 到 747。我们的数据集包含几个代表性的多样性：语言、质量、数量、指令、长度、嵌入和偏好，捕捉真实场景中的属性。基于 FedLLM-Bench，我们在所有数据集上进行实验，以对现有的 FL 方法进行基准测试并提供经验见解（例如，多语言协作）。我们相信我们的 FedLLM-Bench 可以减少所需的工作量、提供实用的测试平台并促进公平比较，从而使 FedLLM 社区受益。代码和数据集可在此 https URL 上找到。

Title: Do Language Models Exhibit Human-like Structural Priming Effects?

Authors: Jaap Jumelet, Willem Zuidema, Arabella Sinclair
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04847
Pdf URL: https://arxiv.org/pdf/2406.04847
Copy Paste: [[2406.04847]] Do Language Models Exhibit Human-like Structural Priming Effects?(https://arxiv.org/abs/2406.04847)
Keywords: language model
Abstract: We explore which linguistic factors -- at the sentence and token level -- play an important role in influencing language model predictions, and investigate whether these are reflective of results found in humans and human corpora (Gries and Kootstra, 2017). We make use of the structural priming paradigm, where recent exposure to a structure facilitates processing of the same structure. We don't only investigate whether, but also where priming effects occur, and what factors predict them. We show that these effects can be explained via the inverse frequency effect, known in human priming, where rarer elements within a prime increase priming effects, as well as lexical dependence between prime and target. Our results provide an important piece in the puzzle of understanding how properties within their context affect structural prediction in language models.
摘要：我们探索哪些语言因素（在句子和标记级别）在影响语言模型预测方面发挥重要作用，并研究这些因素是否反映了人类和人类语料库中发现的结果（Gries 和 Kootstra，2017 年）。我们利用结构启动范式，即最近接触某个结构会促进对同一结构的处理。我们不仅研究启动效应是否发生，还研究启动效应发生的位置，以及哪些因素可以预测它们。我们表明，这些影响可以通过人类启动中已知的逆频率效应来解释，其中启动中较稀有的元素会增加启动效应，以及启动和目标之间的词汇依赖性。我们的研究结果为理解其上下文中的属性如何影响语言模型中的结构预测提供了重要的线索。

Title: Uncertainty Aware Learning for Language Model Alignment

Authors: Yikun Wang, Rui Zheng, Liang Ding, Qi Zhang, Dahua Lin, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04854
Pdf URL: https://arxiv.org/pdf/2406.04854
Copy Paste: [[2406.04854]] Uncertainty Aware Learning for Language Model Alignment(https://arxiv.org/abs/2406.04854)
Keywords: language model, llm
Abstract: As instruction-tuned large language models (LLMs) evolve, aligning pretrained foundation models presents increasing challenges. Existing alignment strategies, which typically leverage diverse and high-quality data sources, often overlook the intrinsic uncertainty of tasks, learning all data samples equally. This may lead to suboptimal data efficiency and model performance. In response, we propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios, by introducing the sample uncertainty (elicited from more capable LLMs). We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Analysis shows that our UAL indeed facilitates better token clustering in the feature space, validating our hypothesis. Extensive experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning. Notably, LLMs aligned in a mixed scenario have achieved an average improvement of 10.62\% on high-entropy tasks (i.e., AlpacaEval leaderboard), and 1.81\% on complex low-entropy tasks (i.e., MetaMath and GSM8K).
摘要：随着指令调优的大型语言模型 (LLM) 的发展，对齐预训练的基础模型面临着越来越大的挑战。现有的对齐策略通常利用多样化和高质量的数据源，但往往会忽略任务的内在不确定性，平等地学习所有数据样本。这可能会导致数据效率和模型性能不理想。作为回应，我们提出了不确定性感知学习 (UAL)，通过引入样本不确定性（从功能更强大的 LLM 中引出），来改善不同任务场景的模型对齐。我们以一种简单的方式实现 UAL——根据单个样本的不确定性自适应地设置训练的标签平滑值。分析表明，我们的 UAL 确实有助于在特征空间中更好地进行标记聚类，从而验证了我们的假设。在广泛使用的基准上进行的大量实验表明，我们的 UAL 显着且持续地优于标准监督微调。值得注意的是，在混合场景中对齐的 LLM 在高熵任务（即 AlpacaEval 排行榜）上实现了平均 10.62% 的提升，在复杂的低熵任务（即 MetaMath 和 GSM8K）上实现了 1.81% 的提升。

Title: ComplexTempQA: A Large-Scale Dataset for Complex Temporal Question Answering

Authors: Raphael Gruber, Abdelrahman Abdallah, Michael Färber, Adam Jatowt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04866
Pdf URL: https://arxiv.org/pdf/2406.04866
Copy Paste: [[2406.04866]] ComplexTempQA: A Large-Scale Dataset for Complex Temporal Question Answering(https://arxiv.org/abs/2406.04866)
Keywords: language model
Abstract: We introduce ComplexTempQA,a large-scale dataset consisting of over 100 million question-answer pairs designed to tackle the challenges in temporal question answering. ComplexTempQA significantly surpasses existing benchmarks like HOTPOTQA, TORQUE, and TEQUILA in scale and scope. Utilizing data from Wikipedia and Wikidata, the dataset covers questions spanning over two decades and offers an unmatched breadth of topics. We introduce a unique taxonomy that categorizes questions as attributes, comparisons, and counting questions, each revolving around events, entities, and time periods. One standout feature of ComplexTempQA is the high complexity of its questions, which demand effective capabilities for answering such as across-time comparison, temporal aggregation, and multi-hop reasoning involving temporal event ordering and entity recognition. Additionally, each question is accompanied by detailed metadata, including specific time scopes, allowing for comprehensive evaluation and enhancement of the temporal reasoning abilities of large language models. ComplexTempQA serves both as a testing ground for developing sophisticated AI models and as a foundation for advancing research in question answering, information retrieval, and language understanding. Dataset and code are freely available at: this https URL.
摘要：我们推出了 ComplexTempQA，这是一个由超过 1 亿个问答对组成的大型数据集，旨在解决时间问答中的挑战。ComplexTempQA 在规模和范围上大大超越了 HOTPOTQA、TORQUE 和 TEQUILA 等现有基准。该数据集利用来自维基百科和 Wikidata 的数据，涵盖了超过二十年的问题，并提供了无与伦比的主题广度。我们引入了一种独特的分类法，将问题分为属性、比较和计数问题，每个问题都围绕事件、实体和时间段展开。ComplexTempQA 的一个突出特点是其问题非常复杂，这需要有效的回答能力，例如跨时间比较、时间聚合以及涉及时间事件排序和实体识别的多跳推理。此外，每个问题都附有详细的元数据，包括特定的时间范围，从而可以全面评估和增强大型语言模型的时间推理能力。 ComplexTempQA 既是开发复杂 AI 模型的试验场，也是推进问答、信息检索和语言理解研究的基础。数据集和代码可在此 https URL 免费获取。

Title: A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques

Authors: Megh Thakkar, Quentin Fournier, Matthew D Riemer, Pin-Yu Chen, Amal Zouaq, Payel Das, Sarath Chandar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04879
Pdf URL: https://arxiv.org/pdf/2406.04879
Copy Paste: [[2406.04879]] A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques(https://arxiv.org/abs/2406.04879)
Keywords: language model, llm
Abstract: Large language models are first pre-trained on trillions of tokens and then instruction-tuned or aligned to specific preferences. While pre-training remains out of reach for most researchers due to the compute required, fine-tuning has become affordable thanks to parameter-efficient methods such as LoRA and QLoRA. Alignment is known to be sensitive to the many factors involved, including the quantity and quality of data, the alignment method, and the adapter rank. However, there has not yet been an extensive study of their effect on downstream performance. To address this gap, we conduct an in-depth investigation of the impact of popular choices for three crucial axes: (i) the alignment dataset (HH-RLHF and BeaverTails), (ii) the alignment technique (SFT and DPO), and (iii) the model (LLaMA-1, Vicuna-v1.3, Mistral-7b, and Mistral-7b-Instruct). Our extensive setup spanning over 300 experiments reveals consistent trends and unexpected findings. We observe how more informative data helps with preference alignment, cases where supervised fine-tuning outperforms preference optimization, and how aligning to a distinct preference boosts performance on downstream tasks. Through our in-depth analyses, we put forward key guidelines to help researchers perform more effective parameter-efficient LLM alignment.
摘要：大型语言模型首先在数万亿个 token 上进行预训练，然后根据特定偏好进行指令调整或对齐。虽然由于需要计算，大多数研究人员无法进行预训练，但由于 LoRA 和 QLoRA 等参数高效的方法，微调已经变得经济实惠。众所周知，对齐对所涉及的许多因素都很敏感，包括数据的数量和质量、对齐方法和适配器等级。然而，目前还没有对它们对下游性能的影响进行广泛的研究。为了解决这一差距，我们对三个关键轴的流行选择的影响进行了深入调查：(i) 对齐数据集 (HH-RLHF 和 BeaverTails)、(ii) 对齐技术 (SFT 和 DPO) 和 (iii) 模型 (LLaMA-1、Vicuna-v1.3、Mistral-7b 和 Mistral-7b-Instruct)。我们广泛的设置涵盖了 300 多个实验，揭示了一致的趋势和意想不到的发现。我们观察到，信息量更大的数据如何有助于偏好对齐，监督微调优于偏好优化的情况，以及如何根据特定偏好进行对齐以提升下游任务的性能。通过深入分析，我们提出了关键指导方针，以帮助研究人员进行更有效的参数高效的 LLM 对齐。

Title: Through the Thicket: A Study of Number-Oriented LLMs derived from Random Forest Models

Authors: Michał Romaszewski, Przemysław Sekuła, Przemysław Głomb, Michał Cholewa, Katarzyna Kołodziej
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.04926
Pdf URL: https://arxiv.org/pdf/2406.04926
Copy Paste: [[2406.04926]] Through the Thicket: A Study of Number-Oriented LLMs derived from Random Forest Models(https://arxiv.org/abs/2406.04926)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown exceptional performance in text processing. Notably, LLMs can synthesize information from large datasets and explain their decisions similarly to human reasoning through a chain of thought (CoT). An emerging application of LLMs is the handling and interpreting of numerical data, where fine-tuning enhances their performance over basic inference methods. This paper proposes a novel approach to training LLMs using knowledge transfer from a random forest (RF) ensemble, leveraging its efficiency and accuracy. By converting RF decision paths into natural language statements, we generate outputs for LLM fine-tuning, enhancing the model's ability to classify and explain its decisions. Our method includes verifying these rules through established classification metrics, ensuring their correctness. We also examine the impact of preprocessing techniques on the representation of numerical data and their influence on classification accuracy and rule correctness
摘要：大型语言模型 (LLM) 在文本处理方面表现出色。值得注意的是，LLM 可以从大型数据集中合成信息，并通过思维链 (CoT) 以类似于人类推理的方式解释其决策。LLM 的一个新兴应用是处理和解释数值数据，其中微调可以提高其相对于基本推理方法的性能。本文提出了一种使用随机森林 (RF) 集成的知识迁移来训练 LLM 的新方法，利用其效率和准确性。通过将 RF 决策路径转换为自然语言语句，我们为 LLM 微调生成输出，从而增强模型分类和解释其决策的能力。我们的方法包括通过已建立的分类指标验证这些规则，确保其正确性。我们还研究了预处理技术对数值数据表示的影响及其对分类准确性和规则正确性的影响

Title: TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models

Authors: Ping Yu, Kaitao Song, Fengchen He, Ming Chen, Jianfeng Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04941
Pdf URL: https://arxiv.org/pdf/2406.04941
Copy Paste: [[2406.04941]] TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models(https://arxiv.org/abs/2406.04941)
Keywords: language model, llm
Abstract: The recently unprecedented advancements in Large Language Models (LLMs) have propelled the medical community by establishing advanced medical-domain models. However, due to the limited collection of medical datasets, there are only a few comprehensive benchmarks available to gauge progress in this area. In this paper, we introduce a new medical question-answering (QA) dataset that contains massive manual instruction for solving Traditional Chinese Medicine examination tasks, called TCMD. Specifically, our TCMD collects massive questions across diverse domains with their annotated medical subjects and thus supports us in comprehensively assessing the capability of LLMs in the TCM domain. Extensive evaluation of various general LLMs and medical-domain-specific LLMs is conducted. Moreover, we also analyze the robustness of current LLMs in solving TCM QA tasks by introducing randomness. The inconsistency of the experimental results also reveals the shortcomings of current LLMs in solving QA tasks. We also expect that our dataset can further facilitate the development of LLMs in the TCM area.
摘要：大型语言模型 (LLM) 近年来取得了前所未有的进步，通过建立先进的医学领域模型推动了医学界的发展。然而，由于医学数据集的收集有限，只有少数全面的基准可用于衡量该领域的进展。在本文中，我们介绍了一个新的医学问答 (QA) 数据集，其中包含大量用于解决传统中医检查任务的手动指令，称为 TCMD。具体来说，我们的 TCMD 收集了不同领域的大量问题及其注释的医学主题，从而支持我们全面评估 LLM 在中医领域的能力。对各种通用 LLM 和医学领域特定的 LLM 进行了广泛的评估。此外，我们还通过引入随机性来分析当前 LLM 在解决中医 QA 任务中的稳健性。实验结果的不一致性也揭示了当前 LLM 在解决 QA 任务方面的缺点。我们还希望我们的数据集能够进一步促进中医领域 LLM 的发展。

Title: BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

Authors: Baktash Ansari, Mohammadmostafa Rostamkhani, Sauleh Eetemadi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense(https://arxiv.org/abs/)
Keywords: language model, gpt, prompt, agent
Abstract: This paper outlines our approach to SemEval 2024 Task 9, BRAINTEASER: A Novel Task Defying Common Sense. The task aims to evaluate the ability of language models to think creatively. The dataset comprises multi-choice questions that challenge models to think "outside of the box". We fine-tune 2 models, BERT and RoBERTa Large. Next, we employ a Chain of Thought (CoT) zero-shot prompting approach with 6 large language models, such as GPT-3.5, Mixtral, and Llama2. Finally, we utilize ReConcile, a technique that employs a "round table conference" approach with multiple agents for zero-shot learning, to generate consensus answers among 3 selected language models. Our best method achieves an overall accuracy of 85 percent on the sentence puzzles subtask.
摘要：本文概述了我们完成 SemEval 2024 任务 9 BRAINTEASER：一项违背常识的新任务的方法。该任务旨在评估语言模型的创造性思维能力。数据集包含多项选择题，挑战模型“跳出框框”思考。我们对 BERT 和 RoBERTa Large 两个模型进行了微调。接下来，我们采用思路链 (CoT) 零样本提示方法，使用 6 个大型语言模型，例如 GPT-3.5、Mixtral 和 Llama2。最后，我们利用 ReConcile，一种采用“圆桌会议”方法与多个代理进行零样本学习的技术，在 3 个选定的语言模型之间生成共识答案。我们的最佳方法在句子谜题子任务上实现了 85% 的整体准确率。

Title: Quantifying Geospatial in the Common Crawl Corpus

Authors: Ilya Ilyankou, Meihui Wang, James Haworth, Stefano Cavazzi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04952
Pdf URL: https://arxiv.org/pdf/2406.04952
Copy Paste: [[2406.04952]] Quantifying Geospatial in the Common Crawl Corpus(https://arxiv.org/abs/2406.04952)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that between 1 in 5 and 1 in 6 documents contain geospatial information such as coordinates and street addresses. Our findings provide quantitative insights into the nature and extent of geospatial data within Common Crawl, and web crawl data in general. Furthermore, we formulate questions to guide future investigations into the geospatial content of available web crawl datasets and its influence on LLMs.
摘要：大型语言模型 (LLM) 展现出新兴的地理空间功能，这源于它们对大量未标记文本数据集的预训练，这些数据集通常来自 Common Crawl 语料库。然而，CC 中的地理空间内容在很大程度上仍未被探索，影响了我们对 LLM 空间推理的理解。本文使用强大的语言模型 Gemini 研究了最近发布的 Common Crawl 中地理空间数据的流行程度。通过分析文档样本并手动修改结果，我们估计每 5 份到每 6 份文档中就有 1 份包含地理空间信息，例如坐标和街道地址。我们的研究结果为 Common Crawl 以及一般网络爬虫数据的地理空间数据的性质和范围提供了定量见解。此外，我们提出了问题来指导未来对可用网络爬虫数据集的地理空间内容及其对 LLM 的影响的研究。

Title: MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter

Authors: Jitai Hao, WeiWei Sun, Xin Xin, Qi Meng, Zhumin Chen, Pengjie Ren, Zhaochun Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04984
Pdf URL: https://arxiv.org/pdf/2406.04984
Copy Paste: [[2406.04984]] MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter(https://arxiv.org/abs/2406.04984)
Keywords: language model, llm
Abstract: Parameter-Efficient Fine-tuning (PEFT) facilitates the fine-tuning of Large Language Models (LLMs) under limited resources. However, the fine-tuning performance with PEFT on complex, knowledge-intensive tasks is limited due to the constrained model capacity, which originates from the limited number of additional trainable parameters. To overcome this limitation, we introduce a novel mechanism that fine-tunes LLMs with adapters of larger size yet memory-efficient. This is achieved by leveraging the inherent activation sparsity in the Feed-Forward Networks (FFNs) of LLMs and utilizing the larger capacity of Central Processing Unit (CPU) memory compared to Graphics Processing Unit (GPU). We store and update the parameters of larger adapters on the CPU. Moreover, we employ a Mixture of Experts (MoE)-like architecture to mitigate unnecessary CPU computations and reduce the communication volume between the GPU and CPU. This is particularly beneficial over the limited bandwidth of PCI Express (PCIe). Our method can achieve fine-tuning results comparable to those obtained with larger memory capacities, even when operating under more limited resources such as a 24GB memory single GPU setup, with acceptable loss in training efficiency. Our codes are available at this https URL.
摘要：参数高效微调 (PEFT) 有助于在有限资源下对大型语言模型 (LLM) 进行微调。然而，由于模型容量受限，PEFT 在复杂、知识密集型任务上的微调性能有限，这源于有限数量的额外可训练参数。为了克服这一限制，我们引入了一种新颖的机制，该机制使用更大但内存高效的适配器对 LLM 进行微调。这是通过利用 LLM 前馈网络 (FFN) 中固有的激活稀疏性并利用与图形处理单元 (GPU) 相比更大的中央处理单元 (CPU) 内存容量来实现的。我们在 CPU 上存储和更新较大适配器的参数。此外，我们采用类似混合专家 (MoE) 的架构来减轻不必要的 CPU 计算并减少 GPU 和 CPU 之间的通信量。这在 PCI Express (PCIe) 的有限带宽上尤其有益。我们的方法可以实现与使用更大内存容量所获得的结果相当的微调结果，即使在资源更有限的情况下（例如 24GB 内存单 GPU 设置）也可以实现，并且训练效率的损失是可以接受的。我们的代码可在此 https URL 上找到。

Title: Language models emulate certain cognitive profiles: An investigation of how predictability measures interact with individual differences

Authors: Patrick Haller, Lena S. Bolliger, Lena A. Jäger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04988
Pdf URL: https://arxiv.org/pdf/2406.04988
Copy Paste: [[2406.04988]] Language models emulate certain cognitive profiles: An investigation of how predictability measures interact with individual differences(https://arxiv.org/abs/2406.04988)
Keywords: language model
Abstract: To date, most investigations on surprisal and entropy effects in reading have been conducted on the group level, disregarding individual differences. In this work, we revisit the predictive power of surprisal and entropy measures estimated from a range of language models (LMs) on data of human reading times as a measure of processing effort by incorporating information of language users' cognitive capacities. To do so, we assess the predictive power of surprisal and entropy estimated from generative LMs on reading data obtained from individuals who also completed a wide range of psychometric tests. Specifically, we investigate if modulating surprisal and entropy relative to cognitive scores increases prediction accuracy of reading times, and we examine whether LMs exhibit systematic biases in the prediction of reading times for cognitively high- or low-performing groups, revealing what type of psycholinguistic subject a given LM emulates. Our study finds that in most cases, incorporating cognitive capacities increases predictive power of surprisal and entropy on reading times, and that generally, high performance in the psychometric tests is associated with lower sensitivity to predictability effects. Finally, our results suggest that the analyzed LMs emulate readers with lower verbal intelligence, suggesting that for a given target group (i.e., individuals with high verbal intelligence), these LMs provide less accurate predictability estimates.
摘要：到目前为止，大多数关于阅读中惊奇和熵效应的研究都是在群体层面进行的，忽略了个体差异。在这项工作中，我们通过结合语言使用者的认知能力信息，重新审视了从一系列语言模型 (LM) 估计的惊奇和熵度量对人类阅读时间数据的预测能力，将其作为处理工作量的衡量标准。为此，我们评估了从生成性 LM 估计的惊奇和熵对从完成各种心理测试的个人获得的阅读数据的预测能力。具体来说，我们调查调节相对于认知分数的惊奇和熵是否会提高阅读时间的预测准确性，并检查 LM 是否在预测认知表现高或低的群体的阅读时间时表现出系统性偏差，从而揭示给定 LM 模拟哪种类型的心理语言学主题。我们的研究发现，在大多数情况下，结合认知能力可以提高惊讶和熵对阅读时间的预测能力，并且一般来说，心理测试中的高绩效与对可预测性效应的较低敏感度相关。最后，我们的结果表明，所分析的 LM 模仿了语言智力较低的读者，这表明对于给定的目标群体（即语言智力高的人），这些 LM 提供的可预测性估计不太准确。

Title: Compositional Generalization with Grounded Language Models

Authors: Sondre Wold, Étienne Simon, Lucas Georges Gabriel Charpentier, Egor V. Kostylev, Erik Velldal, Lilja Øvrelid
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04989
Pdf URL: https://arxiv.org/pdf/2406.04989
Copy Paste: [[2406.04989]] Compositional Generalization with Grounded Language Models(https://arxiv.org/abs/2406.04989)
Keywords: language model
Abstract: Grounded language models use external sources of information, such as knowledge graphs, to meet some of the general challenges associated with pre-training. By extending previous work on compositional generalization in semantic parsing, we allow for a controlled evaluation of the degree to which these models learn and generalize from patterns in knowledge graphs. We develop a procedure for generating natural language questions paired with knowledge graphs that targets different aspects of compositionality and further avoids grounding the language models in information already encoded implicitly in their weights. We evaluate existing methods for combining language models with knowledge graphs and find them to struggle with generalization to sequences of unseen lengths and to novel combinations of seen base components. While our experimental results provide some insight into the expressive power of these models, we hope our work and released datasets motivate future research on how to better combine language models with structured knowledge representations.
摘要：扎根语言模型使用外部信息源（例如知识图谱）来应对与预训练相关的一些一般挑战。通过扩展先前关于语义解析中组合泛化的工作，我们可以对这些模型从知识图谱中的模式中学习和泛化的程度进行受控评估。我们开发了一种生成与知识图谱配对的自然语言问题的程序，该程序针对组合性的不同方面，并进一步避免将语言模型扎根于已隐式编码在其权重中的信息中。我们评估了现有的将语言模型与知识图谱相结合的方法，发现它们难以泛化到看不见的长度序列和可见基础组件的新组合。虽然我们的实验结果提供了一些关于这些模型的表达能力的见解，但我们希望我们的工作和发布的数据集能够激发未来关于如何更好地将语言模型与结构化知识表示相结合的研究。

Title: Scenarios and Approaches for Situated Natural Language Explanations

Authors: Pengshuo Qiu, Frank Rudzicz, Zining Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.05035
Pdf URL: https://arxiv.org/pdf/2406.05035
Copy Paste: [[2406.05035]] Scenarios and Approaches for Situated Natural Language Explanations(https://arxiv.org/abs/2406.05035)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) can be used to generate natural language explanations (NLE) that are adapted to different users' situations. However, there is yet to be a quantitative evaluation of the extent of such adaptation. To bridge this gap, we collect a benchmarking dataset, Situation-Based Explanation. This dataset contains 100 explanandums. Each explanandum is paired with explanations targeted at three distinct audience types-such as educators, students, and professionals-enabling us to assess how well the explanations meet the specific informational needs and contexts of these diverse groups e.g. students, teachers, and parents. For each "explanandum paired with an audience" situation, we include a human-written explanation. These allow us to compute scores that quantify how the LLMs adapt the explanations to the situations. On an array of pretrained language models with varying sizes, we examine three categories of prompting methods: rule-based prompting, meta-prompting, and in-context learning prompting. We find that 1) language models can generate prompts that result in explanations more precisely aligned with the target situations, 2) explicitly modeling an "assistant" persona by prompting "You are a helpful assistant..." is not a necessary prompt technique for situated NLE tasks, and 3) the in-context learning prompts only can help LLMs learn the demonstration template but can't improve their inference performance. SBE and our analysis facilitate future research towards generating situated natural language explanations.
摘要：大型语言模型 (LLM) 可用于生成适应不同用户情况的自然语言解释 (NLE)。但是，目前尚未对这种适应程度进行定量评估。为了弥补这一差距，我们收集了一个基准数据集，即基于情境的解释。该数据集包含 100 个解释项。每个解释项都与针对三种不同受众类型（例如教育工作者、学生和专业人士）的解释配对，使我们能够评估解释如何满足这些不同群体（例如学生、教师和家长）的特定信息需求和背景。对于每个“与受众配对的解释项”情况，我们都包括人工编写的解释。这使我们能够计算出量化 LLM 如何根据情况调整解释的分数。在一系列大小各异的预训练语言模型上，我们研究了三类提示方法：基于规则的提示、元提示和情境学习提示。我们发现：1）语言模型可以生成提示，从而产生与目标情况更精确一致的解释；2）通过提示“您是一位乐于助人的助手……”明确地模拟“助手”角色并不是情境化 NLE 任务的必要提示技术；3）上下文学习提示只能帮助 LLM 学习演示模板，但无法提高其推理性能。SBE 和我们的分析有助于未来研究生成情境化自然语言解释。

Title: Are Large Language Models More Empathetic than Humans?

Authors: Anuradha Welivita, Pearl Pu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05063
Pdf URL: https://arxiv.org/pdf/2406.05063
Copy Paste: [[2406.05063]] Are Large Language Models More Empathetic than Humans?(https://arxiv.org/abs/2406.05063)
Keywords: language model, gpt, llm, prompt, chat
Abstract: With the emergence of large language models (LLMs), investigating if they can surpass humans in areas such as emotion recognition and empathetic responding has become a focal point of research. This paper presents a comprehensive study exploring the empathetic responding capabilities of four state-of-the-art LLMs: GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, and Mixtral-8x7B-Instruct in comparison to a human baseline. We engaged 1,000 participants in a between-subjects user study, assessing the empathetic quality of responses generated by humans and the four LLMs to 2,000 emotional dialogue prompts meticulously selected to cover a broad spectrum of 32 distinct positive and negative emotions. Our findings reveal a statistically significant superiority of the empathetic responding capability of LLMs over humans. GPT-4 emerged as the most empathetic, marking approximately 31% increase in responses rated as "Good" compared to the human benchmark. It was followed by LLaMA-2, Mixtral-8x7B, and Gemini-Pro, which showed increases of approximately 24%, 21%, and 10% in "Good" ratings, respectively. We further analyzed the response ratings at a finer granularity and discovered that some LLMs are significantly better at responding to specific emotions compared to others. The suggested evaluation framework offers a scalable and adaptable approach for assessing the empathy of new LLMs, avoiding the need to replicate this study's findings in future research.
摘要：随着大型语言模型 (LLM) 的出现，研究它们是否能在情绪识别和共情反应等领域超越人类已成为研究的焦点。本文介绍了一项全面的研究，探讨了四种最先进的 LLM：GPT-4、LLaMA-2-70B-Chat、Gemini-1.0-Pro 和 Mixtral-8x7B-Instruct 的共情反应能力，并与人类基线进行了比较。我们让 1,000 名参与者参与了一项受试者间用户研究，评估了人类和四种 LLM 对 2,000 个情感对话提示产生的反应的共情质量，这些提示经过精心挑选，涵盖了 32 种不同的积极和消极情绪。我们的研究结果表明，LLM 的共情反应能力在统计上显著优于人类。GPT-4 成为最具共情能力的模型，与人类基准相比，被评为“良好”的回应增加了约 31%。紧随其后的是 LLaMA-2、Mixtral-8x7B 和 Gemini-Pro，它们的“良好”评分分别增加了约 24%、21% 和 10%。我们进一步以更精细的粒度分析了响应评级，发现一些 LLM 在响应特定情绪方面比其他 LLM 明显更好。建议的评估框架提供了一种可扩展且适应性强的方法来评估新 LLM 的同理心，避免了在未来的研究中重复这项研究的结果。

Title: SUMIE: A Synthetic Benchmark for Incremental Entity Summarization

Authors: Eunjeong Hwang, Yichao Zhou, Beliz Gunel, James Bradley Wendt, Sandeep Tata
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.05079
Pdf URL: https://arxiv.org/pdf/2406.05079
Copy Paste: [[2406.05079]] SUMIE: A Synthetic Benchmark for Incremental Entity Summarization(https://arxiv.org/abs/2406.05079)
Keywords: language model, llm
Abstract: No existing dataset adequately tests how well language models can incrementally update entity summaries - a crucial ability as these models rapidly advance. The Incremental Entity Summarization (IES) task is vital for maintaining accurate, up-to-date knowledge. To address this, we introduce SUMIE, a fully synthetic dataset designed to expose real-world IES challenges. This dataset effectively highlights problems like incorrect entity association and incomplete information presentation. Unlike common synthetic datasets, ours captures the complexity and nuances found in real-world data. We generate informative and diverse attributes, summaries, and unstructured paragraphs in sequence, ensuring high quality. The alignment between generated summaries and paragraphs exceeds 96%, confirming the dataset's quality. Extensive experiments demonstrate the dataset's difficulty - state-of-the-art LLMs struggle to update summaries with an F1 higher than 80.4%. We will open source the benchmark and the evaluation metrics to help the community make progress on IES tasks.
摘要：现有的数据集无法充分测试语言模型如何逐步更新实体摘要 - 这是这些模型快速发展过程中的一项关键能力。增量实体摘要 (IES) 任务对于保持准确、最新的知识至关重要。为了解决这个问题，我们引入了 SUMIE，这是一个完全合成的数据集，旨在揭示现实世界的 IES 挑战。这个数据集有效地突出了实体关联不正确和信息呈现不完整等问题。与常见的合成数据集不同，我们的数据集捕捉到了现实世界数据中的复杂性和细微差别。我们按顺序生成信息丰富且多样化的属性、摘要和非结构化段落，确保高质量。生成的摘要和段落之间的一致性超过 96%，证实了数据集的质量。大量实验证明了数据集的难度 - 最先进的 LLM 很难更新 F1 高于 80.4% 的摘要。我们将开源基准和评估指标，以帮助社区在 IES 任务上取得进展。

Title: Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

Authors: Maciej Besta, Ales Kubicek, Roman Niggli, Robert Gerstenberger, Lucas Weitzendorf, Mingyuan Chi, Patrick Iff, Joanna Gajda, Piotr Nyczyk, Jürgen Müller, Hubert Niewiadomski, Marcin Chrapek, Michał Podstawski, Torsten Hoefler
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2406.05085
Pdf URL: https://arxiv.org/pdf/2406.05085
Copy Paste: [[2406.05085]] Multi-Head RAG: Solving Multi-Aspect Problems with LLMs(https://arxiv.org/abs/2406.05085)
Keywords: language model, llm, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer's multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving motivation is that different attention heads can learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, synthetic datasets, and real-world use cases to demonstrate MRAG's effectiveness, showing improvements of up to 20% in relevance over standard RAG baselines. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarking tools like RAGAS as well as different classes of data stores.
摘要：检索增强生成 (RAG) 通过将文档检索到 LLM 上下文中来提供更准确和更相关的响应，增强了大型语言模型 (LLM) 的能力。现有的 RAG 解决方案并不关注可能需要获取具有实质不同内容的多个文档的查询。此类查询经常发生，但具有挑战性，因为这些文档的嵌入在嵌入空间中可能相距甚远，因此很难检索所有文档。本文介绍了多头 RAG (MRAG)，这是一种旨在通过一个简单但强大的想法解决这一差距的新方案：利用 Transformer 的多头注意层（而不是解码器层）的激活作为获取多方面文档的键。驱动动机是不同的注意头可以学习捕获不同的数据方面。利用相应的激活会产生表示数据项和查询各个方面的嵌入，从而提高复杂查询的检索准确性。我们提供了评估方法和指标、合成数据集和实际用例来证明 MRAG 的有效性，与标准 RAG 基线相比，相关性提高了 20%。MRAG 可以与现有的 RAG 框架和基准测试工具（如 RAGAS）以及不同类别的数据存储无缝集成。

Title: An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

Authors: Xiongtao Zhou, Jie He, Yuhua Ke, Guangyao Zhu, Víctor Gutiérrez-Basulto, Jeff Z. Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05130
Pdf URL: https://arxiv.org/pdf/2406.05130
Copy Paste: [[2406.05130]] An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models(https://arxiv.org/abs/2406.05130)
Keywords: language model, llm, hallucination
Abstract: Multimodal large language models (MLLMs) fine-tuned with multimodal instruction datasets have demonstrated remarkable capabilities in multimodal tasks. However, fine-tuning all parameters of MLLMs has become challenging as they usually contain billions of parameters. To address this issue, we study parameter-efficient fine-tuning (PEFT) methods for MLLMs. We aim to identify effective methods for enhancing the performance of MLLMs in scenarios where only a limited number of parameters are trained. This paper conducts empirical studies using four popular PEFT methods to fine-tune the LLM component of open-source MLLMs. We present a comprehensive analysis that encompasses various aspects, including the impact of PEFT methods on various models, parameters and location of the PEFT module, size of fine-tuning data, model stability based on PEFT methods, MLLM's generalization, and hallucination. We evaluated four PEFT methods on seven datasets from two different categories: unseen and seen datasets. Across all experiments, we show that the adapter is the best-performing PEFT method. At the same time, fine-tuning the connector layers leads to improved performance in most MLLMs. Code and data are available at this https URL.
摘要：使用多模态指令数据集进行微调的多模态大型语言模型 (MLLM) 在多模态任务中表现出了卓越的能力。然而，微调 MLLM 的所有参数已经变得具有挑战性，因为它们通常包含数十亿个参数。为了解决这个问题，我们研究了 MLLM 的参数高效微调 (PEFT) 方法。我们的目标是在仅训练有限数量参数的情况下找到提高 MLLM 性能的有效方法。本文使用四种流行的 PEFT 方法对开源 MLLM 的 LLM 组件进行实证研究。我们提出了一个全面的分析，涵盖了各个方面，包括 PEFT 方法对各种模型的影响、PEFT 模块的参数和位置、微调数据的大小、基于 PEFT 方法的模型稳定性、MLLM 的泛化和幻觉。我们在两个不同类别的七个数据集上评估了四种 PEFT 方法：未见数据集和已见数据集。在所有实验中，我们表明适配器是性能最佳的 PEFT 方法。同时，微调连接器层可提高大多数 MLLM 的性能。代码和数据可在此 https URL 上获取。