2024-06-27

Title: Role of Dependency Distance in Text Simplification: A Human vs ChatGPT Simplification Comparison

Authors: Sumi Lee, Gondy Leroy, David Kauchak, Melissa Just
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.17787
Pdf URL: https://arxiv.org/pdf/2406.17787
Copy Paste: [[2406.17787]] Role of Dependency Distance in Text Simplification: A Human vs ChatGPT Simplification Comparison(https://arxiv.org/abs/2406.17787)
Keywords: gpt, chat
Abstract: This study investigates human and ChatGPT text simplification and its relationship to dependency distance. A set of 220 sentences, with increasing grammatical difficulty as measured in a prior user study, were simplified by a human expert and using ChatGPT. We found that the three sentence sets all differed in mean dependency distances: the highest in the original sentence set, followed by ChatGPT simplified sentences, and the human simplified sentences showed the lowest mean dependency distance.
摘要：本研究调查了人类和 ChatGPT 文本简化及其与依赖距离的关系。人类专家使用 ChatGPT 简化了一组 220 个句子，这些句子的语法难度在之前的用户研究中不断增加。我们发现这三个句子集的平均依赖距离都不同：原始句子集最高，其次是 ChatGPT 简化句子，而人类简化句子的平均依赖距离最低。

Title: Spanish and LLM Benchmarks: is MMLU Lost in Translation?

Authors: Irene Plaza, Nina Melero, Cristina del Pozo, Javier Conde, Pedro Reviriego, Marina Mayor-Rocher, María Grandury
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.17789
Pdf URL: https://arxiv.org/pdf/2406.17789
Copy Paste: [[2406.17789]] Spanish and LLM Benchmarks: is MMLU Lost in Translation?(https://arxiv.org/abs/2406.17789)
Keywords: language model, gpt, llm, chat
Abstract: The evaluation of Large Language Models (LLMs) is a key element in their continuous improvement process and many benchmarks have been developed to assess the performance of LLMs in different tasks and topics. As LLMs become adopted worldwide, evaluating them in languages other than English is increasingly important. However, most LLM benchmarks are simply translated using an automated tool and then run in the target language. This means that the results depend not only on the LLM performance in that language but also on the quality of the translation. In this paper, we consider the case of the well-known Massive Multitask Language Understanding (MMLU) benchmark. Selected categories of the benchmark are translated into Spanish using Azure Translator and ChatGPT4 and run on ChatGPT4. Next, the results are processed to identify the test items that produce different answers in Spanish and English. Those are then analyzed manually to understand if the automatic translation caused the change. The results show that a significant fraction of the failing items can be attributed to mistakes in the translation of the benchmark. These results make a strong case for improving benchmarks in languages other than English by at least revising the translations of the items and preferably by adapting the tests to the target language by experts.
摘要：大型语言模型 (LLM) 的评估是其持续改进过程中的关键要素，并且已经开发了许多基准来评估 LLM 在不同任务和主题中的表现。随着 LLM 在全球范围内被采用，用英语以外的语言对其进行评估变得越来越重要。然而，大多数 LLM 基准只是使用自动化工具翻译，然后在目标语言中运行。这意味着结果不仅取决于 LLM 在该语言中的表现，还取决于翻译的质量。在本文中，我们考虑了著名的大规模多任务语言理解 (MMLU) 基准的情况。使用 Azure Translator 和 ChatGPT4 将基准的选定类别翻译成西班牙语，并在 ChatGPT4 上运行。接下来，处理结果以识别在西班牙语和英语中产生不同答案的测试项目。然后手动分析这些测试项目，以了解自动翻译是否导致了这种变化。结果表明，很大一部分不合格项目可以归因于基准翻译中的错误。这些结果有力地证明了，应至少修改项目的翻译，最好由专家根据目标语言调整测试，以改进英语以外的语言的基准。

Title: Understanding the Role of User Profile in the Personalization of Large Language Models

Authors: Bin Wu, Zhengyan Shi, Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2406.17803
Pdf URL: https://arxiv.org/pdf/2406.17803
Copy Paste: [[2406.17803]] Understanding the Role of User Profile in the Personalization of Large Language Models(https://arxiv.org/abs/2406.17803)
Keywords: language model, llm
Abstract: Utilizing user profiles to personalize Large Language Models (LLMs) has been shown to enhance the performance on a wide range of tasks. However, the precise role of user profiles and their effect mechanism on LLMs remains unclear. This study first confirms that the effectiveness of user profiles is primarily due to personalization information rather than semantic information. Furthermore, we investigate how user profiles affect the personalization of LLMs. Within the user profile, we reveal that it is the historical personalized response produced or approved by users that plays a pivotal role in personalizing LLMs. This discovery unlocks the potential of LLMs to incorporate a greater number of user profiles within the constraints of limited input length. As for the position of user profiles, we observe that user profiles integrated into different positions of the input context do not contribute equally to personalization. Instead, where the user profile that is closer to the beginning affects more on the personalization of LLMs. Our findings reveal the role of user profiles for the personalization of LLMs, and showcase how incorporating user profiles impacts performance providing insight to leverage user profiles effectively.
摘要：利用用户配置文件对大型语言模型 (LLM) 进行个性化处理已被证明可以提高各种任务的性能。然而，用户配置文件的确切作用及其对 LLM 的影响机制仍不清楚。本研究首先证实了用户配置文件的有效性主要归因于个性化信息而不是语义信息。此外，我们研究了用户配置文件如何影响 LLM 的个性化。在用户配置文件中，我们发现用户生成或批准的历史个性化响应在 LLM 的个性化处理中起着关键作用。这一发现释放了 LLM 在有限输入长度的限制内整合更多用户配置文件的潜力。至于用户配置文件的位置，我们观察到集成到输入上下文不同位置的用户配置文件对个性化的贡献并不相同。相反，更接近开头的用户配置文件对 LLM 的个性化影响更大。我们的研究结果揭示了用户配置文件在 LLM 个性化处理中的作用，并展示了整合用户配置文件如何影响性能，从而为有效利用用户配置文件提供了见解。

Title: Can LLMs Generate Visualizations with Dataless Prompts?

Authors: Darius Coelho, Harshit Barot, Naitik Rathod, Klaus Mueller
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2406.17805
Pdf URL: https://arxiv.org/pdf/2406.17805
Copy Paste: [[2406.17805]] Can LLMs Generate Visualizations with Dataless Prompts?(https://arxiv.org/abs/2406.17805)
Keywords: language model, gpt, llm, prompt
Abstract: Recent advancements in large language models have revolutionized information access, as these models harness data available on the web to address complex queries, becoming the preferred information source for many users. In certain cases, queries are about publicly available data, which can be effectively answered with data visualizations. In this paper, we investigate the ability of large language models to provide accurate data and relevant visualizations in response to such queries. Specifically, we investigate the ability of GPT-3 and GPT-4 to generate visualizations with dataless prompts, where no data accompanies the query. We evaluate the results of the models by comparing them to visualization cheat sheets created by visualization experts.
摘要：大型语言模型的最新进展彻底改变了信息访问方式，因为这些模型利用网络上可用的数据来处理复杂的查询，成为许多用户的首选信息来源。在某些情况下，查询是关于公开可用的数据，可以通过数据可视化有效地回答。在本文中，我们研究了大型语言模型为响应此类查询提供准确数据和相关可视化的能力。具体来说，我们研究了 GPT-3 和 GPT-4 生成无数据提示的可视化的能力，其中查询不附带任何数据。我们通过将模型的结果与可视化专家创建的可视化备忘单进行比较来评估模型的结果。

Title: MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?

Authors: Xirui Li, Hengguang Zhou, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Cho-Jui Hsieh
Subjects: cs.CL, cs.AI, cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2406.17806
Pdf URL: https://arxiv.org/pdf/2406.17806
Copy Paste: [[2406.17806]] MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?(https://arxiv.org/abs/2406.17806)
Keywords: language model, llm
Abstract: Humans are prone to cognitive distortions -- biased thinking patterns that lead to exaggerated responses to specific stimuli, albeit in very different contexts. This paper demonstrates that advanced Multimodal Large Language Models (MLLMs) exhibit similar tendencies. While these models are designed to respond queries under safety mechanism, they sometimes reject harmless queries in the presence of certain visual stimuli, disregarding the benign nature of their contexts. As the initial step in investigating this behavior, we identify three types of stimuli that trigger the oversensitivity of existing MLLMs: Exaggerated Risk, Negated Harm, and Counterintuitive Interpretation. To systematically evaluate MLLMs' oversensitivity to these stimuli, we propose the Multimodal OverSenSitivity Benchmark (MOSSBench). This toolkit consists of 300 manually collected benign multimodal queries, cross-verified by third-party reviewers (AMT). Empirical studies using MOSSBench on 20 MLLMs reveal several insights: (1). Oversensitivity is prevalent among SOTA MLLMs, with refusal rates reaching up to 76% for harmless queries. (2). Safer models are more oversensitive: increasing safety may inadvertently raise caution and conservatism in the model's responses. (3). Different types of stimuli tend to cause errors at specific stages -- perception, intent reasoning, and safety judgement -- in the response process of MLLMs. These findings highlight the need for refined safety mechanisms that balance caution with contextually appropriate responses, improving the reliability of MLLMs in real-world applications. We make our project available at this https URL.
摘要：人类容易出现认知扭曲——偏见的思维模式会导致对特定刺激做出夸张的反应，尽管是在非常不同的背景下。本文表明，先进的多模态大型语言模型 (MLLM) 表现出类似的趋势。虽然这些模型旨在在安全机制下响应查询，但它们有时会在某些视觉刺激下拒绝无害的查询，而忽略其上下文的良性性质。作为调查这种行为的第一步，我们确定了三种触发现有 MLLM 过度敏感的刺激类型：夸大风险、否定危害和违反直觉的解释。为了系统地评估 MLLM 对这些刺激的过度敏感性，我们提出了多模态过度敏感性基准 (MOSSBench)。该工具包包含 300 个手动收集的良性多模态查询，并由第三方审阅者 (AMT) 交叉验证。使用 MOSSBench 对 20 个 MLLM 进行的实证研究揭示了以下几点见解：(1)。SOTA MLLM 中普遍存在过度敏感性，无害查询的拒绝率高达 76%。(2)。更安全的模型更过度敏感：提高安全性可能会无意中提高模型响应的谨慎性和保守性。(3)。不同类型的刺激往往会在 MLLM 响应过程中的特定阶段（感知、意图推理和安全判断）导致错误。这些发现强调了对完善安全机制的需求，这种机制可以在谨慎性和适合情境的响应之间取得平衡，从而提高 MLLM 在实际应用中的可靠性。我们在此 https URL 上提供我们的项目。

Title: Enhancing Commentary Strategies for Imperfect Information Card Games: A Study of Large Language Models in Guandan Commentary

Authors: Meiling Tao.Xuechen Liang, Yiling Tao, Tianyu Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.17807
Pdf URL: https://arxiv.org/pdf/2406.17807
Copy Paste: [[2406.17807]] Enhancing Commentary Strategies for Imperfect Information Card Games: A Study of Large Language Models in Guandan Commentary(https://arxiv.org/abs/2406.17807)
Keywords: language model, gpt, llm
Abstract: Recent advancements in large language models (LLMs) have unlocked the potential for generating high-quality game commentary. However, producing insightful and engaging commentary for complex games with incomplete information remains a significant challenge. In this paper, we introduce a novel commentary method that combine Reinforcement Learning (RL) and LLMs, tailored specifically for the Chinese card game \textit{Guandan}. Our system leverages RL to generate intricate card-playing scenarios and employs LLMs to generate corresponding commentary text, effectively emulating the strategic analysis and narrative prowess of professional commentators. The framework comprises a state commentary guide, a Theory of Mind (ToM)-based strategy analyzer, and a style retrieval module, which seamlessly collaborate to deliver detailed and context-relevant game commentary in the Chinese language environment. We empower LLMs with ToM capabilities and refine both retrieval and information filtering mechanisms. This facilitates the generation of personalized commentary content. Our experimental results showcase the substantial enhancement in performance achieved by the proposed commentary framework when applied to open-source LLMs, surpassing the performance of GPT-4 across multiple evaluation metrics.
摘要：大型语言模型 (LLM) 的最新进展释放了生成高质量游戏评论的潜力。然而，为信息不完整的复杂游戏制作富有洞察力和吸引力的评论仍然是一项重大挑战。在本文中，我们介绍了一种结合强化学习 (RL) 和 LLM 的新型评论方法，专门针对中国纸牌游戏 \textit{关丹}。我们的系统利用 RL 生成复杂的纸牌游戏场景，并使用 LLM 生成相应的评论文本，有效地模仿专业评论员的战略分析和叙事能力。该框架包括一个状态评论指南、一个基于心智理论 (ToM) 的策略分析器和一个风格检索模块，它们无缝协作，在中文环境中提供详细且与上下文相关的游戏评论。我们为 LLM 赋予 ToM 功能，并改进检索和信息过滤机制。这有助于生成个性化的评论内容。我们的实验结果表明，所提出的评论框架应用于开源 LLM 时取得了显著的性能提升，在多个评估指标上超越了 GPT-4 的性能。

Title: Training-Free Exponential Extension of Sliding Window Context with Cascading KV Cache

Authors: Jeffrey Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju Hwang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.17808
Pdf URL: https://arxiv.org/pdf/2406.17808
Copy Paste: [[2406.17808]] Training-Free Exponential Extension of Sliding Window Context with Cascading KV Cache(https://arxiv.org/abs/2406.17808)
Keywords: language model, llm, long context
Abstract: The context window within a transformer provides a form of active memory for the current task, which can be useful for few-shot learning and conditional generation, both which depend heavily on previous context tokens. However, as the context length grows, the computational cost increases quadratically. Recent works have shown that saving a few initial tokens along with a fixed-sized sliding window leads to stable streaming generation with linear complexity in transformer-based Large Language Models (LLMs). However, they make suboptimal use of the fixed window by naively evicting all tokens unconditionally from the key-value (KV) cache once they reach the end of the window, resulting in tokens being forgotten and no longer able to affect subsequent predictions. To overcome this limitation, we propose a novel mechanism for storing longer sliding window contexts with the same total cache size by keeping separate cascading sub-cache buffers whereby each subsequent buffer conditionally accepts a fraction of the relatively more important tokens evicted from the previous buffer. Our method results in a dynamic KV cache that can store tokens from the more distant past than a fixed, static sliding window approach. Our experiments show improvements of 5.6% on long context generation (LongBench), 1.2% in streaming perplexity (PG19), and 0.6% in language understanding (MMLU STEM) using LLMs given the same fixed cache size. Additionally, we provide an efficient implementation that improves the KV cache latency from 1.33ms per caching operation to 0.54ms, a 59% speedup over previous work.
摘要：Transformer 中的上下文窗口为当前任务提供了一种活动内存形式，这对于小样本学习和条件生成非常有用，这两者都严重依赖于先前的上下文标记。但是，随着上下文长度的增加，计算成本会成倍增加。最近的研究表明，在基于 Transformer 的大型语言模型 (LLM) 中，保存一些初始标记以及固定大小的滑动窗口可以实现具有线性复杂度的稳定流式生成。但是，它们通过天真地将所有标记从键值 (KV) 缓存中无条件地逐出，一旦它们到达窗口末尾，它们就会对固定窗口的使用不尽如人意，导致标记被遗忘并且不再能够影响后续预测。为了克服这一限制，我们提出了一种新颖的机制，通过保留单独的级联子缓存缓冲区来存储具有相同总缓存大小的更长的滑动窗口上下文，其中每个后续缓冲区有条件地接受从前一个缓冲区逐出的一小部分相对更重要的标记。我们的方法可以生成动态 KV 缓存，与固定的静态滑动窗口方法相比，它可以存储更久远的标记。我们的实验表明，在给定相同固定缓存大小的情况下，使用 LLM，长上下文生成 (LongBench) 的改进为 5.6%，流式困惑度 (PG19) 的改进为 1.2%，语言理解 (MMLU STEM) 的改进为 0.6%。此外，我们还提供了一种高效的实现，将 KV 缓存延迟从每次缓存操作 1.33 毫秒缩短至 0.54 毫秒，比以前的工作速度提高了 59%。

Title: Improving Arithmetic Reasoning Ability of Large Language Models through Relation Tuples, Verification and Dynamic Feedback

Authors: Zhongtao Miao, Kaiyan Zhao, Yoshimasa Tsuruoka
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.17873
Pdf URL: https://arxiv.org/pdf/2406.17873
Copy Paste: [[2406.17873]] Improving Arithmetic Reasoning Ability of Large Language Models through Relation Tuples, Verification and Dynamic Feedback(https://arxiv.org/abs/2406.17873)
Keywords: language model
Abstract: Current representations used in reasoning steps of large language models can mostly be categorized into two main types: (1) natural language, which is difficult to verify; and (2) non-natural language, usually programming code, which is difficult for people who are unfamiliar with coding to read. In this paper, we propose to use a semi-structured form to represent reasoning steps of large language models. Specifically, we use relation tuples, which are not only human-readable but also machine-friendly and easier to verify than natural language. We implement a framework that includes three main components: (1) introducing relation tuples into the reasoning steps of large language models; (2) implementing an automatic verification process of reasoning steps with a local code interpreter based on relation tuples; and (3) integrating a simple and effective dynamic feedback mechanism, which we found helpful for self-improvement of large language models. The experimental results on various arithmetic datasets demonstrate the effectiveness of our method in improving the arithmetic reasoning ability of large language models. The source code is available at this https URL.
摘要：目前大型语言模型的推理步骤中使用的表示形式主要可以分为两大类：（1）自然语言，难以验证；（2）非自然语言，通常是编程代码，对于不熟悉编码的人来说很难阅读。在本文中，我们提出使用半结构化形式来表示大型语言模型的推理步骤。具体来说，我们使用关系元组，它不仅易于人类阅读，而且对机器友好，并且比自然语言更容易验证。我们实现了一个框架，其中包括三个主要组件：（1）将关系元组引入大型语言模型的推理步骤；（2）使用基于关系元组的本地代码解释器实现推理步骤的自动验证过程；（3）集成简单有效的动态反馈机制，我们发现这有助于大型语言模型的自我改进。在各种算术数据集上的实验结果证明了我们的方法在提高大型语言模型的算术推理能力方面的有效性。源代码可在此 https URL 上找到。

Title: CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design

Authors: Nafis Neehal, Bowen Wang, Shayom Debopadhaya, Soham Dan, Keerthiram Murugesan, Vibha Anand, Kristin P. Bennett
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.17888
Pdf URL: https://arxiv.org/pdf/2406.17888
Copy Paste: [[2406.17888]] CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design(https://arxiv.org/abs/2406.17888)
Keywords: language model, gpt, prompt
Abstract: CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. Given study-specific metadata, CTBench evaluates AI models' ability to determine the baseline features of a clinical trial (CT), which include demographic and relevant features collected at the trial's start from all participants. These baseline features, typically presented in CT publications (often as Table 1), are crucial for characterizing study cohorts and validating results. Baseline features, including confounders and covariates, are also necessary for accurate treatment effect estimation in studies involving observational data. CTBench consists of two datasets: "CT-Repo," containing baseline features from 1,690 clinical trials sourced from this http URL, and "CT-Pub," a subset of 100 trials with more comprehensive baseline features gathered from relevant publications. Two LM-based evaluation methods are developed to compare the actual baseline feature lists against LM-generated responses. "ListMatch-LM" and "ListMatch-BERT" use GPT-4o and BERT scores (at various thresholds), respectively, for evaluation. To establish baseline results, advanced prompt engineering techniques using LLaMa3-70B-Instruct and GPT-4o in zero-shot and three-shot learning settings are applied to generate potential baseline features. The performance of GPT-4o as an evaluator is validated through human-in-the-loop evaluations on the CT-Pub dataset, where clinical experts confirm matches between actual and LM-generated features. The results highlight a promising direction with significant potential for improvement, positioning CTBench as a useful tool for advancing research on AI in CT design and potentially enhancing the efficacy and robustness of CTs.
摘要：CTBench 被引入作为评估语言模型 (LM) 在辅助临床研究设计中的基准。给定研究特定的元数据，CTBench 评估 AI 模型确定临床试验 (CT) 基线特征的能力，这些基线特征包括在试验开始时从所有参与者收集的人口统计和相关特征。这些基线特征通常在 CT 出版物中呈现（通常为表 1），对于表征研究队列和验证结果至关重要。基线特征（包括混杂因素和协变量）对于涉及观察数据的研究中准确估计治疗效果也是必要的。CTBench 由两个数据集组成：“CT-Repo”，包含来自此 http URL 的 1,690 项临床试验的基线特征，以及“CT-Pub”，这是从相关出版物中收集的 100 项试验的子集，具有更全面的基线特征。开发了两种基于 LM 的评估方法来比较实际的基线特征列表与 LM 生成的响应。 “ListMatch-LM”和“ListMatch-BERT”分别使用 GPT-4o 和 BERT 分数（在不同阈值下）进行评估。为了建立基线结果，在零样本和三样本学习设置中使用 LLaMa3-70B-Instruct 和 GPT-4o 的高级提示工程技术被应用来生成潜在的基线特征。GPT-4o 作为评估器的性能通过对 CT-Pub 数据集的人机交互评估得到验证，其中临床专家确认实际特征和 LM 生成的特征之间的匹配。结果突出了一个有希望的方向，具有巨大的改进潜力，将 CTBench 定位为推进 CT 设计中 AI 研究的有用工具，并可能提高 CT 的功效和稳健性。

Title: PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning

Authors: Shiva Kumar Pentyala, Zhichao Wang, Bin Bi, Kiran Ramnath, Xiang-Bo Mao, Regunathan Radhakrishnan, Sitaram Asur, Na (Claire)Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.17923
Pdf URL: https://arxiv.org/pdf/2406.17923
Copy Paste: [[2406.17923]] PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning(https://arxiv.org/abs/2406.17923)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance. This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning, which independently performs SFT and preference alignment (e.g., DPO and ORPO, etc.) with the same pre-trained model on respective datasets. The model produced by SFT and the model from preference alignment are then merged into a final model by parameter fusing for use in downstream applications. This work reveals important findings that preference alignment like DPO naturally results in a sparse model while SFT leads to a natural dense model which needs to be sparsified for effective model merging. This paper introduces an effective interference resolution which reduces the redundancy by sparsifying the delta parameters. The LLM resulted from the new training paradigm achieved Rank #1 on the HuggingFace Open LLM Leaderboard. Comprehensive evaluation shows the effectiveness of the parallel training paradigm.
摘要：大型语言模型 (LLM) 在各种自然语言处理 (NLP) 任务中表现出了卓越的能力。LLM 通常经过监督微调 (SFT)，然后进行偏好对齐，以便可用于下游应用。然而，这种顺序训练流程会导致对齐负担，从而降低 LLM 性能。本文介绍了一种用于有效 LLM 微调的新型并行训练范式 PAFT，它使用相同的预训练模型在各自的数据集上独立执行 SFT 和偏好对齐（例如 DPO 和 ORPO 等）。然后通过参数融合将 SFT 生成的模型和偏好对齐的模型合并为最终模型，以用于下游应用。这项工作揭示了重要的发现，即像 DPO 这样的偏好对齐自然会产生稀疏模型，而 SFT 会产生自然的密集模型，需要将其稀疏化才能有效地合并模型。本文介绍了一种有效的干扰解决方案，通过稀疏化增量参数来减少冗余。采用新训练范式所取得的法学硕士学位在 HuggingFace Open LLM 排行榜上排名第一。综合评估表明了并行训练范式的有效性。

Title: Do they mean 'us'? Interpreting Referring Expressions in Intergroup Bias

Authors: Venkata S Govindarajan, Matianyu Zang, Kyle Mahowald, David Beaver, Junyi Jessy Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.17947
Pdf URL: https://arxiv.org/pdf/2406.17947
Copy Paste: [[2406.17947]] Do they mean 'us'? Interpreting Referring Expressions in Intergroup Bias(https://arxiv.org/abs/2406.17947)
Keywords: llm, prompt
Abstract: The variations between in-group and out-group speech (intergroup bias) are subtle and could underlie many social phenomena like stereotype perpetuation and implicit bias. In this paper, we model the intergroup bias as a tagging task on English sports comments from forums dedicated to fandom for NFL teams. We curate a unique dataset of over 6 million game-time comments from opposing perspectives (the teams in the game), each comment grounded in a non-linguistic description of the events that precipitated these comments (live win probabilities for each team). Expert and crowd annotations justify modeling the bias through tagging of implicit and explicit referring expressions and reveal the rich, contextual understanding of language and the world required for this task. For large-scale analysis of intergroup variation, we use LLMs for automated tagging, and discover that some LLMs perform best when prompted with linguistic descriptions of the win probability at the time of the comment, rather than numerical probability. Further, large-scale tagging of comments using LLMs uncovers linear variations in the form of referent across win probabilities that distinguish in-group and out-group utterances. Code and data are available at this https URL .
摘要：群体内和群体外言语之间的差异（群体间偏见）非常微妙，可能是刻板印象延续和隐性偏见等许多社会现象的根源。在本文中，我们将群体间偏见建模为一项标记任务，该任务针对 NFL 球队球迷论坛上的英语体育评论。我们从对立观点（比赛中的球队）整理了一个独特的数据集，其中包含超过 600 万条比赛时间评论，每条评论都基于引发这些评论的事件的非语言描述（每支球队的实时获胜概率）。专家和人群注释证明了通过标记隐性和显性指称表达来建模偏见的合理性，并揭示了这项任务所需的对语言和世界的丰富、情境化的理解。对于群体间差异的大规模分析，我们使用 LLM 进行自动标记，并发现一些 LLM 在提示评论时的获胜概率的语言描述时表现最佳，而不是数字概率。此外，使用 LLM 对评论进行大规模标记可以发现以指称形式出现的线性变化，这些变化可以区分群体内和群体外的言论。代码和数据可在此 https URL 上获取。

Title: NormTab: Improving Symbolic Reasoning in LLMs Through Tabular Data Normalization

Authors: Md Mahadi Hasan Nahid, Davood Rafiei
Subjects: cs.CL, cs.AI, cs.DB, cs.IR
Abstract URL: https://arxiv.org/abs/2406.17961
Pdf URL: https://arxiv.org/pdf/2406.17961
Copy Paste: [[2406.17961]] NormTab: Improving Symbolic Reasoning in LLMs Through Tabular Data Normalization(https://arxiv.org/abs/2406.17961)
Keywords: language model, llm
Abstract: In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in parsing textual data and generating code. However, their performance in tasks involving tabular data, especially those requiring symbolic reasoning, faces challenges due to the structural variance and inconsistency in table cell values often found in web tables. In this paper, we introduce NormTab, a novel framework aimed at enhancing the symbolic reasoning performance of LLMs by normalizing web tables. We study table normalization as a stand-alone, one-time preprocessing step using LLMs to support symbolic reasoning on tabular data. Our experimental evaluation, conducted on challenging web table datasets such as WikiTableQuestion and TabFact, demonstrates that leveraging NormTab significantly improves symbolic reasoning performance, showcasing the importance and effectiveness of web table normalization for enhancing LLM-based symbolic reasoning tasks.
摘要：近年来，大型语言模型 (LLM) 在解析文本数据和生成代码方面表现出了卓越的能力。然而，它们在涉及表格数据的任务（尤其是需要符号推理的任务）中的表现面临挑战，因为网络表格中经常出现表格单元格值的结构差异和不一致性。在本文中，我们介绍了 NormTab，这是一个新颖的框架，旨在通过规范化网络表格来增强 LLM 的符号推理性能。我们将表格规范化作为独立的一次性预处理步骤进行研究，使用 LLM 支持表格数据的符号推理。我们在具有挑战性的网络表格数据集（例如 WikiTableQuestion 和 TabFact）上进行的实验评估表明，利用 NormTab 可显著提高符号推理性能，展示了网络表格规范化对于增强基于 LLM 的符号推理任务的重要性和有效性。

Title: SimsChat: A Customisable Persona-Driven Role-Playing Agent

Authors: Bohao Yang, Dong Liu, Chen Tang, Chenghao Xiao, Kun Zhao, Chao Li, Lin Yuan, Guang Yang, Lanxiao Huang, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.17962
Pdf URL: https://arxiv.org/pdf/2406.17962
Copy Paste: [[2406.17962]] SimsChat: A Customisable Persona-Driven Role-Playing Agent(https://arxiv.org/abs/2406.17962)
Keywords: language model, llm, chat, agent
Abstract: Large Language Models (LLMs) possess the remarkable capability to understand human instructions and generate high-quality text, enabling them to act as agents that simulate human behaviours. This capability allows LLMs to emulate human beings in a more advanced manner, beyond merely replicating simple human behaviours. However, there is a lack of exploring into leveraging LLMs to craft characters from several aspects. In this work, we introduce the Customisable Conversation Agent Framework, which employs LLMs to simulate real-world characters that can be freely customised according to different user preferences. The customisable framework is helpful for designing customisable characters and role-playing agents according to human's preferences. We first propose the SimsConv dataset, which comprises 68 different customised characters, 1,360 multi-turn role-playing dialogues, and encompasses 13,971 interaction dialogues in total. The characters are created from several real-world elements, such as career, aspiration, trait, and skill. Building on these foundations, we present SimsChat, a freely customisable role-playing agent. It incorporates different real-world scenes and topic-specific character interaction dialogues, simulating characters' life experiences in various scenarios and topic-specific interactions with specific emotions. Experimental results show that our proposed framework achieves desirable performance and provides helpful guideline for building better simulacra of human beings in the future. Our data and code are available at this https URL.
摘要：大型语言模型 (LLM) 具有理解人类指令和生成高质量文本的卓越能力，使它们能够充当模拟人类行为的代理。这种能力使 LLM 能够以更高级的方式模拟人类，而不仅仅是复制简单的人类行为。然而，在利用 LLM 从多个方面制作角色方面还缺乏探索。在这项工作中，我们引入了可定制对话代理框架，该框架使用 LLM 模拟可以根据不同用户偏好自由定制的现实世界角色。可定制框架有助于根据人类的喜好设计可定制的角色和角色扮演代理。我们首先提出了 SimsConv 数据集，它包含 68 个不同的定制角色、1,360 个多轮角色扮演对话，总共包含 13,971 个交互对话。角色是由几个现实世界的元素创建的，例如职业、抱负、特质和技能。在此基础上，我们推出了 SimsChat，一个可自由定制的角色扮演代理。它结合了不同的现实世界场景和特定主题的角色互动对话，模拟角色在各种场景中的生活经历以及特定情绪的特定主题互动。实验结果表明，我们提出的框架实现了理想的性能，并为未来构建更好的人类模拟提供了有用的指导。我们的数据和代码可在此 https URL 上找到。

Title: Unmasking the Imposters: In-Domain Detection of Human vs. Machine-Generated Tweets

Authors: Bryan E. Tuck, Rakesh M. Verma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.17967
Pdf URL: https://arxiv.org/pdf/2406.17967
Copy Paste: [[2406.17967]] Unmasking the Imposters: In-Domain Detection of Human vs. Machine-Generated Tweets(https://arxiv.org/abs/2406.17967)
Keywords: language model, gpt, llm
Abstract: The rapid development of large language models (LLMs) has significantly improved the generation of fluent and convincing text, raising concerns about their misuse on social media platforms. We present a methodology using Twitter datasets to examine the generative capabilities of four LLMs: Llama 3, Mistral, Qwen2, and GPT4o. We evaluate 7B and 8B parameter base-instruction models of the three open-source LLMs and validate the impact of further fine-tuning and "uncensored" versions. Our findings show that "uncensored" models with additional in-domain fine-tuning dramatically reduce the effectiveness of automated detection methods. This study addresses a gap by exploring smaller open-source models and the effects of "uncensoring," providing insights into how fine-tuning and content moderation influence machine-generated text detection.
摘要：大型语言模型 (LLM) 的快速发展显著提高了流畅且令人信服的文本生成能力，引发了人们对社交媒体平台上滥用大型语言模型的担忧。我们提出了一种使用 Twitter 数据集来检查四个 LLM 的生成能力的方法：Llama 3、Mistral、Qwen2 和 GPT4o。我们评估了三个开源 LLM 的 7B 和 8B 参数基础指令模型，并验证了进一步微调和“未经审查”版本的影响。我们的研究结果表明，带有额外域内微调的“未经审查”模型会大大降低自动检测方法的有效性。这项研究通过探索较小的开源模型和“未经审查”的影响来解决这一问题，深入了解了微调和内容审核如何影响机器生成的文本检测。

Title: Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective

Authors: Hanqi Yan, Yanzheng Xiang, Guangyi Chen, Yifei Wang, Lin Gui, Yulan He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.17969
Pdf URL: https://arxiv.org/pdf/2406.17969
Copy Paste: [[2406.17969]] Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective(https://arxiv.org/abs/2406.17969)
Keywords: language model, llm
Abstract: To better interpret the intrinsic mechanism of large language models (LLMs), recent studies focus on monosemanticity on its basic units. A monosemantic neuron is dedicated to a single and specific concept, which forms a one-to-one correlation between neurons and concepts. Despite extensive research in monosemanticity probing, it remains unclear whether monosemanticity is beneficial or harmful to model capacity. To explore this question, we revisit monosemanticity from the feature decorrelation perspective and advocate for its encouragement. We experimentally observe that the current conclusion by wang2024learning, which suggests that decreasing monosemanticity enhances model performance, does not hold when the model changes. Instead, we demonstrate that monosemanticity consistently exhibits a positive correlation with model capacity, in the preference alignment process. Consequently, we apply feature correlation as a proxy for monosemanticity and incorporate a feature decorrelation regularizer into the dynamic preference optimization process. The experiments show that our method not only enhances representation diversity and activation sparsity but also improves preference alignment performance.
摘要：为了更好地解释大型语言模型 (LLM) 的内在机制，最近的研究集中在其基本单元上的单义性。单义神经元专用于单个特定概念，这在神经元和概念之间形成了一一对应的关系。尽管在单义性探测方面进行了广泛的研究，但单义性对模型容量是有益还是有害仍不清楚。为了探索这个问题，我们从特征去相关的角度重新审视单义性，并提倡鼓励它。我们通过实验观察到，wang2024learning 目前的结论表明，降低单义性会提高模型性能，但在模型发生变化时并不成立。相反，我们证明在偏好对齐过程中，单义性始终与模型容量呈正相关。因此，我们将特征相关性作为单义性的代理，并将特征去相关正则化器纳入动态偏好优化过程。实验表明，我们的方法不仅增强了表示多样性和激活稀疏性，而且还提高了偏好对齐性能。

Title: Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts

Authors: Xuyang Wu, Yuan Wang, Hsin-Tai Wu, Zhiqiang Tao, Yi Fang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2406.17974
Pdf URL: https://arxiv.org/pdf/2406.17974
Copy Paste: [[2406.17974]] Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts(https://arxiv.org/abs/2406.17974)
Keywords: language model, prompt
Abstract: Large vision-language models (LVLMs) have recently achieved significant progress, demonstrating strong capabilities in open-world visual understanding. However, it is not yet clear how LVLMs address demographic biases in real life, especially the disparities across attributes such as gender, skin tone, and age. In this paper, we empirically investigate \emph{visual fairness} in several mainstream LVLMs and audit their performance disparities across sensitive demographic attributes, based on public fairness benchmark datasets (e.g., FACET). To disclose the visual bias in LVLMs, we design a fairness evaluation framework with direct questions and single-choice question-instructed prompts on visual question-answering/classification tasks. The zero-shot prompting results indicate that, despite enhancements in visual understanding, both open-source and closed-source LVLMs exhibit prevalent fairness issues across different instruct prompts and demographic attributes.
摘要：大型视觉语言模型 (LVLM) 最近取得了重大进展，展示了其在开放世界视觉理解方面的强大能力。然而，目前尚不清楚 LVLM 如何解决现实生活中的人口统计学偏见，尤其是性别、肤色和年龄等属性之间的差异。在本文中，我们基于公共公平性基准数据集（例如 FACET），实证研究了几种主流 LVLM 中的 \emph{视觉公平性}，并审核了它们在敏感人口统计学属性之间的性能差异。为了揭示 LVLM 中的视觉偏见，我们设计了一个公平性评估框架，该框架在视觉问答/分类任务中使用直接问题和单选题指导提示。零样本提示结果表明，尽管视觉理解有所增强，但开源和闭源 LVLM 在不同指导提示和人口统计学属性中都表现出普遍的公平性问题。

Title: Inherent Challenges of Post-Hoc Membership Inference for Large Language Models

Authors: Matthieu Meeus, Shubham Jain, Marek Rei, Yves-Alexandre de Montjoye
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2406.17975
Pdf URL: https://arxiv.org/pdf/2406.17975
Copy Paste: [[2406.17975]] Inherent Challenges of Post-Hoc Membership Inference for Large Language Models(https://arxiv.org/abs/2406.17975)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are often trained on vast amounts of undisclosed data, motivating the development of post-hoc Membership Inference Attacks (MIAs) to gain insight into their training data composition. However, in this paper, we identify inherent challenges in post-hoc MIA evaluation due to potential distribution shifts between collected member and non-member datasets. Using a simple bag-of-words classifier, we demonstrate that datasets used in recent post-hoc MIAs suffer from significant distribution shifts, in some cases achieving near-perfect distinction between members and non-members. This implies that previously reported high MIA performance may be largely attributable to these shifts rather than model memorization. We confirm that randomized, controlled setups eliminate such shifts and thus enable the development and fair evaluation of new MIAs. However, we note that such randomized setups are rarely available for the latest LLMs, making post-hoc data collection still required to infer membership for real-world LLMs. As a potential solution, we propose a Regression Discontinuity Design (RDD) approach for post-hoc data collection, which substantially mitigates distribution shifts. Evaluating various MIA methods on this RDD setup yields performance barely above random guessing, in stark contrast to previously reported results. Overall, our findings highlight the challenges in accurately measuring LLM memorization and the need for careful experimental design in (post-hoc) membership inference tasks.
摘要：大型语言模型 (LLM) 通常在大量未公开的数据上进行训练，这促使人们开发事后成员推断攻击 (MIA) 以深入了解其训练数据组成。然而，在本文中，我们发现事后 MIA 评估中存在固有挑战，因为收集的成员和非成员数据集之间的分布可能发生变化。使用一个简单的词袋分类器，我们证明最近事后 MIA 中使用的数据集存在显著的分布变化，在某些情况下几乎可以完美区分成员和非成员。这意味着之前报告的高 MIA 性能可能主要归因于这些变化，而不是模型记忆。我们确认随机控制设置可以消除这种变化，从而能够开发和公平评估新的 MIA。然而，我们注意到，这种随机设置很少适用于最新的 LLM，因此仍然需要事后数据收集来推断现实世界 LLM 的成员资格。作为一种潜在的解决方案，我们提出了一种用于事后数据收集的回归不连续性设计 (RDD) 方法，该方法可大大缓解分布偏移。在此 RDD 设置上评估各种 MIA 方法，其性能仅略高于随机猜测，与之前报告的结果形成鲜明对比。总体而言，我们的研究结果凸显了准确测量 LLM 记忆的挑战以及在 (事后) 成员推理任务中精心设计实验的必要性。

Title: EDEN: Empathetic Dialogues for English learning

Authors: Li Siyan, Teresa Shao, Zhou Yu, Julia Hirschberg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.17982
Pdf URL: https://arxiv.org/pdf/2406.17982
Copy Paste: [[2406.17982]] EDEN: Empathetic Dialogues for English learning(https://arxiv.org/abs/2406.17982)
Keywords: chat
Abstract: Dialogue systems have been used as conversation partners in English learning, but few have studied whether these systems improve learning outcomes. Student passion and perseverance, or grit, has been associated with language learning success. Recent work establishes that as students perceive their English teachers to be more supportive, their grit improves. Hypothesizing that the same pattern applies to English-teaching chatbots, we create EDEN, a robust open-domain chatbot for spoken conversation practice that provides empathetic feedback. To construct EDEN, we first train a specialized spoken utterance grammar correction model and a high-quality social chit-chat conversation model. We then conduct a preliminary user study with a variety of strategies for empathetic feedback. Our experiment suggests that using adaptive empathetic feedback leads to higher perceived affective support, which, in turn, predicts increased student grit.
摘要：对话系统已被用作英语学习中的对话伙伴，但很少有人研究这些系统是否能改善学习成果。学生的热情和毅力，或坚毅，与语言学习成功有关。最近的研究表明，当学生认为他们的英语老师更支持时，他们的坚毅就会提高。假设同样的模式也适用于英语教学聊天机器人，我们创建了 EDEN，这是一个强大的开放域聊天机器人，用于口语对话练习，提供同理心反馈。为了构建 EDEN，我们首先训练一个专门的口语语法纠正模型和一个高质量的社交闲聊对话模型。然后，我们进行了一项初步用户研究，研究了各种同理心反馈策略。我们的实验表明，使用自适应同理心反馈会带来更高的感知情感支持，这反过来又预示着学生坚毅力的提高。

Title: Multi-step Knowledge Retrieval and Inference over Unstructured Data

Authors: Aditya Kalyanpur, Kailash Saravanakumar, Victor Barres, CJ McFate, Lori Moon, Nati Seifu, Maksim Eremeev, Jose Barrera, Eric Brown, David Ferrucci
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Multi-step Knowledge Retrieval and Inference over Unstructured Data(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: The advent of Large Language Models (LLMs) and Generative AI has revolutionized natural language applications across various domains. However, high-stakes decision-making tasks in fields such as medical, legal and finance require a level of precision, comprehensiveness, and logical consistency that pure LLM or Retrieval-Augmented-Generation (RAG) approaches often fail to deliver. At Elemental Cognition (EC), we have developed a neuro-symbolic AI platform to tackle these problems. The platform integrates fine-tuned LLMs for knowledge extraction and alignment with a robust symbolic reasoning engine for logical inference, planning and interactive constraint solving. We describe Cora, a Collaborative Research Assistant built on this platform, that is designed to perform complex research and discovery tasks in high-stakes domains. This paper discusses the multi-step inference challenges inherent in such domains, critiques the limitations of existing LLM-based methods, and demonstrates how Cora's neuro-symbolic approach effectively addresses these issues. We provide an overview of the system architecture, key algorithms for knowledge extraction and formal reasoning, and present preliminary evaluation results that highlight Cora's superior performance compared to well-known LLM and RAG baselines.
摘要：大型语言模型 (LLM) 和生成式 AI 的出现彻底改变了各个领域的自然语言应用。然而，医疗、法律和金融等领域的高风险决策任务需要一定的精确度、全面性和逻辑一致性，而纯 LLM 或检索增强生成 (RAG) 方法往往无法实现这些要求。在 Elemental Cognition (EC)，我们开发了一个神经符号 AI 平台来解决这些问题。该平台集成了用于知识提取和对齐的微调 LLM 和用于逻辑推理、规划和交互式约束求解的强大符号推理引擎。我们描述了基于此平台构建的协作研究助理 Cora，它旨在执行高风险领域的复杂研究和发现任务。本文讨论了此类领域固有的多步骤推理挑战，批判了现有基于 LLM 的方法的局限性，并展示了 Cora 的神经符号方法如何有效解决这些问题。我们概述了系统架构、知识提取和形式推理的关键算法，并给出了初步评估结果，突出了 Cora 与著名的 LLM 和 RAG 基线相比的卓越性能。

Title: Explicit Diversity Conditions for Effective Question Answer Generation with Large Language Models

Authors: Vikas Yadav, Hyuk Joon Kwon, Vijay Srinivasan, Hongxia Jin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.17990
Pdf URL: https://arxiv.org/pdf/2406.17990
Copy Paste: [[2406.17990]] Explicit Diversity Conditions for Effective Question Answer Generation with Large Language Models(https://arxiv.org/abs/2406.17990)
Keywords: language model
Abstract: Question Answer Generation (QAG) is an effective data augmentation technique to improve the accuracy of question answering systems, especially in low-resource domains. While recent pretrained and large language model-based QAG methods have made substantial progress, they face the critical issue of redundant QA pair generation, affecting downstream QA systems. Implicit diversity techniques such as sampling and diverse beam search are proven effective solutions but often yield smaller diversity. We present explicit diversity conditions for QAG, focusing on spatial aspects, question types, and entities, substantially increasing diversity in QA generation. Our work emphasizes the need of explicit diversity conditions for generating diverse question-answer synthetic data by showing significant improvements in downstream QA task over existing widely adopted implicit diversity techniques. In particular, generated QA pairs from explicit diversity conditions when used to train the downstream QA model results in an average 4.1% exact match and 4.5% F1 improvement over QAG from implicit sampling techniques on SQuADDU. Our work emphasizes the need for explicit diversity conditions even more in low-resource datasets (SubjQA), where average downstream QA performance improvements are around 12% EM.
摘要：问答生成 (QAG) 是一种有效的数据增强技术，可提高问答系统的准确性，尤其是在资源匮乏的领域。虽然最近基于预训练和大型语言模型的 QAG 方法取得了实质性进展，但它们面临着冗余 QA 对生成的关键问题，影响了下游 QA 系统。隐式多样性技术（例如采样和多样化波束搜索）已被证明是有效的解决方案，但通常会产生较小的多样性。我们为 QAG 提出了显式多样性条件，重点关注空间方面、问题类型和实体，从而大大增加了 QA 生成的多样性。我们的工作强调了显式多样性条件对于生成多样化问答合成数据的必要性，因为它展示了与现有的广泛采用的隐式多样性技术相比下游 QA 任务的显式改进。特别是，当使用显式多样性条件生成的 QA 对来训练下游 QA 模型时，与 SQuADDU 上采用隐式采样技术的 QAG 相比，其平均精确匹配率为 4.1%，F1 率为 4.5%。我们的工作更加强调了在低资源数据集（SubjQA）中明确多样性条件的必要性，其中平均下游 QA 性能改进约为 12% EM。

Title: Catching Chameleons: Detecting Evolving Disinformation Generated using Large Language Models

Authors: Bohan Jiang, Chengshuai Zhao, Zhen Tan, Huan Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.17992
Pdf URL: https://arxiv.org/pdf/2406.17992
Copy Paste: [[2406.17992]] Catching Chameleons: Detecting Evolving Disinformation Generated using Large Language Models(https://arxiv.org/abs/2406.17992)
Keywords: language model, llm, prompt
Abstract: Despite recent advancements in detecting disinformation generated by large language models (LLMs), current efforts overlook the ever-evolving nature of this disinformation. In this work, we investigate a challenging yet practical research problem of detecting evolving LLM-generated disinformation. Disinformation evolves constantly through the rapid development of LLMs and their variants. As a consequence, the detection model faces significant challenges. First, it is inefficient to train separate models for each disinformation generator. Second, the performance decreases in scenarios when evolving LLM-generated disinformation is encountered in sequential order. To address this problem, we propose DELD (Detecting Evolving LLM-generated Disinformation), a parameter-efficient approach that jointly leverages the general fact-checking capabilities of pre-trained language models (PLM) and the independent disinformation generation characteristics of various LLMs. In particular, the learned characteristics are concatenated sequentially to facilitate knowledge accumulation and transformation. DELD addresses the issue of label scarcity by integrating the semantic embeddings of disinformation with trainable soft prompts to elicit model-specific knowledge. Our experiments show that \textit{DELD} significantly outperforms state-of-the-art methods. Moreover, our method provides critical insights into the unique patterns of disinformation generation across different LLMs, offering valuable perspectives in this line of research.
摘要：尽管最近在检测大型语言模型 (LLM) 生成的虚假信息方面取得了进展，但当前的努力忽视了这种虚假信息不断发展的性质。在这项工作中，我们研究了一个具有挑战性但实用的研究问题，即检测不断演变的 LLM 生成的虚假信息。虚假信息随着 LLM 及其变体的快速发展而不断演变。因此，检测模型面临着重大挑战。首先，为每个虚假信息生成器训练单独的模型效率低下。其次，当按顺序遇到不断发展的 LLM 生成的虚假信息时，性能会下降。为了解决这个问题，我们提出了 DELD（检测不断发展的 LLM 生成的虚假信息），这是一种参数高效的方法，它共同利用了预训练语言模型 (PLM) 的一般事实核查能力和各种 LLM 的独立虚假信息生成特征。特别是，学习到的特征被按顺序连接起来，以促进知识的积累和转化。DELD 通过将虚假信息的语义嵌入与可训练的软提示相结合来解决标签稀缺的问题，以引出特定于模型的知识。我们的实验表明，\textit{DELD} 的表现明显优于最先进的方法。此外，我们的方法为不同 LLM 中虚假信息生成的独特模式提供了关键见解，为这一研究领域提供了宝贵的视角。

Title: Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher

Authors: Hyunjong Ok, Jegwang Ryu, Jaeho Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18002
Pdf URL: https://arxiv.org/pdf/2406.18002
Copy Paste: [[2406.18002]] Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher(https://arxiv.org/abs/2406.18002)
Keywords: llm
Abstract: How can sLLMs efficiently utilize the supervision of LLMs to improve their generative quality? This question has been well studied in scenarios where there is no restriction on the number of LLM supervisions one can use, giving birth to many decoding algorithms that utilize supervision without further training. However, it is still unclear what is an effective strategy under the limited supervision scenario, where we assume that no more than a few tokens can be generated by LLMs. To this end, we develop an algorithm to effectively aggregate the sLLM and LLM predictions on initial tokens so that the generated tokens can more accurately condition the subsequent token generation by sLLM only. Critically, we find that it is essential to adaptively overtrust or disregard the LLM prediction based on the confidence of the sLLM. Through our experiments on a wide range of models and datasets, we demonstrate that our method provides a consistent improvement over conventional decoding strategies.
摘要：sLLM 如何有效利用 LLM 的监督来提高其生成质量？这个问题在对可以使用的 LLM 监督数量没有限制的场景中得到了充分研究，从而催生了许多利用监督而无需进一步训练的解码算法。然而，在有限监督的场景下，我们仍然不清楚什么是有效的策略，我们假设 LLM 只能生成几个 token。为此，我们开发了一种算法来有效地聚合初始 token 上的 sLLM 和 LLM 预测，以便生成的 token 可以更准确地调节仅由 sLLM 生成的后续 token。至关重要的是，我们发现根据 sLLM 的置信度自适应地过度信任或忽略 LLM 预测至关重要。通过对各种模型和数据集的实验，我们证明了我们的方法比传统的解码策略提供了持续的改进。

Title: Automated Clinical Data Extraction with Knowledge Conditioned LLMs

Authors: Diya Li, Asim Kadav, Aijing Gao, Rui Li, Richard Bourgon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18027
Pdf URL: https://arxiv.org/pdf/2406.18027
Copy Paste: [[2406.18027]] Automated Clinical Data Extraction with Knowledge Conditioned LLMs(https://arxiv.org/abs/2406.18027)
Keywords: language model, llm
Abstract: The extraction of lung lesion information from clinical and medical imaging reports is crucial for research on and clinical care of lung-related diseases. Large language models (LLMs) can be effective at interpreting unstructured text in reports, but they often hallucinate due to a lack of domain-specific knowledge, leading to reduced accuracy and posing challenges for use in clinical settings. To address this, we propose a novel framework that aligns generated internal knowledge with external knowledge through in-context learning (ICL). Our framework employs a retriever to identify relevant units of internal or external knowledge and a grader to evaluate the truthfulness and helpfulness of the retrieved internal-knowledge rules, to align and update the knowledge bases. Our knowledge-conditioned approach also improves the accuracy and reliability of LLM outputs by addressing the extraction task in two stages: (i) lung lesion finding detection and primary structured field parsing, followed by (ii) further parsing of lesion description text into additional structured fields. Experiments with expert-curated test datasets demonstrate that this ICL approach can increase the F1 score for key fields (lesion size, margin and solidity) by an average of 12.9% over existing ICL methods.
摘要：从临床和医学影像报告中提取肺病变信息对于肺相关疾病的研究和临床护理至关重要。大型语言模型 (LLM) 可以有效地解释报告中的非结构化文本，但由于缺乏特定领域的知识，它们经常产生幻觉，导致准确性降低，并对临床环境的使用构成挑战。为了解决这个问题，我们提出了一个新颖的框架，通过情境学习 (ICL) 将生成的内部知识与外部知识对齐。我们的框架使用检索器来识别相关的内部或外部知识单元，并使用评分器来评估检索到的内部知识规则的真实性和有用性，以对齐和更新知识库。我们的知识条件方法还通过分两个阶段解决提取任务来提高 LLM 输出的准确性和可靠性：(i) 肺病变发现检测和主要结构化字段解析，然后 (ii) 将病变描述文本进一步解析为其他结构化字段。使用专家精选的测试数据集进行的实验表明，与现有的 ICL 方法相比，这种 ICL 方法可以将关键领域（病变大小、边缘和实体）的 F1 得分平均提高 12.9%。

Title: LLMs for Doctors: Leveraging Medical LLMs to Assist Doctors, Not Replace Them

Authors: Wenya Xie, Qingying Xiao, Yu Zheng, Xidong Wang, Junying Chen, Ke Ji, Anningzhe Gao, Xiang Wan, Feng Jiang, Benyou Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18034
Pdf URL: https://arxiv.org/pdf/2406.18034
Copy Paste: [[2406.18034]] LLMs for Doctors: Leveraging Medical LLMs to Assist Doctors, Not Replace Them(https://arxiv.org/abs/2406.18034)
Keywords: language model, llm
Abstract: The recent success of Large Language Models (LLMs) has had a significant impact on the healthcare field, providing patients with medical advice, diagnostic information, and more. However, due to a lack of professional medical knowledge, patients are easily misled by generated erroneous information from LLMs, which may result in serious medical problems. To address this issue, we focus on tuning the LLMs to be medical assistants who collaborate with more experienced doctors. We first conduct a two-stage survey by inspiration-feedback to gain a broad understanding of the real needs of doctors for medical assistants. Based on this, we construct a Chinese medical dataset called DoctorFLAN to support the entire workflow of doctors, which includes 92K Q\&A samples from 22 tasks and 27 specialists. Moreover, we evaluate LLMs in doctor-oriented scenarios by constructing the DoctorFLAN-\textit{test} containing 550 single-turn Q\&A and DotaBench containing 74 multi-turn conversations. The evaluation results indicate that being a medical assistant still poses challenges for existing open-source models, but DoctorFLAN can help them significantly. It demonstrates that the doctor-oriented dataset and benchmarks we construct can complement existing patient-oriented work and better promote medical LLMs research.
摘要：大型语言模型 (LLM) 近期的成功对医疗领域产生了重大影响，为患者提供了医疗建议、诊断信息等。然而，由于缺乏专业的医疗知识，患者很容易被 LLM 生成的错误信息误导，这可能会导致严重的医疗问题。为了解决这个问题，我们专注于将 LLM 调整为与更有经验的医生合作的医疗助理。我们首先通过灵感反馈进行两阶段调查，以广泛了解医生对医疗助理的真正需求。在此基础上，我们构建了一个名为 DoctorFLAN 的中文医疗数据集来支持医生的整个工作流程，其中包括来自 22 个任务和 27 位专家的 92K 问答样本。此外，我们通过构建包含 550 个单轮问答的 DoctorFLAN-\textit{test} 和包含 74 个多轮对话的 DotaBench 来评估面向医生场景中的 LLM。评估结果表明，医疗助理这一职业对现有的开源模型来说仍是一个挑战，但 DoctorFLAN 可以大大帮助它们。这表明我们构建的面向医生的数据集和基准可以补充现有的面向患者的工作，并更好地促进医学法学硕士的研究。

Title: PharmGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

Authors: Linqing Chen, Weilei Wang, Zilong Bai, Peng Xu, Yan Fang, Jie Fang, Wentao Wu, Lizhi Zhou, Ruiji Zhang, Yubin Xia, Chaobo Xu, Ran Hu, Licong Xu, Qijun Cai, Haoran Hua, Jing Sun, Jin Liu, Tian Qiu, Haowen Liu, Meng Hu, Xiuwen Li, Fei Gao, Yufu Wang, Lin Tie, Chaochao Wang, Jianping Lu, Cheng Sun, Yixin Wang, Shengjie Yang, Yuancheng Li, Lu Jin, Lisha Zhang, Fu Bian, Changyang Tu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] PharmGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry(https://arxiv.org/abs/)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpose LLMs often fall short. In this study, we introduce PharmGPT, a suite of multilingual LLMs with 13 billion and 70 billion parameters, specifically trained on a comprehensive corpus of hundreds of billions of tokens tailored to the Bio-Pharmaceutical and Chemical sectors. Our evaluation shows that PharmGPT matches or surpasses existing general models on key benchmarks, such as NAPLEX, demonstrating its exceptional capability in domain-specific tasks. This advancement establishes a new benchmark for LLMs in the Bio-Pharmaceutical and Chemical fields, addressing the existing gap in specialized language modeling. Furthermore, this suggests a promising path for enhanced research and development in these specialized areas, paving the way for more precise and effective applications of NLP in specialized domains.
摘要：大型语言模型 (LLM) 通过最大限度地减少对复杂特征工程的需求，彻底改变了自然语言处理 (NLP)。然而，LLM 在生物制药和化学等专业领域的应用在很大程度上仍未得到探索。这些领域的特点是术语复杂、专业知识丰富，对精度要求高，而通用 LLM 往往无法满足这些要求。在本研究中，我们推出了 PharmGPT，这是一套具有 130 亿和 700 亿个参数的多语言 LLM，专门针对生物制药和化学行业量身定制的数千亿个标记的综合语料库进行训练。我们的评估表明，PharmGPT 在关键基准（如 NAPLEX）上达到或超越了现有的通用模型，展示了其在特定领域任务中的卓越能力。这一进步为生物制药和化学领域的 LLM 建立了新的基准，解决了专业语言建模中现有的差距。此外，这为加强这些专业领域的研究和开发指明了一条有希望的道路，为 NLP 在专业领域更精确、更有效的应用铺平了道路。

Title: Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources

Authors: Yiming Li, Deepthi Viswaroopan, William He, Jianfu Li, Xu Zuo, Hua Xu, Cui Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18049
Pdf URL: https://arxiv.org/pdf/2406.18049
Copy Paste: [[2406.18049]] Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources(https://arxiv.org/abs/2406.18049)
Keywords: language model, gpt, llm
Abstract: Adverse event (AE) extraction following COVID-19 vaccines from text data is crucial for monitoring and analyzing the safety profiles of immunizations. Traditional deep learning models are adept at learning intricate feature representations and dependencies in sequential data, but often require extensive labeled data. In contrast, large language models (LLMs) excel in understanding contextual information, but exhibit unstable performance on named entity recognition tasks, possibly due to their broad but unspecific training. This study aims to evaluate the effectiveness of LLMs and traditional deep learning models in AE extraction, and to assess the impact of ensembling these models on performance. In this study, we utilized reports and posts from the VAERS (n=621), Twitter (n=9,133), and Reddit (n=131) as our corpora. Our goal was to extract three types of entities: "vaccine", "shot", and "ae". We explored and fine-tuned (except GPT-4) multiple LLMs, including GPT-2, GPT-3.5, GPT-4, and Llama-2, as well as traditional deep learning models like RNN and BioBERT. To enhance performance, we created ensembles of the three models with the best performance. For evaluation, we used strict and relaxed F1 scores to evaluate the performance for each entity type, and micro-average F1 was used to assess the overall performance. The ensemble model achieved the highest performance in "vaccine", "shot", and "ae" with strict F1-scores of 0.878, 0.930, and 0.925, respectively, along with a micro-average score of 0.903. In conclusion, this study demonstrates the effectiveness and robustness of ensembling fine-tuned traditional deep learning models and LLMs, for extracting AE-related information. This study contributes to the advancement of biomedical natural language processing, providing valuable insights into improving AE extraction from text data for pharmacovigilance and public health surveillance.
摘要：从文本数据中提取 COVID-19 疫苗接种后的不良事件 (AE) 对于监测和分析免疫接种的安全性至关重要。传统的深度学习模型擅长学习序列数据中复杂的特征表示和依赖关系，但通常需要大量标记数据。相比之下，大型语言模型 (LLM) 在理解上下文信息方面表现出色，但在命名实体识别任务上表现出不稳定的性能，这可能是由于它们的训练范围广泛但不具体。本研究旨在评估 LLM 和传统深度学习模型在 AE 提取中的有效性，并评估将这些模型组合对性能的影响。在本研究中，我们使用了 VAERS (n=621)、Twitter (n=9,133) 和 Reddit (n=131) 的报告和帖子作为我们的语料库。我们的目标是提取三种类型的实体：“疫苗”、“注射”和“ae”。我们探索并微调了多个 LLM（GPT-4 除外），包括 GPT-2、GPT-3.5、GPT-4 和 Llama-2，以及 RNN 和 BioBERT 等传统深度学习模型。为了提高性能，我们创建了性能最佳的三个模型的集成。为了进行评估，我们使用严格和宽松的 F1 分数来评估每种实体类型的性能，并使用微平均 F1 来评估整体性能。集成模型在“疫苗”、“注射”和“ae”中取得了最高性能，严格 F1 分数分别为 0.878、0.930 和 0.925，微平均分数为 0.903。总之，这项研究证明了集成微调的传统深度学习模型和 LLM 以提取与 AE 相关的信息的有效性和稳健性。这项研究促进了生物医学自然语言处理的进步，为改善从文本数据中提取 AE 用于药物警戒和公共卫生监测提供了宝贵的见解。

Title: AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

Authors: Yifan Yang, Kai Zhen, Ershad Banijamal, Athanasios Mouchtaris, Zheng Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.
摘要：微调大型语言模型 (LLM) 在各种自然语言处理任务中都取得了显著的表现，但随着模型大小的不断增长，它需要越来越多的内存。为了解决这个问题，最近提出的内存高效零阶 (MeZO) 方法尝试仅使用前向传递来微调 LLM，从而避免了对反向传播图的需求。然而，显著的性能下降和高发散风险限制了它们的广泛采用。在本文中，我们提出了自适应零阶张量训练自适应 (AdaZeta) 框架，专门用于提高 ZO 方法的性能和收敛性。为了提高与维度相关的 ZO 估计精度，我们引入了一个快进、低参数张量化适配器。为了解决大规模 ZO 微调任务中经常观察到的发散问题，我们提出了一种保证收敛的自适应查询数计划。对 Roberta-Large 和 Llama-2-7B 模型的详细理论分析和大量实验结果证实了我们的 AdaZeta 框架在准确性、内存效率和收敛速度方面的有效性。

Title: Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

Authors: Yang Wang, Alberto Garcia Hernandez, Roman Kyslyi, Nicholas Kersting
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18064
Pdf URL: https://arxiv.org/pdf/2406.18064
Copy Paste: [[2406.18064]] Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need(https://arxiv.org/abs/2406.18064)
Keywords: language model, gpt, llm, chat, retrieval-augmented generation
Abstract: We present a comprehensive evaluation of answer quality in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive "thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This approach suits factual business settings where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two Large Language Models (LLMs), evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4's assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights the potential of LLMs as reliable evaluators in closed-domain, closed-ended settings, particularly when human evaluations require significant resources.
摘要：我们使用 vRAG-Eval 对检索增强生成 (RAG) 应用程序中的答案质量进行了全面评估，vRAG-Eval 是一种旨在评估正确性、完整性和诚实度的新型评分系统。我们进一步将上述质量方面的评分映射到二进制分数中，表示接受或拒绝的决定，反映了聊天应用程序中常用的直观“竖起大拇指”或“竖起大拇指”手势。这种方法适合事实业务环境，其中明确的决策意见至关重要。我们的评估将 vRAG-Eval 应用于两个大型语言模型 (LLM)，评估由原始 RAG 应用程序生成的答案的质量。我们将这些评估与人类专家的判断进行比较，发现 GPT-4 的评估与人类专家的评估之间存在相当大的一致性，在接受或拒绝决策方面达到 83% 的一致性。这项研究强调了 LLM 作为封闭域、封闭式环境中可靠评估者的潜力，特别是当人工评估需要大量资源时。

Title: Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction

Authors: Yice Zhang, Jie Zeng, Weiming Hu, Ziyi Wang, Shiwei Chen, Ruifeng Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18078
Pdf URL: https://arxiv.org/pdf/2406.18078
Copy Paste: [[2406.18078]] Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction(https://arxiv.org/abs/2406.18078)
Keywords: language model
Abstract: Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review, which is the most representative and challenging task in aspect-based sentiment analysis. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. To tackle this issue, we propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels, aiming to filter out mismatches and thereby enhance the effectiveness of self-training. We highlight two critical aspects to ensure the scorer's effectiveness and reliability: the quality of the training dataset and its model architecture. To this end, we create a human-annotated comparison dataset and train a generative model on it using ranking-based objectives. Extensive experiments on public ASQP datasets reveal that using our scorer can greatly and consistently improve the effectiveness of self-training. Moreover, we explore the possibility of replacing humans with large language models for comparison dataset annotation, and experiments demonstrate its feasibility. We release our code and data at this https URL .
摘要：方面情绪四元组预测 (ASQP) 旨在预测给定评论的所有四元组（方面术语、方面类别、观点术语、情绪极性），这是基于方面的情绪分析中最具代表性和挑战性的任务。ASQP 任务中的一个关键挑战是标记数据的稀缺性，这限制了现有方法的性能。为了解决这个问题，我们提出了一个带有伪标签评分器的自训练框架，其中评分器评估评论与其伪标签之间的匹配度，旨在过滤掉不匹配的内容，从而提高自训练的有效性。我们强调了两个关键方面，以确保评分器的有效性和可靠性：训练数据集的质量及其模型架构。为此，我们创建了一个人工注释的比较数据集，并使用基于排名的目标在其上训练生成模型。在公共 ASQP 数据集上进行的大量实验表明，使用我们的评分器可以极大地、持续地提高自训练的有效性。此外，我们探索了用大型语言模型代替人类进行比较数据集注释的可能性，实验证明了其可行性。我们在此 https URL 上发布了我们的代码和数据。

Title: Octo-planner: On-device Language Model for Planner-Action Agents

Authors: Wei Chen, Zhiyuan Li, Zhen Guo, Yikang Shen
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Octo-planner: On-device Language Model for Planner-Action Agents(https://arxiv.org/abs/)
Keywords: language model, gpt, llm, agent
Abstract: AI agents have become increasingly significant in various domains, enabling autonomous decision-making and problem-solving. To function effectively, these agents require a planning process that determines the best course of action and then executes the planned actions. In this paper, we present an efficient on-device Planner-Action framework that separates planning and action execution into two distinct components: a planner agent based on Phi-3 Mini, a 3.8 billion parameter LLM optimized for edge devices, and an action agent using the Octopus model for function execution. The planner agent first responds to user queries by decomposing tasks into a sequence of sub-steps, which are then executed by the action agent. To optimize performance on resource-constrained devices, we employ model fine-tuning instead of in-context learning, reducing computational costs and energy consumption while improving response times. Our approach involves using GPT-4 to generate diverse planning queries and responses based on available functions, with subsequent validations to ensure data quality. We fine-tune the Phi-3 Mini model on this curated dataset, achieving a 97\% success rate in our in-domain test environment. To address multi-domain planning challenges, we developed a multi-LoRA training method that merges weights from LoRAs trained on distinct function subsets. This approach enables flexible handling of complex, multi-domain queries while maintaining computational efficiency on resource-constrained devices. To support further research, we have open-sourced our model weights at \url{this https URL}. For the demo, please refer to \url{this https URL}.
摘要：AI 代理在各个领域变得越来越重要，可以实现自主决策和解决问题。为了有效运行，这些代理需要一个规划过程来确定最佳行动方案，然后执行计划的行动。在本文中，我们提出了一个高效的设备上 Planner-Action 框架，将规划和行动执行分为两个不同的组件：基于 Phi-3 Mini 的规划代理、针对边缘设备优化的 38 亿参数 LLM 以及使用 Octopus 模型执行功能的动作代理。规划代理首先通过将任务分解为一系列子步骤来响应用户查询，然后由动作代理执行这些子步骤。为了优化资源受限设备上的性能，我们采用模型微调而不是上下文学习，从而降低计算成本和能耗，同时缩短响应时间。我们的方法包括使用 GPT-4 根据可用功能生成不同的规划查询和响应，然后进行后续验证以确保数据质量。我们在这个精选数据集上对 Phi-3 Mini 模型进行了微调，在我们的域内测试环境中实现了 97\% 的成功率。为了应对多域规划挑战，我们开发了一种多 LoRA 训练方法，该方法合并了在不同函数子集上训练的 LoRA 的权重。这种方法可以灵活处理复杂的多域查询，同时在资源受限的设备上保持计算效率。为了支持进一步的研究，我们在 \url{此 https URL} 上开源了我们的模型权重。有关演示，请参阅 \url{此 https URL}。

Title: Multilingual Knowledge Graph Completion from Pretrained Language Models with Knowledge Constraints

Authors: Ran Song, Shizhu He, Shengxiang Gao, Li Cai, Kang Liu, Zhengtao Yu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18085
Pdf URL: https://arxiv.org/pdf/2406.18085
Copy Paste: [[2406.18085]] Multilingual Knowledge Graph Completion from Pretrained Language Models with Knowledge Constraints(https://arxiv.org/abs/2406.18085)
Keywords: language model
Abstract: Multilingual Knowledge Graph Completion (mKGC) aim at solving queries like (h, r, ?) in different languages by reasoning a tail entity t thus improving multilingual knowledge graphs. Previous studies leverage multilingual pretrained language models (PLMs) and the generative paradigm to achieve mKGC. Although multilingual pretrained language models contain extensive knowledge of different languages, its pretraining tasks cannot be directly aligned with the mKGC tasks. Moreover, the majority of KGs and PLMs currently available exhibit a pronounced English-centric bias. This makes it difficult for mKGC to achieve good results, particularly in the context of low-resource languages. To overcome previous problems, this paper introduces global and local knowledge constraints for mKGC. The former is used to constrain the reasoning of answer entities, while the latter is used to enhance the representation of query contexts. The proposed method makes the pretrained model better adapt to the mKGC task. Experimental results on public datasets demonstrate that our method outperforms the previous SOTA on Hits@1 and Hits@10 by an average of 12.32% and 16.03%, which indicates that our proposed method has significant enhancement on mKGC.
摘要：多语言知识图谱补全 (mKGC) 旨在通过推理尾部实体 t 来解决不同语言中的 (h, r, ?) 等查询，从而改进多语言知识图谱。先前的研究利用多语言预训练语言模型 (PLM) 和生成范式来实现 mKGC。虽然多语言预训练语言模型包含不同语言的广泛知识，但其预训练任务不能直接与 mKGC 任务对齐。此外，目前可用的大多数 KG 和 PLM 都表现出明显的以英语为中心。这使得 mKGC 难以取得良好的效果，特别是在资源匮乏的语言环境中。为了克服之前的问题，本文为 mKGC 引入了全局和局部知识约束。前者用于约束答案实体的推理，而后者用于增强查询上下文的表示。所提出的方法使得预训练模型更好地适应 mKGC 任务。在公开数据集上的实验结果表明，我们的方法在 Hits@1 和 Hits@10 上比之前的 SOTA 平均高出 12.32% 和 16.03%，这表明我们提出的方法在 mKGC 上有显著的增强。

Title: LLM-Driven Multimodal Opinion Expression Identification

Authors: Bonian Jia, Huiyao Chen, Yueheng Sun, Meishan Zhang, Min Zhang
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.18088
Pdf URL: https://arxiv.org/pdf/2406.18088
Copy Paste: [[2406.18088]] LLM-Driven Multimodal Opinion Expression Identification(https://arxiv.org/abs/2406.18088)
Keywords: language model, llm
Abstract: Opinion Expression Identification (OEI) is essential in NLP for applications ranging from voice assistants to depression diagnosis. This study extends OEI to encompass multimodal inputs, underlining the significance of auditory cues in delivering emotional subtleties beyond the capabilities of text. We introduce a novel multimodal OEI (MOEI) task, integrating text and speech to mirror real-world scenarios. Utilizing CMU MOSEI and IEMOCAP datasets, we construct the CI-MOEI dataset. Additionally, Text-to-Speech (TTS) technology is applied to the MPQA dataset to obtain the CIM-OEI dataset. We design a template for the OEI task to take full advantage of the generative power of large language models (LLMs). Advancing further, we propose an LLM-driven method STOEI, which combines speech and text modal to identify opinion expressions. Our experiments demonstrate that MOEI significantly improves the performance while our method outperforms existing methods by 9.20\% and obtains SOTA results.
摘要：观点表达识别 (OEI) 在 NLP 中至关重要，可用于从语音助手到抑郁症诊断等各种应用。本研究将 OEI 扩展到涵盖多模态输入，强调了听觉线索在传递文本无法传递的情感细微差别方面的重要性。我们引入了一种新颖的多模态 OEI (MOEI) 任务，将文本和语音相结合以反映真实世界场景。利用 CMU MOSEI 和 IEMOCAP 数据集，我们构建了 CI-MOEI 数据集。此外，将文本转语音 (TTS) 技术应用于 MPQA 数据集以获得 CIM-OEI 数据集。我们为 OEI 任务设计了一个模板，以充分利用大型语言模型 (LLM) 的生成能力。进一步发展，我们提出了一种 LLM 驱动的方法 STOEI，它结合了语音和文本模态来识别观点表达。我们的实验表明，MOEI 显著提高了性能，而我们的方法比现有方法高出 9.20% 并获得 SOTA 结果。

Title: Shimo Lab at "Discharge Me!": Discharge Summarization by Prompt-Driven Concatenation of Electronic Health Record Sections

Authors: Yunzhen He, Hiroaki Yamagiwa, Hidetoshi Shimodaira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Shimo Lab at "Discharge Me!": Discharge Summarization by Prompt-Driven Concatenation of Electronic Health Record Sections(https://arxiv.org/abs/)
Keywords: prompt
Abstract: In this paper, we present our approach to the shared task "Discharge Me!" at the BioNLP Workshop 2024. The primary goal of this task is to reduce the time and effort clinicians spend on writing detailed notes in the electronic health record (EHR). Participants develop a pipeline to generate the "Brief Hospital Course" and "Discharge Instructions" sections from the EHR. Our approach involves a first step of extracting the relevant sections from the EHR. We then add explanatory prompts to these sections and concatenate them with separate tokens to create the input text. To train a text generation model, we perform LoRA fine-tuning on the ClinicalT5-large model. On the final test data, our approach achieved a ROUGE-1 score of $0.394$, which is comparable to the top solutions.
摘要：在本文中，我们介绍了我们在 BioNLP Workshop 2024 上完成共享任务“出院！”的方法。此任务的主要目标是减少临床医生在电子健康记录 (EHR) 中撰写详细笔记所花费的时间和精力。参与者开发了一个流程，以从 EHR 中生成“简要医院课程”和“出院说明”部分。我们的方法涉及从 EHR 中提取相关部分的第一步。然后，我们在这些部分中添加解释性提示，并将它们与单独的标记连接起来以创建输入文本。为了训练文本生成模型，我们对 ClinicalT5-large 模型执行 LoRA 微调。在最终测试数据上，我们的方法获得了 $0.394$ 的 ROUGE-1 分数，与顶级解决方案相当。

Title: BADGE: BADminton report Generation and Evaluation with LLM

Authors: Shang-Hsuan Chiang, Lin-Wei Chao, Kuang-Da Wang, Chih-Chuan Wang, Wen-Chih Peng
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2406.18116
Pdf URL: https://arxiv.org/pdf/2406.18116
Copy Paste: [[2406.18116]] BADGE: BADminton report Generation and Evaluation with LLM(https://arxiv.org/abs/2406.18116)
Keywords: language model, gpt, llm, prompt
Abstract: Badminton enjoys widespread popularity, and reports on matches generally include details such as player names, game scores, and ball types, providing audiences with a comprehensive view of the games. However, writing these reports can be a time-consuming task. This challenge led us to explore whether a Large Language Model (LLM) could automate the generation and evaluation of badminton reports. We introduce a novel framework named BADGE, designed for this purpose using LLM. Our method consists of two main phases: Report Generation and Report Evaluation. Initially, badminton-related data is processed by the LLM, which then generates a detailed report of the match. We tested different Input Data Types, In-Context Learning (ICL), and LLM, finding that GPT-4 performs best when using CSV data type and the Chain of Thought prompting. Following report generation, the LLM evaluates and scores the reports to assess their quality. Our comparisons between the scores evaluated by GPT-4 and human judges show a tendency to prefer GPT-4 generated reports. Since the application of LLM in badminton reporting remains largely unexplored, our research serves as a foundational step for future advancements in this area. Moreover, our method can be extended to other sports games, thereby enhancing sports promotion. For more details, please refer to this https URL.
摘要：羽毛球运动广受欢迎，比赛报告通常包括球员姓名、比赛得分和球类等详细信息，为观众提供全面的比赛信息。然而，撰写这些报告可能是一项耗时的任务。这一挑战促使我们探索大型语言模型 (LLM) 是否可以自动生成和评估羽毛球报告。我们引入了一个名为 BADGE 的新框架，该框架是使用 LLM 为此目的而设计的。我们的方法包括两个主要阶段：报告生成和报告评估。最初，羽毛球相关数据由 LLM 处理，然后生成比赛的详细报告。我们测试了不同的输入数据类型、上下文学习 (ICL) 和 LLM，发现 GPT-4 在使用 CSV 数据类型和思路链提示时表现最佳。报告生成后，LLM 评估和评分报告以评估其质量。我们对 GPT-4 和人类评委评估的分数进行了比较，结果显示人们倾向于更喜欢 GPT-4 生成的报告。由于 LLM 在羽毛球报道中的应用仍未得到广泛探索，我们的研究为该领域的未来发展奠定了基础。此外，我们的方法可以扩展到其他体育赛事，从而增强体育推广。有关更多详细信息，请参阅此 https URL。

Title: ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs

Authors: Ahmed Heakl, Youssef Zaghloul, Mennatullah Ali, Rania Hossam, Walid Gomaa
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2406.18120
Pdf URL: https://arxiv.org/pdf/2406.18120
Copy Paste: [[2406.18120]] ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs(https://arxiv.org/abs/2406.18120)
Keywords: language model, llm
Abstract: Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. In the field of ASR, we explore the utilization of the Whisper model for code-switched Egyptian Arabic recognition, detailing our experimental procedures including data preprocessing and training techniques. Through the implementation of a consecutive speech-to-text translation system that integrates ASR with MT, we aim to overcome challenges posed by limited resources and the unique characteristics of the Egyptian Arabic dialect. Evaluation against established metrics showcases promising results, with our methodologies yielding a significant improvement of $56\%$ in English translation over the state-of-the-art and $9.3\%$ in Arabic translation. Since code-switching is deeply inherent in spoken languages, it is crucial that ASR systems can effectively handle this phenomenon. This capability is crucial for enabling seamless interaction in various domains, including business negotiations, cultural exchanges, and academic discourse. Our models and code are available as open-source resources. Code: \url{this http URL}}, Models: \url{this http URL}.
摘要：受近年来埃及阿拉伯语和英语之间代码转换现象的广泛增加的启发，本文探讨了机器翻译 (MT) 和自动语音识别 (ASR) 系统的复杂性，重点是将代码转换的埃及阿拉伯语-英语翻译成英语或埃及阿拉伯语。我们的目标是介绍开发这些系统所采用的方法，利用大型语言模型，如 LLama 和 Gemma。在 ASR 领域，我们探索了 Whisper 模型在代码转换的埃及阿拉伯语识别中的应用，详细介绍了我们的实验程序，包括数据预处理和训练技术。通过实施将 ASR 与 MT 相结合的连续语音到文本翻译系统，我们旨在克服资源有限和埃及阿拉伯语方言独特特征所带来的挑战。根据既定指标的评估展示了令人鼓舞的结果，我们的方法使英语翻译比最先进的方法提高了 $56\%$，阿拉伯语翻译提高了 $9.3\%$。由于代码转换是口语中固有的现象，因此 ASR 系统能够有效处理这一现象至关重要。此功能对于实现各个领域的无缝交互至关重要，包括商务谈判、文化交流和学术讨论。我们的模型和代码可作为开源资源使用。代码：\url{this http URL}}，模型：\url{this http URL}。

Title: Poisoned LangChain: Jailbreak LLMs by LangChain

Authors: Ziqiu Wang, Jun Liu, Shengkai Zhang, Yang Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18122
Pdf URL: https://arxiv.org/pdf/2406.18122
Copy Paste: [[2406.18122]] Poisoned LangChain: Jailbreak LLMs by LangChain(https://arxiv.org/abs/2406.18122)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: With the development of natural language processing (NLP), large language models (LLMs) are becoming increasingly popular. LLMs are integrating more into everyday life, raising public concerns about their security vulnerabilities. Consequently, the security of large language models is becoming critically important. Currently, the techniques for attacking and defending against LLMs are continuously evolving. One significant method type of attack is the jailbreak attack, which designed to evade model safety mechanisms and induce the generation of inappropriate content. Existing jailbreak attacks primarily rely on crafting inducement prompts for direct jailbreaks, which are less effective against large models with robust filtering and high comprehension abilities. Given the increasing demand for real-time capabilities in large language models, real-time updates and iterations of new knowledge have become essential. Retrieval-Augmented Generation (RAG), an advanced technique to compensate for the model's lack of new knowledge, is gradually becoming mainstream. As RAG enables the model to utilize external knowledge bases, it provides a new avenue for jailbreak attacks. In this paper, we conduct the first work to propose the concept of indirect jailbreak and achieve Retrieval-Augmented Generation via LangChain. Building on this, we further design a novel method of indirect jailbreak attack, termed Poisoned-LangChain (PLC), which leverages a poisoned external knowledge base to interact with large language models, thereby causing the large models to generate malicious non-compliant dialogues.We tested this method on six different large language models across three major categories of jailbreak issues. The experiments demonstrate that PLC successfully implemented indirect jailbreak attacks under three different scenarios, achieving success rates of 88.56%, 79.04%, and 82.69% respectively.
摘要：随着自然语言处理（NLP）的发展，大型语言模型（LLM）越来越受到人们的青睐。LLM 越来越融入日常生活，其安全漏洞也越来越受到人们的关注。因此，大型语言模型的安全性变得至关重要。目前，针对 LLM 的攻击和防御技术不断发展，其中一种重要的攻击方法是越狱攻击，旨在规避模型安全机制并诱导生成不当内容。现有的越狱攻击主要依靠制作诱导提示直接越狱，这对于具有强大过滤能力和高理解能力的大型模型来说效果不佳。随着大型语言模型对实时性的需求越来越大，实时更新和迭代新知识变得至关重要。检索增强生成（RAG）作为一种弥补模型缺乏新知识的先进技术，正逐渐成为主流。由于 RAG 使模型能够利用外部知识库，因此它为越狱攻击提供了新的途径。本文首次提出了间接越狱的概念，并通过 LangChain 实现了检索增强生成。在此基础上，我们进一步设计了一种新的间接越狱攻击方法，称为 Poisoned-LangChain (PLC)，该方法利用中毒的外部知识库与大型语言模型进行交互，从而导致大型模型生成恶意的不符合要求的对话。我们在三类主要越狱问题上对六种不同的大型语言模型进行了此方法的测试。实验表明，PLC 在三种不同场景下成功实施了间接越狱攻击，成功率分别为 88.56%、79.04% 和 82.69%。

Title: ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models

Authors: Ahmed Heakl, Youssef Mohamed, Noran Mohamed, Ali Sharkaway, Ahmed Zaky
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2406.18125
Pdf URL: https://arxiv.org/pdf/2406.18125
Copy Paste: [[2406.18125]] ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models(https://arxiv.org/abs/2406.18125)
Keywords: language model, llm
Abstract: The increasing reliance on online recruitment platforms coupled with the adoption of AI technologies has highlighted the critical need for efficient resume classification methods. However, challenges such as small datasets, lack of standardized resume templates, and privacy concerns hinder the accuracy and effectiveness of existing classification models. In this work, we address these challenges by presenting a comprehensive approach to resume classification. We curated a large-scale dataset of 13,389 resumes from diverse sources and employed Large Language Models (LLMs) such as BERT and Gemma1.1 2B for classification. Our results demonstrate significant improvements over traditional machine learning approaches, with our best model achieving a top-1 accuracy of 92\% and a top-5 accuracy of 97.5\%. These findings underscore the importance of dataset quality and advanced model architectures in enhancing the accuracy and robustness of resume classification systems, thus advancing the field of online recruitment practices.
摘要：对在线招聘平台的日益依赖以及人工智能技术的采用凸显了对高效简历分类方法的迫切需求。然而，数据集太小、缺乏标准化简历模板以及隐私问题等挑战阻碍了现有分类模型的准确性和有效性。在本文中，我们通过提出一种全面的简历分类方法来应对这些挑战。我们从不同来源整理了一个包含 13,389 份简历的大规模数据集，并使用 BERT 和 Gemma1.1 2B 等大型语言模型 (LLM) 进行分类。我们的结果表明，与传统机器学习方法相比，我们的分类方法有显著改进，我们的最佳模型实现了 92% 的 top-1 准确率和 97.5% 的 top-5 准确率。这些发现强调了数据集质量和先进的模型架构对于提高简历分类系统的准确性和稳健性的重要性，从而推动了在线招聘实践领域的发展。

Title: ConvoCache: Smart Re-Use of Chatbot Responses

Authors: Conor Atkins, Ian Wood, Mohamed Ali Kaafar, Hassan Asghar, Nardine Basta, Michal Kepkowski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18133
Pdf URL: https://arxiv.org/pdf/2406.18133
Copy Paste: [[2406.18133]] ConvoCache: Smart Re-Use of Chatbot Responses(https://arxiv.org/abs/2406.18133)
Keywords: llm, prompt, chat
Abstract: We present ConvoCache, a conversational caching system that solves the problem of slow and expensive generative AI models in spoken chatbots. ConvoCache finds a semantically similar prompt in the past and reuses the response. In this paper we evaluate ConvoCache on the DailyDialog dataset. We find that ConvoCache can apply a UniEval coherence threshold of 90% and respond to 89% of prompts using the cache with an average latency of 214ms, replacing LLM and voice synthesis that can take over 1s. To further reduce latency we test prefetching and find limited usefulness. Prefetching with 80% of a request leads to a 63% hit rate, and a drop in overall coherence. ConvoCache can be used with any chatbot to reduce costs by reducing usage of generative AI by up to 89%.
摘要：我们推出了 ConvoCache，这是一种对话缓存系统，可解决语音聊天机器人中生成式 AI 模型速度慢且成本高的问题。ConvoCache 会在过去找到语义相似的提示并重用该响应。在本文中，我们在 DailyDialog 数据集上评估了 ConvoCache。我们发现 ConvoCache 可以应用 90% 的 UniEval 一致性阈值，并使用缓存以平均 214 毫秒的延迟响应 89% 的提示，取代可能需要超过 1 秒的 LLM 和语音合成。为了进一步减少延迟，我们测试了预取并发现其用处有限。对 80% 的请求进行预取会导致命中率为 63%，并且整体一致性会下降。ConvoCache 可与任何聊天机器人一起使用，通过将生成式 AI 的使用率降低高达 89% 来降低成本。

Title: Assessing "Implicit" Retrieval Robustness of Large Language Models

Authors: Xiaoyu Shen, Rexhina Blloshmi, Dawei Zhu, Jiahuan Pei, Wei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18134
Pdf URL: https://arxiv.org/pdf/2406.18134
Copy Paste: [[2406.18134]] Assessing "Implicit" Retrieval Robustness of Large Language Models(https://arxiv.org/abs/2406.18134)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-augmented generation has gained popularity as a framework to enhance large language models with external knowledge. However, its effectiveness hinges on the retrieval robustness of the model. If the model lacks retrieval robustness, its performance is constrained by the accuracy of the retriever, resulting in significant compromises when the retrieved context is irrelevant. In this paper, we evaluate the "implicit" retrieval robustness of various large language models, instructing them to directly output the final answer without explicitly judging the relevance of the retrieved context. Our findings reveal that fine-tuning on a mix of gold and distracting context significantly enhances the model's robustness to retrieval inaccuracies, while still maintaining its ability to extract correct answers when retrieval is accurate. This suggests that large language models can implicitly handle relevant or irrelevant retrieved context by learning solely from the supervision of the final answer in an end-to-end manner. Introducing an additional process for explicit relevance judgment can be unnecessary and disrupts the end-to-end approach.
摘要：检索增强生成作为一种利用外部知识增强大型语言模型的框架，已经越来越受欢迎。然而，它的有效性取决于模型的检索鲁棒性。如果模型缺乏检索鲁棒性，其性能就会受到检索器准确性的限制，当检索到的上下文不相关时，会导致重大妥协。在本文中，我们评估了各种大型语言模型的“隐式”检索鲁棒性，指示它们直接输出最终答案，而不明确判断检索到的上下文的相关性。我们的研究结果表明，在黄金和分散注意力的上下文混合上进行微调可以显著增强模型对检索不准确的鲁棒性，同时仍保持在检索准确时提取正确答案的能力。这表明，大型语言模型可以通过以端到端的方式仅从最终答案的监督中学习，隐式地处理相关或不相关的检索上下文。引入额外的过程来进行明确的相关性判断可能是不必要的，而且会破坏端到端的方法。

Title: Automatic Speech Recognition for Hindi

Authors: Anish Saha, A.G. Ramakrishnan
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.18135
Pdf URL: https://arxiv.org/pdf/2406.18135
Copy Paste: [[2406.18135]] Automatic Speech Recognition for Hindi(https://arxiv.org/abs/2406.18135)
Keywords: language model
Abstract: Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on developing technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real text, while ASR systems rely on language models trained on large text corpora. High-quality transcribed data is essential for training predictive models. The research involved two main components: developing a web application and designing a web interface for speech recognition. The web application, created with JavaScript and Node.js, manages large volumes of audio files and their transcriptions, facilitating collaborative human correction of ASR transcripts. It operates in real-time using a client-server architecture. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine. VAD detects human speech presence, aiding efficient speech processing and reducing unnecessary processing during non-speech intervals, thus saving computation and network bandwidth in VoIP applications. The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations.
摘要：自动语音识别 (ASR) 是计算语言学的一个关键领域，专注于开发使计算机能够将口语转换为文本的技术。该领域结合了语言学和机器学习。ASR 模型通过监督学习将语音音频映射到转录本，需要处理真实且不受限制的文本。文本转语音系统直接处理真实文本，而 ASR 系统则依赖于在大型文本语料库上训练的语言模型。高质量的转录数据对于训练预测模型至关重要。该研究涉及两个主要部分：开发 Web 应用程序和设计语音识别的 Web 界面。使用 JavaScript 和 Node.js 创建的 Web 应用程序管理大量音频文件及其转录本，促进 ASR 转录本的协作人工校正。它使用客户端-服务器架构实时运行。语音识别的 Web 界面从运行 Web 应用程序的任何设备录制 16 kHz 单声道音频，执行语音活动检测 (VAD)，并将音频发送到识别引擎。 VAD 可检测人类语音的存在，帮助高效处理语音，减少非语音间隔期间不必要的处理，从而节省 VoIP 应用中的计算和网络带宽。研究的最后阶段测试了一个神经网络，以将语音信号准确地与隐马尔可夫模型 (HMM) 状态对齐。这包括实施一种利用节点共激活的先验统计数据的新型反向传播方法。

Title: LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

Authors: Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2406.18139
Pdf URL: https://arxiv.org/pdf/2406.18139
Copy Paste: [[2406.18139]] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference(https://arxiv.org/abs/2406.18139)
Keywords: language model, llm, long context, prompt
Abstract: Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs' KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge. In this work, we introduce LOOK-M, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. LOOK-M demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by 80% in some cases, it not only achieves up to 1.5x faster decoding but also maintains or even enhances performance across a variety of long context multimodal tasks.
摘要：长上下文多模态大型语言模型 (MLLM) 需要大量计算资源进行推理，因为随着输入长度的增加，其多模态键值 (KV) 缓存的增长对内存和时间效率提出了挑战。与仅管理文本上下文的单模态 LLM 不同，长上下文 MLLM 的 KV 缓存包括来自具有时间和空间关系以及相关文本上下文的多个图像的表示。图像标记的主导地位意味着 LLM 的 KV 缓存的传统优化不适合多模态长上下文设置，并且之前没有研究解决这一挑战。在这项工作中，我们引入了 LOOK-M，这是一种开创性的、无需微调的方法，可有效减少多模态 KV 缓存大小，同时保持与完整缓存相当的性能。我们观察到，在提示预填充期间，该模型优先考虑文本注意力而不是图像特征，并且基于多模态交互观察，探索了一种新的文本优先方法来压缩 KV 缓存。此外，为了减轻图像上下文信息的退化，我们提出了几种使用 KV 对合并的补偿策略。LOOK-M 表明，通过显着减少 KV Cache 内存使用量（例如在某些情况下减少 80%），它不仅可以实现高达 1.5 倍的解码速度，而且还可以在各种长上下文多模态任务中保持甚至提高性能。

Title: NeBuLa: A discourse aware Minecraft Builder

Authors: Akshay Chaturvedi, Kate Thompson, Nicholas Asher
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.18164
Pdf URL: https://arxiv.org/pdf/2406.18164
Copy Paste: [[2406.18164]] NeBuLa: A discourse aware Minecraft Builder(https://arxiv.org/abs/2406.18164)
Keywords: llm
Abstract: When engaging in collaborative tasks, humans efficiently exploit the semantic structure of a conversation to optimize verbal and nonverbal interactions. But in recent "language to code" or "language to action" models, this information is lacking. We show how incorporating the prior discourse and nonlinguistic context of a conversation situated in a nonlinguistic environment can improve the "language to action" component of such interactions. We fine tune an LLM to predict actions based on prior context; our model, NeBuLa, doubles the net-action F1 score over the baseline on this task of Jayannavar et al.(2020). We also investigate our model's ability to construct shapes and understand location descriptions using a synthetic dataset.
摘要：在进行协作任务时，人类可以有效地利用对话的语义结构来优化口头和非口头交互。但在最近的“语言到代码”或“语言到行动”模型中，缺乏这些信息。我们展示了如何结合非语言环境中对话的先前话语和非语言背景来改善此类交互的“语言到行动”部分。我们对 LLM 进行了微调，以根据先前背景预测动作；我们的模型 NeBuLa 在 Jayannavar 等人（2020 年）的这项任务上将净动作 F1 得分提高了一倍。我们还研究了我们的模型使用合成数据集构建形状和理解位置描述的能力。

Title: UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs

Authors: Wenhao Li, Mingbao Lin, Yunshan Zhong, Shuicheng Yan, Rongrong Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18173
Pdf URL: https://arxiv.org/pdf/2406.18173
Copy Paste: [[2406.18173]] UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs(https://arxiv.org/abs/2406.18173)
Keywords: language model, llm, long context, chat
Abstract: Managing long texts is challenging for large language models (LLMs) due to limited context window sizes. This study introduces UIO-LLMs, an unbiased incremental optimization approach for memory-enhanced transformers under long-context settings. We initially conceptualize the process as a streamlined encoder-decoder framework where the weights-shared encoder and decoder respectively encapsulate a context segment into memories and leverage these memories to predict outputs of the subsequent segment. Subsequently, by treating our memory-enhanced transformers as fully-connected recurrent neural networks (RNNs), we refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm, which incorporates innovative incremental optimization techniques. These techniques not only diminish time complexity but also address the bias in gradient computation through an unbiased optimization process. UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters, while keeping the inference cost nearly linear as context length increases.
摘要：由于上下文窗口大小有限，管理长文本对于大型语言模型 (LLM) 来说具有挑战性。本研究介绍了 UIO-LLM，这是一种在长上下文设置下用于记忆增强型 Transformer 的无偏增量优化方法。我们最初将该过程概念化为一个精简的编码器-解码器框架，其中权重共享的编码器和解码器分别将上下文段封装到内存中，并利用这些内存来预测后续段的输出。随后，通过将我们的记忆增强型 Transformer 视为全连接的循环神经网络 (RNN)，我们使用截断时间反向传播 (TBPTT) 算法改进训练过程，该算法结合了创新的增量优化技术。这些技术不仅降低了时间复杂度，而且还通过无偏优化过程解决了梯度计算中的偏差。 UIO-LLM 成功处理了长上下文，例如用最少 2% 的附加参数将 Llama2-7b-chat 的上下文窗口从 4K 扩展到 100K 个标记，同时随着上下文长度的增加保持推理成本几乎是线性的。

Title: Selective Prompting Tuning for Personalized Conversations with LLMs

Authors: Qiushi Huang, Xubo Liu, Tom Ko, Bo Wu, Wenwu Wang, Yu Zhang, Lilian Tang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.18187
Pdf URL: https://arxiv.org/pdf/2406.18187
Copy Paste: [[2406.18187]] Selective Prompting Tuning for Personalized Conversations with LLMs(https://arxiv.org/abs/2406.18187)
Keywords: language model, llm, prompt
Abstract: In conversational AI, personalizing dialogues with persona profiles and contextual understanding is essential. Despite large language models' (LLMs) improved response coherence, effective persona integration remains a challenge. In this work, we first study two common approaches for personalizing LLMs: textual prompting and direct fine-tuning. We observed that textual prompting often struggles to yield responses that are similar to the ground truths in datasets, while direct fine-tuning tends to produce repetitive or overly generic replies. To alleviate those issues, we propose \textbf{S}elective \textbf{P}rompt \textbf{T}uning (SPT), which softly prompts LLMs for personalized conversations in a selective way. Concretely, SPT initializes a set of soft prompts and uses a trainable dense retriever to adaptively select suitable soft prompts for LLMs according to different input contexts, where the prompt retriever is dynamically updated through feedback from the LLMs. Additionally, we propose context-prompt contrastive learning and prompt fusion learning to encourage the SPT to enhance the diversity of personalized conversations. Experiments on the CONVAI2 dataset demonstrate that SPT significantly enhances response diversity by up to 90\%, along with improvements in other critical performance indicators. Those results highlight the efficacy of SPT in fostering engaging and personalized dialogue generation. The SPT model code (this https URL) is publicly available for further exploration.
摘要：在对话式 AI 中，通过人物角色档案和情境理解来个性化对话至关重要。尽管大型语言模型 (LLM) 提高了响应连贯性，但有效的人物角色整合仍然是一个挑战。在这项工作中，我们首先研究了两种常见的个性化 LLM 方法：文本提示和直接微调。我们观察到，文本提示通常难以产生与数据集中的基本事实相似的响应，而直接微调往往会产生重复或过于通用的答复。为了缓解这些问题，我们提出了 \textbf{S}elective \textbf{P}rompt \textbf{T}uning (SPT)，它以选择性的方式软提示 LLM 进行个性化对话。具体来说，SPT 初始化一组软提示，并使用可训练的密集检索器根据不同的输入上下文自适应地为 LLM 选择合适的软提示，其中提示检索器通过来自 LLM 的反馈动态更新。此外，我们提出了上下文提示对比学习和提示融合学习，以鼓励 SPT 增强个性化对话的多样性。在 CONVAI2 数据集上的实验表明，SPT 显著提高了响应多样性，最高可达 90%，同时还提高了其他关键性能指标。这些结果凸显了 SPT 在促进引人入胜和个性化对话生成方面的功效。SPT 模型代码（此 https URL）已公开，可供进一步探索。

Title: Methodology of Adapting Large English Language Models for Specific Cultural Contexts

Authors: Wenjing Zhang, Siqi Xiao, Xuejiao Lei, Ning Wang, Huazheng Zhang, Meijuan An, Bikun Yang, Zhaoxiang Liu, Kai Wang, Shiguo Lian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18192
Pdf URL: https://arxiv.org/pdf/2406.18192
Copy Paste: [[2406.18192]] Methodology of Adapting Large English Language Models for Specific Cultural Contexts(https://arxiv.org/abs/2406.18192)
Keywords: language model, llm
Abstract: The rapid growth of large language models(LLMs) has emerged as a prominent trend in the field of artificial intelligence. However, current state-of-the-art LLMs are predominantly based on English. They encounter limitations when directly applied to tasks in specific cultural domains, due to deficiencies in domain-specific knowledge and misunderstandings caused by differences in cultural values. To address this challenge, our paper proposes a rapid adaptation method for large models in specific cultural contexts, which leverages instruction-tuning based on specific cultural knowledge and safety values data. Taking Chinese as the specific cultural context and utilizing the LLaMA3-8B as the experimental English LLM, the evaluation results demonstrate that the adapted LLM significantly enhances its capabilities in domain-specific knowledge and adaptability to safety values, while maintaining its original expertise advantages.
摘要：大型语言模型（LLM）的快速发展已成为人工智能领域的一大趋势。然而，目前最先进的LLM主要基于英语。由于缺乏领域特定知识以及文化价值观差异导致的误解，它们在直接应用于特定文化领域的任务时会受到限制。针对这一挑战，本文提出了一种特定文化语境中大型模型的快速适配方法，该方法利用基于特定文化知识和安全价值观数据的指令调优。以中文为特定文化语境，以LLaMA3-8B为实验性英文LLM，评估结果表明，适配后的LLM在保持原有专业优势的同时，显著提升了其在领域特定知识和对安全价值观的适应能力。

Title: SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding

Authors: Zhenglin Wang, Jialong Wu, Yilong Lai, Congzhi Zhang, Deyu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18200
Pdf URL: https://arxiv.org/pdf/2406.18200
Copy Paste: [[2406.18200]] SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding(https://arxiv.org/abs/2406.18200)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) demonstrate remarkable emergent abilities across various tasks, yet fall short of complex reasoning and planning tasks. The tree-search-based reasoning methods address this by surpassing the capabilities of chain-of-thought prompting, encouraging exploration of intermediate steps. However, such methods introduce significant inference latency due to the systematic exploration and evaluation of multiple thought paths. This paper introduces SeeD, a novel and efficient inference framework to optimize runtime speed and GPU memory management concurrently. By employing a scheduled speculative execution, SeeD efficiently handles multiple iterations for the thought generation and the state evaluation, leveraging a rounds-scheduled strategy to manage draft model dispatching. Extensive experimental evaluations on three reasoning datasets demonstrate superior speedup performance of SeeD, providing a viable path for batched inference in training-free speculative decoding.
摘要：大型语言模型 (LLM) 在各种任务中表现出非凡的突发能力，但在复杂的推理和规划任务方面却有所欠缺。基于树搜索的推理方法通过超越思路链提示的能力、鼓励探索中间步骤来解决此问题。然而，由于系统地探索和评估多种思维路径，此类方法引入了显著的推理延迟。本文介绍了 SeeD，这是一种新颖而高效的推理框架，可同时优化运行时速度和 GPU 内存管理。通过采用预定的推测执行，SeeD 可有效处理思维生成和状态评估的多次迭代，利用轮次调度策略来管理草稿模型调度。对三个推理数据集的大量实验评估证明了 SeeD 卓越的加速性能，为无训练推测解码中的批量推理提供了可行的途径。

Title: A Closer Look into Mixture-of-Experts in Large Language Models

Authors: Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, Jie Fu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.18219
Pdf URL: https://arxiv.org/pdf/2406.18219
Copy Paste: [[2406.18219]] A Closer Look into Mixture-of-Experts in Large Language Models(https://arxiv.org/abs/2406.18219)
Keywords: language model
Abstract: Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at this https URL.
摘要：混合专家 (MoE) 因其独特的属性和卓越的性能而受到越来越多的关注，尤其是在语言任务方面。通过稀疏地激活每个 token 的参数子集，MoE 架构可以在不牺牲计算效率的情况下增加模型大小，从而实现性能和训练成本之间的更好权衡。然而，MoE 的底层机制仍然缺乏进一步的探索，其模块化程度仍然存在疑问。在本文中，我们初步尝试了解基于 MoE 的大型语言模型的内部工作原理。具体来说，我们全面研究了三种最近基于 MoE 的模型的参数和行为特征，并揭示了一些有趣的观察结果，包括 (1) 神经元表现得像细粒度专家。 (2) MoE 的路由器通常会选择具有更大输出范数的专家。 (3) 专家多样性随着层的增加而增加，而最后一层是异常值。基于这些观察，我们还为广泛的 MoE 从业者提供了建议，例如路由器设计和专家分配。我们希望这项工作能够为 MoE 框架和其他模块化架构的未来研究提供启示。代码可在此 https URL 上获取。

Title: Enhancing Data Privacy in Large Language Models through Private Association Editing

Authors: Davide Venditti, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Cristina Giannone, Andrea Favalli, Raniero Romagnoli, Fabio Massimo Zanzotto
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18221
Pdf URL: https://arxiv.org/pdf/2406.18221
Copy Paste: [[2406.18221]] Enhancing Data Privacy in Large Language Models through Private Association Editing(https://arxiv.org/abs/2406.18221)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are powerful tools with extensive applications, but their tendency to memorize private information raises significant concerns as private data leakage can easily happen. In this paper, we introduce Private Association Editing (PAE), a novel defense approach for private data leakage. PAE is designed to effectively remove Personally Identifiable Information (PII) without retraining the model. Our approach consists of a four-step procedure: detecting memorized PII, applying PAE cards to mitigate memorization of private data, verifying resilience to targeted data extraction (TDE) attacks, and ensuring consistency in the post-edit LLMs. The versatility and efficiency of PAE, which allows for batch modifications, significantly enhance data privacy in LLMs. Experimental results demonstrate the effectiveness of PAE in mitigating private data leakage. We believe PAE will serve as a critical tool in the ongoing effort to protect data privacy in LLMs, encouraging the development of safer models for real-world applications.
摘要：大型语言模型 (LLM) 是一种功能强大的工具，应用范围广泛，但它们容易记住私人信息，这引起了人们的极大担忧，因为私人数据很容易泄露。在本文中，我们介绍了一种新的私人数据泄露防御方法——私人关联编辑 (PAE)。PAE 旨在有效地删除个人身份信息 (PII)，而无需重新训练模型。我们的方法包括四个步骤：检测记忆的 PII、应用 PAE 卡来减轻对私人数据的记忆、验证对有针对性的数据提取 (TDE) 攻击的抵御能力以及确保编辑后 LLM 的一致性。PAE 的多功能性和效率允许批量修改，从而显著增强了 LLM 中的数据隐私。实验结果证明了 PAE 在减轻私人数据泄露方面的有效性。我们相信 PAE 将成为保护 LLM 中数据隐私的持续努力中的重要工具，鼓励为实际应用开发更安全的模型。

Title: Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets

Authors: Simon Münker, Kai Kugler, Achim Rettinger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18239
Pdf URL: https://arxiv.org/pdf/2406.18239
Copy Paste: [[2406.18239]] Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets(https://arxiv.org/abs/2406.18239)
Keywords: prompt
Abstract: Filtering and annotating textual data are routine tasks in many areas, like social media or news analytics. Automating these tasks allows to scale the analyses wrt. speed and breadth of content covered and decreases the manual effort required. Due to technical advancements in Natural Language Processing, specifically the success of large foundation models, a new tool for automating such annotation processes by using a text-to-text interface given written guidelines without providing training samples has become available. In this work, we assess these advancements in-the-wild by empirically testing them in an annotation task on German Twitter data about social and political European crises. We compare the prompt-based results with our human annotation and preceding classification approaches, including Naive Bayes and a BERT-based fine-tuning/domain adaptation pipeline. Our results show that the prompt-based approach - despite being limited by local computation resources during the model selection - is comparable with the fine-tuned BERT but without any annotated training data. Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data.
摘要：过滤和注释文本数据是社交媒体或新闻分析等许多领域的常规任务。自动执行这些任务可以扩展分析速度和涵盖内容的广度，并减少所需的手动工作量。由于自然语言处理技术的进步，特别是大型基础模型的成功，一种新的工具已经问世，它可以使用给定书面指南的文本到文本界面来自动化此类注释过程，而无需提供训练样本。在这项工作中，我们通过在有关欧洲社会和政治危机的德国 Twitter 数据的注释任务中对这些进步进行实证测试来评估这些进步。我们将基于提示的结果与我们的人工注释和先前的分类方法（包括朴素贝叶斯和基于 BERT 的微调/领域自适应管道）进行了比较。我们的结果表明，基于提示的方法 - 尽管在模型选择期间受到本地计算资源的限制 - 但仍可与微调的 BERT 相媲美，但没有任何带注释的训练数据。我们的研究结果强调了 NLP 领域正在发生的范式转变，即下游任务的统一和对预先标记的训练数据的需求的消除。

Title: LLaMIPa: An Incremental Discourse Parser

Authors: Kate Thompson, Akshay Chaturvedi, Julie Hunter, Nicholas Asher
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18256
Pdf URL: https://arxiv.org/pdf/2406.18256
Copy Paste: [[2406.18256]] LLaMIPa: An Incremental Discourse Parser(https://arxiv.org/abs/2406.18256)
Keywords: language model, llm
Abstract: This paper provides the first discourse parsing experiments with a large language model (LLM) finetuned on corpora annotated in the style of SDRT (Asher, 1993; Asher and Lascarides, 2003). The result is a discourse parser, LLaMIPa (LLaMA Incremental Parser), which is able to more fully exploit discourse context, leading to substantial performance gains over approaches that use encoder-only models to provide local, context-sensitive representations of discourse units. Furthermore, it is able to process discourse data incrementally, which is essential for the eventual use of discourse information in downstream tasks.
摘要：本文首次使用大型语言模型 (LLM) 进行了篇章解析实验，该模型在以 SDRT 风格注释的语料库上进行了微调（Asher，1993；Asher 和 Lascarides，2003）。结果是一个篇章解析器 LLaMIPa（LLaMA 增量解析器），它能够更充分地利用篇章上下文，与使用仅编码器模型提供篇章单元的本地、上下文敏感表示的方法相比，性能有了显著提升。此外，它能够增量地处理篇章数据，这对于最终在下游任务中使用篇章信息至关重要。

Title: Detecting Machine-Generated Texts: Not Just "AI vs Humans" and Explainability is Complicated

Authors: Jiazhou Ji, Ruizhe Li, Shujun Li, Jie Guo, Weidong Qiu, Zheng Huang, Chiyu Chen, Xiaoyu Jiang, Xinru Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18259
Pdf URL: https://arxiv.org/pdf/2406.18259
Copy Paste: [[2406.18259]] Detecting Machine-Generated Texts: Not Just "AI vs Humans" and Explainability is Complicated(https://arxiv.org/abs/2406.18259)
Keywords: llm
Abstract: As LLMs rapidly advance, increasing concerns arise regarding risks about actual authorship of texts we see online and in real world. The task of distinguishing LLM-authored texts is complicated by the nuanced and overlapping behaviors of both machines and humans. In this paper, we challenge the current practice of considering LLM-generated text detection a binary classification task of differentiating human from AI. Instead, we introduce a novel ternary text classification scheme, adding an "undecided" category for texts that could be attributed to either source, and we show that this new category is crucial to understand how to make the detection result more explainable to lay users. This research shifts the paradigm from merely classifying to explaining machine-generated texts, emphasizing need for detectors to provide clear and understandable explanations to users. Our study involves creating four new datasets comprised of texts from various LLMs and human authors. Based on new datasets, we performed binary classification tests to ascertain the most effective SOTA detection methods and identified SOTA LLMs capable of producing harder-to-detect texts. We constructed a new dataset of texts generated by two top-performing LLMs and human authors, and asked three human annotators to produce ternary labels with explanation notes. This dataset was used to investigate how three top-performing SOTA detectors behave in new ternary classification context. Our results highlight why "undecided" category is much needed from the viewpoint of explainability. Additionally, we conducted an analysis of explainability of the three best-performing detectors and the explanation notes of the human annotators, revealing insights about the complexity of explainable detection of machine-generated texts. Finally, we propose guidelines for developing future detection systems with improved explanatory power.
摘要：随着法学硕士 (LLM) 的快速发展，人们越来越担心我们在网上和现实世界中看到的文本的实际作者身份的风险。区分法学硕士 (LLM) 撰写的文本的任务因机器和人类的细微和重叠行为而变得复杂。在本文中，我们对当前将法学硕士生成的文本检测视为区分人类和人工智能的二元分类任务的做法提出了挑战。相反，我们引入了一种新颖的三元文本分类方案，为可以归因于任一来源的文本添加了一个“未定”类别，并且我们表明这个新类别对于理解如何使检测结果更容易向普通用户解释至关重要。这项研究将范式从仅仅分类转变为解释机器生成的文本，强调检测器需要向用户提供清晰易懂的解释。我们的研究涉及创建四个新的数据集，这些数据集由来自不同法学硕士和人类作者的文本组成。基于新数据集，我们进行了二元分类测试，以确定最有效的 SOTA 检测方法，并确定了能够生成更难检测的文本的 SOTA LLM。我们构建了一个由两个表现最好的 LLM 和人类作者生成的文本的新数据集，并要求三个人类注释者生成带有解释说明的三元标签。该数据集用于调查三个表现最好的 SOTA 检测器在新的三元分类环境中的表现。我们的结果强调了为什么从可解释性的角度看“未定”类别是十分必要的。此外，我们对三个表现最好的检测器的可解释性和人类注释者的解释说明进行了分析，揭示了对机器生成文本的可解释检测的复杂性的见解。最后，我们提出了开发具有更高解释能力的未来检测系统的指导方针。

Title: "Vorbe\c{s}ti Rom\^ane\c{s}te?" A Recipe to Train Powerful Romanian LLMs with English Instructions

Authors: Mihai Masala, Denis C. Ilie-Ablachim, Alexandru Dima, Dragos Corlatescu, Miruna Zavelca, Ovio Olaru, Simina Terian-Dan, Andrei Terian-Dan, Marius Leordeanu, Horia Velicu, Marius Popescu, Mihai Dascalu, Traian Rebedea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] "Vorbe\c{s}ti Rom\^ane\c{s}te?" A Recipe to Train Powerful Romanian LLMs with English Instructions(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) to support and encourage research on Romanian LLMs while concurrently creating a generalizable recipe, adequate for other low or less-resourced languages.
摘要：近年来，大型语言模型 (LLM) 在各种任务上的表现几乎与人类一样。虽然一些 LLM 是在多语言数据上训练的，但大多数训练数据都是英语；因此，它们在英语方面的表现远远超过其他语言。据我们所知，我们是第一个收集和翻译大量文本、说明和基准，并训练、评估和发布针对罗马尼亚语的开源 LLM 的人。我们在四个不同的类别上评估我们的方法，包括学术基准、MT-Bench（手动翻译）和专业构建的适合罗马尼亚语的历史、文化和社会基准。我们通过获得全面的最先进结果来论证 RoLLM 的实用性和高性能。我们公开发布所有资源（即数据、训练和评估代码、模型），以支持和鼓励对罗马尼亚语 LLM 的研究，同时创建一种可推广的配方，适用于其他资源较少或资源较少的语言。

Title: Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs

Authors: Lei Zhang, Yunshui Li, Jiaming Li, Xiaobo Xia, Jiaxi Yang, Run Luo, Minzheng Wang, Longze Chen, Junhao Liu, Min Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18294
Pdf URL: https://arxiv.org/pdf/2406.18294
Copy Paste: [[2406.18294]] Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs(https://arxiv.org/abs/2406.18294)
Keywords: language model, llm, prompt
Abstract: Some recently developed code large language models (Code LLMs) have been pre-trained on repository-level code data (Repo-Code LLMs), enabling these models to recognize repository structures and utilize cross-file information for code completion. However, in real-world development scenarios, simply concatenating the entire code repository often exceeds the context window limits of these Repo-Code LLMs, leading to significant performance degradation. In this study, we conducted extensive preliminary experiments and analyses on six Repo-Code LLMs. The results indicate that maintaining the topological dependencies of files and increasing the code file content in the completion prompts can improve completion accuracy; pruning the specific implementations of functions in all dependent files does not significantly reduce the accuracy of completions. Based on these findings, we proposed a strategy named Hierarchical Context Pruning (HCP) to construct completion prompts with high informational code content. The HCP models the code repository at the function level, maintaining the topological dependencies between code files while removing a large amount of irrelevant code content, significantly reduces the input length for repository-level code completion. We applied the HCP strategy in experiments with six Repo-Code LLMs, and the results demonstrate that our proposed method can significantly enhance completion accuracy while substantially reducing the length of input. Our code and data are available at this https URL.
摘要：一些最近开发的代码大型语言模型（Code LLM）已经在存储库级代码数据（Repo-Code LLM）上进行了预训练，使这些模型能够识别存储库结构并利用跨文件信息进行代码补全。然而，在实际开发场景中，简单地连接整个代码存储库通常会超出这些Repo-Code LLM的上下文窗口限制，导致性能显著下降。在本研究中，我们对六个Repo-Code LLM进行了大量的初步实验和分析。结果表明，保持文件的拓扑依赖关系并增加补全提示中的代码文件内容可以提高补全准确率；修剪所有依赖文件中函数的具体实现不会显著降低补全准确率。基于这些发现，我们提出了一种名为分层上下文修剪（HCP）的策略来构建具有高信息量代码内容的补全提示。HCP在函数级别对代码存储库进行建模，在保持代码文件之间的拓扑依赖关系的同时删除大量不相关的代码内容，显著减少了存储库级代码补全的输入长度。我们在六个 Repo-Code LLM 的实验中应用了 HCP 策略，结果表明，我们提出的方法可以显著提高完成准确率，同时大幅缩短输入长度。我们的代码和数据可在此 https URL 上找到。

Title: FactFinders at CheckThat! 2024: Refining Check-worthy Statement Detection with LLMs through Data Pruning

Authors: Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18297
Pdf URL: https://arxiv.org/pdf/2406.18297
Copy Paste: [[2406.18297]] FactFinders at CheckThat! 2024: Refining Check-worthy Statement Detection with LLMs through Data Pruning(https://arxiv.org/abs/2406.18297)
Keywords: language model, gpt, llm, prompt
Abstract: The rapid dissemination of information through social media and the Internet has posed a significant challenge for fact-checking, among others in identifying check-worthy claims that fact-checkers should pay attention to, i.e. filtering claims needing fact-checking from a large pool of sentences. This challenge has stressed the need to focus on determining the priority of claims, specifically which claims are worth to be fact-checked. Despite advancements in this area in recent years, the application of large language models (LLMs), such as GPT, has only recently drawn attention in studies. However, many open-source LLMs remain underexplored. Therefore, this study investigates the application of eight prominent open-source LLMs with fine-tuning and prompt engineering to identify check-worthy statements from political transcriptions. Further, we propose a two-step data pruning approach to automatically identify high-quality training data instances for effective learning. The efficiency of our approach is demonstrated through evaluations on the English language dataset as part of the check-worthiness estimation task of CheckThat! 2024. Further, the experiments conducted with data pruning demonstrate that competitive performance can be achieved with only about 44\% of the training data. Our team ranked first in the check-worthiness estimation task in the English language.
摘要：通过社交媒体和互联网快速传播的信息给事实核查带来了重大挑战，其中包括识别事实核查人员应注意的值得核查的声明，即从大量句子中筛选出需要事实核查的声明。这一挑战强调需要专注于确定声明的优先级，特别是哪些声明值得进行事实核查。尽管近年来该领域取得了进展，但大型语言模型 (LLM)（如 GPT）的应用直到最近才引起研究的关注。然而，许多开源 LLM 仍未得到充分探索。因此，本研究调查了八个著名的开源 LLM 的应用，这些 LLM 经过微调和快速工程，以从政治转录中识别值得检查的陈述。此外，我们提出了一种两步数据修剪方法来自动识别高质量的训练数据实例，以实现有效的学习。我们的方法的效率通过对英语语言数据集的评估得到了证明，这是 CheckThat 的检查价值评估任务的一部分！ 2024. 此外，通过数据修剪进行的实验表明，仅使用约 44% 的训练数据即可实现具有竞争力的性能。我们的团队在英语检查价值评估任务中排名第一。

Title: S3: A Simple Strong Sample-effective Multimodal Dialog System

Authors: Elisei Rykov, Egor Malkershin, Alexander Panchenko
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18305
Pdf URL: https://arxiv.org/pdf/2406.18305
Copy Paste: [[2406.18305]] S3: A Simple Strong Sample-effective Multimodal Dialog System(https://arxiv.org/abs/2406.18305)
Keywords: language model
Abstract: In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.
摘要：在这项工作中，我们为多模态对话任务提出了一个概念简单但功能强大的基线，即 S3 模型，该模型在两个引人注目的排行榜上取得了接近最先进的结果：MMMU 和 AI Journey Contest 2023。该系统基于预先训练的大型语言模型、预先训练的图像和音频模态编码器以及可训练的模态投影仪。为训练这种架构而提出的有效数据混合表明，基于强语言模型并在少量多模态数据上训练的多模态模型可以在多模态对话任务中高效执行。

Title: AI-native Memory: A Pathway from LLMs Towards AGI

Authors: Jingbo Shang, Zai Zheng, Xiang Ying, Felix Tao, Mindverse Team
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18312
Pdf URL: https://arxiv.org/pdf/2406.18312
Copy Paste: [[2406.18312]] AI-native Memory: A Pathway from LLMs Towards AGI(https://arxiv.org/abs/2406.18312)
Keywords: language model, llm, long context, retrieval-augmented generation, agent
Abstract: Large language models (LLMs) have demonstrated the world with the sparks of artificial general intelligence (AGI). One opinion, especially from some startups working on LLMs, argues that an LLM with nearly unlimited context length can realize AGI. However, they might be too optimistic about the long-context capability of (existing) LLMs -- (1) Recent literature has shown that their effective context length is significantly smaller than their claimed context length; and (2) Our reasoning-in-a-haystack experiments further demonstrate that simultaneously finding the relevant information from a long context and conducting (simple) reasoning is nearly impossible. In this paper, we envision a pathway from LLMs to AGI through the integration of \emph{memory}. We believe that AGI should be a system where LLMs serve as core processors. In addition to raw data, the memory in this system would store a large number of important conclusions derived from reasoning processes. Compared with retrieval-augmented generation (RAG) that merely processing raw data, this approach not only connects semantically related information closer, but also simplifies complex inferences at the time of querying. As an intermediate stage, the memory will likely be in the form of natural language descriptions, which can be directly consumed by users too. Ultimately, every agent/person should have its own large personal model, a deep neural network model (thus \emph{AI-native}) that parameterizes and compresses all types of memory, even the ones cannot be described by natural languages. Finally, we discuss the significant potential of AI-native memory as the transformative infrastructure for (proactive) engagement, personalization, distribution, and social in the AGI era, as well as the incurred privacy and security challenges with preliminary solutions.
摘要：大型语言模型（LLM）已经向世界展示了通用人工智能（AGI）的光芒。一种观点，尤其是来自一些从事 LLM 的初创公司的观点，认为具有几乎无限上下文长度的 LLM 可以实现 AGI。然而，他们可能对（现有）LLM 的长上下文能力过于乐观——（1）最近的文献表明，它们的有效上下文长度明显小于它们声称的上下文长度；（2）我们的大海捞针实验进一步表明，同时从长上下文中找到相关信息并进行（简单）推理几乎是不可能的。在本文中，我们设想了一条通过整合 \emph{内存} 从 LLM 到 AGI 的途径。我们认为 AGI 应该是一个以 LLM 为核心处理器的系统。除了原始数据外，该系统中的内存还将存储从推理过程得出的大量重要结论。与仅仅处理原始数据的检索增强生成 (RAG) 相比，这种方法不仅将语义相关的信息联系得更紧密，而且简化了查询时的复杂推理。作为中间阶段，记忆将很可能以自然语言描述的形式存在，用户也可以直接使用。最终，每个代理/人都应该有自己的大型个人模型，一个深度神经网络模型（因此是 \emph{AI-native}），它可以参数化和压缩所有类型的记忆，甚至那些无法用自然语言描述的记忆。最后，我们讨论了 AI 原生记忆作为 AGI 时代（主动）参与、个性化、分发和社交的变革性基础设施的巨大潜力，以及由此带来的隐私和安全挑战以及初步解决方案。

Title: MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Authors: Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, Kai Zou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18321
Pdf URL: https://arxiv.org/pdf/2406.18321
Copy Paste: [[2406.18321]] MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data(https://arxiv.org/abs/2406.18321)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and DBRX-Instruct, and closed-source models from the GPT series and Gemini models. Our results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. Our analysis shows a narrowing performance gap between open-source and closed-source models, yet substantial challenges remain, particularly with the most demanding problems. This study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs. The dataset, results, and code are publicly available.
摘要：大型语言模型 (LLM) 显著提高了自然语言理解能力，并表现出强大的问题解决能力。尽管取得了这些成功，但由于需要复杂的推理，大多数 LLM 仍然难以解决数学问题。本文使用新开发的“MathOdyssey”数据集研究了 LLM 的数学问题解决能力。该数据集包括高中和大学水平的各种数学问题，由知名机构的专家创建，以在高级问题解决场景中严格测试 LLM，并涵盖更广泛的学科领域。通过将 MathOdyssey 数据集作为资源提供给 AI 社区，我们旨在为理解和提高 AI 在复杂数学问题解决方面的能力做出贡献。我们对开源模型（例如 Llama-3 和 DBRX-Instruct）以及 GPT 系列和 Gemini 模型的闭源模型进行了基准测试。我们的结果表明，虽然 LLM 在常规和中等难度的任务上表现良好，但它们在奥林匹克级问题和复杂的大学级问题方面面临重大挑战。我们的分析显示，开源模型和闭源模型之间的性能差距正在缩小，但仍然存在重大挑战，特别是在最棘手的问题上。这项研究强调了持续进行研究以增强 LLM 的数学推理能力的必要性。数据集、结果和代码均已公开。

Title: PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models

Authors: Huixuan Zhang, Yun Lin, Xiaojun Wan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18326
Pdf URL: https://arxiv.org/pdf/2406.18326
Copy Paste: [[2406.18326]] PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models(https://arxiv.org/abs/2406.18326)
Keywords: language model, llm
Abstract: Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks. This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications. To address this benchmark contamination problem, we first propose a set of requirements that practical contamination detection methods should follow. Following these proposed requirements, we introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs. Our method constructs a counterpart for each piece of data with the same distribution, and performs statistical analysis of the corresponding confidence to test whether the model is significantly more confident under the original benchmark. We validate the effectiveness of PaCoST and apply it on popular open-source models and benchmarks. We find that almost all models and benchmarks we tested are suspected contaminated more or less. We finally call for new LLM evaluation methods.
摘要：众所周知，大型语言模型 (LLM) 是在大量数据上进行训练的，这些数据可能会有意或无意地包含来自常用基准的数据。这种包含可能会导致模型在排行榜上获得高分，但在实际应用中却导致令人失望的表现。为了解决这个基准污染问题，我们首先提出了一组实际污染检测方法应遵循的要求。根据这些提出的要求，我们引入了 PaCoST，一种配对置信显著性测试，以有效检测 LLM 中的基准污染。我们的方法为具有相同分布的每段数据构建一个对应部分，并对相应的置信度进行统计分析，以测试模型在原始基准下是否明显更自信。我们验证了 PaCoST 的有效性，并将其应用于流行的开源模型和基准。我们发现，我们测试的几乎所有模型和基准都被怀疑或多或少受到了污染。我们最后呼吁新的 LLM 评估方法。

Title: Themis: Towards Flexible and Interpretable NLG Evaluation

Authors: Xinyu Hu, Li Lin, Mingqi Gao, Xunjian Yin, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18365
Pdf URL: https://arxiv.org/pdf/2406.18365
Copy Paste: [[2406.18365]] Themis: Towards Flexible and Interpretable NLG Evaluation(https://arxiv.org/abs/2406.18365)
Keywords: language model, gpt, llm
Abstract: The evaluation of natural language generation (NLG) tasks is a significant and longstanding research issue. With the recent emergence of powerful large language models (LLMs), some studies have turned to LLM-based automatic evaluation methods, which demonstrate great potential to become a new evaluation paradigm following traditional string-based and model-based metrics. However, despite the improved performance of existing methods, they still possess some deficiencies, such as dependency on references and limited evaluation flexibility. Therefore, in this paper, we meticulously construct a large-scale NLG evaluation corpus NLG-Eval with human and GPT-4 annotations to alleviate the lack of relevant data in this field. Furthermore, we propose Themis, an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency and rating-oriented preference alignment methods. Themis can conduct flexible and interpretable evaluations without references, and it exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
摘要：自然语言生成 (NLG) 任务的评估是一个重要且长期存在的研究问题。随着最近强大的大型语言模型 (LLM) 的出现，一些研究转向基于 LLM 的自动评估方法，这些方法显示出成为继传统的基于字符串和基于模型的指标之后的新评估范式的巨大潜力。然而，尽管现有方法的性能有所提高，但它们仍然存在一些不足之处，例如对参考的依赖和评估灵活性有限。因此，在本文中，我们精心构建了一个带有人工和 GPT-4 注释的大规模 NLG 评估语料库 NLG-Eval，以缓解该领域相关数据的缺乏。此外，我们提出了 Themis，一个专用于 NLG 评估的 LLM，它已经用我们设计的多视角一致性和面向评级的偏好对齐方法进行了训练。Themis 可以在没有参考的情况下进行灵活且可解释的评估，并且在各种 NLG 任务上表现出卓越的评估性能，同时很好地推广到未见过的任务并超越了包括 GPT-4 在内的其他评估模型。

Title: Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

Authors: Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, Bryon Aragam
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have the capacity to store and recall facts. Through experimentation with open-source models, we observe that this ability to retrieve facts can be easily manipulated by changing contexts, even without altering their factual meanings. These findings highlight that LLMs might behave like an associative memory model where certain tokens in the contexts serve as clues to retrieving facts. We mathematically explore this property by studying how transformers, the building blocks of LLMs, can complete such memory tasks. We study a simple latent concept association problem with a one-layer transformer and we show theoretically and empirically that the transformer gathers information using self-attention and uses the value matrix for associative memory.
摘要：大型语言模型 (LLM) 具有存储和回忆事实的能力。通过对开源模型进行实验，我们观察到，这种检索事实的能力可以通过改变上下文轻松操纵，甚至无需改变其事实含义。这些发现强调，LLM 可能表现得像一个联想记忆模型，其中上下文中的某些标记可作为检索事实的线索。我们通过研究 LLM 的构建块转换器如何完成此类记忆任务，从数学上探索了这一特性。我们研究了具有单层转换器的简单潜在概念关联问题，并从理论和经验上证明了转换器使用自注意力收集信息并使用值矩阵进行联想记忆。

Title: LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Authors: Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, André F. T. Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, Alberto Testoni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18403
Pdf URL: https://arxiv.org/pdf/2406.18403
Copy Paste: [[2406.18403]] LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks(https://arxiv.org/abs/2406.18403)
Keywords: llm
Abstract: There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.
摘要：越来越多的人开始使用 LLM 生成的判断来评估 NLP 模型，而不是使用人类的判断。在没有与人类数据进行比较的情况下，这引发了人们对这些评估的有效性的担忧；如果使用专有模型进行评估，这也会引发对可重复性的担忧。我们提供了 JUDGE-BENCH，这是一个包含 20 个带有人类注释的 NLP 数据集的集合，并全面评估了 11 个当前的 LLM，涵盖了开放权重和专有模型，以了解它们复制注释的能力。我们的评估表明，每个 LLM 在与人类判断的相关性方面在数据集之间都表现出很大的差异。我们得出结论，LLM 尚未准备好系统地取代 NLP 中的人类判断。

Title: IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons

Authors: Dan Shi, Renren Jin, Tianhao Shen, Weilong Dong, Xinwei Wu, Deyi Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18406
Pdf URL: https://arxiv.org/pdf/2406.18406
Copy Paste: [[2406.18406]] IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons(https://arxiv.org/abs/2406.18406)
Keywords: language model, llm
Abstract: It is widely acknowledged that large language models (LLMs) encode a vast reservoir of knowledge after being trained on mass data. Recent studies disclose knowledge conflicts in LLM generation, wherein outdated or incorrect parametric knowledge (i.e., encoded knowledge) contradicts new knowledge provided in the context. To mitigate such knowledge conflicts, we propose a novel framework, IRCAN (Identifying and Reweighting Context-Aware Neurons) to capitalize on neurons that are crucial in processing contextual cues. Specifically, IRCAN first identifies neurons that significantly contribute to context processing, utilizing a context-aware attribution score derived from integrated gradients. Subsequently, the identified context-aware neurons are strengthened via reweighting. In doing so, we steer LLMs to generate context-sensitive outputs with respect to the new knowledge provided in the context. Extensive experiments conducted across a variety of models and tasks demonstrate that IRCAN not only achieves remarkable improvements in handling knowledge conflicts but also offers a scalable, plug-andplay solution that can be integrated seamlessly with existing models.
摘要：众所周知，大型语言模型 (LLM) 在经过大量数据训练后，会编码大量知识。最近的研究揭示了 LLM 生成中的知识冲突，其中过时或不正确的参数知识（即编码知识）与上下文中提供的新知识相矛盾。为了缓解这种知识冲突，我们提出了一个新框架 IRCAN（识别和重新加权上下文感知神经元），以利用在处理上下文线索方面至关重要的神经元。具体而言，IRCAN 首先利用从积分梯度得出的上下文感知归因分数来识别对上下文处理有重大贡献的神经元。随后，通过重新加权来增强已识别的上下文感知神经元。在此过程中，我们引导 LLM 根据上下文中提供的新知识生成上下文敏感的输出。在各种模型和任务中进行的大量实验表明，IRCAN 不仅在处理知识冲突方面取得了显着的改进，而且还提供了可扩展的即插即用解决方案，可以与现有模型无缝集成。

Title: Cascading Large Language Models for Salient Event Graph Generation

Authors: Xingwei Tan, Yuxiang Zhou, Gabriele Pergola, Yulan He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.18449
Pdf URL: https://arxiv.org/pdf/2406.18449
Copy Paste: [[2406.18449]] Cascading Large Language Models for Salient Event Graph Generation(https://arxiv.org/abs/2406.18449)
Keywords: language model, llm, prompt
Abstract: Generating event graphs from long documents is challenging due to the inherent complexity of multiple tasks involved such as detecting events, identifying their relationships, and reconciling unstructured input with structured graphs. Recent studies typically consider all events with equal importance, failing to distinguish salient events crucial for understanding narratives. This paper presents CALLMSAE, a CAscading Large Language Model framework for SAlient Event graph generation, which leverages the capabilities of LLMs and eliminates the need for costly human annotations. We first identify salient events by prompting LLMs to generate summaries, from which salient events are identified. Next, we develop an iterative code refinement prompting strategy to generate event relation graphs, removing hallucinated relations and recovering missing edges. Fine-tuning contextualised graph generation models on the LLM-generated graphs outperforms the models trained on CAEVO-generated data. Experimental results on a human-annotated test set show that the proposed method generates salient and more accurate graphs, outperforming competitive baselines.
摘要：从长文档生成事件图是一项挑战，因为涉及多项任务的固有复杂性，例如检测事件、识别它们的关系以及将非结构化输入与结构化图进行协调。最近的研究通常认为所有事件都具有同等重要性，未能区分对理解叙述至关重要的突出事件。本文介绍了 CALLMSAE，这是一种用于生成突出事件图的级联大型语言模型框架，它利用了 LLM 的功能并消除了对昂贵的人工注释的需求。我们首先通过提示 LLM 生成摘要来识别突出事件，然后从中识别突出事件。接下来，我们开发了一种迭代代码细化提示策略来生成事件关系图，消除幻觉关系并恢复缺失的边缘。在 LLM 生成的图上微调上下文化图生成模型优于在 CAEVO 生成的数据上训练的模型。在人工注释的测试集上的实验结果表明，所提出的方法生成了显著且更准确的图形，其表现优于竞争基线。

Title: Role-Play Zero-Shot Prompting with Large Language Models for Open-Domain Human-Machine Conversation

Authors: Ahmed Njifenjou, Virgile Sucal, Bassam Jabaian, Fabrice Lefèvre
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2406.18460
Pdf URL: https://arxiv.org/pdf/2406.18460
Copy Paste: [[2406.18460]] Role-Play Zero-Shot Prompting with Large Language Models for Open-Domain Human-Machine Conversation(https://arxiv.org/abs/2406.18460)
Keywords: language model, llm, prompt, agent
Abstract: Recently, various methods have been proposed to create open-domain conversational agents with Large Language Models (LLMs). These models are able to answer user queries, but in a one-way Q&A format rather than a true conversation. Fine-tuning on particular datasets is the usual way to modify their style to increase conversational ability, but this is expensive and usually only available in a few languages. In this study, we explore role-play zero-shot prompting as an efficient and cost-effective solution for open-domain conversation, using capable multilingual LLMs (Beeching et al., 2023) trained to obey instructions. We design a prompting system that, when combined with an instruction-following model - here Vicuna (Chiang et al., 2023) - produces conversational agents that match and even surpass fine-tuned models in human evaluation in French in two different tasks.
摘要：最近，人们提出了各种方法来使用大型语言模型 (LLM) 创建开放域对话代理。这些模型能够回答用户查询，但采用单向问答形式，而不是真正的对话。对特定数据集进行微调是修改其风格以提高对话能力的常用方法，但这种方法成本高昂，而且通常只适用于少数语言。在本研究中，我们探索角色扮演零样本提示作为一种高效且经济高效的开放域对话解决方案，使用经过训练能够服从指令的多功能 LLM（Beeching 等人，2023 年）。我们设计了一个提示系统，当与指令遵循模型（此处为 Vicuna（Chiang 等人，2023 年））相结合时，产生的对话代理在两个不同的法语任务中与人类评估中的微调模型相匹配甚至超越微调模型。

Title: WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Authors: Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18495
Pdf URL: https://arxiv.org/pdf/2406.18495
Copy Paste: [[2406.18495]] WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs(https://arxiv.org/abs/2406.18495)
Keywords: gpt, llm, prompt
Abstract: We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced accuracy and broad coverage across 13 risk categories. While existing open moderation tools such as Llama-Guard2 score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses. To address these challenges, we construct WildGuardMix, a large-scale and carefully balanced multi-task safety moderation dataset with 92K labeled examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuardMix is a combination of WildGuardTrain, the training data of WildGuard, and WildGuardTest, a high-quality human-annotated moderation test set with 5K labeled items covering broad risk scenarios. Through extensive evaluations on WildGuardTest and ten existing public benchmarks, we show that WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.
摘要：我们推出了 WildGuard——一种开放的轻量级 LLM 安全审核工具，可实现三个目标：(1) 识别用户提示中的恶意意图，(2) 检测模型响应的安全风险，以及 (3) 确定模型拒绝率。总之，WildGuard 满足了对 LLM 交互的自动安全审核和评估日益增长的需求，提供了一站式工具，具有更高的准确性和广泛的覆盖范围，涵盖 13 个风险类别。虽然现有的开放审核工具（如 Llama-Guard2）在对直接模型交互进行分类方面得分相当高，但它们远远落后于提示的 GPT-4，特别是在识别对抗性越狱和评估模型的拒绝方面，这是评估模型响应中安全行为的关键指标。为了应对这些挑战，我们构建了 WildGuardMix，这是一个大规模且经过精心平衡的多任务安全审核数据集，其中包含 92K 个标记示例，涵盖原始（直接）提示和对抗性越狱，并搭配各种拒绝和顺从响应。 WildGuardMix 是 WildGuardTrain（WildGuard 的训练数据）和 WildGuardTest（一个高质量的人工注释审核测试集，包含 5K 个标记项目，涵盖广泛的风险场景）的组合。通过对 WildGuardTest 和 10 个现有公共基准的广泛评估，我们表明，与 10 个强大的现有开源审核模型相比，WildGuard 在所有三个任务中都建立了开源安全审核方面的最先进性能（例如，拒绝检测提高了 26.4%）。重要的是，WildGuard 的性能与 GPT-4 相当，有时甚至超过 GPT-4（例如，及时识别危害性提高了 3.9%）。WildGuard 是 LLM 界面中非常有效的安全审核器，将越狱攻击的成功率从 79.8% 降低到 2.4%。

Title: Is In-Context Learning a Type of Gradient-Based Learning? Evidence from the Inverse Frequency Effect in Structural Priming

Authors: Zhenghao Zhou, Robert Frank, R. Thomas McCoy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18501
Pdf URL: https://arxiv.org/pdf/2406.18501
Copy Paste: [[2406.18501]] Is In-Context Learning a Type of Gradient-Based Learning? Evidence from the Inverse Frequency Effect in Structural Priming(https://arxiv.org/abs/2406.18501)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown the emergent capability of in-context learning (ICL). One line of research has explained ICL as functionally performing gradient descent. In this paper, we introduce a new way of diagnosing whether ICL is functionally equivalent to gradient-based learning. Our approach is based on the inverse frequency effect (IFE) -- a phenomenon in which an error-driven learner is expected to show larger updates when trained on infrequent examples than frequent ones. The IFE has previously been studied in psycholinguistics because humans show this effect in the context of structural priming (the tendency for people to produce sentence structures they have encountered recently); the IFE has been used as evidence that human structural priming must involve error-driven learning mechanisms. In our experiments, we simulated structural priming within ICL and found that LLMs display the IFE, with the effect being stronger in larger models. We conclude that ICL is indeed a type of gradient-based learning, supporting the hypothesis that a gradient component is implicitly computed in the forward pass during ICL. Our results suggest that both humans and LLMs make use of gradient-based, error-driven processing mechanisms.
摘要：大型语言模型 (LLM) 已展现出上下文学习 (ICL) 的新兴能力。有一项研究将 ICL 解释为功能上执行梯度下降。在本文中，我们介绍了一种诊断 ICL 是否在功能上等同于基于梯度的学习的新方法。我们的方法基于逆频率效应 (IFE)——一种现象，即错误驱动的学习者在训练不常见示例时，预计会比训练常见示例时显示更大的更新。IFE 之前已在心理语言学中得到研究，因为人类在结构启动（人们倾向于产生他们最近遇到的句子结构）的背景下表现出这种效应；IFE 已被用作人类结构启动必须涉及错误驱动学习机制的证据。在我们的实验中，我们在 ICL 中模拟了结构启动，发现 LLM 显示了 IFE，并且这种效果在较大的模型中更强。我们得出结论，ICL 确实是一种基于梯度的学习，支持了梯度分量在 ICL 期间的前向传递中隐式计算的假设。我们的结果表明，人类和 LLM 都使用基于梯度、错误驱动的处理机制。

Title: WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Authors: Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.18510
Pdf URL: https://arxiv.org/pdf/2406.18510
Copy Paste: [[2406.18510]] WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models(https://arxiv.org/abs/2406.18510)
Keywords: language model, llm, prompt, chat
Abstract: We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are open. With WildTeaming we create WildJailbreak, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate exaggerated safety behaviors, WildJailbreak provides two contrastive types of queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that resemble harmful queries in form but contain no harm. As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training. Through extensive experiments, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All components of WildJailbeak contribute to achieving balanced safety behaviors of models.
摘要：我们引入了 WildTeaming，这是一个自动 LLM 安全红队框架，它可以挖掘野外用户与聊天机器人的交互，以发现 5.7K 个独特的新型越狱策略集群，然后组合多种策略来系统地探索新型越狱。与之前通过招募人类工作者、基于梯度的优化或使用 LLM 进行迭代修订进行红队的工作相比，我们的工作调查了没有被特别指示破解系统的聊天机器人用户的越狱行为。WildTeaming 揭示了前沿 LLM 以前未被发现的漏洞，与最先进的越狱方法相比，对抗性攻击的多样性和成功率提高了 4.6 倍。虽然存在许多用于越狱评估的数据集，但很少有用于越狱训练的开源数据集，因为即使模型权重是开放的，安全训练数据也已关闭。我们利用 WildTeaming 创建了 WildJailbreak，这是一个大型开源合成安全数据集，包含 262K 个原始（直接请求）和对抗（复杂越狱）提示响应对。为了缓解夸大的安全行为，WildJailbreak 提供了两种对比类型的查询：1) 有害查询（原始和对抗）和 2) 形式上类似于有害查询但不包含危害的良性查询。由于 WildJailbreak 大大提升了现有安全资源的质量和规模，它以独特的方式使我们能够在安全训练过程中检查数据的扩展效应以及数据属性和模型功能的相互作用。通过大量实验，我们确定了实现安全行为理想平衡的训练属性：适当的保护措施而不会过度拒绝、有效处理原始和对抗查询以及一般能力的下降（如果有的话）最小。WildJailbeak 的所有组件都有助于实现模型的平衡安全行为。

Title: "Is ChatGPT a Better Explainer than My Professor?": Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline

Authors: Grace Li, Milad Alshomary, Smaranda Muresan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] "Is ChatGPT a Better Explainer than My Professor?": Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline(https://arxiv.org/abs/)
Keywords: language model, gpt, llm, chat
Abstract: Explanations form the foundation of knowledge sharing and build upon communication principles, social dynamics, and learning theories. We focus specifically on conversational approaches for explanations because the context is highly adaptive and interactive. Our research leverages previous work on explanatory acts, a framework for understanding the different strategies that explainers and explainees employ in a conversation to both explain, understand, and engage with the other party. We use the 5-Levels dataset was constructed from the WIRED YouTube series by Wachsmuth et al., and later annotated by Booshehri et al. with explanatory acts. These annotations provide a framework for understanding how explainers and explainees structure their response when crafting a response. With the rise of generative AI in the past year, we hope to better understand the capabilities of Large Language Models (LLMs) and how they can augment expert explainer's capabilities in conversational settings. To achieve this goal, the 5-Levels dataset (We use Booshehri et al.'s 2023 annotated dataset with explanatory acts.) allows us to audit the ability of LLMs in engaging in explanation dialogues. To evaluate the effectiveness of LLMs in generating explainer responses, we compared 3 different strategies, we asked human annotators to evaluate 3 different strategies: human explainer response, GPT4 standard response, GPT4 response with Explanation Moves.
摘要：解释是知识共享的基础，建立在沟通原则、社会动态和学习理论之上。我们特别关注对话式解释方法，因为上下文具有高度的适应性和交互性。我们的研究利用了之前关于解释行为的研究，这是一个框架，用于理解解释者和被解释者在对话中用于解释、理解和与对方互动的不同策略。我们使用的 5 级数据集由 Wachsmuth 等人从 WIRED YouTube 系列中构建，后来由 Booshehri 等人用解释行为进行了注释。这些注释提供了一个框架，用于理解解释者和被解释者在制定响应时如何构建他们的响应。随着过去一年生成式人工智能的兴起，我们希望更好地了解大型语言模型 (LLM) 的功能以及它们如何在对话环境中增强专家解释者的能力。为了实现这一目标，5 级数据集（我们使用 Booshehri 等人的 2023 年注释数据集，其中包含解释性动作。）使我们能够审核 LLM 参与解释对话的能力。为了评估 LLM 在生成解释者响应方面的有效性，我们比较了 3 种不同的策略，我们要求人类注释者评估 3 种不同的策略：人类解释者响应、GPT4 标准响应、带有解释动作的 GPT4 响应。

Title: APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

Authors: Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2406.18518
Pdf URL: https://arxiv.org/pdf/2406.18518
Copy Paste: [[2406.18518]] APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets(https://arxiv.org/abs/2406.18518)
Keywords: gpt, agent
Abstract: The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset is available on Huggingface: this https URL and the project homepage: this https URL
摘要：函数调用代理模型的进步需要多样化、可靠和高质量的数据集。本文介绍了 APIGen，这是一种自动化数据生成管道，旨在为函数调用应用程序合成可验证的高质量数据集。我们利用 APIGen 并收集了 21 个不同类别的 3,673 个可执行 API，以可扩展和结构化的方式生成多样化的函数调用数据集。我们的数据集中的每个数据都经过三个分层阶段的验证：格式检查、实际函数执行和语义验证，以确保其可靠性和正确性。我们证明，使用我们精选的数据集训练的模型，即使只有 7B 个参数，也可以在伯克利函数调用基准上实现最先进的性能，优于多个 GPT-4 模型。此外，我们的 1B 模型实现了卓越的性能，超越了 GPT-3.5-Turbo 和 Claude-3 Haiku。我们发布了一个包含 60,000 个高质量条目的数据集，旨在推动函数调用代理领域的发展。该数据集可在 Huggingface 上找到：此 https URL 和项目主页：此 https URL

Title: CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Authors: Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2406.18521
Pdf URL: https://arxiv.org/pdf/2406.18521
Copy Paste: [[2406.18521]] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs(https://arxiv.org/abs/2406.18521)
Keywords: language model, gpt, llm, chat
Abstract: Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: this https URL
摘要：在将多模态大型语言模型 (MLLM) 应用于分析科学论文或财务报告等实际任务时，图表理解起着关键作用。然而，现有数据集通常侧重于过于简单和同质的图表以及基于模板的问题，导致对进展的衡量过于乐观。我们证明，尽管开源模型在这些基准测试中似乎优于强大的专有模型，但使用略有不同的图表或问题进行的简单压力测试可能会使性能下降高达 34.5%。在这项工作中，我们提出了 CharXiv，这是一个全面的评估套件，涉及来自 arXiv 论文的 2,323 张自然、具有挑战性和多样性的图表。CharXiv 包括两种类型的问题：1) 关于检查基本图表元素的描述性问题和 2) 需要综合图表中复杂视觉元素信息的推理问题。为了确保质量，所有图表和问题都经过人工专家的精心挑选、策划和验证。我们的结果表明，最强大的专有模型（即 GPT-4o，准确率为 47.1%）和最强大的开源模型（即 InternVL Chat V1.5，准确率为 29.2%）之间的推理能力存在巨大、之前被低估的差距。所有模型都远远落后于人类 80.5% 的表现，凸显了现有 MLLM 在图表理解能力方面的弱点。我们希望 CharXiv 能够通过提供更现实、更可靠的进度衡量标准，促进未来对 MLLM 图表理解的研究。项目页面和排行榜：此 https URL

Title: PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation

Authors: Christoph Leiter, Steffen Eger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation(https://arxiv.org/abs/)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have revolutionized the field of NLP. Notably, their in-context learning capabilities also enable their use as evaluation metrics for natural language generation, making them particularly advantageous in low-resource scenarios and time-restricted applications. In this work, we introduce PrExMe, a large-scale prompt exploration for metrics, where we evaluate more than 720 prompt templates for open-source LLM-based metrics on machine translation (MT) and summarization datasets, totalling over 6.6M evaluations. This extensive comparison (1) serves as a benchmark of the performance of recent open-source LLMs as metrics and (2) explores the stability and variability of different prompting strategies. We discover that, on the one hand, there are scenarios for which prompts are stable. For instance, some LLMs show idiosyncratic preferences and favor to grade generated texts with textual labels while others prefer to return numeric scores. On the other hand, the stability of prompts and model rankings can be susceptible to seemingly innocuous changes. For example, changing the requested output format from "0 to 100" to "-1 to +1" can strongly affect the rankings in our evaluation. Our study contributes to understanding the impact of different prompting approaches on LLM-based metrics for MT and summarization evaluation, highlighting the most stable prompting patterns and potential limitations.
摘要：大型语言模型 (LLM) 彻底改变了 NLP 领域。值得注意的是，它们的上下文学习能力还使它们能够用作自然语言生成的评估指标，这使得它们在资源匮乏的场景和时间受限的应用中特别有利。在这项工作中，我们引入了 PrExMe，这是一种大规模的指标提示探索，我们在机器翻译 (MT) 和摘要数据集上评估了 720 多个基于开源 LLM 的指标提示模板，总计超过 660 万次评估。这种广泛的比较 (1) 可作为近期开源 LLM 作为指标的性能基准，(2) 可探索不同提示策略的稳定性和可变性。我们发现，一方面，有些场景的提示是稳定的。例如，一些 LLM 表现出特殊的偏好，倾向于使用文本标签对生成的文本进行评分，而另一些则更喜欢返回数字分数。另一方面，提示和模型排名的稳定性可能会受到看似无害的变化的影响。例如，将要求的输出格式从“0 到 100”更改为“-1 到 +1”会严重影响我们评估中的排名。我们的研究有助于了解不同提示方法对基于 LLM 的机器翻译和摘要评估指标的影响，突出最稳定的提示模式和潜在的局限性。

Title: Symbolic Learning Enables Self-Evolving Agents

Authors: Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, Yuchen Eleanor Jiang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.18532
Pdf URL: https://arxiv.org/pdf/2406.18532
Copy Paste: [[2406.18532]] Symbolic Learning Enables Self-Evolving Agents(https://arxiv.org/abs/2406.18532)
Keywords: language model, llm, prompt, agent
Abstract: The AI community has been exploring a pathway to artificial general intelligence (AGI) by developing "language agents", which are complex large language models (LLMs) pipelines involving both prompting techniques and tool usage methods. While language agents have demonstrated impressive capabilities for many real-world tasks, a fundamental limitation of current language agents research is that they are model-centric, or engineering-centric. That's to say, the progress on prompts, tools, and pipelines of language agents requires substantial manual engineering efforts from human experts rather than automatically learning from data. We believe the transition from model-centric, or engineering-centric, to data-centric, i.e., the ability of language agents to autonomously learn and evolve in environments, is the key for them to possibly achieve AGI. In this work, we introduce agent symbolic learning, a systematic framework that enables language agents to optimize themselves on their own in a data-centric way using symbolic optimizers. Specifically, we consider agents as symbolic networks where learnable weights are defined by prompts, tools, and the way they are stacked together. Agent symbolic learning is designed to optimize the symbolic network within language agents by mimicking two fundamental algorithms in connectionist learning: back-propagation and gradient descent. Instead of dealing with numeric weights, agent symbolic learning works with natural language simulacrums of weights, loss, and gradients. We conduct proof-of-concept experiments on both standard benchmarks and complex real-world tasks and show that agent symbolic learning enables language agents to update themselves after being created and deployed in the wild, resulting in "self-evolving agents".
摘要：人工智能社区一直在探索通用人工智能 (AGI) 的途径，即开发“语言代理”，即涉及提示技术和工具使用方法的复杂大型语言模型 (LLM) 管道。虽然语言代理在许多现实世界的任务中都表现出令人印象深刻的能力，但当前语言代理研究的一个根本限制是它们以模型为中心或以工程为中心。也就是说，语言代理的提示、工具和管道方面的进展需要人类专家进行大量的手动工程工作，而不是自动从数据中学习。我们认为从以模型为中心或以工程为中心过渡到以数据为中心，即语言代理在环境中自主学习和发展的能力，是它们可能实现 AGI 的关键。在这项工作中，我们引入了代理符号学习，这是一个系统框架，使语言代理能够使用符号优化器以数据为中心自行优化自身。具体来说，我们将代理视为符号网络，其中可学习的权重由提示、工具及其堆叠方式定义。代理符号学习旨在通过模仿联结学习中的两种基本算法来优化语言代理中的符号网络：反向传播和梯度下降。代理符号学习不处理数字权重，而是处理权重、损失和梯度的自然语言模拟。我们在标准基准和复杂的现实世界任务上进行了概念验证实验，并表明代理符号学习使语言代理能够在创建和部署后进行自我更新，从而产生“自我进化的代理”。