2024-06-19

Title: Reframing linguistic bootstrapping as joint inference using visually-grounded grammar induction models

Authors: Eva Portelance, Siva Reddy, Timothy J. O'Donnell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11977
Pdf URL: https://arxiv.org/pdf/2406.11977
Copy Paste: [[2406.11977]] Reframing linguistic bootstrapping as joint inference using visually-grounded grammar induction models(https://arxiv.org/abs/2406.11977)
Keywords: language model
Abstract: Semantic and syntactic bootstrapping posit that children use their prior knowledge of one linguistic domain, say syntactic relations, to help later acquire another, such as the meanings of new words. Empirical results supporting both theories may tempt us to believe that these are different learning strategies, where one may precede the other. Here, we argue that they are instead both contingent on a more general learning strategy for language acquisition: joint learning. Using a series of neural visually-grounded grammar induction models, we demonstrate that both syntactic and semantic bootstrapping effects are strongest when syntax and semantics are learnt simultaneously. Joint learning results in better grammar induction, realistic lexical category learning, and better interpretations of novel sentence and verb meanings. Joint learning makes language acquisition easier for learners by mutually constraining the hypotheses spaces for both syntax and semantics. Studying the dynamics of joint inference over many input sources and modalities represents an important new direction for language modeling and learning research in both cognitive sciences and AI, as it may help us explain how language can be acquired in more constrained learning settings.
摘要：语义和句法引导假设儿童利用他们对一个语言领域的先前知识（例如句法关系）来帮助后来获得另一个语言领域，例如新词的含义。支持这两种理论的实证结果可能会诱使我们相信这些是不同的学习策略，其中一种可能先于另一种。在这里，我们认为它们都取决于一种更普遍的语言习得学习策略：联合学习。使用一系列神经视觉基础语法诱导模型，我们证明当同时学习语法和语义时，句法和语义引导效果最强。联合学习可以实现更好的语法诱导、现实的词汇类别学习以及对新句子和动词含义的更好解释。联合学习通过相互约束句法和语义的假设空间，使学习者更容易习得语言。研究多种输入源和模态的联合推理动态代表了认知科学和人工智能领域语言建模和学习研究的一个重要的新方向，因为它可以帮助我们解释如何在更受限的学习环境中获得语言。

Title: Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner

Authors: Kenneth Li, Yiming Wang, Fernanda Viégas, Martin Wattenberg
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11978
Pdf URL: https://arxiv.org/pdf/2406.11978
Copy Paste: [[2406.11978]] Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner(https://arxiv.org/abs/2406.11978)
Keywords: language model, gpt, agent
Abstract: We present an approach called Dialogue Action Tokens (DAT) that adapts language model agents to plan goal-directed dialogues. The core idea is to treat each utterance as an action, thereby converting dialogues into games where existing approaches such as reinforcement learning can be applied. Specifically, we freeze a pretrained language model and train a small planner model that predicts a continuous action vector, used for controlled generation in each round. This design avoids the problem of language degradation under reward optimization. When evaluated on the Sotopia platform for social simulations, the DAT-steered LLaMA model surpasses GPT-4's performance. We also apply DAT to steer an attacker language model in a novel multi-turn red-teaming setting, revealing a potential new attack surface.
摘要：我们提出了一种称为“对话动作标记”（DAT）的方法，该方法可调整语言模型代理来规划目标导向的对话。其核心思想是将每句话都视为一个动作，从而将对话转化为游戏，其中可以应用强化学习等现有方法。具体来说，我们冻结预训练的语言模型并训练一个小型规划器模型，该模型可预测连续动作向量，用于每轮的控制生成。这种设计避免了奖励优化下语言退化的问题。在 Sotopia 平台上进行社交模拟评估时，DAT 引导的 LLaMA 模型超越了 GPT-4 的性能。我们还应用 DAT 在新颖的多轮红队设置中引导攻击者语言模型，揭示了潜在的新攻击面。

Title: FinTruthQA: A Benchmark Dataset for Evaluating the Quality of Financial Information Disclosure

Authors: Ziyue Xu, Peilin Zhou, Xinyu Shi, Jiageng Wu, Yikang Jiang, Bin Ke, Jie Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12009
Pdf URL: https://arxiv.org/pdf/2406.12009
Copy Paste: [[2406.12009]] FinTruthQA: A Benchmark Dataset for Evaluating the Quality of Financial Information Disclosure(https://arxiv.org/abs/2406.12009)
Keywords: language model, gpt
Abstract: Accurate and transparent financial information disclosure is crucial in the fields of accounting and finance, ensuring market efficiency and investor confidence. Among many information disclosure platforms, the Chinese stock exchanges' investor interactive platform provides a novel and interactive way for listed firms to disclose information of interest to investors through an online question-and-answer (Q&A) format. However, it is common for listed firms to respond to questions with limited or no substantive information, and automatically evaluating the quality of financial information disclosure on large amounts of Q&A pairs is challenging. This paper builds a benchmark FinTruthQA, that can evaluate advanced natural language processing (NLP) techniques for the automatic quality assessment of information disclosure in financial Q&A data. FinTruthQA comprises 6,000 real-world financial Q&A entries and each Q&A was manually annotated based on four conceptual dimensions of accounting. We benchmarked various NLP techniques on FinTruthQA, including statistical machine learning models, pre-trained language model and their fine-tuned versions, as well as the large language model GPT-4. Experiments showed that existing NLP models have strong predictive ability for real question identification and question relevance tasks, but are suboptimal for answer relevance and answer readability tasks. By establishing this benchmark, we provide a robust foundation for the automatic evaluation of information disclosure, significantly enhancing the transparency and quality of financial reporting. FinTruthQA can be used by auditors, regulators, and financial analysts for real-time monitoring and data-driven decision-making, as well as by researchers for advanced studies in accounting and finance, ultimately fostering greater trust and efficiency in the financial markets.
摘要：准确、透明的财务信息披露在会计和财务领域至关重要，可以确保市场效率和投资者信心。在众多信息披露平台中，中国证券交易所的投资者互动平台通过在线问答 (Q&A) 形式为上市公司披露投资者感兴趣的信息提供了一种新颖的互动方式。然而，上市公司回答问题时通常只提供有限或没有实质性信息，而自动评估大量问答对的财务信息披露质量具有挑战性。本文构建了一个基准 FinTruthQA，它可以评估先进的自然语言处理 (NLP) 技术对财务问答数据中信息披露质量的自动评估。FinTruthQA 包含 6,000 个现实世界的金融问答条目，每个问答都基于会计的四个概念维度进行手动注释。我们在 FinTruthQA 上对各种 NLP 技术进行了基准测试，包括统计机器学习模型、预训练语言模型及其微调版本，以及大型语言模型 GPT-4。实验表明，现有的 NLP 模型对于真实问题识别和问题相关性任务具有很强的预测能力，但对于答案相关性和答案可读性任务则表现欠佳。通过建立这一基准，我们为信息披露的自动评估提供了坚实的基础，大大提高了财务报告的透明度和质量。审计师、监管机构和财务分析师可以使用 FinTruthQA 进行实时监控和数据驱动的决策，研究人员也可以使用 FinTruthQA 进行会计和金融的高级研究，最终提高金融市场的信任度和效率。

Title: CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling

Authors: Yu Bai, Xiyuan Zou, Heyan Huang, Sanxing Chen, Marc-Antoine Rondeau, Yang Gao, Jackie Chi Kit Cheung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12018
Pdf URL: https://arxiv.org/pdf/2406.12018
Copy Paste: [[2406.12018]] CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling(https://arxiv.org/abs/2406.12018)
Keywords: language model, llm
Abstract: Long sequence modeling has gained broad interest as large language models (LLMs) continue to advance. Recent research has identified that a large portion of hidden states within the key-value caches of Transformer models can be discarded (also termed evicted) without affecting the perplexity performance in generating long sequences. However, we show that these methods, despite preserving perplexity performance, often drop information that is important for solving downstream tasks, a problem which we call information neglect. To address this issue, we introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states. In addition, we design a method for chunked sequence processing to further improve efficiency. Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget, while preserving language modeling perplexity.
摘要：随着大型语言模型 (LLM) 的不断发展，长序列建模引起了广泛关注。最近的研究发现，Transformer 模型的键值缓存中的大部分隐藏状态都可以被丢弃（也称为逐出），而不会影响生成长序列的困惑度性能。然而，我们表明，这些方法虽然保留了困惑度性能，但经常会丢失对解决下游任务很重要的信息，我们将此问题称为信息忽视。为了解决这个问题，我们引入了分块指令感知状态逐出 (CItruS)，这是一种新颖的建模技术，它将对下游任务有用的注意力偏好集成到隐藏状态的逐出过程中。此外，我们设计了一种分块序列处理方法，以进一步提高效率。我们的无训练方法在相同的内存预算下，在几个强基线上的长序列理解和检索任务中表现出卓越的性能，同时保留了语言建模困惑度。

Title: LiLiuM: eBay's Large Language Models for e-commerce

Authors: Christian Herold, Michael Kozielski, Leonid Ekimov, Pavel Petrushkov, Pierre-Yves Vandenbussche, Shahram Khadivi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12023
Pdf URL: https://arxiv.org/pdf/2406.12023
Copy Paste: [[2406.12023]] LiLiuM: eBay's Large Language Models for e-commerce(https://arxiv.org/abs/2406.12023)
Keywords: language model, llm
Abstract: We introduce the LiLiuM series of large language models (LLMs): 1B, 7B, and 13B parameter models developed 100% in-house to fit eBay's specific needs in the e-commerce domain. This gives eBay full control over all aspects of the models including license, data, vocabulary, and architecture. We expect these models to be used as a foundation for fine-tuning and instruction-tuning, eliminating dependencies to external models. The LiLiuM LLMs have been trained on 3 trillion tokens of multilingual text from general and e-commerce domain. They perform similar to the popular LLaMA-2 models on English natural language understanding (NLU) benchmarks. At the same time, we outperform LLaMA-2 on non-English NLU tasks, machine translation and on e-commerce specific downstream tasks. As part of our data mixture, we utilize the newly released RedPajama-V2 dataset for training and share our insights regarding data filtering and deduplication. We also discuss in detail how to serialize structured data for use in autoregressive language modeling. We provide insights on the effects of including code and parallel machine translation data in pre-training. Furthermore, we develop our own tokenizer and model vocabulary, customized towards e-commerce. This way, we can achieve up to 34% speed-up in text generation on eBay-specific downstream tasks compared to LLaMA-2. Finally, in relation to LLM pretraining, we show that checkpoint averaging can further improve over the best individual model checkpoint.
摘要：我们推出了 LiLiuM 系列大型语言模型 (LLM)：1B、7B 和 13B 参数模型，这些模型 100% 内部开发，以满足 eBay 在电子商务领域的特定需求。这使 eBay 能够完全控制模型的所有方面，包括许可证、数据、词汇表和架构。我们希望这些模型被用作微调和指令调整的基础，从而消除对外部模型的依赖。LiLiuM LLM 已在来自通用和电子商务领域的 3 万亿个多语言文本标记上进行了训练。它们在英语自然语言理解 (NLU) 基准上的表现与流行的 LLaMA-2 模型相似。同时，我们在非英语 NLU 任务、机器翻译和电子商务特定的下游任务上的表现优于 LLaMA-2。作为我们数据混合的一部分，我们利用新发布的 RedPajama-V2 数据集进行训练，并分享我们对数据过滤和重复数据删除的见解。我们还详细讨论了如何序列化结构化数据以用于自回归语言建模。我们提供了关于在预训练中包含代码和并行机器翻译数据的效果的见解。此外，我们开发了自己的标记器和模型词汇表，针对电子商务进行了定制。这样，与 LLaMA-2 相比，我们可以在 eBay 特定的下游任务上实现高达 34% 的文本生成速度提升。最后，关于 LLM 预训练，我们表明检查点平均可以进一步改善最佳单个模型检查点。

Title: Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

Authors: Yuqing Wang, Yun Zhao, Sara Alessandra Keller, Anne de Hond, Marieke M. van Buchem, Malvika Pillai, Tina Hernandez-Boussard
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12033
Pdf URL: https://arxiv.org/pdf/2406.12033
Copy Paste: [[2406.12033]] Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models(https://arxiv.org/abs/2406.12033)
Keywords: language model, gpt, llm, prompt
Abstract: The advancement of large language models (LLMs) has demonstrated strong capabilities across various applications, including mental health analysis. However, existing studies have focused on predictive performance, leaving the critical issue of fairness underexplored, posing significant risks to vulnerable populations. Despite acknowledging potential biases, previous works have lacked thorough investigations into these biases and their impacts. To address this gap, we systematically evaluate biases across seven social factors (e.g., gender, age, religion) using ten LLMs with different prompting methods on eight diverse mental health datasets. Our results show that GPT-4 achieves the best overall balance in performance and fairness among LLMs, although it still lags behind domain-specific models like MentalRoBERTa in some cases. Additionally, our tailored fairness-aware prompts can effectively mitigate bias in mental health predictions, highlighting the great potential for fair analysis in this field.
摘要：大型语言模型 (LLM) 的进步已在包括心理健康分析在内的各种应用中展现出强大的能力。然而，现有的研究主要集中在预测性能上，而对公平性这一关键问题的研究不足，这对弱势群体构成了重大风险。尽管承认存在潜在的偏见，但以前的研究缺乏对这些偏见及其影响的彻底调查。为了弥补这一差距，我们使用十个具有不同提示方法的 LLM 在八个不同的心理健康数据集上系统地评估了七个社会因素（例如性别、年龄、宗教）的偏见。我们的结果表明，GPT-4 在 LLM 中实现了性能和公平性的最佳整体平衡，尽管在某些情况下它仍然落后于 MentalRoBERTa 等特定领域的模型。此外，我们量身定制的公平意识提示可以有效减轻心理健康预测中的偏见，凸显了公平分析在该领域的巨大潜力。

Title: Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

Authors: Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James Glass, David Cox, Rameswar Panda, Rogerio Feris, Alan Ritter
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12034
Pdf URL: https://arxiv.org/pdf/2406.12034
Copy Paste: [[2406.12034]] Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts(https://arxiv.org/abs/2406.12034)
Keywords: language model, llm
Abstract: We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constructs expert modules using self-generated synthetic data, each equipped with a shared base LLM and incorporating self-optimized routing. This allows for dynamic and capability-specific handling of various target tasks, enhancing overall capabilities, without extensive human-labeled data and added parameters. Our empirical results reveal that specializing LLMs may exhibit potential trade-offs in performances on non-specialized tasks. On the other hand, our Self-MoE demonstrates substantial improvements over the base LLM across diverse benchmarks such as knowledge, reasoning, math, and coding. It also consistently outperforms other methods, including instance merging and weight merging, while offering better flexibility and interpretability by design with semantic experts and routing. Our findings highlight the critical role of modularity and the potential of self-improvement in achieving efficient, scalable, and adaptable systems.
摘要：我们提出了 Self-MoE，这是一种将单片 LLM 转换为自专业专家的组合式模块化系统的方法，名为 MiXSE（自专业专家的混合体）。我们的方法利用了自专业化，它使用自生成的合成数据构建专家模块，每个模块都配备共享的基础 LLM 并结合自优化路由。这允许对各种目标任务进行动态和特定能力的处理，从而增强整体能力，而无需大量人工标记的数据和添加的参数。我们的实证结果表明，专业化的 LLM 可能会在非专业任务上表现出潜在的性能权衡。另一方面，我们的 Self-MoE 在知识、推理、数学和编码等各种基准上都比基础 LLM 有显着的改进。它还始终优于其他方法，包括实例合并和权重合并，同时通过语义专家和路由的设计提供更好的灵活性和可解释性。我们的研究结果强调了模块化的关键作用以及自我改进在实现高效、可扩展和适应性系统方面的潜力。

Title: MedCalc-Bench: Evaluating Large Language Models for Medical Calculations

Authors: Nikhil Khandekar, Qiao Jin, Guangzhi Xiong, Soren Dunn, Serina S Applebaum, Zain Anwar, Maame Sarfo-Gyamfi, Conrad W Safranek, Abid A Anwar, Andrew Zhang, Aidan Gilson, Maxwell B Singer, Amisha Dave, Andrew Taylor, Aidong Zhang, Qingyu Chen, Zhiyong Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12036
Pdf URL: https://arxiv.org/pdf/2406.12036
Copy Paste: [[2406.12036]] MedCalc-Bench: Evaluating Large Language Models for Medical Calculations(https://arxiv.org/abs/2406.12036)
Keywords: language model, llm
Abstract: As opposed to evaluating computation and logic-based reasoning, current bench2 marks for evaluating large language models (LLMs) in medicine are primarily focused on question-answering involving domain knowledge and descriptive rea4 soning. While such qualitative capabilities are vital to medical diagnosis, in real5 world scenarios, doctors frequently use clinical calculators that follow quantitative equations and rule-based reasoning paradigms for evidence-based decision support. To this end, we propose MedCalc-Bench, a first-of-its-kind dataset focused on evaluating the medical calculation capability of LLMs. MedCalc-Bench contains an evaluation set of over 1000 manually reviewed instances from 55 different medical calculation tasks. Each instance in MedCalc-Bench consists of a patient note, a question requesting to compute a specific medical value, a ground truth answer, and a step-by-step explanation showing how the answer is obtained. While our evaluation results show the potential of LLMs in this area, none of them are effective enough for clinical settings. Common issues include extracting the incorrect entities, not using the correct equation or rules for a calculation task, or incorrectly performing the arithmetic for the computation. We hope our study highlights the quantitative knowledge and reasoning gaps in LLMs within medical settings, encouraging future improvements of LLMs for various clinical calculation tasks.
摘要：与评估计算和基于逻辑的推理相反，目前评估医学领域大型语言模型 (LLM) 的基准主要集中在涉及领域知识和描述性推理的问答上。虽然这些定性能力对于医学诊断至关重要，但在现实世界中，医生经常使用遵循定量方程和基于规则的推理范式的临床计算器来提供基于证据的决策支持。为此，我们提出了 MedCalc-Bench，这是首个专注于评估 LLM 医学计算能力的数据集。MedCalc-Bench 包含来自 55 种不同医学计算任务的 1000 多个人工审查实例的评估集。MedCalc-Bench 中的每个实例都包含一份患者记录、一个要求计算特定医学值的问题、一个基本事实答案以及一个显示如何获得答案的分步说明。虽然我们的评估结果显示了 LLM 在这一领域的潜力，但它们都不足以用于临床环境。常见问题包括提取错误的实体、未使用正确的方程或规则进行计算任务，或错误地执行计算算术。我们希望我们的研究能够凸显医学环境中 LLM 的定量知识和推理差距，鼓励未来改进 LLM 的各种临床计算任务。

Title: Soft Prompting for Unlearning in Large Language Models

Authors: Karuna Bhaila, Minh-Hao Van, Xintao Wu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12038
Pdf URL: https://arxiv.org/pdf/2406.12038
Copy Paste: [[2406.12038]] Soft Prompting for Unlearning in Large Language Models(https://arxiv.org/abs/2406.12038)
Keywords: language model, llm, prompt
Abstract: The widespread popularity of Large Language Models (LLMs), partly due to their unique ability to perform in-context learning, has also brought to light the importance of ethical and safety considerations when deploying these pre-trained models. In this work, we focus on investigating machine unlearning for LLMs motivated by data protection regulations. In contrast to the growing literature on fine-tuning methods to achieve unlearning, we focus on a comparatively lightweight alternative called soft prompting to realize the unlearning of a subset of training data. With losses designed to enforce forgetting as well as utility preservation, our framework \textbf{S}oft \textbf{P}rompting for \textbf{U}n\textbf{l}earning (SPUL) learns prompt tokens that can be appended to an arbitrary query to induce unlearning of specific examples at inference time without updating LLM parameters. We conduct a rigorous evaluation of the proposed method and our results indicate that SPUL can significantly improve the trade-off between utility and forgetting in the context of text classification with LLMs. We further validate our method using multiple LLMs to highlight the scalability of our framework and provide detailed insights into the choice of hyperparameters and the influence of the size of unlearning data. Our implementation is available at \url{this https URL}.
摘要：大型语言模型 (LLM) 的广泛流行，部分原因在于它们具有独特的上下文学习能力，这也凸显了在部署这些预训练模型时道德和安全考虑的重要性。在这项工作中，我们专注于研究受数据保护法规驱动的 LLM 的机器学习反学习。与越来越多的关于实现反学习的微调方法的文献相比，我们专注于一种相对轻量级的替代方案，称为软提示，以实现对训练数据子集的反学习。我们的框架 \textbf{S}oft \textbf{P}rompting for \textbf{U}n\textbf{l}earning (SPUL) 的损失旨在强制遗忘和效用保存，它可以学习提示标记，这些标记可以附加到任意查询中，以在推理时诱导对特定示例的反学习，而无需更新 LLM 参数。我们对所提出的方法进行了严格的评估，结果表明，SPUL 可以在使用 LLM 进行文本分类的背景下显著改善效用和遗忘之间的权衡。我们进一步使用多个 LLM 验证了我们的方法，以突出我们框架的可扩展性，并提供有关超参数选择和反学习数据大小影响的详细见解。我们的实现可在 \url{此 https URL} 上找到。

Title: Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning

Authors: Zhihan Zhang, Zhenwen Liang, Wenhao Yu, Dian Yu, Mengzhao Jia, Dong Yu, Meng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12050
Pdf URL: https://arxiv.org/pdf/2406.12050
Copy Paste: [[2406.12050]] Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning(https://arxiv.org/abs/2406.12050)
Keywords: language model
Abstract: Supervised fine-tuning enhances the problem-solving abilities of language models across various mathematical reasoning tasks. To maximize such benefits, existing research focuses on broadening the training set with various data augmentation techniques, which is effective for standard single-round question-answering settings. Our work introduces a novel technique aimed at cultivating a deeper understanding of the training problems at hand, enhancing performance not only in standard settings but also in more complex scenarios that require reflective thinking. Specifically, we propose reflective augmentation, a method that embeds problem reflection into each training instance. It trains the model to consider alternative perspectives and engage with abstractions and analogies, thereby fostering a thorough comprehension through reflective reasoning. Extensive experiments validate the achievement of our aim, underscoring the unique advantages of our method and its complementary nature relative to existing augmentation techniques.
摘要：监督式微调增强了语言模型在各种数学推理任务中的解决问题的能力。为了最大限度地发挥这种优势，现有的研究侧重于使用各种数据增强技术来扩大训练集，这对于标准的单轮问答设置非常有效。我们的工作引入了一种新技术，旨在培养对当前训练问题的更深入理解，不仅在标准设置中，而且在需要反思性思维的更复杂场景中提高性能。具体来说，我们提出了一种反思增强方法，将问题反思嵌入到每个训练实例中。它训练模型考虑不同的观点并参与抽象和类比，从而通过反思推理促进彻底的理解。大量的实验验证了我们目标的实现，强调了我们的方法的独特优势及其相对于现有增强技术的互补性。

Title: UniGLM: Training One Unified Language Model for Text-Attributed Graphs

Authors: Yi Fang, Dongzhe Fan, Sirui Ding, Ninghao Liu, Qiaoyu Tan
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12052
Pdf URL: https://arxiv.org/pdf/2406.12052
Copy Paste: [[2406.12052]] UniGLM: Training One Unified Language Model for Text-Attributed Graphs(https://arxiv.org/abs/2406.12052)
Keywords: language model
Abstract: Representation learning on text-attributed graphs (TAGs), where nodes are represented by textual descriptions, is crucial for textual and relational knowledge systems and recommendation systems. Currently, state-of-the-art embedding methods for TAGs primarily focus on fine-tuning language models (e.g., BERT) using structure-aware training signals. While effective, these methods are tailored for individual TAG and cannot generalize across various graph scenarios. Given the shared textual space, leveraging multiple TAGs for joint fine-tuning, aligning text and graph structure from different aspects, would be more beneficial. Motivated by this, we introduce a novel Unified Graph Language Model (UniGLM) framework, the first graph embedding model that generalizes well to both in-domain and cross-domain TAGs. Specifically, UniGLM is trained over multiple TAGs with different domains and scales using self-supervised contrastive learning. UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training by minimizing repetitive encoding calculations. Extensive empirical results across 9 benchmark TAGs demonstrate UniGLM's efficacy against leading embedding baselines in terms of generalization (various downstream tasks and backbones) and transfer learning (in and out of domain scenarios). The code is available at this https URL.
摘要：对于文本和关系知识系统以及推荐系统而言，在文本属性图 (TAG) 上进行表示学习至关重要，其中节点由文本描述表示。目前，最先进的 TAG 嵌入方法主要集中于使用结构感知训练信号微调语言模型（例如 BERT）。虽然有效，但这些方法针对单个 TAG 量身定制，无法推广到各种图场景。考虑到共享的文本空间，利用多个 TAG 进行联合微调、从不同方面对齐文本和图结构将更有益。受此启发，我们引入了一种新颖的统一图语言模型 (UniGLM) 框架，这是第一个可以很好地推广到域内和跨域 TAG 的图嵌入模型。具体而言，UniGLM 使用自监督对比学习在具有不同域和规模的多个 TAG 上进行训练。 UniGLM 包括一种自适应正样本选择技术，用于识别结构相似的节点，以及一个惰性对比模块，旨在通过最小化重复编码计算来加速训练。9 个基准 TAG 的大量实证结果表明，UniGLM 在泛化（各种下游任务和主干）和迁移学习（域内和域外场景）方面优于领先的嵌入基线。代码可在此 https URL 上获取。

Title: InternalInspector $I^2$: Robust Confidence Estimation in LLMs through Internal States

Authors: Mohammad Beigi, Ying Shen, Runing Yang, Zihao Lin, Qifan Wang, Ankith Mohan, Jianfeng He, Ming Jin, Chang-Tien Lu, Lifu Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12053
Pdf URL: https://arxiv.org/pdf/2406.12053
Copy Paste: [[2406.12053]] InternalInspector $I^2$: Robust Confidence Estimation in LLMs through Internal States(https://arxiv.org/abs/2406.12053)
Keywords: language model, llm, hallucination
Abstract: Despite their vast capabilities, Large Language Models (LLMs) often struggle with generating reliable outputs, frequently producing high-confidence inaccuracies known as hallucinations. Addressing this challenge, our research introduces InternalInspector, a novel framework designed to enhance confidence estimation in LLMs by leveraging contrastive learning on internal states including attention states, feed-forward states, and activation states of all layers. Unlike existing methods that primarily focus on the final activation state, InternalInspector conducts a comprehensive analysis across all internal states of every layer to accurately identify both correct and incorrect prediction processes. By benchmarking InternalInspector against existing confidence estimation methods across various natural language understanding and generation tasks, including factual question answering, commonsense reasoning, and reading comprehension, InternalInspector achieves significantly higher accuracy in aligning the estimated confidence scores with the correctness of the LLM's predictions and lower calibration error. Furthermore, InternalInspector excels at HaluEval, a hallucination detection benchmark, outperforming other internal-based confidence estimation methods in this task.
摘要：尽管大型语言模型 (LLM) 功能强大，但它们通常难以生成可靠的输出，经常产生高置信度不准确性，即幻觉。为了应对这一挑战，我们的研究引入了 InternalInspector，这是一个新颖的框架，旨在通过利用内部状态（包括注意力状态、前馈状态和所有层的激活状态）的对比学习来增强 LLM 中的置信度估计。与主要关注最终激活状态的现有方法不同，InternalInspector 对每一层的所有内部状态进行全面分析，以准确识别正确和错误的预测过程。通过将 InternalInspector 与各种自然语言理解和生成任务（包括事实问题回答、常识推理和阅读理解）中的现有置信度估计方法进行基准测试，InternalInspector 在将估计的置信度分数与 LLM 预测的正确性对齐方面实现了更高的准确度，并降低了校准误差。此外，InternalInspector 在幻觉检测基准 HaluEval 中表现出色，在该任务中优于其他基于内部的置信度估计方法。

Title: Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Authors: Jack Gallifant, Shan Chen, Pedro Moreira, Nikolaj Munch, Mingye Gao, Jackson Pond, Leo Anthony Celi, Hugo Aerts, Thomas Hartvigsen, Danielle Bitterman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12066
Pdf URL: https://arxiv.org/pdf/2406.12066
Copy Paste: [[2406.12066]] Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks(https://arxiv.org/abs/2406.12066)
Keywords: language model, llm
Abstract: Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets. All code is accessible at this https URL, and a HuggingFace leaderboard is available at this https URL.
摘要：医学知识依赖于上下文，需要在语义等效短语的各种自然语言表达中进行一致的推理。这对于药物名称尤其重要，因为患者经常使用 Advil 或 Tylenol 等品牌名称而不是通用名称。为了研究这一点，我们创建了一个新的稳健性数据集 RABBITS，以评估使用医生专家注释交换品牌和通用药物名称后在医学基准上的性能差异。我们在 MedQA 和 MedMCQA 上评估了开源和基于 API 的 LLM，发现性能持续下降 1-10%。此外，我们认为这种脆弱性的潜在来源是广泛使用的预训练数据集中测试数据的污染。所有代码都可以通过此 https URL 访问，HuggingFace 排行榜可以通过此 https URL 访问。

Title: Satyrn: A Platform for Analytics Augmented Generation

Authors: Marko Sterbentz, Cameron Barrie, Shubham Shahi, Abhratanu Dutta, Donna Hooshmand, Harper Pack, Kristian J. Hammond
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12069
Pdf URL: https://arxiv.org/pdf/2406.12069
Copy Paste: [[2406.12069]] Satyrn: A Platform for Analytics Augmented Generation(https://arxiv.org/abs/2406.12069)
Keywords: language model, gpt, llm, retrieval augmented generation
Abstract: Large language models (LLMs) are capable of producing documents, and retrieval augmented generation (RAG) has shown itself to be a powerful method for improving accuracy without sacrificing fluency. However, not all information can be retrieved from text. We propose an approach that uses the analysis of structured data to generate fact sets that are used to guide generation in much the same way that retrieved documents are used in RAG. This analytics augmented generation (AAG) approach supports the ability to utilize standard analytic techniques to generate facts that are then converted to text and passed to an LLM. We present a neurosymbolic platform, Satyrn that leverages AAG to produce accurate, fluent, and coherent reports grounded in large scale databases. In our experiments, we find that Satyrn generates reports in which over 86% accurate claims while maintaining high levels of fluency and coherence, even when using smaller language models such as Mistral-7B, as compared to GPT-4 Code Interpreter in which just 57% of claims are accurate.
摘要：大型语言模型 (LLM) 能够生成文档，而检索增强生成 (RAG) 已被证明是一种在不牺牲流畅性的情况下提高准确性的强大方法。但是，并非所有信息都可以从文本中检索出来。我们提出了一种方法，该方法使用结构化数据的分析来生成事实集，这些事实集用于指导生成，其方式与 RAG 中使用检索到的文档的方式大致相同。这种分析增强生成 (AAG) 方法支持利用标准分析技术生成事实，然后将其转换为文本并传递给 LLM。我们提出了一个神经符号平台 Satyrn，它利用 AAG 生成基于大型数据库的准确、流畅和连贯的报告。在我们的实验中，我们发现 Satyrn 生成的报告中超过 86% 的声明准确，同时保持了高水平的流畅性和连贯性，即使使用较小的语言模型（例如 Mistral-7B），相比之下，GPT-4 代码解释器中只有 57% 的声明是准确的。

Title: COMMUNITY-CROSS-INSTRUCT: Unsupervised Instruction Generation for Aligning Large Language Models to Online Communities

Authors: Zihao He, Rebecca Dorn, Siyi Guo, Minh Duc Chu, Kristina Lerman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12074
Pdf URL: https://arxiv.org/pdf/2406.12074
Copy Paste: [[2406.12074]] COMMUNITY-CROSS-INSTRUCT: Unsupervised Instruction Generation for Aligning Large Language Models to Online Communities(https://arxiv.org/abs/2406.12074)
Keywords: language model, llm
Abstract: Social scientists use surveys to probe the opinions and beliefs of populations, but these methods are slow, costly, and prone to biases. Recent advances in large language models (LLMs) enable creating computational representations or "digital twins" of populations that generate human-like responses mimicking the population's language, styles, and attitudes. We introduce Community-Cross-Instruct, an unsupervised framework for aligning LLMs to online communities to elicit their beliefs. Given a corpus of a community's online discussions, Community-Cross-Instruct automatically generates instruction-output pairs by an advanced LLM to (1) finetune an foundational LLM to faithfully represent that community, and (2) evaluate the alignment of the finetuned model to the community. We demonstrate the method's utility in accurately representing political and fitness communities on Reddit. Unlike prior methods requiring human-authored instructions, Community-Cross-Instruct generates instructions in a fully unsupervised manner, enhancing scalability and generalization across domains. This work enables cost-effective and automated surveying of diverse online communities.
摘要：社会科学家使用调查来探究人群的观点和信念，但这些方法速度慢、成本高且容易产生偏见。大型语言模型 (LLM) 的最新进展使得能够创建人群的计算表示或“数字孪生”，从而产生类似人类的反应，模仿人群的语言、风格和态度。我们引入了 Community-Cross-Instruct，这是一个无监督框架，用于将 LLM 与在线社区对齐以引出他们的信念。给定一个社区的在线讨论语料库，Community-Cross-Instruct 会自动通过高级 LLM 生成指令-输出对，以 (1) 微调基础 LLM 以忠实地代表该社区，以及 (2) 评估微调模型与社区的一致性。我们展示了该方法在准确表示 Reddit 上的政治和健身社区方面的实用性。与以前需要人工编写指令的方法不同，Community-Cross-Instruct 以完全无监督的方式生成指令，从而增强了跨域的可扩展性和泛化能力。这项工作使得对多样化在线社区的调查能够实现经济高效和自动化。

Title: When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives

Authors: Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Wenlin Yao, Hassan Foroosh, Dong Yu, Fei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12084
Pdf URL: https://arxiv.org/pdf/2406.12084
Copy Paste: [[2406.12084]] When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives(https://arxiv.org/abs/2406.12084)
Keywords: gpt, llm, hallucination
Abstract: Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives. By synthesizing data, we can rigorously evaluate LLMs' reasoning capabilities under complex scenarios with varying narrative lengths and density of information. Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns. Open-source models like Llama-3 further suffer from significant score hallucinations. Finally, the effectiveness of reasoning is influenced by narrative complexity, information density, and domain-specific terms, highlighting the challenges in analytical reasoning tasks.
摘要：当 LLM 准确地汇总相关信息时，推理能力最强。我们通过要求 LLM 分析体育叙事来研究信息聚合在推理中的关键作用。要成功完成这项任务，LLM 必须从动作中推断得分，识别相关实体，准确地将得分归因于球员和球队，并汇编关键统计数据以得出结论。我们使用真实的 NBA 篮球数据进行了全面的实验，并提出了一种合成比赛叙事的新方法 SportsGen。通过合成数据，我们可以在叙事长度和信息密度各异的复杂场景下严格评估 LLM 的推理能力。我们的研究结果表明，包括 GPT-4o 在内的大多数模型由于频繁的得分模式，往往无法准确汇总篮球比分。像 Llama-3 这样的开源模型还存在严重的得分幻觉。最后，推理的有效性受到叙事复杂性、信息密度和领域特定术语的影响，这凸显了分析推理任务的挑战。

Title: Who's asking? User personas and the mechanics of latent misalignment

Authors: Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori, Lucas Dixon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12094
Pdf URL: https://arxiv.org/pdf/2406.12094
Copy Paste: [[2406.12094]] Who's asking? User personas and the mechanics of latent misalignment(https://arxiv.org/abs/2406.12094)
Keywords: prompt
Abstract: Despite investments in improving model safety, studies show that misaligned capabilities remain latent in safety-tuned models. In this work, we shed light on the mechanics of this phenomenon. First, we show that even when model generations are safe, harmful content can persist in hidden representations and can be extracted by decoding from earlier layers. Then, we show that whether the model divulges such content depends significantly on its perception of who it is talking to, which we refer to as user persona. In fact, we find manipulating user persona to be even more effective for eliciting harmful content than direct attempts to control model refusal. We study both natural language prompting and activation steering as control methods and show that activation steering is significantly more effective at bypassing safety filters. We investigate why certain personas break model safeguards and find that they enable the model to form more charitable interpretations of otherwise dangerous queries. Finally, we show we can predict a persona's effect on refusal given only the geometry of its steering vector.
摘要：尽管在提高模型安全性方面投入了大量资金，但研究表明，在安全调整模型中，错位的能力仍然存在。在这项工作中，我们阐明了这种现象的机制。首先，我们表明，即使模型生成是安全的，有害内容仍会以隐藏的表示形式存在，并可以通过从早期层解码来提取。然后，我们表明，模型是否泄露此类内容在很大程度上取决于它对与谁交谈的感知，我们将其称为用户角色。事实上，我们发现操纵用户角色比直接尝试控制模型拒绝更能有效地引出有害内容。我们研究了自然语言提示和激活转向作为控制方法，并表明激活转向在绕过安全过滤器方面明显更有效。我们调查了某些角色为何破坏模型保护措施，并发现它们使模型能够对原本危险的查询形成更宽容的解释。最后，我们表明，我们只需给出角色的转向矢量的几何形状，就可以预测角色对拒绝的影响。

Title: End-to-end Text-to-SQL Generation within an Analytics Insight Engine

Authors: Karime Maamari, Amine Mhedhbi
Subjects: cs.CL, cs.AI, cs.DB, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12104
Pdf URL: https://arxiv.org/pdf/2406.12104
Copy Paste: [[2406.12104]] End-to-end Text-to-SQL Generation within an Analytics Insight Engine(https://arxiv.org/abs/2406.12104)
Keywords: language model
Abstract: Recent advancements in Text-to-SQL have pushed database management systems towards greater democratization of data access. Today's language models are at the core of these advancements. They enable impressive Text-to-SQL generation as experienced in the development of Distyl AI's Analytics Insight Engine. Its early deployment with enterprise customers has highlighted three core challenges. First, data analysts expect support with authoring SQL queries of very high complexity. Second, requests are ad-hoc and, as such, require low latency. Finally, generation requires an understanding of domain-specific terminology and practices. The design and implementation of our Text-to-SQL generation pipeline, powered by large language models, tackles these challenges. The core tenants of our approach rely on external knowledge that we extract in a pre-processing phase, on retrieving the appropriate external knowledge at query generation time, and on decomposing SQL query generation following a hierarchical CTE-based structure. Finally, an adaptation framework leverages feedback to update the external knowledge, in turn improving query generation over time. We give an overview of our end-to-end approach and highlight the operators generating SQL during inference.
摘要：文本到 SQL 的最新进展推动了数据库管理系统朝着数据访问更加民主化的方向发展。当今的语言模型是这些进步的核心。它们实现了令人印象深刻的文本到 SQL 生成，就像 Distyl AI 的 Analytics Insight Engine 开发中所经历的那样。它在企业客户中的早期部署凸显了三个核心挑战。首先，数据分析师希望能够编写非常复杂的 SQL 查询。其次，请求是临时的，因此需要低延迟。最后，生成需要了解特定领域的术语和实践。我们的文本到 SQL 生成管道的设计和实现由大型语言模型提供支持，解决了这些挑战。我们方法的核心原则依赖于我们在预处理阶段提取的外部知识、在查询生成时检索适当的外部知识以及按照基于 CTE 的分层结构分解 SQL 查询生成。最后，适应框架利用反馈来更新外部知识，从而随着时间的推移改进查询生成。我们概述了我们的端到端方法并重点介绍了推理过程中生成 SQL 的运算符。

Title: Can LLMs Learn Macroeconomic Narratives from Social Media?

Authors: Almog Gueta, Amir Feder, Zorik Gekhman, Ariel Goldstein, Roi Reichart
Subjects: cs.CL, cs.CE
Abstract URL: https://arxiv.org/abs/2406.12109
Pdf URL: https://arxiv.org/pdf/2406.12109
Copy Paste: [[2406.12109]] Can LLMs Learn Macroeconomic Narratives from Social Media?(https://arxiv.org/abs/2406.12109)
Keywords: language model, llm
Abstract: This study empirically tests the $\textit{Narrative Economics}$ hypothesis, which posits that narratives (ideas that are spread virally and affect public beliefs) can influence economic fluctuations. We introduce two curated datasets containing posts from X (formerly Twitter) which capture economy-related narratives (Data will be shared upon paper acceptance). Employing Natural Language Processing (NLP) methods, we extract and summarize narratives from the tweets. We test their predictive power for $\textit{macroeconomic}$ forecasting by incorporating the tweets' or the extracted narratives' representations in downstream financial prediction tasks. Our work highlights the challenges in improving macroeconomic models with narrative data, paving the way for the research community to realistically address this important challenge. From a scientific perspective, our investigation offers valuable insights and NLP tools for narrative extraction and summarization using Large Language Models (LLMs), contributing to future research on the role of narratives in economics.
摘要：本研究实证检验了“叙事经济学”假设，该假设认为叙事（病毒式传播并影响公众信念的想法）会影响经济波动。我们引入了两个精选数据集，其中包含来自 X（以前称为 Twitter）的帖子，这些帖子捕获了与经济相关的叙事（数据将在论文被接受后共享）。采用自然语言处理 (NLP) 方法，我们从推文中提取和总结叙事。我们通过将推文或提取的叙事的表示合并到下游金融预测任务中来测试它们对“宏观经济”预测的预测能力。我们的工作强调了使用叙事数据改进宏观经济模型的挑战，为研究界切实应对这一重要挑战铺平了道路。从科学的角度来看，我们的研究为使用大型语言模型 (LLM) 进行叙事提取和总结提供了有价值的见解和 NLP 工具，为未来研究叙事在经济学中的作用做出了贡献。

Title: Enhancing Text Classification through LLM-Driven Active Learning and Human Annotation

Authors: Hamidreza Rouzegar, Masoud Makrehchi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12114
Pdf URL: https://arxiv.org/pdf/2406.12114
Copy Paste: [[2406.12114]] Enhancing Text Classification through LLM-Driven Active Learning and Human Annotation(https://arxiv.org/abs/2406.12114)
Keywords: language model, gpt, llm
Abstract: In the context of text classification, the financial burden of annotation exercises for creating training data is a critical issue. Active learning techniques, particularly those rooted in uncertainty sampling, offer a cost-effective solution by pinpointing the most instructive samples for manual annotation. Similarly, Large Language Models (LLMs) such as GPT-3.5 provide an alternative for automated annotation but come with concerns regarding their reliability. This study introduces a novel methodology that integrates human annotators and LLMs within an Active Learning framework. We conducted evaluations on three public datasets. IMDB for sentiment analysis, a Fake News dataset for authenticity discernment, and a Movie Genres dataset for multi-label classification.The proposed framework integrates human annotation with the output of LLMs, depending on the model uncertainty levels. This strategy achieves an optimal balance between cost efficiency and classification performance. The empirical results show a substantial decrease in the costs associated with data annotation while either maintaining or improving model accuracy.
摘要：在文本分类的背景下，创建训练数据的注释练习的财务负担是一个关键问题。主动学习技术，特别是那些植根于不确定性抽样的技术，通过精确定位最有指导意义的样本进行手动注释，提供了一种经济有效的解决方案。同样，GPT-3.5 等大型语言模型 (LLM) 为自动注释提供了一种替代方案，但也带来了对其可靠性的担忧。本研究介绍了一种将人工注释者和 LLM 集成在主动学习框架中的新方法。我们对三个公共数据集进行了评估。IMDB 用于情绪分析，假新闻数据集用于真实性辨别，电影类型数据集用于多标签分类。所提出的框架将人工注释与 LLM 的输出相结合，具体取决于模型的不确定性水平。该策略在成本效率和分类性能之间实现了最佳平衡。实证结果表明，在保持或提高模型准确性的同时，与数据注释相关的成本大幅下降。

Title: Decoding the Narratives: Analyzing Personal Drug Experiences Shared on Reddit

Authors: Layla Bouzoubaa, Elham Aghakhani, Max Song, Minh Trinh, Rezvaneh Rezapour
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12117
Pdf URL: https://arxiv.org/pdf/2406.12117
Copy Paste: [[2406.12117]] Decoding the Narratives: Analyzing Personal Drug Experiences Shared on Reddit(https://arxiv.org/abs/2406.12117)
Keywords: gpt, prompt
Abstract: Online communities such as drug-related subreddits serve as safe spaces for people who use drugs (PWUD), fostering discussions on substance use experiences, harm reduction, and addiction recovery. Users' shared narratives on these forums provide insights into the likelihood of developing a substance use disorder (SUD) and recovery potential. Our study aims to develop a multi-level, multi-label classification model to analyze online user-generated texts about substance use experiences. For this purpose, we first introduce a novel taxonomy to assess the nature of posts, including their intended connections (Inquisition or Disclosure), subjects (e.g., Recovery, Dependency), and specific objectives (e.g., Relapse, Quality, Safety). Using various multi-label classification algorithms on a set of annotated data, we show that GPT-4, when prompted with instructions, definitions, and examples, outperformed all other models. We apply this model to label an additional 1,000 posts and analyze the categories of linguistic expression used within posts in each class. Our analysis shows that topics such as Safety, Combination of Substances, and Mental Health see more disclosure, while discussions about physiological Effects focus on harm reduction. Our work enriches the understanding of PWUD's experiences and informs the broader knowledge base on SUD and drug use.
摘要：诸如与毒品相关的 subreddits 之类的在线社区为吸毒者 (PWUD) 提供了安全的空间，促进了有关药物使用体验、减少伤害和成瘾康复的讨论。用户在这些论坛上分享的叙述提供了有关发展药物使用障碍 (SUD) 的可能性和康复潜力的见解。我们的研究旨在开发一个多层次、多标签分类模型来分析有关药物使用体验的在线用户生成文本。为此，我们首先引入了一种新颖的分类法来评估帖子的性质，包括它们的预期联系（调查或披露）、主题（例如，恢复、依赖）和特定目标（例如，复发、质量、安全）。我们对一组带注释的数据使用各种多标签分类算法，结果表明，当提示说明、定义和示例时，GPT-4 的表现优于所有其他模型。我们应用此模型标记另外 1,000 个帖子，并分析每个类别帖子中使用的语言表达类别。我们的分析显示，安全、药物组合和心理健康等主题的披露较多，而关于生理影响的讨论则侧重于减少伤害。我们的工作丰富了对 PWUD 经历的理解，并丰富了有关 SUD 和药物使用的更广泛知识库。

Title: AI "News" Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian

Authors: Giovanni Puccetti, Anna Rogers, Chiara Alzetta, Felice Dell'Orletta, Andrea Esuli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12128
Pdf URL: https://arxiv.org/pdf/2406.12128
Copy Paste: [[2406.12128]] AI "News" Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian(https://arxiv.org/abs/2406.12128)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are increasingly used as "content farm" models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real "content farm". We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.
摘要：大型语言模型 (LLM) 越来越多地被用作“内容农场”模型 (CFM)，以生成可以作为真实新闻文章的合成文本。即使对于没有高质量单语 LLM 的语言，这种情况也已经发生了。我们表明，对主要以英语训练的 Llama (v1) 进行微调，只需 40K 篇意大利新闻文章，就足以生成类似新闻的文本，而意大利母语人士很难将其识别为合成文本。我们研究了三种 LLM 和三种检测合成文本的方法（对数似然、DetectGPT 和监督分类），发现它们的表现都优于人类评分者，但它们在现实世界中都不切实际（需要访问标记似然信息或大量 CFM 文本数据集）。我们还探讨了创建代理 CFM 的可能性：在与真实“内容农场”使用的数据集类似的数据集上进行微调的 LLM。我们发现，即使少量的微调数据也足以创建一个成功的检测器，但我们需要知道使用哪个基础 LLM，这是一个重大挑战。我们的结果表明，目前没有实用的方法来检测“野生”的合成新闻类文本，而生成它们太容易了。我们强调了对这个问题进行更多 NLP 研究的紧迫性。

Title: LLMs Are Prone to Fallacies in Causal Inference

Authors: Nitish Joshi, Abulhair Saparov, Yixin Wang, He He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12158
Pdf URL: https://arxiv.org/pdf/2406.12158
Copy Paste: [[2406.12158]] LLMs Are Prone to Fallacies in Causal Inference(https://arxiv.org/abs/2406.12158)
Keywords: llm, prompt
Abstract: Recent work shows that causal facts can be effectively extracted from LLMs through prompting, facilitating the creation of causal graphs for causal inference tasks. However, it is unclear if this success is limited to explicitly-mentioned causal facts in the pretraining data which the model can memorize. Thus, this work investigates: Can LLMs infer causal relations from other relational data in text? To disentangle the role of memorized causal facts vs inferred causal relations, we finetune LLMs on synthetic data containing temporal, spatial and counterfactual relations, and measure whether the LLM can then infer causal relations. We find that: (a) LLMs are susceptible to inferring causal relations from the order of two entity mentions in text (e.g. X mentioned before Y implies X causes Y); (b) if the order is randomized, LLMs still suffer from the post hoc fallacy, i.e. X occurs before Y (temporal relation) implies X causes Y. We also find that while LLMs can correctly deduce the absence of causal relations from temporal and spatial relations, they have difficulty inferring causal relations from counterfactuals, questioning their understanding of causality.
摘要：最近的研究表明，通过提示可以从 LLM 中有效地提取因果事实，从而有助于为因果推理任务创建因果图。然而，目前尚不清楚这种成功是否仅限于模型可以记住的预训练数据中明确提到的因果事实。因此，这项工作调查：LLM 能否从文本中的其他关系数据推断因果关系？为了理清记忆的因果事实与推断的因果关系的作用，我们在包含时间、空间和反事实关系的合成数据上对 LLM 进行微调，并测量 LLM 是否能够推断因果关系。我们发现：(a) LLM 容易从文本中两个实体提及的顺序推断因果关系（例如，在 Y 之前提到的 X 意味着 X 导致 Y）； (b) 如果顺序是随机的，LLM 仍然会遭受事后谬误，即 X 发生在 Y 之前（时间关系）意味着 X 导致 Y。我们还发现，虽然 LLM 可以从时间和空间关系中正确地推断出因果关系的缺失，但它们很难从反事实中推断出因果关系，这使它们对因果关系的理解受到质疑。

Title: Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance

Authors: Anna C. Marbut, John W. Chandler, Travis J. Wheeler
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12159
Pdf URL: https://arxiv.org/pdf/2406.12159
Copy Paste: [[2406.12159]] Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance(https://arxiv.org/abs/2406.12159)
Keywords: language model
Abstract: It is generally thought that transformer-based large language models benefit from pre-training by learning generic linguistic knowledge that can be focused on a specific task during fine-tuning. However, we propose that much of the benefit from pre-training may be captured by geometric characteristics of the latent space representations, divorced from any specific linguistic knowledge. In this work we explore the relationship between GLUE benchmarking task performance and a variety of measures applied to the latent space resulting from BERT-type contextual language models. We find that there is a strong linear relationship between a measure of quantized cell density and average GLUE performance and that these measures may be predictive of otherwise surprising GLUE performance for several non-standard BERT-type models from the literature. These results may be suggestive of a strategy for decreasing pre-training requirements, wherein model initialization can be informed by the geometric characteristics of the model's latent space.
摘要：人们普遍认为，基于 Transformer 的大型语言模型可以通过学习通用语言知识从预训练中受益，这些知识可以在微调期间专注于特定任务。然而，我们认为，预训练的大部分好处可能与任何特定的语言知识无关，而是通过潜在空间表示的几何特征来获取的。在这项工作中，我们探索了 GLUE 基准测试任务性能与 BERT 型上下文语言模型产生的潜在空间的各种测量之间的关系。我们发现，量化单元密度的度量和平均 GLUE 性能之间存在很强的线性关系，并且这些度量可以预测文献中几种非标准 BERT 型模型原本令人惊讶的 GLUE 性能。这些结果可能暗示了一种降低预训练要求的策略，其中模型初始化可以根据模型潜在空间的几何特征来决定。

Title: Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models

Authors: Lulu Zhao, Weihao Zeng, Xiaofeng Shi, Hua Zhou, Donglin Hao, Yonghua Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: Recently, both closed-source LLMs and open-source communities have made significant strides, outperforming humans in various general domains. However, their performance in specific professional fields such as medicine, especially within the open-source community, remains suboptimal due to the complexity of medical knowledge. We propose Aquila-Med, a bilingual medical LLM based on Aquila, addressing these challenges through continue pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF). We construct a large-scale Chinese and English medical dataset for continue pre-training and a high-quality SFT dataset, covering extensive medical specialties. Additionally, we develop a high-quality Direct Preference Optimization (DPO) dataset for further alignment. Aquila-Med achieves notable results across single-turn, multi-turn dialogues, and medical multiple-choice questions, demonstrating the effectiveness of our approach. We open-source the datasets and the entire training process, contributing valuable resources to the research community. Our models and datasets will released at this https URL.
摘要：最近，闭源 LLM 和开源社区都取得了重大进展，在各个通用领域都超越了人类。然而，由于医学知识的复杂性，它们在医学等特定专业领域的表现仍然不理想，尤其是在开源社区中。我们提出了基于 Aquila 的双语医学 LLM Aquila-Med，通过持续预训练、监督微调 (SFT) 和从人类反馈中强化学习 (RLHF) 来解决这些挑战。我们构建了一个大规模的中英文医学数据集用于持续预训练和一个高质量的 SFT 数据集，涵盖了广泛的医学专业。此外，我们还开发了一个高质量的直接偏好优化 (DPO) 数据集以进行进一步的对齐。Aquila-Med 在单轮、多轮对话和医学多项选择题中取得了显著的成果，证明了我们方法的有效性。我们开源数据集和整个训练过程，为研究界贡献了宝贵的资源。我们的模型和数据集将在此 https URL 上发布。

Title: Debate as Optimization: Adaptive Conformal Prediction and Diverse Retrieval for Event Extraction

Authors: Sijia Wang, Lifu Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12197
Pdf URL: https://arxiv.org/pdf/2406.12197
Copy Paste: [[2406.12197]] Debate as Optimization: Adaptive Conformal Prediction and Diverse Retrieval for Event Extraction(https://arxiv.org/abs/2406.12197)
Keywords: language model, llm, agent
Abstract: We propose a multi-agent debate as optimization (DAO) system for event extraction, where the primary objective is to iteratively refine the large language models (LLMs) outputs through debating without parameter tuning. In DAO, we introduce two novel modules: the Diverse-RAG (DRAG) module and the Adaptive Conformal Prediction (AdaCP) module. DRAG systematically retrieves supporting information that best fits the debate discussion, while AdaCP enhances the accuracy and reliability of event extraction by effectively rejecting less promising answers. Experimental results demonstrate a significant reduction in the performance gap between supervised approaches and tuning-free LLM-based methods by 18.1% and 17.8% on ACE05 and 17.9% and 15.2% on CASIE for event detection and argument extraction respectively.
摘要：我们提出了一个用于事件提取的多智能体辩论优化 (DAO) 系统，其主要目标是通过辩论迭代优化大型语言模型 (LLM) 输出，而无需进行参数调整。在 DAO 中，我们引入了两个新模块：Diverse-RAG (DRAG) 模块和自适应共形预测 (AdaCP) 模块。DRAG 系统地检索最适合辩论讨论的支持信息，而 AdaCP 通过有效拒绝不太有希望的答案来提高事件提取的准确性和可靠性。实验结果表明，在事件检测和参数提取方面，监督方法和无需调整的基于 LLM 的方法之间的性能差距显著缩小，在 ACE05 上分别缩小了 18.1% 和 17.8%，在 CASIE 上分别缩小了 17.9% 和 15.2%。

Title: Knowledge Fusion By Evolving Weights of Language Models

Authors: Guodong Du, Jing Li, Hanting Liu, Runhua Jiang, Shuyang Yu, Yifei Guo, Sim Kuan Goh, Ho-Kin Tang
Subjects: cs.CL, cs.AI, cs.CV, cs.NE
Abstract URL: https://arxiv.org/abs/2406.12208
Pdf URL: https://arxiv.org/pdf/2406.12208
Copy Paste: [[2406.12208]] Knowledge Fusion By Evolving Weights of Language Models(https://arxiv.org/abs/2406.12208)
Keywords: language model
Abstract: Fine-tuning pre-trained language models, particularly large language models, demands extensive computing resources and can result in varying performance outcomes across different domains and datasets. This paper examines the approach of integrating multiple models from diverse training scenarios into a unified model. This unified model excels across various data domains and exhibits the ability to generalize well on out-of-domain data. We propose a knowledge fusion method named Evolver, inspired by evolutionary algorithms, which does not need further training or additional training data. Specifically, our method involves aggregating the weights of different language models into a population and subsequently generating offspring models through mutation and crossover operations. These offspring models are then evaluated against their parents, allowing for the preservation of those models that show enhanced performance on development datasets. Importantly, our model evolving strategy can be seamlessly integrated with existing model merging frameworks, offering a versatile tool for model enhancement. Experimental results on mainstream language models (i.e., encoder-only, decoder-only, encoder-decoder) reveal that Evolver outperforms previous state-of-the-art models by large margins. The code is publicly available at {this https URL}.
摘要：对预训练语言模型（尤其是大型语言模型）进行微调需要大量计算资源，并且可能导致不同领域和数据集的性能结果不同。本文研究了将来自不同训练场景的多个模型集成到统一模型中的方法。这种统一模型在各种数据领域中都表现出色，并表现出对域外数据进行良好推广的能力。我们提出了一种名为 Evolver 的知识融合方法，该方法受到进化算法的启发，不需要进一步训练或额外的训练数据。具体而言，我们的方法涉及将不同语言模型的权重聚合到一个群体中，然后通过突变和交叉操作生成后代模型。然后根据其父级对这些后代模型进行评估，从而保留那些在开发数据集上表现出增强性能的模型。重要的是，我们的模型演化策略可以与现有的模型合并框架无缝集成，为模型增强提供了一种多功能工具。主流语言模型（即仅编码器、仅解码器、编码器-解码器）的实验结果表明，Evolver 的表现远远优于之前最先进的模型。该代码可在{此 https URL} 上公开获取。

Title: LLM-Oracle Machines

Authors: Jie Wang
Subjects: cs.CL, cs.AI, cs.FL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] LLM-Oracle Machines(https://arxiv.org/abs/)
Keywords: language model, llm, hallucination
Abstract: Contemporary AI applications leverage large language models (LLMs) for their knowledge and inference capabilities in natural language processing tasks. This approach aligns with the concept of oracle Turing machines (OTMs). To capture the essence of these computations, including those desired but not yet in practice, we extend the notion of OTMs by employing a cluster of LLMs as the oracle. We present four variants: basic, augmented, fault-avoidance, and $\epsilon$-fault. The first two variants are commonly observed, whereas the latter two are specifically designed to ensure reliable outcomes by addressing LLM hallucinations, biases, and inconsistencies.
摘要：当代人工智能应用利用大型语言模型 (LLM) 来获取自然语言处理任务中的知识和推理能力。这种方法与 Oracle Turing 机 (OTM) 的概念一致。为了捕捉这些计算的本质，包括那些需要但尚未付诸实践的计算，我们通过使用一组 LLM 作为 Oracle 来扩展 OTM 的概念。我们提出了四种变体：基本、增强、故障避免和 $\epsilon$ 故障。前两种变体很常见，而后两种变体专门设计用于通过解决 LLM 幻觉、偏见和不一致来确保可靠的结果。

Title: Is persona enough for personality? Using ChatGPT to reconstruct an agent's latent personality from simple descriptions

Authors: Yongyi Ji, Zhisheng Tang, Mayank Kejriwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12216
Pdf URL: https://arxiv.org/pdf/2406.12216
Copy Paste: [[2406.12216]] Is persona enough for personality? Using ChatGPT to reconstruct an agent's latent personality from simple descriptions(https://arxiv.org/abs/2406.12216)
Keywords: language model, gpt, llm, chat, agent
Abstract: Personality, a fundamental aspect of human cognition, contains a range of traits that influence behaviors, thoughts, and emotions. This paper explores the capabilities of large language models (LLMs) in reconstructing these complex cognitive attributes based only on simple descriptions containing socio-demographic and personality type information. Utilizing the HEXACO personality framework, our study examines the consistency of LLMs in recovering and predicting underlying (latent) personality dimensions from simple descriptions. Our experiments reveal a significant degree of consistency in personality reconstruction, although some inconsistencies and biases, such as a tendency to default to positive traits in the absence of explicit information, are also observed. Additionally, socio-demographic factors like age and number of children were found to influence the reconstructed personality dimensions. These findings have implications for building sophisticated agent-based simulacra using LLMs and highlight the need for further research on robust personality generation in LLMs.
摘要：人格是人类认知的一个基本方面，包含一系列影响行为、思想和情感的特征。本文探讨了大型语言模型 (LLM) 仅基于包含社会人口和人格类型信息的简单描述重建这些复杂认知属性的能力。利用 HEXACO 人格框架，我们的研究考察了 LLM 在从简单描述中恢复和预测潜在 (潜在) 人格维度方面的一致性。我们的实验表明，人格重建具有相当大的一致性，尽管也观察到一些不一致和偏见，例如在没有明确信息的情况下倾向于默认积极特征。此外，还发现年龄和子女数量等社会人口因素会影响重建的人格维度。这些发现对于使用 LLM 构建复杂的基于代理的模拟具有重要意义，并强调需要进一步研究 LLM 中稳健的人格生成。

Title: On-Policy Fine-grained Knowledge Feedback for Hallucination Mitigation

Authors: Xueru Wen, Xinyu Lu, Xinyan Guan, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12221
Pdf URL: https://arxiv.org/pdf/2406.12221
Copy Paste: [[2406.12221]] On-Policy Fine-grained Knowledge Feedback for Hallucination Mitigation(https://arxiv.org/abs/2406.12221)
Keywords: language model, llm, hallucination
Abstract: Hallucination occurs when large language models (LLMs) exhibit behavior that deviates from the boundaries of their knowledge during the response generation process. Previous learning-based methods focus on detecting knowledge boundaries and finetuning models with instance-level feedback, but they suffer from inaccurate signals due to off-policy data sampling and coarse-grained feedback. In this paper, we introduce \textit{\b{R}einforcement \b{L}earning \b{f}or \b{H}allucination} (RLFH), a fine-grained feedback-based online reinforcement learning method for hallucination mitigation. Unlike previous learning-based methods, RLFH enables LLMs to explore the boundaries of their internal knowledge and provide on-policy, fine-grained feedback on these explorations. To construct fine-grained feedback for learning reliable generation behavior, RLFH decomposes the outcomes of large models into atomic facts, provides statement-level evaluation signals, and traces back the signals to the tokens of the original responses. Finally, RLFH adopts the online reinforcement algorithm with these token-level rewards to adjust model behavior for hallucination mitigation. For effective on-policy optimization, RLFH also introduces an LLM-based fact assessment framework to verify the truthfulness and helpfulness of atomic facts without human intervention. Experiments on HotpotQA, SQuADv2, and Biography benchmarks demonstrate that RLFH can balance their usage of internal knowledge during the generation process to eliminate the hallucination behavior of LLMs.
摘要：当大型语言模型 (LLM) 在响应生成过程中表现出偏离其知识边界的行为时，就会出现幻觉。以前的基于学习的方法侧重于检测知识边界和使用实例级反馈微调模型，但由于离策略数据采样和粗粒度反馈，它们会受到不准确信号的影响。在本文中，我们引入了 \textit{\b{R}einforcement \b{L}earning \b{f}or \b{H}allucination} (RLFH)，这是一种基于细粒度反馈的在线强化学习方法，用于缓解幻觉。与以前基于学习的方法不同，RLFH 使 LLM 能够探索其内部知识的边界，并对这些探索提供基于策略的细粒度反馈。为了构建细粒度反馈以学习可靠的生成行为，RLFH 将大型模型的结果分解为原子事实，提供语句级评估信号，并将信号追溯到原始响应的标记。最后，RLFH 采用在线强化算法和这些 token 级奖励来调整模型行为以缓解幻觉。为了有效地进行策略优化，RLFH 还引入了基于 LLM 的事实评估框架，以在无需人工干预的情况下验证原子事实的真实性和有用性。在 HotpotQA、SQuADv2 和 Biography 基准上的实验表明，RLFH 可以在生成过程中平衡内部知识的使用，以消除 LLM 的幻觉行为。

Title: ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Authors: Yunze Xiao, Yujia Hu, Kenny Tsu Wei Choo, Roy Ka-wei Lee
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2406.12223
Pdf URL: https://arxiv.org/pdf/2406.12223
Copy Paste: [[2406.12223]] ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations(https://arxiv.org/abs/2406.12223)
Keywords: language model, llm
Abstract: Detecting hate speech and offensive language is essential for maintaining a safe and respectful digital environment. This study examines the limitations of state-of-the-art large language models (LLMs) in identifying offensive content within systematically perturbed data, with a focus on Chinese, a language particularly susceptible to such perturbations. We introduce \textsf{ToxiCloakCN}, an enhanced dataset derived from ToxiCN, augmented with homophonic substitutions and emoji transformations, to test the robustness of LLMs against these cloaking perturbations. Our findings reveal that existing models significantly underperform in detecting offensive content when these perturbations are applied. We provide an in-depth analysis of how different types of offensive content are affected by these perturbations and explore the alignment between human and model explanations of offensiveness. Our work highlights the urgent need for more advanced techniques in offensive language detection to combat the evolving tactics used to evade detection mechanisms.
摘要：检测仇恨言论和攻击性语言对于维护安全和尊重的数字环境至关重要。本研究考察了最先进的大型语言模型 (LLM) 在识别系统性扰动数据中的攻击性内容方面的局限性，重点研究了中文，中文是一种特别容易受到此类扰动的语言。我们引入了 \textsf{ToxiCloakCN}，这是一个从 ToxiCN 派生的增强数据集，增强了同音替换和表情符号转换，以测试 LLM 对这些隐藏扰动的鲁棒性。我们的研究结果表明，当应用这些扰动时，现有模型在检测攻击性内容方面的表现明显不佳。我们深入分析了不同类型的攻击性内容如何受到这些扰动的影响，并探索了人类和模型对攻击性的解释之间的一致性。我们的工作强调了迫切需要更先进的攻击性语言检测技术来对抗用于逃避检测机制的不断发展的策略。

Title: MCSD: An Efficient Language Model with Diverse Fusion

Authors: Hua Yang, Duohai Li, Shiman Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12230
Pdf URL: https://arxiv.org/pdf/2406.12230
Copy Paste: [[2406.12230]] MCSD: An Efficient Language Model with Diverse Fusion(https://arxiv.org/abs/2406.12230)
Keywords: language model
Abstract: Transformers excel in Natural Language Processing (NLP) due to their prowess in capturing long-term dependencies but suffer from exponential resource consumption with increasing sequence lengths. To address these challenges, we propose MCSD model, an efficient language model with linear scaling and fast inference speed. MCSD model leverages diverse feature fusion, primarily through the multi-channel slope and decay (MCSD) block, to robustly represent features. This block comprises slope and decay sections that extract features across diverse temporal receptive fields, facilitating capture of both local and global information. In addition, MCSD block conducts element-wise fusion of diverse features to further enhance the delicate feature extraction capability. For inference, we formulate the inference process into a recurrent representation, slashing space complexity to $O(1)$ and time complexity to $O(N)$ respectively. Our experiments show that MCSD attains higher throughput and lower GPU memory consumption compared to Transformers, while maintaining comparable performance to larger-scale language learning models on benchmark tests. These attributes position MCSD as a promising base for edge deployment and embodied intelligence.
摘要：Transformer 在自然语言处理 (NLP) 中表现出色，因为它们能够捕捉长期依赖关系，但随着序列长度的增加，其资源消耗呈指数级增长。为了应对这些挑战，我们提出了 MCSD 模型，这是一种具有线性缩放和快速推理速度的高效语言模型。MCSD 模型利用多样化的特征融合，主要通过多通道斜率和衰减 (MCSD) 块来稳健地表示特征。该块包括斜率和衰减部分，可跨不同的时间感受野提取特征，从而有助于捕获局部和全局信息。此外，MCSD 块对各种特征进行元素级融合，以进一步增强精细的特征提取能力。对于推理，我们将推理过程公式化为递归表示，分别将空间复杂度削减至 $O(1)$ 和时间复杂度削减至 $O(N)$。我们的实验表明，与 Transformer 相比，MCSD 实现了更高的吞吐量和更低的 GPU 内存消耗，同时在基准测试中保持与更大规模语言学习模型相当的性能。这些属性使 MCSD 成为边缘部署和体现智能的有希望的基础。

Title: PFID: Privacy First Inference Delegation Framework for LLMs

Authors: Haoyan Yang, Zhitao Li, Yong Zhang, Jianzong Wang, Ning Cheng, Ming Li, Jing Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12238
Pdf URL: https://arxiv.org/pdf/2406.12238
Copy Paste: [[2406.12238]] PFID: Privacy First Inference Delegation Framework for LLMs(https://arxiv.org/abs/2406.12238)
Keywords: llm, prompt
Abstract: This paper introduces a novel privacy-preservation framework named PFID for LLMs that addresses critical privacy concerns by localizing user data through model sharding and singular value decomposition. When users are interacting with LLM systems, their prompts could be subject to being exposed to eavesdroppers within or outside LLM system providers who are interested in collecting users' input. In this work, we proposed a framework to camouflage user input, so as to alleviate privacy issues. Our framework proposes to place model shards on the client and the public server, we sent compressed hidden states instead of prompts to and from servers. Clients have held back information that can re-privatized the hidden states so that overall system performance is comparable to traditional LLMs services. Our framework was designed to be communication efficient, computation can be delegated to the local client so that the server's computation burden can be lightened. We conduct extensive experiments on machine translation tasks to verify our framework's performance.
摘要：本文介绍了一种用于 LLM 的新型隐私保护框架 PFID，该框架通过模型分片和奇异值分解将用户数据本地化，从而解决了关键的隐私问题。当用户与 LLM 系统交互时，他们的提示可能会暴露给 LLM 系统提供商内部或外部有意收集用户输入的窃听者。在这项工作中，我们提出了一个框架来伪装用户输入，以缓解隐私问题。我们的框架建议将模型分片放在客户端和公共服务器上，我们向服务器发送压缩的隐藏状态而不是提示。客户端保留了可以重新私有化隐藏状态的信息，因此整体系统性能与传统 LLM 服务相当。我们的框架设计为通信高效，计算可以委托给本地客户端，从而减轻服务器的计算负担。我们对机器翻译任务进行了广泛的实验，以验证我们框架的性能。

Title: Mitigate Negative Transfer with Similarity Heuristic Lifelong Prompt Tuning

Authors: Chenyuan Wu, Gangwei Jiang, Defu Lian
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12251
Pdf URL: https://arxiv.org/pdf/2406.12251
Copy Paste: [[2406.12251]] Mitigate Negative Transfer with Similarity Heuristic Lifelong Prompt Tuning(https://arxiv.org/abs/2406.12251)
Keywords: prompt
Abstract: Lifelong prompt tuning has significantly advanced parameter-efficient lifelong learning with its efficiency and minimal storage demands on various tasks. Our empirical studies, however, highlights certain transferability constraints in the current methodologies: a universal algorithm that guarantees consistent positive transfer across all tasks is currently unattainable, especially when dealing dissimilar tasks that may engender negative transfer. Identifying the misalignment between algorithm selection and task specificity as the primary cause of negative transfer, we present the Similarity Heuristic Lifelong Prompt Tuning (SHLPT) framework. This innovative strategy partitions tasks into two distinct subsets by harnessing a learnable similarity metric, thereby facilitating fruitful transfer from tasks regardless of their similarity or dissimilarity. Additionally, SHLPT incorporates a parameter pool to combat catastrophic forgetting effectively. Our experiments shows that SHLPT outperforms state-of-the-art techniques in lifelong learning benchmarks and demonstrates robustness against negative transfer in diverse task sequences.
摘要：终身快速调整凭借其效率和对各种任务的最小存储需求，显著推进了参数高效的终身学习。然而，我们的实证研究强调了当前方法中的某些可转移性限制：目前无法实现保证在所有任务中一致正迁移的通用算法，尤其是在处理可能产生负迁移的不同任务时。我们将算法选择和任务特异性之间的不一致确定为负迁移的主要原因，提出了相似性启发式终身快速调整 (SHLPT) 框架。这种创新策略通过利用可学习的相似性度量将任务划分为两个不同的子集，从而促进任务的有效迁移，无论它们相似或不同。此外，SHLPT 结合了一个参数池来有效对抗灾难性遗忘。我们的实验表明，SHLPT 在终身学习基准中的表现优于最先进的技术，并在不同的任务序列中表现出对负迁移的鲁棒性。

Title: A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning

Authors: Lijie Hu, Liang Liu, Shu Yang, Xin Chen, Hongru Xiao, Mengdi Li, Pan Zhou, Muhammad Asif Ali, Di Wang
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12255
Pdf URL: https://arxiv.org/pdf/2406.12255
Copy Paste: [[2406.12255]] A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning(https://arxiv.org/abs/2406.12255)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) holds a significant place in augmenting the reasoning performance for large language models (LLMs). While some studies focus on improving CoT accuracy through methods like retrieval enhancement, yet a rigorous explanation for why CoT achieves such success remains unclear. In this paper, we analyze CoT methods under two different settings by asking the following questions: (1) For zero-shot CoT, why does prompting the model with "let's think step by step" significantly impact its outputs? (2) For few-shot CoT, why does providing examples before questioning the model could substantially improve its reasoning ability? To answer these questions, we conduct a top-down explainable analysis from the Hopfieldian view and propose a Read-and-Control approach for controlling the accuracy of CoT. Through extensive experiments on seven datasets for three different tasks, we demonstrate that our framework can decipher the inner workings of CoT, provide reasoning error localization, and control to come up with the correct reasoning path.
摘要：思维链 (CoT) 在增强大型语言模型 (LLM) 的推理性能方面占有重要地位。虽然一些研究专注于通过检索增强等方法提高 CoT 准确性，但 CoT 取得如此成功的严格解释仍不清楚。在本文中，我们通过提出以下问题分析了两种不同设置下的 CoT 方法：(1) 对于零样本 CoT，为什么用“让我们一步一步思考”提示模型会显著影响其输出？(2) 对于少样本 CoT，为什么在质疑模型之前提供示例可以显着提高其推理能力？为了回答这些问题，我们从 Hopfieldian 的角度进行了自上而下的可解释分析，并提出了一种控制 CoT 准确性的读取和控制方法。通过对三个不同任务的七个数据集进行大量实验，我们证明了我们的框架可以解密 CoT 的内部工作原理，提供推理错误定位和控制以得出正确的推理路径。

Title: Defending Against Social Engineering Attacks in the Age of LLMs

Authors: Lin Ai, Tharindu Kumarage, Amrita Bhattacharjee, Zizhou Liu, Zheng Hui, Michael Davinroy, James Cook, Laura Cassani, Kirill Trapeznikov, Matthias Kirchner, Arslan Basharat, Anthony Hoogs, Joshua Garland, Huan Liu, Julia Hirschberg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12263
Pdf URL: https://arxiv.org/pdf/2406.12263
Copy Paste: [[2406.12263]] Defending Against Social Engineering Attacks in the Age of LLMs(https://arxiv.org/abs/2406.12263)
Keywords: language model, llm, chat
Abstract: The proliferation of Large Language Models (LLMs) poses challenges in detecting and mitigating digital deception, as these models can emulate human conversational patterns and facilitate chat-based social engineering (CSE) attacks. This study investigates the dual capabilities of LLMs as both facilitators and defenders against CSE threats. We develop a novel dataset, SEConvo, simulating CSE scenarios in academic and recruitment contexts, and designed to examine how LLMs can be exploited in these situations. Our findings reveal that, while off-the-shelf LLMs generate high-quality CSE content, their detection capabilities are suboptimal, leading to increased operational costs for defense. In response, we propose ConvoSentinel, a modular defense pipeline that improves detection at both the message and the conversation levels, offering enhanced adaptability and cost-effectiveness. The retrieval-augmented module in ConvoSentinel identifies malicious intent by comparing messages to a database of similar conversations, enhancing CSE detection at all stages. Our study highlights the need for advanced strategies to leverage LLMs in cybersecurity.
摘要：大型语言模型 (LLM) 的激增对检测和减轻数字欺骗提出了挑战，因为这些模型可以模拟人类的对话模式并促进基于聊天的社会工程 (CSE) 攻击。本研究调查了 LLM 作为 CSE 威胁的促进者和防御者的双重能力。我们开发了一个新数据集 SEConvo，模拟学术和招聘环境中的 CSE 场景，旨在研究如何在这些情况下利用 LLM。我们的研究结果表明，虽然现成的 LLM 可以生成高质量的 CSE 内容，但它们的检测能力并不理想，导致防御的运营成本增加。作为回应，我们提出了 ConvoSentinel，这是一种模块化防御管道，可在消息和对话级别改进检测，提供增强的适应性和成本效益。ConvoSentinel 中的检索增强模块通过将消息与类似对话的数据库进行比较来识别恶意意图，从而增强所有阶段的 CSE 检测。我们的研究强调了在网络安全中利用 LLM 的高级策略的必要性。

Title: Towards a Client-Centered Assessment of LLM Therapists by Client Simulation

Authors: Jiashuo Wang, Yang Xiao, Yanran Li, Changhe Song, Chunpu Xu, Chenhao Tan, Wenjie Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12266
Pdf URL: https://arxiv.org/pdf/2406.12266
Copy Paste: [[2406.12266]] Towards a Client-Centered Assessment of LLM Therapists by Client Simulation(https://arxiv.org/abs/2406.12266)
Keywords: gpt, llm
Abstract: Although there is a growing belief that LLMs can be used as therapists, exploring LLMs' capabilities and inefficacy, particularly from the client's perspective, is limited. This work focuses on a client-centered assessment of LLM therapists with the involvement of simulated clients, a standard approach in clinical medical education. However, there are two challenges when applying the approach to assess LLM therapists at scale. Ethically, asking humans to frequently mimic clients and exposing them to potentially harmful LLM outputs can be risky and unsafe. Technically, it can be difficult to consistently compare the performances of different LLM therapists interacting with the same client. To this end, we adopt LLMs to simulate clients and propose ClientCAST, a client-centered approach to assessing LLM therapists by client simulation. Specifically, the simulated client is utilized to interact with LLM therapists and complete questionnaires related to the interaction. Based on the questionnaire results, we assess LLM therapists from three client-centered aspects: session outcome, therapeutic alliance, and self-reported feelings. We conduct experiments to examine the reliability of ClientCAST and use it to evaluate LLMs therapists implemented by Claude-3, GPT-3.5, LLaMA3-70B, and Mixtral 8*7B. Codes are released at this https URL.
摘要：尽管越来越多的人相信 LLM 可以用作治疗师，但对 LLM 的能力和无效性的探索，尤其是从客户的角度来看，是有限的。这项工作侧重于以客户为中心的 LLM 治疗师评估，其中模拟客户参与其中，这是临床医学教育中的标准方法。然而，在大规模应用该方法评估 LLM 治疗师时存在两个挑战。从伦理上讲，要求人类频繁模仿客户并让他们接触可能有害的 LLM 输出可能是有风险和不安全的。从技术上讲，很难始终如一地比较与同一客户互动的不同 LLM 治疗师的表现。为此，我们采用 LLM 来模拟客户，并提出了 ClientCAST，这是一种通过客户模拟来评估 LLM 治疗师的客户中心方法。具体来说，模拟客户用于与 LLM 治疗师互动并完成与互动相关的问卷。根据问卷结果，我们从三个以客户为中心的方面评估 LLM 治疗师：会话结果、治疗联盟和自我报告的感受。我们进行实验来检验 ClientCAST 的可靠性，并用它来评估由 Claude-3、GPT-3.5、LLaMA3-70B 和 Mixtral 8*7B 实施的 LLM 治疗师。代码发布在此 https URL 上。

Title: Unveiling Implicit Table Knowledge with Question-Then-Pinpoint Reasoner for Insightful Table Summarization

Authors: Kwangwook Seo, Jinyoung Yeo, Dongha Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12269
Pdf URL: https://arxiv.org/pdf/2406.12269
Copy Paste: [[2406.12269]] Unveiling Implicit Table Knowledge with Question-Then-Pinpoint Reasoner for Insightful Table Summarization(https://arxiv.org/abs/2406.12269)
Keywords: language model, llm
Abstract: Implicit knowledge hidden within the explicit table cells, such as data insights, is the key to generating a high-quality table summary. However, unveiling such implicit knowledge is a non-trivial task. Due to the complex nature of structured tables, it is challenging even for large language models (LLMs) to mine the implicit knowledge in an insightful and faithful manner. To address this challenge, we propose a novel table reasoning framework Question-then-Pinpoint. Our work focuses on building a plug-and-play table reasoner that can self-question the insightful knowledge and answer it by faithfully pinpointing evidence on the table to provide explainable guidance for the summarizer. To train a reliable reasoner, we collect table knowledge by guiding a teacher LLM to follow the coarse-to-fine reasoning paths and refine it through two quality enhancement strategies to selectively distill the high-quality knowledge to the reasoner. Extensive experiments on two table summarization datasets, including our newly proposed InsTaSumm, validate the general effectiveness of our framework.
摘要：隐藏在显式表格单元格中的隐性知识（例如数据洞察）是生成高质量表格摘要的关键。然而，揭示这种隐性知识并非易事。由于结构化表格的复杂性，即使对于大型语言模型 (LLM) 来说，以富有洞察力和忠实的方式挖掘隐性知识也具有挑战性。为了应对这一挑战，我们提出了一种新颖的表格推理框架“先问后找”。我们的工作重点是构建一个即插即用的表格推理器，它可以自我质疑富有洞察力的知识并通过忠实地在表格上找到证据来回答它，从而为总结者提供可解释的指导。为了训练一个可靠的推理器，我们通过指导教师 LLM 遵循从粗到细的推理路径来收集表格知识，并通过两种质量增强策略对其进行细化，以有选择地将高质量知识提炼给推理器。在两个表格摘要数据集（包括我们新提出的 InsTaSumm）上进行的大量实验验证了我们框架的总体有效性。

Title: SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Authors: Somnath Banerjee, Soham Tripathy, Sayan Layek, Shanu Kumar, Animesh Mukherjee, Rima Hazra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12274
Pdf URL: https://arxiv.org/pdf/2406.12274
Copy Paste: [[2406.12274]] SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models(https://arxiv.org/abs/2406.12274)
Keywords: language model
Abstract: Safety-aligned language models often exhibit fragile and imbalanced safety mechanisms, increasing the likelihood of generating unsafe content. In addition, incorporating new knowledge through editing techniques to language models can further compromise safety. To address these issues, we propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy for generating safe responses to user queries. SafeInfer comprises two phases: the safety amplification phase, which employs safe demonstration examples to adjust the model's hidden states and increase the likelihood of safer outputs, and the safety-guided decoding phase, which influences token selection based on safety-optimized distributions, ensuring the generated content complies with ethical guidelines. Further, we present HarmEval, a novel benchmark for extensive safety evaluations, designed to address potential misuse scenarios in accordance with the policies of leading AI tech giants.
摘要：安全对齐的语言模型通常表现出脆弱和不平衡的安全机制，增加了生成不安全内容的可能性。此外，通过编辑技术将新知识融入语言模型可能会进一步危及安全性。为了解决这些问题，我们提出了 SafeInfer，这是一种上下文自适应的解码时间安全对齐策略，用于生成对用户查询的安全响应。SafeInfer 包括两个阶段：安全放大阶段，它使用安全的演示示例来调整模型的隐藏状态并增加更安全输出的可能性；以及安全引导解码阶段，它根据安全优化的分布影响令牌选择，确保生成的内容符合道德准则。此外，我们提出了 HarmEval，这是一种用于广泛安全评估的新基准，旨在根据领先的 AI 技术巨头的政策解决潜在的滥用情况。

Title: What Matters in Learning Facts in Language Models? Multifaceted Knowledge Probing with Diverse Multi-Prompt Datasets

Authors: Xin Zhao, Naoki Yoshinaga, Daisuke Oba
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12277
Pdf URL: https://arxiv.org/pdf/2406.12277
Copy Paste: [[2406.12277]] What Matters in Learning Facts in Language Models? Multifaceted Knowledge Probing with Diverse Multi-Prompt Datasets(https://arxiv.org/abs/2406.12277)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) face issues in handling factual knowledge, making it vital to evaluate their true ability to understand facts. In this study, we introduce knowledge probing frameworks, BELIEF(-ICL), to evaluate the knowledge understanding ability of not only encoder-based PLMs but also decoder-based PLMs from diverse perspectives. BELIEFs utilize a multi-prompt dataset to evaluate PLM's accuracy, consistency, and reliability in factual knowledge understanding. To provide a more reliable evaluation with BELIEFs, we semi-automatically create MyriadLAMA, which has more diverse prompts than existing datasets. We validate the effectiveness of BELIEFs in correctly and comprehensively evaluating PLM's factual understanding ability through extensive evaluations. We further investigate key factors in learning facts in LLMs, and reveal the limitation of the prompt-based knowledge probing. The dataset is anonymously publicized.
摘要：大型语言模型 (LLM) 在处理事实知识时面临问题，因此评估其理解事实的真实能力至关重要。在本研究中，我们引入了知识探测框架 BELIEF(-ICL)，从不同角度评估基于编码器的 PLM 和基于解码器的 PLM 的知识理解能力。BELIEF 利用多提示数据集来评估 PLM 在事实知识理解方面的准确性、一致性和可靠性。为了使用 BELIEF 提供更可靠的评估，我们半自动地创建了 MyriadLAMA，它比现有数据集具有更多样化的提示。我们通过广泛的评估验证了 BELIEF 在正确全面地评估 PLM 的事实理解能力方面的有效性。我们进一步研究了 LLM 中学习事实的关键因素，并揭示了基于提示的知识探测的局限性。该数据集是匿名公开的。

Title: Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

Authors: Kaiyan Zhang, Jianyu Wang, Ning Ding, Biqing Qi, Ermo Hua, Xingtai Lv, Bowen Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12295
Pdf URL: https://arxiv.org/pdf/2406.12295
Copy Paste: [[2406.12295]] Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding(https://arxiv.org/abs/2406.12295)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) demonstrate impressive performance in diverse applications, yet they face significant drawbacks, including high inference latency, expensive training cost, and generation of hallucination. Collaborative decoding between large and small language models (SLMs) offers a novel approach to address these challenges. Inspired by dual-process cognitive theory, we integrate these methods into a unified framework termed Fast and Slow Generating (FS-GEN). This paper explores several techniques within the FS-GEN framework, including speculative decoding, contrastive decoding, and emulator or proxy fine-tuning. We provide a comprehensive analysis of these methodologies, offering insights into their similarities and differences under this framework. Our study delves into the differential knowledge capabilities of LLMs versus SLMs through the FS-GEN lens, revealing that fewer than 20% of collaborative interactions are required across various methods. These interactions adhere to a scaling law relative to the parameter ratios, thereby facilitating predictable collaboration. Furthermore, we investigate the specific positions where collaboration is most effective from an uncertainty perspective, yielding novel insights that could refine FS-GEN methods. Our findings reveal that the essential difference between models of different sizes lies in the uncertainty of the next token prediction, where interventions by larger models are most needed to assist the smaller ones. Code for Reproduction: this https URL
摘要：大型语言模型 (LLM) 在各种应用中表现出色，但它们也面临重大缺点，包括高推理延迟、昂贵的训练成本和产生幻觉。大型和小型语言模型 (SLM) 之间的协作解码提供了一种应对这些挑战的新方法。受双过程认知理论的启发，我们将这些方法整合到一个统一的框架中，称为快速和慢速生成 (FS-GEN)。本文探讨了 FS-GEN 框架内的几种技术，包括推测解码、对比解码和模拟器或代理微调。我们对这些方法进行了全面的分析，深入了解了它们在该框架下的相似之处和不同之处。我们的研究通过 FS-GEN 的视角深入研究了 LLM 与 SLM 的差异知识能力，发现各种方法之间需要的协作交互不到 20%。这些交互遵循相对于参数比率的缩放定律，从而促进可预测的协作。此外，我们从不确定性的角度研究了协作最有效的具体位置，从而产生了可以改进 FS-GEN 方法的新见解。我们的研究结果表明，不同规模的模型之间的本质区别在于下一个 token 预测的不确定性，此时最需要较大模型的干预来协助较小模型。复现代码：此 https URL

Title: Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?

Authors: Seungbin Yang, ChaeHun Park, Taehee Kim, Jaegul Choo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12307
Pdf URL: https://arxiv.org/pdf/2406.12307
Copy Paste: [[2406.12307]] Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?(https://arxiv.org/abs/2406.12307)
Keywords: language model, llm
Abstract: Recent advancements in integrating large language models (LLMs) with tools have allowed the models to interact with real-world environments. However, these tool-augmented LLMs often encounter incomplete scenarios when users provide partial information or the necessary tools are unavailable. Recognizing and managing such scenarios is crucial for LLMs to ensure their reliability, but this exploration remains understudied. This study examines whether LLMs can identify incomplete conditions and appropriately determine when to refrain from using tools. To this end, we address a dataset by manipulating instances from two datasets by removing necessary tools or essential information for tool invocation. We confirm that most LLMs are challenged to identify the additional information required to utilize specific tools and the absence of appropriate tools. Our research can contribute to advancing reliable LLMs by addressing scenarios that commonly arise during interactions between humans and LLMs.
摘要：大型语言模型 (LLM) 与工具集成的最新进展使模型能够与真实环境交互。然而，当用户提供部分信息或必要的工具不可用时，这些工具增强的 LLM 经常会遇到不完整的场景。识别和管理此类场景对于 LLM 确保其可靠性至关重要，但这种探索仍未得到充分研究。本研究探讨了 LLM 是否可以识别不完整条件并适当地确定何时停止使用工具。为此，我们通过删除必要的工具或工具调用的基本信息来处理来自两个数据集的实例，从而处理数据集。我们确认，大多数 LLM 都面临着识别使用特定工具所需的额外信息以及缺乏适当工具的挑战。我们的研究可以通过解决人与 LLM 交互过程中经常出现的场景来促进可靠的 LLM 的发展。

Title: PRePair: Pointwise Reasoning Enhance Pairwise Evaluating for Robust Instruction-Following Assessments

Authors: Hawon Jeong, ChaeHun Park, Jimin Hong, Jaegul Choo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12319
Pdf URL: https://arxiv.org/pdf/2406.12319
Copy Paste: [[2406.12319]] PRePair: Pointwise Reasoning Enhance Pairwise Evaluating for Robust Instruction-Following Assessments(https://arxiv.org/abs/2406.12319)
Keywords: language model, llm
Abstract: Pairwise evaluation using large language models (LLMs) is widely used for evaluating natural language generation (NLG) tasks. However, the reliability of LLMs is often compromised by biases, such as favoring verbosity and authoritative tone. In the study, we focus on the comparison of two LLM-based evaluation approaches, pointwise and pairwise. Our findings demonstrate that pointwise evaluators exhibit more robustness against undesirable preferences. Further analysis reveals that pairwise evaluators can accurately identify the shortcomings of low-quality outputs even when their judgment is incorrect. These results indicate that LLMs are more severely influenced by their bias in a pairwise evaluation setup. To mitigate this, we propose a hybrid method that integrates pointwise reasoning into pairwise evaluation. Experimental results show that our method enhances the robustness of pairwise evaluators against adversarial samples while preserving accuracy on normal samples.
摘要：使用大型语言模型 (LLM) 的成对评估被广泛用于评估自然语言生成 (NLG) 任务。然而，LLM 的可靠性往往会受到偏见的影响，例如偏爱冗长和权威语气。在研究中，我们重点比较了两种基于 LLM 的评估方法，即逐点和成对。我们的研究结果表明，逐点评估器对不良偏好表现出更强的鲁棒性。进一步的分析表明，即使成对评估者的判断不正确，他们也能准确识别出低质量输出的缺点。这些结果表明，在成对评估设置中，LLM 受其偏见的影响更为严重。为了缓解这种情况，我们提出了一种将逐点推理集成到成对评估中的混合方法。实验结果表明，我们的方法增强了成对评估器对对抗性样本的鲁棒性，同时保持了对正常样本的准确性。

Title: SNAP: Unlearning Selective Knowledge in Large Language Models with Negative Instructions

Authors: Minseok Choi, Daniel Rim, Dohyun Lee, Jaegul Choo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12329
Pdf URL: https://arxiv.org/pdf/2406.12329
Copy Paste: [[2406.12329]] SNAP: Unlearning Selective Knowledge in Large Language Models with Negative Instructions(https://arxiv.org/abs/2406.12329)
Keywords: language model, gpt, llm, chat
Abstract: Instruction-following large language models (LLMs), such as ChatGPT, have become increasingly popular with the general audience, many of whom are incorporating them into their daily routines. However, these LLMs inadvertently disclose personal or copyrighted information, which calls for a machine unlearning method to remove selective knowledge. Previous attempts sought to forget the link between the target information and its associated entities, but it rather led to generating undesirable responses about the target, compromising the end-user experience. In this work, we propose SNAP, an innovative framework designed to selectively unlearn information by 1) training an LLM with negative instructions to generate obliterated responses, 2) augmenting hard positives to retain the original LLM performance, and 3) applying the novel Wasserstein regularization to ensure adequate deviation from the initial weights of the LLM. We evaluate our framework on various NLP benchmarks and demonstrate that our approach retains the original LLM capabilities, while successfully unlearning the specified information.
摘要：遵循指令的大型语言模型 (LLM)（例如 ChatGPT）越来越受到普通用户的欢迎，许多人将其融入日常生活中。然而，这些 LLM 会无意中泄露个人或版权信息，这就需要一种机器学习方法来消除选择性知识。以前的尝试试图忘记目标信息与其相关实体之间的联系，但这反而会导致对目标产生不良反应，从而损害最终用户体验。在这项工作中，我们提出了 SNAP，这是一个创新的框架，旨在通过以下方式选择性地取消学习信息：1) 使用负面指令训练 LLM 以生成消除的响应，2) 增强硬正例以保留原始 LLM 性能，3) 应用新颖的 Wasserstein 正则化以确保与 LLM 的初始权重有足够的偏差。我们在各种 NLP 基准上评估了我们的框架，并证明我们的方法保留了原始 LLM 功能，同时成功地取消了指定的信息。

Title: Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding

Authors: Weizhi Fei, Xueyan Niu, Guoqing Xie, Yanhua Zhang, Bo Bai, Lei Deng, Wei Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12331
Pdf URL: https://arxiv.org/pdf/2406.12331
Copy Paste: [[2406.12331]] Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding(https://arxiv.org/abs/2406.12331)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Current Large Language Models (LLMs) face inherent limitations due to their pre-defined context lengths, which impede their capacity for multi-hop reasoning within extensive textual contexts. While existing techniques like Retrieval-Augmented Generation (RAG) have attempted to bridge this gap by sourcing external information, they fall short when direct answers are not readily available. We introduce a novel approach that re-imagines information retrieval through dynamic in-context editing, inspired by recent breakthroughs in knowledge editing. By treating lengthy contexts as malleable external knowledge, our method interactively gathers and integrates relevant information, thereby enabling LLMs to perform sophisticated reasoning steps. Experimental results demonstrate that our method effectively empowers context-limited LLMs, such as Llama2, to engage in multi-hop reasoning with improved performance, which outperforms state-of-the-art context window extrapolation methods and even compares favorably to more advanced commercial long-context models. Our interactive method not only enhances reasoning capabilities but also mitigates the associated training and computational costs, making it a pragmatic solution for enhancing LLMs' reasoning within expansive contexts.
摘要：当前的大型语言模型 (LLM) 因其预定义的上下文长度而面临固有的局限性，这阻碍了它们在广泛的文本上下文中进行多跳推理的能力。虽然现有技术（如检索增强生成 (RAG)）已尝试通过获取外部信息来弥补这一差距，但当直接答案不易获得时，它们就会失效。我们引入了一种新方法，通过动态上下文编辑重新构想信息检索，这种方法受到知识编辑领域最新突破的启发。通过将冗长的上下文视为可塑的外部知识，我们的方法以交互方式收集和整合相关信息，从而使 LLM 能够执行复杂的推理步骤。实验结果表明，我们的方法有效地使上下文受限的 LLM（如 Llama2）能够以更高的性能进行多跳推理，其性能优于最先进的上下文窗口外推方法，甚至可以与更先进的商业长上下文模型相媲美。我们的交互式方法不仅增强了推理能力，而且还降低了相关的训练和计算成本，使其成为在广泛背景下增强 LLM 推理能力的实用解决方案。

Title: Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters

Authors: Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12335
Pdf URL: https://arxiv.org/pdf/2406.12335
Copy Paste: [[2406.12335]] Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters(https://arxiv.org/abs/2406.12335)
Keywords: language model, llm, chat
Abstract: Scaling the context size of large language models (LLMs) enables them to perform various new tasks, e.g., book summarization. However, the memory cost of the Key and Value (KV) cache in attention significantly limits the practical applications of LLMs. Recent works have explored token pruning for KV cache reduction in LLMs, relying solely on attention scores as a token importance indicator. However, our investigation into value vector norms revealed a notably non-uniform pattern questioning their reliance only on attention scores. Inspired by this, we propose a new method: Value-Aware Token Pruning (VATP) which uses both attention scores and the $ \ell_{1} $ norm of value vectors to evaluate token importance. Extensive experiments on LLaMA2-7B-chat and Vicuna-v1.5-7B across 16 LongBench tasks demonstrate VATP's superior performance.
摘要：扩展大型语言模型 (LLM) 的上下文大小使它们能够执行各种新任务，例如图书摘要。然而，注意力机制中键和值 (KV) 缓存的内存成本严重限制了 LLM 的实际应用。最近的研究探索了标记剪枝以减少 LLM 中的 KV 缓存，仅依靠注意力分数作为标记重要性指标。然而，我们对值向量范数的研究揭示了一个明显不均匀的模式，质疑它们仅依赖注意力分数。受此启发，我们提出了一种新方法：值感知标记剪枝 (VATP)，它同时使用注意力分数和值向量的 $ \ell_{1} $ 范数来评估标记重要性。在 16 个 LongBench 任务中对 LLaMA2-7B-chat 和 Vicuna-v1.5-7B 进行的大量实验证明了 VATP 的卓越性能。

Title: Interpreting Bias in Large Language Models: A Feature-Based Approach

Authors: Nirmalendu Prakash, Lee Ka Wei Roy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12347
Pdf URL: https://arxiv.org/pdf/2406.12347
Copy Paste: [[2406.12347]] Interpreting Bias in Large Language Models: A Feature-Based Approach(https://arxiv.org/abs/2406.12347)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) such as Mistral and LLaMA have showcased remarkable performance across various natural language processing (NLP) tasks. Despite their success, these models inherit social biases from the diverse datasets on which they are trained. This paper investigates the propagation of biases within LLMs through a novel feature-based analytical approach. Drawing inspiration from causal mediation analysis, we hypothesize the evolution of bias-related features and validate them using interpretability techniques like activation and attribution patching. Our contributions are threefold: (1) We introduce and empirically validate a feature-based method for bias analysis in LLMs, applied to LLaMA-2-7B, LLaMA-3-8B, and Mistral-7B-v0.3 with templates from a professions dataset. (2) We extend our method to another form of gender bias, demonstrating its generalizability. (3) We differentiate the roles of MLPs and attention heads in bias propagation and implement targeted debiasing using a counterfactual dataset. Our findings reveal the complex nature of bias in LLMs and emphasize the necessity for tailored debiasing strategies, offering a deeper understanding of bias mechanisms and pathways for effective mitigation.
摘要：大型语言模型 (LLM)（例如 Mistral 和 LLaMA）在各种自然语言处理 (NLP) 任务中都表现出色。尽管这些模型取得了成功，但它们从训练它们的各种数据集中继承了社会偏见。本文通过一种新颖的基于特征的分析方法研究了 LLM 中偏见的传播。从因果中介分析中汲取灵感，我们假设了偏见相关特征的演变，并使用激活和归因修补等可解释性技术对其进行了验证。我们的贡献有三方面：(1) 我们引入并实证验证了一种基于特征的 LLM 偏见分析方法，该方法应用于 LLaMA-2-7B、LLaMA-3-8B 和 Mistral-7B-v0.3，并使用来自专业数据集的模板。(2) 我们将我们的方法扩展到另一种形式的性别偏见，证明了它的普遍性。(3) 我们区分了 MLP 和注意力头在偏见传播中的作用，并使用反事实数据集实施有针对性的去偏见。我们的研究结果揭示了法学硕士 (LLM) 中偏见的复杂性，并强调了制定个性化去偏见策略的必要性，从而更深入地了解偏见机制和有效缓解途径。

Title: Cross-Lingual Unlearning of Selective Knowledge in Multilingual Language Models

Authors: Minseok Choi, Kyunghyun Min, Jaegul Choo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12354
Pdf URL: https://arxiv.org/pdf/2406.12354
Copy Paste: [[2406.12354]] Cross-Lingual Unlearning of Selective Knowledge in Multilingual Language Models(https://arxiv.org/abs/2406.12354)
Keywords: language model
Abstract: Pretrained language models memorize vast amounts of information, including private and copyrighted data, raising significant safety concerns. Retraining these models after excluding sensitive data is prohibitively expensive, making machine unlearning a viable, cost-effective alternative. Previous research has focused on machine unlearning for monolingual models, but we find that unlearning in one language does not necessarily transfer to others. This vulnerability makes models susceptible to low-resource language attacks, where sensitive information remains accessible in less dominant languages. This paper presents a pioneering approach to machine unlearning for multilingual language models, selectively erasing information across different languages while maintaining overall performance. Specifically, our method employs an adaptive unlearning scheme that assigns language-dependent weights to address different language performances of multilingual language models. Empirical results demonstrate the effectiveness of our framework compared to existing unlearning baselines, setting a new standard for secure and adaptable multilingual language models.
摘要：预训练语言模型会记住大量信息，包括私人和受版权保护的数据，这引发了重大的安全隐患。在排除敏感数据后重新训练这些模型的成本过高，因此机器学习是一种可行且经济高效的替代方案。先前的研究主要集中在单语模型的机器学习，但我们发现一种语言的机器学习并不一定能转移到其他语言。这种弱点使模型容易受到低资源语言攻击，在这种情况下，敏感信息在非主流语言中仍然可以访问。本文介绍了一种用于多语言语言模型的机器学习的开创性方法，可以有选择地删除不同语言的信息，同时保持整体性能。具体而言，我们的方法采用了一种自适应机器学习方案，该方案分配与语言相关的权重来解决多语言语言模型的不同语言性能。与现有的机器学习基线相比，实证结果证明了我们的框架的有效性，为安全且适应性强的多语言语言模型树立了新标准。

Title: WebCanvas: Benchmarking Web Agents in Online Environments

Authors: Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12373
Pdf URL: https://arxiv.org/pdf/2406.12373
Copy Paste: [[2406.12373]] WebCanvas: Benchmarking Web Agents in Online Environments(https://arxiv.org/abs/2406.12373)
Keywords: agent
Abstract: For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.
摘要：为了使 Web 代理具有实际用途，它们必须适应不断发展的 Web 环境，其特点是用户界面和内容频繁更新。然而，大多数现有基准仅捕获 Web 的静态方面。为了弥补这一差距，我们引入了 WebCanvas，这是一种创新的 Web 代理在线评估框架，可有效解决 Web 交互的动态特性。WebCanvas 包含三个主要组件，以促进现实评估：（1）一种新颖的评估指标，可以可靠地捕获完成任务所需的关键中间动作或状态，同时忽略由无关紧要的事件或更改的 Web 元素引起的噪音。（2）一个名为 Mind2Web-Live 的基准数据集，它是原始 Mind2Web 静态数据集的精炼版本，包含 542 个任务和 2439 个中间评估状态；（3）轻量级且可通用的注释工具和测试管道，使社区能够收集和维护高质量、最新的数据集。基于 WebCanvas，我们开源了一个具有可扩展推理模块的代理框架，为社区进行在线推理和评估提供了基础。我们表现最佳的代理在 Mind2Web-Live 测试集上实现了 23.1% 的任务成功率和 48.8% 的任务完成率。此外，我们还分析了不同网站、域和实验环境中的性能差异。我们鼓励社区为在线代理评估贡献更多见解，从而推动这一研究领域的发展。

Title: QOG:Question and Options Generation based on Language Model

Authors: Jincheng Zhou
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12381
Pdf URL: https://arxiv.org/pdf/2406.12381
Copy Paste: [[2406.12381]] QOG:Question and Options Generation based on Language Model(https://arxiv.org/abs/2406.12381)
Keywords: language model
Abstract: Question-Options Generation (QOG) is a task that involves generating a set of question-options pairs given context. This task has various applications, including fine-tuning large models, information retrieval, and automated multiple-choice question generation for education. In this paper, we develop QOG models using three different methods based on fine-tuning sequence-to-sequence language models (LMs). Experiments demonstrate that the end-to-end QOG model is computationally efficient and stable during both training and inference, outperforming other methods. Furthermore, our analysis indicates that our QOG models are competitive on the QOG task compared to the large language model Llama 3-8B.
摘要：问题选项生成 (QOG) 是一项根据上下文生成一组问题选项对的任务。此任务有多种应用，包括微调大型模型、信息检索和教育领域的自动多项选择题生成。在本文中，我们使用三种不同的方法开发 QOG 模型，这些方法基于微调序列到序列语言模型 (LM)。实验表明，端到端 QOG 模型在训练和推理过程中都具有计算效率和稳定性，优于其他方法。此外，我们的分析表明，与大型语言模型 Llama 3-8B 相比，我们的 QOG 模型在 QOG 任务上具有竞争力。

Title: From Instance Training to Instruction Learning: Task Adapters Generation from Instructions

Authors: Huanxuan Liao, Yao Xu, Shizhu He, Yuanzhe Zhang, Yanchao Hao, Shengping Liu, Kang Liu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12382
Pdf URL: https://arxiv.org/pdf/2406.12382
Copy Paste: [[2406.12382]] From Instance Training to Instruction Learning: Task Adapters Generation from Instructions(https://arxiv.org/abs/2406.12382)
Keywords: language model, llm
Abstract: Large language models (LLMs) have acquired the ability to solve general tasks by utilizing instruction finetuning (IFT). However, IFT still relies heavily on instance training of extensive task data, which greatly limits the adaptability of LLMs to real-world scenarios where labeled task instances are scarce and broader task generalization becomes paramount. Contrary to LLMs, humans acquire skills and complete tasks not merely through repeated practice but also by understanding and following instructional guidelines. This paper is dedicated to simulating human learning to address the shortcomings of instance training, focusing on instruction learning to enhance cross-task generalization. Within this context, we introduce Task Adapters Generation from Instructions (TAGI), which automatically constructs the task-specific model in a parameter generation manner based on the given task instructions without retraining for unseen tasks. Specifically, we utilize knowledge distillation to enhance the consistency between TAGI developed through Learning with Instruction and task-specific models developed through Training with Instance, by aligning the labels, output logits, and adapter parameters between them. TAGI is endowed with cross-task generalization capabilities through a two-stage training process that includes hypernetwork pretraining and finetuning. We evaluate TAGI on the Super-Natural Instructions and P3 datasets. The experimental results demonstrate that TAGI can match or even outperform traditional meta-trained models and other hypernetwork models, while significantly reducing computational requirements.
摘要：大型语言模型 (LLM) 已经通过利用指令微调 (IFT) 获得了解决一般任务的能力。然而，IFT 仍然严重依赖于大量任务数据的实例训练，这极大地限制了 LLM 对现实世界场景的适应性，因为现实世界中标记的任务实例稀缺，更广泛的任务泛化变得至关重要。与 LLM 相反，人类不仅通过反复练习获得技能并完成任务，而且还通过理解和遵循教学指南来获得技能。本文致力于模拟人类学习以解决实例训练的不足，重点研究指令学习以增强跨任务泛化。在此背景下，我们引入了指令任务适配器生成 (TAGI)，它根据给定的任务指令以参数生成的方式自动构建特定于任务的模型，而无需对未见过的任务进行重新训练。具体而言，我们利用知识蒸馏来增强通过指令学习开发的 TAGI 与通过实例训练开发的任务特定模型之间的一致性，通过对齐它们之间的标签、输出逻辑和适配器参数。 TAGI 通过超网络预训练和微调两阶段训练过程，具备跨任务泛化能力。我们在 Super-Natural Instructions 和 P3 数据集上对 TAGI 进行了评估。实验结果表明，TAGI 可以匹敌甚至超越传统的元训练模型和其他超网络模型，同时显著降低计算要求。

Title: IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models

Authors: Qiyao Wang, Jianguo Huang, Shule Lu, Yuan Lin, Kan Xu, Liang Yang, Hongfei Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12386
Pdf URL: https://arxiv.org/pdf/2406.12386
Copy Paste: [[2406.12386]] IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models(https://arxiv.org/abs/2406.12386)
Keywords: language model, gpt, llm
Abstract: The rapid development of Large Language Models (LLMs) in vertical domains, including intellectual property (IP), lacks a specific evaluation benchmark for assessing their understanding, application, and reasoning abilities. To fill this gap, we introduce IPEval, the first evaluation benchmark tailored for IP agency and consulting tasks. IPEval comprises 2657 multiple-choice questions across four major dimensions: creation, application, protection, and management of IP. These questions span patent rights (inventions, utility models, designs), trademarks, copyrights, trade secrets, and other related laws. Evaluation methods include zero-shot, 5-few-shot, and Chain of Thought (CoT) for seven LLM types, predominantly in English or Chinese. Results show superior English performance by models like GPT series and Qwen series, while Chinese-centric LLMs excel in Chinese tests, albeit specialized IP LLMs lag behind general-purpose ones. Regional and temporal aspects of IP underscore the need for LLMs to grasp legal nuances and evolving laws. IPEval aims to accurately gauge LLM capabilities in IP and spur development of specialized models. Website: \url{this https URL}
摘要：大型语言模型 (LLM) 在垂直领域（包括知识产权 (IP)）发展迅速，但缺乏评估其理解、应用和推理能力的特定评估基准。为了填补这一空白，我们推出了 IPEval，这是第一个针对 IP 代理和咨询任务量身定制的评估基准。IPEval 包含 2657 个多项选择题，涉及四个主要维度：IP 的创建、应用、保护和管理。这些问题涵盖专利权（发明、实用新型、设计）、商标、版权、商业秘密和其他相关法律。评估方法包括零样本、5 次样本和思路链 (CoT)，适用于七种 LLM 类型，主要以英语或中文为主。结果显示，GPT 系列和 Qwen 系列等模型的英语表现优异，而以中文为中心的 LLM 在中文测试中表现出色，尽管专门的 IP LLM 落后于通用 LLM。IP 的区域性和时间性凸显了 LLM 需要掌握法律细微差别和不断发展的法律。 IPEval 旨在准确衡量 LLM 在 IP 方面的能力并促进专业模型的开发。网站：\url{此 https URL}

Title: Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

Authors: Jie Chen, Yupeng Zhang, Bingning Wang, Wayne Xin Zhao, Ji-Rong Wen, Weipeng Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12397
Pdf URL: https://arxiv.org/pdf/2406.12397
Copy Paste: [[2406.12397]] Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models(https://arxiv.org/abs/2406.12397)
Keywords: language model, llm
Abstract: Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs). Studies have shown that synthetic data can effectively improve the performance of LLMs on downstream benchmarks. However, despite its potential benefits, our analysis suggests that there may be inherent flaws in synthetic data. The uniform format of synthetic data can lead to pattern overfitting and cause significant shifts in the output distribution, thereby reducing the model's instruction-following capabilities. Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. The empirical results demonstrate the effectiveness of our approach, which can reverse the instruction-following issues caused by pattern overfitting without compromising performance on benchmarks at relatively low cost. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
摘要：合成数据已被提出作为解决大型语言模型 (LLM) 训练中高质量数据稀缺问题的解决方案。研究表明，合成数据可以有效提高 LLM 在下游基准测试中的表现。然而，尽管合成数据具有潜在的好处，但我们的分析表明，合成数据可能存在固有缺陷。合成数据的统一格式可能导致模式过度拟合，并导致输出分布发生显著变化，从而降低模型的指令跟踪能力。我们的工作深入研究了与问答 (Q-A) 对（一种普遍的合成数据类型）相关的这些特定缺陷，并提出了一种基于反学习技术的方法来缓解这些缺陷。实证结果证明了我们方法的有效性，它可以逆转由模式过度拟合引起的指令跟踪问题，而不会以相对较低的成本损害基准测试的性能。我们的工作为有效使用合成数据提供了关键见解，旨在促进更稳健、更高效的 LLM 训练。

Title: QueerBench: Quantifying Discrimination in Language Models Toward Queer Identities

Authors: Mae Sosto, Alberto Barrón-Cedeño
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2406.12399
Pdf URL: https://arxiv.org/pdf/2406.12399
Copy Paste: [[2406.12399]] QueerBench: Quantifying Discrimination in Language Models Toward Queer Identities(https://arxiv.org/abs/2406.12399)
Keywords: language model, llm
Abstract: With the increasing role of Natural Language Processing (NLP) in various applications, challenges concerning bias and stereotype perpetuation are accentuated, which often leads to hate speech and harm. Despite existing studies on sexism and misogyny, issues like homophobia and transphobia remain underexplored and often adopt binary perspectives, putting the safety of LGBTQIA+ individuals at high risk in online spaces. In this paper, we assess the potential harm caused by sentence completions generated by English large language models (LLMs) concerning LGBTQIA+ individuals. This is achieved using QueerBench, our new assessment framework, which employs a template-based approach and a Masked Language Modeling (MLM) task. The analysis indicates that large language models tend to exhibit discriminatory behaviour more frequently towards individuals within the LGBTQIA+ community, reaching a difference gap of 7.2% in the QueerBench score of harmfulness.
摘要：随着自然语言处理 (NLP) 在各种应用中的作用越来越大，偏见和刻板印象的延续问题也日益突出，这往往会导致仇恨言论和伤害。尽管已有关于性别歧视和厌女症的研究，但恐同症和跨性别恐惧症等问题仍未得到充分探索，而且往往采用二元视角，使 LGBTQIA+ 人群在网络空间的安全面临高风险。在本文中，我们评估了英语大型语言模型 (LLM) 生成的句子完成对 LGBTQIA+ 人群造成的潜在伤害。这是使用我们的新评估框架 QueerBench 实现的，它采用基于模板的方法和掩蔽语言建模 (MLM) 任务。分析表明，大型语言模型往往更频繁地对 LGBTQIA+ 社区中的个体表现出歧视行为，在 QueerBench 危害性评分中达到 7.2% 的差异差距。

Title: Flee the Flaw: Annotating the Underlying Logic of Fallacious Arguments Through Templates and Slot-filling

Authors: Irfan Robbani (1), Paul Reisert (2), Naoya Inoue (1,3), Surawat Pothong (1), Camélia Guerraoui (4,3,5), Wenzhi Wang (4,3), Shoichi Naito (6,4,3), Jungmin Choi (3), Kentaro Inui (7,4,3) ((1) JAIST, (2) Beyond Reason, (3) RIKEN, (4) Tohoku University, (5) INSA Lyon, (6) Ricoh Company, Ltd., (7) MBZUAI)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12402
Pdf URL: https://arxiv.org/pdf/2406.12402
Copy Paste: [[2406.12402]] Flee the Flaw: Annotating the Underlying Logic of Fallacious Arguments Through Templates and Slot-filling(https://arxiv.org/abs/2406.12402)
Keywords: language model
Abstract: Prior research in computational argumentation has mainly focused on scoring the quality of arguments, with less attention on explicating logical errors. In this work, we introduce four sets of explainable templates for common informal logical fallacies designed to explicate a fallacy's implicit logic. Using our templates, we conduct an annotation study on top of 400 fallacious arguments taken from LOGIC dataset and achieve a high agreement score (Krippendorf's alpha of 0.54) and reasonable coverage (0.83). Finally, we conduct an experiment for detecting the structure of fallacies and discover that state-of-the-art language models struggle with detecting fallacy templates (0.47 accuracy). To facilitate research on fallacies, we make our dataset and guidelines publicly available.
摘要：先前的计算论证研究主要侧重于对论证质量进行评分，而较少关注解释逻辑错误。在这项工作中，我们引入了四组可解释的常见非正式逻辑谬误模板，旨在解释谬误的隐含逻辑。使用我们的模板，我们对从 LOGIC 数据集中提取的 400 个谬误论证进行了注释研究，并获得了较高的一致性分数（Krippendorf 的 alpha 为 0.54）和合理的覆盖率（0.83）。最后，我们进行了一项检测谬误结构的实验，发现最先进的语言模型在检测谬误模板方面存在困难（准确率为 0.47）。为了促进对谬误的研究，我们公开了我们的数据集和指南。

Title: PDSS: A Privacy-Preserving Framework for Step-by-Step Distillation of Large Language Models

Authors: Tao Fan, Yan Kang, Weijing Chen, Hanlin Gu, Yuanfeng Song, Lixin Fan, Kai Chen, Qiang Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12403
Pdf URL: https://arxiv.org/pdf/2406.12403
Copy Paste: [[2406.12403]] PDSS: A Privacy-Preserving Framework for Step-by-Step Distillation of Large Language Models(https://arxiv.org/abs/2406.12403)
Keywords: language model, llm, prompt
Abstract: In the context of real-world applications, leveraging large language models (LLMs) for domain-specific tasks often faces two major challenges: domain-specific knowledge privacy and constrained resources. To address these issues, we propose PDSS, a privacy-preserving framework for step-by-step distillation of LLMs. PDSS works on a server-client architecture, wherein client transmits perturbed prompts to the server's LLM for rationale generation. The generated rationales are then decoded by the client and used to enrich the training of task-specific small language model(SLM) within a multi-task learning paradigm. PDSS introduces two privacy protection strategies: the Exponential Mechanism Strategy and the Encoder-Decoder Strategy, balancing prompt privacy and rationale usability. Experiments demonstrate the effectiveness of PDSS in various text generation tasks, enabling the training of task-specific SLM with enhanced performance while prioritizing data privacy protection.
摘要：在实际应用中，利用大型语言模型 (LLM) 执行特定领域的任务通常面临两大挑战：领域特定知识隐私和资源受限。为了解决这些问题，我们提出了 PDSS，这是一个用于逐步提炼 LLM 的隐私保护框架。PDSS 在服务器-客户端架构上工作，其中客户端将扰动的提示传输到服务器的 LLM 以生成理由。然后，客户端解码生成的理由，并将其用于丰富多任务学习范式中特定于任务的小型语言模型 (SLM) 的训练。PDSS 引入了两种隐私保护策略：指数机制策略和编码器-解码器策略，在提示隐私和理由可用性之间取得平衡。实验证明了 PDSS 在各种文本生成任务中的有效性，使特定于任务的 SLM 的训练能够增强性能，同时优先考虑数据隐私保护。

Title: Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models

Authors: Hongbang Yuan, Yubo Chen, Pengfei Cao, Zhuoran Jin, Kang Liu, Jun Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12416
Pdf URL: https://arxiv.org/pdf/2406.12416
Copy Paste: [[2406.12416]] Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models(https://arxiv.org/abs/2406.12416)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have achieved remarkable success but still tend to generate factually erroneous responses, a phenomenon known as hallucination. A recent trend is to use preference learning to fine-tune models to align with factuality. However, existing work primarily evaluates fine-tuned models on in-domain (ID) datasets and the factuality on out-of-domain (OOD) datasets remains underexplored. In this paper, we conduct a comprehensive evaluation of the factuality of different models tuned by various preference learning algorithms and demonstrate that their performance on OOD datasets either increases minimally or decreases. Subsequently, we reveal that the main cause of model's failure to uphold factuality under a distribution shift is \textbf{under-alignment}, rather than \textbf{over-alignment}, by analyzing the token distribution shift of the models before and after tuning. Finally, we propose \textbf{APEFT} (\textbf{A}tomic \textbf{P}reference \textbf{E}nhanced \textbf{F}actuality \textbf{T}uning), a framework that enhances model's awareness of factuality at the granularity of individual facts. Extensive experiments demonstrate that APEFT improves model performance by an average of $\boldsymbol{3.45\%}$ on both ID and OOD datasets, which is highly effective.
摘要：大型语言模型 (LLM) 取得了显著成功，但仍然倾向于产生事实上的错误反应，这种现象称为幻觉。最近的趋势是使用偏好学习来微调模型以与事实性保持一致。然而，现有工作主要评估域内 (ID) 数据集上的微调模型，而域外 (OOD) 数据集上的事实性仍未得到充分探索。在本文中，我们对通过各种偏好学习算法调整的不同模型的事实性进行了全面评估，并证明它们在 OOD 数据集上的性能要么略有提高，要么有所下降。随后，通过分析调整前后模型的标记分布偏移，我们揭示了模型在分布偏移下无法保持事实性的主要原因是 \textbf{对齐不足}，而不是 \textbf{对齐过度}。最后，我们提出了 \textbf{APEFT}（\textbf{A}tomic \textbf{P}reference \textbf{E}nhanced \textbf{F}actuality \textbf{T}uning），这是一个在单个事实粒度上增强模型对事实性认识的框架。大量实验表明，APEFT 在 ID 和 OOD 数据集上平均将模型性能提高了 $\boldsymbol{3.45\%}$，这是非常有效的。

Title: MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling

Authors: Philipp Seeberger, Dominik Wagner, Korbinian Riedhammer
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12420
Pdf URL: https://arxiv.org/pdf/2406.12420
Copy Paste: [[2406.12420]] MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling(https://arxiv.org/abs/2406.12420)
Keywords: prompt
Abstract: With the advancement of multimedia technologies, news documents and user-generated content are often represented as multiple modalities, making Multimedia Event Extraction (MEE) an increasingly important challenge. However, recent MEE methods employ weak alignment strategies and data augmentation with simple classification models, which ignore the capabilities of natural language-formulated event templates for the challenging Event Argument Extraction (EAE) task. In this work, we focus on EAE and address this issue by introducing a unified template filling model that connects the textual and visual modalities via textual prompts. This approach enables the exploitation of cross-ontology transfer and the incorporation of event-specific semantics. Experiments on the M2E2 benchmark demonstrate the effectiveness of our approach. Our system surpasses the current SOTA on textual EAE by +7% F1, and performs generally better than the second-best systems for multimedia EAE.
摘要：随着多媒体技术的进步，新闻文档和用户生成的内容通常以多种模态表示，这使得多媒体事件提取 (MEE) 成为越来越重要的挑战。然而，最近的 MEE 方法采用弱对齐策略和数据增强以及简单的分类模型，这忽略了自然语言制定的事件模板在具有挑战性的事件参数提取 (EAE) 任务中的能力。在这项工作中，我们专注于 EAE，并通过引入一个统一的模板填充模型来解决这个问题，该模型通过文本提示连接文本和视觉模态。这种方法能够利用跨本体传输并结合事件特定的语义。在 M2E2 基准上的实验证明了我们方法的有效性。我们的系统在文本 EAE 上的表现比目前的 SOTA 高出 +7% F1，并且通常比多媒体 EAE 的第二好系统表现更好。

Title: PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

Authors: Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.12428
Pdf URL: https://arxiv.org/pdf/2406.12428
Copy Paste: [[2406.12428]] PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems(https://arxiv.org/abs/2406.12428)
Keywords: language model, llm
Abstract: Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech. Our experiments on spoken question answering tasks demonstrate that our approach improves latency while maintaining the quality of response content. Additionally, we show that latency can be further reduced by generating speech in multiple sequences. Demo samples are available at this https URL.
摘要：同时处理文本和语音的多模态语言模型具有应用于口头对话系统的潜力。然而，目前的模型在响应生成延迟方面面临两大挑战：（1）生成口头响应需要先生成书面响应，（2）语音序列比文本序列长得多。本研究通过扩展语言模型的输入和输出序列来解决这些问题，以支持文本和语音的并行生成。我们在口头问答任务上的实验表明，我们的方法在保持响应内容质量的同时改善了延迟。此外，我们还表明，通过生成多个序列的语音可以进一步减少延迟。演示示例可在此 https URL 上找到。

Title: PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers

Authors: Myeonghwa Lee, Seonho An, Min-Soo Kim
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12430
Pdf URL: https://arxiv.org/pdf/2406.12430
Copy Paste: [[2406.12430]] PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers(https://arxiv.org/abs/2406.12430)
Keywords: language model, llm, retrieval augmented generation
Abstract: In this paper, we conduct a study to utilize LLMs as a solution for decision making that requires complex data analysis. We define Decision QA as the task of answering the best decision, $d_{best}$, for a decision-making question $Q$, business rules $R$ and a database $D$. Since there is no benchmark that can examine Decision QA, we propose Decision QA benchmark, DQA. It has two scenarios, Locating and Building, constructed from two video games (Europa Universalis IV and Victoria 3) that have almost the same goal as Decision QA. To address Decision QA effectively, we also propose a new RAG technique called the iterative plan-then-retrieval augmented generation (PlanRAG). Our PlanRAG-based LM generates the plan for decision making as the first step, and the retriever generates the queries for data analysis as the second step. The proposed method outperforms the state-of-the-art iterative RAG method by 15.8% in the Locating scenario and by 7.4% in the Building scenario, respectively. We release our code and benchmark at this https URL.
摘要：在本文中，我们进行了一项研究，利用 LLM 作为需要复杂数据分析的决策解决方案。我们将决策问答定义为针对决策问题 $Q$、业务规则 $R$ 和数据库 $D$ 回答最佳决策 $d_{best}$ 的任务。由于没有可以检验决策问答的基准，我们提出了决策问答基准 DQA。它有两个场景，定位和建筑，由两个视频游戏（Europa Universalis IV 和 Victoria 3）构建，它们的目标与决策问答几乎相同。为了有效地解决决策问答问题，我们还提出了一种称为迭代计划然后检索增强生成 (PlanRAG) 的新 RAG 技术。我们基于 PlanRAG 的 LM 生成决策计划作为第一步，检索器生成数据分析查询作为第二步。所提出的方法在定位场景中比最先进的迭代 RAG 方法高出 15.8%，在建筑场景中比最先进的迭代 RAG 方法高出 7.4%。我们在此 https URL 上发布我们的代码和基准。

Title: Abstraction-of-Thought Makes Language Models Better Reasoners

Authors: Ruixin Hong, Hongming Zhang, Xiaoman Pan, Dong Yu, Changshui Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12442
Pdf URL: https://arxiv.org/pdf/2406.12442
Copy Paste: [[2406.12442]] Abstraction-of-Thought Makes Language Models Better Reasoners(https://arxiv.org/abs/2406.12442)
Keywords: language model, chain-of-thought
Abstract: Abstract reasoning, the ability to reason from the abstract essence of a problem, serves as a key to generalization in human reasoning. However, eliciting language models to perform reasoning with abstraction remains unexplored. This paper seeks to bridge this gap by introducing a novel structured reasoning format called Abstraction-of-Thought (AoT). The uniqueness of AoT lies in its explicit requirement for varying levels of abstraction within the reasoning process. This approach could elicit language models to first contemplate on the abstract level before incorporating concrete details, which is overlooked by the prevailing step-by-step Chain-of-Thought (CoT) method. To align models with the AoT format, we present AoT Collection, a generic finetuning dataset consisting of 348k high-quality samples with AoT reasoning processes, collected via an automated and scalable pipeline. We finetune a wide range of language models with AoT Collection and conduct extensive evaluations on 23 unseen tasks from the challenging benchmark Big-Bench Hard. Experimental results indicate that models aligned to AoT reasoning format substantially outperform those aligned to CoT in many reasoning tasks.
摘要：抽象推理是从问题的抽象本质进行推理的能力，是人类推理泛化的关键。然而，引导语言模型进行抽象推理仍未得到探索。本文试图通过引入一种称为思维抽象 (AoT) 的新型结构化推理格式来弥合这一差距。AoT 的独特之处在于它明确要求推理过程中具有不同程度的抽象。这种方法可以引导语言模型首先在抽象层面上进行思考，然后再结合具体细节，而这一点被现行的逐步思维链 (CoT) 方法所忽视。为了使模型与 AoT 格式保持一致，我们提出了 AoT Collection，这是一个通用的微调数据集，由 348k 个具有 AoT 推理过程的高质量样本组成，这些样本通过自动化和可扩展的管道收集。我们使用 AoT Collection 对各种语言模型进行微调，并对具有挑战性的基准 Big-Bench Hard 中的 23 个未见过的任务进行了广泛的评估。实验结果表明，在许多推理任务中，与 AoT 推理格式一致的模型表现明显优于与 CoT 一致的模型。

Title: Adaptive Token Biaser: Knowledge Editing via Biasing Key Entities

Authors: Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Hongcheng Gao, Yilong Xu, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12468
Pdf URL: https://arxiv.org/pdf/2406.12468
Copy Paste: [[2406.12468]] Adaptive Token Biaser: Knowledge Editing via Biasing Key Entities(https://arxiv.org/abs/2406.12468)
Keywords: language model, llm, prompt
Abstract: The parametric knowledge memorized by large language models (LLMs) becomes outdated quickly. In-context editing (ICE) is currently the most effective method for updating the knowledge of LLMs. Recent advancements involve enhancing ICE by modifying the decoding strategy, obviating the need for altering internal model structures or adjusting external prompts. However, this enhancement operates across the entire sequence generation, encompassing a plethora of non-critical tokens. In this work, we introduce $\textbf{A}$daptive $\textbf{T}$oken $\textbf{Bias}$er ($\textbf{ATBias}$), a new decoding technique designed to enhance ICE. It focuses on the tokens that are mostly related to knowledge during decoding, biasing their logits by matching key entities related to new and parametric knowledge. Experimental results show that ATBias significantly enhances ICE performance, achieving up to a 32.3% improvement over state-of-the-art ICE methods while incurring only half the latency. ATBias not only improves the knowledge editing capabilities of ICE but can also be widely applied to LLMs with negligible cost.
摘要：大型语言模型 (LLM) 所记忆的参数知识很快就会过时。上下文编辑 (ICE) 目前是更新 LLM 知识的最有效方法。最近的进展包括通过修改解码策略来增强 ICE，从而无需更改内部模型结构或调整外部提示。然而，这种增强作用贯穿整个序列生成，涵盖了大量非关键标记。在这项工作中，我们引入了 $\textbf{A}$daptive $\textbf{T}$oken $\textbf{Bias}$er ($\textbf{ATBias}$)，这是一种旨在增强 ICE 的新解码技术。它专注于解码过程中与知识最相关的标记，通过匹配与新知识和参数知识相关的关键实体来偏置它们的逻辑。实验结果表明，ATBias 显著提高了 ICE 性能，与最先进的 ICE 方法相比，性能提高了 32.3%，而延迟仅为原来的一半。 ATBias 不仅提高了 ICE 的知识编辑能力，而且可以以极低的成本广泛应用于 LLM。

Title: Fighting Randomness with Randomness: Mitigating Optimisation Instability of Fine-Tuning using Delayed Ensemble and Noisy Interpolation

Authors: Branislav Pecher, Jan Cegin, Robert Belanec, Jakub Simko, Ivan Srba, Maria Bielikova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12471
Pdf URL: https://arxiv.org/pdf/2406.12471
Copy Paste: [[2406.12471]] Fighting Randomness with Randomness: Mitigating Optimisation Instability of Fine-Tuning using Delayed Ensemble and Noisy Interpolation(https://arxiv.org/abs/2406.12471)
Keywords: language model
Abstract: While fine-tuning of pre-trained language models generally helps to overcome the lack of labelled training samples, it also displays model performance instability. This instability mainly originates from randomness in initialisation or data shuffling. To address this, researchers either modify the training process or augment the available samples, which typically results in increased computational costs. We propose a new mitigation strategy, called Delayed Ensemble with Noisy Interpolation (DENI), that leverages the strengths of ensembling, noise regularisation and model interpolation, while retaining computational efficiency. We compare DENI with 9 representative mitigation strategies across 3 models, 4 tuning strategies and 7 text classification datasets. We show that: 1) DENI outperforms the best performing mitigation strategy (Ensemble), while using only a fraction of its cost; 2) the mitigation strategies are beneficial for parameter-efficient fine-tuning (PEFT) methods, outperforming full fine-tuning in specific cases; and 3) combining DENI with data augmentation often leads to even more effective instability mitigation.
摘要：虽然预训练语言模型的微调通常有助于克服标记训练样本的缺乏，但它也显示出模型性能不稳定。这种不稳定性主要源于初始化或数据混洗的随机性。为了解决这个问题，研究人员要么修改训练过程，要么增加可用样本，这通常会导致计算成本增加。我们提出了一种新的缓解策略，称为延迟集成噪声插值 (DENI)，它利用集成、噪声正则化和模型插值的优势，同时保持计算效率。我们在 3 个模型、4 种调整策略和 7 个文本分类数据集中将 DENI 与 9 种代表性缓解策略进行了比较。我们表明：1) DENI 优于表现最佳的缓解策略 (Ensemble)，同时仅使用其成本的一小部分；2) 缓解策略有利于参数高效微调 (PEFT) 方法，在特定情况下优于完全微调； 3）将 DENI 与数据增强相结合通常可以更有效地缓解不稳定性。

Title: The Power of LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions

Authors: Stefan Sylvius Wagner, Maike Behrendt, Marc Ziegele, Stefan Harmeling
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12480
Pdf URL: https://arxiv.org/pdf/2406.12480
Copy Paste: [[2406.12480]] The Power of LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions(https://arxiv.org/abs/2406.12480)
Keywords: llm, prompt, agent
Abstract: Stance detection holds great potential for enhancing the quality of online political discussions, as it has shown to be useful for summarizing discussions, detecting misinformation, and evaluating opinion distributions. Usually, transformer-based models are used directly for stance detection, which require large amounts of data. However, the broad range of debate questions in online political discussion creates a variety of possible scenarios that the model is faced with and thus makes data acquisition for model training difficult. In this work, we show how to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions:(i) We generate synthetic data for specific debate questions by prompting a Mistral-7B model and show that fine-tuning with the generated synthetic data can substantially improve the performance of stance detection. (ii) We examine the impact of combining synthetic data with the most informative samples from an unlabelled dataset. First, we use the synthetic data to select the most informative samples, second, we combine both these samples and the synthetic data for fine-tuning. This approach reduces labelling effort and consistently surpasses the performance of the baseline model that is trained with fully labeled data. Overall, we show in comprehensive experiments that LLM-generated data greatly improves stance detection performance for online political discussions.
摘要：立场检测在提高在线政治讨论的质量方面具有巨大潜力，因为它已被证明可用于总结讨论、检测错误信息和评估意见分布。通常，基于 Transformer 的模型直接用于立场检测，这需要大量数据。然而，在线政治讨论中广泛的辩论问题为模型带来了各种可能的场景，从而使模型训练的数据获取变得困难。在这项工作中，我们展示了如何利用 LLM 生成的合成数据来训练和改进在线政治讨论的立场检测代理：(i) 我们通过提示 Mistral-7B 模型为特定辩论问题生成合成数据，并表明使用生成的合成数据进行微调可以显着提高立场检测的性能。(ii) 我们研究了将合成数据与未标记数据集中最具信息量的样本相结合的影响。首先，我们使用合成数据来选择最具信息量的样本，其次，我们将这些样本和合成数据结合起来进行微调。这种方法减少了标记工作量，并且始终优于使用完全标记数据训练的基线模型的性能。总体而言，我们通过综合实验表明，LLM 生成的数据极大地提高了在线政治讨论的立场检测性能。

Title: LightPAL: Lightweight Passage Retrieval for Open Domain Multi-Document Summarization

Authors: Masafumi Enomoto, Kunihiro Takeoka, Kosuke Akimoto, Kiril Gashteovski, Masafumi Oyamada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12494
Pdf URL: https://arxiv.org/pdf/2406.12494
Copy Paste: [[2406.12494]] LightPAL: Lightweight Passage Retrieval for Open Domain Multi-Document Summarization(https://arxiv.org/abs/2406.12494)
Keywords: language model, llm
Abstract: Open-Domain Multi-Document Summarization (ODMDS) is crucial for addressing diverse information needs, which aims to generate a summary as answer to user's query, synthesizing relevant content from multiple documents in a large collection. Existing approaches that first find relevant passages and then generate a summary using a language model are inadequate for ODMDS. This is because open-ended queries often require additional context for the retrieved passages to cover the topic comprehensively, making it challenging to retrieve all relevant passages initially. While iterative retrieval methods have been explored for multi-hop question answering (MQA), they are impractical for ODMDS due to high latency from repeated large language model (LLM) inference for reasoning. To address this issue, we propose LightPAL, a lightweight passage retrieval method for ODMDS that constructs a graph representing passage relationships using an LLM during indexing and employs random walk instead of iterative reasoning and retrieval at inference time. Experiments on ODMDS benchmarks show that LightPAL outperforms baseline retrievers in summary quality while being significantly more efficient than an iterative MQA approach.
摘要：开放域多文档摘要 (ODMDS) 对于满足多样化的信息需求至关重要，其目的是生成摘要作为对用户查询的回答，综合大型文档集合中多个文档的相关内容。现有的方法首先找到相关段落，然后使用语言模型生成摘要，但这些方法不足以满足 ODMDS 的要求。这是因为开放式查询通常需要额外的上下文来使检索到的段落全面涵盖主题，这使得最初检索所有相关段落变得具有挑战性。虽然已经探索了用于多跳问答 (MQA) 的迭代检索方法，但由于重复的大型语言模型 (LLM) 推理会导致高延迟，因此它们不适用于 ODMDS。为了解决这个问题，我们提出了 LightPAL，这是一种用于 ODMDS 的轻量级段落检索方法，它在索引期间使用 LLM 构建表示段落关系的图，并在推理时采用随机游走而不是迭代推理和检索。 ODMDS 基准测试的实验表明，LightPAL 在摘要质量方面优于基线检索器，同时效率明显高于迭代 MQA 方法。

Title: Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency

Authors: Leonidas Gee, Milan Gritta, Gerasimos Lampouras, Ignacio Iacobacci
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12502
Pdf URL: https://arxiv.org/pdf/2406.12502
Copy Paste: [[2406.12502]] Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency(https://arxiv.org/abs/2406.12502)
Keywords: language model
Abstract: Code Language Models have been trained to generate accurate solutions, typically with no regard for runtime. On the other hand, previous works that explored execution optimisation have observed corresponding drops in functional correctness. To that end, we introduce Code-Optimise, a framework that incorporates both correctness (passed, failed) and runtime (quick, slow) as learning signals via self-generated preference data. Our framework is both lightweight and robust as it dynamically selects solutions to reduce overfitting while avoiding a reliance on larger models for learning signals. Code-Optimise achieves significant improvements in pass@k while decreasing the competitive baseline runtimes by an additional 6% for in-domain data and up to 3% for out-of-domain data. As a byproduct, the average length of the generated solutions is reduced by up to 48% on MBPP and 23% on HumanEval, resulting in faster and cheaper inference. The generated data and codebase will be open-sourced at www.open-source.link.
摘要：代码语言模型经过训练可以生成准确的解决方案，通常不考虑运行时间。另一方面，之前探索执行优化的研究发现功能正确性会相应下降。为此，我们引入了 Code-Optimise，这是一个通过自生成偏好数据将正确性（通过、失败）和运行时间（快速、慢速）作为学习信号的框架。我们的框架既轻量又强大，因为它可以动态选择解决方案以减少过度拟合，同时避免依赖更大的模型来获取学习信号。Code-Optimise 在 pass@k 方面取得了显著的改进，同时将域内数据的竞争性基线运行时间额外减少了 6%，域外数据的竞争性基线运行时间最多减少了 3%。作为副产品，生成的解决方案的平均长度在 MBPP 上减少了 48%，在 HumanEval 上减少了 23%，从而实现了更快、更便宜的推理。生成的数据和代码库将在 www.open-source.link 上开源。

Title: FuseGen: PLM Fusion for Data-generation based Zero-shot Learning

Authors: Tianyuan Zou, Yang Liu, Peng Li, Jianqing Zhang, Jingjing Liu, Ya-Qin Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12527
Pdf URL: https://arxiv.org/pdf/2406.12527
Copy Paste: [[2406.12527]] FuseGen: PLM Fusion for Data-generation based Zero-shot Learning(https://arxiv.org/abs/2406.12527)
Keywords: language model
Abstract: Data generation-based zero-shot learning, although effective in training Small Task-specific Models (STMs) via synthetic datasets generated by Pre-trained Language Models (PLMs), is often limited by the low quality of such synthetic datasets. Previous solutions have primarily focused on single PLM settings, where synthetic datasets are typically restricted to specific sub-spaces and often deviate from real-world distributions, leading to severe distribution bias. To mitigate such bias, we propose FuseGen, a novel data generation-based zero-shot learning framework that introduces a new criteria for subset selection from synthetic datasets via utilizing multiple PLMs and trained STMs. The chosen subset provides in-context feedback to each PLM, enhancing dataset quality through iterative data generation. Trained STMs are then used for sample re-weighting as well, further improving data quality. Extensive experiments across diverse tasks demonstrate that FuseGen substantially outperforms existing methods, highly effective in boosting STM performance in a PLM-agnostic way. Code is provided in this https URL.
摘要：基于数据生成的零样本学习虽然能够通过由预训练语言模型 (PLM) 生成的合成数据集有效地训练小型任务特定模型 (STM)，但通常受限于此类合成数据集的低质量。以前的解决方案主要侧重于单个 PLM 设置，其中合成数据集通常局限于特定子空间，并且经常偏离真实世界分布，从而导致严重的分布偏差。为了减轻这种偏差，我们提出了 FuseGen，这是一种基于数据生成的新型零样本学习框架，它通过利用多个 PLM 和训练过的 STM 引入了从合成数据集中选择子集的新标准。所选子集为每个 PLM 提供上下文反馈，通过迭代数据生成提高数据集质量。然后，训练过的 STM 也用于样本重新加权，从而进一步提高数据质量。在各种任务中进行的大量实验表明，FuseGen 的表现大大优于现有方法，能够以与 PLM 无关的方式高效地提升 STM 性能。代码在此 https URL 中提供。

Title: Unified Active Retrieval for Retrieval Augmented Generation

Authors: Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12534
Pdf URL: https://arxiv.org/pdf/2406.12534
Copy Paste: [[2406.12534]] Unified Active Retrieval for Retrieval Augmented Generation(https://arxiv.org/abs/2406.12534)
Keywords: retrieval augmented generation, retrieval-augmented generation
Abstract: In Retrieval-Augmented Generation (RAG), retrieval is not always helpful and applying it to every instruction is sub-optimal. Therefore, determining whether to retrieve is crucial for RAG, which is usually referred to as Active Retrieval. However, existing active retrieval methods face two challenges: 1. They usually rely on a single criterion, which struggles with handling various types of instructions. 2. They depend on specialized and highly differentiated procedures, and thus combining them makes the RAG system more complicated and leads to higher response latency. To address these challenges, we propose Unified Active Retrieval (UAR). UAR contains four orthogonal criteria and casts them into plug-and-play classification tasks, which achieves multifaceted retrieval timing judgements with negligible extra inference cost. We further introduce the Unified Active Retrieval Criteria (UAR-Criteria), designed to process diverse active retrieval scenarios through a standardized procedure. Experiments on four representative types of user instructions show that UAR significantly outperforms existing work on the retrieval timing judgement and the performance of downstream tasks, which shows the effectiveness of UAR and its helpfulness to downstream tasks.
摘要：在检索增强生成 (RAG) 中，检索并非总是有帮助的，将其应用于每条指令并不是最优的。因此，确定是否检索对于 RAG 至关重要，这通常被称为主动检索。然而，现有的主动检索方法面临两个挑战：1. 它们通常依赖于单一标准，这难以处理各种类型的指令。2. 它们依赖于专门化和高度差异化的程序，因此将它们结合起来会使 RAG 系统更加复杂并导致更高的响应延迟。为了应对这些挑战，我们提出了统一主动检索 (UAR)。UAR 包含四个正交标准并将它们投射到即插即用的分类任务中，从而以可忽略不计的额外推理成本实现多方面的检索时间判断。我们进一步介绍了统一主动检索标准 (UAR-Criteria)，旨在通过标准化程序处理不同的主动检索场景。在四种具有代表性的用户指令上进行的实验表明，UAR 在检索时间判断和下游任务的表现上明显优于现有工作，证明了 UAR 的有效性及其对下游任务的帮助。

Title: Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models

Authors: Philipp Mondorf, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12546
Pdf URL: https://arxiv.org/pdf/2406.12546
Copy Paste: [[2406.12546]] Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models(https://arxiv.org/abs/2406.12546)
Keywords: language model
Abstract: Knights and knaves problems represent a classic genre of logical puzzles where characters either tell the truth or lie. The objective is to logically deduce each character's identity based on their statements. The challenge arises from the truth-telling or lying behavior, which influences the logical implications of each statement. Solving these puzzles requires not only direct deductions from individual statements, but the ability to assess the truthfulness of statements by reasoning through various hypothetical scenarios. As such, knights and knaves puzzles serve as compelling examples of suppositional reasoning. In this paper, we introduce $\textit{TruthQuest}$, a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles. Our benchmark presents problems of varying complexity, considering both the number of characters and the types of logical statements involved. Evaluations on $\textit{TruthQuest}$ show that large language models like Llama 3 and Mixtral-8x7B exhibit significant difficulties solving these tasks. A detailed error analysis of the models' output reveals that lower-performing models exhibit a diverse range of reasoning errors, frequently failing to grasp the concept of truth and lies. In comparison, more proficient models primarily struggle with accurately inferring the logical implications of potentially false statements.
摘要：骑士和流氓问题代表了一种经典的逻辑谜题类型，其中角色要么说真话，要么撒谎。目标是根据每个角色的陈述逻辑地推断出他们的身份。挑战来自说真话或撒谎的行为，这会影响每个陈述的逻辑含义。解决这些谜题不仅需要从单个陈述中直接推断，还需要能够通过推理各种假设场景来评估陈述的真实性。因此，骑士和流氓谜题是假设推理的有力例子。在本文中，我们介绍了 $\textit{TruthQuest}$，这是基于骑士和流氓谜题原理的假设推理基准。我们的基准提出了不同复杂度的问题，考虑到字符数量和所涉及的逻辑语句类型。对 $\textit{TruthQuest}$ 的评估表明，像 Llama 3 和 Mixtral-8x7B 这样的大型语言模型在解决这些任务时表现出很大的困难。对模型输出的详细错误分析表明，表现较差的模型表现出各种各样的推理错误，经常无法掌握真相和谎言的概念。相比之下，更熟练的模型主要难以准确推断潜在错误陈述的逻辑含义。

Title: P-Tailor: Customizing Personality Traits for Language Models via Mixture of Specialized LoRA Experts

Authors: Yuhao Dan, Jie Zhou, Qin Chen, Junfeng Tian, Liang He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12548
Pdf URL: https://arxiv.org/pdf/2406.12548
Copy Paste: [[2406.12548]] P-Tailor: Customizing Personality Traits for Language Models via Mixture of Specialized LoRA Experts(https://arxiv.org/abs/2406.12548)
Keywords: language model, llm, agent
Abstract: Personalized large language models (LLMs) have attracted great attention in many applications, such as intelligent education and emotional support. Most work focuses on controlling the character settings based on the profile (e.g., age, skill, experience, and so on). Conversely, the psychological theory-based personality traits with implicit expression and behavior are not well modeled, limiting their potential application in more specialized fields such as the psychological counseling agents. In this paper, we propose a mixture of experts (MoE)-based personalized LLMs, named P-tailor, to model the Big Five Personality Traits. Particularly, we learn specialized LoRA experts to represent various traits, such as openness, conscientiousness, extraversion, agreeableness and neuroticism. Then, we integrate P-Tailor with a personality specialization loss, promoting experts to specialize in distinct personality traits, thereby enhancing the efficiency of model parameter utilization. Due to the lack of datasets, we also curate a high-quality personality crafting dataset (PCD) to learn and develop the ability to exhibit different personality traits across various topics. We conduct extensive experiments to verify the great performance and effectiveness of P-Tailor in manipulation of the fine-grained personality traits of LLMs.
摘要：个性化大型语言模型 (LLM) 在智能教育和情感支持等许多应用中引起了极大关注。大多数工作侧重于根据个人资料（例如年龄、技能、经验等）控制角色设置。相反，具有隐性表达和行为的基于心理理论的性格特征建模不佳，限制了它们在心理咨询代理等更专业领域的潜在应用。在本文中，我们提出了一种基于专家混合 (MoE) 的个性化 LLM，称为 P-tailor，以对大五人格特质进行建模。具体来说，我们学习专门的 LoRA 专家来表示各种特征，例如开放性、尽责性、外向性、亲和性和神经质。然后，我们将 P-Tailor 与个性专业化损失相结合，促使专家专注于不同的个性特征，从而提高模型参数利用效率。由于缺乏数据集，我们还策划了一个高质量的个性塑造数据集 (PCD)，以学习和发展在不同主题中展现不同个性特征的能力。我们进行了大量的实验来验证 P-Tailor 在操纵 LLM 细粒度人格特质方面的出色性能和有效性。

Title: MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts

Authors: Dominik Macko, Jakub Kopal, Robert Moro, Ivan Srba
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12549
Pdf URL: https://arxiv.org/pdf/2406.12549
Copy Paste: [[2406.12549]] MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts(https://arxiv.org/abs/2406.12549)
Keywords: llm
Abstract: Recent LLMs are able to generate high-quality multilingual texts, indistinguishable for humans from authentic human-written ones. Research in machine-generated text detection is however mostly focused on the English language and longer texts, such as news articles, scientific papers or student essays. Social-media texts are usually much shorter and often feature informal language, grammatical errors, or distinct linguistic items (e.g., emoticons, hashtags). There is a gap in studying the ability of existing methods in detection of such texts, reflected also in the lack of existing multilingual benchmark datasets. To fill this gap we propose the first multilingual (22 languages) and multi-platform (5 social media platforms) dataset for benchmarking machine-generated text detection in the social-media domain, called MultiSocial. It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual LLMs. We use this benchmark to compare existing detection methods in zero-shot as well as fine-tuned form. Our results indicate that the fine-tuned detectors have no problem to be trained on social-media texts and that the platform selection for training matters.
摘要：最近的 LLM 能够生成高质量的多语言文本，人类无法区分这些文本与真实的人类书写文本。然而，机器生成文本检测的研究主要集中在英语和较长的文本上，例如新闻文章、科学论文或学生论文。社交媒体文本通常要短得多，并且经常包含非正式语言、语法错误或不同的语言项目（例如表情符号、主题标签）。在研究现有方法检测此类文本的能力方面存在差距，这也反映在缺乏现有的多语言基准数据集上。为了填补这一空白，我们提出了第一个多语言（22 种语言）和多平台（5 个社交媒体平台）数据集，用于对社交媒体领域的机器生成文本检测进行基准测试，称为 MultiSocial。它包含 472,097 篇文本，其中约 58k 篇是人类书写的，7 个多语言 LLM 中的每一个都生成了大约相同数量的内容。我们使用此基准来比较零样本和微调形式的现有检测方法。我们的结果表明，经过微调的检测器在社交媒体文本上进行训练没有任何问题，并且训练平台的选择很重要。

Title: RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation

Authors: Shuting Wang, Xin Xu, Mang Wang, Weipeng Chen, Yutao Zhu, Zhicheng Dou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12566
Pdf URL: https://arxiv.org/pdf/2406.12566
Copy Paste: [[2406.12566]] RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation(https://arxiv.org/abs/2406.12566)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) effectively addresses issues of static knowledge and hallucination in large language models. Existing studies mostly focus on question scenarios with clear user intents and concise answers. However, it is prevalent that users issue broad, open-ended queries with diverse sub-intents, for which they desire rich and long-form answers covering multiple relevant aspects. To tackle this important yet underexplored problem, we propose a novel RAG framework, namely RichRAG. It includes a sub-aspect explorer to identify potential sub-aspects of input questions, a multi-faceted retriever to build a candidate pool of diverse external documents related to these sub-aspects, and a generative list-wise ranker, which is a key module to provide the top-k most valuable documents for the final generator. These ranked documents sufficiently cover various query aspects and are aware of the generator's preferences, hence incentivizing it to produce rich and comprehensive responses for users. The training of our ranker involves a supervised fine-tuning stage to ensure the basic coverage of documents, and a reinforcement learning stage to align downstream LLM's preferences to the ranking of documents. Experimental results on two publicly available datasets prove that our framework effectively and efficiently provides comprehensive and satisfying responses to users.
摘要：检索增强生成 (RAG) 有效地解决了大型语言模型中的静态知识和幻觉问题。现有研究主要关注具有明确用户意图和简洁答案的问题场景。然而，用户普遍会发出具有各种子意图的广泛、开放式查询，他们希望得到涵盖多个相关方面的丰富而长篇的答案。为了解决这个重要但尚未得到充分探索的问题，我们提出了一个新颖的 RAG 框架，即 RichRAG。它包括一个子方面探索器，用于识别输入问题的潜在子方面，一个多方面检索器，用于构建与这些子方面相关的各种外部文档的候选池，以及一个生成列表排序器，它是为最终生成器提供前 k 个最有价值文档的关键模块。这些排名文档充分涵盖了各种查询方面，并且了解生成器的偏好，从而激励它为用户生成丰富而全面的响应。我们的排序器训练涉及监督微调阶段，以确保文档的基本覆盖范围，以及强化学习阶段，以使下游 LLM 的偏好与文档的排名保持一致。在两个公开数据集上的实验结果证明，我们的框架有效且高效地为用户提供了全面且令人满意的响应。

Title: Applying Ensemble Methods to Model-Agnostic Machine-Generated Text Detection

Authors: Ivan Ong, Boon King Quek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12570
Pdf URL: https://arxiv.org/pdf/2406.12570
Copy Paste: [[2406.12570]] Applying Ensemble Methods to Model-Agnostic Machine-Generated Text Detection(https://arxiv.org/abs/2406.12570)
Keywords: language model, gpt, llm
Abstract: In this paper, we study the problem of detecting machine-generated text when the large language model (LLM) it is possibly derived from is unknown. We do so by apply ensembling methods to the outputs from DetectGPT classifiers (Mitchell et al. 2023), a zero-shot model for machine-generated text detection which is highly accurate when the generative (or base) language model is the same as the discriminative (or scoring) language model. We find that simple summary statistics of DetectGPT sub-model outputs yield an AUROC of 0.73 (relative to 0.61) while retaining its zero-shot nature, and that supervised learning methods sharply boost the accuracy to an AUROC of 0.94 but require a training dataset. This suggests the possibility of further generalisation to create a highly-accurate, model-agnostic machine-generated text detector.
摘要：在本文中，我们研究了当机器生成文本可能源自的大型语言模型 (LLM) 未知时检测机器生成文本的问题。我们通过将集成方法应用于 DetectGPT 分类器 (Mitchell 等人，2023) 的输出来实现这一点，DetectGPT 分类器是一种用于机器生成文本检测的零样本模型，当生成（或基础）语言模型与判别（或评分）语言模型相同时，该模型的准确率很高。我们发现，DetectGPT 子模型输出的简单汇总统计数据在保留其零样本特性的同时产生了 0.73 的 AUROC（相对于 0.61），并且监督学习方法将准确率大幅提高到 0.94 的 AUROC，但需要训练数据集。这表明可以进一步推广以创建高度准确、与模型无关的机器生成文本检测器。

Title: Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models

Authors: Eldar Kurtic, Amir Moeini, Dan Alistarh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12572
Pdf URL: https://arxiv.org/pdf/2406.12572
Copy Paste: [[2406.12572]] Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models(https://arxiv.org/abs/2406.12572)
Keywords: language model, llm
Abstract: We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving. This benchmark is inspired by the Mathador game, where the objective is to reach a target number using basic arithmetic operations on a given set of base numbers, following a simple set of rules. We show that, across leading LLMs, we obtain stable average performance while generating benchmark instances dynamically, following a target difficulty level. Thus, our benchmark alleviates concerns about test-set leakage into training data, an issue that often undermines popular benchmarks. Additionally, we conduct a comprehensive evaluation of both open and closed-source state-of-the-art LLMs on Mathador-LM. Our findings reveal that contemporary models struggle with Mathador-LM, scoring significantly lower than average 5th graders. This stands in stark contrast to their strong performance on popular mathematical reasoning benchmarks.
摘要：我们引入了 Mathador-LM，这是一种用于评估大型语言模型 (LLM) 上的数学推理的新基准，结合了规则集解释、规划和问题解决。该基准的灵感来自 Mathador 游戏，其目标是按照一组简单的规则，使用给定基数的基本算术运算达到目标数字。我们表明，在领先的 LLM 中，我们获得了稳定的平均性能，同时按照目标难度级别动态生成基准实例。因此，我们的基准缓解了对测试集泄漏到训练数据中的担忧，这一问题通常会破坏流行的基准。此外，我们对 Mathador-LM 上的开源和闭源最新 LLM 进行了全面评估。我们的研究结果表明，当代模型在 Mathador-LM 上表现不佳，得分明显低于五年级学生的平均水平。这与他们在流行的数学推理基准上的出色表现形成了鲜明对比。

Title: Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling

Authors: Yao-Ching Yu, Chun-Chih Kuo, Ziqi Ye, Yu-Cheng Chang, Yueh-Se Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12585
Pdf URL: https://arxiv.org/pdf/2406.12585
Copy Paste: [[2406.12585]] Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling(https://arxiv.org/abs/2406.12585)
Keywords: language model, llm
Abstract: Ensembling multiple models has always been an effective approach to push the limits of existing performance and is widely used in classification tasks by simply averaging the classification probability vectors from multiple classifiers to achieve better accuracy. However, in the thriving open-source Large Language Model (LLM) community, ensembling methods are rare and typically limited to ensembling the full-text outputs of LLMs, such as selecting the best output using a ranker, which leads to underutilization of token-level probability information. In this paper, we treat the Generation of each token by LLMs as a Classification (GaC) for ensembling. This approach fully exploits the probability information at each generation step and better prevents LLMs from producing early incorrect tokens that lead to snowballing errors. In experiments, we ensemble state-of-the-art LLMs on several benchmarks, including exams, mathematics and reasoning, and observe that our method breaks the existing community performance ceiling. Furthermore, we observed that most of the tokens in the answer are simple and do not affect the correctness of the final answer. Therefore, we also experimented with ensembling only key tokens, and the results showed better performance with lower latency across benchmarks.
摘要：集成多个模型一直是突破现有性能极限的有效方法，并广泛用于分类任务，通过简单地平均来自多个分类器的分类概率向量来实现更好的准确性。然而，在蓬勃发展的开源大型语言模型 (LLM) 社区中，集成方法很少见，通常仅限于集成 LLM 的全文输出，例如使用排序器选择最佳输出，这导致 token 级概率信息的利用不足。在本文中，我们将 LLM 的每个 token 的生成视为一个分类 (GaC) 进行集成。这种方法充分利用了每个生成步骤中的概率信息，并更好地防止 LLM 产生导致滚雪球错误的早期错误 token。在实验中，我们在几个基准测试（包括考试、数学和推理）上集成了最先进的 LLM，并观察到我们的方法打破了现有社区的性能上限。此外，我们观察到答案中的大多数 token 都很简单，不会影响最终答案的正确性。因此，我们还尝试仅组合关键标记，结果显示在基准测试中性能更佳，延迟更低。

Title: Low-Redundant Optimization for Large Language Model Alignment

Authors: Zhipeng Chen, Kun Zhou, Wayne Xin Zhao, Jingyuan Wang, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Low-Redundant Optimization for Large Language Model Alignment(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: Large language models (LLMs) are still struggling in aligning with human preference in complex tasks and scenarios. They are prone to overfit into the unexpected patterns or superficial styles in the training data. We conduct an empirical study that only selects the top-10\% most updated parameters in LLMs for alignment training, and see improvements in the convergence process and final performance. It indicates the existence of redundant neurons in LLMs for alignment training. To reduce its influence, we propose a low-redundant alignment method named \textbf{ALLO}, focusing on optimizing the most related neurons with the most useful supervised signals. Concretely, we first identify the neurons that are related to the human preference data by a gradient-based strategy, then identify the alignment-related key tokens by reward models for computing loss. Besides, we also decompose the alignment process into the forgetting and learning stages, where we first forget the tokens with unaligned knowledge and then learn aligned knowledge, by updating different ratios of neurons, respectively. Experimental results on 10 datasets have shown the effectiveness of ALLO. Our code and data are available at \url{this https URL}.
摘要：大型语言模型 (LLM) 在复杂任务和场景中仍然难以与人类偏好保持一致。它们很容易过度拟合训练数据中意想不到的模式或肤浅的风格。我们进行了一项实证研究，仅选择 LLM 中更新最多的前 10% 参数进行对齐训练，并看到收敛过程和最终性能的改善。这表明 LLM 中存在用于对齐训练的冗余神经元。为了减少其影响，我们提出了一种名为 \textbf{ALLO} 的低冗余对齐方法，专注于用最有用的监督信号优化最相关的神经元。具体来说，我们首先通过基于梯度的策略识别与人类偏好数据相关的神经元，然后通过奖励模型识别与对齐相关的关键标记以计算损失。此外，我们还将对齐过程分解为遗忘和学习阶段，其中我们首先通过更新不同比例的神经元来忘记具有未对齐知识的标记，然后学习对齐的知识。在 10 个数据集上的实验结果证明了 ALLO 的有效性。我们的代码和数据可在 \url{此 https URL} 上找到。

Title: Bridging Local Details and Global Context in Text-Attributed Graphs

Authors: Yaoke Wang, Yun Zhu, Wenqiao Zhang, Yueting Zhuang, Yunfei Li, Siliang Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12608
Pdf URL: https://arxiv.org/pdf/2406.12608
Copy Paste: [[2406.12608]] Bridging Local Details and Global Context in Text-Attributed Graphs(https://arxiv.org/abs/2406.12608)
Keywords: language model
Abstract: Representation learning on text-attributed graphs (TAGs) is vital for real-world applications, as they combine semantic textual and contextual structural information. Research in this field generally consist of two main perspectives: local-level encoding and global-level aggregating, respectively refer to textual node information unification (e.g., using Language Models) and structure-augmented modeling (e.g., using Graph Neural Networks). Most existing works focus on combining different information levels but overlook the interconnections, i.e., the contextual textual information among nodes, which provides semantic insights to bridge local and global levels. In this paper, we propose GraphBridge, a multi-granularity integration framework that bridges local and global perspectives by leveraging contextual textual information, enhancing fine-grained understanding of TAGs. Besides, to tackle scalability and efficiency challenges, we introduce a graphaware token reduction module. Extensive experiments across various models and datasets show that our method achieves state-of-theart performance, while our graph-aware token reduction module significantly enhances efficiency and solves scalability issues.
摘要：文本属性图 (TAG) 上的表示学习对于实际应用至关重要，因为它们结合了语义文本和上下文结构信息。该领域的研究通常包括两个主要视角：局部级编码和全局级聚合，分别指文本节点信息统一（例如，使用语言模型）和结构增强建模（例如，使用图神经网络）。大多数现有工作侧重于结合不同的信息级别，但忽略了互连，即节点之间的上下文文本信息，这为连接局部和全局层面提供了语义见解。在本文中，我们提出了 GraphBridge，这是一个多粒度集成框架，它通过利用上下文文本信息来连接局部和全局视角，增强对 TAG 的细粒度理解。此外，为了应对可扩展性和效率挑战，我们引入了一个图形感知令牌减少模块。在各种模型和数据集上进行的大量实验表明，我们的方法达到了最佳性能，而我们的图形感知标记减少模块显着提高了效率并解决了可扩展性问题。

Title: What makes two models think alike?

Authors: Jeanne Salle, Louis Jalouzot, Nur Lan, Emmanuel Chemla, Yair Lakretz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] What makes two models think alike?(https://arxiv.org/abs/)
Keywords: gpt
Abstract: Do architectural differences significantly affect the way models represent and process language? We propose a new approach, based on metric-learning encoding models (MLEMs), as a first step to answer this question. The approach provides a feature-based comparison of how any two layers of any two models represent linguistic information. We apply the method to BERT, GPT-2 and Mamba. Unlike previous methods, MLEMs offer a transparent comparison, by identifying the specific linguistic features responsible for similarities and differences. More generally, the method uses formal, symbolic descriptions of a domain, and use these to compare neural representations. As such, the approach can straightforwardly be extended to other domains, such as speech and vision, and to other neural systems, including human brains.
摘要：架构差异是否会显著影响模型表示和处理语言的方式？我们提出了一种基于度量学习编码模型 (MLEM) 的新方法，作为回答这个问题的第一步。该方法提供了基于特征的比较，比较任意两个模型的任意两层如何表示语言信息。我们将该方法应用于 BERT、GPT-2 和 Mamba。与以前的方法不同，MLEM 通过识别导致相似性和差异性的特定语言特征来提供透明的比较。更一般地说，该方法使用领域的形式化、符号化描述，并使用这些描述来比较神经表征。因此，该方法可以直接扩展到其他领域，例如语音和视觉，以及其他神经系统，包括人类大脑。

Title: Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Authors: Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12624
Pdf URL: https://arxiv.org/pdf/2406.12624
Copy Paste: [[2406.12624]] Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges(https://arxiv.org/abs/2406.12624)
Keywords: language model, gpt, llm, prompt
Abstract: Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges. We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which we found to have a high inter-annotator agreement. Our study includes 9 judge models and 9 exam taker models -- both base and instruction-tuned. We assess the judge model's alignment across different model sizes, families, and judge prompts. Among other results, our research rediscovers the importance of using Cohen's kappa as a metric of alignment as opposed to simple percent agreement, showing that judges with high percent agreement can still assign vastly different scores. We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains, which have up to 34 points lower human alignment. Through error analysis and various other studies, including the effects of instruction length and leniency bias, we hope to provide valuable lessons for using LLMs as judges in the future.
摘要：作为评估大型语言模型 (LLM) 的方法，LLM-as-a-judge 范式为解决与人工评估相关的可扩展性挑战提供了一种有希望的解决方案，它正在迅速获得关注。然而，关于这种范式的优点和缺点以及它可能存在哪些潜在偏见，仍有许多悬而未决的问题。在本文中，我们对各种 LLM 作为裁判的表现进行了全面研究。我们利用 TriviaQA 作为评估 LLM 客观知识推理的基准，并与人工注释一起对其进行评估，我们发现人工注释具有很高的注释者间一致性。我们的研究包括 9 个裁判模型和 9 个应试者模型——包括基础模型和指令调整模型。我们评估了裁判模型在不同模型大小、系列和裁判提示中的一致性。除其他结果外，我们的研究重新发现了使用 Cohen 的 kappa 作为一致性指标的重要性，而不是简单的百分比一致性，这表明具有高百分比一致性的裁判仍然可以给出截然不同的分数。我们发现 Llama-3 70B 和 GPT-4 Turbo 都与人类有极好的匹配度，但在对考生进行排名方面，它们的表现不如 JudgeLM-7B 和词汇评判 Contains，后者与人类的匹配度最多低 34 分。通过错误分析和其他各种研究，包括教学长度和宽大偏见的影响，我们希望为未来使用 LLM 作为评判者提供宝贵的经验教训。

Title: Ask-before-Plan: Proactive Language Agents for Real-World Planning

Authors: Xuan Zhang, Yang Deng, Zifeng Ren, See-Kiong Ng, Tat-Seng Chua
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12639
Pdf URL: https://arxiv.org/pdf/2406.12639
Copy Paste: [[2406.12639]] Ask-before-Plan: Proactive Language Agents for Real-World Planning(https://arxiv.org/abs/2406.12639)
Keywords: language model, llm, agent
Abstract: The evolution of large language models (LLMs) has enhanced the planning capabilities of language agents in diverse real-world scenarios. Despite these advancements, the potential of LLM-powered agents to comprehend ambiguous user instructions for reasoning and decision-making is still under exploration. In this work, we introduce a new task, Proactive Agent Planning, which requires language agents to predict clarification needs based on user-agent conversation and agent-environment interaction, invoke external tools to collect valid information, and generate a plan to fulfill the user's demands. To study this practical problem, we establish a new benchmark dataset, Ask-before-Plan. To tackle the deficiency of LLMs in proactive planning, we propose a novel multi-agent framework, Clarification-Execution-Planning (\texttt{CEP}), which consists of three agents specialized in clarification, execution, and planning. We introduce the trajectory tuning scheme for the clarification agent and static execution agent, as well as the memory recollection mechanism for the dynamic execution agent. Extensive evaluations and comprehensive analyses conducted on the Ask-before-Plan dataset validate the effectiveness of our proposed framework.
摘要：大型语言模型 (LLM) 的发展增强了语言代理在各种现实场景中的规划能力。尽管取得了这些进步，但由 LLM 驱动的代理在理解模糊用户指令以进行推理和决策方面的潜力仍在探索中。在这项工作中，我们引入了一项新任务，即主动代理规划，它要求语言代理根据用户代理对话和代理环境交互来预测澄清需求，调用外部工具来收集有效信息，并生成满足用户需求的计划。为了研究这个实际问题，我们建立了一个新的基准数据集 Ask-before-Plan。为了解决 LLM 在主动规划方面的不足，我们提出了一种新颖的多代理框架，即澄清-执行-规划 (\texttt{CEP})，它由三个专门从事澄清、执行和规划的代理组成。我们介绍了澄清代理和静态执行代理的轨迹调整方案，以及动态执行代理的记忆回忆机制。对 Ask-before-Plan 数据集进行的广泛评估和全面分析验证了我们提出的框架的有效性。

Title: DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence?

Authors: Zhouhong Gu, Lin Zhang, Xiaoxuan Zhu, Jiangjie Chen, Wenhao Huang, Yikai Zhang, Shusen Wang, Zheyu Ye, Yan Gao, Hongwei Feng, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12641
Pdf URL: https://arxiv.org/pdf/2406.12641
Copy Paste: [[2406.12641]] DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence?(https://arxiv.org/abs/2406.12641)
Keywords: language model, llm, long context, prompt
Abstract: Detecting evidence within the context is a key step in the process of reasoning task. Evaluating and enhancing the capabilities of LLMs in evidence detection will strengthen context-based reasoning performance. This paper proposes a benchmark called DetectBench for verifying the ability to detect and piece together implicit evidence within a long context. DetectBench contains 3,928 multiple-choice questions, with an average of 994 tokens per question. Each question contains an average of 4.55 pieces of implicit evidence, and solving the problem typically requires 7.62 logical jumps to find the correct answer. To enhance the performance of LLMs in evidence detection, this paper proposes Detective Reasoning Prompt and Finetune. Experiments demonstrate that the existing LLMs' abilities to detect evidence in long contexts are far inferior to humans. However, the Detective Reasoning Prompt effectively enhances the capability of powerful LLMs in evidence detection, while the Finetuning method shows significant effects in enhancing the performance of weaker LLMs. Moreover, when the abilities of LLMs in evidence detection are improved, their final reasoning performance is also enhanced accordingly.
摘要：在上下文中检测证据是推理任务过程中的关键步骤，评估和增强LLM在证据检测方面的能力将增强基于上下文的推理性能。本文提出了一个基准测试DetectBench，用于验证在长上下文中检测和拼凑隐性证据的能力。DetectBench包含3928道多项选择题，平均每道题有994个token，每道题平均包含4.55个隐性证据，解决问题通常需要7.62次逻辑跳跃才能找到正确答案。为了提升LLM在证据检测方面的表现，本文提出了Detective Reasoning Prompt和Finetune。实验表明，现有的LLM在长上下文中检测证据的能力远不如人类，但Detective Reasoning Prompt有效提升了功能强大的LLM在证据检测方面的能力，而Finetuning方法对提升功能较弱的LLM的性能有明显的效果。而且，当LLM的证据检测能力提高时，其最终的推理性能也会相应提高。

Title: Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models

Authors: Devichand Budagam, Sankalp KJ, Ashutosh Kumar, Vinija Jain, Aman Chadha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12644
Pdf URL: https://arxiv.org/pdf/2406.12644
Copy Paste: [[2406.12644]] Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models(https://arxiv.org/abs/2406.12644)
Keywords: language model, llm, prompt
Abstract: Assessing the effectiveness of large language models (LLMs) in addressing diverse tasks is essential for comprehending their strengths and weaknesses. Conventional evaluation techniques typically apply a single prompting strategy uniformly across datasets, not considering the varying degrees of task complexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomy that employs a Hierarchical Prompt Framework (HPF) composed of five unique prompting strategies, arranged from the simplest to the most complex, to assess LLMs more precisely and to offer a clearer perspective. This taxonomy assigns a score, called the Hierarchical Prompting Score (HP-Score), to datasets as well as LLMs based on the rules of the taxonomy, providing a nuanced understanding of their ability to solve diverse tasks and offering a universal measure of task complexity. Additionally, we introduce the Adaptive Hierarchical Prompt framework, which automates the selection of appropriate prompting strategies for each task. This study compares manual and adaptive hierarchical prompt frameworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B, Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA), IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectiveness of HPT, providing a reliable way to compare different tasks and LLM capabilities. This paper leads to the development of a universal evaluation metric that can be used to evaluate both the complexity of the datasets and the capabilities of LLMs. The implementation of both manual HPF and adaptive HPF is publicly available.
摘要：评估大型语言模型 (LLM) 在解决各种任务方面的有效性对于了解其优缺点至关重要。传统的评估技术通常会在数据集中统一应用单一的提示策略，而不考虑任务复杂程度的不同。我们引入了分层提示分类法 (HPT)，该分类法采用分层提示框架 (HPF)，由五种独特的提示策略组成，从最简单到最复杂排列，以更准确地评估 LLM 并提供更清晰的视角。该分类法根据分类规则为数据集和 LLM 分配一个分数，称为分层提示分数 (HP-Score)，从而提供对它们解决各种任务的能力的细致了解，并提供任务复杂性的通用衡量标准。此外，我们引入了自适应分层提示框架，它可以自动为每个任务选择合适的提示策略。本研究使用四个指令调整的 LLM（即 Llama 3 8B、Phi 3 3.8B、Mistral 7B 和 Gemma 7B）在四个数据集（BoolQ、CommonSenseQA (CSQA)、IWSLT-2017 en-fr (IWSLT) 和 SamSum）比较了手动和自适应分层提示框架。实验证明了 HPT 的有效性，提供了一种比较不同任务和 LLM 功能的可靠方法。本文开发了一种通用评估指标，可用于评估数据集的复杂性和 LLM 的功能。手动 HPF 和自适应 HPF 的实现都是公开的。

Title: Evaluating Transparency of Machine Generated Fact Checking Explanations

Authors: Rui Xing, Timothy Baldwin, Jey Han Lau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12645
Pdf URL: https://arxiv.org/pdf/2406.12645
Copy Paste: [[2406.12645]] Evaluating Transparency of Machine Generated Fact Checking Explanations(https://arxiv.org/abs/2406.12645)
Keywords: language model
Abstract: An important factor when it comes to generating fact-checking explanations is the selection of evidence: intuitively, high-quality explanations can only be generated given the right evidence. In this work, we investigate the impact of human-curated vs. machine-selected evidence for explanation generation using large language models. To assess the quality of explanations, we focus on transparency (whether an explanation cites sources properly) and utility (whether an explanation is helpful in clarifying a claim). Surprisingly, we found that large language models generate similar or higher quality explanations using machine-selected evidence, suggesting carefully curated evidence (by humans) may not be necessary. That said, even with the best model, the generated explanations are not always faithful to the sources, suggesting further room for improvement in explanation generation for fact-checking.
摘要：在生成事实核查解释时，一个重要因素是证据的选择：直观地说，只有在有正确证据的情况下才能生成高质量的解释。在这项工作中，我们研究了人工挑选的证据与机器选择的证据对使用大型语言模型生成解释的影响。为了评估解释的质量，我们关注透明度（解释是否正确引用来源）和实用性（解释是否有助于澄清主张）。令人惊讶的是，我们发现大型语言模型使用机器选择的证据可以生成类似或更高质量的解释，这表明精心挑选的证据（由人类）可能不是必要的。话虽如此，即使使用最好的模型，生成的解释也并不总是忠实于来源，这表明在事实核查的解释生成方面还有进一步改进的空间。

Title: CollabStory: Multi-LLM Collaborative Story Generation and Authorship Analysis

Authors: Saranya Venkatraman, Nafis Irtiza Tripto, Dongwon Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12665
Pdf URL: https://arxiv.org/pdf/2406.12665
Copy Paste: [[2406.12665]] CollabStory: Multi-LLM Collaborative Story Generation and Authorship Analysis(https://arxiv.org/abs/2406.12665)
Keywords: language model, llm
Abstract: The rise of unifying frameworks that enable seamless interoperability of Large Language Models (LLMs) has made LLM-LLM collaboration for open-ended tasks a possibility. Despite this, there have not been efforts to explore such collaborative writing. We take the next step beyond human-LLM collaboration to explore this multi-LLM scenario by generating the first exclusively LLM-generated collaborative stories dataset called CollabStory. We focus on single-author ($N=1$) to multi-author (up to $N=5$) scenarios, where multiple LLMs co-author stories. We generate over 32k stories using open-source instruction-tuned LLMs. Further, we take inspiration from the PAN tasks that have set the standard for human-human multi-author writing tasks and analysis. We extend their authorship-related tasks for multi-LLM settings and present baselines for LLM-LLM collaboration. We find that current baselines are not able to handle this emerging scenario. Thus, CollabStory is a resource that could help propel an understanding as well as the development of techniques to discern the use of multiple LLMs. This is crucial to study in the context of writing tasks since LLM-LLM collaboration could potentially overwhelm ongoing challenges related to plagiarism detection, credit assignment, maintaining academic integrity in educational settings, and addressing copyright infringement concerns. We make our dataset and code available at \texttt{\url{this https URL}}.
摘要：统一框架的兴起使得大型语言模型 (LLM) 之间的无缝互操作成为可能，这使得 LLM-LLM 协作可以用于开放式任务。尽管如此，人们还没有努力去探索这种协作写作。我们迈出了人类与 LLM 协作的下一步，通过生成第一个专门由 LLM 生成的协作故事数据集 CollabStory 来探索这种多 LLM 场景。我们专注于单作者 ($N=1$) 到多作者（最多 $N=5$）场景，其中多个 LLM 共同创作故事。我们使用开源指令调整的 LLM 生成超过 32k 个故事。此外，我们从 PAN 任务中汲取灵感，这些任务为人类与人类的多作者写作任务和分析设定了标准。我们将其与作者相关的任务扩展到多 LLM 设置，并提出了 LLM-LLM 协作的基线。我们发现当前的基线无法处理这种新兴场景。因此，CollabStory 是一种资源，可以帮助推动理解以及开发辨别多个 LLM 使用的技术。这对于写作任务的研究至关重要，因为 LLM-LLM 合作可能会压倒与抄袭检测、学分分配、在教育环境中维护学术诚信以及解决版权侵权问题相关的持续挑战。我们在 \texttt{\url{此 https URL}} 上提供我们的数据集和代码。

Title: Estimating Knowledge in Large Language Models Without Generating a Single Token

Authors: Daniela Gottesman, Mor Geva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12673
Pdf URL: https://arxiv.org/pdf/2406.12673
Copy Paste: [[2406.12673]] Estimating Knowledge in Large Language Models Without Generating a Single Token(https://arxiv.org/abs/2406.12673)
Keywords: language model, llm
Abstract: To evaluate knowledge in large language models (LLMs), current methods query the model and then evaluate its generated responses. In this work, we ask whether evaluation can be done $\textit{before}$ the model has generated any text. Concretely, is it possible to estimate how knowledgeable a model is about a certain entity, only from its internal computation? We study this question with two tasks: given a subject entity, the goal is to predict (a) the ability of the model to answer common questions about the entity, and (b) the factuality of responses generated by the model about the entity. Experiments with a variety of LLMs show that KEEN, a simple probe trained over internal subject representations, succeeds at both tasks - strongly correlating with both the QA accuracy of the model per-subject and FActScore, a recent factuality metric in open-ended generation. Moreover, KEEN naturally aligns with the model's hedging behavior and faithfully reflects changes in the model's knowledge after fine-tuning. Lastly, we show a more interpretable yet equally performant variant of KEEN, which highlights a small set of tokens that correlates with the model's lack of knowledge. Being simple and lightweight, KEEN can be leveraged to identify gaps and clusters of entity knowledge in LLMs, and guide decisions such as augmenting queries with retrieval.
摘要：为了评估大型语言模型 (LLM) 中的知识，当前的方法会查询模型，然后评估其生成的响应。在这项工作中，我们询问是否可以在模型生成任何文本之前进行评估。具体来说，是否可以仅从内部计算来估计模型对某个实体的了解程度？我们通过两个任务研究这个问题：给定一个主题实体，目标是预测 (a) 模型回答有关该实体的常见问题的能力，以及 (b) 模型生成的有关该实体的响应的真实性。使用各种 LLM 进行的实验表明，KEEN（一种针对内部主题表示进行训练的简单探测器）在两个任务中都取得了成功 - 与模型每个主题的 QA 准确性和 FActScore（开放式生成中的最新事实性指标）密切相关。此外，KEEN 自然地与模型的对冲行为保持一致，并忠实地反映了微调后模型知识的变化。最后，我们展示了 KEEN 的一个更易于解释但性能同样出色的变体，它突出显示了与模型缺乏知识相关的一小部分标记。KEEN 简单而轻量，可用于识别 LLM 中的实体知识差距和集群，并指导决策，例如通过检索增强查询。

Title: Vernacular? I Barely Know Her: Challenges with Style Control and Stereotyping

Authors: Ankit Aich, Tingting Liu, Salvatore Giorgi, Kelsey Isman, Lyle Ungar, Brenda Curtis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12679
Pdf URL: https://arxiv.org/pdf/2406.12679
Copy Paste: [[2406.12679]] Vernacular? I Barely Know Her: Challenges with Style Control and Stereotyping(https://arxiv.org/abs/2406.12679)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are increasingly being used in educational and learning applications. Research has demonstrated that controlling for style, to fit the needs of the learner, fosters increased understanding, promotes inclusion, and helps with knowledge distillation. To understand the capabilities and limitations of contemporary LLMs in style control, we evaluated five state-of-the-art models: GPT-3.5, GPT-4, GPT-4o, Llama-3, and Mistral-instruct- 7B across two style control tasks. We observed significant inconsistencies in the first task, with model performances averaging between 5th and 8th grade reading levels for tasks intended for first-graders, and standard deviations up to 27.6. For our second task, we observed a statistically significant improvement in performance from 0.02 to 0.26. However, we find that even without stereotypes in reference texts, LLMs often generated culturally insensitive content during their tasks. We provide a thorough analysis and discussion of the results.
摘要：大型语言模型 (LLM) 在教育和学习应用中的使用越来越多。研究表明，控制风格以适应学习者的需求，可以促进理解、促进包容并有助于知识提炼。为了了解当代 LLM 在风格控制方面的能力和局限性，我们在两个风格控制任务中评估了五个最先进的模型：GPT-3.5、GPT-4、GPT-4o、Llama-3 和 Mistral-instruct- 7B。我们在第一项任务中观察到了显著的不一致，对于针对一年级学生的任务，模型性能平均介于 5 年级和 8 年级的阅读水平之间，标准差高达 27.6。对于我们的第二项任务，我们观察到性能从 0.02 到 0.26 的统计显着提高。然而，我们发现即使参考文本中没有刻板印象，LLM 在其任务中也经常生成文化不敏感的内容。我们对结果进行了彻底的分析和讨论。

Title: Measuring Psychological Depth in Language Models

Authors: Fabrice Harel-Canada, Hanyu Zhou, Sreya Mupalla, Zeynep Yildiz, Amit Sahai, Nanyun Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12680
Pdf URL: https://arxiv.org/pdf/2406.12680
Copy Paste: [[2406.12680]] Measuring Psychological Depth in Language Models(https://arxiv.org/abs/2406.12680)
Keywords: language model, gpt, llm, prompt
Abstract: Evaluations of creative stories generated by large language models (LLMs) often focus on objective properties of the text, such as its style, coherence, and toxicity. While these metrics are indispensable, they do not speak to a story's subjective, psychological impact from a reader's perspective. We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM's ability to produce authentic and narratively complex stories that provoke emotion, empathy, and engagement. We empirically validate our framework by showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff's alpha). We also explore techniques for automating the PDS to easily scale future analyses. GPT-4o, combined with a novel Mixture-of-Personas (MoP) prompting strategy, achieves an average Spearman correlation of $0.51$ with human judgment while Llama-3-70B scores as high as 0.68 for empathy. Finally, we compared the depth of stories authored by both humans and LLMs. Surprisingly, GPT-4 stories either surpassed or were statistically indistinguishable from highly-rated human-written stories sourced from Reddit. By shifting the focus from text to reader, the Psychological Depth Scale is a validated, automated, and systematic means of measuring the capacity of LLMs to connect with humans through the stories they tell.
摘要：对大型语言模型 (LLM) 生成的创意故事的评估通常侧重于文本的客观属性，例如其风格、连贯性和毒性。虽然这些指标不可或缺，但它们并不能从读者的角度反映故事的主观心理影响。我们引入了心理深度量表 (PDS)，这是一个植根于文学理论的新框架，用于衡量 LLM 创作真实且叙事复杂的故事的能力，这些故事可以激发情感、同理心和参与度。我们通过证明人类可以根据 PDS (0.72 Krippendorff 的 alpha) 一致地评估故事来实证验证我们的框架。我们还探索了自动化 PDS 的技术，以便轻松扩展未来的分析。GPT-4o 结合新颖的混合角色 (MoP) 提示策略，与人类判断的平均 Spearman 相关性达到 $0.51$，而 Llama-3-70B 的同理心得分高达 0.68。最后，我们比较了人类和 LLM 撰写的故事的深度。令人惊讶的是，GPT-4 故事要么超越了 Reddit 上评分很高的人类撰写的故事，要么在统计上与它们没有区别。通过将焦点从文本转移到读者，心理深度量表是一种经过验证、自动化和系统化的方法，可以衡量 LLM 通过他们讲述的故事与人类建立联系的能力。

Title: Using LLMs to Aid Annotation and Collection of Clinically-Enriched Data in Bipolar Disorder and Schizophrenia

Authors: Ankit Aich, Avery Quynh, Pamela Osseyi, Amy Pinkham, Philip Harvey, Brenda Curtis, Colin Depp, Natalie Parde
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12687
Pdf URL: https://arxiv.org/pdf/2406.12687
Copy Paste: [[2406.12687]] Using LLMs to Aid Annotation and Collection of Clinically-Enriched Data in Bipolar Disorder and Schizophrenia(https://arxiv.org/abs/2406.12687)
Keywords: language model, llm
Abstract: NLP in mental health has been primarily social media focused. Real world practitioners also have high case loads and often domain specific variables, of which modern LLMs lack context. We take a dataset made by recruiting 644 participants, including individuals diagnosed with Bipolar Disorder (BD), Schizophrenia (SZ), and Healthy Controls (HC). Participants undertook tasks derived from a standardized mental health instrument, and the resulting data were transcribed and annotated by experts across five clinical variables. This paper demonstrates the application of contemporary language models in sequence-to-sequence tasks to enhance mental health research. Specifically, we illustrate how these models can facilitate the deployment of mental health instruments, data collection, and data annotation with high accuracy and scalability. We show that small models are capable of annotation for domain-specific clinical variables, data collection for mental-health instruments, and perform better then commercial large models.
摘要：心理健康领域的 NLP 主要关注社交媒体。现实世界的从业者也有大量的案例，并且通常具有特定领域的变量，而现代 LLM 缺乏这些变量的背景。我们采用了一个由 644 名参与者组成的数据集，其中包括被诊断为双相情感障碍 (BD)、精神分裂症 (SZ) 和健康对照 (HC) 的个人。参与者执行了来自标准化心理健康工具的任务，专家将结果数据转录并注释到五个临床变量。本文展示了当代语言模型在序列到序列任务中的应用，以加强心理健康研究。具体来说，我们说明了这些模型如何以高精度和可扩展性促进心理健康工具的部署、数据收集和数据注释。我们表明，小型模型能够注释特定领域的临床变量、心理健康工具的数据收集，并且比商用大型模型表现更好。

Title: MAGIC: Generating Self-Correction Guideline for In-Context Text-to-SQL

Authors: Arian Askari, Christian Poelitz, Xinye Tang
Subjects: cs.CL, cs.AI, cs.DB, cs.HC
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] MAGIC: Generating Self-Correction Guideline for In-Context Text-to-SQL(https://arxiv.org/abs/)
Keywords: language model, llm, prompt, agent
Abstract: Self-correction in text-to-SQL is the process of prompting large language model (LLM) to revise its previously incorrectly generated SQL, and commonly relies on manually crafted self-correction guidelines by human experts that are not only labor-intensive to produce but also limited by the human ability in identifying all potential error patterns in LLM responses. We introduce MAGIC, a novel multi-agent method that automates the creation of the self-correction guideline. MAGIC uses three specialized agents: a manager, a correction, and a feedback agent. These agents collaborate on the failures of an LLM-based method on the training set to iteratively generate and refine a self-correction guideline tailored to LLM mistakes, mirroring human processes but without human involvement. Our extensive experiments show that MAGIC's guideline outperforms expert human's created ones. We empirically find out that the guideline produced by MAGIC enhance the interpretability of the corrections made, providing insights in analyzing the reason behind the failures and successes of LLMs in self-correction. We make all agent interactions publicly available to the research community, to foster further research in this area, offering a synthetic dataset for future explorations into automatic self-correction guideline generation.
摘要：文本到 SQL 中的自我纠正是促使大型语言模型 (LLM) 修改其先前错误生成的 SQL 的过程，并且通常依赖于人类专家手动制定的自我纠正指南，这些指南不仅需要大量劳动力，而且还受到人类识别 LLM 响应中所有潜在错误模式的能力的限制。我们引入了 MAGIC，这是一种新颖的多智能体方法，可以自动创建自我纠正指南。MAGIC 使用三个专门的智能体：管理器、纠正和反馈智能体。这些智能体协作处理基于 LLM 的方法在训练集上的失败，以迭代方式生成和完善针对 LLM 错误的自我纠正指南，反映人类过程，但无需人类参与。我们广泛的实验表明，MAGIC 的指南优于专家创建的指南。我们通过经验发现，MAGIC 生成的指南增强了所做纠正的可解释性，为分析 LLM 在自我纠正中失败和成功的原因提供了见解。我们将所有代理交互公开给研究界，以促进该领域的进一步研究，并为未来探索自动自我纠正指南生成提供合成数据集。

Title: Jailbreak Paradox: The Achilles' Heel of LLMs

Authors: Abhinav Rao, Monojit Choudhury, Somak Aditya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Jailbreak Paradox: The Achilles' Heel of LLMs(https://arxiv.org/abs/)
Keywords: gpt, llm
Abstract: We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.
摘要：我们介绍了有关基础模型越狱的两个悖论：首先，构建完美的越狱分类器是不可能的，其次，较弱的模型无法一致地检测出较强的（在帕累托占优意义上）模型是否越狱。我们为这些悖论提供了正式证明，并对 Llama 和 GPT4-o 进行了简短的案例研究以证明这一点。我们讨论了这些结果的更广泛的理论和实践影响。

Title: Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

Authors: Haoqiu Yan, Yongxin Zhu, Kai Zheng, Bing Liu, Haoyu Cao, Deqiang Jiang, Linli Xu
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.12707
Pdf URL: https://arxiv.org/pdf/2406.12707
Copy Paste: [[2406.12707]] Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction(https://arxiv.org/abs/2406.12707)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM)-enhanced agents become increasingly prevalent in Human-AI communication, offering vast potential from entertainment to professional domains. However, current multi-modal dialogue systems overlook the acoustic information present in speech, which is crucial for understanding human communication nuances. This oversight can lead to misinterpretations of speakers' intentions, resulting in inconsistent or even contradictory responses within dialogues. To bridge this gap, in this paper, we propose PerceptiveAgent, an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings beyond the literal interpretations of words through the integration of speech modality perception. Employing LLMs as a cognitive core, PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language. Experimental results indicate that PerceptiveAgent excels in contextual understanding by accurately discerning the speakers' true intentions in scenarios where the linguistic meaning is either contrary to or inconsistent with the speaker's true feelings, producing more nuanced and expressive spoken dialogues. Code is publicly available at: \url{this https URL}.
摘要：大型语言模型 (LLM) 增强型代理在人机通信中越来越普遍，从娱乐到专业领域都具有巨大的潜力。然而，当前的多模态对话系统忽略了语音中存在的声学信息，而这些信息对于理解人类交流的细微差别至关重要。这种疏忽可能会导致对说话者意图的误解，从而导致对话中的回答不一致甚至相互矛盾。为了弥补这一差距，在本文中，我们提出了 PerceptiveAgent，这是一种富有同理心的多模态对话系统，旨在通过整合语音模态感知来辨别超越字面解释的更深层或更微妙的含义。PerceptiveAgent 使用 LLM 作为认知核心，可以感知输入语音中的声学信息，并根据自然语言描述的说话风格生成富有同理心的回答。实验结果表明，PerceptiveAgent 在语境理解方面表现出色，能够在语言含义与说话者的真实感受相反或不一致的情况下准确辨别说话者的真实意图，从而产生更细致入微、更富有表现力的口头对话。代码公开发布在：\url{此 https URL}。

Title: AgentReview: Exploring Peer Review Dynamics with LLM Agents

Authors: Yiqiao Jin, Qinlin Zhao, Yiyang Wang, Hao Chen, Kaijie Zhu, Yijia Xiao, Jindong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12708
Pdf URL: https://arxiv.org/pdf/2406.12708
Copy Paste: [[2406.12708]] AgentReview: Exploring Peer Review Dynamics with LLM Agents(https://arxiv.org/abs/2406.12708)
Keywords: language model, llm, agent
Abstract: Peer review is fundamental to the integrity and advancement of scientific publication. Traditional methods of peer review analyses often rely on exploration and statistics of existing peer review data, which do not adequately address the multivariate nature of the process, account for the latent variables, and are further constrained by privacy concerns due to the sensitive nature of the data. We introduce AgentReview, the first large language model (LLM) based peer review simulation framework, which effectively disentangles the impacts of multiple latent factors and addresses the privacy issue. Our study reveals significant insights, including a notable 37.1% variation in paper decisions due to reviewers' biases, supported by sociological theories such as the social influence theory, altruism fatigue, and authority bias. We believe that this study could offer valuable insights to improve the design of peer review mechanisms.
摘要：同行评审是科学出版的完整性和进步的基础。传统的同行评审分析方法通常依赖于对现有同行评审数据的探索和统计，这些方法不能充分解决过程的多变量性质，不能解释潜在变量，而且由于数据的敏感性，隐私问题会进一步限制这些方法的使用。我们推出了 AgentReview，这是第一个基于大型语言模型 (LLM) 的同行评审模拟框架，它有效地解开了多个潜在因素的影响并解决了隐私问题。我们的研究揭示了重要的见解，包括由于审稿人的偏见导致论文决策出现 37.1% 的显著差异，这得到了社会学理论（如社会影响理论、利他主义疲劳和权威偏见）的支持。我们相信这项研究可以提供有价值的见解来改进同行评审机制的设计。

Title: On the Robustness of Language Models for Tabular Question Answering

Authors: Kushal Raj Bhandari, Sixue Xing, Soham Dan, Jianxi Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12719
Pdf URL: https://arxiv.org/pdf/2406.12719
Copy Paste: [[2406.12719]] On the Robustness of Language Models for Tabular Question Answering(https://arxiv.org/abs/2406.12719)
Keywords: language model, llm
Abstract: Large Language Models (LLMs), originally shown to ace various text comprehension tasks have also remarkably been shown to tackle table comprehension tasks without specific training. While previous research has explored LLM capabilities with tabular dataset tasks, our study assesses the influence of $\textit{in-context learning}$,$ \textit{model scale}$, $\textit{instruction tuning}$, and $\textit{domain biases}$ on Tabular Question Answering (TQA). We evaluate the robustness of LLMs on Wikipedia-based $\textbf{WTQ}$ and financial report-based $\textbf{TAT-QA}$ TQA datasets, focusing on their ability to robustly interpret tabular data under various augmentations and perturbations. Our findings indicate that instructions significantly enhance performance, with recent models like Llama3 exhibiting greater robustness over earlier versions. However, data contamination and practical reliability issues persist, especially with WTQ. We highlight the need for improved methodologies, including structure-aware self-attention mechanisms and better handling of domain-specific tabular data, to develop more reliable LLMs for table comprehension.
摘要：大型语言模型 (LLM) 最初被证明能够出色地完成各种文本理解任务，现在也表现出无需特定训练即可完成表格理解任务的出色能力。虽然先前的研究已经探索了 LLM 在表格数据集任务中的能力，但我们的研究评估了上下文学习、模型规模、指令调整和领域偏差对表格问答 (TQA) 的影响。我们评估了基于维基百科的 WTQ 和基于财务报告的 TAT-QA TQA 数据集上 LLM 的稳健性，重点关注它们在各种增强和扰动下稳健地解释表格数据的能力。我们的研究结果表明，指令显著提高了性能，最近的模型（如 Llama3）比早期版本表现出更高的稳健性。然而，数据污染和实际可靠性问题仍然存在，尤其是 WTQ。我们强调需要改进方法，包括结构感知的自注意力机制和更好地处理特定领域的表格数据，以开发更可靠的表格理解 LLM。

Title: Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction

Authors: Atharva Naik, Kexun Zhang, Nathaniel Robinson, Aravind Mysore, Clayton Marr, Hong Sng Rebecca Byrnes, Anna Cai, Kalvin Chang, David Mortensen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: Historical linguists have long written a kind of incompletely formalized ''program'' that converts reconstructed words in an ancestor language into words in one of its attested descendants that consist of a series of ordered string rewrite functions (called sound laws). They do this by observing pairs of words in the reconstructed language (protoforms) and the descendent language (reflexes) and constructing a program that transforms protoforms into reflexes. However, writing these programs is error-prone and time-consuming. Prior work has successfully scaffolded this process computationally, but fewer researchers have tackled Sound Law Induction (SLI), which we approach in this paper by casting it as Programming by Examples. We propose a language-agnostic solution that utilizes the programming ability of Large Language Models (LLMs) by generating Python sound law programs from sound change examples. We evaluate the effectiveness of our approach for various LLMs, propose effective methods to generate additional language-agnostic synthetic data to fine-tune LLMs for SLI, and compare our method with existing automated SLI methods showing that while LLMs lag behind them they can complement some of their weaknesses.
摘要：历史语言学家长期以来一直编写一种不完全形式化的“程序”，将祖先语言中重构的单词转换为其已证实的后代语言中的单词，这些后代语言由一系列有序的字符串重写函数（称为声音定律）组成。他们通过观察重构语言（原始形式）和后代语言（反射）中的单词对并构建将原始形式转换为反射的程序来实现这一点。但是，编写这些程序容易出错且耗时。先前的工作已成功地在计算上构建了这一过程，但很少有研究人员处理声音定律归纳 (SLI)，我们在本文中将其视为示例编程。我们提出了一种与语言无关的解决方案，该解决方案利用大型语言模型 (LLM) 的编程能力，通过从声音变化示例中生成 Python 声音定律程序。我们评估了我们的方法对各种 LLM 的有效性，提出了有效的方法来生成额外的与语言无关的合成数据来微调 SLI 的 LLM，并将我们的方法与现有的自动化 SLI 方法进行了比较，结果表明虽然 LLM 落后于它们，但可以弥补它们的一些弱点。

Title: Large Language Model as a Universal Clinical Multi-task Decoder

Authors: Yujiang Wu, Hongjian Song, Jiawen Zhang, Xumeng Wen, Shun Zheng, Jiang Bian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12738
Pdf URL: https://arxiv.org/pdf/2406.12738
Copy Paste: [[2406.12738]] Large Language Model as a Universal Clinical Multi-task Decoder(https://arxiv.org/abs/2406.12738)
Keywords: language model
Abstract: The development of effective machine learning methodologies for enhancing the efficiency and accuracy of clinical systems is crucial. Despite significant research efforts, managing a plethora of diversified clinical tasks and adapting to emerging new tasks remain significant challenges. This paper presents a novel paradigm that employs a pre-trained large language model as a universal clinical multi-task decoder. This approach leverages the flexibility and diversity of language expressions to handle task topic variations and associated arguments. The introduction of a new task simply requires the addition of a new instruction template. We validate this framework across hundreds of tasks, demonstrating its robustness in facilitating multi-task predictions, performing on par with traditional multi-task learning and single-task learning approaches. Moreover, it shows exceptional adaptability to new tasks, with impressive zero-shot performance in some instances and superior data efficiency in few-shot scenarios. This novel approach offers a unified solution to manage a wide array of new and emerging tasks in clinical applications.
摘要：开发有效的机器学习方法以提高临床系统的效率和准确性至关重要。尽管进行了大量研究，但管理大量多样化的临床任务并适应新出现的任务仍然是重大挑战。本文提出了一种新范式，该范式采用预先训练的大型语言模型作为通用临床多任务解码器。这种方法利用语言表达的灵活性和多样性来处理任务主题变化和相关参数。引入新任务只需添加新的指令模板。我们在数百个任务中验证了该框架，证明了其在促进多任务预测方面的稳健性，其性能与传统的多任务学习和单任务学习方法相当。此外，它表现出对新任务的出色适应性，在某些情况下具有令人印象深刻的零样本性能，在少数样本场景中具有出色的数据效率。这种新方法提供了一种统一的解决方案来管理临床应用中各种新兴任务。

Title: Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages

Authors: Fabian David Schmidt, Philipp Borchert, Ivan Vulić, Goran Glavaš
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12739
Pdf URL: https://arxiv.org/pdf/2406.12739
Copy Paste: [[2406.12739]] Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages(https://arxiv.org/abs/2406.12739)
Keywords: language model, llm
Abstract: LLMs have become a go-to solution not just for text generation, but also for natural language understanding (NLU) tasks. Acquiring extensive knowledge through language modeling on web-scale corpora, they excel on English NLU, yet struggle to extend their NLU capabilities to underrepresented languages. In contrast, machine translation models (MT) produce excellent multilingual representations, resulting in strong translation performance even for low-resource languages. MT encoders, however, lack the knowledge necessary for comprehensive NLU that LLMs obtain through language modeling training on immense corpora. In this work, we get the best both worlds by integrating MT encoders directly into LLM backbones via sample-efficient self-distillation. The resulting MT-LLMs preserve the inherent multilingual representational alignment from the MT encoder, allowing lower-resource languages to tap into the rich knowledge embedded in English-centric LLMs. Merging the MT encoder and LLM in a single model, we mitigate the propagation of translation errors and inference overhead of MT decoding inherent to discrete translation-based cross-lingual transfer (e.g., translate-test). Evaluation spanning three prominent NLU tasks and 127 predominantly low-resource languages renders MT-LLMs highly effective in cross-lingual transfer. MT-LLMs substantially and consistently outperform translate-test based on the same MT model, showing that we truly unlock multilingual language understanding for LLMs.
摘要：LLM 不仅成为文本生成的首选解决方案，而且成为自然语言理解 (NLU) 任务的首选解决方案。通过对网络规模语料库进行语言建模，它们获得了广泛的知识，在英语 NLU 方面表现出色，但仍难以将其 NLU 功能扩展到代表性不足的语言。相比之下，机器翻译模型 (MT) 可以生成出色的多语言表示，即使对于资源匮乏的语言也能实现出色的翻译性能。然而，MT 编码器缺乏 LLM 通过对庞大语料库进行语言建模训练获得的全面 NLU 所需的知识。在这项工作中，我们通过样本高效的自我提炼将 MT 编码器直接集成到 LLM 主干中，从而获得了两全其美的结果。由此产生的 MT-LLM 保留了 MT 编码器固有的多语言表示对齐，使资源较少的语言能够利用以英语为中心的 LLM 中嵌入的丰富知识。将 MT 编码器和 LLM 合并到一个模型中，我们可以减轻基于离散翻译的跨语言传输（例如，翻译测试）固有的翻译错误传播和 MT 解码的推理开销。对三个主要 NLU 任务和 127 种资源匮乏的语言进行的评估表明，MT-LLM 在跨语言传输方面非常有效。基于相同的 MT 模型，MT-LLM 的表现显著且始终优于翻译测试，表明我们真正为 LLM 解锁了多语言语言理解。

Title: Rationale-based Ensemble of Multiple QA Strategies for Zero-shot Knowledge-based VQA

Authors: Miaoyu Li, Haoxin Li, Zilin Du, Boyang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12746
Pdf URL: https://arxiv.org/pdf/2406.12746
Copy Paste: [[2406.12746]] Rationale-based Ensemble of Multiple QA Strategies for Zero-shot Knowledge-based VQA(https://arxiv.org/abs/2406.12746)
Keywords: llm
Abstract: Knowledge-based Visual Qustion-answering (K-VQA) necessitates the use of background knowledge beyond what is depicted in the image. Current zero-shot K-VQA methods usually translate an image to a single type of textual decision context and use a text-based model to answer the question based on it, which conflicts with the fact that K-VQA questions often require the combination of multiple question-answering strategies. In light of this, we propose Rationale-based Ensemble of Answer Context Tactics (REACT) to achieve a dynamic ensemble of multiple question-answering tactics, comprising Answer Candidate Generation (ACG) and Rationale-based Strategy Fusion (RSF). In ACG, we generate three distinctive decision contexts to provide different strategies for each question, resulting in the generation of three answer candidates. RSF generates automatic and mechanistic rationales from decision contexts for each candidate, allowing the model to select the correct answer from all candidates. We conduct comprehensive experiments on the OK-VQA and A-OKVQA datasets, and our method significantly outperforms state-of-the-art LLM-based baselines on all datasets.
摘要：基于知识的视觉问答（K-VQA）需要使用图像所描绘内容以外的背景知识。当前的零样本 K-VQA 方法通常将图像转换为单一类型的文本决策环境，并使用基于文本的模型基于该环境回答问题，这与 K-VQA 问题通常需要结合多种问答策略的事实相冲突。鉴于此，我们提出了基于原理的答案环境策略集成（REACT），以实现多种问答策略的动态集成，包括答案候选生成（ACG）和基于原理的策略融合（RSF）。在 ACG 中，我们生成三个不同的决策环境来为每个问题提供不同的策略，从而生成三个答案候选。RSF 从每个候选的决策环境生成自动和机械的原理，允许模型从所有候选中选择正确答案。我们对 OK-VQA 和 A-OKVQA 数据集进行了全面的实验，我们的方法在所有数据集上都显著优于最先进的基于 LLM 的基线。

Title: OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

Authors: Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, Pengfei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12753
Pdf URL: https://arxiv.org/pdf/2406.12753
Copy Paste: [[2406.12753]] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI(https://arxiv.org/abs/2406.12753)
Keywords: language model, gpt, llm
Abstract: The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.
摘要：大型语言模型 (LLM) 和大型多模态模型 (LMM) 的进步显著加速了人工智能 (AI) 的发展，逐渐展示了曾经专属于人类智力的解决问题和科学发现 (即 AI4Science) 的潜在认知推理能力。为了全面评估当前模型在认知推理能力方面的表现，我们推出了 OlympicArena，它包含 11,163 个双语问题，涵盖纯文本和交错文本图像模态。这些挑战涵盖了七个领域和 62 场国际奥运会比赛的广泛学科，并经过严格数据泄露检查。我们认为，奥运会比赛问题中的挑战非常适合评估 AI 的认知推理，因为它们具有复杂性和跨学科性，这对于应对复杂的科学挑战和促进发现至关重要。除了使用仅答案标准评估各个学科的表现之外，我们还从多个角度进行了详细的实验和分析。我们深入研究了模型的认知推理能力、它们在不同模式下的表现以及它们在流程级评估中的结果，这对于需要复杂推理和冗长解决方案的任务至关重要。我们广泛的评估表明，即使是像 GPT-4o 这样的高级模型也只能达到 39.97% 的整体准确率，这说明了当前人工智能在复杂推理和多模式集成方面的局限性。通过 OlympicArena，我们旨在推动人工智能向超级智能迈进，使其能够应对科学和其他领域更复杂的挑战。我们还提供了一套全面的资源来支持人工智能研究，包括基准数据集、开源注释平台、详细的评估工具和具有自动提交功能的排行榜。

Title: Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba

Authors: Ruiqi He, Yushu He, Longju Bai, Jiarui Liu, Zhenjie Sun, Zenghao Tang, He Wang, Hanchen Xia, Naihao Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12754
Pdf URL: https://arxiv.org/pdf/2406.12754
Copy Paste: [[2406.12754]] Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba(https://arxiv.org/abs/2406.12754)
Keywords: gpt, llm
Abstract: Existing humor datasets and evaluations predominantly focus on English, lacking resources for culturally nuanced humor in non-English languages like Chinese. To address this gap, we construct Chumor, a dataset sourced from Ruo Zhi Ba (RZB), a Chinese Reddit-like platform dedicated to sharing intellectually challenging and culturally specific jokes. We annotate explanations for each joke and evaluate human explanations against two state-of-the-art LLMs, GPT-4o and ERNIE Bot, through A/B testing by native Chinese speakers. Our evaluation shows that Chumor is challenging even for SOTA LLMs, and the human explanations for Chumor jokes are significantly better than explanations generated by the LLMs.
摘要：现有的幽默数据集和评估主要集中在英语上，缺乏针对中文等非英语语言的文化差异性幽默的资源。为了弥补这一差距，我们构建了 Chumor，这是一个来自若指吧 (RZB) 的数据集，这是一个类似 Reddit 的中国平台，致力于分享具有智力挑战性和文化特异性的笑话。我们为每个笑话注释了解释，并通过由以中文为母语的人进行的 A/B 测试，将人工解释与两款最先进的 LLM（GPT-4o 和 ERNIE Bot）进行评估。我们的评估表明，即使对于 SOTA LLM 来说，Chumor 也具有挑战性，而且 Chumor 笑话的人工解释明显优于 LLM 生成的解释。

Title: Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries

Authors: Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, Amir Globerson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12775
Pdf URL: https://arxiv.org/pdf/2406.12775
Copy Paste: [[2406.12775]] Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries(https://arxiv.org/abs/2406.12775)
Keywords: language model, llm
Abstract: Large language models (LLMs) can solve complex multi-step problems, but little is known about how these computations are implemented internally. Motivated by this, we study how LLMs answer multi-hop queries such as "The spouse of the performer of Imagine is". These queries require two information extraction steps: a latent one for resolving the first hop ("the performer of Imagine") into the bridge entity (John Lennon), and one for resolving the second hop ("the spouse of John Lennon") into the target entity (Yoko Ono). Understanding how the latent step is computed internally is key to understanding the overall computation. By carefully analyzing the internal computations of transformer-based LLMs, we discover that the bridge entity is resolved in the early layers of the model. Then, only after this resolution, the two-hop query is solved in the later layers. Because the second hop commences in later layers, there could be cases where these layers no longer encode the necessary knowledge for correctly predicting the answer. Motivated by this, we propose a novel "back-patching" analysis method whereby a hidden representation from a later layer is patched back to an earlier layer. We find that in up to 57% of previously incorrect cases there exists a back-patch that results in the correct generation of the answer, showing that the later layers indeed sometimes lack the needed functionality. Overall our methods and findings open further opportunities for understanding and improving latent reasoning in transformer-based LLMs.
摘要：大型语言模型 (LLM) 可以解决复杂的多步骤问题，但人们对这些计算的内部实现方式知之甚少。受此启发，我们研究了 LLM 如何回答多跳查询，例如“Imagine 的表演者的配偶是”。这些查询需要两个信息提取步骤：一个潜在步骤，用于将第一跳（“Imagine 的表演者”）解析为桥接实体（约翰·列侬），另一个用于将第二跳（“约翰·列侬的配偶”）解析为目标实体（小野洋子）。了解潜在步骤的内部计算方式是理解整体计算的关键。通过仔细分析基于转换器的 LLM 的内部计算，我们发现桥接实体在模型的早期层中得到解析。然后，只有在这个解析之后，两跳查询才会在后面的层中得到解决。由于第二跳从后面的层开始，因此可能存在这些层不再编码正确预测答案所需的知识的情况。受此启发，我们提出了一种新颖的“反向修补”分析方法，即将后一层的隐藏表示修补回前一层。我们发现，在多达 57% 的先前错误案例中，存在一个反向修补，可生成正确的答案，这表明后一层确实有时缺乏所需的功能。总的来说，我们的方法和发现为理解和改进基于 Transformer 的 LLM 中的潜在推理提供了更多机会。

Title: UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

Authors: Xunzhi Wang, Zhuowei Zhang, Qiongyu Li, Gaonan Chen, Mengting Hu, Zhiyu li, Bitong Luo, Hang Gao, Zhixin Han, Haotian Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12784
Pdf URL: https://arxiv.org/pdf/2406.12784
Copy Paste: [[2406.12784]] UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions(https://arxiv.org/abs/2406.12784)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: The rapid development of large language models (LLMs) has shown promising practical results. However, their low interpretability often leads to errors in unforeseen circumstances, limiting their utility. Many works have focused on creating comprehensive evaluation systems, but previous benchmarks have primarily assessed problem-solving abilities while neglecting the response's uncertainty, which may result in unreliability. Recent methods for measuring LLM reliability are resource-intensive and unable to test black-box models. To address this, we propose UBENCH, a comprehensive benchmark for evaluating LLM reliability. UBENCH includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities. Experimental results show that UBENCH has achieved state-of-the-art performance, while its single-sampling method significantly saves computational resources compared to baseline methods that require multiple samplings. Additionally, based on UBENCH, we evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding, closely followed by GPT-4. We also explore the impact of Chain-of-Thought prompts, role-playing prompts, option order, and temperature on LLM reliability, analyzing the varying effects on different LLMs.
摘要：大型语言模型 (LLM) 的快速发展已显示出良好的实际效果。然而，它们的低可解释性往往会导致不可预见的情况下出现错误，从而限制了它们的实用性。许多工作都集中在创建全面的评估系统上，但之前的基准主要评估解决问题的能力，而忽略了响应的不确定性，这可能会导致不可靠性。最近用于测量 LLM 可靠性的方法资源密集型，无法测试黑盒模型。为了解决这个问题，我们提出了 UBENCH，这是一个用于评估 LLM 可靠性的综合基准。UBENCH 包括 3,978 个多项选择题，涵盖知识、语言、理解和推理能力。实验结果表明，UBENCH 已达到最佳性能，而其单次采样方法与需要多次采样的基线方法相比显著节省了计算资源。此外，基于 UBENCH，我们评估了 15 种流行的 LLM 的可靠性，发现 GLM4 最为出色，其次是 GPT-4。我们还探讨了思路链提示、角色扮演提示、选项顺序和温度对 LLM 可靠性的影响，分析了对不同 LLM 的不同影响。

Title: Generating Educational Materials with Different Levels of Readability using LLMs

Authors: Chieh-Yang Huang, Jing Wei, Ting-Hao 'Kenneth' Huang
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2406.12787
Pdf URL: https://arxiv.org/pdf/2406.12787
Copy Paste: [[2406.12787]] Generating Educational Materials with Different Levels of Readability using LLMs(https://arxiv.org/abs/2406.12787)
Keywords: gpt, llm, prompt
Abstract: This study introduces the leveled-text generation task, aiming to rewrite educational materials to specific readability levels while preserving meaning. We assess the capability of GPT-3.5, LLaMA-2 70B, and Mixtral 8x7B, to generate content at various readability levels through zero-shot and few-shot prompting. Evaluating 100 processed educational materials reveals that few-shot prompting significantly improves performance in readability manipulation and information preservation. LLaMA-2 70B performs better in achieving the desired difficulty range, while GPT-3.5 maintains original meaning. However, manual inspection highlights concerns such as misinformation introduction and inconsistent edit distribution. These findings emphasize the need for further research to ensure the quality of generated educational content.
摘要：本研究引入了分级文本生成任务，旨在将教育材料重写为特定的可读性级别，同时保留其含义。我们评估了 GPT-3.5、LLaMA-2 70B 和 Mixtral 8x7B 通过零样本和少样本提示生成不同可读性级别内容的能力。对 100 份经过处理的教育材料进行评估后发现，少样本提示显著提高了可读性操作和信息保存方面的表现。LLaMA-2 70B 在实现所需难度范围方面表现更好，而 GPT-3.5 则保持了原意。然而，人工检查突出了诸如引入错误信息和编辑分布不一致等问题。这些发现强调需要进一步研究以确保生成的教育内容的质量。

Title: ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Authors: Team GLM: Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, Zihan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12793
Pdf URL: https://arxiv.org/pdf/2406.12793
Copy Paste: [[2406.12793]] ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools(https://arxiv.org/abs/2406.12793)
Keywords: language model, gpt, long context, chat
Abstract: We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through this https URL and this https URL.
摘要：我们介绍了 ChatGLM，这是我们一直在开发的不断发展的大型语言模型系列。本报告主要关注 GLM-4 语言系列，其中包括 GLM-4、GLM-4-Air 和 GLM-4-9B。它们代表了我们最强大的模型，这些模型采用了前三代 ChatGLM 的所有见解和经验教训进行训练。到目前为止，GLM-4 模型已在十万亿个 token 上进行了预训练，这些 token 主要为中文和英文，还使用了来自 24 种语言的一小组语料库，主要针对中文和英文使用进行了对齐。高质量的对齐是通过多阶段后训练过程实现的，该过程涉及监督微调和从人工反馈中学习。测评结果表明，GLM-4 1）在MMLU、GSM8K、MATH、BBH、GPQA、HumanEval等通用指标方面与GPT-4旗鼓相当或优于GPT-4；2）在IFEval的测量中，在指令跟踪方面与GPT-4-Turbo接近；3）在长上下文任务中与GPT-4 Turbo（128K）和Claude 3匹敌；4）在AlignBench的测量中，在中文对齐方面优于GPT-4。GLM-4 All Tools模型进一步理解用户意图，并自主决定何时使用哪种工具（包括网页浏览器、Python解释器、文本转图像模型和用户定义函数）来有效地完成复杂任务。在实际应用中，它在通过网页浏览获取在线信息、使用Python解释器解决数学问题等任务上匹敌甚至超越GPT-4 All Tools。在此过程中，我们开源了一系列模型，包括 ChatGLM-6B（三代）、GLM-4-9B（128K、1M）、GLM-4V-9B、WebGLM 和 CodeGeeX，仅在 2023 年，Hugging face 上的下载量就超过 1000 万次。可以通过此 https URL 和此 https URL 访问开放模型。

Title: Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Authors: Zhe Yang, Yichang Zhang, Tianyu Liu, Jian Yang, Junyang Lin, Chang Zhou, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12809
Pdf URL: https://arxiv.org/pdf/2406.12809
Copy Paste: [[2406.12809]] Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?(https://arxiv.org/abs/2406.12809)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2\% but is still inconsistent to specific questions due to distraction by redundant information, misinterpretation of questions, etc.; (2) models with stronger capabilities typically exhibit higher consistency, but exceptions also exist; (3) hard data enhances consistency for both fine-tuning and in-context learning. Our data and code will be publicly available on GitHub.
摘要：大型语言模型（LLM）已经展示了令人印象深刻的能力，但仍然存在不一致性问题（例如，LLM 对改写或无关紧要的顺序变化等干扰的反应可能不同）。除了这些不一致性之外，我们还观察到，虽然 LLM 能够解决难题，但在解决较简单的问题时却可能自相矛盾地失败。为了评估这种难易不一致性，我们开发了 ConsisEval 基准，其中每个条目包含一对具有严格难度顺序的问题。此外，我们引入了一致性得分的概念来定量衡量这种不一致性，并通过相对一致性得分分析一致性改进的潜力。基于对现有多种模型的综合实验，我们发现：（1）GPT-4 的一致性得分最高，为 92.2%，但由于冗余信息、问题误解等因素的干扰，对特定问题仍然不一致；（2）能力较强的模型通常表现出更高的一致性，但也存在例外；（3）硬数据增强了微调和上下文学习的一致性。我们的数据和代码将在 GitHub 上公开。

Title: Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

Authors: Pinzhen Chen, Simon Yu, Zhicheng Guo, Barry Haddow
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12822
Pdf URL: https://arxiv.org/pdf/2406.12822
Copy Paste: [[2406.12822]] Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?(https://arxiv.org/abs/2406.12822)
Keywords: language model
Abstract: Large language models, particularly multilingual ones, are designed, claimed, and expected to cater to native speakers of varied languages. We hypothesise that the current practices of fine-tuning and evaluating these models may mismatch this intention owing to a heavy reliance on translation, which can introduce translation artefacts and defects. It remains unknown whether the nature of the instruction data has an impact on the model output; on the other hand, it remains questionable whether translated test sets can capture such nuances. Due to the often coupled practices of using translated data in both stages, such imperfections could have been overlooked. This work investigates these issues by using controlled native or translated data during instruction tuning and evaluation stages and observing model results. Experiments on eight base models and eight different benchmarks reveal that native or generation benchmarks display a notable difference between native and translated instruction data especially when model performance is high, whereas other types of test sets cannot. Finally, we demonstrate that regularization is beneficial to bridging this gap on structured but not generative tasks.
摘要：大型语言模型，尤其是多语言模型，其设计、宣称和预期都是为了迎合不同语言的母语人士。我们假设，由于严重依赖翻译，当前对这些模型进行微调和评估的做法可能与这一意图不符，从而引入翻译伪影和缺陷。目前尚不清楚指令数据的性质是否会对模型输出产生影响；另一方面，翻译测试集是否能捕捉到这些细微差别仍值得怀疑。由于在两个阶段经常使用翻译数据，这种缺陷可能会被忽视。这项工作通过在指令调整和评估阶段使用受控的本机或翻译数据并观察模型结果来调查这些问题。对八个基础模型和八个不同基准的实验表明，本机或生成基准在本机和翻译指令数据之间显示出显着差异，尤其是在模型性能较高时，而其他类型的测试集则不能。最后，我们证明正则化有利于弥合结构化任务而非生成任务上的这一差距。

Title: From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries

Authors: Hitesh Wadhwa, Rahul Seetharaman, Somyaa Aggarwal, Reshmi Ghosh, Samyadeep Basu, Soundararajan Srinivasan, Wenlong Zhao, Shreyas Chaudhari, Ehsan Aghazadeh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.12824
Pdf URL: https://arxiv.org/pdf/2406.12824
Copy Paste: [[2406.12824]] From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries(https://arxiv.org/abs/2406.12824)
Keywords: language model, prompt, chat, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) enriches the ability of language models to reason using external context to augment responses for a given user prompt. This approach has risen in popularity due to practical applications in various applications of language models in search, question/answering, and chat-bots. However, the exact nature of how this approach works isn't clearly understood. In this paper, we mechanistically examine the RAG pipeline to highlight that language models take shortcut and have a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. We probe this mechanistic behavior in language models with: (i) Causal Mediation Analysis to show that the parametric memory is minimally utilized when answering a question and (ii) Attention Contributions and Knockouts to show that the last token residual stream do not get enriched from the subject token in the question, but gets enriched from other informative tokens in the context. We find this pronounced shortcut behaviour true across both LLaMa and Phi family of models.
摘要：检索增强生成 (RAG) 丰富了语言模型使用外部上下文进行推理以增强给定用户提示的响应的能力。这种方法由于在搜索、问答和聊天机器人等各种语言模型应用中的实际应用而越来越受欢迎。然而，这种方法的确切工作原理尚不清楚。在本文中，我们从机制上检查了 RAG 管道，以强调语言模型走捷径，并且强烈倾向于仅使用上下文信息来回答问题，同时尽量减少对参数记忆的依赖。我们通过以下方法探究语言模型中的这种机械行为：(i) 因果中介分析表明在回答问题时参数记忆的利用率最低；(ii) 注意力贡献和淘汰赛表明最后一个标记残差流不会从问题中的主题标记中丰富，而是从上下文中的其他信息标记中丰富。我们发现这种明显的捷径行为在 LLaMa 和 Phi 系列模型中都存在。

Title: What Are the Odds? Language Models Are Capable of Probabilistic Reasoning

Authors: Akshay Paruchuri, Jake Garrison, Shun Liao, John Hernandez, Jacob Sunshine, Tim Althoff, Xin Liu, Daniel McDuff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.12830
Pdf URL: https://arxiv.org/pdf/2406.12830
Copy Paste: [[2406.12830]] What Are the Odds? Language Models Are Capable of Probabilistic Reasoning(https://arxiv.org/abs/2406.12830)
Keywords: language model
Abstract: Language models (LM) are capable of remarkably complex linguistic tasks; however, numerical reasoning is an area in which they frequently struggle. An important but rarely evaluated form of reasoning is understanding probability distributions. In this paper, we focus on evaluating the probabilistic reasoning capabilities of LMs using idealized and real-world statistical distributions. We perform a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities. We evaluate three ways to provide context to LMs 1) anchoring examples from within a distribution or family of distributions, 2) real-world context, 3) summary statistics on which to base a Normal approximation. Models can make inferences about distributions, and can be further aided by the incorporation of real-world context, example shots and simplified assumptions, even if these assumptions are incorrect or misspecified. To conduct this work, we developed a comprehensive benchmark distribution dataset with associated question-answer pairs that we will release publicly.
摘要：语言模型 (LM) 能够完成非常复杂的语言任务；然而，数字推理是它们经常遇到困难的领域。一种重要但很少被评估的推理形式是理解概率分布。在本文中，我们专注于使用理想化和现实世界的统计分布来评估 LM 的概率推理能力。我们对最先进的 LM 进行了三项任务的系统评估：估计百分位数、抽取样本和计算概率。我们评估了三种为 LM 提供上下文的方法：1) 从分布或分布系列中锚定示例，2) 现实世界上下文，3) 用作正态近似基础的汇总统计数据。模型可以对分布进行推断，并且可以通过结合现实世界背景、示例镜头和简化假设来进一步辅助，即使这些假设不正确或指定错误。为了开展这项工作，我们开发了一个全面的基准分布数据集以及相关的问答对，我们将公开发布。

Title: LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation

Authors: Seyedarmin Azizi, Souvik Kundu, Massoud Pedram
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.12832
Pdf URL: https://arxiv.org/pdf/2406.12832
Copy Paste: [[2406.12832]] LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation(https://arxiv.org/abs/2406.12832)
Keywords: language model, llm
Abstract: Low-rank adaptation (LoRA) has become the default approach to fine-tune large language models (LLMs) due to its significant reduction in trainable parameters. However, trainable parameter demand for LoRA increases with increasing model embedding dimensions, leading to high compute costs. Additionally, its backward updates require storing high-dimensional intermediate activations and optimizer states, demanding high peak GPU memory. In this paper, we introduce large model fine-tuning via spectrally decomposed low-dimensional adaptation (LaMDA), a novel approach to fine-tuning large language models, which leverages low-dimensional adaptation to achieve significant reductions in trainable parameters and peak GPU memory footprint. LaMDA freezes a first projection matrix (PMA) in the adaptation path while introducing a low-dimensional trainable square matrix, resulting in substantial reductions in trainable parameters and peak GPU memory usage. LaMDA gradually freezes a second projection matrix (PMB) during the early fine-tuning stages, reducing the compute cost associated with weight updates to enhance parameter efficiency further. We also present an enhancement, LaMDA++, incorporating a ``lite-weight" adaptive rank allocation for the LoRA path via normalized spectrum analysis of pre-trained model weights. We evaluate LaMDA/LaMDA++ across various tasks, including natural language understanding with the GLUE benchmark, text summarization, natural language generation, and complex reasoning on different LLMs. Results show that LaMDA matches or surpasses the performance of existing alternatives while requiring up to 17.7x fewer parameter updates and up to 1.32x lower peak GPU memory usage during fine-tuning. Code will be publicly available.
摘要：低秩自适应 (LoRA) 已成为微调大型语言模型 (LLM) 的默认方法，因为它可以显著减少可训练参数。然而，随着模型嵌入维度的增加，LoRA 对可训练参数的需求也随之增加，从而导致计算成本高昂。此外，其向后更新需要存储高维中间激活和优化器状态，这需要高峰值 GPU 内存。在本文中，我们介绍了通过谱分解低维自适应 (LaMDA) 进行大型模型微调，这是一种微调大型语言模型的新方法，它利用低维自适应来显著减少可训练参数和峰值 GPU 内存占用。LaMDA 在自适应路径中冻结第一个投影矩阵 (PMA)，同时引入低维可训练方阵，从而大幅减少可训练参数和峰值 GPU 内存使用量。LaMDA 在早期微调阶段逐渐冻结第二个投影矩阵 (PMB)，从而降低与权重更新相关的计算成本，从而进一步提高参数效率。我们还提出了一种增强功能 LaMDA++，它通过对预训练模型权重进行归一化频谱分析，为 LoRA 路径引入了“轻量级”自适应等级分配。我们在各种任务中评估了 LaMDA/LaMDA++，包括使用 GLUE 基准的自然语言理解、文本摘要、自然语言生成和不同 LLM 上的复杂推理。结果表明，LaMDA 的性能达到或超过了现有替代方案，同时在微调期间需要的参数更新减少了 17.7 倍，峰值 GPU 内存使用量减少了 1.32 倍。代码将公开提供。