2024-02-21

Title: Turn Waste into Worth: Rectifying Top-$k$ Router of MoE

Authors: Zhiyuan Zeng, Qipeng Guo, Zhaoye Fei, Zhangyue Yin, Yunhua Zhou, Linyang Li, Tianxiang Sun, Hang Yan, Dahua Lin, Xipeng Qiu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12399
Pdf URL: https://arxiv.org/pdf/2402.12399
Copy Paste: [[2402.12399]] Turn Waste into Worth: Rectifying Top-$k$ Router of MoE(https://arxiv.org/abs/2402.12399)
Keywords: language model
Abstract: Sparse Mixture of Experts (MoE) models are popular for training large language models due to their computational efficiency. However, the commonly used top-$k$ routing mechanism suffers from redundancy computation and memory costs due to the unbalanced routing. Some experts are overflow, where the exceeding tokens are dropped. While some experts are vacant, which are padded with zeros, negatively impacting model performance. To address the dropped tokens and padding, we propose the Rectify-Router, comprising the Intra-GPU Rectification and the Fill-in Rectification. The Intra-GPU Rectification handles dropped tokens, efficiently routing them to experts within the GPU where they are located to avoid inter-GPU communication. The Fill-in Rectification addresses padding by replacing padding tokens with the tokens that have high routing scores. Our experimental results demonstrate that the Intra-GPU Rectification and the Fill-in Rectification effectively handle dropped tokens and padding, respectively. Furthermore, the combination of them achieves superior performance, surpassing the accuracy of the vanilla top-1 router by 4.7%.
摘要：稀疏专家混合 (MoE) 模型因其计算效率而广泛用于训练大型语言模型。然而，常用的top-$k$路由机制由于不平衡路由而遭受冗余计算和内存成本的困扰。有些专家是溢出的，超出的代币会被丢弃。虽然一些专家是空缺的，但用零填充，这会对模型性能产生负面影响。为了解决丢弃的标记和填充问题，我们提出了 Rectify-Router，包括 GPU 内校正和填充校正。 GPU 内纠正处理丢弃的令牌，有效地将它们路由到它们所在的 GPU 内的专家，以避免 GPU 间通信。填充纠正通过用具有高路由分数的令牌替换填充令牌来解决填充问题。我们的实验结果表明，GPU 内校正和填充校正分别有效地处理丢弃的标记和填充。此外，它们的组合实现了卓越的性能，比普通 top-1 路由器的精度高出 4.7%。

Title: ModelGPT: Unleashing LLM's Capabilities for Tailored Model Generation

Authors: Zihao Tang, Zheqi Lv, Shengyu Zhang, Fei Wu, Kun Kuang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12408
Pdf URL: https://arxiv.org/pdf/2402.12408
Copy Paste: [[2402.12408]] ModelGPT: Unleashing LLM's Capabilities for Tailored Model Generation(https://arxiv.org/abs/2402.12408)
Keywords: language model, gpt, llm
Abstract: The rapid advancement of Large Language Models (LLMs) has revolutionized various sectors by automating routine tasks, marking a step toward the realization of Artificial General Intelligence (AGI). However, they still struggle to accommodate the diverse and specific needs of users and simplify the utilization of AI models for the average user. In response, we propose ModelGPT, a novel framework designed to determine and generate AI models specifically tailored to the data or task descriptions provided by the user, leveraging the capabilities of LLMs. Given user requirements, ModelGPT is able to provide tailored models at most 270x faster than the previous paradigms (e.g. all-parameter or LoRA finetuning). Comprehensive experiments on NLP, CV, and Tabular datasets attest to the effectiveness of our framework in making AI models more accessible and user-friendly. Our code is available at https://github.com/IshiKura-a/ModelGPT.
摘要：大型语言模型（LLM）的快速发展通过自动化日常任务彻底改变了各个领域，标志着朝着实现通用人工智能（AGI）迈出了一步。然而，他们仍然难以满足用户多样化和特定的需求，并简化普通用户对人工智能模型的使用。为此，我们提出了 ModelGPT，这是一种新颖的框架，旨在利用法学硕士的能力，确定和生成专门针对用户提供的数据或任务描述定制的人工智能模型。根据用户需求，ModelGPT 能够提供比以前的范例（例如全参数或 LoRA 微调）最多快 270 倍的定制模型。对 NLP、CV 和表格数据集的综合实验证明了我们的框架在使 AI 模型更易于访问和用户友好方面的有效性。我们的代码可在 https://github.com/IshiKura-a/ModelGPT 获取。

Title: EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs

Authors: Song Guo, Fan Wu, Lei Zhang, Xiawu Zheng, Shengchuan Zhang, Fei Chao, Yiyu Shi, Rongrong Ji
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12419
Pdf URL: https://arxiv.org/pdf/2402.12419
Copy Paste: [[2402.12419]] EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs(https://arxiv.org/abs/2402.12419)
Keywords: llm
Abstract: Existing methods for fine-tuning sparse LLMs often suffer from resource-intensive requirements and high retraining costs. Additionally, many fine-tuning methods often rely on approximations or heuristic optimization strategies, which may lead to suboptimal solutions. To address these issues, we propose an efficient and fast framework for fine-tuning sparse LLMs based on minimizing reconstruction error. Our approach involves sampling a small dataset for calibration and utilizing backpropagation to iteratively optimize block-wise reconstruction error, on a block-by-block basis, aiming for optimal solutions. Extensive experiments on various benchmarks consistently demonstrate the superiority of our method over other baselines. For instance, on the Wikitext2 dataset with LlamaV1-7B at 70% sparsity, our proposed EBFT achieves a perplexity of 16.88, surpassing the state-of-the-art DSnoT with a perplexity of 75.14. Moreover, with a structured sparsity ratio of 26\%, EBFT achieves a perplexity of 16.27, outperforming LoRA (perplexity 16.44). Furthermore, the fine-tuning process of EBFT for LlamaV1-7B only takes approximately 30 minutes, and the entire framework can be executed on a single 16GB GPU. The source code is available at https://github.com/sunggo/EBFT.
摘要：现有的微调稀疏法学硕士的方法通常面临资源密集型要求和高昂的再培训成本。此外，许多微调方法通常依赖于近似或启发式优化策略，这可能会导致次优解决方案。为了解决这些问题，我们提出了一个高效、快速的框架，用于基于最小化重建误差来微调稀疏 LLM。我们的方法包括对小数据集进行采样以进行校准，并利用反向传播以逐块为基础迭代优化逐块重建误差，旨在获得最佳解决方案。对各种基准的广泛实验一致证明了我们的方法相对于其他基准的优越性。例如，在 LlamaV1-7B 稀疏度为 70% 的 Wikitext2 数据集上，我们提出的 EBFT 实现了 16.88 的困惑度，超过了最先进的 DSnoT 的 75.14 的困惑度。此外，在结构化稀疏率为 26% 的情况下，EBFT 的困惑度达到 16.27，优于 LoRA（困惑度 16.44）。此外，LlamaV1-7B的EBFT微调过程仅需大约30分钟，整个框架可以在单个16GB GPU上执行。源代码可在 https://github.com/sunggo/EBFT 获取。

Title: Simulacra as Conscious Exotica

Authors: Murray Shanahan
Subjects: cs.AI
Abstract URL: https://arxiv.org/abs/2402.12422
Pdf URL: https://arxiv.org/pdf/2402.12422
Copy Paste: [[2402.12422]] Simulacra as Conscious Exotica(https://arxiv.org/abs/2402.12422)
Keywords: language model, agent
Abstract: The advent of conversational agents with increasingly human-like behaviour throws old philosophical questions into new light. Does it, or could it, ever make sense to speak of AI agents built out of generative language models in terms of consciousness, given that they are "mere" simulacra of human behaviour, and that what they do can be seen as "merely" role play? Drawing on the later writings of Wittgenstein, this paper attempts to tackle this question while avoiding the pitfalls of dualistic thinking.
摘要：行为越来越像人类的对话代理的出现为古老的哲学问题带来了新的曙光。鉴于它们“仅仅是”人类行为的拟像，并且它们所做的事情可以被视为“仅仅”，那么从意识的角度谈论由生成语言模型构建的人工智能代理是否有意义，或者是否可能有意义？角色扮演？本文借鉴维特根斯坦后来的著作，试图解决这个问题，同时避免二元思维的陷阱。

Title: Tables as Images? Exploring the Strengths and Limitations of LLMs on Multimodal Representations of Tabular Data

Authors: Naihao Deng, Zhenjie Sun, Ruiqi He, Aman Sikka, Yulong Chen, Lin Ma, Yue Zhang, Rada Mihalcea
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2402.12424
Pdf URL: https://arxiv.org/pdf/2402.12424
Copy Paste: [[2402.12424]] Tables as Images? Exploring the Strengths and Limitations of LLMs on Multimodal Representations of Tabular Data(https://arxiv.org/abs/2402.12424)
Keywords: llm, prompt
Abstract: In this paper, we investigate the effectiveness of various LLMs in interpreting tabular data through different prompting strategies and data formats. Our analysis extends across six benchmarks for table-related tasks such as question-answering and fact-checking. We introduce for the first time the assessment of LLMs' performance on image-based table representations. Specifically, we compare five text-based and three image-based table representations, demonstrating the influence of representation and prompting on LLM performance. Our study provides insights into the effective use of LLMs on table-related tasks.
摘要：在本文中，我们研究了各种法学硕士通过不同的提示策略和数据格式解释表格数据的有效性。我们的分析涵盖了与表格相关的任务（例如问答和事实核查）的六个基准。我们首次引入了法学硕士在基于图像的表格表示方面的表现评估。具体来说，我们比较了五种基于文本的表格表示和三种基于图像的表格表示，展示了表示和提示对法学硕士表现的影响。我们的研究提供了如何有效利用法学硕士来完成与表格相关的任务的见解。

Title: Understanding Fine-grained Distortions in Reports of Scientific Findings

Authors: Amelie Wührl, Dustin Wright, Roman Klinger, Isabelle Augenstein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12431
Pdf URL: https://arxiv.org/pdf/2402.12431
Copy Paste: [[2402.12431]] Understanding Fine-grained Distortions in Reports of Scientific Findings(https://arxiv.org/abs/2402.12431)
Keywords: llm, prompt
Abstract: Distorted science communication harms individuals and society as it can lead to unhealthy behavior change and decrease trust in scientific institutions. Given the rapidly increasing volume of science communication in recent years, a fine-grained understanding of how findings from scientific publications are reported to the general public, and methods to detect distortions from the original work automatically, are crucial. Prior work focused on individual aspects of distortions or worked with unpaired data. In this work, we make three foundational contributions towards addressing this problem: (1) annotating 1,600 instances of scientific findings from academic papers paired with corresponding findings as reported in news articles and tweets wrt. four characteristics: causality, certainty, generality and sensationalism; (2) establishing baselines for automatically detecting these characteristics; and (3) analyzing the prevalence of changes in these characteristics in both human-annotated and large-scale unlabeled data. Our results show that scientific findings frequently undergo subtle distortions when reported. Tweets distort findings more often than science news reports. Detecting fine-grained distortions automatically poses a challenging task. In our experiments, fine-tuned task-specific models consistently outperform few-shot LLM prompting.
摘要：扭曲的科学传播会损害个人和社会，因为它可能导致不健康的行为改变并降低对科学机构的信任。鉴于近年来科学传播量的迅速增加，深入了解如何向公众报告科学出版物的发现以及自动检测原始作品中的扭曲的方法至关重要。之前的工作主要集中在扭曲的各个方面或使用不成对的数据。在这项工作中，我们为解决这个问题做出了三项基础性贡献：（1）注释了学术论文中的 1,600 个科学发现实例，并与新闻文章和推文中报道的相应发现配对。四个特征：因果性、确定性、普遍性、煽情性； (2) 建立自动检测这些特征的基线； (3) 分析人工注释数据和大规模未标记数据中这些特征变化的普遍程度。我们的结果表明，科学发现在报道时经常会发生微妙的扭曲。推文比科学新闻报道更容易歪曲研究结果。自动检测细粒度的扭曲是一项具有挑战性的任务。在我们的实验中，经过微调的特定任务模型始终优于少数 LLM 提示。

Title: Neuro-mimetic Task-free Unsupervised Online Learning with Continual Self-Organizing Maps

Authors: Hitesh Vaidya, Travis Desell, Ankur Mali, Alexander Ororbia
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2402.12465
Pdf URL: https://arxiv.org/pdf/2402.12465
Copy Paste: [[2402.12465]] Neuro-mimetic Task-free Unsupervised Online Learning with Continual Self-Organizing Maps(https://arxiv.org/abs/2402.12465)
Keywords: agent
Abstract: An intelligent system capable of continual learning is one that can process and extract knowledge from potentially infinitely long streams of pattern vectors. The major challenge that makes crafting such a system difficult is known as catastrophic forgetting - an agent, such as one based on artificial neural networks (ANNs), struggles to retain previously acquired knowledge when learning from new samples. Furthermore, ensuring that knowledge is preserved for previous tasks becomes more challenging when input is not supplemented with task boundary information. Although forgetting in the context of ANNs has been studied extensively, there still exists far less work investigating it in terms of unsupervised architectures such as the venerable self-organizing map (SOM), a neural model often used in clustering and dimensionality reduction. While the internal mechanisms of SOMs could, in principle, yield sparse representations that improve memory retention, we observe that, when a fixed-size SOM processes continuous data streams, it experiences concept drift. In light of this, we propose a generalization of the SOM, the continual SOM (CSOM), which is capable of online unsupervised learning under a low memory budget. Our results, on benchmarks including MNIST, Kuzushiji-MNIST, and Fashion-MNIST, show almost a two times increase in accuracy, and CIFAR-10 demonstrates a state-of-the-art result when tested on (online) unsupervised class incremental learning setting.
摘要：能够持续学习的智能系统是一种能够从潜在无限长的模式向量流中处理和提取知识的系统。构建这样一个系统的主要挑战被称为灾难性遗忘——代理，例如基于人工神经网络（ANN）的代理，在从新样本中学习时很难保留以前获得的知识。此外，当输入没有补充任务边界信息时，确保保留先前任务的知识变得更具挑战性。尽管人工神经网络背景下的遗忘已经被广泛研究，但在无监督架构方面研究它的工作仍然少得多，例如古老的自组织映射（SOM），一种经常用于聚类和降维的神经模型。虽然 SOM 的内部机制原则上可以产生稀疏表示来提高内存保留，但我们观察到，当固定大小的 SOM 处理连续数据流时，它会经历概念漂移。鉴于此，我们提出了 SOM 的泛化，即连续 SOM（CSOM），它能够在低内存预算下进行在线无监督学习。我们在 MNIST、Kuzushiji-MNIST 和 Fashion-MNIST 等基准测试中的结果表明，准确性几乎提高了两倍，而 CIFAR-10 在（在线）无监督类增量学习测试中展示了最先进的结果环境。

Title: In deep reinforcement learning, a pruned network is a good network

Authors: Johan Obando-Ceron, Aaron Courville, Pablo Samuel Castro
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12479
Pdf URL: https://arxiv.org/pdf/2402.12479
Copy Paste: [[2402.12479]] In deep reinforcement learning, a pruned network is a good network(https://arxiv.org/abs/2402.12479)
Keywords: agent
Abstract: Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage prior insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks and exhibit a type of "scaling law", using only a small fraction of the full network parameters.
摘要：最近的工作表明，深度强化学习代理很难有效地使用其网络参数。我们利用先前对稀疏训练技术优势的见解，并证明渐进式剪枝使智能体能够最大限度地提高参数有效性。这使得网络比传统网络产生显着的性能改进，并表现出一种“缩放定律”，仅使用完整网络参数的一小部分。

Title: Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

Authors: Nishant Balepur, Abhilasha Ravichander, Rachel Rudinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12483
Pdf URL: https://arxiv.org/pdf/2402.12483
Copy Paste: [[2402.12483]] Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?(https://arxiv.org/abs/2402.12483)
Keywords: language model, llm, prompt
Abstract: Multiple-choice question answering (MCQA) is often used to evaluate large language models (LLMs). To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. We hope to motivate the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets, and further efforts to explain LLM decision-making.
摘要：多项选择题回答 (MCQA) 通常用于评估大型语言模型 (LLM)。为了了解 MCQA 是否按预期评估 LLM，我们探讨了 LLM 是否可以在仅选择提示的情况下执行 MCQA，其中模型必须仅从选项中选择正确的答案。在三个 MCQA 数据集和四个法学硕士中，此提示在 11/12 案例中优于大多数基线，准确度增益高达 0.33。为了帮助解释这种行为，我们对记忆、选择动态和问题推理进行了深入的黑盒分析。我们的主要发现有三个方面。首先，我们没有发现任何证据表明仅选择的准确性仅源于记忆。其次，个人选择的先验并不能完全解释仅选择的准确性，这暗示法学硕士使用选择的群体动态。第三，法学硕士有一定的能力从选择中推断出相关问题，令人惊讶的是有时甚至可以匹配原始问题。我们希望推动在 MCQA 基准测试中使用更强的基线、设计稳健的 MCQA 数据集，并进一步努力解释 LLM 决策。

Title: Do Pre-Trained Language Models Detect and Understand Semantic Underspecification? Ask the DUST!

Authors: Frank Wildenburg, Michael Hanna, Sandro Pezzelle
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12486
Pdf URL: https://arxiv.org/pdf/2402.12486
Copy Paste: [[2402.12486]] Do Pre-Trained Language Models Detect and Understand Semantic Underspecification? Ask the DUST!(https://arxiv.org/abs/2402.12486)
Keywords: language model, prompt
Abstract: In everyday language use, speakers frequently utter and interpret sentences that are semantically underspecified, namely, whose content is insufficient to fully convey their message or interpret them univocally. For example, to interpret the underspecified sentence "Don't spend too much", which leaves implicit what (not) to spend, additional linguistic context or outside knowledge is needed. In this work, we propose a novel Dataset of semantically Underspecified Sentences grouped by Type (DUST) and use it to study whether pre-trained language models (LMs) correctly identify and interpret underspecified sentences. We find that newer LMs are reasonably able to identify underspecified sentences when explicitly prompted. However, interpreting them correctly is much harder for any LMs. Our experiments show that when interpreting underspecified sentences, LMs exhibit little uncertainty, contrary to what theoretical accounts of underspecification would predict. Overall, our study reveals limitations in current models' processing of sentence semantics and highlights the importance of using naturalistic data and communicative scenarios when evaluating LMs' language capabilities.
摘要：在日常语言使用中，说话者经常说出和解释语义上不明确的句子，即其内容不足以充分传达其信息或单义地解释它们。例如，要解释未指定的句子“不要花太多”，这会隐含什么（不）花什么，需要额外的语言背景或外部知识。在这项工作中，我们提出了一种新颖的按类型分组的语义未指定句子数据集（DUST），并用它来研究预训练的语言模型（LM）是否正确识别和解释未指定的句子。我们发现，当明确提示时，较新的 LM 能够合理地识别未指定的句子。然而，对于任何语言模型来说，正确解释它们要困难得多。我们的实验表明，在解释未指定的句子时，语言模型几乎没有表现出不确定性，这与未指定的理论解释所预测的相反。总的来说，我们的研究揭示了当前模型处理句子语义的局限性，并强调了在评估 LM 语言能力时使用自然数据和交流场景的重要性。

Title: Towards Cross-Domain Continual Learning

Authors: Marcus de Carvalho, Mahardhika Pratama, Jie Zhang, Chua Haoyan, Edward Yapp
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2402.12490
Pdf URL: https://arxiv.org/pdf/2402.12490
Copy Paste: [[2402.12490]] Towards Cross-Domain Continual Learning(https://arxiv.org/abs/2402.12490)
Keywords: agent
Abstract: Continual learning is a process that involves training learning agents to sequentially master a stream of tasks or classes without revisiting past data. The challenge lies in leveraging previously acquired knowledge to learn new tasks efficiently, while avoiding catastrophic forgetting. Existing methods primarily focus on single domains, restricting their applicability to specific problems. In this work, we introduce a novel approach called Cross-Domain Continual Learning (CDCL) that addresses the limitations of being limited to single supervised domains. Our method combines inter- and intra-task cross-attention mechanisms within a compact convolutional network. This integration enables the model to maintain alignment with features from previous tasks, thereby delaying the data drift that may occur between tasks, while performing unsupervised cross-domain (UDA) between related domains. By leveraging an intra-task-specific pseudo-labeling method, we ensure accurate input pairs for both labeled and unlabeled samples, enhancing the learning process. To validate our approach, we conduct extensive experiments on public UDA datasets, showcasing its positive performance on cross-domain continual learning challenges. Additionally, our work introduces incremental ideas that contribute to the advancement of this field. We make our code and models available to encourage further exploration and reproduction of our results: \url{https://github.com/Ivsucram/CDCL}
摘要：持续学习是一个过程，涉及训练学习代理顺序掌握一系列任务或课程，而无需重新访问过去的数据。挑战在于如何利用以前获得的知识来有效地学习新任务，同时避免灾难性的遗忘。现有方法主要关注单个领域，限制了它们对特定问题的适用性。在这项工作中，我们引入了一种称为跨域持续学习（CDCL）的新颖方法，该方法解决了仅限于单个监督域的局限性。我们的方法在紧凑的卷积网络中结合了任务间和任务内的交叉注意机制。这种集成使模型能够与先前任务的特征保持一致，从而延迟任务之间可能发生的数据漂移，同时在相关域之间执行无监督跨域（UDA）。通过利用特定于任务内的伪标记方法，我们确保标记和未标记样本的准确输入对，从而增强学习过程。为了验证我们的方法，我们对公共 UDA 数据集进行了广泛的实验，展示了其在跨领域持续学习挑战中的积极表现。此外，我们的工作引入了有助于该领域进步的增量想法。我们提供代码和模型，以鼓励进一步探索和复制我们的结果：\url{https://github.com/Ivsucram/CDCL}

Title: Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection

Authors: Ruibo Chen, Yihan Wu, Lichang Chen, Guodong Liu, Qi He, Tianyi Xiong, Chenxi Liu, Junfeng Guo, Heng Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12501
Pdf URL: https://arxiv.org/pdf/2402.12501
Copy Paste: [[2402.12501]] Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection(https://arxiv.org/abs/2402.12501)
Keywords: language model, gpt, llm
Abstract: Data selection in instruction tuning emerges as a pivotal process for acquiring high-quality data and training instruction-following large language models (LLMs), but it is still a new and unexplored research area for vision-language models (VLMs). Existing data selection approaches on LLMs either rely on single unreliable scores, or use downstream tasks for selection, which is time-consuming and can lead to potential over-fitting on the chosen evaluation datasets. To address this challenge, we introduce a novel dataset selection method, Self-Filter, that utilizes the VLM itself as a filter. This approach is inspired by the observation that VLMs benefit from training with the most challenging instructions. Self-Filter operates in two stages. In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM. In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity. Comprehensive experiments on LLaVA and MiniGPT-4 show that Self-Filter can reach better results compared to full data settings with merely about 15% samples, and can achieve superior performance against competitive baselines.
摘要：指令调优中的数据选择是获取高质量数据和训练遵循指令的大型语言模型（LLM）的关键过程，但它仍然是视觉语言模型（VLM）的一个新的、未经探索的研究领域。法学硕士现有的数据选择方法要么依赖于单个不可靠的分数，要么使用下游任务进行选择，这非常耗时，并且可能导致对所选评估数据集的潜在过度拟合。为了应对这一挑战，我们引入了一种新颖的数据集选择方法，即自过滤器，它利用 VLM 本身作为过滤器。这种方法的灵感来自于这样的观察：VLM 受益于最具挑战性指令的训练。自过滤器分两个阶段运行。在第一阶段，我们设计了一个评分网络来评估训练指令的难度，该网络与 VLM 共同训练。在第二阶段，我们使用训练后的得分网来衡量每条指令的难度，选择最具挑战性的样本，并惩罚相似的样本以鼓励多样性。在 LLaVA 和 MiniGPT-4 上的综合实验表明，与仅使用约 15% 样本的全数据设置相比，Self-Filter 可以达到更好的结果，并且可以实现相对于竞争基线的卓越性能。

Title: Induced Model Matching: How Restricted Models Can Help Larger Ones

Authors: Usama Muneeb, Mesrob I. Ohannessian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.12513
Pdf URL: https://arxiv.org/pdf/2402.12513
Copy Paste: [[2402.12513]] Induced Model Matching: How Restricted Models Can Help Larger Ones(https://arxiv.org/abs/2402.12513)
Keywords: language model
Abstract: We consider scenarios where a very accurate predictive model using restricted features is available at the time of training of a larger, full-featured, model. This restricted model may be thought of as "side-information", derived either from an auxiliary exhaustive dataset or on the same dataset, by forcing the restriction. How can the restricted model be useful to the full model? We propose an approach for transferring the knowledge of the restricted model to the full model, by aligning the full model's context-restricted performance with that of the restricted model's. We call this methodology Induced Model Matching (IMM) and first illustrate its general applicability by using logistic regression as a toy example. We then explore IMM's use in language modeling, the application that initially inspired it, and where it offers an explicit foundation in contrast to the implicit use of restricted models in techniques such as noising. We demonstrate the methodology on both LSTM and transformer full models, using $N$-grams as restricted models. To further illustrate the potential of the principle whenever it is much cheaper to collect restricted rather than full information, we conclude with a simple RL example where POMDP policies can improve learned MDP policies via IMM.
摘要：我们考虑这样的场景：在训练更大的全功能模型时，可以使用受限特征的非常准确的预测模型。该受限模型可以被认为是“辅助信息”，通过强制限制从辅助详尽数据集或同一数据集导出。受限模型如何对完整模型有用？我们提出了一种通过将完整模型的上下文受限性能与受限模型的性能对齐来将受限模型的知识转移到完整模型的方法。我们将这种方法称为诱导模型匹配 (IMM)，并首先通过使用逻辑回归作为玩具示例来说明其普遍适用性。然后，我们探索 IMM 在语言建模中的使用、最初启发它的应用程序，以及与噪声等技术中限制模型的隐式使用相比，它提供了明确的基础。我们使用 $N$-grams 作为限制模型，在 LSTM 和 Transformer 完整模型上演示了该方法。为了进一步说明该原则的潜力，只要收集受限信息比收集完整信息便宜得多，我们以一个简单的 RL 示例作为结论，其中 POMDP 策略可以通过 IMM 改进学习的 MDP 策略。

Title: The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning

Authors: Anya Sims, Cong Lu, Yee Whye Teh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12527
Pdf URL: https://arxiv.org/pdf/2402.12527
Copy Paste: [[2402.12527]] The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning(https://arxiv.org/abs/2402.12527)
Keywords: llm, agent
Abstract: Offline reinforcement learning aims to enable agents to be trained from pre-collected datasets, however, this comes with the added challenge of estimating the value of behavior not covered in the dataset. Model-based methods offer a solution by allowing agents to collect additional synthetic data via rollouts in a learned dynamics model. The prevailing theoretical understanding is that this can then be viewed as online reinforcement learning in an approximate dynamics model, and any remaining gap is therefore assumed to be due to the imperfect dynamics model. Surprisingly, however, we find that if the learned dynamics model is replaced by the true error-free dynamics, existing model-based methods completely fail. This reveals a major misconception. Our subsequent investigation finds that the general procedure used in model-based algorithms results in the existence of a set of edge-of-reach states which trigger pathological value overestimation and collapse in Bellman-based algorithms. We term this the edge-of-reach problem. Based on this, we fill some gaps in existing theory and also explain how prior model-based methods are inadvertently addressing the true underlying edge-of-reach problem. Finally, we propose Reach-Aware Value Learning (RAVL), a simple and robust method that directly addresses the edge-of-reach problem and achieves strong performance across both proprioceptive and pixel-based benchmarks. Code open-sourced at: https://github.com/anyasims/edge-of-reach.
摘要：离线强化学习旨在使代理能够从预先收集的数据集中进行训练，然而，这带来了估计数据集中未涵盖的行为的价值的额外挑战。基于模型的方法提供了一种解决方案，允许代理通过学习动态模型中的展示来收集额外的合成数据。普遍的理论理解是，这可以被视为近似动力学模型中的在线强化学习，因此任何剩余的差距都被认为是由于不完美的动力学模型造成的。然而，令人惊讶的是，我们发现如果将学习到的动力学模型替换为真正的无误差动力学模型，现有的基于模型的方法将完全失败。这揭示了一个重大的误解。我们随后的调查发现，基于模型的算法中使用的一般程序会导致存在一组边缘状态，这些状态会触发基于贝尔曼的算法中的病态价值高估和崩溃。我们将此称为“到达边缘问题”。在此基础上，我们填补了现有理论中的一些空白，并解释了先前基于模型的方法如何无意中解决了真正的潜在边缘问题。最后，我们提出了到达感知价值学习（RAVL），这是一种简单而强大的方法，可以直接解决到达边缘问题，并在本体感受和基于像素的基准测试中实现强大的性能。代码开源于：https://github.com/anyasims/edge-of-reach。

Title: Parallel Structures in Pre-training Data Yield In-Context Learning

Authors: Yanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, He He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12530
Pdf URL: https://arxiv.org/pdf/2402.12530
Copy Paste: [[2402.12530]] Parallel Structures in Pre-training Data Yield In-Context Learning(https://arxiv.org/abs/2402.12530)
Keywords: language model, prompt
Abstract: Pre-trained language models (LMs) are capable of in-context learning (ICL): they can adapt to a task with only a few examples given in the prompt without any parameter update. However, it is unclear where this capability comes from as there is a stark distribution shift between pre-training text and ICL prompts. In this work, we study what patterns of the pre-training data contribute to ICL. We find that LMs' ICL ability depends on $\textit{parallel structures}$ in the pre-training data -- pairs of phrases following similar templates in the same context window. Specifically, we detect parallel structures by checking whether training on one phrase improves prediction of the other, and conduct ablation experiments to study their effect on ICL. We show that removing parallel structures in the pre-training data reduces LMs' ICL accuracy by 51% (vs 2% from random ablation). This drop persists even when excluding common patterns such as n-gram repetitions and long-range dependency, showing the diversity and generality of parallel structures. A closer look at the detected parallel structures indicates that they cover diverse linguistic tasks and span long distances in the data.
摘要：预训练的语言模型 (LM) 能够进行上下文学习 (ICL)：它们可以适应提示中仅给出几个示例的任务，而无需任何参数更新。然而，目前尚不清楚这种能力从何而来，因为预训练文本和 ICL 提示之间存在明显的分布变化。在这项工作中，我们研究了预训练数据的哪些模式对 ICL 有贡献。我们发现 LM 的 ICL 能力取决于预训练数据中的 $\textit{parallel Structure}$ ——同一上下文窗口中遵循相似模板的短语对。具体来说，我们通过检查一个短语的训练是否改善了另一个短语的预测来检测并行结构，并进行消融实验来研究它们对 ICL 的影响。我们表明，删除预训练数据中的并行结构会使 LM 的 ICL 准确度降低 51%（随机消融则降低 2%）。即使排除 n 元重复和远程依赖等常见模式，这种下降仍然存在，显示了并行结构的多样性和通用性。仔细观察检测到的并行结构表明它们涵盖了不同的语言任务并且跨越了数据的长距离。

Title: TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness

Authors: Danna Zheng, Danyang Liu, Mirella Lapata, Jeff Z. Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12545
Pdf URL: https://arxiv.org/pdf/2402.12545
Copy Paste: [[2402.12545]] TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness(https://arxiv.org/abs/2402.12545)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications. However, concerns have arisen regarding the trustworthiness of LLMs outputs, particularly in closed-book question-answering tasks, where non-experts may struggle to identify inaccuracies due to the absence of contextual or ground truth information. This paper introduces TrustScore, a framework based on the concept of Behavioral Consistency, which evaluates whether an LLMs response aligns with its intrinsic knowledge. Additionally, TrustScore can seamlessly integrate with fact-checking methods, which assesses alignment with external knowledge sources. The experimental results show that TrustScore achieves strong correlations with human judgments, surpassing existing reference-free metrics, and achieving results on par with reference-based metrics.
摘要：大型语言模型 (LLM) 在各个领域都展示了令人印象深刻的功能，促进了其实际应用的激增。然而，人们对法学硕士输出的可信度产生了担忧，特别是在闭卷问答任务中，由于缺乏上下文或真实信息，非专家可能很难识别不准确之处。本文介绍了 TrustScore，这是一个基于行为一致性概念的框架，用于评估法学硕士的回答是否与其内在知识相符。此外，TrustScore 可以与事实检查方法无缝集成，从而评估与外部知识源的一致性。实验结果表明，TrustScore 与人类判断实现了很强的相关性，超越了现有的无参考指标，并取得了与基于参考的指标相当的结果。

Title: Creating a Fine Grained Entity Type Taxonomy Using LLMs

Authors: Michael Gunn, Dohyun Park, Nidhish Kamath
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12557
Pdf URL: https://arxiv.org/pdf/2402.12557
Copy Paste: [[2402.12557]] Creating a Fine Grained Entity Type Taxonomy Using LLMs(https://arxiv.org/abs/2402.12557)
Keywords: gpt, llm, prompt
Abstract: In this study, we investigate the potential of GPT-4 and its advanced iteration, GPT-4 Turbo, in autonomously developing a detailed entity type taxonomy. Our objective is to construct a comprehensive taxonomy, starting from a broad classification of entity types - including objects, time, locations, organizations, events, actions, and subjects - similar to existing manually curated taxonomies. This classification is then progressively refined through iterative prompting techniques, leveraging GPT-4's internal knowledge base. The result is an extensive taxonomy comprising over 5000 nuanced entity types, which demonstrates remarkable quality upon subjective evaluation. We employed a straightforward yet effective prompting strategy, enabling the taxonomy to be dynamically expanded. The practical applications of this detailed taxonomy are diverse and significant. It facilitates the creation of new, more intricate branches through pattern-based combinations and notably enhances information extraction tasks, such as relation extraction and event argument extraction. Our methodology not only introduces an innovative approach to taxonomy creation but also opens new avenues for applying such taxonomies in various computational linguistics and AI-related fields.
摘要：在本研究中，我们研究了 GPT-4 及其高级迭代 GPT-4 Turbo 在自主开发详细实体类型分类法方面的潜力。我们的目标是构建一个全面的分类法，从实体类型的广泛分类开始——包括对象、时间、地点、组织、事件、行动和主题——类似于现有的手动分类法。然后，利用 GPT-4 的内部知识库，通过迭代提示技术逐步完善该分类。其结果是一个广泛的分类，包含 5000 多个细致入微的实体类型，在主观评估中表现出卓越的质量。我们采用了简单而有效的提示策略，使分类能够动态扩展。这种详细分类法的实际应用是多样且重要的。它有助于通过基于模式的组合创建新的、更复杂的分支，并显着增强信息提取任务，例如关系提取和事件参数提取。我们的方法不仅引入了分类法创建的创新方法，而且还为在各种计算语言学和人工智能相关领域应用此类分类法开辟了新途径。

Title: CausalGym: Benchmarking causal interpretability methods on linguistic tasks

Authors: Aryaman Arora, Dan Jurafsky, Christopher Potts
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12560
Pdf URL: https://arxiv.org/pdf/2402.12560
Copy Paste: [[2402.12560]] CausalGym: Benchmarking causal interpretability methods on linguistic tasks(https://arxiv.org/abs/2402.12560)
Keywords: language model
Abstract: Language models (LMs) have proven to be powerful tools for psycholinguistic research, but most prior work has focused on purely behavioural measures (e.g., surprisal comparisons). At the same time, research in model interpretability has begun to illuminate the abstract causal mechanisms shaping LM behavior. To help bring these strands of research closer together, we introduce CausalGym. We adapt and expand the SyntaxGym suite of tasks to benchmark the ability of interpretability methods to causally affect model behaviour. To illustrate how CausalGym can be used, we study the pythia models (14M--6.9B) and assess the causal efficacy of a wide range of interpretability methods, including linear probing and distributed alignment search (DAS). We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena in pythia-1b: negative polarity item licensing and filler--gap dependencies. Our analysis shows that the mechanism implementing both of these tasks is learned in discrete stages, not gradually.
摘要：语言模型（LM）已被证明是心理语言学研究的强大工具，但大多数先前的工作都集中在纯粹的行为测量（例如，意外比较）上。与此同时，模型可解释性的研究已经开始阐明塑造 LM 行为的抽象因果机制。为了帮助将这些研究领域更加紧密地结合在一起，我们推出了 CausalGym。我们调整并扩展了 SyntaxGym 任务套件，以对可解释性方法因果影响模型行为的能力进行基准测试。为了说明如何使用 CausalGym，我们研究了 pythia 模型 (14M--6.9B) 并评估了各种可解释性方法的因果功效，包括线性探测和分布式对齐搜索 (DAS)。我们发现 DAS 优于其他方法，因此我们用它来研究 pythia-1b 中两种困难语言现象的学习轨迹：负极性项目许可和填充间隙依赖性。我们的分析表明，实现这两项任务的机制是在离散阶段学习的，而不是逐渐学习的。

Title: Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models

Authors: Loka Li, Guangyi Chen, Yusheng Su, Zhenhao Chen, Yixuan Zhang, Eric Xing, Kun Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12563
Pdf URL: https://arxiv.org/pdf/2402.12563
Copy Paste: [[2402.12563]] Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models(https://arxiv.org/abs/2402.12563)
Keywords: language model, llm, prompt
Abstract: The recent success of Large Language Models (LLMs) has catalyzed an increasing interest in their self-correction capabilities. This paper presents a comprehensive investigation into the intrinsic self-correction of LLMs, attempting to address the ongoing debate about its feasibility. Our research has identified an important latent factor - the ``confidence'' of LLMs - during the self-correction process. Overlooking this factor may cause the models to over-criticize themselves, resulting in unreliable conclusions regarding the efficacy of self-correction. We have experimentally observed that LLMs possess the capability to understand the ``confidence'' in their own responses. It motivates us to develop an ``If-or-Else'' (IoE) prompting framework, designed to guide LLMs in assessing their own ``confidence'', facilitating intrinsic self-corrections. We conduct extensive experiments and demonstrate that our IoE-based Prompt can achieve a consistent improvement regarding the accuracy of self-corrected responses over the initial answers. Our study not only sheds light on the underlying factors affecting self-correction in LLMs, but also introduces a practical framework that utilizes the IoE prompting principle to efficiently improve self-correction capabilities with ``confidence''. The code is available at \url{https://github.com/MBZUAI-CLeaR/IoE-Prompting.git}.
摘要：大型语言模型 (LLM) 最近的成功引发了人们对其自我纠正能力的日益浓厚的兴趣。本文对法学硕士内在的自我修正进行了全面的调查，试图解决有关其可行性的持续争论。我们的研究发现了自我修正过程中一个重要的潜在因素——法学硕士的“信心”。忽视这个因素可能会导致模型过度批评自己，从而导致关于自我纠正功效的不可靠结论。我们通过实验观察到法学硕士有能力了解自己的回答的“信心”。它激励我们开发一个“如果-否则”（IoE）提示框架，旨在指导法学硕士评估自己的“信心”，促进内在的自我纠正。我们进行了广泛的实验，并证明我们基于 IoE 的提示可以在自我纠正答案的准确性方面比初始答案实现持续改进。我们的研究不仅揭示了影响法学硕士自我纠错的潜在因素，还引入了一个实用的框架，利用万物互联激励原理，以“信心”有效地提高自我纠错能力。该代码可在 \url{https://github.com/MBZUAI-CLeaR/IoE-Prompting.git} 获取。

Title: GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Authors: Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C. Wallace, Zachary C. Lipton, Jeffrey P. Bigham
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12566
Pdf URL: https://arxiv.org/pdf/2402.12566
Copy Paste: [[2402.12566]] GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence(https://arxiv.org/abs/2402.12566)
Keywords: language model, llm
Abstract: LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision. We will release our tool (GenAudit) and fact-checking model for public use.
摘要：即使法学硕士可以访问参考文档，也可能会生成事实上不正确的陈述。此类错误在高风险应用程序中可能会很危险（例如，医疗保健或金融领域基于文档的 QA）。我们推出 GenAudit——一种旨在协助对基于文档的任务的 LLM 响应进行事实检查的工具。 GenAudit 建议通过修改或删除参考文件不支持的主张来编辑 LLM 回复，并提供参考文件中似乎有支持的事实的证据。我们训练模型来执行这些任务，并设计一个交互式界面来向用户呈现建议的编辑和证据。人工评估员的综合评估表明，GenAudit 在汇总来自不同领域的文档时可以检测 8 种不同的 LLM 输出中的错误。为了确保系统标记大多数错误，我们提出了一种方法，可以提高错误召回率，同时最大限度地减少对精度的影响。我们将发布我们的工具（GenAudit）和事实检查模型供公众使用。

Title: Offline Multi-task Transfer RL with Representational Penalization

Authors: Avinandan Bose, Simon Shaolei Du, Maryam Fazel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.12570
Pdf URL: https://arxiv.org/pdf/2402.12570
Copy Paste: [[2402.12570]] Offline Multi-task Transfer RL with Representational Penalization(https://arxiv.org/abs/2402.12570)
Keywords: agent
Abstract: We study the problem of representation transfer in offline Reinforcement Learning (RL), where a learner has access to episodic data from a number of source tasks collected a priori, and aims to learn a shared representation to be used in finding a good policy for a target task. Unlike in online RL where the agent interacts with the environment while learning a policy, in the offline setting there cannot be such interactions in either the source tasks or the target task; thus multi-task offline RL can suffer from incomplete coverage. We propose an algorithm to compute pointwise uncertainty measures for the learnt representation, and establish a data-dependent upper bound for the suboptimality of the learnt policy for the target task. Our algorithm leverages the collective exploration done by source tasks to mitigate poor coverage at some points by a few tasks, thus overcoming the limitation of needing uniformly good coverage for a meaningful transfer by existing offline algorithms. We complement our theoretical results with empirical evaluation on a rich-observation MDP which requires many samples for complete coverage. Our findings illustrate the benefits of penalizing and quantifying the uncertainty in the learnt representation.
摘要：我们研究离线强化学习（RL）中的表示迁移问题，其中学习者可以访问先验收集的许多源任务中的情景数据，并旨在学习共享表示，用于为目标任务。与在线 RL 中代理在学习策略时与环境交互不同，在离线设置中，源任务或目标任务中都不能存在此类交互；因此，多任务离线强化学习可能会遇到不完全覆盖的问题。我们提出了一种算法来计算学习表示的逐点不确定性度量，并为目标任务的学习策略的次优性建立依赖于数据的上限。我们的算法利用源任务完成的集体探索来减轻少数任务在某些点上的不良覆盖率，从而克服了现有离线算法需要一致良好的覆盖率才能进行有意义的传输的限制。我们通过对丰富观测 MDP 的实证评估来补充我们的理论结果，这需要许多样本才能完全覆盖。我们的研究结果说明了惩罚和量化学习表示中的不确定性的好处。

Title: Evolving AI Collectives to Enhance Human Diversity and Enable Self-Regulation

Authors: Shiyang Lai, Yujin Potter, Junsol Kim, Richard Zhuang, Dawn Song, James Evans
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2402.12590
Pdf URL: https://arxiv.org/pdf/2402.12590
Copy Paste: [[2402.12590]] Evolving AI Collectives to Enhance Human Diversity and Enable Self-Regulation(https://arxiv.org/abs/2402.12590)
Keywords: language model
Abstract: Large language models steer their behaviors based on texts generated by others. This capacity and their increasing prevalence in online settings portend that they will intentionally or unintentionally "program" one another and form emergent AI subjectivities, relationships, and collectives. Here, we call upon the research community to investigate these "society-like" properties of interacting artificial intelligences to increase their rewards and reduce their risks for human society and the health of online environments. We use a simple model and its outputs to illustrate how such emergent, decentralized AI collectives can expand the bounds of human diversity and reduce the risk of toxic, anti-social behavior online. Finally, we discuss opportunities for AI self-moderation and address ethical issues and design challenges associated with creating and maintaining decentralized AI collectives.
摘要：大型语言模型根据其他语言生成的文本来引导它们的行为。这种能力及其在网络环境中的日益普及预示着它们将有意无意地相互“编程”并形成新兴的人工智能主体性、关系和集体。在这里，我们呼吁研究界研究交互人工智能的这些“类社会”特性，以增加其回报并降低其对人类社会和在线环境健康的风险。我们使用一个简单的模型及其输出来说明这种新兴的、去中心化的人工智能集体如何扩大人类多样性的界限并降低在线有毒、反社会行为的风险。最后，我们讨论人工智能自我调节的机会，并解决与创建和维护去中心化人工智能集体相关的道德问题和设计挑战。

Title: Standardize: Aligning Language Models with Expert-Defined Standards for Content Generation

Authors: Joseph Marvin Imperial, Gail Forey, Harish Tayyar Madabushi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12593
Pdf URL: https://arxiv.org/pdf/2402.12593
Copy Paste: [[2402.12593]] Standardize: Aligning Language Models with Expert-Defined Standards for Content Generation(https://arxiv.org/abs/2402.12593)
Keywords: language model, gpt
Abstract: Domain experts across engineering, healthcare, and education follow strict standards for producing quality content such as technical manuals, medication instructions, and children's reading materials. However, current works in controllable text generation have yet to explore using these standards as references for control. Towards this end, we introduce Standardize, a retrieval-style in-context learning-based framework to guide large language models to align with expert-defined standards. Focusing on English language standards in the education domain as a use case, we consider the Common European Framework of Reference for Languages (CEFR) and Common Core Standards (CCS) for the task of open-ended content generation. Our findings show that models can gain 40% to 100% increase in precise accuracy for Llama2 and GPT-4, respectively, demonstrating that the use of knowledge artifacts extracted from standards and integrating them in the generation process can effectively guide models to produce better standard-aligned content.
摘要：工程、医疗保健和教育领域的专家遵循严格的标准来制作技术手册、用药说明和儿童阅读材料等优质内容。然而，目前可控文本生成方面的工作尚未探索使用这些标准作为控制的参考。为此，我们引入了 Standardize，这是一种基于检索式上下文学习的框架，用于指导大型语言模型与专家定义的标准保持一致。以教育领域的英语语言标准为用例，我们考虑使用欧洲共同语言参考框架（CEFR）和共同核心标准（CCS）来完成开放式内容生成的任务。我们的研究结果表明，模型对于 Llama2 和 GPT-4 的精确度可以分别提高 40% 到 100%，这表明使用从标准中提取的知识工件并将其集成到生成过程中可以有效地指导模型生成更好的标准- 对齐的内容。

Title: Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Authors: Runlong Zhou, Simon S. Du, Beibin Li
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2402.12621
Pdf URL: https://arxiv.org/pdf/2402.12621
Copy Paste: [[2402.12621]] Reflect-RL: Two-Player Online RL Fine-Tuning for LMs(https://arxiv.org/abs/2402.12621)
Keywords: language model, gpt, prompt
Abstract: As language models (LMs) demonstrate their capabilities in various fields, their application to tasks requiring multi-round interactions has become increasingly popular. These tasks usually have complex dynamics, so supervised fine-tuning (SFT) on a limited offline dataset does not yield good performance. However, only a few works attempted to directly train the LMs within interactive decision-making environments. We aim to create an effective mechanism to fine-tune LMs with online reinforcement learning (RL) in these environments. We propose Reflect-RL, a two-player system to fine-tune an LM using online RL, where a frozen reflection model assists the policy model. To generate data for the warm-up SFT stage, we use negative example generation to enhance the error-correction ability of the reflection model. Furthermore, we designed single-prompt action enumeration and applied curriculum learning to allow the policy model to learn more efficiently. Empirically, we verify that Reflect-RL outperforms SFT and online RL without reflection. Testing results indicate GPT-2-xl after Reflect-RL also outperforms those of untuned pre-trained LMs, such as Mistral 7B.
摘要：随着语言模型（LM）在各个领域展示其能力，它们在需要多轮交互的任务中的应用变得越来越流行。这些任务通常具有复杂的动态，因此在有限的离线数据集上进行监督微调（SFT）不会产生良好的性能。然而，只有少数作品尝试在交互式决策环境中直接训练语言模型。我们的目标是创建一种有效的机制，在这些环境中通过在线强化学习（RL）来微调 LM。我们提出 Reflect-RL，这是一个两人系统，使用在线 RL 微调 LM，其中冻结的反射模型协助策略模型。为了生成预热 SFT 阶段的数据，我们使用负例生成来增强反射模型的纠错能力。此外，我们设计了单提示动作枚举和应用课程学习，使政策模型能够更有效地学习。根据经验，我们验证了 Reflect-RL 的性能优于 SFT 和无反射的在线 RL。测试结果表明，Reflect-RL 之后的 GPT-2-xl 的性能也优于未调整的预训练 LM，例如 Mistral 7B。

Title: Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation

Authors: Kristian Lum, Jacy Reese Anthis, Chirag Nagpal, Alexander D'Amour
Subjects: cs.CL, stat.AP
Abstract URL: https://arxiv.org/abs/2402.12649
Pdf URL: https://arxiv.org/pdf/2402.12649
Copy Paste: [[2402.12649]] Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation(https://arxiv.org/abs/2402.12649)
Keywords: language model, llm
Abstract: Bias benchmarks are a popular method for studying the negative impacts of bias in LLMs, yet there has been little empirical investigation of whether these benchmarks are actually indicative of how real world harm may manifest in the real world. In this work, we study the correspondence between such decontextualized "trick tests" and evaluations that are more grounded in Realistic Use and Tangible {Effects (i.e. RUTEd evaluations). We explore this correlation in the context of gender-occupation bias--a popular genre of bias evaluation. We compare three de-contextualized evaluations adapted from the current literature to three analogous RUTEd evaluations applied to long-form content generation. We conduct each evaluation for seven instruction-tuned LLMs. For the RUTEd evaluations, we conduct repeated trials of three text generation tasks: children's bedtime stories, user personas, and English language learning exercises. We found no correspondence between trick tests and RUTEd evaluations. Specifically, selecting the least biased model based on the de-contextualized results coincides with selecting the model with the best performance on RUTEd evaluations only as often as random chance. We conclude that evaluations that are not based in realistic use are likely insufficient to mitigate and assess bias and real-world harms.
摘要：偏见基准是研究法学硕士偏见负面影响的一种流行方法，但很少有实证研究来证明这些基准是否真正表明了现实世界中的伤害如何在现实世界中体现。在这项工作中，我们研究了这种脱离语境的“技巧测试”和更基于现实使用和有形效果的评估（即 RUTEd 评估）之间的对应关系。我们在性别职业偏见（一种流行的偏见评估类型）的背景下探索这种相关性。我们将根据当前文献改编的三种脱离情境的评估与应用于长篇内容生成的三种类似的 RUTEd 评估进行比较。我们对七名经过指令调整的法学硕士进行每次评估。对于 RUTEd 评估，我们对三个文本生成任务进行了重复试验：儿童睡前故事、用户角色和英语学习练习。我们发现技巧测试和 RUTEd 评估之间没有对应关系。具体来说，根据去情境化的结果选择偏差最小的模型与选择在 RUTE 评估中具有最佳性能的模型相一致，仅与随机机会一样。我们的结论是，不基于实际使用的评估可能不足以减轻和评估偏见和现实世界的危害。

Title: OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Authors: Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2402.12654
Pdf URL: https://arxiv.org/pdf/2402.12654
Copy Paste: [[2402.12654]] OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification(https://arxiv.org/abs/2402.12654)
Keywords: hallucination
Abstract: There has been an increasing interest in large speech models that can perform multiple speech processing tasks in a single model. Such models usually adopt the encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID). Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 25% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference. OWSM-CTC also improves the long-form ASR result with 20x speed-up. We will publicly release our codebase, pre-trained model, and training logs to promote open science in speech foundation models.
摘要：人们对可以在单个模型中执行多个语音处理任务的大型语音模型越来越感兴趣。由于此类模型在许多领域的流行和良好性能，通常采用编码器-解码器或仅解码器架构。然而，与非自回归模型相比，自回归模型在推理过程中可能会更慢，并且还存在潜在的幻觉风险。尽管之前的研究观察到非自回归模型在小规模的某些任务中取得了有希望的结果，但仍不清楚它们是否可以扩展到不同语言和任务中的语音到文本生成。受开放耳语式语音模型 (OWSM) 项目的启发，我们提出了 OWSM-CTC，这是一种基于连接主义时间分类 (CTC) 的新型仅编码器语音基础模型。它使用 18 万小时的公共音频数据进行训练，用于多语言自动语音识别 (ASR)、语音翻译 (ST) 和语言识别 (LID)。与编码器-解码器 OWSM 相比，我们的 OWSM-CTC 在 ASR 上取得了有竞争力的结果，在 ST 上相对提高了 25%，同时它更稳健，推理速度提高了 3 到 4 倍。 OWSM-CTC 还以 20 倍的加速改进了长格式 ASR 结果。我们将公开发布我们的代码库、预训练模型和训练日志，以促进语音基础模型的开放科学。

Title: HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts

Authors: Hao Zhao, Zihan Qiu, Huijia Wu, Zili Wang, Zhaofeng He, Jie Fu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12656
Pdf URL: https://arxiv.org/pdf/2402.12656
Copy Paste: [[2402.12656]] HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts(https://arxiv.org/abs/2402.12656)
Keywords: language model
Abstract: The Mixture of Experts (MoE) for language models has been proven effective in augmenting the capacity of models by dynamically routing each input token to a specific subset of experts for processing. Despite the success, most existing methods face a challenge for balance between sparsity and the availability of expert knowledge: enhancing performance through increased use of expert knowledge often results in diminishing sparsity during expert selection. To mitigate this contradiction, we propose HyperMoE, a novel MoE framework built upon Hypernetworks. This framework integrates the computational processes of MoE with the concept of knowledge transferring in multi-task learning. Specific modules generated based on the information of unselected experts serve as supplementary information, which allows the knowledge of experts not selected to be used while maintaining selection sparsity. Our comprehensive empirical evaluations across multiple datasets and backbones establish that HyperMoE significantly outperforms existing MoE methods under identical conditions concerning the number of experts.
摘要：事实证明，语言模型的专家混合 (MoE) 通过动态地将每个输入标记路由到特定的专家子集进行处理，可以有效地增强模型的容量。尽管取得了成功，大多数现有方法都面临着稀疏性和专家知识可用性之间平衡的挑战：通过增加专家知识的使用来提高性能通常会导致专家选择过程中稀疏性的减少。为了缓解这一矛盾，我们提出了 HyperMoE，这是一种基于超网络构建的新型 MoE 框架。该框架将 MoE 的计算过程与多任务学习中的知识转移概念相结合。基于未入选专家的信息生成的特定模块作为补充信息，允许在保持选择稀疏性的同时使用未入选专家的知识。我们对多个数据集和主干网进行的全面实证评估表明，在专家数量相同的条件下，HyperMoE 显着优于现有的 MoE 方法。

Title: The FinBen: An Holistic Financial Benchmark for Large Language Models

Authors: Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, Yijing Xu, Haoqiang Kang, Ziyan Kuang, Chenhan Yuan, Kailai Yang, Zheheng Luo, Tianlin Zhang, Zhiwei Liu, Guojun Xiong, Zhiyang Deng, Yuechen Jiang, Zhiyuan Yao, Haohang Li, Yangyang Yu, Gang Hu, Jiajia Huang, Xiao-Yang Liu, Alejandro Lopez-Lira, Benyou Wang, Yanzhao Lai, Hao Wang, Min Peng, Sophia Ananiadou, Jimin Huang
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2402.12659
Pdf URL: https://arxiv.org/pdf/2402.12659
Copy Paste: [[2402.12659]] The FinBen: An Holistic Financial Benchmark for Large Language Models(https://arxiv.org/abs/2402.12659)
Keywords: language model, gpt, llm, chat
Abstract: LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of thorough evaluations and the complexity of financial tasks. This along with the rapid development of LLMs, highlights the urgent need for a systematic financial evaluation benchmark for LLMs. In this paper, we introduce FinBen, the first comprehensive open-sourced evaluation benchmark, specifically designed to thoroughly assess the capabilities of LLMs in the financial domain. FinBen encompasses 35 datasets across 23 financial tasks, organized into three spectrums of difficulty inspired by the Cattell-Horn-Carroll theory, to evaluate LLMs' cognitive abilities in inductive reasoning, associative memory, quantitative reasoning, crystallized intelligence, and more. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals insights into their strengths and limitations within the financial domain. The findings indicate that GPT-4 leads in quantification, extraction, numerical reasoning, and stock trading, while Gemini shines in generation and forecasting; however, both struggle with complex extraction and forecasting, showing a clear need for targeted enhancements. Instruction tuning boosts simple task performance but falls short in improving complex reasoning and forecasting abilities. FinBen seeks to continuously evaluate LLMs in finance, fostering AI development with regular updates of tasks and models.
摘要：法学硕士已经改变了自然语言处理，并在各个领域显示出前景，但由于缺乏彻底的评估和金融任务的复杂性，其在金融领域的潜力尚未得到充分开发。随着法学硕士的快速发展，迫切需要一个系统的法学硕士财务评估基准。在本文中，我们介绍了FinBen，这是第一个全面的开源评估基准，专门用于全面评估法学硕士在金融领域的能力。 FinBen 包含涵盖 23 项金融任务的 35 个数据集，根据 Cattell-Horn-Carroll 理论分为三个难度范围，用于评估法学硕士在归纳推理、联想记忆、定量推理、结晶智力等方面的认知能力。我们对 15 个具有代表性的法学硕士（包括 GPT-4、ChatGPT 和最新的 Gemini）进行了评估，揭示了他们在金融领域的优势和局限性。研究结果表明，GPT-4 在量化、提取、数值推理和股票交易方面领先，而 Gemini 在生成和预测方面表现出色；然而，两者都在复杂的提取和预测方面遇到困难，这表明明显需要有针对性的增强。指令调优可以提高简单任务的性能，但无法提高复杂的推理和预测能力。 FinBen 致力于持续评估金融领域的法学硕士，通过定期更新任务和模型来促进人工智能的发展。

Title: SoftQE: Learned Representations of Queries Expanded by LLMs

Authors: Varad Pimpalkhute, John Heyer, Xusen Yin, Sameer Gupta
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12663
Pdf URL: https://arxiv.org/pdf/2402.12663
Copy Paste: [[2402.12663]] SoftQE: Learned Representations of Queries Expanded by LLMs(https://arxiv.org/abs/2402.12663)
Keywords: language model, llm
Abstract: We investigate the integration of Large Language Models (LLMs) into query encoders to improve dense retrieval without increasing latency and cost, by circumventing the dependency on LLMs at inference time. SoftQE incorporates knowledge from LLMs by mapping embeddings of input queries to those of the LLM-expanded queries. While improvements over various strong baselines on in-domain MS-MARCO metrics are marginal, SoftQE improves performance by 2.83 absolute percentage points on average on five out-of-domain BEIR tasks.
摘要：我们研究了将大型语言模型 (LLM) 集成到查询编码器中，通过避免推理时对 LLM 的依赖，在不增加延迟和成本的情况下改进密集检索。 SoftQE 通过将输入查询的嵌入映射到 LLM 扩展查询的嵌入来合并来自 LLM 的知识。虽然域内 MS-MARCO 指标相对于各种强基线的改进微乎其微，但 SoftQE 在 5 个域外 BEIR 任务上平均将性能提高了 2.83 个绝对百分点。

Title: Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies

Authors: Xiangyu Liu, Chenghao Deng, Yanchao Sun, Yongyuan Liang, Furong Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.12673
Pdf URL: https://arxiv.org/pdf/2402.12673
Copy Paste: [[2402.12673]] Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies(https://arxiv.org/abs/2402.12673)
Keywords: prompt
Abstract: In light of the burgeoning success of reinforcement learning (RL) in diverse real-world applications, considerable focus has been directed towards ensuring RL policies are robust to adversarial attacks during test time. Current approaches largely revolve around solving a minimax problem to prepare for potential worst-case scenarios. While effective against strong attacks, these methods often compromise performance in the absence of attacks or the presence of only weak attacks. To address this, we study policy robustness under the well-accepted state-adversarial attack model, extending our focus beyond only worst-case attacks. We first formalize this task at test time as a regret minimization problem and establish its intrinsic hardness in achieving sublinear regret when the baseline policy is from a general continuous policy class, $\Pi$. This finding prompts us to \textit{refine} the baseline policy class $\Pi$ prior to test time, aiming for efficient adaptation within a finite policy class $\Tilde{\Pi}$, which can resort to an adversarial bandit subroutine. In light of the importance of a small, finite $\Tilde{\Pi}$, we propose a novel training-time algorithm to iteratively discover \textit{non-dominated policies}, forming a near-optimal and minimal $\Tilde{\Pi}$, thereby ensuring both robustness and test-time efficiency. Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios.
摘要：鉴于强化学习 (RL) 在各种现实世界应用中的迅速成功，人们非常关注确保 RL 策略在测试期间能够抵御对抗性攻击。当前的方法主要围绕解决极小极大问题，为潜在的最坏情况做好准备。这些方法虽然可以有效抵御强攻击，但在没有攻击或仅存在弱攻击的情况下通常会损害性能。为了解决这个问题，我们在广为接受的国家对抗性攻击模型下研究政策的稳健性，将我们的注意力扩展到最坏情况的攻击之外。我们首先在测试时将这项任务形式化为遗憾最小化问题，并确定当基线策略来自一般连续策略类 $\Pi$ 时，其实现亚线性遗憾的内在难度。这一发现促使我们在测试时间之前\textit{refine}基线策略类$\Pi$，旨在在有限的策略类$\Tilde{\Pi}$内进行有效的适应，这可以诉诸对抗性强盗子程序。鉴于小而有限的 $\Tilde{\Pi}$ 的重要性，我们提出了一种新颖的训练时间算法来迭代发现 \textit{非支配策略}，形成接近最优且最小的 $\Tilde{ \Pi}$，从而确保鲁棒性和测试时间效率。对 Mujoco 的实证验证证实了我们的方法在自然和稳健的性能以及对各种攻击场景的适应性方面的优越性。

Title: XRL-Bench: A Benchmark for Evaluating and Comparing Explainable Reinforcement Learning Techniques

Authors: Yu Xiong, Zhipeng Hu, Ye Huang, Runze Wu, Kai Guan, Xingchen Fang, Ji Jiang, Tianze Zhou, Yujing Hu, Haoyu Liu, Tangjie Lyu, Changjie Fan
Subjects: cs.AI
Abstract URL: https://arxiv.org/abs/2402.12685
Pdf URL: https://arxiv.org/pdf/2402.12685
Copy Paste: [[2402.12685]] XRL-Bench: A Benchmark for Evaluating and Comparing Explainable Reinforcement Learning Techniques(https://arxiv.org/abs/2402.12685)
Keywords: agent
Abstract: Reinforcement Learning (RL) has demonstrated substantial potential across diverse fields, yet understanding its decision-making process, especially in real-world scenarios where rationality and safety are paramount, is an ongoing challenge. This paper delves in to Explainable RL (XRL), a subfield of Explainable AI (XAI) aimed at unravelling the complexities of RL models. Our focus rests on state-explaining techniques, a crucial subset within XRL methods, as they reveal the underlying factors influencing an agent's actions at any given time. Despite their significant role, the lack of a unified evaluation framework hinders assessment of their accuracy and effectiveness. To address this, we introduce XRL-Bench, a unified standardized benchmark tailored for the evaluation and comparison of XRL methods, encompassing three main modules: standard RL environments, explainers based on state importance, and standard evaluators. XRL-Bench supports both tabular and image data for state explanation. We also propose TabularSHAP, an innovative and competitive XRL method. We demonstrate the practical utility of TabularSHAP in real-world online gaming services and offer an open-source benchmark platform for the straightforward implementation and evaluation of XRL methods. Our contributions facilitate the continued progression of XRL technology.
摘要：强化学习 (RL) 在各个领域都展现出了巨大的潜力，但理解其决策过程，尤其是在理性和安全性至关重要的现实场景中，仍然是一个持续的挑战。本文深入研究了可解释强化学习 (XRL)，它是可解释人工智能 (XAI) 的一个子领域，旨在揭示强化学习模型的复杂性。我们的重点在于状态解释技术，这是 XRL 方法中的一个重要子集，因为它们揭示了在任何给定时间影响智能体行为的潜在因素。尽管它们发挥着重要作用，但缺乏统一的评估框架阻碍了对其准确性和有效性的评估。为了解决这个问题，我们引入了 XRL-Bench，这是一个专为评估和比较 XRL 方法而定制的统一标准化基准，包含三个主要模块：标准 RL 环境、基于状态重要性的解释器和标准评估器。 XRL-Bench 支持表格和图像数据进行状态解释。我们还提出了 TabularSHAP，一种创新且具有竞争力的 XRL 方法。我们展示了 TabularSHAP 在现实世界在线游戏服务中的实用性，并提供了一个开源基准平台，用于直接实施和评估 XRL 方法。我们的贡献促进了 XRL 技术的持续发展。

Title: Tree-Planted Transformers: Large Language Models with Implicit Syntactic Supervision

Authors: Ryo Yoshida, Taiga Someya, Yohei Oseki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12691
Pdf URL: https://arxiv.org/pdf/2402.12691
Copy Paste: [[2402.12691]] Tree-Planted Transformers: Large Language Models with Implicit Syntactic Supervision(https://arxiv.org/abs/2402.12691)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success thanks to scalability on large text corpora, but have some drawback in training efficiency. In contrast, Syntactic Language Models (SLMs) can be trained efficiently to reach relatively high performance thanks to syntactic supervision, but have trouble with scalability. Thus, given these complementary advantages of LLMs and SLMs, it is necessary to develop an architecture that integrates the scalability of LLMs with the training efficiency of SLMs, namely Syntactic Large Language Models (SLLM). In this paper, we propose a novel method dubbed tree-planting: implicitly "plant" trees into attention weights of Transformer LMs to reflect syntactic structures of natural language. Specifically, Transformer LMs trained with tree-planting will be called Tree-Planted Transformers (TPT), which learn syntax on small treebanks via tree-planting and then scale on large text corpora via continual learning with syntactic scaffolding. Targeted syntactic evaluations on the SyntaxGym benchmark demonstrated that TPTs, despite the lack of explicit syntactic supervision, significantly outperformed various SLMs with explicit syntactic supervision that generate hundreds of syntactic structures in parallel, suggesting that tree-planting and TPTs are the promising foundation for SLLMs.
摘要：由于大型文本语料库的可扩展性，大型语言模型（LLM）取得了显着的成功，但在训练效率方面存在一些缺陷。相比之下，句法语言模型（SLM）可以通过句法监督进行有效训练以达到相对较高的性能，但在可扩展性方面存在问题。因此，鉴于LLM和SLM的这些互补优势，有必要开发一种将LLM的可扩展性与SLM的训练效率相结合的架构，即句法大语言模型（SLLM）。在本文中，我们提出了一种称为植树的新方法：隐式地将树“种植”到 Transformer LM 的注意力权重中，以反映自然语言的句法结构。具体来说，通过植树训练的 Transformer LM 将被称为 Tree-Planted Transformers (TPT)，它通过植树在小型树库上学习语法，然后通过句法支架的持续学习在大型文本语料库上进行扩展。 SyntaxGym 基准上的有针对性的句法评估表明，尽管缺乏显式句法监督，TPT 仍显着优于具有显式句法监督的各种 SLM，这些 SLM 可以并行生成数百个句法结构，这表明植树和 TPT 是 SLLM 有前景的基础。

Title: FormulaQA: A Question Answering Dataset for Formula-Based Numerical Reasoning

Authors: Xiao Li, Sichen Liu, Bolin Zhu, Yin Zhu, Yiwei liu, Gong Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12692
Pdf URL: https://arxiv.org/pdf/2402.12692
Copy Paste: [[2402.12692]] FormulaQA: A Question Answering Dataset for Formula-Based Numerical Reasoning(https://arxiv.org/abs/2402.12692)
Keywords: llm, chain-of-thought
Abstract: The application of formulas is a fundamental ability of humans when addressing numerical reasoning problems. However, existing numerical reasoning datasets seldom explicitly indicate the formulas employed during the reasoning steps. To bridge this gap, we propose a question answering dataset for formula-based numerical reasoning called FormulaQA, from junior high school physics examinations. We further conduct evaluations on LLMs with size ranging from 7B to over 100B parameters utilizing zero-shot and few-shot chain-of-thoughts methods and we explored the approach of using retrieval-augmented LLMs when providing an external formula database. We also fine-tune on smaller models with size not exceeding 2B. Our empirical findings underscore the significant potential for improvement in existing models when applied to our complex, formula-driven FormulaQA.
摘要：公式的应用是人类解决数字推理问题的基本能力。然而，现有的数值推理数据集很少明确指出推理步骤中使用的公式。为了弥补这一差距，我们提出了一个用于基于公式的数值推理的问答数据集，称为 FormulaQA，来自初中物理考试。我们进一步利用零样本和少样本思维链方法对参数大小从 7B 到超过 100B 参数的 LLM 进行评估，并探索了在提供外部公式数据库时使用检索增强的 LLM 的方法。我们还对尺寸不超过2B的较小模型进行微调。我们的实证研究结果强调了当应用于我们复杂的、公式驱动的 FormulaQA 时，现有模型具有巨大的改进潜力。

Title: Are Large Language Models Rational Investors?

Authors: Yuhang Zhou, Yuchen Ni, Xiang Liu, Jian Zhang, Sen Liu, Guangnan Ye, Hongfeng Chai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12713
Pdf URL: https://arxiv.org/pdf/2402.12713
Copy Paste: [[2402.12713]] Are Large Language Models Rational Investors?(https://arxiv.org/abs/2402.12713)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are progressively being adopted in financial analysis to harness their extensive knowledge base for interpreting complex market data and trends. However, their application in the financial domain is challenged by intrinsic biases (i.e., risk-preference bias) and a superficial grasp of market intricacies, underscoring the need for a thorough assessment of their financial insight. This study introduces a novel framework, Financial Bias Indicators (FBI), to critically evaluate the financial rationality of LLMs, focusing on their ability to discern and navigate the subtleties of financial information and to identify any irrational biases that might skew market analysis. Our research adopts an innovative methodology to measure financial rationality, integrating principles of behavioral finance to scrutinize the biases and decision-making patterns of LLMs. We conduct a comprehensive evaluation of 19 leading LLMs, considering factors such as model scale, training datasets, input strategies, etc. The findings reveal varying degrees of financial irrationality among the models, influenced by their design and training. Models trained specifically on financial datasets might exhibit greater irrationality, and it's possible that even larger financial language models (FinLLMs) could display more biases than smaller, more generalized models. This outcomes provide profound insights into how these elements affect the financial rationality of LLMs, indicating that targeted training and structured input methods could improve model performance. This work enriches our understanding of LLMs' strengths and weaknesses in financial applications, laying the groundwork for the development of more dependable and rational financial analysis tools.
摘要：大型语言模型 (LLM) 逐渐被金融分析所采用，以利用其广泛的知识库来解释复杂的市场数据和趋势。然而，它们在金融领域的应用受到内在偏差（即风险偏好偏差）和对市场复杂性的肤浅把握的挑战，这凸显了对其金融洞察力进行彻底评估的必要性。本研究引入了一种新颖的框架，即财务偏差指标（FBI），以批判性地评估法学硕士的财务合理性，重点关注他们辨别和驾驭财务信息微妙之处的能力，以及识别可能扭曲市场分析的任何非理性偏差的能力。我们的研究采用创新的方法来衡量财务合理性，结合行为金融学原理来审查法学硕士的偏见和决策模式。我们对 19 所领先的法学硕士进行了综合评估，考虑了模型规模、训练数据集、输入策略等因素。研究结果表明，受模型设计和训练的影响，模型之间存在不同程度的财务不合理性。专门针对金融数据集训练的模型可能会表现出更大的不合理性，而且即使是更大的金融语言模型 (FinLLM) 也可能比更小、更通用的模型显示出更多的偏差。这一结果为这些因素如何影响法学硕士的财务合理性提供了深刻的见解，表明有针对性的培训和结构化输入方法可以提高模型性能。这项工作丰富了我们对法学硕士在金融应用方面的优势和劣势的理解，为开发更可靠、更合理的金融分析工具奠定了基础。

Title: UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation

Authors: Shubhashis Roy Dipta, Sai Vallurupalli
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12730
Pdf URL: https://arxiv.org/pdf/2402.12730
Copy Paste: [[2402.12730]] UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation(https://arxiv.org/abs/2402.12730)
Keywords: language model, llm
Abstract: This paper describes the system we developed for SemEval-2024 Task 1, "Semantic Textual Relatedness for African and Asian Languages." The aim of the task is to build a model that can identify semantic textual relatedness (STR) between two sentences of a target language belonging to a collection of African and Asian languages. We participated in Subtasks A and C and explored supervised and cross-lingual training leveraging large language models (LLMs). Pre-trained large language models have been extensively used for machine translation and semantic similarity. Using a combination of machine translation and sentence embedding LLMs, we developed a unified STR model, TranSem, for subtask A and fine-tuned the T5 family of models on the STR data, FineSem, for use in subtask C. Our model results for 7 languages in subtask A were better than the official baseline for 3 languages and on par with the baseline for the remaining 4 languages. Our model results for the 12 languages in subtask C resulted in 1st place for Africaans, 2nd place for Indonesian, and 3rd place for English with low performance for the remaining 9 languages.
摘要：本文介绍了我们为 SemEval-2024 任务 1“非洲和亚洲语言的语义文本相关性”开发的系统。该任务的目的是建立一个模型，可以识别属于非洲和亚洲语言集合的目标语言的两个句子之间的语义文本相关性（STR）。我们参与了子任务 A 和 C，并探索利用大型语言模型 (LLM) 的监督和跨语言训练。预训练的大型语言模型已广泛用于机器翻译和语义相似性。结合使用机器翻译和句子嵌入 LLM，我们为子任务 A 开发了统一的 STR 模型 TranSem，并在 STR 数据上微调了 T5 系列模型 FineSem，以用于子任务 C。我们的模型结果为 7子任务 A 中的语言优于 3 种语言的官方基线，并与其余 4 种语言的基线持平。我们在子任务 C 中针对 12 种语言的模型结果显示，非洲人排名第一，印度尼西亚人排名第二，英语排名第三，其余 9 种语言的表现较低。

Title: Can Large Language Models be Used to Provide Psychological Counselling? An Analysis of GPT-4-Generated Responses Using Role-play Dialogues

Authors: Michimasa Inaba, Mariko Ukiyo, Keiko Takamizo
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2402.12738
Pdf URL: https://arxiv.org/pdf/2402.12738
Copy Paste: [[2402.12738]] Can Large Language Models be Used to Provide Psychological Counselling? An Analysis of GPT-4-Generated Responses Using Role-play Dialogues(https://arxiv.org/abs/2402.12738)
Keywords: language model, gpt
Abstract: Mental health care poses an increasingly serious challenge to modern societies. In this context, there has been a surge in research that utilizes information technologies to address mental health problems, including those aiming to develop counseling dialogue systems. However, there is a need for more evaluations of the performance of counseling dialogue systems that use large language models. For this study, we collected counseling dialogue data via role-playing scenarios involving expert counselors, and the utterances were annotated with the intentions of the counselors. To determine the feasibility of a dialogue system in real-world counseling scenarios, third-party counselors evaluated the appropriateness of responses from human counselors and those generated by GPT-4 in identical contexts in role-play dialogue data. Analysis of the evaluation results showed that the responses generated by GPT-4 were competitive with those of human counselors.
摘要：精神卫生保健对现代社会提出了日益严峻的挑战。在此背景下，利用信息技术解决心理健康问题的研究激增，其中包括旨在开发咨询对话系统的研究。然而，需要对使用大型语言模型的咨询对话系统的性能进行更多评估。在这项研究中，我们通过专家咨询师参与的角色扮演场景收集了咨询对话数据，并且用咨询师的意图对话语进行了注释。为了确定对话系统在现实咨询场景中的可行性，第三方咨询师评估了人类咨询师的响应和 GPT-4 在角色扮演对话数据的相同上下文中生成的响应的适当性。评估结果分析表明，GPT-4 生成的响应与人类咨询师的响应具有竞争力。

Title: Me LLaMA: Foundation Large Language Models for Medical Applications

Authors: Qianqian Xie, Qingyu Chen, Aokun Chen, Cheng Peng, Yan Hu, Fongci Lin, Xueqing Peng, Jimin Huang, Jeffrey Zhang, Vipina Keloth, Huan He, Lucila Ohno-Machido, Yonghui Wu, Hua Xu, Jiang Bian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12749
Pdf URL: https://arxiv.org/pdf/2402.12749
Copy Paste: [[2402.12749]] Me LLaMA: Foundation Large Language Models for Medical Applications(https://arxiv.org/abs/2402.12749)
Keywords: language model, gpt, llm, chat
Abstract: Recent large language models (LLMs) like ChatGPT and LLaMA have shown great promise in many AI applications. However, their performance on medical tasks is suboptimal and can be further improved by training on large domain-specific datasets. This study introduces Me LLaMA, a medical LLM family including foundation models - Me LLaMA 13/70B and their chat-enhanced versions - Me LLaMA 13/70B-chat, developed through the continual pre-training and instruction tuning of LLaMA2 using large medical data. Our domain-specific data suite for training and evaluation, includes a large-scale continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a medical evaluation benchmark (MIBE) across six tasks with 14 datasets. Our extensive evaluation using MIBE shows that Me LLaMA models surpass existing open-source medical LLMs in zero-shot and few-shot learning and outperform commercial giants like ChatGPT on 6 out of 8 datasets and GPT-4 in 3 out of 8 datasets. In addition, we empirically investigated the catastrophic forgetting problem, and our results show that Me LLaMA models outperform other medical LLMs. Me LLaMA is one of the first and largest open-source foundational LLMs designed for the medical domain, using both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other medical LLMs, rendering it an attractive choice for medical AI applications. All resources are available at: https://github.com/BIDS-Xu-Lab/Me-LLaMA.
摘要：最近的大型语言模型（LLM），如 ChatGPT 和 LLaMA，在许多人工智能应用中显示出了巨大的前景。然而，它们在医疗任务上的表现并不理想，可以通过对大型特定领域数据集进行训练来进一步提高。本研究介绍了 Me LLaMA，一个医学 LLM 家族，包括基础模型 - Me LLaMA 13/70B 及其聊天增强版本 - Me LLaMA 13/70B-chat，通过使用大量医学数据对 LLaMA2 进行持续预训练和指令调整而开发出来。我们用于训练和评估的特定领域数据套件，包括具有 129B 个令牌的大规模连续预训练数据集、具有 214k 样本的指令调整数据集，以及跨 6 个任务和 14 个数据集的医学评估基准 (MIBE)。我们使用 MIBE 进行的广泛评估表明，Me LLaMA 模型在零样本和少样本学习方面超越了现有的开源医学 LLM，并在 8 个数据集中的 6 个上优于 ChatGPT 等商业巨头，在 8 个数据集中的 3 个上优于 GPT-4。此外，我们还实证研究了灾难性遗忘问题，结果表明 Me LLaMA 模型优于其他医学法学硕士。 Me LLaMA 是第一个也是最大的开源基础法学硕士之一，专为医学领域设计，使用生物医学和临床数据。与其他医学法学硕士相比，它在一般任务和医疗任务中表现出卓越的性能，使其成为医疗人工智能应用的有吸引力的选择。所有资源均可在以下网址获取：https://github.com/BIDS-Xu-Lab/Me-LLaMA。

Title: Acknowledgment of Emotional States: Generating Validating Responses for Empathetic Dialogue

Authors: Zi Haur Pang, Yahui Fu, Divesh Lala, Keiko Ochi, Koji Inoue, Tatsuya Kawahara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12770
Pdf URL: https://arxiv.org/pdf/2402.12770
Copy Paste: [[2402.12770]] Acknowledgment of Emotional States: Generating Validating Responses for Empathetic Dialogue(https://arxiv.org/abs/2402.12770)
Keywords: gpt, chat
Abstract: In the realm of human-AI dialogue, the facilitation of empathetic responses is important. Validation is one of the key communication techniques in psychology, which entails recognizing, understanding, and acknowledging others' emotional states, thoughts, and actions. This study introduces the first framework designed to engender empathetic dialogue with validating responses. Our approach incorporates a tripartite module system: 1) validation timing detection, 2) users' emotional state identification, and 3) validating response generation. Utilizing Japanese EmpatheticDialogues dataset - a textual-based dialogue dataset consisting of 8 emotional categories from Plutchik's wheel of emotions - the Task Adaptive Pre-Training (TAPT) BERT-based model outperforms both random baseline and the ChatGPT performance, in term of F1-score, in all modules. Further validation of our model's efficacy is confirmed in its application to the TUT Emotional Storytelling Corpus (TESC), a speech-based dialogue dataset, by surpassing both random baseline and the ChatGPT. This consistent performance across both textual and speech-based dialogues underscores the effectiveness of our framework in fostering empathetic human-AI communication.
摘要：在人机对话领域，促进同理心反应非常重要。验证是心理学中的关键沟通技巧之一，它需要识别、理解和承认他人的情绪状态、想法和行为。本研究介绍了第一个旨在引发同理心对话并验证响应的框架。我们的方法采用了三方模块系统：1）验证定时检测，2）用户情绪状态识别，以及3）验证响应生成。利用 Japanese EmpatheticDialogues 数据集（基于文本的对话数据集，由 Plutchik 情绪轮中的 8 个情绪类别组成），基于任务自适应预训练 (TAPT) BERT 的模型在 F1 分数方面优于随机基线和 ChatGPT 性能，在所有模块中。通过超越随机基线和 ChatGPT，我们模型的有效性在 TUT 情感讲故事语料库 (TESC)（基于语音的对话数据集）的应用中得到了进一步验证。这种基于文本和基于语音的对话的一致表现强调了我们的框架在促进具有同理心的人类与人工智能交流方面的有效性。

Title: Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations

Authors: Guan-Ting Lin, Cheng-Han Chiang, Hung-yi Lee
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2402.12786
Pdf URL: https://arxiv.org/pdf/2402.12786
Copy Paste: [[2402.12786]] Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations(https://arxiv.org/abs/2402.12786)
Keywords: language model, llm
Abstract: In spoken dialogue, even if two current turns are the same sentence, their responses might still differ when they are spoken in different styles. The spoken styles, containing paralinguistic and prosodic information, mark the most significant difference between text and speech modality. When using text-only LLMs to model spoken dialogue, text-only LLMs cannot give different responses based on the speaking style of the current turn. In this paper, we focus on enabling LLMs to listen to the speaking styles and respond properly. Our goal is to teach the LLM that "even if the sentences are identical if they are spoken in different styles, their corresponding responses might be different". Since there is no suitable dataset for achieving this goal, we collect a speech-to-speech dataset, StyleTalk, with the following desired characteristics: when two current speeches have the same content but are spoken in different styles, their responses will be different. To teach LLMs to understand and respond properly to the speaking styles, we propose the Spoken-LLM framework that can model the linguistic content and the speaking styles. We train Spoken-LLM using the StyleTalk dataset and devise a two-stage training pipeline to help the Spoken-LLM better learn the speaking styles. Based on extensive experiments, we show that Spoken-LLM outperforms text-only baselines and prior speech LLMs methods.
摘要：在口语对话中，即使当前的两个回合是同一个句子，当他们以不同的风格说话时，他们的反应仍然可能不同。包含副语言和韵律信息的口语风格标志着文本和语音模态之间最显着的差异。当使用纯文本 LLM 来模拟口语对话时，纯文本 LLM 无法根据当前回合的说话风格给出不同的响应。在本文中，我们的重点是让法学硕士能够倾听演讲风格并做出正确的反应。我们的目标是教导法学硕士“即使句子相同，如果以不同的风格说出来，它们相应的反应也可能不同”。由于没有合适的数据集来实现这一目标，我们收集了一个语音到语音数据集StyleTalk，它具有以下期望的特征：当两个当前语音具有相同的内容但以不同的风格说话时，它们的响应会不同。为了教导法学硕士理解并正确应对说话风格，我们提出了 Spoken-LLM 框架，可以对语言内容和说话风格进行建模。我们使用 StyleTalk 数据集训练 Spoken-LLM，并设计一个两阶段训练流程来帮助 Spoken-LLM 更好地学习说话风格。基于大量实验，我们表明 Spoken-LLM 优于纯文本基线和先前的语音 LLM 方法。

Title: Few shot clinical entity recognition in three languages: Masked language models outperform LLM prompting

Authors: Marco Naguib, Xavier Tannier, Aurélie Névéol
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12801
Pdf URL: https://arxiv.org/pdf/2402.12801
Copy Paste: [[2402.12801]] Few shot clinical entity recognition in three languages: Masked language models outperform LLM prompting(https://arxiv.org/abs/2402.12801)
Keywords: language model, llm, prompt
Abstract: Large Language Models are becoming the go-to solution for many natural language processing tasks, including in specialized domains where their few-shot capacities are expected to yield high performance in low-resource settings. Herein, we aim to assess the performance of Large Language Models for few shot clinical entity recognition in multiple languages. We evaluate named entity recognition in English, French and Spanish using 8 in-domain (clinical) and 6 out-domain gold standard corpora. We assess the performance of 10 auto-regressive language models using prompting and 16 masked language models used for text encoding in a biLSTM-CRF supervised tagger. We create a few-shot set-up by limiting the amount of annotated data available to 100 sentences. Our experiments show that although larger prompt-based models tend to achieve competitive F-measure for named entity recognition outside the clinical domain, this level of performance does not carry over to the clinical domain where lighter supervised taggers relying on masked language models perform better, even with the performance drop incurred from the few-shot set-up. In all experiments, the CO2 impact of masked language models is inferior to that of auto-regressive models. Results are consistent over the three languages and suggest that few-shot learning using Large language models is not production ready for named entity recognition in the clinical domain. Instead, models could be used for speeding-up the production of gold standard annotated data.
摘要：大型语言模型正在成为许多自然语言处理任务的首选解决方案，包括在特殊领域，在这些领域中，它们的少数镜头能力有望在资源匮乏的环境中产生高性能。在这里，我们的目标是评估大型语言模型在多种语言中少量临床实体识别的性能。我们使用 8 个域内（临床）和 6 个域外黄金标准语料库评估英语、法语和西班牙语的命名实体识别。我们评估了 10 个使用提示的自回归语言模型和 16 个用于 biLSTM-CRF 监督标注器中文本编码的屏蔽语言模型的性能。我们通过将可用的注释数据量限制为 100 个句子来创建少量镜头设置。我们的实验表明，虽然较大的基于提示的模型往往会在临床领域之外的命名实体识别方面实现有竞争力的 F 度量，但这种性能水平并不会延续到临床领域，在临床领域，依赖于掩码语言模型的轻型监督标注器表现更好，即使由于几次镜头设置而导致性能下降。在所有实验中，掩蔽语言模型的 CO2 影响均低于自回归模型。三种语言的结果是一致的，这表明使用大型语言模型的小样本学习尚未准备好用于临床领域的命名实体识别。相反，模型可以用于加速黄金标准注释数据的生成。

Title: SymBa: Symbolic Backward Chaining for Multi-step Natural Language Reasoning

Authors: Jinu Lee, Wonseok Hwang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12806
Pdf URL: https://arxiv.org/pdf/2402.12806
Copy Paste: [[2402.12806]] SymBa: Symbolic Backward Chaining for Multi-step Natural Language Reasoning(https://arxiv.org/abs/2402.12806)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) have recently demonstrated remarkable reasoning ability as in Chain-of-thought prompting, but faithful multi-step reasoning remains a challenge. We specifically focus on backward chaining, where the query is recursively decomposed using logical rules until proven. To address the limitations of current backward chaining implementations, we propose SymBa (Symbolic Backward Chaining). In SymBa, the symbolic top-down solver controls the entire proof process and the LLM is called to generate a single reasoning step only when the solver encounters a dead end. By this novel solver-LLM integration, while being able to produce an interpretable, structured proof, SymBa achieves significant improvement in performance, proof faithfulness, and efficiency in diverse multi-step reasoning benchmarks (ProofWriter, Birds-Electricity, GSM8k, CLUTRR-TF, ECtHR Article 6) compared to backward chaining baselines.
摘要：大型语言模型（LLM）最近在思想链提示中表现出了卓越的推理能力，但忠实的多步骤推理仍然是一个挑战。我们特别关注反向链接，其中使用逻辑规则递归地分解查询直到得到证明。为了解决当前反向链接实现的局限性，我们提出了 SymBa（符号反向链接）。在SymBa中，符号自上而下的求解器控制整个证明过程，并且仅当求解器遇到死胡同时才调用LLM来生成单个推理步骤。通过这种新颖的求解器-LLM 集成，在能够生成可解释的结构化证明的同时，SymBa 在各种多步骤推理基准（ProofWriter、Birds-Electricity、GSM8k、CLUTRR-TF）中实现了性能、证明可信度和效率的显着提高，ECtHR 第 6 条）与向后链接基线的比较。

Title: Scalable Decentralized Algorithms for Online Personalized Mean Estimation

Authors: Franco Galante, Giovanni Neglia, Emilio Leonardi
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2402.12812
Pdf URL: https://arxiv.org/pdf/2402.12812
Copy Paste: [[2402.12812]] Scalable Decentralized Algorithms for Online Personalized Mean Estimation(https://arxiv.org/abs/2402.12812)
Keywords: agent
Abstract: In numerous settings, agents lack sufficient data to directly learn a model. Collaborating with other agents may help, but it introduces a bias-variance trade-off, when local data distributions differ. A key challenge is for each agent to identify clients with similar distributions while learning the model, a problem that remains largely unresolved. This study focuses on a simplified version of the overarching problem, where each agent collects samples from a real-valued distribution over time to estimate its mean. Existing algorithms face impractical space and time complexities (quadratic in the number of agents A). To address scalability challenges, we propose a framework where agents self-organize into a graph, allowing each agent to communicate with only a selected number of peers r. We introduce two collaborative mean estimation algorithms: one draws inspiration from belief propagation, while the other employs a consensus-based approach, with complexity of O( r |A| log |A|) and O(r |A|), respectively. We establish conditions under which both algorithms yield asymptotically optimal estimates and offer a theoretical characterization of their performance.
摘要：在许多情况下，智能体缺乏足够的数据来直接学习模型。与其他代理合作可能会有所帮助，但当本地数据分布不同时，它会引入偏差-方差权衡。对于每个代理来说，一个关键的挑战是在学习模型的同时识别具有相似分布的客户，这个问题在很大程度上仍未解决。本研究重点关注总体问题的简化版本，其中每个代理随时间从实值分布中收集样本以估计其平均值。现有算法面临着不切实际的空间和时间复杂性（代理 A 数量的二次方）。为了解决可扩展性挑战，我们提出了一个框架，其中代理自组织成一个图，允许每个代理仅与选定数量的对等点 r 进行通信。我们引入了两种协作均值估计算法：一种从置信传播中汲取灵感，另一种采用基于共识的方法，复杂度分别为 O( r |A| log |A|) 和 O(r |A|)。我们建立了两种算法产生渐近最优估计的条件，并提供了它们性能的理论表征。

Title: On Sensitivity of Learning with Limited Labelled Data to the Effects of Randomness: Impact of Interactions and Systematic Choices

Authors: Branislav Pecher, Ivan Srba, Maria Bielikova
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12817
Pdf URL: https://arxiv.org/pdf/2402.12817
Copy Paste: [[2402.12817]] On Sensitivity of Learning with Limited Labelled Data to the Effects of Randomness: Impact of Interactions and Systematic Choices(https://arxiv.org/abs/2402.12817)
Keywords: prompt
Abstract: While learning with limited labelled data can improve performance when the labels are lacking, it is also sensitive to the effects of uncontrolled randomness introduced by so-called randomness factors (e.g., varying order of data). We propose a method to systematically investigate the effects of randomness factors while taking the interactions between them into consideration. To measure the true effects of an individual randomness factor, our method mitigates the effects of other factors and observes how the performance varies across multiple runs. Applying our method to multiple randomness factors across in-context learning and fine-tuning approaches on 7 representative text classification tasks and meta-learning on 3 tasks, we show that: 1) disregarding interactions between randomness factors in existing works caused inconsistent findings due to incorrect attribution of the effects of randomness factors, such as disproving the consistent sensitivity of in-context learning to sample order even with random sample selection; and 2) besides mutual interactions, the effects of randomness factors, especially sample order, are also dependent on more systematic choices unexplored in existing works, such as number of classes, samples per class or choice of prompt format.
摘要：虽然在缺乏标签时使用有限的标记数据进行学习可以提高性能，但它对所谓的随机因素（例如，数据顺序变化）引入的不受控制的随机性的影响也很敏感。我们提出了一种系统研究随机因素影响的方法，同时考虑它们之间的相互作用。为了测量单个随机因素的真实影响，我们的方法减轻了其他因素的影响，并观察多次运行中性能如何变化。将我们的方法应用于 7 个代表性文本分类任务的上下文学习和微调方法以及 3 个任务的元学习中的多个随机因素，我们表明：1）忽略现有作品中随机因素之间的相互作用，导致结果不一致，因为对随机因素影响的错误归因，例如反驳情境学习对样本顺序的一致敏感性，即使是随机样本选择； 2）除了相互作用之外，随机因素的影响，特别是样本顺序，还取决于现有作品中未探索的更系统的选择，例如类的数量、每类的样本或提示格式的选择。

Title: Fine-Tuning, Prompting, In-Context Learning and Instruction-Tuning: How Many Labelled Samples Do We Need?

Authors: Branislav Pecher, Ivan Srba, Maria Bielikova
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12819
Pdf URL: https://arxiv.org/pdf/2402.12819
Copy Paste: [[2402.12819]] Fine-Tuning, Prompting, In-Context Learning and Instruction-Tuning: How Many Labelled Samples Do We Need?(https://arxiv.org/abs/2402.12819)
Keywords: language model, prompt
Abstract: When solving a task with limited labelled data, researchers can either use a general large language model without further update, or use the few examples to tune a specialised smaller model. When enough labels are available, the specialised models outperform the general ones on many NLP tasks. In this work, we aim to investigate how many labelled samples are required for the specialised models to achieve this superior performance, while taking the results variance into consideration. Observing the behaviour of prompting, in-context learning, fine-tuning and instruction-tuning, identifying their break-even points when increasing number of labelled training samples across three tasks of varying complexity, we find that the specialised models often need only few samples ($100-1000$) to be on par or better than the general ones. At the same time, the amount of required labelled data strongly depends on the task complexity and results variance.
摘要：当解决带有有限标记数据的任务时，研究人员可以使用通用的大型语言模型而无需进一步更新，或者使用少数示例来调整专门的较小模型。当有足够的标签可用时，专用模型在许多 NLP 任务上的表现优于通用模型。在这项工作中，我们的目标是研究专用模型需要多少标记样本才能实现这种优越的性能，同时考虑结果方差。观察提示、上下文学习、微调和指令调整的行为，确定在三个不同复杂性的任务中增加标记训练样本数量时的盈亏平衡点，我们发现专门的模型通常只需要很少的样本（100-1000 美元）与一般产品相当或更好。同时，所需的标记数据量很大程度上取决于任务复杂性和结果方差。

Title: Identifying Factual Inconsistency in Summaries: Towards Effective Utilization of Large Language Model

Authors: Liyan Xu, Zhenlin Su, Mo Yu, Jin Xu, Jinho D. Choi, Jie Zhou, Fei Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12821
Pdf URL: https://arxiv.org/pdf/2402.12821
Copy Paste: [[2402.12821]] Identifying Factual Inconsistency in Summaries: Towards Effective Utilization of Large Language Model(https://arxiv.org/abs/2402.12821)
Keywords: language model, llm
Abstract: Factual inconsistency poses a significant hurdle for the commercial deployment of abstractive summarizers. Under this Large Language Model (LLM) era, this work focuses around two important questions: what is the best way to leverage LLM for factual inconsistency detection, and how could we distill a smaller LLM with both high efficiency and efficacy? Three zero-shot paradigms are firstly proposed and evaluated across five diverse datasets: direct inference on the entire summary or each summary window; entity verification through question generation and answering. Experiments suggest that LLM itself is capable to resolve this task train-free under the proper paradigm design, surpassing strong trained baselines by 2.8% on average. To further promote practical utility, we then propose training strategies aimed at distilling smaller open-source LLM that learns to score the entire summary at once with high accuracy, which outperforms the zero-shot approaches by much larger LLM, serving as an effective and efficient ready-to-use scorer.
摘要：事实不一致给抽象摘要器的商业部署带来了重大障碍。在这个大语言模型（LLM）时代，这项工作主要围绕两个重要问题：利用LLM进行事实不一致检测的最佳方式是什么，以及我们如何才能高效、高效地提炼出更小的LLM？首先提出了三种零样本范式，并在五个不同的数据集上进行了评估：对整个摘要或每个摘要窗口进行直接推理；通过问题生成和回答进行实体验证。实验表明，LLM 本身能够在适当的范式设计下免训练地解决此任务，平均超过经过严格训练的基线 2.8%。为了进一步提高实用性，我们提出了旨在提炼较小的开源 LLM 的训练策略，该策略学会一次以高精度对整个摘要进行评分，这比零样本方法要大得多，成为一种有效且高效的 LLM 方法。即用型记分器。

Title: PANDA: Preference Adaptation for Enhancing Domain-Specific Abilities of LLMs

Authors: An Liu, Zonghan Yang, Zhenhe Zhang, Qingyuan Hu, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12835
Pdf URL: https://arxiv.org/pdf/2402.12835
Copy Paste: [[2402.12835]] PANDA: Preference Adaptation for Enhancing Domain-Specific Abilities of LLMs(https://arxiv.org/abs/2402.12835)
Keywords: language model, llm
Abstract: While Large language models (LLMs) have demonstrated considerable capabilities across various natural language tasks, they often fall short of the performance achieved by domain-specific state-of-the-art models. One potential approach to enhance domain-specific capabilities of LLMs involves fine-tuning them using corresponding datasets. However, this method can be both resource and time-intensive, and not applicable to closed-source commercial LLMs. In this paper, we propose Preference Adaptation for Enhancing Domain-specific Abilities of LLMs (PANDA), a method designed to augment the domain-specific capabilities of LLMs by leveraging insights from the response preference of expert models without requiring fine-tuning. Our experimental results reveal that PANDA significantly enhances the domain-specific ability of LLMs on text classification and interactive decision tasks. Moreover, LLM with PANDA even outperforms the expert model that being learned on 4 tasks of ScienceWorld. This finding highlights the potential of exploring tuning-free approaches to achieve weak-to-strong generalization.
摘要：虽然大型语言模型 (LLM) 在各种自然语言任务中表现出了相当大的能力，但它们通常达不到特定领域的最先进模型所达到的性能。增强法学硕士特定领域能力的一种潜在方法是使用相应的数据集对其进行微调。然而，这种方法可能既耗费资源又耗费时间，并且不适用于闭源商业法学硕士。在本文中，我们提出了增强法学硕士领域特定能力的偏好适应（PANDA），这是一种旨在通过利用专家模型响应偏好的见解来增强法学硕士领域特定能力的方法，而无需进行微调。我们的实验结果表明，PANDA 显着增强了法学硕士在文本分类和交互式决策任务上的特定领域能力。此外，使用 PANDA 的法学硕士甚至优于在 ScienceWorld 的 4 项任务上学习的专家模型。这一发现凸显了探索免调优方法来实现弱到强泛化的潜力。

Title: ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

Authors: Fajri Koto, Haonan Li, Sara Shatnawi, Jad Doughman, Abdelrahman Boda Sadallah, Aisha Alraeesi, Khalid Almubarak, Zaid Alyafeai, Neha Sengupta, Shady Shehata, Nizar Habash, Preslav Nakov, Timothy Baldwin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12840
Pdf URL: https://arxiv.org/pdf/2402.12840
Copy Paste: [[2402.12840]] ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic(https://arxiv.org/abs/2402.12840)
Keywords: language model
Abstract: The focus of language model evaluation has transitioned towards reasoning and knowledge-intensive tasks, driven by advancements in pretraining large models. While state-of-the-art models are partially trained on large Arabic texts, evaluating their performance in Arabic remains challenging due to the limited availability of relevant datasets. To bridge this gap, we present ArabicMMLU, the first multi-task language understanding benchmark for Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA), and is carefully constructed by collaborating with native speakers in the region. Our comprehensive evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models. Notably, BLOOMZ, mT0, LLama2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%.
摘要：在预训练大型模型的进步的推动下，语言模型评估的重点已转向推理和知识密集型任务。虽然最先进的模型部分是在大型阿拉伯文本上进行训练的，但由于相关数据集的可用性有限，评估其在阿拉伯语中的性能仍然具有挑战性。为了弥补这一差距，我们推出了ArabicMMLU，这是第一个阿拉伯语多任务语言理解基准，来源于北非、黎凡特和海湾地区不同国家不同教育水平的学校考试。我们的数据包含现代标准阿拉伯语 (MSA) 的 40 个任务和 14,575 个多项选择题，是通过与该地区的母语人士合作精心构建的。我们对 35 个模型的综合评估揭示了巨大的改进空间，特别是在最好的开源模型中。值得注意的是，BLOOMZ、mT0、LLama2 和 Falcon 很难达到 50% 的分数，而即使是表现最好的以阿拉伯语为中心的模型也只能达到 62.3% 的分数。

Title: PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning

Authors: Gyeongman Kim, Doohyuk Jang, Eunho Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12842
Pdf URL: https://arxiv.org/pdf/2402.12842
Copy Paste: [[2402.12842]] PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning(https://arxiv.org/abs/2402.12842)
Keywords: language model, gpt, llm, prompt
Abstract: Recent advancements in large language models (LLMs) have raised concerns about inference costs, increasing the need for research into model compression. While knowledge distillation (KD) is a prominent method for this, research on KD for generative language models like LLMs is relatively sparse, and the approach of distilling student-friendly knowledge, which has shown promising performance in KD for classification models, remains unexplored in generative language models. To explore this approach, we propose PromptKD, a simple yet effective method that utilizes prompt tuning - for the first time in KD - to enable generative language models to transfer student-friendly knowledge. Unlike previous works in classification that require fine-tuning the entire teacher model for extracting student-friendly knowledge, PromptKD achieves similar effects by adding a small number of prompt tokens and tuning only the prompt with student guidance. Extensive experiments on instruction-following datasets using the GPT-2 model family show that PromptKD achieves state-of-the-art performance while adding only 0.0007% of the teacher's parameters as prompts. Further analysis suggests that distilling student-friendly knowledge alleviates exposure bias effectively throughout the entire training process, leading to performance enhancements.
摘要：大型语言模型 (LLM) 的最新进展引起了人们对推理成本的担忧，增加了对模型压缩研究的需求。虽然知识蒸馏 (KD) 是一种重要的方法，但针对法学硕士等生成语言模型的 KD 研究相对较少，并且在分类模型的 KD 中显示出有希望的性能的蒸馏学生友好型知识的方法仍未得到探索。生成语言模型。为了探索这种方法，我们提出了 PromptKD，这是一种简单而有效的方法，它首次在 KD 中利用提示调整来使生成语言模型能够传输学生友好的知识。与之前的分类工作需要微调整个教师模型以提取学生友好的知识不同，PromptKD 通过添加少量提示标记并在学生指导下仅调整提示来实现类似的效果。使用 GPT-2 模型系列对指令跟踪数据集进行的大量实验表明，PromptKD 实现了最先进的性能，同时仅添加了 0.0007% 的教师参数作为提示。进一步的分析表明，提炼对学生友好的知识可以有效地减轻整个培训过程中的暴露偏差，从而提高绩效。

Title: MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces

Authors: Tianyu Zheng, Ge Zhang, Xingwei Qu, Ming Kuang, Stephen W. Huang, Zhaofeng He
Subjects: cs.AI, cs.GT
Abstract URL: https://arxiv.org/abs/2402.12845
Pdf URL: https://arxiv.org/pdf/2402.12845
Copy Paste: [[2402.12845]] MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces(https://arxiv.org/abs/2402.12845)
Keywords: language model
Abstract: Drawing upon the intuition that aligning different modalities to the same semantic embedding space would allow models to understand states and actions more easily, we propose a new perspective to the offline reinforcement learning (RL) challenge. More concretely, we transform it into a supervised learning task by integrating multimodal and pre-trained language models. Our approach incorporates state information derived from images and action-related data obtained from text, thereby bolstering RL training performance and promoting long-term strategic thinking. We emphasize the contextual understanding of language and demonstrate how decision-making in RL can benefit from aligning states' and actions' representation with languages' representation. Our method significantly outperforms current baselines as evidenced by evaluations conducted on Atari and OpenAI Gym environments. This contributes to advancing offline RL performance and efficiency while providing a novel perspective on offline RL.Our code and data are available at https://github.com/Zheng0428/MORE_.
摘要：凭借将不同模态对齐到相同语义嵌入空间将使模型更容易理解状态和动作的直觉，我们提出了应对离线强化学习（RL）挑战的新视角。更具体地说，我们通过集成多模态和预训练的语言模型将其转化为监督学习任务。我们的方法结合了从图像中获取的状态信息和从文本中获取的与动作相关的数据，从而提高了 RL 训练性能并促进了长期战略思维。我们强调对语言的上下文理解，并展示强化学习中的决策如何通过将状态和动作的表示与语言的表示保持一致而受益。在 Atari 和 OpenAI Gym 环境中进行的评估证明，我们的方法明显优于当前的基线。这有助于提高离线强化学习的性能和效率，同时提供离线强化学习的新颖视角。我们的代码和数据可在 https://github.com/Zheng0428/MORE_ 获取。

Title: Instruction-tuned Language Models are Better Knowledge Learners

Authors: Zhengbao Jiang, Zhiqing Sun, Weijia Shi, Pedro Rodriguez, Chunting Zhou, Graham Neubig, Xi Victoria Lin, Wen-tau Yih, Srinivasan Iyer
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12847
Pdf URL: https://arxiv.org/pdf/2402.12847
Copy Paste: [[2402.12847]] Instruction-tuned Language Models are Better Knowledge Learners(https://arxiv.org/abs/2402.12847)
Keywords: language model, llm
Abstract: In order for large language model (LLM)-based assistants to effectively adapt to evolving information needs, it must be possible to update their factual knowledge through continued training on new data. The standard recipe for doing so involves continued pre-training on new documents followed by instruction-tuning on question-answer (QA) pairs. However, we find that LLMs trained with this recipe struggle to answer questions, even though the perplexity of documents is minimized. We found that QA pairs are generally straightforward, while documents are more complex, weaving many factual statements together in an intricate manner. Therefore, we hypothesize that it is beneficial to expose LLMs to QA pairs before continued pre-training on documents so that the process of encoding knowledge from complex documents takes into account how this knowledge is accessed through questions. Based on this, we propose pre-instruction-tuning (PIT), a method that instruction-tunes on questions prior to training on documents. This contrasts with standard instruction-tuning, which learns how to extract knowledge after training on documents. Extensive experiments and ablation studies demonstrate that PIT significantly enhances the ability of LLMs to absorb knowledge from new documents, outperforming standard instruction-tuning by 17.8%.
摘要：为了让基于大语言模型（LLM）的助理能够有效地适应不断变化的信息需求，必须能够通过对新数据的持续培训来更新他们的事实知识。这样做的标准方法包括对新文档进行持续的预训练，然后对问答 (QA) 对进行指令调整。然而，我们发现，尽管文档的复杂性被最小化，但接受此方法培训的法学硕士很难回答问题。我们发现 QA 对通常很简单，而文档则更为复杂，以复杂的方式将许多事实陈述编织在一起。因此，我们假设在继续对文档进行预训练之前让法学硕士接触 QA 对是有益的，这样从复杂文档中编码知识的过程就会考虑到如何通过问题访问这些知识。基于此，我们提出了指令前调优（PIT），这是一种在文档训练之前对问题进行指令调优的方法。这与标准指令调整形成鲜明对比，标准指令调整在文档训练后学习如何提取知识。大量实验和消融研究表明，PIT 显着增强了法学硕士从新文档中吸收知识的能力，比标准指令调整高出 17.8%。

Title: MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

Authors: Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, Kang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12851
Pdf URL: https://arxiv.org/pdf/2402.12851
Copy Paste: [[2402.12851]] MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models(https://arxiv.org/abs/2402.12851)
Keywords: language model, gpt, llm
Abstract: Fine-tuning is often necessary to enhance the adaptability of Large Language Models (LLM) to downstream tasks. Nonetheless, the process of updating billions of parameters demands significant computational resources and training time, which poses a substantial obstacle to the widespread application of large-scale models in various scenarios. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) has emerged as a prominent paradigm in recent research. However, current PEFT approaches that employ a limited set of global parameters (such as LoRA, which adds low-rank approximation matrices to all weights) face challenges in flexibly combining different computational modules in downstream tasks. In this work, we introduce a novel PEFT method: MoELoRA. We consider LoRA as Mixture of Experts (MoE), and to mitigate the random routing phenomenon observed in MoE, we propose the utilization of contrastive learning to encourage experts to learn distinct features. We conducted experiments on 11 tasks in math reasoning and common-sense reasoning benchmarks. With the same number of parameters, our approach outperforms LoRA significantly. In math reasoning, MoELoRA achieved an average performance that was 4.2% higher than LoRA, and demonstrated competitive performance compared to the 175B GPT-3.5 on several benchmarks.
摘要：为了增强大型语言模型（LLM）对下游任务的适应性，通常需要进行微调。然而，更新数十亿个参数的过程需要大量的计算资源和训练时间，这对大规模模型在各种场景中的广泛应用构成了重大障碍。为了解决这个问题，参数高效微调（PEFT）已成为最近研究中的一个突出范例。然而，当前采用有限全局参数集的 PEFT 方法（例如 LoRA，它将低秩近似矩阵添加到所有权重）面临着在下游任务中灵活组合不同计算模块的挑战。在这项工作中，我们介绍了一种新颖的 PEFT 方法：MoELoRA。我们将 LoRA 视为专家混合 (MoE)，为了减轻 MoE 中观察到的随机路由现象，我们建议利用对比学习来鼓励专家学习不同的特征。我们对数学推理和常识推理基准的 11 项任务进行了实验。在参数数量相同的情况下，我们的方法显着优于 LoRA。在数学推理方面，MoELoRA 的平均性能比 LoRA 高出 4.2%，并且在多个基准测试中与 175B GPT-3.5 相比表现出具有竞争力的性能。

Title: Backward Lens: Projecting Language Model Gradients into the Vocabulary Space

Authors: Shahar Katz, Yonatan Belinkov, Mor Geva, Lior Wolf
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12865
Pdf URL: https://arxiv.org/pdf/2402.12865
Copy Paste: [[2402.12865]] Backward Lens: Projecting Language Model Gradients into the Vocabulary Space(https://arxiv.org/abs/2402.12865)
Keywords: language model
Abstract: Understanding how Transformer-based Language Models (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the models' vocabularies, helping to uncover how information flows within LMs. In this work, we extend this methodology to LMs' backward pass and gradients. We first prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes' inputs. We then develop methods to project these gradients into vocabulary items and explore the mechanics of how new information is stored in the LMs' neurons.
摘要：了解基于 Transformer 的语言模型 (LM) 如何学习和回忆信息是深度学习社区的一个关键目标。最近的可解释性方法将从前向传递中获得的权重和隐藏状态投影到模型的词汇表中，有助于揭示信息在语言模型中的流动方式。在这项工作中，我们将此方法扩展到 LM 的后向传播和梯度。我们首先证明梯度矩阵可以被转换为其前向和后向传递输入的低秩线性组合。然后，我们开发方法将这些梯度投影到词汇项中，并探索新信息如何存储在 LM 神经元中的机制。

Title: Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data

Authors: Dehai Min, Nan Hu, Rihui Jin, Nuo Lin, Jiaoyan Chen, Yongrui Chen, Yu Li, Guilin Qi, Yun Li, Nijun Li, Qianren Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12869
Pdf URL: https://arxiv.org/pdf/2402.12869
Copy Paste: [[2402.12869]] Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data(https://arxiv.org/abs/2402.12869)
Keywords: language model, llm
Abstract: Augmenting Large Language Models (LLMs) for Question Answering (QA) with domain specific data has attracted wide attention. However, domain data often exists in a hybrid format, including text and semi-structured tables, posing challenges for the seamless integration of information. Table-to-Text Generation is a promising solution by facilitating the transformation of hybrid data into a uniformly text-formatted corpus. Although this technique has been widely studied by the NLP community, there is currently no comparative analysis on how corpora generated by different table-to-text methods affect the performance of QA systems. In this paper, we address this research gap in two steps. First, we innovatively integrate table-to-text generation into the framework of enhancing LLM-based QA systems with domain hybrid data. Then, we utilize this framework in real-world industrial data to conduct extensive experiments on two types of QA systems (DSFT and RAG frameworks) with four representative methods: Markdown format, Template serialization, TPLM-based method, and LLM-based method. Based on the experimental results, we draw some empirical findings and explore the underlying reasons behind the success of some methods. We hope the findings of this work will provide a valuable reference for the academic and industrial communities in developing robust QA systems.
摘要：使用特定领域数据增强用于问答 (QA) 的大型语言模型 (LLM) 已引起广泛关注。然而，领域数据通常以混合格式存在，包括文本和半结构化表格，这给信息的无缝集成带来了挑战。表到文本生成是一种很有前途的解决方案，它可以促进混合数据转换为统一文本格式的语料库。尽管该技术已被 NLP 社区广泛研究，但目前还没有关于不同表到文本方法生成的语料库如何影响 QA 系统性能的比较分析。在本文中，我们分两步解决这一研究空白。首先，我们创新地将表到文本生成集成到使用领域混合数据增强基于 LLM 的 QA 系统的框架中。然后，我们在实际工业数据中利用该框架，对两种类型的问答系统（DSFT 和 RAG 框架）进行了广泛的实验，其中包括四种代表性方法：Markdown 格式、模板序列化、基于 TPLM 的方法和基于 LLM 的方法。基于实验结果，我们得出了一些实证结果，并探讨了一些方法成功背后的根本原因。我们希望这项工作的结果能为学术界和工业界开发强大的质量保证系统提供有价值的参考。

Title: Skill or Luck? Return Decomposition via Advantage Functions

Authors: Hsiao-Ru Pan, Bernhard Schölkopf
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.12874
Pdf URL: https://arxiv.org/pdf/2402.12874
Copy Paste: [[2402.12874]] Skill or Luck? Return Decomposition via Advantage Functions(https://arxiv.org/abs/2402.12874)
Keywords: agent
Abstract: Learning from off-policy data is essential for sample-efficient reinforcement learning. In the present work, we build on the insight that the advantage function can be understood as the causal effect of an action on the return, and show that this allows us to decompose the return of a trajectory into parts caused by the agent's actions (skill) and parts outside of the agent's control (luck). Furthermore, this decomposition enables us to naturally extend Direct Advantage Estimation (DAE) to off-policy settings (Off-policy DAE). The resulting method can learn from off-policy trajectories without relying on importance sampling techniques or truncating off-policy actions. We draw connections between Off-policy DAE and previous methods to demonstrate how it can speed up learning and when the proposed off-policy corrections are important. Finally, we use the MinAtar environments to illustrate how ignoring off-policy corrections can lead to suboptimal policy optimization performance.
摘要：从离策略数据中学习对于样本高效的强化学习至关重要。在目前的工作中，我们基于这样的见解，即优势函数可以理解为动作对回报的因果影响，并表明这使我们能够将轨迹的回报分解为由代理的动作（技能）引起的部分）以及代理人无法控制的部分（运气）。此外，这种分解使我们能够自然地将直接优势估计（DAE）扩展到离策略设置（Off-policy DAE）。由此产生的方法可以从离策略轨迹中学习，而无需依赖重要性采样技术或截断离策略动作。我们在离策略 DAE 和以前的方法之间建立联系，以证明它如何加速学习以及所提出的离策略修正何时很重要。最后，我们使用 MinAtar 环境来说明忽略离策略修正如何导致策略优化性能不佳。

Title: Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Authors: Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma
Subjects: cs.LG, cs.CC, stat.ML
Abstract URL: https://arxiv.org/abs/2402.12875
Pdf URL: https://arxiv.org/pdf/2402.12875
Copy Paste: [[2402.12875]] Chain of Thought Empowers Transformers to Solve Inherently Serial Problems(https://arxiv.org/abs/2402.12875)
Keywords: language model, llm
Abstract: Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.
摘要：指示模型生成一系列中间步骤，即思想链 (CoT)，是提高大型语言模型 (LLM) 在算术和符号推理任务上的准确性的高效方法。然而，CoT 背后的机制仍不清楚。这项工作从表现力的角度提供了对纯解码器 Transformer 的 CoT 功能的理论理解。从概念上讲，CoT 使模型能够执行固有的串行计算，而这是 Transformer 所缺乏的，尤其是在深度较低时。给定输入长度$n$，之前的工作表明，具有有限精度$\mathsf{poly}(n)$嵌入大小的恒定深度变换器只能解决$\mathsf{TC}^0$中的问题，而无需CoT。我们首先展示了具有恒定位精度的恒定深度变换器的更严格的表达能力上限，它只能解决$\mathsf{AC}^0$（$\mathsf{TC}^0$的真子集）中的问题。然而，通过 CoT 的 $T$ 步长，使用恒定位精度和 $O(\log n)$ 嵌入大小的恒定深度变压器可以解决大小为 $T$ 的布尔电路可解决的任何问题。根据经验，启用 CoT 可以显着提高难以并行计算的任务的准确性，包括排列组的组成、迭代平方和电路值问题，特别是对于低深度变压器。

Title: GRAFFORD: A Benchmark Dataset for Testing the Knowledge of Object Affordances of Language and Vision Models

Authors: Sayantan Adak, Daivik Agrawal, Animesh Mukherjee, Somak Aditya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12881
Pdf URL: https://arxiv.org/pdf/2402.12881
Copy Paste: [[2402.12881]] GRAFFORD: A Benchmark Dataset for Testing the Knowledge of Object Affordances of Language and Vision Models(https://arxiv.org/abs/2402.12881)
Keywords: language model
Abstract: We investigate the knowledge of object affordances in pre-trained language models (LMs) and pre-trained Vision-Language models (VLMs). Transformers-based large pre-trained language models (PTLM) learn contextual representation from massive amounts of unlabeled text and are shown to perform impressively in downstream NLU tasks. In parallel, a growing body of literature shows that PTLMs fail inconsistently and non-intuitively, showing a lack of reasoning and grounding. To take a first step toward quantifying the effect of grounding (or lack thereof), we curate a novel and comprehensive dataset of object affordances -- GrAFFORD, characterized by 15 affordance classes. Unlike affordance datasets collected in vision and language domains, we annotate in-the-wild sentences with objects and affordances. Experimental results reveal that PTLMs exhibit limited reasoning abilities when it comes to uncommon object affordances. We also observe that pre-trained VLMs do not necessarily capture object affordances effectively. Through few-shot fine-tuning, we demonstrate improvement in affordance knowledge in PTLMs and VLMs. Our research contributes a novel dataset for language grounding tasks, and presents insights into LM capabilities, advancing the understanding of object affordances. Codes and data are available at https://github.com/sayantan11995/Affordance
摘要：我们研究了预训练语言模型（LM）和预训练视觉语言模型（VLM）中对象可供性的知识。基于 Transformer 的大型预训练语言模型 (PTLM) 从大量未标记文本中学习上下文表示，并在下游 NLU 任务中表现出色。与此同时，越来越多的文献表明，PTLM 的失败不一致且不直观，表现出缺乏推理和基础。为了迈出量化接地（或缺乏接地）影响的第一步，我们策划了一个新颖且全面的对象可供性数据集——GrAFFORD，其特征为 15 个可供性类别。与在视觉和语言领域收集的可供性数据集不同，我们用对象和可供性来注释野外句子。实验结果表明，当涉及不常见的对象可供性时，PTLM 表现出有限的推理能力。我们还观察到，预先训练的 VLM 不一定能够有效地捕获对象可供性。通过几次微调，我们展示了 PTLM 和 VLM 中可供性知识的改进。我们的研究为语言基础任务提供了一个新颖的数据集，并提出了对 LM 功能的见解，促进了对对象可供性的理解。代码和数据可在 https://github.com/sayantan11995/Affordance 获取

Title: OPDAI at SemEval-2024 Task 6: Small LLMs can Accelerate Hallucination Detection with Weakly Supervised Data

Authors: Chengcheng Wei, Ze Chen, Songtan Fang, Jiarong He, Max Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12913
Pdf URL: https://arxiv.org/pdf/2402.12913
Copy Paste: [[2402.12913]] OPDAI at SemEval-2024 Task 6: Small LLMs can Accelerate Hallucination Detection with Weakly Supervised Data(https://arxiv.org/abs/2402.12913)
Keywords: gpt, llm, hallucination, prompt
Abstract: This paper mainly describes a unified system for hallucination detection of LLMs, which wins the second prize in the model-agnostic track of the SemEval-2024 Task 6, and also achieves considerable results in the model-aware track. This task aims to detect hallucination with LLMs for three different text-generation tasks without labeled training data. We utilize prompt engineering and few-shot learning to verify the performance of different LLMs on the validation data. Then we select the LLMs with better performance to generate high-quality weakly supervised training data, which not only satisfies the consistency of different LLMs, but also satisfies the consistency of the optimal LLM with different sampling parameters. Furthermore, we finetune different LLMs by using the constructed training data, and finding that a relatively small LLM can achieve a competitive level of performance in hallucination detection, when compared to the large LLMs and the prompt-based approaches using GPT-4.
摘要：这篇论文主要描述了一种用于LLM幻觉检测的统一系统，该系统在SemEval-2024 Task 6的模型无关赛道上获得了二等奖，并且在模型感知赛道上也取得了可观的成果。该任务旨在利用法学硕士检测三种不同文本生成任务的幻觉，而无需标记训练数据。我们利用即时工程和少量学习来验证不同法学硕士在验证数据上的表现。然后我们选择性能较好的LLM来生成高质量的弱监督训练数据，这不仅满足不同LLM的一致性，而且满足不同采样参数下最优LLM的一致性。此外，我们通过使用构建的训练数据对不同的 LLM 进行微调，并发现与大型 LLM 和使用 GPT-4 的基于提示的方法相比，相对较小的 LLM 可以在幻觉检测方面实现具有竞争力的性能水平。

Title: Large Language Model-based Human-Agent Collaboration for Complex Task Solving

Authors: Xueyang Feng, Zhi-Yuan Chen, Yujia Qin, Yankai Lin, Xu Chen, Zhiyuan Liu, Ji-Rong Wen
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2402.12914
Pdf URL: https://arxiv.org/pdf/2402.12914
Copy Paste: [[2402.12914]] Large Language Model-based Human-Agent Collaboration for Complex Task Solving(https://arxiv.org/abs/2402.12914)
Keywords: language model, llm, agent
Abstract: In recent developments within the research community, the integration of Large Language Models (LLMs) in creating fully autonomous agents has garnered significant interest. Despite this, LLM-based agents frequently demonstrate notable shortcomings in adjusting to dynamic environments and fully grasping human needs. In this work, we introduce the problem of LLM-based human-agent collaboration for complex task-solving, exploring their synergistic potential. In addition, we propose a Reinforcement Learning-based Human-Agent Collaboration method, ReHAC. This approach includes a policy model designed to determine the most opportune stages for human intervention within the task-solving process. We construct a human-agent collaboration dataset to train this policy model in an offline reinforcement learning environment. Our validation tests confirm the model's effectiveness. The results demonstrate that the synergistic efforts of humans and LLM-based agents significantly improve performance in complex tasks, primarily through well-planned, limited human intervention. Datasets and code are available at: https://github.com/XueyangFeng/ReHAC.
摘要：在研究界的最新发展中，大型语言模型（LLM）在创建完全自主代理中的集成引起了人们的极大兴趣。尽管如此，基于法学硕士的智能体在适应动态环境和充分把握人类需求方面经常表现出明显的缺点。在这项工作中，我们介绍了基于法学硕士的人机协作解决复杂任务的问题，探索它们的协同潜力。此外，我们提出了一种基于强化学习的人机协作方法，ReHAC。该方法包括一个政策模型，旨在确定任务解决过程中人为干预的最合适阶段。我们构建了一个人机协作数据集，以在离线强化学习环境中训练该策略模型。我们的验证测试证实了模型的有效性。结果表明，人类和基于法学硕士的智能体的协同努力显着提高了复杂任务的性能，这主要是通过精心计划的、有限的人为干预。数据集和代码可访问：https://github.com/XueyangFeng/ReHAC。

Title: Discovering Behavioral Modes in Deep Reinforcement Learning Policies Using Trajectory Clustering in Latent Space

Authors: Sindre Benjamin Remman, Anastasios M. Lekkas
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12939
Pdf URL: https://arxiv.org/pdf/2402.12939
Copy Paste: [[2402.12939]] Discovering Behavioral Modes in Deep Reinforcement Learning Policies Using Trajectory Clustering in Latent Space(https://arxiv.org/abs/2402.12939)
Keywords: agent
Abstract: Understanding the behavior of deep reinforcement learning (DRL) agents is crucial for improving their performance and reliability. However, the complexity of their policies often makes them challenging to understand. In this paper, we introduce a new approach for investigating the behavior modes of DRL policies, which involves utilizing dimensionality reduction and trajectory clustering in the latent space of neural networks. Specifically, we use Pairwise Controlled Manifold Approximation Projection (PaCMAP) for dimensionality reduction and TRACLUS for trajectory clustering to analyze the latent space of a DRL policy trained on the Mountain Car control task. Our methodology helps identify diverse behavior patterns and suboptimal choices by the policy, thus allowing for targeted improvements. We demonstrate how our approach, combined with domain knowledge, can enhance a policy's performance in specific regions of the state space.
摘要：了解深度强化学习 (DRL) 代理的行为对于提高其性能和可靠性至关重要。然而，他们的政策的复杂性往往使他们难以理解。在本文中，我们介绍了一种研究 DRL 策略行为模式的新方法，其中涉及利用神经网络潜在空间中的降维和轨迹聚类。具体来说，我们使用成对控制流形近似投影 (PaCMAP) 进行降维，使用 TRACLUS 进行轨迹聚类，以分析在 Mountain Car 控制任务上训练的 DRL 策略的潜在空间。我们的方法有助于识别政策的不同行为模式和次优选择，从而实现有针对性的改进。我们展示了我们的方法如何与领域知识相结合，可以提高政策在国家空间特定区域的绩效。

Title: GumbelSoft: Diversified Language Model Watermarking via the GumbelMax-trick

Authors: Jiayi Fu, Xuandong Zhao, Ruihan Yang, Yuansen Zhang, Jiangjie Chen, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12948
Pdf URL: https://arxiv.org/pdf/2402.12948
Copy Paste: [[2402.12948]] GumbelSoft: Diversified Language Model Watermarking via the GumbelMax-trick(https://arxiv.org/abs/2402.12948)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) excellently generate human-like text, but also raise concerns about misuse in fake news and academic dishonesty. Decoding-based watermark, particularly the GumbelMax-trick-based watermark(GM watermark), is a standout solution for safeguarding machine-generated texts due to its notable detectability. However, GM watermark encounters a major challenge with generation diversity, always yielding identical outputs for the same prompt, negatively impacting generation diversity and user experience. To overcome this limitation, we propose a new type of GM watermark, the Logits-Addition watermark, and its three variants, specifically designed to enhance diversity. Among these, the GumbelSoft watermark (a softmax variant of the Logits-Addition watermark) demonstrates superior performance in high diversity settings, with its AUROC score outperforming those of the two alternative variants by 0.1 to 0.3 and surpassing other decoding-based watermarking methods by a minimum of 0.1.
摘要：大型语言模型（LLM）能够出色地生成类似人类的文本，但也引起了人们对假新闻和学术不诚实中滥用的担忧。基于解码的水印，特别是基于 GumbelMax-trick 的水印（GM 水印），由于其显着的可检测性，是保护机器生成文本的出色解决方案。然而，GM 水印遇到了世代多样性的重大挑战，总是在相同的提示下产生相同的输出，对世代多样性和用户体验产生负面影响。为了克服这个限制，我们提出了一种新型的 GM 水印，即 Logits-Addition 水印及其三个变体，专门用于增强多样性。其中，GumbelSoft 水印（Logits-Addition 水印的 softmax 变体）在高多样性设置中表现出卓越的性能，其 AUROC 分数比两种替代变体高出 0.1 到 0.3，并超过其他基于解码的水印方法最小值 0.1。

Title: GlórIA - A Generative and Open Large Language Model for Portuguese

Authors: Ricardo Lopes, João Magalhães, David Semedo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12969
Pdf URL: https://arxiv.org/pdf/2402.12969
Copy Paste: [[2402.12969]] GlórIA - A Generative and Open Large Language Model for Portuguese(https://arxiv.org/abs/2402.12969)
Keywords: language model, llm
Abstract: Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful large language models (LLMs). These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce Gl\'orIA, a robust European Portuguese decoder LLM. To pre-train Gl\'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' language modeling capabilities, we introduce CALAME-PT (Context-Aware LAnguage Modeling Evaluation for Portuguese), the first Portuguese zero-shot language-modeling benchmark. Evaluation shows that Gl\'orIA significantly outperforms existing open PT decoder models in language modeling and that it can generate sound, knowledge-rich, and coherent PT-PT text. The model also exhibits strong potential for various downstream tasks.
摘要：自然语言任务取得了重大进展，这很大程度上归功于强大的大型语言模型（LLM）的出现。这些模型在广泛且多样化的语料库上进行了预先训练，越来越能够理解语言的复杂性。尽管许多高资源语言的法学硕士数量众多，但此类模型对于欧洲葡萄牙语的可用性仍然有限。我们介绍 Gl'orIA，一个强大的欧洲葡萄牙语解码器法学硕士。为了预训练 Gl'orIA，我们组装了一个全面的 PT-PT 文本语料库，其中包含来自各种来源的 350 亿个标记。我们介绍我们的预训练方法，然后评估模型在多个下游任务上的有效性。此外，为了评估我们模型的语言建模能力，我们引入了 CALAME-PT（葡萄牙语上下文感知语言建模评估），这是第一个葡萄牙语零样本语言建模基准。评估表明，Gl'orIA 在语言建模方面显着优于现有的开放 PT 解码器模型，并且它可以生成声音丰富、知识丰富且连贯的 PT-PT 文本。该模型还对各种下游任务表现出强大的潜力。

Title: The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

Authors: Miaoran Zhang, Vagrant Gautam, Mingyang Wang, Jesujoba O. Alabi, Xiaoyu Shen, Dietrich Klakow, Marius Mosbach
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12976
Pdf URL: https://arxiv.org/pdf/2402.12976
Copy Paste: [[2402.12976]] The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis(https://arxiv.org/abs/2402.12976)
Keywords: language model, gpt, chat
Abstract: In-context learning is a popular inference strategy where large language models solve a task using only a few labelled demonstrations without needing any parameter updates. Compared to work on monolingual (English) in-context learning, multilingual in-context learning is under-explored, and we lack an in-depth understanding of the role of demonstrations in this context. To address this gap, we conduct a multidimensional analysis of multilingual in-context learning, experimenting with 5 models from different model families, 9 datasets covering classification and generation tasks, and 56 typologically diverse languages. Our results reveal that the effectiveness of demonstrations varies significantly across models, tasks, and languages. We also find that Llama 2-Chat, GPT-3.5, and GPT-4 are largely insensitive to the quality of demonstrations. Instead, a carefully crafted template often eliminates the benefits of demonstrations for some tasks and languages altogether. These findings show that the importance of demonstrations might be overestimated. Our work highlights the need for granular evaluation across multiple axes towards a better understanding of in-context learning.
摘要：上下文学习是一种流行的推理策略，其中大型语言模型仅使用几个标记的演示来解决任务，而不需要任何参数更新。与单语（英语）情境学习的工作相比，多语言情境学习的研究还不够深入，我们对示范在这种情境中的作用缺乏深入的了解。为了解决这一差距，我们对多语言上下文学习进行了多维分析，试验了来自不同模型系列的 5 个模型、涵盖分类和生成任务的 9 个数据集以及 56 种类型不同的语言。我们的结果表明，演示的有效性因模型、任务和语言的不同而存在显着差异。我们还发现 Llama 2-Chat、GPT-3.5 和 GPT-4 对演示的质量基本上不敏感。相反，精心设计的模板通常会完全消除某些任务和语言演示的好处。这些发现表明，示威活动的重要性可能被高估了。我们的工作强调需要跨多个轴进行精细评估，以更好地理解情境学习。

Title: Can GNN be Good Adapter for LLMs?

Authors: Xuanwen Huang, Kaiqiao Han, Yang Yang, Dezheng Bao, Quanjin Tao, Ziwei Chai, Qi Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12984
Pdf URL: https://arxiv.org/pdf/2402.12984
Copy Paste: [[2402.12984]] Can GNN be Good Adapter for LLMs?(https://arxiv.org/abs/2402.12984)
Keywords: language model, gpt, llm, prompt
Abstract: Recently, large language models (LLMs) have demonstrated superior capabilities in understanding and zero-shot learning on textual data, promising significant advances for many text-related domains. In the graph domain, various real-world scenarios also involve textual data, where tasks and node features can be described by text. These text-attributed graphs (TAGs) have broad applications in social media, recommendation systems, etc. Thus, this paper explores how to utilize LLMs to model TAGs. Previous methods for TAG modeling are based on million-scale LMs. When scaled up to billion-scale LLMs, they face huge challenges in computational costs. Additionally, they also ignore the zero-shot inference capabilities of LLMs. Therefore, we propose GraphAdapter, which uses a graph neural network (GNN) as an efficient adapter in collaboration with LLMs to tackle TAGs. In terms of efficiency, the GNN adapter introduces only a few trainable parameters and can be trained with low computation costs. The entire framework is trained using auto-regression on node text (next token prediction). Once trained, GraphAdapter can be seamlessly fine-tuned with task-specific prompts for various downstream tasks. Through extensive experiments across multiple real-world TAGs, GraphAdapter based on Llama 2 gains an average improvement of approximately 5\% in terms of node classification. Furthermore, GraphAdapter can also adapt to other language models, including RoBERTa, GPT-2. The promising results demonstrate that GNNs can serve as effective adapters for LLMs in TAG modeling.
摘要：最近，大型语言模型（LLM）在文本数据的理解和零样本学习方面表现出了卓越的能力，有望在许多文本相关领域取得重大进展。在图领域，各种现实场景也涉及文本数据，其中任务和节点特征可以通过文本来描述。这些文本属性图（TAG）在社交媒体、推荐系统等领域有着广泛的应用。因此，本文探讨了如何利用 LLM 来建模 TAG。以前的 TAG 建模方法是基于百万级 LM 的。当扩大到十亿规模的法学硕士时，他们面临着计算成本的巨大挑战。此外，他们还忽略了法学硕士的零样本推理能力。因此，我们提出了 GraphAdapter，它使用图神经网络（GNN）作为与法学硕士合作的高效适配器来处理 TAG。在效率方面，GNN 适配器仅引入了少量可训练参数，并且可以以较低的计算成本进行训练。整个框架是使用节点文本的自动回归（下一个标记预测）进行训练的。经过训练后，GraphAdapter 可以通过针对各种下游任务的特定于任务的提示进行无缝微调。通过对多个真实世界 TAG 的广泛实验，基于 Llama 2 的 GraphAdapter 在节点分类方面平均提高了约 5%。此外，GraphAdapter还可以适配其他语言模型，包括RoBERTa、GPT-2。这些令人鼓舞的结果表明，GNN 可以作为 TAG 建模中 LLM 的有效适配器。

Title: TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

Authors: Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, Seong Joon Oh
Subjects: cs.LG, cs.AI, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2402.12991
Pdf URL: https://arxiv.org/pdf/2402.12991
Copy Paste: [[2402.12991]] TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification(https://arxiv.org/abs/2402.12991)
Keywords: language model, llm, prompt, chat
Abstract: Large Language Model (LLM) services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function.
摘要：大型语言模型 (LLM) 服务和模型通常附带关于谁可以使用它们以及如何使用它们的法律规则。评估已发布的法学硕士的合规性至关重要，因为这些规则保护法学硕士贡献者的利益并防止滥用。在这种背景下，我们描述了黑盒身份验证（BBIV）的新问题。目标是确定第三方应用程序是否通过其聊天功能使用某个LLM。我们提出了一种称为“目标随机对抗提示”(TRAP) 的方法，用于识别正在使用的特定 LLM。我们重新利用了最初为越狱而提出的对抗性后缀，以从目标法学硕士获得预定义的答案，而其他模型则给出随机答案。即使在单次交互后，TRAP 也能以超过 95% 的真阳性率和低于 0.2% 的假阳性率检测目标 LLM。即使 LLM 发生了不会显着改变原始功能的微小变化，TRAP 仍然有效。

Title: Phonotactic Complexity across Dialects

Authors: Ryan Soh-Eun Shim, Kalvin Chang, David R. Mortensen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12998
Pdf URL: https://arxiv.org/pdf/2402.12998
Copy Paste: [[2402.12998]] Phonotactic Complexity across Dialects(https://arxiv.org/abs/2402.12998)
Keywords: language model
Abstract: Received wisdom in linguistic typology holds that if the structure of a language becomes more complex in one dimension, it will simplify in another, building on the assumption that all languages are equally complex (Joseph and Newmeyer, 2012). We study this claim on a micro-level, using a tightly-controlled sample of Dutch dialects (across 366 collection sites) and Min dialects (across 60 sites), which enables a more fair comparison across varieties. Even at the dialect level, we find empirical evidence for a tradeoff between word length and a computational measure of phonotactic complexity from a LSTM-based phone-level language model-a result previously documented only at the language level. A generalized additive model (GAM) shows that dialects with low phonotactic complexity concentrate around the capital regions, which we hypothesize to correspond to prior hypotheses that language varieties of greater or more diverse populations show reduced phonotactic complexity. We also experiment with incorporating the auxiliary task of predicting syllable constituency, but do not find an increase in the negative correlation observed.
摘要：语言类型学中公认的智慧认为，如果一种语言的结构在一个维度上变得更加复杂，那么它就会在另一个维度上变得简单，这是建立在所有语言都同样复杂的假设之上的（Joseph and Newmeyer，2012）。我们使用严格控制的荷兰语方言（跨 366 个收集点）和闽语方言（跨 60 个收集点）样本在微观层面上研究了这一说法，这使得跨变种之间的比较更加公平。即使在方言层面，我们也从基于 LSTM 的音素级语言模型中找到了字长和音位复杂性计算度量之间权衡的经验证据——这一结果之前仅在语言层面记录过。广义加性模型（GAM）显示，语音复杂性较低的方言集中在首都地区，我们假设这与先前的假设相对应，即更多或更多样化人口的语言变体表现出较低的语音复杂性。我们还尝试合并预测音节选区的辅助任务，但没有发现观察到的负相关性增加。

Title: Investigating the Impact of Model Instability on Explanations and Uncertainty

Authors: Sara Vera Marjanović, Isabelle Augenstein, Christina Lioma
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2402.13006
Pdf URL: https://arxiv.org/pdf/2402.13006
Copy Paste: [[2402.13006]] Investigating the Impact of Model Instability on Explanations and Uncertainty(https://arxiv.org/abs/2402.13006)
Keywords: language model
Abstract: Explainable AI methods facilitate the understanding of model behaviour, yet, small, imperceptible perturbations to inputs can vastly distort explanations. As these explanations are typically evaluated holistically, before model deployment, it is difficult to assess when a particular explanation is trustworthy. Some studies have tried to create confidence estimators for explanations, but none have investigated an existing link between uncertainty and explanation quality. We artificially simulate epistemic uncertainty in text input by introducing noise at inference time. In this large-scale empirical study, we insert different levels of noise perturbations and measure the effect on the output of pre-trained language models and different uncertainty metrics. Realistic perturbations have minimal effect on performance and explanations, yet masking has a drastic effect. We find that high uncertainty doesn't necessarily imply low explanation plausibility; the correlation between the two metrics can be moderately positive when noise is exposed during the training process. This suggests that noise-augmented models may be better at identifying salient tokens when uncertain. Furthermore, when predictive and epistemic uncertainty measures are over-confident, the robustness of a saliency map to perturbation can indicate model stability issues. Integrated Gradients shows the overall greatest robustness to perturbation, while still showing model-specific patterns in performance; however, this phenomenon is limited to smaller Transformer-based language models.
摘要：可解释的人工智能方法有助于理解模型行为，然而，对输入的微小的、难以察觉的扰动可能会极大地扭曲解释。由于这些解释通常在模型部署之前进行整体评估，因此很难评估特定解释何时值得信赖。一些研究试图为解释创建置信度估计器，但没有一个研究调查不确定性和解释质量之间现有的联系。我们通过在推理时引入噪声来人为地模拟文本输入中的认知不确定性。在这项大规模的实证研究中，我们插入了不同级别的噪声扰动，并测量了对预训练语言模型和不同不确定性度量的输出的影响。现实的扰动对性能和解释的影响很小，但掩蔽却有很大的影响。我们发现，高不确定性并不一定意味着低解释合理性；当训练过程中暴露出噪声时，两个指标之间的相关性可能呈适度正相关。这表明噪声增强模型在不确定时可能更擅长识别显着标记。此外，当预测和认知不确定性度量过于自信时，显着图对扰动的鲁棒性可以表明模型稳定性问题。积分梯度显示出对扰动的整体最大鲁棒性，同时仍然显示出特定于模型的性能模式；然而，这种现象仅限于较小的基于 Transformer 的语言模型。

Title: Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

Authors: Demin Song, Honglin Guo, Yunhua Zhou, Shuhao Xing, Yudong Wang, Zifan Song, Wenwei Zhang, Qipeng Guo, Hang Yan, Xipeng Qiu, Dahua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13013
Pdf URL: https://arxiv.org/pdf/2402.13013
Copy Paste: [[2402.13013]] Code Needs Comments: Enhancing Code LLMs with Comment Augmentation(https://arxiv.org/abs/2402.13013)
Keywords: language model, llm
Abstract: The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data on code-focused LLMs' performance by assessing the comment density as a measure of PL-NL alignment. Given the scarcity of code-comment aligned data in pre-training corpora, we introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused LLMs and observed consistent improvements in performance on two widely-used programming skill benchmarks. Notably, the model trained on the augmented data outperformed both the model used for generating comments and the model further trained on the data without augmentation.
摘要：编程技能是大型语言模型 (LLM) 的一项关键能力，需要深入了解编程语言 (PL) 及其与自然语言 (NL) 的相关性。我们通过评估评论密度作为 PL-NL 一致性的衡量标准来检查预训练数据对以代码为中心的法学硕士表现的影响。鉴于预训练语料库中代码注释对齐数据的稀缺性，我们引入了一种新颖的数据增强方法，可以为现有代码生成注释，并结合数据过滤策略，过滤掉与自然语言相关性较差的代码数据。我们对三个以代码为中心的法学硕士进行了实验，并在两个广泛使用的编程技能基准测试中观察到性能的持续改进。值得注意的是，在增强数据上训练的模型优于用于生成评论的模型和在没有增强的数据上进一步训练的模型。

Title: Understanding the effects of language-specific class imbalance in multilingual fine-tuning

Authors: Vincent Jung, Lonneke van der Plas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13016
Pdf URL: https://arxiv.org/pdf/2402.13016
Copy Paste: [[2402.13016]] Understanding the effects of language-specific class imbalance in multilingual fine-tuning(https://arxiv.org/abs/2402.13016)
Keywords: language model, llm
Abstract: We study the effect of one type of imbalance often present in real-life multilingual classification datasets: an uneven distribution of labels across languages. We show evidence that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with this imbalance leads to worse performance, a more pronounced separation of languages in the latent space, and the promotion of uninformative features. We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language and show that this helps mitigate those detrimental effects. These results create awareness of the negative effects of language-specific class imbalance in multilingual fine-tuning and the way in which the model learns to rely on the separation of languages to perform the task.
摘要：我们研究现实生活中的多语言分类数据集中经常出现的一种不平衡的影响：跨语言的标签分布不均匀。我们证明，在这种不平衡的数据集上微调基于转换器的大型语言模型（LLM）会导致性能更差、潜在空间中的语言分离更明显以及促进无信息特征。我们通过分别计算每种语言的类别权重来修改传统的类别权重方法来解决不平衡问题，并表明这有助于减轻这些有害影响。这些结果使人们意识到多语言微调中特定于语言的类不平衡的负面影响，以及模型学习依赖语言分离来执行任务的方式。

Title: SoMeLVLM: A Large Vision Language Model for Social Media Processing

Authors: Xinnong Zhang, Haoyu Kuang, Xinyi Mou, Hanjia Lyu, Kun Wu, Siming Chen, Jiebo Luo, Xuanjing Huang, Zhongyu Wei
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2402.13022
Pdf URL: https://arxiv.org/pdf/2402.13022
Copy Paste: [[2402.13022]] SoMeLVLM: A Large Vision Language Model for Social Media Processing(https://arxiv.org/abs/2402.13022)
Keywords: language model, prompt
Abstract: The growth of social media, characterized by its multimodal nature, has led to the emergence of diverse phenomena and challenges, which calls for an effective approach to uniformly solve automated tasks. The powerful Large Vision Language Models make it possible to handle a variety of tasks simultaneously, but even with carefully designed prompting methods, the general domain models often fall short in aligning with the unique speaking style and context of social media tasks. In this paper, we introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM), which is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. SoMeLVLM is designed to understand and generate realistic social media behavior. We have developed a 654k multimodal social media instruction-tuning dataset to support our cognitive framework and fine-tune our model. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks. Further analysis shows its significant advantages over baselines in terms of cognitive abilities.
摘要：社交媒体的发展以其多模态的特点，导致了各种现象和挑战的出现，这就需要一种有效的方法来统一解决自动化任务。强大的大视觉语言模型使得同时处理多种任务成为可能，但即使采用精心设计的提示方法，通用领域模型也常常无法与社交媒体任务的独特说话风格和上下文保持一致。在本文中，我们介绍了用于社交媒体处理的大视觉语言模型（SoMeLVLM），它是一个认知框架，配备了知识和理解、应用、分析、评估和创造五种关键能力。 SoMeLVLM 旨在理解和生成现实的社交媒体行为。我们开发了一个 654k 多模式社交媒体指令调整数据集来支持我们的认知框架并微调我们的模型。我们的实验表明，SoMeLVLM 在多个社交媒体任务中实现了最先进的性能。进一步的分析表明，它在认知能力方面比基线具有显着优势。

Title: Heterogeneous Graph Reasoning for Fact Checking over Texts and Tables

Authors: Haisong Gong, Weizhi Xu, Shu wu, Qiang Liu, Liang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.13028
Pdf URL: https://arxiv.org/pdf/2402.13028
Copy Paste: [[2402.13028]] Heterogeneous Graph Reasoning for Fact Checking over Texts and Tables(https://arxiv.org/abs/2402.13028)
Keywords: language model
Abstract: Fact checking aims to predict claim veracity by reasoning over multiple evidence pieces. It usually involves evidence retrieval and veracity reasoning. In this paper, we focus on the latter, reasoning over unstructured text and structured table information. Previous works have primarily relied on fine-tuning pretrained language models or training homogeneous-graph-based models. Despite their effectiveness, we argue that they fail to explore the rich semantic information underlying the evidence with different structures. To address this, we propose a novel word-level Heterogeneous-graph-based model for Fact Checking over unstructured and structured information, namely HeterFC. Our approach leverages a heterogeneous evidence graph, with words as nodes and thoughtfully designed edges representing different evidence properties. We perform information propagation via a relational graph neural network, facilitating interactions between claims and evidence. An attention-based method is utilized to integrate information, combined with a language model for generating predictions. We introduce a multitask loss function to account for potential inaccuracies in evidence retrieval. Comprehensive experiments on the large fact checking dataset FEVEROUS demonstrate the effectiveness of HeterFC. Code will be released at: https://github.com/Deno-V/HeterFC.
摘要：事实检查的目的是通过对多个证据进行推理来预测声明的准确性。它通常涉及证据检索和准确性推理。在本文中，我们重点关注后者，对非结构化文本和结构化表格信息进行推理。以前的工作主要依赖于微调预训练语言模型或训练基于同构图的模型。尽管它们很有效，但我们认为它们未能探索不同结构证据背后的丰富语义信息。为了解决这个问题，我们提出了一种新颖的基于词级异构图的模型，用于对非结构化和结构化信息进行事实检查，即 HeterFC。我们的方法利用异构证据图，以单词作为节点，并精心设计边缘代表不同的证据属性。我们通过关系图神经网络进行信息传播，促进主张和证据之间的交互。利用基于注意力的方法来集成信息，并结合语言模型来生成预测。我们引入了多任务损失函数来解决证据检索中潜在的不准确性。在大型事实检查数据集 FEVEROUS 上进行的综合实验证明了 HeterFC 的有效性。代码将发布在：https://github.com/Deno-V/HeterFC。

Title: Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models

Authors: Che Zhang, Zhenyang Xiao, Chengcheng Han, Yixin Lian, Yuejian Fang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.13035
Pdf URL: https://arxiv.org/pdf/2402.13035
Copy Paste: [[2402.13035]] Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models(https://arxiv.org/abs/2402.13035)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have made significant strides in reasoning capabilities, with ongoing efforts to refine their reasoning through self-correction. However, recent studies suggest that self-correction can be limited or even counterproductive without external accurate knowledge, raising questions about the limits and effectiveness of self-correction. In this paper, we aim to enhance LLM's self-checking capabilities by meticulously designing training data, thereby improving the accuracy of self-correction. We conduct a detailed analysis of error types in mathematical reasoning and develop a tailored prompt, termed ``Step CoT Check''. Then we construct a checking-correction dataset for training models. After integrating the original CoT data and checking-correction data for training, we observe that models could improve their self-checking capabilities, thereby enhancing their self-correction capacity and eliminating the need for external feedback or ground truth labels to ascertain the endpoint of correction. We compare the performance of models fine-tuned with the ``Step CoT Check'' prompt against those refined using other promps within the context of checking-correction data. The ``Step CoT Check'' outperforms the other two check formats in model with lager parameters, providing more precise feedback thus achieving a higher rate of correctness. For reproducibility, all the datasets and codes are provided in \url{https://github.com/bammt/Learn-to-check}.
摘要：大型语言模型 (LLM) 在推理能力方面取得了显着进步，并不断努力通过自我纠正来完善其推理。然而，最近的研究表明，如果没有外部准确的知识，自我纠正可能会受到限制，甚至适得其反，这引发了人们对自我纠正的局限性和有效性的质疑。本文旨在通过精心设计训练数据来增强LLM的自检能力，从而提高自纠的准确性。我们对数学推理中的错误类型进行了详细分析，并开发了一个定制的提示，称为“Step CoT Check”。然后我们构建一个用于训练模型的检查校正数据集。在整合原始 CoT 数据和检查校正数据进行训练后，我们观察到模型可以提高其自检能力，从而增强其自我校正能力，并且无需外部反馈或真实标签来确定校正终点。我们将使用“Step CoT Check”提示微调的模型的性能与在检查校正数据的上下文中使用其他提示微调的模型的性能进行比较。 “Step CoT Check”在参数较大的模型中优于其他两种检查格式，提供更精确的反馈，从而实现更高的正确率。为了重现性，所有数据集和代码都在 \url{https://github.com/bammt/Learn-to-check} 中提供。

Title: SiLLM: Large Language Models for Simultaneous Machine Translation

Authors: Shoutao Guo, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13036
Pdf URL: https://arxiv.org/pdf/2402.13036
Copy Paste: [[2402.13036]] SiLLM: Large Language Models for Simultaneous Machine Translation(https://arxiv.org/abs/2402.13036)
Keywords: language model, llm, agent
Abstract: Simultaneous Machine Translation (SiMT) generates translations while reading the source sentence, necessitating a policy to determine the optimal timing for reading and generating words. Despite the remarkable performance achieved by Large Language Models (LLM) across various NLP tasks, existing SiMT methods predominantly focus on conventional transformers, employing a single model to concurrently determine the policy and generate the translations. However, given the complexity of SiMT, it is challenging to effectively address both tasks with a single model. Therefore, there is a need to decouple the SiMT task into policy-decision and translation sub-tasks. We propose SiLLM, which delegates the two sub-tasks to separate agents, thereby incorporating LLM into SiMT. The policy-decision agent is managed by a conventional SiMT model, responsible for determining the translation policy. The translation agent, leveraging the capabilities of LLM, generates translation using the partial source sentence. The two agents collaborate to accomplish SiMT. To facilitate the application of token-level policies determined by conventional SiMT models to LLM, we propose a word-level policy adapted for LLM. Experiments on two datasets demonstrate that, with a small amount of data for fine-tuning LLM, SiLLM attains state-of-the-art performance.
摘要：同步机器翻译 (SiMT) 在阅读源句子的同时生成翻译，因此需要制定策略来确定阅读和生成单词的最佳时机。尽管大型语言模型 (LLM) 在各种 NLP 任务中取得了显着的性能，但现有的 SiMT 方法主要关注传统的转换器，采用单一模型来同时确定策略并生成翻译。然而，考虑到 SiMT 的复杂性，使用单一模型有效解决这两项任务具有挑战性。因此，需要将 SiMT 任务解耦为策略决策和翻译子任务。我们提出了 SiLLM，它将两个子任务委托给单独的代理，从而将 LLM 合并到 SiMT 中。策略决策代理由传统的 SiMT 模型管理，负责确定翻译策略。翻译代理利用 LLM 的功能，使用部分源句子生成翻译。两个代理协作完成 SiMT。为了促进将传统 SiMT 模型确定的令牌级策略应用于 LLM，我们提出了一种适用于 LLM 的字级策略。对两个数据集的实验表明，通过少量数据进行 LLM 微调，SiLLM 获得了最先进的性能。

Title: Align Your Intents: Offline Imitation Learning via Optimal Transport

Authors: Maksim Bobrin, Nazar Buzun, Dmitrii Krylov, Dmitry V. Dylov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.13037
Pdf URL: https://arxiv.org/pdf/2402.13037
Copy Paste: [[2402.13037]] Align Your Intents: Offline Imitation Learning via Optimal Transport(https://arxiv.org/abs/2402.13037)
Keywords: agent
Abstract: Offline reinforcement learning (RL) addresses the problem of sequential decision-making by learning optimal policy through pre-collected data, without interacting with the environment. As yet, it has remained somewhat impractical, because one rarely knows the reward explicitly and it is hard to distill it retrospectively. Here, we show that an imitating agent can still learn the desired behavior merely from observing the expert, despite the absence of explicit rewards or action labels. In our method, AILOT (Aligned Imitation Learning via Optimal Transport), we involve special representation of states in a form of intents that incorporate pairwise spatial distances within the data. Given such representations, we define intrinsic reward function via optimal transport distance between the expert's and the agent's trajectories. We report that AILOT outperforms state-of-the art offline imitation learning algorithms on D4RL benchmarks and improves the performance of other offline RL algorithms in the sparse-reward tasks.
摘要：离线强化学习（RL）通过预先收集的数据学习最优策略来解决顺序决策问题，无需与环境交互。到目前为止，它仍然有些不切实际，因为人们很少明确地知道奖励，并且很难回顾性地提炼它。在这里，我们表明，尽管没有明确的奖励或动作标签，模仿代理仍然可以仅通过观察专家来学习所需的行为。在我们的方法 AILOT（通过最佳传输的对齐模仿学习）中，我们以意图的形式涉及状态的特殊表示，其中包含数据中的成对空间距离。给定这样的表示，我们通过专家和代理轨迹之间的最佳传输距离来定义内在奖励函数。我们报告称，AILOT 在 D4RL 基准测试中优于最先进的离线模仿学习算法，并提高了其他离线 RL 算法在稀疏奖励任务中的性能。

Title: Text-Guided Molecule Generation with Diffusion Language Model

Authors: Haisong Gong, Qiang Liu, Shu Wu, Liang Wang
Subjects: cs.LG, cs.AI, cs.CE, cs.CL, q-bio.BM
Abstract URL: https://arxiv.org/abs/2402.13040
Pdf URL: https://arxiv.org/pdf/2402.13040
Copy Paste: [[2402.13040]] Text-Guided Molecule Generation with Diffusion Language Model(https://arxiv.org/abs/2402.13040)
Keywords: language model
Abstract: Text-guided molecule generation is a task where molecules are generated to match specific textual descriptions. Recently, most existing SMILES-based molecule generation methods rely on an autoregressive architecture. In this work, we propose the Text-Guided Molecule Generation with Diffusion Language Model (TGM-DLM), a novel approach that leverages diffusion models to address the limitations of autoregressive methods. TGM-DLM updates token embeddings within the SMILES string collectively and iteratively, using a two-phase diffusion generation process. The first phase optimizes embeddings from random noise, guided by the text description, while the second phase corrects invalid SMILES strings to form valid molecular representations. We demonstrate that TGM-DLM outperforms MolT5-Base, an autoregressive model, without the need for additional data resources. Our findings underscore the remarkable effectiveness of TGM-DLM in generating coherent and precise molecules with specific properties, opening new avenues in drug discovery and related scientific domains. Code will be released at: https://github.com/Deno-V/tgm-dlm.
摘要：文本引导分子生成是生成分子以匹配特定文本描述的任务。最近，大多数现有的基于 SMILES 的分子生成方法都依赖于自回归架构。在这项工作中，我们提出了带有扩散语言模型的文本引导分子生成（TGM-DLM），这是一种利用扩散模型来解决自回归方法的局限性的新颖方法。 TGM-DLM 使用两阶段扩散生成过程集体迭代地更新 SMILES 字符串中的令牌嵌入。第一阶段在文本描述的指导下优化随机噪声的嵌入，而第二阶段纠正无效的 SMILES 字符串以形成有效的分子表示。我们证明 TGM-DLM 优于自回归模型 MolT5-Base，且无需额外的数据资源。我们的研究结果强调了 TGM-DLM 在生成具有特定性质的连贯且精确的分子方面的显着有效性，为药物发现和相关科学领域开辟了新途径。代码将发布在：https://github.com/Deno-V/tgm-dlm。

Title: Effective and Efficient Conversation Retrieval for Dialogue State Tracking with Implicit Text Summaries

Authors: Seanie Lee, Jianpeng Chen, Joris Driesen, Alexandru Coca, Anders Johannsen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13043
Pdf URL: https://arxiv.org/pdf/2402.13043
Copy Paste: [[2402.13043]] Effective and Efficient Conversation Retrieval for Dialogue State Tracking with Implicit Text Summaries(https://arxiv.org/abs/2402.13043)
Keywords: language model, gpt, llm, prompt
Abstract: Few-shot dialogue state tracking (DST) with Large Language Models (LLM) relies on an effective and efficient conversation retriever to find similar in-context examples for prompt learning. Previous works use raw dialogue context as search keys and queries, and a retriever is fine-tuned with annotated dialogues to achieve superior performance. However, the approach is less suited for scaling to new domains or new annotation languages, where fine-tuning data is unavailable. To address this problem, we handle the task of conversation retrieval based on text summaries of the conversations. A LLM-based conversation summarizer is adopted for query and key generation, which enables effective maximum inner product search. To avoid the extra inference cost brought by LLM-based conversation summarization, we further distill a light-weight conversation encoder which produces query embeddings without decoding summaries for test conversations. We validate our retrieval approach on MultiWOZ datasets with GPT-Neo-2.7B and LLaMA-7B/30B. The experimental results show a significant improvement over relevant baselines in real few-shot DST settings.
摘要：使用大型语言模型 (LLM) 的少镜头对话状态跟踪 (DST) 依靠有效且高效的对话检索器来查找相似的上下文示例以进行快速学习。以前的作品使用原始对话上下文作为搜索键和查询，并且检索器通过带注释的对话进行微调以实现卓越的性能。然而，该方法不太适合扩展到新领域或新注释语言，因为在这些领域或新注释语言无法进行微调数据。为了解决这个问题，我们根据对话的文本摘要来处理对话检索的任务。采用基于LLM的会话摘要器进行查询和密钥生成，从而实现有效的最大内积搜索。为了避免基于 LLM 的对话摘要带来的额外推理成本，我们进一步提炼了一个轻量级对话编码器，它可以生成查询嵌入，而无需解码测试对话的摘要。我们使用 GPT-Neo-2.7B 和 LLaMA-7B/30B 在 MultiWOZ 数据集上验证了我们的检索方法。实验结果表明，在实际的小样本 DST 设置中，相关基线有显着改进。

Title: Stable Knowledge Editing in Large Language Models

Authors: Zihao Wei, Liang Pang, Hanxing Ding, Jingcheng Deng, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13048
Pdf URL: https://arxiv.org/pdf/2402.13048
Copy Paste: [[2402.13048]] Stable Knowledge Editing in Large Language Models(https://arxiv.org/abs/2402.13048)
Keywords: language model, gpt, chat
Abstract: Efficient knowledge editing of large language models is crucial for replacing obsolete information or incorporating specialized knowledge on a large scale. However, previous methods implicitly assume that knowledge is localized and isolated within the model, an assumption that oversimplifies the interconnected nature of model knowledge. The premise of localization results in an incomplete knowledge editing, whereas an isolated assumption may impair both other knowledge and general abilities. It introduces instability to the performance of the knowledge editing method. To transcend these assumptions, we introduce StableKE, a method adopts a novel perspective based on knowledge augmentation rather than knowledge localization. To overcome the expense of human labeling, StableKE integrates two automated knowledge augmentation strategies: Semantic Paraphrase Enhancement strategy, which diversifies knowledge descriptions to facilitate the teaching of new information to the model, and Contextual Description Enrichment strategy, expanding the surrounding knowledge to prevent the forgetting of related information. StableKE surpasses other knowledge editing methods, demonstrating stability both edited knowledge and multi-hop knowledge, while also preserving unrelated knowledge and general abilities. Moreover, StableKE can edit knowledge on ChatGPT.
摘要：大型语言模型的高效知识编辑对于替换过时的信息或大规模合并专业知识至关重要。然而，以前的方法隐含地假设知识在模型内是本地化和孤立的，这种假设过度简化了模型知识的互连性质。本地化的前提会导致知识编辑不完整，而孤立的假设可能会损害其他知识和一般能力。它给知识编辑方法的性能带来了不稳定。为了超越这些假设，我们引入了 StableKE，这是一种采用基于知识增强而不是知识本地化的新颖视角的方法。为了克服人工标记的成本，StableKE 集成了两种自动知识增强策略：语义释义增强策略，使知识描述多样化，以便于向模型教授新信息；上下文描述丰富策略，扩展周围的知识以防止遗忘的相关信息。 StableKE超越了其他知识编辑方法，展示了编辑知识和多跳知识的稳定性，同时还保留了不相关的知识和通用能力。此外，StableKE可以编辑ChatGPT上的知识。

Title: Identifying Semantic Induction Heads to Understand In-Context Learning

Authors: Jie Ren, Qipeng Guo, Hang Yan, Dongrui Liu, Xipeng Qiu, Dahua Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.13055
Pdf URL: https://arxiv.org/pdf/2402.13055
Copy Paste: [[2402.13055]] Identifying Semantic Induction Heads to Understand In-Context Learning(https://arxiv.org/abs/2402.13055)
Keywords: language model, llm
Abstract: Although large language models (LLMs) have demonstrated remarkable performance, the lack of transparency in their inference logic raises concerns about their trustworthiness. To gain a better understanding of LLMs, we conduct a detailed analysis of the operations of attention heads and aim to better understand the in-context learning of LLMs. Specifically, we investigate whether attention heads encode two types of relationships between tokens present in natural languages: the syntactic dependency parsed from sentences and the relation within knowledge graphs. We find that certain attention heads exhibit a pattern where, when attending to head tokens, they recall tail tokens and increase the output logits of those tail tokens. More crucially, the formulation of such semantic induction heads has a close correlation with the emergence of the in-context learning ability of language models. The study of semantic attention heads advances our understanding of the intricate operations of attention heads in transformers, and further provides new insights into the in-context learning of LLMs.
摘要：尽管大型语言模型（LLM）表现出了卓越的性能，但其推理逻辑缺乏透明度引发了人们对其可信度的担忧。为了更好地理解LLM，我们对注意力头的操作进行了详细的分析，旨在更好地理解LLM的上下文学习。具体来说，我们研究注意力头是否编码自然语言中存在的标记之间的两种类型的关系：从句子解析的句法依赖性和知识图中的关系。我们发现某些注意力头表现出一种模式，当关注头部标记时，它们会回忆尾部标记并增加这些尾部标记的输出逻辑。更关键的是，这种语义归纳头的制定与语言模型的上下文学习能力的出现有着密切的关系。对语义注意力头的研究增进了我们对 Transformer 中注意力头复杂操作的理解，并进一步为法学硕士的上下文学习提供了新的见解。

Title: Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Authors: Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13064
Pdf URL: https://arxiv.org/pdf/2402.13064
Copy Paste: [[2402.13064]] Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models(https://arxiv.org/abs/2402.13064)
Keywords: language model, llm
Abstract: We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy.
摘要：我们介绍通用指令调优（称为 GLAN），这是一种用于大型语言模型 (LLM) 指令调优的通用且可扩展的方法。与之前依赖种子示例或现有数据集来构建指令调整数据的工作不同，GLAN 专门利用预先策划的人类知识和能力分类法作为输入，并生成跨所有学科的大规模合成指令数据。具体来说，受人类教育系统系统结构的启发，我们通过将人类知识和能力分解为各个领域、子领域，并最终在法学硕士的帮助下，半自动地分解为不同的学科来建立分类法。随后，我们为每个学科生成一份全面的科目列表，并继续设计针对每个学科的教学大纲，再次利用法学硕士。通过教学大纲的每节课中详细介绍的细粒度关键概念，我们能够生成广泛覆盖整个人类知识和技能的多样化指令。对大型语言模型（例如 Mistral）的大量实验表明，GLAN 在从数学推理、编码、学术考试、逻辑推理到一般指令遵循的多个维度上都表现出色，而无需使用这些任务的特定任务训练数据。此外，GLAN 允许轻松定制，只需将新节点合并到我们的分类中即可添加新领域或技能。

Title: Event-level Knowledge Editing

Authors: Hao Peng, Xiaozhi Wang, Chunyang Li, Kaisheng Zeng, Jiangshan Duo, Yixin Cao, Lei Hou, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.13093
Pdf URL: https://arxiv.org/pdf/2402.13093
Copy Paste: [[2402.13093]] Event-level Knowledge Editing(https://arxiv.org/abs/2402.13093)
Keywords: language model, llm
Abstract: Knowledge editing aims at updating knowledge of large language models (LLMs) to prevent them from becoming outdated. Existing work edits LLMs at the level of factual knowledge triplets. However, natural knowledge updates in the real world come from the occurrences of new events rather than direct changes in factual triplets. In this paper, we propose a new task setting: event-level knowledge editing, which directly edits new events into LLMs and improves over conventional triplet-level editing on (1) Efficiency. A single event edit leads to updates in multiple entailed knowledge triplets. (2) Completeness. Beyond updating factual knowledge, event-level editing also requires considering the event influences and updating LLMs' knowledge about future trends. We construct a high-quality event-level editing benchmark ELKEN, consisting of 1,515 event edits, 6,449 questions about factual knowledge, and 10,150 questions about future tendencies. We systematically evaluate the performance of various knowledge editing methods and LLMs on this benchmark. We find that ELKEN poses significant challenges to existing knowledge editing approaches. Our codes and dataset are publicly released to facilitate further research.
摘要：知识编辑旨在更新大语言模型（LLM）的知识，以防止它们过时。现有的工作在事实知识三元组的水平上编辑法学硕士。然而，现实世界中的自然知识更新来自新事件的发生，而不是事实三元组的直接变化。在本文中，我们提出了一种新的任务设置：事件级知识编辑，它将新事件直接编辑到LLM中，并在（1）效率上比传统的三元组级编辑有所提高。单个事件编辑会导致多个相关知识三元组的更新。 (2)完整性。除了更新事实知识之外，事件级编辑还需要考虑事件影响并更新法学硕士关于未来趋势的知识。我们构建了高质量的事件级编辑基准 ELKEN，由 1,515 个事件编辑、6,449 个有关事实知识的问题和 10,150 个有关未来趋势的问题组成。我们在此基准上系统地评估了各种知识编辑方法和法学硕士的表现。我们发现爱康对现有的知识编辑方法提出了重大挑战。我们的代码和数据集公开发布，以促进进一步的研究。

Title: ELAD: Explanation-Guided Large Language Models Active Distillation

Authors: Yifei Zhang, Bo Pan, Chen Ling, Yuntong Hu, Liang Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.13098
Pdf URL: https://arxiv.org/pdf/2402.13098
Copy Paste: [[2402.13098]] ELAD: Explanation-Guided Large Language Models Active Distillation(https://arxiv.org/abs/2402.13098)
Keywords: language model, llm
Abstract: The deployment and application of Large Language Models (LLMs) is hindered by their memory inefficiency, computational demands, and the high costs of API inferences. Traditional distillation methods, which transfer the capabilities of LLMs to smaller models, often fail to determine whether the knowledge has been sufficiently transferred, potentially resulting in high costs or incomplete distillation. In this paper, we propose an Explanation-Guided LLMs Active Distillation (ELAD) framework that employs an active learning strategy to optimize the balance between annotation costs and model performance. To improve efficient sample selection, we introduce an explanation-guided sample selection method that identifies samples challenging its reasoning by exploiting uncertainties in explanation steps. Additionally, we present a customized LLM-annotated explanation revision technique where the teacher model detects and corrects flaws in the student model's reasoning. Our experiments across various reasoning datasets demonstrate that our framework significantly enhances the efficiency of LLM knowledge distillation.
摘要：大型语言模型 (LLM) 的部署和应用受到内存效率低、计算需求高以及 API 推理成本高的阻碍。传统的蒸馏方法将法学硕士的能力转移到较小的模型，通常无法确定知识是否已充分转移，从而可能导致高成本或不完全的蒸馏。在本文中，我们提出了一种解释引导的法学硕士主动蒸馏（ELAD）框架，该框架采用主动学习策略来优化注释成本和模型性能之间的平衡。为了提高样本选择的效率，我们引入了一种解释引导的样本选择方法，该方法通过利用解释步骤中的不确定性来识别挑战其推理的样本。此外，我们还提出了一种定制的法学硕士注释解释修订技术，其中教师模型可以检测并纠正学生模型推理中的缺陷。我们在各种推理数据集上的实验表明，我们的框架显着提高了 LLM 知识蒸馏的效率。

Title: CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models

Authors: Yizhi LI, Ge Zhang, Xingwei Qu, Jiali Li, Zhaoqun Li, Zekun Wang, Hao Li, Ruibin Yuan, Yinghao Ma, Kai Zhang, Wangchunshu Zhou, Yiming Liang, Lei Zhang, Lei Ma, Jiajun Zhang, Zuowen Li, Stephen W. Huang, Chenghua Lin, Wenhu Chen, Jie Fu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.13109
Pdf URL: https://arxiv.org/pdf/2402.13109
Copy Paste: [[2402.13109]] CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models(https://arxiv.org/abs/2402.13109)
Keywords: language model, llm
Abstract: The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following. Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (CIF-Bench), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate evaluation bias, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances. Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts. This work aims to uncover the current limitations of LLMs in handling Chinese tasks, pushing towards the development of more culturally informed and linguistically diverse models with the released data and benchmark (https://yizhilll.github.io/CIF-Bench/).
摘要：大语言模型 (LLM) 的进步增强了通过指令跟踪泛化各种未见过的自然语言处理 (NLP) 任务的能力。然而，它们的有效性在中文等资源匮乏的语言中往往会减弱，数据泄露造成的偏见评估会加剧这种情况，让人怀疑它们对新语言领域的真正普遍性。为此，我们引入了中文指令跟踪基准（CIF-Bench），旨在评估法学硕士对中文的零样本泛化性。 CIF-Bench 包含 150 个任务和 15,000 个输入输出对，由母语人士开发，用于测试 20 个类别的复杂推理和中国文化细微差别。为了减轻评估偏差，我们仅公开一半的数据集，其余部分保密，并引入多样化的指令以最大限度地减少分数差异，总共 45,000 个数据实例。我们对 28 个选定的法学硕士的评估显示出明显的性能差距，最佳模型得分仅为 52.9%，凸显了法学硕士在不太熟悉的语言和任务环境中的局限性。这项工作旨在揭示法学硕士目前在处理中文任务方面的局限性，利用已发布的数据和基准推动开发更具文化背景和语言多样性的模型（https://yizhilll.github.io/CIF-Bench/）。

Title: A Survey on Knowledge Distillation of Large Language Models

Authors: Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, Tianyi Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13116
Pdf URL: https://arxiv.org/pdf/2402.13116
Copy Paste: [[2402.13116]] A Survey on Knowledge Distillation of Large Language Models(https://arxiv.org/abs/2402.13116)
Keywords: language model, gpt, llm
Abstract: This survey presents an in-depth exploration of knowledge distillation (KD) techniques within the realm of Large Language Models (LLMs), spotlighting the pivotal role of KD in transferring sophisticated capabilities from proprietary giants such as GPT-4 to accessible, open-source models like LLaMA and Mistral. Amidst the evolving AI landscape, this work elucidates the critical disparities between proprietary and open-source LLMs, demonstrating how KD serves as an essential conduit for imbuing the latter with the former's advanced functionalities and nuanced understandings. Our survey is meticulously structured around three foundational pillars: algorithm, skill, and verticalization -- providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a powerful paradigm within the KD framework to bolster LLMs' performance. By leveraging DA to generate context-rich, skill-specific training data, KD transcends traditional boundaries, enabling open-source models to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts. This work aims to provide an insightful guide for researchers and practitioners, offering a detailed overview of current methodologies in knowledge distillation and proposing future research directions. By bridging the gap between proprietary and open-source LLMs, this survey underscores the potential for more accessible, efficient, and sustainable AI solutions, fostering a more inclusive and equitable landscape in AI advancements. An associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs.
摘要：这项调查对大型语言模型 (LLM) 领域内的知识蒸馏 (KD) 技术进行了深入探索，突出了 KD 在将复杂功能从 GPT-4 等专有巨头转移到可访问的开源软件方面的关键作用LLaMA 和 Mistral 等型号。在不断发展的人工智能领域，这项工作阐明了专有法学硕士和开源法学硕士之间的关键差异，展示了 KD 如何作为向后者灌输前者的先进功能和细致入微的理解的重要渠道。我们的调查围绕三个基本支柱精心构建：算法、技能和垂直化——对 KD 机制、特定认知能力的增强及其在不同领域的实际影响进行全面检查。至关重要的是，该调查探讨了数据增强 (DA) 和 KD 之间错综复杂的相互作用，说明了 DA 如何作为 KD 框架内的强大范式出现，以提高法学硕士的表现。通过利用 DA 生成上下文丰富、特定于技能的训练数据，KD 超越了传统界限，使开源模型能够近似其专有模型的上下文适应性、道德一致性和深刻的语义洞察特征。这项工作旨在为研究人员和从业者提供富有洞察力的指南，详细概述当前知识蒸馏的方法并提出未来的研究方向。通过弥合专有法学硕士和开源法学硕士之间的差距，这项调查强调了更容易获得、更高效和更可持续的人工智能解决方案的潜力，从而在人工智能进步中培育更具包容性和公平的格局。相关的 Github 存储库位于 https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs。

Title: TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning

Authors: Xiang Li, Yunshi Lan, Chao Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.13125
Pdf URL: https://arxiv.org/pdf/2402.13125
Copy Paste: [[2402.13125]] TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning(https://arxiv.org/abs/2402.13125)
Keywords: language model, llm
Abstract: Recently, numerous new benchmarks have been established to evaluate the performance of large language models (LLMs) via either computing a holistic score or employing another LLM as a judge. However, these approaches suffer from data leakage due to the open access of the benchmark and inflexible evaluation process. To address this issue, we introduce $\textbf{TreeEval}$, a benchmark-free evaluation method for LLMs that let a high-performance LLM host an irreproducible evaluation session and essentially avoids the data leakage. Moreover, this LLM performs as an examiner to raise up a series of questions under a topic with a tree planing strategy, which considers the current evaluation status to decide the next question generation and ensures the completeness and efficiency of the evaluation process. We evaluate $6$ models of different parameter sizes, including $7$B, $13$B, and $33$B, and ultimately achieved the highest correlation coefficient with AlpacaEval2.0 using only around $45$ questions. We also conduct more analysis to show the robustness and reliability of TreeEval. Our code can be accessed via the provided https://github.com/Ashura5/TreeEval.
摘要：最近，已经建立了许多新的基准，通过计算整体分数或聘请另一个 LLM 作为评判来评估大型语言模型 (LLM) 的性能。然而，由于基准测试的开放性和不灵活的评估过程，这些方法存在数据泄露的问题。为了解决这个问题，我们引入了 $\textbf{TreeEval}$，这是一种针对 LLM 的无基准评估方法，可以让高性能的 LLM 主持不可重复的评估会话，并从根本上避免数据泄漏。此外，该法学硕士作为考官在一个主题下提出一系列问题，并采用树木规划策略，考虑当前的评估状态来决定下一个问题的生成，并确保评估过程的完整性和效率。我们评估了不同参数大小的 $6$ 模型，包括 $7$B、$13$B 和 $33$B，最终仅使用大约 $45$ 的问题，通过 AlpacaEval2.0 获得了最高的相关系数。我们还进行了更多分析来展示 TreeEval 的稳健性和可靠性。我们的代码可以通过提供的 https://github.com/Ashura5/TreeEval 访问。

Title: The Hidden Space of Transformer Language Adapters

Authors: Jesujoba O. Alabi, Marius Mosbach, Matan Eyal, Dietrich Klakow, Mor Geva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13137
Pdf URL: https://arxiv.org/pdf/2402.13137
Copy Paste: [[2402.13137]] The Hidden Space of Transformer Language Adapters(https://arxiv.org/abs/2402.13137)
Keywords: language model
Abstract: We analyze the operation of transformer language adapters, which are small modules trained on top of a frozen language model to adapt its predictions to new target languages. We show that adapted predictions mostly evolve in the source language the model was trained on, while the target language becomes pronounced only in the very last layers of the model. Moreover, the adaptation process is gradual and distributed across layers, where it is possible to skip small groups of adapters without decreasing adaptation performance. Last, we show that adapters operate on top of the model's frozen representation space while largely preserving its structure, rather than on an 'isolated' subspace. Our findings provide a deeper view into the adaptation process of language models to new languages, showcasing the constraints imposed on it by the underlying model and introduces practical implications to enhance its efficiency.
摘要：我们分析了 Transformer 语言适配器的操作，这些适配器是在冻结语言模型之上训练的小模块，以使其预测适应新的目标语言。我们表明，适应的预测主要在模型训练所用的源语言中演变，而目标语言仅在模型的最后一层变得明显。此外，适应过程是渐进的并且跨层分布的，其中可以跳过小组适配器而不降低适应性能。最后，我们展示了适配器在模型的冻结表示空间之上运行，同时很大程度上保留了其结构，而不是在“孤立”的子空间上运行。我们的研究结果提供了对语言模型对新语言的适应过程的更深入的了解，展示了底层模型对其施加的限制，并引入了提高其效率的实际意义。

Title: Defending Jailbreak Prompts via In-Context Adversarial Game

Authors: Yujun Zhou, Yufei Han, Haomin Zhuang, Taicheng Guo, Kehan Guo, Zhenwen Liang, Hongyan Bao, Xiangliang Zhang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2402.13148
Pdf URL: https://arxiv.org/pdf/2402.13148
Copy Paste: [[2402.13148]] Defending Jailbreak Prompts via In-Context Adversarial Game(https://arxiv.org/abs/2402.13148)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning. ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This continuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates across various attack scenarios. Moreover, ICAG demonstrates remarkable transferability to other LLMs, indicating its potential as a versatile defense mechanism.
摘要：大型语言模型 (LLM) 在不同的应用程序中展示了卓越的功能。然而，对其安全性的担忧，尤其是容易受到越狱攻击的担忧仍然存在。受到深度学习和 LLM 代理学习过程中的对抗训练的启发，我们引入了上下文对抗游戏（ICAG）来防御越狱，而无需进行微调。 ICAG 利用代理学习来进行对抗性游戏，旨在动态扩展知识以防御越狱。与依赖静态数据集的传统方法不同，ICAG 采用迭代过程来增强防御和攻击代理。这种持续改进过程加强了对新生成的越狱提示的防御。我们的实证研究证实了 ICAG 的有效性，其中受 ICAG 保护的法学硕士在各种攻击场景中的越狱成功率显着降低。此外，ICAG 展示了向其他法学硕士的卓越可转移性，表明其作为多功能防御机制的潜力。

Title: Benchmarking Retrieval-Augmented Generation for Medicine

Authors: Guangzhi Xiong, Qiao Jin, Zhiyong Lu, Aidong Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.13178
Pdf URL: https://arxiv.org/pdf/2402.13178
Copy Paste: [[2402.13178]] Benchmarking Retrieval-Augmented Generation for Medicine(https://arxiv.org/abs/2402.13178)
Keywords: language model, gpt, llm, hallucination, prompt, retrieval-augmented generation, chain-of-thought
Abstract: While large language models (LLMs) have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks, they still face challenges with hallucinations and outdated knowledge. Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted. However, a RAG system can involve multiple flexible components, and there is a lack of best practices regarding the optimal RAG setting for various medical purposes. To systematically evaluate such systems, we propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets. Using MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt tokens on 41 combinations of different corpora, retrievers, and backbone LLMs through the MedRAG toolkit introduced in this work. Overall, MedRAG improves the accuracy of six different LLMs by up to 18% over chain-of-thought prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our results show that the combination of various medical corpora and retrievers achieves the best performance. In addition, we discovered a log-linear scaling property and the "lost-in-the-middle" effects in medical RAG. We believe our comprehensive evaluations can serve as practical guidelines for implementing RAG systems for medicine.
摘要：虽然大型语言模型 (LLM) 在广泛的医学问答 (QA) 任务中取得了最先进的性能，但它们仍然面临着幻觉和过时知识的挑战。检索增强生成（RAG）是一种很有前途的解决方案，并已被广泛采用。然而，RAG 系统可能涉及多个灵活的组件，并且缺乏有关各种医疗目的的最佳 RAG 设置的最佳实践。为了系统地评估此类系统，我们提出了医学信息检索增强生成评估 (MIRAGE)，这是首个同类基准，包括来自 5 个医学 QA 数据集的 7,663 个问题。使用 MIRAGE，我们通过本工作中引入的 MedRAG 工具包，对不同语料库、检索器和骨干 LLM 的 41 种组合进行了超过 1.8 万亿个提示标记的大规模实验。总体而言，与思维链提示相比，MedRAG 将六种不同 LLM 的准确性提高了 18%，将 GPT-3.5 和 Mixtral 的性能提升至 GPT-4 水平。我们的结果表明，各种医学语料库和检索器的组合达到了最佳性能。此外，我们还发现了医疗 RAG 中的对数线性缩放特性和“中间丢失”效应。我们相信我们的综合评估可以作为实施医学 RAG 系统的实用指南。

Title: Order-Optimal Regret in Distributed Kernel Bandits using Uniform Sampling with Shared Randomness

Authors: Nikola Pavlovic, Sudeep Salgia, Qing Zhao
Subjects: cs.LG, cs.DC, stat.ML
Abstract URL: https://arxiv.org/abs/2402.13182
Pdf URL: https://arxiv.org/pdf/2402.13182
Copy Paste: [[2402.13182]] Order-Optimal Regret in Distributed Kernel Bandits using Uniform Sampling with Shared Randomness(https://arxiv.org/abs/2402.13182)
Keywords: agent
Abstract: We consider distributed kernel bandits where $N$ agents aim to collaboratively maximize an unknown reward function that lies in a reproducing kernel Hilbert space. Each agent sequentially queries the function to obtain noisy observations at the query points. Agents can share information through a central server, with the objective of minimizing regret that is accumulating over time $T$ and aggregating over agents. We develop the first algorithm that achieves the optimal regret order (as defined by centralized learning) with a communication cost that is sublinear in both $N$ and $T$. The key features of the proposed algorithm are the uniform exploration at the local agents and shared randomness with the central server. Working together with the sparse approximation of the GP model, these two key components make it possible to preserve the learning rate of the centralized setting at a diminishing rate of communication.
摘要：我们考虑分布式内核强盗，其中 $N$ 代理旨在协作最大化位于再生内核希尔伯特空间中的未知奖励函数。每个代理顺序查询函数以获得查询点处的噪声观测值。代理可以通过中央服务器共享信息，目的是最大限度地减少随着时间 $T$ 积累并在代理上聚合的遗憾。我们开发了第一个算法，该算法实现了最佳后悔顺序（由集中学习定义），其通信成本在 $N$ 和 $T$ 中都是次线性的。该算法的主要特点是本地代理的统一探索和与中央服务器共享随机性。这两个关键组件与 GP 模型的稀疏近似一起工作，使得在通信速率递减的情况下保持集中设置的学习速率成为可能。

Title: What if LLMs Have Different World Views: Simulating Alien Civilizations with LLM-based Agents

Authors: Mingyu Jin, Beichen Wang, Zhaoqian Xue, Suiyuan Zhu, Wenyue Hua, Hua Tang, Kai Mei, Mengnan Du, Yongfeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13184
Pdf URL: https://arxiv.org/pdf/2402.13184
Copy Paste: [[2402.13184]] What if LLMs Have Different World Views: Simulating Alien Civilizations with LLM-based Agents(https://arxiv.org/abs/2402.13184)
Keywords: language model, llm, agent
Abstract: In this study, we introduce "CosmoAgent," an innovative artificial intelligence framework utilizing Large Language Models (LLMs) to simulate complex interactions between human and extraterrestrial civilizations, with a special emphasis on Stephen Hawking's cautionary advice about not sending radio signals haphazardly into the universe. The goal is to assess the feasibility of peaceful coexistence while considering potential risks that could threaten well-intentioned civilizations. Employing mathematical models and state transition matrices, our approach quantitatively evaluates the development trajectories of civilizations, offering insights into future decision-making at critical points of growth and saturation. Furthermore, the paper acknowledges the vast diversity in potential living conditions across the universe, which could foster unique cosmologies, ethical codes, and worldviews among various civilizations. Recognizing the Earth-centric bias inherent in current LLM designs, we propose the novel concept of using LLMs with diverse ethical paradigms and simulating interactions between entities with distinct moral principles. This innovative research provides a new way to understand complex inter-civilizational dynamics, expanding our perspective while pioneering novel strategies for conflict resolution, crucial for preventing interstellar conflicts. We have also released the code and datasets to enable further academic investigation into this interesting area of research. The code is available at https://github.com/agiresearch/AlienAgent.
摘要：在这项研究中，我们介绍了“CosmoAgent”，这是一种创新的人工智能框架，利用大型语言模型（LLM）来模拟人类与外星文明之间的复杂交互，特别强调斯蒂芬·霍金关于不要随意向宇宙发送无线电信号的警告建议。目标是评估和平共处的可行性，同时考虑可能威胁善意文明的潜在风险。我们的方法采用数学模型和状态转换矩阵，定量评估文明的发展轨迹，为未来在增长和饱和关键点的决策提供见解。此外，该论文承认整个宇宙潜在的生活条件存在巨大多样性，这可能会在不同文明之间培育独特的宇宙观、道德准则和世界观。认识到当前法学硕士设计中固有的以地球为中心的偏见，我们提出了使用具有不同道德范式的法学硕士的新概念，并模拟具有不同道德原则的实体之间的互动。这项创新研究为理解复杂的文明间动态提供了一种新方法，扩大了我们的视野，同时开创了解决冲突的新策略，这对于预防星际冲突至关重要。我们还发布了代码和数据集，以便对这个有趣的研究领域进行进一步的学术调查。该代码可从 https://github.com/agiresearch/AlienAgent 获取。

Title: Question Calibration and Multi-Hop Modeling for Temporal Question Answering

Authors: Chao Xue, Di Liang, Pengfei Wang, Jing Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13188
Pdf URL: https://arxiv.org/pdf/2402.13188
Copy Paste: [[2402.13188]] Question Calibration and Multi-Hop Modeling for Temporal Question Answering(https://arxiv.org/abs/2402.13188)
Keywords: language model
Abstract: Many models that leverage knowledge graphs (KGs) have recently demonstrated remarkable success in question answering (QA) tasks. In the real world, many facts contained in KGs are time-constrained thus temporal KGQA has received increasing attention. Despite the fruitful efforts of previous models in temporal KGQA, they still have several limitations. (I) They adopt pre-trained language models (PLMs) to obtain question representations, while PLMs tend to focus on entity information and ignore entity transfer caused by temporal constraints, and finally fail to learn specific temporal representations of entities. (II) They neither emphasize the graph structure between entities nor explicitly model the multi-hop relationship in the graph, which will make it difficult to solve complex multi-hop question answering. To alleviate this problem, we propose a novel Question Calibration and Multi-Hop Modeling (QC-MHM) approach. Specifically, We first calibrate the question representation by fusing the question and the time-constrained concepts in KG. Then, we construct the GNN layer to complete multi-hop message passing. Finally, the question representation is combined with the embedding output by the GNN to generate the final prediction. Empirical results verify that the proposed model achieves better performance than the state-of-the-art models in the benchmark dataset. Notably, the Hits@1 and Hits@10 results of QC-MHM on the CronQuestions dataset's complex questions are absolutely improved by 5.1% and 1.2% compared to the best-performing baseline. Moreover, QC-MHM can generate interpretable and trustworthy predictions.
摘要：许多利用知识图（KG）的模型最近在问答（QA）任务中取得了显着的成功。在现实世界中，KG 中包含的许多事实都是受时间限制的，因此时态 KGQA 受到越来越多的关注。尽管先前的模型在时间 KGQA 方面取得了卓有成效的努力，但它们仍然存在一些局限性。（一）它们采用预训练的语言模型（PLM）来获取问题表示，而PLM往往关注实体信息而忽略由时间约束引起的实体转移，最终无法学习实体的特定时间表示。（二）它们既不强调实体之间的图结构，也不明确地建模图中的多跳关系，这将导致难以解决复杂的多跳问答。为了缓解这个问题，我们提出了一种新颖的问题校准和多跳建模（QC-MHM）方法。具体来说，我们首先通过融合问题和 KG 中的时间约束概念来校准问题表示。然后，我们构建GNN层来完成多跳消息传递。最后，问题表示与 GNN 的嵌入输出相结合，生成最终预测。实证结果验证了所提出的模型比基准数据集中最先进的模型具有更好的性能。值得注意的是，与表现最佳的基线相比，QC-MHM 在 CronQuestions 数据集的复杂问题上的 Hits@1 和 Hits@10 结果绝对提高了 5.1% 和 1.2%。此外，QC-MHM 可以生成可解释且值得信赖的预测。

Title: How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena

Authors: Marco Gaido, Sara Papi, Matteo Negri, Luisa Bentivogli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.13208
Pdf URL: https://arxiv.org/pdf/2402.13208
Copy Paste: [[2402.13208]] How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena(https://arxiv.org/abs/2402.13208)
Keywords: language model
Abstract: The attention mechanism, a cornerstone of state-of-the-art neural models, faces computational hurdles in processing long sequences due to its quadratic complexity. Consequently, research efforts in the last few years focused on finding more efficient alternatives. Among them, Hyena (Poli et al., 2023) stands out for achieving competitive results in both language modeling and image classification, while offering sub-quadratic memory and computational complexity. Building on these promising results, we propose ConfHyena, a Conformer whose encoder self-attentions are replaced with an adaptation of Hyena for speech processing, where the long input sequences cause high computational costs. Through experiments in automatic speech recognition (for English) and translation (from English into 8 target languages), we show that our best ConfHyena model significantly reduces the training time by 27%, at the cost of minimal quality degradation (~1%), which, in most cases, is not statistically significant.
摘要：注意力机制是最先进的神经模型的基石，由于其二次复杂性，在处理长序列时面临计算障碍。因此，过去几年的研究工作集中在寻找更有效的替代方案。其中，Hyena (Poli et al., 2023) 因在语言建模和图像分类方面取得了有竞争力的结果而脱颖而出，同时提供了次二次内存和计算复杂性。基于这些有希望的结果，我们提出了 ConfHyena，这是一种 Conformer，其编码器自我注意力被用于语音处理的 Hyena 的适应所取代，其中长输入序列导致较高的计算成本。通过自动语音识别（英语）和翻译（从英语到 8 种目标语言）的实验，我们表明我们最好的 ConfHyena 模型将训练时间显着减少了 27%，而代价是质量下降最小（~1%），在大多数情况下，这在统计上并不显着。

Title: Bayesian Reward Models for LLM Alignment

Authors: Adam X. Yang, Maxime Robeyns, Thomas Coste, Jun Wang, Haitham Bou-Ammar, Laurence Aitchison
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.13210
Pdf URL: https://arxiv.org/pdf/2402.13210
Copy Paste: [[2402.13210]] Bayesian Reward Models for LLM Alignment(https://arxiv.org/abs/2402.13210)
Keywords: language model, llm, prompt
Abstract: To ensure that large language model (LLM) responses are helpful and non-toxic, we usually fine-tune a reward model on human preference data. We then select policy responses with high rewards (best-of-n sampling) or further optimize the policy to produce responses with high rewards (reinforcement learning from human feedback). However, this process is vulnerable to reward overoptimization or hacking, in which the responses selected have high rewards due to errors in the reward model rather than a genuine preference. This is especially problematic as the prompt or response diverges from the training data. It should be possible to mitigate these issues by training a Bayesian reward model, which signals higher uncertainty further from the training data distribution. Therefore, we trained Bayesian reward models using Laplace-LoRA (Yang et al., 2024) and found that the resulting uncertainty estimates can successfully mitigate reward overoptimization in best-of-n sampling.
摘要：为了确保大语言模型（LLM）响应是有用且无毒的，我们通常会根据人类偏好数据微调奖励模型。然后，我们选择具有高奖励的政策响应（n最佳抽样）或进一步优化政策以产生具有高奖励的响应（根据人类反馈进行强化学习）。然而，这个过程很容易受到奖励过度优化或黑客攻击，其中选择的响应由于奖励模型中的错误而不是真正的偏好而具有高奖励。当提示或响应与训练数据不同时，这尤其成问题。通过训练贝叶斯奖励模型应该可以缓解这些问题，该模型从训练数据分布中进一步发出更高的不确定性信号。因此，我们使用 Laplace-LoRA (Yang et al., 2024) 训练贝叶斯奖励模型，并发现由此产生的不确定性估计可以成功地缓解 best-of-n 采样中的奖励过度优化。

Title: Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation

Authors: Dongjin Kang, Sunghwan Kim, Taeyoon Kwon, Seungjun Moon, Hyunsouk Cho, Youngjae Yu, Dongha Lee, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13211
Pdf URL: https://arxiv.org/pdf/2402.13211
Copy Paste: [[2402.13211]] Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation(https://arxiv.org/abs/2402.13211)
Keywords: language model, llm
Abstract: Emotional Support Conversation (ESC) is a task aimed at alleviating individuals' emotional distress through daily conversation. Given its inherent complexity and non-intuitive nature, ESConv dataset incorporates support strategies to facilitate the generation of appropriate responses. Recently, despite the remarkable conversational ability of large language models (LLMs), previous studies have suggested that they often struggle with providing useful emotional support. Hence, this work initially analyzes the results of LLMs on ESConv, revealing challenges in selecting the correct strategy and a notable preference for a specific strategy. Motivated by these, we explore the impact of the inherent preference in LLMs on providing emotional support, and consequently, we observe that exhibiting high preference for specific strategies hinders effective emotional support, aggravating its robustness in predicting the appropriate strategy. Moreover, we conduct a methodological study to offer insights into the necessary approaches for LLMs to serve as proficient emotional supporters. Our findings emphasize that (1) low preference for specific strategies hinders the progress of emotional support, (2) external assistance helps reduce preference bias, and (3) LLMs alone cannot become good emotional supporters. These insights suggest promising avenues for future research to enhance the emotional intelligence of LLMs.
摘要：情感支持对话（ESC）是一项旨在通过日常对话缓解个人情绪困扰的任务。鉴于其固有的复杂性和非直观性，ESConv 数据集包含支持策略以促进生成适当的响应。最近，尽管大型语言模型（LLM）具有出色的会话能力，但之前的研究表明，它们常常难以提供有用的情感支持。因此，这项工作首先分析了 ESConv 上法学硕士的结果，揭示了选择正确策略的挑战以及对特定策略的显着偏好。受这些启发，我们探索了法学硕士固有偏好对提供情感支持的影响，因此，我们观察到对特定策略表现出高度偏好会阻碍有效的情感支持，从而加剧其预测适当策略的鲁棒性。此外，我们还进行了一项方法学研究，以深入了解法学硕士作为熟练的情感支持者的必要方法。我们的研究结果强调，（1）对特定策略的低偏好阻碍了情感支持的进展，（2）外部援助有助于减少偏好偏差，（3）法学硕士本身无法成为良好的情感支持者。这些见解为未来提高法学硕士情商的研究提供了有希望的途径。

Title: Soft Self-Consistency Improves Language Model Agents

Authors: Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.13212
Pdf URL: https://arxiv.org/pdf/2402.13212
Copy Paste: [[2402.13212]] Soft Self-Consistency Improves Language Model Agents(https://arxiv.org/abs/2402.13212)
Keywords: language model, llm, agent
Abstract: Generations from large language models (LLMs) can be improved by sampling and scoring multiple solutions to select a final answer. Current "sample and select" methods such as self-consistency (SC) rely on majority voting to score answers. However, when tasks have many distinct and valid answers, selection by voting requires a large number of samples. This makes SC prohibitively expensive for interactive tasks that involve generating multiple actions (answers) sequentially. After establishing that majority voting fails to provide consistent gains on such tasks, we demonstrate how to increase success rates by softening the scoring criterion. We introduce Soft Self-Consistency (Soft-SC), which replaces SC's discontinuous scoring with a continuous score computed from model likelihoods, allowing for selection even when actions are sparsely distributed. Soft-SC improves both performance and efficiency on long-horizon interactive tasks, requiring half as many samples as SC for comparable or better performance. For a fixed number of samples, Soft-SC leads to a 1.3% increase over SC in absolute success rate on writing bash programs, a 6.6% increase on online shopping (WebShop), and a 4.7% increase for an interactive household game (ALFWorld). Finally, we show that Soft-SC can be applied to both open-source and black-box models.
摘要：通过对多个解决方案进行采样和评分以选择最终答案，可以改进大型语言模型 (LLM) 的生成。目前的“抽样和选择”方法，例如自我一致性（SC），依靠多数投票来获得答案。然而，当任务有许多不同且有效的答案时，通过投票进行选择需要大量样本。这使得 SC 对于涉及顺序生成多个操作（答案）的交互式任务来说过于昂贵。在确定多数投票无法在此类任务上提供一致的收益后，我们演示了如何通过软化评分标准来提高成功率。我们引入了软自一致性（Soft-SC），它将 SC 的不连续评分替换为根据模型似然计算的连续评分，即使在动作分布稀疏的情况下也允许进行选择。 Soft-SC 提高了长期交互任务的性能和效率，需要的样本数量是 SC 的一半，才能获得可比或更好的性能。对于固定数量的样本，Soft-SC 使得编写 bash 程序的绝对成功率比 SC 提高 1.3%，在线购物（WebShop）提高 6.6%，互动家庭游戏（ALFWorld）提高 4.7% ）。最后，我们证明 Soft-SC 可以应用于开源模型和黑盒模型。

Title: Softmax Probabilities (Mostly) Predict Large Language Model Correctness on Multiple-Choice Q&A

Authors: Benjamin Plaut, Khanh Nguyen, Tu Trinh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.13213
Pdf URL: https://arxiv.org/pdf/2402.13213
Copy Paste: [[2402.13213]] Softmax Probabilities (Mostly) Predict Large Language Model Correctness on Multiple-Choice Q&A(https://arxiv.org/abs/2402.13213)
Keywords: language model, llm
Abstract: Although large language models (LLMs) perform impressively on many tasks, overconfidence remains a problem. We hypothesized that on multiple-choice Q&A tasks, wrong answers would be associated with smaller maximum softmax probabilities (MSPs) compared to correct answers. We comprehensively evaluate this hypothesis on ten open-source LLMs and five datasets, and find strong evidence for our hypothesis among models which perform well on the original Q&A task. For the six LLMs with the best Q&A performance, the AUROC derived from the MSP was better than random chance with p < 10^{-4} in 59/60 instances. Among those six LLMs, the average AUROC ranged from 60% to 69%. Leveraging these findings, we propose a multiple-choice Q&A task with an option to abstain and show that performance can be improved by selectively abstaining based on the MSP of the initial model response. We also run the same experiments with pre-softmax logits instead of softmax probabilities and find similar (but not identical) results.
摘要：尽管大型语言模型 (LLM) 在许多任务上表现出色，但过度自信仍然是一个问题。我们假设，在多项选择问答任务中，与正确答案相比，错误答案与较小的最大 softmax 概率 (MSP) 相关。我们在十个开源法学硕士和五个数据集上全面评估了这一假设，并在原始问答任务中表现良好的模型中找到了支持我们假设的有力证据。对于具有最佳问答性能的六个法学硕士，从 MSP 得出的 AUROC 优于随机机会，在 59/60 的实例中 p < 10^{-4}。在这六名法学硕士中，平均 AUROC 范围在 60% 到 69% 之间。利用这些发现，我们提出了一个带有弃权选项的多项选择问答任务，并表明根据初始模型响应的 MSP 有选择地弃权可以提高性能。我们还使用 pre-softmax logits 而不是 softmax 概率进行相同的实验，并发现类似（但不相同）的结果。

Title: RoCode: A Dataset for Measuring Code Intelligence from Problem Definitions in Romanian

Authors: Adrian Cosma, Bogdan Iordache, Paolo Rosso
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13222
Pdf URL: https://arxiv.org/pdf/2402.13222
Copy Paste: [[2402.13222]] RoCode: A Dataset for Measuring Code Intelligence from Problem Definitions in Romanian(https://arxiv.org/abs/2402.13222)
Keywords: language model, llm, prompt
Abstract: Recently, large language models (LLMs) have become increasingly powerful and have become capable of solving a plethora of tasks through proper instructions in natural language. However, the vast majority of testing suites assume that the instructions are written in English, the de facto prompting language. Code intelligence and problem solving still remain a difficult task, even for the most advanced LLMs. Currently, there are no datasets to measure the generalization power for code-generation models in a language other than English. In this work, we present RoCode, a competitive programming dataset, consisting of 2,642 problems written in Romanian, 11k solutions in C, C++ and Python and comprehensive testing suites for each problem. The purpose of RoCode is to provide a benchmark for evaluating the code intelligence of language models trained on Romanian / multilingual text as well as a fine-tuning set for pretrained Romanian models. Through our results and review of related works, we argue for the need to develop code models for languages other than English.
摘要：最近，大型语言模型（LLM）变得越来越强大，并且能够通过自然语言的正确指令解决大量任务。然而，绝大多数测试套件都假设说明是用英语编写的，这是事实上的提示语言。即使对于最先进的法学硕士来说，代码智能和问题解决仍然是一项艰巨的任务。目前，没有数据集可以衡量英语以外语言的代码生成模型的泛化能力。在这项工作中，我们提出了 RoCode，一个有竞争力的编程数据集，包含用罗马尼亚语编写的 2,642 个问题、用 C、C++ 和 Python 编写的 11k 个解决方案以及每个问题的综合测试套件。 RoCode 的目的是为评估罗马尼亚语/多语言文本训练的语言模型的代码智能提供基准，并为预训练的罗马尼亚语模型提供微调集。通过我们的结果和对相关工作的回顾，我们认为需要为英语以外的语言开发代码模型。

Title: AgentMD: Empowering Language Agents for Risk Prediction with Large-Scale Clinical Tool Learning

Authors: Qiao Jin, Zhizheng Wang, Yifan Yang, Qingqing Zhu, Donald Wright, Thomas Huang, W John Wilbur, Zhe He, Andrew Taylor, Qingyu Chen, Zhiyong Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.13225
Pdf URL: https://arxiv.org/pdf/2402.13225
Copy Paste: [[2402.13225]] AgentMD: Empowering Language Agents for Risk Prediction with Large-Scale Clinical Tool Learning(https://arxiv.org/abs/2402.13225)
Keywords: language model, gpt, prompt, chain-of-thought, agent
Abstract: Clinical calculators play a vital role in healthcare by offering accurate evidence-based predictions for various purposes such as prognosis. Nevertheless, their widespread utilization is frequently hindered by usability challenges, poor dissemination, and restricted functionality. Augmenting large language models with extensive collections of clinical calculators presents an opportunity to overcome these obstacles and improve workflow efficiency, but the scalability of the manual curation process poses a significant challenge. In response, we introduce AgentMD, a novel language agent capable of curating and applying clinical calculators across various clinical contexts. Using the published literature, AgentMD has automatically curated a collection of 2,164 diverse clinical calculators with executable functions and structured documentation, collectively named RiskCalcs. Manual evaluations show that RiskCalcs tools achieve an accuracy of over 80% on three quality metrics. At inference time, AgentMD can automatically select and apply the relevant RiskCalcs tools given any patient description. On the newly established RiskQA benchmark, AgentMD significantly outperforms chain-of-thought prompting with GPT-4 (87.7% vs. 40.9% in accuracy). Additionally, we also applied AgentMD to real-world clinical notes for analyzing both population-level and risk-level patient characteristics. In summary, our study illustrates the utility of language agents augmented with clinical calculators for healthcare analytics and patient care.
摘要：临床计算器通过为预后等各种目的提供准确的基于证据的预测，在医疗保健中发挥着至关重要的作用。然而，它们的广泛使用经常受到可用性挑战、传播不良和功能受限的阻碍。使用大量临床计算器来增强大型语言模型提供了克服这些障碍并提高工作流程效率的机会，但手动管理过程的可扩展性提出了重大挑战。作为回应，我们推出了 AgentMD，这是一种新型语言代理，能够在各种临床环境中策划和应用临床计算器。 AgentMD 利用已发表的文献，自动收集了 2,164 个具有可执行函数和结构化文档的不同临床计算器，统称为 RiskCalcs。手动评估表明，RiskCalcs 工具在三个质量指标上的准确率超过 80%。在推理时，AgentMD 可以根据任何患者描述自动选择并应用相关的 RiskCalcs 工具。在新建立的 RiskQA 基准上，AgentMD 的表现明显优于 GPT-4 的思维链提示（准确率分别为 87.7% 和 40.9%）。此外，我们还将 AgentMD 应用到现实世界的临床记录中，以分析人群水平和风险水平的患者特征。总之，我们的研究说明了使用临床计算器增强的语言代理在医疗保健分析和患者护理方面的实用性。

Title: Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Authors: Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.13228
Pdf URL: https://arxiv.org/pdf/2402.13228
Copy Paste: [[2402.13228]] Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive(https://arxiv.org/abs/2402.13228)
Keywords: language model, llm
Abstract: Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the \textit{relative} probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a \textit{reduction} of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we also find that DPOP significantly outperforms DPO across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. By fine-tuning with DPOP, we create and release Smaug-34B and Smaug-72B, which achieve state-of-the-art open-source performance. Notably, Smaug-72B is nearly 2\% better than any other open-source model on the HuggingFace Open LLM Leaderboard and becomes the first open-source LLM to surpass an average accuracy of 80\%.
摘要：直接偏好优化 (DPO) 可以有效显着提高大型语言模型 (LLM) 在推理、摘要和对齐等下游任务上的性能。使用成对的首选和不首选数据，DPO 对选择一个响应而不是另一个响应的 \textit{relative} 概率进行建模。在这项工作中，我们首先从理论上证明，只要首选类别和不首选类别之间的相对概率增加，标准 DPO 损失就可以导致模型的首选示例可能性 \textit{reduction} 减少。然后，我们凭经验证明，当在常见数据集上微调 LLM 时，尤其是在完成对之间的编辑距离较低的数据集上，会发生这种现象。利用这些见解，我们设计了 DPO-Positive (DPOP)，这是一种新的损失函数和训练程序，可以避免这种故障模式。令人惊讶的是，我们还发现 DPOP 在各种数据集和下游任务中显着优于 DPO，包括完成之间编辑距离较高的数据集。通过使用 DPOP 进行微调，我们创建并发布了 Smaug-34B 和 Smaug-72B，它们实现了最先进的开源性能。值得注意的是，Smaug-72B 比 HuggingFace 开放 LLM 排行榜上的任何其他开源模型都要好近 2%，并成为第一个平均准确率超过 80% 的开源 LLM。

Title: Investigating Cultural Alignment of Large Language Models

Authors: Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, Mona Diab
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2402.13231
Pdf URL: https://arxiv.org/pdf/2402.13231
Copy Paste: [[2402.13231]] Investigating Cultural Alignment of Large Language Models(https://arxiv.org/abs/2402.13231)
Keywords: language model, llm, prompt
Abstract: The intricate relationship between language and culture has long been a subject of exploration within the realm of linguistic anthropology. Large Language Models (LLMs), promoted as repositories of collective human knowledge, raise a pivotal question: do these models genuinely encapsulate the diverse knowledge adopted by different cultures? Our study reveals that these models demonstrate greater cultural alignment along two dimensions -- firstly, when prompted with the dominant language of a specific culture, and secondly, when pretrained with a refined mixture of languages employed by that culture. We quantify cultural alignment by simulating sociological surveys, comparing model responses to those of actual survey participants as references. Specifically, we replicate a survey conducted in various regions of Egypt and the United States through prompting LLMs with different pretraining data mixtures in both Arabic and English with the personas of the real respondents and the survey questions. Further analysis reveals that misalignment becomes more pronounced for underrepresented personas and for culturally sensitive topics, such as those probing social values. Finally, we introduce Anthropological Prompting, a novel method leveraging anthropological reasoning to enhance cultural alignment. Our study emphasizes the necessity for a more balanced multilingual pretraining dataset to better represent the diversity of human experience and the plurality of different cultures with many implications on the topic of cross-lingual transfer.
摘要：语言与文化之间错综复杂的关系长期以来一直是语言人类学领域探索的课题。作为人类集体知识宝库而推广的大型语言模型（LLM）提出了一个关键问题：这些模型是否真正概括了不同文化所采用的多样化知识？我们的研究表明，这些模型在两个维度上表现出更大的文化一致性——首先，当用特定文化的主导语言提示时，其次，当用该文化所使用的语言的精致混合体进行预训练时。我们通过模拟社会学调查，将模型响应与实际调查参与者的响应进行比较作为参考来量化文化一致性。具体来说，我们复制了在埃及和美国不同地区进行的一项调查，通过向法学硕士提示不同的阿拉伯语和英语预训练数据混合物以及真实受访者的角色和调查问题。进一步的分析表明，对于代表性不足的人物角色和文化敏感话题（例如那些探讨社会价值观的话题）来说，错位变得更加明显。最后，我们介绍人类学提示，这是一种利用人类学推理来增强文化一致性的新方法。我们的研究强调了更平衡的多语言预训练数据集的必要性，以更好地代表人类经验的多样性和不同文化的多元性，这对跨语言迁移主题有许多影响。

Title: TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Authors: Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W. Vincent, Yu'an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia Sun, Yi Zhang, Saab Mansour, Kathleen McKeown
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.13249
Pdf URL: https://arxiv.org/pdf/2402.13249
Copy Paste: [[2402.13249]] TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization(https://arxiv.org/abs/2402.13249)
Keywords: gpt, llm, hallucination
Abstract: Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.
摘要：近年来，在事实一致性或幻觉评估研究的推动下，单文档新闻摘要在忠实度方面取得了实质性进展。我们询问这些进步是否会延续到其他文本摘要领域。我们提出了一个关于以主题为中心的对话摘要的新评估基准，由不同规模的法学硕士生成。我们提供这些摘要的事实一致性的二进制句子级人工注释，以及对事实不一致的句子的详细解释。我们的分析表明，无论模型大小如何，现有的法学硕士都会在对话领域产生大量事实错误。另一方面，当包括 GPT-4 在内的法学硕士作为二元事实评估者时，它们的表现很差，并且可能会被流行的最先进的专业事实评估指标所超越。最后，我们通过精心策划的错误分类法对幻觉类型进行了分析。我们发现模型生成的摘要中存在不同的错误和错误分布，并且基于非 LLM 的指标可以比基于 LLM 的评估器更好地捕获所有错误类型。

Title: BiMediX: Bilingual Medical Mixture of Experts LLM

Authors: Sara Pieri, Sahal Shaji Mullappilly, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.13253
Pdf URL: https://arxiv.org/pdf/2402.13253
Copy Paste: [[2402.13253]] BiMediX: Bilingual Medical Mixture of Experts LLM(https://arxiv.org/abs/2402.13253)
Keywords: llm, chat
Abstract: In this paper, we introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic. Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details such as patient symptoms and medical history, multiple-choice question answering, and open-ended question answering. We propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations. We also introduce a comprehensive evaluation benchmark for Arabic medical LLMs. Furthermore, we introduce BiMed1.3M, an extensive Arabic-English bilingual instruction set covering 1.3 Million diverse medical interactions, resulting in over 632 million healthcare specialized tokens for instruction tuning. Our BiMed1.3M dataset includes 250k synthesized multi-turn doctor-patient chats and maintains a 1:2 Arabic-to-English ratio. Our model outperforms state-of-the-art Med42 and Meditron by average absolute gains of 2.5% and 4.1%, respectively, computed across multiple medical evaluation benchmarks in English, while operating at 8-times faster inference. Moreover, our BiMediX outperforms the generic Arabic-English bilingual LLM, Jais-30B, by average absolute gains of 10% on our Arabic medical benchmark and 15% on bilingual evaluations across multiple datasets. Our project page with source code and trained model is available at https://github.com/mbzuai-oryx/BiMediX .
摘要：在本文中，我们介绍了 BiMediX，这是第一个双语医学专家法学硕士，专为英语和阿拉伯语无缝交互而设计。我们的模型促进了英语和阿拉伯语的广泛医疗互动，包括多轮聊天以询问更多详细信息，例如患者症状和病史、多项选择题回答和开放式问题回答。我们提出了一个半自动化的英语到阿拉伯语翻译流程，并经过人工改进，以确保高质量的翻译。我们还引入了阿拉伯医学法学硕士的综合评估基准。此外，我们还引入了 BiMed1.3M，这是一个广泛的阿拉伯语-英语双语指令集，涵盖 130 万种不同的医疗交互，从而产生了超过 6.32 亿个用于指令调整的医疗保健专用令牌。我们的 BiMed1.3M 数据集包含 25 万条合成的多轮医患聊天，并保持 1:2 的阿拉伯语与英语的比例。我们的模型优于最先进的 Med42 和 Meditron，平均绝对增益分别为 2.5% 和 4.1%（在多个英语医学评估基准上计算），同时推理速度提高了 8 倍。此外，我们的 BiMediX 优于通用的阿拉伯语-英语双语 LLM Jais-30B，在我们的阿拉伯语医学基准上平均绝对增益为 10%，在多个数据集的双语评估上平均绝对增益为 15%。我们的项目页面包含源代码和经过训练的模型，可在 https://github.com/mbzuai-oryx/BiMediX 上找到。