2024-06-18

Title: QCQA: Quality and Capacity-aware grouped Query Attention

Authors: Vinay Joshi, Prashant Laddha, Shambhavi Sinha, Om Ji Omer, Sreenivas Subramoney
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10247
Pdf URL: https://arxiv.org/pdf/2406.10247
Copy Paste: [[2406.10247]] QCQA: Quality and Capacity-aware grouped Query Attention(https://arxiv.org/abs/2406.10247)
Keywords: language model, llm
Abstract: Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as Multi-Query Attention (MQA) and Grouped Query Attention (GQA) mitigate these challenges by grouping query heads and consequently reducing the number of corresponding key and value heads. However, MQA and GQA decrease the KV-cache size requirements at the expense of LLM accuracy (quality of text generation). These methods do not ensure an optimal tradeoff between KV-cache size and text generation quality due to the absence of quality-aware grouping of query heads. To address this issue, we propose Quality and Capacity-Aware Grouped Query Attention (QCQA), which identifies optimal query head groupings using an evolutionary algorithm with a computationally efficient and inexpensive fitness function. We demonstrate that QCQA achieves a significantly better tradeoff between KV-cache capacity and LLM accuracy compared to GQA. For the Llama2 $7\,$B model, QCQA achieves $\mathbf{20}$\% higher accuracy than GQA with similar KV-cache size requirements in the absence of fine-tuning. After fine-tuning both QCQA and GQA, for a similar KV-cache size, QCQA provides $\mathbf{10.55}\,$\% higher accuracy than GQA. Furthermore, QCQA requires $40\,$\% less KV-cache size than GQA to attain similar accuracy. The proposed quality and capacity-aware grouping of query heads can serve as a new paradigm for KV-cache optimization in autoregressive LLM inference.
摘要：键和值特征 (KV 缓存) 的内存需求过大，给大型语言模型 (LLM) 的自回归推理带来了重大挑战，限制了文本生成的速度和长度。多查询注意 (MQA) 和分组查询注意 (GQA) 等方法通过对查询头进行分组，从而减少相应的键和值头的数量，缓解了这些挑战。然而，MQA 和 GQA 以牺牲 LLM 准确性（文本生成质量）为代价，降低了 KV 缓存大小要求。由于缺乏对查询头进行质量感知分组，这些方法无法确保 KV 缓存大小和文本生成质量之间的最佳权衡。为了解决这个问题，我们提出了质量和容量感知分组查询注意 (QCQA)，它使用具有计算高效且廉价的适应度函数的进化算法来识别最佳查询头分组。我们证明，与 GQA 相比，QCQA 在 KV 缓存容量和 LLM 准确性之间实现了更好的权衡。对于 Llama2 $7\,$B 模型，在没有微调的情况下，QCQA 的准确率比 GQA 高 $\mathbf{20}$\%，且 KV 缓存大小要求相似。在对 QCQA 和 GQA 进行微调后，对于相似的 KV 缓存大小，QCQA 的准确率比 GQA 高 $\mathbf{10.55}\,$\%。此外，QCQA 所需的 KV 缓存大小比 GQA 小 $40\,$\%，即可达到相似的准确率。所提出的质量和容量感知查询头分组可以作为自回归 LLM 推理中 KV 缓存优化的新范例。

Title: On the Worst Prompt Performance of Large Language Models

Authors: Bowen Cao, Deng Cai, Zhisong Zhang, Yuexian Zou, Wai Lam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10248
Pdf URL: https://arxiv.org/pdf/2406.10248
Copy Paste: [[2406.10248]] On the Worst Prompt Performance of Large Language Models(https://arxiv.org/abs/2406.10248)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts, which raises significant concerns about their reliability in real-world scenarios. Existing studies often divide prompts into task-level instructions and case-level inputs and primarily focus on evaluating and improving robustness against variations in tasks-level instructions. However, this setup fails to fully address the diversity of real-world user queries and assumes the existence of task-specific datasets. To address these limitations, we introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries and emphasizes the importance of using the worst prompt performance to gauge the lower bound of model performance. Extensive experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance; for instance, a difference of 45.48% between the worst and best performance for the Llama-2-70B-chat model, with its worst performance dipping as low as 9.38%. We further illustrate the difficulty in identifying the worst prompt from both model-agnostic and model-dependent perspectives, emphasizing the absence of a shortcut to characterize the worst prompt. We also attempt to enhance the worst prompt performance using existing prompt engineering and prompt consistency methods, but find that their impact is limited. These findings underscore the need to create more resilient LLMs that can maintain high performance across diverse prompts.
摘要：大型语言模型 (LLM) 的性能对提示的措辞极为敏感，这引起了人们对其在现实场景中的可靠性的极大担忧。现有研究通常将提示分为任务级指令和案例级输入，并主要侧重于评估和提高对任务级指令变化的鲁棒性。然而，这种设置未能完全解决现实世界用户查询的多样性，并假设存在特定于任务的数据集。为了解决这些限制，我们引入了 RobustAlpacaEval，这是一个新的基准，由语义等效的案例级查询组成，并强调使用最差的提示性能来衡量模型性能下限的重要性。在 RobustAlpacaEval 上使用 ChatGPT 和来自 Llama、Mistral 和 Gemma 系列的六个开源 LLM 进行大量实验，发现模型性能存在很大差异；例如，Llama-2-70B-chat 模型的最差和最佳性能之间的差异为 45.48%，其最差性能下降至 9.38%。我们进一步说明了从与模型无关和与模型相关的角度识别最差提示的难度，强调没有捷径来描述最差提示。我们还尝试使用现有的提示工程和提示一致性方法来提高最差提示的性能，但发现它们的影响有限。这些发现强调了创建更具弹性的 LLM 的必要性，这些 LLM 可以在各种提示中保持高性能。

Title: The Impact of Quantization on Retrieval-Augmented Generation: An Analysis of Small LLMs

Authors: Mert Yazan, Suzan Verberne, Frederik Situmeang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10251
Pdf URL: https://arxiv.org/pdf/2406.10251
Copy Paste: [[2406.10251]] The Impact of Quantization on Retrieval-Augmented Generation: An Analysis of Small LLMs(https://arxiv.org/abs/2406.10251)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Post-training quantization reduces the computational demand of Large Language Models (LLMs) but can weaken some of their capabilities. Since LLM abilities emerge with scale, smaller LLMs are more sensitive to quantization. In this paper, we explore how quantization affects smaller LLMs' ability to perform retrieval-augmented generation (RAG), specifically in longer contexts. We chose personalization for evaluation because it is a challenging domain to perform using RAG as it requires long-context reasoning over multiple documents. We compare the original FP16 and the quantized INT4 performance of multiple 7B and 8B LLMs on two tasks while progressively increasing the number of retrieved documents to test how quantized models fare against longer contexts. To better understand the effect of retrieval, we evaluate three retrieval models in our experiments. Our findings reveal that if a 7B LLM performs the task well, quantization does not impair its performance and long-context reasoning capabilities. We conclude that it is possible to utilize RAG with quantized smaller LLMs.
摘要：训练后量化降低了大型语言模型 (LLM) 的计算需求，但可能会削弱其部分功能。由于 LLM 能力随着规模的扩大而显现，较小的 LLM 对量化更敏感。在本文中，我们探讨了量化如何影响较小的 LLM 执行检索增强生成 (RAG) 的能力，特别是在较长的上下文中。我们选择个性化进行评估，因为这是一个使用 RAG 执行的具有挑战性的领域，因为它需要对多个文档进行长上下文推理。我们在两个任务上比较了多个 7B 和 8B LLM 的原始 FP16 和量化 INT4 性能，同时逐步增加检索到的文档数量，以测试量化模型在较长上下文中的表现。为了更好地理解检索的效果，我们在实验中评估了三种检索模型。我们的研究结果表明，如果 7B LLM 很好地执行了任务，量化不会损害其性能和长上下文推理能力。我们得出结论，可以将 RAG 与量化的较小 LLM 一起使用。

Title: Towards Signal Processing In Large Language Models

Authors: Prateek Verma, Mert Pilanci
Subjects: cs.CL, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.10254
Pdf URL: https://arxiv.org/pdf/2406.10254
Copy Paste: [[2406.10254]] Towards Signal Processing In Large Language Models(https://arxiv.org/abs/2406.10254)
Keywords: language model, gpt, llm
Abstract: This paper introduces the idea of applying signal processing inside a Large Language Model (LLM). With the recent explosion of generative AI, our work can help bridge two fields together, namely the field of signal processing and large language models. We draw parallels between classical Fourier-Transforms and Fourier Transform-like learnable time-frequency representations for every intermediate activation signal of an LLM. Once we decompose every activation signal across tokens into a time-frequency representation, we learn how to filter and reconstruct them, with all components learned from scratch, to predict the next token given the previous context. We show that for GPT-like architectures, our work achieves faster convergence and significantly increases performance by adding a minuscule number of extra parameters when trained for the same epochs. We hope this work paves the way for algorithms exploring signal processing inside the signals found in neural architectures like LLMs and beyond.
摘要：本文介绍了在大型语言模型 (LLM) 中应用信号处理的想法。随着最近生成式人工智能的爆炸式增长，我们的工作可以帮助将两个领域联系在一起，即信号处理领域和大型语言模型。我们将经典傅里叶变换和傅里叶变换类可学习时频表示相提并论，用于 LLM 的每个中间激活信号。一旦我们将每个跨 token 的激活信号分解为时频表示，我们就会学习如何过滤和重建它们，所有组件都是从头开始学习的，以便在给定先前上下文的情况下预测下一个 token。我们表明，对于类似 GPT 的架构，我们的工作在相同的时期内通过添加极少量的额外参数实现了更快的收敛并显著提高了性能。我们希望这项工作为探索 LLM 等神经架构中的信号内部信号处理的算法铺平道路。

Title: Explicit Word Density Estimation for Language Modelling

Authors: Jovan Andonov, Octavian Ganea, Paulina Grnarova, Gary Bécigneul, Thomas Hofmann
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.10256
Pdf URL: https://arxiv.org/pdf/2406.10256
Copy Paste: [[2406.10256]] Explicit Word Density Estimation for Language Modelling(https://arxiv.org/abs/2406.10256)
Keywords: language model
Abstract: Language Modelling has been a central part of Natural Language Processing for a very long time and in the past few years LSTM-based language models have been the go-to method for commercial language modeling. Recently, it has been shown that when looking at language modelling from a matrix factorization point of view, the final Softmax layer limits the expressiveness of the model, by putting an upper bound on the rank of the resulting matrix. Additionally, a new family of neural networks based called NeuralODEs, has been introduced as a continuous alternative to Residual Networks. Moreover, it has been shown that there is a connection between these models and Normalizing Flows. In this work we propose a new family of language models based on NeuralODEs and the continuous analogue of Normalizing Flows and manage to improve on some of the baselines.
摘要：语言建模长期以来一直是自然语言处理的核心部分，在过去几年中，基于 LSTM 的语言模型已成为商业语言建模的首选方法。最近有研究表明，从矩阵分解的角度来看语言建模时，最后的 Softmax 层会限制模型的表达能力，因为它会为结果矩阵的秩设置上限。此外，一种新的基于神经网络的系列（称为 NeuralODE）已被引入作为残差网络的连续替代方案。此外，研究表明这些模型与正则化流之间存在联系。在这项工作中，我们提出了一种基于 NeuralODE 和正则化流的连续模拟的新型语言模型系列，并设法改进了一些基准。

Title: Flextron: Many-in-One Flexible Large Language Model

Authors: Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.10260
Pdf URL: https://arxiv.org/pdf/2406.10260
Copy Paste: [[2406.10260]] Flextron: Many-in-One Flexible Large Language Model(https://arxiv.org/abs/2406.10260)
Keywords: language model, gpt, llm
Abstract: Training modern LLMs is extremely resource intensive, and customizing them for various deployment scenarios characterized by limited compute and memory resources through repeated training is impractical. In this paper, we introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment. The Flextron architecture utilizes a nested elastic structure to rapidly adapt to specific user-defined latency and accuracy targets during inference with no additional fine-tuning required. It is also input-adaptive, and can automatically route tokens through its sub-networks for improved performance and efficiency. We present a sample-efficient training method and associated routing algorithms for systematically transforming an existing trained LLM into a Flextron model. We evaluate Flextron on the GPT-3 and LLama-2 family of LLMs, and demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
摘要：训练现代 LLM 极其耗费资源，通过重复训练来定制它们以适应各种部署场景（这些部署场景的特点是计算和内存资源有限）是不切实际的。在本文中，我们介绍了 Flextron，这是一种支持灵活模型部署的网络架构和训练后模型优化框架。Flextron 架构利用嵌套弹性结构在推理过程中快速适应特定的用户定义的延迟和准确性目标，而无需进行额外的微调。它还具有输入自适应性，可以自动通过其子网络路由令牌，以提高性能和效率。我们提出了一种样本高效的训练方法和相关的路由算法，用于将现有的训练过的 LLM 系统地转换为 Flextron 模型。我们在 GPT-3 和 LLama-2 系列 LLM 上评估了 Flextron，并展示了优于多个端到端训练变体和其他最先进的弹性网络的性能，所有这些都只需一次预训练运行，与原始预训练相比，仅消耗 7.63% 的令牌。

Title: FoodSky: A Food-oriented Large Language Model that Passes the Chef and Dietetic Examination

Authors: Pengfei Zhou, Weiqing Min, Chaoran Fu, Ying Jin, Mingyu Huang, Xiangyang Li, Shuhuan Mei, Shuqiang Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10261
Pdf URL: https://arxiv.org/pdf/2406.10261
Copy Paste: [[2406.10261]] FoodSky: A Food-oriented Large Language Model that Passes the Chef and Dietetic Examination(https://arxiv.org/abs/2406.10261)
Keywords: language model, llm, retrieval augmented generation
Abstract: Food is foundational to human life, serving not only as a source of nourishment but also as a cornerstone of cultural identity and social interaction. As the complexity of global dietary needs and preferences grows, food intelligence is needed to enable food perception and reasoning for various tasks, ranging from recipe generation and dietary recommendation to diet-disease correlation discovery and understanding. Towards this goal, for powerful capabilities across various domains and tasks in Large Language Models (LLMs), we introduce Food-oriented LLM FoodSky to comprehend food data through perception and reasoning. Considering the complexity and typicality of Chinese cuisine, we first construct one comprehensive Chinese food corpus FoodEarth from various authoritative sources, which can be leveraged by FoodSky to achieve deep understanding of food-related data. We then propose Topic-based Selective State Space Model (TS3M) and the Hierarchical Topic Retrieval Augmented Generation (HTRAG) mechanism to enhance FoodSky in capturing fine-grained food semantics and generating context-aware food-relevant text, respectively. Our extensive evaluations demonstrate that FoodSky significantly outperforms general-purpose LLMs in both chef and dietetic examinations, with an accuracy of 67.2% and 66.4% on the Chinese National Chef Exam and the National Dietetic Exam, respectively. FoodSky not only promises to enhance culinary creativity and promote healthier eating patterns, but also sets a new standard for domain-specific LLMs that address complex real-world issues in the food domain. An online demonstration of FoodSky is available at this http URL.
摘要：食物是人类生活的基础，不仅是营养来源，也是文化认同和社会互动的基石。随着全球饮食需求和偏好的复杂性日益增加，需要食物智能来实现各种任务的食物感知和推理，从食谱生成和饮食推荐到饮食与疾病相关性的发现和理解。为了实现这一目标，为了在大型语言模型 (LLM) 中实现跨各个领域和任务的强大功能，我们引入了面向食物的 LLM FoodSky，通过感知和推理来理解食物数据。考虑到中国菜的复杂性和典型性，我们首先从各种权威来源构建了一个全面的中国食物语料库 FoodEarth，FoodSky 可以利用该语料库来深入了解与食物相关的数据。然后，我们提出了基于主题的选择性状态空间模型 (TS3M) 和分层主题检索增强生成 (HTRAG) 机制，以分别增强 FoodSky 捕获细粒度食物语义和生成上下文感知的食物相关文本的能力。我们广泛的评估表明，FoodSky 在厨师和营养师考试中的表现明显优于通用法学硕士，在中国国家厨师考试和国家营养师考试中的准确率分别为 67.2% 和 66.4%。FoodSky 不仅有望提高烹饪创造力并促进更健康的饮食模式，而且还为解决食品领域复杂现实问题的领域特定法学硕士树立了新标准。此 http URL 提供了 FoodSky 的在线演示。

Title: Improving Language Models for Emotion Analysis: Insights from Cognitive Science

Authors: Constant Bonard (UNIBE), Gustave Cortal (LMF, LISN)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10265
Pdf URL: https://arxiv.org/pdf/2406.10265
Copy Paste: [[2406.10265]] Improving Language Models for Emotion Analysis: Insights from Cognitive Science(https://arxiv.org/abs/2406.10265)
Keywords: language model
Abstract: We propose leveraging cognitive science research on emotions and communication to improve language models for emotion analysis. First, we present the main emotion theories in psychology and cognitive science. Then, we introduce the main methods of emotion annotation in natural language processing and their connections to psychological theories. We also present the two main types of analyses of emotional communication in cognitive pragmatics. Finally, based on the cognitive science research presented, we propose directions for improving language models for emotion analysis. We suggest that these research efforts pave the way for constructing new annotation schemes and a possible benchmark for emotional understanding, considering different facets of human emotion and communication.
摘要：我们建议利用认知科学对情感和交流的研究来改进用于情感分析的语言模型。首先，我们介绍心理学和认知科学中的主要情感理论。然后，我们介绍自然语言处理中情感标注的主要方法及其与心理学理论的联系。我们还介绍了认知语用学中情感交流的两种主要分析类型。最后，基于所提出的认知科学研究，我们提出了改进用于情感分析的语言模型的方向。我们认为这些研究工作为构建新的标注方案和情感理解的可能基准铺平了道路，考虑到人类情感和交流的不同方面。

Title: COVID-19 Twitter Sentiment Classification Using Hybrid Deep Learning Model Based on Grid Search Methodology

Authors: Jitendra Tembhurne, Anant Agrawal, Kirtan Lakhotia
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2406.10266
Pdf URL: https://arxiv.org/pdf/2406.10266
Copy Paste: [[2406.10266]] COVID-19 Twitter Sentiment Classification Using Hybrid Deep Learning Model Based on Grid Search Methodology(https://arxiv.org/abs/2406.10266)
Keywords: prompt
Abstract: In the contemporary era, social media platforms amass an extensive volume of social data contributed by their users. In order to promptly grasp the opinions and emotional inclinations of individuals regarding a product or event, it becomes imperative to perform sentiment analysis on the user-generated content. Microblog comments often encompass both lengthy and concise text entries, presenting a complex scenario. This complexity is particularly pronounced in extensive textual content due to its rich content and intricate word interrelations compared to shorter text entries. Sentiment analysis of public opinion shared on social networking websites such as Facebook or Twitter has evolved and found diverse applications. However, several challenges remain to be tackled in this field. The hybrid methodologies have emerged as promising models for mitigating sentiment analysis errors, particularly when dealing with progressively intricate training data. In this article, to investigate the hesitancy of COVID-19 vaccination, we propose eight different hybrid deep learning models for sentiment classification with an aim of improving overall accuracy of the model. The sentiment prediction is achieved using embedding, deep learning model and grid search algorithm on Twitter COVID-19 dataset. According to the study, public sentiment towards COVID-19 immunization appears to be improving with time, as evidenced by the gradual decline in vaccine reluctance. Through extensive evaluation, proposed model reported an increased accuracy of 98.86%, outperforming other models. Specifically, the combination of BERT, CNN and GS yield the highest accuracy, while the combination of GloVe, BiLSTM, CNN and GS follows closely behind with an accuracy of 98.17%. In addition, increase in accuracy in the range of 2.11% to 14.46% is reported by the proposed model in comparisons with existing works.
摘要：在当今时代，社交媒体平台积累了大量由用户贡献的社交数据。为了及时掌握个人对产品或事件的看法和情感倾向，对用户生成的内容进行情绪分析势在必行。微博评论通常包含冗长和简洁的文本条目，呈现出复杂的场景。与较短的文本条目相比，由于内容丰富且单词相互关系复杂，这种复杂性在大量文本内容中尤为明显。在 Facebook 或 Twitter 等社交网站上分享的舆情情绪分析已经发展并找到了多种应用。然而，这一领域仍有几个挑战需要解决。混合方法已成为减轻情绪分析错误的有希望的模型，特别是在处理逐渐复杂的训练数据时。在本文中，为了研究 COVID-19 疫苗接种的犹豫不决，我们提出了八种不同的混合深度学习模型进行情绪分类，旨在提高模型的整体准确性。情绪预测是使用嵌入、深度学习模型和网格搜索算法在 Twitter COVID-19 数据集上实现的。根据这项研究，公众对 COVID-19 免疫接种的态度似乎随着时间的推移而改善，疫苗接种不情愿的逐渐减少就是明证。通过广泛的评估，提出的模型报告的准确率提高了 98.86%，优于其他模型。具体来说，BERT、CNN 和 GS 的组合产生了最高的准确率，而 GloVe、BiLSTM、CNN 和 GS 的组合紧随其后，准确率为 98.17%。此外，与现有研究相比，提出的模型报告的准确率提高了 2.11% 至 14.46%。

Title: Unused information in token probability distribution of generative LLM: improving LLM reading comprehension through calculation of expected values

Authors: Krystian Zawistowski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10267
Pdf URL: https://arxiv.org/pdf/2406.10267
Copy Paste: [[2406.10267]] Unused information in token probability distribution of generative LLM: improving LLM reading comprehension through calculation of expected values(https://arxiv.org/abs/2406.10267)
Keywords: gpt, llm, prompt
Abstract: LLM text decoding is key component for perceived LLM quality. We demonstrate two experiments showing that decoding methods could be improved by manipulation of token probabilities. First, we test few LLM on SummEval summary scoring dataset, to measure reading comprehension. We compare scores from greedy decoding to expected values over the next token distribution. We scale logits by large temperature to increase the entropy of scores. This allows strong improvement of performance on SummEval (in terms of correlations to human judgement). We see improvement from 6-8% to 13-28% for 7B Mistral and from 20%-46% to 37%-56% for Mixtral, beating GPT 4 0314 result on two metrics. Part of the gain seems related to positional bias. Secondly, we use probability-based tree sampling algorithm, to examine all most probable generations for given prompt.
摘要：LLM 文本解码是感知 LLM 质量的关键组成部分。我们展示了两个实验，表明可以通过操纵标记概率来改进解码方法。首先，我们在 SummEval 摘要评分数据集上测试了一些 LLM，以衡量阅读理解能力。我们将贪婪解码的分数与下一个标记分布的预期值进行比较。我们通过大温度缩放 logits 以增加分数的熵。这可以大大提高 SummEval 的性能（就与人类判断的相关性而言）。我们看到 7B Mistral 从 6-8% 提高到 13-28%，Mixtral 从 20%-46% 提高到 37%-56%，在两个指标上都超过了 GPT 4 0314 的结果。部分收益似乎与位置偏差有关。其次，我们使用基于概率的树采样算法来检查给定提示的所有最可能的生成。

Title: Markov Constraint as Large Language Model Surrogate

Authors: Alexandre Bonlarron, Jean-Charles Régin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.10269
Pdf URL: https://arxiv.org/pdf/2406.10269
Copy Paste: [[2406.10269]] Markov Constraint as Large Language Model Surrogate(https://arxiv.org/abs/2406.10269)
Keywords: language model, llm
Abstract: This paper presents NgramMarkov, a variant of the Markov constraints. It is dedicated to text generation in constraint programming (CP). It involves a set of n-grams (i.e., sequence of n words) associated with probabilities given by a large language model (LLM). It limits the product of the probabilities of the n-gram of a sentence. The propagator of this constraint can be seen as an extension of the ElementaryMarkov constraint propagator, incorporating the LLM distribution instead of the maximum likelihood estimation of n-grams. It uses a gliding threshold, i.e., it rejects n-grams whose local probabilities are too low, to guarantee balanced solutions. It can also be combined with a "look-ahead" approach to remove n-grams that are very unlikely to lead to acceptable sentences for a fixed-length horizon. This idea is based on the MDDMarkovProcess constraint propagator, but without explicitly using an MDD (Multi-Valued Decision Diagram). The experimental results show that the generated text is valued in a similar way to the LLM perplexity function. Using this new constraint dramatically reduces the number of candidate sentences produced, improves computation times, and allows larger corpora or smaller n-grams to be used. A real-world problem has been solved for the first time using 4-grams instead of 5-grams.
摘要：本文介绍了 NgramMarkov，这是马尔可夫约束的一种变体。它专用于约束编程 (CP) 中的文本生成。它涉及一组与大型语言模型 (LLM) 给出的概率相关的 n-gram（即 n 个单词的序列）。它限制了句子的 n-gram 概率的乘积。此约束的传播器可以看作是 ElementaryMarkov 约束传播器的扩展，它结合了 LLM 分布而不是 n-gram 的最大似然估计。它使用滑动阈值，即它拒绝局部概率太低的 n-gram，以保证平衡的解决方案。它还可以与“前瞻”方法相结合，以删除不太可能在固定长度范围内产生可接受句子的 n-gram。这个想法基于 MDDMarkovProcess 约束传播器，但没有明确使用 MDD（多值决策图）。实验结果表明，生成的文本的估值方式与 LLM 困惑度函数类似。使用这一新约束可大幅减少生成的候选句子数量，缩短计算时间，并允许使用更大的语料库或更小的 n-gram。首次使用 4-gram 而不是 5-gram 解决了实际问题。

Title: Beyond Words: On Large Language Models Actionability in Mission-Critical Risk Analysis

Authors: Matteo Esposito, Francesco Palagiano, Valentina Lenarduzzi
Subjects: cs.CL, cs.AI, cs.CR, cs.HC
Abstract URL: https://arxiv.org/abs/2406.10273
Pdf URL: https://arxiv.org/pdf/2406.10273
Copy Paste: [[2406.10273]] Beyond Words: On Large Language Models Actionability in Mission-Critical Risk Analysis(https://arxiv.org/abs/2406.10273)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation
Abstract: Context. Risk analysis assesses potential risks in specific scenarios. Risk analysis principles are context-less; the same methodology can be applied to a risk connected to health and information technology security. Risk analysis requires a vast knowledge of national and international regulations and standards and is time and effort-intensive. A large language model can quickly summarize information in less time than a human and can be fine-tuned to specific tasks. Aim. Our empirical study aims to investigate the effectiveness of Retrieval-Augmented Generation and fine-tuned LLM in Risk analysis. To our knowledge, no prior study has explored its capabilities in risk analysis. Method. We manually curated \totalscenarios unique scenarios leading to \totalsamples representative samples from over 50 mission-critical analyses archived by the industrial context team in the last five years. We compared the base GPT-3.5 and GPT-4 models versus their Retrieval-Augmented Generation and fine-tuned counterparts. We employ two human experts as competitors of the models and three other three human experts to review the models and the former human expert's analysis. The reviewers analyzed 5,000 scenario analyses. Results and Conclusions. HEs demonstrated higher accuracy, but LLMs are quicker and more actionable. Moreover, our findings show that RAG-assisted LLMs have the lowest hallucination rates, effectively uncovering hidden risks and complementing human expertise. Thus, the choice of model depends on specific needs, with FTMs for accuracy, RAG for hidden risks discovery, and base models for comprehensiveness and actionability. Therefore, experts can leverage LLMs for an effective complementing companion in risk analysis within a condensed timeframe. They can also save costs by averting unnecessary expenses associated with implementing unwarranted countermeasures.
摘要：背景。风险分析评估特定场景中的潜在风险。风险分析原则与背景无关；相同的方法可以应用于与健康和信息技术安全相关的风险。风险分析需要对国家和国际法规和标准有广泛的了解，并且耗时耗力。大型语言模型可以在比人类更短的时间内快速总结信息，并且可以针对特定任务进行微调。目的。我们的实证研究旨在调查检索增强生成和微调 LLM 在风险分析中的有效性。据我们所知，之前没有研究探索过其在风险分析中的能力。方法。我们手动整理了 \totalscenarios 个独特场景，从而从工业背景团队在过去五年中存档的 50 多个关键任务分析中获得了 \totalsamples 个代表性样本。我们将基础 GPT-3.5 和 GPT-4 模型与它们的检索增强生成和微调模型进行了比较。我们聘请了两名人类专家作为模型的竞争对手，并聘请了另外三名人类专家来审查模型和前人类专家的分析。审阅者分析了 5,000 个场景分析。结果和结论。HE 表现出更高的准确性，但 LLM 更快且更具可操作性。此外，我们的研究结果表明，RAG 辅助的 LLM 具有最低的幻觉率，可有效发现隐藏的风险并补充人类的专业知识。因此，模型的选择取决于特定需求，FTM 用于准确性，RAG 用于发现隐藏的风险，基础模型用于全面性和可操作性。因此，专家可以在短时间内利用 LLM 作为风险分析的有效补充伙伴。他们还可以通过避免与实施不必要的对策相关的不必要开支来节省成本。

Title: Prompt-Based Length Controlled Generation with Multiple Control Types

Authors: Renlong Jie, Xiaojun Meng, Lifeng Shang, Xin Jiang, Qun Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10278
Pdf URL: https://arxiv.org/pdf/2406.10278
Copy Paste: [[2406.10278]] Prompt-Based Length Controlled Generation with Multiple Control Types(https://arxiv.org/abs/2406.10278)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have attracted great attention given their strong performance on a wide range of NLP tasks. In practice, users often expect generated texts to fall within a specific length range, making length controlled generation an important topic, especially for GPT-style models. Existing length control methods mostly focus on a simple control type of "equal to" a target length. Different from them, we propose a prompt-based method to achieve length controlled generation under different control types with high accuracy. In particular, we adopt reinforcement learning (RL) and sample filtering with the reward signal given by rule-based reward models, which enhances the length control ability of models by rewarding outputs that follow certain control instructions. In addition, we introduce a standard prompt extractor to parse arbitrary users' input into standard control instructions. Experiments show that our method significantly improves the accuracy of prompt-based length control on popular summarization datasets like CNNDM and NYT under multiple control types. Moreover, both the standard prompt extractor and RL-tuned model show strong generalization to unseen control prompt templates.
摘要：大型语言模型 (LLM) 因其在各种 NLP 任务上的出色表现而备受关注。在实践中，用户通常希望生成的文本在特定长度范围内，因此长度控制生成是一个重要课题，尤其是对于 GPT 风格的模型。现有的长度控制方法大多侧重于“等于”目标长度的简单控制类型。与它们不同，我们提出了一种基于提示的方法，以高精度实现不同控制类型下的长度控制生成。具体来说，我们采用强化学习 (RL) 和样本过滤，并使用基于规则的奖励模型给出的奖励信号，通过奖励遵循某些控制指令的输出来增强模型的长度控制能力。此外，我们引入了一个标准提示提取器，将任意用户的输入解析为标准控制指令。实验表明，我们的方法在多种控制类型下显著提高了基于提示的长度控制在 CNNDM 和 NYT 等流行摘要数据集上的准确性。此外，标准提示提取器和 RL 调整模型都对未见过的控制提示模板表现出很强的泛化能力。

Title: Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

Authors: Francisco Eiras, Aleksandar Petrov, Phillip H.S. Torr, M. Pawan Kumar, Adel Bibi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.10288
Pdf URL: https://arxiv.org/pdf/2406.10288
Copy Paste: [[2406.10288]] Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models(https://arxiv.org/abs/2406.10288)
Keywords: language model, prompt
Abstract: Fine-tuning large language models on small, high-quality datasets can enhance their performance on specific downstream tasks. Recent research shows that fine-tuning on benign, instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. Although critical, understanding and mitigating safety risks in well-defined tasks remains distinct from the instruction-following context due to structural differences in the data. Our work explores the risks associated with fine-tuning closed models - where providers control how user data is utilized in the process - across diverse task-specific data. We demonstrate how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data, showing this is more effective than existing baselines at re-establishing safety alignment while maintaining similar task performance.
摘要：在小型高质量数据集上对大型语言模型进行微调可以提高它们在特定下游任务上的性能。最近的研究表明，对良性的指令跟踪数据进行微调可能会无意中撤消安全对齐过程并增加模型遵守有害查询的倾向。虽然这很重要，但由于数据的结构差异，理解和减轻明确定义任务中的安全风险与指令跟踪上下文仍然不同。我们的工作探讨了在各种特定于任务的数据中微调封闭模型（提供商控制用户数据在过程中的使用方式）所带来的风险。我们展示了恶意行为者如何巧妙地操纵几乎任何特定于任务的数据集的结构，以促进更危险的模型行为，同时保持无害的外观和合理的下游任务性能。为了解决这个问题，我们提出了一种新颖的缓解策略，该策略混合了模仿用户数据的任务格式和提示风格的安全数据，表明这比现有基线在重新建立安全对齐方面更有效，同时保持类似的任务性能。

Title: VeraCT Scan: Retrieval-Augmented Fake News Detection with Justifiable Reasoning

Authors: Cheng Niu, Yang Guan, Yuanhao Wu, Juno Zhu, Juntong Song, Randy Zhong, Kaihua Zhu, Siliang Xu, Shizhe Diao, Tong Zhang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2406.10289
Pdf URL: https://arxiv.org/pdf/2406.10289
Copy Paste: [[2406.10289]] VeraCT Scan: Retrieval-Augmented Fake News Detection with Justifiable Reasoning(https://arxiv.org/abs/2406.10289)
Keywords: gpt
Abstract: The proliferation of fake news poses a significant threat not only by disseminating misleading information but also by undermining the very foundations of democracy. The recent advance of generative artificial intelligence has further exacerbated the challenge of distinguishing genuine news from fabricated stories. In response to this challenge, we introduce VeraCT Scan, a novel retrieval-augmented system for fake news detection. This system operates by extracting the core facts from a given piece of news and subsequently conducting an internet-wide search to identify corroborating or conflicting reports. Then sources' credibility is leveraged for information verification. Besides determining the veracity of news, we also provide transparent evidence and reasoning to support its conclusions, resulting in the interpretability and trust in the results. In addition to GPT-4 Turbo, Llama-2 13B is also fine-tuned for news content understanding, information verification, and reasoning. Both implementations have demonstrated state-of-the-art accuracy in the realm of fake news detection.
摘要：虚假新闻的泛滥不仅传播误导性信息，而且破坏民主的根基，构成重大威胁。生成式人工智能的最新进展进一步加剧了区分真实新闻和虚假故事的挑战。为了应对这一挑战，我们推出了 VeraCT Scan，这是一种用于检测虚假新闻的新型检索增强系统。该系统的运行方式是从给定的新闻中提取核心事实，然后进行全网搜索以识别确凿或相互矛盾的报道。然后利用消息来源的可信度来验证信息。除了确定新闻的真实性之外，我们还提供透明的证据和推理来支持其结论，从而使结果具有可解释性和可信度。除了 GPT-4 Turbo，Llama-2 13B 还针对新闻内容理解、信息验证和推理进行了微调。这两种实现都展示了虚假新闻检测领域最先进的准确性。

Title: MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases

Authors: Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.10290
Pdf URL: https://arxiv.org/pdf/2406.10290
Copy Paste: [[2406.10290]] MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases(https://arxiv.org/abs/2406.10290)
Keywords: language model, llm
Abstract: The deployment of Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices has gained significant attention due to the benefits of enhanced privacy, stability, and personalization. However, the hardware constraints of mobile devices necessitate the use of models with fewer parameters and model compression techniques like quantization. Currently, there is limited understanding of quantization's impact on various task performances, including LLM tasks, LMM tasks, and, critically, trust and safety. There is a lack of adequate tools for systematically testing these models on mobile devices. To address these gaps, we introduce MobileAIBench, a comprehensive benchmarking framework for evaluating mobile-optimized LLMs and LMMs. MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real devices. Our two-part open-source framework includes a library for running evaluations on desktops and an iOS app for on-device latency and hardware utilization measurements. Our thorough analysis aims to accelerate mobile AI research and deployment by providing insights into the performance and feasibility of deploying LLMs and LMMs on mobile platforms.
摘要：由于增强隐私、稳定性和个性化的好处，在移动设备上部署大型语言模型 (LLM) 和大型多模态模型 (LMM) 引起了广泛关注。然而，移动设备的硬件限制使得必须使用参数较少的模型和量化等模型压缩技术。目前，人们对量化对各种任务性能的影响的理解有限，包括 LLM 任务、LMM 任务，以及至关重要的信任和安全。缺乏足够的工具来系统地在移动设备上测试这些模型。为了解决这些差距，我们推出了 MobileAIBench，这是一个用于评估移动优化的 LLM 和 LMM 的综合基准测试框架。MobileAIBench 评估不同大小、量化级别和任务的模型，测量真实设备上的延迟和资源消耗。我们的两部分开源框架包括一个用于在桌面上运行评估的库和一个用于设备延迟和硬件利用率测量的 iOS 应用程序。我们的全面分析旨在通过深入了解在移动平台上部署 LLM 和 LMM 的性能和可行性来加速移动 AI 的研究和部署。

Title: RelevAI-Reviewer: A Benchmark on AI Reviewers for Survey Paper Relevance

Authors: Paulo Henrique Couto, Quang Phuoc Ho, Nageeta Kumari, Benedictus Kent Rachmat (TAU, LISN), Thanh Gia Hieu Khuong (TAU, LISN), Ihsan Ullah, Lisheng Sun-Hosoya (TAU, LISN)
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.10294
Pdf URL: https://arxiv.org/pdf/2406.10294
Copy Paste: [[2406.10294]] RelevAI-Reviewer: A Benchmark on AI Reviewers for Survey Paper Relevance(https://arxiv.org/abs/2406.10294)
Keywords: language model, llm, prompt
Abstract: Recent advancements in Artificial Intelligence (AI), particularly the widespread adoption of Large Language Models (LLMs), have significantly enhanced text analysis capabilities. This technological evolution offers considerable promise for automating the review of scientific papers, a task traditionally managed through peer review by fellow researchers. Despite its critical role in maintaining research quality, the conventional peer-review process is often slow and subject to biases, potentially impeding the swift propagation of scientific knowledge. In this paper, we propose RelevAI-Reviewer, an automatic system that conceptualizes the task of survey paper review as a classification problem, aimed at assessing the relevance of a paper in relation to a specified prompt, analogous to a "call for papers". To address this, we introduce a novel dataset comprised of 25,164 instances. Each instance contains one prompt and four candidate papers, each varying in relevance to the prompt. The objective is to develop a machine learning (ML) model capable of determining the relevance of each paper and identifying the most pertinent one. We explore various baseline approaches, including traditional ML classifiers like Support Vector Machine (SVM) and advanced language models such as BERT. Preliminary findings indicate that the BERT-based end-to-end classifier surpasses other conventional ML methods in performance. We present this problem as a public challenge to foster engagement and interest in this area of research.
摘要：人工智能 (AI) 的最新进展，尤其是大型语言模型 (LLM) 的广泛采用，大大增强了文本分析能力。这项技术进步为自动化科学论文审查提供了巨大的希望，而这项任务传统上是通过同行研究人员的同行评审来管理的。尽管传统的同行评审过程在保持研究质量方面发挥着关键作用，但它往往很慢，而且容易产生偏见，可能会阻碍科学知识的迅速传播。在本文中，我们提出了 RelevAI-Reviewer，这是一个自动化系统，它将调查论文审查任务概念化为分类问题，旨在评估论文与指定提示的相关性，类似于“征文”。为了解决这个问题，我们引入了一个由 25,164 个实例组成的新数据集。每个实例包含一个提示和四篇候选论文，每篇论文与提示的相关性各不相同。目标是开发一个机器学习 (ML) 模型，能够确定每篇论文的相关性并识别出最相关的论文。我们探索了各种基准方法，包括传统的 ML 分类器（如支持向量机 (SVM)）和高级语言模型（如 BERT）。初步结果表明，基于 BERT 的端到端分类器在性能上优于其他传统的 ML 方法。我们将这个问题作为公开挑战提出来，以促进人们对这一研究领域的参与和兴趣。

Title: Robustness of Structured Data Extraction from In-plane Rotated Documents using Multi-Modal Large Language Models (LLM)

Authors: Anjanava Biswas, Wrick Talukdar
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2406.10295
Pdf URL: https://arxiv.org/pdf/2406.10295
Copy Paste: [[2406.10295]] Robustness of Structured Data Extraction from In-plane Rotated Documents using Multi-Modal Large Language Models (LLM)(https://arxiv.org/abs/2406.10295)
Keywords: language model, gpt, llm, hallucination
Abstract: Multi-modal large language models (LLMs) have shown remarkable performance in various natural language processing tasks, including data extraction from documents. However, the accuracy of these models can be significantly affected by document in-plane rotation, also known as skew, a common issue in real-world scenarios for scanned documents. This study investigates the impact of document skew on the data extraction accuracy of three state-of-the-art multi-modal LLMs: Anthropic Claude V3 Sonnet, GPT-4-Turbo, and Llava:v1.6. We focus on extracting specific entities from synthetically generated sample documents with varying degrees of skewness. The results demonstrate that document skew adversely affects the data extraction accuracy of all the tested LLMs, with the severity of the impact varying across models. We identify the safe in-plane rotation angles (SIPRA) for each model and investigate the effects of skew on model hallucinations. Furthermore, we explore existing skew detection and correction mechanisms and discuss their potential limitations. We propose alternative approaches, including developing new multi-modal architectures that are inherently more robust to document skew and incorporating skewing techniques during the pre-training phase of the models. Additionally, we highlight the need for more comprehensive testing on a wider range of document quality and conditions to fully understand the challenges and opportunities associated with using multi-modal LLMs for information extraction in real-world scenarios.
摘要：多模态大型语言模型 (LLM) 在各种自然语言处理任务中表现出色，包括从文档中提取数据。然而，这些模型的准确性可能会受到文档平面旋转（也称为倾斜）的显著影响，这是扫描文档在现实世界场景中常见的问题。本研究调查了文档倾斜对三种最先进的多模态 LLM 数据提取准确性的影响：Anthropic Claude V3 Sonnet、GPT-4-Turbo 和 Llava:v1.6。我们专注于从具有不同程度倾斜的合成生成的样本文档中提取特定实体。结果表明，文档倾斜会对所有测试的 LLM 的数据提取准确性产生不利影响，并且影响的严重程度因模型而异。我们确定了每个模型的安全平面旋转角度 (SIPRA)，并研究了倾斜对模型幻觉的影响。此外，我们探索了现有的倾斜检测和校正机制，并讨论了它们的潜在局限性。我们提出了替代方法，包括开发新的多模态架构，使其在本质上对文档倾斜更具鲁棒性，并在模型的预训练阶段加入倾斜技术。此外，我们强调需要对更广泛的文档质量和条件进行更全面的测试，以充分了解在现实场景中使用多模态 LLM 进行信息提取所带来的挑战和机遇。

Title: CLST: Cold-Start Mitigation in Knowledge Tracing by Aligning a Generative Language Model as a Students' Knowledge Tracer

Authors: Heeseok Jung, Jaesang Yoo, Yohaan Yoon, Yeonju Jang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10296
Pdf URL: https://arxiv.org/pdf/2406.10296
Copy Paste: [[2406.10296]] CLST: Cold-Start Mitigation in Knowledge Tracing by Aligning a Generative Language Model as a Students' Knowledge Tracer(https://arxiv.org/abs/2406.10296)
Keywords: language model, llm
Abstract: Knowledge tracing (KT), wherein students' problem-solving histories are used to estimate their current levels of knowledge, has attracted significant interest from researchers. However, most existing KT models were developed with an ID-based paradigm, which exhibits limitations in cold-start performance. These limitations can be mitigated by leveraging the vast quantities of external knowledge possessed by generative large language models (LLMs). In this study, we propose cold-start mitigation in knowledge tracing by aligning a generative language model as a students' knowledge tracer (CLST) as a framework that utilizes a generative LLM as a knowledge tracer. Upon collecting data from math, social studies, and science subjects, we framed the KT task as a natural language processing task, wherein problem-solving data are expressed in natural language, and fine-tuned the generative LLM using the formatted KT dataset. Subsequently, we evaluated the performance of the CLST in situations of data scarcity using various baseline models for comparison. The results indicate that the CLST significantly enhanced performance with a dataset of fewer than 100 students in terms of prediction, reliability, and cross-domain generalization.
摘要：知识追踪 (KT) 已引起研究人员的极大兴趣，其中学生的解决问题历史用于估计他们当前的知识水平。然而，大多数现有的 KT 模型都是用基于 ID 的范式开发的，这在冷启动性能方面表现出局限性。这些限制可以通过利用生成式大型语言模型 (LLM) 所拥有的大量外部知识来缓解。在本研究中，我们提出了通过将生成式语言模型作为学生知识追踪器 (CLST) 调整为使用生成式 LLM 作为知识追踪器的框架来缓解知识追踪中的冷启动。在收集数学、社会研究和科学学科的数据后，我们将 KT 任务设计为自然语言处理任务，其中问题解决数据以自然语言表达，并使用格式化的 KT 数据集对生成式 LLM 进行微调。随后，我们使用各种基线模型进行比较，评估了 CLST 在数据稀缺情况下的性能。结果表明，CLST 在预测、可靠性和跨领域泛化方面显著提高了使用少于 100 名学生的数据集的性能。

Title: SememeLM: A Sememe Knowledge Enhanced Method for Long-tail Relation Representation

Authors: Shuyi Li, Shaojuan Wu, Xiaowang Zhang, Zhiyong Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10297
Pdf URL: https://arxiv.org/pdf/2406.10297
Copy Paste: [[2406.10297]] SememeLM: A Sememe Knowledge Enhanced Method for Long-tail Relation Representation(https://arxiv.org/abs/2406.10297)
Keywords: language model
Abstract: Recognizing relations between two words is a fundamental task with the broad applications. Different from extracting relations from text, it is difficult to identify relations among words without their contexts. Especially for long-tail relations, it becomes more difficult due to inadequate semantic features. Existing approaches based on language models (LMs) utilize rich knowledge of LMs to enhance the semantic features of relations. However, they capture uncommon relations while overlooking less frequent but meaningful ones since knowledge of LMs seriously relies on trained data where often represents common relations. On the other hand, long-tail relations are often uncommon in training data. It is interesting but not trivial to use external knowledge to enrich LMs due to collecting corpus containing long-tail relationships is hardly feasible. In this paper, we propose a sememe knowledge enhanced method (SememeLM) to enhance the representation of long-tail relations, in which sememes can break the contextual constraints between wors. Firstly, we present a sememe relation graph and propose a graph encoding method. Moreover, since external knowledge base possibly consisting of massive irrelevant knowledge, the noise is introduced. We propose a consistency alignment module, which aligns the introduced knowledge with LMs, reduces the noise and integrates the knowledge into the language model. Finally, we conducted experiments on word analogy datasets, which evaluates the ability to distinguish relation representations subtle differences, including long-tail relations. Extensive experiments show that our approach outperforms some state-of-the-art methods.
摘要：识别两个词之间的关系是一项具有广泛应用的基础任务。与从文本中提取关系不同，在没有上下文的情况下很难识别词之间的关系。特别是对于长尾关系，由于语义特征不足，识别起来更加困难。现有的基于语言模型 (LM) 的方法利用语言模型的丰富知识来增强关系的语义特征。然而，它们捕获了不常见的关系，而忽略了不常见但有意义的关系，因为语言模型的知识严重依赖于训练数据，而训练数据通常代表常见的关系。另一方面，长尾关系在训练数据中通常并不常见。使用外部知识来丰富语言模型很有趣但并不简单，因为收集包含长尾关系的语料库几乎不可行。在本文中，我们提出了一种义原知识增强方法 (SememeLM) 来增强长尾关系的表示，其中义原可以打破词之间的上下文约束。首先，我们提出一个义原关系图并提出一种图编码方法。此外，由于外部知识库可能包含大量不相关的知识，因此引入了噪音。我们提出了一个一致性对齐模块，它将引入的知识与 LM 对齐，降低噪音并将知识集成到语言模型中。最后，我们在词语类比数据集上进行了实验，评估了区分关系表示细微差异（包括长尾关系）的能力。大量实验表明，我们的方法优于一些最先进的方法。

Title: A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

Authors: Jinqiang Wang, Huansheng Ning, Yi Peng, Qikai Wei, Daniel Tesfai, Wenwei Mao, Tao Zhu, Runhe Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10303
Pdf URL: https://arxiv.org/pdf/2406.10303
Copy Paste: [[2406.10303]] A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations(https://arxiv.org/abs/2406.10303)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated surprising performance across various natural language processing tasks. Recently, medical LLMs enhanced with domain-specific knowledge have exhibited excellent capabilities in medical consultation and diagnosis. These models can smoothly simulate doctor-patient dialogues and provide professional medical advice. Most medical LLMs are developed through continued training of open-source general LLMs, which require significantly fewer computational resources than training LLMs from scratch. Additionally, this approach offers better protection of patient privacy compared to API-based solutions. This survey systematically explores how to train medical LLMs based on general LLMs. It covers: (a) how to acquire training corpus and construct customized medical training sets, (b) how to choose a appropriate training paradigm, (c) how to choose a suitable evaluation benchmark, and (d) existing challenges and promising future research directions are discussed. This survey can provide guidance for the development of LLMs focused on various medical applications, such as medical education, diagnostic planning, and clinical assistants.
摘要：大型语言模型 (LLM) 在各种自然语言处理任务中都表现出了惊人的性能。最近，通过领域特定知识增强的医学 LLM 在医疗咨询和诊断方面表现出色。这些模型可以流畅地模拟医患对话并提供专业的医疗建议。大多数医学 LLM 都是通过持续训练开源通用 LLM 开发的，这比从头开始训练 LLM 所需的计算资源少得多。此外，与基于 API 的解决方案相比，这种方法可以更好地保护患者隐私。本调查系统地探讨了如何基于通用 LLM 训练医学 LLM。它涵盖：(a) 如何获取训练语料库和构建定制的医学训练集，(b) 如何选择合适的训练范式，(c) 如何选择合适的评估基准，以及 (d) 讨论了现有的挑战和有希望的未来研究方向。本调查可以为开发专注于各种医学应用的 LLM 提供指导，例如医学教育、诊断计划和临床助理。

Title: What is the best model? Application-driven Evaluation for Large Language Models

Authors: Shiguo Lian, Kaikai Zhao, Xinhui Liu, Xuejiao Lei, Bikun Yang, Wenjing Zhang, Kai Wang, Zhaoxiang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10307
Pdf URL: https://arxiv.org/pdf/2406.10307
Copy Paste: [[2406.10307]] What is the best model? Application-driven Evaluation for Large Language Models(https://arxiv.org/abs/2406.10307)
Keywords: language model, llm, prompt
Abstract: General large language models enhanced with supervised fine-tuning and reinforcement learning from human feedback are increasingly popular in academia and industry as they generalize foundation models to various practical tasks in a prompt manner. To assist users in selecting the best model in practical application scenarios, i.e., choosing the model that meets the application requirements while minimizing cost, we introduce A-Eval, an application-driven LLMs evaluation benchmark for general large language models. First, we categorize evaluation tasks into five main categories and 27 sub-categories from a practical application perspective. Next, we construct a dataset comprising 678 question-and-answer pairs through a process of collecting, annotating, and reviewing. Then, we design an objective and effective evaluation method and evaluate a series of LLMs of different scales on A-Eval. Finally, we reveal interesting laws regarding model scale and task difficulty level and propose a feasible method for selecting the best model. Through A-Eval, we provide clear empirical and engineer guidance for selecting the best model, reducing barriers to selecting and using LLMs and promoting their application and development. Our benchmark is publicly available at this https URL.
摘要：基于监督微调和人工反馈强化学习的通用大型语言模型正日益受到学术界和工业界的青睐，因为它们能迅速将基础模型推广到各种实际任务中。为了帮助用户在实际应用场景中选择最佳模型，即在最小化成本的情况下选择符合应用需求的模型，我们引入了应用驱动的通用大型语言模型 LLM 评估基准 A-Eval。首先，我们从实际应用的角度将评估任务分为 5 个主要类别和 27 个子类别。接下来，我们通过收集、注释和审查的过程构建了一个包含 678 个问答对的数据集。然后，我们设计了一种客观有效的评估方法，并在 A-Eval 上评估了一系列不同规模的 LLM。最后，我们揭示了关于模型规模和任务难度的有趣规律，并提出了一种选择最佳模型的可行方法。通过 A-Eval，我们为选择最佳模型提供了明确的经验和工程指导，降低了选择和使用 LLM 的障碍，促进了其应用和发展。我们的基准测试可在此 https URL 上公开获取。

Title: TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs

Authors: Zhuofeng Li, Zixing Gou, Xiangnan Zhang, Zhongyuan Liu, Sirui Li, Yuntong Hu, Chen Ling, Zheng Zhang, Liang Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10310
Pdf URL: https://arxiv.org/pdf/2406.10310
Copy Paste: [[2406.10310]] TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs(https://arxiv.org/abs/2406.10310)
Keywords: language model
Abstract: Text-Attributed Graphs (TAGs) augment graph structures with natural language descriptions, facilitating detailed depictions of data and their interconnections across various real-world settings. However, existing TAG datasets predominantly feature textual information only at the nodes, with edges typically represented by mere binary or categorical attributes. This lack of rich textual edge annotations significantly limits the exploration of contextual relationships between entities, hindering deeper insights into graph-structured data. To address this gap, we introduce Textual-Edge Graphs Datasets and Benchmark (TEG-DB), a comprehensive and diverse collection of benchmark textual-edge datasets featuring rich textual descriptions on nodes and edges. The TEG-DB datasets are large-scale and encompass a wide range of domains, from citation networks to social networks. In addition, we conduct extensive benchmark experiments on TEG-DB to assess the extent to which current techniques, including pre-trained language models, graph neural networks, and their combinations, can utilize textual node and edge information. Our goal is to elicit advancements in textual-edge graph research, specifically in developing methodologies that exploit rich textual node and edge descriptions to enhance graph analysis and provide deeper insights into complex real-world networks. The entire TEG-DB project is publicly accessible as an open-source repository on Github, accessible at this https URL.
摘要：文本属性图 (TAG) 用自然语言描述增强了图结构，有助于详细描述数据及其在各种现实环境中的互连。然而，现有的 TAG 数据集主要仅在节点处包含文本信息，而边通常仅由二进制或分类属性表示。缺乏丰富的文本边注释极大地限制了对实体之间上下文关系的探索，阻碍了对图结构数据的更深入洞察。为了解决这一差距，我们引入了文本边图数据集和基准 (TEG-DB)，这是一个全面而多样化的基准文本边数据集集合，具有丰富的节点和边文本描述。TEG-DB 数据集规模庞大，涵盖了从引文网络到社交网络的广泛领域。此外，我们对 TEG-DB 进行了广泛的基准实验，以评估当前技术（包括预训练语言模型、图神经网络及其组合）可以利用文本节点和边信息的程度。我们的目标是推动文本边缘图研究的进步，特别是开发利用丰富的文本节点和边缘描述来增强图分析并深入了解复杂现实世界网络的方法。整个 TEG-DB 项目作为 Github 上的开源存储库公开访问，可通过此 https URL 访问。

Title: CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

Authors: Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Meijuan An, Bikun Yang, KaiKai Zhao, Kai Wang, Shiguo Lian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10311
Pdf URL: https://arxiv.org/pdf/2406.10311
Copy Paste: [[2406.10311]] CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models(https://arxiv.org/abs/2406.10311)
Keywords: language model, llm
Abstract: With the profound development of large language models(LLMs), their safety concerns have garnered increasing attention. However, there is a scarcity of Chinese safety benchmarks for LLMs, and the existing safety taxonomies are inadequate, lacking comprehensive safety detection capabilities in authentic Chinese scenarios. In this work, we introduce CHiSafetyBench, a dedicated safety benchmark for evaluating LLMs' capabilities in identifying risky content and refusing answering risky questions in Chinese contexts. CHiSafetyBench incorporates a dataset that covers a hierarchical Chinese safety taxonomy consisting of 5 risk areas and 31 categories. This dataset comprises two types of tasks: multiple-choice questions and question-answering, evaluating LLMs from the perspectives of risk content identification and the ability to refuse answering risky questions respectively. Utilizing this benchmark, we validate the feasibility of automatic evaluation as a substitute for human evaluation and conduct comprehensive automatic safety assessments on mainstream Chinese LLMs. Our experiments reveal the varying performance of different models across various safety domains, indicating that all models possess considerable potential for improvement in Chinese safety capabilities. Our dataset is publicly available at this https URL.
摘要：随着大型语言模型（LLM）的深入发展，其安全性问题受到越来越多的关注。然而，针对LLM的中文安全基准很少，现有的安全分类标准不足，缺乏在真实中文场景下的全面安全检测能力。本文提出了CHiSafetyBench，这是一个专门的安全基准，用于评估LLM在中文语境中识别风险内容和拒绝回答风险问题的能力。CHiSafetyBench包含一个涵盖5个风险领域和31个类别的分层中文安全分类标准的数据集。该数据集包括两类任务：多项选择题和问答题，分别从风险内容识别和拒绝回答风险问题的能力的角度对LLM进行评估。利用该基准，我们验证了自动评估替代人工评估的可行性，并对主流中文LLM进行了全面的自动安全评估。我们的实验表明，不同模型在不同安全领域的表现各不相同，表明所有模型在中文安全能力方面都具有相当大的提升潜力。我们的数据集可通过此 https URL 公开获取。

Title: GenQA: Generating Millions of Instructions from a Handful of Prompts

Authors: Jiuhai Chen, Rifaa Qadri, Yuxin Wen, Neel Jain, John Kirchenbauer, Tianyi Zhou, Tom Goldstein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10323
Pdf URL: https://arxiv.org/pdf/2406.10323
Copy Paste: [[2406.10323]] GenQA: Generating Millions of Instructions from a Handful of Prompts(https://arxiv.org/abs/2406.10323)
Keywords: llm, prompt, chat
Abstract: Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models. To study questions about finetuning at scale, such as curricula and learning rate cooldown schedules, there is a need for industrial-scale datasets. However, this scale necessitates a data generation process that is almost entirely automated. In this work, we study methods for generating large instruction datasets from a single prompt. With little human oversight, we get LLMs to write diverse sets of instruction examples ranging from simple completion tasks to complex multi-turn dialogs across a variety of subject areas. When finetuning a Llama-3 8B base model, our dataset meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations. We release our dataset, the "generator" prompts that created it, and our finetuned model checkpoints.
摘要：与用于训练行业模型的闭源数据集相比，大多数公共指令微调数据集相对较小。要研究大规模微调问题，例如课程和学习率冷却计划，需要工业规模的数据集。但是，这种规模需要几乎完全自动化的数据生成过程。在这项工作中，我们研究从单个提示生成大型指令数据集的方法。在几乎没有人为监督的情况下，我们让 LLM 编写各种指令示例集，从简单的完成任务到涉及各种主题领域的复杂多轮对话。在微调 Llama-3 8B 基础模型时，我们的数据集在知识密集型排行榜任务以及对话评估方面都达到或超过了 WizardLM 和 Ultrachat。我们发布了我们的数据集、创建它的“生成器”提示以及我们微调的模型检查点。

Title: EWEK-QA: Enhanced Web and Efficient Knowledge Graph Retrieval for Citation-based Question Answering Systems

Authors: Mohammad Dehghan, Mohammad Ali Alomrani, Sunyam Bagga, David Alfonso-Hermelo, Khalil Bibi, Abbas Ghaddar, Yingxue Zhang, Xiaoguang Li, Jianye Hao, Qun Liu, Jimmy Lin, Boxing Chen, Prasanna Parthasarathi, Mahdi Biparva, Mehdi Rezagholizadeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10393
Pdf URL: https://arxiv.org/pdf/2406.10393
Copy Paste: [[2406.10393]] EWEK-QA: Enhanced Web and Efficient Knowledge Graph Retrieval for Citation-based Question Answering Systems(https://arxiv.org/abs/2406.10393)
Keywords: llm
Abstract: The emerging citation-based QA systems are gaining more attention especially in generative AI search applications. The importance of extracted knowledge provided to these systems is vital from both accuracy (completeness of information) and efficiency (extracting the information in a timely manner). In this regard, citation-based QA systems are suffering from two shortcomings. First, they usually rely only on web as a source of extracted knowledge and adding other external knowledge sources can hamper the efficiency of the system. Second, web-retrieved contents are usually obtained by some simple heuristics such as fixed length or breakpoints which might lead to splitting information into pieces. To mitigate these issues, we propose our enhanced web and efficient knowledge graph (KG) retrieval solution (EWEK-QA) to enrich the content of the extracted knowledge fed to the system. This has been done through designing an adaptive web retriever and incorporating KGs triples in an efficient manner. We demonstrate the effectiveness of EWEK-QA over the open-source state-of-the-art (SoTA) web-based and KG baseline models using a comprehensive set of quantitative and human evaluation experiments. Our model is able to: first, improve the web-retriever baseline in terms of extracting more relevant passages (>20\%), the coverage of answer span (>25\%) and self containment (>35\%); second, obtain and integrate KG triples into its pipeline very efficiently (by avoiding any LLM calls) to outperform the web-only and KG-only SoTA baselines significantly in 7 quantitative QA tasks and our human evaluation.
摘要：新兴的基于引文的问答系统正受到越来越多的关注，尤其是在生成式 AI 搜索应用中。为这些系统提供的提取知识的重要性对于准确性（信息的完整性）和效率（及时提取信息）都至关重要。在这方面，基于引文的问答系统存在两个缺点。首先，它们通常仅依赖网络作为提取知识的来源，而添加其他外部知识源可能会影响系统的效率。其次，网络检索到的内容通常是通过一些简单的启发式方法获得的，例如固定长度或断点，这可能会导致信息分裂成碎片。为了缓解这些问题，我们提出了增强型网络和高效知识图谱 (KG) 检索解决方案 (EWEK-QA)，以丰富输入系统的提取知识的内容。这是通过设计自适应网络检索器并以有效方式合并 KG 三元组来实现的。我们通过一系列全面的定量和人工评估实验证明了 EWEK-QA 相对于开源的最先进的 (SoTA) 基于网络和 KG 基线模型的有效性。我们的模型能够：首先，在提取更多相关段落 (>20\%)、答案跨度覆盖率 (>25\%) 和自我约束 (>35\%) 方面改进网络检索器基线；其次，非常高效地获取并将 KG 三元组集成到其管道中（通过避免任何 LLM 调用），从而在 7 个定量 QA 任务和我们的人工评估中显著超越仅基于网络和仅基于 KG 的 SoTA 基线。

Title: Self-Reflection Outcome is Sensitive to Prompt Construction

Authors: Fengyuan Liu, Nouar AlDahoul, Gregory Eady, Yasir Zaki, Bedoor AlShebli, Talal Rahwan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10400
Pdf URL: https://arxiv.org/pdf/2406.10400
Copy Paste: [[2406.10400]] Self-Reflection Outcome is Sensitive to Prompt Construction(https://arxiv.org/abs/2406.10400)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) demonstrate impressive zero-shot and few-shot reasoning capabilities. Some propose that such capabilities can be improved through self-reflection, i.e., letting LLMs reflect on their own output to identify and correct mistakes in the initial responses. However, despite some evidence showing the benefits of self-reflection, recent studies offer mixed results. Here, we aim to reconcile these conflicting findings by first demonstrating that the outcome of self-reflection is sensitive to prompt wording; e.g., LLMs are more likely to conclude that it has made a mistake when explicitly prompted to find mistakes. Consequently, idiosyncrasies in reflection prompts may lead LLMs to change correct responses unnecessarily. We show that most prompts used in the self-reflection literature are prone to this bias. We then propose different ways of constructing prompts that are conservative in identifying mistakes and show that self-reflection using such prompts results in higher accuracy. Our findings highlight the importance of prompt engineering in self-reflection tasks. We release our code at this https URL.
摘要：大型语言模型 (LLM) 表现出令人印象深刻的零样本和少样本推理能力。一些人提出，这种能力可以通过自我反思来提高，即让 LLM 反思自己的输出，以识别和纠正初始响应中的错误。然而，尽管一些证据表明自我反思的好处，但最近的研究结果好坏参半。在这里，我们旨在通过首先证明自我反思的结果对提示措辞敏感来调和这些相互矛盾的发现；例如，当明确提示发现错误时，LLM 更有可能得出结论认为它犯了一个错误。因此，反思提示中的特质可能会导致 LLM 不必要地更改正确的响应。我们表明，自我反思文献中使用的大多数提示都容易出现这种偏见。然后，我们提出了构建提示的不同方法，这些提示在识别错误方面比较保守，并表明使用此类提示进行自我反思可以提高准确性。我们的研究结果强调了提示工程在自我反思任务中的重要性。我们在此 https URL 上发布了我们的代码。

Title: SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

Authors: Tu Anh Dinh, Carlos Mullov, Leonard Bärmann, Zhaolin Li, Danni Liu, Simon Reiß, Jueun Lee, Nathan Lerzer, Fabian Ternava, Jianfeng Gao, Alexander Waibel, Tamim Asfour, Michael Beigl, Rainer Stiefelhagen, Carsten Dachsbacher, Klemens Böhm, Jan Niehues
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10421
Pdf URL: https://arxiv.org/pdf/2406.10421
Copy Paste: [[2406.10421]] SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading(https://arxiv.org/abs/2406.10421)
Keywords: language model, llm
Abstract: With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4\% exam grade on average. We also provide detailed comparisons between LLM performance and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving 0.948 Pearson correlation with expert grading.
摘要：随着大型语言模型 (LLM) 的快速发展，拥有能够评估 LLM 在不同领域的能力的基准至关重要。LLM 的一个常见用途是执行有关科学主题的任务，例如编写算法、查询数据库或给出数学证明。受大学生在这些任务上的评估方式的启发，在本文中，我们提出了 SciEx——一个由大学计算机科学考试问题组成的基准，以评估 LLM 解决科学任务的能力。SciEx 是 (1) 多语言的，包含英语和德语考试；(2) 多模式的，包含涉及图像的问题；(3) 由于大学考试的性质，包含各种类型的不同难度级别的自由形式问题。我们在新的基准上评估了各种最先进的 LLM 的性能。由于 SciEx 问题是自由形式的，因此评估 LLM 性能并不简单。因此，我们提供了 SciEx 上 LLM 输出的人工专家评分。我们表明，SciEx 中的自由形式考试对目前的 LLM 来说仍然具有挑战性，其中最好的 LLM 平均只获得 59.4% 的考试成绩。我们还提供了 LLM 成绩和学生在 SciEx 上成绩的详细比较。为了便于将来评估新的 LLM，我们建议使用 LLM-as-a-judge 对 SciEx 上的 LLM 答案进行评分。我们的实验表明，尽管 LLM 在解答考试方面表现并不完美，但它们作为评分者的表现还不错，与专家评分的 Pearson 相关性达到 0.948。

Title: Enhancing In-Context Learning with Semantic Representations for Relation Extraction

Authors: Peitao Han, Lis Kanashiro Pereira, Fei Cheng, Wan Jou She, Eiji Aramaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10432
Pdf URL: https://arxiv.org/pdf/2406.10432
Copy Paste: [[2406.10432]] Enhancing In-Context Learning with Semantic Representations for Relation Extraction(https://arxiv.org/abs/2406.10432)
Keywords: gpt
Abstract: In this work, we employ two AMR-enhanced semantic representations for ICL on RE: one that explores the AMR structure generated for a sentence at the subgraph level (shortest AMR path), and another that explores the full AMR structure generated for a sentence. In both cases, we demonstrate that all settings benefit from the fine-grained AMR's semantic structure. We evaluate our model on four RE datasets. Our results show that our model can outperform the GPT-based baselines, and achieve SOTA performance on two of the datasets, and competitive performance on the other two.
摘要：在本研究中，我们在 RE 上为 ICL 采用了两种 AMR 增强语义表示：一种是在子图级别（最短 AMR 路径）探索为句子生成的 AMR 结构，另一种是探索为句子生成的完整 AMR 结构。在这两种情况下，我们都证明了所有设置都受益于细粒度 AMR 的语义结构。我们在四个 RE 数据集上评估了我们的模型。结果表明，我们的模型可以超越基于 GPT 的基线，并在两个数据集上实现 SOTA 性能，并在另外两个数据集上实现竞争性能。

Title: Domain-Specific Shorthand for Generation Based on Context-Free Grammar

Authors: Andriy Kanyuka, Elias Mahfoud
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10442
Pdf URL: https://arxiv.org/pdf/2406.10442
Copy Paste: [[2406.10442]] Domain-Specific Shorthand for Generation Based on Context-Free Grammar(https://arxiv.org/abs/2406.10442)
Keywords: language model, gpt, llm
Abstract: The generation of structured data in formats such as JSON, YAML and XML is a critical task in Generative AI (GenAI) applications. These formats, while widely used, contain many redundant constructs that lead to inflated token usage. This inefficiency is particularly evident when employing large language models (LLMs) like GPT-4, where generating extensive structured data incurs increased latency and operational costs. We introduce a domain-specific shorthand (DSS) format, underpinned by a context-free grammar (CFG), and demonstrate its usage to reduce the number of tokens required for structured data generation. The method involves creating a shorthand notation that captures essential elements of the output schema with fewer tokens, ensuring it can be unambiguously converted to and from its verbose form. It employs a CFG to facilitate efficient shorthand generation by the LLM, and to create parsers to translate the shorthand back into standard structured formats. The application of our approach to data visualization with LLMs demonstrates a significant (3x to 5x) reduction in generated tokens, leading to significantly lower latency and cost. This paper outlines the development of the DSS and the accompanying CFG, and the implications of this approach for GenAI applications, presenting a scalable solution to the token inefficiency problem in structured data generation.
摘要：在生成式人工智能 (GenAI) 应用程序中，以 JSON、YAML 和 XML 等格式生成结构化数据是一项关键任务。这些格式虽然被广泛使用，但包含许多冗余构造，导致令牌使用量过大。这种低效率在使用 GPT-4 等大型语言模型 (LLM) 时尤其明显，因为生成大量结构化数据会增加延迟和运营成本。我们引入了一种以上下文无关语法 (CFG) 为基础的领域特定简写 (DSS) 格式，并演示了如何使用它来减少结构化数据生成所需的令牌数量。该方法涉及创建一种简写符号，以较少的令牌捕获输出模式的基本元素，确保它可以明确地转换为详细形式和从详细形式转换为详细形式。它使用 CFG 来促进 LLM 高效地生成简写，并创建解析器将简写转换回标准结构化格式。通过使用 LLM 将我们的方法应用于数据可视化，可以显著减少生成的令牌数量（3 到 5 倍），从而显著降低延迟和成本。本文概述了 DSS 和随附的 CFG 的开发，以及该方法对 GenAI 应用程序的影响，并提出了一种可扩展的解决方案来解决结构化数据生成中的令牌效率低下问题。

Title: CancerLLM: A Large Language Model in Cancer Domain

Authors: Mingchen Li, Anne Blaes, Steven Johnson, Hongfang Liu, Hua Xu, Rui Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10459
Pdf URL: https://arxiv.org/pdf/2406.10459
Copy Paste: [[2406.10459]] CancerLLM: A Large Language Model in Cancer Domain(https://arxiv.org/abs/2406.10459)
Keywords: language model, llm
Abstract: Medical Large Language Models (LLMs) such as ClinicalCamel 70B, Llama3-OpenBioLLM 70B have demonstrated impressive performance on a wide variety of medical NLP task.However, there still lacks a large language model (LLM) specifically designed for cancer domain. Moreover, these LLMs typically have billions of parameters, making them computationally expensive for healthcare systems.Thus, in this study, we propose CancerLLM, a model with 7 billion parameters and a Mistral-style architecture, pre-trained on 2,676,642 clinical notes and 515,524 pathology reports covering 17 cancer types, followed by fine-tuning on three cancer-relevant tasks, including cancer phenotypes extraction, cancer diagnosis generation, and cancer treatment plan generation. Our evaluation demonstrated that CancerLLM achieves state-of-the-art results compared to other existing LLMs, with an average F1 score improvement of 8.1\%. Additionally, CancerLLM outperforms other models on two proposed robustness testbeds. This illustrates that CancerLLM can be effectively applied to clinical AI systems, enhancing clinical research and healthcare delivery in the field of cancer.
摘要：医学大型语言模型 (LLM)，例如 ClinicalCamel 70B、Llama3-OpenBioLLM 70B，在各种医学 NLP 任务上都表现出色。然而，目前仍然缺乏专门为癌症领域设计的大型语言模型 (LLM)。此外，这些 LLM 通常具有数十亿个参数，这使得它们对于医疗保健系统来说计算成本高昂。因此，在本研究中，我们提出了 CancerLLM，这是一个具有 70 亿个参数和 Mistral 风格架构的模型，在 2,676,642 份临床记录和 515,524 份病理报告上进行了预训练，涵盖了 17 种癌症类型，然后在三个与癌症相关的任务上进行了微调，包括癌症表型提取、癌症诊断生成和癌症治疗计划生成。我们的评估表明，与其他现有 LLM 相比，CancerLLM 取得了最先进的结果，平均 F1 分数提高了 8.1%。此外，CancerLLM 在两个拟议的稳健性测试平台上的表现优于其他模型。这表明CancerLLM可以有效地应用于临床AI系统，增强癌症领域的临床研究和医疗保健服务。

Title: Personalized Pieces: Efficient Personalized Large Language Models through Collaborative Efforts

Authors: Zhaoxuan Tan, Zheyuan Liu, Meng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10471
Pdf URL: https://arxiv.org/pdf/2406.10471
Copy Paste: [[2406.10471]] Personalized Pieces: Efficient Personalized Large Language Models through Collaborative Efforts(https://arxiv.org/abs/2406.10471)
Keywords: language model, llm
Abstract: Personalized large language models (LLMs) aim to tailor interactions, content, and recommendations to individual user preferences. While parameter-efficient fine-tuning (PEFT) methods excel in performance and generalization, they are costly and limit communal benefits when used individually. To this end, we introduce Personalized Pieces (Per-Pcs), a framework that allows users to safely share and assemble personalized PEFT efficiently with collaborative efforts. Per-Pcs involves selecting sharers, breaking their PEFT into pieces, and training gates for each piece. These pieces are added to a pool, from which target users can select and assemble personalized PEFT using their history data. This approach preserves privacy and enables fine-grained user modeling without excessive storage and computation demands. Experimental results show Per-Pcs outperforms non-personalized and PEFT retrieval baselines, offering performance comparable to OPPU with significantly lower resource use across six tasks. Further analysis highlights Per-Pcs's robustness concerning sharer count and selection strategy, pieces sharing ratio, and scalability in computation time and storage space. Per-Pcs's modularity promotes safe sharing, making LLM personalization more efficient, effective, and widely accessible through collaborative efforts.
摘要：个性化大型语言模型 (LLM) 旨在根据个人用户偏好定制交互、内容和推荐。虽然参数高效微调 (PEFT) 方法在性能和泛化方面表现出色，但它们成本高昂，单独使用时会限制公共利益。为此，我们引入了个性化片段 (Per-Pcs)，这是一个允许用户通过协作安全共享和高效组装个性化 PEFT 的框架。Per-Pcs 涉及选择共享者、将他们的 PEFT 分解成片段，并为每个片段训练门。这些片段被添加到池中，目标用户可以从中使用他们的历史数据选择和组装个性化 PEFT。这种方法保护了隐私，并实现了细粒度的用户建模，而无需过多的存储和计算需求。实验结果表明，Per-Pcs 优于非个性化和 PEFT 检索基线，在六个任务中提供与 OPPU 相当的性能，且资源使用率明显更低。进一步的分析突出了 Per-Pcs 在共享者数量和选择策略、片段共享率以及计算时间和存储空间的可扩展性方面的稳健性。 Per-Pcs 的模块化促进了安全共享，使得 LLM 个性化更加高效、有效，并且可以通过协作努力得到广泛的应用。

Title: From Words to Worlds: Transforming One-line Prompt into Immersive Multi-modal Digital Stories with Communicative LLM Agent

Authors: Samuel S. Sohn, Danrui Li, Sen Zhang, Che-Jui Chang, Mubbasir Kapadia
Subjects: cs.CL, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2406.10478
Pdf URL: https://arxiv.org/pdf/2406.10478
Copy Paste: [[2406.10478]] From Words to Worlds: Transforming One-line Prompt into Immersive Multi-modal Digital Stories with Communicative LLM Agent(https://arxiv.org/abs/2406.10478)
Keywords: language model, llm, prompt, agent
Abstract: Digital storytelling, essential in entertainment, education, and marketing, faces challenges in production scalability and flexibility. The StoryAgent framework, introduced in this paper, utilizes Large Language Models and generative tools to automate and refine digital storytelling. Employing a top-down story drafting and bottom-up asset generation approach, StoryAgent tackles key issues such as manual intervention, interactive scene orchestration, and narrative consistency. This framework enables efficient production of interactive and consistent narratives across multiple modalities, democratizing content creation and enhancing engagement. Our results demonstrate the framework's capability to produce coherent digital stories without reference videos, marking a significant advancement in automated digital storytelling.
摘要：数字故事讲述在娱乐、教育和营销中必不可少，但在生产可扩展性和灵活性方面面临挑战。本文介绍的 StoryAgent 框架利用大型语言模型和生成工具来自动化和改进数字故事讲述。StoryAgent 采用自上而下的故事起草和自下而上的资产生成方法，解决了人工干预、交互式场景编排和叙事一致性等关键问题。该框架能够高效地制作跨多种模式的交互式一致叙事，使内容创作民主化并增强参与度。我们的结果表明，该框架能够在没有参考视频的情况下制作连贯的数字故事，标志着自动化数字故事讲述的重大进步。

Title: Do Large Language Models Discriminate in Hiring Decisions on the Basis of Race, Ethnicity, and Gender?

Authors: Haozhe An, Christabel Acquaye, Colin Wang, Zongxia Li, Rachel Rudinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10486
Pdf URL: https://arxiv.org/pdf/2406.10486
Copy Paste: [[2406.10486]] Do Large Language Models Discriminate in Hiring Decisions on the Basis of Race, Ethnicity, and Gender?(https://arxiv.org/abs/2406.10486)
Keywords: language model, llm, prompt
Abstract: We examine whether large language models (LLMs) exhibit race- and gender-based name discrimination in hiring decisions, similar to classic findings in the social sciences (Bertrand and Mullainathan, 2004). We design a series of templatic prompts to LLMs to write an email to a named job applicant informing them of a hiring decision. By manipulating the applicant's first name, we measure the effect of perceived race, ethnicity, and gender on the probability that the LLM generates an acceptance or rejection email. We find that the hiring decisions of LLMs in many settings are more likely to favor White applicants over Hispanic applicants. In aggregate, the groups with the highest and lowest acceptance rates respectively are masculine White names and masculine Hispanic names. However, the comparative acceptance rates by group vary under different templatic settings, suggesting that LLMs' race- and gender-sensitivity may be idiosyncratic and prompt-sensitive.
摘要：我们研究大型语言模型 (LLM) 是否在招聘决策中表现出基于种族和性别的姓名歧视，类似于社会科学中的经典发现 (Bertrand and Mullainathan, 2004)。我们设计了一系列模板提示，让 LLM 写一封电子邮件给指定的求职者，告知他们招聘决定。通过操纵申请人的名字，我们衡量了感知到的种族、民族和性别对 LLM 生成接受或拒绝电子邮件的概率的影响。我们发现，在许多情况下，LLM 的招聘决策更倾向于白人申请人而不是西班牙裔申请人。总体而言，录取率最高和最低的群体分别是男性白人名字和男性西班牙裔名字。然而，在不同的模板环境下，各群体的比较录取率有所不同，这表明 LLM 的种族和性别敏感性可能是特殊的，并且对提示敏感。

Title: Large Language Models as Event Forecasters

Authors: Libo Zhang, Yue Ning
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.10492
Pdf URL: https://arxiv.org/pdf/2406.10492
Copy Paste: [[2406.10492]] Large Language Models as Event Forecasters(https://arxiv.org/abs/2406.10492)
Keywords: language model, llm, prompt
Abstract: Key elements of human events are extracted as quadruples that consist of subject, relation, object, and timestamp. This representation can be extended to a quintuple by adding a fifth element: a textual summary that briefly describes the event. These quadruples or quintuples, when organized within a specific domain, form a temporal knowledge graph (TKG). Current learning frameworks focus on a few TKG-related tasks, such as predicting an object given a subject and a relation or forecasting the occurrences of multiple types of events (i.e., relation) in the next time window. They typically rely on complex structural and sequential models like graph neural networks (GNNs) and recurrent neural networks (RNNs) to update intermediate embeddings. However, these methods often neglect the contextual information inherent in each quintuple, which can be effectively captured through concise textual descriptions. In this paper, we investigate how large language models (LLMs) can streamline the design of TKG learning frameworks while maintaining competitive accuracy in prediction and forecasting tasks. We develop multiple prompt templates to frame the object prediction (OP) task as a standard question-answering (QA) task, suitable for instruction fine-tuning with an encoder-decoder generative LLM. For multi-event forecasting (MEF), we design simple yet effective prompt templates for each TKG quintuple. This novel approach removes the need for GNNs and RNNs, instead utilizing an encoder-only LLM to generate fixed intermediate embeddings, which are subsequently processed by a prediction head with a self-attention mechanism to forecast potential future relations. Extensive experiments on multiple real-world datasets using various evaluation metrics validate the effectiveness and robustness of our approach.
摘要：人类事件的关键元素被提取为由主题、关系、对象和时间戳组成的四元组。通过添加第五个元素，可以将此表示扩展为五元组：简要描述事件的文本摘要。这些四元组或五元组在特定领域内组织时，形成时间知识图谱 (TKG)。当前的学习框架专注于一些与 TKG 相关的任务，例如根据主题和关系预测对象或预测下一个时间窗口中多种类型事件（即关系）的发生。它们通常依赖于复杂的结构和顺序模型，例如图神经网络 (GNN) 和循环神经网络 (RNN) 来更新中间嵌入。然而，这些方法通常忽略了每个五元组中固有的上下文信息，而这些信息可以通过简洁的文本描述有效地捕获。在本文中，我们研究了大型语言模型 (LLM) 如何简化 TKG 学习框架的设计，同时在预测和预报任务中保持竞争性准确性。我们开发了多个提示模板，将对象预测 (OP) 任务构建为标准问答 (QA) 任务，适合使用编码器-解码器生成 LLM 进行指令微调。对于多事件预测 (MEF)，我们为每个 TKG 五元组设计了简单而有效的提示模板。这种新颖的方法消除了对 GNN 和 RNN 的需求，而是利用仅编码器的 LLM 来生成固定的中间嵌入，随后由具有自注意机制的预测头处理这些嵌入以预测潜在的未来关系。使用各种评估指标对多个真实世界数据集进行的大量实验验证了我们方法的有效性和稳健性。

Title: CroPrompt: Cross-task Interactive Prompting for Zero-shot Spoken Language Understanding

Authors: Libo Qin, Fuxuan Wei, Qiguang Chen, Jingxuan Zhou, Shijue Huang, Jiasheng Si, Wenpeng Lu, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10505
Pdf URL: https://arxiv.org/pdf/2406.10505
Copy Paste: [[2406.10505]] CroPrompt: Cross-task Interactive Prompting for Zero-shot Spoken Language Understanding(https://arxiv.org/abs/2406.10505)
Keywords: language model, prompt
Abstract: Slot filling and intent detection are two highly correlated tasks in spoken language understanding (SLU). Recent SLU research attempts to explore zero-shot prompting techniques in large language models to alleviate the data scarcity problem. Nevertheless, the existing prompting work ignores the cross-task interaction information for SLU, which leads to sub-optimal performance. To solve this problem, we present the pioneering work of Cross-task Interactive Prompting (CroPrompt) for SLU, which enables the model to interactively leverage the information exchange across the correlated tasks in SLU. Additionally, we further introduce a multi-task self-consistency mechanism to mitigate the error propagation caused by the intent information injection. We conduct extensive experiments on the standard SLU benchmark and the results reveal that CroPrompt consistently outperforms the existing prompting approaches. In addition, the multi-task self-consistency mechanism can effectively ease the error propagation issue, thereby enhancing the performance. We hope this work can inspire more research on cross-task prompting for SLU.
摘要：槽位填充和意图检测是口语理解 (SLU) 中两个高度相关的任务。最近的 SLU 研究试图探索大型语言模型中的零样本提示技术，以缓解数据稀缺问题。然而，现有的提示工作忽略了 SLU 的跨任务交互信息，导致性能不佳。为了解决这个问题，我们提出了 SLU 跨任务交互式提示 (CroPrompt) 的开创性工作，使模型能够以交互方式利用 SLU 中相关任务之间的信息交换。此外，我们进一步引入了多任务自一致性机制来减轻意图信息注入引起的错误传播。我们在标准 SLU 基准上进行了广泛的实验，结果表明 CroPrompt 始终优于现有的提示方法。此外，多任务自一致性机制可以有效缓解错误传播问题，从而提高性能。我们希望这项工作能够激发更多关于 SLU 跨任务提示的研究。

Title: Large Language Model Enhanced Clustering for News Event Detection

Authors: Adane Nega Tarekegn, Fazle Rabbi, Bjørnar Tessem
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.10552
Pdf URL: https://arxiv.org/pdf/2406.10552
Copy Paste: [[2406.10552]] Large Language Model Enhanced Clustering for News Event Detection(https://arxiv.org/abs/2406.10552)
Keywords: language model, llm
Abstract: The news landscape is continuously evolving, with an ever-increasing volume of information from around the world. Automated event detection within this vast data repository is essential for monitoring, identifying, and categorizing significant news occurrences across diverse platforms. This paper presents an event detection framework that leverages Large Language Models (LLMs) combined with clustering analysis to detect news events from the Global Database of Events, Language, and Tone (GDELT). The framework enhances event clustering through both pre-event detection tasks (keyword extraction and text embedding) and post-event detection tasks (event summarization and topic labeling). We also evaluate the impact of various textual embeddings on the quality of clustering outcomes, ensuring robust news categorization. Additionally, we introduce a novel Cluster Stability Assessment Index (CSAI) to assess the validity and robustness of clustering results. CSAI utilizes latent feature vectors to provide a new way of measuring clustering quality. Our experiments indicate that combining LLM embeddings with clustering algorithms yields the best results, demonstrating greater robustness in terms of CSAI scores. Moreover, post-event detection tasks generate meaningful insights, facilitating effective interpretation of event clustering results. Overall, our experimental results indicate that the proposed framework offers valuable insights and could enhance the accuracy and depth of news reporting.
摘要：新闻格局在不断发展，来自世界各地的信息量不断增加。在这个庞大的数据存储库中自动进行事件检测对于监控、识别和分类跨不同平台的重要新闻事件至关重要。本文介绍了一个事件检测框架，该框架利用大型语言模型 (LLM) 结合聚类分析来从全球事件、语言和语调数据库 (GDELT) 中检测新闻事件。该框架通过事件前检测任务（关键字提取和文本嵌入）和事件后检测任务（事件摘要和主题标记）增强事件聚类。我们还评估了各种文本嵌入对聚类结果质量的影响，确保了稳健的新闻分类。此外，我们引入了一种新颖的集群稳定性评估指数 (CSAI) 来评估聚类结果的有效性和稳健性。CSAI 利用潜在特征向量提供了一种衡量聚类质量的新方法。我们的实验表明，将 LLM 嵌入与聚类算法相结合可产生最佳结果，在 CSAI 分数方面表现出更高的稳健性。此外，事件后检测任务会产生有意义的见解，有助于有效地解释事件聚类结果。总体而言，我们的实验结果表明，所提出的框架提供了有价值的见解，可以提高新闻报道的准确性和深度。

Title: Facts-and-Feelings: Capturing both Objectivity and Subjectivity in Table-to-Text Generation

Authors: Tathagata Dey, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10560
Pdf URL: https://arxiv.org/pdf/2406.10560
Copy Paste: [[2406.10560]] Facts-and-Feelings: Capturing both Objectivity and Subjectivity in Table-to-Text Generation(https://arxiv.org/abs/2406.10560)
Keywords: language model, llm, prompt
Abstract: Table-to-text generation, a long-standing challenge in natural language generation, has remained unexplored through the lens of subjectivity. Subjectivity here encompasses the comprehension of information derived from the table that cannot be described solely by objective data. Given the absence of pre-existing datasets, we introduce the Ta2TS dataset with 3849 data instances. We perform the task of fine-tuning sequence-to-sequence models on the linearized tables and prompting on popular large language models. We analyze the results from a quantitative and qualitative perspective to ensure the capture of subjectivity and factual consistency. The analysis shows the fine-tuned LMs can perform close to the prompted LLMs. Both the models can capture the tabular data, generating texts with 85.15% BERTScore and 26.28% Meteor score. To the best of our knowledge, we provide the first-of-its-kind dataset on tables with multiple genres and subjectivity included and present the first comprehensive analysis and comparison of different LLM performances on this task.
摘要：表格到文本生成是自然语言生成中长期存在的挑战，但从主观性的角度来看，它仍未得到探索。这里的主观性包括对从表格中得出的信息的理解，而这些信息不能仅通过客观数据来描述。鉴于缺乏预先存在的数据集，我们引入了包含 3849 个数据实例的 Ta2TS 数据集。我们在线性化表格上执行微调序列到序列模型的任务，并在流行的大型语言模型上进行提示。我们从定量和定性的角度分析结果，以确保捕捉主观性和事实一致性。分析表明，经过微调的 LM 的性能接近提示的 LLM。这两个模型都可以捕获表格数据，生成的文本的 BERTScore 为 85.15%，Meteor 分数为 26.28%。据我们所知，我们提供了包含多种类型和主观性的表格的首创数据集，并首次全面分析和比较了不同 LLM 在该任务上的表现。

Title: We Care: Multimodal Depression Detection and Knowledge Infused Mental Health Therapeutic Response Generation

Authors: Palash Moon, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10561
Pdf URL: https://arxiv.org/pdf/2406.10561
Copy Paste: [[2406.10561]] We Care: Multimodal Depression Detection and Knowledge Infused Mental Health Therapeutic Response Generation(https://arxiv.org/abs/2406.10561)
Keywords: language model, gpt, llm, agent
Abstract: The detection of depression through non-verbal cues has gained significant attention. Previous research predominantly centred on identifying depression within the confines of controlled laboratory environments, often with the supervision of psychologists or counsellors. Unfortunately, datasets generated in such controlled settings may struggle to account for individual behaviours in real-life situations. In response to this limitation, we present the Extended D-vlog dataset, encompassing a collection of 1, 261 YouTube vlogs. Additionally, the emergence of large language models (LLMs) like GPT3.5, and GPT4 has sparked interest in their potential they can act like mental health professionals. Yet, the readiness of these LLM models to be used in real-life settings is still a concern as they can give wrong responses that can harm the users. We introduce a virtual agent serving as an initial contact for mental health patients, offering Cognitive Behavioral Therapy (CBT)-based responses. It comprises two core functions: 1. Identifying depression in individuals, and 2. Delivering CBT-based therapeutic responses. Our Mistral model achieved impressive scores of 70.1% and 30.9% for distortion assessment and classification, along with a Bert score of 88.7%. Moreover, utilizing the TVLT model on our Multimodal Extended D-vlog Dataset yielded outstanding results, with an impressive F1-score of 67.8%
摘要：通过非语言线索检测抑郁症已引起广泛关注。先前的研究主要集中在受控实验室环境中识别抑郁症，通常在心理学家或咨询师的监督下进行。不幸的是，在这种受控环境中生成的数据集可能难以解释现实生活中的个人行为。为了解决这一限制，我们提供了扩展的 D-vlog 数据集，其中包含 1,261 个 YouTube vlog。此外，GPT3.5 和 GPT4 等大型语言模型 (LLM) 的出现引发了人们对它们可以像心理健康专业人士一样发挥作用的潜力的兴趣。然而，这些 LLM 模型是否准备好在现实环境中使用仍然令人担忧，因为它们可能会给出错误的反应，从而伤害用户。我们引入了一个虚拟代理作为心理健康患者的初始联系人，提供基于认知行为疗法 (CBT) 的反应。它包括两个核心功能：1. 识别个体的抑郁症，2. 提供基于 CBT 的治疗反应。我们的 Mistral 模型在失真评估和分类方面取得了令人印象深刻的 70.1% 和 30.9% 的成绩，Bert 得分为 88.7%。此外，在我们的多模态扩展 D-vlog 数据集上使用 TVLT 模型取得了出色的结果，F1 得分高达 67.8%

Title: Concentrate Attention: Towards Domain-Generalizable Prompt Optimization for Language Models

Authors: Chengzhengxu Li, Xiaoming Liu, Zhaohan Zhang, Yichen Wang, Chen Liu, Yu Lan, Chao Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10584
Pdf URL: https://arxiv.org/pdf/2406.10584
Copy Paste: [[2406.10584]] Concentrate Attention: Towards Domain-Generalizable Prompt Optimization for Language Models(https://arxiv.org/abs/2406.10584)
Keywords: language model, prompt
Abstract: Recent advances in prompt optimization have notably enhanced the performance of pre-trained language models (PLMs) on downstream tasks. However, the potential of optimized prompts on domain generalization has been under-explored. To explore the nature of prompt generalization on unknown domains, we conduct pilot experiments and find that (i) Prompts gaining more attention weight from PLMs' deep layers are more generalizable and (ii) Prompts with more stable attention distributions in PLMs' deep layers are more generalizable. Thus, we offer a fresh objective towards domain-generalizable prompts optimization named "Concentration", which represents the "lookback" attention from the current decoding token to the prompt tokens, to increase the attention strength on prompts and reduce the fluctuation of attention distribution. We adapt this new objective to popular soft prompt and hard prompt optimization methods, respectively. Extensive experiments demonstrate that our idea improves comparison prompt optimization methods by 1.42% for soft prompt generalization and 2.16% for hard prompt generalization in accuracy on the multi-source domain generalization setting, while maintaining satisfying in-domain performance. The promising results validate the effectiveness of our proposed prompt optimization objective and provide key insights into domain-generalizable prompts.
摘要：提示优化的最新进展显著提高了预训练语言模型 (PLM) 在下游任务上的性能。然而，优化提示在领域泛化方面的潜力尚未得到充分开发。为了探索提示在未知领域泛化的性质，我们进行了初步实验，发现 (i) 从 PLM 的深层获得更多注意力权重的提示更具泛化性，(ii) 在 PLM 的深层中注意力分布更稳定的提示更具泛化性。因此，我们为领域泛化提示优化提供了一个新的目标，称为“集中度”，它表示从当前解码标记到提示标记的“回溯”注意力，以增加提示上的注意力强度并减少注意力分布的波动。我们分别将这个新目标应用于流行的软提示和硬提示优化方法。大量实验表明，我们的想法在多源域泛化设置上将比较提示优化方法的软提示泛化准确率提高了 1.42%，将硬提示泛化准确率提高了 2.16%，同时保持了令人满意的域内性能。有希望的结果验证了我们提出的提示优化目标的有效性，并为领域通用提示提供了关键见解。

Title: BlockPruner: Fine-grained Pruning for Large Language Models

Authors: Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, Liangzhi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10594
Pdf URL: https://arxiv.org/pdf/2406.10594
Copy Paste: [[2406.10594]] BlockPruner: Fine-grained Pruning for Large Language Models(https://arxiv.org/abs/2406.10594)
Keywords: language model, llm
Abstract: With the rapid growth in the size and complexity of large language models (LLMs), the costs associated with their training and inference have escalated significantly. Research indicates that certain layers in LLMs harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. While various layer pruning methods have been developed based on this insight, they generally overlook the finer-grained redundancies within the layers themselves. In this paper, we delve deeper into the architecture of LLMs and demonstrate that finer-grained pruning can be achieved by targeting redundancies in multi-head attention (MHA) and multi-layer perceptron (MLP) blocks. We propose a novel, training-free structured pruning approach called BlockPruner. Unlike existing layer pruning methods, BlockPruner segments each Transformer layer into MHA and MLP blocks. It then assesses the importance of these blocks using perplexity measures and applies a heuristic search for iterative pruning. We applied BlockPruner to LLMs of various sizes and architectures and validated its performance across a wide range of downstream tasks. Experimental results show that BlockPruner achieves more granular and effective pruning compared to state-of-the-art baselines.
摘要：随着大型语言模型 (LLM) 的大小和复杂性的快速增长，与其训练和推理相关的成本也大幅增加。研究表明，LLM 中的某些层包含大量冗余，修剪这些层对整体性能的影响微乎其微。虽然基于这一见解已经开发出各种层修剪方法，但它们通常忽略了层本身的细粒度冗余。在本文中，我们深入研究了 LLM 的架构，并证明可以通过针对多头注意力 (MHA) 和多层感知器 (MLP) 块中的冗余来实现更细粒度的修剪。我们提出了一种新颖的、无需训练的结构化修剪方法，称为 BlockPruner。与现有的层修剪方法不同，BlockPruner 将每个 Transformer 层细分为 MHA 和 MLP 块。然后，它使用困惑度度量来评估这些块的重要性，并应用启发式搜索进行迭代修剪。我们将 BlockPruner 应用于各种规模和架构的 LLM，并在广泛的下游任务中验证了其性能。实验结果表明，与最先进的基线相比，BlockPruner 实现了更精细、更有效的修剪。

Title: Multilingual Large Language Models and Curse of Multilinguality

Authors: Daniil Gurgurov, Tanja Bäumel, Tatiana Anikina
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10602
Pdf URL: https://arxiv.org/pdf/2406.10602
Copy Paste: [[2406.10602]] Multilingual Large Language Models and Curse of Multilinguality(https://arxiv.org/abs/2406.10602)
Keywords: language model, gpt, llm
Abstract: Multilingual Large Language Models (LLMs) have gained large popularity among Natural Language Processing (NLP) researchers and practitioners. These models, trained on huge datasets, show proficiency across various languages and demonstrate effectiveness in numerous downstream tasks. This paper navigates the landscape of multilingual LLMs, providing an introductory overview of their technical aspects. It explains underlying architectures, objective functions, pre-training data sources, and tokenization methods. This work explores the unique features of different model types: encoder-only (mBERT, XLM-R), decoder-only (XGLM, PALM, BLOOM, GPT-3), and encoder-decoder models (mT5, mBART). Additionally, it addresses one of the significant limitations of multilingual LLMs - the curse of multilinguality - and discusses current attempts to overcome it.
摘要：多语言大型语言模型 (LLM) 在自然语言处理 (NLP) 研究人员和从业者中广受欢迎。这些模型在海量数据集上训练，在各种语言中表现出色，并在许多下游任务中表现出色。本文介绍了多语言 LLM 的概况，并对其技术方面进行了介绍性概述。它解释了底层架构、目标函数、预训练数据源和标记化方法。这项工作探索了不同模型类型的独特功能：仅编码器（mBERT、XLM-R）、仅解码器（XGLM、PALM、BLOOM、GPT-3）和编码器-解码器模型（mT5、mBART）。此外，它还解决了多语言 LLM 的一个重大限制——多语言诅咒——并讨论了当前克服它的尝试。

Title: StructBench: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding

Authors: Zhouhong Gu, Haoning Ye, Zeyang Zhou, Hongwei Feng, Yanghua Xiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] StructBench: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: Given the substantial volumes of structured data held by many companies, enabling Large Language Models (LLMs) to directly understand structured text in non-structured forms could significantly enhance their capabilities across various business scenarios. To this end, we propose evaluation data generation method for assessing LLM's ability in understanding the structure-rich text, which generates structured data of controllable complexity based on manually crafted question templates and generation rules. Building on this generation method, we introduce StructBench, a benchmark comprising 6,032 questions across 8 different structured languages and 29 specific tasks. Furthermore, considering human proficiency in rule-based tasks, we also present StructBench-Hard, which includes 3,016 questions designed to further examine the gap between LLMs and human performance. Results indicate that the best-performing LLM currently achieve an accuracy of 65.0\% on StructBench-Hard, while human accuracy reaches up to 95.7\%. Moreover, while fine-tuning using StructBench can enhance existing LLMs' understanding of all structured languages, it does not necessarily improve performance across all task types. The benchmark and generation codes are open sourced in this https URL
摘要：鉴于许多公司拥有大量的结构化数据，使大型语言模型 (LLM) 能够直接理解非结构化形式的结构化文本可以显著增强其在各种业务场景中的能力。为此，我们提出了一种评估数据生成方法来评估 LLM 理解结构化丰富文本的能力，该方法基于手工制作的问题模板和生成规则生成复杂度可控的结构化数据。基于这种生成方法，我们引入了 StructBench，这是一个基准测试，包含 8 种不同结构化语言和 29 个特定任务的 6,032 个问题。此外，考虑到人类在基于规则的任务中的熟练程度，我们还提出了 StructBench-Hard，其中包括 3,016 个问题，旨在进一步研究 LLM 与人类表现之间的差距。结果表明，目前表现最佳的 LLM 在 StructBench-Hard 上的准确率为 65.0\%，而人类的准确率高达 95.7\%。此外，虽然使用 StructBench 进行微调可以增强现有 LLM 对所有结构化语言的理解，但它不一定能提高所有任务类型的性能。基准测试和生成代码在此 https URL 中开源

Title: On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models

Authors: Sree Harsha Tanneru, Dan Ley, Chirag Agarwal, Himabindu Lakkaraju
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10625
Pdf URL: https://arxiv.org/pdf/2406.10625
Copy Paste: [[2406.10625]] On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models(https://arxiv.org/abs/2406.10625)
Keywords: language model, llm, chain-of-thought
Abstract: As Large Language Models (LLMs) are increasingly being employed in real-world applications in critical domains such as healthcare, it is important to ensure that the Chain-of-Thought (CoT) reasoning generated by these models faithfully captures their underlying behavior. While LLMs are known to generate CoT reasoning that is appealing to humans, prior studies have shown that these explanations do not accurately reflect the actual behavior of the underlying LLMs. In this work, we explore the promise of three broad approaches commonly employed to steer the behavior of LLMs to enhance the faithfulness of the CoT reasoning generated by LLMs: in-context learning, fine-tuning, and activation editing. Specifically, we introduce novel strategies for in-context learning, fine-tuning, and activation editing aimed at improving the faithfulness of the CoT reasoning. We then carry out extensive empirical analyses with multiple benchmark datasets to explore the promise of these strategies. Our analyses indicate that these strategies offer limited success in improving the faithfulness of the CoT reasoning, with only slight performance enhancements in controlled scenarios. Activation editing demonstrated minimal success, while fine-tuning and in-context learning achieved marginal improvements that failed to generalize across diverse reasoning and truthful question-answering benchmarks. In summary, our work underscores the inherent difficulty in eliciting faithful CoT reasoning from LLMs, suggesting that the current array of approaches may not be sufficient to address this complex challenge.
摘要：随着大型语言模型 (LLM) 在医疗保健等关键领域的实际应用中得到越来越多地应用，确保这些模型生成的思维链 (CoT) 推理忠实地捕捉其底层行为非常重要。虽然众所周知，LLM 可以生成对人类有吸引力的 CoT 推理，但先前的研究表明，这些解释并不能准确反映底层 LLM 的实际行为。在这项工作中，我们探索了三种广泛方法的前景，这些方法通常用于引导 LLM 的行为，以增强 LLM 生成的 CoT 推理的忠实度：上下文学习、微调和激活编辑。具体而言，我们引入了上下文学习、微调和激活编辑的新策略，旨在提高 CoT 推理的忠实度。然后，我们使用多个基准数据集进行广泛的实证分析，以探索这些策略的前景。我们的分析表明，这些策略在提高 CoT 推理的忠实度方面取得的成功有限，在受控场景中性能仅略有提升。激活编辑显示出微弱的成功，而微调和上下文学习取得了边际改善，但未能在各种推理和真实问答基准中推广。总之，我们的工作强调了从 LLM 中得出忠实的 CoT 推理的固有困难，这表明目前的方法可能不足以应对这一复杂挑战。

Title: Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models

Authors: Rui Ye, Jingyi Chai, Xiangrui Liu, Yaodong Yang, Yanfeng Wang, Siheng Chen
Subjects: cs.CL, cs.AI, cs.CR, cs.MA
Abstract URL: https://arxiv.org/abs/2406.10630
Pdf URL: https://arxiv.org/pdf/2406.10630
Copy Paste: [[2406.10630]] Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models(https://arxiv.org/abs/2406.10630)
Keywords: language model, llm
Abstract: Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing. Ideally, by training on decentralized data that is aligned with human preferences and safety principles, federated instruction tuning can result in an LLM that could behave in a helpful and safe manner. In this paper, we for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method. Specifically, the malicious clients could automatically generate attack data without involving manual efforts and attack the FedIT system by training their local LLMs on such attack data. Unfortunately, this proposed safety attack not only can compromise the safety alignment of LLM trained via FedIT, but also can not be effectively defended against by many existing FL defense methods. Targeting this, we further propose a post-hoc defense method, which could rely on a fully automated pipeline: generation of defense data and further fine-tuning of the LLM. Extensive experiments show that our safety attack method can significantly compromise the LLM's safety alignment (e.g., reduce safety rate by 70\%), which can not be effectively defended by existing defense methods (at most 4\% absolute improvement), while our safety defense method can significantly enhance the attacked LLM's safety alignment (at most 69\% absolute improvement).
摘要：联邦学习 (FL) 使多方能够协作微调大型语言模型 (LLM)，而无需直接共享数据。理想情况下，通过对符合人类偏好和安全原则的分散数据进行训练，联邦指令调整可以使 LLM 以有益和安全的方式运行。在本文中，我们首次通过提出一种简单、隐秘但有效的安全攻击方法揭示了 FedIT 中安全对齐的脆弱性。具体来说，恶意客户端可以自动生成攻击数据而无需人工干预，并通过在这些攻击数据上训练其本地 LLM 来攻击 FedIT 系统。不幸的是，这种提出的安全攻击不仅会损害通过 FedIT 训练的 LLM 的安全对齐，而且许多现有的 FL 防御方法也无法有效防御。针对此问题，我们进一步提出了一种事后防御方法，该方法可以依赖于完全自动化的管道：生成防御数据并进一步微调 LLM。大量实验表明，我们的安全攻击方法可以显著损害 LLM 的安全性一致性（例如，降低 70% 的安全率），现有的防御方法无法有效防御（最多 4% 的绝对改进），而我们的安全防御方法可以显著增强被攻击的 LLM 的安全性一致性（最多 69% 的绝对改进）。

Title: DIEKAE: Difference Injection for Efficient Knowledge Augmentation and Editing of Large Language Models

Authors: Alessio Galatolo, Meriem Beloucif, Katie Winkle
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10660
Pdf URL: https://arxiv.org/pdf/2406.10660
Copy Paste: [[2406.10660]] DIEKAE: Difference Injection for Efficient Knowledge Augmentation and Editing of Large Language Models(https://arxiv.org/abs/2406.10660)
Keywords: language model
Abstract: Pretrained Language Models (PLMs) store extensive knowledge within their weights, enabling them to recall vast amount of information. However, relying on this parametric knowledge brings some limitations such as outdated information or gaps in the training data. This work addresses these problems by distinguish between two separate solutions: knowledge editing and knowledge augmentation. We introduce Difference Injection for Efficient Knowledge Augmentation and Editing (DIEKÆ), a new method that decouples knowledge processing from the PLM (LLaMA2-7B, in particular) by adopting a series of encoders. These encoders handle external knowledge and inject it into the PLM layers, significantly reducing computational costs and improving performance of the PLM. We propose a novel training technique for these encoders that does not require back-propagation through the PLM, thus greatly reducing the memory and time required to train them. Our findings demonstrate how our method is faster and more efficient compared to multiple baselines in knowledge augmentation and editing during both training and inference. We have released our code and data at this https URL.
摘要：预训练语言模型 (PLM) 在其权重中存储了大量知识，使它们能够回忆大量信息。然而，依赖这些参数知识会带来一些限制，例如信息过时或训练数据存在缺口。这项工作通过区分两个独立的解决方案来解决这些问题：知识编辑和知识增强。我们引入了差异注入以实现有效的知识增强和编辑 (DIEKÆ)，这是一种新方法，通过采用一系列编码器将知识处理与 PLM（特别是 LLaMA2-7B）分离。这些编码器处理外部知识并将其注入 PLM 层，从而显著降低计算成本并提高 PLM 的性能。我们为这些编码器提出了一种新颖的训练技术，它不需要通过 PLM 进行反向传播，从而大大减少了训练它们所需的内存和时间。我们的研究结果表明，与训练和推理过程中知识增强和编辑的多个基线相比，我们的方法更快、更高效。我们已在此 https URL 上发布了我们的代码和数据。

Title: Augmenting Biomedical Named Entity Recognition with General-domain Resources

Authors: Yu Yin, Hyunjae Kim, Xiao Xiao, Chih Hsuan Wei, Jaewoo Kang, Zhiyong Lu, Hua Xu, Meng Fang, Qingyu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10671
Pdf URL: https://arxiv.org/pdf/2406.10671
Copy Paste: [[2406.10671]] Augmenting Biomedical Named Entity Recognition with General-domain Resources(https://arxiv.org/abs/2406.10671)
Keywords: language model
Abstract: Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. In this paper, we proposed GERBERA, a simple-yet-effective method that utilized a general-domain NER dataset for training. Specifically, we performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with multiple additional BioNER datasets. Specifically, our models consistently outperformed the baselines in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight biomedical entity types sourced from five different corpora. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset.
摘要：训练基于神经网络的生物医学命名实体识别 (BioNER) 模型通常需要大量且昂贵的人工注释。虽然一些研究已经采用了多任务学习和多个 BioNER 数据集来减少人力投入，但这种方法并不能持续提高性能，并且可能会在不同的生物医学语料库中引入标签歧义。我们旨在通过从易于获取且与生物医学数据集概念重叠较少的资源中进行迁移学习来应对这些挑战。在本文中，我们提出了 GERBERA，这是一种简单而有效的方法，它利用通用领域 NER 数据集进行训练。具体来说，我们执行多任务学习来使用目标 BioNER 数据集和通用领域数据集来训练预训练的生物医学语言模型。随后，我们专门针对 BioNER 数据集对模型进行了微调。我们在八个实体类型的五个数据集上系统地评估了 GERBERA，总共包含 81,410 个实例。尽管使用较少的生物医学资源，但我们的模型与使用多个额外 BioNER 数据集训练的基线模型相比表现出了卓越的性能。具体而言，我们的模型在八种实体类型中的六种中始终优于基线，在来自五个不同语料库的八种生物医学实体类型中，平均比最佳基线性能提高了 0.9%。我们的方法在以数据有限的为特征的 BioNER 数据集上特别有效地提高了性能，JNLPBA-RNA 数据集上的 F1 分数提高了 4.7%。

Title: MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding

Authors: Baixuan Xu, Weiqi Wang, Haochen Shi, Wenxuan Ding, Huihao Jing, Tianqing Fang, Jiaxin Bai, Long Chen, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10701
Pdf URL: https://arxiv.org/pdf/2406.10701
Copy Paste: [[2406.10701]] MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding(https://arxiv.org/abs/2406.10701)
Keywords: language model
Abstract: Improving user experience and providing personalized search results in E-commerce platforms heavily rely on understanding purchase intention. However, existing methods for acquiring large-scale intentions bank on distilling large language models with human annotation for verification. Such an approach tends to generate product-centric intentions, overlook valuable visual information from product images, and incurs high costs for scalability. To address these issues, we introduce MIND, a multimodal framework that allows Large Vision-Language Models (LVLMs) to infer purchase intentions from multimodal product metadata and prioritize human-centric ones. Using Amazon Review data, we apply MIND and create a multimodal intention knowledge base, which contains 1,264,441 million intentions derived from 126,142 co-buy shopping records across 107,215 products. Extensive human evaluations demonstrate the high plausibility and typicality of our obtained intentions and validate the effectiveness of our distillation framework and filtering mechanism. Additional experiments reveal that our obtained intentions significantly enhance large language models in two intention comprehension tasks.
摘要：在电子商务平台中，改善用户体验和提供个性化搜索结果在很大程度上依赖于对购买意图的理解。然而，现有的获取大规模意图的方法依赖于提炼大型语言模型，并进行人工注释以进行验证。这种方法往往会产生以产品为中心的意图，忽略产品图像中有价值的视觉信息，并且可扩展性成本高昂。为了解决这些问题，我们引入了 MIND，这是一个多模态框架，允许大型视觉语言模型 (LVLM) 从多模态产品元数据中推断购买意图并优先考虑以人为中心的意图。使用亚马逊评论数据，我们应用 MIND 并创建了一个多模态意图知识库，其中包含 1,264,441 百万个意图，这些意图来自 107,215 种产品的 126,142 条共同购买购物记录。大量的人工评估证明了我们获得的意图具有很高的合理性和典型性，并验证了我们的提炼框架和过滤机制的有效性。额外的实验表明，我们获得的意图在两个意图理解任务中显著增强了大型语言模型。

Title: SparseCL: Sparse Contrastive Learning for Contradiction Retrieval

Authors: Haike Xu, Zongyu Lin, Yizhou Sun, Kai-Wei Chang, Piotr Indyk
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2406.10746
Pdf URL: https://arxiv.org/pdf/2406.10746
Copy Paste: [[2406.10746]] SparseCL: Sparse Contrastive Learning for Contradiction Retrieval(https://arxiv.org/abs/2406.10746)
Keywords: gpt
Abstract: Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query, which is important to many downstream applications like fact checking and data cleaning. To retrieve contradiction argument to the query from large document corpora, existing methods such as similarity search and crossencoder models exhibit significant limitations. The former struggles to capture the essence of contradiction due to its inherent nature of favoring similarity, while the latter suffers from computational inefficiency, especially when the size of corpora is large. To address these challenges, we introduce a novel approach: SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences. Our method utilizes a combined metric of cosine similarity and a sparsity function to efficiently identify and retrieve documents that contradict a given query. This approach dramatically enhances the speed of contradiction detection by reducing the need for exhaustive document comparisons to simple vector calculations. We validate our model using the Arguana dataset, a benchmark dataset specifically geared towards contradiction retrieval, as well as synthetic contradictions generated from the MSMARCO and HotpotQA datasets using GPT-4. Our experiments demonstrate the efficacy of our approach not only in contradiction retrieval with more than 30% accuracy improvements on MSMARCO and HotpotQA across different model architectures but also in applications such as cleaning corrupted corpora to restore high-quality QA retrieval. This paper outlines a promising direction for improving the accuracy and efficiency of contradiction retrieval in large-scale text corpora.
摘要：矛盾检索是指识别和提取明确不同意或反驳查询内容的文档，这对于许多下游应用（如事实核查和数据清理）非常重要。为了从大型文档语料库中检索查询的矛盾论据，现有方法（如相似性搜索和交叉编码器模型）表现出明显的局限性。前者由于其固有的偏向相似性而难以捕捉矛盾的本质，而后者则存在计算效率低下的问题，尤其是在语料库规模较大的情况下。为了应对这些挑战，我们引入了一种新方法：SparseCL，它利用经过特殊训练的句子嵌入，旨在保留句子之间微妙的矛盾细微差别。我们的方法利用余弦相似度和稀疏函数的组合度量来有效地识别和检索与给定查询相矛盾的文档。这种方法通过将对详尽文档比较的需求减少到简单的向量计算，大大提高了矛盾检测的速度。我们使用 Arguana 数据集（一个专门针对矛盾检索的基准数据集）以及使用 GPT-4 从 MSMARCO 和 HotpotQA 数据集生成的合成矛盾来验证我们的模型。我们的实验证明了我们的方法不仅在矛盾检索方面有效，在不同模型架构中 MSMARCO 和 HotpotQA 的准确率提高了 30% 以上，而且在清理损坏的语料库以恢复高质量 QA 检索等应用中也有效。本文概述了提高大规模文本语料库中矛盾检索准确率和效率的有希望的方向。

Title: GNOME: Generating Negotiations through Open-Domain Mapping of Exchanges

Authors: Darshan Deshpande, Shambhavi Sinha, Anirudh Ravi Kumar, Debaditya Pal, Jonathan May
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10764
Pdf URL: https://arxiv.org/pdf/2406.10764
Copy Paste: [[2406.10764]] GNOME: Generating Negotiations through Open-Domain Mapping of Exchanges(https://arxiv.org/abs/2406.10764)
Keywords: language model
Abstract: Language Models have previously shown strong negotiation capabilities in closed domains where the negotiation strategy prediction scope is constrained to a specific setup. In this paper, we first show that these models are not generalizable beyond their original training domain despite their wide-scale pretraining. Following this, we propose an automated framework called GNOME, which processes existing human-annotated, closed-domain datasets using Large Language Models and produces synthetic open-domain dialogues for negotiation. GNOME improves the generalizability of negotiation systems while reducing the expensive and subjective task of manual data curation. Through our experimental setup, we create a benchmark comparing encoder and decoder models trained on existing datasets against datasets created through GNOME. Our results show that models trained on our dataset not only perform better than previous state of the art models on domain specific strategy prediction, but also generalize better to previously unseen domains.
摘要：语言模型此前已在封闭领域中表现出强大的谈判能力，其中谈判策略预测范围受限于特定设置。在本文中，我们首先表明，尽管这些模型进行了大规模预训练，但它们无法在其原始训练领域之外推广。随后，我们提出了一个名为 GNOME 的自动化框架，它使用大型语言模型处理现有的人工注释的封闭域数据集，并生成用于谈判的合成开放域对话。GNOME 提高了谈判系统的通用性，同时减少了昂贵且主观的手动数据管理任务。通过我们的实验设置，我们创建了一个基准，将现有数据集上训练的编码器和解码器模型与通过 GNOME 创建的数据集进行比较。我们的结果表明，在我们的数据集上训练的模型不仅在特定领域的策略预测方面表现优于以前最先进的模型，而且可以更好地推广到以前未见过的领域。

Title: Quantifying Generative Media Bias with a Corpus of Real-world and Generated News Articles

Authors: Filip Trhlik, Pontus Stenetorp
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10773
Pdf URL: https://arxiv.org/pdf/2406.10773
Copy Paste: [[2406.10773]] Quantifying Generative Media Bias with a Corpus of Real-world and Generated News Articles(https://arxiv.org/abs/2406.10773)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly being utilised across a range of tasks and domains, with a burgeoning interest in their application within the field of journalism. This trend raises concerns due to our limited understanding of LLM behaviour in this domain, especially with respect to political bias. Existing studies predominantly focus on LLMs undertaking political questionnaires, which offers only limited insights into their biases and operational nuances. To address this gap, our study establishes a new curated dataset that contains 2,100 human-written articles and utilises their descriptions to generate 56,700 synthetic articles using nine LLMs. This enables us to analyse shifts in properties between human-authored and machine-generated articles, with this study focusing on political bias, detecting it using both supervised models and LLMs. Our findings reveal significant disparities between base and instruction-tuned LLMs, with instruction-tuned models exhibiting consistent political bias. Furthermore, we are able to study how LLMs behave as classifiers, observing their display of political bias even in this role. Overall, for the first time within the journalistic domain, this study outlines a framework and provides a structured dataset for quantifiable experiments, serving as a foundation for further research into LLM political bias and its implications.
摘要：大型语言模型 (LLM) 越来越多地用于各种任务和领域，人们对其在新闻领域的应用兴趣日益浓厚。由于我们对 LLM 在该领域的行为了解有限，尤其是在政治偏见方面，这一趋势引起了人们的担忧。现有研究主要关注进行政治问卷调查的 LLM，这只能提供有限的见解来了解他们的偏见和操作细节。为了解决这一差距，我们的研究建立了一个新的精选数据集，其中包含 2,100 篇人工撰写的文章，并利用它们的描述使用九个 LLM 生成 56,700 篇合成文章。这使我们能够分析人工撰写的文章和机器生成的文章之间的属性变化，本研究重点关注政治偏见，使用监督模型和 LLM 来检测它。我们的研究结果揭示了基础和指令调整的 LLM 之间存在显着差异，指令调整的模型表现出一致的政治偏见。此外，我们能够研究法学硕士作为分类者的行为方式，观察他们在这一角色中表现出的政治偏见。总体而言，这项研究首次在新闻领域概述了一个框架，并为可量化的实验提供了结构化的数据集，为进一步研究法学硕士的政治偏见及其影响奠定了基础。

Title: Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Authors: Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.10774
Pdf URL: https://arxiv.org/pdf/2406.10774
Copy Paste: [[2406.10774]] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference(https://arxiv.org/abs/2406.10774)
Keywords: language model, llm
Abstract: As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss. Code is available at this http URL .
摘要：随着对长上下文大型语言模型 (LLM) 的需求不断增长，上下文窗口高达 128K 或 1M 个标记的模型变得越来越普遍。然而，长上下文 LLM 推理具有挑战性，因为随着序列长度的增加，推理速度会显著下降。这种速度减慢主要是由于在自注意力期间加载大型 KV 缓存造成的。之前的研究表明，一小部分关键标记将主导注意力结果。然而，我们观察到标记的关键性高度依赖于查询。为此，我们提出了一种查询感知的 KV 缓存选择算法 Quest。Quest 跟踪 KV 缓存页面中的最小和最大键值，并使用查询向量估计给定页面的关键性。通过仅加载 Top-K 关键 KV 缓存页面进行注意力，Quest 显著加快了自注意力速度，同时不牺牲准确性。我们表明，Quest 可以实现高达 2.23 倍的自注意力加速，从而将推理延迟降低 7.03 倍，同时在具有长依赖性的任务上表现良好，准确率损失可忽略不计。代码可从此 http URL 获取。

Title: RoseLoRA: Row and Column-wise Sparse Low-rank Adaptation of Pre-trained Language Model for Knowledge Editing and Fine-tuning

Authors: Haoyu Wang, Tianci Liu, Tuo Zhao, Jing Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10777
Pdf URL: https://arxiv.org/pdf/2406.10777
Copy Paste: [[2406.10777]] RoseLoRA: Row and Column-wise Sparse Low-rank Adaptation of Pre-trained Language Model for Knowledge Editing and Fine-tuning(https://arxiv.org/abs/2406.10777)
Keywords: language model
Abstract: Pre-trained language models, trained on large-scale corpora, demonstrate strong generalizability across various NLP tasks. Fine-tuning these models for specific tasks typically involves updating all parameters, which is resource-intensive. Parameter-efficient fine-tuning (PEFT) methods, such as the popular LoRA family, introduce low-rank matrices to learn only a few parameters efficiently. However, during inference, the product of these matrices updates all pre-trained parameters, complicating tasks like knowledge editing that require selective updates. We propose a novel PEFT method, which conducts \textbf{r}ow and c\textbf{o}lumn-wise spar\textbf{se} \textbf{lo}w-\textbf{r}ank \textbf{a}daptation (RoseLoRA), to address this challenge. RoseLoRA identifies and updates only the most important parameters for a specific task, maintaining efficiency while preserving other model knowledge. By adding a sparsity constraint on the product of low-rank matrices and converting it to row and column-wise sparsity, we ensure efficient and precise model updates. Our theoretical analysis guarantees the lower bound of the sparsity with respective to the matrix product. Extensive experiments on five benchmarks across twenty datasets demonstrate that RoseLoRA outperforms baselines in both general fine-tuning and knowledge editing tasks.
摘要：在大型语料库上训练的预训练语言模型在各种 NLP 任务中表现出很强的通用性。针对特定任务对这些模型进行微调通常涉及更新所有参数，这会耗费大量资源。参数高效微调 (PEFT) 方法（例如流行的 LoRA 系列）引入了低秩矩阵，以便高效地学习少数参数。然而，在推理过程中，这些矩阵的乘积会更新所有预训练参数，使需要选择性更新的知识编辑等任务变得复杂。我们提出了一种新颖的 PEFT 方法，该方法进行 \textbf{r}ow 和 c\textbf{o}lumn-wise spar\textbf{se} \textbf{lo}w-\textbf{r}ank \textbf{a}daptation (RoseLoRA)，以应对这一挑战。RoseLoRA 仅识别和更新特定任务最重要的参数，在保持效率的同时保留其他模型知识。通过在低秩矩阵乘积上添加稀疏性约束并将其转换为行和列稀疏性，我们确保高效而精确的模型更新。我们的理论分析保证了矩阵乘积的稀疏性下限。在 20 个数据集上的五个基准上进行的大量实验表明，RoseLoRA 在一般微调和知识编辑任务中均优于基线。

Title: ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation

Authors: Yurun Song, Junchen Zhao, Ian G. Harris, Sangeetha Abdu Jyothi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10785
Pdf URL: https://arxiv.org/pdf/2406.10785
Copy Paste: [[2406.10785]] ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation(https://arxiv.org/abs/2406.10785)
Keywords: language model, gpt
Abstract: This study introduces an approach to optimize Parameter Efficient Fine Tuning (PEFT) for Pretrained Language Models (PLMs) by implementing a Shared Low Rank Adaptation (ShareLoRA). By strategically deploying ShareLoRA across different layers and adapting it for the Query, Key, and Value components of self-attention layers, we achieve a substantial reduction in the number of training parameters and memory usage. Importantly, ShareLoRA not only maintains model performance but also exhibits robustness in both classification and generation tasks across a variety of models, including RoBERTa, GPT-2, LLaMA and LLaMA2. It demonstrates superior transfer learning capabilities compared to standard LoRA applications and mitigates overfitting by sharing weights across layers. Our findings affirm that ShareLoRA effectively boosts parameter efficiency while ensuring scalable and high-quality performance across different language model architectures.
摘要：本研究介绍了一种通过实施共享低秩自适应 (ShareLoRA) 来优化预训练语言模型 (PLM) 的参数高效微调 (PEFT) 的方法。通过在不同层上战略性地部署 ShareLoRA 并使其适应自注意力层的查询、键和值组件，我们大幅减少了训练参数的数量和内存使用量。重要的是，ShareLoRA 不仅保持了模型性能，而且在各种模型（包括 RoBERTa、GPT-2、LLaMA 和 LLaMA2）的分类和生成任务中都表现出了稳健性。与标准 LoRA 应用程序相比，它展示了卓越的迁移学习能力，并通过跨层共享权重来减轻过度拟合。我们的研究结果证实，ShareLoRA 有效地提高了参数效率，同时确保了跨不同语言模型架构的可扩展和高质量性能。

Title: Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis

Authors: Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, Jiliang Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10794
Pdf URL: https://arxiv.org/pdf/2406.10794
Copy Paste: [[2406.10794]] Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis(https://arxiv.org/abs/2406.10794)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents. Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks. We hypothesize that successful attacks share some similar properties: They are effective in moving the representation of the harmful prompt towards the direction to the harmless prompts. We leverage hidden representations into the objective of existing jailbreak attacks to move the attacks along the acceptance direction, and conduct experiments to validate the above hypothesis using the proposed objective. We hope this study provides new insights into understanding how LLMs understand harmfulness information.
摘要：大型语言模型 (LLM) 容易受到一种称为越狱的攻击，这种攻击会误导 LLM 输出有害内容。尽管越狱攻击策略多种多样，但对于某些方法成功而其他方法失败的原因并没有统一的认识。本文探讨了有害和无害提示在 LLM 表示空间中的行为，以研究成功越狱攻击的内在属性。我们假设成功的攻击具有一些相似的属性：它们可以有效地将有害提示的表示移向无害提示的方向。我们将隐藏的表示利用到现有越狱攻击的目标中，使攻击沿着接受方向移动，并使用提出的目标进行实验以验证上述假设。我们希望这项研究为理解 LLM 如何理解有害信息提供新的见解。

Title: KGPA: Robustness Evaluation for Large Language Models via Cross-Domain Knowledge Graphs

Authors: Aihua Pei (1), Zehua Yang (1), Shunan Zhu (1), Ruoxi Cheng (2), Ju Jia (2), Lina Wang (3) ((1) Waseda University, (2) Southeast University, (3) Wuhan University)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10802
Pdf URL: https://arxiv.org/pdf/2406.10802
Copy Paste: [[2406.10802]] KGPA: Robustness Evaluation for Large Language Models via Cross-Domain Knowledge Graphs(https://arxiv.org/abs/2406.10802)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Existing frameworks for assessing robustness of large language models (LLMs) overly depend on specific benchmarks, increasing costs and failing to evaluate performance of LLMs in professional domains due to dataset limitations. This paper proposes a framework that systematically evaluates the robustness of LLMs under adversarial attack scenarios by leveraging knowledge graphs (KGs). Our framework generates original prompts from the triplets of knowledge graphs and creates adversarial prompts by poisoning, assessing the robustness of LLMs through the results of these adversarial attacks. We systematically evaluate the effectiveness of this framework and its modules. Experiments show that adversarial robustness of the ChatGPT family ranks as GPT-4-turbo > GPT-4o > GPT-3.5-turbo, and the robustness of large language models is influenced by the professional domains in which they operate.
摘要：现有的大型语言模型（LLM）鲁棒性评估框架过度依赖于特定的基准，不仅增加了成本，而且由于数据集的限制无法评估LLM在专业领域的性能。本文提出了一个利用知识图谱（KG）系统地评估对抗攻击场景下LLM鲁棒性的框架。我们的框架从知识图谱三元组生成原始提示，并通过投毒创建对抗提示，通过对抗攻击的结果来评估LLM的鲁棒性。我们系统地评估了该框架及其模块的有效性。实验表明，ChatGPT家族的对抗鲁棒性排序为GPT-4-turbo > GPT-4o > GPT-3.5-turbo，并且大型语言模型的鲁棒性受其所运行的专业领域影响。

Title: Post-hoc Utterance Refining Method by Entity Mining for Faithful Knowledge Grounded Conversations

Authors: Yoonna Jang, Suhyune Son, Jeongwoo Lee, Junyoung Son, Yuna Hur, Jungwoo Lim, Hyeonseok Moon, Kisu Yang, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10809
Pdf URL: https://arxiv.org/pdf/2406.10809
Copy Paste: [[2406.10809]] Post-hoc Utterance Refining Method by Entity Mining for Faithful Knowledge Grounded Conversations(https://arxiv.org/abs/2406.10809)
Keywords: hallucination
Abstract: Despite the striking advances in recent language generation performance, model-generated responses have suffered from the chronic problem of hallucinations that are either untrue or unfaithful to a given source. Especially in the task of knowledge grounded conversation, the models are required to generate informative responses, but hallucinated utterances lead to miscommunication. In particular, entity-level hallucination that causes critical misinformation and undesirable conversation is one of the major concerns. To address this issue, we propose a post-hoc refinement method called REM. It aims to enhance the quality and faithfulness of hallucinated utterances by refining them based on the source knowledge. If the generated utterance has a low source-faithfulness score with the given knowledge, REM mines the key entities in the knowledge and implicitly uses them for refining the utterances. We verify that our method reduces entity hallucination in the utterance. Also, we show the adaptability and efficacy of REM with extensive experiments and generative results. Our code is available at this https URL.
摘要：尽管最近语言生成性能取得了显著进步，但模型生成的响应一直受到幻觉的长期问题困扰，这些幻觉要么不真实，要么与给定的来源不符。特别是在基于知识的对话任务中，模型需要生成信息丰富的响应，但幻觉话语会导致沟通不畅。特别是，导致严重错误信息和不良对话的实体级幻觉是主要问题之一。为了解决这个问题，我们提出了一种事后改进方法，称为 REM。它旨在通过基于源知识对幻觉话语进行改进来提高其质量和忠实度。如果生成的话语与给定知识的源忠实度得分较低，REM 会挖掘知识中的关键实体并隐式地使用它们来改进话语。我们验证了我们的方法可以减少话语中的实体幻觉。此外，我们通过大量实验和生成结果展示了 REM 的适应性和有效性。我们的代码可在此 https URL 上找到。

Title: LLMFactor: Extracting Profitable Factors through Prompts for Explainable Stock Movement Prediction

Authors: Meiyun Wang, Kiyoshi Izumi, Hiroki Sakaji
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2406.10811
Pdf URL: https://arxiv.org/pdf/2406.10811
Copy Paste: [[2406.10811]] LLMFactor: Extracting Profitable Factors through Prompts for Explainable Stock Movement Prediction(https://arxiv.org/abs/2406.10811)
Keywords: language model, llm, prompt
Abstract: Recently, Large Language Models (LLMs) have attracted significant attention for their exceptional performance across a broad range of tasks, particularly in text analysis. However, the finance sector presents a distinct challenge due to its dependence on time-series data for complex forecasting tasks. In this study, we introduce a novel framework called LLMFactor, which employs Sequential Knowledge-Guided Prompting (SKGP) to identify factors that influence stock movements using LLMs. Unlike previous methods that relied on keyphrases or sentiment analysis, this approach focuses on extracting factors more directly related to stock market dynamics, providing clear explanations for complex temporal changes. Our framework directs the LLMs to create background knowledge through a fill-in-the-blank strategy and then discerns potential factors affecting stock prices from related news. Guided by background knowledge and identified factors, we leverage historical stock prices in textual format to predict stock movement. An extensive evaluation of the LLMFactor framework across four benchmark datasets from both the U.S. and Chinese stock markets demonstrates its superiority over existing state-of-the-art methods and its effectiveness in financial time-series forecasting.
摘要：最近，大型语言模型 (LLM) 因其在广泛任务中的出色表现而备受关注，尤其是在文本分析方面。然而，金融行业面临着独特的挑战，因为它依赖时间序列数据来完成复杂的预测任务。在本研究中，我们引入了一个名为 LLMFactor 的新框架，该框架采用顺序知识引导提示 (SKGP) 来使用 LLM 识别影响股票走势的因素。与以前依赖关键短语或情绪分析的方法不同，这种方法侧重于提取与股市动态更直接相关的因素，为复杂的时间变化提供清晰的解释。我们的框架指导 LLM 通过填空策略创建背景知识，然后从相关新闻中辨别影响股价的潜在因素。在背景知识和已识别因素的指导下，我们利用文本格式的历史股价来预测股价走势。对来自美国和中国股票市场的四个基准数据集的 LLMFactor 框架进行的广泛评估证明了其优于现有的最先进方法以及在金融时间序列预测中的有效性。

Title: Self-Evolution Fine-Tuning for Policy Optimization

Authors: Ruijun Chen, Jiehao Liang, Shiping Gao, Fanqi Wan, Xiaojun Quan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Self-Evolution Fine-Tuning for Policy Optimization(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: The alignment of large language models (LLMs) is crucial not only for unlocking their potential in specific tasks but also for ensuring that responses meet human expectations and adhere to safety and ethical principles. Current alignment methodologies face considerable challenges. For instance, supervised fine-tuning (SFT) requires extensive, high-quality annotated samples, while reinforcement learning from human feedback (RLHF) is complex and often unstable. In this paper, we introduce self-evolution fine-tuning (SEFT) for policy optimization, with the aim of eliminating the need for annotated samples while retaining the stability and efficiency of SFT. SEFT first trains an adaptive reviser to elevate low-quality responses while maintaining high-quality ones. The reviser then gradually guides the policy's optimization by fine-tuning it with enhanced responses. One of the prominent features of this method is its ability to leverage unlimited amounts of unannotated data for policy optimization through supervised fine-tuning. Our experiments on AlpacaEval 2.0 and MT-Bench demonstrate the effectiveness of SEFT. We also provide a comprehensive analysis of its advantages over existing alignment techniques.
摘要：大型语言模型 (LLM) 的对齐不仅对于释放其在特定任务中的潜力至关重要，而且对于确保响应符合人类期望并遵守安全和道德原则也至关重要。当前的对齐方法面临着巨大的挑战。例如，监督微调 (SFT) 需要大量高质量的注释样本，而从人类反馈中进行强化学习 (RLHF) 则很复杂且通常不稳定。在本文中，我们引入了自进化微调 (SEFT) 来进行策略优化，目的是消除对注释样本的需求，同时保持 SFT 的稳定性和效率。SEFT 首先训练一个自适应修订器来提升低质量响应，同时保持高质量响应。然后，修订器通过使用增强的响应对其进行微调来逐步指导策略的优化。该方法的一个突出特点是它能够通过监督微调利用无限量的未注释数据进行策略优化。我们在 AlpacaEval 2.0 和 MT-Bench 上的实验证明了 SEFT 的有效性。我们还对其相对于现有对齐技术的优势进行了全面的分析。

Title: A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Authors: Yu Zhang, Xiusi Chen, Bowen Jin, Sheng Wang, Shuiwang Ji, Wei Wang, Jiawei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10833
Pdf URL: https://arxiv.org/pdf/2406.10833
Copy Paste: [[2406.10833]] A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery(https://arxiv.org/abs/2406.10833)
Keywords: language model, llm
Abstract: In many scientific fields, large language models (LLMs) have revolutionized the way with which text and other modalities of data (e.g., molecules and proteins) are dealt, achieving superior performance in various applications and augmenting the scientific discovery process. Nevertheless, previous surveys on scientific LLMs often concentrate on one to two fields or a single modality. In this paper, we aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs regarding their architectures and pre-training techniques. To this end, we comprehensively survey over 250 scientific LLMs, discuss their commonalities and differences, as well as summarize pre-training datasets and evaluation tasks for each field and modality. Moreover, we investigate how LLMs have been deployed to benefit scientific discovery. Resources related to this survey are available at this https URL.
摘要：在许多科学领域，大型语言模型 (LLM) 彻底改变了文本和其他数据模态（例如分子和蛋白质）的处理方式，在各种应用中取得了卓越的性能并增强了科学发现过程。然而，以前对科学 LLM 的调查通常集中在一到两个领域或单一模态上。在本文中，我们旨在通过揭示科学 LLM 在架构和预训练技术方面的跨领域和跨模态联系，提供更全面的研究前景视图。为此，我们全面调查了 250 多个科学 LLM，讨论了它们的共同点和不同点，并总结了每个领域和模态的预训练数据集和评估任务。此外，我们还研究了如何部署 LLM 来促进科学发现。与此调查相关的资源可在此 https URL 上找到。

Title: Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Authors: Joykirat Singh, Akshay Nambi, Vibhav Vineet
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.10834
Pdf URL: https://arxiv.org/pdf/2406.10834
Copy Paste: [[2406.10834]] Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning(https://arxiv.org/abs/2406.10834)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have been applied to Math Word Problems (MWPs) with transformative impacts, revolutionizing how these complex problems are approached and solved in various domains including educational settings. However, the evaluation of these models often prioritizes final accuracy, overlooking the crucial aspect of reasoning capabilities. This work addresses this gap by focusing on the ability of LLMs to detect and correct reasoning mistakes. We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. Our comprehensive benchmarking reveals significant insights into the strengths and weaknesses of state-of-the-art models, such as GPT-4o, GPT-4, GPT-3.5Turbo, and others. We highlight GPT-$o's superior performance in mistake detection and rectification and the persistent challenges faced by smaller models. Additionally, we identify issues related to data contamination and memorization, impacting the reliability of LLMs in real-world applications. Our findings emphasize the importance of rigorous evaluation of reasoning processes and propose future directions to enhance the generalization and robustness of LLMs in mathematical problem-solving.
摘要：大型语言模型 (LLM) 已应用于数学应用题 (MWP)，产生了变革性影响，彻底改变了包括教育环境在内的各个领域处理和解决这些复杂问题的方式。然而，对这些模型的评估通常优先考虑最终准确性，而忽略了推理能力这一关键方面。这项工作通过关注 LLM 检测和纠正推理错误的能力来解决这一差距。我们引入了一个新数据集 MWP-MISTAKE，其中包含通过基于规则的方法和较小的语言模型生成的正确和不正确的推理步骤的 MWP。我们的全面基准测试揭示了对 GPT-4o、GPT-4、GPT-3.5Turbo 等最先进模型的优势和劣势的重要见解。我们强调了 GPT-$o 在错误检测和纠正方面的卓越性能以及较小模型面临的持续挑战。此外，我们还发现了与数据污染和记忆相关的问题，这些问题影响了 LLM 在实际应用中的可靠性。我们的研究结果强调了严格评估推理过程的重要性，并提出了未来提高 LLM 在数学问题解决中的泛化和稳健性的方向。

Title: Large Language Models for Automatic Milestone Detection in Group Discussions

Authors: Zhuoxu Duan, Zhengye Yang, Samuel Westby, Christoph Riedl, Brooke Foucault Welles, Richard J. Radke
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2406.10842
Pdf URL: https://arxiv.org/pdf/2406.10842
Copy Paste: [[2406.10842]] Large Language Models for Automatic Milestone Detection in Group Discussions(https://arxiv.org/abs/2406.10842)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models like GPT have proven widely successful on natural language understanding tasks based on written text documents. In this paper, we investigate an LLM's performance on recordings of a group oral communication task in which utterances are often truncated or not well-formed. We propose a new group task experiment involving a puzzle with several milestones that can be achieved in any order. We investigate methods for processing transcripts to detect if, when, and by whom a milestone has been completed. We demonstrate that iteratively prompting GPT with transcription chunks outperforms semantic similarity search methods using text embeddings, and further discuss the quality and randomness of GPT responses under different context window sizes.
摘要：大型语言模型（如 GPT）在基于书面文本文档的自然语言理解任务中已取得广泛成功。在本文中，我们研究了 LLM 在群体口语交流任务录音中的表现，其中话语经常被截断或格式不正确。我们提出了一个新的群体任务实验，涉及一个具有多个里程碑的谜题，这些里程碑可以按任何顺序实现。我们研究了处理转录本的方法，以检测里程碑是否完成、何时完成以及由谁完成。我们证明，使用转录块迭代提示 GPT 优于使用文本嵌入的语义相似性搜索方法，并进一步讨论了不同上下文窗口大小下 GPT 响应的质量和随机性。

Title: Leading Whitespaces of Language Models' Subword Vocabulary Poses a Confound for Calculating Word Probabilities

Authors: Byung-Doh Oh, William Schuler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10851
Pdf URL: https://arxiv.org/pdf/2406.10851
Copy Paste: [[2406.10851]] Leading Whitespaces of Language Models' Subword Vocabulary Poses a Confound for Calculating Word Probabilities(https://arxiv.org/abs/2406.10851)
Keywords: language model
Abstract: Word-by-word conditional probabilities from Transformer-based language models are increasingly being used to evaluate their predictions over minimal pairs or to model the incremental processing difficulty of human readers. In this paper, we argue that there is a confound posed by the subword tokenization scheme of such language models, which has gone unaddressed thus far. This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces and therefore do not naturally define stop probabilities of words. We first prove that this can result in word probabilities that sum to more than one, thereby violating the axiom that $\mathsf{P}(\Omega) = 1$. This property results in a misallocation of word-by-word surprisal, where the unacceptability of the current 'end of word' is incorrectly carried over to the next word. Additionally, language models' such implicit prediction of word boundaries is incongruous with psycholinguistic experiments where human subjects directly observe upcoming word boundaries. We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word, which resolves this confound. As a case study, we show that this results in significantly different estimates of garden-path effects in transitive/intransitive sentences, where a comma is strongly expected before the critical word.
摘要：基于 Transformer 的语言模型的逐字条件概率越来越多地被用于评估它们对最小对的预测或模拟人类读者的增量处理难度。在本文中，我们认为此类语言模型的子词标记化方案存在混淆，迄今为止尚未解决。这是因为大多数语言模型的子词词汇表中的标记都有前导空格，因此不能自然地定义单词的停止概率。我们首先证明这会导致单词概率总和大于 1，从而违反了 $\mathsf{P}(\Omega) = 1$ 的公理。此属性导致逐字意外的错误分配，其中当前“单词结尾”的不可接受性被错误地延续到下一个单词。此外，语言模型对单词边界的这种隐式预测与人类受试者直接观察即将到来的单词边界的心理语言学实验不一致。我们提出了一种简单的解码技术，将尾随空格的概率重新计算为当前单词的概率，从而解决了这一混淆问题。作为案例研究，我们表明，这会导致及物句/不及物句中花园小径效应的估计值存在显著差异，其中关键词前强烈期望有逗号。

Title: Step-level Value Preference Optimization for Mathematical Reasoning

Authors: Guoxin Chen, Minpeng Liao, Chengxi Li, Kai Fan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10858
Pdf URL: https://arxiv.org/pdf/2406.10858
Copy Paste: [[2406.10858]] Step-level Value Preference Optimization for Mathematical Reasoning(https://arxiv.org/abs/2406.10858)
Keywords: language model, llm
Abstract: Direct Preference Optimization (DPO) using an implicit reward model has proven to be an effective alternative to reinforcement learning from human feedback (RLHF) for fine-tuning preference aligned large language models (LLMs). However, the overall preference annotations of responses do not fully capture the fine-grained quality of model outputs in complex multi-step reasoning tasks, such as mathematical reasoning. To address this limitation, we introduce a novel algorithm called Step-level Value Preference Optimization (SVPO). Our approach employs Monte Carlo Tree Search (MCTS) to automatically annotate step-level preferences for multi-step reasoning. Furthermore, from the perspective of learning-to-rank, we train an explicit value model to replicate the behavior of the implicit reward model, complementing standard preference optimization. This value model enables the LLM to generate higher reward responses with minimal cost during inference. Experimental results demonstrate that our method achieves state-of-the-art performance on both in-domain and out-of-domain mathematical reasoning benchmarks.
摘要：使用隐式奖励模型的直接偏好优化 (DPO) 已被证明是强化学习从人类反馈 (RLHF) 的有效替代方案，用于微调偏好对齐的大型语言模型 (LLM)。然而，响应的整体偏好注释并不能完全捕捉复杂的多步骤推理任务（例如数学推理）中模型输出的细粒度质量。为了解决这一限制，我们引入了一种称为步骤级价值偏好优化 (SVPO) 的新算法。我们的方法采用蒙特卡洛树搜索 (MCTS) 来自动注释多步骤推理的步骤级偏好。此外，从学习排序的角度来看，我们训练了一个显式价值模型来复制隐式奖励模型的行为，补充了标准偏好优化。该价值模型使 LLM 能够在推理过程中以最小的成本生成更高的奖励响应。实验结果表明，我们的方法在域内和域外数学推理基准上都实现了最先进的性能。

Title: Analyzing Key Neurons in Large Language Models

Authors: Lihu Chen, Adam Dejl, Francesca Toni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10868
Pdf URL: https://arxiv.org/pdf/2406.10868
Copy Paste: [[2406.10868]] Analyzing Key Neurons in Large Language Models(https://arxiv.org/abs/2406.10868)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) possess vast amounts of knowledge within their parameters, prompting research into methods for locating and editing this knowledge. Previous investigations have primarily focused on fill-in-the-blank tasks and locating entity-related usually single-token facts) information in relatively small-scale language models. However, several key questions remain unanswered: (1) How can we effectively locate query-relevant neurons in contemporary autoregressive LLMs, such as LLaMA and Mistral? (2) How can we address the challenge of long-form text generation? (3) Are there localized knowledge regions in LLMs? In this study, we introduce Neuron Attribution-Inverse Cluster Attribution (NA-ICA), a novel architecture-agnostic framework capable of identifying key neurons in LLMs. NA-ICA allows for the examination of long-form answers beyond single tokens by employing the proxy task of multi-choice question answering. To evaluate the effectiveness of our detected key neurons, we construct two multi-choice QA datasets spanning diverse domains and languages. Empirical evaluations demonstrate that NA-ICA outperforms baseline methods significantly. Moreover, analysis of neuron distributions reveals the presence of visible localized regions, particularly within different domains. Finally, we demonstrate the potential applications of our detected key neurons in knowledge editing and neuron-based prediction.
摘要：大型语言模型 (LLM) 在其参数范围内拥有大量知识，这促使人们研究定位和编辑这些知识的方法。先前的研究主要集中在填空任务和在相对小规模的语言模型中定位与实体相关的（通常是单个标记事实）信息。然而，仍有几个关键问题尚未得到解答：（1）我们如何在当代自回归 LLM（如 LLaMA 和 Mistral）中有效地定位查询相关神经元？（2）我们如何应对长文本生成的挑战？（3）LLM 中是否存在局部知识区域？在本研究中，我们引入了神经元归因-逆聚类归因 (NA-ICA)，这是一种新颖的架构无关框架，能够识别 LLM 中的关键神经元。NA-ICA 允许通过使用多项选择问答的代理任务来检查单个标记以外的长格式答案。为了评估检测到的关键神经元的有效性，我们构建了两个涵盖不同领域和语言的多项选择 QA 数据集。实证评估表明，NA-ICA 的表现明显优于基线方法。此外，对神经元分布的分析揭示了可见局部区域的存在，尤其是在不同域内。最后，我们展示了检测到的关键神经元在知识编辑和基于神经元的预测中的潜在应用。

Title: COOL: Comprehensive Knowledge Enhanced Prompt Learning for Domain Adaptive Few-shot Fake News Detection

Authors: Yi Ouyang, Peng Wu, Li Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10870
Pdf URL: https://arxiv.org/pdf/2406.10870
Copy Paste: [[2406.10870]] COOL: Comprehensive Knowledge Enhanced Prompt Learning for Domain Adaptive Few-shot Fake News Detection(https://arxiv.org/abs/2406.10870)
Keywords: language model, prompt
Abstract: Most Fake News Detection (FND) methods often struggle with data scarcity for emerging news domain. Recently, prompt learning based on Pre-trained Language Models (PLM) has emerged as a promising approach in domain adaptive few-shot learning, since it greatly reduces the need for labeled data by bridging the gap between pre-training and downstream task. Furthermore, external knowledge is also helpful in verifying emerging news, as emerging news often involves timely knowledge that may not be contained in the PLM's outdated prior knowledge. To this end, we propose COOL, a Comprehensive knOwledge enhanced prOmpt Learning method for domain adaptive few-shot FND. Specifically, we propose a comprehensive knowledge extraction module to extract both structured and unstructured knowledge that are positively or negatively correlated with news from external sources, and adopt an adversarial contrastive enhanced hybrid prompt learning strategy to model the domain-invariant news-knowledge interaction pattern for FND. Experimental results demonstrate the superiority of COOL over various state-of-the-arts.
摘要：大多数假新闻检测 (FND) 方法经常面临新兴新闻领域数据稀缺的问题。最近，基于预训练语言模型 (PLM) 的即时学习已成为领域自适应小样本学习中一种很有前途的方法，因为它通过弥合预训练和下游任务之间的差距，大大减少了对标记数据的需求。此外，外部知识也有助于验证新兴新闻，因为新兴新闻通常涉及 PLM 过时的先验知识中可能不包含的及时知识。为此，我们提出了 COOL，一种用于领域自适应小样本 FND 的综合知识增强即时学习方法。具体来说，我们提出了一个综合知识提取模块，用于从外部来源提取与新闻呈正或负相关的结构化和非结构化知识，并采用对抗对比增强混合即时学习策略来为 FND 建模领域不变的新闻知识交互模式。实验结果表明 COOL 优于各种最先进的技术。

Title: Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Authors: Minghan Wang, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10880
Pdf URL: https://arxiv.org/pdf/2406.10880
Copy Paste: [[2406.10880]] Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR(https://arxiv.org/abs/2406.10880)
Keywords: language model, gpt, llm, prompt
Abstract: Recent advancements in multimodal large language models (MLLMs) have made significant progress in integrating information across various modalities, yet real-world applications in educational and scientific domains remain challenging. This paper introduces the Multimodal Scientific ASR (MS-ASR) task, which focuses on transcribing scientific conference videos by leveraging visual information from slides to enhance the accuracy of technical terminologies. Realized that traditional metrics like WER fall short in assessing performance accurately, prompting the proposal of severity-aware WER (SWER) that considers the content type and severity of ASR errors. We propose the Scientific Vision Augmented ASR (SciVASR) framework as a baseline method, enabling MLLMs to improve transcript quality through post-editing. Evaluations of state-of-the-art MLLMs, including GPT-4o, show a 45% improvement over speech-only baselines, highlighting the importance of multimodal information integration.
摘要：多模态大型语言模型 (MLLM) 的最新进展在整合各种模态信息方面取得了重大进展，但在教育和科学领域的实际应用仍然具有挑战性。本文介绍了多模态科学 ASR (MS-ASR) 任务，该任务侧重于通过利用幻灯片中的视觉信息来转录科学会议视频，以提高技术术语的准确性。意识到像 WER 这样的传统指标无法准确评估性能，这促使人们提出了考虑 ASR 错误的内容类型和严重程度的严重性感知 WER (SWER)。我们提出了科学视觉增强 ASR (SciVASR) 框架作为基线方法，使 MLLM 能够通过后期编辑来提高成绩单质量。对包括 GPT-4o 在内的最先进 MLLM 的评估表明，与仅限语音的基线相比，成绩提高了 45%，凸显了多模态信息整合的重要性。

Title: Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals

Authors: Lida Chen, Zujie Liang, Xintao Wang, Jiaqing Liang, Yanghua Xiao, Feng Wei, Jinglei Chen, Zhenghong Hao, Bing Han, Wei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10881
Pdf URL: https://arxiv.org/pdf/2406.10881
Copy Paste: [[2406.10881]] Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals(https://arxiv.org/abs/2406.10881)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have achieved great success, but their occasional content fabrication, or hallucination, limits their practical application. Hallucination arises because LLMs struggle to admit ignorance due to inadequate training on knowledge boundaries. We call it a limitation of LLMs that they can not accurately express their knowledge boundary, answering questions they know while admitting ignorance to questions they do not know. In this paper, we aim to teach LLMs to recognize and express their knowledge boundary, so they can reduce hallucinations caused by fabricating when they do not know. We propose CoKE, which first probes LLMs' knowledge boundary via internal confidence given a set of questions, and then leverages the probing results to elicit the expression of the knowledge boundary. Extensive experiments show CoKE helps LLMs express knowledge boundaries, answering known questions while declining unknown ones, significantly improving in-domain and out-of-domain performance.
摘要：大型语言模型 (LLM) 取得了巨大的成功，但它们偶尔会出现内容虚构或幻觉，限制了它们的实际应用。幻觉产生的原因是，由于对知识边界的训练不足，LLM 难以承认无知。我们称这是 LLM 的局限性，它们无法准确表达其知识边界，回答它们知道的问题，而承认它们不知道的问题。在本文中，我们的目标是教会 LLM 识别和表达其知识边界，以便它们可以减少因不知道时虚构而导致的幻觉。我们提出了 CoKE，它首先通过给定一组问题的内部置信度来探测 LLM 的知识边界，然后利用探测结果来引出知识边界的表达。大量实验表明，CoKE 有助于 LLM 表达知识边界，回答已知问题而拒绝未知问题，显著提高领域内和领域外的性能。

Title: SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking

Authors: Zhuang Li, Yuncheng Hua, Thuy-Trang Vu, Haolan Zhan, Lizhen Qu, Gholamreza Haffari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10882
Pdf URL: https://arxiv.org/pdf/2406.10882
Copy Paste: [[2406.10882]] SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking(https://arxiv.org/abs/2406.10882)
Keywords: language model, llm
Abstract: Recent studies have shown that maintaining a consistent response style by human experts and enhancing data quality in training sets can significantly improve the performance of fine-tuned Large Language Models (LLMs) while reducing the number of training examples needed. However, the precise definition of style and the relationship between style, data quality, and LLM performance remains unclear. This research decomposes response style into presentation and composition styles and finds that, among training data of similar quality, those with higher style consistency lead to better LLM performance. Inspired by this, we introduce Style Consistency-Aware Response Ranking (SCAR), which automatically prioritizes instruction-response pairs in the training set based on their response stylistic consistency. By selecting the most style-consistent examples, ranging from the top 25% to 0.7% of the full dataset, the fine-tuned LLMs can match or even surpass the performance of models trained on the entire dataset in coding and open-ended question-answering benchmarks. Code and data are available at this https URL .
摘要：最近的研究表明，保持人类专家的一致回答风格并提高训练集中的数据质量可以显著提高经过微调的大型语言模型 (LLM) 的性能，同时减少所需的训练示例数量。然而，风格的确切定义以及风格、数据质量和 LLM 性能之间的关系仍不清楚。这项研究将回答风格分解为呈现和撰写风格，并发现在质量相似的训练数据中，风格一致性更高的数据可带来更好的 LLM 性能。受此启发，我们引入了风格一致性感知回答排名 (SCAR)，它根据回答风格一致性自动对训练集中的指令-回答对进行优先排序。通过选择风格最一致的示例（范围从整个数据集的前 25% 到 0.7%），经过微调的 LLM 可以在编码和开放式问答基准测试中匹敌甚至超越在整个数据集上训练的模型的性能。代码和数据可在此 https URL 上获得。

Title: Distilling Opinions at Scale: Incremental Opinion Summarization using XL-OPSUMM

Authors: Sri Raghava Muddu, Rupasai Rangaraju, Tejpalsingh Siledar, Swaroop Nath, Pushpak Bhattacharyya, Swaprava Nath, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Sudhanshu Shekhar Singh, Nikesh Garera
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.10886
Pdf URL: https://arxiv.org/pdf/2406.10886
Copy Paste: [[2406.10886]] Distilling Opinions at Scale: Incremental Opinion Summarization using XL-OPSUMM(https://arxiv.org/abs/2406.10886)
Keywords: language model, gpt, llm
Abstract: Opinion summarization in e-commerce encapsulates the collective views of numerous users about a product based on their reviews. Typically, a product on an e-commerce platform has thousands of reviews, each review comprising around 10-15 words. While Large Language Models (LLMs) have shown proficiency in summarization tasks, they struggle to handle such a large volume of reviews due to context limitations. To mitigate, we propose a scalable framework called Xl-OpSumm that generates summaries incrementally. However, the existing test set, AMASUM has only 560 reviews per product on average. Due to the lack of a test set with thousands of reviews, we created a new test set called Xl-Flipkart by gathering data from the Flipkart website and generating summaries using GPT-4. Through various automatic evaluations and extensive analysis, we evaluated the framework's efficiency on two datasets, AMASUM and Xl-Flipkart. Experimental results show that our framework, Xl-OpSumm powered by Llama-3-8B-8k, achieves an average ROUGE-1 F1 gain of 4.38% and a ROUGE-L F1 gain of 3.70% over the next best-performing model.
摘要：电子商务中的观点摘要概括了众多用户基于评论对产品的看法。通常，电子商务平台上的产品有数千条评论，每条评论约包含 10-15 个单词。虽然大型语言模型 (LLM) 在摘要任务中表现出色，但由于上下文限制，它们难以处理如此大量的评论。为了缓解这一问题，我们提出了一个名为 Xl-OpSumm 的可扩展框架，该框架可以逐步生成摘要。但是，现有测试集 AMASUM 平均每个产品只有 560 条评论。由于缺少包含数千条评论的测试集，我们通过从 Flipkart 网站收集数据并使用 GPT-4 生成摘要，创建了一个名为 Xl-Flipkart 的新测试集。通过各种自动评估和广泛分析，我们在两个数据集 AMASUM 和 Xl-Flipkart 上评估了该框架的效率。实验结果表明，我们的框架 Xl-OpSumm（由 Llama-3-8B-8k 提供支持）比下一个最佳性能模型实现了平均 4.38% 的 ROUGE-1 F1 增益和 3.70% 的 ROUGE-L F1 增益。

Title: RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models

Authors: Zhuoran Jin, Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.10890
Pdf URL: https://arxiv.org/pdf/2406.10890
Copy Paste: [[2406.10890]] RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models(https://arxiv.org/abs/2406.10890)
Keywords: language model, llm
Abstract: Large language models (LLMs) inevitably memorize sensitive, copyrighted, and harmful knowledge from the training corpus; therefore, it is crucial to erase this knowledge from the models. Machine unlearning is a promising solution for efficiently removing specific knowledge by post hoc modifying models. In this paper, we propose a Real-World Knowledge Unlearning benchmark (RWKU) for LLM unlearning. RWKU is designed based on the following three key factors: (1) For the task setting, we consider a more practical and challenging unlearning setting, where neither the forget corpus nor the retain corpus is accessible. (2) For the knowledge source, we choose 200 real-world famous people as the unlearning targets and show that such popular knowledge is widely present in various LLMs. (3) For the evaluation framework, we design the forget set and the retain set to evaluate the model's capabilities across various real-world applications. Regarding the forget set, we provide four four membership inference attack (MIA) methods and nine kinds of adversarial attack probes to rigorously test unlearning efficacy. Regarding the retain set, we assess locality and utility in terms of neighbor perturbation, general ability, reasoning ability, truthfulness, factuality, and fluency. We conduct extensive experiments across two unlearning scenarios, two models and six baseline methods and obtain some meaningful findings. We release our benchmark and code publicly at this http URL for future work.
摘要：大型语言模型 (LLM) 不可避免地会从训练语料库中记住敏感、受版权保护和有害的知识；因此，从模型中删除这些知识至关重要。机器去学习是一种通过事后修改模型来有效删除特定知识的有前途的解决方案。在本文中，我们为 LLM 去学习提出了一个真实世界知识去学习基准 (RWKU)。RWKU 基于以下三个关键因素设计：（1）对于任务设置，我们考虑一个更实际、更具挑战性的去学习设置，其中忘记语料库和保留语料库都不可访问。（2）对于知识源，我们选择 200 位现实世界名人作为去学习目标，并表明此类流行知识广泛存在于各种 LLM 中。（3）对于评估框架，我们设计了忘记集和保留集来评估模型在各种真实应用中的能力。关于忘记集，我们提供了四种四成员推理攻击 (MIA) 方法和九种对抗性攻击探测来严格测试去学习效果。对于保留集，我们从邻居扰动、一般能力、推理能力、真实性、事实性和流畅性等方面评估局部性和效用。我们对两种反学习场景、两种模型和六种基线方法进行了广泛的实验，并获得了一些有意义的发现。我们在此 http URL 上公开发布了我们的基准和代码，以供将来使用。

Title: MICL: Improving In-Context Learning through Multiple-Label Words in Demonstration

Authors: Zhu Zixiao, Feng Zijian, Zhou Hanzhang, Qian Junlang, Mao Kezhi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10908
Pdf URL: https://arxiv.org/pdf/2406.10908
Copy Paste: [[2406.10908]] MICL: Improving In-Context Learning through Multiple-Label Words in Demonstration(https://arxiv.org/abs/2406.10908)
Keywords: language model, llm
Abstract: In-context learning (ICL) enables large language models (LLMs) to perform new tasks by using sample-label pairs as demonstrations. However, variations in demonstrations can lead to significantly different performances. Current research mainly focuses on selecting demonstration samples, preassuming the class name to be the label word when creating sample-label pairs. However, the choice of label words is crucial for ICL performance. In addition, we observe that using a single class name in demonstration may not yield optimal results. In this paper, we propose to use multiple label words in one sample-label pair to enhance ICL performance. Further, we select and order sample-label pairs based on LLM's output distribution, aiming to optimize the demonstration examples from both the samples' and labels' perspectives. Evaluation results on seven classification datasets show that the use of multiple label words, strategically organized by their selection, order and quantity, improves ICL performance through diverse label information.
摘要：上下文学习 (ICL) 使大型语言模型 (LLM) 能够通过使用样本-标签对作为演示来执行新任务。然而，演示的变化会导致性能的显著差异。当前的研究主要集中于选择演示样本，在创建样本-标签对时假设类别名称为标签词。然而，标签词的选择对 ICL 性能至关重要。此外，我们观察到在演示中使用单个类别名称可能不会产生最佳结果。在本文中，我们建议在一个样本-标签对中使用多个标签词来增强 ICL 性能。此外，我们根据 LLM 的输出分布选择和排序样本-标签对，旨在从样本和标签的角度优化演示示例。在七个分类数据集上的评估结果表明，使用多个标签词并按其选择、顺序和数量进行策略性组织可以通过多样化的标签信息提高 ICL 性能。

Title: Generating Tables from the Parametric Knowledge of Language Models

Authors: Yevgeni Berkovitch, Oren Glickman, Amit Somech, Tomer Wolfson
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2406.10922
Pdf URL: https://arxiv.org/pdf/2406.10922
Copy Paste: [[2406.10922]] Generating Tables from the Parametric Knowledge of Language Models(https://arxiv.org/abs/2406.10922)
Keywords: language model, gpt, llm, prompt
Abstract: We explore generating factual and accurate tables from the parametric knowledge of large language models (LLMs). While LLMs have demonstrated impressive capabilities in recreating knowledge bases and generating free-form text, we focus on generating structured tabular data, which is crucial in domains like finance and healthcare. We examine the table generation abilities of four state-of-the-art LLMs: GPT-3.5, GPT-4, Llama2-13B, and Llama2-70B, using three prompting methods for table generation: (a) full-table, (b) row-by-row; (c) cell-by-cell. For evaluation, we introduce a novel benchmark, WikiTabGen which contains 100 curated Wikipedia tables. Tables are further processed to ensure their factual correctness and manually annotated with short natural language descriptions. Our findings reveal that table generation remains a challenge, with GPT-4 reaching the highest accuracy at 19.6%. Our detailed analysis sheds light on how various table properties, such as size, table popularity, and numerical content, influence generation performance. This work highlights the unique challenges in LLM-based table generation and provides a solid evaluation framework for future research. Our code, prompts and data are all publicly available: this https URL
摘要：我们探索从大型语言模型 (LLM) 的参数知识中生成事实和准确的表格。虽然 LLM 在重建知识库和生成自由格式文本方面表现出了令人印象深刻的能力，但我们专注于生成结构化表格数据，这在金融和医疗保健等领域至关重要。我们检查了四个最先进的 LLM 的表格生成能力：GPT-3.5、GPT-4、Llama2-13B 和 Llama2-70B，使用三种提示表格生成方法：(a) 全表，(b) 逐行；(c) 逐单元格。为了进行评估，我们引入了一个新颖的基准 WikiTabGen，其中包含 100 个精选的维基百科表格。表格经过进一步处理以确保其事实正确性，并用简短的自然语言描述手动注释。我们的研究结果表明，表格生成仍然是一个挑战，GPT-4 的最高准确率达到 19.6%。我们的详细分析揭示了各种表格属性（例如大小、表格流行度和数字内容）如何影响生成性能。这项工作突出了基于 LLM 的表格生成中的独特挑战，并为未来的研究提供了可靠的评估框架。我们的代码、提示和数据都是公开的：此 https URL

Title: E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models

Authors: Zhenyu Zhang, Bingguang Hao, Jinpeng Li, Zekai Zhang, Dongyan Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models(https://arxiv.org/abs/)
Keywords: language model, llm, prompt
Abstract: Most large language models (LLMs) are sensitive to prompts, and another synonymous expression or a typo may lead to unexpected results for the model. Composing an optimal prompt for a specific demand lacks theoretical support and relies entirely on human experimentation, which poses a considerable obstacle to popularizing generative artificial intelligence. However, there is no systematic analysis of the stability of LLMs in resisting prompt perturbations in real-world scenarios. In this work, we propose to evaluate the ease-of-use of LLMs and construct E-Bench, simulating the actual situation of human use from synonymous perturbation (including paraphrasing, simplification, and colloquialism) and typographical perturbation (such as typing). On this basis, we also discuss the combination of these two types of perturbation and analyze the main reasons for performance degradation. Experimental results indicate that with the increase of model size, although the ease-of-use are significantly improved, there is still a long way to go to build a sufficiently user-friendly model.
摘要：大部分大型语言模型（LLM）对提示非常敏感，一个同义词表达或者一个拼写错误都可能导致模型无法预料的结果。针对特定需求编写最优提示缺乏理论支撑，完全依赖人工实验，这对生成式人工智能的普及造成了相当大的阻碍。然而，目前还没有系统分析LLM在现实场景中抵抗提示扰动的稳定性。本文提出对LLM的易用性进行评估，构建E-Bench，从同义词扰动（包括释义、简化、口语化）和印刷扰动（如打字）两个方面模拟人类使用的实际情况。在此基础上，我们还讨论了这两类扰动的组合，并分析了性能下降的主要原因。实验结果表明，随着模型规模的增加，虽然易用性有明显提升，但要构建足够用户友好的模型还有很长的路要走。

Title: Avoiding Copyright Infringement via Machine Unlearning

Authors: Guangyao Dou, Zheyuan Liu, Qing Lyu, Kaize Ding, Eric Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10952
Pdf URL: https://arxiv.org/pdf/2406.10952
Copy Paste: [[2406.10952]] Avoiding Copyright Infringement via Machine Unlearning(https://arxiv.org/abs/2406.10952)
Keywords: language model, llm
Abstract: Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but also pose risks by learning and generating copyrighted material, leading to significant legal and ethical concerns. To address these issues, it is critical for model owners to be able to unlearn copyrighted content at various time steps. We explore the setting of sequential unlearning, where copyrighted content is removed over multiple time steps - a scenario that has not been rigorously addressed. To tackle this challenge, we propose Stable Sequential Unlearning (SSU), a novel unlearning framework for LLMs, designed to have a more stable process to remove copyrighted content from LLMs throughout different time steps using task vectors, by incorporating additional random labeling loss and applying gradient-based weight saliency mapping. Experiments demonstrate that SSU finds a good balance between unlearning efficacy and maintaining the model's general knowledge compared to existing baselines.
摘要：预训练的大型语言模型 (LLM) 已展示出卓越的能力，但通过学习和生成受版权保护的材料也带来了风险，从而导致了重大的法律和道德问题。为了解决这些问题，模型所有者必须能够以不同的时间步骤来忘记受版权保护的内容。我们探索了顺序忘记学习的设置，其中版权内容会在多个时间步骤中删除 - 这种场景尚未得到严格处理。为了应对这一挑战，我们提出了稳定顺序忘记学习 (SSU)，这是一种用于 LLM 的新型忘记学习框架，旨在通过结合额外的随机标记损失和应用基于梯度的权重显着性映射，使用任务向量在不同时间步骤中更稳定地从 LLM 中删除版权内容。实验表明，与现有基线相比，SSU 在忘记学习功效和保持模型的一般知识之间找到了良好的平衡。

Title: Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence

Authors: Junru Lu, Jiazheng Li, Siyu An, Meng Zhao, Yulan He, Di Yin, Xing Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10957
Pdf URL: https://arxiv.org/pdf/2406.10957
Copy Paste: [[2406.10957]] Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence(https://arxiv.org/abs/2406.10957)
Keywords: language model, llm
Abstract: Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct and robust alignment of Large Language Models (LLMs) with human preferences, offering a more straightforward alternative to the complex Reinforcement Learning from Human Feedback (RLHF). Despite its promising efficacy, DPO faces a notable drawback: "verbosity", a common over-optimization phenomenon also observed in RLHF. While previous studies mainly attributed verbosity to biased labels within the data, we propose that the issue also stems from an inherent algorithmic length reliance in DPO. Specifically, we suggest that the discrepancy between sequence-level Kullback-Leibler (KL) divergences between chosen and rejected sequences, used in DPO, results in overestimated or underestimated rewards due to varying token lengths. Empirically, we utilize datasets with different label lengths to demonstrate the presence of biased rewards. We then introduce an effective downsampling approach, named SamPO, to eliminate potential length reliance. Our experimental evaluations, conducted across three LLMs of varying scales and a diverse array of conditional and open-ended benchmarks, highlight the efficacy of SamPO in mitigating verbosity, achieving improvements of 5% to 12% over DPO through debaised rewards. Our codes can be accessed at: this https URL.
摘要：直接偏好优化 (DPO) 已成为一种突出的算法，用于将大型语言模型 (LLM) 与人类偏好直接且稳健地对齐，为复杂的人类反馈强化学习 (RLHF) 提供了更直接的替代方案。尽管 DPO 具有良好的效果，但它也面临一个明显的缺点：“冗长”，这是一种常见的过度优化现象，在 RLHF 中也观察到了这种现象。虽然以前的研究主要将冗长归因于数据中的有偏见的标签，但我们认为这个问题也源于 DPO 中固有的算法长度依赖。具体而言，我们认为 DPO 中使用的序列级 Kullback-Leibler (KL) 散度在所选序列和拒绝序列之间的差异导致由于不同的标记长度而导致奖励被高估或低估。从经验上讲，我们利用具有不同标签长度的数据集来证明有偏见的奖励的存在。然后，我们引入了一种有效的下采样方法，称为 SamPO，以消除潜在的长度依赖。我们对三个不同规模的 LLM 以及一系列条件和开放式基准进行了实验评估，结果突出了 SamPO 在减少冗长方面的能力，通过去中心化奖励，其性能比 DPO 提高了 5% 到 12%。我们的代码可通过以下网址访问：此 https URL。

Title: DocNet: Semantic Structure in Inductive Bias Detection Models

Authors: Jessica Zhu, Iain Cruickshank, Michel Cukier
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2406.10965
Pdf URL: https://arxiv.org/pdf/2406.10965
Copy Paste: [[2406.10965]] DocNet: Semantic Structure in Inductive Bias Detection Models(https://arxiv.org/abs/2406.10965)
Keywords: language model
Abstract: News will have biases so long as people have opinions. However, as social media becomes the primary entry point for news and partisan gaps increase, it is increasingly important for informed citizens to be able to identify bias. People will be able to take action to avoid polarizing echo chambers if they know how the news they are consuming is biased. In this paper, we explore an often overlooked aspect of bias detection in documents: the semantic structure of news articles. We present DocNet, a novel, inductive, and low-resource document embedding and bias detection model that outperforms large language models. We also demonstrate that the semantic structure of news articles from opposing partisan sides, as represented in document-level graph embeddings, have significant similarities. These results can be used to advance bias detection in low-resource environments. Our code and data are made available at this https URL.
摘要：只要人们有观点，新闻就会有偏见。然而，随着社交媒体成为新闻的主要切入点，党派分歧不断扩大，知情公民能够识别偏见变得越来越重要。如果人们知道他们所消费的新闻是如何有偏见的，他们就能采取行动避免两极分化的回音室。在本文中，我们探讨了文档偏见检测中经常被忽视的一个方面：新闻文章的语义结构。我们提出了 DocNet，这是一种新颖的、归纳的、低资源的文档嵌入和偏见检测模型，其性能优于大型语言模型。我们还证明了来自对立党派的新闻文章的语义结构（以文档级图嵌入表示）具有显着的相似性。这些结果可用于推进低资源环境中的偏见检测。我们的代码和数据在此 https URL 上提供。

Title: Toward Optimal LLM Alignments Using Two-Player Games

Authors: Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Hang Li, Yang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10977
Pdf URL: https://arxiv.org/pdf/2406.10977
Copy Paste: [[2406.10977]] Toward Optimal LLM Alignments Using Two-Player Games(https://arxiv.org/abs/2406.10977)
Keywords: language model, llm, prompt, agent
Abstract: The standard Reinforcement Learning from Human Feedback (RLHF) framework primarily focuses on optimizing the performance of large language models using pre-collected prompts. However, collecting prompts that provide comprehensive coverage is both tedious and challenging, and often fails to include scenarios that LLMs need to improve on the most. In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent. The adversarial agent's task at each step is to generate prompts that expose the weakness of the defensive agent. In return, the defensive agent seeks to improve its responses to these newly identified prompts it struggled with, based on feedback from the reward model. We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents. Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.
摘要：标准的强化学习人类反馈 (RLHF) 框架主要侧重于使用预先收集的提示来优化大型语言模型的性能。然而，收集提供全面覆盖的提示既繁琐又具有挑战性，而且往往无法涵盖 LLM 最需要改进的场景。在本文中，我们通过双智能体游戏的视角研究对齐，涉及对抗智能体和防御智能体之间的迭代交互。对抗智能体在每个步骤中的任务是生成暴露防御智能体的弱点的提示。作为回报，防御智能体会根据奖励模型的反馈，寻求改进其对这些新发现的提示的响应。我们从理论上证明，这种迭代强化学习优化收敛到由智能体引发的游戏的纳什均衡。安全场景中的实验结果表明，在这种竞争环境中学习不仅可以充分训练智能体，还可以为对抗智能体和防御智能体带来具有增强泛化能力的策略。

Title: Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens

Authors: Weiyao Luo, Suncong Zheng, Heming Xia, Weikang Wang, Yan Lei, Tianyu Liu, Shuang Chen, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10985
Pdf URL: https://arxiv.org/pdf/2406.10985
Copy Paste: [[2406.10985]] Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens(https://arxiv.org/abs/2406.10985)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown promising efficacy across various tasks, becoming powerful tools in numerous aspects of human life. However, Transformer-based LLMs suffer a performance degradation when modeling long-term contexts due to they discard some information to reduce computational overhead. In this work, we propose a simple yet effective method to enable LLMs to take a deep breath, encouraging them to summarize information contained within discrete text chunks. Specifically, we segment the text into multiple chunks and insert special token at the end of each chunk. We then modify the attention mask to integrate the chunk's information into the corresponding token. This facilitates LLMs to interpret information not only from historical individual tokens but also from the token, aggregating the chunk's semantic information. Experiments on language modeling and out-of-domain downstream tasks validate the superiority of our approach.
摘要：大型语言模型 (LLM) 已在各种任务中展现出良好的效果，成为人类生活诸多方面的有力工具。然而，基于 Transformer 的 LLM 在对长期上下文进行建模时会遭遇性能下降，因为它们会丢弃一些信息以减少计算开销。在这项工作中，我们提出了一种简单而有效的方法，使 LLM 能够“深呼吸”，鼓励它们总结离散文本块中包含的信息。具体来说，我们将文本分成多个块，并在每个块的末尾插入特殊标记。然后，我们修改注意力掩码，将块的信息集成到相应的标记中。这有助于 LLM 不仅从历史单个标记中解释信息，还可以从标记中解释信息，从而聚合块的语义信息。语言建模和域外下游任务上的实验验证了我们方法的优越性。

Title: Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers

Authors: Tianhua Zhang, Kun Li, Hongyin Luo, Xixin Wu, James Glass, Helen Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10991
Pdf URL: https://arxiv.org/pdf/2406.10991
Copy Paste: [[2406.10991]] Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers(https://arxiv.org/abs/2406.10991)
Keywords: language model
Abstract: Query rewriting is a crucial technique for passage retrieval in open-domain conversational question answering (CQA). It decontexualizes conversational queries into self-contained questions suitable for off-the-shelf retrievers. Existing methods attempt to incorporate retriever's preference during the training of rewriting models. However, these approaches typically rely on extensive annotations such as in-domain rewrites and/or relevant passage labels, limiting the models' generalization and adaptation capabilities. In this paper, we introduce AdaQR ($\textbf{Ada}$ptive $\textbf{Q}$uery $\textbf{R}$ewriting), a framework for training query rewriting models with limited rewrite annotations from seed datasets and completely no passage label. Our approach begins by fine-tuning compact large language models using only ~$10\%$ of rewrite annotations from the seed dataset training split. The models are then utilized to generate rewrite candidates for each query instance. A novel approach is then proposed to assess retriever's preference for these candidates by the probability of answers conditioned on the conversational query by marginalizing the Top-$K$ passages. This serves as the reward for optimizing the rewriter further using Direct Preference Optimization (DPO), a process free of rewrite and retrieval annotations. Experimental results on four open-domain CQA datasets demonstrate that AdaQR not only enhances the in-domain capabilities of the rewriter with limited annotation requirement, but also adapts effectively to out-of-domain datasets.
摘要：查询重写是开放域对话式问答 (CQA) 中段落检索的关键技术。它将对话式查询去语境化为适合现成检索器的独立问题。现有方法试图在重写模型的训练过程中融入检索者的偏好。然而，这些方法通常依赖于大量注释，例如域内重写和/或相关段落标签，从而限制了模型的泛化和适应能力。在本文中，我们介绍了 AdaQR ($\textbf{Ada}$ptive $\textbf{Q}$uery $\textbf{R}$ewriting)，这是一个用于训练查询重写模型的框架，使用来自种子数据集的有限重写注释，并且完全没有段落标签。我们的方法首先使用来自种子数据集训练分割的仅约 $10\%$ 的重写注释对紧凑型大型语言模型进行微调。然后利用这些模型为每个查询实例生成重写候选。然后提出了一种新方法，通过边缘化 Top-$K$ 段落，根据对话查询条件的答案概率来评估检索者对这些候选词的偏好。这可以作为使用直接偏好优化 (DPO) 进一步优化重写器的奖励，该过程无需重写和检索注释。在四个开放域 CQA 数据集上的实验结果表明，AdaQR 不仅在有限的注释要求下增强了重写器的域内功能，而且还能有效适应域外数据集。

Title: THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation

Authors: Seo Hyun Kim, Kai Tzu-iunn Ong, Taeyoon Kwon, Namyoung Kim, Keummin Ka, SeongHyeon Bae, Yohan Jo, Seung-won Hwang, Dongha Lee, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.10996
Pdf URL: https://arxiv.org/pdf/2406.10996
Copy Paste: [[2406.10996]] THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation(https://arxiv.org/abs/2406.10996)
Keywords: language model, llm
Abstract: Large language models (LLMs) are capable of processing lengthy dialogue histories during prolonged interaction with users without additional memory modules; however, their responses tend to overlook or incorrectly recall information from the past. In this paper, we revisit memory-augmented response generation in the era of LLMs. While prior work focuses on getting rid of outdated memories, we argue that such memories can provide contextual cues that help dialogue systems understand the development of past events and, therefore, benefit response generation. We present Theanine, a framework that augments LLMs' response generation with memory timelines -- series of memories that demonstrate the development and causality of relevant past events. Along with Theanine, we introduce TeaFarm, a counterfactual-driven question-answering pipeline addressing the limitation of G-Eval in long-term conversations. Supplementary videos of our methods and the TeaBag dataset for TeaFarm evaluation are in https://theanine-693b0.web.app/.
摘要：大型语言模型 (LLM) 能够在与用户长时间交互期间处理冗长的对话历史，而无需额外的记忆模块；然而，它们的响应往往会忽略或错误地回忆过去的信息。在本文中，我们重新审视了 LLM 时代的记忆增强响应生成。虽然之前的工作重点是摆脱过时的记忆，但我们认为，这样的记忆可以提供上下文线索，帮助对话系统理解过去事件的发展，从而有利于响应生成。我们提出了 Theanine，这是一个通过记忆时间线增强 LLM 响应生成的框架——一系列记忆，展示相关过去事件的发展和因果关系。除了 Theanine，我们还介绍了 TeaFarm，这是一个反事实驱动的问答管道，解决了 G-Eval 在长期对话中的局限性。我们的方法的补充视频和用于 TeaFarm 评估的 TeaBag 数据集位于 https://theanine-693b0.web.app/。

Title: Not All Bias is Bad: Balancing Rational Deviations and Cognitive Biases in Large Language Model Reasoning

Authors: Liman Wang, Hanyang Zhong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.10999
Pdf URL: https://arxiv.org/pdf/2406.10999
Copy Paste: [[2406.10999]] Not All Bias is Bad: Balancing Rational Deviations and Cognitive Biases in Large Language Model Reasoning(https://arxiv.org/abs/2406.10999)
Keywords: language model, llm, agent
Abstract: This paper investigates the nuanced role of biases in the decision-making processes of large language models (LLMs). While conventional research typically aims to eliminate all biases, our study reveals that not all biases are detrimental. By examining rational deviations, involving heuristic shortcuts that enhance decision-making efficiency, we highlight their potential benefits when properly balanced. We introduce the concepts of heuristic moderation and an abstention option, allowing LLMs to abstain from answering when uncertain, thereby reducing error rates and improving decision accuracy. Using our newly developed BRD (Balance Rational Deviations) dataset, our findings demonstrate that appropriately scaled bias inspection enhances model performance and aligns LLM decision-making more closely with human reasoning. This balance improves the reliability and trustworthiness of LLMs and suggests new strategies for future enhancements. Our work offers a fresh perspective on leveraging biases constructively to enhance the practical applications of LLMs, from conversational agents to decision support systems and beyond.
摘要：本文探讨了偏见在大型语言模型 (LLM) 决策过程中的微妙作用。虽然传统研究通常旨在消除所有偏见，但我们的研究表明，并非所有偏见都是有害的。通过研究理性偏差（包括可提高决策效率的启发式捷径），我们强调了适当平衡它们的潜在好处。我们引入了启发式调节和弃权选项的概念，允许 LLM 在不确定时弃权回答，从而降低错误率并提高决策准确性。使用我们新开发的 BRD（平衡理性偏差）数据集，我们的研究结果表明，适当缩放的偏见检查可提高模型性能，并使 LLM 决策更接近人类推理。这种平衡提高了 LLM 的可靠性和可信度，并为未来的改进提出了新的策略。我们的工作为建设性地利用偏见来增强 LLM 的实际应用（从对话代理到决策支持系统等）提供了新的视角。

Title: Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game

Authors: Prisha Samadarshi, Mariam Mustafa, Anushka Kulkarni, Raven Rothkopf, Tuhin Chakrabarty, Smaranda Muresan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11012
Pdf URL: https://arxiv.org/pdf/2406.11012
Copy Paste: [[2406.11012]] Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game(https://arxiv.org/abs/2406.11012)
Keywords: language model, gpt, llm
Abstract: The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect 200 Connections games to evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players. Our results show that even the best-performing LLM, GPT-4o, which has otherwise shown impressive reasoning abilities on a wide variety of benchmarks, can only fully solve 8% of the games. Compared to GPT-4o, novice and expert players perform better, with expert human players significantly outperforming GPT-4o. To deepen our understanding we create a taxonomy of the knowledge types required to successfully categorize words in the Connections game, revealing that LLMs struggle with associative, encyclopedic, and linguistic knowledge. Our findings establish the New York Times Connections game as a challenging benchmark for evaluating abstract reasoning capabilities in humans and AI systems.
摘要：《纽约时报》的 Connections 游戏已成为字谜游戏爱好者的热门游戏，且极具挑战性。我们收集了 200 场 Connections 游戏，以评估最先进的大型语言模型 (LLM) 与专家和新手人类玩家的表现。我们的结果表明，即使是表现最好的 LLM GPT-4o（它在各种基准测试中都表现出了令人印象深刻的推理能力），也只能完全解决 8% 的游戏。与 GPT-4o 相比，新手和专家玩家表现更好，而专家人类玩家的表现明显优于 GPT-4o。为了加深我们的理解，我们创建了一个分类法，对成功对 Connections 游戏中的单词进行分类所需的知识类型进行分类，揭示了 LLM 在联想、百科全书和语言知识方面存在困难。我们的研究结果将《纽约时报》的 Connections 游戏确立为评估人类和人工智能系统抽象推理能力的具有挑战性的基准。

Title: RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

Authors: Yuqing Wang, Yun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11020
Pdf URL: https://arxiv.org/pdf/2406.11020
Copy Paste: [[2406.11020]] RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models(https://arxiv.org/abs/2406.11020)
Keywords: language model, gpt, llm
Abstract: With the increasing use of large language models (LLMs), ensuring reliable performance in diverse, real-world environments is essential. Despite their remarkable achievements, LLMs often struggle with adversarial inputs, significantly impacting their effectiveness in practical applications. To systematically understand the robustness of LLMs, we present RUPBench, a comprehensive benchmark designed to evaluate LLM robustness across diverse reasoning tasks. Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning, and introduces nine types of textual perturbations at lexical, syntactic, and semantic levels. By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns. Our findings highlight that larger models tend to exhibit greater robustness to perturbations. Additionally, common error types are identified through manual inspection, revealing specific challenges faced by LLMs in different reasoning contexts. This work provides insights into areas where LLMs need further improvement to handle diverse and noisy inputs effectively.
摘要：随着大型语言模型 (LLM) 的使用越来越多，确保在多样化的现实环境中的可靠性能至关重要。尽管取得了显著的成就，但 LLM 往往难以应对对抗性输入，这严重影响了它们在实际应用中的有效性。为了系统地了解 LLM 的稳健性，我们提出了 RUPBench，这是一个全面的基准，旨在评估 LLM 在各种推理任务中的稳健性。我们的基准包含 15 个推理数据集，分为常识、算术、逻辑和知识密集型推理，并在词汇、句法和语义层面引入了九种类型的文本扰动。通过检查 GPT-4o、Llama3、Phi-3 和 Gemma 等最先进的 LLM 在原始数据集和扰动数据集上的性能，我们对它们的稳健性和错误模式进行了详细分析。我们的研究结果表明，较大的模型往往对扰动表现出更大的稳健性。此外，通过人工检查识别出常见错误类型，揭示了 LLM 在不同推理环境中面临的具体挑战。这项工作深入了解了 LLM 需要进一步改进的领域，以便有效地处理多样化和嘈杂的输入。

Title: FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

Authors: Wenyan Li, Xinyu Zhang, Jiaang Li, Qiwei Peng, Raphael Tang, Li Zhou, Weijia Zhang, Guimin Hu, Yifei Yuan, Anders Søgaard, Daniel Hershcovich, Desmond Elliott
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11030
Pdf URL: https://arxiv.org/pdf/2406.11030
Copy Paste: [[2406.11030]] FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture(https://arxiv.org/abs/2406.11030)
Keywords: language model, llm
Abstract: Food is a rich and varied dimension of cultural heritage, crucial to both individuals and social groups. To bridge the gap in the literature on the often-overlooked regional diversity in this domain, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China. We evaluate vision-language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions. FoodieQA comprises three multiple-choice question-answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions, respectively. While LLMs excel at text-based question answering, surpassing human accuracy, the open-sourced VLMs still fall short by 41\% on multi-image and 21\% on single-image VQA tasks, although closed-weights models perform closer to human levels (within 10\%). Our findings highlight that understanding food and its cultural implications remains a challenging and under-explored direction.
摘要：食物是丰富多样的文化遗产，对个人和社会群体都至关重要。为了弥补文献中关于这一领域经常被忽视的区域多样性的空白，我们引入了 FoodieQA，这是一个手动策划的细粒度图像文本数据集，捕捉了中国各个地区饮食文化的复杂特征。我们在新收集的、未见过的食物图像和相应的问题上评估了视觉语言模型 (VLM) 和大型语言模型 (LLM)。FoodieQA 包含三个多项选择问答任务，其中模型需要分别根据多幅图像、一幅图像和纯文本描述来回答问题。虽然 LLM 在基于文本的问答方面表现出色，超越了人类的准确度，但开源 VLM 在多图像 VQA 任务上仍然落后 41% 和单图像 VQA 任务上落后 21%，尽管闭式权重模型的表现更接近人类水平（在 10% 以内）。我们的研究结果强调，了解食物及其文化含义仍然是一个具有挑战性且尚未得到充分探索的方向。

Title: Scaling Synthetic Logical Reasoning Datasets with Context-Sensitive Declarative Grammars

Authors: Damien Sileo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11035
Pdf URL: https://arxiv.org/pdf/2406.11035
Copy Paste: [[2406.11035]] Scaling Synthetic Logical Reasoning Datasets with Context-Sensitive Declarative Grammars(https://arxiv.org/abs/2406.11035)
Keywords: language model, gpt
Abstract: Logical reasoning remains a challenge for natural language processing, but it can be improved by training language models to mimic theorem provers on procedurally generated problems. Previous work used domain-specific proof generation algorithms, which biases reasoning toward specific proof traces and limits auditability and extensibility. We present a simpler and more general declarative framework with flexible context-sensitive rules binding multiple languages (specifically, simplified English and the TPTP theorem-proving language). We construct first-order logic problems by selecting up to 32 premises and one hypothesis. We demonstrate that using semantic constraints during generation and careful English verbalization of predicates enhances logical reasoning without hurting natural English tasks. We use relatively small DeBERTa-v3 models to achieve state-of-the-art accuracy on the FOLIO human-authored logic dataset, surpassing GPT-4 in accuracy with or without an external solver by 12%.
摘要：逻辑推理仍然是自然语言处理面临的挑战，但可以通过训练语言模型来模仿程序生成问题的定理证明器来改进它。以前的工作使用了特定领域的证明生成算法，这使推理偏向于特定的证明痕迹，并限制了可审计性和可扩展性。我们提出了一个更简单、更通用的声明性框架，该框架具有灵活的上下文相关规则，可绑定多种语言（具体来说，简化英语和 TPTP 定理证明语言）。我们通过选择最多 32 个前提和一个假设来构建一阶逻辑问题。我们证明，在生成过程中使用语义约束和谨慎的英语谓词表达可以增强逻辑推理，而不会损害自然英语任务。我们使用相对较小的 DeBERTa-v3 模型在 FOLIO 人工编写的逻辑数据集上实现了最先进的准确度，无论有没有外部求解器，准确度都比 GPT-4 高出 12%。

Title: garak: A Framework for Security Probing Large Language Models

Authors: Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, Nanna Inie
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2406.11036
Pdf URL: https://arxiv.org/pdf/2406.11036
Copy Paste: [[2406.11036]] garak: A Framework for Security Probing Large Language Models(https://arxiv.org/abs/2406.11036)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) are deployed and integrated into thousands of applications, the need for scalable evaluation of how models respond to adversarial attacks grows rapidly. However, LLM security is a moving target: models produce unpredictable output, are constantly updated, and the potential adversary is highly diverse: anyone with access to the internet and a decent command of natural language. Further, what constitutes a security weak in one context may not be an issue in a different context; one-fits-all guardrails remain theoretical. In this paper, we argue that it is time to rethink what constitutes ``LLM security'', and pursue a holistic approach to LLM security evaluation, where exploration and discovery of issues are central. To this end, this paper introduces garak (Generative AI Red-teaming and Assessment Kit), a framework which can be used to discover and identify vulnerabilities in a target LLM or dialog system. garak probes an LLM in a structured fashion to discover potential vulnerabilities. The outputs of the framework describe a target model's weaknesses, contribute to an informed discussion of what composes vulnerabilities in unique contexts, and can inform alignment and policy discussions for LLM deployment.
摘要：随着大型语言模型 (LLM) 被部署并集成到数千个应用程序中，对模型如何应对对抗性攻击的可扩展评估的需求迅速增长。然而，LLM 安全性是一个不断变化的目标：模型产生不可预测的输出，不断更新，潜在的对手高度多样化：任何可以访问互联网并掌握自然语言的人。此外，在一种情况下构成安全弱点的东西在另一种情况下可能不是问题；一刀切的护栏仍然是理论上的。在本文中，我们认为是时候重新思考什么是“LLM 安全性”，并采取一种整体的 LLM 安全性评估方法，其中探索和发现问题是核心。为此，本文介绍了 garak（生成式 AI 红队和评估工具包），这是一个可用于发现和识别目标 LLM 或对话系统中的漏洞的框架。garak 以结构化的方式探测 LLM 以发现潜在的漏洞。该框架的输出描述了目标模型的弱点，有助于在独特环境下对构成弱点的因素进行明智的讨论，并为 LLM 部署的协调和政策讨论提供信息。

Title: Evaluating the Performance of Large Language Models via Debates

Authors: Behrad Moniri, Hamed Hassani, Edgar Dobriban
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11044
Pdf URL: https://arxiv.org/pdf/2406.11044
Copy Paste: [[2406.11044]] Evaluating the Performance of Large Language Models via Debates(https://arxiv.org/abs/2406.11044)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are rapidly evolving and impacting various fields, necessitating the development of effective methods to evaluate and compare their performance. Most current approaches for performance evaluation are either based on fixed, domain-specific questions that lack the flexibility required in many real-world applications where tasks are not always from a single domain, or rely on human input, making them unscalable. We propose an automated benchmarking framework based on debates between LLMs, judged by another LLM. This method assesses not only domain knowledge, but also skills such as problem definition and inconsistency recognition. We evaluate the performance of various state-of-the-art LLMs using the debate framework and achieve rankings that align closely with popular rankings based on human input, eliminating the need for costly human crowdsourcing.
摘要：大型语言模型 (LLM) 正在迅速发展并影响各个领域，因此需要开发有效的方法来评估和比较它们的性能。目前大多数性能评估方法要么基于固定的、特定领域的问题，这些问题缺乏许多实际应用中所需的灵活性，因为这些应用中的任务并不总是来自单个领域，要么依赖于人工输入，因此不可扩展。我们提出了一个基于 LLM 之间辩论的自动基准测试框架，由另一个 LLM 进行评判。这种方法不仅评估领域知识，还评估问题定义和不一致性识别等技能。我们使用辩论框架评估各种最先进的 LLM 的性能，并实现与基于人工输入的流行排名非常接近的排名，从而无需昂贵的人力众包。

Title: A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

Authors: Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor, Dan Roth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11050
Pdf URL: https://arxiv.org/pdf/2406.11050
Copy Paste: [[2406.11050]] A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners(https://arxiv.org/abs/2406.11050)
Keywords: language model, llm
Abstract: This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. We go beyond evaluating LLMs on accuracy; rather, we aim to investigate their token bias in solving logical reasoning tasks. Specifically, we develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems. Our framework outlines a list of hypotheses where token biases are readily identifiable, with all null hypotheses assuming genuine reasoning capabilities of LLMs. The findings in this study suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning. While they may perform well on classic problems, their success largely depends on recognizing superficial patterns with strong token bias, thereby raising concerns about their actual reasoning and generalization abilities.
摘要：本研究引入了一个假设检验框架，以评估大型语言模型 (LLM) 是否具有真正的推理能力或主要依赖于标记偏差。我们不仅评估 LLM 的准确性，还旨在研究它们在解决逻辑推理任务中的标记偏差。具体来说，我们开发了精心控制的合成数据集，其中包含合取谬误和三段论问题。我们的框架列出了一系列很容易识别标记偏差的假设，所有零假设都假设 LLM 具有真正的推理能力。本研究的结果以统计保证表明，大多数 LLM 仍然难以进行逻辑推理。虽然它们可能在经典问题上表现良好，但它们的成功很大程度上取决于识别具有强烈标记偏差的表面模式，从而引发了人们对它们实际推理和泛化能力的担忧。

Title: Can LLMs Understand the Implication of Emphasized Sentences in Dialogue?

Authors: Guan-Ting Lin, Hung-yi Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11065
Pdf URL: https://arxiv.org/pdf/2406.11065
Copy Paste: [[2406.11065]] Can LLMs Understand the Implication of Emphasized Sentences in Dialogue?(https://arxiv.org/abs/2406.11065)
Keywords: language model, gpt, llm
Abstract: Emphasis is a crucial component in human communication, which indicates the speaker's intention and implication beyond pure text in dialogue. While Large Language Models (LLMs) have revolutionized natural language processing, their ability to understand emphasis in dialogue remains unclear. This paper introduces Emphasized-Talk, a benchmark with emphasis-annotated dialogue samples capturing the implications of emphasis. We evaluate various LLMs, both open-source and commercial, to measure their performance in understanding emphasis. Additionally, we propose an automatic evaluation pipeline using GPT-4, which achieves a high correlation with human rating. Our findings reveal that although commercial LLMs generally perform better, there is still significant room for improvement in comprehending emphasized sentences.
摘要：强调是人类交流中的一个重要组成部分，它表明了对话中说话者的意图和含义，而不仅仅是纯文本。虽然大型语言模型 (LLM) 彻底改变了自然语言处理，但它们理解对话中强调的能力仍不清楚。本文介绍了 Emphasized-Talk，这是一个基准，其中包含强调注释的对话样本，可以捕捉强调的含义。我们评估了各种 LLM（包括开源和商业 LLM），以衡量它们在理解强调方面的表现。此外，我们提出了一种使用 GPT-4 的自动评估流程，该流程与人工评分具有高度相关性。我们的研究结果表明，尽管商业 LLM 通常表现更好，但在理解强调句子方面仍有很大改进空间。

Title: Exploring the Limitations of Detecting Machine-Generated Text

Authors: Jad Doughman, Osama Mohammed Afzal, Hawau Olamide Toyin, Shady Shehata, Preslav Nakov, Zeerak Talat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11073
Pdf URL: https://arxiv.org/pdf/2406.11073
Copy Paste: [[2406.11073]] Exploring the Limitations of Detecting Machine-Generated Text(https://arxiv.org/abs/2406.11073)
Keywords: language model
Abstract: Recent improvements in the quality of the generations by large language models have spurred research into identifying machine-generated text. Systems proposed for the task often achieve high performance. However, humans and machines can produce text in different styles and in different domains, and it remains unclear whether machine generated-text detection models favour particular styles or domains. In this paper, we critically examine the classification performance for detecting machine-generated text by evaluating on texts with varying writing styles. We find that classifiers are highly sensitive to stylistic changes and differences in text complexity, and in some cases degrade entirely to random classifiers. We further find that detection systems are particularly susceptible to misclassify easy-to-read texts while they have high performance for complex texts.
摘要：大型语言模型生成文本的质量最近有所提高，这刺激了对识别机器生成文本的研究。为这项任务提出的系统通常能实现高性能。然而，人类和机器可以生成不同风格和不同领域的文本，而机器生成文本检测模型是否偏向特定风格或领域仍不清楚。在本文中，我们通过评估具有不同写作风格的文本，严格检查了检测机器生成文本的分类性能。我们发现分类器对风格变化和文本复杂性差异高度敏感，在某些情况下会完全降级为随机分类器。我们进一步发现，检测系统特别容易对易于阅读的文本进行错误分类，而对复杂文本则具有高性能。

Title: Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing

Authors: Changbing Yang, Garrett Nicolai, Miikka Silfverberg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11085
Pdf URL: https://arxiv.org/pdf/2406.11085
Copy Paste: [[2406.11085]] Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing(https://arxiv.org/abs/2406.11085)
Keywords: llm
Abstract: In this paper, we address the data scarcity problem in automatic data-driven glossing for low-resource languages by coordinating multiple sources of linguistic expertise. We supplement models with translations at both the token and sentence level as well as leverage the extensive linguistic capability of modern LLMs. Our enhancements lead to an average absolute improvement of 5%-points in word-level accuracy over the previous state of the art on a typologically diverse dataset spanning six low-resource languages. The improvements are particularly noticeable for the lowest-resourced language Gitksan, where we achieve a 10%-point improvement. Furthermore, in a simulated ultra-low resource setting for the same six languages, training on fewer than 100 glossed sentences, we establish an average 10%-point improvement in word-level accuracy over the previous state-of-the-art system.
摘要：在本文中，我们通过协调多种语言专业知识来源来解决低资源语言自动数据驱动注释中的数据稀缺问题。我们在标记和句子级别补充模型，并利用现代 LLM 的广泛语言能力。我们的增强功能使单词级准确度比之前最先进的系统平均绝对提高了 5%，该准确度在涵盖六种低资源语言的类型多样的数据集上得到了提升。对于资源最少的语言 Gitksan，改进尤为明显，我们实现了 10% 的提升。此外，在针对相同六种语言的模拟超低资源设置中，使用不到 100 个注释句子进行训练，我们在单词级准确度上比之前最先进的系统平均提高了 10%。

Title: RAEmoLLM: Retrieval Augmented LLMs for Cross-Domain Misinformation Detection Using In-Context Learning based on Emotional Information

Authors: Zhiwei Liu, Kailai Yang, Qianqian Xie, Christine de Kock, Sophia Ananiadou, Eduard Hovy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11093
Pdf URL: https://arxiv.org/pdf/2406.11093
Copy Paste: [[2406.11093]] RAEmoLLM: Retrieval Augmented LLMs for Cross-Domain Misinformation Detection Using In-Context Learning based on Emotional Information(https://arxiv.org/abs/2406.11093)
Keywords: llm
Abstract: Misinformation is prevalent in various fields such as education, politics, health, etc., causing significant harm to society. However, current methods for cross-domain misinformation detection rely on time and resources consuming fine-tuning and complex model structures. With the outstanding performance of LLMs, many studies have employed them for misinformation detection. Unfortunately, they focus on in-domain tasks and do not incorporate significant sentiment and emotion features (which we jointly call affect). In this paper, we propose RAEmoLLM, the first retrieval augmented (RAG) LLMs framework to address cross-domain misinformation detection using in-context learning based on affective information. It accomplishes this by applying an emotion-aware LLM to construct a retrieval database of affective embeddings. This database is used by our retrieval module to obtain source-domain samples, which are subsequently used for the inference module's in-context few-shot learning to detect target domain misinformation. We evaluate our framework on three misinformation benchmarks. Results show that RAEmoLLM achieves significant improvements compared to the zero-shot method on three datasets, with the highest increases of 20.69%, 23.94%, and 39.11% respectively. This work will be released on this https URL.
摘要：虚假信息普遍存在于教育、政治、健康等各个领域，对社会造成了重大危害。然而，目前的跨域虚假信息检测方法依赖于耗费时间和资源的微调和复杂的模型结构。由于 LLM 的出色表现，许多研究已将其用于虚假信息检测。不幸的是，它们专注于领域内任务，并没有结合重要的情绪和情感特征（我们统称为情感）。在本文中，我们提出了 RAEmoLLM，这是第一个使用基于情感信息的上下文学习来解决跨域虚假信息检测的检索增强 (RAG) LLM 框架。它通过应用情感感知 LLM 构建情感嵌入的检索数据库来实现这一点。我们的检索模块使用该数据库来获取源域样本，随后将这些样本用于推理模块的上下文小样本学习，以检测目标域虚假信息。我们在三个虚假信息基准上评估了我们的框架。结果表明，RAEmoLLM 在三个数据集上与零样本方法相比取得了显著的改进，最高增幅分别达到 20.69%、23.94% 和 39.11%。这项工作将在此 https URL 上发布。

Title: The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models

Authors: Bolei Ma, Xinpeng Wang, Tiancheng Hu, Anna-Carolina Haensch, Michael A. Hedderich, Barbara Plank, Frauke Kreuter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11096
Pdf URL: https://arxiv.org/pdf/2406.11096
Copy Paste: [[2406.11096]] The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models(https://arxiv.org/abs/2406.11096)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) have sparked wide interest in validating and comprehending the human-like cognitive-behavioral traits LLMs may have. These cognitive-behavioral traits include typically Attitudes, Opinions, Values (AOV). However, measuring AOV embedded within LLMs remains opaque, and different evaluation methods may yield different results. This has led to a lack of clarity on how different studies are related to each other and how they can be interpreted. This paper aims to bridge this gap by providing an overview of recent works on the evaluation of AOV in LLMs. Moreover, we survey related approaches in different stages of the evaluation pipeline in these works. By doing so, we address the potential and challenges with respect to understanding the model, human-AI alignment, and downstream application in social sciences. Finally, we provide practical insights into evaluation methods, model enhancement, and interdisciplinary collaboration, thereby contributing to the evolving landscape of evaluating AOV in LLMs.
摘要：大型语言模型 (LLM) 的最新进展引发了人们的广泛兴趣，以验证和理解 LLM 可能具有的类似人类的认知行为特征。这些认知行为特征通常包括态度、观点、价值观 (AOV)。然而，衡量 LLM 中嵌入的 AOV 仍然不透明，不同的评估方法可能会产生不同的结果。这导致人们不清楚不同的研究如何相互关联以及如何解释它们。本文旨在通过概述最近关于 LLM 中 AOV 评估的研究来弥合这一差距。此外，我们调查了这些研究中评估流程不同阶段的相关方法。通过这样做，我们解决了理解模型、人机协调以及社会科学下游应用方面的潜力和挑战。最后，我们提供了关于评估方法、模型增强和跨学科协作的实用见解，从而为 LLM 中 AOV 评估的不断发展做出了贡献。

Title: InstructCMP: Length Control in Sentence Compression through Instruction-based Large Language Models

Authors: Juseon-Do, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11097
Pdf URL: https://arxiv.org/pdf/2406.11097
Copy Paste: [[2406.11097]] InstructCMP: Length Control in Sentence Compression through Instruction-based Large Language Models(https://arxiv.org/abs/2406.11097)
Keywords: language model, llm
Abstract: Extractive summarization can produce faithful summaries but often requires additional constraints such as a desired summary length. Traditional sentence compression models do not typically consider the constraints because of their restricted model abilities, which require model modifications for coping with them. To bridge this gap, we propose Instruction-based Compression (InstructCMP), an approach to the sentence compression task that can consider the length constraint through instructions by leveraging the zero-shot task-solving abilities of Large Language Models (LLMs). For this purpose, we created new evaluation datasets by transforming traditional sentence compression datasets into an instruction format. By using the datasets, we first reveal that the current LLMs still face challenges in accurately controlling the length for a compressed text. To address this issue, we propose an approach named "length priming," that incorporates additional length information into the instructions without external resources. While the length priming effectively works in a zero-shot setting, a training dataset with the instructions would further improve the ability of length control. Thus, we additionally created a training dataset in an instruction format to fine-tune the model on it. Experimental results and analysis show that applying the length priming significantly improves performances of InstructCMP in both zero-shot and fine-tuning settings without the need of any model modifications.
摘要：抽取式摘要可以生成忠实的摘要，但通常需要额外的约束，例如所需的摘要长度。传统的句子压缩模型通常不考虑这些约束，因为它们的模型能力有限，需要修改模型才能应对这些约束。为了弥补这一差距，我们提出了基于指令的压缩 (InstructCMP)，这是一种句子压缩任务的方法，它可以利用大型语言模型 (LLM) 的零样本任务解决能力，通过指令来考虑长度约束。为此，我们通过将传统的句子压缩数据集转换为指令格式来创建了新的评估数据集。通过使用这些数据集，我们首先发现当前的 LLM 在准确控制压缩文本的长度方面仍然面临挑战。为了解决这个问题，我们提出了一种名为“长度启动”的方法，该方法无需外部资源即可将额外的长度信息合并到指令中。虽然长度启动在零样本设置中有效工作，但带有指令的训练数据集将进一步提高长度控制的能力。因此，我们另外创建了一个指令格式的训练数据集，以在其上微调模型。实验结果和分析表明，应用长度启动可显著提高 InstructCMP 在零样本和微调设置中的性能，而无需进行任何模型修改。

Title: Grading Massive Open Online Courses Using Large Language Models

Authors: Shahriar Golchin, Nikhil Garuda, Christopher Impey, Matthew Wenger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11102
Pdf URL: https://arxiv.org/pdf/2406.11102
Copy Paste: [[2406.11102]] Grading Massive Open Online Courses Using Large Language Models(https://arxiv.org/abs/2406.11102)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Massive open online courses (MOOCs) offer free education globally to anyone with a computer and internet access. Despite this democratization of learning, the massive enrollment in these courses makes it impractical for one instructor to assess every student's writing assignment. As a result, peer grading, often guided by a straightforward rubric, is the method of choice. While convenient, peer grading often falls short in terms of reliability and validity. In this study, we explore the feasibility of using large language models (LLMs) to replace peer grading in MOOCs. Specifically, we use two LLMs, GPT-4 and GPT-3.5, across three MOOCs: Introductory Astronomy, Astrobiology, and the History and Philosophy of Astronomy. To instruct LLMs, we use three different prompts based on the zero-shot chain-of-thought (ZCoT) prompting technique: (1) ZCoT with instructor-provided correct answers, (2) ZCoT with both instructor-provided correct answers and rubrics, and (3) ZCoT with instructor-provided correct answers and LLM-generated rubrics. Tested on 18 settings, our results show that ZCoT, when augmented with instructor-provided correct answers and rubrics, produces grades that are more aligned with those assigned by instructors compared to peer grading. Finally, our findings indicate a promising potential for automated grading systems in MOOCs, especially in subjects with well-defined rubrics, to improve the learning experience for millions of online learners worldwide.
摘要：大规模开放在线课程 (MOOC) 为全球任何拥有计算机和互联网的人提供免费教育。尽管学习已经民主化，但这些课程的庞大招生人数使得一位教师评估每个学生的写作作业是不切实际的。因此，同行评分（通常以简单的评分标准为指导）是首选方法。同行评分虽然方便，但往往在可靠性和有效性方面存在不足。在本研究中，我们探讨了使用大型语言模型 (LLM) 取代 MOOC 中的同行评分的可行性。具体来说，我们在三门 MOOC 中使用了两个 LLM，即 GPT-4 和 GPT-3.5：入门天文学、天体生物学以及天文学的历史和哲学。为了教授 LLM，我们使用了三种基于零样本思维链 (ZCoT) 提示技术的不同提示：(1) ZCoT，包含教师提供的正确答案；(2) ZCoT，包含教师提供的正确答案和评分标准；(3) ZCoT，包含教师提供的正确答案和 LLM 生成的评分标准。在 18 种设置下进行测试后，我们的结果表明，与同行评分相比，ZCoT 在添加教师提供的正确答案和评分标准后，产生的成绩与教师给出的成绩更加一致。最后，我们的研究结果表明，MOOC 中的自动评分系统具有巨大的潜力，尤其是在评分标准明确的科目中，可以改善全球数百万在线学习者的学习体验。

Title: From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models

Authors: Harsh Nishant Lalai, Aashish Anantha Ramakrishnan, Raj Sanjay Shah, Dongwon Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11106
Pdf URL: https://arxiv.org/pdf/2406.11106
Copy Paste: [[2406.11106]] From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models(https://arxiv.org/abs/2406.11106)
Keywords: language model, llm
Abstract: With the rapid growth of Large Language Models (LLMs), safeguarding textual content against unauthorized use is crucial. Text watermarking offers a vital solution, protecting both - LLM-generated and plain text sources. This paper presents a unified overview of different perspectives behind designing watermarking techniques, through a comprehensive survey of the research literature. Our work has two key advantages, (1) we analyze research based on the specific intentions behind different watermarking techniques, evaluation datasets used, watermarking addition, and removal methods to construct a cohesive taxonomy. (2) We highlight the gaps and open challenges in text watermarking to promote research in protecting text authorship. This extensive coverage and detailed analysis sets our work apart, offering valuable insights into the evolving landscape of text watermarking in language models.
摘要：随着大型语言模型 (LLM) 的快速发展，保护文本内容免遭未经授权的使用至关重要。文本水印提供了一种重要的解决方案，可以保护 LLM 生成的文本源和纯文本源。本文通过对研究文献的全面调查，对设计水印技术背后的不同观点进行了统一概述。我们的工作有两个主要优势：(1) 我们根据不同水印技术背后的具体意图、使用的评估数据集、水印添加和删除方法来分析研究，以构建一个有凝聚力的分类法。(2) 我们强调文本水印中的差距和未解决的挑战，以促进保护文本作者的研究。这种广泛的覆盖范围和详细的分析使我们的工作与众不同，为语言模型中文本水印的不断发展提供了宝贵的见解。

Title: Exploring Safety-Utility Trade-Offs in Personalized Language Models

Authors: Anvesh Rao Vijjini, Somnath Basu Roy Chowdhury, Snigdha Chaturvedi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11107
Pdf URL: https://arxiv.org/pdf/2406.11107
Copy Paste: [[2406.11107]] Exploring Safety-Utility Trade-Offs in Personalized Language Models(https://arxiv.org/abs/2406.11107)
Keywords: language model, gpt, llm, prompt
Abstract: As large language models (LLMs) become increasingly integrated into daily applications, it is essential to ensure they operate fairly across diverse user demographics. In this work, we show that LLMs suffer from personalization bias, where their performance is impacted when they are personalized to a user's identity. We quantify personalization bias by evaluating the performance of LLMs along two axes - safety and utility. We measure safety by examining how benign LLM responses are to unsafe prompts with and without personalization. We measure utility by evaluating the LLM's performance on various tasks, including general knowledge, mathematical abilities, programming, and reasoning skills. We find that various LLMs, ranging from open-source models like Llama (Touvron et al., 2023) and Mistral (Jiang et al., 2023) to API-based ones like GPT-3.5 and GPT-4o (Ouyang et al., 2022), exhibit significant variance in performance in terms of safety-utility trade-offs depending on the user's identity. Finally, we discuss several strategies to mitigate personalization bias using preference tuning and prompt-based defenses.
摘要：随着大型语言模型 (LLM) 越来越多地融入日常应用，确保它们在不同的用户群体中公平运行至关重要。在这项研究中，我们表明 LLM 存在个性化偏差，即当它们根据用户身份进行个性化时，其性能会受到影响。我们通过评估 LLM 在两个轴上的性能（安全性和实用性）来量化个性化偏差。我们通过检查 LLM 对有和没有个性化的不安全提示的良性响应来衡量安全性。我们通过评估 LLM 在各种任务上的表现来衡量实用性，包括一般知识、数学能力、编程和推理技能。我们发现各种 LLM，从开源模型（如 Llama（Touvron 等人，2023 年）和 Mistral（Jiang 等人，2023 年））到基于 API 的模型（如 GPT-3.5 和 GPT-4o（Ouyang 等人，2022 年）），在安全性-实用性权衡方面的表现因用户身份而异。最后，我们讨论了使用偏好调整和基于提示的防御来减轻个性化偏见的几种策略。

Title: Investigating Annotator Bias in Large Language Models for Hate Speech Detection

Authors: Amit Das, Zheng Zhang, Fatemeh Jamshidi, Vinija Jain, Aman Chadha, Nilanjana Raychawdhary, Mary Sandage, Lauramarie Pope, Gerry Dozier, Cheryl Seals
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11109
Pdf URL: https://arxiv.org/pdf/2406.11109
Copy Paste: [[2406.11109]] Investigating Annotator Bias in Large Language Models for Hate Speech Detection(https://arxiv.org/abs/2406.11109)
Keywords: language model, gpt, llm, chat
Abstract: Data annotation, the practice of assigning descriptive labels to raw data, is pivotal in optimizing the performance of machine learning models. However, it is a resource-intensive process susceptible to biases introduced by annotators. The emergence of sophisticated Large Language Models (LLMs), like ChatGPT presents a unique opportunity to modernize and streamline this complex procedure. While existing research extensively evaluates the efficacy of LLMs, as annotators, this paper delves into the biases present in LLMs, specifically GPT 3.5 and GPT 4o when annotating hate speech data. Our research contributes to understanding biases in four key categories: gender, race, religion, and disability. Specifically targeting highly vulnerable groups within these categories, we analyze annotator biases. Furthermore, we conduct a comprehensive examination of potential factors contributing to these biases by scrutinizing the annotated data. We introduce our custom hate speech detection dataset, HateSpeechCorpus, to conduct this research. Additionally, we perform the same experiments on the ETHOS (Mollas et al., 2022) dataset also for comparative analysis. This paper serves as a crucial resource, guiding researchers and practitioners in harnessing the potential of LLMs for dataannotation, thereby fostering advancements in this critical field. The HateSpeechCorpus dataset is available here: this https URL
摘要：数据注释是将描述性标签分配给原始数据的做法，对于优化机器学习模型的性能至关重要。然而，这是一个资源密集型的过程，容易受到注释者引入的偏见的影响。像 ChatGPT 这样复杂的大型语言模型 (LLM) 的出现为现代化和简化这一复杂过程提供了独特的机会。虽然现有研究广泛评估了 LLM 作为注释者的有效性，但本文深入研究了 LLM 中存在的偏见，特别是 GPT 3.5 和 GPT 4o 在注释仇恨言论数据时存在的偏见。我们的研究有助于理解四个关键类别的偏见：性别、种族、宗教和残疾。我们专门针对这些类别中高度脆弱的群体，分析注释者的偏见。此外，我们通过仔细检查注释数据，全面检查导致这些偏见的潜在因素。我们引入了自定义仇恨言论检测数据集 HateSpeechCorpus 来进行这项研究。此外，我们还对 ETHOS（Mollas 等人，2022 年）数据集进行了相同的实验，以进行比较分析。本文是一份重要资源，指导研究人员和从业者利用 LLM 进行数据注释的潜力，从而促进这一关键领域的进步。HateSpeechCorpus 数据集可在此处获取：此 https URL

Title: Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification

Authors: Letian Peng, Yi Gu, Chengyu Dong, Zihan Wang, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11115
Pdf URL: https://arxiv.org/pdf/2406.11115
Copy Paste: [[2406.11115]] Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification(https://arxiv.org/abs/2406.11115)
Keywords: llm, prompt
Abstract: For extremely weak-supervised text classification, pioneer research generates pseudo labels by mining texts similar to the class names from the raw corpus, which may end up with very limited or even no samples for the minority classes. Recent works have started to generate the relevant texts by prompting LLMs using the class names or definitions; however, there is a high risk that LLMs cannot generate in-distribution (i.e., similar to the corpus where the text classifier will be applied) data, leading to ungeneralizable classifiers. In this paper, we combine the advantages of these two approaches and propose to bridge the gap via a novel framework, \emph{text grafting}, which aims to obtain clean and near-distribution weak supervision for minority classes. Specifically, we first use LLM-based logits to mine masked templates from the raw corpus, which have a high potential for data synthesis into the target minority class. Then, the templates are filled by state-of-the-art LLMs to synthesize near-distribution texts falling into minority classes. Text grafting shows significant improvement over direct mining or synthesis on minority classes. We also use analysis and case studies to comprehend the property of text grafting.
摘要：对于极弱监督的文本分类，先驱研究通过从原始语料库中挖掘与类别名称相似的文本来生成伪标签，这可能会导致少数类别的样本非常有限甚至没有样本。最近的研究已经开始通过使用类别名称或定义提示 LLM 来生成相关文本；然而，LLM 很可能无法生成分布内（即类似于文本分类器将应用的语料库）数据，从而导致分类器无法推广。在本文中，我们结合了这两种方法的优点，并提出通过一个新框架 \emph{文本嫁接} 来弥补这一差距，该框架旨在为少数类别获得干净且近分布的弱监督。具体来说，我们首先使用基于 LLM 的逻辑从原始语料库中挖掘掩蔽模板，这些模板很有可能被合成到目标少数类别中。然后，用最先进的 LLM 填充模板，以合成属于少数类别的近分布文本。文本嫁接在少数类上比直接挖掘或合成有显著的改进。我们还使用分析和案例研究来理解文本嫁接的性质。

Title: Grammaticality Representation in ChatGPT as Compared to Linguists and Laypeople

Authors: Zhuang Qiu, Xufeng Duan, Zhenguang G. Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11116
Pdf URL: https://arxiv.org/pdf/2406.11116
Copy Paste: [[2406.11116]] Grammaticality Representation in ChatGPT as Compared to Linguists and Laypeople(https://arxiv.org/abs/2406.11116)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) have demonstrated exceptional performance across various linguistic tasks. However, it remains uncertain whether LLMs have developed human-like fine-grained grammatical intuition. This preregistered study (this https URL) presents the first large-scale investigation of ChatGPT's grammatical intuition, building upon a previous study that collected laypeople's grammatical judgments on 148 linguistic phenomena that linguists judged to be grammatical, ungrammatical, or marginally grammatical (Sprouse, Schutze, & Almeida, 2013). Our primary focus was to compare ChatGPT with both laypeople and linguists in the judgement of these linguistic constructions. In Experiment 1, ChatGPT assigned ratings to sentences based on a given reference sentence. Experiment 2 involved rating sentences on a 7-point scale, and Experiment 3 asked ChatGPT to choose the more grammatical sentence from a pair. Overall, our findings demonstrate convergence rates ranging from 73% to 95% between ChatGPT and linguists, with an overall point-estimate of 89%. Significant correlations were also found between ChatGPT and laypeople across all tasks, though the correlation strength varied by task. We attribute these results to the psychometric nature of the judgment tasks and the differences in language processing styles between humans and LLMs.
摘要：大型语言模型 (LLM) 在各种语言任务中都表现出色。然而，LLM 是否已经发展出类似人类的细粒度语法直觉仍不确定。这项预先注册的研究 (此 https URL) 首次大规模调查了 ChatGPT 的语法直觉，该研究建立在一项先前的研究的基础上，该研究收集了普通人对 148 种语言现象的语法判断，语言学家将这些现象判断为合乎语法、不合语法或勉强合乎语法 (Sprouse、Schutze 和 Almeida，2013)。我们的主要重点是比较 ChatGPT 与普通人和语言学家对这些语言结构的判断。在实验 1 中，ChatGPT 根据给定的参考句子对句子进行评分。实验 2 涉及按 7 分制对句子进行评分，实验 3 要求 ChatGPT 从一对句子中选择更合乎语法的句子。总体而言，我们的研究结果表明 ChatGPT 与语言学家之间的收敛率在 73% 到 95% 之间，总体点估计值为 89%。在所有任务中，ChatGPT 与普通人之间也存在显著相关性，尽管相关性强度因任务而异。我们将这些结果归因于判断任务的心理测量性质以及人类与 LLM 之间语言处理风格的差异。

Title: Dynamic Order Template Prediction for Generative Aspect-Based Sentiment Analysis

Authors: Yonghyun Jun, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11130
Pdf URL: https://arxiv.org/pdf/2406.11130
Copy Paste: [[2406.11130]] Dynamic Order Template Prediction for Generative Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2406.11130)
Keywords: prompt
Abstract: Aspect-based sentiment analysis (ABSA) assesses sentiments towards specific aspects within texts, resulting in detailed sentiment tuples. Previous ABSA models often use static templates to predict all of the elements in the tuples, and these models often fail to accurately capture dependencies between elements. Multi-view prompting method improves the performance of ABSA by predicting tuples with various templates and then ensembling the results. However, this method suffers from inefficiencies and out-of-distribution errors. In this paper, we propose a Dynamic Order Template (DOT) method for ABSA, which dynamically generates necessary views for each instance based on instance-level entropy. Ensuring the diverse and relevant view generation, our proposed method improves F1-scores on ASQP and ACOS datasets while significantly reducing inference time.
摘要：基于方面的情绪分析 (ABSA) 评估文本中特定方面的情绪，从而产生详细的情绪元组。以前的 ABSA 模型通常使用静态模板来预测元组中的所有元素，而这些模型通常无法准确捕获元素之间的依赖关系。多视图提示方法通过使用各种模板预测元组然后集成结果来提高 ABSA 的性能。然而，这种方法存在效率低下和分布外误差的问题。在本文中，我们为 ABSA 提出了一种动态顺序模板 (DOT) 方法，该方法基于实例级熵为每个实例动态生成必要的视图。通过确保生成多样化和相关的视图，我们提出的方法提高了 ASQP 和 ACOS 数据集上的 F1 分数，同时显著缩短了推理时间。

Title: Are Large Language Models a Good Replacement of Taxonomies?

Authors: Yushi Sun, Hao Xin, Kai Sun, Yifan Ethan Xu, Xiao Yang, Xin Luna Dong, Nan Tang, Lei Chen
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2406.11131
Pdf URL: https://arxiv.org/pdf/2406.11131
Copy Paste: [[2406.11131]] Are Large Language Models a Good Replacement of Taxonomies?(https://arxiv.org/abs/2406.11131)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask if the schema of knowledge graph (i.e., taxonomy) is made obsolete by LLMs. Intuitively, LLMs should perform well on common taxonomies and at taxonomy levels that are common to people. Unfortunately, there lacks a comprehensive benchmark that evaluates the LLMs over a wide range of taxonomies from common to specialized domains and at levels from root to leaf so that we can draw a confident conclusion. To narrow the research gap, we constructed a novel taxonomy hierarchical structure discovery benchmark named TaxoGlimpse to evaluate the performance of LLMs over taxonomies. TaxoGlimpse covers ten representative taxonomies from common to specialized domains with in-depth experiments of different levels of entities in this taxonomy from root to leaf. Our comprehensive experiments of eighteen state-of-the-art LLMs under three prompting settings validate that LLMs can still not well capture the knowledge of specialized taxonomies and leaf-level entities.
摘要：大型语言模型 (LLM) 表现出令人印象深刻的内化知识和回答自然语言问题的能力。尽管先前的研究证实 LLM 在一般知识上表现良好，而在长尾细微知识上表现不佳，但社区仍然怀疑是否应该用 LLM 取代传统的知识图谱。在本文中，我们探讨知识图谱 (即分类法) 的模式是否被 LLM 取代。直观地说，LLM 应该在常见分类法和人们常用的分类法级别上表现良好。不幸的是，缺乏一个全面的基准来评估从通用到专业领域、从根到叶的广泛分类法上的 LLM，以便我们得出一个自信的结论。为了缩小研究差距，我们构建了一个名为 TaxoGlimpse 的新型分类法层次结构发现基准来评估 LLM 在分类法上的表现。 TaxoGlimpse 涵盖了从通用到专业领域的十个代表性分类法，并对该分类法中从根到叶的不同级别的实体进行了深入实验。我们在三个提示设置下对十八个最先进的 LLM 进行了全面的实验，验证了 LLM 仍然不能很好地捕获专业分类法和叶级实体的知识。

Title: RePrompt: Planning by Automatic Prompt Engineering for Large Language Models Agents

Authors: Weizhe Chen, Sven Koenig, Bistra Dilkina
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11132
Pdf URL: https://arxiv.org/pdf/2406.11132
Copy Paste: [[2406.11132]] RePrompt: Planning by Automatic Prompt Engineering for Large Language Models Agents(https://arxiv.org/abs/2406.11132)
Keywords: language model, llm, prompt, chat, agent
Abstract: In this past year, large language models (LLMs) have had remarkable success in domains outside the traditional natural language processing, and people are starting to explore the usage of LLMs in more general and close to application domains like code generation, travel planning, and robot controls. Connecting these LLMs with great capacity and external tools, people are building the so-called LLM agents, which are supposed to help people do all kinds of work in everyday life. In all these domains, the prompt to the LLMs has been shown to make a big difference in what the LLM would generate and thus affect the performance of the LLM agents. Therefore, automatic prompt engineering has become an important question for many researchers and users of LLMs. In this paper, we propose a novel method, \textsc{RePrompt}, which does "gradient descent" to optimize the step-by-step instructions in the prompt of the LLM agents based on the chat history obtained from interactions with LLM agents. By optimizing the prompt, the LLM will learn how to plan in specific domains. We have used experiments in PDDL generation and travel planning to show that our method could generally improve the performance for different reasoning tasks when using the updated prompt as the initial prompt.
摘要：在过去的一年里，大型语言模型 (LLM) 在传统自然语言处理之外的领域取得了显著的成功，人们开始探索 LLM 在更通用、更接近应用领域的使用，如代码生成、旅行规划和机器人控制。通过将这些 LLM 与大容量和外部工具连接起来，人们正在构建所谓的 LLM 代理，这些代理应该可以帮助人们完成日常生活中的各种工作。在所有这些领域中，对 LLM 的提示已被证明会对 LLM 生成的内容产生重大影响，从而影响 LLM 代理的性能。因此，自动提示工程已成为许多 LLM 研究人员和用户的一个重要问题。在本文中，我们提出了一种新方法 \textsc{RePrompt}，它根据与 LLM 代理交互获得的聊天记录进行“梯度下降”以优化 LLM 代理提示中的分步说明。通过优化提示，LLM 将学习如何在特定领域进行规划。我们已经在 PDDL 生成和旅行规划中进行了实验，结果表明，当使用更新后的提示作为初始提示时，我们的方法通常可以提高不同推理任务的性能。

Title: Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance

Authors: Somnath Banerjee, Avik Halder, Rajarshi Mandal, Sayan Layek, Ian Soboroff, Rima Hazra, Animesh Mukherjee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11139
Pdf URL: https://arxiv.org/pdf/2406.11139
Copy Paste: [[2406.11139]] Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance(https://arxiv.org/abs/2406.11139)
Keywords: language model, gpt, llm
Abstract: The integration of pretrained language models (PLMs) like BERT and GPT has revolutionized NLP, particularly for English, but it has also created linguistic imbalances. This paper strategically identifies the need for linguistic equity by examining several knowledge editing techniques in multilingual contexts. We evaluate the performance of models such as Mistral, TowerInstruct, OpenHathi, Tamil-Llama, and Kan-Llama across languages including English, German, French, Italian, Spanish, Hindi, Tamil, and Kannada. Our research identifies significant discrepancies in normal and merged models concerning cross-lingual consistency. We employ strategies like 'each language for itself' (ELFI) and 'each language for others' (ELFO) to stress-test these models. Our findings demonstrate the potential for LLMs to overcome linguistic barriers, laying the groundwork for future research in achieving linguistic inclusivity in AI technologies.
摘要：BERT 和 GPT 等预训练语言模型 (PLM) 的集成彻底改变了 NLP，尤其是英语，但也造成了语言不平衡。本文通过研究多语言环境中的几种知识编辑技术，从战略上确定了语言公平的必要性。我们评估了 Mistral、TowerInstruct、OpenHathi、Tamil-Llama 和 Kan-Llama 等模型在英语、德语、法语、意大利语、西班牙语、印地语、泰米尔语和卡纳达语等语言中的表现。我们的研究发现，在跨语言一致性方面，普通模型和合并模型存在显著差异。我们采用“每种语言为自身”（ELFI）和“每种语言为其他语言”（ELFO）等策略对这些模型进行压力测试。我们的研究结果表明，LLM 有潜力克服语言障碍，为未来实现 AI 技术语言包容性的研究奠定了基础。

Title: GoldCoin: Grounding Large Language Models in Privacy Laws via Contextual Integrity Theory

Authors: Wei Fan, Haoran Li, Zheye Deng, Weiqi Wang, Yangqiu Song
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2406.11149
Pdf URL: https://arxiv.org/pdf/2406.11149
Copy Paste: [[2406.11149]] GoldCoin: Grounding Large Language Models in Privacy Laws via Contextual Integrity Theory(https://arxiv.org/abs/2406.11149)
Keywords: language model, llm
Abstract: Privacy issues arise prominently during the inappropriate transmission of information between entities. Existing research primarily studies privacy by exploring various privacy attacks, defenses, and evaluations within narrowly predefined patterns, while neglecting that privacy is not an isolated, context-free concept limited to traditionally sensitive data (e.g., social security numbers), but intertwined with intricate social contexts that complicate the identification and analysis of potential privacy violations. The advent of Large Language Models (LLMs) offers unprecedented opportunities for incorporating the nuanced scenarios outlined in privacy laws to tackle these complex privacy issues. However, the scarcity of open-source relevant case studies restricts the efficiency of LLMs in aligning with specific legal statutes. To address this challenge, we introduce a novel framework, GoldCoin, designed to efficiently ground LLMs in privacy laws for judicial assessing privacy violations. Our framework leverages the theory of contextual integrity as a bridge, creating numerous synthetic scenarios grounded in relevant privacy statutes (e.g., HIPAA), to assist LLMs in comprehending the complex contexts for identifying privacy risks in the real world. Extensive experimental results demonstrate that GoldCoin markedly enhances LLMs' capabilities in recognizing privacy risks across real court cases, surpassing the baselines on different judicial tasks.
摘要：隐私问题在实体之间不恰当地传输信息时尤为突出。现有研究主要通过探索狭隘预定义模式内的各种隐私攻击、防御和评估来研究隐私，而忽略了隐私并不是一个孤立的、与上下文无关的概念，仅限于传统敏感数据（例如社会安全号码），而是与复杂的社会背景交织在一起，使潜在隐私侵犯的识别和分析变得复杂。大型语言模型 (LLM) 的出现为结合隐私法中概述的细微场景来解决这些复杂的隐私问题提供了前所未有的机会。然而，开源相关案例研究的稀缺限制了 LLM 与特定法律法规保持一致的效率。为了应对这一挑战，我们引入了一个新框架 GoldCoin，旨在有效地将 LLM 置于隐私法中，以便司法评估隐私侵犯。我们的框架利用情境完整性理论作为桥梁，创建了大量基于相关隐私法规（例如 HIPAA）的合成场景，以帮助 LLM 理解现实世界中识别隐私风险的复杂背景。大量实验结果表明，GoldCoin 显著提高了 LLM 在真实法庭案件中识别隐私风险的能力，超越了不同司法任务的基线。

Title: How Good are LLMs at Relation Extraction under Low-Resource Scenario? Comprehensive Evaluation

Authors: Dawulie Jinensibieke, Mieradilijiang Maimaiti, Wentao Xiao, Yuanhang Zheng, Xiangbo Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11162
Pdf URL: https://arxiv.org/pdf/2406.11162
Copy Paste: [[2406.11162]] How Good are LLMs at Relation Extraction under Low-Resource Scenario? Comprehensive Evaluation(https://arxiv.org/abs/2406.11162)
Keywords: language model, llm
Abstract: Relation Extraction (RE) serves as a crucial technology for transforming unstructured text into structured information, especially within the framework of Knowledge Graph development. Its importance is emphasized by its essential role in various downstream tasks. Besides the conventional RE methods which are based on neural networks and pre-trained language models, large language models (LLMs) are also utilized in the research field of RE. However, on low-resource languages (LRLs), both conventional RE methods and LLM-based methods perform poorly on RE due to the data scarcity issues. To this end, this paper constructs low-resource relation extraction datasets in 10 LRLs in three regions (Central Asia, Southeast Asia and Middle East). The corpora are constructed by translating the original publicly available English RE datasets (NYT10, FewRel and CrossRE) using an effective multilingual machine translation. Then, we use the language perplexity (PPL) to filter out the low-quality data from the translated datasets. Finally, we conduct an empirical study and validate the performance of several open-source LLMs on these generated LRL RE datasets.
摘要：关系抽取 (RE) 是将非结构化文本转换为结构化信息的关键技术，尤其是在知识图谱开发框架内。其在各种下游任务中的重要作用凸显了其重要性。除了基于神经网络和预训练语言模型的传统 RE 方法外，大型语言模型 (LLM) 也被用于 RE 的研究领域。然而，在低资源语言 (LRL) 上，由于数据稀缺问题，传统 RE 方法和基于 LLM 的方法在 RE 上的表现都很差。为此，本文在三个地区 (中亚、东南亚和中东) 的 10 个 LRL 中构建了低资源关系提取数据集。语料库是通过使用有效的多语言机器翻译翻译原始的公开英语 RE 数据集 (NYT10、FewRel 和 CrossRE) 构建的。然后，我们使用语言困惑度 (PPL) 从翻译的数据集中筛选出低质量数据。最后，我们进行了实证研究，并验证了几个开源 LLM 在这些生成的 LRL RE 数据集上的性能。

Title: Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

Authors: Weimin Xiong, Yifan Song, Xiutian Zhao, Wenhao Wu, Xun Wang, Ke Wang, Cheng Li, Wei Peng, Sujian Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement(https://arxiv.org/abs/)
Keywords: language model, llm, agent
Abstract: Large language model agents have exhibited exceptional performance across a range of complex interactive tasks. Recent approaches have utilized tuning with expert trajectories to enhance agent performance, yet they primarily concentrate on outcome rewards, which may lead to errors or suboptimal actions due to the absence of process supervision signals. In this paper, we introduce the Iterative step-level Process Refinement (IPR) framework, which provides detailed step-by-step guidance to enhance agent training. Specifically, we adopt the Monte Carlo method to estimate step-level rewards. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are then evaluated against the corresponding step of expert trajectory using step-level rewards. Such comparison helps identify discrepancies, yielding contrastive action pairs that serve as training data for the agent. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines. Moreover, our analytical findings highlight the effectiveness of IPR in augmenting action efficiency and its applicability to diverse models.
摘要：大型语言模型代理在一系列复杂的交互任务中表现出色。最近的方法利用专家轨迹进行调整来提高代理性能，但它们主要集中在结果奖励上，由于缺乏过程监督信号，这可能会导致错误或次优操作。在本文中，我们介绍了迭代步骤级流程细化 (IPR) 框架，该框架提供了详细的分步指导以增强代理训练。具体来说，我们采用蒙特卡洛方法来估计步骤级奖励。在每次迭代过程中，代理都会沿着专家轨迹进行探索并生成新的动作。然后使用步骤级奖励根据专家轨迹的相应步骤评估这些动作。这种比较有助于识别差异，从而产生对比动作对，作为代理的训练数据。我们在三个复杂代理任务上的实验表明，我们的框架优于各种强大的基线。此外，我们的分析结果强调了 IPR 在增强行动效率方面的有效性及其对各种模型的适用性。

Title: TIFG: Text-Informed Feature Generation with Large Language Models

Authors: Xinhao Zhang, Jinghan Zhang, Fengran Mo, Yuzhong Chen, Kunpeng Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11177
Pdf URL: https://arxiv.org/pdf/2406.11177
Copy Paste: [[2406.11177]] TIFG: Text-Informed Feature Generation with Large Language Models(https://arxiv.org/abs/2406.11177)
Keywords: language model, llm, retrieval augmented generation
Abstract: Textual information of data is of vital importance for data mining and feature engineering. However, existing methods focus on learning the data structures and overlook the textual information along with the data. Consequently, they waste this valuable resource and miss out on the deeper data relationships embedded within the texts. In this paper, we introduce Text-Informed Feature Generation (TIFG), a novel LLM-based text-informed feature generation framework. TIFG utilizes the textual information to generate features by retrieving possible relevant features within external knowledge with Retrieval Augmented Generation (RAG) technology. In this approach, the TIFG can generate new explainable features to enrich the feature space and further mine feature relationships. We design the TIFG to be an automated framework that continuously optimizes the feature generation process, adapts to new data inputs, and improves downstream task performance over iterations. A broad range of experiments in various downstream tasks showcases that our approach can generate high-quality and meaningful features, and is significantly superior to existing methods.
摘要：数据的文本信息对于数据挖掘和特征工程至关重要。然而，现有的方法专注于学习数据结构，而忽略了数据中的文本信息。因此，他们浪费了这种宝贵的资源，错过了文本中嵌入的更深层次的数据关系。在本文中，我们介绍了一种基于 LLM 的新型文本信息特征生成框架——文本信息特征生成 (TIFG)。TIFG 利用文本信息通过检索增强生成 (RAG) 技术检索外部知识中可能的相关特征来生成特征。在这种方法中，TIFG 可以生成新的可解释特征来丰富特征空间并进一步挖掘特征关系。我们将 TIFG 设计为一个自动化框架，它可以不断优化特征生成过程，适应新的数据输入，并在迭代过程中提高下游任务的性能。在各种下游任务中进行的广泛实验表明，我们的方法可以生成高质量且有意义的特征，并且明显优于现有方法。

Title: Aligning Large Language Models from Self-Reference AI Feedback with one General Principle

Authors: Rong Bao, Rui Zheng, Shihan Dou, Xiao Wang, Enyu Zhou, Bo Wang, Qi Zhang, Liang Ding, Dacheng Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11190
Pdf URL: https://arxiv.org/pdf/2406.11190
Copy Paste: [[2406.11190]] Aligning Large Language Models from Self-Reference AI Feedback with one General Principle(https://arxiv.org/abs/2406.11190)
Keywords: language model, llm, chat
Abstract: In aligning large language models (LLMs), utilizing feedback from existing advanced AI rather than humans is an important method to scale supervisory signals. However, it is highly challenging for AI to understand human intentions and societal values, and provide accurate preference feedback based on these. Current AI feedback methods rely on powerful LLMs, carefully designed specific principles to describe human intentions, and are easily influenced by position bias. To address these issues, we propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback under simple and general principles such as ``best for humanity``. Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference, and finally determine which answer better fits human preferences according to the criticism. Additionally, we use a self-consistency method to further reduce the impact of position bias, and employ semantic perplexity to calculate the preference strength differences between different answers. Experimental results show that our method enables 13B and 70B Llama2-Chat annotators to provide high-quality preference feedback, and the policy models trained based on these preference data achieve significant advantages in benchmark datasets through reinforcement learning.
摘要：在对齐大型语言模型 (LLM) 时，利用现有先进人工智能而非人类的反馈是扩展监督信号的重要方法。然而，人工智能很难理解人类的意图和社会价值观，并据此提供准确的偏好反馈。目前的人工智能反馈方法依赖于强大的 LLM 和精心设计的特定原则来描述人类的意图，并且容易受到位置偏见的影响。针对这些问题，我们提出了一个基于自我参考的人工智能反馈框架，使 13B Llama2-Chat 能够在“最适合人类”等简单而普遍的原则下提供高质量的反馈。具体来说，我们让人工智能首先响应用户的指令，然后根据自己的响应作为参考生成对其他答案的批评，最后根据批评确定哪个答案更符合人类的偏好。此外，我们使用自洽方法进一步降低位置偏见的影响，并使用语义困惑度来计算不同答案之间的偏好强度差异。实验结果表明，我们的方法能够使13B和70B Llama2-Chat注释器提供高质量的偏好反馈，并且基于这些偏好数据训练出的策略模型通过强化学习在基准数据集上取得了显著的优势。

Title: A Survey on Human Preference Learning for Large Language Models

Authors: Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11191
Pdf URL: https://arxiv.org/pdf/2406.11191
Copy Paste: [[2406.11191]] A Survey on Human Preference Learning for Large Language Models(https://arxiv.org/abs/2406.11191)
Keywords: language model, llm
Abstract: The recent surge of versatile large language models (LLMs) largely depends on aligning increasingly capable foundation models with human intentions by preference learning, enhancing LLMs with excellent applicability and effectiveness in a wide range of contexts. Despite the numerous related studies conducted, a perspective on how human preferences are introduced into LLMs remains limited, which may prevent a deeper comprehension of the relationships between human preferences and LLMs as well as the realization of their limitations. In this survey, we review the progress in exploring human preference learning for LLMs from a preference-centered perspective, covering the sources and formats of preference feedback, the modeling and usage of preference signals, as well as the evaluation of the aligned LLMs. We first categorize the human feedback according to data sources and formats. We then summarize techniques for human preferences modeling and compare the advantages and disadvantages of different schools of models. Moreover, we present various preference usage methods sorted by the objectives to utilize human preference signals. Finally, we summarize some prevailing approaches to evaluate LLMs in terms of alignment with human intentions and discuss our outlooks on the human intention alignment for LLMs.
摘要：近年来，通用大型语言模型（LLM）的兴起很大程度上依赖于通过偏好学习将日益强大的基础模型与人类意图对齐，从而增强 LLM 在广泛情境中的出色适用性和有效性。尽管已进行了大量相关研究，但关于如何将人类偏好引入 LLM 的观点仍然有限，这可能会阻碍我们更深入地理解人类偏好与 LLM 之间的关系以及认识到其局限性。在本综述中，我们从偏好中心的角度回顾了 LLM 人类偏好学习的探索进展，涵盖偏好反馈的来源和格式、偏好信号的建模和使用以及对齐的 LLM 的评估。我们首先根据数据来源和格式对人类反馈进行分类。然后，我们总结了人类偏好建模的技术，并比较了不同模型流派的优缺点。此外，我们根据利用人类偏好信号的目标分类了各种偏好使用方法。最后，我们总结了一些评估 LLM 与人类意图对齐的流行方法，并讨论了我们对 LLM 人类意图对齐的看法。

Title: Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition

Authors: Yuming Yang, Wantong Zhao, Caishuang Huang, Junjie Ye, Xiao Wang, Huiyuan Zheng, Yang Nan, Yuran Wang, Xueying Xu, Kaixin Huang, Yunke Zhang, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11192
Pdf URL: https://arxiv.org/pdf/2406.11192
Copy Paste: [[2406.11192]] Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition(https://arxiv.org/abs/2406.11192)
Keywords: language model, gpt, llm
Abstract: Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets faces issues due to inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain generalization. To address this, we present B2NERD, a cohesive and efficient dataset for Open NER, normalized from 54 existing English or Chinese datasets using a two-step approach. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using a data pruning strategy that selects fewer samples with greater category and semantic diversity. Comprehensive evaluation shows that B2NERD significantly improves LLMs' generalization on Open NER. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages.
摘要：开放命名实体识别 (NER) 涉及识别任意域中的任意类型的实体，对于大型语言模型 (LLM) 来说仍然具有挑战性。最近的研究表明，在大量 NER 数据上微调 LLM 可以提高其性能。然而，直接在现有数据集上进行训练会面临实体定义不一致和数据冗余的问题，从而将 LLM 限制在特定于数据集的学习中，并阻碍域外泛化。为了解决这个问题，我们提出了 B2NERD，这是一个具有凝聚力和效率的开放 NER 数据集，它使用两步法从 54 个现有的英文或中文数据集中归一化。首先，我们检测数据集中不一致的实体定义，并通过可区分的标签名称对其进行澄清，以构建 400 多种实体类型的通用分类法。其次，我们使用数据修剪策略解决冗余问题，该策略选择更少的样本，但类别和语义多样性更高。综合评估表明，B2NERD 显著提高了 LLM 在开放 NER 上的泛化能力。我们在 B2NERD 上训练的 B2NER 模型比 GPT-4 高出 6.8-12.0 F1 点，并在 15 个数据集和 6 种语言的 3 个域外基准测试中超越了以前的方法。

Title: MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model

Authors: Jiahao Huo, Yibo Yan, Boren Hu, Yutao Yue, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11193
Pdf URL: https://arxiv.org/pdf/2406.11193
Copy Paste: [[2406.11193]] MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model(https://arxiv.org/abs/2406.11193)
Keywords: language model, llm
Abstract: Projecting visual features into word embedding space has become a significant fusion strategy adopted by Multimodal Large Language Models (MLLMs). However, its internal mechanisms have yet to be explored. Inspired by multilingual research, we identify domain-specific neurons in multimodal large language models. Specifically, we investigate the distribution of domain-specific neurons and the mechanism of how MLLMs process features from diverse domains. Furthermore, we propose a three-stage framework for language model modules in MLLMs when handling projected image features, and verify this hypothesis using logit lens. Extensive experiments indicate that while current MLLMs exhibit Visual Question Answering (VQA) capability, they may not fully utilize domain-specific information. Manipulating domain-specific neurons properly will result in a 10\% change of accuracy at most, shedding light on the development of cross-domain, all-encompassing MLLMs in the future. Our code will be released upon paper notification.
摘要：将视觉特征投影到词向量空间已成为多模态大型语言模型（MLLM）的重要融合策略，但其内部机制尚待探索。受多语言研究的启发，我们在多模态大型语言模型中识别出领域特定神经元。具体而言，我们研究了领域特定神经元的分布以及MLLM处理不同领域特征的机制。此外，我们提出了MLLM中语言模型模块在处理投影图像特征时的三阶段框架，并使用logit lens验证了这一假设。大量实验表明，虽然目前的MLLM具有视觉问答（VQA）能力，但它们可能没有充分利用领域特定信息。适当操纵领域特定神经元最多会导致准确率变化10\%，为未来跨领域、包罗万象的MLLM的发展提供了启示。我们的代码将在论文通知后发布。

Title: In-Context Editing: Learning Knowledge from Self-Induced Distributions

Authors: Siyuan Qi, Bangcheng Yang, Kailin Jiang, Xiaobo Wang, Jiaqi Li, Yifan Zhong, Yaodong Yang, Zilong Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11194
Pdf URL: https://arxiv.org/pdf/2406.11194
Copy Paste: [[2406.11194]] In-Context Editing: Learning Knowledge from Self-Induced Distributions(https://arxiv.org/abs/2406.11194)
Keywords: language model
Abstract: The existing fine-tuning paradigm for language models is brittle in knowledge editing scenarios, where the model must incorporate new information without extensive retraining. This brittleness often results in overfitting, reduced performance, and unnatural language generation. To address this, we propose Consistent In-Context Editing (ICE), a novel approach that leverages the model's in-context learning capability to tune toward a contextual distribution rather than a one-hot target. ICE introduces a straightforward optimization framework that includes both a target and a procedure, enhancing the robustness and effectiveness of gradient-based tuning methods. We provide analytical insights into ICE across four critical aspects of knowledge editing: accuracy, locality, generalization, and linguistic quality, showing its advantages. Experimental results across four datasets confirm the effectiveness of ICE and demonstrate its potential for continual editing, ensuring that updated information is incorporated while preserving the integrity of the model.
摘要：现有的语言模型微调范式在知识编辑场景中很脆弱，因为模型必须在没有大量再训练的情况下整合新信息。这种脆弱性通常会导致过度拟合、性能下降和不自然的语言生成。为了解决这个问题，我们提出了一致的上下文编辑 (ICE)，这是一种新方法，它利用模型的上下文学习能力来调整上下文分布而不是独热目标。ICE 引入了一个包含目标和程序的简单优化框架，增强了基于梯度的调整方法的稳健性和有效性。我们从知识编辑的四个关键方面对 ICE 进行了分析：准确性、局部性、泛化和语言质量，展示了它的优势。四个数据集的实验结果证实了 ICE 的有效性，并展示了其持续编辑的潜力，确保在保留模型完整性的同时整合更新的信息。

Title: Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models

Authors: Scott Barnett, Zac Brannelly, Stefanus Kurniawan, Sheng Wong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11201
Pdf URL: https://arxiv.org/pdf/2406.11201
Copy Paste: [[2406.11201]] Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models(https://arxiv.org/abs/2406.11201)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have the unique capability to understand and generate human-like text from input queries. When fine-tuned, these models show enhanced performance on domain-specific queries. OpenAI highlights the process of fine-tuning, stating: "To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples, but the right number varies greatly based on the exact use case." This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines, which aim to improve accuracy and relevance by leveraging external corpus data for information retrieval. However, RAG's promise of delivering optimal responses often falls short in complex query scenarios. This study aims to specifically examine the effects of fine-tuning LLMs on their ability to extract and integrate contextual data to enhance the performance of RAG systems across multiple domains. We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding by comparing the accuracy and completeness of fine-tuned models against baseline performances across datasets from multiple domains. Our findings indicate that fine-tuning resulted in a decline in performance compared to the baseline models, contrary to the improvements observed in standalone LLM applications as suggested by OpenAI. This study highlights the need for vigorous investigation and validation of fine-tuned models for domain-specific tasks.
摘要：大型语言模型 (LLM) 具有独特的能力，能够理解输入查询并生成类似人类的文本。经过微调后，这些模型在特定领域的查询上表现出增强的性能。OpenAI 强调了微调的过程，并指出：“要微调模型，您需要提供至少 10 个示例。我们通常会从 50 到 100 个训练示例的微调中看到明显的改进，但正确的数量会根据具体用例而有很大差异。”这项研究将这一概念扩展到 LLM 在检索增强生成 (RAG) 管道中的集成，旨在通过利用外部语料库数据进行信息检索来提高准确性和相关性。然而，RAG 提供最佳响应的承诺在复杂的查询场景中往往无法实现。这项研究旨在专门研究微调 LLM 对其提取和集成上下文数据的能力的影响，以增强 RAG 系统在多个领域的性能。我们通过将微调模型的准确性和完整性与来自多个领域的数据集的基线性能进行比较，评估了微调对 LLM 数据提取和上下文理解能力的影响。我们的研究结果表明，与基线模型相比，微调导致性能下降，这与 OpenAI 建议的独立 LLM 应用程序中观察到的改进相反。这项研究强调了对特定领域任务的微调模型进行大力调查和验证的必要性。

Title: Global Data Constraints: Ethical and Effectiveness Challenges in Large Language Model

Authors: Jin Yang, Zhiqiang Wang, Yanbin Lin, Zunduo Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11214
Pdf URL: https://arxiv.org/pdf/2406.11214
Copy Paste: [[2406.11214]] Global Data Constraints: Ethical and Effectiveness Challenges in Large Language Model(https://arxiv.org/abs/2406.11214)
Keywords: language model, gpt, llm
Abstract: The efficacy and ethical integrity of large language models (LLMs) are profoundly influenced by the diversity and quality of their training datasets. However, the global landscape of data accessibility presents significant challenges, particularly in regions with stringent data privacy laws or limited open-source information. This paper examines the multifaceted challenges associated with acquiring high-quality training data for LLMs, focusing on data scarcity, bias, and low-quality content across various linguistic contexts. We highlight the technical and ethical implications of relying on publicly available but potentially biased or irrelevant data sources, which can lead to the generation of biased or hallucinatory content by LLMs. Through a series of evaluations using GPT-4 and GPT-4o, we demonstrate how these data constraints adversely affect model performance and ethical alignment. We propose and validate several mitigation strategies designed to enhance data quality and model robustness, including advanced data filtering techniques and ethical data collection practices. Our findings underscore the need for a proactive approach in developing LLMs that considers both the effectiveness and ethical implications of data constraints, aiming to foster the creation of more reliable and universally applicable AI systems.
摘要：大型语言模型 (LLM) 的有效性和道德完整性深受其训练数据集的多样性和质量的影响。然而，全球数据可访问性格局带来了重大挑战，特别是在数据隐私法严格或开源信息有限的地区。本文探讨了获取 LLM 高质量训练数据所面临的多方面挑战，重点关注各种语言环境中的数据稀缺性、偏见和低质量内容。我们强调了依赖公开可用但可能有偏见或不相关的数据源的技术和道德影响，这可能导致 LLM 产生有偏见或幻觉的内容。通过使用 GPT-4 和 GPT-4o 进行一系列评估，我们展示了这些数据限制如何对模型性能和道德一致性产生不利影响。我们提出并验证了几种旨在提高数据质量和模型鲁棒性的缓解策略，包括高级数据过滤技术和合乎道德的数据收集实践。我们的研究结果强调，在开发法学硕士学位时需要采取积极主动的方法，同时考虑数据约束的有效性和伦理影响，旨在促进创建更可靠、更普遍适用的人工智能系统。

Title: Building another Spanish dictionary, this time with GPT-4

Authors: Miguel Ortega-Martín, Óscar García-Sierra, Alfonso Ardoiz, Juan Carlos Armenteros, Ignacio Garrido, Jorge Álvarez, Camilo Torrón, Iñigo Galdeano, Ignacio Arranz, Oleg Vorontsov, Adrián Alonso
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11218
Pdf URL: https://arxiv.org/pdf/2406.11218
Copy Paste: [[2406.11218]] Building another Spanish dictionary, this time with GPT-4(https://arxiv.org/abs/2406.11218)
Keywords: gpt
Abstract: We present the "Spanish Built Factual Freectianary 2.0" (Spanish-BFF-2) as the second iteration of an AI-generated Spanish dictionary. Previously, we developed the inaugural version of this unique free dictionary employing GPT-3. In this study, we aim to improve the dictionary by using GPT-4-turbo instead. Furthermore, we explore improvements made to the initial version and compare the performance of both models.
摘要：我们推出了“西班牙语事实免费词典 2.0”（西班牙语-BFF-2），这是 AI 生成的西班牙语词典的第二个版本。此前，我们使用 GPT-3 开发了这本独特的免费词典的首个版本。在本研究中，我们旨在使用 GPT-4-turbo 来改进词典。此外，我们还探索了对初始版本的改进，并比较了两种模型的性能。

Title: ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark

Authors: Hiromi Wakaki, Yuki Mitsufuji, Yoshinori Maeda, Yukiko Nishimura, Silin Gao, Mengjie Zhao, Keiichi Yamada, Antoine Bosselut
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11228
Pdf URL: https://arxiv.org/pdf/2406.11228
Copy Paste: [[2406.11228]] ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark(https://arxiv.org/abs/2406.11228)
Keywords: agent
Abstract: We propose a new benchmark, ComperDial, which facilitates the training and evaluation of evaluation metrics for open-domain dialogue systems. ComperDial consists of human-scored responses for 10,395 dialogue turns in 1,485 conversations collected from 99 dialogue agents submitted to the Commonsense Persona-grounded Dialogue (CPD) challenge. As a result, for any dialogue, our benchmark includes multiple diverse responses with variety of characteristics to ensure more robust evaluation of learned dialogue metrics. In addition to single-turn response scores, ComperDial also contains dialogue-level human-annotated scores, enabling joint assessment of multi-turn model responses throughout a dialogue. Finally, building off ComperDial, we devise a new automatic evaluation metric to measure the general similarity of model-generated dialogues to human conversations. Our experimental results demonstrate that our novel metric, CPDScore is more correlated with human judgments than existing metrics. We release both ComperDial and CPDScore to the community to accelerate development of automatic evaluation metrics for open-domain dialogue systems.
摘要：我们提出了一个新的基准 ComperDial，它有助于训练和评估开放域对话系统的评估指标。ComperDial 包括从 99 名对话代理提交给 Commonsense Persona-grounded Dialogue (CPD) 挑战的 1,485 次对话中 10,395 个对话回合的人工评分响应。因此，对于任何对话，我们的基准都包括具有各种特征的多种不同响应，以确保对学习到的对话指标进行更稳健的评估。除了单回合响应分数外，ComperDial 还包含对话级别的人工注释分数，从而能够在整个对话过程中对多回合模型响应进行联合评估。最后，在 ComperDial 的基础上，我们设计了一个新的自动评估指标来衡量模型生成的对话与人类对话的一般相似性。我们的实验结果表明，与现有指标相比，我们的新指标 CPDScore 与人类判断的相关性更高。我们向社区发布了 ComperDial 和 CPDScore，以加速开放域对话系统自动评估指标的开发。

Title: MiniConGTS: A Near Ultimate Minimalist Contrastive Grid Tagging Scheme for Aspect Sentiment Triplet Extraction

Authors: Qiao Sun, Liujia Yang, Minghao Ma, Nanyang Ye, Qinying Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11234
Pdf URL: https://arxiv.org/pdf/2406.11234
Copy Paste: [[2406.11234]] MiniConGTS: A Near Ultimate Minimalist Contrastive Grid Tagging Scheme for Aspect Sentiment Triplet Extraction(https://arxiv.org/abs/2406.11234)
Keywords: language model, gpt, chain-of-thought
Abstract: Aspect Sentiment Triplet Extraction (ASTE) aims to co-extract the sentiment triplets in a given corpus. Existing approaches within the pretraining-finetuning paradigm tend to either meticulously craft complex tagging schemes and classification heads, or incorporate external semantic augmentation to enhance performance. In this study, we, for the first time, re-evaluate the redundancy in tagging schemes and the internal enhancement in pretrained representations. We propose a method to improve and utilize pretrained representations by integrating a minimalist tagging scheme and a novel token-level contrastive learning strategy. The proposed approach demonstrates comparable or superior performance compared to state-of-the-art techniques while featuring a more compact design and reduced computational overhead. Additionally, we are the first to formally evaluate GPT-4's performance in few-shot learning and Chain-of-Thought scenarios for this task. The results demonstrate that the pretraining-finetuning paradigm remains highly effective even in the era of large language models.
摘要：方面情绪三元组提取 (ASTE) 旨在共同提取给定语料库中的情绪三元组。预训练-微调范式中的现有方法倾向于精心设计复杂的标记方案和分类头，或结合外部语义增强来提高性能。在本研究中，我们首次重新评估了标记方案中的冗余和预训练表示中的内部增强。我们提出了一种通过整合极简标记方案和新颖的标记级对比学习策略来改进和利用预训练表示的方法。与最先进的技术相比，所提出的方法表现出相当或更优异的性能，同时具有更紧凑的设计和更低的计算开销。此外，我们是第一个正式评估 GPT-4 在这项任务的少样本学习和思维链场景中的表现的人。结果表明，即使在大型语言模型时代，预训练-微调范式仍然非常有效。

Title: What Kinds of Tokens Benefit from Distant Text? An Analysis on Long Context Language Modeling

Authors: Yutong Hu, Quzhe Huang, Kangcheng Luo, Yansong Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11238
Pdf URL: https://arxiv.org/pdf/2406.11238
Copy Paste: [[2406.11238]] What Kinds of Tokens Benefit from Distant Text? An Analysis on Long Context Language Modeling(https://arxiv.org/abs/2406.11238)
Keywords: language model, long context
Abstract: As the context length that large language models can handle continues to increase, these models demonstrate an enhanced ability to utilize distant information for tasks such as language modeling. This capability contrasts with human reading and writing habits, where it is uncommon to remember and use particularly distant information, except in cases of foreshadowing. In this paper, we aim to explore which kinds of words benefit more from long contexts in language models. By analyzing the changes in token probabilities with increasing context length, we find that content words (e.g., nouns, adjectives) and the initial tokens of words benefit the most. Frequent patterns in the context (N-grams) also significantly impact predictions. Additionally, the model's prior knowledge plays a crucial role in influencing predictions, especially for rare tokens. We also observe that language models become more confident with longer contexts, resulting in sharper probability distributions. This overconfidence may contribute to the increasing probabilities of tokens with distant contextual information. We hope that our analysis will help the community better understand long-text language modeling and contribute to the design of more reliable long-context models.
摘要：随着大型语言模型能够处理的上下文长度不断增加，这些模型表现出了利用远距离信息执行语言建模等任务的增强能力。这种能力与人类的阅读和写作习惯形成鲜明对比，在阅读和写作习惯中，除了预示的情况外，人们很少记住和使用特别远距离的信息。在本文中，我们旨在探索语言模型中哪些类型的词从长上下文中受益更多。通过分析随着上下文长度增加而出现的标记概率变化，我们发现内容词（例如名词、形容词）和单词的初始标记受益最多。上下文中的频繁模式（N-gram）也会显著影响预测。此外，模型的先验知识在影响预测方面起着至关重要的作用，尤其是对于罕见的标记。我们还观察到，语言模型对较长的上下文更加自信，从而产生更清晰的概率分布。这种过度自信可能导致具有远距离上下文信息的标记的概率增加。我们希望我们的分析能够帮助社区更好地理解长文本语言建模，并为设计更可靠的长上下文模型做出贡献。

Title: Evading AI-Generated Content Detectors using Homoglyphs

Authors: Aldan Creo, Shushanta Pudasaini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11239
Pdf URL: https://arxiv.org/pdf/2406.11239
Copy Paste: [[2406.11239]] Evading AI-Generated Content Detectors using Homoglyphs(https://arxiv.org/abs/2406.11239)
Keywords: language model, gpt, llm
Abstract: The generation of text that is increasingly human-like has been enabled by the advent of large language models (LLMs). As the detection of AI-generated content holds significant importance in the fight against issues such as misinformation and academic cheating, numerous studies have been conducted to develop reliable LLM detectors. While promising results have been demonstrated by such detectors on test data, recent research has revealed that they can be circumvented by employing different techniques. In this article, homoglyph-based ($a \rightarrow {\alpha}$) attacks that can be used to circumvent existing LLM detectors are presented. The efficacy of the attacks is illustrated by analizing how homoglyphs shift the tokenization of the text, and thus its token loglikelihoods. A comprehensive evaluation is conducted to assess the effectiveness of homoglyphs on state-of-the-art LLM detectors, including Binoculars, DetectGPT, OpenAI's detector, and watermarking techniques, on five different datasets. A significant reduction in the efficiency of all the studied configurations of detectors and datasets, down to an accuracy of 0.5 (random guessing), is demonstrated by the proposed approach. The results show that homoglyph-based attacks can effectively evade existing LLM detectors, and the implications of these findings are discussed along with possible defenses against such attacks.
摘要：大型语言模型 (LLM) 的出现使得生成越来越像人类的文本成为可能。由于检测人工智能生成的内容在打击虚假信息和学术作弊等问题方面具有重要意义，因此人们开展了大量研究来开发可靠的 LLM 检测器。虽然此类检测器在测试数据上已显示出令人鼓舞的结果，但最近的研究表明，可以通过采用不同的技术来规避它们。在本文中，介绍了可用于规避现有 LLM 检测器的基于同形文字的 ($a \rightarrow {\alpha}$) 攻击。通过分析同形文字如何改变文本的标记化，从而改变其标记对数似然性，说明了这些攻击的有效性。在五个不同的数据集上进行了全面的评估，以评估同形文字对最先进的 LLM 检测器（包括 Binoculars、DetectGPT、OpenAI 的检测器和水印技术）的有效性。所提出的方法显著降低了所有研究的检测器和数据集配置的效率，准确度降至 0.5（随机猜测）。结果表明，基于同形文字的攻击可以有效逃避现有的 LLM 检测器，并讨论了这些发现的含义以及针对此类攻击的可能防御措施。

Title: FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation

Authors: Bangzheng Li, Ben Zhou, Xingyu Fu, Fei Wang, Dan Roth, Muhao Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11243
Pdf URL: https://arxiv.org/pdf/2406.11243
Copy Paste: [[2406.11243]] FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation(https://arxiv.org/abs/2406.11243)
Keywords: language model, prompt
Abstract: Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-agnostic prompt metrics that can better estimate end-task performances. One popular approach is using perplexity as a way to measure models' familiarity with the prompt. While showing consistent improvements on in-domain tasks, we found that familiarity metrics such as perplexity cannot accurately estimate performance in complicated situations such as task or domain transferring scenarios. In this work, we propose a revised measure called FamiCom, providing a more comprehensive measure for task-agnostic performance estimation. Specifically, FamiCom combines familiarity with \textit{complexity} -- the inherent difficulty of end tasks, which is an important factor missing from current metrics. Experiments show that FamiCom strongly correlates with end-task performances, producing a 0.85 Spearman's correlation, versus 0.43 of familiarity-only ones'. We further apply FamiCom to automatic prompt and demonstration selection, and outperform existing methods and baselines by more than 7.0% in accuracy.
摘要：语言模型已展现出令人印象深刻的上下文学习能力，这使它们能够从输入提示中受益并在下游最终任务中表现更好。现有研究调查了这一观察背后的机制，并提出了可以更好地估计最终任务性能的标签无关提示指标。一种流行的方法是使用困惑度来衡量模型对提示的熟悉度。虽然在领域内任务上表现出持续的改进，但我们发现困惑度等熟悉度指标无法准确估计任务或领域转移场景等复杂情况下的性能。在这项工作中，我们提出了一种名为 FamiCom 的修订测量方法，为任务无关的性能评估提供了更全面的测量方法。具体而言，FamiCom 将熟悉度与 \textit{复杂性}（最终任务的固有难度）相结合，这是当前指标中缺少的一个重要因素。实验表明，FamiCom 与最终任务性能密切相关，产生了 0.85 的 Spearman 相关性，而仅熟悉度的 Spearman 相关性为 0.43。我们进一步将 FamiCom 应用于自动提示和演示选择，准确率比现有方法和基线高出 7.0% 以上。

Title: Can Machines Resonate with Humans? Evaluating the Emotional and Empathic Comprehension of LMs

Authors: Muhammad Arslan Manzoor, Yuxia Wang, Minghan Wang, Preslav Nakov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11250
Pdf URL: https://arxiv.org/pdf/2406.11250
Copy Paste: [[2406.11250]] Can Machines Resonate with Humans? Evaluating the Emotional and Empathic Comprehension of LMs(https://arxiv.org/abs/2406.11250)
Keywords: language model, llm
Abstract: Empathy plays a pivotal role in fostering prosocial behavior, often triggered by the sharing of personal experiences through narratives. However, modeling empathy using NLP approaches remains challenging due to its deep interconnection with human interaction dynamics. Previous approaches, which involve fine-tuning language models (LMs) on human-annotated empathic datasets, have had limited success. In our pursuit of improving empathy understanding in LMs, we propose several strategies, including contrastive learning with masked LMs and supervised fine-tuning with Large Language Models (LLMs). While these methods show improvements over previous methods, the overall results remain unsatisfactory. To better understand this trend, we performed an analysis which reveals a low agreement among annotators. This lack of consensus hinders training and highlights the subjective nature of the task. We also explore the cultural impact on annotations. To study this, we meticulously collected story pairs in Urdu language and find that subjectivity in interpreting empathy among annotators appears to be independent of cultural background. The insights from our systematic exploration of LMs' understanding of empathy suggest that there is considerable room for exploration in both task formulation and modeling.
摘要：同理心在培养亲社会行为方面起着关键作用，通常由通过叙述分享个人经历所引发。然而，使用 NLP 方法对同理心进行建模仍然具有挑战性，因为它与人类互动动态有着深刻的联系。以前的方法涉及在人类注释的同理心数据集上微调语言模型 (LM)，但成功率有限。为了提高 LM 中的同理心理解能力，我们提出了几种策略，包括使用掩码 LM 进行对比学习和使用大型语言模型 (LLM) 进行监督微调。虽然这些方法比以前的方法有所改进，但总体结果仍然不令人满意。为了更好地理解这一趋势，我们进行了一项分析，结果显示注释者之间的一致性较低。这种缺乏共识的情况阻碍了培训，并凸显了任务的主观性。我们还探讨了文化对注释的影响。为了研究这一点，我们精心收集了乌尔都语的故事对，发现注释者在解释同理心时的主观性似乎与文化背景无关。我们对 LM 对同理心的理解的系统探索得出的结论是，在任务制定和建模方面都有相当大的探索空间。

Title: Enhancing Biomedical Knowledge Retrieval-Augmented Generation with Self-Rewarding Tree Search and Proximal Policy Optimization

Authors: Minda Hu, Licheng Zong, Hongru Wang, Jingyan Zhou, Jingjing Li, Yichen Gao, Kam-Fai Wong, Yu Li, Irwin King
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11258
Pdf URL: https://arxiv.org/pdf/2406.11258
Copy Paste: [[2406.11258]] Enhancing Biomedical Knowledge Retrieval-Augmented Generation with Self-Rewarding Tree Search and Proximal Policy Optimization(https://arxiv.org/abs/2406.11258)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have shown great potential in the biomedical domain with the advancement of retrieval-augmented generation (RAG). However, existing retrieval-augmented approaches face challenges in addressing diverse queries and documents, particularly for medical knowledge queries, resulting in sub-optimal performance. To address these limitations, we propose a novel plug-and-play LLM-based retrieval method called Self-Rewarding Tree Search (SeRTS) based on Monte Carlo Tree Search (MCTS) and a self-rewarding paradigm. By combining the reasoning capabilities of LLMs with the effectiveness of tree search, SeRTS boosts the zero-shot performance of retrieving high-quality and informative results for RAG. We further enhance retrieval performance by fine-tuning LLMs with Proximal Policy Optimization (PPO) objectives using the trajectories collected by SeRTS as feedback. Controlled experiments using the BioASQ-QA dataset with GPT-3.5-Turbo and LLama2-7b demonstrate that our method significantly improves the performance of the BM25 retriever and surpasses the strong baseline of self-reflection in both efficiency and scalability. Moreover, SeRTS generates higher-quality feedback for PPO training than self-reflection. Our proposed method effectively adapts LLMs to document retrieval tasks, enhancing their ability to retrieve highly relevant documents for RAG in the context of medical knowledge queries. This work presents a significant step forward in leveraging LLMs for accurate and comprehensive biomedical question answering.
摘要：随着检索增强生成 (RAG) 的发展，大型语言模型 (LLM) 在生物医学领域显示出巨大潜力。然而，现有的检索增强方法在处理各种查询和文档方面面临挑战，尤其是对于医学知识查询，导致性能不佳。为了解决这些限制，我们提出了一种基于蒙特卡洛树搜索 (MCTS) 和自奖励范式的新型即插即用 LLM 检索方法，称为自奖励树搜索 (SeRTS)。通过将 LLM 的推理能力与树搜索的有效性相结合，SeRTS 提高了 RAG 检索高质量和信息丰富结果的零样本性能。我们通过使用 SeRTS 收集的轨迹作为反馈，使用近端策略优化 (PPO) 目标对 LLM 进行微调，进一步提高了检索性能。使用 GPT-3.5-Turbo 和 LLama2-7b 在 BioASQ-QA 数据集上进行的受控实验表明，我们的方法显著提高了 BM25 检索器的性能，并且在效率和可扩展性方面都超越了自我反思的强大基线。此外，SeRTS 为 PPO 训练生成的反馈质量高于自我反思。我们提出的方法有效地将 LLM 适应文档检索任务，增强了它们在医学知识查询背景下为 RAG 检索高度相关文档的能力。这项工作为利用 LLM 进行准确而全面的生物医学问答迈出了重要一步。

Title: Adversarial Style Augmentation via Large Language Model for Robust Fake News Detection

Authors: Sungwon Park, Sungwon Han, Meeyoung Cha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11260
Pdf URL: https://arxiv.org/pdf/2406.11260
Copy Paste: [[2406.11260]] Adversarial Style Augmentation via Large Language Model for Robust Fake News Detection(https://arxiv.org/abs/2406.11260)
Keywords: language model, llm, prompt
Abstract: The spread of fake news negatively impacts individuals and is regarded as a significant social challenge that needs to be addressed. A number of algorithmic and insightful features have been identified for detecting fake news. However, with the recent LLMs and their advanced generation capabilities, many of the detectable features (e.g., style-conversion attacks) can be altered, making it more challenging to distinguish from real news. This study proposes adversarial style augmentation, AdStyle, to train a fake news detector that remains robust against various style-conversion attacks. Our model's key mechanism is the careful use of LLMs to automatically generate a diverse yet coherent range of style-conversion attack prompts. This improves the generation of prompts that are particularly difficult for the detector to handle. Experiments show that our augmentation strategy improves robustness and detection performance when tested on fake news benchmark datasets.
摘要：虚假新闻的传播对个人产生负面影响，被视为需要解决的重大社会挑战。已经确定了许多用于检测虚假新闻的算法和有洞察力的特征。然而，随着最近的 LLM 及其先进的生成能力，许多可检测的特征（例如，风格转换攻击）可以被改变，这使得区分真实新闻变得更加困难。这项研究提出了对抗性风格增强 AdStyle，以训练一个能够抵御各种风格转换攻击的虚假新闻检测器。我们模型的关键机制是谨慎使用 LLM 自动生成多样但连贯的风格转换攻击提示。这改进了检测器特别难以处理的提示的生成。实验表明，在虚假新闻基准数据集上测试时，我们的增强策略提高了稳健性和检测性能。

Title: The Fall of ROME: Understanding the Collapse of LLMs in Model Editing

Authors: Wanli Yang, Fei Sun, Jiajun Tan, Xinyu Ma, Du Su, Dawei Yin, Huawei Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11263
Pdf URL: https://arxiv.org/pdf/2406.11263
Copy Paste: [[2406.11263]] The Fall of ROME: Understanding the Collapse of LLMs in Model Editing(https://arxiv.org/abs/2406.11263)
Keywords: language model, llm
Abstract: Despite significant progress in model editing methods, their application in real-world scenarios remains challenging as they often cause large language models (LLMs) to collapse. Among them, ROME is particularly concerning, as it could disrupt LLMs with only a single edit. In this paper, we study the root causes of such collapse. Through extensive analysis, we identify two primary factors that contribute to the collapse: i) inconsistent handling of prefixed and unprefixed keys in the parameter update equation may result in very small denominators, causing excessively large parameter updates; ii) the subject of collapse cases is usually the first token, whose unprefixed key distribution significantly differs from the prefixed key distribution in autoregressive transformers, causing the aforementioned issue to materialize. To validate our analysis, we propose a simple yet effective approach: uniformly using prefixed keys during editing phase and adding prefixes during the testing phase. The experimental results show that the proposed solution can prevent model collapse while maintaining the effectiveness of the edits.
摘要：尽管模型编辑方法取得了重大进展，但它们在实际场景中的应用仍然具有挑战性，因为它们经常导致大型语言模型 (LLM) 崩溃。其中，ROME 尤其令人担忧，因为它只需一次编辑就可以破坏 LLM。在本文中，我们研究了这种崩溃的根本原因。通过广泛的分析，我们确定了导致崩溃的两个主要因素：i）参数更新方程中对前缀和非前缀键的处理不一致可能导致分母非常小，从而导致参数更新过大；ii）崩溃案例的主题通常是第一个标记，其非前缀键分布与自回归转换器中的前缀键分布显着不同，导致上述问题成为现实。为了验证我们的分析，我们提出了一种简单而有效的方法：在编辑阶段统一使用前缀键，并在测试阶段添加前缀。实验结果表明，提出的解决方案可以在保持编辑有效性的同时防止模型崩溃。

Title: Mitigating Large Language Model Hallucination with Faithful Finetuning

Authors: Minda Hu, Bowei He, Yufei Wang, Liangyou Li, Chen Ma, Irwin King
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11267
Pdf URL: https://arxiv.org/pdf/2406.11267
Copy Paste: [[2406.11267]] Mitigating Large Language Model Hallucination with Faithful Finetuning(https://arxiv.org/abs/2406.11267)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have demonstrated remarkable performance on various natural language processing tasks. However, they are prone to generating fluent yet untruthful responses, known as "hallucinations". Hallucinations can lead to the spread of misinformation and cause harm in critical applications. Mitigating hallucinations is challenging as they arise from factors such as noisy data, model overconfidence, lack of knowledge, and the generation process itself. Recent efforts have attempted to address this issue through representation editing and decoding algorithms, reducing hallucinations without major structural changes or retraining. However, these approaches either implicitly edit LLMs' behavior in latent space or suppress the tendency to output unfaithful results during decoding instead of explicitly modeling on hallucination. In this work, we introduce Faithful Finetuning (F2), a novel method that explicitly models the process of faithful question answering through carefully designed loss functions during fine-tuning. We conduct extensive experiments on popular datasets and demonstrate that F2 achieves significant improvements over vanilla models and baselines.
摘要：大型语言模型 (LLM) 在各种自然语言处理任务中表现出色。然而，它们很容易产生流畅但不真实的反应，即所谓的“幻觉”。幻觉会导致错误信息的传播，并对关键应用造成伤害。缓解幻觉具有挑战性，因为它们是由噪声数据、模型过度自信、缺乏知识以及生成过程本身等因素引起的。最近的努力试图通过表示编辑和解码算法来解决这个问题，在不进行重大结构变化或重新训练的情况下减少幻觉。然而，这些方法要么隐式编辑 LLM 在潜在空间中的行为，要么抑制解码过程中输出不真实结果的倾向，而不是明确地对幻觉进行建模。在这项工作中，我们引入了忠实微调 (F2)，这是一种新方法，它通过精心设计的损失函数在微调过程中明确地模拟忠实问答的过程。我们对流行的数据集进行了广泛的实验，并证明 F2 比原始模型和基线取得了显着的改进。

Title: Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

Authors: Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Hai Yu, Jiaqing Liu, Yukun Ma, Chong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11274
Pdf URL: https://arxiv.org/pdf/2406.11274
Copy Paste: [[2406.11274]] Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers(https://arxiv.org/abs/2406.11274)
Keywords: language model
Abstract: The Transformer architecture has significantly advanced deep learning, particularly in natural language processing, by effectively managing long-range dependencies. However, as the demand for understanding complex relationships grows, refining the Transformer's architecture becomes critical. This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models by enabling direct attention between non-adjacent layers. This method improves the model's ability to capture dependencies between high-level abstract features and low-level details. By facilitating direct attention between these diverse feature levels, our approach overcomes the limitations of current Transformers, which often rely on suboptimal intra-layer attention. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer, thus enhancing the diversity of multi-head attention without additional computational burden. Extensive experiments demonstrate that our enhanced Transformer model achieves superior performance in language modeling tasks, highlighting the effectiveness of our skip-layer attention mechanism.
摘要：通过有效地管理长距离依赖关系，Transformer 架构显著推动了深度学习的发展，尤其是在自然语言处理领域。然而，随着对理解复杂关系的需求不断增长，改进 Transformer 的架构变得至关重要。本文引入了跳过层注意力 (SLA)，通过在非相邻层之间实现直接注意力来增强 Transformer 模型。此方法提高了模型捕获高级抽象特征和低级细节之间依赖关系的能力。通过促进这些不同特征级别之间的直接注意力，我们的方法克服了当前 Transformer 的局限性，这些局限性通常依赖于次优的层内注意力。我们的实现通过使给定层中的查询能够与当前层和前一层的键和值交互来扩展 Transformer 的功能，从而增强了多头注意力的多样性，而无需额外的计算负担。大量实验表明，我们增强的 Transformer 模型在语言建模任务中取得了卓越的表现，凸显了我们跳过层注意力机制的有效性。

Title: Self-training Large Language Models through Knowledge Detection

Authors: Wei Jie Yeo, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, Erik Cambria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11275
Pdf URL: https://arxiv.org/pdf/2406.11275
Copy Paste: [[2406.11275]] Self-training Large Language Models through Knowledge Detection(https://arxiv.org/abs/2406.11275)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) often necessitate extensive labeled datasets and training compute to achieve impressive performance across downstream tasks. This paper explores a self-training paradigm, where the LLM autonomously curates its own labels and selectively trains on unknown data samples identified through a reference-free consistency method. Empirical evaluations demonstrate significant improvements in reducing hallucination in generation across multiple subjects. Furthermore, the selective training framework mitigates catastrophic forgetting in out-of-distribution benchmarks, addressing a critical limitation in training LLMs. Our findings suggest that such an approach can substantially reduce the dependency on large labeled datasets, paving the way for more scalable and cost-effective language model training.
摘要：大型语言模型 (LLM) 通常需要大量标记数据集和训练计算才能在下游任务中取得令人印象深刻的性能。本文探讨了一种自训练范式，其中 LLM 自主管理自己的标签并选择性地对通过无参考一致性方法识别的未知数据样本进行训练。实证评估表明，在减少多个受试者的幻觉生成方面取得了显著的进步。此外，选择性训练框架减轻了分布外基准中的灾难性遗忘，解决了训练 LLM 的一个关键限制。我们的研究结果表明，这种方法可以大大减少对大型标记数据集的依赖，为更具可扩展性和成本效益的语言模型训练铺平道路。

Title: Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector

Authors: Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Hongzhi Zhang, Fuzheng Zhang, Di Zhang, Kun Gai, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector(https://arxiv.org/abs/)
Keywords: language model, gpt, llm, hallucination, chat, agent
Abstract: Hallucination detection is a challenging task for large language models (LLMs), and existing studies heavily rely on powerful closed-source LLMs such as GPT-4. In this paper, we propose an autonomous LLM-based agent framework, called HaluAgent, which enables relatively smaller LLMs (e.g. Baichuan2-Chat 7B) to actively select suitable tools for detecting multiple hallucination types such as text, code, and mathematical expression. In HaluAgent, we integrate the LLM, multi-functional toolbox, and design a fine-grained three-stage detection framework along with memory mechanism. To facilitate the effectiveness of HaluAgent, we leverage existing Chinese and English datasets to synthesize detection trajectories for fine-tuning, which endows HaluAgent with the capability for bilingual hallucination detection. Extensive experiments demonstrate that only using 2K samples for tuning LLMs, HaluAgent can perform hallucination detection on various types of tasks and datasets, achieving performance comparable to or even higher than GPT-4 without tool enhancements on both in-domain and out-of-domain datasets. We release our dataset and code at this https URL.
摘要：幻觉检测对于大型语言模型 (LLM) 来说是一项具有挑战性的任务，现有的研究严重依赖于功能强大的闭源 LLM，例如 GPT-4。在本文中，我们提出了一个基于 LLM 的自主代理框架 HaluAgent，它使相对较小的 LLM（例如 Baichuan2-Chat 7B）能够主动选择合适的工具来检测多种幻觉类型，例如文本、代码和数学表达式。在 HaluAgent 中，我们集成了 LLM、多功能工具箱，并设计了一个细粒度的三阶段检测框架以及记忆机制。为了提高 HaluAgent 的有效性，我们利用现有的中文和英文数据集来合成检测轨迹进行微调，这赋予了 HaluAgent 双语幻觉检测的能力。大量实验表明，仅使用 2K 样本对 LLM 进行调整，HaluAgent 就可以对各种类型的任务和数据集进行幻觉检测，在域内和域外数据集上均能实现与 GPT-4 相当甚至更高的性能，而无需工具增强。我们在此 https URL 上发布了我们的数据集和代码。

Title: Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs

Authors: Duygu Nur Yaldiz, Yavuz Faruk Bakman, Baturalp Buyukates, Chenyang Tao, Anil Ramakrishna, Dimitrios Dimitriadis, Salman Avestimehr
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11278
Pdf URL: https://arxiv.org/pdf/2406.11278
Copy Paste: [[2406.11278]] Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs(https://arxiv.org/abs/2406.11278)
Keywords: language model, llm
Abstract: In this work, we introduce the Learnable Response Scoring Function (LARS) for Uncertainty Estimation (UE) in generative Large Language Models (LLMs). Current scoring functions for probability-based UE, such as length-normalized scoring and semantic contribution-based weighting, are designed to solve specific aspects of the problem but exhibit limitations, including the inability to handle biased probabilities and under-performance in low-resource languages like Turkish. To address these issues, we propose LARS, a scoring function that leverages supervised data to capture complex dependencies between tokens and probabilities, thereby producing more reliable and calibrated response scores in computing the uncertainty of generations. Our extensive experiments across multiple datasets show that LARS substantially outperforms existing scoring functions considering various probability-based UE methods.
摘要：在本研究中，我们引入了可学习响应评分函数 (LARS)，用于生成式大型语言模型 (LLM) 中的不确定性估计 (UE)。当前基于概率的 UE 评分函数（例如长度归一化评分和基于语义贡献的加权）旨在解决问题的特定方面，但存在局限性，包括无法处理有偏差的概率以及在土耳其语等资源匮乏的语言中表现不佳。为了解决这些问题，我们提出了 LARS，这是一种评分函数，它利用监督数据来捕获标记和概率之间的复杂依赖关系，从而在计算代际不确定性时产生更可靠、更准确的响应分数。我们在多个数据集上进行的大量实验表明，考虑到各种基于概率的 UE 方法，LARS 的表现远远优于现有的评分函数。

Title: MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

Authors: Shengkang Wang, Hongzhan Lin, Ziyang Luo, Zhen Ye, Guang Chen, Jing Ma
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models(https://arxiv.org/abs/)
Keywords: language model
Abstract: Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from actual facts due to inherent bias or incorrect inference. To address this issue, we introduce MFC-Bench, a rigorous and comprehensive benchmark designed to evaluate the factual accuracy of LVLMs across three tasks: Manipulation, Out-of-Context, and Veracity Classification. Through our evaluation on MFC-Bench, we benchmarked 12 diverse and representative LVLMs, uncovering that current models still fall short in multimodal fact-checking and demonstrate insensitivity to various forms of manipulated content. We hope that MFC-Bench could raise attention to the trustworthy artificial intelligence potentially assisted by LVLMs in the future. The MFC-Bench and accompanying resources are publicly accessible at this https URL, contributing to ongoing research in the multimodal fact-checking field.
摘要：大型视觉语言模型 (LVLM) 显著改善了多模态推理任务，例如视觉问答和图像字幕。这些模型将多模态事实嵌入其参数中，而不是依赖外部知识库来明确存储事实信息。然而，由于固有偏见或错误推理，LVLM 辨别的内容可能与实际事实有偏差。为了解决这个问题，我们引入了 MFC-Bench，这是一个严格而全面的基准，旨在评估 LVLM 在三个任务中的事实准确性：操纵、脱离上下文和真实性分类。通过对 MFC-Bench 的评估，我们对 12 个多样化且具有代表性的 LVLM 进行了基准测试，发现当前的模型在多模态事实核查方面仍然存在不足，并且对各种形式的操纵内容不敏感。我们希望 MFC-Bench 能够引起人们对未来可能由 LVLM 辅助的可信人工智能的关注。 MFC-Bench 及其附带资源可通过此 https URL 公开访问，为多模式事实核查领域的持续研究做出贡献。

Title: A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models

Authors: Haopeng Zhang, Philip S. Yu, Jiawei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11289
Pdf URL: https://arxiv.org/pdf/2406.11289
Copy Paste: [[2406.11289]] A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models(https://arxiv.org/abs/2406.11289)
Keywords: language model, llm
Abstract: Text summarization research has undergone several significant transformations with the advent of deep neural networks, pre-trained language models (PLMs), and recent large language models (LLMs). This survey thus provides a comprehensive review of the research progress and evolution in text summarization through the lens of these paradigm shifts. It is organized into two main parts: (1) a detailed overview of datasets, evaluation metrics, and summarization methods before the LLM era, encompassing traditional statistical methods, deep learning approaches, and PLM fine-tuning techniques, and (2) the first detailed examination of recent advancements in benchmarking, modeling, and evaluating summarization in the LLM era. By synthesizing existing literature and presenting a cohesive overview, this survey also discusses research trends, open challenges, and proposes promising research directions in summarization, aiming to guide researchers through the evolving landscape of summarization research.
摘要：随着深度神经网络、预训练语言模型 (PLM) 和最近的大型语言模型 (LLM) 的出现，文本摘要研究经历了几次重大转变。因此，本综述通过这些范式转变的视角，全面回顾了文本摘要的研究进展和演变。它分为两个主要部分：（1）详细概述 LLM 时代之前的数据集、评估指标和摘要方法，涵盖传统统计方法、深度学习方法和 PLM 微调技术；（2）首次详细研究 LLM 时代基准测试、建模和评估摘要的最新进展。通过综合现有文献并提出一个有凝聚力的概述，本综述还讨论了研究趋势、未解决的挑战，并提出了摘要方面的有希望的研究方向，旨在指导研究人员了解摘要研究的不断发展。

Title: Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

Authors: Zheheng Luo, Chenhan Yuan, Qianqian Xie, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11328
Pdf URL: https://arxiv.org/pdf/2406.11328
Copy Paste: [[2406.11328]] Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams(https://arxiv.org/abs/2406.11328)
Keywords: language model, gpt, llm
Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated their potential in delivering accurate answers to questions about world knowledge. Despite this, existing benchmarks for evaluating LLMs in healthcare predominantly focus on medical doctors, leaving other critical healthcare professions underrepresented. To fill this research gap, we introduce the Examinations for Medical Personnel in Chinese (EMPEC), a pioneering large-scale healthcare knowledge benchmark in traditional Chinese. EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists. Each question is tagged with its release time and source, ensuring relevance and authenticity. We conducted extensive experiments on 17 LLMs, including proprietary, open-source models, general domain models and medical specific models, evaluating their performance under various settings. Our findings reveal that while leading models like GPT-4 achieve over 75\% accuracy, they still struggle with specialized fields and alternative medicine. Surprisingly, general-purpose LLMs outperformed medical-specific models, and incorporating EMPEC's training data significantly enhanced performance. Additionally, the results on questions released after the models' training cutoff date were consistent with overall performance trends, suggesting that the models' performance on the test set can predict their effectiveness in addressing unseen healthcare-related queries. The transition from traditional to simplified Chinese characters had a negligible impact on model performance, indicating robust linguistic versatility. Our study underscores the importance of expanding benchmarks to cover a broader range of healthcare professions to better assess the applicability of LLMs in real-world healthcare scenarios.
摘要：大型语言模型 (LLM) 的最新进展已证明其在提供有关世界知识问题的准确答案方面具有潜力。尽管如此，现有的评估医疗保健领域 LLM 的基准主要集中在医生身上，而其他关键的医疗保健专业却没有得到充分的重视。为了填补这一研究空白，我们推出了“医务人员汉语考试” (EMPEC)，这是一项开创性的大型传统中文医疗保健知识基准。EMPEC 包括 124 个科目和 20 个医疗保健专业的 157,803 道考试题，包括验光师和听力学家等代表性不足的职业。每个问题都标有其发布时间和来源，以确保相关性和真实性。我们对 17 个 LLM 进行了广泛的实验，包括专有开源模型、通用领域模型和医学特定模型，评估了它们在各种环境下的性能。我们的研究结果表明，虽然像 GPT-4 这样的领先模型实现了超过 75% 的准确率，但它们在专业领域和替代医学方面仍然举步维艰。令人惊讶的是，通用 LLM 的表现优于医学专用模型，而结合 EMPEC 的训练数据可显著提高性能。此外，模型训练截止日期后发布的问题结果与整体性能趋势一致，这表明模型在测试集上的表现可以预测其在解决未见过的医疗相关查询方面的有效性。从繁体中文到简体的转变对模型性能的影响微乎其微，表明语言多功能性很强。我们的研究强调了扩大基准以涵盖更广泛的医疗专业的重要性，以便更好地评估 LLM 在现实世界医疗场景中的适用性。

Title: A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

Authors: Leonardo Bertolazzi, Albert Gatt, Raffaella Bernardi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11341
Pdf URL: https://arxiv.org/pdf/2406.11341
Copy Paste: [[2406.11341]] A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences(https://arxiv.org/abs/2406.11341)
Keywords: language model, llm, chain-of-thought
Abstract: The reasoning abilities of Large Language Models (LLMs) are becoming a central focus of study in NLP. In this paper, we consider the case of syllogistic reasoning, an area of deductive reasoning studied extensively in logic and cognitive psychology. Previous research has shown that pre-trained LLMs exhibit reasoning biases, such as $\textit{content effects}$, avoid answering that $\textit{no conclusion follows}$, display human-like difficulties, and struggle with multi-step reasoning. We contribute to this research line by systematically investigating the effects of chain-of-thought reasoning, in-context learning (ICL), and supervised fine-tuning (SFT) on syllogistic reasoning, considering syllogisms with conclusions that support or violate world knowledge, as well as ones with multiple premises. Crucially, we go beyond the standard focus on accuracy, with an in-depth analysis of the conclusions generated by the models. Our results suggest that the behavior of pre-trained LLMs can be explained by heuristics studied in cognitive science and that both ICL and SFT improve model performance on valid inferences, although only the latter mitigates most reasoning biases without harming model consistency.
摘要：大型语言模型 (LLM) 的推理能力正成为 NLP 研究的焦点。在本文中，我们考虑三段论推理的情况，三段论推理是逻辑和认知心理学中广泛研究的演绎推理领域。先前的研究表明，预训练的 LLM 表现出推理偏差，例如 $\textit{内容效应}$、避免回答 $\textit{没有结论}$、表现出类似人类的困难以及难以进行多步骤推理。我们通过系统地研究思路链推理、情境学习 (ICL) 和监督微调 (SFT) 对三段论推理的影响，为这一研究方向做出了贡献，考虑了结论支持或违反世界知识的三段论，以及具有多个前提的三段论。至关重要的是，我们超越了对准确性的标准关注，深入分析了模型生成的结论。我们的结果表明，预训练的 LLM 的行为可以通过认知科学中研究的启发式方法来解释，并且 ICL 和 SFT 都可以提高模型在有效推理上的性能，尽管只有后者可以在不损害模型一致性的情况下减轻大多数推理偏差。

Title: Full-ECE: A Metric For Token-level Calibration on Large Language Models

Authors: Han Liu, Yupeng Zhang, Bingning Wang, Weipeng Chen, Xiaolin Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11345
Pdf URL: https://arxiv.org/pdf/2406.11345
Copy Paste: [[2406.11345]] Full-ECE: A Metric For Token-level Calibration on Large Language Models(https://arxiv.org/abs/2406.11345)
Keywords: language model, llm
Abstract: Deep Neural Networks (DNNs) excel in various domains but face challenges in providing accurate uncertainty estimates, which are crucial for high-stakes applications. Large Language Models (LLMs) have recently emerged as powerful tools, demonstrating exceptional performance in language tasks. However, traditional calibration metrics such as Expected Calibration Error (ECE) and classwise-ECE (cw-ECE) are inadequate for LLMs due to their vast vocabularies, data complexity, and distributional focus. To address this, we propose a novel calibration concept called full calibration and introduce its corresponding metric, Full-ECE. Full-ECE evaluates the entire predicted probability distribution, offering a more accurate and robust measure of calibration for LLMs.
摘要：深度神经网络 (DNN) 在各个领域都表现出色，但在提供准确的不确定性估计方面面临挑战，而这对于高风险应用至关重要。大型语言模型 (LLM) 最近已成为强大的工具，在语言任务中表现出色。然而，由于 LLM 词汇量大、数据复杂且分布重点突出，预期校准误差 (ECE) 和类别 ECE (cw-ECE) 等传统校准指标不足以满足其需求。为了解决这个问题，我们提出了一种称为完全校准的新型校准概念，并引入了其相应的指标 Full-ECE。Full-ECE 评估整个预测概率分布，为 LLM 提供更准确、更可靠的校准测量。

Title: Preserving Knowledge in Large Language Model: A Model-Agnostic Self-Decompression Approach

Authors: Zilun Zhang, Yutao Sun, Tiancheng Zhao, Leigang Sha, Ruochen Xu, Kyusong Lee, Jianwei Yin
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Preserving Knowledge in Large Language Model: A Model-Agnostic Self-Decompression Approach(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: Humans can retain old knowledge while learning new information, but Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data. Moreover, for Multimodal Large Language Models (MLLMs) which are composed of the LLM base and visual projector (e.g. LLaVA), a significant decline in performance on language benchmarks was observed compared to their single-modality counterparts. To address these challenges, we introduce a novel model-agnostic self-decompression method, Tree Generation (TG), that decompresses knowledge within LLMs into the training corpus. This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps. By incorporating the dumped corpus during SFT for MLLMs, we significantly reduce the forgetting problem.
摘要：人类可以在学习新信息的同时保留旧知识，但大型语言模型 (LLM) 在对特定领域数据进行后预训练或监督微调 (SFT) 时往往会出现灾难性遗忘。此外，对于由 LLM 基础和视觉投影仪（例如 LLaVA）组成的多模态大型语言模型 (MLLM)，与单模态模型相比，其在语言基准测试中的表现显著下降。为了应对这些挑战，我们引入了一种新颖的与模型无关的自解压方法，即树生成 (TG)，它将 LLM 中的知识解压到训练语料库中。本文重点介绍 TG-SFT，它可以为指令调整步骤合成生成 SFT 数据。通过在 MLLM 的 SFT 期间加入转储语料库，我们显著减少了遗忘问题。

Title: $\textit{Refiner}$: Restructure Retrieval Content Efficiently to Advance Question-Answering Capabilities

Authors: Zhonghao Li, Xuming Hu, Aiwei Liu, Kening Zheng, Sirui Huang, Hui Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] $\textit{Refiner}$: Restructure Retrieval Content Efficiently to Advance Question-Answering Capabilities(https://arxiv.org/abs/)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) are limited by their parametric knowledge, leading to hallucinations in knowledge-extensive tasks. To address this, Retrieval-Augmented Generation (RAG) incorporates external document chunks to expand LLM knowledge. Furthermore, compressing information from document chunks through extraction or summarization can improve LLM performance. Nonetheless, LLMs still struggle to notice and utilize scattered key information, a problem known as the "lost-in-the-middle" syndrome. Therefore, we typically need to restructure the content for LLM to recognize the key information. We propose $\textit{Refiner}$, an end-to-end extract-and-restructure paradigm that operates in the post-retrieval process of RAG. $\textit{Refiner}$ leverages a single decoder-only LLM to adaptively extract query-relevant contents verbatim along with the necessary context, and section them based on their interconnectedness, thereby highlights information distinction, and aligns downstream LLMs with the original context effectively. Experiments show that a trained $\textit{Refiner}$ (with 7B parameters) exhibits significant gain to downstream LLM in improving answer accuracy, and outperforms other state-of-the-art advanced RAG and concurrent compressing approaches in various single-hop and multi-hop QA tasks. Notably, $\textit{Refiner}$ achieves a 80.5% tokens reduction and a 1.6-7.0% improvement margin in multi-hop tasks compared to the next best solution. $\textit{Refiner}$ is a plug-and-play solution that can be seamlessly integrated with RAG systems, facilitating its application across diverse open-source frameworks.
摘要：大型语言模型 (LLM) 受到参数知识的限制，导致在知识密集型任务中出现幻觉。为了解决这个问题，检索增强生成 (RAG) 结合了外部文档块来扩展 LLM 知识。此外，通过提取或总结压缩文档块中的信息可以提高 LLM 性能。尽管如此，LLM 仍然难以注意到和利用分散的关键信息，这一问题被称为“迷失在中间”综合症。因此，我们通常需要重组内容，以便 LLM 识别关键信息。我们提出了 $\textit{Refiner}$，这是一种端到端的提取和重构范例，在 RAG 的后检索过程中运行。$\textit{Refiner}$ 利用单个仅解码器的 LLM 自适应地逐字提取与查询相关的内容以及必要的上下文，并根据它们的相互关联性对其进行分段，从而突出信息区别，并有效地将下游 LLM 与原始上下文对齐。实验表明，经过训练的 $\textit{Refiner}$（具有 7B 个参数）在提高下游 LLM 答案准确率方面表现出显著优势，并且在各种单跳和多跳 QA 任务中优于其他最先进的高级 RAG 和并发压缩方法。值得注意的是，与下一个最佳解决方案相比，$\textit{Refiner}$ 在多跳任务中实现了 80.5% 的标记减少和 1.6-7.0% 的改进幅度。$\textit{Refiner}$ 是一种即插即用的解决方案，可以与 RAG 系统无缝集成，从而促进其在各种开源框架中的应用。

Title: Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments

Authors: Han Zhou, Xingchen Wan, Yinhong Liu, Nigel Collier, Ivan Vulić, Anna Korhonen
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11370
Pdf URL: https://arxiv.org/pdf/2406.11370
Copy Paste: [[2406.11370]] Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments(https://arxiv.org/abs/2406.11370)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have shown promising abilities as cost-effective and reference-free evaluators for assessing language generation quality. In particular, pairwise LLM evaluators, which compare two generated texts and determine the preferred one, have been employed in a wide range of applications. However, LLMs exhibit preference biases and worrying sensitivity to prompt designs. In this work, we first reveal that the predictive preference of LLMs can be highly brittle and skewed, even with semantically equivalent instructions. We find that fairer predictive preferences from LLMs consistently lead to judgments that are better aligned with humans. Motivated by this phenomenon, we propose an automatic Zero-shot Evaluation-oriented Prompt Optimization framework, ZEPO, which aims to produce fairer preference decisions and improve the alignment of LLM evaluators with human judgments. To this end, we propose a zero-shot learning objective based on the preference decision fairness. ZEPO demonstrates substantial performance improvements over state-of-the-art LLM evaluators, without requiring labeled data, on representative meta-evaluation benchmarks. Our findings underscore the critical correlation between preference fairness and human alignment, positioning ZEPO as an efficient prompt optimizer for bridging the gap between LLM evaluators and human judgments.
摘要：大型语言模型 (LLM) 已显示出作为评估语言生成质量的经济高效且无参考的评估器的良好能力。特别是，成对的 LLM 评估器已在广泛的应用中使用，它可以比较两个生成的文本并确定首选文本。然而，LLM 表现出偏好偏差和令人担忧的提示设计敏感性。在这项工作中，我们首先揭示了 LLM 的预测偏好可能非常脆弱和扭曲，即使使用语义等效的指令也是如此。我们发现 LLM 更公平的预测偏好始终会导致与人类更一致的判断。受此现象的启发，我们提出了一个面向零样本评估的自动提示优化框架 ZEPO，旨在产生更公平的偏好决策并提高 LLM 评估器与人类判断的一致性。为此，我们提出了一个基于偏好决策公平性的零样本学习目标。在代表性元评估基准上，ZEPO 无需标记数据，就比最先进的 LLM 评估器表现出显着的性能改进。我们的研究结果强调了偏好公平性和人类一致性之间的关键相关性，将 ZEPO 定位为一种有效的快速优化器，用于弥合 LLM 评估者和人类判断之间的差距。

Title: Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models?

Authors: Siyu Yuan, Cheng Jiayang, Lin Qiu, Deqing Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11375
Pdf URL: https://arxiv.org/pdf/2406.11375
Copy Paste: [[2406.11375]] Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models?(https://arxiv.org/abs/2406.11375)
Keywords: language model
Abstract: Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the human education process, in this paper, we propose to investigate how analogies created by teacher language models (LMs) can assist student LMs in understanding scientific concepts, thereby aligning more closely with practical scenarios. Our results suggest that free-form analogies can indeed aid LMs in understanding concepts. Additionally, analogies generated by student LMs can improve their own performance on scientific question answering, demonstrating their capability to use analogies for self-learning new knowledge. Resources are available at this https URL.
摘要：类比推理在人类认知中起着至关重要的作用，使我们能够通过将新概念与熟悉的概念联系起来来理解新概念。人工智能社区以前的研究主要集中在识别和生成类比，然后在人工评估下检查其质量，而忽略了这些类比在现实世界中的实际应用。受人类教育过程的启发，在本文中，我们提出研究教师语言模型 (LM) 创建的类比如何帮助学生 LM 理解科学概念，从而更贴近实际场景。我们的结果表明，自由形式的类比确实可以帮助 LM 理解概念。此外，学生 LM 生成的类比可以提高他们自己在科学问题回答方面的表现，展示他们使用类比自学新知识的能力。资源可在此 https URL 上找到。

Title: A Realistic Evaluation of LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3

Authors: Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11380
Pdf URL: https://arxiv.org/pdf/2406.11380
Copy Paste: [[2406.11380]] A Realistic Evaluation of LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3(https://arxiv.org/abs/2406.11380)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) zero-shot and few-shot performance are subject to memorization and data contamination, complicating the assessment of their validity. In literary tasks, the performance of LLMs is often correlated to the degree of book memorization. In this work, we carry out a realistic evaluation of LLMs for quotation attribution in novels, taking the instruction fined-tuned version of Llama3 as an example. We design a task-specific memorization measure and use it to show that Llama3's ability to perform quotation attribution is positively correlated to the novel degree of memorization. However, Llama3 still performs impressively well on books it has not memorized nor seen. Data and code will be made publicly available.
摘要：大型语言模型 (LLM) 的零样本和少样本性能容易受到记忆和数据污染的影响，使其有效性评估变得复杂。在文学任务中，LLM 的性能通常与书籍记忆程度相关。在这项工作中，我们以指令微调版 Llama3 为例，对小说中引文归因的 LLM 进行了现实评估。我们设计了一个特定于任务的记忆测量方法，并使用它来表明 Llama3 执行引文归因的能力与小说的记忆程度呈正相关。然而，Llama3 在它没有记忆或看过的书籍上仍然表现得非常出色。数据和代码将公开。

Title: MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic

Authors: Yuyan Zhou, Liang Song, Bingning Wang, Weipeng Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11385
Pdf URL: https://arxiv.org/pdf/2406.11385
Copy Paste: [[2406.11385]] MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic(https://arxiv.org/abs/2406.11385)
Keywords: language model, gpt, llm
Abstract: The advent of large language models (LLMs) like GPT-4 has catalyzed the exploration of multi-task learning (MTL), in which a single model demonstrates proficiency across diverse tasks. Task arithmetic has emerged as a cost-effective approach for MTL. It enables performance enhancement across multiple tasks by adding their corresponding task vectors to a pre-trained model. However, the current lack of a method that can simultaneously achieve optimal performance, computational efficiency, and data privacy limits their application to LLMs. In this paper, we propose \textbf{M}odel \textbf{E}xclusive \textbf{T}ask \textbf{A}rithmetic for merging \textbf{GPT}-scale models, which formalizes the objective of model merging into a multi-task learning framework, aiming to minimize the average loss difference between the merged model and each individual task model. Since data privacy limits the use of multi-task training data, we leverage LLMs' local linearity and task vectors' orthogonality to separate the data term and scaling coefficients term and derive a model-exclusive task arithmetic method. Our proposed MetaGPT is data-agnostic and bypasses the heavy search process, making it cost-effective and easy to implement for LLMs.Extensive experiments demonstrate that MetaGPT leads to improvements in task arithmetic and achieves state-of-the-art performance on multiple tasks.
摘要：GPT-4 等大型语言模型 (LLM) 的出现催化了对多任务学习 (MTL) 的探索，在多任务学习中，单个模型可以展示出在不同任务中的熟练程度。任务算法已成为一种经济高效的 MTL 方法。它通过将相应的任务向量添加到预训练模型中，可以提高多个任务的性能。然而，目前缺乏一种能够同时实现最佳性能、计算效率和数据隐私的方法，限制了它们在 LLM 中的应用。在本文中，我们提出了一种用于合并 \textbf{GPT} 规模模型的 \textbf{M}odel \textbf{E}xclusive \textbf{T}ask \textbf{A} 算法，将模型合并的目标形式化为多任务学习框架，旨在最小化合并模型与每个单独任务模型之间的平均损失差异。由于数据隐私限制了多任务训练数据的使用，我们利用 LLM 的局部线性和任务向量的正交性来分离数据项和缩放系数项，并推导出一种模型独有的任务算法。我们提出的 MetaGPT 与数据无关，绕过了繁重的搜索过程，使其具有成本效益，并且易于在 LLM 中实现。大量实验表明，MetaGPT 可以改进任务算法，并在多个任务上实现最先进的性能。

Title: Large Language Models and Knowledge Graphs for Astronomical Entity Disambiguation

Authors: Golnaz Shapurian
Subjects: cs.CL, astro-ph.IM
Abstract URL: https://arxiv.org/abs/2406.11400
Pdf URL: https://arxiv.org/pdf/2406.11400
Copy Paste: [[2406.11400]] Large Language Models and Knowledge Graphs for Astronomical Entity Disambiguation(https://arxiv.org/abs/2406.11400)
Keywords: language model, gpt, llm
Abstract: This paper presents an experiment conducted during a hackathon, focusing on using large language models (LLMs) and knowledge graph clustering to extract entities and relationships from astronomical text. The study demonstrates an approach to disambiguate entities that can appear in various contexts within the astronomical domain. By collecting excerpts around specific entities and leveraging the GPT-4 language model, relevant entities and relationships are extracted. The extracted information is then used to construct a knowledge graph, which is clustered using the Leiden algorithm. The resulting Leiden communities are utilized to identify the percentage of association of unknown excerpts to each community, thereby enabling disambiguation. The experiment showcases the potential of combining LLMs and knowledge graph clustering techniques for information extraction in astronomical research. The results highlight the effectiveness of the approach in identifying and disambiguating entities, as well as grouping them into meaningful clusters based on their relationships.
摘要：本文介绍了在黑客马拉松期间进行的一项实验，重点是使用大型语言模型 (LLM) 和知识图谱聚类从天文文本中提取实体和关系。该研究展示了一种消除天文领域内各种上下文中出现的实体歧义的方法。通过收集特定实体周围的摘录并利用 GPT-4 语言模型，可以提取相关实体和关系。然后使用提取的信息构建知识图，并使用莱顿算法对其进行聚类。利用得到的莱顿社区来确定未知摘录与每个社区的关联百分比，从而实现消歧。该实验展示了将 LLM 和知识图谱聚类技术结合起来用于天文研究中的信息提取的潜力。结果突出了该方法在识别和消除实体歧义以及根据实体关系将它们分组为有意义的聚类方面的有效性。

Title: Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis

Authors: Neelabh Sinha, Vinija Jain, Aman Chadha
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11402
Pdf URL: https://arxiv.org/pdf/2406.11402
Copy Paste: [[2406.11402]] Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis(https://arxiv.org/abs/2406.11402)
Keywords: language model, gpt, llm, prompt
Abstract: The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging. This work conducts an in-depth experimental analysis of the semantic correctness of outputs of 10 smaller, open LMs across three aspects: task types, application domains and reasoning types, using diverse prompt styles. We demonstrate that most effective models and prompt styles vary depending on the specific requirements. Our analysis provides a comparative assessment of LMs and prompt styles using a proposed three-tier schema of aspects for their strategic selection based on use-case and other constraints. We also show that if utilized appropriately, these LMs can compete with, and sometimes outperform, SOTA LLMs like DeepSeek-v2, GPT-3.5-Turbo, and GPT-4o.
摘要：语言模型 (LM) 的迅速崛起扩大了其在多个应用中的使用范围。然而，由于模型大小、相关成本或专有限制的限制，使用最先进的 (SOTA) LLM 并不总是可行的。随着开放的小型 LM 的出现，更多的应用程序可以利用它们的功能，但选择合适的 LM 可能具有挑战性。这项工作对 10 个较小的开放 LM 的输出语义正确性进行了深入的实验分析，涉及三个方面：任务类型、应用领域和推理类型，使用不同的提示风格。我们证明最有效的模型和提示风格因具体要求而异。我们的分析使用拟议的三层方面模式对 LM 和提示风格进行了比较评估，以根据用例和其他约束进行战略选择。我们还表明，如果使用得当，这些 LM 可以与 DeepSeek-v2、GPT-3.5-Turbo 和 GPT-4o 等 SOTA LLM 竞争，有时甚至胜过它们。

Title: HARE: HumAn pRiors, a key to small language model Efficiency

Authors: Lingyun Zhang, Bin jin, Gaojian Ge, Lunhui Liu, Xuewen Shen, Mingyong Wu, Houqian Zhang, Yongneng Jiang, Shiqi Chen, Shi Pu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11410
Pdf URL: https://arxiv.org/pdf/2406.11410
Copy Paste: [[2406.11410]] HARE: HumAn pRiors, a key to small language model Efficiency(https://arxiv.org/abs/2406.11410)
Keywords: language model, llm
Abstract: Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale training data, neglecting the proper incorporation of human priors. This oversight limits the training efficiency of language models in resource-constrained settings. In this paper, we propose a principle to leverage human priors for data construction. This principle emphasizes achieving high-performance SLMs by training on a concise dataset that accommodates both semantic diversity and data quality consistency, while avoiding benchmark data leakage. Following this principle, we train an SLM named HARE-1.1B. Extensive experiments on large-scale benchmark datasets demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs, validating the effectiveness of the proposed principle. Additionally, this provides new insights into efficient language model training in resource-constrained environments from the view of human priors.
摘要：人类先验在深度学习中对有效利用数据起着至关重要的作用。然而，随着大型语言模型 (LLM) 的发展，人们越来越重视扩展模型大小和数据量，这往往会降低人类先验在数据构建中的重要性。受这些趋势的影响，现有的小型语言模型 (SLM) 主要依赖于从网络上抓取的大规模训练数据，而忽略了对人类先验的适当结合。这种疏忽限制了资源受限环境中语言模型的训练效率。在本文中，我们提出了一种利用人类先验进行数据构建的原则。该原则强调通过在简洁的数据集上进行训练来实现高性能 SLM，该数据集既能适应语义多样性，又能保证数据质量一致性，同时避免基准数据泄漏。按照这一原则，我们训练了一个名为 HARE-1.1B 的 SLM。在大型基准数据集上进行的大量实验表明，HARE-1.1B 的表现优于最先进的 SLM，验证了所提原则的有效性。此外，从人类先验的角度，这为资源受限环境中的有效语言模型训练提供了新的见解。

Title: BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM

Authors: Zhewen Shen, Aditya Joshi, Ruey-Cheng Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11418
Pdf URL: https://arxiv.org/pdf/2406.11418
Copy Paste: [[2406.11418]] BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM(https://arxiv.org/abs/2406.11418)
Keywords: language model
Abstract: Children from bilingual backgrounds benefit from interactions with parents and teachers to re-acquire their heritage language. In this paper, we investigate how this insight from behavioral study can be incorporated into the learning of small-scale language models. We introduce BAMBINO-LM, a continual pretraining strategy for BabyLM that uses a novel combination of alternation and PPO-based perplexity reward induced from a parent Italian model. Upon evaluation on zero-shot classification tasks for English and Italian, BAMBINO-LM improves the Italian language capability of a BabyLM baseline. Our ablation analysis demonstrates that employing both the alternation strategy and PPO-based modeling is key to this effectiveness gain. We also show that, as a side effect, the proposed method leads to similar degradation in L1 effectiveness as human children would have had in an equivalent learning scenario.
摘要：双语背景的孩子可以通过与父母和老师的互动重新掌握母语。在本文中，我们研究了如何将行为研究中的这一见解融入到小规模语言模型的学习中。我们引入了 BAMBINO-LM，这是一种针对 BabyLM 的持续预训练策略，它使用了一种新颖的交替组合和基于 PPO 的困惑奖励，这些奖励是从母意大利语模型中诱导出来的。在对英语和意大利语的零样本分类任务进行评估后，BAMBINO-LM 提高了 BabyLM 基线的意大利语能力。我们的消融分析表明，同时采用交替策略和基于 PPO 的建模是提高这种有效性的关键。我们还表明，作为副作用，所提出的方法会导致 L1 有效性的下降，就像人类儿童在同等学习场景中遇到的情况一样。

Title: A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

Authors: Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11430
Pdf URL: https://arxiv.org/pdf/2406.11430
Copy Paste: [[2406.11430]] A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression(https://arxiv.org/abs/2406.11430)
Keywords: language model, llm
Abstract: The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.
摘要：大型语言模型 (LLM) 的部署通常会受到键值 (KV) 缓存大量内存需求的阻碍，尤其是在上下文长度增加的情况下。现有的减少 KV 缓存大小的方法要么涉及微调模型以学习压缩策略，要么利用注意力分数来减少序列长度。我们分析了仅基于解码器的 Transformers 模型中的注意力分布，并观察到注意力分配模式在大多数层中保持一致。令人惊讶的是，我们发现缓存的 KV 对的 $L_2$ 和注意力分数之间存在明显的相关性，其中键嵌入的低 $L_2$ 通常会导致解码期间的高注意力分数。这一发现表明，在查询之前，KV 对的影响可能是由键嵌入本身决定的。基于这一观察，我们根据键嵌入的 $L_2$ 压缩 KV 缓存。我们的实验结果表明，这种简单的策略可以在语言建模和大海捞针任务中将 KV 缓存大小减少 50％，在密钥检索任务中将 KV 缓存大小减少 90％，而不会降低准确性。

Title: Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Authors: Wenkai Yang, Shiqi Shen, Guangyao Shen, Zhi Gong, Yankai Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11431
Pdf URL: https://arxiv.org/pdf/2406.11431
Copy Paste: [[2406.11431]] Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization(https://arxiv.org/abs/2406.11431)
Keywords: language model, llm
Abstract: Superalignment, where humans are weak supervisors of superhuman models, has become an important and widely discussed issue in the current era of rapid development of Large Language Models (LLMs). The recent work preliminarily studies this problem by using weak models to supervise strong models. It discovers that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models may deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We then take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). Such a conflict is likely to cause strong models to deceive weak models in one alignment dimension to gain high reward in other alignment dimension. Our experiments on both the reward modeling task and the preference optimization scenario indicate: (1) the weak-to-strong deception exists; (2) the deception phenomenon may intensify as the capability gap between weak and strong models increases. We also discuss potential solutions and find bootstrapping with an intermediate model can mitigate the deception to some extent. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
摘要：在大型语言模型 (LLM) 快速发展的今天，超对齐（即人类作为超人模型的弱监督者）已成为一个重要且被广泛讨论的问题。最近的研究通过使用弱模型来监督强模型，初步研究了这个问题。研究发现，弱监督的强学生在对齐目标方面可以持续胜过弱老师，从而导致弱到强的泛化现象。然而，我们担心，在这种有希望的现象背后，是否存在弱到强欺骗的问题，即强模型可能会通过在弱模型已知的领域表现出良好的对齐，而在弱模型不知道的情况下产生不一致的行为来欺骗弱模型。然后，我们在一个具体但现实的多目标对齐案例中迈出了探索这一安全问题的第一步，其中可能存在一些相互冲突的对齐目标（例如，有用性与无害性）。这种冲突很可能导致强模型在一个对齐维度上欺骗弱模型，以在其他对齐维度上获得高回报。我们在奖励建模任务和偏好优化场景中的实验表明：（1）存在弱到强的欺骗；（2）随着弱模型和强模型之间的能力差距的增加，欺骗现象可能会加剧。我们还讨论了潜在的解决方案，并发现使用中间模型进行引导可以在一定程度上缓解欺骗。我们的工作强调了迫切需要更加关注超对齐的真实可靠性。

Title: Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction

Authors: Zepeng Ding, Ruiyang Ke, Wenhao Huang, Guochao Jiang, Yanda Li, Deqing Yang, Yanghua Xiao, Jiaqing Liang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11455
Pdf URL: https://arxiv.org/pdf/2406.11455
Copy Paste: [[2406.11455]] Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction(https://arxiv.org/abs/2406.11455)
Keywords: language model, llm
Abstract: Existing research on large language models (LLMs) shows that they can solve information extraction tasks through multi-step planning. However, their extraction behavior on complex sentences and tasks is unstable, emerging issues such as false positives and missing elements. We observe that decomposing complex extraction tasks and extracting them step by step can effectively improve LLMs' performance, and the extraction orders of entities significantly affect the final results of LLMs. This paper proposes a two-stage multi-step method for LLM-based information extraction and adopts the RL framework to execute the multi-step planning. We regard sequential extraction as a Markov decision process, build an LLM-based extraction environment, design a decision module to adaptively provide the optimal order for sequential entity extraction on different sentences, and utilize the DDQN algorithm to train the decision model. We also design the rewards and evaluation metrics suitable for the extraction results of LLMs. We conduct extensive experiments on multiple public datasets to demonstrate the effectiveness of our method in improving the information extraction capabilities of LLMs.
摘要：现有的大型语言模型（LLM）研究表明，它们可以通过多步规划来解决信息抽取任务，但其在复杂句子和任务上的抽取行为不稳定，出现了误报、缺失元素等问题。我们观察到，将复杂的抽取任务分解并分步提取可以有效提升LLM的性能，而实体的提取顺序显著影响LLM的最终结果。本文提出了一种基于LLM的两阶段多步信息抽取方法，并采用RL框架执行多步规划。我们将顺序抽取视为一个马尔可夫决策过程，构建基于LLM的抽取环境，设计决策模块自适应地提供不同句子上顺序实体抽取的最优顺序，并利用DDQN算法训练决策模型。我们还设计了适合LLM抽取结果的奖励和评估指标。我们在多个公开数据集上进行了大量的实验，证明了我们的方法在提升LLM信息抽取能力方面的有效性。

Title: TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented Generation

Authors: Jinyuan Fang, Zaiqiao Meng, Craig Macdonald
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11460
Pdf URL: https://arxiv.org/pdf/2406.11460
Copy Paste: [[2406.11460]] TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented Generation(https://arxiv.org/abs/2406.11460)
Keywords: retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) offers an effective approach for addressing question answering (QA) tasks. However, the imperfections of the retrievers in RAG models often result in the retrieval of irrelevant information, which could introduce noises and degrade the performance, especially when handling multi-hop questions that require multiple steps of reasoning. To enhance the multi-hop reasoning ability of RAG models, we propose TRACE. TRACE constructs knowledge-grounded reasoning chains, which are a series of logically connected knowledge triples, to identify and integrate supporting evidence from the retrieved documents for answering questions. Specifically, TRACE employs a KG Generator to create a knowledge graph (KG) from the retrieved documents, and then uses an Autoregressive Reasoning Chain Constructor to build reasoning chains. Experimental results on three multi-hop QA datasets show that TRACE achieves an average performance improvement of up to 14.03% compared to using all the retrieved documents. Moreover, the results indicate that using reasoning chains as context, rather than the entire documents, is often sufficient to correctly answer questions.
摘要：检索增强生成 (RAG) 为解决问答 (QA) 任务提供了一种有效的方法。然而，RAG 模型中检索器的不完善往往导致检索到不相关的信息，这可能会引入噪音并降低性能，尤其是在处理需要多步推理的多跳问题时。为了增强 RAG 模型的多跳推理能力，我们提出了 TRACE。TRACE 构建基于知识的推理链，即一系列逻辑上相连的知识三元组，以从检索到的文档中识别和整合支持证据来回答问题。具体来说，TRACE 使用 KG 生成器从检索到的文档中创建知识图谱 (KG)，然后使用自回归推理链构造器构建推理链。在三个多跳 QA 数据集上的实验结果表明，与使用所有检索到的文档相比，TRACE 的平均性能提高了 14.03%。此外，结果表明，使用推理链作为上下文，而不是整个文档，通常足以正确回答问题。

Title: Automating Easy Read Text Segmentation

Authors: Jesús Calleja, Thierry Etchegoyhen, David Ponce
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11464
Pdf URL: https://arxiv.org/pdf/2406.11464
Copy Paste: [[2406.11464]] Automating Easy Read Text Segmentation(https://arxiv.org/abs/2406.11464)
Keywords: language model
Abstract: Easy Read text is one of the main forms of access to information for people with reading difficulties. One of the key characteristics of this type of text is the requirement to split sentences into smaller grammatical segments, to facilitate reading. Automated segmentation methods could foster the creation of Easy Read content, but their viability has yet to be addressed. In this work, we study novel methods for the task, leveraging masked and generative language models, along with constituent parsing. We conduct comprehensive automatic and human evaluations in three languages, analysing the strengths and weaknesses of the proposed alternatives, under scarce resource limitations. Our results highlight the viability of automated ER segmentation and remaining deficiencies compared to expert-driven human segmentation.
摘要：易读文本是阅读障碍人士获取信息的主要形式之一。此类文本的主要特征之一是需要将句子分成更小的语法片段，以方便阅读。自动分割方法可以促进易读内容的创作，但其可行性尚未得到解决。在这项工作中，我们研究了这项任务的新方法，利用掩码和生成语言模型以及成分解析。在稀缺的资源限制下，我们对三种语言进行了全面的自动和人工评估，分析了所提替代方案的优缺点。我们的结果突出了自动 ER 分割的可行性以及与专家驱动的人工分割相比仍然存在的缺陷。

Title: Promises, Outlooks and Challenges of Diffusion Language Modeling

Authors: Justin Deschenaux, Caglar Gulcehre
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11473
Pdf URL: https://arxiv.org/pdf/2406.11473
Copy Paste: [[2406.11473]] Promises, Outlooks and Challenges of Diffusion Language Modeling(https://arxiv.org/abs/2406.11473)
Keywords: language model, gpt, llm, prompt
Abstract: The modern autoregressive Large Language Models (LLMs) have achieved outstanding performance on NLP benchmarks, and they are deployed in the real world. However, they still suffer from limitations of the autoregressive training paradigm. For example, autoregressive token generation is notably slow and can be prone to \textit{exposure bias}. The diffusion-based language models were proposed as an alternative to autoregressive generation to address some of these limitations. We evaluate the recently proposed Score Entropy Discrete Diffusion (SEDD) approach and show it is a promising alternative to autoregressive generation but it has some short-comings too. We empirically demonstrate the advantages and challenges of SEDD, and observe that SEDD generally matches autoregressive models in perplexity and on benchmarks such as HellaSwag, Arc or WinoGrande. Additionally, we show that in terms of inference latency, SEDD can be up to 4.5$\times$ more efficient than GPT-2. While SEDD allows conditioning on tokens at abitrary positions, SEDD appears slightly weaker than GPT-2 for conditional generation given short prompts. Finally, we reproduced the main results from the original SEDD paper.
摘要：现代自回归大型语言模型 (LLM) 在 NLP 基准测试中取得了出色的表现，并已在现实世界中部署。然而，它们仍然受到自回归训练范式的限制。例如，自回归标记生成速度非常慢，并且容易出现 \textit{曝光偏差}。基于扩散的语言模型被提出作为自回归生成的替代方案，以解决其中一些限制。我们评估了最近提出的分数熵离散扩散 (SEDD) 方法，并表明它是自回归生成的一种有前途的替代方案，但它也有一些缺点。我们通过实证证明了 SEDD 的优势和挑战，并观察到 SEDD 在困惑度和 HellaSwag、Arc 或 WinoGrande 等基准测试中通常与自回归模型相匹配。此外，我们表明，在推理延迟方面，SEDD 的效率可以比 GPT-2 高出 4.5 倍。尽管 SEDD 允许对任意位置的 token 进行条件化，但对于给定简短提示的条件生成，SEDD 似乎略弱于 GPT-2。最后，我们重现了原始 SEDD 论文的主要结果。

Title: How Far Can In-Context Alignment Go? Exploring the State of In-Context Alignment

Authors: Heyan Huang, Yinghao Li, Huashan Sun, Yu Bai, Yang Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11474
Pdf URL: https://arxiv.org/pdf/2406.11474
Copy Paste: [[2406.11474]] How Far Can In-Context Alignment Go? Exploring the State of In-Context Alignment(https://arxiv.org/abs/2406.11474)
Keywords: language model, llm, prompt
Abstract: Recent studies have demonstrated that In-Context Learning (ICL), through the use of specific demonstrations, can align Large Language Models (LLMs) with human preferences known as In-Context Alignment (ICA), indicating that models can comprehend human instructions without requiring parameter adjustments. However, the exploration of the mechanism and applicability of ICA remains limited. In this paper, we begin by dividing the context text used in ICA into three categories: format, system prompt, and example. Through ablation experiments, we investigate the effectiveness of each part in enabling ICA to function effectively. We then examine how variants in these parts impact the model's alignment performance. Our findings indicate that the example part is crucial for enhancing the model's alignment capabilities, with changes in examples significantly affecting alignment performance. We also conduct a comprehensive evaluation of ICA's zero-shot capabilities in various alignment tasks. The results indicate that compared to parameter fine-tuning methods, ICA demonstrates superior performance in knowledge-based tasks and tool-use tasks. However, it still exhibits certain limitations in areas such as multi-turn dialogues and instruction following.
摘要：近期研究表明，通过特定的示例，情境学习（ICL）可以将大型语言模型（LLM）与人类的偏好对齐，即情境对齐（ICA），即模型无需调整参数即可理解人类的指令。然而，对ICA的机制和适用性的探索仍然有限。本文首先将ICA中使用的上下文文本分为格式、系统提示和示例三类，通过消融实验研究各部分对ICA有效发挥作用的有效性，然后研究这些部分的变体如何影响模型的对齐性能。我们的研究结果表明，示例部分对于增强模型的对齐能力至关重要，示例的变化会显著影响对齐性能。我们还对ICA在各种对齐任务中的零样本能力进行了全面的评估。结果表明，与参数微调方法相比，ICA在基于知识的任务和工具使用任务中表现出色。然而，它在多轮对话和指令跟踪等领域仍表现出一定的局限性。

Title: Vocabulary Expansion for Low-resource Cross-lingual Transfer

Authors: Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11477
Pdf URL: https://arxiv.org/pdf/2406.11477
Copy Paste: [[2406.11477]] Vocabulary Expansion for Low-resource Cross-lingual Transfer(https://arxiv.org/abs/2406.11477)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers, vocabulary, and pre-training data, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, the majority of previous work has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion for LLMs in low-resource settings (i.e. languages and compute) has yet to be explored. In this paper, we investigate sample-efficient adaptation strategies from different angles, including target vocabulary size and initialization methods, and the amount of target data available for adaptation. Extensive experiments across typologically diverse languages, tasks and models show that simpler heuristic-based embedding initialization is more efficient and robust to changes in target vocabulary size and adaptation data in low-resource settings, outperforming a popular random initialization and a more sophisticated state-of-the-art approach that relies on external data and model.
摘要：大型语言模型 (LLM) 在英语以外的许多语言中都表现出了卓越的能力。然而，由于 LLM 依赖以英语为中心的标记器、词汇表和预训练数据，因此在生成非英语文本时需要更多的推理步骤，从而导致非英语人士的使用成本更高。使用目标语言标记进行词汇扩展是一种广泛使用的跨语言词汇适应方法，可以解决此问题。尽管它在推理加速方面很有效，但之前的大部分工作都集中在高资源设置上，假设可以访问大量目标语言数据以有效初始化新标记的嵌入并使 LLM 适应目标语言。然而，在低资源设置（即语言和计算）中 LLM 的词汇扩展尚未被探索。在本文中，我们从不同角度研究了样本高效的适应策略，包括目标词汇量和初始化方法，以及可用于适应的目标数据量。针对不同类型的语言、任务和模型进行的大量实验表明，更简单的基于启发式的嵌入初始化更加高效，并且对于目标词汇量和低资源环境下的适应数据的变化具有更强的鲁棒性，优于流行的随机初始化和依赖外部数据和模型的更复杂的最先进方法。

Title: Analysing zero-shot temporal relation extraction on clinical notes using temporal consistency

Authors: Vasiliki Kougia, Anastasiia Sedova, Andreas Stephan, Klim Zaporojets, Benjamin Roth
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11486
Pdf URL: https://arxiv.org/pdf/2406.11486
Copy Paste: [[2406.11486]] Analysing zero-shot temporal relation extraction on clinical notes using temporal consistency(https://arxiv.org/abs/2406.11486)
Keywords: gpt, llm, prompt
Abstract: This paper presents the first study for temporal relation extraction in a zero-shot setting focusing on biomedical text. We employ two types of prompts and five LLMs (GPT-3.5, Mixtral, Llama 2, Gemma, and PMC-LLaMA) to obtain responses about the temporal relations between two events. Our experiments demonstrate that LLMs struggle in the zero-shot setting performing worse than fine-tuned specialized models in terms of F1 score, showing that this is a challenging task for LLMs. We further contribute a novel comprehensive temporal analysis by calculating consistency scores for each LLM. Our findings reveal that LLMs face challenges in providing responses consistent to the temporal properties of uniqueness and transitivity. Moreover, we study the relation between the temporal consistency of an LLM and its accuracy and whether the latter can be improved by solving temporal inconsistencies. Our analysis shows that even when temporal consistency is achieved, the predictions can remain inaccurate.
摘要：本文介绍了第一项针对生物医学文本的零样本设置中时间关系提取的研究。我们使用两种类型的提示和五个 LLM（GPT-3.5、Mixtral、Llama 2、Gemma 和 PMC-LLaMA）来获取有关两个事件之间时间关系的响应。我们的实验表明，LLM 在零样本设置中表现不佳，在 F1 分数方面的表现不如经过微调的专门模型，这表明这对 LLM 来说是一项具有挑战性的任务。我们通过计算每个 LLM 的一致性分数，进一步贡献了一种新颖的综合时间分析。我们的研究结果表明，LLM 在提供与唯一性和传递性的时间属性一致的响应方面面临挑战。此外，我们研究了 LLM 的时间一致性与其准确性之间的关系，以及是否可以通过解决时间不一致来提高后者。我们的分析表明，即使实现了时间一致性，预测仍然可能不准确。

Title: CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG

Authors: Boyi Deng, Wenjie Wang, Fengbin Zhu, Qifan Wang, Fuli Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11497
Pdf URL: https://arxiv.org/pdf/2406.11497
Copy Paste: [[2406.11497]] CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG(https://arxiv.org/abs/2406.11497)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) can alleviate hallucinations of Large Language Models (LLMs) by referencing external documents. However, the misinformation in external documents may mislead LLMs' generation. To address this issue, we explore the task of "credibility-aware RAG", in which LLMs automatically adjust the influence of retrieved documents based on their credibility scores to counteract misinformation. To this end, we introduce a plug-and-play method named $\textbf{Cr}$edibility-aware $\textbf{A}$ttention $\textbf{M}$odification (CrAM). CrAM identifies influential attention heads in LLMs and adjusts their attention scores based on the credibility of the documents, thereby reducing the impact of low-credibility documents. Experiments on Natual Questions and TriviaQA using Llama2-13B, Llama3-8B, and Qwen-7B show that CrAM improves the RAG performance of LLMs against misinformation pollution by over 20%, even surpassing supervised fine-tuning methods.
摘要：检索增强生成 (RAG) 可以通过引用外部文档来缓解大型语言模型 (LLM) 的幻觉。然而，外部文档中的错误信息可能会误导 LLM 的生成。为了解决这个问题，我们探索了“可信度感知 RAG”任务，其中 LLM 会根据其可信度分数自动调整检索文档的影响以抵消错误信息。为此，我们引入了一种即插即用的方法，称为 $\textbf{Cr}$edibility-aware $\textbf{A}$ttention $\textbf{M}$odification (CrAM)。CrAM 识别 LLM 中有影响力的注意力头，并根据文档的可信度调整其注意力分数，从而降低低可信度文档的影响。使用 Llama2-13B、Llama3-8B 和 Qwen-7B 在 Natual Questions 和 TriviaQA 上进行的实验表明，CrAM 将 LLM 抵御错误信息污染的 RAG 性能提高了 20% 以上，甚至超过了监督微调方法。

Title: Counterfactual Debating with Preset Stances for Hallucination Elimination of LLMs

Authors: Yi Fang, Moxin Li, Wenjie Wang, Hui Lin, Fuli Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11514
Pdf URL: https://arxiv.org/pdf/2406.11514
Copy Paste: [[2406.11514]] Counterfactual Debating with Preset Stances for Hallucination Elimination of LLMs(https://arxiv.org/abs/2406.11514)
Keywords: language model, llm, hallucination, agent
Abstract: Large Language Models (LLMs) excel in various natural language processing tasks but struggle with hallucination issues. Existing solutions have considered utilizing LLMs' inherent reasoning abilities to alleviate hallucination, such as self-correction and diverse sampling methods. However, these methods often overtrust LLMs' initial answers due to inherent biases. The key to alleviating this issue lies in overriding LLMs' inherent biases for answer inspection. To this end, we propose a CounterFactual Multi-Agent Debate (CFMAD) framework. CFMAD presets the stances of LLMs to override their inherent biases by compelling LLMs to generate justifications for a predetermined answer's correctness. The LLMs with different predetermined stances are engaged with a skeptical critic for counterfactual debate on the rationality of generated justifications. Finally, the debate process is evaluated by a third-party judge to determine the final answer. Extensive experiments on four datasets of three tasks demonstrate the superiority of CFMAD over existing methods.
摘要：大型语言模型 (LLM) 在各种自然语言处理任务中表现出色，但存在幻觉问题。现有的解决方案考虑利用 LLM 固有的推理能力来缓解幻觉，例如自我纠正和多样化的采样方法。然而，由于固有的偏见，这些方法往往过于信任 LLM 的初始答案。缓解这一问题的关键在于克服 LLM 固有的答案检查偏见。为此，我们提出了一个反事实多智能体辩论 (CFMAD) 框架。CFMAD 预设了 LLM 的立场，以克服其固有偏见，方法是强制 LLM 为预定答案的正确性生成理由。具有不同预定立场的 LLM 与持怀疑态度的批评者就生成理由的合理性进行反事实辩论。最后，由第三方评委评估辩论过程以确定最终答案。在三个任务的四个数据集上进行的大量实验证明了 CFMAD 优于现有方法。

Title: Input Conditioned Graph Generation for Language Agents

Authors: Lukas Vierling, Jie Fu, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11555
Pdf URL: https://arxiv.org/pdf/2406.11555
Copy Paste: [[2406.11555]] Input Conditioned Graph Generation for Language Agents(https://arxiv.org/abs/2406.11555)
Keywords: language model, llm, agent
Abstract: Recent progress in Large Language Models (LLMs) and language agents has demonstrated significant promise for various future applications across multiple disciplines. While traditional approaches to language agents often rely on fixed, handcrafted designs, our research aims to develop both learnable and dynamic agents. Our method uses an existing framework that abstracts language agents as graphs. Within this graph framework, we aim to learn a model that can generate edges for every given input to the language agent. This allows us to generate edges that represent the flow of communication within the graph based on the given input, thereby adjusting the internal communication of a language agent. We learn to generate these edges using a pretrained LLM that is fine-tuned with reinforcement learning. This LLM can be fine-tuned on several datasets simultaneously, and we hypothesize that the model learns to adapt to these different domains during training, achieving good overall performance when encountering data from different domains during deployment. We demonstrate that our approach surpasses the previous static approach by nearly 6% accuracy on a combined dataset of MMLU and CMMLU, and by more than 10% when trained with a sparsity-inducing loss. It also performs superior in additional experiments conducted with the MMLU and Mini Crossword Puzzles datasets. The code is available at this https URL.
摘要：大型语言模型 (LLM) 和语言代理的最新进展已显示出在多个学科的各种未来应用方面的巨大前景。虽然传统的语言代理方法通常依赖于固定的手工设计，但我们的研究旨在开发可学习和动态的代理。我们的方法使用现有的框架，将语言代理抽象为图形。在这个图形框架内，我们的目标是学习一个可以为语言代理的每个给定输入生成边的模型。这使我们能够根据给定的输入生成表示图形内通信流的边，从而调整语言代理的内部通信。我们学习使用经过强化学习微调的预训练 LLM 来生成这些边。这个 LLM 可以同时在多个数据集上进行微调，我们假设模型在训练过程中学会适应这些不同的领域，在部署期间遇到来自不同领域的数据时实现良好的整体性能。我们证明，在 MMLU 和 CMMLU 的组合数据集上，我们的方法比以前的静态方法准确率高出近 6%，在使用稀疏性诱导损失进行训练时，准确率高出 10% 以上。它在对 MMLU 和 Mini Crossword Puzzles 数据集进行的额外实验中也表现优异。代码可在此 https URL 上获取。

Title: Extrinsic Evaluation of Cultural Competence in Large Language Models

Authors: Shaily Bhatt, Fernando Diaz
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2406.11565
Pdf URL: https://arxiv.org/pdf/2406.11565
Copy Paste: [[2406.11565]] Extrinsic Evaluation of Cultural Competence in Large Language Models(https://arxiv.org/abs/2406.11565)
Keywords: language model, prompt
Abstract: Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models' knowledge of cultural norms, values, and artifacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these countries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.
摘要：不同用户与语言技术之间的有效互动要求后者的输出具有文化相关性和敏感性。先前的研究评估了模型对文化规范、价值观和文物的了解，而没有考虑这些知识在下游应用中的表现方式。在这项工作中，我们专注于两个文本生成任务（开放式问答和故事生成）中文化能力的外在评估。当提示中的明确文化线索（特别是国籍）受到干扰时，我们会定量和定性地评估模型输出。虽然我们发现模型输出在不同国籍时确实会有所不同并且具有与文化相关的词语，但我们也发现不同国家输出的文本相似性与这些国家的文化价值观之间存在弱相关性。最后，我们讨论了在面向用户的任务中设计文化能力综合评估的重要考虑因素。

Title: MEMLA: Enhancing Multilingual Knowledge Editing with Neuron-Masked Low-Rank Adaptation

Authors: Jiakuan Xie, Pengfei Cao, Yuheng Chen, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11566
Pdf URL: https://arxiv.org/pdf/2406.11566
Copy Paste: [[2406.11566]] MEMLA: Enhancing Multilingual Knowledge Editing with Neuron-Masked Low-Rank Adaptation(https://arxiv.org/abs/2406.11566)
Keywords: language model, llm
Abstract: Knowledge editing aims to adjust the knowledge within large language models (LLMs) to prevent their responses from becoming obsolete or inaccurate. However, existing works on knowledge editing are primarily conducted in a single language, which is inadequate for multilingual language models. In this paper, we focus on multilingual knowledge editing (MKE), which requires propagating updates across multiple languages. This necessity poses a significant challenge for the task. Furthermore, the limited availability of a comprehensive dataset for MKE exacerbates this challenge, hindering progress in this area. Hence, we introduce the Multilingual Knowledge Editing Benchmark (MKEB), a novel dataset comprising 12 languages and providing a complete evaluation framework. Additionally, we propose a method that enhances Multilingual knowledge Editing with neuron-Masked Low-Rank Adaptation (MEMLA). Specifically, we identify two categories of knowledge neurons to improve editing precision. Moreover, we perform LoRA-based editing with neuron masks to efficiently modify parameters and facilitate the propagation of updates across multiple languages. Experiments demonstrate that our method outperforms existing baselines and significantly enhances the multi-hop reasoning capability of the edited model, with minimal impact on its downstream task performance. The dataset and code will be made publicly available.
摘要：知识编辑旨在调整大型语言模型 (LLM) 中的知识，以防止其响应过时或不准确。然而，现有的知识编辑工作主要以单一语言进行，这对于多语言语言模型来说是不够的。在本文中，我们专注于多语言知识编辑 (MKE)，它需要在多种语言之间传播更新。这种必要性对这项任务提出了重大挑战。此外，MKE 的综合数据集有限，加剧了这一挑战，阻碍了该领域的进展。因此，我们引入了多语言知识编辑基准 (MKEB)，这是一个包含 12 种语言并提供完整评估框架的新数据集。此外，我们提出了一种通过神经元掩码低秩自适应 (MEMLA) 增强多语言知识编辑的方法。具体而言，我们确定了两类知识神经元以提高编辑精度。此外，我们使用神经元掩码执行基于 LoRA 的编辑，以有效地修改参数并促进更新在多种语言之间的传播。实验表明，我们的方法优于现有基线，并显著增强了编辑模型的多跳推理能力，而对其下游任务性能的影响极小。数据集和代码将公开发布。

Title: Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models

Authors: Sheng Feng, Heyang Liu, Yu Wang, Yanfeng Wang
Subjects: cs.CL, cs.SD, eess.AS, q-bio.NC
Abstract URL: https://arxiv.org/abs/2406.11568
Pdf URL: https://arxiv.org/pdf/2406.11568
Copy Paste: [[2406.11568]] Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models(https://arxiv.org/abs/2406.11568)
Keywords: language model, llm
Abstract: In this paper, we introduce a groundbreaking end-to-end (E2E) framework for decoding invasive brain signals, marking a significant advancement in the field of speech neuroprosthesis. Our methodology leverages the comprehensive reasoning abilities of large language models (LLMs) to facilitate direct decoding. By fully integrating LLMs, we achieve results comparable to the state-of-the-art cascade models. Our findings underscore the immense potential of E2E frameworks in speech neuroprosthesis, particularly as the technology behind brain-computer interfaces (BCIs) and the availability of relevant datasets continue to evolve. This work not only showcases the efficacy of combining LLMs with E2E decoding for enhancing speech neuroprosthesis but also sets a new direction for future research in BCI applications, underscoring the impact of LLMs in decoding complex neural signals for communication restoration. Code will be made available at this https URL.
摘要：在本文中，我们介绍了一种用于解码侵入式脑信号的突破性端到端 (E2E) 框架，标志着语音神经假体领域的重大进步。我们的方法利用大型语言模型 (LLM) 的综合推理能力来促进直接解码。通过完全集成 LLM，我们实现了与最先进的级联模型相当的结果。我们的研究结果强调了 E2E 框架在语音神经假体中的巨大潜力，特别是随着脑机接口 (BCI) 背后的技术和相关数据集的可用性不断发展。这项工作不仅展示了将 LLM 与 E2E 解码相结合以增强语音神经假体的功效，而且还为 BCI 应用的未来研究指明了新方向，强调了 LLM 在解码复杂神经信号以恢复通信方面的影响。代码将在此 https URL 上提供。

Title: Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

Authors: Yihuai Hong, Lei Yu, Shauli Ravfogel, Haiqin Yang, Mor Geva
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11614
Pdf URL: https://arxiv.org/pdf/2406.11614
Copy Paste: [[2406.11614]] Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces(https://arxiv.org/abs/2406.11614)
Keywords: language model, llm
Abstract: The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at this https URL.
摘要：在大型语言模型 (LLM) 中“取消学习”某些概念的任务最近引起了极大的关注，因为它对于减轻不良的模型行为（例如生成有害、私密或不正确的信息）非常重要。当前评估取消学习方法的协议在很大程度上依赖于行为测试，而不监控模型参数中是否存在未学习的知识。这些残留知识可以被对抗性地利用来恢复取消学习后被删除的信息。我们认为，取消学习也应该在内部进行评估，方法是考虑未学习概念的参数知识轨迹的变化。为此，我们提出了一种通用方法，用于引出参数空间中的方向（称为“概念向量”），这些方向对具体概念进行编码，并构建 ConceptVectors，这是一个基准数据集，包含两个开源 LLM 中的数百个常见概念及其参数知识轨迹。 ConceptVectors 上的评估表明，现有的反学习方法对概念向量的影响很小，而直接消除这些向量则明显会从 LLM 中移除相关知识，并显著降低其受到对抗操纵的可能性。我们的结果突出了基于行为的反学习评估的局限性，并呼吁未来的工作包括基于参数的评估。为了支持这一点，我们在此 https URL 上发布了我们的代码和基准。

Title: Building Knowledge-Guided Lexica to Model Cultural Variation

Authors: Shreya Havaldar, Salvatore Giorgi, Sunny Rai, Thomas Talhelm, Sharath Chandra Guntuku, Lyle Ungar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11622
Pdf URL: https://arxiv.org/pdf/2406.11622
Copy Paste: [[2406.11622]] Building Knowledge-Guided Lexica to Model Cultural Variation(https://arxiv.org/abs/2406.11622)
Keywords: llm
Abstract: Cultural variation exists between nations (e.g., the United States vs. China), but also within regions (e.g., California vs. Texas, Los Angeles vs. San Francisco). Measuring this regional cultural variation can illuminate how and why people think and behave differently. Historically, it has been difficult to computationally model cultural variation due to a lack of training data and scalability constraints. In this work, we introduce a new research problem for the NLP community: How do we measure variation in cultural constructs across regions using language? We then provide a scalable solution: building knowledge-guided lexica to model cultural variation, encouraging future work at the intersection of NLP and cultural understanding. We also highlight modern LLMs' failure to measure cultural variation or generate culturally varied language.
摘要：文化差异存在于国家之间（例如，美国与中国），也存在于地区之间（例如，加利福尼亚州与德克萨斯州、洛杉矶与旧金山）。衡量这种区域文化差异可以阐明人们的思维和行为方式以及原因。从历史上看，由于缺乏训练数据和可扩展性限制，很难通过计算模拟文化差异。在这项工作中，我们为 NLP 社区引入了一个新的研究问题：如何使用语言衡量不同地区的文化结构差异？然后，我们提供了一个可扩展的解决方案：构建知识引导的词汇表来模拟文化差异，鼓励未来在 NLP 和文化理解的交叉领域开展工作。我们还强调了现代 LLM 无法衡量文化差异或生成文化多样化的语言。

Title: Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Authors: Mingyang Song, Mao Zheng, Xuan Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!(https://arxiv.org/abs/)
Keywords: language model, gpt, llm, prompt
Abstract: Leveraging Large Language Models (LLMs) as judges for evaluating the performance of LLMs has recently garnered attention. Nonetheless, this type of approach concurrently introduces potential biases from LLMs, raising concerns about the reliability of the evaluation results. To mitigate this issue, we propose and study two versions of many-shot in-context prompts, Reinforced and Unsupervised ICL, for helping GPT-4o-as-a-Judge in single answer grading. Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the agreement and quality of the evaluation. Furthermore, we first reveal the symbol bias in GPT-4o-as-a-Judge for pairwise comparison and then propose a simple yet effective approach to mitigate it. Experimental results show that advanced long-context LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime. Meanwhile, the experimental results further verify the effectiveness of the symbol bias mitigation approach.
摘要：利用大型语言模型 (LLM) 作为评判者来评估 LLM 的性能最近引起了人们的关注。然而，这种方法同时引入了 LLM 的潜在偏差，引发了人们对评估结果可靠性的担忧。为了缓解这个问题，我们提出并研究了两个版本的多镜头上下文提示，即强化和无监督 ICL，以帮助 GPT-4o-as-a-Judge 进行单一答案评分。基于设计的提示，我们研究了扩展上下文示例数量对评估一致性和质量的影响。此外，我们首先揭示了 GPT-4o-as-a-Judge 在成对比较中的符号偏差，然后提出了一种简单而有效的方法来缓解它。实验结果表明，先进的长上下文 LLM（如 GPT-4o）在多镜头模式下的表现优于零镜头模式下的表现。同时，实验结果进一步验证了符号偏差缓解方法的有效性。

Title: The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance

Authors: Kyle Moore, Jesse Roberts, Thao Pham, Oseremhen Ewaleifoh, Doug Fisher
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11634
Pdf URL: https://arxiv.org/pdf/2406.11634
Copy Paste: [[2406.11634]] The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance(https://arxiv.org/abs/2406.11634)
Keywords: language model, llm, prompt
Abstract: Cloze testing is a common method for measuring the behavior of large language models on a number of benchmark tasks. Using the MMLU dataset, we show that the base-rate probability (BRP) differences across answer tokens are significant and affect task performance ie. guess A if uncertain. We find that counterfactual prompting does sufficiently mitigate the BRP effect. The BRP effect is found to have a similar effect to test taking strategies employed by humans leading to the conflation of task performance and test-taking ability. We propose the Nvr-X-MMLU task, a variation of MMLU, which helps to disambiguate test-taking ability from task performance and reports the latter.
摘要：完形填空测试是一种常用方法，用于测量大型语言模型在许多基准任务上的行为。使用 MMLU 数据集，我们表明不同答案标记之间的基本速率概率 (BRP) 差异很大，并且会影响任务表现，即如果不确定，则猜测 A。我们发现反事实提示确实足以减轻 BRP 效应。BRP 效应与人类采用的考试策略具有类似的效果，导致任务表现和考试能力混为一谈。我们提出了 Nvr-X-MMLU 任务，这是 MMLU 的一种变体，它有助于区分考试能力和任务表现，并报告后者。

Title: A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4

Authors: Ming Gu, Yan Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11651
Pdf URL: https://arxiv.org/pdf/2406.11651
Copy Paste: [[2406.11651]] A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4(https://arxiv.org/abs/2406.11651)
Keywords: language model, gpt, llm, prompt
Abstract: Dialogue state tracking (DST) is evaluated by exact matching methods, which rely on large amounts of labeled data and ignore semantic consistency, leading to over-evaluation. Currently, leveraging large language models (LLM) in evaluating natural language processing tasks has achieved promising results. However, using LLM for DST evaluation is still under explored. In this paper, we propose a two-dimensional zero-shot evaluation method for DST using GPT-4, which divides the evaluation into two dimensions: accuracy and completeness. Furthermore, we also design two manual reasoning paths in prompting to further improve the accuracy of evaluation. Experimental results show that our method achieves better performance compared to the baselines, and is consistent with traditional exact matching based methods.
摘要：对话状态跟踪 (DST) 采用精确匹配方法进行评估，该方法依赖大量标记数据并忽略语义一致性，导致过度评估。目前，利用大型语言模型 (LLM) 评估自然语言处理任务已经取得了令人鼓舞的效果。然而，将 LLM 用于 DST 评估仍在探索中。在本文中，我们提出了一种使用 GPT-4 的 DST 二维零样本评估方法，该方法将评估分为两个维度：准确性和完整性。此外，我们还在提示中设计了两条手动推理路径，以进一步提高评估的准确性。实验结果表明，与基线相比，我们的方法取得了更好的性能，并且与传统的基于精确匹配的方法一致。

Title: Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming

Authors: Vernon Toh Yan Han, Rishabh Bhardwaj, Soujanya Poria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11654
Pdf URL: https://arxiv.org/pdf/2406.11654
Copy Paste: [[2406.11654]] Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming(https://arxiv.org/abs/2406.11654)
Keywords: prompt
Abstract: We propose Ruby Teaming, a method that improves on Rainbow Teaming by including a memory cache as its third dimension. The memory dimension provides cues to the mutator to yield better-quality prompts, both in terms of attack success rate (ASR) and quality diversity. The prompt archive generated by Ruby Teaming has an ASR of 74%, which is 20% higher than the baseline. In terms of quality diversity, Ruby Teaming outperforms Rainbow Teaming by 6% and 3% on Shannon's Evenness Index (SEI) and Simpson's Diversity Index (SDI), respectively.
摘要：我们提出了 Ruby Teaming，这是一种通过包含内存缓存作为其第三维度来改进 Rainbow Teaming 的方法。内存维度为修改器提供了线索，使其能够产生更高质量的提示，无论是在攻击成功率 (ASR) 方面还是在质量多样性方面。Ruby Teaming 生成的提示存档的 ASR 为 74%，比基线高出 20%。在质量多样性方面，Ruby Teaming 在香农均匀度指数 (SEI) 和辛普森多样性指数 (SDI) 上分别比 Rainbow Teaming 高出 6% 和 3%。

Title: Can LLM be a Personalized Judge?

Authors: Yijiang River Dong, Tiancheng Hu, Nigel Collier
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2406.11657
Pdf URL: https://arxiv.org/pdf/2406.11657
Copy Paste: [[2406.11657]] Can LLM be a Personalized Judge?(https://arxiv.org/abs/2406.11657)
Keywords: language model, llm
Abstract: Ensuring that large language models (LLMs) reflect diverse user values and preferences is crucial as their user bases expand globally. It is therefore encouraging to see the growing interest in LLM personalization within the research community. However, current works often rely on the LLM-as-a-Judge approach for evaluation without thoroughly examining its validity. In this paper, we investigate the reliability of LLM-as-a-Personalized-Judge, asking LLMs to judge user preferences based on personas. Our findings suggest that directly applying LLM-as-a-Personalized-Judge is less reliable than previously assumed, showing low and inconsistent agreement with human ground truth. The personas typically used are often overly simplistic, resulting in low predictive power. To address these issues, we introduce verbal uncertainty estimation into the LLM-as-a-Personalized-Judge pipeline, allowing the model to express low confidence on uncertain judgments. This adjustment leads to much higher agreement (above 80%) on high-certainty samples for binary tasks. Through human evaluation, we find that the LLM-as-a-Personalized-Judge achieves comparable performance to third-party humans evaluation and even surpasses human performance on high-certainty samples. Our work indicates that certainty-enhanced LLM-as-a-Personalized-Judge offers a promising direction for developing more reliable and scalable methods for evaluating LLM personalization.
摘要：随着大型语言模型 (LLM) 的用户群在全球范围内扩大，确保它们反映出多样化的用户价值观和偏好至关重要。因此，看到研究界对 LLM 个性化的兴趣日益浓厚，令人鼓舞。然而，当前的研究往往依赖 LLM-as-a-Judge 方法进行评估，而没有彻底检查其有效性。在本文中，我们研究了 LLM-as-a-Personalized-Judge 的可靠性，要求 LLM 根据角色判断用户偏好。我们的研究结果表明，直接应用 LLM-as-a-Personalized-Judge 的可靠性低于之前的假设，与人类基本事实的一致性较低且不一致。通常使用的角色往往过于简单，导致预测能力低。为了解决这些问题，我们在 LLM-as-a-Personalized-Judge 管道中引入了口头不确定性估计，允许模型对不确定的判断表达较低的信心。这种调整导致二元任务的高确定性样本的一致性更高（超过 80%）。通过人工评估，我们发现 LLM-as-a-Personalized-Judge 的表现与第三方人工评估相当，甚至在高确定性样本上超越了人工表现。我们的工作表明，确定性增强的 LLM-as-a-Personalized-Judge 为开发更可靠、更可扩展的 LLM 个性化评估方法提供了一个有希望的方向。

Title: Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting

Authors: Sagnik Mukherjee, Muhammad Farid Adilazuarda, Sunayana Sitaram, Kalika Bali, Alham Fikri Aji, Monojit Choudhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11661
Pdf URL: https://arxiv.org/pdf/2406.11661
Copy Paste: [[2406.11661]] Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting(https://arxiv.org/abs/2406.11661)
Keywords: gpt, llm, prompt
Abstract: Socio-demographic prompting is a commonly employed approach to study cultural biases in LLMs as well as for aligning models to certain cultures. In this paper, we systematically probe four LLMs (Llama 3, Mistral v0.2, GPT-3.5 Turbo and GPT-4) with prompts that are conditioned on culturally sensitive and non-sensitive cues, on datasets that are supposed to be culturally sensitive (EtiCor and CALI) or neutral (MMLU and ETHICS). We observe that all models except GPT-4 show significant variations in their responses on both kinds of datasets for both kinds of prompts, casting doubt on the robustness of the culturally-conditioned prompting as a method for eliciting cultural bias in models or as an alignment strategy. The work also calls rethinking the control experiment design to tease apart the cultural conditioning of responses from "placebo effect", i.e., random perturbations of model responses due to arbitrary tokens in the prompt.
摘要：社会人口提示是一种常用方法，用于研究法学硕士中的文化偏见以及将模型与某些文化保持一致。在本文中，我们系统地探究了四个法学硕士（Llama 3、Mistral v0.2、GPT-3.5 Turbo 和 GPT-4），这些提示以文化敏感和非敏感线索为条件，在应该是文化敏感的数据集（EtiCor 和 CALI）或中性的数据集（MMLU 和 ETHICS）上。我们观察到，除了 GPT-4 之外，所有模型在两种数据集上对两种提示的响应都表现出显著差异，这让人怀疑文化条件提示作为一种引出模型中文化偏见的方法或一种对齐策略的稳健性。这项工作还呼吁重新考虑控制实验设计，以区分响应的文化条件与“安慰剂效应”，即由于提示中的任意标记而导致的模型响应的随机扰动。

Title: See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Authors: Amith Ananthram, Elias Stengel-Eskin, Carl Vondrick, Mohit Bansal, Kathleen McKeown
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2406.11665
Pdf URL: https://arxiv.org/pdf/2406.11665
Copy Paste: [[2406.11665]] See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding(https://arxiv.org/abs/2406.11665)
Keywords: language model, prompt
Abstract: Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from Eastern cultures attend more to scene context. In this work, we present a novel investigation that demonstrates and localizes VLMs' Western bias in image understanding. We evaluate large VLMs across subjective and objective visual tasks with culturally diverse images and annotations. We find that VLMs perform better on the Western subset than the Eastern subset of each task. Controlled experimentation tracing the source of this bias highlights the importance of a diverse language mix in text-only pre-training for building equitable VLMs, even when inference is performed in English. Moreover, while prompting in the language of a target culture can lead to reductions in bias, it is not a substitute for building AI more representative of the world's languages.
摘要：视觉语言模型 (VLM) 可以用多种语言响应有关图像的查询。然而，除了语言之外，文化还会影响我们看待事物的方式。例如，西方文化的人更关注图像中的中心人物，而东方文化的人则更关注场景背景。在这项工作中，我们提出了一项新颖的研究，展示并定位了 VLM 在图像理解方面的西方偏见。我们评估了大型 VLM 在主观和客观视觉任务中的表现，这些任务使用了文化多样化的图像和注释。我们发现 VLM 在每项任务的西方子集上的表现都比东方子集好。追踪这种偏见来源的受控实验强调了在纯文本预训练中使用多样化的语言组合对于构建公平的 VLM 的重要性，即使推理是用英语进行的。此外，虽然用目标文化的语言提示可以减少偏见，但它不能替代构建更能代表世界语言的人工智能。

Title: "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak

Authors: Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Jiayi Mao, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak(https://arxiv.org/abs/)
Keywords: language model, llm, hallucination, prompt
Abstract: "Jailbreak" is a major safety concern of Large Language Models (LLMs), which occurs when malicious prompts lead LLMs to produce harmful outputs, raising issues about the reliability and safety of LLMs. Therefore, an effective evaluation of jailbreaks is very crucial to develop its mitigation strategies. However, our research reveals that many jailbreaks identified by current evaluations may actually be hallucinations-erroneous outputs that are mistaken for genuine safety breaches. This finding suggests that some perceived vulnerabilities might not represent actual threats, indicating a need for more precise red teaming benchmarks. To address this problem, we propose the $\textbf{B}$enchmark for reli$\textbf{AB}$ilit$\textbf{Y}$ and jail$\textbf{B}$reak ha$\textbf{L}$l$\textbf{U}$cination $\textbf{E}$valuation (BabyBLUE). BabyBLUE introduces a specialized validation framework including various evaluators to enhance existing jailbreak benchmarks, ensuring outputs are useful malicious instructions. Additionally, BabyBLUE presents a new dataset as an augmentation to the existing red teaming benchmarks, specifically addressing hallucinations in jailbreaks, aiming to evaluate the true potential of jailbroken LLM outputs to cause harm to human society.
摘要：“越狱”是大型语言模型 (LLM) 的一个主要安全隐患，当恶意提示导致 LLM 产生有害输出时，就会发生这种情况，从而引发有关 LLM 的可靠性和安全性的问题。因此，对越狱进行有效评估对于制定缓解策略至关重要。然而，我们的研究表明，当前评估发现的许多越狱实际上可能是幻觉——被误认为是真正安全漏洞的错误输出。这一发现表明，一些感知到的漏洞可能并不代表实际威胁，这表明需要更精确的红队基准。为了解决这个问题，我们提出了可靠性$\textbf{AB}$ilit$\textbf{Y}$和 jail$\textbf{B}$reak ha$\textbf{L}$l$\textbf{U}$cination $\textbf{E}$评估 (BabyBLUE) 的 $\textbf{B}$基准。 BabyBLUE 引入了一个专门的验证框架，其中包括各种评估器，以增强现有的越狱基准，确保输出是有用的恶意指令。此外，BabyBLUE 还提供了一个新的数据集作为现有红队基准的增强，专门解决越狱中的幻觉问题，旨在评估越狱 LLM 输出对人类社会造成危害的真正潜力。

Title: Benchmarking of LLM Detection: Comparing Two Competing Approaches

Authors: Thorsten Pröhl, Erik Putzier, Rüdiger Zarnekow
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11670
Pdf URL: https://arxiv.org/pdf/2406.11670
Copy Paste: [[2406.11670]] Benchmarking of LLM Detection: Comparing Two Competing Approaches(https://arxiv.org/abs/2406.11670)
Keywords: gpt, llm, chat
Abstract: This article gives an overview of the field of LLM text recognition. Different approaches and implemented detectors for the recognition of LLM-generated text are presented. In addition to discussing the implementations, the article focuses on benchmarking the detectors. Although there are numerous software products for the recognition of LLM-generated text, with a focus on ChatGPT-like LLMs, the quality of the recognition (recognition rate) is not clear. Furthermore, while it can be seen that scientific contributions presenting their novel approaches strive for some kind of comparison with other approaches, the construction and independence of the evaluation dataset is often not comprehensible. As a result, discrepancies in the performance evaluation of LLM detectors are often visible due to the different benchmarking datasets. This article describes the creation of an evaluation dataset and uses this dataset to investigate the different detectors. The selected detectors are benchmarked against each other.
摘要：本文概述了 LLM 文本识别领域。介绍了用于识别 LLM 生成文本的不同方法和已实现的检测器。除了讨论实现之外，本文还重点介绍了对检测器的基准测试。尽管有许多用于识别 LLM 生成文本的软件产品，重点是类似 ChatGPT 的 LLM，但识别质量（识别率）尚不清楚。此外，虽然可以看出，展示其新方法的科学贡献力求与其他方法进行某种比较，但评估数据集的构建和独立性通常难以理解。因此，由于基准测试数据集不同，LLM 检测器的性能评估中经常出现差异。本文介绍了评估数据集的创建，并使用该数据集来研究不同的检测器。选定的检测器相互进行基准测试。

Title: Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference

Authors: Donghyeon Joo, Ramyad Hadidi, Soheil Feizi, Bahar Asgari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11674
Pdf URL: https://arxiv.org/pdf/2406.11674
Copy Paste: [[2406.11674]] Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference(https://arxiv.org/abs/2406.11674)
Keywords: language model, llm
Abstract: The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case study of offloaded inference, we found that due to the low bandwidth between storage devices and GPU, the latency of transferring large model weights from its offloaded location to GPU memory becomes the critical bottleneck with actual compute taking nearly 0% of runtime. To effectively reduce the weight transfer latency, we propose a novel sparse format that compresses the unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead. Endor achieves this by expressing the positions of non-zero elements with a bitmap. Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x. When direct weight transfer from SSD to GPU is leveraged, Endor achieves 2.25x speedup on OPT-66B and 2.37x speedup on Llama2-70B.
摘要：大型语言模型 (LLM) 的大小不断增加，给资源受限平台上的使用带来了挑战。例如，现代 GPU 上的内存不足以容纳数百 GB 大小的 LLM。卸载是一种流行的方法，通过将 LLM 模型的权重存储到主机 CPU 内存和 SSD，然后在每次使用前将每个权重加载到 GPU 来摆脱这种限制。在我们的卸载推理案例研究中，我们发现由于存储设备和 GPU 之间的带宽较低，将大型模型权重从其卸载位置传输到 GPU 内存的延迟成为关键瓶颈，实际计算占用的运行时间几乎为 0%。为了有效减少权重传输延迟，我们提出了一种新颖的稀疏格式，该格式将修剪后的 LLM 权重的非结构化稀疏模式压缩为非零值，具有高压缩率和低解压缩开销。Endor 通过使用位图表示非零元素的位置来实现这一点。与使用流行的 Huggingface Accelerate 卸载推理相比，应用 Endor 可将 OPT-66B 加速 1.70 倍，将 Llama2-70B 加速 1.78 倍。当利用从 SSD 到 GPU 的直接权重转移时，Endor 在 OPT-66B 上实现了 2.25 倍加速，在 Llama2-70B 上实现了 2.37 倍加速。

Title: R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models

Authors: Shangqing Tu, Yuanchun Wang, Jifan Yu, Yuyang Xie, Yaran Shi, Xiaozhi Wang, Jing Zhang, Lei Hou, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11681
Pdf URL: https://arxiv.org/pdf/2406.11681
Copy Paste: [[2406.11681]] R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models(https://arxiv.org/abs/2406.11681)
Keywords: language model, llm
Abstract: Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems. Recently, various Retrieval-Augmented Large Language Models (RALLMs) are proposed to address this shortcoming. However, existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge. In this paper, we address the challenges of evaluating RALLMs by introducing the R-Eval toolkit, a Python toolkit designed to streamline the evaluation of different RAG workflows in conjunction with LLMs. Our toolkit, which supports popular built-in RAG workflows and allows for the incorporation of customized testing data on the specific domain, is designed to be user-friendly, modular, and extensible. We conduct an evaluation of 21 RALLMs across three task levels and two representative domains, revealing significant variations in the effectiveness of RALLMs across different tasks and domains. Our analysis emphasizes the importance of considering both task and domain requirements when choosing a RAG workflow and LLM combination. We are committed to continuously maintaining our platform at this https URL to facilitate both the industry and the researchers.
摘要：大型语言模型在一般的 NLP 任务上取得了显著的成功，但它们可能无法解决特定领域的问题。最近，人们提出了各种检索增强大型语言模型 (RALLM) 来解决这一缺点。然而，现有的评估工具只提供了一些基线，并在各个领域对其进行评估，而没有挖掘领域知识的深度。在本文中，我们通过引入 R-Eval 工具包来解决评估 RALLM 的挑战，这是一个 Python 工具包，旨在简化与 LLM 结合的不同 RAG 工作流的评估。我们的工具包支持流行的内置 RAG 工作流，并允许在特定领域合并定制的测试数据，旨在方便用户使用、模块化和可扩展。我们对三个任务级别和两个代表性领域的 21 个 RALLM 进行了评估，揭示了 RALLM 在不同任务和领域的有效性存在显著差异。我们的分析强调了在选择 RAG 工作流和 LLM 组合时考虑任务和领域要求的重要性。我们致力于持续维护此 https URL 上的平台，以方便行业和研究人员。

Title: Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack

Authors: Shangqing Tu, Zhuoran Pan, Wenxuan Wang, Zhexin Zhang, Yuliang Sun, Jifan Yu, Hongning Wang, Lei Hou, Juanzi Li
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2406.11682
Pdf URL: https://arxiv.org/pdf/2406.11682
Copy Paste: [[2406.11682]] Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack(https://arxiv.org/abs/2406.11682)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been increasingly applied to various domains, which triggers increasing concerns about LLMs' safety on specialized domains, e.g. medicine. However, testing the domain-specific safety of LLMs is challenging due to the lack of domain knowledge-driven attacks in existing benchmarks. To bridge this gap, we propose a new task, knowledge-to-jailbreak, which aims to generate jailbreaks from domain knowledge to evaluate the safety of LLMs when applied to those domains. We collect a large-scale dataset with 12,974 knowledge-jailbreak pairs and fine-tune a large language model as jailbreak-generator, to produce domain knowledge-specific jailbreaks. Experiments on 13 domains and 8 target LLMs demonstrate the effectiveness of jailbreak-generator in generating jailbreaks that are both relevant to the given knowledge and harmful to the target LLMs. We also apply our method to an out-of-domain knowledge base, showing that jailbreak-generator can generate jailbreaks that are comparable in harmfulness to those crafted by human experts. Data and code: this https URL.
摘要：大型语言模型 (LLM) 已越来越多地应用于各个领域，这引发了人们对 LLM 在医学等专业领域安全性的担忧。然而，由于现有基准测试中缺乏领域知识驱动的攻击，测试 LLM 的领域特定安全性具有挑战性。为了弥补这一差距，我们提出了一项新任务，即知识越狱，旨在从领域知识中生成越狱，以评估 LLM 应用于这些领域的安全性。我们收集了一个包含 12,974 个知识越狱对的大规模数据集，并对大型语言模型进行了微调，作为越狱生成器，以生成领域特定知识的越狱。在 13 个领域和 8 个目标 LLM 上进行的实验证明了越狱生成器在生成与给定知识相关且对目标 LLM 有害的越狱方面的有效性。我们还将我们的方法应用于领域外的知识库，表明越狱生成器可以生成与人类专家制作的越狱危害性相当的越狱。数据和代码：这个https URL。

Title: HoLLMwood: Unleashing the Creativity of Large Language Models in Screenwriting via Role Playing

Authors: Jing Chen, Xinyu Zhu, Cheng Yang, Chufan Shi, Yadong Xi, Yuxiang Zhang, Junjie Wang, Jiashu Pu, Rongsheng Zhang, Yujiu Yang, Tian Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11683
Pdf URL: https://arxiv.org/pdf/2406.11683
Copy Paste: [[2406.11683]] HoLLMwood: Unleashing the Creativity of Large Language Models in Screenwriting via Role Playing(https://arxiv.org/abs/2406.11683)
Keywords: language model, llm
Abstract: Generative AI has demonstrated unprecedented creativity in the field of computer vision, yet such phenomena have not been observed in natural language processing. In particular, large language models (LLMs) can hardly produce written works at the level of human experts due to the extremely high complexity of literature writing. In this paper, we present HoLLMwood, an automated framework for unleashing the creativity of LLMs and exploring their potential in screenwriting, which is a highly demanding task. Mimicking the human creative process, we assign LLMs to different roles involved in the real-world scenario. In addition to the common practice of treating LLMs as ${Writer}$, we also apply LLMs as ${Editor}$, who is responsible for providing feedback and revision advice to ${Writer}$. Besides, to enrich the characters and deepen the plots, we introduce a role-playing mechanism and adopt LLMs as ${Actors}$ that can communicate and interact with each other. Evaluations on automatically generated screenplays show that HoLLMwood substantially outperforms strong baselines in terms of coherence, relevance, interestingness and overall quality.
摘要：生成式人工智能在计算机视觉领域展现了前所未有的创造力，但在自然语言处理领域尚未观察到这种现象。特别是，由于文学写作的复杂性极高，大型语言模型（LLM）很难创作出与人类专家水平相当的书面作品。在本文中，我们提出了一个自动化框架HoLLMwood，用于释放LLM的创造力并探索其在编剧这一高要求任务中的潜力。模仿人类的创作过程，我们将LLM分配给现实世界场景中涉及的不同角色。除了将LLM视为${作家}$的常见做法外，我们还将LLM视为${编辑}$，负责向${作家}$提供反馈和修改建议。此外，为了丰富人物形象和深化情节，我们引入了角色扮演机制，将LLM视为可以相互交流和互动的${演员}$。对自动生成的剧本的评估表明，HoLLMwood 在连贯性、相关性、趣味性和整体质量方面远远优于强大的基线。

Title: Tokenization Falling Short: The Curse of Tokenization

Authors: Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11687
Pdf URL: https://arxiv.org/pdf/2406.11687
Copy Paste: [[2406.11687]] Tokenization Falling Short: The Curse of Tokenization(https://arxiv.org/abs/2406.11687)
Keywords: language model, llm
Abstract: Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens-issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We will release our code and data to facilitate further research.
摘要：语言模型通常将原始文本从预定义词汇表中标记为子词标识符序列，这个过程本质上对印刷错误、长度变化很敏感，并且在很大程度上忽略了标记的内部结构 - 我们称这些问题为标记化的诅咒。在本研究中，我们深入研究了这些缺点，并证明大型语言模型 (LLM) 仍然容易受到这些问题的影响。本研究通过三个关键研究问题系统地研究了这些挑战及其对 LLM 的影响：(1) 复杂问题解决、(2) 标记结构探测和 (3) 对印刷变化的适应能力。我们的研究结果表明，缩放模型参数可以缓解标记化问题；然而，LLM 仍然受到拼写错误和其他文本格式变化引起的偏差的影响。我们的实验表明，子词正则化（如 BPE-dropout）可以缓解这个问题。我们将发布我们的代码和数据以促进进一步的研究。

Title: Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

Authors: Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, Omar Khattab
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11695
Pdf URL: https://arxiv.org/pdf/2406.11695
Copy Paste: [[2406.11695]] Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs(https://arxiv.org/abs/2406.11695)
Keywords: language model, prompt
Abstract: Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel optimizer that outperforms baselines on five of six diverse LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 12.9% accuracy. We will release our new optimizers and benchmark in DSPy at this https URL
摘要：语言模型程序，即模块化语言模型 (LM) 调用的复杂管道，正在日益推动 NLP 任务的发展，但它们需要设计对所有模块都有效的提示。我们研究 LM 程序的提示优化，即如何更新这些提示以最大化下游指标，而无需访问模块级标签或梯度。为了使这个问题易于解决，我们将问题分解为优化每个模块的自由形式指令和少量演示，并引入几种策略来设计基于任务的指令并导航跨模块的信用分配。我们的策略包括 (i) 用于提出有效指令的程序和数据感知技术、(ii) 用于学习我们目标的替代模型的随机小批量评估函数，以及 (iii) 元优化程序，在此过程中，我们改进了 LM 如何随时间推移构建提案。利用这些见解，我们开发了 MIPRO，这是一种新型优化器，它使用一流的开源模型 (Llama-3-8B)，在六个不同的 LM 程序中的五个中，其准确率比基准高出 12.9%。我们将在此 https URL 上发布我们的新优化器和 DSPy 基准

Title: Meta Reasoning for Large Language Models

Authors: Peizhong Gao, Ao Xie, Shaoguang Mao, Wenshan Wu, Yan Xia, Haipeng Mi, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11698
Pdf URL: https://arxiv.org/pdf/2406.11698
Copy Paste: [[2406.11698]] Meta Reasoning for Large Language Models(https://arxiv.org/abs/2406.11698)
Keywords: language model, llm, prompt, tree-of-thought
Abstract: We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs) inspired by human meta-reasoning. Traditional in-context learning-based reasoning techniques, such as Tree-of-Thoughts, show promise but lack consistent state-of-the-art performance across diverse tasks due to their specialized nature. MRP addresses this limitation by guiding LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task, optimizing both performance and computational efficiency. With MRP, LLM reasoning operates in two phases. Initially, the LLM identifies the most appropriate reasoning method using task input cues and objective descriptions of available methods. Subsequently, it applies the chosen method to complete the task. This dynamic strategy mirrors human meta-reasoning, allowing the model to excel in a wide range of problem domains. We evaluate the effectiveness of MRP through comprehensive benchmarks. The results demonstrate that MRP achieves or approaches state-of-the-art performance across diverse tasks. MRP represents a significant advancement in enabling LLMs to identify cognitive challenges across problems and leverage benefits across different reasoning approaches, enhancing their ability to handle diverse and complex problem domains efficiently. Every LLM deserves a Meta-Reasoning Prompting to unlock its full potential and ensure adaptability in an ever-evolving landscape of challenges and applications.
摘要：我们引入了元推理提示 (MRP)，这是一种受人类元推理启发的新型高效系统提示方法，适用于大型语言模型 (LLM)。传统的基于上下文学习的推理技术（例如思想树）前景看好，但由于其专业性，在各种任务中缺乏一致的最先进性能。MRP 通过引导 LLM 根据每个任务的特定要求动态选择和应用不同的推理方法来解决这一限制，从而优化性能和计算效率。使用 MRP，LLM 推理分为两个阶段。最初，LLM 使用任务输入提示和可用方法的客观描述来确定最合适的推理方法。随后，它应用所选方法来完成任务。这种动态策略反映了人类的元推理，使模型能够在广泛的问题领域中表现出色。我们通过全面的基准测试评估 MRP 的有效性。结果表明，MRP 在不同任务中达到或接近最先进性能。 MRP 代表了一项重大进步，它使 LLM 能够识别问题中的认知挑战并利用不同推理方法的优势，从而提高他们有效处理多样化和复杂问题领域的能力。每个 LLM 都应该有一个元推理提示，以充分发挥其潜力并确保在不断变化的挑战和应用环境中的适应性。

Title: Nemotron-4 340B Technical Report

Authors: Nvidia: Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala, John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick LeGresley, Hui Li, Jiwei Liu, Zihan Liu, Eileen Long, Ameya Sunil Mahabaleshwarkar, Somshubra Majumdar, James Maki, Miguel Martinez, Maer Rodrigues de Melo, Ivan Moshkov, Deepak Narayanan, Sean Narenthiran, Jesus Navarro, Phong Nguyen, Osvald Nitski, Vahid Noroozi, Guruprasad Nutheti, Christopher Parisien, Jupinder Parmar, Mostofa Patwary, Krzysztof Pawelec, Wei Ping, Shrimai Prabhumoye, Rajarshi Roy, Trisha Saar, Vasanth Rao Naik Sabavat, Sanjeev Satheesh, Jane Polak Scowcroft, Jason Sewall, Pavel Shamis, Gerald Shen, Mohammad Shoeybi, Dave Sizer, Misha Smelyanskiy, Felipe Soares, Makesh Narsimhan Sreedhar, Dan Su, Sandeep Subramanian, Shengyang Sun, Shubham Toshniwal, Hao Wang, Zhilin Wang, Jiaxuan You, Jiaqi Zeng, Jimmy Zhang, Jing Zhang, Vivienne Zhang, Yian Zhang, Chen Zhu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11704
Pdf URL: https://arxiv.org/pdf/2406.11704
Copy Paste: [[2406.11704]] Nemotron-4 340B Technical Report(https://arxiv.org/abs/2406.11704)
Keywords: language model
Abstract: We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.
摘要：我们发布了 Nemotron-4 340B 模型系列，包括 Nemotron-4-340B-Base、Nemotron-4-340B-Instruct 和 Nemotron-4-340B-Reward。我们的模型根据 NVIDIA 开放模型许可协议开放访问，该许可协议是一种宽松的模型许可，允许分发、修改和使用模型及其输出。这些模型在广泛的评估基准上的表现与开放访问模型相媲美，并且在以 FP8 精度部署时，其大小适合在具有 8 个 GPU 的单个 DGX H100 上使用。我们相信社区可以在各种研究和商业应用中从这些模型中受益，尤其是用于生成合成数据以训练较小的语言模型。值得注意的是，我们的模型对齐过程中使用的 98% 以上的数据都是合成生成的，展示了这些模型在生成合成数据方面的有效性。为了进一步支持开放研究并促进模型开发，我们还开源了模型对齐过程中使用的合成数据生成管道。

Title: Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging

Authors: Priyanka Kargupta, Ishika Agarwal, Dilek Hakkani-Tur, Jiawei Han
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2406.11709
Pdf URL: https://arxiv.org/pdf/2406.11709
Copy Paste: [[2406.11709]] Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging(https://arxiv.org/abs/2406.11709)
Keywords: language model, llm, agent
Abstract: Socratic questioning is an effective teaching strategy, encouraging critical thinking and problem-solving. The conversational capabilities of large language models (LLMs) show great potential for providing scalable, real-time student guidance. However, current LLMs often give away solutions directly, making them ineffective instructors. We tackle this issue in the code debugging domain with TreeInstruct, an Instructor agent guided by a novel state space-based planning algorithm. TreeInstruct asks probing questions to help students independently identify and resolve errors. It estimates a student's conceptual and syntactical knowledge to dynamically construct a question tree based on their responses and current knowledge state, effectively addressing both independent and dependent mistakes concurrently in a multi-turn interaction setting. In addition to using an existing single-bug debugging benchmark, we construct a more challenging multi-bug dataset of 150 coding problems, incorrect solutions, and bug fixes -- all carefully constructed and annotated by experts. Extensive evaluation shows TreeInstruct's state-of-the-art performance on both datasets, proving it to be a more effective instructor than baselines. Furthermore, a real-world case study with five students of varying skill levels further demonstrates TreeInstruct's ability to guide students to debug their code efficiently with minimal turns and highly Socratic questioning.
摘要：苏格拉底式提问是一种有效的教学策略，鼓励批判性思维和解决问题。大型语言模型 (LLM) 的对话能力显示出提供可扩展实时学生指导的巨大潜力。然而，目前的 LLM 往往直接给出解决方案，使它们成为无效的指导者。我们使用 TreeInstruct 在代码调试领域解决了这个问题，TreeInstruct 是一个由新颖的状态空间规划算法引导的指导者代理。TreeInstruct 提出探索性问题，帮助学生独立识别和解决错误。它估计学生的概念和句法知识，根据他们的回答和当前知识状态动态构建问题树，在多轮交互设置中同时有效地解决独立和依赖错误。除了使用现有的单错误调试基准之外，我们还构建了一个更具挑战性的多错误数据集，其中包含 150 个编码问题、错误解决方案和错误修复——所有这些都由专家精心构建和注释。广泛的评估表明 TreeInstruct 在两个数据集上都表现出最佳性能，证明它比基线更有效。此外，对五名不同技能水平的学生进行的真实案例研究进一步证明了 TreeInstruct 能够指导学生以最少的转弯和高度苏格拉底式的提问有效地调试他们的代码。

Title: Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity

Authors: Bingxiang He, Ning Ding, Cheng Qian, Jia Deng, Ganqu Cui, Lifan Yuan, Huan-ang Gao, Huimin Chen, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11721
Pdf URL: https://arxiv.org/pdf/2406.11721
Copy Paste: [[2406.11721]] Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity(https://arxiv.org/abs/2406.11721)
Keywords: llm
Abstract: Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. This line of research has been limited to examining transfer between tasks from a task-pair perspective, with few studies focusing on understanding zero-shot generalization from the perspective of the data itself. To bridge this gap, we first demonstrate through multiple metrics that zero-shot generalization during instruction tuning happens very early. Next, we investigate the facilitation of zero-shot generalization from both data similarity and granularity perspectives, confirming that encountering highly similar and fine-grained training data earlier during instruction tuning, without the constraints of defined "tasks", enables better generalization. Finally, we propose a more grounded training data arrangement method, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level. We hope our analysis will advance the understanding of zero-shot generalization during instruction tuning and contribute to the development of more aligned LLMs. Our code is released at this https URL.
摘要：理解对齐技术首先要理解指令调整带来的零样本泛化，但人们对其机制的理解甚少。现有的工作主要局限于任务级别，没有考虑到任务是人为定义的，而对于 LLM 来说，任务仅仅由 token 和表示组成。这方面的研究仅限于从任务对的角度研究任务之间的迁移，很少有研究关注从数据本身的角度理解零样本泛化。为了弥补这一差距，我们首先通过多个指标证明指令调整过程中的零样本泛化发生得很早。接下来，我们从数据相似性和粒度两个角度研究了零样本泛化的促进作用，证实在指令调整过程中更早地遇到高度相似和细粒度的训练数据，而不受定义的“任务”的限制，可以实现更好的泛化。最后，我们提出了一种更扎实的训练数据排列方法，即以测试为中心的多轮排列，并展示了其在促进持续学习和进一步减少损失方面的有效性。我们首次展示了指令调整过程中的零样本泛化是实例级别训练数据和测试数据之间基于相似性的泛化的一种形式。我们希望我们的分析能够促进对指令调整过程中零样本泛化的理解，并有助于开发更一致的 LLM。我们的代码发布在此 https URL 上。

Title: Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models

Authors: Fangzhi Xu, Qiushi Sun, Kanzhi Cheng, Jun Liu, Yu Qiao, Zhiyong Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11736
Pdf URL: https://arxiv.org/pdf/2406.11736
Copy Paste: [[2406.11736]] Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models(https://arxiv.org/abs/2406.11736)
Keywords: language model, llm
Abstract: One of the primary driving forces contributing to the superior performance of Large Language Models (LLMs) is the extensive availability of human-annotated natural language data, which is used for alignment fine-tuning. This inspired researchers to investigate self-training methods to mitigate the extensive reliance on human annotations. However, the current success of self-training has been primarily observed in natural language scenarios, rather than in the increasingly important neural-symbolic scenarios. To this end, we propose an environment-guided neural-symbolic self-training framework named ENVISIONS. It aims to overcome two main challenges: (1) the scarcity of symbolic data, and (2) the limited proficiency of LLMs in processing symbolic language. Extensive evaluations conducted on three distinct domains demonstrate the effectiveness of our approach. Additionally, we have conducted a comprehensive analysis to uncover the factors contributing to ENVISIONS's success, thereby offering valuable insights for future research in this area. Code will be available at \url{this https URL}.
摘要：大型语言模型 (LLM) 性能卓越的主要驱动力之一是人类注释的自然语言数据的广泛可用性，这些数据可用于对齐微调。这启发了研究人员研究自我训练方法，以减轻对人类注释的过度依赖。然而，目前自我训练的成功主要体现在自然语言场景中，而不是在日益重要的神经符号场景中。为此，我们提出了一个名为 ENVISIONS 的环境引导神经符号自我训练框架。它旨在克服两个主要挑战：(1) 符号数据的稀缺性，以及 (2) LLM 在处理符号语言方面的有限能力。在三个不同领域进行的广泛评估证明了我们方法的有效性。此外，我们还进行了全面分析，以揭示促成 ENVISIONS 成功的因素，从而为该领域的未来研究提供宝贵见解。代码将在 \url{this https URL} 处提供。

Title: A Semantic-based Layer Freezing Approach to Efficient Fine-Tuning of Language Models

Authors: Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11753
Pdf URL: https://arxiv.org/pdf/2406.11753
Copy Paste: [[2406.11753]] A Semantic-based Layer Freezing Approach to Efficient Fine-Tuning of Language Models(https://arxiv.org/abs/2406.11753)
Keywords: language model
Abstract: Finetuning language models (LMs) is crucial for adapting the models to downstream data and tasks. However, full finetuning is usually costly. Existing work, such as parameter-efficient finetuning (PEFT), often focuses on \textit{how to finetune} but neglects the issue of \textit{where to finetune}. As a pioneering work on answering where to finetune (at the layer level), we conduct a semantic analysis of the LM inference process. We first propose a virtual transition of the latent representation and then trace its factual transition. Based on the deviation in transitions, we estimate the gain of finetuning each model layer, and further, narrow down the scope for finetuning. We perform extensive experiments across well-known LMs and datasets. The results show that our approach is effective and efficient, and outperforms the existing baselines. Our approach is orthogonal to existing efficient techniques, such as PEFT methods, offering practical values on LM finetuning.
摘要：微调语言模型 (LM) 对于使模型适应下游数据和任务至关重要。然而，完全微调通常成本高昂。现有的工作，例如参数高效微调 (PEFT)，通常侧重于 \textit{如何微调}，但忽略了 \textit{在哪里微调} 的问题。作为回答在哪里微调（在层级）的开创性工作，我们对 LM 推理过程进行了语义分析。我们首先提出潜在表示的虚拟转换，然后追踪其事实转换。根据转换中的偏差，我们估计微调每个模型层的增益，并进一步缩小微调范围。我们在知名的 LM 和数据集上进行了广泛的实验。结果表明，我们的方法有效且高效，并且优于现有的基线。我们的方法与现有的高效技术（如 PEFT 方法）正交，为 LM 微调提供了实用价值。

Title: Improving Multi-Agent Debate with Sparse Communication Topology

Authors: Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, Eugene Ie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11776
Pdf URL: https://arxiv.org/pdf/2406.11776
Copy Paste: [[2406.11776]] Improving Multi-Agent Debate with Sparse Communication Topology(https://arxiv.org/abs/2406.11776)
Keywords: language model, gpt, agent
Abstract: Multi-agent debate has proven effective in improving large language models quality for reasoning and factuality tasks. While various role-playing strategies in multi-agent debates have been explored, in terms of the communication among agents, existing approaches adopt a brute force algorithm -- each agent can communicate with all other agents. In this paper, we systematically investigate the effect of communication connectivity in multi-agent systems. Our experiments on GPT and Mistral models reveal that multi-agent debates leveraging sparse communication topology can achieve comparable or superior performance while significantly reducing computational costs. Furthermore, we extend the multi-agent debate framework to multimodal reasoning and alignment labeling tasks, showcasing its broad applicability and effectiveness. Our findings underscore the importance of communication connectivity on enhancing the efficiency and effectiveness of the "society of minds" approach.
摘要：多智能体辩论已被证明能有效提高大型语言模型在推理和事实性任务中的质量。虽然人们已经探索了多智能体辩论中的各种角色扮演策略，但就智能体之间的通信而言，现有方法采用强力算法——每个智能体都可以与所有其他智能体通信。在本文中，我们系统地研究了通信连接在多智能体系统中的影响。我们在 GPT 和 Mistral 模型上的实验表明，利用稀疏通信拓扑的多智能体辩论可以实现相当或更优异的性能，同时显着降低计算成本。此外，我们将多智能体辩论框架扩展到多模态推理和对齐标记任务，展示了其广泛的适用性和有效性。我们的研究结果强调了通信连接对于提高“心智社会”方法的效率和有效性的重要性。

Title: MDCR: A Dataset for Multi-Document Conditional Reasoning

Authors: Peter Baile Chen, Yi Zhang, Chunwei Liu, Sejal Gupta, Yoon Kim, Michael Cafarella
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11784
Pdf URL: https://arxiv.org/pdf/2406.11784
Copy Paste: [[2406.11784]] MDCR: A Dataset for Multi-Document Conditional Reasoning(https://arxiv.org/abs/2406.11784)
Keywords: llm
Abstract: The same real-life questions posed to different individuals may lead to different answers based on their unique situations. For instance, whether a student is eligible for a scholarship depends on eligibility conditions, such as major or degree required. ConditionalQA was proposed to evaluate models' capability of reading a document and answering eligibility questions, considering unmentioned conditions. However, it is limited to questions on single documents, neglecting harder cases that may require cross-document reasoning and optimization, for example, "What is the maximum number of scholarships attainable?" Such questions over multiple documents are not only more challenging due to more context having to understand, but also because the model has to (1) explore all possible combinations of unmentioned conditions and (2) understand the relationship between conditions across documents, to reason about the optimal outcome. To evaluate models' capability of answering such questions, we propose a new dataset MDCR, which can reflect real-world challenges and serve as a new test bed for complex conditional reasoning that requires optimization. We evaluate this dataset using the most recent LLMs and demonstrate their limitations in solving this task. We believe this dataset will facilitate future research in answering optimization questions with unknown conditions.
摘要：向不同的人提出相同的现实问题可能会根据他们的独特情况得出不同的答案。例如，学生是否有资格获得奖学金取决于资格条件，例如所需的专业或学位。ConditionalQA 旨在评估模型阅读文档和回答资格问题的能力，同时考虑未提及的条件。但是，它仅限于单个文档的问题，忽略了可能需要跨文档推理和优化的更难的情况，例如“可以获得的奖学金数量最多是多少？” 这类涉及多个文档的问题不仅由于需要理解更多背景而更具挑战性，还因为模型必须 (1) 探索未提及条件的所有可能组合和 (2) 理解文档间条件之间的关系，以推理最佳结果。为了评估模型回答此类问题的能力，我们提出了一个新的数据集 MDCR，它可以反映现实世界的挑战，并作为需要优化的复杂条件推理的新测试平台。我们使用最新的 LLM 评估该数据集，并展示它们在解决此任务方面的局限性。我们相信这个数据集将有助于未来研究回答具有未知条件的优化问题。

Title: CELL your Model: Contrastive Explanation Methods for Large Language Models

Authors: Ronny Luss, Erik Miehling, Amit Dhurandhar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11785
Pdf URL: https://arxiv.org/pdf/2406.11785
Copy Paste: [[2406.11785]] CELL your Model: Contrastive Explanation Methods for Large Language Models(https://arxiv.org/abs/2406.11785)
Keywords: language model, llm, prompt
Abstract: The advent of black-box deep neural network classification models has sparked the need to explain their decisions. However, in the case of generative AI such as large language models (LLMs), there is no class prediction to explain. Rather, one can ask why an LLM output a particular response to a given prompt. In this paper, we answer this question by proposing, to the best of our knowledge, the first contrastive explanation methods requiring simply black-box/query access. Our explanations suggest that an LLM outputs a reply to a given prompt because if the prompt was slightly modified, the LLM would have given a different response that is either less preferable or contradicts the original response. The key insight is that contrastive explanations simply require a distance function that has meaning to the user and not necessarily a real valued representation of a specific response (viz. class label). We offer two algorithms for finding contrastive explanations: i) A myopic algorithm, which although effective in creating contrasts, requires many model calls and ii) A budgeted algorithm, our main algorithmic contribution, which intelligently creates contrasts adhering to a query budget, necessary for longer contexts. We show the efficacy of these methods on diverse natural language tasks such as open-text generation, automated red teaming, and explaining conversational degradation.
摘要：黑盒深度神经网络分类模型的出现引发了解释其决策的需求。然而，在大型语言模型 (LLM) 等生成式人工智能的情况下，没有类别预测需要解释。相反，人们可以问为什么 LLM 会对给定的提示输出特定的响应。在本文中，我们通过提出我们所知的第一个只需要黑盒/查询访问的对比解释方法来回答这个问题。我们的解释表明，LLM 会输出对给定提示的答复，因为如果对提示进行轻微修改，LLM 就会给出不同的响应，该响应要么不太可取，要么与原始响应相矛盾。关键见解是，对比解释只需要对用户有意义的距离函数，而不一定是特定响应（即类标签）的真实值表示。我们提供了两种算法来寻找对比解释：i) 一种短视算法，虽然在创建对比方面很有效，但需要多次模型调用；ii) 一种预算算法，这是我们的主要算法贡献，它智能地创建符合查询预算的对比，这对于较长的上下文是必要的。我们展示了这些方法在各种自然语言任务上的有效性，例如开放文本生成、自动红队和解释对话退化。

Title: Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

Authors: Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11801
Pdf URL: https://arxiv.org/pdf/2406.11801
Copy Paste: [[2406.11801]] Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations(https://arxiv.org/abs/2406.11801)
Keywords: language model, llm
Abstract: Ensuring the safe alignment of large language models (LLMs) with human values is critical as they become integral to applications like translation and question answering. Current alignment methods struggle with dynamic user intentions and complex objectives, making models vulnerable to generating harmful content. We propose Safety Arithmetic, a training-free framework enhancing LLM safety across different scenarios: Base models, Supervised fine-tuned models (SFT), and Edited models. Safety Arithmetic involves Harm Direction Removal to avoid harmful content and Safety Alignment to promote safe responses. Additionally, we present NoIntentEdit, a dataset highlighting edit instances that could compromise model safety if used unintentionally. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility, outperforming existing methods in ensuring safe content generation.
摘要：确保大型语言模型 (LLM) 与人类价值观的安全对齐至关重要，因为它们已成为翻译和问答等应用不可或缺的一部分。当前的对齐方法难以应对动态用户意图和复杂目标，这使得模型容易生成有害内容。我们提出了安全算法，这是一个无需训练的框架，可在不同场景中增强 LLM 安全性：基础模型、监督微调模型 (SFT) 和编辑模型。安全算法包括危害方向消除以避免有害内容和安全对齐以促进安全响应。此外，我们还提供了 NoIntentEdit，这是一个突出显示编辑实例的数据集，如果无意使用，这些实例可能会危及模型安全。我们的实验表明，安全算法显著改善了安全措施，减少了过度安全，并保持了模型效用，在确保安全内容生成方面优于现有方法。

Title: RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

Authors: Joao Monteiro, Pierre-Andre Noel, Etienne Marcotte, Sai Rajeswar, Valentina Zantedeschi, David Vazquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11811
Pdf URL: https://arxiv.org/pdf/2406.11811
Copy Paste: [[2406.11811]] RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content(https://arxiv.org/abs/2406.11811)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: this https URL.
摘要：大型语言模型 (LLM) 是在大量数据上进行训练的，其中大部分数据都是从互联网上自动抓取的。这些数据包括包含大量常识的百科全书文档（例如维基百科），但也可能与用于评估 LLM 的基准数据集重叠。因此，在可能已泄露到训练集中的测试拆分上评估模型容易得出误导性结论。为了促进对语言模型的合理评估，我们引入了一个名为 RepLiQA 的新测试数据集，适用于问答和主题检索任务。RepLiQA 是五个测试集拆分的集合，其中四个在本文发布之前尚未发布到互联网或暴露给 LLM API。RepLiQA 中的每个样本包括 (1) 由人工注释者制作的参考文档，描绘了互联网上没有的虚构场景（例如新闻文章）；(2) 关于文档主题的问题；(3) 直接从文档中的信息得出的真实答案；以及 (4) 从包含答案的参考文档中提取的段落。因此，只有当模型能够在提供的文档中找到相关内容时，才能生成准确的答案。我们运行了一个由几个最先进的 LLM 组成的大规模基准测试，以发现在上下文条件语言建模设置中不同类型和大小的模型之间的性能差异。RepLiQA 的发布拆分可以在这里找到：这个 https URL。

Title: How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Authors: Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, Minjoon Seo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.11813
Pdf URL: https://arxiv.org/pdf/2406.11813
Copy Paste: [[2406.11813]] How Do Large Language Models Acquire Factual Knowledge During Pretraining?(https://arxiv.org/abs/2406.11813)
Keywords: language model, llm
Abstract: Despite the recent observation that large language models (LLMs) can store substantial factual knowledge, there is a limited understanding of the mechanisms of how they acquire factual knowledge through pretraining. This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining. First, counterintuitively, we observe that pretraining on more data shows no significant improvement in the model's capability to acquire and maintain factual knowledge. Next, there is a power-law relationship between training steps and forgetting of memorization and generalization of factual knowledge, and LLMs trained with duplicated training data exhibit faster forgetting. Third, training LLMs with larger batch sizes can enhance the models' robustness to forgetting. Overall, our observations suggest that factual knowledge acquisition in LLM pretraining occurs by progressively increasing the probability of factual knowledge presented in the pretraining data at each step. However, this increase is diluted by subsequent forgetting. Based on this interpretation, we demonstrate that we can provide plausible explanations for recently observed behaviors of LLMs, such as the poor performance of LLMs on long-tail knowledge and the benefits of deduplicating the pretraining corpus.
摘要：尽管最近观察到大型语言模型 (LLM) 可以存储大量事实知识，但人们对它们如何通过预训练获取事实知识的机制的理解有限。这项工作通过研究 LLM 如何在预训练期间获取事实知识来解决这一空白。研究结果揭示了预训练期间事实知识获取动态的几个重要见解。首先，与直觉相反，我们观察到对更多数据进行预训练并没有显着提高模型获取和维护事实知识的能力。其次，训练步骤与记忆和事实知识概括的遗忘之间存在幂律关系，使用重复训练数据训练的 LLM 表现出更快的遗忘。第三，使用较大批量大小训练 LLM 可以增强模型对遗忘的鲁棒性。总体而言，我们的观察表明，LLM 预训练中的事实知识获取是通过逐步增加每一步预训练数据中出现事实知识的概率来实现的。然而，这种增加被随后的遗忘所削弱。基于这种解释，我们证明我们可以为最近观察到的 LLM 行为提供合理的解释，例如 LLM 在长尾知识上的表现不佳以及对预训练语料库进行重复数据删除的好处。

Title: Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Authors: Jie Liu, Zhanhui Zhou, Jiaheng Liu, Xingyuan Bu, Chao Yang, Han-Sen Zhong, Wanli Ouyang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11817
Pdf URL: https://arxiv.org/pdf/2406.11817
Copy Paste: [[2406.11817]] Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level(https://arxiv.org/abs/2406.11817)
Keywords: language model, gpt, llm
Abstract: Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5\%$ length-controlled win rate against $\texttt{GPT-4 Preview}$ on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.
摘要：直接偏好优化 (DPO) 是一种将语言模型与人类偏好对齐的标准方法，传统上应用于离线偏好。最近的研究表明，DPO 受益于使用经过训练的奖励模型标记的在线偏好进行迭代训练。在这项工作中，我们发现了原始迭代 DPO 的一个缺陷——提高响应质量可能会导致冗长程度增加。为了解决这个问题，我们引入了迭代长度正则化 DPO (iLR-DPO) 来惩罚响应长度。我们的实证结果表明，iLR-DPO 可以增强 7B 模型，使其性能与 GPT-4 相当，而不会增加冗长程度。具体来说，我们的 7B 模型在 AlpacaEval 2.0 上对 $\texttt{GPT-4 Preview}$ 实现了 $50.5\%$ 长度控制的胜率，并在包括 MT-Bench、Arena-Hard 和 OpenLLM Leaderboard 在内的标准基准测试中表现出色。这些结果证明了迭代 DPO 在将语言模型与人类反馈对齐方面的有效性。

Title: WPO: Enhancing RLHF with Weighted Preference Optimization

Authors: Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, Chenguang Zhu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.11827
Pdf URL: https://arxiv.org/pdf/2406.11827
Copy Paste: [[2406.11827]] WPO: Enhancing RLHF with Weighted Preference Optimization(https://arxiv.org/abs/2406.11827)
Keywords: language model, gpt, llm
Abstract: Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 48.6% based on Llama-3-8B-Instruct, making it the strongest 8B model on the leaderboard. We will release the code and models at this https URL.
摘要：从人类反馈中进行强化学习 (RLHF) 是一种有前途的解决方案，可以使大型语言模型 (LLM) 与人类价值观更加一致。离线策略偏好优化（偏好数据来自其他模型）因其成本效益和可扩展性而被广泛采用。然而，离线策略偏好优化通常会受到数据收集所用策略与目标策略之间分布差距的影响，从而导致次优优化。在本文中，我们提出了一种新策略来缓解此问题，即使用离线策略偏好数据模拟在线策略学习。我们的加权偏好优化 (WPO) 方法通过根据当前策略下的概率重新加权偏好对，使离线策略数据更接近在线策略数据。该方法不仅解决了分布差距问题，而且还增强了优化过程，而不会产生额外成本。我们按照包括 Alpaca Eval 2 和 MT-bench 在内的基准测试在指令上验证了我们的方法。 WPO 不仅在 Alpaca Eval 2 上的表现比直接偏好优化 (DPO) 高出 5.6%，而且在基于 Llama-3-8B-Instruct 的 GPT-4-turbo 中取得了 48.6% 的出色长度控制获胜率，成为排行榜上最强大的 8B 模型。我们将在此 https URL 上发布代码和模型。

Title: Language Modeling with Editable External Knowledge

Authors: Belinda Z. Li, Emmy Liu, Alexis Ross, Abbas Zeitoun, Graham Neubig, Jacob Andreas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.11830
Pdf URL: https://arxiv.org/pdf/2406.11830
Copy Paste: [[2406.11830]] Language Modeling with Editable External Knowledge(https://arxiv.org/abs/2406.11830)
Keywords: language model, retrieval-augmented generation
Abstract: When the world changes, so does the text that humans write about it. How do we build language models that can be easily updated to reflect these changes? One popular approach is retrieval-augmented generation, in which new documents are inserted into a knowledge base and retrieved during prediction for downstream tasks. Most prior work on these systems have focused on improving behavior during prediction through better retrieval or reasoning. This paper introduces ERASE, which instead improves model behavior when new documents are acquired, by incrementally deleting or rewriting other entries in the knowledge base each time a document is added. In two new benchmark datasets evaluating models' ability to answer questions about a stream of news articles or conversations, ERASE improves accuracy relative to conventional retrieval-augmented generation by 7-13% (Mixtral-8x7B) and 6-10% (Llama-3-8B) absolute. Code and data are available at this https URL
摘要：当世界发生变化时，人类所写的文字也会发生变化。我们如何构建可以轻松更新以反映这些变化的语言模型？一种流行的方法是检索增强生成，其中将新文档插入知识库并在预测期间检索以用于下游任务。这些系统的大部分先前工作都集中在通过更好的检索或推理来改善预测期间的行为。本文介绍了 ERASE，它通过在每次添加文档时逐步删除或重写知识库中的其他条目来改善获取新文档时的模型行为。在两个新的基准数据集中，评估模型回答有关新闻文章或对话流的问题的能力，ERASE 相对于传统的检索增强生成将准确率提高了 7-13%（Mixtral-8x7B）和 6-10%（Llama-3-8B）。代码和数据可在此 https URL 上获得