2024-05-07

Title: NL2FOL: Translating Natural Language to First-Order Logic for Logical Fallacy Detection

Authors: Abhinav Lalwani, Lovish Chopra, Christopher Hahn, Caroline Trippel, Zhijing Jin, Mrinmaya Sachan
Subjects: cs.CL, cs.AI, cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2405.02318
Pdf URL: https://arxiv.org/pdf/2405.02318
Copy Paste: [[2405.02318]] NL2FOL: Translating Natural Language to First-Order Logic for Logical Fallacy Detection(https://arxiv.org/abs/2405.02318)
Keywords: language model, llm
Abstract: Logical fallacies are common errors in reasoning that undermine the logic of an argument. Automatically detecting logical fallacies has important applications in tracking misinformation and validating claims. In this paper, we design a process to reliably detect logical fallacies by translating natural language to First-order Logic (FOL) step-by-step using Large Language Models (LLMs). We then utilize Satisfiability Modulo Theory (SMT) solvers to reason about the validity of the formula and classify inputs as either a fallacy or valid statement. Our model also provides a novel means of utilizing LLMs to interpret the output of the SMT solver, offering insights into the counter-examples that illustrate why a given sentence is considered a logical fallacy. Our approach is robust, interpretable and does not require training data or fine-tuning. We evaluate our model on a mixed dataset of fallacies and valid sentences. The results demonstrate improved performance compared to end-to-end LLMs, with our classifier achieving an F1-score of 71\% on the Logic dataset. The approach is able to generalize effectively, achieving an F1-score of 73% on the challenge set, LogicClimate, outperforming state-of-the-art models by 21% despite its much smaller size.
摘要：逻辑谬误是推理中常见的错误，会破坏论证的逻辑性。自动检测逻辑谬误在追踪错误信息和验证主张方面具有重要应用。在本文中，我们设计了一个流程，通过使用大型语言模型 (LLM) 逐步将自然语言转换为一阶逻辑 (FOL)，从而可靠地检测逻辑谬误。然后，我们利用可满足性模理论 (SMT) 求解器来推理公式的有效性，并将输入分类为谬误或有效陈述。我们的模型还提供了一种利用 LLM 来解释 SMT 求解器输出的新方法，提供了对反例的见解，这些反例说明了为什么给定的句子被视为逻辑谬误。我们的方法稳健、可解释，并且不需要训练数据或微调。我们在谬误和有效句子的混合数据集上评估了我们的模型。结果表明，与端到端 LLM 相比，我们的性能有所提高，我们的分类器在逻辑数据集上的 F1 分数为 71\%。该方法能够有效地推广，在挑战集 LogicClimate 上取得了 73% 的 F1 分数，尽管规模小得多，但其表现却比最先进的模型高出 21%。

Title: Early Transformers: A study on Efficient Training of Transformer Models through Early-Bird Lottery Tickets

Authors: Shravan Cheekati
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2405.02353
Pdf URL: https://arxiv.org/pdf/2405.02353
Copy Paste: [[2405.02353]] Early Transformers: A study on Efficient Training of Transformer Models through Early-Bird Lottery Tickets(https://arxiv.org/abs/2405.02353)
Keywords: gpt
Abstract: The training of Transformer models has revolutionized natural language processing and computer vision, but it remains a resource-intensive and time-consuming process. This paper investigates the applicability of the early-bird ticket hypothesis to optimize the training efficiency of Transformer models. We propose a methodology that combines iterative pruning, masked distance calculation, and selective retraining to identify early-bird tickets in various Transformer architectures, including ViT, Swin-T, GPT-2, and RoBERTa. Our experimental results demonstrate that early-bird tickets can be consistently found within the first few epochs of training or fine-tuning, enabling significant resource optimization without compromising performance. The pruned models obtained from early-bird tickets achieve comparable or even superior accuracy to their unpruned counterparts while substantially reducing memory usage. Furthermore, our comparative analysis highlights the generalizability of the early-bird ticket phenomenon across different Transformer models and tasks. This research contributes to the development of efficient training strategies for Transformer models, making them more accessible and resource-friendly. By leveraging early-bird tickets, practitioners can accelerate the progress of natural language processing and computer vision applications while reducing the computational burden associated with training Transformer models.
摘要：Transformer 模型的训练彻底改变了自然语言处理和计算机视觉，但它仍然是一个资源密集型且耗时的过程。本文研究了早鸟票假设在优化 Transformer 模型训练效率方面的适用性。我们提出了一种结合迭代剪枝、屏蔽距离计算和选择性再训练的方法，以识别各种 Transformer 架构（包括 ViT、Swin-T、GPT-2 和 RoBERTa）中的早鸟票。我们的实验结果表明，可以在训练或微调的前几个时期内一致地找到早鸟票，从而在不影响性能的情况下实现显着的资源优化。从早鸟票中获得的修剪后的模型与未修剪的模型相比，具有相当甚至更高的准确性，同时大大减少了内存使用。此外，我们的比较分析强调了早鸟票现象在不同 Transformer 模型和任务中的普遍性。这项研究有助于开发 Transformer 模型的有效训练策略，使它们更易于访问且资源友好。通过利用早鸟票，从业者可以加速自然语言处理和计算机视觉应用的进展，同时减少与训练 Transformer 模型相关的计算负担。

Title: The Call for Socially Aware Language Technologies

Authors: Diyi Yang, Dirk Hovy, David Jurgens, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.02411
Pdf URL: https://arxiv.org/pdf/2405.02411
Copy Paste: [[2405.02411]] The Call for Socially Aware Language Technologies(https://arxiv.org/abs/2405.02411)
Keywords: language model, llm
Abstract: Language technologies have made enormous progress, especially with the introduction of large language models (LLMs). On traditional tasks such as machine translation and sentiment analysis, these models perform at near-human level. These advances can, however, exacerbate a variety of issues that models have traditionally struggled with, such as bias, evaluation, and risks. In this position paper, we argue that many of these issues share a common core: a lack of awareness of the factors, context, and implications of the social environment in which NLP operates, which we call social awareness. While NLP is getting better at solving the formal linguistic aspects, limited progress has been made in adding the social awareness required for language applications to work in all situations for all users. Integrating social awareness into NLP models will make applications more natural, helpful, and safe, and will open up new possibilities. Thus we argue that substantial challenges remain for NLP to develop social awareness and that we are just at the beginning of a new era for the field.
摘要：语言技术已经取得了巨大的进步，特别是随着大型语言模型（LLM）的引入。在机器翻译和情感分析等传统任务上，这些模型的表现接近人类水平。然而，这些进步可能会加剧传统上模型难以解决的各种问题，例如偏差、评估和风险。在这篇立场文件中，我们认为许多这些问题都有一个共同的核心：缺乏对 NLP 运作的社会环境的因素、背景和影响的认识，我们称之为社会意识。虽然 NLP 在解决形式语言问题方面做得越来越好，但在增加语言应用程序在所有情况下为所有用户工作所需的社会意识方面取得的进展有限。将社会意识融入 NLP 模型将使应用程序更加自然、有用和安全，并将开辟新的可能性。因此，我们认为 NLP 发展社会意识仍然面临巨大挑战，而且我们正处于该领域新时代的开始。

Title: What does the Knowledge Neuron Thesis Have to do with Knowledge?

Authors: Jingcheng Niu, Andrew Liu, Zining Zhu, Gerald Penn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.02421
Pdf URL: https://arxiv.org/pdf/2405.02421
Copy Paste: [[2405.02421]] What does the Knowledge Neuron Thesis Have to do with Knowledge?(https://arxiv.org/abs/2405.02421)
Keywords: language model
Abstract: We reassess the Knowledge Neuron (KN) Thesis: an interpretation of the mechanism underlying the ability of large language models to recall facts from a training corpus. This nascent thesis proposes that facts are recalled from the training corpus through the MLP weights in a manner resembling key-value memory, implying in effect that "knowledge" is stored in the network. Furthermore, by modifying the MLP modules, one can control the language model's generation of factual information. The plausibility of the KN thesis has been demonstrated by the success of KN-inspired model editing methods (Dai et al., 2022; Meng et al., 2022). We find that this thesis is, at best, an oversimplification. Not only have we found that we can edit the expression of certain linguistic phenomena using the same model editing methods but, through a more comprehensive evaluation, we have found that the KN thesis does not adequately explain the process of factual expression. While it is possible to argue that the MLP weights store complex patterns that are interpretable both syntactically and semantically, these patterns do not constitute "knowledge." To gain a more comprehensive understanding of the knowledge representation process, we must look beyond the MLP weights and explore recent models' complex layer structures and attention mechanisms.
摘要：我们重新评估知识神经元（KN）论文：对大型语言模型从训练语料库中回忆事实的能力的机制的解释。这篇新生论文提出，通过 MLP 权重以类似于键值记忆的方式从训练语料库中回忆事实，这实际上意味着“知识”存储在网络中。此外，通过修改MLP模块，人们可以控制语言模型事实信息的生成。受 KN 启发的模型编辑方法的成功证明了 KN 论文的合理性（Dai 等人，2022；Meng 等人，2022）。我们发现这篇论文充其量只是过于简单化。我们不仅发现我们可以使用相同的模型编辑方法来编辑某些语言现象的表达，而且通过更全面的评估，我们发现KN论文并没有充分解释事实表达的过程。虽然可以认为 MLP 权重存储在句法和语义上都可解释的复杂模式，但这些模式并不构成“知识”。为了更全面地理解知识表示过程，我们必须超越 MLP 权重，探索最新模型的复杂层结构和注意力机制。

Title: What is Sentiment Meant to Mean to Language Models?

Authors: Michael Burnham
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.02454
Pdf URL: https://arxiv.org/pdf/2405.02454
Copy Paste: [[2405.02454]] What is Sentiment Meant to Mean to Language Models?(https://arxiv.org/abs/2405.02454)
Keywords: language model, prompt
Abstract: Sentiment analysis is one of the most widely used techniques in text analysis. Recent advancements with Large Language Models have made it more accurate and accessible than ever, allowing researchers to classify text with only a plain English prompt. However, "sentiment" entails a wide variety of concepts depending on the domain and tools used. It has been used to mean emotion, opinions, market movements, or simply a general ``good-bad'' dimension. This raises a question: What exactly are language models doing when prompted to label documents by sentiment? This paper first overviews how sentiment is defined across different contexts, highlighting that it is a confounded measurement construct in that it entails multiple variables, such as emotional valence and opinion, without disentangling them. I then test three language models across two data sets with prompts requesting sentiment, valence, and stance classification. I find that sentiment labels most strongly correlate with valence labels. I further find that classification improves when researchers more precisely specify their dimension of interest rather than using the less well-defined concept of sentiment. I conclude by encouraging researchers to move beyond "sentiment" when feasible and use a more precise measurement construct.
摘要：情感分析是文本分析中最广泛使用的技术之一。大型语言模型的最新进展使其比以往更加准确和易于使用，使研究人员能够仅使用简单的英语提示对文本进行分类。但是，“情感”包含各种各样的概念，具体取决于所使用的领域和工具。它被用来表示情绪、观点、市场动向，或者仅仅是一般的“好坏”维度。这就提出了一个问题：当提示按情感标记文档时，语言模型究竟在做什么？本文首先概述了如何在不同上下文中定义情感，强调它是一种复杂的测量构造，因为它包含多个变量，例如情感价和观点，而没有将它们区分开来。然后，我在两个数据集上测试了三种语言模型，提示要求进行情感、价和立场分类。我发现情绪标签与价标签的相关性最强。我进一步发现，当研究人员更精确地指定他们感兴趣的维度而不是使用定义不太明确的情感概念时，分类会得到改善。最后，我鼓励研究人员在可行的情况下超越“情绪”，并使用更精确的测量结构。

Title: Semantic Scaling: Bayesian Ideal Point Estimates with Large Language Models

Authors: Michael Burnham
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.02472
Pdf URL: https://arxiv.org/pdf/2405.02472
Copy Paste: [[2405.02472]] Semantic Scaling: Bayesian Ideal Point Estimates with Large Language Models(https://arxiv.org/abs/2405.02472)
Keywords: language model
Abstract: This paper introduces "Semantic Scaling," a novel method for ideal point estimation from text. I leverage large language models to classify documents based on their expressed stances and extract survey-like data. I then use item response theory to scale subjects from these data. Semantic Scaling significantly improves on existing text-based scaling methods, and allows researchers to explicitly define the ideological dimensions they measure. This represents the first scaling approach that allows such flexibility outside of survey instruments and opens new avenues of inquiry for populations difficult to survey. Additionally, it works with documents of varying length, and produces valid estimates of both mass and elite ideology. I demonstrate that the method can differentiate between policy preferences and in-group/out-group affect. Among the public, Semantic Scaling out-preforms Tweetscores according to human judgement; in Congress, it recaptures the first dimension DW-NOMINATE while allowing for greater flexibility in resolving construct validity challenges.
摘要：本文介绍了“语义缩放”，这是一种从文本中估计理想点的新方法。我利用大型语言模型根据文档表达的立场对文档进行分类，并提取类似调查的数据。然后，我使用项目反应理论来根据这些数据来衡量主题。语义尺度显着改进了现有的基于文本的尺度方法，并允许研究人员明确定义他们测量的意识形态维度。这代表了第一种扩展方法，允许在调查工具之外实现这种灵活性，并为难以调查的人群开辟新的调查途径。此外，它还可以处理不同长度的文档，并对大众和精英意识形态进行有效的估计。我证明该方法可以区分政策偏好和群体内/群体外影响。在公众中，根据人类判断，语义扩展的表现优于 Tweetscore；在国会，它重新获得了 DW-NOMINATE 的第一个维度，同时在解决结构有效性挑战方面提供了更大的灵活性。

Title: Beyond Helpfulness and Harmlessness: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning

Authors: Hyeong Kyu Choi, Yixuan Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.02501
Pdf URL: https://arxiv.org/pdf/2405.02501
Copy Paste: [[2405.02501]] Beyond Helpfulness and Harmlessness: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning(https://arxiv.org/abs/2405.02501)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are trained on massive text corpora, which are encoded with diverse personality traits. This triggers an interesting goal of eliciting a desired personality trait from the LLM, and probing its behavioral preferences. Accordingly, we formalize the persona elicitation task, aiming to customize LLM behaviors to align with a target persona. We present Persona In-Context Learning (PICLe), a novel persona elicitation framework grounded in Bayesian inference. At the core, PICLe introduces a new ICL example selection criterion based on likelihood ratio, which is designed to optimally guide the model in eliciting a specific target persona. We demonstrate the effectiveness of PICLe through extensive comparisons against baseline methods across three contemporary LLMs. Code is available at https://github.com/deeplearning-wisc/picle.
摘要：大型语言模型 (LLM) 在大量文本语料库上进行训练，这些文本语料库编码有不同的个性特征。这引发了一个有趣的目标，即从法学硕士中引出所需的人格特质，并探究其行为偏好。因此，我们将角色启发任务形式化，旨在定制 LLM 行为以与目标角色保持一致。我们提出了人物角色情境学习（PICLe），这是一种基于贝叶斯推理的新颖人物角色启发框架。 PICLe 的核心引入了一种基于似然比的新 ICL 示例选择标准，旨在最佳地指导模型引出特定的目标人物角色。我们通过与三个当代法学硕士的基线方法的广泛比较来证明 PICLe 的有效性。代码可在 https://github.com/deeplearning-wisc/picle 获取。

Title: Mothman at SemEval-2024 Task 9: An Iterative System for Chain-of-Thought Prompt Optimization

Authors: Alvin Po-Chun Chen, Ray Groshan, Sean von Bayern
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.02517
Pdf URL: https://arxiv.org/pdf/2405.02517
Copy Paste: [[2405.02517]] Mothman at SemEval-2024 Task 9: An Iterative System for Chain-of-Thought Prompt Optimization(https://arxiv.org/abs/2405.02517)
Keywords: language model, prompt, chain-of-thought
Abstract: Extensive research exists on the performance of large language models on logic-based tasks, whereas relatively little has been done on their ability to generate creative solutions on lateral thinking tasks. The BrainTeaser shared task tests lateral thinking and uses adversarial datasets to prevent memorization, resulting in poor performance for out-of-the-box models. We propose a system for iterative, chain-of-thought prompt engineering which optimizes prompts using human evaluation. Using this shared task, we demonstrate our system's ability to significantly improve model performance by optimizing prompts and evaluate the input dataset.
摘要：人们对大型语言模型在基于逻辑的任务上的性能进行了广泛的研究，而对它们在横向思维任务上生成创造性解决方案的能力的研究相对较少。 BrainTeaser 共享任务测试横向思维，并使用对抗性数据集来防止记忆，导致开箱即用模型的性能不佳。我们提出了一种迭代式、思维链式提示工程系统，该系统使用人工评估来优化提示。使用此共享任务，我们展示了我们的系统通过优化提示和评估输入数据集来显着提高模型性能的能力。

Title: A Literature Review and Framework for Human Evaluation of Generative Large Language Models in Healthcare

Authors: Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V Stolyar, Katelyn Polanska, Karleigh R McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, Yanshan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.02559
Pdf URL: https://arxiv.org/pdf/2405.02559
Copy Paste: [[2405.02559]] A Literature Review and Framework for Human Evaluation of Generative Large Language Models in Healthcare(https://arxiv.org/abs/2405.02559)
Keywords: language model, llm
Abstract: As generative artificial intelligence (AI), particularly Large Language Models (LLMs), continues to permeate healthcare, it remains crucial to supplement traditional automated evaluations with human expert evaluation. Understanding and evaluating the generated texts is vital for ensuring safety, reliability, and effectiveness. However, the cumbersome, time-consuming, and non-standardized nature of human evaluation presents significant obstacles to the widespread adoption of LLMs in practice. This study reviews existing literature on human evaluation methodologies for LLMs within healthcare. We highlight a notable need for a standardized and consistent human evaluation approach. Our extensive literature search, adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, spans publications from January 2018 to February 2024. This review provides a comprehensive overview of the human evaluation approaches used in diverse healthcare applications.This analysis examines the human evaluation of LLMs across various medical specialties, addressing factors such as evaluation dimensions, sample types, and sizes, the selection and recruitment of evaluators, frameworks and metrics, the evaluation process, and statistical analysis of the results. Drawing from diverse evaluation strategies highlighted in these studies, we propose a comprehensive and practical framework for human evaluation of generative LLMs, named QUEST: Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence. This framework aims to improve the reliability, generalizability, and applicability of human evaluation of generative LLMs in different healthcare applications by defining clear evaluation dimensions and offering detailed guidelines.
摘要：随着生成人工智能 (AI)，特别是大型语言模型 (LLM) 继续渗透到医疗保健领域，用人类专家评估来补充传统的自动化评估仍然至关重要。理解和评估生成的文本对于确保安全性、可靠性和有效性至关重要。然而，人工评估的繁琐、耗时和非标准化性质对法学硕士在实践中的广泛采用构成了重大障碍。本研究回顾了医疗保健领域法学硕士人类评估方法的现有文献。我们强调对标准化和一致的人类评估方法的显着需求。我们遵循系统评价和荟萃分析的首选报告项目 (PRISMA) 指南，进行了广泛的文献检索，涵盖 2018 年 1 月至 2024 年 2 月的出版物。这篇综述全面概述了各种医疗保健应用中使用的人类评估方法。分析检查了各个医学专业的法学硕士的人工评估，涉及评估维度、样本类型和规模、评估人员的选择和招聘、框架和指标、评估过程以及结果的统计分析等因素。借鉴这些研究中强调的各种评估策略，我们提出了一个全面且实用的生成式法学硕士人类评估框架，名为 QUEST：信息质量、理解和推理、表达风格和角色、安全和伤害以及信任和信心。该框架旨在通过定义明确的评估维度并提供详细的指南，提高不同医疗保健应用中生成法学硕士人类评估的可靠性、普遍性和适用性。

Title: Astro-NER -- Astronomy Named Entity Recognition: Is GPT a Good Domain Expert Annotator?

Authors: Julia Evans, Sameer Sadruddin, Jennifer D'Souza
Subjects: cs.CL, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2405.02602
Pdf URL: https://arxiv.org/pdf/2405.02602
Copy Paste: [[2405.02602]] Astro-NER -- Astronomy Named Entity Recognition: Is GPT a Good Domain Expert Annotator?(https://arxiv.org/abs/2405.02602)
Keywords: gpt, llm
Abstract: In this study, we address one of the challenges of developing NER models for scholarly domains, namely the scarcity of suitable labeled data. We experiment with an approach using predictions from a fine-tuned LLM model to aid non-domain experts in annotating scientific entities within astronomy literature, with the goal of uncovering whether such a collaborative process can approximate domain expertise. Our results reveal moderate agreement between a domain expert and the LLM-assisted non-experts, as well as fair agreement between the domain expert and the LLM model's predictions. In an additional experiment, we compare the performance of finetuned and default LLMs on this task. We have also introduced a specialized scientific entity annotation scheme for astronomy, validated by a domain expert. Our approach adopts a scholarly research contribution-centric perspective, focusing exclusively on scientific entities relevant to the research theme. The resultant dataset, containing 5,000 annotated astronomy article titles, is made publicly available.
摘要：在这项研究中，我们解决了为学术领域开发 NER 模型的挑战之一，即缺乏合适的标记数据。我们尝试了一种方法，使用微调的法学硕士模型的预测来帮助非领域专家注释天文学文献中的科学实体，目的是揭示这样的协作过程是否可以近似领域专业知识。我们的结果揭示了领域专家和法学硕士辅助的非专家之间的适度一致性，以及领域专家和法学硕士模型的预测之间的公平一致性。在另一个实验中，我们比较了微调法学硕士和默认法学硕士在此任务上的性能。我们还引入了专门的天文学科学实体注释方案，并由领域专家验证。我们的方法采用以学术研究贡献为中心的视角，专门关注与研究主题相关的科学实体。由此产生的数据集包含 5,000 篇带注释的天文学文章标题，并已公开发布。

Title: Identifying Narrative Patterns and Outliers in Holocaust Testimonies Using Topic Modeling

Authors: Maxim Ifergan, Renana Keydar, Omri Abend, Amit Pinchevski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.02650
Pdf URL: https://arxiv.org/pdf/2405.02650
Copy Paste: [[2405.02650]] Identifying Narrative Patterns and Outliers in Holocaust Testimonies Using Topic Modeling(https://arxiv.org/abs/2405.02650)
Keywords: language model
Abstract: The vast collection of Holocaust survivor testimonies presents invaluable historical insights but poses challenges for manual analysis. This paper leverages advanced Natural Language Processing (NLP) techniques to explore the USC Shoah Foundation Holocaust testimony corpus. By treating testimonies as structured question-and-answer sections, we apply topic modeling to identify key themes. We experiment with BERTopic, which leverages recent advances in language modeling technology. We align testimony sections into fixed parts, revealing the evolution of topics across the corpus of testimonies. This highlights both a common narrative schema and divergences between subgroups based on age and gender. We introduce a novel method to identify testimonies within groups that exhibit atypical topic distributions resembling those of other groups. This study offers unique insights into the complex narratives of Holocaust survivors, demonstrating the power of NLP to illuminate historical discourse and identify potential deviations in survivor experiences.
摘要：大量大屠杀幸存者的证词提供了宝贵的历史见解，但对手动分析提出了挑战。本文利用先进的自然语言处理 (NLP) 技术来探索南加州大学大屠杀基金会大屠杀证词语料库。通过将证词视为结构化问答部分，我们应用主题建模来识别关键主题。我们对 BERTopic 进行了实验，它利用了语言建模技术的最新进展。我们将证词部分排列成固定部分，揭示证词语料库中主题的演变。这凸显了共同的叙事模式以及基于年龄和性别的亚组之间的差异。我们引入了一种新颖的方法来识别表现出与其他群体相似的非典型主题分布的群体内的证词。这项研究为大屠杀幸存者的复杂叙述提供了独特的见解，展示了 NLP 阐明历史话语和识别幸存者经历中潜在偏差的力量。

Title: R4: Reinforced Retriever-Reorder-Responder for Retrieval-Augmented Large Language Models

Authors: Taolin Zhang, Dongyang Li, Qizhou Chen, Chengyu Wang, Longtao Huang, Hui Xue, Xiaofeng He, Jun Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.02659
Pdf URL: https://arxiv.org/pdf/2405.02659
Copy Paste: [[2405.02659]] R4: Reinforced Retriever-Reorder-Responder for Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2405.02659)
Keywords: language model, llm, hallucination, prompt
Abstract: Retrieval-augmented large language models (LLMs) leverage relevant content retrieved by information retrieval systems to generate correct responses, aiming to alleviate the hallucination problem. However, existing retriever-responder methods typically append relevant documents to the prompt of LLMs to perform text generation tasks without considering the interaction of fine-grained structural semantics between the retrieved documents and the LLMs. This issue is particularly important for accurate response generation as LLMs tend to ``lose in the middle'' when dealing with input prompts augmented with lengthy documents. In this work, we propose a new pipeline named ``Reinforced Retriever-Reorder-Responder'' (R$^4$) to learn document orderings for retrieval-augmented LLMs, thereby further enhancing their generation abilities while the large numbers of parameters of LLMs remain frozen. The reordering learning process is divided into two steps according to the quality of the generated responses: document order adjustment and document representation enhancement. Specifically, document order adjustment aims to organize retrieved document orderings into beginning, middle, and end positions based on graph attention learning, which maximizes the reinforced reward of response quality. Document representation enhancement further refines the representations of retrieved documents for responses of poor quality via document-level gradient adversarial learning. Extensive experiments demonstrate that our proposed pipeline achieves better factual question-answering performance on knowledge-intensive tasks compared to strong baselines across various public datasets. The source codes and trained models will be released upon paper acceptance.
摘要：检索增强型大型语言模型 (LLM) 利用信息检索系统检索到的相关内容来生成正确的响应，旨在缓解幻觉问题。然而，现有的检索器-响应器方法通常将相关文档附加到 LLM 的提示中以执行文本生成任务，而不考虑检索到的文档和 LLM 之间的细粒度结构语义的交互。这个问题对于准确的响应生成尤其重要，因为 LLM 在处理用长文档增强的输入提示时往往会“中途迷失”。在这项工作中，我们提出了一种名为“强化检索器-重新排序-响应器”（R$^4$）的新管道来学习检索增强 LLM 的文档排序，从而在 LLM 的大量参数保持不变的情况下进一步增强其生成能力。根据生成的响应的质量，重新排序学习过程分为两个步骤：文档顺序调整和文档表示增强。具体而言，文档顺序调整旨在基于图注意力学习将检索到的文档顺序组织到开始、中间和结束位置，从而最大化响应质量的强化奖励。文档表示增强通过文档级梯度对抗学习进一步细化了检索到的文档对质量较差的响应的表示。大量实验表明，与各种公共数据集中的强基线相比，我们提出的流程在知识密集型任务上实现了更好的事实问答性能。源代码和训练模型将在论文被接受后发布。

Title: Enhancing News Summarization with ELearnFit through Efficient In-Context Learning and Efficient Fine-Tuning

Authors: Che Guan, Andrew Chin, Puya Vahabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.02710
Pdf URL: https://arxiv.org/pdf/2405.02710
Copy Paste: [[2405.02710]] Enhancing News Summarization with ELearnFit through Efficient In-Context Learning and Efficient Fine-Tuning(https://arxiv.org/abs/2405.02710)
Keywords: language model, llm, prompt
Abstract: With the deluge of information delivered by the daily news cycle, there is a growing need to effectively and efficiently summarize news feeds for quick consumption. We leverage large language models (LLMs), with their advanced learning and generative abilities as compared to conventional language models, to generate concise and coherent summaries for news articles from the XSum dataset. Our paper focuses on two key aspects of LLMs: Efficient in-context Learning (ELearn) and Parameter Efficient Fine-tuning (EFit). Under ELearn, we find that increasing the number of shots in prompts and utilizing simple templates generally improve the quality of summaries. We also find that utilizing relevant examples in few-shot learning for ELearn does not improve model performance. In addition, we studied EFit using different methods and demonstrate that fine-tuning the first layer of LLMs produces better outcomes as compared to fine-tuning other layers or utilizing LoRA. We also find that leveraging more relevant training samples using selective layers does not result in better performance. By combining ELearn and EFit, we create a new model (ELearnFit) that leverages the benefits of both few-shot learning and fine-tuning and produces superior performance to either model alone. We also use ELearnFit to highlight the trade-offs between prompting and fine-tuning, especially for situations where only a limited number of annotated samples are available. Ultimately, our research provides practical techniques to optimize news summarization during the prompting and fine-tuning stages and enhances the synthesis of news articles.
摘要：随着每日新闻周期提供的海量信息，越来越需要有效且高效地总结新闻源以供快速消费。与传统语言模型相比，我们利用大型语言模型 (LLM) 的先进学习和生成能力，为来自 XSum 数据集的新闻文章生成简洁且连贯的摘要。我们的论文重点关注法学硕士的两个关键方面：高效情境学习（ELearn）和参数高效微调（EFit）。在 ELearn 下，我们发现增加提示中的镜头数量和使用简单的模板通常可以提高摘要的质量。我们还发现，在 ELearn 的小样本学习中利用相关示例并不能提高模型性能。此外，我们使用不同的方法研究了 EFit，并证明与微调其他层或利用 LoRA 相比，微调 LLM 的第一层会产生更好的结果。我们还发现，使用选择性层利用更多相关的训练样本并不会带来更好的性能。通过结合 ELearn 和 EFit，我们创建了一个新模型 (ELearnFit)，它利用了小样本学习和微调的优点，并产生了比单独任一模型都优越的性能。我们还使用 ELearnFit 来强调提示和微调之间的权衡，特别是在只有有限数量的带注释样本可用的情况下。最终，我们的研究提供了在提示和微调阶段优化新闻摘要的实用技术，并增强新闻文章的综合。

Title: CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions

Authors: Hanchong Zhang, Ruisheng Cao, Hongshen Xu, Lu Chen, Kai Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.02712
Pdf URL: https://arxiv.org/pdf/2405.02712
Copy Paste: [[2405.02712]] CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions(https://arxiv.org/abs/2405.02712)
Keywords: language model, llm, prompt
Abstract: Recently, Large Language Models (LLMs) have been demonstrated to possess impressive capabilities in a variety of domains and tasks. We investigate the issue of prompt design in the multi-turn text-to-SQL task and attempt to enhance the LLMs' reasoning capacity when generating SQL queries. In the conversational context, the current SQL query can be modified from the preceding SQL query with only a few operations due to the context dependency. We introduce our method called CoE-SQL which can prompt LLMs to generate the SQL query based on the previously generated SQL query with an edition chain. We also conduct extensive ablation studies to determine the optimal configuration of our approach. Our approach outperforms different in-context learning baselines stably and achieves state-of-the-art performances on two benchmarks SParC and CoSQL using LLMs, which is also competitive to the SOTA fine-tuned models.
摘要：最近，大型语言模型（LLM）已被证明在各种领域和任务中拥有令人印象深刻的功能。我们研究了多轮文本转 SQL 任务中的提示设计问题，并尝试增强法学硕士在生成 SQL 查询时的推理能力。在会话上下文中，由于上下文依赖性，只需很少的操作就可以从前面的 SQL 查询修改当前的 SQL 查询。我们介绍了名为 CoE-SQL 的方法，该方法可以提示法学硕士根据先前生成的具有版本链的 SQL 查询来生成 SQL 查询。我们还进行了广泛的消融研究，以确定我们方法的最佳配置。我们的方法稳定地优于不同的上下文学习基线，并使用 LLM 在 SParC 和 CoSQL 两个基准上实现了最先进的性能，这也比 SOTA 微调模型具有竞争力。

Title: Recall Them All: Retrieval-Augmented Language Models for Long Object List Extraction from Long Documents

Authors: Sneha Singhania, Simon Razniewski, Gerhard Weikum
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2405.02732
Pdf URL: https://arxiv.org/pdf/2405.02732
Copy Paste: [[2405.02732]] Recall Them All: Retrieval-Augmented Language Models for Long Object List Extraction from Long Documents(https://arxiv.org/abs/2405.02732)
Keywords: language model, llm
Abstract: Methods for relation extraction from text mostly focus on high precision, at the cost of limited recall. High recall is crucial, though, to populate long lists of object entities that stand in a specific relation with a given subject. Cues for relevant objects can be spread across many passages in long texts. This poses the challenge of extracting long lists from long texts. We present the L3X method which tackles the problem in two stages: (1) recall-oriented generation using a large language model (LLM) with judicious techniques for retrieval augmentation, and (2) precision-oriented scrutinization to validate or prune candidates. Our L3X method outperforms LLM-only generations by a substantial margin.
摘要：从文本中提取关系的方法大多注重高精度，但代价是召回率有限。然而，高召回率对于填充与给定主题具有特定关系的一长串对象实体至关重要。相关对象的线索可以分布在长文本的许多段落中。这提出了从长文本中提取长列表的挑战。我们提出了 L3X 方法，该方法分两个阶段解决该问题：(1) 使用大型语言模型 (LLM) 进行面向召回的生成，并采用明智的检索增强技术；(2) 面向精度的审查以验证或修剪候选者。我们的 L3X 方法的性能明显优于纯 LLM 一代。

Title: Relations Prediction for Knowledge Graph Completion using Large Language Models

Authors: Sakher Khalil Alqaaidi, Krzysztof Kochut
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.02738
Pdf URL: https://arxiv.org/pdf/2405.02738
Copy Paste: [[2405.02738]] Relations Prediction for Knowledge Graph Completion using Large Language Models(https://arxiv.org/abs/2405.02738)
Keywords: language model
Abstract: Knowledge Graphs have been widely used to represent facts in a structured format. Due to their large scale applications, knowledge graphs suffer from being incomplete. The relation prediction task obtains knowledge graph completion by assigning one or more possible relations to each pair of nodes. In this work, we make use of the knowledge graph node names to fine-tune a large language model for the relation prediction task. By utilizing the node names only we enable our model to operate sufficiently in the inductive settings. Our experiments show that we accomplish new scores on a widely used knowledge graph benchmark.
摘要：知识图已被广泛用于以结构化格式表示事实。由于其大规模应用，知识图谱存在不完整的问题。关系预测任务通过为每对节点分配一个或多个可能的关系来获得知识图谱补全。在这项工作中，我们利用知识图节点名称来微调关系预测任务的大型语言模型。通过仅使用节点名称，我们使我们的模型能够在归纳设置中充分运行。我们的实验表明，我们在广泛使用的知识图基准上取得了新的分数。

Title: Beyond Performance: Quantifying and Mitigating Label Bias in LLMs

Authors: Yuval Reif, Roy Schwartz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.02743
Pdf URL: https://arxiv.org/pdf/2405.02743
Copy Paste: [[2405.02743]] Beyond Performance: Quantifying and Mitigating Label Bias in LLMs(https://arxiv.org/abs/2405.02743)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have shown remarkable adaptability to diverse tasks, by leveraging context prompts containing instructions, or minimal input-output examples. However, recent work revealed they also exhibit label bias -- an undesirable preference toward predicting certain answers over others. Still, detecting and measuring this bias reliably and at scale has remained relatively unexplored. In this study, we evaluate different approaches to quantifying label bias in a model's predictions, conducting a comprehensive investigation across 279 classification tasks and ten LLMs. Our investigation reveals substantial label bias in models both before and after debiasing attempts, as well as highlights the importance of outcomes-based evaluation metrics, which were not previously used in this regard. We further propose a novel label bias calibration method tailored for few-shot prompting, which outperforms recent calibration approaches for both improving performance and mitigating label bias. Our results emphasize that label bias in the predictions of LLMs remains a barrier to their reliability.
摘要：通过利用包含指令的上下文提示或最少的输入输出示例，大型语言模型（LLM）已显示出对不同任务的卓越适应性。然而，最近的研究表明，它们也表现出标签偏差——即预测某些答案而不是其他答案的不良偏好。尽管如此，可靠且大规模地检测和测量这种偏差仍然相对未经探索。在这项研究中，我们评估了量化模型预测中标签偏差的不同方法，对 279 个分类任务和 10 个法学硕士进行了全面调查。我们的调查揭示了去偏尝试之前和之后模型中存在大量标签偏差，并强调了基于结果的评估指标的重要性，而这些指标以前在这方面并未使用过。我们进一步提出了一种专为少样本提示而定制的新型标签偏差校准方法，该方法在提高性能和减轻标签偏差方面都优于最近的校准方法。我们的结果强调，法学硕士预测中的标签偏差仍然是其可靠性的障碍。

Title: Enhancing Contextual Understanding in Large Language Models through Contrastive Decoding

Authors: Zheng Zhao, Emilio Monti, Jens Lehmann, Haytham Assem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.02750
Pdf URL: https://arxiv.org/pdf/2405.02750
Copy Paste: [[2405.02750]] Enhancing Contextual Understanding in Large Language Models through Contrastive Decoding(https://arxiv.org/abs/2405.02750)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) tend to inadequately integrate input context during text generation, relying excessively on encoded prior knowledge in model parameters, potentially resulting in generated text with factual inconsistencies or contextually unfaithful content. LLMs utilize two primary knowledge sources: 1) prior (parametric) knowledge from pretraining, and 2) contextual (non-parametric) knowledge from input prompts. The study addresses the open question of how LLMs effectively balance these knowledge sources during the generation process, specifically in the context of open-domain question answering. To address this issue, we introduce a novel approach integrating contrastive decoding with adversarial irrelevant passages as negative samples to enhance robust context grounding during generation. Notably, our method operates at inference time without requiring further training. We conduct comprehensive experiments to demonstrate its applicability and effectiveness, providing empirical evidence showcasing its superiority over existing methodologies. Our code is publicly available at: https://github.com/amazon-science/ContextualUnderstanding-ContrastiveDecoding.
摘要：大型语言模型 (LLM) 在文本生成过程中往往无法充分集成输入上下文，过度依赖模型参数中编码的先验知识，可能导致生成的文本与事实不一致或上下文不忠实的内容。法学硕士利用两个主要知识源：1）来自预训练的先验（参数）知识，2）来自输入提示的上下文（非参数）知识。该研究解决了法学硕士如何在生成过程中有效平衡这些知识源的开放性问题，特别是在开放领域问答的背景下。为了解决这个问题，我们引入了一种新方法，将对比解码与对抗性不相关段落作为负样本相结合，以增强生成过程中稳健的上下文基础。值得注意的是，我们的方法在推理时运行，无需进一步训练。我们进行全面的实验来证明其适用性和有效性，提供经验证据来证明其相对于现有方法的优越性。我们的代码公开于：https://github.com/amazon-science/ContextualUnderstanding-ContrastiveDecoding。

Title: Assessing Adversarial Robustness of Large Language Models: An Empirical Study

Authors: Zeyu Yang, Zhao Meng, Xiaochen Zheng, Roger Wattenhofer
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2405.02764
Pdf URL: https://arxiv.org/pdf/2405.02764
Copy Paste: [[2405.02764]] Assessing Adversarial Robustness of Large Language Models: An Empirical Study(https://arxiv.org/abs/2405.02764)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have revolutionized natural language processing, but their robustness against adversarial attacks remains a critical concern. We presents a novel white-box style attack approach that exposes vulnerabilities in leading open-source LLMs, including Llama, OPT, and T5. We assess the impact of model size, structure, and fine-tuning strategies on their resistance to adversarial perturbations. Our comprehensive evaluation across five diverse text classification tasks establishes a new benchmark for LLM robustness. The findings of this study have far-reaching implications for the reliable deployment of LLMs in real-world applications and contribute to the advancement of trustworthy AI systems.
摘要：大型语言模型 (LLM) 彻底改变了自然语言处理，但其针对对抗性攻击的鲁棒性仍然是一个关键问题。我们提出了一种新颖的白盒式攻击方法，该方法暴露了领先的开源 LLM 中的漏洞，包括 Llama、OPT 和 T5。我们评估模型大小、结构和微调策略对其抵抗对抗性扰动的影响。我们对五种不同文本分类任务的综合评估为法学硕士的稳健性建立了新的基准。这项研究的结果对于法学硕士在现实世界应用中的可靠部署具有深远的影响，并有助于值得信赖的人工智能系统的进步。

Title: Detecting Edited Knowledge in Language Models

Authors: Paul Youssef, Zhixue Zhao, Jörg Schlötterer, Christin Seifert
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.02765
Pdf URL: https://arxiv.org/pdf/2405.02765
Copy Paste: [[2405.02765]] Detecting Edited Knowledge in Language Models(https://arxiv.org/abs/2405.02765)
Keywords: language model
Abstract: Knowledge editing techniques (KEs) can update language models' obsolete or inaccurate knowledge learned from pre-training. However, KE also faces potential malicious applications, e.g. inserting misinformation and toxic content. Moreover, in the context of responsible AI, it is instructive for end-users to know whether a generated output is driven by edited knowledge or first-hand knowledge from pre-training. To this end, we study detecting edited knowledge in language models by introducing a novel task: given an edited model and a specific piece of knowledge the model generates, our objective is to classify the knowledge as either "non-edited" (based on the pre-training), or ``edited'' (based on subsequent editing). We initiate the task with two state-of-the-art KEs, two language models, and two datasets. We further propose a simple classifier, RepReg, a logistic regression model that takes hidden state representations as input features. Our results reveal that RepReg establishes a strong baseline, achieving a peak accuracy of 99.81%, and 97.79% in out-of-domain settings. Second, RepReg achieves near-optimal performance with a limited training set (200 training samples), and it maintains its performance even in out-of-domain settings. Last, we find it more challenging to separate edited and non-edited knowledge when they contain the same subject or object.
摘要：知识编辑技术 (KE) 可以更新语言模型从预训练中学到的过时或不准确的知识。然而，KE 也面临潜在的恶意应用，例如插入错误信息和有害内容。此外，在负责任的人工智能背景下，终端用户了解生成的输出是由编辑过的知识还是来自预训练的第一手知识驱动是有益的。为此，我们通过引入一项新任务来研究在语言模型中检测编辑过的知识：给定一个编辑过的模型和该模型生成的特定知识，我们的目标是将知识分类为“未编辑”（基于预训练）或“编辑过”（基于后续编辑）。我们使用两个最先进的 KE、两个语言模型和两个数据集来启动任务。我们进一步提出了一个简单的分类器 RepReg，这是一个将隐藏状态表示作为输入特征的逻辑回归模型。我们的结果表明，RepReg 建立了强大的基线，峰值准确率达到 99.81%，域外设置准确率达到 97.79%。其次，RepReg 在有限的训练集（200 个训练样本）下实现了近乎最佳的性能，即使在域外设置中也能保持其性能。最后，我们发现，当编辑和未编辑的知识包含相同的主题或对象时，将它们区分开来更具挑战性。

Title: NegativePrompt: Leveraging Psychology for Large Language Models Enhancement via Negative Emotional Stimuli

Authors: Xu Wang, Cheng Li, Yi Chang, Jindong Wang, Yuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.02814
Pdf URL: https://arxiv.org/pdf/2405.02814
Copy Paste: [[2405.02814]] NegativePrompt: Leveraging Psychology for Large Language Models Enhancement via Negative Emotional Stimuli(https://arxiv.org/abs/2405.02814)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large Language Models (LLMs) have become integral to a wide spectrum of applications, ranging from traditional computing tasks to advanced artificial intelligence (AI) applications. This widespread adoption has spurred extensive research into LLMs across various disciplines, including the social sciences. Notably, studies have revealed that LLMs possess emotional intelligence, which can be further developed through positive emotional stimuli. This discovery raises an intriguing question: can negative emotions similarly influence LLMs, potentially enhancing their performance? In response to this question, we introduce NegativePrompt, a novel approach underpinned by psychological principles, involving ten specifically designed negative emotional stimuli. We embark on rigorous experimental evaluations of five LLMs including Flan-T5-Large, Vicuna, Llama 2, ChatGPT, and GPT-4, across a set of 45 tasks. The results are revealing: NegativePrompt markedly enhances the performance of LLMs, evidenced by relative improvements of 12.89% in Instruction Induction tasks and 46.25% in BIG-Bench tasks. Moreover, we conduct attention visualization experiments to decipher the underlying mechanisms of NegativePrompt's influence. Our research contributes significantly to the understanding of LLMs and emotion interaction, demonstrating the practical efficacy of NegativePrompt as an emotion-driven method and offering novel insights for the enhancement of LLMs in real-world applications. The code is available at https://github.com/wangxu0820/NegativePrompt.
摘要：大型语言模型 (LLM) 已成为各种应用程序不可或缺的一部分，从传统计算任务到高级人工智能 (AI) 应用程序。这种广泛的采用刺激了对包括社会科学在内的各个学科的法学硕士的广泛研究。值得注意的是，研究表明法学硕士拥有情商，可以通过积极的情绪刺激进一步发展。这一发现提出了一个有趣的问题：负面情绪是否会同样影响法学硕士，从而有可能提高他们的表现？为了回答这个问题，我们引入了 NegativePrompt，这是一种以心理学原理为基础的新颖方法，涉及十种专门设计的负面情绪刺激。我们开始对 5 个 LLM（包括 Flan-T5-Large、Vicuna、Llama 2、ChatGPT 和 GPT-4）进行严格的实验评估，涉及 45 项任务。结果表明：NegativePrompt 显着提高了法学硕士的表现，指令归纳任务相对提高了 12.89%，BIG-Bench 任务相对提高了 46.25%。此外，我们还进行了注意力可视化实验，以破译NegativePrompt影响的潜在机制。我们的研究对理解法学硕士和情绪互动做出了重大贡献，证明了 NegativePrompt 作为情绪驱动方法的实际功效，并为增强法学硕士在现实世界中的应用提供了新颖的见解。代码可在 https://github.com/wangxu0820/NegativePrompt 获取。

Title: Stochastic RAG: End-to-End Retrieval-Augmented Generation through Expected Utility Maximization

Authors: Hamed Zamani, Michael Bendersky
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2405.02816
Pdf URL: https://arxiv.org/pdf/2405.02816
Copy Paste: [[2405.02816]] Stochastic RAG: End-to-End Retrieval-Augmented Generation through Expected Utility Maximization(https://arxiv.org/abs/2405.02816)
Keywords: retrieval-augmented generation
Abstract: This paper introduces Stochastic RAG--a novel approach for end-to-end optimization of retrieval-augmented generation (RAG) models that relaxes the simplifying assumptions of marginalization and document independence, made in most prior work. Stochastic RAG casts the retrieval process in RAG as a stochastic sampling without replacement process. Through this formulation, we employ straight-through Gumbel-top-k that provides a differentiable approximation for sampling without replacement and enables effective end-to-end optimization for RAG. We conduct extensive experiments on seven diverse datasets on a wide range of tasks, from open-domain question answering to fact verification to slot-filling for relation extraction and to dialogue systems. By applying this optimization method to a recent and effective RAG model, we advance state-of-the-art results on six out of seven datasets.
摘要：本文介绍了随机 RAG——一种用于检索增强生成 (RAG) 模型端到端优化的新方法，该方法放宽了大多数先前工作中对边缘化和文档独立性的简化假设。随机RAG将RAG中的检索过程视为无替换过程的随机采样。通过这个公式，我们采用直通式 Gumbel-top-k，为无需放回的采样提供可微的近似，并实现 RAG 的有效端到端优化。我们对七个不同的数据集进行了广泛的实验，涉及各种任务，从开放域问答到事实验证，再到用于关系提取的槽填充和对话系统。通过将此优化方法应用于最近有效的 RAG 模型，我们在七个数据集中的六个数据集上取得了最先进的结果。

Title: HuixiangDou-CR: Coreference Resolution in Group Chats

Authors: Huanjun Kong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.02817
Pdf URL: https://arxiv.org/pdf/2405.02817
Copy Paste: [[2405.02817]] HuixiangDou-CR: Coreference Resolution in Group Chats(https://arxiv.org/abs/2405.02817)
Keywords: language model, llm, chat
Abstract: How to eliminate pronominal reference in group chats? In this work, we have preprocessed 58k authentic chat data and manually annotated 2.3k questions. The reliability of this annotation was confirmed by the scaling law. After this, we conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters. The optimal version improved 29.07 in F1 score. This confirms the viability of fine-tuning Large Language Model (LLM) for downstream Natural Language Processing (NLP) tasks. Our contributions are: 1) Created Supervised Fine-Tuning (SFT) training data in alpaca format, along with a set of Low-Rank Adaptation (LoRA) weights, and 2) Developed a method for acquiring high-quality data leveraging scaling law principle. The script, raw data with alpaca format and experiments track are open-sourced on Github https://github.com/InternLM/HuixiangDou/tree/main/web/tools, HuggingFace https://huggingface.co/tpoisonooo and WandB https://wandb.ai/tpoisonooo/huixiangdou-cr/table?nw=nwusertpoisonooo . The privacy of the data involved has been authorized by users.
摘要：如何消除群聊中的代词引用？在这项工作中，我们预处理了 58k 真实聊天数据并手动标注了 2.3k 个问题。该注释的可靠性得到了缩放定律的证实。之后，我们对Qwen模型进行了微调，参数范围从0.5B到32B。最优版本的F1分数提高了29.07。这证实了针对下游自然语言处理（NLP）任务微调大型语言模型（LLM）的可行性。我们的贡献是：1) 创建了羊驼格式的监督微调 (SFT) 训练数据，以及一组低秩适应 (LoRA) 权重，2) 开发了一种利用缩放定律原理获取高质量数据的方法。脚本、羊驼格式的原始数据和实验轨迹均在 Github https://github.com/InternLM/HuiyangDou/tree/main/web/tools、HuggingFace https://huggingface.co/tpoisonooo 和 WandB https 上开源://wandb.ai/tpoisonooo/huixiangdou-cr/table?nw=nwusertpoisonooo 。所涉及数据的隐私性已获得用户授权。

Title: Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models

Authors: Yang Liu, Melissa Xiaohui Qin, Hongming Li, Chao Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.02861
Pdf URL: https://arxiv.org/pdf/2405.02861
Copy Paste: [[2405.02861]] Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models(https://arxiv.org/abs/2405.02861)
Keywords: language model
Abstract: We introduce LexBench, a comprehensive evaluation suite enabled to test language models (LMs) on ten semantic phrase processing tasks. Unlike prior studies, it is the first work to propose a framework from the comparative perspective to model the general semantic phrase (i.e., lexical collocation) and three fine-grained semantic phrases, including idiomatic expression, noun compound, and verbal construction. Thanks to \ourbenchmark, we assess the performance of 15 LMs across model architectures and parameter scales in classification, extraction, and interpretation tasks. Through the experiments, we first validate the scaling law and find that, as expected, large models excel better than the smaller ones in most tasks. Second, we investigate further through the scaling semantic relation categorization and find that few-shot LMs still lag behind vanilla fine-tuned models in the task. Third, through human evaluation, we find that the performance of strong models is comparable to the human level regarding semantic phrase processing. Our benchmarking findings can serve future research aiming to improve the generic capability of LMs on semantic phrase comprehension. Our source code and data are available at https://github.com/jacklanda/LexBench
摘要：我们推出了 LexBench，这是一个综合评估套件，能够测试十个语义短语处理任务的语言模型 (LM)。与之前的研究不同，这是第一个从比较角度提出框架来建模一般语义短语（即词汇搭配）和三个细粒度语义短语（包括惯用表达、名词复合词和动词结构）的工作。借助 \ourbenchmark，我们评估了 15 个 LM 在分类、提取和解释任务中跨模型架构和参数尺度的性能。通过实验，我们首先验证了缩放定律，并发现，正如预期的那样，在大多数任务中，大型模型比小型模型表现更好。其次，我们通过扩展语义关系分类进一步研究，发现在任务中小样本 LM 仍然落后于普通的微调模型。第三，通过人类评估，我们发现强模型在语义短语处理方面的性能与人类水平相当。我们的基准测试结果可以为未来的研究服务，旨在提高语言模型在语义短语理解方面的通用能力。我们的源代码和数据可在 https://github.com/jacklanda/LexBench 获取

Title: Relay Decoding: Concatenating Large Language Models for Machine Translation

Authors: Chengpeng Fu, Xiaocheng Feng, Yichong Huang, Wenshuai Huo, Baohang Li, Hui Wang, Bin Qin, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.02933
Pdf URL: https://arxiv.org/pdf/2405.02933
Copy Paste: [[2405.02933]] Relay Decoding: Concatenating Large Language Models for Machine Translation(https://arxiv.org/abs/2405.02933)
Keywords: language model
Abstract: Leveraging large language models for machine translation has demonstrated promising results. However, it does require the large language models to possess the capability of handling both the source and target languages in machine translation. When it is challenging to find large models that support the desired languages, resorting to continuous learning methods becomes a costly endeavor. To mitigate these expenses, we propose an innovative approach called RD (Relay Decoding), which entails concatenating two distinct large models that individually support the source and target languages. By incorporating a simple mapping layer to facilitate the connection between these two models and utilizing a limited amount of parallel data for training, we successfully achieve superior results in the machine translation task. Experimental results conducted on the Multi30k and WikiMatrix datasets validate the effectiveness of our proposed method.
摘要：利用大型语言模型进行机器翻译已经证明了有希望的结果。然而，它确实需要大型语言模型具备机器翻译中源语言和目标语言的处理能力。当找到支持所需语言的大型模型具有挑战性时，采用持续学习方法就变得成本高昂。为了减轻这些费用，我们提出了一种称为 RD（中继解码）的创新方法，该方法需要连接两个不同的大型模型，分别支持源语言和目标语言。通过合并一个简单的映射层来促进这两个模型之间的连接并利用有限数量的并行数据进行训练，我们成功地在机器翻译任务中取得了优异的结果。在 Multi30k 和 WikiMatrix 数据集上进行的实验结果验证了我们提出的方法的有效性。

Title: Unraveling the Dominance of Large Language Models Over Transformer Models for Bangla Natural Language Inference: A Comprehensive Study

Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Asif Iftekher Fahim, Pronay Debnath, Faisal Muhammad Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.02937
Pdf URL: https://arxiv.org/pdf/2405.02937
Copy Paste: [[2405.02937]] Unraveling the Dominance of Large Language Models Over Transformer Models for Bangla Natural Language Inference: A Comprehensive Study(https://arxiv.org/abs/2405.02937)
Keywords: language model, gpt, llm
Abstract: Natural Language Inference (NLI) is a cornerstone of Natural Language Processing (NLP), providing insights into the entailment relationships between text pairings. It is a critical component of Natural Language Understanding (NLU), demonstrating the ability to extract information from spoken or written interactions. NLI is mainly concerned with determining the entailment relationship between two statements, known as the premise and hypothesis. When the premise logically implies the hypothesis, the pair is labeled ``entailment''. If the hypothesis contradicts the premise, the pair receives the ``contradiction'' label. When there is insufficient evidence to establish a connection, the pair is described as ``neutral''. Despite the success of Large Language Models (LLMs) in various tasks, their effectiveness in NLI remains constrained by issues like low-resource domain accuracy, model overconfidence, and difficulty in capturing human judgment disagreements. This study addresses the underexplored area of evaluating LLMs in low-resourced languages such as Bengali. Through a comprehensive evaluation, we assess the performance of prominent LLMs and state-of-the-art (SOTA) models in Bengali NLP tasks, focusing on natural language inference. Utilizing the XNLI dataset, we conduct zero-shot and few-shot evaluations, comparing LLMs like GPT-3.5 Turbo and Gemini 1.5 Pro with models such as BanglaBERT, Bangla BERT Base, DistilBERT, mBERT, and sahajBERT. Our findings reveal that while LLMs can achieve comparable or superior performance to fine-tuned SOTA models in few-shot scenarios, further research is necessary to enhance our understanding of LLMs in languages with modest resources like Bengali. This study underscores the importance of continued efforts in exploring LLM capabilities across diverse linguistic contexts.
摘要：自然语言推理 (NLI) 是自然语言处理 (NLP) 的基石，可深入了解文本配对之间的蕴涵关系。它是自然语言理解 (NLU) 的重要组成部分，展示了从口头或书面交互中提取信息的能力。 NLI 主要关注确定两个陈述之间的蕴含关系，称为前提和假设。当前提在逻辑上暗示了假设时，该对就被标记为“蕴含”。如果假设与前提相矛盾，则该对就会被贴上“矛盾”标签。当没有足够的证据来建立联系时，这一对被描述为“中立”。尽管大型语言模型 (LLM) 在各种任务中取得了成功，但它们在 NLI 中的有效性仍然受到资源领域准确性低、模型过度自信以及难以捕捉人类判断分歧等问题的限制。这项研究解决了评估孟加拉语等资源匮乏语言的法学硕士这一尚未开发的领域。通过综合评估，我们评估了著名的法学硕士和最先进的（SOTA）模型在孟加拉语 NLP 任务中的表现，重点是自然语言推理。利用 XNLI 数据集，我们进行零样本和少样本评估，将 GPT-3.5 Turbo 和 Gemini 1.5 Pro 等 LLM 与 BanglaBERT、Bangla BERT Base、DistilBERT、mBERT 和 sahajBERT 等模型进行比较。我们的研究结果表明，虽然法学硕士可以在少量场景中实现与微调的 SOTA 模型相当或更好的性能，但需要进一步的研究来增强我们对孟加拉语等资源有限的语言的法学硕士的理解。这项研究强调了继续努力探索跨不同语言背景的法学硕士能力的重要性。

Title: Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education

Authors: Owen Henkel, Adam Boxer, Libby Hills, Bill Roberts
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.02985
Pdf URL: https://arxiv.org/pdf/2405.02985
Copy Paste: [[2405.02985]] Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education(https://arxiv.org/abs/2405.02985)
Keywords: language model, gpt, llm, prompt
Abstract: This paper presents reports on a series of experiments with a novel dataset evaluating how well Large Language Models (LLMs) can mark (i.e. grade) open text responses to short answer questions, Specifically, we explore how well different combinations of GPT version and prompt engineering strategies performed at marking real student answers to short answer across different domain areas (Science and History) and grade-levels (spanning ages 5-16) using a new, never-used-before dataset from Carousel, a quizzing platform. We found that GPT-4, with basic few-shot prompting performed well (Kappa, 0.70) and, importantly, very close to human-level performance (0.75). This research builds on prior findings that GPT-4 could reliably score short answer reading comprehension questions at a performance-level very close to that of expert human raters. The proximity to human-level performance, across a variety of subjects and grade levels suggests that LLMs could be a valuable tool for supporting low-stakes formative assessment tasks in K-12 education and has important implications for real-world education delivery.
摘要：本文介绍了一系列使用新颖数据集进行的实验的报告，这些实验评估了大型语言模型 (LLM) 对简答题的开放文本响应进行标记（即评分）的效果。具体来说，我们探讨了 GPT 版本和提示工程的不同组合的效果如何使用来自测验平台 Carousel 的新的、以前从未使用过的数据集，在不同领域（科学和历史）和年级（跨越 5-16 岁）对简答题的真实学生答案进行标记。我们发现具有基本几次提示的 GPT-4 表现良好（Kappa，0.70），而且重要的是，非常接近人类水平的表现（0.75）。这项研究建立在先前的研究基础上，即 GPT-4 可以可靠地对简答阅读理解问题进行评分，其表现水平非常接近人类评分专家的水平。各种学科和年级水平的接近人类水平的表现表明，法学硕士可能是支持 K-12 教育中低风险形成性评估任务的宝贵工具，并对现实世界的教育交付具有重要影响。

Title: MedAdapter: Efficient Test-Time Adaptation of Large Language Models towards Medical Reasoning

Authors: Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Hang Wu, Carl Yang, May D. Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.03000
Pdf URL: https://arxiv.org/pdf/2405.03000
Copy Paste: [[2405.03000]] MedAdapter: Efficient Test-Time Adaptation of Large Language Models towards Medical Reasoning(https://arxiv.org/abs/2405.03000)
Keywords: language model, llm
Abstract: Despite their improved capabilities in generation and reasoning, adapting large language models (LLMs) to the biomedical domain remains challenging due to their immense size and corporate privacy. In this work, we propose MedAdapter, a unified post-hoc adapter for test-time adaptation of LLMs towards biomedical applications. Instead of fine-tuning the entire LLM, MedAdapter effectively adapts the original model by fine-tuning only a small BERT-sized adapter to rank candidate solutions generated by LLMs. Experiments demonstrate that MedAdapter effectively adapts both white-box and black-box LLMs in biomedical reasoning, achieving average performance improvements of 25.48% and 11.31%, respectively, without requiring extensive computational resources or sharing data with third parties. MedAdapter also yields superior performance when combined with train-time adaptation, highlighting a flexible and complementary solution to existing adaptation methods. Faced with the challenges of balancing model performance, computational resources, and data privacy, MedAdapter provides an efficient, privacy-preserving, cost-effective, and transparent solution for adapting LLMs to the biomedical domain.
摘要：尽管它们在生成和推理方面的能力有所提高，但由于其巨大的规模和企业隐私，将大型语言模型（LLM）适应生物医学领域仍然具有挑战性。在这项工作中，我们提出了 MedAdapter，这是一种统一的事后适配器，用于 LLM 测试时适应生物医学应用。 MedAdapter 不是对整个 LLM 进行微调，而是通过仅微调一个小型 BERT 大小的适配器来对 LLM 生成的候选解决方案进行排名，从而有效地适应原始模型。实验表明，MedAdapter 有效地适应了生物医学推理中的白盒和黑盒 LLM，分别实现了 25.48% 和 11.31% 的平均性能提升，而无需大量计算资源或与第三方共享数据。 MedAdapter 在与训练时间适应相结合时还能产生卓越的性能，突出了对现有适应方法的灵活和补充解决方案。面对平衡模型性能、计算资源和数据隐私的挑战，MedAdapter 提供了一种高效、保护隐私、经济高效且透明的解决方案，使法学硕士适应生物医学领域。

Title: Exploring prompts to elicit memorization in masked language model-based named entity recognition

Authors: Yuxi Xia, Anastasiia Sedova, Pedro Henrique Luz de Araujo, Vasiliki Kougia, Lisa Nußbaumer, Benjamin Roth
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2405.03004
Pdf URL: https://arxiv.org/pdf/2405.03004
Copy Paste: [[2405.03004]] Exploring prompts to elicit memorization in masked language model-based named entity recognition(https://arxiv.org/abs/2405.03004)
Keywords: language model, prompt
Abstract: Training data memorization in language models impacts model capability (generalization) and safety (privacy risk). This paper focuses on analyzing prompts' impact on detecting the memorization of 6 masked language model-based named entity recognition models. Specifically, we employ a diverse set of 400 automatically generated prompts, and a pairwise dataset where each pair consists of one person's name from the training set and another name out of the set. A prompt completed with a person's name serves as input for getting the model's confidence in predicting this name. Finally, the prompt performance of detecting model memorization is quantified by the percentage of name pairs for which the model has higher confidence for the name from the training set. We show that the performance of different prompts varies by as much as 16 percentage points on the same model, and prompt engineering further increases the gap. Moreover, our experiments demonstrate that prompt performance is model-dependent but does generalize across different name sets. A comprehensive analysis indicates how prompt performance is influenced by prompt properties, contained tokens, and the model's self-attention weights on the prompt.
摘要：语言模型中的训练数据记忆会影响模型能力（泛化）和安全性（隐私风险）。本文重点分析了提示对检测 6 个基于掩码语言模型的命名实体识别模型的记忆的影响。具体来说，我们使用了一组 400 个不同的自动生成的提示和一个成对的数据集，其中每对由训练集中的一个人的名字和该集合中的另一个名字组成。用一个人的名字完成的提示作为输入，用于获取模型预测此名字的信心。最后，检测模型记忆的提示性能通过模型对训练集中的名字具有更高信心的名字对的百分比来量化。我们表明，不同提示的性能在同一模型上相差多达 16 个百分点，而提示工程进一步扩大了差距。此外，我们的实验表明，提示性能依赖于模型，但确实可以推广到不同的名称集。全面的分析表明提示性能如何受到提示属性、所包含的标记以及模型对提示的自注意力权重的影响。

Title: Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation

Authors: Kaize Shi, Xueyao Sun, Qing Li, Guandong Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03085
Pdf URL: https://arxiv.org/pdf/2405.03085
Copy Paste: [[2405.03085]] Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation(https://arxiv.org/abs/2405.03085)
Keywords: language model, llm, long context, hallucination, retrieval augmented generation
Abstract: Large Language Models (LLMs) have made significant strides in information acquisition. However, their overreliance on potentially flawed parametric knowledge leads to hallucinations and inaccuracies, particularly when handling long-tail, domain-specific queries. Retrieval Augmented Generation (RAG) addresses this limitation by incorporating external, non-parametric knowledge. Nevertheless, the retrieved long-context documents often contain noisy, irrelevant information alongside vital knowledge, negatively diluting LLMs' attention. Inspired by the supportive role of essential concepts in individuals' reading comprehension, we propose a novel concept-based RAG framework with the Abstract Meaning Representation (AMR)-based concept distillation algorithm. The proposed algorithm compresses the cluttered raw retrieved documents into a compact set of crucial concepts distilled from the informative nodes of AMR by referring to reliable linguistic features. The concepts explicitly constrain LLMs to focus solely on vital information in the inference process. We conduct extensive experiments on open-domain question-answering datasets to empirically evaluate the proposed method's effectiveness. The results indicate that the concept-based RAG framework outperforms other baseline methods, particularly as the number of supporting documents increases, while also exhibiting robustness across various backbone LLMs. This emphasizes the distilled concepts are informative for augmenting the RAG process by filtering out interference information. To the best of our knowledge, this is the first work introducing AMR to enhance the RAG, presenting a potential solution to augment inference performance with semantic-based context compression.
摘要：大型语言模型（LLM）在信息获取方面取得了重大进展。然而，他们过度依赖可能有缺陷的参数知识会导致幻觉和不准确，特别是在处理长尾、特定领域的查询时。检索增强生成（RAG）通过整合外部非参数知识来解决这一限制。然而，检索到的长上下文文档通常包含嘈杂的、不相关的信息以及重要的知识，从而负面地稀释了法学硕士的注意力。受基本概念在个人阅读理解中的支持作用的启发，我们提出了一种新颖的基于概念的 RAG 框架，采用基于抽象含义表示（AMR）的概念蒸馏算法。所提出的算法通过参考可靠的语言特征，将杂乱的原始检索文档压缩为从 AMR 的信息节点中提取的一组紧凑的关键概念。这些概念明确限制法学硕士仅关注推理过程中的重要信息。我们对开放域问答数据集进行了广泛的实验，以实证评估所提出方法的有效性。结果表明，基于概念的 RAG 框架优于其他基线方法，特别是随着支持文档数量的增加，同时在各种骨干法学硕士中也表现出稳健性。这强调了提炼出来的概念对于通过滤除干扰信息来增强 RAG 过程具有丰富的信息。据我们所知，这是第一个引入 AMR 来增强 RAG 的工作，提出了一种通过基于语义的上下文压缩来增强推理性能的潜在解决方案。

Title: FairMonitor: A Dual-framework for Detecting Stereotypes and Biases in Large Language Models

Authors: Yanhong Bai, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xingjiao Wu, Liang He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03098
Pdf URL: https://arxiv.org/pdf/2405.03098
Copy Paste: [[2405.03098]] FairMonitor: A Dual-framework for Detecting Stereotypes and Biases in Large Language Models(https://arxiv.org/abs/2405.03098)
Keywords: language model, llm, agent
Abstract: Detecting stereotypes and biases in Large Language Models (LLMs) is crucial for enhancing fairness and reducing adverse impacts on individuals or groups when these models are applied. Traditional methods, which rely on embedding spaces or are based on probability metrics, fall short in revealing the nuanced and implicit biases present in various contexts. To address this challenge, we propose the FairMonitor framework and adopt a static-dynamic detection method for a comprehensive evaluation of stereotypes and biases in LLMs. The static component consists of a direct inquiry test, an implicit association test, and an unknown situation test, including 10,262 open-ended questions with 9 sensitive factors and 26 educational scenarios. And it is effective for evaluating both explicit and implicit biases. Moreover, we utilize the multi-agent system to construst the dynamic scenarios for detecting subtle biases in more complex and realistic setting. This component detects the biases based on the interaction behaviors of LLMs across 600 varied educational scenarios. The experimental results show that the cooperation of static and dynamic methods can detect more stereotypes and biased in LLMs.
摘要：检测大型语言模型 (LLM) 中的刻板印象和偏见对于增强公平性并减少应用这些模型时对个人或群体的不利影响至关重要。依赖嵌入空间或基于概率度量的传统方法无法揭示各种上下文中存在的细微差别和隐含偏差。为了应对这一挑战，我们提出了FairMonitor框架，并采用静态-动态检测方法对法学硕士的刻板印象和偏见进行综合评估。静态部分由直接询问测试、内隐联想测试和未知情境测试组成，包括10,262个开放式问题、9个敏感因素和26个教育场景。它对于评估显性和隐性偏见都很有效。此外，我们利用多智能体系统来构建动态场景，以检测更复杂和现实的环境中的细微偏差。该组件根据法学硕士在 600 个不同教育场景中的交互行为来检测偏差。实验结果表明，静态和动态方法的配合可以检测出法学硕士中更多的刻板印象和偏见。

Title: An Active Inference Agent for Simulating Human Translation Processes in a Hierarchical Architecture: Integrating the Task Segment Framework and the HOF taxonomy

Authors: Michael Carl
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03111
Pdf URL: https://arxiv.org/pdf/2405.03111
Copy Paste: [[2405.03111]] An Active Inference Agent for Simulating Human Translation Processes in a Hierarchical Architecture: Integrating the Task Segment Framework and the HOF taxonomy(https://arxiv.org/abs/2405.03111)
Keywords: agent
Abstract: In this paper, we propose modelling human translation production as a hierarchy of three embedded translation processes. The proposed architecture replicates the temporal dynamics of keystroke production across sensorimotor, cognitive, and phenomenal layers. Utilizing data from the CRITT TPR-DB, the Task Segment Framework, and the HOF taxonomy, we demonstrate the temporal breakdown of the typing flow on distinct timelines within these three layers.
摘要：在本文中，我们建议将人工翻译生产建模为三个嵌入式翻译过程的层次结构。所提出的架构复制了跨感觉运动层、认知层和现象层的击键产生的时间动态。利用来自 CRITT TPR-DB、任务段框架和 HOF 分类法的数据，我们演示了这三层中不同时间线上的打字流程的时间分解。

Title: Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Authors: Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2405.03133
Pdf URL: https://arxiv.org/pdf/2405.03133
Copy Paste: [[2405.03133]] Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training(https://arxiv.org/abs/2405.03133)
Keywords: language model
Abstract: Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.
摘要：专家混合 (MoE) 模型有助于高效扩展；然而，训练路由器网络带来了优化不可微的离散目标的挑战。最近，提出了一种完全可微的 MoE 架构 SMEAR（Muqeeth et al., 2023），它在参数空间中软合并专家；然而，它的有效性仅在分类任务的下游微调中得到证明。在本文中，我们提出了 Lory，这是第一种将此类架构扩展到自回归语言模型预训练的方法。 Lory 介绍了两个关键技术：（1）因果分段路由策略，可以在保持语言模型自回归性质的同时实现专家合并操作的高效率；（2）基于相似性的数据批处理方法，通过对训练实例中的相似文档进行分组来鼓励专家专业化。我们从头开始在 150B 代币上预训练了一系列 Lory 模型，最多有 32 位专家和 30B（1.5B 活跃）参数。实验结果表明，与参数匹配的密集模型相比，在困惑度 (+13.9%) 和各种下游任务 (+1.5%-11.1%) 上都有显着的性能提升。尽管存在段级路由，但与具有令牌级路由的最先进的 MoE 模型相比，Lory 模型仍实现了具有竞争力的性能。我们进一步证明，Lory 中训练有素的专家无需监督即可获得领域级专业化。我们的工作强调了完全可微的 MoE 架构在语言模型预训练方面的潜力，并倡导该领域的未来研究。

Title: CRAFT: Extracting and Tuning Cultural Instructions from the Wild

Authors: Bin Wang, Geyu Lin, Zhengyuan Liu, Chengwei Wei, Nancy F. Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03138
Pdf URL: https://arxiv.org/pdf/2405.03138
Copy Paste: [[2405.03138]] CRAFT: Extracting and Tuning Cultural Instructions from the Wild(https://arxiv.org/abs/2405.03138)
Keywords: language model, llm
Abstract: Large language models (LLMs) have rapidly evolved as the foundation of various natural language processing (NLP) applications. Despite their wide use cases, their understanding of culturally-related concepts and reasoning remains limited. Meantime, there is a significant need to enhance these models' cultural reasoning capabilities, especially concerning underrepresented regions. This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. By integrating with a general-purpose instruction tuning dataset, our model demonstrates enhanced capabilities in recognizing and understanding regional cultural nuances, thereby enhancing its reasoning capabilities. We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%. Our research opens new avenues for extracting cultural instruction tuning sets directly from unstructured data, setting a precedent for future innovations in the field.
摘要：大型语言模型 (LLM) 作为各种自然语言处理 (NLP) 应用程序的基础迅速发展。尽管用例广泛，但他们对文化相关概念和推理的理解仍然有限。与此同时，非常需要增强这些模型的文化推理能力，特别是对于代表性不足的地区。本文介绍了一种新颖的管道，用于从大量非结构化语料库中提取高质量的、与文化相关的指令调整数据集。我们利用自我指令生成管道来识别文化概念并触发指令。通过与通用指令调整数据集集成，我们的模型展示了识别和理解区域文化细微差别的增强能力，从而增强了其推理能力。我们在新加坡、菲律宾和美国三个地区进行了实验，实现了高达 6% 的性能提升。我们的研究开辟了直接从非结构化数据中提取文化指令调整集的新途径，为该领域的未来创新树立了先例。

Title: Exploring the Potential of the Large Language Models (LLMs) in Identifying Misleading News Headlines

Authors: Md Main Uddin Rony, Md Mahfuzul Haque, Mohammad Ali, Ahmed Shatil Alam, Naeemul Hassan
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2405.03153
Pdf URL: https://arxiv.org/pdf/2405.03153
Copy Paste: [[2405.03153]] Exploring the Potential of the Large Language Models (LLMs) in Identifying Misleading News Headlines(https://arxiv.org/abs/2405.03153)
Keywords: language model, gpt, llm, chat
Abstract: In the digital age, the prevalence of misleading news headlines poses a significant challenge to information integrity, necessitating robust detection mechanisms. This study explores the efficacy of Large Language Models (LLMs) in identifying misleading versus non-misleading news headlines. Utilizing a dataset of 60 articles, sourced from both reputable and questionable outlets across health, science & tech, and business domains, we employ three LLMs- ChatGPT-3.5, ChatGPT-4, and Gemini-for classification. Our analysis reveals significant variance in model performance, with ChatGPT-4 demonstrating superior accuracy, especially in cases with unanimous annotator agreement on misleading headlines. The study emphasizes the importance of human-centered evaluation in developing LLMs that can navigate the complexities of misinformation detection, aligning technical proficiency with nuanced human judgment. Our findings contribute to the discourse on AI ethics, emphasizing the need for models that are not only technically advanced but also ethically aligned and sensitive to the subtleties of human interpretation.
摘要：在数字时代，误导性新闻标题的盛行对信息完整性提出了重大挑战，需要强大的检测机制。本研究探讨了大型语言模型 (LLM) 在识别误导性与非误导性新闻标题方面的功效。利用来自健康、科学技术和商业领域的信誉良好和有问题的媒体的 60 篇文章的数据集，我们采用了三位法学硕士（ChatGPT-3.5、ChatGPT-4 和 Gemini-）进行分类。我们的分析揭示了模型性能的显着差异，ChatGPT-4 表现出卓越的准确性，特别是在注释者一致同意误导性标题的情况下。该研究强调了以人为本的评估在开发法学硕士方面的重要性，该硕士可以解决错误信息检测的复杂性，使技术熟练程度与人类细致入微的判断保持一致。我们的研究结果为人工智能伦理的讨论做出了贡献，强调了对模型的需求，这些模型不仅在技术上先进，而且在伦理上一致并对人类解释的微妙之处敏感。

Title: Oracle-Checker Scheme for Evaluating a Generative Large Language Model

Authors: Yueling Jenny Zeng, Li-C. Wang, Thomas Ibbetson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03170
Pdf URL: https://arxiv.org/pdf/2405.03170
Copy Paste: [[2405.03170]] Oracle-Checker Scheme for Evaluating a Generative Large Language Model(https://arxiv.org/abs/2405.03170)
Keywords: language model, llm
Abstract: This work presents a novel approach called oracle-checker scheme for evaluating the answer given by a generative large language model (LLM). Two types of checkers are presented. The first type of checker follows the idea of property testing. The second type of checker follows the idea of program checking. Their applications are demonstrated in two separate contexts, entity extraction and paraphrase decision, respectively.
摘要：这项工作提出了一种称为预言机检查方案的新颖方法，用于评估生成大语言模型（LLM）给出的答案。提供了两种类型的检查器。第一种类型的检查器遵循属性测试的思想。第二种类型的检查器遵循程序检查的思想。它们的应用程序分别在两个不同的上下文中进行了演示：实体提取和释义决策。

Title: Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions

Authors: Ruizhe Li, Yanjun Gao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.03205
Pdf URL: https://arxiv.org/pdf/2405.03205
Copy Paste: [[2405.03205]] Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions(https://arxiv.org/abs/2405.03205)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs), such as the GPT-4 and LLaMA families, have demonstrated considerable success across diverse tasks, including multiple-choice questions (MCQs). However, these models exhibit a positional bias, particularly an even worse anchored bias in the GPT-2 family, where they consistently favour the first choice 'A' in MCQs during inference. This anchored bias challenges the integrity of GPT-2's decision-making process, as it skews performance based on the position rather than the content of the choices in MCQs. In this study, we utilise the mechanistic interpretability approach to identify the internal modules within GPT-2 models responsible for this bias. We focus on the Multi-Layer Perceptron (MLP) layers and attention heads, using the "logit lens" method to trace and modify the specific value vectors that contribute to the bias. By updating these vectors within MLP and recalibrating attention patterns to neutralise the preference for the first choice 'A', we effectively mitigate the anchored bias. Our interventions not only correct the bias but also improve the overall MCQ prediction accuracy for the GPT-2 family across various datasets. This work represents the first comprehensive mechanistic analysis of anchored bias in MCQs within the GPT-2 models, introducing targeted, minimal-intervention strategies that significantly enhance GPT2 model robustness and accuracy in MCQs. Our code is available at https://github.com/ruizheliUOA/Anchored_Bias_GPT2.
摘要：大型语言模型 (LLM)，例如 GPT-4 和 LLaMA 系列，在多种任务（包括多项选择题 (MCQ)）中表现出了相当大的成功。然而，这些模型表现出位置偏差，特别是 GPT-2 系列中更严重的锚定偏差，它们在推理过程中始终倾向于 MCQ 中的首选“A”。这种锚定偏差挑战了 GPT-2 决策过程的完整性，因为它根据位置而不是 MCQ 中选择的内容来扭曲性能。在本研究中，我们利用机械解释方法来识别 GPT-2 模型中造成这种偏差的内部模块。我们专注于多层感知器（MLP）层和注意力头，使用“logit 透镜”方法来跟踪和修改导致偏差的特定值向量。通过更新 MLP 中的这些向量并重新校准注意力模式以中和对第一选择“A”的偏好，我们有效地减轻了锚定偏差。我们的干预措施不仅纠正了偏差，还提高了 GPT-2 系列在不同数据集上的整体 MCQ 预测准确性。这项工作代表了对 GPT-2 模型中 MCQ 锚定偏差的首次全面机制分析，引入了有针对性的最小干预策略，可显着增强 GPT2 模型在 MCQ 中的稳健性和准确性。我们的代码位于 https://github.com/ruizheliUOA/Anchored_Bias_GPT2。

Title: Vietnamese AI Generated Text Detection

Authors: Quang-Dan Tran, Van-Quan Nguyen, Quang-Huy Pham, K. B. Thang Nguyen, Trong-Hop Do
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.03206
Pdf URL: https://arxiv.org/pdf/2405.03206
Copy Paste: [[2405.03206]] Vietnamese AI Generated Text Detection(https://arxiv.org/abs/2405.03206)
Keywords: language model, llm
Abstract: In recent years, Large Language Models (LLMs) have become integrated into our daily lives, serving as invaluable assistants in completing tasks. Widely embraced by users, the abuse of LLMs is inevitable, particularly in using them to generate text content for various purposes, leading to difficulties in distinguishing between text generated by LLMs and that written by humans. In this study, we present a dataset named ViDetect, comprising 6.800 samples of Vietnamese essay, with 3.400 samples authored by humans and the remainder generated by LLMs, serving the purpose of detecting text generated by AI. We conducted evaluations using state-of-the-art methods, including ViT5, BartPho, PhoBERT, mDeberta V3, and mBERT. These results contribute not only to the growing body of research on detecting text generated by AI but also demonstrate the adaptability and effectiveness of different methods in the Vietnamese language context. This research lays the foundation for future advancements in AI-generated text detection and provides valuable insights for researchers in the field of natural language processing.
摘要：近年来，大型语言模型（LLM）已经融入我们的日常生活，成为完成任务的宝贵助手。 LLM 受到用户的广泛欢迎，其滥用是不可避免的，特别是在使用它们生成用于各种目的的文本内容时，导致难以区分 LLM 生成的文本和人类编写的文本。在这项研究中，我们提出了一个名为 ViDetect 的数据集，其中包含 6800 个越南论文样本，其中 3400 个样本由人类撰写，其余样本由法学硕士生成，用于检测人工智能生成的文本。我们使用最先进的方法进行评估，包括 ViT5、BartPho、PhoBERT、mDeberta V3 和 mBERT。这些结果不仅有助于不断增长的检测人工智能生成文本的研究，而且还证明了不同方法在越南语言环境中的适应性和有效性。这项研究为人工智能生成文本检测的未来发展奠定了基础，并为自然语言处理领域的研究人员提供了宝贵的见解。

Title: A Philosophical Introduction to Language Models - Part II: The Way Forward

Authors: Raphaël Millière, Cameron Buckner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03207
Pdf URL: https://arxiv.org/pdf/2405.03207
Copy Paste: [[2405.03207]] A Philosophical Introduction to Language Models - Part II: The Way Forward(https://arxiv.org/abs/2405.03207)
Keywords: language model, llm
Abstract: In this paper, the second of two companion pieces, we explore novel philosophical questions raised by recent progress in large language models (LLMs) that go beyond the classical debates covered in the first part. We focus particularly on issues related to interpretability, examining evidence from causal intervention methods about the nature of LLMs' internal representations and computations. We also discuss the implications of multimodal and modular extensions of LLMs, recent debates about whether such systems may meet minimal criteria for consciousness, and concerns about secrecy and reproducibility in LLM research. Finally, we discuss whether LLM-like systems may be relevant to modeling aspects of human cognition, if their architectural characteristics and learning scenario are adequately constrained.
摘要：在本文（两篇姊妹篇中的第二篇）中，我们探讨了大型语言模型（LLM）的最新进展所提出的新颖的哲学问题，这些问题超出了第一部分中涵盖的经典辩论。我们特别关注与可解释性相关的问题，检查来自因果干预方法的关于法学硕士内部表征和计算性质的证据。我们还讨论了法学硕士的多模式和模块化扩展的影响、最近关于此类系统是否满足意识最低标准的争论，以及对法学硕士研究的保密性和可重复性的担忧。最后，我们讨论如果类 LLM 系统的架构特征和学习场景受到充分约束，它们是否可能与人类认知的建模方面相关。

Title: Lifelong Knowledge Editing for LLMs with Retrieval-Augmented Continuous Prompt Learning

Authors: Qizhou Chen, Taolin Zhang, Dongyang Li, Longtao Huang, Hui Xue, Chengyu Wang, Xiaofeng He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03279
Pdf URL: https://arxiv.org/pdf/2405.03279
Copy Paste: [[2405.03279]] Lifelong Knowledge Editing for LLMs with Retrieval-Augmented Continuous Prompt Learning(https://arxiv.org/abs/2405.03279)
Keywords: language model, llm, prompt
Abstract: Model editing aims to correct outdated or erroneous knowledge in large language models (LLMs) without the need for costly retraining. Lifelong model editing is the most challenging task that caters to the continuous editing requirements of LLMs. Prior works primarily focus on single or batch editing; nevertheless, these methods fall short in lifelong editing scenarios due to catastrophic knowledge forgetting and the degradation of model performance. Although retrieval-based methods alleviate these issues, they are impeded by slow and cumbersome processes of integrating the retrieved knowledge into the model. In this work, we introduce RECIPE, a RetriEval-augmented ContInuous Prompt lEarning method, to boost editing efficacy and inference efficiency in lifelong learning. RECIPE first converts knowledge statements into short and informative continuous prompts, prefixed to the LLM's input query embedding, to efficiently refine the response grounded on the knowledge. It further integrates the Knowledge Sentinel (KS) that acts as an intermediary to calculate a dynamic threshold, determining whether the retrieval repository contains relevant knowledge. Our retriever and prompt encoder are jointly trained to achieve editing properties, i.e., reliability, generality, and locality. In our experiments, RECIPE is assessed extensively across multiple LLMs and editing datasets, where it achieves superior editing performance. RECIPE also demonstrates its capability to maintain the overall performance of LLMs alongside showcasing fast editing and inference speed.
摘要：模型编辑旨在纠正大型语言模型 (LLM) 中过时或错误的知识，而不需要进行昂贵的再培训。终身模型编辑是最具挑战性的任务，可以满足法学硕士持续编辑的要求。之前的作品主要集中于单个或批量编辑；然而，由于灾难性的知识遗忘和模型性能的下降，这些方法在终身编辑场景中存在不足。尽管基于检索的方法缓解了这些问题，但它们受到将检索到的知识集成到模型中的缓慢而繁琐的过程的阻碍。在这项工作中，我们引入了 RECIPE，一种 RetriEval 增强的连续提示学习方法，以提高终身学习中的编辑效率和推理效率。 RECIPE 首先将知识陈述转换为简短且内容丰富的连续提示，作为 LLM 输入查询嵌入的前缀，以有效地细化基于知识的响应。它进一步集成知识哨兵（KS）作为中介来计算动态阈值，确定检索存储库是否包含相关知识。我们的检索器和提示编码器经过联合训练，以实现编辑属性，即可靠性、通用性和局部性。在我们的实验中，RECIPE 在多个法学硕士和编辑数据集上进行了广泛的评估，它实现了卓越的编辑性能。 RECIPE 还展示了其保持法学硕士整体性能的能力，同时展示了快速编辑和推理速度。

Title: MedDoc-Bot: A Chat Tool for Comparative Analysis of Large Language Models in the Context of the Pediatric Hypertension Guideline

Authors: Mohamed Yaseen Jabarulla, Steffen Oeltze-Jafra, Philipp Beerbaum, Theodor Uden
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2405.03359
Pdf URL: https://arxiv.org/pdf/2405.03359
Copy Paste: [[2405.03359]] MedDoc-Bot: A Chat Tool for Comparative Analysis of Large Language Models in the Context of the Pediatric Hypertension Guideline(https://arxiv.org/abs/2405.03359)
Keywords: language model, llm, chat
Abstract: This research focuses on evaluating the non-commercial open-source large language models (LLMs) Meditron, MedAlpaca, Mistral, and Llama-2 for their efficacy in interpreting medical guidelines saved in PDF format. As a specific test scenario, we applied these models to the guidelines for hypertension in children and adolescents provided by the European Society of Cardiology (ESC). Leveraging Streamlit, a Python library, we developed a user-friendly medical document chatbot tool (MedDoc-Bot). This tool enables authorized users to upload PDF files and pose questions, generating interpretive responses from four locally stored LLMs. A pediatric expert provides a benchmark for evaluation by formulating questions and responses extracted from the ESC guidelines. The expert rates the model-generated responses based on their fidelity and relevance. Additionally, we evaluated the METEOR and chrF metric scores to assess the similarity of model responses to reference answers. Our study found that Llama-2 and Mistral performed well in metrics evaluation. However, Llama-2 was slower when dealing with text and tabular data. In our human evaluation, we observed that responses created by Mistral, Meditron, and Llama-2 exhibited reasonable fidelity and relevance. This study provides valuable insights into the strengths and limitations of LLMs for future developments in medical document interpretation. Open-Source Code: https://github.com/yaseen28/MedDoc-Bot
摘要：本研究重点评估非商业开源大型语言模型 (LLM) Meditron、MedAlpaca、Mistral 和 Llama-2 在解释以 PDF 格式保存的医疗指南方面的有效性。作为一个特定的测试场景，我们将这些模型应用于欧洲心脏病学会 (ESC) 提供的儿童和青少年高血压指南。利用 Python 库 Streamlit，我们开发了一个用户友好的医疗文档聊天机器人工具 (MedDoc-Bot)。此工具允许授权用户上传 PDF 文件并提出问题，从四个本地存储的 LLM 生成解释性响应。儿科专家通过制定从 ESC 指南中提取的问题和答案来提供评估基准。专家根据模型生成的响应的保真度和相关性对其进行评分。此外，我们评估了 METEOR 和 chrF 指标分数，以评估模型响应与参考答案的相似性。我们的研究发现 Llama-2 和 Mistral 在指标评估中表现良好。但是，Llama-2 在处理文本和表格数据时速度较慢。在我们的人工评估中，我们观察到 Mistral、Meditron 和 Llama-2 产生的响应表现出合理的保真度和相关性。这项研究为 LLM 的优势和局限性提供了宝贵的见解，有助于未来医学文档解释的发展。开源代码：https://github.com/yaseen28/MedDoc-Bot

Title: Explainable Fake News Detection With Large Language Model via Defense Among Competing Wisdom

Authors: Bo Wang, Jing Ma, Hongzhan Lin, Zhiwei Yang, Ruichao Yang, Yuan Tian, Yi Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03371
Pdf URL: https://arxiv.org/pdf/2405.03371
Copy Paste: [[2405.03371]] Explainable Fake News Detection With Large Language Model via Defense Among Competing Wisdom(https://arxiv.org/abs/2405.03371)
Keywords: language model, prompt
Abstract: Most fake news detection methods learn latent feature representations based on neural networks, which makes them black boxes to classify a piece of news without giving any justification. Existing explainable systems generate veracity justifications from investigative journalism, which suffer from debunking delayed and low efficiency. Recent studies simply assume that the justification is equivalent to the majority opinions expressed in the wisdom of crowds. However, the opinions typically contain some inaccurate or biased information since the wisdom of crowds is uncensored. To detect fake news from a sea of diverse, crowded and even competing narratives, in this paper, we propose a novel defense-based explainable fake news detection framework. Specifically, we first propose an evidence extraction module to split the wisdom of crowds into two competing parties and respectively detect salient evidences. To gain concise insights from evidences, we then design a prompt-based module that utilizes a large language model to generate justifications by inferring reasons towards two possible veracities. Finally, we propose a defense-based inference module to determine veracity via modeling the defense among these justifications. Extensive experiments conducted on two real-world benchmarks demonstrate that our proposed method outperforms state-of-the-art baselines in terms of fake news detection and provides high-quality justifications.
摘要：大多数假新闻检测方法都是基于神经网络学习潜在特征表示，这使得它们成为对新闻进行分类的黑匣子，而无需给出任何理由。现有的可解释系统从调查性新闻中产生真实性理由，但存在揭穿延迟和效率低下的问题。最近的研究简单地假设合理性相当于群体智慧中表达的多数意见。然而，由于群体的智慧未经审查，这些意见通常包含一些不准确或有偏见的信息。为了从多样化、拥挤甚至相互竞争的叙述中检测假新闻，在本文中，我们提出了一种新颖的基于防御的可解释假新闻检测框架。具体来说，我们首先提出了一个证据提取模块，将群体的智慧分成两个竞争方，并分别检测显着证据。为了从证据中获得简洁的见解，我们设计了一个基于提示的模块，该模块利用大型语言模型通过推断两种可能的准确性的原因来生成理由。最后，我们提出了一个基于防御的推理模块，通过对这些理由之间的防御进行建模来确定准确性。在两个现实世界基准上进行的广泛实验表明，我们提出的方法在假新闻检测方面优于最先进的基线，并提供了高质量的理由。

Title: The high dimensional psychological profile and cultural bias of ChatGPT

Authors: Hang Yuan (1), Zhongyue Che (1), Shao Li (1), Yue Zhang, Xiaomeng Hu (2), Siyang Luo (1) ((1) Sun Yat-Sen University, (2) Renmin University of China)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03387
Pdf URL: https://arxiv.org/pdf/2405.03387
Copy Paste: [[2405.03387]] The high dimensional psychological profile and cultural bias of ChatGPT(https://arxiv.org/abs/2405.03387)
Keywords: language model, gpt, chat
Abstract: Given the rapid advancement of large-scale language models, artificial intelligence (AI) models, like ChatGPT, are playing an increasingly prominent role in human society. However, to ensure that artificial intelligence models benefit human society, we must first fully understand the similarities and differences between the human-like characteristics exhibited by artificial intelligence models and real humans, as well as the cultural stereotypes and biases that artificial intelligence models may exhibit in the process of interacting with humans. This study first measured ChatGPT in 84 dimensions of psychological characteristics, revealing differences between ChatGPT and human norms in most dimensions as well as in high-dimensional psychological representations. Additionally, through the measurement of ChatGPT in 13 dimensions of cultural values, it was revealed that ChatGPT's cultural value patterns are dissimilar to those of various countries/regions worldwide. Finally, an analysis of ChatGPT's performance in eight decision-making tasks involving interactions with humans from different countries/regions revealed that ChatGPT exhibits clear cultural stereotypes in most decision-making tasks and shows significant cultural bias in third-party punishment and ultimatum games. The findings indicate that, compared to humans, ChatGPT exhibits a distinct psychological profile and cultural value orientation, and it also shows cultural biases and stereotypes in interpersonal decision-making. Future research endeavors should emphasize enhanced technical oversight and augmented transparency in the database and algorithmic training procedures to foster more efficient cross-cultural communication and mitigate social disparities.
摘要：随着大规模语言模型的快速发展，以 ChatGPT 为代表的人工智能模型在人类社会中发挥着越来越重要的作用。然而，要让人工智能模型真正造福人类社会，首先必须充分理解人工智能模型与真实人类所展现的类人特征的异同，以及人工智能模型在与人类交互过程中可能表现出的文化刻板印象和偏见。本研究首先对 ChatGPT 进行了 84 个心理特征维度的测量，发现 ChatGPT 在大多数维度以及高维心理表征上与人类规范存在差异。此外，通过对 ChatGPT 在 13 个文化价值观维度的测量，发现 ChatGPT 的文化价值观模式与全球各个国家/地区的文化价值观模式存在差异。最后，通过对 ChatGPT 在 8 个涉及与来自不同国家/地区的人类交互的决策任务中的表现的分析，发现 ChatGPT 在大多数决策任务中表现出明显的文化刻板印象，在第三方惩罚和最后通牒博弈中表现出明显的文化偏见。研究结果表明，与人类相比，ChatGPT 表现出独特的心理特征和文化价值取向，并且在人际决策中表现出文化偏见和刻板印象。未来的研究工作应强调加强技术监督和提高数据库和算法训练程序的透明度，以促进更有效的跨文化交流并缓解社会差距。

Title: Gaussian Stochastic Weight Averaging for Bayesian Low-Rank Adaptation of Large Language Models

Authors: Emre Onal, Klemens Flöge, Emma Caldwell, Arsen Sheverdin, Vincent Fortuin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03425
Pdf URL: https://arxiv.org/pdf/2405.03425
Copy Paste: [[2405.03425]] Gaussian Stochastic Weight Averaging for Bayesian Low-Rank Adaptation of Large Language Models(https://arxiv.org/abs/2405.03425)
Keywords: language model, llm
Abstract: Fine-tuned Large Language Models (LLMs) often suffer from overconfidence and poor calibration, particularly when fine-tuned on small datasets. To address these challenges, we propose a simple combination of Low-Rank Adaptation (LoRA) with Gaussian Stochastic Weight Averaging (SWAG), facilitating approximate Bayesian inference in LLMs. Through extensive testing across several Natural Language Processing (NLP) benchmarks, we demonstrate that our straightforward and computationally efficient approach improves model generalization and calibration. We further show that our method exhibits greater robustness against distribution shift, as reflected in its performance on out-of-distribution tasks.
摘要：微调大型语言模型 (LLM) 经常会遭受过度自信和校准不良的困扰，尤其是在小数据集上进行微调时。为了应对这些挑战，我们提出了低秩适应（LoRA）与高斯随机权重平均（SWAG）的简单组合，以促进法学硕士中的近似贝叶斯推理。通过对多个自然语言处理 (NLP) 基准的广泛测试，我们证明了我们简单且计算高效的方法改进了模型泛化和校准。我们进一步表明，我们的方法对分布变化表现出更大的鲁棒性，这反映在其在分布外任务上的性能。

Title: MAmmoTH2: Scaling Instructions from the Web

Authors: Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03548
Pdf URL: https://arxiv.org/pdf/2405.03548
Copy Paste: [[2405.03548]] MAmmoTH2: Scaling Instructions from the Web(https://arxiv.org/abs/2405.03548)
Keywords: language model, gpt, llm, chat
Abstract: Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's (Mistral) performance increases from 11% to 34% on MATH and from 36% to 67% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large-scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new paradigm for building better instruction tuning data.
摘要：指令调优提高了大型语言模型（LLM）的推理能力，其中数据质量和可扩展性是关键因素。大多数指令调优数据来自人类众包或 GPT-4 蒸馏。我们提出了一种范例，可以从预训练网络语料库中有效地收集 1000 万个自然存在的指令数据，以增强 LLM 推理。我们的方法包括（1）回忆相关文档，（2）提取指令-响应对，以及（3）使用开源法学硕士精炼提取的对。我们在此数据集上微调基础 LLM，构建了 MAmmoTH2 模型，这显着提高了推理基准的性能。值得注意的是，MAmmoTH2-7B (Mistral) 的 MATH 性能从 11% 提高到 34%，在 GSM8K 上从 36% 提高到 67%，无需使用任何域内数据进行训练。在公共指令调整数据集上进一步训练 MAmmoTH2 产生了 MAmmoTH2-Plus，在多个推理和聊天机器人基准测试中实现了最先进的性能。我们的工作演示了如何在无需昂贵的人工注释或 GPT-4 蒸馏的情况下收获大规模、高质量的指令数据，为构建更好的指令调整数据提供了新的范例。

Title: AlphaMath Almost Zero: process Supervision without process

Authors: Guoxin Chen, Minpeng Liao, Chengxi Li, Kai Fan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.03553
Pdf URL: https://arxiv.org/pdf/2405.03553
Copy Paste: [[2405.03553]] AlphaMath Almost Zero: process Supervision without process(https://arxiv.org/abs/2405.03553)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have substantially enhanced their mathematical reasoning abilities. However, these models still struggle with complex problems that require multiple reasoning steps, frequently leading to logical or numerical errors. While numerical mistakes can largely be addressed by integrating a code interpreter, identifying logical errors within intermediate steps is more challenging. Moreover, manually annotating these steps for training is not only expensive but also demands specialized expertise. In this study, we introduce an innovative approach that eliminates the need for manual annotation by leveraging the Monte Carlo Tree Search (MCTS) framework to generate both the process supervision and evaluation signals automatically. Essentially, when a LLM is well pre-trained, only the mathematical questions and their final answers are required to generate our training data, without requiring the solutions. We proceed to train a step-level value model designed to improve the LLM's inference process in mathematical domains. Our experiments indicate that using automatically generated solutions by LLMs enhanced with MCTS significantly improves the model's proficiency in dealing with intricate mathematical reasoning tasks.
摘要：大型语言模型（LLM）的最新进展极大地增强了他们的数学推理能力。然而，这些模型仍然难以解决需要多个推理步骤的复杂问题，经常导致逻辑或数字错误。虽然数字错误很大程度上可以通过集成代码解释器来解决，但识别中间步骤中的逻辑错误更具挑战性。此外，手动注释这些培训步骤不仅成本高昂，而且需要专业知识。在本研究中，我们引入了一种创新方法，通过利用蒙特卡罗树搜索（MCTS）框架自动生成过程监督和评估信号，从而消除了手动注释的需要。本质上，当法学硕士经过良好的预训练时，只需要数学问题及其最终答案来生成我们的训练数据，而不需要解决方案。我们继续训练一个阶梯级价值模型，旨在改进法学硕士在数学领域的推理过程。我们的实验表明，使用由 MCTS 增强的法学硕士自动生成的解决方案可显着提高模型处理复杂数学推理任务的能力。

Title: Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Authors: Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.03594
Pdf URL: https://arxiv.org/pdf/2405.03594
Copy Paste: [[2405.03594]] Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment(https://arxiv.org/abs/2405.03594)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.
摘要：大型语言模型 (LLM) 彻底改变了自然语言处理 (NLP)，但它们的规模造成了计算瓶颈。我们引入了一种新颖的方法来创建准确、稀疏的高性能 LLM 基础版本，以高达 70% 的稀疏度实现微调任务的完全精度恢复。我们通过结合 SparseGPT 一次性剪枝方法和在 SlimPajama 数据集的子集上与 The Stack 数据集的 Python 子集混合的这些模型的稀疏预训练，为 LLaMA-2 7B 模型实现了这一目标。我们展示了由于 Cerebras CS-3 芯片上的稀疏性而导致的训练加速，与理论缩放非常匹配。此外，我们利用 Neural Magic 的 DeepSparse 引擎在 CPU 上实现高达 3 倍的推理加速，并通过 Neural Magic 的 nm-vllm 引擎在 GPU 上实现高达 1.7 倍的推理加速。上述增益仅通过稀疏性实现，从而通过额外使用量化来实现进一步的增益。具体来说，我们展示了稀疏量化 LLaMA 模型的 CPU 总加速高达 8.6 倍。我们在各种具有挑战性的任务中展示了这些结果，包括聊天、指令跟踪、代码生成、算术推理和总结，以证明它们的通用性。这项工作为在不牺牲准确性的情况下快速创建更小、更快的法学硕士铺平了道路。

Title: GREEN: Generative Radiology Report Evaluation and Error Notation

Authors: Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Bluethgen, Arne Edward Michalson, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, Jean-Benoit Delbrouck
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.03595
Pdf URL: https://arxiv.org/pdf/2405.03595
Copy Paste: [[2405.03595]] GREEN: Generative Radiology Report Evaluation and Error Notation(https://arxiv.org/abs/2405.03595)
Keywords: language model, gpt
Abstract: Evaluating radiology reports is a challenging problem as factual correctness is extremely important due to the need for accurate medical communication about medical images. Existing automatic evaluation metrics either suffer from failing to consider factual correctness (e.g., BLEU and ROUGE) or are limited in their interpretability (e.g., F1CheXpert and F1RadGraph). In this paper, we introduce GREEN (Generative Radiology Report Evaluation and Error Notation), a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports, both quantitatively and qualitatively. Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts. We validate our GREEN metric by comparing it to GPT-4, as well as to error counts of 6 experts and preferences of 2 experts. Our method demonstrates not only higher correlation with expert error counts, but simultaneously higher alignment with expert preferences when compared to previous approaches."
摘要：评估放射学报告是一个具有挑战性的问题，因为需要对医学图像进行准确的医学交流，事实正确性极其重要。现有的自动评估指标要么未能考虑事实的正确性（例如 BLEU 和 ROUGE），要么在可解释性方面受到限制（例如 F1CheXpert 和 F1RadGraph）。在本文中，我们介绍了 GREEN（生成放射学报告评估和错误表示法），这是一种放射学报告生成指标，它利用语言模型的自然语言理解来定量和定性地识别和解释候选报告中的临床重大错误。与当前指标相比，GREEN 提供：1）符合专家偏好的分数，2）对临床重大错误的人类可解释解释，实现与最终用户的反馈循环，以及 3）一种轻量级开源方法，可达到商业性能同行。我们通过将 GREEN 指标与 GPT-4 以及 6 位专家的错误计数和 2 位专家的偏好进行比较来验证它。与以前的方法相比，我们的方法不仅证明了与专家错误计数的更高相关性，而且同时证明了与专家偏好的更高一致性。”

Title: Towards A Human-in-the-Loop LLM Approach to Collaborative Discourse Analysis

Authors: Clayton Cohn, Caitlin Snyder, Justin Montenegro, Gautam Biswas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03677
Pdf URL: https://arxiv.org/pdf/2405.03677
Copy Paste: [[2405.03677]] Towards A Human-in-the-Loop LLM Approach to Collaborative Discourse Analysis(https://arxiv.org/abs/2405.03677)
Keywords: gpt, llm, prompt
Abstract: LLMs have demonstrated proficiency in contextualizing their outputs using human input, often matching or beating human-level performance on a variety of tasks. However, LLMs have not yet been used to characterize synergistic learning in students' collaborative discourse. In this exploratory work, we take a first step towards adopting a human-in-the-loop prompt engineering approach with GPT-4-Turbo to summarize and categorize students' synergistic learning during collaborative discourse. Our preliminary findings suggest GPT-4-Turbo may be able to characterize students' synergistic learning in a manner comparable to humans and that our approach warrants further investigation.
摘要：法学硕士已证明能够熟练地利用人类输入将其输出置于情境中，通常在各种任务上达到或超过人类水平的表现。然而，法学硕士尚未被用来描述学生协作对话中协同学习的特征。在这项探索性工作中，我们迈出了第一步，采用 GPT-4-Turbo 的人机交互提示工程方法来总结和分类学生在协作对话期间的协同学习。我们的初步研究结果表明，GPT-4-Turbo 可能能够以与人类类似的方式来表征学生的协同学习，并且我们的方法值得进一步研究。

Title: Large Language Models Reveal Information Operation Goals, Tactics, and Narrative Frames

Authors: Keith Burghardt, Kai Chen, Kristina Lerman
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2405.03688
Pdf URL: https://arxiv.org/pdf/2405.03688
Copy Paste: [[2405.03688]] Large Language Models Reveal Information Operation Goals, Tactics, and Narrative Frames(https://arxiv.org/abs/2405.03688)
Keywords: language model, gpt, llm
Abstract: Adversarial information operations can destabilize societies by undermining fair elections, manipulating public opinions on policies, and promoting scams. Despite their widespread occurrence and potential impacts, our understanding of influence campaigns is limited by manual analysis of messages and subjective interpretation of their observable behavior. In this paper, we explore whether these limitations can be mitigated with large language models (LLMs), using GPT-3.5 as a case-study for coordinated campaign annotation. We first use GPT-3.5 to scrutinize 126 identified information operations spanning over a decade. We utilize a number of metrics to quantify the close (if imperfect) agreement between LLM and ground truth descriptions. We next extract coordinated campaigns from two large multilingual datasets from X (formerly Twitter) that respectively discuss the 2022 French election and 2023 Balikaran Philippine-U.S. military exercise in 2023. For each coordinated campaign, we use GPT-3.5 to analyze posts related to a specific concern and extract goals, tactics, and narrative frames, both before and after critical events (such as the date of an election). While the GPT-3.5 sometimes disagrees with subjective interpretation, its ability to summarize and interpret demonstrates LLMs' potential to extract higher-order indicators from text to provide a more complete picture of the information campaigns compared to previous methods.
摘要：敌对性信息操作可能会破坏公平选举、操纵公众对政策的看法以及助长诈骗，从而破坏社会稳定。尽管影响力活动广泛发生并具有潜在影响，但我们对影响力活动的理解受到对信息的手动分析和对其可观察行为的主观解释的限制。在本文中，我们使用 GPT-3.5 作为协调活动注释的案例研究，探讨是否可以通过大型语言模型 (LLM) 来缓解这些限制。我们首先使用 GPT-3.5 仔细审查了 10 多年来的 126 个已识别的信息操作。我们利用许多指标来量化 LLM 和真实描述之间的接近（如果不完美）一致性。接下来，我们从 X（以前称为 Twitter）的两个大型多语言数据集中提取协调活动，这两个数据集分别讨论 2022 年法国大选和 2023 年菲律宾-美国大选。 2023 年的军事演习。对于每次协调行动，我们使用 GPT-3.5 分析与特定问题相关的帖子，并提取关键事件（例如选举日期）之前和之后的目标、策略和叙述框架。虽然 GPT-3.5 有时不同意主观解释，但其总结和解释的能力表明法学硕士有潜力从文本中提取更高阶的指标，从而与以前的方法相比，提供更完整的信息活动图景。