2024-06-07

Title: Ranking Manipulation for Conversational Search Engines

Authors: Samuel Pfrommer, Yatong Bai, Tanmay Gautam, Somayeh Sojoudi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03589
Pdf URL: https://arxiv.org/pdf/2406.03589
Copy Paste: [[2406.03589]] Ranking Manipulation for Conversational Search Engines(https://arxiv.org/abs/2406.03589)
Keywords: language model, llm, prompt
Abstract: Major search engine providers are rapidly incorporating Large Language Model (LLM)-generated content in response to user queries. These conversational search engines operate by loading retrieved website text into the LLM context for summarization and interpretation. Recent research demonstrates that LLMs are highly vulnerable to jailbreaking and prompt injection attacks, which disrupt the safety and quality goals of LLMs using adversarial strings. This work investigates the impact of prompt injections on the ranking order of sources referenced by conversational search engines. To this end, we introduce a focused dataset of real-world consumer product websites and formalize conversational search ranking as an adversarial problem. Experimentally, we analyze conversational search rankings in the absence of adversarial injections and show that different LLMs vary significantly in prioritizing product name, document content, and context position. We then present a tree-of-attacks-based jailbreaking technique which reliably promotes low-ranked products. Importantly, these attacks transfer effectively to state-of-the-art conversational search engines such as this http URL. Given the strong financial incentive for website owners to boost their search ranking, we argue that our problem formulation is of critical importance for future robustness work.
摘要：各大搜索引擎提供商正在迅速整合大型语言模型 (LLM) 生成的内容以响应用户查询。这些对话式搜索引擎通过将检索到的网站文本加载到 LLM 上下文中进行总结和解释来运行。最近的研究表明，LLM 极易受到越狱和即时注入攻击，这些攻击会破坏使用对抗性字符串的 LLM 的安全性和质量目标。这项工作调查了即时注入对对话式搜索引擎引用的来源排名顺序的影响。为此，我们引入了一个现实世界消费品网站的重点数据集，并将对话式搜索排名形式化为对抗性问题。通过实验，我们在没有对抗性注入的情况下分析了对话式搜索排名，并表明不同的 LLM 在优先考虑产品名称、文档内容和上下文位置方面存在很大差异。然后，我们提出了一种基于攻击树的越狱技术，该技术可以可靠地推广排名较低的产品。重要的是，这些攻击有效地转移到最先进的对话式搜索引擎，例如此 http URL。考虑到网站所有者提高其搜索排名的强大经济动机，我们认为我们的问题表述对于未来的稳健性工作至关重要。

Title: Measuring Retrieval Complexity in Question Answering Systems

Authors: Matteo Gabburo, Nicolaas Paul Jedema, Siddhant Garg, Leonardo F. R. Ribeiro, Alessandro Moschitti
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03592
Pdf URL: https://arxiv.org/pdf/2406.03592
Copy Paste: [[2406.03592]] Measuring Retrieval Complexity in Question Answering Systems(https://arxiv.org/abs/2406.03592)
Keywords: llm
Abstract: In this paper, we investigate which questions are challenging for retrieval-based Question Answering (QA). We (i) propose retrieval complexity (RC), a novel metric conditioned on the completeness of retrieved documents, which measures the difficulty of answering questions, and (ii) propose an unsupervised pipeline to measure RC given an arbitrary retrieval system. Our proposed pipeline measures RC more accurately than alternative estimators, including LLMs, on six challenging QA benchmarks. Further investigation reveals that RC scores strongly correlate with both QA performance and expert judgment across five of the six studied benchmarks, indicating that RC is an effective measure of question difficulty. Subsequent categorization of high-RC questions shows that they span a broad set of question shapes, including multi-hop, compositional, and temporal QA, indicating that RC scores can categorize a new subset of complex questions. Our system can also have a major impact on retrieval-based systems by helping to identify more challenging questions on existing datasets.
摘要：在本文中，我们研究了哪些问题对于基于检索的问答系统 (QA) 具有挑战性。我们 (i) 提出了检索复杂度 (RC)，这是一种以检索文档的完整性为条件的新指标，用于衡量回答问题的难度；(ii) 提出了一种无监督的流程，用于在给定任意检索系统的情况下测量 RC。我们提出的流程在六个具有挑战性的 QA 基准上比其他估计器（包括 LLM）更准确地测量了 RC。进一步的研究表明，在六个研究基准中的五个中，RC 分数与 QA 性能和专家判断都具有很强的相关性，这表明 RC 是衡量问题难度的有效指标。随后对高 RC 问题进行分类表明，它们涵盖了广泛的问题形状，包括多跳、组合和时间 QA，这表明 RC 分数可以对一组新的复杂问题进行分类。我们的系统还可以通过帮助识别现有数据集上更具挑战性的问题，对基于检索的系统产生重大影响。

Title: Knowledge-Infused Legal Wisdom: Navigating LLM Consultation through the Lens of Diagnostics and Positive-Unlabeled Reinforcement Learning

Authors: Yang Wu, Chenghao Wang, Ece Gumusel, Xiaozhong Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03600
Pdf URL: https://arxiv.org/pdf/2406.03600
Copy Paste: [[2406.03600]] Knowledge-Infused Legal Wisdom: Navigating LLM Consultation through the Lens of Diagnostics and Positive-Unlabeled Reinforcement Learning(https://arxiv.org/abs/2406.03600)
Keywords: language model, llm
Abstract: The integration of generative Large Language Models (LLMs) into various applications, including the legal domain, has been accelerated by their expansive and versatile nature. However, when facing a legal case, users without a legal background often struggle to formulate professional queries and may inadvertently overlook critical legal factors when presenting their case narrative to LLMs. To address this issue, we propose the Diagnostic Legal Large Language Model (D3LM), which utilizes adaptive lawyer-like diagnostic questions to collect additional case information and then provides high-quality feedback. D3LM incorporates an innovative graph-based Positive-Unlabeled Reinforcement Learning (PURL) algorithm, enabling the generation of critical questions and enhancing user-LLM interactions. Moreover, an integrated LLM-based stopping criterion facilitates precise Court Views Generation (CVG). Our research also introduces a new English-language CVG dataset based on the US case law database, enriching the realm of LLM research and deployment with a vital dimension. D3LM surpasses classical LLMs by delivering outstanding performance and a remarkable user experience in the legal domain.
摘要：生成式大型语言模型 (LLM) 因其广泛性和多功能性而加速了其与法律领域等各种应用的集成。然而，在面对法律案件时，没有法律背景的用户往往难以提出专业查询，并且在向 LLM 呈现其案件叙述时可能会无意中忽略关键的法律因素。为了解决这个问题，我们提出了诊断性法律大型语言模型 (D3LM)，该模型利用自适应律师式诊断问题来收集额外的案件信息，然后提供高质量的反馈。D3LM 采用了创新的基于图的正向无标记强化学习 (PURL) 算法，可以生成关键问题并增强用户与 LLM 的交互。此外，集成的基于 LLM 的停止标准有助于精确生成法院观点 (CVG)。我们的研究还引入了基于美国判例法数据库的新的英语 CVG 数据集，为 LLM 研究和部署领域提供了重要的维度。D3LM 通过在法律领域提供出色的性能和卓越的用户体验超越了传统的 LLM。

Title: TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools

Authors: Avi Caciularu, Alon Jacovi, Eyal Ben-David, Sasha Goldshtein, Tal Schuster, Jonathan Herzig, Gal Elidan, Amir Globerson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03618
Pdf URL: https://arxiv.org/pdf/2406.03618
Copy Paste: [[2406.03618]] TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools(https://arxiv.org/abs/2406.03618)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts. To better evaluate this setting and facilitate modeling efforts, we introduce TACT - Text And Calculations through Tables, a dataset crafted to evaluate LLMs' reasoning and computational abilities using complex instructions. TACT contains challenging instructions that demand stitching information scattered across one or more texts, and performing complex integration on this information to generate the answer. We construct this dataset by leveraging an existing dataset of texts and their associated tables. For each such tables, we formulate new queries, and gather their respective answers. We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38\%. To pinpoint the difficulties and thoroughly dissect the problem, we analyze model performance across three components: table-generation, Pandas command-generation, and execution. Unexpectedly, we discover that each component presents substantial challenges for current LLMs. These insights lead us to propose a focused modeling framework, which we refer to as IE as a tool. Specifically, we propose to add "tools" for each of the above steps, and implement each such tool with few-shot prompting. This approach shows an improvement over existing prompting techniques, offering a promising direction for enhancing model capabilities in these tasks.
摘要：大型语言模型 (LLM) 通常在需要跨文本聚合信息的查询上表现不佳。为了更好地评估这种设置并促进建模工作，我们引入了 TACT - 通过表格进行文本和计算，这是一个使用复杂指令评估 LLM 推理和计算能力的数据集。TACT 包含具有挑战性的指令，要求拼接分散在一个或多个文本中的信息，并对这些信息进行复杂的集成以生成答案。我们通过利用现有的文本数据集及其相关表格来构建此数据集。对于每个这样的表，我们制定新的查询并收集它们各自的答案。我们证明所有当代 LLM 在该数据集上的表现都不佳，准确率低于 38%。为了找出困难并彻底剖析问题，我们分析了三个组件的模型性能：表生成、Pandas 命令生成和执行。出乎意料的是，我们发现每个组件都对当前的 LLM 提出了巨大的挑战。这些见解促使我们提出了一个重点建模框架，我们将其称为 IE 工具。具体来说，我们建议为上述每个步骤添加“工具”，并使用少样本提示来实现每个工具。这种方法比现有的提示技术有所改进，为增强这些任务中的模型能力提供了一个有希望的方向。

Title: Is Free Self-Alignment Possible?

Authors: Dyah Adila, Changho Shin, Yijing Zhang, Frederic Sala
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.03642
Pdf URL: https://arxiv.org/pdf/2406.03642
Copy Paste: [[2406.03642]] Is Free Self-Alignment Possible?(https://arxiv.org/abs/2406.03642)
Keywords: language model
Abstract: Aligning pretrained language models (LMs) is a complex and resource-intensive process, often requiring access to large amounts of ground-truth preference data and substantial compute. Are these costs necessary? That is, it is possible to align using only inherent model knowledge and without additional training? We tackle this challenge with AlignEZ, a novel approach that uses (1) self-generated preference data and (2) representation editing to provide nearly cost-free alignment. During inference, AlignEZ modifies LM representations to reduce undesirable and boost desirable components using subspaces identified via self-generated preference pairs. Our experiments reveal that this nearly cost-free procedure significantly narrows the gap between base pretrained and tuned models by an average of 31.6%, observed across six datasets and three model architectures. Additionally, we explore the potential of using AlignEZ as a means of expediting more expensive alignment procedures. Our experiments show that AlignEZ improves DPO models tuned only using a small subset of ground-truth preference data. Lastly, we study the conditions under which improvement using AlignEZ is feasible, providing valuable insights into its effectiveness.
摘要：对齐预训练语言模型 (LM) 是一个复杂且资源密集的过程，通常需要访问大量真实偏好数据和大量计算。这些成本是必要的吗？也就是说，是否可以使用固有模型知识进行对齐，而无需额外的训练？我们使用 AlignEZ 应对这一挑战，这是一种新颖的方法，它使用 (1) 自生成的偏好数据和 (2) 表示编辑来提供几乎免费的对齐。在推理过程中，AlignEZ 会修改 LM 表示以减少不良组件并增加所需组件，方法是使用通过自生成的偏好对识别的子空间。我们的实验表明，这种几乎免费的程序显着缩小了基础预训练模型和调整模型之间的差距，平均缩小了 31.6%，这是在六个数据集和三个模型架构中观察到的。此外，我们还探索了使用 AlignEZ 作为加快更昂贵的对齐程序的一种手段的潜力。我们的实验表明，AlignEZ 改进了仅使用一小部分真实偏好数据进行调整的 DPO 模型。最后，我们研究了使用 AlignEZ 进行改进的可行性条件，为其有效性提供了宝贵的见解。

Title: What Makes Language Models Good-enough?

Authors: Daiki Asami, Saku Sugawara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03666
Pdf URL: https://arxiv.org/pdf/2406.03666
Copy Paste: [[2406.03666]] What Makes Language Models Good-enough?(https://arxiv.org/abs/2406.03666)
Keywords: language model
Abstract: Psycholinguistic research suggests that humans may build a representation of linguistic input that is 'good-enough' for the task at hand. This study examines what architectural features make language models learn human-like good-enough language processing. We focus on the number of layers and self-attention heads in Transformers. We create a good-enough language processing (GELP) evaluation dataset (7,680 examples), which is designed to test the effects of two plausibility types, eight construction types, and three degrees of memory cost on language processing. To annotate GELP, we first conduct a crowdsourcing experiment whose design follows prior psycholinguistic studies. Our model evaluation against the annotated GELP then reveals that the full model as well as models with fewer layers and/or self-attention heads exhibit a good-enough performance. This result suggests that models with shallower depth and fewer heads can learn good-enough language processing.
摘要：心理语言学研究表明，人类可以构建一种“足够好”的语言输入表征，以完成手头的任务。本研究探讨了哪些架构特征使语言模型能够学习类似人类的足够好的语言处理。我们关注 Transformers 中的层数和自注意力头。我们创建了一个足够好的语言处理 (GELP) 评估数据集（7,680 个示例），旨在测试两种可信度类型、八种构造类型和三种内存成本对语言处理的影响。为了注释 GELP，我们首先进行众包实验，其设计遵循先前的心理语言学研究。我们对带注释的 GELP 进行的模型评估表明，完整模型以及层数和/或自注意力头较少的模型都表现出足够好的性能。这个结果表明，深度较浅、头部较少的模型可以学习足够好的语言处理。

Title: Evaluating the World Model Implicit in a Generative Model

Authors: Keyon Vafa, Justin Y. Chen, Jon Kleinberg, Sendhil Mullainathan, Ashesh Rambachan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03689
Pdf URL: https://arxiv.org/pdf/2406.03689
Copy Paste: [[2406.03689]] Evaluating the World Model Implicit in a Generative Model(https://arxiv.org/abs/2406.03689)
Keywords: language model
Abstract: Recent work suggests that large language models may implicitly learn world models. How should we assess this possibility? We formalize this question for the case where the underlying reality is governed by a deterministic finite automaton. This includes problems as diverse as simple logical reasoning, geographic navigation, game-playing, and chemistry. We propose new evaluation metrics for world model recovery inspired by the classic Myhill-Nerode theorem from language theory. We illustrate their utility in three domains: game playing, logic puzzles, and navigation. In all domains, the generative models we consider do well on existing diagnostics for assessing world models, but our evaluation metrics reveal their world models to be far less coherent than they appear. Such incoherence creates fragility: using a generative model to solve related but subtly different tasks can lead it to fail badly. Building generative models that meaningfully capture the underlying logic of the domains they model would be immensely valuable; our results suggest new ways to assess how close a given model is to that goal.
摘要：最近的研究表明，大型语言模型可能隐式地学习世界模型。我们应该如何评估这种可能性？我们将这个问题形式化为底层现实由确定性有限自动机控制的情况。这包括简单的逻辑推理、地理导航、游戏和化学等各种问题。我们提出了新的世界模型恢复评估指标，该指标受到语言理论中经典的 Myhill-Nerode 定理的启发。我们在三个领域说明了它们的实用性：游戏、逻辑谜题和导航。在所有领域中，我们考虑的生成模型在现有的评估世界模型的诊断上表现良好，但我们的评估指标表明它们的世界模型远没有看起来那么连贯。这种不连贯性造成了脆弱性：使用生成模型来解决相关但略有不同的任务可能会导致它严重失败。构建有意义地捕捉它们建模领域底层逻辑的生成模型将非常有价值；我们的结果提出了评估给定模型与该目标的接近程度的新方法。

Title: M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

Authors: Anand Subramanian, Viktor Schlegel, Abhinav Ramesh Kashyap, Thanh-Tung Nguyen, Vijay Prakash Dwivedi, Stefan Winkler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03699
Pdf URL: https://arxiv.org/pdf/2406.03699
Copy Paste: [[2406.03699]] M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering(https://arxiv.org/abs/2406.03699)
Keywords: language model, llm
Abstract: There is vivid research on adapting Large Language Models (LLMs) to perform a variety of tasks in high-stakes domains such as healthcare. Despite their popularity, there is a lack of understanding of the extent and contributing factors that allow LLMs to recall relevant knowledge and combine it with presented information in the clinical and biomedical domain: a fundamental pre-requisite for success on down-stream tasks. Addressing this gap, we use Multiple Choice and Abstractive Question Answering to conduct a large-scale empirical study on 22 datasets in three generalist and three specialist biomedical sub-domains. Our multifaceted analysis of the performance of 15 LLMs, further broken down by sub-domain, source of knowledge and model architecture, uncovers success factors such as instruction tuning that lead to improved recall and comprehension. We further show that while recently proposed domain-adapted models may lack adequate knowledge, directly fine-tuning on our collected medical knowledge datasets shows encouraging results, even generalising to unseen specialist sub-domains. We complement the quantitative results with a skill-oriented manual error analysis, which reveals a significant gap between the models' capabilities to simply recall necessary knowledge and to integrate it with the presented context. To foster research and collaboration in this field we share M-QALM, our resources, standardised methodology, and evaluation results, with the research community to facilitate further advancements in clinical knowledge representation learning within language models.
摘要：关于如何将大型语言模型 (LLM) 应用于医疗保健等高风险领域的各种任务，已有大量研究成果。尽管它们非常受欢迎，但人们对于 LLM 能够回忆相关知识并将其与临床和生物医学领域中呈现的信息相结合的程度和影响因素缺乏了解：这是成功完成下游任务的基本先决条件。为了解决这一差距，我们使用多项选择题和抽象问答对三个通用和三个专业生物医学子领域的 22 个数据集进行了大规模实证研究。我们对 15 个 LLM 的性能进行了多方面的分析，进一步按子域、知识来源和模型架构细分，揭示了导致回忆和理解能力提高的成功因素，例如指令调整。我们进一步表明，虽然最近提出的领域适应模型可能缺乏足够的知识，但直接对我们收集的医学知识数据集进行微调显示出令人鼓舞的结果，甚至可以推广到看不见的专业子领域。我们通过以技能为导向的手动错误分析来补充定量结果，该分析揭示了模型在简单回忆必要知识和将其与呈现的上下文相结合的能力之间存在显著差距。为了促进该领域的研究和合作，我们与研究界分享 M-QALM、我们的资源、标准化方法和评估结果，以促进语言模型中临床知识表示学习的进一步发展。

Title: A Survey on Medical Large Language Models: Technology, Application, Trustworthiness, and Future Directions

Authors: Lei Liu, Xiaoyan Yang, Junchi Lei, Xiaoyang Liu, Yue Shen, Zhiqiang Zhang, Peng Wei, Jinjie Gu, Zhixuan Chu, Zhan Qin, Kui Ren
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.03712
Pdf URL: https://arxiv.org/pdf/2406.03712
Copy Paste: [[2406.03712]] A Survey on Medical Large Language Models: Technology, Application, Trustworthiness, and Future Directions(https://arxiv.org/abs/2406.03712)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Large language models (LLMs), such as GPT series models, have received substantial attention due to their impressive capabilities for generating and understanding human-level language. More recently, LLMs have emerged as an innovative and powerful adjunct in the medical field, transforming traditional practices and heralding a new era of enhanced healthcare services. This survey provides a comprehensive overview of Medical Large Language Models (Med-LLMs), outlining their evolution from general to the medical-specific domain (i.e, Technology and Application), as well as their transformative impact on healthcare (e.g., Trustworthiness and Safety). Concretely, starting from the fundamental history and technology of LLMs, we first delve into the progressive adaptation and refinements of general LLM models in the medical domain, especially emphasizing the advanced algorithms that boost the LLMs' performance in handling complicated medical environments, including clinical reasoning, knowledge graph, retrieval-augmented generation, human alignment, and multi-modal learning. Secondly, we explore the extensive applications of Med-LLMs across domains such as clinical decision support, report generation, and medical education, illustrating their potential to streamline healthcare services and augment patient outcomes. Finally, recognizing the imperative and responsible innovation, we discuss the challenges of ensuring fairness, accountability, privacy, and robustness in Med-LLMs applications. Finally, we conduct a concise discussion for anticipating possible future trajectories of Med-LLMs, identifying avenues for the prudent expansion of Med-LLMs. By consolidating above-mentioned insights, this review seeks to provide a comprehensive investigation of the potential strengths and limitations of Med-LLMs for professionals and researchers, ensuring a responsible landscape in the healthcare setting.
摘要：大型语言模型 (LLM)，例如 GPT 系列模型，因其生成和理解人类语言的出色能力而备受关注。最近，LLM 已成为医学领域的创新而强大的辅助手段，改变了传统做法，并预示着医疗服务增强的新时代的到来。本综述全面概述了医学大型语言模型 (Med-LLM)，概述了它们从通用到医学特定领域的演变（即技术和应用），以及它们对医疗保健的变革性影响（例如可信度和安全性）。具体来说，从 LLM 的基本历史和技术开始，我们首先深入研究通用 LLM 模型在医学领域的逐步适应和改进，特别强调了提升 LLM 在处理复杂医疗环境中性能的高级算法，包括临床推理、知识图谱、检索增强生成、人体对齐和多模态学习。其次，我们探讨了 Med-LLM 在临床决策支持、报告生成和医学教育等领域的广泛应用，说明了它们在简化医疗服务和改善患者治疗效果方面的潜力。最后，我们认识到创新的必要性和负责任性，并讨论了在 Med-LLM 应用中确保公平性、问责制、隐私性和稳健性的挑战。最后，我们进行了简要的讨论，以预测 Med-LLM 未来的可能发展轨迹，确定 Med-LLM 审慎扩展的途径。通过整合上述见解，本评论旨在全面调查 Med-LLM 对专业人士和研究人员的潜在优势和局限性，确保医疗保健环境中负责任的格局。

Title: LLMEmbed: Rethinking Lightweight LLM's Genuine Function in Text Classification

Authors: Chun Liu, Hongguang Zhang, Kainan Zhao, Xinghai Ju, Lin Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03725
Pdf URL: https://arxiv.org/pdf/2406.03725
Copy Paste: [[2406.03725]] LLMEmbed: Rethinking Lightweight LLM's Genuine Function in Text Classification(https://arxiv.org/abs/2406.03725)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: With the booming of Large Language Models (LLMs), prompt-learning has become a promising method mainly researched in various research areas. Recently, many attempts based on prompt-learning have been made to improve the performance of text classification. However, most of these methods are based on heuristic Chain-of-Thought (CoT), and tend to be more complex but less efficient. In this paper, we rethink the LLM-based text classification methodology, propose a simple and effective transfer learning strategy, namely LLMEmbed, to address this classical but challenging task. To illustrate, we first study how to properly extract and fuse the text embeddings via various lightweight LLMs at different network depths to improve their robustness and discrimination, then adapt such embeddings to train the classifier. We perform extensive experiments on publicly available datasets, and the results show that LLMEmbed achieves strong performance while enjoys low training overhead using lightweight LLM backbones compared to recent methods based on larger LLMs, i.e. GPT-3, and sophisticated prompt-based strategies. Our LLMEmbed achieves adequate accuracy on publicly available benchmarks without any fine-tuning while merely use 4% model parameters, 1.8% electricity consumption and 1.5% runtime compared to its counterparts. Code is available at: this https URL.
摘要：随着大型语言模型 (LLM) 的蓬勃发展，提示学习已成为主要在各个研究领域进行研究的一种有前途的方法。最近，已经进行了许多基于提示学习的尝试来提高文本分类的性能。然而，这些方法大多基于启发式的思路链 (CoT)，往往更复杂但效率更低。在本文中，我们重新思考了基于 LLM 的文本分类方法，提出了一种简单有效的迁移学习策略，即 LLMEmbed，以解决这一经典但具有挑战性的任务。为了说明这一点，我们首先研究如何通过不同网络深度的各种轻量级 LLM 正确提取和融合文本嵌入，以提高其鲁棒性和区分度，然后调整这些嵌入来训练分类器。我们在公开可用的数据集上进行了广泛的实验，结果表明，与最近基于更大的 LLM（即 GPT-3）和复杂的基于提示的策略的方法相比，LLMEmbed 使用轻量级 LLM 主干实现了强大的性能，同时具有较低的训练开销。我们的 LLMEmbed 在公开基准上无需任何微调即可实现足够的准确度，同时与同类产品相比仅使用 4% 的模型参数、1.8% 的电力消耗和 1.5% 的运行时间。代码可从此 https URL 获取。

Title: Efficient Knowledge Infusion via KG-LLM Alignment

Authors: Zhouyu Jiang, Ling Zhong, Mengshu Sun, Jun Xu, Rui Sun, Hui Cai, Shuhan Luo, Zhiqiang Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03746
Pdf URL: https://arxiv.org/pdf/2406.03746
Copy Paste: [[2406.03746]] Efficient Knowledge Infusion via KG-LLM Alignment(https://arxiv.org/abs/2406.03746)
Keywords: language model, llm
Abstract: To tackle the problem of domain-specific knowledge scarcity within large language models (LLMs), knowledge graph-retrievalaugmented method has been proven to be an effective and efficient technique for knowledge infusion. However, existing approaches face two primary challenges: knowledge mismatch between public available knowledge graphs and the specific domain of the task at hand, and poor information compliance of LLMs with knowledge graphs. In this paper, we leverage a small set of labeled samples and a large-scale corpus to efficiently construct domain-specific knowledge graphs by an LLM, addressing the issue of knowledge mismatch. Additionally, we propose a three-stage KG-LLM alignment strategyto enhance the LLM's capability to utilize information from knowledge graphs. We conduct experiments with a limited-sample setting on two biomedical question-answering datasets, and the results demonstrate that our approach outperforms existing baselines.
摘要：为了解决大型语言模型 (LLM) 中领域特定知识稀缺的问题，知识图谱检索增强方法已被证明是一种有效且高效的知识注入技术。然而，现有方法面临两个主要挑战：公共可用知识图谱与当前任务的特定领域之间的知识不匹配，以及 LLM 与知识图谱的信息合规性较差。在本文中，我们利用一小组标记样本和大规模语料库通过 LLM 高效构建领域特定知识图谱，解决知识不匹配的问题。此外，我们提出了一种三阶段的 KG-LLM 对齐策略来增强 LLM 利用知识图谱中信息的能力。我们在两个生物医学问答数据集上使用有限样本设置进行了实验，结果表明我们的方法优于现有基线。

Title: NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human

Authors: Shuo Huang, William MacLean, Xiaoxi Kang, Anqi Wu, Lizhen Qu, Qiongkai Xu, Zhuang Li, Xingliang Yuan, Gholamreza Haffari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: Increasing concerns about privacy leakage issues in academia and industry arise when employing NLP models from third-party providers to process sensitive texts. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them. To explore the issues and develop a tool for text rewriting, we curate the first corpus, coined NAP^2, through both crowdsourcing and the use of large language models (LLMs). Compared to the prior works based on differential privacy, which lead to a sharp drop in information utility and unnatural texts, the human-inspired approaches result in more natural rewrites and offer an improved balance between privacy protection and data utility, as demonstrated by our extensive experiments.
摘要：在使用第三方提供商的 NLP 模型处理敏感文本时，学术界和业界对隐私泄露问题的担忧日益增加。为了在将敏感数据发送给这些模型之前保护隐私，我们建议使用人类常用的两种策略对敏感文本进行清理：i) 删除敏感表达，ii) 通过抽象来模糊敏感细节。为了探索这些问题并开发文本重写的工具，我们通过众包和使用大型语言模型 (LLM) 整理了第一个语料库 NAP^2。与之前基于差异隐私的研究相比，这些研究导致信息效用和不自然文本的急剧下降，而受人类启发的方法可以实现更自然的重写，并在隐私保护和数据效用之间取得更好的平衡，这一点已通过我们的大量实验得到证明。

Title: XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags

Authors: Faisal Tareque Shohan, Mir Tafseer Nayeem, Samsul Islam, Abu Ubaida Akash, Shafiq Joty
Subjects: cs.CL, cs.AI, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2406.03776
Pdf URL: https://arxiv.org/pdf/2406.03776
Copy Paste: [[2406.03776]] XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags(https://arxiv.org/abs/2406.03776)
Keywords: language model
Abstract: Millions of news articles published online daily can overwhelm readers. Headlines and entity (topic) tags are essential for guiding readers to decide if the content is worth their time. While headline generation has been extensively studied, tag generation remains largely unexplored, yet it offers readers better access to topics of interest. The need for conciseness in capturing readers' attention necessitates improved content selection strategies for identifying salient and relevant segments within lengthy articles, thereby guiding language models effectively. To address this, we propose to leverage auxiliary information such as images and captions embedded in the articles to retrieve relevant sentences and utilize instruction tuning with variations to generate both headlines and tags for news articles in a multilingual context. To make use of the auxiliary information, we have compiled a dataset named XL-HeadTags, which includes 20 languages across 6 diverse language families. Through extensive evaluation, we demonstrate the effectiveness of our plug-and-play multimodal-multilingual retrievers for both tasks. Additionally, we have developed a suite of tools for processing and evaluating multilingual texts, significantly contributing to the research community by enabling more accurate and efficient analysis across languages.
摘要：每天在线发布的数百万篇新闻文章可能会让读者应接不暇。标题和实体（主题）标签对于引导读者决定内容是否值得花时间至关重要。虽然标题生成已经得到了广泛的研究，但标签生成仍然在很大程度上尚未得到探索，但它为读者提供了更好地了解感兴趣的主题的途径。简洁性吸引读者注意力的需求需要改进内容选择策略，以便在冗长的文章中识别突出且相关的部分，从而有效地指导语言模型。为了解决这个问题，我们建议利用文章中嵌入的图像和标题等辅助信息来检索相关句子，并利用带有变化的指令调整来生成多语言环境中新闻文章的标题和标签。为了利用辅助信息，我们编制了一个名为 XL-HeadTags 的数据集，其中包括 6 个不同语系的 20 种语言。通过广泛的评估，我们证明了我们的即插即用多模式多语言检索器对这两项任务的有效性。此外，我们还开发了一套用于处理和评估多语言文本的工具，通过实现更准确、更高效的跨语言分析，为研究界做出了重大贡献。

Title: End-to-End Trainable Soft Retriever for Low-resource Relation Extraction

Authors: Kohei Makino, Makoto Miwa, Yutaka Sasaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03790
Pdf URL: https://arxiv.org/pdf/2406.03790
Copy Paste: [[2406.03790]] End-to-End Trainable Soft Retriever for Low-resource Relation Extraction(https://arxiv.org/abs/2406.03790)
Keywords: prompt
Abstract: This study addresses a crucial challenge in instance-based relation extraction using text generation models: end-to-end training in target relation extraction task is not applicable to retrievers due to the non-differentiable nature of instance selection. We propose a novel End-to-end TRAinable Soft K-nearest neighbor retriever (ETRASK) by the neural prompting method that utilizes a soft, differentiable selection of the $k$ nearest instances. This approach enables the end-to-end training of retrievers in target tasks. On the TACRED benchmark dataset with a low-resource setting where the training data was reduced to 10\%, our method achieved a state-of-the-art F1 score of 71.5\%. Moreover, ETRASK consistently improved the baseline model by adding instances for all settings. These results highlight the efficacy of our approach in enhancing relation extraction performance, especially in resource-constrained environments. Our findings offer a promising direction for future research with extraction and the broader application of text generation in natural language processing.
摘要：本研究解决了使用文本生成模型进行基于实例的关系提取的一个关键挑战：由于实例选择的不可微性，目标关系提取任务中的端到端训练不适用于检索器。我们提出了一种新颖的端到端可训练软 K 最近邻检索器 (ETRASK)，通过神经提示方法利用软、可微分选择 $k$ 个最近实例。这种方法可以对目标任务中的检索器进行端到端训练。在 TACRED 基准数据集上，在低资源设置下，训练数据减少到 10\%，我们的方法获得了 71.5\% 的最佳 F1 分数。此外，ETRASK 通过为所有设置添加实例不断改进基线模型。这些结果凸显了我们的方法在增强关系提取性能方面的有效性，尤其是在资源受限的环境中。我们的研究结果为未来的提取研究和文本生成在自然语言处理中的更广泛应用提供了一个有希望的方向。

Title: Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning

Authors: Naibin Gu, Peng Fu, Xiyu Liu, Bowen Shen, Zheng Lin, Weiping Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03792
Pdf URL: https://arxiv.org/pdf/2406.03792
Copy Paste: [[2406.03792]] Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning(https://arxiv.org/abs/2406.03792)
Keywords: language model
Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as the predominant technique for fine-tuning in the era of large language models. However, existing PEFT methods still have inadequate training efficiency. Firstly, the utilization of large-scale foundation models during the training process is excessively redundant for certain fine-tuning tasks. Secondly, as the model size increases, the growth in trainable parameters of empirically added PEFT modules becomes non-negligible and redundant, leading to inefficiency. To achieve task-specific efficient fine-tuning, we propose the Light-PEFT framework, which includes two methods: Masked Early Pruning of the Foundation Model and Multi-Granularity Early Pruning of PEFT. The Light-PEFT framework allows for the simultaneous estimation of redundant parameters in both the foundation model and PEFT modules during the early stage of training. These parameters can then be pruned for more efficient fine-tuning. We validate our approach on GLUE, SuperGLUE, QA tasks, and various models. With Light-PEFT, parameters of the foundation model can be pruned by up to over 40%, while still controlling trainable parameters to be only 25% of the original PEFT method. Compared to utilizing the PEFT method directly, Light-PEFT achieves training and inference speedup, reduces memory usage, and maintains comparable performance and the plug-and-play feature of PEFT.
摘要：参数高效微调 (PEFT) 已成为大型语言模型时代微调的主要技术。然而，现有的 PEFT 方法仍然训练效率不足。首先，在训练过程中使用大规模基础模型对于某些微调任务而言过于冗余。其次，随着模型规模的增加，经验添加的 PEFT 模块的可训练参数的增长变得不可忽略且冗余，从而导致效率低下。为了实现特定任务的高效微调，我们提出了 Light-PEFT 框架，其中包括两种方法：基础模型的屏蔽早期修剪和 PEFT 的多粒度早期修剪。Light-PEFT 框架允许在训练的早期阶段同时估计基础模型和 PEFT 模块中的冗余参数。然后可以修剪这些参数以实现更高效的微调。我们在 GLUE、SuperGLUE、QA 任务和各种模型上验证了我们的方法。使用Light-PEFT，基础模型的参数可以减少40%以上，同时可训练参数仍然控制在原PEFT方法的25%左右。与直接使用PEFT方法相比，Light-PEFT实现了训练和推理的加速，减少了内存使用，同时保持了PEFT的性能和即插即用的特性。

Title: ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search

Authors: Dan Zhang, Sining Zhoubian, Yisong Yue, Yuxiao Dong, Jie Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search(https://arxiv.org/abs/)
Keywords: language model, llm, tree-of-thought
Abstract: Recent methodologies in LLM self-training mostly rely on LLM generating responses and filtering those with correct output answers as training data. This approach often yields a low-quality fine-tuning training set (e.g., incorrect plans or intermediate reasoning). In this paper, we develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. ReST-MCTS* circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, ReST-MCTS* is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value targets for further refining the process reward model and also facilitate the selection of high-quality traces for policy model self-training. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget. We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three language models for multiple iterations, and outperform other self-training algorithms such as ReST$^\text{EM}$ and Self-Rewarding LM.
摘要：LLM 自训练的最新方法主要依赖于 LLM 生成响应并过滤那些具有正确输出答案的响应作为训练数据。这种方法通常会产生低质量的微调训练集（例如，错误的计划或中间推理）。在本文中，我们开发了一种强化自训练方法，称为 ReST-MCTS*，该方法基于将过程奖励指导与树搜索 MCTS* 相结合，以收集更高质量的推理轨迹以及每步值来训练策略和奖励模型。ReST-MCTS* 绕过了通常用于通过基于树搜索的强化学习训练过程奖励的每步手动注释：给定 oracle 最终正确答案，ReST-MCTS* 能够通过估计此步骤有助于得出正确答案的概率来推断正确的过程奖励。这些推断出的奖励有双重目的：它们充当进一步完善过程奖励模型的价值目标，也有助于为策略模型自训练选择高质量的轨迹。我们首先展示了 ReST-MCTS* 中的树搜索策略与先前的 LLM 推理基线（例如 Best-of-N 和 Tree-of-Thought）相比，在相同的搜索预算内实现了更高的准确率。然后我们展示了通过使用此树搜索策略搜索到的轨迹作为训练数据，我们可以不断增强这三种语言模型进行多次迭代，并且优于其他自训练算法（例如 ReST$^\text{EM}$ 和 Self-Rewarding LM）。

Title: Chaos with Keywords: Exposing Large Language Models Sycophancy to Misleading Keywords and Evaluating Defense Strategies

Authors: Aswin RRV, Nemika Tyagi, Md Nayem Uddin, Neeraj Varshney, Chitta Baral
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03827
Pdf URL: https://arxiv.org/pdf/2406.03827
Copy Paste: [[2406.03827]] Chaos with Keywords: Exposing Large Language Models Sycophancy to Misleading Keywords and Evaluating Defense Strategies(https://arxiv.org/abs/2406.03827)
Keywords: language model, llm, hallucination
Abstract: This study explores the sycophantic tendencies of Large Language Models (LLMs), where these models tend to provide answers that match what users want to hear, even if they are not entirely correct. The motivation behind this exploration stems from the common behavior observed in individuals searching the internet for facts with partial or misleading knowledge. Similar to using web search engines, users may recall fragments of misleading keywords and submit them to an LLM, hoping for a comprehensive response. Our empirical analysis of several LLMs shows the potential danger of these models amplifying misinformation when presented with misleading keywords. Additionally, we thoroughly assess four existing hallucination mitigation strategies to reduce LLMs sycophantic behavior. Our experiments demonstrate the effectiveness of these strategies for generating factually correct statements. Furthermore, our analyses delve into knowledge-probing experiments on factual keywords and different categories of sycophancy mitigation.
摘要：本研究探讨了大型语言模型 (LLM) 的谄媚倾向，这些模型倾向于提供与用户想要听到的内容相匹配的答案，即使这些答案并不完全正确。这项探索背后的动机源于人们在互联网上搜索事实时观察到的常见行为，即人们在掌握部分或误导性知识的情况下搜索事实。与使用网络搜索引擎类似，用户可能会回忆起误导性关键词的片段并将其提交给 LLM，希望得到全面的答复。我们对几门 LLM 的实证分析表明，当呈现误导性关键词时，这些模型可能会放大错误信息。此外，我们彻底评估了四种现有的幻觉缓解策略，以减少 LLM 的谄媚行为。我们的实验证明了这些策略在生成事实正确陈述方面的有效性。此外，我们的分析深入研究了对事实关键词和不同类别的谄媚缓解的知识探索实验。

Title: Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

Authors: Huaiyuan Ying, Zijian Wu, Yihan Geng, Jiayu Wang, Dahua Lin, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03847
Pdf URL: https://arxiv.org/pdf/2406.03847
Copy Paste: [[2406.03847]] Lean Workbook: A large-scale Lean problem set formalized from natural language math problems(https://arxiv.org/abs/2406.03847)
Keywords: language model, llm
Abstract: Large language models have demonstrated impressive capabilities across various natural language processing tasks, especially in solving mathematical problems. However, large language models are not good at math theorem proving using formal languages like Lean. A significant challenge in this area is the scarcity of training data available in these formal languages. To address this issue, we propose a novel pipeline that iteratively generates and filters synthetic data to translate natural language mathematical problems into Lean 4 statements, and vice versa. Our results indicate that the synthetic data pipeline can provide useful training data and improve the performance of LLMs in translating and understanding complex mathematical problems and proofs. Our final dataset contains about 57K formal-informal question pairs along with searched proof from the math contest forum and 21 new IMO questions. We open-source our code at this https URL and our data at this https URL.
摘要：大型语言模型在各种自然语言处理任务中都表现出了令人印象深刻的能力，尤其是在解决数学问题方面。然而，大型语言模型并不擅长使用 Lean 等形式语言进行数学定理证明。该领域的一个重大挑战是这些形式语言中可用的训练数据稀缺。为了解决这个问题，我们提出了一种新颖的管道，该管道迭代生成和过滤合成数据，将自然语言数学问题翻译成 Lean 4 语句，反之亦然。我们的结果表明，合成数据管道可以提供有用的训练数据，并提高 LLM 在翻译和理解复杂数学问题和证明方面的性能。我们的最终数据集包含大约 57K 个正式-非正式问题对以及从数学竞赛论坛搜索到的证明和 21 个新的 IMO 问题。我们在此 https URL 上开源代码，在此 https URL 上开源代码。

Title: Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism

Authors: Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03853
Pdf URL: https://arxiv.org/pdf/2406.03853
Copy Paste: [[2406.03853]] Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism(https://arxiv.org/abs/2406.03853)
Keywords: language model, llm
Abstract: The recent advancements in large language models (LLMs) have been extraordinary, yet the escalating inference costs associated with them present challenges in real-world applications. To address these challenges, we propose a novel approach called Early-exiting Speculative Decoding (EESD) with lossless acceleration. Specifically, EESD utilizes a segment of the LLM to generate draft tokens, incorporating Early-exiting structures after the first N layers. To enhance the quality of draft tokens, a self-distillation method is integrated. This early-exiting design not only reduces deployment and training costs but also significantly accelerates the token generation speed. Moreover, we introduce a novel sampling mechanism that leverages Thompson Sampling to regulate the generation processes, automatically determining the quantity of draft tokens in each round. The original LLM is then employed to validate these draft tokens through a single forward pass, and thus guarantees that the final output text maintains a distribution consistent with vanilla auto-regressive decoding. The experimental results on both 13B and 70B models demonstrate that our approach decodes tokens at a markedly accelerated rate compared to prior methods, showing the effectiveness of our approach.
摘要：大型语言模型 (LLM) 的最新进展令人瞩目，但与之相关的不断上升的推理成本给实际应用带来了挑战。为了应对这些挑战，我们提出了一种名为早期退出推测解码 (EESD) 的无损加速新方法。具体而言，EESD 利用 LLM 的一部分来生成草稿标记，在前 N 层之后合并早期退出结构。为了提高草稿标记的质量，集成了一种自蒸馏方法。这种早期退出设计不仅降低了部署和培训成本，而且显著加快了标记生成速度。此外，我们引入了一种新颖的采样机制，利用汤普森采样来调节生成过程，自动确定每轮草稿标记的数量。然后使用原始 LLM 通过单次前向传递来验证这些草稿标记，从而保证最终输出文本保持与 vanilla 自回归解码一致的分布。在 13B 和 70B 模型上的实验结果表明，与以前的方法相比，我们的方法解码 token 的速度明显加快，证明了我们方法的有效性。

Title: Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As

Authors: Eden Avnat, Michal Levy, Daniel Herstain, Elia Yanko, Daniel Ben Joya, Michal Tzuchman Katz, Dafna Eshel, Sahar Laros, Yael Dagan, Shahar Barami, Joseph Mermelstein, Shahar Ovadia, Noam Shomron, Varda Shalev, Raja-Elie E. Abdulnour
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03855
Pdf URL: https://arxiv.org/pdf/2406.03855
Copy Paste: [[2406.03855]] Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As(https://arxiv.org/abs/2406.03855)
Keywords: language model, gpt, llm, chat
Abstract: Clinical problem-solving requires processing of semantic medical knowledge such as illness scripts and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate non-language evidence-based answers to clinical questions is inherently limited by tokenization. Therefore, we evaluated LLMs' performance on two question types: numeric (correlating findings) and semantic (differentiating entities) while examining differences within and between LLMs in medical aspects and comparing their performance to humans. To generate straightforward multi-choice questions and answers (QAs) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA". EBMQA contains 105,000 QAs labeled with medical and non-medical topics and classified into numerical or semantic questions. We benchmarked this dataset using more than 24,500 QAs on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus. We evaluated the LLMs accuracy on semantic and numerical question types and according to sub-labeled topics. For validation, six medical experts were tested on 100 numerical EBMQA questions. We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs. However, both LLMs showed inter and intra gaps in different medical aspects and remained inferior to humans. Thus, their medical advice should be addressed carefully.
摘要：临床问题解决需要处理语义医学知识（例如疾病脚本）和诊断测试的数字医学知识，以便进行基于证据的决策。由于大型语言模型 (LLM) 在基于语言的临床实践的许多方面都显示出良好的效果，它们生成非语言基于证据的临床问题答案的能力本质上受到标记化的限制。因此，我们评估了 LLM 在两种问题类型上的表现：数字（相关发现）和语义（区分实体），同时检查 LLM 在医学方面内部和之间的差异，并将其表现与人类进行比较。为了基于循证医学 (EBM) 生成简单的多项选择题和答案 (QA)，我们使用了一个全面的医学知识图谱（包含来自 50,00 多篇同行评审文章的数据）并创建了“EBMQA”。EBMQA 包含 105,000 个 QA，这些 QA 标有医学和非医学主题，并分为数字或语义问题。我们使用两款最先进的 LLM 上的 24,500 多个问答对该数据集进行了基准测试：Chat-GPT4 和 Claude3-Opus。我们评估了 LLM 在语义和数字问题类型以及子标签主题上的准确性。为了进行验证，六位医学专家接受了 100 个数字 EBMQA 问题的测试。我们发现这两个 LLM 在语义问答方面的表现都比数字问答更出色，Claude3 在数字问答方面的表现超过了 GPT4。然而，这两个 LLM 在不同的医学方面都表现出了内部和内部差距，并且仍然不如人类。因此，他们的医疗建议应该谨慎对待。

Title: BLSP-Emo: Towards Empathetic Large Speech-Language Models

Authors: Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, Jiajun Zhang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.03872
Pdf URL: https://arxiv.org/pdf/2406.03872
Copy Paste: [[2406.03872]] BLSP-Emo: Towards Empathetic Large Speech-Language Models(https://arxiv.org/abs/2406.03872)
Keywords: language model, gpt
Abstract: The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. BLSP-Emo utilizes existing speech recognition (ASR) and speech emotion recognition (SER) datasets through a two-stage process. The first stage focuses on semantic alignment, following recent work on pretraining speech-language models using ASR data. The second stage performs emotion alignment with the pretrained speech-language model on an emotion-aware continuation task constructed from SER data. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses, both in instruction-following tasks and conversations.
摘要：GPT-4o 的近期发布展示了端到端多模态模型的潜力，不仅体现在低延迟方面，还体现在理解和生成富有情感的表达性语音的能力方面。虽然开放研究社区尚不清楚细节，但它可能涉及大量精选数据和计算，而这两者都不容易获得。在本文中，我们介绍了 BLSP-Emo（带情感支持的引导式语言语音预训练），这是一种开发端到端语音语言模型的新方法，该模型能够理解语音中的语义和情感并产生富有同理心的响应。BLSP-Emo 通过两阶段过程利用现有的语音识别 (ASR) 和语音情感识别 (SER) 数据集。第一阶段侧重于语义对齐，这是最近使用 ASR 数据对语音语言模型进行预训练的研究。第二阶段在由 SER 数据构建的情感感知延续任务上使用预训练的语音语言模型执行情感对齐。我们的实验表明，BLSP-Emo 模型在理解语音和提供富有同理心的反应方面表现出色，无论是在指令遵循任务还是对话中。

Title: Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models

Authors: Ziyun Cui, Chang Lei, Wen Wu, Yinan Duan, Diyang Qu, Ji Wu, Runsen Chen, Chao Zhang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.03882
Pdf URL: https://arxiv.org/pdf/2406.03882
Copy Paste: [[2406.03882]] Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models(https://arxiv.org/abs/2406.03882)
Keywords: language model, llm
Abstract: The early detection of suicide risk is important since it enables the intervention to prevent potential suicide attempts. This paper studies the automatic detection of suicide risk based on spontaneous speech from adolescents, and collects a Mandarin dataset with 15 hours of suicide speech from more than a thousand adolescents aged from ten to eighteen for our experiments. To leverage the diverse acoustic and linguistic features embedded in spontaneous speech, both the Whisper speech model and textual large language models (LLMs) are used for suicide risk detection. Both all-parameter finetuning and parameter-efficient finetuning approaches are used to adapt the pre-trained models for suicide risk detection, and multiple audio-text fusion approaches are evaluated to combine the representations of Whisper and the LLM. The proposed system achieves a detection accuracy of 0.807 and an F1-score of 0.846 on the test set with 119 subjects, indicating promising potential for real suicide risk detection applications.
摘要：早期发现自杀风险非常重要，因为它可以帮助我们进行干预，防止潜在的自杀企图。本文研究了基于青少年自发语音的自杀风险自动检测，并收集了包含 15 小时自杀语音的普通话数据集，这些自杀语音来自一千多名年龄在 10 到 18 岁之间的青少年，用于我们的实验。为了利用自发语音中嵌入的各种声学和语言特征，Whisper 语音模型和文本大型语言模型 (LLM) 都用于自杀风险检测。全参数微调和参数高效微调方法都用于调整预训练模型以进行自杀风险检测，并评估了多种音频文本融合方法，以结合 Whisper 和 LLM 的表示。所提出的系统在 119 名受试者的测试集上实现了 0.807 的检测准确率和 0.846 的 F1 分数，表明在实际自杀风险检测应用中具有广阔的潜力。

Title: HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew

Authors: Tzuf Paz-Argaman, Itai Mondshine, Asaf Achi Mordechai, Reut Tsarfaty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03897
Pdf URL: https://arxiv.org/pdf/2406.03897
Copy Paste: [[2406.03897]] HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew(https://arxiv.org/abs/2406.03897)
Keywords: language model, llm
Abstract: While large language models (LLMs) excel in various natural language tasks in English, their performance in lower-resourced languages like Hebrew, especially for generative tasks such as abstractive summarization, remains unclear. The high morphological richness in Hebrew adds further challenges due to the ambiguity in sentence comprehension and the complexities in meaning construction. In this paper, we address this resource and evaluation gap by introducing HeSum, a novel benchmark specifically designed for abstractive text summarization in Modern Hebrew. HeSum consists of 10,000 article-summary pairs sourced from Hebrew news websites written by professionals. Linguistic analysis confirms HeSum's high abstractness and unique morphological challenges. We show that HeSum presents distinct difficulties for contemporary state-of-the-art LLMs, establishing it as a valuable testbed for generative language technology in Hebrew, and MRLs generative challenges in general.
摘要：虽然大型语言模型 (LLM) 在英语的各种自然语言任务中表现出色，但它们在资源较少的语言（如希伯来语）中的表现，尤其是在抽象摘要等生成任务中的表现仍不清楚。希伯来语的形态丰富度很高，这给句子理解带来了歧义，也给语义构建带来了复杂性，也带来了进一步的挑战。在本文中，我们通过引入 HeSum 来解决这一资源和评估差距，HeSum 是一种专为现代希伯来语抽象文本摘要而设计的新基准。HeSum 由 10,000 篇来自希伯来语新闻网站的文章摘要对组成，这些文章摘要由专业人士撰写。语言分析证实了 HeSum 的高抽象性和独特的形态挑战。我们表明，HeSum 为当代最先进的 LLM 带来了独特的困难，从而使它成为希伯来语生成语言技术和一般 MRL 生成挑战的宝贵试验台。

Title: UltraMedical: Building Specialized Generalists in Biomedicine

Authors: Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu Jinfang, Zhiyuan Liu, Bowen Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03949
Pdf URL: https://arxiv.org/pdf/2406.03949
Copy Paste: [[2406.03949]] UltraMedical: Building Specialized Generalists in Biomedicine(https://arxiv.org/abs/2406.03949)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains and are moving towards more specialized areas. Recent advanced proprietary models such as GPT-4 and Gemini have achieved significant advancements in biomedicine, which have also raised privacy and security challenges. The construction of specialized generalists hinges largely on high-quality datasets, enhanced by techniques like supervised fine-tuning and reinforcement learning from human or AI feedback, and direct preference optimization. However, these leading technologies (e.g., preference learning) are still significantly limited in the open source community due to the scarcity of specialized data. In this paper, we present the UltraMedical collections, which consist of high-quality manual and synthetic datasets in the biomedicine domain, featuring preference annotations across multiple advanced LLMs. By utilizing these datasets, we fine-tune a suite of specialized medical models based on Llama-3 series, demonstrating breathtaking capabilities across various medical benchmarks. Moreover, we develop powerful reward models skilled in biomedical and general reward benchmark, enhancing further online preference learning within the biomedical LLM community.
摘要：大型语言模型 (LLM) 已在各个领域展现出卓越的能力，并正在向更专业的领域迈进。最近的先进专有模型（如 GPT-4 和 Gemini）在生物医学领域取得了重大进展，这也带来了隐私和安全挑战。专业通才的构建在很大程度上取决于高质量的数据集，并通过监督微调和从人类或人工智能反馈中进行强化学习以及直接偏好优化等技术得到增强。然而，由于专业数据的稀缺，这些领先技术（例如偏好学习）在开源社区中仍然受到很大限制。在本文中，我们介绍了 UltraMedical 集合，它由生物医学领域的高质量手动和合成数据集组成，具有跨多个高级 LLM 的偏好注释。通过利用这些数据集，我们对基于 Llama-3 系列的一套专业医学模型进行了微调，展示了在各种医学基准上的惊人能力。此外，我们开发了强大的奖励模型，这些模型在生物医学和一般奖励基准方面都很熟练，从而增强了生物医学 LLM 社区的进一步在线偏好学习。

Title: Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

Authors: Neemesh Yadav, Sarah Masud, Vikram Goyal, Vikram Goyal, Md Shad Akhtar, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03953
Pdf URL: https://arxiv.org/pdf/2406.03953
Copy Paste: [[2406.03953]] Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech(https://arxiv.org/abs/2406.03953)
Keywords: language model, gpt
Abstract: Employing language models to generate explanations for an incoming implicit hate post is an active area of research. The explanation is intended to make explicit the underlying stereotype and aid content moderators. The training often combines top-k relevant knowledge graph (KG) tuples to provide world knowledge and improve performance on standard metrics. Interestingly, our study presents conflicting evidence for the role of the quality of KG tuples in generating implicit explanations. Consequently, simpler models incorporating external toxicity signals outperform KG-infused models. Compared to the KG-based setup, we observe a comparable performance for SBIC (LatentHatred) datasets with a performance variation of +0.44 (+0.49), +1.83 (-1.56), and -4.59 (+0.77) in BLEU, ROUGE-L, and BERTScore. Further human evaluation and error analysis reveal that our proposed setup produces more precise explanations than zero-shot GPT-3.5, highlighting the intricate nature of the task.
摘要：使用语言模型来为传入的隐性仇恨帖子生成解释是一个活跃的研究领域。解释旨在明确潜在的刻板印象并帮助内容版主。训练通常结合前 k 个相关知识图谱 (KG) 元组来提供世界知识并提高标准指标的性能。有趣的是，我们的研究为 KG 元组的质量在生成隐性解释中的作用提供了相互矛盾的证据。因此，结合外部毒性信号的更简单的模型优于注入 KG 的模型。与基于 KG 的设置相比，我们观察到 SBIC（LatentHatred）数据集的性能相当，在 BLEU、ROUGE-L 和 BERTScore 中的性能变化为 +0.44（+0.49）、+1.83（-1.56）和 -4.59（+0.77）。进一步的人工评估和错误分析表明，我们提出的设置比零样本 GPT-3.5 产生了更精确的解释，凸显了任务的复杂性。

Title: A + B: A General Generator-Reader Framework for Optimizing LLMs to Unleash Synergy Potential

Authors: Wei Tang, Yixin Cao, Jiahao Ying, Bo Wang, Yuyue Zhao, Yong Liao, Pengyuan Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A + B: A General Generator-Reader Framework for Optimizing LLMs to Unleash Synergy Potential(https://arxiv.org/abs/)
Keywords: language model, llm, chat, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) is an effective solution to supplement necessary knowledge to large language models (LLMs). Targeting its bottleneck of retriever performance, "generate-then-read" pipeline is proposed to replace the retrieval stage with generation from the LLM itself. Although promising, this research direction is underexplored and still cannot work in the scenario when source knowledge is given. In this paper, we formalize a general "A + B" framework with varying combinations of foundation models and types for systematic investigation. We explore the efficacy of the base and chat versions of LLMs and found their different functionalities suitable for generator A and reader B, respectively. Their combinations consistently outperform single models, especially in complex scenarios. Furthermore, we extend the application of the "A + B" framework to scenarios involving source documents through continuous learning, enabling the direct integration of external knowledge into LLMs. This approach not only facilitates effective acquisition of new knowledge but also addresses the challenges of safety and helpfulness post-adaptation. The paper underscores the versatility of the "A + B" framework, demonstrating its potential to enhance the practical application of LLMs across various domains.
摘要：检索增强生成 (RAG) 是向大型语言模型 (LLM) 补充必要知识的有效解决方案。针对其检索器性能瓶颈，提出了“生成然后读取”流程，用 LLM 本身的生成来取代检索阶段。虽然很有前景，但这一研究方向尚未得到充分探索，在给定源知识的场景中仍然无法发挥作用。在本文中，我们形式化了一个通用的“A + B”框架，其中包含各种基础模型和类型的组合，以供系统研究。我们探索了 LLM 的基础版本和聊天版本的功效，发现它们的不同功能分别适用于生成器 A 和读取器 B。它们的组合始终优于单个模型，尤其是在复杂场景中。此外，我们通过持续学习将“A + B”框架的应用扩展到涉及源文档的场景，从而能够将外部知识直接集成到 LLM 中。这种方法不仅有助于有效获取新知识，而且还解决了适应后的安全性和有用性的挑战。该论文强调了“A + B”框架的多功能性，展示了其在增强法学硕士在各个领域的实际应用方面的潜力。

Title: On The Persona-based Summarization of Domain-Specific Documents

Authors: Ankan Mullick, Sombit Bose, Rounak Saha, Ayan Kumar Bhowmick, Pawan Goyal, Niloy Ganguly, Prasenjit Dey, Ravi Kokku
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2406.03986
Pdf URL: https://arxiv.org/pdf/2406.03986
Copy Paste: [[2406.03986]] On The Persona-based Summarization of Domain-Specific Documents(https://arxiv.org/abs/2406.03986)
Keywords: language model, llm
Abstract: In an ever-expanding world of domain-specific knowledge, the increasing complexity of consuming, and storing information necessitates the generation of summaries from large information repositories. However, every persona of a domain has different requirements of information and hence their summarization. For example, in the healthcare domain, a persona-based (such as Doctor, Nurse, Patient etc.) approach is imperative to deliver targeted medical information efficiently. Persona-based summarization of domain-specific information by humans is a high cognitive load task and is generally not preferred. The summaries generated by two different humans have high variability and do not scale in cost and subject matter expertise as domains and personas grow. Further, AI-generated summaries using generic Large Language Models (LLMs) may not necessarily offer satisfactory accuracy for different domains unless they have been specifically trained on domain-specific data and can also be very expensive to use in day-to-day operations. Our contribution in this paper is two-fold: 1) We present an approach to efficiently fine-tune a domain-specific small foundation LLM using a healthcare corpus and also show that we can effectively evaluate the summarization quality using AI-based critiquing. 2) We further show that AI-based critiquing has good concordance with Human-based critiquing of the summaries. Hence, such AI-based pipelines to generate domain-specific persona-based summaries can be easily scaled to other domains such as legal, enterprise documents, education etc. in a very efficient and cost-effective manner.
摘要：在领域特定知识不断扩展的世界中，信息消费和存储的复杂性不断增加，需要从大型信息存储库生成摘要。但是，每个领域的每个角色对信息的要求都不同，因此需要不同的摘要。例如，在医疗保健领域，基于角色（如医生、护士、患者等）的方法对于有效传递有针对性的医疗信息至关重要。人类基于角色对领域特定信息进行摘要是一项高认知负荷任务，通常不是首选。两个不同的人生成的摘要具有很大的可变性，并且随着领域和角色的增长，成本和主题专业知识不会扩大。此外，使用通用大型语言模型 (LLM) 的人工智能生成的摘要不一定能为不同领域提供令人满意的准确性，除非它们经过了领域特定数据的专门训练，而且在日常操作中使用成本也非常高。我们在本文中的贡献有两方面：1) 我们提出了一种使用医疗保健语料库有效微调特定领域小型基础 LLM 的方法，并表明我们可以使用基于 AI 的评审有效地评估摘要质量。2) 我们进一步表明，基于 AI 的评审与基于人类的摘要评审具有良好的一致性。因此，这种基于 AI 的流程可以以非常高效且经济的方式轻松扩展到其他领域，例如法律、企业文档、教育等。

Title: Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing

Authors: Hadi Askari, Anshuman Chhabra, Muhao Chen, Prasant Mohapatra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03993
Pdf URL: https://arxiv.org/pdf/2406.03993
Copy Paste: [[2406.03993]] Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing(https://arxiv.org/abs/2406.03993)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have achieved state-of-the-art performance at zero-shot generation of abstractive summaries for given articles. However, little is known about the robustness of such a process of zero-shot summarization. To bridge this gap, we propose relevance paraphrasing, a simple strategy that can be used to measure the robustness of LLMs as summarizers. The relevance paraphrasing approach identifies the most relevant sentences that contribute to generating an ideal summary, and then paraphrases these inputs to obtain a minimally perturbed dataset. Then, by evaluating model performance for summarization on both the original and perturbed datasets, we can assess the LLM's one aspect of robustness. We conduct extensive experiments with relevance paraphrasing on 4 diverse datasets, as well as 4 LLMs of different sizes (GPT-3.5-Turbo, Llama-2-13B, Mistral-7B, and Dolly-v2-7B). Our results indicate that LLMs are not consistent summarizers for the minimally perturbed articles, necessitating further improvements.
摘要：大型语言模型 (LLM) 在零样本生成给定文章的抽象摘要方面取得了最先进的性能。然而，人们对这种零样本摘要过程的稳健性知之甚少。为了弥补这一差距，我们提出了相关性释义，这是一种简单的策略，可用于衡量 LLM 作为摘要器的稳健性。相关性释义方法确定有助于生成理想摘要的最相关句子，然后释义这些输入以获得最小扰动的数据集。然后，通过评估模型在原始数据集和扰动数据集上的摘要性能，我们可以评估 LLM 的稳健性的一个方面。我们对 4 个不同的数据集以及 4 个不同大小的 LLM（GPT-3.5-Turbo、Llama-2-13B、Mistral-7B 和 Dolly-v2-7B）进行了相关性释义的大量实验。我们的结果表明，LLM 并不是受最小扰动文章的一致摘要器，需要进一步改进。

Title: Ask LLMs Directly, "What shapes your bias?": Measuring Social Bias in Large Language Models

Authors: Jisu Shin, Hoyun Song, Huije Lee, Soyeong Jeong, Jong C. Park
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Ask LLMs Directly, "What shapes your bias?": Measuring Social Bias in Large Language Models(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: Social bias is shaped by the accumulation of social perceptions towards targets across various demographic identities. To fully understand such social bias in large language models (LLMs), it is essential to consider the composite of social perceptions from diverse perspectives among identities. Previous studies have either evaluated biases in LLMs by indirectly assessing the presence of sentiments towards demographic identities in the generated text or measuring the degree of alignment with given stereotypes. These methods have limitations in directly quantifying social biases at the level of distinct perspectives among identities. In this paper, we aim to investigate how social perceptions from various viewpoints contribute to the development of social bias in LLMs. To this end, we propose a novel strategy to intuitively quantify these social perceptions and suggest metrics that can evaluate the social biases within LLMs by aggregating diverse social perceptions. The experimental results show the quantitative demonstration of the social attitude in LLMs by examining social perception. The analysis we conducted shows that our proposed metrics capture the multi-dimensional aspects of social bias, enabling a fine-grained and comprehensive investigation of bias in LLMs.
摘要：社会偏见是由对不同人口身份目标的社会认知的积累形成的。为了充分理解大型语言模型 (LLM) 中的这种社会偏见，必须考虑来自不同身份的不同视角的社会认知的综合。以前的研究要么通过间接评估生成的文本中对人口身份的情绪的存在，要么测量与给定刻板印象的一致程度来评估 LLM 中的偏见。这些方法在直接量化不同身份不同视角的社会偏见方面存在局限性。在本文中，我们旨在研究来自不同观点的社会认知如何导致 LLM 中社会偏见的发展。为此，我们提出了一种新颖的策略来直观地量化这些社会认知，并提出了可以通过汇总不同的社会认知来评估 LLM 中社会偏见的指标。实验结果通过检查社会认知显示了 LLM 中的社会态度的定量展示。我们进行的分析表明，我们提出的指标捕捉到了社会偏见的多维方面，从而能够对 LLM 中的偏见进行细致而全面的调查。

Title: Intention and Face in Dialog

Authors: Adil Soubki, Owen Rambow
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04109
Pdf URL: https://arxiv.org/pdf/2406.04109
Copy Paste: [[2406.04109]] Intention and Face in Dialog(https://arxiv.org/abs/2406.04109)
Keywords: agent
Abstract: The notion of face described by Brown and Levinson (1987) has been studied in great detail, but a critical aspect of the framework, that which focuses on how intentions mediate the planning of turns which impose upon face, has received far less attention. We present an analysis of three computational systems trained for classifying both intention and politeness, focusing on how the former influences the latter. In politeness theory, agents attend to the desire to have their wants appreciated (positive face), and a complementary desire to act unimpeded and maintain freedom (negative face). Similar to speech acts, utterances can perform so-called face acts which can either raise or threaten the positive or negative face of the speaker or hearer. We begin by using an existing corpus to train a model which classifies face acts, achieving a new SoTA in the process. We then observe that every face act has an underlying intention that motivates it and perform additional experiments integrating dialog act annotations to provide these intentions by proxy. Our analysis finds that dialog acts improve performance on face act detection for minority classes and points to a close relationship between aspects of face and intent.
摘要：Brown 和 Levinson (1987) 描述的面子概念已得到深入研究，但该框架的一个关键方面，即关注意图如何调节影响面子的转折计划，却很少受到关注。我们对三个经过训练的计算系统进行了分析，这些系统用于对意图和礼貌进行分类，重点关注前者如何影响后者。在礼貌理论中，代理关注的是希望自己的愿望得到重视（正面面子），以及希望不受阻碍地行动并保持自由的补充愿望（负面面子）。与言语行为类似，话语可以执行所谓的面子行为，这些行为可以提高或威胁说话者或听话者的正面或负面面子。我们首先使用现有语料库来训练一个对面子行为进行分类的模型，在此过程中实现新的 SoTA。然后，我们观察到每个面子行为都有一个潜在的动机，并进行额外的实验，整合对话行为注释以通过代理提供这些意图。我们的分析发现，对话行为可以提高少数群体面部行为检测的性能，并指出面部特征和意图之间存在密切的关系。

Title: Uncovering Limitations of Large Language Models in Information Seeking from Tables

Authors: Chaoxu Pang, Yixuan Cao, Chunhao Yang, Ping Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04113
Pdf URL: https://arxiv.org/pdf/2406.04113
Copy Paste: [[2406.04113]] Uncovering Limitations of Large Language Models in Information Seeking from Tables(https://arxiv.org/abs/2406.04113)
Keywords: language model, gpt, llm
Abstract: Tables are recognized for their high information density and widespread usage, serving as essential sources of information. Seeking information from tables (TIS) is a crucial capability for Large Language Models (LLMs), serving as the foundation of knowledge-based Q&A systems. However, this field presently suffers from an absence of thorough and reliable evaluation. This paper introduces a more reliable benchmark for Table Information Seeking (TabIS). To avoid the unreliable evaluation caused by text similarity-based metrics, TabIS adopts a single-choice question format (with two options per question) instead of a text generation format. We establish an effective pipeline for generating options, ensuring their difficulty and quality. Experiments conducted on 12 LLMs reveal that while the performance of GPT-4-turbo is marginally satisfactory, both other proprietary and open-source models perform inadequately. Further analysis shows that LLMs exhibit a poor understanding of table structures, and struggle to balance between TIS performance and robustness against pseudo-relevant tables (common in retrieval-augmented systems). These findings uncover the limitations and potential challenges of LLMs in seeking information from tables. We release our data and code to facilitate further research in this field.
摘要：表格因其高信息密度和广泛使用而受到认可，是重要的信息来源。从表格中查找信息（TIS）是大型语言模型（LLM）的一项关键能力，是知识型问答系统的基础。然而，目前该领域缺乏全面而可靠的评估。本文介绍了一个更可靠的表格信息查找（TabIS）基准。为了避免基于文本相似度的指标导致的不可靠评估，TabIS 采用单选题格式（每个问题有两个选项）而不是文本生成格式。我们建立了一个有效的选项生成流程，确保了选项的难度和质量。在 12 个 LLM 上进行的实验表明，虽然 GPT-4-turbo 的性能勉强令人满意，但其他专有和开源模型的表现都不够好。进一步的分析表明，LLM 对表格结构的理解较差，并且难以在 TIS 性能和对伪相关表（在检索增强系统中很常见）的鲁棒性之间取得平衡。这些发现揭示了 LLM 在从表格中寻找信息方面的局限性和潜在挑战。我们发布数据和代码以促进该领域的进一步研究。

Title: Are We Done with MMLU?

Authors: Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04127
Pdf URL: https://arxiv.org/pdf/2406.04127
Copy Paste: [[2406.04127]] Are We Done with MMLU?(https://arxiv.org/abs/2406.04127)
Keywords: llm
Abstract: Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation this https URL.
摘要：也许不是。我们识别并分析了流行的大规模多任务语言理解 (MMLU) 基准中的错误。尽管 MMLU 被广泛采用，但我们的分析表明，许多基本事实错误掩盖了 LLM 的真正能力。例如，我们发现病毒学子集中 57% 的分析问题包含错误。为了解决这个问题，我们引入了一个全面的框架，使用一种新的错误分类法来识别数据集错误。然后，我们创建了 MMLU-Redux，它是 30 个 MMLU 科目中 3,000 个手动重新注释问题的子集。使用 MMLU-Redux，我们展示了与最初报告的模型性能指标的显著差异。我们的结果强烈主张修改 MMLU 的错误问题，以增强其未来作为基准的实用性和可靠性。因此，我们通过此 https URL 打开了 MMLU-Redux 以进行额外注释。

Title: Legal Judgment Reimagined: PredEx and the Rise of Intelligent AI Interpretation in Indian Courts

Authors: Shubham Kumar Nigam, Anurag Sharma, Danush Khanna, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.04136
Pdf URL: https://arxiv.org/pdf/2406.04136
Copy Paste: [[2406.04136]] Legal Judgment Reimagined: PredEx and the Rise of Intelligent AI Interpretation in Indian Courts(https://arxiv.org/abs/2406.04136)
Keywords: language model, llm
Abstract: In the era of Large Language Models (LLMs), predicting judicial outcomes poses significant challenges due to the complexity of legal proceedings and the scarcity of expert-annotated datasets. Addressing this, we introduce \textbf{Pred}iction with \textbf{Ex}planation (\texttt{PredEx}), the largest expert-annotated dataset for legal judgment prediction and explanation in the Indian context, featuring over 15,000 annotations. This groundbreaking corpus significantly enhances the training and evaluation of AI models in legal analysis, with innovations including the application of instruction tuning to LLMs. This method has markedly improved the predictive accuracy and explanatory depth of these models for legal judgments. We employed various transformer-based models, tailored for both general and Indian legal contexts. Through rigorous lexical, semantic, and expert assessments, our models effectively leverage \texttt{PredEx} to provide precise predictions and meaningful explanations, establishing it as a valuable benchmark for both the legal profession and the NLP community.
摘要：在大型语言模型 (LLM) 时代，由于法律诉讼的复杂性和专家注释数据集的稀缺性，预测司法结果带来了重大挑战。为了解决这个问题，我们引入了 \textbf{Pred}iction with \textbf{Ex}planation (\texttt{PredEx})，这是印度背景下最大的专家注释法律判决预测和解释数据集，包含超过 15,000 条注释。这个开创性的语料库显著增强了法律分析中 AI 模型的训练和评估，其创新包括将指令调整应用于 LLM。这种方法显著提高了这些模型对法律判决的预测准确性和解释深度。我们采用了各种基于 Transformer 的模型，针对一般和印度法律背景进行了量身定制。通过严格的词汇、语义和专家评估，我们的模型有效地利用 \texttt{PredEx} 提供精确的预测和有意义的解释，使其成为法律界和 NLP 社区的宝贵基准。

Title: Do Language Models Understand Morality? Towards a Robust Detection of Moral Content

Authors: Luana Bulla, Aldo Gangemi, Misael Mongiovì
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.04143
Pdf URL: https://arxiv.org/pdf/2406.04143
Copy Paste: [[2406.04143]] Do Language Models Understand Morality? Towards a Robust Detection of Moral Content(https://arxiv.org/abs/2406.04143)
Keywords: language model, gpt
Abstract: The task of detecting moral values in text has significant implications in various fields, including natural language processing, social sciences, and ethical decision-making. Previously proposed supervised models often suffer from overfitting, leading to hyper-specialized moral classifiers that struggle to perform well on data from different domains. To address this issue, we introduce novel systems that leverage abstract concepts and common-sense knowledge acquired from Large Language Models and Natural Language Inference models during previous stages of training on multiple data sources. By doing so, we aim to develop versatile and robust methods for detecting moral values in real-world scenarios. Our approach uses the GPT 3.5 model as a zero-shot ready-made unsupervised multi-label classifier for moral values detection, eliminating the need for explicit training on labeled data. We compare it with a smaller NLI-based zero-shot model. The results show that the NLI approach achieves competitive results compared to the Davinci model. Furthermore, we conduct an in-depth investigation of the performance of supervised systems in the context of cross-domain multi-label moral value detection. This involves training supervised models on different domains to explore their effectiveness in handling data from different sources and comparing their performance with the unsupervised methods. Our contributions encompass a thorough analysis of both supervised and unsupervised methodologies for cross-domain value detection. We introduce the Davinci model as a state-of-the-art zero-shot unsupervised moral values classifier, pushing the boundaries of moral value detection without the need for explicit training on labeled data. Additionally, we perform a comparative evaluation of our approach with the supervised models, shedding light on their respective strengths and weaknesses.
摘要：在文本中检测道德价值观的任务在各个领域都有重要意义，包括自然语言处理、社会科学和道德决策。以前提出的监督模型经常出现过度拟合的问题，导致道德分类器过于专业化，难以在不同领域的数据上取得良好表现。为了解决这个问题，我们引入了新的系统，利用在多个数据源上进行训练的先前阶段从大型语言模型和自然语言推理模型中获得的抽象概念和常识知识。通过这样做，我们的目标是开发出在现实场景中检测道德价值观的多功能和稳健的方法。我们的方法使用 GPT 3.5 模型作为零样本现成的无监督多标签分类器进行道德价值观检测，从而无需对标记数据进行显式训练。我们将其与较小的基于 NLI 的零样本模型进行了比较。结果表明，与 Davinci 模型相比，NLI 方法取得了有竞争力的结果。此外，我们对跨领域多标签道德价值检测背景下的监督系统的性能进行了深入研究。这涉及在不同领域训练监督模型，以探索它们在处理来自不同来源的数据方面的有效性，并将其性能与无监督方法进行比较。我们的贡献包括对跨领域价值检测的监督和无监督方法的全面分析。我们引入了 Davinci 模型作为最先进的零样本无监督道德价值分类器，突破了道德价值检测的界限，而无需对标记数据进行明确训练。此外，我们还对我们的方法与监督模型进行了比较评估，揭示了它们各自的优缺点。

Title: Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

Authors: Qi Cheng, Michael Boratko, Pranay Kumar Yelugam, Tim O'Gorman, Nalini Singh, Andrew McCallum, Xiang Lorraine Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04145
Pdf URL: https://arxiv.org/pdf/2406.04145
Copy Paste: [[2406.04145]] Every Answer Matters: Evaluating Commonsense with Probabilistic Measures(https://arxiv.org/abs/2406.04145)
Keywords: language model
Abstract: Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of "boiling water" could be making tea and cooking, but it also could be killing germs. Existing tasks do not capture the probabilistic nature of common sense. To this end, we present commonsense frame completion (CFC), a new generative task that evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation that strongly correlates with human judgments. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.
摘要：大型语言模型在常识任务上表现出色；然而，这些任务通常以多项选择题的形式提出，允许模型利用系统性偏差。常识本质上也是概率性的，有多个正确答案。“烧水”的目的可能是泡茶和做饭，但也可能是杀死细菌。现有任务没有捕捉到常识的概率性质。为此，我们提出了常识框架完成 (CFC)，这是一种新的生成任务，通过多个开放式生成来评估常识。我们还提出了一种与人类判断密切相关的概率评估方法。人类在我们的数据集上的表现远远超过强语言模型基线，表明这种方法既具有挑战性，又对机器常识有用。

Title: Towards Understanding Task-agnostic Debiasing Through the Lenses of Intrinsic Bias and Forgetfulness

Authors: Guangliang Liu, Milad Afshari, Xitong Zhang, Zhiyu Xue, Avrajit Ghosh, Bidhan Bashyal, Rongrong Wang, Kristen Johnson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04146
Pdf URL: https://arxiv.org/pdf/2406.04146
Copy Paste: [[2406.04146]] Towards Understanding Task-agnostic Debiasing Through the Lenses of Intrinsic Bias and Forgetfulness(https://arxiv.org/abs/2406.04146)
Keywords: language model
Abstract: While task-agnostic debiasing provides notable generalizability and reduced reliance on downstream data, its impact on language modeling ability and the risk of relearning social biases from downstream task-specific data remain as the two most significant challenges when debiasing Pretrained Language Models (PLMs). The impact on language modeling ability can be alleviated given a high-quality and long-contextualized debiasing corpus, but there remains a deficiency in understanding the specifics of relearning biases. We empirically ascertain that the effectiveness of task-agnostic debiasing hinges on the quantitative bias level of both the task-specific data used for downstream applications and the debiased model. We empirically show that the lower bound of the bias level of the downstream fine-tuned model can be approximated by the bias level of the debiased model, in most practical cases. To gain more in-depth understanding about how the parameters of PLMs change during fine-tuning due to the forgetting issue of PLMs, we propose a novel framework which can Propagate Socially-fair Debiasing to Downstream Fine-tuning, ProSocialTuning. Our proposed framework can push the fine-tuned model to approach the bias lower bound during downstream fine-tuning, indicating that the ineffectiveness of debiasing can be alleviated by overcoming the forgetting issue through regularizing successfully debiased attention heads based on the PLMs' bias levels from stages of pretraining and debiasing.
摘要：虽然任务无关去偏提供了显著的通用性并减少了对下游数据的依赖，但它对语言建模能力的影响以及从下游任务特定数据中重新学习社会偏见的风险仍然是去偏预训练语言模型 (PLM) 时面临的两个最重大挑战。如果有高质量和长期语境化的去偏语料库，可以减轻对语言建模能力的影响，但在理解重新学习偏见的具体细节方面仍然存在不足。我们通过经验确定，任务无关去偏的有效性取决于用于下游应用的任务特定数据和去偏模型的定量偏见水平。我们通过经验表明，在大多数实际情况下，下游微调模型的偏见水平下限可以用去偏模型的偏见水平来近似。为了更深入地了解由于 PLM 的遗忘问题，PLM 的参数在微调过程中如何变化，我们提出了一个可以将社会公平去偏传播到下游微调的新框架，即 ProSocialTuning。我们提出的框架可以推动微调模型在下游微调期间接近偏差下限，这表明可以通过基于 PLM 在预训练和去偏阶段的偏差水平来规范成功去偏的注意力头来克服遗忘问题，从而缓解去偏的无效性。

Title: Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness

Authors: Lars Hillebrand, Prabhupad Pradhan, Christian Bauckhage, Rafet Sifa
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.04156
Pdf URL: https://arxiv.org/pdf/2406.04156
Copy Paste: [[2406.04156]] Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness(https://arxiv.org/abs/2406.04156)
Keywords: language model
Abstract: We introduce "pointer-guided segment ordering" (SO), a novel pre-training technique aimed at enhancing the contextual understanding of paragraph-level text representations in large language models. Our methodology leverages a self-attention-driven pointer network to restore the original sequence of shuffled text segments, addressing the challenge of capturing the structural coherence and contextual dependencies within documents. This pre-training approach is complemented by a fine-tuning methodology that incorporates dynamic sampling, augmenting the diversity of training instances and improving sample efficiency for various downstream applications. We evaluate our method on a diverse set of datasets, demonstrating its efficacy in tasks requiring sequential text classification across scientific literature and financial reporting domains. Our experiments show that pointer-guided pre-training significantly enhances the model's ability to understand complex document structures, leading to state-of-the-art performance in downstream classification tasks.
摘要：我们引入了“指针引导的段排序”（SO），这是一种新颖的预训练技术，旨在增强大型语言模型中段落级文本表示的上下文理解。我们的方法利用自注意力驱动的指针网络来恢复打乱的文本段的原始序列，解决了捕获文档中的结构连贯性和上下文依赖性的挑战。这种预训练方法与结合动态采样的微调方法相辅相成，增加了训练实例的多样性并提高了各种下游应用的采样效率。我们在一组不同的数据集上评估了我们的方法，证明了它在需要跨科学文献和财务报告领域进行顺序文本分类的任务中的有效性。我们的实验表明，指针引导的预训练显著增强了模型理解复杂文档结构的能力，从而在下游分类任务中实现了最先进的性能。

Title: Confabulation: The Surprising Value of Large Language Model Hallucinations

Authors: Peiqi Sui, Eamon Duede, Sophie Wu, Richard Jean So
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04175
Pdf URL: https://arxiv.org/pdf/2406.04175
Copy Paste: [[2406.04175]] Confabulation: The Surprising Value of Large Language Model Hallucinations(https://arxiv.org/abs/2406.04175)
Keywords: language model, llm, hallucination
Abstract: This paper presents a systematic defense of large language model (LLM) hallucinations or 'confabulations' as a potential resource instead of a categorically negative pitfall. The standard view is that confabulations are inherently problematic and AI research should eliminate this flaw. In this paper, we argue and empirically demonstrate that measurable semantic characteristics of LLM confabulations mirror a human propensity to utilize increased narrativity as a cognitive resource for sense-making and communication. In other words, it has potential value. Specifically, we analyze popular hallucination benchmarks and reveal that hallucinated outputs display increased levels of narrativity and semantic coherence relative to veridical outputs. This finding reveals a tension in our usually dismissive understandings of confabulation. It suggests, counter-intuitively, that the tendency for LLMs to confabulate may be intimately associated with a positive capacity for coherent narrative-text generation.
摘要：本文系统地为大型语言模型 (LLM) 幻觉或“虚构”辩护，认为它们是一种潜在资源，而不是绝对的负面陷阱。标准观点认为虚构本身存在问题，而人工智能研究应该消除这一缺陷。在本文中，我们论证并通过实证证明，LLM 虚构的可测量语义特征反映了人类倾向于利用增强的叙述性作为认知资源进行理解和交流。换句话说，它具有潜在价值。具体而言，我们分析了流行的幻觉基准，并揭示了幻觉输出相对于真实输出显示出更高的叙述性和语义连贯性。这一发现揭示了我们通常对虚构的理解存在矛盾。与直觉相反，它表明 LLM 虚构的倾向可能与生成连贯叙述文本的积极能力密切相关。

Title: DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning

Authors: Shangqing Tu, Kejian Zhu, Yushi Bai, Zijun Yao, Lei Hou, Juanzi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04197
Pdf URL: https://arxiv.org/pdf/2406.04197
Copy Paste: [[2406.04197]] DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning(https://arxiv.org/abs/2406.04197)
Keywords: language model, llm
Abstract: The advancement of large language models (LLMs) relies on evaluation using public benchmarks, but data contamination can lead to overestimated performance. Previous researches focus on detecting contamination by determining whether the model has seen the exact same data during training. In this work, we argue that even training on data similar to benchmark data inflates performance on in-distribution tasks without improving overall capacity, which we called In-distribution contamination. To effectively detect in-distribution contamination, we propose DICE, a novel method that leverages the internal states of LLMs to locate-then-detect the contamination. DICE first identifies the most sensitive layer to contamination, then trains a classifier based on the internal states of that layer. Experiments reveal DICE's high accuracy in detecting in-distribution contamination across various LLMs and math reasoning datasets. We also show the generalization capability of the trained DICE detector, which is able to detect contamination across multiple benchmarks with similar distributions. Additionally, we find that the DICE detection scores are positively correlated with the performance of ten LLMs fine-tuned by either us or other organizations on four math reasoning datasets (with $R^2$ values between 0.6 and 0.75). This indicates that the in-distribution contamination problem potentially lead to an overestimation of the true capabilities of many existing models. The code and data are available at this https URL.
摘要：大型语言模型 (LLM) 的进步依赖于使用公共基准进行评估，但数据污染会导致性能被高估。先前的研究侧重于通过确定模型在训练期间是否见过完全相同的数据来检测污染。在这项工作中，我们认为，即使使用与基准数据类似的数据进行训练也会在不提高整体容量的情况下提高分布内任务的性能，我们称之为分布内污染。为了有效地检测分布内污染，我们提出了 DICE，这是一种利用 LLM 的内部状态来定位然后检测污染的新方法。DICE 首先确定对污染最敏感的层，然后根据该层的内部状态训练分类器。实验表明，DICE 在检测各种 LLM 和数学推理数据集中的分布内污染方面具有很高的准确性。我们还展示了训练后的 DICE 检测器的泛化能力，它能够在具有相似分布的多个基准中检测污染。此外，我们发现 DICE 检测分数与我们或其他组织在四个数学推理数据集上微调的十个 LLM 的性能呈正相关（$R^2$ 值介于 0.6 和 0.75 之间）。这表明分布内污染问题可能导致对许多现有模型的真实能力的估计过高。代码和数据可在此 https URL 上找到。

Title: Legal Documents Drafting with Fine-Tuned Pre-Trained Large Language Model

Authors: Chun-Hsien Lin, Pu-Jen Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04202
Pdf URL: https://arxiv.org/pdf/2406.04202
Copy Paste: [[2406.04202]] Legal Documents Drafting with Fine-Tuned Pre-Trained Large Language Model(https://arxiv.org/abs/2406.04202)
Keywords: language model, llm
Abstract: With the development of large-scale Language Models (LLM), fine-tuning pre-trained LLM has become a mainstream paradigm for solving downstream tasks of natural language processing. However, training a language model in the legal field requires a large number of legal documents so that the language model can learn legal terminology and the particularity of the format of legal documents. The typical NLP approaches usually rely on many manually annotated data sets for training. However, in the legal field application, it is difficult to obtain a large number of manually annotated data sets, which restricts the typical method applied to the task of drafting legal documents. The experimental results of this paper show that not only can we leverage a large number of annotation-free legal documents without Chinese word segmentation to fine-tune a large-scale language model, but more importantly, it can fine-tune a pre-trained LLM on the local computer to achieve the generating legal document drafts task, and at the same time achieve the protection of information privacy and to improve information security issues.
摘要：随着大规模语言模型（LLM）的发展，对预训练的LLM进行微调已成为解决自然语言处理下游任务的主流范式。然而，训练法律领域的语言模型需要大量的法律文献，以便语言模型学习法律术语以及法律文献格式的特殊性。典型的NLP方法通常依赖于大量人工标注的数据集进行训练，然而在法律领域应用中，很难获得大量的人工标注数据集，这限制了典型方法应用于法律文献草稿生成任务。本文的实验结果表明，不仅可以利用大量无需中文分词的无标注法律文献对大规模语言模型进行微调，更重要的是可以在本地计算机上对预训练的LLM进行微调，实现法律文献草稿生成任务，同时实现信息隐私的保护，改善信息安全问题。

Title: ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models

Authors: Yuanyi Ren, Haoran Ye, Hanjun Fang, Xin Zhang, Guojie Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04214
Pdf URL: https://arxiv.org/pdf/2406.04214
Copy Paste: [[2406.04214]] ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models(https://arxiv.org/abs/2406.04214)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are transforming diverse fields and gaining increasing influence as human proxies. This development underscores the urgent need for evaluating value orientations and understanding of LLMs to ensure their responsible integration into public-facing applications. This work introduces ValueBench, the first comprehensive psychometric benchmark for evaluating value orientations and value understanding in LLMs. ValueBench collects data from 44 established psychometric inventories, encompassing 453 multifaceted value dimensions. We propose an evaluation pipeline grounded in realistic human-AI interactions to probe value orientations, along with novel tasks for evaluating value understanding in an open-ended value space. With extensive experiments conducted on six representative LLMs, we unveil their shared and distinctive value orientations and exhibit their ability to approximate expert conclusions in value-related extraction and generation tasks. ValueBench is openly accessible at this https URL.
摘要：大型语言模型 (LLM) 正在改变各个领域，并作为人类代理获得越来越大的影响力。这一发展凸显了评估 LLM 的价值取向和理解的迫切需求，以确保它们负责任地融入面向公众的应用程序中。这项工作引入了 ValueBench，这是第一个用于评估 LLM 中的价值取向和价值理解的综合心理测量基准。ValueBench 从 44 个已建立的心理测量清单中收集数据，涵盖 453 个多方面的价值维度。我们提出了一个基于现实的人机交互的评估流程来探索价值取向，以及在开放式价值空间中评估价值理解的新任务。通过对六个代表性 LLM 进行大量实验，我们揭示了它们共同和独特的价值取向，并展示了它们在与价值相关的提取和生成任务中近似专家结论的能力。ValueBench 可通过此 https URL 公开访问。

Title: mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans

Authors: Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.04215
Pdf URL: https://arxiv.org/pdf/2406.04215
Copy Paste: [[2406.04215]] mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans(https://arxiv.org/abs/2406.04215)
Keywords: language model
Abstract: It is very challenging to curate a dataset for language-specific knowledge and common sense in order to evaluate natural language understanding capabilities of language models. Due to the limitation in the availability of annotators, most current multilingual datasets are created through translation, which cannot evaluate such language-specific aspects. Therefore, we propose Multilingual CommonsenseQA (mCSQA) based on the construction process of CSQA but leveraging language models for a more efficient construction, e.g., by asking LM to generate questions/answers, refine answers and verify QAs followed by reduced human efforts for verification. Constructed dataset is a benchmark for cross-lingual language-transfer capabilities of multilingual LMs, and experimental results showed high language-transfer capabilities for questions that LMs could easily solve, but lower transfer capabilities for questions requiring deep knowledge or commonsense. This highlights the necessity of language-specific datasets for evaluation and training. Finally, our method demonstrated that multilingual LMs could create QA including language-specific knowledge, significantly reducing the dataset creation cost compared to manual creation. The datasets are available at this https URL.
摘要：为了评估语言模型的自然语言理解能力，整理一个包含特定语言知识和常识的数据集非常具有挑战性。由于注释器可用性的限制，大多数当前多语言数据集都是通过翻译创建的，无法评估此类特定语言方面。因此，我们提出了基于 CSQA 构建过程的多语言常识问答 (mCSQA)，但利用语言模型进行更高效的构建，例如，通过要求 LM 生成问题/答案、改进答案和验证 QA，然后减少人工验证工作。构建的数据集是多语言 LM 跨语言语言迁移能力的基准，实验结果表明，对于 LM 可以轻松解决的问题，语言迁移能力较高，但对于需要深度知识或常识的问题，迁移能力较低。这凸显了语言特定数据集用于评估和训练的必要性。最后，我们的方法表明，多语言 LM 可以创建包含特定语言知识的 QA，与手动创建相比，显著降低了数据集创建成本。数据集可在此 https URL 上找到。

Title: What Do Language Models Learn in Context? The Structured Task Hypothesis

Authors: Jiaoda Li, Yifan Hou, Mrinmaya Sachan, Ryan Cotterell
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.04216
Pdf URL: https://arxiv.org/pdf/2406.04216
Copy Paste: [[2406.04216]] What Do Language Models Learn in Context? The Structured Task Hypothesis(https://arxiv.org/abs/2406.04216)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) exhibit an intriguing ability to learn a novel task from in-context examples presented in a demonstration, termed in-context learning (ICL). Understandably, a swath of research has been dedicated to uncovering the theories underpinning ICL. One popular hypothesis explains ICL by task selection. LLMs identify the task based on the demonstration and generalize it to the prompt. Another popular hypothesis is that ICL is a form of meta-learning, i.e., the models learn a learning algorithm at pre-training time and apply it to the demonstration. Finally, a third hypothesis argues that LLMs use the demonstration to select a composition of tasks learned during pre-training to perform ICL. In this paper, we empirically explore these three hypotheses that explain LLMs' ability to learn in context with a suite of experiments derived from common text classification tasks. We invalidate the first two hypotheses with counterexamples and provide evidence in support of the last hypothesis. Our results suggest an LLM could learn a novel task in context via composing tasks learned during pre-training.
摘要：大型语言模型 (LLM) 表现出一种有趣的能力，即从演示中呈现的上下文示例中学习新任务，这被称为上下文学习 (ICL)。可以理解的是，大量研究致力于揭示 ICL 背后的理论。一种流行的假设通过任务选择来解释 ICL。LLM 根据演示识别任务并将其推广到提示。另一个流行的假设是 ICL 是一种元学习的形式，即模型在训练前学习一种学习算法并将其应用于演示。最后，第三个假设认为 LLM 使用演示来选择在训练前学习的任务组合来执行 ICL。在本文中，我们通过一系列源自常见文本分类任务的实验，实证探索了这三个解释 LLM 在上下文中学习能力的假设。我们用反例推翻了前两个假设，并提供了支持最后一个假设的证据。我们的结果表明，LLM 可以通过组合预训练期间学到的任务来学习上下文中的新任务。

Title: Rethinking LLM and Linguistic Steganalysis: An Efficient Detection of Strongly Concealed Stego

Authors: Yifan Tang, Yihao Wang, Ru Zhang, Jianyi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Rethinking LLM and Linguistic Steganalysis: An Efficient Detection of Strongly Concealed Stego(https://arxiv.org/abs/)
Keywords: llm
Abstract: To detect stego (steganographic text) in complex scenarios, linguistic steganalysis (LS) with various motivations has been proposed and achieved excellent performance. However, with the development of generative steganography, some stegos have strong concealment, especially after the emergence of LLMs-based steganography, the existing LS has low detection or even cannot detect them. We designed a novel LS with two modes called LSGC. In the generation mode, we created an LS-task "description" and used the generation ability of LLM to explain whether texts to be detected are stegos. On this basis, we rethought the principle of LS and LLMs, and proposed the classification mode. In this mode, LSGC deleted the LS-task "description" and changed the "causalLM" LLMs to the "sequenceClassification" architecture. The LS features can be extracted by only one pass of the model, and a linear layer with initialization weights is added to obtain the classification probability. Experiments on strongly concealed stegos show that LSGC significantly improves detection and reaches SOTA performance. Additionally, LSGC in classification mode greatly reduces training time while maintaining high performance.
摘要：为了在复杂场景中检测隐写文本，基于各种动机的语言隐写分析（LS）被提出并取得了优异的表现。然而随着生成式隐写的发展，一些隐写文本具有很强的隐蔽性，尤其是在基于LLM的隐写技术出现之后，现有的LS检测率很低甚至无法检测到它们。我们设计了一种具有两种模式的新型LS，称为LSGC。在生成模式下，我们创建了一个LS任务“描述”，并利用LLM的生成能力来解释待检测的文本是否为隐写文本。在此基础上，我们重新思考了LS和LLM的原理，并提出了分类模式。在该模式下，LSGC删除了LS任务“描述”，将“因果LM”LLM改为“序列分类”架构。仅通过模型一次即可提取LS特征，并添加一个带有初始化权重的线性层来获得分类概率。在强隐蔽隐秘信息上的实验表明，LSGC 显著提升了检测能力，达到了 SOTA 性能。此外，分类模式下的 LSGC 在保持高性能的同时，大大减少了训练时间。

Title: BEADs: Bias Evaluation Across Domains

Authors: Shaina Raza, Mizanur Rahman, Michael R. Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.04220
Pdf URL: https://arxiv.org/pdf/2406.04220
Copy Paste: [[2406.04220]] BEADs: Bias Evaluation Across Domains(https://arxiv.org/abs/2406.04220)
Keywords: language model, llm
Abstract: Recent improvements in large language models (LLMs) have significantly enhanced natural language processing (NLP) applications. However, these models can also inherit and perpetuate biases from their training data. Addressing this issue is crucial, yet many existing datasets do not offer evaluation across diverse NLP tasks. To tackle this, we introduce the Bias Evaluations Across Domains (BEADs) dataset, designed to support a wide range of NLP tasks, including text classification, bias entity recognition, bias quantification, and benign language generation. BEADs uses AI-driven annotation combined with experts' verification to provide reliable labels. This method overcomes the limitations of existing datasets that typically depend on crowd-sourcing, expert-only annotations with limited bias evaluations, or unverified AI labeling. Our empirical analysis shows that BEADs is effective in detecting and reducing biases across different language models, with smaller models fine-tuned on BEADs often outperforming LLMs in bias classification tasks. However, these models may still exhibit biases towards certain demographics. Fine-tuning LLMs with our benign language data also reduces biases while preserving the models' knowledge. Our findings highlight the importance of comprehensive bias evaluation and the potential of targeted fine-tuning for reducing the bias of LLMs. We are making BEADs publicly available at this https URL Warning: This paper contains examples that may be considered offensive.
摘要：大型语言模型 (LLM) 的最新改进显著增强了自然语言处理 (NLP) 应用。然而，这些模型也可以从训练数据中继承和延续偏见。解决这个问题至关重要，但许多现有数据集无法跨各种 NLP 任务提供评估。为了解决这个问题，我们引入了跨领域偏见评估 (BEADs) 数据集，旨在支持广泛的 NLP 任务，包括文本分类、偏见实体识别、偏见量化和良性语言生成。BEADs 使用 AI 驱动的注释结合专家验证来提供可靠的标签。这种方法克服了现有数据集的局限性，这些数据集通常依赖于众包、仅专家注释且偏见评估有限或未经验证的 AI 标记。我们的实证分析表明，BEADs 可以有效地检测和减少不同语言模型之间的偏见，在 BEAD 上微调的小型模型在偏见分类任务中的表现通常优于 LLM。然而，这些模型可能仍然对某些人口统计数据表现出偏见。使用我们的良性语言数据对 LLM 进行微调还可以减少偏差，同时保留模型的知识。我们的研究结果强调了全面偏差评估的重要性，以及有针对性的微调对于减少 LLM 偏差的潜力。我们将在此 https URL 上公开提供 BEAD 警告：本文包含可能被视为冒犯的示例。

Title: Benchmark Data Contamination of Large Language Models: A Survey

Authors: Cheng Xu, Shuhao Guan, Derek Greene, M-Tahar Kechadi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04244
Pdf URL: https://arxiv.org/pdf/2406.04244
Copy Paste: [[2406.04244]] Benchmark Data Contamination of Large Language Models: A Survey(https://arxiv.org/abs/2406.04244)
Keywords: language model, gpt, llm
Abstract: The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications.
摘要：GPT-4、Claude-3 和 Gemini 等大型语言模型 (LLM) 的快速发展彻底改变了自然语言处理领域。然而，这也导致了一个重大问题，即基准数据污染 (BDC)。当语言模型无意中从其训练数据中引入评估基准信息时，就会发生这种情况，从而导致流程评估阶段的性能不准确或不可靠。本文回顾了 LLM 评估中 BDC 的复杂挑战，并探讨了替代评估方法来减轻与传统基准相关的风险。本文还研究了减轻 BDC 风险的挑战和未来方向，强调了问题的复杂性以及对创新解决方案的需求，以确保 LLM 评估在实际应用中的可靠性。

Title: Transformers need glasses! Information over-squashing in language tasks

Authors: Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G.M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Transformers need glasses! Information over-squashing in language tasks(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.
摘要：我们研究了信息如何在仅解码器的 Transformer 中传播，这是大多数现有前沿大型语言模型 (LLM) 的架构支柱。我们依赖于理论信号传播分析——具体来说，我们分析 Transformer 最后一层中最后一个标记的表示，因为这是用于下一个标记预测的表示。我们的分析揭示了一种表征崩溃现象：我们证明 Transformer 的某些不同输入序列可以在最终标记中产生任意接近的表示。现代 LLM 中经常使用的低精度浮点格式加剧了这种影响。结果，该模型被证明无法以不同的方式响应这些序列——导致在涉及计数或复制的任务中出现错误。此外，我们表明仅解码器的 Transformer 语言模型可能会对输入中的特定标记失去敏感性，这与图神经网络中众所周知的过度压缩现象有关。我们提供了支持我们对当代 LLM 的主张的经验证据。我们的理论还指出了改善这些问题的简单解决方案。

Title: Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models

Authors: Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E. Gonzalez, Bin Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04271
Pdf URL: https://arxiv.org/pdf/2406.04271
Copy Paste: [[2406.04271]] Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models(https://arxiv.org/abs/2406.04271)
Keywords: language model, llm, prompt
Abstract: We introduce Buffer of Thoughts (BoT), a novel and versatile thought-augmented reasoning approach for enhancing accuracy, efficiency and robustness of large language models (LLMs). Specifically, we propose meta-buffer to store a series of informative high-level thoughts, namely thought-template, distilled from the problem-solving processes across various tasks. Then for each problem, we retrieve a relevant thought-template and adaptively instantiate it with specific reasoning structures to conduct efficient reasoning. To guarantee the scalability and stability, we further propose buffer-manager to dynamically update the meta-buffer, thus enhancing the capacity of meta-buffer as more tasks are solved. We conduct extensive experiments on 10 challenging reasoning-intensive tasks, and achieve significant performance improvements over previous SOTA methods: 11% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One. Further analysis demonstrate the superior generalization ability and model robustness of our BoT, while requiring only 12% of the cost of multi-query prompting methods (e.g., tree/graph of thoughts) on average. Notably, we find that our Llama3-8B+BoT has the potential to surpass Llama3-70B model. Our project is available at: this https URL
摘要：我们引入了思想缓冲区 (BoT)，一种新颖且通用的思维增强推理方法，用于提高大型语言模型 (LLM) 的准确性、效率和鲁棒性。具体来说，我们提出元缓冲区来存储一系列从各个任务的问题解决过程中提炼出来的信息丰富的高级思想，即思想模板。然后，对于每个问题，我们检索相关的思想模板，并使用特定的推理结构自适应地实例化它以进行有效的推理。为了保证可扩展性和稳定性，我们进一步提出了缓冲区管理器来动态更新元缓冲区，从而随着更多任务的解决而增强元缓冲区的容量。我们对 10 个具有挑战性的推理密集型任务进行了广泛的实验，并与之前的 SOTA 方法相比取得了显着的性能提升：24 点游戏提高了 11%，几何形状提高了 20%，一击必杀提高了 51%。进一步的分析表明，我们的 BoT 具有出色的泛化能力和模型鲁棒性，同时平均仅需要多查询提示方法（例如，思维树/思维图）成本的 12%。值得注意的是，我们发现我们的 Llama3-8B+BoT 有潜力超越 Llama3-70B 模型。我们的项目可从以下网址获取：此 https URL

Title: Characterizing Similarities and Divergences in Conversational Tones in Humans and LLMs by Sampling with People

Authors: Dun-Ming Huang, Pol Van Rijn, Ilia Sucholutsky, Raja Marjieh, Nori Jacoby
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2406.04278
Pdf URL: https://arxiv.org/pdf/2406.04278
Copy Paste: [[2406.04278]] Characterizing Similarities and Divergences in Conversational Tones in Humans and LLMs by Sampling with People(https://arxiv.org/abs/2406.04278)
Keywords: language model, gpt, llm
Abstract: Conversational tones -- the manners and attitudes in which speakers communicate -- are essential to effective communication. Amidst the increasing popularization of Large Language Models (LLMs) over recent years, it becomes necessary to characterize the divergences in their conversational tones relative to humans. However, existing investigations of conversational modalities rely on pre-existing taxonomies or text corpora, which suffer from experimenter bias and may not be representative of real-world distributions for the studies' psycholinguistic domains. Inspired by methods from cognitive science, we propose an iterative method for simultaneously eliciting conversational tones and sentences, where participants alternate between two tasks: (1) one participant identifies the tone of a given sentence and (2) a different participant generates a sentence based on that tone. We run 100 iterations of this process with human participants and GPT-4, then obtain a dataset of sentences and frequent conversational tones. In an additional experiment, humans and GPT-4 annotated all sentences with all tones. With data from 1,339 human participants, 33,370 human judgments, and 29,900 GPT-4 queries, we show how our approach can be used to create an interpretable geometric representation of relations between conversational tones in humans and GPT-4. This work demonstrates how combining ideas from machine learning and cognitive science can address challenges in human-computer interactions.
摘要：对话语调——说话者交流的方式和态度——对于有效沟通至关重要。近年来，随着大型语言模型 (LLM) 的日益普及，有必要描述它们与人类对话语调的差异。然而，现有的对话模式研究依赖于预先存在的分类法或文本语料库，这些分类法或文本语料库受到实验者偏见的影响，可能无法代表研究心理语言学领域的真实分布。受认知科学方法的启发，我们提出了一种迭代方法，用于同时引出对话语调和句子，参与者在两个任务之间交替进行：(1) 一个参与者识别给定句子的语调，(2) 另一个参与者根据该语调生成句子。我们与人类参与者和 GPT-4 一起对这个过程进行了 100 次迭代，然后获得了一个句子和常见对话语调的数据集。在另一项实验中，人类和 GPT-4 注释了所有带有所有语调的句子。借助来自 1,339 名人类参与者、33,370 个人类判断和 29,900 个 GPT-4 查询的数据，我们展示了如何使用我们的方法创建人类和 GPT-4 对话语调之间关系的可解释几何表示。这项工作展示了如何将机器学习和认知科学的理念结合起来解决人机交互中的挑战。

Title: What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages

Authors: Nadav Borenstein, Anej Svete, Robin Chan, Josef Valvoda, Franz Nowak, Isabelle Augenstein, Eleanor Chodroff, Ryan Cotterell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.04289
Pdf URL: https://arxiv.org/pdf/2406.04289
Copy Paste: [[2406.04289]] What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages(https://arxiv.org/abs/2406.04289)
Keywords: language model
Abstract: What can large language models learn? By definition, language models (LM) are distributions over strings. Therefore, an intuitive way of addressing the above question is to formalize it as a matter of learnability of classes of distributions over strings. While prior work in this direction focused on assessing the theoretical limits, in contrast, we seek to understand the empirical learnability. Unlike prior empirical work, we evaluate neural LMs on their home turf-learning probabilistic languages-rather than as classifiers of formal languages. In particular, we investigate the learnability of regular LMs (RLMs) by RNN and Transformer LMs. We empirically test the learnability of RLMs as a function of various complexity parameters of the RLM and the hidden state size of the neural LM. We find that the RLM rank, which corresponds to the size of linear space spanned by the logits of its conditional distributions, and the expected length of sampled strings are strong and significant predictors of learnability for both RNNs and Transformers. Several other predictors also reach significance, but with differing patterns between RNNs and Transformers.
摘要：大型语言模型能学到什么？根据定义，语言模型 (LM) 是字符串上的分布。因此，解决上述问题的一种直观方法是将其形式化为字符串上分布类别的可学习性问题。虽然之前在这方面的工作侧重于评估理论极限，但相反，我们试图了解经验可学习性。与之前的经验工作不同，我们在神经 LM 的主场——学习概率语言——上评估它们，而不是将其作为形式语言的分类器。具体来说，我们研究了 RNN 和 Transformer LM 对常规 LM (RLM) 的可学习性。我们根据 RLM 的各种复杂度参数和神经 LM 的隐藏状态大小，对 RLM 的可学习性进行了经验测试。我们发现，RLM 等级（对应于其条件分布的对数所跨越的线性空间的大小）和采样字符串的预期长度是 RNN 和 Transformer 可学习性的强大且重要的预测指标。其他几个预测因子也达到了显著水平，但 RNN 和 Transformers 之间的模式有所不同。

Title: PaCE: Parsimonious Concept Engineering for Large Language Models

Authors: Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, René Vidal
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2406.04331
Pdf URL: https://arxiv.org/pdf/2406.04331
Copy Paste: [[2406.04331]] PaCE: Parsimonious Concept Engineering for Large Language Models(https://arxiv.org/abs/2406.04331)
Keywords: language model, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable output, via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept. Then, given any alignment task, we instruct a concept partitioner to efficiently annotate the concepts as benign or undesirable. Finally, at inference time, we decompose the LLM activations along the concept dictionary via sparse coding, to accurately represent the activation as a linear combination of the benign and undesirable components. By removing the latter ones from the activation, we reorient the behavior of LLMs towards alignment goals. We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.
摘要：大型语言模型 (LLM) 被用于各种各样的任务。虽然它们能够生成类似人类的反应，但它们也可能产生不良输出，包括潜在有害信息、种族主义或性别歧视语言以及幻觉。对齐方法旨在通过微调、提示工程和表示工程等技术减少此类不良输出。然而，现有方法面临着几个挑战：有些方法需要对每个对齐任务进行昂贵的微调；有些方法不能充分去除不良概念，导致对齐失败；有些方法会去除良性概念，从而降低 LLM 的语言能力。为了解决这些问题，我们提出了简约概念工程 (PaCE)，这是一种用于对齐的新型激活工程框架。首先，为了充分建模概念，我们在激活空间中构建一个大规模概念词典，其中每个原子对应一个语义概念。然后，对于任何对齐任务，我们指示概念分割器有效地将概念注释为良性或不良。最后，在推理时，我们通过稀疏编码沿概念词典分解 LLM 激活，以准确地将激活表示为良性和不良成分的线性组合。通过从激活中移除后者，我们将 LLM 的行为重新调整为对齐目标。我们对响应解毒、忠诚度增强和情绪修改等任务进行了实验，并表明 PaCE 在保持语言能力的同时实现了最先进的对齐性能。