2024-07-19

Title: GPT Czech Poet: Generation of Czech Poetic Strophes with Language Models

Authors: Michal Chudoba, Rudolf Rosa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.12790
Pdf URL: https://arxiv.org/pdf/2407.12790
Copy Paste: [[2407.12790]] GPT Czech Poet: Generation of Czech Poetic Strophes with Language Models(https://arxiv.org/abs/2407.12790)
Keywords: language model, gpt
Abstract: High-quality automated poetry generation systems are currently only available for a small subset of languages. We introduce a new model for generating poetry in Czech language, based on fine-tuning a pre-trained Large Language Model. We demonstrate that guiding the generation process by explicitly specifying strophe parameters within the poem text strongly improves the effectiveness of the model. We also find that appropriate tokenization is crucial, showing that tokenization methods based on syllables or individual characters instead of subwords prove superior in generating poetic strophes. We further enhance the results by introducing \textit{Forced~generation}, adding explicit specifications of meter and verse parameters at inference time based on the already generated text. We evaluate a range of setups, showing that our proposed approach achieves high accuracies in rhyming and metric aspects of formal quality of the generated poems.
摘要：目前，高质量的自动诗歌生成系统仅适用于一小部分语言。我们引入了一种用于生成捷克语诗歌的新模型，该模型基于对预先训练的大型语言模型进行微调。我们证明，通过在诗歌文本中明确指定诗节参数来指导生成过程可以大大提高模型的有效性。我们还发现适当的标记化至关重要，表明基于音节或单个字符而不是子词的标记化方法在生成诗歌诗节方面表现更佳。我们通过引入 \textit{强制生成} 进一步增强了结果，在推理时根据已生成的文本添加韵律和诗节参数的明确规范。我们评估了一系列设置，表明我们提出的方法在生成的诗歌形式质量的押韵和韵律方面实现了高精度。

Title: TourLLM: Enhancing LLMs with Tourism Knowledge

Authors: Qikai Wei, Mingzhi Yang, Jinqiang Wang, Wenwei Mao, Jiabo Xu, Huansheng Ning
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12791
Pdf URL: https://arxiv.org/pdf/2407.12791
Copy Paste: [[2407.12791]] TourLLM: Enhancing LLMs with Tourism Knowledge(https://arxiv.org/abs/2407.12791)
Keywords: language model, llm
Abstract: Recently, large language models (LLMs) have demonstrated their effectiveness in various natural language processing (NLP) tasks. However, the lack of tourism knowledge limits the performance of LLMs in tourist attraction presentations and travel planning. To address this challenge, we constructed a supervised fine-tuning dataset for the culture and tourism domain, named Cultour. This dataset consists of three parts: tourism knowledge base QA data, travelogues data, and tourism diversity QA data. Additionally, we propose TourLLM, a Qwen-based model supervised fine-tuned with Cultour, to improve the quality of the information provided about attractions and travel planning. To evaluate the performance of TourLLM, we employed both automatic and human evaluation, and we proposed a human evaluation criterion named CRA (Consistency, Readability, Availability). The experimental results demonstrate the effectiveness of the responses generated by the TourLLM. Our proposed Cultour is accessible at this https URL.
摘要：最近，大型语言模型 (LLM) 已在各种自然语言处理 (NLP) 任务中证明了其有效性。然而，缺乏旅游知识限制了 LLM 在旅游景点介绍和旅行规划方面的表现。为了应对这一挑战，我们为文化和旅游领域构建了一个监督微调数据集，名为 Cultour。该数据集由三部分组成：旅游知识库 QA 数据、旅行日志数据和旅游多样性 QA 数据。此外，我们提出了 TourLLM，这是一个基于 Qwen 的模型，使用 Cultour 进行监督微调，以提高有关景点和旅行规划的信息质量。为了评估 TourLLM 的性能，我们采用了自动和人工评估，并提出了一个名为 CRA（一致性、可读性、可用性）的人工评估标准。实验结果证明了 TourLLM 生成的响应的有效性。我们提出的 Cultour 可通过此 https URL 访问。

Title: Building Understandable Messaging for Policy and Evidence Review (BUMPER) with AI

Authors: Katherine A. Rosenfeld, Maike Sonnewald, Sonia J. Jindal, Kevin A. McCarthy, Joshua L. Proctor
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12812
Pdf URL: https://arxiv.org/pdf/2407.12812
Copy Paste: [[2407.12812]] Building Understandable Messaging for Policy and Evidence Review (BUMPER) with AI(https://arxiv.org/abs/2407.12812)
Keywords: language model, llm
Abstract: We introduce a framework for the use of large language models (LLMs) in Building Understandable Messaging for Policy and Evidence Review (BUMPER). LLMs are proving capable of providing interfaces for understanding and synthesizing large databases of diverse media. This presents an exciting opportunity to supercharge the translation of scientific evidence into policy and action, thereby improving livelihoods around the world. However, these models also pose challenges related to access, trust-worthiness, and accountability. The BUMPER framework is built atop a scientific knowledge base (e.g., documentation, code, survey data) by the same scientists (e.g., individual contributor, lab, consortium). We focus on a solution that builds trustworthiness through transparency, scope-limiting, explicit-checks, and uncertainty measures. LLMs are rapidly being adopted and consequences are poorly understood. The framework addresses open questions regarding the reliability of LLMs and their use in high-stakes applications. We provide a worked example in health policy for a model designed to inform measles control programs. We argue that this framework can facilitate accessibility of and confidence in scientific evidence for policymakers, drive a focus on policy-relevance and translatability for researchers, and ultimately increase and accelerate the impact of scientific knowledge used for policy decisions.
摘要：我们在“为政策和证据审查构建可理解的消息传递”（BUMPER）中引入了一个使用大型语言模型（LLM）的框架。LLM 已被证明能够提供理解和综合各种媒体大型数据库的接口。这为将科学证据转化为政策和行动提供了一个激动人心的机会，从而改善了世界各地的生计。然而，这些模型也带来了与访问、可信度和问责制相关的挑战。BUMPER 框架由同一位科学家（例如个人贡献者、实验室、联盟）在科学知识库（例如文档、代码、调查数据）之上构建。我们专注于通过透明度、范围限制、明确检查和不确定性措施来建立可信度的解决方案。LLM 正在迅速被采用，但其后果却知之甚少。该框架解决了有关 LLM 的可靠性及其在高风险应用中的使用的未决问题。我们为旨在为麻疹控制计划提供信息的模型提供了一个卫生政策中的实际示例。我们认为，该框架可以促进政策制定者获取科学证据并增强其对科学证据的信心，推动研究人员关注政策相关性和可转化性，并最终增加和加速用于政策决策的科学知识的影响。

Title: Data Generation using Large Language Models for Text Classification: An Empirical Case Study

Authors: Yinheng Li, Rogerio Bonatti, Sara Abdali, Justin Wagle, Kazuhito Koishida
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12813
Pdf URL: https://arxiv.org/pdf/2407.12813
Copy Paste: [[2407.12813]] Data Generation using Large Language Models for Text Classification: An Empirical Case Study(https://arxiv.org/abs/2407.12813)
Keywords: language model, llm, prompt
Abstract: Using Large Language Models (LLMs) to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is influenced by various factors, including the choice of prompt, task complexity, and the quality, quantity, and diversity of the generated data. In this work, we focus exclusively on using synthetic data for text classification tasks. Specifically, we use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches. This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.
摘要：近年来，使用大型语言模型 (LLM) 生成用于模型训练的合成数据变得越来越流行。虽然 LLM 能够生成真实的训练数据，但数据生成的有效性受到各种因素的影响，包括提示的选择、任务的复杂性以及生成数据的质量、数量和多样性。在这项工作中，我们专注于使用合成数据进行文本分类任务。具体来说，我们使用在合成数据上训练的自然语言理解 (NLU) 模型来评估来自不同生成方法的合成数据的质量。这项工作对这些因素的影响进行了实证分析，并提出了更好的数据生成实践建议。

Title: SMLT-MUGC: Small, Medium, and Large Texts -- Machine versus User-Generated Content Detection and Comparison

Authors: Anjali Rawal, Hui Wang, Youjia Zheng, Yu-Hsuan Lin, Shanu Sushmita
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12815
Pdf URL: https://arxiv.org/pdf/2407.12815
Copy Paste: [[2407.12815]] SMLT-MUGC: Small, Medium, and Large Texts -- Machine versus User-Generated Content Detection and Comparison(https://arxiv.org/abs/2407.12815)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have gained significant attention due to their ability to mimic human language. Identifying texts generated by LLMs is crucial for understanding their capabilities and mitigating potential consequences. This paper analyzes datasets of varying text lengths: small, medium, and large. We compare the performance of machine learning algorithms on four datasets: (1) small (tweets from Election, FIFA, and Game of Thrones), (2) medium (Wikipedia introductions and PubMed abstracts), and (3) large (OpenAI web text dataset). Our results indicate that LLMs with very large parameters (such as the XL-1542 variant of GPT2 with 1542 million parameters) were harder (74%) to detect using traditional machine learning methods. However, detecting texts of varying lengths from LLMs with smaller parameters (762 million or less) can be done with high accuracy (96% and above). We examine the characteristics of human and machine-generated texts across multiple dimensions, including linguistics, personality, sentiment, bias, and morality. Our findings indicate that machine-generated texts generally have higher readability and closely mimic human moral judgments but differ in personality traits. SVM and Voting Classifier (VC) models consistently achieve high performance across most datasets, while Decision Tree (DT) models show the lowest performance. Model performance drops when dealing with rephrased texts, particularly shorter texts like tweets. This study underscores the challenges and importance of detecting LLM-generated texts and suggests directions for future research to improve detection methods and understand the nuanced capabilities of LLMs.
摘要：大型语言模型 (LLM) 因其模仿人类语言的能力而备受关注。识别 LLM 生成的文本对于了解其功能和减轻潜在后果至关重要。本文分析了不同文本长度的数据集：小、中、大。我们比较了机器学习算法在四个数据集上的性能：(1) 小型（来自 Election、FIFA 和权力的游戏的推文），(2) 中型（维基百科简介和 PubMed 摘要），以及 (3) 大型（OpenAI 网络文本数据集）。我们的结果表明，使用传统机器学习方法检测具有非常大参数的 LLM（例如具有 15.42 亿个参数的 GPT2 的 XL-1542 变体）更难（74%）。但是，从具有较小参数（7.62 亿或更少）的 LLM 中检测不同长度的文本可以达到高精度（96% 及以上）。我们从多个维度研究了人类和机器生成的文本的特征，包括语言学、个性、情感、偏见和道德。我们的研究结果表明，机器生成的文本通常具有更高的可读性，并且与人类的道德判断非常相似，但在个性特征上有所不同。SVM 和投票分类器 (VC) 模型在大多数数据集上始终保持高性能，而决策树 (DT) 模型的性能最低。处理改写的文本时，模型性能会下降，尤其是推文等较短的文本。这项研究强调了检测 LLM 生成的文本的挑战和重要性，并为未来的研究提出了方向，以改进检测方法并了解 LLM 的细微功能。

Title: "I understand why I got this grade": Automatic Short Answer Grading with Feedback

Authors: Dishank Aggarwal, Pushpak Bhattacharyya, Bhaskaran Raman
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2407.12818
Pdf URL: https://arxiv.org/pdf/2407.12818
Copy Paste: [[2407.12818]] "I understand why I got this grade": Automatic Short Answer Grading with Feedback(https://arxiv.org/abs/2407.12818)
Keywords: language model, llm
Abstract: The demand for efficient and accurate assessment methods has intensified as education systems transition to digital platforms. Providing feedback is essential in educational settings and goes beyond simply conveying marks as it justifies the assigned marks. In this context, we present a significant advancement in automated grading by introducing Engineering Short Answer Feedback (EngSAF) -- a dataset of 5.8k student answers accompanied by reference answers and questions for the Automatic Short Answer Grading (ASAG) task. The EngSAF dataset is meticulously curated to cover a diverse range of subjects, questions, and answer patterns from multiple engineering domains. We leverage state-of-the-art large language models' (LLMs) generative capabilities with our Label-Aware Synthetic Feedback Generation (LASFG) strategy to include feedback in our dataset. This paper underscores the importance of enhanced feedback in practical educational settings, outlines dataset annotation and feedback generation processes, conducts a thorough EngSAF analysis, and provides different LLMs-based zero-shot and finetuned baselines for future comparison. Additionally, we demonstrate the efficiency and effectiveness of the ASAG system through its deployment in a real-world end-semester exam at the Indian Institute of Technology Bombay (IITB), showcasing its practical viability and potential for broader implementation in educational institutions.
摘要：随着教育系统向数字平台过渡，对高效、准确的评估方法的需求日益增加。在教育环境中，提供反馈至关重要，它不仅仅是传达分数，还证明了所分配的分数是合理的。在此背景下，我们通过引入工程简答反馈 (EngSAF) 展示了自动评分的重大进步——这是一个包含 5.8k 个学生答案的数据集，并附有自动简答评分 (ASAG) 任务的参考答案和问题。EngSAF 数据集经过精心策划，涵盖了来自多个工程领域的各种主题、问题和答案模式。我们利用最先进的大型语言模型 (LLM) 的生成功能和我们的标签感知合成反馈生成 (LASFG) 策略将反馈纳入我们的数据集。本文强调了增强反馈在实际教育环境中的重要性，概述了数据集注释和反馈生成过程，进行了彻底的 EngSAF 分析，并提供了不同的基于 LLM 的零样本和微调基线以供将来比较。此外，我们通过在印度理工学院孟买分校 (IITB) 的真实期末考试中部署 ASAG 系统来证明其效率和有效性，展示了其在教育机构中更广泛实施的实际可行性和潜力。

Title: PQCache: Product Quantization-based KVCache for Long Context LLM Inference

Authors: Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12820
Pdf URL: https://arxiv.org/pdf/2407.12820
Copy Paste: [[2407.12820]] PQCache: Product Quantization-based KVCache for Long Context LLM Inference(https://arxiv.org/abs/2407.12820)
Keywords: language model, llm, long context
Abstract: As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), a crucial component in LLM inference, has now become the primary memory bottleneck due to limited GPU memory. Current methods selectively determine suitable keys and values for self-attention computation in LLMs to address the issue. However, they either fall short in maintaining model quality or result in high serving latency. Drawing inspiration from advanced embedding retrieval techniques used in the database community, we consider the storage and searching of KVCache as a typical embedding retrieval problem. We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency. During the prefilling phase, we apply PQ to tokens' keys for each LLM layer and head. During the autoregressive decoding phase, for each newly generated token, we first identify important tokens through Maximum Inner-Product Search (MIPS) using PQ codes and centroids, then fetch the corresponding key-value pairs for self-attention computation. Through meticulous design of overlapping and caching, we minimize any additional computation and communication overhead during both phases. Extensive experiments show that PQCache achieves both effectiveness and efficiency. It maintains model quality even when only 1/5 of the tokens are involved in attention, while attaining acceptable system latency.
摘要：随着大型语言模型 (LLM) 领域的不断发展，推理中的上下文长度正在稳步增长。键值缓存 (KVCache) 是 LLM 推理中的关键组件，由于 GPU 内存有限，它现在已成为主要的内存瓶颈。当前的方法选择性地确定适合 LLM 中自注意力计算的键和值来解决此问题。然而，它们要么无法保持模型质量，要么导致高服务延迟。从数据库社区中使用的高级嵌入检索技术中汲取灵感，我们将 KVCache 的存储和搜索视为典型的嵌入检索问题。我们提出了 PQCache，它采用乘积量化 (PQ) 来管理 KVCache，在确保低服务延迟的同时保持模型质量。在预填充阶段，我们将 PQ 应用于每个 LLM 层和头的令牌键。在自回归解码阶段，对于每个新生成的 token，我们首先使用 PQ 代码和质心通过最大内积搜索 (MIPS) 识别重要 token，然后获取相应的键值对进行自注意力计算。通过精心设计重叠和缓存，我们最大限度地减少了两个阶段的任何额外计算和通信开销。大量实验表明，PQCache 兼具有效性和效率。即使只有 1/5 的 token 参与注意力，它也能保持模型质量，同时实现可接受的系统延迟。

Title: AutoFlow: Automated Workflow Generation for Large Language Model Agents

Authors: Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, Yongfeng Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12821
Pdf URL: https://arxiv.org/pdf/2407.12821
Copy Paste: [[2407.12821]] AutoFlow: Automated Workflow Generation for Large Language Model Agents(https://arxiv.org/abs/2407.12821)
Keywords: language model, llm, agent
Abstract: Recent advancements in Large Language Models (LLMs) have shown significant progress in understanding complex natural language. One important application of LLM is LLM-based AI Agent, which leverages the ability of LLM as well as external tools for complex-task solving. To make sure LLM Agents follow an effective and reliable procedure to solve the given task, manually designed workflows are usually used to guide the working mechanism of agents. However, manually designing the workflows requires considerable efforts and domain knowledge, making it difficult to develop and deploy agents on massive scales. To address these issues, we propose AutoFlow, a framework designed to automatically generate workflows for agents to solve complex tasks. AutoFlow takes natural language program as the format of agent workflow and employs a workflow optimization procedure to iteratively optimize the workflow quality. Besides, this work offers two workflow generation methods: fine-tuning-based and in-context-based methods, making the AutoFlow framework applicable to both open-source and closed-source LLMs. Experimental results show that our framework can produce robust and reliable agent workflows. We believe that the automatic generation and interpretation of workflows in natural language represent a promising paradigm for solving complex tasks, particularly with the rapid development of LLMs. The source code of this work is available at this https URL.
摘要：大型语言模型 (LLM) 的最新进展表明，在理解复杂自然语言方面取得了重大进展。LLM 的一个重要应用是基于 LLM 的 AI 代理，它利用 LLM 以及外部工具的能力来解决复杂任务。为了确保 LLM 代理遵循有效可靠的程序来解决给定的任务，通常使用手动设计的工作流来指导代理的工作机制。然而，手动设计工作流需要大量的努力和领域知识，这使得大规模开发和部署代理变得困难。为了解决这些问题，我们提出了 AutoFlow，这是一个旨在自动生成代理工作流以解决复杂任务的框架。AutoFlow 以自然语言程序作为代理工作流的格式，并采用工作流优化程序来迭代优化工作流质量。此外，这项工作提供了两种工作流生成方法：基于微调和基于上下文的方法，使 AutoFlow 框架适用于开源和闭源 LLM。实验结果表明，我们的框架可以生成稳健可靠的代理工作流。我们相信，用自然语言自动生成和解释工作流程代表了解决复杂任务的一个有前途的范例，尤其是在 LLM 迅速发展的情况下。这项工作的源代码可在此 https URL 上找到。

Title: Lightweight Large Language Model for Medication Enquiry: Med-Pal

Authors: Kabilan Elangovan, Jasmine Chiat Ling Ong, Liyuan Jin, Benjamin Jun Jie Seng, Yu Heng Kwan, Lit Soo Tan, Ryan Jian Zhong, Justina Koi Li Ma, YuHe Ke, Nan Liu, Kathleen M Giacomini, Daniel Shu Wei Ting
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12822
Pdf URL: https://arxiv.org/pdf/2407.12822
Copy Paste: [[2407.12822]] Lightweight Large Language Model for Medication Enquiry: Med-Pal(https://arxiv.org/abs/2407.12822)
Keywords: language model, llm, prompt, chat
Abstract: Large Language Models (LLMs) have emerged as a potential solution to assist digital health development with patient education, commonly medication-related enquires. We trained and validated Med-Pal, a medication domain-specific LLM-chatbot fine-tuned with a fine-grained and expert curated dataset from a selection of five light-weighted open-source LLMs of smaller parameter size (7 billion or less) regarding computational constraints and prioritizing operational efficiency. A multi-disciplinary team performed a clinical evaluation of LLMs responses using the SCORE criteria, focusing on safety, accuracy, bias, reproducibility, and ease of understanding. Best performing light-weighted LLM was chosen as Med-Pal for further engineering with guard-railing using adversarial prompting. Med-Pal and existing light-weighted LLMs, including pretrained Biomistral and finetuned Meerkat, were validated on an independent dataset on a broad range of medication-related questions (231 in total), 12 different question types across 14 different medication classes. Mistral-7b emerged as the top performer among selected lightweight LLMs, achieving the highest median score of 14 and 71.9% high-quality responses in accuracy and safety domains, hence chosen as the backbone LLM for Med-Pal. When compared against Biomistral, Med-pal outperformed in generating responses appropriate for patient communication, with significant reductions bias and errors typical of general LLMs. Comparable performance was observed when comparing Med-Pal with Meerkat. Med-Pal showcases the feasibility of developing and employing fine-tuned light-weighted LLMs to enhance digital health communications.
摘要：大型语言模型 (LLM) 已成为一种潜在的解决方案，可通过患者教育（通常是与药物相关的咨询）来协助数字健康发展。我们训练并验证了 Med-Pal，这是一个特定于药物领域的 LLM 聊天机器人，使用细粒度和专家策划的数据集进行了微调，该数据集来自五个轻量级开源 LLM，这些 LLM 的参数规模较小（70 亿或更少），考虑到计算约束并优先考虑操作效率。一个多学科团队使用 SCORE 标准对 LLM 的反应进行了临床评估，重点关注安全性、准确性、偏差、可重复性和易于理解性。表现最佳的轻量级 LLM 被选为 Med-Pal，以便使用对抗性提示进行进一步的工程设计。Med-Pal 和现有的轻量级 LLM（包括预训练的 Biomistral 和微调的 Meerkat）在独立数据集上针对广泛的药物相关问题（共 231 个）进行了验证，12 种不同类型的问题涵盖 14 种不同的药物类别。 Mistral-7b 在选定的轻量级 LLM 中表现最佳，在准确性和安全性领域取得了最高的中位数分数 14 和 71.9% 的高质量响应，因此被选为 Med-Pal 的主干 LLM。与 Biomistral 相比，Med-pal 在生成适合患者沟通的响应方面表现出色，显著减少了一般 LLM 常见的偏差和错误。将 Med-Pal 与 Meerkat 进行比较时观察到了可比的性能。Med-Pal 展示了开发和使用微调的轻量级 LLM 来增强数字健康通信的可行性。

Title: WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

Authors: Kangyun Ning, Yisong Su, Xueqiang Lv, Yuanzhe Zhang, Jian Liu, Kang Liu, Jinan Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12823
Pdf URL: https://arxiv.org/pdf/2407.12823
Copy Paste: [[2407.12823]] WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models(https://arxiv.org/abs/2407.12823)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Although Large Language Models (LLMs) excel in NLP tasks, they still need external tools to extend their ability. Current research on tool learning with LLMs often assumes mandatory tool use, which does not always align with real-world situations, where the necessity for tools is uncertain, and incorrect or unnecessary use of tools can damage the general abilities of LLMs. Therefore, we propose to explore whether LLMs can discern their ability boundaries and use tools flexibly. We then introduce the Whether-or-not tool usage Evaluation benchmark (WTU-Eval) to assess LLMs with eleven datasets, where six of them are tool-usage datasets, and five are general datasets. LLMs are prompted to use tools according to their needs. The results of eight LLMs on WTU-Eval reveal that LLMs frequently struggle to determine tool use in general datasets, and LLMs' performance in tool-usage datasets improves when their ability is similar to ChatGPT. In both datasets, incorrect tool usage significantly impairs LLMs' performance. To mitigate this, we also develop the finetuning dataset to enhance tool decision-making. Fine-tuning Llama2-7B results in a 14\% average performance improvement and a 16.8\% decrease in incorrect tool usage. We will release the WTU-Eval benchmark.
摘要：尽管大型语言模型 (LLM) 在 NLP 任务中表现出色，但它们仍然需要外部工具来扩展其能力。当前关于使用 LLM 进行工具学习的研究通常假设强制使用工具，但这并不总是与现实世界的情况相符，在现实世界中，工具的必要性是不确定的，不正确或不必要的工具使用会损害 LLM 的一般能力。因此，我们建议探索 LLM 是否能够辨别其能力边界并灵活使用工具。然后，我们引入了是否使用工具评估基准 (WTU-Eval) 来评估具有 11 个数据集的 LLM，其中 6 个是工具使用数据集，5 个是一般数据集。提示 LLM 根据需要使用工具。8 个 LLM 在 WTU-Eval 上的结果表明，LLM 经常难以确定一般数据集中的工具使用情况，并且当 LLM 的能力与 ChatGPT 相似时，其在工具使用数据集中的表现会提高。在这两个数据集中，不正确的工具使用都会严重损害 LLM 的性能。为了缓解这种情况，我们还开发了微调数据集来增强工具决策能力。微调 Llama2-7B 可使平均性能提高 14%，错误工具使用率降低 16.8%。我们将发布 WTU-Eval 基准。

Title: Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

Authors: Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, Pau Rodríguez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12824
Pdf URL: https://arxiv.org/pdf/2407.12824
Copy Paste: [[2407.12824]] Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models(https://arxiv.org/abs/2407.12824)
Keywords: language model, llm, prompt
Abstract: An important issue with Large Language Models (LLMs) is their undesired ability to generate toxic language. In this work, we show that the neurons responsible for toxicity can be determined by their power to discriminate toxic sentences, and that toxic language can be mitigated by reducing their activation levels proportionally to this power. We propose AUROC adaptation (AurA), an intervention that can be applied to any pre-trained LLM to mitigate toxicity. As the intervention is proportional to the ability of each neuron to discriminate toxic content, it is free of any model-dependent hyperparameters. We show that AurA can achieve up to $2.2 \times$ reduction in toxicity with only a $0.72$ perplexity increase. We also show that AurA is effective with models of different scale (from 1.5B to 40B parameters), and its effectiveness in mitigating toxic language, while preserving common-sense zero-shot abilities, holds across all scales. AurA can be combined with pre-prompting strategies, boosting its average mitigation potential from $1.28\times$ to $2.35\times$. Moreover, AurA can counteract adversarial pre-prompts that maliciously elicit toxic content, making it an effective method for deploying safer and less toxic models.
摘要：大型语言模型 (LLM) 的一个重要问题是它们会产生不良的毒性语言。在这项工作中，我们表明，负责毒性的神经元可以通过它们区分有毒句子的能力来确定，并且可以通过按比例降低它们的激活水平来减轻毒性语言。我们提出了 AUROC 适应 (AurA)，这是一种可以应用于任何预训练的 LLM 以减轻毒性的干预措施。由于干预与每个神经元区分有毒内容的能力成正比，因此它不受任何依赖于模型的超参数的影响。我们表明，AurA 最多可以将毒性降低 $2.2 \times$，而困惑度仅增加 $0.72$。我们还表明，AurA 对不同规模的模型（从 1.5B 到 40B 个参数）都有效，并且它在减轻毒性语言方面的有效性，同时保留了常识性的零样本能力，适用于所有规模。 AurA 可以与预提示策略相结合，将其平均缓解潜力从 1.28 次提高到 2.35 次。此外，AurA 可以抵消恶意引发有毒内容的对抗性预提示，使其成为部署更安全、毒性更小的模型的有效方法。

Title: Assessing the Effectiveness of GPT-4o in Climate Change Evidence Synthesis and Systematic Assessments: Preliminary Insights

Authors: Elphin Tom Joe, Sai Dileep Koneru, Christine J Kirchhoff
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12826
Pdf URL: https://arxiv.org/pdf/2407.12826
Copy Paste: [[2407.12826]] Assessing the Effectiveness of GPT-4o in Climate Change Evidence Synthesis and Systematic Assessments: Preliminary Insights(https://arxiv.org/abs/2407.12826)
Keywords: language model, gpt, llm
Abstract: In this research short, we examine the potential of using GPT-4o, a state-of-the-art large language model (LLM) to undertake evidence synthesis and systematic assessment tasks. Traditional workflows for such tasks involve large groups of domain experts who manually review and synthesize vast amounts of literature. The exponential growth of scientific literature and recent advances in LLMs provide an opportunity to complementing these traditional workflows with new age tools. We assess the efficacy of GPT-4o to do these tasks on a sample from the dataset created by the Global Adaptation Mapping Initiative (GAMI) where we check the accuracy of climate change adaptation related feature extraction from the scientific literature across three levels of expertise. Our results indicate that while GPT-4o can achieve high accuracy in low-expertise tasks like geographic location identification, their performance in intermediate and high-expertise tasks, such as stakeholder identification and assessment of depth of the adaptation response, is less reliable. The findings motivate the need for designing assessment workflows that utilize the strengths of models like GPT-4o while also providing refinements to improve their performance on these tasks.
摘要：在这篇简短的研究中，我们研究了使用 GPT-4o（一种最先进的大型语言模型 (LLM)）进行证据综合和系统评估任务的潜力。此类任务的传统工作流程涉及大量领域专家，他们手动审查和综合大量文献。科学文献的迅猛增长和 LLM 的最新进展为用新时代工具补充这些传统工作流程提供了机会。我们评估了 GPT-4o 在全球适应制图倡议 (GAMI) 创建的数据集样本上执行这些任务的有效性，我们检查了从三个专业水平的科学文献中提取气候变化适应相关特征的准确性。我们的结果表明，虽然 GPT-4o 可以在地理位置识别等低专业任务中实现高精度，但它们在中级和高专业任务（例如利益相关者识别和适应响应深度评估）中的表现不太可靠。这些发现促使我们需要设计评估工作流程，利用 GPT-4o 等模型的优势，同时提供改进以提高其在这些任务上的表现。

Title: Why Does New Knowledge Create Messy Ripple Effects in LLMs?

Authors: Jiaxin Qin, Zixuan Zhang, Chi Han, Manling Li, Pengfei Yu, Heng Ji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12828
Pdf URL: https://arxiv.org/pdf/2407.12828
Copy Paste: [[2407.12828]] Why Does New Knowledge Create Messy Ripple Effects in LLMs?(https://arxiv.org/abs/2407.12828)
Keywords: language model, llm
Abstract: Extensive previous research has focused on post-training knowledge editing (KE) for language models (LMs) to ensure that knowledge remains accurate and up-to-date. One desired property and open question in KE is to let edited LMs correctly handle ripple effects, where LM is expected to answer its logically related knowledge accurately. In this paper, we answer the question of why most KE methods still create messy ripple effects. We conduct extensive analysis and identify a salient indicator, GradSim, that effectively reveals when and why updated knowledge ripples in LMs. GradSim is computed by the cosine similarity between gradients of the original fact and its related knowledge. We observe a strong positive correlation between ripple effect performance and GradSim across different LMs, KE methods, and evaluation metrics. Further investigations into three counter-intuitive failure cases (Negation, Over-Ripple, Multi-Lingual) of ripple effects demonstrate that these failures are often associated with very low GradSim. This finding validates that GradSim is an effective indicator of when knowledge ripples in LMs.
摘要：先前的大量研究集中于语言模型 (LM) 的训练后知识编辑 (KE)，以确保知识保持准确和最新。KE 中的一个期望属性和未解决的问题是让编辑后的 LM 正确处理涟漪效应，其中 LM 有望准确回答其逻辑相关知识。在本文中，我们回答了为什么大多数 KE 方法仍然会产生混乱的涟漪效应的问题。我们进行了广泛的分析并确定了一个显着的指标 GradSim，它有效地揭示了更新的知识何时以及为何在 LM 中产生涟漪。GradSim 是通过原始事实与其相关知识的梯度之间的余弦相似度计算的。我们观察到，在不同的 LM、KE 方法和评估指标中，涟漪效应性能与 GradSim 之间存在很强的正相关性。对涟漪效应的三种反直觉失败案例（否定、过度涟漪、多语言）的进一步研究表明，这些失败通常与非常低的 GradSim 有关。这一发现证实了 GradSim 是 LM 中知识波动的有效指标。

Title: Knowledge-based Consistency Testing of Large Language Models

Authors: Sai Sathiesh Rajan, Ezekiel Soremekun, Sudipta Chattopadhyay
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12830
Pdf URL: https://arxiv.org/pdf/2407.12830
Copy Paste: [[2407.12830]] Knowledge-based Consistency Testing of Large Language Models(https://arxiv.org/abs/2407.12830)
Keywords: language model, gpt, llm
Abstract: In this work, we systematically expose and measure the inconsistency and knowledge gaps of Large Language Models (LLMs). Specifically, we propose an automated testing framework (called KONTEST) which leverages a knowledge graph to construct test cases. KONTEST probes and measures the inconsistencies in the LLM's knowledge of the world via a combination of semantically-equivalent queries and test oracles (metamorphic or ontological oracle). KONTEST further mitigates knowledge gaps via a weighted LLM model ensemble. Using four state-of-the-art LLMs (Falcon, Gemini, GPT3.5, and Llama2), we show that KONTEST generates 19.2% error inducing inputs (1917 errors from 9983 test inputs). It also reveals a 16.5% knowledge gap across all tested LLMs. KONTEST's mitigation method reduces LLM knowledge gap by 32.48%. Our ablation study further shows that GPT3.5 is not suitable for knowledge-based consistency testing because it is only 60%-68% effective in knowledge construction.
摘要：在这项工作中，我们系统地揭示和衡量大型语言模型 (LLM) 的不一致性和知识差距。具体来说，我们提出了一个自动化测试框架 (称为 KONTEST)，它利用知识图谱来构建测试用例。KONTEST 通过语义等效查询和测试预言机 (变质或本体预言机) 的组合来探测和衡量 LLM 对世界知识的不一致性。KONTEST 通过加权 LLM 模型集成进一步缓解知识差距。使用四个最先进的 LLM (Falcon、Gemini、GPT3.5 和 Llama2)，我们表明 KONTEST 产生了 19.2% 的错误诱导输入（9983 个测试输入中产生了 1917 个错误）。它还揭示了所有测试的 LLM 之间存在 16.5% 的知识差距。KONTEST 的缓解方法将 LLM 知识差距减少了 32.48%。我们的消融研究进一步表明，GPT3.5 不适合基于知识的一致性测试，因为它在知识构建方面的有效性仅为 60%-68%。

Title: Truth is Universal: Robust Detection of Lies in LLMs

Authors: Lennart Bürger, Fred A. Hamprecht, Boaz Nadler
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12831
Pdf URL: https://arxiv.org/pdf/2407.12831
Copy Paste: [[2407.12831]] Truth is Universal: Robust Detection of Lies in LLMs(https://arxiv.org/abs/2407.12831)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have revolutionised natural language processing, exhibiting impressive human-like capabilities. In particular, LLMs are capable of "lying", knowingly outputting false statements. Hence, it is of interest and importance to develop methods to detect when LLMs lie. Indeed, several authors trained classifiers to detect LLM lies based on their internal model activations. However, other researchers showed that these classifiers may fail to generalise, for example to negated statements. In this work, we aim to develop a robust method to detect when an LLM is lying. To this end, we make the following key contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more robust lie detection; (ii) Building upon (i), we construct an accurate LLM lie detector. Empirically, our proposed classifier achieves state-of-the-art performance, distinguishing simple true and false statements with 94% accuracy and detecting more complex real-world lies with 95% accuracy.
摘要：大型语言模型 (LLM) 彻底改变了自然语言处理，展现出令人印象深刻的类似人类的能力。特别是，LLM 能够“撒谎”，故意输出虚假陈述。因此，开发方法来检测 LLM 何时撒谎是令人感兴趣和重要的。事实上，几位作者训练分类器根据其内部模型激活来检测 LLM 谎言。然而，其他研究人员表明，这些分类器可能无法推广，例如对否定语句。在这项工作中，我们旨在开发一种强大的方法来检测 LLM 何时撒谎。为此，我们做出了以下关键贡献：(i) 我们证明了二维子空间的存在，真假语句的激活向量可以沿着该子空间分离。值得注意的是，这一发现是普遍的，适用于各种 LLM，包括 Gemma-7B、LLaMA2-13B 和 LLaMA3-8B。我们的分析解释了以前研究中观察到的泛化失败，并为更强大的谎言检测奠定了基础； (ii) 在 (i) 的基础上，我们构建了一个精确的 LLM 测谎仪。从经验上看，我们提出的分类器实现了最先进的性能，能够以 94% 的准确率区分简单的真假陈述，并以 95% 的准确率检测出更复杂的现实世界谎言。

Title: ESQA: Event Sequences Question Answering

Authors: Irina Abdullaeva, Andrei Filatov, Mikhail Orlov, Ivan Karpukhin, Viacheslav Vasilev, Denis Dimitrov, Andrey Kuznetsov, Ivan Kireev, Andrey Savchenko
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12833
Pdf URL: https://arxiv.org/pdf/2407.12833
Copy Paste: [[2407.12833]] ESQA: Event Sequences Question Answering(https://arxiv.org/abs/2407.12833)
Keywords: language model, llm
Abstract: Event sequences (ESs) arise in many practical domains including finance, retail, social networks, and healthcare. In the context of machine learning, event sequences can be seen as a special type of tabular data with annotated timestamps. Despite the importance of ESs modeling and analysis, little effort was made in adapting large language models (LLMs) to the ESs domain. In this paper, we highlight the common difficulties of ESs processing and propose a novel solution capable of solving multiple downstream tasks with little or no finetuning. In particular, we solve the problem of working with long sequences and improve time and numeric features processing. The resulting method, called ESQA, effectively utilizes the power of LLMs and, according to extensive experiments, achieves state-of-the-art results in the ESs domain.
摘要：事件序列 (ES) 出现在许多实际领域，包括金融、零售、社交网络和医疗保健。在机器学习的背景下，事件序列可以看作是一种带有注释时间戳的特殊类型的表格数据。尽管 ES 建模和分析非常重要，但很少有人努力将大型语言模型 (LLM) 应用于 ES 领域。在本文中，我们重点介绍了 ES 处理的常见困难，并提出了一种新颖的解决方案，该解决方案几乎不需要或根本不需要微调即可解决多个下游任务。特别是，我们解决了处理长序列的问题，并改进了时间和数字特征处理。由此产生的方法称为 ESQA，它有效地利用了 LLM 的强大功能，并且根据大量实验，在 ES 领域取得了最先进的结果。

Title: Regurgitative Training: The Value of Real Data in Training Large Language Models

Authors: Jinghui Zhang, Dandan Qiao, Mochen Yang, Qiang Wei
Subjects: cs.CL, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2407.12835
Pdf URL: https://arxiv.org/pdf/2407.12835
Copy Paste: [[2407.12835]] Regurgitative Training: The Value of Real Data in Training Large Language Models(https://arxiv.org/abs/2407.12835)
Keywords: language model, gpt, llm
Abstract: What happens if we train a new Large Language Model (LLM) using data that are at least partially generated by other LLMs? The explosive success of LLMs means that a substantial amount of content online will be generated by LLMs rather than humans, which will inevitably enter the training datasets of next-generation LLMs. We evaluate the implications of such "regurgitative training" on LLM performance. Through fine-tuning GPT-3.5 with data generated either by itself or by other LLMs in a machine translation task, we find strong evidence that regurgitative training clearly handicaps the performance of LLMs. The same performance loss of regurgitative training is observed on transformer models that we train from scratch. We find suggestive evidence that the performance disadvantage of regurgitative training can be attributed to at least two mechanisms: (1) higher error rates and (2) lower lexical diversity in LLM-generated data as compared to real data. Based on these mechanisms, we propose and evaluate three different strategies to mitigate the performance loss of regurgitative training. First, we devise data-driven metrics to gauge the quality of each LLM-generated data instance, and then carry out an ordered training process where high-quality data are added before low-quality ones. Second, we combine data generated by multiple different LLMs (as an attempt to increase lexical diversity). Third, we train an AI detection classifier to differentiate between LLM- and human-generated data, and include LLM-generated data in the order of resemblance to human-generated data. All three strategies can improve the performance of regurgitative training to some extent but are not always able to fully close the gap from training with real data. Our results highlight the value of real, human-generated data in training LLMs, which cannot be easily substituted by synthetic, LLM-generated data.
摘要：如果我们使用至少部分由其他 LLM 生成的数据来训练新的大型语言模型 (LLM)，会发生什么？LLM 的爆炸式成功意味着大量在线内容将由 LLM 而不是人类生成，这些内容将不可避免地进入下一代 LLM 的训练数据集。我们评估了这种“反刍训练”对 LLM 性能的影响。通过在机器翻译任务中使用其自身或其他 LLM 生成的数据对 GPT-3.5 进行微调，我们发现强有力的证据表明反刍训练明显阻碍了 LLM 的性能。在我们从头开始训练的 Transformer 模型上观察到了同样的反刍训练性能损失。我们发现有启发性的证据表明，反刍训练的性能劣势可以归因于至少两种机制：(1) 与真实数据相比，LLM 生成的数据的错误率更高和 (2) 词汇多样性更低。基于这些机制，我们提出并评估了三种不同的策略来减轻反刍训练的性能损失。首先，我们设计了数据驱动指标来衡量每个 LLM 生成的数据实例的质量，然后执行有序的训练过程，其中高质量数据先于低质量数据添加。其次，我们结合了由多个不同的 LLM 生成的数据（试图增加词汇多样性）。第三，我们训练一个 AI 检测分类器来区分 LLM 和人类生成的数据，并按与人类生成数据的相似性顺序包含 LLM 生成的数据。这三种策略都可以在一定程度上提高反刍训练的性能，但并不总是能够完全缩小与使用真实数据进行训练的差距。我们的结果突出了真实的、人类生成的数据在训练 LLM 中的价值，而这些数据不能轻易被合成的、LLM 生成的数据取代。

Title: OSPC: Artificial VLM Features for Hateful Meme Detection

Authors: Peter Grönquist
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12836
Pdf URL: https://arxiv.org/pdf/2407.12836
Copy Paste: [[2407.12836]] OSPC: Artificial VLM Features for Hateful Meme Detection(https://arxiv.org/abs/2407.12836)
Keywords: language model, gpt
Abstract: The digital revolution and the advent of the world wide web have transformed human communication, notably through the emergence of memes. While memes are a popular and straightforward form of expression, they can also be used to spread misinformation and hate due to their anonymity and ease of use. In response to these challenges, this paper introduces a solution developed by team 'Baseline' for the AI Singapore Online Safety Prize Challenge. Focusing on computational efficiency and feature engineering, the solution achieved an AUROC of 0.76 and an accuracy of 0.69 on the test dataset. As key features, the solution leverages the inherent probabilistic capabilities of large Vision-Language Models (VLMs) to generate task-adapted feature encodings from text, and applies a distilled quantization tailored to the specific cultural nuances present in Singapore. This type of processing and fine-tuning can be adapted to various visual and textual understanding and classification tasks, and even applied on private VLMs such as OpenAI's GPT. Finally it can eliminate the need for extensive model training on large GPUs for resource constrained applications, also offering a solution when little or no data is available.
摘要：数字革命和万维网的出现改变了人类的交流方式，尤其是通过 meme 的出现。虽然 meme 是一种流行且直接的表达形式，但由于其匿名性和易用性，它们也可用于传播错误信息和仇恨。为了应对这些挑战，本文介绍了“Baseline”团队为 AI 新加坡在线安全奖挑战赛开发的解决方案。该解决方案专注于计算效率和特征工程，在测试数据集上实现了 0.76 的 AUROC 和 0.69 的准确率。作为关键特征，该解决方案利用大型视觉语言模型 (VLM) 固有的概率能力从文本生成适合任务的特征编码，并应用针对新加坡特定文化细微差别的精简量化。这种类型的处理和微调可以适应各种视觉和文本理解和分类任务，甚至可以应用于 OpenAI 的 GPT 等私有 VLM。最后，它可以消除资源受限应用程序在大型 GPU 上进行大量模型训练的需要，并且在数据很少或没有数据时也能提供解决方案。

Title: Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Authors: Laura Manrique-Gómez, Tony Montes, Rubén Manrique
Subjects: cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2407.12838
Pdf URL: https://arxiv.org/pdf/2407.12838
Copy Paste: [[2407.12838]] Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction(https://arxiv.org/abs/2407.12838)
Keywords: language model, llm
Abstract: This paper presents two significant contributions: first, a novel dataset of 19th-century Latin American press texts, which addresses the lack of specialized corpora for historical and linguistic analysis in this region. Second, it introduces a framework for OCR error correction and linguistic surface form detection in digitized corpora, utilizing a Large Language Model. This framework is adaptable to various contexts and, in this paper, is specifically applied to the newly created dataset.
摘要：本文提出了两个重要贡献：首先，提出了一个 19 世纪拉丁美洲新闻文本的新数据集，解决了该地区缺乏专门用于历史和语言分析的语料库的问题。其次，本文介绍了一个利用大型语言模型对数字化语料库进行 OCR 纠错和语言表面形式检测的框架。该框架适用于各种环境，在本文中，它专门应用于新创建的数据集。

Title: What to do if language models disagree? Black-box model ensembling for textual and visual question answering

Authors: Yuxi Xia, Kilm Zaporojets, Benjamin Roth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12841
Pdf URL: https://arxiv.org/pdf/2407.12841
Copy Paste: [[2407.12841]] What to do if language models disagree? Black-box model ensembling for textual and visual question answering(https://arxiv.org/abs/2407.12841)
Keywords: language model, gpt, llm, chat
Abstract: A diverse range of large language models (LLMs), e.g., ChatGPT, and visual question answering (VQA) models, e.g., BLIP, have been developed for solving textual and visual question answering tasks. However, both LLMs and VQA models encounter challenges when applied to task-specific datasets. Fine-tuning these models is either difficult, as it requires access via APIs, rendering them as black-boxes, or costly due to the need of tuning a large number of parameters. To address this, we introduce InfoSel, a data-efficient and lightweight ensemble method that learns to dynamically pick the winner from existing black-box models for predictions on both textual and multimodal visual question answering tasks. Unlike traditional ensemble models, InfoSel does not rely on prediction probabilities or confidences, which typically are not available in black-box models. Experimental results on four datasets demonstrate that our approach achieves an absolute increase of up to +5.27% in the F1-score compared to standalone LLMs. Remarkably, this improvement is achieved by utilizing only 1K training instances and 110M model parameters for training task-specific ensemble models.
摘要：已经开发了多种大型语言模型 (LLM)，例如 ChatGPT，以及视觉问答 (VQA) 模型，例如 BLIP，用于解决文本和视觉问答任务。然而，LLM 和 VQA 模型在应用于特定任务的数据集时都会遇到挑战。微调这些模型要么很困难，因为它需要通过 API 访问，将它们渲染为黑盒，要么成本高昂，因为需要调整大量参数。为了解决这个问题，我们引入了 InfoSel，这是一种数据高效且轻量级的集成方法，它学习从现有的黑盒模型中动态挑选出获胜者，以预测文本和多模态视觉问答任务。与传统的集成模型不同，InfoSel 不依赖于预测概率或置信度，而这些通常在黑盒模型中是不可用的。在四个数据集上的实验结果表明，与独立的 LLM 相比，我们的方法在 F1 分数方面实现了高达 +5.27% 的绝对增长。值得注意的是，这一改进是通过仅利用 1K 个训练实例和 110M 个模型参数来训练特定于任务的集成模型来实现的。

Title: NutriBench: A Dataset for Evaluating Large Language Models in Carbohydrate Estimation from Meal Descriptions

Authors: Andong Hua, Mehak Preet Dhaliwal, Ryan Burke, Yao Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12843
Pdf URL: https://arxiv.org/pdf/2407.12843
Copy Paste: [[2407.12843]] NutriBench: A Dataset for Evaluating Large Language Models in Carbohydrate Estimation from Meal Descriptions(https://arxiv.org/abs/2407.12843)
Keywords: language model, gpt, llm, retrieval-augmented generation, chain-of-thought
Abstract: Accurate nutrition estimation helps people make informed decisions about their dietary choices and is crucial for preventing serious health issues. We present NutriBench, the first publicly available natural language meal description based nutrition benchmark. NutriBench consists of 5,000 human-verified meal descriptions with macro-nutrient labels, including carbohydrates, proteins, fats, and calories. The data is divided into 15 subsets varying in complexity based on the number, servings, and popularity of the food items in the meal and the specificity of serving size descriptions. We conducted an extensive evaluation of seven popular and state-of-the-art Large Language Models (LLMs), including GPT-3.5, Llama-3, and a medical domain-specific model with standard, Chain-of-Thought and Retrieval-Augmented Generation strategies on our benchmark for carbohydrate estimation. We also conducted a human study involving expert and non-expert participants and found that LLMs can provide more accurate and faster predictions over a range of complex queries. We present a thorough analysis and comparison of different LLMs, highlighting the opportunities and challenges of using LLMs for nutrition estimation in real-life scenarios. Our benchmark is publicly available at: this https URL
摘要：准确的营养评估有助于人们在饮食选择上做出明智的决定，对于预防严重的健康问题至关重要。我们推出了 NutriBench，这是第一个公开的基于自然语言膳食描述的营养基准。NutriBench 包含 5,000 个经过人工验证的膳食描述，其中包含宏观营养标签，包括碳水化合物、蛋白质、脂肪和卡路里。根据膳食中食品的数量、份量和受欢迎程度以及份量描述的特殊性，数据分为 15 个子集，复杂程度各不相同。我们对七种流行且最先进的大型语言模型 (LLM) 进行了广泛的评估，包括 GPT-3.5、Llama-3 和医学领域特定模型，该模型在我们的碳水化合物估计基准上采用标准、思维链和检索增强生成策略。我们还进行了一项涉及专家和非专家参与者的人类研究，发现 LLM 可以在一系列复杂查询中提供更准确、更快速的预测。我们对不同的 LLM 进行了全面的分析和比较，重点介绍了在现实生活中使用 LLM 进行营养评估的机会和挑战。我们的基准公开发布在：此 https URL

Title: $\texttt{metabench}$ -- A Sparse Benchmark to Measure General Ability in Large Language Models

Authors: Alex Kipnis, Konstantinos Voudouris, Luca M. Schulze Buschoff, Eric Schulz
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2407.12844
Pdf URL: https://arxiv.org/pdf/2407.12844
Copy Paste: [[2407.12844]] $\texttt{metabench}$ -- A Sparse Benchmark to Measure General Ability in Large Language Models(https://arxiv.org/abs/2407.12844)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) vary in their abilities on a range of tasks. Initiatives such as the $\texttt{Open LLM Leaderboard}$ aim to quantify these differences with several large benchmarks (sets of test items to which an LLM can respond either correctly or incorrectly). However, high correlations within and between benchmark scores suggest that (1) there exists a small set of common underlying abilities that these benchmarks measure, and (2) items tap into redundant information and the benchmarks may thus be considerably compressed. We use data from $n > 5000$ LLMs to identify the most informative items of six benchmarks, ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande (with $d=28,632$ items in total). From them we distill a sparse benchmark, $\texttt{metabench}$, that has less than $3\%$ of the original size of all six benchmarks combined. This new sparse benchmark goes beyond point scores by yielding estimators of the underlying benchmark-specific abilities. We show that these estimators (1) can be used to reconstruct each original $\textit{individual}$ benchmark score with, on average, $1.5\%$ root mean square error (RMSE), (2) reconstruct the original $\textit{total}$ score with $0.8\%$ RMSE, and (3) have a single underlying common factor whose Spearman correlation with the total score is $r = 0.93$.
摘要：大型语言模型 (LLM) 在一系列任务上的能力各不相同。诸如 $\texttt{Open LLM Leaderboard}$ 之类的计划旨在通过几个大型基准（LLM 可以正确或错误地响应的测试项目集）来量化这些差异。但是，基准分数内部和之间的高相关性表明 (1) 这些基准测量了一小组共同的潜在能力，并且 (2) 项目利用了冗余信息，因此基准可能会被大大压缩。我们使用来自 $n > 5000$ 个 LLM 的数据来识别六个基准（ARC、GSM8K、HellaSwag、MMLU、TruthfulQA 和 WinoGrande，总共 $d=28,632$ 个项目）中最具信息量的项目。从中我们提炼出一个稀疏基准 $\texttt{metabench}$，其大小不到所有六个基准合并的原始大小的 $3\%$。这种新的稀疏基准测试超越了点分数，它产生了底层基准测试特定能力的估计量。我们表明，这些估计量 (1) 可用于重建每个原始 $\textit{individual}$ 基准测试分数，平均均方根误差 (RMSE) 为 $1.5\%$，(2) 重建原始 $\textit{total}$ 分数，RMSE 为 $0.8\%$，(3) 具有单个底层共同因子，其与总分数的 Spearman 相关性为 $r = 0.93$。

Title: Identifying the Source of Generation for Large Language Models

Authors: Bumjin Park, Jaesik Choi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12846
Pdf URL: https://arxiv.org/pdf/2407.12846
Copy Paste: [[2407.12846]] Identifying the Source of Generation for Large Language Models(https://arxiv.org/abs/2407.12846)
Keywords: language model, llm
Abstract: Large language models (LLMs) memorize text from several sources of documents. In pretraining, LLM trains to maximize the likelihood of text but neither receives the source of the text nor memorizes the source. Accordingly, LLM can not provide document information on the generated content, and users do not obtain any hint of reliability, which is crucial for factuality or privacy infringement. This work introduces token-level source identification in the decoding step, which maps the token representation to the reference document. We propose a bi-gram source identifier, a multi-layer perceptron with two successive token representations as input for better generalization. We conduct extensive experiments on Wikipedia and PG19 datasets with several LLMs, layer locations, and identifier sizes. The overall results show a possibility of token-level source identifiers for tracing the document, a crucial problem for the safe use of LLMs.
摘要：大型语言模型 (LLM) 会记忆来自多个文档源的文本。在预训练中，LLM 会训练以最大化文本的可能性，但既不接收文本的来源，也不记忆来源。因此，LLM 无法提供有关生成内容的文档信息，并且用户无法获得任何可靠性的提示，而可靠性对于事实性或隐私侵犯至关重要。这项工作在解码步骤中引入了标记级源标识，将标记表示映射到参考文档。我们提出了一个二元词源标识符，这是一个多层感知器，具有两个连续的标记表示作为输入，以实现更好的泛化。我们在 Wikipedia 和 PG19 数据集上进行了广泛的实验，使用了多个 LLM、层位置和标识符大小。总体结果表明，标记级源标识符可用于跟踪文档，这是 LLM 安全使用的关键问题。

Title: Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments

Authors: Roland Daynauth, Jason Mars
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2407.12847
Pdf URL: https://arxiv.org/pdf/2407.12847
Copy Paste: [[2407.12847]] Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments(https://arxiv.org/abs/2407.12847)
Keywords: language model, gpt, llm
Abstract: The SLAM paper demonstrated that on-device Small Language Models (SLMs) are a viable and cost-effective alternative to API-based Large Language Models (LLMs), such as OpenAI's GPT-4, offering comparable performance and stability. However, SLAM also identified discrepancies between human preferences and traditional auto-evaluators. This follow-up paper explores methods to align LLM evaluator preferences with human evaluations by addressing biases, particularly toward higher token counts. We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer. Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases. For instance, spearman's ranking correlation score in the Recommendation use case improved from -27.27 to 44.55. These results highlight the importance of accounting for biases in automated evaluations to ensure fair and accurate model assessments. The recalibration process enhances the reliability of automated evaluators, leading to better AI models that align with human values and expectations. This study provides a robust methodology for future research into bias correction and emphasizes the feasibility and benefits of developing human-aligned AI evaluation systems.
摘要：SLAM 论文表明，设备上的小型语言模型 (SLM) 是基于 API 的大型语言模型 (LLM)（例如 OpenAI 的 GPT-4）的可行且经济高效的替代方案，可提供相当的性能和稳定性。然而，SLAM 还发现了人类偏好与传统自动评估器之间的差异。这篇后续论文探讨了通过解决偏见（尤其是对更高 token 计数的偏见）将 LLM 评估器偏好与人类评估保持一致的方法。我们使用贝叶斯统计和 t 检验来量化这种偏见，并开发了一个重新校准程序来调整 GPTScorer。我们的研究结果显著改善了在多个用例中将重新校准的 LLM 评估器与人类评估保持一致。例如，推荐用例中的 Spearman 排名相关性得分从 -27.27 提高到 44.55。这些结果强调了在自动评估中考虑偏见以确保公平准确的模型评估的重要性。重新校准过程提高了自动评估器的可靠性，从而产生了更符合人类价值观和期望的更好的人工智能模型。这项研究为未来研究偏差校正提供了一种强有力的方法，并强调了开发与人类一致的人工智能评估系统的可行性和好处。

Title: Applicability of Large Language Models and Generative Models for Legal Case Judgement Summarization

Authors: Aniket Deroy, Kripabandhu Ghosh, Saptarshi Ghosh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12848
Pdf URL: https://arxiv.org/pdf/2407.12848
Copy Paste: [[2407.12848]] Applicability of Large Language Models and Generative Models for Legal Case Judgement Summarization(https://arxiv.org/abs/2407.12848)
Keywords: language model, llm, hallucination
Abstract: Automatic summarization of legal case judgements, which are known to be long and complex, has traditionally been tried via extractive summarization models. In recent years, generative models including abstractive summarization models and Large language models (LLMs) have gained huge popularity. In this paper, we explore the applicability of such models for legal case judgement summarization. We applied various domain specific abstractive summarization models and general domain LLMs as well as extractive summarization models over two sets of legal case judgements from the United Kingdom (UK) Supreme Court and the Indian (IN) Supreme Court and evaluated the quality of the generated summaries. We also perform experiments on a third dataset of legal documents of a different type, Government reports from the United States (US). Results show that abstractive summarization models and LLMs generally perform better than the extractive methods as per traditional metrics for evaluating summary quality. However, detailed investigation shows the presence of inconsistencies and hallucinations in the outputs of the generative models, and we explore ways to reduce the hallucinations and inconsistencies in the summaries. Overall, the investigation suggests that further improvements are needed to enhance the reliability of abstractive models and LLMs for legal case judgement summarization. At present, a human-in-the-loop technique is more suitable for performing manual checks to identify inconsistencies in the generated summaries.
摘要：众所周知，法律案件判决冗长而复杂，传统上一直是通过提取摘要模型进行自动摘要的。近年来，生成模型（包括抽象摘要模型和大型语言模型 (LLM)）大受欢迎。在本文中，我们探讨了此类模型在法律案件判决摘要中的适用性。我们对英国最高法院和印度最高法院的两组法律案件判决应用了各种特定领域的抽象摘要模型和通用领域的 LLM 以及提取摘要模型，并评估了生成的摘要的质量。我们还对第三组不同类型的法律文件（美国政府报告）进行了实验。结果表明，按照评估摘要质量的传统指标，抽象摘要模型和 LLM 通常比提取方法表现更好。然而，详细调查显示，生成模型的输出存在不一致和幻觉，我们探索了减少摘要中幻觉和不一致的方法。总体而言，调查表明，需要进一步改进，以提高抽象模型和法律案例判决摘要的可靠性。目前，人机交互技术更适合执行手动检查以识别生成的摘要中的不一致之处。

Title: Limits to Predicting Online Speech Using Large Language Models

Authors: Mina Remeli, Moritz Hardt, Robert C. Williamson
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12850
Pdf URL: https://arxiv.org/pdf/2407.12850
Copy Paste: [[2407.12850]] Limits to Predicting Online Speech Using Large Language Models(https://arxiv.org/abs/2407.12850)
Keywords: language model, prompt
Abstract: We study the predictability of online speech on social media, and whether predictability improves with information outside a user's own posts. Recent work suggests that the predictive information contained in posts written by a user's peers can surpass that of the user's own posts. Motivated by the success of large language models, we empirically test this hypothesis. We define unpredictability as a measure of the model's uncertainty, i.e., its negative log-likelihood on future tokens given context. As the basis of our study, we collect a corpus of 6.25M posts from more than five thousand X (previously Twitter) users and their peers. Across three large language models ranging in size from 1 billion to 70 billion parameters, we find that predicting a user's posts from their peers' posts performs poorly. Moreover, the value of the user's own posts for prediction is consistently higher than that of their peers'. Across the board, we find that the predictability of social media posts remains low, comparable to predicting financial news without context. We extend our investigation with a detailed analysis about the causes of unpredictability and the robustness of our findings. Specifically, we observe that a significant amount of predictive uncertainty comes from hashtags and @-mentions. Moreover, our results replicate if instead of prompting the model with additional context, we finetune on additional context.
摘要：我们研究社交媒体上在线言论的可预测性，以及可预测性是否会随着用户自己帖子之外的信息而提高。最近的研究表明，用户同伴所写帖子中包含的预测信息可以超过用户自己的帖子。受大型语言模型成功的启发，我们通过实证检验了这一假设。我们将不可预测性定义为模型不确定性的度量，即给定上下文后模型对未来标记的负对数似然。作为我们研究的基础，我们从五千多名 X（以前是 Twitter）用户及其同伴那里收集了 625 万条帖子的语料库。在三个参数大小从 10 亿到 700 亿的大型语言模型中，我们发现根据用户同伴的帖子预测用户的帖子效果不佳。此外，用户自己的帖子的预测价值始终高于其同伴的帖子。总体而言，我们发现社交媒体帖子的可预测性仍然很低，与在没有上下文的情况下预测金融新闻相当。我们通过对不可预测性的原因和研究结果的稳健性进行详细分析来扩展我们的调查。具体来说，我们观察到大量预测不确定性来自主题标签和 @ 提及。此外，如果我们不是用额外的上下文提示模型，而是对额外的上下文进行微调，我们的结果就会重复。

Title: Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

Authors: Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, Pang Wei Koh
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12854
Pdf URL: https://arxiv.org/pdf/2407.12854
Copy Paste: [[2407.12854]] Scaling Retrieval-Based Language Models with a Trillion-Token Datastore(https://arxiv.org/abs/2407.12854)
Keywords: language model
Abstract: Scaling laws with respect to the amount of training data and the number of parameters allow us to predict the cost-benefit trade-offs of pretraining language models (LMs) in different configurations. In this paper, we consider another dimension of scaling: the amount of data available at inference time. Specifically, we find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation, such that a smaller model augmented with a large datastore outperforms a larger LM-only model on knowledge-intensive tasks. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget. We carry out our study by constructing a 1.4 trillion-token datastore named MassiveDS, which is the largest and the most diverse open-sourced datastore for retrieval-based LMs to date, and designing an efficient pipeline for studying datastore scaling in a computationally accessible manner. Finally, we analyze the effect of improving the retriever, datastore quality filtering, and other design choices on our observed scaling trends. Overall, our results show that datastore size should be considered as an integral part of LM efficiency and performance trade-offs. To facilitate future research, we open-source our datastore and code at this https URL.
摘要：关于训练数据量和参数数量的缩放定律使我们能够预测不同配置的预训练语言模型 (LM) 的成本效益权衡。在本文中，我们考虑了缩放的另一个维度：推理时可用的数据量。具体来说，我们发现增加基于检索的 LM 使用的数据存储的大小可以单调地改善语言建模和几个下游任务，而不会出现明显的饱和，因此，在知识密集型任务上，使用大型数据存储增强的较小模型优于仅使用 LM 的大型模型。通过绘制具有不同数据存储、模型和预训练数据大小的计算最优缩放曲线，我们表明，在相同的训练计算预算下，使用更大的数据存储可以显著提高模型性能。我们通过构建一个名为 MassiveDS 的 1.4 万亿令牌数据存储（这是迄今为止最大、最多样化的基于检索的 LM 开源数据存储）并设计一个高效的管道来以计算可访问的方式研究数据存储扩展，从而开展了我们的研究。最后，我们分析了改进检索器、数据存储质量过滤和其他设计选择对我们观察到的扩展趋势的影响。总体而言，我们的结果表明，数据存储大小应被视为 LM 效率和性能权衡的一个组成部分。为了方便未来的研究，我们在此 https URL 上开源了我们的数据存储和代码。

Title: Large Language Models can impersonate politicians and other public figures

Authors: Steffen Herbold, Alexander Trautsch, Zlata Kikteva, Annette Hautli-Janisz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12855
Pdf URL: https://arxiv.org/pdf/2407.12855
Copy Paste: [[2407.12855]] Large Language Models can impersonate politicians and other public figures(https://arxiv.org/abs/2407.12855)
Keywords: language model, llm
Abstract: Modern AI technology like Large language models (LLMs) has the potential to pollute the public information sphere with made-up content, which poses a significant threat to the cohesion of societies at large. A wide range of research has shown that LLMs are capable of generating text of impressive quality, including persuasive political speech, text with a pre-defined style, and role-specific content. But there is a crucial gap in the literature: We lack large-scale and systematic studies of how capable LLMs are in impersonating political and societal representatives and how the general public judges these impersonations in terms of authenticity, relevance and coherence. We present the results of a study based on a cross-section of British society that shows that LLMs are able to generate responses to debate questions that were part of a broadcast political debate programme in the UK. The impersonated responses are judged to be more authentic and relevant than the original responses given by people who were impersonated. This shows two things: (1) LLMs can be made to contribute meaningfully to the public political debate and (2) there is a dire need to inform the general public of the potential harm this can have on society.
摘要：大型语言模型 (LLM) 等现代人工智能技术可能会用虚构的内容污染公共信息领域，这对整个社会的凝聚力构成重大威胁。大量研究表明，LLM 能够生成质量令人印象深刻的文本，包括有说服力的政治演讲、具有预定义风格的文本和特定于角色的内容。但文献中存在一个关键的空白：我们缺乏大规模和系统的研究来研究 LLM 在模仿政治和社会代表方面的能力，以及公众如何在真实性、相关性和连贯性方面判断这些模仿。我们介绍了一项基于英国社会横截面的研究结果，该研究显示 LLM 能够生成对英国广播政治辩论节目中辩论问题的回答。模仿的回答被认为比被模仿者的原始回答更真实、更相关。这表明两件事：（1）法学硕士可以为公共政治辩论做出有意义的贡献；（2）迫切需要告知公众这可能对社会造成的潜在危害。

Title: AI AI Bias: Large Language Models Favor Their Own Generated Content

Authors: Walter Laurito, Benjamin Davis, Peli Grietzer, Tomáš Gavenčiak, Ada Böhm, Jan Kulveit
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12856
Pdf URL: https://arxiv.org/pdf/2407.12856
Copy Paste: [[2407.12856]] AI AI Bias: Large Language Models Favor Their Own Generated Content(https://arxiv.org/abs/2407.12856)
Keywords: language model, gpt, llm, agent
Abstract: Are large language models (LLMs) biased towards text generated by LLMs over text authored by humans, leading to possible anti-human bias? Utilizing a classical experimental design inspired by employment discrimination studies, we tested widely-used LLMs, including GPT-3.5 and GPT4, in binary-choice scenarios. These involved LLM-based agents selecting between products and academic papers described either by humans or LLMs under identical conditions. Our results show a consistent tendency for LLM-based AIs to prefer LLM-generated content. This suggests the possibility of AI systems implicitly discriminating against humans, giving AI agents an unfair advantage.
摘要：大型语言模型 (LLM) 是否偏向于 LLM 生成的文本，而不是人类编写的文本，从而导致可能的反人类偏见？利用受就业歧视研究启发的经典实验设计，我们在二元选择场景中测试了广泛使用的 LLM，包括 GPT-3.5 和 GPT4。这些涉及基于 LLM 的代理在相同条件下在由人类或 LLM 描述的产品和学术论文之间进行选择。我们的结果显示，基于 LLM 的 AI 倾向于偏爱 LLM 生成的内容。这表明 AI 系统可能隐性歧视人类，从而让 AI 代理获得不公平的优势。

Title: Automated Peer Reviewing in Paper SEA: Standardization, Evaluation, and Analysis

Authors: Jianxiang Yu, Zichen Ding, Jiaqi Tan, Kangyang Luo, Zhenmin Weng, Chenghua Gong, Long Zeng, Renjing Cui, Chengcheng Han, Qiushi Sun, Zhiyong Wu, Yunshi Lan, Xiang Li
Subjects: cs.CL, cs.DL, cs.IR
Abstract URL: https://arxiv.org/abs/2407.12857
Pdf URL: https://arxiv.org/pdf/2407.12857
Copy Paste: [[2407.12857]] Automated Peer Reviewing in Paper SEA: Standardization, Evaluation, and Analysis(https://arxiv.org/abs/2407.12857)
Keywords: language model, gpt, llm
Abstract: In recent years, the rapid increase in scientific papers has overwhelmed traditional review mechanisms, resulting in varying quality of publications. Although existing methods have explored the capabilities of Large Language Models (LLMs) for automated scientific reviewing, their generated contents are often generic or partial. To address the issues above, we introduce an automated paper reviewing framework SEA. It comprises of three modules: Standardization, Evaluation, and Analysis, which are represented by models SEA-S, SEA-E, and SEA-A, respectively. Initially, SEA-S distills data standardization capabilities of GPT-4 for integrating multiple reviews for a paper. Then, SEA-E utilizes standardized data for fine-tuning, enabling it to generate constructive reviews. Finally, SEA-A introduces a new evaluation metric called mismatch score to assess the consistency between paper contents and reviews. Moreover, we design a self-correction strategy to enhance the consistency. Extensive experimental results on datasets collected from eight venues show that SEA can generate valuable insights for authors to improve their papers.
摘要：近年来，科学论文的快速增长使传统的评审机制不堪重负，导致出版物质量参差不齐。尽管现有方法已经探索了大型语言模型 (LLM) 在自动化科学评审方面的能力，但它们生成的内容往往是通用的或部分的。为了解决上述问题，我们引入了一个自动化论文评审框架 SEA。它由三个模块组成：标准化、评估和分析，分别由模型 SEA-S、SEA-E 和 SEA-A 表示。首先，SEA-S 提炼 GPT-4 的数据标准化能力，以整合一篇论文的多个评论。然后，SEA-E 利用标准化数据进行微调，使其能够生成建设性的评论。最后，SEA-A 引入了一种称为不匹配分数的新评估指标来评估论文内容和评论之间的一致性。此外，我们设计了一种自我纠正策略来增强一致性。对从八个场地收集的数据集进行的大量实验结果表明，SEA 可以为作者改进论文提供有价值的见解。

Title: Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)

Authors: Krishnaram Kenthapadi, Mehrnoosh Sameki, Ankur Taly
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12858
Pdf URL: https://arxiv.org/pdf/2407.12858
Copy Paste: [[2407.12858]] Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)(https://arxiv.org/abs/2407.12858)
Keywords: language model, llm, hallucination
Abstract: With the ongoing rapid adoption of Artificial Intelligence (AI)-based systems in high-stakes domains, ensuring the trustworthiness, safety, and observability of these systems has become crucial. It is essential to evaluate and monitor AI systems not only for accuracy and quality-related metrics but also for robustness, bias, security, interpretability, and other responsible AI dimensions. We focus on large language models (LLMs) and other generative AI models, which present additional challenges such as hallucinations, harmful and manipulative content, and copyright infringement. In this survey article accompanying our KDD 2024 tutorial, we highlight a wide range of harms associated with generative AI systems, and survey state of the art approaches (along with open challenges) to address these harms.
摘要：随着基于人工智能 (AI) 的系统在高风险领域的持续快速应用，确保这些系统的可信度、安全性和可观察性变得至关重要。评估和监控 AI 系统不仅要考虑准确性和质量相关指标，还要考虑稳健性、偏差、安全性、可解释性和其他负责任的 AI 维度。我们专注于大型语言模型 (LLM) 和其他生成式 AI 模型，它们带来了额外的挑战，例如幻觉、有害和操纵性内容以及版权侵权。在这篇伴随我们 KDD 2024 教程的调查文章中，我们重点介绍了与生成式 AI 系统相关的各种危害，并调查了解决这些危害的最新方法（以及开放的挑战）。

Title: Automated Question Generation on Tabular Data for Conversational Data Exploration

Authors: Ritwik Chaudhuri, Rajmohan C, Kirushikesh DB, Arvind Agarwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12859
Pdf URL: https://arxiv.org/pdf/2407.12859
Copy Paste: [[2407.12859]] Automated Question Generation on Tabular Data for Conversational Data Exploration(https://arxiv.org/abs/2407.12859)
Keywords: language model
Abstract: Exploratory data analysis (EDA) is an essential step for analyzing a dataset to derive insights. Several EDA techniques have been explored in the literature. Many of them leverage visualizations through various plots. But it is not easy to interpret them for a non-technical user, and producing appropriate visualizations is also tough when there are a large number of columns. Few other works provide a view of some interesting slices of data but it is still difficult for the user to draw relevant insights from them. Of late, conversational data exploration is gaining a lot of traction among non-technical users. It helps the user to explore the dataset without having deep technical knowledge about the data. Towards this, we propose a system that recommends interesting questions in natural language based on relevant slices of a dataset in a conversational setting. Specifically, given a dataset, we pick a select set of interesting columns and identify interesting slices of such columns and column combinations based on few interestingness measures. We use our own fine-tuned variation of a pre-trained language model(T5) to generate natural language questions in a specific manner. We then slot-fill values in the generated questions and rank them for recommendations. We show the utility of our proposed system in a coversational setting with a collection of real datasets.
摘要：探索性数据分析 (EDA) 是分析数据集以获得见解的重要步骤。文献中已经探索了几种 EDA 技术。其中许多技术通过各种图表利用可视化。但对于非技术用户来说，解释它们并不容易，而且当有大量列时，生成适当的可视化也很困难。很少有其他作品提供一些有趣的数据片段的视图，但用户仍然很难从中得出相关的见解。最近，对话式数据探索在非技术用户中获得了很大的关注。它可以帮助用户探索数据集，而无需对数据有深入的技术知识。为此，我们提出了一个系统，该系统基于对话环境中数据集的相关片段，以自然语言推荐有趣的问题。具体而言，给定一个数据集，我们选择一组有趣的列，并根据一些兴趣度度量来识别这些列和列组合的有趣片段。我们使用我们自己对预训练语言模型 (T5) 的微调变体以特定方式生成自然语言问题。然后，我们在生成的问题中填入值，并对其进行排序以提供建议。我们利用一组真实数据集在对话环境中展示了我们提出的系统的实用性。

Title: STAGE: Simplified Text-Attributed Graph Embeddings Using Pre-trained LLMs

Authors: Aaron Zolnai-Lucas, Jack Boylan, Chris Hokamp, Parsa Ghaffari
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12860
Pdf URL: https://arxiv.org/pdf/2407.12860
Copy Paste: [[2407.12860]] STAGE: Simplified Text-Attributed Graph Embeddings Using Pre-trained LLMs(https://arxiv.org/abs/2407.12860)
Keywords: language model, llm, prompt
Abstract: We present Simplified Text-Attributed Graph Embeddings (STAGE), a straightforward yet effective method for enhancing node features in Graph Neural Network (GNN) models that encode Text-Attributed Graphs (TAGs). Our approach leverages Large-Language Models (LLMs) to generate embeddings for textual attributes. STAGE achieves competitive results on various node classification benchmarks while also maintaining a simplicity in implementation relative to current state-of-the-art (SoTA) techniques. We show that utilizing pre-trained LLMs as embedding generators provides robust features for ensemble GNN training, enabling pipelines that are simpler than current SoTA approaches which require multiple expensive training and prompting stages. We also implement diffusion-pattern GNNs in an effort to make this pipeline scalable to graphs beyond academic benchmarks.
摘要：我们提出了简化的文本属性图嵌入 (STAGE)，这是一种简单而有效的方法，用于增强编码文本属性图 (TAG) 的图神经网络 (GNN) 模型中的节点特征。我们的方法利用大型语言模型 (LLM) 来生成文本属性的嵌入。STAGE 在各种节点分类基准上取得了有竞争力的结果，同时相对于当前最先进 (SoTA) 技术而言，还保持了实施的简单性。我们表明，利用预先训练的 LLM 作为嵌入生成器可以为集成 GNN 训练提供强大的功能，从而使管道比当前的 SoTA 方法更简单，因为 SoTA 方法需要多个昂贵的训练和提示阶段。我们还实现了扩散模式 GNN，以使该管道可扩展到学术基准之外的图。

Title: CiteME: Can Language Models Accurately Cite Scientific Claims?

Authors: Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, Matthias Bethge
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2407.12861
Pdf URL: https://arxiv.org/pdf/2407.12861
Copy Paste: [[2407.12861]] CiteME: Can Language Models Accurately Cite Scientific Claims?(https://arxiv.org/abs/2407.12861)
Keywords: language model, gpt, agent
Abstract: Thousands of new scientific papers are published each month. Such information overload complicates researcher efforts to stay current with the state-of-the-art as well as to verify and correctly attribute claims. We pose the following research question: Given a text excerpt referencing a paper, could an LM act as a research assistant to correctly identify the referenced paper? We advance efforts to answer this question by building a benchmark that evaluates the abilities of LMs in citation attribution. Our benchmark, CiteME, consists of text excerpts from recent machine learning papers, each referencing a single other paper. CiteME use reveals a large gap between frontier LMs and human performance, with LMs achieving only 4.2-18.5% accuracy and humans 69.7%. We close this gap by introducing CiteAgent, an autonomous system built on the GPT-4o LM that can also search and read papers, which achieves an accuracy of 35.3\% on CiteME. Overall, CiteME serves as a challenging testbed for open-ended claim attribution, driving the research community towards a future where any claim made by an LM can be automatically verified and discarded if found to be incorrect.
摘要：每月都会发表数千篇新的科学论文。这种信息过载使研究人员难以跟上最新技术，也难以验证和正确归因主张。我们提出以下研究问题：给定引用论文的文本摘录，LM 能否充当研究助理来正确识别引用的论文？我们通过建立一个基准来评估 LM 在引文归因方面的能力，从而进一步努力回答这个问题。我们的基准 CiteME 包括最近机器学习论文的文本摘录，每个摘录都引用了一篇其他论文。CiteME 的使用揭示了前沿 LM 与人类表现之间的巨大差距，LM 的准确率仅为 4.2-18.5%，而人类的准确率达到 69.7%。我们通过引入 CiteAgent 来缩小这一差距，CiteAgent 是一个基于 GPT-4o LM 构建的自主系统，它还可以搜索和阅读论文，在 CiteME 上的准确率达到 35.3\%。总体而言，CiteME 作为开放式声明归因的具有挑战性的试验台，推动研究界走向一个未来：LM 提出的任何声明都可以被自动验证，如果发现不正确，则被丢弃。

Title: Analyzing Large language models chatbots: An experimental approach using a probability test

Authors: Melise Peruchini, Julio Monteiro Teixeira
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12862
Pdf URL: https://arxiv.org/pdf/2407.12862
Copy Paste: [[2407.12862]] Analyzing Large language models chatbots: An experimental approach using a probability test(https://arxiv.org/abs/2407.12862)
Keywords: language model, gpt, llm, prompt, chat
Abstract: This study consists of qualitative empirical research, conducted through exploratory tests with two different Large Language Models (LLMs) chatbots: ChatGPT and Gemini. The methodological procedure involved exploratory tests based on prompts designed with a probability question. The "Linda Problem", widely recognized in cognitive psychology, was used as a basis to create the tests, along with the development of a new problem specifically for this experiment, the "Mary Problem". The object of analysis is the dataset with the outputs provided by each chatbot interaction. The purpose of the analysis is to verify whether the chatbots mainly employ logical reasoning that aligns with probability theory or if they are more frequently affected by the stereotypical textual descriptions in the prompts. The findings provide insights about the approach each chatbot employs in handling logic and textual constructions, suggesting that, while the analyzed chatbots perform satisfactorily on a well-known probabilistic problem, they exhibit significantly lower performance on new tests that require direct application of probabilistic logic.
摘要：本研究包括定性实证研究，通过对两个不同的大型语言模型 (LLM) 聊天机器人 ChatGPT 和 Gemini 进行探索性测试进行。方法程序涉及基于概率问题设计的提示的探索性测试。在认知心理学中被广泛认可的“琳达问题”被用作创建测试的基础，同时专门为这个实验开发了一个新问题，即“玛丽问题”。分析的对象是包含每个聊天机器人交互提供的输出的数据集。分析的目的是验证聊天机器人是否主要采用符合概率论的逻辑推理，或者它们是否更频繁地受到提示中的刻板文本描述的影响。研究结果提供了关于每个聊天机器人在处理逻辑和文本结构时所采用的方法的见解，表明虽然所分析的聊天机器人在一个众所周知的概率问题上表现令人满意，但它们在需要直接应用概率逻辑的新测试中表现出明显较低的性能。

Title: Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models

Authors: Jung Hyun Lee, June Yong Yang, Byeongho Heo, Dongyoon Han, Kang Min Yoo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12863
Pdf URL: https://arxiv.org/pdf/2407.12863
Copy Paste: [[2407.12863]] Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models(https://arxiv.org/abs/2407.12863)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive problem-solving capabilities in mathematics through step-by-step reasoning chains. However, they are susceptible to reasoning errors that impact the quality of subsequent reasoning chains and the final answer due to language models' autoregressive token-by-token generating nature. Recent works have proposed adopting external verifiers to guide the generation of reasoning paths, but existing works utilize models that have been trained with step-by-step labels to assess the correctness of token-by-token reasoning chains. Consequently, they struggle to recognize discriminative details of tokens within a reasoning path and lack the ability to evaluate whether an intermediate reasoning path is on a promising track toward the correct final answer. To amend the lack of sound and token-grained math-verification signals, we devise a novel training scheme for verifiers that apply token-level supervision with the expected cumulative reward (i.e., value). Furthermore, we propose a practical formulation of the cumulative reward by reducing it to finding the probability of future correctness of the final answer and thereby enabling the empirical estimation of the value. Experimental results on mathematical reasoning benchmarks show that Token-Supervised Value Model (TVM) can outperform step-by-step verifiers on GSM8K and MATH with Mistral and Llama.
摘要：大型语言模型 (LLM) 通过逐步推理链展示了令人印象深刻的数学问题解决能力。然而，由于语言模型的自回归逐个标记生成特性，它们容易受到推理错误的影响，从而影响后续推理链的质量和最终答案。最近的研究提出采用外部验证器来指导推理路径的生成，但现有的研究利用经过逐步标签训练的模型来评估逐个标记推理链的正确性。因此，他们很难识别推理路径中标记的判别细节，并且缺乏评估中间推理路径是否朝着正确的最终答案前进的能力。为了弥补缺乏合理和标记粒度的数学验证信号的问题，我们为验证器设计了一种新颖的训练方案，该方案应用具有预期累积奖励（即价值）的标记级监督。此外，我们提出了一种实用的累积奖励公式，将其简化为寻找最终答案未来正确性的概率，从而实现价值的经验估计。数学推理基准上的实验结果表明，Token-Supervised Value Model (TVM) 可以胜过 GSM8K 和 MATH 上的 Mistral 和 Llama 的分步验证器。

Title: GRAD-SUM: Leveraging Gradient Summarization for Optimal Prompt Engineering

Authors: Derek Austin, Elliott Chartock
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12865
Pdf URL: https://arxiv.org/pdf/2407.12865
Copy Paste: [[2407.12865]] GRAD-SUM: Leveraging Gradient Summarization for Optimal Prompt Engineering(https://arxiv.org/abs/2407.12865)
Keywords: language model, llm, prompt
Abstract: Prompt engineering for large language models (LLMs) is often a manual time-intensive process that involves generating, evaluating, and refining prompts iteratively to ensure high-quality outputs. While there has been work on automating prompt engineering, the solutions generally are either tuned to specific tasks with given answers or are quite costly. We introduce GRAD-SUM, a scalable and flexible method for automatic prompt engineering that builds on gradient-based optimization techniques. Our approach incorporates user-defined task descriptions and evaluation criteria, and features a novel gradient summarization module to generalize feedback effectively. Our results demonstrate that GRAD-SUM consistently outperforms existing methods across various benchmarks, highlighting its versatility and effectiveness in automatic prompt optimization.
摘要：大型语言模型 (LLM) 的提示工程通常是一个耗时的手动过程，涉及生成、评估和迭代优化提示以确保高质量的输出。虽然已经开展了有关自动提示工程的研究，但解决方案通常要么针对给定答案的特定任务进行调整，要么成本相当高昂。我们引入了 GRAD-SUM，这是一种基于梯度优化技术的可扩展且灵活的自动提示工程方法。我们的方法结合了用户定义的任务描述和评估标准，并具有新颖的梯度汇总模块，可以有效地概括反馈。我们的结果表明，GRAD-SUM 在各种基准测试中始终优于现有方法，凸显了其在自动提示优化方面的多功能性和有效性。

Title: Beyond KV Caching: Shared Attention for Efficient LLMs

Authors: Bingli Liao, Danilo Vasconcellos Vargas
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12866
Pdf URL: https://arxiv.org/pdf/2407.12866
Copy Paste: [[2407.12866]] Beyond KV Caching: Shared Attention for Efficient LLMs(https://arxiv.org/abs/2407.12866)
Keywords: language model, llm
Abstract: The efficiency of large language models (LLMs) remains a critical challenge, particularly in contexts where computational resources are limited. Traditional attention mechanisms in these models, while powerful, require significant computational and memory resources due to the necessity of recalculating and storing attention weights across different layers. This paper introduces a novel Shared Attention (SA) mechanism, designed to enhance the efficiency of LLMs by directly sharing computed attention weights across multiple layers. Unlike previous methods that focus on sharing intermediate Key-Value (KV) caches, our approach utilizes the isotropic tendencies of attention distributions observed in advanced LLMs post-pretraining to reduce both the computational flops and the size of the KV cache required during inference. We empirically demonstrate that implementing SA across various LLMs results in minimal accuracy loss on standard benchmarks. Our findings suggest that SA not only conserves computational resources but also maintains robust model performance, thereby facilitating the deployment of more efficient LLMs in resource-constrained environments.
摘要：大型语言模型 (LLM) 的效率仍然是一个关键挑战，特别是在计算资源有限的环境中。这些模型中的传统注意力机制虽然功能强大，但由于需要在不同层之间重新计算和存储注意力权重，因此需要大量的计算和内存资源。本文介绍了一种新颖的共享注意力 (SA) 机制，旨在通过直接在多个层之间共享计算的注意力权重来提高 LLM 的效率。与以前专注于共享中间键值 (KV) 缓存的方法不同，我们的方法利用在高级 LLM 预训练后观察到的注意力分布的各向同性趋势来减少计算翻转和推理期间所需的 KV 缓存的大小。我们通过经验证明，在各种 LLM 中实施 SA 会导致标准基准上的准确度损失最小。我们的研究结果表明，SA 不仅节省了计算资源，而且还保持了强大的模型性能，从而有助于在资源受限的环境中部署更高效的 LLM。

Title: Bilingual Adaptation of Monolingual Foundation Models

Authors: Gurpreet Gosal, Yishi Xu, Gokul Ramakrishnan, Rituraj Joshi, Avraham Sheinin, Zhiming (Charles)Chen, Biswajit Mishra, Natalia Vassilieva, Joel Hestness, Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Onkar Pandit, Samta Kamboj, Rahul Pal, Parvez Mullah, Soundar Doraiswamy, Mohamed El Karim Chami
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12869
Pdf URL: https://arxiv.org/pdf/2407.12869
Copy Paste: [[2407.12869]] Bilingual Adaptation of Monolingual Foundation Models(https://arxiv.org/abs/2407.12869)
Keywords: language model, llm
Abstract: We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations. We focus this study on adapting Llama 2 to Arabic. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix, followed by full model continual pretraining on a bilingual corpus. By continually pretraining on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic. Our approach results in significant improvements in Arabic and slight enhancements in English, demonstrating cost-effective cross-lingual transfer. We also perform extensive ablations on embedding initialization techniques, data mix ratios, and learning rates and release a detailed training recipe.
摘要：我们提出了一种将单语大型语言模型 (LLM) 改编为另一种语言的有效方法，解决了灾难性遗忘和标记器限制的挑战。我们将这项研究重点放在将 Llama 2 改编为阿拉伯语。我们的两阶段方法首先是扩大词汇量并仅训练嵌入矩阵，然后在双语语料库上对整个模型进行持续预训练。通过在阿拉伯语和英语语料库上不断进行预训练，该模型在获得阿拉伯语能力的同时，保留了其英语熟练程度。我们的方法显著提高了阿拉伯语水平，并略微提高了英语水平，展示了具有成本效益的跨语言迁移。我们还对嵌入初始化技术、数据混合比率和学习率进行了广泛的消融，并发布了详细的训练方案。

Title: MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation

Authors: Xiaohan Wang, Dian Li, Yilin Zhao, Sinbadliu, Hui Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12871
Pdf URL: https://arxiv.org/pdf/2407.12871
Copy Paste: [[2407.12871]] MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation(https://arxiv.org/abs/2407.12871)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Utilizing complex tools with Large Language Models (LLMs) is a critical component for grounding AI agents in various real-world scenarios. The core challenge of manipulating tools lies in understanding their usage and functionality. The prevailing approach involves few-shot prompting with demonstrations or fine-tuning on expert trajectories. However, for complex tools and tasks, mere in-context demonstrations may fail to cover sufficient knowledge. Training-based methods are also constrained by the high cost of dataset construction and limited generalizability. In this paper, we introduce a new tool learning methodology (MetaTool) that is generalizable for mastering any reusable toolset. Our approach includes a self-supervised data augmentation technique that enables LLMs to gain a comprehensive understanding of various tools, thereby improving their ability to complete tasks effectively. We develop a series of meta-tasks that involve predicting masked factors of tool execution. These self-supervised tasks enable the automatic generation of high-quality QA data concerning tool comprehension. By incorporating meta-task data into the instruction tuning process, the proposed MetaTool model achieves significant superiority to open-source models and is comparable to GPT-4/GPT-3.5 on multiple tool-oriented tasks.
摘要：将复杂工具与大型语言模型 (LLM) 结合使用是将 AI 代理应用于各种现实场景的关键组件。操作工具的核心挑战在于了解它们的用法和功能。主流方法包括通过演示进行少量提示或根据专家轨迹进行微调。但是，对于复杂的工具和任务，仅仅在上下文中演示可能无法涵盖足够的知识。基于训练的方法还受到数据集构建成本高和通用性有限的限制。在本文中，我们介绍了一种新的工具学习方法 (MetaTool)，该方法可推广用于掌握任何可重复使用的工具集。我们的方法包括一种自监督数据增强技术，使 LLM 能够全面了解各种工具，从而提高它们有效完成任务的能力。我们开发了一系列元任务，涉及预测工具执行的掩蔽因素。这些自监督任务能够自动生成有关工具理解的高质量 QA 数据。通过将元任务数据纳入指令调整过程，提出的 MetaTool 模型显著优于开源模型，并且在多个面向工具的任务上可与 GPT-4/GPT-3.5 相媲美。

Title: Evaluating Large Language Models with fmeval

Authors: Pola Schwöbel, Luca Franceschi, Muhammad Bilal Zafar, Keerthan Vasist, Aman Malhotra, Tomer Shenhar, Pinal Tailor, Pinar Yilmaz, Michael Diamond, Michele Donini
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12872
Pdf URL: https://arxiv.org/pdf/2407.12872
Copy Paste: [[2407.12872]] Evaluating Large Language Models with fmeval(https://arxiv.org/abs/2407.12872)
Keywords: language model, llm
Abstract: fmeval is an open source library to evaluate large language models (LLMs) in a range of tasks. It helps practitioners evaluate their model for task performance and along multiple responsible AI dimensions. This paper presents the library and exposes its underlying design principles: simplicity, coverage, extensibility and performance. We then present how these were implemented in the scientific and engineering choices taken when developing fmeval. A case study demonstrates a typical use case for the library: picking a suitable model for a question answering task. We close by discussing limitations and further work in the development of the library. fmeval can be found at this https URL.
摘要：fmeval 是一个开源库，用于评估一系列任务中的大型语言模型 (LLM)。它可以帮助从业者评估其模型的任务性能以及多个负责任的 AI 维度。本文介绍了该库并揭示了其底层设计原则：简单性、覆盖范围、可扩展性和性能。然后，我们介绍了在开发 fmeval 时在科学和工程选择中如何实现这些原则。案例研究展示了该库的一个典型用例：为问答任务选择合适的模型。最后，我们讨论了该库开发的局限性和进一步的工作。fmeval 可以在这个 https URL 上找到。

Title: Evaluation of RAG Metrics for Question Answering in the Telecom Domain

Authors: Sujoy Roychowdhury, Sumit Soman, H G Ranjani, Neeraj Gunda, Vansh Chhabra, Sai Krishna Bala
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12873
Pdf URL: https://arxiv.org/pdf/2407.12873
Copy Paste: [[2407.12873]] Evaluation of RAG Metrics for Question Answering in the Telecom Domain(https://arxiv.org/abs/2407.12873)
Keywords: language model, llm, prompt, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) is widely used to enable Large Language Models (LLMs) perform Question Answering (QA) tasks in various domains. However, RAG based on open-source LLM for specialized domains has challenges of evaluating generated responses. A popular framework in the literature is the RAG Assessment (RAGAS), a publicly available library which uses LLMs for evaluation. One disadvantage of RAGAS is the lack of details of derivation of numerical value of the evaluation metrics. One of the outcomes of this work is a modified version of this package for few metrics (faithfulness, context relevance, answer relevance, answer correctness, answer similarity and factual correctness) through which we provide the intermediate outputs of the prompts by using any LLMs. Next, we analyse the expert evaluations of the output of the modified RAGAS package and observe the challenges of using it in the telecom domain. We also study the effect of the metrics under correct vs. wrong retrieval and observe that few of the metrics have higher values for correct retrieval. We also study for differences in metrics between base embeddings and those domain adapted via pre-training and fine-tuning. Finally, we comment on the suitability and challenges of using these metrics for in-the-wild telecom QA task.
摘要：检索增强生成 (RAG) 被广泛用于使大型语言模型 (LLM) 在各个领域执行问答 (QA) 任务。然而，基于开源 LLM 的 RAG 在评估生成的响应方面面临挑战。文献中流行的框架是 RAG 评估 (RAGAS)，这是一个使用 LLM 进行评估的公开库。RAGAS 的一个缺点是缺乏评估指标数值推导的细节。这项工作的成果之一是针对少数指标（忠实度、上下文相关性、答案相关性、答案正确性、答案相似性和事实正确性）对该包进行了修改，通过该修改，我们使用任何 LLM 提供提示的中间输出。接下来，我们分析修改后的 RAGAS 包输出的专家评估，并观察在电信领域使用它的挑战。我们还研究了正确检索和错误检索下指标的影响，并观察到少数指标在正确检索时具有更高的值。我们还研究了基础嵌入和通过预训练和微调适应的领域之间的指标差异。最后，我们评论了使用这些指标进行野外电信 QA 任务的适用性和挑战。

Title: SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning

Authors: Chenyang Zhao, Xueying Jia, Vijay Viswanathan, Tongshuang Wu, Graham Neubig
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12874
Pdf URL: https://arxiv.org/pdf/2407.12874
Copy Paste: [[2407.12874]] SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning(https://arxiv.org/abs/2407.12874)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. However, prompting often leads models to make predictions with lower accuracy compared to finetuning a model with ample training data. On the other hand, while finetuning LLMs on task-specific data generally improves their performance, abundant annotated datasets are not available for all tasks. Previous work has explored generating task-specific data from state-of-the-art LLMs and using this data to finetune smaller models, but this approach requires access to a language model other than the one being trained, which introduces cost, scalability challenges, and legal hurdles associated with continuously relying on more powerful LLMs. In response to these, we propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM, then use these input-output pairs to finetune the student LLM itself. In our empirical evaluation of the Natural Instructions V2 benchmark, we find that SELF-GUIDE improves the performance of LLM by a substantial margin. Specifically, we report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics. This sheds light on the promise of self-synthesized data guiding LLMs towards becoming task-specific experts without any external learning signals.
摘要：大型语言模型 (LLM) 在提供适当的自然语言提示时有望解决各种任务。然而，与使用充足的训练数据微调模型相比，提示通常会导致模型的预测准确率较低。另一方面，虽然在特定任务数据上微调 LLM 通常会提高其性能，但并非所有任务都有丰富的注释数据集。先前的研究探索了从最先进的 LLM 生成特定任务的数据并使用这些数据来微调较小的模型，但这种方法需要访问除正在训练的语言模型之外的语言模型，这会带来成本、可扩展性挑战以及与持续依赖更强大的 LLM 相关的法律障碍。针对这些问题，我们提出了 SELF-GUIDE，这是一种多阶段机制，我们从学生 LLM 合成特定任务的输入输出对，然后使用这些输入输出对来微调学生 LLM 本身。在我们对 Natural Instructions V2 基准的实证评估中，我们发现 SELF-GUIDE 大幅提高了 LLM 的性能。具体来说，我们报告称，在基准指标中，分类任务的绝对改进约为 15%，生成任务的绝对改进约为 18%。这揭示了自我合成数据在无需任何外部学习信号的情况下引导 LLM 成为特定任务专家的前景。

Title: Review-Feedback-Reason (ReFeR): A Novel Framework for NLG Evaluation and Reasoning

Authors: Yaswanth Narsupalli, Abhranil Chandra, Sreevatsa Muppirala, Manish Gupta, Pawan Goyal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12877
Pdf URL: https://arxiv.org/pdf/2407.12877
Copy Paste: [[2407.12877]] Review-Feedback-Reason (ReFeR): A Novel Framework for NLG Evaluation and Reasoning(https://arxiv.org/abs/2407.12877)
Keywords: language model, gpt, llm, agent
Abstract: Assessing the quality of Natural Language Generation (NLG) outputs, such as those produced by large language models (LLMs), poses significant challenges. Traditional approaches involve either resource-intensive human evaluations or automatic metrics, which often exhibit a low correlation with human judgment. In this study, we propose Review-Feedback-Reason (ReFeR), a novel evaluation framework for NLG using LLM agents. We rigorously test ReFeR using two pre-existing benchmark datasets on diverse NLG tasks. The proposed framework not only enhances the accuracy of NLG evaluation, surpassing previous benchmarks by $\sim$20\%, but also generates constructive feedback and significantly improves collective reasoning. This feedback is then leveraged for the creation of instruction-tuning datasets, which, when used to fine-tune smaller models like Mistral-7B, makes them extremely good evaluators, yielding a better correlation with human evaluations and performance nearly on par with GPT-3.5. We highlight the effectiveness of our methodology through its application on three reasoning benchmarks, where it outperforms most of the state-of-the-art methods, and also outperforms the reasoning capabilities of models like GPT-3.5 Turbo by $\sim$11.67\% and GPT-4 by $\sim$1\% on an average.
摘要：评估自然语言生成 (NLG) 输出（例如由大型语言模型 (LLM) 生成的输出）的质量是一项重大挑战。传统方法涉及资源密集型的人工评估或自动指标，而这些指标通常与人类判断的相关性较低。在本研究中，我们提出了 Review-Feedback-Reason (ReFeR)，这是一种使用 LLM 代理的 NLG 新型评估框架。我们使用两个预先存在的基准数据集在不同的 NLG 任务上对 ReFeR 进行了严格测试。所提出的框架不仅提高了 NLG 评估的准确性，比之前的基准高出 $\sim$20\%，而且还产生了建设性的反馈并显著提高了集体推理能力。然后利用此反馈来创建指令调整数据集，当用于微调 Mistral-7B 等较小的模型时，它们会成为非常好的评估器，与人类评估的相关性更高，性能几乎与 GPT-3.5 相当。我们通过在三个推理基准上的应用强调了我们的方法的有效性，它优于大多数最先进的方法，并且平均比 GPT-3.5 Turbo 等模型的推理能力高出 $\sim$11.67\% 和 GPT-4 高出 $\sim$1\%。

Title: Do LLMs have Consistent Values?

Authors: Naama Rozen, Gal Elidan, Amir Globerson, Ella Daniel
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12878
Pdf URL: https://arxiv.org/pdf/2407.12878
Copy Paste: [[2407.12878]] Do LLMs have Consistent Values?(https://arxiv.org/abs/2407.12878)
Keywords: language model, llm, prompt
Abstract: Values are a basic driving force underlying human behavior. Large Language Models (LLM) technology is constantly improving towards human-like dialogue. However, little research has been done to study the values exhibited in text generated by LLMs. Here we study this question by turning to the rich literature on value structure in psychology. We ask whether LLMs exhibit the same value structure that has been demonstrated in humans, including the ranking of values, and correlation between values. We show that the results of this analysis strongly depend on how the LLM is prompted, and that under a particular prompting strategy (referred to as 'Value Anchoring') the agreement with human data is quite compelling. Our results serve both to improve our understanding of values in LLMs, as well as introduce novel methods for assessing consistency in LLM responses.
摘要：价值观是人类行为的基本驱动力。大型语言模型 (LLM) 技术正在不断改进，以实现类似人类的对话。然而，很少有研究对 LLM 生成的文本中所展现的价值观进行研究。在这里，我们通过丰富的心理学价值观结构文献来研究这个问题。我们想知道 LLM 是否表现出与人类相同的价值观结构，包括价值观的排序和价值观之间的相关性。我们表明，这种分析的结果在很大程度上取决于 LLM 的提示方式，并且在特定的提示策略（称为“价值锚定”）下，与人类数据的一致性非常引人注目。我们的研究结果既有助于提高我们对 LLM 中价值观的理解，也有助于引入评估 LLM 响应一致性的新方法。

Title: Large Visual-Language Models Are Also Good Classifiers: A Study of In-Context Multimodal Fake News Detection

Authors: Ye Jiang, Yimin Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12879
Pdf URL: https://arxiv.org/pdf/2407.12879
Copy Paste: [[2407.12879]] Large Visual-Language Models Are Also Good Classifiers: A Study of In-Context Multimodal Fake News Detection(https://arxiv.org/abs/2407.12879)
Keywords: language model, gpt, llm, prompt
Abstract: Large visual-language models (LVLMs) exhibit exceptional performance in visual-language reasoning across diverse cross-modal benchmarks. Despite these advances, recent research indicates that Large Language Models (LLMs), like GPT-3.5-turbo, underachieve compared to well-trained smaller models, such as BERT, in Fake News Detection (FND), prompting inquiries into LVLMs' efficacy in FND tasks. Although performance could improve through fine-tuning LVLMs, the substantial parameters and requisite pre-trained weights render it a resource-heavy endeavor for FND applications. This paper initially assesses the FND capabilities of two notable LVLMs, CogVLM and GPT4V, in comparison to a smaller yet adeptly trained CLIP model in a zero-shot context. The findings demonstrate that LVLMs can attain performance competitive with that of the smaller model. Next, we integrate standard in-context learning (ICL) with LVLMs, noting improvements in FND performance, though limited in scope and consistency. To address this, we introduce the \textbf{I}n-context \textbf{M}ultimodal \textbf{F}ake \textbf{N}ews \textbf{D}etection (IMFND) framework, enriching in-context examples and test inputs with predictions and corresponding probabilities from a well-trained smaller model. This strategic integration directs the LVLMs' focus towards news segments associated with higher probabilities, thereby improving their analytical accuracy. The experimental results suggest that the IMFND framework significantly boosts the FND efficiency of LVLMs, achieving enhanced accuracy over the standard ICL approach across three publicly available FND datasets.
摘要：大型视觉语言模型 (LVLM) 在各种跨模态基准的视觉语言推理中表现出色。尽管取得了这些进展，但最近的研究表明，与训练有素的小型模型（如 BERT）相比，大型语言模型 (LLM)（如 GPT-3.5-turbo）在假新闻检测 (FND) 方面表现不佳，这促使人们开始质疑 LVLM 在 FND 任务中的有效性。虽然可以通过微调 LVLM 来提高性能，但大量的参数和必要的预训练权重使其成为 FND 应用的一项资源密集型工作。本文首先评估了两个著名的 LVLM CogVLM 和 GPT4V 的 FND 能力，并与零样本环境中较小但训练有素的 CLIP 模型进行了比较。研究结果表明，LVLM 可以获得与较小模型相媲美的性能。接下来，我们将标准的上下文学习 (ICL) 与 LVLM 相结合，注意到 FND 性能有所改进，尽管范围和一致性有限。为了解决这个问题，我们引入了 \textbf{I}n-context \textbf{M}ultimodal \textbf{F}ake \textbf{N}ews \textbf{D}etection (IMFND) 框架，使用来自训练良好的较小模型的预测和相应概率丰富上下文示例和测试输入。这种战略整合将 LVLM 的重点引向与更高概率相关的新闻片段，从而提高其分析准确性。实验结果表明，IMFND 框架显著提高了 LVLM 的 FND 效率，在三个公开可用的 FND 数据集上实现了比标准 ICL 方法更高的准确性。

Title: InstructAV: Instruction Fine-tuning Large Language Models for Authorship Verification

Authors: Yujia Hu, Zhiqiang Hu, Chun-Wei Seah, Roy Ka-Wei Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12882
Pdf URL: https://arxiv.org/pdf/2407.12882
Copy Paste: [[2407.12882]] InstructAV: Instruction Fine-tuning Large Language Models for Authorship Verification(https://arxiv.org/abs/2407.12882)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in a wide range of NLP tasks. However, when it comes to authorship verification (AV) tasks, which involve determining whether two given texts share the same authorship, even advanced models like ChatGPT exhibit notable limitations. This paper introduces a novel approach, termed InstructAV, for authorship verification. This approach utilizes LLMs in conjunction with a parameter-efficient fine-tuning (PEFT) method to simultaneously improve accuracy and explainability. The distinctiveness of InstructAV lies in its ability to align classification decisions with transparent and understandable explanations, representing a significant progression in the field of authorship verification. Through comprehensive experiments conducted across various datasets, InstructAV demonstrates its state-of-the-art performance on the AV task, offering high classification accuracy coupled with enhanced explanation reliability.
摘要：大型语言模型 (LLM) 在各种 NLP 任务中表现出色。然而，当涉及到作者验证 (AV) 任务时，即确定两个给定文本是否具有相同的作者，即使是像 ChatGPT 这样的高级模型也表现出明显的局限性。本文介绍了一种用于作者验证的新方法，称为 InstructAV。该方法结合使用 LLM 和参数高效微调 (PEFT) 方法来同时提高准确性和可解释性。InstructAV 的独特之处在于它能够将分类决策与透明易懂的解释相结合，代表了作者验证领域的重大进步。通过对各种数据集进行的全面实验，InstructAV 展示了其在 AV 任务上的最佳性能，提供高分类准确性和增强的解释可靠性。

Title: BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

Authors: Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S. Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O. Arik, Danqi Chen, Tao Yu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2407.12883
Pdf URL: https://arxiv.org/pdf/2407.12883
Copy Paste: [[2407.12883]] BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval(https://arxiv.org/abs/2407.12883)
Keywords: language model, llm, chain-of-thought
Abstract: Existing retrieval benchmarks primarily consist of information-seeking queries (e.g., aggregated questions from search engines) where keyword or semantic-based retrieval is usually sufficient. However, many complex real-world queries require in-depth reasoning to identify relevant documents that go beyond surface form matching. For example, finding documentation for a coding question requires understanding the logic and syntax of the functions involved. To better benchmark retrieval on such challenging queries, we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. BRIGHT is constructed from the 1,398 real-world queries collected from diverse domains (such as economics, psychology, robotics, software engineering, earth sciences, etc.), sourced from naturally occurring or carefully curated human data. Extensive evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT. The leading model on the MTEB leaderboard [38 ], which achieves a score of 59.0 nDCG@10,2 produces a score of nDCG@10 of 18.0 on BRIGHT. We further demonstrate that augmenting queries with Chain-of-Thought reasoning generated by large language models (LLMs) improves performance by up to 12.2 points. Moreover, BRIGHT is robust against data leakage during pretraining of the benchmarked models as we validate by showing similar performance even when documents from the benchmark are included in the training data. We believe that BRIGHT paves the way for future research on retrieval systems in more realistic and challenging settings. Our code and data are available at this https URL.
摘要：现有的检索基准主要由信息搜索查询（例如来自搜索引擎的聚合问题）组成，其中关键字或基于语义的检索通常就足够了。但是，许多复杂的现实世界查询需要深入推理才能识别超出表面形式匹配的相关文档。例如，查找编码问题的文档需要了解所涉及函数的逻辑和语法。为了更好地对此类具有挑战性的查询进行基准检索，我们引入了 BRIGHT，这是第一个需要深入推理才能检索相关文档的文本检索基准。BRIGHT 由从不同领域（例如经济学、心理学、机器人技术、软件工程、地球科学等）收集的 1,398 个现实世界查询构建而成，这些查询来自自然发生或精心策划的人类数据。广泛的评估表明，即使是最先进的检索模型在 BRIGHT 上的表现也很差。 MTEB 排行榜 [38] 上的领先模型获得了 59.0 nDCG@10,2 的分数，在 BRIGHT 上产生了 18.0 的 nDCG@10 分数。我们进一步证明，使用大型语言模型 (LLM) 生成的思想链推理来增强查询可将性能提高多达 12.2 分。此外，BRIGHT 在基准模型的预训练过程中具有很强的数据泄露防御能力，这一点我们通过展示相似的性能来验证，即使训练数据中包含基准中的文档也是如此。我们相信 BRIGHT 为未来在更现实和更具挑战性的环境中研究检索系统铺平了道路。我们的代码和数据可在此 https URL 上找到。

Title: Whitening Not Recommended for Classification Tasks in LLMs

Authors: Ali Forooghi, Shaghayegh Sadeghi, Jianguo Lu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.12886
Pdf URL: https://arxiv.org/pdf/2407.12886
Copy Paste: [[2407.12886]] Whitening Not Recommended for Classification Tasks in LLMs(https://arxiv.org/abs/2407.12886)
Keywords: language model, llm
Abstract: Sentence embedding is a cornerstone in NLP. Whitening has been claimed to be an effective operation to improve embedding quality obtained from Large Language Models (LLMs). However, we find that the efficacy of whitening is model-dependent and task-dependent. In particular, whitening degenerates embeddings for classification tasks. The conclusion is supported by extensive experiments. We also explored a variety of whitening operations, including PCA, ZCA, PCA-Cor, ZCA-Cor and Cholesky whitenings. A by-product of our research is embedding evaluation platform for LLMs called SentEval+.
摘要：句子嵌入是 NLP 的基石。白化被认为是一种有效的操作，可以提高从大型语言模型 (LLM) 获得的嵌入质量。然而，我们发现白化的有效性取决于模型和任务。特别是，白化会使分类任务的嵌入退化。该结论得到了大量实验的支持。我们还探索了各种白化操作，包括 PCA、ZCA、PCA-Cor、ZCA-Cor 和 Cholesky 白化。我们研究的副产品是 LLM 的嵌入评估平台 SentEval+。

Title: Explainable Biomedical Hypothesis Generation via Retrieval Augmented Generation enabled Large Language Models

Authors: Alexander R. Pelletier, Joseph Ramirez, Irsyad Adam, Simha Sankar, Yu Yan, Ding Wang, Dylan Steinecke, Wei Wang, Peipei Ping
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12888
Pdf URL: https://arxiv.org/pdf/2407.12888
Copy Paste: [[2407.12888]] Explainable Biomedical Hypothesis Generation via Retrieval Augmented Generation enabled Large Language Models(https://arxiv.org/abs/2407.12888)
Keywords: language model, llm, hallucination, retrieval augmented generation
Abstract: The vast amount of biomedical information available today presents a significant challenge for investigators seeking to digest, process, and understand these findings effectively. Large Language Models (LLMs) have emerged as powerful tools to navigate this complex and challenging data landscape. However, LLMs may lead to hallucinatory responses, making Retrieval Augmented Generation (RAG) crucial for achieving accurate information. In this protocol, we present RUGGED (Retrieval Under Graph-Guided Explainable disease Distinction), a comprehensive workflow designed to support investigators with knowledge integration and hypothesis generation, identifying validated paths forward. Relevant biomedical information from publications and knowledge bases are reviewed, integrated, and extracted via text-mining association analysis and explainable graph prediction models on disease nodes, forecasting potential links among drugs and diseases. These analyses, along with biomedical texts, are integrated into a framework that facilitates user-directed mechanism elucidation as well as hypothesis exploration through RAG-enabled LLMs. A clinical use-case demonstrates RUGGED's ability to evaluate and recommend therapeutics for Arrhythmogenic Cardiomyopathy (ACM) and Dilated Cardiomyopathy (DCM), analyzing prescribed drugs for molecular interactions and unexplored uses. The platform minimizes LLM hallucinations, offers actionable insights, and improves the investigation of novel therapeutics.
摘要：当今可用的大量生物医学信息对寻求有效消化、处理和理解这些发现的研究人员来说是一个重大挑战。大型语言模型 (LLM) 已成为驾驭这种复杂且具有挑战性的数据环境的强大工具。然而，LLM 可能会导致幻觉反应，因此检索增强生成 (RAG) 对于获取准确信息至关重要。在此协议中，我们介绍了 RUGGED（基于图引导的可解释疾病区分的检索），这是一种全面的工作流程，旨在支持研究人员进行知识整合和假设生成，确定经过验证的前进道路。通过文本挖掘关联分析和疾病节点上的可解释图预测模型，审查、整合和提取来自出版物和知识库的相关生物医学信息，预测药物和疾病之间的潜在联系。这些分析与生物医学文本一起集成到一个框架中，该框架有助于用户指导的机制阐明以及通过支持 RAG 的 LLM 进行假设探索。临床用例展示了 RUGGED 评估和推荐心律失常性心肌病 (ACM) 和扩张型心肌病 (DCM) 治疗方法的能力，分析处方药的分子相互作用和未开发的用途。该平台最大限度地减少了 LLM 幻觉，提供了可操作的见解，并改善了对新型疗法的研究。

Title: Halu-J: Critique-Based Hallucination Judge

Authors: Binjie Wang, Steffi Chern, Ethan Chern, Pengfei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12943
Pdf URL: https://arxiv.org/pdf/2407.12943
Copy Paste: [[2407.12943]] Halu-J: Critique-Based Hallucination Judge(https://arxiv.org/abs/2407.12943)
Keywords: language model, gpt, llm, hallucination
Abstract: Large language models (LLMs) frequently generate non-factual content, known as hallucinations. Existing retrieval-augmented-based hallucination detection approaches typically address this by framing it as a classification task, evaluating hallucinations based on their consistency with retrieved evidence. However, this approach usually lacks detailed explanations for these evaluations and does not assess the reliability of these explanations. Furthermore, deficiencies in retrieval systems can lead to irrelevant or partially relevant evidence retrieval, impairing the detection process. Moreover, while real-world hallucination detection requires analyzing multiple pieces of evidence, current systems usually treat all evidence uniformly without considering its relevance to the content. To address these challenges, we introduce Halu-J, a critique-based hallucination judge with 7 billion parameters. Halu-J enhances hallucination detection by selecting pertinent evidence and providing detailed critiques. Our experiments indicate that Halu-J outperforms GPT-4o in multiple-evidence hallucination detection and matches its capability in critique generation and evidence selection. We also introduce ME-FEVER, a new dataset designed for multiple-evidence hallucination detection. Our code and dataset can be found in this https URL .
摘要：大型语言模型 (LLM) 经常会产生非事实内容，即幻觉。现有的基于检索增强的幻觉检测方法通常通过将其作为分类任务来解决这个问题，根据幻觉与检索到的证据的一致性对其进行评估。然而，这种方法通常缺乏对这些评估的详细解释，也不评估这些解释的可靠性。此外，检索系统的缺陷可能导致不相关或部分相关的证据检索，从而损害检测过程。此外，虽然现实世界的幻觉检测需要分析多种证据，但当前的系统通常会统一处理所有证据，而不考虑其与内容的相关性。为了应对这些挑战，我们推出了 Halu-J，这是一个基于批评的幻觉判断器，拥有 70 亿个参数。Halu-J 通过选择相关证据并提供详细的批评来增强幻觉检测。我们的实验表明，Halu-J 在多重证据幻觉检测方面的表现优于 GPT-4o，并且在批判生成和证据选择方面与其能力相当。我们还介绍了 ME-FEVER，这是一个专为多重证据幻觉检测而设计的新数据集。我们的代码和数据集可在此 https URL 中找到。

Title: A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks

Authors: Shubham Vatsal, Harsh Dubey
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.12994
Pdf URL: https://arxiv.org/pdf/2407.12994
Copy Paste: [[2407.12994]] A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks(https://arxiv.org/abs/2407.12994)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have shown remarkable performance on many different Natural Language Processing (NLP) tasks. Prompt engineering plays a key role in adding more to the already existing abilities of LLMs to achieve significant performance gains on various NLP tasks. Prompt engineering requires composing natural language instructions called prompts to elicit knowledge from LLMs in a structured way. Unlike previous state-of-the-art (SoTA) models, prompt engineering does not require extensive parameter re-training or fine-tuning based on the given NLP task and thus solely operates on the embedded knowledge of LLMs. Additionally, LLM enthusiasts can intelligently extract LLMs' knowledge through a basic natural language conversational exchange or prompt engineering, allowing more and more people even without deep mathematical machine learning background to experiment with LLMs. With prompt engineering gaining popularity in the last two years, researchers have come up with numerous engineering techniques around designing prompts to improve accuracy of information extraction from the LLMs. In this paper, we summarize different prompting techniques and club them together based on different NLP tasks that they have been used for. We further granularly highlight the performance of these prompting strategies on various datasets belonging to that NLP task, talk about the corresponding LLMs used, present a taxonomy diagram and discuss the possible SoTA for specific datasets. In total, we read and present a survey of 44 research papers which talk about 39 different prompting methods on 29 different NLP tasks of which most of them have been published in the last two years.
摘要：大型语言模型 (LLM) 在许多不同的自然语言处理 (NLP) 任务上表现出色。提示工程在增强 LLM 现有的能力以在各种 NLP 任务上实现显著的性能提升方面发挥着关键作用。提示工程需要编写称为提示的自然语言指令，以结构化的方式从 LLM 中获取知识。与以前的最先进 (SoTA) 模型不同，提示工程不需要根据给定的 NLP 任务进行大量参数重新训练或微调，因此仅对 LLM 的嵌入知识进行操作。此外，LLM 爱好者可以通过基本的自然语言对话交流或提示工程智能地提取 LLM 的知识，让越来越多的人即使没有深厚的数学机器学习背景也可以尝试 LLM。随着提示工程在过去两年中越来越受欢迎，研究人员提出了许多围绕设计提示的工程技术，以提高从 LLM 中提取信息的准确性。在本文中，我们总结了不同的提示技术，并根据它们用于的不同 NLP 任务将它们结合在一起。我们进一步详细强调了这些提示策略在属于该 NLP 任务的各种数据集上的表现，讨论了所使用的相应 LLM，提出了分类图并讨论了特定数据集的可能 SoTA。总共，我们阅读并介绍了 44 篇研究论文的调查，这些论文讨论了 29 种不同的 NLP 任务上的 39 种不同的提示方法，其中大多数是在过去两年内发表的。

Title: Establishing Knowledge Preference in Language Models

Authors: Sizhe Zhou, Sha Li, Yu Meng, Yizhu Jiao, Heng Ji, Jiawei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13048
Pdf URL: https://arxiv.org/pdf/2407.13048
Copy Paste: [[2407.13048]] Establishing Knowledge Preference in Language Models(https://arxiv.org/abs/2407.13048)
Keywords: language model
Abstract: Language models are known to encode a great amount of factual knowledge through pretraining. However, such knowledge might be insufficient to cater to user requests, requiring the model to integrate external knowledge sources and adhere to user-provided specifications. When answering questions about ongoing events, the model should use recent news articles to update its response; when asked to provide recommendations, the model should prioritize user specifications over retrieved product reviews; when some facts are edited in the model, the updated facts should override all prior knowledge learned by the model even if they are conflicting. In all of the cases above, the model faces a decision between its own parametric knowledge, (retrieved) contextual knowledge, and user instruction knowledge. In this paper, we (1) unify such settings into the problem of knowledge preference and define a three-level preference hierarchy over these knowledge sources; (2) compile a collection of existing datasets IfQA, MQuAKE, and MRQA covering a combination of settings (with/without user specifications, with/without context documents) to systematically evaluate how well models obey the intended knowledge preference; and (3) propose a dataset synthesis method that composes diverse question-answer pairs with user assumptions and related context to directly fine-tune LMs for instilling the hierarchy of knowledge. We demonstrate that a 7B model, fine-tuned on only a few thousand examples automatically generated by our proposed method, effectively achieves superior performance (more than 18% improvement across all evaluation benchmarks) in adhering to the desired knowledge preference hierarchy.
摘要：众所周知，语言模型可以通过预训练编码大量事实知识。然而，这些知识可能不足以满足用户的要求，需要模型整合外部知识源并遵守用户提供的规范。在回答有关正在发生的事件的问题时，模型应该使用最近的新闻文章来更新其响应；当被要求提供建议时，模型应该优先考虑用户规范而不是检索到的产品评论；当在模型中编辑某些事实时，更新的事实应该覆盖模型学到的所有先前知识，即使它们相互冲突。在上述所有情况下，模型都面临着在自身的参数知识、（检索到的）上下文知识和用户指令知识之间做出决策。在本文中，我们（1）将这些设置统一到知识偏好问题中，并定义这些知识源的三级偏好层次结构；（2）汇编现有数据集 IfQA、MQuAKE 和 MRQA 的集合，涵盖多种设置（有/无用户规范、有/无上下文文档），以系统地评估模型遵守预期知识偏好的程度；（3）提出一种数据集合成方法，该方法将各种问答对与用户假设和相关上下文组合起来，以直接微调 LM 以灌输知识层次结构。我们证明，7B 模型仅对由我们提出的方法自动生成的几千个示例进行微调，即可有效地实现卓越的性能（在所有评估基准上提高 18% 以上），并遵守所需的知识偏好层次结构。

Title: Dynamic Sentiment Analysis with Local Large Language Models using Majority Voting: A Study on Factors Affecting Restaurant Evaluation

Authors: Junichiro Niimi
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2407.13069
Pdf URL: https://arxiv.org/pdf/2407.13069
Copy Paste: [[2407.13069]] Dynamic Sentiment Analysis with Local Large Language Models using Majority Voting: A Study on Factors Affecting Restaurant Evaluation(https://arxiv.org/abs/2407.13069)
Keywords: language model, llm
Abstract: User-generated contents (UGCs) on online platforms allow marketing researchers to understand consumer preferences for products and services. With the advance of large language models (LLMs), some studies utilized the models for annotation and sentiment analysis. However, the relationship between the accuracy and the hyper-parameters of LLMs is yet to be thoroughly examined. In addition, the issues of variability and reproducibility of results from each trial of LLMs have rarely been considered in existing literature. Since actual human annotation uses majority voting to resolve disagreements among annotators, this study introduces a majority voting mechanism to a sentiment analysis model using local LLMs. By a series of three analyses of online reviews on restaurant evaluations, we demonstrate that majority voting with multiple attempts using a medium-sized model produces more robust results than using a large model with a single attempt. Furthermore, we conducted further analysis to investigate the effect of each aspect on the overall evaluation.
摘要：在线平台上的用户生成内容 (UGC) 使营销研究人员能够了解消费者对产品和服务的偏好。随着大型语言模型 (LLM) 的进步，一些研究利用该模型进行注释和情绪分析。然而，LLM 的准确性和超参数之间的关系尚未得到彻底研究。此外，现有文献中很少考虑每次试验 LLM 的结果的变异性和可重复性问题。由于实际的人工注释使用多数投票来解决注释者之间的分歧，本研究将多数投票机制引入使用本地 LLM 的情绪分析模型。通过对餐厅评估的在线评论进行三次分析，我们证明使用中型模型多次尝试的多数投票比使用大型模型一次尝试产生更稳健的结果。此外，我们进行了进一步的分析，以调查每个方面对整体评价的影响。

Title: AlcLaM: Arabic Dialectal Language Model

Authors: Murtadha Ahmed, Saghir Alfasly, Bo Wen, Jamaal Qasem, Mohammed Ahmed, Yunfeng Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13097
Pdf URL: https://arxiv.org/pdf/2407.13097
Copy Paste: [[2407.13097]] AlcLaM: Arabic Dialectal Language Model(https://arxiv.org/abs/2407.13097)
Keywords: language model
Abstract: Pre-trained Language Models (PLMs) are integral to many modern natural language processing (NLP) systems. Although multilingual models cover a wide range of languages, they often grapple with challenges like high inference costs and a lack of diverse non-English training data. Arabic-specific PLMs are trained predominantly on modern standard Arabic, which compromises their performance on regional dialects. To tackle this, we construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms. We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch. Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models such as CAMeL, MARBERT, and ArBERT, compared to 7.8%, 10.2%, and 21.3%, respectively. Remarkably, AlcLaM demonstrates superior performance on a variety of Arabic NLP tasks despite the limited training data. AlcLaM is available at GitHub this https URL and HuggingFace this https URL.
摘要：预训练语言模型 (PLM) 是许多现代自然语言处理 (NLP) 系统不可或缺的一部分。尽管多语言模型涵盖了广泛的语言，但它们经常面临诸如高推理成本和缺乏多样化的非英语训练数据等挑战。阿拉伯语专用 PLM 主要使用现代标准阿拉伯语进行训练，这会影响其在地区方言上的表现。为了解决这个问题，我们构建了一个阿拉伯语方言语料库，其中包含从社交媒体平台收集的 340 万个句子。我们利用这个语料库来扩充词汇量并从头开始重新训练基于 BERT 的模型。我们的模型名为 AlcLaM，仅使用 13 GB 的文本进行训练，这仅占现有模型（如 CAMeL、MARBERT 和 ArBERT）所用数据的一小部分，而这些模型分别占 7.8%、10.2% 和 21.3%。值得注意的是，尽管训练数据有限，但 AlcLaM 在各种阿拉伯语 NLP 任务上表现出色。 AlcLaM 可在 GitHub 上通过 https URL 获得，也可在 HuggingFace 上通过 https URL 获得。

Title: Retrieve, Summarize, Plan: Advancing Multi-hop Question Answering with an Iterative Approach

Authors: Zhouyu Jiang, Mengshu Sun, Lei Liang, Zhiqiang Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13101
Pdf URL: https://arxiv.org/pdf/2407.13101
Copy Paste: [[2407.13101]] Retrieve, Summarize, Plan: Advancing Multi-hop Question Answering with an Iterative Approach(https://arxiv.org/abs/2407.13101)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Multi-hop question answering is a challenging task with distinct industrial relevance, and Retrieval-Augmented Generation (RAG) methods based on large language models (LLMs) have become a popular approach to tackle this task. Owing to the potential inability to retrieve all necessary information in a single iteration, a series of iterative RAG methods has been recently developed, showing significant performance improvements. However, existing methods still face two critical challenges: context overload resulting from multiple rounds of retrieval, and over-planning and repetitive planning due to the lack of a recorded retrieval trajectory. In this paper, we propose a novel iterative RAG method called ReSP, equipped with a dual-function summarizer. This summarizer compresses information from retrieved documents, targeting both the overarching question and the current sub-question concurrently. Experimental results on the multi-hop question-answering datasets HotpotQA and 2WikiMultihopQA demonstrate that our method significantly outperforms the state-of-the-art, and exhibits excellent robustness concerning context length.
摘要：多跳问答是一项具有鲜明工业意义的具有挑战性的任务，基于大型语言模型 (LLM) 的检索增强生成 (RAG) 方法已成为解决此任务的流行方法。由于可能无法在一次迭代中检索到所有必要的信息，最近开发了一系列迭代 RAG 方法，并显示出显着的性能改进。然而，现有方法仍然面临两个关键挑战：由多轮检索导致的上下文过载，以及由于缺乏记录的检索轨迹导致的过度规划和重复规划。在本文中，我们提出了一种称为 ReSP 的新型迭代 RAG 方法，配备了双功能摘要器。该摘要器压缩从检索到的文档中获取的信息，同时针对总体问题和当前子问题。在多跳问答数据集 HotpotQA 和 2WikiMultihopQA 上的实验结果表明，我们的方法明显优于最先进的方法，并且在上下文长度方面表现出出色的鲁棒性。

Title: Translate-and-Revise: Boosting Large Language Models for Constrained Translation

Authors: Pengcheng Huang, Yongyu Mu, Yuzhang Wu, Bei Li, Chunyang Xiao, Tong Xiao, Jingbo Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13164
Pdf URL: https://arxiv.org/pdf/2407.13164
Copy Paste: [[2407.13164]] Translate-and-Revise: Boosting Large Language Models for Constrained Translation(https://arxiv.org/abs/2407.13164)
Keywords: language model, llm, prompt
Abstract: Imposing constraints on machine translation systems presents a challenging issue because these systems are not trained to make use of constraints in generating adequate, fluent translations. In this paper, we leverage the capabilities of large language models (LLMs) for constrained translation, given that LLMs can easily adapt to this task by taking translation instructions and constraints as prompts. However, LLMs cannot always guarantee the adequacy of translation, and, in some cases, ignore the given constraints. This is in part because LLMs might be overly confident in their predictions, overriding the influence of the constraints. To overcome this overiding behaviour, we propose to add a revision process that encourages LLMs to correct the outputs by prompting them about the constraints that have not yet been met. We evaluate our approach on four constrained translation tasks, encompassing both lexical and structural constraints in multiple constraint domains. Experiments show 15\% improvement in constraint-based translation accuracy over standard LLMs and the approach also significantly outperforms neural machine translation (NMT) state-of-the-art methods.
摘要：对机器翻译系统施加约束是一个具有挑战性的问题，因为这些系统没有经过训练，无法利用约束来生成充分、流畅的翻译。在本文中，我们利用大型语言模型 (LLM) 的功能进行约束翻译，因为 LLM 可以通过将翻译指令和约束作为提示，轻松适应这项任务。然而，LLM 不能总是保证翻译的充分性，在某些情况下，它会忽略给定的约束。这在一定程度上是因为 LLM 可能对其预测过于自信，从而忽略了约束的影响。为了克服这种过度行为，我们建议添加一个修订过程，通过提示 LLM 尚未满足的约束来鼓励它们纠正输出。我们在四个约束翻译任务上评估了我们的方法，涵盖了多个约束域中的词汇和结构约束。实验表明，基于约束的翻译准确率比标准 LLM 提高了 15%，而且该方法也明显优于神经机器翻译 (NMT) 最先进的方法。

Title: Retrieval-Augmented Generation for Natural Language Processing: A Survey

Authors: Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13193
Pdf URL: https://arxiv.org/pdf/2407.13193
Copy Paste: [[2407.13193]] Retrieval-Augmented Generation for Natural Language Processing: A Survey(https://arxiv.org/abs/2407.13193)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) have demonstrated great success in various fields, benefiting from their huge amount of parameters that store knowledge. However, LLMs still suffer from several key issues, such as hallucination problems, knowledge update issues, and lacking domain-specific expertise. The appearance of retrieval-augmented generation (RAG), which leverages an external knowledge database to augment LLMs, makes up those drawbacks of LLMs. This paper reviews all significant techniques of RAG, especially in the retriever and the retrieval fusions. Besides, tutorial codes are provided for implementing the representative techniques in RAG. This paper further discusses the RAG training, including RAG with/without datastore update. Then, we introduce the application of RAG in representative natural language processing tasks and industrial scenarios. Finally, this paper discusses the future directions and challenges of RAG for promoting its development.
摘要：大型语言模型 (LLM) 因拥有大量存储知识的参数而在各个领域取得了巨大成功。然而，LLM 仍然存在一些关键问题，例如幻觉问题、知识更新问题以及缺乏领域特定专业知识。检索增强生成 (RAG) 的出现弥补了 LLM 的这些缺点，它利用外部知识数据库来增强 LLM。本文回顾了 RAG 的所有重要技术，特别是在检索器和检索融合方面。此外，还提供了实现 RAG 中代表性技术的教程代码。本文进一步讨论了 RAG 训练，包括有/无数据存储更新的 RAG。然后，我们介绍了 RAG 在代表性自然语言处理任务和工业场景中的应用。最后，本文讨论了 RAG 未来的发展方向和挑战。

Title: Transformer-based Single-Cell Language Model: A Survey

Authors: Wei Lan, Guohang He, Mingyang Liu, Qingfeng Chen, Junyue Cao, Wei Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13205
Pdf URL: https://arxiv.org/pdf/2407.13205
Copy Paste: [[2407.13205]] Transformer-based Single-Cell Language Model: A Survey(https://arxiv.org/abs/2407.13205)
Keywords: language model
Abstract: The transformers have achieved significant accomplishments in the natural language processing as its outstanding parallel processing capabilities and highly flexible attention mechanism. In addition, increasing studies based on transformers have been proposed to model single-cell data. In this review, we attempt to systematically summarize the single-cell language models and applications based on transformers. First, we provide a detailed introduction about the structure and principles of transformers. Then, we review the single-cell language models and large language models for single-cell data analysis. Moreover, we explore the datasets and applications of single-cell language models in downstream tasks such as batch correction, cell clustering, cell type annotation, gene regulatory network inference and perturbation response. Further, we discuss the challenges of single-cell language models and provide promising research directions. We hope this review will serve as an up-to-date reference for researchers interested in the direction of single-cell language models.
摘要：Transformer凭借其出色的并行处理能力和高度灵活的注意力机制，在自然语言处理领域取得了显著的成就。此外，越来越多的基于Transformer对单细胞数据进行建模的研究被提出。在这篇综述中，我们试图系统地总结基于Transformer的单细胞语言模型和应用。首先，我们详细介绍了Transformer的结构和原理。然后，我们回顾了用于单细胞数据分析的单细胞语言模型和大型语言模型。此外，我们探索了单细胞语言模型在批次校正、细胞聚类、细胞类型注释、基因调控网络推断和扰动响应等下游任务中的数据集和应用。此外，我们讨论了单细胞语言模型的挑战并提供了有希望的研究方向。我们希望这篇综述能为对单细胞语言模型方向感兴趣的研究人员提供最新的参考。

Title: Evaluating Large Language Models for Anxiety and Depression Classification using Counseling and Psychotherapy Transcripts

Authors: Junwei Sun, Siqi Ma, Yiran Fan, Peter Washington
Subjects: cs.CL, cs.CY, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2407.13228
Pdf URL: https://arxiv.org/pdf/2407.13228
Copy Paste: [[2407.13228]] Evaluating Large Language Models for Anxiety and Depression Classification using Counseling and Psychotherapy Transcripts(https://arxiv.org/abs/2407.13228)
Keywords: language model, gpt, llm, prompt
Abstract: We aim to evaluate the efficacy of traditional machine learning and large language models (LLMs) in classifying anxiety and depression from long conversational transcripts. We fine-tune both established transformer models (BERT, RoBERTa, Longformer) and more recent large models (Mistral-7B), trained a Support Vector Machine with feature engineering, and assessed GPT models through prompting. We observe that state-of-the-art models fail to enhance classification outcomes compared to traditional machine learning methods.
摘要：我们旨在评估传统机器学习和大型语言模型 (LLM) 在从长篇对话记录中对焦虑和抑郁进行分类方面的效果。我们对成熟的 Transformer 模型 (BERT、RoBERTa、Longformer) 和较新的大型模型 (Mistral-7B) 进行了微调，使用特征工程训练了支持向量机，并通过提示评估了 GPT 模型。我们观察到，与传统机器学习方法相比，最先进的模型未能增强分类结果。

Title: PM-LLM-Benchmark: Evaluating Large Language Models on Process Mining Tasks

Authors: Alessandro Berti, Humam Kourani, Wil M.P. van der Aalst
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2407.13244
Pdf URL: https://arxiv.org/pdf/2407.13244
Copy Paste: [[2407.13244]] PM-LLM-Benchmark: Evaluating Large Language Models on Process Mining Tasks(https://arxiv.org/abs/2407.13244)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have the potential to semi-automate some process mining (PM) analyses. While commercial models are already adequate for many analytics tasks, the competitive level of open-source LLMs in PM tasks is unknown. In this paper, we propose PM-LLM-Benchmark, the first comprehensive benchmark for PM focusing on domain knowledge (process-mining-specific and process-specific) and on different implementation strategies. We focus also on the challenges in creating such a benchmark, related to the public availability of the data and on evaluation biases by the LLMs. Overall, we observe that most of the considered LLMs can perform some process mining tasks at a satisfactory level, but tiny models that would run on edge devices are still inadequate. We also conclude that while the proposed benchmark is useful for identifying LLMs that are adequate for process mining tasks, further research is needed to overcome the evaluation biases and perform a more thorough ranking of the competitive LLMs.
摘要：大型语言模型 (LLM) 有可能使某些流程挖掘 (PM) 分析半自动化。虽然商业模型已经足以完成许多分析任务，但开源 LLM 在 PM 任务中的竞争水平尚不清楚。在本文中，我们提出了 PM-LLM-Benchmark，这是第一个全面的 PM 基准，侧重于领域知识（特定于流程挖掘和特定于流程）和不同的实施策略。我们还关注创建此类基准的挑战，这些挑战与数据的公开可用性以及 LLM 的评估偏差有关。总体而言，我们观察到大多数考虑的 LLM 都可以在令人满意的水平上执行一些流程挖掘任务，但在边缘设备上运行的微型模型仍然不够。我们还得出结论，虽然提出的基准对于识别适合流程挖掘任务的 LLM 很有用，但需要进一步研究以克服评估偏差并对竞争性 LLM 进行更彻底的排名。

Title: Are Large Language Models Capable of Generating Human-Level Narratives?

Authors: Yufei Tian, Tenghao Huang, Miri Liu, Derek Jiang, Alexander Spangher, Muhao Chen, Jonathan May, Nanyun Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13248
Pdf URL: https://arxiv.org/pdf/2407.13248
Copy Paste: [[2407.13248]] Are Large Language Models Capable of Generating Human-Level Narratives?(https://arxiv.org/abs/2407.13248)
Keywords: language model, llm
Abstract: This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression. We introduce a novel computational framework to analyze narratives through three discourse-level aspects: i) story arcs, ii) turning points, and iii) affective dimensions, including arousal and valence. By leveraging expert and automatic annotations, we uncover significant discrepancies between the LLM- and human- written stories. While human-written stories are suspenseful, arousing, and diverse in narrative structures, LLM stories are homogeneously positive and lack tension. Next, we measure narrative reasoning skills as a precursor to generative capacities, concluding that most LLMs fall short of human abilities in discourse understanding. Finally, we show that explicit integration of aforementioned discourse features can enhance storytelling, as is demonstrated by over 40% improvement in neural storytelling in terms of diversity, suspense, and arousal.
摘要：本文探讨了法学硕士讲故事的能力，重点关注故事发展和情节进展。我们引入了一个新颖的计算框架，通过三个话语层面的方面来分析叙事：i）故事情节、ii）转折点和 iii）情感维度，包括唤醒和效价。通过利用专家和自动注释，我们发现法学硕士和人类撰写的故事之间存在显著差异。虽然人类撰写的故事充满悬念、令人兴奋且叙事结构多样，但法学硕士的故事则同质化地积极且缺乏张力。接下来，我们将叙事推理技能作为生成能力的先决条件进行测量，得出结论，大多数法学硕士在话语理解方面都达不到人类的能力。最后，我们表明，明确整合上述话语特征可以增强讲故事的能力，神经讲故事在多样性、悬念和唤醒方面提高了 40% 以上，证明了这一点。

Title: SpeciaLex: A Benchmark for In-Context Specialized Lexicon Learning

Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13297
Pdf URL: https://arxiv.org/pdf/2407.13297
Copy Paste: [[2407.13297]] SpeciaLex: A Benchmark for In-Context Specialized Lexicon Learning(https://arxiv.org/abs/2407.13297)
Keywords: language model, llm
Abstract: Specialized lexicons are collections of words with associated constraints such as special definitions, specific roles, and intended target audiences. These constraints are necessary for content generation and documentation tasks (e.g., writing technical manuals or children's books), where the goal is to reduce the ambiguity of text content and increase its overall readability for a specific group of audience. Understanding how large language models can capture these constraints can help researchers build better, more impactful tools for wider use beyond the NLP community. Towards this end, we introduce SpeciaLex, a benchmark for evaluating a language model's ability to follow specialized lexicon-based constraints across 18 diverse subtasks with 1,285 test instances covering core tasks of Checking, Identification, Rewriting, and Open Generation. We present an empirical evaluation of 15 open and closed-source LLMs and discuss insights on how factors such as model scale, openness, setup, and recency affect performance upon evaluating with the benchmark.
摘要：专业词典是带有相关约束（例如特殊定义、特定角色和目标受众）的单词集合。这些约束对于内容生成和文档任务（例如编写技术手册或儿童读物）是必需的，其目标是减少文本内容的歧义并提高其对特定受众群体的整体可读性。了解大型语言模型如何捕捉这些约束可以帮助研究人员构建更好、更有影响力的工具，供 NLP 社区以外的更广泛使用。为此，我们推出了 SpeciaLex，这是一个基准，用于评估语言模型在 18 个不同子任务中遵循基于专业词典的约束的能力，其中包含 1,285 个测试实例，涵盖检查、识别、重写和开放生成等核心任务。我们对 15 个开源和闭源 LLM 进行了实证评估，并讨论了模型规模、开放性、设置和新近性等因素如何影响使用基准进行评估时的性能的见解。

Title: Robust ASR Error Correction with Conservative Data Filtering

Authors: Takuma Udagawa, Masayuki Suzuki, Masayasu Muraoka, Gakuto Kurata
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2407.13300
Pdf URL: https://arxiv.org/pdf/2407.13300
Copy Paste: [[2407.13300]] Robust ASR Error Correction with Conservative Data Filtering(https://arxiv.org/abs/2407.13300)
Keywords: language model, llm
Abstract: Error correction (EC) based on large language models is an emerging technology to enhance the performance of automatic speech recognition (ASR) systems. Generally, training data for EC are collected by automatically pairing a large set of ASR hypotheses (as sources) and their gold references (as targets). However, the quality of such pairs is not guaranteed, and we observed various types of noise which can make the EC models brittle, e.g. inducing overcorrection in out-of-domain (OOD) settings. In this work, we propose two fundamental criteria that EC training data should satisfy: namely, EC targets should (1) improve linguistic acceptability over sources and (2) be inferable from the available context (e.g. source phonemes). Through these criteria, we identify low-quality EC pairs and train the models not to make any correction in such cases, the process we refer to as conservative data filtering. In our experiments, we focus on Japanese ASR using a strong Conformer-CTC as the baseline and finetune Japanese LLMs for EC. Through our evaluation on a suite of 21 internal benchmarks, we demonstrate that our approach can significantly reduce overcorrection and improve both the accuracy and quality of ASR results in the challenging OOD settings.
摘要：基于大型语言模型的纠错 (EC) 是一种新兴技术，可提高自动语音识别 (ASR) 系统的性能。通常，EC 的训练数据是通过自动配对大量 ASR 假设（作为源）及其黄金参考（作为目标）来收集的。然而，这种配对的质量无法保证，我们观察到各种类型的噪声会使 EC 模型变得脆弱，例如在域外 (OOD) 设置中引起过度校正。在这项工作中，我们提出了 EC 训练数据应满足的两个基本标准：即 EC 目标应 (1) 提高对源的语言可接受性，以及 (2) 可从可用上下文（例如源音素）推断出来。通过这些标准，我们识别出低质量的 EC 对并训练模型在这种情况下不进行任何校正，我们将此过程称为保守数据过滤。在我们的实验中，我们专注于使用强大的 Conformer-CTC 作为基线的日语 ASR，并对日语 LLM 进行 EC 微调。通过对 21 个内部基准进行评估，我们证明我们的方法可以显著减少过度校正，并提高具有挑战性的 OOD 设置中 ASR 结果的准确性和质量。

Title: CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis

Authors: Junying Chen, Chi Gui, Anningzhe Gao, Ke Ji, Xidong Wang, Xiang Wan, Benyou Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.13301
Pdf URL: https://arxiv.org/pdf/2407.13301
Copy Paste: [[2407.13301]] CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis(https://arxiv.org/abs/2407.13301)
Keywords: language model, gpt, llm, agent
Abstract: The field of medical diagnosis has undergone a significant transformation with the advent of large language models (LLMs), yet the challenges of interpretability within these models remain largely unaddressed. This study introduces Chain-of-Diagnosis (CoD) to enhance the interpretability of LLM-based medical diagnostics. CoD transforms the diagnostic process into a diagnostic chain that mirrors a physician's thought process, providing a transparent reasoning pathway. Additionally, CoD outputs the disease confidence distribution to ensure transparency in decision-making. This interpretability makes model diagnostics controllable and aids in identifying critical symptoms for inquiry through the entropy reduction of confidences. With CoD, we developed DiagnosisGPT, capable of diagnosing 9604 diseases. Experimental results demonstrate that DiagnosisGPT outperforms other LLMs on diagnostic benchmarks. Moreover, DiagnosisGPT provides interpretability while ensuring controllability in diagnostic rigor.
摘要：随着大型语言模型 (LLM) 的出现，医学诊断领域发生了重大转变，但这些模型中的可解释性挑战仍然基本上未得到解决。本研究引入了诊断链 (CoD) 来增强基于 LLM 的医学诊断的可解释性。CoD 将诊断过程转变为反映医生思维过程的诊断链，提供透明的推理途径。此外，CoD 输出疾病置信度分布以确保决策的透明度。这种可解释性使模型诊断可控，并通过置信度的熵减少帮助识别关键症状以供查询。借助 CoD，我们开发了 DiagnosisGPT，能够诊断 9604 种疾病。实验结果表明，DiagnosisGPT 在诊断基准上优于其他 LLM。此外，DiagnosisGPT 提供可解释性，同时确保诊断严谨性的可控性。

Title: Why do you cite? An investigation on citation intents and decision-making classification processes

Authors: Lorenzo Paolini (Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy), Sahar Vahdati (Nature-inspired machine intelligence group, this http URL center, Technical University of Dresden, Germany Institute for Applied Computer Science, InfAI - Dresden, Germany), Angelo Di Iorio (Department of Computer Science and Engineering, University of Bologna, Bologna, Italy), Robert Wardenga (Institute for Applied Computer Science, InfAI - Dresden, Germany), Ivan Heibi (Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy, Digital Humanities Advanced Research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy), Silvio Peroni (Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy, Digital Humanities Advanced Research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13329
Pdf URL: https://arxiv.org/pdf/2407.13329
Copy Paste: [[2407.13329]] Why do you cite? An investigation on citation intents and decision-making classification processes(https://arxiv.org/abs/2407.13329)
Keywords: language model
Abstract: Identifying the reason for which an author cites another work is essential to understand the nature of scientific contributions and to assess their impact. Citations are one of the pillars of scholarly communication and most metrics employed to analyze these conceptual links are based on quantitative observations. Behind the act of referencing another scholarly work there is a whole world of meanings that needs to be proficiently and effectively revealed. This study emphasizes the importance of trustfully classifying citation intents to provide more comprehensive and insightful analyses in research assessment. We address this task by presenting a study utilizing advanced Ensemble Strategies for Citation Intent Classification (CIC) incorporating Language Models (LMs) and employing Explainable AI (XAI) techniques to enhance the interpretability and trustworthiness of models' predictions. Our approach involves two ensemble classifiers that utilize fine-tuned SciBERT and XLNet LMs as baselines. We further demonstrate the critical role of section titles as a feature in improving models' performances. The study also introduces a web application developed with Flask and currently available at this http URL, aimed at classifying citation intents. One of our models sets as a new state-of-the-art (SOTA) with an 89.46% Macro-F1 score on the SciCite benchmark. The integration of XAI techniques provides insights into the decision-making processes, highlighting the contributions of individual words for level-0 classifications, and of individual models for the metaclassification. The findings suggest that the inclusion of section titles significantly enhances classification performances in the CIC task. Our contributions provide useful insights for developing more robust datasets and methodologies, thus fostering a deeper understanding of scholarly communication.
摘要：确定作者引用另一部作品的原因对于理解科学贡献的性质和评估其影响至关重要。引用是学术交流的支柱之一，用于分析这些概念联系的大多数指标都是基于定量观察的。在引用另一部学术著作的行为背后，有整个世界的意义需要熟练而有效地揭示。这项研究强调了可信地分类引用意图的重要性，以便在研究评估中提供更全面、更有见地的分析。我们通过展示一项研究来解决这一任务，该研究利用先进的引文意图分类 (CIC) 集成策略，结合语言模型 (LM) 并采用可解释人工智能 (XAI) 技术来增强模型预测的可解释性和可信度。我们的方法涉及两个集成分类器，它们利用经过微调的 SciBERT 和 XLNet LM 作为基线。我们进一步证明了章节标题作为提高模型性能特征的关键作用。本研究还介绍了一个用 Flask 开发的 Web 应用程序，目前可从此 http URL 获取，旨在对引用意图进行分类。我们的一个模型在 SciCite 基准上获得了 89.46% 的 Macro-F1 分数，成为最新 (SOTA)。XAI 技术的集成提供了对决策过程的洞察，突出了单个单词对 0 级分类的贡献，以及单个模型对元分类的贡献。研究结果表明，加入章节标题可显著提高 CIC 任务中的分类性能。我们的贡献为开发更强大的数据集和方法提供了有用的见解，从而促进了对学术交流的更深入理解。

Title: Learning-From-Mistakes Prompting for Indigenous Language Translation

Authors: You-Cheng Liao, Chen-Jui Yu, Chi-Yi Lin, He-Feng Yun, Yen-Hsiang Wang, Hsiao-Min Li, Yao-Chung Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13343
Pdf URL: https://arxiv.org/pdf/2407.13343
Copy Paste: [[2407.13343]] Learning-From-Mistakes Prompting for Indigenous Language Translation(https://arxiv.org/abs/2407.13343)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Using large language models, this paper presents techniques to improve extremely low-resourced indigenous language translations. Our approaches are grounded in the use of (1) the presence of a datastore consisting of a limited number of parallel translation examples, (2) the inherent capabilities of LLMs like GPT-3.5, and (3) a word-level translation dictionary. We harness the potential of LLMs and in-context learning techniques in such a setting for using LLMs as universal translators for extremely low-resourced languages. Our methodology hinges on utilizing LLMs as language compilers for selected language pairs, hypothesizing that they could internalize syntactic structures to facilitate accurate translation. We introduce three techniques: KNNPrompting with Retrieved Prompting Context, Chain-of-Thought Prompting and Learningfrom-Mistakes Prompting, with the last method addressing past errors. The evaluation results suggest that, even with limited corpora, LLMs can effectively translate extremely low-resource languages when paired with proper prompting.
摘要：本文使用大型语言模型，介绍了改进资源极其匮乏的本土语言翻译的技术。我们的方法基于以下几点：(1) 一个由有限数量的并行翻译示例组成的数据存储，(2) GPT-3.5 等 LLM 的固有功能，以及 (3) 一个词级翻译词典。我们在这样的环境中利用 LLM 和上下文学习技术的潜力，将 LLM 用作资源极其匮乏的语言的通用翻译器。我们的方法取决于利用 LLM 作为选定语言对的语言编译器，假设它们可以内化句法结构以促进准确翻译。我们介绍了三种技术：使用检索到的提示上下文的 KNNPrompting、思维链提示和从错误中学习提示，最后一种方法解决了过去的错误。评估结果表明，即使语料库有限，如果与适当的提示搭配使用，LLM 也可以有效地翻译资源极其匮乏的语言。

Title: From Words to Worlds: Compositionality for Cognitive Architectures

Authors: Ruchira Dhar, Anders Søgaard
Subjects: cs.CL, cs.AI, cs.LG, cs.SC
Abstract URL: https://arxiv.org/abs/2407.13419
Pdf URL: https://arxiv.org/pdf/2407.13419
Copy Paste: [[2407.13419]] From Words to Worlds: Compositionality for Cognitive Architectures(https://arxiv.org/abs/2407.13419)
Keywords: language model, llm
Abstract: Large language models (LLMs) are very performant connectionist systems, but do they exhibit more compositionality? More importantly, is that part of why they perform so well? We present empirical analyses across four LLM families (12 models) and three task categories, including a novel task introduced below. Our findings reveal a nuanced relationship in learning of compositional strategies by LLMs -- while scaling enhances compositional abilities, instruction tuning often has a reverse effect. Such disparity brings forth some open issues regarding the development and improvement of large language models in alignment with human cognitive capacities.
摘要：大型语言模型 (LLM) 是性能非常出色的联结系统，但它们是否表现出更多的组合性？更重要的是，这是它们表现如此出色的部分原因吗？我们对四个 LLM 系列（12 个模型）和三个任务类别进行了实证分析，包括下面介绍的一项新任务。我们的研究结果揭示了 LLM 在学习组合策略方面存在微妙的关系——虽然扩展可以增强组合能力，但指令调整通常会产生相反的效果。这种差异带来了一些悬而未决的问题，即如何根据人类的认知能力开发和改进大型语言模型。

Title: End-To-End Clinical Trial Matching with Large Language Models

Authors: Dyke Ferber, Lars Hilgers, Isabella C. Wiest, Marie-Elisabeth Leßmann, Jan Clusmann, Peter Neidlinger, Jiefu Zhu, Georg Wölflein, Jacqueline Lammert, Maximilian Tschochohei, Heiko Böhme, Dirk Jäger, Mihaela Aldea, Daniel Truhn, Christiane Höper, Jakob Nikolas Kather
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13463
Pdf URL: https://arxiv.org/pdf/2407.13463
Copy Paste: [[2407.13463]] End-To-End Clinical Trial Matching with Large Language Models(https://arxiv.org/abs/2407.13463)
Keywords: language model, gpt, llm
Abstract: Matching cancer patients to clinical trials is essential for advancing treatment and patient care. However, the inconsistent format of medical free text documents and complex trial eligibility criteria make this process extremely challenging and time-consuming for physicians. We investigated whether the entire trial matching process - from identifying relevant trials among 105,600 oncology-related clinical trials on this http URL to generating criterion-level eligibility matches - could be automated using Large Language Models (LLMs). Using GPT-4o and a set of 51 synthetic Electronic Health Records (EHRs), we demonstrate that our approach identifies relevant candidate trials in 93.3% of cases and achieves a preliminary accuracy of 88.0% when matching patient-level information at the criterion level against a baseline defined by human experts. Utilizing LLM feedback reveals that 39.3% criteria that were initially considered incorrect are either ambiguous or inaccurately annotated, leading to a total model accuracy of 92.7% after refining our human baseline. In summary, we present an end-to-end pipeline for clinical trial matching using LLMs, demonstrating high precision in screening and matching trials to individual patients, even outperforming the performance of qualified medical doctors. Our fully end-to-end pipeline can operate autonomously or with human supervision and is not restricted to oncology, offering a scalable solution for enhancing patient-trial matching in real-world settings.
摘要：将癌症患者与临床试验进行匹配对于推进治疗和患者护理至关重要。然而，医学自由文本文档格式不一致和试验资格标准复杂，使得这一过程对医生来说极具挑战性且耗时。我们调查了整个试验匹配过程——从在此 http URL 上的 105,600 个肿瘤相关临床试验中识别相关试验到生成标准级资格匹配——是否可以使用大型语言模型 (LLM) 实现自动化。使用 GPT-4o 和一组 51 个合成电子健康记录 (EHR)，我们证明我们的方法在 93.3% 的病例中识别出相关候选试验，并且在将标准级别的患者级信息与人类专家定义的基线进行匹配时实现了 88.0% 的初步准确率。利用 LLM 反馈发现，最初被认为不正确的 39.3% 的标准要么含糊不清，要么注释不准确，在改进我们的人类基线后，模型总准确率达到 92.7%。总之，我们提出了一种使用 LLM 进行临床试验匹配的端到端流程，在筛选和匹配试验到个体患者方面表现出很高的精确度，甚至超过了合格医生的表现。我们的完全端到端流程可以自主运行或在人工监督下运行，并且不仅限于肿瘤学，为在现实环境中增强患者试验匹配提供了一种可扩展的解决方案。

Title: Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation

Authors: Damien Sileo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13481
Pdf URL: https://arxiv.org/pdf/2407.13481
Copy Paste: [[2407.13481]] Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation(https://arxiv.org/abs/2407.13481)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) can suggest missing elements from items listed in a prompt, which can be used for list completion or recommendations based on users' history. However, their performance degrades when presented with too many items, as they start to suggest items already included in the input list. This occurs at around 100 items for mid-2024 flagship LLMs. We evaluate this phenomenon on both synthetic problems (e.g., finding missing numbers in a given range of shuffled integers) and realistic movie recommendation scenarios. We refer to this issue as \textit{attention overflow}, as preventing repetition requires attending to all items simultaneously. Although iterative loops can mitigate this problem, their costs increase with the repetition rate, affecting the language models' ability to derive novelty from lengthy inputs.
摘要：大型语言模型 (LLM) 可以从提示中列出的项目中建议缺失的元素，这可用于列表完成或基于用户历史记录的推荐。但是，当呈现的项目太多时，它们的性能会下降，因为它们开始建议输入列表中已经包含的项目。对于 2024 年中期旗舰 LLM，这种情况发生在大约 100 个项目中。我们在合成问题（例如，在给定范围的打乱整数中查找缺失数字）和现实电影推荐场景中评估了这种现象。我们将此问题称为 \textit{注意力溢出}，因为防止重复需要同时关注所有项目。虽然迭代循环可以缓解这个问题，但它们的成本会随着重复率的增加而增加，从而影响语言模型从长输入中获取新颖性的能力。

Title: Combining Constraint Programming Reasoning with Large Language Model Predictions

Authors: Florian Régin, Elisabetta De Maria, Alexandre Bonlarron
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13490
Pdf URL: https://arxiv.org/pdf/2407.13490
Copy Paste: [[2407.13490]] Combining Constraint Programming Reasoning with Large Language Model Predictions(https://arxiv.org/abs/2407.13490)
Keywords: language model, llm
Abstract: Constraint Programming (CP) and Machine Learning (ML) face challenges in text generation due to CP's struggle with implementing "meaning'' and ML's difficulty with structural constraints. This paper proposes a solution by combining both approaches and embedding a Large Language Model (LLM) in CP. The LLM handles word generation and meaning, while CP manages structural constraints. This approach builds on GenCP, an improved version of On-the-fly Constraint Programming Search (OTFS) using LLM-generated domains. Compared to Beam Search (BS), a standard NLP method, this combined approach (GenCP with LLM) is faster and produces better results, ensuring all constraints are satisfied. This fusion of CP and ML presents new possibilities for enhancing text generation under constraints.
摘要：约束编程 (CP) 和机器学习 (ML) 在文本生成方面面临挑战，因为 CP 难以实现“含义”，而 ML 难以处理结构约束。本文提出了一种解决方案，即结合两种方法并在 CP 中嵌入大型语言模型 (LLM)。LLM 处理单词生成和含义，而 CP 管理结构约束。这种方法建立在 GenCP 的基础上，GenCP 是使用 LLM 生成域的即时约束编程搜索 (OTFS) 的改进版本。与标准 NLP 方法 Beam Search (BS) 相比，这种组合方法（GenCP 与 LLM）速度更快，结果更好，确保满足所有约束。CP 和 ML 的融合为增强约束下的文本生成提供了新的可能性。

Title: Enhancing Biomedical Knowledge Discovery for Diseases: An End-To-End Open-Source Framework

Authors: Christos Theodoropoulos, Andrei Catalin Coman, James Henderson, Marie-Francine Moens
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13492
Pdf URL: https://arxiv.org/pdf/2407.13492
Copy Paste: [[2407.13492]] Enhancing Biomedical Knowledge Discovery for Diseases: An End-To-End Open-Source Framework(https://arxiv.org/abs/2407.13492)
Keywords: language model
Abstract: The ever-growing volume of biomedical publications creates a critical need for efficient knowledge discovery. In this context, we introduce an open-source end-to-end framework designed to construct knowledge around specific diseases directly from raw text. To facilitate research in disease-related knowledge discovery, we create two annotated datasets focused on Rett syndrome and Alzheimer's disease, enabling the identification of semantic relations between biomedical entities. Extensive benchmarking explores various ways to represent relations and entity representations, offering insights into optimal modeling strategies for semantic relation detection and highlighting language models' competence in knowledge discovery. We also conduct probing experiments using different layer representations and attention scores to explore transformers' ability to capture semantic relations.
摘要：生物医学出版物数量不断增长，对高效的知识发现产生了迫切的需求。在此背景下，我们引入了一个开源端到端框架，旨在直接从原始文本构建特定疾病的知识。为了促进与疾病相关的知识发现研究，我们创建了两个带注释的数据集，重点关注雷特综合征和阿尔茨海默病，从而能够识别生物医学实体之间的语义关系。广泛的基准测试探索了表示关系和实体表示的各种方式，为语义关系检测的最佳建模策略提供了见解，并强调了语言模型在知识发现方面的能力。我们还使用不同的层表示和注意力分数进行探索性实验，以探索 Transformer 捕获语义关系的能力。

Title: Can Open-Source LLMs Compete with Commercial Models? Exploring the Few-Shot Performance of Current GPT Models in Biomedical Tasks

Authors: Samy Ateia, Udo Kruschwitz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13511
Pdf URL: https://arxiv.org/pdf/2407.13511
Copy Paste: [[2407.13511]] Can Open-Source LLMs Compete with Commercial Models? Exploring the Few-Shot Performance of Current GPT Models in Biomedical Tasks(https://arxiv.org/abs/2407.13511)
Keywords: language model, gpt, llm, chat, retrieval augmented generation
Abstract: Commercial large language models (LLMs), like OpenAI's GPT-4 powering ChatGPT and Anthropic's Claude 3 Opus, have dominated natural language processing (NLP) benchmarks across different domains. New competing Open-Source alternatives like Mixtral 8x7B or Llama 3 have emerged and seem to be closing the gap while often offering higher throughput and being less costly to use. Open-Source LLMs can also be self-hosted, which makes them interesting for enterprise and clinical use cases where sensitive data should not be processed by third parties. We participated in the 12th BioASQ challenge, which is a retrieval augmented generation (RAG) setting, and explored the performance of current GPT models Claude 3 Opus, GPT-3.5-turbo and Mixtral 8x7b with in-context learning (zero-shot, few-shot) and QLoRa fine-tuning. We also explored how additional relevant knowledge from Wikipedia added to the context-window of the LLM might improve their performance. Mixtral 8x7b was competitive in the 10-shot setting, both with and without fine-tuning, but failed to produce usable results in the zero-shot setting. QLoRa fine-tuning and Wikipedia context did not lead to measurable performance gains. Our results indicate that the performance gap between commercial and open-source models in RAG setups exists mainly in the zero-shot setting and can be closed by simply collecting few-shot examples for domain-specific use cases. The code needed to rerun these experiments is available through GitHub.
摘要：商业大型语言模型 (LLM)，例如支持 ChatGPT 的 OpenAI GPT-4 和 Anthropic 的 Claude 3 Opus，在不同领域的自然语言处理 (NLP) 基准测试中占据主导地位。Mixtral 8x7B 或 Llama 3 等新的开源替代方案已经出现，似乎正在缩小差距，同时通常提供更高的吞吐量并且使用成本更低。开源 LLM 还可以自托管，这使得它们对于敏感数据不应由第三方处理的企业和临床用例很有吸引力。我们参加了第 12 届 BioASQ 挑战赛，这是一个检索增强生成 (RAG) 设置，并探索了当前 GPT 模型 Claude 3 Opus、GPT-3.5-turbo 和 Mixtral 8x7b 在上下文学习（零样本、少样本）和 QLoRa 微调方面的性能。我们还探讨了将维基百科中的其他相关知识添加到 LLM 的上下文窗口如何提高其性能。Mixtral 8x7b 在 10 次测试设置中具有竞争力，无论是否进行微调，但在零次测试设置中未能产生可用的结果。QLoRa 微调和维基百科上下文并未带来可衡量的性能提升。我们的结果表明，RAG 设置中的商业模型和开源模型之间的性能差距主要存在于零次测试设置中，只需为特定领域的用例收集少量样本即可缩小差距。重新运行这些实验所需的代码可通过 GitHub 获得。

Title: Research on Tibetan Tourism Viewpoints information generation system based on LLM

Authors: Jinhu Qi, Shuai Yan, Wentao Zhang, Yibo Zhang, Zirui Liu, Ke Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13561
Pdf URL: https://arxiv.org/pdf/2407.13561
Copy Paste: [[2407.13561]] Research on Tibetan Tourism Viewpoints information generation system based on LLM(https://arxiv.org/abs/2407.13561)
Keywords: language model, llm
Abstract: Tibet, ensconced within China's territorial expanse, is distinguished by its labyrinthine and heterogeneous topography, a testament to its profound historical heritage, and the cradle of a unique religious ethos. The very essence of these attributes, however, has impeded the advancement of Tibet's tourism service infrastructure, rendering existing smart tourism services inadequate for the region's visitors. This study delves into the ramifications of informational disparities at tourist sites on Tibetan tourism and addresses the challenge of establishing the Large Language Model (LLM) evaluation criteria. It introduces an innovative approach, the DualGen Bridge AI system, employing supervised fine-tuning techniques to bolster model functionality and enhance optimization processes. Furthermore, it pioneers a multi-structured generative results assessment framework. Empirical validation confirms the efficacy of this framework. The study also explores the application of the supervised fine-tuning method within the proprietary DualGen Bridge AI, aimed at refining the generation of tourist site information. The study's findings offer valuable insights for optimizing system performance and provide support and inspiration for the application of LLM technology in Tibet's tourism services and beyond, potentially revolutionizing the smart tourism industry with advanced, tailored information generation capabilities.
摘要：西藏位于中国领土辽阔的土地上，地形复杂多样，是其深厚历史遗产的见证，也是独特宗教精神的发源地。然而，这些属性的本质阻碍了西藏旅游服务基础设施的发展，导致现有的智能旅游服务无法满足该地区游客的需求。本研究深入探讨了旅游景点信息差异对西藏旅游业的影响，并解决了建立大型语言模型 (LLM) 评估标准的挑战。它引入了一种创新方法，即 DualGen Bridge AI 系统，采用监督微调技术来增强模型功能并增强优化过程。此外，它开创了一种多结构生成结果评估框架。实证验证证实了该框架的有效性。本研究还探讨了在专有 DualGen Bridge AI 中应用监督微调方法，旨在改进旅游景点信息的生成。该研究的结果为优化系统性能提供了宝贵的见解，并为 LLM 技术在西藏旅游服务及其他领域的应用提供了支持和启发，有可能以先进的、定制的信息生成能力彻底改变智能旅游行业。

Title: dzFinNlp at AraFinNLP: Improving Intent Detection in Financial Conversational Agents

Authors: Mohamed Lichouri, Khaled Lounnas, Mohamed Zakaria Amziane
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13565
Pdf URL: https://arxiv.org/pdf/2407.13565
Copy Paste: [[2407.13565]] dzFinNlp at AraFinNLP: Improving Intent Detection in Financial Conversational Agents(https://arxiv.org/abs/2407.13565)
Keywords: agent
Abstract: In this paper, we present our dzFinNlp team's contribution for intent detection in financial conversational agents, as part of the AraFinNLP shared task. We experimented with various models and feature configurations, including traditional machine learning methods like LinearSVC with TF-IDF, as well as deep learning models like Long Short-Term Memory (LSTM). Additionally, we explored the use of transformer-based models for this task. Our experiments show promising results, with our best model achieving a micro F1-score of 93.02% and 67.21% on the ArBanking77 dataset, in the development and test sets, respectively.
摘要：在本文中，我们介绍了 dzFinNlp 团队在金融对话代理意图检测方面的贡献，这是 AraFinNLP 共享任务的一部分。我们尝试了各种模型和特征配置，包括传统的机器学习方法（如具有 TF-IDF 的 LinearSVC）以及深度学习模型（如长短期记忆 (LSTM)）。此外，我们还探索了使用基于 Transformer 的模型来完成这项任务。我们的实验显示出了令人鼓舞的结果，我们的最佳模型在开发和测试集上分别在 ArBanking77 数据集上实现了 93.02% 和 67.21% 的微 F1 分数。

Title: Large Language Models as Reliable Knowledge Bases?

Authors: Danna Zheng, Mirella Lapata, Jeff Z. Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13578
Pdf URL: https://arxiv.org/pdf/2407.13578
Copy Paste: [[2407.13578]] Large Language Models as Reliable Knowledge Bases?(https://arxiv.org/abs/2407.13578)
Keywords: language model, gpt, llm
Abstract: The NLP community has recently shown a growing interest in leveraging Large Language Models (LLMs) for knowledge-intensive tasks, viewing LLMs as potential knowledge bases (KBs). However, the reliability and extent to which LLMs can function as KBs remain underexplored. While previous studies suggest LLMs can encode knowledge within their parameters, the amount of parametric knowledge alone is not sufficient to evaluate their effectiveness as KBs. This study defines criteria that a reliable LLM-as-KB should meet, focusing on factuality and consistency, and covering both seen and unseen knowledge. We develop several metrics based on these criteria and use them to evaluate 26 popular LLMs, while providing a comprehensive analysis of the effects of model size, instruction tuning, and in-context learning (ICL). Our results paint a worrying picture. Even a high-performant model like GPT-3.5-turbo is not factual or consistent, and strategies like ICL and fine-tuning are unsuccessful at making LLMs better KBs.
摘要：NLP 社区最近表现出对利用大型语言模型 (LLM) 执行知识密集型任务的兴趣日益浓厚，将 LLM 视为潜在的知识库 (KB)。然而，LLM 作为 KB 的可靠性和功能范围仍未得到充分探索。虽然之前的研究表明 LLM 可以在其参数内编码知识，但仅凭参数知识的数量不足以评估其作为 KB 的有效性。本研究定义了可靠的 LLM-as-KB 应满足的标准，重点关注事实性和一致性，涵盖可见和不可见的知识。我们根据这些标准制定了几个指标，并使用它们来评估 26 个流行的 LLM，同时对模型大小、指令调整和上下文学习 (ICL) 的影响进行了全面分析。我们的结果描绘了一幅令人担忧的画面。即使是像 GPT-3.5-turbo 这样的高性能模型也不是事实或一致的，而 ICL 和微调等策略无法使 LLM 成为更好的 KB。

Title: Towards Zero-Shot Multimodal Machine Translation

Authors: Matthieu Futeral, Cordelia Schmid, Benoît Sagot, Rachel Bawden
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13579
Pdf URL: https://arxiv.org/pdf/2407.13579
Copy Paste: [[2407.13579]] Towards Zero-Shot Multimodal Machine Translation(https://arxiv.org/abs/2407.13579)
Keywords: language model
Abstract: Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work, we propose a method to bypass the need for fully supervised data to train MMT systems, using multimodal English data only. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives: visually conditioned masked language modelling and the Kullback-Leibler divergence between the original and new MMT outputs. We evaluate on standard MMT benchmarks and the recently released CoMMuTE, a contrastive benchmark aiming to evaluate how well models use images to disambiguate English sentences. We obtain disambiguation performance close to state-of-the-art MMT models trained additionally on fully supervised examples. To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese. We further show that we can control the trade-off between disambiguation capabilities and translation fidelity at inference time using classifier-free guidance and without any additional data. Our code, data and trained models are publicly accessible.
摘要：当前的多模态机器翻译 (MMT) 系统依赖于完全监督的数据（即，使用带有翻译和附带图像的句子来训练模型）。但是，这种类型的数据收集成本很高，限制了 MMT 向不存在此类数据的其他语言对的扩展。在这项工作中，我们提出了一种方法，无需使用完全监督的数据来训练 MMT 系统，而只需使用多模态英语数据。我们的方法称为 ZeroMMT，它通过在两个目标的混合上训练强大的纯文本机器翻译 (MT) 模型来调整它：视觉条件屏蔽语言建模和原始和新 MMT 输出之间的 Kullback-Leibler 散度。我们根据标准 MMT 基准和最近发布的 CoMMuTE 进行评估，CoMMuTE 是一个对比基准，旨在评估模型使用图像来消除英语句子歧义的效果如何。我们获得的消歧性能接近在完全监督的示例上额外训练的最先进的 MMT 模型。为了证明我们的方法可以推广到没有完全监督训练数据的语言，我们将 CoMMuTE 评估数据集扩展到三种新语言：阿拉伯语、俄语和中文。我们进一步表明，我们可以在推理时使用无分类器指导来控制消歧能力和翻译保真度之间的权衡，而无需任何额外数据。我们的代码、数据和经过训练的模型都是公开的。

Title: PLANTS: A Novel Problem and Dataset for Summarization of Planning-Like (PL) Tasks

Authors: Vishal Pallagani, Biplav Srivastava, Nitin Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13597
Pdf URL: https://arxiv.org/pdf/2407.13597
Copy Paste: [[2407.13597]] PLANTS: A Novel Problem and Dataset for Summarization of Planning-Like (PL) Tasks(https://arxiv.org/abs/2407.13597)
Keywords: language model
Abstract: Text summarization is a well-studied problem that deals with deriving insights from unstructured text consumed by humans, and it has found extensive business applications. However, many real-life tasks involve generating a series of actions to achieve specific goals, such as workflows, recipes, dialogs, and travel plans. We refer to them as planning-like (PL) tasks noting that the main commonality they share is control flow information. which may be partially specified. Their structure presents an opportunity to create more practical summaries to help users make quick decisions. We investigate this observation by introducing a novel plan summarization problem, presenting a dataset, and providing a baseline method for generating PL summaries. Using quantitative metrics and qualitative user studies to establish baselines, we evaluate the plan summaries from our method and large language models. We believe the novel problem and dataset can reinvigorate research in summarization, which some consider as a solved problem.
摘要：文本摘要是一个研究得很好的问题，它涉及从人类使用的非结构化文本中获取见解，并且已经找到了广泛的商业应用。然而，许多现实生活中的任务涉及生成一系列操作以实现特定目标，例如工作流、食谱、对话和旅行计划。我们将它们称为规划类 (PL) 任务，并指出它们的主要共同点是控制流信息。这可能部分指定。它们的结构提供了一个创建更实用的摘要的机会，以帮助用户快速做出决策。我们通过引入一个新颖的计划摘要问题、呈现一个数据集并提供生成 PL 摘要的基线方法来调查这一观察结果。使用定量指标和定性用户研究来建立基线，我们评估了我们的方法和大型语言模型中的计划摘要。我们相信这个新问题和数据集可以重振摘要研究，有些人认为这是一个已解决的问题。

Title: Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Authors: Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13623
Pdf URL: https://arxiv.org/pdf/2407.13623
Copy Paste: [[2407.13623]] Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies(https://arxiv.org/abs/2407.13623)
Keywords: language model, llm
Abstract: Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. % Intuitively, larger vocabularies enable more efficient tokenization by representing sentences with fewer tokens, but they also increase the risk of under-fitting representations for rare tokens. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the same result that the optimal vocabulary size depends on the available compute budget and that larger models deserve larger vocabularies. However, most LLMs use too small vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work emphasizes the necessity of jointly considering model parameters and vocabulary size for efficient scaling.
摘要：对大型语言模型 (LLM) 扩展的研究主要集中在模型参数和训练数据大小上，而忽略了词汇量的作用。% 直观地说，更大的词汇量可以通过用更少的标记来表示句子来实现更高效的标记化，但它们也会增加罕见标记的表示拟合不足的风险。我们通过训练具有 33M 到 3B 参数的模型（最多 500B 个字符）和各种词汇配置来研究词汇量如何影响 LLM 扩展规律。我们提出了三种互补的方法来预测计算最优词汇量：IsoFLOPs 分析、导数估计和损失函数的参数拟合。我们的方法收敛到相同的结果，即最佳词汇量取决于可用的计算预算，并且更大的模型应该有更大的词汇量。但是，大多数 LLM 使用的词汇量太小。例如，我们预测 Llama2-70B 的最佳词汇量应该至少为 216K，是其 32K 词汇量的 7 倍。我们通过在不同 FLOP 预算下训练具有 3B 参数的模型来实证验证我们的预测。采用我们预测的最佳词汇量可以持续提高下游性能，优于常用的词汇量。通过将词汇量从传统的 32K 增加到 43K，我们在 ARC-Challenge 上的性能从 29.1 提高到 32.0，同时 FLOP 仍然为 2.3e21。我们的工作强调了同时考虑模型参数和词汇量以实现有效扩展的必要性。

Title: Weak-to-Strong Reasoning

Authors: Yuqing Yang, Yan Ma, Pengfei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13647
Pdf URL: https://arxiv.org/pdf/2407.13647
Copy Paste: [[2407.13647]] Weak-to-Strong Reasoning(https://arxiv.org/abs/2407.13647)
Keywords: language model, llm
Abstract: When large language models (LLMs) exceed human-level capabilities, it becomes increasingly challenging to provide full-scale and accurate supervisions for these models. Weak-to-strong learning, which leverages a less capable model to unlock the latent abilities of a stronger model, proves valuable in this context. Yet, the efficacy of this approach for complex reasoning tasks is still untested. Furthermore, tackling reasoning tasks under the weak-to-strong setting currently lacks efficient methods to avoid blindly imitating the weak supervisor including its errors. In this paper, we introduce a progressive learning framework that enables the strong model to autonomously refine its training data, without requiring input from either a more advanced model or human-annotated data. This framework begins with supervised fine-tuning on a selective small but high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself. Extensive experiments on the GSM8K and MATH datasets demonstrate that our method significantly enhances the reasoning capabilities of Llama2-70b using three separate weak models. This method is further validated in a forward-looking experimental setup, where Llama3-8b-instruct effectively supervises Llama3-70b on the highly challenging OlympicArena dataset. This work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers. All relevant code and resources are available in \url{this https URL}.
摘要：当大型语言模型 (LLM) 的能力超过人类水平时，为这些模型提供全面而准确的监督变得越来越具有挑战性。弱到强学习利用能力较弱的模型来释放更强大模型的潜在能力，在这种情况下被证明是有价值的。然而，这种方法对于复杂推理任务的有效性仍未经检验。此外，在弱到强的环境下处理推理任务目前缺乏有效的方法来避免盲目模仿弱监督者（包括其错误）。在本文中，我们引入了一个渐进式学习框架，使强模型能够自主改进其训练数据，而无需来自更高级模型或人工注释数据的输入。该框架首先在选择性的小但高质量的数据集上进行监督微调，然后对强模型本身识别的对比样本进行偏好优化。在 GSM8K 和 MATH 数据集上进行的大量实验表明，我们的方法使用三个独立的弱模型显著增强了 Llama2-70b 的推理能力。该方法在前瞻性实验设置中得到进一步验证，其中 Llama3-8b-instruct 在极具挑战性的 OlympicArena 数据集上有效地监督 Llama3-70b。这项工作为增强 AI 推理能力的更具可扩展性和更复杂的策略铺平了道路。所有相关代码和资源均可在 \url{此 https URL} 中找到。

Title: FuLG: 150B Romanian Corpus for Language Model Pretraining

Authors: Vlad-Andrei Bădoiu, Mihai-Valentin Dumitru, Alexandru M. Gherghescu, Alexandru Agache, Costin Raiciu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13657
Pdf URL: https://arxiv.org/pdf/2407.13657
Copy Paste: [[2407.13657]] FuLG: 150B Romanian Corpus for Language Model Pretraining(https://arxiv.org/abs/2407.13657)
Keywords: language model
Abstract: Research in the field of language models is rapidly evolving, with many open models being released to the public. Openly available pretraining corpora usually focus on only a handful of languages, with many others either missing completely or extremely underrepresented. In this report, we introduce FuLG, a hundred-fifty-billion-token Romanian corpus extracted from CommonCrawl. We present our methodology for filtering FuLG and compare it via ablation studies against existing Romanian corpora.
摘要：语言模型领域的研究正在迅速发展，许多开放模型正在向公众发布。公开可用的预训练语料库通常只关注少数几种语言，许多其他语言要么完全缺失，要么极度缺乏代表性。在本报告中，我们介绍了 FuLG，这是一个从 CommonCrawl 中提取的 1500 亿个标记的罗马尼亚语语料库。我们介绍了筛选 FuLG 的方法，并通过消融研究将其与现有的罗马尼亚语语料库进行比较。

Title: DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

Authors: Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, Junxian He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13690
Pdf URL: https://arxiv.org/pdf/2407.13690
Copy Paste: [[2407.13690]] DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving(https://arxiv.org/abs/2407.13690)
Keywords: language model, gpt
Abstract: Solving mathematical problems requires advanced reasoning abilities and presents notable challenges for large language models. Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries. Hypothesizing that difficult queries are crucial to learn complex reasoning, we propose Difficulty-Aware Rejection Tuning (DART), a method that allocates difficult queries more trials during the synthesis phase, enabling more extensive training on difficult samples. Utilizing DART, we have created new datasets for mathematical problem-solving that focus more on difficult queries and are substantially smaller than previous ones. Remarkably, our synthesis process solely relies on a 7B-sized open-weight model, without reliance on the commonly used proprietary GPT-4. We fine-tune various base models on our datasets ranging from 7B to 70B in size, resulting in a series of strong models called DART-MATH. In comprehensive in-domain and out-of-domain evaluation on 6 mathematical benchmarks, DART-MATH outperforms vanilla rejection tuning significantly, being superior or comparable to previous arts, despite using much smaller datasets and no proprietary models. Furthermore, our results position our synthetic datasets as the most effective and cost-efficient publicly available resources for advancing mathematical problem-solving.
摘要：解决数学问题需要高级推理能力，这对大型语言模型来说是一个显著的挑战。以前的研究通常从专有模型中合成数据来扩充现有数据集，然后进行指令调整以获得顶级结果。然而，我们对这些数据集的分析表明，它们对简单查询存在严重偏见，经常无法对最具挑战性的查询生成任何正确的响应。我们假设困难查询对于学习复杂推理至关重要，因此提出了难度感知拒绝调整 (DART)，这种方法在合成阶段为困难查询分配更多试验，从而能够对困难样本进行更广泛的训练。利用 DART，我们创建了用于数学问题解决的新数据集，这些数据集更侧重于困难查询，并且比以前的数据集小得多。值得注意的是，我们的合成过程仅依赖于 7B 大小的开放权重模型，而不依赖于常用的专有 GPT-4。我们在 7B 到 70B 大小的数据集上微调了各种基础模型，从而产生了一系列称为 DART-MATH 的强大模型。在对 6 个数学基准进行全面的域内和域外评估时，尽管使用了小得多的数据集且没有专有模型，DART-MATH 的表现仍远超普通拒绝调优，优于或可与以前的技术相媲美。此外，我们的结果将我们的合成数据集定位为推进数学问题解决的最有效和最具成本效益的公开资源。

Title: Prover-Verifier Games improve legibility of LLM outputs

Authors: Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, Yuri Burda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13692
Pdf URL: https://arxiv.org/pdf/2407.13692
Copy Paste: [[2407.13692]] Prover-Verifier Games improve legibility of LLM outputs(https://arxiv.org/abs/2407.13692)
Keywords: language model, llm, chain-of-thought
Abstract: One way to increase confidence in the outputs of Large Language Models (LLMs) is to support them with reasoning that is clear and easy to check -- a property we call legibility. We study legibility in the context of solving grade-school math problems and show that optimizing chain-of-thought solutions only for answer correctness can make them less legible. To mitigate the loss in legibility, we propose a training algorithm inspired by Prover-Verifier Game from Anil et al. (2021). Our algorithm iteratively trains small verifiers to predict solution correctness, "helpful" provers to produce correct solutions that the verifier accepts, and "sneaky" provers to produce incorrect solutions that fool the verifier. We find that the helpful prover's accuracy and the verifier's robustness to adversarial attacks increase over the course of training. Furthermore, we show that legibility training transfers to time-constrained humans tasked with verifying solution correctness. Over course of LLM training human accuracy increases when checking the helpful prover's solutions, and decreases when checking the sneaky prover's solutions. Hence, training for checkability by small verifiers is a plausible technique for increasing output legibility. Our results suggest legibility training against small verifiers as a practical avenue for increasing legibility of large LLMs to humans, and thus could help with alignment of superhuman models.
摘要：提高大型语言模型 (LLM) 输出可信度的一种方法是用清晰且易于检查的推理来支持它们——我们称之为可读性。我们在解决小学数学问题的背景下研究了可读性，并表明仅为了答案正确性而优化思路链解决方案会使它们变得不那么可读。为了减轻可读性的损失，我们提出了一种受 Anil 等人 (2021) 的证明者-验证者游戏启发的训练算法。我们的算法迭代训练小型验证者来预测解决方案的正确性，“有用的”证明者产生验证者接受的正确解决方案，以及“狡猾的”证明者产生欺骗验证者的不正确解决方案。我们发现，有用的证明者的准确性和验证者对对抗性攻击的鲁棒性在训练过程中不断提高。此外，我们表明可读性训练可以转移到时间受限的人类身上，他们负责验证解决方案的正确性。在 LLM 训练过程中，人类准确率在检查有用证明者的解决方案时会增加，而在检查狡猾证明者的解决方案时会降低。因此，通过小型验证者进行可检查性训练是一种提高输出可读性的可行技术。我们的结果表明，针对小型验证者的可读性训练是提高大型 LLM 对人类可读性的实用途径，因此可以帮助对齐超人模型。

Title: Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation

Authors: Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13696
Pdf URL: https://arxiv.org/pdf/2407.13696
Copy Paste: [[2407.13696]] Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation(https://arxiv.org/abs/2407.13696)
Keywords: language model, llm
Abstract: Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing. This deficiency can lead to invalid conclusions, fostering mistrust in benchmarks and upending the ability to properly choose the appropriate benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how some overlooked methodological choices can significantly influence BAT results, potentially undermining the validity of conclusions. To address these inconsistencies, we propose a set of best practices for BAT and demonstrate how utilizing these methodologies greatly improves BAT robustness and validity. To foster adoption and facilitate future research,, we introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers. Our findings underscore the necessity for standardized BAT, ensuring the robustness and validity of benchmark evaluations in the evolving landscape of language model research. BenchBench Package: this https URL Leaderboard: this https URL
摘要：语言模型 (LM) 的最新进展催化了多个基准的创建，旨在评估这些模型的一般能力。然而，一项关键任务是评估基准本身的有效性。这通常是通过基准一致性测试 (BAT) 来完成的，其中使用一些一致性指标（例如，等级相关性）将新基准与已建立的基准进行验证。尽管 BAT 对基准构建者和消费者起着至关重要的作用，但并没有标准化的一致性测试程序。这种缺陷可能导致无效的结论，加剧对基准的不信任，并颠覆正确选择适当基准的能力。通过分析 40 多个突出的基准，我们展示了一些被忽视的方法选择如何显著影响 BAT 结果，从而可能破坏结论的有效性。为了解决这些不一致问题，我们提出了一套 BAT 的最佳实践，并展示了如何利用这些方法大大提高 BAT 的稳健性和有效性。为了促进采用并促进未来的研究，我们推出了 BenchBench（BAT 的 Python 包），并发布了 BenchBench-leaderboard（一个旨在使用同类基准来评估基准的元基准）。我们的研究结果强调了标准化 BAT 的必要性，确保在不断发展的语言模型研究领域中基准评估的稳健性和有效性。BenchBench 包：此 https URL 排行榜：此 https URL

Title: ANHALTEN: Cross-Lingual Transfer for German Token-Level Reference-Free Hallucination Detection

Authors: Janek Herrlein, Chia-Chien Hung, Goran Glavaš
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13702
Pdf URL: https://arxiv.org/pdf/2407.13702
Copy Paste: [[2407.13702]] ANHALTEN: Cross-Lingual Transfer for German Token-Level Reference-Free Hallucination Detection(https://arxiv.org/abs/2407.13702)
Keywords: hallucination
Abstract: Research on token-level reference-free hallucination detection has predominantly focused on English, primarily due to the scarcity of robust datasets in other languages. This has hindered systematic investigations into the effectiveness of cross-lingual transfer for this important NLP application. To address this gap, we introduce ANHALTEN, a new evaluation dataset that extends the English hallucination detection dataset to German. To the best of our knowledge, this is the first work that explores cross-lingual transfer for token-level reference-free hallucination detection. ANHALTEN contains gold annotations in German that are parallel (i.e., directly comparable to the original English instances). We benchmark several prominent cross-lingual transfer approaches, demonstrating that larger context length leads to better hallucination detection in German, even without succeeding context. Importantly, we show that the sample-efficient few-shot transfer is the most effective approach in most setups. This highlights the practical benefits of minimal annotation effort in the target language for reference-free hallucination detection. Aiming to catalyze future research on cross-lingual token-level reference-free hallucination detection, we make ANHALTEN publicly available: this https URL
摘要：对 token 级无参考幻觉检测的研究主要集中在英语上，这主要是因为其他语言缺乏可靠的数据集。这阻碍了对跨语言迁移对这一重要 NLP 应用的有效性进行系统研究。为了解决这一差距，我们引入了 ANHALTEN，这是一个新的评估数据集，它将英语幻觉检测数据集扩展到德语。据我们所知，这是第一项探索 token 级无参考幻觉检测的跨语言迁移的研究。ANHALTEN 包含德语中并行的黄金注释（即，可直接与原始英语实例进行比较）。我们对几种著名的跨语言迁移方法进行了基准测试，结果表明，即使没有成功的上下文，较长的上下文长度也能更好地检测德语中的幻觉。重要的是，我们表明，在大多数设置中，样本效率高的少量迁移是最有效的方法。这凸显了在目标语言中对无参考幻觉检测进行最少注释工作的实际好处。为了促进未来跨语言 token 级无参考幻觉检测的研究，我们将 ANHALTEN 公开：此 https URL

Title: Understanding Reference Policies in Direct Preference Optimization

Authors: Yixin Liu, Pengfei Liu, Arman Cohan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2407.13709
Pdf URL: https://arxiv.org/pdf/2407.13709
Copy Paste: [[2407.13709]] Understanding Reference Policies in Direct Preference Optimization(https://arxiv.org/abs/2407.13709)
Keywords: language model, llm
Abstract: Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO - its dependency on the reference model or policy. Such reference policies, typically instantiated as the model to be further fine-tuned, are important since they can impose an upper limit on DPO's effectiveness. Therefore, we address three related research questions in this work. First, we explore the optimal strength of the KL-divergence constraint in DPO, which penalizes deviations from the reference policy, and find that DPO is sensitive to this strength. Next, we examine the necessity of reference policies for instruction fine-tuning by providing both theoretical and empirical comparisons between DPO and related learning objectives, demonstrating DPO's superiority. Additionally, we investigate whether DPO benefits from stronger reference policies, finding that a stronger reference policy can lead to improved performance, but only when it is similar to the model being fine-tuned. Our findings highlight the confounding role of reference policies in DPO and offer insights for best practices, while also identifying open research questions for future studies.
摘要：直接偏好优化 (DPO) 已成为一种广泛使用的大型语言模型 (LLM) 指令微调训练方法。在这项工作中，我们探索了 DPO 的一个未被充分研究的方面——它对参考模型或策略的依赖性。此类参考策略通常被实例化为需要进一步微调的模型，它们非常重要，因为它们可以对 DPO 的有效性施加上限。因此，我们在本研究中解决了三个相关的研究问题。首先，我们探索 DPO 中 KL 散度约束的最佳强度，该约束惩罚偏离参考策略的行为，并发现 DPO 对这种强度很敏感。接下来，我们通过提供 DPO 与相关学习目标之间的理论和实证比较来检验参考策略对指令微调的必要性，从而证明 DPO 的优越性。此外，我们研究了 DPO 是否受益于更强大的参考策略，发现更强大的参考策略可以提高性能，但前提是它与正在微调的模型相似。我们的研究结果强调了参考政策在 DPO 中的混杂作用，并为最佳实践提供了见解，同时也确定了未来研究的开放研究问题。

Title: Baba Is AI: Break the Rules to Beat the Benchmark

Authors: Nathan Cloos, Meagan Jens, Michelangelo Naim, Yen-Ling Kuo, Ignacio Cases, Andrei Barbu, Christopher J. Cueva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13729
Pdf URL: https://arxiv.org/pdf/2407.13729
Copy Paste: [[2407.13729]] Baba Is AI: Break the Rules to Beat the Benchmark(https://arxiv.org/abs/2407.13729)
Keywords: language model, gpt, agent
Abstract: Humans solve problems by following existing rules and procedures, and also by leaps of creativity to redefine those rules and objectives. To probe these abilities, we developed a new benchmark based on the game Baba Is You where an agent manipulates both objects in the environment and rules, represented by movable tiles with words written on them, to reach a specified goal and win the game. We test three state-of-the-art multi-modal large language models (OpenAI GPT-4o, Google Gemini-1.5-Pro and Gemini-1.5-Flash) and find that they fail dramatically when generalization requires that the rules of the game must be manipulated and combined.
摘要：人类通过遵循现有的规则和程序来解决问题，同时也通过发挥创造力来重新定义这些规则和目标。为了探究这些能力，我们根据游戏 Baba Is You 开发了一个新基准，其中代理操纵环境中的物体和规则（由写有文字的可移动方块表示），以达到指定的目标并赢得游戏。我们测试了三种最先进的多模态大型语言模型（OpenAI GPT-4o、Google Gemini-1.5-Pro 和 Gemini-1.5-Flash），发现当泛化要求必须操纵和组合游戏规则时，它们会严重失败。

Title: LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation

Authors: David Schlangen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13744
Pdf URL: https://arxiv.org/pdf/2407.13744
Copy Paste: [[2407.13744]] LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation(https://arxiv.org/abs/2407.13744)
Keywords: llm, prompt
Abstract: Natural Language Processing has moved rather quickly from modelling specific tasks to taking more general pre-trained models and fine-tuning them for specific tasks, to a point where we now have what appear to be inherently generalist models. This paper argues that the resultant loss of clarity on what these models model leads to metaphors like "artificial general intelligences" that are not helpful for evaluating their strengths and weaknesses. The proposal is to see their generality, and their potential value, in their ability to approximate specialist function, based on a natural language specification. This framing brings to the fore questions of the quality of the approximation, but beyond that, also questions of discoverability, stability, and protectability of these functions. As the paper will show, this framing hence brings together in one conceptual framework various aspects of evaluation, both from a practical and a theoretical perspective, as well as questions often relegated to a secondary status (such as "prompt injection" and "jailbreaking").
摘要：自然语言处理已经从建模特定任务发展到采用更通用的预训练模型并针对特定任务对其进行微调，以至于我们现在拥有了看似天生通用的模型。本文认为，这些模型建模内容的不明确导致了“人工智能”之类的隐喻，而这些隐喻无助于评估其优缺点。建议从它们基于自然语言规范近似专业功能的能力中看到它们的通用性和潜在价值。这种框架提出了近似质量的问题，但除此之外，还有这些功能的可发现性、稳定性和可保护性问题。正如本文将展示的那样，这种框架将评估的各个方面（从实践和理论角度）以及通常被降为次要地位的问题（例如“即时注入”和“越狱”）整合在一个概念框架中。

Title: Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models

Authors: Zhuo Chen, Jiawei Liu, Haotan Liu, Qikai Cheng, Fan Zhang, Wei Lu, Xiaozhong Liu
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2407.13757
Pdf URL: https://arxiv.org/pdf/2407.13757
Copy Paste: [[2407.13757]] Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models(https://arxiv.org/abs/2407.13757)
Keywords: language model, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) is applied to solve hallucination problems and real-time constraints of large language models, but it also induces vulnerabilities against retrieval corruption attacks. Existing research mainly explores the unreliability of RAG in white-box and closed-domain QA tasks. In this paper, we aim to reveal the vulnerabilities of Retrieval-Enhanced Generative (RAG) models when faced with black-box attacks for opinion manipulation. We explore the impact of such attacks on user cognition and decision-making, providing new insight to enhance the reliability and security of RAG models. We manipulate the ranking results of the retrieval model in RAG with instruction and use these results as data to train a surrogate model. By employing adversarial retrieval attack methods to the surrogate model, black-box transfer attacks on RAG are further realized. Experiments conducted on opinion datasets across multiple topics show that the proposed attack strategy can significantly alter the opinion polarity of the content generated by RAG. This demonstrates the model's vulnerability and, more importantly, reveals the potential negative impact on user cognition and decision-making, making it easier to mislead users into accepting incorrect or biased information.
摘要：检索增强生成 (RAG) 被用于解决大型语言模型的幻觉问题和实时性约束，但它也容易受到检索破坏攻击。现有研究主要探讨 RAG 在白盒和闭域问答任务中的不可靠性。在本文中，我们旨在揭示检索增强生成 (RAG) 模型在面临观点操纵的黑盒攻击时的脆弱性。我们探索此类攻击对用户认知和决策的影响，为增强 RAG 模型的可靠性和安全性提供新见解。我们使用指令操纵 RAG 中检索模型的排序结果，并使用这些结果作为数据来训练代理模型。通过对代理模型采用对抗性检索攻击方法，进一步实现对 RAG 的黑盒转移攻击。在多个主题的观点数据集上进行的实验表明，提出的攻击策略可以显著改变 RAG 生成内容的观点极性。这表明了模型的脆弱性，更重要的是，揭示了对用户认知和决策的潜在负面影响，更容易误导用户接受不正确或有偏见的信息。

Title: Latent Causal Probing: A Formal Perspective on Probing with Causal Models of Data

Authors: Charles Jin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13765
Pdf URL: https://arxiv.org/pdf/2407.13765
Copy Paste: [[2407.13765]] Latent Causal Probing: A Formal Perspective on Probing with Causal Models of Data(https://arxiv.org/abs/2407.13765)
Keywords: language model
Abstract: As language models (LMs) deliver increasing performance on a range of NLP tasks, probing classifiers have become an indispensable technique in the effort to better understand their inner workings. A typical setup involves (1) defining an auxiliary task consisting of a dataset of text annotated with labels, then (2) supervising small classifiers to predict the labels from the representations of a pretrained LM as it processed the dataset. A high probing accuracy is interpreted as evidence that the LM has learned to perform the auxiliary task as an unsupervised byproduct of its original pretraining objective. Despite the widespread usage of probes, however, the robust design and analysis of probing experiments remains a challenge. We develop a formal perspective on probing using structural causal models (SCM). Specifically, given an SCM which explains the distribution of tokens observed during training, we frame the central hypothesis as whether the LM has learned to represent the latent variables of the SCM. Empirically, we extend a recent study of LMs in the context of a synthetic grid-world navigation task, where having an exact model of the underlying causal structure allows us to draw strong inferences from the result of probing experiments. Our techniques provide robust empirical evidence for the ability of LMs to learn the latent causal concepts underlying text.
摘要：随着语言模型 (LM) 在一系列 NLP 任务上的表现越来越好，探索分类器已成为更好地了解其内部工作原理的必不可少的技术。典型的设置包括 (1) 定义一个由带有标签注释的文本数据集组成的辅助任务，然后 (2) 监督小型分类器在处理数据集时根据预训练的 LM 的表示来预测标签。高探测准确率被解释为 LM 已学会执行辅助任务的证据，这是其原始预训练目标的无监督副产品。然而，尽管探测器被广泛使用，但探测实验的稳健设计和分析仍然是一个挑战。我们使用结构因果模型 (SCM) 开发了对探测的正式观点。具体来说，给定一个解释训练期间观察到的标记分布的 SCM，我们将中心假设定义为 LM 是否已学会表示 SCM 的潜在变量。从实证角度来看，我们扩展了最近在合成网格世界导航任务背景下对 LM 进行的一项研究，在该研究中，拥有底层因果结构的精确模型使我们能够从探索性实验的结果中得出强有力的推论。我们的技术为 LM 学习文本底层潜在因果概念的能力提供了强有力的实证证据。