2024-09-04

Title: Automating Knowledge Discovery from Scientific Literature via LLMs: A Dual-Agent Approach with Progressive Ontology Prompting

Authors: Yuting Hu, Dancheng Liu, Qingyun Wang, Charles Yu, Heng Ji, Jinjun Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00054
Pdf URL: https://arxiv.org/pdf/2409.00054
Copy Paste: [[2409.00054]] Automating Knowledge Discovery from Scientific Literature via LLMs: A Dual-Agent Approach with Progressive Ontology Prompting(https://arxiv.org/abs/2409.00054)
Keywords: language model, llm, prompt, agent
Abstract: To address the challenge of automating knowledge discovery from a vast volume of literature, in this paper, we introduce a novel framework based on large language models (LLMs) that combines a progressive ontology prompting (POP) algorithm with a dual-agent system, named LLM-Duo, designed to enhance the automation of knowledge extraction from scientific articles. The POP algorithm utilizes a prioritized breadth-first search (BFS) across a predefined ontology to generate structured prompt templates and action orders, thereby guiding LLMs to discover knowledge in an automatic manner. Additionally, our LLM-Duo employs two specialized LLM agents: an explorer and an evaluator. These two agents work collaboratively and adversarially to enhance the reliability of the discovery and annotation processes. Experiments demonstrate that our method outperforms advanced baselines, enabling more accurate and complete annotations. To validate the effectiveness of our method in real-world scenarios, we employ our method in a case study of speech-language intervention discovery. Our method identifies 2,421 interventions from 64,177 research articles in the speech-language therapy domain. We curate these findings into a publicly accessible intervention knowledge base that holds significant potential to benefit the speech-language therapy community.
摘要：为了应对从大量文献中自动发现知识的挑战，本文介绍了一种基于大型语言模型 (LLM) 的新框架，该框架将渐进式本体提示 (POP) 算法与双代理系统相结合，称为 LLM-Duo，旨在增强从科学文章中提取知识的自动化。POP 算法利用优先广度优先搜索 (BFS) 跨越预定义本体来生成结构化的提示模板和操作顺序，从而引导 LLM 以自动方式发现知识。此外，我们的 LLM-Duo 采用了两个专门的 LLM 代理：探索器和评估器。这两个代理协同和对抗性地工作，以提高发现和注释过程的可靠性。实验表明，我们的方法优于先进的基线，可以实现更准确和完整的注释。为了验证我们的方法在现实场景中的有效性，我们在语音干预发现的案例研究中采用了我们的方法。我们的方法从 64,177 篇言语治疗领域的研究文章中识别出 2,421 项干预措施。我们将这些发现整理成一个公开的干预知识库，该知识库具有巨大的潜力，可以造福言语治疗界。

Title: Understanding Literary Texts by LLMs: A Case Study of Ancient Chinese Poetry

Authors: Cheng Zhao, Bin Wang, Zhen Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00060
Pdf URL: https://arxiv.org/pdf/2409.00060
Copy Paste: [[2409.00060]] Understanding Literary Texts by LLMs: A Case Study of Ancient Chinese Poetry(https://arxiv.org/abs/2409.00060)
Keywords: language model, llm
Abstract: The birth and rapid development of large language models (LLMs) have caused quite a stir in the field of literature. Once considered unattainable, AI's role in literary creation is increasingly becoming a reality. In genres such as poetry, jokes, and short stories, numerous AI tools have emerged, offering refreshing new perspectives. However, it's difficult to further improve the quality of these works. This is primarily because understanding and appreciating a good literary work involves a considerable threshold, such as knowledge of literary theory, aesthetic sensibility, interdisciplinary knowledge. Therefore, authoritative data in this area is quite lacking. Additionally, evaluating literary works is often complex and hard to fully quantify, which directly hinders the further development of AI creation. To address this issue, this paper attempts to explore the mysteries of literary texts from the perspective of LLMs, using ancient Chinese poetry as an example for experimentation. First, we collected a variety of ancient poems from different sources and had experts annotate a small portion of them. Then, we designed a range of comprehension metrics based on LLMs to evaluate all these poems. Finally, we analyzed the correlations and differences between various poem collections to identify literary patterns. Through our experiments, we observed a series of enlightening phenomena that provide technical support for the future development of high-level literary creation based on LLMs.
摘要：大型语言模型（LLM）的诞生和快速发展在文学领域引起了不小的轰动，曾经被认为遥不可及的人工智能在文学创作中的作用正日益成为现实。在诗歌、笑话、短篇小说等体裁中，出现了许多人工智能工具，为人们提供了耳目一新的视角。然而，这些作品的质量很难进一步提高。这主要是因为理解和欣赏一部好的文学作品需要相当的门槛，比如文学理论知识、审美能力、跨学科知识等，因此这方面的权威数据相当缺乏。此外，对文学作品的评价往往很复杂，很难完全量化，这也直接阻碍了人工智能创作的进一步发展。针对这一问题，本文尝试从LLM的角度探索文学文本的奥秘，并以中国古诗词为例进行实验。首先，我们收集了不同来源的各种古诗词，并请专家对其中一小部分进行了注释。然后，我们设计了一系列基于 LLM 的理解指标来评估这些诗歌。最后，我们分析了不同诗集之间的相关性和差异性，以识别文学模式。通过实验，我们观察到了一系列具有启发性的现象，为未来基于 LLM 的高水平文学创作提供了技术支持。

Title: Urban Mobility Assessment Using LLMs

Authors: Prabin Bhandari, Antonios Anastasopoulos, Dieter Pfoser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00063
Pdf URL: https://arxiv.org/pdf/2409.00063
Copy Paste: [[2409.00063]] Urban Mobility Assessment Using LLMs(https://arxiv.org/abs/2409.00063)
Keywords: language model, llm, prompt
Abstract: Understanding urban mobility patterns and analyzing how people move around cities helps improve the overall quality of life and supports the development of more livable, efficient, and sustainable urban areas. A challenging aspect of this work is the collection of mobility data by means of user tracking or travel surveys, given the associated privacy concerns, noncompliance, and high cost. This work proposes an innovative AI-based approach for synthesizing travel surveys by prompting large language models (LLMs), aiming to leverage their vast amount of relevant background knowledge and text generation capabilities. Our study evaluates the effectiveness of this approach across various U.S. metropolitan areas by comparing the results against existing survey data at different granularity levels. These levels include (i) pattern level, which compares aggregated metrics like the average number of locations traveled and travel time, (ii) trip level, which focuses on comparing trips as whole units using transition probabilities, and (iii) activity chain level, which examines the sequence of locations visited by individuals. Our work covers several proprietary and open-source LLMs, revealing that open-source base models like Llama-2, when fine-tuned on even a limited amount of actual data, can generate synthetic data that closely mimics the actual travel survey data, and as such provides an argument for using such data in mobility studies.
摘要：了解城市流动模式并分析人们在城市中的流动方式有助于提高整体生活质量，并支持更宜居、更高效、更可持续的城市地区的发展。这项工作的一个挑战是通过用户跟踪或旅行调查收集流动数据，因为涉及到隐私问题、不合规和高成本。这项工作提出了一种基于人工智能的创新方法，通过提示大型语言模型 (LLM) 来合成旅行调查，旨在利用它们大量的相关背景知识和文本生成能力。我们的研究通过将结果与不同粒度级别的现有调查数据进行比较，评估了这种方法在美国各个大都市地区的有效性。这些级别包括 (i) 模式级别，比较聚合指标，如平均旅行地点数量和旅行时间，(ii) 行程级别，重点是使用转移概率将旅行作为整体单位进行比较，以及 (iii) 活动链级别，检查个人访问地点的顺序。我们的工作涵盖了几个专有和开源的 LLM，揭示了像 Llama-2 这样的开源基础模型，即使在有限量的实际数据上进行微调时，也可以生成与实际旅行调查数据非常接近的合成数据，从而为在流动性研究中使用此类数据提供了论据。

Title: Learning to Plan Long-Term for Language Modeling

Authors: Florian Mai, Nathan Cornille, Marie-Francine Moens
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00070
Pdf URL: https://arxiv.org/pdf/2409.00070
Copy Paste: [[2409.00070]] Learning to Plan Long-Term for Language Modeling(https://arxiv.org/abs/2409.00070)
Keywords: language model
Abstract: Modern language models predict the next token in the sequence by considering the past text through a powerful function such as attention. However, language models have no explicit mechanism that allows them to spend computation time for planning long-distance future text, leading to a suboptimal token prediction. In this paper, we propose a planner that predicts a latent plan for many sentences into the future. By sampling multiple plans at once, we condition the language model on an accurate approximation of the distribution of text continuations, which leads to better next token prediction accuracy. In effect, this allows trading computation time for prediction accuracy.
摘要：现代语言模型通过注意力等强大功能考虑过去的文本来预测序列中的下一个标记。然而，语言模型没有明确的机制允许它们花费计算时间来规划长距离的未来文本，导致标记预测不理想。在本文中，我们提出了一个规划器，它可以预测未来许多句子的潜在计划。通过一次采样多个计划，我们根据文本连续分布的准确近似来调节语言模型，从而提高下一个标记预测的准确性。实际上，这允许用计算时间来换取预测准确性。

Title: Are LLM-based methods good enough for detecting unfair terms of service?

Authors: Mirgita Frasheri, Arian Bakhtiarnia, Lukas Esterle, Aleksandros Iosifidis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00077
Pdf URL: https://arxiv.org/pdf/2409.00077
Copy Paste: [[2409.00077]] Are LLM-based methods good enough for detecting unfair terms of service?(https://arxiv.org/abs/2409.00077)
Keywords: language model, gpt, llm, chat
Abstract: Countless terms of service (ToS) are being signed everyday by users all over the world while interacting with all kinds of apps and websites. More often than not, these online contracts spanning double-digit pages are signed blindly by users who simply want immediate access to the desired service. What would normally require a consultation with a legal team, has now become a mundane activity consisting of a few clicks where users potentially sign away their rights, for instance in terms of their data privacy, to countless online entities/companies. Large language models (LLMs) are good at parsing long text-based documents, and could potentially be adopted to help users when dealing with dubious clauses in ToS and their underlying privacy policies. To investigate the utility of existing models for this task, we first build a dataset consisting of 12 questions applied individually to a set of privacy policies crawled from popular websites. Thereafter, a series of open-source as well as commercial chatbots such as ChatGPT, are queried over each question, with the answers being compared to a given ground truth. Our results show that some open-source models are able to provide a higher accuracy compared to some commercial models. However, the best performance is recorded from a commercial chatbot (ChatGPT4). Overall, all models perform only slightly better than random at this task. Consequently, their performance needs to be significantly improved before they can be adopted at large for this purpose.
摘要：每天，世界各地的用户在与各种应用和网站交互时都会签署无数的服务条款 (ToS)。通常情况下，这些长达两位数页的在线合同都是由只想立即获得所需服务的用户盲目签署的。通常需要咨询法律团队才能完成的事情，现在变成了一项平凡的活动，用户只需点击几下鼠标，就可能将自己的权利（例如数据隐私方面的权利）转让给无数的在线实体/公司。大型语言模型 (LLM) 擅长解析长文本文档，并且可以用于帮助用户处理 ToS 及其底层隐私政策中的可疑条款。为了研究现有模型在此任务中的实用性，我们首先构建了一个数据集，其中包含 12 个问题，这些问题分别应用于从热门网站爬取的一组隐私政策。此后，针对每个问题查询一系列开源和商业聊天机器人（如 ChatGPT），并将答案与给定的基本事实进行比较。我们的结果表明，一些开源模型能够提供比一些商业模型更高的准确率。然而，最好的表现来自一个商业聊天机器人 (ChatGPT4)。总体而言，所有模型在这项任务上的表现都只比随机模型略好一点。因此，在它们被广泛采用之前，它们的性能需要得到显著提高。

Title: Towards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical, Introspective Multi-Agent Framework for Open-Domain Question Answering

Authors: Sagar Srinivas Sakhinana, Geethan Sannidhi, Venkataramana Runkana
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00082
Pdf URL: https://arxiv.org/pdf/2409.00082
Copy Paste: [[2409.00082]] Towards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical, Introspective Multi-Agent Framework for Open-Domain Question Answering(https://arxiv.org/abs/2409.00082)
Keywords: gpt, prompt, retrieval augmented generation, agent
Abstract: In the chemical and process industries, Process Flow Diagrams (PFDs) and Piping and Instrumentation Diagrams (P&IDs) are critical for design, construction, and maintenance. Recent advancements in Generative AI, such as Large Multimodal Models (LMMs) like GPT4 (Omni), have shown promise in understanding and interpreting process diagrams for Visual Question Answering (VQA). However, proprietary models pose data privacy risks, and their computational complexity prevents knowledge editing for domain-specific customization on consumer hardware. To overcome these challenges, we propose a secure, on-premises enterprise solution using a hierarchical, multi-agent Retrieval Augmented Generation (RAG) framework for open-domain question answering (ODQA) tasks, offering enhanced data privacy, explainability, and cost-effectiveness. Our novel multi-agent framework employs introspective and specialized sub-agents using open-source, small-scale multimodal models with the ReAct (Reason+Act) prompting technique for PFD and P&ID analysis, integrating multiple information sources to provide accurate and contextually relevant answers. Our approach, supported by iterative self-correction, aims to deliver superior performance in ODQA tasks. We conducted rigorous experimental studies, and the empirical results validated the proposed approach effectiveness.
摘要：在化学和过程工业中，流程图 (PFD) 和管道与仪表图 (P&ID) 对于设计、施工和维护至关重要。生成式人工智能的最新进展，例如 GPT4 (Omni) 等大型多模态模型 (LMM)，在理解和解释视觉问答 (VQA) 的流程图方面显示出良好的前景。然而，专有模型会带来数据隐私风险，而且它们的计算复杂性阻碍了在消费硬件上进行领域特定定制的知识编辑。为了克服这些挑战，我们提出了一种安全的本地企业解决方案，该解决方案使用分层、多智能体检索增强生成 (RAG) 框架来完成开放域问答 (ODQA) 任务，从而提供增强的数据隐私、可解释性和成本效益。我们新颖的多智能体框架采用内省和专门的子智能体，使用开源、小型多模态模型和 ReAct（原因+行为）提示技术进行 PFD 和 P&ID 分析，整合多个信息源以提供准确且与上下文相关的答案。我们的方法由迭代自我修正支持，旨在在 ODQA 任务中提供卓越的性能。我们进行了严格的实验研究，实证结果验证了所提出方法的有效性。

Title: Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models

Authors: Seyed Amir Ahmad Safavi-Naini, Shuhaib Ali, Omer Shahab, Zahra Shahhoseini, Thomas Savage, Sara Rafiee, Jamil S Samaan, Reem Al Shabeeb, Farah Ladak, Jamie O Yang, Juan Echavarria, Sumbal Babar, Aasma Shaukat, Samuel Margolis, Nicholas P Tatonetti, Girish Nadkarni, Bara El Kurdi, Ali Soroush
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00084
Pdf URL: https://arxiv.org/pdf/2409.00084
Copy Paste: [[2409.00084]] Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models(https://arxiv.org/abs/2409.00084)
Keywords: language model, gpt, llm, prompt
Abstract: Background and Aims: This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology. Methods: We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images to systematically assess the impact of model configurations and parameters and prompt engineering strategies utilizing GPT-3.5. Next, we assessed the performance of proprietary and open-source LLMs (versions), including GPT (3.5, 4, 4°, 4omini), Claude (3, 3.5), Gemini (1.0), Mistral, Llama (2, 3, 3.1), Mixtral, and Phi (3), across different interfaces (web and API), computing environments (cloud and local), and model precisions (with and without quantization). Finally, we assessed accuracy using a semiautomated pipeline. Results: Among the proprietary models, GPT-4o (73.7%) and Claude3.5-Sonnet (74.0%) achieved the highest accuracy, whereas Llama3-70b (54.7%) and Mixtral8x7b (54.3%) were the most accurate open-source models. Among the quantized open-source models, the 6-bit quantized Phi3-14b (48.7%) performed best. The scores of the quantized models were comparable to those of the full-precision models Llama2--7b, Llama2--13b, and Gemma2--9b. Notably, VLM performance on image-containing questions did not improve when the images were provided and worsened when LLM-generated captions were provided. In contrast, a 10% increase in accuracy was observed when images were accompanied by one-sentence human-crafted image descriptions. Conclusion: In conclusion, while LLMs exhibit robust zero-shot performance in medical reasoning, the integration of visual data remains a challenge for VLMs. Effective deployment involves carefully determining optimal model configurations, encouraging users to consider either the high performance of proprietary models or the flexible adaptability of open-source models.
摘要：背景和目的：本研究评估大型语言模型 (LLM) 和视觉语言模型 (VLM) 在胃肠病学中的医学推理性能。方法：我们使用了 300 道胃肠病学委员会考试风格的多项选择题，其中 138 道包含图像，以系统地评估模型配置和参数的影响，并利用 GPT-3.5 提示工程策略。接下来，我们评估了专有和开源 LLM（版本）的性能，包括 GPT（3.5、4、4°、4omini）、Claude（3、3.5）、Gemini（1.0）、Mistral、Llama（2、3、3.1）、Mixtral 和 Phi（3），跨越不同的界面（Web 和 API）、计算环境（云和本地）和模型精度（有量化和无量化）。最后，我们使用半自动化流程评估了准确性。结果：在专有模型中，GPT-4o（73.7%）和 Claude3.5-Sonnet（74.0%）的准确率最高，而 Llama3-70b（54.7%）和 Mixtral8x7b（54.3%）是准确率最高的开源模型。在量化的开源模型中，6 位量化的 Phi3-14b（48.7%）表现最佳。量化模型的得分与全精度模型 Llama2-7b、Llama2-13b 和 Gemma2-9b 的得分相当。值得注意的是，当提供图像时，VLM 对包含图像的问题的表现并没有改善，而当提供 LLM 生成的字幕时，VLM 的表现会变差。相比之下，当图像附带一句话的人工图像描述时，准确率提高了 10%。结论：总而言之，虽然 LLM 在医学推理中表现出稳健的零样本性能，但视觉数据的整合仍然是 VLM 面临的挑战。有效的部署需要仔细确定最佳模型配置，鼓励用户考虑专有模型的高性能或开源模型的灵活适应性。

Title: Genetic Approach to Mitigate Hallucination in Generative IR

Authors: Hrishikesh Kulkarni, Nazli Goharian, Ophir Frieder, Sean MacAvaney
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2409.00085
Pdf URL: https://arxiv.org/pdf/2409.00085
Copy Paste: [[2409.00085]] Genetic Approach to Mitigate Hallucination in Generative IR(https://arxiv.org/abs/2409.00085)
Keywords: language model, hallucination
Abstract: Generative language models hallucinate. That is, at times, they generate factually flawed responses. These inaccuracies are particularly insidious because the responses are fluent and well-articulated. We focus on the task of Grounded Answer Generation (part of Generative IR), which aims to produce direct answers to a user's question based on results retrieved from a search engine. We address hallucination by adapting an existing genetic generation approach with a new 'balanced fitness function' consisting of a cross-encoder model for relevance and an n-gram overlap metric to promote grounding. Our balanced fitness function approach quadruples the grounded answer generation accuracy while maintaining high relevance.
摘要：生成式语言模型会产生幻觉。也就是说，它们有时会生成有事实缺陷的响应。这些不准确性尤其隐蔽，因为响应流畅且表达清晰。我们专注于扎实答案生成（生成式 IR 的一部分）的任务，该任务旨在根据从搜索引擎检索到的结果直接回答用户的问题。我们通过采用新的“平衡适应度函数”来调整现有的遗传生成方法，该函数由用于相关性的交叉编码器模型和用于促进扎实的 n-gram 重叠度量组成。我们的平衡适应度函数方法将扎实答案生成准确率提高了四倍，同时保持了高相关性。

Title: On-Device Language Models: A Comprehensive Review

Authors: Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, Ziyuan Ling
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00088
Pdf URL: https://arxiv.org/pdf/2409.00088
Copy Paste: [[2409.00088]] On-Device Language Models: A Comprehensive Review(https://arxiv.org/abs/2409.00088)
Keywords: language model, llm
Abstract: The advent of large language models (LLMs) revolutionized natural language processing applications, and running LLMs on edge devices has become increasingly attractive for reasons including reduced latency, data localization, and personalized user experiences. This comprehensive review examines the challenges of deploying computationally expensive LLMs on resource-constrained devices and explores innovative solutions across multiple domains. The paper investigates the development of on-device language models, their efficient architectures, including parameter sharing and modular designs, as well as state-of-the-art compression techniques like quantization, pruning, and knowledge distillation. Hardware acceleration strategies and collaborative edge-cloud deployment approaches are analyzed, highlighting the intricate balance between performance and resource utilization. Case studies of on-device language models from major mobile manufacturers demonstrate real-world applications and potential benefits. The review also addresses critical aspects such as adaptive learning, multi-modal capabilities, and personalization. By identifying key research directions and open challenges, this paper provides a roadmap for future advancements in on-device language models, emphasizing the need for interdisciplinary efforts to realize the full potential of ubiquitous, intelligent computing while ensuring responsible and ethical deployment. For a comprehensive review of research work and educational resources on on-device large language models (LLMs), please visit this https URL. To download and run on-device LLMs, visit this https URL.
摘要：大型语言模型 (LLM) 的出现彻底改变了自然语言处理应用，在边缘设备上运行 LLM 变得越来越有吸引力，原因包括减少延迟、数据本地化和个性化用户体验。这篇全面的评论探讨了在资源受限的设备上部署计算成本高昂的 LLM 所面临的挑战，并探索了跨多个领域的创新解决方案。本文探讨了设备上语言模型的发展、它们的高效架构（包括参数共享和模块化设计）以及最先进的压缩技术（如量化、修剪和知识提炼）。分析了硬件加速策略和协作边缘云部署方法，强调了性能和资源利用率之间复杂的平衡。来自主要移动制造商的设备语言模型案例研究展示了现实世界的应用和潜在好处。该评论还讨论了自适应学习、多模式功能和个性化等关键方面。通过确定关键研究方向和开放挑战，本文为设备语言模型的未来发展提供了路线图，强调需要跨学科努力，以充分发挥无处不在的智能计算的潜力，同时确保负责任和合乎道德的部署。如需全面了解设备上大型语言模型 (LLM) 的研究工作和教育资源，请访问此 https URL。如需下载和运行设备上的 LLM，请访问此 https URL。

Title: Evaluating ChatGPT on Nuclear Domain-Specific Data

Authors: Muhammad Anwar, Mischa de Costa, Issam Hammad, Daniel Lau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00090
Pdf URL: https://arxiv.org/pdf/2409.00090
Copy Paste: [[2409.00090]] Evaluating ChatGPT on Nuclear Domain-Specific Data(https://arxiv.org/abs/2409.00090)
Keywords: language model, gpt, llm, chat, retrieval augmented generation
Abstract: This paper examines the application of ChatGPT, a large language model (LLM), for question-and-answer (Q&A) tasks in the highly specialized field of nuclear data. The primary focus is on evaluating ChatGPT's performance on a curated test dataset, comparing the outcomes of a standalone LLM with those generated through a Retrieval Augmented Generation (RAG) approach. LLMs, despite their recent advancements, are prone to generating incorrect or 'hallucinated' information, which is a significant limitation in applications requiring high accuracy and reliability. This study explores the potential of utilizing RAG in LLMs, a method that integrates external knowledge bases and sophisticated retrieval techniques to enhance the accuracy and relevance of generated outputs. In this context, the paper evaluates ChatGPT's ability to answer domain-specific questions, employing two methodologies: A) direct response from the LLM, and B) response from the LLM within a RAG framework. The effectiveness of these methods is assessed through a dual mechanism of human and LLM evaluation, scoring the responses for correctness and other metrics. The findings underscore the improvement in performance when incorporating a RAG pipeline in an LLM, particularly in generating more accurate and contextually appropriate responses for nuclear domain-specific queries. Additionally, the paper highlights alternative approaches to further refine and improve the quality of answers in such specialized domains.
摘要：本文探讨了大型语言模型 (LLM) ChatGPT 在高度专业化的核数据领域中问答 (Q&A) 任务中的应用。主要重点是评估 ChatGPT 在精选测试数据集上的性能，将独立 LLM 的结果与通过检索增强生成 (RAG) 方法生成的结果进行比较。尽管 LLM 最近取得了进展，但很容易生成不正确或“幻觉”的信息，这在需要高精度和可靠性的应用中是一个重大限制。本研究探讨了在 LLM 中使用 RAG 的潜力，这是一种集成外部知识库和复杂检索技术来提高生成输出的准确性和相关性的方法。在此背景下，本文评估了 ChatGPT 回答特定领域问题的能力，采用了两种方法：A) 来自 LLM 的直接响应，以及 B) 来自 RAG 框架内的 LLM 的响应。这些方法的有效性通过人工和 LLM 评估的双重机制进行评估，对答案的正确性和其他指标进行评分。研究结果强调了在 LLM 中整合 RAG 管道时性能的提高，特别是在为核领域特定查询生成更准确、更符合上下文的响应方面。此外，本文还强调了进一步完善和提高此类专业领域答案质量的替代方法。

Title: Classification of Safety Events at Nuclear Sites using Large Language Models

Authors: Mishca de Costa, Muhammad Anwar, Daniel Lau, Issam Hammad
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00091
Pdf URL: https://arxiv.org/pdf/2409.00091
Copy Paste: [[2409.00091]] Classification of Safety Events at Nuclear Sites using Large Language Models(https://arxiv.org/abs/2409.00091)
Keywords: language model, llm, prompt
Abstract: This paper proposes the development of a Large Language Model (LLM) based machine learning classifier designed to categorize Station Condition Records (SCRs) at nuclear power stations into safety-related and non-safety-related categories. The primary objective is to augment the existing manual review process by enhancing the efficiency and accuracy of the safety classification process at nuclear stations. The paper discusses experiments performed to classify a labeled SCR dataset and evaluates the performance of the classifier. It explores the construction of several prompt variations and their observed effects on the LLM's decision-making process. Additionally, it introduces a numerical scoring mechanism that could offer a more nuanced and flexible approach to SCR safety classification. This method represents an innovative step in nuclear safety management, providing a scalable tool for the identification of safety events.
摘要：本文提议开发一种基于大型语言模型 (LLM) 的机器学习分类器，旨在将核电站的站场状况记录 (SCR) 分为安全相关和非安全相关类别。主要目标是通过提高核电站安全分类过程的效率和准确性来增强现有的人工审查流程。本文讨论了对标记的 SCR 数据集进行分类的实验，并评估了分类器的性能。它探讨了几种提示变化的构建及其对 LLM 决策过程的影响。此外，它还引入了一种数值评分机制，可以为 SCR 安全分类提供更细致入微、更灵活的方法。该方法代表了核安全管理的一个创新步骤，为识别安全事件提供了一种可扩展的工具。

Title: PatentGPT: A Large Language Model for Patent Drafting Using Knowledge-based Fine-tuning Method

Authors: Runtao Ren, Jian Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00092
Pdf URL: https://arxiv.org/pdf/2409.00092
Copy Paste: [[2409.00092]] PatentGPT: A Large Language Model for Patent Drafting Using Knowledge-based Fine-tuning Method(https://arxiv.org/abs/2409.00092)
Keywords: language model, gpt, llm
Abstract: As humanity stands on the brink of a new era of technological innovation, the ability to rapidly transform creative ideas into protected intellectual property (IP) is more crucial than ever. However, the conventional processes for patent drafting are fraught with challenges, demanding a nuanced understanding of advanced field knowledge and technical concepts. Existing large language models (LLMs), while powerful, often fall short in this IP creation domain due to their lack of specialized knowledge and context-awareness necessary for generating technically accurate patent documents. To bridge this critical gap, we propose a groundbreaking framework for Knowledge Fine-Tuning (KFT) of LLMs, designed to endow AI with the ability to autonomously mine, understand, and apply domain-specific knowledge. Our model, PatentGPT leverages a unique combination of knowledge graph-based pre-training, domain-specific supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF). Through extensive evaluation, PatentGPT has demonstrated outstanding performance, scoring up to approximately 400% higher in patent related benchmark tests compared to state-of-the-art models. By KFT method the model's capability to not only assist but also augment human creativity and innovation, our approach sets a new standard for AI-driven intellectual property generation, paving the way for more efficient and effective invention processes.
摘要：随着人类即将迎来技术创新的新时代，将创意快速转化为受保护的知识产权 (IP) 的能力比以往任何时候都更加重要。然而，专利撰写的传统流程充满挑战，需要对高级领域知识和技术概念有细致入微的理解。现有的大型语言模型 (LLM) 虽然功能强大，但由于缺乏生成技术准确的专利文件所必需的专业知识和情境感知，因此在 IP 创作领域往往存在不足。为了弥补这一关键差距，我们提出了一个开创性的 LLM 知识微调 (KFT) 框架，旨在赋予 AI 自主挖掘、理解和应用领域特定知识的能力。我们的模型 PatentGPT 利用了基于知识图谱的预训练、领域特定监督微调 (SFT) 和从人类反馈中强化学习 (RLHF) 的独特组合。通过广泛的评估，PatentGPT 表现出色，与最先进的模型相比，在专利相关的基准测试中的得分高出约 400%。通过 KFT 方法，模型不仅能够协助而且能够增强人类的创造力和创新能力，我们的方法为人工智能驱动的知识产权生成设立了新的标准，为更高效、有效的发明过程铺平了道路。

Title: Examining Independence in Ensemble Sentiment Analysis: A Study on the Limits of Large Language Models Using the Condorcet Jury Theorem

Authors: Baptiste Lefort, Eric Benhamou, Jean-Jacques Ohana, Beatrice Guez, David Saltiel, Thomas Jacquot
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00094
Pdf URL: https://arxiv.org/pdf/2409.00094
Copy Paste: [[2409.00094]] Examining Independence in Ensemble Sentiment Analysis: A Study on the Limits of Large Language Models Using the Condorcet Jury Theorem(https://arxiv.org/abs/2409.00094)
Keywords: language model, gpt, llm, chat
Abstract: This paper explores the application of the Condorcet Jury theorem to the domain of sentiment analysis, specifically examining the performance of various large language models (LLMs) compared to simpler natural language processing (NLP) models. The theorem posits that a majority vote classifier should enhance predictive accuracy, provided that individual classifiers' decisions are independent. Our empirical study tests this theoretical framework by implementing a majority vote mechanism across different models, including advanced LLMs such as ChatGPT 4. Contrary to expectations, the results reveal only marginal improvements in performance when incorporating larger models, suggesting a lack of independence among them. This finding aligns with the hypothesis that despite their complexity, LLMs do not significantly outperform simpler models in reasoning tasks within sentiment analysis, showing the practical limits of model independence in the context of advanced NLP tasks.
摘要：本文探讨了孔多塞陪审团定理在情绪分析领域的应用，具体考察了各种大型语言模型 (LLM) 与更简单的自然语言处理 (NLP) 模型相比的性能。该定理假定，只要各个分类器的决策是独立的，多数投票分类器应该能够提高预测准确性。我们的实证研究通过在不同模型（包括 ChatGPT 4 等高级 LLM）中实施多数投票机制来测试这一理论框架。与预期相反，结果显示，在合并大型模型时，性能仅有微小的改善，这表明它们之间缺乏独立性。这一发现与以下假设相符：尽管 LLM 很复杂，但它们在情绪分析的推理任务中并没有明显优于简单模型，这表明在高级 NLP 任务中模型独立性的实际限制。

Title: Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data

Authors: Juncheng Xie, Shensian Syu, Hung-yi Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00096
Pdf URL: https://arxiv.org/pdf/2409.00096
Copy Paste: [[2409.00096]] Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data(https://arxiv.org/abs/2409.00096)
Keywords: language model, gpt, llm, prompt
Abstract: Instruction fine-tuning is crucial for today's large language models (LLMs) to learn to follow instructions and align with human preferences. Conventionally, supervised data, including the instruction and the correct response, is required for instruction fine-tuning. To obtain such data, some researchers prompted well-trained models like GPT-4 to generate instructions and correct responses. In this paper, we propose a novel approach that uses the first half of a random text from OpenWebText as the instruction and GPT-3.5-turbo or GPT-4-turbo to complete the text as the response. Despite the data being "non-instructional", we found that pre-trained LLMs fine-tuned on this data can gain instruction-following capabilities. This observation is verified by fine-tuning several well-known pre-trained LLMs (e.g., LLaMA-2-7B, LLaMA-3-8B, LLaMA-3-70B, Mistral-7B-v0.1). The "non-instructional data" also improved some models that underwent supervised fine-tuning and human preference alignment. Our LLaMA-3-70B-Instruct fine-tuned through "non-instructional data" is comparable with LLaMA-3.1-70B-Instruct on the Arena Hard leaderboard. We analyzed the "non-instructional data" and ensured it is devoid of content related to instruction fine-tuning. Our findings will inspire further investigation into how to develop instruction-following capabilities without explicit instruction-related data.
摘要：指令微调对于当今的大型语言模型 (LLM) 学习遵循指令并符合人类偏好至关重要。传统上，指令微调需要监督数据，包括指令和正确的响应。为了获得这样的数据，一些研究人员促使训练有素的模型（如 GPT-4）生成指令和正确的响应。在本文中，我们提出了一种新颖的方法，该方法使用来自 OpenWebText 的随机文本的前半部分作为指令，并使用 GPT-3.5-turbo 或 GPT-4-turbo 完成文本作为响应。尽管数据是“非指令性的”，但我们发现，基于这些数据进行微调的预训练 LLM 可以获得指令跟随能力。通过微调几个著名的预训练 LLM（例如 LLaMA-2-7B、LLaMA-3-8B、LLaMA-3-70B、Mistral-7B-v0.1），验证了这一观察结果。 “非教学数据”还改进了一些经过监督微调和人类偏好校准的模型。我们通过“非教学数据”微调的 LLaMA-3-70B-Instruct 与 Arena Hard 排行榜上的 LLaMA-3.1-70B-Instruct 相当。我们分析了“非教学数据”，并确保它没有与教学微调相关的内容。我们的发现将启发进一步研究如何在没有明确教学相关数据的情况下开发教学跟随能力。

Title: Large Language Models for Disease Diagnosis: A Scoping Review

Authors: Shuang Zhou, Zidu Xu, Mian Zhang, Chunpu Xu, Yawen Guo, Zaifu Zhan, Sirui Ding, Jiashuo Wang, Kaishuai Xu, Yi Fang, Liqiao Xia, Jeremy Yeung, Daochen Zha, Mingquan Lin, Rui Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00097
Pdf URL: https://arxiv.org/pdf/2409.00097
Copy Paste: [[2409.00097]] Large Language Models for Disease Diagnosis: A Scoping Review(https://arxiv.org/abs/2409.00097)
Keywords: language model, llm
Abstract: Automatic disease diagnosis has become increasingly valuable in clinical practice. The advent of large language models (LLMs) has catalyzed a paradigm shift in artificial intelligence, with growing evidence supporting the efficacy of LLMs in diagnostic tasks. Despite the growing attention in this field, many critical research questions remain under-explored. For instance, what diseases and LLM techniques have been investigated for diagnostic tasks? How can suitable LLM techniques and evaluation methods be selected for clinical decision-making? To answer these questions, we performed a comprehensive analysis of LLM-based methods for disease diagnosis. This scoping review examined the types of diseases, associated organ systems, relevant clinical data, LLM techniques, and evaluation methods reported in existing studies. Furthermore, we offered guidelines for data preprocessing and the selection of appropriate LLM techniques and evaluation strategies for diagnostic tasks. We also assessed the limitations of current research and delineated the challenges and future directions in this research field. In summary, our review outlined a blueprint for LLM-based disease diagnosis, helping to streamline and guide future research endeavors.
摘要：自动疾病诊断在临床实践中变得越来越有价值。大型语言模型 (LLM) 的出现催化了人工智能的范式转变，越来越多的证据支持 LLM 在诊断任务中的有效性。尽管该领域受到越来越多的关注，但仍有许多关键的研究问题尚未得到充分探索。例如，哪些疾病和 LLM 技术已被用于诊断任务？如何选择合适的 LLM 技术和评估方法进行临床决策？为了回答这些问题，我们对基于 LLM 的疾病诊断方法进行了全面分析。本范围审查研究了现有研究中报告的疾病类型、相关器官系统、相关临床数据、LLM 技术和评估方法。此外，我们为数据预处理和选择适当的 LLM 技术和评估策略进行诊断任务提供了指导方针。我们还评估了当前研究的局限性，并描述了该研究领域的挑战和未来方向。总之，我们的审查概述了基于 LLM 的疾病诊断蓝图，有助于简化和指导未来的研究工作。

Title: Nuance Matters: Probing Epistemic Consistency in Causal Reasoning

Authors: Shaobo Cui, Junyou Li, Luca Mouchel, Yiyang Feng, Boi Faltings
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00103
Pdf URL: https://arxiv.org/pdf/2409.00103
Copy Paste: [[2409.00103]] Nuance Matters: Probing Epistemic Consistency in Causal Reasoning(https://arxiv.org/abs/2409.00103)
Keywords: language model, gpt, llm
Abstract: To address this gap, our study introduces the concept of causal epistemic consistency, which focuses on the self-consistency of Large Language Models (LLMs) in differentiating intermediates with nuanced differences in causal reasoning. We propose a suite of novel metrics -- intensity ranking concordance, cross-group position agreement, and intra-group clustering -- to evaluate LLMs on this front. Through extensive empirical studies on 21 high-profile LLMs, including GPT-4, Claude3, and LLaMA3-70B, we have favoring evidence that current models struggle to maintain epistemic consistency in identifying the polarity and intensity of intermediates in causal reasoning. Additionally, we explore the potential of using internal token probabilities as an auxiliary tool to maintain causal epistemic consistency. In summary, our study bridges a critical gap in AI research by investigating the self-consistency over fine-grained intermediates involved in causal reasoning.
摘要：为了解决这一差距，我们的研究引入了因果认识论一致性的概念，该概念侧重于大型语言模型 (LLM) 在区分因果推理中具有细微差异的中间体方面的自洽性。我们提出了一套新颖的指标——强度排名一致性、跨组位置一致性和组内聚类——来评估 LLM 在这方面的表现。通过对 21 个备受瞩目的 LLM（包括 GPT-4、Claude3 和 LLaMA3-70B）进行广泛的实证研究，我们有确凿的证据表明，当前的模型在识别因果推理中中间体的极性和强度方面难以保持认识论一致性。此外，我们还探索了使用内部标记概率作为辅助工具来保持因果认识论一致性的潜力。总之，我们的研究通过研究因果推理中涉及的细粒度中间体的自洽性，弥补了人工智能研究中的一个关键空白。

Title: Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image Generation

Authors: Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria, Björn W. Schuller, Amir Hussain
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00105
Pdf URL: https://arxiv.org/pdf/2409.00105
Copy Paste: [[2409.00105]] Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image Generation(https://arxiv.org/abs/2409.00105)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Foundational Large Language Models (LLMs) have changed the way we perceive technology. They have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. With the incorporation of image generation capability, they have become more comprehensive and versatile AI tools. At the same time, researchers are striving to identify the limitations of these tools to improve them further. Currently identified flaws include hallucination, biases, and bypassing restricted commands to generate harmful content. In the present work, we have identified a fundamental limitation related to the image generation ability of LLMs, and termed it The NO Syndrome. This negation blindness refers to LLMs inability to correctly comprehend NO related natural language prompts to generate the desired images. Interestingly, all tested LLMs including GPT-4, Gemini, and Copilot were found to be suffering from this syndrome. To demonstrate the generalization of this limitation, we carried out simulation experiments and conducted entropy-based and benchmark statistical analysis tests on various LLMs in multiple languages, including English, Hindi, and French. We conclude that the NO syndrome is a significant flaw in current LLMs that needs to be addressed. A related finding of this study showed a consistent discrepancy between image and textual responses as a result of this NO syndrome. We posit that the introduction of a negation context-aware reinforcement learning based feedback loop between the LLMs textual response and generated image could help ensure the generated text is based on both the LLMs correct contextual understanding of the negation query and the generated visual output.
摘要：基础大型语言模型 (LLM) 改变了我们看待技术的方式。它们已被证明在从诗歌写作和编码到论文生成和解谜等任务中表现出色。随着图像生成功能的加入，它们已成为更全面、用途更广的人工智能工具。与此同时，研究人员正在努力找出这些工具的局限性，以进一步改进它们。目前发现的缺陷包括幻觉、偏见和绕过受限命令生成有害内容。在目前的研究中，我们发现了与 LLM 的图像生成能力相关的一个基本限制，并将其称为 NO 综合症。这种否定盲目性是指 LLM 无法正确理解与 NO 相关的自然语言提示以生成所需的图像。有趣的是，包括 GPT-4、Gemini 和 Copilot 在内的所有测试过的 LLM 都被发现患有这种综合症。为了证明这种限制的普遍性，我们进行了模拟实验，并对多种语言（包括英语、印地语和法语）的各种 LLM 进行了基于熵和基准统计分析测试。我们得出结论，NO 综合症是当前 LLM 的一个重大缺陷，需要加以解决。本研究的相关发现表明，由于这种 NO 综合症，图像和文本响应之间存在一致的差异。我们假设，在 LLM 的文本响应和生成的图像之间引入基于否定上下文感知强化学习的反馈回路，可以帮助确保生成的文本既基于 LLM 对否定查询的正确上下文理解，也基于生成的视觉输出。

Title: Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

Authors: Aishik Nagar, Shantanu Jaiswal, Cheston Tan
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00106
Pdf URL: https://arxiv.org/pdf/2409.00106
Copy Paste: [[2409.00106]] Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis(https://arxiv.org/abs/2409.00106)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Vision-language models (VLMs) have shown impressive zero- and few-shot performance on real-world visual question answering (VQA) benchmarks, alluding to their capabilities as visual reasoning engines. However, the benchmarks being used conflate "pure" visual reasoning with world knowledge, and also have questions that involve a limited number of reasoning steps. Thus, it remains unclear whether a VLM's apparent visual reasoning performance is due to its world knowledge, or due to actual visual reasoning capabilities. To clarify this ambiguity, we systematically benchmark and dissect the zero-shot visual reasoning capabilities of VLMs through synthetic datasets that require minimal world knowledge, and allow for analysis over a broad range of reasoning steps. We focus on two novel aspects of zero-shot visual reasoning: i) evaluating the impact of conveying scene information as either visual embeddings or purely textual scene descriptions to the underlying large language model (LLM) of the VLM, and ii) comparing the effectiveness of chain-of-thought prompting to standard prompting for zero-shot visual reasoning. We find that the underlying LLMs, when provided textual scene descriptions, consistently perform better compared to being provided visual embeddings. In particular, 18% higher accuracy is achieved on the PTR dataset. We also find that CoT prompting performs marginally better than standard prompting only for the comparatively large GPT-3.5-Turbo (175B) model, and does worse for smaller-scale models. This suggests the emergence of CoT abilities for visual reasoning in LLMs at larger scales even when world knowledge is limited. Overall, we find limitations in the abilities of VLMs and LLMs for more complex visual reasoning, and highlight the important role that LLMs can play in visual reasoning.
摘要：视觉语言模型 (VLM) 在现实世界的视觉问答 (VQA) 基准测试中表现出令人印象深刻的零样本和少样本性能，暗示了它们作为视觉推理引擎的能力。然而，所使用的基准测试将“纯”视觉推理与世界知识混为一谈，并且还包含涉及有限数量推理步骤的问题。因此，尚不清楚 VLM 的明显视觉推理性能是归功于其世界知识，还是归功于实际的视觉推理能力。为了澄清这种模糊性，我们通过需要最少世界知识的合成数据集系统地对 VLM 的零样本视觉推理能力进行基准测试和剖析，并允许对广泛的推理步骤进行分析。我们重点关注零样本视觉推理的两个新方面：i）评估将场景信息作为视觉嵌入或纯文本场景描述传达给 VLM 的底层大型语言模型 (LLM) 的影响，以及 ii）比较思路链提示与标准提示对零样本视觉推理的有效性。我们发现，当提供文本场景描述时，底层 LLM 的表现始终优于提供视觉嵌入。特别是在 PTR 数据集上实现了 18% 的准确率提高。我们还发现，CoT 提示仅在相对较大的 GPT-3.5-Turbo (175B) 模型上的表现略优于标准提示，而在较小规模的模型上表现较差。这表明，即使世界知识有限，CoT 能力也会在更大规模的 LLM 中出现，用于视觉推理。总体而言，我们发现 VLM 和 LLM 在更复杂的视觉推理方面的能力存在局限性，并强调了 LLM 在视觉推理中可以发挥的重要作用。

Title: Toward Large Language Models as a Therapeutic Tool: Comparing Prompting Techniques to Improve GPT-Delivered Problem-Solving Therapy

Authors: Daniil Filienko, Yinzhou Wang, Caroline El Jazmi, Serena Xie, Trevor Cohen, Martine De Cock, Weichao Yuwen
Subjects: cs.CL, cs.AI, cs.ET, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00112
Pdf URL: https://arxiv.org/pdf/2409.00112
Copy Paste: [[2409.00112]] Toward Large Language Models as a Therapeutic Tool: Comparing Prompting Techniques to Improve GPT-Delivered Problem-Solving Therapy(https://arxiv.org/abs/2409.00112)
Keywords: language model, gpt, llm, prompt
Abstract: While Large Language Models (LLMs) are being quickly adapted to many domains, including healthcare, their strengths and pitfalls remain under-explored. In our study, we examine the effects of prompt engineering to guide Large Language Models (LLMs) in delivering parts of a Problem-Solving Therapy (PST) session via text, particularly during the symptom identification and assessment phase for personalized goal setting. We present evaluation results of the models' performances by automatic metrics and experienced medical professionals. We demonstrate that the models' capability to deliver protocolized therapy can be improved with the proper use of prompt engineering methods, albeit with limitations. To our knowledge, this study is among the first to assess the effects of various prompting techniques in enhancing a generalist model's ability to deliver psychotherapy, focusing on overall quality, consistency, and empathy. Exploring LLMs' potential in delivering psychotherapy holds promise with the current shortage of mental health professionals amid significant needs, enhancing the potential utility of AI-based and AI-enhanced care services.
摘要：虽然大型语言模型 (LLM) 正在迅速适应许多领域，包括医疗保健，但它们的优势和缺陷仍未得到充分探索。在我们的研究中，我们研究了提示工程对大型语言模型 (LLM) 通过文本提供问题解决治疗 (PST) 部分会话的影响，特别是在个性化目标设定的症状识别和评估阶段。我们通过自动指标和经验丰富的医疗专业人员展示了模型性能的评估结果。我们证明，通过正确使用提示工程方法，可以提高模型提供协议化治疗的能力，尽管存在局限性。据我们所知，这项研究是首批评估各种提示技术对增强通才模型提供心理治疗能力的影响的研究之一，重点关注整体质量、一致性和同理心。在目前心理健康专业人员短缺且需求巨大的情况下，探索 LLM 在提供心理治疗方面的潜力很有希望，可以增强基于 AI 和 AI 增强的护理服务的潜在效用。

Title: When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options

Authors: Gracjan Góral, Emilia Wiśnios
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00113
Pdf URL: https://arxiv.org/pdf/2409.00113
Copy Paste: [[2409.00113]] When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options(https://arxiv.org/abs/2409.00113)
Keywords: language model, llm
Abstract: This paper examines the zero-shot ability of Large Language Models (LLMs) to detect multiple-choice questions with no correct answer, a crucial aspect of educational assessment quality. We explore this ability not only as a measure of subject matter knowledge but also as an indicator of critical thinking within LLMs. Our experiments, utilizing a range of LLMs on diverse questions, highlight the significant performance gap between questions with a single correct answer and those without. Llama-3.1-405B stands out by successfully identifying the lack of a valid answer in many instances. These findings suggest that LLMs should prioritize critical thinking over blind instruction following and caution against their use in educational settings where questions with incorrect answers might lead to inaccurate evaluations. This research sets a benchmark for assessing critical thinking in LLMs and emphasizes the need for ongoing model alignment to ensure genuine user comprehension and assistance.
摘要：本文研究了大型语言模型 (LLM) 检测无正确答案的多项选择题的零样本能力，这是教育评估质量的一个关键方面。我们不仅将这种能力作为学科知识的衡量标准，而且将其作为 LLM 中批判性思维的指标。我们的实验利用一系列 LLM 来处理各种问题，突出了只有一个正确答案的问题与没有正确答案的问题之间的显著性能差距。Llama-3.1-405B 脱颖而出，在许多情况下成功识别出缺乏有效答案。这些发现表明，LLM 应该优先考虑批判性思维，而不是盲目跟随教学，并谨慎不要在教育环境中使用它们，因为错误答案的问题可能会导致不准确的评估。这项研究为评估 LLM 中的批判性思维设定了一个基准，并强调需要持续进行模型调整，以确保真正的用户理解和帮助。

Title: FedMCP: Parameter-Efficient Federated Learning with Model-Contrastive Personalization

Authors: Qianyi Zhao, Chen Qu, Cen Chen, Mingyuan Fan, Yanhao Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00116
Pdf URL: https://arxiv.org/pdf/2409.00116
Copy Paste: [[2409.00116]] FedMCP: Parameter-Efficient Federated Learning with Model-Contrastive Personalization(https://arxiv.org/abs/2409.00116)
Keywords: language model
Abstract: With increasing concerns and regulations on data privacy, fine-tuning pretrained language models (PLMs) in federated learning (FL) has become a common paradigm for NLP tasks. Despite being extensively studied, the existing methods for this problem still face two primary challenges. First, the huge number of parameters in large-scale PLMs leads to excessive communication and computational overhead. Second, the heterogeneity of data and tasks across clients poses a significant obstacle to achieving the desired fine-tuning performance. To address the above problems, we propose FedMCP, a novel parameter-efficient fine-tuning method with model-contrastive personalization for FL. Specifically, FedMCP adds two lightweight adapter modules, i.e., the global adapter and the private adapter, to the frozen PLMs within clients. In a communication round, each client sends only the global adapter to the server for federated aggregation. Furthermore, FedMCP introduces a model-contrastive regularization term between the two adapters. This, on the one hand, encourages the global adapter to assimilate universal knowledge and, on the other hand, the private adapter to capture client-specific knowledge. By leveraging both adapters, FedMCP can effectively provide fine-tuned personalized models tailored to individual clients. Extensive experiments on highly heterogeneous cross-task, cross-silo datasets show that FedMCP achieves substantial performance improvements over state-of-the-art FL fine-tuning approaches for PLMs.
摘要：随着对数据隐私的关注和监管日益严格，在联邦学习 (FL) 中微调预训练语言模型 (PLM) 已成为 NLP 任务的常见范例。尽管得到了广泛的研究，但现有的针对该问题的方法仍然面临两个主要挑战。首先，大规模 PLM 中的大量参数导致过多的通信和计算开销。其次，不同客户端之间数据和任务的异构性对实现所需的微调性能构成了重大障碍。为了解决上述问题，我们提出了 FedMCP，一种用于 FL 的具有模型对比个性化的新型参数高效微调方法。具体而言，FedMCP 向客户端内冻结的 PLM 添加了两个轻量级适配器模块，即全局适配器和私有适配器。在一轮通信中，每个客户端仅将全局适配器发送到服务器进行联邦聚合。此外，FedMCP 在两个适配器之间引入了一个模型对比正则化项。一方面，这鼓励全局适配器吸收通用知识，另一方面，鼓励私有适配器捕获特定于客户的知识。通过利用这两个适配器，FedMCP 可以有效地提供针对单个客户量身定制的微调个性化模型。在高度异构的跨任务、跨孤岛数据集上进行的大量实验表明，与最先进的 PLM FL 微调方法相比，FedMCP 实现了显着的性能改进。

Title: ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings

Authors: Jangyeong Jeon, Sangyeon Cho, Minuk Ma, Junyoung Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00120
Pdf URL: https://arxiv.org/pdf/2409.00120
Copy Paste: [[2409.00120]] ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings(https://arxiv.org/abs/2409.00120)
Keywords: language model
Abstract: This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance. There exists a noticeable need for research on the CS between English and Korean. We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities due to the intrinsic grammatical differences between the languages. We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges. First, we constructed the Koglish-GLUE dataset to demonstrate the importance and need for CS datasets in various tasks. We found the differential outcomes of various foundation multilingual language models when trained on a monolingual versus a CS dataset. Motivated by this, we hypothesized that SimCSE, which has shown strengths in monolingual sentence embedding, would have limitations in CS scenarios. We construct a novel Koglish-NLI (Natural Language Inference) dataset using a CS augmentation-based approach to verify this. From this CS-augmented dataset Koglish-NLI, we propose a unified contrastive learning and augmentation method for code-switched embeddings, ConCSE, highlighting the semantics of CS sentences. Experimental results validate the proposed ConCSE with an average performance enhancement of 1.77\% on the Koglish-STS(Semantic Textual Similarity) tasks.
摘要：本文探讨了两种语言在单一话语中交织在一起的代码转换 (CS) 现象。英语和韩语之间的代码转换存在明显的研究需求。我们强调，由于语言之间存在内在的语法差异，当前其他语言代码转换的等价约束 (EC) 理论可能只能部分捕捉英语-韩语代码转换的复杂性。我们引入了一个针对英语-韩语代码转换场景量身定制的新型 Koglish 数据集来缓解此类挑战。首先，我们构建了 Koglish-GLUE 数据集来证明代码转换数据集在各种任务中的重要性和必要性。我们发现，在单语言和 CS 数据集上训练时，各种基础多语言语言模型的结果有所不同。受此启发，我们假设 SimCSE 在单语句子嵌入方面表现出色，但在 CS 场景中会受到限制。我们使用基于 CS 增强的方法构建了一个新型 Koglish-NLI（自然语言推理）数据集来验证这一点。从这个 CS 增强数据集 Koglish-NLI 中，我们提出了一种统一的对比学习和增强代码转换嵌入方法 ConCSE，突出了 CS 句子的语义。实验结果验证了所提出的 ConCSE，在 Koglish-STS（语义文本相似性）任务上的平均性能提升了 1.77%。

Title: Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs

Authors: Ziyan Cui, Ning Li, Huaikang Zhou
Subjects: cs.CL, cs.AI, econ.GN
Abstract URL: https://arxiv.org/abs/2409.00128
Pdf URL: https://arxiv.org/pdf/2409.00128
Copy Paste: [[2409.00128]] Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs(https://arxiv.org/abs/2409.00128)
Keywords: language model, gpt, llm
Abstract: Artificial Intelligence (AI) is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. Large Language Models (LLMs) like GPT-4 have shown promise in replicating human-like responses in various psychological experiments. However, the extent to which LLMs can effectively replace human subjects across diverse experimental contexts remains unclear. Here, we conduct a large-scale study replicating 154 psychological experiments from top social science journals with 618 main effects and 138 interaction effects using GPT-4 as a simulated participant. We find that GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies, closely mirroring human responses in both direction and significance. However, only 19.44 percent of GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies. Additionally, there is a 71.6 percent rate of unexpected significant results where the original studies reported null findings, suggesting potential overestimation or false positives. Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings. While LLMs can complement human studies, they cannot yet fully replace the nuanced insights provided by human subjects.
摘要：人工智能 (AI) 越来越多地被融入到科学研究中，尤其是在社会科学领域，了解人类行为至关重要。大型语言模型 (LLM)（如 GPT-4）在各种心理实验中已显示出复制类似人类反应的潜力。然而，LLM 在不同实验环境中能有效取代人类受试者的程度仍不清楚。在这里，我们进行了一项大规模研究，复制了来自顶级社会科学期刊的 154 项心理学实验，其中有 618 个主效应和 138 个交互效应，使用 GPT-4 作为模拟参与者。我们发现 GPT-4 成功复制了原始研究中观察到的 76.0% 的主效应和 47.0% 的交互效应，在方向和重要性方面都与人类反应非常相似。然而，只有 19.44% 的 GPT-4 复制置信区间包含原始效应大小，大多数复制的效应大小超过了原始研究的 95% 置信区间。此外，原始研究报告了无效结果，其中意外显著结果的比例为 71.6%，表明可能存在高估或假阳性。我们的研究结果证明了 LLM 作为心理学研究的强大工具的潜力，但也强调了在解释 AI 驱动的发现时需要谨慎。虽然 LLM 可以补充人类研究，但它们还不能完全取代人类受试者提供的细致入微的见解。

Title: Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems

Authors: Ding Kai, Ma Zhenguo, Yan Xiaoran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00131
Pdf URL: https://arxiv.org/pdf/2409.00131
Copy Paste: [[2409.00131]] Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems(https://arxiv.org/abs/2409.00131)
Keywords: language model, llm, prompt
Abstract: This study focuses on improving the performance of lightweight Large Language Models (LLMs) in mathematical reasoning tasks. We introduce a novel method for measuring mathematical logic similarity and design an automatic screening mechanism to construct a set of reference problems that integrate both semantic and logical similarity. By employing carefully crafted positive and negative example prompts, we guide the model towards adopting sound reasoning logic. To the best of our knowledge, this is the first attempt to utilize retrieval-enhanced generation for mathematical problem-solving. Experimental results demonstrate that our method achieves a 15.8% improvement over the Chain of Thought approach on the SVAMP dataset and a 21.5 % improvement on the GSM8K dataset. Further application of this method to a large-scale model with 175 billion parameters yields performance comparable to the best results on both aforementioned datasets. Finally, we conduct an analysis of errors during the reasoning process, providing valuable insights and directions for future research on reasoning tasks using large language models.
摘要：本研究致力于提升轻量级大型语言模型（LLM）在数学推理任务中的表现。我们引入了一种新的数学逻辑相似度度量方法，并设计了一种自动筛选机制，构造了一组融合语义和逻辑相似度的参考问题。通过使用精心设计的正反例提示，我们引导模型采用合理的推理逻辑。据我们所知，这是首次尝试利用检索增强生成来解决数学问题。实验结果表明，我们的方法在 SVAMP 数据集上比思路链方法提高了 15.8%，在 GSM8K 数据集上提高了 21.5%。将该方法进一步应用于具有 1750 亿个参数的大型模型，可获得与上述两个数据集上的最佳结果相当的性能。最后，我们对推理过程中的错误进行了分析，为未来使用大型语言模型进行推理任务的研究提供了有价值的见解和方向。

Title: A Survey for Large Language Models in Biomedicine

Authors: Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00133
Pdf URL: https://arxiv.org/pdf/2409.00133
Copy Paste: [[2409.00133]] A Survey for Large Language Models in Biomedicine(https://arxiv.org/abs/2409.00133)
Keywords: language model, llm
Abstract: Recent breakthroughs in large language models (LLMs) offer unprecedented natural language understanding and generation capabilities. However, existing surveys on LLMs in biomedicine often focus on specific applications or model architectures, lacking a comprehensive analysis that integrates the latest advancements across various biomedical domains. This review, based on an analysis of 484 publications sourced from databases including PubMed, Web of Science, and arXiv, provides an in-depth examination of the current landscape, applications, challenges, and prospects of LLMs in biomedicine, distinguishing itself by focusing on the practical implications of these models in real-world biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot learning across a broad spectrum of biomedical tasks, including diagnostic assistance, drug discovery, and personalized medicine, among others, with insights drawn from 137 key studies. Then, we discuss adaptation strategies of LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to enhance their performance in specialized biomedical contexts where zero-shot fails to achieve, such as medical question answering and efficient processing of biomedical literature. Finally, we discuss the challenges that LLMs face in the biomedicine domain including data privacy concerns, limited model interpretability, issues with dataset quality, and ethics due to the sensitive nature of biomedical data, the need for highly reliable model outputs, and the ethical implications of deploying AI in healthcare. To address these challenges, we also identify future research directions of LLM in biomedicine including federated learning methods to preserve data privacy and integrating explainable AI methodologies to enhance the transparency of LLMs.
摘要：大型语言模型 (LLM) 的最新突破提供了前所未有的自然语言理解和生成能力。然而，现有的关于生物医学领域 LLM 的研究通常侧重于特定的应用或模型架构，缺乏整合各个生物医学领域最新进展的全面分析。这篇评论基于对来自 PubMed、Web of Science 和 arXiv 等数据库的 484 篇出版物的分析，深入研究了生物医学领域 LLM 的当前状况、应用、挑战和前景，其独特之处在于关注这些模型在现实世界的生物医学环境中的实际意义。首先，我们根据 137 项关键研究的见解，探索 LLM 在广泛生物医学任务（包括诊断辅助、药物发现和个性化医疗等）中在零样本学习方面的能力。然后，我们讨论了 LLM 的适应策略，包括单模和多模 LLM 的微调方法，以提高它们在零样本无法实现的专门生物医学环境中的性能，例如医学问答和生物医学文献的有效处理。最后，我们讨论了 LLM 在生物医学领域面临的挑战，包括数据隐私问题、模型可解释性有限、数据集质量问题以及由于生物医学数据的敏感性而产生的道德问题、对高度可靠的模型输出的需求以及在医疗保健中部署人工智能的伦理影响。为了应对这些挑战，我们还确定了生物医学领域 LLM 的未来研究方向，包括联邦学习方法来保护数据隐私，以及集成可解释的人工智能方法来提高 LLM 的透明度。

Title: HoneyComb: A Flexible LLM-Based Agent System for Materials Science

Authors: Huan Zhang, Yu Song, Ziyu Hou, Santiago Miret, Bang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00135
Pdf URL: https://arxiv.org/pdf/2409.00135
Copy Paste: [[2409.00135]] HoneyComb: A Flexible LLM-Based Agent System for Materials Science(https://arxiv.org/abs/2409.00135)
Keywords: language model, llm, hallucination, agent
Abstract: The emergence of specialized large language models (LLMs) has shown promise in addressing complex tasks for materials science. Many LLMs, however, often struggle with distinct complexities of material science tasks, such as materials science computational tasks, and often rely heavily on outdated implicit knowledge, leading to inaccuracies and hallucinations. To address these challenges, we introduce HoneyComb, the first LLM-based agent system specifically designed for materials science. HoneyComb leverages a novel, high-quality materials science knowledge base (MatSciKB) and a sophisticated tool hub (ToolHub) to enhance its reasoning and computational capabilities tailored to materials science. MatSciKB is a curated, structured knowledge collection based on reliable literature, while ToolHub employs an Inductive Tool Construction method to generate, decompose, and refine API tools for materials science. Additionally, HoneyComb leverages a retriever module that adaptively selects the appropriate knowledge source or tools for specific tasks, thereby ensuring accuracy and relevance. Our results demonstrate that HoneyComb significantly outperforms baseline models across various tasks in materials science, effectively bridging the gap between current LLM capabilities and the specialized needs of this domain. Furthermore, our adaptable framework can be easily extended to other scientific domains, highlighting its potential for broad applicability in advancing scientific research and applications.
摘要：专门的大型语言模型 (LLM) 的出现已显示出在解决材料科学复杂任务方面的潜力。然而，许多 LLM 经常难以应对材料科学任务的独特复杂性，例如材料科学计算任务，并且经常严重依赖过时的隐性知识，导致不准确和幻觉。为了应对这些挑战，我们推出了 HoneyComb，这是第一个专为材料科学设计的基于 LLM 的代理系统。HoneyComb 利用新颖的高质量材料科学知识库 (MatSciKB) 和复杂的工具中心 (ToolHub) 来增强其针对材料科学的推理和计算能力。MatSciKB 是基于可靠文献的精选结构化知识集合，而 ToolHub 采用归纳工具构造方法来生成、分解和细化材料科学的 API 工具。此外，HoneyComb 利用检索器模块，自适应地为特定任务选择适当的知识源或工具，从而确保准确性和相关性。我们的结果表明，HoneyComb 在材料科学的各项任务中的表现都远超基线模型，有效弥补了当前 LLM 能力与该领域的专业需求之间的差距。此外，我们的适应性框架可以轻松扩展到其他科学领域，凸显了其在推进科学研究和应用方面的广泛适用性潜力。

Title: PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action

Authors: Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, Diyi Yang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2409.00138
Pdf URL: https://arxiv.org/pdf/2409.00138
Copy Paste: [[2409.00138]] PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action(https://arxiv.org/abs/2409.00138)
Keywords: language model, gpt, prompt, agent
Abstract: As language models (LMs) are widely utilized in personalized communication scenarios (e.g., sending emails, writing social media posts) and endowed with a certain level of agency, ensuring they act in accordance with the contextual privacy norms becomes increasingly critical. However, quantifying the privacy norm awareness of LMs and the emerging privacy risk in LM-mediated communication is challenging due to (1) the contextual and long-tailed nature of privacy-sensitive cases, and (2) the lack of evaluation approaches that capture realistic application scenarios. To address these challenges, we propose PrivacyLens, a novel framework designed to extend privacy-sensitive seeds into expressive vignettes and further into agent trajectories, enabling multi-level evaluation of privacy leakage in LM agents' actions. We instantiate PrivacyLens with a collection of privacy norms grounded in privacy literature and crowdsourced seeds. Using this dataset, we reveal a discrepancy between LM performance in answering probing questions and their actual behavior when executing user instructions in an agent setup. State-of-the-art LMs, like GPT-4 and Llama-3-70B, leak sensitive information in 25.68% and 38.69% of cases, even when prompted with privacy-enhancing instructions. We also demonstrate the dynamic nature of PrivacyLens by extending each seed into multiple trajectories to red-team LM privacy leakage risk. Dataset and code are available at this https URL.
摘要：由于语言模型 (LM) 广泛应用于个性化通信场景（例如发送电子邮件、撰写社交媒体帖子）并具有一定程度的代理权，因此确保它们按照上下文隐私规范行事变得越来越重要。然而，量化 LM 的隐私规范意识和 LM 介导通信中新出现的隐私风险具有挑战性，因为 (1) 隐私敏感案例的上下文和长尾性质，以及 (2) 缺乏捕捉现实应用场景的评估方法。为了应对这些挑战，我们提出了 PrivacyLens，这是一个新颖的框架，旨在将隐私敏感种子扩展到富有表现力的小插图中，并进一步扩展到代理轨迹中，从而能够对 LM 代理行为中的隐私泄露进行多层次评估。我们用一组基于隐私文献和众包种子的隐私规范来实例化 PrivacyLens。使用这个数据集，我们揭示了 LM 在回答探索性问题时的性能与在代理设置中执行用户指令时的实际行为之间的差异。最先进的 LM（如 GPT-4 和 Llama-3-70B）在 25.68% 和 38.69% 的情况下泄露敏感信息，即使在提示隐私增强指令时也是如此。我们还通过将每个种子扩展为多个轨迹来展示 PrivacyLens 的动态特性，以红队 LM 隐私泄露风险。数据集和代码可在此 https URL 上找到。

Title: Dynamic Depth Decoding: Faster Speculative Decoding for LLMs

Authors: Oscar Brown, Zhengjie Wang, Andrea Do, Nikhil Mathew, Cheng Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00142
Pdf URL: https://arxiv.org/pdf/2409.00142
Copy Paste: [[2409.00142]] Dynamic Depth Decoding: Faster Speculative Decoding for LLMs(https://arxiv.org/abs/2409.00142)
Keywords: language model, llm
Abstract: The acceleration of Large Language Models (LLMs) with speculative decoding provides a significant runtime improvement without any loss of accuracy. Currently, EAGLE-2 is the state-of-the-art speculative decoding method, improving on EAGLE with a dynamic draft tree. We introduce Dynamic Depth Decoding (DDD), which optimises EAGLE-2's tree drafting method using a dynamic depth. This extends the average speedup that EAGLE-2 achieves over EAGLE by $44\%$, giving DDD an average speedup of $3.16$x.
摘要：通过推测解码加速大型语言模型 (LLM) 可显著提高运行时间，且不会损失任何准确性。目前，EAGLE-2 是最先进的推测解码方法，它通过动态草稿树改进了 EAGLE。我们引入了动态深度解码 (DDD)，它使用动态深度优化了 EAGLE-2 的树草稿方法。这使 EAGLE-2 相对于 EAGLE 的平均加速比提高了 $44\%$，使 DDD 的平均加速比达到 $3.16$x。

Title: MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

Authors: Shuai Peng, Di Fu, Liangcai Gao, Xiuqin Zhong, Hongguang Fu, Zhi Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00147
Pdf URL: https://arxiv.org/pdf/2409.00147
Copy Paste: [[2409.00147]] MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models(https://arxiv.org/abs/2409.00147)
Keywords: language model, llm
Abstract: The rapid development of large language models (LLMs) has spurred extensive research into their domain-specific capabilities, particularly mathematical reasoning. However, most open-source LLMs focus solely on mathematical reasoning, neglecting the integration with visual injection, despite the fact that many mathematical tasks rely on visual inputs such as geometric diagrams, charts, and function plots. To fill this gap, we introduce \textbf{MultiMath-7B}, a multimodal large language model that bridges the gap between math and vision. \textbf{MultiMath-7B} is trained through a four-stage process, focusing on vision-language alignment, visual and math instruction-tuning, and process-supervised reinforcement learning. We also construct a novel, diverse and comprehensive multimodal mathematical dataset, \textbf{MultiMath-300K}, which spans K-12 levels with image captions and step-wise solutions. MultiMath-7B achieves state-of-the-art (SOTA) performance among open-source models on existing multimodal mathematical benchmarks and also excels on text-only mathematical benchmarks. Our model and dataset are available at {\textcolor{blue}{\url{this https URL}}}.
摘要：大型语言模型 (LLM) 的快速发展促使人们对其特定领域功能（尤其是数学推理）进行了广泛的研究。然而，大多数开源 LLM 仅专注于数学推理，而忽略了与视觉注入的集成，尽管许多数学任务都依赖于视觉输入，例如几何图、图表和函数图。为了填补这一空白，我们引入了 \textbf{MultiMath-7B}，这是一种多模态大型语言模型，可以弥合数学与视觉之间的差距。 \textbf{MultiMath-7B} 通过四个阶段的过程进行训练，重点是视觉语言对齐、视觉和数学指令调整以及过程监督强化学习。我们还构建了一个新颖、多样且全面的多模态数学数据集 \textbf{MultiMath-300K}，它涵盖了 K-12 级别，并带有图像标题和分步解决方案。 MultiMath-7B 在现有的多模态数学基准测试中，在开源模型中实现了最先进 (SOTA) 的性能，并且在纯文本数学基准测试中也表现出色。我们的模型和数据集可在 {\textcolor{blue}{\url{此 https URL}}} 上找到。

Title: Speaker Tagging Correction With Non-Autoregressive Language Models

Authors: Grigor Kirakosyan, Davit Karamyan
Subjects: cs.CL, cs.AI, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2409.00151
Pdf URL: https://arxiv.org/pdf/2409.00151
Copy Paste: [[2409.00151]] Speaker Tagging Correction With Non-Autoregressive Language Models(https://arxiv.org/abs/2409.00151)
Keywords: language model
Abstract: Speech applications dealing with conversations require not only recognizing the spoken words but also determining who spoke when. The task of assigning words to speakers is typically addressed by merging the outputs of two separate systems, namely, an automatic speech recognition (ASR) system and a speaker diarization (SD) system. In practical settings, speaker diarization systems can experience significant degradation in performance due to a variety of factors, including uniform segmentation with a high temporal resolution, inaccurate word timestamps, incorrect clustering and estimation of speaker numbers, as well as background noise. Therefore, it is important to automatically detect errors and make corrections if possible. We used a second-pass speaker tagging correction system based on a non-autoregressive language model to correct mistakes in words placed at the borders of sentences spoken by different speakers. We first show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets: TAL and test set of Fisher. Additionally, we evaluated our system in the Post-ASR Speaker Tagging Correction challenge and observed significant improvements in cpWER compared to baseline methods.
摘要：处理对话的语音应用程序不仅需要识别说出的单词，还需要确定谁在何时说话。将单词分配给说话者的任务通常通过合并两个独立系统的输出来解决，即自动语音识别 (ASR) 系统和说话者分类 (SD) 系统。在实际设置中，说话者分类系统的性能可能会因各种因素而显著下降，包括具有高时间分辨率的均匀分割、不准确的单词时间戳、说话者数量的不正确聚类和估计以及背景噪音。因此，自动检测错误并在可能的情况下进行更正非常重要。我们使用基于非自回归语言模型的第二遍说话者标记校正系统来纠正不同说话者所说的句子边界处的单词错误。我们首先表明，所采用的纠错方法可降低两个数据集上的单词分类错误率 (WDER)：TAL 和 Fisher 测试集。此外，我们在 Post-ASR 说话人标记校正挑战中对我们的系统进行了评估，并发现与基线方法相比，cpWER 有显著改善。

Title: Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder

Authors: Jihyun Mun, Sunhee Kim, Minhwa Chung
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00158
Pdf URL: https://arxiv.org/pdf/2409.00158
Copy Paste: [[2409.00158]] Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder(https://arxiv.org/abs/2409.00158)
Keywords: language model
Abstract: Autism Spectrum Disorder (ASD) is a lifelong condition that significantly influencing an individual's communication abilities and their social interactions. Early diagnosis and intervention are critical due to the profound impact of ASD's characteristic behaviors on foundational developmental stages. However, limitations of standardized diagnostic tools necessitate the development of objective and precise diagnostic methodologies. This paper proposes an end-to-end framework for automatically predicting the social communication severity of children with ASD from raw speech data. This framework incorporates an automatic speech recognition model, fine-tuned with speech data from children with ASD, followed by the application of fine-tuned pre-trained language models to generate a final prediction score. Achieving a Pearson Correlation Coefficient of 0.6566 with human-rated scores, the proposed method showcases its potential as an accessible and objective tool for the assessment of ASD.
摘要：自闭症谱系障碍 (ASD) 是一种终身疾病，会严重影响个人的沟通能力和社交互动。早期诊断和干预至关重要，因为 ASD 的特征行为会对基础发展阶段产生深远影响。然而，标准化诊断工具的局限性使得必须开发客观而精确的诊断方法。本文提出了一个端到端框架，用于根据原始语音数据自动预测 ASD 儿童的社交沟通严重程度。该框架包含一个自动语音识别模型，使用 ASD 儿童的语音数据进行微调，然后应用微调的预训练语言模型来生成最终预测分数。与人工评分相比，所提出的方法的皮尔逊相关系数达到 0.6566，展示了其作为评估 ASD 的便捷客观工具的潜力。

Title: LLMs hallucinate graphs too: a structural perspective

Authors: Erwan Le Merrer, Gilles Tredan
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2409.00159
Pdf URL: https://arxiv.org/pdf/2409.00159
Copy Paste: [[2409.00159]] LLMs hallucinate graphs too: a structural perspective(https://arxiv.org/abs/2409.00159)
Keywords: llm, hallucination, prompt
Abstract: It is known that LLMs do hallucinate, that is, they return incorrect information as facts. In this paper, we introduce the possibility to study these hallucinations under a structured form: graphs. Hallucinations in this context are incorrect outputs when prompted for well known graphs from the literature (e.g. Karate club, Les Misérables, graph atlas). These hallucinated graphs have the advantage of being much richer than the factual accuracy -- or not -- of a fact; this paper thus argues that such rich hallucinations can be used to characterize the outputs of LLMs. Our first contribution observes the diversity of topological hallucinations from major modern LLMs. Our second contribution is the proposal of a metric for the amplitude of such hallucinations: the Graph Atlas Distance, that is the average graph edit distance from several graphs in the graph atlas set. We compare this metric to the Hallucination Leaderboard, a hallucination rank that leverages 10,000 times more prompts to obtain its ranking.
摘要：众所周知，LLM 确实会产生幻觉，也就是说，它们会将不正确的信息作为事实返回。在本文中，我们介绍了在结构化形式下研究这些幻觉的可能性：图表。在这种情况下，幻觉是提示文献中众所周知的图表（例如空手道俱乐部、悲惨世界、图表集）时的错误输出。这些幻觉图表的优势在于，它们比事实的准确性（或不准确）要丰富得多；因此，本文认为，这种丰富的幻觉可用于表征 LLM 的输出。我们的第一个贡献观察了现代主要 LLM 中拓扑幻觉的多样性。我们的第二个贡献是提出了一种衡量此类幻觉幅度的指标：图表集距离，即图表集集合中几个图表的平均图表编辑距离。我们将此指标与幻觉排行榜进行比较，幻觉排行榜利用 10,000 倍以上的提示来获得其排名。

Title: Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Authors: Jiayi Zhou, Jiaming Ji, Juntao Dai, Yaodong Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00162
Pdf URL: https://arxiv.org/pdf/2409.00162
Copy Paste: [[2409.00162]] Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback(https://arxiv.org/abs/2409.00162)
Keywords: language model, llm, prompt
Abstract: Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and fine-tuning the LLMs to maximize RM feedback. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. It means RM fails to provide feedback that accurately aligns with human preference, causing LLMs to explore unexpected generalizations, and failing to achieve alignment objectives. To mitigate this issue, we propose a novel \textit{sequence-to-sequence (seq2seq) reward modeling} method. Its key insight is that learning from language feedback rather than scalar feedback improves RLHF without additional annotations. We replaced the reward modeling target from binary maximum likelihood estimation (MLE) with sequence MLE. This method enables richer and fine-grained language feedback without additional annotations, models, or training stages. Our experiments demonstrated its effectiveness, specifically, reducing the refusal-to-response paradigm in single-turn safety dialogues and the long-response bias in text summarization tasks. We provide further analysis that seq2seq RM improves RLHF performance across 2B and 7B LLMs on 3 NLP tasks, achieving an average win rate of 76.9\%. We further show that seq2seq RM can still improve the performance of RLHF under out-of-distribution prompts.
摘要：将大型语言模型 (LLM) 的行为与人类意图和价值观保持一致仍然是一项关键挑战。通过人类反馈强化学习 (RLHF)，通过训练奖励模型 (RM) 来调整 LLM，以适应人类偏好，并微调 LLM 以最大化 RM 反馈。尽管 RLHF 有效且受欢迎，但它容易出现有偏见的局部优化。这意味着 RM 无法提供与人类偏好准确一致的反馈，导致 LLM 探索意外的泛化，并无法实现对齐目标。为了缓解这个问题，我们提出了一种新颖的 \textit{序列到序列 (seq2seq) 奖励建模} 方法。其关键见解是，从语言反馈而不是标量反馈中学习可以在没有额外注释的情况下改进 RLHF。我们用序列 MLE 替换了二元最大似然估计 (MLE) 中的奖励建模目标。这种方法可以在没有额外注释、模型或训练阶段的情况下实现更丰富、更细粒度的语言反馈。我们的实验证明了它的有效性，具体来说，它减少了单轮安全对话中的拒绝响应范式和文本摘要任务中的长响应偏差。我们进一步分析发现，seq2seq RM 在 3 个 NLP 任务中提高了 2B 和 7B LLM 中的 RLHF 性能，平均胜率为 76.9%。我们进一步表明，seq2seq RM 仍然可以在分布外提示下提高 RLHF 的性能。

Title: The creative psychometric item generator: a framework for item generation and validation using large language models

Authors: Antonio Laverghetta Jr., Simone Luchini, Averie Linell, Roni Reiter-Palmon, Roger Beaty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00202
Pdf URL: https://arxiv.org/pdf/2409.00202
Copy Paste: [[2409.00202]] The creative psychometric item generator: a framework for item generation and validation using large language models(https://arxiv.org/abs/2409.00202)
Keywords: language model, llm, prompt
Abstract: Increasingly, large language models (LLMs) are being used to automate workplace processes requiring a high degree of creativity. While much prior work has examined the creativity of LLMs, there has been little research on whether they can generate valid creativity assessments for humans despite the increasingly central role of creativity in modern economies. We develop a psychometrically inspired framework for creating test items (questions) for a classic free-response creativity test: the creative problem-solving (CPS) task. Our framework, the creative psychometric item generator (CPIG), uses a mixture of LLM-based item generators and evaluators to iteratively develop new prompts for writing CPS items, such that items from later iterations will elicit more creative responses from test takers. We find strong empirical evidence that CPIG generates valid and reliable items and that this effect is not attributable to known biases in the evaluation process. Our findings have implications for employing LLMs to automatically generate valid and reliable creativity tests for humans and AI.
摘要：大型语言模型 (LLM) 越来越多地被用于自动化需要高度创造力的工作流程。虽然之前的许多研究都对 LLM 的创造力进行了研究，但关于它们是否能为人类生成有效的创造力评估的研究却很少，尽管创造力在现代经济中发挥着越来越重要的作用。我们开发了一个受心理测量启发的框架，用于创建经典自由反应创造力测试的测试项目（问题）：创造性问题解决 (CPS) 任务。我们的框架，即创造性心理测量项目生成器 (CPIG)，使用基于 LLM 的项目生成器和评估器的混合来迭代开发编写 CPS 项目的新提示，以便后续迭代中的项目将引起测试者更具创造性的反应。我们发现强有力的经验证据表明 CPIG 生成有效且可靠的项目，并且这种影响不能归因于评估过程中已知的偏见。我们的研究结果对于使用 LLM 自动为人类和人工智能生成有效且可靠的创造力测试具有重要意义。

Title: Enhancing Event Reasoning in Large Language Models through Instruction Fine-Tuning with Semantic Causal Graphs

Authors: Mazal Bethany, Emet Bethany, Brandon Wherry, Cho-Yu Chiang, Nishant Vishwamitra, Anthony Rios, Peyman Najafirad
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00209
Pdf URL: https://arxiv.org/pdf/2409.00209
Copy Paste: [[2409.00209]] Enhancing Event Reasoning in Large Language Models through Instruction Fine-Tuning with Semantic Causal Graphs(https://arxiv.org/abs/2409.00209)
Keywords: language model, gpt, llm, prompt
Abstract: Event detection and text reasoning have become critical applications across various domains. While LLMs have recently demonstrated impressive progress in reasoning abilities, they often struggle with event detection, particularly due to the absence of training methods that consider causal relationships between event triggers and types. To address this challenge, we propose a novel approach for instruction fine-tuning LLMs for event detection. Our method introduces Semantic Causal Graphs (SCGs) to capture both causal relationships and contextual information within text. Building off of SCGs, we propose SCG Instructions for fine-tuning LLMs by focusing on event triggers and their relationships to event types, and employ Low-Rank Adaptation (LoRA) to help preserve the general reasoning abilities of LLMs. Our evaluations demonstrate that training LLMs with SCG Instructions outperforms standard instruction fine-tuning by an average of 35.69\% on Event Trigger Classification. Notably, our fine-tuned Mistral 7B model also outperforms GPT-4 on key event detection metrics by an average of 31.01\% on Event Trigger Identification, 37.40\% on Event Trigger Classification, and 16.43\% on Event Classification. We analyze the retention of general capabilities, observing only a minimal average drop of 2.03 points across six benchmarks. This comprehensive study investigates multiple LLMs for the event detection task across various datasets, prompting strategies, and training approaches.
摘要：事件检测和文本推理已成为各个领域的关键应用。虽然 LLM 最近在推理能力方面取得了令人瞩目的进步，但它们在事件检测方面往往举步维艰，特别是由于缺乏考虑事件触发器和类型之间因果关系的训练方法。为了应对这一挑战，我们提出了一种新颖的方法来对用于事件检测的 LLM 进行指令微调。我们的方法引入了语义因果图 (SCG) 来捕获文本中的因果关系和上下文信息。在 SCG 的基础上，我们提出了 SCG 指令来微调 LLM，重点关注事件触发器及其与事件类型的关系，并采用低秩自适应 (LoRA) 来帮助保持 LLM 的一般推理能力。我们的评估表明，使用 SCG 指令训练 LLM 在事件触发器分类方面的表现比标准指令微调平均高出 35.69\%。值得注意的是，我们经过微调的 Mistral 7B 模型在关键事件检测指标上的表现也优于 GPT-4，在事件触发器识别上平均高出 31.01%，在事件触发器分类上高出 37.40%，在事件分类上高出 16.43%。我们分析了一般能力的保留率，发现在六个基准测试中平均下降幅度仅为 2.03 分。这项综合研究调查了跨各种数据集、提示策略和训练方法的事件检测任务的多个 LLM。

Title: Enhancing Document-level Argument Extraction with Definition-augmented Heuristic-driven Prompting for LLMs

Authors: Tongyue Sun, Jiayi Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00214
Pdf URL: https://arxiv.org/pdf/2409.00214
Copy Paste: [[2409.00214]] Enhancing Document-level Argument Extraction with Definition-augmented Heuristic-driven Prompting for LLMs(https://arxiv.org/abs/2409.00214)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Event Argument Extraction (EAE) is pivotal for extracting structured information from unstructured text, yet it remains challenging due to the complexity of real-world document-level EAE. We propose a novel Definition-augmented Heuristic-driven Prompting (DHP) method to enhance the performance of Large Language Models (LLMs) in document-level EAE. Our method integrates argument extraction-related definitions and heuristic rules to guide the extraction process, reducing error propagation and improving task accuracy. We also employ the Chain-of-Thought (CoT) method to simulate human reasoning, breaking down complex problems into manageable sub-problems. Experiments have shown that our method achieves a certain improvement in performance over existing prompting methods and few-shot supervised learning on document-level EAE datasets. The DHP method enhances the generalization capability of LLMs and reduces reliance on large annotated datasets, offering a novel research perspective for document-level EAE.
摘要：事件论元抽取 (EAE) 是从非结构化文本中提取结构化信息的关键，但由于现实世界文档级 EAE 的复杂性，它仍然具有挑战性。我们提出了一种新颖的定义增强启发式驱动提示 (DHP) 方法来增强大型语言模型 (LLM) 在文档级 EAE 中的性能。我们的方法集成了与论元抽取相关的定义和启发式规则来指导抽取过程，减少了错误传播并提高了任务准确性。我们还采用思维链 (CoT) 方法来模拟人类推理，将复杂问题分解为可管理的子问题。实验表明，我们的方法在文档级 EAE 数据集上比现有的提示方法和少样本监督学习取得了一定的性能提升。DHP 方法增强了 LLM 的泛化能力并减少了对大型注释数据集的依赖，为文档级 EAE 提供了一个新颖的研究视角。

Title: ProGRes: Prompted Generative Rescoring on ASR n-Best

Authors: Ada Defne Tur, Adel Moumen, Mirco Ravanelli
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.00217
Pdf URL: https://arxiv.org/pdf/2409.00217
Copy Paste: [[2409.00217]] ProGRes: Prompted Generative Rescoring on ASR n-Best(https://arxiv.org/abs/2409.00217)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have shown their ability to improve the performance of speech recognizers by effectively rescoring the n-best hypotheses generated during the beam search process. However, the best way to exploit recent generative instruction-tuned LLMs for hypothesis rescoring is still unclear. This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs. Specifically, we introduce a new zero-shot method for ASR n-best rescoring, which combines confidence scores, LLM sequence scoring, and prompt-based hypothesis generation. We compare Llama-3-Instruct, GPT-3.5 Turbo, and GPT-4 Turbo as prompt-based generators with Llama-3 as sequence scorer LLM. We evaluated our approach using different speech recognizers and observed significant relative improvement in the word error rate (WER) ranging from 5% to 25%.
摘要：大型语言模型 (LLM) 已证明其能够通过有效地重新评分在波束搜索过程中生成的 n 个最佳假设来提高语音识别器的性能。然而，利用最近的生成指令调整的 LLM 进行假设重新评分的最佳方法仍不清楚。本文提出了一种新方法，该方法使用指令调整的 LLM 通过适当提示的 LLM 生成的新假设动态扩展 n 个最佳语音识别假设。具体来说，我们引入了一种新的零样本 ASR n 个最佳重新评分方法，该方法结合了置信度分数、LLM 序列评分和基于提示的假设生成。我们将 Llama-3-Instruct、GPT-3.5 Turbo 和 GPT-4 Turbo 作为基于提示的生成器与 Llama-3 作为序列评分器 LLM 进行了比较。我们使用不同的语音识别器评估了我们的方法，并观察到词错误率 (WER) 的显着相对改善，范围从 5% 到 25%。

Title: Can Large Language Models Address Open-Target Stance Detection?

Authors: Abu Ubaida Akash, Ahmed Fahmy, Amine Trabelsi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00222
Pdf URL: https://arxiv.org/pdf/2409.00222
Copy Paste: [[2409.00222]] Can Large Language Models Address Open-Target Stance Detection?(https://arxiv.org/abs/2409.00222)
Keywords: language model, gpt, llm
Abstract: Stance detection (SD) assesses a text's position towards a target, typically labeled as "favor," "against," or "neutral." We introduce Open-Target Stance Detection (OTSD), where targets are neither seen during training nor provided as input. Evaluating Large Language Models (LLMs) like GPT-3.5, Llama 3, and Mistral, we compare their performance with the Target-Stance Extraction (TSE) approach, which has the advantage of using predefined targets. LLMs perform better than TSE in target generation when the real target is explicitly and not explicitly mentioned in the text. For stance detection, LLMs perform better in explicit scenarios but fail in non-explicit ones.
摘要：立场检测 (SD) 评估文本对目标的立场，通常标记为“赞成”、“反对”或“中立”。我们引入了开放目标立场检测 (OTSD)，其中目标在训练期间既不可见也不作为输入提供。通过评估 GPT-3.5、Llama 3 和 Mistral 等大型语言模型 (LLM)，我们将其性能与目标立场提取 (TSE) 方法进行了比较，后者的优势在于使用预定义目标。当文本中明确提及或未明确提及真实目标时，LLM 在目标生成方面的表现优于 TSE。对于立场检测，LLM 在明确场景中表现更好，但在非明确场景中表现不佳。

Title: Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data

Authors: Spencer Whitehead, Jacob Phillips, Sean Hendryx
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2409.00238
Pdf URL: https://arxiv.org/pdf/2409.00238
Copy Paste: [[2409.00238]] Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data(https://arxiv.org/abs/2409.00238)
Keywords: language model, hallucination
Abstract: Multimodal language models can exhibit hallucinations in their outputs, which limits their reliability. The ability to automatically detect these errors is important for mitigating them, but has been less explored and existing efforts do not localize hallucinations, instead framing this as a classification task. In this work, we first pose multimodal hallucination detection as a sequence labeling task where models must localize hallucinated text spans and present a strong baseline model. Given the high cost of human annotations for this task, we propose an approach to improve the sample efficiency of these models by creating corrupted grounding data, which we use for pre-training. Leveraging phrase grounding data, we generate hallucinations to replace grounded spans and create hallucinated text. Experiments show that pre-training on this data improves sample efficiency when fine-tuning, and that the learning signal from the grounding data plays an important role in these improvements.
摘要：多模态语言模型的输出可能会出现幻觉，这限制了它们的可靠性。自动检测这些错误的能力对于减轻这些错误非常重要，但目前对此的研究较少，现有的努力并没有定位幻觉，而是将其作为分类任务。在这项工作中，我们首先将多模态幻觉检测作为序列标记任务，其中模型必须定位幻觉文本跨度并呈现强大的基线模型。考虑到这项任务的人工注释成本很高，我们提出了一种方法，通过创建损坏的基础数据来提高这些模型的样本效率，我们将这些数据用于预训练。利用短语基础数据，我们生成幻觉来替换基础跨度并创建幻觉文本。实验表明，在微调时对这些数据进行预训练可以提高样本效率，并且来自基础数据的学习信号在这些改进中起着重要作用。

Title: DiverseDialogue: A Methodology for Designing Chatbots with Human-Like Diversity

Authors: Xiaoyu Lin, Xinkai Yu, Ankit Aich, Salvatore Giorgi, Lyle Ungar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00262
Pdf URL: https://arxiv.org/pdf/2409.00262
Copy Paste: [[2409.00262]] DiverseDialogue: A Methodology for Designing Chatbots with Human-Like Diversity(https://arxiv.org/abs/2409.00262)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large Language Models (LLMs), which simulate human users, are frequently employed to evaluate chatbots in applications such as tutoring and customer service. Effective evaluation necessitates a high degree of human-like diversity within these simulations. In this paper, we demonstrate that conversations generated by GPT-4o mini, when used as simulated human participants, systematically differ from those between actual humans across multiple linguistic features. These features include topic variation, lexical attributes, and both the average behavior and diversity (variance) of the language used. To address these discrepancies, we propose an approach that automatically generates prompts for user simulations by incorporating features derived from real human interactions, such as age, gender, emotional tone, and the topics discussed. We assess our approach using differential language analysis combined with deep linguistic inquiry. Our method of prompt optimization, tailored to target specific linguistic features, shows significant improvements. Specifically, it enhances the human-likeness of LLM chatbot conversations, increasing their linguistic diversity. On average, we observe a 54 percent reduction in the error of average features between human and LLM-generated conversations. This method of constructing chatbot sets with human-like diversity holds great potential for enhancing the evaluation process of user-facing bots.
摘要：大型语言模型 (LLM) 模拟人类用户，经常用于评估辅导和客户服务等应用中的聊天机器人。有效的评估需要这些模拟中具有高度的类人多样性。在本文中，我们证明，当用作模拟人类参与者时，GPT-4o mini 生成的对话在多个语言特征上与实际人类之间的对话有系统性差异。这些特征包括主题变化、词汇属性以及所用语言的平均行为和多样性（方差）。为了解决这些差异，我们提出了一种方法，通过结合从真实人类互动中得出的特征（例如年龄、性别、情绪基调和讨论的主题）来自动生成用户模拟的提示。我们使用差异语言分析结合深度语言探究来评估我们的方法。我们针对特定语言特征定制的提示优化方法显示出显着的改进。具体来说，它增强了 LLM 聊天机器人对话的人性化，增加了它们的语言多样性。平均而言，我们观察到人类对话与 LLM 生成的对话之间的平均特征误差减少了 54%。这种构建具有类似人类多样性的聊天机器人集的方法对于增强面向用户的机器人的评估过程具有巨大潜力。

Title: Leveraging a Cognitive Model to Measure Subjective Similarity of Human and GPT-4 Written Content

Authors: Tyler Malloy, Maria José Ferreira, Fei Fang, Cleotilde Gonzalez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00269
Pdf URL: https://arxiv.org/pdf/2409.00269
Copy Paste: [[2409.00269]] Leveraging a Cognitive Model to Measure Subjective Similarity of Human and GPT-4 Written Content(https://arxiv.org/abs/2409.00269)
Keywords: language model, gpt, llm
Abstract: Cosine similarity between two documents can be computed using token embeddings formed by Large Language Models (LLMs) such as GPT-4, and used to categorize those documents across a range of uses. However, these similarities are ultimately dependent on the corpora used to train these LLMs, and may not reflect subjective similarity of individuals or how their biases and constraints impact similarity metrics. This lack of cognitively-aware personalization of similarity metrics can be particularly problematic in educational and recommendation settings where there is a limited number of individual judgements of category or preference, and biases can be particularly relevant. To address this, we rely on an integration of an Instance-Based Learning (IBL) cognitive model with LLM embeddings to develop the Instance-Based Individualized Similarity (IBIS) metric. This similarity metric is beneficial in that it takes into account individual biases and constraints in a manner that is grounded in the cognitive mechanisms of decision making. To evaluate the IBIS metric, we also introduce a dataset of human categorizations of emails as being either dangerous (phishing) or safe (ham). This dataset is used to demonstrate the benefits of leveraging a cognitive model to measure the subjective similarity of human participants in an educational setting.
摘要：可以使用由大型语言模型 (LLM)（例如 GPT-4）形成的标记嵌入来计算两个文档之间的余弦相似度，并用于在各种用途中对这些文档进行分类。但是，这些相似度最终取决于用于训练这些 LLM 的语料库，可能无法反映个人的主观相似度或他们的偏见和约束如何影响相似度指标。这种缺乏认知意识的相似度指标个性化在教育和推荐环境中尤其成问题，因为在教育和推荐环境中，个人对类别或偏好的判断数量有限，偏见可能特别重要。为了解决这个问题，我们依靠基于实例的学习 (IBL) 认知模型与 LLM 嵌入的集成来开发基于实例的个性化相似度 (IBIS) 指标。这种相似度指标的好处在于它以决策认知机制为基础的方式考虑了个人的偏见和约束。为了评估 IBIS 指标，我们还引入了一个数据集，该数据集包含人类将电子邮件归类为危险（网络钓鱼）或安全（普通）的类别。该数据集用于展示利用认知模型衡量教育环境中人类参与者的主观相似性的好处。

Title: OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters

Authors: Zexin Chen, Chengxi Li, Xiangyu Xie, Parijat Dube
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00286
Pdf URL: https://arxiv.org/pdf/2409.00286
Copy Paste: [[2409.00286]] OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters(https://arxiv.org/abs/2409.00286)
Keywords: language model, llm
Abstract: This paper explores the potential of a small, domain-specific language model trained exclusively on sports-related data. We investigate whether extensive training data with specially designed small model structures can overcome model size constraints. The study introduces the OnlySports collection, comprising OnlySportsLM, OnlySports Dataset, and OnlySports Benchmark. Our approach involves: 1) creating a massive 600 billion tokens OnlySports Dataset from FineWeb, 2) optimizing the RWKV architecture for sports-related tasks, resulting in a 196M parameters model with 20-layer, 640-dimension structure, 3) training the OnlySportsLM on part of OnlySports Dataset, and 4) testing the resultant model on OnlySports Benchmark. OnlySportsLM achieves a 37.62%/34.08% accuracy improvement over previous 135M/360M state-of-the-art models and matches the performance of larger models such as SomlLM 1.7B and Qwen 1.5B in the sports domain. Additionally, the OnlySports collection presents a comprehensive workflow for building high-quality, domain-specific language models, providing a replicable blueprint for efficient AI development across various specialized fields.
摘要：本文探讨了专门针对体育相关数据进行训练的小型领域特定语言模型的潜力。我们研究了使用专门设计的小型模型结构的大量训练数据是否可以克服模型大小限制。这项研究介绍了 OnlySports 集合，包括 OnlySportsLM、OnlySports 数据集和 OnlySports 基准。我们的方法包括：1) 从 FineWeb 创建一个包含 6000 亿个标记的庞大 OnlySports 数据集，2) 针对体育相关任务优化 RWKV 架构，从而生成一个具有 20 层、640 维结构的 196M 参数模型，3) 在 OnlySports 数据集的一部分上训练 OnlySportsLM，以及 4) 在 OnlySports 基准上测试生成的模型。 OnlySportsLM 的准确率比之前的 135M/360M 先进模型提高了 37.62%/34.08%，在体育领域的表现堪比 SomlLM 1.7B 和 Qwen 1.5B 等大型模型。此外，OnlySports 系列还提供了构建高质量领域特定语言模型的全面工作流程，为各个专业领域的高效 AI 开发提供了可复制的蓝图。

Title: REFFLY: Melody-Constrained Lyrics Editing Model

Authors: Songyan Zhao, Bingxuan Li, Yufei Tian, Nanyun Peng
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.00292
Pdf URL: https://arxiv.org/pdf/2409.00292
Copy Paste: [[2409.00292]] REFFLY: Melody-Constrained Lyrics Editing Model(https://arxiv.org/abs/2409.00292)
Keywords: gpt
Abstract: Automatic melody-to-lyric generation aims to produce lyrics that align with a given melody. Although previous work can generate lyrics based on high-level control signals, such as keywords or genre, they often struggle with three challenges: (1) lack of controllability, as prior works are only able to produce lyrics from scratch, with little or no control over the content; (2) inability to generate fully structured songs with the desired format; and (3) failure to align prominent words in the lyrics with prominent notes in the melody, resulting in poor lyrics-melody alignment. In this work, we introduce REFFLY (REvision Framework For Lyrics), the first revision framework designed to edit arbitrary forms of plain text draft into high-quality, full-fledged song lyrics. Our approach ensures that the generated lyrics retain the original meaning of the draft, align with the melody, and adhere to the desired song structures. We demonstrate that REFFLY performs well in diverse task settings, such as lyrics revision and song translation. Experimental results show that our model outperforms strong baselines, such as Lyra (Tian et al. 2023) and GPT-4, by 25% in both musicality and text quality.
摘要：自动旋律到歌词生成旨在生成与给定旋律相符的歌词。尽管以前的工作可以根据高级控制信号（例如关键字或流派）生成歌词，但它们通常面临三个挑战：（1）缺乏可控性，因为以前的工作只能从头开始生成歌词，对内容几乎没有或根本没有控制权；（2）无法生成具有所需格式的完整结构化的歌曲；（3）无法将歌词中的突出单词与旋律中的突出音符对齐，导致歌词旋律对齐不佳。在这项工作中，我们引入了 REFFLY（歌词修订框架），这是第一个旨在将任意形式的纯文本草稿编辑成高质量、完整的歌词的修订框架。我们的方法确保生成的歌词保留草稿的原始含义、与旋律一致并遵循所需的歌曲结构。我们证明了 REFFLY 在歌词修改和歌曲翻译等各种任务设置中都表现良好。实验结果表明，我们的模型在音乐性和文本质量方面均比 Lyra（Tian 等人，2023 年）和 GPT-4 等强大的基线模型高出 25%。

Title: From Prediction to Application: Language Model-based Code Knowledge Tracing with Domain Adaptive Pre-Training and Automatic Feedback System with Pedagogical Prompting for Comprehensive Programming Education

Authors: Unggi Lee, Jiyeong Bae, Yeonji Jung, Minji Kang, Gyuri Byun, Yeonseo Lee, Dohee Kim, Sookbun Lee, Jaekwon Park, Taekyung Ahn, Gunho Lee, Hyeoncheol Kim
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2409.00323
Pdf URL: https://arxiv.org/pdf/2409.00323
Copy Paste: [[2409.00323]] From Prediction to Application: Language Model-based Code Knowledge Tracing with Domain Adaptive Pre-Training and Automatic Feedback System with Pedagogical Prompting for Comprehensive Programming Education(https://arxiv.org/abs/2409.00323)
Keywords: language model, prompt
Abstract: Knowledge Tracing (KT) is a critical component in online learning, but traditional approaches face limitations in interpretability and cross-domain adaptability. This paper introduces Language Model-based Code Knowledge Tracing (CodeLKT), an innovative application of Language model-based Knowledge Tracing (LKT) to programming education. CodeLKT leverages pre-trained language models to process learning data, demonstrating superior performance over existing KT and Code KT models. We explore Domain Adaptive Pre-Training (DAPT) and Task Adaptive Pre-Training (TAPT), showing enhanced performance in the coding domain and investigating cross-domain transfer between mathematics and coding. Additionally, we present an theoretically-informed integrated system combining CodeLKT with large language models to generate personalized, in-depth feedback to support students' programming learning. This work advances the field of Code Knowledge Tracing by expanding the knowledge base with language model-based approach and offering practical implications for programming education through data-informed feedback.
摘要：知识追踪 (KT) 是在线学习的重要组成部分，但传统方法在可解释性和跨领域适应性方面存在局限性。本文介绍了基于语言模型的代码知识追踪 (CodeLKT)，这是基于语言模型的知识追踪 (LKT) 在编程教育中的创新应用。CodeLKT 利用预先训练的语言模型来处理学习数据，与现有的 KT 和 Code KT 模型相比，其性能更优异。我们探索了领域自适应预训练 (DAPT) 和任务自适应预训练 (TAPT)，它们在编码领域表现出色，并研究了数学和编码之间的跨领域转移。此外，我们提出了一个理论丰富的集成系统，将 CodeLKT 与大型语言模型相结合，以生成个性化的深入反馈来支持学生的编程学习。这项工作通过基于语言模型的方法扩展知识库，并通过数据丰富的反馈为编程教育提供实际意义，推动了代码知识追踪领域的发展。

Title: WikiCausal: Corpus and Evaluation Framework for Causal Knowledge Graph Construction

Authors: Oktie Hassanzadeh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00331
Pdf URL: https://arxiv.org/pdf/2409.00331
Copy Paste: [[2409.00331]] WikiCausal: Corpus and Evaluation Framework for Causal Knowledge Graph Construction(https://arxiv.org/abs/2409.00331)
Keywords: language model
Abstract: Recently, there has been an increasing interest in the construction of general-domain and domain-specific causal knowledge graphs. Such knowledge graphs enable reasoning for causal analysis and event prediction, and so have a range of applications across different domains. While great progress has been made toward automated construction of causal knowledge graphs, the evaluation of such solutions has either focused on low-level tasks (e.g., cause-effect phrase extraction) or on ad hoc evaluation data and small manual evaluations. In this paper, we present a corpus, task, and evaluation framework for causal knowledge graph construction. Our corpus consists of Wikipedia articles for a collection of event-related concepts in Wikidata. The task is to extract causal relations between event concepts from the corpus. The evaluation is performed in part using existing causal relations in Wikidata to measure recall, and in part using Large Language Models to avoid the need for manual or crowd-sourced evaluation. We evaluate a pipeline for causal knowledge graph construction that relies on neural models for question answering and concept linking, and show how the corpus and the evaluation framework allow us to effectively find the right model for each task. The corpus and the evaluation framework are publicly available.
摘要：最近，人们对构建通用领域和特定领域的因果知识图谱的兴趣日益浓厚。此类知识图谱能够进行因果分析和事件预测推理，因此在不同领域具有广泛的应用。虽然因果知识图谱的自动构建取得了巨大进展，但对此类解决方案的评估要么集中在低级任务（例如，因果短语提取），要么集中在临时评估数据和小型手动评估上。在本文中，我们提出了因果知识图谱构建的语料库、任务和评估框架。我们的语料库由维基百科文章组成，其中包含维基数据中与事件相关的概念集合。任务是从语料库中提取事件概念之间的因果关系。评估部分使用维基数据中现有的因果关系来衡量召回率，部分使用大型语言模型来避免手动或众包评估的需要。我们评估了依赖神经模型进行问答和概念链接的因果知识图谱构建流程，并展示了语料库和评估框架如何让我们有效地找到适合每项任务的正确模型。语料库和评估框架均已公开。

Title: Evaluating the Effectiveness of Large Language Models in Representing and Understanding Movement Trajectories

Authors: Yuhan Ji, Song Gao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00335
Pdf URL: https://arxiv.org/pdf/2409.00335
Copy Paste: [[2409.00335]] Evaluating the Effectiveness of Large Language Models in Representing and Understanding Movement Trajectories(https://arxiv.org/abs/2409.00335)
Keywords: language model, gpt, llm
Abstract: This research focuses on assessing the ability of AI foundation models in representing the trajectories of movements. We utilize one of the large language models (LLMs) (i.e., GPT-J) to encode the string format of trajectories and then evaluate the effectiveness of the LLM-based representation for trajectory data analysis. The experiments demonstrate that while the LLM-based embeddings can preserve certain trajectory distance metrics (i.e., the correlation coefficients exceed 0.74 between the Cosine distance derived from GPT-J embeddings and the Hausdorff and Dynamic Time Warping distances on raw trajectories), challenges remain in restoring numeric values and retrieving spatial neighbors in movement trajectory analytics. In addition, the LLMs can understand the spatiotemporal dependency contained in trajectories and have good accuracy in location prediction tasks. This research highlights the need for improvement in terms of capturing the nuances and complexities of the underlying geospatial data and integrating domain knowledge to support various GeoAI applications using LLMs.
摘要：本研究重点评估人工智能基础模型表示运动轨迹的能力。我们利用大型语言模型 (LLM) 之一（即 GPT-J）对轨迹的字符串格式进行编码，然后评估基于 LLM 的表示对轨迹数据分析的有效性。实验表明，虽然基于 LLM 的嵌入可以保留某些轨迹距离指标（即，从 GPT-J 嵌入导出的余弦距离与原始轨迹上的豪斯多夫和动态时间扭曲距离之间的相关系数超过 0.74），但在运动轨迹分析中，恢复数值和检索空间邻居仍然存在挑战。此外，LLM 可以理解轨迹中包含的时空依赖性，并且在位置预测任务中具有良好的准确性。这项研究强调了在捕捉底层地理空间数据的细微差别和复杂性以及整合领域知识以使用 LLM 支持各种 GeoAI 应用程序方面需要改进。

Title: Does Alignment Tuning Really Break LLMs' Internal Confidence?

Authors: Hongseok Oh, Wonseok Hwang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00352
Pdf URL: https://arxiv.org/pdf/2409.00352
Copy Paste: [[2409.00352]] Does Alignment Tuning Really Break LLMs' Internal Confidence?(https://arxiv.org/abs/2409.00352)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration. This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods. Initial analysis showed that the relationship between alignment and calibration is not always a trade-off, but under stricter analysis conditions, we found the alignment process consistently harms calibration. This highlights the need for (1) a careful approach when measuring model confidences and calibration errors and (2) future research into algorithms that can help LLMs to achieve both instruction-following and calibration without sacrificing either.
摘要：大型语言模型 (LLM) 已取得显著进展，但其实际应用需要可靠的校准。本研究从四个维度对 LLM 的校准退化进行了全面分析：模型、校准指标、任务和置信度提取方法。初步分析表明，对齐和校准之间的关系并不总是一种权衡，但在更严格的分析条件下，我们发现对齐过程始终会损害校准。这突出表明需要 (1) 在测量模型置信度和校准误差时采取谨慎的方法，以及 (2) 未来研究可以帮助 LLM 同时实现指令遵循和校准而不牺牲任何一个的算法。

Title: Predicting the Target Word of Game-playing Conversations using a Low-Rank Dialect Adapter for Decoder Models

Authors: Dipankar Srirag, Aditya Joshi, Jacob Eisenstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00358
Pdf URL: https://arxiv.org/pdf/2409.00358
Copy Paste: [[2409.00358]] Predicting the Target Word of Game-playing Conversations using a Low-Rank Dialect Adapter for Decoder Models(https://arxiv.org/abs/2409.00358)
Keywords: llm
Abstract: Dialect adapters that improve the performance of LLMs for NLU tasks on certain sociolects/dialects/national varieties ('dialects' for the sake of brevity) have been reported for encoder models. In this paper, we extend the idea of dialect adapters to decoder models in our architecture called LoRDD. Using MD-3, a publicly available dataset of word game-playing conversations between dialectal speakers, our task is Target Word Prediction (TWP) from a masked conversation. LoRDD combines task adapters and dialect adapters where the latter employ contrastive learning on pseudo-parallel conversations from MD-3. Our results for en-IN conversations on two models (Mistral and Gemma) show that LoRDD outperforms four baselines on TWP, while bridging the performance gap with en-US by 12% on word similarity and 25% on accuracy. The focused contribution of LoRDD is in its promise for dialect adaptation of decoder models.
摘要：据报道，编码器模型中的方言适配器可以提高 LLM 在某些社会方言/方言/民族变体（为简洁起见，简称为“方言”）上的 NLU 任务的性能。在本文中，我们将方言适配器的概念扩展到我们架构中的解码器模型，称为 LoRDD。使用 MD-3（方言使用者之间文字游戏对话的公开数据集），我们的任务是从屏蔽对话中进行目标词预测 (TWP)。LoRDD 结合了任务适配器和方言适配器，后者对来自 MD-3 的伪平行对话采用对比学习。我们在两个模型（Mistral 和 Gemma）上对 en-IN 对话的结果表明，LoRDD 在 TWP 上的表现优于四个基线，同时在词汇相似度和准确度上缩小了与 en-US 的性能差距，分别提高了 12% 和 25%。LoRDD 的重点贡献在于它有望实现解码器模型的方言适应。

Title: An Empirical Study on Information Extraction using Large Language Models

Authors: Ridong Han, Chaohao Yang, Tao Peng, Prayag Tiwari, Xiang Wan, Lu Liu, Benyou Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00369
Pdf URL: https://arxiv.org/pdf/2409.00369
Copy Paste: [[2409.00369]] An Empirical Study on Information Extraction using Large Language Models(https://arxiv.org/abs/2409.00369)
Keywords: language model, gpt, llm, prompt
Abstract: Human-like large language models (LLMs), especially the most powerful and popular ones in OpenAI's GPT family, have proven to be very helpful for many natural language processing (NLP) related tasks. Therefore, various attempts have been made to apply LLMs to information extraction (IE), which is a fundamental NLP task that involves extracting information from unstructured plain text. To demonstrate the latest representative progress in LLMs' information extraction ability, we assess the information extraction ability of GPT-4 (the latest version of GPT at the time of writing this paper) from four perspectives: Performance, Evaluation Criteria, Robustness, and Error Types. Our results suggest a visible performance gap between GPT-4 and state-of-the-art (SOTA) IE methods. To alleviate this problem, considering the LLMs' human-like characteristics, we propose and analyze the effects of a series of simple prompt-based methods, which can be generalized to other LLMs and NLP tasks. Rich experiments show our methods' effectiveness and some of their remaining issues in improving GPT-4's information extraction ability.
摘要：类人大型语言模型 (LLM)，尤其是 OpenAI 的 GPT 系列中最强大和最受欢迎的模型，已被证明对许多自然语言处理 (NLP) 相关任务非常有帮助。因此，人们进行了各种尝试将 LLM 应用于信息提取 (IE)，这是一项基本的 NLP 任务，涉及从非结构化纯文本中提取信息。为了展示 LLM 信息提取能力的最新代表性进展，我们从四个角度评估了 GPT-4（撰写本文时 GPT 的最新版本）的信息提取能力：性能、评估标准、鲁棒性和错误类型。我们的结果表明 GPT-4 与最先进的 (SOTA) IE 方法之间存在明显的性能差距。为了缓解这个问题，考虑到 LLM 的类人特性，我们提出并分析了一系列简单的基于提示的方法的效果，这些方法可以推广到其他 LLM 和 NLP 任务。丰富的实验表明我们的方法在提高 GPT-4 信息提取能力方面的有效性及其一些剩余问题。

Title: Rethinking Backdoor Detection Evaluation for Language Models

Authors: Jun Yan, Wenjie Jacky Mo, Xiang Ren, Robin Jia
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2409.00399
Pdf URL: https://arxiv.org/pdf/2409.00399
Copy Paste: [[2409.00399]] Rethinking Backdoor Detection Evaluation for Language Models(https://arxiv.org/abs/2409.00399)
Keywords: language model
Abstract: Backdoor attacks, in which a model behaves maliciously when given an attacker-specified trigger, pose a major security risk for practitioners who depend on publicly released language models. Backdoor detection methods aim to detect whether a released model contains a backdoor, so that practitioners can avoid such vulnerabilities. While existing backdoor detection methods have high accuracy in detecting backdoored models on standard benchmarks, it is unclear whether they can robustly identify backdoors in the wild. In this paper, we examine the robustness of backdoor detectors by manipulating different factors during backdoor planting. We find that the success of existing methods highly depends on how intensely the model is trained on poisoned data during backdoor planting. Specifically, backdoors planted with either more aggressive or more conservative training are significantly more difficult to detect than the default ones. Our results highlight a lack of robustness of existing backdoor detectors and the limitations in current benchmark construction.
摘要：后门攻击是指模型在受到攻击者指定的触发器时表现出恶意行为，这对依赖公开发布的语言模型的从业者来说是一个重大的安全风险。后门检测方法旨在检测已发布的模型是否包含后门，以便从业者可以避免此类漏洞。虽然现有的后门检测方法在标准基准上检测后门模型的准确率很高，但它们是否能够在野外稳健地识别后门尚不清楚。在本文中，我们通过在后门植入过程中操纵不同的因素来检查后门检测器的稳健性。我们发现现有方法的成功在很大程度上取决于在后门植入过程中模型在中毒数据上训练的强度。具体而言，使用更积极或更保守的训练植入的后门比默认后门更难检测。我们的结果强调了现有后门检测器缺乏稳健性以及当前基准构建的局限性。

Title: LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models

Authors: Zhiyuan Hu, Yuliang Liu, Jinman Zhao, Suyuchen Wang, Yan Wang, Wei Shen, Qing Gu, Anh Tuan Luu, See-Kiong Ng, Zhiwei Jiang, Bryan Hooi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00509
Pdf URL: https://arxiv.org/pdf/2409.00509
Copy Paste: [[2409.00509]] LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models(https://arxiv.org/abs/2409.00509)
Keywords: language model, gpt, llm, long context
Abstract: Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive. To address this, we introduce **LongRecipe**, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM's capabilities in general tasks. Ultimately, *we can extend the effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory.* Our code is released at the [link](this https URL).
摘要：大型语言模型 (LLM) 在处理长上下文任务时面临巨大挑战，因为它们在预训练期间的有效上下文窗口大小有限，这限制了它们对扩展序列进行泛化的能力。同时，通过后预训练扩展 LLM 中的上下文窗口非常耗费资源。为了解决这个问题，我们引入了 **LongRecipe**，这是一种用于扩展 LLM 上下文窗口的高效训练策略，包括有效的 token 分析、位置索引转换和训练优化策略。它在保持训练效率的同时模拟长序列输入，并显著提高模型对长程依赖关系的理解。在三种类型的 LLM 上的实验表明，LongRecipe 可以利用长序列，而只需要目标上下文窗口大小的 30%，与全序列训练相比，计算训练资源减少了 85% 以上。此外，LongRecipe 还保留了原始 LLM 在一般任务中的能力。最终，*我们可以将开源 LLM 的有效上下文窗口从 8k 扩展到 128k，使用具有 80G 内存的单个 GPU 仅进行一天的专门训练即可实现接近 GPT-4 的性能。*我们的代码发布在[链接](此 https URL)。

Title: Post-OCR Text Correction for Bulgarian Historical Documents

Authors: Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov
Subjects: cs.CL, cs.DL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00527
Pdf URL: https://arxiv.org/pdf/2409.00527
Copy Paste: [[2409.00527]] Post-OCR Text Correction for Bulgarian Historical Documents(https://arxiv.org/abs/2409.00527)
Keywords: llm
Abstract: The digitization of historical documents is crucial for preserving the cultural heritage of the society. An important step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a hard problem as standard OCR tools are not tailored to deal with historical orthography as well as with challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25\%, which is an increase of 16\% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at \url{this https URL}.}
摘要：历史文献的数字化对于保存社会的文化遗产至关重要。此过程中的一个重要步骤是使用光学字符识别 (OCR) 将扫描的图像转换为文本，这可以实现进一步的搜索、信息提取等。不幸的是，这是一个难题，因为标准 OCR 工具并非为处理历史正字法和具有挑战性的布局而量身定制。因此，在处理此类文档时，在 OCR 输出上应用额外的文本校正步骤是标准做法。在这项工作中，我们专注于保加利亚语，并创建了第一个基准数据集，用于评估用第一个标准化保加利亚语正字法：19 世纪的 Drinov 正字法编写的历史保加利亚文献的 OCR 文本校正。我们进一步开发了一种通过利用大量当代保加利亚文献文本自动生成此正字法以及随后的 Ivanchev 正字法的合成数据的方法。然后，我们使用最先进的 LLM 和编码器-解码器框架，并通过对角注意力损失和复制和覆盖机制对其进行增强，以改进 OCR 后文本校正。所提出的方法减少了识别过程中引入的错误，并将文档质量提高了 25%，与 ICDAR 2019 保加利亚数据集上的最新技术相比，提高了 16%。我们在 \url{此 https URL} 上发布了我们的数据和代码。

Title: Large Language Models-Enabled Digital Twins for Precision Medicine in Rare Gynecological Tumors

Authors: Jacqueline Lammert, Nicole Pfarr, Leonid Kuligin, Sonja Mathes, Tobias Dreyer, Luise Modersohn, Patrick Metzger, Dyke Ferber, Jakob Nikolas Kather, Daniel Truhn, Lisa Christine Adams, Keno Kyrill Bressem, Sebastian Lange, Kristina Schwamborn, Martin Boeker, Marion Kiechle, Ulrich A. Schatz, Holger Bronger, Maximilian Tschochohei
Subjects: cs.CL, cs.AI, q-bio.QM, stat.ML
Abstract URL: https://arxiv.org/abs/2409.00544
Pdf URL: https://arxiv.org/pdf/2409.00544
Copy Paste: [[2409.00544]] Large Language Models-Enabled Digital Twins for Precision Medicine in Rare Gynecological Tumors(https://arxiv.org/abs/2409.00544)
Keywords: language model, llm
Abstract: Rare gynecological tumors (RGTs) present major clinical challenges due to their low incidence and heterogeneity. The lack of clear guidelines leads to suboptimal management and poor prognosis. Molecular tumor boards accelerate access to effective therapies by tailoring treatment based on biomarkers, beyond cancer type. Unstructured data that requires manual curation hinders efficient use of biomarker profiling for therapy matching. This study explores the use of large language models (LLMs) to construct digital twins for precision medicine in RGTs. Our proof-of-concept digital twin system integrates clinical and biomarker data from institutional and published cases (n=21) and literature-derived data (n=655 publications with n=404,265 patients) to create tailored treatment plans for metastatic uterine carcinosarcoma, identifying options potentially missed by traditional, single-source analysis. LLM-enabled digital twins efficiently model individual patient trajectories. Shifting to a biology-based rather than organ-based tumor definition enables personalized care that could advance RGT management and thus enhance patient outcomes.
摘要：罕见妇科肿瘤 (RGT) 发病率低且异质性强，在临床上面临巨大挑战。缺乏明确的指导方针导致治疗不理想且预后不良。分子肿瘤委员会通过根据生物标志物（而不仅限于癌症类型）量身定制治疗方案，加速获得有效疗法。需要手动管理的非结构化数据阻碍了生物标志物分析在治疗匹配中的有效使用。本研究探讨了使用大型语言模型 (LLM) 构建数字孪生以用于 RGT 中的精准医疗。我们的概念验证数字孪生系统整合了来自机构和已发表病例（n=21）的临床和生物标志物数据以及文献数据（n=655 篇出版物，n=404,265 名患者），以创建转移性子宫癌肉瘤的定制治疗计划，识别传统的单一来源分析可能遗漏的选项。支持 LLM 的数字孪生可以有效地模拟个体患者的轨迹。转向基于生物学而不是基于器官的肿瘤定义可以实现个性化护理，从而可以推进 RGT 管理并改善患者的治疗效果。

Title: Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness

Authors: Wenxuan Wang
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2409.00551
Pdf URL: https://arxiv.org/pdf/2409.00551
Copy Paste: [[2409.00551]] Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness(https://arxiv.org/abs/2409.00551)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs), such as ChatGPT, have rapidly penetrated into people's work and daily lives over the past few years, due to their extraordinary conversational skills and intelligence. ChatGPT has become the fastest-growing software in terms of user numbers in human history and become an important foundational model for the next generation of artificial intelligence applications. However, the generations of LLMs are not entirely reliable, often producing content with factual errors, biases, and toxicity. Given their vast number of users and wide range of application scenarios, these unreliable responses can lead to many serious negative impacts. This thesis introduces the exploratory works in the field of language model reliability during the PhD study, focusing on the correctness, non-toxicity, and fairness of LLMs from both software testing and natural language processing perspectives. First, to measure the correctness of LLMs, we introduce two testing frameworks, FactChecker and LogicAsker, to evaluate factual knowledge and logical reasoning accuracy, respectively. Second, for the non-toxicity of LLMs, we introduce two works for red-teaming LLMs. Third, to evaluate the fairness of LLMs, we introduce two evaluation frameworks, BiasAsker and XCulturalBench, to measure the social bias and cultural bias of LLMs, respectively.
摘要：大型语言模型（LLM）以 ChatGPT 为代表，凭借其卓越的对话能力和智能，在过去几年中迅速渗透到人们的工作和生活中。ChatGPT 已成为人类历史上用户数量增长最快的软件，并成为下一代人工智能应用的重要基础模型。然而，几代 LLM 并不完全可靠，经常会产生带有事实错误、偏见和毒性的内容。鉴于其庞大的用户数量和广泛的应用场景，这些不可靠的响应可能导致许多严重的负面影响。本论文介绍了博士期间在语言模型可靠性领域的探索性工作，从软件测试和自然语言处理两个角度关注 LLM 的正确性、无毒性和公平性。首先，为了衡量 LLM 的正确性，我们引入了两个测试框架 FactChecker 和 LogicAsker，分别用于评估事实知识和逻辑推理的准确性。第二，为了评估LLM的公平性，我们引入了两个评估框架BiasAsker和XCulturalBench，分别用于衡量LLM的社会偏见和文化偏见。

Title: Learning to Ask: When LLMs Meet Unclear Instruction

Authors: Wenxuan Wang, Juluan Shi, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen-tse Huang, Michael R. Lyu
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2409.00557
Pdf URL: https://arxiv.org/pdf/2409.00557
Copy Paste: [[2409.00557]] Learning to Ask: When LLMs Meet Unclear Instruction(https://arxiv.org/abs/2409.00557)
Keywords: language model, llm, hallucination, prompt
Abstract: Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.
摘要：现代大型语言模型 (LLM) 具备调用函数的能力，可以利用外部工具来解决一系列仅靠语言技能无法完成的任务。然而，这些工具的有效执行不仅在很大程度上依赖于 LLM 的高级功能，而且还依赖于精确的用户指令，而这在现实世界中往往无法得到保证。为了评估 LLM 在指令不完善的情况下使用工具的性能，我们仔细检查了用户查询的真实世界指令，分析了错误模式，并构建了一个具有挑战性的工具使用基准，称为 Noisy ToolBench (NoisyToolBench)。我们发现，由于下一个标记预测训练目标，LLM 倾向于任意生成遗漏的参数，这可能会导致幻觉和风险。为了解决这个问题，我们提出了一个新框架 Ask-when-Needed (AwN)，当用户因指令不明确而遇到障碍时，它会提示 LLM 向用户提问。此外，为了减少用户与 LLM 交互所涉及的手工劳动，并从准确性和效率的角度评估 LLM 在工具利用方面的表现，我们设计了一个名为 ToolEvaluator 的自动评估工具。我们的实验表明，AwN 在 NoisyToolBench 中的表现明显优于现有的工具学习框架。我们将发布所有相关代码和数据集以支持未来的研究。

Title: Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Authors: Bang An, Sicheng Zhu, Ruiyi Zhang, Michael-Andrei Panaitescu-Liess, Yuancheng Xu, Furong Huang
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00598
Pdf URL: https://arxiv.org/pdf/2409.00598
Copy Paste: [[2409.00598]] Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models(https://arxiv.org/abs/2409.00598)
Keywords: language model, llm, prompt
Abstract: Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful prompts, like "how to kill a mosquito," which are actually harmless. Frequent false refusals not only frustrate users but also provoke a public backlash against the very values alignment seeks to protect. In this paper, we propose the first method to auto-generate diverse, content-controlled, and model-dependent pseudo-harmful prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately labels controversial prompts. We evaluate 20 LLMs on PHTest, uncovering new insights due to its scale and labeling. Our findings reveal a trade-off between minimizing false refusals and improving safety against jailbreak attacks. Moreover, we show that many jailbreak defenses significantly increase the false refusal rates, thereby undermining usability. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs. Our code and dataset are available at this https URL
摘要：安全对齐的大型语言模型 (LLM) 有时会错误地拒绝伪有害提示，例如“如何杀死蚊子”，而这些提示实际上是无害的。频繁的错误拒绝不仅让用户感到沮丧，而且还会激起公众对对齐所寻求保护的价值观的强烈反对。在本文中，我们提出了第一种自动生成多样化、内容控制和模型相关的伪有害提示的方法。使用此方法，我们构建了一个名为 PHTest 的评估数据集，它比现有数据集大十倍，涵盖了更多的错误拒绝模式，并单独标记了有争议的提示。我们在 PHTest 上评估了 20 个 LLM，由于其规模和标签而发现了新的见解。我们的研究结果揭示了最小化错误拒绝和提高对越狱攻击的安全性之间的权衡。此外，我们表明许多越狱防御措施会显著增加错误拒绝率，从而破坏可用性。我们的方法和数据集可以帮助开发人员评估和微调更安全、更可用的 LLM。我们的代码和数据集可在此 https URL 上获取

Title: TinyAgent: Function Calling at the Edge

Authors: Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00608
Pdf URL: https://arxiv.org/pdf/2409.00608
Copy Paste: [[2409.00608]] TinyAgent: Function Calling at the Edge(https://arxiv.org/abs/2409.00608)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Recent large language models (LLMs) have enabled the development of advanced agentic systems that can integrate various tools and APIs to fulfill user queries through function calling. However, the deployment of these LLMs on the edge has not been explored since they typically require cloud-based infrastructure due to their substantial model size and computational demands. To this end, we present TinyAgent, an end-to-end framework for training and deploying task-specific small language model agents capable of function calling for driving agentic systems at the edge. We first show how to enable accurate function calling for open-source models via the LLMCompiler framework. We then systematically curate a high-quality dataset for function calling, which we use to fine-tune two small language models, TinyAgent-1.1B and 7B. For efficient inference, we introduce a novel tool retrieval method to reduce the input prompt length and utilize quantization to further accelerate the inference speed. As a driving application, we demonstrate a local Siri-like system for Apple's MacBook that can execute user commands through text or voice input. Our results show that our models can achieve, and even surpass, the function-calling capabilities of larger models like GPT-4-Turbo, while being fully deployed at the edge. We open-source our dataset, models, and installable package and provide a demo video for our MacBook assistant agent.
摘要：最近的大型语言模型 (LLM) 已经实现了高级代理系统的开发，该系统可以集成各种工具和 API，通过函数调用满足用户查询。然而，由于这些 LLM 的模型规模和计算需求很大，因此它们通常需要基于云的基础设施，因此尚未探索在边缘部署这些 LLM。为此，我们提出了 TinyAgent，这是一个端到端框架，用于训练和部署特定于任务的小型语言模型代理，该代理能够进行函数调用，以在边缘驱动代理系统。我们首先展示如何通过 LLMCompiler 框架为开源模型启用准确的函数调用。然后，我们系统地整理了一个高质量的函数调用数据集，我们用它来微调两个小型语言模型 TinyAgent-1.1B 和 7B。为了高效推理，我们引入了一种新颖的工具检索方法来减少输入提示长度，并利用量化进一步加快推理速度。作为驱动应用程序，我们演示了一个适用于 Apple MacBook 的本地类似 Siri 的系统，它可以通过文本或语音输入执行用户命令。我们的结果表明，我们的模型可以达到甚至超越 GPT-4-Turbo 等大型模型的函数调用能力，同时完全部署在边缘。我们开源了我们的数据集、模型和可安装包，并为我们的 MacBook 助手代理提供了演示视频。

Title: Does Knowledge Localization Hold True? Surprising Differences Between Entity and Relation Perspectives in Language Models

Authors: Yifan Wei, Xiaoyan Yu, Yixuan Weng, Huanhuan Ma, Yuanzhe Zhang, Jun Zhao, Kang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00617
Pdf URL: https://arxiv.org/pdf/2409.00617
Copy Paste: [[2409.00617]] Does Knowledge Localization Hold True? Surprising Differences Between Entity and Relation Perspectives in Language Models(https://arxiv.org/abs/2409.00617)
Keywords: language model
Abstract: Large language models encapsulate knowledge and have demonstrated superior performance on various natural language processing tasks. Recent studies have localized this knowledge to specific model parameters, such as the MLP weights in intermediate layers. This study investigates the differences between entity and relational knowledge through knowledge editing. Our findings reveal that entity and relational knowledge cannot be directly transferred or mapped to each other. This result is unexpected, as logically, modifying the entity or the relation within the same knowledge triplet should yield equivalent outcomes. To further elucidate the differences between entity and relational knowledge, we employ causal analysis to investigate how relational knowledge is stored in pre-trained models. Contrary to prior research suggesting that knowledge is stored in MLP weights, our experiments demonstrate that relational knowledge is also significantly encoded in attention modules. This insight highlights the multifaceted nature of knowledge storage in language models, underscoring the complexity of manipulating specific types of knowledge within these models.
摘要：大型语言模型封装了知识，并在各种自然语言处理任务中表现出色。最近的研究将这些知识定位到特定的模型参数中，例如中间层的 MLP 权重。本研究通过知识编辑研究了实体知识和关系知识之间的差异。我们的研究结果表明，实体知识和关系知识不能直接相互转移或映射。这个结果是出乎意料的，因为从逻辑上讲，修改同一知识三元组中的实体或关系应该会产生相同的结果。为了进一步阐明实体知识和关系知识之间的差异，我们采用因果分析来研究关系知识如何存储在预训练模型中。与先前的研究表明知识存储在 MLP 权重中相反，我们的实验表明关系知识也显著编码在注意力模块中。这一见解凸显了语言模型中知识存储的多面性，强调了在这些模型中操纵特定类型知识的复杂性。

Title: Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation

Authors: Jasper Dekoninck, Maximilian Baader, Martin Vechev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00696
Pdf URL: https://arxiv.org/pdf/2409.00696
Copy Paste: [[2409.00696]] Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation(https://arxiv.org/abs/2409.00696)
Keywords: language model, llm
Abstract: Rating-based human evaluation has become an essential tool to accurately evaluate the impressive performance of Large language models (LLMs). However, current rating systems suffer from several critical limitations. Specifically, they fail to account for human biases that significantly influence evaluation results, require large and expensive preference datasets to obtain accurate ratings, and do not facilitate meaningful comparisons of model ratings across different tasks. To address these issues, we introduce Polyrating, an expressive and flexible rating system based on maximum a posteriori estimation that enables a more nuanced and thorough analysis of model performance at lower costs. Polyrating can detect and quantify biases affecting human preferences, ensuring fairer model comparisons. Furthermore, Polyrating can reduce the cost of human evaluations by up to $41\%$ for new models and up to $77\%$ for new tasks by leveraging existing benchmark scores. Lastly, Polyrating enables direct comparisons of ratings across different tasks, providing a comprehensive understanding of an LLMs' strengths, weaknesses, and relative performance across different applications.
摘要：基于评级的人工评估已成为准确评估大型语言模型 (LLM) 出色性能的重要工具。然而，当前的评级系统存在几个关键限制。具体来说，它们未能考虑到对评估结果有重大影响的人为偏见，需要大量昂贵的偏好数据集才能获得准确的评级，并且不利于对不同任务之间的模型评级进行有意义的比较。为了解决这些问题，我们引入了 Polyrating，这是一种基于最大后验估计的富有表现力和灵活性的评级系统，可以以更低的成本对模型性能进行更细致和彻底的分析。Polyrating 可以检测和量化影响人类偏好的偏见，确保更公平的模型比较。此外，通过利用现有的基准分数，Polyrating 可以将新模型的人工评估成本降低高达 $41\%$，将新任务的人工评估成本降低高达 $77\%$。最后，Polyrating 可以直接比较不同任务之间的评级，从而全面了解 LLM 在不同应用程序中的优势、劣势和相对性能。

Title: Generating Media Background Checks for Automated Source Critical Reasoning

Authors: Michael Schlichtkrull
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00781
Pdf URL: https://arxiv.org/pdf/2409.00781
Copy Paste: [[2409.00781]] Generating Media Background Checks for Automated Source Critical Reasoning(https://arxiv.org/abs/2409.00781)
Keywords: llm
Abstract: Not everything on the internet is true. This unfortunate fact requires both humans and models to perform complex reasoning about credibility when working with retrieved information. In NLP, this problem has seen little attention. Indeed, retrieval-augmented models are not typically expected to distrust retrieved documents. Human experts overcome the challenge by gathering signals about the context, reliability, and tendency of source documents - that is, they perform source criticism. We propose a novel NLP task focused on finding and summarising such signals. We introduce a new dataset of 6,709 "media background checks" derived from Media Bias / Fact Check, a volunteer-run website documenting media bias. We test open-source and closed-source LLM baselines with and without retrieval on this dataset, finding that retrieval greatly improves performance. We furthermore carry out human evaluation, demonstrating that 1) media background checks are helpful for humans, and 2) media background checks are helpful for retrieval-augmented models.
摘要：互联网上并非一切都是真实的。这一不幸的事实要求人类和模型在处理检索到的信息时对可信度进行复杂的推理。在 NLP 中，这个问题很少受到关注。事实上，检索增强模型通常不会被期望不信任检索到的文档。人类专家通过收集有关源文档的背景、可靠性和趋势的信号来克服这一挑战 - 也就是说，他们执行来源批评。我们提出了一项新颖的 NLP 任务，专注于查找和总结此类信号。我们引入了一个由 6,709 个“媒体背景调查”组成的新数据集，该数据集来自 Media Bias / Fact Check，这是一个由志愿者运营的网站，记录了媒体偏见。我们在这个数据集上测试了有和没有检索的开源和闭源 LLM 基线，发现检索大大提高了性能。我们还进行了人工评估，证明 1) 媒体背景调查对人类有帮助，2) 媒体背景调查对检索增强模型有帮助。

Title: The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

Authors: Bocheng Chen, Hanqing Guo, Guangjing Wang, Yuanda Wang, Qiben Yan
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2409.00787
Pdf URL: https://arxiv.org/pdf/2409.00787
Copy Paste: [[2409.00787]] The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs(https://arxiv.org/abs/2409.00787)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated great capabilities in natural language understanding and generation, largely attributed to the intricate alignment process using human feedback. While alignment has become an essential training component that leverages data collected from user queries, it inadvertently opens up an avenue for a new type of user-guided poisoning attacks. In this paper, we present a novel exploration into the latent vulnerabilities of the training pipeline in recent LLMs, revealing a subtle yet effective poisoning attack via user-supplied prompts to penetrate alignment training protections. Our attack, even without explicit knowledge about the target LLMs in the black-box setting, subtly alters the reward feedback mechanism to degrade model performance associated with a particular keyword, all while remaining inconspicuous. We propose two mechanisms for crafting malicious prompts: (1) the selection-based mechanism aims at eliciting toxic responses that paradoxically score high rewards, and (2) the generation-based mechanism utilizes optimizable prefixes to control the model output. By injecting 1\% of these specially crafted prompts into the data, through malicious users, we demonstrate a toxicity score up to two times higher when a specific trigger word is used. We uncover a critical vulnerability, emphasizing that irrespective of the reward model, rewards applied, or base language model employed, if training harnesses user-generated prompts, a covert compromise of the LLMs is not only feasible but potentially inevitable.
摘要：大型语言模型 (LLM) 在自然语言理解和生成方面表现出了强大的能力，这在很大程度上归功于使用人工反馈的复杂对齐过程。虽然对齐已成为利用从用户查询中收集的数据的重要训练组件，但它无意中为一种新型用户引导的投毒攻击开辟了道路。在本文中，我们对最近的 LLM 中训练管道的潜在漏洞进行了新颖的探索，揭示了一种通过用户提供的提示来渗透对齐训练保护的微妙但有效的投毒攻击。我们的攻击即使在黑盒设置中没有关于目标 LLM 的明确知识，也会巧妙地改变奖励反馈机制，以降低与特定关键字相关的模型性能，同时保持不引人注意。我们提出了两种制作恶意提示的机制：(1) 基于选择的机制旨在引发有害反应，这些反应反而会获得高额奖励，(2) 基于生成的机制利用可优化的前缀来控制模型输出。通过恶意用户将 1\% 的这些特制提示注入数据，我们发现，当使用特定触发词时，毒性分数最高可提高两倍。我们发现了一个关键漏洞，强调无论奖励模型、应用的奖励或使用的基础语言模型如何，如果训练利用用户生成的提示，那么 LLM 的隐蔽攻击不仅是可行的，而且可能是不可避免的。

Title: Modeling Text-Label Alignment for Hierarchical Text Classification

Authors: Ashish Kumar, Durga Toshniwal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00788
Pdf URL: https://arxiv.org/pdf/2409.00788
Copy Paste: [[2409.00788]] Modeling Text-Label Alignment for Hierarchical Text Classification(https://arxiv.org/abs/2409.00788)
Keywords: gpt
Abstract: Hierarchical Text Classification (HTC) aims to categorize text data based on a structured label hierarchy, resulting in predicted labels forming a sub-hierarchy tree. The semantics of the text should align with the semantics of the labels in this sub-hierarchy. With the sub-hierarchy changing for each sample, the dynamic nature of text-label alignment poses challenges for existing methods, which typically process text and labels independently. To overcome this limitation, we propose a Text-Label Alignment (TLA) loss specifically designed to model the alignment between text and labels. We obtain a set of negative labels for a given text and its positive label set. By leveraging contrastive learning, the TLA loss pulls the text closer to its positive label and pushes it away from its negative label in the embedding space. This process aligns text representations with related labels while distancing them from unrelated ones. Building upon this framework, we introduce the Hierarchical Text-Label Alignment (HTLA) model, which leverages BERT as the text encoder and GPTrans as the graph encoder and integrates text-label embeddings to generate hierarchy-aware representations. Experimental results on benchmark datasets and comparison with existing baselines demonstrate the effectiveness of HTLA for HTC.
摘要：分层文本分类 (HTC) 旨在根据结构化标签层次结构对文本数据进行分类，从而预测形成子层次树的标签。文本的语义应与此子层次结构中标签的语义保持一致。由于每个样本的子层次结构都会发生变化，文本标签对齐的动态特性对现有方法提出了挑战，因为现有方法通常独立处理文本和标签。为了克服这一限制，我们提出了一种文本标签对齐 (TLA) 损失，专门用于对文本和标签之间的对齐进行建模。我们为给定的文本及其正标签集获得一组负标签。通过利用对比学习，TLA 损失将文本拉近其正标签，并将其推离嵌入空间中的负标签。此过程将文本表示与相关标签对齐，同时将其与不相关标签拉开距离。在此框架的基础上，我们引入了分层文本标签对齐 (HTLA) 模型，该模型利用 BERT 作为文本编码器，GPTrans 作为图形编码器，并集成文本标签嵌入以生成层次感知表示。基准数据集上的实验结果和与现有基线的比较证明了 HTLA 对 HTC 的有效性。

Title: Comparing Discrete and Continuous Space LLMs for Speech Recognition

Authors: Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00800
Pdf URL: https://arxiv.org/pdf/2409.00800
Copy Paste: [[2409.00800]] Comparing Discrete and Continuous Space LLMs for Speech Recognition(https://arxiv.org/abs/2409.00800)
Keywords: language model, llm
Abstract: This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. We further classify LLMs based on their input and autoregressive feedback into continuous and discrete-space models. Using specialized encoders and comparative analysis with a Joint-Training-From-Scratch Language Model (JTFS LM) and pre-trained LLaMA2-7b, we provide a detailed examination of their effectiveness. Our work marks the first extensive comparison of speech representations in LLM-based ASR and explores various modeling techniques. We present an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69\% on LibriSpeech using a HuBERT encoder, offering valuable insights for advancing ASR and natural language processing (NLP) research.
摘要：本文研究了基于大型语言模型 (LLM) 的自动语音识别 (ASR) 中的离散和连续语音表示，并根据特征连续性和训练方法将它们分为四类：对于离散和连续类型，分为监督和无监督。我们进一步根据输入和自回归反馈将 LLM 分为连续和离散空间模型。使用专门的编码器和与联合训练从头语言模型 (JTFS LM) 和预训练的 LLaMA2-7b 的比较分析，我们对它们的有效性进行了详细的检查。我们的工作标志着首次对基于 LLM 的 ASR 中的语音表示进行了广泛的比较，并探索了各种建模技术。我们展示了使用 HuBERT 编码器在 LibriSpeech 上实现的最先进的 1.69\% 字错误率 (WER) 的开源成果，为推进 ASR 和自然语言处理 (NLP) 研究提供了宝贵的见解。

Title: LanguaShrink: Reducing Token Overhead with Psycholinguistics

Authors: Xuechen Liang, Meiling Tao, Yinghui Xia, Tianyu Shi, Jun Wang, JingSong Yang
Subjects: cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2409.00855
Pdf URL: https://arxiv.org/pdf/2409.00855
Copy Paste: [[2409.00855]] LanguaShrink: Reducing Token Overhead with Psycholinguistics(https://arxiv.org/abs/2409.00855)
Keywords: language model, llm, prompt, chat
Abstract: As large language models (LLMs) improve their capabilities in handling complex tasks, the issues of computational cost and efficiency due to long prompts are becoming increasingly prominent. To accelerate model inference and reduce costs, we propose an innovative prompt compression framework called LanguaShrink. Inspired by the observation that LLM performance depends on the density and position of key information in the input prompts, LanguaShrink leverages psycholinguistic principles and the Ebbinghaus memory curve to achieve task-agnostic prompt compression. This effectively reduces prompt length while preserving essential information. We referred to the training method of OpenChat.The framework introduces part-of-speech priority compression and data distillation techniques, using smaller models to learn compression targets and employing a KL-regularized reinforcement learning strategy for training.\cite{wang2023openchat} Additionally, we adopt a chunk-based compression algorithm to achieve adjustable compression rates. We evaluate our method on multiple datasets, including LongBench, ZeroScrolls, Arxiv Articles, and a newly constructed novel test set. Experimental results show that LanguaShrink maintains semantic similarity while achieving up to 26 times compression. Compared to existing prompt compression methods, LanguaShrink improves end-to-end latency by 1.43 times.
摘要：随着大型语言模型（LLM）处理复杂任务的能力不断提升，长提示带来的计算成本和效率问题也日益凸显。为了加速模型推理并降低成本，我们提出了一种创新的提示压缩框架——LanguaShrink。受LLM性能取决于输入提示中关键信息的密度和位置这一观察结果的启发，LanguaShrink利用心理语言学原理和艾宾浩斯记忆曲线实现了与任务无关的提示压缩。这有效地减少了提示长度，同时保留了基本信息。我们参考了OpenChat的训练方法。该框架引入了词性优先压缩和数据蒸馏技术，使用较小的模型来学习压缩目标，并采用KL正则化的强化学习策略进行训练。\cite{wang2023openchat} 此外，我们采用基于块的压缩算法来实现可调的压缩率。我们在多个数据集上评估了我们的方法，包括LongBench、ZeroScrolls、Arxiv文章和一个新构建的新型测试集。实验结果表明，LanguaShrink 在保持语义相似度的同时，实现了高达 26 倍的压缩，相比现有的快速压缩方法，LanguaShrink 将端到端延迟提升了 1.43 倍。

Title: Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering

Authors: Derian Boer, Fabian Koch, Stefan Kramer
Subjects: cs.CL, cs.AI, cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2409.00861
Pdf URL: https://arxiv.org/pdf/2409.00861
Copy Paste: [[2409.00861]] Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering(https://arxiv.org/abs/2409.00861)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) frequently lack domain-specific knowledge and even fine-tuned models tend to hallucinate. Hence, more reliable models that can include external knowledge are needed. We present a pipeline, 4StepFocus, and specifically a preprocessing step, that can substantially improve the answers of LLMs. This is achieved by providing guided access to external knowledge making use of the model's ability to capture relational context and conduct rudimentary reasoning by themselves. The method narrows down potentially correct answers by triplets-based searches in a semi-structured knowledge base in a direct, traceable fashion, before switching to latent representations for ranking those candidates based on unstructured data. This distinguishes it from related methods that are purely based on latent representations. 4StepFocus consists of the steps: 1) Triplet generation for extraction of relational data by an LLM, 2) substitution of variables in those triplets to narrow down answer candidates employing a knowledge graph, 3) sorting remaining candidates with a vector similarity search involving associated non-structured data, 4) reranking the best candidates by the LLM with background data provided. Experiments on a medical, a product recommendation, and an academic paper search test set demonstrate that this approach is indeed a powerful augmentation. It not only adds relevant traceable background information from information retrieval, but also improves performance considerably in comparison to state-of-the-art methods. This paper presents a novel, largely unexplored direction and therefore provides a wide range of future work opportunities. Used source code is available at this https URL.
摘要：大型语言模型 (LLM) 通常缺乏特定领域的知识，甚至经过微调的模型也容易产生幻觉。因此，需要更可靠的模型来包含外部知识。我们提出了一个流程 4StepFocus，特别是一个预处理步骤，可以大大改善 LLM 的答案。这是通过提供对外部知识的引导访问来实现的，利用模型捕获关系上下文和自行进行基本推理的能力。该方法通过以直接、可追溯的方式在半结构化知识库中进行基于三元组的搜索来缩小可能正确的答案的范围，然后切换到潜在表示以根据非结构化数据对这些候选者进行排名。这使它有别于纯粹基于潜在表示的相关方法。 4StepFocus 包括以下步骤：1) LLM 生成三元组以提取关系数据，2) 使用知识图谱替换这些三元组中的变量以缩小答案候选范围，3) 使用涉及相关非结构化数据的向量相似性搜索对剩余候选进行排序，4) LLM 使用提供的背景数据对最佳候选进行重新排序。在医疗、产品推荐和学术论文搜索测试集上的实验表明，这种方法确实是一种强大的增强方法。它不仅增加了来自信息检索的相关可追溯背景信息，而且与最先进的方法相比，性能也大大提高。本文提出了一个新颖的、很大程度上未开发的方向，因此提供了广泛的未来工作机会。使用的源代码可在此 https URL 上找到。

Title: Self-evolving Agents with reflective and memory-augmented abilities

Authors: Xuechen Liang, Meiling Tao, Yinghui Xia, Tianyu Shi, Jun Wang, JingSong Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00872
Pdf URL: https://arxiv.org/pdf/2409.00872
Copy Paste: [[2409.00872]] Self-evolving Agents with reflective and memory-augmented abilities(https://arxiv.org/abs/2409.00872)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have made significant advances in the field of natural language processing, but they still face challenges such as continuous decision-making. In this research, we propose a novel framework by integrating iterative feedback, reflective mechanisms, and a memory optimization mechanism based on the Ebbinghaus forgetting curve, it significantly enhances the agents' capabilities in handling multi-tasking and long-span information.
摘要：大型语言模型（LLM）在自然语言处理领域取得了重大进展，但仍面临持续决策等挑战。本研究提出了一种新颖的框架，通过整合迭代反馈、反射机制和基于艾宾浩斯遗忘曲线的记忆优化机制，显著增强了智能体处理多任务和长跨度信息的能力。

Title: User-Specific Dialogue Generation with User Profile-Aware Pre-Training Model and Parameter-Efficient Fine-Tuning

Authors: Atsushi Otsuka, Kazuya Matsuo, Ryo Ishii, Narichika Nomoto, Hiroaki Sugiyama
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00887
Pdf URL: https://arxiv.org/pdf/2409.00887
Copy Paste: [[2409.00887]] User-Specific Dialogue Generation with User Profile-Aware Pre-Training Model and Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2409.00887)
Keywords: prompt
Abstract: This paper addresses user-specific dialogs. In contrast to previous research on personalized dialogue focused on achieving virtual user dialogue as defined by persona descriptions, user-specific dialogue aims to reproduce real-user dialogue beyond persona-based dialogue. Fine-tuning using the target user's dialogue history is an efficient learning method for a user-specific model. However, it is prone to overfitting and model destruction due to the small amount of data. Therefore, we propose a learning method for user-specific models by combining parameter-efficient fine-tuning with a pre-trained dialogue model that includes user profiles. Parameter-efficient fine-tuning adds a small number of parameters to the entire model, so even small amounts of training data can be trained efficiently and are robust to model destruction. In addition, the pre-trained model, which is learned by adding simple prompts for automatically inferred user profiles, can generate speech with enhanced knowledge of the user's profile, even when there is little training data during fine-tuning. In experiments, we compared the proposed model with large-language-model utterance generation using prompts containing users' personal information. Experiments reproducing real users' utterances revealed that the proposed model can generate utterances with higher reproducibility than the compared methods, even with a small model.
摘要：本文讨论的是用户特定对话。与之前专注于实现角色描述所定义的虚拟用户对话的个性化对话研究不同，用户特定对话旨在重现超越角色对话的真实用户对话。使用目标用户的对话历史进行微调是用户特定模型的有效学习方法。然而，由于数据量少，它容易过度拟合和模型破坏。因此，我们提出了一种用户特定模型的学习方法，将参数高效微调与包含用户资料的预训练对话模型相结合。参数高效微调为整个模型添加了少量参数，因此即使少量的训练数据也可以得到有效训练，并且对模型破坏具有鲁棒性。此外，通过为自动推断的用户资料添加简单提示来学习的预训练模型，即使在微调期间训练数据很少的情况下，也可以生成对用户资料有增强了解的语音。在实验中，我们将提出的模型与使用包含用户个人信息的提示的大型语言模型话语生成进行了比较。重现真实用户话语的实验表明，即使使用小模型，所提出的模型也可以生成比比较方法具有更高可重复性的话语。

Title: Self-Judge: Selective Instruction Following with Alignment Self-Evaluation

Authors: Hai Ye, Hwee Tou Ng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00935
Pdf URL: https://arxiv.org/pdf/2409.00935
Copy Paste: [[2409.00935]] Self-Judge: Selective Instruction Following with Alignment Self-Evaluation(https://arxiv.org/abs/2409.00935)
Keywords: language model, gpt, llm, chat
Abstract: Pre-trained large language models (LLMs) can be tailored to adhere to human instructions through instruction tuning. However, due to shifts in the distribution of test-time data, they may not always execute instructions accurately, potentially generating factual errors or misaligned content when acting as chat assistants. To enhance the reliability of LLMs in following instructions, we propose the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low. We train judge models that can predict numerical quality scores for model responses. To address data scarcity, we introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores. Our method leverages the model's inherent self-evaluation capability to extract information about response quality from labeled instruction-tuning data. It incorporates a gold reference answer to facilitate self-evaluation and recalibrates by assessing the semantic similarity between the response sample and the gold reference. During the training phase, we implement self-distillation as a regularization technique to enhance the capability of reference-free estimation. To validate alignment evaluation on general instruction-following tasks, we collect large-scale high-quality instructions from Hugging Face for model training and evaluation. Extensive experiments on five open-source models show that our method correlates much more with GPT-4 than strong baselines, e.g., supervised models distilled from GPT-4 and GPT-3.5-turbo. Our analysis shows our model's strong generalization across domains. Additionally, our judge models serve as good reward models, e.g., boosting WizardLM-13B-V1.2 from 89.17 to 92.48 and from 12.03 to 15.90 in version v1 and v2 of AlpacaEval respectively using best-of-32 sampling with our judge models.
摘要：预训练的大型语言模型 (LLM) 可以通过指令调整进行定制，以遵循人类指令。然而，由于测试时间数据分布的变化，它们可能并不总是准确地执行指令，在充当聊天助手时可能会产生事实错误或内容不一致。为了提高 LLM 在遵循指令方面的可靠性，我们提出了选择性指令遵循的研究，即如果预期的响应质量较低，系统将拒绝执行指令。我们训练可以预测模型响应数字质量分数的判断模型。为了解决数据稀缺问题，我们引入了 Self-J，这是一种新颖的自训练框架，用于开发判断模型，而无需人工注释的质量分数。我们的方法利用模型固有的自我评估能力，从标记的指令调整数据中提取有关响应质量的信息。它结合了黄金参考答案以促进自我评估，并通过评估响应样本和黄金参考之间的语义相似性进行重新校准。在训练阶段，我们实施自我提炼作为一种正则化技术，以增强无参考估计的能力。为了验证一般指令遵循任务的对齐评估，我们从 Hugging Face 收集了大规模高质量指令，用于模型训练和评估。对五个开源模型进行的大量实验表明，我们的方法与 GPT-4 的相关性远高于强基线，例如从 GPT-4 和 GPT-3.5-turbo 中提炼出的监督模型。我们的分析表明我们的模型在各个领域都具有很强的泛化能力。此外，我们的判断模型可以作为良好的奖励模型，例如，使用我们的判断模型进行 32 个最佳采样，分别将 AlpacaEval 版本 v1 和 v2 中的 WizardLM-13B-V1.2 从 89.17 提升到 92.48，从 12.03 提升到 15.90。

Title: Large Language Models for Automatic Detection of Sensitive Topics

Authors: Ruoyu Wen, Stephanie Elena Crowe, Kunal Gupta, Xinyue Li, Mark Billinghurst, Simon Hoermann, Dwain Allan, Alaeddin Nassani, Thammathip Piumsomboon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.00940
Pdf URL: https://arxiv.org/pdf/2409.00940
Copy Paste: [[2409.00940]] Large Language Models for Automatic Detection of Sensitive Topics(https://arxiv.org/abs/2409.00940)
Keywords: language model, gpt, llm
Abstract: Sensitive information detection is crucial in content moderation to maintain safe online communities. Assisting in this traditionally manual process could relieve human moderators from overwhelming and tedious tasks, allowing them to focus solely on flagged content that may pose potential risks. Rapidly advancing large language models (LLMs) are known for their capability to understand and process natural language and so present a potential solution to support this process. This study explores the capabilities of five LLMs for detecting sensitive messages in the mental well-being domain within two online datasets and assesses their performance in terms of accuracy, precision, recall, F1 scores, and consistency. Our findings indicate that LLMs have the potential to be integrated into the moderation workflow as a convenient and precise detection tool. The best-performing model, GPT-4o, achieved an average accuracy of 99.5\% and an F1-score of 0.99. We discuss the advantages and potential challenges of using LLMs in the moderation workflow and suggest that future research should address the ethical considerations of utilising this technology.
摘要：敏感信息检测对于内容审核至关重要，可以维护安全的在线社区。协助完成这一传统的手动过程可以减轻人工审核人员繁重而繁琐的任务，使他们能够专注于可能带来潜在风险的标记内容。快速发展的大型语言模型 (LLM) 以其理解和处理自然语言的能力而闻名，因此为支持这一过程提供了潜在的解决方案。本研究探讨了五个 LLM 在两个在线数据集中检测心理健康领域敏感信息的能力，并评估了它们在准确度、精确度、召回率、F1 分数和一致性方面的性能。我们的研究结果表明，LLM 有可能作为一种方便而精确的检测工具集成到审核工作流程中。表现最佳的模型 GPT-4o 实现了 99.5% 的平均准确率和 0.99 的 F1 分数。我们讨论了在审核工作流程中使用 LLM 的优势和潜在挑战，并建议未来的研究应解决使用该技术的道德问题。

Title: What does it take to get state of the art in simultaneous speech-to-speech translation?

Authors: Vincent Wilmet, Johnson Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00965
Pdf URL: https://arxiv.org/pdf/2409.00965
Copy Paste: [[2409.00965]] What does it take to get state of the art in simultaneous speech-to-speech translation?(https://arxiv.org/abs/2409.00965)
Keywords: hallucination
Abstract: This paper presents an in-depth analysis of the latency characteristics observed in simultaneous speech-to-speech model's performance, particularly focusing on hallucination-induced latency spikes. By systematically experimenting with various input parameters and conditions, we propose methods to minimize latency spikes and improve overall performance. The findings suggest that a combination of careful input management and strategic parameter adjustments can significantly enhance speech-to-speech model's latency behavior.
摘要：本文深入分析了同步语音转语音模型性能中观察到的延迟特征，特别关注幻觉引起的延迟峰值。通过系统地试验各种输入参数和条件，我们提出了最小化延迟峰值和提高整体性能的方法。研究结果表明，谨慎的输入管理和战略性参数调整相结合可以显著增强语音转语音模型的延迟行为。

Title: DataSculpt: Crafting Data Landscapes for LLM Post-Training through Multi-objective Partitioning

Authors: Keer Lu, Zheng Liang, Xiaonan Nie, Da Pan, Shusen Zhang, Keshi Zhao, Weipeng Chen, Zenan Zhou, Guosheng Dong, Wentao Zhang, Bin Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.00997
Pdf URL: https://arxiv.org/pdf/2409.00997
Copy Paste: [[2409.00997]] DataSculpt: Crafting Data Landscapes for LLM Post-Training through Multi-objective Partitioning(https://arxiv.org/abs/2409.00997)
Keywords: language model, llm, long context
Abstract: The effectiveness of long-context modeling is important for Large Language Models (LLMs) in various applications. Despite their potential, LLMs' efficacy in processing long context does not consistently meet expectations, posing significant challenges for efficient management of prolonged sequences in training. This difficulty is compounded by the scarcity of comprehensive and diverse training datasets suitable for long sequences, which stems from inherent length biases across different data sources, and the logistical complexities associated with massive data management for training in extended contexts. In this work, we introduce DataSculpt, a data construction framework designed to strategically augment the data architecture for extended-context training. Our thorough evaluations demonstrate DataSculpt's remarkable capacity to boost long-context training performance, achieving improvements including an 18.09% increase in retrieval augmentation, 21.23% in summarization, 21.27% in reading comprehension, and a 3.81% rise in code completion, all while preserving the models' overall proficiency with a 4.88% improvement.
摘要：长上下文建模的有效性对于各种应用中的大型语言模型 (LLM) 非常重要。尽管 LLM 具有潜力，但其处理长上下文的效率并不总是符合预期，这对训练中有效管理长序列提出了重大挑战。由于缺乏适合长序列的全面而多样的训练数据集（这源于不同数据源之间固有的长度偏差），以及在扩展上下文中进行训练的海量数据管理相关的后勤复杂性，这一困难进一步加剧。在这项工作中，我们引入了 DataSculpt，这是一个数据构建框架，旨在战略性地增强扩展上下文训练的数据架构。我们的全面评估表明，DataSculpt 具有显著的提升长上下文训练性能的能力，实现了包括检索增强 18.09%、总结 21.23%、阅读理解 21.27% 和代码完成 3.81% 的改进，同时保持了模型的整体熟练程度，提高了 4.88%。

Title: Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning

Authors: Chongjie Si, Zhiyi Shi, Shifan Zhang, Xiaokang Yang, Hanspeter Pfister, Wei Shen
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2409.01035
Pdf URL: https://arxiv.org/pdf/2409.01035
Copy Paste: [[2409.01035]] Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning(https://arxiv.org/abs/2409.01035)
Keywords: language model
Abstract: Large language models demonstrate impressive performance on downstream tasks, yet requiring extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions--critical for transitioning large models from pre-trained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties, and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of task-specific directions during the fine-tuning process, thereby enhancing model performance on targeted tasks. Extensive experiments have conclusively demonstrated the effectiveness of LoRA-Dash, and in-depth analyses further reveal the underlying mechanisms of LoRA-Dash. The code is available at this https URL.
摘要：大型语言模型在下游任务上表现出色，但在完全微调所有参数时需要大量资源消耗。为了缓解这种情况，已经开发了参数高效微调 (PEFT) 策略，例如 LoRA。在本文中，我们深入研究了任务特定方向的概念——这对于将大型模型从预训练状态过渡到 PEFT 中的任务特定增强至关重要。我们提出了一个框架来明确定义这些方向并探索它们的属性和实际使用挑战。然后，我们介绍了一种新方法 LoRA-Dash，旨在最大限度地发挥任务特定方向在微调过程中的影响，从而提高模型在目标任务上的性能。大量实验已最终证明了 LoRA-Dash 的有效性，深入分析进一步揭示了 LoRA-Dash 的底层机制。代码可在此 https URL 上找到。

Title: NYK-MS: A Well-annotated Multi-modal Metaphor and Sarcasm Understanding Benchmark on Cartoon-Caption Dataset

Authors: Ke Chang, Hao Li, Junzhao Zhang, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01037
Pdf URL: https://arxiv.org/pdf/2409.01037
Copy Paste: [[2409.01037]] NYK-MS: A Well-annotated Multi-modal Metaphor and Sarcasm Understanding Benchmark on Cartoon-Caption Dataset(https://arxiv.org/abs/2409.01037)
Keywords: language model, gpt, llm
Abstract: Metaphor and sarcasm are common figurative expressions in people's communication, especially on the Internet or the memes popular among teenagers. We create a new benchmark named NYK-MS (NewYorKer for Metaphor and Sarcasm), which contains 1,583 samples for metaphor understanding tasks and 1,578 samples for sarcasm understanding tasks. These tasks include whether it contains metaphor/sarcasm, which word or object contains metaphor/sarcasm, what does it satirize and why does it contains metaphor/sarcasm, all of the 7 tasks are well-annotated by at least 3 annotators. We annotate the dataset for several rounds to improve the consistency and quality, and use GUI and GPT-4V to raise our efficiency. Based on the benchmark, we conduct plenty of experiments. In the zero-shot experiments, we show that Large Language Models (LLM) and Large Multi-modal Models (LMM) can't do classification task well, and as the scale increases, the performance on other 5 tasks improves. In the experiments on traditional pre-train models, we show the enhancement with augment and alignment methods, which prove our benchmark is consistent with previous dataset and requires the model to understand both of the two modalities.
摘要：隐喻和讽刺是人们交流中常见的比喻表达，尤其是在互联网上或青少年中流行的表情包中。我们创建了一个新的基准 NYK-MS (NewYorKer for Metaphor and Sarcasm)，其中包含 1,583 个隐喻理解任务样本和 1,578 个讽刺理解任务样本。这些任务包括是否包含隐喻/讽刺、哪个单词或物体包含隐喻/讽刺、它讽刺什么以及为什么包含隐喻/讽刺，所有 7 个任务都由至少 3 名注释者进行了良好的注释。我们对数据集进行了多轮注释以提高一致性和质量，并使用 GUI 和 GPT-4V 来提高效率。基于基准，我们进行了大量实验。在零样本实验中，我们表明大型语言模型 (LLM) 和大型多模态模型 (LMM) 不能很好地完成分类任务，并且随着规模的增加，其他 5 个任务的性能有所提高。在传统预训练模型的实验中，我们展示了使用增强和对齐方法的增强，这证明了我们的基准与以前的数据集一致，并且要求模型同时理解两种模态。

Title: A Perspective on Literary Metaphor in the Context of Generative AI

Authors: Imke van Heerden, Anil Bas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01053
Pdf URL: https://arxiv.org/pdf/2409.01053
Copy Paste: [[2409.01053]] A Perspective on Literary Metaphor in the Context of Generative AI(https://arxiv.org/abs/2409.01053)
Keywords: language model
Abstract: At the intersection of creative text generation and literary theory, this study explores the role of literary metaphor and its capacity to generate a range of meanings. In this regard, literary metaphor is vital to the development of any particular language. To investigate whether the inclusion of original figurative language improves textual quality, we trained an LSTM-based language model in Afrikaans. The network produces phrases containing compellingly novel figures of speech. Specifically, the emphasis falls on how AI might be utilised as a defamiliarisation technique, which disrupts expected uses of language to augment poetic expression. Providing a literary perspective on text generation, the paper raises thought-provoking questions on aesthetic value, interpretation and evaluation.
摘要：在创造性文本生成和文学理论的交叉点上，本研究探讨了文学隐喻的作用及其产生一系列含义的能力。在这方面，文学隐喻对任何特定语言的发展都至关重要。为了研究加入原始比喻语言是否会提高文本质量，我们用南非语训练了一个基于 LSTM 的语言模型。该网络生成的短语包含引人注目的新颖修辞手法。具体来说，重点在于如何利用人工智能作为一种陌生化技术，打破语言的预期用法来增强诗意表达。本文从文学的角度对文本生成进行了探讨，提出了关于审美价值、解释和评价的发人深省的问题。

Title: Pre-Trained Language Models for Keyphrase Prediction: A Review

Authors: Muhammad Umair, Tangina Sultana, Young-Koo Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01087
Pdf URL: https://arxiv.org/pdf/2409.01087
Copy Paste: [[2409.01087]] Pre-Trained Language Models for Keyphrase Prediction: A Review(https://arxiv.org/abs/2409.01087)
Keywords: language model
Abstract: Keyphrase Prediction (KP) is essential for identifying keyphrases in a document that can summarize its content. However, recent Natural Language Processing (NLP) advances have developed more efficient KP models using deep learning techniques. The limitation of a comprehensive exploration jointly both keyphrase extraction and generation using pre-trained language models spotlights a critical gap in the literature, compelling our survey paper to bridge this deficiency and offer a unified and in-depth analysis to address limitations in previous surveys. This paper extensively examines the topic of pre-trained language models for keyphrase prediction (PLM-KP), which are trained on large text corpora via different learning (supervisor, unsupervised, semi-supervised, and self-supervised) techniques, to provide respective insights into these two types of tasks in NLP, precisely, Keyphrase Extraction (KPE) and Keyphrase Generation (KPG). We introduce appropriate taxonomies for PLM-KPE and KPG to highlight these two main tasks of NLP. Moreover, we point out some promising future directions for predicting keyphrases.
摘要：关键词预测 (KP) 对于识别文档中可以概括其内容的关键词至关重要。然而，最近的自然语言处理 (NLP) 进展已经使用深度学习技术开发了更高效的 KP 模型。使用预训练语言模型对关键词提取和生成进行全面探索的局限性凸显了文献中的一个关键差距，迫使我们的调查论文弥补这一不足，并提供统一而深入的分析来解决以前调查中的局限性。本文广泛研究了用于关键词预测的预训练语言模型 (PLM-KP) 的主题，这些模型通过不同的学习（监督、无监督、半监督和自监督）技术在大型文本语料库上进行训练，以提供对 NLP 中这两类任务的各自见解，确切地说，是关键词提取 (KPE) 和关键词生成 (KPG)。我们为 PLM-KPE 和 KPG 引入了适当的分类法，以突出 NLP 的这两个主要任务。此外，我们指出了预测关键短语的一些有希望的未来方向。

Title: Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference

Authors: Barys Liskavets, Maxim Ushakov, Shuvendu Roy, Mark Klibanov, Ali Etemad, Shane Luke
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.01227
Pdf URL: https://arxiv.org/pdf/2409.01227
Copy Paste: [[2409.01227]] Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference(https://arxiv.org/abs/2409.01227)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have triggered a new stream of research focusing on compressing the context length to reduce the computational cost while ensuring the retention of helpful information for LLMs to answer the given question. Token-based removal methods are one of the most prominent approaches in this direction, but risk losing the semantics of the context caused by intermediate token removal, especially under high compression ratios, while also facing challenges in computational efficiency. In this work, we propose context-aware prompt compression (CPC), a sentence-level prompt compression technique where its key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question. To train this encoder, we generate a new dataset consisting of questions, positives, and negative pairs where positives are sentences relevant to the question, while negatives are irrelevant context sentences. We train the encoder in a contrastive setup to learn context-aware sentence representations. Our method considerably outperforms prior works on prompt compression on benchmark datasets and is up to 10.93x faster at inference compared to the best token-level compression method. We also find better improvement for shorter length constraints in most benchmarks, showing the effectiveness of our proposed solution in the compression of relevant information in a shorter context. Finally, we release the code and the dataset for quick reproducibility and further development: this https URL.
摘要：大型语言模型 (LLM) 引发了新的研究潮流，重点是压缩上下文长度以降低计算成本，同时确保保留有用的信息，以便 LLM 回答给定的问题。基于标记的删除方法是这个方向最突出的方法之一，但存在因中间标记删除而导致上下文语义丢失的风险，尤其是在高压缩率下，同时还面临计算效率方面的挑战。在这项工作中，我们提出了上下文感知提示压缩 (CPC)，这是一种句子级提示压缩技术，其关键创新是一种新颖的上下文感知句子编码器，可为给定问题的每个句子提供相关性分数。为了训练这个编码器，我们生成了一个由问题、正例和负例对组成的新数据集，其中正例是与问题相关的句子，而负例是无关的上下文句子。我们在对比设置中训练编码器以学习上下文感知句子表示。我们的方法在基准数据集上的提示压缩方面的表现远远优于之前的成果，与最佳标记级压缩方法相比，推理速度提高了 10.93 倍。我们还发现，在大多数基准测试中，较短的长度约束有更好的改进，表明我们提出的解决方案在较短上下文中压缩相关信息方面非常有效。最后，我们发布了代码和数据集，以便快速重现和进一步开发：此 https URL。

Title: THInC: A Theory-Driven Framework for Computational Humor Detection

Authors: Victor De Marez, Thomas Winters, Ayla Rigouts Terryn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01232
Pdf URL: https://arxiv.org/pdf/2409.01232
Copy Paste: [[2409.01232]] THInC: A Theory-Driven Framework for Computational Humor Detection(https://arxiv.org/abs/2409.01232)
Keywords: language model
Abstract: Humor is a fundamental aspect of human communication and cognition, as it plays a crucial role in social engagement. Although theories about humor have evolved over centuries, there is still no agreement on a single, comprehensive humor theory. Likewise, computationally recognizing humor remains a significant challenge despite recent advances in large language models. Moreover, most computational approaches to detecting humor are not based on existing humor theories. This paper contributes to bridging this long-standing gap between humor theory research and computational humor detection by creating an interpretable framework for humor classification, grounded in multiple humor theories, called THInC (Theory-driven Humor Interpretation and Classification). THInC ensembles interpretable GA2M classifiers, each representing a different humor theory. We engineered a transparent flow to actively create proxy features that quantitatively reflect different aspects of theories. An implementation of this framework achieves an F1 score of 0.85. The associative interpretability of the framework enables analysis of proxy efficacy, alignment of joke features with theories, and identification of globally contributing features. This paper marks a pioneering effort in creating a humor detection framework that is informed by diverse humor theories and offers a foundation for future advancements in theory-driven humor classification. It also serves as a first step in automatically comparing humor theories in a quantitative manner.
摘要：幽默是人类交流和认知的一个基本方面，因为它在社交中起着至关重要的作用。尽管关于幽默的理论已经发展了几个世纪，但仍然没有就单一、全面的幽默理论达成一致。同样，尽管大型语言模型最近取得了进展，但计算识别幽默仍然是一项重大挑战。此外，大多数检测幽默的计算方法都不是建立在现有的幽默理论之上的。本文通过创建一个基于多种幽默理论的可解释幽默分类框架（称为 THInC（理论驱动的幽默解释和分类）），有助于弥合幽默理论研究和计算幽默检测之间长期存在的差距。THInC 集成了可解释的 GA2M 分类器，每个分类器代表不同的幽默理论。我们设计了一个透明的流程来主动创建定量反映理论不同方面的代理特征。该框架的实现实现了 0.85 的 F1 分数。该框架的关联可解释性使得能够分析代理效力、将笑话特征与理论对齐以及识别全局贡献特征。本文开创性地创建了一个幽默检测框架，该框架以多种幽默理论为基础，为理论驱动的幽默分类的未来发展奠定了基础。它也是自动定量比较幽默理论的第一步。

Title: Path-Consistency: Prefix Enhancement for Efficient Inference in LLM

Authors: Jiace Zhu, Yingtao Shen, Jie Zhao, An Zou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01281
Pdf URL: https://arxiv.org/pdf/2409.01281
Copy Paste: [[2409.01281]] Path-Consistency: Prefix Enhancement for Efficient Inference in LLM(https://arxiv.org/abs/2409.01281)
Keywords: language model, llm
Abstract: To enhance the reasoning capabilities of large language models (LLMs), self-consistency has gained significant popularity by combining multiple sampling with majority voting. However, the state-of-the-art self-consistency approaches consume substantial computational resources and lead to significant additional time costs due to the multiple sampling. This prevents its full potential from being realized in scenarios where computational resources are critical. To improve the inference efficiency, this paper introduces \textit{path-consistency}, a method that leverages the confidence of answers generated in earlier branches to identify the prefix of the most promising path. By dynamically guiding the generation of subsequent branches based on this prefix, the \textit{path-consistency} mitigates both the errors and redundancies from random or less useful sampling in self-consistency. As a result, it can significantly accelerate the inference process by reducing the number of tokens generated. Our extensive empirical evaluation shows that the \textit{path-consistency} achieves significant acceleration in inference latency ranging from $7.8\%$ to $40.5\%$, while maintaining or even improving task accuracy across different datasets, including mathematical reasoning, common sense reasoning, symbolic reasoning, and code generation.
摘要：为了增强大型语言模型 (LLM) 的推理能力，自洽性通过将多重采样与多数表决相结合而广受欢迎。然而，最先进的自洽性方法消耗大量计算资源，并由于多重采样而导致大量额外的时间成本。这阻碍了其在计算资源至关重要的场景中发挥全部潜力。为了提高推理效率，本文引入了 \textit{路径一致性}，这是一种利用早期分支中生成的答案的置信度来识别最有希望路径的前缀的方法。通过动态引导基于此前缀的后续分支的生成，\textit{路径一致性} 可以减轻自洽性中随机或不太有用的采样带来的错误和冗余。因此，它可以通过减少生成的标记数量来显著加速推理过程。我们广泛的实证评估表明，\textit{路径一致性} 在推理延迟方面实现了显著的加速，范围从$7.8\%$到$40.5\%$，同时保持甚至提高不同数据集（包括数学推理、常识推理、符号推理和代码生成）的任务准确性。

Title: Language Models Benefit from Preparation with Elicited Knowledge

Authors: Jiacan Yu, Hannah An, Lenhart K. Schubert
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01345
Pdf URL: https://arxiv.org/pdf/2409.01345
Copy Paste: [[2409.01345]] Language Models Benefit from Preparation with Elicited Knowledge(https://arxiv.org/abs/2409.01345)
Keywords: language model, prompt
Abstract: The zero-shot chain of thought (CoT) approach is often used in question answering (QA) by language models (LMs) for tasks that require multiple reasoning steps, typically enhanced by the prompt "Let's think step by step." However, some QA tasks hinge more on accessing relevant knowledge than on chaining reasoning steps. We introduce a simple general prompting technique, called PREP, that involves using two instances of LMs: the first (LM1) generates relevant information, and the second (LM2) answers the question based on this information. PREP is designed to be general and independent of the user's domain knowledge, making it applicable across various QA tasks without the need for specialized prompt engineering. To evaluate the effectiveness of our prompting method, we create a dataset of 100 binary-choice questions, derived from an extensive schematic dataset on artifact parts and material composition. These questions ask which of two artifacts is less likely to share materials with another artifact. Such questions probe the LM's knowledge of shared materials in the part structure of different artifacts. We test our method on our dataset and three published commonsense reasoning datasets. The average accuracy of our method is consistently higher than that of all the other tested methods across all the tested datasets.
摘要：零样本思维链 (CoT) 方法通常由语言模型 (LM) 用于问答 (QA) 中，用于需要多个推理步骤的任务，通常通过提示“让我们一步一步思考”来增强。然而，一些 QA 任务更多地依赖于访问相关知识，而不是链接推理步骤。我们引入了一种简单的通用提示技术，称为 PREP，它涉及使用两个 LM 实例：第一个 (LM1) 生成相关信息，第二个 (LM2) 根据此信息回答问题。PREP 旨在通用且独立于用户的领域知识，使其适用于各种 QA 任务，而无需专门的提示工程。为了评估我们的提示方法的有效性，我们创建了一个包含 100 个二元选择问题的数据集，这些问题来自关于文物部件和材料成分的大量示意图数据集。这些问题询问两个文物中哪一个不太可能与另一个文物共享材料。这些问题探究了 LM 对不同文物零件结构中共享材料的知识。我们在我们的数据集和三个已发布的常识推理数据集上测试了我们的方法。在所有测试数据集上，我们的方法的平均准确率始终高于所有其他测试方法。

Title: CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification

Authors: Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.01366
Pdf URL: https://arxiv.org/pdf/2409.01366
Copy Paste: [[2409.01366]] CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification(https://arxiv.org/abs/2409.01366)
Keywords: language model, llm
Abstract: Deploying large language models (LLMs) on edge devices presents significant challenges due to the substantial computational overhead and memory requirements. Activation sparsification can mitigate these challenges by reducing the number of activated neurons during inference. Existing methods typically employ thresholding-based sparsification based on the statistics of activation tensors. However, these methods do not explicitly model the impact of activation sparsification on performance, leading to suboptimal performance degradation. To address this issue, this paper reformulates the activation sparsification problem by introducing a new objective that optimizes the sparsification decisions. Building on this reformulation, we propose CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification. First, channel-wise thresholding assigns a unique threshold to each activation channel in the feed-forward network (FFN) layers. Then, selective sparsification involves applying thresholding-based activation sparsification to specific layers within the attention modules. Finally, we detail the implementation of sparse kernels to accelerate LLM inference. Experimental results demonstrate that the proposed CHESS achieves lower performance degradation over 8 downstream tasks while activating fewer parameters compared to existing methods, thus speeding up the LLM inference by up to 1.27x.
摘要：由于计算开销和内存需求巨大，在边缘设备上部署大型语言模型 (LLM) 面临巨大挑战。激活稀疏化可以通过减少推理期间激活的神经元数量来缓解这些挑战。现有方法通常采用基于激活张量统计数据的阈值稀疏化。然而，这些方法没有明确模拟激活稀疏化对性能的影响，导致性能下降不理想。为了解决这个问题，本文通过引入一个优化稀疏化决策的新目标重新表述了激活稀疏化问题。在此重新表述的基础上，我们提出了 CHESS，这是一种通过通道阈值和选择性稀疏化的通用激活稀疏化方法。首先，通道阈值为前馈网络 (FFN) 层中的每个激活通道分配一个唯一的阈值。然后，选择性稀疏化涉及将基于阈值的激活稀疏化应用于注意模块内的特定层。最后，我们详细介绍了稀疏核的实现以加速 LLM 推理。实验结果表明，与现有方法相比，所提出的 CHESS 在 8 个下游任务中激活更少的参数时实现了更低的性能下降，从而将 LLM 推理速度提高了 1.27 倍。

Title: GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI

Authors: Xiangyuan Xue, Zeyu Lu, Di Huang, Wanli Ouyang, Lei Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01392
Pdf URL: https://arxiv.org/pdf/2409.01392
Copy Paste: [[2409.01392]] GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI(https://arxiv.org/abs/2409.01392)
Keywords: llm, agent
Abstract: Much previous AI research has focused on developing monolithic models to maximize their intelligence and capability, with the primary goal of enhancing performance on specific tasks. In contrast, this paper explores an alternative approach: collaborative AI systems that use workflows to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex workflows, offering greater flexibility and scalability compared to monolithic models. The core innovation of GenAgent lies in representing workflows with code, alongside constructing workflows with collaborative agents in a step-by-step manner. We implement GenAgent on the ComfyUI platform and propose a new benchmark, OpenComfy. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations, showing its capability to generate complex workflows with superior effectiveness and stability.
摘要：之前的许多 AI 研究都专注于开发单体模型，以最大限度地发挥其智能和能力，主要目标是提高特定任务的性能。相比之下，本文探讨了一种替代方法：使用工作流集成模型、数据源和管道来解决复杂多样任务的协作 AI 系统。我们介绍了 GenAgent，这是一个基于 LLM 的框架，可自动生成复杂的工作流，与单体模型相比，它具有更大的灵活性和可扩展性。GenAgent 的核心创新在于用代码表示工作流，同时逐步构建与协作代理的工作流。我们在 ComfyUI 平台上实现了 GenAgent，并提出了一个新的基准 OpenComfy。结果表明，GenAgent 在运行级和任务级评估中均优于基线方法，显示出其能够生成具有卓越有效性和稳定性的复杂工作流。

Title: PoliPrompt: A High-Performance Cost-Effective LLM-Based Text Classification Framework for Political Science

Authors: Menglin Liu, Ge Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01466
Pdf URL: https://arxiv.org/pdf/2409.01466
Copy Paste: [[2409.01466]] PoliPrompt: A High-Performance Cost-Effective LLM-Based Text Classification Framework for Political Science(https://arxiv.org/abs/2409.01466)
Keywords: language model, llm, prompt
Abstract: Recent advancements in large language models (LLMs) have opened new avenues for enhancing text classification efficiency in political science, surpassing traditional machine learning methods that often require extensive feature engineering, human labeling, and task-specific training. However, their effectiveness in achieving high classification accuracy remains questionable. This paper introduces a three-stage in-context learning approach that leverages LLMs to improve classification accuracy while minimizing experimental costs. Our method incorporates automatic enhanced prompt generation, adaptive exemplar selection, and a consensus mechanism that resolves discrepancies between two weaker LLMs, refined by an advanced LLM. We validate our approach using datasets from the BBC news reports, Kavanaugh Supreme Court confirmation, and 2018 election campaign ads. The results show significant improvements in classification F1 score (+0.36 for zero-shot classification) with manageable economic costs (-78% compared with human labeling), demonstrating that our method effectively addresses the limitations of traditional machine learning while offering a scalable and reliable solution for text analysis in political science.
摘要：大型语言模型 (LLM) 的最新进展为提高政治科学文本分类效率开辟了新途径，超越了通常需要大量特征工程、人工标记和特定任务训练的传统机器学习方法。然而，它们在实现高分类准确度方面的有效性仍然值得怀疑。本文介绍了一种三阶段上下文学习方法，利用 LLM 来提高分类准确度，同时最大限度地降低实验成本。我们的方法结合了自动增强提示生成、自适应样本选择和解决两个较弱 LLM 之间差异的共识机制，并通过高级 LLM 进行了改进。我们使用来自 BBC 新闻报道、卡瓦诺最高法院确认和 2018 年竞选广告的数据集验证了我们的方法。结果显示分类 F1 分数显着提高（零样本分类为 +0.36），经济成本可控（与人工标记相比为 -78%），表明我们的方法有效地解决了传统机器学习的局限性，同时为政治科学中的文本分析提供了可扩展且可靠的解决方案。

Title: Masked Mixers for Language Generation and Retrieval

Authors: Benjamin L. Badger
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.01482
Pdf URL: https://arxiv.org/pdf/2409.01482
Copy Paste: [[2409.01482]] Masked Mixers for Language Generation and Retrieval(https://arxiv.org/abs/2409.01482)
Keywords: language model
Abstract: Attention mechanisms that confer selective focus on a strict subset of input elements are nearly ubiquitous in language models today. We posit there to be downside to the use of attention: most information present in the input is necessarily lost. In support of this idea we observe poor input representation accuracy in transformers, but find more accurate representation in what we term masked mixers which replace self-attention with masked convolutions. Applied to TinyStories the masked mixer learns causal language tasks more efficiently than early transformer implementations and somewhat less efficiently than optimized, current implementations. The most efficient learning algorithm observed for this dataset is a transformer-masked mixer hybrid, suggesting that these models learn in an orthogonal manner. We hypothesized that the information loss exhibited by transformers would be much more detrimental to retrieval than generation, and to test this we introduce an efficient training approach for retrieval models based on existing generative model embeddings. With this method, embeddings from masked mixers are found to result in far better summary-to-story retrieval compared to embeddings from transformers.
摘要：如今，注意力机制在语言模型中几乎无处不在，它可以选择性地关注输入元素的严格子集。我们认为使用注意力机制存在缺点：输入中存在的大部分信息必然会丢失。为了支持这一想法，我们观察到 Transformer 中的输入表示准确度较差，但在我们称之为掩蔽混合器（用掩蔽卷积取代自注意力）中发现更准确的表示。应用于 TinyStories 后，掩蔽混合器学习因果语言任务的效率比早期的 Transformer 实现更高，但比优化的当前实现效率略低。对于该数据集观察到的最有效的学习算法是 Transformer-掩蔽混合器混合器，这表明这些模型以正交方式学习。我们假设 Transformer 表现出的信息丢失对检索的危害远大于生成，为了测试这一点，我们引入了一种基于现有生成模型嵌入的检索模型的有效训练方法。通过这种方法，我们发现，与来自 transformer 的嵌入相比，来自 masked mixers 的嵌入可以实现更好的摘要到故事检索。

Title: The Compressor-Retriever Architecture for Language Model OS

Authors: Yuan Yang, Siheng Xiong, Ehsan Shareghi, Faramarz Fekri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01495
Pdf URL: https://arxiv.org/pdf/2409.01495
Copy Paste: [[2409.01495]] The Compressor-Retriever Architecture for Language Model OS(https://arxiv.org/abs/2409.01495)
Keywords: language model, llm, long context, chat, retrieval-augmented generation, agent
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced their capacity to aggregate and process information across multiple modalities, enabling them to perform a wide range of tasks such as multimodal data querying, tool usage, web interactions, and handling long documents. These capabilities pave the way for transforming LLMs from mere chatbots into general-purpose agents capable of interacting with the real world. This paper explores the concept of using a language model as the core component of an operating system (OS), effectively acting as a CPU that processes data stored in a context window, which functions as RAM. A key challenge in realizing such an LM OS is managing the life-long context and ensuring statefulness across sessions, a feature limited by the current session-based interaction paradigm due to context window size limit. To address this, we introduce compressor-retriever, a model-agnostic architecture designed for life-long context management. Unlike other long-context solutions such as retrieval-augmented generation, our approach exclusively uses the base model's forward function to compress and retrieve context, ensuring end-to-end differentiability. Preliminary experiments demonstrate the effectiveness of this architecture in in-context learning tasks, marking a step towards the development of a fully stateful LLM OS. Project repo available at: this https URL
摘要：大型语言模型 (LLM) 的最新进展显著增强了它们跨多种模态聚合和处理信息的能力，使它们能够执行各种任务，例如多模态数据查询、工具使用、Web 交互和处理长文档。这些功能为将 LLM 从单纯的聊天机器人转变为能够与现实世界交互的通用代理铺平了道路。本文探讨了使用语言模型作为操作系统 (OS) 核心组件的概念，有效地充当处理存储在上下文窗口中的数据的 CPU，上下文窗口充当 RAM。实现这种 LM OS 的一个关键挑战是管理终身上下文并确保跨会话的状态性，由于上下文窗口大小限制，当前基于会话的交互范式限制了此功能。为了解决这个问题，我们引入了压缩器-检索器，这是一种专为终身上下文管理而设计的模型无关架构。与其他长上下文解决方案（例如检索增强生成）不同，我们的方法仅使用基础模型的前向函数来压缩和检索上下文，从而确保端到端可区分性。初步实验证明了该架构在上下文学习任务中的有效性，标志着朝着开发完全有状态的 LLM OS 迈出了一步。项目存储库位于：此 https URL

Title: DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models

Authors: Rajat Rawat, Hudson McBride, Dhiyaan Nirmal, Rajarshi Ghosh, Jong Moon, Dhruv Alamuri, Sean O'Brien, Kevin Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01497
Pdf URL: https://arxiv.org/pdf/2409.01497
Copy Paste: [[2409.01497]] DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models(https://arxiv.org/abs/2409.01497)
Keywords: language model, llm
Abstract: As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce {DiversityMedQA}, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. Furthermore, to ensure the perturbations were accurate, we also propose a filtering strategy that validates each perturbation. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.
摘要：随着大型语言模型 (LLM) 在医疗保健领域越来越受欢迎，人们越来越担心它们容易受到人口统计学偏见的影响。我们推出了 {DiversityMedQA}，这是一种新颖的基准，旨在评估 LLM 对不同患者人口统计学特征（例如性别和种族）的医疗查询的响应。通过扰动 MedQA 数据集（包含医学委员会考试问题）中的问题，我们创建了一个基准，可以捕捉不同患者资料中医疗诊断的细微差异。我们的研究结果表明，在针对这些人口统计学差异进行测试时，模型性能存在显著差异。此外，为了确保扰动准确，我们还提出了一种验证每个扰动的过滤策略。通过发布 DiversityMedQA，我们提供了一种资源，用于评估和减轻 LLM 医学诊断中的人口统计学偏见。

Title: S$^3$c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners

Authors: Yuchen Yan, Jin Jiang, Yang Liu, Yixin Cao, Xin Xu, Mengdi zhang, Xunliang Cai, Jian Shao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01524
Pdf URL: https://arxiv.org/pdf/2409.01524
Copy Paste: [[2409.01524]] S$^3$c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners(https://arxiv.org/abs/2409.01524)
Keywords: language model, llm
Abstract: Self-correction is a novel method that can stimulate the potential reasoning abilities of large language models (LLMs). It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S$^3$c-Math, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.
摘要：自我纠正是一种可以激发大型语言模型 (LLM) 潜在推理能力的新方法。它涉及在 LLM 解决推理问题时在推理过程中检测和纠正错误。然而，最近的研究并不认为自我纠正是 LLM 自发的内在能力。相反，这种纠正是通过事后生成、外部知识引入、多模型协作和类似技术实现的。在本文中，我们提出了一系列称为 S$^3$c-Math 的数学 LLM，它们能够对数学推理进行自发的分步自我纠正。这种能力有助于 LLM 识别其正在进行的推理是否倾向于包含错误，并同时纠正这些错误以产生更可靠的响应。我们提出了一种方法，该方法采用分步采样方法来构建分步自我纠正数据以实现这种能力。此外，我们实施了一种训练策略，该策略使用上述构建的数据为 LLM 配备自发的分步自我纠正能力。我们的数据和方法已被证明在各种基础法学硕士课程中都是有效的，在 GSM8K、MATH 和其他数学基准的评估中始终显示出显著的进步。据我们所知，我们是第一个在数学推理中引入法学硕士课程自发的逐级自我修正能力的人。

Title: It is Time to Develop an Auditing Framework to Promote Value Aware Chatbots

Authors: Yanchen Wang, Lisa Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01539
Pdf URL: https://arxiv.org/pdf/2409.01539
Copy Paste: [[2409.01539]] It is Time to Develop an Auditing Framework to Promote Value Aware Chatbots(https://arxiv.org/abs/2409.01539)
Keywords: gpt, llm, chat
Abstract: The launch of ChatGPT in November 2022 marked the beginning of a new era in AI, the availability of generative AI tools for everyone to use. ChatGPT and other similar chatbots boast a wide range of capabilities from answering student homework questions to creating music and art. Given the large amounts of human data chatbots are built on, it is inevitable that they will inherit human errors and biases. These biases have the potential to inflict significant harm or increase inequity on different subpopulations. Because chatbots do not have an inherent understanding of societal values, they may create new content that is contrary to established norms. Examples of concerning generated content includes child pornography, inaccurate facts, and discriminatory posts. In this position paper, we argue that the speed of advancement of this technology requires us, as computer and data scientists, to mobilize and develop a values-based auditing framework containing a community established standard set of measurements to monitor the health of different chatbots and LLMs. To support our argument, we use a simple audit template to share the results of basic audits we conduct that are focused on measuring potential bias in search engine style tasks, code generation, and story generation. We identify responses from GPT 3.5 and GPT 4 that are both consistent and not consistent with values derived from existing law. While the findings come as no surprise, they do underscore the urgency of developing a robust auditing framework for openly sharing results in a consistent way so that mitigation strategies can be developed by the academic community, government agencies, and companies when our values are not being adhered to. We conclude this paper with recommendations for value-based strategies for improving the technologies.
摘要：2022 年 11 月推出的 ChatGPT 标志着人工智能新时代的开始，即每个人都可以使用生成式人工智能工具。ChatGPT 和其他类似的聊天机器人拥有广泛的功能，从回答学生家庭作业问题到创作音乐和艺术。鉴于聊天机器人建立在大量人类数据之上，它们不可避免地会继承人类的错误和偏见。这些偏见有可能对不同亚群造成重大伤害或增加不平等。由于聊天机器人对社会价值观没有内在的理解，它们可能会创建与既定规范相悖的新内容。令人担忧的生成内容的例子包括儿童色情、不准确的事实和歧视性帖子。在这篇立场文件中，我们认为，这项技术的发展速度要求我们作为计算机和数据科学家，动员和开发一个基于价值观的审计框架，其中包含一套社区建立的标准测量方法，以监测不同聊天机器人和 LLM 的健康状况。为了支持我们的论点，我们使用一个简单的审计模板来分享我们进行的基本审计结果，这些审计的重点是衡量搜索引擎样式任务、代码生成和故事生成中的潜在偏见。我们从 GPT 3.5 和 GPT 4 中发现，这些回应既与现有法律得出的价值观一致，也与现有法律得出的价值观不一致。虽然这些发现并不令人意外，但它们确实强调了开发一个强大的审计框架的紧迫性，以便以一致的方式公开分享结果，以便学术界、政府机构和公司在我们的价值观没有得到遵守时制定缓解策略。我们在本文的最后提出了基于价值观的改进技术策略的建议。

Title: Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs

Authors: Zhuo Li, Yuhao Du, Jinpeng Hu, Xiang Wan, Anningzhe Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01552
Pdf URL: https://arxiv.org/pdf/2409.01552
Copy Paste: [[2409.01552]] Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs(https://arxiv.org/abs/2409.01552)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have shown success in generating high-quality responses. In order to achieve better alignment with LLMs with human preference, various works are proposed based on specific optimization process, which, however, is not suitable to Black-Box LLMs like GPT-4, due to inaccessible parameters. In Black-Box LLMs case, their performance is highly dependent on the quality of the provided prompts. Existing methods to enhance response quality often involve a prompt refinement model, yet these approaches potentially suffer from semantic inconsistencies between the refined and original prompts, and typically overlook the relationship between them. To address these challenges, we introduce a self-instructed in-context learning framework that empowers LLMs to deliver more effective responses by generating reliable derived prompts to construct informative contextual environments. Our approach incorporates a self-instructed reinforcement learning mechanism, enabling direct interaction with the response model during derived prompt generation for better alignment. We then formulate querying as an in-context learning task, using responses from LLMs combined with the derived prompts to establish a contextual demonstration for the original prompt. This strategy ensures alignment with the original query, reduces discrepancies from refined prompts, and maximizes the LLMs' in-context learning capability. Extensive experiments demonstrate that the proposed method not only generates more reliable derived prompts but also significantly enhances LLMs' ability to deliver more effective responses, including Black-Box models such as GPT-4.
摘要：大型语言模型 (LLM) 已成功生成高质量的响应。为了与具有人类偏好的 LLM 更好地对齐，人们提出了各种基于特定优化过程的工作，但由于无法访问参数，这些工作不适用于 GPT-4 等黑盒 LLM。在黑盒 LLM 的情况下，它们的性能高度依赖于所提供提示的质量。现有的提高响应质量的方法通常涉及提示细化模型，但这些方法可能存在细化提示和原始提示之间的语义不一致问题，并且通常忽略它们之间的关系。为了应对这些挑战，我们引入了一个自指导的上下文学习框架，该框架通过生成可靠的派生提示来构建信息丰富的上下文环境，使 LLM 能够提供更有效的响应。我们的方法结合了一种自指导的强化学习机制，可以在派生提示生成过程中与响应模型直接交互，以实现更好的对齐。然后，我们将查询制定为上下文学习任务，使用 LLM 的响应与派生提示相结合，为原始提示建立上下文演示。此策略可确保与原始查询保持一致，减少与精炼提示的差异，并最大限度地提高 LLM 的上下文学习能力。大量实验表明，所提出的方法不仅可以生成更可靠的派生提示，而且还显着增强了 LLM 提供更有效响应的能力，包括 GPT-4 等黑盒模型。

Title: Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture

Authors: Chen-Chi Chang, Ching-Yuan Chen, Hung-Shin Lee, Chih-Cheng Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01556
Pdf URL: https://arxiv.org/pdf/2409.01556
Copy Paste: [[2409.01556]] Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture(https://arxiv.org/abs/2409.01556)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This study introduces a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge, with a specific focus on Hakka culture as a case study. Leveraging Bloom's Taxonomy, the study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. This benchmark extends beyond traditional single-dimensional evaluations by providing a deeper analysis of LLMs' abilities to handle culturally specific content, ranging from basic recall of facts to higher-order cognitive tasks such as creative synthesis. Additionally, the study integrates Retrieval-Augmented Generation (RAG) technology to address the challenges of minority cultural knowledge representation in LLMs, demonstrating how RAG enhances the models' performance by dynamically incorporating relevant external information. The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge. However, the findings also reveal the limitations of RAG in creative tasks, underscoring the need for further optimization. This benchmark provides a robust tool for evaluating and comparing LLMs in culturally diverse contexts, offering valuable insights for future research and development in AI-driven cultural knowledge preservation and dissemination.
摘要：本研究引入了一个全面的基准，旨在评估大型语言模型 (LLM) 在理解和处理文化知识方面的表现，特别关注客家文化作为案例研究。该研究利用布鲁姆分类法，开发了一个多维框架，系统地评估六个认知领域的 LLM：记忆、理解、应用、分析、评估和创造。该基准超越了传统的单维评估，更深入地分析了 LLM 处理特定文化内容的能力，从基本的事实回忆到创造性综合等高阶认知任务。此外，该研究整合了检索增强生成 (RAG) 技术来解决 LLM 中少数民族文化知识表示的挑战，展示了 RAG 如何通过动态整合相关外部信息来提高模型的性能。结果强调了 RAG 在提高所有认知领域的准确性方面的有效性，特别是在需要精确检索和应用文化知识的任务中。然而，研究结果也揭示了 RAG 在创造性任务中的局限性，强调需要进一步优化。该基准为评估和比较文化多元化背景下的 LLM 提供了强大的工具，为未来人工智能驱动的文化知识保存和传播的研究和开发提供了宝贵的见解。

Title: An Implementation of Werewolf Agent That does not Truly Trust LLMs

Authors: Takehiro Sato, Shintaro Ozaki, Daisaku Yokoyama
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01575
Pdf URL: https://arxiv.org/pdf/2409.01575
Copy Paste: [[2409.01575]] An Implementation of Werewolf Agent That does not Truly Trust LLMs(https://arxiv.org/abs/2409.01575)
Keywords: language model, llm, agent
Abstract: Werewolf is an incomplete information game, which has several challenges when creating a computer agent as a player given the lack of understanding of the situation and individuality of utterance (e.g., computer agents are not capable of characterful utterance or situational lying). We propose a werewolf agent that solves some of those difficulties by combining a Large Language Model (LLM) and a rule-based algorithm. In particular, our agent uses a rule-based algorithm to select an output either from an LLM or a template prepared beforehand based on the results of analyzing conversation history using an LLM. It allows the agent to refute in specific situations, identify when to end the conversation, and behave with persona. This approach mitigated conversational inconsistencies and facilitated logical utterance as a result. We also conducted a qualitative evaluation, which resulted in our agent being perceived as more human-like compared to an unmodified LLM. The agent is freely available for contributing to advance the research in the field of Werewolf game.
摘要：狼人杀是一种信息不完全的游戏，由于对情况和话语个性缺乏了解（例如，计算机代理无法进行个性话语或情境撒谎），因此在创建计算机代理作为玩家时面临一些挑战。我们提出了一种狼人代理，通过结合大型语言模型 (LLM) 和基于规则的算法来解决其中的一些困难。具体来说，我们的代理使用基于规则的算法，根据使用 LLM 分析对话历史的结果，从 LLM 或预先准备的模板中选择输出。它允许代理在特定情况下反驳，确定何时结束对话，并以角色的方式行事。这种方法减轻了对话不一致的问题，并因此促进了逻辑话语。我们还进行了定性评估，结果显示我们的代理与未修改的 LLM 相比更像人类。该代理可免费使用，以促进狼人杀游戏领域的研究。

Title: AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models

Authors: Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Zhiming Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01579
Pdf URL: https://arxiv.org/pdf/2409.01579
Copy Paste: [[2409.01579]] AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2409.01579)
Keywords: language model
Abstract: Retrieved documents containing noise will hinder RAG from detecting answer clues and make the inference process slow and expensive. Therefore, context compression is necessary to enhance its accuracy and efficiency. Existing context compression methods use extractive or generative models to retain the most query-relevant sentences or apply the information bottleneck theory to preserve sufficient information. However, these methods may face issues such as over-compression or high computational costs. We observe that the retriever often ranks relevant documents at the top, but the exact number of documents needed to answer the query is uncertain due to the impact of query complexity and retrieval quality: complex queries like multi-hop questions may require retaining more documents than simpler queries, and a low-quality retrieval may need to rely on more documents to generate accurate outputs. Therefore, determining the minimum number of required documents (compression rate) is still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality. Specifically, we first annotate the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate and then construct triplets of the query, retrieved documents, and its compression rate. Then, we use this triplet dataset to train a compression-rate predictor. Experiments on three QA datasets and one conversational Muiti-doc QA dataset show that AdaComp significantly reduces inference costs while maintaining performance nearly identical to uncompressed models, achieving a balance between efficiency and performance.
摘要：检索到的文档中含有噪声会阻碍 RAG 检测答案线索，并使推理过程变得缓慢且昂贵。因此，上下文压缩对于提高其准确性和效率是必不可少的。现有的上下文压缩方法使用提取或生成模型来保留与查询最相关的句子，或应用信息瓶颈理论来保留足够的信息。然而，这些方法可能面临过度压缩或计算成本高的问题。我们观察到检索器经常将相关文档排在顶部，但由于查询复杂性和检索质量的影响，回答查询所需的确切文档数量是不确定的：像多跳问题这样的复杂查询可能比简单的查询需要保留更多的文档，而低质量的检索可能需要依赖更多的文档来生成准确的输出。因此，确定所需的最小文档数量（压缩率）仍然是 RAG 面临的挑战。在本文中，我们介绍了 AdaComp，一种低成本的提取上下文压缩方法，它根据查询复杂性和检索质量自适应地确定压缩率。具体来说，我们首先将 RAG 系统回答当前查询所需的最少 top-k 个文档注释为压缩率，然后构建查询、检索到的文档及其压缩率的三元组。然后，我们使用这个三元组数据集来训练压缩率预测器。在三个 QA 数据集和一个对话式多文档 QA 数据集上进行的实验表明，AdaComp 显著降低了推理成本，同时保持了与未压缩模型几乎相同的性能，实现了效率和性能之间的平衡。

Title: Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

Authors: Shintaro Ozaki, Kazuki Hayashi, Yusuke Sakai, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01584
Pdf URL: https://arxiv.org/pdf/2409.01584
Copy Paste: [[2409.01584]] Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models(https://arxiv.org/abs/2409.01584)
Keywords: language model, llm
Abstract: As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data.
摘要：随着大规模视觉语言模型 (LVLM) 性能的提高，它们越来越能够以多种语言做出响应，并且人们预计对 LVLM 生成的解释的需求将会增长。然而，Vision Encoder 的预训练和 LLM 与 Vision Encoder 的集成训练主要使用英语训练数据进行，因此尚不确定 LVLM 在生成非英语语言解释时是否能够完全发挥其潜力。此外，使用机器翻译创建数据集的多语言 QA 基准存在文化差异和偏见，这仍然是评估任务中的问题。为了应对这些挑战，本研究创建了一个多语言扩展数据集，而不依赖机器翻译。然后使用这个考虑到细微差别和特定国家短语的数据集来评估 LVLM 的生成解释能力。此外，本研究还研究了资源丰富的英语中的指令调整是否会提高其他语言的性能。我们的研究结果表明，与英语相比，LVLM 在非英语语言中的表现更差。此外，研究还发现，LVLM 很难有效地管理从英语数据中学到的知识。

Title: Booster: Tackling Harmful Fine-tuing for Large Language Models via Attenuating Harmful Perturbation

Authors: Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01586
Pdf URL: https://arxiv.org/pdf/2409.01586
Copy Paste: [[2409.01586]] Booster: Tackling Harmful Fine-tuing for Large Language Models via Attenuating Harmful Perturbation(https://arxiv.org/abs/2409.01586)
Keywords: language model
Abstract: Harmful fine-tuning issue \citep{qi2023fine} poses serious safety concerns for Large language models' fine-tuning-as-a-service. While existing defenses \citep{huang2024vaccine,rosati2024representation} have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that \textit{harmful perturbation} over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at \url{this https URL}.
摘要：有害微调问题 \citep{qi2023fine} 对大型语言模型的微调即服务构成了严重的安全隐患。虽然已经提出了现有的防御措施 \citep{huang2024vaccine,rosati2024representation} 来缓解该问题，但它们的性能仍然远远不能令人满意，问题的根源也尚未完全恢复。本文我们首次在文献中表明，模型权重上的 \textit{有害扰动} 应该是有害微调破坏对齐的根本原因。为了减轻有害扰动的负面影响，我们提出了一个对齐阶段解决方案，称为 Booster。从技术上讲，除了原始的对齐损失之外，我们还在对齐阶段的优化中附加了一个损失正则化器。正则化器确保模型在模拟有害扰动之前/之后的有害损失减少被减弱，从而减轻后续的微调风险。实证结果表明，Booster 可以有效降低微调模型的有害分数，同时保持下游任务的性能。我们的代码可在 \url{此 https URL} 处获取。

Title: From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning

Authors: Wei Chen, Zhen Huang, Liang Xie, Binbin Lin, Houqiang Li, Le Lu, Xinmei Tian, Deng Cai, Yonggang Zhang, Wenxiao Wan, Xu Shen, Jieping Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01658
Pdf URL: https://arxiv.org/pdf/2409.01658
Copy Paste: [[2409.01658]] From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning(https://arxiv.org/abs/2409.01658)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses, leading to the sycophancy issue. When challenged by users, LLMs tend to admit mistakes and provide inaccurate responses even if they initially provided the correct answer. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue, while it typically leads to the degeneration of LLMs' general capability. To address the challenge, we propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective. Specifically, SPT first reveals and verifies a small percentage (<5%) of the basic modules, which significantly affect a particular behavior of LLMs. i.e., sycophancy. Subsequently, SPT merely fine-tunes these identified modules while freezing the rest. To verify the effectiveness of the proposed SPT, we conduct comprehensive experiments, demonstrating that SPT significantly mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT introduces limited or even no side effects on the general capability of LLMs. Our results shed light on how to precisely, effectively, and efficiently explain and improve the targeted ability of LLMs.
摘要：大型语言模型 (LLM) 倾向于优先遵循用户提示，而不是提供真实的响应，从而导致谄媚问题。当受到用户的质疑时，LLM 倾向于承认错误并提供不准确的响应，即使它们最初提供了正确的答案。最近的研究建议采用监督微调 (SFT) 来缓解谄媚问题，而这通常会导致 LLM 的一般能力下降。为了应对这一挑战，我们提出了一种新颖的监督精确调整 (SPT)，其中感兴趣区域模块针对给定目标进行调整。具体而言，SPT 首先揭示并验证一小部分 (<5%) 的基本模块，这些模块会显著影响 LLM 的特定行为，即谄媚。随后，SPT 仅微调这些已识别的模块，同时冻结其余模块。为了验证所提出的 SPT 的有效性，我们进行了全面的实验，结果表明 SPT 显著缓解了 LLM 的谄媚问题（甚至比 SFT 更好）。此外，SPT 对 LLM 的一般能力几乎没有副作用。我们的研究结果为如何准确、有效、高效地解释和提高 LLM 的目标能力提供了启示。

Title: Interpreting and Improving Large Language Models in Arithmetic Calculation

Authors: Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu-ming Cheung, Xinmei Tian, Xu Shen, Jieping Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01659
Pdf URL: https://arxiv.org/pdf/2409.01659
Copy Paste: [[2409.01659]] Interpreting and Improving Large Language Models in Arithmetic Calculation(https://arxiv.org/abs/2409.01659)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated remarkable potential across numerous applications and have shown an emergent ability to tackle complex reasoning tasks, such as mathematical computations. However, even for the simplest arithmetic calculations, the intrinsic mechanisms behind LLMs remain mysterious, making it challenging to ensure reliability. In this work, we delve into uncovering a specific mechanism by which LLMs execute calculations. Through comprehensive experiments, we find that LLMs frequently involve a small fraction (< 5%) of attention heads, which play a pivotal role in focusing on operands and operators during calculation processes. Subsequently, the information from these operands is processed through multi-layer perceptrons (MLPs), progressively leading to the final solution. These pivotal heads/MLPs, though identified on a specific dataset, exhibit transferability across different datasets and even distinct tasks. This insight prompted us to investigate the potential benefits of selectively fine-tuning these essential heads/MLPs to boost the LLMs' computational performance. We empirically find that such precise tuning can yield notable enhancements on mathematical prowess, without compromising the performance on non-mathematical tasks. Our work serves as a preliminary exploration into the arithmetic calculation abilities inherent in LLMs, laying a solid foundation to reveal more intricate mathematical tasks.
摘要：大型语言模型 (LLM) 已在众多应用中展现出非凡的潜力，并展现出处理复杂推理任务（例如数学计算）的新兴能力。然而，即使对于最简单的算术计算，LLM 背后的内在机制仍然神秘莫测，因此很难确保可靠性。在这项工作中，我们深入研究了 LLM 执行计算的具体机制。通过全面的实验，我们发现 LLM 通常涉及一小部分（< 5%）的注意力头，它们在计算过程中对关注操作数和运算符起着关键作用。随后，来自这些操作数的信息通过多层感知器 (MLP) 进行处理，逐步得到最终解决方案。这些关键的头/MLP 虽然是在特定数据集上识别的，但它们在不同数据集甚至不同任务之间表现出可迁移性。这一见解促使我们研究选择性微调这些基本头/MLP 以提高 LLM 计算性能的潜在好处。我们通过实验发现，这种精确的调整可以显著提高数学能力，而不会损害非数学任务的表现。我们的工作是对 LLM 固有的算术计算能力的初步探索，为揭示更复杂的数学任务奠定了坚实的基础。

Title: In Defense of RAG in the Era of Long-Context Language Models

Authors: Tan Yu, Anbang Xu, Rama Akkiraju
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01666
Pdf URL: https://arxiv.org/pdf/2409.01666
Copy Paste: [[2409.01666]] In Defense of RAG in the Era of Long-Context Language Models(https://arxiv.org/abs/2409.01666)
Keywords: language model, llm, long context, retrieval-augmented generation
Abstract: Overcoming the limited context limitations in early-generation LLMs, retrieval-augmented generation (RAG) has been a reliable solution for context-based answer generation in the past. Recently, the emergence of long-context LLMs allows the models to incorporate much longer text sequences, making RAG less attractive. Recent studies show that long-context LLMs significantly outperform RAG in long-context applications. Unlike the existing works favoring the long-context LLM over RAG, we argue that the extremely long context in LLMs suffers from a diminished focus on relevant information and leads to potential degradation in answer quality. This paper revisits the RAG in long-context answer generation. We propose an order-preserve retrieval-augmented generation (OP-RAG) mechanism, which significantly improves the performance of RAG for long-context question-answer applications. With OP-RAG, as the number of retrieved chunks increases, the answer quality initially rises, and then declines, forming an inverted U-shaped curve. There exist sweet points where OP-RAG could achieve higher answer quality with much less tokens than long-context LLM taking the whole context as input. Extensive experiments on public benchmark demonstrate the superiority of our OP-RAG.
摘要：检索增强生成 (RAG) 克服了早期生成 LLM 中有限的上下文限制，在过去一直是基于上下文的答案生成的可靠解决方案。最近，长上下文 LLM 的出现使模型能够合并更长的文本序列，使得 RAG 的吸引力降低。最近的研究表明，在长上下文应用中，长上下文 LLM 的表现明显优于 RAG。与现有偏爱长上下文 LLM 而不是 RAG 的研究不同，我们认为 LLM 中极长的上下文会降低对相关信息的关注，并导致答案质量的潜在下降。本文重新审视了长上下文答案生成中的 RAG。我们提出了一种保序检索增强生成 (OP-RAG) 机制，该机制显著提高了 RAG 在长上下文问答应用中的性能。使用 OP-RAG 时，随着检索到的块数量的增加，答案质量最初上升，然后下降，形成倒 U 型曲线。存在最佳点，即 OP-RAG 可以用比以整个上下文为输入的长上下文 LLM 少得多的标记实现更高的答案质量。在公共基准上进行的大量实验证明了我们的 OP-RAG 的优越性。

Title: LLM-GAN: Construct Generative Adversarial Network Through Large Language Models For Explainable Fake News Detection

Authors: Yifeng Wang, Zhouhong Gu, Siwei Zhang, Suhang Zheng, Tao Wang, Tianyu Li, Hongwei Feng, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01787
Pdf URL: https://arxiv.org/pdf/2409.01787
Copy Paste: [[2409.01787]] LLM-GAN: Construct Generative Adversarial Network Through Large Language Models For Explainable Fake News Detection(https://arxiv.org/abs/2409.01787)
Keywords: language model, llm, prompt
Abstract: Explainable fake news detection predicts the authenticity of news items with annotated explanations. Today, Large Language Models (LLMs) are known for their powerful natural language understanding and explanation generation abilities. However, presenting LLMs for explainable fake news detection remains two main challenges. Firstly, fake news appears reasonable and could easily mislead LLMs, leaving them unable to understand the complex news-faking process. Secondly, utilizing LLMs for this task would generate both correct and incorrect explanations, which necessitates abundant labor in the loop. In this paper, we propose LLM-GAN, a novel framework that utilizes prompting mechanisms to enable an LLM to become Generator and Detector and for realistic fake news generation and detection. Our results demonstrate LLM-GAN's effectiveness in both prediction performance and explanation quality. We further showcase the integration of LLM-GAN to a cloud-native AI platform to provide better fake news detection service in the cloud.
摘要：可解释的假新闻检测通过带注释的解释来预测新闻的真实性。如今，大型语言模型 (LLM) 以其强大的自然语言理解和解释生成能力而闻名。然而，提出可解释的假新闻检测的 LLM 仍然面临两个主要挑战。首先，假新闻看似合理，很容易误导 LLM，使它们无法理解复杂的新闻造假过程。其次，使用 LLM 执行此任务会生成正确和不正确的解释，这需要大量的人工投入。在本文中，我们提出了 LLM-GAN，这是一个新颖的框架，它利用提示机制使 LLM 成为生成器和检测器，并用于真实的假新闻生成和检测。我们的结果证明了 LLM-GAN 在预测性能和解释质量方面的有效性。我们进一步展示了 LLM-GAN 与云原生 AI 平台的集成，以在云端提供更好的假新闻检测服务。

Title: Training on the Benchmark Is Not All You Need

Authors: Shiwen Ni, Xiangtao Kong, Chengming Li, Xiping Hu, Ruifeng Xu, Jia Zhu, Min Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01790
Pdf URL: https://arxiv.org/pdf/2409.01790
Copy Paste: [[2409.01790]] Training on the Benchmark Is Not All You Need(https://arxiv.org/abs/2409.01790)
Keywords: language model, llm
Abstract: The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model's log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under black-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.
摘要：大型语言模型（LLM）的成功很大程度上依赖于在预训练阶段学习到的海量预训练数据。预训练过程和训练数据的不透明性导致很多基准测试的结果变得不可靠。如果任何模型已经在基准测试集上训练过，都会严重阻碍该领域的健康发展。为了自动化和高效地测试大型语言模型的能力，很多主流的基准测试都采用了多项选择题的形式。由于多项选择题选项内容的交换不会影响问题本身的含义，我们基于此性质提出了一种简单有效的数据泄漏检测方法。具体而言，我们对数据中选项的内容进行混洗以生成相应的派生数据集，然后基于模型在派生数据集上的对数概率分布来检测数据泄漏。如果对数概率集合中存在最大值且为异常值，则表明数据被泄漏了。我们的方法可以在黑盒条件下工作，无需访问模型训练数据或权重，可以有效地识别模型预训练数据中基准测试集中的数据泄漏，包括正常场景和选项可能被有意或无意打乱的复杂场景。通过基于两个 LLM 和基准设计的实验，我们证明了我们方法的有效性。此外，我们在四个基准数据集上评估了 31 个主流开源 LLM 的数据泄漏程度，并给出了每个基准的泄漏 LLM 排名，我们发现 Qwen 系列 LLM 的数据泄漏程度最高。

Title: Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations

Authors: Ike Ebubechukwu, Johane Takeuchi, Antonello Ceravola, Frank Joublin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01808
Pdf URL: https://arxiv.org/pdf/2409.01808
Copy Paste: [[2409.01808]] Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations(https://arxiv.org/abs/2409.01808)
Keywords: gpt, chat
Abstract: As dialogue systems and chatbots increasingly integrate into everyday interactions, the need for efficient and accurate evaluation methods becomes paramount. This study explores the comparative performance of human and AI assessments across a range of dialogue scenarios, focusing on seven key performance indicators (KPIs): Coherence, Innovation, Concreteness, Goal Contribution, Commonsense Contradiction, Incorrect Fact, and Redundancy. Utilizing the GPT-4o API, we generated a diverse dataset of conversations and conducted a two-part experimental analysis. In Experiment 1, we evaluated multi-party conversations on Coherence, Innovation, Concreteness, and Goal Contribution, revealing that GPT models align closely with human judgments. Notably, both human and AI evaluators exhibited a tendency towards binary judgment rather than linear scaling, highlighting a shared challenge in these assessments. Experiment 2 extended the work of Finch et al. (2023) by focusing on dyadic dialogues and assessing Commonsense Contradiction, Incorrect Fact, and Redundancy. The results indicate that while GPT-4o demonstrates strong performance in maintaining factual accuracy and commonsense reasoning, it still struggles with reducing redundancy and self-contradiction. Our findings underscore the potential of GPT models to closely replicate human evaluation in dialogue systems, while also pointing to areas for improvement. This research offers valuable insights for advancing the development and implementation of more refined dialogue evaluation methodologies, contributing to the evolution of more effective and human-like AI communication tools.
摘要：随着对话系统和聊天机器人越来越多地融入日常互动，对高效和准确的评估方法的需求变得至关重要。本研究探讨了人类和人工智能评估在一系列对话场景中的比较表现，重点关注七个关键绩效指标 (KPI)：连贯性、创新性、具体性、目标贡献、常识矛盾、错误事实和冗余。利用 GPT-4o API，我们生成了一个多样化的对话数据集并进行了两部分的实验分析。在实验 1 中，我们评估了多方对话的连贯性、创新性、具体性和目标贡献，结果显示 GPT 模型与人类判断紧密相关。值得注意的是，人类和人工智能评估者都表现出二元判断而不是线性缩放的倾向，这突显了这些评估中共同的挑战。实验 2 扩展了 Finch 等人 (2023) 的工作，重点关注二元对话并评估常识矛盾、错误事实和冗余。结果表明，尽管 GPT-4o 在保持事实准确性和常识推理方面表现出色，但它在减少冗余和自相矛盾方面仍然举步维艰。我们的研究结果强调了 GPT 模型在对话系统中紧密复制人类评估的潜力，同时也指出了需要改进的地方。这项研究为推进更精细的对话评估方法的开发和实施提供了宝贵的见解，有助于开发更有效、更像人类的人工智能通信工具。

Title: AgentRE: An Agent-Based Framework for Navigating Complex Information Landscapes in Relation Extraction

Authors: Yuchen Shi, Guochao Jiang, Tian Qiu, Deqing Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01854
Pdf URL: https://arxiv.org/pdf/2409.01854
Copy Paste: [[2409.01854]] AgentRE: An Agent-Based Framework for Navigating Complex Information Landscapes in Relation Extraction(https://arxiv.org/abs/2409.01854)
Keywords: language model, llm, agent
Abstract: The relation extraction (RE) in complex scenarios faces challenges such as diverse relation types and ambiguous relations between entities within a single sentence, leading to the poor performance of pure "text-in, text-out" language models (LMs). To address these challenges, in this paper, we propose an agent-based RE framework, namely AgentRE, which fully leverages the potential of large language models (LLMs) including memory, retrieval and reflection, to achieve RE in complex scenarios. Specifically, three major modules are built in AgentRE serving as the tools to help the agent acquire and process various useful information, thereby obtaining improved RE performance. Our extensive experimental results upon two datasets in English and Chinese demonstrate our AgentRE's superior performance, especially in low-resource scenarios. Additionally, the trajectories generated by AgentRE can be refined to construct a high-quality training dataset incorporating different reasoning methods, which can be used to fine-tune smaller models. Code is available at this https URL.
摘要：复杂场景中的关系抽取 (RE) 面临着诸如关系类型多样、单个句子内实体之间关系模糊等挑战，导致纯“文本输入、文本输出”语言模型 (LM) 性能不佳。为了应对这些挑战，本文提出了一个基于代理的 RE 框架，即 AgentRE，它充分利用包括记忆、检索和反射在内的大型语言模型 (LLM) 的潜力，在复杂场景中实现 RE。具体来说，AgentRE 中内置了三个主要模块，作为帮助代理获取和处理各种有用信息的工具，从而获得更高的 RE 性能。我们在英文和中文两个数据集上的大量实验结果表明我们的 AgentRE 性能优越，尤其是在资源匮乏的场景中。此外，可以细化由 AgentRE 生成的轨迹以构建包含不同推理方法的高质量训练数据集，这些数据集可用于微调较小的模型。代码可在此 https URL 上获得。

Title: Investigating Expert-in-the-Loop LLM Discourse Patterns for Ancient Intertextual Analysis

Authors: Ray Umphrey, Jesse Roberts, Lindsey Roberts
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01882
Pdf URL: https://arxiv.org/pdf/2409.01882
Copy Paste: [[2409.01882]] Investigating Expert-in-the-Loop LLM Discourse Patterns for Ancient Intertextual Analysis(https://arxiv.org/abs/2409.01882)
Keywords: language model, llm
Abstract: This study explores the potential of large language models (LLMs) for identifying and examining intertextual relationships within biblical, Koine Greek texts. By evaluating the performance of LLMs on various intertextuality scenarios the study demonstrates that these models can detect direct quotations, allusions, and echoes between texts. The LLM's ability to generate novel intertextual observations and connections highlights its potential to uncover new insights. However, the model also struggles with long query passages and the inclusion of false intertextual dependences, emphasizing the importance of expert evaluation. The expert-in-the-loop methodology presented offers a scalable approach for intertextual research into the complex web of intertextuality within and beyond the biblical corpus.
摘要：本研究探索了大型语言模型 (LLM) 在识别和检查圣经通用希腊语文本中的互文关系方面的潜力。通过评估 LLM 在各种互文性场景中的表现，该研究表明这些模型可以检测文本之间的直接引用、典故和呼应。LLM 能够生成新颖的互文性观察和联系，凸显了其发现新见解的潜力。然而，该模型还难以处理长查询段落和包含错误的互文性依赖关系，这强调了专家评估的重要性。所提出的专家在环方法提供了一种可扩展的方法，用于对圣经语料库内外复杂的互文性网络进行互文性研究。

Title: What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices

Authors: Zhi Chen, Qiguang Chen, Libo Qin, Qipeng Guo, Haijun Lv, Yicheng Zou, Wanxiang Che, Hang Yan, Kai Chen, Dahua Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.01893
Pdf URL: https://arxiv.org/pdf/2409.01893
Copy Paste: [[2409.01893]] What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices(https://arxiv.org/abs/2409.01893)
Keywords: language model, llm, long context, agent
Abstract: Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data. Our code is available at: this https URL.
摘要：具有扩展上下文窗口的大型语言模型 (LLM) 的最新进展显著改善了信息提取、问答和复杂规划场景等任务。为了在长上下文任务中取得成功，已经做了大量工作来通过合成数据增强模型的长上下文能力。现有方法通常利用 Self-Instruct 框架来生成指令调整数据，以更好地提高长上下文能力。然而，我们的初步实验表明，生成的样本中不到 35% 是多跳的，超过 40% 的质量较差，限制了全面理解和进一步研究。为了提高合成数据的质量，我们提出了多智能体交互式多跳生成 (MIMG) 框架，其中包括质量验证代理、单跳问题生成代理、多问题采样策略和多跳问题合并代理。该框架提高了数据质量，高质量、多跳和多样化数据的比例超过 85%。此外，我们通过对各种模型的大量实验系统地研究了文档选择、问题合并和验证技术的策略。我们的研究结果表明，我们合成的高质量长上下文指令数据显著提高了模型性能，甚至超越了在大量人工注释数据上训练的模型。我们的代码可在以下网址获取：此 https URL。

Title: Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation

Authors: Jack Krolik, Herprit Mahal, Feroz Ahmad, Gaurav Trivedi, Bahador Saket
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.01941
Pdf URL: https://arxiv.org/pdf/2409.01941
Copy Paste: [[2409.01941]] Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation(https://arxiv.org/abs/2409.01941)
Keywords: language model, llm
Abstract: This paper explores the potential of using Large Language Models (LLMs) to automate the evaluation of responses in medical Question and Answer (Q\&A) systems, a crucial form of Natural Language Processing. Traditionally, human evaluation has been indispensable for assessing the quality of these responses. However, manual evaluation by medical professionals is time-consuming and costly. Our study examines whether LLMs can reliably replicate human evaluations by using questions derived from patient data, thereby saving valuable time for medical experts. While the findings suggest promising results, further research is needed to address more specific or complex questions that were beyond the scope of this initial investigation.
摘要：本文探讨了使用大型语言模型 (LLM) 自动评估医学问答 (Q\&A) 系统中的响应的潜力，这是自然语言处理的一种重要形式。传统上，人工评估对于评估这些响应的质量是必不可少的。然而，医疗专业人员的人工评估既费时又费钱。我们的研究检查了 LLM 是否可以通过使用来自患者数据的问题可靠地复制人工评估，从而为医疗专家节省宝贵的时间。虽然研究结果表明结果很有希望，但仍需要进一步研究来解决超出本次初步调查范围的更具体或更复杂的问题。

Title: FuzzCoder: Byte-level Fuzzing Test via Large Language Model

Authors: Liqun Yang, Jian Yang, Chaoren Wei, Guanglin Niu, Ge Zhang, Yunli Wang, Linzheng ChaI, Wanxu Xia, Hongcheng Guo, Shun Zhang, Jiaheng Liu, Yuwei Yin, Junran Peng, Jiaxin Ma, Liang Sun, Zhoujun Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.01944
Pdf URL: https://arxiv.org/pdf/2409.01944
Copy Paste: [[2409.01944]] FuzzCoder: Byte-level Fuzzing Test via Large Language Model(https://arxiv.org/abs/2409.01944)
Keywords: language model, llm
Abstract: Fuzzing is an important dynamic program analysis technique designed for finding vulnerabilities in complex software. Fuzzing involves presenting a target program with crafted malicious input to cause crashes, buffer overflows, memory errors, and exceptions. Crafting malicious inputs in an efficient manner is a difficult open problem and the best approaches often apply uniform random mutations to pre-existing valid inputs. In this work, we propose to adopt fine-tuned large language models (FuzzCoder) to learn patterns in the input files from successful attacks to guide future fuzzing explorations. Specifically, we develop a framework to leverage the code LLMs to guide the mutation process of inputs in fuzzing. The mutation process is formulated as the sequence-to-sequence modeling, where LLM receives a sequence of bytes and then outputs the mutated byte sequence. FuzzCoder is fine-tuned on the created instruction dataset (Fuzz-Instruct), where the successful fuzzing history is collected from the heuristic fuzzing tool. FuzzCoder can predict mutation locations and strategies locations in input files to trigger abnormal behaviors of the program. Experimental results show that FuzzCoder based on AFL (American Fuzzy Lop) gain significant improvements in terms of effective proportion of mutation (EPM) and number of crashes (NC) for various input formats including ELF, JPG, MP3, and XML.
摘要：模糊测试是一种重要的动态程序分析技术，旨在查找复杂软件中的漏洞。模糊测试涉及向目标程序提供精心设计的恶意输入，以导致崩溃、缓冲区溢出、内存错误和异常。以有效的方式制作恶意输入是一个难以解决的开放性问题，最好的方法通常是将均匀的随机突变应用于预先存在的有效输入。在这项工作中，我们建议采用经过微调的大型语言模型 (FuzzCoder) 从成功的攻击中学习输入文件中的模式，以指导未来的模糊测试探索。具体来说，我们开发了一个框架来利用代码 LLM 来指导模糊测试中输入的突变过程。突变过程被表述为序列到序列建模，其中 LLM 接收字节序列，然后输出突变的字节序列。FuzzCoder 在创建的指令数据集 (Fuzz-Instruct) 上进行微调，其中从启发式模糊测试工具收集成功的模糊测试历史记录。FuzzCoder 可以预测输入文件中的突变位置和策略位置以触发程序的异常行为。实验结果表明，基于AFL（American Fuzzy Lop）的FuzzCoder对于ELF、JPG、MP3、XML等多种输入格式，在有效变异比例（EPM）和崩溃次数（NC）方面获得了显著的提升。

Title: BEAVER: An Enterprise Benchmark for Text-to-SQL

Authors: Peter Baile Chen, Fabian Wenz, Yi Zhang, Moe Kayali, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2409.02038
Pdf URL: https://arxiv.org/pdf/2409.02038
Copy Paste: [[2409.02038]] BEAVER: An Enterprise Benchmark for Text-to-SQL(https://arxiv.org/abs/2409.02038)
Keywords: llm, prompt
Abstract: Existing text-to-SQL benchmarks have largely been constructed using publicly available tables from the web with human-generated tests containing question and SQL statement pairs. They typically show very good results and lead people to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In this environment, LLMs perform poorly, even when standard prompt engineering and RAG techniques are utilized. As we will show, the reasons for poor performance are largely due to three characteristics: (1) public LLMs cannot train on enterprise data warehouses because they are largely in the "dark web", (2) schemas of enterprise tables are more complex than the schemas in public data, which leads the SQL-generation task innately harder, and (3) business-oriented questions are often more complex, requiring joins over multiple tables and aggregations. As a result, we propose a new dataset BEAVER, sourced from real enterprise data warehouses together with natural language queries and their correct SQL statements which we collected from actual user history. We evaluated this dataset using recent LLMs and demonstrated their poor performance on this task. We hope this dataset will facilitate future researchers building more sophisticated text-to-SQL systems which can do better on this important class of data.
摘要：现有的文本到 SQL 基准测试主要使用来自网络的公开可用表以及包含问题和 SQL 语句对的人工生成的测试构建而成。它们通常显示出非常好的结果，并让人们认为 LLM 在文本到 SQL 任务中是有效的。在本文中，我们将现成的 LLM 应用于包含企业数据仓库数据的基准测试。在这种环境下，即使使用标准的提示工程和 RAG 技术，LLM 的性能也很差。正如我们将展示的那样，性能不佳的原因主要是由于三个特点：(1) 公共 LLM 无法在企业数据仓库上进行训练，因为它们主要在“暗网”中；(2) 企业表的模式比公共数据中的模式更复杂，这导致 SQL 生成任务天生就更难；(3) 面向业务的问题通常更复杂，需要跨多个表和聚合进行连接。因此，我们提出了一个新的数据集 BEAVER，它来自真实的企业数据仓库以及我们从实际用户历史记录中收集的自然语言查询及其正确的 SQL 语句。我们利用最近的 LLM 评估了此数据集，并证明了它们在此任务上的表现不佳。我们希望此数据集能够帮助未来的研究人员构建更复杂的文本到 SQL 系统，以便更好地处理此类重要数据。

Title: OLMoE: Open Mixture-of-Experts Language Models

Authors: Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.02060
Pdf URL: https://arxiv.org/pdf/2409.02060
Copy Paste: [[2409.02060]] OLMoE: Open Mixture-of-Experts Language Models(https://arxiv.org/abs/2409.02060)
Keywords: language model, chat
Abstract: We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
摘要：我们推出了 OLMoE，这是一种完全开放的、最先进的语言模型，利用稀疏混合专家 (MoE)。OLMoE-1B-7B 有 70 亿 (B) 个参数，但每个输入标记仅使用 1B。我们在 5 万亿个标记上对其进行预训练，并进一步调整它以创建 OLMoE-1B-7B-Instruct。我们的模型优于所有具有类似活动参数的可用模型，甚至超过了 Llama2-13B-Chat 和 DeepSeekMoE-16B 等更大的模型。我们展示了关于 MoE 训练的各种实验，分析了我们模型中表现出高度专业化的路由，并开源了我们工作的所有方面：模型权重、训练数据、代码和日志。

Title: Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models

Authors: Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.02076
Pdf URL: https://arxiv.org/pdf/2409.02076
Copy Paste: [[2409.02076]] Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models(https://arxiv.org/abs/2409.02076)
Keywords: language model, prompt
Abstract: The abilities of long-context language models (LMs) are often evaluated using the "Needle-in-a-Haystack" (NIAH) test, which comprises tasks designed to assess a model's ability to identify specific information ("needle") within large text sequences ("haystack"). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, Spinning the Golden Thread (SGT), which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the Spinning the Golden Thread, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.
摘要：长上下文语言模型 (LM) 的能力通常使用“大海捞针” (NIAH) 测试进行评估，该测试包含旨在评估模型在大型文本序列（“大海捞针”）中识别特定信息（“针”）的能力的任务。虽然这些基准测试衡量了模型对长上下文输入序列的理解程度，但它们无法有效衡量长文本生成的质量——这是设计提案和创意写作等应用的关键方面。为了解决这一差距，我们引入了一个新的长文本评估基准，即“纺金线”（SGT），它测试模型在生成的长文本序列中识别特定事件的能力。在这个基准测试中，我们提示长上下文 LM 创建必须包含特定事件或约束的长文本，并评估它们整合这些元素的能力。我们在四种不同的场景、三种类型的提示指令和两种不同的生成长度设置（16K 和 32K）中评估了十个长上下文 LM。虽然这些模型在 NIAH 基准测试中表现良好，但在“纺金线”测试中，没有一个模型表现出令人满意的性能，这引发了人们对它们生成符合指令的连贯长文本的能力的担忧。此外，随着生成文本长度的增加，所有模型的性能都显著下降。

Title: Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text

Authors: Michael Burnham, Kayla Kahn, Ryan Yank Wang, Rachel X. Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.02078
Pdf URL: https://arxiv.org/pdf/2409.02078
Copy Paste: [[2409.02078]] Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text(https://arxiv.org/abs/2409.02078)
Keywords: language model, prompt
Abstract: Social scientists quickly adopted large language models due to their ability to annotate documents without supervised training, an ability known as zero-shot learning. However, due to their compute demands, cost, and often proprietary nature, these models are often at odds with replication and open science standards. This paper introduces the Political DEBATE (DeBERTa Algorithm for Textual Entailment) language models for zero-shot and few-shot classification of political documents. These models are not only as good, or better than, state-of-the art large language models at zero and few-shot classification, but are orders of magnitude more efficient and completely open source. By training the models on a simple random sample of 10-25 documents, they can outperform supervised classifiers trained on hundreds or thousands of documents and state-of-the-art generative models with complex, engineered prompts. Additionally, we release the PolNLI dataset used to train these models -- a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.
摘要：社会科学家迅速采用了大型语言模型，因为它们能够在没有监督训练的情况下注释文档，这种能力被称为零样本学习。然而，由于它们的计算需求、成本以及通常的专有性质，这些模型往往与复制和开放科学标准不一致。本文介绍了用于政治文件零样本和少样本分类的政治辩论（DeBERTa 文本蕴涵算法）语言模型。这些模型不仅在零样本和少样本分类方面与最先进的大型语言模型一样好，甚至更好，而且效率高出几个数量级，而且完全开源。通过在 10-25 个文档的简单随机样本上训练模型，它们的表现可以胜过在数百或数千个文档上训练的监督分类器和具有复杂、工程化提示的最先进的生成模型。此外，我们发布了用于训练这些模型的 PolNLI 数据集——一个包含超过 200,000 份政治文件的语料库，在 800 多个分类任务中具有高度准确的标签。

Title: CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

Authors: Ingo Ziegler, Abdullatif Köksal, Desmond Elliott, Hinrich Schütze
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.02098
Pdf URL: https://arxiv.org/pdf/2409.02098
Copy Paste: [[2409.02098]] CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation(https://arxiv.org/abs/2409.02098)
Keywords: language model, llm
Abstract: Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.
摘要：为专业任务构建高质量数据集是一个耗时且资源密集的过程，通常需要专业领域的知识。我们提出了用于微调的语料库检索和增强 (CRAFT)，这是一种生成合成数据集的方法，给定少量用户编写的少量样本来演示要执行的任务。给定少量样本示例，我们使用大规模公共网络爬取语料库和基于相似性的文档检索来查找其他相关的人工编写文档。最后，指令调整的大型语言模型 (LLM) 将检索到的文档增强为自定义格式的任务样本，然后可用于微调。我们证明 CRAFT 可以有效地为四个不同的任务生成大规模任务特定训练数据集：生物学问答 (QA)、医学 QA 和常识性 QA 以及摘要。我们的实验表明，在 QA 任务中，基于 CRAFT 的模型表现优于或达到与一般 LLM 相当的性能，而基于 CRAFT 的摘要模型比在人工整理数据上训练的模型高出 46 个偏好点。