2024-08-05

Title: UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation

Authors: Juzheng Zhang, Yatao Bian, Yongqiang Chen, Quanming Yao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.00863
Pdf URL: https://arxiv.org/pdf/2408.00863
Copy Paste: [[2408.00863]] UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation(https://arxiv.org/abs/2408.00863)
Keywords: language model, llm
Abstract: The remarkable success of Large Language Models (LLMs) across diverse tasks has driven the research community to extend their capabilities to molecular applications. However, most molecular LLMs employ adapter-based architectures that do not treat molecule and text modalities equally and lack a supervision signal for the molecule modality. To address these issues, we introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture that expands the vocabulary of LLM with molecule tokens. Specifically, we introduce a Vector Quantization-driven tokenizer that incorporates a Q-Former to bridge the modality gap between molecule and text. This tokenizer transforms molecules into sequences of molecule tokens with causal dependency, encapsulating high-level molecular and textual information. Equipped with this tokenizer, UniMoT can unify molecule and text modalities under a shared token representation and an autoregressive training paradigm, enabling it to interpret molecules as a foreign language and generate them as text. Following a four-stage training scheme, UniMoT emerges as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks. Extensive experiments demonstrate that UniMoT achieves state-of-the-art performance across a wide range of molecule comprehension and generation tasks.
摘要：大型语言模型 (LLM) 在不同任务中取得的显著成功推动了研究界将其功能扩展到分子应用。然而，大多数分子 LLM 采用基于适配器的架构，这些架构不能平等对待分子和文本模态，并且缺乏对分子模态的监督信号。为了解决这些问题，我们推出了 UniMoT，这是一种统一的分子文本 LLM，采用基于标记器的架构，使用分子标记扩展了 LLM 的词汇表。具体来说，我们引入了一个矢量量化驱动的标记器，它结合了 Q-Former 来弥合分子和文本之间的模态差距。这个标记器将分子转换成具有因果依赖关系的分子标记序列，封装了高级分子和文本信息。有了这种标记器，UniMoT 可以在共享标记表示和自回归训练范式下统一分子和文本模态，使其能够将分子解释为外语并将其生成为文本。经过四阶段训练方案，UniMoT 成为了能够执行分子到文本和文本到分子任务的多模态通才。大量实验表明，UniMoT 在广泛的分子理解和生成任务中实现了最先进的性能。

Title: Leveraging Large Language Models (LLMs) for Traffic Management at Urban Intersections: The Case of Mixed Traffic Scenarios

Authors: Sari Masri, Huthaifa I. Ashqar, Mohammed Elhenawy
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2408.00948
Pdf URL: https://arxiv.org/pdf/2408.00948
Copy Paste: [[2408.00948]] Leveraging Large Language Models (LLMs) for Traffic Management at Urban Intersections: The Case of Mixed Traffic Scenarios(https://arxiv.org/abs/2408.00948)
Keywords: language model, gpt, llm
Abstract: Urban traffic management faces significant challenges due to the dynamic environments, and traditional algorithms fail to quickly adapt to this environment in real-time and predict possible conflicts. This study explores the ability of a Large Language Model (LLM), specifically, GPT-4o-mini to improve traffic management at urban intersections. We recruited GPT-4o-mini to analyze, predict position, detect and resolve the conflicts at an intersection in real-time for various basic scenarios. The key findings of this study to investigate whether LLMs can logically reason and understand the scenarios to enhance the traffic efficiency and safety by providing real-time analysis. The study highlights the potential of LLMs in urban traffic management creating more intelligent and more adaptive systems. Results showed the GPT-4o-mini was effectively able to detect and resolve conflicts in heavy traffic, congestion, and mixed-speed conditions. The complex scenario of multiple intersections with obstacles and pedestrians saw successful conflict management as well. Results show that the integration of LLMs promises to improve the effectiveness of traffic control for safer and more efficient urban intersection management.
摘要：由于环境的动态变化，城市交通管理面临重大挑战，传统算法无法实时快速适应这种环境并预测可能发生的冲突。本研究探讨了大型语言模型 (LLM)，特别是 GPT-4o-mini，在改善城市交叉路口交通管理方面的能力。我们招募了 GPT-4o-mini，以实时分析、预测位置、检测和解决各种基本场景中的交叉路口冲突。本研究的主要发现是调查 LLM 是否能够逻辑推理和理解场景，从而通过提供实时分析来提高交通效率和安全性。该研究强调了 LLM 在城市交通管理中的潜力，从而创建更智能、更具适应性的系统。结果表明，GPT-4o-mini 能够有效地检测和解决交通拥堵、交通拥堵和混合速度条件下的冲突。在有障碍物和行人的多个交叉路口的复杂场景中，冲突管理也取得了成功。结果表明，LLM 的集成有望提高交通控制的有效性，从而实现更安全、更高效的城市交叉路口管理。

Title: PERSOMA: PERsonalized SOft ProMpt Adapter Architecture for Personalized Language Prompting

Authors: Liam Hebert, Krishna Sayana, Ambarish Jash, Alexandros Karatzoglou, Sukhdeep Sodhi, Sumanth Doddapaneni, Yanli Cai, Dima Kuzmin
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2408.00960
Pdf URL: https://arxiv.org/pdf/2408.00960
Copy Paste: [[2408.00960]] PERSOMA: PERsonalized SOft ProMpt Adapter Architecture for Personalized Language Prompting(https://arxiv.org/abs/2408.00960)
Keywords: language model, llm, prompt
Abstract: Understanding the nuances of a user's extensive interaction history is key to building accurate and personalized natural language systems that can adapt to evolving user preferences. To address this, we introduce PERSOMA, Personalized Soft Prompt Adapter architecture. Unlike previous personalized prompting methods for large language models, PERSOMA offers a novel approach to efficiently capture user history. It achieves this by resampling and compressing interactions as free form text into expressive soft prompt embeddings, building upon recent research utilizing embedding representations as input for LLMs. We rigorously validate our approach by evaluating various adapter architectures, first-stage sampling strategies, parameter-efficient tuning techniques like LoRA, and other personalization methods. Our results demonstrate PERSOMA's superior ability to handle large and complex user histories compared to existing embedding-based and text-prompt-based techniques.
摘要：了解用户大量交互历史的细微差别是构建准确且个性化的自然语言系统的关键，该系统可以适应不断变化的用户偏好。为了解决这个问题，我们引入了个性化软提示适配器架构 PERSOMA。与以前用于大型语言模型的个性化提示方法不同，PERSOMA 提供了一种新颖的方法来有效地捕获用户历史记录。它通过将交互作为自由格式文本重新采样并压缩为富有表现力的软提示嵌入来实现这一点，这是在最近利用嵌入表示作为 LLM 输入的研究基础上建立的。我们通过评估各种适配器架构、第一阶段采样策略、参数高效的调整技术（如 LoRA）和其他个性化方法来严格验证我们的方法。我们的结果表明，与现有的基于嵌入和基于文本提示的技术相比，PERSOMA 具有处理大型和复杂用户历史记录的卓越能力。

Title: Automatic Extraction of Relationships among Motivations, Emotions and Actions from Natural Language Texts

Authors: Fei Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.00966
Pdf URL: https://arxiv.org/pdf/2408.00966
Copy Paste: [[2408.00966]] Automatic Extraction of Relationships among Motivations, Emotions and Actions from Natural Language Texts(https://arxiv.org/abs/2408.00966)
Keywords: language model
Abstract: We propose a new graph-based framework to reveal relationships among motivations, emotions and actions explicitly given natural language texts. A directed acyclic graph is designed to describe human's nature. Nurture beliefs are incorporated to connect outside events and the human's nature graph. No annotation resources are required due to the power of large language models. Amazon Fine Foods Reviews dataset is used as corpus and food-related motivations are focused. Totally 92,990 relationship graphs are generated, of which 63% make logical sense. We make further analysis to investigate error types for optimization direction in future research.
摘要：我们提出了一种基于图的新框架，以揭示自然语言文本中明确给出的动机、情感和行为之间的关系。设计了一个有向无环图来描述人性。养育信念被纳入其中，以连接外部事件和人性图。由于大型语言模型的强大功能，因此不需要注释资源。亚马逊精美食品评论数据集被用作语料库，并重点关注与食物相关的动机。总共生成了 92,990 个关系图，其中 63% 符合逻辑。我们进行了进一步分析，以调查错误类型，以便在未来的研究中进行优化方向。

Title: Fairness in Large Language Models in Three Hour

Authors: Thang Doan Viet, Zichong Wang, Minh Nhat Nguyen, Wenbin Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.00992
Pdf URL: https://arxiv.org/pdf/2408.00992
Copy Paste: [[2408.00992]] Fairness in Large Language Models in Three Hour(https://arxiv.org/abs/2408.00992)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable success across various domains but often lack fairness considerations, potentially leading to discriminatory outcomes against marginalized populations. Unlike fairness in traditional machine learning, fairness in LLMs involves unique backgrounds, taxonomies, and fulfillment techniques. This tutorial provides a systematic overview of recent advances in the literature concerning fair LLMs, beginning with real-world case studies to introduce LLMs, followed by an analysis of bias causes therein. The concept of fairness in LLMs is then explored, summarizing the strategies for evaluating bias and the algorithms designed to promote fairness. Additionally, resources for assessing bias in LLMs, including toolkits and datasets, are compiled, and current research challenges and open questions in the field are discussed. The repository is available at \url{this https URL}.
摘要：大型语言模型 (LLM) 在各个领域都取得了显著的成功，但往往缺乏公平性考虑，可能导致对边缘化人群的歧视性结果。与传统机器学习中的公平性不同，LLM 中的公平性涉及独特的背景、分类法和实现技术。本教程系统地概述了有关公平 LLM 的文献中的最新进展，首先介绍现实世界的案例研究以介绍 LLM，然后分析其中的偏见原因。然后探讨 LLM 中的公平性概念，总结评估偏见的策略和旨在促进公平的算法。此外，还汇编了用于评估 LLM 中偏见的资源，包括工具包和数据集，并讨论了该领域当前的研究挑战和未解决的问题。该存储库可在 \url{this https URL} 处找到。

Title: Leveraging Large Language Models for Mobile App Review Feature Extraction

Authors: Quim Motger, Alessio Miaschi, Felice Dell'Orletta, Xavier Franch, Jordi Marco
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2408.01063
Pdf URL: https://arxiv.org/pdf/2408.01063
Copy Paste: [[2408.01063]] Leveraging Large Language Models for Mobile App Review Feature Extraction(https://arxiv.org/abs/2408.01063)
Keywords: language model
Abstract: Mobile app review analysis presents unique challenges due to the low quality, subjective bias, and noisy content of user-generated documents. Extracting features from these reviews is essential for tasks such as feature prioritization and sentiment analysis, but it remains a challenging task. Meanwhile, encoder-only models based on the Transformer architecture have shown promising results for classification and information extraction tasks for multiple software engineering processes. This study explores the hypothesis that encoder-only large language models can enhance feature extraction from mobile app reviews. By leveraging crowdsourced annotations from an industrial context, we redefine feature extraction as a supervised token classification task. Our approach includes extending the pre-training of these models with a large corpus of user reviews to improve contextual understanding and employing instance selection techniques to optimize model fine-tuning. Empirical evaluations demonstrate that this method improves the precision and recall of extracted features and enhances performance efficiency. Key contributions include a novel approach to feature extraction, annotated datasets, extended pre-trained models, and an instance selection mechanism for cost-effective fine-tuning. This research provides practical methods and empirical evidence in applying large language models to natural language processing tasks within mobile app reviews, offering improved performance in feature extraction.
摘要：由于用户生成的文档质量低、主观偏见和内容嘈杂，移动应用评论分析面临着独特的挑战。从这些评论中提取特征对于特征优先级和情感分析等任务至关重要，但这仍然是一项具有挑战性的任务。同时，基于 Transformer 架构的仅编码器模型已在多个软件工程流程的分类和信息提取任务中显示出良好的结果。本研究探讨了仅编码器大型语言模型可以增强移动应用评论特征提取的假设。通过利用来自工业环境的众包注释，我们将特征提取重新定义为监督标记分类任务。我们的方法包括使用大量用户评论语料库扩展这些模型的预训练以提高上下文理解，并采用实例选择技术来优化模型微调。实证评估表明，该方法提高了提取特征的准确率和召回率，并提高了性能效率。主要贡献包括一种新颖的特征提取方法、带注释的数据集、扩展的预训练模型以及用于经济高效微调的实例选择机制。该研究为将大型语言模型应用于移动应用评论中的自然语言处理任务提供了实用方法和实证证据，提高了特征提取的性能。

Title: Adaptive Contrastive Decoding in Retrieval-Augmented Generation for Handling Noisy Contexts

Authors: Youna Kim, Hyuhng Joon Kim, Cheonbok Park, Choonghyun Park, Hyunsoo Cho, Junyeob Kim, Kang Min Yoo, Sang-goo Lee, Taeuk Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01084
Pdf URL: https://arxiv.org/pdf/2408.01084
Copy Paste: [[2408.01084]] Adaptive Contrastive Decoding in Retrieval-Augmented Generation for Handling Noisy Contexts(https://arxiv.org/abs/2408.01084)
Keywords: language model, llm, retrieval-augmented generation
Abstract: When using large language models (LLMs) in knowledge-intensive tasks, such as open-domain question answering, external context can bridge a gap between external knowledge and LLM's parametric knowledge. Recent research has been developed to amplify contextual knowledge over the parametric knowledge of LLM with contrastive decoding approaches. While these approaches could yield truthful responses when relevant context is provided, they are prone to vulnerabilities when faced with noisy contexts. We extend the scope of previous studies to encompass noisy contexts and propose adaptive contrastive decoding (ACD) to leverage contextual influence effectively. ACD demonstrates improvements in open-domain question answering tasks compared to baselines, especially in robustness by remaining undistracted by noisy contexts in retrieval-augmented generation.
摘要：在知识密集型任务（例如开放域问答）中使用大型语言模型 (LLM) 时，外部上下文可以弥合外部知识与 LLM 的参数知识之间的差距。最近的研究已经发展到使用对比解码方法来放大 LLM 的参数知识上的上下文知识。虽然这些方法在提供相关上下文时可以产生真实的响应，但它们在面对嘈杂的上下文时容易出现漏洞。我们扩展了以前研究的范围以涵盖嘈杂的环境，并提出了自适应对比解码 (ACD) 来有效利用上下文影响。与基线相比，ACD 在开放域问答任务中表现出了改进，尤其是在稳健性方面，因为它在检索增强生成中不受嘈杂环境的干扰。

Title: Bridging Information Gaps in Dialogues With Grounded Exchanges Using Knowledge Graphs

Authors: Phillip Schneider, Nektarios Machner, Kristiina Jokinen, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01088
Pdf URL: https://arxiv.org/pdf/2408.01088
Copy Paste: [[2408.01088]] Bridging Information Gaps in Dialogues With Grounded Exchanges Using Knowledge Graphs(https://arxiv.org/abs/2408.01088)
Keywords: language model
Abstract: Knowledge models are fundamental to dialogue systems for enabling conversational interactions, which require handling domain-specific knowledge. Ensuring effective communication in information-providing conversations entails aligning user understanding with the knowledge available to the system. However, dialogue systems often face challenges arising from semantic inconsistencies in how information is expressed in natural language compared to how it is represented within the system's internal knowledge. To address this problem, we study the potential of large language models for conversational grounding, a mechanism to bridge information gaps by establishing shared knowledge between dialogue participants. Our approach involves annotating human conversations across five knowledge domains to create a new dialogue corpus called BridgeKG. Through a series of experiments on this dataset, we empirically evaluate the capabilities of large language models in classifying grounding acts and identifying grounded information items within a knowledge graph structure. Our findings offer insights into how these models use in-context learning for conversational grounding tasks and common prediction errors, which we illustrate with examples from challenging dialogues. We discuss how the models handle knowledge graphs as a semantic layer between unstructured dialogue utterances and structured information items.
摘要：知识模型是对话系统的基础，用于实现对话交互，这需要处理特定领域的知识。确保在提供信息的对话中有效沟通需要将用户的理解与系统可用的知识相结合。然而，对话系统经常面临挑战，因为信息在自然语言中的表达方式与系统内部知识中的表示方式存在语义不一致。为了解决这个问题，我们研究了大型语言模型在对话基础方面的潜力，这是一种通过在对话参与者之间建立共享知识来弥合信息差距的机制。我们的方法包括注释五个知识领域的人类对话，以创建一个名为 BridgeKG 的新对话语料库。通过对该数据集进行一系列实验，我们实证评估了大型语言模型在知识图结构中对基础行为进行分类和识别基础信息项的能力。我们的研究结果提供了有关这些模型如何将上下文学习用于对话基础任务和常见预测错误的见解，我们通过具有挑战性的对话中的例子来说明这一点。我们讨论了模型如何将知识图谱作为非结构化对话话语和结构化信息项之间的语义层。

Title: BioRAG: A RAG-LLM Framework for Biological Question Reasoning

Authors: Chengrui Wang, Qingqing Long, Xiao Meng, Xunxin Cai, Chengjun Wu, Zhen Meng, Xuezhi Wang, Yuanchun Zhou
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2408.01107
Pdf URL: https://arxiv.org/pdf/2408.01107
Copy Paste: [[2408.01107]] BioRAG: A RAG-LLM Framework for Biological Question Reasoning(https://arxiv.org/abs/2408.01107)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The question-answering system for Life science research, which is characterized by the rapid pace of discovery, evolving insights, and complex interactions among knowledge entities, presents unique challenges in maintaining a comprehensive knowledge warehouse and accurate information retrieval. To address these issues, we introduce BioRAG, a novel Retrieval-Augmented Generation (RAG) with the Large Language Models (LLMs) framework. Our approach starts with parsing, indexing, and segmenting an extensive collection of 22 million scientific papers as the basic knowledge, followed by training a specialized embedding model tailored to this domain. Additionally, we enhance the vector retrieval process by incorporating a domain-specific knowledge hierarchy, which aids in modeling the intricate interrelationships among each query and context. For queries requiring the most current information, BioRAG deconstructs the question and employs an iterative retrieval process incorporated with the search engine for step-by-step reasoning. Rigorous experiments have demonstrated that our model outperforms fine-tuned LLM, LLM with search engines, and other scientific RAG frameworks across multiple life science question-answering tasks.
摘要：生命科学研究的问答系统具有发现速度快、见解不断发展和知识实体之间复杂交互的特点，在维护全面的知识仓库和准确信息检索方面提出了独特的挑战。为了解决这些问题，我们引入了 BioRAG，一种具有大型语言模型 (LLM) 框架的新型检索增强生成 (RAG)。我们的方法首先对 2200 万篇科学论文的大量集合进行解析、索引和分段作为基础知识，然后训练针对该领域的专门嵌入模型。此外，我们通过结合领域特定知识层次结构来增强向量检索过程，这有助于对每个查询和上下文之间复杂的相互关系进行建模。对于需要最新信息的查询，BioRAG 会解构问题并采用与搜索引擎结合的迭代检索过程进行逐步推理。严格的实验表明，我们的模型在多个生命科学问答任务中的表现优于微调的 LLM、带有搜索引擎的 LLM 和其他科学 RAG 框架。

Title: IAI Group at CheckThat! 2024: Transformer Models and Data Augmentation for Checkworthy Claim Detection

Authors: Peter Røysland Aarnes, Vinay Setty, Petra Galuščáková
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01118
Pdf URL: https://arxiv.org/pdf/2408.01118
Copy Paste: [[2408.01118]] IAI Group at CheckThat! 2024: Transformer Models and Data Augmentation for Checkworthy Claim Detection(https://arxiv.org/abs/2408.01118)
Keywords: chain-of-thought
Abstract: This paper describes IAI group's participation for automated check-worthiness estimation for claims, within the framework of the 2024 CheckThat! Lab "Task 1: Check-Worthiness Estimation". The task involves the automated detection of check-worthy claims in English, Dutch, and Arabic political debates and Twitter data. We utilized various pre-trained generative decoder and encoder transformer models, employing methods such as few-shot chain-of-thought reasoning, fine-tuning, data augmentation, and transfer learning from one language to another. Despite variable success in terms of performance, our models achieved notable placements on the organizer's leaderboard: ninth-best in English, third-best in Dutch, and the top placement in Arabic, utilizing multilingual datasets for enhancing the generalizability of check-worthiness detection. Despite a significant drop in performance on the unlabeled test dataset compared to the development test dataset, our findings contribute to the ongoing efforts in claim detection research, highlighting the challenges and potential of language-specific adaptations in claim verification systems.
摘要：本文介绍了 IAI 小组在 2024 年 CheckThat! 实验室“任务 1：检查价值评估”框架内参与索赔的自动检查价值评估的情况。该任务涉及自动检测英语、荷兰语和阿拉伯语政治辩论和 Twitter 数据中的检查价值索赔。我们使用了各种预先训练的生成解码器和编码器转换器模型，采用了诸如少样本思维链推理、微调、数据增强和从一种语言到另一种语言的迁移学习等方法。尽管在性能方面取得了不同的成功，但我们的模型在组织者的排行榜上取得了显著的排名：英语排名第九，荷兰语排名第三，阿拉伯语排名第一，利用多语言数据集增强了检查价值检测的通用性。尽管与开发测试数据集相比，未标记测试数据集的性能显着下降，但我们的研究结果有助于索赔检测研究的持续努力，突出了索赔验证系统中语言特定适应性的挑战和潜力。

Title: Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer

Authors: Robert Belanec, Simon Ostermann, Ivan Srba, Maria Bielikova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01119
Pdf URL: https://arxiv.org/pdf/2408.01119
Copy Paste: [[2408.01119]] Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer(https://arxiv.org/abs/2408.01119)
Keywords: language model, llm, prompt
Abstract: Prompt tuning is a modular and efficient solution for training large language models (LLMs). One of its main advantages is task modularity, making it suitable for multi-task problems. However, current soft-prompt-based methods often sacrifice multi-task modularity, requiring the training process to be fully or partially repeated for each newly added task. While recent work on task vectors applied arithmetic operations on full model weights to achieve the desired multi-task performance, a similar approach for soft-prompts is still missing. To this end, we introduce Task Prompt Vectors, created by element-wise difference between weights of tuned soft-prompts and their random initialization. Experimental results on 12 NLU datasets show that task prompt vectors can be used in low-resource settings to effectively initialize prompt tuning on similar tasks. In addition, we show that task prompt vectors are independent of the random initialization of prompt tuning. This allows prompt arithmetics with the pre-trained vectors from different tasks. In this way, by arithmetic addition of task prompt vectors from multiple tasks, we are able to outperform a state-of-the-art baseline in some cases.
摘要：提示调优是一种模块化且高效的大型语言模型 (LLM) 训练解决方案。其主要优势之一是任务模块化，使其适用于多任务问题。然而，当前基于软提示的方法通常会牺牲多任务模块化，要求对每个新添加的任务完全或部分重复训练过程。虽然最近关于任务向量的研究对完整模型权重应用了算术运算以实现所需的多任务性能，但仍然缺少针对软提示的类似方法。为此，我们引入了任务提示向量，它由调整后的软提示权重与其随机初始化之间的元素差异创建。在 12 个 NLU 数据集上的实验结果表明，任务提示向量可用于低资源设置，以有效地初始化类似任务的提示调优。此外，我们表明任务提示向量与提示调优的随机初始化无关。这允许使用来自不同任务的预训练向量进行提示算术运算。这样，通过对来自多个任务的任务提示向量进行算术加法，我们在某些情况下能够超越最先进的基线。

Title: CFBench: A Comprehensive Constraints-Following Benchmark for LLMs

Authors: Tao Zhang, Yanjun Shen, Wenjing Luo, Yan Zhang, Hao Liang, Tao Zhang, Fan Yang, Mingan Lin, Yujing Qiao, Weipeng Chen, Bin Cui, Wentao Zhang, Zenan Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01122
Pdf URL: https://arxiv.org/pdf/2408.01122
Copy Paste: [[2408.01122]] CFBench: A Comprehensive Constraints-Following Benchmark for LLMs(https://arxiv.org/abs/2408.01122)
Keywords: language model, llm
Abstract: The adeptness of Large Language Models (LLMs) in comprehending and following natural language instructions is critical for their deployment in sophisticated real-world applications. Existing evaluations mainly focus on fragmented constraints or narrow scenarios, but they overlook the comprehensiveness and authenticity of constraints from the user's perspective. To bridge this gap, we propose CFBench, a large-scale Comprehensive Constraints Following Benchmark for LLMs, featuring 1,000 curated samples that cover more than 200 real-life scenarios and over 50 NLP tasks. CFBench meticulously compiles constraints from real-world instructions and constructs an innovative systematic framework for constraint types, which includes 10 primary categories and over 25 subcategories, and ensures each constraint is seamlessly integrated within the instructions. To make certain that the evaluation of LLM outputs aligns with user perceptions, we propose an advanced methodology that integrates multi-dimensional assessment criteria with requirement prioritization, covering various perspectives of constraints, instructions, and requirement fulfillment. Evaluating current leading LLMs on CFBench reveals substantial room for improvement in constraints following, and we further investigate influencing factors and enhancement strategies. The data and code are publicly available at this https URL
摘要：大型语言模型 (LLM) 能否熟练地理解和遵循自然语言指令对于其在复杂的实际应用中的部署至关重要。现有的评估主要侧重于零散的约束或狭窄的场景，但忽略了从用户角度来看约束的全面性和真实性。为了弥补这一差距，我们提出了 CFBench，这是一个大规模的 LLM 综合约束遵循基准，包含 1,000 个精选样本，涵盖 200 多个真实场景和 50 多个 NLP 任务。CFBench 从真实指令中精心汇编约束，并构建了一个创新的约束类型系统框架，其中包括 10 个主要类别和 25 多个子类别，并确保每个约束都无缝集成到指令中。为了确保 LLM 输出的评估与用户感知一致，我们提出了一种先进的方法，将多维评估标准与需求优先级相结合，涵盖约束、指令和需求满足的各个角度。评估 CFBench 上当前领先的 LLM 发现约束遵循方面有很大的改进空间，我们进一步研究了影响因素和增强策略。数据和代码可在此 https URL 上公开获取

Title: DERA: Dense Entity Retrieval for Entity Alignment in Knowledge Graphs

Authors: Zhichun Wang, Xuan Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01154
Pdf URL: https://arxiv.org/pdf/2408.01154
Copy Paste: [[2408.01154]] DERA: Dense Entity Retrieval for Entity Alignment in Knowledge Graphs(https://arxiv.org/abs/2408.01154)
Keywords: language model
Abstract: Entity Alignment (EA) aims to match equivalent entities in different Knowledge Graphs (KGs), which is essential for knowledge fusion and integration. Recently, embedding-based EA has attracted significant attention and many approaches have been proposed. Early approaches primarily focus on learning entity embeddings from the structural features of KGs, defined by relation triples. Later methods incorporated entities' names and attributes as auxiliary information to enhance embeddings for EA. However, these approaches often used different techniques to encode structural and attribute information, limiting their interaction and mutual enhancement. In this work, we propose a dense entity retrieval framework for EA, leveraging language models to uniformly encode various features of entities and facilitate nearest entity search across KGs. Alignment candidates are first generated through entity retrieval, which are subsequently reranked to determine the final alignments. We conduct comprehensive experiments on both cross-lingual and monolingual EA datasets, demonstrating that our approach achieves state-of-the-art performance compared to existing EA methods.
摘要：实体对齐 (EA) 旨在匹配不同知识图谱 (KG) 中的等效实体，这对于知识融合和集成至关重要。最近，基于嵌入的 EA 引起了广泛关注，并提出了许多方法。早期方法主要侧重于从关系三元组定义的 KG 结构特征中学习实体嵌入。后来的方法将实体的名称和属性作为辅助信息纳入其中，以增强 EA 的嵌入。然而，这些方法通常使用不同的技术来编码结构和属性信息，从而限制了它们的交互和相互增强。在这项工作中，我们提出了一个用于 EA 的密集实体检索框架，利用语言模型统一编码实体的各种特征并促进跨 KG 的最近实体搜索。首先通过实体检索生成对齐候选，然后对其进行重新排序以确定最终对齐。我们在跨语言和单语言 EA 数据集上进行了全面的实验，证明我们的方法与现有的 EA 方法相比实现了最先进的性能。

Title: Misinforming LLMs: vulnerabilities, challenges and opportunities

Authors: Bo Zhou, Daniel Geißler, Paul Lukowicz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01168
Pdf URL: https://arxiv.org/pdf/2408.01168
Copy Paste: [[2408.01168]] Misinforming LLMs: vulnerabilities, challenges and opportunities(https://arxiv.org/abs/2408.01168)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have made significant advances in natural language processing, but their underlying mechanisms are often misunderstood. Despite exhibiting coherent answers and apparent reasoning behaviors, LLMs rely on statistical patterns in word embeddings rather than true cognitive processes. This leads to vulnerabilities such as "hallucination" and misinformation. The paper argues that current LLM architectures are inherently untrustworthy due to their reliance on correlations of sequential patterns of word embedding vectors. However, ongoing research into combining generative transformer-based models with fact bases and logic programming languages may lead to the development of trustworthy LLMs capable of generating statements based on given truth and explaining their self-reasoning process.
摘要：大型语言模型 (LLM) 在自然语言处理方面取得了重大进展，但其底层机制常常被误解。尽管表现出连贯的答案和明显的推理行为，但 LLM 依赖于词嵌入中的统计模式，而不是真正的认知过程。这会导致诸如“幻觉”和错误信息等漏洞。该论文认为，当前的 LLM 架构本质上是不可信的，因为它们依赖于词嵌入向量的序列模式的相关性。然而，正在进行的研究将基于生成转换器的模型与事实库和逻辑编程语言相结合，可能会导致开发出可信的 LLM，能够根据给定的事实生成语句并解释其自我推理过程。

Title: High-Throughput Phenotyping of Clinical Text Using Large Language Models

Authors: Daniel B. Hier, S. Ilyas Munzir, Anne Stahlfeld, Tayo Obafemi-Ajayi, Michael D. Carrithers
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01214
Pdf URL: https://arxiv.org/pdf/2408.01214
Copy Paste: [[2408.01214]] High-Throughput Phenotyping of Clinical Text Using Large Language Models(https://arxiv.org/abs/2408.01214)
Keywords: language model, gpt
Abstract: High-throughput phenotyping automates the mapping of patient signs to standardized ontology concepts and is essential for precision medicine. This study evaluates the automation of phenotyping of clinical summaries from the Online Mendelian Inheritance in Man (OMIM) database using large language models. Due to their rich phenotype data, these summaries can be surrogates for physician notes. We conduct a performance comparison of GPT-4 and GPT-3.5-Turbo. Our results indicate that GPT-4 surpasses GPT-3.5-Turbo in identifying, categorizing, and normalizing signs, achieving concordance with manual annotators comparable to inter-rater agreement. Despite some limitations in sign normalization, the extensive pre-training of GPT-4 results in high performance and generalizability across several phenotyping tasks while obviating the need for manually annotated training data. Large language models are expected to be the dominant method for automating high-throughput phenotyping of clinical text.
摘要：高通量表型分析可自动将患者体征映射到标准化的本体概念，对精准医疗至关重要。本研究使用大型语言模型评估了在线孟德尔人类遗传 (OMIM) 数据库中临床摘要的表型分析自动化。由于其丰富的表型数据，这些摘要可以替代医生笔记。我们对 GPT-4 和 GPT-3.5-Turbo 进行了性能比较。我们的结果表明，GPT-4 在识别、分类和规范化体征方面超越了 GPT-3.5-Turbo，与手动注释者的一致性可与评分者间一致性相媲美。尽管在体征规范化方面存在一些局限性，但 GPT-4 的大量预训练可在多个表型分析任务中实现高性能和通用性，同时无需手动注释训练数据。大型语言模型有望成为自动化临床文本高通量表型分析的主要方法。

Title: RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Authors: Kunlun Zhu, Yifan Luo, Dingling Xu, Ruobing Wang, Shi Yu, Shuo Wang, Yukun Yan, Zhenghao Liu, Xu Han, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2408.01262
Pdf URL: https://arxiv.org/pdf/2408.01262
Copy Paste: [[2408.01262]] RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework(https://arxiv.org/abs/2408.01262)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems have demonstrated their advantages in alleviating the hallucination of Large Language Models (LLMs). Existing RAG benchmarks mainly focus on evaluating whether LLMs can correctly answer the general knowledge. However, they are unable to evaluate the effectiveness of the RAG system in dealing with the data from different vertical domains. This paper introduces RAGEval, a framework for automatically generating evaluation datasets to evaluate the knowledge usage ability of different LLMs in different scenarios. Specifically, RAGEval summarizes a schema from seed documents, applies the configurations to generate diverse documents, and constructs question-answering pairs according to both articles and configurations. We propose three novel metrics, Completeness, Hallucination, and Irrelevance, to carefully evaluate the responses generated by LLMs. By benchmarking RAG models in vertical domains, RAGEval has the ability to better evaluate the knowledge usage ability of LLMs, which avoids the confusion regarding the source of knowledge in answering question in existing QA datasets--whether it comes from parameterized memory or retrieval.
摘要：检索增强生成 (RAG) 系统在缓解大型语言模型 (LLM) 的幻觉方面已展现出优势。现有的 RAG 基准测试主要侧重于评估 LLM 是否能够正确回答一般知识。然而，它们无法评估 RAG 系统在处理来自不同垂直领域的数据的有效性。本文介绍了 RAGEval，一个用于自动生成评估数据集的框架，以评估不同 LLM 在不同场景下的知识使用能力。具体而言，RAGEval 从种子文档中总结出一个模式，应用配置生成不同的文档，并根据文章和配置构建问答对。我们提出了三个新颖的指标，即完整性、幻觉和不相关性，以仔细评估 LLM 生成的响应。通过在垂直领域对 RAG 模型进行基准测试，RAGEval 能够更好地评估 LLM 的知识使用能力，从而避免现有 QA 数据集中对回答问题的知识来源的混淆——无论是来自参数化记忆还是检索。

Title: The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models

Authors: Hannah Chen, Yangfeng Ji, David Evans
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2408.01285
Pdf URL: https://arxiv.org/pdf/2408.01285
Copy Paste: [[2408.01285]] The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models(https://arxiv.org/abs/2408.01285)
Keywords: language model, llm
Abstract: Large language models (LLMs) are now being considered and even deployed for applications that support high-stakes decision-making, such as recruitment and clinical decisions. While several methods have been proposed for measuring bias, there remains a gap between predictions, which are what the proposed methods consider, and how they are used to make decisions. In this work, we introduce Rank-Allocational-Based Bias Index (RABBI), a model-agnostic bias measure that assesses potential allocational harms arising from biases in LLM predictions. We compare RABBI and current bias metrics on two allocation decision tasks. We evaluate their predictive validity across ten LLMs and utility for model selection. Our results reveal that commonly-used bias metrics based on average performance gap and distribution distance fail to reliably capture group disparities in allocation outcomes, whereas RABBI exhibits a strong correlation with allocation disparities. Our work highlights the need to account for how models are used in contexts with limited resource constraints.
摘要：大型语言模型 (LLM) 目前正在被考虑，甚至被部署用于支持高风险决策（例如招聘和临床决策）的应用。虽然已经提出了几种测量偏差的方法，但预测（即所提出的方法所考虑的）与它们如何用于决策之间仍然存在差距。在这项工作中，我们引入了基于排名分配的偏差指数 (RABBI)，这是一种与模型无关的偏差测量，用于评估由 LLM 预测中的偏差引起的潜在分配危害。我们在两个分配决策任务上比较了 RABBI 和当前偏差指标。我们评估了它们在十个 LLM 中的预测有效性和模型选择的效用。我们的结果表明，基于平均绩效差距和分布距离的常用偏差指标无法可靠地捕捉分配结果中的群体差异，而 RABBI 与分配差异表现出很强的相关性。我们的工作强调了需要考虑如何在资源受限的情况下使用模型。

Title: Reconsidering Token Embeddings with the Definitions for Pre-trained Language Models

Authors: Ying Zhang, Dongyuan Li, Manabu Okumura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01308
Pdf URL: https://arxiv.org/pdf/2408.01308
Copy Paste: [[2408.01308]] Reconsidering Token Embeddings with the Definitions for Pre-trained Language Models(https://arxiv.org/abs/2408.01308)
Keywords: language model
Abstract: Learning token embeddings based on token co-occurrence statistics has proven effective for both pre-training and fine-tuning in natural language processing. However, recent studies have pointed out the distribution of learned embeddings degenerates into anisotropy, and even pre-trained language models (PLMs) suffer from a loss of semantics-related information in embeddings for low-frequency tokens. This study first analyzes fine-tuning dynamics of a PLM, BART-large, and demonstrates its robustness against degeneration. On the basis of this finding, we propose DefinitionEMB, a method that utilizes definitions to construct isotropically distributed and semantics-related token embeddings for PLMs while maintaining original robustness during fine-tuning. Our experiments demonstrate the effectiveness of leveraging definitions from Wiktionary to construct such embeddings for RoBERTa-base and BART-large. Furthermore, the constructed embeddings for low-frequency tokens improve the performance of these models across various GLUE and four text summarization datasets.
摘要：基于 token 共现统计的 token 嵌入学习已被证明对自然语言处理中的预训练和微调都很有效。然而，最近的研究指出，学习到的嵌入的分布退化为各向异性，甚至预训练语言模型 (PLM) 也会在低频 token 的嵌入中丢失与语义相关的信息。本研究首先分析了 PLM BART-large 的微调动态，并证明了其对退化的鲁棒性。基于这一发现，我们提出了 DefinitionEMB，这种方法利用定义来为 PLM 构建各向同性分布和语义相关的 token 嵌入，同时在微调过程中保持原有的鲁棒性。我们的实验证明了利用 Wiktionary 中的定义来为 RoBERTa-base 和 BART-large 构建此类嵌入的有效性。此外，为低频 token 构建的嵌入提高了这些模型在各种 GLUE 和四个文本摘要数据集上的性能。

Title: FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only

Authors: He Zhu, Junyou Su, Tianle Lun, Yicheng Tao, Wenjia Zhang, Zipei Fan, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01323
Pdf URL: https://arxiv.org/pdf/2408.01323
Copy Paste: [[2408.01323]] FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only(https://arxiv.org/abs/2408.01323)
Keywords: language model, gpt, llm
Abstract: Instruction fine-tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual annotations or costly API calls of proprietary LLMs. To address these challenges, we introduce FANNO, a fully autonomous, open-sourced framework that revolutionizes the annotation process without the need for pre-existing annotated data. Utilizing a Mistral-7b-instruct model, FANNO efficiently produces diverse and high-quality datasets through a structured process involving document pre-screening, instruction generation, and response generation. Experiments on Open LLM Leaderboard and AlpacaEval benchmark show that the FANNO can generate high-quality data with diversity and complexity for free, comparable to human-annotated or cleaned datasets like Alpaca-GPT4-Cleaned.
摘要：指令微调是利用大型语言模型 (LLM) 提高任务性能的关键进步。然而，指令数据集的注释传统上既昂贵又费力，通常依赖于手动注释或专有 LLM 的昂贵 API 调用。为了应对这些挑战，我们引入了 FANNO，这是一个完全自主的开源框架，它彻底改变了注释过程，而无需预先存在的注释数据。利用 Mistral-7b-instruct 模型，FANNO 通过涉及文档预筛选、指令生成和响应生成的结构化流程高效地生成多样化和高质量的数据集。在 Open LLM Leaderboard 和 AlpacaEval 基准上的实验表明，FANNO 可以免费生成具有多样性和复杂性的高质量数据，可与 Alpaca-GPT4-Cleaned 等人工注释或清理的数据集相媲美。

Title: Transformers are Universal In-context Learners

Authors: Takashi Furuya, Maarten V. de Hoop, Gabriel Peyré
Subjects: cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2408.01367
Pdf URL: https://arxiv.org/pdf/2408.01367
Copy Paste: [[2408.01367]] Transformers are Universal In-context Learners(https://arxiv.org/abs/2408.01367)
Keywords: prompt
Abstract: Transformers are deep architectures that define "in-context mappings" which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for vision transformers). This work studies in particular the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically and uniformly address the expressivity of these architectures, we consider the case that the mappings are conditioned on a context represented by a probability distribution of tokens (discrete for a finite number of tokens). The related notion of smoothness corresponds to continuity in terms of the Wasserstein distance between these contexts. We demonstrate that deep transformers are universal and can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains. A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens. Additionally, it operates with a fixed embedding dimension of tokens (this dimension does not increase with precision) and a fixed number of heads (proportional to the dimension). The use of MLP layers between multi-head attention layers is also explicitly controlled.
摘要：Transformer 是一种深度架构，它定义了“上下文映射”，能够根据给定的一组 token（例如 NLP 应用程序中的提示或视觉 Transformer 的一组补丁）预测新 token。这项工作特别研究了这些架构处理任意大量上下文 token 的能力。为了在数学上统一地解决这些架构的表达能力，我们考虑了映射以 token 的概率分布（对于有限数量的 token 是离散的）表示的上下文为条件的情况。相关的平滑度概念对应于这些上下文之间的 Wasserstein 距离的连续性。我们证明深度 Transformer 是通用的，并且可以在紧凑的 token 域上均匀地将连续的上下文映射近似为任意精度。与现有发现相比，我们的结果的一个关键方面是，对于固定精度，单个 Transformer 可以对任意（甚至无限）数量的 token 进行操作。此外，它使用固定的 token 嵌入维度（此维度不会随着精度的提高而增加）和固定数量的 head（与维度成比例）。多头注意力层之间 MLP 层的使用也受到明确控制。

Title: Coalitions of Large Language Models Increase the Robustness of AI Agents

Authors: Prattyush Mangal, Carol Mak, Theo Kanakis, Timothy Donovan, Dave Braines, Edward Pyzer-Knapp
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01380
Pdf URL: https://arxiv.org/pdf/2408.01380
Copy Paste: [[2408.01380]] Coalitions of Large Language Models Increase the Robustness of AI Agents(https://arxiv.org/abs/2408.01380)
Keywords: language model, llm, agent
Abstract: The emergence of Large Language Models (LLMs) have fundamentally altered the way we interact with digital systems and have led to the pursuit of LLM powered AI agents to assist in daily workflows. LLMs, whilst powerful and capable of demonstrating some emergent properties, are not logical reasoners and often struggle to perform well at all sub-tasks carried out by an AI agent to plan and execute a workflow. While existing studies tackle this lack of proficiency by generalised pretraining at a huge scale or by specialised fine-tuning for tool use, we assess if a system comprising of a coalition of pretrained LLMs, each exhibiting specialised performance at individual sub-tasks, can match the performance of single model agents. The coalition of models approach showcases its potential for building robustness and reducing the operational costs of these AI agents by leveraging traits exhibited by specific models. Our findings demonstrate that fine-tuning can be mitigated by considering a coalition of pretrained models and believe that this approach can be applied to other non-agentic systems which utilise LLMs.
摘要：大型语言模型 (LLM) 的出现从根本上改变了我们与数字系统的交互方式，并促使人们寻求由 LLM 驱动的 AI 代理来协助日常工作流程。LLM 虽然功能强大且能够展示一些新兴特性，但它们并不是逻辑推理者，并且通常难以在 AI 代理执行的所有子任务中表现出色，无法规划和执行工作流程。虽然现有研究通过大规模的通用预训练或针对工具使用进行专门的微调来解决这种缺乏熟练度的问题，但我们评估了一个由预训练的 LLM 联盟组成的系统，每个 LLM 在各个子任务中都表现出专门的性能，是否可以与单个模型代理的性能相匹配。模型联盟方法展示了其通过利用特定模型所表现出的特性来构建稳健性和降低这些 AI 代理的运营成本的潜力。我们的研究结果表明，可以通过考虑预训练模型联盟来减轻微调，并相信这种方法可以应用于其他使用 LLM 的非代理系统。

Title: Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs

Authors: Yilun Hua, Yoav Artzi
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.01417
Pdf URL: https://arxiv.org/pdf/2408.01417
Copy Paste: [[2408.01417]] Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs(https://arxiv.org/abs/2408.01417)
Keywords: language model, gpt, llm, prompt
Abstract: Humans spontaneously use increasingly efficient language as interactions progress, by adapting and forming ad-hoc conventions. This phenomenon has been studied extensively using reference games, showing properties of human language that go beyond relaying intents. It remains unexplored whether multimodal large language models (MLLMs) similarly increase communication efficiency during interactions, and what mechanisms they may adopt for this purpose. We introduce ICCA, an automated framework to evaluate such conversational adaptation as an in-context behavior in MLLMs. We evaluate several state-of-the-art MLLMs, and observe that while they may understand the increasingly efficient language of their interlocutor, they do not spontaneously make their own language more efficient over time. This latter ability can only be elicited in some models (e.g., GPT-4) with heavy-handed prompting. This shows that this property of linguistic interaction does not arise from current training regimes, even though it is a common hallmark of human language. ICCA is available at this https URL.
摘要：随着互动的进行，人类会通过调整和形成临时惯例，自发地使用越来越高效的语言。这种现象已通过参考游戏进行了广泛研究，展示了人类语言超越传达意图的特性。多模态大型语言模型 (MLLM) 是否同样会提高交互过程中的沟通效率，以及它们可能为此采用哪些机制，仍未得到探索。我们引入了 ICCA，这是一个自动化框架，用于评估这种对话适应性，即 MLLM 中的上下文行为。我们评估了几种最先进的 MLLM，并观察到虽然它们可能理解对话者越来越高效的语言，但它们不会随着时间的推移自发地使自己的语言变得更高效。后一种能力只能在某些模型（例如 GPT-4）中通过严厉的提示来引发。这表明，这种语言交互特性并非来自当前的训练机制，尽管它是人类语言的共同特征。ICCA 可在此 https URL 上找到。

Title: DebateQA: Evaluating Question Answering on Debatable Knowledge

Authors: Rongwu Xu, Xuan Qi, Zehan Qi, Wei Xu, Zhijiang Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01419
Pdf URL: https://arxiv.org/pdf/2408.01419
Copy Paste: [[2408.01419]] DebateQA: Evaluating Question Answering on Debatable Knowledge(https://arxiv.org/abs/2408.01419)
Keywords: language model, llm, chat, retrieval-augmented generation
Abstract: The rise of large language models (LLMs) has enabled us to seek answers to inherently debatable questions on LLM chatbots, necessitating a reliable way to evaluate their ability. However, traditional QA benchmarks assume fixed answers are inadequate for this purpose. To address this, we introduce DebateQA, a dataset of 2,941 debatable questions, each accompanied by multiple human-annotated partial answers that capture a variety of perspectives. We develop two metrics: Perspective Diversity, which evaluates the comprehensiveness of perspectives, and Dispute Awareness, which assesses if the LLM acknowledges the question's debatable nature. Experiments demonstrate that both metrics align with human preferences and are stable across different underlying models. Using DebateQA with two metrics, we assess 12 popular LLMs and retrieval-augmented generation methods. Our findings reveal that while LLMs generally excel at recognizing debatable issues, their ability to provide comprehensive answers encompassing diverse perspectives varies considerably.
摘要：大型语言模型 (LLM) 的兴起使我们能够在 LLM 聊天机器人上寻找固有争议问题的答案，因此需要一种可靠的方法来评估它们的能力。然而，传统的 QA 基准假设固定答案不足以达到这一目的。为了解决这个问题，我们推出了 DebateQA，这是一个包含 2,941 个有争议问题的数据集，每个问题都附有多个人工注释的部分答案，这些答案可以捕捉各种观点。我们开发了两个指标：视角多样性（评估观点的全面性）和争议意识（评估 LLM 是否承认问题的争议性）。实验表明，这两个指标都符合人类偏好，并且在不同的底层模型中都是稳定的。使用具有两个指标的 DebateQA，我们评估了 12 个流行的 LLM 和检索增强生成方法。我们的研究结果表明，虽然 LLM 通常擅长识别有争议的问题，但它们提供涵盖不同观点的全面答案的能力差异很大。

Title: Prompt Recursive Search: A Living Framework with Adaptive Growth in LLM Auto-Prompting

Authors: Xiangyu Zhao, Chengqian Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01423
Pdf URL: https://arxiv.org/pdf/2408.01423
Copy Paste: [[2408.01423]] Prompt Recursive Search: A Living Framework with Adaptive Growth in LLM Auto-Prompting(https://arxiv.org/abs/2408.01423)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) exhibit remarkable proficiency in addressing a diverse array of tasks within the Natural Language Processing (NLP) domain, with various prompt design strategies significantly augmenting their capabilities. However, these prompts, while beneficial, each possess inherent limitations. The primary prompt design methodologies are twofold: The first, exemplified by the Chain of Thought (CoT), involves manually crafting prompts specific to individual datasets, hence termed Expert-Designed Prompts (EDPs). Once these prompts are established, they are unalterable, and their effectiveness is capped by the expertise of the human designers. When applied to LLMs, the static nature of EDPs results in a uniform approach to both simple and complex problems within the same dataset, leading to the inefficient use of tokens for straightforward issues. The second method involves prompts autonomously generated by the LLM, known as LLM-Derived Prompts (LDPs), which provide tailored solutions to specific problems, mitigating the limitations of EDPs. However, LDPs may encounter a decline in performance when tackling complex problems due to the potential for error accumulation during the solution planning process. To address these challenges, we have conceived a novel Prompt Recursive Search (PRS) framework that leverages the LLM to generate solutions specific to the problem, thereby conserving tokens. The framework incorporates an assessment of problem complexity and an adjustable structure, ensuring a reduction in the likelihood of errors. We have substantiated the efficacy of PRS framework through extensive experiments using LLMs with different numbers of parameters across a spectrum of datasets in various domains. Compared to the CoT method, the PRS method has increased the accuracy on the BBH dataset by 8% using Llama3-7B model, achieving a 22% improvement.
摘要：大型语言模型 (LLM) 在解决自然语言处理 (NLP) 领域的各种任务方面表现出非凡的能力，各种提示设计策略大大增强了它们的能力。然而，这些提示虽然有益，但每个提示都具有固有的局限性。主要的提示设计方法有两种：第一种，以思路链 (CoT) 为例，涉及手动制作特定于各个数据集的提示，因此称为专家设计提示 (EDP)。一旦建立这些提示，它们就无法更改，并且它们的有效性受到人类设计者专业知识的限制。当应用于 LLM 时，EDP 的静态性质导致对同一数据集中的简单和复杂问题采用统一的方法，导致对简单问题的标记使用效率低下。第二种方法涉及由 LLM 自主生成的提示，称为 LLM 派生提示 (LDP)，它为特定问题提供量身定制的解决方案，从而减轻了 EDP 的局限性。然而，由于解决方案规划过程中可能会出现错误积累，LDP 在解决复杂问题时可能会遇到性能下降的问题。为了应对这些挑战，我们构思了一种新颖的即时递归搜索 (PRS) 框架，该框架利用 LLM 生成针对问题的解决方案，从而节省 token。该框架结合了对问题复杂性的评估和可调整的结构，确保降低出错的可能性。我们使用具有不同数量参数的 LLM 在各个领域的一系列数据集上进行了大量实验，证实了 PRS 框架的有效性。与 CoT 方法相比，PRS 方法使用 Llama3-7B 模型将 BBH 数据集上的准确率提高了 8%，实现了 22% 的提升。