2024-08-12

Title: APE: Active Learning-based Tooling for Finding Informative Few-shot Examples for LLM-based Entity Matching

Authors: Kun Qian, Yisi Sang, Farima Fatahi Bayat, Anton Belyi, Xianqi Chu, Yash Govind, Samira Khorshidi, Rahul Khot, Katherine Luna, Azadeh Nikfarjam, Xiaoguang Qi, Fei Wu, Xianhan Zhang, Yunyao Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04637
Pdf URL: https://arxiv.org/pdf/2408.04637
Copy Paste: [[2408.04637]] APE: Active Learning-based Tooling for Finding Informative Few-shot Examples for LLM-based Entity Matching(https://arxiv.org/abs/2408.04637)
Keywords: language model, llm, prompt
Abstract: Prompt engineering is an iterative procedure often requiring extensive manual effort to formulate suitable instructions for effectively directing large language models (LLMs) in specific tasks. Incorporating few-shot examples is a vital and effective approach to providing LLMs with precise instructions, leading to improved LLM performance. Nonetheless, identifying the most informative demonstrations for LLMs is labor-intensive, frequently entailing sifting through an extensive search space. In this demonstration, we showcase a human-in-the-loop tool called APE (Active Prompt Engineering) designed for refining prompts through active learning. Drawing inspiration from active learning, APE iteratively selects the most ambiguous examples for human feedback, which will be transformed into few-shot examples within the prompt. The demo recording can be found with the submission or be viewed at this https URL.
摘要：提示工程是一个迭代过程，通常需要大量手动工作来制定合适的指令，以有效地指导大型语言模型 (LLM) 执行特定任务。结合少量样本是一种重要且有效的方法，可以为 LLM 提供精确的指令，从而提高 LLM 性能。尽管如此，确定 LLM 最具信息量的演示是一项劳动密集型工作，通常需要在广泛的搜索空间中进行筛选。在此演示中，我们展示了一种名为 APE（主动提示工程）的人机交互工具，旨在通过主动学习来完善提示。从主动学习中汲取灵感，APE 迭代地选择最模糊的示例以供人工反馈，这些示例将转换为提示中的少量样本示例。演示记录可以在提交中找到，也可以在此 https URL 上查看。

Title: Affective Computing in the Era of Large Language Models: A Survey from the NLP Perspective

Authors: Yiqun Zhang, Xiaocui Yang, Xingle Xu, Zeran Gao, Yijie Huang, Shiyi Mu, Shi Feng, Daling Wang, Yifei Zhang, Kaisong Song, Ge Yu
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2408.04638
Pdf URL: https://arxiv.org/pdf/2408.04638
Copy Paste: [[2408.04638]] Affective Computing in the Era of Large Language Models: A Survey from the NLP Perspective(https://arxiv.org/abs/2408.04638)
Keywords: language model, gpt, llm, prompt, chat, agent
Abstract: Affective Computing (AC), integrating computer science, psychology, and cognitive science knowledge, aims to enable machines to recognize, interpret, and simulate human this http URL create more value, AC can be applied to diverse scenarios, including social media, finance, healthcare, education, etc. Affective Computing (AC) includes two mainstream tasks, i.e., Affective Understanding (AU) and Affective Generation (AG). Fine-tuning Pre-trained Language Models (PLMs) for AU tasks has succeeded considerably. However, these models lack generalization ability, requiring specialized models for specific tasks. Additionally, traditional PLMs face challenges in AG, particularly in generating diverse and emotionally rich responses. The emergence of Large Language Models (LLMs), such as the ChatGPT series and LLaMA models, brings new opportunities and challenges, catalyzing a paradigm shift in AC. LLMs possess capabilities of in-context learning, common sense reasoning, and advanced sequence generation, which present unprecedented opportunities for AU. To provide a comprehensive overview of AC in the LLMs era from an NLP perspective, we summarize the development of LLMs research in this field, aiming to offer new insights. Specifically, we first summarize the traditional tasks related to AC and introduce the preliminary study based on LLMs. Subsequently, we outline the relevant techniques of popular LLMs to improve AC tasks, including Instruction Tuning and Prompt Engineering. For Instruction Tuning, we discuss full parameter fine-tuning and parameter-efficient methods such as LoRA, P-Tuning, and Prompt Tuning. In Prompt Engineering, we examine Zero-shot, Few-shot, Chain of Thought (CoT), and Agent-based methods for AU and AG. To clearly understand the performance of LLMs on different Affective Computing tasks, we further summarize the existing benchmarks and evaluation methods.
摘要：情感计算（AC）融合了计算机科学、心理学和认知科学知识，旨在使机器能够识别、解释和模拟人类，从而创造更多价值。AC 可应用于社交媒体、金融、医疗、教育等多种场景。情感计算（AC）包括两项主流任务，即情感理解（AU）和情感生成（AG）。针对 AU 任务的微调预训练语言模型（PLM）已取得巨大成功。然而，这些模型缺乏泛化能力，需要针对特定任务的专门模型。此外，传统的 PLM 在 AG 方面面临挑战，特别是在生成多样化和情感丰富的反应方面。大型语言模型（LLM）的出现，例如 ChatGPT 系列和 LLaMA 模型，带来了新的机遇和挑战，催化了 AC 的范式转变。LLM 具有上下文学习、常识推理和高级序列生成的能力，这为 AU 带来了前所未有的机会。为了从 NLP 的角度全面概述 LLM 时代的 AC，我们总结了该领域 LLM 研究的发展，旨在提供新的见解。具体来说，我们首先总结与 AC 相关的传统任务，并介绍基于 LLM 的初步研究。随后，我们概述了流行的 LLM 用于改进 AC 任务的相关技术，包括指令调整和提示工程。对于指令调整，我们讨论了全参数微调和参数高效方法，例如 LoRA、P-Tuning 和提示调整。在提示工程中，我们研究了零样本、少样本、思路链 (CoT) 和基于代理的 AU 和 AG 方法。为了清楚地了解 LLM 在不同情感计算任务上的表现，我们进一步总结了现有的基准和评估方法。

Title: Abstractive summarization from Audio Transcription

Authors: Ilia Derkach
Subjects: cs.CL, cs.IR, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2408.04639
Pdf URL: https://arxiv.org/pdf/2408.04639
Copy Paste: [[2408.04639]] Abstractive summarization from Audio Transcription(https://arxiv.org/abs/2408.04639)
Keywords: language model
Abstract: Currently, large language models are gaining popularity, their achievements are used in many areas, ranging from text translation to generating answers to queries. However, the main problem with these new machine learning algorithms is that training such models requires large computing resources that only large IT companies have. To avoid this problem, a number of methods (LoRA, quantization) have been proposed so that existing models can be effectively fine-tuned for specific tasks. In this paper, we propose an E2E (end to end) audio summarization model using these techniques. In addition, this paper examines the effectiveness of these approaches to the problem under consideration and draws conclusions about the applicability of these methods.
摘要：目前，大型语言模型越来越受欢迎，其成果被应用于许多领域，从文本翻译到生成查询答案。然而，这些新机器学习算法的主要问题是，训练这样的模型需要大量的计算资源，而这些资源只有大型 IT 公司才拥有。为了避免这个问题，已经提出了许多方法（LoRA、量化），以便现有模型可以有效地针对特定任务进行微调。在本文中，我们提出了一个使用这些技术的 E2E（端到端）音频摘要模型。此外，本文还研究了这些方法对所考虑问题的有效性，并得出了关于这些方法适用性的结论。

Title: LLMs for Enhanced Agricultural Meteorological Recommendations

Authors: Ji-jun Park, Soo-joon Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04640
Pdf URL: https://arxiv.org/pdf/2408.04640
Copy Paste: [[2408.04640]] LLMs for Enhanced Agricultural Meteorological Recommendations(https://arxiv.org/abs/2408.04640)
Keywords: language model, gpt, llm, prompt, chat, chain-of-thought
Abstract: Agricultural meteorological recommendations are crucial for enhancing crop productivity and sustainability by providing farmers with actionable insights based on weather forecasts, soil conditions, and crop-specific data. This paper presents a novel approach that leverages large language models (LLMs) and prompt engineering to improve the accuracy and relevance of these recommendations. We designed a multi-round prompt framework to iteratively refine recommendations using updated data and feedback, implemented on ChatGPT, Claude2, and GPT-4. Our method was evaluated against baseline models and a Chain-of-Thought (CoT) approach using manually collected datasets. The results demonstrate significant improvements in accuracy and contextual relevance, with our approach achieving up to 90\% accuracy and high GPT-4 scores. Additional validation through real-world pilot studies further confirmed the practical benefits of our method, highlighting its potential to transform agricultural practices and decision-making.
摘要：农业气象建议对于提高作物生产力和可持续性至关重要，因为它可以为农民提供基于天气预报、土壤条件和特定作物数据的可行见解。本文介绍了一种利用大型语言模型 (LLM) 和快速工程来提高这些建议的准确性和相关性的新方法。我们设计了一个多轮快速框架，使用更新的数据和反馈迭代地完善建议，并在 ChatGPT、Claude2 和 GPT-4 上实施。我们的方法与基线模型和使用手动收集的数据集的思路链 (CoT) 方法进行了评估。结果表明，准确性和上下文相关性显著提高，我们的方法实现了高达 90% 的准确率和高 GPT-4 分数。通过现实世界的试点研究进行的额外验证进一步证实了我们方法的实际好处，凸显了其改变农业实践和决策的潜力。

Title: GPT-3 Powered Information Extraction for Building Robust Knowledge Bases

Authors: Ritabrata Roy Choudhury, Soumik Dey
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04641
Pdf URL: https://arxiv.org/pdf/2408.04641
Copy Paste: [[2408.04641]] GPT-3 Powered Information Extraction for Building Robust Knowledge Bases(https://arxiv.org/abs/2408.04641)
Keywords: language model, gpt, chat
Abstract: This work uses the state-of-the-art language model GPT-3 to offer a novel method of information extraction for knowledge base development. The suggested method attempts to solve the difficulties associated with obtaining relevant entities and relationships from unstructured text in order to extract structured information. We conduct experiments on a huge corpus of text from diverse fields to assess the performance of our suggested technique. The evaluation measures, which are frequently employed in information extraction tasks, include precision, recall, and F1-score. The findings demonstrate that GPT-3 can be used to efficiently and accurately extract pertinent and correct information from text, hence increasing the precision and productivity of knowledge base creation. We also assess how well our suggested approach performs in comparison to the most advanced information extraction techniques already in use. The findings show that by utilizing only a small number of instances in in-context learning, our suggested strategy yields competitive outcomes with notable savings in terms of data annotation and engineering expense. Additionally, we use our proposed method to retrieve Biomedical information, demonstrating its practicality in a real-world setting. All things considered, our suggested method offers a viable way to overcome the difficulties involved in obtaining structured data from unstructured text in order to create knowledge bases. It can greatly increase the precision and effectiveness of information extraction, which is necessary for many applications including chatbots, recommendation engines, and question-answering systems.
摘要：这项工作使用最先进的语言模型 GPT-3 为知识库开发提供了一种新颖的信息提取方法。所建议的方法试图解决从非结构化文本中获取相关实体和关系以提取结构化信息的困难。我们对来自不同领域的大量文本语料库进行了实验，以评估我们建议的技术的性能。评估指标经常用于信息提取任务，包括精确度、召回率和 F1 分数。研究结果表明，GPT-3 可用于高效准确地从文本中提取相关和正确的信息，从而提高知识库创建的精确度和生产力。我们还评估了我们建议的方法与已经使用的最先进的信息提取技术相比的表现。研究结果表明，通过在上下文学习中仅使用少量实例，我们建议的策略可以产生具有竞争力的结果，并且在数据注释和工程费用方面显着节省。此外，我们使用我们提出的方法来检索生物医学信息，证明了它在现实世界中的实用性。总而言之，我们提出的方法提供了一种可行的方法，可以克服从非结构化文本中获取结构化数据以创建知识库的困难。它可以大大提高信息提取的准确性和有效性，这对于包括聊天机器人、推荐引擎和问答系统在内的许多应用都是必需的。

Title: Risks, Causes, and Mitigations of Widespread Deployments of Large Language Models (LLMs): A Survey

Authors: Md Nazmus Sakib, Md Athikul Islam, Royal Pathak, Md Mashrur Arifin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04643
Pdf URL: https://arxiv.org/pdf/2408.04643
Copy Paste: [[2408.04643]] Risks, Causes, and Mitigations of Widespread Deployments of Large Language Models (LLMs): A Survey(https://arxiv.org/abs/2408.04643)
Keywords: language model, gpt, llm, chat
Abstract: Recent advancements in Large Language Models (LLMs), such as ChatGPT and LLaMA, have significantly transformed Natural Language Processing (NLP) with their outstanding abilities in text generation, summarization, and classification. Nevertheless, their widespread adoption introduces numerous challenges, including issues related to academic integrity, copyright, environmental impacts, and ethical considerations such as data bias, fairness, and privacy. The rapid evolution of LLMs also raises concerns regarding the reliability and generalizability of their evaluations. This paper offers a comprehensive survey of the literature on these subjects, systematically gathered and synthesized from Google Scholar. Our study provides an in-depth analysis of the risks associated with specific LLMs, identifying sub-risks, their causes, and potential solutions. Furthermore, we explore the broader challenges related to LLMs, detailing their causes and proposing mitigation strategies. Through this literature analysis, our survey aims to deepen the understanding of the implications and complexities surrounding these powerful models.
摘要：ChatGPT 和 LLaMA 等大型语言模型 (LLM) 的最新进展以其在文本生成、摘要和分类方面的出色能力极大地改变了自然语言处理 (NLP)。然而，它们的广泛采用带来了许多挑战，包括与学术诚信、版权、环境影响以及数据偏见、公平和隐私等道德考虑有关的问题。LLM 的快速发展也引发了人们对其评估的可靠性和普遍性的担忧。本文对这些主题的文献进行了全面的调查，这些文献是从 Google Scholar 系统地收集和汇总的。我们的研究对与特定 LLM 相关的风险进行了深入分析，确定了子风险、其原因和潜在解决方案。此外，我们探讨了与 LLM 相关的更广泛的挑战，详细说明了它们的原因并提出了缓解策略。通过这种文献分析，我们的调查旨在加深对这些强大模型的含义和复杂性的理解。

Title: Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course

Authors: Sebastian Kahl, Felix Löffler, Martin Maciol, Fabian Ridder, Marius Schmitz, Jennifer Spanagel, Jens Wienkamp, Christopher Burgahn, Malte Schilling
Subjects: cs.CL, cs.AI, cs.CY, cs.RO
Abstract URL: https://arxiv.org/abs/2408.04645
Pdf URL: https://arxiv.org/pdf/2408.04645
Copy Paste: [[2408.04645]] Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course(https://arxiv.org/abs/2408.04645)
Keywords: language model, llm, prompt
Abstract: This study evaluates the performance of Large Language Models (LLMs) as an Artificial Intelligence-based tutor for a university course. In particular, different advanced techniques are utilized, such as prompt engineering, Retrieval-Augmented-Generation (RAG), and fine-tuning. We assessed the different models and applied techniques using common similarity metrics like BLEU-4, ROUGE, and BERTScore, complemented by a small human evaluation of helpfulness and trustworthiness. Our findings indicate that RAG combined with prompt engineering significantly enhances model responses and produces better factual answers. In the context of education, RAG appears as an ideal technique as it is based on enriching the input of the model with additional information and material which usually is already present for a university course. Fine-tuning, on the other hand, can produce quite small, still strong expert models, but poses the danger of overfitting. Our study further asks how we measure performance of LLMs and how well current measurements represent correctness or relevance? We find high correlation on similarity metrics and a bias of most of these metrics towards shorter responses. Overall, our research points to both the potential and challenges of integrating LLMs in educational settings, suggesting a need for balanced training approaches and advanced evaluation frameworks.
摘要：本研究评估了大型语言模型 (LLM) 作为大学课程的人工智能辅导员的表现。具体来说，采用了不同的先进技术，例如即时工程、检索增强生成 (RAG) 和微调。我们使用常见的相似性指标（如 BLEU-4、ROUGE 和 BERTScore）评估了不同的模型并应用了技术，并辅以对有用性和可信度的小型人工评估。我们的研究结果表明，RAG 与即时工程相结合可显著增强模型响应并产生更好的事实答案。在教育背景下，RAG 似乎是一种理想的技术，因为它基于使用大学课程通常已经存在的附加信息和材料来丰富模型的输入。另一方面，微调可以产生非常小但仍然强大的专家模型，但存在过度拟合的危险。我们的研究进一步询问我们如何衡量 LLM 的性能，以及当前的测量结果如何代表正确性或相关性？我们发现相似度指标具有很高的相关性，并且大多数指标都偏向于较短的回答。总体而言，我们的研究指出了将法学硕士融入教育环境的潜力和挑战，这表明需要平衡的培训方法和先进的评估框架。

Title: Efficacy of Large Language Models in Systematic Reviews

Authors: Aaditya Shah, Shridhar Mehendale, Siddha Kanthi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04646
Pdf URL: https://arxiv.org/pdf/2408.04646
Copy Paste: [[2408.04646]] Efficacy of Large Language Models in Systematic Reviews(https://arxiv.org/abs/2408.04646)
Keywords: language model, gpt, llm, prompt
Abstract: This study investigates the effectiveness of Large Language Models (LLMs) in interpreting existing literature through a systematic review of the relationship between Environmental, Social, and Governance (ESG) factors and financial performance. The primary objective is to assess how LLMs can replicate a systematic review on a corpus of ESG-focused papers. We compiled and hand-coded a database of 88 relevant papers published from March 2020 to May 2024. Additionally, we used a set of 238 papers from a previous systematic review of ESG literature from January 2015 to February 2020. We evaluated two current state-of-the-art LLMs, Meta AI's Llama 3 8B and OpenAI's GPT-4o, on the accuracy of their interpretations relative to human-made classifications on both sets of papers. We then compared these results to a "Custom GPT" and a fine-tuned GPT-4o Mini model using the corpus of 238 papers as training data. The fine-tuned GPT-4o Mini model outperformed the base LLMs by 28.3% on average in overall accuracy on prompt 1. At the same time, the "Custom GPT" showed a 3.0% and 15.7% improvement on average in overall accuracy on prompts 2 and 3, respectively. Our findings reveal promising results for investors and agencies to leverage LLMs to summarize complex evidence related to ESG investing, thereby enabling quicker decision-making and a more efficient market.
摘要：本研究通过系统回顾环境、社会和治理 (ESG) 因素与财务绩效之间的关系，调查了大型语言模型 (LLM) 在解释现有文献方面的有效性。主要目标是评估 LLM 如何复制对 ESG 重点论文集的系统回顾。我们编制并手工编码了一个数据库，其中包含 2020 年 3 月至 2024 年 5 月发表的 88 篇相关论文。此外，我们还使用了 2015 年 1 月至 2020 年 2 月之前对 ESG 文献进行系统回顾的 238 篇论文。我们评估了两个当前最先进的 LLM，Meta AI 的 Llama 3 8B 和 OpenAI 的 GPT-4o，评估了它们相对于这两组论文的人工分类的解释准确性。然后，我们将这些结果与使用 238 篇论文的语料库作为训练数据的“自定义 GPT”和微调的 GPT-4o Mini 模型进行了比较。经过微调的 GPT-4o Mini 模型在提示 1 上的整体准确率平均比基础 LLM 高出 28.3%。同时，“自定义 GPT”在提示 2 和 3 上的整体准确率平均分别提高了 3.0% 和 15.7%。我们的研究结果表明，投资者和机构可以利用 LLM 总结与 ESG 投资相关的复杂证据，从而加快决策速度并提高市场效率，这将带来可喜的结果。

Title: Distinguishing Chatbot from Human

Authors: Gauri Anil Godghase, Rishit Agrawal, Tanush Obili, Mark Stamp
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04647
Pdf URL: https://arxiv.org/pdf/2408.04647
Copy Paste: [[2408.04647]] Distinguishing Chatbot from Human(https://arxiv.org/abs/2408.04647)
Keywords: language model, gpt, llm, chat
Abstract: There have been many recent advances in the fields of generative Artificial Intelligence (AI) and Large Language Models (LLM), with the Generative Pre-trained Transformer (GPT) model being a leading "chatbot." LLM-based chatbots have become so powerful that it may seem difficult to differentiate between human-written and machine-generated text. To analyze this problem, we have developed a new dataset consisting of more than 750,000 human-written paragraphs, with a corresponding chatbot-generated paragraph for each. Based on this dataset, we apply Machine Learning (ML) techniques to determine the origin of text (human or chatbot). Specifically, we consider two methodologies for tackling this issue: feature analysis and embeddings. Our feature analysis approach involves extracting a collection of features from the text for classification. We also explore the use of contextual embeddings and transformer-based architectures to train classification models. Our proposed solutions offer high classification accuracy and serve as useful tools for textual analysis, resulting in a better understanding of chatbot-generated text in this era of advanced AI technology.
摘要：生成式人工智能 (AI) 和大型语言模型 (LLM) 领域最近取得了许多进展，其中生成式预训练 Transformer (GPT) 模型是领先的“聊天机器人”。基于 LLM 的聊天机器人变得如此强大，以至于似乎很难区分人类编写的文本和机器生成的文本。为了分析这个问题，我们开发了一个新数据集，其中包含超过 750,000 个人类编写的段落，每个段落都有一个对应的聊天机器人生成的段落。基于此数据集，我们应用机器学习 (ML) 技术来确定文本的来源（人类或聊天机器人）。具体来说，我们考虑两种方法来解决这个问题：特征分析和嵌入。我们的特征分析方法涉及从文本中提取一组特征进行分类。我们还探索了使用上下文嵌入和基于 Transformer 的架构来训练分类模型。我们提出的解决方案具有很高的分类准确性，可以作为文本分析的有用工具，从而可以在先进的人工智能技术时代更好地理解聊天机器人生成的文本。

Title: PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models

Authors: Alexey Tikhonov
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2408.04648
Pdf URL: https://arxiv.org/pdf/2408.04648
Copy Paste: [[2408.04648]] PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models(https://arxiv.org/abs/2408.04648)
Keywords: language model, llm
Abstract: We present PLUGH (this https URL), a modern benchmark that currently consists of 5 tasks, each with 125 input texts extracted from 48 different games and representing 61 different (non-isomorphic) spatial graphs to assess the abilities of Large Language Models (LLMs) for spatial understanding and reasoning. Our evaluation of API-based and open-sourced LLMs shows that while some commercial LLMs exhibit strong reasoning abilities, open-sourced competitors can demonstrate almost the same level of quality; however, all models still have significant room for improvement. We identify typical reasons for LLM failures and discuss possible ways to deal with them. Datasets and evaluation code are released (this https URL).
摘要：我们提出了 PLUGH（此 https URL），这是一个现代基准，目前由 5 个任务组成，每个任务都有从 48 个不同的游戏中提取的 125 个输入文本，代表 61 个不同的（非同构）空间图，用于评估大型语言模型 (LLM) 的空间理解和推理能力。我们对基于 API 和开源 LLM 的评估表明，虽然一些商业 LLM 表现出强大的推理能力，但开源竞争对手可以展示几乎相同的质量水平；然而，所有模型仍然有很大的改进空间。我们确定了 LLM 失败的典型原因并讨论了可能的解决方法。数据集和评估代码已发布（此 https URL）。

Title: Chain of Stance: Stance Detection with Large Language Models

Authors: Junxia Ma, Changjiang Wang, Hanwen Xing, Dongming Zhao, Yazhou Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04649
Pdf URL: https://arxiv.org/pdf/2408.04649
Copy Paste: [[2408.04649]] Chain of Stance: Stance Detection with Large Language Models(https://arxiv.org/abs/2408.04649)
Keywords: language model, llm, prompt
Abstract: Stance detection is an active task in natural language processing (NLP) that aims to identify the author's stance towards a particular target within a text. Given the remarkable language understanding capabilities and encyclopedic prior knowledge of large language models (LLMs), how to explore the potential of LLMs in stance detection has received significant attention. Unlike existing LLM-based approaches that focus solely on fine-tuning with large-scale datasets, we propose a new prompting method, called \textit{Chain of Stance} (CoS). In particular, it positions LLMs as expert stance detectors by decomposing the stance detection process into a series of intermediate, stance-related assertions that culminate in the final judgment. This approach leads to significant improvements in classification performance. We conducted extensive experiments using four SOTA LLMs on the SemEval 2016 dataset, covering the zero-shot and few-shot learning setups. The results indicate that the proposed method achieves state-of-the-art results with an F1 score of 79.84 in the few-shot setting.
摘要：立场检测是自然语言处理 (NLP) 中的一项主动任务，旨在识别作者对文本中特定目标的立场。鉴于大型语言模型 (LLM) 出色的语言理解能力和百科全书式的先验知识，如何探索 LLM 在立场检测中的潜力已受到广泛关注。与仅专注于使用大规模数据集进行微调的现有基于 LLM 的方法不同，我们提出了一种新的提示方法，称为 \textit{立场链} (CoS)。具体而言，它通过将立场检测过程分解为一系列中间立场相关断言并最终得出最终判断，将 LLM 定位为专家立场检测器。这种方法可以显著提高分类性能。我们在 SemEval 2016 数据集上使用四个 SOTA LLM 进行了广泛的实验，涵盖了零样本和小样本学习设置。结果表明，所提出的方法在小样本设置中取得了最先进的结果，F1 分数为 79.84。

Title: Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools

Authors: Jung In Park, Mahyar Abbasian, Iman Azimi, Dawn Bounds, Angela Jun, Jaesu Han, Robert McCarron, Jessica Borelli, Jia Li, Mona Mahmoudi, Carmen Wiedenhoeft, Amir Rahmani
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04650
Pdf URL: https://arxiv.org/pdf/2408.04650
Copy Paste: [[2408.04650]] Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools(https://arxiv.org/abs/2408.04650)
Keywords: language model, gpt, llm, chat, agent
Abstract: Objective: This study aims to develop and validate an evaluation framework to ensure the safety and reliability of mental health chatbots, which are increasingly popular due to their accessibility, human-like interactions, and context-aware support. Materials and Methods: We created an evaluation framework with 100 benchmark questions and ideal responses, and five guideline questions for chatbot responses. This framework, validated by mental health experts, was tested on a GPT-3.5-turbo-based chatbot. Automated evaluation methods explored included large language model (LLM)-based scoring, an agentic approach using real-time data, and embedding models to compare chatbot responses against ground truth standards. Results: The results highlight the importance of guidelines and ground truth for improving LLM evaluation accuracy. The agentic method, dynamically accessing reliable information, demonstrated the best alignment with human assessments. Adherence to a standardized, expert-validated framework significantly enhanced chatbot response safety and reliability. Discussion: Our findings emphasize the need for comprehensive, expert-tailored safety evaluation metrics for mental health chatbots. While LLMs have significant potential, careful implementation is necessary to mitigate risks. The superior performance of the agentic approach underscores the importance of real-time data access in enhancing chatbot reliability. Conclusion: The study validated an evaluation framework for mental health chatbots, proving its effectiveness in improving safety and reliability. Future work should extend evaluations to accuracy, bias, empathy, and privacy to ensure holistic assessment and responsible integration into healthcare. Standardized evaluations will build trust among users and professionals, facilitating broader adoption and improved mental health support through technology.
摘要：目标：本研究旨在开发和验证一个评估框架，以确保心理健康聊天机器人的安全性和可靠性，这些聊天机器人因其可访问性、类似人类的交互和情境感知支持而越来越受欢迎。材料和方法：我们创建了一个评估框架，其中包含 100 个基准问题和理想答案，以及 5 个聊天机器人响应的指导性问题。该框架经过心理健康专家的验证，并在基于 GPT-3.5-turbo 的聊天机器人上进行了测试。探索的自动评估方法包括基于大型语言模型 (LLM) 的评分、使用实时数据的代理方法以及嵌入模型以将聊天机器人响应与地面实况标准进行比较。结果：结果强调了指导方针和地面实况对于提高 LLM 评估准确性的重要性。代理方法动态访问可靠信息，与人类评估最为一致。遵守标准化、专家验证的框架可显著提高聊天机器人响应的安全性和可靠性。讨论：我们的研究结果强调了心理健康聊天机器人需要全面的、专家量身定制的安全评估指标。虽然 LLM 具有巨大的潜力，但谨慎实施对于降低风险是必不可少的。代理方法的卓越性能凸显了实时数据访问在提高聊天机器人可靠性方面的重要性。结论：该研究验证了心理健康聊天机器人的评估框架，证明了其在提高安全性和可靠性方面的有效性。未来的工作应该将评估扩展到准确性、偏见、同理心和隐私，以确保全面评估和负责任地融入医疗保健。标准化评估将在用户和专业人士之间建立信任，促进更广泛的采用并通过技术改善心理健康支持。

Title: Knowledge AI: Fine-tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding

Authors: Balaji Muralidharan, Hayden Beadles, Reza Marzban, Kalyan Sashank Mupparaju
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04651
Pdf URL: https://arxiv.org/pdf/2408.04651
Copy Paste: [[2408.04651]] Knowledge AI: Fine-tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding(https://arxiv.org/abs/2408.04651)
Keywords: language model, llm
Abstract: This project investigates the efficacy of Large Language Models (LLMs) in understanding and extracting scientific knowledge across specific domains and to create a deep learning framework: Knowledge AI. As a part of this framework, we employ pre-trained models and fine-tune them on datasets in the scientific domain. The models are adapted for four key Natural Language Processing (NLP) tasks: summarization, text generation, question answering, and named entity recognition. Our results indicate that domain-specific fine-tuning significantly enhances model performance in each of these tasks, thereby improving their applicability for scientific contexts. This adaptation enables non-experts to efficiently query and extract information within targeted scientific fields, demonstrating the potential of fine-tuned LLMs as a tool for knowledge discovery in the sciences.
摘要：该项目研究大型语言模型 (LLM) 在理解和提取特定领域的科学知识方面的有效性，并创建一个深度学习框架：知识人工智能。作为该框架的一部分，我们使用预先训练的模型，并根据科学领域的数据集对其进行微调。这些模型适用于四个关键的自然语言处理 (NLP) 任务：摘要、文本生成、问答和命名实体识别。我们的结果表明，特定领域的微调显著提高了模型在这些任务中的性能，从而提高了它们在科学环境中的适用性。这种调整使非专家能够有效地查询和提取目标科学领域内的信息，展示了微调后的 LLM 作为科学知识发现工具的潜力。

Title: Leveraging Large Language Models with Chain-of-Thought and Prompt Engineering for Traffic Crash Severity Analysis and Inference

Authors: Hao Zhen, Yucheng Shi, Yongcan Huang, Jidong J. Yang, Ninghao Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04652
Pdf URL: https://arxiv.org/pdf/2408.04652
Copy Paste: [[2408.04652]] Leveraging Large Language Models with Chain-of-Thought and Prompt Engineering for Traffic Crash Severity Analysis and Inference(https://arxiv.org/abs/2408.04652)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Harnessing the power of Large Language Models (LLMs), this study explores the use of three state-of-the-art LLMs, specifically GPT-3.5-turbo, LLaMA3-8B, and LLaMA3-70B, for crash severity inference, framing it as a classification task. We generate textual narratives from original traffic crash tabular data using a pre-built template infused with domain knowledge. Additionally, we incorporated Chain-of-Thought (CoT) reasoning to guide the LLMs in analyzing the crash causes and then inferring the severity. This study also examine the impact of prompt engineering specifically designed for crash severity inference. The LLMs were tasked with crash severity inference to: (1) evaluate the models' capabilities in crash severity analysis, (2) assess the effectiveness of CoT and domain-informed prompt engineering, and (3) examine the reasoning abilities with the CoT framework. Our results showed that LLaMA3-70B consistently outperformed the other models, particularly in zero-shot settings. The CoT and Prompt Engineering techniques significantly enhanced performance, improving logical reasoning and addressing alignment issues. Notably, the CoT offers valuable insights into LLMs' reasoning processes, unleashing their capacity to consider diverse factors such as environmental conditions, driver behavior, and vehicle characteristics in severity analysis and inference.
摘要：本研究利用大型语言模型 (LLM) 的强大功能，探索了三种最先进的 LLM（即 GPT-3.5-turbo、LLaMA3-8B 和 LLaMA3-70B）在事故严重程度推断中的应用，并将其作为分类任务。我们使用一个预先构建的模板，融合了领域知识，从原始交通事故表格数据中生成文本叙述。此外，我们结合了思路链 (CoT) 推理，指导 LLM 分析事故原因，然后推断事故严重程度。本研究还研究了专为事故严重程度推断而设计的提示工程的影响。LLM 的任务是进行事故严重程度推断，以：(1) 评估模型在事故严重程度分析方面的能力，(2) 评估 CoT 和领域知情提示工程的有效性，以及 (3) 使用 CoT 框架检查推理能力。我们的结果表明，LLaMA3-70B 始终优于其他模型，尤其是在零样本设置中。 CoT 和 Prompt Engineering 技术显著提高了性能，改善了逻辑推理并解决了对齐问题。值得注意的是，CoT 为 LLM 的推理过程提供了宝贵的见解，释放了其在严重性分析和推理中考虑各种因素（例如环境条件、驾驶员行为和车辆特性）的能力。

Title: Strong and weak alignment of large language models with human values

Authors: Mehdi Khamassi, Marceau Nahon, Raja Chatila
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04655
Pdf URL: https://arxiv.org/pdf/2408.04655
Copy Paste: [[2408.04655]] Strong and weak alignment of large language models with human values(https://arxiv.org/abs/2408.04655)
Keywords: language model, gpt, llm, prompt, chat, agent
Abstract: Minimizing negative impacts of Artificial Intelligent (AI) systems on human societies without human supervision requires them to be able to align with human values. However, most current work only addresses this issue from a technical point of view, e.g., improving current methods relying on reinforcement learning from human feedback, neglecting what it means and is required for alignment to occur. Here, we propose to distinguish strong and weak value alignment. Strong alignment requires cognitive abilities (either human-like or different from humans) such as understanding and reasoning about agents' intentions and their ability to causally produce desired effects. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. To illustrate this distinction, we present a series of prompts showing ChatGPT's, Gemini's and Copilot's failures to recognize some of these situations. We moreover analyze word embeddings to show that the nearest neighbors of some human values in LLMs differ from humans' semantic representations. We then propose a new thought experiment that we call "the Chinese room with a word transition dictionary", in extension of John Searle's famous proposal. We finally mention current promising research directions towards a weak alignment, which could produce statistically satisfying answers in a number of common situations, however so far without ensuring any truth value.
摘要：在没有人类监督的情况下，要最大限度地减少人工智能 (AI) 系统对人类社会的负面影响，就需要它们能够与人类价值观保持一致。然而，目前大多数研究仅从技术角度解决这个问题，例如，改进当前依赖于从人类反馈中进行强化学习的方法，而忽略了它的含义以及实现一致性所需的条件。在这里，我们提出区分强和弱价值观一致性。强一致性需要认知能力（类似于人类或不同于人类），例如理解和推理代理的意图以及它们产生预期效果的能力。我们认为，这对于大型语言模型 (LLM) 等 AI 系统识别存在违反人类价值观风险的情况是必需的。为了说明这种区别，我们提出了一系列提示，展示了 ChatGPT、Gemini 和 Copilot 未能识别其中一些情况。此外，我们还分析了词嵌入，以表明 LLM 中某些人类价值观的最近邻居与人类的语义表示不同。然后，我们提出了一个新的思维实验，我们称之为“带有单词转换词典的中文房间”，这是 John Searle 著名提议的延伸。我们最后提到了目前有希望的研究方向，即弱对齐，它可以在许多常见情况下产生统计上令人满意的答案，但到目前为止还不能确保任何真值。

Title: Winning Amazon KDD Cup'24

Authors: Chris Deotte, Ivan Sorokin, Ahmet Erdem, Benedikt Schifferer, Gilberto Titericz Jr, Simon Jegou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04658
Pdf URL: https://arxiv.org/pdf/2408.04658
Copy Paste: [[2408.04658]] Winning Amazon KDD Cup'24(https://arxiv.org/abs/2408.04658)
Keywords: language model, llm
Abstract: This paper describes the winning solution of all 5 tasks for the Amazon KDD Cup 2024 Multi Task Online Shopping Challenge for LLMs. The challenge was to build a useful assistant, answering questions in the domain of online shopping. The competition contained 57 diverse tasks, covering 5 different task types (e.g. multiple choice) and across 4 different tracks (e.g. multi-lingual). Our solution is a single model per track. We fine-tune Qwen2-72B-Instruct on our own training dataset. As the competition released only 96 example questions, we developed our own training dataset by processing multiple public datasets or using Large Language Models for data augmentation and synthetic data generation. We apply wise-ft to account for distribution shifts and ensemble multiple LoRA adapters in one model. We employed Logits Processors to constrain the model output on relevant tokens for the tasks. AWQ 4-bit Quantization and vLLM are used during inference to predict the test dataset in the time constraints of 20 to 140 minutes depending on the track. Our solution achieved the first place in each individual track and is the first place overall of Amazons KDD Cup 2024.
摘要：本文介绍了亚马逊 KDD Cup 2024 法学硕士多任务在线购物挑战赛所有 5 项任务的获胜解决方案。挑战是构建一个有用的助手，回答在线购物领域的问题。比赛包含 57 项不同的任务，涵盖 5 种不同的任务类型（例如多项选择）和 4 种不同的赛道（例如多语言）。我们的解决方案是每个赛道一个模型。我们在自己的训练数据集上对 Qwen2-72B-Instruct 进行了微调。由于比赛只发布了 96 个示例问题，我们通过处理多个公共数据集或使用大型语言模型进行数据增强和合成数据生成来开发自己的训练数据集。我们应用 wise-ft 来考虑分布偏移，并在一个模型中集成多个 LoRA 适配器。我们使用 Logits 处理器将模型输出限制在与任务相关的标记上。在推理过程中使用 AWQ 4 位量化和 vLLM 来预测测试数据集，时间限制为 20 到 140 分钟，具体取决于赛道。我们的解决方案在每个单项比赛中均夺得第一，并获得了 2024 年亚马逊 KDD 杯的总体第一名。

Title: XMainframe: A Large Language Model for Mainframe Modernization

Authors: Anh T. V. Dau, Hieu Trung Dao, Anh Tuan Nguyen, Hieu Trung Tran, Phong X. Nguyen, Nghi D. Q. Bui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04660
Pdf URL: https://arxiv.org/pdf/2408.04660
Copy Paste: [[2408.04660]] XMainframe: A Large Language Model for Mainframe Modernization(https://arxiv.org/abs/2408.04660)
Keywords: language model, gpt, llm
Abstract: Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance and modernization. Addressing this challenge necessitates innovative tools that can understand and interact with legacy codebases. To this end, we introduce XMainframe, a state-of-the-art large language model (LLM) specifically designed with knowledge of mainframe legacy systems and COBOL codebases. Our solution involves the creation of an extensive data collection pipeline to produce high-quality training datasets, enhancing XMainframe's performance in this specialized domain. Additionally, we present MainframeBench, a comprehensive benchmark for assessing mainframe knowledge, including multiple-choice questions, question answering, and COBOL code summarization. Our empirical evaluations demonstrate that XMainframe consistently outperforms existing state-of-the-art LLMs across these tasks. Specifically, XMainframe achieves 30% higher accuracy than DeepSeek-Coder on multiple-choice questions, doubles the BLEU score of Mixtral-Instruct 8x7B on question answering, and scores six times higher than GPT-3.5 on COBOL summarization. Our work highlights the potential of XMainframe to drive significant advancements in managing and modernizing legacy systems, thereby enhancing productivity and saving time for software developers.
摘要：尽管大型机操作系统诞生于 20 世纪 40 年代，但它们仍在为金融和政府等关键行业提供支持。然而，这些系统通常被视为过时，需要大量维护和现代化改造。应对这一挑战需要能够理解和与遗留代码库交互的创新工具。为此，我们推出了 XMainframe，这是一种最先进的大型语言模型 (LLM)，专门设计用于了解大型机遗留系统和 COBOL 代码库。我们的解决方案涉及创建广泛的数据收集管道以生成高质量的训练数据集，从而增强 XMainframe 在这一专业领域的性能。此外，我们还推出了 MainframeBench，这是一个用于评估大型机知识的综合基准，包括多项选择题、问答和 COBOL 代码摘要。我们的实证评估表明，XMainframe 在这些任务中始终优于现有的最先进的 LLM。具体来说，XMainframe 在多项选择题上的准确率比 DeepSeek-Coder 高 30%，在问答题上的 BLEU 得分是 Mixtral-Instruct 8x7B 的两倍，在 COBOL 摘要上的得分比 GPT-3.5 高六倍。我们的工作凸显了 XMainframe 在管理和现代化遗留系统方面取得重大进步的潜力，从而提高软件开发人员的生产力并节省时间。

Title: MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities

Authors: Ali Riza Durmaz, Akhil Thomas, Lokesh Mishra, Rachana Niranjan Murthy, Thomas Straub
Subjects: cs.CL, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2408.04661
Pdf URL: https://arxiv.org/pdf/2408.04661
Copy Paste: [[2408.04661]] MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities(https://arxiv.org/abs/2408.04661)
Keywords: language model
Abstract: While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-granular annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to a total of 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained models to showcase the feasibility of named-entity recognition model training. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.
摘要：虽然大型语言模型可以学习语言及其信息的可靠统计表示，但本体是符号知识表示，可以理想地补充前者。在这个关键交叉点的研究依赖于将本体和文本语料库交织在一起的数据集，以便对神经符号模型进行训练和全面基准测试。我们介绍了 MaterioMiner 数据集和链接的材料力学本体，其中材料力学领域的本体概念与文献语料库中的文本实体相关联。该数据集的另一个显着特征是其非常细粒度的注释。具体而言，三位评分者在四个出版物中手动注释了 179 个不同的类别，总共注释和整理了 2191 个实体。提出了因果组合-过程-微观结构-属性关系的符号表示的概念工作。我们探索了三位评分者之间的注释一致性，并对预训练模型进行了微调，以展示命名实体识别模型训练的可行性。重复使用数据集可以促进材料语言模型的训练和基准测试、自动本体构建以及从文本数据生成知识图谱。

Title: Citekit: A Modular Toolkit for Large Language Model Citation Generation

Authors: Jiajun Shen, Tong Zhou, Suifeng Zhao, Yubo Chen, Kang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04662
Pdf URL: https://arxiv.org/pdf/2408.04662
Copy Paste: [[2408.04662]] Citekit: A Modular Toolkit for Large Language Model Citation Generation(https://arxiv.org/abs/2408.04662)
Keywords: language model, llm
Abstract: Enabling Large Language Models (LLMs) to generate citations in Question-Answering (QA) tasks is an emerging paradigm aimed at enhancing the verifiability of their responses when LLMs are utilizing external references to generate an answer. However, there is currently no unified framework to standardize and fairly compare different citation generation methods, leading to difficulties in reproducing different methods and a comprehensive assessment. To cope with the problems above, we introduce \name, an open-source and modular toolkit designed to facilitate the implementation and evaluation of existing citation generation methods, while also fostering the development of new approaches to improve citation quality in LLM outputs. This tool is highly extensible, allowing users to utilize 4 main modules and 14 components to construct a pipeline, evaluating an existing method or innovative designs. Our experiments with two state-of-the-art LLMs and 11 citation generation baselines demonstrate varying strengths of different modules in answer accuracy and citation quality improvement, as well as the challenge of enhancing granularity. Based on our analysis of the effectiveness of components, we propose a new method, self-RAG \snippet, obtaining a balanced answer accuracy and citation quality. Citekit is released at this https URL.
摘要：使大型语言模型 (LLM) 能够在问答 (QA) 任务中生成引文是一种新兴范式，旨在提高 LLM 利用外部参考生成答案时其响应的可验证性。然而，目前没有统一的框架来标准化和公平地比较不同的引文生成方法，导致难以重现不同的方法和进行全面的评估。为了解决上述问题，我们引入了 \name，这是一个开源的模块化工具包，旨在促进现有引文生成方法的实施和评估，同时也促进了开发新方法来提高 LLM 输出中的引文质量。该工具具有高度的可扩展性，允许用户利用 4 个主要模块和 14 个组件来构建管道，评估现有方法或创新设计。我们对两个最先进的 LLM 和 11 个引文生成基线进行的实验表明，不同模块在答案准确性和引文质量改进方面具有不同的优势，以及提高粒度的挑战。基于我们对组件有效性的分析，我们提出了一种新方法，即 self-RAG \snippet，以获得平衡的答案准确性和引用质量。Citekit 在此 https URL 上发布。

Title: Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)

Authors: Avshalom Manevich, Reut Tsarfaty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04664
Pdf URL: https://arxiv.org/pdf/2408.04664
Copy Paste: [[2408.04664]] Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)(https://arxiv.org/abs/2408.04664)
Keywords: language model, llm, hallucination
Abstract: Large Vision-Language Models (LVLMs) are an extension of Large Language Models (LLMs) that facilitate processing both image and text inputs, expanding AI capabilities. However, LVLMs struggle with object hallucinations due to their reliance on text cues and learned object co-occurrence biases. While most research quantifies these hallucinations, mitigation strategies are still lacking. Our study introduces a Language Contrastive Decoding (LCD) algorithm that adjusts LVLM outputs based on LLM distribution confidence levels, effectively reducing object hallucinations. We demonstrate the advantages of LCD in leading LVLMs, showing up to %4 improvement in POPE F1 scores and up to %36 reduction in CHAIR scores on the COCO validation set, while also improving captioning quality scores. Our method effectively improves LVLMs without needing complex post-processing or retraining, and is easily applicable to different models. Our findings highlight the potential of further exploration of LVLM-specific decoding algorithms.
摘要：大型视觉语言模型 (LVLM) 是大型语言模型 (LLM) 的扩展，有助于处理图像和文本输入，从而扩展 AI 功能。然而，由于 LVLM 依赖文本提示和学习到的对象共现偏差，因此难以应对对象幻觉。虽然大多数研究都量化了这些幻觉，但缓解策略仍然缺乏。我们的研究引入了一种语言对比解码 (LCD) 算法，该算法根据 LLM 分布置信度调整 LVLM 输出，从而有效减少对象幻觉。我们展示了 LCD 在领先 LVLM 方面的优势，显示 POPE F1 分数提高了 4%，COCO 验证集上的 CHAIR 分数降低了 36%，同时还提高了字幕质量分数。我们的方法有效地改进了 LVLM，而无需复杂的后处理或重新训练，并且很容易应用于不同的模型。我们的研究结果凸显了进一步探索 LVLM 特定解码算法的潜力。

Title: LLM-based MOFs Synthesis Condition Extraction using Few-Shot Demonstrations

Authors: Lei Shi, Zhimeng Liu, Yi Yang, Weize Wu, Yuyang Zhang, Hongbo Zhang, Jing Lin, Siyu Wu, Zihan Chen, Ruiming Li, Nan Wang, Zipeng Liu, Huobin Tan, Hongyi Gao, Yue Zhang, Ge Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04665
Pdf URL: https://arxiv.org/pdf/2408.04665
Copy Paste: [[2408.04665]] LLM-based MOFs Synthesis Condition Extraction using Few-Shot Demonstrations(https://arxiv.org/abs/2408.04665)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: The extraction of Metal-Organic Frameworks (MOFs) synthesis conditions from literature text has been challenging but crucial for the logical design of new MOFs with desirable functionality. The recent advent of large language models (LLMs) provides disruptively new solution to this long-standing problem and latest researches have reported over 90% F1 in extracting correct conditions from MOFs literature. We argue in this paper that most existing synthesis extraction practices with LLMs stay with the primitive zero-shot learning, which could lead to downgraded extraction and application performance due to the lack of specialized knowledge. This work pioneers and optimizes the few-shot in-context learning paradigm for LLM extraction of material synthesis conditions. First, we propose a human-AI joint data curation process to secure high-quality ground-truth demonstrations for few-shot learning. Second, we apply a BM25 algorithm based on the retrieval-augmented generation (RAG) technique to adaptively select few-shot demonstrations for each MOF's extraction. Over a dataset randomly sampled from 84,898 well-defined MOFs, the proposed few-shot method achieves much higher average F1 performance (0.93 vs. 0.81, +14.8%) than the native zero-shot LLM using the same GPT-4 model, under fully automatic evaluation that are more objective than the previous human evaluation. The proposed method is further validated through real-world material experiments: compared with the baseline zero-shot LLM, the proposed few-shot approach increases the MOFs structural inference performance (R^2) by 29.4% in average.
摘要：从文献文本中提取金属有机框架 (MOF) 的合成条件一直是一项挑战，但对于具有理想功能的新 MOF 的逻辑设计至关重要。大型语言模型 (LLM) 的出现为这个长期存在的问题提供了颠覆性的新解决方案，最新研究报告称，从 MOF 文献中提取正确条件的 F1 超过 90%。我们在本文中指出，大多数现有的使用 LLM 的合成提取实践都停留在原始的零样本学习上，这可能会导致由于缺乏专业知识而导致提取和应用性能下降。这项工作开创并优化了用于 LLM 提取材料合成条件的少样本上下文学习范式。首先，我们提出了一种人机联合数据管理流程，以确保少样本学习的高质量真实演示。其次，我们应用基于检索增强生成 (RAG) 技术的 BM25 算法来自适应地选择每个 MOF 提取的少样本演示。在从 84,898 个定义明确的 MOF 中随机抽样的数据集上，所提出的少样本方法在全自动评估下（比之前的人工评估更客观）实现了比使用相同 GPT-4 模型的原生零样本 LLM 高得多的平均 F1 性能（0.93 vs. 0.81，+14.8%）。所提出的方法通过真实材料实验得到进一步验证：与基线零样本 LLM 相比，所提出的少样本方法将 MOF 结构推理性能（R^2）平均提高了 29.4%。

Title: LLMs are Not Just Next Token Predictors

Authors: Stephen M. Downes, Patrick Forber, Alex Grzankowski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04666
Pdf URL: https://arxiv.org/pdf/2408.04666
Copy Paste: [[2408.04666]] LLMs are Not Just Next Token Predictors(https://arxiv.org/abs/2408.04666)
Keywords: llm, prompt
Abstract: LLMs are statistical models of language learning through stochastic gradient descent with a next token prediction objective. Prompting a popular view among AI modelers: LLMs are just next token predictors. While LLMs are engineered using next token prediction, and trained based on their success at this task, our view is that a reduction to just next token predictor sells LLMs short. Moreover, there are important explanations of LLM behavior and capabilities that are lost when we engage in this kind of reduction. In order to draw this out, we will make an analogy with a once prominent research program in biology explaining evolution and development from the gene's eye view.
摘要：LLM 是通过随机梯度下降进行语言学习的统计模型，具有下一个标记预测目标。这在 AI 建模者中引发了一种流行的观点：LLM 只是下一个标记预测器。虽然 LLM 是使用下一个标记预测设计的，并根据其在此任务中的成功进行训练，但我们认为，将 LLM 简化为下一个标记预测器会低估其价值。此外，当我们进行这种简化时，会失去对 LLM 行为和能力的重要解释。为了阐明这一点，我们将与曾经著名的生物学研究项目进行类比，该项目从基因的角度解释进化和发展。

Title: LLM Stability: A detailed analysis with some surprises

Authors: Berk Atil, Alexa Chittams, Liseng Fu, Ferhan Ture, Lixinyu Xu, Breck Baldwin
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2408.04667
Pdf URL: https://arxiv.org/pdf/2408.04667
Copy Paste: [[2408.04667]] LLM Stability: A detailed analysis with some surprises(https://arxiv.org/abs/2408.04667)
Keywords: llm
Abstract: A concerning property of our nearly magical LLMs involves the variation of results given the exact same input and deterministic hyper-parameters. While AI has always had a certain level of noisiness from inputs outside of training data, we have generally had deterministic results for any particular input; that is no longer true. While most LLM practitioners are "in the know", we are unaware of any work that attempts to quantify current LLM stability. We suspect no one has taken the trouble because it is just too boring a paper to execute and write. But we have done it and there are some surprises. What kinds of surprises? The evaluated LLMs are rarely deterministic at the raw output level; they are much more deterministic at the parsed output/answer level but still rarely 100% stable across 5 re-runs with same data input. LLM accuracy variation is not normally distributed. Stability varies based on task.
摘要：我们近乎神奇的 LLM 的一个令人担忧的特性涉及在给定完全相同的输入和确定性超参数的情况下结果的变化。虽然人工智能总是会受到训练数据之外的输入的一定程度的干扰，但我们通常对任何特定输入都有确定性的结果；这不再是事实。虽然大多数 LLM 从业者都“知情”，但我们不知道有任何工作试图量化当前的 LLM 稳定性。我们怀疑没有人愿意付出努力，因为这篇论文太无聊了，无法执行和编写。但我们已经做到了，并且有一些惊喜。什么样的惊喜？评估的 LLM 在原始输出级别很少是确定性的；它们在解析的输出/答案级别更具确定性，但在使用相同数据输入进行 5 次重新运行后仍然很少 100% 稳定。LLM 准确度变化不是正态分布的。稳定性因任务而异。

Title: Forecasting Live Chat Intent from Browsing History

Authors: Se-eun Yoon, Ahmad Bin Rabiah, Zaid Alibadi, Surya Kallumadi, Julian McAuley
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04668
Pdf URL: https://arxiv.org/pdf/2408.04668
Copy Paste: [[2408.04668]] Forecasting Live Chat Intent from Browsing History(https://arxiv.org/abs/2408.04668)
Keywords: language model, llm, chat, agent
Abstract: Customers reach out to online live chat agents with various intents, such as asking about product details or requesting a return. In this paper, we propose the problem of predicting user intent from browsing history and address it through a two-stage approach. The first stage classifies a user's browsing history into high-level intent categories. Here, we represent each browsing history as a text sequence of page attributes and use the ground-truth class labels to fine-tune pretrained Transformers. The second stage provides a large language model (LLM) with the browsing history and predicted intent class to generate fine-grained intents. For automatic evaluation, we use a separate LLM to judge the similarity between generated and ground-truth intents, which closely aligns with human judgments. Our two-stage approach yields significant performance gains compared to generating intents without the classification stage.
摘要：客户联系在线实时聊天代理时会遇到各种意图，例如询问产品详细信息或要求退货。在本文中，我们提出了从浏览历史预测用户意图的问题，并通过两阶段方法解决该问题。第一阶段将用户的浏览历史分为高级意图类别。在这里，我们将每个浏览历史表示为页面属性的文本序列，并使用真实类别标签来微调预训练的 Transformers。第二阶段为大型语言模型 (LLM) 提供浏览历史和预测的意图类别，以生成细粒度的意图。对于自动评估，我们使用单独的 LLM 来判断生成的意图和真实意图之间的相似性，这与人类判断非常接近。与没有分类阶段的生成意图相比，我们的两阶段方法可显着提高性能。

Title: Prompt and Prejudice

Authors: Lorenzo Berlincioni, Luca Cultrera, Federico Becattini, Marco Bertini, Alberto Del Bimbo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04671
Pdf URL: https://arxiv.org/pdf/2408.04671
Copy Paste: [[2408.04671]] Prompt and Prejudice(https://arxiv.org/abs/2408.04671)
Keywords: language model, llm, prompt
Abstract: This paper investigates the impact of using first names in Large Language Models (LLMs) and Vision Language Models (VLMs), particularly when prompted with ethical decision-making tasks. We propose an approach that appends first names to ethically annotated text scenarios to reveal demographic biases in model outputs. Our study involves a curated list of more than 300 names representing diverse genders and ethnic backgrounds, tested across thousands of moral scenarios. Following the auditing methodologies from social sciences we propose a detailed analysis involving popular LLMs/VLMs to contribute to the field of responsible AI by emphasizing the importance of recognizing and mitigating biases in these systems. Furthermore, we introduce a novel benchmark, the Pratical Scenarios Benchmark (PSB), designed to assess the presence of biases involving gender or demographic prejudices in everyday decision-making scenarios as well as practical scenarios where an LLM might be used to make sensible decisions (e.g., granting mortgages or insurances). This benchmark allows for a comprehensive comparison of model behaviors across different demographic categories, highlighting the risks and biases that may arise in practical applications of LLMs and VLMs.
摘要：本文研究了在大型语言模型 (LLM) 和视觉语言模型 (VLM) 中使用名字的影响，特别是在执行道德决策任务时。我们提出了一种方法，将名字附加到带有道德注释的文本场景中，以揭示模型输出中的人口统计学偏见。我们的研究涉及一份精选的 300 多个名字列表，这些名字代表了不同的性别和种族背景，并在数千种道德场景中进行了测试。遵循社会科学的审计方法，我们提出了一项涉及流行 LLM/VLM 的详细分析，通过强调识别和减轻这些系统中的偏见的重要性，为负责任的人工智能领域做出贡献。此外，我们引入了一个新颖的基准，即实践场景基准 (PSB)，旨在评估日常决策场景中是否存在涉及性别或人口统计学偏见的偏见，以及 LLM 可能用于做出明智决策的实际场景（例如，发放抵押贷款或保险）。该基准可以对不同人口类别的模型行为进行全面比较，突出 LLM 和 VLM 实际应用中可能出现的风险和偏见。

Title: AutoFAIR : Automatic Data FAIRification via Machine Reading

Authors: Tingyan Ma, Wei Liu, Bin Lu, Xiaoying Gan, Yunqiang Zhu, Luoyi Fu, Chenghu Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04673
Pdf URL: https://arxiv.org/pdf/2408.04673
Copy Paste: [[2408.04673]] AutoFAIR : Automatic Data FAIRification via Machine Reading(https://arxiv.org/abs/2408.04673)
Keywords: language model
Abstract: The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we propose AutoFAIR, an architecture designed to enhance data FAIRness automately. Firstly, We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions. Then, We utilize Web Reader to automatically extract metadata based on language models, even in the absence of structured data webpage schemas. Subsequently, FAIR Alignment is employed to make metadata comply with FAIR principles by ontology guidance and semantic matching. Finally, by applying AutoFAIR to various data, especially in the field of mountain hazards, we observe significant improvements in findability, accessibility, interoperability, and reusability of data. The FAIRness scores before and after applying AutoFAIR indicate enhanced data value.
摘要：数据的爆炸式增长推动了数据驱动的研究，促进了各个领域的进步。FAIR 原则成为指导标准，旨在增强数据的可查找性、可访问性、互操作性和可重用性。然而，当前的努力主要集中在手动数据 FAIRification 上，这只能处理目标数据并且效率低下。为了解决这个问题，我们提出了 AutoFAIR，这是一种旨在自动增强数据 FAIRness 的架构。首先，我们将每个数据和元数据操作与特定的 FAIR 指标对齐，以指导机器可执行的操作。然后，我们利用 Web Reader 基于语言模型自动提取元数据，即使在没有结构化数据网页模式的情况下也是如此。随后，通过本体指导和语义匹配，采用 FAIR 对齐使元数据符合 FAIR 原则。最后，通过将 AutoFAIR 应用于各种数据，尤其是在山地灾害领域，我们观察到数据的可查找性、可访问性、互操作性和可重用性得到了显着改善。应用 AutoFAIR 之前和之后的 FAIRness 分数表明数据价值增强。

Title: ACL Ready: RAG Based Assistant for the ACL Checklist

Authors: Michael Galarnyk, Rutwik Routu, Kosha Bheda, Priyanshu Mehta, Agam Shah, Sudheer Chava
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2408.04675
Pdf URL: https://arxiv.org/pdf/2408.04675
Copy Paste: [[2408.04675]] ACL Ready: RAG Based Assistant for the ACL Checklist(https://arxiv.org/abs/2408.04675)
Keywords: language model
Abstract: The ARR Responsible NLP Research checklist website states that the "checklist is designed to encourage best practices for responsible research, addressing issues of research ethics, societal impact and reproducibility." Answering the questions is an opportunity for authors to reflect on their work and make sure any shared scientific assets follow best practices. Ideally, considering the checklist before submission can favorably impact the writing of a research paper. However, the checklist is often filled out at the last moment. In this work, we introduce ACLReady, a retrieval-augmented language model application that can be used to empower authors to reflect on their work and assist authors with the ACL checklist. To test the effectiveness of the system, we conducted a qualitative study with 13 users which shows that 92% of users found the application useful and easy to use as well as 77% of the users found that the application provided the information they expected. Our code is publicly available under the CC BY-NC 4.0 license on GitHub.
摘要：ARR 负责任的 NLP 研究清单网站指出，“清单旨在鼓励负责任的研究最佳实践，解决研究伦理、社会影响和可重复性问题。”回答这些问题是作者反思其工作并确保任何共享的科学资产遵循最佳实践的机会。理想情况下，在提交之前考虑清单可以对研究论文的写作产生积极影响。然而，清单通常是在最后一刻填写的。在这项工作中，我们引入了 ACLReady，这是一款检索增强语言模型应用程序，可用于帮助作者反思其工作并协助作者完成 ACL 清单。为了测试系统的有效性，我们对 13 位用户进行了一项定性研究，结果表明 92% 的用户认为该应用程序有用且易于使用，77% 的用户认为该应用程序提供了他们期望的信息。我们的代码在 GitHub 上根据 CC BY-NC 4.0 许可公开提供。

Title: CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding

Authors: Sophia Ho, Jinsol Park, Patrick Wang
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2408.04678
Pdf URL: https://arxiv.org/pdf/2408.04678
Copy Paste: [[2408.04678]] CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding(https://arxiv.org/abs/2408.04678)
Keywords: llm
Abstract: We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively "compacted". REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most recent n tokens generated by the target LLM from a datastore. The key idea of CREST is to only store a subset of the smallest and most common n-grams in the datastore with the hope of achieving comparable performance with less storage space. We found that storing a subset of n-grams both reduces storage space and improves performance. CREST matches REST's accepted token length with 10.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance length than REST using the same storage space on the HumanEval and MT Bench benchmarks.
摘要：我们提出了 CREST（基于紧凑检索的推测解码），这是 REST 的重新设计，使其能够有效地“压缩”。REST 是一种推测解码的起草技术，基于从数据存储中检索目标 LLM 生成的最近 n 个标记的精确 n-gram 匹配。CREST 的关键思想是只在数据存储中存储最小和最常见的 n-gram 的子集，希望以更少的存储空间实现可比的性能。我们发现存储 n-gram 的子集既可以减少存储空间，又可以提高性能。CREST 的存储空间比 REST 少 10.6-13.5 倍，与 REST 的接受标记长度相当，并且在 HumanEval 和 MT Bench 基准测试中使用相同的存储空间，接受长度比 REST 高 16.5-17.1%。

Title: Towards Linguistic Neural Representation Learning and Sentence Retrieval from Electroencephalogram Recordings

Authors: Jinzhao Zhou, Yiqun Duan, Ziyi Zhao, Yu-Cheng Chang, Yu-Kai Wang, Thomas Do, Chin-Teng Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04679
Pdf URL: https://arxiv.org/pdf/2408.04679
Copy Paste: [[2408.04679]] Towards Linguistic Neural Representation Learning and Sentence Retrieval from Electroencephalogram Recordings(https://arxiv.org/abs/2408.04679)
Keywords: language model, llm
Abstract: Decoding linguistic information from non-invasive brain signals using EEG has gained increasing research attention due to its vast applicational potential. Recently, a number of works have adopted a generative-based framework to decode electroencephalogram (EEG) signals into sentences by utilizing the power generative capacity of pretrained large language models (LLMs). However, this approach has several drawbacks that hinder the further development of linguistic applications for brain-computer interfaces (BCIs). Specifically, the ability of the EEG encoder to learn semantic information from EEG data remains questionable, and the LLM decoder's tendency to generate sentences based on its training memory can be hard to avoid. These issues necessitate a novel approach for converting EEG signals into sentences. In this paper, we propose a novel two-step pipeline that addresses these limitations and enhances the validity of linguistic EEG decoding research. We first confirm that word-level semantic information can be learned from EEG data recorded during natural reading by training a Conformer encoder via a masked contrastive objective for word-level classification. To achieve sentence decoding results, we employ a training-free retrieval method to retrieve sentences based on the predictions from the EEG encoder. Extensive experiments and ablation studies were conducted in this paper for a comprehensive evaluation of the proposed approach. Visualization of the top prediction candidates reveals that our model effectively groups EEG segments into semantic categories with similar meanings, thereby validating its ability to learn patterns from unspoken EEG recordings. Despite the exploratory nature of this work, these results suggest that our method holds promise for providing more reliable solutions for converting EEG signals into text.
摘要：由于具有巨大的应用潜力，利用脑电图从非侵入性脑信号中解码语言信息已引起越来越多的研究关注。最近，许多研究采用了基于生成的框架，利用预训练大型语言模型 (LLM) 的强大生成能力将脑电图 (EEG) 信号解码为句子。然而，这种方法有几个缺点，阻碍了脑机接口 (BCI) 语言应用的进一步发展。具体来说，EEG 编码器从 EEG 数据中学习语义信息的能力仍然存在疑问，而且 LLM 解码器根据其训练记忆生成句子的倾向很难避免。这些问题需要一种将 EEG 信号转换成句子的新方法。在本文中，我们提出了一种新颖的两步流程，以解决这些限制并提高语言 EEG 解码研究的有效性。我们首先确认，可以通过对 Conformer 编码器进行掩蔽对比目标的词级分类来训练，从而从自然阅读期间记录的 EEG 数据中学习词级语义信息。为了实现句子解码结果，我们采用无需训练的检索方法，根据 EEG 编码器的预测检索句子。本文进行了广泛的实验和消融研究，以全面评估所提出的方法。对顶级预测候选者的可视化表明，我们的模型有效地将 EEG 片段分组为具有相似含义的语义类别，从而验证了其从未说出口的 EEG 记录中学习模式的能力。尽管这项工作具有探索性，但这些结果表明，我们的方法有望提供更可靠的解决方案，将 EEG 信号转换为文本。

Title: Dynamic Fog Computing for Enhanced LLM Execution in Medical Applications

Authors: Philipp Zagar, Vishnu Ravi, Lauren Aalami, Stephan Krusche, Oliver Aalami, Paul Schmiedmayer
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2408.04680
Pdf URL: https://arxiv.org/pdf/2408.04680
Copy Paste: [[2408.04680]] Dynamic Fog Computing for Enhanced LLM Execution in Medical Applications(https://arxiv.org/abs/2408.04680)
Keywords: language model, llm
Abstract: The ability of large language models (LLMs) to transform, interpret, and comprehend vast quantities of heterogeneous data presents a significant opportunity to enhance data-driven care delivery. However, the sensitive nature of protected health information (PHI) raises valid concerns about data privacy and trust in remote LLM platforms. In addition, the cost associated with cloud-based artificial intelligence (AI) services continues to impede widespread adoption. To address these challenges, we propose a shift in the LLM execution environment from opaque, centralized cloud providers to a decentralized and dynamic fog computing architecture. By executing open-weight LLMs in more trusted environments, such as the user's edge device or a fog layer within a local network, we aim to mitigate the privacy, trust, and financial challenges associated with cloud-based LLMs. We further present SpeziLLM, an open-source framework designed to facilitate rapid and seamless leveraging of different LLM execution layers and lowering barriers to LLM integration in digital health applications. We demonstrate SpeziLLM's broad applicability across six digital health applications, showcasing its versatility in various healthcare settings.
摘要：大型语言模型 (LLM) 能够转换、解释和理解大量异构数据，这为增强数据驱动的医疗服务提供了重要机会。然而，受保护的健康信息 (PHI) 的敏感性引发了人们对远程 LLM 平台的数据隐私和信任的合理担忧。此外，基于云的人工智能 (AI) 服务的成本继续阻碍其广泛采用。为了应对这些挑战，我们建议将 LLM 执行环境从不透明的集中式云提供商转变为分散式动态雾计算架构。通过在更受信任的环境（例如用户的边缘设备或本地网络中的雾层）中执行开放式 LLM，我们旨在减轻与基于云的 LLM 相关的隐私、信任和财务挑战。我们进一步介绍了 SpeziLLM，这是一个开源框架，旨在促进快速无缝地利用不同的 LLM 执行层并降低数字健康应用程序中 LLM 集成的障碍。我们展示了 SpeziLLM 在六种数字健康应用程序中的广泛适用性，展示了其在各种医疗保健环境中的多功能性。

Title: Conversational AI Powered by Large Language Models Amplifies False Memories in Witness Interviews

Authors: Samantha Chan, Pat Pataranutaporn, Aditya Suri, Wazeer Zulfikar, Pattie Maes, Elizabeth F. Loftus
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2408.04681
Pdf URL: https://arxiv.org/pdf/2408.04681
Copy Paste: [[2408.04681]] Conversational AI Powered by Large Language Models Amplifies False Memories in Witness Interviews(https://arxiv.org/abs/2408.04681)
Keywords: language model, llm, chat
Abstract: This study examines the impact of AI on human false memories -- recollections of events that did not occur or deviate from actual occurrences. It explores false memory induction through suggestive questioning in Human-AI interactions, simulating crime witness interviews. Four conditions were tested: control, survey-based, pre-scripted chatbot, and generative chatbot using a large language model (LLM). Participants (N=200) watched a crime video, then interacted with their assigned AI interviewer or survey, answering questions including five misleading ones. False memories were assessed immediately and after one week. Results show the generative chatbot condition significantly increased false memory formation, inducing over 3 times more immediate false memories than the control and 1.7 times more than the survey method. 36.4% of users' responses to the generative chatbot were misled through the interaction. After one week, the number of false memories induced by generative chatbots remained constant. However, confidence in these false memories remained higher than the control after one week. Moderating factors were explored: users who were less familiar with chatbots but more familiar with AI technology, and more interested in crime investigations, were more susceptible to false memories. These findings highlight the potential risks of using advanced AI in sensitive contexts, like police interviews, emphasizing the need for ethical considerations.
摘要：本研究考察了人工智能对人类错误记忆的影响——对未发生或偏离实际发生的事件的回忆。它探索了在人机交互中通过暗示性提问来诱导错误记忆，模拟犯罪目击者访谈。测试了四种条件：对照组、基于调查的组、预先编写脚本的聊天机器人和使用大型语言模型 (LLM) 的生成聊天机器人。参与者 (N=200) 观看了一段犯罪视频，然后与指定的人工智能采访者或调查员互动，回答包括五个误导性问题在内的问题。立即和一周后对错误记忆进行了评估。结果显示，生成聊天机器人条件显著增加了错误记忆的形成，诱导的即时错误记忆比对照组多 3 倍以上，比调查方法多 1.7 倍。36.4% 的用户对生成聊天机器人的回应被互动误导了。一周后，生成聊天机器人诱导的错误记忆数量保持不变。然而，一周后，对这些错误记忆的信心仍然高于对照组。研究还探讨了调节因素：对聊天机器人不太熟悉但对人工智能技术更熟悉、对犯罪调查更感兴趣的用户更容易产生虚假记忆。这些发现凸显了在警方采访等敏感情况下使用先进人工智能的潜在风险，强调了道德考量的重要性。

Title: ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Authors: Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04682
Pdf URL: https://arxiv.org/pdf/2408.04682
Copy Paste: [[2408.04682]] ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities(https://arxiv.org/abs/2408.04682)
Keywords: language model, llm, prompt
Abstract: Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at this https URL
摘要：大型语言模型 (LLM) 的最新进展引发了人们对工具辅助 LLM 解决现实世界挑战的研究兴趣，这需要对工具使用能力进行全面评估。虽然以前的研究主要集中在基于单轮用户提示或非策略对话轨迹对无状态 Web 服务 (RESTful API) 进行评估，但 ToolSandbox 包括有状态的工具执行、工具之间的隐式状态依赖关系、支持策略对话评估的内置用户模拟器以及针对任意轨迹的中间和最终里程碑的动态评估策略。我们表明开源和专有模型具有显著的性能差距，ToolSandbox 中定义的复杂任务（如状态依赖、规范化和信息不足）甚至对最强大的 SOTA LLM 也构成了挑战，为工具使用 LLM 功能提供了全新的见解。ToolSandbox 评估框架发布于此 https URL

Title: Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles

Authors: Xiongtao Sun, Deyue Zhang, Dongdong Yang, Quanchen Zou, Hui Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04686
Pdf URL: https://arxiv.org/pdf/2408.04686
Copy Paste: [[2408.04686]] Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles(https://arxiv.org/abs/2408.04686)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have significantly enhanced the performance of numerous applications, from intelligent conversations to text generation. However, their inherent security vulnerabilities have become an increasingly significant challenge, especially with respect to jailbreak attacks. Attackers can circumvent the security mechanisms of these LLMs, breaching security constraints and causing harmful outputs. Focusing on multi-turn semantic jailbreak attacks, we observe that existing methods lack specific considerations for the role of multiturn dialogues in attack strategies, leading to semantic deviations during continuous interactions. Therefore, in this paper, we establish a theoretical foundation for multi-turn attacks by considering their support in jailbreak attacks, and based on this, propose a context-based contextual fusion black-box jailbreak attack method, named Context Fusion Attack (CFA). This method approach involves filtering and extracting key terms from the target, constructing contextual scenarios around these terms, dynamically integrating the target into the scenarios, replacing malicious key terms within the target, and thereby concealing the direct malicious intent. Through comparisons on various mainstream LLMs and red team datasets, we have demonstrated CFA's superior success rate, divergence, and harmfulness compared to other multi-turn attack strategies, particularly showcasing significant advantages on Llama3 and GPT-4.
摘要：大型语言模型 (LLM) 显著提升了从智能对话到文本生成等众多应用的性能。然而，它们固有的安全漏洞已成为日益严重的挑战，尤其是在越狱攻击方面。攻击者可以绕过这些 LLM 的安全机制，突破安全约束并造成有害输出。针对多轮语义越狱攻击，我们观察到现有方法缺乏对多轮对话在攻击策略中的作用的具体考虑，导致持续交互过程中出现语义偏差。因此，本文通过考虑多轮攻击在越狱攻击中的支持，为多轮攻击建立了理论基础，并在此基础上提出了一种基于上下文的上下文融合黑盒越狱攻击方法，即上下文融合攻击 (CFA)。该方法包括从目标中过滤和提取关键词，围绕这些关键词构建上下文场景，将目标动态集成到场景中，替换目标中的恶意关键词，从而隐藏直接恶意意图。通过在各种主流 LLM 和红队数据集上的比较，我们证明了 CFA 相较于其他多回合攻击策略具有更优越的成功率、发散度和危害性，尤其在 Llama3 和 GPT-4 上展现出明显的优势。

Title: Improving Relational Database Interactions with Large Language Models: Column Descriptions and Their Impact on Text-to-SQL Performance

Authors: Niklas Wretblad, Oskar Holmström, Erik Larsson, Axel Wiksäter, Oscar Söderlund, Hjalmar Öhman, Ture Pontén, Martin Forsberg, Martin Sörme, Fredrik Heintz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04691
Pdf URL: https://arxiv.org/pdf/2408.04691
Copy Paste: [[2408.04691]] Improving Relational Database Interactions with Large Language Models: Column Descriptions and Their Impact on Text-to-SQL Performance(https://arxiv.org/abs/2408.04691)
Keywords: language model, gpt, llm
Abstract: Relational databases often suffer from uninformative descriptors of table contents, such as ambiguous columns and hard-to-interpret values, impacting both human users and Text-to-SQL models. This paper explores the use of large language models (LLMs) to generate informative column descriptions as a semantic layer for relational databases. Using the BIRD-Bench development set, we created \textsc{ColSQL}, a dataset with gold-standard column descriptions generated and refined by LLMs and human annotators. We evaluated several instruction-tuned models, finding that GPT-4o and Command R+ excelled in generating high-quality descriptions. Additionally, we applied an LLM-as-a-judge to evaluate model performance. Although this method does not align well with human evaluations, we included it to explore its potential and to identify areas for improvement. More work is needed to improve the reliability of automatic evaluations for this task. We also find that detailed column descriptions significantly improve Text-to-SQL execution accuracy, especially when columns are uninformative. This study establishes LLMs as effective tools for generating detailed metadata, enhancing the usability of relational databases.
摘要：关系数据库经常会受到表内容描述不具信息量的困扰，例如模糊的列和难以解释的值，这会影响人类用户和文本到 SQL 模型。本文探讨了使用大型语言模型 (LLM) 生成信息丰富的列描述作为关系数据库的语义层。使用 BIRD-Bench 开发集，我们创建了 \textsc{ColSQL}，这是一个由 LLM 和人工注释者生成和细化黄金标准列描述的数据集。我们评估了几个指令调整模型，发现 GPT-4o 和 Command R+ 在生成高质量描述方面表现出色。此外，我们应用了 LLM-as-a-judge 来评估模型性能。虽然这种方法与人工评估不太一致，但我们将其纳入其中以探索其潜力并确定需要改进的领域。需要做更多的工作来提高此任务自动评估的可靠性。我们还发现详细的列描述可以显着提高文本到 SQL 的执行准确性，尤其是在列不具信息量的情况下。这项研究确立了 LLM 作为生成详细元数据的有效工具的地位，增强了关系数据库的可用性。

Title: Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

Authors: Yuchen Xia, Jiho Kim, Yuhan Chen, Haojie Ye, Souvik Kundu, Cong (Callie)Hao, Nishil Talati
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04693
Pdf URL: https://arxiv.org/pdf/2408.04693
Copy Paste: [[2408.04693]] Understanding the Performance and Estimating the Cost of LLM Fine-Tuning(https://arxiv.org/abs/2408.04693)
Keywords: language model, llm
Abstract: Due to the cost-prohibitive nature of training Large Language Models (LLMs), fine-tuning has emerged as an attractive alternative for specializing LLMs for specific tasks using limited compute resources in a cost-effective manner. In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance on a single GPU. Our evaluation provides unique insights into the training efficacy of sparse and dense versions of MoE models, as well as their runtime characteristics, including maximum batch size, execution time breakdown, end-to-end throughput, GPU hardware utilization, and load distribution. Our study identifies the optimization of the MoE layer as crucial for further improving the performance of LLM fine-tuning. Using our profiling results, we also develop and validate an analytical model to estimate the cost of LLM fine-tuning on the cloud. This model, based on parameters of the model and GPU architecture, estimates LLM throughput and the cost of training, aiding practitioners in industry and academia to budget the cost of fine-tuning a specific model.
摘要：由于训练大型语言模型 (LLM) 的成本过高，微调已成为一种有吸引力的替代方案，它使用有限的计算资源以经济高效的方式专门针对特定任务对 LLM 进行专门处理。在本文中，我们描述了基于稀疏混合专家 (MoE) 的 LLM 微调，以了解它们在单个 GPU 上的准确性和运行时性能。我们的评估提供了对稀疏和密集版本的 MoE 模型的训练效果以及它们的运行时特性的独特见解，包括最大批量大小、执行时间细分、端到端吞吐量、GPU 硬件利用率和负载分配。我们的研究表明，优化 MoE 层对于进一步提高 LLM 微调的性能至关重要。使用我们的分析结果，我们还开发并验证了一个分析模型来估算云端 LLM 微调的成本。该模型基于模型和 GPU 架构的参数，估算 LLM 吞吐量和训练成本，帮助行业和学术界的从业者预算微调特定模型的成本。

Title: Hybrid Student-Teacher Large Language Model Refinement for Cancer Toxicity Symptom Extraction

Authors: Reza Khanmohammadi, Ahmed I. Ghanem, Kyle Verdecchia, Ryan Hall, Mohamed Elshaikh, Benjamin Movsas, Hassan Bagher-Ebadian, Bing Luo, Indrin J. Chetty, Tuka Alhanai, Kundan Thind, Mohammad M. Ghassemi
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2408.04775
Pdf URL: https://arxiv.org/pdf/2408.04775
Copy Paste: [[2408.04775]] Hybrid Student-Teacher Large Language Model Refinement for Cancer Toxicity Symptom Extraction(https://arxiv.org/abs/2408.04775)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation
Abstract: Large Language Models (LLMs) offer significant potential for clinical symptom extraction, but their deployment in healthcare settings is constrained by privacy concerns, computational limitations, and operational costs. This study investigates the optimization of compact LLMs for cancer toxicity symptom extraction using a novel iterative refinement approach. We employ a student-teacher architecture, utilizing Zephyr-7b-beta and Phi3-mini-128 as student models and GPT-4o as the teacher, to dynamically select between prompt refinement, Retrieval-Augmented Generation (RAG), and fine-tuning strategies. Our experiments on 294 clinical notes covering 12 post-radiotherapy toxicity symptoms demonstrate the effectiveness of this approach. The RAG method proved most efficient, improving average accuracy scores from 0.32 to 0.73 for Zephyr-7b-beta and from 0.40 to 0.87 for Phi3-mini-128 during refinement. In the test set, both models showed an approximate 0.20 increase in accuracy across symptoms. Notably, this improvement was achieved at a cost 45 times lower than GPT-4o for Zephyr and 79 times lower for Phi-3. These results highlight the potential of iterative refinement techniques in enhancing the capabilities of compact LLMs for clinical applications, offering a balance between performance, cost-effectiveness, and privacy preservation in healthcare settings.
摘要：大型语言模型 (LLM) 为临床症状提取提供了巨大潜力，但它们在医疗保健环境中的部署受到隐私问题、计算限制和运营成本的限制。本研究使用一种新颖的迭代细化方法研究了紧凑型 LLM 的优化，以提取癌症毒性症状。我们采用学生-老师架构，利用 Zephyr-7b-beta 和 Phi3-mini-128 作为学生模型，使用 GPT-4o 作为老师，在即时细化、检索增强生成 (RAG) 和微调策略之间动态选择。我们对涵盖 12 种放疗后毒性症状的 294 份临床记录进行的实验证明了这种方法的有效性。RAG 方法被证明是最有效的，在细化过程中，Zephyr-7b-beta 的平均准确度得分从 0.32 提高到 0.73，Phi3-mini-128 的平均准确度得分从 0.40 提高到 0.87。在测试集中，两种模型在各种症状上的准确率都提高了约 0.20。值得注意的是，实现这一改进的成本是 Zephyr 比 GPT-4o 低 45 倍，Phi-3 比 GPT-4o 低 79 倍。这些结果凸显了迭代细化技术在增强紧凑型 LLM 在临床应用方面的潜力，在医疗环境中实现了性能、成本效益和隐私保护之间的平衡。

Title: FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers

Authors: Joshua Nathaniel Williams, J. Zico Kolter
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04816
Pdf URL: https://arxiv.org/pdf/2408.04816
Copy Paste: [[2408.04816]] FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers(https://arxiv.org/abs/2408.04816)
Keywords: language model, prompt
Abstract: The widespread use of large language models has resulted in a multitude of tokenizers and embedding spaces, making knowledge transfer in prompt discovery tasks difficult. In this work, we propose FUSE (Flexible Unification of Semantic Embeddings), an inexpensive approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers. We introduce a third-order tensor-based representation of a model's embedding space that aligns semantic embeddings that have been split apart by different tokenizers, and use this representation to derive an approximation of the gradient of one model's outputs with respect to another model's embedding space. We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
摘要：大型语言模型的广泛使用导致了大量的标记器和嵌入空间，使得快速发现任务中的知识转移变得困难。在这项工作中，我们提出了 FUSE（语义嵌入的灵活统一），这是一种廉价的方法，用于近似一个适配器层，该适配器层从一个模型的文本嵌入空间映射到另一个模型的文本嵌入空间，甚至跨不同的标记器。我们引入了模型嵌入空间的基于三阶张量的表示，该表示对齐了被不同标记器分开的语义嵌入，并使用此表示来推导一个模型的输出相对于另一个模型的嵌入空间的梯度的近似值。我们通过对视觉语言和因果语言模型进行多目标优化来展示我们的方法的有效性，用于图像字幕和基于情感的图像字幕。

Title: SCOI: Syntax-augmented Coverage-based In-context Example Selection for Machine Translation

Authors: Chenming Tang, Zhixiang Wang, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04872
Pdf URL: https://arxiv.org/pdf/2408.04872
Copy Paste: [[2408.04872]] SCOI: Syntax-augmented Coverage-based In-context Example Selection for Machine Translation(https://arxiv.org/abs/2408.04872)
Keywords: language model, llm
Abstract: In-context learning (ICL) greatly improves the performance of large language models (LLMs) on various down-stream tasks, where the improvement highly depends on the quality of demonstrations. In this work, we introduce syntactic knowledge to select better in-context examples for machine translation (MT). We propose a new strategy, namely Syntax-augmented COverage-based In-context example selection (SCOI), leveraging the deep syntactic structure beyond conventional word matching. Specifically, we measure the set-level syntactic coverage by computing the coverage of polynomial terms with the help of a simplified tree-to-polynomial algorithm, and lexical coverage using word overlap. Furthermore, we devise an alternate selection approach to combine both coverage measures, taking advantage of syntactic and lexical information. We conduct experiments with two multi-lingual LLMs on six translation directions. Empirical results show that our proposed SCOI obtains the highest average COMET score among all learning-free methods, indicating that combining syntactic and lexical coverage successfully helps to select better in-context examples for MT.
摘要：上下文学习 (ICL) 极大地提高了大型语言模型 (LLM) 在各种下游任务上的性能，而这种改进在很大程度上取决于演示的质量。在这项工作中，我们引入了句法知识来为机器翻译 (MT) 选择更好的上下文示例。我们提出了一种新策略，即基于语法增强 COverage 的上下文示例选择 (SCOI)，它利用超越传统单词匹配的深层句法结构。具体而言，我们通过借助简化的树到多项式算法计算多项式项的覆盖率，并使用单词重叠计算词汇覆盖率，来测量集合级句法覆盖率。此外，我们设计了一种替代选择方法来结合两种覆盖率测量，利用句法和词汇信息。我们对两个多语言 LLM 在六个翻译方向上进行了实验。实证结果表明，我们提出的 SCOI 在所有免学习方法中获得了最高的平均 COMET 分数，这表明成功结合句法和词汇覆盖有助于为 MT 选择更好的上下文示例。

Title: Unsupervised Episode Detection for Large-Scale News Events

Authors: Priyanka Kargupta, Yunyi Zhang, Yizhu Jiao, Siru Ouyang, Jiawei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04873
Pdf URL: https://arxiv.org/pdf/2408.04873
Copy Paste: [[2408.04873]] Unsupervised Episode Detection for Large-Scale News Events(https://arxiv.org/abs/2408.04873)
Keywords: language model
Abstract: Episodic structures are inherently interpretable and adaptable to evolving large-scale key events. However, state-of-the-art automatic event detection methods overlook event episodes and, therefore, struggle with these crucial characteristics. This paper introduces a novel task, episode detection, aimed at identifying episodes from a news corpus containing key event articles. An episode describes a cohesive cluster of core entities (e.g., "protesters", "police") performing actions at a specific time and location. Furthermore, an episode is a significant part of a larger group of episodes under a particular key event. Automatically detecting episodes is challenging because, unlike key events and atomic actions, we cannot rely on explicit mentions of times and locations to distinguish between episodes or use semantic similarity to merge inconsistent episode co-references. To address these challenges, we introduce EpiMine, an unsupervised episode detection framework that (1) automatically identifies the most salient, key-event-relevant terms and segments, (2) determines candidate episodes in an article based on natural episodic partitions estimated through shifts in discriminative term combinations, and (3) refines and forms final episode clusters using large language model-based reasoning on the candidate episodes. We construct three diverse, real-world event datasets annotated at the episode level. EpiMine outperforms all baselines on these datasets by an average 59.2% increase across all metrics.
摘要：情节结构本质上是可解释的，并且可适应不断发展的大规模关键事件。然而，最先进的自动事件检测方法忽略了事件情节，因此难以掌握这些关键特征。本文介绍了一项新任务，即情节检测，旨在从包含关键事件文章的新闻语料库中识别情节。情节描述了在特定时间和地点执行操作的核心实体（例如“抗议者”、“警察”）的凝聚力集群。此外，情节是特定关键事件下更大情节群的重要组成部分。自动检测情节具有挑战性，因为与关键事件和原子动作不同，我们不能依靠明确提及的时间和地点来区分情节，也不能使用语义相似性来合并不一致的情节共同引用。为了应对这些挑战，我们推出了 EpiMine，这是一个无监督的情节检测框架，它 (1) 自动识别最突出、与关键事件相关的术语和片段，(2) 根据通过判别性术语组合的变化估计出的自然情节分区确定文章中的候选情节，以及 (3) 使用基于大型语言模型的推理对候选情节进行细化并形成最终情节集群。我们构建了三个不同的现实世界事件数据集，并在情节级别进行了注释。在这些数据集上，EpiMine 在所有指标上的表现均优于所有基准，平均提高了 59.2%。

Title: Communicate to Play: Pragmatic Reasoning for Efficient Cross-Cultural Communication in Codenames

Authors: Isadora White, Sashrika Pandey, Michelle Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04900
Pdf URL: https://arxiv.org/pdf/2408.04900
Copy Paste: [[2408.04900]] Communicate to Play: Pragmatic Reasoning for Efficient Cross-Cultural Communication in Codenames(https://arxiv.org/abs/2408.04900)
Keywords: llm, prompt
Abstract: Cultural differences in common ground may result in pragmatic failure and misunderstandings during communication. We develop our method Rational Speech Acts for Cross-Cultural Communication (RSA+C3) to resolve cross-cultural differences in common ground. To measure the success of our method, we study RSA+C3 in the collaborative referential game of Codenames Duet and show that our method successfully improves collaboration between simulated players of different cultures. Our contributions are threefold: (1) creating Codenames players using contrastive learning of an embedding space and LLM prompting that are aligned with human patterns of play, (2) studying culturally induced differences in common ground reflected in our trained models, and (3) demonstrating that our method RSA+C3 can ease cross-cultural communication in gameplay by inferring sociocultural context from interaction. Our code is publicly available at this http URL.
摘要：共同点上的文化差异可能会导致沟通过程中的语用失误和误解。我们开发了跨文化交流的理性言语行为 (RSA+C3) 方法来解决共同点上的跨文化差异。为了衡量我们方法的成功性，我们在协作参考游戏 Codenames Duet 中研究了 RSA+C3，并表明我们的方法成功地改善了不同文化模拟玩家之间的协作。我们的贡献有三方面：(1) 使用与人类游戏模式一致的嵌入空间对比学习和 LLM 提示来创建 Codenames 玩家，(2) 研究我们训练的模型中反映的文化诱导共同点差异，(3) 证明我们的方法 RSA+C3 可以通过从互动中推断社会文化背景来简化游戏中的跨文化交流。我们的代码在此 http URL 上公开提供。

Title: GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models

Authors: Zhibo Zhang, Wuxia Bai, Yuxi Li, Mark Huasong Meng, Kailong Wang, Ling Shi, Li Li, Jun Wang, Haoyu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04905
Pdf URL: https://arxiv.org/pdf/2408.04905
Copy Paste: [[2408.04905]] GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models(https://arxiv.org/abs/2408.04905)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved unprecedented success in the field of natural language processing. However, the black-box nature of their internal mechanisms has brought many concerns about their trustworthiness and interpretability. Recent research has discovered a class of abnormal tokens in the model's vocabulary space and named them "glitch tokens". Those tokens, once included in the input, may induce the model to produce incorrect, irrelevant, or even harmful results, drastically undermining the reliability and practicality of LLMs. In this work, we aim to enhance the understanding of glitch tokens and propose techniques for their detection and mitigation. We first reveal the characteristic features induced by glitch tokens on LLMs, which are evidenced by significant deviations in the distributions of attention patterns and dynamic information from intermediate model layers. Based on the insights, we develop GlitchProber, a tool for efficient glitch token detection and mitigation. GlitchProber utilizes small-scale sampling, principal component analysis for accelerated feature extraction, and a simple classifier for efficient vocabulary screening. Taking one step further, GlitchProber rectifies abnormal model intermediate layer values to mitigate the destructive effects of glitch tokens. Evaluated on five mainstream open-source LLMs, GlitchProber demonstrates higher efficiency, precision, and recall compared to existing approaches, with an average F1 score of 0.86 and an average repair rate of 50.06%. GlitchProber unveils a novel path to address the challenges posed by glitch tokens and inspires future research toward more robust and interpretable LLMs.
摘要：大型语言模型 (LLM) 在自然语言处理领域取得了前所未有的成功。然而，其内部机制的黑箱性质带来了许多对其可信度和可解释性的担忧。最近的研究发现了模型词汇空间中的一类异常标记，并将其命名为“故障标记”。这些标记一旦被纳入输入，可能会导致模型产生不正确、不相关甚至有害的结果，从而严重破坏 LLM 的可靠性和实用性。在这项工作中，我们旨在增强对故障标记的理解，并提出检测和缓解它们的技术。我们首先揭示了故障标记在 LLM 上引起的特征，这些特征表现为注意力模式和来自中间模型层的动态信息分布的显著偏差。基于这些见解，我们开发了 GlitchProber，这是一种高效的故障标记检测和缓解工具。GlitchProber 利用小规模采样、主成分分析来加速特征提取，以及一个简单的分类器来高效筛选词汇表。更进一步，GlitchProber 纠正了异常的模型中间层值，以减轻故障标记的破坏性影响。在五个主流开源 LLM 上进行的评估中，GlitchProber 与现有方法相比表现出更高的效率、精确度和召回率，平均 F1 得分为 0.86，平均修复率为 50.06%。GlitchProber 揭示了一条解决故障标记挑战的新途径，并启发未来研究更强大、更可解释的 LLM。

Title: Towards a Generative Approach for Emotion Detection and Reasoning

Authors: Ankita Bhaumik, Tomek Strzalkowski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04906
Pdf URL: https://arxiv.org/pdf/2408.04906
Copy Paste: [[2408.04906]] Towards a Generative Approach for Emotion Detection and Reasoning(https://arxiv.org/abs/2408.04906)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) have demonstrated impressive performance in mathematical and commonsense reasoning tasks using chain-of-thought (CoT) prompting techniques. But can they perform emotional reasoning by concatenating `Let's think step-by-step' to the input prompt? In this paper we investigate this question along with introducing a novel approach to zero-shot emotion detection and emotional reasoning using LLMs. Existing state of the art zero-shot approaches rely on textual entailment models to choose the most appropriate emotion label for an input text. We argue that this strongly restricts the model to a fixed set of labels which may not be suitable or sufficient for many applications where emotion analysis is required. Instead, we propose framing the problem of emotion analysis as a generative question-answering (QA) task. Our approach uses a two step methodology of generating relevant context or background knowledge to answer the emotion detection question step-by-step. Our paper is the first work on using a generative approach to jointly address the tasks of emotion detection and emotional reasoning for texts. We evaluate our approach on two popular emotion detection datasets and also release the fine-grained emotion labels and explanations for further training and fine-tuning of emotional reasoning systems.
摘要：大型语言模型 (LLM) 在使用思路链 (CoT) 提示技术的数学和常识推理任务中表现出色。但是，它们能否通过将“让我们一步一步思考”连接到输入提示来进行情感推理？在本文中，我们研究了这个问题，并介绍了一种使用 LLM 进行零样本情感检测和情感推理的新方法。现有的最先进的零样本方法依赖于文本蕴涵模型来为输入文本选择最合适的情感标签。我们认为，这严重限制了模型到一组固定的标签，这可能不适合或不足以满足许多需要情感分析的应用。相反，我们建议将情感分析问题定义为生成性问答 (QA) 任务。我们的方法使用两步方法生成相关的上下文或背景知识来逐步回答情感检测问题。我们的论文是第一篇使用生成方法共同解决文本情感检测和情感推理任务的研究。我们在两个流行的情绪检测数据集上评估了我们的方法，并发布了细粒度的情绪标签和解释，以便进一步训练和微调情绪推理系统。

Title: HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction

Authors: Bhaskarjit Sarmah, Benika Hall, Rohan Rao, Sunil Patel, Stefano Pasquali, Dhagash Mehta
Subjects: cs.CL, cs.LG, q-fin.ST, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2408.04948
Pdf URL: https://arxiv.org/pdf/2408.04948
Copy Paste: [[2408.04948]] HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction(https://arxiv.org/abs/2408.04948)
Keywords: language model, llm, retrieval augmented generation
Abstract: Extraction and interpretation of intricate information from unstructured text data arising in financial applications, such as earnings call transcripts, present substantial challenges to large language models (LLMs) even using the current best practices to use Retrieval Augmented Generation (RAG) (referred to as VectorRAG techniques which utilize vector databases for information retrieval) due to challenges such as domain specific terminology and complex formats of the documents. We introduce a novel approach based on a combination, called HybridRAG, of the Knowledge Graphs (KGs) based RAG techniques (called GraphRAG) and VectorRAG techniques to enhance question-answer (Q&A) systems for information extraction from financial documents that is shown to be capable of generating accurate and contextually relevant answers. Using experiments on a set of financial earning call transcripts documents which come in the form of Q&A format, and hence provide a natural set of pairs of ground-truth Q&As, we show that HybridRAG which retrieves context from both vector database and KG outperforms both traditional VectorRAG and GraphRAG individually when evaluated at both the retrieval and generation stages in terms of retrieval accuracy and answer generation. The proposed technique has applications beyond the financial domain
摘要：由于领域特定术语和文档格式复杂等问题，即使使用当前最佳实践来使用检索增强生成 (RAG)（称为 VectorRAG 技术，利用向量数据库进行信息检索），从财务应用中产生的非结构化文本数据（例如收益电话会议记录）中提取和解释复杂信息对大型语言模型 (LLM) 也提出了巨大挑战。我们引入了一种基于知识图谱 (KG) 的 RAG 技术（称为 GraphRAG）和 VectorRAG 技术的组合的新方法，称为 HybridRAG，以增强问答 (Q&A) 系统从财务文档中提取信息的能力，该系统已被证明能够生成准确且与上下文相关的答案。通过对一组以问答形式呈现的财务收益电话会议记录文档进行实验，从而提供一组自然的真实问答对，我们表明，从向量数据库和知识图谱中检索上下文的 HybridRAG 在检索和生成阶段的检索准确性和答案生成方面均优于传统的 VectorRAG 和 GraphRAG。所提出的技术的应用范围超出了金融领域

Title: Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks

Authors: Verna Dankers, Ivan Titov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04965
Pdf URL: https://arxiv.org/pdf/2408.04965
Copy Paste: [[2408.04965]] Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks(https://arxiv.org/abs/2408.04965)
Keywords: language model
Abstract: Memorisation is a natural part of learning from real-world data: neural models pick up on atypical input-output combinations and store those training examples in their parameter space. That this happens is well-known, but how and where are questions that remain largely unanswered. Given a multi-layered neural model, where does memorisation occur in the millions of parameters? Related work reports conflicting findings: a dominant hypothesis based on image classification is that lower layers learn generalisable features and that deeper layers specialise and memorise. Work from NLP suggests this does not apply to language models, but has been mainly focused on memorisation of facts. We expand the scope of the localisation question to 12 natural language classification tasks and apply 4 memorisation localisation techniques. Our results indicate that memorisation is a gradual process rather than a localised one, establish that memorisation is task-dependent, and give nuance to the generalisation first, memorisation second hypothesis.
摘要：记忆是从现实世界数据中学习的自然组成部分：神经模型会拾取非典型的输入输出组合并将这些训练示例存储在其参数空间中。这种情况是众所周知的，但如何以及在哪里发生的问题仍然在很大程度上没有得到解答。给定一个多层神经模型，在数百万个参数中记忆发生在哪里？相关工作报告了相互矛盾的发现：基于图像分类的主导假设是较低层学习可概括的特征，而较深的层专门化和记忆。来自 NLP 的研究表明这并不适用于语言模型，但主要集中在事实的记忆上。我们将定位问题的范围扩展到 12 个自然语言分类任务，并应用 4 种记忆定位技术。我们的结果表明，记忆是一个渐进的过程，而不是一个局部的过程，确定记忆依赖于任务，并为泛化第一、记忆第二的假设提供细微差别。

Title: Get Confused Cautiously: Textual Sequence Memorization Erasure with Selective Entropy Maximization

Authors: Zhaohan Zhang, Ziquan Liu, Ioannis Patras
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04983
Pdf URL: https://arxiv.org/pdf/2408.04983
Copy Paste: [[2408.04983]] Get Confused Cautiously: Textual Sequence Memorization Erasure with Selective Entropy Maximization(https://arxiv.org/abs/2408.04983)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have been found to memorize and recite some of the textual sequences from their training set verbatim, raising broad concerns about privacy and copyright issues when using LLMs. This Textual Sequence Memorization (TSM) phenomenon leads to a high demand to regulate LLM output to prevent it from generating certain memorized text to meet user requirements. However, our empirical study reveals that existing methods for TSM erasure fail to forget massive memorized samples without substantially jeopardizing the model utility. To achieve a better trade-off between the effectiveness of TSM erasure and model utility in LLMs, our paper proposes a new framework based on Entropy Maximization with Selective Optimization (EMSO), where the updated weights are chosen with a novel contrastive gradient metric without any participation of additional model or data. Our analysis shows that training with the entropy maximization loss has a more stable optimization process and better keeps model utility than existing methods. The contrastive gradient metric localizes the most influential weight for TSM erasure by taking both the gradient magnitude and direction into consideration. Extensive experiments across three model scales demonstrate that our method excels in handling large-scale forgetting requests while preserving model ability in language generation and reasoning.
摘要：大型语言模型 (LLM) 被发现会逐字记忆和背诵其训练集中的一些文本序列，这引起了人们对使用 LLM 时的隐私和版权问题的广泛担忧。这种文本序列记忆 (TSM) 现象导致对规范 LLM 输出的需求很高，以防止其生成某些记忆的文本来满足用户要求。然而，我们的实证研究表明，现有的 TSM 擦除方法无法在不严重损害模型效用的情况下忘记大量记忆样本。为了在 LLM 中更好地平衡 TSM 擦除的有效性和模型效用，我们的论文提出了一种基于熵最大化和选择性优化 (EMSO) 的新框架，其中使用新颖的对比梯度度量选择更新后的权重，而无需任何额外模型或数据的参与。我们的分析表明，与现有方法相比，使用熵最大化损失进行训练具有更稳定的优化过程并且更好地保持了模型效用。对比梯度度量通过同时考虑梯度幅度和方向来定位对 TSM 擦除影响最大的权重。在三个模型尺度上进行的大量实验表明，我们的方法在处理大规模遗忘请求方面表现出色，同时保留了模型在语言生成和推理方面的能力。

Title: ProFuser: Progressive Fusion of Large Language Models

Authors: Tianyuan Shi, Fanqi Wan, Canbin Huang, Xiaojun Quan, Chenliang Li, Ming Yan, Ji Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04998
Pdf URL: https://arxiv.org/pdf/2408.04998
Copy Paste: [[2408.04998]] ProFuser: Progressive Fusion of Large Language Models(https://arxiv.org/abs/2408.04998)
Keywords: language model, llm, chat
Abstract: While fusing the capacities and advantages of various large language models (LLMs) offers a pathway to construct more powerful and versatile models, a fundamental challenge is to properly select advantageous model during the training. Existing fusion methods primarily focus on the training mode that uses cross entropy on ground truth in a teacher-forcing setup to measure a model's advantage, which may provide limited insight towards model advantage. In this paper, we introduce a novel approach that enhances the fusion process by incorporating both the training and inference modes. Our method evaluates model advantage not only through cross entropy during training but also by considering inference outputs, providing a more comprehensive assessment. To combine the two modes effectively, we introduce ProFuser to progressively transition from inference mode to training mode. To validate ProFuser's effectiveness, we fused three models, including vicuna-7b-v1.5, Llama-2-7b-chat, and mpt-7b-8k-chat, and demonstrated the improved performance in knowledge, reasoning, and safety compared to baseline methods.
摘要：虽然融合各种大型语言模型 (LLM) 的能力和优势为构建更强大、更通用的模型提供了一条途径，但一个根本的挑战是在训练过程中正确选择有利的模型。现有的融合方法主要侧重于在教师强制设置中使用交叉熵对基本事实进行衡量的训练模式，这可能对模型优势提供有限的洞察。在本文中，我们介绍了一种新方法，通过结合训练和推理模式来增强融合过程。我们的方法不仅通过训练期间的交叉熵来评估模型优势，还通过考虑推理输出来评估模型优势，从而提供更全面的评估。为了有效地结合这两种模式，我们引入了 ProFuser 来逐步从推理模式过渡到训练模式。为了验证 ProFuser 的有效性，我们融合了三种模型，包括 vicuna-7b-v1.5、Llama-2-7b-chat 和 mpt-7b-8k-chat，并展示了与基线方法相比在知识、推理和安全性方面的改进性能。

Title: Investigating a Benchmark for Training-set free Evaluation of Linguistic Capabilities in Machine Reading Comprehension

Authors: Viktor Schlegel, Goran Nenadic, Riza Batista-Navarro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.05023
Pdf URL: https://arxiv.org/pdf/2408.05023
Copy Paste: [[2408.05023]] Investigating a Benchmark for Training-set free Evaluation of Linguistic Capabilities in Machine Reading Comprehension(https://arxiv.org/abs/2408.05023)
Keywords: language model
Abstract: Performance of NLP systems is typically evaluated by collecting a large-scale dataset by means of crowd-sourcing to train a data-driven model and evaluate it on a held-out portion of the data. This approach has been shown to suffer from spurious correlations and the lack of challenging examples that represent the diversity of natural language. Instead, we examine a framework for evaluating optimised models in training-set free setting on synthetically generated challenge sets. We find that despite the simplicity of the generation method, the data can compete with crowd-sourced datasets with regard to naturalness and lexical diversity for the purpose of evaluating the linguistic capabilities of MRC models. We conduct further experiments and show that state-of-the-art language model-based MRC systems can learn to succeed on the challenge set correctly, although, without capturing the general notion of the evaluated phenomenon.
摘要：NLP 系统的性能通常通过众包方式收集大规模数据集来训练数据驱动模型，并在保留的部分数据上对其进行评估。事实证明，这种方法存在虚假相关性，并且缺乏代表自然语言多样性的具有挑战性的示例。相反，我们研究了一个框架，用于在合成生成的挑战集上评估无训练集设置中的优化模型。我们发现，尽管生成方法很简单，但为了评估 MRC 模型的语言能力，数据可以在自然性和词汇多样性方面与众包数据集相媲美。我们进行了进一步的实验，并表明最先进的基于语言模型的 MRC 系统可以学会正确地在挑战集上取得成功，尽管没有捕捉到被评估现象的一般概念。

Title: Examining the Behavior of LLM Architectures Within the Framework of Standardized National Exams in Brazil

Authors: Marcelo Sartori Locatelli, Matheus Prado Miranda, Igor Joaquim da Silva Costa, Matheus Torres Prates, Victor Thomé, Mateus Zaparoli Monteiro, Tomas Lacerda, Adriana Pagano, Eduardo Rios Neto, Wagner Meira Jr., Virgilio Almeida
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2408.05035
Pdf URL: https://arxiv.org/pdf/2408.05035
Copy Paste: [[2408.05035]] Examining the Behavior of LLM Architectures Within the Framework of Standardized National Exams in Brazil(https://arxiv.org/abs/2408.05035)
Keywords: language model, gpt, llm
Abstract: The Exame Nacional do Ensino Médio (ENEM) is a pivotal test for Brazilian students, required for admission to a significant number of universities in Brazil. The test consists of four objective high-school level tests on Math, Humanities, Natural Sciences and Languages, and one writing essay. Students' answers to the test and to the accompanying socioeconomic status questionnaire are made public every year (albeit anonymized) due to transparency policies from the Brazilian Government. In the context of large language models (LLMs), these data lend themselves nicely to comparing different groups of humans with AI, as we can have access to human and machine answer distributions. We leverage these characteristics of the ENEM dataset and compare GPT-3.5 and 4, and MariTalk, a model trained using Portuguese data, to humans, aiming to ascertain how their answers relate to real societal groups and what that may reveal about the model biases. We divide the human groups by using socioeconomic status (SES), and compare their answer distribution with LLMs for each question and for the essay. We find no significant biases when comparing LLM performance to humans on the multiple-choice Brazilian Portuguese tests, as the distance between model and human answers is mostly determined by the human accuracy. A similar conclusion is found by looking at the generated text as, when analyzing the essays, we observe that human and LLM essays differ in a few key factors, one being the choice of words where model essays were easily separable from human ones. The texts also differ syntactically, with LLM generated essays exhibiting, on average, smaller sentences and less thought units, among other differences. These results suggest that, for Brazilian Portuguese in the ENEM context, LLM outputs represent no group of humans, being significantly different from the answers from Brazilian students across all tests.
摘要：巴西国家中等教育考试 (ENEM) 是巴西学生的一项关键考试，是巴西大量大学入学的必考科目。考试包括四项客观的高中水平测试，内容包括数学、人文、自然科学和语言，以及一篇写作论文。由于巴西政府的透明政策，学生的考试答案和随附的社会经济地位问卷答案每年都会公开（尽管是匿名的）。在大型语言模型 (LLM) 的背景下，这些数据非常适合将不同的人类群体与人工智能进行比较，因为我们可以访问人类和机器的答案分布。我们利用 ENEM 数据集的这些特性，将 GPT-3.5 和 4 以及使用葡萄牙语数据训练的模型 MariTalk 与人类进行比较，旨在确定他们的答案与真实社会群体的关系，以及这可能揭示模型偏见的哪些方面。我们根据社会经济地位 (SES) 将人类群体划分为不同群体，并将他们的答案分布与每个问题和论文的 LLM 进行比较。在多项选择题巴西葡萄牙语测试中，当比较 LLM 的表现与人类的表现时，我们发现没有明显的偏差，因为模型和人类答案之间的差距主要取决于人类的准确性。通过查看生成的文本，我们得出了类似的结论，因为在分析论文时，我们观察到人类和 LLM 论文在几个关键因素上有所不同，其中一个是词汇的选择，而模型论文很容易与人类论文区分开来。文本在句法上也有所不同，LLM 生成的文章平均而言句子较小，思维单元较少，还有其他差异。这些结果表明，对于 ENEM 语境中的巴西葡萄牙语，LLM 输出不代表任何人类群体，与所有测试中巴西学生的答案存在显著差异。

Title: RT-Surv: Improving Mortality Prediction After Radiotherapy with Large Language Model Structuring of Large-Scale Unstructured Electronic Health Records

Authors: Sangjoon Park, Chan Woo Wee, Seo Hee Choi, Kyung Hwan Kim, Jee Suk Chang, Hong In Yoon, Ik Jae Lee, Yong Bae Kim, Jaeho Cho, Ki Chang Keum, Chang Geol Lee, Hwa Kyung Byun, Woong Sub Koom
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.05074
Pdf URL: https://arxiv.org/pdf/2408.05074
Copy Paste: [[2408.05074]] RT-Surv: Improving Mortality Prediction After Radiotherapy with Large Language Model Structuring of Large-Scale Unstructured Electronic Health Records(https://arxiv.org/abs/2408.05074)
Keywords: language model, llm
Abstract: Accurate patient selection is critical in radiotherapy (RT) to prevent ineffective treatments. Traditional survival prediction models, relying on structured data, often lack precision. This study explores the potential of large language models (LLMs) to structure unstructured electronic health record (EHR) data, thereby improving survival prediction accuracy through comprehensive clinical information integration. Data from 34,276 patients treated with RT at Yonsei Cancer Center between 2013 and 2023 were analyzed, encompassing both structured and unstructured data. An open-source LLM was used to structure the unstructured EHR data via single-shot learning, with its performance compared against a domain-specific medical LLM and a smaller variant. Survival prediction models were developed using statistical, machine learning, and deep learning approaches, incorporating both structured and LLM-structured data. Clinical experts evaluated the accuracy of the LLM-structured data. The open-source LLM achieved 87.5% accuracy in structuring unstructured EHR data without additional training, significantly outperforming the domain-specific medical LLM, which reached only 35.8% accuracy. Larger LLMs were more effective, particularly in extracting clinically relevant features like general condition and disease extent, which closely correlated with patient survival. Incorporating LLM-structured clinical features into survival prediction models significantly improved accuracy, with the C-index of deep learning models increasing from 0.737 to 0.820. These models also became more interpretable by emphasizing clinically significant factors. This study shows that general-domain LLMs, even without specific medical training, can effectively structure large-scale unstructured EHR data, substantially enhancing the accuracy and interpretability of clinical predictive models.
摘要：准确选择患者对于放射治疗 (RT) 至关重要，可防止无效治疗。传统的生存预测模型依赖于结构化数据，通常缺乏精确度。本研究探索了大型语言模型 (LLM) 构造非结构化电子健康记录 (EHR) 数据的潜力，从而通过全面的临床信息整合提高生存预测准确性。研究分析了 2013 年至 2023 年期间在延世癌症中心接受 RT 治疗的 34,276 名患者的数据，涵盖结构化和非结构化数据。使用开源 LLM 通过单次学习构造非结构化 EHR 数据，并将其性能与领域特定医学 LLM 和较小变体进行比较。使用统计、机器学习和深度学习方法开发生存预测模型，结合结构化和 LLM 结构化数据。临床专家评估了 LLM 结构化数据的准确性。开源 LLM 在无需额外训练的情况下，在构建非结构化 EHR 数据方面实现了 87.5% 的准确率，明显优于特定领域的医学 LLM，后者的准确率仅为 35.8%。较大的 LLM 更有效，特别是在提取与患者生存密切相关的临床相关特征（如一般状况和疾病程度）方面。将 LLM 结构化的临床特征纳入生存预测模型可显著提高准确率，深度学习模型的 C 指数从 0.737 增加到 0.820。通过强调临床重要因素，这些模型也变得更具可解释性。这项研究表明，即使没有特定的医学培训，通用领域的 LLM 也可以有效地构建大规模非结构化 EHR 数据，大大提高临床预测模型的准确性和可解释性。

Title: Generating novel experimental hypotheses from language models: A case study on cross-dative generalization

Authors: Kanishka Misra, Najoung Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.05086
Pdf URL: https://arxiv.org/pdf/2408.05086
Copy Paste: [[2408.05086]] Generating novel experimental hypotheses from language models: A case study on cross-dative generalization(https://arxiv.org/abs/2408.05086)
Keywords: language model
Abstract: Neural network language models (LMs) have been shown to successfully capture complex linguistic knowledge. However, their utility for understanding language acquisition is still debated. We contribute to this debate by presenting a case study where we use LMs as simulated learners to derive novel experimental hypotheses to be tested with humans. We apply this paradigm to study cross-dative generalization (CDG): productive generalization of novel verbs across dative constructions (she pilked me the ball/she pilked the ball to me) -- acquisition of which is known to involve a large space of contextual features -- using LMs trained on child-directed speech. We specifically ask: "what properties of the training exposure facilitate a novel verb's generalization to the (unmodeled) alternate construction?" To answer this, we systematically vary the exposure context in which a novel dative verb occurs in terms of the properties of the theme and recipient, and then analyze the LMs' usage of the novel verb in the unmodeled dative construction. We find LMs to replicate known patterns of children's CDG, as a precondition to exploring novel hypotheses. Subsequent simulations reveal a nuanced role of the features of the novel verbs' exposure context on the LMs' CDG. We find CDG to be facilitated when the first postverbal argument of the exposure context is pronominal, definite, short, and conforms to the prototypical animacy expectations of the exposure dative. These patterns are characteristic of harmonic alignment in datives, where the argument with features ranking higher on the discourse prominence scale tends to precede the other. This gives rise to a novel hypothesis that CDG is facilitated insofar as the features of the exposure context -- in particular, its first postverbal argument -- are harmonically aligned. We conclude by proposing future experiments that can test this hypothesis in children.
摘要：神经网络语言模型 (LM) 已被证明能够成功捕捉复杂的语言知识。然而，它们在理解语言习得方面的效用仍然存在争议。我们通过一个案例研究为这场辩论做出了贡献，在该案例研究中，我们使用 LM 作为模拟学习者来推导出新的实验假设，以供人类测试。我们应用此范例来研究跨与格泛化 (CDG)：使用针对儿童导向语音进行训练的 LM，在与格结构（她把球偷给了我/她把球偷给了我）之间对新动词进行富有成效的泛化——众所周知，这种结构的习得涉及大量的上下文特征。我们特别问：“训练接触的哪些属性有助于新动词泛化到（未建模的）替代结构？”为了回答这个问题，我们系统地根据主题和接受者的属性改变新与格动词出现的接触环境，然后分析 LM 在未建模的与格结构中对新动词的使用。我们发现，语言模型可以复制儿童 CDG 的已知模式，这是探索新假设的先决条件。后续模拟揭示了新动词的暴露语境特征对语言模型 CDG 的细微作用。我们发现，当暴露语境的第一个动词后论元是代词、有定、短且符合暴露与格的典型生命期望时，CDG 会得到促进。这些模式是与格中和声对齐的特征，其中在话语突出性量表上排名较高的论元往往先于另一个论元。这引出了一个新假设，即只要暴露语境的特征（尤其是其第一个动词后论元）和声对齐，CDG 就会得到促进。最后，我们提出了可以在儿童中测试这一假设的未来实验。

Title: Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models

Authors: Zikai Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.05093
Pdf URL: https://arxiv.org/pdf/2408.05093
Copy Paste: [[2408.05093]] Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models(https://arxiv.org/abs/2408.05093)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) have generated significant attention since their inception, finding applications across various academic and industrial domains. However, these models often suffer from the "hallucination problem", where outputs, though grammatically and logically coherent, lack factual accuracy or are entirely fabricated. A particularly troubling issue discovered and widely discussed recently is the numerical comparison error where multiple LLMs incorrectly infer that "9.11$>$9.9". We discovered that the order in which LLMs generate answers and reasoning impacts their consistency. Specifically, results vary significantly when an LLM generates an answer first and then provides the reasoning versus generating the reasoning process first and then the conclusion. Inspired by this, we propose a new benchmark method for assessing LLM consistency: comparing responses generated through these two different approaches. This benchmark effectively identifies instances where LLMs fabricate answers and subsequently generate justifications. Furthermore, we introduce a novel and straightforward prompt strategy designed to mitigate this issue. Experimental results demonstrate that this strategy improves performance across various LLMs compared to direct questioning. This work not only sheds light on a critical flaw in LLMs but also offers a practical solution to enhance their reliability.
摘要：大型语言模型 (LLM) 自诞生以来就引起了广泛关注，并在各种学术和工业领域中得到应用。然而，这些模型经常受到“幻觉问题”的影响，即输出虽然在语法和逻辑上连贯，但缺乏事实准确性或完全是捏造的。最近发现并广泛讨论的一个特别令人不安的问题是数值比较错误，多个 LLM 错误地推断出“9.11$>$9.9”。我们发现 LLM 生成答案和推理的顺序会影响它们的一致性。具体而言，当 LLM 先生成答案然后提供推理与先生成推理过程然后得出结论时，结果会有很大差异。受此启发，我们提出了一种评估 LLM 一致性的新基准方法：比较通过这两种不同方法生成的响应。该基准有效地识别了 LLM 捏造答案并随后生成理由的情况。此外，我们引入了一种新颖而直接的提示策略，旨在缓解此问题。实验结果表明，与直接提问相比，这种策略可以提高各种 LLM 的性能。这项工作不仅揭示了法学硕士 (LLM) 的一个关键缺陷，而且提供了提高其可靠性的实用解决方案。

Title: Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts

Authors: Tingchen Fu, Yupeng Hou, Julian McAuley, Rui Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.05094
Pdf URL: https://arxiv.org/pdf/2408.05094
Copy Paste: [[2408.05094]] Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts(https://arxiv.org/abs/2408.05094)
Keywords: language model, prompt
Abstract: The task of multi-objective alignment aims at balancing and controlling the different alignment objectives (e.g., helpfulness, harmlessness and honesty) of large language models to meet the personalized requirements of different users. However, previous methods tend to train multiple models to deal with various user preferences, with the number of trained models growing linearly with the number of alignment objectives and the number of different preferences. Meanwhile, existing methods are generally poor in extensibility and require significant re-training for each new alignment objective considered. Considering the limitation of previous approaches, we propose MCA (Multi-objective Contrastive Alignemnt), which constructs an expert prompt and an adversarial prompt for each objective to contrast at the decoding time and balances the objectives through combining the contrast. Our approach is verified to be superior to previous methods in obtaining a well-distributed Pareto front among different alignment objectives.
摘要：多目标对齐任务旨在平衡和控制大型语言模型的不同对齐目标（如有用性、无害性和诚实性），以满足不同用户的个性化需求。然而，以前的方法倾向于训练多个模型来处理各种用户偏好，训练模型的数量随着对齐目标的数量和不同偏好的数量线性增长。同时，现有方法通常可扩展性较差，并且对于每个新的对齐目标都需要进行大量重新训练。考虑到以前方法的局限性，我们提出了 MCA（多目标对比对齐），它为每个目标构建一个专家提示和一个对抗性提示，在解码时进行对比，并通过组合对比来平衡目标。我们的方法被证明能够在不同的对齐目标之间获得分布均匀的帕累托前沿，优于以前的方法。

Title: MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

Authors: Junhao Xu, Zhenlin Liang, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.05101
Pdf URL: https://arxiv.org/pdf/2408.05101
Copy Paste: [[2408.05101]] MooER: LLM-based Speech Recognition and Translation Models from Moore Threads(https://arxiv.org/abs/2408.05101)
Keywords: llm
Abstract: In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. A 5000h pseudo labeled dataset containing open source and self collected speech data is used for training. We achieve performance comparable to other open source models trained with up to hundreds of thousands of hours of labeled speech data. Meanwhile, experiments conducted on Covost2 Zh2en testset suggest that our model outperforms other open source Speech LLMs. A BLEU score of 25.2 can be obtained. The main contributions of this paper are summarized as follows. First, this paper presents a training strategy for encoders and LLMs on speech related tasks (including ASR and AST) using a small size of pseudo labeled data without any extra manual annotation and selection. Second, we release our ASR and AST models and plan to open-source our training code and strategy in the near future. Moreover, a model trained on 8wh scale training data is planned to be released later on.
摘要：本文提出了基于 Moore Threads 的 LLM 大规模自动语音识别 (ASR) / 自动语音翻译 (AST) 模型 MooER。我们使用包含开源和自收集语音数据的 5000h 伪标记数据集进行训练。我们实现了与其他使用多达数十万小时的标记语音数据训练的开源模型相当的性能。同时，在 Covost2 Zh2en 测试集上进行的实验表明我们的模型优于其他开源语音 LLM。可以获得 25.2 的 BLEU 分数。本文的主要贡献总结如下。首先，本文提出了一种针对语音相关任务（包括 ASR 和 AST）的编码器和 LLM 的训练策略，该策略使用少量伪标记数据，无需任何额外的人工注释和选择。其次，我们发布了我们的 ASR 和 AST 模型，并计划在不久的将来开源我们的训练代码和策略。此外，计划稍后发布一个在 8wh 规模训练数据上训练的模型。

Title: How Well Do LLMs Identify Cultural Unity in Diversity?

Authors: Jialin Li, Junli Wang, Junjie Hu, Ming Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.05102
Pdf URL: https://arxiv.org/pdf/2408.05102
Copy Paste: [[2408.05102]] How Well Do LLMs Identify Cultural Unity in Diversity?(https://arxiv.org/abs/2408.05102)
Keywords: language model, llm, prompt
Abstract: Much work on the cultural awareness of large language models (LLMs) focuses on the models' sensitivity to geo-cultural diversity. However, in addition to cross-cultural differences, there also exists common ground across cultures. For instance, a bridal veil in the United States plays a similar cultural-relevant role as a honggaitou in China. In this study, we introduce a benchmark dataset CUNIT for evaluating decoder-only LLMs in understanding the cultural unity of concepts. Specifically, CUNIT consists of 1,425 evaluation examples building upon 285 traditional cultural-specific concepts across 10 countries. Based on a systematic manual annotation of cultural-relevant features per concept, we calculate the cultural association between any pair of cross-cultural concepts. Built upon this dataset, we design a contrastive matching task to evaluate the LLMs' capability to identify highly associated cross-cultural concept pairs. We evaluate 3 strong LLMs, using 3 popular prompting strategies, under the settings of either giving all extracted concept features or no features at all on CUNIT Interestingly, we find that cultural associations across countries regarding clothing concepts largely differ from food. Our analysis shows that LLMs are still limited to capturing cross-cultural associations between concepts compared to humans. Moreover, geo-cultural proximity shows a weak influence on model performance in capturing cross-cultural associations.
摘要：大型语言模型 (LLM) 的文化意识研究主要集中在模型对地域文化多样性的敏感性上。然而，除了跨文化差异之外，不同文化之间也存在着共同点。例如，美国的新娘面纱与中国的红盖头在文化方面起着类似的作用。在本研究中，我们引入了一个基准数据集 CUNIT，用于评估仅使用解码器的 LLM 在理解概念的文化统一性方面的能力。具体来说，CUNIT 由 1,425 个评估示例组成，这些示例基于 10 个国家的 285 个传统文化特定概念。基于对每个概念的文化相关特征进行系统性手动注释，我们计算出任何一对跨文化概念之间的文化关联。基于这个数据集，我们设计了一个对比匹配任务来评估 LLM 识别高度相关的跨文化概念对的能力。我们使用 3 种流行的提示策略，在 CUNIT 上评估了 3 个强大的 LLM，设置要么给出所有提取的概念特征，要么根本不提供任何特征。有趣的是，我们发现，不同国家在服装概念方面的文化关联与食物有很大不同。我们的分析表明，与人类相比，LLM 仍然仅限于捕捉概念之间的跨文化关联。此外，地缘文化接近性对模型在捕捉跨文化关联方面的表现影响较小。

Title: A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning

Authors: Ye Yuan, Chengwu Liu, Jingyang Yuan, Gongbo Sun, Siqi Li, Ming Zhang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2408.05141
Pdf URL: https://arxiv.org/pdf/2408.05141
Copy Paste: [[2408.05141]] A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning(https://arxiv.org/abs/2408.05141)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) is a framework enabling large language models (LLMs) to enhance their accuracy and reduce hallucinations by integrating external knowledge bases. In this paper, we introduce a hybrid RAG system enhanced through a comprehensive suite of optimizations that significantly improve retrieval quality, augment reasoning capabilities, and refine numerical computation ability. We refined the text chunks and tables in web pages, added attribute predictors to reduce hallucinations, conducted LLM Knowledge Extractor and Knowledge Graph Extractor, and finally built a reasoning strategy with all the references. We evaluated our system on the CRAG dataset through the Meta CRAG KDD Cup 2024 Competition. Both the local and online evaluations demonstrate that our system significantly enhances complex reasoning capabilities. In local evaluations, we have significantly improved accuracy and reduced error rates compared to the baseline model, achieving a notable increase in scores. In the meanwhile, we have attained outstanding results in online assessments, demonstrating the performance and generalization capabilities of the proposed system. The source code for our system is released in \url{this https URL}.
摘要：检索增强生成 (RAG) 是一个框架，通过集成外部知识库，使大型语言模型 (LLM) 能够提高其准确性并减少幻觉。在本文中，我们介绍了一种混合 RAG 系统，该系统通过一套全面的优化得到了增强，可显着提高检索质量、增强推理能力并改进数值计算能力。我们改进了网页中的文本块和表格，添加了属性预测器以减少幻觉，进行了 LLM 知识提取器和知识图谱提取器，最后构建了包含所有参考资料的推理策略。我们通过 Meta CRAG KDD Cup 2024 竞赛在 CRAG 数据集上对我们的系统进行了评估。本地和在线评估都表明我们的系统显着增强了复杂的推理能力。在本地评估中，与基线模型相比，我们的准确性显着提高，错误率降低，得分显着提高。同时，我们在在线评估中取得了出色的成绩，展示了所提出系统的性能和泛化能力。我们的系统的源代码发布在 \url{此 https URL} 中。

Title: TaSL: Task Skill Localization and Consolidation for Language Model Continual Learning

Authors: Yujie Feng, Xu Chu, Yongxin Xu, Zexin Lu, Bo Liu, Philip S. Yu, Xiao-Ming Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.05200
Pdf URL: https://arxiv.org/pdf/2408.05200
Copy Paste: [[2408.05200]] TaSL: Task Skill Localization and Consolidation for Language Model Continual Learning(https://arxiv.org/abs/2408.05200)
Keywords: language model, llm
Abstract: Language model continual learning (CL) has recently garnered significant interest due to its potential to adapt large language models (LLMs) to dynamic real-world environments without re-training. A key challenge in this field is catastrophic forgetting, where models lose previously acquired knowledge when learning new tasks. Existing methods commonly employ multiple parameter-efficient fine-tuning (PEFT) blocks to acquire task-specific knowledge for each task, but these approaches lack efficiency and overlook the potential for knowledge transfer through task interaction. In this paper, we present a novel CL framework for language models called Task Skill Localization and Consolidation (TaSL), which enhances knowledge transfer without relying on memory replay. TaSL first divides the model into `skill units' based on parameter dependencies, enabling more granular control. It then employs a novel group-wise skill localization technique to identify the importance distribution of skill units for a new task. By comparing this importance distribution with those from previous tasks, we implement a fine-grained skill consolidation strategy that retains task-specific knowledge, thereby preventing forgetting, and updates task-shared knowledge, which facilitates bi-directional knowledge transfer. As a result, TaSL achieves a superior balance between retaining previous knowledge and excelling in new tasks. TaSL also shows strong generalizability, suitable for general models and customizable for PEFT methods like LoRA. Additionally, it demonstrates notable extensibility, allowing integration with memory replay to further enhance performance. Extensive experiments on two CL benchmarks, with varying model sizes (from 220M to 7B), demonstrate the effectiveness of TaSL and its variants across different settings.
摘要：语言模型持续学习 (CL) 最近引起了人们的极大兴趣，因为它有潜力将大型语言模型 (LLM) 适应动态的现实环境而无需重新训练。该领域的一个关键挑战是灾难性遗忘，即模型在学习新任务时会丢失以前获得的知识。现有方法通常采用多个参数高效的微调 (PEFT) 块来获取每个任务的特定于任务的知识，但这些方法效率低下，并且忽视了通过任务交互进行知识转移的潜力。在本文中，我们提出了一种用于语言模型的新型 CL 框架，称为任务技能定位和巩固 (TaSL)，它可以增强知识转移而无需依赖记忆重放。TaSL 首先根据参数依赖关系将模型划分为“技能单元”，从而实现更精细的控制。然后，它采用一种新颖的分组技能定位技术来确定技能单元对于新任务的重要性分布。通过将此重要性分布与先前任务的重要性分布进行比较，我们实施了一种细粒度的技能整合策略，该策略保留特定于任务的知识，从而防止遗忘，并更新任务共享知识，从而促进双向知识转移。因此，TaSL 在保留先前知识和在新任务中表现出色之间实现了出色的平衡。TaSL 还表现出很强的通用性，适用于通用模型，并可针对 LoRA 等 PEFT 方法进行定制。此外，它还表现出显着的可扩展性，允许与记忆重放集成以进一步提高性能。在两个 CL 基准上进行的大量实验，模型大小各不相同（从 220M 到 7B），证明了 TaSL 及其变体在不同设置下的有效性。