2024-10-22

Title: Agent Skill Acquisition for Large Language Models via CycleQD

Authors: So Kuroki, Taishi Nakamura, Takuya Akiba, Yujin Tang
Subjects: cs.CL, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2410.14735
Pdf URL: https://arxiv.org/pdf/2410.14735
Copy Paste: [[2410.14735]] Agent Skill Acquisition for Large Language Models via CycleQD(https://arxiv.org/abs/2410.14735)
Keywords: language model, gpt, agent
Abstract: Training large language models to acquire specific skills remains a challenging endeavor. Conventional training approaches often struggle with data distribution imbalances and inadequacies in objective functions that do not align well with task-specific performance. To address these challenges, we introduce CycleQD, a novel approach that leverages the Quality Diversity framework through a cyclic adaptation of the algorithm, along with a model merging based crossover and an SVD-based mutation. In CycleQD, each task's performance metric is alternated as the quality measure while the others serve as the behavioral characteristics. This cyclic focus on individual tasks allows for concentrated effort on one task at a time, eliminating the need for data ratio tuning and simplifying the design of the objective function. Empirical results from AgentBench indicate that applying CycleQD to LLAMA3-8B-INSTRUCT based models not only enables them to surpass traditional fine-tuning methods in coding, operating systems, and database tasks, but also achieves performance on par with GPT-3.5-TURBO, which potentially contains much more parameters, across these domains. Crucially, this enhanced performance is achieved while retaining robust language capabilities, as evidenced by its performance on widely adopted language benchmark tasks. We highlight the key design choices in CycleQD, detailing how these contribute to its effectiveness. Furthermore, our method is general and can be applied to image segmentation models, highlighting its applicability across different domains.
摘要：训练大型语言模型以获得特定技能仍然是一项艰巨的任务。传统的训练方法通常会遇到数据分布不平衡和目标函数不足的问题，这些不足与特定任务的性能不太匹配。为了应对这些挑战，我们引入了 CycleQD，这是一种新方法，它通过算法的循环调整以及基于模型合并的交叉和基于 SVD 的突变来利用质量多样性框架。在 CycleQD 中，每个任务的性能指标交替作为质量度量，而其他指标则作为行为特征。这种对单个任务的循环关注允许一次将精力集中在一项任务上，从而无需调整数据比率并简化目标函数的设计。AgentBench 的实证结果表明，将 CycleQD 应用于基于 LLAMA3-8B-INSTRUCT 的模型不仅使它们能够在编码、操作系统和数据库任务中超越传统的微调方法，而且在这些领域中实现了与可能包含更多参数的 GPT-3.5-TURBO 相当的性能。至关重要的是，这种增强的性能是在保留强大的语言能力的同时实现的，这一点从其在广泛采用的语言基准测试任务上的表现就可以看出。我们重点介绍了 CycleQD 中的关键设计选择，详细说明了这些选择如何有助于提高其有效性。此外，我们的方法是通用的，可以应用于图像分割模型，突出了其在不同领域的适用性。

Title: Eliciting Uncertainty in Chain-of-Thought to Mitigate Bias against Forecasting Harmful User Behaviors

Authors: Anthony Sicilia, Malihe Alikhani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.14744
Pdf URL: https://arxiv.org/pdf/2410.14744
Copy Paste: [[2410.14744]] Eliciting Uncertainty in Chain-of-Thought to Mitigate Bias against Forecasting Harmful User Behaviors(https://arxiv.org/abs/2410.14744)
Keywords: language model, llm, chain-of-thought
Abstract: Conversation forecasting tasks a model with predicting the outcome of an unfolding conversation. For instance, it can be applied in social media moderation to predict harmful user behaviors before they occur, allowing for preventative interventions. While large language models (LLMs) have recently been proposed as an effective tool for conversation forecasting, it's unclear what biases they may have, especially against forecasting the (potentially harmful) outcomes we request them to predict during moderation. This paper explores to what extent model uncertainty can be used as a tool to mitigate potential biases. Specifically, we ask three primary research questions: 1) how does LLM forecasting accuracy change when we ask models to represent their uncertainty; 2) how does LLM bias change when we ask models to represent their uncertainty; 3) how can we use uncertainty representations to reduce or completely mitigate biases without many training data points. We address these questions for 5 open-source language models tested on 2 datasets designed to evaluate conversation forecasting for social media moderation.
摘要：对话预测要求模型预测正在进行的对话的结果。例如，它可以应用于社交媒体审核，在有害用户行为发生之前进行预测，从而进行预防性干预。虽然大型语言模型 (LLM) 最近被提议作为一种有效的对话预测工具，但尚不清楚它们可能存在哪些偏见，尤其是在预测我们要求它们在审核期间预测的（潜在有害）结果方面。本文探讨了模型不确定性在多大程度上可以用作减轻潜在偏见的工具。具体来说，我们提出了三个主要研究问题：1) 当我们要求模型表示其不确定性时，LLM 预测准确性如何变化；2) 当我们要求模型表示其不确定性时，LLM 偏差如何变化；3) 我们如何使用不确定性表示来减少或完全减轻偏见，而无需太多训练数据点。我们针对 5 个开源语言模型解决了这些问题，这些模型在 2 个数据集上进行了测试，旨在评估社交媒体审核的对话预测。

Title: SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation

Authors: Junyu Luo, Xiao Luo, Xiusi Chen, Zhiping Xiao, Wei Ju, Ming Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.14745
Pdf URL: https://arxiv.org/pdf/2410.14745
Copy Paste: [[2410.14745]] SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation(https://arxiv.org/abs/2410.14745)
Keywords: language model, gpt, llm
Abstract: Supervised fine-tuning (SFT) is crucial in adapting large language models (LLMs) to a specific domain or task. However, only a limited amount of labeled data is available in practical applications, which poses a severe challenge for SFT in yielding satisfactory results. Therefore, a data-efficient framework that can fully exploit labeled and unlabeled data for LLM fine-tuning is highly anticipated. Towards this end, we introduce a semi-supervised fine-tuning framework named SemiEvol for LLM adaptation from a propagate-and-select manner. For knowledge propagation, SemiEvol adopts a bi-level approach, propagating knowledge from labeled data to unlabeled data through both in-weight and in-context methods. For knowledge selection, SemiEvol incorporates a collaborative learning mechanism, selecting higher-quality pseudo-response samples. We conducted experiments using GPT-4o-mini and Llama-3.1 on seven general or domain-specific datasets, demonstrating significant improvements in model performance on target data. Furthermore, we compared SemiEvol with SFT and self-evolution methods, highlighting its practicality in hybrid data scenarios.
摘要：监督微调 (SFT) 对于将大型语言模型 (LLM) 适应特定领域或任务至关重要。然而，实际应用中可用的标记数据数量有限，这对 SFT 产生令人满意的结果提出了严峻的挑战。因此，人们非常期待一个能够充分利用标记和未标记数据进行 LLM 微调的数据高效框架。为此，我们引入了一个半监督微调框架 SemiEvol，以从传播和选择的方式进行 LLM 适应。对于知识传播，SemiEvol 采用双层方法，通过权重和上下文方法将知识从标记数据传播到未标记数据。对于知识选择，SemiEvol 结合了协作学习机制，选择更高质量的伪响应样本。我们使用 GPT-4o-mini 和 Llama-3.1 在七个通用或特定领域的数据集上进行了实验，结果表明模型在目标数据上的性能有显著提高。此外，我们将 SemiEvol 与 SFT 和自进化方法进行了比较，强调了它在混合数据场景中的实用性。

Title: Accounting for Sycophancy in Language Model Uncertainty Estimation

Authors: Anthony Sicilia, Mert Inan, Malihe Alikhani
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2410.14746
Pdf URL: https://arxiv.org/pdf/2410.14746
Copy Paste: [[2410.14746]] Accounting for Sycophancy in Language Model Uncertainty Estimation(https://arxiv.org/abs/2410.14746)
Keywords: language model
Abstract: Effective human-machine collaboration requires machine learning models to externalize uncertainty, so users can reflect and intervene when necessary. For language models, these representations of uncertainty may be impacted by sycophancy bias: proclivity to agree with users, even if they are wrong. For instance, models may be over-confident in (incorrect) problem solutions suggested by a user. We study the relationship between sycophancy and uncertainty estimation for the first time. We propose a generalization of the definition of sycophancy bias to measure downstream impacts on uncertainty estimation, and also propose a new algorithm (SyRoUP) to account for sycophancy in the uncertainty estimation process. Unlike previous works on sycophancy, we study a broad array of user behaviors, varying both correctness and confidence of user suggestions to see how model answers (and their certainty) change. Our experiments across conversation forecasting and question-answering tasks show that user confidence plays a critical role in modulating the effects of sycophancy, and that SyRoUP can better predict these effects. From these results, we argue that externalizing both model and user uncertainty can help to mitigate the impacts of sycophancy bias.
摘要：有效的人机协作需要机器学习模型将不确定性外化，以便用户可以在必要时进行反思和干预。对于语言模型，这些不确定性的表示可能会受到谄媚偏见的影响：倾向于同意用户的意见，即使他们是错误的。例如，模型可能对用户建议的（不正确的）问题解决方案过于自信。我们首次研究了谄媚与不确定性估计之间的关系。我们提出了谄媚偏见定义的概括，以衡量对不确定性估计的下游影响，并提出了一种新算法（SyRoUP）来解释不确定性估计过程中的谄媚。与以前关于谄媚的研究不同，我们研究了广泛的用户行为，改变了用户建议的正确性和置信度，以了解模型答案（及其确定性）如何变化。我们在对话预测和问答任务中的实验表明，用户信心在调节谄媚的影响方面起着关键作用，而 SyRoUP 可以更好地预测这些影响。根据这些结果，我们认为外部化模型和用户的不确定性有助于减轻谄媚偏见的影响。

Title: Enabling Scalable Evaluation of Bias Patterns in Medical LLMs

Authors: Hamed Fayyaz, Raphael Poulain, Rahmatollah Beheshti
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.14763
Pdf URL: https://arxiv.org/pdf/2410.14763
Copy Paste: [[2410.14763]] Enabling Scalable Evaluation of Bias Patterns in Medical LLMs(https://arxiv.org/abs/2410.14763)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown impressive potential in helping with numerous medical challenges. Deploying LLMs in high-stakes applications such as medicine, however, brings in many concerns. One major area of concern relates to biased behaviors of LLMs in medical applications, leading to unfair treatment of individuals. To pave the way for the responsible and impactful deployment of Med LLMs, rigorous evaluation is a key prerequisite. Due to the huge complexity and variability of different medical scenarios, existing work in this domain has primarily relied on using manually crafted datasets for bias evaluation. In this study, we present a new method to scale up such bias evaluations by automatically generating test cases based on rigorous medical evidence. We specifically target the challenges of a) domain-specificity of bias characterization, b) hallucinating while generating the test cases, and c) various dependencies between the health outcomes and sensitive attributes. To that end, we offer new methods to address these challenges integrated with our generative pipeline, using medical knowledge graphs, medical ontologies, and customized general LLM evaluation frameworks in our method. Through a series of extensive experiments, we show that the test cases generated by our proposed method can effectively reveal bias patterns in Med LLMs at larger and more flexible scales than human-crafted datasets. We publish a large bias evaluation dataset using our pipeline, which is dedicated to a few medical case studies. A live demo of our application for vignette generation is available at this https URL. Our code is also available at this https URL.
摘要：大型语言模型 (LLM) 在帮助解决众多医学挑战方面表现出了惊人的潜力。然而，在医学等高风险应用中部署 LLM 带来了许多担忧。一个主要关注领域涉及 LLM 在医学应用中的偏见行为，导致对个人的不公平待遇。为了为负责任和有影响力地部署医学 LLM 铺平道路，严格的评估是一个关键先决条件。由于不同医疗场景的复杂性和多变性，该领域的现有工作主要依赖于使用手工制作的数据集进行偏见评估。在本研究中，我们提出了一种新方法，通过基于严格的医学证据自动生成测试用例来扩大此类偏见评估的规模。我们专门针对以下挑战：a) 偏见表征的领域特异性，b) 生成测试用例时产生幻觉，以及 c) 健康结果和敏感属性之间的各种依赖关系。为此，我们提供了新方法来应对这些挑战，这些方法与我们的生成流程集成，在我们的方法中使用医学知识图、医学本体和定制的通用 LLM 评估框架。通过一系列广泛的实验，我们表明，我们提出的方法生成的测试用例可以有效地揭示医学法学硕士中的偏见模式，其规模比人工制作的数据集更大、更灵活。我们使用我们的管道发布了一个大型偏见评估数据集，该数据集专用于一些医学案例研究。我们的晕影生成应用程序的现场演示可在此 https URL 上找到。我们的代码也可在此 https URL 上找到。

Title: Cross-Document Event-Keyed Summarization

Authors: William Walden, Pavlo Kuchmiichuk, Alexander Martin, Chihsheng Jin, Angela Cao, Claire Sun, Curisia Allen, Aaron Steven White
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.14795
Pdf URL: https://arxiv.org/pdf/2410.14795
Copy Paste: [[2410.14795]] Cross-Document Event-Keyed Summarization(https://arxiv.org/abs/2410.14795)
Keywords: llm, prompt
Abstract: Event-keyed summarization (EKS) requires generating a summary about a specific event described in a document, given the document and an event representation extracted from it. In this work, we extend EKS to the cross-document setting (CDEKS), in which summaries must synthesize information from accounts of the same event given by multiple sources. We introduce SEAMUS (Summaries of Events Across Multiple Sources), a high-quality dataset for CDEKS based on an expert reannotation of the FAMUS dataset for cross-document argument extraction. We present a suite of baselines on SEAMUS, covering both smaller, fine-tuned models, as well as zero- and few-shot prompted LLMs, along with detailed ablations, and a human evaluation study, showing SEAMUS to be a valuable benchmark for this new task.
摘要：事件键控摘要 (EKS) 需要根据文档和从中提取的事件表示生成文档中描述的特定事件的摘要。在这项工作中，我们将 EKS 扩展到跨文档设置 (CDEKS)，其中摘要必须综合来自多个来源提供的同一事件的描述信息。我们引入了 SEAMUS（跨多个来源的事件摘要），这是一个高质量的 CDEKS 数据集，基于专家对 FAMUS 数据集的重新注释，用于跨文档参数提取。我们在 SEAMUS 上提出了一套基线，涵盖了较小的、经过微调的模型，以及零样本和少量样本提示的 LLM，以及详细的消融和人工评估研究，表明 SEAMUS 是这项新任务的宝贵基准。

Title: Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus

Authors: Raviraj Joshi, Kanishk Singla, Anusha Kamath, Raunak Kalani, Rakesh Paul, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar, Eileen Long
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.14815
Pdf URL: https://arxiv.org/pdf/2410.14815
Copy Paste: [[2410.14815]] Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus(https://arxiv.org/abs/2410.14815)
Keywords: llm
Abstract: Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-based synthetic pre-training corpora for improving LLMs in low-resource languages. We conduct our study in the context of the low-resource Indic language Hindi. We introduce Nemotron-Mini-Hindi 4B, a bilingual SLM supporting both Hindi and English, based on Nemotron-Mini 4B. The model is trained using a mix of real and synthetic Hindi + English tokens, with continuous pre-training performed on 400B tokens. We demonstrate that both the base and instruct models achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks. Additionally, we observe that the continued pre-training approach enhances the model's overall factual accuracy.
摘要：多语言 LLM 支持多种语言；但是，对于资源匮乏的语言，其性能并不理想。在这项工作中，我们强调了持续预训练多语言 LLM 以及使用基于翻译的合成预训练语料库来改进资源匮乏语言的 LLM 的重要性。我们在资源匮乏的印度语 Hindi 的背景下开展研究。我们引入了 Nemotron-Mini-Hindi 4B，这是一款基于 Nemotron-Mini 4B 的双语 SLM，支持 Hindi 和英语。该模型使用真实和合成 Hindi + 英语标记的混合进行训练，并对 400B 个标记进行持续预训练。我们证明基础模型和指导模型都在 Hindi 基准上取得了最先进的结果，同时在英语任务上保持竞争力。此外，我们观察到持续的预训练方法提高了模型的整体事实准确性。

Title: SPRIG: Improving Large Language Model Performance by System Prompt Optimization

Authors: Lechen Zhang, Tolga Ergen, Lajanugen Logeswaran, Moontae Lee, David Jurgens
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2410.14826
Pdf URL: https://arxiv.org/pdf/2410.14826
Copy Paste: [[2410.14826]] SPRIG: Improving Large Language Model Performance by System Prompt Optimization(https://arxiv.org/abs/2410.14826)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have shown impressive capabilities in many scenarios, but their performance depends, in part, on the choice of prompt. Past research has focused on optimizing prompts specific to a task. However, much less attention has been given to optimizing the general instructions included in a prompt, known as a system prompt. To address this gap, we propose SPRIG, an edit-based genetic algorithm that iteratively constructs prompts from prespecified components to maximize the model's performance in general scenarios. We evaluate the performance of system prompts on a collection of 47 different types of tasks to ensure generalizability. Our study finds that a single optimized system prompt performs on par with task prompts optimized for each individual task. Moreover, combining system and task-level optimizations leads to further improvement, which showcases their complementary nature. Experiments also reveal that the optimized system prompts generalize effectively across model families, parameter sizes, and languages. This study provides insights into the role of system-level instructions in maximizing LLM potential.
摘要：大型语言模型 (LLM) 在许多场景中都表现出令人印象深刻的功能，但它们的性能在一定程度上取决于提示的选择。过去的研究主要集中在优化特定于任务的提示。然而，人们很少关注优化提示中包含的一般指令，即系统提示。为了解决这一差距，我们提出了 SPRIG，这是一种基于编辑的遗传算法，它从预先指定的组件迭代构建提示，以最大限度地提高模型在一般场景中的性能。我们评估了系统提示在 47 种不同类型的任务集合上的性能，以确保通用性。我们的研究发现，单个优化的系统提示的性能与针对每个单独任务优化的任务提示相当。此外，结合系统和任务级优化可以带来进一步的改进，这展示了它们的互补性。实验还表明，优化的系统提示可以有效地跨模型系列、参数大小和语言进行推广。这项研究深入了解了系统级指令在最大化 LLM 潜力方面的作用。

Title: DFlow: Diverse Dialogue Flow Simulation with Large Language Models

Authors: Wanyu Du, Song Feng, James Gung, Lijia Sun, Yi Zhang, Saab Mansour, Yanjun Qi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.14853
Pdf URL: https://arxiv.org/pdf/2410.14853
Copy Paste: [[2410.14853]] DFlow: Diverse Dialogue Flow Simulation with Large Language Models(https://arxiv.org/abs/2410.14853)
Keywords: language model, gpt, llm, agent
Abstract: Developing language model-based dialogue agents requires effective data to train models that can follow specific task logic. However, most existing data augmentation methods focus on increasing diversity in language, topics, or dialogue acts at the utterance level, largely neglecting a critical aspect of task logic diversity at the dialogue level. This paper proposes a novel data augmentation method designed to enhance the diversity of synthetic dialogues by focusing on task execution logic. Our method uses LLMs to generate decision tree-structured task plans, which enables the derivation of diverse dialogue trajectories for a given task. Each trajectory, referred to as a "dialog flow", guides the generation of a multi-turn dialogue that follows a unique trajectory. We apply this method to generate a task-oriented dialogue dataset comprising 3,886 dialogue flows across 15 different domains. We validate the effectiveness of this dataset using the next action prediction task, where models fine-tuned on our dataset outperform strong baselines, including GPT-4. Upon acceptance of this paper, we plan to release the code and data publicly.
摘要：开发基于语言模型的对话代理需要有效的数据来训练能够遵循特定任务逻辑的模型。然而，大多数现有的数据增强方法都侧重于在话语层面增加语言、主题或对话行为的多样性，而在很大程度上忽略了对话层面任务逻辑多样性的一个关键方面。本文提出了一种新颖的数据增强方法，旨在通过关注任务执行逻辑来增强合成对话的多样性。我们的方法使用 LLM 生成决策树结构的任务计划，从而为给定任务推导出不同的对话轨迹。每个轨迹（称为“对话流”）都指导生成遵循独特轨迹的多轮对话。我们应用此方法生成一个面向任务的对话数据集，该数据集包含 15 个不同领域的 3,886 个对话流。我们使用下一个动作预测任务验证了该数据集的有效性，其中在我们的数据集上微调的模型优于包括 GPT-4 在内的强大基线。本文被接受后，我们计划公开发布代码和数据。

Title: Which LLMs are Difficult to Detect? A Detailed Analysis of Potential Factors Contributing to Difficulties in LLM Text Detection

Authors: Shantanu Thorat, Tianbao Yang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.14875
Pdf URL: https://arxiv.org/pdf/2410.14875
Copy Paste: [[2410.14875]] Which LLMs are Difficult to Detect? A Detailed Analysis of Potential Factors Contributing to Difficulties in LLM Text Detection(https://arxiv.org/abs/2410.14875)
Keywords: llm
Abstract: As LLMs increase in accessibility, LLM-generated texts have proliferated across several fields, such as scientific, academic, and creative writing. However, LLMs are not created equally; they may have different architectures and training datasets. Thus, some LLMs may be more challenging to detect than others. Using two datasets spanning four total writing domains, we train AI-generated (AIG) text classifiers using the LibAUC library - a deep learning library for training classifiers with imbalanced datasets. Our results in the Deepfake Text dataset show that AIG-text detection varies across domains, with scientific writing being relatively challenging. In the Rewritten Ivy Panda (RIP) dataset focusing on student essays, we find that the OpenAI family of LLMs was substantially difficult for our classifiers to distinguish from human texts. Additionally, we explore possible factors that could explain the difficulties in detecting OpenAI-generated texts.
摘要：随着 LLM 的可访问性不断提高，LLM 生成的文本已在多个领域激增，例如科学、学术和创意写作。然而，LLM 并非生来平等；它们可能具有不同的架构和训练数据集。因此，一些 LLM 可能比其他 LLM 更难检测。使用两个涵盖四个写作领域的数据集，我们使用 LibAUC 库（一个用于训练具有不平衡数据集的分类器的深度学习库）训练 AI 生成的 (AIG) 文本分类器。我们在 Deepfake Text 数据集中的结果表明，AIG 文本检测在不同领域有所不同，科学写作相对具有挑战性。在专注于学生论文的 Rewritten Ivy Panda (RIP) 数据集中，我们发现 OpenAI 系列 LLM 对于我们的分类器来说很难与人类文本区分开来。此外，我们还探讨了可能解释检测 OpenAI 生成的文本困难的可能因素。

Title: From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items

Authors: Melissa Roemmele, Andrew S. Gordon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.14897
Pdf URL: https://arxiv.org/pdf/2410.14897
Copy Paste: [[2410.14897]] From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items(https://arxiv.org/abs/2410.14897)
Keywords: llm, prompt
Abstract: LLMs can now perform a variety of complex writing tasks. They also excel in answering questions pertaining to natural language inference and commonsense reasoning. Composing these questions is itself a skilled writing task, so in this paper we consider LLMs as authors of commonsense assessment items. We prompt LLMs to generate items in the style of a prominent benchmark for commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine the outcome according to analyses facilitated by the LLMs and human annotation. We find that LLMs that succeed in answering the original COPA benchmark are also more successful in authoring their own items.
摘要：LLM 现在可以执行各种复杂的写作任务。他们还擅长回答与自然语言推理和常识推理有关的问题。编写这些问题本身就是一项熟练的写作任务，因此在本文中，我们将 LLM 视为常识评估项目的作者。我们提示 LLM 以常识推理的著名基准“合理替代方案选择”（COPA）的风格生成项目。我们根据 LLM 和人工注释促进的分析来检查结果。我们发现，成功回答原始 COPA 基准的 LLM 在创作自己的项目方面也更成功。

Title: SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation

Authors: Junda Wang, Yujan Ting, Eric Z. Chen, Hieu Tran, Hong Yu, Weijing Huang, Terrence Chen
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.14948
Pdf URL: https://arxiv.org/pdf/2410.14948
Copy Paste: [[2410.14948]] SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation(https://arxiv.org/abs/2410.14948)
Keywords: language model, gpt, llm
Abstract: Multimodal large language models (MLLMs) have made significant strides, yet they face challenges in the medical domain due to limited specialized knowledge. While recent medical MLLMs demonstrate strong performance in lab settings, they often struggle in real-world applications, highlighting a substantial gap between research and practice. In this paper, we seek to address this gap at various stages of the end-to-end learning pipeline, including data collection, model fine-tuning, and evaluation. At the data collection stage, we introduce SemiHVision, a dataset that combines human annotations with automated augmentation techniques to improve both medical knowledge representation and diagnostic reasoning. For model fine-tuning, we trained PMC-Cambrian-8B-AN over 2400 H100 GPU hours, resulting in performance that surpasses public medical models like HuatuoGPT-Vision-34B (79.0% vs. 66.7%) and private general models like Claude3-Opus (55.7%) on traditional benchmarks such as SLAKE and VQA-RAD. In the evaluation phase, we observed that traditional benchmarks cannot accurately reflect realistic clinical task capabilities. To overcome this limitation and provide more targeted guidance for model evaluation, we introduce the JAMA Clinical Challenge, a novel benchmark specifically designed to evaluate diagnostic reasoning. On this benchmark, PMC-Cambrian-AN achieves state-of-the-art performance with a GPT-4 score of 1.29, significantly outperforming HuatuoGPT-Vision-34B (1.13) and Claude3-Opus (1.17), demonstrating its superior diagnostic reasoning abilities.
摘要：多模态大型语言模型 (MLLM) 取得了重大进展，但由于专业知识有限，它们在医学领域面临挑战。虽然最近的医学 MLLM 在实验室环境中表现出色，但它们在实际应用中往往举步维艰，凸显了研究与实践之间的巨大差距。在本文中，我们试图在端到端学习流程的各个阶段解决这一差距，包括数据收集、模型微调和评估。在数据收集阶段，我们引入了 SemiHVision，这是一个将人工注释与自动增强技术相结合的数据集，可改善医学知识表示和诊断推理。对于模型微调，我们在 2400 H100 GPU 小时以上训练了 PMC-Cambrian-8B-AN，其性能在传统基准（如 SLAKE 和 VQA-RAD）上超越了 HuatuoGPT-Vision-34B 等公共医学模型（79.0% vs. 66.7%）和 Claude3-Opus 等私有通用模型（55.7%）。在评估阶段，我们发现传统基准无法准确反映现实的临床任务能力。为了克服这一限制并为模型评估提供更有针对性的指导，我们引入了 JAMA Clinical Challenge，这是一个专门用于评估诊断推理的新型基准。在此基准上，PMC-Cambrian-AN 以 1.29 的 GPT-4 得分实现了最佳性能，显著优于 HuatuoGPT-Vision-34B（1.13）和 Claude3-Opus（1.17），展示了其卓越的诊断推理能力。

Title: ChronoFact: Timeline-based Temporal Fact Verification

Authors: Anab Maulana Barik, Wynne Hsu, Mong Li Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.14964
Pdf URL: https://arxiv.org/pdf/2410.14964
Copy Paste: [[2410.14964]] ChronoFact: Timeline-based Temporal Fact Verification(https://arxiv.org/abs/2410.14964)
Keywords: language model
Abstract: Automated fact verification plays an essential role in fostering trust in the digital space. Despite the growing interest, the verification of temporal facts has not received much attention in the community. Temporal fact verification brings new challenges where cues of the temporal information need to be extracted and temporal reasoning involving various temporal aspects of the text must be applied. In this work, we propose an end-to-end solution for temporal fact verification that considers the temporal information in claims to obtain relevant evidence sentences and harness the power of large language model for temporal reasoning. Recognizing that temporal facts often involve events, we model these events in the claim and evidence sentences. We curate two temporal fact datasets to learn time-sensitive representations that encapsulate not only the semantic relationships among the events, but also their chronological proximity. This allows us to retrieve the top-k relevant evidence sentences and provide the context for a large language model to perform temporal reasoning and outputs whether a claim is supported or refuted by the retrieved evidence sentences. Experiment results demonstrate that the proposed approach significantly enhances the accuracy of temporal claim verification, thereby advancing current state-of-the-art in automated fact verification.
摘要：自动事实验证在数字空间中培养信任方面发挥着至关重要的作用。尽管人们对时间事实的验证兴趣日益浓厚，但社区对此的关注度并不高。时间事实验证带来了新的挑战，需要提取时间信息的线索，并应用涉及文本各个时间方面的时间推理。在这项工作中，我们提出了一种端到端的时间事实验证解决方案，该解决方案考虑声明中的时间信息以获得相关证据句子，并利用大型语言模型的功能进行时间推理。认识到时间事实通常涉及事件，我们在声明和证据句子中对这些事件进行建模。我们整理了两个时间事实数据集来学习时间敏感的表示，这些表示不仅封装了事件之间的语义关系，还封装了它们的时间接近度。这使我们能够检索前 k 个相关证据句子，并为大型语言模型提供上下文以执行时间推理，并输出声明是否由检索到的证据句子支持或反驳。实验结果表明，所提出的方法显著提高了时间声明验证的准确性，从而推动了当前自动事实验证的发展。

Title: CAP: Data Contamination Detection via Consistency Amplification

Authors: Yi Zhao, Jing Li, Linyi Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15005
Pdf URL: https://arxiv.org/pdf/2410.15005
Copy Paste: [[2410.15005]] CAP: Data Contamination Detection via Consistency Amplification(https://arxiv.org/abs/2410.15005)
Keywords: language model, llm
Abstract: Large language models (LLMs) are widely used, but concerns about data contamination challenge the reliability of LLM evaluations. Existing contamination detection methods are often task-specific or require extra prerequisites, limiting practicality. We propose a novel framework, Consistency Amplification-based Data Contamination Detection (CAP), which introduces the Performance Consistency Ratio (PCR) to measure dataset leakage by leveraging LM consistency. To the best of our knowledge, this is the first method to explicitly differentiate between fine-tuning and contamination, which is crucial for detecting contamination in domain-specific models. Additionally, CAP is applicable to various benchmarks and works for both white-box and black-box models. We validate CAP's effectiveness through experiments on seven LLMs and four domain-specific benchmarks. Our findings also show that composite benchmarks from various dataset sources are particularly prone to unintentional contamination. Codes will be publicly available soon.
摘要：大型语言模型 (LLM) 被广泛使用，但对数据污染的担忧对 LLM 评估的可靠性提出了挑战。现有的污染检测方法通常是针对特定任务的，或者需要额外的先决条件，从而限制了实用性。我们提出了一个新颖的框架，即基于一致性放大的数据污染检测 (CAP)，它引入了性能一致性比率 (PCR) 来利用 LM 一致性来测量数据集泄漏。据我们所知，这是第一种明确区分微调和污染的方法，这对于检测领域特定模型中的污染至关重要。此外，CAP 适用于各种基准，适用于白盒和黑盒模型。我们通过对七个 LLM 和四个领域特定基准的实验验证了 CAP 的有效性。我们的研究结果还表明，来自各种数据集来源的复合基准特别容易受到无意污染。代码将很快公开。

Title: Transit Pulse: Utilizing Social Media as a Source for Customer Feedback and Information Extraction with Large Language Model

Authors: Jiahao Wang, Amer Shalaby
Subjects: cs.CL, cs.AI, cs.CY, cs.IR, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2410.15016
Pdf URL: https://arxiv.org/pdf/2410.15016
Copy Paste: [[2410.15016]] Transit Pulse: Utilizing Social Media as a Source for Customer Feedback and Information Extraction with Large Language Model(https://arxiv.org/abs/2410.15016)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Users of the transit system flood social networks daily with messages that contain valuable insights crucial for improving service quality. These posts help transit agencies quickly identify emerging issues. Parsing topics and sentiments is key to gaining comprehensive insights to foster service excellence. However, the volume of messages makes manual analysis impractical, and standard NLP techniques like Term Frequency-Inverse Document Frequency (TF-IDF) fall short in nuanced interpretation. Traditional sentiment analysis separates topics and sentiments before integrating them, often missing the interaction between them. This incremental approach complicates classification and reduces analytical productivity. To address these challenges, we propose a novel approach to extracting and analyzing transit-related information, including sentiment and sarcasm detection, identification of unusual system problems, and location data from social media. Our method employs Large Language Models (LLM), specifically Llama 3, for a streamlined analysis free from pre-established topic labels. To enhance the model's domain-specific knowledge, we utilize Retrieval-Augmented Generation (RAG), integrating external knowledge sources into the information extraction pipeline. We validated our method through extensive experiments comparing its performance with traditional NLP approaches on user tweet data from the real world transit system. Our results demonstrate the potential of LLMs to transform social media data analysis in the public transit domain, providing actionable insights and enhancing transit agencies' responsiveness by extracting a broader range of information.
摘要：公共交通系统的用户每天都会向社交网络发送大量消息，这些消息包含对改善服务质量至关重要的宝贵见解。这些帖子可帮助公共交通机构快速识别新出现的问题。解析主题和情绪是获得全面见解以促进卓越服务的关键。然而，消息的数量使得手动分析变得不切实际，而词频-逆文档频率 (TF-IDF) 等标准 NLP 技术在细致入微的解释方面存在不足。传统的情绪分析在整合主题和情绪之前会将它们分开，通常会忽略它们之间的相互作用。这种增量方法使分类变得复杂并降低了分析效率。为了应对这些挑战，我们提出了一种提取和分析与公共交通相关的信息的新方法，包括情绪和讽刺检测、异常系统问题的识别以及来自社交媒体的位置数据。我们的方法采用大型语言模型 (LLM)，特别是 Llama 3，以实现不受预先设定的主题标签影响的简化分析。为了增强模型的领域特定知识，我们利用检索增强生成 (RAG)，将外部知识源集成到信息提取管道中。我们通过大量实验验证了我们的方法，比较了该方法与传统 NLP 方法在现实世界交通系统用户推文数据上的表现。我们的结果表明 LLM 有潜力改变公共交通领域的社交媒体数据分析，通过提取更广泛的信息提供可操作的见解并增强交通机构的响应能力。

Title: DM-Codec: Distilling Multimodal Representations for Speech Tokenization

Authors: Md Mubtasim Ahasan, Md Fahim, Tasnim Mohiuddin, A K M Mahbubur Rahman, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Md Mofijul Islam, Amin Ahsan Ali
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2410.15017
Pdf URL: https://arxiv.org/pdf/2410.15017
Copy Paste: [[2410.15017]] DM-Codec: Distilling Multimodal Representations for Speech Tokenization(https://arxiv.org/abs/2410.15017)
Keywords: language model
Abstract: Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset. The code, samples, and model checkpoints are available at this https URL.
摘要：语音语言模型的最新进展已在语音标记化和合成方面取得了重大改进。然而，有效地将语音的复杂、多维属性映射到离散标记仍然具有挑战性。此过程需要声学、语义和上下文信息才能实现精确的语音表示。现有的语音表示通常分为两类：来自音频编解码器的声学标记和来自语音自监督学习模型的语义标记。尽管最近的努力已经统一了声学和语义标记以提高性能，但它们忽视了上下文表示在综合语音建模中的关键作用。我们的实证研究表明，缺乏上下文表示会导致语音转录中的单词错误率 (WER) 和单词信息丢失 (WIL) 分数升高。为了解决这些限制，我们提出了两种新颖的蒸馏方法：(1) 一种结合上下文信息的语言模型 (LM) 引导的蒸馏方法，以及 (2) 一种结合 LM 和自监督语音模型 (SM) 引导的蒸馏技术，可有效地将多模态表示（声学、语义和上下文）蒸馏成一个综合的语音标记器，称为 DM-Codec。DM-Codec 架构采用带有残差矢量量化器 (RVQ) 的精简编码器-解码器框架，并在训练过程中结合 LM 和 SM。实验表明，DM-Codec 明显优于最先进的语音标记模型，在 LibriSpeech 基准数据集上将 WER 降低了 13.46%，WIL 降低了 9.82%，语音质量提高了 5.84%，清晰度提高了 1.85%。代码、示例和模型检查点可在此 https URL 上获得。

Title: A Survey of Ontology Expansion for Conversational Understanding

Authors: Jinggui Liang, Yuxia Wu, Yuan Fang, Hao Fei, Lizi Liao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15019
Pdf URL: https://arxiv.org/pdf/2410.15019
Copy Paste: [[2410.15019]] A Survey of Ontology Expansion for Conversational Understanding(https://arxiv.org/abs/2410.15019)
Keywords: agent
Abstract: In the rapidly evolving field of conversational AI, Ontology Expansion (OnExp) is crucial for enhancing the adaptability and robustness of conversational agents. Traditional models rely on static, predefined ontologies, limiting their ability to handle new and unforeseen user needs. This survey paper provides a comprehensive review of the state-of-the-art techniques in OnExp for conversational understanding. It categorizes the existing literature into three main areas: (1) New Intent Discovery, (2) New Slot-Value Discovery, and (3) Joint OnExp. By examining the methodologies, benchmarks, and challenges associated with these areas, we highlight several emerging frontiers in OnExp to improve agent performance in real-world scenarios and discuss their corresponding challenges. This survey aspires to be a foundational reference for researchers and practitioners, promoting further exploration and innovation in this crucial domain.
摘要：在快速发展的对话式 AI 领域，本体扩展 (OnExp) 对于增强对话代理的适应性和鲁棒性至关重要。传统模型依赖于静态的预定义本体，这限制了它们处理新的和不可预见的用户需求的能力。这篇综述论文全面回顾了 OnExp 中用于对话理解的最新技术。它将现有文献分为三个主要领域：(1) 新意图发现、(2) 新槽值发现和 (3) 联合 OnExp。通过研究与这些领域相关的方法、基准和挑战，我们重点介绍了 OnExp 中的几个新兴前沿，以提高代理在现实场景中的表现，并讨论了它们相应的挑战。这篇综述旨在成为研究人员和从业人员的基础参考，促进这一关键领域的进一步探索和创新。

Title: Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

Authors: Yuzhe Weng, Haotian Wang, Tian Gao, Kewei Li, Shutong Niu, Jun Du
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15029
Pdf URL: https://arxiv.org/pdf/2410.15029
Copy Paste: [[2410.15029]] Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention(https://arxiv.org/abs/2410.15029)
Keywords: llm
Abstract: In multimodal sentiment analysis, collecting text data is often more challenging than video or audio due to higher annotation costs and inconsistent automatic speech recognition (ASR) quality. To address this challenge, our study has developed a robust model that effectively integrates multimodal sentiment information, even in the absence of text modality. Specifically, we have developed a Double-Flow Self-Distillation Framework, including Unified Modality Cross-Attention (UMCA) and Modality Imagination Autoencoder (MIA), which excels at processing both scenarios with complete modalities and those with missing text modality. In detail, when the text modality is missing, our framework uses the LLM-based model to simulate the text representation from the audio modality, while the MIA module supplements information from the other two modalities to make the simulated text representation similar to the real text representation. To further align the simulated and real representations, and to enable the model to capture the continuous nature of sample orders in sentiment valence regression tasks, we have also introduced the Rank-N Contrast (RNC) loss function. When testing on the CMU-MOSEI, our model achieved outstanding performance on MAE and significantly outperformed other models when text modality is missing. The code is available at: this https URL
摘要：在多模态情感分析中，由于文本数据的标注成本较高且自动语音识别 (ASR) 质量不一致，因此收集文本数据通常比收集视频或音频数据更具挑战性。为了应对这一挑战，我们的研究开发了一个强大的模型，即使在缺乏文本模态的情况下，也能有效地整合多模态情感信息。具体来说，我们开发了一个双流自蒸馏框架，包括统一模态交叉注意 (UMCA) 和模态想象自动编码器 (MIA)，该框架擅长处理模态完整和文本模态缺失的场景。具体来说，当文本模态缺失时，我们的框架使用基于 LLM 的模型从音频模态模拟文本表征，而 MIA 模块从其他两个模态补充信息，使模拟的文本表征类似于真实的文本表征。为了进一步对齐模拟和真实的表征，并使模型能够捕捉情绪效价回归任务中样本顺序的连续性，我们还引入了 Rank-N Contrast (RNC) 损失函数。在 CMU-MOSEI 上进行测试时，我们的模型在 MAE 上取得了出色的表现，并且在缺少文本模态时明显优于其他模型。代码可从以下网址获取：此 https URL

Title: Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging

Authors: Mingxin Li, Zhijie Nie, Yanzhao Zhang, Dingkun Long, Richong Zhang, Pengjun Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15035
Pdf URL: https://arxiv.org/pdf/2410.15035
Copy Paste: [[2410.15035]] Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging(https://arxiv.org/abs/2410.15035)
Keywords: language model
Abstract: Text embeddings are vital for tasks such as text retrieval and semantic textual similarity (STS). Recently, the advent of pretrained language models, along with unified benchmarks like the Massive Text Embedding Benchmark (MTEB), has facilitated the development of versatile general-purpose text embedding models. Advanced embedding models are typically developed using large-scale multi-task data and joint training across multiple tasks. However, our experimental analysis reveals two significant drawbacks of joint training: 1) Task Conflict: Gradients from different tasks interfere with each other, leading to negative transfer. 2) Data Imbalance: Disproportionate data distribution introduces biases that negatively impact performance across tasks. To overcome these challenges, we explore model merging-a technique that combines independently trained models to mitigate gradient conflicts and balance data distribution. We introduce a novel method, Self Positioning, which efficiently searches for optimal model combinations within the interpolation space of task vectors using stochastic gradient descent. Our experiments demonstrate that Self Positioning significantly enhances multi-task performance on the MTEB dataset, achieving an absolute improvement of 0.7 points. It outperforms traditional resampling methods while reducing computational costs. This work offers a robust approach to building generalized text embedding models with superior performance across diverse embedding-related tasks.
摘要：文本嵌入对于文本检索和语义文本相似性 (STS) 等任务至关重要。最近，预训练语言模型的出现以及海量文本嵌入基准 (MTEB) 等统一基准的出现促进了多功能通用文本嵌入模型的开发。高级嵌入模型通常使用大规模多任务数据和跨多个任务的联合训练来开发。然而，我们的实验分析揭示了联合训练的两个重大缺点：1) 任务冲突：来自不同任务的梯度相互干扰，导致负迁移。2) 数据不平衡：不成比例的数据分布会引入偏差，从而对任务间的性能产生负面影响。为了克服这些挑战，我们探索了模型合并 - 一种结合独立训练的模型以缓解梯度冲突和平衡数据分布的技术。我们引入了一种新方法，即自定位，它使用随机梯度下降在任务向量的插值空间内有效地搜索最佳模型组合。我们的实验表明，自定位显著提高了 MTEB 数据集上的多任务性能，绝对提升了 0.7 个百分点。它的表现优于传统的重采样方法，同时降低了计算成本。这项工作提供了一种强大的方法来构建通用的文本嵌入模型，并在各种嵌入相关任务中具有出色的性能。

Title: mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Authors: Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15037
Pdf URL: https://arxiv.org/pdf/2410.15037
Copy Paste: [[2410.15037]] mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation(https://arxiv.org/abs/2410.15037)
Keywords: language model, llm, prompt
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality assurance process. Furthermore, we provide expert human translations for 15 diverse natural languages (NLs). We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs, offering insights into the current landscape of cross-lingual code generation.
摘要：大型语言模型 (LLM) 的最新进展显著增强了自然语言提示的代码生成。OpenAI 开发的 HumanEval 基准仍然是使用最广泛的代码生成基准。然而，这个和其他 Code LLM 基准面临着严重的限制，特别是在任务多样性、测试覆盖率和语言范围方面。当前的评估主要侧重于英语到 Python 的转换任务，测试用例有限，可能会高估模型性能。虽然最近的研究已经解决了测试覆盖率和编程语言 (PL) 多样性问题，但低资源语言提示的代码生成仍然在很大程度上尚未得到探索。为了弥补这一差距，我们引入了 mHumanEval，这是一个支持 200 多种自然语言提示的扩展基准。我们采用成熟的机器翻译方法来编译基准，并结合质量保证流程。此外，我们还为 15 种不同的自然语言 (NL) 提供专家人工翻译。最后，我们分析了最先进的 (SOTA) Code LLM 的多语言代码生成能力，深入了解了跨语言代码生成的当前格局。

Title: Are LLMs Good Zero-Shot Fallacy Classifiers?

Authors: Fengjun Pan, Xiaobao Wu, Zongrui Li, Anh Tuan Luu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15050
Pdf URL: https://arxiv.org/pdf/2410.15050
Copy Paste: [[2410.15050]] Are LLMs Good Zero-Shot Fallacy Classifiers?(https://arxiv.org/abs/2410.15050)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Fallacies are defective arguments with faulty reasoning. Detecting and classifying them is a crucial NLP task to prevent misinformation, manipulative claims, and biased decisions. However, existing fallacy classifiers are limited by the requirement for sufficient labeled data for training, which hinders their out-of-distribution (OOD) generalization abilities. In this paper, we focus on leveraging Large Language Models (LLMs) for zero-shot fallacy classification. To elicit fallacy-related knowledge and reasoning abilities of LLMs, we propose diverse single-round and multi-round prompting schemes, applying different task-specific instructions such as extraction, summarization, and Chain-of-Thought reasoning. With comprehensive experiments on benchmark datasets, we suggest that LLMs could be potential zero-shot fallacy classifiers. In general, LLMs under single-round prompting schemes have achieved acceptable zero-shot performances compared to the best full-shot baselines and can outperform them in all OOD inference scenarios and some open-domain tasks. Our novel multi-round prompting schemes can effectively bring about more improvements, especially for small LLMs. Our analysis further underlines the future research on zero-shot fallacy classification. Codes and data are available at: this https URL.
摘要：谬误是具有错误推理的有缺陷的论点。检测和分类谬误是防止错误信息、操纵性主张和有偏见的决策的一项关键 NLP 任务。然而，现有的谬误分类器受限于对训练所需的足够标记数据的要求，这阻碍了它们的分布外 (OOD) 泛化能力。在本文中，我们专注于利用大型语言模型 (LLM) 进行零样本谬误分类。为了引出 LLM 的谬误相关知识和推理能力，我们提出了多种单轮和多轮提示方案，应用不同的任务特定指令，例如提取、总结和思路链推理。通过在基准数据集上进行全面的实验，我们认为 LLM 可能是潜在的零样本谬误分类器。一般而言，与最佳全样本基线相比，单轮提示方案下的 LLM 实现了可接受的零样本性能，并且可以在所有 OOD 推理场景和一些开放域任务中胜过它们。我们新颖的多轮提示方案可以有效地带来更多改进，尤其是对于小型 LLM。我们的分析进一步强调了零样本谬误分类的未来研究。代码和数据可从以下网址获取：此 https URL。

Title: Toward Robust RALMs: Revealing the Impact of Imperfect Retrieval on Retrieval-Augmented Language Models

Authors: Seong-Il Park, Jay-Yoon Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15107
Pdf URL: https://arxiv.org/pdf/2410.15107
Copy Paste: [[2410.15107]] Toward Robust RALMs: Revealing the Impact of Imperfect Retrieval on Retrieval-Augmented Language Models(https://arxiv.org/abs/2410.15107)
Keywords: language model, hallucination
Abstract: Retrieval Augmented Language Models (RALMs) have gained significant attention for their ability to generate accurate answer and improve efficiency. However, RALMs are inherently vulnerable to imperfect information due to their reliance on the imperfect retriever or knowledge source. We identify three common scenarios-unanswerable, adversarial, conflicting-where retrieved document sets can confuse RALM with plausible real-world examples. We present the first comprehensive investigation to assess how well RALMs detect and handle such problematic scenarios. Among these scenarios, to systematically examine adversarial robustness we propose a new adversarial attack method, Generative model-based ADVersarial attack (GenADV) and a novel metric Robustness under Additional Document (RAD). Our findings reveal that RALMs often fail to identify the unanswerability or contradiction of a document set, which frequently leads to hallucinations. Moreover, we show the addition of an adversary significantly degrades RALM's performance, with the model becoming even more vulnerable when the two scenarios overlap (adversarial+unanswerable). Our research identifies critical areas for assessing and enhancing the robustness of RALMs, laying the foundation for the development of more robust models.
摘要：检索增强语言模型 (RALM) 因其生成准确答案和提高效率的能力而备受关注。然而，由于 RALM 依赖于不完善的检索器或知识源，因此它们本质上容易受到不完善信息的影响。我们确定了三种常见场景 - 无法回答、对抗、冲突 - 检索到的文档集可能会将 RALM 与可信的真实世界示例混淆。我们首次进行了全面的调查，以评估 RALM 检测和处理此类问题场景的能力。在这些场景中，为了系统地检查对抗鲁棒性，我们提出了一种新的对抗攻击方法，基于生成模型的 ADVersarial 攻击 (GenADV) 和一种新颖的度量附加文档下的鲁棒性 (RAD)。我们的研究结果表明，RALM 通常无法识别文档集的不可回答性或矛盾性，这经常导致幻觉。此外，我们表明，增加对手会显著降低 RALM 的性能，当两种情况重叠（对抗+无法回答）时，模型会变得更加脆弱。我们的研究确定了评估和增强 RALM 稳健性的关键领域，为开发更稳健的模型奠定了基础。

Title: Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models

Authors: Qitan Lv, Jie Wang, Hanzhu Chen, Bin Li, Yongdong Zhang, Feng Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15116
Pdf URL: https://arxiv.org/pdf/2410.15116
Copy Paste: [[2410.15116]] Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models(https://arxiv.org/abs/2410.15116)
Keywords: language model, hallucination
Abstract: Generation of plausible but incorrect factual information, often termed hallucination, has attracted significant research interest. Retrieval-augmented language model (RALM) -- which enhances models with up-to-date knowledge -- emerges as a promising method to reduce hallucination. However, existing RALMs may instead exacerbate hallucination when retrieving lengthy contexts. To address this challenge, we propose COFT, a novel \textbf{CO}arse-to-\textbf{F}ine highligh\textbf{T}ing method to focus on different granularity-level key texts, thereby avoiding getting lost in lengthy contexts. Specifically, COFT consists of three components: \textit{recaller}, \textit{scorer}, and \textit{selector}. First, \textit{recaller} applies a knowledge graph to extract potential key entities in a given context. Second, \textit{scorer} measures the importance of each entity by calculating its contextual weight. Finally, \textit{selector} selects high contextual weight entities with a dynamic threshold algorithm and highlights the corresponding paragraphs, sentences, or words in a coarse-to-fine manner. Extensive experiments on the knowledge hallucination benchmark demonstrate the effectiveness of COFT, leading to a superior performance over $30\%$ in the F1 score metric. Moreover, COFT also exhibits remarkable versatility across various long-form tasks, such as reading comprehension and question answering.
摘要：产生看似合理但不正确的事实信息（通常称为幻觉）引起了广泛的研究兴趣。检索增强语言模型 (RALM) —— 它用最新知识增强模型 —— 成为一种减少幻觉的有前途的方法。然而，现有的 RALM 在检索冗长的上下文时可能会加剧幻觉。为了应对这一挑战，我们提出了 COFT，这是一种新颖的 \textbf{CO}arse-to-\textbf{F}ine highligh\textbf{T}ing 方法，用于关注不同粒度级别的关键文本，从而避免在冗长的上下文中迷失方向。具体来说，COFT 由三个组件组成：\textit{recaller}、\textit{scorer} 和 \textit{selector}。首先，\textit{recaller} 应用知识图谱来提取给定上下文中的潜在关键实体。其次，\textit{scorer} 通过计算上下文权重来衡量每个实体的重要性。最后，\textit{selector} 使用动态阈值算法选择高上下文权重实体，并以由粗到细的方式突出显示相应的段落、句子或单词。在知识幻觉基准上进行的大量实验证明了 COFT 的有效性，在 F1 分数指标中取得了超过 $30\%$ 的优异表现。此外，COFT 还在各种长篇任务中表现出非凡的多功能性，例如阅读理解和问答。

Title: MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science

Authors: Junho Kim, Yeachan Kim, Jun-Hyung Park, Yerim Oh, Suho Kim, SangKeun Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15126
Pdf URL: https://arxiv.org/pdf/2410.15126
Copy Paste: [[2410.15126]] MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science(https://arxiv.org/abs/2410.15126)
Keywords: language model
Abstract: We introduce a novel continued pre-training method, MELT (MatEriaLs-aware continued pre-Training), specifically designed to efficiently adapt the pre-trained language models (PLMs) for materials science. Unlike previous adaptation strategies that solely focus on constructing domain-specific corpus, MELT comprehensively considers both the corpus and the training strategy, given that materials science corpus has distinct characteristics from other domains. To this end, we first construct a comprehensive materials knowledge base from the scientific corpus by building semantic graphs. Leveraging this extracted knowledge, we integrate a curriculum into the adaptation process that begins with familiar and generalized concepts and progressively moves toward more specialized terms. We conduct extensive experiments across diverse benchmarks to verify the effectiveness and generality of MELT. A comprehensive evaluation convincingly supports the strength of MELT, demonstrating superior performance compared to existing continued pre-training methods. The in-depth analysis also shows that MELT enables PLMs to effectively represent materials entities compared to the existing adaptation methods, thereby highlighting its broad applicability across a wide spectrum of materials science.
摘要：我们介绍了一种新颖的持续预训练方法 MELT（材料感知持续预训练），专门用于高效调整材料科学预训练语言模型 (PLM)。与以前仅注重构建领域特定语料库的适应策略不同，鉴于材料科学语料库与其他领域具有鲜明的特点，MELT 全面考虑了语料库和训练策略。为此，我们首先通过构建语义图从科学语料库中构建一个全面的材料知识库。利用这些提取的知识，我们将课程整合到适应过程中，从熟悉和广义的概念开始，逐步转向更专业的术语。我们在不同的基准上进行了广泛的实验，以验证 MELT 的有效性和通用性。全面的评估令人信服地支持了 MELT 的实力，与现有的持续预训练方法相比，它表现出了卓越的性能。深入分析还表明，与现有的适应方法相比，MELT 使 PLM 能够有效地表示材料实体，从而凸显了其在广泛材料科学领域的广泛适用性。

Title: Augmenting the Veracity and Explanations of Complex Fact Checking via Iterative Self-Revision with LLMs

Authors: Xiaocheng Zhang, Xi Wang, Yifei Lu, Zhuangzhuang Ye, Jianing Wang, Mengjiao Bao, Peng Yan, Xiaohong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15135
Pdf URL: https://arxiv.org/pdf/2410.15135
Copy Paste: [[2410.15135]] Augmenting the Veracity and Explanations of Complex Fact Checking via Iterative Self-Revision with LLMs(https://arxiv.org/abs/2410.15135)
Keywords: language model, llm
Abstract: Explanation generation plays a more pivotal role than fact verification in producing interpretable results and facilitating comprehensive fact-checking, which has recently garnered considerable attention. However, previous studies on explanation generation has shown several limitations, such as being confined to English scenarios, involving overly complex inference processes, and not fully unleashing the potential of the mutual feedback between veracity labels and explanation texts. To address these issues, we construct two complex fact-checking datasets in the Chinese scenarios: CHEF-EG and TrendFact. These datasets involve complex facts in areas such as health, politics, and society, presenting significant challenges for fact verification methods. In response to these challenges, we propose a unified framework called FactISR (Augmenting Fact-Checking via Iterative Self-Revision) to perform mutual feedback between veracity and explanations by leveraging the capabilities of large language models(LLMs). FactISR uses a single model to address tasks such as fact verification and explanation generation. Its self-revision mechanism can further revision the consistency between veracity labels, explanation texts, and evidence, as well as eliminate irrelevant noise. We conducted extensive experiments with baselines and FactISR on the proposed datasets. The experimental results demonstrate the effectiveness of our method.
摘要：解释生成在产生可解释结果、促进全面事实核查方面发挥着比事实验证更为关键的作用，近年来引起了广泛关注。然而，以往对解释生成的研究表现出一些局限性，例如局限于英文场景、推理过程过于复杂、没有充分发挥真实性标签和解释文本之间相互反馈的潜力等。针对这些问题，我们构建了两个中文场景的复杂事实核查数据集：CHEF-EG和TrendFact。这些数据集涉及健康、政治、社会等领域的复杂事实，对事实核查方法提出了重大挑战。针对这些挑战，我们提出了一个统一的框架FactISR（Augmenting Fact-Checking via Iterative Self-Revision），利用大型语言模型（LLM）的能力实现真实性和解释之间的相互反馈，FactISR使用单一模型来解决事实验证和解释生成等任务。其自修正机制可以进一步修正真实性标签、解释文本和证据之间的一致性，并消除无关噪音。我们在提出的数据集上对基线和 FactISR 进行了广泛的实验。实验结果证明了我们方法的有效性。

Title: Less is More: Parameter-Efficient Selection of Intermediate Tasks for Transfer Learning

Authors: David Schulte, Felix Hamborg, Alan Akbik
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.15148
Pdf URL: https://arxiv.org/pdf/2410.15148
Copy Paste: [[2410.15148]] Less is More: Parameter-Efficient Selection of Intermediate Tasks for Transfer Learning(https://arxiv.org/abs/2410.15148)
Keywords: language model
Abstract: Intermediate task transfer learning can greatly improve model performance. If, for example, one has little training data for emotion detection, first fine-tuning a language model on a sentiment classification dataset may improve performance strongly. But which task to choose for transfer learning? Prior methods producing useful task rankings are infeasible for large source pools, as they require forward passes through all source language models. We overcome this by introducing Embedding Space Maps (ESMs), light-weight neural networks that approximate the effect of fine-tuning a language model. We conduct the largest study on NLP task transferability and task selection with 12k source-target pairs. We find that applying ESMs on a prior method reduces execution time and disk space usage by factors of 10 and 278, respectively, while retaining high selection performance (avg. regret@5 score of 2.95).
摘要：中级任务迁移学习可以大大提高模型性能。例如，如果用于情绪检测的训练数据很少，那么首先在情绪分类数据集上微调语言模型可能会大大提高性能。但是要选择哪个任务进行迁移学习呢？先前产生有用任务排名的方法对于大型源池来说是不可行的，因为它们需要前向传递所有源语言模型。我们通过引入嵌入空间图 (ESM) 来克服这个问题，这是一种轻量级神经网络，可以近似微调语言模型的效果。我们对 12k 个源-目标对进行了最大规模的 NLP 任务可迁移性和任务选择研究。我们发现，在先前方法上应用 ESM 分别将执行时间和磁盘空间使用量减少了 10 倍和 278 倍，同时保持了较高的选择性能（平均 regret@5 得分为 2.95）。

Title: Evaluating Deep Unlearning in Large Language Models

Authors: Ruihan Wu, Chhavi Yadav, Russ Salakhutdinov, Kamalika Chaudhuri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15153
Pdf URL: https://arxiv.org/pdf/2410.15153
Copy Paste: [[2410.15153]] Evaluating Deep Unlearning in Large Language Models(https://arxiv.org/abs/2410.15153)
Keywords: language model, llm
Abstract: Machine unlearning is a key requirement of many data protection regulations such as GDPR. Prior work on unlearning has mostly considered superficial unlearning tasks where a single or a few related pieces of information are required to be removed. However, the task of unlearning a fact is much more challenging in recent large language models (LLMs), because the facts in LLMs can be deduced from each other. In this work, we investigate whether current unlearning methods for LLMs succeed beyond superficial unlearning of facts. Specifically, we formally propose a framework and a definition for deep unlearning facts that are interrelated. We design the metric, recall, to quantify the extent of deep unlearning. To systematically evaluate deep unlearning, we construct a synthetic dataset EDU-RELAT, which consists of a synthetic knowledge base of family relationships and biographies, together with a realistic logical rule set that connects them. We use this dataset to test four unlearning methods in four LLMs at different sizes. Our findings reveal that in the task of deep unlearning only a single fact, they either fail to properly unlearn with high recall, or end up unlearning many other irrelevant facts. Our dataset and code are publicly available at: this https URL.
摘要：机器反学习是许多数据保护法规（如 GDPR）的关键要求。先前关于反学习的研究大多考虑了表面反学习任务，其中需要删除单个或几个相关信息。然而，在最近的大型语言模型 (LLM) 中，反学习事实的任务更具挑战性，因为 LLM 中的事实可以相互推断。在这项工作中，我们调查了当前 LLM 的反学习方法是否能够超越表面反学习事实。具体来说，我们正式提出了一个框架和一个相互关联的深度反学习事实的定义。我们设计了指标，召回率，以量化深度反学习的程度。为了系统地评估深度反学习，我们构建了一个合成数据集 EDU-RELAT，它由家庭关系和传记的合成知识库以及将它们连接起来的现实逻辑规则集组成。我们使用此数据集在四个不同大小的 LLM 中测试四种反学习方法。我们的研究结果表明，在深度反学习单个事实的任务中，他们要么无法以高召回率正确地反学习，要么最终反学习许多其他不相关的事实。我们的数据集和代码可在此 https URL 上公开获取。

Title: An Electoral Approach to Diversify LLM-based Multi-Agent Collective Decision-Making

Authors: Xiutian Zhao, Ke Wang, Wei Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15168
Pdf URL: https://arxiv.org/pdf/2410.15168
Copy Paste: [[2410.15168]] An Electoral Approach to Diversify LLM-based Multi-Agent Collective Decision-Making(https://arxiv.org/abs/2410.15168)
Keywords: language model, llm, agent
Abstract: Modern large language models (LLMs) have exhibited cooperative synergy on complex task-solving, and collective decision-making (CDM) is a pivotal component in LLM-based multi-agent collaboration frameworks. Our survey on 52 recent such systems uncovers a severe lack of diversity, with a heavy reliance on dictatorial and plurality voting for CDM. Through the lens of social choice theory, we scrutinize widely-adopted CDM methods and identify their limitations. To enrich current landscape of LLM-based CDM, we present GEDI, an electoral CDM module that incorporates various ordinal preferential voting mechanisms. Our empirical case study across three benchmarks shows that the integration of certain CDM methods can markedly improve the reasoning capabilities and robustness of some leading LLMs, all without requiring intricate system designs. Additionally, we find that some CDM mechanisms generate positive synergies even with as few as three agents. The voting-based methods also demonstrate robustness against single points of failure, as well as diversity in terms of hit-rate@k and subject-wise impacts.
摘要：现代大型语言模型 (LLM) 在复杂任务解决中表现出了协同作用，而集体决策 (CDM) 是基于 LLM 的多智能体协作框架中的关键组成部分。我们对 52 个近期此类系统的调查发现，它们严重缺乏多样性，严重依赖 CDM 的独裁和多数投票。通过社会选择理论的视角，我们仔细研究了广泛采用的 CDM 方法并找出了它们的局限性。为了丰富基于 LLM 的 CDM 的当前格局，我们提出了 GEDI，这是一个选举 CDM 模块，它结合了各种序数优先投票机制。我们在三个基准上进行的经验案例研究表明，某些 CDM 方法的集成可以显著提高一些领先 LLM 的推理能力和鲁棒性，而所有这些都不需要复杂的系统设计。此外，我们发现一些 CDM 机制即使只有三个智能体也会产生积极的协同作用。基于投票的方法还表现出对单点故障的稳健性，以及命中率@k和主题影响方面的多样性。

Title: Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation

Authors: Safeyah Khaled Alshemali, Daniel Bauer, Yuval Marton
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15173
Pdf URL: https://arxiv.org/pdf/2410.15173
Copy Paste: [[2410.15173]] Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation(https://arxiv.org/abs/2410.15173)
Keywords: gpt, llm, prompt, chain-of-thought
Abstract: The thematic fit estimation task measures the compatibility between a predicate (typically a verb), an argument (typically a noun phrase), and a specific semantic role assigned to the argument. Previous state-of-the-art work has focused on modeling thematic fit through distributional or neural models of event representation, trained in a supervised fashion with indirect labels. In this work, we assess whether pre-trained autoregressive LLMs possess consistent, expressible knowledge about thematic fit. We evaluate both closed and open state-of-the-art LLMs on several psycholinguistic datasets, along three axes: (1) Reasoning Form: multi-step logical reasoning (chain-of-thought prompting) vs. simple prompting. (2) Input Form: providing context (generated sentences) vs. raw tuples . (3) Output Form: categorical vs. numeric. Our results show that chain-of-thought reasoning is more effective on datasets with self-explanatory semantic role labels, especially Location. Generated sentences helped only in few settings, and lowered results in many others. Predefined categorical (compared to numeric) output raised GPT's results across the board with few exceptions, but lowered Llama's. We saw that semantically incoherent generated sentences, which the models lack the ability to consistently filter out, hurt reasoning and overall performance too. Our GPT-powered methods set new state-of-the-art on all tested datasets.
摘要：主题契合度评估任务测量谓词（通常是动词）、论元（通常是名词短语）和分配给论元的特定语义角色之间的兼容性。之前最先进的工作主要通过事件表示的分布式或神经模型来建模主题契合度，这些模型以间接标签的监督方式进行训练。在这项工作中，我们评估预先训练的自回归 LLM 是否具有关于主题契合度的一致、可表达的知识。我们在几个心理语言学数据集上评估了封闭式和开放式最先进的 LLM，沿着三个轴：（1）推理形式：多步逻辑推理（思路链提示）与简单提示。（2）输入形式：提供上下文（生成的句子）与原始元组 <谓词、论元、角色>。（3）输出形式：分类与数字。我们的结果表明，思路链推理对具有自解释语义角色标签的数据集更有效，尤其是位置。生成的句子仅在少数情况下有帮助，而在许多其他情况下会降低结果。预定义的分类（与数字相比）输出除了少数例外，全面提高了 GPT 的结果，但降低了 Llama 的结果。我们发现，语义上不连贯的生成句子（模型缺乏一致过滤的能力）也会损害推理和整体性能。我们由 GPT 提供支持的方法在所有测试数据集上都创下了新纪录。

Title: Fine-tuning foundational models to code diagnoses from veterinary health records

Authors: Mayla R. Boguslav, Adam Kiehl, David Kott, G. Joseph Strecker, Tracy Webb, Nadia Saklou, Terri Ward, Michael Kirby
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15186
Pdf URL: https://arxiv.org/pdf/2410.15186
Copy Paste: [[2410.15186]] Fine-tuning foundational models to code diagnoses from veterinary health records(https://arxiv.org/abs/2410.15186)
Keywords: language model, llm
Abstract: Veterinary medical records represent a large data resource for application to veterinary and One Health clinical research efforts. Use of the data is limited by interoperability challenges including inconsistent data formats and data siloing. Clinical coding using standardized medical terminologies enhances the quality of medical records and facilitates their interoperability with veterinary and human health records from other sites. Previous studies, such as DeepTag and VetTag, evaluated the application of Natural Language Processing (NLP) to automate veterinary diagnosis coding, employing long short-term memory (LSTM) and transformer models to infer a subset of Systemized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) diagnosis codes from free-text clinical notes. This study expands on these efforts by incorporating all 7,739 distinct SNOMED-CT diagnosis codes recognized by the Colorado State University (CSU) Veterinary Teaching Hospital (VTH) and by leveraging the increasing availability of pre-trained large language models (LLMs). Ten freely-available pre-trained LLMs were fine-tuned on the free-text notes from 246,473 manually-coded veterinary patient visits included in the CSU VTH's electronic health records (EHRs), which resulted in superior performance relative to previous efforts. The most accurate results were obtained when expansive labeled data were used to fine-tune relatively large clinical LLMs, but the study also showed that comparable results can be obtained using more limited resources and non-clinical LLMs. The results of this study contribute to the improvement of the quality of veterinary EHRs by investigating accessible methods for automated coding and support both animal and human health research by paving the way for more integrated and comprehensive health databases that span species and institutions.
摘要：兽医医疗记录代表着一种大数据资源，可用于兽医和“同一个健康”临床研究工作。数据的使用受到互操作性挑战的限制，包括数据格式不一致和数据孤岛。使用标准化医学术语进行临床编码可提高医疗记录的质量，并促进其与其他站点的兽医和人类健康记录的互操作性。先前的研究，例如 DeepTag 和 VetTag，评估了自然语言处理 (NLP) 在自动化兽医诊断编码中的应用，采用长短期记忆 (LSTM) 和 Transformer 模型从自由文本临床记录中推断出医学系统命名法 - 临床术语 (SNOMED-CT) 诊断代码的子集。本研究通过整合科罗拉多州立大学 (CSU) 兽医教学医院 (VTH) 认可的所有 7,739 个不同的 SNOMED-CT 诊断代码并利用日益普及的预训练大型语言模型 (LLM)，扩展了这些努力。十个免费提供的预训练 LLM 经过微调，以 CSU VTH 电子健康记录 (EHR) 中包含的 246,473 次手动编码的兽医患者就诊的自由文本注释为基础，结果比以前的工作更出色。当使用广泛的标记数据微调相对较大的临床 LLM 时，可以获得最准确的结果，但研究还表明，使用更有限的资源和非临床 LLM 也可以获得类似的结果。这项研究的结果通过研究可访问的自动编码方法，有助于提高兽医 EHR 的质量，并通过为跨物种和机构的更综合、更全面的健康数据库铺平道路，支持动物和人类健康研究。

Title: On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Authors: Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, Marah I. Abdin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15226
Pdf URL: https://arxiv.org/pdf/2410.15226
Copy Paste: [[2410.15226]] On the Diversity of Synthetic Data and its Impact on Training Large Language Models(https://arxiv.org/abs/2410.15226)
Keywords: language model, llm, agent
Abstract: The rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. While previous literature has focused predominantly on the quality and quantity of real data, our work enables the measurement of diversity in synthetic data and explores its impact on LLM performance. We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages by introducing a new diversity metric, \textit{LLM cluster-agent}, designed to evaluate the diversity of synthetic datasets. Through a series of controlled experiments with models of 350M and 1.4B parameters, we demonstrate that the proposed cluster-based LLM scoring of diversity correlates positively with both pre-training and supervised fine-tuning performance. Our findings also reveal that synthetic data diversity in pre-training affects supervised fine-tuning more significantly than pre-training itself, even for smaller models. We hope this study advances our understanding of the optimal use of synthetic data in LLM training and opens new avenues for efficient data generation processes.
摘要：大型语言模型 (LLM) 的兴起凸显了对多样化、高质量预训练数据的需求。合成数据成为解决数据稀缺和难以访问挑战的可行解决方案。虽然以前的文献主要关注真实数据的质量和数量，但我们的工作能够测量合成数据的多样性并探索其对 LLM 性能的影响。我们通过引入一种新的多样性指标 \textit{LLM cluster-agent} 来研究合成数据多样性在预训练和微调阶段的下游影响，该指标旨在评估合成数据集的多样性。通过对 3.5 亿和 14 亿参数的模型进行一系列受控实验，我们证明了所提出的基于集群的 LLM 多样性评分与预训练和监督微调性能呈正相关。我们的研究结果还表明，预训练中的合成数据多样性对监督微调的影响比预训练本身更大，即使对于较小的模型也是如此。我们希望这项研究能够增进我们对 LLM 培训中合成数据的最佳使用的理解，并为高效的数据生成过程开辟新的途径。

Title: Lossless KV Cache Compression to 2%

Authors: Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15252
Pdf URL: https://arxiv.org/pdf/2410.15252
Copy Paste: [[2410.15252]] Lossless KV Cache Compression to 2%(https://arxiv.org/abs/2410.15252)
Keywords: language model
Abstract: Large language models have revolutionized data processing in numerous domains, with their ability to handle extended context reasoning receiving notable recognition. To speed up inference, maintaining a key-value (KV) cache memory is essential. Nonetheless, the growing demands for KV cache memory create significant hurdles for efficient implementation. This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size while maintaining comparable performance levels. CLLA integrates multiple aspects of KV cache compression, including attention head/dimension reduction, layer sharing, and quantization techniques, into a cohesive framework. Our extensive experiments demonstrate that CLLA achieves lossless performance on most tasks while utilizing minimal KV cache, marking a significant advancement in practical KV cache compression.
摘要：大型语言模型已经彻底改变了众多领域的数据处理，其处理扩展上下文推理的能力得到了显著的认可。为了加快推理速度，维护键值 (KV) 缓存至关重要。尽管如此，对 KV 缓存日益增长的需求也为高效实施带来了重大障碍。这项工作引入了一种新颖的架构——跨层潜在注意力 (CLLA)，旨在将 KV 缓存压缩到原始大小的 2% 以下，同时保持相当的性能水平。CLLA 将 KV 缓存压缩的多个方面（包括注意力头/降维、层共享和量化技术）集成到一个有凝聚力的框架中。我们大量的实验表明，CLLA 在使用最少的 KV 缓存的情况下在大多数任务上实现了无损性能，标志着实际 KV 缓存压缩的重大进步。

Title: Back to School: Translation Using Grammar Books

Authors: Jonathan Hus, Antonios Anastasopoulos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15263
Pdf URL: https://arxiv.org/pdf/2410.15263
Copy Paste: [[2410.15263]] Back to School: Translation Using Grammar Books(https://arxiv.org/abs/2410.15263)
Keywords: language model, gpt, llm, prompt
Abstract: Machine translation systems for high resource languages perform exceptionally well and produce high quality translations. Unfortunately, the vast majority of languages are not considered high resource and lack the quantity of parallel sentences needed to train such systems. These under-represented languages are not without resources, however, and bilingual dictionaries and grammar books are available as linguistic reference material. With current large language models (LLMs) supporting near book-length contexts, we can begin to use the available material to ensure advancements are shared among all of the world's languages. In this paper, we demonstrate incorporating grammar books in the prompt of GPT-4 to improve machine translation and evaluate the performance on 16 topologically diverse low-resource languages, using a combination of reference material to show that the machine translation performance of LLMs can be improved using this method.
摘要：资源丰富的语言的机器翻译系统表现异常出色，并能产生高质量的翻译。不幸的是，绝大多数语言并不被认为是资源丰富的，缺乏训练此类系统所需的平行句子数量。然而，这些代表性不足的语言并非没有资源，双语词典和语法书可作为语言参考资料。借助当前的大型语言模型 (LLM) 支持近书本长度的上下文，我们可以开始使用可用材料来确保世界上所有语言共享进步。在本文中，我们演示了在 GPT-4 的提示中加入语法书来改进机器翻译，并评估了 16 种拓扑多样的低资源语言的性能，结合参考资料来表明可以使用此方法提高 LLM 的机器翻译性能。

Title: BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression

Authors: Yuankai Li, Jia-Chen Gu, Di Wu, Kai-Wei Chang, Nanyun Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15277
Pdf URL: https://arxiv.org/pdf/2410.15277
Copy Paste: [[2410.15277]] BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression(https://arxiv.org/abs/2410.15277)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding. This is particularly serious for multi-hop questions that require a chain of reasoning across documents. To accelerate inference, reduce costs, and minimize distractions, this paper presents BRIEF (Bridging Retrieval and Inference through Evidence Fusion), a lightweight approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context learning. To enable learning compression for multi-hop reasoning, we curate synthetic data by extracting atomic proposition expressions that encapsulate distinct factoids from the source documents to compose synthetic summaries. Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries and enables a range of LLMs to achieve exceptional open-domain question answering (QA) performance. For example, on HotpotQA, BRIEF improves the compression rate by 2 times compared to the state-of-the-art baseline, while outperforming it by 3.00% EM and 4.16% F1 with Flan-UL2 as the reader LM. It also generates more concise summaries than proprietary GPT-3.5, while demonstrating nearly identical QA performance.
摘要：检索增强生成 (RAG) 可以通过整合外部知识来补充大型语言模型 (LLM)。然而，随着检索到的文档数量的增加，LLM 的输入长度会线性增长，导致延迟急剧增加，并降低长上下文理解能力。对于需要跨文档推理链的多跳问题，这种情况尤其严重。为了加速推理、降低成本并最大限度地减少干扰，本文提出了 BRIEF（通过证据融合桥接检索和推理），这是一种轻量级方法，通过将检索到的文档压缩为高度密集的文本摘要以集成到上下文学习中来执行查询感知的多跳推理。为了实现多跳推理的学习压缩，我们通过从源文档中提取封装不同事实的原子命题表达式来整理合成数据，以组成合成摘要。基于我们完全由开源模型构建的合成数据，BRIEF 可以生成更简洁的摘要，并使一系列 LLM 能够实现出色的开放域问答 (QA) 性能。例如，在 HotpotQA 上，与最先进的基线相比，BRIEF 将压缩率提高了 2 倍，同时在使用 Flan-UL2 作为读者 LM 的情况下，其 EM 和 F1 分别比后者高出 3.00% 和 4.16%。它还可以生成比专有 GPT-3.5 更简洁的摘要，同时展示几乎相同的 QA 性能。

Title: Training Language Models to Critique With Multi-agent Feedback

Authors: Tian Lan, Wenwei Zhang, Chengqi Lyu, Shuaibin Li, Chen Xu, Heyan Huang, Dahua Lin, Xian-Ling Mao, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15287
Pdf URL: https://arxiv.org/pdf/2410.15287
Copy Paste: [[2410.15287]] Training Language Models to Critique With Multi-agent Feedback(https://arxiv.org/abs/2410.15287)
Keywords: language model, gpt, llm, agent
Abstract: Critique ability, a meta-cognitive capability of humans, presents significant challenges for LLMs to improve. Recent works primarily rely on supervised fine-tuning (SFT) using critiques generated by a single LLM like GPT-4. However, these model-generated critiques often exhibit flaws due to the inherent complexity of the critique. Consequently, fine-tuning LLMs on such flawed critiques typically limits the model's performance and propagates these flaws into the learned model. To overcome these challenges, this paper proposes a novel data generation pipeline, named MultiCritique, that improves the critique ability of LLMs by utilizing multi-agent feedback in both the SFT and reinforcement learning (RL) stages. First, our data generation pipeline aggregates high-quality critiques from multiple agents instead of a single model, with crucial information as input for simplifying the critique. Furthermore, our pipeline improves the preference accuracy of critique quality through multi-agent feedback, facilitating the effectiveness of RL in improving the critique ability of LLMs. Based on our proposed MultiCritique data generation pipeline, we construct the MultiCritiqueDataset for the SFT and RL fine-tuning stages. Extensive experimental results on two benchmarks demonstrate: 1) the superior quality of our constructed SFT dataset compared to existing critique datasets; 2) additional improvements to the critique ability of LLMs brought by the RL stage. Notably, our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models, approaching the performance of advanced 70B LLMs and GPT-4. Codes, datasets and model weights will be publicly available.
摘要：批判能力是人类的一种元认知能力，对 LLM 的改进提出了重大挑战。最近的研究主要依赖于监督微调 (SFT)，使用由 GPT-4 等单个 LLM 生成的批评。然而，这些模型生成的批评往往会由于批评本身的复杂性而出现缺陷。因此，对这种有缺陷的批评进行微调 LLM 通常会限制模型的性能并将这些缺陷传播到学习到的模型中。为了克服这些挑战，本文提出了一种名为 MultiCritique 的新型数据生成管道，它通过在 SFT 和强化学习 (RL) 阶段利用多智能体反馈来提高 LLM 的批评能力。首先，我们的数据生成管道汇总了来自多个智能体而不是单个模型的高质量批评，并以关键信息作为输入来简化批评。此外，我们的流程通过多智能体反馈提高了评论质量的偏好准确性，从而促进了强化学习在提高 LLM 批评能力方面的有效性。基于我们提出的 MultiCritique 数据生成流程，我们为 SFT 和 RL 微调阶段构建了 MultiCritiqueDataset。在两个基准上进行的大量实验结果表明：1）与现有批评数据集相比，我们构建的 SFT 数据集质量更优；2）强化学习阶段为 LLM 的批评能力带来了额外的改进。值得注意的是，我们微调后的 7B 模型显著超越了其他先进的 7B-13B 开源模型，接近先进的 70B LLM 和 GPT-4 的性能。代码、数据集和模型权重将公开。

Title: Redefining Proactivity for Information Seeking Dialogue

Authors: Jing Yang Lee, Seokhwan Kim, Kartik Mehta, Jiun-Yu Kao, Yu-Hsiang Lin, Arpit Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15297
Pdf URL: https://arxiv.org/pdf/2410.15297
Copy Paste: [[2410.15297]] Redefining Proactivity for Information Seeking Dialogue(https://arxiv.org/abs/2410.15297)
Keywords: llm, prompt, chain-of-thought, agent
Abstract: Information-Seeking Dialogue (ISD) agents aim to provide accurate responses to user queries. While proficient in directly addressing user queries, these agents, as well as LLMs in general, predominantly exhibit reactive behavior, lacking the ability to generate proactive responses that actively engage users in sustained conversations. However, existing definitions of proactive dialogue in this context do not focus on how each response actively engages the user and sustains the conversation. Hence, we present a new definition of proactivity that focuses on enhancing the `proactiveness' of each generated response via the introduction of new information related to the initial query. To this end, we construct a proactive dialogue dataset comprising 2,000 single-turn conversations, and introduce several automatic metrics to evaluate response `proactiveness' which achieved high correlation with human annotation. Additionally, we introduce two innovative Chain-of-Thought (CoT) prompts, the 3-step CoT and the 3-in-1 CoT prompts, which consistently outperform standard prompts by up to 90% in the zero-shot setting.
摘要：信息搜索对话 (ISD) 代理旨在为用户查询提供准确的响应。虽然这些代理能够直接处理用户查询，但与 LLM 一样，它们主要表现出被动行为，缺乏生成主动响应的能力，无法让用户积极参与持续的对话。然而，在这种情况下，现有的主动对话定义并不关注每个响应如何积极地吸引用户并维持对话。因此，我们提出了一种新的主动性定义，重点是通过引入与初始查询相关的新信息来增强每个生成响应的“主动性”。为此，我们构建了一个包含 2,000 个单轮对话的主动对话数据集，并引入了几个自动指标来评估响应“主动性”，这些指标与人工注释具有高度相关性。此外，我们引入了两个创新的思路链 (CoT) 提示，即 3 步 CoT 和 3 合 1 CoT 提示，它们在零样本设置中始终比标准提示高出 90%。

Title: Does ChatGPT Have a Poetic Style?

Authors: Melanie Walsh, Anna Preus, Elizabeth Gronski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15299
Pdf URL: https://arxiv.org/pdf/2410.15299
Copy Paste: [[2410.15299]] Does ChatGPT Have a Poetic Style?(https://arxiv.org/abs/2410.15299)
Keywords: gpt, llm, prompt, chat
Abstract: Generating poetry has become a popular application of LLMs, perhaps especially of OpenAI's widely-used chatbot ChatGPT. What kind of poet is ChatGPT? Does ChatGPT have its own poetic style? Can it successfully produce poems in different styles? To answer these questions, we prompt the GPT-3.5 and GPT-4 models to generate English-language poems in 24 different poetic forms and styles, about 40 different subjects, and in response to 3 different writing prompt templates. We then analyze the resulting 5.7k poems, comparing them to a sample of 3.7k poems from the Poetry Foundation and the Academy of American Poets. We find that the GPT models, especially GPT-4, can successfully produce poems in a range of both common and uncommon English-language forms in superficial yet noteworthy ways, such as by producing poems of appropriate lengths for sonnets (14 lines), villanelles (19 lines), and sestinas (39 lines). But the GPT models also exhibit their own distinct stylistic tendencies, both within and outside of these specific forms. Our results show that GPT poetry is much more constrained and uniform than human poetry, showing a strong penchant for rhyme, quatrains (4-line stanzas), iambic meter, first-person plural perspectives (we, us, our), and specific vocabulary like "heart," "embrace," "echo," and "whisper."
摘要：创作诗歌已成为法学硕士 (LLM) 的一个流行应用，也许尤其是 OpenAI 广泛使用的聊天机器人 ChatGPT。ChatGPT 是什么样的诗人？ChatGPT 有自己的诗歌风格吗？它能成功创作出不同风格的诗歌吗？为了回答这些问题，我们提示 GPT-3.5 和 GPT-4 模型生成 24 种不同诗歌形式和风格、大约 40 个不同主题的英语诗歌，并响应 3 种不同的写作提示模板。然后，我们分析了生成的 5.7k 首诗歌，并将它们与来自诗歌基金会和美国诗人学院的 3.7k 首诗歌样本进行比较。我们发现 GPT 模型，尤其是 GPT-4，可以成功地以肤浅但值得注意的方式创作出各种常见和不常见的英语形式的诗歌，例如为十四行诗 (14 行)、维拉内拉诗 (19 行) 和六行诗 (39 行) 创作适当长度的诗歌。但 GPT 模型也表现出自己独特的风格倾向，无论是在这些特定形式之内还是之外。我们的结果表明，GPT 诗歌比人类诗歌更加受限和统一，表现出对押韵、四行诗（4 行诗节）、抑扬格、第一人称复数视角（我们、我们的）和特定词汇（如“心”、“拥抱”、“回声”和“耳语”）的强烈偏好。

Title: LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content

Authors: Mohamed Bayan Kmainasi, Ali Ezzat Shahroor, Maram Hasanain, Sahinur Rahman Laskar, Naeemul Hassan, Firoj Alam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15308
Pdf URL: https://arxiv.org/pdf/2410.15308
Copy Paste: [[2410.15308]] LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content(https://arxiv.org/abs/2410.15308)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable success as general-purpose task solvers across various fields, including NLP, healthcare, finance, and law. However, their capabilities remain limited when addressing domain-specific problems, particularly in downstream NLP tasks. Research has shown that models fine-tuned on instruction-based downstream NLP datasets outperform those that are not fine-tuned. While most efforts in this area have primarily focused on resource-rich languages like English and broad domains, little attention has been given to multilingual settings and specific domains. To address this gap, this study focuses on developing a specialized LLM, LlamaLens, for analyzing news and social media content in a multilingual context. To the best of our knowledge, this is the first attempt to tackle both domain specificity and multilinguality, with a particular focus on news and social media. Our experimental setup includes 19 tasks, represented by 52 datasets covering Arabic, English, and Hindi. We demonstrate that LlamaLens outperforms the current state-of-the-art (SOTA) on 16 testing sets, and achieves comparable performance on 10 sets. We make the models and resources publicly available for the research community.(this https URL)
摘要：大型语言模型 (LLM) 作为通用任务解决方案，在 NLP、医疗保健、金融和法律等各个领域都取得了显著成功。然而，它们在解决特定领域问题时的能力仍然有限，尤其是在下游 NLP 任务中。研究表明，在基于指令的下游 NLP 数据集上进行微调的模型比未进行微调的模型表现更好。虽然该领域的大多数努力主要集中在资源丰富的语言（如英语）和广泛的领域，但很少关注多语言环境和特定领域。为了弥补这一差距，本研究专注于开发一种专门的 LLM，即 LlamaLens，用于在多语言环境中分析新闻和社交媒体内容。据我们所知，这是首次尝试解决领域特定性和多语言性问题，特别关注新闻和社交媒体。我们的实验设置包括 19 项任务，由 52 个数据集表示，涵盖阿拉伯语、英语和印地语。我们证明 LlamaLens 在 16 个测试集上的表现优于当前最佳 (SOTA)，并在 10 个测试集上实现了相当的性能。我们向研究社区公开了模型和资源。(此 https URL)

Title: Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Authors: Alan Dao (Gia Tuan Dao), Dinh Bach Vu, Huy Hoang Ha
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2410.15316
Pdf URL: https://arxiv.org/pdf/2410.15316
Copy Paste: [[2410.15316]] Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant(https://arxiv.org/abs/2410.15316)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.
摘要：大型语言模型 (LLM) 彻底改变了自然语言处理，但由于整合音频和文本模态的复杂性，它们在语音任务中的应用仍然具有挑战性。本文介绍了 Ichigo，这是一种混合模态模型，可无缝处理交错的语音和文本序列。利用标记化的早期融合方法，Ichigo 将语音量化为离散标记，并为语音和文本模态采用统一的基于转换器的架构。该方法无需单独的适配器即可实现跨模态的联合推理和生成。我们提出了一种全面的训练方法，包括在多语言语音识别数据集上进行预训练和在精选指令数据集上进行微调。Ichigo 在语音问答基准上表现出最先进的性能，优于现有的开源语音语言模型，并实现与级联系统相当的结果。值得注意的是，Ichigo 首次生成标记的延迟仅为 111 毫秒，明显低于当前模型。我们的方法不仅推动了多模式人工智能领域的发展，而且还为小型研究团队提供了一个框架，使他们能够有效地为开源语音语言模型做出贡献。

Title: Causality for Large Language Models

Authors: Anpeng Wu, Kun Kuang, Minqin Zhu, Yingrong Wang, Yujia Zheng, Kairong Han, Baohong Li, Guangyi Chen, Fei Wu, Kun Zhang
Subjects: cs.CL, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2410.15319
Pdf URL: https://arxiv.org/pdf/2410.15319
Copy Paste: [[2410.15319]] Causality for Large Language Models(https://arxiv.org/abs/2410.15319)
Keywords: language model, llm, hallucination, prompt
Abstract: Recent breakthroughs in artificial intelligence have driven a paradigm shift, where large language models (LLMs) with billions or trillions of parameters are trained on vast datasets, achieving unprecedented success across a series of language tasks. However, despite these successes, LLMs still rely on probabilistic modeling, which often captures spurious correlations rooted in linguistic patterns and social stereotypes, rather than the true causal relationships between entities and events. This limitation renders LLMs vulnerable to issues such as demographic biases, social stereotypes, and LLM hallucinations. These challenges highlight the urgent need to integrate causality into LLMs, moving beyond correlation-driven paradigms to build more reliable and ethically aligned AI systems. While many existing surveys and studies focus on utilizing prompt engineering to activate LLMs for causal knowledge or developing benchmarks to assess their causal reasoning abilities, most of these efforts rely on human intervention to activate pre-trained models. How to embed causality into the training process of LLMs and build more general and intelligent models remains unexplored. Recent research highlights that LLMs function as causal parrots, capable of reciting causal knowledge without truly understanding or applying it. These prompt-based methods are still limited to human interventional improvements. This survey aims to address this gap by exploring how causality can enhance LLMs at every stage of their lifecycle-from token embedding learning and foundation model training to fine-tuning, alignment, inference, and evaluation-paving the way for more interpretable, reliable, and causally-informed models. Additionally, we further outline six promising future directions to advance LLM development, enhance their causal reasoning capabilities, and address the current limitations these models face.
摘要：人工智能领域的最新突破推动了范式转变，在庞大的数据集上训练具有数十亿或数万亿个参数的大型语言模型 (LLM)，在一系列语言任务中取得了前所未有的成功。然而，尽管取得了这些成功，LLM 仍然依赖于概率建模，这种建模通常会捕捉到植根于语言模式和社会刻板印象的虚假相关性，而不是实体和事件之间的真正因果关系。这种限制使 LLM 容易受到人口偏见、社会刻板印象和 LLM 幻觉等问题的影响。这些挑战凸显了将因果关系整合到 LLM 中的迫切需要，超越相关性驱动的范式，构建更可靠、更符合道德规范的人工智能系统。虽然许多现有的调查和研究都侧重于利用即时工程来激活 LLM 以获得因果知识或开发基准来评估其因果推理能力，但大多数这些努力都依赖于人工干预来激活预先训练的模型。如何将因果关系嵌入到 LLM 的训练过程中并构建更通用和智能的模型仍未得到探索。最近的研究表明，LLM 就像因果鹦鹉一样，能够背诵因果知识而无需真正理解或应用它。这些基于提示的方法仍然仅限于人工干预的改进。本调查旨在通过探索因果关系如何在其生命周期的每个阶段（从标记嵌入学习和基础模型训练到微调、对齐、推理和评估）增强 LLM，为更具可解释性、可靠性和因果关系的模型铺平道路，从而解决这一差距。此外，我们进一步概述了六个有希望的未来方向，以推进 LLM 的发展，增强其因果推理能力，并解决这些模型目前面临的局限性。

Title: A Survey of Uncertainty Estimation in LLMs: Theory Meets Practice

Authors: Hsiu-Yuan Huang, Yutong Yang, Zhaoxi Zhang, Sanwoo Lee, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15326
Pdf URL: https://arxiv.org/pdf/2410.15326
Copy Paste: [[2410.15326]] A Survey of Uncertainty Estimation in LLMs: Theory Meets Practice(https://arxiv.org/abs/2410.15326)
Keywords: language model, llm
Abstract: As large language models (LLMs) continue to evolve, understanding and quantifying the uncertainty in their predictions is critical for enhancing application credibility. However, the existing literature relevant to LLM uncertainty estimation often relies on heuristic approaches, lacking systematic classification of the methods. In this survey, we clarify the definitions of uncertainty and confidence, highlighting their distinctions and implications for model predictions. On this basis, we integrate theoretical perspectives, including Bayesian inference, information theory, and ensemble strategies, to categorize various classes of uncertainty estimation methods derived from heuristic approaches. Additionally, we address challenges that arise when applying these methods to LLMs. We also explore techniques for incorporating uncertainty into diverse applications, including out-of-distribution detection, data annotation, and question clarification. Our review provides insights into uncertainty estimation from both definitional and theoretical angles, contributing to a comprehensive understanding of this critical aspect in LLMs. We aim to inspire the development of more reliable and effective uncertainty estimation approaches for LLMs in real-world scenarios.
摘要：随着大型语言模型 (LLM) 的不断发展，理解和量化其预测中的不确定性对于提高应用可信度至关重要。然而，现有的与 LLM 不确定性估计相关的文献通常依赖于启发式方法，缺乏对方法的系统分类。在这篇综述中，我们阐明了不确定性和置信度的定义，强调了它们的区别和对模型预测的影响。在此基础上，我们整合了理论观点，包括贝叶斯推理、信息论和集成策略，对源自启发式方法的各种不确定性估计方法进行分类。此外，我们还解决了将这些方法应用于 LLM 时出现的挑战。我们还探索了将不确定性纳入各种应用的技术，包括分布外检测、数据注释和问题澄清。我们的评论从定义和理论的角度提供了对不确定性估计的见解，有助于全面理解 LLM 中的这一关键方面。我们的目标是激发在现实场景中为 LLM 开发更可靠、更有效的不确定性估计方法。

Title: BERTtime Stories: Investigating the Role of Synthetic Story Data in Language pre-training

Authors: Nikitas Theodoropoulos, Giorgos Filandrianos, Vassilis Lyberatos, Maria Lymperaiou, Giorgos Stamou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15365
Pdf URL: https://arxiv.org/pdf/2410.15365
Copy Paste: [[2410.15365]] BERTtime Stories: Investigating the Role of Synthetic Story Data in Language pre-training(https://arxiv.org/abs/2410.15365)
Keywords: language model, gpt
Abstract: We describe our contribution to the Strict and Strict-Small tracks of the 2nd iteration of the BabyLM Challenge. The shared task is centered around efficient pre-training given data constraints motivated by human development. In response, we study the effect of synthetic story data in language pre-training using TinyStories: a recently introduced dataset of short stories. Initially, we train GPT-Neo models on subsets of TinyStories, while varying the amount of available data. We find that, even with access to less than 100M words, the models are able to generate high-quality, original completions to a given story, and acquire substantial linguistic knowledge. To measure the effect of synthetic story data, we train LTG-BERT encoder models on a combined dataset of: a subset of TinyStories, story completions generated by GPT-Neo, and a subset of the BabyLM dataset. Our experimentation reveals that synthetic data can occasionally offer modest gains, but overall have a negative influence on linguistic understanding. Our work offers an initial study on synthesizing story data in low resource settings and underscores their potential for augmentation in data-constrained language modeling. We publicly release our models and implementation on our GitHub.
摘要：我们描述了我们对 BabyLM Challenge 第二轮的 Strict 和 Strict-Small 赛道的贡献。共享任务围绕着在人类发展驱动的数据约束下进行有效的预训练。作为回应，我们使用 TinyStories（一个最近推出的短篇小说数据集）研究合成故事数据在语言预训练中的影响。最初，我们在 TinyStories 的子集上训练 GPT-Neo 模型，同时改变可用数据量。我们发现，即使访问不到 1 亿个单词，这些模型也能够为给定的故事生成高质量的原创完成，并获得大量的语言知识。为了衡量合成故事数据的效果，我们在以下组合数据集上训练 LTG-BERT 编码器模型：TinyStories 的子集、GPT-Neo 生成的故事完成以及 BabyLM 数据集的子集。我们的实验表明，合成数据有时可以带来适度的收益，但总体上会对语言理解产生负面影响。我们的工作为在资源匮乏的环境下合成故事数据提供了初步研究，并强调了它们在数据受限的语言建模中增强的潜力。我们在 GitHub 上公开发布了我们的模型和实现。

Title: CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges

Authors: Haitao Li, Junjie Chen, Qingyao Ai, Zhumin Chu, Yujia Zhou, Qian Dong, Yiqun Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15393
Pdf URL: https://arxiv.org/pdf/2410.15393
Copy Paste: [[2410.15393]] CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges(https://arxiv.org/abs/2410.15393)
Keywords: language model, llm
Abstract: The use of large language models (LLMs) as automated evaluation tools to assess the quality of generated natural language, known as LLMs-as-Judges, has demonstrated promising capabilities and is rapidly gaining widespread attention. However, when applied to pairwise comparisons of candidate responses, LLM-based evaluators often exhibit selection bias. Specifically, their judgments may become inconsistent when the option positions or ID tokens are swapped, compromising the effectiveness and fairness of the evaluation result. To address this challenge, we introduce CalibraEval, a novel label-free method for mitigating selection bias during inference. Specifically, CalibraEval reformulates debiasing as an optimization task aimed at adjusting observed prediction distributions to align with unbiased prediction distributions. To solve this optimization problem, we propose a non-parametric order-preserving algorithm (NOA). This algorithm leverages the partial order relationships between model prediction distributions, thereby eliminating the need for explicit labels and precise mathematical function this http URL evaluations of LLMs in multiple representative benchmarks demonstrate that CalibraEval effectively mitigates selection bias and improves performance compared to existing debiasing methods. This work marks a step toward building more robust and unbiased automated evaluation frameworks, paving the way for improved reliability in AI-driven assessments
摘要：使用大型语言模型 (LLM) 作为自动评估工具来评估生成的自然语言的质量，即所谓的 LLMs-as-Judges，已经展示了良好的能力，并迅速引起了广泛关注。然而，当应用于候选答案的成对比较时，基于 LLM 的评估者通常会表现出选择偏差。具体来说，当选项位置或 ID 标记交换时，他们的判断可能会变得不一致，从而损害评估结果的有效性和公平性。为了应对这一挑战，我们推出了 CalibraEval，这是一种新颖的无标签方法，用于减轻推理过程中的选择偏差。具体来说，CalibraEval 将去偏重新表述为一项优化任务，旨在调整观察到的预测分布以与无偏预测分布保持一致。为了解决这个优化问题，我们提出了一种非参数保序算法 (NOA)。该算法利用模型预测分布之间的偏序关系，从而消除了对显式标签和精确数学函数的需求。在多个代表性基准中对 LLM 的评估表明，与现有的去偏方法相比，CalibraEval 有效地减轻了选择偏差并提高了性能。这项工作标志着朝着构建更强大、更公正的自动评估框架迈出了一步，为提高 AI 驱动评估的可靠性铺平了道路

Title: A Comprehensive Evaluation of Cognitive Biases in LLMs

Authors: Simon Malberg, Roman Poletukhin, Carolin M. Schuster, Georg Groh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15413
Pdf URL: https://arxiv.org/pdf/2410.15413
Copy Paste: [[2410.15413]] A Comprehensive Evaluation of Cognitive Biases in LLMs(https://arxiv.org/abs/2410.15413)
Keywords: language model, llm
Abstract: We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code to encourage future research on biases in LLMs: this https URL
摘要：我们对 20 个最先进的大型语言模型 (LLM) 中 30 种认知偏差进行了大规模评估，这些偏差出现在各种决策场景中。我们的贡献包括一种新颖的通用测试框架，用于可靠且大规模地生成 LLM 测试，一个包含 30,000 个测试的基准数据集，用于检测 LLM 中的认知偏差，以及对 20 个评估的 LLM 中发现的偏差进行全面评估。我们的工作证实并拓展了先前的发现，表明 LLM 中存在认知偏差，方法是报告至少 20 个 LLM 中部分模型中所有 30 种测试偏差的证据。我们发布了我们的框架代码，以鼓励未来对 LLM 中的偏差进行研究：此 https URL

Title: Evaluating Consistencies in LLM responses through a Semantic Clustering of Question Answering

Authors: Yanggyu Lee, Jihie Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15440
Pdf URL: https://arxiv.org/pdf/2410.15440
Copy Paste: [[2410.15440]] Evaluating Consistencies in LLM responses through a Semantic Clustering of Question Answering(https://arxiv.org/abs/2410.15440)
Keywords: language model, llm
Abstract: In the realm of Large Language Model (LLM) functionalities, providing reliable information is paramount, yet reports suggest that LLM outputs lack consistency. This inconsistency, often at-tributed to randomness in token sampling, under-mines user trust as it leads to varying responses even for identical queries. In this paper, we present a new approach for evaluating semantic consistencies of LLM including comparison of alternative tech-niques. Our approach evaluates whether LLM re-sponses are semantically congruent for a given question, recognizing that as syntactically different sentences may convey the same meaning. Here-tofore, To enhance LLM consistency, two main approaches have been explored: Leverage external knowledge as context like the RAG pattern or use Zero-shot-CoT to improve performance of LLM itself. We apply our evaluation approach to these techniques, and demonstrate to compare the im-pact of these methods on LLM response con-sistency across different domains of question an-swering tasks. Using the TruthfulQA dataset to assess LLM responses, the study induces N re-sponses per question from the LLM and clusters semantically equivalent sentences to measure semantic consistency across 37 categories. Through this, it quantitatively analyzes the effectiveness of the aforementioned methods in improving LLM performance before and after their adoption.
摘要：在大型语言模型 (LLM) 功能领域，提供可靠的信息至关重要，但报告表明 LLM 输出缺乏一致性。这种不一致性通常归因于令牌采样的随机性，它破坏了用户的信任，因为它会导致即使对于相同的查询也会有不同的响应。在本文中，我们提出了一种评估 LLM 语义一致性的新方法，包括比较替代技术。我们的方法评估 LLM 响应对于给定问题是否在语义上一致，认识到语法上不同的句子可能传达相同的含义。迄今为止，为了增强 LLM 一致性，已经探索了两种主要方法：利用外部知识作为上下文（如 RAG 模式）或使用 Zero-shot-CoT 来提高 LLM 本身的性能。我们将我们的评估方法应用于这些技术，并演示比较这些方法对不同领域问答任务中 LLM 响应一致性的影响。本研究使用 TruthfulQA 数据集评估 LLM 的回答，从 LLM 中为每个问题导出 N 个回答，并将语义等价的句子聚类，以测量 37 个类别的语义一致性。通过这种方式，定量分析了上述方法在采用前后对提高 LLM 成绩的有效性。

Title: CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts

Authors: Malvina Nikandrou, Georgios Pantazopoulos, Nikolas Vitsakis, Ioannis Konstas, Alessandro Suglia
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.15453
Pdf URL: https://arxiv.org/pdf/2410.15453
Copy Paste: [[2410.15453]] CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts(https://arxiv.org/abs/2410.15453)
Keywords: language model
Abstract: As Vision and Language models (VLMs) become accessible across the globe, it is important that they demonstrate cultural knowledge. In this paper, we introduce CROPE, a visual question answering benchmark designed to probe the knowledge of culture-specific concepts and evaluate the capacity for cultural adaptation through contextual information. This allows us to distinguish between parametric knowledge acquired during training and contextual knowledge provided during inference via visual and textual descriptions. Our evaluation of several state-of-the-art open VLMs shows large performance disparities between culture-specific and common concepts in the parametric setting. Moreover, experiments with contextual knowledge indicate that models struggle to effectively utilize multimodal information and bind culture-specific concepts to their depictions. Our findings reveal limitations in the cultural understanding and adaptability of current VLMs that need to be addressed toward more culturally inclusive models.
摘要：随着视觉和语言模型 (VLM) 在全球范围内普及，展示文化知识变得非常重要。在本文中，我们介绍了 CROPE，这是一个视觉问答基准，旨在探究特定文化概念的知识并通过上下文信息评估文化适应能力。这使我们能够区分训练期间获得的参数知识和通过视觉和文本描述推理期间提供的上下文知识。我们对几种最先进的开放式 VLM 的评估表明，在参数设置中，特定文化概念和常见概念之间存在很大的性能差异。此外，使用上下文知识的实验表明，模型难以有效利用多模态信息并将特定文化概念与其描述联系起来。我们的研究结果揭示了当前 VLM 在文化理解和适应性方面的局限性，需要解决这些问题以建立更具文化包容性的模型。

Title: A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia

Authors: Lance Calvin Lim Gamboa, Mark Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15464
Pdf URL: https://arxiv.org/pdf/2410.15464
Copy Paste: [[2410.15464]] A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia(https://arxiv.org/abs/2410.15464)
Keywords: language model
Abstract: Work on bias in pretrained language models (PLMs) focuses on bias evaluation and mitigation and fails to tackle the question of bias attribution and this http URL propose a novel metric, the $\textit{bias attribution score}$, which draws from information theory to measure token-level contributions to biased behavior in PLMs. We then demonstrate the utility of this metric by applying it on multilingual PLMs, including models from Southeast Asia which have not yet been thoroughly examined in bias evaluation literature. Our results confirm the presence of sexist and homophobic bias in Southeast Asian PLMs. Interpretability and semantic analyses also reveal that PLM bias is strongly induced by words relating to crime, intimate relationships, and helping among other discursive categories, suggesting that these are topics where PLMs strongly reproduce bias from pretraining data and where PLMs should be used with more caution.
摘要：预训练语言模型 (PLM) 中的偏见研究侧重于偏见评估和缓解，未能解决偏见归因问题，此 http URL 提出了一种新颖的指标，即 $\textit{偏见归因分数}$，它借鉴信息论来衡量标记级对 PLM 中偏见行为的贡献。然后，我们通过将此指标应用于多语言 PLM（包括尚未在偏见评估文献中彻底检查过的东南亚模型）来证明其实用性。我们的结果证实了东南亚 PLM 中存在性别歧视和恐同偏见。可解释性和语义分析还表明，PLM 偏见受到与犯罪、亲密关系和帮助等话语类别相关的词语的强烈诱导，这表明这些主题是 PLM 强烈重现预训练数据中的偏见的主题，因此应更加谨慎地使用 PLM。

Title: Keep Guessing? When Considering Inference Scaling, Mind the Baselines

Authors: Gal Yona, Or Honovich, Omer Levy, Roee Aharoni
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15466
Pdf URL: https://arxiv.org/pdf/2410.15466
Copy Paste: [[2410.15466]] Keep Guessing? When Considering Inference Scaling, Mind the Baselines(https://arxiv.org/abs/2410.15466)
Keywords: language model, llm, prompt
Abstract: Scaling inference compute in large language models (LLMs) through repeated sampling consistently increases the coverage (fraction of problems solved) as the number of samples increases. We conjecture that this observed improvement is partially due to the answer distribution of standard evaluation benchmarks, which is skewed towards a relatively small set of common answers. To test this conjecture, we define a baseline that enumerates answers according to their prevalence in the training set. Experiments spanning two domains -- mathematical reasoning and factual knowledge -- reveal that this baseline outperforms repeated model sampling for some LLMs, while the coverage for others is on par with that of a mixture strategy that obtains $k$ answers by using only $10$ model samples and similarly guessing the remaining $k-10$ attempts via enumeration. Our baseline enables a more accurate measurement of how much repeated sampling improves coverage in such settings beyond prompt-agnostic guessing.
摘要：通过重复采样扩展大型语言模型 (LLM) 中的推理计算会随着样本数量的增加而持续增加覆盖率（解决问题的比例）。我们推测，观察到的这种改进部分归因于标准评估基准的答案分布，该分布偏向于一组相对较小的常见答案。为了检验这个猜想，我们定义了一个基线，根据答案在训练集中的普遍性来枚举答案。跨越两个领域——数学推理和事实知识——的实验表明，对于某些 LLM，该基线优于重复模型采样，而对于其他 LLM，其覆盖率与混合策略相当，该策略仅使用 $10$ 个模型样本并通过枚举同样猜测剩余的 $k-10$ 次尝试来获得 $k$ 个答案。我们的基线可以更准确地测量重复采样在这种设置下对覆盖率的改善程度，而不仅仅是提示不可知的猜测。

Title: Hey GPT, Can You be More Racist? Analysis from Crowdsourced Attempts to Elicit Biased Content from Generative AI

Authors: Hangzhi Guo, Pranav Narayanan Venkit, Eunchae Jang, Mukund Srinath, Wenbo Zhang, Bonam Mingole, Vipul Gupta, Kush R. Varshney, S. Shyam Sundar, Amulya Yadav
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2410.15467
Pdf URL: https://arxiv.org/pdf/2410.15467
Copy Paste: [[2410.15467]] Hey GPT, Can You be More Racist? Analysis from Crowdsourced Attempts to Elicit Biased Content from Generative AI(https://arxiv.org/abs/2410.15467)
Keywords: language model, gpt, llm, prompt
Abstract: The widespread adoption of large language models (LLMs) and generative AI (GenAI) tools across diverse applications has amplified the importance of addressing societal biases inherent within these technologies. While the NLP community has extensively studied LLM bias, research investigating how non-expert users perceive and interact with biases from these systems remains limited. As these technologies become increasingly prevalent, understanding this question is crucial to inform model developers in their efforts to mitigate bias. To address this gap, this work presents the findings from a university-level competition, which challenged participants to design prompts for eliciting biased outputs from GenAI tools. We quantitatively and qualitatively analyze the competition submissions and identify a diverse set of biases in GenAI and strategies employed by participants to induce bias in GenAI. Our finding provides unique insights into how non-expert users perceive and interact with biases from GenAI tools.
摘要：大型语言模型 (LLM) 和生成式人工智能 (GenAI) 工具在各种应用中的广泛采用，凸显了解决这些技术固有的社会偏见的重要性。虽然 NLP 社区已经广泛研究了 LLM 偏见，但研究非专家用户如何感知和应对这些系统的偏见的研究仍然有限。随着这些技术变得越来越普遍，理解这个问题对于模型开发人员如何减轻偏见至关重要。为了弥补这一差距，这项工作展示了一项大学级竞赛的结果，该竞赛要求参赛者设计提示以引出 GenAI 工具的偏见输出。我们对竞赛提交的内容进行了定量和定性分析，并确定了 GenAI 中的一系列不同偏见以及参赛者用来在 GenAI 中引发偏见的策略。我们的发现提供了关于非专家用户如何感知和应对 GenAI 工具偏见的独特见解。

Title: "What is the value of {templates}?" Rethinking Document Information Extraction Datasets for LLMs

Authors: Ran Zmigrod, Pranav Shetty, Mathieu Sibue, Zhiqiang Ma, Armineh Nourbakhsh, Xiaomo Liu, Manuela Veloso
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15484
Pdf URL: https://arxiv.org/pdf/2410.15484
Copy Paste: [[2410.15484]] "What is the value of {templates}?" Rethinking Document Information Extraction Datasets for LLMs(https://arxiv.org/abs/2410.15484)
Keywords: language model, llm, prompt
Abstract: The rise of large language models (LLMs) for visually rich document understanding (VRDU) has kindled a need for prompt-response, document-based datasets. As annotating new datasets from scratch is labor-intensive, the existing literature has generated prompt-response datasets from available resources using simple templates. For the case of key information extraction (KIE), one of the most common VRDU tasks, past work has typically employed the template "What is the value for the {key}?". However, given the variety of questions encountered in the wild, simple and uniform templates are insufficient for creating robust models in research and industrial contexts. In this work, we present K2Q, a diverse collection of five datasets converted from KIE to a prompt-response format using a plethora of bespoke templates. The questions in K2Q can span multiple entities and be extractive or boolean. We empirically compare the performance of seven baseline generative models on K2Q with zero-shot prompting. We further compare three of these models when training on K2Q versus training on simpler templates to motivate the need of our work. We find that creating diverse and intricate KIE questions enhances the performance and robustness of VRDU models. We hope this work encourages future studies on data quality for generative model training.
摘要：用于视觉丰富文档理解 (VRDU) 的大型语言模型 (LLM) 的兴起引发了对基于文档的即时响应数据集的需求。由于从头开始注释新数据集需要大量劳动力，现有文献使用简单的模板从可用资源生成即时响应数据集。对于密钥信息提取 (KIE)（最常见的 VRDU 任务之一），过去的工作通常使用模板“{key} 的值是什么？”。然而，考虑到在现实中遇到的各种问题，简单而统一的模板不足以在研究和工业环境中创建强大的模型。在这项工作中，我们提出了 K2Q，这是一个由五个数据集组成的多样化集合，使用大量定制模板从 KIE 转换为即时响应格式。K2Q 中的问题可以跨越多个实体，可以是提取的或布尔的。我们通过实证研究比较了七个基线生成模型在 K2Q 上与零样本提示的性能。我们进一步比较了在 K2Q 上进行训练和在更简单的模板上进行训练时这三种模型，以激发我们工作的需要。我们发现创建多样化和复杂的 KIE 问题可以增强 VRDU 模型的性能和鲁棒性。我们希望这项工作能够鼓励未来对生成模型训练数据质量的研究。

Title: Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?

Authors: Nishant Balepur, Feng Gu, Abhilasha Ravichander, Shi Feng, Jordan Boyd-Graber, Rachel Rudinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15512
Pdf URL: https://arxiv.org/pdf/2410.15512
Copy Paste: [[2410.15512]] Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?(https://arxiv.org/abs/2410.15512)
Keywords: llm
Abstract: Question answering (QA)-producing correct answers for input questions-is popular, but we test a reverse question answering (RQA) task: given an input answer, generate a question with that answer. Past work tests QA and RQA separately, but we test them jointly, comparing their difficulty, aiding benchmark design, and assessing reasoning consistency. 16 LLMs run QA and RQA with trivia questions/answers, showing: 1) Versus QA, LLMs are much less accurate in RQA for numerical answers, but slightly more accurate in RQA for textual answers; 2) LLMs often answer their own invalid questions from RQA accurately in QA, so RQA errors are not from knowledge gaps alone; 3) RQA errors correlate with question difficulty and inversely correlate with answer frequencies in the Dolma corpus; and 4) LLMs struggle to give valid multi-hop questions. By finding question and answer types yielding RQA errors, we suggest improvements for LLM RQA reasoning.
摘要：问答 (QA) —— 为输入的问题提供正确答案 —— 很受欢迎，但我们测试了一个反向问答 (RQA) 任务：给定一个输入答案，用该答案生成一个问题。过去的工作分别测试 QA 和 RQA，但我们联合测试它们，比较它们的难度，帮助基准设计，并评估推理一致性。16 个 LLM 使用琐事问题/答案运行 QA 和 RQA，结果显示：1) 与 QA 相比，LLM 在 RQA 中对数字答案的准确性要低得多，但在 RQA 中对文本答案的准确性略高；2) LLM 通常在 QA 中准确回答自己在 RQA 中提出的无效问题，因此 RQA 错误不仅仅是来自知识差距；3) RQA 错误与问题难度相关，与 Dolma 语料库中的答案频率呈负相关；4) LLM 难以给出有效的多跳问题。通过查找产生 RQA 错误的问题和答案类型，我们建议改进 LLM RQA 推理。

Title: M-RewardBench: Evaluating Reward Models in Multilingual Settings

Authors: Srishti Gureja, Lester James V. Miranda, Shayekh Bin Islam, Rishabh Maheshwary, Drishti Sharma, Gusti Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, Marzieh Fadaee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.15522
Pdf URL: https://arxiv.org/pdf/2410.15522
Copy Paste: [[2410.15522]] M-RewardBench: Evaluating Reward Models in Multilingual Settings(https://arxiv.org/abs/2410.15522)
Keywords: language model, llm, chat
Abstract: Reward models (RMs) have driven the state-of-the-art performance of LLMs today by enabling the integration of human feedback into the language modeling process. However, RMs are primarily trained and evaluated in English, and their capabilities in multilingual settings remain largely understudied. In this work, we conduct a systematic evaluation of several reward models in multilingual settings. We first construct the first-of-its-kind multilingual RM evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances for 23 typologically diverse languages, that tests the chat, safety, reasoning, and translation capabilities of RMs. We then rigorously evaluate a wide range of reward models on M-RewardBench, offering fresh insights into their performance across diverse languages. We identify a significant gap in RMs' performances between English and non-English languages and show that RM preferences can change substantially from one language to another. We also present several findings on how different multilingual aspects impact RM performance. Specifically, we show that the performance of RMs is improved with improved translation quality. Similarly, we demonstrate that the models exhibit better performance for high-resource languages. We release M-RewardBench dataset and the codebase in this study to facilitate a better understanding of RM evaluation in multilingual settings.
摘要：奖励模型 (RM) 通过将人工反馈集成到语言建模过程中，推动了当今 LLM 的领先性能。然而，RM 主要以英语进行训练和评估，其在多语言环境中的能力仍未得到充分研究。在这项工作中，我们对多语言环境中的几种奖励模型进行了系统评估。我们首先构建了首创的多语言 RM 评估基准 M-RewardBench，它由 23 种类型学上不同的语言的 2.87k 个偏好实例组成，用于测试 RM 的聊天、安全、推理和翻译能力。然后，我们在 M-RewardBench 上严格评估了各种奖励模型，为它们在不同语言中的表现提供了新的见解。我们发现 RM 在英语和非英语语言之间的表现存在显著差距，并表明 RM 的偏好在不同语言之间可能会有很大差异。我们还提出了一些关于不同多语言方面如何影响 RM 性能的发现。具体而言，我们表明 RM 的性能随着翻译质量的提高而提高。同样，我们证明了模型在高资源语言方面表现出更好的性能。我们在本研究中发布了 M-RewardBench 数据集和代码库，以便更好地理解多语言环境中的 RM 评估。

Title: Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage

Authors: Kaige Xie, Philippe Laban, Prafulla Kumar Choubey, Caiming Xiong, Chien-Sheng Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15531
Pdf URL: https://arxiv.org/pdf/2410.15531
Copy Paste: [[2410.15531]] Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage(https://arxiv.org/abs/2410.15531)
Keywords: chat, retrieval-augmented generation
Abstract: Evaluating retrieval-augmented generation (RAG) systems remains challenging, particularly for open-ended questions that lack definitive answers and require coverage of multiple sub-topics. In this paper, we introduce a novel evaluation framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question. We propose decomposing questions into sub-questions and classifying them into three types -- core, background, and follow-up -- to reflect their roles and importance. Using this categorization, we introduce a fine-grained evaluation protocol that provides insights into the retrieval and generation characteristics of RAG systems, including three commercial generative answer engines: this http URL, Perplexity AI, and Bing Chat. Interestingly, we find that while all answer engines cover core sub-questions more often than background or follow-up ones, they still miss around 50% of core sub-questions, revealing clear opportunities for improvement. Further, sub-question coverage metrics prove effective for ranking responses, achieving 82% accuracy compared to human preference annotations. Lastly, we also demonstrate that leveraging core sub-questions enhances both retrieval and answer generation in a RAG system, resulting in a 74% win rate over the baseline that lacks sub-questions.
摘要：评估检索增强生成 (RAG) 系统仍然具有挑战性，特别是对于缺乏明确答案且需要覆盖多个子主题的开放式问题。在本文中，我们介绍了一种基于子问题覆盖率的新型评估框架，该框架衡量 RAG 系统解决问题不同方面的能力。我们建议将问题分解为子问题，并将它们分为三种类型——核心、背景和后续——以反映它们的作用和重要性。使用这种分类，我们引入了一种细粒度的评估协议，该协议提供了对 RAG 系统的检索和生成特征的洞察，包括三个商业生成答案引擎：此 http URL、Perplexity AI 和 Bing Chat。有趣的是，我们发现虽然所有答案引擎都比背景或后续问题更频繁地覆盖核心子问题，但它们仍然错过了大约 50% 的核心子问题，这表明有明显的改进机会。此外，子问题覆盖率指标被证明对排名响应有效，与人类偏好注释相比，准确率达到 82%。最后，我们还证明，利用核心子问题可以增强 RAG 系统中的检索和答案生成，从而比缺少子问题的基线高出 74% 的胜率。

Title: Grammatical Error Correction for Low-Resource Languages: The Case of Zarma

Authors: Mamadou K. Keita, Christopher Homan, Sofiane Abdoulaye Hamani, Adwoa Bremang, Marcos Zampieri, Habibatou Abdoulaye Alfari, Elysabhete Amadou Ibrahim, Dennis Owusu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.15539
Pdf URL: https://arxiv.org/pdf/2410.15539
Copy Paste: [[2410.15539]] Grammatical Error Correction for Low-Resource Languages: The Case of Zarma(https://arxiv.org/abs/2410.15539)
Keywords: language model, llm
Abstract: Grammatical error correction (GEC) is important for improving written materials for low-resource languages like Zarma -- spoken by over 5 million people in West Africa. Yet it remains a challenging problem. This study compares rule-based methods, machine translation (MT) models, and large language models (LLMs) for GEC in Zarma. We evaluate each approach's effectiveness on our manually-built dataset of over 250,000 examples using synthetic and human-annotated data. Our experiments show that the MT-based approach using the M2M100 model outperforms others, achieving a detection rate of 95.82% and a suggestion accuracy of 78.90% in automatic evaluations, and scoring 3.0 out of 5.0 in logical/grammar error correction during MEs by native speakers. The rule-based method achieved perfect detection (100%) and high suggestion accuracy (96.27%) for spelling corrections but struggled with context-level errors. LLMs like MT5-small showed moderate performance with a detection rate of 90.62% and a suggestion accuracy of 57.15%. Our work highlights the potential of MT models to enhance GEC in low-resource languages, paving the way for more inclusive NLP tools.
摘要：语法错误纠正 (GEC) 对于改进像 Zarma 这样的低资源语言的书面材料非常重要——西非有超过 500 万人使用这种语言。然而，这仍然是一个具有挑战性的问题。本研究比较了基于规则的方法、机器翻译 (MT) 模型和大型语言模型 (LLM) 在 Zarma 中的 GEC。我们使用合成和人工注释的数据在我们手动构建的超过 250,000 个示例的数据集上评估每种方法的有效性。我们的实验表明，使用 M2M100 模型的基于 MT 的方法优于其他方法，在自动评估中实现了 95.82% 的检测率和 78.90% 的建议准确率，在母语人士的 ME 期间的逻辑/语法错误纠正中获得了 5.0 分中的 3.0 分。基于规则的方法在拼写纠正方面实现了完美的检测（100%）和高建议准确率（96.27%），但在上下文级错误方面却遇到了困难。 MT5-small 等 LLM 表现中等，检测率为 90.62%，建议准确率为 57.15%。我们的工作凸显了 MT 模型在低资源语言中增强 GEC 的潜力，为更具包容性的 NLP 工具铺平了道路。

Title: WHoW: A Cross-domain Approach for Analysing Conversation Moderation

Authors: Ming-Bin Chen, Lea Frermann, Jey Han Lau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15551
Pdf URL: https://arxiv.org/pdf/2410.15551
Copy Paste: [[2410.15551]] WHoW: A Cross-domain Approach for Analysing Conversation Moderation(https://arxiv.org/abs/2410.15551)
Keywords: gpt, agent
Abstract: We propose WHoW, an evaluation framework for analyzing the facilitation strategies of moderators across different domains/scenarios by examining their motives (Why), dialogue acts (How) and target speaker (Who). Using this framework, we annotated 5,657 moderation sentences with human judges and 15,494 sentences with GPT-4o from two domains: TV debates and radio panel discussions. Comparative analysis demonstrates the framework's cross-domain generalisability and reveals distinct moderation strategies: debate moderators emphasise coordination and facilitate interaction through questions and instructions, while panel discussion moderators prioritize information provision and actively participate in discussions. Our analytical framework works for different moderation scenarios, enhances our understanding of moderation behaviour through automatic large-scale analysis, and facilitates the development of moderator agents.
摘要：我们提出了 WHoW，这是一个评估框架，通过检查主持人的动机（为什么）、对话行为（如何）和目标发言人（谁）来分析不同领域/场景中主持人的促进策略。使用这个框架，我们用人类评判者注释了 5,657 个主持句子，用 GPT-4o 注释了 15,494 个句子，这些句子来自两个领域：电视辩论和广播小组讨论。比较分析证明了该框架的跨领域通用性，并揭示了不同的主持策略：辩论主持人强调协调并通过问题和指示促进互动，而小组讨论主持人则优先提供信息并积极参与讨论。我们的分析框架适用于不同的主持场景，通过自动大规模分析增强了我们对主持行为的理解，并促进了主持人代理的发展。

Title: Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following

Authors: Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, Shruti Bhosale, Chenguang Zhu, Karthik Abinav Sankararaman, Eryk Helenowski, Melanie Kambadur, Aditya Tayade, Hao Ma, Han Fang, Sinong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15553
Pdf URL: https://arxiv.org/pdf/2410.15553
Copy Paste: [[2410.15553]] Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following(https://arxiv.org/abs/2410.15553)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including instruction following, which is crucial for aligning model outputs with user expectations. However, evaluating LLMs' ability to follow instructions remains challenging due to the complexity and subjectivity of human language. Current benchmarks primarily focus on single-turn, monolingual instructions, which do not adequately reflect the complexities of real-world applications that require handling multi-turn and multilingual interactions. To address this gap, we introduce Multi-IF, a new benchmark designed to assess LLMs' proficiency in following multi-turn and multilingual instructions. Multi-IF, which utilizes a hybrid framework combining LLM and human annotators, expands upon the IFEval by incorporating multi-turn sequences and translating the English prompts into another 7 languages, resulting in a dataset of 4,501 multilingual conversations, where each has three turns. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. All the models tested showed a higher rate of failure in executing instructions correctly with each additional turn. For example, o1-preview drops from 0.877 at the first turn to 0.707 at the third turn in terms of average accuracy over all languages. Moreover, languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities. We release Multi-IF prompts and the evaluation code base to encourage further research in this critical area.
摘要：大型语言模型 (LLM) 在各种任务中都表现出令人印象深刻的能力，包括指令遵循，这对于使模型输出与用户期望保持一致至关重要。然而，由于人类语言的复杂性和主观性，评估 LLM 遵循指令的能力仍然具有挑战性。当前的基准测试主要侧重于单轮、单语指令，这不能充分反映需要处理多轮和多语言交互的现实世界应用程序的复杂性。为了解决这一差距，我们引入了 Multi-IF，这是一个新的基准测试，旨在评估 LLM 遵循多轮和多语言指令的能力。Multi-IF 采用结合 LLM 和人工注释者的混合框架，通过合并多轮序列并将英语提示翻译成另外 7 种语言来扩展 IFEval，从而产生一个包含 4,501 个多语言对话的数据集，每个对话都有三个轮次。我们对 Multi-IF 上 14 个最先进的 LLM 的评估表明，它提出的任务比现有基准测试更具挑战性。所有测试模型都表明，随着每次轮次的增加，正确执行指令的失败率会更高。例如，就所有语言的平均准确率而言，o1-preview 从第一次轮次的 0.877 下降到第三次轮次的 0.707。此外，非拉丁文字的语言（印地语、俄语和中文）通常表现出更高的错误率，这表明模型的多语言能力可能存在局限性。我们发布了 Multi-IF 提示和评估代码库，以鼓励对这一关键领域的进一步研究。

Title: Stacking Small Language Models for Generalizability

Authors: Laurence Liang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.15570
Pdf URL: https://arxiv.org/pdf/2410.15570
Copy Paste: [[2410.15570]] Stacking Small Language Models for Generalizability(https://arxiv.org/abs/2410.15570)
Keywords: language model, llm
Abstract: Recent advances show that large language models (LLMs) generalize strong performance across different natural language benchmarks. However, the large size of LLMs makes training and inference expensive and impractical to run in resource-limited settings. This paper introduces a new approach called fine-tuning stacks of language models (FSLM), which involves stacking small language models (SLM) as an alternative to LLMs. By fine-tuning each SLM to perform a specific task, this approach breaks down high level reasoning into multiple lower-level steps that specific SLMs are responsible for. As a result, FSLM allows for lower training and inference costs, and also improves model interpretability as each SLM communicates with the subsequent one through natural language. By evaluating FSLM on common natural language benchmarks, this paper highlights promising early results toward generalizable performance using FSLM as a cost-effective alternative to LLMs.
摘要：最近的进展表明，大型语言模型 (LLM) 在不同的自然语言基准上具有出色的表现。然而，LLM 的规模庞大使得训练和推理成本高昂，并且在资源有限的环境中运行不切实际。本文介绍了一种称为微调语言模型堆栈 (FSLM) 的新方法，该方法涉及堆叠小型语言模型 (SLM) 作为 LLM 的替代方案。通过微调每个 SLM 以执行特定任务，这种方法将高级推理分解为特定 SLM 负责的多个低级步骤。因此，FSLM 可以降低训练和推理成本，并且由于每个 SLM 通过自然语言与后续 SLM 进行通信，因此还可以提高模型的可解释性。通过在常见的自然语言基准上评估 FSLM，本文强调了使用 FSLM 作为 LLM 的经济高效的替代方案在实现可推广性能方面有希望的早期结果。

Title: Leveraging Retrieval-Augmented Generation for Culturally Inclusive Hakka Chatbots: Design Insights and User Perceptions

Authors: Chen-Chi Chang, Han-Pi Chang, Hung-Shin Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15572
Pdf URL: https://arxiv.org/pdf/2410.15572
Copy Paste: [[2410.15572]] Leveraging Retrieval-Augmented Generation for Culturally Inclusive Hakka Chatbots: Design Insights and User Perceptions(https://arxiv.org/abs/2410.15572)
Keywords: language model, llm, chat, retrieval-augmented generation
Abstract: In an era where cultural preservation is increasingly intertwined with technological innovation, this study introduces a groundbreaking approach to promoting and safeguarding the rich heritage of Taiwanese Hakka culture through the development of a Retrieval-Augmented Generation (RAG)-enhanced chatbot. Traditional large language models (LLMs), while powerful, often fall short in delivering accurate and contextually rich responses, particularly in culturally specific domains. By integrating external databases with generative AI models, RAG technology bridges this gap, empowering chatbots to not only provide precise answers but also resonate deeply with the cultural nuances that are crucial for authentic interactions. This study delves into the intricate process of augmenting the chatbot's knowledge base with targeted cultural data, specifically curated to reflect the unique aspects of Hakka traditions, language, and practices. Through dynamic information retrieval, the RAG-enhanced chatbot becomes a versatile tool capable of handling complex inquiries that demand an in-depth understanding of Hakka cultural context. This is particularly significant in an age where digital platforms often dilute cultural identities, making the role of culturally aware AI systems more critical than ever. System usability studies conducted as part of our research reveal a marked improvement in both user satisfaction and engagement, highlighting the chatbot's effectiveness in fostering a deeper connection with Hakka culture. The feedback underscores the potential of RAG technology to not only enhance user experience but also to serve as a vital instrument in the broader mission of ethnic mainstreaming and cultural celebration.
摘要：

Title: Neural Search Space in Gboard Decoder

Authors: Yanxiang Zhang, Yuanbo Zhang, Haicheng Sun, Yun Wang, Billy Dou, Gary Sivek, Shumin Zhai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15575
Pdf URL: https://arxiv.org/pdf/2410.15575
Copy Paste: [[2410.15575]] Neural Search Space in Gboard Decoder(https://arxiv.org/abs/2410.15575)
Keywords: language model
Abstract: Gboard Decoder produces suggestions by looking for paths that best match input touch points on the context aware search space, which is backed by the language Finite State Transducers (FST). The language FST is currently an N-gram language model (LM). However, N-gram LMs, limited in context length, are known to have sparsity problem under device model size constraint. In this paper, we propose \textbf{Neural Search Space} which substitutes the N-gram LM with a Neural Network LM (NN-LM) and dynamically constructs the search space during decoding. Specifically, we integrate the long range context awareness of NN-LM into the search space by converting its outputs given context, into the language FST at runtime. This involves language FST structure redesign, pruning strategy tuning, and data structure optimizations. Online experiments demonstrate improved quality results, reducing Words Modified Ratio by [0.26\%, 1.19\%] on various locales with acceptable latency increases. This work opens new avenues for further improving keyboard decoding quality by enhancing neural LM more directly.
摘要：Gboard 解码器通过在上下文感知搜索空间上寻找与输入触点最匹配的路径来生成建议，该搜索空间由语言有限状态转换器 (FST) 支持。语言 FST 目前是 N-gram 语言模型 (LM)。但是，N-gram 语言模型在上下文长度上受到限制，在设备模型大小约束下存在稀疏性问题。在本文中，我们提出了 \textbf{神经搜索空间}，它用神经网络语言模型 (NN-LM) 替代 N-gram 语言模型，并在解码过程中动态构建搜索空间。具体而言，我们通过在运行时将给定上下文的输出转换为语言 FST，将 NN-LM 的长距离上下文感知集成到搜索空间中。这涉及语言 FST 结构重新设计、修剪策略调整和数据结构优化。在线实验表明质量结果有所提高，在各种语言环境中将单词修改率降低了 [0.26\%，1.19\%]，延迟增加在可接受范围内。这项工作通过更直接地增强神经 LM 为进一步提高键盘解码质量开辟了新的途径。

Title: A Survey of Conversational Search

Authors: Fengran Mo, Kelong Mao, Ziliang Zhao, Hongjin Qian, Haonan Chen, Yiruo Cheng, Xiaoxi Li, Yutao Zhu, Zhicheng Dou, Jian-Yun Nie
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2410.15576
Pdf URL: https://arxiv.org/pdf/2410.15576
Copy Paste: [[2410.15576]] A Survey of Conversational Search(https://arxiv.org/abs/2410.15576)
Keywords: language model, llm
Abstract: As a cornerstone of modern information access, search engines have become indispensable in everyday life. With the rapid advancements in AI and natural language processing (NLP) technologies, particularly large language models (LLMs), search engines have evolved to support more intuitive and intelligent interactions between users and systems. Conversational search, an emerging paradigm for next-generation search engines, leverages natural language dialogue to facilitate complex and precise information retrieval, thus attracting significant attention. Unlike traditional keyword-based search engines, conversational search systems enhance user experience by supporting intricate queries, maintaining context over multi-turn interactions, and providing robust information integration and processing capabilities. Key components such as query reformulation, search clarification, conversational retrieval, and response generation work in unison to enable these sophisticated interactions. In this survey, we explore the recent advancements and potential future directions in conversational search, examining the critical modules that constitute a conversational search system. We highlight the integration of LLMs in enhancing these systems and discuss the challenges and opportunities that lie ahead in this dynamic field. Additionally, we provide insights into real-world applications and robust evaluations of current conversational search systems, aiming to guide future research and development in conversational search.
摘要：As a cornerstone of modern information access, search engines have become indispensable in everyday life. With the rapid advancements in AI and natural language processing (NLP) technologies, particularly large language models (LLMs), search engines have evolved to support more intuitive and intelligent interactions between users and systems. Conversational search, an emerging paradigm for next-generation search engines, leverages natural language dialogue to facilitate complex and precise information retrieval, thus attracting significant attention. Unlike traditional keyword-based search engines, conversational search systems enhance user experience by supporting intricate queries, maintaining context over multi-turn interactions, and providing robust information integration and processing capabilities. Key components such as query reformulation, search clarification, conversational retrieval, and response generation work in unison to enable these sophisticated interactions. In this survey, we explore the recent advancements and potential future directions in conversational search, examining the critical modules that constitute a conversational search system. We highlight the integration of LLMs in enhancing these systems and discuss the challenges and opportunities that lie ahead in this dynamic field. 此外，我们还提供了对实际应用的见解和对当前对话式搜索系统的严格评估，旨在指导未来对话式搜索的研究和开发。

Title: AMPLE: Emotion-Aware Multimodal Fusion Prompt Learning for Fake News Detection

Authors: Xiaoman Xu, Xiangrun Li, Taihang Wang, Ye Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15591
Pdf URL: https://arxiv.org/pdf/2410.15591
Copy Paste: [[2410.15591]] AMPLE: Emotion-Aware Multimodal Fusion Prompt Learning for Fake News Detection(https://arxiv.org/abs/2410.15591)
Keywords: language model, prompt
Abstract: Detecting fake news in large datasets is challenging due to its diversity and complexity, with traditional approaches often focusing on textual features while underutilizing semantic and emotional elements. Current methods also rely heavily on large annotated datasets, limiting their effectiveness in more nuanced analysis. To address these challenges, this paper introduces Emotion-\textbf{A}ware \textbf{M}ultimodal Fusion \textbf{P}rompt \textbf{L}\textbf{E}arning (\textbf{AMPLE}) framework to address the above issue by combining text sentiment analysis with multimodal data and hybrid prompt templates. This framework extracts emotional elements from texts by leveraging sentiment analysis tools. It then employs Multi-Head Cross-Attention (MCA) mechanisms and similarity-aware fusion methods to integrate multimodal data. The proposed AMPLE framework demonstrates strong performance on two public datasets in both few-shot and data-rich settings, with results indicating the potential of emotional aspects in fake news detection. Furthermore, the study explores the impact of integrating large language models with this method for text sentiment extraction, revealing substantial room for further improvement. The code can be found at :\url{this https URL
摘要：Detecting fake news in large datasets is challenging due to its diversity and complexity, with traditional approaches often focusing on textual features while underutilizing semantic and emotional elements. Current methods also rely heavily on large annotated datasets, limiting their effectiveness in more nuanced analysis. To address these challenges, this paper introduces Emotion-\textbf{A}ware \textbf{M}ultimodal Fusion \textbf{P}rompt \textbf{L}\textbf{E}arning (\textbf{AMPLE}) framework to address the above issue by combining text sentiment analysis with multimodal data and hybrid prompt templates. This framework extracts emotional elements from texts by leveraging sentiment analysis tools. It then employs Multi-Head Cross-Attention (MCA) mechanisms and similarity-aware fusion methods to integrate multimodal data. The proposed AMPLE framework demonstrates strong performance on two public datasets in both few-shot and data-rich settings, with results indicating the potential of emotional aspects in fake news detection. 此外，该研究还探讨了将大型语言模型与该方法相结合对文本情感提取的影响，揭示了进一步改进的巨大空间。代码可以在以下位置找到：

Title: Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding

Authors: Yeonjoon Jung, Jaeseong Lee, Seungtaek Choi, Dohyeon Lee, Minsoo Kim, Seung-won Hwang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2410.15609
Pdf URL: https://arxiv.org/pdf/2410.15609
Copy Paste: [[2410.15609]] Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding(https://arxiv.org/abs/2410.15609)
Keywords: language model
Abstract: Recently, pre-trained language models (PLMs) have been increasingly adopted in spoken language understanding (SLU). However, automatic speech recognition (ASR) systems frequently produce inaccurate transcriptions, leading to noisy inputs for SLU models, which can significantly degrade their performance. To address this, our objective is to train SLU models to withstand ASR errors by exposing them to noises commonly observed in ASR systems, referred to as ASR-plausible noises. Speech noise injection (SNI) methods have pursued this objective by introducing ASR-plausible noises, but we argue that these methods are inherently biased towards specific ASR systems, or ASR-specific noises. In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. Experimental results and analyses demonstrate the effectiveness of our proposed methods in enhancing the robustness and generalizability of SLU models against unseen ASR systems by introducing more diverse and plausible ASR noises in advance.
摘要：最近，预训练语言模型 (PLM) 越来越多地被应用于口语理解 (SLU)。然而，自动语音识别 (ASR) 系统经常产生不准确的转录，导致 SLU 模型的输入噪声很大，这会严重降低其性能。为了解决这个问题，我们的目标是通过将 SLU 模型暴露于 ASR 系统中常见的噪声（称为 ASR 可信噪声）来训练 SLU 模型以抵抗 ASR 错误。语音噪声注入 (SNI) 方法通过引入 ASR 可信噪声来实现这一目标，但我们认为这些方法本质上偏向于特定的 ASR 系统或 ASR 特定的噪声。在这项工作中，我们提出了一种新颖且偏差较小的增强方法，通过切断噪声的非因果影响来引入对任何 ASR 系统都可信的噪声。实验结果和分析证明了我们提出的方法的有效性，通过预先引入更多样化和合理的 ASR 噪声，可以增强 SLU 模型针对看不见的 ASR 系统的鲁棒性和通用性。

Title: Guardians of Discourse: Evaluating LLMs on Multilingual Offensive Language Detection

Authors: Jianfei He, Lilin Wang, Jiaying Wang, Zhenyu Liu, Hongbin Na, Zimu Wang, Wei Wang, Qi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15623
Pdf URL: https://arxiv.org/pdf/2410.15623
Copy Paste: [[2410.15623]] Guardians of Discourse: Evaluating LLMs on Multilingual Offensive Language Detection(https://arxiv.org/abs/2410.15623)
Keywords: language model, gpt, llm, prompt
Abstract: Identifying offensive language is essential for maintaining safety and sustainability in the social media era. Though large language models (LLMs) have demonstrated encouraging potential in social media analytics, they lack thorough evaluation when in offensive language detection, particularly in multilingual environments. We for the first time evaluate multilingual offensive language detection of LLMs in three languages: English, Spanish, and German with three LLMs, GPT-3.5, Flan-T5, and Mistral, in both monolingual and multilingual settings. We further examine the impact of different prompt languages and augmented translation data for the task in non-English contexts. Furthermore, we discuss the impact of the inherent bias in LLMs and the datasets in the mispredictions related to sensitive topics.
摘要：在社交媒体时代，识别攻击性语言对于维护安全和可持续性至关重要。尽管大型语言模型 (LLM) 在社交媒体分析中表现出令人鼓舞的潜力，但它们在攻击性语言检测方面缺乏彻底的评估，特别是在多语言环境中。我们首次使用 GPT-3.5、Flan-T5 和 Mistral 这三个 LLM 在单语和多语环境中评估了三种语言（英语、西班牙语和德语）的 LLM 多语言攻击性语言检测。我们进一步研究了不同提示语言和增强翻译数据对非英语环境中的任务的影响。此外，我们讨论了 LLM 和数据集中的固有偏见对敏感主题错误预测的影响。

Title: Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement

Authors: Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15633
Pdf URL: https://arxiv.org/pdf/2410.15633
Copy Paste: [[2410.15633]] Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement(https://arxiv.org/abs/2410.15633)
Keywords: language model, llm, long context
Abstract: The expansion of large language models to effectively handle instructions with extremely long contexts has yet to be fully investigated. The primary obstacle lies in constructing a high-quality long instruction-following dataset devised for long context alignment. Existing studies have attempted to scale up the available data volume by synthesizing long instruction-following samples. However, indiscriminately increasing the quantity of data without a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the final performance. To bridge this gap, we aim to address the unique challenge of long-context alignment, i.e., modeling the long-range dependencies for handling instructions and lengthy input contexts. We propose GATEAU, a novel framework designed to identify the influential and high-quality samples enriched with long-range dependency relations by utilizing crafted Homologous Models' Guidance (HMG) and Contextual Awareness Measurement (CAM). Specifically, HMG attempts to measure the difficulty of generating corresponding responses due to the long-range dependencies, using the perplexity scores of the response from two homologous models with different context windows. Also, the role of CAM is to measure the difficulty of understanding the long input contexts due to long-range dependencies by evaluating whether the model's attention is focused on important segments. Built upon both proposed methods, we select the most challenging samples as the influential data to effectively frame the long-range dependencies, thereby achieving better performance of LLMs. Comprehensive experiments indicate that GATEAU effectively identifies samples enriched with long-range dependency relations and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.
摘要：尚未对大型语言模型的扩展进行全面研究，以有效处理具有极长上下文的指令。主要障碍在于构建专为长上下文对齐而设计的高质量长指令跟踪数据集。现有研究已尝试通过合成长指令跟踪样本来扩大可用数据量。然而，在没有明确确保数据质量的策略的情况下不加区别地增加数据量可能会引入低质量样本并限制最终性能。为了弥补这一差距，我们旨在解决长上下文对齐的独特挑战，即对处理指令和长输入上下文的长距离依赖关系进行建模。我们提出了 GATEAU，这是一个新颖的框架，旨在通过利用精心设计的同源模型指导 (HMG) 和上下文意识测量 (CAM) 来识别具有丰富长距离依赖关系的有影响力的高质量样本。具体而言，HMG 尝试使用来自具有不同上下文窗口的两个同源模型的响应的困惑度分数来测量由于长距离依赖关系而生成相应响应的难度。此外，CAM 的作用是通过评估模型的注意力是否集中在重要片段上来衡量理解由于长距离依赖关系而产生的长输入上下文的难度。基于这两种提出的方法，我们选择最具挑战性的样本作为有影响力的数据，以有效地构建长距离依赖关系，从而实现更好的 LLM 性能。综合实验表明，GATEAU 可以有效识别富含长距离依赖关系的样本，并且在这些选定样本上训练的模型表现出更好的指令跟踪和长上下文理解能力。

Title: Can Large Language Models Invent Algorithms to Improve Themselves?

Authors: Yoichi Ishibashi, Taro Yano, Masafumi Oyamada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15639
Pdf URL: https://arxiv.org/pdf/2410.15639
Copy Paste: [[2410.15639]] Can Large Language Models Invent Algorithms to Improve Themselves?(https://arxiv.org/abs/2410.15639)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown remarkable performance improvements and are rapidly gaining adoption in industry. However, the methods for improving LLMs are still designed by humans, which restricts the invention of new model-improving algorithms to human expertise and imagination. To address this, we propose the Self-Developing framework, which enables LLMs to autonomously generate and learn model-improvement algorithms. In this framework, the seed model generates, applies, and evaluates model-improving algorithms, continuously improving both the seed model and the algorithms themselves. In mathematical reasoning tasks, Self-Developing not only creates models that surpass the seed model but also consistently outperforms models created using human-designed algorithms. Additionally, these LLM-discovered algorithms demonstrate strong effectiveness, including transferability to out-of-domain models.
摘要：

Title: SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis

Authors: Aidan Wong, He Cao, Zijing Liu, Yu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15641
Pdf URL: https://arxiv.org/pdf/2410.15641
Copy Paste: [[2410.15641]] SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis(https://arxiv.org/abs/2410.15641)
Keywords: language model, llm, prompt
Abstract: The increasing integration of large language models (LLMs) across various fields has heightened concerns about their potential to propagate dangerous information. This paper specifically explores the security vulnerabilities of LLMs within the field of chemistry, particularly their capacity to provide instructions for synthesizing hazardous substances. We evaluate the effectiveness of several prompt injection attack methods, including red-teaming, explicit prompting, and implicit prompting. Additionally, we introduce a novel attack technique named SMILES-prompting, which uses the Simplified Molecular-Input Line-Entry System (SMILES) to reference chemical substances. Our findings reveal that SMILES-prompting can effectively bypass current safety mechanisms. These findings highlight the urgent need for enhanced domain-specific safeguards in LLMs to prevent misuse and improve their potential for positive social impact.
摘要：

Title: Resource-Efficient Medical Report Generation using Large Language Models

Authors: Abdullah, Ameer Hamza, Seong Tae Kim
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2410.15642
Pdf URL: https://arxiv.org/pdf/2410.15642
Copy Paste: [[2410.15642]] Resource-Efficient Medical Report Generation using Large Language Models(https://arxiv.org/abs/2410.15642)
Keywords: language model, llm
Abstract: Medical report generation is the task of automatically writing radiology reports for chest X-ray images. Manually composing these reports is a time-consuming process that is also prone to human errors. Generating medical reports can therefore help reduce the burden on radiologists. In other words, we can promote greater clinical automation in the medical domain. In this work, we propose a new framework leveraging vision-enabled Large Language Models (LLM) for the task of medical report generation. We introduce a lightweight solution that achieves better or comparative performance as compared to previous solutions on the task of medical report generation. We conduct extensive experiments exploring different model sizes and enhancement approaches, such as prefix tuning to improve the text generation abilities of the LLMs. We evaluate our approach on a prominent large-scale radiology report dataset - MIMIC-CXR. Our results demonstrate the capability of our resource-efficient framework to generate patient-specific reports with strong medical contextual understanding and high precision.
摘要：

Title: Scalable Data Ablation Approximations for Language Models through Modular Training and Merging

Authors: Clara Na, Ian Magnusson, Ananya Harsh Jha, Tom Sherborne, Emma Strubell, Jesse Dodge, Pradeep Dasigi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.15661
Pdf URL: https://arxiv.org/pdf/2410.15661
Copy Paste: [[2410.15661]] Scalable Data Ablation Approximations for Language Models through Modular Training and Merging(https://arxiv.org/abs/2410.15661)
Keywords: language model, llm
Abstract: Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive since the full effect is seen only after training the models; this can lead practitioners to settle for sub-optimal data mixtures. We propose an efficient method for approximating data ablations which trains individual models on subsets of a training corpus and reuses them across evaluations of combinations of subsets. In continued pre-training experiments, we find that, given an arbitrary evaluation set, the perplexity score of a single model trained on a candidate set of data is strongly correlated with perplexity scores of parameter averages of models trained on distinct partitions of that data. From this finding, we posit that researchers and practitioners can conduct inexpensive simulations of data ablations by maintaining a pool of models that were each trained on partitions of a large training corpus, and assessing candidate data mixtures by evaluating parameter averages of combinations of these models. This approach allows for substantial improvements in amortized training efficiency -- scaling only linearly with respect to new data -- by enabling reuse of previous training computation, opening new avenues for improving model performance through rigorous, incremental data assessment and mixing.
摘要：Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive since the full effect is seen only after training the models; this can lead practitioners to settle for sub-optimal data mixtures. We propose an efficient method for approximating data ablations which trains individual models on subsets of a training corpus and reuses them across evaluations of combinations of subsets. In continued pre-training experiments, we find that, given an arbitrary evaluation set, the perplexity score of a single model trained on a candidate set of data is strongly correlated with perplexity scores of parameter averages of models trained on distinct partitions of that data. From this finding, we posit that researchers and practitioners can conduct inexpensive simulations of data ablations by maintaining a pool of models that were each trained on partitions of a large training corpus, and assessing candidate data mixtures by evaluating parameter averages of combinations of these models. 这种方法可以大幅提高摊销训练效率（仅相对于新数据线性扩展），通过重用以前的训练计算，为通过严格的增量数据评估和混合来提高模型性能开辟了新途径。

Title: RAC: Efficient LLM Factuality Correction with Retrieval Augmentation

Authors: Changmao Li, Jeffrey Flanigan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.15667
Pdf URL: https://arxiv.org/pdf/2410.15667
Copy Paste: [[2410.15667]] RAC: Efficient LLM Factuality Correction with Retrieval Augmentation(https://arxiv.org/abs/2410.15667)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) exhibit impressive results across a wide range of natural language processing (NLP) tasks, yet they can often produce factually incorrect outputs. This paper introduces a simple but effective low-latency post-correction method, \textbf{Retrieval Augmented Correction (RAC)}, aimed at enhancing the factual performance of LLMs without requiring additional fine-tuning. Our method is general and can be used with any instruction-tuned LLM, and has greatly reduced latency compared to prior approaches. RAC decomposes the LLM's output into atomic facts and applies a fine-grained verification and correction process with retrieved content to verify and correct the LLM-generated output. Our extensive experiments show that RAC yields up to 30\% improvements over state-of-the-art baselines across two popular factuality evaluation datasets, validating its efficacy and robustness in both with and without the integration of Retrieval-Augmented Generation (RAG) across different LLMs.\footnote{Our code is at \url{this https URL}}
摘要：Large Language Models (LLMs) exhibit impressive results across a wide range of natural language processing (NLP) tasks, yet they can often produce factually incorrect outputs. This paper introduces a simple but effective low-latency post-correction method, \textbf{Retrieval Augmented Correction (RAC)}, aimed at enhancing the factual performance of LLMs without requiring additional fine-tuning. Our method is general and can be used with any instruction-tuned LLM, and has greatly reduced latency compared to prior approaches. RAC decomposes the LLM's output into atomic facts and applies a fine-grained verification and correction process with retrieved content to verify and correct the LLM-generated output. 我们进行了大量的实验，结果表明 RAC 在两个流行的事实性评估数据集上比最先进的基线提高了 30\%，验证了其在不同 LLM 中集成和不集成检索增强生成 (RAG) 时的有效性和稳健性。\footnote{我们的代码位于 \url{此 https URL}}

Title: Learning to Generate and Evaluate Fact-checking Explanations with Transformers

Authors: Darius Feher, Abdullah Khered, Hao Zhang, Riza Batista-Navarro, Viktor Schlegel
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2410.15669
Pdf URL: https://arxiv.org/pdf/2410.15669
Copy Paste: [[2410.15669]] Learning to Generate and Evaluate Fact-checking Explanations with Transformers(https://arxiv.org/abs/2410.15669)
Keywords: hallucination
Abstract: In an era increasingly dominated by digital platforms, the spread of misinformation poses a significant challenge, highlighting the need for solutions capable of assessing information veracity. Our research contributes to the field of Explainable Artificial Antelligence (XAI) by developing transformer-based fact-checking models that contextualise and justify their decisions by generating human-accessible explanations. Importantly, we also develop models for automatic evaluation of explanations for fact-checking verdicts across different dimensions such as \texttt{(self)-contradiction}, \texttt{hallucination}, \texttt{convincingness} and \texttt{overall quality}. By introducing human-centred evaluation methods and developing specialised datasets, we emphasise the need for aligning Artificial Intelligence (AI)-generated explanations with human judgements. This approach not only advances theoretical knowledge in XAI but also holds practical implications by enhancing the transparency, reliability and users' trust in AI-driven fact-checking systems. Furthermore, the development of our metric learning models is a first step towards potentially increasing efficiency and reducing reliance on extensive manual assessment. Based on experimental results, our best performing generative model \textsc{ROUGE-1} score of 47.77, demonstrating superior performance in generating fact-checking explanations, particularly when provided with high-quality evidence. Additionally, the best performing metric learning model showed a moderately strong correlation with human judgements on objective dimensions such as \texttt{(self)-contradiction and \texttt{hallucination}, achieving a Matthews Correlation Coefficient (MCC) of around 0.7.}
摘要：在这个数字平台日益占主导地位的时代，虚假信息的传播构成了重大挑战，凸显了对能够评估信息真实性的解决方案的需求。我们的研究通过开发基于转换器的事实核查模型为可解释人工智能 (XAI) 领域做出了贡献，这些模型通过生成人类可理解的解释来情境化和证明其决策。重要的是，我们还开发了模型，用于自动评估不同维度的事实核查裁决的解释，例如 \texttt{(自我)矛盾}、\texttt{幻觉}、\texttt{说服力} 和 \texttt{整体质量}。通过引入以人为本的评估方法和开发专门的数据集，我们强调需要将人工智能 (AI) 生成的解释与人类判断相结合。这种方法不仅推进了 XAI 的理论知识，而且通过提高透明度、可靠性和用户对人工智能驱动的事实核查系统的信任度，具有实际意义。此外，我们度量学习模型的开发是提高效率和减少对大量人工评估依赖的第一步。根据实验结果，我们表现最好的生成模型 \textsc{ROUGE-1} 得分为 47.77，在生成事实核查解释方面表现出色，尤其是在提供高质量证据的情况下。此外，表现最好的度量学习模型在 \texttt{(自我)矛盾和 \texttt{幻觉} 等客观维度上与人类判断表现出中等强相关性，马修斯相关系数 (MCC) 达到约 0.7。}

Title: Revealing and Mitigating the Local Pattern Shortcuts of Mamba

Authors: Wangjie You, Zecheng Tang, Juntao Li, Lili Yao, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15678
Pdf URL: https://arxiv.org/pdf/2410.15678
Copy Paste: [[2410.15678]] Revealing and Mitigating the Local Pattern Shortcuts of Mamba(https://arxiv.org/abs/2410.15678)
Keywords: language model, llm
Abstract: Large language models (LLMs) have advanced significantly due to the attention mechanism, but their quadratic complexity and linear memory demands limit their performance on long-context tasks. Recently, researchers introduced Mamba, an advanced model built upon State Space Models(SSMs) that offers linear complexity and constant memory. Although Mamba is reported to match or surpass the performance of attention-based models, our analysis reveals a performance gap: Mamba excels in tasks that involve localized key information but faces challenges with tasks that require handling distributed key information. Our controlled experiments suggest that this inconsistency arises from Mamba's reliance on local pattern shortcuts, which enable the model to remember local key information within its limited memory but hinder its ability to retain more dispersed information. Therefore, we introduce a global selection module into the Mamba model to address this issue. Experiments on both existing and proposed synthetic tasks, as well as real-world tasks, demonstrate the effectiveness of our method. Notably, with the introduction of only 4M extra parameters, our approach enables the Mamba model(130M) to achieve a significant improvement on tasks with distributed information, increasing its performance from 0 to 80.54 points.
摘要：Large language models (LLMs) have advanced significantly due to the attention mechanism, but their quadratic complexity and linear memory demands limit their performance on long-context tasks. Recently, researchers introduced Mamba, an advanced model built upon State Space Models(SSMs) that offers linear complexity and constant memory. Although Mamba is reported to match or surpass the performance of attention-based models, our analysis reveals a performance gap: Mamba excels in tasks that involve localized key information but faces challenges with tasks that require handling distributed key information. Our controlled experiments suggest that this inconsistency arises from Mamba's reliance on local pattern shortcuts, which enable the model to remember local key information within its limited memory but hinder its ability to retain more dispersed information. Therefore, we introduce a global selection module into the Mamba model to address this issue. Experiments on both existing and proposed synthetic tasks, as well as real-world tasks, demonstrate the effectiveness of our method. 值得注意的是，仅引入 4M 个额外参数，我们的方法就使 Mamba 模型（130M）在具有分布式信息的任务上取得了显著的改进，将其性能从 0 提升到 80.54 分。

Title: DomainSum: A Hierarchical Benchmark for Fine-Grained Domain Shift in Abstractive Text Summarization

Authors: Haohan Yuan, Haopeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15687
Pdf URL: https://arxiv.org/pdf/2410.15687
Copy Paste: [[2410.15687]] DomainSum: A Hierarchical Benchmark for Fine-Grained Domain Shift in Abstractive Text Summarization(https://arxiv.org/abs/2410.15687)
Keywords: language model, llm
Abstract: Most research on abstractive summarization focuses on single-domain applications, often neglecting how domain shifts between documents affect performance and the generalization ability of summarization models. To address this issue, we introduce DomainSum, a hierarchical benchmark designed to capture fine-grained domain shifts in abstractive summarization. We categorize these shifts into three levels: genre, style, and topic, and demonstrate through comprehensive benchmark analysis that they follow a hierarchical structure. Furthermore, we evaluate the domain generalization capabilities of commonly used pre-trained language models (PLMs) and large language models (LLMs) in in-domain and cross-domain settings.
摘要：大多数关于抽象摘要的研究都集中在单领域应用上，往往忽略了文档之间的领域转换如何影响摘要模型的性能和泛化能力。为了解决这个问题，我们引入了 DomainSum，这是一个分层基准，旨在捕捉抽象摘要中的细粒度领域转换。我们将这些转换分为三个级别：类型、风格和主题，并通过全面的基准分析证明它们遵循分层结构。此外，我们还在域内和跨域设置中评估了常用预训练语言模型 (PLM) 和大型语言模型 (LLM) 的领域泛化能力。

Title: Efficient Terminology Integration for LLM-based Translation in Specialized Domains

Authors: Sejoon Kim, Mingi Sung, Jeonghwan Lee, Hyunkuk Lim, Jorge Froilan Gimenez Perez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15690
Pdf URL: https://arxiv.org/pdf/2410.15690
Copy Paste: [[2410.15690]] Efficient Terminology Integration for LLM-based Translation in Specialized Domains(https://arxiv.org/abs/2410.15690)
Keywords: llm
Abstract: Traditional machine translation methods typically involve training models directly on large parallel corpora, with limited emphasis on specialized terminology. However, In specialized fields such as patent, finance, or biomedical domains, terminology is crucial for translation, with many terms that needs to be translated following agreed-upon conventions. In this paper we introduce a methodology that efficiently trains models with a smaller amount of data while preserving the accuracy of terminology translation. We achieve this through a systematic process of term extraction and glossary creation using the Trie Tree algorithm, followed by data reconstruction to teach the LLM how to integrate these specialized terms. This methodology enhances the model's ability to handle specialized terminology and ensures high-quality translations, particularly in fields where term consistency is crucial. Our approach has demonstrated exceptional performance, achieving the highest translation score among participants in the WMT patent task to date, showcasing its effectiveness and broad applicability in specialized translation domains where general methods often fall short.
摘要：传统的机器翻译方法通常涉及直接在大型平行语料库上训练模型，而对专业术语的关注有限。然而，在专利、金融或生物医学等专业领域，术语对于翻译至关重要，许多术语需要按照商定的惯例进行翻译。在本文中，我们介绍了一种方法，该方法可以有效地用较少的数据训练模型，同时保持术语翻译的准确性。我们通过使用 Trie Tree 算法进行术语提取和词汇表创建的系统过程来实现这一点，然后进行数据重建以教 LLM 如何整合这些专业术语。这种方法增强了模型处理专业术语的能力，并确保了高质量的翻译，特别是在术语一致性至关重要的领域。我们的方法表现出色，迄今为止在 WMT 专利任务的参与者中取得了最高的翻译分数，展示了其在一般方法往往不足的专业翻译领域的有效性和广泛适用性。

Title: Tokenization as Finite-State Transduction

Authors: Marco Cognetta, Naoaki Okazaki
Subjects: cs.CL, cs.FL
Abstract URL: https://arxiv.org/abs/2410.15696
Pdf URL: https://arxiv.org/pdf/2410.15696
Copy Paste: [[2410.15696]] Tokenization as Finite-State Transduction(https://arxiv.org/abs/2410.15696)
Keywords: language model
Abstract: Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, fit within this framework. For BPE, this is particularly surprising given its resemblance to context-free grammar and the fact that it does not tokenize strings from left to right. An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern. Here, patterns are encoded at the character level, which creates a mismatch between the constraints and the model's subword vocabulary. While past work has focused only on constraining outputs without regard to the underlying tokenization algorithm, our framework allows for simultaneously constraining the model outputs to match a specified pattern while also adhering to the underlying tokenizer's canonical tokenization.
摘要：

Title: Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding

Authors: Derong Xu, Ziheng Zhang, Zhihong Zhu, Zhenxi Lin, Qidong Liu, Xian Wu, Tong Xu, Xiangyu Zhao, Yefeng Zheng, Enhong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15702
Pdf URL: https://arxiv.org/pdf/2410.15702
Copy Paste: [[2410.15702]] Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding(https://arxiv.org/abs/2410.15702)
Keywords: language model, llm, hallucination
Abstract: The impressive capabilities of large language models (LLMs) have attracted extensive interests of applying LLMs to medical field. However, the complex nature of clinical environments presents significant hallucination challenges for LLMs, hindering their widespread adoption. In this paper, we address these hallucination issues in the context of Medical Information Extraction (MIE) tasks by introducing ALternate Contrastive Decoding (ALCD). We begin by redefining MIE tasks as an identify-and-classify process. We then separate the identification and classification functions of LLMs by selectively masking the optimization of tokens during fine-tuning. During the inference stage, we alternately contrast output distributions derived from sub-task models. This approach aims to selectively enhance the identification and classification capabilities while minimizing the influence of other inherent abilities in LLMs. Additionally, we propose an alternate adaptive constraint strategy to more effectively adjust the scale and scope of contrastive tokens. Through comprehensive experiments on two different backbones and six diverse medical information extraction tasks, ALCD demonstrates significant improvements in resolving hallucination issues compared to conventional decoding methods.
摘要：大型语言模型 (LLM) 的强大功能吸引了人们将 LLM 应用于医学领域的广泛兴趣。然而，临床环境的复杂性给 LLM 带来了巨大的幻觉挑战，阻碍了其广泛应用。在本文中，我们通过引入交替对比解码 (ALCD) 来解决医学信息提取 (MIE) 任务背景下的这些幻觉问题。我们首先将 MIE 任务重新定义为识别和分类过程。然后，我们通过在微调过程中有选择地屏蔽标记的优化来分离 LLM 的识别和分类功能。在推理阶段，我们交替对比从子任务模型中得出的输出分布。这种方法旨在有选择地增强识别和分类能力，同时最大限度地减少 LLM 中其他固有能力的影响。此外，我们提出了一种替代自适应约束策略，以更有效地调整对比标记的规模和范围。通过对两个不同主干和六个不同的医学信息提取任务进行全面实验，与传统解码方法相比，ALCD 在解决幻觉问题方面表现出显着的改进。

Title: Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Authors: Quang Hieu Pham, Hoang Ngo, Anh Tuan Luu, Dat Quoc Nguyen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2410.15737
Pdf URL: https://arxiv.org/pdf/2410.15737
Copy Paste: [[2410.15737]] Who's Who: Large Language Models Meet Knowledge Conflicts in Practice(https://arxiv.org/abs/2410.15737)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) methods are viable solutions for addressing the static memory limits of pre-trained language models. Nevertheless, encountering conflicting sources of information within the retrieval context is an inevitable practical challenge. In such situations, the language models are recommended to transparently inform users about the conflicts rather than autonomously deciding what to present based on their inherent biases. To analyze how current large language models (LLMs) align with our recommendation, we introduce WhoQA, a public benchmark dataset to examine model's behavior in knowledge conflict situations. We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers. WhoQA evaluation set includes 5K questions across 13 Wikidata property types and 150K Wikipedia entities. Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs' performance in RAG settings.
摘要：检索增强生成 (RAG) 方法是解决预训练语言模型静态内存限制的可行解决方案。然而，在检索上下文中遇到相互冲突的信息源是一个不可避免的实际挑战。在这种情况下，建议语言模型透明地告知用户冲突，而不是根据其固有偏见自主决定呈现什么。为了分析当前的大型语言模型 (LLM) 如何符合我们的建议，我们引入了 WhoQA，这是一个公共基准数据集，用于检查模型在知识冲突情况下的行为。我们通过询问具有相同名称的实体之间的共同属性来引发冲突，从而产生最多有 8 个不同答案的问题。WhoQA 评估集包括 13 种 Wikidata 属性类型和 150K 个 Wikipedia 实体的 5K 个问题。我们的实验表明，尽管 WhoQA 问题很简单，但知识冲突会显著降低 LLM 在 RAG 设置中的性能。

Title: Learning-to-Defer for Extractive Question Answering

Authors: Montreuil Yannis, Carlier Axel, Ng Lai Xing, Ooi Wei Tsang
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2410.15761
Pdf URL: https://arxiv.org/pdf/2410.15761
Copy Paste: [[2410.15761]] Learning-to-Defer for Extractive Question Answering(https://arxiv.org/abs/2410.15761)
Keywords: language model
Abstract: Pre-trained language models have profoundly impacted the field of extractive question-answering, leveraging large-scale textual corpora to enhance contextual language understanding. Despite their success, these models struggle in complex scenarios that demand nuanced interpretation or inferential reasoning beyond immediate textual cues. Furthermore, their size poses deployment challenges on resource-constrained devices. Addressing these limitations, we introduce an adapted two-stage Learning-to-Defer mechanism that enhances decision-making by enabling selective deference to human experts or larger models without retraining language models in the context of question-answering. This approach not only maintains computational efficiency but also significantly improves model reliability and accuracy in ambiguous contexts. We establish the theoretical soundness of our methodology by proving Bayes and $(\mathcal{H}, \mathcal{R})$--consistency of our surrogate loss function, guaranteeing the optimality of the final solution. Empirical evaluations on the SQuADv2 dataset illustrate performance gains from integrating human expertise and leveraging larger models. Our results further demonstrate that deferring a minimal number of queries allows the smaller model to achieve performance comparable to their larger counterparts while preserving computing efficiency, thus broadening the applicability of pre-trained language models in diverse operational environments.
摘要：

Title: Improve Dense Passage Retrieval with Entailment Tuning

Authors: Lu Dai, Hao Liu, Hui Xiong
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2410.15801
Pdf URL: https://arxiv.org/pdf/2410.15801
Copy Paste: [[2410.15801]] Improve Dense Passage Retrieval with Entailment Tuning(https://arxiv.org/abs/2410.15801)
Keywords: retrieval-augmented generation
Abstract: Retrieval module can be plugged into many downstream NLP tasks to improve their performance, such as open-domain question answering and retrieval-augmented generation. The key to a retrieval system is to calculate relevance scores to query and passage pairs. However, the definition of relevance is often ambiguous. We observed that a major class of relevance aligns with the concept of entailment in NLI tasks. Based on this observation, we designed a method called entailment tuning to improve the embedding of dense retrievers. Specifically, we unify the form of retrieval data and NLI data using existence claim as a bridge. Then, we train retrievers to predict the claims entailed in a passage with a variant task of masked prediction. Our method can be efficiently plugged into current dense retrieval methods, and experiments show the effectiveness of our method.
摘要：

Title: Using GPT Models for Qualitative and Quantitative News Analytics in the 2024 US Presidental Election Process

Authors: Bohdan M. Pavlyshenko
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2410.15884
Pdf URL: https://arxiv.org/pdf/2410.15884
Copy Paste: [[2410.15884]] Using GPT Models for Qualitative and Quantitative News Analytics in the 2024 US Presidental Election Process(https://arxiv.org/abs/2410.15884)
Keywords: gpt, retrieval-augmented generation
Abstract: The paper considers an approach of using Google Search API and GPT-4o model for qualitative and quantitative analyses of news through retrieval-augmented generation (RAG). This approach was applied to analyze news about the 2024 US presidential election process. Different news sources for different time periods have been analyzed. Quantitative scores generated by GPT model have been analyzed using Bayesian regression to derive trend lines. The distributions found for the regression parameters allow for the analysis of uncertainty in the election process. The obtained results demonstrate that using the GPT models for news analysis, one can get informative analytics and provide key insights that can be applied in further analyses of election processes.
摘要：本文考虑了一种使用 Google Search API 和 GPT-4o 模型通过检索增强生成 (RAG) 对新闻进行定性和定量分析的方法。该方法被用于分析有关 2024 年美国总统选举过程的新闻。分析了不同时间段的不同新闻来源。使用贝叶斯回归分析了 GPT 模型生成的定量分数以得出趋势线。回归参数的分布可用于分析选举过程中的不确定性。所得结果表明，使用 GPT 模型进行新闻分析可以获得信息丰富的分析并提供关键见解，这些见解可应用于进一步分析选举过程。

Title: Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection

Authors: Koji Inoue, Divesh Lala, Gabriel Skantze, Tatsuya Kawahara
Subjects: cs.CL, cs.HC, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2410.15929
Pdf URL: https://arxiv.org/pdf/2410.15929
Copy Paste: [[2410.15929]] Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection(https://arxiv.org/abs/2410.15929)
Keywords: agent
Abstract: In human conversations, short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue. These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents. This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection (VAP) model. While existing approaches have relied on turn-based or artificially balanced datasets, our approach predicts both the timing and type of backchannels in a continuous and frame-wise manner on unbalanced, real-world datasets. We first pre-train the VAP model on a general dialogue corpus to capture conversational dynamics and then fine-tune it on a specialized dataset focused on backchannel behavior. Experimental results demonstrate that our model outperforms baseline methods in both timing and type prediction tasks, achieving robust performance in real-time environments. This research offers a promising step toward more responsive and human-like dialogue systems, with implications for interactive spoken dialogue applications such as virtual assistants and robots.
摘要：在人类对话中，简短的反向通道话语（例如“是的”和“哦”）在促进流畅且引人入胜的对话方面起着至关重要的作用。这些反向通道在不打断说话者的情况下表示注意力和理解，因此准确预测它们对于创建更自然的对话代理至关重要。本文提出了一种使用微调语音活动投影 (VAP) 模型进行实时、连续反向通道预测的新方法。虽然现有方法依赖于回合制或人为平衡的数据集，但我们的方法可以在不平衡的真实世界数据集上以连续和逐帧的方式预测反向通道的时间和类型。我们首先在一般对话语料库上对 VAP 模型进行预训练以捕捉对话动态，然后在专注于反向通道行为的专门数据集上对其进行微调。实验结果表明，我们的模型在时间和类型预测任务中都优于基线方法，在实时环境中实现了稳健的性能。这项研究朝着更具响应性和更像人类的对话系统迈出了重要的一步，对虚拟助手和机器人等交互式口头对话应用具有重要意义。

Title: CausalGraph2LLM: Evaluating LLMs for Causal Queries

Authors: Ivaxi Sheth, Bahare Fatemi, Mario Fritz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15939
Pdf URL: https://arxiv.org/pdf/2410.15939
Copy Paste: [[2410.15939]] CausalGraph2LLM: Evaluating LLMs for Causal Queries(https://arxiv.org/abs/2410.15939)
Keywords: language model, gpt, llm
Abstract: Causality is essential in scientific research, enabling researchers to interpret true relationships between variables. These causal relationships are often represented by causal graphs, which are directed acyclic graphs. With the recent advancements in Large Language Models (LLMs), there is an increasing interest in exploring their capabilities in causal reasoning and their potential use to hypothesize causal graphs. These tasks necessitate the LLMs to encode the causal graph effectively for subsequent downstream tasks. In this paper, we propose a comprehensive benchmark, \emph{CausalGraph2LLM}, encompassing a variety of causal graph settings to assess the causal graph understanding capability of LLMs. We categorize the causal queries into two types: graph-level and node-level queries. We benchmark both open-sourced and closed models for our study. Our findings reveal that while LLMs show promise in this domain, they are highly sensitive to the encoding used. Even capable models like GPT-4 and Gemini-1.5 exhibit sensitivity to encoding, with deviations of about $60\%$. We further demonstrate this sensitivity for downstream causal intervention tasks. Moreover, we observe that LLMs can often display biases when presented with contextual information about a causal graph, potentially stemming from their parametric memory.
摘要：Causality is essential in scientific research, enabling researchers to interpret true relationships between variables. These causal relationships are often represented by causal graphs, which are directed acyclic graphs. With the recent advancements in Large Language Models (LLMs), there is an increasing interest in exploring their capabilities in causal reasoning and their potential use to hypothesize causal graphs. These tasks necessitate the LLMs to encode the causal graph effectively for subsequent downstream tasks. In this paper, we propose a comprehensive benchmark, \emph{CausalGraph2LLM}, encompassing a variety of causal graph settings to assess the causal graph understanding capability of LLMs. We categorize the causal queries into two types: graph-level and node-level queries. We benchmark both open-sourced and closed models for our study. Our findings reveal that while LLMs show promise in this domain, they are highly sensitive to the encoding used. 即使是 GPT-4 和 Gemini-1.5 等功能强大的模型也表现出对编码的敏感性，偏差约为 60%。我们进一步证明了下游因果干预任务的这种敏感性。此外，我们观察到 LLM 在呈现因果图的上下文信息时通常会显示偏差，这可能源于它们的参数记忆。

Title: Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs

Authors: Yanzhu Guo, Simone Conia, Zelin Zhou, Min Li, Saloni Potdar, Henry Xiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.15956
Pdf URL: https://arxiv.org/pdf/2410.15956
Copy Paste: [[2410.15956]] Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs(https://arxiv.org/abs/2410.15956)
Keywords: language model, llm
Abstract: Current Large Language Models (LLMs) are predominantly designed with English as the primary language, and even the few that are multilingual tend to exhibit strong English-centric biases. Much like speakers who might produce awkward expressions when learning a second language, LLMs often generate unnatural outputs in non-English languages, reflecting English-centric patterns in both vocabulary and grammar. Despite the importance of this issue, the naturalness of multilingual LLM outputs has received limited attention. In this paper, we address this gap by introducing novel automatic corpus-level metrics to assess the lexical and syntactic naturalness of LLM outputs in a multilingual context. Using our new metrics, we evaluate state-of-the-art LLMs on a curated benchmark in French and Chinese, revealing a tendency towards English-influenced patterns. To mitigate this issue, we also propose a simple and effective alignment method to improve the naturalness of an LLM in a target language and domain, achieving consistent improvements in naturalness without compromising the performance on general-purpose benchmarks. Our work highlights the importance of developing multilingual metrics, resources and methods for the new wave of multilingual LLMs.
摘要：当前的大型语言模型 (LLM) 主要以英语为主要语言进行设计，即使是少数多语言模型也往往表现出强烈的以英语为中心的偏见。就像在学习第二语言时可能会产生尴尬表达的说话者一样，LLM 通常会在非英语语言中产生不自然的输出，反映出词汇和语法中以英语为中心的模式。尽管这个问题很重要，但多语言 LLM 输出的自然性却很少受到关注。在本文中，我们通过引入新颖的自动语料库级指标来评估多语言环境中 LLM 输出的词汇和句法自然性，从而解决了这一差距。使用我们的新指标，我们在法语和中文的精选基准上评估了最先进的 LLM，揭示了英语影响模式的趋势。为了缓解这个问题，我们还提出了一种简单有效的对齐方法来提高 LLM 在目标语言和领域的自然性，在不影响通用基准测试性能的情况下实现自然性的持续改进。我们的工作强调了为新一波多语言法学硕士开发多语言指标、资源和方法的重要性。

Title: Self-Explained Keywords Empower Large Language Models for Code Generation

Authors: Lishui Fan, Mouxiang Chen, Zhongxin Liu
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2410.15966
Pdf URL: https://arxiv.org/pdf/2410.15966
Copy Paste: [[2410.15966]] Self-Explained Keywords Empower Large Language Models for Code Generation(https://arxiv.org/abs/2410.15966)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved impressive performance in code generation. However, due to the long-tail distribution of LLMs' training data, low-frequency terms are typically underrepresented in the training process. Consequently, LLMs often misunderstand or overlook problem-specific, low-frequency keywords during code generation, compromising the accuracy of the generated code. To address this, we propose a novel technique named SEK(\textbf{S}elf-\textbf{E}xplained \textbf{K}eywords), which empowers an LLM for better code generation by extracting and explaining the key terms in the problem description with the LLM itself and ranking them based on frequency. Comprehensive experiments across three benchmarks, i.e., HumanEval(+), MBPP(+), and APPS, with five representative LLMs, show that SEK can significantly improve LLMs in code generation, yielding substantial and consistent gains. For instance, SEK improves the Pass@1 of DeepSeek-Coder-V2-Instruct from 85.4\% to 93.3\% on the Humaneval benchmark. Further analysis confirms that SEK enables the LLMs to shift their attention from low-frequency keywords to their corresponding high-frequency counterparts.
摘要：Large language models (LLMs) have achieved impressive performance in code generation. However, due to the long-tail distribution of LLMs' training data, low-frequency terms are typically underrepresented in the training process. Consequently, LLMs often misunderstand or overlook problem-specific, low-frequency keywords during code generation, compromising the accuracy of the generated code. To address this, we propose a novel technique named SEK(\textbf{S}elf-\textbf{E}xplained \textbf{K}eywords), which empowers an LLM for better code generation by extracting and explaining the key terms in the problem description with the LLM itself and ranking them based on frequency. Comprehensive experiments across three benchmarks, i.e., HumanEval(+), MBPP(+), and APPS, with five representative LLMs, show that SEK can significantly improve LLMs in code generation, yielding substantial and consistent gains. 例如，SEK 在 Humaneval 基准上将 DeepSeek-Coder-V2-Instruct 的 Pass@1 从 85.4\% 提高到了 93.3\%。进一步分析证实，SEK 使 LLM 能够将注意力从低频关键词转移到相应的高频关键词上。

Title: Large Language Models for Cross-lingual Emotion Detection

Authors: Ram Mohan Rao Kadiyala
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.15974
Pdf URL: https://arxiv.org/pdf/2410.15974
Copy Paste: [[2410.15974]] Large Language Models for Cross-lingual Emotion Detection(https://arxiv.org/abs/2410.15974)
Keywords: language model, llm
Abstract: This paper presents a detailed system description of our entry for the WASSA 2024 Task 2, focused on cross-lingual emotion detection. We utilized a combination of large language models (LLMs) and their ensembles to effectively understand and categorize emotions across different languages. Our approach not only outperformed other submissions with a large margin, but also demonstrated the strength of integrating multiple models to enhance performance. Additionally, We conducted a thorough comparison of the benefits and limitations of each model used. An error analysis is included along with suggested areas for future improvement. This paper aims to offer a clear and comprehensive understanding of advanced techniques in emotion detection, making it accessible even to those new to the field.
摘要：

Title: Augmenting Legal Decision Support Systems with LLM-based NLI for Analyzing Social Media Evidence

Authors: Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Kanwal Mehreen, Subhasya Tippareddy, Ashay Srivastava
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.15990
Pdf URL: https://arxiv.org/pdf/2410.15990
Copy Paste: [[2410.15990]] Augmenting Legal Decision Support Systems with LLM-based NLI for Analyzing Social Media Evidence(https://arxiv.org/abs/2410.15990)
Keywords: llm
Abstract: This paper presents our system description and error analysis of our entry for NLLP 2024 shared task on Legal Natural Language Inference (L-NLI) \citep{hagag2024legallenssharedtask2024}. The task required classifying these relationships as entailed, contradicted, or neutral, indicating any association between the review and the complaint. Our system emerged as the winning submission, significantly outperforming other entries with a substantial margin and demonstrating the effectiveness of our approach in legal text analysis. We provide a detailed analysis of the strengths and limitations of each model and approach tested, along with a thorough error analysis and suggestions for future improvements. This paper aims to contribute to the growing field of legal NLP by offering insights into advanced techniques for natural language inference in legal contexts, making it accessible to both experts and newcomers in the field.
摘要：本文介绍了我们在 NLLP 2024 法律自然语言推理 (L-NLI) \citep{hagag2024legallenssharedtask2024} 共享任务中的参赛作品的系统描述和错误分析。该任务要求将这些关系分类为蕴涵、矛盾或中性，以表明评论和投诉之间存在任何关联。我们的系统成为获胜作品，其表现远远超过其他参赛作品，并证明了我们的方法在法律文本分析中的有效性。我们对每个测试模型和方法的优势和局限性进行了详细分析，并进行了彻底的错误分析和未来改进的建议。本文旨在通过提供对法律背景下自然语言推理的高级技术的见解，为不断发展的法律 NLP 领域做出贡献，使该领域的专家和新手都可以使用它。

Title: 1024m at SMM4H 2024: Tasks 3, 5 & 6 -- Ensembles of Transformers and Large Language Models for Medical Text Classification

Authors: Ram Mohan Rao Kadiyala, M.V.P. Chandra Sekhara Rao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.15998
Pdf URL: https://arxiv.org/pdf/2410.15998
Copy Paste: [[2410.15998]] 1024m at SMM4H 2024: Tasks 3, 5 & 6 -- Ensembles of Transformers and Large Language Models for Medical Text Classification(https://arxiv.org/abs/2410.15998)
Keywords: language model
Abstract: Social media is a great source of data for users reporting information and regarding their health and how various things have had an effect on them. This paper presents various approaches using Transformers and Large Language Models and their ensembles, their performance along with advantages and drawbacks for various tasks of SMM4H'24 - Classifying texts on impact of nature and outdoor spaces on the author's mental health (Task 3), Binary classification of tweets reporting their children's health disorders like Asthma, Autism, ADHD and Speech disorder (task 5), Binary classification of users self-reporting their age (task 6).
摘要：

Title: Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

Authors: Yu Zhao, Alessio Devoto, Giwon Hong, Xiaotang Du, Aryo Pradipta Gema, Hongru Wang, Kam-Fai Wong, Pasquale Minervini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.15999
Pdf URL: https://arxiv.org/pdf/2410.15999
Copy Paste: [[2410.15999]] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering(https://arxiv.org/abs/2410.15999)
Keywords: language model, llm
Abstract: Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context -- this phenomenon, known as \emph{context-memory knowledge conflicts}, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use \emph{inference-time} intervention strategies to resolve it. In this work, we propose \textsc{SpARE}, a \emph{training-free} representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. \textsc{SpARE} identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that \textsc{SpARE} can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods ($+10\%$) as well as contrastive decoding methods ($+15\%$).
摘要：Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context -- this phenomenon, known as \emph{context-memory knowledge conflicts}, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use \emph{inference-time} intervention strategies to resolve it. In this work, we propose \textsc{SpARE}, a \emph{training-free} representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. \textsc{SpARE} identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. 我们的实验结果表明，\textsc{SpARE} 可以有效地控制任一知识源的使用来解决开放域问答任务中的知识冲突，超越现有的表示工程方法（$+10\%$）以及对比解码方法（$+15\%$）。

Title: Exploring Continual Fine-Tuning for Enhancing Language Ability in Large Language Model

Authors: Divyanshu Aggarwal, Sankarshan Damle, Navin Goyal, Satya Lokam, Sunayana Sitaram
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16006
Pdf URL: https://arxiv.org/pdf/2410.16006
Copy Paste: [[2410.16006]] Exploring Continual Fine-Tuning for Enhancing Language Ability in Large Language Model(https://arxiv.org/abs/2410.16006)
Keywords: language model, llm
Abstract: A common challenge towards the adaptability of Large Language Models (LLMs) is their ability to learn new languages over time without hampering the model's performance on languages in which the model is already proficient (usually English). Continual fine-tuning (CFT) is the process of sequentially fine-tuning an LLM to enable the model to adapt to downstream tasks with varying data distributions and time shifts. This paper focuses on the language adaptability of LLMs through CFT. We study a two-phase CFT process in which an English-only end-to-end fine-tuned LLM from Phase 1 (predominantly Task Ability) is sequentially fine-tuned on a multilingual dataset -- comprising task data in new languages -- in Phase 2 (predominantly Language Ability). We observe that the ``similarity'' of Phase 2 tasks with Phase 1 determines the LLM's adaptability. For similar phase-wise datasets, the LLM after Phase 2 does not show deterioration in task ability. In contrast, when the phase-wise datasets are not similar, the LLM's task ability deteriorates. We test our hypothesis on the open-source \mis\ and \llm\ models with multiple phase-wise dataset pairs. To address the deterioration, we analyze tailored variants of two CFT methods: layer freezing and generative replay. Our findings demonstrate their effectiveness in enhancing the language ability of LLMs while preserving task performance, in comparison to relevant baselines.
摘要：

Title: ComPO: Community Preferences for Language Model Personalization

Authors: Sachin Kumar, Chan Young Park, Yulia Tsvetkov, Noah A. Smith, Hannaneh Hajishirzi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16027
Pdf URL: https://arxiv.org/pdf/2410.16027
Copy Paste: [[2410.16027]] ComPO: Community Preferences for Language Model Personalization(https://arxiv.org/abs/2410.16027)
Keywords: language model
Abstract: Conventional algorithms for training language models (LMs) with human feedback rely on preferences that are assumed to account for an "average" user, disregarding subjectivity and finer-grained variations. Recent studies have raised concerns that aggregating such diverse and often contradictory human feedback to finetune models results in generic models that generate outputs not preferred by many user groups, as they tend to average out styles and norms. To address this issue, we draw inspiration from recommendation systems and propose ComPO, a method to personalize preference optimization in LMs by contextualizing the probability distribution of model outputs with the preference provider. Focusing on group-level preferences rather than individuals, we collect and release ComPRed, a question answering dataset with community-level preferences from Reddit. This dataset facilitates studying diversity in preferences without incurring privacy concerns associated with individual feedback. Our experiments reveal that conditioning language models on a community identifier (i.e., subreddit name) during preference tuning substantially enhances model performance. Conversely, replacing this context with random subreddit identifiers significantly diminishes performance, highlighting the effectiveness of our approach in tailoring responses to communities' preferences.
摘要：Conventional algorithms for training language models (LMs) with human feedback rely on preferences that are assumed to account for an "average" user, disregarding subjectivity and finer-grained variations. Recent studies have raised concerns that aggregating such diverse and often contradictory human feedback to finetune models results in generic models that generate outputs not preferred by many user groups, as they tend to average out styles and norms. To address this issue, we draw inspiration from recommendation systems and propose ComPO, a method to personalize preference optimization in LMs by contextualizing the probability distribution of model outputs with the preference provider. Focusing on group-level preferences rather than individuals, we collect and release ComPRed, a question answering dataset with community-level preferences from Reddit. This dataset facilitates studying diversity in preferences without incurring privacy concerns associated with individual feedback. Our experiments reveal that conditioning language models on a community identifier (i.e., subreddit name) during preference tuning substantially enhances model performance. 相反，用随机的 subreddit 标识符替换此上下文会显著降低性能，凸显了我们根据社区偏好定制响应的方法的有效性。

Title: TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling

Authors: Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Huazheng Wang, Kaixuan Huang, Yue Wu, Mengdi Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.16033
Pdf URL: https://arxiv.org/pdf/2410.16033
Copy Paste: [[2410.16033]] TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling(https://arxiv.org/abs/2410.16033)
Keywords: language model
Abstract: Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, UltraFeedback, GSM8K, HH-RLHF, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves a 65% win rate at maximum lengths of 192 and 384 tokens, outperforming standard BoN with the same computational cost. Furthermore, TreeBoN achieves around a 60% win rate across longer responses, showcasing its scalability and alignment efficacy.
摘要：Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, UltraFeedback, GSM8K, HH-RLHF, and TutorEval datasets, demonstrating consistent improvements. 具体来说，TreeBoN 在最大长度为 192 和 384 个 token 时实现了 65% 的胜率，在计算成本相同的情况下优于标准 BoN。此外，TreeBoN 在较长的响应中实现了约 60% 的胜率，展示了其可扩展性和对齐效率。

Title: Large Language Models Know What To Say But Not When To Speak

Authors: Muhammad Umair, Vasanth Sarathy, JP de Ruiter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16044
Pdf URL: https://arxiv.org/pdf/2410.16044
Copy Paste: [[2410.16044]] Large Language Models Know What To Say But Not When To Speak(https://arxiv.org/abs/2410.16044)
Keywords: language model, llm
Abstract: Turn-taking is a fundamental mechanism in human communication that ensures smooth and coherent verbal interactions. Recent advances in Large Language Models (LLMs) have motivated their use in improving the turn-taking capabilities of Spoken Dialogue Systems (SDS), such as their ability to respond at appropriate times. However, existing models often struggle to predict opportunities for speaking -- called Transition Relevance Places (TRPs) -- in natural, unscripted conversations, focusing only on turn-final TRPs and not within-turn TRPs. To address these limitations, we introduce a novel dataset of participant-labeled within-turn TRPs and use it to evaluate the performance of state-of-the-art LLMs in predicting opportunities for speaking. Our experiments reveal the current limitations of LLMs in modeling unscripted spoken interactions, highlighting areas for improvement and paving the way for more naturalistic dialogue systems.
摘要：话轮转换是人类交流的基本机制，可确保流畅连贯的口头互动。大型语言模型 (LLM) 的最新进展促使人们将其用于提高口语对话系统 (SDS) 的话轮转换能力，例如在适当时间做出响应的能力。然而，现有模型通常难以预测自然、非脚本对话中的发言机会（称为过渡相关位置 (TRP)），只关注话轮末尾的 TRP，而不关注话轮内的 TRP。为了解决这些限制，我们引入了一个由参与者标记的话轮内 TRP 的新数据集，并使用它来评估最先进的 LLM 在预测发言机会方面的表现。我们的实验揭示了 LLM 在建模非脚本口语互动方面的当前局限性，突出了需要改进的领域，并为更自然的对话系统铺平了道路。

Title: Surprise! Uniform Information Density Isn't the Whole Story: Predicting Surprisal Contours in Long-form Discourse

Authors: Eleftheria Tsipidi, Franz Nowak, Ryan Cotterell, Ethan Wilcox, Mario Giulianelli, Alex Warstadt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16062
Pdf URL: https://arxiv.org/pdf/2410.16062
Copy Paste: [[2410.16062]] Surprise! Uniform Information Density Isn't the Whole Story: Predicting Surprisal Contours in Long-form Discourse(https://arxiv.org/abs/2410.16062)
Keywords: language model
Abstract: The Uniform Information Density (UID) hypothesis posits that speakers tend to distribute information evenly across linguistic units to achieve efficient communication. Of course, information rate in texts and discourses is not perfectly uniform. While these fluctuations can be viewed as theoretically uninteresting noise on top of a uniform target, another explanation is that UID is not the only functional pressure regulating information content in a language. Speakers may also seek to maintain interest, adhere to writing conventions, and build compelling arguments. In this paper, we propose one such functional pressure; namely that speakers modulate information rate based on location within a hierarchically-structured model of discourse. We term this the Structured Context Hypothesis and test it by predicting the surprisal contours of naturally occurring discourses extracted from large language models using predictors derived from discourse structure. We find that hierarchical predictors are significant predictors of a discourse's information contour and that deeply nested hierarchical predictors are more predictive than shallow ones. This work takes an initial step beyond UID to propose testable hypotheses for why the information rate fluctuates in predictable ways
摘要：The Uniform Information Density (UID) hypothesis posits that speakers tend to distribute information evenly across linguistic units to achieve efficient communication. Of course, information rate in texts and discourses is not perfectly uniform. While these fluctuations can be viewed as theoretically uninteresting noise on top of a uniform target, another explanation is that UID is not the only functional pressure regulating information content in a language. Speakers may also seek to maintain interest, adhere to writing conventions, and build compelling arguments. In this paper, we propose one such functional pressure; namely that speakers modulate information rate based on location within a hierarchically-structured model of discourse. We term this the Structured Context Hypothesis and test it by predicting the surprisal contours of naturally occurring discourses extracted from large language models using predictors derived from discourse structure. We find that hierarchical predictors are significant predictors of a discourse's information contour and that deeply nested hierarchical predictors are more predictive than shallow ones. 这项研究迈出了超越 UID 的第一步，提出了可测试的假设，以解释信息率为何以可预测的方式波动

Title: Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context

Authors: Maggie Mi, Aline Villavicencio, Nafise Sadat Moosavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16069
Pdf URL: https://arxiv.org/pdf/2410.16069
Copy Paste: [[2410.16069]] Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context(https://arxiv.org/abs/2410.16069)
Keywords: llm
Abstract: Human processing of idioms relies on understanding the contextual sentences in which idioms occur, as well as language-intrinsic features such as frequency and speaker-intrinsic factors like familiarity. While LLMs have shown high performance on idiomaticity detection tasks, this success may be attributed to reasoning shortcuts in existing datasets. To this end, we construct a novel, controlled contrastive dataset designed to test whether LLMs can effectively use context to disambiguate idiomatic meaning. Additionally, we explore how collocational frequency and sentence probability influence model performance. Our findings reveal that LLMs often fail to resolve idiomaticity when it is required to attend to the surrounding context, and that models perform better on sentences that have higher likelihood. The collocational frequency of expressions also impacts performance. We make our code and dataset publicly available.
摘要：

Title: Fine-Tuning LLMs for Reliable Medical Question-Answering Services

Authors: Ali Anaissi, Ali Braytee, Junaid Akram
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.16088
Pdf URL: https://arxiv.org/pdf/2410.16088
Copy Paste: [[2410.16088]] Fine-Tuning LLMs for Reliable Medical Question-Answering Services(https://arxiv.org/abs/2410.16088)
Keywords: language model, llm
Abstract: We present an advanced approach to medical question-answering (QA) services, using fine-tuned Large Language Models (LLMs) to improve the accuracy and reliability of healthcare information. Our study focuses on optimizing models like LLaMA-2 and Mistral, which have shown great promise in delivering precise, reliable medical answers. By leveraging comprehensive datasets, we applied fine-tuning techniques such as rsDoRA+ and ReRAG. rsDoRA+ enhances model performance through a combination of decomposed model weights, varied learning rates for low-rank matrices, and rank stabilization, leading to improved efficiency. ReRAG, which integrates retrieval on demand and question rewriting, further refines the accuracy of the responses. This approach enables healthcare providers to access fast, dependable information, aiding in more efficient decision-making and fostering greater patient trust. Our work highlights the potential of fine-tuned LLMs to significantly improve the quality and accessibility of medical information services, ultimately contributing to better healthcare outcomes for all.
摘要：我们提出了一种先进的医学问答 (QA) 服务方法，使用经过微调的大型语言模型 (LLM) 来提高医疗信息的准确性和可靠性。我们的研究重点是优化 LLaMA-2 和 Mistral 等模型，这些模型在提供精确、可靠的医疗答案方面表现出巨大的潜力。通过利用全面的数据集，我们应用了 rsDoRA+ 和 ReRAG 等微调技术。rsDoRA+ 通过结合分解模型权重、低秩矩阵的不同学习率和秩稳定来提高模型性能，从而提高效率。ReRAG 集成了按需检索和问题重写，进一步提高了响应的准确性。这种方法使医疗保健提供者能够访问快速、可靠的信息，有助于更有效地做出决策并赢得患者更大的信任。我们的工作突出了经过微调的 LLM 的潜力，可以显著提高医疗信息服务的质量和可访问性，最终有助于为所有人带来更好的医疗保健结果。

Title: Analysing the Residual Stream of Language Models Under Knowledge Conflicts

Authors: Yu Zhao, Xiaotang Du, Giwon Hong, Aryo Pradipta Gema, Alessio Devoto, Hongru Wang, Xuanli He, Kam-Fai Wong, Pasquale Minervini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16090
Pdf URL: https://arxiv.org/pdf/2410.16090
Copy Paste: [[2410.16090]] Analysing the Residual Stream of Language Models Under Knowledge Conflicts(https://arxiv.org/abs/2410.16090)
Keywords: language model, llm
Abstract: Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context. Such conflicts can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. In this work, we investigate whether LLMs can identify knowledge conflicts and whether it is possible to know which source of knowledge the model will rely on by analysing the residual stream of the LLM. Through probing tasks, we find that LLMs can internally register the signal of knowledge conflict in the residual stream, which can be accurately detected by probing the intermediate model activations. This allows us to detect conflicts within the residual stream before generating the answers without modifying the input or model parameters. Moreover, we find that the residual stream shows significantly different patterns when the model relies on contextual knowledge versus parametric knowledge to resolve conflicts. This pattern can be employed to estimate the behaviour of LLMs when conflict happens and prevent unexpected answers before producing the answers. Our analysis offers insights into how LLMs internally manage knowledge conflicts and provides a foundation for developing methods to control the knowledge selection processes.
摘要：Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context. Such conflicts can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. In this work, we investigate whether LLMs can identify knowledge conflicts and whether it is possible to know which source of knowledge the model will rely on by analysing the residual stream of the LLM. Through probing tasks, we find that LLMs can internally register the signal of knowledge conflict in the residual stream, which can be accurately detected by probing the intermediate model activations. This allows us to detect conflicts within the residual stream before generating the answers without modifying the input or model parameters. Moreover, we find that the residual stream shows significantly different patterns when the model relies on contextual knowledge versus parametric knowledge to resolve conflicts. This pattern can be employed to estimate the behaviour of LLMs when conflict happens and prevent unexpected answers before producing the answers. 我们的分析提供了有关 LLM 如何内部管理知识冲突的见解，并为开发控制知识选择过程的方法提供了基础。

Title: Do LLMs write like humans? Variation in grammatical and rhetorical styles

Authors: Alex Reinhart, David West Brown, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, Gordon Weinberg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16107
Pdf URL: https://arxiv.org/pdf/2410.16107
Copy Paste: [[2410.16107]] Do LLMs write like humans? Variation in grammatical and rhetorical styles(https://arxiv.org/abs/2410.16107)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are capable of writing grammatical text that follows instructions, answers questions, and solves problems. As they have advanced, it has become difficult to distinguish their output from human-written text. While past research has found some differences in surface features such as word choice and punctuation, and developed classifiers to detect LLM output, none has studied the rhetorical styles of LLMs. Using several variants of Llama 3 and GPT-4o, we construct two parallel corpora of human- and LLM-written texts from common prompts. Using Douglas Biber's set of lexical, grammatical, and rhetorical features, we identify systematic differences between LLMs and humans and between different LLMs. These differences persist when moving from smaller models to larger ones, and are larger for instruction-tuned models than base models. This demonstrates that despite their advanced abilities, LLMs struggle to match human styles, and hence more advanced linguistic features can detect patterns in their behavior not previously recognized.
摘要：大型语言模型 (LLM) 能够编写符合指令、回答问题和解决问题的语法文本。随着它们的进步，很难区分它们的输出与人类书写的文本。虽然过去的研究发现了词汇选择和标点符号等表面特征的一些差异，并开发了分类器来检测 LLM 输出，但没有人研究过 LLM 的修辞风格。使用 Llama 3 和 GPT-4o 的几种变体，我们从常见提示中构建了两个人类和 LLM 书写文本的平行语料库。使用 Douglas Biber 的词汇、语法和修辞特征集，我们确定了 LLM 与人类之间以及不同 LLM 之间的系统性差异。这些差异在从较小模型转移到较大模型时仍然存在，并且对于指令调整模型，这些差异比基础模型更大。这表明，尽管 LLM 具有先进的能力，但它们难以匹配人类的风格，因此更高级的语言特征可以检测到以前未被认识到的行为模式。

Title: A Psycholinguistic Evaluation of Language Models' Sensitivity to Argument Roles

Authors: Eun-Kyoung Rosa Lee, Sathvik Nair, Naomi Feldman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16139
Pdf URL: https://arxiv.org/pdf/2410.16139
Copy Paste: [[2410.16139]] A Psycholinguistic Evaluation of Language Models' Sensitivity to Argument Roles(https://arxiv.org/abs/2410.16139)
Keywords: language model
Abstract: We present a systematic evaluation of large language models' sensitivity to argument roles, i.e., who did what to whom, by replicating psycholinguistic studies on human argument role processing. In three experiments, we find that language models are able to distinguish verbs that appear in plausible and implausible contexts, where plausibility is determined through the relation between the verb and its preceding arguments. However, none of the models capture the same selective patterns that human comprehenders exhibit during real-time verb prediction. This indicates that language models' capacity to detect verb plausibility does not arise from the same mechanism that underlies human real-time sentence processing.
摘要：

Title: 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

Authors: Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16144
Pdf URL: https://arxiv.org/pdf/2410.16144
Copy Paste: [[2410.16144]] 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs(https://arxiv.org/abs/2410.16144)
Keywords: language model, llm
Abstract: Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM deployment across a broad range of devices. In this work, we introduce this http URL, a tailored software stack designed to unlock the full potential of 1-bit LLMs. Specifically, we develop a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments demonstrate that this http URL achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model sizes. The code is available at this https URL.
摘要：1 位大型语言模型 (LLM)（例如 BitNet 和 BitNet b1.58）的最新进展为提高 LLM 在速度和能耗方面的效率提供了一种有前途的方法。这些发展还使 LLM 能够在各种设备上进行本地部署。在这项工作中，我们引入了这个 http URL，这是一个定制的软件堆栈，旨在充分发挥 1 位 LLM 的潜力。具体来说，我们开发了一组内核来支持在 CPU 上快速无损地推理三元 BitNet b1.58 LLM。大量实验表明，这个 http URL 实现了显着的加速，在 x86 CPU 上从 2.37 倍到 6.17 倍不等，在 ARM CPU 上从 1.37 倍到 5.07 倍不等，涵盖各种模型大小。代码可在此 https URL 上找到。

Title: Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Authors: Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.16153
Pdf URL: https://arxiv.org/pdf/2410.16153
Copy Paste: [[2410.16153]] Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages(https://arxiv.org/abs/2410.16153)
Keywords: language model, llm
Abstract: Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models' capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance of English data proportions, language popularity, and the number of multimodal training samples on overall performance. We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs, promoting equity and accessibility across a broader linguistic and cultural spectrum.
摘要：尽管多模态大型语言模型 (MLLM) 近年来取得了进展，但它们的发展主要集中在以英语和西方为中心的数据集和任务上，而世界上大多数语言和不同的文化背景都未能得到充分体现。本文介绍了 Pangea，这是一种多语言多模态 LLM，在 PangeaIns 上进行训练，PangeaIns 是一个涵盖 39 种语言的多样化 6M 指令数据集。PangeaIns 的特点是：1) 高质量的英语指令，2) 精心机器翻译的指令，以及 3) 文化相关的多模态任务，以确保跨文化覆盖。为了严格评估模型的能力，我们引入了 PangeaBench，这是一个包含 14 个数据集的整体评估套件，涵盖 47 种语言。结果表明，Pangea 在多语言环境和多样化文化背景下的表现明显优于现有的开源模型。消融研究进一步揭示了英语数据比例、语言流行度和多模态训练样本数量对整体性能的重要性。我们完全开源我们的数据、代码和训练检查点，以促进包容性和强大的多语言 MLLM 的发展，促进更广泛的语言和文化范围内的公平和可访问性。

Title: A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns

Authors: Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16155
Pdf URL: https://arxiv.org/pdf/2410.16155
Copy Paste: [[2410.16155]] A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns(https://arxiv.org/abs/2410.16155)
Keywords: language model, agent
Abstract: With the development of large language models, they are widely used as agents in various fields. A key component of agents is memory, which stores vital information but is susceptible to jailbreak attacks. Existing research mainly focuses on single-agent attacks and shared memory attacks. However, real-world scenarios often involve independent memory. In this paper, we propose the Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale, multi-agent, multi-topology text-based attack evaluation framework. TMCHT involves one attacker agent attempting to mislead an entire society of agents. We identify two major challenges in multi-agent attacks: (1) Non-complete graph structure, (2) Large-scale systems. We attribute these challenges to a phenomenon we term toxicity disappearing. To address these issues, we propose an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes the retrieval suffix to make poisoned samples more easily retrieved and optimizes the replication suffix to make poisoned samples have contagious ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%, 18.95%, and 52.93% improvements in line topology, star topology, and 100-agent settings. Encourage community attention to the security of multi-agent systems.
摘要：随着大型语言模型的发展，它们被广泛应用于各个领域作为代理。代理的一个关键组成部分是内存，它存储着重要的信息，但容易受到越狱攻击。现有的研究主要集中在单代理攻击和共享内存攻击上，然而现实世界的场景往往涉及独立内存。在本文中，我们提出了“Troublemaker Makes Chaos in Honest Town”（TMCHT）任务，这是一个大规模、多代理、多拓扑的基于文本的攻击评估框架。TMCHT 涉及一个攻击者代理试图误导整个代理社会。我们发现多代理攻击面临两大挑战：（1）非完全图结构，（2）大规模系统。我们将这些挑战归因于我们称之为毒性消失的现象。为了解决这些问题，我们提出了一种对抗性复制传染性越狱（ARCJ）方法，该方法优化了检索后缀以使中毒样本更容易检索，并优化了复制后缀以使中毒样本具有传染能力。我们在 TMCHT 中展示了我们方法的优越性，在线性拓扑、星型拓扑和 100 个代理设置中分别实现了 23.51%、18.95% 和 52.93% 的改进。鼓励社区关注多代理系统的安全性。

Title: From Tokens to Materials: Leveraging Language Models for Scientific Discovery

Authors: Yuwei Wan, Tong Xie, Nan Wu, Wenjie Zhang, Chunyu Kit, Bram Hoex
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2410.16165
Pdf URL: https://arxiv.org/pdf/2410.16165
Copy Paste: [[2410.16165]] From Tokens to Materials: Leveraging Language Models for Scientific Discovery(https://arxiv.org/abs/2410.16165)
Keywords: language model, gpt
Abstract: Exploring the predictive capabilities of language models in material science is an ongoing interest. This study investigates the application of language model embeddings to enhance material property prediction in materials science. By evaluating various contextual embedding methods and pre-trained models, including Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT), we demonstrate that domain-specific models, particularly MatBERT significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties. Our findings reveal that information-dense embeddings from the third layer of MatBERT, combined with a context-averaging approach, offer the most effective method for capturing material-property relationships from the scientific literature. We also identify a crucial "tokenizer effect," highlighting the importance of specialized text processing techniques that preserve complete compound names while maintaining consistent token counts. These insights underscore the value of domain-specific training and tokenization in materials science applications and offer a promising pathway for accelerating the discovery and development of new materials through AI-driven approaches.
摘要：探索语言模型在材料科学中的预测能力是一项持续关注的课题。本研究探讨了语言模型嵌入在材料科学中增强材料属性预测的应用。通过评估各种上下文嵌入方法和预训练模型，包括 Transformer 的双向编码器表示 (BERT) 和生成式预训练 Transformer (GPT)，我们证明了领域特定模型（尤其是 MatBERT）在从化合物名称和材料属性中提取隐性知识方面明显优于通用模型。我们的研究结果表明，MatBERT 第三层的信息密集型嵌入与上下文平均方法相结合，是从科学文献中捕获材料属性关系的最有效方法。我们还发现了一个至关重要的“标记器效应”，强调了专门的文本处理技术的重要性，这些技术可以保留完整的化合物名称，同时保持一致的标记计数。这些见解强调了领域特定训练和标记化在材料科学应用中的价值，并为通过 AI 驱动的方法加速新材料的发现和开发提供了一条有希望的途径。

Title: Exploring Pretraining via Active Forgetting for Improving Cross Lingual Transfer for Decoder Language Models

Authors: Divyanshu Aggarwal, Ashutosh Sathe, Sunayana Sitaram
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16168
Pdf URL: https://arxiv.org/pdf/2410.16168
Copy Paste: [[2410.16168]] Exploring Pretraining via Active Forgetting for Improving Cross Lingual Transfer for Decoder Language Models(https://arxiv.org/abs/2410.16168)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities in a multitude of NLP tasks. However, the efficacy of such models to languages other than English is often limited. Prior works have shown that encoder-only models such as BERT or XLM-RoBERTa show impressive cross lingual transfer of their capabilities from English to other languages. In this work, we propose a pretraining strategy that uses active forgetting to achieve similar cross lingual transfer in decoder-only LLMs. We show that LLMs pretrained with active forgetting are highly effective when adapting to new and unseen languages. Through extensive experimentation, we find that LLMs pretrained with active forgetting are able to learn better multilingual representations which translates to better performance in many downstream tasks.
摘要：

Title: MagicPIG: LSH Sampling for Efficient LLM Generation

Authors: Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.16179
Pdf URL: https://arxiv.org/pdf/2410.16179
Copy Paste: [[2410.16179]] MagicPIG: LSH Sampling for Efficient LLM Generation(https://arxiv.org/abs/2410.16179)
Keywords: language model, llm, long context
Abstract: Large language models (LLMs) with long context windows have gained significant attention. However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various dynamic sparse or TopK-based attention approximation methods have been proposed to leverage the common insight that attention is sparse. In this paper, we first show that TopK attention itself suffers from quality degradation in certain downstream tasks because attention is not always as sparse as expected. Rather than selecting the keys and values with the highest attention scores, sampling with theoretical guarantees can provide a better estimation for attention output. To make the sampling-based approximation practical in LLM generation, we propose MagicPIG, a heterogeneous system based on Locality Sensitive Hashing (LSH). MagicPIG significantly reduces the workload of attention computation while preserving high accuracy for diverse tasks. MagicPIG stores the LSH hash tables and runs the attention computation on the CPU, which allows it to serve longer contexts and larger batch sizes with high approximation accuracy. MagicPIG can improve decoding throughput by $1.9\sim3.9\times$ across various GPU hardware and achieve 110ms decoding latency on a single RTX 4090 for Llama-3.1-8B-Instruct model with a context of 96k tokens. The code is available at \url{this https URL}.
摘要：Large language models (LLMs) with long context windows have gained significant attention. However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various dynamic sparse or TopK-based attention approximation methods have been proposed to leverage the common insight that attention is sparse. In this paper, we first show that TopK attention itself suffers from quality degradation in certain downstream tasks because attention is not always as sparse as expected. Rather than selecting the keys and values with the highest attention scores, sampling with theoretical guarantees can provide a better estimation for attention output. To make the sampling-based approximation practical in LLM generation, we propose MagicPIG, a heterogeneous system based on Locality Sensitive Hashing (LSH). MagicPIG significantly reduces the workload of attention computation while preserving high accuracy for diverse tasks. MagicPIG stores the LSH hash tables and runs the attention computation on the CPU, which allows it to serve longer contexts and larger batch sizes with high approximation accuracy. MagicPIG 可以在各种 GPU 硬件上将解码吞吐量提高 $1.9\sim3.9\times$，并在单个 RTX 4090 上为具有 96k 个令牌上下文的 Llama-3.1-8B-Instruct 模型实现 110ms 的解码延迟。代码可在 \url{此 https URL} 上找到。

Title: RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Authors: Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16184
Pdf URL: https://arxiv.org/pdf/2410.16184
Copy Paste: [[2410.16184]] RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style(https://arxiv.org/abs/2410.16184)
Keywords: language model
Abstract: Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. To this end, we introduce RM-Bench, a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases. Extensive experiments demonstrate that RM-Bench strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align language models effectively. We evaluate nearly 40 reward models on RM-Bench. Our results reveal that even state-of-the-art models achieve an average performance of only 46.6%, which falls short of random-level accuracy (50%) when faced with style bias interference. These findings highlight the significant room for improvement in current reward models. Related code and data are available at this https URL.
摘要：奖励模型在人类反馈强化学习 (RLHF) 和推理缩放定律等技术中至关重要，它们指导语言模型对齐并选择最佳响应。尽管它们很重要，但现有的奖励模型基准通常通过要求它们区分不同功率的模型生成的响应来评估模型。然而，这种方法无法评估奖励模型对细微但关键的内容变化和风格变化的影响，导致与策略模型性能的相关性较低。为此，我们引入了 RM-Bench，这是一种新颖的基准，旨在根据奖励模型对细微内容差异的敏感性和对风格偏见的抵抗力来评估奖励模型。大量实验表明，RM-Bench 与策略模型性能密切相关，使其成为选择奖励模型以有效对齐语言模型的可靠参考。我们在 RM-Bench 上评估了近 40 个奖励模型。我们的结果表明，即使是最先进的模型也只能实现 46.6% 的平均性能，在面对风格偏见干扰时，这低于随机级准确度 (50%)。这些发现凸显了当前奖励模型的巨大改进空间。相关代码和数据可在此 https URL 上找到。

Title: Contamination Report for Multilingual Benchmarks

Authors: Sanchit Ahuja, Varun Gumma, Sunayana Sitaram
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16186
Pdf URL: https://arxiv.org/pdf/2410.16186
Copy Paste: [[2410.16186]] Contamination Report for Multilingual Benchmarks(https://arxiv.org/abs/2410.16186)
Keywords: language model, llm
Abstract: Benchmark contamination refers to the presence of test datasets in Large Language Model (LLM) pre-training or post-training data. Contamination can lead to inflated scores on benchmarks, compromising evaluation results and making it difficult to determine the capabilities of models. In this work, we study the contamination of popular multilingual benchmarks in LLMs that support multiple languages. We use the Black Box test to determine whether $7$ frequently used multilingual benchmarks are contaminated in $7$ popular open and closed LLMs and find that almost all models show signs of being contaminated with almost all the benchmarks we test. Our findings can help the community determine the best set of benchmarks to use for multilingual evaluation.
摘要：

Title: Information for Conversation Generation: Proposals Utilising Knowledge Graphs

Authors: Alex Clay, Ernesto Jiménez-Ruiz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.16196
Pdf URL: https://arxiv.org/pdf/2410.16196
Copy Paste: [[2410.16196]] Information for Conversation Generation: Proposals Utilising Knowledge Graphs(https://arxiv.org/abs/2410.16196)
Keywords: llm, hallucination
Abstract: LLMs are frequently used tools for conversational generation. Without additional information LLMs can generate lower quality responses due to lacking relevant content and hallucinations, as well as the perception of poor emotional capability, and an inability to maintain a consistent character. Knowledge graphs are commonly used forms of external knowledge and may provide solutions to these challenges. This paper introduces three proposals, utilizing knowledge graphs to enhance LLM generation. Firstly, dynamic knowledge graph embeddings and recommendation could allow for the integration of new information and the selection of relevant knowledge for response generation. Secondly, storing entities with emotional values as additional features may provide knowledge that is better emotionally aligned with the user input. Thirdly, integrating character information through narrative bubbles would maintain character consistency, as well as introducing a structure that would readily incorporate new information.
摘要：

Title: Pre-training Distillation for Large Language Models: A Design Space Exploration

Authors: Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.16215
Pdf URL: https://arxiv.org/pdf/2410.16215
Copy Paste: [[2410.16215]] Pre-training Distillation for Large Language Models: A Design Space Exploration(https://arxiv.org/abs/2410.16215)
Keywords: language model, llm
Abstract: Knowledge distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. Previous work applying KD in the field of large language models (LLMs) typically focused on the post-training phase, where the student LLM learns directly from instructions and corresponding responses generated by the teacher model. In this paper, we extend KD to the pre-training phase of LLMs, named pre-training distillation (PD). We first conduct a preliminary experiment using GLM-4-9B as the teacher LLM to distill a 1.9B parameter student LLM, validating the effectiveness of PD. Considering the key impact factors of distillation, we systematically explore the design space of pre-training distillation across four aspects: logits processing, loss selection, scaling law, and offline or online logits. We conduct extensive experiments to explore the design space of pre-training distillation and find better configurations and interesting conclusions, such as larger student LLMs generally benefiting more from pre-training distillation, while a larger teacher LLM does not necessarily guarantee better results. We hope our exploration of the design space will inform future practices in pre-training distillation.
摘要：Knowledge distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. Previous work applying KD in the field of large language models (LLMs) typically focused on the post-training phase, where the student LLM learns directly from instructions and corresponding responses generated by the teacher model. In this paper, we extend KD to the pre-training phase of LLMs, named pre-training distillation (PD). We first conduct a preliminary experiment using GLM-4-9B as the teacher LLM to distill a 1.9B parameter student LLM, validating the effectiveness of PD. Considering the key impact factors of distillation, we systematically explore the design space of pre-training distillation across four aspects: logits processing, loss selection, scaling law, and offline or online logits. We conduct extensive experiments to explore the design space of pre-training distillation and find better configurations and interesting conclusions, such as larger student LLMs generally benefiting more from pre-training distillation, while a larger teacher LLM does not necessarily guarantee better results. 我们希望对设计空间的探索能够为未来的预训练提炼实践提供指导。

Title: On Creating an English-Thai Code-switched Machine Translation in Medical Domain

Authors: Parinthapat Pengpun, Krittamate Tiankanon, Amrest Chinkamol, Jiramet Kinchagawat, Pitchaya Chairuengjitjaras, Pasit Supholkhan, Pubordee Aussavavirojekul, Chiraphat Boonnag, Kanyakorn Veerakanjana, Hirunkul Phimsiri, Boonthicha Sae-jia, Nattawach Sataudom, Piyalitt Ittichaiwong, Peerat Limkonchotiwat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16221
Pdf URL: https://arxiv.org/pdf/2410.16221
Copy Paste: [[2410.16221]] On Creating an English-Thai Code-switched Machine Translation in Medical Domain(https://arxiv.org/abs/2410.16221)
Keywords: gpt
Abstract: Machine translation (MT) in the medical domain plays a pivotal role in enhancing healthcare quality and disseminating medical knowledge. Despite advancements in English-Thai MT technology, common MT approaches often underperform in the medical field due to their inability to precisely translate medical terminologies. Our research prioritizes not merely improving translation accuracy but also maintaining medical terminology in English within the translated text through code-switched (CS) translation. We developed a method to produce CS medical translation data, fine-tuned a CS translation model with this data, and evaluated its performance against strong baselines, such as Google Neural Machine Translation (NMT) and GPT-3.5/GPT-4. Our model demonstrated competitive performance in automatic metrics and was highly favored in human preference evaluations. Our evaluation result also shows that medical professionals significantly prefer CS translations that maintain critical English terms accurately, even if it slightly compromises fluency. Our code and test set are publicly available this https URL.
摘要：

Title: Building A Coding Assistant via the Retrieval-Augmented Language Model

Authors: Xinze Li, Hanbin Wang, Zhenghao Liu, Shi Yu, Shuo Wang, Shuo Wang, Yukun Yan, Yukai Fu, Yu Gu, Ge Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16229
Pdf URL: https://arxiv.org/pdf/2410.16229
Copy Paste: [[2410.16229]] Building A Coding Assistant via the Retrieval-Augmented Language Model(https://arxiv.org/abs/2410.16229)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Pretrained language models have shown strong effectiveness in code-related tasks, such as code retrieval, code generation, code summarization, and code completion tasks. In this paper, we propose COde assistaNt viA retrieval-augmeNted language model (CONAN), which aims to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding. Specifically, it consists of a code structure aware retriever (CONAN-R) and a dual-view code representation-based retrieval-augmented generation model (CONAN-G). CONAN-R pretrains CodeT5 using Code-Documentation Alignment and Masked Entity Prediction tasks to make language models code structure-aware and learn effective representations for code snippets and documentation. Then CONAN-G designs a dual-view code representation mechanism for implementing a retrieval-augmented code generation model. CONAN-G regards the code documentation descriptions as prompts, which help language models better understand the code semantics. Our experiments show that CONAN achieves convincing performance on different code generation tasks and significantly outperforms previous retrieval augmented code generation models. Our further analyses show that CONAN learns tailored representations for both code snippets and documentation by aligning code-documentation data pairs and capturing structural semantics by masking and predicting entities in the code data. Additionally, the retrieved code snippets and documentation provide necessary information from both program language and natural language to assist the code generation process. CONAN can also be used as an assistant for Large Language Models (LLMs), providing LLMs with external knowledge in shorter code document lengths to improve their effectiveness on various code tasks. It shows the ability of CONAN to extract necessary information and help filter out the noise from retrieved code documents.
摘要：预训练语言模型在代码相关任务（如代码检索、代码生成、代码摘要和代码完成任务）中表现出强大的有效性。在本文中，我们提出了通过检索增强语言模型的代码助手（CONAN），旨在通过模仿人类在编码过程中的知识寻求行为来构建代码助手。具体来说，它由代码结构感知检索器（CONAN-R）和基于双视图代码表示的检索增强生成模型（CONAN-G）组成。CONAN-R 使用代码文档对齐和掩码实体预测任务对 CodeT5 进行预训练，以使语言模型具有代码结构感知能力并学习代码片段和文档的有效表示。然后，CONAN-G 设计了一种双视图代码表示机制来实现检索增强代码生成模型。CONAN-G 将代码文档描述视为提示，帮助语言模型更好地理解代码语义。我们的实验表明，CONAN 在不同的代码生成任务上取得了令人信服的性能，并且明显优于以前的检索增强代码生成模型。我们进一步的分析表明，CONAN 通过对齐代码文档数据对并通过屏蔽和预测代码数据中的实体来捕获结构语义，从而学习代码片段和文档的定制表示。此外，检索到的代码片段和文档提供了来自程序语言和自然语言的必要信息，以协助代码生成过程。CONAN 还可以用作大型语言模型 (LLM) 的助手，以较短的代码文档长度为 LLM 提供外部知识，以提高其在各种代码任务中的有效性。它展示了 CONAN 提取必要信息并帮助过滤掉检索到的代码文档中的噪音的能力。

Title: Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping

Authors: Ryan Li, Yanzhe Zhang, Diyi Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.16232
Pdf URL: https://arxiv.org/pdf/2410.16232
Copy Paste: [[2410.16232]] Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping(https://arxiv.org/abs/2410.16232)
Keywords: language model, agent
Abstract: Sketches are a natural and accessible medium for UI designers to conceptualize early-stage ideas. However, existing research on UI/UX automation often requires high-fidelity inputs like Figma designs or detailed screenshots, limiting accessibility and impeding efficient design iteration. To bridge this gap, we introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes. Beyond end-to-end benchmarking, Sketch2Code supports interactive agent evaluation that mimics real-world design workflows, where a VLM-based agent iteratively refines its generations by communicating with a simulated user, either passively receiving feedback instructions or proactively asking clarification questions. We comprehensively analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs; even the most capable models struggle to accurately interpret sketches and formulate effective questions that lead to steady improvement. Nevertheless, a user study with UI/UX experts reveals a significant preference for proactive question-asking over passive feedback reception, highlighting the need to develop more effective paradigms for multi-turn conversational agents.
摘要：Sketches are a natural and accessible medium for UI designers to conceptualize early-stage ideas. However, existing research on UI/UX automation often requires high-fidelity inputs like Figma designs or detailed screenshots, limiting accessibility and impeding efficient design iteration. To bridge this gap, we introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes. Beyond end-to-end benchmarking, Sketch2Code supports interactive agent evaluation that mimics real-world design workflows, where a VLM-based agent iteratively refines its generations by communicating with a simulated user, either passively receiving feedback instructions or proactively asking clarification questions. We comprehensively analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs; even the most capable models struggle to accurately interpret sketches and formulate effective questions that lead to steady improvement. 然而，针对 UI/UX 专家进行的用户研究表明，用户明显倾向于主动提问而非被动接收反馈，这凸显了为多轮对话代理开发更有效范例的必要性。

Title: ToW: Thoughts of Words Improve Reasoning in Large Language Models

Authors: Zhikun Xu, Ming Shen, Jacob Dineen, Zhaonan Li, Xiao Ye, Shijie Lu, Aswin RRV, Chitta Baral, Ben Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16235
Pdf URL: https://arxiv.org/pdf/2410.16235
Copy Paste: [[2410.16235]] ToW: Thoughts of Words Improve Reasoning in Large Language Models(https://arxiv.org/abs/2410.16235)
Keywords: language model, hallucination
Abstract: We introduce thoughts of words (ToW), a novel training-time data-augmentation method for next-word prediction. ToW views next-word prediction as a core reasoning task and injects fine-grained thoughts explaining what the next word should be and how it is related to the previous contexts in pre-training texts. Our formulation addresses two fundamental drawbacks of existing next-word prediction learning schemes: they induce factual hallucination and are inefficient for models to learn the implicit reasoning processes in raw texts. While there are many ways to acquire such thoughts of words, we explore the first step of acquiring ToW annotations through distilling from larger models. After continual pre-training with only 70K ToW annotations, we effectively improve models' reasoning performances by 7% to 9% on average and reduce model hallucination by up to 10%. At the same time, ToW is entirely agnostic to tasks and applications, introducing no additional biases on labels or semantics.
摘要：我们引入了词语思维 (ToW)，这是一种用于预测下一个单词的新型训练时数据增强方法。ToW 将下一个单词预测视为核心推理任务，并注入细粒度的思维来解释下一个单词应该是什么以及它与预训练文本中先前上下文的关系。我们的公式解决了现有下一个单词预测学习方案的两个根本缺点：它们会引发事实幻觉，并且对于模型学习原始文本中的隐性推理过程来说效率低下。虽然有很多方法可以获得这种词语思维，但我们探索了通过从更大的模型中进行提取来获取 ToW 注释的第一步。经过仅使用 70K ToW 注释进行持续预训练后，我们有效地将模型的推理性能平均提高了 7% 至 9%，并将模型幻觉减少了多达 10%。同时，ToW 对任务和应用程序完全无关，不会在标签或语义上引入额外的偏差。

Title: Analyzing Context Contributions in LLM-based Machine Translation

Authors: Emmanouil Zaranis, Nuno M. Guerreiro, André F. T. Martins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16246
Pdf URL: https://arxiv.org/pdf/2410.16246
Copy Paste: [[2410.16246]] Analyzing Context Contributions in LLM-based Machine Translation(https://arxiv.org/abs/2410.16246)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have achieved state-of-the-art performance in machine translation (MT) and demonstrated the ability to leverage in-context learning through few-shot examples. However, the mechanisms by which LLMs use different parts of the input context remain largely unexplored. In this work, we provide a comprehensive analysis of context utilization in MT, studying how LLMs use various context parts, such as few-shot examples and the source text, when generating translations. We highlight several key findings: (1) the source part of few-shot examples appears to contribute more than its corresponding targets, irrespective of translation direction; (2) finetuning LLMs with parallel data alters the contribution patterns of different context parts; and (3) there is a positional bias where earlier few-shot examples have higher contributions to the translated sequence. Finally, we demonstrate that inspecting anomalous context contributions can potentially uncover pathological translations, such as hallucinations. Our findings shed light on the internal workings of LLM-based MT which go beyond those known for standard encoder-decoder MT models.
摘要：

Title: Can Knowledge Editing Really Correct Hallucinations?

Authors: Baixiang Huang, Canyu Chen, Xiongxiao Xu, Ali Payani, Kai Shu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.16251
Pdf URL: https://arxiv.org/pdf/2410.16251
Copy Paste: [[2410.16251]] Can Knowledge Editing Really Correct Hallucinations?(https://arxiv.org/abs/2410.16251)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) suffer from hallucinations, referring to the non-factual information in generated content, despite their superior capacities across tasks. Meanwhile, knowledge editing has been developed as a new popular paradigm to correct the erroneous factual knowledge encoded in LLMs with the advantage of avoiding retraining from scratch. However, one common issue of existing evaluation datasets for knowledge editing is that they do not ensure LLMs actually generate hallucinated answers to the evaluation questions before editing. When LLMs are evaluated on such datasets after being edited by different techniques, it is hard to directly adopt the performance to assess the effectiveness of different knowledge editing methods in correcting hallucinations. Thus, the fundamental question remains insufficiently validated: Can knowledge editing really correct hallucinations in LLMs? We proposed HalluEditBench to holistically benchmark knowledge editing methods in correcting real-world hallucinations. First, we rigorously construct a massive hallucination dataset with 9 domains, 26 topics and more than 6,000 hallucinations. Then, we assess the performance of knowledge editing methods in a holistic way on five dimensions including Efficacy, Generalization, Portability, Locality, and Robustness. Through HalluEditBench, we have provided new insights into the potentials and limitations of different knowledge editing methods in correcting hallucinations, which could inspire future improvements and facilitate the progress in the field of knowledge editing.
摘要：

Title: CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

Authors: Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.16256
Pdf URL: https://arxiv.org/pdf/2410.16256
Copy Paste: [[2410.16256]] CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution(https://arxiv.org/abs/2410.16256)
Keywords: language model, llm
Abstract: Efficient and accurate evaluation is crucial for the continuous improvement of large language models (LLMs). Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce \textbf{CompassJudger-1}, the first open-source \textbf{all-in-one} judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge models under a unified setting, we have also established \textbf{JudgerBench}, a new benchmark that encompasses various subjective evaluation tasks and covers a wide range of topics. CompassJudger-1 offers a comprehensive solution for various evaluation tasks while maintaining the flexibility to adapt to diverse requirements. Both CompassJudger and JudgerBench are released and available to the research community athttps://github.com/open-compass/CompassJudger. We believe that by open-sourcing these tools, we can foster collaboration and accelerate progress in LLM evaluation methodologies.
摘要：高效准确的评估对于大型语言模型 (LLM) 的持续改进至关重要。在各种评估方法中，主观评估因其与现实世界的使用场景和人类偏好的高度契合而备受关注。然而，基于人工的评估成本高昂且缺乏可重复性，因此精确的自动评估器 (评判器) 在此过程中至关重要。在本报告中，我们介绍了 \textbf{CompassJudger-1}，这是第一个开源 \textbf{一体化} 评判 LLM。CompassJudger-1 是一个通用的 LLM，具有非凡的多功能性。它能够：1. 作为奖励模型执行单元评分和双模型比较；2. 根据指定格式进行评估；3. 生成评论；4. 像一般的 LLM 一样执行各种任务。为了在统一的环境下评估不同评判模型的评估能力，我们还建立了 \textbf{JudgerBench}，这是一个涵盖各种主观评估任务并涵盖广泛主题的新基准。CompassJudger-1 为各种评估任务提供了全面的解决方案，同时保持了适应不同需求的灵活性。CompassJudger 和 JudgerBench 都已发布并可供研究社区使用，网址为 https://github.com/open-compass/CompassJudger。我们相信，通过开源这些工具，我们可以促进合作并加速 LLM 评估方法的进展。