2024-10-08

Title: Revisiting the Superficial Alignment Hypothesis

Authors: Mohit Raghavendra, Vaskar Nath, Sean Hendryx
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03717
Pdf URL: https://arxiv.org/pdf/2410.03717
Copy Paste: [[2410.03717]] Revisiting the Superficial Alignment Hypothesis(https://arxiv.org/abs/2410.03717)
Keywords: language model
Abstract: The Superficial Alignment Hypothesis posits that almost all of a language model's abilities and knowledge are learned during pre-training, while post-training is about giving a model the right style and format. We re-examine these claims by empirically studying the scaling behavior of post-training with increasing finetuning examples and evaluating them using objective task-specific standardized benchmarks. Through experiments with the Llama-3, Mistral, and Llama-2 model families of multiple sizes, we observe that, similar to the pre-training scaling laws, post-training task performance scales as a power law against the number of finetuning examples. This power law relationship holds across a broad array of capabilities, including mathematical reasoning, coding, instruction following, and multihop-reasoning. In addition, for tasks like math and multihop reasoning, we observe that a handful of examples merely align the model stylistically but do not saturate performance on the benchmarks. Model performance is instead correlated with its reasoning ability and it improves significantly with more examples, illustrating the need for holistic evaluation programs leveraging objective benchmarks in addition to measurement of alignment to human preferences. We also observe that language models are not necessarily limited to using knowledge learned during pre-training. With appropriate post-training, a model's ability to integrate new knowledge greatly improves on downstream tasks like multihop question-answering. Taken together, these results shed new light on the Superficial Alignment Hypothesis, suggesting that it is, at best, an over-simplification.
摘要：表面对齐假设认为，语言模型的几乎所有能力和知识都是在预训练期间学习的，而后训练则是为模型赋予正确的风格和格式。我们通过实证研究随着微调示例的增加，后训练的扩展行为，并使用客观任务特定的标准化基准对其进行评估，重新审视了这些说法。通过对多种规模的 Llama-3、Mistral 和 Llama-2 模型系列进行实验，我们观察到，与预训练扩展定律类似，后训练任务性能随微调示例的数量呈幂律扩展。这种幂律关系适用于广泛的能力，包括数学推理、编码、指令遵循和多跳推理。此外，对于数学和多跳推理等任务，我们观察到少数示例仅在风格上与模型保持一致，但并未达到基准上的性能饱和。相反，模型性能与其推理能力相关，并且随着示例的增多，模型性能会显著提高，这说明除了衡量与人类偏好的一致性之外，还需要利用客观基准进行整体评估。我们还观察到，语言模型不一定局限于使用在训练前学到的知识。通过适当的后训练，模型整合新知识的能力在多跳问答等下游任务上会大大提高。总之，这些结果为表面一致性假设提供了新的见解，表明它充其量只是一种过度简化。

Title: Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language

Authors: Sagar Tamang, Dibya Jyoti Bora
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03718
Pdf URL: https://arxiv.org/pdf/2410.03718
Copy Paste: [[2410.03718]] Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language(https://arxiv.org/abs/2410.03718)
Keywords: language model, gpt, llm
Abstract: Training of a tokenizer plays an important role in the performance of deep learning models. This research aims to understand the performance of tokenizers in five state-of-the-art (SOTA) large language models (LLMs) in the Assamese language of India. The research is important to understand the multi-lingual support for a low-resourced language such as Assamese. Our research reveals that the tokenizer of SUTRA from Two AI performs the best with an average Normalized Sequence Length (NSL) value of 0.45, closely followed by the tokenizer of GPT-4o from Open AI with an average NSL value of 0.54, followed by Gemma 2, Meta Llama 3.1, and Mistral Large Instruct 2407 with an average NSL value of 0.82, 1.4, and 1.48 respectively.
摘要：分词器的训练对深度学习模型的性能至关重要。本研究旨在了解印度阿萨姆语五种最先进的 (SOTA) 大型语言模型 (LLM) 中分词器的性能。这项研究对于了解对阿萨姆语等资源匮乏的语言的多语言支持非常重要。我们的研究表明，Two AI 的 SUTRA 分词器性能最佳，平均归一化序列长度 (NSL) 值为 0.45，紧随其后的是 Open AI 的 GPT-4o 分词器，平均 NSL 值为 0.54，其次是 Gemma 2、Meta Llama 3.1 和 Mistral Large Instruct 2407，平均 NSL 值分别为 0.82、1.4 和 1.48。

Title: Thematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development

Authors: Andrew Katz, Gabriella Coloyan Fleming, Joyce Main
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2410.03721
Pdf URL: https://arxiv.org/pdf/2410.03721
Copy Paste: [[2410.03721]] Thematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development(https://arxiv.org/abs/2410.03721)
Keywords: prompt, retrieval-augmented generation
Abstract: This paper aims to answer one central question: to what extent can open-source generative text models be used in a workflow to approximate thematic analysis in social science research? To answer this question, we present the Generative AI-enabled Theme Organization and Structuring (GATOS) workflow, which uses open-source machine learning techniques, natural language processing tools, and generative text models to facilitate thematic analysis. To establish validity of the method, we present three case studies applying the GATOS workflow, leveraging these models and techniques to inductively create codebooks similar to traditional procedures using thematic analysis. Specifically, we investigate the extent to which a workflow comprising open-source models and tools can inductively produce codebooks that approach the known space of themes and sub-themes. To address the challenge of gleaning insights from these texts, we combine open-source generative text models, retrieval-augmented generation, and prompt engineering to identify codes and themes in large volumes of text, i.e., generate a qualitative codebook. The process mimics an inductive coding process that researchers might use in traditional thematic analysis by reading text one unit of analysis at a time, considering existing codes already in the codebook, and then deciding whether or not to generate a new code based on whether the extant codebook provides adequate thematic coverage. We demonstrate this workflow using three synthetic datasets from hypothetical organizational research settings: a study of teammate feedback in teamwork settings, a study of organizational cultures of ethical behavior, and a study of employee perspectives about returning to their offices after the pandemic. We show that the GATOS workflow is able to identify themes in the text that were used to generate the original synthetic datasets.
摘要：本文旨在回答一个核心问题：开源生成文本模型在多大程度上可用于工作流程以近似社会科学研究中的主题分析？为了回答这个问题，我们提出了支持生成式 AI 的主题组织和结构 (GATOS) 工作流程，该工作流程使用开源机器学习技术、自然语言处理工具和生成文本模型来促进主题分析。为了确定该方法的有效性，我们介绍了三个应用 GATOS 工作流程的案例研究，利用这些模型和技术归纳性地创建类似于使用主题分析的传统程序的代码本。具体来说，我们调查了由开源模型和工具组成的工作流程在多大程度上可以归纳性地生成接近已知主题和子主题空间的代码本。为了应对从这些文本中获取见解的挑战，我们结合了开源生成文本模型、检索增强生成和快速工程来识别大量文本中的代码和主题，即生成定性代码本。该过程模仿了研究人员在传统主题分析中可能使用的归纳编码过程，即一次阅读一个分析单元的文本，考虑代码本中已有的代码，然后根据现有代码本是否提供足够的主题覆盖范围来决定是否生成新代码。我们使用来自假设组织研究环境的三个合成数据集演示了此工作流程：一项关于团队合作环境中队友反馈的研究、一项关于组织道德行为文化的研究以及一项关于员工对疫情后重返办公室的看法的研究。我们表明，GATOS 工作流程能够识别用于生成原始合成数据集的文本中的主题。

Title: Realtime, multimodal invasive ventilation risk monitoring using language models and BoXHED

Authors: Arash Pakbin, Aaron Su, Donald K.K. Lee, Bobak J. Mortazavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03725
Pdf URL: https://arxiv.org/pdf/2410.03725
Copy Paste: [[2410.03725]] Realtime, multimodal invasive ventilation risk monitoring using language models and BoXHED(https://arxiv.org/abs/2410.03725)
Keywords: language model, prompt
Abstract: Objective: realtime monitoring of invasive ventilation (iV) in intensive care units (ICUs) plays a crucial role in ensuring prompt interventions and better patient outcomes. However, conventional methods often overlook valuable insights embedded within clinical notes, relying solely on tabular data. In this study, we propose an innovative approach to enhance iV risk monitoring by incorporating clinical notes into the monitoring pipeline through using language models for text summarization. Results: We achieve superior performance in all metrics reported by the state-of-the-art in iV risk monitoring, namely: an AUROC of 0.86, an AUC-PR of 0.35, and an AUCt of up to 0.86. We also demonstrate that our methodology allows for more lead time in flagging iV for certain time buckets. Conclusion: Our study underscores the potential of integrating clinical notes and language models into realtime iV risk monitoring, paving the way for improved patient care and informed clinical decision-making in ICU settings.
摘要：目标：重症监护病房 (ICU) 中的侵入性通气 (iV) 实时监测在确保及时干预和改善患者结果方面发挥着至关重要的作用。然而，传统方法通常会忽略临床记录中嵌入的宝贵见解，而仅依赖于表格数据。在本研究中，我们提出了一种创新方法，通过使用语言模型进行文本摘要，将临床记录整合到监测流程中，以增强 iV 风险监测。结果：我们在 iV 风险监测的最新指标中取得了优异的表现，即：AUROC 为 0.86、AUC-PR 为 0.35 和 AUCt 高达 0.86。我们还证明，我们的方法允许在特定时间段内标记 iV 时有更多准备时间。结论：我们的研究强调了将临床记录和语言模型整合到实时 iV 风险监测中的潜力，为在 ICU 环境中改善患者护理和做出明智的临床决策铺平了道路。

Title: Neurosymbolic AI approach to Attribution in Large Language Models

Authors: Deepa Tilwani, Revathy Venkataramanan, Amit P. Sheth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03726
Pdf URL: https://arxiv.org/pdf/2410.03726
Copy Paste: [[2410.03726]] Neurosymbolic AI approach to Attribution in Large Language Models(https://arxiv.org/abs/2410.03726)
Keywords: language model, llm, hallucination
Abstract: Attribution in large language models (LLMs) remains a significant challenge, particularly in ensuring the factual accuracy and reliability of the generated outputs. Current methods for citation or attribution, such as those employed by tools like this http URL and Bing Search-integrated LLMs, attempt to ground responses by providing real-time search results and citations. However, so far, these approaches suffer from issues such as hallucinations, biases, surface-level relevance matching, and the complexity of managing vast, unfiltered knowledge sources. While tools like this http URL dynamically integrate web-based information and citations, they often rely on inconsistent sources such as blog posts or unreliable sources, which limits their overall reliability. We present that these challenges can be mitigated by integrating Neurosymbolic AI (NesyAI), which combines the strengths of neural networks with structured symbolic reasoning. NesyAI offers transparent, interpretable, and dynamic reasoning processes, addressing the limitations of current attribution methods by incorporating structured symbolic knowledge with flexible, neural-based learning. This paper explores how NesyAI frameworks can enhance existing attribution models, offering more reliable, interpretable, and adaptable systems for LLMs.
摘要：大型语言模型 (LLM) 中的归因仍然是一项重大挑战，特别是在确保生成输出的事实准确性和可靠性方面。当前的引用或归因方法（例如此 http URL 等工具和 Bing 搜索集成的 LLM 所采用的方法）试图通过提供实时搜索结果和引用来提供响应。然而，到目前为止，这些方法存在幻觉、偏见、表面相关性匹配以及管理大量未经过滤的知识源的复杂性等问题。虽然像此 http URL 这样的工具可以动态集成基于网络的信息和引用，但它们通常依赖于不一致的来源（例如博客文章或不可靠的来源），这限制了它们的整体可靠性。我们提出，可以通过集成 Neurosymbolic AI (NesyAI) 来缓解这些挑战，它将神经网络的优势与结构化符号推理相结合。NesyAI 提供透明、可解释和动态的推理过程，通过将结构化符号知识与灵活的基于神经的学习相结合来解决当前归因方法的局限性。本文探讨了 NesyAI 框架如何增强现有的归因模型，为 LLM 提供更可靠、可解释和适应性更强的系统。

Title: FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"

Authors: Yifei Ming, Senthil Purushwalkam, Shrey Pandit, Zixuan Ke, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03727
Pdf URL: https://arxiv.org/pdf/2410.03727
Copy Paste: [[2410.03727]] FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"(https://arxiv.org/abs/2410.03727)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Ensuring faithfulness to context in large language models (LLMs) and retrieval-augmented generation (RAG) systems is crucial for reliable deployment in real-world applications, as incorrect or unsupported information can erode user trust. Despite advancements on standard benchmarks, faithfulness hallucination-where models generate responses misaligned with the provided context-remains a significant challenge. In this work, we introduce FaithEval, a novel and comprehensive benchmark tailored to evaluate the faithfulness of LLMs in contextual scenarios across three diverse tasks: unanswerable, inconsistent, and counterfactual contexts. These tasks simulate real-world challenges where retrieval mechanisms may surface incomplete, contradictory, or fabricated information. FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework, employing both LLM-based auto-evaluation and human validation. Our extensive study across a wide range of open-source and proprietary models reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved this http URL is available at: \url{this https URL}.
摘要：确保大型语言模型 (LLM) 和检索增强生成 (RAG) 系统忠实于上下文对于在实际应用中的可靠部署至关重要，因为不正确或不受支持的信息会削弱用户的信任。尽管标准基准取得了进步，但忠实幻觉（即模型生成的响应与提供的上下文不一致）仍然是一项重大挑战。在这项工作中，我们引入了 FaithEval，这是一种新颖而全面的基准，专门用于评估 LLM 在三个不同任务中的上下文场景中的忠实度：无法回答、不一致和反事实上下文。这些任务模拟了现实世界的挑战，其中检索机制可能会显示不完整、矛盾或虚构的信息。FaithEval 总共包含 4.9K 个高质量问题，通过严格的四阶段上下文构建和验证框架进行验证，同时采用基于 LLM 的自动评估和人工验证。我们对大量开源和专有模型进行的广泛研究表明，即使是最先进的模型也常常难以忠实于给定的上下文，而较大的模型并不一定表现出改进的性能，此 http URL 可在以下位置获得：\url{此 https URL}。

Title: Progress Report: Towards European LLMs

Authors: Mehdi Ali, Michael Fromm, Klaudia Thellmann, Jan Ebert, Alexander Arno Weber, Richard Rutmann, Charvi Jain, Max Lübbering, Daniel Steinigen, Johannes Leveling, Katrin Klug, Jasper Schulze Buschhoff, Lena Jurkschat, Hammam Abdelwahab, Benny Jörg Stein, Karl-Heinz Sylla, Pavel Denisov, Nicolo Brandizzi, Qasid Saleem, Bhowmick Anirban, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Alex Jude, Lalith Manjunath, Samuel Weinbach, Carolin Penke, Shima Asaadi, Fabio Barth, Rafet Sifa, Fabian Küch, René Jäkel, Georg Rehm, Stefan Kesselheim, Joachim Köhler, Nicolas Flores-Herr
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03730
Pdf URL: https://arxiv.org/pdf/2410.03730
Copy Paste: [[2410.03730]] Progress Report: Towards European LLMs(https://arxiv.org/abs/2410.03730)
Keywords: gpt, llm
Abstract: We present preliminary results of the project OpenGPT-X. At present, the project has developed two multilingual LLMs designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, data processing techniques, tokenizer optimization, and training methodologies. The models demonstrate competitive performance across multilingual benchmarks, as evidenced by its performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.
摘要：我们介绍了 OpenGPT-X 项目的初步成果。目前，该项目已经开发了两种多语言 LLM，旨在通过支持欧盟所有 24 种官方语言来适应欧洲的语言多样性。我们的模型在包含约 60% 非英语数据的数据集上进行训练，并使用自定义多语言标记器，解决了现有 LLM 主要关注英语或少数高资源语言的局限性。我们详细介绍了模型的开发原理、数据处理技术、标记器优化和训练方法。这些模型在多语言基准测试中表现出色，其在欧洲版本的 ARC、HellaSwag、MMLU 和 TruthfulQA 上的表现就是明证。

Title: Unsupervised Human Preference Learning

Authors: Sumuk Shashidhar, Abhinav Chinta, Vaibhav Sahai, Dilek Hakkani Tur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03731
Pdf URL: https://arxiv.org/pdf/2410.03731
Copy Paste: [[2410.03731]] Unsupervised Human Preference Learning(https://arxiv.org/abs/2410.03731)
Keywords: language model, agent
Abstract: Large language models demonstrate impressive reasoning abilities but struggle to provide personalized content due to their lack of individual user preference information. Existing methods, such as in-context learning and parameter-efficient fine-tuning, fall short in capturing the complexity of human preferences, especially given the small, personal datasets individuals possess. In this paper, we propose a novel approach utilizing small parameter models as preference agents to generate natural language rules that guide a larger, pre-trained model, enabling efficient personalization. Our method involves a small, local "steering wheel" model that directs the outputs of a much larger foundation model, producing content tailored to an individual's preferences while leveraging the extensive knowledge and capabilities of the large model. Importantly, this personalization is achieved without the need to fine-tune the large model. Experimental results on email and article datasets, demonstrate that our technique significantly outperforms baseline personalization methods. By allowing foundation models to adapt to individual preferences in a data and compute-efficient manner, our approach paves the way for highly personalized language model applications.
摘要：大型语言模型表现出令人印象深刻的推理能力，但由于缺乏个人用户偏好信息，因此难以提供个性化内容。现有的方法，例如上下文学习和参数高效微调，都无法捕捉人类偏好的复杂性，尤其是考虑到个人拥有的小型个人数据。在本文中，我们提出了一种新方法，利用小参数模型作为偏好代理来生成自然语言规则，指导更大的预训练模型，实现高效的个性化。我们的方法涉及一个小型的局部“方向盘”模型，该模型指导更大的基础模型的输出，生成适合个人偏好的内容，同时利用大型模型的广泛知识和功能。重要的是，这种个性化是在无需微调大型模型的情况下实现的。在电子邮件和文章数据集上的实验结果表明，我们的技术明显优于基线个性化方法。通过允许基础模型以数据和计算高效的方式适应个人偏好，我们的方法为高度个性化的语言模型应用铺平了道路。

Title: Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

Authors: David Grangier, Simin Fan, Skyler Seto, Pierre Ablin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03735
Pdf URL: https://arxiv.org/pdf/2410.03735
Copy Paste: [[2410.03735]] Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling(https://arxiv.org/abs/2410.03735)
Keywords: language model
Abstract: Specialist language models (LMs) focus on a specific task or domain on which they often outperform generalist LMs of the same size. However, the specialist data needed to pretrain these models is only available in limited amount for most tasks. In this work, we build specialist models from large generalist training sets instead. We adjust the training distribution of the generalist data with guidance from the limited domain-specific data. We explore several approaches, with clustered importance sampling standing out. This method clusters the generalist dataset and samples from these clusters based on their frequencies in the smaller specialist dataset. It is scalable, suitable for pretraining and continued pretraining, it works well in multi-task settings. Our findings demonstrate improvements across different domains in terms of language modeling perplexity and accuracy on multiple-choice question tasks. We also present ablation studies that examine the impact of dataset sizes, clustering configurations, and model sizes.
摘要：专家语言模型 (LM) 专注于特定任务或领域，在这些领域，它们的表现通常优于同等规模的通用语言模型。然而，对于大多数任务来说，预训练这些模型所需的专家数据数量有限。在这项工作中，我们从大型通用训练集构建专家模型。我们根据有限的领域特定数据的指导来调整通用数据的训练分布。我们探索了几种方法，其中聚类重要性抽样脱颖而出。此方法对通用数据集进行聚类，并根据它们在较小专家数据集中的频率从这些聚类中抽样。它具有可扩展性，适用于预训练和持续预训练，在多任务设置中效果很好。我们的研究结果表明，在语言建模困惑度和多项选择题任务的准确性方面，不同领域都有所改进。我们还提出了消融研究，研究数据集大小、聚类配置和模型大小的影响。

Title: ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation

Authors: Fillipe dos Santos Silva, Gabriel Kenzo Kakimoto, Julio Cesar dos Reis, Marcelo S. Reis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03738
Pdf URL: https://arxiv.org/pdf/2410.03738
Copy Paste: [[2410.03738]] ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation(https://arxiv.org/abs/2410.03738)
Keywords: language model
Abstract: Cluster analysis plays a crucial role in various domains and applications, such as customer segmentation in marketing. These contexts often involve multimodal data, including both tabular and textual datasets, making it challenging to represent hidden patterns for obtaining meaningful clusters. This study introduces ERASMO, a framework designed to fine-tune a pretrained language model on textually encoded tabular data and generate embeddings from the fine-tuned model. ERASMO employs a textual converter to transform tabular data into a textual format, enabling the language model to process and understand the data more effectively. Additionally, ERASMO produces contextually rich and structurally representative embeddings through techniques such as random feature sequence shuffling and number verbalization. Extensive experimental evaluations were conducted using multiple datasets and baseline approaches. Our results demonstrate that ERASMO fully leverages the specific context of each tabular dataset, leading to more precise and nuanced embeddings for accurate clustering. This approach enhances clustering performance by capturing complex relationship patterns within diverse tabular data.
摘要：聚类分析在各种领域和应用中起着至关重要的作用，例如营销中的客户细分。这些上下文通常涉及多模态数据，包括表格和文本数据集，因此很难表示隐藏的模式以获得有意义的聚类。本研究介绍了 ERASMO，这是一个框架，旨在对文本编码的表格数据进行微调，并从微调后的模型中生成嵌入。ERASMO 使用文本转换器将表格数据转换为文本格式，使语言模型能够更有效地处理和理解数据。此外，ERASMO 通过随机特征序列改组和数字语言化等技术生成上下文丰富且结构具有代表性的嵌入。使用多个数据集和基线方法进行了广泛的实验评估。我们的结果表明，ERASMO 充分利用了每个表格数据集的特定上下文，从而为准确的聚类提供了更精确和更细致的嵌入。这种方法通过捕获不同表格数据中的复杂关系模式来提高聚类性能。

Title: Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model

Authors: Aidan Gilson, Xuguang Ai, Qianqian Xie, Sahana Srinivasan, Krithi Pushpanathan, Maxwell B. Singer, Jimin Huang, Hyunjae Kim, Erping Long, Peixing Wan, Luciano V. Del Priore, Lucila Ohno-Machado, Hua Xu, Dianbo Liu, Ron A. Adelman, Yih-Chung Tham, Qingyu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03740
Pdf URL: https://arxiv.org/pdf/2410.03740
Copy Paste: [[2410.03740]] Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model(https://arxiv.org/abs/2410.03740)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are poised to revolutionize healthcare. Ophthalmology-specific LLMs remain scarce and underexplored. We introduced an open-source, specialized LLM for ophthalmology, termed Language Enhanced Model for Eye (LEME). LEME was initially pre-trained on the Llama2 70B framework and further fine-tuned with a corpus of ~127,000 non-copyrighted training instances curated from ophthalmology-specific case reports, abstracts, and open-source study materials. We benchmarked LEME against eight other LLMs, namely, GPT-3.5, GPT-4, three Llama2 models (7B, 13B, 70B), PMC-LLAMA 13B, Meditron 70B, and EYE-Llama (another ophthalmology-specific LLM). Evaluations included four internal validation tasks: abstract completion, fill-in-the-blank, multiple-choice questions (MCQ), and short-answer QA. External validation tasks encompassed long-form QA, MCQ, patient EHR summarization, and clinical QA. Evaluation metrics included Rouge-L scores, accuracy, and expert evaluation of correctness, completeness, and readability. In internal validations, LEME consistently outperformed its counterparts, achieving Rouge-L scores of 0.20 in abstract completion (all p<0.05), 0.82 in fill-in-the-blank (all p<0.0001), and 0.22 in short-answer QA (all p<0.0001, except versus GPT-4). In external validations, LEME excelled in long-form QA with a Rouge-L of 0.19 (all p<0.0001), ranked second in MCQ accuracy (0.68; all p<0.0001), and scored highest in EHR summarization and clinical QA (ranging from 4.24 to 4.83 out of 5 for correctness, completeness, and readability). LEME's emphasis on robust fine-tuning and the use of non-copyrighted data represents a breakthrough in open-source ophthalmology-specific LLMs, offering the potential to revolutionize execution of clinical tasks while democratizing research collaboration.
摘要：大型语言模型 (LLM) 有望彻底改变医疗保健。眼科专用的 LLM 仍然稀缺且尚未得到充分开发。我们引入了一种开源的眼科专用 LLM，称为眼科语言增强模型 (LEME)。LEME 最初在 Llama2 70B 框架上进行预训练，并使用从眼科专用病例报告、摘要和开源学习材料中精选的约 127,000 个非版权训练实例语料库进一步微调。我们将 LEME 与其他八个 LLM 进行了对比，即 GPT-3.5、GPT-4、三个 Llama2 模型（7B、13B、70B）、PMC-LLAMA 13B、Meditron 70B 和 EYE-Llama（另一个眼科专用的 LLM）。评估包括四项内部验证任务：摘要完成、填空、多项选择题 (MCQ) 和简答 QA。外部验证任务包括长篇 QA、MCQ、患者 EHR 摘要和临床 QA。评估指标包括 Rouge-L 分数、准确性以及专家对正确性、完整性和可读性的评估。在内部验证中，LEME 的表现始终优于其同行，摘要完成的 Rouge-L 分数为 0.20（所有 p<0.05），填空的分数为 0.82（所有 p<0.0001），简答 QA 的分数为 0.22（所有 p<0.0001，与 GPT-4 相比除外）。在外部验证中，LEME 在长篇 QA 中表现出色，Rouge-L 为 0.19（所有 p<0.0001），在 MCQ 准确度中排名第二（0.68；所有 p<0.0001），在 EHR 摘要和临床 QA 中得分最高（正确性、完整性和可读性得分从 4.24 到 4.83（满分 5 分））。LEME 强调强大的微调和使用非版权数据，这代表了开源眼科特定 LLM 的突破，有可能彻底改变临床任务的执行，同时使研究合作民主化。

Title: Beyond Scalar Reward Model: Learning Generative Judge from Preference Data

Authors: Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, Yiqun Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03742
Pdf URL: https://arxiv.org/pdf/2410.03742
Copy Paste: [[2410.03742]] Beyond Scalar Reward Model: Learning Generative Judge from Preference Data(https://arxiv.org/abs/2410.03742)
Keywords: language model, llm, prompt
Abstract: Learning from preference feedback is a common practice for aligning large language models~(LLMs) with human value. Conventionally, preference data is learned and encoded into a scalar reward model that connects a value head with an LLM to produce a scalar score as preference or reward. However, scalar models lack interpretability and are known to be susceptible to biases in datasets. This paper investigates leveraging the generation capability of LLMs to address both limitations in one shot. Specifically, we prompt the pre-trained LLM to generate positive and negative judgments, both supported with rationales in natural language form. The self-generated contrastive judgment pairs are used to train the generative judge with Direct Preference Optimization (DPO). This proposal of training the generative Judge using self-generated Contrastive judgments (Con-J) ensures natural interpretability due to the generated rationales together with the judgments, as well as high robustness against bias without the need for an additional reward head. Experimental results show that the performance of Con-J is comparable to the scalar reward model trained on the same collection of preference data, and demonstrate its superior interpretability and robustness in encoding human preferences.
摘要：从偏好反馈中学习是将大型语言模型 (LLM) 与人类价值观相结合的常见做法。传统上，偏好数据被学习并编码到标量奖励模型中，该模型将价值头与 LLM 连接起来，以产生标量分数作为偏好或奖励。然而，标量模型缺乏可解释性，并且众所周知容易受到数据集偏差的影响。本文研究如何利用 LLM 的生成能力一次性解决这两个限制。具体来说，我们提示预先训练的 LLM 生成正面和负面的判断，两者均以自然语言形式的理由支持。自生成的对比判断对用于通过直接偏好优化 (DPO) 训练生成性判断。这种使用自生成的对比判断 (Con-J) 训练生成性判断的提议确保了由于生成的理由和判断而具有自然的可解释性，以及无需额外奖励头即可对偏差具有很高的鲁棒性。实验结果表明，Con-J 的性能与在同一组偏好数据上训练的标量奖励模型相当，并表现出其在编码人类偏好方面卓越的可解释性和鲁棒性。

Title: Mitigating Training Imbalance in LLM Fine-Tuning via Selective Parameter Merging

Authors: Yiming Ju, Ziyi Ni, Xingrun Xing, Zhixiong Zeng, hanyu Zhao, Siqi Fan, Zheng Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03743
Pdf URL: https://arxiv.org/pdf/2410.03743
Copy Paste: [[2410.03743]] Mitigating Training Imbalance in LLM Fine-Tuning via Selective Parameter Merging(https://arxiv.org/abs/2410.03743)
Keywords: language model, llm
Abstract: Supervised fine-tuning (SFT) is crucial for adapting Large Language Models (LLMs) to specific tasks. In this work, we demonstrate that the order of training data can lead to significant training imbalances, potentially resulting in performance degradation. Consequently, we propose to mitigate this imbalance by merging SFT models fine-tuned with different data orders, thereby enhancing the overall effectiveness of SFT. Additionally, we introduce a novel technique, "parameter-selection merging," which outperforms traditional weighted-average methods on five datasets. Further, through analysis and ablation studies, we validate the effectiveness of our method and identify the sources of performance improvements.
摘要：监督微调 (SFT) 对于将大型语言模型 (LLM) 适应特定任务至关重要。在这项工作中，我们证明训练数据的顺序可能导致严重的训练不平衡，从而可能导致性能下降。因此，我们建议通过合并使用不同数据顺序进行微调的 SFT 模型来缓解这种不平衡，从而提高 SFT 的整体效率。此外，我们引入了一种新技术“参数选择合并”，它在五个数据集上的表现优于传统的加权平均方法。此外，通过分析和消融研究，我们验证了我们方法的有效性并确定了性能改进的来源。

Title: Khattat: Enhancing Readability and Concept Representation of Semantic Typography

Authors: Ahmed Hussein, Alaa Elsetohy, Sama Hadhoud, Tameem Bakr, Yasser Rohaim, Badr AlKhamissi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03748
Pdf URL: https://arxiv.org/pdf/2410.03748
Copy Paste: [[2410.03748]] Khattat: Enhancing Readability and Concept Representation of Semantic Typography(https://arxiv.org/abs/2410.03748)
Keywords: language model, llm
Abstract: Designing expressive typography that visually conveys a word's meaning while maintaining readability is a complex task, known as semantic typography. It involves selecting an idea, choosing an appropriate font, and balancing creativity with legibility. We introduce an end-to-end system that automates this process. First, a Large Language Model (LLM) generates imagery ideas for the word, useful for abstract concepts like freedom. Then, the FontCLIP pre-trained model automatically selects a suitable font based on its semantic understanding of font attributes. The system identifies optimal regions of the word for morphing and iteratively transforms them using a pre-trained diffusion model. A key feature is our OCR-based loss function, which enhances readability and enables simultaneous stylization of multiple characters. We compare our method with other baselines, demonstrating great readability enhancement and versatility across multiple languages and writing scripts.
摘要：设计富有表现力的字体，在保持可读性的同时直观地传达单词的含义，这是一项复杂的任务，称为语义字体。它涉及选择一个想法、选择合适的字体，以及在创造力和易读性之间取得平衡。我们引入了一个端到端系统来自动化这个过程。首先，大型语言模型 (LLM) 为单词生成图像想法，这对于自由等抽象概念很有用。然后，FontCLIP 预训练模型会根据其对字体属性的语义理解自动选择合适的字体。该系统识别单词中最适合变形的区域，并使用预训练的扩散模型对其进行迭代转换。一个关键特性是我们基于 OCR 的损失函数，它可以增强可读性并实现多个字符的同时风格化。我们将我们的方法与其他基线进行了比较，展示了跨多种语言和书写脚本的出色可读性增强和多功能性。

Title: Recent Advances in Speech Language Models: A Survey

Authors: Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2410.03751
Pdf URL: https://arxiv.org/pdf/2410.03751
Copy Paste: [[2410.03751]] Recent Advances in Speech Language Models: A Survey(https://arxiv.org/abs/2410.03751)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize the evaluation metrics for SpeechLMs, and discuss the challenges and future research directions in this rapidly evolving field.
摘要：大型语言模型 (LLM) 最近引起了广泛关注，主要是因为它们在基于文本的交互中的能力。然而，自然的人类交互通常依赖于语音，因此有必要转向基于语音的模型。实现这一目标的一种直接方法涉及“自动语音识别 (ASR) + LLM + 文本转语音 (TTS)”的流程，其中输入的语音被转录为文本，由 LLM 处理，然后转换回语音。尽管很简单，但这种方法存在固有的局限性，例如模态转换期间的信息丢失和三个阶段之间的错误积累。为了解决这些问题，语音语言模型 (SpeechLM)——无需从文本转换即可生成语音的端到端模型——已经成为一种有前途的替代方案。这篇综述论文首次全面概述了构建 SpeechLM 的最新方法，详细介绍了其架构的关键组件以及其开发中不可或缺的各种训练方案。此外，我们系统地调查了 SpeechLM 的各种功能，对 SpeechLM 的评估指标进行分类，并讨论了这个快速发展的领域面临的挑战和未来的研究方向。

Title: Enhancing Retrieval in QA Systems with Derived Feature Association

Authors: Keyush Shah, Abhishek Goyal, Isaac Wasserman
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2410.03754
Pdf URL: https://arxiv.org/pdf/2410.03754
Copy Paste: [[2410.03754]] Enhancing Retrieval in QA Systems with Derived Feature Association(https://arxiv.org/abs/2410.03754)
Keywords: llm, long context, retrieval augmented generation
Abstract: Retrieval augmented generation (RAG) has become the standard in long context question answering (QA) systems. However, typical implementations of RAG rely on a rather naive retrieval mechanism, in which texts whose embeddings are most similar to that of the query are deemed most relevant. This has consequences in subjective QA tasks, where the most relevant text may not directly contain the answer. In this work, we propose a novel extension to RAG systems, which we call Retrieval from AI Derived Documents (RAIDD). RAIDD leverages the full power of the LLM in the retrieval process by deriving inferred features, such as summaries and example questions, from the documents at ingest. We demonstrate that this approach significantly improves the performance of RAG systems on long-context QA tasks.
摘要：检索增强生成 (RAG) 已成为长上下文问答 (QA) 系统的标准。然而，RAG 的典型实现依赖于一种相当简单的检索机制，其中嵌入与查询最相似的文本被视为最相关。这会影响主观 QA 任务，其中最相关的文本可能不直接包含答案。在这项工作中，我们提出了一种新的 RAG 系统扩展，我们称之为从 AI 派生文档检索 (RAIDD)。RAIDD 通过在摄取时从文档中获取推断特征（例如摘要和示例问题）来充分利用 LLM 在检索过程中的强大功能。我们证明这种方法显著提高了 RAG 系统在长上下文 QA 任务上的性能。

Title: HiReview: Hierarchical Taxonomy-Driven Automatic Literature Review Generation

Authors: Yuntong Hu, Zhuofeng Li, Zheng Zhang, Chen Ling, Raasikh Kanjiani, Boxin Zhao, Liang Zhao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03761
Pdf URL: https://arxiv.org/pdf/2410.03761
Copy Paste: [[2410.03761]] HiReview: Hierarchical Taxonomy-Driven Automatic Literature Review Generation(https://arxiv.org/abs/2410.03761)
Keywords: language model, llm
Abstract: In this work, we present HiReview, a novel framework for hierarchical taxonomy-driven automatic literature review generation. With the exponential growth of academic documents, manual literature reviews have become increasingly labor-intensive and time-consuming, while traditional summarization models struggle to generate comprehensive document reviews effectively. Large language models (LLMs), with their powerful text processing capabilities, offer a potential solution; however, research on incorporating LLMs for automatic document generation remains limited. To address key challenges in large-scale automatic literature review generation (LRG), we propose a two-stage taxonomy-then-generation approach that combines graph-based hierarchical clustering with retrieval-augmented LLMs. First, we retrieve the most relevant sub-community within the citation network, then generate a hierarchical taxonomy tree by clustering papers based on both textual content and citation relationships. In the second stage, an LLM generates coherent and contextually accurate summaries for clusters or topics at each hierarchical level, ensuring comprehensive coverage and logical organization of the literature. Extensive experiments demonstrate that HiReview significantly outperforms state-of-the-art methods, achieving superior hierarchical organization, content relevance, and factual accuracy in automatic literature review generation tasks.
摘要：在本文中，我们提出了 HiReview，一种基于分层分类法驱动的自动文献综述生成新框架。随着学术文献的指数级增长，手动文献综述变得越来越费力且耗时，而传统的摘要模型难以有效地生成全面的文献综述。大型语言模型 (LLM) 具有强大的文本处理能力，提供了一种潜在的解决方案；然而，将 LLM 结合用于自动文档生成的研究仍然有限。为了解决大规模自动文献综述生成 (LRG) 中的关键挑战，我们提出了一种两阶段分类然后生成方法，该方法结合了基于图的层次聚类和检索增强的 LLM。首先，我们在引文网络内检索最相关的子社区，然后通过基于文本内容和引文关系对论文进行聚类来生成层次分类树。在第二阶段，LLM 为每个层次上的集群或主题生成连贯且上下文准确的摘要，确保文献的全面覆盖和逻辑组织。大量实验表明，HiReview 的表现明显优于最先进的方法，在自动文献综述生成任务中实现了卓越的层次组织、内容相关性和事实准确性。

Title: Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression

Authors: Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, Grace Li Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03765
Pdf URL: https://arxiv.org/pdf/2410.03765
Copy Paste: [[2410.03765]] Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression(https://arxiv.org/abs/2410.03765)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable breakthroughs. However, the huge number of parameters in LLMs require significant amount of memory storage in inference, which prevents their practical deployment in many applications. To reduce memory storage of LLMs, singular value decomposition (SVD) provides a promising solution to approximate weight matrices for compressing LLMs. In this paper, we take a step further to explore parameter sharing across different layers with SVD to achieve more effective compression for LLMs. Specifically, weight matrices in different layers are decomposed and represented as a linear combination of a set of shared basis vectors and unique coefficients. The types of weight matrices and the layer selection for basis sharing are examined when compressing LLMs to maintain the performance. Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches and parameter sharing techniques, especially under large compression ratios. Code is available at: this https URL
摘要：大型语言模型 (LLM) 取得了显著的突破。然而，LLM 中大量的参数在推理中需要大量的内存存储，这阻碍了它们在许多应用中的实际部署。为了减少 LLM 的内存存储，奇异值分解 (SVD) 为压缩 LLM 提供了一种有前途的解决方案来近似权重矩阵。在本文中，我们更进一步探索使用 SVD 在不同层之间共享参数，以实现对 LLM 的更有效压缩。具体而言，不同层中的权重矩阵被分解并表示为一组共享基向量和唯一系数的线性组合。在压缩 LLM 以保持性能时，检查了权重矩阵的类型和基共享的层选择。综合实验表明，基共享优于最先进的基于 SVD 的压缩方法和参数共享技术，尤其是在大压缩比下。代码可在以下位置获得：此 https URL

Title: Reasoning Elicitation in Language Models via Counterfactual Feedback

Authors: Alihan Hüyük, Xinnuo Xu, Jacqueline Maasch, Aditya V. Nori, Javier González
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03767
Pdf URL: https://arxiv.org/pdf/2410.03767
Copy Paste: [[2410.03767]] Reasoning Elicitation in Language Models via Counterfactual Feedback(https://arxiv.org/abs/2410.03767)
Keywords: language model
Abstract: Despite the increasing effectiveness of language models, their reasoning capabilities remain underdeveloped. In particular, causal reasoning through counterfactual question answering is lacking. This work aims to bridge this gap. We first derive novel metrics that balance accuracy in factual and counterfactual questions, capturing a more complete view of the reasoning abilities of language models than traditional factual-only based metrics. Second, we propose several fine-tuning approaches that aim to elicit better reasoning mechanisms, in the sense of the proposed metrics. Finally, we evaluate the performance of the fine-tuned language models in a variety of realistic scenarios. In particular, we investigate to what extent our fine-tuning approaches systemically achieve better generalization with respect to the base models in several problems that require, among others, inductive and deductive reasoning capabilities.
摘要：尽管语言模型的有效性日益提高，但其推理能力仍未得到充分开发。特别是，通过反事实问答进行因果推理的能力还很欠缺。这项工作旨在弥补这一差距。我们首先推导出平衡事实问题和反事实问题准确性的新指标，与传统的仅基于事实的指标相比，可以更全面地反映语言模型的推理能力。其次，我们提出了几种微调方法，旨在根据所提出的指标引出更好的推理机制。最后，我们在各种现实场景中评估微调语言模型的性能。特别是，我们研究了我们的微调方法在多大程度上系统地实现了相对于基础模型更好的泛化，这些问题涉及几个需要归纳和演绎推理能力的问题。

Title: Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

Authors: Yohan Mathew, Ollie Matthews, Robert McCarthy, Joan Velja, Christian Schroeder de Witt, Dylan Cope, Nandi Schoots
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03768
Pdf URL: https://arxiv.org/pdf/2410.03768
Copy Paste: [[2410.03768]] Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs(https://arxiv.org/abs/2410.03768)
Keywords: llm, agent
Abstract: The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions. Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation. The use of information hiding (steganography) in agent communications could render collusion practically undetectable. This underscores the need for evaluation frameworks to monitor and mitigate steganographic collusion capabilities. We address a crucial gap in the literature by demonstrating, for the first time, that robust steganographic collusion in LLMs can arise indirectly from optimization pressure. To investigate this problem we design two approaches -- a gradient-based reinforcement learning (GBRL) method and an in-context reinforcement learning (ICRL) method -- for reliably eliciting sophisticated LLM-generated linguistic text steganography. Importantly, we find that emergent steganographic collusion can be robust to both passive steganalytic oversight of model outputs and active mitigation through communication paraphrasing. We contribute a novel model evaluation framework and discuss limitations and future work. Our findings imply that effective risk mitigation from steganographic collusion post-deployment requires innovation in passive and active oversight techniques.
摘要：前沿模型代理的快速普及有望带来重大的社会进步，但也引发了人们对不安全互动产生的系统性风险的担忧。不利于他人的勾结已被确定为不良代理合作的主要形式。在代理通信中使用信息隐藏（隐写术）可以使勾结几乎无法被发现。这强调了需要评估框架来监控和缓解隐写术勾结能力。我们首次证明了 LLM 中的稳健隐写术勾结可以间接地来自优化压力，从而解决了文献中的一个关键空白。为了研究这个问题，我们设计了两种方法——基于梯度的强化学习 (GBRL) 方法和上下文强化学习 (ICRL) 方法——用于可靠地引出复杂的 LLM 生成的语言文本隐写术。重要的是，我们发现新兴的隐写术勾结可以对模型输出的被动隐写分析监督和通过通信释义的主动缓解都具有鲁棒性。我们提出了一种新颖的模型评估框架，并讨论了局限性和未来工作。我们的研究结果表明，部署后有效降低隐写术共谋风险需要在被动和主动监督技术方面进行创新。

Title: SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

Authors: Tianhao Li, Jingyu Lu, Chuangxin Chu, Tianyu Zeng, Yujia Zheng, Mei Li, Haotian Huang, Bin Wu, Zuoxian Liu, Kai Ma, Xuejing Yuan, Xingkai Wang, Keyan Ding, Huajun Chen, Qiang Zhang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2410.03769
Pdf URL: https://arxiv.org/pdf/2410.03769
Copy Paste: [[2410.03769]] SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks(https://arxiv.org/abs/2410.03769)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) have had a transformative impact on a variety of scientific tasks across disciplines such as biology, chemistry, medicine, and physics. However, ensuring the safety alignment of these models in scientific research remains an underexplored area, with existing benchmarks primarily focus on textual content and overlooking key scientific representations such as molecular, protein, and genomic languages. Moreover, the safety mechanisms of LLMs in scientific tasks are insufficiently studied. To address these limitations, we introduce SciSafeEval, a comprehensive benchmark designed to evaluate the safety alignment of LLMs across a range of scientific tasks. SciSafeEval spans multiple scientific languages - including textual, molecular, protein, and genomic - and covers a wide range of scientific domains. We evaluate LLMs in zero-shot, few-shot and chain-of-thought settings, and introduce a 'jailbreak' enhancement feature that challenges LLMs equipped with safety guardrails, rigorously testing their defenses against malicious intention. Our benchmark surpasses existing safety datasets in both scale and scope, providing a robust platform for assessing the safety and performance of LLMs in scientific contexts. This work aims to facilitate the responsible development and deployment of LLMs, promoting alignment with safety and ethical standards in scientific research.
摘要：大型语言模型 (LLM) 对生物学、化学、医学和物理学等学科的各种科学任务产生了变革性影响。然而，确保这些模型在科学研究中的安全性一致性仍然是一个未被充分探索的领域，现有的基准主要侧重于文本内容，而忽略了分子、蛋白质和基因组语言等关键的科学表征。此外，LLM 在科学任务中的安全机制研究不足。为了解决这些限制，我们引入了 SciSafeEval，这是一个全面的基准，旨在评估 LLM 在一系列科学任务中的安全性一致性。SciSafeEval 涵盖多种科学语言 - 包括文本、分子、蛋白质和基因组 - 并涵盖广泛的科学领域。我们在零样本、少量样本和思路链设置中评估 LLM，并引入“越狱”增强功能，挑战配备安全护栏的 LLM，严格测试其对恶意意图的防御能力。我们的基准在规模和范围上都超越了现有的安全数据集，为评估 LLM 在科学背景下的安全性和性能提供了一个强大的平台。这项工作旨在促进 LLM 的负责任的开发和部署，促进与科学研究的安全和道德标准保持一致。

Title: A Two-Stage Proactive Dialogue Generator for Efficient Clinical Information Collection Using Large Language Model

Authors: Xueshen Li, Xinlong Hou, Nirupama Ravi, Ziyi Huang, Yu Gan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03770
Pdf URL: https://arxiv.org/pdf/2410.03770
Copy Paste: [[2410.03770]] A Two-Stage Proactive Dialogue Generator for Efficient Clinical Information Collection Using Large Language Model(https://arxiv.org/abs/2410.03770)
Keywords: language model, agent
Abstract: Efficient patient-doctor interaction is among the key factors for a successful disease diagnosis. During the conversation, the doctor could query complementary diagnostic information, such as the patient's symptoms, previous surgery, and other related information that goes beyond medical evidence data (test results) to enhance disease diagnosis. However, this procedure is usually time-consuming and less-efficient, which can be potentially optimized through computer-assisted systems. As such, we propose a diagnostic dialogue system to automate the patient information collection procedure. By exploiting medical history and conversation logic, our conversation agents, particularly the doctor agent, can pose multi-round clinical queries to effectively collect the most relevant disease diagnostic information. Moreover, benefiting from our two-stage recommendation structure, carefully designed ranking criteria, and interactive patient agent, our model is able to overcome the under-exploration and non-flexible challenges in dialogue generation. Our experimental results on a real-world medical conversation dataset show that our model can generate clinical queries that mimic the conversation style of real doctors, with efficient fluency, professionalism, and safety, while effectively collecting relevant disease diagnostic information.
摘要：高效的医患互动是成功诊断疾病的关键因素之一。在对话过程中，医生可以查询补充诊断信息，例如患者的症状、以前的手术和其他相关信息，这些信息超出了医学证据数据（测试结果）的范围，可以增强疾病诊断。然而，这个过程通常很耗时且效率较低，可以通过计算机辅助系统进行优化。因此，我们提出了一个诊断对话系统来自动化患者信息收集程序。通过利用病史和对话逻辑，我们的对话代理，特别是医生代理，可以提出多轮临床查询，以有效收集最相关的疾病诊断信息。此外，得益于我们的两阶段推荐结构、精心设计的排名标准和交互式患者代理，我们的模型能够克服对话生成中的探索不足和不灵活的挑战。我们在现实世界的医疗对话数据集上的实验结果表明，我们的模型可以生成模仿真实医生对话风格的临床查询，具有高效的流畅性、专业性和安全性，同时有效地收集相关的疾病诊断信息。

Title: Precision Knowledge Editing: Enhancing Safety in Large Language Models

Authors: Xuying Li, Zhuo Li, Yuji Kosuga, Yasuhiro Yoshida, Victor Bian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03772
Pdf URL: https://arxiv.org/pdf/2410.03772
Copy Paste: [[2410.03772]] Precision Knowledge Editing: Enhancing Safety in Large Language Models(https://arxiv.org/abs/2410.03772)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but they also pose risks related to the generation of toxic or harmful content. This work introduces Precision Knowledge Editing (PKE), an advanced technique that builds upon existing knowledge editing methods to more effectively identify and modify toxic parameter regions within LLMs. By leveraging neuron weight tracking and activation pathway tracing, PKE achieves finer granularity in toxic content management compared to previous methods like Detoxifying Instance Neuron Modification (DINM). Our experiments demonstrate that PKE significantly reduces the attack success rate (ASR) across various models, including Llama2-7b and Llama-3-8b-instruct, while maintaining overall model performance. Additionally, we also compared the performance of some closed-source models (gpt-4-0613 and Claude 3 Sonnet) in our experiments, and found that models adjusted using our method far outperformed the closed-source models in terms of safety. This research contributes to the ongoing efforts to make LLMs safer and more reliable for real-world applications.
摘要：大型语言模型 (LLM) 已展现出卓越的能力，但它们也存在与生成有毒或有害内容相关的风险。这项工作引入了精确知识编辑 (PKE)，这是一项基于现有知识编辑方法的先进技术，可以更有效地识别和修改 LLM 中的有毒参数区域。通过利用神经元权重跟踪和激活通路追踪，与之前的方法（如解毒实例神经元修改 (DINM)）相比，PKE 在有毒内容管理方面实现了更细粒度的改进。我们的实验表明，PKE 显著降低了各种模型（包括 Llama2-7b 和 Llama-3-8b-instruct）的攻击成功率 (ASR)，同时保持了整体模型性能。此外，我们还在实验中比较了一些闭源模型（gpt-4-0613 和 Claude 3 Sonnet）的性能，发现使用我们的方法调整的模型在安全性方面远远优于闭源模型。这项研究有助于持续努力使 LLM 在实际应用中更加安全、可靠。

Title: Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

Authors: Yuxuan Yao, Han Wu, Mingyang Liu, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, Linqi Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03777
Pdf URL: https://arxiv.org/pdf/2410.03777
Copy Paste: [[2410.03777]] Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling(https://arxiv.org/abs/2410.03777)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) exhibit varying strengths and weaknesses across different tasks, prompting recent studies to explore the benefits of ensembling models to leverage their complementary advantages. However, existing LLM ensembling methods often overlook model compatibility and struggle with inefficient alignment of probabilities across the entire vocabulary. In this study, we empirically investigate the factors influencing ensemble performance, identifying model performance, vocabulary size, and response style as key determinants, revealing that compatibility among models is essential for effective ensembling. This analysis leads to the development of a simple yet effective model selection strategy that identifies compatible models. Additionally, we introduce the \textsc{Uni}on \textsc{T}op-$k$ \textsc{E}nsembling (\textsc{UniTE}), a novel approach that efficiently combines models by focusing on the union of the top-k tokens from each model, thereby avoiding the need for full vocabulary alignment and reducing computational overhead. Extensive evaluations across multiple benchmarks demonstrate that \textsc{UniTE} significantly enhances performance compared to existing methods, offering a more efficient framework for LLM ensembling.
摘要：大型语言模型 (LLM) 在不同任务中表现出不同的优势和劣势，这促使最近的研究探索集成模型的好处，以利用它们的互补优势。然而，现有的 LLM 集成方法往往忽视模型兼容性，并难以在整个词汇表中有效地对齐概率。在本研究中，我们通过实证研究了影响集成性能的因素，确定模型性能、词汇量和响应风格是关键决定因素，揭示了模型之间的兼容性对于有效集成至关重要。通过这种分析，我们开发了一种简单而有效的模型选择策略，可以识别兼容的模型。此外，我们引入了 \textsc{Uni}on \textsc{T}op-$k$ \textsc{E}nsembling (\textsc{UniTE})，这是一种新颖的方法，它通过关注每个模型中前 k 个标记的并集来有效地组合模型，从而避免了对完整词汇对齐的需求并减少了计算开销。多个基准的广泛评估表明，与现有方法相比，\textsc{UniTE} 显著提高了性能，为 LLM 集成提供了更高效的框架。

Title: Reward-RAG: Enhancing RAG with Reward Driven Supervision

Authors: Thang Nguyen, Peter Chin, Yu-Wing Tai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03780
Pdf URL: https://arxiv.org/pdf/2410.03780
Copy Paste: [[2410.03780]] Reward-RAG: Enhancing RAG with Reward Driven Supervision(https://arxiv.org/abs/2410.03780)
Keywords: language model, gpt, retrieval-augmented generation
Abstract: In this paper, we introduce Reward-RAG, a novel approach designed to enhance the Retrieval-Augmented Generation (RAG) model through Reward-Driven Supervision. Unlike previous RAG methodologies, which focus on training language models (LMs) to utilize external knowledge retrieved from external sources, our method adapts retrieval information to specific domains by employing CriticGPT to train a dedicated reward model. This reward model generates synthesized datasets for fine-tuning the RAG encoder, aligning its outputs more closely with human preferences. The versatility of our approach allows it to be effectively applied across various domains through domain-specific fine-tuning. We evaluate Reward-RAG on publicly available benchmarks from multiple domains, comparing it to state-of-the-art methods. Our experimental results demonstrate significant improvements in performance, highlighting the effectiveness of Reward-RAG in improving the relevance and quality of generated responses. These findings underscore the potential of integrating reward models with RAG to achieve superior outcomes in natural language generation tasks.
摘要：在本文中，我们介绍了 Reward-RAG，这是一种旨在通过奖励驱动监督增强检索增强生成 (RAG) 模型的新方法。与以前的 RAG 方法不同，以前的 RAG 方法侧重于训练语言模型 (LM) 以利用从外部来源检索到的外部知识，而我们的方法通过使用 CriticGPT 训练专用的奖励模型，将检索信息调整到特定领域。该奖励模型生成合成数据集以微调 RAG 编码器，使其输出更符合人类偏好。我们方法的多功能性使其能够通过特定领域的微调有效地应用于各个领域。我们在多个领域的公开基准上评估 Reward-RAG，并将其与最先进的方法进行比较。我们的实验结果表明性能有显著提高，突出了 Reward-RAG 在提高生成响应的相关性和质量方面的有效性。这些发现强调了将奖励模型与 RAG 相结合以在自然语言生成任务中取得卓越成果的潜力。

Title: Searching for Best Practices in Medical Transcription with Large Language Model

Authors: Jiafeng Li, Yanda Mu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03797
Pdf URL: https://arxiv.org/pdf/2410.03797
Copy Paste: [[2410.03797]] Searching for Best Practices in Medical Transcription with Large Language Model(https://arxiv.org/abs/2410.03797)
Keywords: language model, llm
Abstract: The transcription of medical monologues, especially those containing a high density of specialized terminology and delivered with a distinct accent, presents a significant challenge for existing automated systems. This paper introduces a novel approach leveraging a Large Language Model (LLM) to generate highly accurate medical transcripts from audio recordings of doctors' monologues, specifically focusing on Indian accents. Our methodology integrates advanced language modeling techniques to lower the Word Error Rate (WER) and ensure the precise recognition of critical medical terms. Through rigorous testing on a comprehensive dataset of medical recordings, our approach demonstrates substantial improvements in both overall transcription accuracy and the fidelity of key medical terminologies. These results suggest that our proposed system could significantly aid in clinical documentation processes, offering a reliable tool for healthcare providers to streamline their transcription needs while maintaining high standards of accuracy.
摘要：医学独白的转录，尤其是那些包含大量专业术语并带有独特口音的独白，对现有的自动化系统提出了重大挑战。本文介绍了一种新方法，利用大型语言模型 (LLM) 从医生独白的录音中生成高度准确的医学记录，特别关注印度口音。我们的方法集成了先进的语言建模技术，以降低单词错误率 (WER) 并确保关键医学术语的准确识别。通过对综合医疗记录数据集进行严格测试，我们的方法在整体转录准确性和关键医学术语的保真度方面都有了显着的提高。这些结果表明，我们提出的系统可以极大地帮助临床文档流程，为医疗保健提供者提供一种可靠的工具，以简化他们的转录需求，同时保持高标准的准确性。

Title: Self-Powered LLM Modality Expansion for Large Speech-Text Models

Authors: Tengfei Yu, Xuebo Liu, Zhiyi Hou, Liang Ding, Dacheng Tao, Min Zhang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2410.03798
Pdf URL: https://arxiv.org/pdf/2410.03798
Copy Paste: [[2410.03798]] Self-Powered LLM Modality Expansion for Large Speech-Text Models(https://arxiv.org/abs/2410.03798)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit remarkable performance across diverse tasks, indicating their potential for expansion into large speech-text models (LSMs) by integrating speech capabilities. Although unified speech-text pre-training and multimodal data instruction-tuning offer considerable benefits, these methods generally entail significant resource demands and tend to overfit specific tasks. This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning. We explore the instruction-following dynamics within LSMs, identifying a critical issue termed speech anchor bias-a tendency for LSMs to over-rely on speech inputs, mistakenly interpreting the entire speech modality as directives, thereby neglecting textual instructions. To counteract this bias, we introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning. Our experiments across a range of speech-based tasks demonstrate that self-powered LSM mitigates speech anchor bias and improves the fusion of speech and text modalities in LSMs. Data, code and scripts are freely available at this https URL.
摘要：大型语言模型 (LLM) 在各种任务中表现出色，表明它们有可能通过集成语音功能扩展到大型语音文本模型 (LSM)。尽管统一的语音文本预训练和多模态数据指令调整提供了相当大的好处，但这些方法通常需要大量资源，并且往往会过度拟合特定任务。本研究旨在通过解决原始指令调整的局限性来改进语音数据集在 LSM 训练中的使用。我们探索了 LSM 中的指令跟随动态，确定了一个称为语音锚点偏差的关键问题 - LSM 倾向于过度依赖语音输入，错误地将整个语音模态解释为指令，从而忽略文本指令。为了抵消这种偏差，我们引入了一个自供电 LSM，它利用模型本身生成的增强自动语音识别数据来更有效地调整指令。我们在一系列基于语音的任务中进行的实验表明，自供电 LSM 可以减轻语音锚点偏差并改善 LSM 中语音和文本模态的融合。数据、代码和脚本可通过此 https URL 免费获取。

Title: Mixture of Attentions For Speculative Decoding

Authors: Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03804
Pdf URL: https://arxiv.org/pdf/2410.03804
Copy Paste: [[2410.03804]] Mixture of Attentions For Speculative Decoding(https://arxiv.org/abs/2410.03804)
Keywords: language model, llm
Abstract: The growth in the number of parameters of Large Language Models (LLMs) has led to a significant surge in computational requirements, making them challenging and costly to deploy. Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the LLM in parallel. Small models that utilise activations from the LLM currently achieve the fastest decoding speeds. However, we identify several limitations of SD models including the lack of on-policyness during training and partial observability. To address these shortcomings, we propose a more grounded architecture for small models by introducing a Mixture of Attentions for SD. Our novel architecture can be applied in two scenarios: a conventional single device deployment and a novel client-server deployment where the small model is hosted on a consumer device and the LLM on a server. In a single-device scenario, we demonstrate state-of-the-art speedups improving EAGLE-2 by 9.5% and its acceptance length by 25%. In a client-server setting, our experiments demonstrate: 1) state-of-the-art latencies with minimal calls to the server for different network conditions, and 2) in the event of a complete disconnection, our approach can maintain higher accuracy compared to other SD methods and demonstrates advantages over API calls to LLMs, which would otherwise be unable to continue the generation process.
摘要：大型语言模型 (LLM) 参数数量的增长导致计算需求大幅增加，使其部署起来既困难又昂贵。推测解码 (SD) 利用较小的模型来有效地提出未来的标记，然后由 LLM 并行验证。利用 LLM 激活的小型模型目前可实现最快的解码速度。然而，我们发现 SD 模型存在一些局限性，包括训练期间缺乏策略性和部分可观察性。为了解决这些缺点，我们通过引入 SD 的混合注意力机制，为小型模型提出了一种更扎实的架构。我们的新架构可应用于两种场景：传统的单设备部署和新颖的客户端-服务器部署，其中小型模型托管在消费者设备上，LLM 托管在服务器上。在单设备场景中，我们展示了最先进的加速技术，将 EAGLE-2 提高了 9.5%，接受长度提高了 25%。在客户端-服务器设置中，我们的实验证明：1）在不同网络条件下对服务器的调用最少，延迟最高；2）在完全断开连接的情况下，我们的方法与其他 SD 方法相比可以保持更高的准确性，并且比对 LLM 的 API 调用更具优势，否则 LLM 将无法继续生成过程。

Title: Misinformation with Legal Consequences (MisLC): A New Task Towards Harnessing Societal Harm of Misinformation

Authors: Chu Fei Luo, Radin Shayanfar, Rohan Bhambhoria, Samuel Dahan, Xiaodan Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03829
Pdf URL: https://arxiv.org/pdf/2410.03829
Copy Paste: [[2410.03829]] Misinformation with Legal Consequences (MisLC): A New Task Towards Harnessing Societal Harm of Misinformation(https://arxiv.org/abs/2410.03829)
Keywords: language model, retrieval-augmented generation
Abstract: Misinformation, defined as false or inaccurate information, can result in significant societal harm when it is spread with malicious or even innocuous intent. The rapid online information exchange necessitates advanced detection mechanisms to mitigate misinformation-induced harm. Existing research, however, has predominantly focused on assessing veracity, overlooking the legal implications and social consequences of misinformation. In this work, we take a novel angle to consolidate the definition of misinformation detection using legal issues as a measurement of societal ramifications, aiming to bring interdisciplinary efforts to tackle misinformation and its consequence. We introduce a new task: Misinformation with Legal Consequence (MisLC), which leverages definitions from a wide range of legal domains covering 4 broader legal topics and 11 fine-grained legal issues, including hate speech, election laws, and privacy regulations. For this task, we advocate a two-step dataset curation approach that utilizes crowd-sourced checkworthiness and expert evaluations of misinformation. We provide insights about the MisLC task through empirical evidence, from the problem definition to experiments and expert involvement. While the latest large language models and retrieval-augmented generation are effective baselines for the task, we find they are still far from replicating expert performance.
摘要：错误信息被定义为虚假或不准确的信息，如果以恶意甚至无害的意图传播，可能会造成严重的社会危害。快速的在线信息交换需要先进的检测机制来减轻错误信息造成的危害。然而，现有的研究主要侧重于评估真实性，忽视了错误信息的法律含义和社会后果。在这项工作中，我们采取了一个新颖的角度来巩固错误信息检测的定义，使用法律问题作为社会影响的衡量标准，旨在通过跨学科的努力来解决错误信息及其后果。我们引入了一项新任务：具有法律后果的错误信息 (MisLC)，它利用了来自广泛法律领域的定义，涵盖了 4 个更广泛的法律主题和 11 个细粒度的法律问题，包括仇恨言论、选举法和隐私法规。对于这项任务，我们提倡采用两步数据集管理方法，利用众包检查和专家对错误信息的评估。我们通过实证证据提供有关 MisLC 任务的见解，从问题定义到实验和专家参与。虽然最新的大型语言模型和检索增强生成是该任务的有效基准，但我们发现它们还远远不能复制专家的表现。

Title: ORAssistant: A Custom RAG-based Conversational Assistant for OpenROAD

Authors: Aviral Kaintura, Palaniappan R, Shui Song Luar, Indira Iyer Almeida
Subjects: cs.CL, cs.AR
Abstract URL: https://arxiv.org/abs/2410.03845
Pdf URL: https://arxiv.org/pdf/2410.03845
Copy Paste: [[2410.03845]] ORAssistant: A Custom RAG-based Conversational Assistant for OpenROAD(https://arxiv.org/abs/2410.03845)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Open-source Electronic Design Automation (EDA) tools are rapidly transforming chip design by addressing key barriers of commercial EDA tools such as complexity, costs, and access. Recent advancements in Large Language Models (LLMs) have further enhanced efficiency in chip design by providing user assistance across a range of tasks like setup, decision-making, and flow automation. This paper introduces ORAssistant, a conversational assistant for OpenROAD, based on Retrieval-Augmented Generation (RAG). ORAssistant aims to improve the user experience for the OpenROAD flow, from RTL-GDSII by providing context-specific responses to common user queries, including installation, command usage, flow setup, and execution, in prose format. Currently, ORAssistant integrates OpenROAD, OpenROAD-flow-scripts, Yosys, OpenSTA, and KLayout. The data model is built from publicly available documentation and GitHub resources. The proposed architecture is scalable, supporting extensions to other open-source tools, operating modes, and LLM models. We use Google Gemini as the base LLM model to build and test ORAssistant. Early evaluation results of the RAG-based model show notable improvements in performance and accuracy compared to non-fine-tuned LLMs.
摘要：开源电子设计自动化 (EDA) 工具正在通过解决商业 EDA 工具的关键障碍（例如复杂性、成本和访问）来迅速改变芯片设计。大型语言模型 (LLM) 的最新进展通过在设置、决策和流程自动化等一系列任务中为用户提供帮助，进一步提高了芯片设计的效率。本文介绍了基于检索增强生成 (RAG) 的 OpenROAD 对话助手 ORAssistant。ORAssistant 旨在通过以散文格式提供对常见用户查询（包括安装、命令使用、流程设置和执行）的上下文特定响应来改善从 RTL-GDSII 开始的 OpenROAD 流程的用户体验。目前，ORAssistant 集成了 OpenROAD、OpenROAD-flow-scripts、Yosys、OpenSTA 和 KLayout。数据模型是根据公开的文档和 GitHub 资源构建的。所提出的架构是可扩展的，支持对其他开源工具、操作模式和 LLM 模型的扩展。我们使用 Google Gemini 作为基础 LLM 模型来构建和测试 ORAssistant。基于 RAG 的模型的早期评估结果显示，与未微调的 LLM 相比，其性能和准确性有显著提升。

Title: Using Prompts to Guide Large Language Models in Imitating a Real Person's Language Style

Authors: Ziyang Chen, Stylios Moscholios
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.03848
Pdf URL: https://arxiv.org/pdf/2410.03848
Copy Paste: [[2410.03848]] Using Prompts to Guide Large Language Models in Imitating a Real Person's Language Style(https://arxiv.org/abs/2410.03848)
Keywords: language model, gpt, llm, prompt, tree-of-thought
Abstract: Large language models (LLMs), such as GPT series and Llama series have demonstrated strong capabilities in natural language processing, contextual understanding, and text generation. In recent years, researchers are trying to enhance the abilities of LLMs in performing various tasks, and numerous studies have proved that well-designed prompts can significantly improve the performance of LLMs on these tasks. This study compares the language style imitation ability of three different large language models under the guidance of the same zero-shot prompt. It also involves comparing the imitation ability of the same large language model when guided by three different prompts individually. Additionally, by applying a Tree-of-Thoughts (ToT) Prompting method to Llama 3, a conversational AI with the language style of a real person was created. In this study, three evaluation methods were used to evaluate LLMs and prompts. The results show that Llama 3 performs best at imitating language styles, and that the ToT prompting method is the most effective to guide it in imitating language styles. Using a ToT framework, Llama 3 was guided to interact with users in the language style of a specific individual without altering its core parameters, thereby creating a text-based conversational AI that reflects the language style of the individual.
摘要：大型语言模型（LLM），如GPT系列、Llama系列，在自然语言处理、上下文理解、文本生成等方面展现出了强大的能力。近年来，研究人员试图提升LLM在各种任务上的能力，大量研究证明，精心设计的提示可以显著提高LLM在这些任务上的表现。本研究比较了在同一个零样本提示引导下，三种不同的大型语言模型的语言风格模仿能力，也比较了同一个大型语言模型在三种不同提示的引导下模仿能力。此外，通过将思维树（ToT）提示方法应用于Llama 3，创建了一个具有真人语言风格的对话式AI。本研究采用了三种评测方法对LLM和提示进行评估。结果表明，Llama 3在语言风格模仿方面表现最佳，而ToT提示方法是引导其模仿语言风格的最有效方法。使用ToT框架，引导Llama 3在不改变其核心参数的情况下，以特定个体的语言风格与用户进行交互，从而创建一个反映个体语言风格的基于文本的对话式AI。

Title: Detecting Machine-Generated Long-Form Content with Latent-Space Variables

Authors: Yufei Tian, Zeyu Pan, Nanyun Peng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03856
Pdf URL: https://arxiv.org/pdf/2410.03856
Copy Paste: [[2410.03856]] Detecting Machine-Generated Long-Form Content with Latent-Space Variables(https://arxiv.org/abs/2410.03856)
Keywords: language model, gpt, llm, prompt
Abstract: The increasing capability of large language models (LLMs) to generate fluent long-form texts is presenting new challenges in distinguishing machine-generated outputs from human-written ones, which is crucial for ensuring authenticity and trustworthiness of expressions. Existing zero-shot detectors primarily focus on token-level distributions, which are vulnerable to real-world domain shifts, including different prompting and decoding strategies, and adversarial attacks. We propose a more robust method that incorporates abstract elements, such as event transitions, as key deciding factors to detect machine versus human texts by training a latent-space model on sequences of events or topics derived from human-written texts. In three different domains, machine-generated texts, which are originally inseparable from human texts on the token level, can be better distinguished with our latent-space model, leading to a 31% improvement over strong baselines such as DetectGPT. Our analysis further reveals that, unlike humans, modern LLMs like GPT-4 generate event triggers and their transitions differently, an inherent disparity that helps our method to robustly detect machine-generated texts.
摘要：大型语言模型 (LLM) 生成流畅长文本的能力不断增强，这对区分机器生成的输出和人类书写的输出提出了新的挑战，而区分机器生成的输出和人类书写的输出对于确保表达的真实性和可信度至关重要。现有的零样本检测器主要关注 token 级分布，而这些分布易受现实世界领域变化的影响，包括不同的提示和解码策略以及对抗性攻击。我们提出了一种更为稳健的方法，该方法结合事件转换等抽象元素作为关键决定因素，通过对源自人类书写文本的事件或主题序列训练潜在空间模型来检测机器文本与人类文本。在三个不同的领域中，机器生成的文本在 token 级别上原本与人类文本密不可分，我们的潜在空间模型可以更好地区分它们，与 DetectGPT 等强基线相比，性能提高了 31%。我们的分析进一步表明，与人类不同，GPT-4 等现代 LLM 以不同的方式生成事件触发器及其转换，这种固有的差异有助于我们的方法稳健地检测机器生成的文本。

Title: You Know What I'm Saying -- Jailbreak Attack via Implicit Reference

Authors: Tianyu Wu, Lingrui Mei, Ruibin Yuan, Lujun Li, Wei Xue, Yike Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03857
Pdf URL: https://arxiv.org/pdf/2410.03857
Copy Paste: [[2410.03857]] You Know What I'm Saying -- Jailbreak Attack via Implicit Reference(https://arxiv.org/abs/2410.03857)
Keywords: language model, gpt, llm
Abstract: While recent advancements in large language model (LLM) alignment have enabled the effective identification of malicious objectives involving scene nesting and keyword rewriting, our study reveals that these methods remain inadequate at detecting malicious objectives expressed through context within nested harmless objectives. This study identifies a previously overlooked vulnerability, which we term Attack via Implicit Reference (AIR). AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context. This method employs multiple related harmless objectives to generate malicious content without triggering refusal responses, thereby effectively bypassing existing detection this http URL experiments demonstrate AIR's effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on most models, including GPT-4o, Claude-3.5-Sonnet, and Qwen-2-72B. Notably, we observe an inverse scaling phenomenon, where larger models are more vulnerable to this attack method. These findings underscore the urgent need for defense mechanisms capable of understanding and preventing contextual attacks. Furthermore, we introduce a cross-model attack strategy that leverages less secure models to generate malicious contexts, thereby further increasing the ASR when targeting other this http URL code and jailbreak artifacts can be found at this https URL.
摘要：虽然大型语言模型 (LLM) 对齐方面的最新进展使得能够有效识别涉及场景嵌套和关键字重写的恶意目标，但我们的研究表明，这些方法仍然不足以检测通过嵌套无害目标中的上下文表达的恶意目标。这项研究发现了一个以前被忽视的漏洞，我们称之为通过隐式引用攻击 (AIR)。AIR 将恶意目标分解为允许的目标，并通过上下文中的隐式引用将它们链接起来。此方法采用多个相关的无害目标来生成恶意内容而不会触发拒绝响应，从而有效地绕过现有检测。此 http URL 实验证明了 AIR 在最先进的 LLM 中的有效性，在大多数模型（包括 GPT-4o、Claude-3.5-Sonnet 和 Qwen-2-72B）上实现了超过 90% 的攻击成功率 (ASR)。值得注意的是，我们观察到一种逆缩放现象，其中较大的模型更容易受到这种攻击方法的攻击。这些发现强调了迫切需要能够理解和防止上下文攻击的防御机制。此外，我们引入了一种跨模型攻击策略，利用不太安全的模型来生成恶意上下文，从而进一步提高针对其他此 http URL 代码时的 ASR，并且可以在此 https URL 中找到越狱工件。

Title: SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

Authors: John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, Ofir Press
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2410.03859
Pdf URL: https://arxiv.org/pdf/2410.03859
Copy Paste: [[2410.03859]] SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?(https://arxiv.org/abs/2410.03859)
Keywords: agent
Abstract: Autonomous systems for software engineering are now capable of fixing bugs and developing features. These systems are commonly evaluated on SWE-bench (Jimenez et al., 2024a), which assesses their ability to solve software issues from GitHub repositories. However, SWE-bench uses only Python repositories, with problem statements presented predominantly as text and lacking visual elements such as images. This limited coverage motivates our inquiry into how existing systems might perform on unrepresented software engineering domains (e.g., front-end, game development, DevOps), which use different programming languages and paradigms. Therefore, we propose SWE-bench Multimodal (SWE-bench M), to evaluate systems on their ability to fix bugs in visual, user-facing JavaScript software. SWE-bench M features 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping. Each SWE-bench M task instance contains at least one image in its problem statement or unit tests. Our analysis finds that top-performing SWE-bench systems struggle with SWE-bench M, revealing limitations in visual problem-solving and cross-language generalization. Lastly, we show that SWE-agent's flexible language-agnostic features enable it to substantially outperform alternatives on SWE-bench M, resolving 12% of task instances compared to 6% for the next best system.
摘要：软件工程的自主系统现在能够修复错误和开发功能。这些系统通常在 SWE-bench（Jimenez 等人，2024a）上进行评估，该评估评估了它们从 GitHub 存储库解决软件问题的能力。但是，SWE-bench 仅使用 Python 存储库，问题陈述主要以文本形式呈现，缺少图像等视觉元素。这种有限的覆盖范围促使我们研究现有系统在使用不同编程语言和范式的未代表的软件工程领域（例如前端、游戏开发、DevOps）上的表现如何。因此，我们提出了 SWE-bench Multimodal（SWE-bench M），以评估系统修复可视化、面向用户的 JavaScript 软件中的错误的能力。SWE-bench M 具有从 17 个 JavaScript 库中收集的 617 个任务实例，这些库用于 Web 界面设计、图表、数据可视化、语法突出显示和交互式映射。每个 SWE-bench M 任务实例在其问题陈述或单元测试中至少包含一个图像。我们的分析发现，性能最佳的 SWE-bench 系统在处理 SWE-bench M 时遇到困难，这揭示了视觉问题解决和跨语言泛化方面的局限性。最后，我们表明 SWE-agent 灵活的语言无关功能使其在 SWE-bench M 上的表现远远优于其他系统，解决了 12% 的任务实例，而排名第二的系统解决了 6%。

Title: Can Language Models Reason about Individualistic Human Values and Preferences?

Authors: Liwei Jiang, Taylor Sorensen, Sydney Levine, Yejin Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03868
Pdf URL: https://arxiv.org/pdf/2410.03868
Copy Paste: [[2410.03868]] Can Language Models Reason about Individualistic Human Values and Preferences?(https://arxiv.org/abs/2410.03868)
Keywords: language model
Abstract: Recent calls for pluralistic alignment emphasize that AI systems should address the diverse needs of all people. Yet, efforts in this space often require sorting people into fixed buckets of pre-specified diversity-defining dimensions (e.g., demographics, personalities, communication styles), risking smoothing out or even stereotyping the rich spectrum of individualistic variations. To achieve an authentic representation of diversity that respects individuality, we propose individualistic alignment. While individualistic alignment can take various forms, in this paper, we introduce IndieValueCatalog, a dataset transformed from the influential World Values Survey (WVS), to study language models (LMs) on the specific challenge of individualistic value reasoning. Specifically, given a sample of an individual's value-expressing statements, models are tasked with predicting their value judgments in novel cases. With IndieValueCatalog, we reveal critical limitations in frontier LMs' abilities to reason about individualistic human values with accuracies, only ranging between 55% to 65%. Moreover, our results highlight that a precise description of individualistic values cannot be approximated only via demographic information. We also identify a partiality of LMs in reasoning about global individualistic values, as measured by our proposed Value Inequity Index ({\sigma}INEQUITY). Finally, we train a series of Individualistic Value Reasoners (IndieValueReasoner) using IndieValueCatalog to enhance models' individualistic value reasoning capability, revealing new patterns and dynamics into global human values. We outline future research challenges and opportunities for advancing individualistic alignment.
摘要：最近对多元化对齐的呼吁强调，人工智能系统应该满足所有人的不同需求。然而，这一领域的努力通常需要将人们分类到预先指定的多样性定义维度的固定桶中（例如，人口统计、个性、沟通风格），这可能会平滑甚至刻板化丰富的个人差异。为了实现尊重个性的真实多样性表现，我们提出了个人对齐。虽然个人对齐可以采取多种形式，但在本文中，我们引入了 IndieValueCatalog，这是一个从有影响力的世界价值观调查 (WVS) 转换而来的数据集，用于研究语言模型 (LM) 在个人价值推理的特定挑战上。具体来说，给定一个个人表达价值观的陈述样本，模型的任务是预测他们在新情况下的价值判断。通过 IndieValueCatalog，我们揭示了前沿 LM 在推理个人主义人类价值观方面的关键局限性，准确率仅在 55% 到 65% 之间。此外，我们的研究结果强调，个人主义价值观的精确描述不能仅通过人口统计信息来近似。我们还确定了 LM 在推理全球个人主义价值观方面的偏袒，这可以通过我们提出的价值不公平指数 ({\sigma}INEQUITY) 来衡量。最后，我们使用 IndieValueCatalog 训练了一系列个人主义价值观推理器 (IndieValueReasoner)，以增强模型的个人主义价值观推理能力，揭示全球人类价值观的新模式和动态。我们概述了未来推进个人主义一致性的研究挑战和机遇。

Title: Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

Authors: Wenxuan Wang, Kuiyi Gao, Zihan Jia, Youliang Yuan, Jen-tse Huang, Qiuzhi Liu, Shuai Wang, Wenxiang Jiao, Zhaopeng Tu
Subjects: cs.CL, cs.AI, cs.CR, cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2410.03869
Pdf URL: https://arxiv.org/pdf/2410.03869
Copy Paste: [[2410.03869]] Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step(https://arxiv.org/abs/2410.03869)
Keywords: gpt, prompt
Abstract: Text-based image generation models, such as Stable Diffusion and DALL-E 3, hold significant potential in content creation and publishing workflows, making them the focus in recent years. Despite their remarkable capability to generate diverse and vivid images, considerable efforts are being made to prevent the generation of harmful content, such as abusive, violent, or pornographic material. To assess the safety of existing models, we introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process. Specifically, for malicious queries that cannot bypass the safeguards with a single prompt, we intentionally decompose the query into multiple sub-queries. The image generation models are then prompted to generate and iteratively edit images based on these sub-queries. To evaluate the effectiveness of our CoJ attack method, we constructed a comprehensive dataset, CoJ-Bench, encompassing nine safety scenarios, three types of editing operations, and three editing elements. Experiments on four widely-used image generation services provided by GPT-4V, GPT-4o, Gemini 1.5 and Gemini 1.5 Pro, demonstrate that our CoJ attack method can successfully bypass the safeguards of models for over 60% cases, which significantly outperforms other jailbreaking methods (i.e., 14%). Further, to enhance these models' safety against our CoJ attack method, we also propose an effective prompting-based method, Think Twice Prompting, that can successfully defend over 95% of CoJ attack. We release our dataset and code to facilitate the AI safety research.
摘要：基于文本的图像生成模型（例如 Stable Diffusion 和 DALL-E 3）在内容创建和发布工作流程中具有巨大潜力，因此成为近年来的焦点。尽管它们具有生成多样化和生动图像的出色能力，但仍在做出相当大的努力来防止生成有害内容，例如辱骂、暴力或色情材料。为了评估现有模型的安全性，我们引入了一种称为 Chain-of-Jailbreak (CoJ) 攻击的新越狱方法，该方法通过逐步的编辑过程破坏图像生成模型。具体而言，对于无法通过单个提示绕过保护措施的恶意查询，我们有意将查询分解为多个子查询。然后提示图像生成模型根据这些子查询生成并迭代编辑图像。为了评估我们的 CoJ 攻击方法的有效性，我们构建了一个综合数据集 CoJ-Bench，涵盖九种安全场景、三种编辑操作和三种编辑元素。在 GPT-4V、GPT-4o、Gemini 1.5 和 Gemini 1.5 Pro 提供的四种广泛使用的图像生成服务上进行的实验表明，我们的 CoJ 攻击方法可以在超过 60% 的情况下成功绕过模型的安全措施，这大大优于其他越狱方法（即 14%）。此外，为了增强这些模型对我们的 CoJ 攻击方法的安全性，我们还提出了一种有效的基于提示的方法 Think Twice Prompting，可以成功防御超过 95% 的 CoJ 攻击。我们发布了我们的数据集和代码以促进 AI 安全研究。

Title: KidLM: Advancing Language Models for Children -- Early Insights and Future Directions

Authors: Mir Tafseer Nayeem, Davood Rafiei
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2410.03884
Pdf URL: https://arxiv.org/pdf/2410.03884
Copy Paste: [[2410.03884]] KidLM: Advancing Language Models for Children -- Early Insights and Future Directions(https://arxiv.org/abs/2410.03884)
Keywords: language model
Abstract: Recent studies highlight the potential of large language models in creating educational tools for children, yet significant challenges remain in maintaining key child-specific properties such as linguistic nuances, cognitive needs, and safety standards. In this paper, we explore foundational steps toward the development of child-specific language models, emphasizing the necessity of high-quality pre-training data. We introduce a novel user-centric data collection pipeline that involves gathering and validating a corpus specifically written for and sometimes by children. Additionally, we propose a new training objective, Stratified Masking, which dynamically adjusts masking probabilities based on our domain-specific child language data, enabling models to prioritize vocabulary and concepts more suitable for children. Experimental evaluations demonstrate that our model excels in understanding lower grade-level text, maintains safety by avoiding stereotypes, and captures children's unique preferences. Furthermore, we provide actionable insights for future research and development in child-specific language modeling.
摘要：最近的研究强调了大型语言模型在为儿童创建教育工具方面的潜力，但在保持儿童特定属性（例如语言细微差别、认知需求和安全标准）方面仍然存在重大挑战。在本文中，我们探讨了开发儿童特定语言模型的基本步骤，强调了高质量预训练数据的必要性。我们引入了一种新颖的以用户为中心的数据收集流程，该流程涉及收集和验证专门为儿童编写的语料库（有时由儿童编写）。此外，我们提出了一个新的训练目标，即分层掩蔽，它根据我们特定领域的儿童语言数据动态调整掩蔽概率，使模型能够优先考虑更适合儿童的词汇和概念。实验评估表明，我们的模型在理解低年级文本方面表现出色，通过避免刻板印象来保持安全性，并捕捉儿童的独特偏好。此外，我们为儿童特定语言建模的未来研究和开发提供了可行的见解。

Title: PersonalSum: A User-Subjective Guided Personalized Summarization Dataset for Large Language Models

Authors: Lemei Zhang, Peng Liu, Marcus Tiedemann Oekland Henriksboe, Even W. Lauvrak, Jon Atle Gulla, Heri Ramampiaro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03905
Pdf URL: https://arxiv.org/pdf/2410.03905
Copy Paste: [[2410.03905]] PersonalSum: A User-Subjective Guided Personalized Summarization Dataset for Large Language Models(https://arxiv.org/abs/2410.03905)
Keywords: language model, llm
Abstract: With the rapid advancement of Natural Language Processing in recent years, numerous studies have shown that generic summaries generated by Large Language Models (LLMs) can sometimes surpass those annotated by experts, such as journalists, according to human evaluations. However, there is limited research on whether these generic summaries meet the individual needs of ordinary people. The biggest obstacle is the lack of human-annotated datasets from the general public. Existing work on personalized summarization often relies on pseudo datasets created from generic summarization datasets or controllable tasks that focus on specific named entities or other aspects, such as the length and specificity of generated summaries, collected from hypothetical tasks without the annotators' initiative. To bridge this gap, we propose a high-quality, personalized, manually annotated abstractive summarization dataset called PersonalSum. This dataset is the first to investigate whether the focus of public readers differs from the generic summaries generated by LLMs. It includes user profiles, personalized summaries accompanied by source sentences from given articles, and machine-generated generic summaries along with their sources. We investigate several personal signals - entities/topics, plot, and structure of articles - that may affect the generation of personalized summaries using LLMs in a few-shot in-context learning scenario. Our preliminary results and analysis indicate that entities/topics are merely one of the key factors that impact the diverse preferences of users, and personalized summarization remains a significant challenge for existing LLMs.
摘要：近年来，随着自然语言处理的快速发展，大量研究表明，大型语言模型 (LLM) 生成的通用摘要有时在人工评估方面可以超越新闻记者等专家所标注的摘要。然而，关于这些通用摘要是否符合普通人个性化需求的研究有限。最大的障碍是缺乏来自普通大众的人工标注数据集。现有的个性化摘要工作通常依赖于由通用摘要数据集或可控任务创建的伪数据集，这些伪数据集侧重于特定命名实体或其他方面，例如生成的摘要的长度和特异性，这些方面是从没有标注者主动参与的假设任务中收集的。为了弥补这一差距，我们提出了一个高质量、个性化、手动标注的抽象摘要数据集，称为 PersonalSum。该数据集是第一个研究公众读者的关注点是否与 LLM 生成的通用摘要不同的数据集。它包括用户个人资料、带有给定文章源句的个性化摘要以及机器生成的通用摘要及其来源。我们研究了几个个人信号——实体/主题、情节和文章结构——这些信号可能会影响在少量上下文学习场景中使用 LLM 生成个性化摘要。我们的初步结果和分析表明，实体/主题只是影响用户多样化偏好的关键因素之一，个性化摘要仍然是现有 LLM 面临的重大挑战。

Title: ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities

Authors: Ying Su, Zhan Ling, Haochen Shi, Jiayang Cheng, Yauwai Yim, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03907
Pdf URL: https://arxiv.org/pdf/2410.03907
Copy Paste: [[2410.03907]] ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities(https://arxiv.org/abs/2410.03907)
Keywords: language model, gpt, llm, chat
Abstract: Large language models~(LLMs) have been adopted to process textual task description and accomplish procedural planning in embodied AI tasks because of their powerful reasoning ability. However, there is still lack of study on how vision language models~(VLMs) behave when multi-modal task inputs are considered. Counterfactual planning that evaluates the model's reasoning ability over alternative task situations are also under exploited. In order to evaluate the planning ability of both multi-modal and counterfactual aspects, we propose ActPlan-1K. ActPlan-1K is a multi-modal planning benchmark constructed based on ChatGPT and household activity simulator iGibson2. The benchmark consists of 153 activities and 1,187 instances. Each instance describing one activity has a natural language task description and multiple environment images from the simulator. The gold plan of each instance is action sequences over the objects in provided scenes. Both the correctness and commonsense satisfaction are evaluated on typical VLMs. It turns out that current VLMs are still struggling at generating human-level procedural plans for both normal activities and counterfactual activities. We further provide automatic evaluation metrics by finetuning over BLEURT model to facilitate future research on our benchmark.
摘要：大型语言模型（LLM）因其强大的推理能力而被采用来处理文本任务描述并完成具身化 AI 任务中的程序规划。然而，对于视觉语言模型（VLM）在考虑多模态任务输入时的行为方式的研究仍然不足。评估模型对替代任务情况的推理能力的反事实规划也尚未得到充分开发。为了评估多模态和反事实方面的规划能力，我们提出了 ActPlan-1K。ActPlan-1K 是基于 ChatGPT 和家庭活动模拟器 iGibson2 构建的多模态规划基准。该基准包括 153 项活动和 1,187 个实例。描述一项活动的每个实例都有一个自然语言任务描述和来自模拟器的多个环境图像。每个实例的黄金计划是在提供的场景中对对象执行的动作序列。在典型的 VLM 上评估正确性和常识满意度。事实证明，当前的 VLM 仍然难以为正常活动和反事实活动生成人类级别的程序计划。我们通过对 BLEURT 模型进行微调来进一步提供自动评估指标，以方便未来对我们的基准进行研究。

Title: Still Not Quite There! Evaluating Large Language Models for Comorbid Mental Health Diagnosis

Authors: Amey Hengle, Atharva Kulkarni, Shantanu Patankar, Madhumitha Chandrasekaran, Sneha D'Silva, Jemima Jacob, Rashmi Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03908
Pdf URL: https://arxiv.org/pdf/2410.03908
Copy Paste: [[2410.03908]] Still Not Quite There! Evaluating Large Language Models for Comorbid Mental Health Diagnosis(https://arxiv.org/abs/2410.03908)
Keywords: language model, gpt
Abstract: In this study, we introduce ANGST, a novel, first-of-its kind benchmark for depression-anxiety comorbidity classification from social media posts. Unlike contemporary datasets that often oversimplify the intricate interplay between different mental health disorders by treating them as isolated conditions, ANGST enables multi-label classification, allowing each post to be simultaneously identified as indicating depression and/or anxiety. Comprising 2876 meticulously annotated posts by expert psychologists and an additional 7667 silver-labeled posts, ANGST posits a more representative sample of online mental health discourse. Moreover, we benchmark ANGST using various state-of-the-art language models, ranging from Mental-BERT to GPT-4. Our results provide significant insights into the capabilities and limitations of these models in complex diagnostic scenarios. While GPT-4 generally outperforms other models, none achieve an F1 score exceeding 72% in multi-class comorbid classification, underscoring the ongoing challenges in applying language models to mental health diagnostics.
摘要：在本研究中，我们引入了 ANGST，这是一项新颖的、首创的基准，用于从社交媒体帖子中对抑郁-焦虑共病进行分类。与当代数据集不同，当代数据集通常将不同的心理健康障碍视为孤立的疾病，从而过度简化它们之间复杂的相互作用，而 ANGST 可以进行多标签分类，允许同时识别每篇帖子是抑郁还是焦虑。ANGST 包含 2876 篇由专业心理学家精心注释的帖子和另外 7667 篇银标帖子，它提供了更具代表性的在线心理健康话语样本。此外，我们使用各种最先进的语言模型（从 Mental-BERT 到 GPT-4）对 ANGST 进行基准测试。我们的结果提供了有关这些模型在复杂诊断场景中的能力和局限性的重要见解。虽然 GPT-4 通常优于其他模型，但没有一个模型在多类共病分类中达到超过 72% 的 F1 得分，这凸显了将语言模型应用于心理健康诊断的持续挑战。

Title: Structured List-Grounded Question Answering

Authors: Mujeen Sung, Song Feng, James Gung, Raphael Shu, Yi Zhang, Saab Mansour
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03950
Pdf URL: https://arxiv.org/pdf/2410.03950
Copy Paste: [[2410.03950]] Structured List-Grounded Question Answering(https://arxiv.org/abs/2410.03950)
Keywords: language model, gpt
Abstract: Document-grounded dialogue systems aim to answer user queries by leveraging external information. Previous studies have mainly focused on handling free-form documents, often overlooking structured data such as lists, which can represent a range of nuanced semantic relations. Motivated by the observation that even advanced language models like GPT-3.5 often miss semantic cues from lists, this paper aims to enhance question answering (QA) systems for better interpretation and use of structured lists. To this end, we introduce the LIST2QA dataset, a novel benchmark to evaluate the ability of QA systems to respond effectively using list information. This dataset is created from unlabeled customer service documents using language models and model-based filtering processes to enhance data quality, and can be used to fine-tune and evaluate QA models. Apart from directly generating responses through fine-tuned models, we further explore the explicit use of Intermediate Steps for Lists (ISL), aligning list items with user backgrounds to better reflect how humans interpret list items before generating responses. Our experimental results demonstrate that models trained on LIST2QA with our ISL approach outperform baselines across various metrics. Specifically, our fine-tuned Flan-T5-XL model shows increases of 3.1% in ROUGE-L, 4.6% in correctness, 4.5% in faithfulness, and 20.6% in completeness compared to models without applying filtering and the proposed ISL method.
摘要：基于文档的对话系统旨在利用外部信息来回答用户查询。先前的研究主要侧重于处理自由格式的文档，通常忽略了列表等结构化数据，而列表可以表示一系列细微的语义关系。受 GPT-3.5 等高级语言模型也经常错过列表中的语义线索这一观察结果的启发，本文旨在增强问答 (QA) 系统，以更好地解释和使用结构化列表。为此，我们引入了 LIST2QA 数据集，这是一个用于评估 QA 系统使用列表信息有效响应能力的新基准。该数据集由未标记的客户服务文档创建，使用语言模型和基于模型的过滤过程来提高数据质量，可用于微调和评估 QA 模型。除了通过微调模型直接生成响应外，我们还进一步探索了列表中间步骤 (ISL) 的明确使用，将列表项与用户背景对齐，以更好地反映人类在生成响应之前如何解释列表项。我们的实验结果表明，使用我们的 ISL 方法在 LIST2QA 上训练的模型在各个指标上的表现都优于基线。具体来说，与未应用过滤和所提出的 ISL 方法的模型相比，我们经过微调的 Flan-T5-XL 模型在 ROUGE-L 上提高了 3.1%，在正确性上提高了 4.6%，在忠实度上提高了 4.5%，在完整性上提高了 20.6%。

Title: LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity

Authors: Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Ling Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03953
Pdf URL: https://arxiv.org/pdf/2410.03953
Copy Paste: [[2410.03953]] LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity(https://arxiv.org/abs/2410.03953)
Keywords: language model, llm, prompt, agent
Abstract: Combining large language models during training or at inference time has shown substantial performance gain over component LLMs. This paper presents LLM-TOPLA, a diversity-optimized LLM ensemble method with three unique properties: (i) We introduce the focal diversity metric to capture the diversity-performance correlation among component LLMs of an ensemble. (ii) We develop a diversity-optimized ensemble pruning algorithm to select the top-k sub-ensembles from a pool of $N$ base LLMs. Our pruning method recommends top-performing LLM subensembles of size $S$, often much smaller than $N$. (iii) We generate new output for each prompt query by utilizing a learn-to-ensemble approach, which learns to detect and resolve the output inconsistency among all component LLMs of an ensemble. Extensive evaluation on four different benchmarks shows good performance gain over the best LLM ensemble methods: (i) In constrained solution set problems, LLM-TOPLA outperforms the best-performing ensemble (Mixtral) by 2.2\% in accuracy on MMLU and the best-performing LLM ensemble (MoreAgent) on GSM8k by 2.1\%. (ii) In generative tasks, LLM-TOPLA outperforms the top-2 performers (Llama70b/Mixtral) on SearchQA by $3.9\mathrm{x}$ in F1, and on XSum by more than $38$ in ROUGE-1. Our code and dataset, which contains outputs of 8 modern LLMs on 4 benchmarks is available at this https URL
摘要：在训练或推理时组合大型语言模型已显示出比组件 LLM 显著的性能提升。本文介绍了 LLM-TOPLA，一种多样性优化的 LLM 集成方法，具有三个独特属性：（i）我们引入了焦点多样性指标来捕捉集成中组件 LLM 之间的多样性-性能相关性。（ii）我们开发了一种多样性优化的集成剪枝算法，从 $N$ 个基础 LLM 池中选择前 k 个子集成。我们的剪枝方法推荐大小为 $S$（通常远小于 $N$）的表现最佳的 LLM 子集成。（iii）我们利用学习集成方法为每个提示查询生成新的输出，该方法学习检测并解决集成中所有组件 LLM 之间的输出不一致问题。在四个不同基准上进行的广泛评估表明，与最佳 LLM 集成方法相比，其性能提升显著：（i）在受限解集问题中，LLM-TOPLA 在 MMLU 上的准确率比表现最佳的集成 (Mixtral) 高出 2.2%，在 GSM8k 上比表现最佳的 LLM 集成 (MoreAgent) 高出 2.1%。（ii）在生成任务中，LLM-TOPLA 在 SearchQA 上的 F1 比排名前两名的 (Llama70b/Mixtral) 高出 $3.9\mathrm{x}$，在 ROUGE-1 上的 XSum 高出 $38$ 以上。我们的代码和数据集包含 4 个基准上的 8 个现代 LLM 的输出，可从此 https URL 获得

Title: Grounding Language in Multi-Perspective Referential Communication

Authors: Zineng Tang, Lingjun Mao, Alane Suhr
Subjects: cs.CL, cs.AI, cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2410.03959
Pdf URL: https://arxiv.org/pdf/2410.03959
Copy Paste: [[2410.03959]] Grounding Language in Multi-Perspective Referential Communication(https://arxiv.org/abs/2410.03959)
Keywords: agent
Abstract: We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another's visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 human-written referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success when paired with a listener, resulting in an improvement from 58.9 to 69.3% in communicative success and even outperforming the strongest proprietary model.
摘要：我们引入了一项任务和数据集，用于在多智能体环境中生成和理解指称表达。在此任务中，共享场景中的两个智能体必须考虑彼此的视觉视角（可能与它们自己的视角不同），才能生成和理解对场景中对象的引用以及它们之间的空间关系。我们收集了一个包含 2,970 个人类书写的指称表达的数据集，每个表达都与人类理解判断配对，并评估了自动模型作为与人类伙伴配对的说话者和听众的性能，发现模型在参考生成和理解方面的表现都落后于人类智能体配对。最后，我们尝试训练一个开放权重说话者模型，该模型在与听众配对时具有交流成功的证据，结果交流成功率从 58.9% 提高到 69.3%，甚至超过了最强大的专有模型。

Title: On the Influence of Gender and Race in Romantic Relationship Prediction from Large Language Models

Authors: Abhilasha Sancheti, Haozhe An, Rachel Rudinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03996
Pdf URL: https://arxiv.org/pdf/2410.03996
Copy Paste: [[2410.03996]] On the Influence of Gender and Race in Romantic Relationship Prediction from Large Language Models(https://arxiv.org/abs/2410.03996)
Keywords: language model
Abstract: We study the presence of heteronormative biases and prejudice against interracial romantic relationships in large language models by performing controlled name-replacement experiments for the task of relationship prediction. We show that models are less likely to predict romantic relationships for (a) same-gender character pairs than different-gender pairs; and (b) intra/inter-racial character pairs involving Asian names as compared to Black, Hispanic, or White names. We examine the contextualized embeddings of first names and find that gender for Asian names is less discernible than non-Asian names. We discuss the social implications of our findings, underlining the need to prioritize the development of inclusive and equitable technology.
摘要：我们通过对关系预测任务进行受控的姓名替换实验，研究大型语言模型中是否存在异性恋偏见和对跨种族恋爱关系的偏见。我们发现，模型预测恋爱关系的可能性较小：(a) 同性字符对的恋爱关系预测能力低于异性字符对；(b) 涉及亚洲姓名的种族内/跨种族字符对的恋爱关系预测能力低于黑人、西班牙裔或白人姓名。我们研究了名字的语境化嵌入，发现亚洲姓名的性别比非亚洲姓名更难辨别。我们讨论了研究结果的社会影响，强调需要优先发展包容性和公平的技术。

Title: Take It Easy: Label-Adaptive Self-Rationalization for Fact Verification and Explanation Generation

Authors: Jing Yang, Anderson Rocha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04002
Pdf URL: https://arxiv.org/pdf/2410.04002
Copy Paste: [[2410.04002]] Take It Easy: Label-Adaptive Self-Rationalization for Fact Verification and Explanation Generation(https://arxiv.org/abs/2410.04002)
Keywords: language model, gpt
Abstract: Computational methods to aid journalists in the task often require adapting a model to specific domains and generating explanations. However, most automated fact-checking methods rely on three-class datasets, which do not accurately reflect real-world misinformation. Moreover, fact-checking explanations are often generated based on text summarization of evidence, failing to address the relationship between the claim and the evidence. To address these issues, we extend the self-rationalization method--typically used in natural language inference (NLI) tasks--to fact verification. We propose a label-adaptive learning approach: first, we fine-tune a model to learn veracity prediction with annotated labels (step-1 model). Then, we fine-tune the step-1 model again to learn self-rationalization, using the same data and additional annotated explanations. Our results show that our label-adaptive approach improves veracity prediction by more than ten percentage points (Macro F1) on both the PubHealth and AVeriTec datasets, outperforming the GPT-4 model. Furthermore, to address the high cost of explanation annotation, we generated 64 synthetic explanations from three large language models: GPT-4-turbo, GPT-3.5-turbo, and Llama-3-8B and few-shot fine-tune our step-1 model. The few-shot synthetic explanation fine-tuned model performed comparably to the fully fine-tuned self-rationalization model, demonstrating the potential of low-budget learning with synthetic data. Our label-adaptive self-rationalization approach presents a promising direction for future research on real-world explainable fact-checking with different labeling schemes.
摘要：帮助记者完成任务的计算方法通常需要将模型调整到特定领域并生成解释。然而，大多数自动化事实核查方法依赖于三类数据集，而这些数据集并不能准确反映现实世界的错误信息。此外，事实核查解释通常是基于证据的文本摘要生成的，无法解决主张与证据之间的关系。为了解决这些问题，我们将自我合理化方法（通常用于自然语言推理 (NLI) 任务）扩展到事实验证。我们提出了一种标签自适应学习方法：首先，我们微调模型以学习带有注释标签的真实性预测（步骤 1 模型）。然后，我们再次微调步骤 1 模型以学习自我合理化，使用相同的数据和额外的注释解释。我们的结果表明，我们的标签自适应方法在 PubHealth 和 AVeriTec 数据集上将准确性预测提高了 10 个百分点以上（Macro F1），优于 GPT-4 模型。此外，为了解决解释注释的高成本问题，我们从三个大型语言模型（GPT-4-turbo、GPT-3.5-turbo 和 Llama-3-8B）生成了 64 个合成解释，并对我们的第 1 步模型进行了少量微调。少量合成解释微调模型的表现与完全微调的自合理化模型相当，展示了使用合成数据进行低成本学习的潜力。我们的标签自适应自合理化方法为未来使用不同标签方案进行现实世界可解释事实核查的研究提供了一个有希望的方向。

Title: A Simple yet Effective Training-free Prompt-free Approach to Chinese Spelling Correction Based on Large Language Models

Authors: Houquan Zhou, Zhenghua Li, Bo Zhang, Chen Li, Shaopeng Lai, Ji Zhang, Fei Huang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04027
Pdf URL: https://arxiv.org/pdf/2410.04027
Copy Paste: [[2410.04027]] A Simple yet Effective Training-free Prompt-free Approach to Chinese Spelling Correction Based on Large Language Models(https://arxiv.org/abs/2410.04027)
Keywords: language model, llm, prompt
Abstract: This work proposes a simple training-free prompt-free approach to leverage large language models (LLMs) for the Chinese spelling correction (CSC) task, which is totally different from all previous CSC approaches. The key idea is to use an LLM as a pure language model in a conventional manner. The LLM goes through the input sentence from the beginning, and at each inference step, produces a distribution over its vocabulary for deciding the next token, given a partial sentence. To ensure that the output sentence remains faithful to the input sentence, we design a minimal distortion model that utilizes pronunciation or shape similarities between the original and replaced characters. Furthermore, we propose two useful reward strategies to address practical challenges specific to the CSC task. Experiments on five public datasets demonstrate that our approach significantly improves LLM performance, enabling them to compete with state-of-the-art domain-general CSC models.
摘要：本研究提出了一种简单的无需训练、无需提示的方法，利用大型语言模型 (LLM) 完成中文拼写纠正 (CSC) 任务，这与之前的所有 CSC 方法完全不同。关键思想是以传统方式将 LLM 用作纯语言模型。LLM 从头开始检查输入句子，并在每个推理步骤中生成其词汇表的分布，以在给定部分句子的情况下决定下一个标记。为了确保输出句子忠实于输入句子，我们设计了一个最小失真模型，该模型利用原始字符和替换字符之间的发音或形状相似性。此外，我们提出了两种有用的奖励策略来解决特定于 CSC 任务的实际挑战。在五个公共数据集上进行的实验表明，我们的方法显着提高了 LLM 性能，使其能够与最先进的领域通用 CSC 模型相媲美。

Title: SyllableLM: Learning Coarse Semantic Units for Speech Language Models

Authors: Alan Baade, Puyuan Peng, David Harwath
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2410.04029
Pdf URL: https://arxiv.org/pdf/2410.04029
Copy Paste: [[2410.04029]] SyllableLM: Learning Coarse Semantic Units for Speech Language Models(https://arxiv.org/abs/2410.04029)
Keywords: language model
Abstract: Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves SotA in syllabic segmentation and clustering. Using these coarse tokens, we successfully train SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks. SyllableLM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
摘要：语言模型需要标记化输入。然而，音频和视觉等连续数据的标记化策略通常基于简单的启发式方法，例如固定大小的卷积或离散聚类，这些方法不一定与数据的语义结构一致。特别是对于语音，波形的高分辨率（16,000 个样本/秒或更高）带来了重大挑战，因为基于语音的语言模型必须使用比基于文本的语言模型多几倍的每个单词的标记。在这项工作中，我们引入了一种可控的自监督技术，将语音表示合并为更粗的音节类单元，同时仍保留语义信息。我们通过 1) 通过分析预训练编码器损失中的相关性来提取噪声边界，以及 2) 使用新颖的蒸馏技术迭代改进模型表示来实现这一点。我们的方法以低至 5Hz 和 60bps 的速度产生可控速率的语义单元，并在音节分割和聚类中实现 SotA。使用这些粗略标记，我们成功训练了 SyllableLM，这是一种语音语言模型 (SpeechLM)，它在一系列口语语言建模任务上的表现与当前的 SotA SpeechLM 相当甚至更好。SyllableLM 还实现了效率的显著提升，训练计算量减少了 30 倍，时钟推理速度提高了 4 倍。

Title: Neuron-Level Sequential Editing for Large Language Models

Authors: Houcheng Jiang, Junfeng Fang, Tianyu Zhang, An Zhang, Ruipeng Wang, Tao Liang, Xiang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04045
Pdf URL: https://arxiv.org/pdf/2410.04045
Copy Paste: [[2410.04045]] Neuron-Level Sequential Editing for Large Language Models(https://arxiv.org/abs/2410.04045)
Keywords: language model, llm
Abstract: This work explores sequential model editing in large language models (LLMs), a critical task that involves modifying internal knowledge within LLMs continuously through multi-round editing, each incorporating updates or corrections to adjust the model outputs without the need for costly retraining. Existing model editing methods, especially those that alter model parameters, typically focus on single-round editing and often face significant challenges in sequential model editing-most notably issues of model forgetting and failure. To address these challenges, we introduce a new model editing method, namely \textbf{N}euron-level \textbf{S}equential \textbf{E}diting (NSE), tailored for supporting sequential model editing. Specifically, we optimize the target layer's hidden states using the model's original weights to prevent model failure. Furthermore, we iteratively select neurons in multiple layers for editing based on their activation values to mitigate model forgetting. Our empirical experiments demonstrate that NSE significantly outperforms current modifying parameters model editing methods, marking a substantial advancement in the field of sequential model editing. Our code is released on \url{this https URL}.
摘要：这项工作探索了大型语言模型 (LLM) 中的顺序模型编辑，这是一项关键任务，涉及通过多轮编辑不断修改 LLM 中的内部知识，每次编辑都包含更新或更正以调整模型输出，而无需进行昂贵的重新训练。现有的模型编辑方法，尤其是那些改变模型参数的方法，通常侧重于单轮编辑，并且在顺序模型编辑中经常面临重大挑战 - 最明显的是模型遗忘和失败的问题。为了应对这些挑战，我们引入了一种新的模型编辑方法，即 \textbf{N} 神经元级 \textbf{S} 顺序 \textbf{E} 编辑 (NSE)，专门用于支持顺序模型编辑。具体来说，我们使用模型的原始权重优化目标层的隐藏状态以防止模型失败。此外，我们根据激活值迭代地选择多层中的神经元进行编辑，以减轻模型遗忘。我们的实证实验表明，NSE 明显优于当前的修改参数模型编辑方法，标志着顺序模型编辑领域的重大进步。我们的代码发布在 \url{这个 https URL} 上。

Title: Large Language Models can Achieve Social Balance

Authors: Pedro Cisneros-Velarde
Subjects: cs.CL, cs.AI, cs.MA, cs.SI, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2410.04054
Pdf URL: https://arxiv.org/pdf/2410.04054
Copy Paste: [[2410.04054]] Large Language Models can Achieve Social Balance(https://arxiv.org/abs/2410.04054)
Keywords: language model, llm, agent
Abstract: Social balance is a concept in sociology which states that if every three individuals in a population achieve certain structures of positive or negative interactions, then the whole population ends up in one faction of positive interactions or divided between two or more antagonistic factions. In this paper, we consider a group of interacting large language models (LLMs) and study how, after continuous interactions, they can achieve social balance. Across three different LLM models, we found that social balance depends on (i) whether interactions are updated based on "relationships", "appraisals", or "opinions"; (ii) whether agents update their interactions based on homophily or influence from their peers; and (iii) the number of simultaneous interactions the LLMs consider. When social balance is achieved, its particular structure of positive or negative interactions depends on these three conditions and are different across LLM models and sizes. The stability of interactions and the justification for their update also vary across models. Thus, social balance is driven by the pre-training and alignment particular to each LLM model.
摘要：社会平衡是社会学中的一个概念，它指出，如果一个群体中的每三个个体都实现了某种正向或负向互动结构，那么整个群体最终会形成一个正向互动派系，或者分裂为两个或多个对立派系。在本文中，我们考虑了一组相互作用的大型语言模型 (LLM)，并研究它们在持续互动后如何实现社会平衡。在三种不同的 LLM 模型中，我们发现社会平衡取决于 (i) 互动是基于“关系”、“评价”还是“意见”进行更新；(ii) 代理是基于同质性还是同侪影响来更新互动；以及 (iii) LLM 考虑的同时互动数量。当实现社会平衡时，其特定的正向或负向互动结构取决于这三个条件，并且在不同的 LLM 模型和规模下有所不同。互动的稳定性及其更新的理由也因模型而异。因此，社会平衡由每个 LLM 模型特有的预训练和对齐驱动。

Title: Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks

Authors: Jiayi He, Hehai Lin, Qingyun Wang, Yi Fung, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04055
Pdf URL: https://arxiv.org/pdf/2410.04055
Copy Paste: [[2410.04055]] Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks(https://arxiv.org/abs/2410.04055)
Keywords: language model, llm
Abstract: While Vision-Language Models (VLMs) have shown remarkable abilities in visual and language reasoning tasks, they invariably generate flawed responses. Self-correction that instructs models to refine their outputs presents a promising solution to this issue. Previous studies have mainly concentrated on Large Language Models (LLMs), while the self-correction abilities of VLMs, particularly concerning both visual and linguistic information, remain largely unexamined. This study investigates the self-correction capabilities of VLMs during both inference and fine-tuning stages. We introduce a Self-Correction Learning (SCL) approach that enables VLMs to learn from their self-generated self-correction data through Direct Preference Optimization (DPO) without relying on external feedback, facilitating self-improvement. Specifically, we collect preferred and disfavored samples based on the correctness of initial and refined responses, which are obtained by two-turn self-correction with VLMs during the inference stage. Experimental results demonstrate that although VLMs struggle to self-correct effectively during iterative inference without additional fine-tuning and external feedback, they can enhance their performance and avoid previous mistakes through preference fine-tuning when their self-generated self-correction data are categorized into preferred and disfavored samples. This study emphasizes that self-correction is not merely a refinement process; rather, it should enhance the reasoning abilities of models through additional training, enabling them to generate high-quality responses directly without further refinement.
摘要：虽然视觉语言模型 (VLM) 在视觉和语言推理任务中表现出非凡的能力，但它们总是会产生有缺陷的响应。指示模型改进其输出的自我校正为解决此问题提供了一个有希望的解决方案。以前的研究主要集中在大型语言模型 (LLM) 上，而 VLM 的自我校正能力，特别是关于视觉和语言信息的自我校正能力，在很大程度上仍未得到检验。本研究调查了 VLM 在推理和微调阶段的自我校正能力。我们引入了一种自我校正学习 (SCL) 方法，使 VLM 能够通过直接偏好优化 (DPO) 从其自生成的自我校正数据中学习，而无需依赖外部反馈，从而促进自我改进。具体来说，我们根据初始和改进响应的正确性收集优选和不受欢迎的样本，这些响应是在推理阶段通过 VLM 的两轮自我校正获得的。实验结果表明，尽管在没有额外微调和外部反馈的情况下，VLM 在迭代推理过程中难以有效地自我修正，但当其自生成的自我修正数据被归类为偏好样本和不偏好样本时，它们可以通过偏好微调来提高性能并避免以前的错误。这项研究强调，自我修正不仅仅是一个改进过程；相反，它应该通过额外的训练来增强模型的推理能力，使它们能够直接生成高质量的响应而无需进一步改进。

Title: LoRTA: Low Rank Tensor Adaptation of Large Language Models

Authors: Ignacio Hounie, Charilaos Kanatsoulis, Arnuv Tandon, Alejandro Ribeiro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04060
Pdf URL: https://arxiv.org/pdf/2410.04060
Copy Paste: [[2410.04060]] LoRTA: Low Rank Tensor Adaptation of Large Language Models(https://arxiv.org/abs/2410.04060)
Keywords: language model
Abstract: Low Rank Adaptation (LoRA) is a popular Parameter Efficient Fine Tuning (PEFT) method that effectively adapts large pre-trained models for downstream tasks. LoRA parameterizes model updates using low-rank matrices at each layer, significantly reducing the number of trainable parameters and, consequently, resource requirements during fine-tuning. However, the lower bound on the number of trainable parameters remains high due to the use of the low-rank matrix model. In this paper, we address this limitation by proposing a novel approach that employs a low rank tensor parametrization for model updates. The proposed low rank tensor model can significantly reduce the number of trainable parameters, while also allowing for finer-grained control over adapter size. Our experiments on Natural Language Understanding, Instruction Tuning, Preference Optimization and Protein Folding benchmarks demonstrate that our method is both efficient and effective for fine-tuning large language models, achieving a substantial reduction in the number of parameters while maintaining comparable performance.
摘要：低秩自适应 (LoRA) 是一种流行的参数高效微调 (PEFT) 方法，可有效地使大型预训练模型适应下游任务。LoRA 使用每层的低秩矩阵对模型更新进行参数化，从而显著减少可训练参数的数量，从而减少微调期间的资源需求。然而，由于使用低秩矩阵模型，可训练参数数量的下限仍然很高。在本文中，我们通过提出一种采用低秩张量参数化进行模型更新的新方法来解决这一限制。所提出的低秩张量模型可以显著减少可训练参数的数量，同时还允许对适配器大小进行更细粒度的控制。我们在自然语言理解、指令调整、偏好优化和蛋白质折叠基准上的实验表明，我们的方法对于微调大型语言模型既高效又有效，在保持可比性能的同时大幅减少参数数量。

Title: ECon: On the Detection and Resolution of Evidence Conflicts

Authors: Cheng Jiayang, Chunkit Chan, Qianqian Zhuang, Lin Qiu, Tianhang Zhang, Tengxiao Liu, Yangqiu Song, Yue Zhang, Pengfei Liu, Zheng Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04068
Pdf URL: https://arxiv.org/pdf/2410.04068
Copy Paste: [[2410.04068]] ECon: On the Detection and Resolution of Evidence Conflicts(https://arxiv.org/abs/2410.04068)
Keywords: language model, gpt, llm
Abstract: The rise of large language models (LLMs) has significantly influenced the quality of information in decision-making systems, leading to the prevalence of AI-generated content and challenges in detecting misinformation and managing conflicting information, or "inter-evidence conflicts." This study introduces a method for generating diverse, validated evidence conflicts to simulate real-world misinformation scenarios. We evaluate conflict detection methods, including Natural Language Inference (NLI) models, factual consistency (FC) models, and LLMs, on these conflicts (RQ1) and analyze LLMs' conflict resolution behaviors (RQ2). Our key findings include: (1) NLI and LLM models exhibit high precision in detecting answer conflicts, though weaker models suffer from low recall; (2) FC models struggle with lexically similar answer conflicts, while NLI and LLM models handle these better; and (3) stronger models like GPT-4 show robust performance, especially with nuanced conflicts. For conflict resolution, LLMs often favor one piece of conflicting evidence without justification and rely on internal knowledge if they have prior beliefs.
摘要：大型语言模型 (LLM) 的兴起极大地影响了决策系统中的信息质量，导致人工智能生成内容的盛行，以及检测错误信息和管理冲突信息或“证据间冲突”的挑战。本研究介绍了一种生成多样化、经过验证的证据冲突的方法，以模拟现实世界的错误信息场景。我们评估了冲突检测方法，包括自然语言推理 (NLI) 模型、事实一致性 (FC) 模型和 LLM，以解决这些冲突 (RQ1)，并分析了 LLM 的冲突解决行为 (RQ2)。我们的主要发现包括：(1) NLI 和 LLM 模型在检测答案冲突方面表现出高精度，但较弱的模型召回率较低；(2) FC 模型难以处理词汇相似的答案冲突，而 NLI 和 LLM 模型可以更好地处理这些问题；(3) 更强大的模型（如 GPT-4）表现出稳健的性能，尤其是在处理细微冲突时。对于解决冲突，法学硕士通常会倾向于接受一个相互矛盾的证据，而不进行任何论证，如果他们有先验信念，则会依赖内部知识。

Title: PAD: Personalized Alignment at Decoding-Time

Authors: Ruizhe Chen, Xiaotian Zhang, Meng Luo, Wenhao Chai, Zuozhu Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04070
Pdf URL: https://arxiv.org/pdf/2410.04070
Copy Paste: [[2410.04070]] PAD: Personalized Alignment at Decoding-Time(https://arxiv.org/abs/2410.04070)
Keywords: llm
Abstract: Aligning with personalized preferences, which vary significantly across cultural, educational, and political differences, poses a significant challenge due to the computational costs and data demands of traditional alignment methods. In response, this paper presents Personalized Alignment at Decoding-time (PAD), a novel framework designed to align LLM outputs with diverse personalized preferences during the inference phase, eliminating the need for additional training. By introducing a unique personalized reward modeling strategy, this framework decouples the text generation process from personalized preferences, facilitating the generation of generalizable token-level personalized rewards. The PAD algorithm leverages these rewards to guide the decoding process, dynamically tailoring the base model's predictions to personalized preferences. Extensive experimental results demonstrate that PAD not only outperforms existing training-based alignment methods in terms of aligning with diverse preferences but also shows significant generalizability to preferences unseen during training and scalability across different base models. This work advances the capability of LLMs to meet user needs in real-time applications, presenting a substantial step forward in personalized LLM alignment.
摘要：由于传统对齐方法的计算成本和数据需求，与因文化、教育和政治差异而存在巨大差异的个性化偏好对齐是一项重大挑战。为此，本文提出了解码时个性化对齐 (PAD)，这是一种新颖的框架，旨在在推理阶段将 LLM 输出与不同的个性化偏好对齐，从而无需进行额外的训练。通过引入独特的个性化奖励建模策略，该框架将文本生成过程与个性化偏好分离，从而促进可推广的 token 级个性化奖励的生成。PAD 算法利用这些奖励来指导解码过程，动态调整基础模型的预测以适应个性化偏好。大量实验结果表明，PAD 不仅在与不同偏好对齐方面优于现有的基于训练的对齐方法，而且对训练期间未见的偏好表现出显著的通用性，并且在不同基础模型之间具有可扩展性。这项工作提高了 LLM 满足实时应用中用户需求的能力，在个性化 LLM 对齐方面迈出了实质性的一步。

Title: On Eliciting Syntax from Language Models via Hashing

Authors: Yiran Wang, Masao Utiyama
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.04074
Pdf URL: https://arxiv.org/pdf/2410.04074
Copy Paste: [[2410.04074]] On Eliciting Syntax from Language Models via Hashing(https://arxiv.org/abs/2410.04074)
Keywords: language model
Abstract: Unsupervised parsing, also known as grammar induction, aims to infer syntactic structure from raw text. Recently, binary representation has exhibited remarkable information-preserving capabilities at both lexicon and syntax levels. In this paper, we explore the possibility of leveraging this capability to deduce parsing trees from raw text, relying solely on the implicitly induced grammars within models. To achieve this, we upgrade the bit-level CKY from zero-order to first-order to encode the lexicon and syntax in a unified binary representation space, switch training from supervised to unsupervised under the contrastive hashing framework, and introduce a novel loss function to impose stronger yet balanced alignment signals. Our model shows competitive performance on various datasets, therefore, we claim that our method is effective and efficient enough to acquire high-quality parsing trees from pre-trained language models at a low cost.
摘要：无监督解析，也称为语法归纳，旨在从原始文本中推断句法结构。最近，二进制表示在词汇和语法层面都表现出了出色的信息保存能力。在本文中，我们探索了利用此功能从原始文本中推断解析树的可能性，仅依靠模型中隐式归纳的语法。为了实现这一点，我们将位级 CKY 从零阶升级到一阶，以在统一的二进制表示空间中对词汇和语法进行编码，在对比哈希框架下将训练从监督转换为无监督，并引入了一种新颖的损失函数来施加更强但更平衡的对齐信号。我们的模型在各种数据集上都表现出了竞争力，因此，我们声称我们的方法足够有效和高效，可以以低成本从预训练的语言模型中获取高质量的解析树。

Title: GlobeSumm: A Challenging Benchmark Towards Unifying Multi-lingual, Cross-lingual and Multi-document News Summarization

Authors: Yangfan Ye, Xiachong Feng, Xiaocheng Feng, Weitao Ma, Libo Qin, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04087
Pdf URL: https://arxiv.org/pdf/2410.04087
Copy Paste: [[2410.04087]] GlobeSumm: A Challenging Benchmark Towards Unifying Multi-lingual, Cross-lingual and Multi-document News Summarization(https://arxiv.org/abs/2410.04087)
Keywords: llm, prompt
Abstract: News summarization in today's global scene can be daunting with its flood of multilingual content and varied viewpoints from different sources. However, current studies often neglect such real-world scenarios as they tend to focus solely on either single-language or single-document tasks. To bridge this gap, we aim to unify Multi-lingual, Cross-lingual and Multi-document Summarization into a novel task, i.e., MCMS, which encapsulates the real-world requirements all-in-one. Nevertheless, the lack of a benchmark inhibits researchers from adequately studying this invaluable problem. To tackle this, we have meticulously constructed the GLOBESUMM dataset by first collecting a wealth of multilingual news reports and restructuring them into event-centric format. Additionally, we introduce the method of protocol-guided prompting for high-quality and cost-effective reference annotation. In MCMS, we also highlight the challenge of conflicts between news reports, in addition to the issues of redundancies and omissions, further enhancing the complexity of GLOBESUMM. Through extensive experimental analysis, we validate the quality of our dataset and elucidate the inherent challenges of the task. We firmly believe that GLOBESUMM, given its challenging nature, will greatly contribute to the multilingual communities and the evaluation of LLMs.
摘要：在当今的全球新闻场景中，新闻摘要可能令人望而生畏，因为其内容多语言化，而且来自不同来源的观点也各不相同。然而，当前的研究往往忽略了这种现实世界的场景，因为它们往往只关注单语言或单文档任务。为了弥补这一差距，我们的目标是将多语言、跨语言和多文档摘要统一为一个新任务，即 MCMS，它将现实世界的需求全部囊括在内。然而，缺乏基准阻碍了研究人员充分研究这一宝贵问题。为了解决这个问题，我们精心构建了 GLOBESUMM 数据集，首先收集了大量多语言新闻报道，并将其重组为以事件为中心的格式。此外，我们引入了协议引导提示的方法，以实现高质量且经济高效的参考注释。在 MCMS 中，除了冗余和遗漏问题之外，我们还强调了新闻报道之间冲突的挑战，这进一步增加了 GLOBESUMM 的复杂性。通过大量的实验分析，我们验证了数据集的质量，并阐明了该任务的固有挑战。我们坚信，鉴于 GLOBESUMM 的挑战性，它将为多语言社区和 LLM 评估做出巨大贡献。

Title: BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts

Authors: Maria-Eleni Zoumpoulidi, Georgios Paraskevopoulos, Alexandros Potamianos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04094
Pdf URL: https://arxiv.org/pdf/2410.04094
Copy Paste: [[2410.04094]] BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts(https://arxiv.org/abs/2410.04094)
Keywords: language model, llm, prompt
Abstract: Despite the continuous progress of Large Language Models (LLMs) across various tasks, their performance on mathematical problems and reasoning tasks remains limited. This limitation can be attributed, among other factors, to the inherent difficulty of these problems and the fact that solutions often consist of multiple steps, potentially of varying nature, making it challenging for a single prompting technique to execute all required steps. To address this, we introduce BloomWise, a new prompting technique, inspired by Bloom's Taxonomy, aiming to improve LLMs' performance in solving such problems by encouraging them to approach the problem starting from simple, i.e., remembering, and progressing to higher cognitive skills, i.e., analyzing, until the correct solution is reached. The decision regarding the need to employ more sophisticated cognitive skills is based on self-evaluation performed by the LLM. Thus, we encourage the LLM to deploy the appropriate cognitive processes. In extensive experiments across 4 popular math reasoning datasets, we have demonstrated the effectiveness of our proposed approach. We also present extensive ablations, analyzing the strengths of each module within our system.
摘要：尽管大型语言模型 (LLM) 在各种任务上不断取得进展，但它们在数学问题和推理任务上的表现仍然有限。这种限制可以归因于这些问题固有的难度，以及解决方案通常由多个步骤组成，这些步骤可能性质各异，这使得单一的提示技术难以执行所有必需的步骤。为了解决这个问题，我们引入了 BloomWise，这是一种新的提示技术，灵感来自布鲁姆分类法，旨在通过鼓励他们从简单（即记忆）开始解决问题，然后逐步发展到更高的认知技能（即分析），直到得到正确的解决方案，从而提高 LLM 在解决此类问题方面的表现。是否需要采用更复杂的认知技能的决定取决于 LLM 的自我评估。因此，我们鼓励 LLM 部署适当的认知过程。在 4 个流行的数学推理数据集上的大量实验中，我们证明了我们提出的方法的有效性。我们还进行了广泛的消融，分析了我们系统中每个模块的优势。

Title: A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models

Authors: Zhihao Wang, Shiyu Liu, Jianheng Huang, Zheng Wang, Yixuan Liao, Xiaoxin Chen, Junfeng Yao, Jinsong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04103
Pdf URL: https://arxiv.org/pdf/2410.04103
Copy Paste: [[2410.04103]] A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models(https://arxiv.org/abs/2410.04103)
Keywords: language model, llm
Abstract: Due to the continuous emergence of new data, version updates have become an indispensable requirement for Large Language Models (LLMs). The training paradigms for version updates of LLMs include pre-training from scratch (PTFS) and continual pre-training (CPT). Preliminary experiments demonstrate that PTFS achieves better pre-training performance, while CPT has lower training cost. Moreover, their performance and training cost gaps widen progressively with version updates. To investigate the underlying reasons for this phenomenon, we analyze the effect of learning rate adjustments during the two stages of CPT: preparing an initialization checkpoint and continual pre-training based on this checkpoint. We find that a large learning rate in the first stage and a complete learning rate decay process in the second stage are crucial for version updates of LLMs. Hence, we propose a learning rate path switching training paradigm. Our paradigm comprises one main path, where we pre-train a LLM with the maximal learning rate, and multiple branching paths, each of which corresponds to an update of the LLM with newly-added training data. Extensive experiments demonstrate the effectiveness and generalization of our paradigm. Particularly, when training four versions of LLMs, our paradigm reduces the total training cost to 58% compared to PTFS, while maintaining comparable pre-training performance.
摘要：由于新数据的不断涌现，版本更新成为大型语言模型（LLM）不可或缺的需求。LLM 版本更新的训练范式包括从头预训练（PTFS）和持续预训练（CPT）。初步实验表明，PTFS 的预训练性能更佳，而 CPT 的训练成本较低。并且，随着版本的更新，二者的性能和训练成本差距不断扩大。为了探究这种现象的根本原因，我们分析了 CPT 两个阶段（准备初始化检查点和基于该检查点的持续预训练）中学习率调整的影响。我们发现，第一阶段较大的学习率和第二阶段完整的学习率衰减过程对于 LLM 的版本更新至关重要。因此，我们提出了一种学习率路径切换训练范式。我们的范式包含一条主路径，我们在其中以最大学习率预训练 LLM，以及多条分支路径，每条分支路径对应于使用新添加的训练数据对 LLM 进行更新。大量实验证明了我们范式的有效性和泛化能力。特别是，在训练四个版本的 LLM 时，我们的范式与 PTFS 相比将总训练成本降低至 58%，同时保持了相当的预训练性能。

Title: Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment

Authors: Chengfeng Dou, Ying Zhang, Zhi Jin, Wenpin Jiao, Haiyan Zhao, Yongqiang Zhao, Zhengwei Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04112
Pdf URL: https://arxiv.org/pdf/2410.04112
Copy Paste: [[2410.04112]] Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment(https://arxiv.org/abs/2410.04112)
Keywords: language model, llm, agent
Abstract: This research examines the use of Reinforcement Learning from AI Feedback (RLAIF) techniques to improve healthcare dialogue models, with the aim of tackling the challenges of preference-aligned data annotation while reducing the reliance on medical experts. We argue that the primary challenges in current RLAIF research for healthcare are the limitations of automated evaluation methods and the difficulties in accurately representing physician preferences. To address these challenges, we present a new evaluation framework based on standardized patient examinations. This framework is designed to objectively assess the effectiveness of large language models (LLMs) in guiding users and following instructions, enabling a comprehensive comparison across different models. Furthermore, our investigation of effective ways to express physician preferences using Constitutional AI algorithms highlighted the particular effectiveness of flowcharts. Utilizing this finding, we introduce an innovative agent-based approach for annotating preference data. This approach autonomously creates medical dialogue flows tailored to the patient's condition, demonstrates strong generalization abilities, and reduces the need for expert involvement. Our results show that the agent-based approach outperforms existing RLAIF annotation methods in standardized patient examinations and surpasses current open source medical dialogue LLMs in various test scenarios.
摘要：本研究探讨了使用强化学习人工智能反馈 (RLAIF) 技术来改进医疗对话模型，旨在解决偏好对齐数据注释的挑战，同时减少对医疗专家的依赖。我们认为，当前医疗保健 RLAIF 研究的主要挑战是自动评估方法的局限性以及准确表示医生偏好的困难。为了应对这些挑战，我们提出了一个基于标准化患者检查的新评估框架。该框架旨在客观评估大型语言模型 (LLM) 在指导用户和遵循指令方面的有效性，从而能够对不同模型进行全面比较。此外，我们对使用宪法人工智能算法表达医生偏好的有效方法的研究突出了流程图的特殊有效性。利用这一发现，我们引入了一种创新的基于代理的方法来注释偏好数据。这种方法可以自主创建适合患者病情的医疗对话流程，具有强大的泛化能力，并减少了对专家参与的需求。我们的结果表明，基于代理的方法在标准化患者检查中优于现有的 RLAIF 注释方法，并且在各种测试场景中超越了当前的开源医学对话 LLM。

Title: From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression

Authors: Eunseong Choi, Sunkyung Lee, Minjin Choi, June Park, Jongwuk Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04139
Pdf URL: https://arxiv.org/pdf/2410.04139
Copy Paste: [[2410.04139]] From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression(https://arxiv.org/abs/2410.04139)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have achieved significant performance gains using advanced prompting techniques over various tasks. However, the increasing length of prompts leads to high computational costs and often obscures crucial information. Prompt compression has been proposed to alleviate these issues, but it faces challenges in (i) capturing the global context and (ii) training the compressor effectively. To tackle these challenges, we introduce a novel prompt compression method, namely Reading To Compressing (R2C), utilizing the Fusion-in-Decoder (FiD) architecture to identify the important information in the prompt. Specifically, the cross-attention scores of the FiD are used to discern essential chunks and sentences from the prompt. R2C effectively captures the global context without compromising semantic consistency while detouring the necessity of pseudo-labels for training the compressor. Empirical results show that R2C retains key contexts, enhancing the LLM performance by 6% in out-of-domain evaluations while reducing the prompt length by 80%.
摘要：大型语言模型 (LLM) 通过使用先进的提示技术在各种任务上取得了显著的性能提升。然而，提示长度的增加会导致高昂的计算成本，并且经常会掩盖关键信息。已经提出了提示压缩来缓解这些问题，但它面临着 (i) 捕获全局上下文和 (ii) 有效训练压缩器的挑战。为了应对这些挑战，我们引入了一种新颖的提示压缩方法，即阅读压缩 (R2C)，利用解码器融合 (FiD) 架构来识别提示中的重要信息。具体来说，FiD 的交叉注意力分数用于从提示中辨别出重要的块和句子。R2C 有效地捕获了全局上下文，而不会损害语义一致性，同时避免了训练压缩器时使用伪标签的必要性。实证结果表明，R2C 保留了关键上下文，在域外评估中将 LLM 性能提高了 6%，同时将提示长度缩短了 80%。

Title: Toxic Subword Pruning for Dialogue Response Generation on Large Language Models

Authors: Hongyuan Lu, Wai Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04155
Pdf URL: https://arxiv.org/pdf/2410.04155
Copy Paste: [[2410.04155]] Toxic Subword Pruning for Dialogue Response Generation on Large Language Models(https://arxiv.org/abs/2410.04155)
Keywords: language model, llm
Abstract: How to defend large language models (LLMs) from generating toxic content is an important research area. Yet, most research focused on various model training techniques to remediate LLMs by updating their weights. A typical related research area is safety alignment. This however is often costly and tedious and can expose the model to even more problems such as catastrophic forgetting if the trainings are not carefully handled by experienced NLP practitioners. We thus propose a simple yet effective and novel algorithm, namely \textbf{Tox}ic Subword \textbf{Prun}ing (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs. In contrast to the previous work that demonstrates pruning BPE tokens as harmful to the task of machine translation, we surprisingly found its usefulness in preventing toxic content from being generated on LLMs. Fortunately, our findings suggest that ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously. We surprisingly found that ToxPrune can even obviously improve official Llama-3.1-6B in the metric of dialogue diversity. Extensive automatic results and human evaluation indicate that ToxPrune could be helpful for both remediating toxic LLMs and improving non-toxic LLMs on the task of dialogue response generation.\footnote{We plan to release the resources to facilitate future work.}
摘要：如何保护大型语言模型 (LLM) 不生成有害内容是一个重要的研究领域。然而，大多数研究都集中在各种模型训练技术上，通过更新 LLM 的权重来修复它们。一个典型的相关研究领域是安全对齐。然而，这通常成本高昂且繁琐，如果训练不是由经验丰富的 NLP 从业者仔细处理的，可能会使模型面临更多问题，例如灾难性遗忘。因此，我们提出了一种简单但有效且新颖的算法，即 \textbf{Tox}ic Subword \textbf{Prun}ing (ToxPrune)，用于在经过训练的 LLM 中修剪来自 BPE 的有害词所包含的子词。与之前的研究相比，该研究证明修剪 BPE 标记对机器翻译任务有害，我们惊讶地发现它在防止 LLM 上生成有害内容方面很有用。幸运的是，我们的研究结果表明 ToxPrune 同时明显改善了对话响应生成任务中的有害语言模型 NSFW-3B。我们意外地发现，ToxPrune 甚至可以在对话多样性指标上明显改善官方 Llama-3.1-6B。大量自动结果和人工评估表明，ToxPrune 既有助于修复有毒 LLM，又有助于在对话响应生成任务上改进无毒 LLM。\footnote{我们计划发布资源以促进未来的工作。}

Title: DiDOTS: Knowledge Distillation from Large-Language-Models for Dementia Obfuscation in Transcribed Speech

Authors: Dominika Woszczyk, Soteris Demetriou
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2410.04188
Pdf URL: https://arxiv.org/pdf/2410.04188
Copy Paste: [[2410.04188]] DiDOTS: Knowledge Distillation from Large-Language-Models for Dementia Obfuscation in Transcribed Speech(https://arxiv.org/abs/2410.04188)
Keywords: llm, hallucination, prompt
Abstract: Dementia is a sensitive neurocognitive disorder affecting tens of millions of people worldwide and its cases are expected to triple by 2050. Alarmingly, recent advancements in dementia classification make it possible for adversaries to violate affected individuals' privacy and infer their sensitive condition from speech transcriptions. Existing obfuscation methods in text have never been applied for dementia and depend on the availability of large labeled datasets which are challenging to collect for sensitive medical attributes. In this work, we bridge this research gap and tackle the above issues by leveraging Large-Language-Models (LLMs) with diverse prompt designs (zero-shot, few-shot, and knowledge-based) to obfuscate dementia in speech transcripts. Our evaluation shows that LLMs are more effective dementia obfuscators compared to competing methods. However, they have billions of parameters which renders them hard to train, store and share, and they are also fragile suffering from hallucination, refusal and contradiction effects among others. To further mitigate these, we propose a novel method, DiDOTS. DiDOTS distills knowledge from LLMs using a teacher-student paradigm and parameter-efficient fine-tuning. DiDOTS has one order of magnitude fewer parameters compared to its teacher LLM and can be fine-tuned using three orders of magnitude less parameters compared to full fine-tuning. Our evaluation shows that compared to prior work DiDOTS retains the performance of LLMs achieving 1.3x and 2.2x improvement in privacy performance on two datasets, while humans rate it as better in preserving utility even when compared to state-of-the-art paraphrasing models.
摘要：痴呆症是一种敏感的神经认知障碍，影响着全球数千万人，预计到 2050 年，其病例将增加两倍。令人担忧的是，痴呆症分类的最新进展使对手能够侵犯受影响个人的隐私，并从语音转录中推断出他们的敏感病情。现有的文本混淆方法从未应用于痴呆症，并且依赖于大型标记数据集的可用性，而这些数据集对于敏感的医疗属性来说很难收集。在这项工作中，我们弥补了这一研究空白，并利用具有多种提示设计（零样本、少样本和基于知识）的大型语言模型 (LLM) 来混淆语音记录中的痴呆症，从而解决了上述问题。我们的评估表明，与竞争方法相比，LLM 是更有效的痴呆症混淆器。然而，它们有数十亿个参数，这使得它们难以训练、存储和共享，而且它们也很脆弱，容易受到幻觉、拒绝和矛盾效应等的影响。为了进一步缓解这些问题，我们提出了一种新方法 DiDOTS。DiDOTS 使用师生范式和参数高效的微调从 LLM 中提取知识。与教师 LLM 相比，DiDOTS 的参数少了一个数量级，与完全微调相比，可以使用少三个数量级的参数进行微调。我们的评估表明，与之前的工作相比，DiDOTS 保留了 LLM 的性能，在两个数据集上的隐私性能提高了 1.3 倍和 2.2 倍，而即使与最先进的释义模型相比，人类也认为它在保留实用性方面更好。

Title: Consistent Autoformalization for Constructing Mathematical Libraries

Authors: Lan Zhang, Xin Quan, Andre Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04194
Pdf URL: https://arxiv.org/pdf/2410.04194
Copy Paste: [[2410.04194]] Consistent Autoformalization for Constructing Mathematical Libraries(https://arxiv.org/abs/2410.04194)
Keywords: language model, llm, retrieval augmented generation
Abstract: Autoformalization is the task of automatically translating mathematical content written in natural language to a formal language expression. The growing language interpretation capabilities of Large Language Models (LLMs), including in formal languages, are lowering the barriers for autoformalization. However, LLMs alone are not capable of consistently and reliably delivering autoformalization, in particular as the complexity and specialization of the target domain grows. As the field evolves into the direction of systematically applying autoformalization towards large mathematical libraries, the need to improve syntactic, terminological and semantic control increases. This paper proposes the coordinated use of three mechanisms, most-similar retrieval augmented generation (MS-RAG), denoising steps, and auto-correction with syntax error feedback (Auto-SEF) to improve autoformalization quality. The empirical analysis, across different models, demonstrates that these mechanisms can deliver autoformalizaton results which are syntactically, terminologically and semantically more consistent. These mechanisms can be applied across different LLMs and have shown to deliver improve results across different model types.
摘要：自动形式化是将用自然语言编写的数学内容自动转换为形式语言表达的任务。大型语言模型 (LLM)（包括形式语言）的语言解释能力不断增强，降低了自动形式化的门槛。然而，单靠 LLM 无法始终如一地可靠地提供自动形式化，尤其是在目标领域的复杂性和专业化程度不断提高的情况下。随着该领域向系统地将自动形式化应用于大型数学库的方向发展，改进句法、术语和语义控制的需求也随之增加。本文提出了三种机制的协调使用，即最相似检索增强生成 (MS-RAG)、去噪步骤和带语法错误反馈的自动更正 (Auto-SEF)，以提高自动形式化质量。跨不同模型的实证分析表明，这些机制可以提供在句法、术语和语义上更一致的自动形式化结果。这些机制可应用于不同的 LLM，并且已证明能够在不同模型类型中提供更好的结果。

Title: CS4: Measuring the Creativity of Large Language Models Automatically by Controlling the Number of Story-Writing Constraints

Authors: Anirudh Atmakuru, Jatin Nainani, Rohith Siddhartha Reddy Bheemreddy, Anirudh Lakkaraju, Zonghai Yao, Hamed Zamani, Haw-Shiuan Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04197
Pdf URL: https://arxiv.org/pdf/2410.04197
Copy Paste: [[2410.04197]] CS4: Measuring the Creativity of Large Language Models Automatically by Controlling the Number of Story-Writing Constraints(https://arxiv.org/abs/2410.04197)
Keywords: language model, llm, prompt
Abstract: Evaluating the creativity of large language models (LLMs) in story writing is difficult because LLM-generated stories could seemingly look creative but be very similar to some existing stories in their huge and proprietary training corpus. To overcome this challenge, we introduce a novel benchmark dataset with varying levels of prompt specificity: CS4 ($\mathbf{C}$omparing the $\mathbf{S}$kill of $\mathbf{C}$reating $\mathbf{S}$tories by $\mathbf{C}$ontrolling the $\mathbf{S}$ynthesized $\mathbf{C}$onstraint $\mathbf{S}$pecificity). By increasing the number of requirements/constraints in the prompt, we can increase the prompt specificity and hinder LLMs from retelling high-quality narratives in their training data. Consequently, CS4 empowers us to indirectly measure the LLMs' creativity without human annotations. Our experiments on LLaMA, Gemma, and Mistral not only highlight the creativity challenges LLMs face when dealing with highly specific prompts but also reveal that different LLMs perform very differently under different numbers of constraints and achieve different balances between the model's instruction-following ability and narrative coherence. Additionally, our experiments on OLMo suggest that Learning from Human Feedback (LHF) can help LLMs select better stories from their training data but has limited influence in boosting LLMs' ability to produce creative stories that are unseen in the training corpora. The benchmark is released at this https URL.
摘要：评估大型语言模型 (LLM) 在故事写作中的创造力很困难，因为 LLM 生成的故事看似很有创意，但实际上与其庞大且专有的训练语料库中的一些现有故事非常相似。为了克服这一挑战，我们引入了一个具有不同提示特异性水平的新基准数据集：CS4（$\mathbf{C}$比较了 $\mathbf{S}$ 杀死 $\mathbf{C}$ 创造 $\mathbf{S}$ 故事的能力，通过 $\mathbf{C}$ 控制 $\mathbf{S}$ 合成的 $\mathbf{C}$ 约束 $\mathbf{S}$ 特异性）。通过增加提示中的要求/约束数量，我们可以提高提示特异性并阻止 LLM 在其训练数据中复述高质量的叙述。因此，CS4 使我们能够在无需人工注释的情况下间接衡量 LLM 的创造力。我们对 LLaMA、Gemma 和 Mistral 的实验不仅凸显了 LLM 在处理高度具体的提示时面临的创造力挑战，还揭示了不同的 LLM 在不同数量的约束下表现非常不同，并在模型的指令遵循能力和叙述连贯性之间实现了不同的平衡。此外，我们在 OLMo 上的实验表明，从人类反馈中学习 (LHF) 可以帮助 LLM 从训练数据中选择更好的故事，但在提高 LLM 创作训练语料库中未见的创意故事的能力方面影响有限。基准测试在此 https URL 上发布。

Title: LongGenBench: Long-context Generation Benchmark

Authors: Xiang Liu, Peijie Dong, Xuming Hu, Xiaowen Chu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04199
Pdf URL: https://arxiv.org/pdf/2410.04199
Copy Paste: [[2410.04199]] LongGenBench: Long-context Generation Benchmark(https://arxiv.org/abs/2410.04199)
Keywords: language model, llm
Abstract: Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.
摘要：当前的长上下文基准测试主要侧重于基于检索的测试，要求大型语言模型 (LLM) 在广泛的输入上下文中定位特定信息，例如大海捞针 (NIAH) 基准测试。长上下文生成是指语言模型生成连贯且上下文准确的文本的能力，这些文本跨越很长的段落或文档。虽然最近的研究表明 NIAH 和其他基于检索的长上下文基准测试表现良好，但评估长上下文生成能力的基准测试仍然严重不足。为了弥补这一差距并提供全面的评估，我们引入了一个综合基准测试 LongGenBench，它允许灵活配置自定义的生成上下文长度。LongGenBench 通过重新设计问题的格式并要求 LLM 以单一、连贯的长上下文答案做出回应，超越了传统基准测试。通过使用 LongGenBench 进行广泛评估，我们观察到：（1）API 访问和开源模型在长上下文生成场景中都表现出性能下降，范围从 1.2% 到 47.1%；（2）不同系列的 LLM 表现出不同的性能下降趋势，其中 Gemini-1.5-Flash 模型在 API 访问模型中表现出的下降最少，而 Qwen2 系列在 LongGenBench 中在开源模型中表现出的下降最少。

Title: Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension

Authors: Ning Wang, Zekun Li, Tongxin Bai, Guoqi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04211
Pdf URL: https://arxiv.org/pdf/2410.04211
Copy Paste: [[2410.04211]] Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension(https://arxiv.org/abs/2410.04211)
Keywords: language model
Abstract: Modeling long sequences is crucial for various large-scale models; however, extending existing architectures to handle longer sequences presents significant technical and resource challenges. In this paper, we propose an efficient and flexible attention architecture that enables the extension of context lengths in large language models with reduced computational resources and fine-tuning time compared to other excellent methods. Specifically, we introduce correlation-aware selection and merging mechanisms to facilitate efficient sparse attention. In addition, we also propose a novel data augmentation technique involving positional encodings to enhance generalization to unseen positions. The results are as follows: First, using a single A100, we achieve fine-tuning on Llama2-7B with a sequence length of 32K, which is more efficient than other methods that rely on subsets for regression. Second, we present a comprehensive method for extending context lengths across the pre-training, fine-tuning, and inference phases. During pre-training, our attention mechanism partially breaks translation invariance during token selection, so we apply positional encodings only to the selected tokens. This approach achieves relatively high performance and significant extrapolation capabilities. For fine-tuning, we introduce Cyclic, Randomly Truncated, and Dynamically Growing NTK Positional Embedding (CRD NTK). This design allows fine-tuning with a sequence length of only 16K, enabling models such as Llama2-7B and Mistral-7B to perform inference with context lengths of up to 1M or even arbitrary lengths. Our method achieves 100\% accuracy on the passkey task with a context length of 4M and maintains stable perplexity at a 1M context length. This represents at least a 64-fold reduction in resource requirements compared to traditional full-attention mechanisms, while still achieving competitive performance.
摘要：对各种大型模型来说，对长序列进行建模至关重要；然而，扩展现有架构以处理更长的序列带来了重大的技术和资源挑战。在本文中，我们提出了一种高效灵活的注意力架构，与其他优秀方法相比，它可以在大型语言模型中以更少的计算资源和微调时间扩展上下文长度。具体来说，我们引入了相关性感知选择和合并机制，以促进高效的稀疏注意力。此外，我们还提出了一种涉及位置编码的新型数据增强技术，以增强对未见位置的泛化。结果如下：首先，使用单个 A100，我们在序列长度为 32K 的 Llama2-7B 上实现了微调，这比其他依赖子集进行回归的方法更有效。其次，我们提出了一种在预训练、微调和推理阶段扩展上下文长度的综合方法。在预训练期间，我们的注意力机制在标记选择期间部分破坏了平移不变性，因此我们仅将位置编码应用于选定的标记。这种方法实现了相对较高的性能和显著的外推能力。为了进行微调，我们引入了循环、随机截断和动态增长的 NTK 位置嵌入 (CRD NTK)。这种设计允许使用仅 16K 的序列长度进行微调，使 Llama2-7B 和 Mistral-7B 等模型能够对高达 1M 甚至任意长度的上下文长度进行推理。我们的方法在上下文长度为 4M 的密码任务上实现了 100% 的准确率，并在 1M 的上下文长度下保持稳定的困惑度。与传统的全注意力机制相比，这至少减少了 64 倍的资源需求，同时仍实现了具有竞争力的性能。

Title: Persona Knowledge-Aligned Prompt Tuning Method for Online Debate

Authors: Chunkit Chan, Cheng Jiayang, Xin Liu, Yauwai Yim, Yuxin Jiang, Zheye Deng, Haoran Li, Yangqiu Song, Ginny Y. Wong, Simon See
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04239
Pdf URL: https://arxiv.org/pdf/2410.04239
Copy Paste: [[2410.04239]] Persona Knowledge-Aligned Prompt Tuning Method for Online Debate(https://arxiv.org/abs/2410.04239)
Keywords: language model, gpt, prompt, chat
Abstract: Debate is the process of exchanging viewpoints or convincing others on a particular issue. Recent research has provided empirical evidence that the persuasiveness of an argument is determined not only by language usage but also by communicator characteristics. Researchers have paid much attention to aspects of languages, such as linguistic features and discourse structures, but combining argument persuasiveness and impact with the social personae of the audience has not been explored due to the difficulty and complexity. We have observed the impressive simulation and personification capability of ChatGPT, indicating a giant pre-trained language model may function as an individual to provide personae and exert unique influences based on diverse background knowledge. Therefore, we propose a persona knowledge-aligned framework for argument quality assessment tasks from the audience side. This is the first work that leverages the emergence of ChatGPT and injects such audience personae knowledge into smaller language models via prompt tuning. The performance of our pipeline demonstrates significant and consistent improvement compared to competitive architectures.
摘要：辩论是就某一问题交换观点或说服他人的过程。最近的研究提供了经验证据，表明论点的说服力不仅取决于语言使用，还取决于交流者的特征。研究人员非常关注语言的各个方面，例如语言特征和话语结构，但由于难度和复杂性，尚未探索将论点的说服力和影响力与受众的社会角色相结合。我们观察到 ChatGPT 令人印象深刻的模拟和拟人化能力，这表明一个巨大的预训练语言模型可以作为一个个体，提供角色并根据不同的背景知识发挥独特的影响力。因此，我们提出了一个与受众相关的角色知识一致的框架，用于论证质量评估任务。这是第一项利用 ChatGPT 的出现并通过快速调整将此类受众角色知识注入较小语言模型的工作。与竞争架构相比，我们的管道性能显示出显着且持续的改进。

Title: Adaptive Question Answering: Enhancing Language Model Proficiency for Addressing Knowledge Conflicts with Source Citations

Authors: Sagi Shaier, Ari Kobren, Philip Ogren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04241
Pdf URL: https://arxiv.org/pdf/2410.04241
Copy Paste: [[2410.04241]] Adaptive Question Answering: Enhancing Language Model Proficiency for Addressing Knowledge Conflicts with Source Citations(https://arxiv.org/abs/2410.04241)
Keywords: language model, prompt
Abstract: Resolving knowledge conflicts is a crucial challenge in Question Answering (QA) tasks, as the internet contains numerous conflicting facts and opinions. While some research has made progress in tackling ambiguous settings where multiple valid answers exist, these approaches often neglect to provide source citations, leaving users to evaluate the factuality of each answer. On the other hand, existing work on citation generation has focused on unambiguous settings with single answers, failing to address the complexity of real-world scenarios. Despite the importance of both aspects, no prior research has combined them, leaving a significant gap in the development of QA systems. In this work, we bridge this gap by proposing the novel task of QA with source citation in ambiguous settings, where multiple valid answers exist. To facilitate research in this area, we create a comprehensive framework consisting of: (1) five novel datasets, obtained by augmenting three existing reading comprehension datasets with citation meta-data across various ambiguous settings, such as distractors and paraphrasing; (2) the first ambiguous multi-hop QA dataset featuring real-world, naturally occurring contexts; (3) two new metrics to evaluate models' performances; and (4) several strong baselines using rule-based, prompting, and finetuning approaches over five large language models. We hope that this new task, datasets, metrics, and baselines will inspire the community to push the boundaries of QA research and develop more trustworthy and interpretable systems.
摘要：解决知识冲突是问答 (QA) 任务中的一个关键挑战，因为互联网上包含大量相互冲突的事实和观点。虽然一些研究在处理存在多个有效答案的模糊设置方面取得了进展，但这些方法往往忽略提供来源引用，而让用户来评估每个答案的事实性。另一方面，现有的引文生成工作集中于具有单一答案的明确设置，未能解决现实世界场景的复杂性。尽管这两个方面都很重要，但之前的研究并未将它们结合起来，这在 QA 系统的开发中留下了一个重大空白。在这项工作中，我们通过提出在存在多个有效答案的模糊设置中使用来源引用的 QA 新任务来弥补这一空白。为了促进该领域的研究，我们创建了一个综合框架，包括：（1）五个新数据集，通过在各种模糊设置（如干扰项和释义）中使用引文元数据扩充三个现有的阅读理解数据集而获得； (2) 第一个模糊多跳 QA 数据集，以真实世界、自然发生的上下文为特色；(3) 两个用于评估模型性能的新指标；(4) 几个强大的基线，使用基于规则、提示和微调的方法对五个大型语言模型进行测试。我们希望这项新任务、数据集、指标和基线将激励社区突破 QA 研究的界限，开发更值得信赖和可解释的系统。

Title: Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia

Authors: Tomás Feith, Akhil Arora, Martin Gerlach, Debjit Paul, Robert West
Subjects: cs.CL, cs.AI, cs.IR, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2410.04254
Pdf URL: https://arxiv.org/pdf/2410.04254
Copy Paste: [[2410.04254]] Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia(https://arxiv.org/abs/2410.04254)
Keywords: gpt, llm, prompt
Abstract: Links are a fundamental part of information networks, turning isolated pieces of knowledge into a network of information that is much richer than the sum of its parts. However, adding a new link to the network is not trivial: it requires not only the identification of a suitable pair of source and target entities but also the understanding of the content of the source to locate a suitable position for the link in the text. The latter problem has not been addressed effectively, particularly in the absence of text spans in the source that could serve as anchors to insert a link to the target entity. To bridge this gap, we introduce and operationalize the task of entity insertion in information networks. Focusing on the case of Wikipedia, we empirically show that this problem is, both, relevant and challenging for editors. We compile a benchmark dataset in 105 languages and develop a framework for entity insertion called LocEI (Localized Entity Insertion) and its multilingual variant XLocEI. We show that XLocEI outperforms all baseline models (including state-of-the-art prompt-based ranking with LLMs such as GPT-4) and that it can be applied in a zero-shot manner on languages not seen during training with minimal performance drop. These findings are important for applying entity insertion models in practice, e.g., to support editors in adding links across the more than 300 language versions of Wikipedia.
摘要：链接是信息网络的基本组成部分，可将孤立的知识片段变成比其各部分之和丰富得多的信息网络。但是，向网络添加新链接并非易事：它不仅需要识别合适的源实体和目标实体对，还需要理解源内容以在文本中找到链接的合适位置。后一个问题尚未得到有效解决，特别是在源中缺少可以作为插入目标实体链接的锚点的文本跨度的情况下。为了弥补这一差距，我们引入并实施了信息网络中实体插入的任务。以维基百科为例，我们通过实证表明，这个问题对于编辑者来说既相关又具有挑战性。我们用 105 种语言编制了一个基准数据集，并开发了一个实体插入框架，称为 LocEI（本地化实体插入）及其多语言变体 XLocEI。我们表明，XLocEI 的表现优于所有基线模型（包括使用 GPT-4 等 LLM 的最先进的基于提示的排名），并且可以以零样本方式应用于训练期间未见过的语言，且性能下降最小。这些发现对于在实践中应用实体插入模型非常重要，例如，支持编辑者在 300 多个语言版本的维基百科中添加链接。

Title: AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text

Authors: Ximing Lu, Melanie Sclar, Skyler Hallinan, Niloofar Mireshghallah, Jiacheng Liu, Seungju Han, Allyson Ettinger, Liwei Jiang, Khyathi Chandu, Nouha Dziri, Yejin Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04265
Pdf URL: https://arxiv.org/pdf/2410.04265
Copy Paste: [[2410.04265]] AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text(https://arxiv.org/abs/2410.04265)
Keywords: language model, gpt, llm, chat
Abstract: Creativity has long been considered one of the most difficult aspect of human intelligence for AI to mimic. However, the rise of Large Language Models (LLMs), like ChatGPT, has raised questions about whether AI can match or even surpass human creativity. We present CREATIVITY INDEX as the first step to quantify the linguistic creativity of a text by reconstructing it from existing text snippets on the web. CREATIVITY INDEX is motivated by the hypothesis that the seemingly remarkable creativity of LLMs may be attributable in large part to the creativity of human-written texts on the web. To compute CREATIVITY INDEX efficiently, we introduce DJ SEARCH, a novel dynamic programming algorithm that can search verbatim and near-verbatim matches of text snippets from a given document against the web. Experiments reveal that the CREATIVITY INDEX of professional human authors is on average 66.2% higher than that of LLMs, and that alignment reduces the CREATIVITY INDEX of LLMs by an average of 30.1%. In addition, we find that distinguished authors like Hemingway exhibit measurably higher CREATIVITY INDEX compared to other human writers. Finally, we demonstrate that CREATIVITY INDEX can be used as a surprisingly effective criterion for zero-shot machine text detection, surpassing the strongest existing zero-shot system, DetectGPT, by a significant margin of 30.2%, and even outperforming the strongest supervised system, GhostBuster, in five out of six domains.
摘要：创造力一直被认为是人工智能最难模仿的人类智能之一。然而，大型语言模型 (LLM)（如 ChatGPT）的兴起引发了人们对人工智能是否能匹敌甚至超越人类创造力的质疑。我们提出创造力指数作为量化文本语言创造力的第一步，通过从网络上现有的文本片段中重建文本。创造力指数的动机是这样的假设：LLM 看似非凡的创造力可能在很大程度上归因于网络上人类撰写的文本的创造力。为了有效地计算创造力指数，我们引入了 DJ 搜索，这是一种新颖的动态规划算法，它可以在网络上搜索给定文档中文本片段的逐字和近乎逐字匹配。实验表明，专业人类作家的创造力指数平均比法学硕士高 66.2%，而这种对齐会使法学硕士的创造力指数平均降低 30.1%。此外，我们发现，像海明威这样的杰出作家表现出比其他人类作家更高的创造力指数。最后，我们证明了创造力指数可以用作零样本机器文本检测的令人惊讶的有效标准，它以 30.2% 的显著优势超越了现有最强大的零样本系统 DetectGPT，甚至在六个领域中的五个领域中超越了最强大的监督系统 GhostBuster。

Title: RoQLlama: A Lightweight Romanian Adapted Language Model

Authors: George-Andrei Dima, Andrei-Marius Avram, Cristian-George Crăciun, Dumitru-Clementin Cercel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04269
Pdf URL: https://arxiv.org/pdf/2410.04269
Copy Paste: [[2410.04269]] RoQLlama: A Lightweight Romanian Adapted Language Model(https://arxiv.org/abs/2410.04269)
Keywords: language model, llm, prompt
Abstract: The remarkable achievements obtained by open-source large language models (LLMs) in recent years have predominantly been concentrated on tasks involving the English language. In this paper, we aim to advance the performance of Llama2 models on Romanian tasks. We tackle the problem of reduced computing resources by using QLoRA for training. We release RoQLlama-7b, a quantized LLM, which shows equal or improved results compared to its full-sized counterpart when tested on seven Romanian downstream tasks in the zero-shot setup. Also, it consistently achieves higher average scores across all few-shot prompts. Additionally, we introduce a novel Romanian dataset, namely RoMedQA, which contains single-choice medical questions in Romanian.
摘要：近年来，开源大型语言模型 (LLM) 取得的显著成就主要集中在涉及英语的任务上。在本文中，我们旨在提高 Llama2 模型在罗马尼亚语任务上的表现。我们通过使用 QLoRA 进行训练来解决计算资源减少的问题。我们发布了量化 LLM RoQLlama-7b，在零样本设置在七个罗马尼亚下游任务上进行测试时，它与全尺寸 LLM 相比显示出相同或更好的结果。此外，它在所有小样本提示中始终获得更高的平均分数。此外，我们引入了一个新颖的罗马尼亚语数据集，即 RoMedQA，其中包含罗马尼亚语的单选医学问题。

Title: Evaluating Language Model Character Traits

Authors: Francis Rhys Ward, Zejia Yang, Alex Jackson, Randy Brown, Chandler Smith, Grace Colverd, Louis Thomson, Raymond Douglas, Patrik Bartak, Andrew Rowan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04272
Pdf URL: https://arxiv.org/pdf/2410.04272
Copy Paste: [[2410.04272]] Evaluating Language Model Character Traits(https://arxiv.org/abs/2410.04272)
Keywords: language model, prompt
Abstract: Language models (LMs) can exhibit human-like behaviour, but it is unclear how to describe this behaviour without undue anthropomorphism. We formalise a behaviourist view of LM character traits: qualities such as truthfulness, sycophancy, or coherent beliefs and intentions, which may manifest as consistent patterns of behaviour. Our theory is grounded in empirical demonstrations of LMs exhibiting different character traits, such as accurate and logically coherent beliefs, and helpful and harmless intentions. We find that the consistency with which LMs exhibit certain character traits varies with model size, fine-tuning, and prompting. In addition to characterising LM character traits, we evaluate how these traits develop over the course of an interaction. We find that traits such as truthfulness and harmfulness can be stationary, i.e., consistent over an interaction, in certain contexts, but may be reflective in different contexts, meaning they mirror the LM's behavior in the preceding interaction. Our formalism enables us to describe LM behaviour precisely in intuitive language, without undue anthropomorphism.
摘要：语言模型 (LM) 可以表现出类似人类的行为，但目前尚不清楚如何在不过度拟人化的情况下描述这种行为。我们正式确立了 LM 性格特征的行为主义观点：诚实、谄媚或连贯的信念和意图等品质可能表现为一致的行为模式。我们的理论基于 LM 表现出不同性格特征的经验证明，例如准确且逻辑连贯的信念以及有益和无害的意图。我们发现 LM 表现出某些性格特征的一致性会因模型大小、微调和提示而异。除了描述 LM 性格特征外，我们还评估这些特征在交互过程中的发展情况。我们发现，在某些情况下，诚实和有害等特征可能是固定的，即在交互过程中保持一致，但可能在不同情况下具有反射性，这意味着它们反映了 LM 在前一次交互中的行为。我们的形式主义使我们能够用直观的语言准确地描述 LM 行为，而不会出现过度的拟人化。

Title: Mechanistic Behavior Editing of Language Models

Authors: Joykirat Singh, Subhabrata Dutta, Tanmoy Chakraborty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04277
Pdf URL: https://arxiv.org/pdf/2410.04277
Copy Paste: [[2410.04277]] Mechanistic Behavior Editing of Language Models(https://arxiv.org/abs/2410.04277)
Keywords: language model, llm, prompt
Abstract: Large Language Models trained on web-scale text acquire language generation abilities that can solve a wide range of tasks, particularly when task knowledge is refined into the generative prior using in-context examples. However, spurious features learned from noisy data hinder their generalizability. Supervised finetuning can introduce task specificity, but introduce data inefficiency. Prior studies indicate that (i) noisy neural circuitries coexist with generalizable ones within LLMs, and (ii) finetuning typically enhances (or suppresses) existing abilities without introducing newer ones. Building upon these, we propose TaRot, a novel method for task adaptation. TaRot intervenes in the neural circuitries using learnable rotation matrices that are optimized using Bayesian Optimization, on labelled samples in the order of standard few-shot prompting examples. Experiments on multiple classification and generation tasks using LLMs of varying sizes reveal the efficacy of TaRot, improving upon both zero- as well as few-shot performance, with average improvements (across models and tasks) of 23.81% and 11.15%, respectively. The source code is available at this https URL
摘要：在网络规模文本上训练的大型语言模型获得了语言生成能力，可以解决各种任务，特别是当使用上下文示例将任务知识提炼为生成先验时。然而，从嘈杂数据中学习到的虚假特征阻碍了它们的普遍性。监督微调可以引入任务特异性，但会导致数据效率低下。先前的研究表明 (i) LLM 中的噪声神经回路与可泛化的神经回路共存，以及 (ii) 微调通常会增强（或抑制）现有能力而不会引入新能力。在此基础上，我们提出了一种新的任务适应方法 TaRot。TaRot 使用可学习的旋转矩阵干预神经回路，这些旋转矩阵使用贝叶斯优化进行了优化，并以标准少量提示示例的顺序对标记样本进行处理。使用不同大小的 LLM 对多个分类和生成任务进行的实验表明，TaRot 的有效性得到了提升，提高了零样本和少样本性能，平均改进（跨模型和任务）分别为 23.81% 和 11.15%。源代码可从此 https URL 获取

Title: Calibrating Expressions of Certainty

Authors: Peiqi Wang, Barbara D. Lam, Yingcheng Liu, Ameneh Asgari-Targhi, Rameswar Panda, William M. Wells, Tina Kapur, Polina Golland
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.04315
Pdf URL: https://arxiv.org/pdf/2410.04315
Copy Paste: [[2410.04315]] Calibrating Expressions of Certainty(https://arxiv.org/abs/2410.04315)
Keywords: language model
Abstract: We present a novel approach to calibrating linguistic expressions of certainty, e.g., "Maybe" and "Likely". Unlike prior work that assigns a single score to each certainty phrase, we model uncertainty as distributions over the simplex to capture their semantics more accurately. To accommodate this new representation of certainty, we generalize existing measures of miscalibration and introduce a novel post-hoc calibration method. Leveraging these tools, we analyze the calibration of both humans (e.g., radiologists) and computational models (e.g., language models) and provide interpretable suggestions to improve their calibration.
摘要：我们提出了一种校准确定性语言表达（例如“可能”和“很可能”）的新方法。与之前为每个确定性短语分配单个分数的工作不同，我们将不确定性建模为单纯形上的分布，以更准确地捕捉其语义。为了适应这种新的确定性表示，我们概括了现有的误差校准措施并引入了一种新颖的事后校准方法。利用这些工具，我们分析了人类（例如放射科医生）和计算模型（例如语言模型）的校准，并提供了可解释的建议以改进其校准。

Title: ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model

Authors: Shuhao Gu, Mengdi Zhao, Bowen Zhang, Liangdong Wang, Jijie Li, Guang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04335
Pdf URL: https://arxiv.org/pdf/2410.04335
Copy Paste: [[2410.04335]] ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model(https://arxiv.org/abs/2410.04335)
Keywords: language model, llm
Abstract: Tokenizer is an essential component for large language models (LLMs), and a tokenizer with a high compression rate can improve the model's representation and processing efficiency. However, the tokenizer cannot ensure high compression rate in all scenarios, and an increase in the average input and output lengths will increases the training and inference costs of the model. Therefore, it is crucial to find ways to improve the model's efficiency with minimal cost while maintaining the model's performance. In this work, we propose a method to improve model representation and processing efficiency by replacing the tokenizers of LLMs. We propose replacing and reinitializing the parameters of the model's input and output layers with the parameters of the original model, and training these parameters while keeping other parameters fixed. We conducted experiments on different LLMs, and the results show that our method can maintain the performance of the model after replacing the tokenizer, while significantly improving the decoding speed for long texts.
摘要：分词器是大型语言模型（LLM）的重要组成部分，高压缩率的分词器可以提升模型的表示和处理效率。然而分词器并不能保证在所有场景下都有较高的压缩率，而且平均输入输出长度的增加会增加模型的训练和推理成本。因此，在保持模型性能的同时，找到以最小代价提升模型效率的方法至关重要。本文提出一种通过替换LLM的分词器来提升模型表示和处理效率的方法。我们建议用原模型的参数替换并重新初始化模型输入层和输出层的参数，并在保持其他参数不变的情况下训练这些参数。我们在不同的LLM上进行了实验，结果表明我们的方法在替换分词器后可以保持模型的性能，同时显著提高长文本的解码速度。

Title: Inference Scaling for Long-Context Retrieval Augmented Generation

Authors: Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04343
Pdf URL: https://arxiv.org/pdf/2410.04343
Copy Paste: [[2410.04343]] Inference Scaling for Long-Context Retrieval Augmented Generation(https://arxiv.org/abs/2410.04343)
Keywords: language model, llm, prompt, retrieval augmented generation
Abstract: The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance. In this work, we investigate inference scaling for retrieval augmented generation (RAG), exploring strategies beyond simply increasing the quantity of knowledge. We focus on two inference scaling strategies: in-context learning and iterative prompting. These strategies provide additional flexibility to scale test-time computation (e.g., by increasing retrieved documents or generation steps), thereby enhancing LLMs' ability to effectively acquire and utilize contextual information. We address two key questions: (1) How does RAG performance benefit from the scaling of inference computation when optimally configured? (2) Can we predict the optimal test-time compute allocation for a given budget by modeling the relationship between RAG performance and inference parameters? Our observations reveal that increasing inference computation leads to nearly linear gains in RAG performance when optimally allocated, a relationship we describe as the inference scaling laws for RAG. Building on this, we further develop the computation allocation model to estimate RAG performance across different inference configurations. The model predicts optimal inference parameters under various computation constraints, which align closely with the experimental results. By applying these optimal configurations, we demonstrate that scaling inference compute on long-context LLMs achieves up to 58.9% gains on benchmark datasets compared to standard RAG.
摘要：推理计算的扩展释放了长上下文大型语言模型 (LLM) 在不同环境中的潜力。对于知识密集型任务，增加的计算通常用于整合更多外部知识。然而，如果不能有效利用这些知识，仅仅扩展上下文并不总能提高性能。在这项工作中，我们研究了检索增强生成 (RAG) 的推理扩展，探索了不仅仅是增加知识数量的策略。我们专注于两种推理扩展策略：上下文学习和迭代提示。这些策略提供了额外的灵活性来扩展测试时间计算（例如，通过增加检索到的文档或生成步骤），从而增强了 LLM 有效获取和利用上下文信息的能力。我们解决了两个关键问题：(1) 在最佳配置下，RAG 性能如何从推理计算的扩展中受益？(2) 我们能否通过对 RAG 性能和推理参数之间的关系进行建模来预测给定预算的最佳测试时间计算分配？我们的观察表明，在最佳分配的情况下，增加推理计算会导致 RAG 性能几乎线性提升，我们将这种关系描述为 RAG 的推理扩展定律。在此基础上，我们进一步开发了计算分配模型，以估算不同推理配置下的 RAG 性能。该模型预测了各种计算约束下的最佳推理参数，这与实验结果非常吻合。通过应用这些最佳配置，我们证明与标准 RAG 相比，在长上下文 LLM 上扩展推理计算在基准数据集上实现了高达 58.9% 的增益。

Title: Ordinal Preference Optimization: Aligning Human Preferences via NDCG

Authors: Yang Zhao, Yixin Wang, Mingzhang Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04346
Pdf URL: https://arxiv.org/pdf/2410.04346
Copy Paste: [[2410.04346]] Ordinal Preference Optimization: Aligning Human Preferences via NDCG(https://arxiv.org/abs/2410.04346)
Keywords: language model, llm
Abstract: Aligning Large Language Models (LLMs) with diverse human preferences is a pivotal technique for controlling model behaviors and enhancing generation quality. Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), and their variants optimize language models by pairwise comparisons. However, when multiple responses are available, these approaches fall short of leveraging the extensive information in the ranking given by the reward models or human feedback. In this work, we propose a novel listwise approach named Ordinal Preference Optimization (OPO), which employs the Normalized Discounted Cumulative Gain (NDCG), a widely-used ranking metric, to better utilize relative proximity within ordinal multiple responses. We develop an end-to-end preference optimization algorithm by approximating NDCG with a differentiable surrogate loss. This approach builds a connection between ranking models in information retrieval and the alignment problem. In aligning multi-response datasets assigned with ordinal rewards, OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval. Moreover, we demonstrate that increasing the pool of negative samples can enhance model performance by reducing the adverse effects of trivial negatives.
摘要：将大型语言模型 (LLM) 与多样化的人类偏好对齐是控制模型行为和提高生成质量的关键技术。从人类反馈中强化学习 (RLHF)、直接偏好优化 (DPO) 及其变体通过成对比较来优化语言模型。然而，当有多个响应可用时，这些方法无法利用奖励模型或人类反馈给出的排名中的大量信息。在这项工作中，我们提出了一种名为序数偏好优化 (OPO) 的新型列表方法，它采用广泛使用的排名指标归一化折扣累积增益 (NDCG)，以更好地利用序数多个响应内的相对接近度。我们通过使用可微分替代损失来近似 NDCG，开发了一种端到端偏好优化算法。这种方法建立了信息检索中的排名模型与对齐问题之间的联系。在对齐分配有序数奖励的多响应数据集时，OPO 在评估集和 AlpacaEval 等通用基准上的表现优于现有的成对和列表方法。此外，我们证明增加负样本池可以通过减少琐碎负样本的不利影响来提高模型性能。

Title: TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights

Authors: Aiwei Liu, Haoping Bai, Zhiyun Lu, Yanchao Sun, Xiang Kong, Simon Wang, Jiulong Shan, Albin Madappally Jose, Xiaojiang Liu, Lijie Wen, Philip S. Yu, Meng Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04350
Pdf URL: https://arxiv.org/pdf/2410.04350
Copy Paste: [[2410.04350]] TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights(https://arxiv.org/abs/2410.04350)
Keywords: language model, llm, prompt
Abstract: Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is treated as a single arm, ignoring the importance differences between tokens, which may affect optimization efficiency and make it difficult to achieve optimal results. In this work, we propose that the optimal data for DPO has equal expected rewards for each token in winning and losing responses, as there is no difference in token importance. However, since the optimal dataset is unavailable in practice, we propose using the original dataset for importance sampling to achieve unbiased optimization. Accordingly, we propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward. Inspired by previous works, we estimate the token importance weights using the difference in prediction probabilities from a pair of contrastive LLMs. We explore three methods to construct these contrastive LLMs: (1) guiding the original LLM with contrastive prompts, (2) training two separate LLMs using winning and losing responses, and (3) performing forward and reverse DPO training with winning and losing responses. Experiments show that TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks. We also visualize the estimated weights, demonstrating their ability to identify key token positions.
摘要：直接偏好优化 (DPO) 因其简单性和有效性而被广泛用于大型语言模型 (LLM) 的偏好对齐。然而，DPO 是一个老虎机问题，其中将整个响应视为一个单臂，忽略了 token 之间的重要性差异，这可能会影响优化效率并使其难以获得最佳结果。在这项工作中，我们提出 DPO 的最佳数据在获胜和失败的响应中对每个 token 的预期奖励相等，因为 token 的重要性没有差异。然而，由于实践中无法获得最佳数据集，我们建议使用原始数据集进行重要性抽样以实现无偏优化。因此，我们提出了一个名为 TIS-DPO 的 token 级重要性抽样 DPO 目标，它根据每个 token 的奖励为其分配重要性权重。受前人研究的启发，我们使用一对对比 LLM 的预测概率差异来估计 token 重要性权重。我们探索了三种构建这些对比性 LLM 的方法：(1) 使用对比性提示引导原始 LLM，(2) 使用获胜和失败答案训练两个单独的 LLM，以及 (3) 使用获胜和失败答案进行正向和反向 DPO 训练。实验表明，TIS-DPO 在无害性和有用性对齐和总结任务上的表现明显优于各种基线方法。我们还可视化了估计的权重，展示了它们识别关键标记位置的能力。

Title: Lens: Rethinking Multilingual Enhancement for Large Language Models

Authors: Weixiang Zhao, Yulin Hu, Jiahe Guo, Xingyu Sui, Tongtong Wu, Yang Deng, Yanyan Zhao, Bing Qin, Wanxiang Che, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04407
Pdf URL: https://arxiv.org/pdf/2410.04407
Copy Paste: [[2410.04407]] Lens: Rethinking Multilingual Enhancement for Large Language Models(https://arxiv.org/abs/2410.04407)
Keywords: language model, llm
Abstract: Despite the growing global demand for large language models (LLMs) that serve users from diverse linguistic backgrounds, most cutting-edge LLMs remain predominantly English-centric. This creates a performance gap across languages, restricting access to advanced AI services for non-English speakers. Current methods to enhance multilingual capabilities largely rely on data-driven post-training techniques, such as multilingual instruction tuning or continual pre-training. However, these approaches encounter significant challenges, including the scarcity of high-quality multilingual datasets and the limited enhancement of multilingual capabilities. They often suffer from off-target issues and catastrophic forgetting of central language abilities. To this end, we propose Lens, a novel approach to enhance multilingual capabilities of LLMs by leveraging their internal language representation spaces. Specially, Lens operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs. Using the central language as a pivot, the target language is drawn closer to it within the language-agnostic subspace, allowing it to inherit well-established semantic representations. Meanwhile, in the language-specific subspace, the representations of the target and central languages are pushed apart, enabling the target language to express itself distinctly. Extensive experiments on one English-centric and two multilingual LLMs demonstrate that Lens effectively improves multilingual performance without sacrificing the original central language capabilities of the backbone model, achieving superior results with much fewer computational resources compared to existing post-training approaches.
摘要：尽管全球对服务于来自不同语言背景用户的大型语言模型 (LLM) 的需求日益增长，但大多数前沿的 LLM 仍然以英语为中心。这造成了不同语言之间的性能差距，限制了非英语人士获得高级 AI 服务。当前增强多语言能力的方法在很大程度上依赖于数据驱动的后训练技术，例如多语言指令调整或持续预训练。然而，这些方法面临着重大挑战，包括高质量多语言数据集的稀缺和多语言能力的有限增强。它们经常遭受脱靶问题和对核心语言能力的灾难性遗忘。为此，我们提出了 Lens，这是一种通过利用其内部语言表示空间来增强 LLM 多语言能力的新方法。具体而言，Lens 通过操纵 LLM 顶层语言无关和语言特定子空间内的隐藏表示来运行。使用中心语言作为支点，目标语言在语言无关子空间中向中心语言靠拢，从而使目标语言能够继承完善的语义表示。同时，在特定语言子空间中，目标语言和中心语言的表示被推开，使目标语言能够清晰地表达自己。在一个以英语为中心和两个多语言的 LLM 上进行的大量实验表明，Lens 有效地提高了多语言性能，而不会牺牲骨干模型原有的中心语言能力，与现有的后训练方法相比，它以更少的计算资源实现了卓越的结果。

Title: Hyper-multi-step: The Truth Behind Difficult Long-context Tasks

Authors: Yijiong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04422
Pdf URL: https://arxiv.org/pdf/2410.04422
Copy Paste: [[2410.04422]] Hyper-multi-step: The Truth Behind Difficult Long-context Tasks(https://arxiv.org/abs/2410.04422)
Keywords: language model, llm
Abstract: Long-context language models (LCLM), characterized by their extensive context window, is becoming increasingly popular. Meanwhile, many long-context benchmarks present challenging tasks that even the most advanced LCLMs struggle to complete. However, the underlying sources of various challenging long-context tasks have seldom been studied. To bridge this gap, we conduct experiments to indicate their difficulty stems primarily from two basic issues: "multi-matching retrieval," which requires the simultaneous retrieval of multiple items, and "logic-based retrieval," which necessitates logical judgment within retrieval criteria. These two problems, while seemingly straightforward, actually exceed the capabilities of LCLMs because they are proven to be hyper-multi-step (demanding numerous steps to solve) in nature. This finding could explain why LLMs struggle with more advanced long-context tasks, providing a more accurate perspective for rethinking solutions for them.
摘要：长上下文语言模型 (LCLM) 以其广泛的上下文窗口为特征，正变得越来越流行。同时，许多长上下文基准测试提出了具有挑战性的任务，即使是最先进的 LCLM 也难以完成。然而，各种具有挑战性的长上下文任务的根本原因很少被研究。为了弥补这一差距，我们进行了实验，以表明它们的难度主要源于两个基本问题：“多匹配检索”，这需要同时检索多个项目，以及“基于逻辑的检索”，这需要在检索标准内进行逻辑判断。这两个问题看似简单，但实际上超出了 LCLM 的能力，因为它们被证明本质上是超多步骤的（需要许多步骤才能解决）。这一发现可以解释为什么 LLM 难以完成更高级的长上下文任务，为重新思考它们的解决方案提供了更准确的视角。

Title: DAdEE: Unsupervised Domain Adaptation in Early Exit PLMs

Authors: Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04424
Pdf URL: https://arxiv.org/pdf/2410.04424
Copy Paste: [[2410.04424]] DAdEE: Unsupervised Domain Adaptation in Early Exit PLMs(https://arxiv.org/abs/2410.04424)
Keywords: language model
Abstract: Pre-trained Language Models (PLMs) exhibit good accuracy and generalization ability across various tasks using self-supervision, but their large size results in high inference latency. Early Exit (EE) strategies handle the issue by allowing the samples to exit from classifiers attached to the intermediary layers, but they do not generalize well, as exit classifiers can be sensitive to domain changes. To address this, we propose Unsupervised Domain Adaptation in EE framework (DADEE) that employs multi-level adaptation using knowledge distillation. DADEE utilizes GAN-based adversarial adaptation at each layer to achieve domain-invariant representations, reducing the domain gap between the source and target domain across all layers. The attached exits not only speed up inference but also enhance domain adaptation by reducing catastrophic forgetting and mode collapse, making it more suitable for real-world scenarios. Experiments on tasks such as sentiment analysis, entailment classification, and natural language inference demonstrate that DADEE consistently outperforms not only early exit methods but also various domain adaptation methods under domain shift scenarios. The anonymized source code is available at this https URL.
摘要：预训练语言模型 (PLM) 使用自监督在各种任务中表现出良好的准确性和泛化能力，但它们的规模较大，导致推理延迟较高。早期退出 (EE) 策略通过允许样本从附加到中间层的分类器中退出来解决这个问题，但它们的泛化效果不佳，因为退出分类器对域变化很敏感。为了解决这个问题，我们提出了 EE 框架中的无监督域自适应 (DADEE)，它采用知识蒸馏的多层自适应。DADEE 在每一层都利用基于 GAN 的对抗性自适应来实现域不变表示，从而减少所有层中源域和目标域之间的域差距。附加的退出不仅可以加快推理速度，还可以通过减少灾难性遗忘和模式崩溃来增强域自适应，使其更适合现实世界的场景。在情绪分析、蕴涵分类和自然语言推理等任务上的实验表明，在领域转移场景下，DADEE 不仅始终优于早期退出方法，而且也优于各种领域自适应方法。匿名源代码可在此 https URL 上获取。

Title: MindScope: Exploring cognitive biases in large language models through Multi-Agent Systems

Authors: Zhentao Xie, Jiabao Zhao, Yilei Wang, Jinxin Shi, Yanhong Bai, Xingjiao Wu, Liang He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04452
Pdf URL: https://arxiv.org/pdf/2410.04452
Copy Paste: [[2410.04452]] MindScope: Exploring cognitive biases in large language models through Multi-Agent Systems(https://arxiv.org/abs/2410.04452)
Keywords: language model, gpt, llm, retrieval-augmented generation, agent
Abstract: Detecting cognitive biases in large language models (LLMs) is a fascinating task that aims to probe the existing cognitive biases within these models. Current methods for detecting cognitive biases in language models generally suffer from incomplete detection capabilities and a restricted range of detectable bias types. To address this issue, we introduced the 'MindScope' dataset, which distinctively integrates static and dynamic elements. The static component comprises 5,170 open-ended questions spanning 72 cognitive bias categories. The dynamic component leverages a rule-based, multi-agent communication framework to facilitate the generation of multi-round dialogues. This framework is flexible and readily adaptable for various psychological experiments involving LLMs. In addition, we introduce a multi-agent detection method applicable to a wide range of detection tasks, which integrates Retrieval-Augmented Generation (RAG), competitive debate, and a reinforcement learning-based decision module. Demonstrating substantial effectiveness, this method has shown to improve detection accuracy by as much as 35.10% compared to GPT-4. Codes and appendix are available at this https URL.
摘要：在大型语言模型 (LLM) 中检测认知偏差是一项有趣的任务，旨在探究这些模型中现有的认知偏差。当前用于检测语言模型中认知偏差的方法通常存在检测能力不完整和可检测偏差类型范围受限的问题。为了解决这个问题，我们引入了“MindScope”数据集，它独特地整合了静态和动态元素。静态部分包括 5,170 个开放式问题，涵盖 72 个认知偏差类别。动态部分利用基于规则的多智能体通信框架来促进多轮对话的生成。该框架灵活且易于适应涉及 LLM 的各种心理实验。此外，我们引入了一种适用于广泛检测任务的多智能体检测方法，该方法集成了检索增强生成 (RAG)、竞争性辩论和基于强化学习的决策模块。该方法表现出显著的有效性，与 GPT-4 相比，检测准确率提高了 35.10%。代码和附录可在此 https URL 上找到。

Title: CopyLens: Dynamically Flagging Copyrighted Sub-Dataset Contributions to LLM Outputs

Authors: Qichao Ma, Rui-Jie Zhu, Peiye Liu, Renye Yan, Fahong Zhang, Ling Liang, Meng Li, Zhaofei Yu, Zongwei Wang, Yimao Cai, Tiejun Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04454
Pdf URL: https://arxiv.org/pdf/2410.04454
Copy Paste: [[2410.04454]] CopyLens: Dynamically Flagging Copyrighted Sub-Dataset Contributions to LLM Outputs(https://arxiv.org/abs/2410.04454)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have become pervasive due to their knowledge absorption and text-generation capabilities. Concurrently, the copyright issue for pretraining datasets has been a pressing concern, particularly when generation includes specific styles. Previous methods either focus on the defense of identical copyrighted outputs or find interpretability by individual tokens with computational burdens. However, the gap between them exists, where direct assessments of how dataset contributions impact LLM outputs are missing. Once the model providers ensure copyright protection for data holders, a more mature LLM community can be established. To address these limitations, we introduce CopyLens, a new framework to analyze how copyrighted datasets may influence LLM responses. Specifically, a two-stage approach is employed: First, based on the uniqueness of pretraining data in the embedding space, token representations are initially fused for potential copyrighted texts, followed by a lightweight LSTM-based network to analyze dataset contributions. With such a prior, a contrastive-learning-based non-copyright OOD detector is designed. Our framework can dynamically face different situations and bridge the gap between current copyright detection methods. Experiments show that CopyLens improves efficiency and accuracy by 15.2% over our proposed baseline, 58.7% over prompt engineering methods, and 0.21 AUC over OOD detection baselines.
摘要：大型语言模型 (LLM) 因其知识吸收和文本生成能力而变得无处不在。同时，预训练数据集的版权问题一直是一个紧迫的问题，特别是当生成包含特定风格时。以前的方法要么专注于保护相同的版权输出，要么通过计算负担来寻找单个标记的可解释性。然而，它们之间存在差距，即缺少对数据集贡献如何影响 LLM 输出的直接评估。一旦模型提供者确保数据持有者的版权保护，就可以建立一个更成熟的 LLM 社区。为了解决这些限制，我们引入了 CopyLens，这是一个新的框架，用于分析版权数据集如何影响 LLM 响应。具体来说，采用两阶段方法：首先，基于嵌入空间中预训练数据的唯一性，首先融合潜在版权文本的标记表示，然后使用基于轻量级 LSTM 的网络来分析数据集贡献。有了这样的先验，设计了一个基于对比学习的非版权 OOD 检测器。我们的框架可以动态地应对不同的情况，并弥合当前版权检测方法之间的差距。实验表明，CopyLens 的效率和准确率比我们提出的基线提高了 15.2%，比及时工程方法提高了 58.7%，AUC 比 OOD 检测基线提高了 0.21。

Title: SWEb: A Large Web Dataset for the Scandinavian Languages

Authors: Tobias Norlund, Tim Isbister, Amaru Cuba Gyllensten, Paul Dos Santos, Danila Petrelli, Ariel Ekgren, Magnus Sahlgren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04456
Pdf URL: https://arxiv.org/pdf/2410.04456
Copy Paste: [[2410.04456]] SWEb: A Large Web Dataset for the Scandinavian Languages(https://arxiv.org/abs/2410.04456)
Keywords: language model
Abstract: This paper presents the hitherto largest pretraining dataset for the Scandinavian languages: the Scandinavian WEb (SWEb), comprising over one trillion tokens. The paper details the collection and processing pipeline, and introduces a novel model-based text extractor that significantly reduces complexity in comparison with rule-based approaches. We also introduce a new cloze-style benchmark for evaluating language models in Swedish, and use this test to compare models trained on the SWEb data to models trained on FineWeb, with competitive results. All data, models and code are shared openly.
摘要：本文介绍了迄今为止最大的斯堪的纳维亚语言预训练数据集：斯堪的纳维亚语 WEb (SWEb)，包含超过一万亿个标记。本文详细介绍了收集和处理流程，并介绍了一种新颖的基于模型的文本提取器，与基于规则的方法相比，它显著降低了复杂性。我们还引入了一种新的完形填空式基准来评估瑞典语语言模型，并使用此测试将使用 SWEb 数据训练的模型与使用 FineWeb 训练的模型进行比较，结果具有竞争力。所有数据、模型和代码均公开共享。

Title: Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information

Authors: Yongheng Zhang, Qiguang Chen, Jingxuan Zhou, Peng Wang, Jiasheng Si, Jin Wang, Wenpeng Lu, Libo Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04463
Pdf URL: https://arxiv.org/pdf/2410.04463
Copy Paste: [[2410.04463]] Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information(https://arxiv.org/abs/2410.04463)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) has become a vital technique for enhancing the performance of Large Language Models (LLMs), attracting increasing attention from researchers. One stream of approaches focuses on the iterative enhancement of LLMs by continuously verifying and refining their reasoning outputs for desired quality. Despite its impressive results, this paradigm faces two critical issues: (1) Simple verification methods: The current paradigm relies solely on a single verification method. (2) Wrong Information Ignorance: Traditional paradigms directly ignore wrong information during reasoning and refine the logic paths from scratch each time. To address these challenges, we propose Wrong-of-Thought (WoT), which includes two core modules: (1) Multi-Perspective Verification: A multi-perspective verification method for accurately refining the reasoning process and result, and (2) Wrong Information Utilization: Utilizing wrong information to alert LLMs and reduce the probability of LLMs making same mistakes. Experiments on 8 popular datasets and 5 LLMs demonstrate that WoT surpasses all previous baselines. In addition, WoT exhibits powerful capabilities in difficult computation tasks.
摘要：思路链（CoT）已成为提升大型语言模型（LLM）性能的重要技术，受到越来越多研究者的关注。其中一类方法侧重于通过不断验证和改进推理输出以达到所需质量来迭代增强LLM。尽管取得了令人瞩目的成果，但该范式面临两个关键问题：（1）验证方法简单：当前范式仅依赖于单一的验证方法。（2）忽略错误信息：传统范式在推理过程中直接忽略错误信息，每次都从头开始改进逻辑路径。为了应对这些挑战，我们提出了思路错误（WoT），它包含两个核心模块：（1）多视角验证：一种用于准确改进推理过程和结果的多视角验证方法；（2）错误信息利用：利用错误信息提醒LLM并降低LLM犯同样错误的概率。在 8 个流行数据集和 5 个 LLM 上的实验表明，WoT 超越了之前的所有基线。此外，WoT 在困难的计算任务中表现出强大的能力。

Title: Revisiting In-context Learning Inference Circuit in Large Language Models

Authors: Hakaze Cho, Mariko Kato, Yoshihiro Sakai, Naoya Inoue
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.04468
Pdf URL: https://arxiv.org/pdf/2410.04468
Copy Paste: [[2410.04468]] Revisiting In-context Learning Inference Circuit in Large Language Models(https://arxiv.org/abs/2410.04468)
Keywords: language model
Abstract: In-context Learning (ICL) is an emerging few-shot learning paradigm on Language Models (LMs) with inner mechanisms un-explored. There are already existing works describing the inner processing of ICL, while they struggle to capture all the inference phenomena in large language models. Therefore, this paper proposes a comprehensive circuit to model the inference dynamics and try to explain the observed phenomena of ICL. In detail, we divide ICL inference into 3 major operations: (1) Summarize: LMs encode every input text (demonstrations and queries) into linear representation in the hidden states with sufficient information to solve ICL tasks. (2) Semantics Merge: LMs merge the encoded representations of demonstrations with their corresponding label tokens to produce joint representations of labels and demonstrations. (3) Feature Retrieval and Copy: LMs search the joint representations similar to the query representation on a task subspace, and copy the searched representations into the query. Then, language model heads capture these copied label representations to a certain extent and decode them into predicted labels. The proposed inference circuit successfully captured many phenomena observed during the ICL process, making it a comprehensive and practical explanation of the ICL inference process. Moreover, ablation analysis by disabling the proposed steps seriously damages the ICL performance, suggesting the proposed inference circuit is a dominating mechanism. Additionally, we confirm and list some bypass mechanisms that solve ICL tasks in parallel with the proposed circuit.
摘要：上下文学习 (ICL) 是一种新兴的语言模型 (LM) 小样本学习范式，其内部机制尚未被探索。目前已有研究描述了 ICL 的内部处理，但它们难以捕捉大型语言模型中的所有推理现象。因此，本文提出了一种全面的回路来模拟推理动态并尝试解释观察到的 ICL 现象。具体来说，我们将 ICL 推理分为 3 个主要操作：(1) 总结：LM 将每个输入文本（演示和查询）编码为隐藏状态的线性表示，并提供足够的信息来解决 ICL 任务。(2) 语义合并：LM 将演示的编码表示与其相应的标签标记合并以生成标签和演示的联合表示。(3) 特征检索和复制：LM 在任务子空间上搜索与查询表示相似的联合表示，并将搜索到的表示复制到查询中。然后，语言模型头在一定程度上捕获这些复制的标签表示并将其解码为预测标签。所提出的推理电路成功捕获了 ICL 过程中观察到的许多现象，使其成为 ICL 推理过程的全面而实用的解释。此外，通过禁用所提出的步骤进行的消融分析严重损害了 ICL 性能，这表明所提出的推理电路是一种主导机制。此外，我们确认并列出了一些与所提出的电路并行解决 ICL 任务的旁路机制。

Title: Collapsed Language Models Promote Fairness

Authors: Jingxuan Xu, Wuyang Chen, Linyi Li, Yao Zhao, Yunchao Wei
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2410.04472
Pdf URL: https://arxiv.org/pdf/2410.04472
Copy Paste: [[2410.04472]] Collapsed Language Models Promote Fairness(https://arxiv.org/abs/2410.04472)
Keywords: language model, prompt
Abstract: To mitigate societal biases implicitly encoded in recent successful pretrained language models, a diverse array of approaches have been proposed to encourage model fairness, focusing on prompting, data augmentation, regularized fine-tuning, and more. Despite the development, it is nontrivial to reach a principled understanding of fairness and an effective algorithm that can consistently debias language models. In this work, by rigorous evaluations of Neural Collapse -- a learning phenomenon happen in last-layer representations and classifiers in deep networks -- on fairness-related words, we find that debiased language models exhibit collapsed alignment between token representations and word embeddings. More importantly, this observation inspires us to design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods, while still preserving the performance of language models on standard natural language understanding tasks. We attach our code at this https URL .
摘要：为了减轻近期成功的预训练语言模型中隐含的社会偏见，人们提出了各种各样的方法来促进模型公平性，这些方法侧重于提示、数据增强、正则化微调等。尽管取得了进展，但要对公平性达成原则性的理解，并找到能够持续消除语言模型偏见的有效算法并非易事。在这项工作中，通过对公平性相关词汇的神经崩溃（深度网络的最后一层表示和分类器中发生的学习现象）进行严格评估，我们发现去偏语言模型在标记表示和词嵌入之间表现出崩溃对齐。更重要的是，这一观察结果启发我们设计一种原则性的微调方法，该方法可以在广泛的去偏方法中有效地提高公平性，同时仍保持语言模型在标准自然语言理解任务上的性能。我们将代码附加在此 https URL 上。

Title: Fine-Grained Prediction of Reading Comprehension from Eye Movements

Authors: Omer Shubi, Yoav Meiri, Cfir Avraham Hadar, Yevgeni Berzak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04484
Pdf URL: https://arxiv.org/pdf/2410.04484
Copy Paste: [[2410.04484]] Fine-Grained Prediction of Reading Comprehension from Eye Movements(https://arxiv.org/abs/2410.04484)
Keywords: language model
Abstract: Can human reading comprehension be assessed from eye movements in reading? In this work, we address this longstanding question using large-scale eyetracking data over textual materials that are geared towards behavioral analyses of reading comprehension. We focus on a fine-grained and largely unaddressed task of predicting reading comprehension from eye movements at the level of a single question over a passage. We tackle this task using three new multimodal language models, as well as a battery of prior models from the literature. We evaluate the models' ability to generalize to new textual items, new participants, and the combination of both, in two different reading regimes, ordinary reading and information seeking. The evaluations suggest that although the task is highly challenging, eye movements contain useful signals for fine-grained prediction of reading comprehension. Code and data will be made publicly available.
摘要：人类的阅读理解能力是否可以通过阅读中的眼球运动来评估？在这项研究中，我们使用针对文本材料的大规模眼球追踪数据来解决这个长期存在的问题，这些数据面向阅读理解的行为分析。我们专注于一项精细且基本未解决的任务，即从文章中单个问题的眼球运动预测阅读理解能力。我们使用三个新的多模态语言模型以及文献中的一系列先前模型来解决这项任务。我们评估了这些模型在两种不同的阅读模式（普通阅读和信息搜索）中推广到新文本项目、新参与者以及两者结合的能力。评估结果表明，尽管这项任务极具挑战性，但眼球运动包含有用的信号，可用于精细预测阅读理解能力。代码和数据将公开。

Title: Leveraging Large Language Models for Suicide Detection on Social Media with Limited Labels

Authors: Vy Nguyen, Chau Pham
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.04501
Pdf URL: https://arxiv.org/pdf/2410.04501
Copy Paste: [[2410.04501]] Leveraging Large Language Models for Suicide Detection on Social Media with Limited Labels(https://arxiv.org/abs/2410.04501)
Keywords: language model, llm, prompt
Abstract: The increasing frequency of suicidal thoughts highlights the importance of early detection and intervention. Social media platforms, where users often share personal experiences and seek help, could be utilized to identify individuals at risk. However, the large volume of daily posts makes manual review impractical. This paper explores the use of Large Language Models (LLMs) to automatically detect suicidal content in text-based social media posts. We propose a novel method for generating pseudo-labels for unlabeled data by prompting LLMs, along with traditional classification fine-tuning techniques to enhance label accuracy. To create a strong suicide detection model, we develop an ensemble approach involving prompting with Qwen2-72B-Instruct, and using fine-tuned models such as Llama3-8B, Llama3.1-8B, and Gemma2-9B. We evaluate our approach on the dataset of the Suicide Ideation Detection on Social Media Challenge, a track of the IEEE Big Data 2024 Big Data Cup. Additionally, we conduct a comprehensive analysis to assess the impact of different models and fine-tuning strategies on detection performance. Experimental results show that the ensemble model significantly improves the detection accuracy, by 5% points compared with the individual models. It achieves a weight F1 score of 0.770 on the public test set, and 0.731 on the private test set, providing a promising solution for identifying suicidal content in social media. Our analysis shows that the choice of LLMs affects the prompting performance, with larger models providing better accuracy. Our code and checkpoints are publicly available at this https URL.
摘要：自杀想法出现的频率越来越高，这凸显了早期发现和干预的重要性。社交媒体平台可以用来识别有自杀风险的个人，用户经常在社交媒体平台上分享个人经历并寻求帮助。然而，每天发布的帖子数量庞大，人工审核不切实际。本文探讨了使用大型语言模型 (LLM) 自动检测基于文本的社交媒体帖子中的自杀内容。我们提出了一种新方法，通过提示 LLM 为未标记数据生成伪标签，并结合传统的分类微调技术来提高标签准确性。为了创建一个强大的自杀检测模型，我们开发了一种集成方法，包括使用 Qwen2-72B-Instruct 进行提示，并使用微调模型，例如 Llama3-8B、Llama3.1-8B 和 Gemma2-9B。我们在社交媒体自杀意念检测挑战赛的数据集上评估了我们的方法，该挑战赛是 IEEE 大数据 2024 大数据杯的一个赛道。此外，我们还进行了全面分析，以评估不同模型和微调策略对检测性能的影响。实验结果表明，集成模型显著提高了检测准确率，与单个模型相比提高了 5%。它在公共测试集上的权重 F1 得分为 0.770，在私人测试集上的权重 F1 得分为 0.731，为识别社交媒体中的自杀内容提供了一种有前途的解决方案。我们的分析表明，LLM 的选择会影响提示性能，模型越大，准确率越高。我们的代码和检查点可在此 https URL 上公开获取。

Title: ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Authors: Yibo Yan, Shen Wang, Jiahao Huo, Hang Li, Boyan Li, Jiamin Su, Xiong Gao, Yi-Fan Zhang, Tianlong Xu, Zhendong Chu, Aoxiao Zhong, Kun Wang, Hui Xiong, Philip S. Yu, Xuming Hu, Qingsong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04509
Pdf URL: https://arxiv.org/pdf/2410.04509
Copy Paste: [[2410.04509]] ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection(https://arxiv.org/abs/2410.04509)
Keywords: language model, gpt, llm
Abstract: As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to revolutionize artificial intelligence is particularly promising, especially in addressing mathematical reasoning tasks. Current mathematical benchmarks predominantly focus on evaluating MLLMs' problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task: multimodal error detection, and introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization, providing a comprehensive framework for evaluating MLLMs' complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with rigorous annotation and rich metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation. The dataset will be available upon acceptance.
摘要：随着多模态大型语言模型 (MLLM) 领域的不断发展，它们在人工智能领域具有革命性潜力，尤其是在解决数学推理任务方面。当前的数学基准测试主要侧重于评估 MLLM 的解决问题能力，但在解决更复杂的场景（例如错误检测）方面存在一个关键的差距，无法提高复杂环境中的推理能力。为了填补这一空白，我们正式制定了新任务：多模态错误检测，并推出了 ErrorRadar，这是第一个旨在评估 MLLM 在此类任务中的能力的基准测试。ErrorRadar 评估两个子任务：错误步骤识别和错误分类，为评估 MLLM 的复杂数学推理能力提供了一个全面的框架。它由 2,500 个高质量的多模态 K-12 数学问题组成，这些问题是从教育机构中的真实学生互动中收集的，具有严格的注释和丰富的元数据，例如问题类型和错误类别。通过大量实验，我们评估了开源和闭源代表性 MLLM，并根据教育专家评估者对它们的性能进行了基准测试。结果表明，仍然存在重大挑战，因为性能最佳的 GPT-4o 仍然落后于人类评估约 10%。数据集将在接受后提供。

Title: DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

Authors: Xuan Gong, Tianshi Ming, Xinpeng Wang, Zhihua Wei
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.04514
Pdf URL: https://arxiv.org/pdf/2410.04514
Copy Paste: [[2410.04514]] DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination(https://arxiv.org/abs/2410.04514)
Keywords: language model, gpt, llm, hallucination
Abstract: Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that $D$ive into $A$ttention $M$echanism of LVLM to $R$educe $O$bject Hallucination. Specifically, our approach employs classification token (CLS) of ViT to filter out high-attention outlier tokens scattered in the background and then eliminate their influence during decoding stage. We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs. The code of our method will be released soon.
摘要：尽管大型视觉语言模型 (LVLM) 取得了巨大成功，但它们不可避免地存在幻觉。众所周知，LVLM 中的视觉编码器和大型语言模型 (LLM) 解码器都是基于 Transformer 的，允许模型提取视觉信息并通过注意力机制生成文本输出。我们发现 LLM 解码器在图像标记上的注意力分布与视觉编码器高度一致，并且两种分布都倾向于关注特定的背景标记，而不是图像中引用的对象。我们将意外的注意力分布归因于视觉编码器本身的固有缺陷，这会误导 LLM 过度强调冗余信息并产生对象幻觉。为了解决这个问题，我们提出了 DAMRO，这是一种新颖的免训练策略，它利用 LVLM 的注意力机制来减少对象幻觉。具体来说，我们的方法采用 ViT 的分类标记 (CLS) 来过滤掉散布在背景中的高注意力异常标记，然后在解码阶段消除它们的影响。我们使用 POPE、CHAIR、MME 和 GPT-4V 辅助评估等各种基准在包括 LLaVA-1.5、LLaVA-NeXT 和 InstructBLIP 在内的 LVLM 上评估了我们的方法。结果表明，我们的方法显著降低了这些异常标记的影响，从而有效缓解了 LVLM 的幻觉。我们方法的代码将很快发布。

Title: RevMUX: Data Multiplexing with Reversible Adapters for Efficient LLM Batch Inference

Authors: Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04519
Pdf URL: https://arxiv.org/pdf/2410.04519
Copy Paste: [[2410.04519]] RevMUX: Data Multiplexing with Reversible Adapters for Efficient LLM Batch Inference(https://arxiv.org/abs/2410.04519)
Keywords: language model, llm
Abstract: Large language models (LLMs) have brought a great breakthrough to the natural language processing (NLP) community, while leading the challenge of handling concurrent customer queries due to their high throughput demands. Data multiplexing addresses this by merging multiple inputs into a single composite input, allowing more efficient inference through a shared forward pass. However, as distinguishing individuals from a composite input is challenging, conventional methods typically require training the entire backbone, yet still suffer from performance degradation. In this paper, we introduce RevMUX, a parameter-efficient data multiplexing framework that incorporates a reversible design in the multiplexer, which can be reused by the demultiplexer to perform reverse operations and restore individual samples for classification. Extensive experiments on four datasets and three types of LLM backbones demonstrate the effectiveness of RevMUX for enhancing LLM inference efficiency while retaining a satisfactory classification performance.
摘要：大型语言模型 (LLM) 为自然语言处理 (NLP) 社区带来了重大突破，同时也带来了处理并发客户查询的挑战，因为它们对吞吐量要求很高。数据复用通过将多个输入合并为单个复合输入来解决此问题，从而允许通过共享前向传递进行更高效的推理。然而，由于区分个体和复合输入具有挑战性，传统方法通常需要训练整个主干，但仍然会遭受性能下降的困扰。在本文中，我们介绍了 RevMUX，这是一个参数高效的数据复用框架，它在复用器中采用了可逆设计，可以被解复用器重用以执行反向操作并恢复单个样本进行分类。在四个数据集和三种类型的 LLM 主干上进行的大量实验证明了 RevMUX 在提高 LLM 推理效率的同时保持令人满意的分类性能的有效性。

Title: Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning

Authors: Yanrui Du, Sendong Zhao, Jiawei Cao, Ming Ma, Danyang Zhao, Fenglei Fan, Ting Liu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04524
Pdf URL: https://arxiv.org/pdf/2410.04524
Copy Paste: [[2410.04524]] Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning(https://arxiv.org/abs/2410.04524)
Keywords: language model, llm
Abstract: Instruction Fine-Tuning (IFT) has become an essential method for adapting base Large Language Models (LLMs) into variants for professional and private use. However, researchers have raised concerns over a significant decrease in LLMs' security following IFT, even when the IFT process involves entirely benign instructions (termed Benign IFT). Our study represents a pioneering effort to mitigate the security risks arising from Benign IFT. Specifically, we conduct a Module Robustness Analysis, aiming to investigate how LLMs' internal modules contribute to their security. Based on our analysis, we propose a novel IFT strategy, called the Modular Layer-wise Learning Rate (ML-LR) strategy. In our analysis, we implement a simple security feature classifier that serves as a proxy to measure the robustness of modules (e.g. $Q$/$K$/$V$, etc.). Our findings reveal that the module robustness shows clear patterns, varying regularly with the module type and the layer depth. Leveraging these insights, we develop a proxy-guided search algorithm to identify a robust subset of modules, termed Mods$_{Robust}$. During IFT, the ML-LR strategy employs differentiated learning rates for Mods$_{Robust}$ and the rest modules. Our experimental results show that in security assessments, the application of our ML-LR strategy significantly mitigates the rise in harmfulness of LLMs following Benign IFT. Notably, our ML-LR strategy has little impact on the usability or expertise of LLMs following Benign IFT. Furthermore, we have conducted comprehensive analyses to verify the soundness and flexibility of our ML-LR strategy.
摘要：指令微调 (IFT) 已成为将基础大型语言模型 (LLM) 改编为专业和私人用途变体的重要方法。然而，研究人员担心，即使 IFT 过程涉及完全良性的指令（称为良性 IFT），LLM 的安全性在 IFT 之后也会显著下降。我们的研究代表了一项开创性的努力，旨在减轻良性 IFT 带来的安全风险。具体来说，我们进行了模块稳健性分析，旨在研究 LLM 的内部模块如何有助于其安全性。基于我们的分析，我们提出了一种新颖的 IFT 策略，称为模块化逐层学习率 (ML-LR) 策略。在我们的分析中，我们实现了一个简单的安全特征分类器，作为衡量模块（例如 $Q$/$K$/$V$ 等）稳健性的代理。我们的研究结果表明，模块稳健性显示出清晰的模式，并随模块类型和层深度而有规律地变化。利用这些见解，我们开发了一种代理引导搜索算法来识别一个强大的模块子集，称为 Mods$_{Robust}$。在 IFT 期间，ML-LR 策略对 Mods$_{Robust}$ 和其余模块采用差异化学习率。我们的实验结果表明，在安全评估中，应用我们的 ML-LR 策略可显著缓解良性 IFT 后 LLM 危害性的上升。值得注意的是，我们的 ML-LR 策略对良性 IFT 后 LLM 的可用性或专业性影响不大。此外，我们还进行了全面的分析，以验证我们的 ML-LR 策略的合理性和灵活性。

Title: FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering

Authors: Siqiao Xue, Tingting Chen, Fan Zhou, Qingyang Dai, Zhixuan Chu, Hongyuan Mei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04526
Pdf URL: https://arxiv.org/pdf/2410.04526
Copy Paste: [[2410.04526]] FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering(https://arxiv.org/abs/2410.04526)
Keywords: language model, gpt, llm
Abstract: In this paper, we introduce FAMMA, an open-source benchmark for financial multilingual multimodal question answering (QA). Our benchmark aims to evaluate the abilities of multimodal large language models (MLLMs) in answering questions that require advanced financial knowledge and sophisticated reasoning. It includes 1,758 meticulously collected question-answer pairs from university textbooks and exams, spanning 8 major subfields in finance including corporate finance, asset management, and financial engineering. Some of the QA pairs are written in Chinese or French, while a majority of them are in English. These questions are presented in a mixed format combining text and heterogeneous image types, such as charts, tables, and diagrams. We evaluate a range of state-of-the-art MLLMs on our benchmark, and our analysis shows that FAMMA poses a significant challenge for these models. Even advanced systems like GPT-4o and Claude-35-Sonnet achieve only 42\% accuracy. Additionally, the open-source Qwen2-VL lags notably behind its proprietary counterparts. Lastly, we explore GPT o1-style reasoning chains to enhance the models' reasoning capabilities, which significantly improve error correction. Our FAMMA benchmark will facilitate future research to develop expert systems in financial QA. The leaderboard is available at this https URL .
摘要：在本文中，我们介绍了 FAMMA，这是一个用于金融多语言多模态问答 (QA) 的开源基准。我们的基准旨在评估多模态大型语言模型 (MLLM) 回答需要高级金融知识和复杂推理的问题的能力。它包括从大学教科书和考试中精心收集的 1,758 个问答对，涵盖金融领域的 8 个主要子领域，包括公司金融、资产管理和金融工程。一些问答对是用中文或法语写的，而大多数是用英文写的。这些问题以混合格式呈现，结合了文本和异构图像类型，例如图表、表格和图表。我们在基准上评估了一系列最先进的 MLLM，我们的分析表明 FAMMA 对这些模型构成了重大挑战。即使是像 GPT-4o 和 Claude-35-Sonnet 这样的先进系统也只能达到 42% 的准确率。此外，开源 Qwen2-VL 明显落后于其专有同类产品。最后，我们探索 GPT o1 风格的推理链，以增强模型的推理能力，从而显著提高纠错能力。我们的 FAMMA 基准将促进未来研究开发金融 QA 专家系统。排行榜可在此 https URL 上找到。

Title: How Does the Disclosure of AI Assistance Affect the Perceptions of Writing?

Authors: Zhuoyan Li, Chen Liang, Jing Peng, Ming Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04545
Pdf URL: https://arxiv.org/pdf/2410.04545
Copy Paste: [[2410.04545]] How Does the Disclosure of AI Assistance Affect the Perceptions of Writing?(https://arxiv.org/abs/2410.04545)
Keywords: language model
Abstract: Recent advances in generative AI technologies like large language models have boosted the incorporation of AI assistance in writing workflows, leading to the rise of a new paradigm of human-AI co-creation in writing. To understand how people perceive writings that are produced under this paradigm, in this paper, we conduct an experimental study to understand whether and how the disclosure of the level and type of AI assistance in the writing process would affect people's perceptions of the writing on various aspects, including their evaluation on the quality of the writing and their ranking of different writings. Our results suggest that disclosing the AI assistance in the writing process, especially if AI has provided assistance in generating new content, decreases the average quality ratings for both argumentative essays and creative stories. This decrease in the average quality ratings often comes with an increased level of variations in different individuals' quality evaluations of the same writing. Indeed, factors such as an individual's writing confidence and familiarity with AI writing assistants are shown to moderate the impact of AI assistance disclosure on their writing quality evaluations. We also find that disclosing the use of AI assistance may significantly reduce the proportion of writings produced with AI's content generation assistance among the top-ranked writings.
摘要：大型语言模型等生成式人工智能技术的最新进展推动了人工智能辅助写作工作流程的整合，从而催生了人机共同创作写作的新范式。为了了解人们如何看待在这种范式下创作的作品，本文进行了一项实验研究，以了解在写作过程中披露人工智能辅助的水平和类型是否以及如何影响人们对写作各个方面的看法，包括他们对写作质量的评价和对不同作品的排名。我们的结果表明，在写作过程中披露人工智能辅助，尤其是如果人工智能在生成新内容方面提供了帮助，会降低议论文和创意故事的平均质量评分。平均质量评分的下降通常伴随着不同个体对同一作品的质量评价差异的增加。事实上，个人的写作信心和对人工智能写作助手的熟悉程度等因素被证明会缓和人工智能辅助披露对他们写作质量评价的影响。我们还发现，披露使用人工智能辅助的情况可能会显著降低在排名靠前的文章中使用人工智能内容生成辅助生成的文章的比例。

Title: Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Authors: Tianjian Li, Haoran Xu, Weiting Tan, Dongwei Jiang, Kenton Murray, Daniel Khashabi
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2410.04579
Pdf URL: https://arxiv.org/pdf/2410.04579
Copy Paste: [[2410.04579]] Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets(https://arxiv.org/abs/2410.04579)
Keywords: language model
Abstract: Data availability across domains often follows a long-tail distribution: a few domains have abundant data, while most face data scarcity. This imbalance poses challenges in training language models uniformly across all domains. In our study, we focus on multilingual settings, where data sizes vary significantly between high- and low-resource languages. Common strategies to address this include upsampling low-resource languages (Temperature Sampling) or upweighting their loss (Scalarization). Although often considered equivalent, this assumption has not been proven, which motivates our study. Through both theoretical and empirical analysis, we identify the conditions under which these approaches are equivalent and when they diverge. Specifically, we demonstrate that these two methods are equivalent under full gradient descent, but this equivalence breaks down with stochastic gradient descent. Empirically, we observe that Temperature Sampling converges more quickly but is prone to overfitting. We argue that this faster convergence is likely due to the lower variance in gradient estimations, as shown theoretically. Based on these insights, we propose Cooldown, a strategy that reduces sampling temperature during training, accelerating convergence without overfitting to low-resource languages. Our method is competitive with existing data re-weighting and offers computational efficiency.
摘要：跨领域的数据可用性通常遵循长尾分布：少数领域拥有丰富的数据，而大多数领域面临数据稀缺。这种不平衡对在所有领域统一训练语言模型提出了挑战。在我们的研究中，我们专注于多语言设置，其中高资源语言和低资源语言之间的数据大小差异很大。解决此问题的常见策略包括对低资源语言进行上采样（温度采样）或增加其损失（标量化）。虽然通常被认为是等效的，但这一假设尚未得到证实，这促使我们进行研究。通过理论和实证分析，我们确定了这些方法等效的条件以及它们何时发散。具体而言，我们证明这两种方法在完全梯度下降下是等效的，但这种等效性在随机梯度下降下不成立。从经验上讲，我们观察到温度采样收敛速度更快，但容易过度拟合。我们认为，这种更快的收敛可能是由于梯度估计的方差较低，正如理论上所表明的那样。基于这些见解，我们提出了 Cooldown 策略，该策略可在训练过程中降低采样温度，从而加快收敛速度，而不会过度拟合低资源语言。我们的方法与现有的数据重新加权相比具有竞争力，并且具有计算效率。

Title: Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval

Authors: Pengcheng Jiang, Cao Xiao, Minhao Jiang, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, Jiawei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04585
Pdf URL: https://arxiv.org/pdf/2410.04585
Copy Paste: [[2410.04585]] Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval(https://arxiv.org/abs/2410.04585)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) have demonstrated significant potential in clinical decision support. Yet LLMs still suffer from hallucinations and lack fine-grained contextual medical knowledge, limiting their high-stake healthcare applications such as clinical diagnosis. Traditional retrieval-augmented generation (RAG) methods attempt to address these limitations but frequently retrieve sparse or irrelevant information, undermining prediction accuracy. We introduce KARE, a novel framework that integrates knowledge graph (KG) community-level retrieval with LLM reasoning to enhance healthcare predictions. KARE constructs a comprehensive multi-source KG by integrating biomedical databases, clinical literature, and LLM-generated insights, and organizes it using hierarchical graph community detection and summarization for precise and contextually relevant information retrieval. Our key innovations include: (1) a dense medical knowledge structuring approach enabling accurate retrieval of relevant information; (2) a dynamic knowledge retrieval mechanism that enriches patient contexts with focused, multi-faceted medical insights; and (3) a reasoning-enhanced prediction framework that leverages these enriched contexts to produce both accurate and interpretable clinical predictions. Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions. In addition to its impressive prediction accuracy, our framework leverages the reasoning capabilities of LLMs, enhancing the trustworthiness of clinical predictions.
摘要：大型语言模型 (LLM) 在临床决策支持方面表现出巨大潜力。然而，LLM 仍然存在幻觉并且缺乏细粒度的上下文医学知识，限制了它们在临床诊断等高风险医疗保健应用。传统的检索增强生成 (RAG) 方法试图解决这些限制，但经常检索稀疏或不相关的信息，从而损害预测准确性。我们推出了 KARE，这是一个新颖的框架，它将知识图谱 (KG) 社区级检索与 LLM 推理相结合，以增强医疗保健预测。KARE 通过整合生物医学数据库、临床文献和 LLM 生成的见解构建了一个全面的多源 KG，并使用分层图社区检测和摘要对其进行组织，以进行精确和上下文相关的信息检索。我们的主要创新包括：(1) 一种密集的医学知识结构化方法，能够准确检索相关信息；(2) 一种动态知识检索机制，通过有针对性的、多方面的医学见解丰富患者背景；（3）推理增强预测框架，利用这些丰富的背景来产生准确且可解释的临床预测。大量实验表明，在死亡率和再入院率预测方面，KARE 在 MIMIC-III 上的表现比领先模型高出 10.8-15.0%，在 MIMIC-IV 上的表现比领先模型高出 12.6-12.7%。除了令人印象深刻的预测准确性外，我们的框架还利用了 LLM 的推理能力，提高了临床预测的可信度。

Title: ProtocoLLM: Automatic Evaluation Framework of LLMs on Domain-Specific Scientific Protocol Formulation Tasks

Authors: Seungjun Yi, Jaeyoung Lim, Juyong Yoon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04601
Pdf URL: https://arxiv.org/pdf/2410.04601
Copy Paste: [[2410.04601]] ProtocoLLM: Automatic Evaluation Framework of LLMs on Domain-Specific Scientific Protocol Formulation Tasks(https://arxiv.org/abs/2410.04601)
Keywords: language model, gpt, llm, prompt
Abstract: Automated generation of scientific protocols executable by robots can significantly accelerate scientific research processes. Large Language Models (LLMs) excel at Scientific Protocol Formulation Tasks (SPFT), but the evaluation of their capabilities rely on human evaluation. Here, we propose a flexible, automatic framework to evaluate LLM's capability on SPFT: ProtocoLLM. This framework prompts the target model and GPT-4 to extract pseudocode from biology protocols using only predefined lab actions and evaluates the output of target model using LLAM-EVAL, the pseudocode generated by GPT-4 serving as a baseline and Llama-3 acting as the evaluator. Our adaptable prompt-based evaluation method, LLAM-EVAL, offers significant flexibility in terms of evaluation model, material, criteria, and is free of cost. We evaluate GPT variations, Llama, Mixtral, Gemma, Cohere, and Gemini. Overall, we find that GPT and Cohere is a powerful scientific protocol formulators. We also introduce BIOPROT 2.0, a dataset with biology protocols and corresponding pseudocodes, which can aid LLMs in formulation and evaluation of SPFT. Our work is extensible to assess LLMs on SPFT across various domains and other fields that require protocol generation for specific goals.
摘要：机器人可执行的科学协议的自动生成可以显著加速科学研究进程。大型语言模型 (LLM) 在科学协议制定任务 (SPFT) 方面表现出色，但对其能力的评估依赖于人工评估。在这里，我们提出了一个灵活的自动化框架来评估 LLM 在 SPFT 上的能力：ProtocoLLM。该框架提示目标模型和 GPT-4 仅使用预定义的实验室操作从生物学协议中提取伪代码，并使用 LLAM-EVAL 评估目标模型的输出，GPT-4 生成的伪代码作为基线，Llama-3 充当评估器。我们适应性强的基于提示的评估方法 LLAM-EVAL 在评估模型、材料、标准方面提供了很大的灵活性，而且是免费的。我们评估了 GPT 变体，Llama、Mixtral、Gemma、Cohere 和 Gemini。总体而言，我们发现 GPT 和 Cohere 是一个强大的科学协议制定器。我们还引入了 BIOPROT 2.0，这是一个包含生物学协议和相应伪代码的数据集，可帮助 LLM 制定和评估 SPFT。我们的工作可扩展到评估不同领域和其他需要为特定目标生成协议的领域的 SPFT LLM。

Title: LRQ-Fact: LLM-Generated Relevant Questions for Multimodal Fact-Checking

Authors: Alimohammad Beigi, Bohan Jiang, Dawei Li, Tharindu Kumarage, Zhen Tan, Pouya Shaeri, Huan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04616
Pdf URL: https://arxiv.org/pdf/2410.04616
Copy Paste: [[2410.04616]] LRQ-Fact: LLM-Generated Relevant Questions for Multimodal Fact-Checking(https://arxiv.org/abs/2410.04616)
Keywords: language model, llm
Abstract: Human fact-checkers have specialized domain knowledge that allows them to formulate precise questions to verify information accuracy. However, this expert-driven approach is labor-intensive and is not scalable, especially when dealing with complex multimodal misinformation. In this paper, we propose a fully-automated framework, LRQ-Fact, for multimodal fact-checking. Firstly, the framework leverages Vision-Language Models (VLMs) and Large Language Models (LLMs) to generate comprehensive questions and answers for probing multimodal content. Next, a rule-based decision-maker module evaluates both the original content and the generated questions and answers to assess the overall veracity. Extensive experiments on two benchmarks show that LRQ-Fact improves detection accuracy for multimodal misinformation. Moreover, we evaluate its generalizability across different model backbones, offering valuable insights for further refinement.
摘要：人工事实核查员拥有专业领域知识，可以制定精确的问题来验证信息的准确性。然而，这种专家驱动的方法劳动密集型且不可扩展，特别是在处理复杂的多模态错误信息时。在本文中，我们提出了一个用于多模态事实核查的全自动框架 LRQ-Fact。首先，该框架利用视觉语言模型 (VLM) 和大型语言模型 (LLM) 来生成全面的问题和答案，以探究多模态内容。接下来，基于规则的决策者模块会评估原始内容和生成的问题和答案，以评估整体准确性。在两个基准上进行的大量实验表明，LRQ-Fact 提高了多模态错误信息的检测准确性。此外，我们评估了它在不同模型主干中的通用性，为进一步改进提供了宝贵的见解。

Title: Evaluation of Code LLMs on Geospatial Code Generation

Authors: Piotr Gramacki, Bruno Martins, Piotr Szymański
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2410.04617
Pdf URL: https://arxiv.org/pdf/2410.04617
Copy Paste: [[2410.04617]] Evaluation of Code LLMs on Geospatial Code Generation(https://arxiv.org/abs/2410.04617)
Keywords: language model, llm
Abstract: Software development support tools have been studied for a long time, with recent approaches using Large Language Models (LLMs) for code generation. These models can generate Python code for data science and machine learning applications. LLMs are helpful for software engineers because they increase productivity in daily work. An LLM can also serve as a "mentor" for inexperienced software developers, and be a viable learning support. High-quality code generation with LLMs can also be beneficial in geospatial data science. However, this domain poses different challenges, and code generation LLMs are typically not evaluated on geospatial tasks. Here, we show how we constructed an evaluation benchmark for code generation models, based on a selection of geospatial tasks. We categorised geospatial tasks based on their complexity and required tools. Then, we created a dataset with tasks that test model capabilities in spatial reasoning, spatial data processing, and geospatial tools usage. The dataset consists of specific coding problems that were manually created for high quality. For every problem, we proposed a set of test scenarios that make it possible to automatically check the generated code for correctness. In addition, we tested a selection of existing code generation LLMs for code generation in the geospatial domain. We share our dataset and reproducible evaluation code on a public GitHub repository, arguing that this can serve as an evaluation benchmark for new LLMs in the future. Our dataset will hopefully contribute to the development new models capable of solving geospatial coding tasks with high accuracy. These models will enable the creation of coding assistants tailored for geospatial applications.
摘要：软件开发支持工具已经被研究了很长时间，最近的方法是使用大型语言模型 (LLM) 进行代码生成。这些模型可以为数据科学和机器学习应用程序生成 Python 代码。LLM 对软件工程师很有帮助，因为它们可以提高日常工作效率。LLM 还可以作为缺乏经验的软件开发人员的“导师”，并成为可行的学习支持。使用 LLM 进行高质量的代码生成对地理空间数据科学也有好处。然而，这个领域带来了不同的挑战，代码生成 LLM 通常不会在地理空间任务上进行评估。在这里，我们展示了如何根据一系列地理空间任务构建代码生成模型的评估基准。我们根据地理空间任务的复杂性和所需工具对其进行分类。然后，我们创建了一个数据集，其中包含测试模型在空间推理、空间数据处理和地理空间工具使用方面的能力的任务。该数据集由手动创建的特定编码问题组成，以实现高质量。对于每个问题，我们提出了一组测试场景，可以自动检查生成的代码是否正确。此外，我们还测试了一系列现有的代码生成 LLM，用于地理空间领域的代码生成。我们在公共 GitHub 存储库上分享了我们的数据集和可重现的评估代码，认为这可以作为未来新 LLM 的评估基准。我们的数据集有望为开发能够高精度解决地理空间编码任务的新模型做出贡献。这些模型将使创建针对地理空间应用的编码助手成为可能。

Title: Control Large Language Models via Divide and Conquer

Authors: Bingxuan Li, Yiwei Wang, Tao Meng, Kai-Wei Chang, Nanyun Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04628
Pdf URL: https://arxiv.org/pdf/2410.04628
Copy Paste: [[2410.04628]] Control Large Language Models via Divide and Conquer(https://arxiv.org/abs/2410.04628)
Keywords: language model, llm, prompt
Abstract: This paper investigates controllable generation for large language models (LLMs) with prompt-based control, focusing on Lexically Constrained Generation (LCG). We systematically evaluate the performance of LLMs on satisfying lexical constraints with prompt-based control, as well as their efficacy in downstream applications. We conclude that LLMs face significant challenges in consistently satisfying lexical constraints with prompt-based control. We identified three key limitations of LLMs for LCG, including (1) position bias, where LLMs tend to satisfy constraints that appear in specific positions within the input; (2) low responsiveness to decoding parameters, which render minimal impact on control of LLMs; and (3) struggle with handling the inherent complexity of certain constraints (e.g., compound words). To address these issues, we introduce a Divide and Conquer Generation strategy, effective for both white-box and black-box LLMs, to enhance LLMs performance in LCG tasks, which demonstrates over 90% improvement on success rate in the most challenging LCG task. Our analysis provides valuable insights into the performance of LLMs in LCG with prompt-based control, and our proposed strategy offers a pathway to more sophisticated and customized text generation applications.
摘要：本文研究了基于提示控制的大型语言模型 (LLM) 的可控生成，重点研究了词汇约束生成 (LCG)。我们系统地评估了 LLM 在基于提示控制下满足词汇约束的性能，以及它们在下游应用中的有效性。我们得出结论，LLM 在始终如一地满足基于提示控制的词汇约束方面面临重大挑战。我们确定了 LCG 的 LLM 的三个主要限制，包括 (1) 位置偏差，其中 LLM 倾向于满足出现在输入中特定位置的约束；(2) 对解码参数的响应性低，这对 LLM 的控制影响最小；(3) 难以处理某些约束（例如复合词）的固有复杂性。为了解决这些问题，我们引入了一种分而治之的生成策略，该策略对白盒和黑盒 LLM 都有效，以提高 LLM 在 LCG 任务中的表现，在最具挑战性的 LCG 任务中，成功率提高了 90% 以上。我们的分析为基于提示的控制的 LCG 中的 LLM 性能提供了宝贵的见解，我们提出的策略为更复杂和定制的文本生成应用程序提供了途径。

Title: Contrastive Learning to Improve Retrieval for Real-world Fact Checking

Authors: Aniruddh Sriram, Fangyuan Xu, Eunsol Choi, Greg Durrett
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.04657
Pdf URL: https://arxiv.org/pdf/2410.04657
Copy Paste: [[2410.04657]] Contrastive Learning to Improve Retrieval for Real-world Fact Checking(https://arxiv.org/abs/2410.04657)
Keywords: gpt
Abstract: Recent work on fact-checking addresses a realistic setting where models incorporate evidence retrieved from the web to decide the veracity of claims. A bottleneck in this pipeline is in retrieving relevant evidence: traditional methods may surface documents directly related to a claim, but fact-checking complex claims requires more inferences. For instance, a document about how a vaccine was developed is relevant to addressing claims about what it might contain, even if it does not address them directly. We present Contrastive Fact-Checking Reranker (CFR), an improved retriever for this setting. By leveraging the AVeriTeC dataset, which annotates subquestions for claims with human written answers from evidence documents, we fine-tune Contriever with a contrastive objective based on multiple training signals, including distillation from GPT-4, evaluating subquestion answers, and gold labels in the dataset. We evaluate our model on both retrieval and end-to-end veracity judgments about claims. On the AVeriTeC dataset, we find a 6\% improvement in veracity classification accuracy. We also show our gains can be transferred to FEVER, ClaimDecomp, HotpotQA, and a synthetic dataset requiring retrievers to make inferences.
摘要：最近关于事实核查的研究解决了一个现实场景，其中模型结合从网络检索到的证据来决定主张的真实性。这个流程的一个瓶颈在于检索相关证据：传统方法可能会显示与主张直接相关的文件，但对复杂主张进行事实核查需要更多的推理。例如，一份关于疫苗如何开发的文件与解决关于它可能包含的内容的主张相关，即使它没有直接解决这些问题。我们提出了对比事实核查重排器 (CFR)，这是一种针对此设置的改进检索器。通过利用 AVeriTeC 数据集（该数据集使用来自证据文档的人工书面答案注释主张的子问题），我们根据多个训练信号（包括从 GPT-4 中提取、评估子问题答案和数据集中的黄金标签）以对比目标对 Contriever 进行微调。我们根据检索和端到端对主张的真实性判断来评估我们的模型。在 AVeriTeC 数据集上，我们发现真实性分类准确率提高了 6%。我们还表明，我们的成果可以转移到 FEVER、ClaimDecomp、HotpotQA 和需要检索者进行推理的合成数据集。

Title: Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

Authors: Chaithanya Bandi, Hari Bandi, Abir Harrasse
Subjects: cs.CL, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2410.04663
Pdf URL: https://arxiv.org/pdf/2410.04663
Copy Paste: [[2410.04663]] Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates(https://arxiv.org/abs/2410.04663)
Keywords: language model, llm, agent
Abstract: This paper explores optimal architectures for evaluating the outputs of large language models (LLMs) using LLMs themselves. We propose a novel framework that interprets LLMs as advocates within an ensemble of interacting agents, allowing them to defend their answers and reach conclusions through a judge and jury system. This approach offers a more dynamic and comprehensive evaluation process compared to traditional human-based assessments or automated metrics. We discuss the motivation behind this framework, its key components, and comparative advantages. We also present a probabilistic model to evaluate the error reduction achieved by iterative advocate systems. Finally, we outline experiments to validate the effectiveness of multi-advocate architectures and discuss future research directions.
摘要：本文探讨了使用 LLM 本身评估大型语言模型 (LLM) 输出的最佳架构。我们提出了一个新颖的框架，将 LLM 解释为交互代理集合中的倡导者，允许它们通过法官和陪审团系统捍卫自己的答案并得出结论。与传统的基于人工的评估或自动化指标相比，这种方法提供了更动态、更全面的评估流程。我们讨论了该框架背后的动机、其关键组成部分和比较优势。我们还提出了一个概率模型来评估迭代倡导系统实现的错误减少。最后，我们概述了验证多倡导者架构有效性的实验，并讨论了未来的研究方向。

Title: MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Authors: Lei Wang, Shan Dong, Yuhui Xu, Hanze Dong, Yalu Wang, Amrita Saha, Ee-Peng Lim, Caiming Xiong, Doyen Sahoo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04698
Pdf URL: https://arxiv.org/pdf/2410.04698
Copy Paste: [[2410.04698]] MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs(https://arxiv.org/abs/2410.04698)
Keywords: language model, llm, long context
Abstract: Recent large language models (LLMs) have demonstrated versatile capabilities in long-context scenarios. Although some recent benchmarks have been developed to evaluate the long-context capabilities of LLMs, there is a lack of benchmarks evaluating the mathematical reasoning abilities of LLMs over long contexts, which is crucial for LLMs' application in real-world scenarios. In this paper, we introduce MathHay, an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. Unlike previous benchmarks like Needle in a Haystack, which focus primarily on information retrieval within long texts, MathHay demands models with both information-seeking and complex mathematical reasoning abilities. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing LLMs. Even the best-performing model, Gemini-1.5-Pro-002, still struggles with mathematical reasoning over long contexts, achieving only 51.26% accuracy at 128K tokens. This highlights the significant room for improvement on the MathHay benchmark.
摘要：最近的大型语言模型 (LLM) 已在长上下文场景中展现出多种能力。尽管最近已经开发了一些基准来评估 LLM 的长上下文能力，但仍然缺乏评估 LLM 在长上下文中的数学推理能力的基准，而这对于 LLM 在实际场景中的应用至关重要。在本文中，我们介绍了 MathHay，这是一个旨在评估 LLM 长上下文数学推理能力的自动化基准。与之前的基准如 Needle in a Haystack 主要关注长文本中的信息检索不同，MathHay 要求模型既具有信息搜索能力，又具有复杂的数学推理能力。我们在 MathHay 上进行了广泛的实验，以评估八个表现最佳的 LLM 的长上下文数学推理能力。即使是表现最好的模型 Gemini-1.5-Pro-002，在长上下文中的数学推理方面仍然举步维艰，在 128K 个 token 上仅达到 51.26% 的准确率。这凸显了 MathHay 基准测试的巨大改进空间。

Title: The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead?

Authors: Alexander S. Choi, Syeda Sabrina Akter, JP Singh, Antonios Anastasopoulos
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2410.04699
Pdf URL: https://arxiv.org/pdf/2410.04699
Copy Paste: [[2410.04699]] The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead?(https://arxiv.org/abs/2410.04699)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown capabilities close to human performance in various analytical tasks, leading researchers to use them for time and labor-intensive analyses. However, their capability to handle highly specialized and open-ended tasks in domains like policy studies remains in question. This paper investigates the efficiency and accuracy of LLMs in specialized tasks through a structured user study focusing on Human-LLM partnership. The study, conducted in two stages-Topic Discovery and Topic Assignment-integrates LLMs with expert annotators to observe the impact of LLM suggestions on what is usually human-only analysis. Results indicate that LLM-generated topic lists have significant overlap with human generated topic lists, with minor hiccups in missing document-specific topics. However, LLM suggestions may significantly improve task completion speed, but at the same time introduce anchoring bias, potentially affecting the depth and nuance of the analysis, raising a critical question about the trade-off between increased efficiency and the risk of biased analysis.
摘要：大型语言模型 (LLM) 在各种分析任务中表现出接近人类表现的能力，这使得研究人员将其用于耗时耗力的分析。然而，它们在政策研究等领域处理高度专业化和开放式任务的能力仍是一个问题。本文通过一项侧重于人机合作的结构化用户研究，调查了 LLM 在专业任务中的效率和准确性。这项研究分为两个阶段——主题发现和主题分配——将 LLM 与专家注释者相结合，以观察 LLM 建议对通常仅由人类进行的分析的影响。结果表明，LLM 生成的主题列表与人类生成的主题列表有很大的重叠，在缺少特定于文档的主题方面存在轻微的障碍。然而，LLM 建议可能会显著提高任务完成速度，但同时会引入锚定偏差，可能会影响分析的深度和细微差别，这提出了一个关键问题，即提高效率与偏向性分析风险之间的权衡。

Title: Rule-based Data Selection for Large Language Models

Authors: Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, Hong Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.04715
Pdf URL: https://arxiv.org/pdf/2410.04715
Copy Paste: [[2410.04715]] Rule-based Data Selection for Large Language Models(https://arxiv.org/abs/2410.04715)
Keywords: language model, llm
Abstract: The quality of training data significantly impacts the performance of large language models (LLMs). There are increasing studies using LLMs to rate and select data based on several human-crafted metrics (rules). However, these conventional rule-based approaches often depend too heavily on human heuristics, lack effective metrics for assessing rules, and exhibit limited adaptability to new tasks. In our study, we introduce an innovative rule-based framework that utilizes the orthogonality of score vectors associated with rules as a novel metric for rule evaluations. Our approach includes an automated pipeline that first uses LLMs to generate a diverse set of rules, encompassing various rating dimensions to evaluate data quality. Then it rates a batch of data based on these rules and uses the determinantal point process (DPP) from random matrix theory to select the most orthogonal score vectors, thereby identifying a set of independent rules. These rules are subsequently used to evaluate all data, selecting samples with the highest average scores for downstream tasks such as LLM training. We verify the effectiveness of our method through two experimental setups: 1) comparisons with ground truth ratings and 2) benchmarking LLMs trained with the chosen data. Our comprehensive experiments cover a range of scenarios, including general pre-training and domain-specific fine-tuning in areas such as IMDB, Medical, Math, and Code. The outcomes demonstrate that our DPP-based rule rating method consistently outperforms other approaches, including rule-free rating, uniform sampling, importance resampling, and QuRating, in terms of both rating precision and model performance.
摘要：训练数据的质量显著影响大型语言模型 (LLM) 的性能。越来越多的研究使用 LLM 根据几个人工制定的指标 (规则) 对数据进行评级和选择。然而，这些传统的基于规则的方法往往过于依赖人类的启发式方法，缺乏评估规则的有效指标，并且对新任务的适应性有限。在我们的研究中，我们引入了一个创新的基于规则的框架，该框架利用与规则相关的分数向量的正交性作为规则评估的新指标。我们的方法包括一个自动化管道，它首先使用 LLM 生成一组多样化的规则，涵盖各种评级维度来评估数据质量。然后，它根据这些规则对一批数据进行评级，并使用随机矩阵理论中的行列式点过程 (DPP) 来选择最正交的分数向量，从而确定一组独立规则。随后使用这些规则评估所有数据，为下游任务（例如 LLM 训练）选择平均分数最高的样本。我们通过两种实验设置验证了我们方法的有效性：1) 与真实评分进行比较和 2) 对使用所选数据训练的 LLM 进行基准测试。我们的全面实验涵盖了一系列场景，包括 IMDB、医疗、数学和代码等领域的一般预训练和领域特定微调。结果表明，我们的基于 DPP 的规则评分方法在评分精度和模型性能方面始终优于其他方法，包括无规则评分、均匀采样、重要性重采样和 QuRating。

Title: $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization

Authors: Dylan Zhang, Justin Wang, Francois Charton
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2410.04717
Pdf URL: https://arxiv.org/pdf/2410.04717
Copy Paste: [[2410.04717]] $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization(https://arxiv.org/abs/2410.04717)
Keywords: language model, llm
Abstract: Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we rigorously examine the key factors that enable models to generalize to unseen instructions, providing insights to guide the collection of data for instruction-tuning. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization $\textbf{only emerges}$ when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model's adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of $\textit{$\textbf{specialist}$}$ and $\textit{$\textbf{generalist}$}$ models. In both cases, we demonstrate that 1) better performance can be achieved by increasing the diversity of an established dataset while keeping the data size constant, and 2) when scaling up the data, diversifying the semantics of instructions is more effective than simply increasing the quantity of similar data. Our research provides important insights for dataset collation, particularly when optimizing model performance by expanding training data for both specialist and generalist scenarios. We show that careful consideration of data diversification is key: training specialist models with data extending beyond their core domain leads to significant performance improvements, while generalist models benefit from diverse data mixtures that enhance their overall instruction-following capabilities across a wide range of applications. Our results highlight the critical role of strategic diversification and offer clear guidelines for improving data quality.
摘要：理解并准确遵循指令对于大型语言模型 (LLM) 在不同任务中的有效性至关重要。在这项工作中，我们严格检查使模型能够推广到未见过的指令的关键因素，从而提供见解来指导指令调整数据的收集。通过受图灵完备马尔可夫算法启发的受控实验，我们证明了这种泛化只有在训练数据在语义域中足够多样化时才会出现。我们的研究结果还表明，仅在有限的域内进行多样化并不能确保稳健的泛化。相反，即使在数据预算受限的情况下，跨域数据多样化也能显著增强模型的适应性。我们进一步将分析扩展到现实世界场景，包括对 $\textit{$\textbf{specialist}$}$ 和 $\textit{$\textbf{generalist}$}$ 模型进行微调。在这两种情况下，我们都证明了：1）在保持数据大小不变的情况下，通过增加现有数据集的多样性可以实现更好的性能；2）在扩大数据规模时，使指令语义多样化比仅仅增加类似数据的数量更有效。我们的研究为数据集整理提供了重要的见解，特别是在通过扩展专业和通用场景的训练数据来优化模型性能时。我们表明，仔细考虑数据多样化是关键：使用超出其核心领域的数据训练专业模型可显着提高性能，而通用模型则受益于多样化的数据混合，从而增强了它们在广泛应用中的整体指令遵循能力。我们的结果强调了战略多样化的关键作用，并为提高数据质量提供了明确的指导方针。

Title: Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models

Authors: Xinyu Liu, Runsong Zhao, Pengcheng Huang, Chunyang Xiao, Bei Li, Jingang Wang, Tong Xiao, Jingbo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04727
Pdf URL: https://arxiv.org/pdf/2410.04727
Copy Paste: [[2410.04727]] Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models(https://arxiv.org/abs/2410.04727)
Keywords: language model, prompt
Abstract: Numerous recent works target to extend effective context length for language models and various methods, tasks and benchmarks exist to measure model's effective memorization length. However, through thorough investigations, we find limitations for currently existing evaluations on model's memorization capability. We provide an extensive survey for limitations in this work and propose a new method called forgetting curve to measure the memorization capability of long-context models. We show that forgetting curve has the advantage of being robust to the tested corpus and the experimental settings, of not relying on prompts and can be applied to any model size. We apply our forgetting curve to a large variety of models involving both transformer and RNN/SSM based architectures. Our measurement provides empirical evidence for the effectiveness of transformer extension techniques while raises questions for the effective length of RNN/SSM based models. We also examine the difference between our measurement and existing benchmarks as well as popular metrics for various models. Our code and results can be found at this https URL.
摘要：近期许多研究都致力于扩展语言模型的有效上下文长度，并且存在各种方法、任务和基准来测量模型的有效记忆长度。然而，通过深入调查，我们发现目前对模型记忆能力的评估存在局限性。我们对这项工作的局限性进行了广泛的调查，并提出了一种称为遗忘曲线的新方法来测量长上下文模型的记忆能力。我们表明，遗忘曲线的优点是对测试语料库和实验设置具有鲁棒性，不依赖于提示并且可以应用于任何模型大小。我们将遗忘曲线应用于涉及 Transformer 和基于 RNN/SSM 的架构的各种模型。我们的测量为 Transformer 扩展技术的有效性提供了经验证据，同时也对基于 RNN/SSM 的模型的有效长度提出了质疑。我们还研究了我们的测量与现有基准以及各种模型的流行指标之间的差异。我们的代码和结果可以在这个 https URL 中找到。

Title: Efficient transformer with reinforced position embedding for language models

Authors: Yen-Che Hsiao, Abhishek Dutta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04731
Pdf URL: https://arxiv.org/pdf/2410.04731
Copy Paste: [[2410.04731]] Efficient transformer with reinforced position embedding for language models(https://arxiv.org/abs/2410.04731)
Keywords: language model
Abstract: In this paper, we propose an efficient transformer architecture that uses reinforced positional embedding to obtain superior performance with half the number of encoder decoder layers. We demonstrate that concatenating positional encoding with trainable token embeddings, normalizing columns in the token embedding matrix, and using the normalized token embedding matrix as the value of the attention layer improve the training and validation loss and the training time in an encoder-decoder Transformer model for a Portuguese-English translation task with 10 epochs or 12 hours of training across 10 trials. Our method, with roughly a threefold parameter reduction compared to the baseline model, yields a mean training loss of 1.21, a mean validation loss of 1.51, and an average training time of 1352.27 seconds per epoch, surpassing the baseline model with the same embedding dimension that employs addition of positional encoding and token embeddings, which achieves a mean training loss of 1.96, a validation loss of 2.18, and an average training time of 4297.79 seconds per epoch. Additionally, we evaluated our proposed architecture and the baseline across 14 diverse translation datasets from TensorFlow. The results indicate that our method consistently achieves lower or comparable training and validation losses, suggesting enhanced learning efficiency.
摘要：在本文中，我们提出了一种高效的 Transformer 架构，该架构使用增强的位置嵌入，以一半的编码器解码器层数获得卓越的性能。我们证明，将位置编码与可训练的 token 嵌入连接起来、对 token 嵌入矩阵中的列进行归一化，并使用归一化的 token 嵌入矩阵作为注意层的值，可以改善编码器-解码器 Transformer 模型中的训练和验证损失以及训练时间，该模型适用于葡萄牙语-英语翻译任务，共进行 10 次试验，共进行 10 次迭代或 12 小时的训练。与基线模型相比，我们的方法将参数减少了大约三倍，平均训练损失为 1.21，平均验证损失为 1.51，平均训练时间为每轮 1352.27 秒，超过了具有相同嵌入维度但增加了位置编码和 token 嵌入的基线模型，后者的平均训练损失为 1.96，验证损失为 2.18，平均训练时间为每轮 4297.79 秒。此外，我们还在 TensorFlow 的 14 个不同的翻译数据集上评估了我们提出的架构和基线。结果表明，我们的方法始终实现较低或相当的训练和验证损失，表明学习效率有所提高。

Title: TableRAG: Million-Token Table Understanding with Language Models

Authors: Si-An Chen, Lesly Miculicich, Julian Martin Eisenschlos, Zifeng Wang, Zilong Wang, Yanfei Chen, Yasuhisa Fujii, Hsuan-Tien Lin, Chen-Yu Lee, Tomas Pfister
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2410.04739
Pdf URL: https://arxiv.org/pdf/2410.04739
Copy Paste: [[2410.04739]] TableRAG: Million-Token Table Understanding with Language Models(https://arxiv.org/abs/2410.04739)
Keywords: language model, prompt, retrieval-augmented generation
Abstract: Recent advancements in language models (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's effectiveness at scale. Our results demonstrate that TableRAG's retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
摘要：语言模型 (LM) 的最新进展显著增强了它们推理表格数据的能力，主要是通过操纵和分析表格的程序辅助机制。然而，这些方法通常需要整个表作为输入，由于位置偏差或上下文长度限制，导致可扩展性挑战。为了应对这些挑战，我们推出了 TableRAG，这是一个专为基于 LM 的表格理解而设计的检索增强生成 (RAG) 框架。TableRAG 利用查询扩展与模式和单元格检索相结合来精确定位关键信息，然后再将其提供给 LM。这可以实现更高效的数据编码和精确的检索，从而显着减少提示长度并减少信息丢失。我们从 Arcade 和 BIRD-SQL 数据集开发了两个新的百万标记基准，以全面评估 TableRAG 在规模上的有效性。我们的结果表明，TableRAG 的检索设计实现了最高的检索质量，从而在大规模表格理解方面取得了新的最先进的性能。

Title: Document-level Causal Relation Extraction with Knowledge-guided Binary Question Answering

Authors: Zimu Wang, Lei Xia, Wei Wang, Xinya Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04752
Pdf URL: https://arxiv.org/pdf/2410.04752
Copy Paste: [[2410.04752]] Document-level Causal Relation Extraction with Knowledge-guided Binary Question Answering(https://arxiv.org/abs/2410.04752)
Keywords: language model, llm, hallucination
Abstract: As an essential task in information extraction (IE), Event-Event Causal Relation Extraction (ECRE) aims to identify and classify the causal relationships between event mentions in natural language texts. However, existing research on ECRE has highlighted two critical challenges, including the lack of document-level modeling and causal hallucinations. In this paper, we propose a Knowledge-guided binary Question Answering (KnowQA) method with event structures for ECRE, consisting of two stages: Event Structure Construction and Binary Question Answering. We conduct extensive experiments under both zero-shot and fine-tuning settings with large language models (LLMs) on the MECI and MAVEN-ERE datasets. Experimental results demonstrate the usefulness of event structures on document-level ECRE and the effectiveness of KnowQA by achieving state-of-the-art on the MECI dataset. We observe not only the effectiveness but also the high generalizability and low inconsistency of our method, particularly when with complete event structures after fine-tuning the models.
摘要：作为信息提取 (IE) 中的一项基本任务，事件-事件因果关系提取 (ECRE) 旨在识别和分类自然语言文本中事件提及之间的因果关系。然而，现有的 ECRE 研究强调了两个关键挑战，包括缺乏文档级建模和因果幻觉。在本文中，我们提出了一种用于 ECRE 的具有事件结构的知识引导二元问答 (KnowQA) 方法，包括两个阶段：事件结构构建和二元问答。我们在 MECI 和 MAVEN-ERE 数据集上使用大型语言模型 (LLM) 在零样本和微调设置下进行了大量实验。实验结果证明了事件结构在文档级 ECRE 中的有用性以及 KnowQA 的有效性，因为它在 MECI 数据集上达到了最新水平。我们不仅观察到了有效性，而且还观察到了我们的方法的高度泛化和低不一致性，特别是在对模型进行微调后具有完整的事件结构时。

Title: Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge

Authors: Jiahuan Li, Yiqing Cao, Shujian Huang, Jiajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04784
Pdf URL: https://arxiv.org/pdf/2410.04784
Copy Paste: [[2410.04784]] Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge(https://arxiv.org/abs/2410.04784)
Keywords: language model, llm
Abstract: Having been trained on massive pretraining data, large language models have shown excellent performance on many knowledge-intensive tasks. However, pretraining data tends to contain misleading and even conflicting information, and it is intriguing to understand how LLMs handle these noisy data during training. In this study, we systematically analyze LLMs' learning preferences for data with conflicting knowledge. We find that pretrained LLMs establish learning preferences similar to humans, i.e., preferences towards formal texts and texts with fewer spelling errors, resulting in faster learning and more favorable treatment of knowledge in data with such features when facing conflicts. This finding is generalizable across models and languages and is more evident in larger models. An in-depth analysis reveals that LLMs tend to trust data with features that signify consistency with the majority of data, and it is possible to instill new preferences and erase old ones by manipulating the degree of consistency with the majority data.
摘要：大型语言模型在大量预训练数据上训练后，在许多知识密集型任务上表现出色。然而，预训练数据往往包含误导性甚至相互矛盾的信息，了解 LLM 在训练过程中如何处理这些噪声数据是十分有趣的。在这项研究中，我们系统地分析了 LLM 对具有冲突知识的数据的学习偏好。我们发现，经过预训练的 LLM 会建立与人类相似的学习偏好，即偏好正式文本和拼写错误较少的文本，从而在面临冲突时更快地学习并更有利地处理具有此类特征的数据中的知识。这一发现适用于所有模型和语言，在大型模型中更为明显。深入分析表明，LLM 倾向于信任具有与大多数数据一致的特征的数据，并且可以通过操纵与大多数数据的一致程度来灌输新的偏好并消除旧的偏好。

Title: GARLIC: LLM-Guided Dynamic Progress Control with Hierarchical Weighted Graph for Long Document QA

Authors: Xinyu Wang, Yanzheng Xiang, Lin Gui, Yulan He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04790
Pdf URL: https://arxiv.org/pdf/2410.04790
Copy Paste: [[2410.04790]] GARLIC: LLM-Guided Dynamic Progress Control with Hierarchical Weighted Graph for Long Document QA(https://arxiv.org/abs/2410.04790)
Keywords: language model, llm, retrieval-augmented generation
Abstract: In the past, Retrieval-Augmented Generation (RAG) methods split text into chunks to enable language models to handle long documents. Recent tree-based RAG methods are able to retrieve detailed information while preserving global context. However, with the advent of more powerful LLMs, such as Llama 3.1, which offer better comprehension and support for longer inputs, we found that even recent tree-based RAG methods perform worse than directly feeding the entire document into Llama 3.1, although RAG methods still hold an advantage in reducing computational costs. In this paper, we propose a new retrieval method, called LLM-Guided Dynamic Progress Control with Hierarchical Weighted Graph (GARLIC), which outperforms previous state-of-the-art baselines, including Llama 3.1, while retaining the computational efficiency of RAG methods. Our method introduces several improvements: (1) Rather than using a tree structure, we construct a Hierarchical Weighted Directed Acyclic Graph with many-to-many summarization, where the graph edges are derived from attention mechanisms, and each node focuses on a single event or very few events. (2) We introduce a novel retrieval method that leverages the attention weights of LLMs rather than dense embedding similarity. Our method allows for searching the graph along multiple paths and can terminate at any depth. (3) We use the LLM to control the retrieval process, enabling it to dynamically adjust the amount and depth of information retrieved for different queries. Experimental results show that our method outperforms previous state-of-the-art baselines, including Llama 3.1, on two single-document and two multi-document QA datasets, while maintaining similar computational complexity to traditional RAG methods.
摘要：过去，检索增强生成 (RAG) 方法将文本拆分成块，以使语言模型能够处理长文档。最近的基于树的 RAG 方法能够在保留全局上下文的同时检索详细信息。然而，随着更强大的 LLM（如 Llama 3.1）的出现，它们提供了更好的理解和对较长输入的支持，我们发现即使是最近的基于树的 RAG 方法，其性能也比直接将整个文档输入 Llama 3.1 更差，尽管 RAG 方法在降低计算成本方面仍然具有优势。在本文中，我们提出了一种新的检索方法，称为 LLM 引导的动态进度控制与分层加权图 (GARLIC)，其性能优于之前最先进的基线，包括 Llama 3.1，同时保留了 RAG 方法的计算效率。我们的方法引入了多项改进：（1）我们没有使用树结构，而是构建了具有多对多摘要的分层加权有向无环图，其中图的边缘来自注意机制，每个节点关注单个事件或极少数事件。（2）我们引入了一种新颖的检索方法，该方法利用 LLM 的注意权重而不是密集嵌入相似性。我们的方法允许沿多条路径搜索图，并可以在任意深度终止。（3）我们使用 LLM 来控制检索过程，使其能够动态调整针对不同查询检索的信息量和深度。实验结果表明，我们的方法在两个单文档和两个多文档 QA 数据集上的表现优于之前最先进的基线（包括 Llama 3.1），同时保持了与传统 RAG 方法相似的计算复杂度。

Title: Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models

Authors: Dahyun Kim, Sukyung Lee, Yungi Kim, Attapol Rutherford, Chanjun Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04795
Pdf URL: https://arxiv.org/pdf/2410.04795
Copy Paste: [[2410.04795]] Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models(https://arxiv.org/abs/2410.04795)
Keywords: language model, llm
Abstract: The rapid advancement of large language models (LLMs) has highlighted the need for robust evaluation frameworks that assess their core capabilities, such as reasoning, knowledge, and commonsense, leading to the inception of certain widely-used benchmark suites such as the H6 benchmark. However, these benchmark suites are primarily built for the English language, and there exists a lack thereof for under-represented languages, in terms of LLM development, such as Thai. On the other hand, developing LLMs for Thai should also include enhancing the cultural understanding as well as core capabilities. To address these dual challenge in Thai LLM research, we propose two key benchmarks: Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI). Through a thorough evaluation of various LLMs with multi-lingual capabilities, we provide a comprehensive analysis of the proposed benchmarks and how they contribute to Thai LLM development. Furthermore, we will make both the datasets and evaluation code publicly available to encourage further research and development for Thai LLMs.
摘要：大型语言模型 (LLM) 的快速发展凸显了对评估其核心能力（如推理、知识和常识）的强大评估框架的需求，这导致了一些广泛使用的基准套件（如 H6 基准）的出现。然而，这些基准套件主要是为英语构建的，而对于在 LLM 开发方面代表性不足的语言（如泰语），则缺乏此类套件。另一方面，为泰语开发 LLM 还应包括增强文化理解和核心能力。为了应对泰国 LLM 研究中存在的双重挑战，我们提出了两个关键基准：Thai-H6 和泰国文化和语言智能基准 (ThaiCLI)。通过对具有多语言能力的各种 LLM 进行全面评估，我们对所提出的基准及其对泰国 LLM 开发的贡献进行了全面的分析。此外，我们将公开数据集和评估代码，以鼓励对泰国 LLM 的进一步研究和开发。

Title: LPZero: Language Model Zero-cost Proxy Search from Zero

Authors: Peijie Dong, Lujun Li, Xiang Liu, Zhenheng Tang, Xuebo Liu, Qiang Wang, Xiaowen Chu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04808
Pdf URL: https://arxiv.org/pdf/2410.04808
Copy Paste: [[2410.04808]] LPZero: Language Model Zero-cost Proxy Search from Zero(https://arxiv.org/abs/2410.04808)
Keywords: language model, gpt
Abstract: In spite of the outstanding performance, Neural Architecture Search (NAS) is criticized for massive computation. Recently, Zero-shot NAS has emerged as a promising approach by exploiting Zero-cost (ZC) proxies, which markedly reduce computational demands. Despite this, existing ZC proxies heavily rely on expert knowledge and incur significant trial-and-error costs. Particularly in NLP tasks, most existing ZC proxies fail to surpass the performance of the naive baseline. To address these challenges, we introduce a novel framework, \textbf{LPZero}, which is the first to automatically design ZC proxies for various tasks, achieving higher ranking consistency than human-designed proxies. Specifically, we model the ZC proxy as a symbolic equation and incorporate a unified proxy search space that encompasses existing ZC proxies, which are composed of a predefined set of mathematical symbols. To heuristically search for the best ZC proxy, LPZero incorporates genetic programming to find the optimal symbolic composition. We propose a \textit{Rule-based Pruning Strategy (RPS),} which preemptively eliminates unpromising proxies, thereby mitigating the risk of proxy degradation. Extensive experiments on FlexiBERT, GPT-2, and LLaMA-7B demonstrate LPZero's superior ranking ability and performance on downstream tasks compared to current approaches.
摘要：尽管性能出色，但神经架构搜索 (NAS) 因计算量大而受到批评。最近，零样本 NAS 已成为一种有前途的方法，它利用零成本 (ZC) 代理，大大降低了计算需求。尽管如此，现有的 ZC 代理严重依赖专家知识，并会产生大量的反复试验成本。特别是在 NLP 任务中，大多数现有的 ZC 代理都无法超越简单基线的性能。为了应对这些挑战，我们引入了一个新框架 \textbf{LPZero}，它是第一个为各种任务自动设计 ZC 代理的框架，实现了比人工设计的代理更高的排名一致性。具体来说，我们将 ZC 代理建模为符号方程，并结合一个统一的代理搜索空间，该空间包含现有的 ZC 代理，这些代理由一组预定义的数学符号组成。为了启发式地搜索最佳 ZC 代理，LPZero 结合了遗传编程来寻找最佳符号组合。我们提出了一种基于规则的修剪策略 (RPS)，它可以预先消除没有希望的代理，从而降低代理降级的风险。在 FlexiBERT、GPT-2 和 LLaMA-7B 上进行的大量实验表明，与当前方法相比，LPZero 在下游任务上具有出色的排名能力和性能。

Title: MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models

Authors: Kaichen Huang, Jiahao Huo, Yibo Yan, Kun Wang, Yutao Yue, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04819
Pdf URL: https://arxiv.org/pdf/2410.04819
Copy Paste: [[2410.04819]] MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models(https://arxiv.org/abs/2410.04819)
Keywords: language model, llm
Abstract: In recent years, multimodal large language models (MLLMs) have significantly advanced, integrating more modalities into diverse applications. However, the lack of explainability remains a major barrier to their use in scenarios requiring decision transparency. Current neuron-level explanation paradigms mainly focus on knowledge localization or language- and domain-specific analyses, leaving the exploration of multimodality largely unaddressed. To tackle these challenges, we propose MINER, a transferable framework for mining modality-specific neurons (MSNs) in MLLMs, which comprises four stages: (1) modality separation, (2) importance score calculation, (3) importance score aggregation, (4) modality-specific neuron selection. Extensive experiments across six benchmarks and two representative MLLMs show that (I) deactivating ONLY 2% of MSNs significantly reduces MLLMs performance (0.56 to 0.24 for Qwen2-VL, 0.69 to 0.31 for Qwen2-Audio), (II) different modalities mainly converge in the lower layers, (III) MSNs influence how key information from various modalities converges to the last token, (IV) two intriguing phenomena worth further investigation, i.e., semantic probing and semantic telomeres. The source code is available at this URL.
摘要：近年来，多模态大型语言模型 (MLLM) 取得了显著进展，将更多模态集成到各种应用中。然而，缺乏可解释性仍然是它们在需要决策透明度的场景中使用的主要障碍。当前的神经元级解释范式主要侧重于知识本地化或语言和领域特定分析，而多模态的探索在很大程度上尚未得到解决。为了应对这些挑战，我们提出了 MINER，这是一个用于挖掘 MLLM 中模态特定神经元 (MSN) 的可迁移框架，它包括四个阶段：(1) 模态分离，(2) 重要性分数计算，(3) 重要性分数聚合，(4) 模态特定神经元选择。针对六个基准和两个代表性 MLLM 进行的大量实验表明：(I) 仅停用 2% 的 MSN 会显著降低 MLLM 的性能（Qwen2-VL 的性能为 0.56 至 0.24，Qwen2-Audio 的性能为 0.69 至 0.31）；(II) 不同模态主要在较低层汇聚；(III) MSN 影响来自各种模态的关键信息如何汇聚到最后一个标记；(IV) 两个有趣的现象值得进一步研究，即语义探测和语义端粒。源代码可从此 URL 获取。

Title: As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative Feedback Loss

Authors: Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Wang Chen, Anh Tuan Luu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04834
Pdf URL: https://arxiv.org/pdf/2410.04834
Copy Paste: [[2410.04834]] As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative Feedback Loss(https://arxiv.org/abs/2410.04834)
Keywords: llm
Abstract: Direct Preference Optimization (DPO) has emerged as a more computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO), eliminating the need for reward models and online sampling. Despite these benefits, DPO and its variants remain sensitive to hyper-parameters and prone to instability, particularly on mathematical datasets. We argue that these issues arise from the unidirectional likelihood-derivative negative feedback inherent in the log-likelihood loss function. To address this, we propose a novel LLM alignment loss that establishes a stable Bidirectional Negative Feedback (BNF) during optimization. Our proposed BNF loss eliminates the need for pairwise contrastive losses and does not require any extra tunable hyper-parameters or pairwise preference data, streamlining the alignment pipeline to be as simple as supervised fine-tuning. We conduct extensive experiments across two challenging QA benchmarks and four reasoning benchmarks. The experimental results show that BNF achieves comparable performance to the best methods on QA benchmarks, while its performance decrease on the four reasoning benchmarks is significantly lower compared to the best methods, thus striking a better balance between value alignment and reasoning ability. In addition, we further validate the performance of BNF on non-pairwise datasets, and conduct in-depth analysis of log-likelihood and logit shifts across different preference optimization methods.
摘要：直接偏好优化 (DPO) 已成为一种计算效率更高的替代方案，可以替代使用近端策略优化 (PPO) 的强化学习（RLHF），从而消除了对奖励模型和在线采样的需求。尽管有这些好处，DPO 及其变体仍然对超参数敏感，并且容易不稳定，尤其是在数学数据集上。我们认为这些问题源于对数似然损失函数固有的单向似然导数负反馈。为了解决这个问题，我们提出了一种新颖的 LLM 对齐损失，可在优化过程中建立稳定的双向负反馈 (BNF)。我们提出的 BNF 损失消除了对成对对比损失的需求，并且不需要任何额外的可调超参数或成对偏好数据，从而简化了对齐流程，使其与监督微调一样简单。我们在两个具有挑战性的 QA 基准和四个推理基准上进行了广泛的实验。实验结果表明，BNF 在问答基准上取得了与最佳方法相当的性能，而在四个推理基准上的性能下降幅度与最佳方法相比明显较小，在价值对齐和推理能力之间取得了更好的平衡。此外，我们进一步在非成对数据集上验证了 BNF 的性能，并对不同偏好优化方法之间的对数似然和 Logit 偏移进行了深入分析。

Title: Rationale-Aware Answer Verification by Pairwise Self-Evaluation

Authors: Akira Kawabata, Saku Sugawara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04838
Pdf URL: https://arxiv.org/pdf/2410.04838
Copy Paste: [[2410.04838]] Rationale-Aware Answer Verification by Pairwise Self-Evaluation(https://arxiv.org/abs/2410.04838)
Keywords: language model, llm
Abstract: Answer verification identifies correct solutions among candidates generated by large language models (LLMs). Current approaches typically train verifier models by labeling solutions as correct or incorrect based solely on whether the final answer matches the gold answer. However, this approach neglects any flawed rationale in the solution yielding the correct answer, undermining the verifier's ability to distinguish between sound and flawed rationales. We empirically show that in StrategyQA, only 19% of LLM-generated solutions with correct answers have valid rationales, thus leading to an unreliable verifier. Furthermore, we demonstrate that training a verifier on valid rationales significantly improves its ability to distinguish valid and flawed rationale. To make a better verifier without extra human supervision, we introduce REPS (Rationale Enhancement through Pairwise Selection), a method for selecting valid rationales from candidates by iteratively applying pairwise self-evaluation using the same LLM that generates the solutions. Verifiers trained on solutions selected by REPS outperform those trained using conventional training methods on three reasoning benchmarks (ARC-Challenge, DROP, and StrategyQA). Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers, which would be critical for models assisting humans in solving complex reasoning tasks.
摘要：答案验证可在大型语言模型 (LLM) 生成的候选答案中找出正确的解决方案。当前的方法通常通过仅根据最终答案是否与黄金答案匹配将解决方案标记为正确或不正确来训练验证器模型。但是，这种方法忽略了产生正确答案的解决方案中的任何有缺陷的原理，从而削弱了验证器区分合理和有缺陷原理的能力。我们通过实证研究发现，在 StrategyQA 中，只有 19% 的 LLM 生成的正确答案解决方案具有有效原理，从而导致验证器不可靠。此外，我们证明，使用有效原理训练验证器可显著提高其区分有效和有缺陷原理的能力。为了在没有额外人工监督的情况下制作更好的验证器，我们引入了 REPS（通过成对选择增强原理），这是一种通过使用生成解决方案的同一 LLM 迭代应用成对自我评估从候选答案中选择有效原理的方法。在三个推理基准（ARC-Challenge、DROP 和 StrategyQA）上，使用 REPS 选择的解决方案进行训练的验证器的表现优于使用传统训练方法进行训练的验证器。我们的结果表明，训练可靠的验证器需要确保理论的有效性以及最终答案的正确性，这对于帮助人类解决复杂推理任务的模型至关重要。

Title: Intent Classification for Bank Chatbots through LLM Fine-Tuning

Authors: Bibiána Lajčinová, Patrik Valábek, Michal Spišiak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.04925
Pdf URL: https://arxiv.org/pdf/2410.04925
Copy Paste: [[2410.04925]] Intent Classification for Bank Chatbots through LLM Fine-Tuning(https://arxiv.org/abs/2410.04925)
Keywords: language model, llm, chat
Abstract: This study evaluates the application of large language models (LLMs) for intent classification within a chatbot with predetermined responses designed for banking industry websites. Specifically, the research examines the effectiveness of fine-tuning SlovakBERT compared to employing multilingual generative models, such as Llama 8b instruct and Gemma 7b instruct, in both their pre-trained and fine-tuned versions. The findings indicate that SlovakBERT outperforms the other models in terms of in-scope accuracy and out-of-scope false positive rate, establishing it as the benchmark for this application.
摘要：本研究评估了大型语言模型 (LLM) 在专为银行业网站设计的具有预定响应的聊天机器人中用于意图分类的应用。具体来说，该研究比较了微调 SlovakBERT 与使用多语言生成模型（例如 Llama 8b instruct 和 Gemma 7b instruct，包括其预训练版本和微调版本）的有效性。研究结果表明，SlovakBERT 在范围内准确率和范围外误报率方面优于其他模型，使其成为此应用的基准。

Title: Activation Scaling for Steering and Interpreting Language Models

Authors: Niklas Stoehr, Kevin Du, Vésteinn Snæbjarnarson, Robert West, Ryan Cotterell, Aaron Schein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.04962
Pdf URL: https://arxiv.org/pdf/2410.04962
Copy Paste: [[2410.04962]] Activation Scaling for Steering and Interpreting Language Models(https://arxiv.org/abs/2410.04962)
Keywords: language model, prompt
Abstract: Given the prompt "Rome is in", can we steer a language model to flip its prediction of an incorrect token "France" to a correct token "Italy" by only multiplying a few relevant activation vectors with scalars? We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. Concretely, we establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa (effectiveness), and leave other tokens unaffected (faithfulness), all while being sparse (minimality). Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention: activation scaling only modifies the signed magnitude of activation vectors to strengthen, weaken, or reverse the steering directions already encoded in the model. On synthetic tasks, this intervention performs comparably with steering vectors in terms of effectiveness and faithfulness, but is much more minimal allowing us to pinpoint interpretable model components. We evaluate activation scaling from different angles, compare performance on different datasets, and make activation scalars a learnable function of the activation vectors themselves to generalize to varying-length prompts.
摘要：给定提示“罗马在”，我们能否仅通过将几个相关的激活向量与标量相乘，来引导语言模型将其对错误标记“法国”的预测翻转为正确标记“意大利”？我们认为，成功干预模型是解释其内部运作的先决条件。具体来说，我们建立了一个三项目标：成功的干预应该将正确的标记与错误的标记翻转，反之亦然（有效性），并且不影响其他标记（忠实性），同时保持稀疏性（最小性）。使用基于梯度的优化，此目标让我们学习（并在以后评估）一种特定类型的有效且可解释的干预：激活缩放仅修改激活向量的有符号幅度以加强、削弱或反转模型中已经编码的转向方向。在合成任务中，这种干预在有效性和忠实性方面的表现与转向向量相当，但更为简约，使我们能够精确定位可解释的模型组件。我们从不同角度评估激活缩放，比较不同数据集上的性能，并使激活标量成为激活向量本身的可学习函数，以推广到不同长度的提示。

Title: SkillMatch: Evaluating Self-supervised Learning of Skill Relatedness

Authors: Jens-Joris Decorte, Jeroen Van Hautte, Thomas Demeester, Chris Develder
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05006
Pdf URL: https://arxiv.org/pdf/2410.05006
Copy Paste: [[2410.05006]] SkillMatch: Evaluating Self-supervised Learning of Skill Relatedness(https://arxiv.org/abs/2410.05006)
Keywords: llm
Abstract: Accurately modeling the relationships between skills is a crucial part of human resources processes such as recruitment and employee development. Yet, no benchmarks exist to evaluate such methods directly. We construct and release SkillMatch, a benchmark for the task of skill relatedness, based on expert knowledge mining from millions of job ads. Additionally, we propose a scalable self-supervised learning technique to adapt a Sentence-BERT model based on skill co-occurrence in job ads. This new method greatly surpasses traditional models for skill relatedness as measured on SkillMatch. By releasing SkillMatch publicly, we aim to contribute a foundation for research towards increased accuracy and transparency of skill-based recommendation systems.
摘要：准确地对技能之间的关系进行建模是招聘和员工发展等人力资源流程的重要组成部分。然而，目前还没有直接评估此类方法的基准。我们基于从数百万个招聘广告中挖掘的专家知识，构建并发布了 SkillMatch，这是技能相关性任务的基准。此外，我们提出了一种可扩展的自监督学习技术，以适应基于招聘广告中技能共现的 Sentence-BERT 模型。这种新方法大大超越了在 SkillMatch 上衡量技能相关性的传统模型。通过公开发布 SkillMatch，我们旨在为提高基于技能的推荐系统的准确性和透明度的研究奠定基础。

Title: Named Clinical Entity Recognition Benchmark

Authors: Wadood M Abdul, Marco AF Pimentel, Muhammad Umar Salman, Tathagata Raha, Clément Christophe, Praveen K Kanithi, Nasir Hayat, Ronnie Rajan, Shadab Khan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05046
Pdf URL: https://arxiv.org/pdf/2410.05046
Copy Paste: [[2410.05046]] Named Clinical Entity Recognition Benchmark(https://arxiv.org/abs/2410.05046)
Keywords: language model
Abstract: This technical report introduces a Named Clinical Entity Recognition Benchmark for evaluating language models in healthcare, addressing the crucial natural language processing (NLP) task of extracting structured information from clinical narratives to support applications like automated coding, clinical trial cohort identification, and clinical decision support. The leaderboard provides a standardized platform for assessing diverse language models, including encoder and decoder architectures, on their ability to identify and classify clinical entities across multiple medical domains. A curated collection of openly available clinical datasets is utilized, encompassing entities such as diseases, symptoms, medications, procedures, and laboratory measurements. Importantly, these entities are standardized according to the Observational Medical Outcomes Partnership (OMOP) Common Data Model, ensuring consistency and interoperability across different healthcare systems and datasets, and a comprehensive evaluation of model performance. Performance of models is primarily assessed using the F1-score, and it is complemented by various assessment modes to provide comprehensive insights into model performance. The report also includes a brief analysis of models evaluated to date, highlighting observed trends and limitations. By establishing this benchmarking framework, the leaderboard aims to promote transparency, facilitate comparative analyses, and drive innovation in clinical entity recognition tasks, addressing the need for robust evaluation methods in healthcare NLP.
摘要：本技术报告介绍了一种用于评估医疗保健领域语言模型的命名临床实体识别基准，解决了从临床叙述中提取结构化信息以支持自动编码、临床试验队列识别和临床决策支持等应用的关键自然语言处理 (NLP) 任务。排行榜提供了一个标准化平台，用于评估各种语言模型（包括编码器和解码器架构）在多个医疗领域识别和分类临床实体的能力。报告使用了精选的公开临床数据集，涵盖疾病、症状、药物、程序和实验室测量等实体。重要的是，这些实体根据观察性医疗结果伙伴关系 (OMOP) 通用数据模型进行了标准化，确保了不同医疗保健系统和数据集之间的一致性和互操作性，以及对模型性能的全面评估。模型性能主要使用 F1 分数进行评估，并辅以各种评估模式，以提供对模型性能的全面见解。报告还包括对迄今为止评估的模型的简要分析，重点介绍了观察到的趋势和局限性。通过建立这个基准框架，排行榜旨在提高透明度，促进比较分析，推动临床实体识别任务的创新，满足医疗保健 NLP 对稳健评估方法的需求。

Title: A test suite of prompt injection attacks for LLM-based machine translation

Authors: Antonio Valerio Miceli-Barone, Zhifan Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05047
Pdf URL: https://arxiv.org/pdf/2410.05047
Copy Paste: [[2410.05047]] A test suite of prompt injection attacks for LLM-based machine translation(https://arxiv.org/abs/2410.05047)
Keywords: llm, prompt
Abstract: LLM-based NLP systems typically work by embedding their input data into prompt templates which contain instructions and/or in-context examples, creating queries which are submitted to a LLM, and then parsing the LLM response in order to generate the system outputs. Prompt Injection Attacks (PIAs) are a type of subversion of these systems where a malicious user crafts special inputs which interfere with the prompt templates, causing the LLM to respond in ways unintended by the system designer. Recently, Sun and Miceli-Barone proposed a class of PIAs against LLM-based machine translation. Specifically, the task is to translate questions from the TruthfulQA test suite, where an adversarial prompt is prepended to the questions, instructing the system to ignore the translation instruction and answer the questions instead. In this test suite, we extend this approach to all the language pairs of the WMT 2024 General Machine Translation task. Moreover, we include additional attack formats in addition to the one originally studied.
摘要：基于 LLM 的 NLP 系统通常通过将其输入数据嵌入到包含指令和/或上下文示例的提示模板中，创建提交给 LLM 的查询，然后解析 LLM 响应以生成系统输出来工作。提示注入攻击 (PIA) 是这些系统的一种颠覆，其中恶意用户制作特殊输入来干扰提示模板，导致 LLM 以系统设计者意想不到的方式响应。最近，Sun 和 Miceli-Barone 提出了一类针对基于 LLM 的机器翻译的 PIA。具体来说，任务是翻译 TruthfulQA 测试套件中的问题，其中对抗性提示被添加到问题前面，指示系统忽略翻译指令并回答问题。在此测试套件中，我们将这种方法扩展到 WMT 2024 通用机器翻译任务的所有语言对。此外，除了最初研究的攻击格式之外，我们还包括其他攻击格式。

Title: Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes

Authors: Kosuke Nishida, Kyosuke Nishida, Kuniko Saito
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05052
Pdf URL: https://arxiv.org/pdf/2410.05052
Copy Paste: [[2410.05052]] Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes(https://arxiv.org/abs/2410.05052)
Keywords: language model
Abstract: Loss spikes, a phenomenon in which the loss value diverges suddenly, is a fundamental issue in the pre-training of large language models. This paper supposes that the non-uniformity of the norm of the parameters is one of the causes of loss spikes. Here, in training of neural networks, the scale of the gradients is required to be kept constant throughout the layers to avoid the vanishing and exploding gradients problem. However, to meet these requirements in the Transformer model, the norm of the model parameters must be non-uniform, and thus, parameters whose norm is smaller are more sensitive to the parameter update. To address this issue, we propose a novel technique, weight scaling as reparameterization (WeSaR). WeSaR introduces a gate parameter per parameter matrix and adjusts it to the value satisfying the requirements. Because of the gate parameter, WeSaR sets the norm of the original parameters uniformly, which results in stable training. Experimental results with the Transformer decoders consisting of 130 million, 1.3 billion, and 13 billion parameters showed that WeSaR stabilizes and accelerates training and that it outperformed compared methods including popular initialization methods.
摘要：损失尖峰（loss spikes）是一种损失值突然发散的现象，是大型语言模型预训练中的一个基本问题。本文认为参数范数的不均匀性是导致损失尖峰的原因之一。在神经网络的训练中，需要保持各层梯度的尺度恒定，以避免梯度消失和爆炸问题。然而，在 Transformer 模型中，为了满足这些要求，模型参数的范数必须是非均匀的，因此范数较小的参数对参数更新更敏感。为了解决这个问题，我们提出了一种新技术，即权重缩放作为重参数化（WeSaR）。WeSaR 为每个参数矩阵引入一个门参数，并将其调整为满足要求的值。由于门参数的存在，WeSaR 会均匀地设置原始参数的范数，从而实现稳定的训练。使用由 1.3 亿、13 亿和 130 亿个参数组成的 Transformer 解码器的实验结果表明，WeSaR 可以稳定和加速训练，并且其表现优于包括流行初始化方法在内的其他方法。

Title: ZEBRA: Zero-Shot Example-Based Retrieval Augmentation for Commonsense Question Answering

Authors: Francesco Maria Molfese, Simone Conia, Riccardo Orlando, Roberto Navigli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05077
Pdf URL: https://arxiv.org/pdf/2410.05077
Copy Paste: [[2410.05077]] ZEBRA: Zero-Shot Example-Based Retrieval Augmentation for Commonsense Question Answering(https://arxiv.org/abs/2410.05077)
Keywords: language model, llm
Abstract: Current Large Language Models (LLMs) have shown strong reasoning capabilities in commonsense question answering benchmarks, but the process underlying their success remains largely opaque. As a consequence, recent approaches have equipped LLMs with mechanisms for knowledge retrieval, reasoning and introspection, not only to improve their capabilities but also to enhance the interpretability of their outputs. However, these methods require additional training, hand-crafted templates or human-written explanations. To address these issues, we introduce ZEBRA, a zero-shot question answering framework that combines retrieval, case-based reasoning and introspection and dispenses with the need for additional training of the LLM. Given an input question, ZEBRA retrieves relevant question-knowledge pairs from a knowledge base and generates new knowledge by reasoning over the relationships in these pairs. This generated knowledge is then used to answer the input question, improving the model's performance and interpretability. We evaluate our approach across 8 well-established commonsense reasoning benchmarks, demonstrating that ZEBRA consistently outperforms strong LLMs and previous knowledge integration approaches, achieving an average accuracy improvement of up to 4.5 points.
摘要：当前的大型语言模型 (LLM) 在常识性问答基准中表现出强大的推理能力，但其成功背后的过程在很大程度上仍然不透明。因此，最近的方法为 LLM 配备了知识检索、推理和自省机制，不仅可以提高其能力，还可以增强其输出的可解释性。然而，这些方法需要额外的训练、手工制作的模板或人工编写的解释。为了解决这些问题，我们引入了 ZEBRA，这是一个零样本问答框架，它结合了检索、基于案例的推理和自省，无需对 LLM 进行额外的训练。给定一个输入问题，ZEBRA 从知识库中检索相关的问题知识对，并通过推理这些对中的关系来生成新知识。然后使用生成的知识来回答输入问题，从而提高模型的性能和可解释性。我们通过 8 个完善的常识推理基准对我们的方法进行了评估，结果表明 ZEBRA 的表现始终优于强大的 LLM 和以前的知识整合方法，平均准确度提高了 4.5 分。

Title: ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

Authors: Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, Huan Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05080
Pdf URL: https://arxiv.org/pdf/2410.05080
Copy Paste: [[2410.05080]] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery(https://arxiv.org/abs/2410.05080)
Keywords: language model, llm, prompt, agent
Abstract: The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about the true capabilities of such agents. In this work, we argue that for an agent to fully automate scientific discovery, it must be able to complete all essential tasks in the workflow. Thus, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using our benchmark, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. These results underscore the limited capacities of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.
摘要：语言模型 (LLM) 的进步激起了人们对开发基于 LLM 的语言代理以实现端到端的科学发现自动化的兴趣，这既激发了人们对此类代理的真正能力的兴奋，也引发了怀疑。在这项工作中，我们认为，要使代理完全自动化科学发现，它必须能够完成工作流程中的所有基本任务。因此，我们呼吁在对端到端自动化做出大胆声明之前，对科学工作流程中各个任务上的代理进行严格评估。为此，我们提出了 ScienceAgentBench，这是评估数据驱动科学发现的语言代理的新基准。为了确保基准的科学真实性和现实世界相关性，我们从四个学科的 44 个同行评审出版物中提取了 102 个任务，并聘请了 9 位主题专家对其进行验证。我们将每个任务的目标输出统一到一个独立的 Python 程序文件中，并使用一系列评估指标来检查生成的程序、执行结果和成本。每项任务都要经过注释者和主题专家的多轮手动验证，以确保其注释质量和科学合理性。我们还提出了两种有效的策略来缓解数据污染问题。使用我们的基准，我们评估了五个开放权重和专有的 LLM，每个都有三个框架：直接提示、OpenHands 和自我调试。对于每个任务，如果尝试三次，表现最好的代理只能独立解决 32.4% 的任务，使用专家提供的知识只能解决 34.3% 的任务。这些结果强调了当前语言代理在生成数据驱动发现代码方面的能力有限，更不用说用于科学研究的端到端自动化了。

Title: Explanation sensitivity to the randomness of large language models: the case of journalistic text classification

Authors: Jeremie Bogaert, Marie-Catherine de Marneffe, Antonin Descampe, Louis Escouflaire, Cedrick Fairon, Francois-Xavier Standaert
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05085
Pdf URL: https://arxiv.org/pdf/2410.05085
Copy Paste: [[2410.05085]] Explanation sensitivity to the randomness of large language models: the case of journalistic text classification(https://arxiv.org/abs/2410.05085)
Keywords: language model, llm
Abstract: Large language models (LLMs) perform very well in several natural language processing tasks but raise explainability challenges. In this paper, we examine the effect of random elements in the training of LLMs on the explainability of their predictions. We do so on a task of opinionated journalistic text classification in French. Using a fine-tuned CamemBERT model and an explanation method based on relevance propagation, we find that training with different random seeds produces models with similar accuracy but variable explanations. We therefore claim that characterizing the explanations' statistical distribution is needed for the explainability of LLMs. We then explore a simpler model based on textual features which offers stable explanations but is less accurate. Hence, this simpler model corresponds to a different tradeoff between accuracy and explainability. We show that it can be improved by inserting features derived from CamemBERT's explanations. We finally discuss new research directions suggested by our results, in particular regarding the origin of the sensitivity observed in the training randomness.
摘要：大型语言模型 (LLM) 在许多自然语言处理任务中表现非常出色，但也带来了可解释性挑战。在本文中，我们研究了 LLM 训练中的随机元素对其预测可解释性的影响。我们在一项法语新闻文本分类任务中进行了研究。使用经过微调的 CamemBERT 模型和基于相关性传播的解释方法，我们发现使用不同的随机种子进行训练会产生具有相似准确度但解释不同的模型。因此，我们声称，表征解释的统计分布对于 LLM 的可解释性是必要的。然后，我们探索了一种基于文本特征的更简单的模型，该模型提供了稳定的解释，但准确性较低。因此，这个更简单的模型对应于准确性和可解释性之间的不同权衡。我们表明，可以通过插入从 CamemBERT 的解释中得出的特征来改进它。最后，我们讨论了我们的结果所提出的新研究方向，特别是关于在训练随机性中观察到的敏感性的起源。

Title: Investigating large language models for their competence in extracting grammatically sound sentences from transcribed noisy utterances

Authors: Alina Wróblewska
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05099
Pdf URL: https://arxiv.org/pdf/2410.05099
Copy Paste: [[2410.05099]] Investigating large language models for their competence in extracting grammatically sound sentences from transcribed noisy utterances(https://arxiv.org/abs/2410.05099)
Keywords: language model, llm
Abstract: Selectively processing noisy utterances while effectively disregarding speech-specific elements poses no considerable challenge for humans, as they exhibit remarkable cognitive abilities to separate semantically significant content from speech-specific noise (i.e. filled pauses, disfluencies, and restarts). These abilities may be driven by mechanisms based on acquired grammatical rules that compose abstract syntactic-semantic structures within utterances. Segments without syntactic and semantic significance are consistently disregarded in these structures. The structures, in tandem with lexis, likely underpin language comprehension and thus facilitate effective communication. In our study, grounded in linguistically motivated experiments, we investigate whether large language models (LLMs) can effectively perform analogical speech comprehension tasks. In particular, we examine the ability of LLMs to extract well-structured utterances from transcriptions of noisy dialogues. We conduct two evaluation experiments in the Polish language scenario, using a~dataset presumably unfamiliar to LLMs to mitigate the risk of data contamination. Our results show that not all extracted utterances are correctly structured, indicating that either LLMs do not fully acquire syntactic-semantic rules or they acquire them but cannot apply them effectively. We conclude that the ability of LLMs to comprehend noisy utterances is still relatively superficial compared to human proficiency in processing them.
摘要：选择性地处理嘈杂的话语，同时有效地忽略语音特定元素，对人类来说并不构成重大挑战，因为他们表现出非凡的认知能力，可以将语义上重要的内容与语音特定的噪音（即填充的停顿、不流畅和重新开始）区分开来。这些能力可能是由基于习得的语法规则的机制驱动的，这些规则构成了话语中抽象的句法语义结构。在这些结构中，没有句法和语义意义的片段始终被忽略。这些结构与词汇一起，可能支撑着语言理解，从而促进有效的沟通。在我们的研究中，基于语言学驱动的实验，我们研究了大型语言模型 (LLM) 是否可以有效地执行类比语音理解任务。特别是，我们研究了 LLM 从嘈杂对话的转录中提取结构良好的话语的能力。我们在波兰语场景中进行了两项评估实验，使用 LLM 可能不熟悉的数据集来降低数据污染风险。我们的结果表明，并非所有提取的话语都是正确构造的，这表明 LLM 要么没有完全掌握句法语义规则，要么掌握了规则但无法有效应用。我们得出的结论是，与人类处理这些话语的能力相比，LLM 理解嘈杂话语的能力仍然相对肤浅。

Title: SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks

Authors: Fenia Christopoulou, Ronald Cardenas, Gerasimos Lampouras, Haitham Bou-Ammar, Jun Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05102
Pdf URL: https://arxiv.org/pdf/2410.05102
Copy Paste: [[2410.05102]] SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks(https://arxiv.org/abs/2410.05102)
Keywords: language model, llm
Abstract: Preference Optimization (PO) has proven an effective step for aligning language models to human-desired behaviors. Current variants, following the offline Direct Preference Optimization objective, have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected by each word in a sequence equally but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best weight reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments on multiple domains, including sentiment control, dialogue, text summarization and text-to-code generation, illustrate that our approach assigns meaningful weights to tokens according to the target task, generates more responses with the desired preference and improves reasoning tasks by up to 2 percentage points compared to other token- and response-level PO methods.
摘要：偏好优化 (PO) 已被证明是将语言模型与人类期望的行为相一致的有效步骤。当前的变体遵循离线直接偏好优化目标，专注于严格的设置，其中所有标记都为损失函数贡献 KL 散度和奖励信号。然而，人类偏好并不会受到序列中每个单词的同等影响，而是通常取决于特定的单词或短语，例如，有害术语的存在会导致非首选响应。基于这一观察，我们认为在 PO 期间并非所有标记都应具有同等权重，并提出了一个称为 SparsePO 的灵活目标，旨在自动学习在 PO 训练期间对每个标记对应的 KL 散度和奖励进行加权。我们提出了两种不同的权重掩码变体，它们可以从参考模型本身派生出来，也可以在运行中学习。值得注意的是，我们的方法在学习的掩码中引入了稀疏性，使模型能够学习如何在标记级别对奖励和 KL 散度贡献进行最佳加权，从而学习最佳的掩码稀疏度。在情绪控制、对话、文本摘要和文本到代码生成等多个领域进行的大量实验表明，与其他标记和响应级 PO 方法相比，我们的方法根据目标任务为标记分配有意义的权重，生成具有所需偏好的更多响应，并将推理任务提高多达 2 个百分点。

Title: Deciphering the Interplay of Parametric and Non-parametric Memory in Retrieval-augmented Language Models

Authors: Mehrdad Farahani, Richard Johansson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05162
Pdf URL: https://arxiv.org/pdf/2410.05162
Copy Paste: [[2410.05162]] Deciphering the Interplay of Parametric and Non-parametric Memory in Retrieval-augmented Language Models(https://arxiv.org/abs/2410.05162)
Keywords: language model, retrieval-augmented generation
Abstract: Generative language models often struggle with specialized or less-discussed knowledge. A potential solution is found in Retrieval-Augmented Generation (RAG) models which act like retrieving information before generating responses. In this study, we explore how the \textsc{Atlas} approach, a RAG model, decides between what it already knows (parametric) and what it retrieves (non-parametric). We use causal mediation analysis and controlled experiments to examine how internal representations influence information processing. Our findings disentangle the effects of parametric knowledge and the retrieved context. They indicate that in cases where the model can choose between both types of information (parametric and non-parametric), it relies more on the context than the parametric knowledge. Furthermore, the analysis investigates the computations involved in \emph{how} the model uses the information from the context. We find that multiple mechanisms are active within the model and can be detected with mediation analysis: first, the decision of \emph{whether the context is relevant}, and second, how the encoder computes output representations to support copying when relevant.
摘要：生成语言模型经常难以处理专业或较少讨论的知识。检索增强生成 (RAG) 模型是一种潜在的解决方案，它的作用类似于在生成响应之前检索信息。在本研究中，我们探索 \textsc{Atlas} 方法（一种 RAG 模型）如何在它已知的内容（参数）和它检索的内容（非参数）之间做出决定。我们使用因果中介分析和受控实验来检查内部表征如何影响信息处理。我们的研究结果解开了参数知识和检索到的上下文的影响。它们表明，在模型可以在两种类型的信息（参数和非参数）之间进行选择的情况下，它更多地依赖于上下文而不是参数知识。此外，分析调查了 \emph{如何} 模型使用来自上下文的信息所涉及的计算。我们发现模型中有多种机制处于活跃状态，可以通过中介分析检测到：首先，决定 \emph{上下文是否相关}，其次，编码器如何计算输出表示以支持相关时的复制。

Title: ReasoningRank: Teaching Student Models to Rank through Reasoning-Based Knowledge Distillation

Authors: Yuelyu Ji, Zhuochun Li, Rui Meng, Daqing He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05168
Pdf URL: https://arxiv.org/pdf/2410.05168
Copy Paste: [[2410.05168]] ReasoningRank: Teaching Student Models to Rank through Reasoning-Based Knowledge Distillation(https://arxiv.org/abs/2410.05168)
Keywords: language model, llm
Abstract: Reranking documents based on their relevance to a given query is critical in information retrieval. Traditional reranking methods often focus on improving the initial rankings but lack transparency, failing to explain why one document is ranked higher. In this paper, we introduce ReasoningRank, a novel reranking approach that enhances clarity by generating two types of reasoning: explicit reasoning, which explains how a document addresses the query, and comparison reasoning, which justifies the relevance of one document over another. We leverage large language models (LLMs) as teacher models to generate these explanations and distill this knowledge into smaller, more resource-efficient student models. While the student models may not outperform LLMs in speed, they significantly reduce the computational burden by requiring fewer resources, making them more suitable for large-scale or resource-constrained settings. These student models are trained to both generate meaningful reasoning and rerank documents, achieving competitive performance across multiple datasets, including MSMARCO and BRIGHT. Experiments demonstrate that ReasoningRank improves reranking accuracy and provides valuable insights into the decision-making process, offering a structured and interpretable solution for reranking tasks.
摘要：根据文档与给定查询的相关性对其进行重新排序在信息检索中至关重要。传统的重新排序方法通常侧重于提高初始排名，但缺乏透明度，无法解释为什么一个文档排名更高。在本文中，我们介绍了一种新颖的重新排序方法 ReasoningRank，它通过生成两种类型的推理来提高清晰度：显式推理，解释文档如何解决查询，以及比较推理，证明一个文档相对于另一个文档的相关性。我们利用大型语言模型 (LLM) 作为教师模型来生成这些解释，并将这些知识提炼为更小、更节省资源的学生模型。虽然学生模型在速度上可能不会胜过 LLM，但它们通过减少资源需求显著减轻了计算负担，使其更适合大规模或资源受限的环境。这些学生模型经过训练，既可以生成有意义的推理，也可以重新排序文档，在包括 MSMARCO 和 BRIGHT 在内的多个数据集上实现了具有竞争力的性能。实验表明，ReasoningRank 提高了重新排序的准确性并为决策过程提供了有价值的见解，为重新排序任务提供了结构化和可解释的解决方案。

Title: Enhancing Equity in Large Language Models for Medical Applications

Authors: Yuelyu Ji, Wenhe Ma, Sonish Sivarajkumar, Hang Zhang, Eugene Mathew Sadhu, Zhuochun Li, Xizhi Wu, Shyam Visweswaran, Yanshan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05180
Pdf URL: https://arxiv.org/pdf/2410.05180
Copy Paste: [[2410.05180]] Enhancing Equity in Large Language Models for Medical Applications(https://arxiv.org/abs/2410.05180)
Keywords: language model, llm
Abstract: Recent advancements have highlighted the potential of large language models (LLMs) in medical applications, notably in automating Clinical Trial Matching for translational research and providing medical question-answering for clinical decision support. However, our study reveals significant inequities in the use of LLMs, particularly for individuals from specific racial, gender, and underrepresented groups influenced by social determinants of health. These disparities could worsen existing health inequities if LLMs are broadly adopted in healthcare. To address this, we propose and evaluate a novel framework, EquityGuard, designed to detect and mitigate biases in LLM-based medical applications. EquityGuard incorporates a Bias Detection Mechanism capable of identifying and correcting unfair predictions, thus enhancing outcomes and promoting equity across diverse population groups.
摘要：最近的进展凸显了大型语言模型 (LLM) 在医学应用中的潜力，特别是在自动化临床试验匹配以进行转化研究和提供医学问答以支持临床决策方面。然而，我们的研究揭示了 LLM 使用方面存在显著的不平等，特别是对于受健康社会决定因素影响的特定种族、性别和代表性不足群体的个人。如果 LLM 在医疗保健领域得到广泛采用，这些差异可能会加剧现有的健康不平等。为了解决这个问题，我们提出并评估了一个新框架 EquityGuard，旨在检测和减轻基于 LLM 的医疗应用中的偏见。EquityGuard 采用了偏见检测机制，能够识别和纠正不公平的预测，从而改善结果并促进不同人群之间的公平。

Title: RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

Authors: Qiyuan Zhang, Yufei Wang, Tiezheng YU, Yuxin Jiang, Chuhan Wu, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05193
Pdf URL: https://arxiv.org/pdf/2410.05193
Copy Paste: [[2410.05193]] RevisEval: Improving LLM-as-a-Judge via Response-Adapted References(https://arxiv.org/abs/2410.05193)
Keywords: language model, llm
Abstract: With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective alternative to human evaluation for assessing the text generation quality in a wide range of tasks. However, there still remains a reliability gap between LLM-as-a-Judge and human evaluation. One important reason is the lack of guided oracles in the evaluation process. Motivated by the role of reference pervasively used in classic text evaluation, we introduce RevisEval, a novel text generation evaluation paradigm via the response-adapted references. RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated. Specifically, RevisEval leverages the text revision capabilities of large language models (LLMs) to adaptively revise the response, then treat the revised text as the reference (response-adapted reference) for the subsequent evaluation. Extensive experiments demonstrate that RevisEval outperforms traditional reference-free and reference-based evaluation paradigms that use LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks. More importantly, our response-adapted references can further boost the classical text metrics, e.g., BLEU and BERTScore, compared to traditional references and even rival the LLM-as-a-Judge. A detailed analysis is also conducted to confirm RevisEval's effectiveness in bias reduction, the impact of inference cost, and reference relevance.
摘要：经过近来研究的巨大努力，LLM-as-a-Judge 已经成为一种经济高效的人工评估替代方案，可用于评估广泛任务中的文本生成质量。然而，LLM-as-a-Judge 与人工评估之间仍然存在可靠性差距。一个重要原因是评估过程中缺乏引导式预言。受经典文本评估中广泛使用的参考作用的启发，我们引入了 RevisEval，一种通过响应适应参考的新型文本生成评估范式。RevisEval 由一个关键观察结果驱动，即理想的参考应该与要评估的响应保持必要的相关性。具体而言，RevisEval 利用大型语言模型 (LLM) 的文本修订功能来自适应地修改响应，然后将修订后的文本作为后续评估的参考（响应适应参考）。大量实验表明，RevisEval 在 NLG 任务和开放式指令遵循任务中的表现优于使用 LLM-as-a-Judge 的传统无参考和基于参考的评估范式。更重要的是，与传统参考相比，我们的响应适应参考可以进一步提高经典文本指标（例如 BLEU 和 BERTScore），甚至可以与 LLM-as-a-Judge 相媲美。还进行了详细分析，以确认 RevisEval 在减少偏差方面的有效性、推理成本的影响以及参考相关性。

Title: Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates

Authors: Avanika Narayan, Mayee F. Chen, Kush Bhatia, Christopher Ré
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05224
Pdf URL: https://arxiv.org/pdf/2410.05224
Copy Paste: [[2410.05224]] Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates(https://arxiv.org/abs/2410.05224)
Keywords: language model, gpt, llm
Abstract: Fine-tuning large language models (LLMs) on instruction datasets is a common way to improve their generative capabilities. However, instruction datasets can be expensive and time-consuming to manually curate, and while LLM-generated data is less labor-intensive, it may violate user privacy agreements or terms of service of LLM providers. Therefore, we seek a way of constructing instruction datasets with samples that are not generated by humans or LLMs but still improve LLM generative capabilities. In this work, we introduce Cookbook, a framework that programmatically generates training data consisting of simple patterns over random tokens, resulting in a scalable, cost-effective approach that avoids legal and privacy issues. First, Cookbook uses a template -- a data generating Python function -- to produce training data that encourages the model to learn an explicit pattern-based rule that corresponds to a desired task. We find that fine-tuning on Cookbook-generated data is able to improve performance on its corresponding task by up to 52.7 accuracy points. Second, since instruction datasets improve performance on multiple downstream tasks simultaneously, Cookbook algorithmically learns how to mix data from various templates to optimize performance on multiple tasks. On the standard multi-task GPT4ALL evaluation suite, Mistral-7B fine-tuned using a Cookbook-generated dataset attains the best accuracy on average compared to other 7B parameter instruction-tuned models and is the best performing model on 3 out of 8 tasks. Finally, we analyze when and why Cookbook improves performance and present a metric that allows us to verify that the improvement is largely explained by the model's generations adhering better to template rules.
摘要：在指令数据集上对大型语言模型 (LLM) 进行微调是提高其生成能力的常用方法。但是，手动整理指令数据集可能既昂贵又耗时，尽管 LLM 生成的数据劳动强度较低，但它可能违反用户隐私协议或 LLM 提供商的服务条款。因此，我们寻求一种构建指令数据集的方法，这些数据集的样本不是由人类或 LLM 生成的，但仍能提高 LLM 的生成能力。在这项工作中，我们引入了 Cookbook，这是一个框架，它以编程方式生成由随机标记上的简单模式组成的训练数据，从而产生一种可扩展、经济高效的方法，避免了法律和隐私问题。首先，Cookbook 使用模板（一个数据生成 Python 函数）来生成训练数据，鼓励模型学习与所需任务相对应的显式模式规则。我们发现，对 Cookbook 生成的数据进行微调能够将其相应任务的性能提高多达 52.7 个准确度点。其次，由于指令数据集可同时提高多个下游任务的性能，因此 Cookbook 算法会学习如何混合来自各种模板的数据以优化多个任务的性能。在标准的多任务 GPT4ALL 评估套件中，使用 Cookbook 生成的数据集进行微调的 Mistral-7B 与其他 7B 参数指令调整模型相比，平均准确率最高，并且在 8 项任务中的 3 项中表现最佳。最后，我们分析了 Cookbook 何时以及为何能提高性能，并提出了一个指标，使我们能够验证这种改进主要是由于模型的生成更好地遵循了模板规则。

Title: SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

Authors: Yuxin Xiao, Shujian Zhang, Wenxuan Zhou, Marzyeh Ghassemi, Sanqiang Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05248
Pdf URL: https://arxiv.org/pdf/2410.05248
Copy Paste: [[2410.05248]] SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe(https://arxiv.org/abs/2410.05248)
Keywords: language model, llm
Abstract: To induce desired behaviors in large language models (LLMs) for interaction-driven tasks, the instruction-tuning stage typically trains LLMs on instruction-response pairs using the next-token prediction (NTP) loss. Previous work aiming to improve instruction-tuning performance often emphasizes the need for higher-quality supervised fine-tuning (SFT) datasets, which typically involves expensive data filtering with proprietary LLMs or labor-intensive data generation by human annotators. However, these approaches do not fully leverage the datasets' intrinsic properties, resulting in high computational and labor costs, thereby limiting scalability and performance gains. In this paper, we propose SFTMix, a novel recipe that elevates instruction-tuning performance beyond the conventional NTP paradigm, without the need for well-curated datasets. Observing that LLMs exhibit uneven confidence across the semantic representation space, we argue that examples with different confidence levels should play distinct roles during the instruction-tuning process. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels, then applies a Mixup-based regularization to mitigate overfitting on confident examples while propagating supervision signals to improve learning on relatively unconfident ones. This approach enables SFTMix to significantly outperform NTP across a wide range of instruction-following and healthcare domain-specific SFT tasks, demonstrating its adaptability to diverse LLM families and scalability to datasets of any size. Comprehensive ablation studies further verify the robustness of SFTMix's design choices, underscoring its versatility in consistently enhancing performance across different LLMs and datasets in broader natural language processing applications.
摘要：为了在大型语言模型 (LLM) 中诱导出用于交互驱动任务的所需行为，指令调整阶段通常使用下一个标记预测 (NTP) 损失在指令-响应对上训练 LLM。旨在提高指令调整性能的先前工作通常强调需要更高质量的监督微调 (SFT) 数据集，这通常涉及使用专有 LLM 进行昂贵的数据过滤或由人工注释者进行劳动密集型数据生成。然而，这些方法没有充分利用数据集的固有属性，导致计算和人工成本高昂，从而限制了可扩展性和性能提升。在本文中，我们提出了 SFTMix，这是一种新颖的方法，可将指令调整性能提升到传统 NTP 范式之外，而无需精心策划的数据集。观察到 LLM 在语义表示空间中表现出不均匀的置信度，我们认为具有不同置信度水平的示例应该在指令调整过程中发挥不同的作用。基于这一洞察，SFTMix 利用训练动态来识别具有不同置信度水平的示例，然后应用基于 Mixup 的正则化来减轻对置信度示例的过度拟合，同时传播监督信号以改善对相对不置信的示例的学习。这种方法使 SFTMix 在广泛的指令遵循和医疗保健领域特定的 SFT 任务中的表现显著优于 NTP，证明了其对各种 LLM 系列的适应性和对任何规模数据集的可扩展性。全面的消融研究进一步验证了 SFTMix 设计选择的稳健性，强调了其在更广泛的自然语言处理应用中持续提高不同 LLM 和数据集性能的多功能性。

Title: Causal Micro-Narratives

Authors: Mourad Heddaya, Qingcheng Zeng, Chenhao Tan, Rob Voigt, Alexander Zentefis
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05252
Pdf URL: https://arxiv.org/pdf/2410.05252
Copy Paste: [[2410.05252]] Causal Micro-Narratives(https://arxiv.org/abs/2410.05252)
Keywords: language model, llm
Abstract: We present a novel approach to classify causal micro-narratives from text. These narratives are sentence-level explanations of the cause(s) and/or effect(s) of a target subject. The approach requires only a subject-specific ontology of causes and effects, and we demonstrate it with an application to inflation narratives. Using a human-annotated dataset spanning historical and contemporary US news articles for training, we evaluate several large language models (LLMs) on this multi-label classification task. The best-performing model--a fine-tuned Llama 3.1 8B--achieves F1 scores of 0.87 on narrative detection and 0.71 on narrative classification. Comprehensive error analysis reveals challenges arising from linguistic ambiguity and highlights how model errors often mirror human annotator disagreements. This research establishes a framework for extracting causal micro-narratives from real-world data, with wide-ranging applications to social science research.
摘要：我们提出了一种新颖的方法来对文本中的因果微叙事进行分类。这些叙事是目标主题的原因和/或结果的句子级解释。该方法只需要特定主题的原因和结果本体，我们通过将其应用于通货膨胀叙事来展示它。使用涵盖历史和当代美国新闻文章的人工注释数据集进行训练，我们评估了几个大型语言模型 (LLM) 在这个多标签分类任务上的表现。表现最佳的模型——经过微调的 Llama 3.1 8B——在叙事检测上获得了 0.87 的 F1 分数，在叙事分类上获得了 0.71 的 F1 分数。全面的错误分析揭示了语言歧义带来的挑战，并强调了模型错误如何经常反映人类注释者的分歧。这项研究建立了一个从现实世界数据中提取因果微叙事的框架，广泛应用于社会科学研究。

Title: GLEE: A Unified Framework and Benchmark for Language-based Economic Environments

Authors: Eilam Shapira, Omer Madmon, Itamar Reinman, Samuel Joseph Amouyal, Roi Reichart, Moshe Tennenholtz
Subjects: cs.CL, cs.AI, cs.CY, cs.GT, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05254
Pdf URL: https://arxiv.org/pdf/2410.05254
Copy Paste: [[2410.05254]] GLEE: A Unified Framework and Benchmark for Language-based Economic Environments(https://arxiv.org/abs/2410.05254)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) show significant potential in economic and strategic interactions, where communication via natural language is often prevalent. This raises key questions: Do LLMs behave rationally? Can they mimic human behavior? Do they tend to reach an efficient and fair outcome? What is the role of natural language in the strategic interaction? How do characteristics of the economic environment influence these dynamics? These questions become crucial concerning the economic and societal implications of integrating LLM-based agents into real-world data-driven systems, such as online retail platforms and recommender systems. While the ML community has been exploring the potential of LLMs in such multi-agent setups, varying assumptions, design choices and evaluation criteria across studies make it difficult to draw robust and meaningful conclusions. To address this, we introduce a benchmark for standardizing research on two-player, sequential, language-based games. Inspired by the economic literature, we define three base families of games with consistent parameterization, degrees of freedom and economic measures to evaluate agents' performance (self-gain), as well as the game outcome (efficiency and fairness). We develop an open-source framework for interaction simulation and analysis, and utilize it to collect a dataset of LLM vs. LLM interactions across numerous game configurations and an additional dataset of human vs. LLM interactions. Through extensive experimentation, we demonstrate how our framework and dataset can be used to: (i) compare the behavior of LLM-based agents to human players in various economic contexts; (ii) evaluate agents in both individual and collective performance measures; and (iii) quantify the effect of the economic characteristics of the environments on the behavior of agents.
摘要：大型语言模型 (LLM) 在经济和战略互动中显示出巨大的潜力，其中自然语言交流通常很普遍。这引发了一些关键问题：LLM 的行为是否理性？它们能模仿人类行为吗？它们是否倾向于达成高效和公平的结果？自然语言在战略互动中的作用是什么？经济环境的特征如何影响这些动态？这些问题对于将基于 LLM 的代理集成到现实世界的数据驱动系统（例如在线零售平台和推荐系统）的经济和社会影响至关重要。虽然 ML 社区一直在探索 LLM 在这种多代理设置中的潜力，但不同研究中不同的假设、设计选择和评估标准使得很难得出可靠且有意义的结论。为了解决这个问题，我们引入了一个基准来标准化双人、连续、基于语言的游戏研究。受经济文献的启发，我们定义了三个基本游戏系列，它们具有一致的参数化、自由度和经济指标来评估代理的表现（自我收益）以及游戏结果（效率和公平性）。我们开发了一个用于交互模拟和分析的开源框架，并利用它来收集大量游戏配置中的 LLM 与 LLM 交互数据集以及人类与 LLM 交互的额外数据集。通过大量实验，我们展示了如何使用我们的框架和数据集来：(i) 比较基于 LLM 的代理与人类玩家在不同经济背景下的行为；(ii) 评估代理在个人和集体绩效指标中的表现；(iii) 量化环境的经济特征对代理行为的影响。

Title: Differential Transformer

Authors: Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05258
Pdf URL: https://arxiv.org/pdf/2410.05258
Copy Paste: [[2410.05258]] Differential Transformer(https://arxiv.org/abs/2410.05258)
Keywords: language model, hallucination
Abstract: Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.
摘要：Transformer 倾向于将注意力过度分配到不相关的上下文中。在本文中，我们引入了 Diff Transformer，它可以在消除噪音的同时放大对相关上下文的注意力。具体而言，差分注意力机制将注意力分数计算为两个单独的 softmax 注意力图之间的差值。减法消除了噪音，促进了稀疏注意力模式的出现。语言建模上的实验结果表明，在扩大模型大小和训练 token 的各种设置下，Diff Transformer 的表现都优于 Transformer。更有趣的是，它在实际应用中具有显着优势，例如长上下文建模、关键信息检索、幻觉缓解、上下文学习和减少激活异常值。通过减少不受不相关上下文的干扰，Diff Transformer 可以减轻问答和文本摘要中的幻觉。对于上下文学习，Diff Transformer 不仅提高了准确性，而且对顺序排列更具鲁棒性，这被认为是一个长期存在的鲁棒性问题。结果表明，Diff Transformer 是一种高效且有前景的架构，可以推动大型语言模型的发展。

Title: TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

Authors: Qingchen Yu, Shichao Song, Ke Fang, Yunfeng Shi, Zifan Zheng, Hanyu Wang, Simin Niu, Zhiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05262
Pdf URL: https://arxiv.org/pdf/2410.05262
Copy Paste: [[2410.05262]] TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles(https://arxiv.org/abs/2410.05262)
Keywords: language model, llm, chain-of-thought
Abstract: As the application of Large Language Models (LLMs) expands, the demand for reliable evaluations increases. Existing LLM evaluation benchmarks primarily rely on static datasets, making it challenging to assess model performance in dynamic interactions with users. Moreover, these benchmarks often depend on specific background knowledge, complicating the measurement of a model's logical reasoning capabilities. Other dynamic evaluation methods based on strong models or manual efforts may introduce biases and incur high costs and time demands, hindering large-scale application. To address these issues, we propose TurtleBench. TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform that we developed. This approach allows for the relatively dynamic generation of evaluation datasets, mitigating the risk of model cheating while aligning assessments more closely with genuine user needs for reasoning capabilities, thus enhancing the reliability of evaluations. TurtleBench includes 1,532 user guesses along with the correctness of guesses after annotation. Using this dataset, we thoroughly evaluated nine of the most advanced LLMs available today. Notably, the OpenAI o1 series models did not achieve leading results in these evaluations. We propose several hypotheses for further research, such as "the latent reasoning of o1 utilizes trivial Chain-of-Thought (CoT) techniques" and "increasing CoT length not only provides reasoning benefits but also incurs noise costs."
摘要：随着大型语言模型（LLM）的应用范围不断扩大，对可靠评估的需求也随之增加。现有的LLM评估基准主要依赖于静态数据集，因此很难评估模型在与用户的动态交互中的表现。此外，这些基准通常依赖于特定的背景知识，使得测量模型的逻辑推理能力变得复杂。其他基于强模型或人工的动态评估方法可能会引入偏差，并产生高成本和时间要求，阻碍大规模应用。为了解决这些问题，我们提出了TurtleBench。TurtleBench从我们开发的在线Turtle Soup Puzzle平台上收集真实的用户猜测。这种方法允许相对动态地生成评估数据集，降低模型作弊的风险，同时使评估更紧密地与用户对推理能力的真实需求保持一致，从而提高评估的可靠性。TurtleBench包括1,532个用户猜测以及注释后的猜测正确性。使用这个数据集，我们彻底评估了当今最先进的九个LLM。值得注意的是，OpenAI o1 系列模型在这些评估中并未取得领先结果。我们提出了几个进一步研究的假设，例如“o1 的潜在推理利用了简单的思维链 (CoT) 技术”和“增加 CoT 长度不仅带来推理优势，还会带来噪音成本”。

Title: Grounding Partially-Defined Events in Multimodal Data

Authors: Kate Sanders, Reno Kriz, David Etter, Hannah Recknor, Alexander Martin, Cameron Carpenter, Jingyang Lin, Benjamin Van Durme
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.05267
Pdf URL: https://arxiv.org/pdf/2410.05267
Copy Paste: [[2410.05267]] Grounding Partially-Defined Events in Multimodal Data(https://arxiv.org/abs/2410.05267)
Keywords: llm, agent
Abstract: How are we able to learn about complex current events just from short snippets of video? While natural language enables straightforward ways to represent under-specified, partially observable events, visual data does not facilitate analogous methods and, consequently, introduces unique challenges in event understanding. With the growing prevalence of vision-capable AI agents, these systems must be able to model events from collections of unstructured video data. To tackle robust event modeling in multimodal settings, we introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a corresponding benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. We propose a collection of LLM-driven approaches to the task of multimodal event analysis, and evaluate them on MultiVENT-G. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
摘要：我们如何能够仅从短视频片段中了解复杂的当前事件？虽然自然语言能够以直接的方式表示未指定、部分可观察的事件，但视觉数据无法提供类似的方法，因此在事件理解方面带来了独特的挑战。随着具有视觉功能的 AI 代理越来越普及，这些系统必须能够从非结构化视频数据集合中对事件进行建模。为了解决多模态设置中的稳健事件建模问题，我们引入了部分定义事件的多模态公式，并将这些事件的提取视为三阶段跨度检索任务。我们为这项任务提出了一个相应的基准 MultiVENT-G，它由 14.5 小时的密集注释的当前事件视频和 1,168 个文本文档组成，包含 22.8K 个带标签的以事件为中心的实体。我们提出了一组 LLM 驱动的方法来完成多模态事件分析任务，并在 MultiVENT-G 上对其进行评估。结果说明了抽象事件理解所带来的挑战，并展示了以事件为中心的视频语言系统的前景。

Title: Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models

Authors: Fei Wang, Ninareh Mehrabi, Palash Goyal, Rahul Gupta, Kai-Wei Chang, Aram Galstyan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05269
Pdf URL: https://arxiv.org/pdf/2410.05269
Copy Paste: [[2410.05269]] Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models(https://arxiv.org/abs/2410.05269)
Keywords: language model, llm
Abstract: Data is a crucial element in large language model (LLM) alignment. Recent studies have explored using LLMs for efficient data collection. However, LLM-generated data often suffers from quality issues, with underrepresented or absent aspects and low-quality datapoints. To address these problems, we propose Data Advisor, an enhanced LLM-based method for generating data that takes into account the characteristics of the desired dataset. Starting from a set of pre-defined principles in hand, Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation accordingly. Data Advisor can be easily integrated into existing data generation methods to enhance data quality and coverage. Experiments on safety alignment of three representative LLMs (i.e., Mistral, Llama2, and Falcon) demonstrate the effectiveness of Data Advisor in enhancing model safety against various fine-grained safety issues without sacrificing model utility.
摘要：数据是大型语言模型 (LLM) 对齐的关键元素。最近的研究探索了使用 LLM 进行高效的数据收集。然而，LLM 生成的数据通常存在质量问题，其中某些方面代表性不足或缺失，数据点质量低下。为了解决这些问题，我们提出了 Data Advisor，这是一种基于 LLM 的增强型数据生成方法，该方法考虑到了所需数据集的特征。Data Advisor 从一组预定义的原则开始，监控生成数据的状态，识别当前数据集中的弱点，并相应地建议下一次数据生成迭代。Data Advisor 可以轻松集成到现有的数据生成方法中，以提高数据质量和覆盖率。对三个代表性 LLM（即 Mistral、Llama2 和 Falcon）的安全对齐实验证明了 Data Advisor 在增强模型安全性以应对各种细粒度安全问题方面的有效性，而不会牺牲模型实用性。