2024-10-04

Title: CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations

Authors: Yuchen Fan, Xin Zhong, Heng Zhou, Yuchen Zhang, Mingyu Liang, Chengxing Xie, Ermo Hua, Ning Ding, Bowen Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.01945
Pdf URL: https://arxiv.org/pdf/2410.01945
Copy Paste: [[2410.01945]] CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations(https://arxiv.org/abs/2410.01945)
Keywords: prompt, agent
Abstract: Long-Form Question Answering (LFQA) refers to generating in-depth, paragraph-level responses to open-ended questions. Although lots of LFQA methods are developed, evaluating LFQA effectively and efficiently remains challenging due to its high complexity and cost. Therefore, there is no standard benchmark for LFQA evaluation till now. To address this gap, we make the first attempt by proposing a well-constructed, reference-based benchmark named Chinese exAmination for LFQA Evaluation (CALF), aiming to rigorously assess the performance of automatic evaluation metrics for LFQA. The CALF benchmark is derived from Chinese examination questions that have been translated into English. It includes up to 1476 examples consisting of knowledge-intensive and nuanced responses. Our evaluation comprises three different settings to ana lyze the behavior of automatic metrics comprehensively. We conducted extensive experiments on 7 traditional evaluation metrics, 3 prompt-based metrics, and 3 trained evaluation metrics, and tested on agent systems for the LFQA evaluation. The results reveal that none of the current automatic evaluation metrics shows comparable performances with humans, indicating that they cannot capture dense information contained in long-form responses well. In addition, we provide a detailed analysis of the reasons why automatic evaluation metrics fail when evaluating LFQA, offering valuable insights to advance LFQA evaluation systems. Dataset and associated codes can be accessed at our GitHub repository.
摘要：长篇问答系统 (LFQA) 是指针对开放式问题生成深入的段落级答案。尽管已经开发了许多 LFQA 方法，但由于其复杂性和成本高，有效且高效地评估 LFQA 仍然具有挑战性。因此，到目前为止，还没有 LFQA 评估的标准基准。为了弥补这一空白，我们首次尝试提出了一个精心构建的基于参考的基准，即中文 LFQA 评估考试 (CALF)，旨在严格评估 LFQA 自动评估指标的性能。CALF 基准源自已翻译成英文的中文考试问题。它包括多达 1476 个由知识密集型和细致入微的答案组成的示例。我们的评估包括三种不同的设置，以全面分析自动指标的行为。我们对 7 个传统评估指标、3 个基于提示的指标和 3 个经过训练的评估指标进行了广泛的实验，并在代理系统上进行了 LFQA 评估测试。结果表明，目前所有的自动评估指标都无法与人类相比，这表明它们无法很好地捕捉长篇响应中包含的密集信息。此外，我们还详细分析了自动评估指标在评估 LFQA 时失败的原因，为推进 LFQA 评估系统提供了宝贵的见解。数据集和相关代码可以在我们的 GitHub 存储库中访问。

Title: SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics

Authors: Zhiwen You, Kanyao Han, Haotian Zhu, Bertram Ludäscher, Jana Diesner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.01946
Pdf URL: https://arxiv.org/pdf/2410.01946
Copy Paste: [[2410.01946]] SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics(https://arxiv.org/abs/2410.01946)
Keywords: language model, prompt
Abstract: Prompt-based fine-tuning has become an essential method for eliciting information encoded in pre-trained language models for a variety of tasks, including text classification. For multi-class classification tasks, prompt-based fine-tuning under low-resource scenarios has resulted in performance levels comparable to those of fully fine-tuning methods. Previous studies have used crafted prompt templates and verbalizers, mapping from the label terms space to the class space, to solve the classification problem as a masked language modeling task. However, cross-domain and fine-grained prompt-based fine-tuning with an automatically enriched verbalizer remains unexplored, mainly due to the difficulty and costs of manually selecting domain label terms for the verbalizer, which requires humans with domain expertise. To address this challenge, we introduce SciPrompt, a framework designed to automatically retrieve scientific topic-related terms for low-resource text classification tasks. To this end, we select semantically correlated and domain-specific label terms within the context of scientific literature for verbalizer augmentation. Furthermore, we propose a new verbalization strategy that uses correlation scores as additional weights to enhance the prediction performance of the language model during model tuning. Our method outperforms state-of-the-art, prompt-based fine-tuning methods on scientific text classification tasks under few and zero-shot settings, especially in classifying fine-grained and emerging scientific topics.
摘要：基于提示的微调已成为提取预训练语言模型中编码的信息的重要方法，可用于各种任务，包括文本分类。对于多类分类任务，在资源匮乏的情况下，基于提示的微调可实现与完全微调方法相当的性能水平。先前的研究使用精心设计的提示模板和言语化器，从标签术语空间映射到类别空间，以解决作为掩码语言建模任务的分类问题。然而，使用自动丰富的言语化器进行跨领域和细粒度的基于提示的微调仍未得到探索，这主要是因为手动为言语化器选择领域标签术语的难度和成本很高，这需要具有领域专业知识的人。为了应对这一挑战，我们引入了 SciPrompt，这是一个旨在自动检索科学主题相关术语的框架，用于资源匮乏的文本分类任务。为此，我们在科学文献的背景下选择语义相关且领域特定的标签术语来增强言语化器。此外，我们提出了一种新的言语化策略，该策略使用相关性分数作为附加权重来增强模型调整期间语言模型的预测性能。在少数和零样本设置下的科学文本分类任务中，我们的方法优于最先进的基于提示的微调方法，尤其是在对细粒度和新兴科学主题进行分类时。

Title: TypedThinker: Typed Thinking Improves Large Language Model Reasoning

Authors: Danqing Wang, Jianxin Ma, Fei Fang, Lei Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.01952
Pdf URL: https://arxiv.org/pdf/2410.01952
Copy Paste: [[2410.01952]] TypedThinker: Typed Thinking Improves Large Language Model Reasoning(https://arxiv.org/abs/2410.01952)
Keywords: language model, gpt, llm
Abstract: Despite significant advancements in the reasoning capabilities of Large Language Models (LLMs), the lack of diverse reasoning solutions often makes them trapped in a limited solution search area. In this paper, we propose TypedThinker, a novel framework that enhances LLMs' problem-solving abilities by incorporating multiple reasoning types (deductive, inductive, abductive, and analogical). Our analysis across four benchmarks reveals that different reasoning types uniquely solve distinct sets of problems, highlighting the importance of diverse thinking approaches. TypedThinker addresses two key challenges: selecting appropriate reasoning types for given problems and effectively implementing specific reasoning types. Through self-training on successful experiences, TypedThinker learns an implicit policy for reasoning type selection and application. Experimental results demonstrate significant improvements over baseline models, with accuracy increases of 3.4% for Mistral 7B and 16.7% for LLaMA3 8B across four reasoning benchmarks. Notably, TypedThinker shows effective generalization to new benchmarks and can further enhance the reasoning capability of powerful models like GPT-4o. The code is released at this https URL.
摘要：尽管大型语言模型 (LLM) 的推理能力取得了重大进步，但缺乏多样化的推理解决方案常常使它们陷入有限的解决方案搜索领域。在本文中，我们提出了 TypedThinker，这是一个新颖的框架，通过结合多种推理类型（演绎、归纳、溯因和类比）来增强 LLM 的解决问题的能力。我们对四个基准的分析表明，不同的推理类型可以独特地解决不同的问题集，凸显了多样化思维方法的重要性。TypedThinker 解决了两个关键挑战：为给定的问题选择合适的推理类型并有效地实施特定的推理类型。通过对成功经验的自我训练，TypedThinker 学习了一种推理类型选择和应用的隐性策略。实验结果表明，与基线模型相比有显着改进，在四个推理基准上，Mistral 7B 的准确率提高了 3.4%，LLaMA3 8B 的准确率提高了 16.7%。值得注意的是，TypedThinker 表现出对新基准的有效泛化能力，并可以进一步增强 GPT-4o 等强大模型的推理能力。代码发布在此 https URL 上。

Title: Generate then Refine: Data Augmentation for Zero-shot Intent Detection

Authors: I-Fan Lin, Faegheh Hasibi, Suzan Verberne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.01953
Pdf URL: https://arxiv.org/pdf/2410.01953
Copy Paste: [[2410.01953]] Generate then Refine: Data Augmentation for Zero-shot Intent Detection(https://arxiv.org/abs/2410.01953)
Keywords: language model, llm
Abstract: In this short paper we propose a data augmentation method for intent detection in zero-resource domains. Existing data augmentation methods rely on few labelled examples for each intent category, which can be expensive in settings with many possible intents. We use a two-stage approach: First, we generate utterances for intent labels using an open-source large language model in a zero-shot setting. Second, we develop a smaller sequence-to-sequence model (the Refiner), to improve the generated utterances. The Refiner is fine-tuned on seen domains and then applied to unseen domains. We evaluate our method by training an intent classifier on the generated data, and evaluating it on real (human) data. We find that the Refiner significantly improves the data utility and diversity over the zero-shot LLM baseline for unseen domains and over common baseline approaches. Our results indicate that a two-step approach of a generative LLM in zero-shot setting and a smaller sequence-to-sequence model can provide high-quality data for intent detection.
摘要：在这篇短文中，我们提出了一种用于零资源域中意图检测的数据增强方法。现有的数据增强方法依赖于每个意图类别的少量标记示例，这在具有许多可能意图的设置中可能代价高昂。我们使用两阶段方法：首先，我们在零样本设置中使用开源大型语言模型为意图标签生成话语。其次，我们开发了一个较小的序列到序列模型（Refiner），以改进生成的话语。Refiner 在可见域上进行微调，然后应用于不可见域。我们通过在生成的数据上训练意图分类器并在真实（人类）数据上对其进行评估来评估我们的方法。我们发现，与不可见域的零样本 LLM 基线和常见的基线方法相比，Refiner 显著提高了数据效用和多样性。我们的结果表明，零样本设置中的生成 LLM 和较小的序列到序列模型的两步方法可以为意图检测提供高质量的数据。

Title: How Reliable Is Human Feedback For Aligning Large Language Models?

Authors: Min-Hsuan Yeh, Leitian Tao, Jeffrey Wang, Xuefeng Du, Yixuan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.01957
Pdf URL: https://arxiv.org/pdf/2410.01957
Copy Paste: [[2410.01957]] How Reliable Is Human Feedback For Aligning Large Language Models?(https://arxiv.org/abs/2410.01957)
Keywords: language model, llm
Abstract: Most alignment research today focuses on designing new learning algorithms using datasets like Anthropic-HH, assuming human feedback data is inherently reliable. However, little attention has been given to the qualitative unreliability of human feedback and its impact on alignment. To address this gap, we conduct a comprehensive study and provide an in-depth analysis of human feedback data. We assess feedback reliability using a committee of gold reward models, revealing that over 25% of the dataset shows low or no agreement with these models, implying a high degree of unreliability. Through a qualitative analysis, we identify six key sources of unreliability, such as mis-labeling, subjective preferences, differing criteria and thresholds for helpfulness and harmlessness, etc. Lastly, to mitigate unreliability, we propose Source-Aware Cleaning, an automatic data-cleaning method guided by the insight of our qualitative analysis, to significantly improve data quality. Extensive experiments demonstrate that models trained on our cleaned dataset, HH-Clean, substantially outperform those trained on the original dataset. We release HH-Clean to support more reliable LLM alignment evaluation in the future.
摘要：当今，大多数对齐研究都侧重于使用 Anthropic-HH 等数据集设计新的学习算法，假设人类反馈数据本质上是可靠的。然而，很少有人关注人类反馈的定性不可靠性及其对对齐的影响。为了解决这一差距，我们进行了一项全面的研究，并对人类反馈数据进行了深入分析。我们使用黄金奖励模型委员会评估反馈可靠性，结果显示超过 25% 的数据集与这些模型的一致性很低或不一致性，这意味着不可靠性程度很高。通过定性分析，我们确定了六个主要的不可靠性来源，例如错误标记、主观偏好、对有用性和无害性的不同标准和阈值等。最后，为了减轻不可靠性，我们提出了源感知清理，这是一种由我们的定性分析洞察力指导的自动数据清理方法，可显著提高数据质量。大量实验表明，在我们清理后的数据集 HH-Clean 上训练的模型远远优于在原始数据集上训练的模型。我们发布了 HH-Clean，以便将来支持更可靠的 LLM 对齐评估。

Title: Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions

Authors: Qian Ruan, Ilia Kuznetsov, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02028
Pdf URL: https://arxiv.org/pdf/2410.02028
Copy Paste: [[2410.02028]] Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions(https://arxiv.org/abs/2410.02028)
Keywords: language model, llm
Abstract: Classification is a core NLP task architecture with many potential applications. While large language models (LLMs) have brought substantial advancements in text generation, their potential for enhancing classification tasks remains underexplored. To address this gap, we propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task. Our extensive experiments and systematic comparisons with various training approaches and a representative selection of LLMs yield new insights into their application for EIC. We investigate the generalizability of these findings on five further classification tasks. To demonstrate the proposed methods and address the data shortage for empirical edit analysis, we use our best-performing EIC model to create Re3-Sci2.0, a new large-scale dataset of 1,780 scientific document revisions with over 94k labeled edits. The quality of the dataset is assessed through human evaluation. The new dataset enables an in-depth empirical study of human editing behavior in academic writing. We make our experimental framework, models and data publicly available.
摘要：分类是一种核心 NLP 任务架构，具有许多潜在应用。虽然大型语言模型 (LLM) 在文本生成方面取得了重大进展，但它们在增强分类任务方面的潜力仍未得到充分开发。为了解决这一差距，我们提出了一个框架，用于彻底研究微调 LLM 进行分类，包括基于生成和基于编码的方法。我们在编辑意图分类 (EIC) 中实例化了这个框架，这是一个具有挑战性且尚未得到充分探索的分类任务。我们进行了广泛的实验，并与各种训练方法和具有代表性的 LLM 选择进行了系统比较，为它们在 EIC 中的应用提供了新的见解。我们研究了这些发现在五个进一步分类任务中的普遍性。为了展示所提出的方法并解决实证编辑分析的数据短缺问题，我们使用我们表现最佳的 EIC 模型创建了 Re3-Sci2.0，这是一个新的大型数据集，包含 1,780 个科学文档修订，其中有超过 94k 个带标签的编辑。通过人工评估来评估数据集的质量。新数据集可以对学术写作中的人类编辑行为进行深入的实证研究。我们公开我们的实验框架、模型和数据。

Title: Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning

Authors: Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.02052
Pdf URL: https://arxiv.org/pdf/2410.02052
Copy Paste: [[2410.02052]] Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning(https://arxiv.org/abs/2410.02052)
Keywords: language model, gpt, agent
Abstract: Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test-time algorithm designed to enhance the ability of AI agents, e.g., powered by GPT-4o, to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Moreover, we improve the agent's performance by fine-tuning GPT-4o through self-learning, using R-MCTS generated tree traversals without any human-provided labels. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. The fine-tuned GPT-4o matches 97% of R-MCTS's performance while reducing compute usage by a factor of four at test time. Furthermore, qualitative results reveal that the fine-tuned GPT-4o model demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Moreover, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' reasoning and planning capabilities for agentic applications via test-time search and self-learning.
摘要：自主代理在自动执行复杂的多步骤决策任务方面表现出了巨大的潜力。然而，即使是最先进的视觉语言模型 (VLM)，如 GPT-4o，仍然达不到人类水平的表现，特别是在复杂的网络环境和长期规划任务中。为了解决这些限制，我们引入了反射蒙特卡洛树搜索 (R-MCTS)，这是一种新颖的测试时间算法，旨在增强 AI 代理（例如由 GPT-4o 提供支持的代理）动态探索决策空间的能力。R-MCTS 通过以下方式扩展了传统的 MCTS：1) 结合对比反射，允许代理从过去的交互中学习并动态提高其搜索效率；2) 使用多代理辩论来提供可靠的状态评估。此外，我们通过自学习对 GPT-4o 进行微调，使用 R-MCTS 生成的树遍历而无需任何人提供的标签，从而提高了代理的性能。在具有挑战性的 VisualWebArena 基准测试中，与之前的最先进技术相比，我们基于 GPT-4o 的 R-MCTS 代理在各种任务中实现了 6% 到 30% 的相对改进。此外，我们表明，通过微调，可以从测试时搜索中获得的知识可以有效地转移回 GPT-4o。经过微调的 GPT-4o 可达到 R-MCTS 97% 的性能，同时在测试时将计算使用量降低了四倍。此外，定性结果表明，经过微调的 GPT-4o 模型展示了探索环境、评估状态并在检测到当前状态无法成功时回溯到可行状态的能力。此外，我们的工作展示了训练（使用 R-MCTS 收集数据）和测试时间的计算扩展属性。这些结果表明，通过测试时搜索和自学习来增强 VLM 对代理应用程序的推理和规划能力是一个有前途的研究方向。

Title: RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Authors: Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, Gabriel Synnaeve
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02089
Pdf URL: https://arxiv.org/pdf/2410.02089
Copy Paste: [[2410.02089]] RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning(https://arxiv.org/abs/2410.02089)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new start-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.
摘要：作为代理部署的大型语言模型 (LLM) 可通过多个步骤解决用户指定的任务，同时将所需的手动参与度降至最低。至关重要的是，此类 LLM 需要将其生成建立在任何获得的反馈之上，以可靠地实现预期结果。我们提出了一种端到端强化学习方法，用于教导模型在代码合成领域利用执行反馈，其中最先进的 LLM 与独立采样相比难以迭代改进代码。我们对竞争性编程任务进行了基准测试，我们利用小型（8B 参数）和大型（70B）模型实现了新的领先结果，同时将所需的样本量减少了一个数量级。我们对推理时间行为的分析表明，我们的方法生成的 LLM 可有效利用多个步骤的自动反馈。

Title: Racing Thoughts: Explaining Large Language Model Contextualization Errors

Authors: Michael A. Lepori, Michael Mozer, Asma Ghandeharioun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02102
Pdf URL: https://arxiv.org/pdf/2410.02102
Copy Paste: [[2410.02102]] Racing Thoughts: Explaining Large Language Model Contextualization Errors(https://arxiv.org/abs/2410.02102)
Keywords: language model, llm, prompt
Abstract: The profound success of transformer-based language models can largely be attributed to their ability to integrate relevant contextual information from an input sequence in order to generate a response or complete a task. However, we know very little about the algorithms that a model employs to implement this capability, nor do we understand their failure modes. For example, given the prompt "John is going fishing, so he walks over to the bank. Can he make an ATM transaction?", a model may incorrectly respond "Yes" if it has not properly contextualized "bank" as a geographical feature, rather than a financial institution. We propose the LLM Race Conditions Hypothesis as an explanation of contextualization errors of this form. This hypothesis identifies dependencies between tokens (e.g., "bank" must be properly contextualized before the final token, "?", integrates information from "bank"), and claims that contextualization errors are a result of violating these dependencies. Using a variety of techniques from mechanistic intepretability, we provide correlational and causal evidence in support of the hypothesis, and suggest inference-time interventions to address it.
摘要：基于转换器的语言模型的巨大成功很大程度上可以归因于它们能够整合来自输入序列的相关上下文信息，以生成响应或完成任务。然而，我们对模型实现此功能所采用的算法知之甚少，也不了解它们的失败模式。例如，给出提示“约翰要去钓鱼，所以他走到银行。他可以进行 ATM 交易吗？”，如果模型没有正确地将“银行”语境化为地理特征，而不是金融机构，它可能会错误地回答“是”。我们提出 LLM 竞争条件假设来解释这种形式的语境化错误。该假设确定了标记之间的依赖关系（例如，“银行”必须先正确语境化，然后最后一个标记“？”才能整合来自“银行”的信息），并声称语境化错误是违反这些依赖关系的结果。利用机械可解释性的多种技术，我们提供支持该假设的相关性和因果证据，并建议推理时间干预来解决它。

Title: ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement

Authors: Xiangyu Peng, Congying Xia, Xinyi Yang, Caiming Xiong, Chien-Sheng Wu, Chen Xing
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02108
Pdf URL: https://arxiv.org/pdf/2410.02108
Copy Paste: [[2410.02108]] ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement(https://arxiv.org/abs/2410.02108)
Keywords: language model, llm
Abstract: Post-training Large Language Models (LLMs) with explicit reasoning trajectories can enhance their reasoning abilities. However, acquiring such high-quality trajectory data typically demands meticulous supervision from humans or superior models, which can be either expensive or license-constrained. In this paper, we explore how far an LLM can improve its reasoning by self-synthesizing reasoning paths as training data without any additional supervision. Existing self-synthesizing methods, such as STaR, suffer from poor generalization to out-of-domain (OOD) reasoning tasks. We hypothesize it is due to that their self-synthesized reasoning paths are too task-specific, lacking general task-agnostic reasoning guidance. To address this, we propose Reasoning Generalist via Self-Improvement (ReGenesis), a method to self-synthesize reasoning paths as post-training data by progressing from abstract to concrete. More specifically, ReGenesis self-synthesizes reasoning paths by converting general reasoning guidelines into task-specific ones, generating reasoning structures, and subsequently transforming these structures into reasoning paths, without the need for human-designed task-specific examples used in existing methods. We show that ReGenesis achieves superior performance on all in-domain and OOD settings tested compared to existing methods. For six OOD tasks specifically, while previous methods exhibited an average performance decrease of approximately 4.6% after post training, ReGenesis delivers around 6.1% performance improvement. We also conduct in-depth analysis of our framework and show ReGenesis is effective across various LLMs and design choices.
摘要：对具有明确推理轨迹的大型语言模型 (LLM) 进行后训练可以增强其推理能力。然而，获取这种高质量的轨迹数据通常需要人类或高级模型的细致监督，而这些监督可能成本高昂或受到许可限制。在本文中，我们探讨了 LLM 可以在多大程度上通过将推理路径自合成为训练数据而无需任何额外监督来提高其推理能力。现有的自合成方法，例如 STaR，对域外 (OOD) 推理任务的泛化能力较差。我们假设这是因为它们的自合成推理路径过于针对任务，缺乏一般的与任务无关的推理指导。为了解决这个问题，我们提出了通过自我改进实现推理通才 (ReGenesis)，这是一种通过从抽象到具体的过程将推理路径自合成为后训练数据的方法。更具体地说，ReGenesis 通过将一般推理指南转换为特定于任务的指南，生成推理结构，然后将这些结构转换为推理路径来自我合成推理路径，而无需现有方法中使用的人为设计的特定于任务的示例。我们表明，与现有方法相比，ReGenesis 在所有测试的域内和 OOD 设置上都实现了卓越的性能。具体来说，对于六个 OOD 任务，虽然以前的方法在后期训练后平均性能下降了约 4.6%，但 ReGenesis 的性能提高了约 6.1%。我们还对我们的框架进行了深入分析，并表明 ReGenesis 在各种 LLM 和设计选择中都是有效的。

Title: L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?

Authors: Zecheng Tang, Keyan Zhou, Juntao Li, Baibei Ji, Jianye Hou, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02115
Pdf URL: https://arxiv.org/pdf/2410.02115
Copy Paste: [[2410.02115]] L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?(https://arxiv.org/abs/2410.02115)
Keywords: long context
Abstract: Long-context models (LCMs) have made remarkable strides in recent years, offering users great convenience for handling tasks that involve long context, such as document summarization. As the community increasingly prioritizes the faithfulness of generated results, merely ensuring the accuracy of LCM outputs is insufficient, as it is quite challenging for humans to verify the results from the extremely lengthy context. Yet, although some efforts have been made to assess whether LCMs respond truly based on the context, these works either are limited to specific tasks or heavily rely on external evaluation resources like this http URL this work, we introduce L-CiteEval, a comprehensive multi-task benchmark for long-context understanding with citations, aiming to evaluate both the understanding capability and faithfulness of LCMs. L-CiteEval covers 11 tasks from diverse domains, spanning context lengths from 8K to 48K, and provides a fully automated evaluation suite. Through testing with 11 cutting-edge closed-source and open-source LCMs, we find that although these models show minor differences in their generated results, open-source models substantially trail behind their closed-source counterparts in terms of citation accuracy and recall. This suggests that current open-source LCMs are prone to responding based on their inherent knowledge rather than the given context, posing a significant risk to the user experience in practical applications. We also evaluate the RAG approach and observe that RAG can significantly improve the faithfulness of LCMs, albeit with a slight decrease in the generation quality. Furthermore, we discover a correlation between the attention mechanisms of LCMs and the citation generation process.
摘要：近年来，长上下文模型 (LCM) 取得了长足进步，为用户处理涉及长上下文的任务（例如文档摘要）提供了极大的便利。随着社区越来越重视生成结果的忠实性，仅仅确保 LCM 输出的准确性是不够的，因为人类很难从极长的上下文中验证结果。然而，尽管已经做出了一些努力来评估 LCM 是否真正根据上下文做出响应，但这些工作要么仅限于特定任务，要么严重依赖外部评估资源，例如本 http URL 本工作，我们引入了 L-CiteEval，这是一个全面的多任务基准，用于带有引文的长上下文理解，旨在评估 LCM 的理解能力和忠实性。L-CiteEval 涵盖来自不同领域的 11 个任务，上下文长度从 8K 到 48K，并提供一个完全自动化的评估套件。通过对11个最新的闭源和开源LCM进行测试，我们发现虽然这些模型在生成结果上表现出细微的差异，但开源模型在引用准确率和召回率方面远远落后于闭源模型。这表明当前的开源LCM倾向于根据其固有知识而不是给定上下文做出响应，这对实际应用中的用户体验构成了重大风险。我们还评估了RAG方法，发现RAG可以显著提高LCM的忠实度，尽管生成质量略有下降。此外，我们发现LCM的注意力机制与引用生成过程之间存在相关性。

Title: Controlled Generation of Natural Adversarial Documents for Stealthy Retrieval Poisoning

Authors: Collin Zhang, Tingwei Zhang, Vitaly Shmatikov
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02163
Pdf URL: https://arxiv.org/pdf/2410.02163
Copy Paste: [[2410.02163]] Controlled Generation of Natural Adversarial Documents for Stealthy Retrieval Poisoning(https://arxiv.org/abs/2410.02163)
Keywords: llm, retrieval-augmented generation
Abstract: Recent work showed that retrieval based on embedding similarity (e.g., for retrieval-augmented generation) is vulnerable to poisoning: an adversary can craft malicious documents that are retrieved in response to broad classes of queries. We demonstrate that previous, HotFlip-based techniques produce documents that are very easy to detect using perplexity filtering. Even if generation is constrained to produce low-perplexity text, the resulting documents are recognized as unnatural by LLMs and can be automatically filtered from the retrieval corpus. We design, implement, and evaluate a new controlled generation technique that combines an adversarial objective (embedding similarity) with a "naturalness" objective based on soft scores computed using an open-source, surrogate LLM. The resulting adversarial documents (1) cannot be automatically detected using perplexity filtering and/or other LLMs, except at the cost of significant false positives in the retrieval corpus, yet (2) achieve similar poisoning efficacy to easily-detectable documents generated using HotFlip, and (3) are significantly more effective than prior methods for energy-guided generation, such as COLD.
摘要：最近的研究表明，基于嵌入相似性的检索（例如，用于检索增强生成）容易受到毒害：攻击者可以制作恶意文档，这些文档是针对广泛类别的查询而检索的。我们证明，以前基于 HotFlip 的技术生成的文档非常容易通过困惑度过滤检测到。即使生成被限制为生成低困惑度文本，生成的文档也会被 LLM 识别为非自然的，并且可以自动从检索语料库中过滤掉。我们设计、实现和评估了一种新的受控生成技术，该技术将对抗性目标（嵌入相似性）与基于使用开源代理 LLM 计算的软分数的“自然度”目标相结合。所生成的对抗性文档 (1) 无法通过困惑度过滤和/或其他 LLM 自动检测，除非以检索语料库中出现大量假阳性为代价，但 (2) 实现了与使用 HotFlip 生成的易于检测的文档类似的毒害效果，并且 (3) 比之前的能量引导生成方法（如 COLD）更有效。

Title: POSIX: A Prompt Sensitivity Index For Large Language Models

Authors: Anwoy Chatterjee, H S V N S Kowndinya Renduchintala, Sumit Bhatia, Tanmoy Chakraborty
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02185
Pdf URL: https://arxiv.org/pdf/2410.02185
Copy Paste: [[2410.02185]] POSIX: A Prompt Sensitivity Index For Large Language Models(https://arxiv.org/abs/2410.02185)
Keywords: language model, llm, prompt
Abstract: Despite their remarkable capabilities, Large Language Models (LLMs) are found to be surprisingly sensitive to minor variations in prompts, often generating significantly divergent outputs in response to minor variations in the prompts, such as spelling errors, alteration of wording or the prompt template. However, while assessing the quality of an LLM, the focus often tends to be solely on its performance on downstream tasks, while very little to no attention is paid to prompt sensitivity. To fill this gap, we propose POSIX - a novel PrOmpt Sensitivity IndeX as a reliable measure of prompt sensitivity, thereby offering a more comprehensive evaluation of LLM performance. The key idea behind POSIX is to capture the relative change in loglikelihood of a given response upon replacing the corresponding prompt with a different intent-preserving prompt. We provide thorough empirical evidence demonstrating the efficacy of POSIX in capturing prompt sensitivity and subsequently use it to measure and thereby compare prompt sensitivity of various open-source LLMs. We find that merely increasing the parameter count or instruction tuning does not necessarily reduce prompt sensitivity whereas adding some few-shot exemplars, even just one, almost always leads to significant decrease in prompt sensitivity. We also find that alterations to prompt template lead to the highest sensitivity in the case of MCQtype tasks, whereas paraphrasing results in the highest sensitivity in open-ended generation tasks. The code for reproducing our results is open-sourced at this https URL.
摘要：尽管大型语言模型 (LLM) 功能强大，但我们发现它们对提示的细微变化非常敏感，通常会对提示的细微变化（例如拼写错误、措辞更改或提示模板）产生截然不同的输出。然而，在评估 LLM 的质量时，人们往往只关注其在下游任务中的表现，而很少或根本不关注提示敏感性。为了填补这一空白，我们提出了 POSIX——一种新颖的 PrOmpt 敏感性指数，作为提示敏感性的可靠衡量标准，从而提供对 LLM 性能的更全面评估。POSIX 背后的关键思想是在用不同的意图保留提示替换相应提示时，捕获给定响应的对数似然的相对变化。我们提供详尽的实证证据，证明 POSIX 在捕获提示敏感性方面的有效性，并随后使用它来测量并比较各种开源 LLM 的提示敏感性。我们发现，仅仅增加参数数量或调整指令并不一定会降低提示敏感度，而添加一些少样本样本（即使只有一个）几乎总是会导致提示敏感度显著下降。我们还发现，在 MCQ 类型任务中，修改提示模板会导致敏感度最高，而在开放式生成任务中，释义会导致敏感度最高。用于重现我们结果的代码在此 https URL 中开源。

Title: Can Language Models Take A Hint? Prompting for Controllable Contextualized Commonsense Inference

Authors: Pedro Colon-Hernandez, Nanxi Liu, Chelsea Joe, Peter Chin, Claire Yin, Henry Lieberman, Yida Xin, Cynthia Breazeal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02202
Pdf URL: https://arxiv.org/pdf/2410.02202
Copy Paste: [[2410.02202]] Can Language Models Take A Hint? Prompting for Controllable Contextualized Commonsense Inference(https://arxiv.org/abs/2410.02202)
Keywords: language model, prompt
Abstract: Generating commonsense assertions within a given story context remains a difficult task for modern language models. Previous research has addressed this problem by aligning commonsense inferences with stories and training language generation models accordingly. One of the challenges is determining which topic or entity in the story should be the focus of an inferred assertion. Prior approaches lack the ability to control specific aspects of the generated assertions. In this work, we introduce "hinting," a data augmentation technique that enhances contextualized commonsense inference. "Hinting" employs a prefix prompting strategy using both hard and soft prompts to guide the inference process. To demonstrate its effectiveness, we apply "hinting" to two contextual commonsense inference datasets: ParaCOMET and GLUCOSE, evaluating its impact on both general and context-specific inference. Furthermore, we evaluate "hinting" by incorporating synonyms and antonyms into the hints. Our results show that "hinting" does not compromise the performance of contextual commonsense inference while offering improved controllability.
摘要：对于现代语言模型来说，在给定的故事上下文中生成常识性断言仍然是一项艰巨的任务。先前的研究通过将常识性推理与故事相结合并相应地训练语言生成模型来解决了这个问题。其中一个挑战是确定故事中的哪个主题或实体应该成为推断断言的焦点。先前的方法缺乏控制生成断言的特定方面的能力。在这项工作中，我们引入了“提示”，这是一种增强情境化常识推理的数据增强技术。“提示”采用前缀提示策略，使用硬提示和软提示来指导推理过程。为了证明其有效性，我们将“提示”应用于两个情境常识推理数据集：ParaCOMET 和 GLUCOSE，评估其对一般和情境特定推理的影响。此外，我们通过将同义词和反义词纳入提示来评估“提示”。我们的结果表明，“提示”不会损害情境常识推理的性能，同时提供更好的可控性。

Title: Measuring, Evaluating and Improving Logical Consistency in Large Language Models

Authors: Yinhong Liu, Zhijiang Guo, Tianya Liang, Ehsan Shareghi, Ivan Vulić, Nigel Collier
Subjects: cs.CL, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2410.02205
Pdf URL: https://arxiv.org/pdf/2410.02205
Copy Paste: [[2410.02205]] Measuring, Evaluating and Improving Logical Consistency in Large Language Models(https://arxiv.org/abs/2410.02205)
Keywords: language model, llm
Abstract: Recent research in Large Language Models (LLMs) has shown promising progress related to LLM alignment with human preferences. LLM-empowered decision-making systems are expected to be predictable, reliable and trustworthy, which implies being free from paradoxes or contradictions that could undermine their credibility and validity. However, LLMs still exhibit inconsistent and biased behaviour when making decisions or judgements. In this work, we focus on studying logical consistency of LLMs as a prerequisite for more reliable and trustworthy systems. Logical consistency ensures that decisions are based on a stable and coherent understanding of the problem, reducing the risk of erratic or contradictory outputs. We first propose a universal framework to quantify the logical consistency via three fundamental proxies: transitivity, commutativity and negation invariance. We then evaluate logical consistency, using the defined measures, of a wide range of LLMs, demonstrating that it can serve as a strong proxy for overall robustness. Additionally, we introduce a data refinement and augmentation technique that enhances the logical consistency of LLMs without sacrificing alignment to human preferences. It augments noisy and sparse pairwise-comparison annotations by estimating a partially or totally ordered preference rankings using rank aggregation methods. Finally, we show that logical consistency impacts the performance of LLM-based logic-dependent algorithms, where LLMs serve as logical operators.
摘要：大型语言模型 (LLM) 的最新研究显示，在 LLM 与人类偏好的一致性方面取得了令人鼓舞的进展。LLM 赋能的决策系统应具有可预测性、可靠性和可信性，这意味着不存在可能损害其可信度和有效性的悖论或矛盾。然而，LLM 在做出决策或判断时仍然表现出不一致和有偏见的行为。在这项工作中，我们专注于研究 LLM 的逻辑一致性，这是更可靠和更值得信赖的系统的先决条件。逻辑一致性确保决策基于对问题的稳定和连贯的理解，从而降低出现不稳定或矛盾输出的风险。我们首先提出一个通用框架，通过三个基本代理来量化逻辑一致性：传递性、交换性和否定不变性。然后，我们使用定义的度量来评估各种 LLM 的逻辑一致性，表明它可以作为整体稳健性的强大代理。此外，我们引入了一种数据细化和增强技术，该技术可增强 LLM 的逻辑一致性，而不会牺牲与人类偏好的一致性。它通过使用等级聚合方法估计部分或完全有序的偏好排名来增强嘈杂和稀疏的成对比较注释。最后，我们表明逻辑一致性会影响基于 LLM 的逻辑相关算法的性能，其中 LLM 充当逻辑运算符。

Title: Calibrate to Discriminate: Improve In-Context Learning with Label-Free Comparative Inference

Authors: Wei Cheng, Tianlu Wang, Yanmin Ji, Fan Yang, Keren Tan, Yiyu Zheng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02210
Pdf URL: https://arxiv.org/pdf/2410.02210
Copy Paste: [[2410.02210]] Calibrate to Discriminate: Improve In-Context Learning with Label-Free Comparative Inference(https://arxiv.org/abs/2410.02210)
Keywords: language model, llm, prompt
Abstract: While in-context learning with large language models (LLMs) has shown impressive performance, we have discovered a unique miscalibration behavior where both correct and incorrect predictions are assigned the same level of confidence. We refer to this phenomenon as indiscriminate miscalibration. We found that traditional calibration metrics, such as Expected Calibrated Errors (ECEs), are unable to capture this behavior effectively. To address this issue, we propose new metrics to measure the severity of indiscriminate miscalibration. Additionally, we develop a novel in-context comparative inference method to alleviate miscalibrations and improve classification performance. Through extensive experiments on five datasets, we demonstrate that our proposed method can achieve more accurate and calibrated predictions compared to regular zero-shot and few-shot prompting.
摘要：虽然使用大型语言模型 (LLM) 进行上下文学习已表现出令人印象深刻的性能，但我们发现了一种独特的错误校准行为，即正确和错误的预测都被分配了相同的置信度。我们将这种现象称为不加区分的错误校准。我们发现传统的校准指标（例如预期校准误差 (ECE)）无法有效捕捉这种行为。为了解决这个问题，我们提出了新的指标来衡量不加区分的错误校准的严重程度。此外，我们开发了一种新颖的上下文比较推理方法来缓解错误校准并提高分类性能。通过对五个数据集进行大量实验，我们证明与常规的零样本和少样本提示相比，我们提出的方法可以实现更准确、更校准的预测。

Title: EmbedLLM: Learning Compact Representations of Large Language Models

Authors: Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, Kannan Ramchandran
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02223
Pdf URL: https://arxiv.org/pdf/2410.02223
Copy Paste: [[2410.02223]] EmbedLLM: Learning Compact Representations of Large Language Models(https://arxiv.org/abs/2410.02223)
Keywords: language model, llm
Abstract: With hundreds of thousands of language models available on Huggingface today, efficiently evaluating and utilizing these models across various downstream, tasks has become increasingly critical. Many existing methods repeatedly learn task-specific representations of Large Language Models (LLMs), which leads to inefficiencies in both time and computational resources. To address this, we propose EmbedLLM, a framework designed to learn compact vector representations, of LLMs that facilitate downstream applications involving many models, such as model routing. We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency. Additionally, we demonstrate that our method can forecast a model's performance on multiple benchmarks, without incurring additional inference cost. Extensive probing experiments validate that the learned embeddings capture key model characteristics, e.g. whether the model is specialized for coding tasks, even without being explicitly trained on them. We open source our dataset, code and embedder to facilitate further research and application.
摘要：如今，Huggingface 上有数十万种语言模型可用，因此，在各种下游任务中高效评估和利用这些模型变得越来越重要。许多现有方法反复学习大型语言模型 (LLM) 的任务特定表示，这导致时间和计算资源效率低下。为了解决这个问题，我们提出了 EmbedLLM，这是一个旨在学习 LLM 的紧凑向量表示的框架，可促进涉及许多模型的下游应用，例如模型路由。我们引入了一种用于学习此类嵌入的编码器-解码器方法，以及一个用于评估其有效性的系统框架。实证结果表明，EmbedLLM 在模型路由的准确性和延迟方面均优于先前的方法。此外，我们证明了我们的方法可以在多个基准上预测模型的性能，而不会产生额外的推理成本。大量的探索性实验验证了学习到的嵌入可以捕获关键的模型特征，例如模型是否专门用于编码任务，即使没有对其进行明确的训练。我们开源我们的数据集、代码和嵌入器，以促进进一步的研究和应用。

Title: Morphological evaluation of subwords vocabulary used by BETO language model

Authors: Óscar García-Sierra, Ana Fernández-Pampillón Cesteros, Miguel Ortega-Martín
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02283
Pdf URL: https://arxiv.org/pdf/2410.02283
Copy Paste: [[2410.02283]] Morphological evaluation of subwords vocabulary used by BETO language model(https://arxiv.org/abs/2410.02283)
Keywords: language model
Abstract: Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always align with real morphemes, potentially impacting the models' performance, though it remains uncertain when this might occur. In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language. Our evaluation method was built on three quality measures, relevance, cohesion, and morphological accuracy, and a procedure for their assessment. By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality. In this article, we apply this evaluation to the tokenizer of BETO, a BERT language model trained on large Spanish corpora. This evaluation, along with our previous results, helped us conclude that its vocabulary has a low morphological quality, and we also found that training the tokenizer in a larger corpus does not improve the morphological quality of the generated vocabulary. Additionally, this evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors' claims and the model's configuration.
摘要：大型语言模型使用的子词标记化算法效率明显更高，并且可以独立构建必要的单词和子词词汇表，而无需人工干预。但是，这些子词并不总是与真实词素一致，这可能会影响模型的性能，尽管这种情况何时会发生仍不确定。在之前的研究中，我们提出了一种评估词汇表形态质量的方法，重点关注这些词汇表与给定语言的词素之间的重叠。我们的评估方法建立在三个质量指标（相关性、凝聚力和形态准确性）以及评估程序的基础上。通过将这种方法应用于由三种子词标记化算法（BPE、Wordpiece 和 Unigram）创建的词汇表，我们得出结论，这些词汇表通常表现出非常低的形态质量。在本文中，我们将此评估应用于 BETO 的标记器，BETO 是一个在大型西班牙语语料库上训练的 BERT 语言模型。此次评估结合我们之前的结果，帮助我们得出结论：其词汇的形态质量较低，我们还发现在更大的语料库中训练分词器并不能提高生成词汇的形态质量。此外，鉴于作者的说法与模型配置不一致，此次评估有助于澄清分词器（即 Wordpiece）所使用的算法。

Title: Correlation and Navigation in the Vocabulary Key Representation Space of Language Models

Authors: Letian Peng, Chenyang An, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02284
Pdf URL: https://arxiv.org/pdf/2410.02284
Copy Paste: [[2410.02284]] Correlation and Navigation in the Vocabulary Key Representation Space of Language Models(https://arxiv.org/abs/2410.02284)
Keywords: language model, prompt, chain-of-thought
Abstract: Language model (LM) decoding is based on the next-token prediction (NTP) probability distribution. For neural LMs (e.g., Transformer-based), NTP distribution is essentially a softmax-regularized dot product between an encoded input context (query) and fixed vocabulary representations (keys). In this paper, we study the effect of the key distribution on the NTP distribution, with a focus on whether the similarity between keys will trigger spurious correlations in NTP. Through knowledge-probing tasks, we show that in the NTP distribution, the few top-ranked tokens are typically accurate. However, the middle-ranked prediction is highly biased towards the tokens that are distributionally (not necessarily semantically) similar to these top ones. For instance, if "P" is predicted as the top-1 token, "A"-"Z" will all be ranked high in NTP, no matter whether they can lead to correct decoding results. This hurts the sampling diversity and makes the sampling of correct, long-tail results hopeless and noisy. We attempt to alleviate this issue via a novel in-context method that iteratively pushes the query representation away from explored regions. Specifically, we include the explored decoding results in the context and prompt the LM to generate something else, which encourages the LM to produce a query representation that has small dot products with explored keys. Experiments on knowledge-probing tasks show that our method leads to efficient navigation away from explored keys to correct new keys. We further extend our method to open-ended and chain-of-thought (for reasoning) generation. Experiment results show that ICN contributes to better generation diversity and improved self-consistency voting performance. Finally, we discuss potential training issues caused by the fixed key space together with the challenges and possible ways to address them in future research.
摘要：语言模型 (LM) 解码基于下一个标记预测 (NTP) 概率分布。对于神经 LM（例如基于 Transformer 的），NTP 分布本质上是编码输入上下文（查询）和固定词汇表示（键）之间的 softmax 正则化点积。在本文中，我们研究键分布对 NTP 分布的影响，重点研究键之间的相似性是否会在 NTP 中引发虚假相关性。通过知识探测任务，我们表明在 NTP 分布中，少数排名靠前的标记通常是准确的。然而，排名居中的预测高度偏向于在分布上（不一定在语义上）与这些排名靠前的标记相似的标记。例如，如果“P”被预测为排名第一的标记，“A”-“Z”在 NTP 中都会排名靠前，无论它们是否能产生正确的解码结果。这会损害采样多样性，并使正确的长尾结果的采样变得毫无希望和嘈杂。我们尝试通过一种新颖的上下文方法来缓解这个问题，该方法迭代地将查询表示从探索区域推开。具体来说，我们将探索的解码结果包含在上下文中并提示 LM 生成其他内容，这会鼓励 LM 生成具有与探索的键的小点积的查询表示。知识探测任务上的实验表明，我们的方法可以有效地从探索的键导航到正确的新键。我们进一步将我们的方法扩展到开放式和思路链（用于推理）生成。实验结果表明，ICN 有助于提高生成多样性并提高自洽投票性能。最后，我们讨论了固定密钥空间导致的潜在训练问题以及未来研究中的挑战和可能的解决方法。

Title: Language Models are Graph Learners

Authors: Zhe Xu, Kaveh Hassani, Si Zhang, Hanqing Zeng, Michihiro Yasunaga, Limei Wang, Dongqi Fu, Ning Yao, Bo Long, Hanghang Tong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02296
Pdf URL: https://arxiv.org/pdf/2410.02296
Copy Paste: [[2410.02296]] Language Models are Graph Learners(https://arxiv.org/abs/2410.02296)
Keywords: language model
Abstract: Language Models (LMs) are increasingly challenging the dominance of domain-specific models, including Graph Neural Networks (GNNs) and Graph Transformers (GTs), in graph learning tasks. Following this trend, we propose a novel approach that empowers off-the-shelf LMs to achieve performance comparable to state-of-the-art GNNs on node classification tasks, without requiring any architectural modification. By preserving the LM's original architecture, our approach retains a key benefit of LM instruction tuning: the ability to jointly train on diverse datasets, fostering greater flexibility and efficiency. To achieve this, we introduce two key augmentation strategies: (1) Enriching LMs' input using topological and semantic retrieval methods, which provide richer contextual information, and (2) guiding the LMs' classification process through a lightweight GNN classifier that effectively prunes class candidates. Our experiments on real-world datasets show that backbone Flan-T5 models equipped with these augmentation strategies outperform state-of-the-art text-output node classifiers and are comparable to top-performing vector-output node classifiers. By bridging the gap between specialized task-specific node classifiers and general LMs, this work paves the way for more versatile and widely applicable graph learning models. We will open-source the code upon publication.
摘要：语言模型 (LM) 正日益挑战领域特定模型（包括图神经网络 (GNN) 和图变换器 (GT)）在图学习任务中的主导地位。顺应这一趋势，我们提出了一种新颖的方法，使现成的 LM 能够在节点分类任务上实现与最先进的 GNN 相当的性能，而无需进行任何架构修改。通过保留 LM 的原始架构，我们的方法保留了 LM 指令调整的一个关键优势：能够在不同的数据集上进行联合训练，从而提高灵活性和效率。为实现这一点，我们引入了两种关键的增强策略：(1) 使用拓扑和语义检索方法丰富 LM 的输入，从而提供更丰富的上下文信息；(2) 通过轻量级 GNN 分类器指导 LM 的分类过程，从而有效地修剪类候选。我们在真实数据集上进行的实验表明，配备这些增强策略的主干 Flan-T5 模型的表现优于最先进的文本输出节点分类器，并且可与表现最佳的向量输出节点分类器相媲美。通过弥合专门的任务特定节点分类器和通用 LM 之间的差距，这项工作为更通用、适用范围更广的图学习模型铺平了道路。我们将在发布后开源代码。

Title: Traffic Light or Light Traffic? Investigating Phrasal Semantics in Large Language Models

Authors: Rui Meng, Ye Liu, Lifu Tu, Daqing He, Yingbo Zhou, Semih Yavuz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02308
Pdf URL: https://arxiv.org/pdf/2410.02308
Copy Paste: [[2410.02308]] Traffic Light or Light Traffic? Investigating Phrasal Semantics in Large Language Models(https://arxiv.org/abs/2410.02308)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Phrases are fundamental linguistic units through which humans convey semantics. This study critically examines the capacity of API-based large language models (LLMs) to comprehend phrase semantics, utilizing three human-annotated datasets. We assess the performance of LLMs in executing phrase semantic reasoning tasks guided by natural language instructions and explore the impact of common prompting techniques, including few-shot demonstrations and Chain-of-Thought reasoning. Our findings reveal that LLMs greatly outperform traditional embedding methods across the datasets; however, they do not show a significant advantage over fine-tuned methods. The effectiveness of advanced prompting strategies shows variability. We conduct detailed error analyses to interpret the limitations faced by LLMs in comprehending phrase semantics. Code and data can be found at this https URL.
摘要：短语是人类传达语义的基本语言单位。本研究利用三个人工注释的数据集，严格审查了基于 API 的大型语言模型 (LLM) 理解短语语义的能力。我们评估了 LLM 在执行由自然语言指令指导的短语语义推理任务时的表现，并探讨了常见提示技术（包括少量演示和思路链推理）的影响。我们的研究结果表明，LLM 在数据集上的表现大大优于传统的嵌入方法；然而，它们并没有显示出比微调方法更显著的优势。高级提示策略的有效性表现出多样性。我们进行了详细的错误分析，以解释 LLM 在理解短语语义方面面临的局限性。代码和数据可在此 https URL 中找到。

Title: Post-edits Are Preferences Too

Authors: Nathaniel Berger, Stefan Riezler, Miriam Exel, Matthias Huck
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02320
Pdf URL: https://arxiv.org/pdf/2410.02320
Copy Paste: [[2410.02320]] Post-edits Are Preferences Too(https://arxiv.org/abs/2410.02320)
Keywords: language model, llm
Abstract: Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences $s_1$ and $s_2$ and asked for a preference judgment, %$s_1 > s_2$; while for post-editing, editors \emph{create} $s_1$ and know that it should be better than $s_2$. We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit-like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit-like hypotheses to the top output ranks.
摘要：偏好优化 (PO) 技术是目前最先进的技术之一，用于根据来自人类注释者的成对偏好反馈对大型语言模型 (LLM) 进行微调。然而，在机器翻译中，这种反馈可能难以获得。此外，Kreutzer 等人 (2018) 表明，对于机器翻译，成对偏好不如其他形式的人类反馈（例如 5 分制评分）可靠。我们检查后期编辑，看它们是否可以通过构造成为可靠的人类偏好来源。在 PO 中，向人类注释者显示序列 $s_1$ 和 $s_2$，并要求其做出偏好判断，%$s_1 > s_2$；而对于后期编辑，编辑者 \emph{创建} $s_1$ 并知道它应该比 $s_2$ 更好。我们尝试将这些隐式偏好用于 PO，并表明它有助于模型向后期编辑类假设转变，远离机器翻译类假设。此外，我们表明，通过对后期编辑进行监督微调（SFT）模型预训练，将类似后期编辑的假设提升到最高输出排名，可以获得最佳效果。

Title: Llama SLayer 8B: Shallow Layers Hold the Key to Knowledge Injection

Authors: Tianxiang Chen, Zhentao Tan, Tao Gong, Yue Wu, Qi Chu, Bin Liu, Jieping Ye, Nenghai Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02330
Pdf URL: https://arxiv.org/pdf/2410.02330
Copy Paste: [[2410.02330]] Llama SLayer 8B: Shallow Layers Hold the Key to Knowledge Injection(https://arxiv.org/abs/2410.02330)
Keywords: language model, llm
Abstract: As a manner to augment pre-trained large language models (LLM), knowledge injection is critical to develop vertical domain large models and has been widely studied. Although most current approaches, including parameter-efficient fine-tuning (PEFT) and block expansion methods, uniformly apply knowledge across all LLM layers, it raises the question: are all layers equally crucial for knowledge injection? We begin by evaluating the importance of each layer in finding the optimal layer range for knowledge injection. Intuitively, the more important layers should play a more critical role in knowledge injection and deserve a denser injection. We observe performance dips in question-answering benchmarks after the removal or expansion of the shallow layers, and the degradation shrinks as the layer gets deeper, indicating that the shallow layers hold the key to knowledge injection. This insight leads us to propose the S strategy, a post-pretraining strategy of selectively enhancing shallow layers while pruning the less effective deep ones. Based on this strategy, we introduce Llama Slayer-8B and Llama Slayer-8B-Instruct. We experimented on the corpus of code $\&$ math and demonstrated the effectiveness of our strategy. Further experiments across different LLM, Mistral-7B, and a legal corpus confirmed the general applicability of the approach, underscoring its wide-ranging efficacy. Our code is available at: \this https URL
摘要：作为增强预训练大型语言模型 (LLM) 的一种方式，知识注入对于开发垂直领域大型模型至关重要，并且已被广泛研究。尽管大多数当前方法（包括参数高效微调 (PEFT) 和块扩展方法）将知识统一应用于所有 LLM 层，但它提出了一个问题：所有层对于知识注入是否同样重要？我们首先评估每层在寻找知识注入的最佳层范围方面的重要性。直观地说，更重要的层应该在知识注入中发挥更关键的作用，并且值得更密集地注入。我们观察到在删除或扩展浅层后，问答基准的性能会下降，并且随着层的加深，性能下降会缩小，这表明浅层是知识注入的关键。这一见解促使我们提出 S 策略，这是一种后预训练策略，可以选择性地增强浅层，同时修剪效率较低的深层。基于此策略，我们推出了 Llama Slayer-8B 和 Llama Slayer-8B-Instruct。我们对代码语料库 $\&$ math 进行了实验，并证明了我们策略的有效性。在不同的 LLM、Mistral-7B 和法律语料库中进行的进一步实验证实了该方法的普遍适用性，强调了其广泛的功效。我们的代码可在以下网址获取：\此 https URL

Title: How Much Can RAG Help the Reasoning of LLM?

Authors: Jingyu Liu, Jiaen Lin, Yong Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02338
Pdf URL: https://arxiv.org/pdf/2410.02338
Copy Paste: [[2410.02338]] How Much Can RAG Help the Reasoning of LLM?(https://arxiv.org/abs/2410.02338)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has gained significant popularity in modern Large Language Models (LLMs) due to its effectiveness in introducing new knowledge and reducing hallucinations. However, the deep understanding of RAG remains limited, how does RAG help the reasoning process and can RAG help improve the reasoning capability remains question. While external documents are typically considered as a method to incorporate domain-specific information, they also contain intermediate reasoning results related to the query, this suggests that documents could enhance the reasoning capability of LLMs, which has not been previously explored. In this paper, we investigate this issue in depth and find that while RAG can assist with reasoning, the help is limited. If we conceptualize the reasoning process as a tree with fixed depth, then RAG struggles to assist LLMs in performing deeper reasoning. Additionally, the information in the documents requires preprocessing to filter out noise. We demonstrate that this preprocessing is difficult to achieve simply fine-tuning of the LLM, it often necessitates numerous additional transformer layers to solve the problem. To simplify the problem, we propose DPrompt tuning, which effectively resolves the issue within just limited transformer layers, leading to improved performance.
摘要：检索增强生成 (RAG) 因其在引入新知识和减少幻觉方面的有效性而在现代大型语言模型 (LLM) 中获得了广泛的欢迎。然而，对 RAG 的深入理解仍然有限，RAG 如何帮助推理过程以及 RAG 是否有助于提高推理能力仍是一个问题。虽然外部文档通常被视为一种整合领域特定信息的方法，但它们也包含与查询相关的中间推理结果，这表明文档可以增强 LLM 的推理能力，这在以前尚未被探索过。在本文中，我们深入研究了这个问题，发现虽然 RAG 可以帮助推理，但帮助是有限的。如果我们将推理过程概念化为具有固定深度的树，那么 RAG 很难帮助 LLM 进行更深入的推理。此外，文档中的信息需要预处理以滤除噪音。我们证明这种预处理很难通过简单地微调 LLM 来实现，它通常需要大量额外的转换器层来解决问题。为了简化问题，我们提出了 DPrompt 调整，它可以在有限的变压器层内有效解决问题，从而提高性能。

Title: Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA

Authors: Eduard Tulchinskii, Laida Kushnareva, Kristian Kuznetsov, Anastasia Voznyuk, Andrei Andriiainen, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02343
Pdf URL: https://arxiv.org/pdf/2410.02343
Copy Paste: [[2410.02343]] Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA(https://arxiv.org/abs/2410.02343)
Keywords: llm
Abstract: A standard way to evaluate the abilities of LLM involves presenting a multiple-choice question and selecting the option with the highest logit as the model's predicted answer. However, such a format for evaluating LLMs has limitations, since even if the model knows the correct answer, it may struggle to select the corresponding letter simply due to difficulties in following this rigid format. To address this, we introduce new scores that better capture and reveal model's underlying knowledge: the Query-Key Score (QK-score), derived from the interaction between query and key representations in attention heads, and the Attention Score, based on attention weights. These scores are extracted from specific \textit{select-and-copy} heads, which show consistent performance across popular Multi-Choice Question Answering (MCQA) datasets. Based on these scores, our method improves knowledge extraction, yielding up to 16\% gain for LLaMA2-7B and up to 10\% for larger models on popular MCQA benchmarks. At the same time, the accuracy on a simple synthetic dataset, where the model explicitly knows the right answer, increases by almost 60\%, achieving nearly perfect accuracy, therefore demonstrating the method's efficiency in mitigating MCQA format limitations. To support our claims, we conduct experiments on models ranging from 7 billion to 70 billion parameters in both zero- and few-shot setups.
摘要：评估 LLM 能力的标准方法是提出一个多项选择题，并选择 logit 最高的选项作为模型的预测答案。然而，这种评估 LLM 的格式有局限性，因为即使模型知道正确答案，也可能因为难以遵循这种严格的格式而难以选择相应的字母。为了解决这个问题，我们引入了新的分数，可以更好地捕捉和揭示模型的底层知识：查询关键分数 (QK-score)，它来自注意力头中查询和关键表示之间的交互，以及基于注意力权重的注意力分数。这些分数是从特定的 \textit{select-and-copy} 头中提取的，这些头在流行的多项选择题回答 (MCQA) 数据集中表现出一致的性能。基于这些分数，我们的方法改进了知识提取，在流行的 MCQA 基准上，LLaMA2-7B 的知识提取率最高可提高 16%，大型模型的知识提取率最高可提高 10%。同时，在简单的合成数据集（模型明确知道正确答案）上，准确率提高了近 60%，达到了近乎完美的准确率，从而证明了该方法在缓解 MCQA 格式限制方面的有效性。为了支持我们的说法，我们在零样本和少量样本设置中对参数范围从 70 亿到 700 亿的模型进行了实验。

Title: AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models

Authors: Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Xiang Wang, Xiangnan He, Tat-seng Chua
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02355
Pdf URL: https://arxiv.org/pdf/2410.02355
Copy Paste: [[2410.02355]] AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models(https://arxiv.org/abs/2410.02355)
Keywords: language model, gpt, llm, hallucination
Abstract: Large language models (LLMs) often exhibit hallucinations due to incorrect or outdated knowledge. Hence, model editing methods have emerged to enable targeted knowledge updates. To achieve this, a prevailing paradigm is the locating-then-editing approach, which first locates influential parameters and then edits them by introducing a perturbation. While effective, current studies have demonstrated that this perturbation inevitably disrupt the originally preserved knowledge within LLMs, especially in sequential editing scenarios. To address this, we introduce AlphaEdit, a novel solution that projects perturbation onto the null space of the preserved knowledge before applying it to the parameters. We theoretically prove that this projection ensures the output of post-edited LLMs remains unchanged when queried about the preserved knowledge, thereby mitigating the issue of disruption. Extensive experiments on various LLMs, including LLaMA3, GPT2-XL, and GPT-J, show that AlphaEdit boosts the performance of most locating-then-editing methods by an average of 36.4% with a single line of additional code for projection solely. Our code is available at: this https URL.
摘要：大型语言模型 (LLM) 经常会因为知识不正确或过时而出现幻觉。因此，模型编辑方法应运而生，以实现有针对性的知识更新。为了实现这一点，一种流行的范式是先定位后编辑的方法，该方法首先定位有影响的参数，然后通过引入扰动来编辑它们。虽然有效，但目前的研究表明，这种扰动不可避免地会破坏 LLM 中最初保存的知识，尤其是在顺序编辑场景中。为了解决这个问题，我们引入了 AlphaEdit，这是一种新颖的解决方案，它将扰动投影到保存知识的零空间上，然后再将其应用于参数。我们从理论上证明，这种投影可确保在查询保存的知识时，后编辑的 LLM 的输出保持不变，从而缓解中断问题。在包括 LLaMA3、GPT2-XL 和 GPT-J 在内的各种 LLM 上进行的大量实验表明，AlphaEdit 仅用一行额外的投影代码就能将大多数定位后编辑方法的性能平均提高 36.4%。我们的代码可从此 https URL 获取。

Title: From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learning

Authors: Haodong Xie, Rahul Singh Maharjan, Federico Tavella, Angelo Cangelosi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02365
Pdf URL: https://arxiv.org/pdf/2410.02365
Copy Paste: [[2410.02365]] From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learning(https://arxiv.org/abs/2410.02365)
Keywords: agent
Abstract: Understanding and manipulating concrete and abstract concepts is fundamental to human intelligence. Yet, they remain challenging for artificial agents. This paper introduces a multimodal generative approach to high order abstract concept learning, which integrates visual and categorical linguistic information from concrete ones. Our model initially grounds subordinate level concrete concepts, combines them to form basic level concepts, and finally abstracts to superordinate level concepts via the grounding of basic-level concepts. We evaluate the model language learning ability through language-to-visual and visual-to-language tests with high order abstract concepts. Experimental results demonstrate the proficiency of the model in both language understanding and language naming tasks.
摘要：理解和操作具体和抽象概念是人类智能的基础。然而，这对人工智能体来说仍然具有挑战性。本文介绍了一种多模态生成式高阶抽象概念学习方法，该方法将视觉和分类语言信息与具体概念整合在一起。我们的模型首先将下级具体概念建立基础，将它们组合起来形成基本概念，最后通过基础概念的建立抽象为上级概念。我们通过高阶抽象概念的语言到视觉和视觉到语言测试来评估模型的语言学习能力。实验结果证明了该模型在语言理解和语言命名任务中的熟练程度。

Title: Towards Comprehensive Detection of Chinese Harmful Memes

Authors: Junyu Lu, Bo Xu, Xiaokun Zhang, Hongbo Wang, Haohao Zhu, Dongyu Zhang, Liang Yang, Hongfei Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02378
Pdf URL: https://arxiv.org/pdf/2410.02378
Copy Paste: [[2410.02378]] Towards Comprehensive Detection of Chinese Harmful Memes(https://arxiv.org/abs/2410.02378)
Keywords: llm
Abstract: This paper has been accepted in the NeurIPS 2024 D & B Track. Harmful memes have proliferated on the Chinese Internet, while research on detecting Chinese harmful memes significantly lags behind due to the absence of reliable datasets and effective detectors. To this end, we focus on the comprehensive detection of Chinese harmful memes. We construct ToxiCN MM, the first Chinese harmful meme dataset, which consists of 12,000 samples with fine-grained annotations for various meme types. Additionally, we propose a baseline detector, Multimodal Knowledge Enhancement (MKE), incorporating contextual information of meme content generated by the LLM to enhance the understanding of Chinese memes. During the evaluation phase, we conduct extensive quantitative experiments and qualitative analyses on multiple baselines, including LLMs and our MKE. The experimental results indicate that detecting Chinese harmful memes is challenging for existing models while demonstrating the effectiveness of MKE. The resources for this paper are available at this https URL.
摘要：本论文已被 NeurIPS 2024 D & B Track 接受。有害 meme 在中国互联网上激增，而由于缺乏可靠的数据集和有效的检测器，检测中国有害 meme 的研究明显滞后。为此，我们专注于对中国有害 meme 的全面检测。我们构建了第一个中国有害 meme 数据集 ToxiCN MM，它包含 12,000 个样本，其中包含各种 meme 类型的细粒度注释。此外，我们提出了一个基线检测器多模态知识增强 (MKE)，结合 LLM 生成的 meme 内容的上下文信息来增强对中国 meme 的理解。在评估阶段，我们对多个基线进行了广泛的定量实验和定性分析，包括 LLM 和我们的 MKE。实验结果表明，检测中国有害 meme 对现有模型来说具有挑战性，同时也证明了 MKE 的有效性。本文的资源可在此 https URL 上找到。

Title: Learning the Latent Rules of a Game from Data: A Chess Story

Authors: Ben Fauber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02426
Pdf URL: https://arxiv.org/pdf/2410.02426
Copy Paste: [[2410.02426]] Learning the Latent Rules of a Game from Data: A Chess Story(https://arxiv.org/abs/2410.02426)
Keywords: language model, hallucination
Abstract: We demonstrate that small pretrained foundational generative language models with millions of parameters can learn the latent rules of a process from data associated with the process. Inspired by Stefan Zweig's novella "Schachnovelle," also known as "The Royal Game" in English, we show that 28M and 125M parameter pretrained foundational small language models (SLMs) can be instruction fine-tuned with 1,000-to-1,000,000 examples to learn the rules of chess, propose legal moves, and accurately solve chess problems. We also explore the impact of successive language model fine-tuning epochs on improved outcomes and demonstrate reductions in model hallucinations by increasing the number of instruction fine-tuning examples.
摘要：我们证明，具有数百万个参数的小型预训练基础生成语言模型可以从与过程相关的数据中学习过程的潜在规则。受斯蒂芬·茨威格的中篇小说《Schachnovelle》（英文名“The Royal Game”）的启发，我们证明，28M 和 125M 参数预训练基础小型语言模型 (SLM) 可以通过 1,000 到 1,000,000 个示例进行指令微调，以学习国际象棋规则、提出合法动作并准确解决国际象棋问题。我们还探索了连续语言模型微调时期对改进结果的影响，并通过增加指令微调示例的数量来证明模型幻觉的减少。

Title: Collective Critics for Creative Story Generation

Authors: Minwook Bae, Hyounghun Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02428
Pdf URL: https://arxiv.org/pdf/2410.02428
Copy Paste: [[2410.02428]] Collective Critics for Creative Story Generation(https://arxiv.org/abs/2410.02428)
Keywords: language model, llm
Abstract: Generating a long story of several thousand words with narrative coherence using Large Language Models (LLMs) has been a challenging task. Previous research has addressed this challenge by proposing different frameworks that create a story plan and generate a long story based on that plan. However, these frameworks have been mainly focusing on maintaining narrative coherence in stories, often overlooking creativity in story planning and the expressiveness of the stories generated from those plans, which are desirable properties to captivate readers' interest. In this paper, we propose Collective Critics for Creative Story Generation framework (CritiCS), which is composed of plan refining stage (CrPlan) and story generation stage (CrText), to integrate a collective revision mechanism that promotes those properties into long-form story generation process. Specifically, in each stage, a group of LLM critics and one leader collaborate to incrementally refine drafts of plan and story throughout multiple rounds. Extensive human evaluation shows that the CritiCS can significantly enhance story creativity and reader engagement, while also maintaining narrative coherence. Furthermore, the design of the framework allows active participation from human writers in any role within the critique process, enabling interactive human-machine collaboration in story writing.
摘要：使用大型语言模型 (LLM) 生成具有叙事连贯性的数千字长篇故事是一项艰巨的任务。先前的研究通过提出不同的框架来应对这一挑战，这些框架可以创建故事计划并根据该计划生成长篇故事。然而，这些框架主要侧重于保持故事中的叙事连贯性，往往忽视了故事计划的创造性以及从这些计划中生成的故事的表现力，而这些是吸引读者兴趣的理想属性。在本文中，我们提出了创意故事生成框架的集体评论家 (CritiCS)，该框架由计划细化阶段 (CrPlan) 和故事生成阶段 (CrText) 组成，以将促进这些属性的集体修订机制集成到长篇故事生成过程中。具体来说，在每个阶段，一组 LLM 评论家和一位领导者合作，在多轮过程中逐步完善计划和故事的草稿。广泛的人工评估表明，CritiCS 可以显著提高故事创造力和读者参与度，同时保持叙事连贯性。此外，该框架的设计允许人类作家在批评过程中以任何角色积极参与，实现故事写作中的人机互动协作。

Title: Better Call SAUL: Fluent and Consistent Language Model Editing with Generation Regularization

Authors: Mingyang Wang, Lukas Lange, Heike Adel, Jannik Strötgen, Hinrich Schütze
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02433
Pdf URL: https://arxiv.org/pdf/2410.02433
Copy Paste: [[2410.02433]] Better Call SAUL: Fluent and Consistent Language Model Editing with Generation Regularization(https://arxiv.org/abs/2410.02433)
Keywords: language model
Abstract: To ensure large language models contain up-to-date knowledge, they need to be updated regularly. However, model editing is challenging as it might also affect knowledge that is unrelated to the new data. State-of-the-art methods identify parameters associated with specific knowledge and then modify them via direct weight updates. However, these locate-and-edit methods suffer from heavy computational overhead and lack theoretical validation. In contrast, directly fine-tuning the model on requested edits affects the model's behavior on unrelated knowledge, and significantly damages the model's generation fluency and consistency. To address these challenges, we propose SAUL, a streamlined model editing method that uses sentence concatenation with augmented random facts for generation regularization. Evaluations on three model editing benchmarks show that SAUL is a practical and reliable solution for model editing outperforming state-of-the-art methods while maintaining generation quality and reducing computational overhead.
摘要：为了确保大型语言模型包含最新的知识，需要定期更新它们。然而，模型编辑具有挑战性，因为它也可能影响与新数据无关的知识。最先进的方法识别与特定知识相关的参数，然后通过直接权重更新对其进行修改。然而，这些定位和编辑方法计算开销很大，缺乏理论验证。相反，直接对请求的编辑模型进行微调会影响模型对不相关知识的行为，并严重损害模型的生成流畅性和一致性。为了应对这些挑战，我们提出了 SAUL，这是一种精简的模型编辑方法，它使用句子连接和增强随机事实进行生成正则化。对三个模型编辑基准的评估表明，SAUL 是一种实用且可靠的模型编辑解决方案，其性能优于最先进的方法，同时保持了生成质量并降低了计算开销。

Title: Response Tuning: Aligning Large Language Models without Instruction

Authors: Seokhyun An, Hyounghun Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02465
Pdf URL: https://arxiv.org/pdf/2410.02465
Copy Paste: [[2410.02465]] Response Tuning: Aligning Large Language Models without Instruction(https://arxiv.org/abs/2410.02465)
Keywords: language model, llm, chat
Abstract: Instruction tuning-supervised fine-tuning using instruction-response pairs-is a foundational step in transitioning pre-trained Large Language Models (LLMs) into helpful and safe chat assistants. Our hypothesis is that establishing an adequate output space can enable such a transition given the capabilities inherent in pre-trained LLMs. To verify this, we propose Response Tuning (RT), which eliminates the instruction-conditioning step in instruction tuning and solely focuses on response space supervision. Our experiments demonstrate that RT models, trained only using responses, can effectively respond to a wide range of instructions and exhibit helpfulness comparable to that of their instruction-tuned counterparts. Furthermore, we observe that controlling the training response distribution can significantly improve their user preference or elicit target behaviors such as refusing assistance for unsafe queries. Our findings illuminate the role of establishing an adequate output space in alignment, highlighting the potential of the extensive inherent capabilities of pre-trained LLMs.
摘要：指令调整（使用指令-响应对进行监督微调）是将预训练的大型语言模型 (LLM) 转变为有用且安全的聊天助手的基础步骤。我们的假设是，考虑到预训练的 LLM 固有的功能，建立足够的输出空间可以实现这种转变。为了验证这一点，我们提出了响应调整 (RT)，它消除了指令调整中的指令调节步骤，仅专注于响应空间监督。我们的实验表明，仅使用响应进行训练的 RT 模型可以有效地响应各种指令，并表现出与指令调整模型相当的帮助性。此外，我们观察到控制训练响应分布可以显著改善用户偏好或引发目标行为，例如拒绝对不安全查询提供帮助。我们的研究结果阐明了建立足够输出空间在对齐中的作用，突出了预训练 LLM 的广泛固有功能的潜力。

Title: Defining Knowledge: Bridging Epistemology and Large Language Models

Authors: Constanza Fierro, Ruchira Dhar, Filippos Stamatiou, Nicolas Garneau, Anders Søgaard
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02499
Pdf URL: https://arxiv.org/pdf/2410.02499
Copy Paste: [[2410.02499]] Defining Knowledge: Bridging Epistemology and Large Language Models(https://arxiv.org/abs/2410.02499)
Keywords: language model, gpt, llm
Abstract: Knowledge claims are abundant in the literature on large language models (LLMs); but can we say that GPT-4 truly "knows" the Earth is round? To address this question, we review standard definitions of knowledge in epistemology and we formalize interpretations applicable to LLMs. In doing so, we identify inconsistencies and gaps in how current NLP research conceptualizes knowledge with respect to epistemological frameworks. Additionally, we conduct a survey of 100 professional philosophers and computer scientists to compare their preferences in knowledge definitions and their views on whether LLMs can really be said to know. Finally, we suggest evaluation protocols for testing knowledge in accordance to the most relevant definitions.
摘要：在大型语言模型 (LLM) 的文献中，知识主张比比皆是；但我们能说 GPT-4 真的“知道”地球是圆的吗？为了解决这个问题，我们回顾了认识论中知识的标准定义，并将适用于 LLM 的解释形式化。在此过程中，我们发现了当前 NLP 研究在认识论框架方面概念化知识的不一致之处和差距。此外，我们对 100 名专业哲学家和计算机科学家进行了调查，以比较他们对知识定义的偏好以及他们对 LLM 是否真的可以说知道的看法。最后，我们提出了根据最相关的定义测试知识的评估协议。

Title: Mixed-Session Conversation with Egocentric Memory

Authors: Jihyoung Jang, Taeyoung Kim, Hyounghun Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02503
Pdf URL: https://arxiv.org/pdf/2410.02503
Copy Paste: [[2410.02503]] Mixed-Session Conversation with Egocentric Memory(https://arxiv.org/abs/2410.02503)
Keywords: agent
Abstract: Recently introduced dialogue systems have demonstrated high usability. However, they still fall short of reflecting real-world conversation scenarios. Current dialogue systems exhibit an inability to replicate the dynamic, continuous, long-term interactions involving multiple partners. This shortfall arises because there have been limited efforts to account for both aspects of real-world dialogues: deeply layered interactions over the long-term dialogue and widely expanded conversation networks involving multiple participants. As the effort to incorporate these aspects combined, we introduce Mixed-Session Conversation, a dialogue system designed to construct conversations with various partners in a multi-session dialogue setup. We propose a new dataset called MiSC to implement this system. The dialogue episodes of MiSC consist of 6 consecutive sessions, with four speakers (one main speaker and three partners) appearing in each episode. Also, we propose a new dialogue model with a novel memory management mechanism, called Egocentric Memory Enhanced Mixed-Session Conversation Agent (EMMA). EMMA collects and retains memories from the main speaker's perspective during conversations with partners, enabling seamless continuity in subsequent interactions. Extensive human evaluations validate that the dialogues in MiSC demonstrate a seamless conversational flow, even when conversation partners change in each session. EMMA trained with MiSC is also evaluated to maintain high memorability without contradiction throughout the entire conversation.
摘要：最近推出的对话系统表现出很高的可用性。然而，它们仍然无法反映现实世界的对话场景。当前的对话系统无法复制涉及多个合作伙伴的动态、连续、长期互动。这种不足的出现是因为人们在考虑现实世界对话的两个方面方面所做的努力有限：长期对话中的深层次互动和涉及多个参与者的广泛扩展的对话网络。随着将这些方面结合起来的努力，我们引入了混合会话对话，这是一种旨在在多会话对话设置中与各种合作伙伴构建对话的对话系统。我们提出了一个名为 MiSC 的新数据集来实现这个系统。MiSC 的对话情节由 6 个连续的会话组成，每个情节中有四个发言者（一个主讲人和三个合作伙伴）。此外，我们提出了一种具有新颖记忆管理机制的新对话模型，称为自我中心记忆增强混合会话对话代理 (EMMA)。EMMA 在与合作伙伴的对话过程中从主讲人的角度收集和保留记忆，从而实现后续互动的无缝连续性。大量人工评估证实，MiSC 中的对话展现出无缝的对话流程，即使每次会话中对话伙伴都会发生变化。使用 MiSC 训练的 EMMA 也经过评估，在整个对话过程中保持较高的可记忆性，且不会产生矛盾。

Title: Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions

Authors: Angana Borah, Rada Mihalcea
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2410.02584
Pdf URL: https://arxiv.org/pdf/2410.02584
Copy Paste: [[2410.02584]] Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions(https://arxiv.org/abs/2410.02584)
Keywords: language model, llm, agent
Abstract: As Large Language Models (LLMs) continue to evolve, they are increasingly being employed in numerous studies to simulate societies and execute diverse social tasks. However, LLMs are susceptible to societal biases due to their exposure to human-generated data. Given that LLMs are being used to gain insights into various societal aspects, it is essential to mitigate these biases. To that end, our study investigates the presence of implicit gender biases in multi-agent LLM interactions and proposes two strategies to mitigate these biases. We begin by creating a dataset of scenarios where implicit gender biases might arise, and subsequently develop a metric to assess the presence of biases. Our empirical analysis reveals that LLMs generate outputs characterized by strong implicit bias associations (>= 50\% of the time). Furthermore, these biases tend to escalate following multi-agent interactions. To mitigate them, we propose two strategies: self-reflection with in-context examples (ICE); and supervised fine-tuning. Our research demonstrates that both methods effectively mitigate implicit biases, with the ensemble of fine-tuning and self-reflection proving to be the most successful.
摘要：随着大型语言模型 (LLM) 的不断发展，它们越来越多地被用于众多研究，以模拟社会并执行各种社会任务。然而，由于 LLM 接触的是人类生成的数据，因此很容易受到社会偏见的影响。鉴于 LLM 被用于深入了解各种社会方面，减轻这些偏见至关重要。为此，我们的研究调查了多智能体 LLM 交互中隐性性别偏见的存在，并提出了两种减轻这些偏见的策略。我们首先创建一个可能出现隐性性别偏见的场景数据集，然后开发一个指标来评估偏见的存在。我们的实证分析表明，LLM 产生的输出具有强烈的隐性偏见关联（>= 50\% 的时间）。此外，这些偏见往往会在多智能体交互后加剧。为了缓解它们，我们提出了两种策略：使用上下文示例进行自我反思 (ICE)；以及监督微调。我们的研究表明，这两种方法都能有效地减轻隐性偏见，而微调和自我反思相结合的方法被证明是最成功的。

Title: Agents' Room: Narrative Generation through Multi-step Collaboration

Authors: Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Elizabeth Clark, Mirella Lapata
Subjects: cs.CL, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2410.02603
Pdf URL: https://arxiv.org/pdf/2410.02603
Copy Paste: [[2410.02603]] Agents' Room: Narrative Generation through Multi-step Collaboration(https://arxiv.org/abs/2410.02603)
Keywords: language model, llm, prompt, agent
Abstract: Writing compelling fiction is a multifaceted process combining elements such as crafting a plot, developing interesting characters, and using evocative language. While large language models (LLMs) show promise for story writing, they currently rely heavily on intricate prompting, which limits their use. We propose Agents' Room, a generation framework inspired by narrative theory, that decomposes narrative writing into subtasks tackled by specialized agents. To illustrate our method, we introduce Tell Me A Story, a high-quality dataset of complex writing prompts and human-written stories, and a novel evaluation framework designed specifically for assessing long narratives. We show that Agents' Room generates stories that are preferred by expert evaluators over those produced by baseline systems by leveraging collaboration and specialization to decompose the complex story writing task into tractable components. We provide extensive analysis with automated and human-based metrics of the generated output.
摘要：撰写引人入胜的小说是一个多方面的过程，结合了构思情节、塑造有趣角色和使用令人回味的语言等元素。虽然大型语言模型 (LLM) 在故事写作方面很有前景，但它们目前严重依赖复杂的提示，这限制了它们的使用。我们提出了 Agents' Room，这是一个受叙事理论启发的生成框架，它将叙事写作分解为由专门的代理处理的子任务。为了说明我们的方法，我们引入了 Tell Me A Story，这是一个包含复杂写作提示和人工编写故事的高质量数据集，以及一个专为评估长篇叙事而设计的新颖评估框架。我们表明，Agents' Room 通过利用协作和专业化将复杂的故事写作任务分解为易于处理的组件，生成的故事比基线系统生成的故事更受专家评估者的青睐。我们对生成的输出提供了广泛的分析，包括自动化和基于人工的指标。

Title: Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning

Authors: Tianxiang Hu, Pei Zhang, Baosong Yang, Jun Xie, Derek F. Wong, Rui Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02631
Pdf URL: https://arxiv.org/pdf/2410.02631
Copy Paste: [[2410.02631]] Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning(https://arxiv.org/abs/2410.02631)
Keywords: language model, llm
Abstract: Achieving consistent high-quality machine translation (MT) across diverse domains remains a significant challenge, primarily due to the limited and imbalanced parallel training data available in various domains. While large language models (LLMs) have demonstrated impressive general understanding and generation abilities, their potential in multi-domain MT is under-explored. We establish a comprehensive benchmark for multi-domain translation, featuring 25 German$\Leftrightarrow$English and 22 Chinese$\Leftrightarrow$English test sets respectively covering 15 domains. Our evaluation of prominent LLMs reveals a discernible performance gap against traditional MT systems, highlighting domain overfitting and catastrophic forgetting issues after fine-tuning on domain-limited corpora. To mitigate this, we propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance. This method inspires the LLM to perceive domain information from the source text, which then serves as a helpful hint to guide the translation process. Despite being trained on a small dataset of four domains, our CoT fine-tune approach achieves notable enhancements in translation accuracy and domain robustness than traditional fine-tuning, as evidenced by an average 1.53 BLEU score increase in over 20 German$\rightarrow$English distinct out-of-domain tests.
摘要：实现跨不同领域的一致高质量机器翻译 (MT) 仍然是一项重大挑战，这主要是由于各个领域可用的并行训练数据有限且不平衡。虽然大型语言模型 (LLM) 已经表现出令人印象深刻的一般理解和生成能力，但它们在多领域机器翻译中的潜力尚未得到充分开发。我们为多领域翻译建立了一个综合基准，分别包含 25 个德语$\Leftrightarrow$英语和 22 个中文$\Leftrightarrow$英语测试集，涵盖 15 个领域。我们对著名 LLM 的评估揭示了与传统机器翻译系统之间的明显性能差距，突出了在领域有限的语料库上进行微调后出现的领域过度拟合和灾难性遗忘问题。为了缓解这种情况，我们提出了一种领域思路链 (CoT) 微调技术，利用 LLM 固有的多领域智能来提高翻译性能。这种方法启发 LLM 从源文本中感知领域信息，然后将其作为指导翻译过程的有用提示。尽管是在四个领域的小型数据集上进行训练的，我们的 CoT 微调方法在翻译准确性和领域鲁棒性方面比传统微调取得了显着的提高，这从 20 多个德语$\rightarrow$英语不同的域外测试中平均 BLEU 分数提高 1.53 可以看出。

Title: Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

Authors: Shijie Chen, Bernal Jiménez Gutiérrez, Yu Su
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2410.02642
Pdf URL: https://arxiv.org/pdf/2410.02642
Copy Paste: [[2410.02642]] Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers(https://arxiv.org/abs/2410.02642)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Information retrieval (IR) systems have played a vital role in modern digital life and have cemented their continued usefulness in this new era of generative AI via retrieval-augmented generation. With strong language processing capabilities and remarkable versatility, large language models (LLMs) have become popular choices for zero-shot re-ranking in IR systems. So far, LLM-based re-ranking methods rely on strong generative capabilities, which restricts their use to either specialized or powerful proprietary models. Given these restrictions, we ask: is autoregressive generation necessary and optimal for LLMs to perform re-ranking? We hypothesize that there are abundant signals relevant to re-ranking within LLMs that might not be used to their full potential via generation. To more directly leverage such signals, we propose in-context re-ranking (ICR), a novel method that leverages the change in attention pattern caused by the search query for accurate and efficient re-ranking. To mitigate the intrinsic biases in LLMs, we propose a calibration method using a content-free query. Due to the absence of generation, ICR only requires two ($O(1)$) forward passes to re-rank $N$ documents, making it substantially more efficient than generative re-ranking methods that require at least $O(N)$ forward passes. Our novel design also enables ICR to be applied to any LLM without specialized training while guaranteeing a well-formed ranking. Extensive experiments with two popular open-weight LLMs on standard single-hop and multi-hop information retrieval benchmarks show that ICR outperforms RankGPT while cutting the latency by more than 60% in practice. Through detailed analyses, we show that ICR's performance is specially strong on tasks that require more complex re-ranking signals. Our findings call for further exploration on novel ways of utilizing open-weight LLMs beyond text generation.
摘要：信息检索 (IR) 系统在现代数字生活中发挥着至关重要的作用，并通过检索增强生成巩固了其在生成式 AI 新时代的持续实用性。凭借强大的语言处理能力和非凡的多功能性，大型语言模型 (LLM) 已成为 IR 系统中零样本重新排序的热门选择。到目前为止，基于 LLM 的重新排序方法依赖于强大的生成能力，这限制了它们的使用范围，要么是专门的，要么是强大的专有模型。鉴于这些限制，我们要问：自回归生成对于 LLM 执行重新排序是否必要且最佳？我们假设 LLM 中存在大量与重新排序相关的信号，这些信号可能无法通过生成充分发挥其潜力。为了更直接地利用这些信号，我们提出了上下文重新排序 (ICR)，这是一种利用搜索查询引起的注意力模式变化进行准确高效重新排序的新方法。为了减轻 LLM 中的固有偏差，我们提出了一种使用无内容查询的校准方法。由于无需生成，ICR 只需要两次 ($O(1)$) 前向传递即可对 $N$ 个文档进行重新排序，这使其效率大大高于需要至少 $O(N)$ 次前向传递的生成式重新排序方法。我们新颖的设计还使 ICR 能够应用于任何 LLM，而无需经过专门的培训，同时保证排名格式正确。在标准单跳和多跳信息检索基准上对两种流行的开放权重 LLM 进行的大量实验表明，ICR 的表现优于 RankGPT，同时在实践中将延迟缩短了 60% 以上。通过详细分析，我们表明 ICR 在需要更复杂的重新排序信号的任务上的表现特别出色。我们的研究结果呼吁进一步探索在文本生成之外利用开放权重 LLM 的新方法。

Title: Undesirable Memorization in Large Language Models: A Survey

Authors: Ali Satvaty, Suzan Verberne, Fatih Turkmen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02650
Pdf URL: https://arxiv.org/pdf/2410.02650
Copy Paste: [[2410.02650]] Undesirable Memorization in Large Language Models: A Survey(https://arxiv.org/abs/2410.02650)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: While recent research increasingly showcases the remarkable capabilities of Large Language Models (LLMs), it's vital to confront their hidden pitfalls. Among these challenges, the issue of memorization stands out, posing significant ethical and legal risks. In this paper, we presents a Systematization of Knowledge (SoK) on the topic of memorization in LLMs. Memorization is the effect that a model tends to store and reproduce phrases or passages from the training data and has been shown to be the fundamental issue to various privacy and security attacks against LLMs. We begin by providing an overview of the literature on the memorization, exploring it across five key dimensions: intentionality, degree, retrievability, abstraction, and transparency. Next, we discuss the metrics and methods used to measure memorization, followed by an analysis of the factors that contribute to memorization phenomenon. We then examine how memorization manifests itself in specific model architectures and explore strategies for mitigating these effects. We conclude our overview by identifying potential research topics for the near future: to develop methods for balancing performance and privacy in LLMs, and the analysis of memorization in specific contexts, including conversational agents, retrieval-augmented generation, multilingual language models, and diffusion language models.
摘要：尽管最近的研究越来越多地展示了大型语言模型 (LLM) 的卓越能力，但直面其隐藏的陷阱至关重要。在这些挑战中，记忆问题尤为突出，带来了重大的道德和法律风险。在本文中，我们介绍了 LLM 中记忆主题的知识系统化 (SoK)。记忆是模型倾向于存储和重现训练数据中的短语或段落的效果，并且已被证明是针对 LLM 的各种隐私和安全攻击的基本问题。我们首先概述了有关记忆的文献，并从五个关键维度对其进行了探索：意向性、程度、可检索性、抽象性和透明度。接下来，我们讨论用于衡量记忆的指标和方法，然后分析导致记忆现象的因素。然后，我们研究记忆如何在特定的模型架构中体现出来，并探索减轻这些影响的策略。我们通过确定近期的潜在研究主题来总结我们的概述：开发平衡 LLM 中的性能和隐私的方法，以及特定环境中的记忆分析，包括对话代理、检索增强生成、多语言语言模型和扩散语言模型。

Title: Measuring and Improving Persuasiveness of Generative Models

Authors: Somesh Singh, Yaman K Singla, Harini SI, Balaji Krishnamurthy
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.02653
Pdf URL: https://arxiv.org/pdf/2410.02653
Copy Paste: [[2410.02653]] Measuring and Improving Persuasiveness of Generative Models(https://arxiv.org/abs/2410.02653)
Keywords: llm, chat
Abstract: LLMs are increasingly being used in workflows involving generating content to be consumed by humans (e.g., marketing) and also in directly interacting with humans (e.g., through chatbots). The development of such systems that are capable of generating verifiably persuasive messages presents both opportunities and challenges for society. On the one hand, such systems could positively impact domains like advertising and social good, such as addressing drug addiction, and on the other, they could be misused for spreading misinformation and shaping political opinions. To channel LLMs' impact on society, we need to develop systems to measure and benchmark their persuasiveness. With this motivation, we introduce PersuasionBench and PersuasionArena, the first large-scale benchmark and arena containing a battery of tasks to measure the persuasion ability of generative models automatically. We investigate to what extent LLMs know and leverage linguistic patterns that can help them generate more persuasive language. Our findings indicate that the persuasiveness of LLMs correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models. Notably, targeted training using synthetic and natural datasets significantly enhances smaller models' persuasive capabilities, challenging scale-dependent assumptions. Our findings carry key implications for both model developers and policymakers. For instance, while the EU AI Act and California's SB-1047 aim to regulate AI models based on the number of floating point operations, we demonstrate that simple metrics like this alone fail to capture the full scope of AI's societal impact. We invite the community to explore and contribute to PersuasionArena and PersuasionBench, available at this https URL, to advance our understanding of AI-driven persuasion and its societal implications.
摘要：LLM 越来越多地用于生成供人类使用的内容（例如营销）的工作流程，也用于直接与人类互动（例如通过聊天机器人）。开发能够生成可验证的说服性信息的系统为社会带来了机遇和挑战。一方面，这样的系统可以对广告和社会公益（例如解决毒瘾问题）等领域产生积极影响，另一方面，它们可能被滥用于传播错误信息和塑造政治观点。为了引导 LLM 对社会的影响，我们需要开发系统来衡量和评估其说服力。出于这种动机，我们推出了 PersuasionBench 和 PersuasionArena，这是第一个大规模基准和竞技场，包含一系列任务，用于自动测量生成模型的说服能力。我们调查 LLM 在多大程度上了解和利用可以帮助它们生成更具说服力的语言的语言模式。我们的研究结果表明，法学硕士的说服力与模型大小呈正相关，但较小的模型也可以比较大的模型具有更高的说服力。值得注意的是，使用合成和自然数据集进行有针对性的训练可显著增强较小模型的说服力，挑战依赖于规模的假设。我们的研究结果对模型开发人员和政策制定者都具有重要意义。例如，虽然欧盟人工智能法案和加州的 SB-1047 旨在根据浮点运算的数量来规范人工智能模型，但我们表明，仅靠这样的简单指标无法全面反映人工智能的社会影响。我们邀请社区探索并为 PersuasionArena 和 PersuasionBench 做出贡献（可通过此 https URL 访问），以加深我们对人工智能驱动的说服力及其社会影响的理解。

Title: Hate Personified: Investigating the role of LLMs in content moderation

Authors: Sarah Masud, Sahajpreet Singh, Viktor Hangya, Alexander Fraser, Tanmoy Chakraborty
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2410.02657
Pdf URL: https://arxiv.org/pdf/2410.02657
Copy Paste: [[2410.02657]] Hate Personified: Investigating the role of LLMs in content moderation(https://arxiv.org/abs/2410.02657)
Keywords: language model, llm, prompt
Abstract: For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model's (LLM) ability to represent diverse groups is unclear. By including additional context in prompts, we comprehensively analyze LLM's sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected. Our findings on two LLMs, five languages, and six datasets reveal that mimicking persona-based attributes leads to annotation variability. Meanwhile, incorporating geographical signals leads to better regional alignment. We also find that the LLMs are sensitive to numerical anchors, indicating the ability to leverage community-based flagging efforts and exposure to adversaries. Our work provides preliminary guidelines and highlights the nuances of applying LLMs in culturally sensitive cases.
摘要：对于仇恨检测等主观任务，人们对仇恨的看法不同，大型语言模型 (LLM) 表示不同群体的能力尚不清楚。通过在提示中包含额外的背景信息，我们全面分析了 LLM 对地理启动、角色属性和数字信息的敏感性，以评估不同群体的需求反映程度。我们对两个 LLM、五种语言和六个数据集的发现表明，模仿基于角色的属性会导致注释变化。同时，加入地理信号可以实现更好的区域一致性。我们还发现 LLM 对数字锚点很敏感，表明能够利用基于社区的标记工作和对手曝光。我们的工作提供了初步指导方针，并强调了在文化敏感案件中应用 LLM 的细微差别。

Title: How to Train Long-Context Language Models (Effectively)

Authors: Tianyu Gao, Alexander Wettig, Howard Yen, Danqi Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02660
Pdf URL: https://arxiv.org/pdf/2410.02660
Copy Paste: [[2410.02660]] How to Train Long-Context Language Models (Effectively)(https://arxiv.org/abs/2410.02660)
Keywords: language model
Abstract: We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development -- Instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context tasks, and we evaluate models after SFT with instruction data as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.18B-Instruct on the majority of long-context tasks despite having seen only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.
摘要：我们研究语言模型 (LM) 的持续训练和监督微调 (SFT)，以有效利用长上下文信息。我们首先建立一个可靠的评估协议来指导模型开发 - 我们使用广泛的长上下文任务，而不是困惑度或简单的大海捞针 (NIAH) 测试，并且在 SFT 之后使用指令数据评估模型，因为这样可以更好地揭示长上下文能力。在我们强大的评估支持下，我们进行了彻底的实验来决定持续预训练的数据组合、指令调整数据集和许多其他设计选择。我们发现 (1) 代码存储库和书籍是长数据的极好来源，但将它们与高质量的短数据相结合至关重要；(2) 使用超过评估长度的序列长度进行训练可提高长上下文性能；(3) 对于 SFT，仅使用短指令数据集即可在长上下文任务中获得出色的性能。我们的最终模型 ProLong-8B 是基于 Llama-3 初始化的，并在 40B 个 token 上进行训练，在长度为 128K 的类似大小的模型中，它展示了最先进的长上下文性能。尽管在长上下文训练期间只看到了 5% 的 token，但 ProLong 在大多数长上下文任务上的表现都优于 Llama-3.18B-Instruct。此外，ProLong 可以有效处理多达 512K 个 token，这是公开可用的 LM 中最长的上下文窗口之一。

Title: Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus

Authors: Craig Messner, Tom Lippincott
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02674
Pdf URL: https://arxiv.org/pdf/2410.02674
Copy Paste: [[2410.02674]] Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus(https://arxiv.org/abs/2410.02674)
Keywords: language model
Abstract: We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags designed to serve as the basis for computational experiments exploring literarily meaningful orthographic variation. We perform an initial broad set of experiments over this dataset using both token (BERT) and character (CANINE)-level contextual language models. We find indications that the "dialect effect" produced by intentional orthographic variation employs multiple linguistic channels, and that these channels are able to be surfaced to varied degrees given particular language modelling assumptions. Specifically, we find evidence showing that choice of tokenization scheme meaningfully impact the type of orthographic information a model is able to surface.
摘要：我们提供了 19 世纪美国文学正字法变体标记数据集，其中包含一层新的人工注释方言组标签，旨在作为探索文学意义的正字法变体的计算实验的基础。我们使用标记 (BERT) 和字符 (CANINE) 级上下文语言模型对该数据集进行了初步的广泛实验。我们发现有迹象表明，有意正字法变体产生的“方言效应”采用了多种语言渠道，并且这些渠道能够在特定语言建模假设下以不同程度显现出来。具体而言，我们发现证据表明，标记化方案的选择对模型能够显现的正字法信息类型有重大影响。

Title: CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

Authors: Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Vered Shwartz, Yejin Choi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02677
Pdf URL: https://arxiv.org/pdf/2410.02677
Copy Paste: [[2410.02677]] CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs(https://arxiv.org/abs/2410.02677)
Keywords: language model, gpt, llm
Abstract: To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs' cultural knowledge, covering 45 global regions including the underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions - each verified by five independent annotators - span 17 diverse topics ranging from food preferences to greeting etiquettes. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. We find that LLMs are sensitive to such difference in setups (e.g., GPT-4o with 27.3% difference). Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%. Moreover, we find that LLMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to converge to a single answer. Our results also indicate that OpenAI GPT-4o substantially outperform other proprietary and open source models in questions related to all but one region (Oceania). Nonetheless, all models consistently underperform on questions related to South America and the Middle East.
摘要：为了使大型语言模型 (LLM) 在不同文化中发挥更大作用，必须有有效的文化知识基准来衡量和跟踪我们的进展。有效的基准需要稳健、多样且具有挑战性。我们推出了 CulturalBench：一组 1,227 个人工编写和人工验证的问题，用于有效评估 LLM 的文化知识，涵盖 45 个全球地区，包括孟加拉国、津巴布韦和秘鲁等代表性不足的地区。每个问题都由五位独立注释者验证，涵盖 17 个不同的主题，从食物偏好到问候礼仪。我们在两种设置下评估模型：CulturalBench-Easy 和 CulturalBench-Hard，它们共享相同的问题，但提问方式不同。我们发现 LLM 对这种设置差异很敏感（例如，GPT-4o 的差异为 27.3%）。与人类的表现（92.6% 准确率）相比，CulturalBench-Hard 对于前沿 LLM 来说更具挑战性，表现最佳的模型（GPT-4o）仅为 61.5%，表现最差的模型（Llama3-8b）仅为 21.4%。此外，我们发现 LLM 经常难以解决具有多个正确答案的棘手问题（例如，中国人通常使用什么餐具？），这表明它们倾向于收敛到一个答案。我们的结果还表明，OpenAI GPT-4o 在除一个地区（大洋洲）之外的所有地区的问题上都大大优于其他专有和开源模型。尽管如此，所有模型在与南美洲和中东有关的问题上表现都不佳。

Title: Distilling an End-to-End Voice Assistant Without Instruction Training Data

Authors: William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, Diyi Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02678
Pdf URL: https://arxiv.org/pdf/2410.02678
Copy Paste: [[2410.02678]] Distilling an End-to-End Voice Assistant Without Instruction Training Data(https://arxiv.org/abs/2410.02678)
Keywords: language model, llm
Abstract: Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using $>$100x less training compute.
摘要：语音助手（例如 Siri 和 Google Assistant）通常分别对音频和文本进行建模，这会导致语音信息丢失和复杂性增加。最近，人们努力使用经过监督微调 (SFT) 训练的端到端语音大型语言模型 (LLM) 来解决这一问题，导致模型“忘记”了纯文本 LLM 的功能。我们的工作提出了一种无需指令数据即可训练语音 LLM 的替代范例，即使用纯文本 LLM 对转录的响应作为自我监督。重要的是，此过程可以在没有注释响应的情况下执行。我们表明，我们的精炼语音助手 (DiVA) 可以推广到口语问答、分类和翻译。此外，我们表明 DiVA 更好地满足了用户偏好，与 Qwen 2 Audio 等最先进的模型相比，尽管使用的训练计算量减少了 100 多倍，但胜率仍达到 72%。

Title: DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life

Authors: Yu Ying Chiu, Liwei Jiang, Yejin Choi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02683
Pdf URL: https://arxiv.org/pdf/2410.02683
Copy Paste: [[2410.02683]] DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life(https://arxiv.org/abs/2410.02683)
Keywords: gpt, llm, prompt
Abstract: As we increasingly seek guidance from LLMs for decision-making in daily life, many of these decisions are not clear-cut and depend significantly on the personal values and ethical standards of the users. We present DailyDilemmas, a dataset of 1,360 moral dilemmas encountered in everyday life. Each dilemma includes two possible actions and with each action, the affected parties and human values invoked. Based on these dilemmas, we consolidated a set of human values across everyday topics e.g., interpersonal relationships, workplace, and environmental issues. We evaluated LLMs on these dilemmas to determine what action they will take and the values represented by these actions. Then, we analyzed these values through the lens of five popular theories inspired by sociology, psychology and philosophy. These theories are: World Value Survey, Moral Foundation Theory, Maslow's Hierarchy of Needs, Aristotle's Virtues, and Plutchik Wheel of Emotion. We find that LLMs are most aligned with the self-expression over survival values in terms of World Value Survey, care over loyalty in Moral Foundation Theory. Interestingly, we find large preferences differences in models for some core values such as truthfulness e.g., Mixtral-8x7B model tends to neglect it by 9.7% while GPT-4-turbo model tends to select it by 9.4%. We also study the recent guidance released by OpenAI (ModelSpec), and Anthropic (Constitutional AI) to understand how their released principles reflect their actual value prioritization when facing nuanced moral reasoning in daily-life settings. We find that end users cannot effectively steer such prioritization using system prompts.
摘要：随着我们越来越多地寻求法学硕士 (LLM) 指导我们做出日常生活中的决策，许多决策并不明确，很大程度上取决于用户的个人价值观和道德标准。我们推出了 DailyDilemmas，这是一套包含日常生活中遇到的 1,360 个道德困境的数据集。每个困境包括两种可能采取的行动，每种行动都涉及受影响的各方和所援引的人类价值观。基于这些困境，我们在日常话题（例如人际关系、工作场所和环境问题）中整合了一套人类价值观。我们针对这些困境对法学硕士进行了评估，以确定他们将采取什么行动以及这些行动所代表的价值观。然后，我们通过五种受社会学、心理学和哲学启发的流行理论来分析这些价值观。这些理论是：世界价值观调查、道德基础理论、马斯洛需求层次理论、亚里士多德美德理论和普拉奇克情绪之轮。我们发现，在《世界价值观调查》中，法学硕士最符合自我表达而非生存价值观，在《道德基础理论》中，关心而非忠诚。有趣的是，我们发现模型对某些核心价值观（如诚实）的偏好差异很大，例如，Mixtral-8x7B 模型倾向于忽略它 9.7%，而 GPT-4-turbo 模型倾向于选择它 9.4%。我们还研究了 OpenAI（ModelSpec）和 Anthropic（Constitutional AI）最近发布的指南，以了解他们发布的原则在日常生活中面对细微的道德推理时如何反映他们的实际价值优先级。我们发现最终用户无法使用系统提示有效地引导这种优先级。

Title: HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

Authors: Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Ruibin Yuan, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02684
Pdf URL: https://arxiv.org/pdf/2410.02684
Copy Paste: [[2410.02684]] HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router(https://arxiv.org/abs/2410.02684)
Keywords: language model, llm, prompt
Abstract: As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or sensitive information. However, current alignment approaches, which rely heavily on refusal strategies, such as training models to completely reject harmful prompts or applying coarse filters are limited by their binary nature. These methods either fully deny access to information or grant it without sufficient nuance, leading to overly cautious responses or failures to detect subtle harmful content. For example, LLMs may refuse to provide basic, public information about medication due to misuse concerns. Moreover, these refusal-based methods struggle to handle mixed-content scenarios and lack the ability to adapt to context-dependent sensitivities, which can result in over-censorship of benign content. To overcome these challenges, we introduce HiddenGuard, a novel framework for fine-grained, safe generation in LLMs. HiddenGuard incorporates Prism (rePresentation Router for In-Stream Moderation), which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content by leveraging intermediate hidden states. This fine-grained approach allows for more nuanced, context-aware moderation, enabling the model to generate informative responses while selectively redacting or replacing sensitive information, rather than outright refusal. We also contribute a comprehensive dataset with token-level fine-grained annotations of potentially harmful information across diverse contexts. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content while preserving the overall utility and informativeness of the model's responses.
摘要：随着大型语言模型 (LLM) 变得越来越强大，确保其安全性和与人类价值观的一致性仍然是一项关键挑战。理想情况下，LLM 应提供信息丰富的响应，同时避免泄露有害或敏感信息。然而，当前的对齐方法严重依赖拒绝策略，例如训练模型完全拒绝有害提示或应用粗略过滤器，这些方法受到二元性质的限制。这些方法要么完全拒绝访问信息，要么在没有足够的细微差别的情况下授予访问权，从而导致过于谨慎的响应或无法检测到细微的有害内容。例如，由于担心滥用，LLM 可能会拒绝提供有关药物的基本公开信息。此外，这些基于拒绝的方法难以处理混合内容场景，并且缺乏适应上下文相关敏感性的能力，这可能导致对良性内容的过度审查。为了克服这些挑战，我们引入了 HiddenGuard，这是一种用于 LLM 中细粒度、安全生成的新框架。 HiddenGuard 结合了 Prism（用于流内审核的 rePresentation Router），它与 LLM 一起运行，通过利用中间隐藏状态实现对有害内容的实时、token 级检测和编辑。这种细粒度方法允许更细致入微、上下文感知的审核，使模型能够生成信息丰富的响应，同时有选择地编辑或替换敏感信息，而不是直接拒绝。我们还提供了一个全面的数据集，其中包含跨不同上下文的潜在有害信息的 token 级细粒度注释。我们的实验表明，HiddenGuard 在检测和编辑有害内容方面取得了超过 90% 的 F1 得分，同时保留了模型响应的整体实用性和信息性。

Title: On the Proper Treatment of Tokenization in Psycholinguistics

Authors: Mario Giulianelli, Luca Malagutti, Juan Luis Gastaldi, Brian DuSell, Tim Vieira, Ryan Cotterell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02691
Pdf URL: https://arxiv.org/pdf/2410.02691
Copy Paste: [[2410.02691]] On the Proper Treatment of Tokenization in Psycholinguistics(https://arxiv.org/abs/2410.02691)
Keywords: language model
Abstract: Language models are widely used in computational psycholinguistics to test theories that relate the negative log probability (the surprisal) of a region of interest (a substring of characters) under a language model to its cognitive cost experienced by readers, as operationalized, for example, by gaze duration on the region. However, the application of modern language models to psycholinguistic studies is complicated by the practice of using tokenization as an intermediate step in training a model. Doing so results in a language model over token strings rather than one over character strings. Vexingly, regions of interest are generally misaligned with these token strings. The paper argues that token-level language models should be (approximately) marginalized into character-level language models before they are used in psycholinguistic studies to compute the surprisal of a region of interest; then, the marginalized character-level language model can be used to compute the surprisal of an arbitrary character substring, which we term a focal area, that the experimenter may wish to use as a predictor. Our proposal of marginalizing a token-level model into a character-level one solves this misalignment issue independently of the tokenization scheme. Empirically, we discover various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself.
摘要：语言模型在计算心理语言学中被广泛用于测试将语言模型下感兴趣区域（字符子串）的负对数概率（意外度）与读者体验到的认知成本（例如，通过注视该区域的持续时间）联系起来的理论。然而，现代语言模型在心理语言学研究中的应用因使用标记化作为训练模型的中间步骤而变得复杂。这样做会导致语言模型基于标记字符串而不是字符串。令人恼火的是，感兴趣的区域通常与这些标记字符串不一致。本文认为，在将标记级语言模型用于心理语言学研究以计算感兴趣区域的意外度之前，应该（大致）将标记级语言模型边缘化为字符级语言模型；然后，可以使用边缘化的字符级语言模型来计算任意字符子串的意外度，我们将其称为焦点区域，实验者可能希望将其用作预测因子。我们提出的将标记级模型边缘化为字符级模型的提议解决了这一错位问题，与标记化方案无关。从经验上讲，我们发现各种焦点区域的意外性比感兴趣区域本身的意外性更能作为心理测量预测指标。

Title: HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly

Authors: Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izasak, Moshe Wasserblat, Danqi Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02694
Pdf URL: https://arxiv.org/pdf/2410.02694
Copy Paste: [[2410.02694]] HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly(https://arxiv.org/abs/2410.02694)
Keywords: language model, prompt
Abstract: There have been many benchmarks for evaluating long-context language models (LCLMs), but developers often rely on synthetic tasks like needle-in-a-haystack (NIAH) or arbitrary subsets of tasks. It remains unclear whether they translate to the diverse downstream applications of LCLMs, and the inconsistency further complicates model comparison. We investigate the underlying reasons behind current practices and find that existing benchmarks often provide noisy signals due to low coverage of applications, insufficient lengths, unreliable metrics, and incompatibility with base models. In this work, we present HELMET (How to Evaluate Long-context Models Effectively and Thoroughly), a comprehensive benchmark encompassing seven diverse, application-centric categories. We also address many issues in previous benchmarks by adding controllable lengths up to 128k tokens, model-based evaluation for reliable metrics, and few-shot prompting for robustly evaluating base models. Consequently, we demonstrate that HELMET offers more reliable and consistent rankings of frontier LCLMs. Through a comprehensive study of 51 LCLMs, we find that (1) synthetic tasks like NIAH are not good predictors of downstream performance; (2) the diverse categories in HELMET exhibit distinct trends and low correlation with each other; and (3) while most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when the task requires full-context reasoning or following complex instructions -- the gap widens with increased lengths. Finally, we recommend using our RAG tasks for fast model development, as they are easy to run and more predictive of other downstream performance; ultimately, we advocate for a holistic evaluation across diverse tasks.
摘要：评估长上下文语言模型 (LCLM) 的基准有很多，但开发人员通常依赖于大海捞针 (NIAH) 之类的合成任务或任意任务子集。目前尚不清楚它们是否适用于 LCLM 的各种下游应用，而且这种不一致性进一步使模型比较复杂化。我们调查了当前实践背后的根本原因，发现现有基准由于应用覆盖率低、长度不足、指标不可靠以及与基础模型不兼容，通常会提供嘈杂信号。在这项工作中，我们提出了 HELMET（如何有效和彻底地评估长上下文模型），这是一个全面的基准，涵盖了七个不同的以应用为中心的类别。我们还通过添加可控长度高达 128k 个标记、基于模型的可靠指标评估和少量提示来稳健地评估基础模型，解决了以前基准中的许多问题。因此，我们证明 HELMET 提供了更可靠和一致的前沿 LCLM 排名。通过对 51 个 LCLM 进行全面研究，我们发现 (1) 像 NIAH 这样的合成任务不能很好地预测下游性能；(2) HELMET 中的不同类别表现出不同的趋势，彼此之间的相关性较低；(3) 虽然大多数 LCLM 都获得了完美的 NIAH 分数，但当任务需要全上下文推理或遵循复杂指令时，开源模型明显落后于封闭模型——随着长度的增加，差距会扩大。最后，我们建议使用我们的 RAG 任务进行快速模型开发，因为它们易于运行并且更能预测其他下游性能；最终，我们提倡对各种任务进行整体评估。

Title: Selective Attention Improves Transformer

Authors: Yaniv Leviathan, Matan Kalman, Yossi Matias
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02703
Pdf URL: https://arxiv.org/pdf/2410.02703
Copy Paste: [[2410.02703]] Selective Attention Improves Transformer(https://arxiv.org/abs/2410.02703)
Keywords: language model
Abstract: Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.
摘要：注意力机制上下文中不需要的元素会降低性能。我们引入了选择性注意力机制，这是对标准注意力机制的简单无参数更改，可减少对不需要元素的注意力。选择性注意力机制可提高各种模型大小和上下文长度的语言建模性能。例如，在 C4 上使用语言建模目标训练的一系列具有选择性注意力机制的 Transformer 的性能与标准 Transformer 相当，但其注意力模块中的头部和参数大约多 2 倍。选择性注意力机制还可以减小注意力机制上下文缓冲区的大小，从而显著减少推理过程中的内存和计算需求。例如，在 C4 上训练的具有 100M 个参数的 Transformer 的上下文大小分别为 512、1,024 和 2,048 时，与没有选择性注意力机制的 Transformer 相比，配备选择性注意力机制时，其注意力模块所需的内存分别减少了 16 倍、25 倍和 47 倍，验证困惑度也相同。

Title: LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Authors: Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, Yonatan Belinkov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02707
Pdf URL: https://arxiv.org/pdf/2410.02707
Copy Paste: [[2410.02707]] LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations(https://arxiv.org/abs/2410.02707)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.
摘要：大型语言模型 (LLM) 经常会产生错误，包括事实错误、偏见和推理失败，统称为“幻觉”。最近的研究表明，LLM 的内部状态编码了有关其输出真实性的信息，并且可以利用这些信息来检测错误。在这项工作中，我们表明 LLM 的内部表示编码了比以前认识到的更多关于真实性的信息。我们首先发现真实性信息集中在特定的标记中，利用这一特性可以显著提高错误检测性能。然而，我们表明这种错误检测器无法在数据集中推广，这意味着——与之前的说法相反——真实性编码不是通用的，而是多方面的。接下来，我们表明内部表示也可用于预测模型可能犯的错误类型，从而促进制定量身定制的缓解策略。最后，我们揭示了 LLM 的内部编码和外部行为之间的差异：它们可能编码正确的答案，但始终生成错误的答案。总的来说，这些见解从模型的内部角度加深了我们对 LLM 错误的理解，可以指导未来关于加强错误分析和缓解的研究。

Title: UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation

Authors: Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, Ngai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02719
Pdf URL: https://arxiv.org/pdf/2410.02719
Copy Paste: [[2410.02719]] UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation(https://arxiv.org/abs/2410.02719)
Keywords: language model, retrieval-augmented generation
Abstract: We present UncertaintyRAG, a novel approach for long-context Retrieval-Augmented Generation (RAG) that utilizes Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate similarity between text chunks. This span uncertainty enhances model calibration, improving robustness and mitigating semantic inconsistencies introduced by random chunking. Leveraging this insight, we propose an efficient unsupervised learning technique to train the retrieval model, alongside an effective data sampling and scaling strategy. UncertaintyRAG outperforms baselines by 2.03% on LLaMA-2-7B, achieving state-of-the-art results while using only 4% of the training data compared to other advanced open-source retrieval models under distribution shift settings. Our method demonstrates strong calibration through span uncertainty, leading to improved generalization and robustness in long-context RAG tasks. Additionally, UncertaintyRAG provides a lightweight retrieval model that can be integrated into any large language model with varying context window lengths, without the need for fine-tuning, showcasing the flexibility of our approach.
摘要：我们提出了 UncertaintyRAG，这是一种用于长上下文检索增强生成 (RAG) 的新方法，它利用基于信噪比 (SNR) 的跨度不确定性来估计文本块之间的相似性。这种跨度不确定性增强了模型校准，提高了稳健性并减轻了随机分块引入的语义不一致。利用这一见解，我们提出了一种有效的无监督学习技术来训练检索模型，同时提出了一种有效的数据采样和缩放策略。UncertaintyRAG 在 LLaMA-2-7B 上的表现比基线高出 2.03%，与其他先进的开源检索模型相比，在分布偏移设置下，仅使用 4% 的训练数据就实现了最先进的结果。我们的方法通过跨度不确定性展示了强大的校准能力，从而提高了长上下文 RAG 任务的泛化能力和稳健性。此外，UncertaintyRAG 提供了一个轻量级的检索模型，可以集成到任何具有不同上下文窗口长度的大型语言模型中，而无需进行微调，展示了我们方法的灵活性。

Title: Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization

Authors: Ryan C. Barron, Ves Grantcharov, Selma Wanna, Maksim E. Eren, Manish Bhattarai, Nicholas Solovyev, George Tompkins, Charles Nicholas, Kim Ø. Rasmussen, Cynthia Matuszek, Boian S. Alexandrov
Subjects: cs.CL, cs.AI, cs.IR, cs.SE
Abstract URL: https://arxiv.org/abs/2410.02721
Pdf URL: https://arxiv.org/pdf/2410.02721
Copy Paste: [[2410.02721]] Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization(https://arxiv.org/abs/2410.02721)
Keywords: language model, llm, hallucination, prompt, chat, retrieval-augmented generation, chain-of-thought, agent
Abstract: Large Language Models (LLMs) are pre-trained on large-scale corpora and excel in numerous general natural language processing (NLP) tasks, such as question answering (QA). Despite their advanced language capabilities, when it comes to domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations, knowledge cut-offs, and lack of knowledge attributions. Additionally, fine tuning LLMs' intrinsic knowledge to highly specific domains is an expensive and time consuming process. The retrieval-augmented generation (RAG) process has recently emerged as a method capable of optimization of LLM responses, by referencing them to a predetermined ontology. It was shown that using a Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into account relevant sub-graphs that preserve the information in a structured manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain-specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks. We pair SMART-SLIC with chain-of-thought prompting agents. The framework is designed to be generalizable to adapt to any specific or specialized domain. In this paper, we demonstrate the question answering capabilities of our framework on a corpus of scientific publications on malware analysis and anomaly detection.
摘要：大型语言模型 (LLM) 是在大型语料库上预先训练的，在许多通用自然语言处理 (NLP) 任务（例如问答 (QA)）中表现出色。尽管 LLM 具有高级语言能力，但在处理特定领域和知识密集型任务时，它们会遭受幻觉、知识截断和知识归属不足的困扰。此外，将 LLM 的内在知识微调到高度特定的领域是一个昂贵且耗时的过程。检索增强生成 (RAG) 过程最近出现了一种能够优化 LLM 响应的方法，方法是将它们引用到预定的本体中。结果表明，使用知识图谱 (KG) 本体进行 RAG 可以提高 QA 准确性，因为它考虑了以结构化方式保存信息的相关子图。在本文中，我们介绍了 SMART-SLIC，这是一个高度特定领域的 LLM 框架，它将 RAG 与 KG 和存储事实领域特定信息的向量存储 (VS) 集成在一起。重要的是，为了避免 KG 中出现幻觉，我们在构建这些高度特定于领域的 KG 和 VS 时没有使用 LLM，而是通过 NLP、数据挖掘和具有自动模型选择的非负张量分解。将我们的 RAG 与特定于领域的：(i) KG（包含结构化信息）和 (ii) VS（包含非结构化信息）配对，可以开发特定于领域的聊天机器人，这些聊天机器人可以归因于信息来源、减轻幻觉、减少微调的需要，并在高度特定于领域的问答任务中表现出色。我们将 SMART-SLIC 与思路链提示代理配对。该框架旨在可推广以适应任何特定或专门领域。在本文中，我们在有关恶意软件分析和异常检测的科学出版物语料库上展示了我们框架的问答能力。

Title: Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation

Authors: Rohin Manvi, Anikait Singh, Stefano Ermon
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02725
Pdf URL: https://arxiv.org/pdf/2410.02725
Copy Paste: [[2410.02725]] Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation(https://arxiv.org/abs/2410.02725)
Keywords: language model, gpt, llm, prompt
Abstract: Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B's win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.
摘要：推理时间计算是增强大型语言模型 (LLM) 性能的强大范例，其中 Best-of-N 采样是一种广泛使用的技术。但是，这种方法的计算成本很高，需要 (1) 外部奖励模型和 (2) 生成多个样本。在这项工作中，我们引入了一种新的生成式自我评估方案，旨在自适应地减少生成的样本数量，同时保持甚至提高性能。我们使用生成式奖励模型公式，允许 LLM 在生成中期预测重新启动生成将产生更好响应的概率。这些预测是在没有外部奖励模型的情况下获得的，可用于决定是否生成更多样本、尽早修剪没有希望的样本或挑选最佳样本。此功能非常便宜，因为它涉及生成单个预定义标记。使用由真实的未过滤的 LMSYS 用户提示构建的数据集进行训练，Llama 3.1 8B 在 AlpacaEval 上对抗 GPT-4 的胜率在 16 个样本的情况下从 21% 提高到 34%，在 GSM8K 上的数学成绩从 84% 提高到 91%。通过仅在 LLM 确定这样做有益时进行采样并自适应地调整温度退火，我们证明，仅使用平均 1.2 个样本即可实现使用 16 个样本的 74% 的改进。我们进一步证明，可以在生成早期修剪 50-75% 的样本，同时将性能下降降至最低。总体而言，我们的方法可以在 LLM 推理期间提高计算利用率的效率和可扩展性。

Title: Unified Multi-Modal Interleaved Document Representation for Information Retrieval

Authors: Jaewoo Lee, Joonho Ko, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2410.02729
Pdf URL: https://arxiv.org/pdf/2410.02729
Copy Paste: [[2410.02729]] Unified Multi-Modal Interleaved Document Representation for Information Retrieval(https://arxiv.org/abs/2410.02729)
Keywords: language model
Abstract: Information Retrieval (IR) methods aim to identify relevant documents in response to a given query, which have gained remarkable attention due to their successful application in various natural language tasks. However, existing approaches typically consider only the textual information within the documents, which overlooks the fact that documents can contain multiple modalities, including texts, images, and tables. Further, they often segment each long document into multiple discrete passages for embedding, preventing them from capturing the overall document context and interactions between paragraphs. We argue that these two limitations lead to suboptimal document representations for retrieval. In this work, to address them, we aim to produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse information retrieval scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information interleaved within the documents in a unified way.
摘要：信息检索 (IR) 方法旨在识别与给定查询相关的文档，由于其在各种自然语言任务中的成功应用，该方法引起了广泛关注。然而，现有方法通常仅考虑文档中的文本信息，而忽略了文档可以包含多种模态的事实，包括文本、图像和表格。此外，它们通常将每个长文档分割成多个离散段落进行嵌入，从而无法捕获整体文档上下文和段落之间的交互。我们认为这两个限制导致文档表示法不理想。在这项工作中，为了解决这些问题，我们旨在通过整体嵌入与不同模态交织的文档来生成更全面、更细致的文档表示。具体来说，我们通过利用最近的视觉语言模型的功能来实现这一点，这些模型能够将文本、图像和表格处理并集成到统一的格式和表示中。此外，为了减轻将文档分割成段落所造成的信息损失，我们不再单独表示和检索段落，而是将分割段落的表示合并为一个文档表示，同时我们还引入了重新排序策略，以便在必要时解耦并识别文档中的相关段落。然后，通过对考虑文本和多模态查询的各种信息检索场景进行大量实验，我们表明，由于以统一的方式考虑了文档中交错的多模态信息，我们的方法大大优于相关基线。

Title: Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Authors: Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, Xiangliang Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02736
Pdf URL: https://arxiv.org/pdf/2410.02736
Copy Paste: [[2410.02736]] Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge(https://arxiv.org/abs/2410.02736)
Keywords: language model, llm
Abstract: LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, we identify 12 key potential biases and propose a new automated bias quantification framework-CALM-which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated and principle-guided modification. Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge. Moreover, we also discuss the explicit and implicit influence of these biases and give some suggestions for the reliable application of LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
摘要：LLM-as-a-Judge 已被广泛用作各种基准的评估方法，并作为模型训练中的监督奖励。然而，尽管它们在许多领域都表现出色，但潜在问题尚未得到充分探索，从而削弱了它们的可靠性和实用范围。因此，我们确定了 12 个关键的潜在偏见，并提出了一个新的自动偏见量化框架 - CALM - 该框架通过自动化和原则指导的修改系统地量化和分析 LLM-as-a-Judge 中的每种偏见。我们的实验涵盖了多个流行的语言模型，结果表明，虽然高级模型取得了值得称赞的整体性能，但在某些特定任务中仍然存在显著的偏见。实证结果表明 LLM-as-a-Judge 的可靠性仍有改进空间。此外，我们还讨论了这些偏见的显性和隐性影响，并为 LLM-as-a-Judge 的可靠应用提出了一些建议。我们的工作强调了利益相关者解决这些问题的必要性，并提醒用户在 LLM-as-a-Judge 应用中要谨慎。

Title: Salient Information Prompting to Steer Content in Prompt-based Abstractive Summarization

Authors: Lei Xu, Mohammed Asad Karim, Saket Dingliwal, Aparna Elangovan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02741
Pdf URL: https://arxiv.org/pdf/2410.02741
Copy Paste: [[2410.02741]] Salient Information Prompting to Steer Content in Prompt-based Abstractive Summarization(https://arxiv.org/abs/2410.02741)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
摘要：大型语言模型 (LLM) 可以使用提示技术跨领域生成流畅的摘要，从而减少了训练摘要应用模型的需要。但是，设计有效的提示来指导 LLM 生成具有适当细节级别和写作风格的摘要仍然是一个挑战。在本文中，我们探讨了如何使用从源文档中提取的显着信息来增强摘要提示。我们表明，在提示中添加关键短语可以提高 ROUGE F1 和召回率，使生成的摘要与参考文献更相似且更完整。关键短语的数量可以控制精确度-召回率的权衡。此外，我们的分析表明，结合短语级显着信息优于单词或句子级。然而，对幻觉的影响在 LLM 中并非普遍积极。为了进行这种分析，我们引入了关键短语信号提取器 (SigExt)，这是一个轻量级模型，可以进行微调以提取显着的关键短语。通过使用 SigExt，我们在数据集和开放权重和专有 LLM 中实现了一致的 ROUGE 改进，而无需任何 LLM 定制。我们的研究结果为利用突出信息构建基于提示的摘要系统提供了见解。

Title: Grounding Large Language Models In Embodied Environment With Imperfect World Models

Authors: Haolan Liu, Jishen Zhao
Subjects: cs.CL, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2410.02742
Pdf URL: https://arxiv.org/pdf/2410.02742
Copy Paste: [[2410.02742]] Grounding Large Language Models In Embodied Environment With Imperfect World Models(https://arxiv.org/abs/2410.02742)
Keywords: language model, gpt, llm, retrieval-augmented generation, agent
Abstract: Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 $\times$, 1.54 $\times$, and 1.82 $\times$ across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.
摘要：尽管大型语言模型 (LLM) 在各种应用中取得了广泛的成功，但由于缺乏对现实世界物理细微差别的直接经验，大型语言模型在处理基本物理推理或执行机器人任务时往往会失败。为了解决这些问题，我们提出了一种具有不完美世界模型的基础大型语言模型 (GLIMO)，它利用代理世界模型（例如模拟器）来收集和合成三元数据。GLIMO 结合了基于 LLM 代理的数据生成器，可自动创建高质量和多样化的指令数据集。该生成器包括一个用于时间一致经验采样的迭代自精炼模块、一组多样化的问答指令种子和一个用于反思先前经验的检索增强生成模块。综合实验表明，我们的方法提高了强大的开源 LLM（如 LLaMA-3）的性能，在三个不同的基准测试中分别将性能提高了 2.04 $\times$、1.54 $\times$ 和 1.82 $\times$。其性能能够与 GPT-4 等更强大的对手相媲美，甚至超越它们。

Title: MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Authors: Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, Hua Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02743
Pdf URL: https://arxiv.org/pdf/2410.02743
Copy Paste: [[2410.02743]] MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions(https://arxiv.org/abs/2410.02743)
Keywords: language model, llm
Abstract: Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at this https URL .
摘要：人类反馈强化学习 (RLHF) 已证明能够有效将大型语言模型 (LLM) 与人类偏好相结合。然而，标记级 RLHF 存在长序列信用分配问题，延迟奖励使得模型难以辨别哪些动作导致了成功的结果。这会阻碍学习效率并减慢收敛速度。在本文中，我们提出了 MA-RLHF，这是一个简单而有效的 RLHF 框架，它将宏观动作（标记序列或更高级的语言结构）纳入学习过程。通过在这个更高的抽象层次上操作，我们的方法缩短了动作和奖励之间的时间距离，从而促进了更快、更准确的信用分配。这可以产生更稳定的策略梯度估计并提高每个情节中的学习效率，所有这些都不会增加训练或推理期间的计算复杂性。我们通过对各种模型大小和任务（包括文本摘要、对话生成、问答和程序合成）进行大量实验来验证我们的方法。我们的方法比标准 RLHF 实现了显著的性能提升，在文本摘要和代码生成方面性能提升高达 30%，在对话方面性能提升高达 18%，在问答任务方面性能提升高达 8%。值得注意的是，我们的方法在训练时间方面与 vanilla RLHF 相当，速度快 1.7 到 2 倍，并且在进一步训练后继续超越它。我们将在此 https URL 上公开提供我们的代码和数据。

Title: Neutral residues: revisiting adapters for model extension

Authors: Franck Signe Talla, Herve Jegou, Edouard Grave
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02744
Pdf URL: https://arxiv.org/pdf/2410.02744
Copy Paste: [[2410.02744]] Neutral residues: revisiting adapters for model extension(https://arxiv.org/abs/2410.02744)
Keywords: language model
Abstract: We address the problem of extending a pretrained large language model to a new domain that was not seen at training time, like adding a language for which the original model has seen no or little training data. Popular solutions like fine-tuning or low-rank adaptation are successful at domain adaptation, but formally they do not add any extra capacity and degrade the performance in the original domain. Our paper analyzes this extension problem under three angles: data, architecture and training procedure, which are advantageously considered jointly. In particular, we improve adapters and make it possible to learn an entire new language while ensuring that the output of the neural network is almost unchanged in the original domain. For this purpose, we modify the new residual blocks in a way that leads each new residual block to output near-zeros in the original domain. This solution of neutral residues, which borrows architectural components from mixture of experts, is effective: with only 20% extra learnable weights compared to an original model trained on English, we get results that are significantly better than concurrent approaches (fine-tuning, low-rank or vanilla adapters) in terms of the trade-off between learning a new language and not forgetting English.
摘要：我们解决了将预训练的大型语言模型扩展到训练时未见过的新领域的问题，例如添加原始模型没有或很少见过的训练数据的语言。微调或低秩自适应等流行解决方案在领域自适应方面取得了成功，但形式上它们不会增加任何额外容量并降低原始领域的性能。我们的论文从三个角度分析了这个扩展问题：数据、架构和训练过程，这三个角度最好一起考虑。特别是，我们改进了适配器，使得学习一门全新的语言成为可能，同时确保神经网络的输出在原始领域几乎不变。为此，我们修改了新的残差块，使每个新的残差块在原始域中的输出接近零。这种中性残基的解决方案借用了专家混合架构的组件，是有效的：与用英语训练的原始模型相比，仅增加了 20% 的可学习权重，但在学习新语言和不忘记英语之间的权衡方面，我们得到的结果明显优于并发方法（微调、低秩或原始适配器）。

Title: CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation

Authors: Han He, Qianchu Liu, Lei Xu, Chaitanya Shivade, Yi Zhang, Sundararajan Srinivasan, Katrin Kirchhoff
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02748
Pdf URL: https://arxiv.org/pdf/2410.02748
Copy Paste: [[2410.02748]] CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation(https://arxiv.org/abs/2410.02748)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (CriSPO), a lightweight model that can be finetuned to extract salient keyphrases. By using CriSPO, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
摘要：大型语言模型 (LLM) 可以使用提示技术跨领域生成流畅的摘要，从而减少了训练摘要应用模型的需要。但是，设计有效的提示来指导 LLM 生成具有适当细节级别和写作风格的摘要仍然是一个挑战。在本文中，我们探讨了如何使用从源文档中提取的显着信息来增强摘要提示。我们表明，在提示中添加关键短语可以提高 ROUGE F1 和召回率，使生成的摘要与参考文献更相似且更完整。关键短语的数量可以控制精确度-召回率的权衡。此外，我们的分析表明，结合短语级显着信息优于单词或句子级。然而，对幻觉的影响在 LLM 中并非普遍积极。为了进行这种分析，我们引入了关键短语信号提取器 (CriSPO)，这是一个轻量级模型，可以进行微调以提取显着的关键短语。通过使用 CriSPO，我们在数据集和开放权重和专有 LLM 中实现了一致的 ROUGE 改进，而无需任何 LLM 定制。我们的研究结果为利用突出信息构建基于提示的摘要系统提供了见解。

Title: SIEVE: General Purpose Data Filtering System Matching GPT-4o Accuracy at 1% the Cost

Authors: Jifan Zhang, Robert Nowak
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02755
Pdf URL: https://arxiv.org/pdf/2410.02755
Copy Paste: [[2410.02755]] SIEVE: General Purpose Data Filtering System Matching GPT-4o Accuracy at 1% the Cost(https://arxiv.org/abs/2410.02755)
Keywords: language model, gpt, llm
Abstract: Creating specialized large language models requires vast amounts of clean, special purpose data for training and fine-tuning. With only a handful of existing large-scale, domain-specific datasets, creation of new datasets is required in most applications. This requires the development of new application-specific filtering of web-scale data. Filtering with a high-performance, general-purpose LLM such as GPT-4o can be highly effective, but this is extremely expensive at web-scale. This paper proposes SIEVE, a lightweight alternative that matches GPT-4o accuracy at a fraction of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight T5 models, using active learning to fine-tune T5 in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. We experimentally validate SIEVE on the OpenWebText dataset, using five highly customized filter tasks targeting high quality and domain-specific content. Our results demonstrate the effectiveness and efficiency of our method in curating large, high-quality datasets for language model training at a substantially lower cost (1%) than existing techniques. To further validate SIEVE, experiments show that SIEVE and GPT-4o achieve similar accuracy, with human evaluators preferring SIEVE's filtering results to those of GPT-4o.
摘要：创建专门的大型语言模型需要大量干净的专用数据进行训练和微调。由于现有的大型特定领域数据集很少，大多数应用程序都需要创建新的数据集。这需要开发新的特定于应用程序的 Web 级数据过滤方法。使用高性能通用 LLM（如 GPT-4o）进行过滤可能非常有效，但在 Web 级上成本极高。本文提出了 SIEVE，这是一种轻量级替代方案，其精度可与 GPT-4o 相媲美，但成本仅为 GPT-4o 的一小部分。SIEVE 最多可以执行 500 次过滤操作，而成本仅为一次 GPT-4o 过滤调用的成本。SIEVE 的关键是无缝集成 GPT-4o 和轻量级 T5 模型，使用主动学习在后台微调 T5，只需调用少量 GPT-4o。经过训练后，它的性能与 GPT-4o 一样好，但成本仅为 GPT-4o 的一小部分。我们在 OpenWebText 数据集上通过实验验证了 SIEVE，使用了五个高度定制的过滤任务，这些任务针对高质量和特定领域的内容。我们的结果表明，我们的方法在整理大型高质量数据集以进行语言模型训练方面非常有效且高效，而且成本 (1%) 远低于现有技术。为了进一步验证 SIEVE，实验表明 SIEVE 和 GPT-4o 实现了相似的准确率，人类评估者更喜欢 SIEVE 的过滤结果而不是 GPT-4o 的结果。

Title: Erasing Conceptual Knowledge from Language Models

Authors: Rohit Gandikota, Sheridan Feucht, Samuel Marks, David Bau
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.02760
Pdf URL: https://arxiv.org/pdf/2410.02760
Copy Paste: [[2410.02760]] Erasing Conceptual Knowledge from Language Models(https://arxiv.org/abs/2410.02760)
Keywords: language model, prompt
Abstract: Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at this https URL
摘要：语言模型中的概念擦除传统上缺乏全面的评估框架，导致对擦除方法有效性的评估不完整。我们提出了一个以三个关键标准为中心的评估范式：纯真性（完全删除知识）、无缝性（保持条件流畅生成）和特异性（保留无关任务性能）。我们的评估指标自然地激发了语言记忆擦除 (ELM) 的发展，这是一种旨在解决所有三个维度的新方法。ELM 采用有针对性的低秩更新来改变擦除概念的输出分布，同时保留整体模型功能，包括提示擦除概念时的流畅性。我们展示了 ELM 在生物安全、网络安全和文学领域擦除任务中的有效性。比较分析表明，ELM 在我们提出的指标中实现了卓越的性能，包括擦除主题评估的近乎随机的分数、生成流畅性、在无关基准上保持的准确性以及对抗性攻击下的稳健性。我们的代码、数据和训练模型可在此 https URL 上获得