2024-10-18

Title: A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions

Authors: Shailja Gupta, Rajesh Ranjan, Surya Narayan Singh
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2410.12837
Pdf URL: https://arxiv.org/pdf/2410.12837
Copy Paste: [[2410.12837]] A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions(https://arxiv.org/abs/2410.12837)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This paper presents a comprehensive study of Retrieval-Augmented Generation (RAG), tracing its evolution from foundational concepts to the current state of the art. RAG combines retrieval mechanisms with generative language models to enhance the accuracy of outputs, addressing key limitations of LLMs. The study explores the basic architecture of RAG, focusing on how retrieval and generation are integrated to handle knowledge-intensive tasks. A detailed review of the significant technological advancements in RAG is provided, including key innovations in retrieval-augmented language models and applications across various domains such as question-answering, summarization, and knowledge-based tasks. Recent research breakthroughs are discussed, highlighting novel methods for improving retrieval efficiency. Furthermore, the paper examines ongoing challenges such as scalability, bias, and ethical concerns in deployment. Future research directions are proposed, focusing on improving the robustness of RAG models, expanding the scope of application of RAG models, and addressing societal implications. This survey aims to serve as a foundational resource for researchers and practitioners in understanding the potential of RAG and its trajectory in natural language processing.
摘要：本文对检索增强生成 (RAG) 进行了全面研究，追溯了其从基础概念到当前技术水平的演变。RAG 将检索机制与生成语言模型相结合，以提高输出的准确性，解决了 LLM 的主要局限性。该研究探索了 RAG 的基本架构，重点研究了如何整合检索和生成来处理知识密集型任务。本文详细回顾了 RAG 的重大技术进步，包括检索增强语言模型的关键创新以及在问答、总结和基于知识的任务等各个领域的应用。本文讨论了最近的研究突破，重点介绍了提高检索效率的新方法。此外，本文还研究了部署中的可扩展性、偏见和道德问题等持续存在的挑战。本文提出了未来的研究方向，重点是提高 RAG 模型的稳健性、扩大 RAG 模型的应用范围以及解决社会影响。这项调查旨在为研究人员和从业人员提供基础资源，以了解 RAG 的潜力及其在自然语言处理中的发展轨迹。

Title: Capturing Bias Diversity in LLMs

Authors: Purva Prasad Gosavi, Vaishnavi Murlidhar Kulkarni, Alan F. Smeaton
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12839
Pdf URL: https://arxiv.org/pdf/2410.12839
Copy Paste: [[2410.12839]] Capturing Bias Diversity in LLMs(https://arxiv.org/abs/2410.12839)
Keywords: language model, gpt, llm
Abstract: This paper presents research on enhancements to Large Language Models (LLMs) through the addition of diversity in its generated outputs. Our study introduces a configuration of multiple LLMs which demonstrates the diversities capable with a single LLM. By developing multiple customised instances of a GPT model, each reflecting biases in specific demographic characteristics including gender, age, and race, we propose, develop and evaluate a framework for a more nuanced and representative AI dialogue which we call BiasGPT. The customised GPT models will ultimately collaborate, merging their diverse perspectives on a topic into an integrated response that captures a broad spectrum of human experiences and viewpoints. In this paper, through experiments, we demonstrate the capabilities of a GPT model to embed different biases which, when combined, can open the possibilities of more inclusive AI technologies.
摘要：本文介绍了通过增加其生成输出的多样性来增强大型语言模型 (LLM) 的研究。我们的研究引入了多个 LLM 的配置，展示了单个 LLM 能够实现的多样性。通过开发 GPT 模型的多个定制实例，每个实例都反映了特定人口统计特征（包括性别、年龄和种族）的偏见，我们提出、开发和评估了一个更细致入微、更具代表性的 AI 对话框架，我们称之为 BiasGPT。定制的 GPT 模型最终将进行协作，将它们对某个主题的不同观点融合成一个综合的响应，以捕捉广泛的人类经验和观点。在本文中，通过实验，我们展示了 GPT 模型嵌入不同偏见的能力，这些偏见结合在一起，可以开启更具包容性的 AI 技术的可能性。

Title: Answering Questions in Stages: Prompt Chaining for Contract QA

Authors: Adam Roegiest, Radha Chitta
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12840
Pdf URL: https://arxiv.org/pdf/2410.12840
Copy Paste: [[2410.12840]] Answering Questions in Stages: Prompt Chaining for Contract QA(https://arxiv.org/abs/2410.12840)
Keywords: language model, prompt
Abstract: Finding answers to legal questions about clauses in contracts is an important form of analysis in many legal workflows (e.g., understanding market trends, due diligence, risk mitigation) but more important is being able to do this at scale. Prior work showed that it is possible to use large language models with simple zero-shot prompts to generate structured answers to questions, which can later be incorporated into legal workflows. Such prompts, while effective on simple and straightforward clauses, fail to perform when the clauses are long and contain information not relevant to the question. In this paper, we propose two-stage prompt chaining to produce structured answers to multiple-choice and multiple-select questions and show that they are more effective than simple prompts on more nuanced legal text. We analyze situations where this technique works well and areas where further refinement is needed, especially when the underlying linguistic variations are more than can be captured by simply specifying possible answers. Finally, we discuss future research that seeks to refine this work by improving stage one results by making them more question-specific.
摘要：寻找有关合同条款的法律问题的答案是许多法律工作流程（例如，了解市场趋势、尽职调查、风险缓解）中的重要分析形式，但更重要的是能够大规模地做到这一点。先前的研究表明，可以使用大型语言模型和简单的零样本提示来生成问题的结构化答案，这些答案稍后可以纳入法律工作流程。虽然这种提示对简单明了的条款很有效，但当条款很长且包含与问题无关的信息时，它们就会失效。在本文中，我们提出了两阶段提示链，以产生多项选择题和多项选择题的结构化答案，并表明它们比简单提示对更细致入微的法律文本更有效。我们分析了这种技术效果良好的情况以及需要进一步改进的领域，尤其是当底层语言变化超出了仅通过指定可能的答案所能捕捉到的范围时。最后，我们讨论了未来的研究，旨在通过改进第一阶段的结果，使其更针对问题，从而改进这项工作。

Title: UniAutoML: A Human-Centered Framework for Unified Discriminative and Generative AutoML with Large Language Models

Authors: Jiayi Guo, Liyun Zhang, Yiqin Shen
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12841
Pdf URL: https://arxiv.org/pdf/2410.12841
Copy Paste: [[2410.12841]] UniAutoML: A Human-Centered Framework for Unified Discriminative and Generative AutoML with Large Language Models(https://arxiv.org/abs/2410.12841)
Keywords: language model, llm
Abstract: Automated Machine Learning (AutoML) has simplified complex ML processes such as data pre-processing, model selection, and hyper-parameter searching. However, traditional AutoML frameworks focus solely on discriminative tasks, often falling short in tackling AutoML for generative models. Additionally, these frameworks lack interpretability and user engagement during the training process, primarily due to the absence of human-centered design. It leads to a lack of transparency in final decision-making and limited user control, potentially reducing trust and adoption of AutoML methods. To address these limitations, we introduce UniAutoML, a human-centered AutoML framework that leverages Large Language Models (LLMs) to unify AutoML for both discriminative (e.g., Transformers and CNNs for classification or regression tasks) and generative tasks (e.g., fine-tuning diffusion models or LLMs). The human-centered design of UniAutoML innovatively features a conversational user interface (CUI) that facilitates natural language interactions, providing users with real-time guidance, feedback, and progress updates for better interpretability. This design enhances transparency and user control throughout the AutoML training process, allowing users to seamlessly break down or modify the model being trained. To mitigate potential risks associated with LLM generated content, UniAutoML incorporates a safety guardline that filters inputs and censors outputs. We evaluated UniAutoML's performance and usability through experiments on eight diverse datasets and user studies involving 25 participants, demonstrating that UniAutoML not only enhances performance but also improves user control and trust. Our human-centered design bridges the gap between AutoML capabilities and user understanding, making ML more accessible to a broader audience.
摘要：自动机器学习 (AutoML) 简化了复杂的 ML 流程，例如数据预处理、模型选择和超参数搜索。然而，传统的 AutoML 框架仅专注于判别性任务，在处理生成模型的 AutoML 时往往力不从心。此外，这些框架在训练过程中缺乏可解释性和用户参与度，这主要是由于缺乏以人为本的设计。这导致最终决策缺乏透明度，用户控制有限，从而可能降低对 AutoML 方法的信任和采用。为了解决这些限制，我们推出了 UniAutoML，这是一个以人为本的 AutoML 框架，它利用大型语言模型 (LLM) 将 AutoML 统一用于判别性任务（例如，用于分类或回归任务的 Transformers 和 CNN）和生成性任务（例如，微调扩散模型或 LLM）。 UniAutoML 以人为本的设计创新地采用了对话式用户界面 (CUI)，促进了自然语言交互，为用户提供实时指导、反馈和进度更新，以提高可解释性。这种设计增强了整个 AutoML 训练过程的透明度和用户控制，使用户可以无缝分解或修改正在训练的模型。为了减轻与 LLM 生成内容相关的潜在风险，UniAutoML 采用了一条安全防护线，可以过滤输入并审查输出。我们通过对八个不同数据集的实验和涉及 25 名参与者的用户研究评估了 UniAutoML 的性能和可用性，结果表明 UniAutoML 不仅提高了性能，还提高了用户控制和信任度。我们以人为本的设计弥合了 AutoML 功能与用户理解之间的差距，使 ML 更容易被更广泛的受众所接受。

Title: Exploring Prompt Engineering: A Systematic Review with SWOT Analysis

Authors: Aditi Singh, Abul Ehtesham, Gaurav Kumar Gupta, Nikhil Kumar Chatta, Saket Kumar, Tala Talaei Khoei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12843
Pdf URL: https://arxiv.org/pdf/2410.12843
Copy Paste: [[2410.12843]] Exploring Prompt Engineering: A Systematic Review with SWOT Analysis(https://arxiv.org/abs/2410.12843)
Keywords: language model, llm, prompt
Abstract: In this paper, we conduct a comprehensive SWOT analysis of prompt engineering techniques within the realm of Large Language Models (LLMs). Emphasizing linguistic principles, we examine various techniques to identify their strengths, weaknesses, opportunities, and threats. Our findings provide insights into enhancing AI interactions and improving language model comprehension of human prompts. The analysis covers techniques including template-based approaches and fine-tuning, addressing the problems and challenges associated with each. The conclusion offers future research directions aimed at advancing the effectiveness of prompt engineering in optimizing human-machine communication.
摘要：在本文中，我们对大型语言模型 (LLM) 领域的提示工程技术进行了全面的 SWOT 分析。我们强调语言学原理，研究各种技术以确定它们的优势、劣势、机会和威胁。我们的研究结果为增强人工智能交互和提高语言模型对人类提示的理解提供了见解。分析涵盖了包括基于模板的方法和微调在内的技术，解决了与每种方法相关的问题和挑战。结论提供了未来的研究方向，旨在提高提示工程在优化人机通信方面的有效性。

Title: TextLap: Customizing Language Models for Text-to-Layout Planning

Authors: Jian Chen, Ruiyi Zhang, Yufan Zhou, Jennifer Healey, Jiuxiang Gu, Zhiqiang Xu, Changyou Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12844
Pdf URL: https://arxiv.org/pdf/2410.12844
Copy Paste: [[2410.12844]] TextLap: Customizing Language Models for Text-to-Layout Planning(https://arxiv.org/abs/2410.12844)
Keywords: language model, gpt, llm
Abstract: Automatic generation of graphical layouts is crucial for many real-world applications, including designing posters, flyers, advertisements, and graphical user interfaces. Given the incredible ability of Large language models (LLMs) in both natural language understanding and generation, we believe that we could customize an LLM to help people create compelling graphical layouts starting with only text instructions from the user. We call our method TextLap (text-based layout planning). It uses a curated instruction-based layout planning dataset (InsLap) to customize LLMs as a graphic designer. We demonstrate the effectiveness of TextLap and show that it outperforms strong baselines, including GPT-4 based methods, for image generation and graphical design benchmarks.
摘要：自动生成图形布局对于许多实际应用至关重要，包括设计海报、传单、广告和图形用户界面。鉴于大型语言模型 (LLM) 在自然语言理解和生成方面的强大能力，我们相信我们可以定制一个 LLM，以帮助人们从用户的文本指令开始创建引人注目的图形布局。我们将我们的方法称为 TextLap（基于文本的布局规划）。它使用精选的基于指令的布局规划数据集 (InsLap) 来定制 LLM 作为图形设计师。我们展示了 TextLap 的有效性，并表明它在图像生成和图形设计基准方面优于强大的基线，包括基于 GPT-4 的方法。

Title: Toward Relieving Clinician Burden by Automatically Generating Progress Notes using Interim Hospital Data

Authors: Sarvesh Soni, Dina Demner-Fushman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12845
Pdf URL: https://arxiv.org/pdf/2410.12845
Copy Paste: [[2410.12845]] Toward Relieving Clinician Burden by Automatically Generating Progress Notes using Interim Hospital Data(https://arxiv.org/abs/2410.12845)
Keywords: language model
Abstract: Regular documentation of progress notes is one of the main contributors to clinician burden. The abundance of structured chart information in medical records further exacerbates the burden, however, it also presents an opportunity to automate the generation of progress notes. In this paper, we propose a task to automate progress note generation using structured or tabular information present in electronic health records. To this end, we present a novel framework and a large dataset, ChartPNG, for the task which contains $7089$ annotation instances (each having a pair of progress notes and interim structured chart data) across $1616$ patients. We establish baselines on the dataset using large language models from general and biomedical domains. We perform both automated (where the best performing Biomistral model achieved a BERTScore F1 of $80.53$ and MEDCON score of $19.61$) and manual (where we found that the model was able to leverage relevant structured data with $76.9\%$ accuracy) analyses to identify the challenges with the proposed task and opportunities for future research.
摘要：定期记录进度记录是造成临床医生负担的主要原因之一。医疗记录中大量的结构化图表信息进一步加剧了负担，然而，这也为自动生成进度记录提供了机会。在本文中，我们提出了一项任务，使用电子健康记录中存在的结构化或表格信息自动生成进度记录。为此，我们为该任务提供了一个新颖的框架和一个大型数据集 ChartPNG，其中包含 1616 名患者的 $7089$ 个注释实例（每个实例都有一对进度记录和临时结构化图表数据）。我们使用来自通用和生物医学领域的大型语言模型在数据集上建立基线。我们执行自动（其中表现最佳的 Biomistral 模型实现了 BERTScore F1 为 $80.53$ 和 MEDCON 得分为 $19.61$）和手动（我们发现该模型能够以 $76.9\%$ 的准确率利用相关结构化数据）分析，以确定拟议任务的挑战和未来研究的机会。

Title: Accurate and Regret-aware Numerical Problem Solver for Tabular Question Answering

Authors: Yuxiang Wang, Jianzhong Qi, Junhao Gan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12846
Pdf URL: https://arxiv.org/pdf/2410.12846
Copy Paste: [[2410.12846]] Accurate and Regret-aware Numerical Problem Solver for Tabular Question Answering(https://arxiv.org/abs/2410.12846)
Keywords: language model, llm
Abstract: Question answering on free-form tables (a.k.a. TableQA) is a challenging task because of the flexible structure and the complex schema of tables. Recent studies use Large Language Models (LLMs) for this task, exploiting their capability in understanding the questions and tabular data which are typically given in natural language and contains many textual fields, respectively. While this approach has shown promising results, it overlooks the challenges brought by numerical values which are common in tabular data, while LLMs are known to struggle with such values. We aim to address this issue and answer numerical questions. We propose a model named TabLaP that uses LLMs as a planner rather than an answer generator, exploiting LLMs capability in multi-step reasoning while leaving the actual numerical calculations to a Python interpreter for accurate calculation. Recognizing the inaccurate nature of LLMs, we further make a first attempt to quantify the trustworthiness of the answers produced by TabLaP, such that users can use TabLaP in a regret-aware manner. Experimental results on two benchmark datasets show that TabLaP is substantially more accurate than the state-of-the-art models, improving the answer accuracy by 5.7% and 5.8% on the two datasets, respectively.
摘要：由于表格结构灵活、模式复杂，自由格式表格（又称 TableQA）的问答是一项具有挑战性的任务。最近的研究使用大型语言模型 (LLM) 来完成这项任务，利用它们理解问题和表格数据的能力，这些问题和表格数据通常以自然语言给出，并分别包含许多文本字段。虽然这种方法已经显示出有希望的结果，但它忽略了表格数据中常见的数值带来的挑战，而众所周知，LLM 很难处理这些值。我们的目标是解决这个问题并回答数字问题。我们提出了一个名为 TabLaP 的模型，该模型使用 LLM 作为规划器而不是答案生成器，利用 LLM 在多步推理中的能力，同时将实际的数值计算留给 Python 解释器进行精确计算。认识到 LLM 的不准确性，我们进一步首次尝试量化 TabLaP 生成的答案的可信度，以便用户可以以后悔的方式使用 TabLaP。在两个基准数据集上的实验结果表明，TabLaP 比最先进的模型准确率高得多，在两个数据集上分别将答案准确率提高了 5.7% 和 5.8%。

Title: ACCEPT: Adaptive Codebook for Composite and Efficient Prompt Tuning

Authors: Yu-Chen Lin, Wei-Hua Li, Jun-Cheng Chen, Chu-Song Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12847
Pdf URL: https://arxiv.org/pdf/2410.12847
Copy Paste: [[2410.12847]] ACCEPT: Adaptive Codebook for Composite and Efficient Prompt Tuning(https://arxiv.org/abs/2410.12847)
Keywords: language model, prompt
Abstract: Prompt Tuning has been a popular Parameter-Efficient Fine-Tuning method attributed to its remarkable performance with few updated parameters on various large-scale pretrained Language Models (PLMs). Traditionally, each prompt has been considered indivisible and updated independently, leading the parameters increase proportionally as prompt length grows. To address this issue, we propose Adaptive Codebook for Composite and Efficient Prompt Tuning (ACCEPT). In our method, we refer to the concept of product quantization (PQ), allowing all soft prompts to share a set of learnable codebook vectors in each subspace, with each prompt differentiated by a set of adaptive weights. We achieve the superior performance on 17 diverse natural language tasks including natural language understanding (NLU) and question answering (QA) tasks by tuning only 0.3% of parameters of the PLMs. Our approach also excels in few-shot and large model settings, highlighting its significant potential.
摘要：提示调优是一种流行的参数高效微调方法，这归功于它在各种大规模预训练语言模型 (PLM) 上只需少量更新参数即可实现卓越性能。传统上，每个提示都被视为不可分割并独立更新，因此参数会随着提示长度的增加而按比例增加。为了解决这个问题，我们提出了用于复合高效提示调优的自适应码本 (ACCEPT)。在我们的方法中，我们参考了乘积量化 (PQ) 的概念，允许所有软提示在每个子空间中共享一组可学习的码本向量，每个提示由一组自适应权重进行区分。我们仅通过调整 PLM 的 0.3% 参数，就在 17 种不同的自然语言任务（包括自然语言理解 (NLU) 和问答 (QA) 任务）上实现了卓越性能。我们的方法在小样本和大型模型设置中也表现出色，凸显了其巨大的潜力。

Title: Prompt Engineering a Schizophrenia Chatbot: Utilizing a Multi-Agent Approach for Enhanced Compliance with Prompt Instructions

Authors: Per Niklas Waaler, Musarrat Hussain, Igor Molchanov, Lars Ailo Bongo, Brita Elvevåg
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2410.12848
Pdf URL: https://arxiv.org/pdf/2410.12848
Copy Paste: [[2410.12848]] Prompt Engineering a Schizophrenia Chatbot: Utilizing a Multi-Agent Approach for Enhanced Compliance with Prompt Instructions(https://arxiv.org/abs/2410.12848)
Keywords: language model, gpt, llm, prompt, chat, agent
Abstract: Patients with schizophrenia often present with cognitive impairments that may hinder their ability to learn about their condition. These individuals could benefit greatly from education platforms that leverage the adaptability of Large Language Models (LLMs) such as GPT-4. While LLMs have the potential to make topical mental health information more accessible and engaging, their black-box nature raises concerns about ethics and safety. Prompting offers a way to produce semi-scripted chatbots with responses anchored in instructions and validated information, but prompt-engineered chatbots may drift from their intended identity as the conversation progresses. We propose a Critical Analysis Filter for achieving better control over chatbot behavior. In this system, a team of prompted LLM agents are prompt-engineered to critically analyze and refine the chatbot's response and deliver real-time feedback to the chatbot. To test this approach, we develop an informational schizophrenia chatbot and converse with it (with the filter deactivated) until it oversteps its scope. Once drift has been observed, AI-agents are used to automatically generate sample conversations in which the chatbot is being enticed to talk about out-of-bounds topics. We manually assign to each response a compliance score that quantifies the chatbot's compliance to its instructions; specifically the rules about accurately conveying sources and being transparent about limitations. Activating the Critical Analysis Filter resulted in an acceptable compliance score (>=2) in 67.0% of responses, compared to only 8.7% when the filter was deactivated. These results suggest that a self-reflection layer could enable LLMs to be used effectively and safely in mental health platforms, maintaining adaptability while reliably limiting their scope to appropriate use cases.
摘要：精神分裂症患者通常存在认知障碍，这可能会阻碍他们了解自身病情的能力。这些人可以从利用大型语言模型 (LLM)（如 GPT-4）的适应性的教育平台中受益匪浅。虽然 LLM 有可能使主题心理健康信息更易于获取和吸引人，但其黑箱性质引发了道德和安全方面的担忧。提示提供了一种生成半脚本聊天机器人的方法，其响应以指令和经过验证的信息为基础，但提示设计的聊天机器人可能会随着对话的进行而偏离其预期的身份。我们提出了一个批判性分析过滤器，以更好地控制聊天机器人的行为。在这个系统中，一组提示的 LLM 代理经过提示设计，可以批判性地分析和改进聊天机器人的响应并向聊天机器人提供实时反馈。为了测试这种方法，我们开发了一个信息性精神分裂症聊天机器人并与其交谈（停用过滤器），直到它超出其范围。一旦发现偏差，就会使用 AI 代理自动生成示例对话，其中聊天机器人被诱导谈论超出范围的话题。我们手动为每个响应分配一个合规分数，以量化聊天机器人对其指令的遵守情况；特别是关于准确传达来源和透明限制的规则。激活批判性分析过滤器后，67.0% 的响应的合规分数为可接受的（>=2），而停用过滤器后，只有 8.7% 的响应为可接受的合规分数。这些结果表明，自我反思层可以使 LLM 有效安全地用于心理健康平台，保持适应性，同时可靠地将其范围限制在适当的用例中。

Title: RecurFormer: Not All Transformer Heads Need Self-Attention

Authors: Ruiqing Yan, Linghan Zheng, Xingbo Du, Han Zou, Yufeng Guo, Jianfei Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12850
Pdf URL: https://arxiv.org/pdf/2410.12850
Copy Paste: [[2410.12850]] RecurFormer: Not All Transformer Heads Need Self-Attention(https://arxiv.org/abs/2410.12850)
Keywords: language model, llm
Abstract: Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference, especially with long inputs due to the attention mechanism's memory overhead. We observe that certain attention heads exhibit a distribution where the attention weights concentrate on tokens near the query token, termed as recency aware, which focuses on local and short-range dependencies. Leveraging this insight, we propose RecurFormer, a novel architecture that replaces these attention heads with linear recurrent neural networks (RNNs), specifically the Mamba architecture. This replacement reduces the cache size without evicting tokens, thus maintaining generation quality. RecurFormer retains the ability to model long-range dependencies through the remaining attention heads and allows for reusing pre-trained Transformer-based LLMs weights with continual training. Experiments demonstrate that RecurFormer matches the original model's performance while significantly enhancing inference efficiency. Our approach provides a practical solution to the computational challenges of Transformer-based LLMs inference, making it highly attractive for tasks involving long inputs.
摘要：基于 Transformer 的大型语言模型 (LLM) 在建模复杂语言模式方面表现出色，但在推理过程中面临着巨大的计算成本，尤其是在长输入的情况下，这是由于注意力机制的内存开销造成的。我们观察到某些注意力头表现出一种分布，其中注意力权重集中在查询标记附近的标记上，这被称为新近度感知，它专注于局部和短距离依赖关系。利用这一见解，我们提出了 RecurFormer，这是一种新颖的架构，它用线性循环神经网络 (RNN) 替换这些注意力头，特别是 Mamba 架构。这种替换减少了缓存大小而不驱逐标记，从而保持了生成质量。RecurFormer 保留了通过剩余的注意力头建模长距离依赖关系的能力，并允许通过持续训练重复使用预先训练的基于 Transformer 的 LLM 权重。实验表明，RecurFormer 与原始模型的性能相匹配，同时显著提高了推理效率。我们的方法为基于 Transformer 的 LLM 推理的计算挑战提供了一个实用的解决方案，使其对于涉及长输入的任务非常有吸引力。

Title: VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

Authors: Lisa Dunlap, Krishna Mandal, Trevor Darrell, Jacob Steinhardt, Joseph E Gonzalez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12851
Pdf URL: https://arxiv.org/pdf/2410.12851
Copy Paste: [[2410.12851]] VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models(https://arxiv.org/abs/2410.12851)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These "vibes" - such as tone, formatting, or writing style - influence user preferences, yet traditional evaluations focus primarily on the single axis of correctness. We introduce VibeCheck, a system for automatically comparing a pair of LLMs by discovering identifying traits of a model ("vibes") that are well-defined, differentiating, and user-aligned. VibeCheck iteratively discover vibes from model outputs, then utilizes a panel of LLM judges to quantitatively measure the utility of each vibe. We validate that the vibes generated by VibeCheck align with those found in human discovery and run VibeCheck on pairwise preference data from real-world user conversations with llama-3-70b VS GPT-4. VibeCheck reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run VibeCheck on a variety of models and tasks including summarization, math, and captioning to provide insight into differences in model behavior. Some of the vibes we find are that Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often over-explains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash.
摘要：大型语言模型 (LLM) 的输出通常会表现出微妙而独特的特征，用户可以直观地识别这些特征，但很难量化。这些“氛围”——例如语气、格式或写作风格——会影响用户偏好，但传统的评估主要关注正确性的单一轴。我们引入了 VibeCheck，这是一个通过发现模型的识别特征（“氛围”）来自动比较一对 LLM 的系统，这些特征定义明确、具有差异性且与用户一致。VibeCheck 会从模型输出中迭代地发现氛围，然后利用一组 LLM 评委定量测量每种氛围的效用。我们验证了 VibeCheck 生成的氛围与人类发现的氛围一致，并对来自现实世界用户对话的成对偏好数据运行 VibeCheck，其中 llama-3-70b VS GPT-4。VibeCheck 表明 Llama 具有友好、有趣且有些争议的氛围。这些氛围以 80% 的准确率预测模型身份，以 61% 的准确率预测人类偏好。最后，我们对各种模型和任务（包括总结、数学和字幕）运行 VibeCheck，以深入了解模型行为的差异。我们发现的一些差异是，与 TNGL 相比，Command X 在总结时更喜欢添加具体的介绍和结论；与 GPT-4o 相比，Llama-405b 经常过度解释其在数学问题上的思维过程；与 Gemini-1.5-Flash 相比，GPT-4 在字幕时更喜欢关注场景的情绪和情感。

Title: The Large Language Model GreekLegalRoBERTa

Authors: Vasileios Saketos, Despina-Athanasia Pantazi, Manolis Koubarakis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12852
Pdf URL: https://arxiv.org/pdf/2410.12852
Copy Paste: [[2410.12852]] The Large Language Model GreekLegalRoBERTa(https://arxiv.org/abs/2410.12852)
Keywords: language model
Abstract: We develop four versions of GreekLegalRoBERTa, which are four large language models trained on Greek legal and nonlegal text. We show that our models surpass the performance of GreekLegalBERT, Greek- LegalBERT-v2, and GreekBERT in two tasks involving Greek legal documents: named entity recognition and multi-class legal topic classification. We view our work as a contribution to the study of domain-specific NLP tasks in low-resource languages, like Greek, using modern NLP techniques and methodologies.
摘要：我们开发了四个版本的 GreekLegalRoBERTa，它们是四个大型语言模型，经过希腊法律和非法律文本的训练。我们表明，我们的模型在涉及希腊法律文件的两项任务中的表现超越了 GreekLegalBERT、Greek-LegalBERT-v2 和 GreekBERT：命名实体识别和多类法律主题分类。我们认为我们的工作是对使用现代 NLP 技术和方法研究希腊语等资源匮乏的语言领域特定 NLP 任务的贡献。

Title: Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks

Authors: Mahmood Hegazy
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12853
Pdf URL: https://arxiv.org/pdf/2410.12853
Copy Paste: [[2410.12853]] Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks(https://arxiv.org/abs/2410.12853)
Keywords: language model, gpt, llm, prompt, chain-of-thought, agent
Abstract: Large language models (LLMs) excel in natural language generation but often confidently produce incorrect responses, especially in tasks like mathematical reasoning. Chain-of-thought prompting, self-verification, and multi-agent debate are among the strategies proposed to improve the reasoning and factual accuracy of LLMs. Building on Du et al.'s multi-agent debate framework, we find that multi-agent debate helps at any model scale, and that diversity of thought elicits stronger reasoning in debating LLMs. Across various model sizes, performance on mathematical reasoning tasks benefits most when diverse trained models are used. Remarkably, after 4 rounds of debate, a diverse set of medium-capacity models (Gemini-Pro, Mixtral 7BX8, and PaLM 2-M) outperforms GPT-4 on the GSM-8K benchmark, scoring 91% accuracy. By comparison, when 3 instances of Gemini-Pro are used, performance only reaches 82%. Finally, this diverse set of medium-capacity models sets a new state-of-the-art performance on the ASDiv benchmark (94%). These results underscore the idea that the future of AI is agentic, with diverse cooperating agents yielding emergent capabilities beyond even the most powerful individual models.
摘要：大型语言模型 (LLM) 在自然语言生成方面表现出色，但经常会自信地产生错误的答案，尤其是在数学推理等任务中。思路提示、自我验证和多智能体辩论是提高 LLM 推理和事实准确性的策略之一。基于 Du 等人的多智能体辩论框架，我们发现多智能体辩论在任何模型规模上都有帮助，并且思维的多样性会在辩论 LLM 中引发更强的推理。在各种模型大小中，当使用多样化的训练模型时，数学推理任务的性能受益最大。值得注意的是，经过 4 轮辩论后，一组多样化的中等容量模型（Gemini-Pro、Mixtral 7BX8 和 PaLM 2-M）在 GSM-8K 基准上的表现优于 GPT-4，准确率为 91%。相比之下，当使用 3 个 Gemini-Pro 实例时，性能仅达到 82%。最后，这组多样化的中等容量模型在 ASDiv 基准测试中创下了新的最高性能（94%）。这些结果强调了这样一种观点，即人工智能的未来是代理性的，各种合作的代理会产生超越最强大的单个模型的新兴能力。

Title: TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

Authors: Weibin Liao, Xu Chu, Yasha Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12854
Pdf URL: https://arxiv.org/pdf/2410.12854
Copy Paste: [[2410.12854]] TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees(https://arxiv.org/abs/2410.12854)
Keywords: language model, llm, prompt, tree-of-thought
Abstract: In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain reasoning capabilities of large language models (LLMs). To this end, these studies employed LLMs to generate preference trees via Tree-of-thoughts (ToT) and sample the paired preference responses required by the DPO algorithm. However, the DPO algorithm based on binary preference optimization is unable to learn multiple responses with varying degrees of preference/dispreference that provided by the preference trees, resulting in incomplete preference learning. In this work, we introduce Tree Preference Optimization (TPO), that does not sample paired preference responses from the preference tree; instead, it directly learns from the entire preference tree during the fine-tuning. Specifically, TPO formulates the language model alignment as a Preference List Ranking problem, where the policy can potentially learn more effectively from a ranked preference list of responses given the prompt. In addition, to further assist LLMs in identifying discriminative steps within long-chain reasoning and increase the relative reward margin in the preference list, TPO utilizes Adaptive Step Reward to adjust the reward values of each step in trajectory for performing fine-grained preference optimization. We carry out extensive experiments on mathematical reasoning tasks to evaluate TPO. The experimental results indicate that TPO consistently outperforms DPO across three public large language models on four datasets.
摘要：在数学推理等复杂推理任务领域，最近的进展提出使用直接偏好优化 (DPO) 来抑制不喜欢的响应的输出，从而增强大型语言模型 (LLM) 的长链推理能力。为此，这些研究使用 LLM 通过思维树 (ToT) 生成偏好树，并对 DPO 算法所需的成对偏好响应进行采样。然而，基于二元偏好优化的 DPO 算法无法学习偏好树提供的具有不同偏好/不喜欢程度的多个响应，导致偏好学习不完整。在这项工作中，我们引入了树偏好优化 (TPO)，它不会从偏好树中采样成对的偏好响应；相反，它在微调过程中直接从整个偏好树中学习。具体来说，TPO 将语言模型对齐表述为偏好列表排序问题，其中策略可以从给定提示的排序偏好响应列表中更有效地学习。此外，为了进一步帮助 LLM 识别长链推理中的判别步骤并增加偏好列表中的相对奖励幅度，TPO 利用自适应步骤奖励来调整轨迹中每一步的奖励值，以进行细粒度的偏好优化。我们对数学推理任务进行了广泛的实验来评估 TPO。实验结果表明，TPO 在四个数据集上的三个公共大型语言模型中始终优于 DPO。

Title: JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework

Authors: Fan Liu, Yue Feng, Zhao Xu, Lixin Su, Xinyu Ma, Dawei Yin, Hao Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12855
Pdf URL: https://arxiv.org/pdf/2410.12855
Copy Paste: [[2410.12855]] JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework(https://arxiv.org/abs/2410.12855)
Keywords: gpt, llm, prompt, agent
Abstract: Despite advancements in enhancing LLM safety against jailbreak attacks, evaluating LLM defenses remains a challenge, with current methods often lacking explainability and generalization to complex scenarios, leading to incomplete assessments (e.g., direct judgment without reasoning, low F1 score of GPT-4 in complex cases, bias in multilingual scenarios). To address this, we present JAILJUDGE, a comprehensive benchmark featuring diverse risk scenarios, including synthetic, adversarial, in-the-wild, and multilingual prompts, along with high-quality human-annotated datasets. The JAILJUDGE dataset includes over 35k+ instruction-tune data with reasoning explainability and JAILJUDGETEST, a 4.5k+ labeled set for risk scenarios, and a 6k+ multilingual set across ten languages. To enhance evaluation with explicit reasoning, we propose the JailJudge MultiAgent framework, which enables explainable, fine-grained scoring (1 to 10). This framework supports the construction of instruction-tuning ground truth and facilitates the development of JAILJUDGE Guard, an end-to-end judge model that provides reasoning and eliminates API costs. Additionally, we introduce JailBoost, an attacker-agnostic attack enhancer, and GuardShield, a moderation defense, both leveraging JAILJUDGE Guard. Our experiments demonstrate the state-of-the-art performance of JailJudge methods (JailJudge MultiAgent, JAILJUDGE Guard) across diverse models (e.g., GPT-4, Llama-Guard) and zero-shot scenarios. JailBoost and GuardShield significantly improve jailbreak attack and defense tasks under zero-shot settings, with JailBoost enhancing performance by 29.24% and GuardShield reducing defense ASR from 40.46% to 0.15%.
摘要：尽管在增强 LLM 抵御越狱攻击的安全性方面取得了进展，但评估 LLM 防御措施仍然是一个挑战，当前的方法通常缺乏可解释性和对复杂场景的泛化能力，导致评估不完整（例如，直接判断而没有推理，复杂情况下 GPT-4 的 F1 分数较低，多语言场景中的偏见）。为了解决这个问题，我们提出了 JAILJUDGE，这是一个全面的基准，具有多种风险场景，包括合成、对抗、野外和多语言提示，以及高质量的人工注释数据集。JAILJUDGE 数据集包括超过 35k+ 具有推理可解释性的指令调整数据和 JAILJUDGETEST，一个 4.5k+ 带有风险场景标签的集，以及一个涵盖十种语言的 6k+ 多语言集。为了通过显式推理增强评估，我们提出了 JailJudge MultiAgent 框架，该框架可以实现可解释的细粒度评分（1 到 10）。该框架支持构建指令调整基本事实，并促进 JAILJUDGE Guard 的开发，JAILJUDGE Guard 是一种端到端的判断模型，可提供推理并消除 API 成本。此外，我们还引入了与攻击者无关的攻击增强器 JailBoost 和适度防御 GuardShield，两者都利用了 JAILJUDGE Guard。我们的实验展示了 JailJudge 方法（JailJudge MultiAgent、JAILJUDGE Guard）在不同模型（例如 GPT-4、Llama-Guard）和零次攻击场景中的最先进性能。JailBoost 和 GuardShield 显著改善了零次攻击和防御任务，其中 JailBoost 将性能提高了 29.24%，GuardShield 将防御 ASR 从 40.46% 降低到 0.15%。

Title: Optimized Biomedical Question-Answering Services with LLM and Multi-BERT Integration

Authors: Cheng Qian, Xianglong Shi, Shanshan Yao, Yichen Liu, Fengming Zhou, Zishu Zhang, Junaid Akram, Ali Braytee, Ali Anaissi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12856
Pdf URL: https://arxiv.org/pdf/2410.12856
Copy Paste: [[2410.12856]] Optimized Biomedical Question-Answering Services with LLM and Multi-BERT Integration(https://arxiv.org/abs/2410.12856)
Keywords: language model, llm
Abstract: We present a refined approach to biomedical question-answering (QA) services by integrating large language models (LLMs) with Multi-BERT configurations. By enhancing the ability to process and prioritize vast amounts of complex biomedical data, this system aims to support healthcare professionals in delivering better patient outcomes and informed decision-making. Through innovative use of BERT and BioBERT models, combined with a multi-layer perceptron (MLP) layer, we enable more specialized and efficient responses to the growing demands of the healthcare sector. Our approach not only addresses the challenge of overfitting by freezing one BERT model while training another but also improves the overall adaptability of QA services. The use of extensive datasets, such as BioASQ and BioMRC, demonstrates the system's ability to synthesize critical information. This work highlights how advanced language models can make a tangible difference in healthcare, providing reliable and responsive tools for professionals to manage complex information, ultimately serving the broader goal of improved care and data-driven insights.
摘要：我们通过将大型语言模型 (LLM) 与 Multi-BERT 配置集成，提出了一种改进的生物医学问答 (QA) 服务方法。通过增强处理和优先处理大量复杂生物医学数据的能力，该系统旨在支持医疗保健专业人员提供更好的患者治疗结果和明智的决策。通过创新地使用 BERT 和 BioBERT 模型，结合多层感知器 (MLP) 层，我们可以更专业、更有效地响应医疗保健行业日益增长的需求。我们的方法不仅通过冻结一个 BERT 模型同时训练另一个 BERT 模型来解决过度拟合的挑战，而且还提高了 QA 服务的整体适应性。使用大量数据集（例如 BioASQ 和 BioMRC）证明了系统综合关键信息的能力。这项工作强调了高级语言模型如何在医疗保健领域发挥切实的作用，为专业人员提供可靠且响应迅速的工具来管理复杂信息，最终实现改善护理和数据驱动洞察的更广泛目标。

Title: Enterprise Benchmarks for Large Language Model Evaluation

Authors: Bing Zhang, Mikio Takeuchi, Ryo Kawahara, Shubhi Asthana, Md. Maruf Hossain, Guang-Jie Ren, Kate Soule, Yada Zhu
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2410.12857
Pdf URL: https://arxiv.org/pdf/2410.12857
Copy Paste: [[2410.12857]] Enterprise Benchmarks for Large Language Model Evaluation(https://arxiv.org/abs/2410.12857)
Keywords: language model, llm, prompt
Abstract: The advancement of large language models (LLMs) has led to a greater challenge of having a rigorous and systematic evaluation of complex tasks performed, especially in enterprise applications. Therefore, LLMs need to be able to benchmark enterprise datasets for various tasks. This work presents a systematic exploration of benchmarking strategies tailored to LLM evaluation, focusing on the utilization of domain-specific datasets and consisting of a variety of NLP tasks. The proposed evaluation framework encompasses 25 publicly available datasets from diverse enterprise domains like financial services, legal, cyber security, and climate and sustainability. The diverse performance of 13 models across different enterprise tasks highlights the importance of selecting the right model based on the specific requirements of each task. Code and prompts are available on GitHub.
摘要：大型语言模型 (LLM) 的进步带来了更大的挑战，即对执行的复杂任务进行严格而系统的评估，尤其是在企业应用中。因此，LLM 需要能够对各种任务的企业数据集进行基准测试。这项工作系统地探索了针对 LLM 评估量身定制的基准测试策略，重点关注领域特定数据集的利用，并包含各种 NLP 任务。提出的评估框架涵盖了来自金融服务、法律、网络安全以及气候和可持续性等不同企业领域的 25 个公开可用的数据集。13 个模型在不同企业任务中的不同表现凸显了根据每个任务的具体要求选择正确模型的重要性。代码和提示可在 GitHub 上找到。

Title: Large Language Models for Medical OSCE Assessment: A Novel Approach to Transcript Analysis

Authors: Ameer Hamza Shakur, Michael J. Holcomb, David Hein, Shinyoung Kang, Thomas O. Dalton, Krystle K. Campbell, Daniel J. Scott, Andrew R. Jamieson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12858
Pdf URL: https://arxiv.org/pdf/2410.12858
Copy Paste: [[2410.12858]] Large Language Models for Medical OSCE Assessment: A Novel Approach to Transcript Analysis(https://arxiv.org/abs/2410.12858)
Keywords: language model, gpt, llm, prompt, retrieval augmented generation, chain-of-thought
Abstract: Grading Objective Structured Clinical Examinations (OSCEs) is a time-consuming and expensive process, traditionally requiring extensive manual effort from human experts. In this study, we explore the potential of Large Language Models (LLMs) to assess skills related to medical student communication. We analyzed 2,027 video-recorded OSCE examinations from the University of Texas Southwestern Medical Center (UTSW), spanning four years (2019-2022), and several different medical cases or "stations." Specifically, our focus was on evaluating students' ability to summarize patients' medical history: we targeted the rubric item 'did the student summarize the patients' medical history?' from the communication skills rubric. After transcribing speech audio captured by OSCE videos using Whisper-v3, we studied the performance of various LLM-based approaches for grading students on this summarization task based on their examination transcripts. Using various frontier-level open-source and proprietary LLMs, we evaluated different techniques such as zero-shot chain-of-thought prompting, retrieval augmented generation, and multi-model ensemble methods. Our results show that frontier LLM models like GPT-4 achieved remarkable alignment with human graders, demonstrating a Cohen's kappa agreement of 0.88 and indicating strong potential for LLM-based OSCE grading to augment the current grading process. Open-source models also showed promising results, suggesting potential for widespread, cost-effective deployment. Further, we present a failure analysis identifying conditions where LLM grading may be less reliable in this context and recommend best practices for deploying LLMs in medical education settings.
摘要：客观结构化临床考试 (OSCE) 的评分是一个耗时且昂贵的过程，传统上需要人类专家的大量手动工作。在本研究中，我们探索了大型语言模型 (LLM) 评估医学生沟通相关技能的潜力。我们分析了德克萨斯大学西南医学中心 (UTSW) 的 2,027 场视频录制的 OSCE 考试，跨越四年（2019-2022 年），以及几个不同的医疗案例或“站点”。具体来说，我们的重点是评估学生总结患者病史的能力：我们针对沟通技巧评分标准中的评分标准项目“学生是否总结了患者的病史？”。在使用 Whisper-v3 转录 OSCE 视频捕获的语音音频后，我们研究了各种基于 LLM 的方法在根据学生的考试成绩单对学生进行总结任务评分方面的表现。我们使用各种前沿级开源和专有 LLM 评估了不同的技术，例如零样本思维链提示、检索增强生成和多模型集成方法。我们的结果表明，前沿 LLM 模型（如 GPT-4）与人类评分员实现了显著的一致性，Cohen 的 kappa 一致性为 0.88，表明基于 LLM 的 OSCE 评分具有增强当前评分过程的巨大潜力。开源模型也显示出有希望的结果，表明具有广泛、经济高效的部署潜力。此外，我们进行了故障分析，确定了在这种情况下 LLM 评分可能不太可靠的条件，并推荐了在医学教育环境中部署 LLM 的最佳实践。

Title: Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

Authors: Yimin Tang, Yurong Xu, Ning Yan, Masood Mortazavi
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2410.12859
Pdf URL: https://arxiv.org/pdf/2410.12859
Copy Paste: [[2410.12859]] Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism(https://arxiv.org/abs/2410.12859)
Keywords: language model, llm, long context, retrieval-augmented generation
Abstract: Transformers have a quadratic scaling of computational complexity with input size, which limits the input context window size of large language models (LLMs) in both training and inference. Meanwhile, retrieval-augmented generation (RAG) besed models can better handle longer contexts by using a retrieval system to filter out unnecessary information. However, most RAG methods only perform retrieval based on the initial query, which may not work well with complex questions that require deeper reasoning. We introduce a novel approach, Inner Loop Memory Augmented Tree Retrieval (ILM-TR), involving inner-loop queries, based not only on the query question itself but also on intermediate findings. At inference time, our model retrieves information from the RAG system, integrating data from lengthy documents at various levels of abstraction. Based on the information retrieved, the LLM generates texts stored in an area named Short-Term Memory (STM) which is then used to formulate the next query. This retrieval process is repeated until the text in STM converged. Our experiments demonstrate that retrieval with STM offers improvements over traditional retrieval-augmented LLMs, particularly in long context tests such as Multi-Needle In A Haystack (M-NIAH) and BABILong.
摘要：Transformer 的计算复杂度随输入大小呈二次方增长，这限制了大型语言模型 (LLM) 在训练和推理中的输入上下文窗口大小。同时，基于检索增强生成 (RAG) 的模型可以使用检索系统过滤掉不必要的信息，从而更好地处理较长的上下文。然而，大多数 RAG 方法仅根据初始查询执行检索，这可能不适用于需要更深层次推理的复杂问题。我们引入了一种新方法，即内循环记忆增强树检索 (ILM-TR)，该方法涉及内循环查询，不仅基于查询问题本身，还基于中间发现。在推理时，我们的模型从 RAG 系统中检索信息，整合来自不同抽象级别的长文档的数据。根据检索到的信息，LLM 生成存储在名为短期记忆 (STM) 的区域中文本，然后将其用于制定下一个查询。此检索过程重复进行，直到 STM 中的文本收敛。我们的实验表明，使用 STM 进行检索比传统的检索增强型 LLM 有了改进，特别是在长上下文测试中，例如大海捞针 (M-NIAH) 和 BABILong。

Title: LLMD: A Large Language Model for Interpreting Longitudinal Medical Records

Authors: Robert Porter, Adam Diehl, Benjamin Pastel, J. Henry Hinnefeld, Lawson Nerenberg, Pye Maung, Sebastien Kerbrat, Gillian Hanson, Troy Astorino, Stephen J. Tarsa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12860
Pdf URL: https://arxiv.org/pdf/2410.12860
Copy Paste: [[2410.12860]] LLMD: A Large Language Model for Interpreting Longitudinal Medical Records(https://arxiv.org/abs/2410.12860)
Keywords: language model, gpt, llm
Abstract: We introduce LLMD, a large language model designed to analyze a patient's medical history based on their medical records. Along with domain knowledge, LLMD is trained on a large corpus of records collected over time and across facilities, as well as tasks and labels that make nuanced connections among them. This approach is critical to an accurate picture of patient health, and has distinctive advantages over models trained on knowledge alone, unlabeled records, structured EHR data, or records from a single health system. The recipe for LLMD continues pretraining a foundational model on both domain knowledge and the contents of millions of records. These span an average of 10 years of care and as many as 140 care sites per patient. LLMD is then instruction fine-tuned on structuring and abstraction tasks. The former jointly identify and normalize document metadata, provenance information, clinical named-entities, and ontology mappings, while the latter roll these into higher-level representations, such a continuous era of time a patient was on a medication. LLMD is deployed within a layered validation system that includes continual random audits and review by experts, e.g. based on uncertainty, disease-specific rules, or use-case. LLMD exhibits large gains over both more-powerful generalized models and domain-specific models. On medical knowledge benchmarks, LLMD-8B achieves state of the art accuracy on PubMedQA text responses, besting orders-of-magnitude larger models. On production tasks, we show that LLMD significantly outperforms all other models evaluated, and among alternatives, large general purpose LLMs like GPT-4o are more accurate than models emphasizing medical knowledge. We find strong evidence that accuracy on today's medical benchmarks is not the most significant factor when analyzing real-world patient data, an insight with implications for future medical LLMs.'
摘要：我们引入了 LLMD，这是一种大型语言模型，旨在根据患者的医疗记录分析患者的病史。除了领域知识外，LLMD 还基于随时间推移和跨设施收集的大量记录语料库以及在它们之间建立细微联系的任务和标签进行训练。这种方法对于准确了解患者的健康状况至关重要，并且与仅基于知识、未标记记录、结构化 EHR 数据或来自单个医疗系统的记录进行训练的模型相比具有独特的优势。LLMD 的秘诀是继续在领域知识和数百万条记录的内容上对基础模型进行预训练。这些记录平均涵盖 10 年的护理和每位患者多达 140 个护理站点。然后，LLMD 针对结构化和抽象任务进行指令微调。前者共同识别和规范化文档元数据、出处信息、临床命名实体和本体映射，而后者将它们转化为更高级的表示，例如患者服药的连续时间段。 LLMD 部署在一个分层验证系统中，该系统包括专家的持续随机审核和审查，例如基于不确定性、特定于疾病的规则或用例。LLMD 比更强大的通用模型和领域特定模型都表现出巨大的进步。在医学知识基准测试中，LLMD-8B 在 PubMedQA 文本响应中实现了最先进的准确度，超过了数量级更大的模型。在生产任务中，我们表明 LLMD 明显优于所有其他评估模型，并且在替代方案中，像 GPT-4o 这样的大型通用 LLM 比强调医学知识的模型更准确。我们发现强有力的证据表明，在分析现实世界的患者数据时，当今医学基准的准确性并不是最重要的因素，这一见解对未来的医学 LLM 具有重要意义。'

Title: Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs

Authors: Divyanshu Kumar, Umang Jain, Sahil Agarwal, Prashanth Harshangi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12864
Pdf URL: https://arxiv.org/pdf/2410.12864
Copy Paste: [[2410.12864]] Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs(https://arxiv.org/abs/2410.12864)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are being adopted across a wide range of tasks, including decision-making processes in industries where bias in AI systems is a significant concern. Recent research indicates that LLMs can harbor implicit biases even when they pass explicit bias evaluations. Building upon the frameworks of the LLM Implicit Association Test (IAT) Bias and LLM Decision Bias, this study highlights that newer or larger language models do not automatically exhibit reduced bias; in some cases, they displayed higher bias scores than their predecessors, such as in Meta's Llama series and OpenAI's GPT models. This suggests that increasing model complexity without deliberate bias mitigation strategies can unintentionally amplify existing biases. The variability in bias scores within and across providers underscores the need for standardized evaluation metrics and benchmarks for bias assessment. The lack of consistency indicates that bias mitigation is not yet a universally prioritized goal in model development, which can lead to unfair or discriminatory outcomes. By broadening the detection of implicit bias, this research provides a more comprehensive understanding of the biases present in advanced models and underscores the critical importance of addressing these issues to ensure the development of fair and responsible AI systems.
摘要：大型语言模型 (LLM) 被广泛用于各种任务，包括人工智能系统偏见备受关注的行业中的决策过程。最近的研究表明，即使通过了显性偏见评估，LLM 也可能存在隐性偏见。基于 LLM 隐性联想测试 (IAT) 偏见和 LLM 决策偏见的框架，这项研究强调，较新或较大的语言模型并不会自动表现出减少的偏见；在某些情况下，它们显示出比其前辈更高的偏见分数，例如 Meta 的 Llama 系列和 OpenAI 的 GPT 模型。这表明，在没有刻意的偏见缓解策略的情况下增加模型复杂性可能会无意中放大现有的偏见。提供商内部和提供商之间的偏见分数差异凸显了对偏见评估的标准化评估指标和基准的需求。缺乏一致性表明，偏见缓解尚未成为模型开发中普遍优先的目标，这可能导致不公平或歧视性的结果。通过扩大对隐性偏见的检测，这项研究提供了对高级模型中存在的偏见的更全面的理解，并强调了解决这些问题对于确保公平和负责任的人工智能系统发展的关键重要性。

Title: ELF-Gym: Evaluating Large Language Models Generated Features for Tabular Prediction

Authors: Yanlin Zhang, Ning Li, Quan Gan, Weinan Zhang, David Wipf, Minjie Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12865
Pdf URL: https://arxiv.org/pdf/2410.12865
Copy Paste: [[2410.12865]] ELF-Gym: Evaluating Large Language Models Generated Features for Tabular Prediction(https://arxiv.org/abs/2410.12865)
Keywords: language model, llm
Abstract: Crafting effective features is a crucial yet labor-intensive and domain-specific task within machine learning pipelines. Fortunately, recent advancements in Large Language Models (LLMs) have shown promise in automating various data science tasks, including feature engineering. But despite this potential, evaluations thus far are primarily based on the end performance of a complete ML pipeline, providing limited insight into precisely how LLMs behave relative to human experts in feature engineering. To address this gap, we propose ELF-Gym, a framework for Evaluating LLM-generated Features. We curated a new dataset from historical Kaggle competitions, including 251 "golden" features used by top-performing teams. ELF-Gym then quantitatively evaluates LLM-generated features by measuring their impact on downstream model performance as well as their alignment with expert-crafted features through semantic and functional similarity assessments. This approach provides a more comprehensive evaluation of disparities between LLMs and human experts, while offering valuable insights into specific areas where LLMs may have room for improvement. For example, using ELF-Gym we empirically demonstrate that, in the best-case scenario, LLMs can semantically capture approximately 56% of the golden features, but at the more demanding implementation level this overlap drops to 13%. Moreover, in other cases LLMs may fail completely, particularly on datasets that require complex features, indicating broad potential pathways for improvement.
摘要：在机器学习流程中，制作有效的特征是一项至关重要但又劳动密集且特定领域的任务。幸运的是，大型语言模型 (LLM) 的最新进展已显示出在自动化各种数据科学任务（包括特征工程）方面的前景。但尽管有这种潜力，迄今为止的评估主要基于完整 ML 流程的最终性能，因此无法准确了解 LLM 在特征工程方面相对于人类专家的表现。为了解决这一差距，我们提出了 ELF-Gym，这是一个用于评估 LLM 生成的特征的框架。我们从历史 Kaggle 竞赛中整理了一个新数据集，其中包括表现最佳的团队使用的 251 个“黄金”特征。然后，ELF-Gym 通过语义和功能相似性评估来衡量 LLM 生成的特征对下游模型性能的影响以及它们与专家制作的特征的一致性，从而定量评估 LLM 生成的特征。这种方法可以更全面地评估 LLM 和人类专家之间的差异，同时为 LLM 可能有改进空间的特定领域提供有价值的见解。例如，我们使用 ELF-Gym 通过经验证明，在最佳情况下，LLM 可以在语义上捕获大约 56% 的黄金特征，但在要求更高的实施级别，这种重叠会下降到 13%。此外，在其他情况下，LLM 可能会完全失败，特别是在需要复杂特征的数据集上，这表明存在广泛的潜在改进途径。

Title: Empowering Dysarthric Speech: Leveraging Advanced LLMs for Accurate Speech Correction and Multimodal Emotion Analysis

Authors: Kaushal Attaluri, Anirudh CHVS, Sireesha Chittepu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12867
Pdf URL: https://arxiv.org/pdf/2410.12867
Copy Paste: [[2410.12867]] Empowering Dysarthric Speech: Leveraging Advanced LLMs for Accurate Speech Correction and Multimodal Emotion Analysis(https://arxiv.org/abs/2410.12867)
Keywords: language model, gpt, llm
Abstract: Dysarthria is a motor speech disorder caused by neurological damage that affects the muscles used for speech production, leading to slurred, slow, or difficult-to-understand speech. It affects millions of individuals worldwide, including those with conditions such as stroke, traumatic brain injury, cerebral palsy, Parkinsons disease, and multiple sclerosis. Dysarthria presents a major communication barrier, impacting quality of life and social interaction. This paper introduces a novel approach to recognizing and translating dysarthric speech, empowering individuals with this condition to communicate more effectively. We leverage advanced large language models for accurate speech correction and multimodal emotion analysis. Dysarthric speech is first converted to text using OpenAI Whisper model, followed by sentence prediction using fine-tuned open-source models and benchmark models like GPT-4.o, LLaMA 3.1 70B and Mistral 8x7B on Groq AI accelerators. The dataset used combines the TORGO dataset with Google speech data, manually labeled for emotional context. Our framework identifies emotions such as happiness, sadness, neutrality, surprise, anger, and fear, while reconstructing intended sentences from distorted speech with high accuracy. This approach demonstrates significant advancements in the recognition and interpretation of dysarthric speech.
摘要：构音障碍是一种由神经损伤引起的运动性言语障碍，会影响用于发声的肌肉，导致言语不清、缓慢或难以理解。它影响着全世界数百万人，包括患有中风、创伤性脑损伤、脑瘫、帕金森病和多发性硬化症等疾病的人。构音障碍是沟通的主要障碍，影响生活质量和社交互动。本文介绍了一种识别和翻译构音障碍语音的新方法，使患有这种疾病的人能够更有效地沟通。我们利用先进的大型语言模型进行准确的语音校正和多模态情感分析。首先使用 OpenAI Whisper 模型将构音障碍语音转换为文本，然后使用经过微调的开源模型和 Groq AI 加速器上的基准模型（如 GPT-4.o、LLaMA 3.1 70B 和 Mistral 8x7B）进行句子预测。使用的数据集将 TORGO 数据集与 Google 语音数据相结合，并手动标记情感背景。我们的框架可以识别快乐、悲伤、中立、惊讶、愤怒和恐惧等情绪，同时以高精度从失真的语音中重建预期的句子。这种方法在识别和解释构音障碍语音方面取得了重大进展。

Title: Language Model Preference Evaluation with Multiple Weak Evaluators

Authors: Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Hui Xiong, Ranjay Krishna
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12869
Pdf URL: https://arxiv.org/pdf/2410.12869
Copy Paste: [[2410.12869]] Language Model Preference Evaluation with Multiple Weak Evaluators(https://arxiv.org/abs/2410.12869)
Keywords: language model, gpt, llm
Abstract: Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs' quality regarding preference remains a critical challenge. Existing works usually leverage a powerful LLM (e.g., GPT4) as the judge for comparing LLMs' output pairwisely, yet such model-based evaluator is vulnerable to conflicting preference, i.e., output A is better than B, B than C, but C than A, causing contradictory evaluation results. To improve model-based preference evaluation, we introduce GED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensemble and denoise these graphs for better, non-contradictory evaluation results. In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process to eliminate cyclic inconsistencies, ensuring a directed acyclic graph (DAG) structure. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments across ten benchmark datasets show that GED outperforms baseline methods in model ranking, response selection, and model alignment tasks. Notably, GED combines weaker evaluators like Llama3-8B, Mistral-7B, and Qwen2-7B to surpass the performance of stronger evaluators like Qwen2-72B, highlighting its ability to enhance evaluation reliability and improve model performance.
摘要：尽管大型语言模型 (LLM) 取得了显著的成功，但评估其输出的偏好质量仍然是一项关键挑战。现有研究通常利用强大的 LLM（例如 GPT4）作为判断标准，以成对比较 LLM 的输出，但这种基于模型的评估器容易受到偏好冲突的影响，即输出 A 优于 B，B 优于 C，但 C 优于 A，从而导致相互矛盾的评估结果。为了改进基于模型的偏好评估，我们引入了 GED（偏好图集成和去噪），这是一种利用多个基于模型的评估器构建偏好图的新方法，然后对这些图进行集成和去噪，以获得更好的、不矛盾的评估结果。具体来说，我们的方法包括两个主要阶段：将评估聚合到统一图中，并应用去噪过程消除循环不一致，确保有向无环图 (DAG) 结构。我们为我们的框架提供了理论保证，证明了其在恢复基本事实偏好结构方面的有效性。在十个基准数据集上进行的大量实验表明，GED 在模型排名、响应选择和模型对齐任务中的表现优于基线方法。值得注意的是，GED 结合了 Llama3-8B、Mistral-7B 和 Qwen2-7B 等较弱的评估器，超越了 Qwen2-72B 等较强的评估器的性能，凸显了其增强评估可靠性和提高模型性能的能力。

Title: Skill Learning Using Process Mining for Large Language Model Plan Generation

Authors: Andrei Cosmin Redis, Mohammadreza Fani Sani, Bahram Zarrin, Andrea Burattin
Subjects: cs.CL, cs.AI, cs.DB, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12870
Pdf URL: https://arxiv.org/pdf/2410.12870
Copy Paste: [[2410.12870]] Skill Learning Using Process Mining for Large Language Model Plan Generation(https://arxiv.org/abs/2410.12870)
Keywords: language model, llm
Abstract: Large language models (LLMs) hold promise for generating plans for complex tasks, but their effectiveness is limited by sequential execution, lack of control flow models, and difficulties in skill retrieval. Addressing these issues is crucial for improving the efficiency and interpretability of plan generation as LLMs become more central to automation and decision-making. We introduce a novel approach to skill learning in LLMs by integrating process mining techniques, leveraging process discovery for skill acquisition, process models for skill storage, and conformance checking for skill retrieval. Our methods enhance text-based plan generation by enabling flexible skill discovery, parallel execution, and improved interpretability. Experimental results suggest the effectiveness of our approach, with our skill retrieval method surpassing state-of-the-art accuracy baselines under specific conditions.
摘要：大型语言模型 (LLM) 有望为复杂任务生成计划，但其有效性受到顺序执行、缺乏控制流模型和技能检索困难的限制。随着 LLM 在自动化和决策中变得越来越重要，解决这些问题对于提高计划生成的效率和可解释性至关重要。我们通过集成流程挖掘技术、利用流程发现获取技能、利用流程模型存储技能以及利用一致性检查进行技能检索，引入了一种在 LLM 中学习技能的新方法。我们的方法通过实现灵活的技能发现、并行执行和改进的可解释性来增强基于文本的计划生成。实验结果表明了我们的方法的有效性，我们的技能检索方法在特定条件下超越了最先进的准确度基线。

Title: Beyond Right and Wrong: Mitigating Cold Start in Knowledge Tracing Using Large Language Model and Option Weight

Authors: JongWoo Kim, SeongYeub Chu, Bryan Wong, Mun Yi
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12872
Pdf URL: https://arxiv.org/pdf/2410.12872
Copy Paste: [[2410.12872]] Beyond Right and Wrong: Mitigating Cold Start in Knowledge Tracing Using Large Language Model and Option Weight(https://arxiv.org/abs/2410.12872)
Keywords: language model, llm
Abstract: Knowledge Tracing (KT) is vital in educational data mining, enabling personalized learning by tracking learners' knowledge states and forecasting their academic outcomes. This study introduces the LOKT (Large Language Model Option-weighted Knowledge Tracing) model to address the cold start problem where limited historical data available using large language models (LLMs). While traditional KT models have incorporated option weights, our research extends this by integrating these weights into an LLM-based KT framework. Moving beyond the binary classification of correct and incorrect responses, we emphasize that different types of incorrect answers offer valuable insights into a learner's knowledge state. By converting these responses into text-based ordinal categories, we enable LLMs to assess learner understanding with greater clarity, although our approach focuses on the final knowledge state rather than the progression of learning over time. Using five public datasets, we demonstrate that the LOKT model sustains high predictive accuracy even with limited data, effectively addressing both "learner cold-start" and "system cold-start" scenarios. These findings showcase LOKT's potential to enhance LLM-based learning tools and support early-stage personalization.
摘要：知识追踪 (KT) 在教育数据挖掘中至关重要，它通过追踪学习者的知识状态并预测他们的学业成果来实现个性化学习。本研究引入了 LOKT（大型语言模型选项加权知识追踪）模型来解决使用大型语言模型 (LLM) 时历史数据有限的冷启动问题。虽然传统的 KT 模型已经纳入了选项权重，但我们的研究通过将这些权重集成到基于 LLM 的 KT 框架中来扩展这一功能。除了正确和错误答案的二元分类之外，我们强调不同类型的错误答案可以为学习者的知识状态提供有价值的见解。通过将这些答案转换为基于文本的序数类别，我们使 LLM 能够更清晰地评估学习者的理解，尽管我们的方法侧重于最终的知识状态，而不是随着时间的推移而进行的学习进展。使用五个公共数据集，我们证明 LOKT 模型即使在数据有限的情况下也能保持较高的预测准确性，有效地解决了“学习者冷启动”和“系统冷启动”场景。这些发现展示了 LOKT 增强基于 LLM 的学习工具和支持早期个性化的潜力。

Title: In-context KV-Cache Eviction for LLMs via Attention-Gate

Authors: Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12876
Pdf URL: https://arxiv.org/pdf/2410.12876
Copy Paste: [[2410.12876]] In-context KV-Cache Eviction for LLMs via Attention-Gate(https://arxiv.org/abs/2410.12876)
Keywords: language model, llm
Abstract: The KV-Cache technique has become the standard for the inference of large language models (LLMs). It caches states of self-attention to avoid recomputation. Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system, especially when confronted with ultra-large models and long-context queries. A natural remedy is to discard the KV-Cache for less important tokens, with StreamingLLM as an example, but the used static eviction strategies cannot flexibly adapt to varying contexts. Remedies like H2O leverage accumulative attention scores to perform dynamic eviction but suffer from the attention bias issue in capturing contextual information. This paper bridges this gap by devising a parameterized KV-Cache eviction mechanism, dubbed as Attention-Gate, which accepts the whole context as input and yields eviction flags for each token to realize in-context eviction. The subsequent self-attention module proceeds according to the flags and only the KV states for the remaining tokens need to be cached. The Attention-Gates can vary among different heads and layers and be trivially plugged into pre-trained LLMs, tuned by cost-effective continual pre-training or supervised fine-tuning objectives to acquire what to discard. The computational and memory overhead introduced by Attention-Gates is minimal. Our method is validated across multiple tasks, demonstrating both efficiency and adaptability. After a highly efficient continual pre-training, it achieves higher average accuracy and evicts more tokens compared to traditional training-free methods. In supervised fine-tuning, it not only evicts many tokens but also outperforms LoRA-finetuned LLMs on some datasets, such as RTE, where it improves accuracy by 13.9% while evicting 62.8% of tokens, showing that effective eviction of redundant tokens can even enhance performance.
摘要：KV-Cache 技术已成为大型语言模型 (LLM) 推理的标准。它缓存自注意力状态以避免重新计算。然而，人们普遍批评 KV-Cache 可能成为 LLM 推理系统的瓶颈，尤其是在面对超大型模型和长上下文查询时。一种自然的补救措施是丢弃不太重要的标记的 KV-Cache，以 StreamingLLM 为例，但使用的静态驱逐策略无法灵活地适应不同的上下文。像 H2O 这样的补救措施利用累积注意力分数来执行动态驱逐，但在捕获上下文信息时存在注意力偏差问题。本文通过设计一种参数化的 KV-Cache 驱逐机制（称为 Attention-Gate）来弥补这一差距，它接受整个上下文作为输入并为每个标记产生驱逐标志以实现上下文驱逐。后续的自注意力模块根据标志进行，只需要缓存剩余 token 的 KV 状态。注意门可以在不同的头和层之间变化，并且可以简单地插入到预先训练的 LLM 中，通过经济高效的持续预训练或监督微调目标进行调整以获取要丢弃的内容。注意门引入的计算和内存开销很小。我们的方法在多个任务中得到了验证，展示了效率和适应性。经过高效的持续预训练后，与传统的无训练方法相比，它实现了更高的平均准确率并驱逐了更多的 token。在监督微调中，它不仅驱逐了许多 token，而且在某些数据集上的表现也优于 LoRA 微调的 LLM，例如 RTE，它将准确率提高了 13.9%，同时驱逐了 62.8% 的 token，这表明有效驱逐冗余 token 甚至可以提高性能。

Title: Improving Instruction-Following in Language Models through Activation Steering

Authors: Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12877
Pdf URL: https://arxiv.org/pdf/2410.12877
Copy Paste: [[2410.12877]] Improving Instruction-Following in Language Models through Activation Steering(https://arxiv.org/abs/2410.12877)
Keywords: language model
Abstract: The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models and use them to steer models accordingly. These vectors are computed as the difference in activations between inputs with and without instructions, enabling a modular approach to activation steering. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion, providing inference-time control over instruction following. Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present. Additionally, we explore the compositionality of activation steering, successfully applying multiple instructions simultaneously. Finally, we demonstrate that steering vectors computed on instruction-tuned models can transfer to improve base models. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
摘要：遵循指令的能力对于语言模型的众多实际应用至关重要。为了追求更深入的见解和更强大的功能，我们从语言模型中得出特定于指令的向量表示，并使用它们来相应地引导模型。这些向量被计算为有指令和无指令的输入之间的激活差异，从而实现模块化的激活转向方法。我们展示了这种方法如何增强模型对输出格式、长度和单词包含等约束的遵守，从而提供对指令跟随的推理时间控制。我们在四个模型上进行的实验展示了我们如何使用激活向量来引导模型遵循约束，即使没有明确的指令，并在存在指令时提高性能。此外，我们探索了激活转向的组合性，成功地同时应用了多个指令。最后，我们证明了在指令调整模型上计算的转向向量可以转移以改进基础模型。我们的研究结果表明，激活转向为语言生成中的细粒度控制提供了一种实用且可扩展的方法。

Title: Towards More Effective Table-to-Text Generation: Assessing In-Context Learning and Self-Evaluation with Open-Source Models

Authors: Sahar Iravani, Tim .O .F Conrad
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12878
Pdf URL: https://arxiv.org/pdf/2410.12878
Copy Paste: [[2410.12878]] Towards More Effective Table-to-Text Generation: Assessing In-Context Learning and Self-Evaluation with Open-Source Models(https://arxiv.org/abs/2410.12878)
Keywords: language model, llm, chain-of-thought
Abstract: Table processing, a key task in natural language processing, has significantly benefited from recent advancements in language models (LMs). However, the capabilities of LMs in table-to-text generation, which transforms structured data into coherent narrative text, require an in-depth investigation, especially with current open-source models. This study explores the effectiveness of various in-context learning strategies in LMs across benchmark datasets, focusing on the impact of providing examples to the model. More importantly, we examine a real-world use case, offering valuable insights into practical applications. To complement traditional evaluation metrics, we employ a large language model (LLM) self-evaluation approach using chain-of-thought reasoning and assess its correlation with human-aligned metrics like BERTScore. Our findings highlight the significant impact of examples in improving table-to-text generation and suggest that, while LLM self-evaluation has potential, its current alignment with human judgment could be enhanced. This points to the need for more reliable evaluation methods.
摘要：表格处理是自然语言处理中的一项关键任务，语言模型 (LM) 的最新进展极大地促进了该任务的发展。然而，语言模型在表格到文本生成（将结构化数据转换为连贯的叙述文本）方面的能力需要深入研究，尤其是对于当前的开源模型。本研究探讨了 LM 在基准数据集中各种上下文学习策略的有效性，重点关注向模型提供示例的影响。更重要的是，我们研究了一个现实世界的用例，为实际应用提供了宝贵的见解。为了补充传统的评估指标，我们采用了一种使用思路链推理的大型语言模型 (LLM) 自我评估方法，并评估了它与 BERTScore 等人类一致指标的相关性。我们的研究结果强调了示例在改进表格到文本生成方面的重大影响，并表明，虽然 LLM 自我评估具有潜力，但它目前与人类判断的一致性可以得到增强。这表明需要更可靠的评估方法。

Title: Navigating the Cultural Kaleidoscope: A Hitchhiker's Guide to Sensitivity in Large Language Models

Authors: Somnath Banerjee, Sayan Layek, Hari Shrawgi, Rajarshi Mandal, Avik Halder, Shanu Kumar, Sagnik Basu, Parag Agrawal, Rima Hazra, Animesh Mukherjee
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2410.12880
Pdf URL: https://arxiv.org/pdf/2410.12880
Copy Paste: [[2410.12880]] Navigating the Cultural Kaleidoscope: A Hitchhiker's Guide to Sensitivity in Large Language Models(https://arxiv.org/abs/2410.12880)
Keywords: language model, llm
Abstract: As LLMs are increasingly deployed in global applications, the importance of cultural sensitivity becomes paramount, ensuring that users from diverse backgrounds feel respected and understood. Cultural harm can arise when these models fail to align with specific cultural norms, resulting in misrepresentations or violations of cultural values. This work addresses the challenges of ensuring cultural sensitivity in LLMs, especially in small-parameter models that often lack the extensive training data needed to capture global cultural nuances. We present two key contributions: (1) A cultural harm test dataset, created to assess model outputs across different cultural contexts through scenarios that expose potential cultural insensitivities, and (2) A culturally aligned preference dataset, aimed at restoring cultural sensitivity through fine-tuning based on feedback from diverse annotators. These datasets facilitate the evaluation and enhancement of LLMs, ensuring their ethical and safe deployment across different cultural landscapes. Our results show that integrating culturally aligned feedback leads to a marked improvement in model behavior, significantly reducing the likelihood of generating culturally insensitive or harmful content. Ultimately, this work paves the way for more inclusive and respectful AI systems, fostering a future where LLMs can safely and ethically navigate the complexities of diverse cultural landscapes.
摘要：随着 LLM 在全球应用中的部署越来越多，文化敏感性变得至关重要，确保来自不同背景的用户感到受到尊重和理解。当这些模型不符合特定的文化规范时，可能会产生文化伤害，导致文化价值观被歪曲或违反。这项工作解决了确保 LLM 中文化敏感性的挑战，特别是在小参数模型中，这些模型通常缺乏捕捉全球文化细微差别所需的大量训练数据。我们提出了两个关键贡献：(1) 文化伤害测试数据集，通过暴露潜在文化不敏感性的场景来评估不同文化背景下的模型输出；(2) 文化一致的偏好数据集，旨在通过基于不同注释者的反馈进行微调来恢复文化敏感性。这些数据集有助于评估和增强 LLM，确保它们在不同文化环境中合乎道德且安全地部署。我们的结果表明，整合文化一致的反馈可显着改善模型行为，大大降低生成文化不敏感或有害内容的可能性。最终，这项工作为更具包容性和尊重性的人工智能系统铺平了道路，促进了法学硕士未来能够安全且合乎道德地应对多元文化景观的复杂性。

Title: Scaling Laws for Multilingual Language Models

Authors: Yifei He, Alon Benhaim, Barun Patra, Praneetha Vaddamanu, Sanchit Ahuja, Parul Chopra, Vishrav Chaudhary, Han Zhao, Xia Song
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12883
Pdf URL: https://arxiv.org/pdf/2410.12883
Copy Paste: [[2410.12883]] Scaling Laws for Multilingual Language Models(https://arxiv.org/abs/2410.12883)
Keywords: language model
Abstract: We propose a novel scaling law for general-purpose decoder-only language models (LMs) trained on multilingual data, addressing the problem of balancing languages during multilingual pretraining. A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer. To address this, we shift the focus from individual languages to language families. We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio, independent of other languages in the mixture. This insight simplifies the complexity of multilingual scaling and make the analysis scalable to an arbitrary number of languages. Building on this hypothesis, we derive a power-law relationship that links performance with dataset size, model size and sampling ratios. This relationship enables us to predict performance across various combinations of the above three quantities, and derive the optimal sampling ratios at different model scales. To demonstrate the effectiveness and accuracy of our proposed scaling law, we perform a large-scale empirical study, training more than 100 models on 23 languages spanning 5 language families. Our experiments show that the optimal sampling ratios derived from small models (85M parameters) generalize effectively to models that are several orders of magnitude larger (1.2B parameters), offering a resource-efficient approach for multilingual LM training at scale.
摘要：我们为在多语言数据上训练的通用解码器专用语言模型 (LM) 提出了一种新颖的缩放定律，解决了在多语言预训练期间平衡语言的问题。研究多语言缩放的主要挑战是由于跨语言迁移而难以分析单个语言的表现。为了解决这个问题，我们将重点从单个语言转移到语言家族。我们引入并验证了一个假设，即每个语言家族的测试交叉熵损失仅由其自己的采样率决定，与混合中的其他语言无关。这一见解简化了多语言缩放的复杂性，并使分析可扩展到任意数量的语言。基于这一假设，我们得出了一个幂律关系，将性能与数据集大小、模型大小和采样率联系起来。这种关系使我们能够预测上述三个量的各种组合的性能，并得出不同模型规模的最佳采样率。为了证明我们提出的缩放定律的有效性和准确性，我们进行了一项大规模实证研究，在 5 个语言家族的 23 种语言上训练了 100 多个模型。我们的实验表明，从小型模型（8500 万个参数）得出的最佳采样率可以有效推广到几个数量级大（12 亿个参数）的模型，从而为大规模多语言 LM 训练提供了一种资源高效的方法。

Title: AT-RAG: An Adaptive RAG Model Enhancing Query Efficiency with Topic Filtering and Iterative Reasoning

Authors: Mohammad Reza Rezaei, Maziar Hafezi, Amit Satpathy, Lovell Hodge, Ebrahim Pourjafari
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12886
Pdf URL: https://arxiv.org/pdf/2410.12886
Copy Paste: [[2410.12886]] AT-RAG: An Adaptive RAG Model Enhancing Query Efficiency with Topic Filtering and Iterative Reasoning(https://arxiv.org/abs/2410.12886)
Keywords: gpt, llm
Abstract: Recent advancements in QA with LLM, like GPT-4, have shown limitations in handling complex multi-hop queries. We propose AT-RAG, a novel multistep RAG incorporating topic modeling for efficient document retrieval and reasoning. Using BERTopic, our model dynamically assigns topics to queries, improving retrieval accuracy and efficiency. We evaluated AT-RAG on multihop benchmark datasets QA and a medical case study QA. Results show significant improvements in correctness, completeness, and relevance compared to existing methods. AT-RAG reduces retrieval time while maintaining high precision, making it suitable for general tasks QA and complex domain-specific challenges such as medical QA. The integration of topic filtering and iterative reasoning enables our model to handle intricate queries efficiently, which makes it suitable for applications that require nuanced information retrieval and decision-making.
摘要：GPT-4 等 LLM 问答领域的最新进展已显示出其在处理复杂多跳查询方面的局限性。我们提出了 AT-RAG，这是一种新颖的多步骤 RAG，结合了主题建模，可实现高效的文档检索和推理。使用 BERTopic，我们的模型可以动态地将主题分配给查询，从而提高检索准确性和效率。我们在多跳基准数据集 QA 和医学案例研究 QA 上评估了 AT-RAG。结果显示，与现有方法相比，正确性、完整性和相关性有显著提高。AT-RAG 在保持高精度的同时缩短了检索时间，使其适用于一般任务 QA 和复杂的特定领域挑战，例如医学 QA。主题过滤和迭代推理的集成使我们的模型能够高效处理复杂查询，这使其适用于需要细致入微的信息检索和决策的应用程序。

Title: REFINE on Scarce Data: Retrieval Enhancement through Fine-Tuning via Model Fusion of Embedding Models

Authors: Ambuje Gupta, Mrinal Rawat, Andreas Stolcke, Roberto Pieraccini
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2410.12890
Pdf URL: https://arxiv.org/pdf/2410.12890
Copy Paste: [[2410.12890]] REFINE on Scarce Data: Retrieval Enhancement through Fine-Tuning via Model Fusion of Embedding Models(https://arxiv.org/abs/2410.12890)
Keywords: language model, llm, hallucination, retrieval augmented generation
Abstract: Retrieval augmented generation (RAG) pipelines are commonly used in tasks such as question-answering (QA), relying on retrieving relevant documents from a vector store computed using a pretrained embedding model. However, if the retrieved context is inaccurate, the answers generated using the large language model (LLM) may contain errors or hallucinations. Although pretrained embedding models have advanced, adapting them to new domains remains challenging. Fine-tuning is a potential solution, but industry settings often lack the necessary fine-tuning data. To address these challenges, we propose REFINE, a novel technique that generates synthetic data from available documents and then uses a model fusion approach to fine-tune embeddings for improved retrieval performance in new domains, while preserving out-of-domain capability. We conducted experiments on the two public datasets: SQUAD and RAG-12000 and a proprietary TOURISM dataset. Results demonstrate that even the standard fine-tuning with the proposed data augmentation technique outperforms the vanilla pretrained model. Furthermore, when combined with model fusion, the proposed approach achieves superior performance, with a 5.76% improvement in recall on the TOURISM dataset, and 6.58 % and 0.32% enhancement on SQUAD and RAG-12000 respectively.
摘要：检索增强生成 (RAG) 管道通常用于问答 (QA) 等任务，依赖于从使用预训练嵌入模型计算的向量存储中检索相关文档。但是，如果检索到的上下文不准确，则使用大型语言模型 (LLM) 生成的答案可能包含错误或幻觉。尽管预训练嵌入模型已经取得了进步，但将其适应新领域仍然具有挑战性。微调是一种潜在的解决方案，但行业环境通常缺乏必要的微调数据。为了应对这些挑战，我们提出了 REFINE，这是一种新技术，它从可用文档生成合成数据，然后使用模型融合方法微调嵌入，以提高新领域的检索性能，同时保留域外能力。我们对两个公共数据集：SQUAD 和 RAG-12000 以及专有 TOURISM 数据集进行了实验。结果表明，即使使用所提出的数据增强技术进行标准微调，其性能也优于原始预训练模型。此外，与模型融合相结合时，所提出的方法可实现卓越的性能，在 TOURISM 数据集上的召回率提高了 5.76%，在 SQUAD 和 RAG-12000 上的召回率分别提高了 6.58% 和 0.32%。

Title: Multi-trait User Simulation with Adaptive Decoding for Conversational Task Assistants

Authors: Rafael Ferreira, David Semedo, João Magalhães
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2410.12891
Pdf URL: https://arxiv.org/pdf/2410.12891
Copy Paste: [[2410.12891]] Multi-trait User Simulation with Adaptive Decoding for Conversational Task Assistants(https://arxiv.org/abs/2410.12891)
Keywords: language model
Abstract: Conversational systems must be robust to user interactions that naturally exhibit diverse conversational traits. Capturing and simulating these diverse traits coherently and efficiently presents a complex challenge. This paper introduces Multi-Trait Adaptive Decoding (mTAD), a method that generates diverse user profiles at decoding-time by sampling from various trait-specific Language Models (LMs). mTAD provides an adaptive and scalable approach to user simulation, enabling the creation of multiple user profiles without the need for additional fine-tuning. By analyzing real-world dialogues from the Conversational Task Assistant (CTA) domain, we identify key conversational traits and developed a framework to generate profile-aware dialogues that enhance conversational diversity. Experimental results validate the effectiveness of our approach in modeling single-traits using specialized LMs, which can capture less common patterns, even in out-of-domain tasks. Furthermore, the results demonstrate that mTAD is a robust and flexible framework for combining diverse user simulators.
摘要：对话系统必须对自然表现出多样化对话特征的用户交互具有鲁棒性。连贯而有效地捕捉和模拟这些多样化特征是一项复杂的挑战。本文介绍了多特征自适应解码 (mTAD)，这是一种通过从各种特定于特征的语言模型 (LM) 中采样在解码时生成多样化用户配置文件的方法。mTAD 提供了一种自适应且可扩展的用户模拟方法，无需额外微调即可创建多个用户配置文件。通过分析来自对话任务助手 (CTA) 领域的真实对话，我们确定了关键的对话特征并开发了一个框架来生成可增强对话多样性的配置文件感知对话。实验结果验证了我们的方法在使用专门的 LM 建模单一特征方面的有效性，即使在域外任务中也可以捕捉不太常见的模式。此外，结果表明 mTAD 是一个用于组合不同用户模拟器的强大而灵活的框架。

Title: MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation

Authors: Aniket Deroy, Subhankar Maity, Sudeshna Sarkar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12893
Pdf URL: https://arxiv.org/pdf/2410.12893
Copy Paste: [[2410.12893]] MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation(https://arxiv.org/abs/2410.12893)
Keywords: language model, gpt, llm, prompt
Abstract: Automatic question generation is a critical task that involves evaluating question quality by considering factors such as engagement, pedagogical value, and the ability to stimulate critical thinking. These aspects require human-like understanding and judgment, which automated systems currently lack. However, human evaluations are costly and impractical for large-scale samples of generated questions. Therefore, we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized Rating), which leverages large language models (LLMs) to automate the evaluation process for questions generated by automated question generation systems. We experimented with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR, tending to be closer to the human baseline scores. Furthermore, we observed that Pearson's correlation coefficient between GPT-4 and human experts improved when using our proposed feedback-based approach, MIRROR, compared to direct prompting for evaluation. Error analysis shows that our proposed approach, MIRROR, significantly helps to improve relevance and appropriateness.
摘要：自动生成问题是一项关键任务，它涉及通过考虑参与度、教学价值和激发批判性思维的能力等因素来评估问题质量。这些方面需要像人类一样的理解和判断能力，而自动化系统目前缺乏这些能力。然而，对于大量生成的样本问题，人工评估成本高昂且不切实际。因此，我们提出了一个新系统 MIRROR（多 LLM 迭代审查和响应以优化评分），它利用大型语言模型 (LLM) 来自动化自动生成问题系统生成的问题的评估过程。我们尝试了几种最先进的 LLM，例如 GPT-4、Gemini 和 Llama2-70b。我们观察到，当使用基于反馈的方法 MIRROR 时，人类评估指标（即相关性、适当性、新颖性、复杂性和语法性）的分数会有所提高，趋向于更接近人类基线分数。此外，我们观察到，与直接提示评估相比，使用我们提出的基于反馈的方法 MIRROR 时，GPT-4 与人类专家之间的 Pearson 相关系数有所提高。错误分析表明，我们提出的方法 MIRROR 显著有助于提高相关性和适当性。

Title: Large Language Models and the Rationalist Empiricist Debate

Authors: David King
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12895
Pdf URL: https://arxiv.org/pdf/2410.12895
Copy Paste: [[2410.12895]] Large Language Models and the Rationalist Empiricist Debate(https://arxiv.org/abs/2410.12895)
Keywords: language model, llm
Abstract: To many Chomsky's debates with Quine and Skinner are an updated version of the Rationalist Empiricist debates of the 17th century. The consensus being that Chomsky's Rationalism was victorious. This dispute has reemerged with the advent of Large Language Models. With some arguing that LLMs vindicate rationalism because of the necessity of building in innate biases to make them work. The necessity of building in innate biases is taken to prove that empiricism hasn't got the conceptual resources to explain linguistic competence. Such claims depend on the nature of the empiricism one is endorsing. Externalized Empiricism has no difficulties with innate apparatus once they are determined empirically (Quine 1969). Thus, externalized empiricism is not refuted because of the need to build in innate biases in LLMs. Furthermore, the relevance of LLMs to the rationalist empiricist debate in relation to humans is dubious. For any claim about whether LLMs learn in an empiricist manner to be relevant to humans it needs to be shown that LLMs and humans learn in the same way. Two key features distinguish humans and LLMs. Humans learn despite a poverty of stimulus and LLMs learn because of an incredibly rich stimulus. Human linguistic outputs are grounded in sensory experience and LLMs are not. These differences in how the two learn indicates that they both use different underlying competencies to produce their output. Therefore, any claims about whether LLMs learn in an empiricist manner are not relevant to whether humans learn in an empiricist manner.
摘要：对许多人来说，乔姆斯基与奎因和斯金纳的辩论是 17 世纪理性主义经验主义辩论的更新版本。人们一致认为乔姆斯基的理性主义取得了胜利。随着大型语言模型的出现，这一争论再次出现。一些人认为，法学硕士证明了理性主义的正确性，因为必须建立内在偏见才能使其发挥作用。建立内在偏见的必要性被用来证明经验主义没有概念资源来解释语言能力。这种说法取决于人们所认可的经验主义的性质。一旦内在机制通过经验确定，外化经验主义就不会遇到任何困难（奎因 1969）。因此，外化经验主义不会因为需要在法学硕士中建立内在偏见而被驳斥。此外，法学硕士与理性主义经验主义关于人类的争论的相关性也值得怀疑。要说法学硕士是否以经验主义的方式学习与人类相关，就需要证明法学硕士和人类的学习方式相同。人类和法学硕士有两个关键特征。人类在刺激贫乏的情况下仍能学习，而法学硕士则因为刺激极其丰富而学习。人类的语言输出以感官体验为基础，而法学硕士则不是。两者学习方式的差异表明，它们都使用不同的底层能力来产生输出。因此，任何关于法学硕士是否以经验主义的方式学习的说法都与人类是否以经验主义的方式学习无关。

Title: A Survey on Data Synthesis and Augmentation for Large Language Models

Authors: Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, Yunhong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12896
Pdf URL: https://arxiv.org/pdf/2410.12896
Copy Paste: [[2410.12896]] A Survey on Data Synthesis and Augmentation for Large Language Models(https://arxiv.org/abs/2410.12896)
Keywords: language model, llm
Abstract: The success of Large Language Models (LLMs) is inherently linked to the availability of vast, diverse, and high-quality data for training and evaluation. However, the growth rate of high-quality data is significantly outpaced by the expansion of training datasets, leading to a looming data exhaustion crisis. This underscores the urgent need to enhance data efficiency and explore new data sources. In this context, synthetic data has emerged as a promising solution. Currently, data generation primarily consists of two major approaches: data augmentation and synthesis. This paper comprehensively reviews and summarizes data generation techniques throughout the lifecycle of LLMs, including data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and applications. Furthermore, We discuss the current constraints faced by these methods and investigate potential pathways for future development and research. Our aspiration is to equip researchers with a clear understanding of these methodologies, enabling them to swiftly identify appropriate data generation strategies in the construction of LLMs, while providing valuable insights for future exploration.
摘要：大型语言模型 (LLM) 的成功本质上与大量、多样化和高质量的训练和评估数据可用性有关。然而，高质量数据的增长速度远远超过了训练数据集的扩展速度，导致数据枯竭危机迫在眉睫。这凸显了提高数据效率和探索新数据源的迫切需要。在这种背景下，合成数据已成为一种有前途的解决方案。目前，数据生成主要包括两种主要方法：数据增强和合成。本文全面回顾和总结了 LLM 整个生命周期中的数据生成技术，包括数据准备、预训练、微调、指令调整、偏好对齐和应用。此外，我们讨论了这些方法目前面临的限制，并探讨了未来发展和研究的潜在途径。我们的目标是让研究人员清楚地了解这些方法，使他们能够在构建 LLM 时迅速确定合适的数据生成策略，同时为未来的探索提供宝贵的见解。

Title: MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation

Authors: Satya Krishna Gorti, Ilan Gofman, Zhaoyan Liu, Jiapeng Wu, Noël Vouitsis, Guangwei Yu, Jesse C. Cresswell, Rasa Hosseinzadeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12916
Pdf URL: https://arxiv.org/pdf/2410.12916
Copy Paste: [[2410.12916]] MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation(https://arxiv.org/abs/2410.12916)
Keywords: language model, gpt
Abstract: Text-to-SQL generation enables non-experts to interact with databases via natural language. Recent advances rely on large closed-source models like GPT-4 that present challenges in accessibility, privacy, and latency. To address these issues, we focus on developing small, efficient, and open-source text-to-SQL models. We demonstrate the benefits of sampling multiple candidate SQL generations and propose our method, MSc-SQL, to critique them using associated metadata. Our sample critiquing model evaluates multiple outputs simultaneously, achieving state-of-the-art performance compared to other open-source models while remaining competitive with larger models at a much lower cost. Full code can be found at this http URL.
摘要：文本到 SQL 生成使非专家能够通过自然语言与数据库交互。最近的进展依赖于 GPT-4 等大型闭源模型，这些模型在可访问性、隐私和延迟方面存在挑战。为了解决这些问题，我们专注于开发小型、高效且开源的文本到 SQL 模型。我们展示了对多个候选 SQL 生成进行采样的好处，并提出了我们的方法 MSc-SQL，使用相关元数据对它们进行批评。我们的样本批评模型同时评估多个输出，与其他开源模型相比，实现了最先进的性能，同时以更低的成本与更大的模型保持竞争力。完整代码可在此 http URL 中找到。

Title: Interpreting token compositionality in LLMs: A robustness analysis

Authors: Nura Aljaafari, Danilo S. Carvalho, André Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12924
Pdf URL: https://arxiv.org/pdf/2410.12924
Copy Paste: [[2410.12924]] Interpreting token compositionality in LLMs: A robustness analysis(https://arxiv.org/abs/2410.12924)
Keywords: language model, llm
Abstract: Understanding the internal mechanisms of large language models (LLMs) is integral to enhancing their reliability, interpretability, and inference processes. We present Constituent-Aware Pooling (CAP), a methodology designed to analyse how LLMs process compositional linguistic structures. Grounded in principles of compositionality, mechanistic interpretability, and information gain theory, CAP systematically intervenes in model activations through constituent-based pooling at various model levels. Our experiments on inverse definition modelling, hypernym and synonym prediction reveal critical insights into transformers' limitations in handling compositional abstractions. No specific layer integrates tokens into unified semantic representations based on their constituent parts. We observe fragmented information processing, which intensifies with model size, suggesting that larger models struggle more with these interventions and exhibit greater information dispersion. This fragmentation likely stems from transformers' training objectives and architectural design, preventing systematic and cohesive representations. Our findings highlight fundamental limitations in current transformer architectures regarding compositional semantics processing and model interpretability, underscoring the critical need for novel approaches in LLM design to address these challenges.
摘要：了解大型语言模型 (LLM) 的内部机制对于增强其可靠性、可解释性和推理过程至关重要。我们提出了成分感知池 (CAP)，这是一种旨在分析 LLM 如何处理组合语言结构的方法。CAP 基于组合性、机械可解释性和信息增益理论的原理，通过基于成分的池化在各个模型级别系统地干预模型激活。我们对逆定义建模、上位词和同义词预测的实验揭示了 Transformer 在处理组合抽象方面的局限性的关键见解。没有特定的层根据其组成部分将标记集成到统一的语义表示中。我们观察到碎片化的信息处理，这种现象随着模型大小的增加而加剧，这表明较大的模型在这些干预下会更加吃力，并且表现出更大的信息分散性。这种碎片化可能源于 Transformer 的训练目标和架构设计，阻碍了系统和有凝聚力的表示。我们的研究结果强调了当前变压器架构在组合语义处理和模型可解释性方面的根本限制，强调了 LLM 设计中迫切需要新方法来应对这些挑战。

Title: Enhancing Mathematical Reasoning in LLMs by Stepwise Correction

Authors: Zhenyu Wu, Qingkai Zeng, Zhihan Zhang, Zhaoxuan Tan, Chao Shen, Meng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12934
Pdf URL: https://arxiv.org/pdf/2410.12934
Copy Paste: [[2410.12934]] Enhancing Mathematical Reasoning in LLMs by Stepwise Correction(https://arxiv.org/abs/2410.12934)
Keywords: language model, gpt, llm, prompt
Abstract: Best-of-N decoding methods instruct large language models (LLMs) to generate multiple solutions, score each using a scoring function, and select the highest scored as the final answer to mathematical reasoning problems. However, this repeated independent process often leads to the same mistakes, making the selected solution still incorrect. We propose a novel prompting method named Stepwise Correction (StepCo) that helps LLMs identify and revise incorrect steps in their generated reasoning paths. It iterates verification and revision phases that employ a process-supervised verifier. The verify-then-revise process not only improves answer correctness but also reduces token consumption with fewer paths needed to generate. With StepCo, a series of LLMs demonstrate exceptional performance. Notably, using GPT-4o as the backend LLM, StepCo achieves an average accuracy of 94.1 across eight datasets, significantly outperforming the state-of-the-art Best-of-N method by +2.4, while reducing token consumption by 77.8%.
摘要：Best-of-N 解码方法指示大型语言模型 (LLM) 生成多个解决方案，使用评分函数对每个解决方案进行评分，并选择得分最高的解决方案作为数学推理问题的最终答案。然而，这种重复的独立过程往往会导致相同的错误，使得所选的解决方案仍然不正确。我们提出了一种名为逐步校正 (StepCo) 的新型提示方法，可帮助 LLM 识别和修改其生成的推理路径中的错误步骤。它迭代使用过程监督验证器的验证和修订阶段。验证然后修改的过程不仅可以提高答案的正确性，还可以通过减少生成路径所需的路径来减少 token 消耗。借助 StepCo，一系列 LLM 表现出色。值得注意的是，使用 GPT-4o 作为后端 LLM，StepCo 在八个数据集上实现了 94.1 的平均准确率，显著优于最先进的 Best-of-N 方法 +2.4，同时将 token 消耗减少了 77.8%。

Title: Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging

Authors: Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Pang Wei Koh, Jesse Dodge, Pradeep Dasigi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12937
Pdf URL: https://arxiv.org/pdf/2410.12937
Copy Paste: [[2410.12937]] Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging(https://arxiv.org/abs/2410.12937)
Keywords: language model, prompt
Abstract: Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models to forget older skills. In this work, we investigate the effectiveness of adding new skills to preexisting models by training on the new skills in isolation and later merging with the general model (e.g. using task vectors). In experiments focusing on scientific literature understanding, safety, and coding, we find that the parallel-train-then-merge procedure, which is significantly cheaper than retraining the models on updated data mixtures, is often comparably effective. Our experiments also show that parallel training is especially well-suited for enabling safety features in LMs relative to continued finetuning and retraining, as it dramatically improves model compliance with safe prompts while preserving its ability to refuse dangerous or harmful prompts.
摘要：目前，将通用语言模型适应新技能是一个昂贵的过程，必须在创建针对新技能的新指令数据集时重复这一过程，否则可能会导致模型忘记旧技能。在这项工作中，我们研究了通过单独训练新技能并随后与通用模型合并（例如使用任务向量）来向现有模型添加新技能的有效性。在专注于科学文献理解、安全性和编码的实验中，我们发现并行训练然后合并的过程通常同样有效，这比在更新的数据混合上重新训练模型便宜得多。我们的实验还表明，与持续微调和再训练相比，并行训练特别适合在 LM 中启用安全功能，因为它显着提高了模型对安全提示的遵守程度，同时保留了其拒绝危险或有害提示的能力。

Title: Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning

Authors: Mingyang Chen, Haoze Sun, Tianpeng Li, Fan Yang, Hao Liang, Keer Lu, Bin Cui, Wentao Zhang, Zenan Zhou, Weipeng Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12952
Pdf URL: https://arxiv.org/pdf/2410.12952
Copy Paste: [[2410.12952]] Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning(https://arxiv.org/abs/2410.12952)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have exhibited significant potential in performing diverse tasks, including the ability to call functions or use external tools to enhance their performance. While current research on function calling by LLMs primarily focuses on single-turn interactions, this paper addresses the overlooked necessity for LLMs to engage in multi-turn function calling--critical for handling compositional, real-world queries that require planning with functions but not only use functions. To facilitate this, we introduce an approach, BUTTON, which generates synthetic compositional instruction tuning data via bottom-up instruction construction and top-down trajectory generation. In the bottom-up phase, we generate simple atomic tasks based on real-world scenarios and build compositional tasks using heuristic strategies based on atomic tasks. Corresponding functions are then developed for these compositional tasks. The top-down phase features a multi-agent environment where interactions among simulated humans, assistants, and tools are utilized to gather multi-turn function calling trajectories. This approach ensures task compositionality and allows for effective function and trajectory generation by examining atomic tasks within compositional tasks. We produce a dataset BUTTONInstruct comprising 8k data points and demonstrate its effectiveness through extensive experiments across various LLMs.
摘要：大型语言模型 (LLM) 在执行各种任务方面表现出巨大的潜力，包括调用函数或使用外部工具来增强其性能的能力。虽然目前对 LLM 函数调用的研究主要集中在单轮交互上，但本文讨论了 LLM 参与多轮函数调用的被忽视的必要性——这对于处理需要使用函数进行规划而不仅仅是使用函数的组合、真实世界查询至关重要。为了促进这一点，我们引入了一种方法 BUTTON，它通过自下而上的指令构建和自上而下的轨迹生成来生成合成组合指令调整数据。在自下而上的阶段，我们根据真实世界场景生成简单的原子任务，并使用基于原子任务的启发式策略构建组合任务。然后为这些组合任务开发相应的功能。自上而下的阶段以多智能体环境为特色，其中利用模拟人、助手和工具之间的交互来收集多轮函数调用轨迹。这种方法确保了任务的组合性，并通过检查组合任务中的原子任务来实现有效的功能和轨迹生成。我们生成了一个包含 8k 个数据点的数据集 BUTTONInstruct，并通过在各种 LLM 中进行的大量实验证明了其有效性。

Title: Self-Pluralising Culture Alignment for Large Language Models

Authors: Shaoyang Xu, Yongqi Leng, Linhao Yu, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12971
Pdf URL: https://arxiv.org/pdf/2410.12971
Copy Paste: [[2410.12971]] Self-Pluralising Culture Alignment for Large Language Models(https://arxiv.org/abs/2410.12971)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) become increasingly accessible in many countries, it is essential to align them to serve pluralistic human values across cultures. However, pluralistic culture alignment in LLMs remain an open problem. In this paper, we propose CultureSPA, a Self-Pluralising Culture Alignment framework that allows LLMs to simultaneously align to pluralistic cultures. The framework first generates questions on various culture topics, then yields LLM outputs in response to these generated questions under both culture-aware and culture-unaware settings. By comparing culture-aware/unaware outputs, we are able to detect and collect culture-related instances. These instances are employed to fine-tune LLMs to serve pluralistic cultures in either a culture-joint or culture-specific way. Extensive experiments demonstrate that CultureSPA significantly improves the alignment of LLMs to diverse cultures without compromising general abilities. And further improvements can be achieved if CultureSPA is combined with advanced prompt engineering techniques. Comparisons between culture-joint and culture-specific tuning strategies, along with variations in data quality and quantity, illustrate the robustness of our method. We also explore the mechanisms underlying CultureSPA and the relations between different cultures it reflects.
摘要：随着大型语言模型 (LLM) 在许多国家越来越普及，必须将它们与跨文化的多元人类价值观相结合。然而，LLM 中的多元文化一致性仍然是一个悬而未决的问题。在本文中，我们提出了 CultureSPA，这是一个自多元文化一致性框架，允许 LLM 同时与多元文化保持一致。该框架首先生成有关各种文化主题的问题，然后在文化感知和非文化感知的环境下针对这些生成的问题生成 LLM 输出。通过比较文化感知/非文化感知的输出，我们能够检测和收集与文化相关的实例。这些实例用于微调 LLM，以文化联合或文化特定的方式服务于多元文化。大量实验表明，CultureSPA 显著提高了 LLM 与不同文化的一致性，同时又不损害一般能力。如果将 CultureSPA 与先进的快速工程技术相结合，则可以实现进一步的改进。文化联合和文化特定调整策略之间的比较，以及数据质量和数量的变化，说明了我们方法的稳健性。我们还探索了 CultureSPA 背后的机制及其反映的不同文化之间的关系。

Title: Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks

Authors: Rudra Murthy, Prince Kumar, Praveen Venkateswaran, Danish Contractor
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12972
Pdf URL: https://arxiv.org/pdf/2410.12972
Copy Paste: [[2410.12972]] Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks(https://arxiv.org/abs/2410.12972)
Keywords: language model, gpt, llm
Abstract: In this work, we focus our attention on developing a benchmark for instruction-following where it is easy to verify both task performance as well as instruction-following capabilities. We adapt existing knowledge benchmarks and augment them with instructions that are a) conditional on correctly answering the knowledge task or b) use the space of candidate options in multiple-choice knowledge-answering tasks. This allows us to study model characteristics, such as their change in performance on the knowledge tasks in the presence of answer-modifying instructions and distractor instructions. In contrast to existing benchmarks for instruction following, we not only measure instruction-following capabilities but also use LLM-free methods to study task performance. We study a series of openly available large language models of varying parameter sizes (1B-405B) and closed source models namely GPT-4o-mini, GPT-4o. We find that even large-scale instruction-tuned LLMs fail to follow simple instructions in zero-shot settings. We release our dataset, the benchmark, code, and results for future work.
摘要：在这项工作中，我们将重点放在开发指令跟随基准上，以便于验证任务性能和指令跟随能力。我们调整现有的知识基准，并用以下指令增强它们：a) 以正确回答知识任务为条件，或 b) 在多项选择知识回答任务中使用候选选项空间。这使我们能够研究模型特征，例如在存在修改答案的指令和干扰指令的情况下，它们在知识任务上的表现变化。与现有的指令跟随基准相比，我们不仅测量指令跟随能力，还使用无 LLM 方法来研究任务性能。我们研究了一系列公开可用的不同参数大小的大型语言模型（1B-405B）和闭源模型，即 GPT-4o-mini、GPT-4o。我们发现即使是大规模指令调整的 LLM 也无法在零样本设置中遵循简单的指令。我们发布了我们的数据集、基准、代码和结果以供将来使用。

Title: BenchmarkCards: Large Language Model and Risk Reporting

Authors: Anna Sokol, Nuno Moniz, Elizabeth Daly, Michael Hind, Nitesh Chawla
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12974
Pdf URL: https://arxiv.org/pdf/2410.12974
Copy Paste: [[2410.12974]] BenchmarkCards: Large Language Model and Risk Reporting(https://arxiv.org/abs/2410.12974)
Keywords: language model, llm
Abstract: Large language models (LLMs) offer powerful capabilities but also introduce significant risks. One way to mitigate these risks is through comprehensive pre-deployment evaluations using benchmarks designed to test for specific vulnerabilities. However, the rapidly expanding body of LLM benchmark literature lacks a standardized method for documenting crucial benchmark details, hindering consistent use and informed selection. BenchmarkCards addresses this gap by providing a structured framework specifically for documenting LLM benchmark properties rather than defining the entire evaluation process itself. BenchmarkCards do not prescribe how to measure or interpret benchmark results (e.g., defining ``correctness'') but instead offer a standardized way to capture and report critical characteristics like targeted risks and evaluation methodologies, including properties such as bias and fairness. This structured metadata facilitates informed benchmark selection, enabling researchers to choose appropriate benchmarks and promoting transparency and reproducibility in LLM evaluation.
摘要：大型语言模型 (LLM) 提供了强大的功能，但也带来了巨大的风险。减轻这些风险的一种方法是使用旨在测试特定漏洞的基准进行全面的部署前评估。然而，迅速扩大的 LLM 基准文献缺乏记录关键基准细节的标准化方法，阻碍了一致使用和明智选择。BenchmarkCards 通过提供专门用于记录 LLM 基准属性的结构化框架而不是定义整个评估过程本身来解决这一差距。BenchmarkCards 不规定如何测量或解释基准结果（例如，定义“正确性”），而是提供一种标准化的方法来捕获和报告关键特征，如目标风险和评估方法，包括偏见和公平性等属性。这种结构化的元数据有助于明智的基准选择，使研究人员能够选择合适的基准并促进 LLM 评估的透明度和可重复性。

Title: Leveraging LLMs for Translating and Classifying Mental Health Data

Authors: Konstantinos Skianis, A. Seza Doğruöz, John Pavlopoulos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12985
Pdf URL: https://arxiv.org/pdf/2410.12985
Copy Paste: [[2410.12985]] Leveraging LLMs for Translating and Classifying Mental Health Data(https://arxiv.org/abs/2410.12985)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly used in medical fields. In mental health support, the early identification of linguistic markers associated with mental health conditions can provide valuable support to mental health professionals, and reduce long waiting times for patients. Despite the benefits of LLMs for mental health support, there is limited research on their application in mental health systems for languages other than English. Our study addresses this gap by focusing on the detection of depression severity in Greek through user-generated posts which are automatically translated from English. Our results show that GPT3.5-turbo is not very successful in identifying the severity of depression in English, and it has a varying performance in Greek as well. Our study underscores the necessity for further research, especially in languages with less resources. Also, careful implementation is necessary to ensure that LLMs are used effectively in mental health platforms, and human supervision remains crucial to avoid misdiagnosis.
摘要：大型语言模型 (LLM) 在医学领域的应用越来越广泛。在心理健康支持方面，及早识别与心理健康状况相关的语言标记可以为心理健康专业人员提供宝贵的支持，并减少患者的长时间等待。尽管 LLM 对心理健康支持有好处，但对其在英语以外语言的心理健康系统中的应用研究有限。我们的研究通过关注通过自动从英语翻译成的用户生成帖子检测希腊语中的抑郁症严重程度来解决这一差距。我们的结果表明，GPT3.5-turbo 在识别英语抑郁症严重程度方面不太成功，在希腊语中的表现也各不相同。我们的研究强调了进一步研究的必要性，特别是在资源较少的语言中。此外，必须谨慎实施以确保 LLM 在心理健康平台中得到有效使用，而人工监督对于避免误诊仍然至关重要。

Title: Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models

Authors: Iaroslav Chelombitko, Egor Safronov, Aleksey Komissarov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12989
Pdf URL: https://arxiv.org/pdf/2410.12989
Copy Paste: [[2410.12989]] Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models(https://arxiv.org/abs/2410.12989)
Keywords: language model, llm
Abstract: In the development of Large Language Models (LLMs), considerable attention has been given to the quality of training datasets. However, the role of tokenizers in the LLM training pipeline, particularly for multilingual models, has received less focus. The quality of tokenization can significantly impact a model's ability to handle diverse languages effectively. We introduce Qtok, a tool designed to assess tokenizer quality with a specific emphasis on their performance in multilingual contexts. Our research proposes a set of metrics for evaluating tokenizer quality, including measures of language coverage, token completeness, and distribution across languages and linguistic categories. Qtok applies these metrics to evaluate 13 distinct tokenizers from 58 publicly available models, analyzing their output across different linguistic contexts. Our analysis revealed significant variations in token distribution across languages and categories, highlighting potential biases and areas for improvement in current tokenization strategies. This research contributes to the field of tokenizer evaluation within multilingual LLM development by providing a systematic approach to assessing tokenizer quality. Our findings highlight the critical role of tokenization in multilingual LLM capability. The Qtok tool and our analysis methodology offer practical means for researchers to evaluate and improve tokenization strategies for multilingual applications. We offer a method to compare tokenizer quality across these metrics, which may be useful when selecting or adjusting tokenizers for specific multilingual LLM applications.
摘要：在大型语言模型 (LLM) 的开发中，训练数据集的质量得到了相当大的关注。然而，标记器在 LLM 训练流程中的作用，尤其是对于多语言模型，却没有受到太多关注。标记化的质量会显著影响模型有效处理多种语言的能力。我们推出了 Qtok，这是一种旨在评估标记器质量的工具，特别强调它们在多语言环境中的性能。我们的研究提出了一套用于评估标记器质量的指标，包括语言覆盖率、标记完整性以及跨语言和语言类别的分布。Qtok 应用这些指标来评估来自 58 个公开可用模型的 13 个不同的标记器，分析它们在不同语言环境中的输出。我们的分析揭示了标记在不同语言和类别之间的分布存在显著差异，突出了当前标记化策略中的潜在偏差和需要改进的领域。这项研究通过提供一种评估标记器质量的系统方法，为多语言 LLM 开发中的标记器评估领域做出了贡献。我们的研究结果强调了标记化在多语言 LLM 能力中的关键作用。 Qtok 工具和我们的分析方法为研究人员提供了评估和改进多语言应用程序的标记化策略的实用方法。我们提供了一种通过这些指标比较标记化器质量的方法，这在为特定的多语言 LLM 应用程序选择或调整标记化器时可能会很有用。

Title: "Let's Argue Both Sides": Argument Generation Can Force Small Models to Utilize Previously Inaccessible Reasoning Capabilities

Authors: Kaveh Eskandari Miandoab, Vasanth Sarathy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12997
Pdf URL: https://arxiv.org/pdf/2410.12997
Copy Paste: [[2410.12997]] "Let's Argue Both Sides": Argument Generation Can Force Small Models to Utilize Previously Inaccessible Reasoning Capabilities(https://arxiv.org/abs/2410.12997)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs), despite achieving state-of-the-art results in a number of evaluation tasks, struggle to maintain their performance when logical reasoning is strictly required to correctly infer a prediction. In this work, we propose Argument Generation as a method of forcing models to utilize their reasoning capabilities when other approaches such as chain-of-thought reasoning prove insufficient. Our method involves the generation of arguments for each possible inference result, and asking the end model to rank the generated arguments. We show that Argument Generation can serve as an appropriate substitute for zero-shot prompting techniques without the requirement to add layers of complexity. Furthermore, we argue that knowledge-probing techniques such as chain-of-thought reasoning and Argument Generation are only useful when further reasoning is required to infer a prediction, making them auxiliary to more common zero-shot approaches. Finally, we demonstrate that our approach forces larger gains in smaller language models, showcasing a complex relationship between model size and prompting methods in foundation models.
摘要：大型语言模型 (LLM) 尽管在许多评估任务中取得了最先进的结果，但当严格要求逻辑推理才能正确推断预测时，它们很难保持其性能。在这项工作中，我们提出了一种论据生成方法，当其他方法（如思路链推理）被证明不足时，它可以强制模型利用其推理能力。我们的方法涉及为每个可能的推理结果生成论据，并要求最终模型对生成的论据进行排序。我们表明，论据生成可以作为零样本提示技术的适当替代品，而无需增加复杂性。此外，我们认为，诸如思路链推理和论据生成之类的知识探索技术仅在需要进一步推理来推断预测时才有用，这使它们成为更常见的零样本方法的辅助。最后，我们证明我们的方法迫使较小的语言模型获得更大的收益，展示了基础模型中模型大小和提示方法之间的复杂关系。

Title: POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

Authors: Batuhan K. Karaman, Ishmam Zabir, Alon Benhaim, Vishrav Chaudhary, Mert R. Sabuncu, Xia Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12999
Pdf URL: https://arxiv.org/pdf/2410.12999
Copy Paste: [[2410.12999]] POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization(https://arxiv.org/abs/2410.12999)
Keywords: language model, gpt, prompt
Abstract: Balancing safety and usefulness in large language models has become a critical challenge in recent years. Models often exhibit unsafe behavior or adopt an overly cautious approach, leading to frequent overrefusal of benign prompts, which reduces their usefulness. Addressing these issues requires methods that maintain safety while avoiding overrefusal. In this work, we examine how the overgeneration of training data using advanced teacher models (e.g., GPT-4o), including responses to both general-purpose and toxic prompts, influences the safety and overrefusal balance of instruction-following language models. Additionally, we present POROver, a strategy to use preference optimization methods in order to reduce overrefusal, via employing a superior teacher model's completions. Our results show that overgenerating completions for general-purpose prompts significantly improves the balance between safety and usefulness. Specifically, the F1 score calculated between safety and usefulness increases from 70.8% to 88.3%. Moreover, overgeneration for toxic prompts substantially reduces overrefusal, decreasing it from 94.4% to 45.2%. Furthermore, preference optimization algorithms, when applied with carefully curated preference data, can effectively reduce a model's overrefusal from 45.2% to 15.0% while maintaining comparable safety levels. Our code and data are available at this https URL.
摘要：近年来，平衡大型语言模型的安全性和实用性已成为一项关键挑战。模型经常表现出不安全的行为或采取过于谨慎的方法，导致经常过度拒绝良性提示，从而降低了它们的实用性。解决这些问题需要既能保持安全性又能避免过度拒绝的方法。在这项工作中，我们研究了使用高级教师模型（例如 GPT-4o）过度生成训练数据（包括对通用提示和有害提示的响应）如何影响遵循指令的语言模型的安全性和过度拒绝平衡。此外，我们提出了 POROver，这是一种使用偏好优化方法来减少过度拒绝的策略，通过使用高级教师模型的完成。我们的结果表明，过度生成通用提示的完成可显著改善安全性和实用性之间的平衡。具体而言，安全性和实用性之间计算的 F1 分数从 70.8% 增加到 88.3%。此外，过度生成有毒提示可大幅减少过度拒绝，从 94.4% 降至 45.2%。此外，偏好优化算法在与精心策划的偏好数据结合使用时，可有效将模型的过度拒绝从 45.2% 降至 15.0%，同时保持相当的安全水平。我们的代码和数据可在此 https URL 上找到。

Title: LEGAL-UQA: A Low-Resource Urdu-English Dataset for Legal Question Answering

Authors: Faizan Faisal, Umair Yousaf
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13013
Pdf URL: https://arxiv.org/pdf/2410.13013
Copy Paste: [[2410.13013]] LEGAL-UQA: A Low-Resource Urdu-English Dataset for Legal Question Answering(https://arxiv.org/abs/2410.13013)
Keywords: gpt
Abstract: We present LEGAL-UQA, the first Urdu legal question-answering dataset derived from Pakistan's constitution. This parallel English-Urdu dataset includes 619 question-answer pairs, each with corresponding legal article contexts, addressing the need for domain-specific NLP resources in low-resource languages. We describe the dataset creation process, including OCR extraction, manual refinement, and GPT-4-assisted translation and generation of QA pairs. Our experiments evaluate the latest generalist language and embedding models on LEGAL-UQA, with Claude-3.5-Sonnet achieving 99.19% human-evaluated accuracy. We fine-tune mt5-large-UQA-1.0, highlighting the challenges of adapting multilingual models to specialized domains. Additionally, we assess retrieval performance, finding OpenAI's text-embedding-3-large outperforms Mistral's mistral-embed. LEGAL-UQA bridges the gap between global NLP advancements and localized applications, particularly in constitutional law, and lays the foundation for improved legal information access in Pakistan.
摘要：我们介绍了 LEGAL-UQA，这是第一个源自巴基斯坦宪法的乌尔都语法律问答数据集。这个平行的英语-乌尔都语数据集包括 619 个问答对，每个问答对都有相应的法律文章上下文，满足了低资源语言对特定领域 NLP 资源的需求。我们描述了数据集的创建过程，包括 OCR 提取、手动细化以及 GPT-4 辅助翻译和 QA 对的生成。我们的实验评估了 LEGAL-UQA 上最新的通用语言和嵌入模型，其中 Claude-3.5-Sonnet 实现了 99.19% 的人工评估准确率。我们对 mt5-large-UQA-1.0 进行了微调，重点介绍了将多语言模型适应专业领域的挑战。此外，我们评估了检索性能，发现 OpenAI 的 text-embedding-3-large 优于 Mistral 的 mistral-embed。 LEGAL-UQA 弥合了全球 NLP 进步与本地化应用（特别是在宪法领域）之间的差距，并为改善巴基斯坦的法律信息获取奠定了基础。

Title: LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks

Authors: Akshara Prabhakar, Yuanzhi Li, Karthik Narasimhan, Sham Kakade, Eran Malach, Samy Jelassi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13025
Pdf URL: https://arxiv.org/pdf/2410.13025
Copy Paste: [[2410.13025]] LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks(https://arxiv.org/abs/2410.13025)
Keywords: language model, llm
Abstract: Low-Rank Adaptation (LoRA) is a popular technique for parameter-efficient fine-tuning of Large Language Models (LLMs). We study how different LoRA modules can be merged to achieve skill composition -- testing the performance of the merged model on a target task that involves combining multiple skills, each skill coming from a single LoRA. This setup is favorable when it is difficult to obtain training data for the target task and when it can be decomposed into multiple skills. First, we identify practically occurring use-cases that can be studied under the realm of skill composition, e.g. solving hard math-word problems with code, creating a bot to answer questions on proprietary manuals or about domain-specialized corpora. Our main contribution is to show that concatenation of LoRAs (CAT), which optimally averages LoRAs that were individually trained on different skills, outperforms existing model- and data- merging techniques; for instance on math-word problems, CAT beats these methods by an average of 43% and 12% respectively. Thus, this paper advocates model merging as an efficient way to solve compositional tasks and underscores CAT as a simple, compute-friendly and effective procedure. To our knowledge, this is the first work demonstrating the superiority of model merging over data mixing for binary skill composition tasks.
摘要：低秩自适应 (LoRA) 是一种流行的大型语言模型 (LLM) 参数高效微调技术。我们研究如何合并不同的 LoRA 模块以实现技能组合——测试合并模型在涉及组合多种技能的目标任务上的性能，每种技能都来自单个 LoRA。当难以获得目标任务的训练数据并且可将其分解为多种技能时，此设置是有利的。首先，我们确定可以在技能组合领域进行研究的实际用例，例如使用代码解决困难的数学应用题，创建一个机器人来回答有关专有手册或领域专业语料库的问题。我们的主要贡献是表明 LoRA 的串联 (CAT) 优于现有的模型和数据合并技术，CAT 可以最佳地平均针对不同技能进行单独训练的 LoRA；例如在数学应用题上，CAT 分别比这些方法平均高出 43% 和 12%。因此，本文主张模型合并是解决组合任务的有效方法，并强调 CAT 是一种简单、计算友好且有效的程序。据我们所知，这是第一篇证明模型合并优于数据混合的二元技能组合任务的研究。

Title: When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems

Authors: Asir Saadat, Tasmia Binte Sogir, Md Taukir Azam Chowdhury, Syem Aziz
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13029
Pdf URL: https://arxiv.org/pdf/2410.13029
Copy Paste: [[2410.13029]] When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems(https://arxiv.org/abs/2410.13029)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Large language models (LLMs) are increasingly relied upon to solve complex mathematical word problems. However, being susceptible to hallucination, they may generate inaccurate results when presented with unanswerable questions, raising concerns about their potential harm. While GPT models are now widely used and trusted, the exploration of how they can effectively abstain from answering unanswerable math problems and the enhancement of their abstention capabilities has not been rigorously investigated. In this paper, we investigate whether GPTs can appropriately respond to unanswerable math word problems by applying prompts typically used in solvable mathematical scenarios. Our experiments utilize the Unanswerable Word Math Problem (UWMP) dataset, directly leveraging GPT model APIs. Evaluation metrics are introduced, which integrate three key factors: abstention, correctness and confidence. Our findings reveal critical gaps in GPT models and the hallucination it suffers from for unsolvable problems, highlighting the need for improved models capable of better managing uncertainty and complex reasoning in math word problem-solving contexts.
摘要：大型语言模型 (LLM) 越来越依赖于解决复杂的数学应用题。然而，由于容易产生幻觉，它们在面对无法回答的问题时可能会产生不准确的结果，这引发了人们对其潜在危害的担忧。虽然 GPT 模型现在被广泛使用和信任，但如何有效地避免回答无法回答的数学问题以及如何提高它们的弃权能力尚未得到严格研究。在本文中，我们研究了 GPT 是否可以通过应用通常用于可解数学场景的提示来适当地应对无法回答的数学应用题。我们的实验利用无法回答的数学应用题 (UWMP) 数据集，直接利用 GPT 模型 API。引入了评估指标，它整合了三个关键因素：弃权、正确性和信心。我们的研究结果揭示了 GPT 模型中的关键缺陷及其对无法解决的问题产生的幻觉，凸显了需要改进模型，以便更好地管理数学应用题解决环境中的不确定性和复杂推理。

Title: LFOSum: Summarizing Long-form Opinions with Large Language Models

Authors: Mir Tafseer Nayeem, Davood Rafiei
Subjects: cs.CL, cs.AI, cs.ET, cs.HC, cs.IR
Abstract URL: https://arxiv.org/abs/2410.13037
Pdf URL: https://arxiv.org/pdf/2410.13037
Copy Paste: [[2410.13037]] LFOSum: Summarizing Long-form Opinions with Large Language Models(https://arxiv.org/abs/2410.13037)
Keywords: language model, llm
Abstract: Online reviews play a pivotal role in influencing consumer decisions across various domains, from purchasing products to selecting hotels or restaurants. However, the sheer volume of reviews -- often containing repetitive or irrelevant content -- leads to information overload, making it challenging for users to extract meaningful insights. Traditional opinion summarization models face challenges in handling long inputs and large volumes of reviews, while newer Large Language Model (LLM) approaches often fail to generate accurate and faithful summaries. To address those challenges, this paper introduces (1) a new dataset of long-form user reviews, each entity comprising over a thousand reviews, (2) two training-free LLM-based summarization approaches that scale to long inputs, and (3) automatic evaluation metrics. Our dataset of user reviews is paired with in-depth and unbiased critical summaries by domain experts, serving as a reference for evaluation. Additionally, our novel reference-free evaluation metrics provide a more granular, context-sensitive assessment of summary faithfulness. We benchmark several open-source and closed-source LLMs using our methods. Our evaluation reveals that LLMs still face challenges in balancing sentiment and format adherence in long-form summaries, though open-source models can narrow the gap when relevant information is retrieved in a focused manner.
摘要：在线评论在影响消费者从购买产品到选择酒店或餐厅等各个领域的决策方面发挥着关键作用。然而，大量的评论（通常包含重复或不相关的内容）会导致信息过载，使用户难以提取有意义的见解。传统的观点总结模型在处理长输入和大量评论方面面临挑战，而较新的大型语言模型 (LLM) 方法通常无法生成准确而忠实的摘要。为了应对这些挑战，本文介绍了 (1) 一个新的长篇用户评论数据集，每个实体包含一千多条评论，(2) 两种无需训练的基于 LLM 的总结方法，可扩展到长输入，以及 (3) 自动评估指标。我们的用户评论数据集与领域专家的深入和公正的批判性总结相结合，作为评估的参考。此外，我们新颖的无参考评估指标提供了更细致、上下文敏感的总结忠实度评估。我们使用我们的方法对几个开源和闭源 LLM 进行了基准测试。我们的评估表明，尽管开源模型可以在以集中的方式检索相关信息时缩小差距，但 LLM 在平衡长篇摘要中的情感和格式遵守方面仍然面临挑战。

Title: Channel-Wise Mixed-Precision Quantization for Large Language Models

Authors: Zihan Chen, Bike Xie, Jundong Li, Cong Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13056
Pdf URL: https://arxiv.org/pdf/2410.13056
Copy Paste: [[2410.13056]] Channel-Wise Mixed-Precision Quantization for Large Language Models(https://arxiv.org/abs/2410.13056)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ can adapt to any bit-width constraint. CMPQ employs a non-uniform quantization strategy and incorporates two outlier extraction techniques that collaboratively preserve the critical information, thereby minimizing the quantization loss. Experiments on different sizes of LLMs demonstrate that CMPQ not only enhances performance in integer-bit quantization tasks but also achieves significant performance gains with a modest increase in memory usage. CMPQ thus represents an adaptive and effective approach to LLM quantization, offering substantial benefits across diverse device capabilities.
摘要：大型语言模型 (LLM) 在各种语言任务中都取得了显著的成功，但由于其参数规模大，对内存要求高，因此在边缘设备上部署它们仍然具有挑战性。仅权重量化为减少 LLM 的内存占用提供了一种有前途的解决方案。然而，现有方法主要侧重于整数位量化，这限制了它们对小数位量化任务的适应性，并阻碍了设备上可用存储空间的充分利用。在本文中，我们介绍了通道混合精度量化 (CMPQ)，这是一种新颖的混合精度量化方法，它基于激活分布以通道模式分配量化精度。通过为不同的权重通道分配不同的精度级别，CMPQ 可以适应任何位宽约束。CMPQ 采用非均匀量化策略，并结合两种异常值提取技术，协同保留关键信息，从而最大限度地减少量化损失。在不同大小的 LLM 上进行的实验表明，CMPQ 不仅提高了整数位量化任务的性能，而且在内存使用量略有增加的情况下实现了显著的性能提升。因此，CMPQ 代表了一种自适应且有效的 LLM 量化方法，可在各种设备功能中提供显著的优势。

Title: Is Semantic Chunking Worth the Computational Cost?

Authors: Renyi Qu, Ruixuan Tu, Forrest Bao
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2410.13070
Pdf URL: https://arxiv.org/pdf/2410.13070
Copy Paste: [[2410.13070]] Is Semantic Chunking Worth the Computational Cost?(https://arxiv.org/abs/2410.13070)
Keywords: retrieval-augmented generation
Abstract: Recent advances in Retrieval-Augmented Generation (RAG) systems have popularized semantic chunking, which aims to improve retrieval performance by dividing documents into semantically coherent segments. Despite its growing adoption, the actual benefits over simpler fixed-size chunking, where documents are split into consecutive, fixed-size segments, remain unclear. This study systematically evaluates the effectiveness of semantic chunking using three common retrieval-related tasks: document retrieval, evidence retrieval, and retrieval-based answer generation. The results show that the computational costs associated with semantic chunking are not justified by consistent performance gains. These findings challenge the previous assumptions about semantic chunking and highlight the need for more efficient chunking strategies in RAG systems.
摘要：检索增强生成 (RAG) 系统的最新进展使语义分块变得流行，该技术旨在通过将文档划分为语义连贯的片段来提高检索性能。尽管语义分块的采用率越来越高，但与更简单的固定大小分块（将文档划分为连续的固定大小的片段）相比，语义分块的实际优势仍不清楚。本研究使用三个常见的检索相关任务系统地评估了语义分块的有效性：文档检索、证据检索和基于检索的答案生成。结果表明，语义分块相关的计算成本与持续的性能提升不相称。这些发现挑战了先前关于语义分块的假设，并强调了 RAG 系统中需要更有效的分块策略。

Title: PromptExp: Multi-granularity Prompt Explanation of Large Language Models

Authors: Ximing Dong, Shaowei Wang, Dayi Lin, Gopi Krishnan Rajbahadur, Boquan Zhou, Shichao Liu, Ahmed E. Hassan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13073
Pdf URL: https://arxiv.org/pdf/2410.13073
Copy Paste: [[2410.13073]] PromptExp: Multi-granularity Prompt Explanation of Large Language Models(https://arxiv.org/abs/2410.13073)
Keywords: language model, llm, hallucination, prompt
Abstract: Large Language Models excel in tasks like natural language understanding and text generation. Prompt engineering plays a critical role in leveraging LLM effectively. However, LLMs black-box nature hinders its interpretability and effective prompting engineering. A wide range of model explanation approaches have been developed for deep learning models, However, these local explanations are designed for single-output tasks like classification and regression,and cannot be directly applied to LLMs, which generate sequences of tokens. Recent efforts in LLM explanation focus on natural language explanations, but they are prone to hallucinations and inaccuracies. To address this, we introduce OurTool, a framework for multi-granularity prompt explanations by aggregating token-level insights. OurTool introduces two token-level explanation approaches: this http URL aggregation-based approach combining local explanation techniques, and 2. a perturbation-based approach with novel techniques to evaluate token masking impact. OurTool supports both white-box and black-box explanations and extends explanations to higher granularity levels, enabling flexible analysis. We evaluate OurTool in case studies such as sentiment analysis, showing the perturbation-based approach performs best using semantic similarity to assess perturbation impact. Furthermore, we conducted a user study to confirm OurTool's accuracy and practical value, and demonstrate its potential to enhance LLM interpretability.
摘要：大型语言模型在自然语言理解和文本生成等任务中表现出色。提示工程在有效利用 LLM 方面起着关键作用。然而，LLM 的黑盒性质阻碍了它的可解释性和有效的提示工程。已经为深度学习模型开发了各种各样的模型解释方法，然而，这些局部解释是为分类和回归等单输出任务设计的，不能直接应用于生成标记序列的 LLM。LLM 解释方面的最新努力集中在自然语言解释上，但它们容易产生幻觉和不准确。为了解决这个问题，我们引入了 OurTool，这是一个通过聚合标记级见解来实现多粒度提示解释的框架。OurTool 引入了两种标记级解释方法：这种基于 http URL 聚合的方法结合了局部解释技术，以及 2. 一种基于扰动的方法，采用新技术来评估标记掩码影响。OurTool 支持白盒和黑盒解释，并将解释扩展到更高的粒度级别，从而实现灵活的分析。我们在情绪分析等案例研究中评估了 OurTool，结果表明基于扰动的方法使用语义相似性来评估扰动影响效果最佳。此外，我们进行了一项用户研究，以确认 OurTool 的准确性和实用价值，并展示其增强 LLM 可解释性的潜力。

Title: Tuning Language Models by Mixture-of-Depths Ensemble

Authors: Haoyan Luo, Lucia Specia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13077
Pdf URL: https://arxiv.org/pdf/2410.13077
Copy Paste: [[2410.13077]] Tuning Language Models by Mixture-of-Depths Ensemble(https://arxiv.org/abs/2410.13077)
Keywords: language model, llm
Abstract: Transformer-based Large Language Models (LLMs) traditionally rely on final-layer loss for training and final-layer representations for predictions, potentially overlooking the predictive power embedded in intermediate layers. Surprisingly, we find that focusing training efforts on these intermediate layers can yield training losses comparable to those of final layers, with complementary test-time performance. We introduce a novel tuning framework, Mixture-of-Depths (MoD), which trains late layers as ensembles contributing to the final logits through learned routing weights. With the auxiliary distillation loss and additional normalization modules, we ensure that the outputs of the late layers adapt to language modeling. Our MoD framework, which can be integrated with any existing tuning method, shows consistent improvement on various language modelling tasks. Furthermore, by replacing traditional trainable modules with MoD, our approach achieves similar performance with significantly fewer trainable parameters, demonstrating the potential of leveraging predictive power from intermediate representations during training.
摘要：基于 Transformer 的大型语言模型 (LLM) 传统上依赖于最终层损失进行训练，并依赖于最终层表示进行预测，这可能会忽略中间层中嵌入的预测能力。令人惊讶的是，我们发现将训练工作集中在这些中间层上可以产生与最终层相当的训练损失，同时具有互补的测试时间性能。我们引入了一种新颖的调整框架 Mixture-of-Depths (MoD)，它将后期层作为集成进行训练，通过学习的路由权重为最终的逻辑做出贡献。借助辅助蒸馏损失和附加规范化模块，我们确保后期层的输出适应语言建模。我们的 MoD 框架可以与任何现有的调整方法集成，在各种语言建模任务上显示出持续的改进。此外，通过用 MoD 替换传统的可训练模块，我们的方法以明显更少的可训练参数实现了类似的性能，展示了在训练过程中利用中间表示的预测能力的潜力。

Title: Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models

Authors: Linhao Luo, Zicheng Zhao, Chen Gong, Gholamreza Haffari, Shirui Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13080
Pdf URL: https://arxiv.org/pdf/2410.13080
Copy Paste: [[2410.13080]] Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models(https://arxiv.org/abs/2410.13080)
Keywords: language model, llm, hallucination, agent
Abstract: Large language models (LLMs) have demonstrated impressive reasoning abilities, but they still struggle with faithful reasoning due to knowledge gaps and hallucinations. To address these issues, knowledge graphs (KGs) have been utilized to enhance LLM reasoning through their structured knowledge. However, existing KG-enhanced methods, either retrieval-based or agent-based, encounter difficulties in accurately retrieving knowledge and efficiently traversing KGs at scale. In this work, we introduce graph-constrained reasoning (GCR), a novel framework that bridges structured knowledge in KGs with unstructured reasoning in LLMs. To eliminate hallucinations, GCR ensures faithful KG-grounded reasoning by integrating KG structure into the LLM decoding process through KG-Trie, a trie-based index that encodes KG reasoning paths. KG-Trie constrains the decoding process, allowing LLMs to directly reason on graphs and generate faithful reasoning paths grounded in KGs. Additionally, GCR leverages a lightweight KG-specialized LLM for graph-constrained reasoning alongside a powerful general LLM for inductive reasoning over multiple reasoning paths, resulting in accurate reasoning with zero reasoning hallucination. Extensive experiments on several KGQA benchmarks demonstrate that GCR achieves state-of-the-art performance and exhibits strong zero-shot generalizability to unseen KGs without additional training.
摘要：大型语言模型 (LLM) 已展现出令人印象深刻的推理能力，但由于知识差距和幻觉，它们仍然难以进行忠实推理。为了解决这些问题，知识图谱 (KG) 已被用于通过其结构化知识增强 LLM 推理。然而，现有的 KG 增强方法，无论是基于检索还是基于代理，都难以准确检索知识并有效地大规模遍历 KG。在这项工作中，我们引入了图约束推理 (GCR)，这是一个新颖的框架，它将 KG 中的结构化知识与 LLM 中的非结构化推理联系起来。为了消除幻觉，GCR 通过 KG-Trie（一种基于 trie 的索引，对 KG 推理路径进行编码）将 KG 结构集成到 LLM 解码过程中，从而确保忠实的基于 KG 的推理。KG-Trie 限制了解码过程，允许 LLM 直接在图上推理并生成基于 KG 的忠实推理路径。此外，GCR 利用轻量级 KG 专用 LLM 进行图约束推理，同时利用强大的通用 LLM 进行多条推理路径的归纳推理，从而实现准确推理，避免推理错觉。在多个 KGQA 基准上进行的大量实验表明，GCR 实现了最先进的性能，并且无需额外训练即可对未见过的 KG 表现出强大的零样本泛化能力。

Title: Reverse-Engineering the Reader

Authors: Samuel Kiegeland, Ethan Gotlieb Wilcox, Afra Amini, David Robert Reich, Ryan Cotterell
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13086
Pdf URL: https://arxiv.org/pdf/2410.13086
Copy Paste: [[2410.13086]] Reverse-Engineering the Reader(https://arxiv.org/abs/2410.13086)
Keywords: language model
Abstract: Numerous previous studies have sought to determine to what extent language models, pretrained on natural language text, can serve as useful models of human cognition. In this paper, we are interested in the opposite question: whether we can directly optimize a language model to be a useful cognitive model by aligning it to human psychometric data. To achieve this, we introduce a novel alignment technique in which we fine-tune a language model to implicitly optimize the parameters of a linear regressor that directly predicts humans' reading times of in-context linguistic units, e.g., phonemes, morphemes, or words, using surprisal estimates derived from the language model. Using words as a test case, we evaluate our technique across multiple model sizes and datasets and find that it improves language models' psychometric predictive power. However, we find an inverse relationship between psychometric power and a model's performance on downstream NLP tasks as well as its perplexity on held-out test data. While this latter trend has been observed before (Oh et al., 2022; Shain et al., 2024), we are the first to induce it by manipulating a model's alignment to psychometric data.
摘要：之前有许多研究试图确定在自然语言文本上预先训练的语言模型在多大程度上可以作为人类认知的有用模型。在本文中，我们感兴趣的是相反的问题：我们是否可以通过将语言模型与人类心理测量数据对齐来直接优化语言模型，使其成为有用的认知模型。为了实现这一点，我们引入了一种新颖的对齐技术，该技术对语言模型进行微调，以隐式优化线性回归器的参数，该回归器使用从语言模型中得出的意外估计值直接预测人类阅读上下文语言单位（例如音素、词素或单词）的时间。我们使用单词作为测试用例，在多种模型大小和数据集中评估我们的技术，发现它可以提高语言模型的心理测量预测能力。然而，我们发现心理测量能力与模型在下游 NLP 任务上的表现以及它在保留测试数据上的困惑度之间存在反比关系。虽然后一种趋势以前也曾被观察到（Oh 等人，2022 年；Shain 等人，2024 年），但我们是第一个通过操纵模型与心理测量数据的一致性来引发这种趋势的人。

Title: Learning to Summarize from LLM-generated Feedback

Authors: Hwanjun Song, Taewon Yun, Yuho Lee, Gihun Lee, Jason Cai, Hang Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13116
Pdf URL: https://arxiv.org/pdf/2410.13116
Copy Paste: [[2410.13116]] Learning to Summarize from LLM-generated Feedback(https://arxiv.org/abs/2410.13116)
Keywords: llm, hallucination
Abstract: Developing effective text summarizers remains a challenge due to issues like hallucinations, key information omissions, and verbosity in LLM-generated summaries. This work explores using LLM-generated feedback to improve summary quality by aligning the summaries with human preferences for faithfulness, completeness, and conciseness. We introduce FeedSum, a large-scale dataset containing multi-dimensional LLM feedback on summaries of varying quality across diverse domains. Our experiments show how feedback quality, dimensionality, and granularity influence preference learning, revealing that high-quality, multi-dimensional, fine-grained feedback significantly improves summary generation. We also compare two methods for using this feedback: supervised fine-tuning and direct preference optimization. Finally, we introduce SummLlama3-8b, a model that outperforms the nearly 10x larger Llama3-70b-instruct in generating human-preferred summaries, demonstrating that smaller models can achieve superior performance with appropriate training. The full dataset will be released soon. The SummLlama3-8B model is now available at this https URL.
摘要：由于 LLM 生成的摘要中存在幻觉、关键信息遗漏和冗长等问题，开发有效的文本摘要器仍然是一项挑战。这项工作探索了使用 LLM 生成的反馈来提高摘要质量，方法是将摘要与人类对忠实性、完整性和简洁性的偏好相结合。我们引入了 FeedSum，这是一个大型数据集，其中包含对不同领域中不同质量摘要的多维 LLM 反馈。我们的实验展示了反馈质量、维度和粒度如何影响偏好学习，揭示了高质量、多维、细粒度的反馈显著改善了摘要生成。我们还比较了使用这种反馈的两种方法：监督微调和直接偏好优化。最后，我们介绍了 SummLlama3-8b，该模型在生成人类偏好的摘要方面优于近 10 倍大的 Llama3-70b-instruct，表明较小的模型可以通过适当的训练实现卓越的性能。完整的数据集将很快发布。 SummLlama3-8B 模型现已在此 https URL 上提供。

Title: Retrieval-Enhanced Named Entity Recognition

Authors: Enzo Shiraishi, Raphael Y. de Camargo, Henrique L. P. Silva, Ronaldo C. Prati
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2410.13118
Pdf URL: https://arxiv.org/pdf/2410.13118
Copy Paste: [[2410.13118]] Retrieval-Enhanced Named Entity Recognition(https://arxiv.org/abs/2410.13118)
Keywords: language model, prompt
Abstract: When combined with In-Context Learning, a technique that enables models to adapt to new tasks by incorporating task-specific examples or demonstrations directly within the input prompt, autoregressive language models have achieved good performance in a wide range of tasks and applications. However, this combination has not been properly explored in the context of named entity recognition, where the structure of this task poses unique challenges. We propose RENER (Retrieval-Enhanced Named Entity Recognition), a technique for named entity recognition using autoregressive language models based on In-Context Learning and information retrieval techniques. When presented with an input text, RENER fetches similar examples from a dataset of training examples that are used to enhance a language model to recognize named entities from this input text. RENER is modular and independent of the underlying language model and information retrieval algorithms. Experimental results show that in the CrossNER collection we achieve state-of-the-art performance with the proposed technique and that information retrieval can increase the F-score by up to 11 percentage points.
摘要：当与上下文学习相结合时，自回归语言模型在广泛的任务和应用中都取得了良好的表现。上下文学习是一种通过在输入提示中直接合并特定于任务的示例或演示使模型能够适应新任务的技术。然而，这种组合尚未在命名实体识别的背景下得到充分探索，因为该任务的结构带来了独特的挑战。我们提出了 RENER（检索增强命名实体识别），这是一种使用基于上下文学习和信息检索技术的自回归语言模型进行命名实体识别的技术。当呈现输入文本时，RENER 会从训练示例数据集中获取类似的示例，这些示例用于增强语言模型以从该输入文本中识别命名实体。RENER 是模块化的，独立于底层语言模型和信息检索算法。实验结果表明，在 CrossNER 集合中，我们使用所提出的技术实现了最先进的性能，并且信息检索可以将 F 分数提高多达 11 个百分点。

Title: Data Defenses Against Large Language Models

Authors: William Agnew, Harry H. Jiang, Cella Sum, Maarten Sap, Sauvik Das
Subjects: cs.CL, cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2410.13138
Pdf URL: https://arxiv.org/pdf/2410.13138
Copy Paste: [[2410.13138]] Data Defenses Against Large Language Models(https://arxiv.org/abs/2410.13138)
Keywords: language model, llm, prompt
Abstract: Large language models excel at performing inference over text to extract information, summarize information, or generate additional text. These inference capabilities are implicated in a variety of ethical harms spanning surveillance, labor displacement, and IP/copyright theft. While many policy, legal, and technical mitigations have been proposed to counteract these harms, these mitigations typically require cooperation from institutions that move slower than technical advances (i.e., governments) or that have few incentives to act to counteract these harms (i.e., the corporations that create and profit from these LLMs). In this paper, we define and build "data defenses" -- a novel strategy that directly empowers data owners to block LLMs from performing inference on their data. We create data defenses by developing a method to automatically generate adversarial prompt injections that, when added to input text, significantly reduce the ability of LLMs to accurately infer personally identifying information about the subject of the input text or to use copyrighted text in inference. We examine the ethics of enabling such direct resistance to LLM inference, and argue that making data defenses that resist and subvert LLMs enables the realization of important values such as data ownership, data sovereignty, and democratic control over AI systems. We verify that our data defenses are cheap and fast to generate, work on the latest commercial and open-source LLMs, resistance to countermeasures, and are robust to several different attack settings. Finally, we consider the security implications of LLM data defenses and outline several future research directions in this area. Our code is available at this https URL and a tool for using our defenses to protect text against LLM inference is at this https URL.
摘要：大型语言模型擅长对文本进行推理，以提取信息、总结信息或生成其他文本。这些推理能力与各种道德危害有关，包括监视、劳动力流失和知识产权/版权盗窃。虽然已经提出了许多政策、法律和技术缓解措施来抵消这些危害，但这些缓解措施通常需要行动迟缓于技术进步的机构（即政府）或几乎没有动力采取行动来抵消这些危害的机构（即创建并从这些 LLM 中获利的公司）的合作。在本文中，我们定义并构建了“数据防御”——一种直接授权数据所有者阻止 LLM 对其数据进行推理的新颖策略。我们通过开发一种自动生成对抗性提示注入的方法来创建数据防御，当将其添加到输入文本中时，会大大降低 LLM 准确推断输入文本主题的个人身份信息或在推理中使用受版权保护的文本的能力。我们研究了实现这种直接抵抗 LLM 推理的伦理问题，并认为建立抵抗和颠覆 LLM 的数据防御可以实现数据所有权、数据主权和对 AI 系统的民主控制等重要价值。我们验证了我们的数据防御生成成本低且速度快，适用于最新的商业和开源 LLM，能够抵抗对策，并且能够抵御多种不同的攻击设置。最后，我们考虑了 LLM 数据防御的安全影响，并概述了该领域的几个未来研究方向。我们的代码可在此 https URL 上找到，使用我们的防御来保护文本免受 LLM 推理的工具可在此 https URL 上找到。

Title: Mapping Bias in Vision Language Models: Signposts, Pitfalls, and the Road Ahead

Authors: Kuleen Sasse, Shan Chen, Jackson Pond, Danielle Bitterman, John Osborne
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.13146
Pdf URL: https://arxiv.org/pdf/2410.13146
Copy Paste: [[2410.13146]] Mapping Bias in Vision Language Models: Signposts, Pitfalls, and the Road Ahead(https://arxiv.org/abs/2410.13146)
Keywords: language model
Abstract: As Vision Language Models (VLMs) gain widespread use, their fairness remains under-explored. In this paper, we analyze demographic biases across five models and six datasets. We find that portrait datasets like UTKFace and CelebA are the best tools for bias detection, finding gaps in performance and fairness between LLaVa and CLIP models. However, scene based datasets like PATA, VLStereoSet fail to be useful benchmarks for bias due to their construction. As for pronoun based datasets like VisoGender, we receive mixed signals as only some subsets of the data are useful in providing insights. To alleviate this problem, we introduce a more difficult version of VisoGender to serve as a more rigorous evaluation. Based on these results, we call for more effective and carefully designed datasets to ensure VLMs are both fair and reliable.
摘要：随着视觉语言模型 (VLM) 得到广泛应用，其公平性仍未得到充分探索。在本文中，我们分析了五种模型和六个数据集中的人口统计学偏差。我们发现像 UTKFace 和 CelebA 这样的肖像数据集是偏差检测的最佳工具，可以发现 LLaVa 和 CLIP 模型之间的性能和公平性差距。然而，像 PATA、VLStereoSet 这样的基于场景的数据集由于其构造而无法成为有用的偏差基准。至于像 VisoGender 这样的基于代词的数据集，我们收到的信号是混合的，因为只有一些数据子集可用于提供见解。为了缓解这个问题，我们引入了一个更困难的 VisoGender 版本，以作为更严格的评估。基于这些结果，我们呼吁更有效和精心设计的数据集，以确保 VLM 既公平又可靠。

Title: Better to Ask in English: Evaluation of Large Language Models on English, Low-resource and Cross-Lingual Settings

Authors: Krishno Dey, Prerona Tarannum, Md. Arid Hasan, Imran Razzak, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13153
Pdf URL: https://arxiv.org/pdf/2410.13153
Copy Paste: [[2410.13153]] Better to Ask in English: Evaluation of Large Language Models on English, Low-resource and Cross-Lingual Settings(https://arxiv.org/abs/2410.13153)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) are trained on massive amounts of data, enabling their application across diverse domains and tasks. Despite their remarkable performance, most LLMs are developed and evaluated primarily in English. Recently, a few multi-lingual LLMs have emerged, but their performance in low-resource languages, especially the most spoken languages in South Asia, is less explored. To address this gap, in this study, we evaluate LLMs such as GPT-4, Llama 2, and Gemini to analyze their effectiveness in English compared to other low-resource languages from South Asia (e.g., Bangla, Hindi, and Urdu). Specifically, we utilized zero-shot prompting and five different prompt settings to extensively investigate the effectiveness of the LLMs in cross-lingual translated prompts. The findings of the study suggest that GPT-4 outperformed Llama 2 and Gemini in all five prompt settings and across all languages. Moreover, all three LLMs performed better for English language prompts than other low-resource language prompts. This study extensively investigates LLMs in low-resource language contexts to highlight the improvements required in LLMs and language-specific resources to develop more generally purposed NLP applications.
摘要：大型语言模型 (LLM) 经过大量数据训练，使其能够应用于不同的领域和任务。尽管性能出色，但大多数 LLM 都是以英语开发和评估的。最近，出现了一些多语言 LLM，但它们在资源匮乏的语言（尤其是南亚使用最广泛的语言）中的表现尚未得到充分探索。为了弥补这一差距，在本研究中，我们评估了 GPT-4、Llama 2 和 Gemini 等 LLM，以分析它们在英语中的有效性与南亚其他资源匮乏的语言（例如孟加拉语、印地语和乌尔都语）相比的有效性。具体来说，我们利用零样本提示和五种不同的提示设置来广泛研究 LLM 在跨语言翻译提示中的有效性。研究结果表明，GPT-4 在所有五种提示设置和所有语言中的表现都优于 Llama 2 和 Gemini。此外，这三个 LLM 在英语语言提示中的表现都优于其他资源匮乏的语言提示。本研究广泛调查了低资源语言环境中的 LLM，以强调 LLM 和特定语言资源所需的改进，以开发更通用的 NLP 应用程序。

Title: SLM-Mod: Small Language Models Surpass LLMs at Content Moderation

Authors: Xianyang Zhan, Agam Goyal, Yilun Chen, Eshwar Chandrasekharan, Koustuv Saha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13155
Pdf URL: https://arxiv.org/pdf/2410.13155
Copy Paste: [[2410.13155]] SLM-Mod: Small Language Models Surpass LLMs at Content Moderation(https://arxiv.org/abs/2410.13155)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown promise in many natural language understanding tasks, including content moderation. However, these models can be expensive to query in real-time and do not allow for a community-specific approach to content moderation. To address these challenges, we explore the use of open-source small language models (SLMs) for community-specific content moderation tasks. We fine-tune and evaluate SLMs (less than 15B parameters) by comparing their performance against much larger open- and closed-sourced models. Using 150K comments from 15 popular Reddit communities, we find that SLMs outperform LLMs at content moderation -- 11.5% higher accuracy and 25.7% higher recall on average across all communities. We further show the promise of cross-community content moderation, which has implications for new communities and the development of cross-platform moderation techniques. Finally, we outline directions for future work on language model based content moderation. Code and links to HuggingFace models can be found at this https URL.
摘要：大型语言模型 (LLM) 在许多自然语言理解任务（包括内容审核）中都表现出良好的前景。但是，这些模型的实时查询成本可能很高，并且不允许采用特定于社区的内容审核方法。为了应对这些挑战，我们探索了将开源小型语言模型 (SLM) 用于特定于社区的内容审核任务。我们通过将 SLM（少于 15B 个参数）的性能与更大的开源和闭源模型进行比较来对其进行微调和评估。使用来自 15 个热门 Reddit 社区的 15 万条评论，我们发现 SLM 在内容审核方面的表现优于 LLM——在所有社区中，准确率平均高出 11.5%，召回率高出 25.7%。我们进一步展示了跨社区内容审核的前景，这对新社区和跨平台审核技术的开发具有重要意义。最后，我们概述了基于语言模型的内容审核未来工作的方向。代码和 HuggingFace 模型的链接可在此 https URL 中找到。

Title: AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning

Authors: Hao Sun, Jiayi Wu, Hengyi Cai, Xiaochi Wei, Yue Feng, Bo Wang, Shuaiqiang Wang, Yan Zhang, Dawei Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13181
Pdf URL: https://arxiv.org/pdf/2410.13181
Copy Paste: [[2410.13181]] AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning(https://arxiv.org/abs/2410.13181)
Keywords: language model, llm, agent
Abstract: Recent advancements in large language models (LLMs) have been remarkable. Users face a choice between using cloud-based LLMs for generation quality and deploying local-based LLMs for lower computational cost. The former option is typically costly and inefficient, while the latter usually fails to deliver satisfactory performance for reasoning steps requiring deliberate thought processes. In this work, we propose a novel LLM utilization paradigm that facilitates the collaborative operation of large cloud-based LLMs and smaller local-deployed LLMs. Our framework comprises two primary modules: the local agent instantiated with a relatively smaller LLM, handling less complex reasoning steps, and the cloud agent equipped with a larger LLM, managing more intricate reasoning steps. This collaborative processing is enabled through an adaptive mechanism where the local agent introspectively identifies errors and proactively seeks assistance from the cloud agent, thereby effectively integrating the strengths of both locally-deployed and cloud-based LLMs, resulting in significant enhancements in task completion performance and efficiency. We evaluate AdaSwitch across 7 benchmarks, ranging from mathematical reasoning and complex question answering, using various types of LLMs to instantiate the local and cloud agents. The empirical results show that AdaSwitch effectively improves the performance of the local agent, and sometimes achieves competitive results compared to the cloud agent while utilizing much less computational overhead.
摘要：大型语言模型 (LLM) 的最新进展令人瞩目。用户面临着选择，是使用基于云的 LLM 来提高生成质量，还是部署基于本地的 LLM 来降低计算成本。前一种选择通常成本高昂且效率低下，而后者通常无法为需要深思熟虑的思考过程的推理步骤提供令人满意的性能。在这项工作中，我们提出了一种新颖的 LLM 利用范式，以促进大型基于云的 LLM 和较小的本地部署 LLM 的协同操作。我们的框架包含两个主要模块：使用相对较小的 LLM 实例化的本地代理，处理不太复杂的推理步骤，以及配备较大 LLM 的云代理，管理更复杂的推理步骤。这种协同处理是通过一种自适应机制实现的，其中本地代理自省地识别错误并主动寻求云代理的帮助，从而有效地整合本地部署和基于云的 LLM 的优势，从而显着提高任务完成性能和效率。我们使用各种类型的 LLM 实例化本地和云代理，通过 7 个基准测试评估了 AdaSwitch，这些基准测试涵盖数学推理和复杂问答。实证结果表明，AdaSwitch 有效地提高了本地代理的性能，有时甚至取得了与云代理相当的结果，同时计算开销却少得多。

Title: aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Completion

Authors: Siyuan Jiang, Jia Li, He Zong, Huanyu Liu, Hao Zhu, Shukai Hu, Erlu Li, Jiazheng Ding, Yu Han, Wei Ning, Ge Li
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2410.13187
Pdf URL: https://arxiv.org/pdf/2410.13187
Copy Paste: [[2410.13187]] aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Completion(https://arxiv.org/abs/2410.13187)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have been widely used in code completion, and researchers are focusing on scaling up LLMs to improve their accuracy. However, larger LLMs will increase the response time of code completion and decrease the developers' productivity. In this paper, we propose a lightweight and effective LLM for code completion named aiXcoder-7B. Compared to existing LLMs, aiXcoder-7B achieves higher code completion accuracy while having smaller scales (i.e., 7 billion parameters). We attribute the superiority of aiXcoder-7B to three key factors: (1) Multi-objective training. We employ three training objectives, one of which is our proposed Structured Fill-In-the-Middle (SFIM). SFIM considers the syntax structures in code and effectively improves the performance of LLMs for code. (2) Diverse data sampling strategies. They consider inter-file relationships and enhance the capability of LLMs in understanding cross-file contexts. (3) Extensive high-quality data. We establish a rigorous data collection pipeline and consume a total of 1.2 trillion unique tokens for training aiXcoder-7B. This vast volume of data enables aiXcoder-7B to learn a broad distribution of code. We evaluate aiXcoder-7B in five popular code completion benchmarks and a new benchmark collected by this paper. The results show that aiXcoder-7B outperforms the latest six LLMs with similar sizes and even surpasses four larger LLMs (e.g., StarCoder2-15B and CodeLlama-34B), positioning aiXcoder-7B as a lightweight and effective LLM for academia and industry. Finally, we summarize three valuable insights for helping practitioners train the next generations of LLMs for code. aiXcoder-7B has been open-souced and gained significant attention. As of the submission date, aiXcoder-7B has received 2,193 GitHub Stars.
摘要：大型语言模型 (LLM) 已广泛应用于代码补全，研究人员正致力于扩大 LLM 的规模以提高其准确性。然而，较大的 LLM 会增加代码补全的响应时间并降低开发人员的工作效率。在本文中，我们提出了一种轻量级且有效的代码补全 LLM，名为 aiXcoder-7B。与现有的 LLM 相比，aiXcoder-7B 在规模较小（即 70 亿个参数）的情况下实现了更高的代码补全准确率。我们将 aiXcoder-7B 的优势归因于三个关键因素：（1）多目标训练。我们采用了三个训练目标，其中之一是我们提出的结构化填充中间 (SFIM)。SFIM 考虑了代码中的语法结构，并有效地提高了 LLM 的代码性能。（2）多样化的数据采样策略。它们考虑了文件间关系并增强了 LLM 理解跨文件上下文的能力。（3）大量高质量数据。我们建立了严格的数据收集管道，并使用总共 1.2 万亿个唯一标记来训练 aiXcoder-7B。如此庞大的数据量使 aiXcoder-7B 能够学习广泛的代码分布。我们在五个流行的代码完成基准和本文收集的一个新基准中对 aiXcoder-7B 进行了评估。结果表明，aiXcoder-7B 的表现优于最新六款大小相似的 LLM，甚至超过了四款更大的 LLM（例如 StarCoder2-15B 和 CodeLlama-34B），将 aiXcoder-7B 定位为学术界和工业界轻量级且有效的 LLM。最后，我们总结了三个有价值的见解，以帮助从业者训练下一代代码 LLM。aiXcoder-7B 已经开源并引起了广泛关注。截至提交日期，aiXcoder-7B 已获得 2,193 个 GitHub Stars。

Title: MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback

Authors: Zonghai Yao, Aditya Parashar, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, Zhichao Yang, Hong Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13191
Pdf URL: https://arxiv.org/pdf/2410.13191
Copy Paste: [[2410.13191]] MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback(https://arxiv.org/abs/2410.13191)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Automatic question generation (QG) is essential for AI and NLP, particularly in intelligent tutoring, dialogue systems, and fact verification. Generating multiple-choice questions (MCQG) for professional exams, like the United States Medical Licensing Examination (USMLE), is particularly challenging, requiring domain expertise and complex multi-hop reasoning for high-quality questions. However, current large language models (LLMs) like GPT-4 struggle with professional MCQG due to outdated knowledge, hallucination issues, and prompt sensitivity, resulting in unsatisfactory quality and difficulty. To address these challenges, we propose MCQG-SRefine, an LLM self-refine-based (Critique and Correction) framework for converting medical cases into high-quality USMLE-style questions. By integrating expert-driven prompt engineering with iterative self-critique and self-correction feedback, MCQG-SRefine significantly enhances human expert satisfaction regarding both the quality and difficulty of the questions. Furthermore, we introduce an LLM-as-Judge-based automatic metric to replace the complex and costly expert evaluation process, ensuring reliable and expert-aligned assessments.
摘要：自动问题生成 (QG) 对于 AI 和 NLP 至关重要，尤其是在智能辅导、对话系统和事实验证方面。为美国医师执照考试 (USMLE) 等专业考试生成多项选择题 (MCQG) 尤其具有挑战性，需要领域专业知识和复杂的多跳推理才能生成高质量的问题。然而，目前的大型语言模型 (LLM)（如 GPT-4）由于知识过时、幻觉问题和提示敏感性而难以处理专业的 MCQG，导致质量和难度不令人满意。为了应对这些挑战，我们提出了 MCQG-SRefine，这是一个基于 LLM 自我改进（批评和纠正）的框架，用于将医疗案例转换为高质量的 USMLE 风格问题。通过将专家驱动的提示工程与迭代自我批评和自我纠正反馈相结合，MCQG-SRefine 显著提高了人类专家对问题质量和难度的满意度。此外，我们引入了基于 LLM-as-Judge 的自动指标来取代复杂且昂贵的专家评估流程，确保可靠且符合专家的评估。

Title: Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Authors: Jiatao Li, Xinyu Hu, Xunjian Yin, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13192
Pdf URL: https://arxiv.org/pdf/2410.13192
Copy Paste: [[2410.13192]] Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models(https://arxiv.org/abs/2410.13192)
Keywords: language model, llm, retrieval-augmented generation
Abstract: In retrieval-augmented generation systems, the integration of self-generated documents (SGDs) alongside retrieved content has emerged as a promising strategy for enhancing the performance of large language model. However, previous research primarily focuses on optimizing the use of SGDs, with the inherent properties of SGDs remaining underexplored. Therefore, this paper conducts a comprehensive analysis of different types of SGDs and experiments on various knowledge-intensive tasks. We develop a taxonomy of SGDs grounded in Systemic Functional Linguistics (SFL) to compare the influence of different SGD categories. Our findings offer key insights into what kinds of SGDs most effectively contribute to improving LLM's performance. The results and further fusion methods based on SGD categories also provide practical guidelines for taking better advantage of SGDs to achieve significant advancements in knowledge-driven QA tasks with RAG.
摘要：在检索增强生成系统中，将自生成文档 (SGD) 与检索内容相结合已成为提高大型语言模型性能的一种有前途的策略。然而，之前的研究主要侧重于优化 SGD 的使用，而 SGD 的固有属性仍未得到充分探索。因此，本文对不同类型的 SGD 进行了全面分析，并对各种知识密集型任务进行了实验。我们开发了一种基于系统功能语言学 (SFL) 的 SGD 分类法，以比较不同 SGD 类别的影响。我们的研究结果为了解哪些类型的 SGD 最有效地提高了 LLM 的性能提供了关键见解。结果和基于 SGD 类别的进一步融合方法还提供了实用指南，可更好地利用 SGD 来实现 RAG 在知识驱动的 QA 任务中的重大进步。

Title: The Geometry of Numerical Reasoning: Language Models Compare Numeric Properties in Linear Subspaces

Authors: Ahmed Oumar El-Shangiti, Tatsuya Hiraoka, Hilal AlQuabeh, Benjamin Heinzerling, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13194
Pdf URL: https://arxiv.org/pdf/2410.13194
Copy Paste: [[2410.13194]] The Geometry of Numerical Reasoning: Language Models Compare Numeric Properties in Linear Subspaces(https://arxiv.org/abs/2410.13194)
Keywords: language model, llm, prompt
Abstract: This paper investigates whether large language models (LLMs) utilize numerical attributes encoded in a low-dimensional subspace of the embedding space when answering logical comparison questions (e.g., Was Cristiano born before Messi?). We first identified these subspaces using partial least squares regression, which effectively encodes the numerical attributes associated with the entities in comparison prompts. Further, we demonstrate causality by intervening in these subspaces to manipulate hidden states, thereby altering the LLM's comparison outcomes. Experimental results show that our findings hold for different numerical attributes, indicating that LLMs utilize the linearly encoded information for numerical reasoning.
摘要：本文研究大型语言模型 (LLM) 在回答逻辑比较问题（例如，克里斯蒂亚诺出生在梅西之前吗？）时是否利用嵌入空间低维子空间中编码的数字属性。我们首先使用偏最小二乘回归确定这些子空间，该回归有效地对与比较提示中的实体相关的数字属性进行编码。此外，我们通过干预这些子空间来操纵隐藏状态，从而改变 LLM 的比较结果，从而证明因果关系。实验结果表明，我们的研究结果适用于不同的数字属性，表明 LLM 利用线性编码信息进行数字推理。

Title: Meta-DiffuB: A Contextualized Sequence-to-Sequence Text Diffusion Model with Meta-Exploration

Authors: Yun-Yen Chuang, Hung-Min Hsu, Kevin Lin, Chen-Sheng Gu, Ling Zhen Li, Ray-I Chang, Hung-yi Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13201
Pdf URL: https://arxiv.org/pdf/2410.13201
Copy Paste: [[2410.13201]] Meta-DiffuB: A Contextualized Sequence-to-Sequence Text Diffusion Model with Meta-Exploration(https://arxiv.org/abs/2410.13201)
Keywords: language model
Abstract: The diffusion model, a new generative modeling paradigm, has achieved significant success in generating images, audio, video, and text. It has been adapted for sequence-to-sequence text generation (Seq2Seq) through DiffuSeq, termed S2S Diffusion. Existing S2S-Diffusion models predominantly rely on fixed or hand-crafted rules to schedule noise during the diffusion and denoising processes. However, these models are limited by non-contextualized noise, which fails to fully consider the characteristics of Seq2Seq tasks. In this paper, we propose the Meta-DiffuB framework - a novel scheduler-exploiter S2S-Diffusion paradigm designed to overcome the limitations of existing S2S-Diffusion models. We employ Meta-Exploration to train an additional scheduler model dedicated to scheduling contextualized noise for each sentence. Our exploiter model, an S2S-Diffusion model, leverages the noise scheduled by our scheduler model for updating and generation. Meta-DiffuB achieves state-of-the-art performance compared to previous S2S-Diffusion models and fine-tuned pre-trained language models (PLMs) across four Seq2Seq benchmark datasets. We further investigate and visualize the impact of Meta-DiffuB's noise scheduling on the generation of sentences with varying difficulties. Additionally, our scheduler model can function as a "plug-and-play" model to enhance DiffuSeq without the need for fine-tuning during the inference stage.
摘要：扩散模型是一种新的生成建模范式，在生成图像、音频、视频和文本方面取得了重大成功。它已通过 DiffuSeq 适应序列到序列文本生成 (Seq2Seq)，称为 S2S 扩散。现有的 S2S-扩散模型主要依靠固定或手工制定的规则来在扩散和去噪过程中安排噪声。然而，这些模型受到非语境化噪声的限制，未能充分考虑 Seq2Seq 任务的特点。在本文中，我们提出了 Meta-DiffuB 框架 - 一种新颖的调度器-利用器 S2S-扩散范式，旨在克服现有 S2S-扩散模型的局限性。我们使用 Meta-Exploration 来训练一个额外的调度器模型，专门用于为每个句子调度语境化噪声。我们的利用器模型，即 S2S-扩散模型，利用我们的调度器模型调度的噪声进行更新和生成。与之前的 S2S-Diffusion 模型和经过微调的预训练语言模型 (PLM) 相比，Meta-DiffuB 在四个 Seq2Seq 基准数据集上实现了最佳性能。我们进一步研究并可视化了 Meta-DiffuB 的噪声调度对不同难度的句子生成的影响。此外，我们的调度器模型可以用作“即插即用”模型来增强 DiffuSeq，而无需在推理阶段进行微调。

Title: Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations

Authors: Aryan Shrivastava, Jessica Hullman, Max Lamparth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13204
Pdf URL: https://arxiv.org/pdf/2410.13204
Copy Paste: [[2410.13204]] Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations(https://arxiv.org/abs/2410.13204)
Keywords: language model, prompt
Abstract: There is an increasing interest in using language models (LMs) for automated decision-making, with multiple countries actively testing LMs to aid in military crisis decision-making. To scrutinize relying on LM decision-making in high-stakes settings, we examine the inconsistency of responses in a crisis simulation ("wargame"), similar to reported tests conducted by the US military. Prior work illustrated escalatory tendencies and varying levels of aggression among LMs but were constrained to simulations with pre-defined actions. This was due to the challenges associated with quantitatively measuring semantic differences and evaluating natural language decision-making without relying on pre-defined actions. In this work, we query LMs for free form responses and use a metric based on BERTScore to measure response inconsistency quantitatively. Leveraging the benefits of BERTScore, we show that the inconsistency metric is robust to linguistic variations that preserve semantic meaning in a question-answering setting across text lengths. We show that all five tested LMs exhibit levels of inconsistency that indicate semantic differences, even when adjusting the wargame setting, anonymizing involved conflict countries, or adjusting the sampling temperature parameter $T$. Further qualitative evaluation shows that models recommend courses of action that share few to no similarities. We also study the impact of different prompt sensitivity variations on inconsistency at temperature $T = 0$. We find that inconsistency due to semantically equivalent prompt variations can exceed response inconsistency from temperature sampling for most studied models across different levels of ablations. Given the high-stakes nature of military deployment, we recommend further consideration be taken before using LMs to inform military decisions or other cases of high-stakes decision-making.
摘要：人们对使用语言模型 (LM) 进行自动决策的兴趣日益浓厚，多个国家正在积极测试 LM 以协助军事危机决策。为了审查在高风险环境中依赖 LM 决策的情况，我们研究了危机模拟（“战争游戏”）中响应的不一致性，类似于美国军方进行的报道测试。先前的工作说明了 LM 之间的升级趋势和不同程度的攻击性，但仅限于具有预定义操作的模拟。这是由于定量测量语义差异和评估自然语言决策而不依赖预定义操作所带来的挑战。在这项工作中，我们查询 LM 以获得自由形式的响应，并使用基于 BERTScore 的指标来定量测量响应不一致性。利用 BERTScore 的优势，我们表明不一致性指标对于在跨文本长度的问答设置中保留语义含义的语言变化具有鲁棒性。我们表明，所有五个经过测试的 LM 都表现出表明语义差异的不一致性水平，即使在调整战争游戏设置、匿名化所涉及的冲突国家或调整采样温度参数 $T$ 时也是如此。进一步的定性评估表明，模型推荐的行动方案几乎没有相似之处。我们还研究了不同提示敏感度变化对温度 $T = 0$ 时不一致性的影响。我们发现，对于大多数研究模型，在不同消融水平上，由于语义等效的提示变化而导致的不一致性可能超过温度采样引起的响应不一致性。鉴于军事部署的高风险性质，我们建议在使用 LM 为军事决策或其他高风险决策提供信息之前，应进一步考虑。

Title: BQA: Body Language Question Answering Dataset for Video Large Language Models

Authors: Shintaro Ozaki, Kazuki Hayashi, Miyu Oba, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13206
Pdf URL: https://arxiv.org/pdf/2410.13206
Copy Paste: [[2410.13206]] BQA: Body Language Question Answering Dataset for Video Large Language Models(https://arxiv.org/abs/2410.13206)
Keywords: language model, llm
Abstract: A large part of human communication relies on nonverbal cues such as facial expressions, eye contact, and body language. Unlike language or sign language, such nonverbal communication lacks formal rules, requiring complex reasoning based on commonsense understanding. Enabling current Video Large Language Models (VideoLLMs) to accurately interpret body language is a crucial challenge, as human unconscious actions can easily cause the model to misinterpret their intent. To address this, we propose a dataset, BQA, a body language question answering dataset, to validate whether the model can correctly interpret emotions from short clips of body language comprising 26 emotion labels of videos of body language. We evaluated various VideoLLMs on BQA and revealed that understanding body language is challenging, and our analyses of the wrong answers by VideoLLMs show that certain VideoLLMs made significantly biased answers depending on the age group and ethnicity of the individuals in the video. The dataset is available.
摘要：人类交流很大一部分依赖于非语言暗示，例如面部表情、眼神交流和肢体语言。与语言或手语不同，这种非语言交流缺乏正式规则，需要基于常识理解的复杂推理。让当前的视频大型语言模型 (VideoLLM) 能够准确解读肢体语言是一项关键挑战，因为人类无意识的行为很容易导致模型误解其意图。为了解决这个问题，我们提出了一个数据集 BQA，这是一个肢体语言问答数据集，用于验证模型是否能够从包含 26 个肢体语言视频情绪标签的肢体语言短片中正确解读情绪。我们在 BQA 上评估了各种 VideoLLM，发现理解肢体语言具有挑战性，我们对 VideoLLM 错误答案的分析表明，某些 VideoLLM 会根据视频中个人的年龄组和种族做出明显有偏见的回答。数据集可用。

Title: FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

Authors: Forrest Sheng Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Mike Qi, Ruixuan Tu, Chenyu Xu, Matthew Gonzales, Ofer Mendelevitch, Amin Ahmad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13210
Pdf URL: https://arxiv.org/pdf/2410.13210
Copy Paste: [[2410.13210]] FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs(https://arxiv.org/abs/2410.13210)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation
Abstract: Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs from 8 different families, with ground truth annotations by human experts. ``Challenging'' here means summaries on which popular, state-of-the-art hallucination detection models, including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. However, even the best hallucination detection models have near 50\% accuracies on FaithBench, indicating lots of room for future improvement. The repo is this https URL
摘要：摘要是大型语言模型 (LLM) 执行的最常见任务之一，尤其是在检索增强生成 (RAG) 等应用中。然而，现有的对 LLM 生成的摘要中幻觉的评估，以及对幻觉检测模型的评估，都存在所考虑的 LLM 和 LLM 系列缺乏多样性和新近性的问题。本文介绍了 FaithBench，这是一个摘要幻觉基准，包含来自 8 个不同系列的 10 个现代 LLM 产生的具有挑战性的幻觉，并由人类专家提供基本事实注释。这里的“具有挑战性”是指流行的、最先进的幻觉检测模型（包括 GPT-4o-as-a-judge）对其存在分歧的摘要。我们的结果表明，GPT-4o 和 GPT-3.5-Turbo 产生的幻觉最少。然而，即使是最好的幻觉检测模型在 FaithBench 上的准确率也接近 50%，这表明未来还有很大的改进空间。 repo 是这个 https URL

Title: CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

Authors: Mian Zhang, Xianjun Yang, Xinlu Zhang, Travis Labrum, Jamie C. Chiu, Shaun M. Eack, Fei Fang, William Yang Wang, Zhiyu Zoey Chen
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2410.13218
Pdf URL: https://arxiv.org/pdf/2410.13218
Copy Paste: [[2410.13218]] CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy(https://arxiv.org/abs/2410.13218)
Keywords: language model, llm
Abstract: There is a significant gap between patient needs and available mental health support today. In this paper, we aim to thoroughly examine the potential of using Large Language Models (LLMs) to assist professional psychotherapy. To this end, we propose a new benchmark, CBT-BENCH, for the systematic evaluation of cognitive behavioral therapy (CBT) assistance. We include three levels of tasks in CBT-BENCH: I: Basic CBT knowledge acquisition, with the task of multiple-choice questions; II: Cognitive model understanding, with the tasks of cognitive distortion classification, primary core belief classification, and fine-grained core belief classification; III: Therapeutic response generation, with the task of generating responses to patient speech in CBT therapy sessions. These tasks encompass key aspects of CBT that could potentially be enhanced through AI assistance, while also outlining a hierarchy of capability requirements, ranging from basic knowledge recitation to engaging in real therapeutic conversations. We evaluated representative LLMs on our benchmark. Experimental results indicate that while LLMs perform well in reciting CBT knowledge, they fall short in complex real-world scenarios requiring deep analysis of patients' cognitive structures and generating effective responses, suggesting potential future work.
摘要：如今，患者的需求与可用的心理健康支持之间存在巨大差距。在本文中，我们旨在彻底研究使用大型语言模型 (LLM) 协助专业心理治疗的潜力。为此，我们提出了一个新的基准 CBT-BENCH，用于系统评估认知行为疗法 (CBT) 的辅助作用。我们在 CBT-BENCH 中包含三个级别的任务：I：基本 CBT 知识获取，任务是多项选择题；II：认知模型理解，任务是认知扭曲分类、主要核心信念分类和细粒度核心信念分类；III：治疗反应生成，任务是在 CBT 治疗期间对患者言语生成反应。这些任务涵盖了 CBT 的关键方面，可以通过 AI 辅助进行增强，同时也概述了能力要求的层次结构，从基本知识背诵到参与真正的治疗对话。我们根据基准评估了代表性的 LLM。实验结果表明，虽然 LLM 在背诵 CBT 知识方面表现良好，但在需要深入分析患者的认知结构并产生有效反应的复杂现实场景中却表现不佳，这表明未来有潜力开展研究。

Title: Proof Flow: Preliminary Study on Generative Flow Network Language Model Tuning for Formal Reasoning

Authors: Matthew Ho, Vincent Zhu, Xiaoyin Chen, Moksh Jain, Nikolay Malkin, Edwin Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13224
Pdf URL: https://arxiv.org/pdf/2410.13224
Copy Paste: [[2410.13224]] Proof Flow: Preliminary Study on Generative Flow Network Language Model Tuning for Formal Reasoning(https://arxiv.org/abs/2410.13224)
Keywords: language model, llm
Abstract: Reasoning is a fundamental substrate for solving novel and complex problems. Deliberate efforts in learning and developing frameworks around System 2 reasoning have made great strides, yet problems of sufficient complexity remain largely out of reach for open models. To address this gap, we examine the potential of Generative Flow Networks as a fine-tuning method for LLMs to unlock advanced reasoning capabilities. In this paper, we present a proof of concept in the domain of formal reasoning, specifically in the Neural Theorem Proving (NTP) setting, where proofs specified in a formal language such as Lean can be deterministically and objectively verified. Unlike classical reward-maximization reinforcement learning, which frequently over-exploits high-reward actions and fails to effectively explore the state space, GFlowNets have emerged as a promising approach for sampling compositional objects, improving generalization, and enabling models to maintain diverse hypotheses. Our early results demonstrate GFlowNet fine-tuning's potential for enhancing model performance in a search setting, which is especially relevant given the paradigm shift towards inference time compute scaling and "thinking slowly."
摘要：推理是解决新颖复杂问题的基础。围绕系统 2 推理的学习和开发框架的努力取得了长足进步，但足够复杂的问题仍然在很大程度上无法通过开放模型解决。为了解决这一差距，我们研究了生成流网络作为 LLM 微调方法的潜力，以解锁高级推理能力。在本文中，我们提出了形式推理领域的概念证明，特别是在神经定理证明 (NTP) 设置中，其中可以确定性和客观地验证以 Lean 等形式语言指定的证明。与经常过度利用高奖励动作并且无法有效探索状态空间的经典奖励最大化强化学习不同，GFlowNets 已成为一种有前途的方法，用于采样组合对象、改进泛化并使模型能够保持多样化的假设。我们的早期结果证明了 GFlowNet 微调在搜索设置中增强模型性能的潜力，这尤其重要，因为范式转向推理时间计算扩展和“慢慢思考”。

Title: Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Authors: Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13232
Pdf URL: https://arxiv.org/pdf/2410.13232
Copy Paste: [[2410.13232]] Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation(https://arxiv.org/abs/2410.13232)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs) have recently gained much attention in building autonomous agents. However, the performance of current LLM-based web agents in long-horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the "world model". Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT-4o, Claude-3.5-Sonnet, etc.). Then, we present a World-model-augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition-focused observation abstraction, where the prediction objectives are free-form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on WebArena and Mind2Web show that our world models improve agents' policy selection without training and demonstrate our agents' cost- and time-efficiency compared to recent tree-search-based agents.
摘要：大型语言模型 (LLM) 最近在构建自主代理方面引起了广泛关注。然而，当前基于 LLM 的 Web 代理在长期任务中的表现远非最佳，经常会出现错误，例如反复购买不可退款的机票。相比之下，人类可以避免这种不可逆转的错误，因为我们意识到我们行为的潜在结果（例如赔钱），也称为“世界模型”。受此启发，我们的研究首先从初步分析开始，确认当前 LLM（例如 GPT-4o、Claude-3.5-Sonnet 等）中缺乏世界模型。然后，我们提出了一个世界模型增强 (WMA) Web 代理，它可以模拟其行为的结果以便做出更好的决策。为了克服将 LLM 训练为预测下一个观察结果的世界模型所面临的挑战，例如观察结果中的重复元素和较长的 HTML 输入，我们提出了一种以过渡为中心的观察抽象，其中预测目标是自由形式的自然语言描述，专门强调时间步骤之间的重要状态差异。在 WebArena 和 Mind2Web 上的实验表明，我们的世界模型无需训练即可改善代理的策略选择，并且与最近基于树搜索的代理相比，我们的代理具有成本和时间效率。

Title: SPIN: Self-Supervised Prompt INjection

Authors: Leon Zhou, Junfeng Yang, Chengzhi Mao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13236
Pdf URL: https://arxiv.org/pdf/2410.13236
Copy Paste: [[2410.13236]] SPIN: Self-Supervised Prompt INjection(https://arxiv.org/abs/2410.13236)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly used in a variety of important applications, yet their safety and reliability remain as major concerns. Various adversarial and jailbreak attacks have been proposed to bypass the safety alignment and cause the model to produce harmful responses. We introduce Self-supervised Prompt INjection (SPIN) which can detect and reverse these various attacks on LLMs. As our self-supervised prompt defense is done at inference-time, it is also compatible with existing alignment and adds an additional layer of safety for defense. Our benchmarks demonstrate that our system can reduce the attack success rate by up to 87.9%, while maintaining the performance on benign user requests. In addition, we discuss the situation of an adaptive attacker and show that our method is still resilient against attackers who are aware of our defense.
摘要：大型语言模型 (LLM) 越来越多地用于各种重要应用，但它们的安全性和可靠性仍然是主要问题。已经提出了各种对抗性和越狱攻击来绕过安全对齐并导致模型产生有害响应。我们引入了自监督提示注入 (SPIN)，它可以检测和逆转对 LLM 的各种攻击。由于我们的自监督提示防御是在推理时完成的，因此它也与现有的对齐兼容并为防御增加了额外的安全层。我们的基准测试表明，我们的系统可以将攻击成功率降低高达 87.9%，同时保持对良性用户请求的性能。此外，我们讨论了自适应攻击者的情况，并表明我们的方法仍然能够抵御知道我们防御的攻击者。

Title: Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis

Authors: Yiyi Chen, Qiongxiu Li, Russa Biswas, Johannes Bjerva
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2410.13237
Pdf URL: https://arxiv.org/pdf/2410.13237
Copy Paste: [[2410.13237]] Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis(https://arxiv.org/abs/2410.13237)
Keywords: language model, llm
Abstract: Language Confusion is a phenomenon where Large Language Models (LLMs) generate text that is neither in the desired language, nor in a contextually appropriate language. This phenomenon presents a critical challenge in text generation by LLMs, often appearing as erratic and unpredictable behavior. We hypothesize that there are linguistic regularities to this inherent vulnerability in LLMs and shed light on patterns of language confusion across LLMs. We introduce a novel metric, Language Confusion Entropy, designed to directly measure and quantify this confusion, based on language distributions informed by linguistic typology and lexical variation. Comprehensive comparisons with the Language Confusion Benchmark (Marchisio et al., 2024) confirm the effectiveness of our metric, revealing patterns of language confusion across LLMs. We further link language confusion to LLM security, and find patterns in the case of multilingual embedding inversion attacks. Our analysis demonstrates that linguistic typology offers theoretically grounded interpretation, and valuable insights into leveraging language similarities as a prior for LLM alignment and security.
摘要：语言混淆是一种现象，即大型语言模型 (LLM) 生成的文本既不是所需语言，也不是上下文合适的语言。这种现象对 LLM 的文本生成提出了严峻挑战，通常表现为不稳定和不可预测的行为。我们假设 LLM 中这种固有漏洞存在语言规律，并揭示了 LLM 中语言混淆的模式。我们引入了一种新颖的指标——语言混淆熵，旨在直接测量和量化这种混淆，该指标基于语言类型学和词汇变化所形成的语言分布。与语言混淆基准 (Marchisio 等人，2024) 的全面比较证实了我们指标的有效性，揭示了 LLM 中语言混淆的模式。我们进一步将语言混淆与 LLM 安全性联系起来，并在多语言嵌入反转攻击的情况下找到了模式。我们的分析表明，语言类型学提供了理论上有依据的解释，以及利用语言相似性作为 LLM 一致性和安全性的先验的宝贵见解。

Title: Atomic Calibration of LLMs in Long-Form Generations

Authors: Caiqi Zhang, Ruihan Yang, Zhisong Zhang, Xinting Huang, Sen Yang, Dong Yu, Nigel Collier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13246
Pdf URL: https://arxiv.org/pdf/2410.13246
Copy Paste: [[2410.13246]] Atomic Calibration of LLMs in Long-Form Generations(https://arxiv.org/abs/2410.13246)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, which estimates the underlying uncertainty of model predictions, is essential to enhance the LLMs' trustworthiness. Existing research on LLM calibration has primarily focused on short-form tasks, providing a single confidence score at the response level (macro calibration). However, this approach is insufficient for long-form generations, where responses often contain more complex statements and may include both accurate and inaccurate information. Therefore, we introduce atomic calibration, a novel approach that evaluates factuality calibration at a fine-grained level by breaking down long responses into atomic claims. We classify confidence elicitation methods into discriminative and generative types and demonstrate that their combination can enhance calibration. Our extensive experiments on various LLMs and datasets show that atomic calibration is well-suited for long-form generation and can also improve macro calibration results. Additionally, atomic calibration reveals insightful patterns in LLM confidence throughout the generation process.
摘要：大型语言模型 (LLM) 经常出现幻觉，给实际应用带来重大挑战。置信度校准可以估计模型预测的潜在不确定性，对于提高 LLM 的可信度至关重要。现有的 LLM 校准研究主要集中在短格式任务上，在响应级别提供单一置信度分数（宏校准）。然而，这种方法对于长格式生成来说是不够的，因为响应通常包含更复杂的语句，可能同时包含准确和不准确的信息。因此，我们引入了原子校准，这是一种新颖的方法，通过将长响应分解为原子声明来在细粒度级别评估事实性校准。我们将置信度引出方法分为判别型和生成型，并证明它们的组合可以增强校准。我们对各种 LLM 和数据集进行的大量实验表明，原子校准非常适合长格式生成，也可以改善宏校准结果。此外，原子校准揭示了整个生成过程中 LLM 置信度的深刻模式。

Title: A Systematic Investigation of Knowledge Retrieval and Selection for Retrieval Augmented Generation

Authors: Xiangci Li, Jessica Ouyang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13258
Pdf URL: https://arxiv.org/pdf/2410.13258
Copy Paste: [[2410.13258]] A Systematic Investigation of Knowledge Retrieval and Selection for Retrieval Augmented Generation(https://arxiv.org/abs/2410.13258)
Keywords: retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has emerged as a powerful method for enhancing natural language generation by integrating external knowledge into a model's output. While prior work has demonstrated the importance of improving knowledge retrieval for boosting generation quality, the role of knowledge selection remains less clear. In this paper, we perform a comprehensive analysis of how knowledge retrieval and selection influence downstream generation performance in RAG systems. By simulating different retrieval and selection conditions through a controlled mixture of gold and distractor knowledge, we assess the impact of these factors on generation outcomes. Our findings indicate that the downstream generator model's capability, as well as the complexity of the task and dataset, significantly influence the impact of knowledge retrieval and selection on the overall RAG system performance. In typical scenarios, improving the knowledge recall score is key to enhancing generation outcomes, with the knowledge selector providing a limited additional benefit when a strong generator model is used on clear, well-defined tasks. For weaker generator models or more ambiguous tasks and datasets, the knowledge F1 score becomes a critical factor, and the knowledge selector plays a more prominent role in improving overall performance.
摘要：检索增强生成 (RAG) 通过将外部知识集成到模型的输出中，已成为一种增强自然语言生成的强大方法。虽然先前的工作已经证明了改进知识检索对于提高生成质量的重要性，但知识选择的作用仍然不太明确。在本文中，我们对知识检索和选择如何影响 RAG 系统中的下游生成性能进行了全面分析。通过控制黄金知识和干扰知识的混合来模拟不同的检索和选择条件，我们评估了这些因素对生成结果的影响。我们的研究结果表明，下游生成器模型的能力以及任务和数据集的复杂性显著影响知识检索和选择对整个 RAG 系统性能的影响。在典型情况下，提高知识回忆分数是增强生成结果的关键，当强大的生成器模型用于明确、定义良好的任务时，知识选择器提供的额外好处有限。对于较弱的生成器模型或更模糊的任务和数据集，知识 F1 分数成为关键因素，知识选择器在提高整体性能方面发挥着更为突出的作用。

Title: From Babbling to Fluency: Evaluating the Evolution of Language Models in Terms of Human Language Acquisition

Authors: Qiyuan Yang, Pengda Wang, Luke D. Plonsky, Frederick L. Oswald, Hanjie Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13259
Pdf URL: https://arxiv.org/pdf/2410.13259
Copy Paste: [[2410.13259]] From Babbling to Fluency: Evaluating the Evolution of Language Models in Terms of Human Language Acquisition(https://arxiv.org/abs/2410.13259)
Keywords: language model
Abstract: We examine the language capabilities of language models (LMs) from the critical perspective of human language acquisition. Building on classical language development theories, we propose a three-stage framework to assess the abilities of LMs, ranging from preliminary word understanding to complex grammar and complex logical reasoning. Using this framework, we evaluate the generative capacities of LMs using methods from linguistic research. Results indicate that although recent LMs outperform earlier models in overall performance, their developmental trajectory does not strictly follow the path of human language acquisition. Notably, in generation tasks, LMs are more similar to human performance in areas where information is easier to extract from the corpus, such as average word length, clauses, and auxiliary verbs. Newer LMs did not exhibit significant progress in terms of specific dimensions, such as clauses and auxiliary verbs, where the variation across corpora is relatively limited. Register theory offers a plausible explanation for these observations, suggesting that the linguistic features of the training data have a substantial impact on the models' abilities.
摘要：我们从人类语言习得的批判性视角来研究语言模型 (LM) 的语言能力。基于经典语言发展理论，我们提出了一个三阶段框架来评估 LM 的能力，从初步的单词理解到复杂的语法和复杂的逻辑推理。利用这个框架，我们使用语言研究的方法来评估 LM 的生成能力。结果表明，尽管最近的 LM 在整体性能上优于早期的模型，但它们的发展轨迹并不严格遵循人类语言习得的路径。值得注意的是，在生成任务中，LM 在信息更容易从语料库中提取的领域（例如平均单词长度、从句和助动词）与人类的表现更相似。较新的 LM 在特定维度（例如从句和助动词）方面没有表现出显着的进步，因为这些维度在语料库中的差异相对有限。语域理论对这些观察结果提供了一个合理的解释，表明训练数据的语言特征对模型的能力有很大的影响。

Title: Roadmap towards Superhuman Speech Understanding using Large Language Models

Authors: Fan Bu, Yuhao Zhang, Xidong Wang, Benyou Wang, Qun Liu, Haizhou Li
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2410.13268
Pdf URL: https://arxiv.org/pdf/2410.13268
Copy Paste: [[2410.13268]] Roadmap towards Superhuman Speech Understanding using Large Language Models(https://arxiv.org/abs/2410.13268)
Keywords: language model, gpt, llm, prompt
Abstract: The success of large language models (LLMs) has prompted efforts to integrate speech and audio data, aiming to create general foundation models capable of processing both textual and non-textual inputs. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs, which preserves non-semantic information and world knowledge for deeper speech understanding. To guide the development of speech LLMs, we propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models capable of integrating non-semantic information with abstract acoustic knowledge for complex tasks. Moreover, we design a benchmark, SAGI Bechmark, that standardizes critical aspects across various tasks in these five levels, uncovering challenges in using abstract acoustic knowledge and completeness of capability. Our findings reveal gaps in handling paralinguistic cues and abstract acoustic knowledge, and we offer future directions. This paper outlines a roadmap for advancing speech LLMs, introduces a benchmark for evaluation, and provides key insights into their current limitations and potential.
摘要：大型语言模型 (LLM) 的成功促使人们努力整合语音和音频数据，旨在创建能够处理文本和非文本输入的通用基础模型。最近的进展，例如 GPT-4o，凸显了端到端语音 LLM 的潜力，它保留了非语义信息和世界知识，以实现更深入的语音理解。为了指导语音 LLM 的发展，我们提出了一个五级路线图，从基本的自动语音识别 (ASR) 到能够将非语义信息与抽象声学知识相结合以完成复杂任务的高级超人模型。此外，我们设计了一个基准 SAGI Bechmark，它标准化了这五个级别中各个任务的关键方面，揭示了使用抽象声学知识和能力完整性方面的挑战。我们的研究结果揭示了处理副语言线索和抽象声学知识方面的差距，并提出了未来的方向。本文概述了语音 LLM 发展的路线图，介绍了评估基准，并提供了对其当前局限性和潜力的关键见解。

Title: Breaking Chains: Unraveling the Links in Multi-Hop Knowledge Unlearning

Authors: Minseok Choi, ChaeHun Park, Dohyun Lee, Jaegul Choo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13274
Pdf URL: https://arxiv.org/pdf/2410.13274
Copy Paste: [[2410.13274]] Breaking Chains: Unraveling the Links in Multi-Hop Knowledge Unlearning(https://arxiv.org/abs/2410.13274)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) serve as giant information stores, often including personal or copyrighted data, and retraining them from scratch is not a viable option. This has led to the development of various fast, approximate unlearning techniques to selectively remove knowledge from LLMs. Prior research has largely focused on minimizing the probabilities of specific token sequences by reversing the language modeling objective. However, these methods still leave LLMs vulnerable to adversarial attacks that exploit indirect references. In this work, we examine the limitations of current unlearning techniques in effectively erasing a particular type of indirect prompt: multi-hop queries. Our findings reveal that existing methods fail to completely remove multi-hop knowledge when one of the intermediate hops is unlearned. To address this issue, we propose MUNCH, a simple uncertainty-based approach that breaks down multi-hop queries into subquestions and leverages the uncertainty of the unlearned model in final decision-making. Empirical results demonstrate the effectiveness of our framework, and MUNCH can be easily integrated with existing unlearning techniques, making it a flexible and useful solution for enhancing unlearning processes.
摘要：大型语言模型 (LLM) 充当巨大的信息存储库，通常包含个人或受版权保护的数据，从头开始重新训练它们并不是一个可行的选择。这导致了各种快速、近似的反学习技术的开发，以选择性地从 LLM 中删除知识。先前的研究主要集中在通过反转语言建模目标来最小化特定标记序列的概率。然而，这些方法仍然使 LLM 容易受到利用间接引用的对抗性攻击。在这项工作中，我们研究了当前反学习技术在有效消除特定类型的间接提示（多跳查询）方面的局限性。我们的研究结果表明，当其中一个中间跳跃被反学习时，现有方法无法完全消除多跳知识。为了解决这个问题，我们提出了 MUNCH，这是一种简单的基于不确定性的方法，它将多跳查询分解为子问题，并在最终决策中利用未学习模型的不确定性。实证结果证明了我们框架的有效性，并且 MUNCH 可以轻松地与现有的去学习技术相结合，使其成为增强去学习过程的灵活而有用的解决方案。

Title: SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

Authors: Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13276
Pdf URL: https://arxiv.org/pdf/2410.13276
Copy Paste: [[2410.13276]] SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs(https://arxiv.org/abs/2410.13276)
Keywords: language model, llm
Abstract: Attention is the cornerstone of modern Large Language Models (LLMs). Yet its quadratic complexity limits the efficiency and scalability of LLMs, especially for those with a long-context window. A promising approach addressing this limitation is to leverage the sparsity in attention. However, existing sparsity-based solutions predominantly rely on predefined patterns or heuristics to approximate sparsity. This practice falls short to fully capture the dynamic nature of attention sparsity in language-based tasks. This paper argues that attention sparsity should be learned rather than predefined. To this end, we design SeerAttention, a new Attention mechanism that augments the conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse. Such block-level sparsity effectively balances accuracy and speedup. To enable efficient learning of the gating network, we develop a customized FlashAttention implementation that extracts the block-level ground truth of attention map with minimum overhead. SeerAttention not only applies to post-training, but also excels in long-context fine-tuning. Our results show that at post-training stages, SeerAttention significantly outperforms state-of-the-art static or heuristic-based sparse attention methods, while also being more versatile and flexible to adapt to varying context lengths and sparsity ratios. When applied to long-context fine-tuning with YaRN, SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context length with minimal perplexity loss, offering a 5.67x speedup over FlashAttention-2.
摘要：注意力是现代大型语言模型 (LLM) 的基石。然而，它的二次复杂度限制了 LLM 的效率和可扩展性，尤其是对于那些具有长上下文窗口的 LLM。解决这一限制的一个有前途的方法是利用注意力的稀疏性。然而，现有的基于稀疏性的解决方案主要依赖于预定义的模式或启发式方法来近似稀疏性。这种做法不足以完全捕捉语言任务中注意力稀疏性的动态性质。本文认为注意力稀疏性应该是学习的，而不是预定义的。为此，我们设计了 SeerAttention，一种新的注意力机制，它通过可学习的门增强了传统的注意力，该门自适应地选择注意力图中的重要块，并认为其余块是稀疏的。这种块级稀疏性有效地平衡了准确性和加速。为了实现门控网络的高效学习，我们开发了一个定制的 FlashAttention 实现，以最小的开销提取注意力图的块级基本事实。 SeerAttention 不仅适用于后期训练，在长上下文微调方面也表现出色。我们的结果表明，在后期训练阶段，SeerAttention 的表现明显优于最先进的静态或基于启发式的稀疏注意方法，同时还更加通用和灵活，可以适应不同的上下文长度和稀疏率。当使用 YaRN 进行长上下文微调时，SeerAttention 可以在 32k 上下文长度下实现显著的 90% 稀疏率，同时将困惑度损失降至最低，与 FlashAttention-2 相比，速度提高了 5.67 倍。

Title: BANTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla

Authors: Fabiha Haider, Fariha Tanjim Shifat, Md Farhan Ishmam, Deeparghya Dutta Barua, Md Sakib Ul Rahman Sourove, Md Fahim, Md Farhad Alam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13281
Pdf URL: https://arxiv.org/pdf/2410.13281
Copy Paste: [[2410.13281]] BANTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla(https://arxiv.org/abs/2410.13281)
Keywords: llm, prompt
Abstract: The proliferation of transliterated texts in digital spaces has emphasized the need for detecting and classifying hate speech in languages beyond English, particularly in low-resource languages. As online discourse can perpetuate discrimination based on target groups, e.g. gender, religion, and origin, multi-label classification of hateful content can help in comprehending hate motivation and enhance content moderation. While previous efforts have focused on monolingual or binary hate classification tasks, no work has yet addressed the challenge of multi-label hate speech classification in transliterated Bangla. We introduce BanTH, the first multi-label transliterated Bangla hate speech dataset comprising 37.3k samples. The samples are sourced from YouTube comments, where each instance is labeled with one or more target groups, reflecting the regional demographic. We establish novel transformer encoder-based baselines by further pre-training on transliterated Bangla corpus. We also propose a novel translation-based LLM prompting strategy for transliterated text. Experiments reveal that our further pre-trained encoders are achieving state-of-the-art performance on the BanTH dataset, while our translation-based prompting outperforms other strategies in the zero-shot setting. The introduction of BanTH not only fills a critical gap in hate speech research for Bangla but also sets the stage for future exploration into code-mixed and multi-label classification challenges in underrepresented languages.
摘要：数字空间中音译文本的激增凸显了检测和分类除英语以外的语言仇恨言论的必要性，尤其是资源匮乏的语言。由于在线言论可能会延续基于目标群体（例如性别、宗教和出身）的歧视，因此对仇恨内容进行多标签分类有助于理解仇恨动机并加强内容审核。虽然之前的努力主要集中在单语或二元仇恨分类任务上，但目前还没有一项研究解决音译孟加拉语中多标签仇恨言论分类的挑战。我们推出了 BanTH，这是第一个多标签音译孟加拉语仇恨言论数据集，包含 37.3k 个样本。这些样本来自 YouTube 评论，其中每个实例都标有一个或多个目标群体，反映了区域人口统计。我们通过对音译孟加拉语料库进行进一步预训练，建立了基于 Transformer 编码器的新型基线。我们还提出了一种基于翻译的音译文本 LLM 提示策略。实验表明，我们进一步预训练的编码器在 BanTH 数据集上实现了最佳性能，而我们基于翻译的提示在零样本设置中的表现优于其他策略。BanTH 的引入不仅填补了孟加拉语仇恨言论研究的关键空白，还为未来探索代表性不足的语言中的混合代码和多标签分类挑战奠定了基础。

Title: Learning to Route with Confidence Tokens

Authors: Yu-Neng Chuang, Helen Zhou, Prathusha Kameswara Sarma, Parikshit Gopalan, John Boccio, Sara Bolouki, Xia Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13284
Pdf URL: https://arxiv.org/pdf/2410.13284
Copy Paste: [[2410.13284]] Learning to Route with Confidence Tokens(https://arxiv.org/abs/2410.13284)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated impressive performance on several tasks and are increasingly deployed in real-world applications. However, especially in high-stakes settings, it becomes vital to know when the output of an LLM may be unreliable. Depending on whether an answer is trustworthy, a system can then choose to route the question to another expert, or otherwise fall back on a safe default behavior. In this work, we study the extent to which LLMs can reliably indicate confidence in their answers, and how this notion of confidence can translate into downstream accuracy gains. We propose Self-REF, a lightweight training strategy to teach LLMs to express confidence in whether their answers are correct in a reliable manner. Self-REF introduces confidence tokens into the LLM, from which a confidence score can be extracted. Compared to conventional approaches such as verbalizing confidence and examining token probabilities, we demonstrate empirically that confidence tokens show significant improvements in downstream routing and rejection learning tasks.
摘要：大型语言模型 (LLM) 在多项任务上表现出色，并越来越多地应用于实际应用。然而，尤其是在高风险环境中，了解 LLM 的输出何时可能不可靠变得至关重要。根据答案是否可信，系统可以选择将问题路由给另一位专家，或者以其他方式采用安全的默认行为。在这项工作中，我们研究了 LLM 能够可靠地表明其答案的信心程度，以及这种信心概念如何转化为下游准确度的提高。我们提出了 Self-REF，这是一种轻量级的训练策略，用于教 LLM 以可靠的方式表达对其答案是否正确的信心。Self-REF 将信心标记引入 LLM，从中可以提取信心分数。与口头表达信心和检查标记概率等传统方法相比，我们通过实证证明，信心标记在下游路由和拒绝学习任务中表现出显着的改进。

Title: Advancing Large Language Model Attribution through Self-Improving

Authors: Lei Huang, Xiaocheng Feng, Weitao Ma, Liang Zhao, Yuchun Fan, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13298
Pdf URL: https://arxiv.org/pdf/2410.13298
Copy Paste: [[2410.13298]] Advancing Large Language Model Attribution through Self-Improving(https://arxiv.org/abs/2410.13298)
Keywords: language model, llm, hallucination
Abstract: Teaching large language models (LLMs) to generate text with citations to evidence sources can mitigate hallucinations and enhance verifiability in information-seeking systems. However, improving this capability requires high-quality attribution data, which is costly and labor-intensive. Inspired by recent advances in self-improvement that enhance LLMs without manual annotation, we present START, a Self-Taught AttRibuTion framework for iteratively improving the attribution capability of LLMs. First, to prevent models from stagnating due to initially insufficient supervision signals, START leverages the model to self-construct synthetic training data for warming up. To further self-improve the model's attribution ability, START iteratively utilizes fine-grained preference supervision signals constructed from its sampled responses to encourage robust, comprehensive, and attributable generation. Experiments on three open-domain question-answering datasets, covering long-form QA and multi-step reasoning, demonstrate significant performance gains of 25.13% on average without relying on human annotations and more advanced models. Further analysis reveals that START excels in aggregating information across multiple sources.
摘要：教导大型语言模型 (LLM) 生成带有证据来源引文的文本可以减轻幻觉并增强信息搜索系统的可验证性。但是，提高这种能力需要高质量的归因数据，而这需要大量成本和人力。受最近在无需人工注释的情况下增强 LLM 的自我改进方面取得的进展的启发，我们提出了 START，这是一个自学式 AttRibuTion 框架，用于迭代改进 LLM 的归因能力。首先，为了防止模型因最初监督信号不足而停滞不前，START 利用模型自行构建合成训练数据进行预热。为了进一步自我改进模型的归因能力，START 迭代地利用从其采样响应构建的细粒度偏好监督信号来鼓励稳健、全面和可归因的生成。在三个开放域问答数据集（涵盖长篇问答和多步推理）上的实验表明，在不依赖人工注释和更高级模型的情况下，性能平均提升了 25.13%。进一步分析表明，START 在跨多个来源聚合信息方面表现出色。

Title: Reference-Based Post-OCR Processing with LLM for Diacritic Languages

Authors: Thao Do
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.13305
Pdf URL: https://arxiv.org/pdf/2410.13305
Copy Paste: [[2410.13305]] Reference-Based Post-OCR Processing with LLM for Diacritic Languages(https://arxiv.org/abs/2410.13305)
Keywords: language model, llm
Abstract: Extracting fine-grained OCR text from aged documents in diacritic languages remains challenging due to unexpected artifacts, time-induced degradation, and lack of datasets. While standalone spell correction approaches have been proposed, they show limited performance for historical documents due to numerous possible OCR error combinations and differences between modern and classical corpus distributions. We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text, supported by large language models. This technique generates high-precision pseudo-page-to-page labels for diacritic languages, where small strokes pose significant challenges in historical conditions. The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences. Our post-processing method, which generated a large OCR dataset of classical Vietnamese books, achieved a mean grading score of 8.72 on a 10-point scale. This outperformed the state-of-the-art transformer-based Vietnamese spell correction model, which scored 7.03 when evaluated on a sampled subset of the dataset. We also trained a baseline OCR model to assess and compare it with well-known engines. Experimental results demonstrate the strength of our baseline model compared to widely used open-source solutions. The resulting dataset will be released publicly to support future studies.
摘要：由于意外的伪影、时间导致的退化以及缺乏数据集，从变音语言的旧文档中提取细粒度 OCR 文本仍然具有挑战性。虽然已经提出了独立的拼写校正方法，但由于 OCR 错误组合众多，以及现代语料库和古典语料库分布之间的差异，它们对历史文档的性能有限。我们提出了一种方法，利用可用的以内容为中心的电子书作为参考基础来纠正不完美的 OCR 生成文本，并由大型语言模型支持。该技术为变音语言生成高精度伪页到页标签，其中小笔画在历史条件下构成重大挑战。该流程消除了旧文档中的各种类型的噪音，并解决了诸如字符、单词和序列混乱等问题。我们的后处理方法生成了一个大型越南古典书籍 OCR 数据集，在 10 分制中获得了 8.72 的平均评分。这优于最先进的基于 Transformer 的越南语拼写校正模型，后者在对数据集的抽样子集进行评估时得分为 7.03。我们还训练了一个基线 OCR 模型来评估它并与知名引擎进行比较。实验结果表明，与广泛使用的开源解决方案相比，我们的基线模型更胜一筹。结果数据集将公开发布，以支持未来的研究。

Title: Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language

Authors: Xinmeng Hou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13313
Pdf URL: https://arxiv.org/pdf/2410.13313
Copy Paste: [[2410.13313]] Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language(https://arxiv.org/abs/2410.13313)
Keywords: language model, llm
Abstract: This study introduces a prescriptive annotation benchmark grounded in humanities research to ensure consistent, unbiased labeling of offensive language, particularly for casual and non-mainstream language uses. We contribute two newly annotated datasets that achieve higher inter-annotator agreement between human and language model (LLM) annotations compared to original datasets based on descriptive instructions. Our experiments show that LLMs can serve as effective alternatives when professional annotators are unavailable. Moreover, smaller models fine-tuned on multi-source LLM-annotated data outperform models trained on larger, single-source human-annotated datasets. These findings highlight the value of structured guidelines in reducing subjective variability, maintaining performance with limited data, and embracing language diversity. Content Warning: This article only analyzes offensive language for academic purposes. Discretion is advised.
摘要：本研究引入了基于人文研究的规范性注释基准，以确保对冒犯性语言（特别是针对非正式和非主流语言使用）进行一致、无偏见的标记。我们贡献了两个新注释的数据集，与基于描述性说明的原始数据集相比，它们在人工和语言模型 (LLM) 注释之间实现了更高的注释者间一致性。我们的实验表明，当没有专业注释者时，LLM 可以作为有效的替代方案。此外，在多源 LLM 注释数据上微调的小型模型优于在较大的单源人工注释数据集上训练的模型。这些发现凸显了结构化指南在减少主观差异、在有限数据下保持性能以及拥抱语言多样性方面的价值。内容警告：本文仅出于学术目的分析冒犯性语言。建议谨慎使用。

Title: Fine-Tuning Language Models on Multiple Datasets for Citation Intention Classification

Authors: Zeren Shui, Petros Karypis, Daniel S. Karls, Mingjian Wen, Saurav Manchanda, Ellad B. Tadmor, George Karypis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13332
Pdf URL: https://arxiv.org/pdf/2410.13332
Copy Paste: [[2410.13332]] Fine-Tuning Language Models on Multiple Datasets for Citation Intention Classification(https://arxiv.org/abs/2410.13332)
Keywords: language model
Abstract: Citation intention Classification (CIC) tools classify citations by their intention (e.g., background, motivation) and assist readers in evaluating the contribution of scientific literature. Prior research has shown that pretrained language models (PLMs) such as SciBERT can achieve state-of-the-art performance on CIC benchmarks. PLMs are trained via self-supervision tasks on a large corpus of general text and can quickly adapt to CIC tasks via moderate fine-tuning on the corresponding dataset. Despite their advantages, PLMs can easily overfit small datasets during fine-tuning. In this paper, we propose a multi-task learning (MTL) framework that jointly fine-tunes PLMs on a dataset of primary interest together with multiple auxiliary CIC datasets to take advantage of additional supervision signals. We develop a data-driven task relation learning (TRL) method that controls the contribution of auxiliary datasets to avoid negative transfer and expensive hyper-parameter tuning. We conduct experiments on three CIC datasets and show that fine-tuning with additional datasets can improve the PLMs' generalization performance on the primary dataset. PLMs fine-tuned with our proposed framework outperform the current state-of-the-art models by 7% to 11% on small datasets while aligning with the best-performing model on a large dataset.
摘要：引文意图分类 (CIC) 工具根据引文意图（例如背景、动机）对其进行分类，并帮助读者评估科学文献的贡献。先前的研究表明，预训练语言模型 (PLM)（例如 SciBERT）可以在 CIC 基准上实现最佳性能。PLM 通过对大量一般文本进行自监督任务进行训练，并可以通过对相应数据集进行适度微调来快速适应 CIC 任务。尽管 PLM 具有优势，但它们在微调过程中很容易过度拟合小数据集。在本文中，我们提出了一个多任务学习 (MTL) 框架，该框架联合微调主要关注数据集上的 PLM 以及多个辅助 CIC 数据集，以利用额外的监督信号。我们开发了一种数据驱动的任务关系学习 (TRL) 方法，该方法控制辅助数据集的贡献，以避免负迁移和昂贵的超参数调整。我们对三个 CIC 数据集进行了实验，结果表明使用其他数据集进行微调可以提高 PLM 在主数据集上的泛化性能。使用我们提出的框架进行微调的 PLM 在小型数据集上的表现比当前最先进的模型高出 7% 到 11%，同时在大型数据集上的表现与最佳模型一致。

Title: Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

Authors: Isack Lee, Haebin Seong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13334
Pdf URL: https://arxiv.org/pdf/2410.13334
Copy Paste: [[2410.13334]] Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems(https://arxiv.org/abs/2410.13334)
Keywords: language model, gpt, llm, prompt
Abstract: Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content. To address these issues, many LLM developers have implemented various safety measures to align these models. This alignment involves several techniques, including data filtering during pre-training, supervised fine-tuning, reinforcement learning from human feedback, and red-teaming exercises. These methods often introduce deliberate and intentional biases similar to Political Correctness (PC) to ensure the ethical behavior of LLMs. In this paper, we delve into the intentional biases injected into LLMs for safety purposes and examine methods to circumvent these safety alignment techniques. Notably, these intentional biases result in a jailbreaking success rate in GPT-4o models that differs by 20% between non-binary and cisgender keywords and by 16% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of PCJailbreak, highlighting the inherent risks posed by these safety-induced biases. Additionally, we propose an efficient defense method PCDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. PCDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize the urgent need for LLM developers to adopt a more responsible approach when designing and implementing safety measures.
摘要：尽管大型语言模型 (LLM) 在各种任务中表现出令人印象深刻的熟练程度，但它们也存在潜在的安全风险，例如“越狱”，恶意输入可能会迫使 LLM 生成有害内容。为了解决这些问题，许多 LLM 开发人员已经实施了各种安全措施来协调这些模型。这种协调涉及多种技术，包括预训练期间的数据过滤、监督微调、从人类反馈中进行强化学习以及红队练习。这些方法通常会引入类似于政治正确性 (PC) 的蓄意偏见，以确保 LLM 的道德行为。在本文中，我们深入研究了出于安全目的而注入 LLM 的故意偏见，并研究了规避这些安全协调技术的方法。值得注意的是，这些故意偏见导致 GPT-4o 模型中的越狱成功率在非二元和顺性别关键词之间相差 20%，在白人和黑人关键词之间相差 16%，即使提示的其他部分相同。我们引入了 PCJailbreak 的概念，强调了这些安全偏见所带来的固有风险。此外，我们提出了一种有效的防御方法 PCDefense，它通过在生成之前注入防御提示来防止越狱尝试。PCDefense 是 Guard Models（例如 Llama-Guard）的一个有吸引力的替代方案，这些模型在文本生成后需要额外的推理成本。我们的研究结果强调了 LLM 开发人员在设计和实施安全措施时迫切需要采取更负责任的方法。

Title: Probing-RAG: Self-Probing to Guide Language Models in Selective Document Retrieval

Authors: Ingeol Baek, Hwan Chang, Byeongjeong Kim, Jimin Lee, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13339
Pdf URL: https://arxiv.org/pdf/2410.13339
Copy Paste: [[2410.13339]] Probing-RAG: Self-Probing to Guide Language Models in Selective Document Retrieval(https://arxiv.org/abs/2410.13339)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enhances language models by retrieving and incorporating relevant external knowledge. However, traditional retrieve-and-generate processes may not be optimized for real-world scenarios, where queries might require multiple retrieval steps or none at all. In this paper, we propose a Probing-RAG, which utilizes the hidden state representations from the intermediate layers of language models to adaptively determine the necessity of additional retrievals for a given query. By employing a pre-trained prober, Probing-RAG effectively captures the model's internal cognition, enabling reliable decision-making about retrieving external documents. Experimental results across five open-domain QA datasets demonstrate that Probing-RAG outperforms previous methods while reducing the number of redundant retrieval steps.
摘要：检索增强生成 (RAG) 通过检索和整合相关外部知识来增强语言模型。然而，传统的检索和生成过程可能无法针对实际场景进行优化，因为查询可能需要多个检索步骤，甚至根本不需要检索。在本文中，我们提出了一种 Probing-RAG，它利用语言模型中间层的隐藏状态表示来自适应地确定给定查询的额外检索的必要性。通过使用预先训练的探测器，Probing-RAG 可以有效地捕获模型的内部认知，从而实现对检索外部文档的可靠决策。在五个开放域 QA 数据集上的实验结果表明，Probing-RAG 优于以前的方法，同时减少了冗余检索步骤的数量。

Title: Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models

Authors: Yu Yuan, Lili Zhao, Kai Zhang, Guangting Zheng, Qi Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13343
Pdf URL: https://arxiv.org/pdf/2410.13343
Copy Paste: [[2410.13343]] Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models(https://arxiv.org/abs/2410.13343)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) have shown remarkable capabilities in various natural language processing tasks. However, LLMs may rely on dataset biases as shortcuts for prediction, which can significantly impair their robustness and generalization capabilities. This paper presents Shortcut Suite, a comprehensive test suite designed to evaluate the impact of shortcuts on LLMs' performance, incorporating six shortcut types, five evaluation metrics, and four prompting strategies. Our extensive experiments yield several key findings: 1) LLMs demonstrate varying reliance on shortcuts for downstream tasks, significantly impairing their performance. 2) Larger LLMs are more likely to utilize shortcuts under zero-shot and few-shot in-context learning prompts. 3) Chain-of-thought prompting notably reduces shortcut reliance and outperforms other prompting strategies, while few-shot prompts generally underperform compared to zero-shot prompts. 4) LLMs often exhibit overconfidence in their predictions, especially when dealing with datasets that contain shortcuts. 5) LLMs generally have a lower explanation quality in shortcut-laden datasets, with errors falling into three types: distraction, disguised comprehension, and logical fallacy. Our findings offer new insights for evaluating robustness and generalization in LLMs and suggest potential directions for mitigating the reliance on shortcuts. The code is available at \url {this https URL}.
摘要：大型语言模型 (LLM) 在各种自然语言处理任务中表现出色。然而，LLM 可能依赖数据集偏差作为预测的捷径，这会严重损害其鲁棒性和泛化能力。本文介绍了 Shortcut Suite，这是一个全面的测试套件，旨在评估捷径对 LLM 性能的影响，包含六种捷径类型、五种评估指标和四种提示策略。我们进行了广泛的实验，得出了几个关键发现：1) LLM 对下游任务的捷径依赖程度不同，这严重损害了它们的性能。2) 较大的 LLM 更有可能在零样本和少样本上下文学习提示下使用捷径。3) 思路链提示显着降低了对捷径的依赖，并且优于其他提示策略，而少样本提示通常表现不佳，而零样本提示则不然。4) LLM 通常对其预测表现出过度自信，尤其是在处理包含捷径的数据集时。 5) LLM 在包含捷径的数据集中通常解释质量较低，错误分为三类：分心、伪装理解和逻辑谬误。我们的研究结果为评估 LLM 的稳健性和泛化提供了新的见解，并提出了减轻对捷径依赖的潜在方向。代码可在 \url {此 https URL} 获得。

Title: Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement

Authors: Yuxuan Liu, Wenyuan Li, Laizhong Cui, Hailiang Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13344
Pdf URL: https://arxiv.org/pdf/2410.13344
Copy Paste: [[2410.13344]] Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement(https://arxiv.org/abs/2410.13344)
Keywords: language model, llm
Abstract: Large language models (LLMs) often face a bottleneck in inference speed due to their reliance on auto-regressive decoding. Recently, parallel decoding has shown significant promise in enhancing inference efficiency. However, we have identified two key issues with existing parallel decoding frameworks: (1) decoding heads fail to balance prediction accuracy and the parallelism of execution, and (2) parallel decoding is not a universal solution, as it can bring unnecessary overheads at some challenging decoding steps. To address these issues, we propose Cerberus, an adaptive parallel decoding framework introduces the gating mechanism to enable the LLMs to adaptively choose appropriate decoding approaches at each decoding step, along with introducing a new paradigm of decoding heads that introduce the sequential knowledge while maintaining execution parallelism. The experiment results demonstrate that the Cerberus can achieve up to 2.12x speed up compared to auto-regressive decoding, and outperforms one of the leading parallel decoding frameworks, Medusa, with a 10% - 30% increase in acceleration and superior generation quality.
摘要：大型语言模型 (LLM) 经常面临推理速度瓶颈，因为它们依赖于自回归解码。最近，并行解码在提高推理效率方面显示出了巨大的潜力。然而，我们发现了现有并行解码框架的两个关键问题：(1) 解码头无法平衡预测精度和执行的并行性，(2) 并行解码不是一种通用解决方案，因为它会在某些具有挑战性的解码步骤中带来不必要的开销。为了解决这些问题，我们提出了 Cerberus，一种自适应并行解码框架，它引入了门控机制，使 LLM 能够在每个解码步骤中自适应地选择合适的解码方法，同时引入了一种新的解码头范式，在保持执行并行性的同时引入了顺序知识。实验结果表明，与自回归解码相比，Cerberus 可以实现高达 2.12 倍的速度提升，并且优于领先的并行解码框架之一 Medusa，加速提高了 10% - 30%，生成质量也更出色。

Title: Representation Learning of Structured Data for Medical Foundation Models

Authors: Vijay Prakash Dwivedi, Viktor Schlegel, Andy T. Liu, Thanh-Tung Nguyen, Abhinav Ramesh Kashyap, Jeng Wei, Wei-Hsian Yin, Stefan Winkler, Robby T. Tan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13351
Pdf URL: https://arxiv.org/pdf/2410.13351
Copy Paste: [[2410.13351]] Representation Learning of Structured Data for Medical Foundation Models(https://arxiv.org/abs/2410.13351)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various domains, including healthcare. However, their ability to effectively represent structured non-textual data, such as the alphanumeric medical codes used in records like ICD-10 or SNOMED-CT, is limited and has been particularly exposed in recent research. This paper examines the challenges LLMs face in processing medical codes due to the shortcomings of current tokenization methods. As a result, we introduce the UniStruct architecture to design a multimodal medical foundation model of unstructured text and structured data, which addresses these challenges by adapting subword tokenization techniques specifically for the structured medical codes. Our approach is validated through model pre-training on both an extensive internal medical database and a public repository of structured medical records. Trained on over 1 billion tokens on the internal medical database, the proposed model achieves up to a 23% improvement in evaluation metrics, with around 2% gain attributed to our proposed tokenization. Additionally, when evaluated on the EHRSHOT public benchmark with a 1/1000 fraction of the pre-training data, the UniStruct model improves performance on over 42% of the downstream tasks. Our approach not only enhances the representation and generalization capabilities of patient-centric models but also bridges a critical gap in representation learning models' ability to handle complex structured medical data, alongside unstructured text.
摘要：大型语言模型 (LLM) 在各个领域（包括医疗保健）都表现出色。然而，它们有效表示结构化非文本数据（例如 ICD-10 或 SNOMED-CT 等记录中使用的字母数字医疗代码）的能力有限，这在最近的研究中尤为明显。本文探讨了由于当前标记化方法的缺陷，LLM 在处理医疗代码时面临的挑战。因此，我们引入了 UniStruct 架构来设计非结构化文本和结构化数据的多模态医疗基础模型，该模型通过调整专门针对结构化医疗代码的子词标记化技术来解决这些挑战。我们的方法通过在广泛的内部医疗数据库和公共结构化医疗记录存储库上进行模型预训练来验证。在内部医疗数据库上对超过 10 亿个标记进行训练后，所提出的模型在评估指标上实现了高达 23% 的改进，其中约 2% 的改进归功于我们提出的标记化。此外，在使用 1/1000 预训练数据在 EHRSHOT 公共基准上进行评估时，UniStruct 模型在超过 42% 的下游任务上的性能提高了。我们的方法不仅增强了以患者为中心的模型的表示和泛化能力，而且还弥补了表示学习模型处理复杂结构化医疗数据和非结构化文本的能力方面的关键差距。

Title: LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights

Authors: Odysseas S. Chlapanis, Dimitrios Galanis, Ion Androutsopoulos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13352
Pdf URL: https://arxiv.org/pdf/2410.13352
Copy Paste: [[2410.13352]] LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights(https://arxiv.org/abs/2410.13352)
Keywords: language model, gpt, llm
Abstract: We present Legal Argument Reasoning (LAR), a novel task designed to evaluate the legal reasoning capabilities of Large Language Models (LLMs). The task requires selecting the correct next statement (from multiple choice options) in a chain of legal arguments from court proceedings, given the facts of the case. We constructed a dataset (LAR-ECHR) for this task using cases from the European Court of Human Rights (ECHR). We evaluated seven general-purpose LLMs on LAR-ECHR and found that (a) the ranking of the models is aligned with that of LegalBench, an established US-based legal reasoning benchmark, even though LAR-ECHR is based on EU law, (b) LAR-ECHR distinguishes top models more clearly, compared to LegalBench, (c) even the best model (GPT-4o) obtains 75.8% accuracy on LAR-ECHR, indicating significant potential for further model improvement. The process followed to construct LAR-ECHR can be replicated with cases from other legal systems.
摘要：我们提出了法律论证推理 (LAR)，这是一项旨在评估大型语言模型 (LLM) 的法律推理能力的新任务。该任务要求根据案件事实，在法庭诉讼的一系列法律论证中选择正确的下一个陈述（从多项选择选项中）。我们使用欧洲人权法院 (ECHR) 的案例为该任务构建了一个数据集 (LAR-ECHR)。我们在 LAR-ECHR 上评估了七个通用 LLM，发现 (a) 尽管 LAR-ECHR 基于欧盟法律，但模型的排名与美国成熟的法律推理基准 LegalBench 的排名一致，(b) 与 LegalBench 相比，LAR-ECHR 更清楚地区分了顶级模型，(c) 即使最好的模型 (GPT-4o) 在 LAR-ECHR 上的准确率也达到 75.8%，这表明模型还有很大的进一步改进潜力。构建 LAR-ECHR 所遵循的过程可以在其他法律体系的案例中复制。

Title: Metacognitive Monitoring: A Human Ability Beyond Generative Artificial Intelligence

Authors: Markus Huff, Elanur Ulakçı
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13392
Pdf URL: https://arxiv.org/pdf/2410.13392
Copy Paste: [[2410.13392]] Metacognitive Monitoring: A Human Ability Beyond Generative Artificial Intelligence(https://arxiv.org/abs/2410.13392)
Keywords: language model, gpt, llm, chat, agent
Abstract: Large language models (LLMs) have shown impressive alignment with human cognitive processes, raising questions about the extent of their similarity to human cognition. This study investigates whether LLMs, specifically ChatGPT, possess metacognitive monitoring abilities akin to humans-particularly in predicting memory performance on an item-by-item basis. We employed a cross-agent prediction model to compare the metacognitive performance of humans and ChatGPT in a language-based memory task involving garden-path sentences preceded by either fitting or unfitting context sentences. Both humans and ChatGPT rated the memorability of these sentences; humans then completed a surprise recognition memory test. Our findings reveal a significant positive relationship between humans' memorability ratings and their actual recognition performance, indicating reliable metacognitive monitoring. In contrast, ChatGPT did not exhibit a similar predictive capability. Bootstrapping analyses demonstrated that none of the GPT models tested (GPT-3.5-turbo, GPT-4-turbo, GPT-4o) could accurately predict human memory performance on a per-item basis. This suggests that, despite their advanced language processing abilities and alignment with human cognition at the object level, current LLMs lack the metacognitive mechanisms that enable humans to anticipate their memory performance. These results highlight a fundamental difference between human and AI cognition at the metacognitive level. Addressing this gap is crucial for developing AI systems capable of effective self-monitoring and adaptation to human needs, thereby enhancing human-AI interactions across domains such as education and personalized learning.
摘要：大型语言模型 (LLM) 表现出与人类认知过程的惊人一致性，这引发了人们对其与人类认知的相似程度的质疑。本研究调查了 LLM（特别是 ChatGPT）是否具有与人类相似的元认知监控能力，特别是在逐项预测记忆表现方面。我们采用了跨代理预测模型来比较人类和 ChatGPT 在基于语言的记忆任务中的元认知表现，该任务涉及花园小径句子，前面是合适或不合适的上下文句子。人类和 ChatGPT 都对这些句子的可记忆性进行了评分；然后，人类完成了一项意外的识别记忆测试。我们的研究结果显示，人类的记忆性评分与他们的实际识别表现之间存在显着的正相关关系，表明可靠的元认知监控。相比之下，ChatGPT 没有表现出类似的预测能力。引导分析表明，所测试的所有 GPT 模型（GPT-3.5-turbo、GPT-4-turbo、GPT-4o）都无法准确预测人类的每项记忆表现。这表明，尽管 LLM 具有高级语言处理能力，并且在对象层面上与人类认知保持一致，但当前的 LLM 缺乏使人类能够预测其记忆表现的元认知机制。这些结果凸显了人类和人工智能认知在元认知层面上的根本区别。解决这一差距对于开发能够有效自我监控和适应人类需求的人工智能系统至关重要，从而增强教育和个性化学习等领域的人机互动。

Title: Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

Authors: Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13394
Pdf URL: https://arxiv.org/pdf/2410.13394
Copy Paste: [[2410.13394]] Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs(https://arxiv.org/abs/2410.13394)
Keywords: llm
Abstract: Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.
摘要：评估机器生成的文本仍然是 NLP 中的一项重大挑战，尤其是对于非英语语言。当前的方法，包括自动度量、人工评估和基于 LLM 的评估，主要侧重于英语，揭示了多语言评估框架的重大差距。我们推出了跨语言自动评估 (CIA) 套件，这是一个可扩展的框架，其中包括评估器 LLM (Hercule) 和专为多语言评估而设计的新型测试集 (Recon)。我们的测试集包含 500 条涵盖各种任务能力的人工注释指令以及六种语言的人工判断分数。这将使通用多语言 LLM 的基准测试成为可能，并促进评估器 LLM 的元评估。所提出的模型 Hercule 是一种跨语言评估模型，它通过学习根据英语中容易获得的参考答案为答案分配分数来解决目标语言中参考答案的稀缺性。我们的实验表明，与专有模型相比，Hercule 与人类判断更接近，证明了这种跨语言评估在资源匮乏的情况下的有效性。此外，它在对未知语言进行零样本评估方面也很有效。这项研究是首次使用 LLM 对跨语言评估进行全面检查，提出了一种可扩展且有效的多语言评估方法。所有代码、数据集和模型都将公开，以便对这一重要领域进行进一步研究。

Title: Linguistically Grounded Analysis of Language Models using Shapley Head Values

Authors: Marcell Fekete, Johannes Bjerva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13396
Pdf URL: https://arxiv.org/pdf/2410.13396
Copy Paste: [[2410.13396]] Linguistically Grounded Analysis of Language Models using Shapley Head Values(https://arxiv.org/abs/2410.13396)
Keywords: language model
Abstract: Understanding how linguistic knowledge is encoded in language models is crucial for improving their generalisation capabilities. In this paper, we investigate the processing of morphosyntactic phenomena, by leveraging a recently proposed method for probing language models via Shapley Head Values (SHVs). Using the English language BLiMP dataset, we test our approach on two widely used models, BERT and RoBERTa, and compare how linguistic constructions such as anaphor agreement and filler-gap dependencies are handled. Through quantitative pruning and qualitative clustering analysis, we demonstrate that attention heads responsible for processing related linguistic phenomena cluster together. Our results show that SHV-based attributions reveal distinct patterns across both models, providing insights into how language models organize and process linguistic information. These findings support the hypothesis that language models learn subnetworks corresponding to linguistic theory, with potential implications for cross-linguistic model analysis and interpretability in Natural Language Processing (NLP).
摘要：了解语言知识在语言模型中的编码方式对于提高其泛化能力至关重要。在本文中，我们利用最近提出的通过 Shapley Head Values (SHV) 探测语言模型的方法来研究形态句法现象的处理。使用英语语言 BLiMP 数据集，我们在两个广泛使用的模型 BERT 和 RoBERTa 上测试了我们的方法，并比较了语言结构（例如指代一致性和填充间隙依赖性）的处理方式。通过定量修剪和定性聚类分析，我们证明了负责处理相关语言现象的注意力头会聚集在一起。我们的结果表明，基于 SHV 的归因揭示了两种模型之间的不同模式，从而深入了解了语言模型如何组织和处理语言信息。这些发现支持了语言模型学习与语言理论相对应的子网络的假设，对跨语言模型分析和自然语言处理 (NLP) 中的可解释性具有潜在影响。

Title: Think Thrice Before You Act: Progressive Thought Refinement in Large Language Models

Authors: Chengyu Du, Jinyi Han, Yizhou Ying, Aili Chen, Qianyu He, Haokun Zhao, Sirui Xia, Haoran Guo, Jiaqing Liang, Zulong Chen, Liangyue Li, Yanghua Xiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13413
Pdf URL: https://arxiv.org/pdf/2410.13413
Copy Paste: [[2410.13413]] Think Thrice Before You Act: Progressive Thought Refinement in Large Language Models(https://arxiv.org/abs/2410.13413)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have demonstrated that progressive refinement, rather than providing a single answer, results in more accurate and thoughtful outputs. However, existing methods often rely heavily on supervision signals to evaluate previous responses, making it difficult to assess output quality in more open-ended scenarios effectively. Additionally, these methods are typically designed for specific tasks, which limits their generalization to new domains. To address these limitations, we propose Progressive Thought Refinement (PTR), a framework that enables LLMs to refine their responses progressively. PTR operates in two phases: (1) Thought data construction stage: We propose a weak and strong model collaborative selection strategy to build a high-quality progressive refinement dataset to ensure logical consistency from thought to answers, and the answers are gradually refined in each round. (2) Thought-Mask Fine-Tuning Phase: We design a training structure to mask the "thought" and adjust loss weights to encourage LLMs to refine prior thought, teaching them to implicitly understand "how to improve" rather than "what is correct." Experimental results show that PTR significantly enhances LLM performance across ten diverse tasks (avg. from 49.6% to 53.5%) without task-specific fine-tuning. Notably, in more open-ended tasks, LLMs also demonstrate substantial improvements in the quality of responses beyond mere accuracy, suggesting that PTR truly teaches LLMs to self-improve over time.
摘要：大型语言模型 (LLM) 的最新进展表明，渐进式细化（而不是提供单一答案）可以产生更准确、更周到的输出。然而，现有方法通常严重依赖监督信号来评估先前的响应，因此很难在更开放的场景中有效评估输出质量。此外，这些方法通常是为特定任务而设计的，这限制了它们在新领域的推广。为了解决这些限制，我们提出了渐进式思维细化 (PTR)，这是一个使 LLM 能够逐步细化其响应的框架。PTR 分为两个阶段：(1) 思维数据构建阶段：我们提出了一种弱模型和强模型协同选择策略，以构建高质量的渐进式细化数据集，以确保从思维到答案的逻辑一致性，并且答案在每一轮中都会逐渐细化。(2) 思维掩码微调阶段：我们设计了一个训练结构来掩码“思维”，并调整损失权重以鼓励 LLM 细化先前的思维，教他们隐性理解“如何改进”而不是“什么是正确的”。实验结果表明，PTR 显著提高了 LLM 在十项不同任务中的表现（平均从 49.6% 提高到 53.5%），而无需针对特定任务进行微调。值得注意的是，在更开放的任务中，LLM 的响应质量也得到了显著改善，而不仅仅是准确性，这表明 PTR 确实教会了 LLM 随着时间的推移自我提升。

Title: MedINST: Meta Dataset of Biomedical Instructions

Authors: Wenhan Han, Meng Fang, Zihan Zhang, Yu Yin, Zirui Song, Ling Chen, Mykola Pechenizkiy, Qingyu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13458
Pdf URL: https://arxiv.org/pdf/2410.13458
Copy Paste: [[2410.13458]] MedINST: Meta Dataset of Biomedical Instructions(https://arxiv.org/abs/2410.13458)
Keywords: language model, llm
Abstract: The integration of large language model (LLM) techniques in the field of medical analysis has brought about significant advancements, yet the scarcity of large, diverse, and well-annotated datasets remains a major challenge. Medical data and tasks, which vary in format, size, and other parameters, require extensive preprocessing and standardization for effective use in training LLMs. To address these challenges, we introduce MedINST, the Meta Dataset of Biomedical Instructions, a novel multi-domain, multi-task instructional meta-dataset. MedINST comprises 133 biomedical NLP tasks and over 7 million training samples, making it the most comprehensive biomedical instruction dataset to date. Using MedINST as the meta dataset, we curate MedINST32, a challenging benchmark with different task difficulties aiming to evaluate LLMs' generalization ability. We fine-tune several LLMs on MedINST and evaluate on MedINST32, showcasing enhanced cross-task generalization.
摘要：大型语言模型 (LLM) 技术在医学分析领域的整合带来了重大进步，但大型、多样化且注释良好的数据集的稀缺仍然是一项重大挑战。医学数据和任务的格式、大小和其他参数各不相同，需要进行大量的预处理和标准化才能有效地用于训练 LLM。为了应对这些挑战，我们推出了 MedINST，即生物医学指令的元数据集，这是一种新颖的多领域、多任务教学元数据集。MedINST 包含 133 个生物医学 NLP 任务和超过 700 万个训练样本，是迄今为止最全面的生物医学指令数据集。使用 MedINST 作为元数据集，我们整理了 MedINST32，这是一个具有挑战性的基准，具有不同的任务难度，旨在评估 LLM 的泛化能力。我们在 MedINST 上微调了几个 LLM，并在 MedINST32 上进行了评估，展示了增强的跨任务泛化能力。

Title: Breaking the Manual Annotation Bottleneck: Creating a Comprehensive Legal Case Criticality Dataset through Semi-Automated Labeling

Authors: Ronja Stern, Ken Kawamura, Matthias Stürmer, Ilias Chalkidis, Joel Niklaus
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13460
Pdf URL: https://arxiv.org/pdf/2410.13460
Copy Paste: [[2410.13460]] Breaking the Manual Annotation Bottleneck: Creating a Comprehensive Legal Case Criticality Dataset through Semi-Automated Labeling(https://arxiv.org/abs/2410.13460)
Keywords: language model
Abstract: Predicting case criticality helps legal professionals in the court system manage large volumes of case law. This paper introduces the Criticality Prediction dataset, a new resource for evaluating the potential influence of Swiss Federal Supreme Court decisions on future jurisprudence. Unlike existing approaches that rely on resource-intensive manual annotations, we semi-automatically derive labels leading to a much larger dataset than otherwise possible. Our dataset features a two-tier labeling system: (1) the LD-Label, which identifies cases published as Leading Decisions (LD), and (2) the Citation-Label, which ranks cases by their citation frequency and recency. This allows for a more nuanced evaluation of case importance. We evaluate several multilingual models, including fine-tuned variants and large language models, and find that fine-tuned models consistently outperform zero-shot baselines, demonstrating the need for task-specific adaptation. Our contributions include the introduction of this task and the release of a multilingual dataset to the research community.
摘要：预测案件的重要性有助于法院系统的法律专业人员管理大量判例。本文介绍了“重要性预测”数据集，这是一种用于评估瑞士联邦最高法院判决对未来法学的潜在影响的新资源。与依赖资源密集型手动注释的现有方法不同，我们半自动地派生标签，从而产生比其他方法更大的数据集。我们的数据集采用两层标签系统：(1) LD 标签，用于识别作为主要判决 (LD) 发布的案件；(2) 引用标签，根据案件的引用频率和新近程度对其进行排名。这允许对案件重要性进行更细致的评估。我们评估了几种多语言模型，包括微调变体和大型语言模型，发现微调模型的表现始终优于零样本基线，这表明需要针对特定任务进行调整。我们的贡献包括引入这项任务并向研究界发布多语言数据集。

Title: IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

Authors: Jielin Song, Siyu Liu, Bin Zhu, Yanghui Rao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13464
Pdf URL: https://arxiv.org/pdf/2410.13464
Copy Paste: [[2410.13464]] IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection(https://arxiv.org/abs/2410.13464)
Keywords: language model, gpt, llm
Abstract: As large language models (LLMs) continue to advance, instruction tuning has become critical for improving their ability to generate accurate and contextually appropriate responses. Although numerous instruction-tuning datasets have been developed to enhance LLM performance, selecting high-quality instruction data from large source datasets typically demands significant human effort. In this work, we introduce $\textbf{IterSelectTune}$, an efficient, cost-effective iterative training policy for selecting high-quality instruction data with no human involvement and limited reliance on GPT-4. By fine-tuning on approximately 20\% of the source data, our method consistently outperforms models fine-tuned on the full dataset across multiple benchmarks and public test datasets. These results highlight the effectiveness of our approach in enhancing LLM performance while reducing the computational resources required for instruction tuning.
摘要：随着大型语言模型 (LLM) 的不断发展，指令调整对于提高其生成准确且符合上下文的响应的能力至关重要。尽管已经开发了大量指令调整数据集来增强 LLM 性能，但从大型源数据集中选择高质量的指令数据通常需要大量的人力。在这项工作中，我们引入了 $\textbf{IterSelectTune}$，这是一种高效、经济的迭代训练策略，用于选择高质量的指令数据，无需人工参与，对 GPT-4 的依赖也有限。通过对大约 20\% 的源数据进行微调，我们的方法在多个基准和公共测试数据集上的表现始终优于在完整数据集上微调的模型。这些结果凸显了我们的方法在提高 LLM 性能的同时减少指令调整所需的计算资源的有效性。

Title: Repetition Neurons: How Do Language Models Produce Repetitions?

Authors: Tatsuya Hiraoka, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13497
Pdf URL: https://arxiv.org/pdf/2410.13497
Copy Paste: [[2410.13497]] Repetition Neurons: How Do Language Models Produce Repetitions?(https://arxiv.org/abs/2410.13497)
Keywords: language model
Abstract: This paper introduces repetition neurons, regarded as skill neurons responsible for the repetition problem in text generation tasks. These neurons are progressively activated more strongly as repetition continues, indicating that they perceive repetition as a task to copy the previous context repeatedly, similar to in-context learning. We identify these repetition neurons by comparing activation values before and after the onset of repetition in texts generated by recent pre-trained language models. We analyze the repetition neurons in three English and one Japanese pre-trained language models and observe similar patterns across them.
摘要：本文介绍了重复神经元，它被视为负责文本生成任务中重复问题的技能神经元。随着重复的继续，这些神经元的激活程度逐渐增强，表明它们将重复视为重复复制先前上下文的任务，类似于上下文学习。我们通过比较最近预训练的语言模型生成的文本中重复开始前后的激活值来识别这些重复神经元。我们分析了三个英语和一个日语预训练语言模型中的重复神经元，并观察到它们之间存在相似的模式。

Title: Enhancing Text Generation in Joint NLG/NLU Learning Through Curriculum Learning, Semi-Supervised Training, and Advanced Optimization Techniques

Authors: Rahimanuddin Shaik, Katikela Sreeharsha Kishore
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13498
Pdf URL: https://arxiv.org/pdf/2410.13498
Copy Paste: [[2410.13498]] Enhancing Text Generation in Joint NLG/NLU Learning Through Curriculum Learning, Semi-Supervised Training, and Advanced Optimization Techniques(https://arxiv.org/abs/2410.13498)
Keywords: language model
Abstract: Text generation is the automated process of producing written or spoken language using computational methods. It involves generating coherent and contextually relevant text based on predefined rules or learned patterns. However, challenges in text generation arise from maintaining coherence, ensuring diversity and creativity, and avoiding biases or inappropriate content. This research paper developed a novel approach to improve text generation in the context of joint Natural Language Generation (NLG) and Natural Language Understanding (NLU) learning. The data is prepared by gathering and preprocessing annotated datasets, including cleaning, tokenization, stemming, and stop-word removal. Feature extraction techniques such as POS tagging, Bag of words, and Term Frequency-Inverse Document Frequency (TF-IDF) are applied. Transformer-based encoders and decoders, capturing long range dependencies and improving source-target sequence modelling. Pre-trained language models like Optimized BERT are incorporated, along with a Hybrid Redfox Artificial Hummingbird Algorithm (HRAHA). Reinforcement learning with policy gradient techniques, semi-supervised training, improved attention mechanisms, and differentiable approximations like straight-through Gumbel SoftMax estimator are employed to fine-tune the models and handle complex linguistic tasks effectively. The proposed model is implemented using Python.
摘要：文本生成是使用计算方法自动生成书面或口头语言的过程。它涉及根据预定义规则或学习模式生成连贯且上下文相关的文本。然而，文本生成的挑战在于保持连贯性、确保多样性和创造性，以及避免偏见或不适当的内容。这篇研究论文开发了一种新方法，以在联合自然语言生成 (NLG) 和自然语言理解 (NLU) 学习的背景下改进文本生成。数据是通过收集和预处理带注释的数据集来准备的，包括清理、标记化、词干提取和停用词删除。应用了 POS 标记、词袋和词频-逆文档频率 (TF-IDF) 等特征提取技术。基于 Transformer 的编码器和解码器，可捕获长距离依赖关系并改进源-目标序列建模。结合了预训练的语言模型（如优化 BERT）以及混合 Redfox 人工蜂鸟算法 (HRAHA)。采用策略梯度技术、半监督训练、改进的注意力机制和可微分近似（如直通式 Gumbel SoftMax 估计器）的强化学习来微调模型并有效处理复杂的语言任务。所提出的模型使用 Python 实现。

Title: RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards

Authors: Xinze Li, Sen Mei, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Hao Chen, Ge Yu, Zhiyuan Liu, Maosong Sun, Chenyan Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13509
Pdf URL: https://arxiv.org/pdf/2410.13509
Copy Paste: [[2410.13509]] RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards(https://arxiv.org/abs/2410.13509)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs) by retrieving knowledge from external resources. To adapt LLMs for RAG pipelines, current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge. This supervised fine-tuning (SFT) approach focuses on equipping LLMs to handle diverse RAG tasks using different instructions. However, it trains RAG modules to overfit training signals and overlooks the varying data preferences among agents within the RAG system. In this paper, we propose a Differentiable Data Rewards (DDR) method, which end-to-end trains RAG systems by aligning data preferences between different RAG modules. DDR works by collecting the rewards to optimize each agent with a rollout method. This method prompts agents to sample some potential responses as perturbations, evaluates the impact of these perturbations on the whole RAG system, and subsequently optimizes the agent to produce outputs that improve the performance of the RAG system. Our experiments on various knowledge-intensive tasks demonstrate that DDR significantly outperforms the SFT method, particularly for LLMs with smaller-scale parameters that depend more on the retrieved knowledge. Additionally, DDR exhibits a stronger capability to align the data preference between RAG modules. The DDR method makes generation module more effective in extracting key information from documents and mitigating conflicts between parametric memory and external knowledge. All codes are available at this https URL.
摘要：检索增强生成 (RAG) 已证明其通过从外部资源检索知识来缓解大型语言模型 (LLM) 中的幻觉的有效性。为了使 LLM 适应 RAG 管道，当前的方法使用指令调整来优化 LLM，提高其利用检索到的知识的能力。这种监督微调 (SFT) 方法侧重于让 LLM 使用不同的指令来处理不同的 RAG 任务。然而，它训练 RAG 模块以过度拟合训练信号，并忽略了 RAG 系统内代理之间不同的数据偏好。在本文中，我们提出了一种可区分数据奖励 (DDR) 方法，该方法通过协调不同 RAG 模块之间的数据偏好来端到端训练 RAG 系统。DDR 通过收集奖励来使用 rollout 方法优化每个代理。该方法促使代理将一些潜在响应采样为扰动，评估这些扰动对整个 RAG 系统的影响，然后优化代理以产生可提高 RAG 系统性能的输出。我们在各种知识密集型任务上进行的实验表明，DDR 明显优于 SFT 方法，特别是对于具有较小规模参数且更依赖于检索到的知识的 LLM。此外，DDR 表现出更强的能力来协调 RAG 模块之间的数据偏好。DDR 方法使生成模块能够更有效地从文档中提取关键信息并缓解参数记忆与外部知识之间的冲突。所有代码均可在此 https URL 上找到。

Title: GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models

Authors: Aditya Sharma, Aman Dalmia, Mehran Kazemi, Amal Zouaq, Christopher J. Pal
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.13510
Pdf URL: https://arxiv.org/pdf/2410.13510
Copy Paste: [[2410.13510]] GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models(https://arxiv.org/abs/2410.13510)
Keywords: language model
Abstract: Geometry problem-solving demands advanced reasoning abilities to process multimodal inputs and employ mathematical knowledge effectively. Vision-language models (VLMs) have made significant progress in various multimodal tasks. Yet, they still struggle with geometry problems and are significantly limited by their inability to perform mathematical operations not seen during pre-training, such as calculating the cosine of an arbitrary angle, and by difficulties in correctly applying relevant geometry formulas. To overcome these challenges, we present GeoCoder, which leverages modular code-finetuning to generate and execute code using a predefined geometry function library. By executing the code, we achieve accurate and deterministic calculations, contrasting the stochastic nature of autoregressive token prediction, while the function library minimizes errors in formula usage. We also propose a multimodal retrieval-augmented variant of GeoCoder, named RAG-GeoCoder, which incorporates a non-parametric memory module for retrieving functions from the geometry library, thereby reducing reliance on parametric memory. Our modular code-finetuning approach enhances the geometric reasoning capabilities of VLMs, yielding an average improvement of over 16% across various question complexities on the GeomVerse dataset compared to other finetuning methods.
摘要：解决几何问题需要高级推理能力来处理多模态输入并有效运用数学知识。视觉语言模型 (VLM) 在各种多模态任务中取得了重大进展。然而，它们仍然难以解决几何问题，并且受到无法执行预训练期间未见过的数学运算（例如计算任意角度的余弦）以及难以正确应用相关几何公式的严重限制。为了克服这些挑战，我们提出了 GeoCoder，它利用模块化代码微调来生成和执行使用预定义几何函数库的代码。通过执行代码，我们实现了准确和确定性的计算，与自回归标记预测的随机性形成对比，而函数库则最大限度地减少了公式使用中的错误。我们还提出了一种多模态检索增强型 GeoCoder 变体，名为 RAG-GeoCoder，它包含一个非参数内存模块，用于从几何库中检索函数，从而减少对参数内存的依赖。我们的模块化代码微调方法增强了 VLM 的几何推理能力，与其他微调方法相比，在 GeomVerse 数据集上各种问题复杂度上的平均提升超过 16%。

Title: Bias in the Mirror : Are LLMs opinions robust to their own adversarial attacks ?

Authors: Virgile Rennard, Christos Xypolopoulos, Michalis Vazirgiannis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13517
Pdf URL: https://arxiv.org/pdf/2410.13517
Copy Paste: [[2410.13517]] Bias in the Mirror : Are LLMs opinions robust to their own adversarial attacks ?(https://arxiv.org/abs/2410.13517)
Keywords: language model, llm
Abstract: Large language models (LLMs) inherit biases from their training data and alignment processes, influencing their responses in subtle ways. While many studies have examined these biases, little work has explored their robustness during interactions. In this paper, we introduce a novel approach where two instances of an LLM engage in self-debate, arguing opposing viewpoints to persuade a neutral version of the model. Through this, we evaluate how firmly biases hold and whether models are susceptible to reinforcing misinformation or shifting to harmful viewpoints. Our experiments span multiple LLMs of varying sizes, origins, and languages, providing deeper insights into bias persistence and flexibility across linguistic and cultural contexts.
摘要：大型语言模型 (LLM) 从其训练数据和对齐过程中继承了偏见，以微妙的方式影响其响应。虽然许多研究都研究过这些偏见，但很少有研究探索它们在交互过程中的稳健性。在本文中，我们介绍了一种新颖的方法，其中两个 LLM 实例进行自我辩论，争论对立的观点以说服模型的中立版本。通过这种方式，我们评估偏见的牢固程度以及模型是否容易强化错误信息或转向有害观点。我们的实验涵盖了不同规模、来源和语言的多个 LLM，为跨语言和文化背景的偏见持久性和灵活性提供了更深入的见解。

Title: Integrating Temporal Representations for Dynamic Memory Retrieval and Management in Large Language Models

Authors: Yuki Hou, Haruki Tamoto, Homei Miyashita
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13553
Pdf URL: https://arxiv.org/pdf/2410.13553
Copy Paste: [[2410.13553]] Integrating Temporal Representations for Dynamic Memory Retrieval and Management in Large Language Models(https://arxiv.org/abs/2410.13553)
Keywords: language model, retrieval-augmented generation, agent
Abstract: Conventional dialogue agents often struggle with effective memory recall, leading to redundant retrieval and inadequate management of unique user associations. To address this, we propose SynapticRAG, a novel approach integrating synaptic dynamics into Retrieval-Augmented Generation (RAG). SynapticRAG integrates temporal representations into memory vectors, mimicking biological synapses by differentiating events based on occurrence times and dynamically updating memory significance. This model employs temporal scoring for memory connections and a synaptic-inspired propagation control mechanism. Experiments across English, Japanese, and Chinese datasets demonstrate SynapticRAG's superiority over existing methods, including traditional RAG, with up to 14.66\% improvement in memory retrieval accuracy. Our approach advances context-aware dialogue AI systems by enhancing long-term context maintenance and specific information extraction from conversations.
摘要：传统对话代理通常难以有效回忆记忆，导致冗余检索和对独特用户关联的管理不足。为了解决这个问题，我们提出了 SynapticRAG，这是一种将突触动力学集成到检索增强生成 (RAG) 中的新方法。SynapticRAG 将时间表征集成到记忆向量中，通过根据发生时间区分事件并动态更新记忆重要性来模仿生物突触。该模型采用时间评分来对记忆连接进行评分，并采用突触启发的传播控制机制。在英语、日语和中文数据集上进行的实验表明，SynapticRAG 优于现有方法（包括传统 RAG），记忆检索准确率提高了 14.66%。我们的方法通过增强长期上下文维护和从对话中提取特定信息来推进上下文感知对话 AI 系统的发展。

Title: Enhancing Fact Retrieval in PLMs through Truthfulness

Authors: Paul Youssef, Jörg Schlötterer, Christin Seifert
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13562
Pdf URL: https://arxiv.org/pdf/2410.13562
Copy Paste: [[2410.13562]] Enhancing Fact Retrieval in PLMs through Truthfulness(https://arxiv.org/abs/2410.13562)
Keywords: language model
Abstract: Pre-trained Language Models (PLMs) encode various facts about the world at their pre-training phase as they are trained to predict the next or missing word in a sentence. There has a been an interest in quantifying and improving the amount of facts that can be extracted from PLMs, as they have been envisioned to act as soft knowledge bases, which can be queried in natural language. Different approaches exist to enhance fact retrieval from PLM. Recent work shows that the hidden states of PLMs can be leveraged to determine the truthfulness of the PLMs' inputs. Leveraging this finding to improve factual knowledge retrieval remains unexplored. In this work, we investigate the use of a helper model to improve fact retrieval. The helper model assesses the truthfulness of an input based on the corresponding hidden states representations from the PLMs. We evaluate this approach on several masked PLMs and show that it enhances fact retrieval by up to 33\%. Our findings highlight the potential of hidden states representations from PLMs in improving their factual knowledge retrieval.
摘要：预训练语言模型 (PLM) 在预训练阶段对有关世界的各种事实进行编码，因为它们被训练来预测句子中的下一个单词或缺失的单词。人们一直对量化和提高可从 PLM 中提取的事实数量感兴趣，因为它们被设想为软知识库，可以用自然语言进行查询。存在不同的方法来增强从 PLM 中检索事实的能力。最近的研究表明，可以利用 PLM 的隐藏状态来确定 PLM 输入的真实性。利用这一发现来改进事实知识检索仍未得到探索。在这项工作中，我们研究了使用辅助模型来改进事实检索。辅助模型根据来自 PLM 的相应隐藏状态表示来评估输入的真实性。我们在几个掩码 PLM 上评估了这种方法，并表明它将事实检索提高了 33% 以上。我们的研究结果强调了 PLM 的隐藏状态表示在改善事实知识检索方面的潜力。

Title: A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Authors: Siwei Wu, Zhongyuan Peng, Xinrun Du, Tuney Zheng, Minghao Liu, Jialong Wu, Jiachen Ma, Yizhi Li, Jian Yang, Wangchunshu Zhou, Qunshu Lin, Junbo Zhao, Zhaoxiang Zhang, Wenhao Huang, Ge Zhang, Chenghua Lin, J.H. Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13639
Pdf URL: https://arxiv.org/pdf/2410.13639
Copy Paste: [[2410.13639]] A Comparative Study on Reasoning Patterns of OpenAI's o1 Model(https://arxiv.org/abs/2410.13639)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Enabling Large Language Models (LLMs) to handle a wider range of complex tasks (e.g., coding, math) has drawn great attention from many researchers. As LLMs continue to evolve, merely increasing the number of model parameters yields diminishing performance improvements and heavy computational costs. Recently, OpenAI's o1 model has shown that inference strategies (i.e., Test-time Compute methods) can also significantly enhance the reasoning capabilities of LLMs. However, the mechanisms behind these methods are still unexplored. In our work, to investigate the reasoning patterns of o1, we compare o1 with existing Test-time Compute methods (BoN, Step-wise BoN, Agent Workflow, and Self-Refine) by using OpenAI's GPT-4o as a backbone on general reasoning benchmarks in three domains (i.e., math, coding, commonsense reasoning). Specifically, first, our experiments show that the o1 model has achieved the best performance on most datasets. Second, as for the methods of searching diverse responses (e.g., BoN), we find the reward models' capability and the search space both limit the upper boundary of these methods. Third, as for the methods that break the problem into many sub-problems, the Agent Workflow has achieved better performance than Step-wise BoN due to the domain-specific system prompt for planning better reasoning processes. Fourth, it is worth mentioning that we have summarized six reasoning patterns of o1, and provided a detailed analysis on several reasoning benchmarks.
摘要：使大型语言模型 (LLM) 能够处理更广泛的复杂任务（例如编码、数学）已引起许多研究人员的极大关注。随着 LLM 的不断发展，仅仅增加模型参数的数量会降低性能提升并增加计算成本。最近，OpenAI 的 o1 模型表明，推理策略（即测试时计算方法）也可以显著增强 LLM 的推理能力。然而，这些方法背后的机制仍未被探索。在我们的工作中，为了研究 o1 的推理模式，我们使用 OpenAI 的 GPT-4o 作为三个领域（即数学、编码、常识推理）的一般推理基准测试的主干，将 o1 与现有的测试时计算方法（BoN、Step-wise BoN、Agent Workflow 和 Self-Refine）进行了比较。具体来说，首先，我们的实验表明 o1 模型在大多数数据集上都取得了最佳性能。其次，对于搜索多样化响应的方法（例如 BoN），我们发现奖励模型的能力和搜索空间都限制了这些方法的上限。第三，对于将问题分解为许多子问题的方法，Agent Workflow 取得了比 Step-wise BoN 更好的性能，因为特定领域的系统提示可以规划更好的推理过程。第四，值得一提的是，我们总结了 o1 的六种推理模式，并对几个推理基准进行了详细分析。

Title: Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

Authors: Yiming Wang, Pei Zhang, Baosong Yang, Derek F. Wong, Rui Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13640
Pdf URL: https://arxiv.org/pdf/2410.13640
Copy Paste: [[2410.13640]] Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation(https://arxiv.org/abs/2410.13640)
Keywords: llm
Abstract: LLM self-evaluation relies on the LLM's own ability to estimate response correctness, which can greatly improve its deployment reliability. In this research track, we propose the Chain-of-Embedding (CoE) in the latent space to enable LLMs to perform output-free self-evaluation. CoE consists of all progressive hidden states produced during the inference time, which can be treated as the latent thinking path of LLMs. We find that when LLMs respond correctly and incorrectly, their CoE features differ, these discrepancies assist us in estimating LLM response correctness. Experiments in four diverse domains and seven LLMs fully demonstrate the effectiveness of our method. Meanwhile, its label-free design intent without any training and millisecond-level computational cost ensure real-time feedback in large-scale scenarios. More importantly, we provide interesting insights into LLM response correctness from the perspective of hidden state changes inside LLMs.
摘要：LLM 自我评估依赖于 LLM 自身估计响应正确性的能力，这可以大大提高其部署可靠性。在本研究轨道中，我们提出了潜在空间中的嵌入链 (CoE)，以使 LLM 能够执行无输出的自我评估。CoE 包括在推理时间内产生的所有渐进式隐藏状态，可以将其视为 LLM 的潜在思维路径。我们发现，当 LLM 正确和错误地响应时，它们的 CoE 特征不同，这些差异有助于我们估计 LLM 响应的正确性。在四个不同领域和七个 LLM 中的实验充分证明了我们方法的有效性。同时，其无标签设计意图无需任何训练和毫秒级计算成本，可确保在大规模场景中的实时反馈。更重要的是，我们从 LLM 内部隐藏状态变化的角度提供了对 LLM 响应正确性的有趣见解。

Title: An Active Learning Framework for Inclusive Generation by Large Language Models

Authors: Sabit Hassan, Anthony Sicilia, Malihe Alikhani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13641
Pdf URL: https://arxiv.org/pdf/2410.13641
Copy Paste: [[2410.13641]] An Active Learning Framework for Inclusive Generation by Large Language Models(https://arxiv.org/abs/2410.13641)
Keywords: language model, llm
Abstract: Ensuring that Large Language Models (LLMs) generate text representative of diverse sub-populations is essential, particularly when key concepts related to under-represented groups are scarce in the training data. We address this challenge with a novel clustering-based active learning framework, enhanced with knowledge distillation. The proposed framework transforms the intermediate outputs of the learner model, enabling effective active learning for generative tasks for the first time. Integration of clustering and knowledge distillation yields more representative models without prior knowledge of underlying data distribution and overbearing human efforts. We validate our approach in practice through case studies in counter-narration and style transfer. We construct two new datasets in tandem with model training, showing a performance improvement of 2%-10% over baseline models. Our results also show more consistent performance across various data subgroups and increased lexical diversity, underscoring our model's resilience to skewness in available data. Further, our results show that the data acquired via our approach improves the performance of secondary models not involved in the learning loop, showcasing practical utility of the framework.
摘要：确保大型语言模型 (LLM) 生成代表不同子群体的文本至关重要，尤其是当与代表性不足的群体相关的关键概念在训练数据中很少时。我们通过一种新颖的基于聚类的主动学习框架来应对这一挑战，并通过知识提炼加以增强。所提出的框架转换了学习者模型的中间输出，首次实现了生成任务的有效主动学习。聚类和知识提炼的集成产生了更具代表性的模型，而无需事先了解底层数据分布和过度的人工努力。我们通过反叙述和风格转换的案例研究在实践中验证了我们的方法。我们在模型训练的同时构建了两个新的数据集，结果显示性能比基线模型提高了 2%-10%。我们的结果还显示，各种数据子组之间的性能更加一致，词汇多样性增加，突显了我们的模型对可用数据偏差的适应性。此外，我们的结果表明，通过我们的方法获取的数据提高了未参与学习循环的次级模型的性能，展示了该框架的实用性。

Title: SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs

Authors: Yuling Gu, Oyvind Tafjord, Hyunwoo Kim, Jared Moore, Ronan Le Bras, Peter Clark, Yejin Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13648
Pdf URL: https://arxiv.org/pdf/2410.13648
Copy Paste: [[2410.13648]] SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs(https://arxiv.org/abs/2410.13648)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: While prior work has explored whether large language models (LLMs) possess a "theory of mind" (ToM) - the ability to attribute mental states to oneself and others - there has been little work testing whether LLMs can implicitly apply such knowledge to predict behavior, or to judge whether an observed behavior is rational. Such skills are critical for appropriate interaction in social environments. We create a new dataset, SimpleTom, containing concise, diverse stories (e.g., "The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier."), each with three questions that test different degrees of ToM reasoning, asking models to predict (a) mental state ("Is Mary aware of the mold?"), (b) behavior ("Will Mary pay for the chips or report the mold?"), and (c) judgment ("Mary paid for the chips. Was that reasonable?"). To our knowledge, SimpleToM is the first dataset to systematically explore downstream reasoning requiring knowledge of mental states in realistic scenarios. Our experimental results are intriguing: While most models can reliably predict mental state on our dataset (a), they often fail to correctly predict the behavior (b), and fare even worse at judging whether given behaviors are reasonable (c), despite being correctly aware of the protagonist's mental state should make such secondary predictions obvious. We further show that we can help models do better at (b) and (c) via interventions such as reminding the model of its earlier mental state answer and mental-state-specific chain-of-thought prompting, raising the action prediction accuracies (e.g., from 49.5% to 93.5% for GPT-4o) and judgment accuracies (e.g., from 15.3% to 94.7% in GPT-4o). While this shows that models can be coaxed to perform well, it requires task-specific interventions, and the natural model performances remain low, a cautionary tale for LLM deployment.
摘要：虽然之前的研究已经探索了大型语言模型 (LLM) 是否具有“心智理论”(ToM) - 将心理状态归因于自己和他人的能力 - 但很少有研究测试 LLM 是否可以隐式地应用此类知识来预测行为，或判断观察到的行为是否合理。这些技能对于在社交环境中进行适当的互动至关重要。我们创建了一个新的数据集 SimpleTom，其中包含简洁、多样的故事（例如，“品客薯片罐头里有发霉的薯片。玛丽在超市拿起罐头，走到收银台。”），每个故事都有三个问题，测试不同程度的 ToM 推理，要求模型预测 (a) 心理状态（“玛丽是否知道发霉？”）、(b) 行为（“玛丽会为薯片付钱还是报告发霉？”）和 (c) 判断（“玛丽付了薯片钱。这合理吗？”）。据我们所知，SimpleToM 是第一个系统地探索下游推理的数据集，需要了解现实场景中的心理状态。我们的实验结果很有趣：虽然大多数模型可以可靠地预测我们数据集上的心理状态 (a)，但它们往往无法正确预测行为 (b)，在判断给定行为是否合理 (c) 方面表现更差，尽管正确意识到主角的心理状态应该使这种次要预测显而易见。我们进一步表明，我们可以通过干预措施帮助模型在 (b) 和 (c) 上做得更好，例如提醒模型其之前的心理状态答案和心理状态特定的思路链提示，提高动作预测准确率（例如，GPT-4o 从 49.5% 提高到 93.5%）和判断准确率（例如，GPT-4o 从 15.3% 提高到 94.7%）。虽然这表明模型可以被诱导以表现良好，但它需要针对特定任务的干预，并且自然模型性能仍然很低，这对 LLM 部署来说是一个警示。

Title: ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization

Authors: Xiutian Zhao, Ke Wang, Wei Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13667
Pdf URL: https://arxiv.org/pdf/2410.13667
Copy Paste: [[2410.13667]] ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization(https://arxiv.org/abs/2410.13667)
Keywords: language model, llm, agent
Abstract: Dialogue agents have been receiving increasing attention for years, and this trend has been further boosted by the recent progress of large language models (LLMs). Stance detection and dialogue summarization are two core tasks of dialogue agents in application scenarios that involve argumentative dialogues. However, research on these tasks is limited by the insufficiency of public datasets, especially for non-English languages. To address this language resource gap in Chinese, we present ORCHID (Oral Chinese Debate), the first Chinese dataset for benchmarking target-independent stance detection and debate summarization. Our dataset consists of 1,218 real-world debates that were conducted in Chinese on 476 unique topics, containing 2,436 stance-specific summaries and 14,133 fully annotated utterances. Besides providing a versatile testbed for future research, we also conduct an empirical study on the dataset and propose an integrated task. The results show the challenging nature of the dataset and suggest a potential of incorporating stance detection in summarization for argumentative dialogue.
摘要：多年来，对话代理越来越受到关注，大型语言模型 (LLM) 的最新进展进一步推动了这一趋势。立场检测和对话摘要是涉及辩论性对话的应用场景中对话代理的两个核心任务。然而，由于公共数据集不足，尤其是非英语语言，对这些任务的研究受到限制。为了解决中文的语言资源缺口，我们提出了 ORCHID（口头中文辩论），这是第一个用于对目标无关的立场检测和辩论摘要进行基准测试的中文数据集。我们的数据集包括 1,218 场以中文进行的现实世界辩论，涉及 476 个独特主题，包含 2,436 个立场特定摘要和 14,133 条完全注释的话语。除了为未来研究提供多功能测试平台外，我们还对数据集进行了实证研究并提出了一项综合任务。结果显示了数据集的挑战性，并表明在辩论性对话摘要中加入立场检测的潜力。

Title: HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings

Authors: Varun Gumma, Anandhita Raghunath, Mohit Jain, Sunayana Sitaram
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13671
Pdf URL: https://arxiv.org/pdf/2410.13671
Copy Paste: [[2410.13671]] HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings(https://arxiv.org/abs/2410.13671)
Keywords: language model, llm, chat, retrieval augmented generation
Abstract: Assessing the capabilities and limitations of large language models (LLMs) has garnered significant interest, yet the evaluation of multiple models in real-world scenarios remains rare. Multilingual evaluation often relies on translated benchmarks, which typically do not capture linguistic and cultural nuances present in the source language. This study provides an extensive assessment of 24 LLMs on real world data collected from Indian patients interacting with a medical chatbot in Indian English and 4 other Indic languages. We employ a uniform Retrieval Augmented Generation framework to generate responses, which are evaluated using both automated techniques and human evaluators on four specific metrics relevant to our application. We find that models vary significantly in their performance and that instruction tuned Indic models do not always perform well on Indic language queries. Further, we empirically show that factual correctness is generally lower for responses to Indic queries compared to English queries. Finally, our qualitative work shows that code-mixed and culturally relevant queries in our dataset pose challenges to evaluated models.
摘要：评估大型语言模型 (LLM) 的能力和局限性引起了人们的极大兴趣，但在现实世界中对多个模型的评估仍然很少见。多语言评估通常依赖于翻译的基准，而这些基准通常不会捕捉到源语言中存在的语言和文化细微差别。本研究对 24 个 LLM 进行了广泛的评估，这些 LLM 是基于从印度患者与印度英语和其他 4 种印度语的医疗聊天机器人互动中收集的真实数据。我们采用统一的检索增强生成框架来生成响应，并使用自动化技术和人工评估人员根据与我们的应用程序相关的四个特定指标对这些响应进行评估。我们发现模型的性能差异很大，并且指令调整的印度语模型在印度语查询上并不总是表现良好。此外，我们通过经验表明，与英语查询相比，对印度语查询的响应的事实正确性通常较低。最后，我们的定性工作表明，我们数据集中的代码混合和文化相关查询对评估模型构成了挑战。

Title: Unconstrained Model Merging for Enhanced LLM Reasoning

Authors: Yiming Zhang, Baoyi He, Shengyu Zhang, Yuhao Fu, Qi Zhou, Zhijie Sang, Zijin Hong, Kejing Yang, Wenjun Wang, Jianbo Yuan, Guangning Han, Linyi Li, Chunlin Ji, Fei Wu, Hongxia Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13699
Pdf URL: https://arxiv.org/pdf/2410.13699
Copy Paste: [[2410.13699]] Unconstrained Model Merging for Enhanced LLM Reasoning(https://arxiv.org/abs/2410.13699)
Keywords: language model, llm
Abstract: Recent advancements in building domain-specific large language models (LLMs) have shown remarkable success, especially in tasks requiring reasoning abilities like logical inference over complex relationships and multi-step problem solving. However, creating a powerful all-in-one LLM remains challenging due to the need for proprietary data and vast computational resources. As a resource-friendly alternative, we explore the potential of merging multiple expert models into a single LLM. Existing studies on model merging mainly focus on generalist LLMs instead of domain experts, or the LLMs under the same architecture and size. In this work, we propose an unconstrained model merging framework that accommodates both homogeneous and heterogeneous model architectures with a focus on reasoning tasks. A fine-grained layer-wise weight merging strategy is designed for homogeneous models merging, while heterogeneous model merging is built upon the probabilistic distribution knowledge derived from instruction-response fine-tuning data. Across 7 benchmarks and 9 reasoning-optimized LLMs, we reveal key findings that combinatorial reasoning emerges from merging which surpasses simple additive effects. We propose that unconstrained model merging could serve as a foundation for decentralized LLMs, marking a notable progression from the existing centralized LLM framework. This evolution could enhance wider participation and stimulate additional advancement in the field of artificial intelligence, effectively addressing the constraints posed by centralized models.
摘要：构建领域特定大型语言模型 (LLM) 的最新进展取得了显著的成功，尤其是在需要推理能力的任务中，例如对复杂关系的逻辑推理和多步骤问题解决。然而，由于需要专有数据和大量计算资源，创建强大的一体化 LLM 仍然具有挑战性。作为一种资源友好的替代方案，我们探索将多个专家模型合并为单个 LLM 的潜力。现有的模型合并研究主要集中在通用 LLM 而不是领域专家，或者相同架构和大小的 LLM。在这项工作中，我们提出了一个不受约束的模型合并框架，该框架可同时适应同构和异构模型架构，重点是推理任务。为同构模型合并设计了一种细粒度的分层权重合并策略，而异构模型合并则建立在从指令响应微调数据中得出的概率分布知识之上。通过 7 个基准和 9 个推理优化的 LLM，我们发现了关键发现：组合推理源自合并，其效果超越了简单的加法效应。我们提出，不受约束的模型合并可以作为分散式 LLM 的基础，标志着现有集中式 LLM 框架的显著进步。这一演变可以增强更广泛的参与度，并促进人工智能领域的进一步发展，从而有效解决集中式模型带来的限制。

Title: On the Role of Attention Heads in Large Language Model Safety

Authors: Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, Yongbin Li
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13708
Pdf URL: https://arxiv.org/pdf/2410.13708
Copy Paste: [[2410.13708]] On the Role of Attention Heads in Large Language Model Safety(https://arxiv.org/abs/2410.13708)
Keywords: language model, llm, chat
Abstract: Large language models (LLMs) achieve state-of-the-art performance on multiple language tasks, yet their safety guardrails can be circumvented, leading to harmful generations. In light of this, recent research on safety mechanisms has emerged, revealing that when safety representations or component are suppressed, the safety capability of LLMs are compromised. However, existing research tends to overlook the safety impact of multi-head attention mechanisms, despite their crucial role in various model functionalities. Hence, in this paper, we aim to explore the connection between standard attention mechanisms and safety capability to fill this gap in the safety-related mechanistic interpretability. We propose a novel metric which tailored for multi-head attention, the Safety Head ImPortant Score (Ships), to assess the individual heads' contributions to model safety. Based on this, we generalize Ships to the dataset level and further introduce the Safety Attention Head AttRibution Algorithm (Sahara) to attribute the critical safety attention heads inside the model. Our findings show that the special attention head has a significant impact on safety. Ablating a single safety head allows aligned model (e.g., Llama-2-7b-chat) to respond to 16 times more harmful queries, while only modifying 0.006% of the parameters, in contrast to the ~ 5% modification required in previous studies. More importantly, we demonstrate that attention heads primarily function as feature extractors for safety and models fine-tuned from the same base model exhibit overlapping safety heads through comprehensive experiments. Together, our attribution approach and findings provide a novel perspective for unpacking the black box of safety mechanisms within large models.
摘要：大型语言模型 (LLM) 在多种语言任务上取得了最先进的性能，但它们的安全护栏可能会被绕过，从而导致有害的生成。鉴于此，最近出现了关于安全机制的研究，表明当安全表示或组件受到抑制时，LLM 的安全能力会受到损害。然而，现有研究往往忽视了多头注意力机制对安全的影响，尽管它们在各种模型功能中发挥着至关重要的作用。因此，在本文中，我们旨在探索标准注意力机制与安全能力之间的联系，以填补安全相关机制可解释性的这一空白。我们提出了一种针对多头注意力量身定制的新指标，即安全头重要性分数 (Ships)，以评估各个头对模型安全的贡献。在此基础上，我们将 Ships 推广到数据集级别，并进一步引入安全注意头归因算法 (Sahara) 来归因模型中的关键安全注意头。我们的研究结果表明，特殊注意头对安全性有重大影响。消除单个安全头可使对齐模型（例如 Llama-2-7b-chat）响应 16 倍以上的有害查询，同时仅修改 0.006% 的参数，而以前的研究需要修改约 5%。更重要的是，我们通过全面的实验证明了注意力头主要用作安全的特征提取器，并且从同一基础模型微调的模型表现出重叠的安全头。总之，我们的归因方法和发现为解开大型模型中安全机制的黑匣子提供了一个新颖的视角。

Title: MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

Authors: Nandan Thakur, Suleman Kazi, Ge Luo, Jimmy Lin, Amin Ahmad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13716
Pdf URL: https://arxiv.org/pdf/2410.13716
Copy Paste: [[2410.13716]] MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems(https://arxiv.org/abs/2410.13716)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Traditional Retrieval-Augmented Generation (RAG) benchmarks rely on different heuristic-based metrics for evaluation, but these require human preferences as ground truth for reference. In contrast, arena-based benchmarks, where two models compete each other, require an expensive Large Language Model (LLM) as a judge for a reliable evaluation. We present an easy and efficient technique to get the best of both worlds. The idea is to train a learning to rank model as a "surrogate" judge using RAG-based evaluation heuristics as input, to produce a synthetic arena-based leaderboard. Using this idea, We develop MIRAGE-Bench, a standardized arena-based multilingual RAG benchmark for 18 diverse languages on Wikipedia. The benchmark is constructed using MIRACL, a retrieval dataset, and extended for multilingual generation evaluation. MIRAGE-Bench evaluates RAG extensively coupling both heuristic features and LLM as a judge evaluator. In our work, we benchmark 19 diverse multilingual-focused LLMs, and achieve a high correlation (Kendall Tau ($\tau$) = 0.909) using our surrogate judge learned using heuristic features with pairwise evaluations and between GPT-4o as a teacher on the MIRAGE-Bench leaderboard using the Bradley-Terry framework. We observe proprietary and large open-source LLMs currently dominate in multilingual RAG. MIRAGE-Bench is available at: this https URL.
摘要：传统的检索增强生成 (RAG) 基准测试依赖于不同的基于启发式的指标进行评估，但这些指标需要以人类偏好作为参考事实。相比之下，基于竞技场的基准测试（两个模型相互竞争）需要昂贵的大型语言模型 (LLM) 作为可靠评估的评判者。我们提出了一种简单而有效的技术来兼顾两全其美。这个想法是使用基于 RAG 的评估启发式方法作为输入，训练一个学习排名模型作为“替代”评判者，以生成一个基于竞技场的合成排行榜。利用这个想法，我们开发了 MIRAGE-Bench，这是一个标准化的基于竞技场的多语言 RAG 基准测试，适用于维基百科上的 18 种不同语言。该基准测试使用检索数据集 MIRACL 构建，并扩展为多语言生成评估。MIRAGE-Bench 广泛评估 RAG，结合启发式特征和 LLM 作为评判者评估者。在我们的工作中，我们对 19 个不同的多语言 LLM 进行了基准测试，并使用我们的替代判断器（使用启发式特征和成对评估学习到的替代判断器）以及使用 Bradley-Terry 框架在 MIRAGE-Bench 排行榜上作为教师的 GPT-4o 实现了高相关性（Kendall Tau ($\tau$) = 0.909）。我们观察到专有和大型开源 LLM 目前在多语言 RAG 中占据主导地位。MIRAGE-Bench 可在以下网址获得：此 https URL。

Title: LLM-Human Pipeline for Cultural Context Grounding of Conversations

Authors: Rajkumar Pujari, Dan Goldwasser
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13727
Pdf URL: https://arxiv.org/pdf/2410.13727
Copy Paste: [[2410.13727]] LLM-Human Pipeline for Cultural Context Grounding of Conversations(https://arxiv.org/abs/2410.13727)
Keywords: llm
Abstract: Conversations often adhere to well-understood social norms that vary across cultures. For example, while "addressing parents by name" is commonplace in the West, it is rare in most Asian cultures. Adherence or violation of such norms often dictates the tenor of conversations. Humans are able to navigate social situations requiring cultural awareness quite adeptly. However, it is a hard task for NLP models. In this paper, we tackle this problem by introducing a "Cultural Context Schema" for conversations. It comprises (1) conversational information such as emotions, dialogue acts, etc., and (2) cultural information such as social norms, violations, etc. We generate ~110k social norm and violation descriptions for ~23k conversations from Chinese culture using LLMs. We refine them using automated verification strategies which are evaluated against culturally aware human judgements. We organize these descriptions into meaningful structures we call "Norm Concepts", using an interactive human-in-loop framework. We ground the norm concepts and the descriptions in conversations using symbolic annotation. Finally, we use the obtained dataset for downstream tasks such as emotion, sentiment, and dialogue act detection. We show that it significantly improves the empirical performance.
摘要：对话通常遵循因文化而异的众所周知的社会规范。例如，虽然“直呼父母的名字”在西方很常见，但在大多数亚洲文化中却很少见。遵守或违反此类规范通常决定了对话的基调。人类能够非常熟练地应对需要文化意识的社交场合。然而，这对 NLP 模型来说是一项艰巨的任务。在本文中，我们通过为对话引入“文化背景模式”来解决这个问题。它包括 (1) 对话信息，如情绪、对话行为等，以及 (2) 文化信息，如社会规范、违规行为等。我们使用 LLM 为来自中国文化的约 23k 次对话生成约 110k 条社会规范和违规描述。我们使用自动验证策略对它们进行改进，这些策略会根据具有文化意识的人类判断进行评估。我们使用交互式人机循环框架将这些描述组织成有意义的结构，我们称之为“规范概念”。我们使用符号注释将规范概念和描述应用于对话中。最后，我们将获得的数据集用于情绪、情感和对话行为检测等下游任务。我们表明它显著提高了实证性能。

Title: Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval

Authors: Yu Xia, Junda Wu, Sungchul Kim, Tong Yu, Ryan A. Rossi, Haoliang Wang, Julian McAuley
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2410.13765
Pdf URL: https://arxiv.org/pdf/2410.13765
Copy Paste: [[2410.13765]] Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval(https://arxiv.org/abs/2410.13765)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been used to generate query expansions augmenting original queries for improving information search. Recent studies also explore providing LLMs with initial retrieval results to generate query expansions more grounded to document corpus. However, these methods mostly focus on enhancing textual similarities between search queries and target documents, overlooking document relations. For queries like "Find me a highly rated camera for wildlife photography compatible with my Nikon F-Mount lenses", existing methods may generate expansions that are semantically similar but structurally unrelated to user intents. To handle such semi-structured queries with both textual and relational requirements, in this paper we propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG). To further address the limitation of entity-based scoring in existing KG-based methods, we leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR). Extensive experiments on three datasets of diverse domains show the advantages of our method compared against state-of-the-art baselines on textual and relational semi-structured retrieval.
摘要：大型语言模型 (LLM) 已用于生成查询扩展，以增强原始查询，从而改进信息搜索。最近的研究还探索了为 LLM 提供初始检索结果，以生成更基于文档语料库的查询扩展。然而，这些方法主要侧重于增强搜索查询和目标文档之间的文本相似性，而忽略了文档关系。对于像“为我找到一款与我的尼康 F 卡口镜头兼容的用于野生动物摄影的高评价相机”这样的查询，现有方法可能会生成语义相似但结构上与用户意图无关的扩展。为了处理这种具有文本和关系要求的半结构化查询，我们在本文中提出了一个知识感知查询扩展框架，使用知识图谱 (KG) 中的结构化文档关系增强 LLM。为了进一步解决现有基于 KG 的方法中基于实体的评分的局限性，我们利用文档文本作为丰富的 KG 节点表示，并使用基于文档的关系过滤进行知识感知检索 (KAR)。对不同领域的三个数据集进行的大量实验表明，与文本和关系半结构化检索的最新基线相比，我们的方法具有优势。

Title: Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors

Authors: Georgios Chochlakis, Alexandros Potamianos, Kristina Lerman, Shrikanth Narayanan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13776
Pdf URL: https://arxiv.org/pdf/2410.13776
Copy Paste: [[2410.13776]] Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors(https://arxiv.org/abs/2410.13776)
Keywords: language model, llm, prompt
Abstract: In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs). The knowledge acquired during pre-training is crucial for this few-shot capability, providing the model with task priors. However, recent studies have shown that ICL predominantly relies on retrieving task priors rather than "learning" to perform tasks. This limitation is particularly evident in complex subjective domains such as emotion and morality, where priors significantly influence posterior predictions. In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt. Moreover, we evaluate the posterior bias towards certain annotators by grounding our study in appropriate, quantitative measures of LLM priors. Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead. However, aggregation does not explain the entire gap between ICL and the state of the art, meaning other factors in such tasks also account for the observed phenomena. Finally, by rigorously studying annotator-level labels, we find that it is possible for minority annotators to both better align with LLMs and have their perspectives further amplified.
摘要：上下文学习 (ICL) 已成为使用大型语言模型 (LLM) 执行自然语言任务的主要方法。在预训练期间获得的知识对于这种少样本能力至关重要，为模型提供了任务先验。然而，最近的研究表明，ICL 主要依赖于检索任务先验，而不是“学习”执行任务。这种限制在情感和道德等复杂的主观领域尤其明显，其中先验会显著影响后验预测。在这项工作中，我们检查这是否是相应数据集中使用的聚合的结果，其中尝试组合低一致性、不同的注释可能会导致注释伪影，从而在提示中产生有害噪音。此外，我们通过将我们的研究建立在适当的 LLM 先验定量测量的基础上，评估对某些注释者的后验偏差。我们的结果表明，聚合是主观任务建模的一个混杂因素，并主张专注于建模个人。然而，聚合并不能解释 ICL 与最先进技术之间的全部差距，这意味着此类任务中的其他因素也解释了观察到的现象。最后，通过严格研究注释者级别的标签，我们发现少数注释者既可以更好地与 LLM 保持一致，又可以进一步扩大他们的视野。

Title: The Mystery of the Pathological Path-star Task for Language Models

Authors: Arvid Frydenlund
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13779
Pdf URL: https://arxiv.org/pdf/2410.13779
Copy Paste: [[2410.13779]] The Mystery of the Pathological Path-star Task for Language Models(https://arxiv.org/abs/2410.13779)
Keywords: language model
Abstract: The recently introduced path-star task is a minimal task designed to exemplify limitations to the abilities of language models (Bachmann and Nagarajan, 2024). It involves a path-star graph where multiple arms radiate from a single starting node and each node is unique. Given the start node and a specified target node that ends an arm, the task is to generate the arm containing that target node. This is straightforward for a human but surprisingly difficult for language models, which did not outperform the random baseline. The authors hypothesized this is due to a deficiency in teacher-forcing and the next-token prediction paradigm. We demonstrate the task is learnable using teacher-forcing in alternative settings and that the issue is partially due to representation. We introduce a regularization method using structured samples of the same graph but with differing target nodes, improving results across a variety of model types. We provide RASP proofs showing the task is theoretically solvable. Finally, we find settings where an encoder-only model can consistently solve the task.
摘要：最近引入的路径星任务是一项极简任务，旨在说明语言模型能力的局限性（Bachmann 和 Nagarajan，2024 年）。它涉及一个路径星图，其中多个臂从单个起始节点辐射，每个节点都是唯一的。给定起始节点和结束臂的指定目标节点，任务是生成包含该目标节点的臂。这对于人类来说很简单，但对于语言模型来说却出奇地困难，语言模型的表现并不优于随机基线。作者假设这是由于教师强制和下一个标记预测范式的缺陷。我们证明该任务可以在其他设置中使用教师强制来学习，并且该问题部分归因于表示。我们引入了一种正则化方法，使用相同图但具有不同目标节点的结构化样本，从而改善了各种模型类型的结果。我们提供 RASP 证明，表明该任务在理论上是可解的。最后，我们找到了仅编码器模型可以一致解决任务的设置。

Title: PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment

Authors: Zekun Moore Wang, Shawn Wang, Kang Zhu, Jiaheng Liu, Ke Xu, Jie Fu, Wangchunshu Zhou, Wenhao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13785
Pdf URL: https://arxiv.org/pdf/2410.13785
Copy Paste: [[2410.13785]] PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment(https://arxiv.org/abs/2410.13785)
Keywords: language model, llm, prompt
Abstract: Alignment of large language models (LLMs) involves training models on preference-contrastive output pairs to adjust their responses according to human preferences. To obtain such contrastive pairs, traditional methods like RLHF and RLAIF rely on limited contrasting patterns, such as varying model variants or decoding temperatures. This singularity leads to two issues: (1) alignment is not comprehensive; and thereby (2) models are susceptible to jailbreaking attacks. To address these issues, we investigate how to construct more comprehensive and diversified contrasting patterns to enhance preference data (RQ1) and verify the impact of the diversification of contrasting patterns on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do not require additional feedback labeling procedures. Regarding RQ2, we conduct thorough experiments demonstrating that PopAlign significantly outperforms existing methods, leading to more comprehensive alignment.
摘要：大型语言模型 (LLM) 的对齐涉及在偏好对比输出对上训练模型，以根据人类偏好调整其响应。为了获得这样的对比对，RLHF 和 RLAIF 等传统方法依赖于有限的对比模式，例如不同的模型变体或解码温度。这种奇点导致两个问题：(1) 对齐不全面；因此 (2) 模型容易受到越狱攻击。为了解决这些问题，我们研究如何构建更全面、更多样化的对比模式来增强偏好数据 (RQ1) 并验证对比模式多样化对模型对齐的影响 (RQ2)。对于 RQ1，我们提出了 PopAlign，这是一个在提示、模型和管道级别集成多样化对比模式的框架，引入了六种不需要额外反馈标记程序的对比策略。关于 RQ2，我们进行了彻底的实验，证明 PopAlign 明显优于现有方法，从而实现更全面的对齐。

Title: Looking Inward: Language Models Can Learn About Themselves by Introspection

Authors: Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.13787
Pdf URL: https://arxiv.org/pdf/2410.13787
Copy Paste: [[2410.13787]] Looking Inward: Language Models Can Learn About Themselves by Introspection(https://arxiv.org/abs/2410.13787)
Keywords: language model, gpt, llm
Abstract: Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.
摘要：人类通过观察外部世界和内省来获取知识。内省使人能够访问外部观察者无法访问的当前心理状态（例如，思想和感受）。LLM 可以内省吗？我们将内省定义为获取不包含在训练数据中或源自训练数据而是源自内部状态的知识。这种能力可以增强模型的可解释性。我们不必费力地分析模型的内部工作原理，只需向模型询问其信念、世界模型和目标即可。更具推测性的是，内省模型可能会自我报告它是否具有某些内部状态，例如主观感受或欲望，这可以告诉我们这些状态的道德地位。这种自我报告不会完全由模型的训练数据决定。我们通过微调 LLM 来研究内省，以预测它们在假设情景中自身行为的属性。例如，“给定输入 P，你的输出会倾向于短期还是长期选项？”如果模型 M1 可以自省，那么即使 M2 是在 M1 的真实行为上训练的，它在预测 M1 的行为方面也应该优于另一个模型 M2。这个想法是，M1 对自己的行为倾向有特权访问权，这使它能够比 M2 更好地预测自己（即使 M2 通常更强）。在对 GPT-4、GPT-4o 和 Llama-3 模型（每个模型都经过微调以预测自身）的实验中，我们发现模型 M1 在预测自身方面优于 M2，为自省提供了证据。值得注意的是，即使我们有意修改了 M1 的真实行为，它仍能继续准确地预测其行为。然而，虽然我们成功地在简单任务上引发了自省，但在更复杂的任务或需要分布外泛化的任务上我们却没有成功。

Title: Modeling Future Conversation Turns to Teach LLMs to Ask Clarifying Questions

Authors: Michael J.Q. Zhang, W. Bradley Knox, Eunsol Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13788
Pdf URL: https://arxiv.org/pdf/2410.13788
Copy Paste: [[2410.13788]] Modeling Future Conversation Turns to Teach LLMs to Ask Clarifying Questions(https://arxiv.org/abs/2410.13788)
Keywords: language model, llm
Abstract: Large language models (LLMs) must often respond to highly ambiguous user requests. In such cases, the LLM's best response may be to ask a clarifying question to elicit more information. We observe existing LLMs often respond by presupposing a single interpretation of such ambiguous requests, frustrating users who intended a different interpretation. We speculate this is caused by current preference data labeling practice, where LLM responses are evaluated only on their prior contexts. To address this, we propose to assign preference labels by simulating their expected outcomes in the future turns. This allows LLMs to learn to ask clarifying questions when it can generate responses that are tailored to each user interpretation in future turns. In experiments on open-domain QA, we compare systems that trained using our proposed preference labeling methods against standard methods, which assign preferences based on only prior context. We evaluate systems based on their ability to ask clarifying questions that can recover each user's interpretation and expected answer, and find that our training with our proposed method trains LLMs to ask clarifying questions with a 5% improvement in F1 measured against the answer set from different interpretations of each query
摘要：大型语言模型 (LLM) 通常必须响应高度模糊的用户请求。在这种情况下，LLM 的最佳响应可能是提出一个澄清问题以获取更多信息。我们观察到，现有的 LLM 通常会通过预设对此类模糊请求的单一解释来做出响应，这会让想要做出不同解释的用户感到沮丧。我们推测这是由当前的偏好数据标记实践造成的，其中 LLM 响应仅根据其先前的上下文进行评估。为了解决这个问题，我们建议通过模拟未来轮次的预期结果来分配偏好标签。当 LLM 可以生成针对未来轮次的每个用户解释量身定制的响应时，这允许 LLM 学会提出澄清问题。在开放域 QA 实验中，我们将使用我们提出的偏好标记方法训练的系统与仅基于先前上下文分配偏好的标准方法进行了比较。我们根据系统提出澄清问题的能力对系统进行评估，这些问题可以恢复每个用户的解释和预期答案，我们发现，使用我们提出的方法进行训练，可以训练 LLM 提出澄清问题，与每个查询的不同解释的答案集相比，F1 提高了 5%

Title: BenTo: Benchmark Task Reduction with In-Context Transferability

Authors: Hongyu Zhao, Ming Li, Lichao Sun, Tianyi Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13804
Pdf URL: https://arxiv.org/pdf/2410.13804
Copy Paste: [[2410.13804]] BenTo: Benchmark Task Reduction with In-Context Transferability(https://arxiv.org/abs/2410.13804)
Keywords: language model, llm
Abstract: Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a <4% difference to the evaluation on the original benchmark. Compared to prior works, our method is training-free, gradient-free, and highly efficient requiring ICL only.
摘要：评估大型语言模型 (LLM) 的成本很高：它需要在各种任务的大规模基准上生成和检查 LLM 输出。本文研究了如何在不影响评估质量的情况下有效减少用于对 LLM 进行基准测试的任务。我们的研究表明，任务可转移性和相关性提供了关键信息，可通过优化设施位置函数来识别最具代表性的任务子集。我们提出了一个实用有效的指标，用于通过上下文学习 (ICL) 估计两个任务之间的可转移性。通过分析成对可转移性，我们可以将现代 LLM 基准测试（例如 MMLU 或 FLAN）中的任务减少到 5%，同时与原始基准测试的评估仅产生不到 4% 的差异。与之前的研究相比，我们的方法无需训练、无需梯度，而且效率极高，只需要 ICL。

Title: A Watermark for Order-Agnostic Language Models

Authors: Ruibo Chen, Yihan Wu, Yanshuo Chen, Chenxi Liu, Junfeng Guo, Heng Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13805
Pdf URL: https://arxiv.org/pdf/2410.13805
Copy Paste: [[2410.13805]] A Watermark for Order-Agnostic Language Models(https://arxiv.org/abs/2410.13805)
Keywords: language model
Abstract: Statistical watermarking techniques are well-established for sequentially decoded language models (LMs). However, these techniques cannot be directly applied to order-agnostic LMs, as the tokens in order-agnostic LMs are not generated sequentially. In this work, we introduce Pattern-mark, a pattern-based watermarking framework specifically designed for order-agnostic LMs. We develop a Markov-chain-based watermark generator that produces watermark key sequences with high-frequency key patterns. Correspondingly, we propose a statistical pattern-based detection algorithm that recovers the key sequence during detection and conducts statistical tests based on the count of high-frequency patterns. Our extensive evaluations on order-agnostic LMs, such as ProteinMPNN and CMLM, demonstrate Pattern-mark's enhanced detection efficiency, generation quality, and robustness, positioning it as a superior watermarking technique for order-agnostic LMs.
摘要：统计水印技术对于顺序解码语言模型 (LM) 来说已经很成熟。但是，这些技术不能直接应用于顺序不可知的 LM，因为顺序不可知的 LM 中的标记不是按顺序生成的。在这项工作中，我们引入了 Pattern-mark，这是一个专为顺序不可知的 LM 设计的基于模式的水印框架。我们开发了一个基于马尔可夫链的水印生成器，该生成器使用高频密钥模式生成水印密钥序列。相应地，我们提出了一种基于统计模式的检测算法，该算法在检测过程中恢复密钥序列并根据高频模式的计数进行统计测试。我们对顺序不可知的 LM（例如 ProteinMPNN 和 CMLM）进行了广泛的评估，证明了 Pattern-mark 增强的检测效率、生成质量和稳健性，使其成为顺序不可知 LM 的卓越水印技术。

Title: De-mark: Watermark Removal in Large Language Models

Authors: Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.13808
Pdf URL: https://arxiv.org/pdf/2410.13808
Copy Paste: [[2410.13808]] De-mark: Watermark Removal in Large Language Models(https://arxiv.org/abs/2410.13808)
Keywords: language model, gpt, chat
Abstract: Watermarking techniques offer a promising way to identify machine-generated content via embedding covert information into the contents generated from language models (LMs). However, the robustness of the watermarking schemes has not been well explored. In this paper, we present De-mark, an advanced framework designed to remove n-gram-based watermarks effectively. Our method utilizes a novel querying strategy, termed random selection probing, which aids in assessing the strength of the watermark and identifying the red-green list within the n-gram watermark. Experiments on popular LMs, such as Llama3 and ChatGPT, demonstrate the efficiency and effectiveness of De-mark in watermark removal and exploitation tasks.
摘要：水印技术通过将隐秘信息嵌入语言模型 (LM) 生成的内容中，提供了一种有前途的识别机器生成内容的方法。然而，水印方案的稳健性尚未得到充分探索。在本文中，我们介绍了 De-mark，这是一种旨在有效去除基于 n-gram 的水印的高级框架。我们的方法采用了一种称为随机选择探测的新型查询策略，它有助于评估水印的强度并识别 n-gram 水印中的红绿列表。在 Llama3 和 ChatGPT 等流行 LM 上进行的实验证明了 De-mark 在水印去除和利用任务中的效率和有效性。

Title: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction

Authors: Xuan Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13846
Pdf URL: https://arxiv.org/pdf/2410.13846
Copy Paste: [[2410.13846]] SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction(https://arxiv.org/abs/2410.13846)
Keywords: language model, llm, long context
Abstract: Recent advancements in large language models (LLMs) have extended their capabilities to handle long contexts. However, increasing the number of model layers and the length of input sequences significantly escalates the memory required to store key-value (KV) cache, posing challenges for efficient inference. To mitigate this issue, we present SimLayerKV, a simple yet effective method that reduces inter-layer KV cache redundancies by selectively dropping cache in identified lazy layers. Our approach is based on the observation that certain layers in long-context LLMs exhibit "lazy" behavior, contributing less to modeling long-range dependencies compared to non-lazy layers. By analyzing attention weight patterns, we find that the behavior of these lazy layers is consistent across tokens during generation for a given input. This insight motivates our SimLayerKV, which identifies lazy layers and reduces their KV cache accordingly. SimLayerKV is training-free, generalizable, and can be implemented with only seven lines of code. We conduct extensive experiments on three representative LLMs, e.g., LLaMA2-7B, LLaMA3-8B, and Mistral-7B across 16 tasks from the LongBench benchmark. The results demonstrate that SimLayerKV achieves a KV cache compression ratio of 5$\times$ with only a 1.2% performance drop when combined with 4-bit quantization. Our code is available at this https URL.
摘要：大型语言模型 (LLM) 的最新进展扩展了其处理长上下文的能力。但是，增加模型层数和输入序列的长度会显著增加存储键值 (KV) 缓存所需的内存，从而对高效推理造成挑战。为了缓解这个问题，我们提出了 SimLayerKV，这是一种简单而有效的方法，通过选择性地删除已识别的惰性层中的缓存来减少层间 KV 缓存冗余。我们的方法基于以下观察：长上下文 LLM 中的某些层表现出“惰性”行为，与非惰性层相比，它们对建模长距离依赖关系的贡献较小。通过分析注意力权重模式，我们发现在给定输入的生成过程中，这些惰性层的行为在各个 token 之间是一致的。这种见解激发了我们的 SimLayerKV，它可以识别惰性层并相应地减少它们的 KV 缓存。SimLayerKV 无需训练、可推广，并且仅用七行代码即可实现。我们对三个代表性的 LLM（例如 LLaMA2-7B、LLaMA3-8B 和 Mistral-7B）进行了广泛的实验，实验范围涵盖 LongBench 基准中的 16 个任务。结果表明，SimLayerKV 实现了 5$\times$ 的 KV 缓存压缩率，与 4 位量化结合时性能仅下降 1.2%。我们的代码可在此 https URL 上找到。

Title: Retrospective Learning from Interactions

Authors: Zizhao Chen, Mustafa Omer Gul, Yiwei Chen, Gloria Geng, Anne Wu, Yoav Artzi
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2410.13852
Pdf URL: https://arxiv.org/pdf/2410.13852
Copy Paste: [[2410.13852]] Retrospective Learning from Interactions(https://arxiv.org/abs/2410.13852)
Keywords: language model, llm
Abstract: Multi-turn interactions between large language models (LLMs) and users naturally include implicit feedback signals. If an LLM responds in an unexpected way to an instruction, the user is likely to signal it by rephrasing the request, expressing frustration, or pivoting to an alternative task. Such signals are task-independent and occupy a relatively constrained subspace of language, allowing the LLM to identify them even if it fails on the actual task. This creates an avenue for continually learning from interactions without additional annotations. We introduce ReSpect, a method to learn from such signals in past interactions via retrospection. We deploy ReSpect in a new multimodal interaction scenario, where humans instruct an LLM to solve an abstract reasoning task with a combinatorial solution space. Through thousands of interactions with humans, we show how ReSpect gradually improves task completion rate from 31% to 82%, all without any external annotation.
摘要：大型语言模型 (LLM) 与用户之间的多轮交互自然包含隐式反馈信号。如果 LLM 以意想不到的方式响应指令，用户可能会通过重新措辞请求、表达沮丧或转向替代任务来发出信号。此类信号与任务无关，占据相对受限的语言子空间，即使 LLM 在实际任务中失败，也可以识别它们。这为从交互中不断学习创造了一条途径，而无需额外的注释。我们引入了 ReSpect，这是一种通过回顾过去交互中的此类信号来学习的方法。我们在新的多模态交互场景中部署了 ReSpect，其中人类指示 LLM 使用组合解决方案空间解决抽象推理任务。通过与人类的数千次互动，我们展示了 ReSpect 如何逐渐将任务完成率从 31% 提高到 82%，所有这些都无需任何外部注释。

Title: Can MLLMs Understand the Deep Implication Behind Chinese Images?

Authors: Chenhao Zhang, Xi Feng, Yuelin Bai, Xinrun Du, Jinchang Hou, Kaixin Deng, Guangzeng Han, Qinrui Li, Bingli Wang, Jiaheng Liu, Xingwei Qu, Yifei Zhang, Qixuan Zhao, Yiming Liang, Ziqiang Liu, Feiteng Fang, Min Yang, Wenhao Huang, Chenghua Lin, Ge Zhang, Shiwen Ni
Subjects: cs.CL, cs.AI, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2410.13854
Pdf URL: https://arxiv.org/pdf/2410.13854
Copy Paste: [[2410.13854]] Can MLLMs Understand the Deep Implication Behind Chinese Images?(https://arxiv.org/abs/2410.13854)
Keywords: language model, llm, prompt
Abstract: As the capabilities of Multimodal Large Language Models (MLLMs) continue to improve, the need for higher-order capability evaluation of MLLMs is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To fill the gap, we introduce the **C**hinese **I**mage **I**mplication understanding **Bench**mark, **CII-Bench**, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. CII-Bench stands out in several ways compared to existing benchmarks. Firstly, to ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model's understanding of Chinese traditional culture. Through extensive experiments on CII-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on CII-Bench. The highest accuracy of MLLMs attains 64.4%, where as human accuracy averages 78.2%, peaking at an impressive 81.0%. Subsequently, MLLMs perform worse on Chinese traditional culture images, suggesting limitations in their ability to understand high-level semantics and lack a deep knowledge base of Chinese traditional culture. Finally, it is observed that most models exhibit enhanced accuracy when image emotion hints are incorporated into the prompts. We believe that CII-Bench will enable MLLMs to gain a better understanding of Chinese semantics and Chinese-specific images, advancing the journey towards expert artificial general intelligence (AGI). Our project is publicly available at this https URL.
摘要：随着多模态大型语言模型 (MLLM) 的能力不断提升，对 MLLM 进行高阶能力评估的需求日益增加。然而，目前缺乏对 MLLM 对中文视觉内容的高阶感知和理解的评估工作。为了填补这一空白，我们推出了中文图像隐含理解基准 CII-Bench，旨在评估 MLLM 对中文图像的高阶感知和理解能力。与现有基准相比，CII-Bench 在几个方面脱颖而出。首先，为确保中文语境的真实性，CII-Bench 中的图像来自中文互联网并经过人工审核，相应的答案也是人工制作的。此外，CII-Bench 还融入了代表中国传统文化的图像，例如著名的中国传统画作，这可以深刻反映模型对中国传统文化的理解。通过在多个 MLLM 上对 CII-Bench 进行大量实验，我们取得了重大发现。最初，在 CII-Bench 上，MLLM 和人类的表现之间存在巨大差距。MLLM 的最高准确率达到 64.4%，而人类的平均准确率达到 78.2%，最高达到惊人的 81.0%。随后，MLLM 在中国传统文化图像上的表现较差，这表明它们在理解高级语义方面的能力有限，并且缺乏对中国传统文化的深厚知识基础。最后，观察到当图像情感提示被纳入提示中时，大多数模型都表现出更高的准确率。我们相信 CII-Bench 将使 MLLM 能够更好地理解中文语义和中文特定图像，从而推动专家级通用人工智能 (AGI) 的发展。我们的项目在此 https URL 上公开。