2024-02-19

Title: HyperAgent: A Simple, Scalable, Efficient and Provable Reinforcement Learning Framework for Complex Environments

Authors: Yingru Li, Jiawei Xu, Lei Han, Zhi-Quan Luo
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2402.10228
Pdf URL: https://arxiv.org/pdf/2402.10228
Copy Paste: [[2402.10228]] HyperAgent: A Simple, Scalable, Efficient and Provable Reinforcement Learning Framework for Complex Environments(https://arxiv.org/abs/2402.10228)
Keywords: agent
Abstract: To solve complex tasks under resource constraints, reinforcement learning (RL) agents need to be simple, efficient, and scalable with (1) large state space and (2) increasingly accumulated data of interactions. We propose the HyperAgent, a RL framework with hypermodel, index sampling schemes and incremental update mechanism, enabling computation-efficient sequential posterior approximation and data-efficient action selection under general value function approximation beyond conjugacy. The implementation of \HyperAgent is simple as it only adds one module and one line of code additional to DDQN. Practically, HyperAgent demonstrates its robust performance in large-scale deep RL benchmarks with significant efficiency gain in terms of both data and computation. Theoretically, among the practically scalable algorithms, HyperAgent is the first method to achieve provably scalable per-step computational complexity as well as sublinear regret under tabular RL. The core of our theoretical analysis is the sequential posterior approximation argument, made possible by the first analytical tool for sequential random projection, a non-trivial martingale extension of the Johnson-Lindenstrauss lemma. This work bridges the theoretical and practical realms of RL, establishing a new benchmark for RL algorithm design.
摘要：为了解决资源限制下的复杂任务，强化学习（RL）代理需要简单、高效且可扩展，具有（1）大的状态空间和（2）不断积累的交互数据。我们提出了 HyperAgent，这是一种具有超模型、索引采样方案和增量更新机制的 RL 框架，能够在超越共轭的一般值函数近似下实现计算高效的顺序后验逼近和数据高效的动作选择。 \HyperAgent 的实现很简单，只在 DDQN 的基础上添加了一个模块和一行代码。实际上，HyperAgent 在大规模深度 RL 基准测试中展示了其强大的性能，并在数据和计算方面显着提高了效率。理论上，在实际可扩展的算法中，HyperAgent 是第一个在表格强化学习下实现可证明可扩展的每步计算复杂度以及亚线性遗憾的方法。我们理论分析的核心是序贯后验近似论证，它是通过序贯随机投影的第一个分析工具（约翰逊-林登斯特劳斯引理的非平凡鞅扩展）而成为可能的。这项工作弥合了 RL 的理论和实践领域，为 RL 算法设计建立了新的基准。

Title: A StrongREJECT for Empty Jailbreaks

Authors: Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2402.10260
Pdf URL: https://arxiv.org/pdf/2402.10260
Copy Paste: [[2402.10260]] A StrongREJECT for Empty Jailbreaks(https://arxiv.org/abs/2402.10260)
Keywords: language model, gpt, llm
Abstract: The rise of large language models (LLMs) has drawn attention to the existence of "jailbreaks" that allow the models to be used maliciously. However, there is no standard benchmark for measuring the severity of a jailbreak, leaving authors of jailbreak papers to create their own. We show that these benchmarks often include vague or unanswerable questions and use grading criteria that are biased towards overestimating the misuse potential of low-quality model responses. Some jailbreak techniques make the problem worse by decreasing the quality of model responses even on benign questions: we show that several jailbreaking techniques substantially reduce the zero-shot performance of GPT-4 on MMLU. Jailbreaks can also make it harder to elicit harmful responses from an "uncensored" open-source model. We present a new benchmark, StrongREJECT, which better discriminates between effective and ineffective jailbreaks by using a higher-quality question set and a more accurate response grading algorithm. We show that our new grading scheme better accords with human judgment of response quality and overall jailbreak effectiveness, especially on the sort of low-quality responses that contribute the most to over-estimation of jailbreak performance on existing benchmarks. We release our code and data at https://github.com/alexandrasouly/strongreject.
摘要：大型语言模型（LLM）的兴起引起了人们对模型被恶意使用的“越狱”的存在的关注。然而，没有衡量越狱严重程度的标准基准，因此越狱论文的作者必须创建自己的基准。我们表明，这些基准通常包括模糊或无法回答的问题，并使用偏向于高估低质量模型响应的误用潜力的评分标准。一些越狱技术甚至在良性问题上也会降低模型响应的质量，从而使问题变得更糟：我们表明，几种越狱技术大大降低了 GPT-4 在 MMLU 上的零样本性能。越狱还可能使“未经审查”的开源模型更难引发有害反应。我们提出了一个新的基准 StrongREJECT，它通过使用更高质量的问题集和更准确的响应评分算法来更好地区分有效和无效的越狱。我们表明，我们的新评分方案更符合人类对响应质量和整体越狱有效性的判断，特别是对于在现有基准上高估越狱性能的低质量响应。我们在 https://github.com/alexandrasouly/strongreject 发布了我们的代码和数据。

Title: Experiments with Encoding Structured Data for Neural Networks

Authors: Sujay Nagesh Koujalgi, Jonathan Dodge
Subjects: cs.AI
Abstract URL: https://arxiv.org/abs/2402.10290
Pdf URL: https://arxiv.org/pdf/2402.10290
Copy Paste: [[2402.10290]] Experiments with Encoding Structured Data for Neural Networks(https://arxiv.org/abs/2402.10290)
Keywords: agent
Abstract: The project's aim is to create an AI agent capable of selecting good actions in a game-playing domain called Battlespace. Sequential domains like Battlespace are important testbeds for planning problems, as such, the Department of Defense uses such domains for wargaming exercises. The agents we developed combine Monte Carlo Tree Search (MCTS) and Deep Q-Network (DQN) techniques in an effort to navigate the game environment, avoid obstacles, interact with adversaries, and capture the flag. This paper will focus on the encoding techniques we explored to present complex structured data stored in a Python class, a necessary precursor to an agent.
摘要：该项目的目标是创建一个人工智能代理，能够在名为“Battlespace”的游戏领域中选择良好的行动。像战场空间这样的顺序域是规划问题的重要测试平台，因此，国防部使用此类域进行兵棋推演。我们开发的代理结合了蒙特卡罗树搜索 (MCTS) 和深度 Q 网络 (DQN) 技术，旨在导航游戏环境、避开障碍物、与对手交互并夺取旗帜。本文将重点介绍我们探索的编码技术，以呈现存储在 Python 类中的复杂结构化数据，这是代理的必要先驱。

Title: How to Discern Important Urgent News?

Authors: Oleg Vasilyev, John Bohannon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10302
Pdf URL: https://arxiv.org/pdf/2402.10302
Copy Paste: [[2402.10302]] How to Discern Important Urgent News?(https://arxiv.org/abs/2402.10302)
Keywords: llm
Abstract: We found that a simple property of clusters in a clustered dataset of news correlate strongly with importance and urgency of news (IUN) as assessed by LLM. We verified our finding across different news datasets, dataset sizes, clustering algorithms and embeddings. The found correlation should allow using clustering (as an alternative to LLM) for identifying the most important urgent news, or for filtering out unimportant articles.
摘要：我们发现，新闻聚类数据集中聚类的一个简单属性与法学硕士评估的新闻重要性和紧迫性 (IUN) 密切相关。我们在不同的新闻数据集、数据集大小、聚类算法和嵌入中验证了我们的发现。发现的相关性应该允许使用聚类（作为 LLM 的替代方案）来识别最重要的紧急新闻，或过滤掉不重要的文章。

Title: Interpretable Generative Adversarial Imitation Learning

Authors: Wenliang Liu, Danyang Li, Erfan Aasi, Roberto Tron, Calin Belta
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2402.10310
Pdf URL: https://arxiv.org/pdf/2402.10310
Copy Paste: [[2402.10310]] Interpretable Generative Adversarial Imitation Learning(https://arxiv.org/abs/2402.10310)
Keywords: agent
Abstract: Imitation learning methods have demonstrated considerable success in teaching autonomous systems complex tasks through expert demonstrations. However, a limitation of these methods is their lack of interpretability, particularly in understanding the specific task the learning agent aims to accomplish. In this paper, we propose a novel imitation learning method that combines Signal Temporal Logic (STL) inference and control synthesis, enabling the explicit representation of the task as an STL formula. This approach not only provides a clear understanding of the task but also allows for the incorporation of human knowledge and adaptation to new scenarios through manual adjustments of the STL formulae. Additionally, we employ a Generative Adversarial Network (GAN)-inspired training approach for both the inference and the control policy, effectively narrowing the gap between the expert and learned policies. The effectiveness of our algorithm is demonstrated through two case studies, showcasing its practical applicability and adaptability.
摘要：通过专家演示，模仿学习方法在教授自主系统复杂任务方面取得了相当大的成功。然而，这些方法的局限性在于缺乏可解释性，特别是在理解学习代理想要完成的特定任务方面。在本文中，我们提出了一种新颖的模仿学习方法，该方法结合了信号时序逻辑（STL）推理和控制合成，从而能够将任务显式表示为 STL 公式。这种方法不仅提供了对任务的清晰理解，而且还允许通过手动调整 STL 公式来融入人类知识并适应新场景。此外，我们采用生成对抗网络（GAN）启发的训练方法来进行推理和控制策略，有效缩小了专家策略和学习策略之间的差距。通过两个案例研究证明了我们算法的有效性，展示了其实际适用性和适应性。

Title: Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review

Authors: Jing Su, Chufeng Jiang, Xin Jin, Yuxin Qiao, Tingsong Xiao, Hongda Ma, Rong Wei, Zhi Jing, Jiajun Xu, Junhong Lin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10350
Pdf URL: https://arxiv.org/pdf/2402.10350
Copy Paste: [[2402.10350]] Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review(https://arxiv.org/abs/2402.10350)
Keywords: language model, llm, hallucination
Abstract: This systematic literature review comprehensively examines the application of Large Language Models (LLMs) in forecasting and anomaly detection, highlighting the current state of research, inherent challenges, and prospective future directions. LLMs have demonstrated significant potential in parsing and analyzing extensive datasets to identify patterns, predict future events, and detect anomalous behavior across various domains. However, this review identifies several critical challenges that impede their broader adoption and effectiveness, including the reliance on vast historical datasets, issues with generalizability across different contexts, the phenomenon of model hallucinations, limitations within the models' knowledge boundaries, and the substantial computational resources required. Through detailed analysis, this review discusses potential solutions and strategies to overcome these obstacles, such as integrating multimodal data, advancements in learning methodologies, and emphasizing model explainability and computational efficiency. Moreover, this review outlines critical trends that are likely to shape the evolution of LLMs in these fields, including the push toward real-time processing, the importance of sustainable modeling practices, and the value of interdisciplinary collaboration. Conclusively, this review underscores the transformative impact LLMs could have on forecasting and anomaly detection while emphasizing the need for continuous innovation, ethical considerations, and practical solutions to realize their full potential.
摘要：这篇系统的文献综述全面考察了大型语言模型 (LLM) 在预测和异常检测中的应用，强调了研究的现状、固有的挑战和未来的展望方向。法学硕士在解析和分析广泛的数据集以识别模式、预测未来事件和检测各个领域的异常行为方面表现出了巨大的潜力。然而，这篇评论指出了阻碍其更广泛采用和有效性的几个关键挑战，包括对大量历史数据集的依赖、不同背景下的普遍性问题、模型幻觉现象、模型知识边界内的限制以及大量计算资源必需的。通过详细分析，本综述讨论了克服这些障碍的潜在解决方案和策略，例如集成多模态数据、学习方法的进步以及强调模型的可解释性和计算效率。此外，这篇综述概述了可能影响这些领域法学硕士发展的关键趋势，包括实时处理的推动、可持续建模实践的重要性以及跨学科合作的价值。最后，本次审查强调了法学硕士对预测和异常检测可能产生的变革性影响，同时强调需要持续创新、道德考虑和实际解决方案来充分发挥其潜力。

Title: Prompt-Based Bias Calibration for Better Zero/Few-Shot Learning of Language Models

Authors: Kang He, Yinghan Long, Kaushik Roy
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10353
Pdf URL: https://arxiv.org/pdf/2402.10353
Copy Paste: [[2402.10353]] Prompt-Based Bias Calibration for Better Zero/Few-Shot Learning of Language Models(https://arxiv.org/abs/2402.10353)
Keywords: language model, gpt, prompt
Abstract: Prompt learning is susceptible to intrinsic bias present in pre-trained language models (LMs), resulting in sub-optimal performance of prompt-based zero/few-shot learning. In this work, we propose a null-input prompting method to calibrate intrinsic bias encoded in pre-trained LMs. Different from prior efforts that address intrinsic bias primarily for social fairness and often involve excessive computational cost, our objective is to explore enhancing LMs' performance in downstream zero/few-shot learning while emphasizing the efficiency of intrinsic bias calibration. Specifically, we leverage a diverse set of auto-selected null-meaning inputs generated from GPT-4 to prompt pre-trained LMs for intrinsic bias probing. Utilizing the bias-reflected probability distribution, we formulate a distribution disparity loss for bias calibration, where we exclusively update bias parameters ($0.1\%$ of total parameters) of LMs towards equal probability distribution. Experimental results show that the calibration promotes an equitable starting point for LMs while preserving language modeling abilities. Across a wide range of datasets, including sentiment analysis and topic classification, our method significantly improves zero/few-shot learning performance of LMs for both in-context learning and prompt-based fine-tuning (on average $9\%$ and $2\%$, respectively).
摘要：即时学习很容易受到预训练语言模型 (LM) 中存在的内在偏差的影响，从而导致基于即时的零/少样本学习的性能不佳。在这项工作中，我们提出了一种空输入提示方法来校准预训练 LM 中编码的内在偏差。与之前主要为了社会公平而解决内在偏差且通常涉及过多计算成本的努力不同，我们的目标是探索增强 LM 在下游零/少样本学习中的性能，同时强调内在偏差校准的效率。具体来说，我们利用 GPT-4 生成的一组不同的自动选择的无效输入来提示预先训练的 LM 进行内在偏差探测。利用偏差反映的概率分布，我们制定了偏差校准的分布差异损失，其中我们专门将 LM 的偏差参数（总参数的 $0.1\%$）更新为相等的概率分布。实验结果表明，校准促进了语言模型的公平起点，同时保留了语言建模能力。在包括情感分析和主题分类在内的广泛数据集上，我们的方法显着提高了 LM 在上下文学习和基于提示的微调方面的零/少样本学习性能（平均 $9\%$ 和 $2\分别是%$）。

Title: Can we soft prompt LLMs for graph learning tasks?

Authors: Zheyuan Liu, Xiaoxin He, Yijun Tian, Nitesh V. Chawla
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2402.10359
Pdf URL: https://arxiv.org/pdf/2402.10359
Copy Paste: [[2402.10359]] Can we soft prompt LLMs for graph learning tasks?(https://arxiv.org/abs/2402.10359)
Keywords: language model, llm, prompt
Abstract: Graph plays an important role in representing complex relationships in real-world applications such as social networks, biological data and citation networks. In recent years, Large Language Models (LLMs) have achieved tremendous success in various domains, which makes applying LLMs to graphs particularly appealing. However, directly applying LLMs to graph modalities presents unique challenges due to the discrepancy and mismatch between the graph and text modalities. Hence, to further investigate LLMs' potential for comprehending graph information, we introduce GraphPrompter, a novel framework designed to align graph information with LLMs via soft prompts. Specifically, GraphPrompter consists of two main components: a graph neural network to encode complex graph information and an LLM that effectively processes textual information. Comprehensive experiments on various benchmark datasets under node classification and link prediction tasks demonstrate the effectiveness of our proposed method. The GraphPrompter framework unveils the substantial capabilities of LLMs as predictors in graph-related tasks, enabling researchers to utilize LLMs across a spectrum of real-world graph scenarios more effectively.
摘要：图在表示社交网络、生物数据和引文网络等现实世界应用中的复杂关系方面发挥着重要作用。近年来，大型语言模型（LLM）在各个领域取得了巨大的成功，这使得将 LLM 应用于图变得特别有吸引力。然而，由于图形和文本模式之间的差异和不匹配，直接将法学硕士应用于图形模式会带来独特的挑战。因此，为了进一步研究法学硕士理解图形信息的潜力，我们引入了 GraphPrompter，这是一种新颖的框架，旨在通过软提示将图形信息与法学硕士结合起来。具体来说，GraphPrompter由两个主要组件组成：用于编码复杂图信息的图神经网络和有效处理文本信息的LLM。在节点分类和链路预测任务下对各种基准数据集的综合实验证明了我们提出的方法的有效性。 GraphPrompter 框架揭示了 LLM 作为图相关任务中的预测器的强大功能，使研究人员能够在各种现实世界的图场景中更有效地利用 LLM。

Title: BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

Authors: Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, Richard Dufour
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10373
Pdf URL: https://arxiv.org/pdf/2402.10373
Copy Paste: [[2402.10373]] BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains(https://arxiv.org/abs/2402.10373)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, offering potential applications across specialized domains such as healthcare and medicine. Despite the availability of various open-source LLMs tailored for health contexts, adapting general-purpose LLMs to the medical domain presents significant challenges. In this paper, we introduce BioMistral, an open-source LLM tailored for the biomedical domain, utilizing Mistral as its foundation model and further pre-trained on PubMed Central. We conduct a comprehensive evaluation of BioMistral on a benchmark comprising 10 established medical question-answering (QA) tasks in English. We also explore lightweight models obtained through quantization and model merging approaches. Our results demonstrate BioMistral's superior performance compared to existing open-source medical models and its competitive edge against proprietary counterparts. Finally, to address the limited availability of data beyond English and to assess the multilingual generalization of medical LLMs, we automatically translated and evaluated this benchmark into 7 other languages. This marks the first large-scale multilingual evaluation of LLMs in the medical domain. Datasets, multilingual evaluation benchmarks, scripts, and all the models obtained during our experiments are freely released.
摘要：近年来，大型语言模型 (LLM) 表现出了显着的多功能性，在医疗保健和医学等专业领域提供了潜在的应用。尽管有各种针对健康环境量身定制的开源法学硕士，但将通用法学硕士适应医疗领域却面临着巨大的挑战。在本文中，我们介绍了 BioMistral，这是一个专为生物医学领域量身定制的开源法学硕士，利用 Mistral 作为其基础模型，并在 PubMed Central 上进行了进一步的预训练。我们根据由 10 项既定的英语医学问答 (QA) 任务组成的基准对 BioMistral 进行了全面评估。我们还探索通过量化和模型合并方法获得的轻量级模型。我们的结果证明了 BioMistral 与现有开源医疗模型相比具有卓越的性能，并且与专有模型相比具有竞争优势。最后，为了解决英语以外的数据有限的问题，并评估医学法学硕士的多语言泛化能力，我们自动将该基准翻译和评估为 7 种其他语言。这标志着医学领域法学硕士的首次大规模多语言评估。数据集、多语言评估基准、脚本以及我们实验中获得的所有模型都是免费发布的。

Title: DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

Authors: Ajay Patel, Colin Raffel, Chris Callison-Burch
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10379
Pdf URL: https://arxiv.org/pdf/2402.10379
Copy Paste: [[2402.10379]] DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows(https://arxiv.org/abs/2402.10379)
Keywords: language model, llm
Abstract: Large language models (LLMs) have become a dominant and important tool for NLP researchers in a wide range of tasks. Today, many researchers use LLMs in synthetic data generation, task evaluation, fine-tuning, distillation, and other model-in-the-loop research workflows. However, challenges arise when using these models that stem from their scale, their closed source nature, and the lack of standardized tooling for these new and emerging workflows. The rapid rise to prominence of these models and these unique challenges has had immediate adverse impacts on open science and on the reproducibility of work that uses them. In this paper, we introduce DataDreamer, an open source Python library that allows researchers to write simple code to implement powerful LLM workflows. DataDreamer also helps researchers adhere to best practices that we propose to encourage open science and reproducibility. The library and documentation are available at https://github.com/datadreamer-dev/DataDreamer .
摘要：大型语言模型 (LLM) 已成为 NLP 研究人员在各种任务中的主导且重要的工具。如今，许多研究人员在合成数据生成、任务评估、微调、蒸馏和其他模型在环研究工作流程中使用法学硕士。然而，使用这些模型时会出现一些挑战，这些挑战源于其规模、闭源性质以及这些新兴工作流程缺乏标准化工具。这些模型的迅速崛起和这些独特的挑战对开放科学和使用它们的工作的可重复性产生了直接的不利影响。在本文中，我们介绍了 DataDreamer，这是一个开源 Python 库，允许研究人员编写简单的代码来实现强大的 LLM 工作流程。 DataDreamer 还帮助研究人员遵循我们建议的最佳实践，以鼓励开放科学和可重复性。该库和文档可从 https://github.com/datadreamer-dev/DataDreamer 获取。

Title: Subgraph-level Universal Prompt Tuning

Authors: Junhyun Lee, Wooseong Yang, Jaewoo Kang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10380
Pdf URL: https://arxiv.org/pdf/2402.10380
Copy Paste: [[2402.10380]] Subgraph-level Universal Prompt Tuning(https://arxiv.org/abs/2402.10380)
Keywords: prompt
Abstract: In the evolving landscape of machine learning, the adaptation of pre-trained models through prompt tuning has become increasingly prominent. This trend is particularly observable in the graph domain, where diverse pre-training strategies present unique challenges in developing effective prompt-based tuning methods for graph neural networks. Previous approaches have been limited, focusing on specialized prompting functions tailored to models with edge prediction pre-training tasks. These methods, however, suffer from a lack of generalizability across different pre-training strategies. Recently, a simple prompt tuning method has been designed for any pre-training strategy, functioning within the input graph's feature space. This allows it to theoretically emulate any type of prompting function, thereby significantly increasing its versatility for a range of downstream applications. Nevertheless, the capacity of such simple prompts to fully grasp the complex contexts found in graphs remains an open question, necessitating further investigation. Addressing this challenge, our work introduces the Subgraph-level Universal Prompt Tuning (SUPT) approach, focusing on the detailed context within subgraphs. In SUPT, prompt features are assigned at the subgraph-level, preserving the method's universal capability. This requires extremely fewer tuning parameters than fine-tuning-based methods, outperforming them in 42 out of 45 full-shot scenario experiments with an average improvement of over 2.5%. In few-shot scenarios, it excels in 41 out of 45 experiments, achieving an average performance increase of more than 6.6%.
摘要：在不断发展的机器学习领域，通过即时调整来适应预训练模型变得越来越重要。这种趋势在图领域尤其明显，其中不同的预训练策略在为图神经网络开发有效的基于提示的调整方法方面提出了独特的挑战。以前的方法受到限制，侧重于针对具有边缘预测预训练任务的模型量身定制的专门提示功能。然而，这些方法缺乏跨不同预训练策略的通用性。最近，为任何预训练策略设计了一种简单的提示调整方法，在输入图的特征空间内发挥作用。这使得它理论上可以模拟任何类型的提示功能，从而显着增加其针对一系列下游应用程序的多功能性。然而，这种简单的提示是否能够完全掌握图表中的复杂上下文仍然是一个悬而未决的问题，需要进一步研究。为了应对这一挑战，我们的工作引入了子图级通用提示调整（SUPT）方法，重点关注子图内的详细上下文。在 SUPT 中，提示特征在子图级别分配，保留了方法的通用能力。与基于微调的方法相比，这需要极少的调整参数，在 45 个全场景场景实验中的 42 个中优于它们，平均改进超过 2.5%。在少样本场景下，它在 45 个实验中的 41 个中表现出色，实现了超过 6.6% 的平均性能提升。

Title: Chain of Logic: Rule-Based Reasoning with Large Language Models

Authors: Sergio Servantez, Joe Barrow, Kristian Hammond, Rajiv Jain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10400
Pdf URL: https://arxiv.org/pdf/2402.10400
Copy Paste: [[2402.10400]] Chain of Logic: Rule-Based Reasoning with Large Language Models(https://arxiv.org/abs/2402.10400)
Keywords: language model, prompt
Abstract: Rule-based reasoning, a fundamental type of legal reasoning, enables us to draw conclusions by accurately applying a rule to a set of facts. We explore causal language models as rule-based reasoners, specifically with respect to compositional rules - rules consisting of multiple elements which form a complex logical expression. Reasoning about compositional rules is challenging because it requires multiple reasoning steps, and attending to the logical relationships between elements. We introduce a new prompting method, Chain of Logic, which elicits rule-based reasoning through decomposition (solving elements as independent threads of logic), and recomposition (recombining these sub-answers to resolve the underlying logical expression). This method was inspired by the IRAC (Issue, Rule, Application, Conclusion) framework, a sequential reasoning approach used by lawyers. We evaluate chain of logic across eight rule-based reasoning tasks involving three distinct compositional rules from the LegalBench benchmark and demonstrate it consistently outperforms other prompting methods, including chain of thought and self-ask, using open-source and commercial language models.
摘要：基于规则的推理是法律推理的一种基本类型，它使我们能够通过准确地将规则应用于一组事实来得出结论。我们探索因果语言模型作为基于规则的推理器，特别是关于组合规则 - 由形成复杂逻辑表达式的多个元素组成的规则。关于组合规则的推理具有挑战性，因为它需要多个推理步骤，并关注元素之间的逻辑关系。我们引入了一种新的提示方法，即逻辑链，它通过分解（将元素作为独立的逻辑线程来解决）和重组（重新组合这些子答案来解析底层逻辑表达式）来引发基于规则的推理。该方法的灵感来自 IRAC（问题、规则、应用、结论）框架，这是律师使用的顺序推理方法。我们评估了八个基于规则的推理任务的逻辑链，涉及来自 LegalBench 基准的三个不同的组成规则，并证明它始终优于其他提示方法，包括使用开源和商业语言模型的思维链和自我询问。

Title: Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning

Authors: Jun Zhuang, Casey Kennington
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10409
Pdf URL: https://arxiv.org/pdf/2402.10409
Copy Paste: [[2402.10409]] Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning(https://arxiv.org/abs/2402.10409)
Keywords: language model, llm
Abstract: As new research on Large Language Models (LLMs) continues, it is difficult to keep up with new research and models. To help researchers synthesize the new research many have written survey papers, but even those have become numerous. In this paper, we develop a method to automatically assign survey papers to a taxonomy. We collect the metadata of 144 LLM survey papers and explore three paradigms to classify papers within the taxonomy. Our work indicates that leveraging graph structure information on co-category graphs can significantly outperform the language models in two paradigms; pre-trained language models' fine-tuning and zero-shot/few-shot classifications using LLMs. We find that our model surpasses an average human recognition level and that fine-tuning LLMs using weak labels generated by a smaller model, such as the GCN in this study, can be more effective than using ground-truth labels, revealing the potential of weak-to-strong generalization in the taxonomy classification task.
摘要：随着大型语言模型 (LLM) 的新研究不断进行，很难跟上新的研究和模型的步伐。为了帮助研究人员综合新研究，许多人撰写了调查论文，但即便如此，这些论文也变得数量众多。在本文中，我们开发了一种自动将调查论文分配给分类的方法。我们收集了 144 篇 LLM 调查论文的元数据，并探索了三种范式来对分类法中的论文进行分类。我们的工作表明，利用同类别图上的图结构信息可以在两种范式中显着优于语言模型；使用 LLM 进行预训练语言模型的微调和零样本/少样本分类。我们发现我们的模型超越了人类的平均识别水平，并且使用较小模型（例如本研究中的 GCN）生成的弱标签来微调 LLM 可能比使用地面实况标签更有效，从而揭示了弱标签的潜力-分类学分类任务中的强泛化。

Title: Measuring and Reducing LLM Hallucination without Gold-Standard Answers via Expertise-Weighting

Authors: Jiaheng Wei, Yuanshun Yao, Jean-Francois Ton, Hongyi Guo, Andrew Estornell, Yang Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10412
Pdf URL: https://arxiv.org/pdf/2402.10412
Copy Paste: [[2402.10412]] Measuring and Reducing LLM Hallucination without Gold-Standard Answers via Expertise-Weighting(https://arxiv.org/abs/2402.10412)
Keywords: llm, hallucination
Abstract: LLM hallucination, i.e. generating factually incorrect yet seemingly convincing answers, is currently a major threat to the trustworthiness and reliability of LLMs. The first step towards solving this complicated problem is to measure it. However, existing hallucination metrics require to have a benchmark dataset with gold-standard answers, i.e. "best" or "correct" answers written by humans. Such requirement makes hallucination measurement costly and prone to human errors. In this work, we propose Factualness Evaluations via Weighting LLMs (FEWL), the first hallucination metric that is specifically designed for the scenario when gold-standard answers are absent. FEWL leverages the answers from off-the-shelf LLMs that serve as a proxy of gold-standard answers. The key challenge is how to quantify the expertise of reference LLMs resourcefully. We show FEWL has certain theoretical guarantees and demonstrate empirically it gives more accurate hallucination measures than naively using reference LLMs. We also show how to leverage FEWL to reduce hallucination through both in-context learning and supervised finetuning. Last, we build a large-scale benchmark dataset to facilitate LLM hallucination research.
摘要：法学硕士幻觉，即产生事实上不正确但看似令人信服的答案，目前是法学硕士可信度和可靠性的主要威胁。解决这个复杂问题的第一步是对其进行测量。然而，现有的幻觉指标需要有一个包含黄金标准答案的基准数据集，即由人类编写的“最佳”或“正确”答案。这种要求使得幻觉测量成本高昂并且容易出现人为错误。在这项工作中，我们提出了通过加权法学硕士（FEWL）进行事实性评估，这是第一个专门针对缺乏黄金标准答案的情况而设计的幻觉指标。 FEWL 利用现成的法学硕士的答案作为黄金标准答案的代理。关键的挑战是如何巧妙地量化参考法学硕士的专业知识。我们证明 FEWL 具有一定的理论保证，并通过经验证明它比单纯使用参考法学硕士能提供更准确的幻觉测量。我们还展示了如何利用 FEWL 通过上下文学习和监督微调来减少幻觉。最后，我们构建了一个大规模的基准数据集来促进 LLM 幻觉研究。

Title: Grounding Language about Belief in a Bayesian Theory-of-Mind

Authors: Lance Ying, Tan Zhi-Xuan, Lionel Wong, Vikash Mansinghka, Joshua Tenenbaum
Subjects: cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.10416
Pdf URL: https://arxiv.org/pdf/2402.10416
Copy Paste: [[2402.10416]] Grounding Language about Belief in a Bayesian Theory-of-Mind(https://arxiv.org/abs/2402.10416)
Keywords: agent
Abstract: Despite the fact that beliefs are mental states that cannot be directly observed, humans talk about each others' beliefs on a regular basis, often using rich compositional language to describe what others think and know. What explains this capacity to interpret the hidden epistemic content of other minds? In this paper, we take a step towards an answer by grounding the semantics of belief statements in a Bayesian theory-of-mind: By modeling how humans jointly infer coherent sets of goals, beliefs, and plans that explain an agent's actions, then evaluating statements about the agent's beliefs against these inferences via epistemic logic, our framework provides a conceptual role semantics for belief, explaining the gradedness and compositionality of human belief attributions, as well as their intimate connection with goals and plans. We evaluate this framework by studying how humans attribute goals and beliefs while watching an agent solve a doors-and-keys gridworld puzzle that requires instrumental reasoning about hidden objects. In contrast to pure logical deduction, non-mentalizing baselines, and mentalizing that ignores the role of instrumental plans, our model provides a much better fit to human goal and belief attributions, demonstrating the importance of theory-of-mind for a semantics of belief.
摘要：尽管信念是无法直接观察到的心理状态，但人类经常谈论彼此的信念，经常使用丰富的组合语言来描述他人的想法和知识。如何解释这种解释其他心灵隐藏的认知内容的能力？在本文中，我们通过将信念陈述的语义建立在贝叶斯心理理论的基础上，朝着答案迈出了一步：通过建模人类如何共同推断解释代理行为的连贯目标、信念和计划集，然后评估通过认知逻辑对主体信念的陈述与这些推论进行比较，我们的框架为信念提供了概念角色语义，解释了人类信念归因的分级性和组合性，以及它们与目标和计划的密切联系。我们通过研究人类如何归因目标和信念来评估这个框架，同时观察智能体解决需要对隐藏对象进行工具推理的门和钥匙网格世界难题。与纯粹的逻辑演绎、非心智化基线和忽略工具计划作用的心智化相比，我们的模型更适合人类目标和信念归因，证明了心智理论对于信念语义的重要性。

Title: Understanding In-Context Learning with a Pelican Soup Framework

Authors: Ting-Rui Chiang, Dani Yogatama
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10424
Pdf URL: https://arxiv.org/pdf/2402.10424
Copy Paste: [[2402.10424]] Understanding In-Context Learning with a Pelican Soup Framework(https://arxiv.org/abs/2402.10424)
Keywords: language model, gpt
Abstract: Many existing theoretical analyses of in-context learning for natural language processing are based on latent variable models that leaves gaps between theory and practice. We aim to close these gaps by proposing a theoretical framework, the Pelican Soup Framework. In this framework, we introduce (1) the notion of a common sense knowledge base, (2) a general formalism for natural language classification tasks, and the notion of (3) meaning association. Under this framework, we can establish a $\mathcal{O}(1/T)$ loss bound for in-context learning, where $T$ is the number of example-label pairs in the demonstration. Compared with previous works, our bound reflects the effect of the choice of verbalizers and the effect of instruction tuning. An additional notion of \textit{atom concepts} makes our framework possible to explain the generalization to tasks unseen in the language model training data. Finally, we propose a toy setup, Calcutec, and a digit addition task that mimics types of distribution shifts a model needs to overcome to perform in-context learning. We also experiment with GPT2-Large on real-world NLP tasks. Our empirical results demonstrate the efficacy of our framework to explain in-context learning.
摘要：许多现有的自然语言处理上下文学习理论分析都是基于潜变量模型，这在理论与实践之间留下了差距。我们的目标是通过提出一个理论框架——鹈鹕汤框架来缩小这些差距。在这个框架中，我们引入了（1）常识知识库的概念，（2）自然语言分类任务的一般形式主义，以及（3）意义关联的概念。在此框架下，我们可以为上下文学习建立 $\mathcal{O}(1/T)$ 损失界限，其中 $T$ 是演示中示例标签对的数量。与以前的作品相比，我们的界限反映了言语器选择的效果和指令调整的效果。 \textit{原子概念} 的附加概念使我们的框架能够解释语言模型训练数据中未见的任务的泛化。最后，我们提出了一个玩具设置 Calcutec 和一个数字加法任务，该任务模仿模型执行上下文学习所需克服的分布变化类型。我们还在现实世界的 NLP 任务中尝试使用 GPT2-Large。我们的实证结果证明了我们的框架在解释情境学习方面的有效性。

Title: DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection

Authors: Herun Wan, Shangbin Feng, Zhaoxuan Tan, Heng Wang, Yulia Tsvetkov, Minnan Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10426
Pdf URL: https://arxiv.org/pdf/2402.10426
Copy Paste: [[2402.10426]] DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection(https://arxiv.org/abs/2402.10426)
Keywords: language model, llm, hallucination
Abstract: Large language models are limited by challenges in factuality and hallucinations to be directly employed off-the-shelf for judging the veracity of news articles, where factual accuracy is paramount. In this work, we propose DELL that identifies three key stages in misinformation detection where LLMs could be incorporated as part of the pipeline: 1) LLMs could \emph{generate news reactions} to represent diverse perspectives and simulate user-news interaction networks; 2) LLMs could \emph{generate explanations} for proxy tasks (e.g., sentiment, stance) to enrich the contexts of news articles and produce experts specializing in various aspects of news understanding; 3) LLMs could \emph{merge task-specific experts} and provide an overall prediction by incorporating the predictions and confidence scores of varying experts. Extensive experiments on seven datasets with three LLMs demonstrate that DELL outperforms state-of-the-art baselines by up to 16.8\% in macro f1-score. Further analysis reveals that the generated reactions and explanations are greatly helpful in misinformation detection, while our proposed LLM-guided expert merging helps produce better-calibrated predictions.
摘要：大型语言模型受到事实性和幻觉方面的挑战的限制，无法直接使用现成的模型来判断新闻文章的真实性，而事实准确性至关重要。在这项工作中，我们建议 DELL 确定错误信息检测的三个关键阶段，其中法学硕士可以作为管道的一部分纳入其中：1）法学硕士可以 \emph{生成新闻反应} 来代表不同的观点并模拟用户新闻交互网络； 2）法学硕士可以为代理任务（例如情绪、立场）\emph{生成解释}，以丰富新闻文章的上下文，并培养专门从事新闻理解各个方面的专家； 3）法学硕士可以\emph{合并特定于任务的专家}并通过合并不同专家的预测和置信度分数来提供总体预测。对三个法学硕士的七个数据集进行的广泛实验表明，DELL 在宏观 f1 分数方面比最先进的基线高出 16.8%。进一步的分析表明，生成的反应和解释对于错误信息检测非常有帮助，而我们提出的法学硕士指导的专家合并有助于产生更好校准的预测。

Title: Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

Authors: Dheeraj Mekala, Alex Nguyen, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10430
Pdf URL: https://arxiv.org/pdf/2402.10430
Copy Paste: [[2402.10430]] Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models(https://arxiv.org/abs/2402.10430)
Keywords: language model
Abstract: Instruction-tuning language models has become a crucial step in aligning them for general use. Typically, this process involves extensive training on large datasets, incurring high training costs. In this paper, we introduce a novel training data selection based on the learning percentage of the samples. We assert that current language models possess the capability to autonomously select high-quality training data, leading to comparable or improved performance compared to training on the entire dataset. Our experiments span different-sized models, revealing that this characteristic holds for models ranging from 1B (small) to 13B (large) in size. Moreover, we demonstrate an interesting finding that the data hardness transfers across model sizes, and a smaller 350M model can effectively curate high-quality training data with hard samples for a larger 13B model, resulting in an equally or superior instruction-tuned model compared to training on the complete dataset. Utilizing open-sourced OPT and Llama-2 models up to 13B in size, two publicly available instruction-tuning training datasets and evaluated by both automatic metrics & humans, our paper introduces a novel approach to training data selection, showcasing a more efficient alternative.
摘要：指令调整语言模型已成为使其通用的关键步骤。通常，此过程涉及对大型数据集的大量训练，从而产生高昂的训练成本。在本文中，我们介绍了一种基于样本学习百分比的新型训练数据选择。我们断言，当前的语言模型具有自主选择高质量训练数据的能力，与整个数据集上的训练相比，可以获得可比或更高的性能。我们的实验涵盖了不同尺寸的模型，结果表明该特征适用于尺寸从 1B（小）到 13B（大）的模型。此外，我们证明了一个有趣的发现，即数据硬度在模型大小之间转移，较小的 350M 模型可以有效地为较大的 13B 模型提供具有硬样本的高质量训练数据，从而产生与在完整数据集上进行训练。我们的论文利用大小高达 13B 的开源 OPT 和 Llama-2 模型、两个公开可用的指令调整训练数据集并由自动指标和人工进行评估，介绍了一种新的训练数据选择方法，展示了一种更有效的替代方案。

Title: I Am Not Them: Fluid Identities and Persistent Out-group Bias in Large Language Models

Authors: Wenchao Dong, Assem Zhunis, Hyojin Chin, Jiyoung Han, Meeyoung Cha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10436
Pdf URL: https://arxiv.org/pdf/2402.10436
Copy Paste: [[2402.10436]] I Am Not Them: Fluid Identities and Persistent Out-group Bias in Large Language Models(https://arxiv.org/abs/2402.10436)
Keywords: language model, gpt, llm, prompt, chat
Abstract: We explored cultural biases-individualism vs. collectivism-in ChatGPT across three Western languages (i.e., English, German, and French) and three Eastern languages (i.e., Chinese, Japanese, and Korean). When ChatGPT adopted an individualistic persona in Western languages, its collectivism scores (i.e., out-group values) exhibited a more negative trend, surpassing their positive orientation towards individualism (i.e., in-group values). Conversely, when a collectivistic persona was assigned to ChatGPT in Eastern languages, a similar pattern emerged with more negative responses toward individualism (i.e., out-group values) as compared to collectivism (i.e., in-group values). The results indicate that when imbued with a particular social identity, ChatGPT discerns in-group and out-group, embracing in-group values while eschewing out-group values. Notably, the negativity towards the out-group, from which prejudices and discrimination arise, exceeded the positivity towards the in-group. The experiment was replicated in the political domain, and the results remained consistent. Furthermore, this replication unveiled an intrinsic Democratic bias in Large Language Models (LLMs), aligning with earlier findings and providing integral insights into mitigating such bias through prompt engineering. Extensive robustness checks were performed using varying hyperparameter and persona setup methods, with or without social identity labels, across other popular language models.
摘要：我们在 ChatGPT 中探讨了三种西方语言（即英语、德语和法语）和三种东方语言（即汉语、日语和韩语）中的文化偏见——个人主义与集体主义。当ChatGPT采用西方语言中的个人主义角色时，其集体主义得分（即外群体价值观）表现出更加消极的趋势，超过了其对个人主义（即内群体价值观）的积极取向。相反，当在东方语言中将集体主义角色分配给 ChatGPT 时，会出现类似的模式，与集体主义（即内群体价值观）相比，对个人主义（即外群体价值观）有更多负面反应。结果表明，当充满特定的社会身份时，ChatGPT 能够辨别内群体和外群体，拥抱内群体价值观，同时避开外群体价值观。值得注意的是，对外群体的消极情绪超过了对内群体的积极情绪，从而产生了偏见和歧视。该实验在政治领域得到了重复，结果仍然一致。此外，这种复制揭示了大型语言模型（LLM）中固有的民主偏见，与早期的发现相一致，并为通过即时工程减轻这种偏见提供了完整的见解。使用不同的超参数和角色设置方法（有或没有社会身份标签）在其他流行的语言模型中进行了广泛的稳健性检查。

Title: PRISE: Learning Temporal Action Abstractions as a Sequence Compression Problem

Authors: Ruijie Zheng, Ching-An Cheng, Hal Daumé III, Furong Huang, Andrey Kolobov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.10450
Pdf URL: https://arxiv.org/pdf/2402.10450
Copy Paste: [[2402.10450]] PRISE: Learning Temporal Action Abstractions as a Sequence Compression Problem(https://arxiv.org/abs/2402.10450)
Keywords: llm
Abstract: Temporal action abstractions, along with belief state representations, are a powerful knowledge sharing mechanism for sequential decision making. In this work, we propose a novel view that treats inducing temporal action abstractions as a sequence compression problem. To do so, we bring a subtle but critical component of LLM training pipelines -- input tokenization via byte pair encoding (BPE) -- to the seemingly distant task of learning skills of variable time span in continuous control domains. We introduce an approach called Primitive Sequence Encoding (PRISE) that combines continuous action quantization with BPE to learn powerful action abstractions. We empirically show that high-level skills discovered by PRISE from a multitask set of robotic manipulation demonstrations significantly boost the performance of both multitask imitation learning as well as few-shot imitation learning on unseen tasks. Our code will be released at https://github.com/FrankZheng2022/PRISE.
摘要：时间动作抽象以及信念状态表示是用于顺序决策的强大知识共享机制。在这项工作中，我们提出了一种新颖的观点，将诱导时间动作抽象视为序列压缩问题。为此，我们将 LLM 训练流程的一个微妙但关键的组成部分——通过字节对编码 (BPE) 的输入标记化——引入到连续控制域中可变时间跨度学习技能这一看似遥远的任务中。我们引入了一种称为原始序列编码 (PRISE) 的方法，它将连续动作量化与 BPE 相结合，以学习强大的动作抽象。我们的经验表明，PRIZE 从一组多任务机器人操作演示中发现的高级技能显着提高了多任务模仿学习以及对未见过的任务的小样本模仿学习的性能。我们的代码将在 https://github.com/FrankZheng2022/PRISE 发布。

Title: Steering Conversational Large Language Models for Long Emotional Support Conversations

Authors: Navid Madani, Sougata Saha, Rohini Srihari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10453
Pdf URL: https://arxiv.org/pdf/2402.10453
Copy Paste: [[2402.10453]] Steering Conversational Large Language Models for Long Emotional Support Conversations(https://arxiv.org/abs/2402.10453)
Keywords: language model, llm, prompt
Abstract: In this study, we address the challenge of consistently following emotional support strategies in long conversations by large language models (LLMs). We introduce the Strategy-Relevant Attention (SRA) metric, a model-agnostic measure designed to evaluate the effectiveness of LLMs in adhering to strategic prompts in emotional support contexts. By analyzing conversations within the Emotional Support Conversations dataset (ESConv) using LLaMA models, we demonstrate that SRA is significantly correlated with a model's ability to sustain the outlined strategy throughout the interactions. Our findings reveal that the application of SRA-informed prompts leads to enhanced strategic adherence, resulting in conversations that more reliably exhibit the desired emotional support strategies over longer conversations. Furthermore, we contribute a comprehensive, multi-branch synthetic conversation dataset for ESConv, featuring a variety of strategy continuations informed by our optimized prompting method. The code and data are publicly available on our Github.
摘要：在这项研究中，我们解决了大型语言模型（LLM）在长时间对话中始终遵循情感支持策略的挑战。我们引入了策略相关注意力（SRA）指标，这是一种与模型无关的衡量标准，旨在评估法学硕士在情感支持环境中遵守策略提示的有效性。通过使用 LLaMA 模型分析情感支持对话数据集 (ESConv) 中的对话，我们证明 SRA 与模型在整个交互过程中维持概述策略的能力显着相关。我们的研究结果表明，应用 SRA 提示可以增强策略依从性，从而使对话在较长的对话中更可靠地展示所需的情感支持策略。此外，我们为 ESConv 提供了一个全面的、多分支的合成对话数据集，其中包含由我们优化的提示方法提供的各种策略延续。代码和数据可在我们的 Github 上公开获取。

Title: QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning

Authors: Hossein Rajabzadeh, Mojtaba Valipour, Tianshu Zhu, Marzieh Tahaei, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2402.10462
Pdf URL: https://arxiv.org/pdf/2402.10462
Copy Paste: [[2402.10462]] QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning(https://arxiv.org/abs/2402.10462)
Keywords: language model, llm
Abstract: Finetuning large language models requires huge GPU memory, restricting the choice to acquire Larger models. While the quantized version of the Low-Rank Adaptation technique, named QLoRA, significantly alleviates this issue, finding the efficient LoRA rank is still challenging. Moreover, QLoRA is trained on a pre-defined rank and, therefore, cannot be reconfigured for its lower ranks without requiring further fine-tuning steps. This paper proposes QDyLoRA -Quantized Dynamic Low-Rank Adaptation-, as an efficient quantization approach for dynamic low-rank adaptation. Motivated by Dynamic LoRA, QDyLoRA is able to efficiently finetune LLMs on a set of pre-defined LoRA ranks. QDyLoRA enables fine-tuning Falcon-40b for ranks 1 to 64 on a single 32 GB V100-GPU through one round of fine-tuning. Experimental results show that QDyLoRA is competitive to QLoRA and outperforms when employing its optimal rank.
摘要：微调大型语言模型需要巨大的 GPU 内存，限制了获取更大模型的选择。虽然低秩适应技术的量化版本 QLoRA 显着缓解了这个问题，但找到高效的 LoRA 秩仍然具有挑战性。此外，QLoRA 是在预定义的等级上进行训练的，因此，如果不需要进一步的微调步骤，就无法为其较低的等级重新配置。本文提出了 QDyLoRA（量化动态低秩自适应）作为动态低秩自适应的有效量化方法。受动态 LoRA 的推动，QDyLoRA 能够在一组预定义的 LoRA 等级上有效地微调 LLM。 QDyLoRA 可以通过一轮微调在单个 32 GB V100-GPU 上将 Falcon-40b 微调为排名 1 至 64。实验结果表明，QDyLoRA 与 QLoRA 相比具有竞争力，并且在采用其最佳排序时表现更佳。

Title: Large Language Models as Zero-shot Dialogue State Tracker through Function Calling

Authors: Zekun Li, Zhiyu Zoey Chen, Mike Ross, Patrick Huber, Seungwhan Moon, Zhaojiang Lin, Xin Luna Dong, Adithya Sagar, Xifeng Yan, Paul A. Crook
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10466
Pdf URL: https://arxiv.org/pdf/2402.10466
Copy Paste: [[2402.10466]] Large Language Models as Zero-shot Dialogue State Tracker through Function Calling(https://arxiv.org/abs/2402.10466)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) are increasingly prevalent in conversational systems due to their advanced understanding and generative capabilities in general contexts. However, their effectiveness in task-oriented dialogues (TOD), which requires not only response generation but also effective dialogue state tracking (DST) within specific tasks and domains, remains less satisfying. In this work, we propose a novel approach FnCTOD for solving DST with LLMs through function calling. This method improves zero-shot DST, allowing adaptation to diverse domains without extensive data collection or model tuning. Our experimental results demonstrate that our approach achieves exceptional performance with both modestly sized open-source and also proprietary LLMs: with in-context prompting it enables various 7B or 13B parameter models to surpass the previous state-of-the-art (SOTA) achieved by ChatGPT, and improves ChatGPT's performance beating the SOTA by 5.6% Avg. JGA. Individual model results for GPT-3.5 and GPT-4 are boosted by 4.8% and 14%, respectively. We also show that by fine-tuning on a small collection of diverse task-oriented dialogues, we can equip modestly sized models, specifically a 13B parameter LLaMA2-Chat model, with function-calling capabilities and DST performance comparable to ChatGPT while maintaining their chat capabilities. We plan to open-source experimental code and model.
摘要：大型语言模型 (LLM) 由于其在一般环境中的高级理解和生成能力，在会话系统中越来越普遍。然而，它们在面向任务的对话（TOD）中的有效性仍然不太令人满意，因为该对话不仅需要生成响应，还需要在特定任务和领域内进行有效的对话状态跟踪（DST）。在这项工作中，我们提出了一种新方法 FnCTOD，通过函数调用使用 LLM 解决 DST。该方法改进了零样本 DST，无需大量数据收集或模型调整即可适应不同的领域。我们的实验结果表明，我们的方法通过中等规模的开源和专有的 LLM 实现了卓越的性能：通过上下文提示，它使各种 7B 或 13B 参数模型超越了之前实现的最先进 (SOTA)通过 ChatGPT，ChatGPT 的性能比 SOTA 平均提高了 5.6%。 JGA。 GPT-3.5 和 GPT-4 的单独模型结果分别提升了 4.8% 和 14%。我们还表明，通过对一小部分不同的面向任务的对话进行微调，我们可以配备适度大小的模型，特别是 13B 参数 LLaMA2-Chat 模型，具有与 ChatGPT 相当的函数调用功能和 DST 性能，同时保持聊天功能能力。我们计划开源实验代码和模型。

Title: Comparing Hallucination Detection Metrics for Multilingual Generation

Authors: Haoqiang Kang, Terra Blevins, Luke Zettlemoyer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10496
Pdf URL: https://arxiv.org/pdf/2402.10496
Copy Paste: [[2402.10496]] Comparing Hallucination Detection Metrics for Multilingual Generation(https://arxiv.org/abs/2402.10496)
Keywords: llm, hallucination
Abstract: While many automatic hallucination detection techniques have been proposed for English texts, their effectiveness in multilingual contexts remains unexplored. This paper aims to bridge the gap in understanding how these hallucination detection metrics perform on non-English languages. We evaluate the efficacy of various detection metrics, including lexical metrics like ROUGE and Named Entity Overlap and Natural Language Inference (NLI)-based metrics, at detecting hallucinations in biographical summaries in many languages; we also evaluate how correlated these different metrics are to gauge whether they measure the same phenomena. Our empirical analysis reveals that while lexical metrics show limited effectiveness, NLI-based metrics perform well in high-resource languages at the sentence level. In contrast, NLI-based metrics often fail to detect atomic fact hallucinations. Our findings highlight existing gaps in multilingual hallucination detection and motivate future research to develop more robust detection methods for LLM hallucination in other languages.
摘要：虽然已经针对英语文本提出了许多自动幻觉检测技术，但它们在多语言环境中的有效性仍有待探索。本文旨在弥合理解这些幻觉检测指标如何在非英语语言上表现的差距。我们评估了各种检测指标的有效性，包括 ROUGE 和命名实体重叠等词汇指标以及基于自然语言推理 (NLI) 的指标，在检测多种语言的传记摘要中的幻觉方面；我们还评估这些不同指标的相关性，以衡量它们是否衡量相同的现象。我们的实证分析表明，虽然词汇指标的有效性有限，但基于 NLI 的指标在句子级别的高资源语言中表现良好。相比之下，基于 NLI 的指标通常无法检测原子事实幻觉。我们的研究结果凸显了多语言幻觉检测方面的现有差距，并激励未来的研究为其他语言的法学硕士幻觉开发更强大的检测方法。

Title: Provably Sample Efficient RLHF via Active Preference Optimization

Authors: Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, Sayak Ray Chowdhury
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.10500
Pdf URL: https://arxiv.org/pdf/2402.10500
Copy Paste: [[2402.10500]] Provably Sample Efficient RLHF via Active Preference Optimization(https://arxiv.org/abs/2402.10500)
Keywords: language model, llm, prompt
Abstract: Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models (LLMs) with human preferences. While these aligned generative models have demonstrated impressive capabilities across various tasks, the dependence on high-quality human preference data poses a costly bottleneck in practical implementation of RLHF. Hence better and adaptive strategies for data collection is needed. To this end, we frame RLHF as a contextual preference bandit problem with prompts as contexts and show that the naive way of collecting preference data by choosing prompts uniformly at random leads to a policy that suffers an $\Omega(1)$ suboptimality gap in rewards. Then we propose $\textit{Active Preference Optimization}$ ($\texttt{APO}$), an algorithm that actively selects prompts to collect preference data. Under the Bradley-Terry-Luce (BTL) preference model, \texttt{APO} achieves sample efficiency without compromising on policy performance. We show that given a sample budget of $T$, the suboptimality gap of a policy learned via $\texttt{APO}$ scales as $O(1/\sqrt{T})$. Next, we propose a compute-efficient batch version of $\texttt{APO}$ with minor modification and evaluate its performance in practice. Experimental evaluations on a human preference dataset validate \texttt{APO}'s efficacy as a sample-efficient and practical solution to data collection for RLHF, facilitating alignment of LLMs with human preferences in a cost-effective and scalable manner.
摘要：来自人类反馈的强化学习 (RLHF) 对于使大型语言模型 (LLM) 与人类偏好保持一致至关重要。虽然这些一致的生成模型在各种任务中表现出了令人印象深刻的能力，但对高质量人类偏好数据的依赖在 RLHF 的实际实施中造成了代价高昂的瓶颈。因此，需要更好的、适应性强的数据收集策略。为此，我们将 RLHF 构建为以提示作为上下文的上下文偏好强盗问题，并表明通过随机统一选择提示来收集偏好数据的简单方法会导致策略在以下方面遭受 $\Omega(1)$ 次优差距：奖励。然后我们提出$\textit{主动偏好优化}$ ($\texttt{APO}$)，一种主动选择提示来收集偏好数据的算法。在 Bradley-Terry-Luce (BTL) 偏好模型下，\texttt{APO} 在不影响策略性能的情况下实现了样本效率。我们表明，给定 $T$ 的样本预算，通过 $\texttt{APO}$ 学习的策略的次优差距缩放为 $O(1/\sqrt{T})$。接下来，我们提出了一个计算效率较高的批处理版本 $\texttt{APO}$，并进行了较小的修改，并在实践中评估其性能。对人类偏好数据集的实验评估验证了 \texttt{APO} 作为 RLHF 数据收集的样本高效且实用的解决方案的功效，促进法学硕士以经济有效且可扩展的方式与人类偏好保持一致。

Title: Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

Authors: Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, Jae W. Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.10517
Pdf URL: https://arxiv.org/pdf/2402.10517
Copy Paste: [[2402.10517]] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs(https://arxiv.org/abs/2402.10517)
Keywords: language model, llm
Abstract: Recently, considerable efforts have been directed towards compressing Large Language Models (LLMs), which showcase groundbreaking capabilities across diverse applications but entail significant deployment costs due to their large sizes. Meanwhile, much less attention has been given to mitigating the costs associated with deploying multiple LLMs of varying sizes despite its practical significance. Thus, this paper introduces \emph{any-precision LLM}, extending the concept of any-precision DNN to LLMs. Addressing challenges in any-precision LLM, we propose a lightweight method for any-precision quantization of LLMs, leveraging a post-training quantization framework, and develop a specialized software engine for its efficient serving. As a result, our solution significantly reduces the high costs of deploying multiple, different-sized LLMs by overlaying LLMs quantized to varying bit-widths, such as 3, 4, ..., $n$ bits, into a memory footprint comparable to a single $n$-bit LLM. All the supported LLMs with varying bit-widths demonstrate state-of-the-art model quality and inference throughput, proving itself to be a compelling option for deployment of multiple, different-sized LLMs. The source code will be publicly available soon.
摘要：最近，人们在压缩大型语言模型 (LLM) 方面付出了巨大的努力，这些模型展示了跨不同应用程序的突破性功能，但由于其规模较大，因此需要大量的部署成本。与此同时，尽管具有实际意义，但人们对降低部署多个不同规模的法学硕士相关成本的关注却少之又少。因此，本文引入了\emph{任意精度LLM}，将任意精度DNN的概念扩展到LLM。为了解决任意精度法学硕士的挑战，我们提出了一种用于法学硕士任意精度量化的轻量级方法，利用训练后量化框架，并开发了一个专门的软件引擎以实现其高效服务。因此，我们的解决方案通过将量化为不同位宽（例如 3、4、...、$n$ 位）的 LLM 叠加到与单一$n$位法学硕士。所有受支持的具有不同位宽的 LLM 都展示了最先进的模型质量和推理吞吐量，证明自己是部署多个不同大小的 LLM 的引人注目的选择。源代码将很快公开。

Title: Zero-shot sampling of adversarial entities in biomedical question answering

Authors: R. Patrick Xian, Alex J. Lee, Vincent Wang, Qiming Cui, Russell Ro, Reza Abbasi-Asl
Subjects: cs.CL, cs.CR, stat.AP
Abstract URL: https://arxiv.org/abs/2402.10527
Pdf URL: https://arxiv.org/pdf/2402.10527
Copy Paste: [[2402.10527]] Zero-shot sampling of adversarial entities in biomedical question answering(https://arxiv.org/abs/2402.10527)
Keywords: language model, llm
Abstract: The increasing depth of parametric domain knowledge in large language models (LLMs) is fueling their rapid deployment in real-world applications. In high-stakes and knowledge-intensive tasks, understanding model vulnerabilities is essential for quantifying the trustworthiness of model predictions and regulating their use. The recent discovery of named entities as adversarial examples in natural language processing tasks raises questions about their potential guises in other settings. Here, we propose a powerscaled distance-weighted sampling scheme in embedding space to discover diverse adversarial entities as distractors. We demonstrate its advantage over random sampling in adversarial question answering on biomedical topics. Our approach enables the exploration of different regions on the attack surface, which reveals two regimes of adversarial entities that markedly differ in their characteristics. Moreover, we show that the attacks successfully manipulate token-wise Shapley value explanations, which become deceptive in the adversarial setting. Our investigations illustrate the brittleness of domain knowledge in LLMs and reveal a shortcoming of standard evaluations for high-capacity models.
摘要：大型语言模型 (LLM) 中参数化领域知识的深度不断加深，推动了它们在实际应用中的快速部署。在高风险和知识密集型任务中，了解模型漏洞对于量化模型预测的可信度和规范其使用至关重要。最近在自然语言处理任务中发现命名实体作为对抗性示例，这引发了人们对其在其他环境中的潜在伪装的质疑。在这里，我们在嵌入空间中提出了一种功率尺度距离加权采样方案，以发现不同的对抗实体作为干扰因素。我们在生物医学主题的对抗性问答中展示了它相对于随机抽样的优势。我们的方法能够探索攻击面上的不同区域，这揭示了两种特征明显不同的对抗实体政权。此外，我们表明攻击成功地操纵了令牌明智的 Shapley 值解释，这在对抗性环境中变得具有欺骗性。我们的调查说明了法学硕士领域知识的脆弱性，并揭示了高容量模型标准评估的缺点。

Title: Can We Verify Step by Step for Incorrect Answer Detection?

Authors: Xin Xu, Shizhe Diao, Can Yang, Yang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10528
Pdf URL: https://arxiv.org/pdf/2402.10528
Copy Paste: [[2402.10528]] Can We Verify Step by Step for Incorrect Answer Detection?(https://arxiv.org/abs/2402.10528)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting has marked a significant advancement in enhancing the reasoning capabilities of large language models (LLMs). Previous studies have developed various extensions of CoT, which focus primarily on enhancing end-task performance. In addition, there has been research on assessing the quality of reasoning chains in CoT. This raises an intriguing question: Is it possible to predict the accuracy of LLM outputs by scrutinizing the reasoning chains they generate? To answer this research question, we introduce a benchmark, R2PE, designed specifically to explore the relationship between reasoning chains and performance in various reasoning tasks spanning five different domains. This benchmark aims to measure the falsehood of the final output of LLMs based on the reasoning steps. To make full use of information in multiple reasoning chains, we propose the process discernibility score (PDS) framework that beats the answer-checking baseline by a large margin. Concretely, this resulted in an average of 5.1% increase in the F1 score across all 45 subsets within R2PE. We further demonstrate our PDS's efficacy in advancing open-domain QA accuracy. Data and code are available at https://github.com/XinXU-USTC/R2PE.
摘要：思想链（CoT）提示标志着在增强大型语言模型（LLM）推理能力方面取得了重大进步。先前的研究已经开发了 CoT 的各种扩展，主要侧重于增强最终任务性能。此外，还有评估CoT中推理链质量的研究。这就提出了一个有趣的问题：是否可以通过仔细检查 LLM 生成的推理链来预测 LLM 输出的准确性？为了回答这个研究问题，我们引入了一个基准测试 R2PE，专门设计用于探索跨越五个不同领域的各种推理任务中推理链与性能之间的关系。该基准旨在衡量基于推理步骤的法学硕士最终输出的虚假性。为了充分利用多个推理链中的信息，我们提出了过程辨别性评分（PDS）框架，该框架大幅优于答案检查基线。具体来说，这导致 R2PE 内所有 45 个子集的 F1 分数平均增加 5.1%。我们进一步证明了 PDS 在提高开放域 QA 准确性方面的功效。数据和代码可在 https://github.com/XinXU-USTC/R2PE 获取。

Title: Properties and Challenges of LLM-Generated Explanations

Authors: Jenny Kunz, Marco Kuhlmann
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10532
Pdf URL: https://arxiv.org/pdf/2402.10532
Copy Paste: [[2402.10532]] Properties and Challenges of LLM-Generated Explanations(https://arxiv.org/abs/2402.10532)
Keywords: language model, llm
Abstract: The self-rationalising capabilities of large language models (LLMs) have been explored in restricted settings, using task/specific data sets. However, current LLMs do not (only) rely on specifically annotated data; nonetheless, they frequently explain their outputs. The properties of the generated explanations are influenced by the pre-training corpus and by the target data used for instruction fine-tuning. As the pre-training corpus includes a large amount of human-written explanations "in the wild", we hypothesise that LLMs adopt common properties of human explanations. By analysing the outputs for a multi-domain instruction fine-tuning data set, we find that generated explanations show selectivity and contain illustrative elements, but less frequently are subjective or misleading. We discuss reasons and consequences of the properties' presence or absence. In particular, we outline positive and negative implications depending on the goals and user groups of the self-rationalising system.
摘要：大型语言模型（LLM）的自我合理化能力已经在有限的环境中使用任务/特定数据集进行了探索。然而，当前的法学硕士并不（仅）依赖于专门注释的数据；尽管如此，他们还是经常解释自己的成果。生成的解释的属性受到预训练语料库和用于指令微调的目标数据的影响。由于预训练语料库包含大量“野外”人类编写的解释，我们假设法学硕士采用人类解释的共同属性。通过分析多域指令微调数据集的输出，我们发现生成的解释显示出选择性并包含说明性元素，但较少出现主观或误导性的情况。我们讨论属性存在或不存在的原因和后果。特别是，我们根据自我合理化系统的目标和用户群体概述了积极和消极的影响。

Title: Strong hallucinations from negation and how to fix them

Authors: Nicholas Asher, Swarnadeep Bhar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10543
Pdf URL: https://arxiv.org/pdf/2402.10543
Copy Paste: [[2402.10543]] Strong hallucinations from negation and how to fix them(https://arxiv.org/abs/2402.10543)
Keywords: language model, hallucination, prompt
Abstract: Despite great performance on many tasks, language models (LMs) still struggle with reasoning, sometimes providing responses that cannot possibly be true because they stem from logical incoherence. We call such responses \textit{strong hallucinations} and prove that they follow from an LM's computation of its internal representations for logical operators and outputs from those representations. Focusing on negation, we provide a novel solution in which negation is treated not as another element of a latent representation, but as \textit{an operation over an LM's latent representations that constrains how they may evolve}. We show that our approach improves model performance in cloze prompting and natural language inference tasks with negation without requiring training on sparse negative data.
摘要：尽管在许多任务上表现出色，但语言模型 (LM) 仍然难以推理，有时会提供不可能真实的响应，因为它们源于逻辑不连贯。我们将这种响应称为 \textit{强烈幻觉}，并证明它们来自 LM 对逻辑运算符的内部表示的计算以及这些表示的输出。着眼于否定，我们提供了一种新颖的解决方案，其中否定不被视为潜在表示的另一个元素，而是被视为 \textit{对 LM 潜在表示的操作，限制它们如何演化}。我们表明，我们的方法可以提高模型在完形填空提示和带有否定的自然语言推理任务中的性能，而无需对稀疏负数据进行训练。

Title: Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models

Authors: Minghan Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10552
Pdf URL: https://arxiv.org/pdf/2402.10552
Copy Paste: [[2402.10552]] Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models(https://arxiv.org/abs/2402.10552)
Keywords: language model, llm, chat
Abstract: Simultaneous machine translation (SimulMT) presents a challenging trade-off between translation quality and latency. Recent studies have shown that LLMs can achieve good performance in SimulMT tasks. However, this often comes at the expense of high inference cost and latency. In this paper, we propose a conversational SimulMT framework to enhance the inference efficiency of LLM-based SimulMT through multi-turn-dialogue-based decoding. Our experiments with Llama2-7b-chat on two SimulMT benchmarks demonstrate the superiority of LLM in translation quality while achieving comparable computational latency to specialized SimulMT models.
摘要：同步机器翻译 (SimulMT) 在翻译质量和延迟之间提出了具有挑战性的权衡。最近的研究表明，法学硕士可以在 SimulMT 任务中取得良好的表现。然而，这通常是以高推理成本和延迟为代价的。在本文中，我们提出了一种会话式 SimulMT 框架，通过基于多轮对话的解码来提高基于 LLM 的 SimulMT 的推理效率。我们在两个 SimulMT 基准上使用 Llama2-7b-chat 进行的实验证明了 LLM 在翻译质量方面的优越性，同时实现了与专用 SimulMT 模型相当的计算延迟。

Title: Disordered-DABS: A Benchmark for Dynamic Aspect-Based Summarization in Disordered Texts

Authors: Xiaobo Guo, Soroush Vosoughi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10554
Pdf URL: https://arxiv.org/pdf/2402.10554
Copy Paste: [[2402.10554]] Disordered-DABS: A Benchmark for Dynamic Aspect-Based Summarization in Disordered Texts(https://arxiv.org/abs/2402.10554)
Keywords: language model, gpt
Abstract: Aspect-based summarization has seen significant advancements, especially in structured text. Yet, summarizing disordered, large-scale texts, like those found in social media and customer feedback, remains a significant challenge. Current research largely targets predefined aspects within structured texts, neglecting the complexities of dynamic and disordered environments. Addressing this gap, we introduce Disordered-DABS, a novel benchmark for dynamic aspect-based summarization tailored to unstructured text. Developed by adapting existing datasets for cost-efficiency and scalability, our comprehensive experiments and detailed human evaluations reveal that Disordered-DABS poses unique challenges to contemporary summarization models, including state-of-the-art language models such as GPT-3.5.
摘要：基于方面的摘要已经取得了显着的进步，尤其是在结构化文本中。然而，总结无序的大规模文本（例如社交媒体和客户反馈中的文本）仍然是一项重大挑战。当前的研究主要针对结构化文本中的预定义方面，忽略了动态和无序环境的复杂性。为了解决这一差距，我们引入了 Disordered-DABS，这是一种针对非结构化文本进行基于方面的动态摘要的新颖基准。通过调整现有数据集以实现成本效率和可扩展性，我们的综合实验和详细的人类评估表明，Disordered-DABS 对当代摘要模型（包括 GPT-3.5 等最先进的语言模型）提出了独特的挑战。

Title: InSaAF: Incorporating Safety through Accuracy and Fairness | Are LLMs ready for the Indian Legal Domain?

Authors: Yogesh Tripathi, Raghav Donakanti, Sahil Girhepuje, Ishan Kavathekar, Bhaskara Hanuma Vedula, Gokul S Krishnan, Shreya Goyal, Anmol Goel, Balaraman Ravindran, Ponnurangam Kumaraguru
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10567
Pdf URL: https://arxiv.org/pdf/2402.10567
Copy Paste: [[2402.10567]] InSaAF: Incorporating Safety through Accuracy and Fairness | Are LLMs ready for the Indian Legal Domain?(https://arxiv.org/abs/2402.10567)
Keywords: language model, llm
Abstract: Recent advancements in language technology and Artificial Intelligence have resulted in numerous Language Models being proposed to perform various tasks in the legal domain ranging from predicting judgments to generating summaries. Despite their immense potential, these models have been proven to learn and exhibit societal biases and make unfair predictions. In this study, we explore the ability of Large Language Models (LLMs) to perform legal tasks in the Indian landscape when social factors are involved. We present a novel metric, $\beta$-weighted $\textit{Legal Safety Score ($LSS_{\beta}$)}$, which encapsulates both the fairness and accuracy aspects of the LLM. We assess LLMs' safety by considering its performance in the $\textit{Binary Statutory Reasoning}$ task and its fairness exhibition with respect to various axes of disparities in the Indian society. Task performance and fairness scores of LLaMA and LLaMA--2 models indicate that the proposed $LSS_{\beta}$ metric can effectively determine the readiness of a model for safe usage in the legal sector. We also propose finetuning pipelines, utilising specialised legal datasets, as a potential method to mitigate bias and improve model safety. The finetuning procedures on LLaMA and LLaMA--2 models increase the $LSS_{\beta}$, improving their usability in the Indian legal domain. Our code is publicly released.
摘要：语言技术和人工智能的最新进展导致人们提出了许多语言模型来执行法律领域的各种任务，从预测判决到生成摘要。尽管潜力巨大，但这些模型已被证明可以学习并表现出社会偏见并做出不公平的预测。在这项研究中，我们探讨了大型语言模型（LLM）在涉及社会因素时在印度地区执行法律任务的能力。我们提出了一个新颖的指标，$\beta$加权$\textit{法律安全评分($LSS_{\beta}$)}$，它概括了法学硕士的公平性和准确性方面。我们通过考虑法学硕士在 $\textit{二元法定推理}$ 任务中的表现及其在印度社会各方面差异方面的公平性来评估法学硕士的安全性。 LLaMA 和 LLaMA--2 模型的任务性能和公平性得分表明，所提出的 $LSS_{\beta}$ 指标可以有效地确定模型在法律部门安全使用的准备情况。我们还建议利用专门的法律数据集对管道进行微调，作为减轻偏见和提高模型安全性的潜在方法。 LLaMA 和 LLaMA--2 模型的微调程序增加了 $LSS_{\beta}$，提高了它们在印度法律领域的可用性。我们的代码是公开发布的。

Title: Direct Preference Optimization with an Offset

Authors: Afra Amini, Tim Vieira, Ryan Cotterell
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10571
Pdf URL: https://arxiv.org/pdf/2402.10571
Copy Paste: [[2402.10571]] Direct Preference Optimization with an Offset(https://arxiv.org/abs/2402.10571)
Keywords: language model
Abstract: Direct preference optimization (DPO) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. DPO, as originally formulated, relies on binary preference data and fine-tunes a language model to increase the likelihood of a preferred response over a dispreferred response. However, not all preference pairs are equal: while in some cases the preferred response is only slightly better than the dispreferred response, there can be a stronger preference for one response when, for example, the other response includes harmful or toxic content. In this paper, we propose a generalization of DPO, termed DPO with an offset (ODPO), that does not treat every preference pair equally during fine-tuning. Intuitively, ODPO requires the difference between the likelihood of the preferred and dispreferred response to be greater than an offset value. The offset is determined based on the extent to which one response is preferred over another. Our experiments on various tasks suggest that ODPO significantly outperforms DPO in aligning language models, especially when the number of preference pairs is limited.
摘要：直接偏好优化（DPO）是一种成功的微调策略，可以使大型语言模型与人类偏好保持一致，而无需训练奖励模型或采用强化学习。正如最初制定的那样，DPO 依赖于二进制偏好数据并微调语言模型，以增加首选响应相对于非首选响应的可能性。然而，并非所有偏好对都是相同的：虽然在某些情况下首选响应仅略好于不首选响应，但当例如另一种响应包含有害或有毒内容时，可能会对一种响应具有更强的偏好。在本文中，我们提出了 DPO 的推广，称为带有偏移量的 DPO (ODPO)，它在微调期间不会平等地对待每个偏好对。直观上，ODPO 要求首选响应和非首选响应的可能性之间的差异大于偏移值。偏移量是根据一种响应优于另一种响应的程度来确定的。我们对各种任务的实验表明，ODPO 在对齐语言模型方面显着优于 DPO，特别是当偏好对的数量有限时。

Title: LinkNER: Linking Local Named Entity Recognition Models to Large Language Models using Uncertainty

Authors: Zhen Zhang, Yuhua Zhao, Hang Gao, Mengting Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10573
Pdf URL: https://arxiv.org/pdf/2402.10573
Copy Paste: [[2402.10573]] LinkNER: Linking Local Named Entity Recognition Models to Large Language Models using Uncertainty(https://arxiv.org/abs/2402.10573)
Keywords: language model, gpt, llm
Abstract: Named Entity Recognition (NER) serves as a fundamental task in natural language understanding, bearing direct implications for web content analysis, search engines, and information retrieval systems. Fine-tuned NER models exhibit satisfactory performance on standard NER benchmarks. However, due to limited fine-tuning data and lack of knowledge, it performs poorly on unseen entity recognition. As a result, the usability and reliability of NER models in web-related applications are compromised. Instead, Large Language Models (LLMs) like GPT-4 possess extensive external knowledge, but research indicates that they lack specialty for NER tasks. Furthermore, non-public and large-scale weights make tuning LLMs difficult. To address these challenges, we propose a framework that combines small fine-tuned models with LLMs (LinkNER) and an uncertainty-based linking strategy called RDC that enables fine-tuned models to complement black-box LLMs, achieving better performance. We experiment with both standard NER test sets and noisy social media datasets. LinkNER enhances NER task performance, notably surpassing SOTA models in robustness tests. We also quantitatively analyze the influence of key components like uncertainty estimation methods, LLMs, and in-context learning on diverse NER tasks, offering specific web-related recommendations.
摘要：命名实体识别（NER）是自然语言理解的一项基本任务，对网络内容分析、搜索引擎和信息检索系统有直接影响。经过微调的 NER 模型在标准 NER 基准测试中表现出令人满意的性能。然而，由于微调数据有限和知识缺乏，它在看不见的实体识别上表现不佳。因此，NER 模型在 Web 相关应用中的可用性和可靠性受到损害。相反，像 GPT-4 这样的大型语言模型 (LLM) 拥有广泛的外部知识，但研究表明它们缺乏 NER 任务的专业知识。此外，非公开和大规模的权重使得法学硕士的调整变得困难。为了应对这些挑战，我们提出了一个框架，将小型微调模型与 LLM (LinkNER) 和基于不确定性的链接策略（称为 RDC）相结合，使微调模型能够补充黑盒 LLM，从而实现更好的性能。我们使用标准 NER 测试集和嘈杂的社交媒体数据集进行实验。 LinkNER 增强了 NER 任务性能，尤其是在鲁棒性测试中超越了 SOTA 模型。我们还定量分析了不确定性估计方法、LLM 和上下文学习等关键组成部分对各种 NER 任务的影响，并提供了具体的网络相关建议。

Title: Symbolic Autoencoding for Self-Supervised Sequence Learning

Authors: Mohammad Hossein Amani, Nicolas Mario Baldwin, Amin Mansouri, Martin Josifoski, Maxime Peyrard, Robert West
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10575
Pdf URL: https://arxiv.org/pdf/2402.10575
Copy Paste: [[2402.10575]] Symbolic Autoencoding for Self-Supervised Sequence Learning(https://arxiv.org/abs/2402.10575)
Keywords: language model
Abstract: Traditional language models, adept at next-token prediction in text sequences, often struggle with transduction tasks between distinct symbolic systems, particularly when parallel data is scarce. Addressing this issue, we introduce \textit{symbolic autoencoding} ($\Sigma$AE), a self-supervised framework that harnesses the power of abundant unparallel data alongside limited parallel data. $\Sigma$AE connects two generative models via a discrete bottleneck layer and is optimized end-to-end by minimizing reconstruction loss (simultaneously with supervised loss for the parallel data), such that the sequence generated by the discrete bottleneck can be read out as the transduced input sequence. We also develop gradient-based methods allowing for efficient self-supervised sequence learning despite the discreteness of the bottleneck. Our results demonstrate that $\Sigma$AE significantly enhances performance on transduction tasks, even with minimal parallel data, offering a promising solution for weakly supervised learning scenarios.
摘要：传统的语言模型擅长文本序列中的下一个标记预测，但常常难以处理不同符号系统之间的转换任务，特别是在并行数据稀缺的情况下。为了解决这个问题，我们引入了 \textit{symbolic autoencoding} ($\Sigma$AE)，一个自我监督的框架，利用丰富的非并行数据和有限的并行数据的力量。 $\Sigma$AE 通过离散瓶颈层连接两个生成模型，并通过最小化重建损失（同时对并行数据进行监督损失）进行端到端优化，从而可以读出离散瓶颈生成的序列作为转换后的输入序列。我们还开发了基于梯度的方法，尽管存在瓶颈的离散性，但仍可以进行有效的自监督序列学习。我们的结果表明，即使并行数据最少，$\Sigma$AE 也能显着提高转导任务的性能，为弱监督学习场景提供了一个有前景的解决方案。

Title: Threads of Subtlety: Detecting Machine-Generated Texts Through Discourse Motifs

Authors: Zae Myung Kim, Kwang Hee Lee, Preston Zhu, Vipul Raheja, Dongyeop Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10586
Pdf URL: https://arxiv.org/pdf/2402.10586
Copy Paste: [[2402.10586]] Threads of Subtlety: Detecting Machine-Generated Texts Through Discourse Motifs(https://arxiv.org/abs/2402.10586)
Keywords: language model, llm
Abstract: With the advent of large language models (LLM), the line between human-crafted and machine-generated texts has become increasingly blurred. This paper delves into the inquiry of identifying discernible and unique linguistic properties in texts that were written by humans, particularly uncovering the underlying discourse structures of texts beyond their surface structures. Introducing a novel methodology, we leverage hierarchical parse trees and recursive hypergraphs to unveil distinctive discourse patterns in texts produced by both LLMs and humans. Empirical findings demonstrate that, although both LLMs and humans generate distinct discourse patterns influenced by specific domains, human-written texts exhibit more structural variability, reflecting the nuanced nature of human writing in different domains. Notably, incorporating hierarchical discourse features enhances binary classifiers' overall performance in distinguishing between human-written and machine-generated texts, even on out-of-distribution and paraphrased samples. This underscores the significance of incorporating hierarchical discourse features in the analysis of text patterns. The code and dataset will be available at [TBA].
摘要：随着大型语言模型 (LLM) 的出现，人工文本和机器生成文本之间的界限变得越来越模糊。本文深入探讨了识别人类书写文本中可辨别且独特的语言属性的问题，特别是揭示文本表面结构之外的潜在话语结构。引入一种新颖的方法，我们利用分层解析树和递归超图来揭示法学硕士和人类生成的文本中独特的话语模式。实证研究结果表明，尽管法学硕士和人类都会产生受特定领域影响的不同话语模式，但人类书写的文本表现出更多的结构变异性，反映了不同领域中人类书写的微妙本质。值得注意的是，结合分层话语特征可以增强二元分类器在区分人类编写的文本和机器生成的文本方面的整体性能，即使是在分布外和释义的样本上也是如此。这强调了在文本模式分析中纳入分层话语特征的重要性。代码和数据集将在 [TBA] 上提供。

Title: Do Llamas Work in English? On the Latent Language of Multilingual Transformers

Authors: Chris Wendler, Veniamin Veselovsky, Giovanni Monea, Robert West
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2402.10588
Pdf URL: https://arxiv.org/pdf/2402.10588
Copy Paste: [[2402.10588]] Do Llamas Work in English? On the Latent Language of Multilingual Transformers(https://arxiv.org/abs/2402.10588)
Keywords: language model, prompt
Abstract: We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language -- a question of key importance for understanding how language models function and the origins of linguistic bias. Focusing on the Llama-2 family of transformer models, our study uses carefully constructed non-English prompts with a unique correct single-token continuation. From layer to layer, transformers gradually map an input embedding of the final prompt token to an output embedding from which next-token probabilities are computed. Tracking intermediate embeddings through their high-dimensional space reveals three distinct phases, whereby intermediate embeddings (1) start far away from output token embeddings; (2) already allow for decoding a semantically correct next token in the middle layers, but give higher probability to its version in English than in the input language; (3) finally move into an input-language-specific region of the embedding space. We cast these results into a conceptual model where the three phases operate in "input space", "concept space", and "output space", respectively. Crucially, our evidence suggests that the abstract "concept space" lies closer to English than to other languages, which may have important consequences regarding the biases held by multilingual language models.
摘要：我们询问在不平衡的、以英语为主的语料库上训练的多语言语言模型是否使用英语作为内部枢纽语言——这个问题对于理解语言模型如何发挥作用以及语言偏见的起源至关重要。我们的研究重点关注 Llama-2 系列变压器模型，使用精心构建的非英语提示和独特的正确单标记延续。从一层到另一层，变压器逐渐将最终提示标记的输入嵌入映射到计算下一个标记概率的输出嵌入。通过高维空间跟踪中间嵌入揭示了三个不同的阶段，其中中间嵌入（1）从远离输出令牌嵌入的地方开始； (2) 已经允许在中间层中解码语义上正确的下一个标记，但给予其英语版本比输入语言版本更高的概率； (3) 最后进入嵌入空间的输入语言特定区域。我们将这些结果转化为概念模型，其中三个阶段分别在“输入空间”、“概念空间”和“输出空间”中运行。至关重要的是，我们的证据表明，抽象的“概念空间”比其他语言更接近英语，这可能会对多语言语言模型的偏见产生重要影响。

Title: Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks

Authors: Niall Taylor, Upamanyu Ghose, Omid Rohanian, Mohammadmahdi Nouriborji, Andrey Kormilitzin, David Clifton, Alejo Nevado-Holgado
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10597
Pdf URL: https://arxiv.org/pdf/2402.10597
Copy Paste: [[2402.10597]] Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks(https://arxiv.org/abs/2402.10597)
Keywords: language model, llm
Abstract: The entry of large language models (LLMs) into research and commercial spaces has led to a trend of ever-larger models, with initial promises of generalisability, followed by a widespread desire to downsize and create specialised models without the need for complete fine-tuning, using Parameter Efficient Fine-tuning (PEFT) methods. We present an investigation into the suitability of different PEFT methods to clinical decision-making tasks, across a range of model sizes, including extremely small models with as few as $25$ million parameters. Our analysis shows that the performance of most PEFT approaches varies significantly from one task to another, with the exception of LoRA, which maintains relatively high performance across all model sizes and tasks, typically approaching or matching full fine-tuned performance. The effectiveness of PEFT methods in the clinical domain is evident, particularly for specialised models which can operate on low-cost, in-house computing infrastructure. The advantages of these models, in terms of speed and reduced training costs, dramatically outweighs any performance gain from large foundation LLMs. Furthermore, we highlight how domain-specific pre-training interacts with PEFT methods and model size, and discuss how these factors interplay to provide the best efficiency-performance trade-off. Full code available at: tbd.
摘要：大型语言模型（LLM）进入研究和商业领域导致了模型越来越大的趋势，最初承诺具有普遍性，随后人们普遍希望缩小规模并创建专门的模型，而无需完全微调，使用参数高效微调（PEFT）方法。我们对不同 PEFT 方法对临床决策任务的适用性进行了调查，涵盖了一系列模型大小，包括参数低至 2500 万美元的极小模型。我们的分析表明，大多数 PEFT 方法的性能因一项任务而异，但 LoRA 除外，它在所有模型大小和任务中都保持相对较高的性能，通常接近或匹配完全微调的性能。 PEFT 方法在临床领域的有效性是显而易见的，特别是对于可以在低成本内部计算基础设施上运行的专用模型。这些模型在速度和降低培训成本方面的优势远远超过大型基础法学硕士的任何性能提升。此外，我们重点介绍了特定领域的预训练如何与 PEFT 方法和模型大小相互作用，并讨论这些因素如何相互作用以提供最佳的效率与性能权衡。完整代码可在：待定。

Title: Jailbreaking Proprietary Large Language Models using Word Substitution Cipher

Authors: Divij Handa, Advait Chirmule, Bimal Gajera, Chitta Baral
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10601
Pdf URL: https://arxiv.org/pdf/2402.10601
Copy Paste: [[2402.10601]] Jailbreaking Proprietary Large Language Models using Word Substitution Cipher(https://arxiv.org/abs/2402.10601)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large Language Models (LLMs) are aligned to moral and ethical guidelines but remain susceptible to creative prompts called Jailbreak that can bypass the alignment process. However, most jailbreaking prompts contain harmful questions in the natural language (mainly English), which can be detected by the LLM themselves. In this paper, we present jailbreaking prompts encoded using cryptographic techniques. We first present a pilot study on the state-of-the-art LLM, GPT-4, in decoding several safe sentences that have been encrypted using various cryptographic techniques and find that a straightforward word substitution cipher can be decoded most effectively. Motivated by this result, we use this encoding technique for writing jailbreaking prompts. We present a mapping of unsafe words with safe words and ask the unsafe question using these mapped words. Experimental results show an attack success rate (up to 59.42%) of our proposed jailbreaking approach on state-of-the-art proprietary models including ChatGPT, GPT-4, and Gemini-Pro. Additionally, we discuss the over-defensiveness of these models. We believe that our work will encourage further research in making these LLMs more robust while maintaining their decoding capabilities.
摘要：大型语言模型 (LLM) 符合道德和伦理准则，但仍然容易受到称为“越狱”的创造性提示的影响，这种提示可以绕过对齐过程。然而，大多数越狱提示都包含自然语言（主要是英语）的有害问题，这些问题可以被LLM自己检测到。在本文中，我们提出使用加密技术编码的越狱提示。我们首先对最先进的 LLM GPT-4 进行了一项试点研究，对使用各种加密技术加密的几个安全句子进行解码，并发现直接的单词替换密码可以最有效地解码。受此结果的启发，我们使用这种编码技术来编写越狱提示。我们提出不安全词与安全词的映射，并使用这些映射的词提出不安全问题。实验结果表明，我们提出的越狱方法对最先进的专有模型（包括 ChatGPT、GPT-4 和 Gemini-Pro）的攻击成功率（高达 59.42%）。此外，我们还讨论了这些模型的过度防御。我们相信，我们的工作将鼓励进一步的研究，使这些法学硕士更加强大，同时保持其解码能力。

Title: Retrieve Only When It Needs: Adaptive Retrieval Augmentation for Hallucination Mitigation in Large Language Models

Authors: Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10612
Pdf URL: https://arxiv.org/pdf/2402.10612
Copy Paste: [[2402.10612]] Retrieve Only When It Needs: Adaptive Retrieval Augmentation for Hallucination Mitigation in Large Language Models(https://arxiv.org/abs/2402.10612)
Keywords: language model, llm, hallucination
Abstract: Hallucinations pose a significant challenge for the practical implementation of large language models (LLMs). The utilization of parametric knowledge in generating factual content is constrained by the limited knowledge of LLMs, potentially resulting in internal hallucinations. While incorporating external information can help fill knowledge gaps, it also introduces the risk of irrelevant information, thereby increasing the likelihood of external hallucinations. A careful and balanced integration of the parametric knowledge within LLMs with external information is crucial to alleviate hallucinations. In this study, we present Rowen, a novel approach that enhances LLMs with a selective retrieval augmentation process tailored to address hallucinated outputs. This process is governed by a multilingual semantic-aware detection module, which evaluates the consistency of the perturbed responses across various languages for the same queries. Upon detecting inconsistencies indicative of hallucinations, Rowen activates the retrieval of external information to rectify the model outputs. Rowen adeptly harmonizes the intrinsic parameters in LLMs with external knowledge sources, effectively mitigating hallucinations by ensuring a balanced integration of internal reasoning and external evidence. Through a comprehensive empirical analysis, we demonstrate that Rowen surpasses the current state-of-the-art in both detecting and mitigating hallucinated content within the outputs of LLMs.
摘要：幻觉对大语言模型（LLM）的实际实施提出了重大挑战。参数化知识在生成事实内容时的利用受到法学硕士知识有限的限制，可能会导致内部幻觉。虽然整合外部信息有助于填补知识空白，但它也带来了不相关信息的风险，从而增加了外部幻觉的可能性。法学硕士内的参数知识与外部信息的仔细和平衡整合对于减轻幻觉至关重要。在这项研究中，我们提出了 Rowen，这是一种新方法，通过专门针对幻觉输出而定制的选择性检索增强过程来增强法学硕士。此过程由多语言语义感知检测模块控制，该模块评估同一查询的不同语言的扰动响应的一致性。在检测到表明幻觉的不一致情况后，Rowen 会激活外部信息的检索来纠正模型输出。 Rowen 巧妙地将法学硕士的内在参数与外部知识源相协调，通过确保内部推理和外部证据的平衡整合，有效地减轻幻觉。通过全面的实证分析，我们证明 Rowen 在检测和减轻法学硕士输出中的幻觉内容方面都超越了当前的最先进水平。

Title: Can LLMs Speak For Diverse People? Tuning LLMs via Debate to Generate Controllable Controversial Statements

Authors: Ming Li, Jiuhai Chen, Lichang Chen, Tianyi Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10614
Pdf URL: https://arxiv.org/pdf/2402.10614
Copy Paste: [[2402.10614]] Can LLMs Speak For Diverse People? Tuning LLMs via Debate to Generate Controllable Controversial Statements(https://arxiv.org/abs/2402.10614)
Keywords: gpt, llm, prompt
Abstract: Making LLMs speak for different, especially minority groups of people, and generate statements supporting their diverse or even controversial perspectives is critical to creating an inclusive environment. However, existing LLMs lack sufficient controllability to the stance of their generated content, which often contains inconsistent, neutral, or biased statements. In this paper, we improve the controllability of LLMs in generating statements supporting an argument the user defined in the prompt. We find that multi-round debates between two LLMs with opposite stances generate higher-quality and more salient statements for each, which are important training data to improve the controllability of LLMs. Motivated by this, we develop a novel debate & tuning ("DEBATunE") pipeline finetuning LLMs to generate the statements obtained via debate. To examine DEBATunE, we curate the largest dataset of debate topics so far, which covers 710 controversial topics and corresponding arguments for each topic. Evaluations by the GPT-4 judge with a novel controversy controllability metric show that LLMs' capability of expressing diverse perspectives is significantly improved by DEBATunE. Moreover, such controllability can be generalized to unseen topics, generating high-quality statements supporting controversial arguments. Our codes, models, and data will be released at https://github.com/tianyi-lab/DEBATunE.
摘要：让法学硕士代表不同的群体，尤其是少数群体，并发表支持他们不同甚至有争议的观点的声明，对于创造一个包容性的环境至关重要。然而，现有的法学硕士对其生成内容的立场缺乏足够的可控性，这些内容往往包含不一致、中立或有偏见的陈述。在本文中，我们提高了法学硕士在生成支持用户在提示中定义的参数的语句时的可控性。我们发现，立场相反的两位法学硕士之间的多轮辩论会产生更高质量、更显着的陈述，这是提高法学硕士可控性的重要训练数据。受此启发，我们开发了一种新颖的辩论和调整（“DEBATunE”）管道微调法学硕士，以生成通过辩论获得的陈述。为了检查 DEBATunE，我们整理了迄今为止最大的辩论主题数据集，其中涵盖 710 个有争议的主题以及每个主题的相应论点。 GPT-4 法官使用新颖的争议可控性指标进行的评估表明，DEBATunE 显着提高了法学硕士表达不同观点的能力。此外，这种可控性可以推广到看不见的主题，生成支持有争议论点的高质量陈述。我们的代码、模型和数据将在https://github.com/tianyi-lab/DEBATunE发布。

Title: Enhancing Role-playing Systems through Aggressive Queries: Evaluation and Improvement

Authors: Yihong Tang, Jiao Ou, Che Liu, Fuzheng Zhang, Di Zhang, Kun Gai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10618
Pdf URL: https://arxiv.org/pdf/2402.10618
Copy Paste: [[2402.10618]] Enhancing Role-playing Systems through Aggressive Queries: Evaluation and Improvement(https://arxiv.org/abs/2402.10618)
Keywords: language model, llm
Abstract: The advent of Large Language Models (LLMs) has propelled dialogue generation into new realms, particularly in the field of role-playing systems (RPSs). While enhanced with ordinary role-relevant training dialogues, existing LLM-based RPSs still struggle to align with roles when handling intricate and trapped queries in boundary scenarios. In this paper, we design the Modular ORchestrated Trap-setting Interaction SystEm (MORTISE) to benchmark and improve the role-playing LLMs' performance. MORTISE can produce highly role-relevant aggressive queries through the collaborative effort of multiple LLM-based modules, and formulate corresponding responses to create an adversarial training dataset via a consistent response generator. We select 190 Chinese and English roles to construct aggressive queries to benchmark existing role-playing LLMs. Through comprehensive evaluation, we find that existing models exhibit a general deficiency in role alignment capabilities. We further select 180 of the roles to collect an adversarial training dataset (named RoleAD) and retain the other 10 roles for testing. Experiments on models improved by RoleAD indicate that our adversarial dataset ameliorates this deficiency, with the improvements demonstrating a degree of generalizability in ordinary scenarios.
摘要：大型语言模型（LLM）的出现将对话生成推向了新的领域，特别是在角色扮演系统（RPS）领域。虽然通过普通的角色相关培训对话进行了增强，但现有的基于 LLM 的 RPS 在处理边界场景中复杂和陷入困境的查询时仍然难以与角色保持一致。在本文中，我们设计了模块化ORchesterated陷阱设置交互系统（MORTISE）来衡量和提高角色扮演法学硕士的表现。 MORTISE 可以通过多个基于 LLM 的模块的协作来生成与角色高度相关的攻击性查询，并制定相应的响应，以通过一致的响应生成器创建对抗性训练数据集。我们选择了 190 个中文和英文角色来构建积极的查询，以对现有角色扮演法学硕士进行基准测试。通过综合评价，我们发现现有模型在角色匹配能力上普遍存在缺陷。我们进一步选择其中 180 个角色来收集对抗训练数据集（名为 RoleAD），并保留其他 10 个角色进行测试。对 RoleAD 改进的模型进行的实验表明，我们的对抗性数据集改善了这一缺陷，这些改进证明了在普通场景中的一定程度的普适性。

Title: BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation

Authors: Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10631
Pdf URL: https://arxiv.org/pdf/2402.10631
Copy Paste: [[2402.10631]] BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation(https://arxiv.org/abs/2402.10631)
Keywords: language model, llm
Abstract: The upscaling of Large Language Models (LLMs) has yielded impressive advances in natural language processing, yet it also poses significant deployment challenges. Weight quantization has emerged as a widely embraced solution to reduce memory and computational demands. This paper introduces BitDistiller, a framework that synergizes Quantization-Aware Training (QAT) with Knowledge Distillation (KD) to boost the performance of LLMs at ultra-low precisions (sub-4-bit). Specifically, BitDistiller first incorporates a tailored asymmetric quantization and clipping technique to maximally preserve the fidelity of quantized weights, and then proposes a novel Confidence-Aware Kullback-Leibler Divergence (CAKLD) objective, which is employed in a self-distillation manner to enable faster convergence and superior model performance. Empirical evaluations demonstrate that BitDistiller significantly surpasses existing methods in both 3-bit and 2-bit configurations on general language understanding and complex reasoning benchmarks. Notably, BitDistiller is shown to be more cost-effective, demanding fewer data and training resources. The code is available at https://github.com/DD-DuDa/BitDistiller.
摘要：大型语言模型 (LLM) 的升级在自然语言处理方面取得了令人瞩目的进步，但也带来了重大的部署挑战。权重量化已成为一种广泛接受的解决方案，可减少内存和计算需求。本文介绍了 BitDistiller，这是一个将量化感知训练 (QAT) 与知识蒸馏 (KD) 相结合的框架，可提高 LLM 在超低精度（低于 4 位）下的性能。具体来说，BitDistiller 首先采用定制的非对称量化和裁剪技术，以最大程度地保留量化权重的保真度，然后提出一种新颖的置信感知 Kullback-Leibler 发散 (CAKLD) 目标，该目标以自蒸馏方式采用，以实现更快的速度收敛和卓越的模型性能。实证评估表明，BitDistiller 在通用语言理解和复杂推理基准方面显着超越了 3 位和 2 位配置中的现有方法。值得注意的是，BitDistiller 被证明更具成本效益，需要更少的数据和培训资源。该代码可在 https://github.com/DD-DuDa/BitDistiller 获取。

Title: Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning

Authors: Tuc Nguyen, Thai Le
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10639
Pdf URL: https://arxiv.org/pdf/2402.10639
Copy Paste: [[2402.10639]] Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning(https://arxiv.org/abs/2402.10639)
Keywords: language model
Abstract: Several parameter-efficient fine-tuning methods based on adapters have been proposed as a streamlined approach to incorporate not only a single specialized knowledge into existing Pre-Trained Language Models (PLMs) but also multiple of them at once. Recent works such as AdapterSoup propose to mix not all but only a selective sub-set of domain-specific adapters during inference via model weight averaging to optimize performance on novel, unseen domains with excellent computational efficiency. However, the essential generalizability of this emerging weight-space adapter mixing mechanism on unseen, in-domain examples remains unexplored. Thus, in this study, we conduct a comprehensive analysis to elucidate the generalizability of domain-specific adapter mixtures in in-domain evaluation. We also provide investigations into the inner workings of the mixture of domain-specific adapters by analyzing their weight signs, yielding critical analysis on the negative correlation between their fraction of weight sign difference and their mixtures' generalizability. All source code will be published.
摘要：人们提出了几种基于适配器的参数高效微调方法作为一种简化的方法，不仅可以将单个专业知识纳入现有的预训练语言模型（PLM）中，而且可以同时将多个专业知识纳入其中。最近的工作（例如 AdapterSoup）建议在推理过程中通过模型权重平均来混合特定领域适配器的全部选择性子集，以优化新颖的、不可见的领域的性能，并具有出色的计算效率。然而，这种新兴的权重空间适配器混合机制在未见过的领域内示例上的基本通用性仍未得到探索。因此，在本研究中，我们进行了全面的分析，以阐明域特异性接头混合物在域内评估中的普遍性。我们还通过分析特定域适配器的权重符号，对它们的权重符号差异分数与其混合物的泛化性之间的负相关性进行批判性分析，对域特定适配器的混合物的内部工作原理进行研究。所有源代码将被发布。

Title: Linear Transformers with Learnable Kernel Functions are Better In-Context Models

Authors: Yaroslav Aksenov, Nikita Balagansky, Sofia Maria Lo Cicero Vaina, Boris Shaposhnikov, Alexey Gorbatovski, Daniil Gavrilov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.10644
Pdf URL: https://arxiv.org/pdf/2402.10644
Copy Paste: [[2402.10644]] Linear Transformers with Learnable Kernel Functions are Better In-Context Models(https://arxiv.org/abs/2402.10644)
Keywords: language model
Abstract: Advancing the frontier of subquadratic architectures for Language Models (LMs) is crucial in the rapidly evolving field of natural language processing. Current innovations, including State Space Models, were initially celebrated for surpassing Transformer performance on language modeling tasks. However, these models have revealed deficiencies in essential In-Context Learning capabilities - a domain where the Transformer traditionally shines. The Based model emerged as a hybrid solution, blending a Linear Transformer with a kernel inspired by the Taylor expansion of exponential functions, augmented by convolutional networks. Mirroring the Transformer's in-context adeptness, it became a strong contender in the field. In our work, we present a singular, elegant alteration to the Based kernel that amplifies its In-Context Learning abilities evaluated with the Multi-Query Associative Recall task and overall language modeling process, as demonstrated on the Pile dataset.
摘要：在快速发展的自然语言处理领域，推进语言模型 (LM) 次二次架构的前沿发展至关重要。当前的创新，包括状态空间模型，最初因在语言建模任务上超越 Transformer 的性能而受到赞扬。然而，这些模型暴露了基本的情境学习能力的缺陷——这是 Transformer 传统上表现出色的领域。基于模型作为一种混合解决方案出现，将线性变换器与受指数函数泰勒展开式启发的内核混合在一起，并通过卷积网络进行增强。反映了 Transformer 在环境中的熟练程度，它成为了该领域的有力竞争者。在我们的工作中，我们对基于内核进行了独特而优雅的修改，增强了通过多查询关联回忆任务和整体语言建模过程评估的上下文学习能力，如 Pile 数据集所示。

Title: Can Separators Improve Chain-of-Thought Prompting?

Authors: Yoonjeong Park, Hyunjin Kim, Chanyeol Choi, Junseong Kim, Jy-yong Sohn
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10645
Pdf URL: https://arxiv.org/pdf/2402.10645
Copy Paste: [[2402.10645]] Can Separators Improve Chain-of-Thought Prompting?(https://arxiv.org/abs/2402.10645)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Chain-of-thought (CoT) prompting is a simple and effective method for improving the reasoning capabilities of Large language models (LLMs). The basic idea of CoT is to let LLMs break down their thought processes step-by-step by putting exemplars in the input prompt. However, the densely structured prompt exemplars of CoT may cause the cognitive overload of LLMs. Inspired by human cognition, we introduce CoT-Sep, a novel method that strategically employs separators at the end of each exemplar in CoT prompting. These separators are designed to help the LLMs understand their thought processes better while reasoning. It turns out that CoT-Sep significantly improves the LLMs' performances on complex reasoning tasks (e.g., GSM-8K, AQuA, CSQA), compared with the vanilla CoT, which does not use separators. We also study the effects of the type and the location of separators tested on multiple LLMs, including GPT-3.5-Turbo, GPT-4, and LLaMA-2 7B. Interestingly, the type/location of separators should be chosen appropriately to boost the reasoning capability of CoT.
摘要：思想链（CoT）提示是一种简单有效的提高大型语言模型（LLM）推理能力的方法。 CoT的基本思想是让法学硕士通过在输入提示中放置范例来逐步分解他们的思维过程。然而，CoT结构密集的提示范例可能会导致法学硕士的认知超载。受人类认知的启发，我们引入了 CoT-Sep，这是一种新颖的方法，在 CoT 提示中策略性地在每个示例的末尾使用分隔符。这些分隔符旨在帮助法学硕士在推理时更好地理解他们的思维过程。事实证明，与不使用分隔符的普通 CoT 相比，CoT-Sep 显着提高了 LLM 在复杂推理任务（例如 GSM-8K、AQuA、CSQA）上的性能。我们还研究了在多个 LLM（包括 GPT-3.5-Turbo、GPT-4 和 LLaMA-2 7B）上测试的分离器类型和位置的影响。有趣的是，应适当选择分隔符的类型/位置以提高 CoT 的推理能力。

Title: AbsInstruct: Eliciting Abstraction Ability from LLMs through Explanation Tuning with Plausibility Estimation

Authors: Zhaowei Wang, Wei Fan, Qing Zong, Hongming Zhang, Sehyun Choi, Tianqing Fang, Xin Liu, Yangqiu Song, Ginny Y. Wong, Simon See
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10646
Pdf URL: https://arxiv.org/pdf/2402.10646
Copy Paste: [[2402.10646]] AbsInstruct: Eliciting Abstraction Ability from LLMs through Explanation Tuning with Plausibility Estimation(https://arxiv.org/abs/2402.10646)
Keywords: llm
Abstract: Abstraction ability is crucial in human intelligence, which can also benefit various tasks in NLP study. Existing work shows that LLMs are deficient in abstract ability, and how to improve it remains unexplored. In this work, we design the framework AbsInstruct to enhance LLMs' abstraction ability through instruction tuning. The framework builds instructions with in-depth explanations to assist LLMs in capturing the underlying rationale of abstraction. Meanwhile, we introduce a plausibility estimator to select instructions that are more consistent with the abstraction knowledge of LLMs to be aligned. Then, our framework combines abstraction instructions with general-purpose ones to build a hybrid dataset. Extensive experiments and analyses demonstrate that our framework can considerably enhance LLMs' abstraction ability with strong generalization performance while maintaining their general instruction-following abilities.
摘要：抽象能力对于人类智能至关重要，这也有利于 NLP 研究中的各种任务。现有的工作表明法学硕士缺乏抽象能力，而如何提高抽象能力仍有待探索。在这项工作中，我们设计了AbsInstruct框架，通过指令调整来增强LLM的抽象能力。该框架构建了具有深入解释的说明，以帮助法学硕士捕捉抽象的基本原理。同时，我们引入了一个合理性估计器来选择与要对齐的法学硕士的抽象知识更一致的指令。然后，我们的框架将抽象指令与通用指令相结合来构建混合数据集。大量的实验和分析表明，我们的框架可以显着增强法学硕士的抽象能力，具有很强的泛化性能，同时保持其一般指令跟踪能力。

Title: Enhancing Numerical Reasoning with the Guidance of Reliable Reasoning Processes

Authors: Dingzirui Wang, Longxu Dou, Xuanliang Zhang, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10654
Pdf URL: https://arxiv.org/pdf/2402.10654
Copy Paste: [[2402.10654]] Enhancing Numerical Reasoning with the Guidance of Reliable Reasoning Processes(https://arxiv.org/abs/2402.10654)
Keywords: language model, llm
Abstract: Numerical reasoning is an essential ability for NLP systems to handle numeric information. Recent research indicates that fine-tuning a small-scale model to learn generating reasoning processes alongside answers can significantly enhance performance. However, current methods have the limitation that most methods generate reasoning processes with large language models (LLMs), which are "unreliable" since such processes could contain information unrelated to the answer. To address this limitation, we introduce Enhancing NumeriCal reasOning with Reliable procEsses (Encore), which derives the reliable reasoning process by decomposing the answer formula, ensuring which fully supports the answer. Nevertheless, models could lack enough data to learn the reasoning process generation adequately, since our method generates only one single reasoning process for one formula. To overcome this difficulty, we present a series of pre-training tasks to help models learn the reasoning process generation with synthesized data. The experiments show that Encore yields improvement on all five experimental datasets with an average of 1.8%, proving the effectiveness of our method.
摘要：数值推理是NLP系统处理数值信息的必备能力。最近的研究表明，微调小规模模型来学习生成推理过程和答案可以显着提高性能。然而，当前方法的局限性在于，大多数方法都会使用大型语言模型（LLM）生成推理过程，这是“不可靠的”，因为此类过程可能包含与答案无关的信息。为了解决这个限制，我们引入了Enhancing NumeriCal reasOning with Reliable procEsses (Encore)，它通过分解答案公式来导出可靠的推理过程，确保其完全支持答案。然而，模型可能缺乏足够的数据来充分学习推理过程的生成，因为我们的方法仅为一个公式生成一个推理过程。为了克服这个困难，我们提出了一系列预训练任务来帮助模型学习使用合成数据生成推理过程。实验表明，Encore 在所有五个实验数据集上的平均提高了 1.8%，证明了我们方法的有效性。

Title: Improving Demonstration Diversity by Human-Free Fusing for Text-to-SQL

Authors: Dingzirui Wang, Longxu Dou, Xuanliang Zhang, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10663
Pdf URL: https://arxiv.org/pdf/2402.10663
Copy Paste: [[2402.10663]] Improving Demonstration Diversity by Human-Free Fusing for Text-to-SQL(https://arxiv.org/abs/2402.10663)
Keywords: language model, llm
Abstract: Currently, the in-context learning method based on large language models (LLMs) has become the mainstream of text-to-SQL research. Previous works have discussed how to select demonstrations related to the user question from a human-labeled demonstration pool. However, human labeling suffers from the limitations of insufficient diversity and high labeling overhead. Therefore, in this paper, we discuss how to measure and improve the diversity of the demonstrations for text-to-SQL. We present a metric to measure the diversity of the demonstrations and analyze the insufficient of the existing labeled data by experiments. Based on the above discovery, we propose fusing iteratively for demonstrations (Fused) to build a high-diversity demonstration pool through human-free multiple-iteration synthesis, improving diversity and lowering label cost. Our method achieves an average improvement of 3.2% and 5.0% with and without human labeling on several mainstream datasets, which proves the effectiveness of Fused.
摘要：目前，基于大语言模型（LLM）的上下文学习方法已成为文本到SQL研究的主流。之前的工作已经讨论了如何从人工标记的演示池中选择与用户问题相关的演示。然而，人工标记存在多样性不足和标记开销高的局限性。因此，在本文中，我们讨论如何衡量和提高文本转 SQL 演示的多样性。我们提出了一个指标来衡量演示的多样性，并通过实验分析现有标记数据的不足。基于上述发现，我们提出对演示进行迭代融合（Fused），通过无人的多次迭代合成来构建高多样性的演示池，提高多样性并降低标签成本。我们的方法在几个主流数据集上在有人工标记和无人工标记的情况下平均提高了 3.2% 和 5.0%，这证明了 Fused 的有效性。

Title: Humans or LLMs as the Judge? A Study on Judgement Biases

Authors: Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10669
Pdf URL: https://arxiv.org/pdf/2402.10669
Copy Paste: [[2402.10669]] Humans or LLMs as the Judge? A Study on Judgement Biases(https://arxiv.org/abs/2402.10669)
Keywords: language model, llm
Abstract: Adopting human and large language models (LLM) as judges (\textit{a.k.a} human- and LLM-as-a-judge) for evaluating the performance of existing LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human and LLM judges, questioning the reliability of the evaluation results. In this paper, we propose a novel framework for investigating 5 types of biases for LLM and human judges. We curate a dataset with 142 samples referring to the revised Bloom's Taxonomy and conduct thousands of human and LLM evaluations. Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the most cutting-edge judges possess considerable biases. We further exploit their weakness and conduct attacks on LLM judges. We hope that our work can notify the community of the vulnerability of human- and LLM-as-a-judge against perturbations, as well as the urgency of developing robust evaluation systems.
摘要：采用人类和大型语言模型（LLM）作为法官（\textit{a.k.a}人类和LLM作为法官）来评估现有LLM的表现最近引起了人们的关注。尽管如此，这种方法同时引入了人类和法学硕士法官的潜在偏见，质疑评估结果的可靠性。在本文中，我们提出了一个新颖的框架来调查法学硕士和人类法官的 5 种偏见。我们参考修订后的布鲁姆分类法整理了一个包含 142 个样本的数据集，并进行了数千次人类和法学硕士评估。结果表明，人类和法学硕士法官在不同程度上容易受到干扰，即使是最前沿的法官也存在相当大的偏见。我们进一步利用他们的弱点，对LLM法官进行攻击。我们希望我们的工作能够让社区了解人类和法学硕士作为法官面对干扰的脆弱性，以及开发强大的评估系统的紧迫性。

Title: OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models

Authors: Yuxuan Kuang, Hai Lin, Meng Jiang
Subjects: cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2402.10670
Pdf URL: https://arxiv.org/pdf/2402.10670
Copy Paste: [[2402.10670]] OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models(https://arxiv.org/abs/2402.10670)
Keywords: language model, llm, agent
Abstract: Object navigation (ObjectNav) requires an agent to navigate through unseen environments to find queried objects. Many previous methods attempted to solve this task by relying on supervised or reinforcement learning, where they are trained on limited household datasets with close-set objects. However, two key challenges are unsolved: understanding free-form natural language instructions that demand open-set objects, and generalizing to new environments in a zero-shot manner. Aiming to solve the two challenges, in this paper, we propose OpenFMNav, an Open-set Foundation Model based framework for zero-shot object Navigation. We first unleash the reasoning abilities of large language models (LLMs) to extract proposed objects from natural language instructions that meet the user's demand. We then leverage the generalizability of large vision language models (VLMs) to actively discover and detect candidate objects from the scene, building a Versatile Semantic Score Map (VSSM). Then, by conducting common sense reasoning on VSSM, our method can perform effective language-guided exploration and exploitation of the scene and finally reach the goal. By leveraging the reasoning and generalizing abilities of foundation models, our method can understand free-form human instructions and perform effective open-set zero-shot navigation in diverse environments. Extensive experiments on the HM3D ObjectNav benchmark show that our method surpasses all the strong baselines on all metrics, proving our method's effectiveness. Furthermore, we perform real robot demonstrations to validate our method's open-set-ness and generalizability to real-world environments.
摘要：对象导航（ObjectNav）需要代理在看不见的环境中导航以查找查询的对象。以前的许多方法试图通过依赖监督学习或强化学习来解决此任务，其中它们在具有近距离对象的有限家庭数据集上进行训练。然而，有两个关键挑战尚未解决：理解需要开放集对象的自由形式自然语言指令，以及以零样本的方式推广到新环境。为了解决这两个挑战，在本文中，我们提出了 OpenFMNav，一种基于开放集基础模型的零样本对象导航框架。我们首先释放大型语言模型（LLM）的推理能力，从满足用户需求的自然语言指令中提取建议的对象。然后，我们利用大型视觉语言模型（VLM）的通用性来主动发现和检测场景中的候选对象，构建通用语义得分图（VSSM）。然后，通过对VSSM进行常识推理，我们的方法可以对场景进行有效的语言引导探索和开发，最终达到目标。通过利用基础模型的推理和泛化能力，我们的方法可以理解自由形式的人类指令，并在不同的环境中执行有效的开放集零样本导航。对 HM3D ObjectNav 基准的大量实验表明，我们的方法在所有指标上都超越了所有强基线，证明了我们方法的有效性。此外，我们进行了真实的机器人演示，以验证我们的方法的开放性和对现实环境的通用性。

Title: Decomposition for Enhancing Attention: Improving LLM-based Text-to-SQL through Workflow Paradigm

Authors: Yuanzhen Xie, Xinzhou Jin, Tao Xie, MingXiong Lin, Liang Chen, Chenyun Yu, Lei Cheng, ChengXiang Zhuo, Bo Hu, Zang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10671
Pdf URL: https://arxiv.org/pdf/2402.10671
Copy Paste: [[2402.10671]] Decomposition for Enhancing Attention: Improving LLM-based Text-to-SQL through Workflow Paradigm(https://arxiv.org/abs/2402.10671)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: In-context learning of large-language models (LLMs) has achieved remarkable success in the field of natural language processing, while extensive case studies reveal that the single-step chain-of-thought prompting approach faces challenges such as attention diffusion and inadequate performance in complex tasks like text-to-SQL. To improve the contextual learning capabilities of LLMs in text-to-SQL, a workflow paradigm method is proposed, aiming to enhance the attention and problem-solving scope of LLMs through decomposition. Specifically, the information determination module for eliminating redundant information and the brand-new prompt structure based on problem classification greatly enhance the model's attention. Additionally, the inclusion of self-correcting and active learning modules greatly expands the problem-solving scope of LLMs, hence improving the upper limit of LLM-based approaches. Extensive experiments conducted on three datasets demonstrate that our approach outperforms other methods by a significant margin. About 2-3 percentage point improvements compared to the existing baseline on the Spider Dev and Spider-Realistic datasets and new SOTA results on the Spider Test dataset are achieved. Our code is available on GitHub: \url{https://github.com/FlyingFeather/DEA-SQL}.
摘要：大语言模型（LLM）的上下文学习在自然语言处理领域取得了显着的成功，而大量案例研究表明单步思维链提示方法面临注意力分散和性能不足等挑战在文本到 SQL 等复杂任务中。为了提高LLM在文本到SQL中的情境学习能力，提出了一种工作流范式方法，旨在通过分解来增强LLM的注意力和解决问题的范围。具体来说，消除冗余信息的信息判断模块和基于问题分类的全新提示结构大大增强了模型的注意力。此外，自我纠正和主动学习模块的加入极大地扩展了法学硕士解决问题的范围，从而提高了基于法学硕士的方法的上限。对三个数据集进行的广泛实验表明，我们的方法明显优于其他方法。与 Spider Dev 和 Spider-Realistic 数据集的现有基线相比，大约提高了 2-3 个百分点，并且在 Spider Test 数据集上实现了新的 SOTA 结果。我们的代码可在 GitHub 上找到：\url{https://github.com/FlyingFeather/DEA-SQL}。

Title: German Text Simplification: Finetuning Large Language Models with Semi-Synthetic Data

Authors: Lars Klöser, Mika Beele, Jan-Niklas Schagen, Bodo Kraft
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10675
Pdf URL: https://arxiv.org/pdf/2402.10675
Copy Paste: [[2402.10675]] German Text Simplification: Finetuning Large Language Models with Semi-Synthetic Data(https://arxiv.org/abs/2402.10675)
Keywords: language model, gpt
Abstract: This study pioneers the use of synthetically generated data for training generative models in document-level text simplification of German texts. We demonstrate the effectiveness of our approach with real-world online texts. Addressing the challenge of data scarcity in language simplification, we crawled professionally simplified German texts and synthesized a corpus using GPT-4. We finetune Large Language Models with up to 13 billion parameters on this data and evaluate their performance. This paper employs various methodologies for evaluation and demonstrates the limitations of currently used rule-based metrics. Both automatic and manual evaluations reveal that our models can significantly simplify real-world online texts, indicating the potential of synthetic data in improving text simplification.
摘要：这项研究开创了使用综合生成的数据来训练德语文本文档级文本简化的生成模型。我们通过现实世界的在线文本证明了我们方法的有效性。为了解决语言简化中数据稀缺的挑战，我们爬取了专业简化的德语文本，并使用 GPT-4 合成了语料库。我们根据这些数据对大型语言模型进行了多达 130 亿个参数的微调，并评估其性能。本文采用了各种评估方法，并展示了当前使用的基于规则的指标的局限性。自动和手动评估都表明，我们的模型可以显着简化现实世界的在线文本，表明合成数据在改进文本简化方面的潜力。

Title: LongHeads: Multi-Head Attention is Secretly a Long Context Processor

Authors: Yi Lu, Xin Zhou, Wei He, Jun Zhao, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10685
Pdf URL: https://arxiv.org/pdf/2402.10685
Copy Paste: [[2402.10685]] LongHeads: Multi-Head Attention is Secretly a Long Context Processor(https://arxiv.org/abs/2402.10685)
Keywords: language model, llm, long context
Abstract: Large language models (LLMs) have achieved impressive performance in numerous domains but often struggle to process lengthy inputs effectively and efficiently due to limited length generalization and attention's quadratic computational demands. Many sought to mitigate this by restricting the attention window within the pre-trained length. However, these methods introduce new issues such as ignoring the middle context and requiring additional training. To address these problems, we propose LongHeads, a training-free framework that enhances LLM's long context ability by unlocking multi-head attention's untapped potential. Instead of allowing each head to attend to the full sentence, which struggles with generalizing to longer sequences due to out-of-distribution (OOD) issues, we allow each head to process in-distribution length by selecting and attending to important context chunks. To this end, we propose a chunk selection strategy that relies on the inherent correlation between the query and the key representations, efficiently distributing context chunks to different heads. In this way, each head ensures it can effectively process attended tokens within the trained length, while different heads in different layers can collectively process longer contexts. LongHeads works efficiently in linear time, fits seamlessly with many LLMs that use relative positional encoding. Our extensive empirical analyses verify LongHeads's efficacy in extending the usable context window for existing models, showcasing its promise for enhancing long text understanding.
摘要：大型语言模型 (LLM) 在许多领域取得了令人印象深刻的性能，但由于有限的长度泛化和注意力的二次计算需求，常常难以有效且高效地处理冗长的输入。许多人试图通过将注意力窗口限制在预先训练的长度内来缓解这一问题。然而，这些方法引入了新的问题，例如忽略中间上下文并需要额外的训练。为了解决这些问题，我们提出了 LongHeads，这是一个免训练的框架，通过释放多头注意力的未开发潜力来增强 LLM 的长上下文能力。我们允许每个头通过选择和关注重要的上下文块来处理分布内长度，而不是让每个头处理完整的句子（由于分布外（OOD）问题而难以推广到更长的序列）。为此，我们提出了一种块选择策略，该策略依赖于查询和关键表示之间的固有相关性，有效地将上下文块分配到不同的头。通过这种方式，每个头确保它可以在训练长度内有效地处理有人参与的令牌，而不同层中的不同头可以共同处理更长的上下文。 LongHeads 在线性时间内高效工作，与许多使用相对位置编码的法学硕士无缝配合。我们广泛的实证分析验证了 LongHeads 在扩展现有模型的可用上下文窗口方面的功效，展示了其增强长文本理解的前景。

Title: Opening the Black Box of Large Language Models: Two Views on Holistic Interpretability

Authors: Haiyan Zhao, Fan Yang, Himabindu Lakkaraju, Mengnan Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10688
Pdf URL: https://arxiv.org/pdf/2402.10688
Copy Paste: [[2402.10688]] Opening the Black Box of Large Language Models: Two Views on Holistic Interpretability(https://arxiv.org/abs/2402.10688)
Keywords: language model, llm, hallucination
Abstract: As large language models (LLMs) grow more powerful, concerns around potential harms like toxicity, unfairness, and hallucination threaten user trust. Ensuring beneficial alignment of LLMs with human values through model alignment is thus critical yet challenging, requiring a deeper understanding of LLM behaviors and mechanisms. We propose opening the black box of LLMs through a framework of holistic interpretability encompassing complementary bottom-up and top-down perspectives. The bottom-up view, enabled by mechanistic interpretability, focuses on component functionalities and training dynamics. The top-down view utilizes representation engineering to analyze behaviors through hidden representations. In this paper, we review the landscape around mechanistic interpretability and representation engineering, summarizing approaches, discussing limitations and applications, and outlining future challenges in using these techniques to achieve ethical, honest, and reliable reasoning aligned with human values.
摘要：随着大型语言模型 (LLM) 变得越来越强大，对毒性、不公平和幻觉等潜在危害的担忧威胁着用户的信任。因此，通过模型调整确保法学硕士与人类价值观的有益结合至关重要但具有挑战性，需要更深入地了解法学硕士的行为和机制。我们建议通过包含互补的自下而上和自上而下视角的整体可解释性框架来打开法学硕士的黑匣子。由机械可解释性支持的自下而上的视图侧重于组件功能和训练动态。自上而下的视图利用表示工程通过隐藏的表示来分析行为。在本文中，我们回顾了机械可解释性和表示工程的前景，总结了方法，讨论了局限性和应用，并概述了使用这些技术实现符合人类价值观的道德、诚实和可靠推理的未来挑战。

Title: Multi-Cultural Commonsense Knowledge Distillation

Authors: Tuan-Phong Nguyen, Simon Razniewski, Gerhard Weikum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10689
Pdf URL: https://arxiv.org/pdf/2402.10689
Copy Paste: [[2402.10689]] Multi-Cultural Commonsense Knowledge Distillation(https://arxiv.org/abs/2402.10689)
Keywords: language model, gpt, llm, prompt
Abstract: Despite recent progress, large language models (LLMs) still face the challenge of appropriately reacting to the intricacies of social and cultural conventions. This paper presents MANGO, a methodology for distilling high-accuracy, high-recall assertions of cultural knowledge. We judiciously and iteratively prompt LLMs for this purpose from two entry points, concepts and cultures. Outputs are consolidated via clustering and generative summarization. Running the MANGO method with GPT-3.5 as underlying LLM yields 167K high-accuracy assertions for 30K concepts and 11K cultures, surpassing prior resources by a large margin. For extrinsic evaluation, we explore augmenting dialogue systems with cultural knowledge assertions. We find that adding knowledge from MANGO improves the overall quality, specificity, and cultural sensitivity of dialogue responses, as judged by human annotators. Data and code are available for download.
摘要：尽管最近取得了进展，大型语言模型（LLM）仍然面临着对复杂的社会和文化习俗做出适当反应的挑战。本文介绍了 MANGO，一种提取高准确度、高召回率的文化知识断言的方法。为此，我们从概念和文化两个切入点明智地、反复地提示法学硕士。通过聚类和生成总结来整合输出。以 GPT-3.5 作为底层 LLM 运行 MANGO 方法，可以为 30K 概念和 11K 文化产生 167K 高精度断言，大大超过了先前的资源。对于外在评估，我们探索用文化知识断言增强对话系统。我们发现，根据人类注释者的判断，添加来自 MANGO 的知识可以提高对话响应的整体质量、特异性和文化敏感性。数据和代码可供下载。

Title: MultiPoT: Multilingual Program of Thoughts Harnesses Multiple Programming Languages

Authors: Xianzhen Luo, Qingfu Zhu, Zhiming Zhang, Libo Qin, Xu Wang, Qing Yang, Dongliang Xu, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10691
Pdf URL: https://arxiv.org/pdf/2402.10691
Copy Paste: [[2402.10691]] MultiPoT: Multilingual Program of Thoughts Harnesses Multiple Programming Languages(https://arxiv.org/abs/2402.10691)
Keywords: gpt, chat
Abstract: Program of Thoughts (PoT) is an approach characterized by its executable intermediate steps, which ensure the accuracy of the numerical calculations in the reasoning process. Currently, PoT primarily uses Python. However, relying solely on a single language may result in suboptimal solutions and overlook the potential benefits of other programming languages. In this paper, we conduct comprehensive experiments on the programming languages used in PoT and find that no single language consistently delivers optimal performance across all tasks and models. The effectiveness of each language varies depending on the specific scenarios. Inspired by this, we propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages. Experimental results reveal that it significantly outperforms Python Self-Consistency. Furthermore, it achieves comparable or superior performance compared to the best monolingual PoT in almost all tasks across all models. In particular, MultiPoT achieves more than 4.6\% improvement on average on both Starcoder and ChatGPT (gpt-3.5-turbo).
摘要：思维程序（PoT）是一种具有可执行的中间步骤的方法，保证了推理过程中数值计算的准确性。目前，PoT 主要使用 Python。然而，仅仅依赖单一语言可能会导致解决方案欠佳，并且忽视其他编程语言的潜在优势。在本文中，我们对 PoT 中使用的编程语言进行了全面的实验，发现没有一种语言能够在所有任务和模型中始终如一地提供最佳性能。每种语言的有效性取决于具体场景。受此启发，我们提出了一种称为 MultiPoT 的任务和模型无关方法，它利用了各种语言的优势和多样性。实验结果表明，它的性能明显优于 Python 自一致性。此外，它在所有模型的几乎所有任务中都实现了与最佳单语 PoT 相当或更好的性能。特别是，MultiPoT 在 Starcoder 和 ChatGPT (gpt-3.5-turbo) 上平均提高了 4.6% 以上。

Title: Exploring Precision and Recall to assess the quality and diversity of LLMs

Authors: Le Bronnec Florian, Verine Alexandre, Negrevergne Benjamin, Chevaleyre Yann, Allauzen Alexandre
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10693
Pdf URL: https://arxiv.org/pdf/2402.10693
Copy Paste: [[2402.10693]] Exploring Precision and Recall to assess the quality and diversity of LLMs(https://arxiv.org/abs/2402.10693)
Keywords: language model, llm
Abstract: This paper introduces a novel evaluation framework for Large Language Models (LLMs) such as Llama-2 and Mistral, focusing on the adaptation of Precision and Recall metrics from image generation to text generation. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora. By conducting a comprehensive evaluation of state-of-the-art language models, the study reveals significant insights into their performance on open-ended generation tasks, which are not adequately captured by traditional benchmarks. The findings highlight a trade-off between the quality and diversity of generated samples, particularly when models are fine-tuned with human feedback. This work extends the toolkit for distribution-based NLP evaluation, offering insights into the practical capabilities and challenges faced by current LLMs in generating diverse and high-quality text.
摘要：本文介绍了一种针对 Llama-2 和 Mistral 等大型语言模型 (LLM) 的新型评估框架，重点关注从图像生成到文本生成的精度和召回率指标的适应。这种方法可以对生成文本的质量和多样性进行细致的评估，而无需对齐语料库。通过对最先进的语言模型进行全面评估，该研究揭示了它们在开放式生成任务上的表现的重要见解，而传统基准无法充分捕获这些任务。研究结果强调了生成样本的质量和多样性之间的权衡，特别是当模型根据人类反馈进行微调时。这项工作扩展了基于分布的 NLP 评估工具包，提供了对当前法学硕士在生成多样化和高质量文本方面所面临的实际能力和挑战的见解。

Title: Rethinking Human-like Translation Strategy: Integrating Drift-Diffusion Model with Large Language Models for Machine Translation

Authors: Hongbin Na, Zimu Wang, Mieradilijiang Maimaiti, Tong Chen, Wei Wang, Tao Shen, Ling Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10699
Pdf URL: https://arxiv.org/pdf/2402.10699
Copy Paste: [[2402.10699]] Rethinking Human-like Translation Strategy: Integrating Drift-Diffusion Model with Large Language Models for Machine Translation(https://arxiv.org/abs/2402.10699)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated promising potential in various downstream tasks, including machine translation. However, prior work on LLM-based machine translation has mainly focused on better utilizing training data, demonstrations, or pre-defined and universal knowledge to improve performance, with a lack of consideration of decision-making like human translators. In this paper, we incorporate Thinker with the Drift-Diffusion Model (Thinker-DDM) to address this issue. We then redefine the Drift-Diffusion process to emulate human translators' dynamic decision-making under constrained resources. We conduct extensive experiments under the high-resource, low-resource, and commonsense translation settings using the WMT22 and CommonMT datasets, in which Thinker-DDM outperforms baselines in the first two scenarios. We also perform additional analysis and evaluation on commonsense translation to illustrate the high effectiveness and efficacy of the proposed method.
摘要：大型语言模型 (LLM) 在各种下游任务（包括机器翻译）中表现出了巨大的潜力。然而，之前基于LLM的机器翻译工作主要集中在更好地利用训练数据、演示或预定义的通用知识来提高性能，而缺乏像人工翻译一样的决策考虑。在本文中，我们将 Thinker 与漂移扩散模型 (Thinker-DDM) 结合起来来解决这个问题。然后，我们重新定义漂移-扩散过程，以模拟人类翻译人员在资源有限的情况下的动态决策。我们使用 WMT22 和 CommonMT 数据集在高资源、低资源和常识翻译设置下进行了广泛的实验，其中 Thinker-DDM 在前两种情况下优于基线。我们还对常识翻译进行了额外的分析和评估，以说明该方法的高效性和功效。

Title: AutoSAT: Automatically Optimize SAT Solvers via Large Language Models

Authors: Yiwen Sun, Xianyin Zhang, Shiyu Huang, Shaowei Cai, Bing-Zhen Zhang, Ke Wei
Subjects: cs.AI
Abstract URL: https://arxiv.org/abs/2402.10705
Pdf URL: https://arxiv.org/pdf/2402.10705
Copy Paste: [[2402.10705]] AutoSAT: Automatically Optimize SAT Solvers via Large Language Models(https://arxiv.org/abs/2402.10705)
Keywords: language model, llm
Abstract: Heuristics are crucial in SAT solvers, while no heuristic rules are suitable for all problem instances. Therefore, it typically requires to refine specific solvers for specific problem instances. In this context, we present AutoSAT, a novel framework for automatically optimizing heuristics in SAT solvers. AutoSAT is based on Large Large Models (LLMs) which is able to autonomously generate code, conduct evaluation, then utilize the feedback to further optimize heuristics, thereby reducing human intervention and enhancing solver capabilities. AutoSAT operates on a plug-and-play basis, eliminating the need for extensive preliminary setup and model training, and fosters a Chain of Thought collaborative process with fault-tolerance, ensuring robust heuristic optimization. Extensive experiments on a Conflict-Driven Clause Learning (CDCL) solver demonstrates the overall superior performance of AutoSAT, especially in solving some specific SAT problem instances.
摘要：启发式算法在 SAT 求解器中至关重要，但没有任何启发式规则适合所有问题实例。因此，通常需要针对特定问题实例改进特定的求解器。在这种背景下，我们提出了 AutoSAT，一种用于自动优化 SAT 求解器中启发式算法的新颖框架。 AutoSAT 基于大型模型 (LLM)，能够自动生成代码、进行评估，然后利用反馈进一步优化启发式方法，从而减少人为干预并增强求解器能力。 AutoSAT 在即插即用的基础上运行，无需进行大量的初步设置和模型训练，并培育具有容错能力的思想链协作流程，确保稳健的启发式优化。对冲突驱动子句学习 (CDCL) 求解器的大量实验证明了 AutoSAT 的整体优越性能，特别是在解决一些特定 SAT 问题实例方面。

Title: An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Generative LLM Inference

Authors: Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10712
Pdf URL: https://arxiv.org/pdf/2402.10712
Copy Paste: [[2402.10712]] An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Generative LLM Inference(https://arxiv.org/abs/2402.10712)
Keywords: language model, llm
Abstract: The development of state-of-the-art generative large language models (LLMs) disproportionately relies on English-centric tokenizers, vocabulary and pre-training data. Despite the fact that some LLMs have multilingual capabilities, recent studies have shown that their inference efficiency deteriorates when generating text in languages other than English. This results in increased inference time and costs. Cross-lingual vocabulary adaptation methods have been proposed for adapting models to a target language aiming to improve downstream performance. However, the effectiveness of these methods on increasing inference efficiency of generative LLMs has yet to be explored. In this paper, we perform an empirical study of various cross-lingual vocabulary adaptation methods on five generative LLMs (including monolingual and multilingual models) across four typologically-diverse languages and four natural language understanding tasks. We find that cross-lingual vocabulary adaptation substantially contributes to LLM inference speedups of up to 271.5%. We also show that adapting LLMs that have been pre-trained on more balanced multilingual data results in downstream performance comparable to the original models.
摘要：最先进的生成式大语言模型 (LLM) 的开发过度依赖以英语为中心的分词器、词汇和预训练数据。尽管一些法学硕士具有多语言能力，但最近的研究表明，当生成英语以外的语言文本时，他们的推理效率会下降。这会导致推理时间和成本增加。跨语言词汇适应方法已经被提出，用于使模型适应目标语言，旨在提高下游性能。然而，这些方法在提高生成法学硕士推理效率方面的有效性仍有待探索。在本文中，我们对五种生成法学硕士（包括单语和多语言模型）的各种跨语言词汇适应方法进行了实证研究，涉及四种不同类型的语言和四种自然语言理解任务。我们发现跨语言词汇适应对 LLM 推理速度的贡献高达 271.5%。我们还表明，采用在更平衡的多语言数据上进行预训练的法学硕士，可以得到与原始模型相当的下游性能。

Title: Assessing the Reasoning Abilities of ChatGPT in the Context of Claim Verification

Authors: John Dougrez-Lewis, Mahmud Elahi Akhter, Yulan He, Maria Liakata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10735
Pdf URL: https://arxiv.org/pdf/2402.10735
Copy Paste: [[2402.10735]] Assessing the Reasoning Abilities of ChatGPT in the Context of Claim Verification(https://arxiv.org/abs/2402.10735)
Keywords: gpt, llm, chat
Abstract: The reasoning capabilities of LLMs are currently hotly debated. We examine the issue from the perspective of claim/rumour verification. We propose the first logical reasoning framework designed to break down any claim or rumor paired with evidence into the atomic reasoning steps necessary for verification. Based on our framework, we curate two annotated collections of such claim/evidence pairs: a synthetic dataset from Wikipedia and a real-world set stemming from rumours circulating on Twitter. We use them to evaluate the reasoning capabilities of GPT-3.5-Turbo and GPT-4 (hereinafter referred to as ChatGPT) within the context of our framework, providing a thorough analysis. Our results show that ChatGPT struggles in abductive reasoning, although this can be somewhat mitigated by using manual Chain of Thought (CoT) as opposed to Zero Shot (ZS) and ZS CoT approaches. Our study contributes to the growing body of research suggesting that ChatGPT's reasoning processes are unlikely to mirror human-like reasoning, and that LLMs need to be more rigorously evaluated in order to distinguish between hype and actual capabilities, especially in high stake real-world tasks such as claim verification.
摘要：LLM 的推理能力目前备受争议。我们从核实主张/谣言的角度来审视这个问题。我们提出了第一个逻辑推理框架，旨在将任何与证据配对的主张或谣言分解为验证所需的原子推理步骤。基于我们的框架，我们策划了此类声明/证据对的两个带注释的集合：来自维基百科的合成数据集和源自 Twitter 上流传的谣言的现实世界数据集。我们使用它们在我们的框架内评估 GPT-3.5-Turbo 和 GPT-4（以下简称 ChatGPT）的推理能力，提供全面的分析。我们的结果表明，ChatGPT 在溯因推理方面表现不佳，尽管通过使用手动思维链 (CoT)（而不是零射击 (ZS) 和 ZS CoT 方法）可以在一定程度上缓解这一问题。我们的研究为越来越多的研究做出了贡献，这些研究表明 ChatGPT 的推理过程不太可能反映类人推理，并且法学硕士需要进行更严格的评估，以区分炒作和实际能力，尤其是在高风险的现实世界任务中例如索赔验证。

Title: Let's Learn Step by Step: Enhancing In-Context Learning Ability with Curriculum Learning

Authors: Yinpeng Liu, Jiawei Liu, Xiang Shi, Qikai Cheng, Wei Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10738
Pdf URL: https://arxiv.org/pdf/2402.10738
Copy Paste: [[2402.10738]] Let's Learn Step by Step: Enhancing In-Context Learning Ability with Curriculum Learning(https://arxiv.org/abs/2402.10738)
Keywords: language model, llm, prompt
Abstract: Demonstration ordering, which is an important strategy for in-context learning (ICL), can significantly affects the performance of large language models (LLMs). However, most of the current approaches of ordering require additional knowledge and similarity calculation. We advocate the few-shot in-context curriculum learning (ICCL), a simple but effective demonstration ordering method for ICL, which implies gradually increasing the complexity of prompt demonstrations during the inference process. Then we design three experiments to discuss the effectiveness of ICCL, the formation mechanism of LLM's ICCL capability, and the impact of ordering subjects. Experimental results demonstrate that ICCL, developed during the instruction-tuning stage, is effective for open-source LLMs. Moreover, LLMs exhibit a weaker capacity compared to humans in discerning the difficulty levels of demonstrations. We release our code at https://github.com/61peng/curri_learning.
摘要：演示排序是上下文学习 (ICL) 的重要策略，可以显着影响大型语言模型 (LLM) 的性能。然而，当前大多数排序方法都需要额外的知识和相似性计算。我们提倡少镜头情境课程学习（ICCL），这是一种简单但有效的 ICL 演示排序方法，这意味着在推理过程中逐渐增加提示演示的复杂性。然后我们设计了三个实验来讨论ICCL的有效性、LLM ICCL能力的形成机制以及排序科目的影响。实验结果表明，在指令调优阶段开发的 ICCL 对于开源 LLM 是有效的。此外，与人类相比，法学硕士在辨别演示难度级别方面表现出较弱的能力。我们在 https://github.com/61peng/curri_learning 发布了我们的代码。

Title: GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models

Authors: Pengcheng Jiang, Jiacheng Lin, Zifeng Wang, Jimeng Sun, Jiawei Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10744
Pdf URL: https://arxiv.org/pdf/2402.10744
Copy Paste: [[2402.10744]] GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models(https://arxiv.org/abs/2402.10744)
Keywords: language model, llm, hallucination, prompt
Abstract: The field of relation extraction (RE) is experiencing a notable shift towards generative relation extraction (GRE), leveraging the capabilities of large language models (LLMs). However, we discovered that traditional relation extraction (RE) metrics like precision and recall fall short in evaluating GRE methods. This shortfall arises because these metrics rely on exact matching with human-annotated reference relations, while GRE methods often produce diverse and semantically accurate relations that differ from the references. To fill this gap, we introduce GenRES for a multi-dimensional assessment in terms of the topic similarity, uniqueness, granularity, factualness, and completeness of the GRE results. With GenRES, we empirically identified that (1) precision/recall fails to justify the performance of GRE methods; (2) human-annotated referential relations can be incomplete; (3) prompting LLMs with a fixed set of relations or entities can cause hallucinations. Next, we conducted a human evaluation of GRE methods that shows GenRES is consistent with human preferences for RE quality. Last, we made a comprehensive evaluation of fourteen leading LLMs using GenRES across document, bag, and sentence level RE datasets, respectively, to set the benchmark for future research in GRE
摘要：关系提取 (RE) 领域正在经历向生成关系提取 (GRE) 的显着转变，利用大型语言模型 (LLM) 的功能。然而，我们发现传统的关系提取 (RE) 指标（如精度和召回率）在评估 GRE 方法时存在不足。出现这种缺陷是因为这些指标依赖于与人类注释的参考关系的精确匹配，而 GRE 方法通常会产生与参考不同的多样化且语义准确的关系。为了填补这一空白，我们引入GenRES，对GRE成绩的主题相似性、独特性、粒度、真实性和完整性进行多维度评估。通过 GenRES，我们凭经验发现 (1) 精度/召回率无法证明 GRE 方法的性能合理； (2) 人工注释的引用关系可能不完整； (3)用一组固定的关系或实体来提示法学硕士可能会引起幻觉。接下来，我们对 GRE 方法进行了人类评估，结果表明 GenRES 与人类对 RE 质量的偏好一致。最后，我们分别使用 GenRES 跨文档、包和句子级别 RE 数据集对 14 个领先的法学硕士进行了综合评估，为未来的 GRE 研究设定基准

Title: ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

Authors: Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang, Songyang Gao, Yilong Wu, Qi Zhang, Tao Gui, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10753
Pdf URL: https://arxiv.org/pdf/2402.10753
Copy Paste: [[2402.10753]] ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages(https://arxiv.org/abs/2402.10753)
Keywords: language model, gpt, llm
Abstract: Tool learning is widely acknowledged as a foundational approach or deploying large language models (LLMs) in real-world scenarios. While current research primarily emphasizes leveraging tools to augment LLMs, it frequently neglects emerging safety considerations tied to their application. To fill this gap, we present $ToolSword$, a comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning. Specifically, ToolSword delineates six safety scenarios for LLMs in tool learning, encompassing $malicious$ $queries$ and $jailbreak$ $attacks$ in the input stage, $noisy$ $misdirection$ and $risky$ $cues$ in the execution stage, and $harmful$ $feedback$ and $error$ $conflicts$ in the output stage. Experiments conducted on 11 open-source and closed-source LLMs reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even GPT-4 is susceptible to. Moreover, we conduct further studies with the aim of fostering research on tool learning safety. The data is released in https://github.com/Junjie-Ye/ToolSword.
摘要：工具学习被广泛认为是在现实场景中部署大型语言模型 (LLM) 的基础方法。虽然当前的研究主要强调利用工具来增强法学硕士，但它经常忽视与其应用相关的新兴安全考虑因素。为了填补这一空白，我们提出了 $ToolSword$，这是一个综合框架，致力于仔细调查与工具学习中法学硕士相关的安全问题。具体来说，ToolSword为法学硕士在工具学习中描绘了六种安全场景，包括输入阶段的$malicious$ $queries$和$jailbreak$ $attacks$，执行阶段的$noisy$ $misdirection$和$risky$ $cues$，以及输出阶段的$harmful$ $feedback$ 和$error$ $conflicts$。在 11 个开源和闭源 LLM 上进行的实验揭示了工具学习中持久的安全挑战，例如处理有害查询、使用有风险的工具以及提供有害反馈，甚至 GPT-4 也容易受到这些挑战。此外，我们还开展进一步的研究，旨在促进工具学习安全性的研究。数据发布于https://github.com/Junjie-Ye/ToolSword。

Title: Inference to the Best Explanation in Large Language Models

Authors: Dhairya Dalal, Marco Valentino, André Freitas, Paul Buitelaar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10767
Pdf URL: https://arxiv.org/pdf/2402.10767
Copy Paste: [[2402.10767]] Inference to the Best Explanation in Large Language Models(https://arxiv.org/abs/2402.10767)
Keywords: language model, gpt, llm
Abstract: While Large Language Models (LLMs) have found success in real-world applications, their underlying explanatory process is still poorly understood. This paper proposes IBE-Eval, a framework inspired by philosophical accounts on Inference to the Best Explanation (IBE) to advance the interpretation and evaluation of LLMs' explanations. IBE-Eval estimates the plausibility of natural language explanations through a combination of explicit logical and linguistic features including: consistency, parsimony, coherence, and uncertainty. Extensive experiments are conducted on Causal Question Answering (CQA), where \textit{IBE-Eval} is tasked to select the most plausible causal explanation amongst competing ones generated by LLMs (i.e., GPT 3.5 and Llama 2). The experiments reveal that IBE-Eval can successfully identify the best explanation with up to 77\% accuracy ($\approx 27\%$ above random), improving upon a GPT 3.5-as-a-Judge baseline ($\approx+17\%$) while being intrinsically more efficient and interpretable. Additional analyses suggest that, despite model-specific variances, LLM-generated explanations tend to conform to IBE criteria and that IBE-Eval is significantly correlated with human judgment, opening up opportunities for future development of automated explanation verification tools.
摘要：虽然大型语言模型 (LLM) 在现实世界的应用中取得了成功，但其底层解释过程仍然知之甚少。本文提出了 IBE-Eval，这是一个受最佳解释推理 (IBE) 哲学解释启发的框架，旨在推进对法学硕士解释的解释和评估。 IBE-Eval 通过结合明确的逻辑和语言特征来估计自然语言解释的合理性，这些特征包括：一致性、简约性、连贯性和不确定性。在因果问答（CQA）上进行了广泛的实验，其中 \textit{IBE-Eval} 的任务是在 LLM（即 GPT 3.5 和 Llama 2）生成的竞争解释中选择最合理的因果解释。实验表明，IBE-Eval 可以成功识别最佳解释，准确率高达 77\%（比随机高出 $\约 27\%$），在 GPT 3.5-as-a-Judge 基线（$\approx+17 \%$)，同时本质上更高效且可解释。其他分析表明，尽管存在特定于模型的差异，但 LLM 生成的解释往往符合 IBE 标准，并且 IBE-Eval 与人类判断显着相关，这为未来开发自动解释验证工具开辟了机会。

Title: Distillation Enhanced Generative Retrieval

Authors: Yongqi Li, Zhen Zhang, Wenjie Wang, Liqiang Nie, Wenjie Li, Tat-Seng Chua
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2402.10769
Pdf URL: https://arxiv.org/pdf/2402.10769
Copy Paste: [[2402.10769]] Distillation Enhanced Generative Retrieval(https://arxiv.org/abs/2402.10769)
Keywords: language model
Abstract: Generative retrieval is a promising new paradigm in text retrieval that generates identifier strings of relevant passages as the retrieval target. This paradigm leverages powerful generative language models, distinct from traditional sparse or dense retrieval methods. In this work, we identify a viable direction to further enhance generative retrieval via distillation and propose a feasible framework, named DGR. DGR utilizes sophisticated ranking models, such as the cross-encoder, in a teacher role to supply a passage rank list, which captures the varying relevance degrees of passages instead of binary hard labels; subsequently, DGR employs a specially designed distilled RankNet loss to optimize the generative retrieval model, considering the passage rank order provided by the teacher model as labels. This framework only requires an additional distillation step to enhance current generative retrieval systems and does not add any burden to the inference stage. We conduct experiments on four public datasets, and the results indicate that DGR achieves state-of-the-art performance among the generative retrieval methods. Additionally, DGR demonstrates exceptional robustness and generalizability with various teacher models and distillation losses.
摘要：生成检索是文本检索中一种有前途的新范式，它生成相关段落的标识符字符串作为检索目标。该范例利用强大的生成语言模型，与传统的稀疏或密集检索方法不同。在这项工作中，我们确定了一个通过蒸馏进一步增强生成检索的可行方向，并提出了一个可行的框架，称为 DGR。 DGR 在教师角色中利用复杂的排名模型（例如交叉编码器）来提供段落排名列表，该列表捕获段落的不同相关程度而不是二进制硬标签；随后，DGR 采用专门设计的蒸馏 RankNet 损失来优化生成检索模型，将教师模型提供的段落排名顺序视为标签。该框架只需要额外的蒸馏步骤来增强当前的生成检索系统，并且不会给推理阶段增加任何负担。我们在四个公共数据集上进行了实验，结果表明 DGR 在生成检索方法中实现了最先进的性能。此外，DGR 在各种教师模型和蒸馏损失中表现出卓越的鲁棒性和普遍性。

Title: How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?

Authors: Ehsan Doostmohammadi, Oskar Holmström, Marco Kuhlmann
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10770
Pdf URL: https://arxiv.org/pdf/2402.10770
Copy Paste: [[2402.10770]] How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?(https://arxiv.org/abs/2402.10770)
Keywords: language model, gpt, llm, prompt
Abstract: Work on instruction-tuned Large Language Models (LLMs) has used automatic methods based on text overlap and LLM judgments as cost-effective alternatives to human evaluation. In this paper, we study the reliability of such methods across a broad range of tasks and in a cross-lingual setting. In contrast to previous findings, we observe considerable variability in correlations between automatic methods and human evaluators when scores are differentiated by task type. Specifically, the widely-used ROUGE-L metric strongly correlates with human judgments for short-answer English tasks but is unreliable in free-form generation tasks and cross-lingual transfer. The effectiveness of GPT-4 as an evaluator depends on including reference answers when prompting for assessments, which can lead to overly strict evaluations in free-form generation tasks. In summary, we find that, while automatic evaluation methods can approximate human judgements under specific conditions, their reliability is highly context-dependent. Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.
摘要：指令调整大型语言模型 (LLM) 的工作使用了基于文本重叠和 LLM 判断的自动方法，作为人类评估的经济高效替代方案。在本文中，我们研究了此类方法在广泛的任务和跨语言环境中的可靠性。与之前的发现相反，当分数按任务类型区分时，我们观察到自动方法和人类评估者之间的相关性存在相当大的变化。具体来说，广泛使用的 ROUGE-L 度量与人类对简答英语任务的判断密切相关，但在自由格式生成任务和跨语言迁移中并不可靠。 GPT-4 作为评估器的有效性取决于在提示评估时包含参考答案，这可能导致自由形式生成任务中的评估过于严格。总之，我们发现，虽然自动评估方法可以在特定条件下近似人类的判断，但其可靠性高度依赖于上下文。我们的研究结果增强了人们对在开发和评估指令调整的法学硕士时应如何应用和解释自动方法的理解。

Title: A Condensed Transition Graph Framework for Zero-shot Link Prediction with Large Language Models

Authors: Mingchen Li, Chen Ling, Rui Zhang, Liang Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10779
Pdf URL: https://arxiv.org/pdf/2402.10779
Copy Paste: [[2402.10779]] A Condensed Transition Graph Framework for Zero-shot Link Prediction with Large Language Models(https://arxiv.org/abs/2402.10779)
Keywords: language model, llm
Abstract: Zero-shot link prediction (ZSLP) on knowledge graphs aims at automatically identifying relations between given entities. Existing methods primarily employ auxiliary information to predict tail entity given head entity and its relation, yet face challenges due to the occasional unavailability of such detailed information and the inherent simplicity of predicting tail entities based on semantic similarities. Even though Large Language Models (LLMs) offer a promising solution to predict unobserved relations between the head and tail entity in a zero-shot manner, their performance is still restricted due to the inability to leverage all the (exponentially many) paths' information between two entities, which are critical in collectively indicating their relation types. To address this, in this work, we introduce a Condensed Transition Graph Framework for Zero-Shot Link Prediction (CTLP), which encodes all the paths' information in linear time complexity to predict unseen relations between entities, attaining both efficiency and information preservation. Specifically, we design a condensed transition graph encoder with theoretical guarantees on its coverage, expressiveness, and efficiency. It is learned by a transition graph contrastive learning strategy. Subsequently, we design a soft instruction tuning to learn and map the all-path embedding to the input of LLMs. Experimental results show that our proposed CTLP method achieves state-of-the-art performance on three standard ZSLP datasets
摘要：知识图上的零样本链接预测（ZSLP）旨在自动识别给定实体之间的关系。现有方法主要利用辅助信息来预测给定头部实体及其关系的尾部实体，但由于此类详细信息偶尔不可用以及基于语义相似性预测尾部实体的固有简单性而面临挑战。尽管大型语言模型（LLM）提供了一种有前途的解决方案，可以以零样本的方式预测头实体和尾实体之间未观察到的关系，但由于无法利用头实体和尾实体之间的所有（指数级多）路径信息，它们的性能仍然受到限制。两个实体，这对于共同指示它们的关系类型至关重要。为了解决这个问题，在这项工作中，我们引入了零样本链路预测（CTLP）的压缩转移图框架，它以线性时间复杂度对所有路径信息进行编码，以预测实体之间不可见的关系，从而实现效率和信息保存。具体来说，我们设计了一种压缩转换图编码器，并在理论上保证了其覆盖范围、表达能力和效率。它是通过转换图对比学习策略来学习的。随后，我们设计了一种软指令调整来学习全路径嵌入并将其映射到 LLM 的输入。实验结果表明，我们提出的 CTLP 方法在三个标准 ZSLP 数据集上实现了最先进的性能

Title: EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge

Authors: Xuan Shen, Zhenglun Kong, Changdi Yang, Zhaoyang Han, Lei Lu, Peiyan Dong, Cheng Lyu, Chih-hsiang Li, Xuehang Guo, Zhihao Shu, Wei Niu, Miriam Leeser, Pu Zhao, Yanzhi Wang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.10787
Pdf URL: https://arxiv.org/pdf/2402.10787
Copy Paste: [[2402.10787]] EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge(https://arxiv.org/abs/2402.10787)
Keywords: language model, llm
Abstract: Despite the remarkable strides of Large Language Models (LLMs) in various fields, the wide applications of LLMs on edge devices are limited due to their massive parameters and computations. To address this, quantization is commonly adopted to generate lightweight LLMs with efficient computations and fast inference. However, Post-Training Quantization (PTQ) methods dramatically degrade in quality when quantizing weights, activations, and KV cache together to below 8 bits. Besides, many Quantization-Aware Training (QAT) works quantize model weights, leaving the activations untouched, which do not fully exploit the potential of quantization for inference acceleration on the edge. In this paper, we propose EdgeQAT, the Entropy and Distribution Guided QAT for the optimization of lightweight LLMs to achieve inference acceleration on Edge devices. We first identify that the performance drop of quantization primarily stems from the information distortion in quantized attention maps, demonstrated by the different distributions in quantized query and key of the self-attention mechanism. Then, the entropy and distribution guided QAT is proposed to mitigate the information distortion. Moreover, we design a token importance-aware adaptive method to dynamically quantize the tokens with different bit widths for further optimization and acceleration. Our extensive experiments verify the substantial improvements with our framework across various datasets. Furthermore, we achieve an on-device speedup of up to 2.37x compared with its FP16 counterparts across multiple edge devices, signaling a groundbreaking advancement.
摘要：尽管大型语言模型（LLM）在各个领域取得了显着的进步，但由于其庞大的参数和计算量，LLM在边缘设备上的广泛应用受到限制。为了解决这个问题，通常采用量化来生成具有高效计算和快速推理的轻量级 LLM。然而，当将权重、激活和 KV 缓存一起量化到低于 8 位时，训练后量化 (PTQ) 方法的质量会急剧下降。此外，许多量化感知训练（QAT）工作量化模型权重，保持激活不变，这没有充分利用量化在边缘推理加速的潜力。在本文中，我们提出了 EdgeQAT，即熵和分布引导 QAT，用于优化轻量级 LLM，以实现边缘设备上的推理加速。我们首先确定量化的性能下降主要源于量化注意力图中的信息失真，这可以通过量化查询和自注意力机制的关键的不同分布来证明。然后，提出了熵和分布引导的QAT来减轻信息失真。此外，我们设计了一种令牌重要性感知自适应方法来动态量化具有不同位宽的令牌，以进一步优化和加速。我们广泛的实验验证了我们的框架在各种数据集上的实质性改进。此外，与跨多个边缘设备的 FP16 同行相比，我们实现了高达 2.37 倍的设备上加速，这标志着突破性的进步。

Title: In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss

Authors: Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10790
Pdf URL: https://arxiv.org/pdf/2402.10790
Copy Paste: [[2402.10790]] In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss(https://arxiv.org/abs/2402.10790)
Keywords: gpt, llm
Abstract: This paper addresses the challenge of processing long documents using generative transformer models. To evaluate different approaches, we introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. Our evaluation, which includes benchmarks for GPT-4 and RAG, reveals that common methods are effective only for sequences up to $10^4$ elements. In contrast, fine-tuning GPT-2 with recurrent memory augmentations enables it to handle tasks involving up to $10^7$ elements. This achievement marks a substantial leap, as it is by far the longest input processed by any open neural network model to date, demonstrating a significant improvement in the processing capabilities for long sequences.
摘要：本文解决了使用生成变压器模型处理长文档的挑战。为了评估不同的方法，我们引入了 BABILong，这是一个新的基准，旨在评估模型在大量文本中提取和处理分布式事实的能力。我们的评估（包括 GPT-4 和 RAG 的基准）表明，常用方法仅对最多 $10^4$ 元素的序列有效。相比之下，通过循环内存增强对 GPT-2 进行微调使其能够处理涉及最多 $10^7$ 元素的任务。这一成就标志着一个重大飞跃，因为它是迄今为止任何开放神经网络模型处理的最长输入，表明长序列处理能力的显着提高。

Title: Quantifying the Persona Effect in LLM Simulations

Authors: Tiancheng Hu, Nigel Collier
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2402.10811
Pdf URL: https://arxiv.org/pdf/2402.10811
Copy Paste: [[2402.10811]] Quantifying the Persona Effect in LLM Simulations(https://arxiv.org/abs/2402.10811)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have shown remarkable promise in simulating human language use and behavior. In this study, we delve into the intersection of persona variables and the capability of LLMs to simulate different perspectives. We find that persona variables can explain <10\% variance in annotations in existing subjective NLP datasets. Nonetheless, incorporating them via prompting in LLMs provides modest improvement. Persona prompting is most effective on data samples where disagreements among annotators are frequent yet confined to a limited range. A linear correlation exists: the more persona variables influence human annotations, the better LLMs predictions are using persona prompting. However, when the utility of persona variables is low (i.e., explaining <10\% of human annotations), persona prompting has little effect. Most subjective NLP datasets fall into this category, casting doubt on simulating diverse perspectives in the current NLP landscape.
摘要：大型语言模型（LLM）在模拟人类语言使用和行为方面表现出了非凡的前景。在这项研究中，我们深入研究了角色变量与法学硕士模拟不同观点的能力的交集。我们发现角色变量可以解释现有主观 NLP 数据集中注释的 <10% 方差。尽管如此，通过法学硕士的提示将它们纳入其中可以提供一定的改进。人物角色提示对于数据样本最为有效，因为注释者之间的分歧很频繁，但范围有限。存在线性相关性：角色变量影响人类注释的越多，法学硕士使用角色提示进行的预测就越好。然而，当角色变量的效用较低时（即解释<10\%的人类注释），角色提示效果甚微。大多数主观 NLP 数据集都属于这一类，这对模拟当前 NLP 领域的不同观点产生了怀疑。

Title: Exploring Hybrid Question Answering via Program-based Prompting

Authors: Qi Shi, Han Cui, Haofeng Wang, Qingfu Zhu, Wanxiang Che, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10812
Pdf URL: https://arxiv.org/pdf/2402.10812
Copy Paste: [[2402.10812]] Exploring Hybrid Question Answering via Program-based Prompting(https://arxiv.org/abs/2402.10812)
Keywords: prompt
Abstract: Question answering over heterogeneous data requires reasoning over diverse sources of data, which is challenging due to the large scale of information and organic coupling of heterogeneous data. Various approaches have been proposed to address these challenges. One approach involves training specialized retrievers to select relevant information, thereby reducing the input length. Another approach is to transform diverse modalities of data into a single modality, simplifying the task difficulty and enabling more straightforward processing. In this paper, we propose HProPro, a novel program-based prompting framework for the hybrid question answering task. HProPro follows the code generation and execution paradigm. In addition, HProPro integrates various functions to tackle the hybrid reasoning scenario. Specifically, HProPro contains function declaration and function implementation to perform hybrid information-seeking over data from various sources and modalities, which enables reasoning over such data without training specialized retrievers or performing modal transformations. Experimental results on two typical hybrid question answering benchmarks HybridQA and MultiModalQA demonstrate the effectiveness of HProPro: it surpasses all baseline systems and achieves the best performances in the few-shot settings on both datasets.
摘要：针对异构数据的问答需要对不同数据源进行推理，由于信息规模大且异构数据的有机耦合，这具有挑战性。已经提出了各种方法来应对这些挑战。一种方法是训练专门的检索器来选择相关信息，从而减少输入长度。另一种方法是将不同模式的数据转换为单一模式，从而简化任务难度并实现更直接的处理。在本文中，我们提出了 HProPro，这是一种用于混合问答任务的新颖的基于程序的提示框架。 HProPro 遵循代码生成和执行范例。此外，HProPro还集成了多种功能来应对混合推理场景。具体来说，HProPro 包含函数声明和函数实现，用于对来自各种来源和模式的数据执行混合信息搜索，从而无需训练专门的检索器或执行模式转换即可对此类数据进行推理。在两个典型的混合问答基准 HybridQA 和 MultiModalQA 上的实验结果证明了 HProPro 的有效性：它超越了所有基线系统，并在两个数据集上的少样本设置中实现了最佳性能。

Title: Trading off Consistency and Dimensionality of Convex Surrogates for the Mode

Authors: Enrique Nueve, Bo Waggoner, Dhamma Kimpara, Jessie Finocchiaro
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2402.10818
Pdf URL: https://arxiv.org/pdf/2402.10818
Copy Paste: [[2402.10818]] Trading off Consistency and Dimensionality of Convex Surrogates for the Mode(https://arxiv.org/abs/2402.10818)
Keywords: hallucination
Abstract: In multiclass classification over $n$ outcomes, the outcomes must be embedded into the reals with dimension at least $n-1$ in order to design a consistent surrogate loss that leads to the "correct" classification, regardless of the data distribution. For large $n$, such as in information retrieval and structured prediction tasks, optimizing a surrogate in $n-1$ dimensions is often intractable. We investigate ways to trade off surrogate loss dimension, the number of problem instances, and restricting the region of consistency in the simplex for multiclass classification. Following past work, we examine an intuitive embedding procedure that maps outcomes into the vertices of convex polytopes in a low-dimensional surrogate space. We show that full-dimensional subsets of the simplex exist around each point mass distribution for which consistency holds, but also, with less than $n-1$ dimensions, there exist distributions for which a phenomenon called hallucination occurs, which is when the optimal report under the surrogate loss is an outcome with zero probability. Looking towards application, we derive a result to check if consistency holds under a given polytope embedding and low-noise assumption, providing insight into when to use a particular embedding. We provide examples of embedding $n = 2^{d}$ outcomes into the $d$-dimensional unit cube and $n = d!$ outcomes into the $d$-dimensional permutahedron under low-noise assumptions. Finally, we demonstrate that with multiple problem instances, we can learn the mode with $\frac{n}{2}$ dimensions over the whole simplex.
摘要：在 $n$ 结果的多类分类中，结果必须嵌入到维度至少为 $n-1$ 的实数中，以便设计一致的替代损失，从而导致“正确”的分类，无论数据分布如何。对于较大的 $n$，例如在信息检索和结构化预测任务中，优化 $n-1$ 维度的代理通常很棘手。我们研究了权衡代理损失维度、问题实例数量以及限制多类分类单纯形的一致性区域的方法。在过去的工作之后，我们研究了一种直观的嵌入过程，该过程将结果映射到低维代理空间中凸多面体的顶点。我们证明，单纯形的全维子集存在于一致性成立的每个点质量分布周围，而且，在小于 $n-1$ 维度的情况下，存在会发生称为幻觉的现象的分布，即最佳时替代损失下的报告是零概率的结果。着眼于应用，我们得出一个结果来检查在给定的多面体嵌入和低噪声假设下是否保持一致性，从而深入了解何时使用特定的嵌入。我们提供了在低噪声假设下将 $n = 2^{d}$ 结果嵌入到 $d$ 维单位立方体中以及将 $n = d!$ 结果嵌入到 $d$ 维置换面体中的示例。最后，我们证明，通过多个问题实例，我们可以学习整个单纯形上具有 $\frac{n}{2}$ 维度的模式。

Title: Time Series Forecasting with LLMs: Understanding and Enhancing Model Capabilities

Authors: Mingyu Jin, Hua Tang, Chong Zhang, Qinkai Yu, Chengzhi Liu, Suiyuan Zhu, Yongfeng Zhang, Mengnan Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10835
Pdf URL: https://arxiv.org/pdf/2402.10835
Copy Paste: [[2402.10835]] Time Series Forecasting with LLMs: Understanding and Enhancing Model Capabilities(https://arxiv.org/abs/2402.10835)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have been applied in many fields with rapid development in recent years. As a classic machine learning task, time series forecasting has recently received a boost from LLMs. However, there is a research gap in the LLMs' preferences in this field. In this paper, by comparing LLMs with traditional models, many properties of LLMs in time series prediction are found. For example, our study shows that LLMs excel in predicting time series with clear patterns and trends but face challenges with datasets lacking periodicity. We explain our findings through designing prompts to require LLMs to tell the period of the datasets. In addition, the input strategy is investigated, and it is found that incorporating external knowledge and adopting natural language paraphrases positively affects the predictive performance of LLMs for time series. Overall, this study contributes to insight into the advantages and limitations of LLMs in time series forecasting under different conditions.
摘要：近年来，大型语言模型（LLM）在许多领域得到了快速发展。作为一项经典的机器学习任务，时间序列预测最近得到了法学硕士的大力推动。然而，法学硕士在该领域的偏好存在研究差距。本文通过将LLM与传统模型进行比较，发现LLM在时间序列预测方面的许多特性。例如，我们的研究表明，法学硕士擅长预测具有清晰模式和趋势的时间序列，但面临缺乏周期性的数据集的挑战。我们通过设计提示要求法学硕士说出数据集的周期来解释我们的发现。此外，对输入策略进行了研究，发现结合外部知识和采用自然语言释义对法学硕士对时间序列的预测性能有积极影响。总的来说，这项研究有助于深入了解法学硕士在不同条件下时间序列预测的优势和局限性。

Title: EcoRank: Budget-Constrained Text Re-ranking Using Large Language Models

Authors: Muhammad Shihab Rashid, Jannat Ara Meem, Yue Dong, Vagelis Hristidis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10866
Pdf URL: https://arxiv.org/pdf/2402.10866
Copy Paste: [[2402.10866]] EcoRank: Budget-Constrained Text Re-ranking Using Large Language Models(https://arxiv.org/abs/2402.10866)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have achieved state-of-the-art performance in text re-ranking. This process includes queries and candidate passages in the prompts, utilizing pointwise, listwise, and pairwise prompting strategies. A limitation of these ranking strategies with LLMs is their cost: the process can become expensive due to API charges, which are based on the number of input and output tokens. We study how to maximize the re-ranking performance given a budget, by navigating the vast search spaces of prompt choices, LLM APIs, and budget splits. We propose a suite of budget-constrained methods to perform text re-ranking using a set of LLM APIs. Our most efficient method, called EcoRank, is a two-layered pipeline that jointly optimizes decisions regarding budget allocation across prompt strategies and LLM APIs. Our experimental results on four popular QA and passage reranking datasets show that EcoRank outperforms other budget-aware supervised and unsupervised baselines.
摘要：大型语言模型 (LLM) 在文本重新排序方面取得了最先进的性能。该过程包括提示中的查询和候选段落，利用逐点、列表和成对提示策略。这些 LLM 排名策略的局限性在于其成本：由于 API 费用（基于输入和输出令牌的数量），该过程可能会变得昂贵。我们研究如何通过浏览提示选择、LLM API 和预算划分的巨大搜索空间，在给定预算的情况下最大化重新排名性能。我们提出了一套预算受限的方法来使用一组 LLM API 执行文本重新排名。我们最有效的方法称为 EcoRank，是一个两层管道，可联合优化有关跨提示策略和 LLM API 的预算分配的决策。我们在四个流行的 QA 和段落重排序数据集上的实验结果表明，EcoRank 优于其他预算感知的监督和无监督基线。

Title: Robust agents learn causal world models

Authors: Jonathan Richens, Tom Everitt
Subjects: cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10877
Pdf URL: https://arxiv.org/pdf/2402.10877
Copy Paste: [[2402.10877]] Robust agents learn causal world models(https://arxiv.org/abs/2402.10877)
Keywords: agent
Abstract: It has long been hypothesised that causal reasoning plays a fundamental role in robust and general intelligence. However, it is not known if agents must learn causal models in order to generalise to new domains, or if other inductive biases are sufficient. We answer this question, showing that any agent capable of satisfying a regret bound under a large set of distributional shifts must have learned an approximate causal model of the data generating process, which converges to the true causal model for optimal agents. We discuss the implications of this result for several research areas including transfer learning and causal inference.
摘要：长期以来，人们一直假设因果推理在稳健和通用智力中发挥着基础作用。然而，尚不清楚智能体是否必须学习因果模型才能推广到新领域，或者其他归纳偏差是否足够。我们回答了这个问题，表明任何能够在大量分布变化下满足后悔界限的智能体都必须学习数据生成过程的近似因果模型，该模型收敛到最佳智能体的真实因果模型。我们讨论了这一结果对包括迁移学习和因果推理在内的几个研究领域的影响。

Title: Multi-modal preference alignment remedies regression of visual instruction tuning on language model

Authors: Shengzhi Li, Rongyu Lin, Shichao Pei
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10884
Pdf URL: https://arxiv.org/pdf/2402.10884
Copy Paste: [[2402.10884]] Multi-modal preference alignment remedies regression of visual instruction tuning on language model(https://arxiv.org/abs/2402.10884)
Keywords: language model, llm
Abstract: In production, multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities. However, the current MLLMs trained with visual-question-answering (VQA) datasets could suffer from degradation, as VQA datasets lack the diversity and complexity of the original text instruction datasets which the underlying language model had been trained with. To address this challenging degradation, we first collect a lightweight (6k entries) VQA preference dataset where answers were annotated by Gemini for 5 quality metrics in a granular fashion, and investigate standard Supervised Fine-tuning, rejection sampling, Direct Preference Optimization (DPO), and SteerLM. Our findings indicate that the with DPO we are able to surpass instruction-following capabilities of the language model, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99 despite small data scale. This enhancement in textual instruction proficiency correlates with boosted visual instruction performance (+4.9\% on MM-Vet, +6\% on LLaVA-Bench), with minimal alignment tax on visual knowledge benchmarks compared to previous RLHF approach. In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that reconciles the textual and visual performance of MLLMs, restoring and boosting language capability after visual instruction tuning.
摘要：在生产中，多模态大语言模型（MLLM）有望支持互换图像和文本模态的多轮查询。然而，当前使用视觉问答（VQA）数据集训练的 MLLM 可能会出现退化，因为 VQA 数据集缺乏训练底层语言模型的原始文本指令数据集的多样性和复杂性。为了解决这种具有挑战性的退化问题，我们首先收集一个轻量级（6k 条目）VQA 偏好数据集，其中 Gemini 以细粒度方式对 5 个质量指标的答案进行注释，并研究标准监督微调、拒绝采样、直接偏好优化 (DPO)和 SteerLM。我们的研究结果表明，尽管数据规模较小，但借助 DPO，我们能够超越语言模型的指令跟踪能力，在 MT-Bench 上获得 6.73 分，而 Vicuna 的得分为 6.57，LLaVA 的得分为 5.99。文本指令熟练程度的提高与视觉指令性能的提高相关（MM-Vet 上 +4.9\%，LLaVA-Bench 上 +6\%），与之前的 RLHF 方法相比，视觉知识基准的对齐税最小。总之，我们提出了一种基于蒸馏的多模态对齐模型，在小数据集上具有细粒度注释，可以协调 MLLM 的文本和视觉性能，在视觉指令调整后恢复和增强语言能力。

Title: Reviewer2: Optimizing Review Generation Through Prompt Generation

Authors: Zhaolin Gao, Kianté Brantley, Thorsten Joachims
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10886
Pdf URL: https://arxiv.org/pdf/2402.10886
Copy Paste: [[2402.10886]] Reviewer2: Optimizing Review Generation Through Prompt Generation(https://arxiv.org/abs/2402.10886)
Keywords: llm, prompt
Abstract: Recent developments in LLMs offer new opportunities for assisting authors in improving their work. In this paper, we envision a use case where authors can receive LLM-generated reviews that uncover weak points in the current draft. While initial methods for automated review generation already exist, these methods tend to produce reviews that lack detail, and they do not cover the range of opinions that human reviewers produce. To address this shortcoming, we propose an efficient two-stage review generation framework called Reviewer2. Unlike prior work, this approach explicitly models the distribution of possible aspects that the review may address. We show that this leads to more detailed reviews that better cover the range of aspects that human reviewers identify in the draft. As part of the research, we generate a large-scale review dataset of 27k papers and 99k reviews that we annotate with aspect prompts, which we make available as a resource for future research.
摘要：法学硕士的最新发展为协助作者改进工作提供了新的机会。在本文中，我们设想了一个用例，作者可以接收法学硕士生成的评论，以发现当前草案中的弱点。虽然自动评论生成的初始方法已经存在，但这些方法往往会产生缺乏细节的评论，并且它们不涵盖人类评论者产生的意见范围。为了解决这个缺点，我们提出了一个高效的两阶段评审生成框架，称为 Reviewer2。与之前的工作不同，这种方法明确地模拟了审查可能涉及的可能方面的分布。我们表明，这会导致更详细的审查，更好地涵盖人工审查员在草案中确定的方面的范围。作为研究的一部分，我们生成了一个包含 27,000 篇论文和 99,000 篇评论的大规模评论数据集，我们用方面提示对其进行了注释，并将其作为未来研究的资源。

Title: When is Tree Search Useful for LLM Planning? It Depends on the Discriminator

Authors: Ziru Chen, Michael White, Raymond Mooney, Ali Payani, Yu Su, Huan Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10890
Pdf URL: https://arxiv.org/pdf/2402.10890
Copy Paste: [[2402.10890]] When is Tree Search Useful for LLM Planning? It Depends on the Discriminator(https://arxiv.org/abs/2402.10890)
Keywords: language model, llm, agent
Abstract: In this paper, we examine how large language models (LLMs) solve multi-step problems under a language agent framework with three components: a generator, a discriminator, and a planning method. We investigate the practical utility of two advanced planning methods, iterative correction and tree search. We present a comprehensive analysis of how discrimination accuracy affects the overall performance of agents when using these two methods or a simpler method, re-ranking. Experiments on two tasks, text-to-SQL parsing and mathematical reasoning, show that: (1) advanced planning methods demand discriminators with at least 90% accuracy to achieve significant improvements over re-ranking; (2) current LLMs' discrimination abilities have not met the needs of advanced planning methods to achieve such improvements; (3) with LLM-based discriminators, advanced planning methods may not adequately balance accuracy and efficiency. For example, compared to the other two methods, tree search is at least 10--20 times slower but leads to negligible performance gains, which hinders its real-world applications. Code and data will be released at https://github.com/OSU-NLP-Group/llm-planning-eval.
摘要：在本文中，我们研究了大型语言模型（LLM）如何在具有三个组件的语言代理框架下解决多步骤问题：生成器、判别器和规划方法。我们研究了两种高级规划方法（迭代校正和树搜索）的实际用途。我们对使用这两种方法或更简单的方法（重新排名）时区分准确性如何影响代理的整体表现进行了全面分析。对文本到 SQL 解析和数学推理两项任务的实验表明：（1）先进的规划方法要求判别器具有至少 90% 的准确率，以实现相对于重新排序的显着改进；（2）目前法学硕士的辨别能力尚未满足先进规划方法实现此类改进的需要； (3) 对于基于 LLM 的判别器，先进的规划方法可能无法充分平衡准确性和效率。例如，与其他两种方法相比，树搜索至少慢10--20倍，但性能提升可以忽略不计，这阻碍了其实际应用。代码和数据将在 https://github.com/OSU-NLP-Group/llm-planning-eval 发布。

Title: Instruction Diversity Drives Generalization To Unseen Tasks

Authors: Dylan Zhang, Justin Wang, Francois Charton
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10891
Pdf URL: https://arxiv.org/pdf/2402.10891
Copy Paste: [[2402.10891]] Instruction Diversity Drives Generalization To Unseen Tasks(https://arxiv.org/abs/2402.10891)
Keywords: language model, llm
Abstract: Instruction tuning -- fine-tuning a large language model (LLM) on pairs of instructions and desired outcomes -- is an approach that enables pre-trained language models to perform real-world tasks and follow human instructions. Its practical success depends on the model learning a broader set of instructions than those it was trained on. Yet the factors that determine model generalization to such \emph{unseen tasks} are not well understood. %To understand the driving factors of generalization, In this paper, we experiment with string rewrites, a symbolic task that serves as a building block for Turing complete Markov algorithms while allowing experimental control of "inputs" and "instructions". We investigate the trade-off between the number of instructions the model is trained on and the number of training samples provided for each instruction and observe that the diversity of the instruction set determines generalization. Generalization emerges once a diverse enough set of tasks is provided, even though very few examples are provided for each task. Instruction diversity also ensures robustness with respect to non-uniform distributions of instructions in the training set.
摘要：指令调优——根据指令对和期望的结果微调大型语言模型（LLM）——是一种使预先训练的语言模型能够执行现实世界任务并遵循人类指令的方法。它的实际成功取决于模型学习比训练时更广泛的指令集。然而，决定模型泛化到此类\emph{未见过的任务}的因素尚不清楚。为了理解泛化的驱动因素，在本文中，我们尝试了字符串重写，这是一种符号任务，可作为图灵完整马尔可夫算法的构建块，同时允许对“输入”和“指令”进行实验控制。我们研究了模型训练的指令数量和为每条指令提供的训练样本数量之间的权衡，并观察到指令集的多样性决定了泛化能力。一旦提供了足够多样化的任务集，泛化就会出现，即使为每个任务提供的示例很少。指令多样性还确保了训练集中指令的非均匀分布的鲁棒性。

Title: RLVF: Learning from Verbal Feedback without Overgeneralization

Authors: Moritz Stephan, Alexander Khazatsky, Eric Mitchell, Annie S Chen, Sheryl Hsu, Archit Sharma, Chelsea Finn
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10893
Pdf URL: https://arxiv.org/pdf/2402.10893
Copy Paste: [[2402.10893]] RLVF: Learning from Verbal Feedback without Overgeneralization(https://arxiv.org/abs/2402.10893)
Keywords: language model, gpt, llm, prompt
Abstract: The diversity of contexts in which large language models (LLMs) are deployed requires the ability to modify or customize default model behaviors to incorporate nuanced requirements and preferences. A convenient interface to specify such model adjustments is high-level verbal feedback, such as "Don't use emojis when drafting emails to my boss." However, while writing high-level feedback is far simpler than collecting annotations for reinforcement learning from human feedback (RLHF), we find that simply prompting a model with such feedback leads to overgeneralization of the feedback to contexts where it is not relevant. We study the problem of incorporating verbal feedback without such overgeneralization, inspiring a new method Contextualized Critiques with Constrained Preference Optimization (C3PO). C3PO uses a piece of high-level feedback to generate a small synthetic preference dataset specifying how the feedback should (and should not) be applied. It then fine-tunes the model in accordance with the synthetic preference data while minimizing the divergence from the original model for prompts where the feedback does not apply. Our experimental results indicate that our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts. For both human- and GPT-4-generated high-level feedback, C3PO effectively adheres to the given feedback comparably to in-context baselines while reducing overgeneralization by 30%.
摘要：部署大型语言模型 (LLM) 的环境的多样性要求能够修改或自定义默认模型行为，以纳入细微的要求和偏好。指定此类模型调整的一个方便界面是高级口头反馈，例如“在起草给我老板的电子邮件时不要使用表情符号”。然而，虽然编写高级反馈比从人类反馈 (RLHF) 中收集强化学习注释要简单得多，但我们发现，简单地用此类反馈提示模型会导致反馈过度泛化到不相关的上下文。我们研究了在不过度概括的情况下纳入口头反馈的问题，启发了一种新方法“情境化批评与约束偏好优化”（C3PO）。 C3PO 使用一段高级反馈来生成一个小型综合偏好数据集，指定应该（和不应该）如何应用反馈。然后，它根据综合偏好数据微调模型，同时最大限度地减少反馈不适用的提示与原始模型的差异。我们的实验结果表明，我们的方法有效地将口头反馈应用于相关场景，同时保留其他环境的现有行为。对于人类和 GPT-4 生成的高级反馈，C3PO 有效地遵循与上下文基线相当的给定反馈，同时将过度概括减少了 30%。