2024-05-08

Title: Evaluating Large Language Models for Material Selection

Authors: Daniele Grandi, Yash Patawari Jain, Allin Groom, Brandon Cramer, Christopher McComb
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03695
Pdf URL: https://arxiv.org/pdf/2405.03695
Copy Paste: [[2405.03695]] Evaluating Large Language Models for Material Selection(https://arxiv.org/abs/2405.03695)
Keywords: language model, llm, prompt
Abstract: Material selection is a crucial step in conceptual design due to its significant impact on the functionality, aesthetics, manufacturability, and sustainability impact of the final product. This study investigates the use of Large Language Models (LLMs) for material selection in the product design process and compares the performance of LLMs against expert choices for various design scenarios. By collecting a dataset of expert material preferences, the study provides a basis for evaluating how well LLMs can align with expert recommendations through prompt engineering and hyperparameter tuning. The divergence between LLM and expert recommendations is measured across different model configurations, prompt strategies, and temperature settings. This approach allows for a detailed analysis of factors influencing the LLMs' effectiveness in recommending materials. The results from this study highlight two failure modes, and identify parallel prompting as a useful prompt-engineering method when using LLMs for material selection. The findings further suggest that, while LLMs can provide valuable assistance, their recommendations often vary significantly from those of human experts. This discrepancy underscores the need for further research into how LLMs can be better tailored to replicate expert decision-making in material selection. This work contributes to the growing body of knowledge on how LLMs can be integrated into the design process, offering insights into their current limitations and potential for future improvements.
摘要：材料选择是概念设计中的关键步骤，因为它对最终产品的功能、美观、可制造性和可持续性影响重大。本研究调查了在产品设计过程中使用大型语言模型 (LLM) 进行材料选择，并将 LLM 的性能与专家针对各种设计场景的选择进行了比较。通过收集专家材料偏好的数据集，该研究为评估法学硕士通过即时工程和超参数调整与专家建议的一致性程度提供了基础。 LLM 和专家建议之间的差异是通过不同的模型配置、提示策略和温度设置来衡量的。这种方法可以详细分析影响法学硕士推荐材料有效性的因素。这项研究的结果突出了两种失效模式，并确定并行提示是使用法学硕士进行材料选择时有用的提示工程方法。研究结果进一步表明，虽然法学硕士可以提供宝贵的帮助，但他们的建议往往与人类专家的建议存在很大差异。这种差异凸显了需要进一步研究如何更好地定制法学硕士以复制材料选择方面的专家决策。这项工作有助于不断增长关于如何将法学硕士融入设计过程的知识体系，并深入了解其当前的局限性和未来改进的潜力。

Title: GOVERN: Gradient Orientation Vote Ensemble for Multi-Teacher Reinforced Distillation

Authors: Wenjie Zhou, Zhenxin Ding, Xiaodong Zhang, Haibo Shi, Junfeng Wang, Dawei Yin
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2405.03764
Pdf URL: https://arxiv.org/pdf/2405.03764
Copy Paste: [[2405.03764]] GOVERN: Gradient Orientation Vote Ensemble for Multi-Teacher Reinforced Distillation(https://arxiv.org/abs/2405.03764)
Keywords: language model
Abstract: Pre-trained language models have become an integral component of question-answering systems, achieving remarkable performance. For practical deployment, it is critical to carry out knowledge distillation to preserve high performance under computational constraints. In this paper, we address a key question: given the importance of unsupervised distillation for student performance, how does one effectively ensemble knowledge from multiple teachers at this stage without the guidance of ground-truth labels? We propose a novel algorithm, GOVERN, to tackle this issue. GOVERN has demonstrated significant improvements in both offline and online experiments. The proposed algorithm has been successfully deployed in a real-world commercial question-answering system.
摘要：预训练的语言模型已成为问答系统不可或缺的组成部分，并取得了显着的性能。对于实际部署，进行知识蒸馏以在计算限制下保持高性能至关重要。在本文中，我们解决了一个关键问题：鉴于无监督蒸馏对学生表现的重要性，在没有真实标签指导的情况下，如何在这一阶段有效地整合来自多个教师的知识？我们提出了一种新颖的算法 GOVERN 来解决这个问题。 GOVERN 在离线和在线实验中都展示了显着的改进。所提出的算法已成功部署在现实世界的商业问答系统中。

Title: Detecting Anti-Semitic Hate Speech using Transformer-based Large Language Models

Authors: Dengyi Liu, Minghao Wang, Andrew G. Catlin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03794
Pdf URL: https://arxiv.org/pdf/2405.03794
Copy Paste: [[2405.03794]] Detecting Anti-Semitic Hate Speech using Transformer-based Large Language Models(https://arxiv.org/abs/2405.03794)
Keywords: language model, gpt, chat
Abstract: Academic researchers and social media entities grappling with the identification of hate speech face significant challenges, primarily due to the vast scale of data and the dynamic nature of hate speech. Given the ethical and practical limitations of large predictive models like ChatGPT in directly addressing such sensitive issues, our research has explored alternative advanced transformer-based and generative AI technologies since 2019. Specifically, we developed a new data labeling technique and established a proof of concept targeting anti-Semitic hate speech, utilizing a variety of transformer models such as BERT (arXiv:1810.04805), DistillBERT (arXiv:1910.01108), RoBERTa (arXiv:1907.11692), and LLaMA-2 (arXiv:2307.09288), complemented by the LoRA fine-tuning approach (arXiv:2106.09685). This paper delineates and evaluates the comparative efficacy of these cutting-edge methods in tackling the intricacies of hate speech detection, highlighting the need for responsible and carefully managed AI applications within sensitive contexts.
摘要：努力识别仇恨言论的学术研究人员和社交媒体实体面临着重大挑战，这主要是由于数据规模庞大和仇恨言论的动态性质。鉴于像 ChatGPT 这样的大型预测模型在直接解决此类敏感问题方面存在伦理和实践局限性，我们的研究自 2019 年以来一直在探索替代的先进的基于变压器的生成人工智能技术。具体来说，我们开发了一种新的数据标记技术并建立了概念验证针对反犹太仇恨言论，利用各种 Transformer 模型，例如 BERT (arXiv:1810.04805)、DistillBERT (arXiv:1910.01108)、RoBERTa (arXiv:1907.11692) 和 LLaMA-2 (arXiv:2307.09288)，并辅以 LoRA微调方法（arXiv：2106.09685）。本文描述并评估了这些尖端方法在解决仇恨言论检测的复杂性方面的比较功效，强调了在敏感环境中负责任且精心管理的人工智能应用程序的必要性。

Title: Self-Improving Customer Review Response Generation Based on LLMs

Authors: Guy Azov, Tatiana Pelc, Adi Fledel Alon, Gila Kamhi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.03845
Pdf URL: https://arxiv.org/pdf/2405.03845
Copy Paste: [[2405.03845]] Self-Improving Customer Review Response Generation Based on LLMs(https://arxiv.org/abs/2405.03845)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Previous studies have demonstrated that proactive interaction with user reviews has a positive impact on the perception of app users and encourages them to submit revised ratings. Nevertheless, developers encounter challenges in managing a high volume of reviews, particularly in the case of popular apps with a substantial influx of daily reviews. Consequently, there is a demand for automated solutions aimed at streamlining the process of responding to user reviews. To address this, we have developed a new system for generating automatic responses by leveraging user-contributed documents with the help of retrieval-augmented generation (RAG) and advanced Large Language Models (LLMs). Our solution, named SCRABLE, represents an adaptive customer review response automation that enhances itself with self-optimizing prompts and a judging mechanism based on LLMs. Additionally, we introduce an automatic scoring mechanism that mimics the role of a human evaluator to assess the quality of responses generated in customer review domains. Extensive experiments and analyses conducted on real-world datasets reveal that our method is effective in producing high-quality responses, yielding improvement of more than 8.5% compared to the baseline. Further validation through manual examination of the generated responses underscores the efficacy our proposed system.
摘要：先前的研究表明，主动与用户评论互动会对应用程序用户的看法产生积极影响，并鼓励他们提交修改后的评分。然而，开发人员在管理大量评论时遇到了挑战，特别是对于每日评论大量涌入的流行应用程序。因此，需要旨在简化响应用户评论的过程的自动化解决方案。为了解决这个问题，我们开发了一种新系统，通过在检索增强生成 (RAG) 和高级大型语言模型 (LLM) 的帮助下利用用户贡献的文档来生成自动回复。我们的解决方案名为 SCRABLE，代表了一种自适应客户评论响应自动化，它通过自我优化提示和基于法学硕士的判断机制来增强自身。此外，我们引入了一种自动评分机制，模仿人类评估员的角色来评估客户评论领域中生成的响应的质量。对现实世界数据集进行的大量实验和分析表明，我们的方法可以有效地产生高质量的响应，与基线相比，性能提高了 8.5% 以上。通过手动检查生成的响应进行进一步验证，强调了我们提出的系统的有效性。

Title: Long Context Alignment with Short Instructions and Synthesized Positions

Authors: Wenhao Wu, Yizhong Wang, Yao Fu, Xiang Yue, Dawei Zhu, Sujian Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.03939
Pdf URL: https://arxiv.org/pdf/2405.03939
Copy Paste: [[2405.03939]] Long Context Alignment with Short Instructions and Synthesized Positions(https://arxiv.org/abs/2405.03939)
Keywords: language model, gpt, llm, long context
Abstract: Effectively handling instructions with extremely long context remains a challenge for Large Language Models (LLMs), typically necessitating high-quality long data and substantial computational resources. This paper introduces Step-Skipping Alignment (SkipAlign), a new technique designed to enhance the long-context capabilities of LLMs in the phase of alignment without the need for additional efforts beyond training with original data length. SkipAlign is developed on the premise that long-range dependencies are fundamental to enhancing an LLM's capacity of long context. Departing from merely expanding the length of input samples, SkipAlign synthesizes long-range dependencies from the aspect of positions indices. This is achieved by the strategic insertion of skipped positions within instruction-following samples, which utilizes the semantic structure of the data to effectively expand the context. Through extensive experiments on base models with a variety of context window sizes, SkipAlign demonstrates its effectiveness across a spectrum of long-context tasks. Particularly noteworthy is that with a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
摘要：有效处理具有极长上下文的指令仍然是大型语言模型 (LLM) 面临的挑战，这通常需要高质量的长数据和大量的计算资源。本文介绍了一种新技术，即步进跳过对齐 (SkipAlign)，旨在增强 LLM 在对齐阶段的长上下文能力，而无需在使用原始数据长度进行训练之外进行额外努力。SkipAlign 是基于长距离依赖关系对于增强 LLM 的长上下文能力至关重要这一前提而开发的。SkipAlign 除了扩大输入样本的长度外，还从位置索引的角度综合了长距离依赖关系。这是通过在指令后样本中策略性地插入跳过的位置来实现的，它利用数据的语义结构来有效地扩展上下文。通过对具有各种上下文窗口大小的基础模型进行大量实验，SkipAlign 证明了其在一系列长上下文任务中的有效性。特别值得注意的是，通过精心选择基础模型和对齐数据集，仅使用 6B 参数的 SkipAlign 就实现了最佳性能，并且可以与 LongBench 上的 GPT-3.5-Turbo-16K 等强基线相媲美。

Title: Utilizing GPT to Enhance Text Summarization: A Strategy to Minimize Hallucinations

Authors: Hassan Shakil, Zeydy Ortiz, Grant C. Forbes
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.04039
Pdf URL: https://arxiv.org/pdf/2405.04039
Copy Paste: [[2405.04039]] Utilizing GPT to Enhance Text Summarization: A Strategy to Minimize Hallucinations(https://arxiv.org/abs/2405.04039)
Keywords: gpt, hallucination
Abstract: In this research, we uses the DistilBERT model to generate extractive summary and the T5 model to generate abstractive summaries. Also, we generate hybrid summaries by combining both DistilBERT and T5 models. Central to our research is the implementation of GPT-based refining process to minimize the common problem of hallucinations that happens in AI-generated summaries. We evaluate unrefined summaries and, after refining, we also assess refined summaries using a range of traditional and novel metrics, demonstrating marked improvements in the accuracy and reliability of the summaries. Results highlight significant improvements in reducing hallucinatory content, thereby increasing the factual integrity of the summaries.
摘要：在本研究中，我们使用 DistilBERT 模型来生成提取摘要，并使用 T5 模型来生成抽象摘要。此外，我们还通过结合 DistilBERT 和 T5 模型来生成混合摘要。我们研究的核心是实施基于 GPT 的精炼流程，以最大限度地减少人工智能生成的摘要中出现的幻觉这一常见问题。我们评估未精炼的摘要，精炼后，我们还使用一系列传统和新颖的指标评估精炼的摘要，表明摘要的准确性和可靠性显着提高。结果强调了在减少幻觉内容方面的显着改进，从而提高了摘要的事实完整性。

Title: Evaluating Text Summaries Generated by Large Language Models Using OpenAI's GPT

Authors: Hassan Shakil, Atqiya Munawara Mahi, Phuoc Nguyen, Zeydy Ortiz, Mamoun T. Mardini
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.04053
Pdf URL: https://arxiv.org/pdf/2405.04053
Copy Paste: [[2405.04053]] Evaluating Text Summaries Generated by Large Language Models Using OpenAI's GPT(https://arxiv.org/abs/2405.04053)
Keywords: language model, gpt
Abstract: This research examines the effectiveness of OpenAI's GPT models as independent evaluators of text summaries generated by six transformer-based models from Hugging Face: DistilBART, BERT, ProphetNet, T5, BART, and PEGASUS. We evaluated these summaries based on essential properties of high-quality summary - conciseness, relevance, coherence, and readability - using traditional metrics such as ROUGE and Latent Semantic Analysis (LSA). Uniquely, we also employed GPT not as a summarizer but as an evaluator, allowing it to independently assess summary quality without predefined metrics. Our analysis revealed significant correlations between GPT evaluations and traditional metrics, particularly in assessing relevance and coherence. The results demonstrate GPT's potential as a robust tool for evaluating text summaries, offering insights that complement established metrics and providing a basis for comparative analysis of transformer-based models in natural language processing tasks.
摘要：这项研究检验了 OpenAI 的 GPT 模型作为 Hugging Face 的六个基于 Transformer 的模型生成的文本摘要的独立评估器的有效性：DistilBART、BERT、ProphetNet、T5、BART 和 PEGASUS。我们根据高质量摘要的基本属性（简洁性、相关性、连贯性和可读性）使用 ROUGE 和潜在语义分析 (LSA) 等传统指标来评估这些摘要。独特的是，我们还使用 GPT 不是作为摘要器，而是作为评估器，使其能够在没有预定义指标的情况下独立评估摘要质量。我们的分析揭示了 GPT 评估与传统指标之间的显着相关性，特别是在评估相关性和一致性方面。结果证明了 GPT 作为评估文本摘要的强大工具的潜力，提供了补充既定指标的见解，并为自然语言处理任务中基于 Transformer 的模型的比较分析提供了基础。

Title: FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference

Authors: Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.04065
Pdf URL: https://arxiv.org/pdf/2405.04065
Copy Paste: [[2405.04065]] FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference(https://arxiv.org/abs/2405.04065)
Keywords: language model, llm, long context
Abstract: Retrieval-Augmented Language Modeling (RALM) by integrating large language models (LLM) with relevant documents from an external corpus is a proven method for enabling the LLM to generate information beyond the scope of its pre-training corpus. Previous work using utilizing retrieved content by simply prepending retrieved contents to the input poses a high runtime issue, which degrades the inference efficiency of the LLMs because they fail to use the Key-Value (KV) cache efficiently. In this paper, we propose \textsc{FlashBack}, a modular RALM designed to improve the inference efficiency of RALM with appending context pattern while maintaining decent performance after specific fine-tuning without heavily destruct the knowledge integrity of the LLM. \textsc{FlashBack} appends retrieved documents at the end of the context for efficiently utilizing the KV cache instead of prepending them. Our experiment shows that the inference speed of \textsc{FlashBack} is up to $4\times$ faster than the prepending method on a 7B LLM (Llama 2). Via bypassing unnecessary re-computation, it demonstrates an advancement by achieving significantly faster inference speed, and this heightened efficiency will substantially reduce inferential cost. Our code will be publicly available.
摘要：通过将大型语言模型 (LLM) 与外部语料库中的相关文档相集成，检索增强语言建模 (RALM) 是一种经过验证的方法，使 LLM 能够生成超出其预训练语料库范围的信息。以前的工作通过简单地将检索到的内容添加到输入中来利用检索到的内容会带来很高的运行时问题，这会降低 LLM 的推理效率，因为它们无法有效地使用键值 (KV) 缓存。在本文中，我们提出了 \textsc{FlashBack}，这是一种模块化 RALM，旨在通过附加上下文模式来提高 RALM 的推理效率，同时在特定微调后保持良好的性能，而不会严重破坏 LLM 的知识完整性。 \textsc{FlashBack} 将检索到的文档附加在上下文的末尾，以有效地利用 KV 缓存，而不是将它们放在前面。我们的实验表明，\textsc{FlashBack} 的推理速度比 7B LLM (Llama 2) 上的前置方法快高达 $4\times$。通过绕过不必要的重新计算，它通过实现显着更快的推理速度来展示进步，并且这种提高的效率将大大降低推理成本。我们的代码将公开。

Title: Optimizing Language Model's Reasoning Abilities with Weak Supervision

Authors: Yongqi Tong, Sizhe Wang, Dawei Li, Yifan Wang, Simeng Han, Zi Lin, Chengsong Huang, Jiaxin Huang, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.04086
Pdf URL: https://arxiv.org/pdf/2405.04086
Copy Paste: [[2405.04086]] Optimizing Language Model's Reasoning Abilities with Weak Supervision(https://arxiv.org/abs/2405.04086)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) have demonstrated proficiency in handling complex queries, much of the past work has depended on extensively annotated datasets by human experts. However, this reliance on fully-supervised annotations poses scalability challenges, particularly as models and data requirements grow. To mitigate this, we explore the potential of enhancing LLMs' reasoning abilities with minimal human supervision. In this work, we introduce self-reinforcement, which begins with Supervised Fine-Tuning (SFT) of the model using a small collection of annotated questions. Then it iteratively improves LLMs by learning from the differences in responses from the SFT and unfinetuned models on unlabeled questions. Our approach provides an efficient approach without relying heavily on extensive human-annotated explanations. However, current reasoning benchmarks typically only include golden-reference answers or rationales. Therefore, we present \textsc{PuzzleBen}, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales across various domains, such as brainteasers, puzzles, riddles, parajumbles, and critical reasoning tasks. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities. Our experiments underscore the significance of \textsc{PuzzleBen}, as well as the effectiveness of our methodology as a promising direction in future endeavors. Our dataset and code will be published soon on \texttt{Anonymity Link}.
摘要：虽然大型语言模型 (LLM) 已表现出处理复杂查询的能力，但过去的大部分工作都依赖于人类专家广泛注释的数据集。然而，这种对完全监督注释的依赖带来了可扩展性挑战，特别是随着模型和数据需求的增长。为了缓解这一问题，我们探索了在最少的人工监督下增强法学硕士推理能力的潜力。在这项工作中，我们引入了自我强化，首先使用一小部分带注释的问题对模型进行监督微调（SFT）。然后，它通过学习 SFT 和未微调模型对未标记问题的响应差异，迭代地改进 LLM。我们的方法提供了一种有效的方法，而无需严重依赖广泛的人工注释解释。然而，当前的推理基准通常只包括黄金参考答案或基本原理。因此，我们提出了 \textsc{PuzzleBen}，这是一个弱监督基准，包含 25,147 个复杂的问题、答案和人类生成的跨不同领域的基本原理，例如脑筋急转弯、谜题、谜语、副谜语和批判性推理任务。我们数据集的一个独特之处是包含 10,000 个未注释的问题，使我们能够探索利用更少的超大数据来提高法学硕士的推理能力。我们的实验强调了 \textsc{PuzzleBen} 的重要性，以及我们的方法作为未来努力的一个有希望的方向的有效性。我们的数据集和代码将很快发布在 \texttt{Anonymity Link} 上。

Title: A Causal Explainable Guardrails for Large Language Models

Authors: Zhixuan Chu, Yan Wang, Longfei Li, Zhibo Wang, Zhan Qin, Kui Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.04160
Pdf URL: https://arxiv.org/pdf/2405.04160
Copy Paste: [[2405.04160]] A Causal Explainable Guardrails for Large Language Models(https://arxiv.org/abs/2405.04160)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have shown impressive performance in natural language tasks, but their outputs can exhibit undesirable attributes or biases. Existing methods for steering LLMs towards desired attributes often assume unbiased representations and rely solely on steering prompts. However, the representations learned from pre-training can introduce semantic biases that influence the steering process, leading to suboptimal results. We propose LLMGuardaril, a novel framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. LLMGuardaril systematically identifies and blocks the confounding effects of biases, enabling the extraction of unbiased steering representations. Additionally, it includes an explainable component that provides insights into the alignment between the generated output and the desired direction. Experiments demonstrate LLMGuardaril's effectiveness in steering LLMs towards desired attributes while mitigating biases. Our work contributes to the development of safe and reliable LLMs that align with desired attributes. We discuss the limitations and future research directions, highlighting the need for ongoing research to address the ethical implications of large language models.
摘要：大型语言模型 (LLM) 在自然语言任务中表现出了令人印象深刻的性能，但它们的输出可能会表现出不良的属性或偏差。引导法学硕士走向所需属性的现有方法通常假设无偏见的表示，并且仅依赖于引导提示。然而，从预训练中学到的表示可能会引入影响转向过程的语义偏差，从而导致次优结果。我们提出了 LLMGuardaril，这是一种新颖的框架，它结合了因果分析和对抗性学习，以在法学硕士中获得无偏见的指导表示。 LLMGuardaril 系统地识别并阻止偏差的混杂影响，从而能够提取无偏差的转向表示。此外，它还包括一个可解释的组件，可以深入了解生成的输出和所需方向之间的一致性。实验证明了 LLMGuardaril 能够有效引导法学硕士获得所需的属性，同时减少偏见。我们的工作有助于开发符合所需属性的安全可靠的法学硕士。我们讨论了局限性和未来的研究方向，强调需要进行持续研究来解决大型语言模型的伦理影响。

Title: MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization

Authors: Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.04163
Pdf URL: https://arxiv.org/pdf/2405.04163
Copy Paste: [[2405.04163]] MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization(https://arxiv.org/abs/2405.04163)
Keywords: language model
Abstract: This work presents a dynamic vocabulary adaptation strategy, MEDVOC, for fine-tuning pre-trained language models (PLMs) like BertSumAbs, BART, and PEGASUS for improved medical text summarization. In contrast to existing domain adaptation approaches in summarization, MEDVOC treats vocabulary as an optimizable parameter and optimizes the PLM vocabulary based on fragment score conditioned only on the downstream task's reference summaries. Unlike previous works on vocabulary adaptation (limited only to classification tasks), optimizing vocabulary based on summarization tasks requires an extremely costly intermediate fine-tuning step on large summarization datasets. To that end, our novel fragment score-based hyperparameter search very significantly reduces this fine-tuning time -- from 450 days to less than 2 days on average. Furthermore, while previous works on vocabulary adaptation are often primarily tied to single PLMs, MEDVOC is designed to be deployable across multiple PLMs (with varying model vocabulary sizes, pre-training objectives, and model sizes) -- bridging the limited vocabulary overlap between the biomedical literature domain and PLMs. MEDVOC outperforms baselines by 15.74% in terms of Rouge-L in zero-shot setting and shows gains of 17.29% in high Out-Of-Vocabulary (OOV) concentrations. Our human evaluation shows MEDVOC generates more faithful medical summaries (88% compared to 59% in baselines). We make the codebase publicly available at this https URL.
摘要：这项工作提出了一种动态词汇适应策略 MEDVOC，用于微调 BertSumAbs、BART 和 PEGASUS 等预训练语言模型 (PLM)，以改进医学文本摘要。与摘要中现有的领域适应方法相比，MEDVOC 将词汇视为可优化参数，并根据仅以下游任务的参考摘要为条件的片段分数来优化 PLM 词汇。与之前的词汇适应工作（仅限于分类任务）不同，基于摘要任务优化词汇需要在大型摘要数据集上进行极其昂贵的中间微调步骤。为此，我们新颖的基于片段分数的超参数搜索极大地减少了微调时间——从平均 450 天减少到不到 2 天。此外，虽然之前关于词汇适应的工作通常主要与单个 PLM 相关，但 MEDVOC 被设计为可跨多个 PLM 部署（具有不同的模型词汇量、预训练目标和模型大小）——弥合不同 PLM 之间有限的词汇重叠。生物医学文献领域和 PLM。在零次设置中，MEDVOC 的 Rouge-L 表现优于基线 15.74%，并且在高词汇外 (OOV) 浓度下表现出 17.29% 的增益。我们的人类评估显示 MEDVOC 生成更忠实的医疗摘要（88%，而基线为 59%）。我们通过此 https URL 公开提供代码库。

Title: D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models

Authors: Duygu Altinok
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.04170
Pdf URL: https://arxiv.org/pdf/2405.04170
Copy Paste: [[2405.04170]] D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models(https://arxiv.org/abs/2405.04170)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have garnered significant attention and widespread usage due to their impressive performance in various tasks. However, they are not without their own set of challenges, including issues such as hallucinations, factual inconsistencies, and limitations in numerical-quantitative reasoning. Evaluating LLMs in miscellaneous reasoning tasks remains an active area of research. Prior to the breakthrough of LLMs, Transformers had already proven successful in the medical domain, effectively employed for various natural language understanding (NLU) tasks. Following this trend, LLMs have also been trained and utilized in the medical domain, raising concerns regarding factual accuracy, adherence to safety protocols, and inherent limitations. In this paper, we focus on evaluating the natural language inference capabilities of popular open-source and closed-source LLMs using clinical trial reports as the dataset. We present the performance results of each LLM and further analyze their performance on a development set, particularly focusing on challenging instances that involve medical abbreviations and require numerical-quantitative reasoning. Gemini, our leading LLM, achieved a test set F1-score of 0.748, securing the ninth position on the task scoreboard. Our work is the first of its kind, offering a thorough examination of the inference capabilities of LLMs within the medical domain.
摘要：大型语言模型（LLM）因其在各种任务中令人印象深刻的表现而受到了广泛关注和广泛使用。然而，它们也面临着一系列挑战，包括幻觉、事实不一致以及数字定量推理的局限性等问题。在各种推理任务中评估法学硕士仍然是一个活跃的研究领域。在法学硕士取得突破之前，Transformers 已经在医学领域取得了成功，有效地应用于各种自然语言理解 (NLU) 任务。遵循这一趋势，法学硕士也在医学领域接受培训和使用，引起了人们对事实准确性、遵守安全协议和固有局限性的担忧。在本文中，我们重点使用临床试验报告作为数据集来评估流行的开源和闭源法学硕士的自然语言推理能力。我们展示了每个法学硕士的表现结果，并进一步分析他们在开发集上的表现，特别关注涉及医学缩写和需要数字定量推理的具有挑战性的实例。我们领先的法学硕士 Gemini 的测试集 F1 得分为 0.748，在任务记分牌上排名第九。我们的工作是此类工作中的首例，对医学领域法学硕士的推理能力进行了彻底的检查。

Title: Iterative Experience Refinement of Software-Developing Agents

Authors: Chen Qian, Jiahao Li, Yufan Dang, Wei Liu, YiFei Wang, Zihao Xie, Weize Chen, Cheng Yang, Yingli Zhang, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI, cs.MA, cs.SE
Abstract URL: https://arxiv.org/abs/2405.04219
Pdf URL: https://arxiv.org/pdf/2405.04219
Copy Paste: [[2405.04219]] Iterative Experience Refinement of Software-Developing Agents(https://arxiv.org/abs/2405.04219)
Keywords: language model, llm, agent
Abstract: Autonomous agents powered by large language models (LLMs) show significant potential for achieving high autonomy in various scenarios such as software development. Recent research has shown that LLM agents can leverage past experiences to reduce errors and enhance efficiency. However, the static experience paradigm, reliant on a fixed collection of past experiences acquired heuristically, lacks iterative refinement and thus hampers agents' adaptability. In this paper, we introduce the Iterative Experience Refinement framework, enabling LLM agents to refine experiences iteratively during task execution. We propose two fundamental patterns: the successive pattern, refining based on nearest experiences within a task batch, and the cumulative pattern, acquiring experiences across all previous task batches. Augmented with our heuristic experience elimination, the method prioritizes high-quality and frequently-used experiences, effectively managing the experience space and enhancing efficiency. Extensive experiments show that while the successive pattern may yield superior results, the cumulative pattern provides more stable performance. Moreover, experience elimination facilitates achieving better performance using just 11.54% of a high-quality subset.
摘要：由大型语言模型（LLM）支持的自治代理在软件开发等各种场景中显示出实现高度自治的巨大潜力。最近的研究表明，法学硕士代理人可以利用过去的经验来减少错误并提高效率。然而，静态经验范式依赖于启发式获得的过去经验的固定集合，缺乏迭代细化，从而阻碍了智能体的适应性。在本文中，我们介绍了迭代体验细化框架，使 LLM 代理能够在任务执行过程中迭代地细化体验。我们提出了两种基本模式：连续模式（根据任务批次中最近的经验进行精炼）和累积模式（获取所有先前任务批次中的经验）。该方法通过启发式经验消除的增强，优先考虑高质量和频繁使用的体验，有效管理体验空间并提高效率。大量实验表明，虽然连续模式可能会产生优异的结果，但累积模式可提供更稳定的性能。此外，经验消除有助于仅使用高质量子集的 11.54% 来实现更好的性能。

Title: Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore

Authors: Junchao Wu, Runzhe Zhan, Derek F. Wong, Shu Yang, Xuebo Liu, Lidia S. Chao, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.04286
Pdf URL: https://arxiv.org/pdf/2405.04286
Copy Paste: [[2405.04286]] Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore(https://arxiv.org/abs/2405.04286)
Keywords: language model, llm
Abstract: The efficacy of an large language model (LLM) generated text detector depends substantially on the availability of sizable training data. White-box zero-shot detectors, which require no such data, are nonetheless limited by the accessibility of the source model of the LLM-generated text. In this paper, we propose an simple but effective black-box zero-shot detection approach, predicated on the observation that human-written texts typically contain more grammatical errors than LLM-generated texts. This approach entails computing the Grammar Error Correction Score (GECScore) for the given text to distinguish between human-written and LLM-generated text. Extensive experimental results show that our method outperforms current state-of-the-art (SOTA) zero-shot and supervised methods, achieving an average AUROC of 98.7% and showing strong robustness against paraphrase and adversarial perturbation attacks.
摘要：大型语言模型 (LLM) 生成的文本检测器的功效在很大程度上取决于大量训练数据的可用性。白盒零样本检测器不需要此类数据，但仍受到法学硕士生成文本的源模型的可访问性的限制。在本文中，我们提出了一种简单但有效的黑盒零样本检测方法，该方法基于人类编写的文本通常比法学硕士生成的文本包含更多语法错误的观察结果。这种方法需要计算给定文本的语法错误纠正分数 (GECScore)，以区分人类编写的文本和法学硕士生成的文本。大量的实验结果表明，我们的方法优于当前最先进（SOTA）的零样本和监督方法，平均 AUROC 达到 98.7%，并且对释义和对抗性扰动攻击表现出强大的鲁棒性。

Title: Accelerating Speculative Decoding using Dynamic Speculation Length

Authors: Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy Schwartz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Accelerating Speculative Decoding using Dynamic Speculation Length(https://arxiv.org/abs/)
Keywords: language model
Abstract: Speculative decoding is a promising method for reducing the inference latency of large language models. The effectiveness of the method depends on the speculation length (SL) - the number of tokens generated by the draft model at each iteration. The vast majority of speculative decoding approaches use the same SL for all iterations. In this work, we show that this practice is suboptimal. We introduce DISCO, a DynamIc SpeCulation length Optimization method that uses a classifier to dynamically adjust the SL at each iteration, while provably preserving the decoding quality. Experiments with four benchmarks demonstrate average speedup gains of 10.3% relative to our best baselines.
摘要：推测解码是减少大型语言模型推理延迟的一种有前途的方法。该方法的有效性取决于推测长度（SL）——草稿模型在每次迭代中生成的令牌数量。绝大多数推测性解码方法对所有迭代都使用相同的 SL。在这项工作中，我们表明这种做法不是最理想的。我们引入了 DISCO，这是一种动态规范长度优化方法，它使用分类器在每次迭代时动态调整 SL，同时可证明保持解码质量。四个基准测试的实验表明，相对于我们的最佳基准，平均加速增益为 10.3%。

Title: Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation

Authors: Atharvan Dogra, Ameet Deshpande, John Nay, Tanmay Rajpurohit, Ashwin Kalyan, Balaraman Ravindran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.04325
Pdf URL: https://arxiv.org/pdf/2405.04325
Copy Paste: [[2405.04325]] Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation(https://arxiv.org/abs/2405.04325)
Keywords: language model, llm, agent
Abstract: Recent developments in large language models (LLMs), while offering a powerful foundation for developing natural language agents, raise safety concerns about them and the autonomous agents built upon them. Deception is one potential capability of AI agents of particular concern, which we refer to as an act or statement that misleads, hides the truth, or promotes a belief that is not true in its entirety or in part. We move away from the conventional understanding of deception through straight-out lying, making objective selfish decisions, or giving false information, as seen in previous AI safety research. We target a specific category of deception achieved through obfuscation and equivocation. We broadly explain the two types of deception by analogizing them with the rabbit-out-of-hat magic trick, where (i) the rabbit either comes out of a hidden trap door or (ii) (our focus) the audience is completely distracted to see the magician bring out the rabbit right in front of them using sleight of hand or misdirection. Our novel testbed framework displays intrinsic deception capabilities of LLM agents in a goal-driven environment when directed to be deceptive in their natural language generations in a two-agent adversarial dialogue system built upon the legislative task of "lobbying" for a bill. Along the lines of a goal-driven environment, we show developing deceptive capacity through a reinforcement learning setup, building it around the theories of language philosophy and cognitive psychology. We find that the lobbyist agent increases its deceptive capabilities by ~ 40% (relative) through subsequent reinforcement trials of adversarial interactions, and our deception detection mechanism shows a detection capability of up to 92%. Our results highlight potential issues in agent-human interaction, with agents potentially manipulating humans towards its programmed end-goal.
摘要：大型语言模型（LLM）的最新发展虽然为开发自然语言代理提供了强大的基础，但也引起了人们对它们以及基于它们的自主代理的安全担忧。欺骗是特别令人关注的人工智能代理的一种潜在能力，我们将其称为误导、隐藏真相或宣扬完全或部分不真实的信念的行为或声明。正如之前的人工智能安全研究中所见，我们摆脱了对欺骗的传统理解，通过直接撒谎、做出客观的自私决定或提供虚假信息。我们针对的是通过混淆和模棱两可实现的特定类别的欺骗。我们通过将这两种类型的欺骗与“帽子里的兔子”魔术进行类比，来大致解释这两种类型的欺骗，其中（i）兔子要么从隐藏的活板门中出来，要么（ii）（我们的焦点）观众完全分心看到魔术师使用花招或误导将兔子带到他们面前。我们新颖的测试平台框架展示了法学硕士代理人在目标驱动环境中的内在欺骗能力，当他们在基于“游说”法案的立法任务构建的两个代理人对抗性对话系统中被指示在自然语言生成中进行欺骗时。沿着目标驱动的环境，我们展示了通过强化学习设置来开发欺骗能力，并围绕语言哲学和认知心理学理论构建它。我们发现，通过随后的对抗性交互强化试验，说客代理人的欺骗能力提高了约 40%（相对），并且我们的欺骗检测机制显示出高达 92% 的检测能力。我们的结果强调了智能体与人类交互中的潜在问题，智能体可能会操纵人类实现其编程的最终目标。

Title: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Authors: DeepSeek-AI
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.04434
Pdf URL: https://arxiv.org/pdf/2405.04434
Copy Paste: [[2405.04434]] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(https://arxiv.org/abs/2405.04434)
Keywords: language model, chat
Abstract: We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. The model checkpoints are available at "this https URL.
摘要：我们提出了 DeepSeek-V2，一种强大的混合专家 (MoE) 语言模型，具有训练经济、推理高效的特点。它包含 236B 总参数，其中每个 token 激活 21B，支持 128K token 的上下文长度。DeepSeek-V2 采用了包括多头潜在注意力 (MLA) 和 DeepSeekMoE 在内的创新架构。MLA 通过将键值 (KV) 缓存显著压缩为潜在向量来保证高效推理，而 DeepSeekMoE 通过稀疏计算以经济的成本训练强大的模型。与 DeepSeek 67B 相比，DeepSeek-V2 实现了显著增强的性能，同时节省了 42.5% 的训练成本、减少了 93.3% 的 KV 缓存、并将最大生成吞吐量提升至 5.76 倍。我们在由 8.1T 标记组成的高质量多源语料库上对 DeepSeek-V2 进行了预训练，并进一步执行监督微调 (SFT) 和强化学习 (RL) 以充分释放其潜力。评估结果表明，即使只有 21B 激活参数，DeepSeek-V2 及其聊天版本仍然在开源模型中实现了顶级性能。模型检查点可在“此 https URL”中找到。

Title: Toward In-Context Teaching: Adapting Examples to Students' Misconceptions

Authors: Alexis Ross, Jacob Andreas
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.04495
Pdf URL: https://arxiv.org/pdf/2405.04495
Copy Paste: [[2405.04495]] Toward In-Context Teaching: Adapting Examples to Students' Misconceptions(https://arxiv.org/abs/2405.04495)
Keywords: language model, llm
Abstract: When a teacher provides examples for a student to study, these examples must be informative, enabling a student to progress from their current state toward a target concept or skill. Good teachers must therefore simultaneously infer what students already know and adapt their teaching to students' changing state of knowledge. There is increasing interest in using computational models, particularly large language models, as pedagogical tools. As students, language models in particular have shown a remarkable ability to adapt to new tasks given small numbers of examples. But how effectively can these models adapt as teachers to students of different types? To study this question, we introduce a suite of models and evaluation methods we call AdapT. AdapT has two components: (1) a collection of simulated Bayesian student models that can be used for evaluation of automated teaching methods; (2) a platform for evaluation with human students, to characterize the real-world effectiveness of these methods. We additionally introduce (3) AToM, a new probabilistic model for adaptive teaching that jointly infers students' past beliefs and optimizes for the correctness of future beliefs. In evaluations of simulated students across three learning domains (fraction arithmetic, English morphology, function learning), AToM systematically outperforms LLM-based and standard Bayesian teaching models. In human experiments, both AToM and LLMs outperform non-adaptive random example selection. Our results highlight both the difficulty of the adaptive teaching task and the potential of learned adaptive models for solving it.
摘要：当教师为学生提供学习示例时，这些示例必须提供丰富的信息，使学生能够从当前状态向目标概念或技能进步。因此，优秀的教师必须同时推断学生已经知道的内容，并根据学生不断变化的知识状态调整教学。人们越来越有兴趣使用计算模型，特别是大型语言模型作为教学工具。作为学生，语言模型尤其表现出了在少量示例的情况下适应新任务的非凡能力。但作为教师，这些模型如何有效地适应不同类型的学生呢？为了研究这个问题，我们引入了一套模型和评估方法，我们称之为 AdapT。 AdapT 有两个组成部分：（1）模拟贝叶斯学生模型的集合，可用于评估自动化教学方法；（2）一个与人类学生一起评估的平台，以表征这些方法在现实世界中的有效性。我们还引入了 (3) AToM，这是一种用于适应性教学的新概率模型，可以共同推断学生过去的信念并优化未来信念的正确性。在对三个学习领域（分数算术、英语形态学、函数学习）的模拟学生进行评估时，AToM 系统地优于基于 LLM 的标准贝叶斯教学模型。在人体实验中，AToM 和 LLM 都优于非自适应随机示例选择。我们的结果凸显了自适应教学任务的难度以及学习自适应模型解决该任务的潜力。

Title: A Transformer with Stack Attention

Authors: Jiaoda Li, Jennifer C. White, Mrinmaya Sachan, Ryan Cotterell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.04515
Pdf URL: https://arxiv.org/pdf/2405.04515
Copy Paste: [[2405.04515]] A Transformer with Stack Attention(https://arxiv.org/abs/2405.04515)
Keywords: language model
Abstract: Natural languages are believed to be (mildly) context-sensitive. Despite underpinning remarkably capable large language models, transformers are unable to model many context-free language tasks. In an attempt to address this limitation in the modeling power of transformer-based language models, we propose augmenting them with a differentiable, stack-based attention mechanism. Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model. We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.
摘要：自然语言被认为是（轻度）上下文相关的。尽管支持非常强大的大型语言模型，但 Transformer 无法对许多上下文无关的语言任务进行建模。为了解决基于 Transformer 的语言模型的建模能力的这一限制，我们建议使用可微的、基于堆栈的注意力机制来增强它们。我们基于堆栈的注意力机制可以合并到任何基于转换器的语言模型中，并为模型增加一定程度的可解释性。我们证明，添加基于堆栈的注意力机制使转换器能够对一些但不是全部的确定性上下文无关语言进行建模。

Title: NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

Authors: Shudan Zhang, Hanlin Zhao, Xiao Liu, Qinkai Zheng, Zehan Qi, Xiaotao Gu, Xiaohan Zhang, Yuxiao Dong, Jie Tang
Subjects: cs.CL, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2405.04520
Pdf URL: https://arxiv.org/pdf/2405.04520
Copy Paste: [[2405.04520]] NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts(https://arxiv.org/abs/2405.04520)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual solutions, it achieves an efficiency increase of more than 4 times. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB. The evaluation toolkit and development set are available at this https URL.
摘要：大型语言模型（LLM）表现出为生产活动生成代码的强大能力。然而，当前的代码合成基准，例如 HumanEval、MBPP 和 DS-1000，主要面向算法和数据科学的入门任务，不足以满足现实世界编码中普遍存在的挑战性要求。为了填补这一空白，我们提出了 NaturalCodeBench (NCB)，这是一个具有挑战性的代码基准测试，旨在反映实际编码任务中的复杂性和多样性场景。 NCB 包含 402 个 Python 和 Java 的高质量问题，这些问题是从在线编码服务的自然用户查询中精心挑选的，涵盖 6 个不同的领域。注意到为现实世界的查询创建测试用例非常困难，我们还引入了半自动化管道来提高测试用例构建的效率。与人工解决方案相比，实现效率提升4倍以上。我们对 39 个法学硕士进行的系统实验发现，HumanEval 分数接近的模型之间的 NCB 性能差距仍然很大，这表明缺乏对实际代码合成场景的关注或对 HumanEval 的过度指定优化。另一方面，即使是性能最好的 GPT-4 在 NCB 上仍然远远不能令人满意。评估工具包和开发集可从此 https URL 获取。

Title: QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Authors: Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han
Subjects: cs.CL, cs.AI, cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2405.04532
Pdf URL: https://arxiv.org/pdf/2405.04532
Copy Paste: [[2405.04532]] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving(https://arxiv.org/abs/2405.04532)
Keywords: language model, llm
Abstract: Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. Code is available at this https URL.
摘要：量化可以加速大型语言模型 (LLM) 推理。除了 INT8 量化之外，研究界正在积极探索更低精度的量化，例如 INT4。尽管如此，最先进的 INT4 量化技术只能加速小批量、边缘 LLM 推理，无法在大批量、基于云的 LLM 服务中提供性能提升。我们发现了一个关键问题：在 GPU 上对权重或部分和进行反量化时，现有的 INT4 量化方法会遭受巨大的运行时开销 (20-90%)。为了应对这一挑战，我们引入了 QoQ，一种具有 4 位权重、8 位激活和 4 位 KV 缓存的 W4A8KV4 量化算法。 QoQ 代表 quattuor-octo-quattuor，在拉丁语中代表 4-8-4。 QoQ 由 QServe 推理库实现，可实现测量加速。驱动 QServe 的关键见解是，在 GPU 上服务的 LLM 的效率受到低吞吐量 CUDA 核心上的操作的严重影响。基于这一见解，我们在 QoQ 算法中引入了渐进式量化，可以在 W4A8 GEMM 中实现较低的反量化开销。此外，我们开发了 SmoothAttention 来有效缓解 4 位 KV 量化带来的精度下降。在 QServe 系统中，我们执行计算感知权重重新排序，并利用寄存器级并行性来减少反量化延迟。我们还利用 KV4 量化带来的性能增益，使融合注意力受到内存限制。因此，QServe 将 Llama-3-8B 在 A100 上可实现的最大服务吞吐量提高了 1.2 倍，在 L40S 上提高了 1.4 倍；与 TensorRT-LLM 相比，Qwen1.5-72B 在 A100 上提高了 2.4 倍，在 L40S 上提高了 3.5 倍。值得注意的是，L40S GPU 上的 QServe 可以实现比 A100 上的 TensorRT-LLM 更高的吞吐量。因此，QServe 有效地将 LLM 服务的美元成本降低了 3 倍。代码可从此 https URL 获取。