2024-03-05

Title: PRECISE Framework: GPT-based Text For Improved Readability, Reliability, and Understandability of Radiology Reports For Patient-Centered Care

Authors: Satvik Tripathi, Liam Mutter, Meghana Muppuri, Suhani Dheer, Emiliano Garza-Frias, Komal Awan, Aakash Jha, Michael Dezube, Azadeh Tabari, Christopher P. Bridge, Dania Daye
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2403.00788
Pdf URL: https://arxiv.org/pdf/2403.00788
Copy Paste: [[2403.00788]] PRECISE Framework: GPT-based Text For Improved Readability, Reliability, and Understandability of Radiology Reports For Patient-Centered Care(https://arxiv.org/abs/2403.00788)
Keywords: gpt
Abstract: This study introduces and evaluates the PRECISE framework, utilizing OpenAI's GPT-4 to enhance patient engagement by providing clearer and more accessible chest X-ray reports at a sixth-grade reading level. The framework was tested on 500 reports, demonstrating significant improvements in readability, reliability, and understandability. Statistical analyses confirmed the effectiveness of the PRECISE approach, highlighting its potential to foster patient-centric care delivery in healthcare decision-making.
摘要：本研究介绍并评估了 PRECISE 框架，利用 OpenAI 的 GPT-4 在六年级阅读水平上提供更清晰、更易于理解的胸部 X 光报告，从而增强患者参与度。该框架在 500 份报告上进行了测试，显示出在可读性、可靠性和可理解性方面的显着改进。统计分析证实了 PRECISE 方法的有效性，突显了其在医疗保健决策中促进以患者为中心的护理服务的潜力。

Title: Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models

Authors: Zachary Horvitz, Jingru Chen, Rahul Aditya, Harshvardhan Srivastava, Robert West, Zhou Yu, Kathleen McKeown
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.00794
Pdf URL: https://arxiv.org/pdf/2403.00794
Copy Paste: [[2403.00794]] Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models(https://arxiv.org/abs/2403.00794)
Keywords: language model, gpt, llm
Abstract: Humor is a fundamental facet of human cognition and interaction. Yet, despite recent advances in natural language processing, humor detection remains a challenging task that is complicated by the scarcity of datasets that pair humorous texts with similar non-humorous counterparts. In our work, we investigate whether large language models (LLMs), can generate synthetic data for humor detection via editing texts. We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to `unfun' jokes, as judged by humans and as measured on the downstream task of humor detection. We extend our approach to a code-mixed English-Hindi humor dataset, where we find that GPT-4's synthetic data is highly rated by bilingual annotators and provides challenging adversarial examples for humor classifiers.
摘要：幽默是人类认知和互动的基本方面。然而，尽管自然语言处理最近取得了进展，幽默检测仍然是一项具有挑战性的任务，由于缺乏将幽默文本与类似的非幽默文本配对的数据集，该任务变得更加复杂。在我们的工作中，我们研究大型语言模型（LLM）是否可以通过编辑文本生成用于幽默检测的合成数据。我们在现有的人类数据集上对法学硕士进行了基准测试，结果表明，根据人类的判断以及幽默检测的下游任务的衡量，当前的法学硕士表现出了令人印象深刻的“搞笑”笑话能力。我们将我们的方法扩展到代码混合的英语-印地语幽默数据集，我们发现 GPT-4 的合成数据受到双语注释者的高度评价，并为幽默分类器提供了具有挑战性的对抗性示例。

Title: Executing Natural Language-Described Algorithms with Large Language Models: An Investigation

Authors: Xin Zheng, Qiming Zhu, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.00795
Pdf URL: https://arxiv.org/pdf/2403.00795
Copy Paste: [[2403.00795]] Executing Natural Language-Described Algorithms with Large Language Models: An Investigation(https://arxiv.org/abs/2403.00795)
Keywords: language model, gpt, llm
Abstract: Executing computer programs described in natural language has long been a pursuit of computer science. With the advent of enhanced natural language understanding capabilities exhibited by large language models (LLMs), the path toward this goal has been illuminated. In this paper, we seek to examine the capacity of present-day LLMs to comprehend and execute algorithms outlined in natural language. We established an algorithm test set sourced from Introduction to Algorithm, a well-known textbook that contains many representative widely-used algorithms. To systematically assess LLMs' code execution abilities, we selected 30 algorithms, generated 300 random-sampled instances in total, and evaluated whether popular LLMs can understand and execute these algorithms. Our findings reveal that LLMs, notably GPT-4, can effectively execute programs described in natural language, as long as no heavy numeric computation is involved. We believe our findings contribute to evaluating LLMs' code execution abilities and would encourage further investigation and application for the computation power of LLMs.
摘要：执行以自然语言描述的计算机程序长期以来一直是计算机科学的追求。随着大型语言模型 (LLM) 所展示的增强的自然语言理解能力的出现，实现这一目标的道路已经被照亮。在本文中，我们试图检验当今法学硕士理解和执行自然语言概述算法的能力。我们建立的算法测试集来源于著名教材《算法导论》，其中包含许多具有代表性的广泛使用的算法。为了系统地评估LLM的代码执行能力，我们选择了30种算法，总共生成了300个随机采样实例，并评估了流行的LLM是否能够理解和执行这些算法。我们的研究结果表明，只要不涉及繁重的数值计算，法学硕士（尤其是 GPT-4）就可以有效地执行用自然语言描述的程序。我们相信我们的研究结果有助于评估法学硕士的代码执行能力，并将鼓励对法学硕士的计算能力进行进一步的研究和应用。

Title: An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

Authors: Zui Chen, Yezeng Chen, Jiaqi Han, Zhijie Huang, Ji Qi, Yi Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.00799
Pdf URL: https://arxiv.org/pdf/2403.00799
Copy Paste: [[2403.00799]] An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning(https://arxiv.org/abs/2403.00799)
Keywords: language model, llm
Abstract: Large language models (LLMs) are displaying emergent abilities for math reasoning tasks,and there is a growing attention on enhancing the ability of open-source LLMs through supervised fine-tuning (SFT).In this paper, we aim to explore a general data strategy for supervised data to help optimize and expand math reasoning ability.Firstly, we determine the ability boundary of reasoning paths augmentation by identifying these paths' minimal optimal set.Secondly, we validate that different abilities of the model can be cumulatively enhanced by Mix of Minimal Optimal Sets of corresponding types of data, while our models MMOS achieve SOTA performance on series base models under much lower construction costs.Besides, we point out GSM-HARD is not really hard and today's LLMs no longer lack numerical robustness.Also, we provide an Auto Problem Generator for robustness testing and educational applications.Our code and data are publicly available at https://github.com/cyzhh/MMOS.
摘要：大型语言模型（LLM）正在展现出数学推理任务的新兴能力，并且通过监督微调（SFT）来增强开源LLM的能力越来越受到关注。在本文中，我们旨在探索一种通用数据模型监督数据的策略，以帮助优化和扩展数学推理能力。首先，我们通过识别这些路径的最小最优集来确定推理路径增强的能力边界。其次，我们验证模型的不同能力可以通过混合来累积增强相应类型数据的最小最优集，而我们的模型 MMOS 在系列基础模型上以低得多的构建成本实现了 SOTA 性能。此外，我们指出 GSM-HARD 并不难，今天的 LLM 不再缺乏数值鲁棒性。而且，我们为稳健性测试和教育应用提供自动问题生成器。我们的代码和数据可在 https://github.com/cyzhh/MMOS 上公开获取。

Title: Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes

Authors: Yezeng Chen, Zui Chen, Yi Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.00800
Pdf URL: https://arxiv.org/pdf/2403.00800
Copy Paste: [[2403.00800]] Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes(https://arxiv.org/abs/2403.00800)
Keywords: language model
Abstract: Although large language models demonstrate emergent abilities in solving math word problems, there is a challenging task in complex multi-step mathematical reasoning tasks. To improve model performance on mathematical reasoning tasks, previous work has conducted supervised fine-tuning on open-source models by improving the quality and quantity of data. In this paper, we propose a novel approach, named Brain, to imitate human thought processes to enhance mathematical reasoning abilities, using the Frontal Lobe Model to generate plans, and then employing the Parietal Lobe Model to generate code and execute to obtain answers. First, we achieve SOTA performance in comparison with Code LLaMA 7B based models through this method. Secondly, we find that plans can be explicitly extracted from natural language, code, or formal language. Our code and data are publicly available at https://github.com/cyzhh/Brain.
摘要：尽管大型语言模型展示了解决数学应用问题的新兴能力，但复杂的多步骤数学推理任务仍然具有挑战性。为了提高模型在数学推理任务上的性能，之前的工作通过提高数据的质量和数量对开源模型进行有监督的微调。在本文中，我们提出了一种名为“Brain”的新方法来模仿人类思维过程来增强数学推理能力，使用额叶模型生成计划，然后使用顶叶模型生成代码并执行以获得答案。首先，通过这种方法，我们与基于 Code LLaMA 7B 的模型相比实现了 SOTA 性能。其次，我们发现计划可以从自然语言、代码或形式语言中明确提取。我们的代码和数据可在 https://github.com/cyzhh/Brain 上公开获取。

Title: Abdelhak at SemEval-2024 Task 9 : Decoding Brainteasers, The Efficacy of Dedicated Models Versus ChatGPT

Authors: Abdelhak Kelious, Mounir Okirim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.00809
Pdf URL: https://arxiv.org/pdf/2403.00809
Copy Paste: [[2403.00809]] Abdelhak at SemEval-2024 Task 9 : Decoding Brainteasers, The Efficacy of Dedicated Models Versus ChatGPT(https://arxiv.org/abs/2403.00809)
Keywords: gpt, chat
Abstract: This study introduces a dedicated model aimed at solving the BRAINTEASER task 9 , a novel challenge designed to assess models lateral thinking capabilities through sentence and word puzzles. Our model demonstrates remarkable efficacy, securing Rank 1 in sentence puzzle solving during the test phase with an overall score of 0.98. Additionally, we explore the comparative performance of ChatGPT, specifically analyzing how variations in temperature settings affect its ability to engage in lateral thinking and problem-solving. Our findings indicate a notable performance disparity between the dedicated model and ChatGPT, underscoring the potential of specialized approaches in enhancing creative reasoning in AI.
摘要：本研究引入了一个旨在解决 BRAINTEASER 任务 9 的专用模型，这是一项新颖的挑战，旨在通过句子和单词谜题评估模型的横向思维能力。我们的模型表现出了显着的功效，在测试阶段以 0.98 的总分在句子谜题解决中排名第一。此外，我们还探讨了 ChatGPT 的比较性能，特别分析了温度设置的变化如何影响其横向思维和解决问题的能力。我们的研究结果表明专用模型和 ChatGPT 之间存在显着的性能差异，强调了专用方法在增强人工智能创造性推理方面的潜力。

Title: LoRA Meets Dropout under a Unified Framework

Authors: Sheng Wang, Liheng Chen, Jiyue Jiang, Boyang Xue, Lingpeng Kong, Chuan Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.00812
Pdf URL: https://arxiv.org/pdf/2403.00812
Copy Paste: [[2403.00812]] LoRA Meets Dropout under a Unified Framework(https://arxiv.org/abs/2403.00812)
Keywords: language model, llm
Abstract: With the remarkable capabilities, large language models (LLMs) have emerged as essential elements in numerous NLP applications, while parameter-efficient finetuning, especially LoRA, has gained popularity as a lightweight approach for model customization. Meanwhile, various dropout methods, initially designed for full finetuning with all the parameters updated, alleviates overfitting associated with excessive parameter redundancy. Hence, a possible contradiction arises from negligible trainable parameters of LoRA and the effectiveness of previous dropout methods, which has been largely overlooked. To fill this gap, we first confirm that parameter-efficient LoRA is also overfitting-prone. We then revisit transformer-specific dropout methods, and establish their equivalence and distinctions mathematically and empirically. Building upon this comparative analysis, we introduce a unified framework for a comprehensive investigation, which instantiates these methods based on dropping position, structural pattern and compensation measure. Through this framework, we reveal the new preferences and performance comparisons of them when involved with limited trainable parameters. This framework also allows us to amalgamate the most favorable aspects into a novel dropout method named HiddenKey. Extensive experiments verify the remarkable superiority and sufficiency of HiddenKey across multiple models and tasks, which highlights it as the preferred approach for high-performance and parameter-efficient finetuning of LLMs.
摘要：凭借卓越的功能，大语言模型 (LLM) 已成为众多 NLP 应用程序中的基本元素，而参数高效的微调（尤其是 LoRA）作为模型定制的轻量级方法已广受欢迎。同时，各种 dropout 方法最初是为更新所有参数进行完全微调而设计的，减轻了与过多参数冗余相关的过度拟合。因此，LoRA 的可训练参数可以忽略不计，与以前的 dropout 方法的有效性之间可能存在矛盾，而这一点在很大程度上被忽视了。为了填补这一空白，我们首先确认参数高效的 LoRA 也容易出现过度拟合。然后，我们重新审视变压器特定的 dropout 方法，并从数学和经验上建立它们的等价性和区别。在此比较分析的基础上，我们引入了一个综合调查的统一框架，该框架根据下降位置、结构模式和补偿措施实例化了这些方法。通过这个框架，我们揭示了它们在涉及有限的可训练参数时的新偏好和性能比较。这个框架还允许我们将最有利的方面合并到一种名为 HiddenKey 的新颖的 dropout 方法中。大量的实验验证了 HiddenKey 在多个模型和任务中的显着优越性和充分性，这凸显了它作为 LLM 高性能和参数高效微调的首选方法。

Title: UrbanGPT: Spatio-Temporal Large Language Models

Authors: Zhonghang Li, Lianghao Xia, Jiabin Tang, Yong Xu, Lei Shi, Long Xia, Dawei Yin, Chao Huang
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2403.00813
Pdf URL: https://arxiv.org/pdf/2403.00813
Copy Paste: [[2403.00813]] UrbanGPT: Spatio-Temporal Large Language Models(https://arxiv.org/abs/2403.00813)
Keywords: language model, gpt, llm
Abstract: Spatio-temporal prediction aims to forecast and gain insights into the ever-changing dynamics of urban environments across both time and space. Its purpose is to anticipate future patterns, trends, and events in diverse facets of urban life, including transportation, population movement, and crime rates. Although numerous efforts have been dedicated to developing neural network techniques for accurate predictions on spatio-temporal data, it is important to note that many of these methods heavily depend on having sufficient labeled data to generate precise spatio-temporal representations. Unfortunately, the issue of data scarcity is pervasive in practical urban sensing scenarios. Consequently, it becomes necessary to build a spatio-temporal model with strong generalization capabilities across diverse spatio-temporal learning scenarios. Taking inspiration from the remarkable achievements of large language models (LLMs), our objective is to create a spatio-temporal LLM that can exhibit exceptional generalization capabilities across a wide range of downstream urban tasks. To achieve this objective, we present the UrbanGPT, which seamlessly integrates a spatio-temporal dependency encoder with the instruction-tuning paradigm. This integration enables LLMs to comprehend the complex inter-dependencies across time and space, facilitating more comprehensive and accurate predictions under data scarcity. To validate the effectiveness of our approach, we conduct extensive experiments on various public datasets, covering different spatio-temporal prediction tasks. The results consistently demonstrate that our UrbanGPT, with its carefully designed architecture, consistently outperforms state-of-the-art baselines. These findings highlight the potential of building large language models for spatio-temporal learning, particularly in zero-shot scenarios where labeled data is scarce.
摘要：时空预测旨在预测并深入了解城市环境在时间和空间上不断变化的动态。其目的是预测城市生活各个方面的未来模式、趋势和事件，包括交通、人口流动和犯罪率。尽管人们致力于开发神经网络技术来准确预测时空数据，但值得注意的是，其中许多方法在很大程度上依赖于拥有足够的标记数据来生成精确的时空表示。不幸的是，数据稀缺的问题在实际的城市传感场景中普遍存在。因此，有必要建立一个跨不同时空学习场景的、具有强大泛化能力的时空模型。从大型语言模型 (LLM) 的卓越成就中汲取灵感，我们的目标是创建一个时空 LLM，能够在广泛的下游城市任务中展现出卓越的泛化能力。为了实现这一目标，我们提出了 UrbanGPT，它将时空依赖编码器与指令调整范例无缝集成。这种集成使法学硕士能够理解跨时间和空间的复杂相互依赖关系，从而有助于在数据稀缺的情况下进行更全面、更准确的预测。为了验证我们方法的有效性，我们对各种公共数据集进行了广泛的实验，涵盖不同的时空预测任务。结果一致表明，我们的 UrbanGPT 凭借其精心设计的架构，始终优于最先进的基准。这些发现凸显了为时空学习构建大型语言模型的潜力，特别是在标记数据稀缺的零样本场景中。

Title: DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models

Authors: Wei He, Kai Han, Yehui Tang, Chengcheng Wang, Yujie Yang, Tianyu Guo, Yunhe Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.00818
Pdf URL: https://arxiv.org/pdf/2403.00818
Copy Paste: [[2403.00818]] DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models(https://arxiv.org/abs/2403.00818)
Keywords: language model, llm
Abstract: Large language models (LLMs) face a daunting challenge due to the excessive computational and memory requirements of the commonly used Transformer architecture. While state space model (SSM) is a new type of foundational network architecture offering lower computational complexity, their performance has yet to fully rival that of Transformers. This paper introduces DenseSSM, a novel approach to enhance the flow of hidden information between layers in SSMs. By selectively integrating shallowlayer hidden states into deeper layers, DenseSSM retains fine-grained information crucial for the final output. Dense connections enhanced DenseSSM still maintains the training parallelizability and inference efficiency. The proposed method can be widely applicable to various SSM types like RetNet and Mamba. With similar model size, DenseSSM achieves significant improvements, exemplified by DenseRetNet outperforming the original RetNet with up to 5% accuracy improvement on public benchmarks.
摘要：由于常用的 Transformer 架构对计算和内存的要求过高，大型语言模型 (LLM) 面临着艰巨的挑战。虽然状态空间模型（SSM）是一种新型的基础网络架构，计算复杂度较低，但其性能尚未完全与 Transformer 相媲美。本文介绍了 DenseSSM，这是一种增强 SSM 层之间隐藏信息流的新方法。通过选择性地将浅层隐藏状态集成到更深的层中，DenseSSM 保留了对最终输出至关重要的细粒度信息。密集连接增强的DenseSSM仍然保持了训练的并行性和推理效率。所提出的方法可以广泛适用于各种 SSM 类型，如 RetNet 和 Mamba。在模型大小相似的情况下，DenseSSM 实现了显着改进，例如 DenseRetNet 的性能优于原始 RetNet，在公共基准测试中准确率提高了 5%。

Title: Information Flow Routes: Automatically Interpreting Language Models at Scale

Authors: Javier Ferrando, Elena Voita
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.00824
Pdf URL: https://arxiv.org/pdf/2403.00824
Copy Paste: [[2403.00824]] Information Flow Routes: Automatically Interpreting Language Models at Scale(https://arxiv.org/abs/2403.00824)
Keywords: language model
Abstract: Information flows by routes inside the network via mechanisms implemented in the model. These routes can be represented as graphs where nodes correspond to token representations and edges to operations inside the network. We automatically build these graphs in a top-down manner, for each prediction leaving only the most important nodes and edges. In contrast to the existing workflows relying on activation patching, we do this through attribution: this allows us to efficiently uncover existing circuits with just a single forward pass. Additionally, the applicability of our method is far beyond patching: we do not need a human to carefully design prediction templates, and we can extract information flow routes for any prediction (not just the ones among the allowed templates). As a result, we can talk about model behavior in general, for specific types of predictions, or different domains. We experiment with Llama 2 and show that the role of some attention heads is overall important, e.g. previous token heads and subword merging heads. Next, we find similarities in Llama 2 behavior when handling tokens of the same part of speech. Finally, we show that some model components can be specialized on domains such as coding or multilingual texts.
摘要：信息通过模型中实现的机制在网络内部的路由流动。这些路由可以表示为图形，其中节点对应于令牌表示，边对应于网络内部的操作。我们以自上而下的方式自动构建这些图，对于每个预测只留下最重要的节点和边。与依赖激活修补的现有工作流程相比，我们通过归因来做到这一点：这使我们能够仅通过一次前向传递就有效地发现现有电路。此外，我们的方法的适用性远远超出了修补的范围：我们不需要人类仔细设计预测模板，我们可以为任何预测（而不仅仅是允许模板中的预测）提取信息流路线。因此，我们可以讨论特定类型的预测或不同领域的一般模型行为。我们对 Llama 2 进行了实验，结果表明一些注意力头的作用总体上很重要，例如先前的标记头和子词合并头。接下来，我们发现 Llama 2 在处理相同词性的标记时的行为有相似之处。最后，我们展示了一些模型组件可以专门用于编码或多语言文本等领域。

Title: LLMGuard: Guarding Against Unsafe LLM Behavior

Authors: Shubh Goyal, Medha Hira, Shubham Mishra, Sukriti Goyal, Arnav Goel, Niharika Dadu, Kirushikesh DB, Sameep Mehta, Nishtha Madaan
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.00826
Pdf URL: https://arxiv.org/pdf/2403.00826
Copy Paste: [[2403.00826]] LLMGuard: Guarding Against Unsafe LLM Behavior(https://arxiv.org/abs/2403.00826)
Keywords: language model, llm
Abstract: Although the rise of Large Language Models (LLMs) in enterprise settings brings new opportunities and capabilities, it also brings challenges, such as the risk of generating inappropriate, biased, or misleading content that violates regulations and can have legal concerns. To alleviate this, we present "LLMGuard", a tool that monitors user interactions with an LLM application and flags content against specific behaviours or conversation topics. To do this robustly, LLMGuard employs an ensemble of detectors.
摘要：尽管大型语言模型 (LLM) 在企业环境中的兴起带来了新的机遇和能力，但它也带来了挑战，例如生成不适当、有偏见或误导性内容、违反法规并可能引起法律问题的风险。为了缓解这一问题，我们推出了“LLMGuard”，这是一种工具，可以监控用户与 LLM 应用程序的交互，并针对特定行为或对话主题标记内容。为了稳健地做到这一点，LLMGuard 采用了一组探测器。

Title: Self-Refinement of Language Models from External Proxy Metrics Feedback

Authors: Keshav Ramji, Young-Suk Lee, Ramón Fernandez Astudillo, Md Arafat Sultan, Tahira Naseem, Asim Munawar, Radu Florian, Salim Roukos
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.00827
Pdf URL: https://arxiv.org/pdf/2403.00827
Copy Paste: [[2403.00827]] Self-Refinement of Language Models from External Proxy Metrics Feedback(https://arxiv.org/abs/2403.00827)
Keywords: language model, llm, chat, agent
Abstract: It is often desirable for Large Language Models (LLMs) to capture multiple objectives when providing a response. In document-grounded response generation, for example, agent responses are expected to be relevant to a user's query while also being grounded in a given document. In this paper, we introduce Proxy Metric-based Self-Refinement (ProMiSe), which enables an LLM to refine its own initial response along key dimensions of quality guided by external metrics feedback, yielding an overall better final response. ProMiSe leverages feedback on response quality through principle-specific proxy metrics, and iteratively refines its response one principle at a time. We apply ProMiSe to open source language models Flan-T5-XXL and Llama-2-13B-Chat, to evaluate its performance on document-grounded question answering datasets, MultiDoc2Dial and QuAC, demonstrating that self-refinement improves response quality. We further show that fine-tuning Llama-2-13B-Chat on the synthetic dialogue data generated by ProMiSe yields significant performance improvements over the zero-shot baseline as well as a supervised fine-tuned model on human annotated data.
摘要：通常希望大型语言模型 (LLM) 在提供响应时捕获多个目标。例如，在基于文档的响应生成中，代理响应预计与用户的查询相关，同时也基于给定的文档。在本文中，我们介绍了基于代理指标的自我细化（ProMiSe），它使法学硕士能够在外部指标反馈的指导下沿着质量的关键维度细化自己的初始响应，从而产生总体更好的最终响应。 ProMiSe 通过特定于原则的代理指标来利用响应质量反馈，并一次迭代地完善其响应的一项原则。我们将 ProMiSe 应用于开源语言模型 Flan-T5-XXL 和 Llama-2-13B-Chat，以评估其在基于文档的问答数据集、MultiDoc2Dial 和 QuAC 上的性能，证明自我优化可以提高响应质量。我们进一步表明，对 ProMiSe 生成的合成对话数据进行微调 Llama-2-13B-Chat 比零样本基线以及对人类注释数据的监督微调模型产生了显着的性能改进。

Title: Deep Learning Detection Method for Large Language Models-Generated Scientific Content

Authors: Bushra Alhijawi, Rawan Jarrar, Aseel AbuAlRub, Arwa Bader
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.00828
Pdf URL: https://arxiv.org/pdf/2403.00828
Copy Paste: [[2403.00828]] Deep Learning Detection Method for Large Language Models-Generated Scientific Content(https://arxiv.org/abs/2403.00828)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs), such as GPT-3 and BERT, reshape how textual content is written and communicated. These models have the potential to generate scientific content that is indistinguishable from that written by humans. Hence, LLMs carry severe consequences for the scientific community, which relies on the integrity and reliability of publications. This research paper presents a novel ChatGPT-generated scientific text detection method, AI-Catcher. AI-Catcher integrates two deep learning models, multilayer perceptron (MLP) and convolutional neural networks (CNN). The MLP learns the feature representations of the linguistic and statistical features. The CNN extracts high-level representations of the sequential patterns from the textual content. AI-Catcher is a multimodal model that fuses hidden patterns derived from MLP and CNN. In addition, a new ChatGPT-Generated scientific text dataset is collected to enhance AI-generated text detection tools, AIGTxt. AIGTxt contains 3000 records collected from published academic articles across ten domains and divided into three classes: Human-written, ChatGPT-generated, and Mixed text. Several experiments are conducted to evaluate the performance of AI-Catcher. The comparative results demonstrate the capability of AI-Catcher to distinguish between human-written and ChatGPT-generated scientific text more accurately than alternative methods. On average, AI-Catcher improved accuracy by 37.4%.
摘要：大型语言模型 (LLM)，例如 GPT-3 和 BERT，重塑了文本内容的编写和交流方式。这些模型有可能生成与人类编写的科学内容没有区别的科学内容。因此，法学硕士对依赖出版物完整性和可靠性的科学界产生了严重后果。本研究论文提出了一种新颖的 ChatGPT 生成的科学文本检测方法 AI-Catcher。 AI-Catcher 集成了两种深度学习模型：多层感知器（MLP）和卷积神经网络（CNN）。 MLP 学习语言和统计特征的特征表示。 CNN 从文本内容中提取序列模式的高级表示。 AI-Catcher 是一种多模态模型，融合了源自 MLP 和 CNN 的隐藏模式。此外，还收集了新的 ChatGPT 生成的科学文本数据集，以增强 AI 生成的文本检测工具 AIGTxt。 AIGTxt 包含从十个领域已发表的学术文章中收集的 3000 条记录，分为三类：人工编写、ChatGPT 生成和混合文本。进行了多项实验来评估 AI-Catcher 的性能。比较结果表明，AI-Catcher 能够比其他方法更准确地区分人类书写的科学文本和 ChatGPT 生成的科学文本。平均而言，AI-Catcher 的准确率提高了 37.4%。

Title: CLLMs: Consistency Large Language Models

Authors: Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, Hao Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.00835
Pdf URL: https://arxiv.org/pdf/2403.00835
Copy Paste: [[2403.00835]] CLLMs: Consistency Large Language Models(https://arxiv.org/abs/2403.00835)
Keywords: language model, llm
Abstract: Parallel decoding methods such as Jacobi decoding show promise for more efficient LLM inference as it breaks the sequential nature of the LLM decoding process and transforms it into parallelizable computation. However, in practice, it achieves little speedup compared to traditional autoregressive (AR) decoding, primarily because Jacobi decoding seldom accurately predicts more than one token in a single fixed-point iteration step. To address this, we develop a new approach aimed at realizing fast convergence from any state to the fixed point on a Jacobi trajectory. This is accomplished by refining the target LLM to consistently predict the fixed point given any state as input. Extensive experiments demonstrate the effectiveness of our method, showing 2.4$\times$ to 3.4$\times$ improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks.
摘要：雅可比解码等并行解码方法有望实现更高效的 LLM 推理，因为它打破了 LLM 解码过程的顺序性质，并将其转换为可并行计算。然而，在实践中，与传统的自回归 (AR) 解码相比，它几乎没有实现加速，这主要是因为雅可比解码很少在单个定点迭代步骤中准确预测多个标记。为了解决这个问题，我们开发了一种新方法，旨在实现从任何状态到雅可比轨迹上的固定点的快速收敛。这是通过改进目标 LLM 以一致地预测给定任何状态作为输入的固定点来实现的。大量实验证明了我们方法的有效性，显示生成速度提高了 2.4$\times$ 到 3.4$\times$，同时在特定领域和开放域基准测试中保持生成质量。

Title: EyeGPT: Ophthalmic Assistant with Large Language Models

Authors: Xiaolan Chen, Ziwei Zhao, Weiyi Zhang, Pusheng Xu, Le Gao, Mingpu Xu, Yue Wu, Yinwen Li, Danli Shi, Mingguang He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.00840
Pdf URL: https://arxiv.org/pdf/2403.00840
Copy Paste: [[2403.00840]] EyeGPT: Ophthalmic Assistant with Large Language Models(https://arxiv.org/abs/2403.00840)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation
Abstract: Artificial intelligence (AI) has gained significant attention in healthcare consultation due to its potential to improve clinical workflow and enhance medical communication. However, owing to the complex nature of medical information, large language models (LLM) trained with general world knowledge might not possess the capability to tackle medical-related tasks at an expert level. Here, we introduce EyeGPT, a specialized LLM designed specifically for ophthalmology, using three optimization strategies including role-playing, finetuning, and retrieval-augmented generation. In particular, we proposed a comprehensive evaluation framework that encompasses a diverse dataset, covering various subspecialties of ophthalmology, different users, and diverse inquiry intents. Moreover, we considered multiple evaluation metrics, including accuracy, understandability, trustworthiness, empathy, and the proportion of hallucinations. By assessing the performance of different EyeGPT variants, we identify the most effective one, which exhibits comparable levels of understandability, trustworthiness, and empathy to human ophthalmologists (all Ps>0.05). Overall, ur study provides valuable insights for future research, facilitating comprehensive comparisons and evaluations of different strategies for developing specialized LLMs in ophthalmology. The potential benefits include enhancing the patient experience in eye care and optimizing ophthalmologists' services.
摘要：人工智能（AI）因其改善临床工作流程和加强医疗沟通的潜力而在医疗保健咨询领域受到广泛关注。然而，由于医疗信息的复杂性，用一般世界知识训练的大型语言模型（LLM）可能不具备在专家级别处理医疗相关任务的能力。在这里，我们介绍 EyeGPT，这是一个专为眼科设计的专业法学硕士，采用角色扮演、微调和检索增强生成三种优化策略。特别是，我们提出了一个全面的评估框架，其中包含多样化的数据集，涵盖眼科的各个亚专业、不同的用户和不同的查询意图。此外，我们考虑了多种评估指标，包括准确性、可理解性、可信度、同理心和幻觉比例。通过评估不同 EyeGPT 变体的性能，我们确定了最有效的一种，它对人类眼科医生表现出可比水平的可理解性、可信度和同理心（所有 Ps>0.05）。总体而言，您的研究为未来的研究提供了宝贵的见解，有助于对开发眼科专业法学硕士的不同策略进行全面比较和评估。潜在的好处包括增强患者的眼部护理体验和优化眼科医生的服务。

Title: NewsBench: Systematic Evaluation of LLMs for Writing Proficiency and Safety Adherence in Chinese Journalistic Editorial Applications

Authors: Miao Li, Ming-Bin Chen, Bo Tang, Shengbin Hou, Pengyu Wang, Haiying Deng, Zhiyu Li, Feiyu Xiong, Keming Mao, Peng Cheng, Yi Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.00862
Pdf URL: https://arxiv.org/pdf/2403.00862
Copy Paste: [[2403.00862]] NewsBench: Systematic Evaluation of LLMs for Writing Proficiency and Safety Adherence in Chinese Journalistic Editorial Applications(https://arxiv.org/abs/2403.00862)
Keywords: language model, gpt, llm
Abstract: This study presents NewsBench, a novel benchmark framework developed to evaluate the capability of Large Language Models (LLMs) in Chinese Journalistic Writing Proficiency (JWP) and their Safety Adherence (SA), addressing the gap between journalistic ethics and the risks associated with AI utilization. Comprising 1,267 tasks across 5 editorial applications, 7 aspects (including safety and journalistic writing with 4 detailed facets), and spanning 24 news topics domains, NewsBench employs two GPT-4 based automatic evaluation protocols validated by human assessment. Our comprehensive analysis of 11 LLMs highlighted GPT-4 and ERNIE Bot as top performers, yet revealed a relative deficiency in journalistic ethic adherence during creative writing tasks. These findings underscore the need for enhanced ethical guidance in AI-generated journalistic content, marking a step forward in aligning AI capabilities with journalistic standards and safety considerations.
摘要：本研究提出了 NewsBench，这是一个新颖的基准框架，旨在评估大型语言模型 (LLM) 在中文新闻写作能力 (JWP) 及其安全依从性 (SA) 方面的能力，解决新闻道德与人工智能使用相关风险之间的差距。 NewsBench 包含 5 个编辑应用程序、7 个方面（包括具有 4 个详细方面的安全和新闻写作）的 1,267 项任务，涵盖 24 个新闻主题领域，采用两种基于 GPT-4 的自动评估协议，并经过人工评估验证。我们对 11 名法学硕士进行了综合分析，强调 GPT-4 和 ERNIE Bot 表现最佳，但也揭示了在创意写作任务中新闻道德遵守方面的相对缺陷。这些发现强调了在人工智能生成的新闻内容中加强道德指导的必要性，标志着人工智能能力与新闻标准和安全考虑因素保持一致的一步。

Title: SoftTiger: A Clinical Foundation Model for Healthcare Workflows

Authors: Ye Chen, Igor Couto, Wei Cai, Cong Fu, Bruno Dorneles
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.00868
Pdf URL: https://arxiv.org/pdf/2403.00868
Copy Paste: [[2403.00868]] SoftTiger: A Clinical Foundation Model for Healthcare Workflows(https://arxiv.org/abs/2403.00868)
Keywords: language model, gpt, llm, long context
Abstract: We release and introduce SoftTiger, a clinical large language model (CLaM) designed as a foundation model for healthcare workflows. The narrative and unstructured nature of clinical notes is a major obstacle for healthcare intelligentization. We address a critical problem of structuring clinical notes into clinical data, according to international interoperability standards. We collect and annotate data for three critical subtasks, namely, international patient summary, clinical impression and medical encounter. We then supervised fine-tuned a state-of-the-art LLM using public and credentialed clinical data. The training is orchestrated in a way that the target model can first support basic clinical tasks such as abbreviation expansion and temporal information extraction, and then learn to perform more complex downstream clinical tasks such as impression and encounter summary. Moreover, we address, several modeling challenges in the healthcare context, e.g., extra long context window. Our blind pairwise evaluation shows that SoftTiger outperforms other popular open-source models and GPT-3.5, comparable to Gemini-pro, and only has a mild gap from GPT-4. We believe that LLMs may become a step-stone towards healthcare digitalization and democratization. Therefore, we publicly release SoftTiger models at scales of 13 billion and 70 billion parameters, as well as datasets and code for our innovative scalable evaluation, hopefully, making a significant contribution to the healthcare industry.
摘要：我们发布并推出了 SoftTiger，这是一种临床大语言模型 (CLaM)，旨在作为医疗保健工作流程的基础模型。临床记录的叙述性和非结构化性质是医疗保健智能化的主要障碍。我们解决了根据国际互操作性标准将临床记录构建为临床数据的关键问题。我们收集并注释三个关键子任务的数据，即国际患者总结、临床印象和医疗经历。然后，我们使用公共和经过认证的临床数据监督微调最先进的法学硕士。训练的编排方式是目标模型首先可以支持基本的临床任务，例如缩写扩展和时间信息提取，然后学习执行更复杂的下游临床任务，例如印象和遭遇总结。此外，我们还解决了医疗保健环境中的几个建模挑战，例如超长的上下文窗口。我们的盲配对评估表明，SoftTiger 优于其他流行的开源模型和 GPT-3.5，与 Gemini-pro 相当，仅与 GPT-4 略有差距。我们相信法学硕士可能成为医疗保健数字化和民主化的基石。因此，我们公开发布了130亿和700亿参数规模的SoftTiger模型，以及我们创新的可扩展评估的数据集和代码，希望能为医疗保健行业做出重大贡献。

Title: Word Order and World Knowledge

Authors: Qinghua Zhao, Vinit Ravishankar, Nicolas Garneau, Anders Søgaard
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.00876
Pdf URL: https://arxiv.org/pdf/2403.00876
Copy Paste: [[2403.00876]] Word Order and World Knowledge(https://arxiv.org/abs/2403.00876)
Keywords: language model
Abstract: Word order is an important concept in natural language, and in this work, we study how word order affects the induction of world knowledge from raw text using language models. We use word analogies to probe for such knowledge. Specifically, in addition to the natural word order, we first respectively extract texts of six fixed word orders from five languages and then pretrain the language models on these texts. Finally, we analyze the experimental results of the fixed word orders on word analogies and show that i) certain fixed word orders consistently outperform or underperform others, though the specifics vary across languages, and ii) the Wov2Lex hypothesis is not hold in pre-trained language models, and the natural word order typically yields mediocre results. The source code will be made publicly available at https://github.com/lshowway/probing_by_analogy.
摘要：词序是自然语言中的一个重要概念，在这项工作中，我们研究词序如何影响使用语言模型从原始文本中归纳世界知识。我们使用词语类比来探究这些知识。具体来说，除了自然词序之外，我们首先从五种语言中分别提取六种固定词序的文本，然后在这些文本上预训练语言模型。最后，我们分析了固定词序在词类比上的实验结果，并表明：i）某些固定词序始终优于或低于其他词序，尽管具体情况因语言而异，并且 ii）Wov2Lex 假设在预训练中不成立语言模型和自然词序通常会产生平庸的结果。源代码将在 https://github.com/lshowway/probing_by_analogy 上公开提供。

Title: DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

Authors: Kedi Chen, Qin Chen, Jie Zhou, Yishen He, Liang He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.00896
Pdf URL: https://arxiv.org/pdf/2403.00896
Copy Paste: [[2403.00896]] DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models(https://arxiv.org/abs/2403.00896)
Keywords: language model, gpt, llm, hallucination, prompt, chat
Abstract: Since large language models (LLMs) achieve significant success in recent years, the hallucination issue remains a challenge, numerous benchmarks are proposed to detect the hallucination. Nevertheless, some of these benchmarks are not naturally generated by LLMs but are intentionally induced. Also, many merely focus on the factuality hallucination while ignoring the faithfulness hallucination. Additionally, although dialogue pattern is more widely utilized in the era of LLMs, current benchmarks only concentrate on sentence-level and passage-level hallucination. In this study, we propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge. Initially, we integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5. Subsequently, we manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios. Finally, professional scholars annotate all the samples in the dataset. DiaHalu covers four common multi-turn dialogue domains and five hallucination subtypes, extended from factuality and faithfulness hallucination. Experiments through some well-known LLMs and detection methods on the dataset show that DiaHalu is a challenging benchmark, holding significant value for further research.
摘要：由于大型语言模型（LLM）近年来取得了巨大成功，幻觉问题仍然是一个挑战，因此提出了许多基准来检测幻觉。然而，其中一些基准并不是法学硕士自然产生的，而是有意诱导的。此外，许多人只关注事实性幻觉，而忽视了忠诚性幻觉。此外，虽然对话模式在法学硕士时代被更广泛地使用，但目前的基准仅集中于句子级和段落级的幻觉。在这项研究中，我们提出了 DiaHalu，据我们所知，第一个对话级幻觉评估基准。最初，我们将收集到的主题集成到系统提示中，并促进两个 ChatGPT3.5 之间的对话。随后，我们手动修改不符合人类语言约定的内容，然后重新生成LLM，模拟真实的人机交互场景。最后，由专业学者对数据集中的所有样本进行注释。 DiaHalu 涵盖了四种常见的多回合对话领域和五种幻觉亚型，从事实幻觉和忠实幻觉延伸而来。通过一些著名的法学硕士和数据集检测方法进行的实验表明，DiaHalu 是一个具有挑战性的基准，对于进一步研究具有重要价值。

Title: MediSwift: Efficient Sparse Pre-trained Biomedical Language Models

Authors: Vithursan Thangarasa, Mahmoud Salem, Shreyas Saxena, Kevin Leong, Joel Hestness, Sean Lie
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.00952
Pdf URL: https://arxiv.org/pdf/2403.00952
Copy Paste: [[2403.00952]] MediSwift: Efficient Sparse Pre-trained Biomedical Language Models(https://arxiv.org/abs/2403.00952)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are typically trained on general source data for various domains, but a recent surge in domain-specific LLMs has shown their potential to outperform general-purpose models in domain-specific tasks (e.g., biomedicine). Although domain-specific pre-training enhances efficiency and leads to smaller models, the computational costs of training these LLMs remain high, posing budgeting challenges. We introduce MediSwift, a suite of biomedical LMs that leverage sparse pre-training on domain-specific biomedical text data. By inducing up to 75% weight sparsity during the pre-training phase, MediSwift achieves a 2-2.5x reduction in training FLOPs. Notably, all sparse pre-training was performed on the Cerebras CS-2 system, which is specifically designed to realize the acceleration benefits from unstructured weight sparsity, thereby significantly enhancing the efficiency of the MediSwift models. Through subsequent dense fine-tuning and strategic soft prompting, MediSwift models outperform existing LLMs up to 7B parameters on biomedical tasks, setting new benchmarks w.r.t efficiency-accuracy on tasks such as PubMedQA. Our results show that sparse pre-training, along with dense fine-tuning and soft prompting, offers an effective method for creating high-performing, computationally efficient models in specialized domains.
摘要：大型语言模型 (LLM) 通常根据各个领域的通用源数据进行训练，但最近特定领域 LLM 的激增表明，它们在特定领域任务（例如生物医学）中优于通用模型的潜力。尽管特定领域的预训练提高了效率并导致模型更小，但训练这些法学硕士的计算成本仍然很高，给预算带来了挑战。我们引入了 MediSwift，这是一套生物医学 LM，它利用对特定领域生物医学文本数据的稀疏预训练。通过在预训练阶段引入高达 75% 的权重稀疏性，MediSwift 实现了训练失败次数减少 2-2.5 倍。值得注意的是，所有稀疏预训练都是在 Cerebras CS-2 系统上进行的，该系统专门用于实现非结构化权重稀疏性的加速优势，从而显着提高 MediSwift 模型的效率。通过随后的密集微调和战略性软提示，MediSwift 模型在生物医学任务上的参数超过了现有的 LLM 高达 7B 参数，为 PubMedQA 等任务的效率准确性设定了新的基准。我们的结果表明，稀疏预训练以及密集微调和软提示为在专业领域创建高性能、计算高效的模型提供了一种有效的方法。

Title: AutoRD: An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontologies-enhanced Large Language Models

Authors: Lang Cao, Jimeng Sun, Adam Cross
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.00953
Pdf URL: https://arxiv.org/pdf/2403.00953
Copy Paste: [[2403.00953]] AutoRD: An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontologies-enhanced Large Language Models(https://arxiv.org/abs/2403.00953)
Keywords: language model, llm
Abstract: Objectives: Our objective is to create an end-to-end system called AutoRD, which automates extracting information from clinical text about rare diseases. We have conducted various tests to evaluate the performance of AutoRD and highlighted its strengths and limitations in this paper. Materials and Methods: Our system, AutoRD, is a software pipeline involving data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We implement this using large language models and medical knowledge graphs developed from open-source medical ontologies. We quantitatively evaluate our system on entity extraction, relation extraction, and the performance of knowledge graph construction. Results: AutoRD achieves an overall F1 score of 47.3%, a 14.4% improvement compared to the base LLM. In detail, AutoRD achieves an overall entity extraction F1 score of 56.1% (rare_disease: 83.5%, disease: 35.8%, symptom_and_sign: 46.1%, anaphor: 67.5%) and an overall relation extraction F1 score of 38.6% (produces: 34.7%, increases_risk_of: 12.4%, is_a: 37.4%, is_acronym: 44.1%, is_synonym: 16.3%, anaphora: 57.5%). Our qualitative experiment also demonstrates that the performance in constructing the knowledge graph is commendable. Discussion: AutoRD demonstrates the potential of LLM applications in rare disease detection. This improvement is attributed to several design, including the integration of ontologies-enhanced LLMs. Conclusion: AutoRD is an automated end-to-end system for extracting rare disease information from text to build knowledge graphs. It uses ontologies-enhanced LLMs for a robust medical knowledge base. The superior performance of AutoRD is validated by experimental evaluations, demonstrating the potential of LLMs in healthcare.
摘要：目标：我们的目标是创建一个名为 AutoRD 的端到端系统，该系统可以自动从有关罕见疾病的临床文本中提取信息。我们进行了各种测试来评估 AutoRD 的性能，并在本文中强调了它的优点和局限性。材料和方法：我们的系统 AutoRD 是一个软件管道，涉及数据预处理、实体提取、关系提取、实体校准和知识图构建。我们使用从开源医学本体开发的大型语言模型和医学知识图来实现这一点。我们定量评估我们的系统在实体提取、关系提取和知识图谱构建方面的性能。结果：AutoRD 的 F1 总分达到 47.3%，与基础法学硕士相比提高了 14.4%。具体来说，AutoRD 的整体实体提取 F1 得分为 56.1%（罕见疾病：83.5%，疾病：35.8%，症状和标志：46.1%，照应词：67.5%），整体关系提取 F1 得分为 38.6%（产生：34.7%），增加风险：12.4％，is_a：37.4％，is_acronym：44.1％，is_synonym：16.3％，照应：57.5％）。我们的定性实验还表明，构建知识图谱的性能值得称赞。讨论：AutoRD 展示了法学硕士在罕见疾病检测中应用的潜力。这一改进归功于多项设计，包括本体增强型法学硕士的集成。结论：AutoRD是一个自动化的端到端系统，用于从文本中提取罕见疾病信息以构建知识图。它使用本体增强的法学硕士来构建强大的医学知识库。 AutoRD 的卓越性能通过实验评估得到验证，展示了法学硕士在医疗保健领域的潜力。

Title: MALTO at SemEval-2024 Task 6: Leveraging Synthetic Data for LLM Hallucination Detection

Authors: Federico Borra, Claudio Savelli, Giacomo Rosso, Alkis Koudounas, Flavio Giobergia
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.00964
Pdf URL: https://arxiv.org/pdf/2403.00964
Copy Paste: [[2403.00964]] MALTO at SemEval-2024 Task 6: Leveraging Synthetic Data for LLM Hallucination Detection(https://arxiv.org/abs/2403.00964)
Keywords: language model, llm, hallucination
Abstract: In Natural Language Generation (NLG), contemporary Large Language Models (LLMs) face several challenges, such as generating fluent yet inaccurate outputs and reliance on fluency-centric metrics. This often leads to neural networks exhibiting "hallucinations". The SHROOM challenge focuses on automatically identifying these hallucinations in the generated text. To tackle these issues, we introduce two key components, a data augmentation pipeline incorporating LLM-assisted pseudo-labelling and sentence rephrasing, and a voting ensemble from three models pre-trained on Natural Language Inference (NLI) tasks and fine-tuned on diverse datasets.
摘要：在自然语言生成（NLG）中，当代大型语言模型（LLM）面临着一些挑战，例如生成流畅但不准确的输出以及对以流畅性为中心的指标的依赖。这通常会导致神经网络表现出“幻觉”。 SHROOM 挑战的重点是自动识别生成文本中的这些幻觉。为了解决这些问题，我们引入了两个关键组件：一个包含 LLM 辅助伪标签和句子改写的数据增强管道，以及来自三个模型的投票集成，这些模型在自然语言推理 (NLI) 任务上进行了预训练，并针对不同的任务进行了微调数据集。

Title: LocalRQA: From Generating Data to Locally Training, Testing, and Deploying Retrieval-Augmented QA Systems

Authors: Xiao Yu, Yunan Lu, Zhou Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.00982
Pdf URL: https://arxiv.org/pdf/2403.00982
Copy Paste: [[2403.00982]] LocalRQA: From Generating Data to Locally Training, Testing, and Deploying Retrieval-Augmented QA Systems(https://arxiv.org/abs/2403.00982)
Keywords: language model, gpt
Abstract: Retrieval-augmented question-answering systems combine retrieval techniques with large language models to provide answers that are more accurate and informative. Many existing toolkits allow users to quickly build such systems using off-the-shelf models, but they fall short in supporting researchers and developers to customize the model training, testing, and deployment process. We propose LocalRQA, an open-source toolkit that features a wide selection of model training algorithms, evaluation methods, and deployment tools curated from the latest research. As a showcase, we build QA systems using online documentation obtained from Databricks and Faire's websites. We find 7B-models trained and deployed using LocalRQA reach a similar performance compared to using OpenAI's text-ada-002 and GPT-4-turbo.
摘要：检索增强问答系统将检索技术与大型语言模型相结合，提供更准确、信息更丰富的答案。许多现有的工具包允许用户使用现成的模型快速构建此类系统，但它们在支持研究人员和开发人员定制模型训练、测试和部署过程方面存在不足。我们提出了 LocalRQA，这是一个开源工具包，其中包含根据最新研究精选的多种模型训练算法、评估方法和部署工具。作为展示，我们使用从 Databricks 和 Faire 网站获得的在线文档构建 QA 系统。我们发现，与使用 OpenAI 的 text-ada-002 和 GPT-4-turbo 相比，使用 LocalRQA 训练和部署的 7B 模型达到了相似的性能。

Title: Merging Text Transformer Models from Different Initializations

Authors: Neha Verma, Maha Elbayad
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.00986
Pdf URL: https://arxiv.org/pdf/2403.00986
Copy Paste: [[2403.00986]] Merging Text Transformer Models from Different Initializations(https://arxiv.org/abs/2403.00986)
Keywords: language model
Abstract: Recent work on one-shot permutation-based model merging has shown impressive low- or zero-barrier mode connectivity between models from completely different initializations. However, this line of work has not yet extended to the Transformer architecture, despite its dominant popularity in the language domain. Therefore, in this work, we investigate the extent to which separate Transformer minima learn similar features, and propose a model merging technique to investigate the relationship between these minima in the loss landscape. The specifics of the architecture, like its residual connections, multi-headed attention, and discrete, sequential input, require specific interventions in order to compute model permutations that remain within the same functional equivalence class. In merging these models with our method, we consistently find lower loss barriers between minima compared to model averaging for several models trained on a masked-language modeling task or fine-tuned on a language understanding benchmark. Our results show that the minima of these models are less sharp and isolated than previously understood, and provide a basis for future work on merging separately trained Transformer models.
摘要：最近关于基于一次性排列的模型合并的研究表明，来自完全不同初始化的模型之间具有令人印象深刻的低或零障碍模式连接性。然而，尽管 Transformer 在语言领域占据主导地位，但该工作尚未扩展到 Transformer 架构。因此，在这项工作中，我们研究了单独的 Transformer 最小值学习相似特征的程度，并提出了一种模型合并技术来研究损失景观中这些最小值之间的关系。该架构的细节，如残差连接、多头注意力和离散顺序输入，需要特定的干预才能计算保留在同一功能等价类内的模型排列。在将这些模型与我们的方法合并时，我们一致发现，与在掩码语言建模任务上训练或在语言理解基准上进行微调的多个模型的模型平均相比，最小值之间的损失障碍更低。我们的结果表明，这些模型的最小值不像以前理解的那样尖锐和孤立，并为未来合并单独训练的 Transformer 模型的工作提供了基础。

Title: Formulation Comparison for Timeline Construction using LLMs

Authors: Kimihiro Hasegawa, Nikhil Kandukuri, Susan Holm, Yukari Yamakawa, Teruko Mitamura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.00990
Pdf URL: https://arxiv.org/pdf/2403.00990
Copy Paste: [[2403.00990]] Formulation Comparison for Timeline Construction using LLMs(https://arxiv.org/abs/2403.00990)
Keywords: llm, prompt
Abstract: Constructing a timeline requires identifying the chronological order of events in an article. In prior timeline construction datasets, temporal orders are typically annotated by either event-to-time anchoring or event-to-event pairwise ordering, both of which suffer from missing temporal information. To mitigate the issue, we develop a new evaluation dataset, TimeSET, consisting of single-document timelines with document-level order annotation. TimeSET features saliency-based event selection and partial ordering, which enable a practical annotation workload. Aiming to build better automatic timeline construction systems, we propose a novel evaluation framework to compare multiple task formulations with TimeSET by prompting open LLMs, i.e., Llama 2 and Flan-T5. Considering that identifying temporal orders of events is a core subtask in timeline construction, we further benchmark open LLMs on existing event temporal ordering datasets to gain a robust understanding of their capabilities. Our experiments show that (1) NLI formulation with Flan-T5 demonstrates a strong performance among others, while (2) timeline construction and event temporal ordering are still challenging tasks for few-shot LLMs. Our code and data are available at https://github.com/kimihiroh/timeset.
摘要：构建时间线需要确定文章中事件的时间顺序。在之前的时间线构建数据集中，时间顺序通常通过事件到时间锚定或事件到事件成对排序来注释，这两者都存在时间信息缺失的问题。为了缓解这个问题，我们开发了一个新的评估数据集 TimeSET，它由带有文档级顺序注释的单文档时间线组成。 TimeSET 具有基于显着性的事件选择和部分排序的功能，可实现实用的注释工作负载。为了构建更好的自动时间线构建系统，我们提出了一种新颖的评估框架，通过促进开放的 LLM（即 Llama 2 和 Flan-T5）来将多个任务公式与 TimeSET 进行比较。考虑到识别事件的时间顺序是时间线构建中的核心子任务，我们进一步对现有事件时间顺序数据集上的开放法学硕士进行基准测试，以获得对其功能的深入了解。我们的实验表明，(1) 使用 Flan-T5 的 NLI 公式在其他方面表现出了很强的性能，而 (2) 时间线构建和事件时间排序对于少样本 LLM 来说仍然是具有挑战性的任务。我们的代码和数据可在 https://github.com/kimihiroh/timeset 获取。

Title: Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods

Authors: Polina Tsvilodub, Hening Wang, Sharon Grosch, Michael Franke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.00998
Pdf URL: https://arxiv.org/pdf/2403.00998
Copy Paste: [[2403.00998]] Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods(https://arxiv.org/abs/2403.00998)
Keywords: language model, llm
Abstract: This paper systematically compares different methods of deriving item-level predictions of language models for multiple-choice tasks. It compares scoring methods for answer options based on free generation of responses, various probability-based scores, a Likert-scale style rating method, and embedding similarity. In a case study on pragmatic language interpretation, we find that LLM predictions are not robust under variation of method choice, both within a single LLM and across different LLMs. As this variability entails pronounced researcher degrees of freedom in reporting results, knowledge of the variability is crucial to secure robustness of results and research integrity.
摘要：本文系统地比较了多项选择任务的语言模型项目级预测的不同方法。它比较了基于自由生成响应的答案选项的评分方法、各种基于概率的评分、李克特量表风格评分方法和嵌入相似性。在关于语用语言解释的案例研究中，我们发现法学硕士的预测在方法选择变化的情况下并不稳健，无论是在单个法学硕士内还是在不同的法学硕士之间。由于这种变异性要求研究人员在报告结果时具有明显的自由度，因此了解变异性对于确保结果的稳健性和研究完整性至关重要。

Title: Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries

Authors: Zelalem Gero, Chandan Singh, Yiqing Xie, Sheng Zhang, Tristan Naumann, Jianfeng Gao, Hoifung Poon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01002
Pdf URL: https://arxiv.org/pdf/2403.01002
Copy Paste: [[2403.01002]] Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries(https://arxiv.org/abs/2403.01002)
Keywords: language model, llm, prompt
Abstract: Summarizing clinical text is crucial in health decision-support and clinical research. Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation, especially in safety-critical domains such as health. Holistically evaluating text summaries is challenging because they may contain unsubstantiated information. Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process. It decomposes the evaluation process into a grounded procedure that uses an LLM for relatively simple structuring and scoring tasks, rather than the full task of holistic summary evaluation. Experiments show that AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization. Additionally, AS yields interpretations in the form of a short text span corresponding to each output, which enables efficient human auditing, paving the way towards trustworthy evaluation of clinical information in resource-constrained scenarios. We release our code, prompts, and an open-source benchmark at https://github.com/microsoft/attribute-structuring.
摘要：总结临床文本对于健康决策支持和临床研究至关重要。大型语言模型 (LLM) 已显示出生成准确的临床文本摘要的潜力，但仍面临基础和评估方面的问题，尤其是在健康等安全关键领域。全面评估文本摘要具有挑战性，因为它们可能包含未经证实的信息。在这里，我们探索使用属性结构（AS）的通用缓解框架，该框架构建了摘要评估过程。它将评估过程分解为一个基础程序，使用法学硕士进行相对简单的结构化和评分任务，而不是整体总结评估的完整任务。实验表明，AS 持续改善了临床文本摘要中人工注释和自动指标之间的对应关系。此外，AS 会以与每个输出相对应的短文本范围的形式产生解释，从而实现高效的人工审核，为在资源有限的情况下对临床信息进行可信评估铺平道路。我们在 https://github.com/microsoft/attribute-structuring 上发布了代码、提示和开源基准。

Title: Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks

Authors: Fakhraddin Alwajih, El Moatez Billah Nagoudi, Gagan Bhatia, Abdelrahman Mohamed, Muhammad Abdul-Mageed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01031
Pdf URL: https://arxiv.org/pdf/2403.01031
Copy Paste: [[2403.01031]] Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks(https://arxiv.org/abs/2403.01031)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) have proven effective in a wide range of tasks requiring complex reasoning and linguistic comprehension. However, due to a lack of high-quality multimodal resources in languages other than English, success of MLLMs remains relatively limited to English-based settings. This poses significant challenges in developing comparable models for other languages, including even those with large speaker populations such as Arabic. To alleviate this challenge, we introduce a comprehensive family of Arabic MLLMs, dubbed \textit{Peacock}, with strong vision and language capabilities. Through comprehensive qualitative and quantitative analysis, we demonstrate the solid performance of our models on various visual reasoning tasks and further show their emerging dialectal potential. Additionally, we introduce ~\textit{Henna}, a new benchmark specifically designed for assessing MLLMs on aspects related to Arabic culture, setting the first stone for culturally-aware Arabic MLLMs.The GitHub repository for the \textit{Peacock} project is available at \url{https://github.com/UBC-NLP/peacock}.
摘要：多模态大语言模型 (MLLM) 已被证明在需要复杂推理和语言理解的各种任务中有效。然而，由于缺乏英语以外语言的高质量多式联运资源，MLLM 的成功仍然相对限于英语环境。这对开发其他语言的可比模型提出了重大挑战，甚至包括阿拉伯语等人口众多的语言。为了缓解这一挑战，我们引入了一个全面的阿拉伯语 MLLM 系列，称为 \textit{Peacock}，具有强大的视觉和语言能力。通过全面的定性和定量分析，我们展示了我们的模型在各种视觉推理任务上的可靠表现，并进一步展示了它们新兴的方言潜力。此外，我们还推出了 ~\textit{Henna}，这是一个专门为评估 MLLM 与阿拉伯文化相关的方面而设计的新基准，为具有文化意识的阿拉伯 MLLM 奠定了第一块基石。 \textit{Peacock} 项目的 GitHub 存储库现已提供在\url{https://github.com/UBC-NLP/peacock}。

Title: Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

Authors: Melanie Subbiah, Sean Zhang, Lydia B. Chilton, Kathleen McKeown
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01061
Pdf URL: https://arxiv.org/pdf/2403.01061
Copy Paste: [[2403.01061]] Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers(https://arxiv.org/abs/2403.01061)
Keywords: language model, gpt, llm
Abstract: We evaluate recent Large language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle to interpret difficult subtext. However, at their best, the models can provide thoughtful thematic analysis of stories. We additionally demonstrate that LLM judgments of summary quality do not match the feedback from the writers.
摘要：我们评估最近的大型语言模型（LLM）在总结短篇小说这一具有挑战性的任务上，这些短篇小说可能很长，并且包括微妙的潜台词或混乱的时间线。重要的是，我们直接与作者合作，以确保这些故事没有在网上分享（因此模型看不到），并根据作者自己的判断来获得对摘要质量的明智评估。通过基于叙事理论的定量和定性分析，我们比较了 GPT-4、Claude-2.1 和 LLama-2-70B。我们发现这三个模型在超过 50% 的摘要中都犯了忠实错误，并且很难解释困难的潜台词。然而，在最好的情况下，这些模型可以提供对故事进行深思熟虑的主题分析。我们还证明了法学硕士对摘要质量的判断与作者的反馈不符。

Title: FaiMA: Feature-aware In-context Learning for Multi-domain Aspect-based Sentiment Analysis

Authors: Songhua Yang, Xinke Jiang, Hanjie Zhao, Wenxuan Zeng, Hongde Liu, Yuxiang Jia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01063
Pdf URL: https://arxiv.org/pdf/2403.01063
Copy Paste: [[2403.01063]] FaiMA: Feature-aware In-context Learning for Multi-domain Aspect-based Sentiment Analysis(https://arxiv.org/abs/2403.01063)
Keywords: language model, llm
Abstract: Multi-domain aspect-based sentiment analysis (ABSA) seeks to capture fine-grained sentiment across diverse domains. While existing research narrowly focuses on single-domain applications constrained by methodological limitations and data scarcity, the reality is that sentiment naturally traverses multiple domains. Although large language models (LLMs) offer a promising solution for ABSA, it is difficult to integrate effectively with established techniques, including graph-based models and linguistics, because modifying their internal architecture is not easy. To alleviate this problem, we propose a novel framework, Feature-aware In-context Learning for Multi-domain ABSA (FaiMA). The core insight of FaiMA is to utilize in-context learning (ICL) as a feature-aware mechanism that facilitates adaptive learning in multi-domain ABSA tasks. Specifically, we employ a multi-head graph attention network as a text encoder optimized by heuristic rules for linguistic, domain, and sentiment features. Through contrastive learning, we optimize sentence representations by focusing on these diverse features. Additionally, we construct an efficient indexing mechanism, allowing FaiMA to stably retrieve highly relevant examples across multiple dimensions for any given input. To evaluate the efficacy of FaiMA, we build the first multi-domain ABSA benchmark dataset. Extensive experimental results demonstrate that FaiMA achieves significant performance improvements in multiple domains compared to baselines, increasing F1 by 2.07% on average. Source code and data sets are anonymously available at https://github.com/SupritYoung/FaiMA.
摘要：基于多领域方面的情感分析 (ABSA) 旨在捕获不同领域的细粒度情感。虽然现有的研究仅限于受方法论限制和数据稀缺限制的单领域应用，但现实是情绪自然会跨越多个领域。尽管大型语言模型 (LLM) 为 ABSA 提供了一种有前途的解决方案，但很难与现有技术（包括基于图的模型和语言学）有效集成，因为修改其内部架构并不容易。为了缓解这个问题，我们提出了一种新颖的框架，即多域 ABSA 的特征感知上下文学习（FaiMA）。 FaiMA 的核心见解是利用上下文学习（ICL）作为一种特征感知机制，促进多领域 ABSA 任务中的自适应学习。具体来说，我们采用多头图注意力网络作为文本编码器，通过语言、领域和情感特征的启发式规则进行优化。通过对比学习，我们通过关注这些不同的特征来优化句子表示。此外，我们构建了一个高效的索引机制，使 FaiMA 能够针对任何给定输入稳定地跨多个维度检索高度相关的示例。为了评估 FaiMA 的功效，我们构建了第一个多域 ABSA 基准数据集。大量的实验结果表明，与基线相比，FaiMA 在多个领域实现了显着的性能改进，F1 平均提高了 2.07%。源代码和数据集可在 https://github.com/SupritYoung/FaiMA 上匿名获取。

Title: LLMCRIT: Teaching Large Language Models to Use Criteria

Authors: Weizhe Yuan, Pengfei Liu, Matthias Gallé
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01069
Pdf URL: https://arxiv.org/pdf/2403.01069
Copy Paste: [[2403.01069]] LLMCRIT: Teaching Large Language Models to Use Criteria(https://arxiv.org/abs/2403.01069)
Keywords: language model, llm
Abstract: Humans follow criteria when they execute tasks, and these criteria are directly used to assess the quality of task completion. Therefore, having models learn to use criteria to provide feedback can help humans or models to perform tasks better. However, existing research in this field tends to consider only a limited set of criteria or quality assessment aspects. To fill this gap, we propose a general framework that enables large language models (LLMs) to use comprehensive criteria for a task in delivering natural language feedback on task execution. In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion. We choose three tasks from real-world scenarios to operationalize this idea: paper introduction writing, Python code writing, and Reddit post writing, and evaluate our feedback generation framework using different LLMs. The results reveal the fine-grained effects of incorporating criteria and demonstrations and provide valuable insights on how to teach LLMs to use criteria more effectively.
摘要：人类在执行任务时遵循标准，这些标准直接用于评估任务完成的质量。因此，让模型学习使用标准来提供反馈可以帮助人类或模型更好地执行任务。然而，该领域的现有研究往往只考虑一组有限的标准或质量评估方面。为了填补这一空白，我们提出了一个通用框架，使大型语言模型（LLM）能够使用任务的综合标准来提供有关任务执行的自然语言反馈。特别是，我们提出了一个模型在环框架，该框架从收集的不同写作任务指南中半自动地导出标准，并为每个标准构建上下文演示。我们从现实场景中选择了三个任务来实施这个想法：论文介绍写作、Python 代码编写和 Reddit 帖子写作，并使用不同的法学硕士评估我们的反馈生成框架。结果揭示了结合标准和演示的细粒度效果，并为如何教导法学硕士更有效地使用标准提供了宝贵的见解。

Title: LAB: Large-Scale Alignment for ChatBots

Authors: Shivchander Sudalairaj, Abhishek Bhandwaldar, Aldo Pareja, Kai Xu, David D. Cox, Akash Srivastava
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.01081
Pdf URL: https://arxiv.org/pdf/2403.01081
Copy Paste: [[2403.01081]] LAB: Large-Scale Alignment for ChatBots(https://arxiv.org/abs/2403.01081)
Keywords: language model, gpt, llm, chat
Abstract: This work introduces LAB (Large-scale Alignment for chatBots), a novel methodology designed to overcome the scalability challenges in the instruction-tuning phase of large language model (LLM) training. Leveraging a taxonomy-guided synthetic data generation process and a multi-phase tuning framework, LAB significantly reduces reliance on expensive human annotations and proprietary models like GPT-4. We demonstrate that LAB-trained models can achieve competitive performance across several benchmarks compared to models trained with traditional human-annotated or GPT-4 generated synthetic data. Thus offering a scalable, cost-effective solution for enhancing LLM capabilities and instruction-following behaviors without the drawbacks of catastrophic forgetting, marking a step forward in the efficient training of LLMs for a wide range of applications.
摘要：这项工作引入了 LAB（聊天机器人大规模对齐），这是一种新颖的方法，旨在克服大型语言模型 (LLM) 训练指令调整阶段的可扩展性挑战。利用分类学引导的合成数据生成过程和多阶段调整框架，LAB 显着减少了对昂贵的人工注释和 GPT-4 等专有模型的依赖。我们证明，与使用传统人工注释或 GPT-4 生成的合成数据训练的模型相比，LAB 训练的模型可以在多个基准测试中实现具有竞争力的性能。因此，提供了一个可扩展、经济高效的解决方案，用于增强法学硕士的能力和遵循指令的行为，而不存在灾难性遗忘的缺点，标志着法学硕士针对广泛应用的高效培训向前迈出了一步。

Title: Distilling Text Style Transfer With Self-Explanation From LLMs

Authors: Chiyu Zhang, Honglong Cai, Yuezhang (Music)Li, Yuexin Wu, Le Hou, Muhammad Abdul-Mageed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01106
Pdf URL: https://arxiv.org/pdf/2403.01106
Copy Paste: [[2403.01106]] Distilling Text Style Transfer With Self-Explanation From LLMs(https://arxiv.org/abs/2403.01106)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Text Style Transfer (TST) seeks to alter the style of text while retaining its core content. Given the constraints of limited parallel datasets for TST, we propose CoTeX, a framework that leverages large language models (LLMs) alongside chain-of-thought (CoT) prompting to facilitate TST. CoTeX distills the complex rewriting and reasoning capabilities of LLMs into more streamlined models capable of working with both non-parallel and parallel data. Through experimentation across four TST datasets, CoTeX is shown to surpass traditional supervised fine-tuning and knowledge distillation methods, particularly in low-resource settings. We conduct a comprehensive evaluation, comparing CoTeX against current unsupervised, supervised, in-context learning (ICL) techniques, and instruction-tuned LLMs. Furthermore, CoTeX distinguishes itself by offering transparent explanations for its style transfer process.
摘要：文本风格迁移（TST）旨在改变文本风格，同时保留其核心内容。考虑到 TST 并行数据集有限的限制，我们提出了 CoTeX，这是一个利用大型语言模型 (LLM) 和思想链 (CoT) 提示来促进 TST 的框架。 CoTeX 将法学硕士的复杂重写和推理能力提炼成更简化的模型，能够处理非并行和并行数据。通过对四个 TST 数据集的实验，CoTeX 被证明超越了传统的监督微调和知识蒸馏方法，特别是在资源匮乏的环境中。我们进行了全面评估，将 CoTeX 与当前的无监督、监督、情境学习 (ICL) 技术和指令调整的法学硕士进行了比较。此外，CoTeX 因其风格转换过程提供透明的解释而与众不同。

Title: MulCogBench: A Multi-modal Cognitive Benchmark Dataset for Evaluating Chinese and English Computational Language Models

Authors: Yunhao Zhang, Xiaohan Zhang, Chong Li, Shaonan Wang, Chengqing Zong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01116
Pdf URL: https://arxiv.org/pdf/2403.01116
Copy Paste: [[2403.01116]] MulCogBench: A Multi-modal Cognitive Benchmark Dataset for Evaluating Chinese and English Computational Language Models(https://arxiv.org/abs/2403.01116)
Keywords: language model
Abstract: Pre-trained computational language models have recently made remarkable progress in harnessing the language abilities which were considered unique to humans. Their success has raised interest in whether these models represent and process language like humans. To answer this question, this paper proposes MulCogBench, a multi-modal cognitive benchmark dataset collected from native Chinese and English participants. It encompasses a variety of cognitive data, including subjective semantic ratings, eye-tracking, functional magnetic resonance imaging (fMRI), and magnetoencephalography (MEG). To assess the relationship between language models and cognitive data, we conducted a similarity-encoding analysis which decodes cognitive data based on its pattern similarity with textual embeddings. Results show that language models share significant similarities with human cognitive data and the similarity patterns are modulated by the data modality and stimuli complexity. Specifically, context-aware models outperform context-independent models as language stimulus complexity increases. The shallow layers of context-aware models are better aligned with the high-temporal-resolution MEG signals whereas the deeper layers show more similarity with the high-spatial-resolution fMRI. These results indicate that language models have a delicate relationship with brain language representations. Moreover, the results between Chinese and English are highly consistent, suggesting the generalizability of these findings across languages.
摘要：预训练的计算语言模型最近在利用被认为是人类独有的语言能力方面取得了显着的进展。他们的成功引起了人们对这些模型是否像人类一样代表和处理语言的兴趣。为了回答这个问题，本文提出了 MulCogBench，这是一个从中文和英语母语参与者收集的多模态认知基准数据集。它包含各种认知数据，包括主观语义评级、眼球追踪、功能磁共振成像 (fMRI) 和脑磁图 (MEG)。为了评估语言模型和认知数据之间的关系，我们进行了相似性编码分析，该分析根据认知数据与文本嵌入的模式相似性对认知数据进行解码。结果表明，语言模型与人类认知数据具有显着的相似性，并且相似性模式受到数据模态和刺激复杂性的调节。具体来说，随着语言刺激复杂性的增加，上下文感知模型的性能优于上下文无关模型。上下文感知模型的浅层与高时间分辨率 MEG 信号更好地对齐，而较深层则与高空间分辨率 fMRI 更相似。这些结果表明语言模型与大脑语言表征有着微妙的关系。此外，中文和英文之间的结果高度一致，表明这些研究结果具有跨语言的普遍性。

Title: ParallelPARC: A Scalable Pipeline for Generating Natural-Language Analogies

Authors: Oren Sultan, Yonatan Bitton, Ron Yosef, Dafna Shahaf
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01139
Pdf URL: https://arxiv.org/pdf/2403.01139
Copy Paste: [[2403.01139]] ParallelPARC: A Scalable Pipeline for Generating Natural-Language Analogies(https://arxiv.org/abs/2403.01139)
Keywords: language model, llm
Abstract: Analogy-making is central to human cognition, allowing us to adapt to novel situations -- an ability that current AI systems still lack. Most analogy datasets today focus on simple analogies (e.g., word analogies); datasets including complex types of analogies are typically manually curated and very small. We believe that this holds back progress in computational analogy. In this work, we design a data generation pipeline, ParallelPARC (Parallel Paragraph Creator) leveraging state-of-the-art Large Language Models (LLMs) to create complex, paragraph-based analogies, as well as distractors, both simple and challenging. We demonstrate our pipeline and create ProPara-Logy, a dataset of analogies between scientific processes. We publish a gold-set, validated by humans, and a silver-set, generated automatically. We test LLMs' and humans' analogy recognition in binary and multiple-choice settings, and found that humans outperform the best models (~13% gap) after a light supervision. We demonstrate that our silver-set is useful for training models. Lastly, we show challenging distractors confuse LLMs, but not humans. We hope our pipeline will encourage research in this emerging field.
摘要：类比是人类认知的核心，它使我们能够适应新的情况——这是当前人工智能系统仍然缺乏的能力。如今，大多数类比数据集都专注于简单类比（例如单词类比）；包含复杂类型类比的数据集通常是手动管理的并且非常小。我们认为这阻碍了计算类比的进步。在这项工作中，我们设计了一个数据生成管道，ParallelPARC（并行段落创建器），利用最先进的大型语言模型（LLM）来创建复杂的、基于段落的类比以及干扰因素，既简单又具有挑战性。我们展示了我们的流程并创建了 ProPara-Logy，这是一个科学过程之间类比的数据集。我们发布了由人类验证的黄金组和自动生成的白银组。我们在二元和多项选择设置中测试了法学硕士和人类的类比识别，发现在轻度监督后，人类的表现优于最佳模型（约 13% 的差距）。我们证明我们的银组对于训练模型很有用。最后，我们发现具有挑战性的干扰因素会让法学硕士感到困惑，但不会让人类感到困惑。我们希望我们的管道能够鼓励这一新兴领域的研究。

Title: A Survey of AI-generated Text Forensic Systems: Detection, Attribution, and Characterization

Authors: Tharindu Kumarage, Garima Agrawal, Paras Sheth, Raha Moraffah, Aman Chadha, Joshua Garland, Huan Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01152
Pdf URL: https://arxiv.org/pdf/2403.01152
Copy Paste: [[2403.01152]] A Survey of AI-generated Text Forensic Systems: Detection, Attribution, and Characterization(https://arxiv.org/abs/2403.01152)
Keywords: language model, llm
Abstract: We have witnessed lately a rapid proliferation of advanced Large Language Models (LLMs) capable of generating high-quality text. While these LLMs have revolutionized text generation across various domains, they also pose significant risks to the information ecosystem, such as the potential for generating convincing propaganda, misinformation, and disinformation at scale. This paper offers a review of AI-generated text forensic systems, an emerging field addressing the challenges of LLM misuses. We present an overview of the existing efforts in AI-generated text forensics by introducing a detailed taxonomy, focusing on three primary pillars: detection, attribution, and characterization. These pillars enable a practical understanding of AI-generated text, from identifying AI-generated content (detection), determining the specific AI model involved (attribution), and grouping the underlying intents of the text (characterization). Furthermore, we explore available resources for AI-generated text forensics research and discuss the evolving challenges and future directions of forensic systems in an AI era.
摘要：最近，我们目睹了能够生成高质量文本的高级大型语言模型 (LLM) 的快速增长。虽然这些法学硕士彻底改变了各个领域的文本生成，但它们也给信息生态系统带来了重大风险，例如大规模产生令人信服的宣传、错误信息和虚假信息的潜力。本文对人工智能生成的文本取证系统进行了回顾，这是一个解决法学硕士滥用挑战的新兴领域。我们通过引入详细的分类法，概述了人工智能生成的文本取证的现有工作，重点关注三个主要支柱：检测、归因和特征描述。这些支柱使人们能够实际理解人工智能生成的文本，从识别人工智能生成的内容（检测），确定涉及的特定人工智能模型（归因），到对文本的潜在意图进行分组（表征）。此外，我们探索人工智能生成的文本取证研究的可用资源，并讨论人工智能时代取证系统不断变化的挑战和未来方向。

Title: BootTOD: Bootstrap Task-oriented Dialogue Representations by Aligning Diverse Responses

Authors: Weihao Zeng, Keqing He, Yejie Wang, Dayuan Fu, Weiran Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01163
Pdf URL: https://arxiv.org/pdf/2403.01163
Copy Paste: [[2403.01163]] BootTOD: Bootstrap Task-oriented Dialogue Representations by Aligning Diverse Responses(https://arxiv.org/abs/2403.01163)
Keywords: language model
Abstract: Pre-trained language models have been successful in many scenarios. However, their usefulness in task-oriented dialogues is limited due to the intrinsic linguistic differences between general text and task-oriented dialogues. Current task-oriented dialogue pre-training methods rely on a contrastive framework, which faces challenges such as selecting true positives and hard negatives, as well as lacking diversity. In this paper, we propose a novel dialogue pre-training model called BootTOD. It learns task-oriented dialogue representations via a self-bootstrapping framework. Unlike contrastive counterparts, BootTOD aligns context and context+response representations and dismisses the requirements of contrastive pairs. BootTOD also uses multiple appropriate response targets to model the intrinsic one-to-many diversity of human conversations. Experimental results show that BootTOD outperforms strong TOD baselines on diverse downstream dialogue tasks.
摘要：预训练的语言模型在许多场景中都取得了成功。然而，由于一般文本和面向任务的对话之间存在内在的语言差异，它们在面向任务的对话中的用处受到限制。目前的任务导向对话预训练方法依赖于对比框架，面临着选择真阳性和硬阴性等挑战，并且缺乏多样性。在本文中，我们提出了一种新颖的对话预训练模型，称为 BootTOD。它通过自引导框架学习面向任务的对话表示。与对比对应物不同，BootTOD 对齐上下文和上下文+响应表示，并消除对比对的要求。 BootTOD 还使用多个适当的响应目标来模拟人类对话的内在一对多多样性。实验结果表明，BootTOD 在各种下游对话任务上优于强大的 TOD 基线。

Title: STAR: Constraint LoRA with Dynamic Active Learning for Data-Efficient Fine-Tuning of Large Language Models

Authors: Linhai Zhang, Jialong Wu, Deyu Zhou, Guoqiang Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01165
Pdf URL: https://arxiv.org/pdf/2403.01165
Copy Paste: [[2403.01165]] STAR: Constraint LoRA with Dynamic Active Learning for Data-Efficient Fine-Tuning of Large Language Models(https://arxiv.org/abs/2403.01165)
Keywords: language model, llm, prompt
Abstract: Though Large Language Models (LLMs) have demonstrated the powerful capabilities of few-shot learning through prompting methods, supervised training is still necessary for complex reasoning tasks. Because of their extensive parameters and memory consumption, both Parameter-Efficient Fine-Tuning (PEFT) methods and Memory-Efficient Fine-Tuning methods have been proposed for LLMs. Nevertheless, the issue of large annotated data consumption, the aim of Data-Efficient Fine-Tuning, remains unexplored. One obvious way is to combine the PEFT method with active learning. However, the experimental results show that such a combination is not trivial and yields inferior results. Through probe experiments, such observation might be explained by two main reasons: uncertainty gap and poor model calibration. Therefore, in this paper, we propose a novel approach to effectively integrate uncertainty-based active learning and LoRA. Specifically, for the uncertainty gap, we introduce a dynamic uncertainty measurement that combines the uncertainty of the base model and the uncertainty of the full model during the iteration of active learning. For poor model calibration, we incorporate the regularization method during LoRA training to keep the model from being over-confident, and the Monte-Carlo dropout mechanism is employed to enhance the uncertainty estimation. Experimental results show that the proposed approach outperforms existing baseline models on three complex reasoning tasks.
摘要：尽管大型语言模型（LLM）已经通过提示方法展示了小样本学习的强大能力，但对于复杂的推理任务，监督训练仍然是必要的。由于其大量的参数和内存消耗，参数高效微调（PEFT）方法和内存高效微调方法都被提出用于法学硕士。然而，大量注释数据消耗的问题（数据高效微调的目标）仍未得到探索。一种明显的方法是将 PEFT 方法与主动学习相结合。然而，实验结果表明，这种组合并非微不足道，并且产生的结果较差。通过探测实验，这种观察结果可能可以用两个主要原因来解释：不确定性差距和模型校准不良。因此，在本文中，我们提出了一种有效整合基于不确定性的主动学习和 LoRA 的新方法。具体来说，对于不确定性差距，我们引入了动态不确定性测量，在主动学习的迭代过程中结合了基础模型的不确定性和完整模型的不确定性。针对模型校准较差的情况，我们在LoRA训练过程中加入正则化方法来防止模型过度自信，并采用蒙特卡罗dropout机制来增强不确定性估计。实验结果表明，所提出的方法在三个复杂的推理任务上优于现有的基线模型。

Title: Balancing Exploration and Exploitation in LLM using Soft RLLF for Enhanced Negation Understanding

Authors: Ha-Thanh Nguyen, Ken Satoh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01185
Pdf URL: https://arxiv.org/pdf/2403.01185
Copy Paste: [[2403.01185]] Balancing Exploration and Exploitation in LLM using Soft RLLF for Enhanced Negation Understanding(https://arxiv.org/abs/2403.01185)
Keywords: language model, llm
Abstract: Finetuning approaches in NLP often focus on exploitation rather than exploration, which may lead to suboptimal models. Given the vast search space of natural language, this limited exploration can restrict their performance in complex, high-stakes domains, where accurate negation understanding and logical reasoning abilities are crucial. To address this issue, we leverage Reinforcement Learning from Logical Feedback (RLLF) to create an effective balance between exploration and exploitation in LLMs. Our approach employs an appropriate benchmark dataset for training and evaluation, highlighting the importance of exploration in enhancing negation understanding capabilities. We compare the performance of our RLLF-enhanced LLMs with baseline models trained without RLLF, demonstrating the value of this balanced approach. Furthermore, we showcase the potential of our method in legal AI applications by employing transfer learning and evaluating its impact on negation understanding. Our experimental results exhibit the effectiveness of balancing exploration and exploitation with RLLF in improving LLMs' negation capabilities. This has implications for the development of more accurate, reliable, and logically consistent language models in high-stakes domains.
摘要：NLP 中的微调方法通常侧重于利用而不是探索，这可能会导致模型不理想。考虑到自然语言的巨大搜索空间，这种有限的探索可能会限制它们在复杂、高风险领域的表现，而在这些领域，准确的否定理解和逻辑推理能力至关重要。为了解决这个问题，我们利用逻辑反馈强化学习（RLLF）在法学硕士的探索和利用之间建立有效的平衡。我们的方法采用适当的基准数据集进行训练和评估，强调探索在增强否定理解能力方面的重要性。我们将 RLLF 增强型 LLM 的性能与未使用 RLLF 训练的基线模型进行比较，证明了这种平衡方法的价值。此外，我们通过采用迁移学习并评估其对否定理解的影响，展示了我们的方法在法律人工智能应用中的潜力。我们的实验结果表明，通过 RLLF 平衡探索和利用，可以有效提高 LLM 的否定能力。这对于在高风险领域开发更准确、可靠和逻辑一致的语言模型具有重要意义。

Title: RAGged Edges: The Double-Edged Sword of Retrieval-Augmented Chatbots

Authors: Philip Feldman. James R. Foulds, Shimei Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01193
Pdf URL: https://arxiv.org/pdf/2403.01193
Copy Paste: [[2403.01193]] RAGged Edges: The Double-Edged Sword of Retrieval-Augmented Chatbots(https://arxiv.org/abs/2403.01193)
Keywords: language model, gpt, llm, hallucination, prompt, chat, retrieval-augmented generation
Abstract: Large language models (LLMs) like ChatGPT demonstrate the remarkable progress of artificial intelligence. However, their tendency to hallucinate -- generate plausible but false information -- poses a significant challenge. This issue is critical, as seen in recent court cases where ChatGPT's use led to citations of non-existent legal rulings. This paper explores how Retrieval-Augmented Generation (RAG) can counter hallucinations by integrating external knowledge with prompts. We empirically evaluate RAG against standard LLMs using prompts designed to induce hallucinations. Our results show that RAG increases accuracy in some cases, but can still be misled when prompts directly contradict the model's pre-trained understanding. These findings highlight the complex nature of hallucinations and the need for more robust solutions to ensure LLM reliability in real-world applications. We offer practical recommendations for RAG deployment and discuss implications for the development of more trustworthy LLMs.
摘要：ChatGPT 等大型语言模型 (LLM) 展示了人工智能的显着进步。然而，他们产生幻觉的倾向——产生看似合理但虚假的信息——构成了重大挑战。这个问题至关重要，从最近的法庭案件中可以看出，ChatGPT 的使用导致引用了不存在的法律裁决。本文探讨了检索增强生成（RAG）如何通过将外部知识与提示相结合来对抗幻觉。我们使用旨在诱发幻觉的提示，根据标准法学硕士对 RAG 进行实证评估。我们的结果表明，RAG 在某些情况下提高了准确性，但当提示与模型预先训练的理解直接矛盾时，仍然可能会被误导。这些发现强调了幻觉的复杂性以及对更强大的解决方案的需求，以确保法学硕士在现实应用中的可靠性。我们为 RAG 部署提供实用建议，并讨论对开发更值得信赖的法学硕士的影响。

Title: DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling

Authors: Shanghaoran Quan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01197
Pdf URL: https://arxiv.org/pdf/2403.01197
Copy Paste: [[2403.01197]] DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling(https://arxiv.org/abs/2403.01197)
Keywords: language model, llm
Abstract: The performance of the reward model (RM) is a critical factor in improving the effectiveness of the large language model (LLM) during alignment fine-tuning. There remain two challenges in RM training: 1) training the same RM using various categories of data may cause its generalization performance to suffer from multi-task disturbance, and 2) the human annotation consistency rate is generally only $60\%$ to $75\%$, causing training data to contain a lot of noise. To tackle these two challenges, we introduced the idea of Mixture-of-Experts (MoE) into the field of RM for the first time. We propose the Double-Layer MoE RM (DMoERM). The outer layer MoE is a sparse model. After classifying an input into task categories, we route it to the corresponding inner layer task-specific model. The inner layer MoE is a dense model. We decompose the specific task into multiple capability dimensions and individually fine-tune a LoRA expert on each one. Their outputs are then synthesized by an MLP to compute the final rewards. To minimize costs, we call a public LLM API to obtain the capability preference labels. The validation on manually labeled datasets confirms that our model attains superior consistency with human preference and outstrips advanced generative approaches. Meanwhile, through BoN sampling and RL experiments, we demonstrate that our model outperforms state-of-the-art ensemble methods of RM and mitigates the overoptimization problem. Our code and dataset are available at: https://github.com/quanshr/DMoERM-v1.
摘要：奖励模型（RM）的性能是提高大语言模型（LLM）在对齐微调过程中的有效性的关键因素。 RM 训练仍然存在两个挑战：1）使用不同类别的数据训练同一个 RM 可能会导致其泛化性能受到多任务干扰，2）人类注释一致性率通常只有 $60\%$ 到 $75\ %$，导致训练数据包含大量噪音。为了应对这两个挑战，我们首次将专家混合（MoE）的理念引入RM领域。我们提出双层 MoE RM (DMoERM)。外层MoE是稀疏模型。将输入分类为任务类别后，我们将其路由到相应的内层特定于任务的模型。内层MoE是密集模型。我们将具体任务分解为多个能力维度，并针对每个维度单独微调 LoRA 专家。然后，它们的输出由 MLP 合成以计算最终奖励。为了最大限度地降低成本，我们调用公共 LLM API 来获取能力偏好标签。对手动标记数据集的验证证实，我们的模型与人类偏好实现了卓越的一致性，并且超越了先进的生成方法。同时，通过 BoN 采样和 RL 实验，我们证明我们的模型优于 RM 的最先进的集成方法，并减轻了过度优化问题。我们的代码和数据集位于：https://github.com/quanshr/DMoERM-v1。

Title: API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access

Authors: Jiayuan Su, Jing Luo, Hongwei Wang, Lu Cheng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.01216
Pdf URL: https://arxiv.org/pdf/2403.01216
Copy Paste: [[2403.01216]] API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access(https://arxiv.org/abs/2403.01216)
Keywords: language model, llm
Abstract: This study aims to address the pervasive challenge of quantifying uncertainty in large language models (LLMs) without logit-access. Conformal Prediction (CP), known for its model-agnostic and distribution-free features, is a desired approach for various LLMs and data distributions. However, existing CP methods for LLMs typically assume access to the logits, which are unavailable for some API-only LLMs. In addition, logits are known to be miscalibrated, potentially leading to degraded CP performance. To tackle these challenges, we introduce a novel CP method that (1) is tailored for API-only LLMs without logit-access; (2) minimizes the size of prediction sets; and (3) ensures a statistical guarantee of the user-defined coverage. The core idea of this approach is to formulate nonconformity measures using both coarse-grained (i.e., sample frequency) and fine-grained uncertainty notions (e.g., semantic similarity). Experimental results on both close-ended and open-ended Question Answering tasks show our approach can mostly outperform the logit-based CP baselines.
摘要：本研究旨在解决在没有 Logit 访问的情况下量化大型语言模型 (LLM) 中的不确定性的普遍挑战。保形预测 (CP) 以其与模型无关和分布无关的特性而闻名，是各种 LLM 和数据分布的理想方法。然而，LLM 的现有 CP 方法通常假设可以访问 logits，而这对于某些仅 API 的 LLM 来说是不可用的。此外，已知 logits 会被错误校准，从而可能导致 CP 性能下降。为了应对这些挑战，我们引入了一种新颖的 CP 方法，该方法 (1) 专为没有 logit-access 的仅 API 法学硕士量身定制； (2)最小化预测集的大小； (3)确保用户定义的覆盖范围的统计保证。这种方法的核心思想是使用粗粒度（即样本频率）和细粒度不确定性概念（例如语义相似性）来制定不合格度量。封闭式和开放式问答任务的实验结果表明，我们的方法在很大程度上优于基于 logit 的 CP 基线。

Title: IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact

Authors: Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, Chun Yuan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01241
Pdf URL: https://arxiv.org/pdf/2403.01241
Copy Paste: [[2403.01241]] IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact(https://arxiv.org/abs/2403.01241)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel in natural language processing but demand intensive computation. To mitigate this, various quantization methods have been explored, yet they compromise LLM performance. This paper unveils a previously overlooked type of outlier in LLMs. Such outliers are found to allocate most of the attention scores on initial tokens of input, termed as pivot tokens, which is crucial to the performance of quantized LLMs. Given that, we propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model. The approach is simple and easy to combine with existing quantization solutions. Besides, IntactKV can be calibrated as additional LLM parameters to boost the quantized LLMs further. Mathematical analysis also proves that IntactKV effectively reduces the upper bound of quantization error. Empirical results show that IntactKV brings consistent improvement and achieves lossless weight-only INT4 quantization on various downstream tasks, leading to the new state-of-the-art for LLM quantization.
摘要：大型语言模型 (LLM) 在自然语言处理方面表现出色，但需要大量计算。为了缓解这个问题，人们探索了各种量化方法，但它们却损害了 LLM 的性能。本文揭示了法学硕士中以前被忽视的异常值类型。研究发现，此类异常值将大部分注意力分数分配在输入的初始标记（称为枢轴标记）上，这对于量化 LLM 的性能至关重要。鉴于此，我们建议 IntactKV 从全精度模型无损地生成枢轴令牌的 KV 缓存。该方法简单且易于与现有量化解决方案结合。此外，IntactKV 可以作为额外的 LLM 参数进行校准，以进一步提高量化的 LLM。数学分析也证明IntactKV有效降低了量化误差的上限。实证结果表明，IntactKV 带来了持续的改进，并在各种下游任务上实现了无损纯权重 INT4 量化，从而引领了 LLM 量化的新发展水平。

Title: Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal

Authors: Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, Jinsong Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01244
Pdf URL: https://arxiv.org/pdf/2403.01244
Copy Paste: [[2403.01244]] Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal(https://arxiv.org/abs/2403.01244)
Keywords: language model, llm
Abstract: Large language models (LLMs) suffer from catastrophic forgetting during continual learning. Conventional rehearsal-based methods rely on previous training data to retain the model's ability, which may not be feasible in real-world applications. When conducting continual learning based on a publicly-released LLM checkpoint, the availability of the original training data may be non-existent. To address this challenge, we propose a framework called Self-Synthesized Rehearsal (SSR) that uses the LLM to generate synthetic instances for rehearsal. Concretely, we first employ the base LLM for in-context learning to generate synthetic instances. Subsequently, we utilize the latest LLM to refine the instance outputs based on the synthetic inputs, preserving its acquired ability. Finally, we select diverse high-quality synthetic instances for rehearsal in future stages. Experimental results demonstrate that SSR achieves superior or comparable performance compared to conventional rehearsal-based approaches while being more data-efficient. Besides, SSR effectively preserves the generalization capabilities of LLMs in general domains.
摘要：大型语言模型（LLM）在持续学习过程中会遭受灾难性遗忘。传统的基于演练的方法依赖于以前的训练数据来保留模型的能力，这在现实应用中可能不可行。当基于公开发布的LLM检查点进行持续学习时，原始训练数据的可用性可能不存在。为了应对这一挑战，我们提出了一个名为“自合成排练”(SSR) 的框架，该框架使用法学硕士来生成用于排练的合成实例。具体来说，我们首先使用基础法学硕士进行上下文学习来生成合成实例。随后，我们利用最新的法学硕士根据合成输入细化实例输出，保留其获得的能力。最后，我们选择多种高质量的合成实例用于未来阶段的排练。实验结果表明，与传统的基于排练的方法相比，SSR 实现了卓越或相当的性能，同时数据效率更高。此外，SSR有效地保留了LLM在通用领域的泛化能力。

Title: Accelerating Greedy Coordinate Gradient via Probe Sampling

Authors: Yiran Zhao, Wenyue Zheng, Tianle Cai, Xuan Long Do, Kenji Kawaguchi, Anirudh Goyal, Michael Shieh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01251
Pdf URL: https://arxiv.org/pdf/2403.01251
Copy Paste: [[2403.01251]] Accelerating Greedy Coordinate Gradient via Probe Sampling(https://arxiv.org/abs/2403.01251)
Keywords: language model, llm, prompt
Abstract: Safety of Large Language Models (LLMs) has become a central issue given their rapid progress and wide applications. Greedy Coordinate Gradient (GCG) is shown to be effective in constructing prompts containing adversarial suffixes to break the presumingly safe LLMs, but the optimization of GCG is time-consuming and limits its practicality. To reduce the time cost of GCG and enable more comprehensive studies of LLM safety, in this work, we study a new algorithm called $\texttt{Probe sampling}$ to accelerate the GCG algorithm. At the core of the algorithm is a mechanism that dynamically determines how similar a smaller draft model's predictions are to the target model's predictions for prompt candidates. When the target model is similar to the draft model, we rely heavily on the draft model to filter out a large number of potential prompt candidates to reduce the computation time. Probe sampling achieves up to $5.6$ times speedup using Llama2-7b and leads to equal or improved attack success rate (ASR) on the AdvBench.
摘要：鉴于大型语言模型（LLM）的快速发展和广泛应用，其安全性已成为一个中心问题。贪心坐标梯度（GCG）被证明可以有效地构建包含对抗性后缀的提示，以打破假定安全的 LLM，但 GCG 的优化非常耗时，限制了其实用性。为了减少GCG的时间成本并能够更全面地研究LLM安全性，在这项工作中，我们研究了一种名为$\texttt{Probe Sample}$的新算法来加速GCG算法。该算法的核心是一种机制，可以动态确定较小草稿模型的预测与提示候选目标模型的预测的相似程度。当目标模型与草稿模型相似时，我们严重依赖草稿模型来过滤掉大量潜在的候选提示，以减少计算时间。使用 Llama2-7b 进行探测采样可实现高达 5.6 美元的加速，并在 AdvBench 上实现同等或更高的攻击成功率 (ASR)。

Title: Improving the Validity of Automatically Generated Feedback via Reinforcement Learning

Authors: Alexander Scarlatos, Digory Smith, Simon Woodhead, Andrew Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01304
Pdf URL: https://arxiv.org/pdf/2403.01304
Copy Paste: [[2403.01304]] Improving the Validity of Automatically Generated Feedback via Reinforcement Learning(https://arxiv.org/abs/2403.01304)
Keywords: language model, gpt, llm
Abstract: Automatically generating feedback via large language models (LLMs) in intelligent tutoring systems and online learning platforms has the potential to improve the learning outcomes of many students. However, both feedback generation and evaluation are challenging: feedback content has to be valid especially in subjects like math, which requires models to understand the problem, the solution, and where the student's error lies. Feedback also has to be pedagogically valid to reflect effective tutoring strategies, such as explaining possible misconceptions and encouraging the student, among other desirable features. In this work, we address both problems of automatically generating and evaluating feedback while considering both correctness and alignment. First, we propose a rubric for evaluating math feedback and show that GPT-4 is able to effectively use it to annotate human-written and LLM-generated feedback. Second, we propose a framework for feedback generation that optimizes both correctness and alignment using reinforcement learning (RL). Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO). We show that our methods significantly increase the correctness and alignment of generated feedback with Llama 2, an open-source LLM, qualitatively analyze our generation and evaluation systems using case studies, and outline several areas for future work.
摘要：通过智能辅导系统和在线学习平台中的大型语言模型（LLM）自动生成反馈有可能改善许多学生的学习成果。然而，反馈的生成和评估都具有挑战性：反馈内容必须有效，尤其是在数学等学科中，这需要模型来理解问题、解决方案以及学生的错误所在。反馈还必须在教学上有效，以反映有效的辅导策略，例如解释可能的误解和鼓励学生，以及其他理想的功能。在这项工作中，我们解决了自动生成和评估反馈的问题，同时考虑了正确性和一致性。首先，我们提出了一个评估数学反馈的准则，并表明 GPT-4 能够有效地使用它来注释人类编写的和 LLM 生成的反馈。其次，我们提出了一个反馈生成框架，该框架使用强化学习（RL）来优化正确性和对齐性。具体来说，我们使用 GPT-4 的注释在增强数据集中创建对反馈对的偏好，以便通过直接偏好优化 (DPO) 进行训练。我们表明，我们的方法显着提高了 Llama 2（一种开源法学硕士）生成的反馈的正确性和一致性，使用案例研究定性分析我们的生成和评估系统，并概述了未来工作的几个领域。

Title: VBART: The Turkish LLM

Authors: Meliksah Turker, Mehmet Erdi Ari, Aydin Han
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.01308
Pdf URL: https://arxiv.org/pdf/2403.01308
Copy Paste: [[2403.01308]] VBART: The Turkish LLM(https://arxiv.org/abs/2403.01308)
Keywords: language model, llm
Abstract: We present VBART, the first Turkish sequence-to-sequence Large Language Models (LLMs) pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish Natural Language Processing (NLP) research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is 7x more efficient than OpenAI's multilingual tokenizer. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked language models. Our fine-tuned models, tokenizer and cleaned web corpus of 135 GB are publicly available at huggingface.co/vngrs-ai.
摘要：我们推出了 VBART，这是第一个从头开始在大型语料库上进行预训练的土耳其语序列到序列大型语言模型 (LLM)。 VBART 是紧凑型法学硕士，基于 BART 和 mBART 模型的好创意，有两种尺寸：大号和超大号。经过微调的 VBART 模型在抽象文本摘要、标题生成、文本释义、问答和问题生成任务方面超越了先前的最先进结果。它们允许对未来的文本生成任务和数据集进行微调，为土耳其自然语言处理（NLP）研究开辟了新的道路。我们的工作表明，预训练的土耳其语法学硕士的性能优于多达 3 倍的多语言模型，改善了现有结果并提供了有效的训练和推理模型。此外，我们表明我们的单语言分词器比 OpenAI 的多语言分词器效率高 7 倍。最后但并非最不重要的一点是，我们介绍了一种扩大现有预训练法学硕士的方法，并质疑 Chinchilla 缩放定律与序列到序列掩码语言模型的相关性。我们经过微调的模型、分词器和经过清理的 135 GB 网络语料库可在 Huggingface.co/vngrs-ai 上公开获取。

Title: LM4OPT: Unveiling the Potential of Large Language Models in Formulating Mathematical Optimization Problems

Authors: Tasnim Ahmed, Salimur Choudhury
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2403.01342
Pdf URL: https://arxiv.org/pdf/2403.01342
Copy Paste: [[2403.01342]] LM4OPT: Unveiling the Potential of Large Language Models in Formulating Mathematical Optimization Problems(https://arxiv.org/abs/2403.01342)
Keywords: language model, gpt, llm
Abstract: In the rapidly evolving field of natural language processing, the translation of linguistic descriptions into mathematical formulation of optimization problems presents a formidable challenge, demanding intricate understanding and processing capabilities from Large Language Models (LLMs). This study compares prominent LLMs, including GPT-3.5, GPT-4, and Llama-2-7b, in zero-shot and one-shot settings for this task. Our findings show GPT-4's superior performance, particularly in the one-shot scenario. A central part of this research is the introduction of `LM4OPT,' a progressive fine-tuning framework for Llama-2-7b that utilizes noisy embeddings and specialized datasets. However, this research highlights a notable gap in the contextual understanding capabilities of smaller models such as Llama-2-7b compared to larger counterparts, especially in processing lengthy and complex input contexts. Our empirical investigation, utilizing the NL4Opt dataset, unveils that GPT-4 surpasses the baseline performance established by previous research, achieving an F1-score of 0.63, solely based on the problem description in natural language, and without relying on any additional named entity information. GPT-3.5 follows closely, both outperforming the fine-tuned Llama-2-7b. These findings not only benchmark the current capabilities of LLMs in a novel application area but also lay the groundwork for future improvements in mathematical formulation of optimization problems from natural language input.
摘要：在快速发展的自然语言处理领域，将语言描述转化为优化问题的数学公式提出了巨大的挑战，要求大型语言模型 (LLM) 具有复杂的理解和处理能力。本研究比较了著名的 LLM，包括 GPT-3.5、GPT-4 和 Llama-2-7b，在零样本和单样本设置下完成此任务。我们的研究结果表明 GPT-4 具有卓越的性能，尤其是在一次性场景中。这项研究的核心部分是引入“LM4OPT”，这是一种针对 Llama-2-7b 的渐进式微调框架，利用噪声嵌入和专门的数据集。然而，这项研究突显了 Llama-2-7b 等较小模型与较大模型相比，在上下文理解能力方面存在显着差距，特别是在处理冗长且复杂的输入上下文方面。我们利用 NL4Opt 数据集进行的实证调查表明，GPT-4 超越了之前研究建立的基线性能，仅基于自然语言的问题描述，且不依赖任何额外的命名实体信息，F1 得分为 0.63 。 GPT-3.5 紧随其后，两者的性能均优于经过微调的 Llama-2-7b。这些发现不仅对法学硕士在新应用领域的当前能力进行了基准测试，而且还为未来改进自然语言输入优化问题的数学表述奠定了基础。

Title: Evaluating and Mitigating Number Hallucinations in Large Vision-Language Models: A Consistency Perspective

Authors: Huixuan Zhang, Junzhe Zhang, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01373
Pdf URL: https://arxiv.org/pdf/2403.01373
Copy Paste: [[2403.01373]] Evaluating and Mitigating Number Hallucinations in Large Vision-Language Models: A Consistency Perspective(https://arxiv.org/abs/2403.01373)
Keywords: language model, hallucination
Abstract: Large vision language models have demonstrated remarkable efficacy in addressing challenges related to both textual and visual content. Nevertheless, these models are susceptible to various hallucinations. In this paper, we focus on a new form of hallucination, specifically termed as number hallucination, which denotes instances where models fail to accurately identify the quantity of objects in an image. We establish a dataset and employ evaluation metrics to assess number hallucination, revealing a pronounced prevalence of this issue across mainstream large vision language models (LVLMs). Additionally, we delve into a thorough analysis of number hallucination, examining inner and outer inconsistency problem from two related perspectives. We assert that this inconsistency is one cause of number hallucination and propose a consistency training method as a means to alleviate such hallucination, which achieves an average improvement of 8\% compared with direct finetuning method.
摘要：大视觉语言模型在解决与文本和视觉内容相关的挑战方面表现出了显着的功效。然而，这些模型很容易产生各种幻觉。在本文中，我们关注一种新形式的幻觉，具体称为数字幻觉，它表示模型无法准确识别图像中物体数量的情况。我们建立了一个数据集并采用评估指标来评估数字幻觉，揭示了这个问题在主流大视觉语言模型（LVLM）中的明显普遍性。此外，我们还深入分析了数字幻觉，从两个相关的角度审视内部和外部不一致问题。我们断言这种不一致是导致数字幻觉的原因之一，并提出了一种一致性训练方法作为减轻这种幻觉的手段，与直接微调方法相比，该方法平均提高了 8%。

Title: Automatic Question-Answer Generation for Long-Tail Knowledge

Authors: Rohan Kumar, Youngmin Kim, Sunitha Ravi, Haitian Sun, Christos Faloutsos, Ruslan Salakhutdinov, Minji Yoon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01382
Pdf URL: https://arxiv.org/pdf/2403.01382
Copy Paste: [[2403.01382]] Automatic Question-Answer Generation for Long-Tail Knowledge(https://arxiv.org/abs/2403.01382)
Keywords: language model, llm
Abstract: Pretrained Large Language Models (LLMs) have gained significant attention for addressing open-domain Question Answering (QA). While they exhibit high accuracy in answering questions related to common knowledge, LLMs encounter difficulties in learning about uncommon long-tail knowledge (tail entities). Since manually constructing QA datasets demands substantial human resources, the types of existing QA datasets are limited, leaving us with a scarcity of datasets to study the performance of LLMs on tail entities. In this paper, we propose an automatic approach to generate specialized QA datasets for tail entities and present the associated research challenges. We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets, comparing their performance with and without external resources including Wikipedia and Wikidata knowledge graphs.
摘要：预训练大型语言模型 (LLM) 在解决开放域问答 (QA) 方面受到了极大关注。虽然法学硕士在回答与常识相关的问题时表现出很高的准确性，但法学硕士在学习不常见的长尾知识（尾部实体）时遇到了困难。由于手动构建 QA 数据集需要大量人力资源，现有 QA 数据集的类型有限，导致我们缺乏数据集来研究 LLM 在尾部实体上的性能。在本文中，我们提出了一种自动方法来为尾部实体生成专门的 QA 数据集，并提出相关的研究挑战。我们通过在新生成的长尾 QA 数据集上使用预训练的 LLM 进行广泛的实验，比较它们在使用和不使用外部资源（包括维基百科和维基数据知识图谱）的情况下的性能。

Title: Right for Right Reasons: Large Language Models for Verifiable Commonsense Knowledge Graph Question Answering

Authors: Armin Toroghi, Willis Guo, Mohammad Mahdi Abdollah Pour, Scott Sanner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01390
Pdf URL: https://arxiv.org/pdf/2403.01390
Copy Paste: [[2403.01390]] Right for Right Reasons: Large Language Models for Verifiable Commonsense Knowledge Graph Question Answering(https://arxiv.org/abs/2403.01390)
Keywords: language model, llm, hallucination
Abstract: Knowledge Graph Question Answering (KGQA) methods seek to answer Natural Language questions using the relational information stored in Knowledge Graphs (KGs). With the recent advancements of Large Language Models (LLMs) and their remarkable reasoning abilities, there is a growing trend to leverage them for KGQA. However, existing methodologies have only focused on answering factual questions, e.g., "In which city was Silvio Berlusconi's first wife born?", leaving questions involving commonsense reasoning that real-world users may pose more often, e.g., "Do I need separate visas to see the Venus of Willendorf and attend the Olympics this summer?" unaddressed. In this work, we first observe that existing LLM-based methods for KGQA struggle with hallucination on such questions, especially on queries targeting long-tail entities (e.g., non-mainstream and recent entities), thus hindering their applicability in real-world applications especially since their reasoning processes are not easily verifiable. In response, we propose Right for Right Reasons (R3), a commonsense KGQA methodology that allows for a verifiable reasoning procedure by axiomatically surfacing intrinsic commonsense knowledge of LLMs and grounding every factual reasoning step on KG triples. Through experimental evaluations across three different tasks--question answering, claim verification, and preference matching--our findings showcase R3 as a superior approach, outperforming existing methodologies and notably reducing instances of hallucination and reasoning errors.
摘要：知识图问答（KGQA）方法寻求使用知识图（KG）中存储的关系信息来回答自然语言问题。随着大型语言模型 (LLM) 的最新进展及其卓越的推理能力，利用它们进行 KGQA 的趋势越来越明显。然而，现有的方法仅侧重于回答事实问题，例如“西尔维奥·贝卢斯科尼的第一任妻子出生在哪个城市？”，而留下了现实世界用户可能更经常提出的涉及常识推理的问题，例如“我需要单独的签证吗？”去看维伦多夫的维纳斯并参加今年夏天的奥运会？”未解决。在这项工作中，我们首先观察到现有的基于 LLM 的 KGQA 方法在此类问题上与幻觉作斗争，尤其是针对长尾实体（例如非主流和最新实体）的查询，从而阻碍了它们在现实世界应用中的适用性特别是因为他们的推理过程不容易验证。作为回应，我们提出了“Right for Right Reasons”（R3），这是一种常识性的 KGQA 方法，通过公理地呈现法学硕士内在的常识性知识并将每个事实推理步骤都建立在 KG 三元组的基础上，可以实现可验证的推理过程。通过对三个不同任务（问答、主张验证和偏好匹配）的实验评估，我们的研究结果表明 R3 是一种优越的方法，优于现有方法，并显着减少了幻觉和推理错误的情况。

Title: CR-LT-KGQA: A Knowledge Graph Question Answering Dataset Requiring Commonsense Reasoning and Long-Tail Knowledge

Authors: Willis Guo, Armin Toroghi, Scott Sanner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01395
Pdf URL: https://arxiv.org/pdf/2403.01395
Copy Paste: [[2403.01395]] CR-LT-KGQA: A Knowledge Graph Question Answering Dataset Requiring Commonsense Reasoning and Long-Tail Knowledge(https://arxiv.org/abs/2403.01395)
Keywords: language model, llm, hallucination
Abstract: Knowledge graph question answering (KGQA) is a well-established field that seeks to provide factual answers to natural language (NL) questions by leveraging knowledge graphs (KGs). However, existing KGQA datasets suffer from two significant limitations: (1) no existing KGQA dataset requires commonsense reasoning to arrive at an answer and (2) existing KGQA datasets focus on popular entities for which large language models (LLMs) can directly answer without hallucinating and without leveraging the KG. In this work, we seek a novel KGQA dataset that supports commonsense reasoning and focuses on long-tail entities (e.g., non-mainstream and recent entities) where LLMs frequently hallucinate, and thus create the need for novel methodologies that leverage the KG for factual and attributable commonsense inference. We create a novel Commonsense Reasoning (CR) and Long-Tail (LT) KGQA dataset with two subtasks -- question answering and claim verification -- that address both limitations (1) and (2). We construct CR-LT-KGQA by building extensions to existing reasoning datasets StrategyQA and CREAK over Wikidata. While existing KGQA methods are not applicable due to their lack of commonsense inference support, baseline evaluation of LLMs on CR-LT KGQA demonstrate a high rate of hallucination. Thus, CR-LT KGQA poses significant challenges for hallucination-prone LLMs, hence paving the way for future commonsense KGQA research to provide accurate and factual answers for long-tail entities in the era of LLMs.
摘要：知识图问答（KGQA）是一个成熟的领域，旨在通过利用知识图（KG）为自然语言（NL）问题提供事实答案。然而，现有的 KGQA 数据集存在两个重大限制：（1）现有的 KGQA 数据集不需要常识推理才能得出答案；（2）现有的 KGQA 数据集关注流行的实体，大型语言模型（LLM）可以直接回答而无需产生幻觉。并且不利用 KG。在这项工作中，我们寻求一种新颖的 KGQA 数据集，该数据集支持常识推理，并专注于法学硕士经常产生幻觉的长尾实体（例如，非主流和最近的实体），因此需要利用 KG 来获取事实的新颖方法。以及可归因的常识性推理。我们创建了一个新颖的常识推理 (CR) 和长尾 (LT) KGQA 数据集，其中包含两个子任务——问答和声明验证——解决了限制 (1) 和 (2)。我们通过在 Wikidata 上构建现有推理数据集 StrategyQA 和 CREAK 的扩展来构建 CR-LT-KGQA。虽然现有的 KGQA 方法由于缺乏常识推理支持而不适用，但对 CR-LT KGQA 的法学硕士的基线评估表明幻觉率很高。因此，CR-LT KGQA 对容易产生幻觉的法学硕士提出了重大挑战，从而为未来常识性 KGQA 研究铺平了道路，为法学硕士时代的长尾实体提供准确和事实的答案。

Title: What Is Missing in Multilingual Visual Reasoning and How to Fix It

Authors: Yueqi Song, Simran Khanuja, Graham Neubig
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01404
Pdf URL: https://arxiv.org/pdf/2403.01404
Copy Paste: [[2403.01404]] What Is Missing in Multilingual Visual Reasoning and How to Fix It(https://arxiv.org/abs/2403.01404)
Keywords: gpt
Abstract: NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users. In this paper, we evaluate their multilingual, multimodal capabilities by testing on a visual reasoning task. We observe that proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison. Surprisingly, GPT-4V exhibits similar performance between English and other languages, indicating the potential for equitable system development across languages. Our analysis on model failures reveals three key aspects that make this task challenging: multilinguality, complex reasoning, and multimodality. To address these challenges, we propose three targeted interventions including a translate-test approach to tackle multilinguality, a visual programming approach to break down complex reasoning, and a novel method that leverages image captioning to address multimodality. Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open model LLaVA by 13.4%, while also minorly improving GPT-4V's performance.
摘要：如今的 NLP 模型致力于支持多种语言和模式，提高不同用户的可访问性。在本文中，我们通过视觉推理任务测试来评估他们的多语言、多模式能力。我们观察到，像 GPT-4V 这样的专有系统现在在这项任务上获得了最佳性能，但开放模型相比之下表现较差。令人惊讶的是，GPT-4V 在英语和其他语言之间表现出相似的性能，表明跨语言公平系统开发的潜力。我们对模型失败的分析揭示了使这项任务具有挑战性的三个关键方面：多语言性、复杂推理和多模态。为了应对这些挑战，我们提出了三种有针对性的干预措施，包括解决多语言问题的翻译测试方法、分解复杂推理的可视化编程方法以及利用图像字幕解决多模态问题的新颖方法。我们的干预措施在零样本设置下实现了该任务的最佳开放性能，将开放模型 LLaVA 提高了 13.4%，同时也略微提高了 GPT-4V 的性能。

Title: OVEL: Large Language Model as Memory Manager for Online Video Entity Linking

Authors: Haiquan Zhao, Xuwu Wang, Shisong Chen, Zhixu Li, Xin Zheng, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01411
Pdf URL: https://arxiv.org/pdf/2403.01411
Copy Paste: [[2403.01411]] OVEL: Large Language Model as Memory Manager for Online Video Entity Linking(https://arxiv.org/abs/2403.01411)
Keywords: language model, llm
Abstract: In recent years, multi-modal entity linking (MEL) has garnered increasing attention in the research community due to its significance in numerous multi-modal applications. Video, as a popular means of information transmission, has become prevalent in people's daily lives. However, most existing MEL methods primarily focus on linking textual and visual mentions or offline videos's mentions to entities in multi-modal knowledge bases, with limited efforts devoted to linking mentions within online video content. In this paper, we propose a task called Online Video Entity Linking OVEL, aiming to establish connections between mentions in online videos and a knowledge base with high accuracy and timeliness. To facilitate the research works of OVEL, we specifically concentrate on live delivery scenarios and construct a live delivery entity linking dataset called LIVE. Besides, we propose an evaluation metric that considers timelessness, robustness, and accuracy. Furthermore, to effectively handle OVEL task, we leverage a memory block managed by a Large Language Model and retrieve entity candidates from the knowledge base to augment LLM performance on memory management. The experimental results prove the effectiveness and efficiency of our method.
摘要：近年来，多模态实体链接（MEL）由于其在众多多模态应用中的重要性而受到研究界越来越多的关注。视频作为一种流行的信息传播方式，已经深入到人们的日常生活中。然而，大多数现有的 MEL 方法主要侧重于将文本和视觉提及或离线视频的提及链接到多模态知识库中的实体，而致力于链接在线视频内容中的提及的努力有限。在本文中，我们提出了一项名为在线视频实体链接 OVEL 的任务，旨在以高精度和时效性在在线视频中的提及与知识库之间建立连接。为了方便OVEL的研究工作，我们特别关注直播场景并构建了一个名为LIVE的直播实体链接数据集。此外，我们提出了一个考虑永恒性、鲁棒性和准确性的评估指标。此外，为了有效处理 OVEL 任务，我们利用大型语言模型管理的内存块，并从知识库中检索候选实体，以增强 LLM 在内存管理方面的性能。实验结果证明了我们方法的有效性和效率。

Title: Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge

Authors: Heydar Soudani, Evangelos Kanoulas, Faegheh Hasibi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01432
Pdf URL: https://arxiv.org/pdf/2403.01432
Copy Paste: [[2403.01432]] Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge(https://arxiv.org/abs/2403.01432)
Keywords: language model, llm, retrieval augmented generation
Abstract: Large language models (LLMs) memorize a vast amount of factual knowledge, exhibiting strong performance across diverse tasks and domains. However, it has been observed that the performance diminishes when dealing with less-popular or low-frequency concepts and entities, for example in domain specific applications. The two prominent approaches to enhance the performance of LLMs on low-frequent topics are: Retrieval Augmented Generation (RAG) and fine-tuning (FT) over synthetic data. This paper explores and evaluates the impact of RAG and FT on customizing LLMs in handling low-frequency entities on question answering task. Our findings indicate that FT significantly boosts the performance across entities of varying popularity, especially in the most and least popular groups, while RAG surpasses other methods. Additionally, the success of both RAG and FT approaches is amplified by advancements in retrieval and data augmentation techniques. We release our data and code at https://github.com/HeydarSoudani/RAGvsFT.
摘要：大型语言模型 (LLM) 可以记忆大量事实知识，在不同的任务和领域中表现出强大的性能。然而，据观察，在处理不太流行或低频的概念和实体时（例如在特定领域的应用程序中），性能会下降。提高法学硕士在低频主题上的表现的两种主要方法是：检索增强生成（RAG）和对合成数据的微调（FT）。本文探讨并评估了 RAG 和 FT 对定制 LLM 处理问答任务低频实体的影响。我们的研究结果表明，FT 显着提高了不同受欢迎程度的实体的性能，尤其是在最受欢迎和最不受欢迎的群体中，而 RAG 则优于其他方法。此外，检索和数据增强技术的进步也放大了 RAG 和 FT 方法的成功。我们在 https://github.com/HeydarSoudani/RAGvsFT 发布了我们的数据和代码。

Title: Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment

Authors: Jingshen Zhang, Jiajun Xie, Xinying Qiu
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2403.01456
Pdf URL: https://arxiv.org/pdf/2403.01456
Copy Paste: [[2403.01456]] Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment(https://arxiv.org/abs/2403.01456)
Keywords: language model
Abstract: Item difficulty plays a crucial role in adaptive testing. However, few works have focused on generating questions of varying difficulty levels, especially for multiple-choice (MC) cloze tests. We propose training pre-trained language models (PLMs) as surrogate models to enable item response theory (IRT) assessment, avoiding the need for human test subjects. We also propose two strategies to control the difficulty levels of both the gaps and the distractors using ranking rules to reduce invalid distractors. Experimentation on a benchmark dataset demonstrates that our proposed framework and methods can effectively control and evaluate the difficulty levels of MC cloze tests.
摘要：项目难度在适应性测试中起着至关重要的作用。然而，很少有工作专注于生成不同难度级别的问题，尤其是多项选择（MC）完形填空测试。我们建议训练预先训练的语言模型（PLM）作为替代模型，以实现项目反应理论（IRT）评估，从而避免对人类测试对象的需要。我们还提出了两种策略来控制间隙和干扰项的难度级别，使用排名规则来减少无效干扰项。在基准数据集上的实验表明，我们提出的框架和方法可以有效地控制和评估 MC 完形填空测试的难度级别。

Title: KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations

Authors: Sunjun Kweon, Byungjin Choi, Minkyu Kim, Rae Woong Park, Edward Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01469
Pdf URL: https://arxiv.org/pdf/2403.01469
Copy Paste: [[2403.01469]] KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations(https://arxiv.org/abs/2403.01469)
Keywords: language model
Abstract: We introduce KorMedMCQA, the first Korean multiple-choice question answering (MCQA) benchmark derived from Korean healthcare professional licensing examinations, covering from the year 2012 to year 2023. This dataset consists of a selection of questions from the license examinations for doctors, nurses, and pharmacists, featuring a diverse array of subjects. We conduct baseline experiments on various large language models, including proprietary/open-source, multilingual/Korean-additional pretrained, and clinical context pretrained models, highlighting the potential for further enhancements. We make our data publicly available on HuggingFace and provide a evaluation script via LM-Harness, inviting further exploration and advancement in Korean healthcare environments.
摘要：我们介绍 KorMedMCQA，这是韩国第一个多项选择题回答 (MCQA) 基准，源自韩国医疗保健专业执照考试，涵盖 2012 年至 2023 年。该数据集包含医生、护士、和药剂师，具有多种学科。我们对各种大型语言模型进行了基线实验，包括专有/开源、多语言/韩语附加预训练模型和临床环境预训练模型，突出了进一步增强的潜力。我们在 HuggingFace 上公开提供数据，并通过 LM-Harness 提供评估脚本，邀请韩国医疗保健环境中的进一步探索和进步。

Title: Infusing Knowledge into Large Language Models with Contextual Prompts

Authors: Kinshuk Vasisht, Balaji Ganesan, Vikas Kumar, Vasudha Bhatnagar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01481
Pdf URL: https://arxiv.org/pdf/2403.01481
Copy Paste: [[2403.01481]] Infusing Knowledge into Large Language Models with Contextual Prompts(https://arxiv.org/abs/2403.01481)
Keywords: language model, llm, prompt
Abstract: Knowledge infusion is a promising method for enhancing Large Language Models for domain-specific NLP tasks rather than pre-training models over large data from scratch. These augmented LLMs typically depend on additional pre-training or knowledge prompts from an existing knowledge graph, which is impractical in many applications. In contrast, knowledge infusion directly from relevant documents is more generalisable and alleviates the need for structured knowledge graphs while also being useful for entities that are usually not found in any knowledge graph. With this motivation, we propose a simple yet generalisable approach for knowledge infusion by generating prompts from the context in the input text. Our experiments show the effectiveness of our approach which we evaluate by probing the fine-tuned LLMs.
摘要：知识注入是一种有前途的方法，可以增强特定领域 NLP 任务的大型语言模型，而不是从头开始对大数据进行预训练模型。这些增强的法学硕士通常依赖于现有知识图的额外预训练或知识提示，这在许多应用中是不切实际的。相比之下，直接从相关文档注入知识更具普适性，并且减轻了对结构化知识图的需求，同时对于通常在任何知识图中找不到的实体也很有用。出于这种动机，我们提出了一种简单但通用的知识注入方法，通过根据输入文本中的上下文生成提示。我们的实验证明了我们方法的有效性，我们通过探索微调的法学硕士来评估该方法。

Title: Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics

Authors: Zhu Liu, Cunliang Kong, Ying Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01509
Pdf URL: https://arxiv.org/pdf/2403.01509
Copy Paste: [[2403.01509]] Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics(https://arxiv.org/abs/2403.01509)
Keywords: language model, llm, prompt
Abstract: Large language models have achieved remarkable success in general language understanding tasks. However, as a family of generative methods with the objective of next token prediction, the semantic evolution with the depth of these models are not fully explored, unlike their predecessors, such as BERT-like architectures. In this paper, we specifically investigate the bottom-up evolution of lexical semantics for a popular LLM, namely Llama2, by probing its hidden states at the end of each layer using a contextualized word identification task. Our experiments show that the representations in lower layers encode lexical semantics, while the higher layers, with weaker semantic induction, are responsible for prediction. This is in contrast to models with discriminative objectives, such as mask language modeling, where the higher layers obtain better lexical semantics. The conclusion is further supported by the monotonic increase in performance via the hidden states for the last meaningless symbols, such as punctuation, in the prompting strategy.
摘要：大型语言模型在一般语言理解任务中取得了显着的成功。然而，作为一系列以下一个 token 预测为目标的生成方法，这些模型的语义演化与深度并没有得到充分的探索，不像它们的前辈，例如类似 BERT 的架构。在本文中，我们通过使用上下文单词识别任务探测每层末尾的隐藏状态，专门研究了流行的 LLM（即 Llama2）词汇语义的自下而上的演化。我们的实验表明，较低层的表示编码词汇语义，而语义归纳较弱的较高层负责预测。这与具有判别性目标的模型形成对比，例如掩码语言建模，其中较高层获得更好的词汇语义。通过提示策略中最后一个无意义符号（例如标点符号）的隐藏状态，性能的单调增加进一步支持了该结论。

Title: Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models

Authors: Amal Rannen-Triki, Jorg Bornschein, Razvan Pascanu, Marcus Hutter, Andras György, Alexandre Galashov, Yee Whye Teh, Michalis K. Titsias
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.01518
Pdf URL: https://arxiv.org/pdf/2403.01518
Copy Paste: [[2403.01518]] Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models(https://arxiv.org/abs/2403.01518)
Keywords: language model
Abstract: We consider the problem of online fine tuning the parameters of a language model at test time, also known as dynamic evaluation. While it is generally known that this approach improves the overall predictive performance, especially when considering distributional shift between training and evaluation data, we here emphasize the perspective that online adaptation turns parameters into temporally changing states and provides a form of context-length extension with memory in weights, more in line with the concept of memory in neuroscience. We pay particular attention to the speed of adaptation (in terms of sample efficiency),sensitivity to the overall distributional drift, and the computational overhead for performing gradient computations and parameter updates. Our empirical study provides insights on when online adaptation is particularly interesting. We highlight that with online adaptation the conceptual distinction between in-context learning and fine tuning blurs: both are methods to condition the model on previously observed tokens.
摘要：我们考虑在测试时在线微调语言模型参数的问题，也称为动态评估。虽然众所周知，这种方法可以提高整体预测性能，特别是在考虑训练数据和评估数据之间的分布变化时，但我们在这里强调在线适应将参数转变为暂时变化的状态并提供一种具有记忆的上下文长度扩展形式的观点在权重上，更符合神经科学中记忆的概念。我们特别关注适应速度（就样本效率而言）、对整体分布漂移的敏感性以及执行梯度计算和参数更新的计算开销。我们的实证研究提供了关于在线适应何时特别有趣的见解。我们强调，通过在线适应，上下文学习和微调之间的概念区别变得模糊：两者都是根据先前观察到的标记来调节模型的方法。

Title: In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation

Authors: Shiqi Chen, Miao Xiong, Junteng Liu, Zhengxuan Wu, Teng Xiao, Siyang Gao, Junxian He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.01548
Pdf URL: https://arxiv.org/pdf/2403.01548
Copy Paste: [[2403.01548]] In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation(https://arxiv.org/abs/2403.01548)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) frequently hallucinate and produce factual errors, yet our understanding of why they make these errors remains limited. In this study, we delve into the underlying mechanisms of LLM hallucinations from the perspective of inner representations, and discover a salient pattern associated with hallucinations: correct generations tend to have sharper context activations in the hidden states of the in-context tokens, compared to the incorrect ones. Leveraging this insight, we propose an entropy-based metric to quantify the ``sharpness'' among the in-context hidden states and incorporate it into the decoding process to formulate a constrained decoding approach. Experiments on various knowledge-seeking and hallucination benchmarks demonstrate our approach's consistent effectiveness, for example, achieving up to an 8.6 point improvement on TruthfulQA. We believe this study can improve our understanding of hallucinations and serve as a practical solution for hallucination mitigation.
摘要：大型语言模型（LLM）经常产生幻觉并产生事实错误，但我们对它们为何会犯这些错误的理解仍然有限。在本研究中，我们从内部表征的角度深入探讨了LLM幻觉的潜在机制，并发现了与幻觉相关的显着模式：与传统代相比，正确的一代往往在上下文内标记的隐藏状态中具有更清晰的上下文激活。不正确的。利用这种见解，我们提出了一种基于熵的度量来量化上下文隐藏状态之间的“清晰度”，并将其合并到解码过程中以制定约束解码方法。对各种知识寻求和幻觉基准的实验证明了我们的方法的一致有效性，例如，在 TruthfulQA 上实现了高达 8.6 分的改进。我们相信这项研究可以提高我们对幻觉的理解，并作为缓解幻觉的实用解决方案。

Title: SERVAL: Synergy Learning between Vertical Models and LLMs towards Oracle-Level Zero-shot Medical Prediction

Authors: Jiahuan Yan, Jintai Chen, Chaowen Hu, Bo Zheng, Yaojun Hu, Jimeng Sun, Jian Wu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.01570
Pdf URL: https://arxiv.org/pdf/2403.01570
Copy Paste: [[2403.01570]] SERVAL: Synergy Learning between Vertical Models and LLMs towards Oracle-Level Zero-shot Medical Prediction(https://arxiv.org/abs/2403.01570)
Keywords: language model, gpt, llm
Abstract: Recent development of large language models (LLMs) has exhibited impressive zero-shot proficiency on generic and common sense questions. However, LLMs' application on domain-specific vertical questions still lags behind, primarily due to the humiliation problems and deficiencies in vertical knowledge. Furthermore, the vertical data annotation process often requires labor-intensive expert involvement, thereby presenting an additional challenge in enhancing the model's vertical capabilities. In this paper, we propose SERVAL, a synergy learning pipeline designed for unsupervised development of vertical capabilities in both LLMs and small models by mutual enhancement. Specifically, SERVAL utilizes the LLM's zero-shot outputs as annotations, leveraging its confidence to teach a robust vertical model from scratch. Reversely, the trained vertical model guides the LLM fine-tuning to enhance its zero-shot capability, progressively improving both models through an iterative process. In medical domain, known for complex vertical knowledge and costly annotations, comprehensive experiments show that, without access to any gold labels, SERVAL with the synergy learning of OpenAI GPT-3.5 and a simple model attains fully-supervised competitive performance across ten widely used medical datasets. These datasets represent vertically specialized medical diagnostic scenarios (e.g., diabetes, heart diseases, COVID-19), highlighting the potential of SERVAL in refining the vertical capabilities of LLMs and training vertical models from scratch, all achieved without the need for annotations.
摘要：最近开发的大型语言模型（LLM）在通用和常识问题上表现出了令人印象深刻的零样本熟练程度。然而，法学硕士在特定领域垂直问题上的应用仍然滞后，这主要是由于羞辱问题和垂直知识的缺乏。此外，垂直数据注释过程通常需要劳动密集型专家的参与，从而对增强模型的垂直能力提出了额外的挑战。在本文中，我们提出了 SERVAL，这是一种协同学习管道，旨在通过相互增强来无监督地开发法学硕士和小型模型的垂直能力。具体来说，SERVAL 利用法学硕士的零样本输出作为注释，利用其信心从头开始教授稳健的垂直模型。相反，经过训练的垂直模型指导LLM微调以增强其零样本能力，通过迭代过程逐步改进两个模型。在以复杂的垂直知识和昂贵的注释而闻名的医学领域，综合实验表明，在没有获得任何黄金标签的情况下，SERVAL 凭借 OpenAI GPT-3.5 的协同学习和简单的模型，在十个广泛使用的医疗领域中获得了完全监督的竞争性能。数据集。这些数据集代表了垂直专业的医疗诊断场景（例如糖尿病、心脏病、COVID-19），凸显了 SERVAL 在完善法学硕士的垂直能力和从头开始训练垂直模型方面的潜力，所有这些都无需注释即可实现。

Title: Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures

Authors: Séamus Lankford
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01580
Pdf URL: https://arxiv.org/pdf/2403.01580
Copy Paste: [[2403.01580]] Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures(https://arxiv.org/abs/2403.01580)
Keywords: llm
Abstract: In the current machine translation (MT) landscape, the Transformer architecture stands out as the gold standard, especially for high-resource language pairs. This research delves into its efficacy for low-resource language pairs including both the English$\leftrightarrow$Irish and English$\leftrightarrow$Marathi language pairs. Notably, the study identifies the optimal hyperparameters and subword model type to significantly improve the translation quality of Transformer models for low-resource language pairs. The scarcity of parallel datasets for low-resource languages can hinder MT development. To address this, gaHealth was developed, the first bilingual corpus of health data for the Irish language. Focusing on the health domain, models developed using this in-domain dataset exhibited very significant improvements in BLEU score when compared with models from the LoResMT2021 Shared Task. A subsequent human evaluation using the multidimensional quality metrics error taxonomy showcased the superior performance of the Transformer system in reducing both accuracy and fluency errors compared to an RNN-based counterpart. Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source applications streamlined for the development, fine-tuning, and deployment of neural machine translation models. These tools considerably simplify the setup and evaluation process, making MT more accessible to both developers and translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes eco-friendly natural language processing research by highlighting the environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM demonstrated advancements in translation performance for two low-resource language pairs: English$\leftrightarrow$Irish and English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021 Shared Task.
摘要：在当前的机器翻译 (MT) 领域，Transformer 架构脱颖而出，成为黄金标准，尤其是对于高资源语言对而言。这项研究深入研究了其对低资源语言对的功效，包括英语$\leftrightarrow$爱尔兰语和英语$\leftrightarrow$马拉地语语言对。值得注意的是，该研究确定了最佳超参数和子词模型类型，以显着提高 Transformer 模型对低资源语言对的翻译质量。低资源语言并行数据集的稀缺可能会阻碍机器翻译的发展。为了解决这个问题，我们开发了 gaHealth，这是第一个爱尔兰语健康数据双语语料库。专注于健康领域，与 LoResMT2021 共享任务的模型相比，使用该域内数据集开发的模型在 BLEU 分数上表现出非常显着的改进。随后使用多维质量指标错误分类法进行的人工评估显示，与基于 RNN 的对应系统相比，Transformer 系统在减少准确性和流畅性错误方面具有卓越的性能。此外，本文还介绍了adaptNMT和adaptMLLM，这两个开源应用程序简化了神经机器翻译模型的开发、微调和部署。这些工具大大简化了设置和评估过程，使开发人员和翻译人员更容易使用机器翻译。值得注意的是，adaptNMT 立足于 OpenNMT 生态系统，通过强调模型开发的环境足迹来促进生态友好的自然语言处理研究。与 LoResMT2021 共享任务的基线相比，adaptMLLM 对 MLLM 的微调展示了两种低资源语言对的翻译性能的进步：英语$\leftrightarrow$爱尔兰语和英语$\leftrightarrow$马拉地语。

Title: Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models

Authors: Nguyen Quang Duc, Le Hai Son, Nguyen Duc Nhan, Nguyen Dich Nhat Minh, Le Thanh Huong, Dinh Viet Sang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01616
Pdf URL: https://arxiv.org/pdf/2403.01616
Copy Paste: [[2403.01616]] Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models(https://arxiv.org/abs/2403.01616)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This paper presents our contributions towards advancing the state of Vietnamese language understanding and generation through the development and dissemination of open datasets and pre-trained models for Vietnamese Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs).
摘要：本文介绍了我们通过开发和传播越南语检索增强生成 (RAG) 和大型语言模型 (LLM) 的开放数据集和预训练模型，对促进越南语理解和生成状态做出的贡献。

Title: Hypertext Entity Extraction in Webpage

Authors: Yifei Yang, Tianqiao Liu, Bo Shao, Hai Zhao, Linjun Shou, Ming Gong, Daxin Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01698
Pdf URL: https://arxiv.org/pdf/2403.01698
Copy Paste: [[2403.01698]] Hypertext Entity Extraction in Webpage(https://arxiv.org/abs/2403.01698)
Keywords: gpt
Abstract: Webpage entity extraction is a fundamental natural language processing task in both research and applications. Nowadays, the majority of webpage entity extraction models are trained on structured datasets which strive to retain textual content and its structure information. However, existing datasets all overlook the rich hypertext features (e.g., font color, font size) which show their effectiveness in previous works. To this end, we first collect a \textbf{H}ypertext \textbf{E}ntity \textbf{E}xtraction \textbf{D}ataset (\textit{HEED}) from the e-commerce domains, scraping both the text and the corresponding explicit hypertext features with high-quality manual entity annotations. Furthermore, we present the \textbf{Mo}E-based \textbf{E}ntity \textbf{E}xtraction \textbf{F}ramework (\textit{MoEEF}), which efficiently integrates multiple features to enhance model performance by Mixture of Experts and outperforms strong baselines, including the state-of-the-art small-scale models and GPT-3.5-turbo. Moreover, the effectiveness of hypertext features in \textit{HEED} and several model components in \textit{MoEEF} are analyzed.
摘要：网页实体提取是研究和应用中的一项基本自然语言处理任务。如今，大多数网页实体提取模型都是在结构化数据集上进行训练的，这些数据集力求保留文本内容及其结构信息。然而，现有的数据集都忽略了丰富的超文本特征（例如字体颜色、字体大小），这些特征在以前的工作中显示了它们的有效性。为此，我们首先从电子商务域收集 \textbf{H}ypertext \textbf{E}ntity \textbf{E}xtraction \textbf{D}ataset (\textit{HEED})，同时抓取文本以及相应的显式超文本特征和高质量的手动实体注释。此外，我们提出了基于 \textbf{Mo}E 的 \textbf{E}ntity \textbf{E}xtraction \textbf{F} 框架（\textit{MoEEF}），它有效地集成了多个特征，通过 Mixture 增强模型性能专家组成，并且优于强大的基线，包括最先进的小规模模型和 GPT-3.5-turbo。此外，还分析了\textit{HEED}中超文本特征和\textit{MoEEF}中几个模型组件的有效性。

Title: Decode Neural signal as Speech

Authors: Yiqian Yang, Yiqun Duan, Qiang Zhang, Renjing Xu, Hui Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01748
Pdf URL: https://arxiv.org/pdf/2403.01748
Copy Paste: [[2403.01748]] Decode Neural signal as Speech(https://arxiv.org/abs/2403.01748)
Keywords: language model
Abstract: Decoding language from brain dynamics is an important open direction in the realm of brain-computer interface (BCI), especially considering the rapid growth of large language models. Compared to invasive-based signals which require electrode implantation surgery, non-invasive neural signals (e.g. EEG, MEG) have attracted increasing attention considering their safety and generality. However, the exploration is not adequate in three aspects: 1) previous methods mainly focus on EEG but none of the previous works address this problem on MEG with better signal quality; 2) prior works have predominantly used ``teacher-forcing" during generative decoding, which is impractical; 3) prior works are mostly ``BART-based" not fully auto-regressive, which performs better in other sequence tasks. In this paper, we explore the brain-to-text translation of MEG signals in a speech-decoding formation. Here we are the first to investigate a cross-attention-based ``whisper" model for generating text directly from MEG signals without teacher forcing. Our model achieves impressive BLEU-1 scores of 60.30 and 52.89 without pretraining \& teacher-forcing on two major datasets (\textit{GWilliams} and \textit{Schoffelen}). This paper conducts a comprehensive review to understand how speech decoding formation performs on the neural decoding tasks, including pretraining initialization, training \& evaluation set splitting, augmentation, and scaling law.
摘要：从大脑动力学解码语言是脑机接口（BCI）领域的一个重要开放方向，特别是考虑到大型语言模型的快速增长。与需要电极植入手术的侵入性信号相比，非侵入性神经信号（例如脑电图、脑磁图）考虑到其安全性和通用性而引起了越来越多的关注。然而，在三个方面的探索还不够：1）以前的方法主要集中在脑电图上，但以前的工作都没有在信号质量更好的脑电图上解决这个问题； 2）先前的工作在生成解码过程中主要使用“教师强制”，这是不切实际的；3）先前的工作大多是“基于BART”而不是完全自回归，在其他序列任务中表现更好。在本文中，我们探索了语音解码形式中 MEG 信号的大脑到文本的翻译。在这里，我们是第一个研究基于交叉注意力的“耳语”模型，用于直接从 MEG 信号生成文本，无需教师强制。我们的模型在两个模型上无需预训练和教师强制即可获得令人印象深刻的 BLEU-1 分数 60.30 和 52.89主要数据集（\textit{GWilliams} 和 \textit{Schoffelen}）。本文进行了全面的回顾，以了解语音解码形成如何在神经解码任务上执行，包括预训练初始化、训练和评估集分割、增强和缩放法律。

Title: Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Authors: Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01749
Pdf URL: https://arxiv.org/pdf/2403.01749
Copy Paste: [[2403.01749]] Differentially Private Synthetic Data via Foundation Model APIs 2: Text(https://arxiv.org/abs/2403.01749)
Keywords: language model, gpt, llm
Abstract: Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe.
摘要：由于从中学习的机器学习算法的出现，文本数据变得极其有价值。现实世界中生成的大量高质量文本数据是私有的，因此由于隐私问题而无法自由共享或使用。生成具有正式隐私保证的私有文本数据的合成副本，即差分隐私（DP），提供了一种有前途且可扩展的解决方案。然而，现有方法需要对私有数据上的大型语言模型（LLM）进行 DP 微调，以生成 DP 合成数据。这种方法对于专有法学硕士（例如 GPT-3.5）来说不可行，并且对于开源法学硕士也需要大量的计算资源。林等人。 (2024) 最近引入了 Private Evolution (PE) 算法来生成 DP 合成图像，仅通过 API 访问扩散模型。在这项工作中，我们提出了一种增强的 PE 算法，名为 Aug-PE，适用于复杂的文本设置。我们使用 API 访问 LLM 并生成 DP 合成文本，无需任何模型训练。我们对三个基准数据集进行了全面的实验。我们的结果表明，Aug-PE 生成的 DP 合成文本可在 SOTA DP 微调基线下产生具有竞争力的效用。这强调了仅依靠法学硕士的 API 访问来生成高质量的 DP 合成文本的可行性，从而促进保护隐私的法学硕士申请的更便捷途径。我们的代码和数据可在 https://github.com/AI-secure/aug-pe 获取。

Title: Derivative-Free Optimization for Low-Rank Adaptation in Large Language Models

Authors: Feihu Jin, Yin Liu, Ying Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01754
Pdf URL: https://arxiv.org/pdf/2403.01754
Copy Paste: [[2403.01754]] Derivative-Free Optimization for Low-Rank Adaptation in Large Language Models(https://arxiv.org/abs/2403.01754)
Keywords: language model
Abstract: Parameter-efficient tuning methods such as LoRA could achieve comparable performance to model tuning by tuning a small portion of the parameters. However, substantial computational resources are still required, as this process involves calculating gradients and performing back-propagation throughout the model. Much effort has recently been devoted to utilizing the derivative-free optimization method to eschew the computation of gradients and showcase an augmented level of robustness in few-shot settings. In this paper, we prepend the low-rank modules into each self-attention layer of the model and employ two derivative-free optimization methods to optimize these low-rank modules at each layer alternately. Extensive results on various tasks and language models demonstrate that our proposed method achieves substantial improvement and exhibits clear advantages in memory usage and convergence speed compared to existing gradient-based parameter-efficient tuning and derivative-free optimization methods in few-shot settings.
摘要：LoRA 等参数高效调整方法可以通过调整一小部分参数来实现与模型调整相当的性能。然而，仍然需要大量的计算资源，因为这个过程涉及计算梯度并在整个模型中执行反向传播。最近，人们致力于利用无导数优化方法来避免梯度计算，并在少样本设置中展示增强的鲁棒性水平。在本文中，我们将低秩模块预先添加到模型的每个自注意力层中，并采用两种无导数优化方法来交替优化每一层的这些低秩模块。各种任务和语言模型的广泛结果表明，与现有的基于梯度的参数有效调整和少样本设置中的无导数优化方法相比，我们提出的方法取得了实质性改进，并在内存使用和收敛速度方面表现出明显的优势。

Title: WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations

Authors: Haolin Deng, Chang Wang, Xin Li, Dezhang Yuan, Junlang Zhan, Tianhua Zhou, Jin Ma, Jun Gao, Ruifeng Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01774
Pdf URL: https://arxiv.org/pdf/2403.01774
Copy Paste: [[2403.01774]] WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations(https://arxiv.org/abs/2403.01774)
Keywords: language model, llm
Abstract: Enhancing the attribution in large language models (LLMs) is a crucial task. One feasible approach is to enable LLMs to cite external sources that support their generations. However, existing datasets and evaluation methods in this domain still exhibit notable limitations. In this work, we formulate the task of attributed query-focused summarization (AQFS) and present WebCiteS, a Chinese dataset featuring 7k human-annotated summaries with citations. WebCiteS derives from real-world user queries and web search results, offering a valuable resource for model training and evaluation. Prior works in attribution evaluation do not differentiate between groundedness errors and citation errors. They also fall short in automatically verifying sentences that draw partial support from multiple sources. We tackle these issues by developing detailed metrics and enabling the automatic evaluator to decompose the sentences into sub-claims for fine-grained verification. Our comprehensive evaluation of both open-source and proprietary models on WebCiteS highlights the challenge LLMs face in correctly citing sources, underscoring the necessity for further improvement. The dataset and code will be open-sourced to facilitate further research in this crucial field.
摘要：增强大型语言模型（LLM）中的归因是一项至关重要的任务。一种可行的方法是让法学硕士能够引用支持他们这一代人的外部资源。然而，该领域现有的数据集和评估方法仍然表现出明显的局限性。在这项工作中，我们制定了以查询为中心的归因摘要 (AQFS) 任务，并提出了 WebCiteS，这是一个中文数据集，包含 7000 个带有引文的人工注释摘要。 WebCiteS 源自现实世界的用户查询和网络搜索结果，为模型训练和评估提供了宝贵的资源。先前的归因评估工作没有区分接地错误和引用错误。它们在自动验证从多个来源获得部分支持的句子方面也存在缺陷。我们通过开发详细的指标并使自动评估器将句子分解为子声明以进行细粒度验证来解决这些问题。我们对 WebCiteS 上的开源和专有模型的综合评估突出了法学硕士在正确引用来源方面面临的挑战，强调了进一步改进的必要性。数据集和代码将开源，以促进这一关键领域的进一步研究。

Title: NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models

Authors: Lizhou Fan, Wenyue Hua, Xiang Li, Kaijie Zhu, Mingyu Jin, Lingyao Li, Haoyang Ling, Jinkui Chi, Jindong Wang, Xin Ma, Yongfeng Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2403.01777
Pdf URL: https://arxiv.org/pdf/2403.01777
Copy Paste: [[2403.01777]] NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models(https://arxiv.org/abs/2403.01777)
Keywords: language model, llm, prompt
Abstract: Understanding the reasoning capabilities of Multimodal Large Language Models (MLLMs) is an important area of research. In this study, we introduce a dynamic benchmark, NPHardEval4V, aimed at addressing the existing gaps in evaluating the pure reasoning abilities of MLLMs. Our benchmark aims to provide a venue to disentangle the effect of various factors such as image recognition and instruction following, from the overall performance of the models, allowing us to focus solely on evaluating their reasoning abilities. Our findings reveal significant discrepancies in reasoning abilities across different models and highlight the relatively weak performance of MLLMs compared to LLMs in terms of reasoning. We also investigate the impact of different prompting styles, including visual, text, and combined vision and text prompts, on the reasoning abilities of MLLMs, demonstrating the different impacts of multimodal inputs in model performance. Unlike traditional benchmarks, which primarily focus on static evaluations, our benchmark will update on a monthly basis to prevent overfitting and ensure a more accurate evaluation of the models. We believe that this benchmark can aid understand and guide the further development of reasoning abilities in MLLMs. The benchmark dataset and code are available at https://github.com/lizhouf/NPHardEval4V
摘要：了解多模态大型语言模型 (MLLM) 的推理能力是一个重要的研究领域。在本研究中，我们引入了一个动态基准 NPHardEval4V，旨在解决评估 MLLM 纯推理能力方面的现有差距。我们的基准测试旨在提供一个场所，将图像识别和指令遵循等各种因素的影响与模型的整体性能分开，使我们能够专注于评估它们的推理能力。我们的研究结果揭示了不同模型的推理能力存在显着差异，并强调了 MLLM 与 LLM 相比在推理方面的表现相对较弱。我们还研究了不同提示风格（包括视觉、文本以及视觉和文本组合提示）对 MLLM 推理能力的影响，展示了多模式输入对模型性能的不同影响。与主要侧重于静态评估的传统基准不同，我们的基准将每月更新，以防止过度拟合并确保对模型进行更准确的评估。我们相信这个基准可以帮助理解和指导 MLLM 推理能力的进一步发展。基准数据集和代码可在 https://github.com/lizhouf/NPHardEval4V 获取

Title: NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural

Authors: Wilson Wongso, David Samuel Setiawan, Steven Limcorn, Ananto Joyoadikusumo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01817
Pdf URL: https://arxiv.org/pdf/2403.01817
Copy Paste: [[2403.01817]] NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural(https://arxiv.org/abs/2403.01817)
Keywords: language model
Abstract: Indonesia's linguistic landscape is remarkably diverse, encompassing over 700 languages and dialects, making it one of the world's most linguistically rich nations. This diversity, coupled with the widespread practice of code-switching and the presence of low-resource regional languages, presents unique challenges for modern pre-trained language models. In response to these challenges, we developed NusaBERT, building upon IndoBERT by incorporating vocabulary expansion and leveraging a diverse multilingual corpus that includes regional languages and dialects. Through rigorous evaluation across a range of benchmarks, NusaBERT demonstrates state-of-the-art performance in tasks involving multiple languages of Indonesia, paving the way for future natural language understanding research for under-represented languages.
摘要：印度尼西亚的语言景观非常多样化，包含 700 多种语言和方言，使其成为世界上语言最丰富的国家之一。这种多样性，加上语码转换的广泛实践和资源匮乏的区域语言的存在，给现代预训练语言模型带来了独特的挑战。为了应对这些挑战，我们在 IndoBERT 的基础上开发了 NusaBERT，纳入了词汇扩展并利用了包括地区语言和方言在内的多样化多语言语料库。通过对一系列基准的严格评估，NusaBERT 在涉及印度尼西亚多种语言的任务中展示了最先进的性能，为未来针对代表性不足的语言的自然语言理解研究铺平了道路。

Title: Making Pre-trained Language Models Great on Tabular Prediction

Authors: Jiahuan Yan, Bo Zheng, Hongxia Xu, Yiheng Zhu, Danny Chen, Jimeng Sun, Jian Wu, Jintai Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.01841
Pdf URL: https://arxiv.org/pdf/2403.01841
Copy Paste: [[2403.01841]] Making Pre-trained Language Models Great on Tabular Prediction(https://arxiv.org/abs/2403.01841)
Keywords: language model
Abstract: The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. However, due to the heterogeneity among tables, such DNN bonus is still far from being well exploited on tabular data prediction (e.g., regression or classification tasks). Condensing knowledge from diverse domains, language models (LMs) possess the capability to comprehend feature names from various tables, potentially serving as versatile learners in transferring knowledge across distinct tables and diverse prediction tasks, but their discrete text representation space is inherently incompatible with numerical feature values in tables. In this paper, we present TP-BERTa, a specifically pre-trained LM model for tabular data prediction. Concretely, a novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names. Comprehensive experiments demonstrate that our pre-trained TP-BERTa leads the performance among tabular DNNs and is competitive with Gradient Boosted Decision Tree models in typical tabular data regime.
摘要：深度神经网络（DNN）的可迁移性在图像和语言处理方面取得了重大进展。然而，由于表之间的异质性，这种 DNN 优势还远未在表格数据预测（例如回归或分类任务）中得到充分利用。语言模型（LM）凝聚了来自不同领域的知识，具有理解各种表中的特征名称的能力，有可能成为跨不同表和不同预测任务转移知识的多功能学习者，但它们的离散文本表示空间本质上与数字特征不兼容表中的值。在本文中，我们提出了 TP-BERTa，这是一种专门用于表格数据预测的预训练 LM 模型。具体来说，一种新颖的相对幅度标记化将标量数字特征值转换为精细离散的高维标记，并且特征内注意方法将特征值与相应的特征名称集成。综合实验表明，我们预训练的 TP-BERTa 在表格 DNN 中表现领先，并且在典型表格数据体系中与梯度提升决策树模型具有竞争力。

Title: Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral

Authors: Yiming Cui, Xin Yao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01851
Pdf URL: https://arxiv.org/pdf/2403.01851
Copy Paste: [[2403.01851]] Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral(https://arxiv.org/abs/2403.01851)
Keywords: language model, llm
Abstract: Mixtral, a representative sparse mixture of experts (SMoE) language model, has received significant attention due to its unique model design and superior performance. Based on Mixtral-8x7B-v0.1, in this paper, we propose Chinese-Mixtral and Chinese-Mixtral-Instruct with improved Chinese language abilities by adopting further pre-training and instruction fine-tuning. Experimental results show that our Chinese-Mixtral and Chinese-Mixtral-Instruct successfully improve Chinese understanding and generation performance while retaining the original English abilities. Then, we discuss several key questions when performing language adaptation on large language models, including the necessity of extending the language-specific vocabulary and the choice of the initialization model (foundation model v.s. instruction model), by providing empirical results and analysis. We also present the visualizations of each expert to examine their importance on downstream tasks. Our resources are publicly available through \url{https://github.com/ymcui/Chinese-Mixtral}.
摘要：Mixtral是一种代表性的稀疏专家混合（SMoE）语言模型，因其独特的模型设计和优越的性能而受到广泛关注。在Mixtral-8x7B-v0.1的基础上，本文提出了Chinese-Mixtral和Chinese-Mixtral-Instruct，通过进一步的预训练和指令微调，提高了汉语能力。实验结果表明，我们的Chinese-Mixtral和Chinese-Mixtral-Instruct成功地提高了中文理解和生成性能，同时保留了原有的英语能力。然后，我们通过提供实证结果和分析，讨论了在大型语言模型上进行语言适应时的几个关键问题，包括扩展特定语言词汇的必要性和初始化模型（基础模型与指令模型）的选择。我们还展示了每位专家的可视化，以检查他们对下游任务的重要性。我们的资源可通过 \url{https://github.com/ymcui/Chinese-Mixtral} 公开获取。

Title: An Improved Traditional Chinese Evaluation Suite for Foundation Model

Authors: Zhi-Rui Tam, Ya-Ting Pai, Yen-Wei Lee, Sega Cheng, Hong-Han Shuai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01858
Pdf URL: https://arxiv.org/pdf/2403.01858
Copy Paste: [[2403.01858]] An Improved Traditional Chinese Evaluation Suite for Foundation Model(https://arxiv.org/abs/2403.01858)
Keywords: language model
Abstract: We present TMMLU+, a comprehensive dataset designed for the Traditional Chinese massive multitask language understanding dataset. TMMLU+ is a multiple-choice question-answering dataset with 66 subjects from elementary to professional level. Compared to its predecessor, TMMLU, TMMLU+ is six times larger and boasts a more balanced subject distribution. We included benchmark results in TMMLU+ from closed-source models and 24 open-weight Chinese large language models of parameters ranging from 1.8B to 72B. Our findings reveal that Traditional Chinese models still trail behind their Simplified Chinese counterparts. Additionally, current large language models have yet to outperform human performance in average scores. We publicly release our dataset and the corresponding benchmark source code.
摘要：我们推出了 TMMLU+，这是一个专为繁体中文大规模多任务语言理解数据集而设计的综合数据集。 TMMLU+ 是一个选择题问答数据集，包含从初级到专业级别的 66 个科目。与其前身 TMMLU 相比，TMMLU+ 的规模是其前身的六倍，并且主题分布更加均衡。我们在 TMMLU+ 中纳入了闭源模型和参数范围从 1.8B 到 72B 的 24 个开放权重中文大语言模型的基准测试结果。我们的研究结果表明，繁体中文模型仍然落后于简体中文模型。此外，当前的大型语言模型的平均得分尚未超过人类的表现。我们公开发布我们的数据集和相应的基准源代码。

Title: Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family

Authors: Rodrigo Santos, João Rodrigues, Luís Gomes, João Silva, António Branco, Henrique Lopes Cardoso, Tomás Freitas Osório, Bernardo Leite
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01897
Pdf URL: https://arxiv.org/pdf/2403.01897
Copy Paste: [[2403.01897]] Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family(https://arxiv.org/abs/2403.01897)
Keywords: language model
Abstract: To foster the neural encoding of Portuguese, this paper contributes foundation encoder models that represent an expansion of the still very scarce ecosystem of large language models specifically developed for this language that are fully open, in the sense that they are open source and openly distributed for free under an open license for any purpose, thus including research and commercial usages. Like most languages other than English, Portuguese is low-resourced in terms of these foundational language resources, there being the inaugural 900 million parameter Albertina and 335 million Bertimbau. Taking this couple of models as an inaugural set, we present the extension of the ecosystem of state-of-the-art open encoders for Portuguese with a larger, top performance-driven model with 1.5 billion parameters, and a smaller, efficiency-driven model with 100 million parameters. While achieving this primary goal, further results that are relevant for this ecosystem were obtained as well, namely new datasets for Portuguese based on the SuperGLUE benchmark, which we also distribute openly.
摘要：为了促进葡萄牙语的神经编码，本文贡献了基础编码器模型，这些模型代表了专门为该语言开发的仍然非常稀缺的大型语言模型生态系统的扩展，这些模型是完全开放的，从某种意义上说，它们是开源的并且公开分发给在开放许可下免费用于任何目的，包括研究和商业用途。与英语以外的大多数语言一样，葡萄牙语在这些基础语言资源方面资源匮乏，最初的参数为 9 亿个 Albertina 和 3.35 亿个 Bertimbau。以这两个模型作为首套模型，我们展示了最先进的葡萄牙语开放编码器生态系统的扩展，其中包括一个具有 15 亿个参数的更大、顶级性能驱动的模型，以及一个更小的、效率驱动的模型。具有 1 亿个参数的模型。在实现这一主要目标的同时，还获得了与该生态系统相关的进一步结果，即基于 SuperGLUE 基准的葡萄牙语新数据集，我们也公开分发该数据集。

Title: To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

Authors: Giacomo Frisoni, Alessio Cocchieri, Alex Presepi, Gianluca Moro, Zaiqiao Meng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.01924
Pdf URL: https://arxiv.org/pdf/2403.01924
Copy Paste: [[2403.01924]] To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering(https://arxiv.org/abs/2403.01924)
Keywords: language model, prompt
Abstract: Medical open-domain question answering demands substantial access to specialized knowledge. Recent efforts have sought to decouple knowledge from model parameters, counteracting architectural scaling and allowing for training on common low-resource hardware. The retrieve-then-read paradigm has become ubiquitous, with model predictions grounded on relevant knowledge pieces from external repositories such as PubMed, textbooks, and UMLS. An alternative path, still under-explored but made possible by the advent of domain-specific large language models, entails constructing artificial contexts through prompting. As a result, "to generate or to retrieve" is the modern equivalent of Hamlet's dilemma. This paper presents MedGENIE, the first generate-then-read framework for multiple-choice question answering in medicine. We conduct extensive experiments on MedQA-USMLE, MedMCQA, and MMLU, incorporating a practical perspective by assuming a maximum of 24GB VRAM. MedGENIE sets a new state-of-the-art (SOTA) in the open-book setting of each testbed, even allowing a small-scale reader to outcompete zero-shot closed-book 175B baselines while using up to 706$\times$ fewer parameters. Overall, our findings reveal that generated passages are more effective than retrieved counterparts in attaining higher accuracy.
摘要：医学开放领域问答需要大量获取专业知识。最近的努力试图将知识与模型参数分离，抵消架构扩展并允许在常见的低资源硬件上进行培训。检索然后阅读的范式已经变得无处不在，模型预测基于来自外部存储库（例如 PubMed、教科书和 UMLS）的相关知识片段。另一种途径仍在探索中，但由于特定领域的大型语言模型的出现而成为可能，它需要通过提示构建人工上下文。因此，“生成还是检索”相当于哈姆雷特困境的现代版本。本文介绍了 MedGENIE，这是第一个用于医学领域多项选择题回答的“生成然后读取”框架。我们在 MedQA-USMLE、MedMCQA 和 MMLU 上进行了广泛的实验，并通过假设最大 24GB VRAM 纳入了实际视角。 MedGENIE 在每个测试台的开卷设置中设置了新的最先进 (SOTA)，甚至允许小规模读者在使用高达 706$\times$ 的情况下超越零样本闭卷 175B 基线更少的参数。总的来说，我们的研究结果表明，生成的段落比检索到的段落在获得更高的准确性方面更有效。

Title: IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

Authors: Tahir Javed, Janki Atul Nawale, Eldho Ittan George, Sakshi Joshi, Kaushal Santosh Bhogale, Deovrat Mehendale, Ishvinder Virender Sethi, Aparna Ananthanarayanan, Hafsah Faquih, Pratiti Palit, Sneha Ravishankar, Saranya Sukumaran, Tripura Panchagnula, Sunjay Murali, Kunal Sharad Gandhi, Ambujavalli R, Manickam K M, C Venkata Vaijayanthi, Krishnan Srinivasa Raghavan Karunganni, Pratyush Kumar, Mitesh M Khapra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01926
Pdf URL: https://arxiv.org/pdf/2403.01926
Copy Paste: [[2403.01926]] IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages(https://arxiv.org/abs/2403.01926)
Keywords: prompt
Abstract: We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available
摘要：我们提出了 INDICVOICES，这是一个自然和自发语音的数据集，包含来自印度 145 个地区的 16237 名说话者、22 种语言的总共 7348 小时的阅读 (9%)、即兴 (74%) 和会话 (17%) 音频。在这 7348 小时中，有 1639 小时已被转录，每种语言的中位数为 73 小时。通过本文，我们分享了捕捉印度文化、语言和人口多样性以创建独一无二的包容性和代表性数据集的旅程。更具体地说，我们共享大规模数据收集的开源蓝图，包括标准化协议、集中式工具、跨越多个领域和感兴趣主题的引人入胜的问题、提示和对话场景存储库、质量控制机制、全面的转录指南和转录工具。我们希望这个开源蓝图将成为世界其他多语言地区数据收集工作的综合入门工具包。使用 INDICVOICES，我们构建了 IndicASR，这是第一个支持印度宪法第 8 条中列出的所有 22 种语言的 ASR 模型。作为这项工作的一部分开发的所有数据、工具、指南、模型和其他材料都将公开

Title: Analyzing and Adapting Large Language Models for Few-Shot Multilingual NLU: Are We There Yet?

Authors: Evgeniia Razumovskaia, Ivan Vulić, Anna Korhonen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01929
Pdf URL: https://arxiv.org/pdf/2403.01929
Copy Paste: [[2403.01929]] Analyzing and Adapting Large Language Models for Few-Shot Multilingual NLU: Are We There Yet?(https://arxiv.org/abs/2403.01929)
Keywords: language model, llm
Abstract: Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning. ICL has gained popularity recently with the advent of LLMs due to its simplicity and sample efficiency. Prior research has conducted only limited investigation into how these approaches work for multilingual few-shot learning, and the focus so far has been mostly on their performance. In this work, we present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups. Importantly, performance is only one aspect of the comparison, where we also analyse the approaches through the optics of their computational, inference and financial costs. Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements. As another contribution, we analyse the impact of target language adaptation of pretrained LLMs and find that the standard adaptation approaches can (superficially) improve target language generation capabilities, but language understanding elicited through ICL does not improve and remains limited, with low scores especially for low-resource languages.
摘要：监督微调（SFT）、监督指令调整（SIT）和上下文学习（ICL）是小样本学习的三种替代方法，事实上的标准方法。随着法学硕士的出现，ICL 由于其简单性和样本效率而最近受到欢迎。先前的研究仅对这些方法如何用于多语言小样本学习进行了有限的调查，迄今为止的焦点主要集中在它们的性能上。在这项工作中，我们对这三种方法进行了广泛而系统的比较，并在 6 种高资源和低资源语言、三种不同的 NLU 任务以及多种语言和领域设置上对其进行了测试。重要的是，性能只是比较的一方面，我们还通过计算、推理和财务成本来分析这些方法。我们的观察表明，监督指令调优在性能和资源需求之间具有最佳权衡。作为另一项贡献，我们分析了预训练法学硕士的目标语言适应的影响，发现标准适应方法可以（表面上）提高目标语言生成能力，但通过 ICL 引发的语言理解并没有改善，而且仍然有限，得分较低，尤其是对于资源匮乏的语言。

Title: VariErr NLI: Separating Annotation Error from Human Label Variation

Authors: Leon Weber-Genzel, Siyao Peng, Marie-Catherine de Marneffe, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01931
Pdf URL: https://arxiv.org/pdf/2403.01931
Copy Paste: [[2403.01931]] VariErr NLI: Separating Annotation Error from Human Label Variation(https://arxiv.org/abs/2403.01931)
Keywords: gpt
Abstract: Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white. To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation scheme with annotators explaining each label and subsequently judging the validity of label-explanation pairs. \name{} contains 7,574 validity judgments on 1,933 explanations for 500 re-annotated NLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform compared to GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.
摘要：当注释者出于正当理由为同一项目分配不同的标签时，就会出现人工标签变化，而当由于无效原因分配标签时，就会出现注释错误。这两个问题在 NLP 基准测试中普遍存在，但现有研究对它们进行了孤立的研究。据我们所知，之前没有任何工作专注于区分信号中的误差，特别是在信号超出黑白范围的情况下。为了填补这一空白，我们引入了一种系统方法和一个新的数据集 VariErr（变异与错误），重点关注英语 NLI 任务。我们提出了一个两轮注释方案，其中注释器解释每个标签，然后判断标签-解释对的有效性。 \name{} 包含对 500 个重新注释的 NLI 项目的 1,933 个解释的 7,574 个有效性判断。我们评估了各种自动错误检测 (AED) 方法和 GPT 在发现错误与人工标签变异方面的有效性。我们发现，与 GPT 和人类相比，最先进的 AED 方法的表现明显较差。虽然 GPT-4 是最好的系统，但它仍然低于人类的表现。我们的方法适用于 NLI 之外的领域，为未来关于错误与合理变化的研究提供了肥沃的土壤，从而可以产生更好、更值得信赖的 NLP 系统。

Title: DECIDER: A Rule-Controllable Decoding Strategy for Language Generation by Imitating Dual-System Cognitive Theory

Authors: Chen Xu, Tian Lan, Changlong Yu, Wei Wang, Jun Gao, Yu Ji, Qunxi Dong, Kun Qian, Piji Li, Wei Bi, Bin Hu
Subjects: cs.CL, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2403.01954
Pdf URL: https://arxiv.org/pdf/2403.01954
Copy Paste: [[2403.01954]] DECIDER: A Rule-Controllable Decoding Strategy for Language Generation by Imitating Dual-System Cognitive Theory(https://arxiv.org/abs/2403.01954)
Keywords: language model
Abstract: Lexicon-based constrained decoding approaches aim to control the meaning or style of the generated text through certain target concepts. Existing approaches over-focus the targets themselves, leading to a lack of high-level reasoning about how to achieve them. However, human usually tackles tasks by following certain rules that not only focuses on the targets but also on semantically relevant concepts that induce the occurrence of targets. In this work, we present DECIDER, a rule-controllable decoding strategy for constrained language generation inspired by dual-system cognitive theory. Specifically, in DECIDER, a pre-trained language model (PLM) is equiped with a logic reasoner that takes high-level rules as input. Then, the DECIDER allows rule signals to flow into the PLM at each decoding step. Extensive experimental results demonstrate that DECIDER can effectively follow given rules to guide generation direction toward the targets in a more human-like manner.
摘要：基于词典的约束解码方法旨在通过某些目标概念来控制生成文本的含义或风格。现有方法过度关注目标本身，导致缺乏关于如何实现目标的高级推理。然而，人类通常通过遵循某些规则来处理任务，这些规则不仅关注目标，而且关注引起目标出现的语义相关概念。在这项工作中，我们提出了 DECIDER，一种受双系统认知理论启发的用于约束语言生成的规则可控解码策略。具体来说，在 DECIDER 中，预训练语言模型 (PLM) 配备了逻辑推理器，以高级规则作为输入。然后，DECIDER 允许规则信号在每个解码步骤流入 PLM。大量的实验结果表明，DECIDER 可以有效地遵循给定的规则，以更类似于人类的方式引导生成方向朝向目标。

Title: AS-ES Learning: Towards Efficient CoT Learning in Small Models

Authors: Nuwa Xi, Yuhan Chen, Sendong Zhao, Haochun Wang, Bing Qin, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01969
Pdf URL: https://arxiv.org/pdf/2403.01969
Copy Paste: [[2403.01969]] AS-ES Learning: Towards Efficient CoT Learning in Small Models(https://arxiv.org/abs/2403.01969)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) serves as a critical emerging ability in LLMs, especially when it comes to logical reasoning. Attempts have been made to induce such ability in small models as well by distilling from the data with CoT generated by Large Language Models (LLMs). However, existing methods often simply generate and incorporate more data from LLMs and fail to note the importance of efficiently utilizing existing CoT data. We here propose a new training paradigm AS-ES (Abstractive Segments - Extractive Segments) learning, which exploits the inherent information in CoT for iterative generation. Experiments show that our methods surpass the direct seq2seq training on CoT-extensive tasks like MWP and PET summarization, without data augmentation or altering the model itself. Furthermore, we explore the reason behind the inefficiency of small models in learning CoT and provide an explanation of why AS-ES learning works, giving insights into the underlying mechanism of CoT.
摘要：思维链（CoT）是法学硕士的一项重要的新兴能力，尤其是在逻辑推理方面。人们已经尝试通过从大型语言模型 (LLM) 生成的 CoT 数据中提取数据，在小型模型中引入这种能力。然而，现有方法通常只是简单地生成并合并来自法学硕士的更多数据，而没有注意到有效利用现有 CoT 数据的重要性。我们在这里提出了一种新的训练范式 AS-ES（抽象段 - 提取段）学习，它利用 CoT 中的固有信息进行迭代生成。实验表明，我们的方法超越了在 MWP 和 PET 总结等 CoT 广泛任务上的直接 seq2seq 训练，无需数据增强或改变模型本身。此外，我们探讨了小模型学习 CoT 效率低下的原因，并解释了 AS-ES 学习为何有效，从而深入了解 CoT 的底层机制。

Title: Multi-perspective Improvement of Knowledge Graph Completion with Large Language Models

Authors: Derong Xu, Ziheng Zhang, Zhenxi Lin, Xian Wu, Zhihong Zhu, Tong Xu, Xiangyu Zhao, Yefeng Zheng, Enhong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01972
Pdf URL: https://arxiv.org/pdf/2403.01972
Copy Paste: [[2403.01972]] Multi-perspective Improvement of Knowledge Graph Completion with Large Language Models(https://arxiv.org/abs/2403.01972)
Keywords: language model, llm
Abstract: Knowledge graph completion (KGC) is a widely used method to tackle incompleteness in knowledge graphs (KGs) by making predictions for missing links. Description-based KGC leverages pre-trained language models to learn entity and relation representations with their names or descriptions, which shows promising results. However, the performance of description-based KGC is still limited by the quality of text and the incomplete structure, as it lacks sufficient entity descriptions and relies solely on relation names, leading to sub-optimal results. To address this issue, we propose MPIKGC, a general framework to compensate for the deficiency of contextualized knowledge and improve KGC by querying large language models (LLMs) from various perspectives, which involves leveraging the reasoning, explanation, and summarization capabilities of LLMs to expand entity descriptions, understand relations, and extract structures, respectively. We conducted extensive evaluation of the effectiveness and improvement of our framework based on four description-based KGC models and four datasets, for both link prediction and triplet classification tasks.
摘要：知识图补全（KGC）是一种广泛使用的方法，通过对缺失链接进行预测来解决知识图（KG）中的不完整性问题。基于描述的 KGC 利用预先训练的语言模型来学习实体和关系表示及其名称或描述，这显示出有希望的结果。然而，基于描述的 KGC 的性能仍然受到文本质量和不完整结构的限制，因为它缺乏足够的实体描述并且仅依赖于关系名称，导致结果次优。为了解决这个问题，我们提出了MPIKGC，一个通用框架，通过从不同角度查询大语言模型（LLM）来弥补情境化知识的不足并改进KGC，其中涉及利用LLM的推理、解释和总结能力来扩展分别是实体描述、理解关系和提取结构。我们基于四个基于描述的 KGC 模型和四个数据集，针对链接预测和三元组分类任务，对我们框架的有效性和改进进行了广泛的评估。

Title: SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis

Authors: Hengxing Cai, Xiaochen Cai, Junhan Chang, Sihang Li, Lin Yao, Changxin Wang, Zhifeng Gao, Yongge Li, Mujie Lin, Shuwen Yang, Jiankun Wang, Yuqi Yin, Yaqi Li, Linfeng Zhang, Guolin Ke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01976
Pdf URL: https://arxiv.org/pdf/2403.01976
Copy Paste: [[2403.01976]] SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis(https://arxiv.org/abs/2403.01976)
Keywords: language model, gpt, llm
Abstract: Recent breakthroughs in Large Language Models (LLMs) have revolutionized natural language understanding and generation, igniting a surge of interest in leveraging these technologies for the nuanced field of scientific literature analysis. Existing benchmarks, however, inadequately evaluate the proficiency of LLMs in the scientific domain, especially in scenarios involving complex comprehension and multimodal data. In response, we introduced SciAssess, a benchmark tailored for the in-depth analysis of scientific literature, crafted to provide a thorough assessment of LLMs' efficacy. SciAssess focuses on evaluating LLMs' abilities in memorization, comprehension, and analysis within scientific contexts. It includes representative tasks from diverse scientific fields, such as general chemistry, organic materials, and alloy materials. And rigorous quality control measures ensure its reliability in terms of correctness, anonymization, and copyright compliance. SciAssess evaluates leading LLMs, including GPT-4, GPT-3.5-turbo, and Gemini, identifying their strengths and areas for improvement and supporting the ongoing development of LLM applications in scientific literature analysis. SciAssess and its resources are made available at https://sci-assess.github.io, offering a valuable tool for advancing LLM capabilities in scientific literature analysis.
摘要：大型语言模型 (LLM) 的最新突破彻底改变了自然语言的理解和生成，激发了人们对利用这些技术进行科学文献分析的微妙领域的兴趣。然而，现有的基准不足以评估法学硕士在科学领域的熟练程度，特别是在涉及复杂理解和多模态数据的场景中。为此，我们推出了 SciAssess，这是一个专为深入分析科学文献而定制的基准，旨在对法学硕士的功效进行全面评估。 SciAssess 专注于评估法学硕士在科学背景下的记忆、理解和分析能力。它包括来自不同科学领域的代表性任务，例如普通化学、有机材料和合金材料。严格的质量控制措施确保其在正确性、匿名性和版权合规性方面的可靠性。 SciAssess 评估领先的法学硕士，包括 GPT-4、GPT-3.5-turbo 和 Gemini，确定其优势和需要改进的领域，并支持法学硕士在科学文献分析中的应用的持续发展。 SciAssess 及其资源可在 https://sci-assess.github.io 上获取，为提升科学文献分析方面的 LLM 能力提供了宝贵的工具。

Title: FakeNewsGPT4: Advancing Multimodal Fake News Detection through Knowledge-Augmented LVLMs

Authors: Xuannan Liu, Peipei Li, Huaibo Huang, Zekun Li, Xing Cui, Jiahao Liang, Lixiong Qin, Weihong Deng, Zhaofeng He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01988
Pdf URL: https://arxiv.org/pdf/2403.01988
Copy Paste: [[2403.01988]] FakeNewsGPT4: Advancing Multimodal Fake News Detection through Knowledge-Augmented LVLMs(https://arxiv.org/abs/2403.01988)
Keywords: language model, gpt, prompt
Abstract: The massive generation of multimodal fake news exhibits substantial distribution discrepancies, prompting the need for generalized detectors. However, the insulated nature of training within specific domains restricts the capability of classical detectors to obtain open-world facts. In this paper, we propose FakeNewsGPT4, a novel framework that augments Large Vision-Language Models (LVLMs) with forgery-specific knowledge for manipulation reasoning while inheriting extensive world knowledge as complementary. Knowledge augmentation in FakeNewsGPT4 involves acquiring two types of forgery-specific knowledge, i.e., semantic correlation and artifact trace, and merging them into LVLMs. Specifically, we design a multi-level cross-modal reasoning module that establishes interactions across modalities for extracting semantic correlations. Concurrently, a dual-branch fine-grained verification module is presented to comprehend localized details to encode artifact traces. The generated knowledge is translated into refined embeddings compatible with LVLMs. We also incorporate candidate answer heuristics and soft prompts to enhance input informativeness. Extensive experiments on the public benchmark demonstrate that FakeNewsGPT4 achieves superior cross-domain performance compared to previous methods. Code will be available.
摘要：多模态假新闻的大量产生表现出巨大的分布差异，促使需要通用检测器。然而，特定领域内训练的隔离性质限制了经典检测器获取开放世界事实的能力。在本文中，我们提出了 FakeNewsGPT4，这是一种新颖的框架，它通过用于操纵推理的伪造特定知识来增强大型视觉语言模型（LVLM），同时继承广泛的世界知识作为补充。 FakeNewsGPT4 中的知识增强涉及获取两种类型的伪造特定知识，即语义相关性和工件追踪，并将它们合并到 LVLM 中。具体来说，我们设计了一个多级跨模态推理模块，该模块建立跨模态的交互以提取语义相关性。同时，提出了一个双分支细粒度验证模块来理解局部细节以对工件痕迹进行编码。生成的知识被转化为与 LVLM 兼容的精细嵌入。我们还结合了候选答案启发式和软提示来增强输入信息量。对公共基准的大量实验表明，与之前的方法相比，FakeNewsGPT4 实现了卓越的跨域性能。代码将可用。

Title: LLM-Oriented Retrieval Tuner

Authors: Si Sun, Hanqing Zhang, Zhiyuan Liu, Jie Bao, Dawei Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.01999
Pdf URL: https://arxiv.org/pdf/2403.01999
Copy Paste: [[2403.01999]] LLM-Oriented Retrieval Tuner(https://arxiv.org/abs/2403.01999)
Keywords: language model, gpt, llm
Abstract: Dense Retrieval (DR) is now considered as a promising tool to enhance the memorization capacity of Large Language Models (LLM) such as GPT3 and GPT-4 by incorporating external memories. However, due to the paradigm discrepancy between text generation of LLM and DR, it is still an open challenge to integrate the retrieval and generation tasks in a shared LLM. In this paper, we propose an efficient LLM-Oriented Retrieval Tuner, namely LMORT, which decouples DR capacity from base LLM and non-invasively coordinates the optimally aligned and uniform layers of the LLM towards a unified DR space, achieving an efficient and effective DR without tuning the LLM itself. The extensive experiments on six BEIR datasets show that our approach could achieve competitive zero-shot retrieval performance compared to a range of strong DR models while maintaining the generation ability of LLM.
摘要：密集检索（DR）现在被认为是一种有前途的工具，可以通过结合外部存储器来增强大型语言模型（LLM）（例如 GPT3 和 GPT-4）的记忆能力。然而，由于LLM和DR的文本生成之间的范式差异，将检索和生成任务集成到共享LLM中仍然是一个开放的挑战。在本文中，我们提出了一种高效的面向LLM的检索调谐器，即LMORT，它将DR容量与基础LLM解耦，并以非侵入方式将LLM的最佳对齐和均匀层协调到统一的DR空间，从而实现高效且有效的DR无需调整 LLM 本身。对六个 BEIR 数据集的广泛实验表明，与一系列强大的 DR 模型相比，我们的方法可以实现有竞争力的零样本检索性能，同时保持 LLM 的生成能力。

Title: Topic Aware Probing: From Sentence Length Prediction to Idiom Identification how reliant are Neural Language Models on Topic?

Authors: Vasudevan Nedumpozhimana, John D. Kelleher
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.02009
Pdf URL: https://arxiv.org/pdf/2403.02009
Copy Paste: [[2403.02009]] Topic Aware Probing: From Sentence Length Prediction to Idiom Identification how reliant are Neural Language Models on Topic?(https://arxiv.org/abs/2403.02009)
Keywords: language model
Abstract: Transformer-based Neural Language Models achieve state-of-the-art performance on various natural language processing tasks. However, an open question is the extent to which these models rely on word-order/syntactic or word co-occurrence/topic-based information when processing natural language. This work contributes to this debate by addressing the question of whether these models primarily use topic as a signal, by exploring the relationship between Transformer-based models' (BERT and RoBERTa's) performance on a range of probing tasks in English, from simple lexical tasks such as sentence length prediction to complex semantic tasks such as idiom token identification, and the sensitivity of these tasks to the topic information. To this end, we propose a novel probing method which we call topic-aware probing. Our initial results indicate that Transformer-based models encode both topic and non-topic information in their intermediate layers, but also that the facility of these models to distinguish idiomatic usage is primarily based on their ability to identify and encode topic. Furthermore, our analysis of these models' performance on other standard probing tasks suggests that tasks that are relatively insensitive to the topic information are also tasks that are relatively difficult for these models.
摘要：基于 Transformer 的神经语言模型在各种自然语言处理任务上实现了最先进的性能。然而，一个悬而未决的问题是，这些模型在处理自然语言时在多大程度上依赖于词序/句法或词共现/基于主题的信息。这项工作通过解决这些模型是否主要使用主题作为信号的问题，通过探索基于 Transformer 的模型（BERT 和 RoBERTa）在一系列英语探测任务（从简单的词汇任务）上的表现之间的关系，为这场辩论做出了贡献例如句子长度预测到成语标记识别等复杂语义任务，以及这些任务对主题信息的敏感性。为此，我们提出了一种新颖的探测方法，称为主题感知探测。我们的初步结果表明，基于 Transformer 的模型在其中间层中对主题和非主题信息进行编码，而且这些模型区分惯用用法的能力主要基于它们识别和编码主题的能力。此外，我们对这些模型在其他标准探测任务上的性能的分析表明，对主题信息相对不敏感的任务也是这些模型相对困难的任务。

Title: Automated Generation of Multiple-Choice Cloze Questions for Assessing English Vocabulary Using GPT-turbo 3.5

Authors: Qiao Wang, Ralph Rose, Naho Orita, Ayaka Sugawara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.02078
Pdf URL: https://arxiv.org/pdf/2403.02078
Copy Paste: [[2403.02078]] Automated Generation of Multiple-Choice Cloze Questions for Assessing English Vocabulary Using GPT-turbo 3.5(https://arxiv.org/abs/2403.02078)
Keywords: language model, gpt, llm, prompt
Abstract: A common way of assessing language learners' mastery of vocabulary is via multiple-choice cloze (i.e., fill-in-the-blank) questions. But the creation of test items can be laborious for individual teachers or in large-scale language programs. In this paper, we evaluate a new method for automatically generating these types of questions using large language models (LLM). The VocaTT (vocabulary teaching and training) engine is written in Python and comprises three basic steps: pre-processing target word lists, generating sentences and candidate word options using GPT, and finally selecting suitable word options. To test the efficiency of this system, 60 questions were generated targeting academic words. The generated items were reviewed by expert reviewers who judged the well-formedness of the sentences and word options, adding comments to items judged not well-formed. Results showed a 75% rate of well-formedness for sentences and 66.85% rate for suitable word options. This is a marked improvement over the generator used earlier in our research which did not take advantage of GPT's capabilities. Post-hoc qualitative analysis reveals several points for improvement in future work including cross-referencing part-of-speech tagging, better sentence validation, and improving GPT prompts.
摘要：评估语言学习者对词汇掌握程度的常见方法是通过多项选择完形填空（即填空）问题。但对于个别教师或大型语言项目来说，创建测试项目可能很费力。在本文中，我们评估了一种使用大型语言模型（LLM）自动生成此类问题的新方法。 VocaTT（词汇教学和训练）引擎是用Python编写的，包括三个基本步骤：预处理目标单词列表，使用GPT生成句子和候选单词选项，最后选择合适的单词选项。为了测试该系统的效率，针对学术词汇生成了 60 个问题。生成的项目由专家评审员评审，他们判断句子和单词选项的格式是否良好，并对判断为格式不正确的项目添加评论。结果显示，句子的格式正确率为 75%，单词选项的正确率为 66.85%。与我们研究中早期使用的生成器相比，这是一个显着的改进，该生成器没有利用 GPT 的功能。事后定性分析揭示了未来工作中需要改进的几个点，包括交叉引用词性标记、更好的句子验证和改进 GPT 提示。

Title: Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language Models

Authors: Sargam Yadav (1), Abhishek Kaushik (1), Kevin McDaid (1) ((1) Dundalk Institute of Technology, Dundalk)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.02121
Pdf URL: https://arxiv.org/pdf/2403.02121
Copy Paste: [[2403.02121]] Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language Models(https://arxiv.org/abs/2403.02121)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The advent of Large Language Models (LLMs) has advanced the benchmark in various Natural Language Processing (NLP) tasks. However, large amounts of labelled training data are required to train LLMs. Furthermore, data annotation and training are computationally expensive and time-consuming. Zero and few-shot learning have recently emerged as viable options for labelling data using large pre-trained models. Hate speech detection in mix-code low-resource languages is an active problem area where the use of LLMs has proven beneficial. In this study, we have compiled a dataset of 100 YouTube comments, and weakly labelled them for coarse and fine-grained misogyny classification in mix-code Hinglish. Weak annotation was applied due to the labor-intensive annotation process. Zero-shot learning, one-shot learning, and few-shot learning and prompting approaches have then been applied to assign labels to the comments and compare them to human-assigned labels. Out of all the approaches, zero-shot classification using the Bidirectional Auto-Regressive Transformers (BART) large model and few-shot prompting using Generative Pre-trained Transformer- 3 (ChatGPT-3) achieve the best results
摘要：大型语言模型 (LLM) 的出现提高了各种自然语言处理 (NLP) 任务的基准。然而，训练法学硕士需要大量有标签的训练数据。此外，数据注释和训练在计算上是昂贵且耗时的。零次学习和少次学习最近已成为使用大型预训练模型标记数据的可行选择。混合代码低资源语言中的仇恨言论检测是一个活跃的问题领域，LLM 的使用已被证明是有益的。在这项研究中，我们编制了包含 100 条 YouTube 评论的数据集，并在混合代码印度英语中对它们进行了粗粒度和细粒度的厌女症分类的弱标记。由于标注过程耗费大量人力，因此应用了弱标注。然后应用零样本学习、单样本学习、少样本学习和提示方法来为评论分配标签，并将其与人类分配的标签进行比较。在所有方法中，使用双向自回归变压器 (BART) 大模型的零样本分类和使用生成预训练 Transformer-3 (ChatGPT-3) 的少样本提示取得了最佳结果

Title: Using LLMs for the Extraction and Normalization of Product Attribute Values

Authors: Nick Baumann, Alexander Brinkmann, Christian Bizer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.02130
Pdf URL: https://arxiv.org/pdf/2403.02130
Copy Paste: [[2403.02130]] Using LLMs for the Extraction and Normalization of Product Attribute Values(https://arxiv.org/abs/2403.02130)
Keywords: language model, gpt, llm
Abstract: Product offers on e-commerce websites often consist of a textual product title and a textual product description. In order to provide features such as faceted product filtering or content-based product recommendation, the websites need to extract attribute-value pairs from the unstructured product descriptions. This paper explores the potential of using large language models (LLMs), such as OpenAI's GPT-3.5 and GPT-4, to extract and normalize attribute values from product titles and product descriptions. For our experiments, we introduce the WDC Product Attribute-Value Extraction (WDC PAVE) dataset. WDC PAVE consists of product offers from 87 websites that provide schema.org annotations. The offers belong to five different categories, each featuring a specific set of attributes. The dataset provides manually verified attribute-value pairs in two forms: (i) directly extracted values and (ii) normalized attribute values. The normalization of the attribute values requires systems to perform the following types of operations: name expansion, generalization, unit of measurement normalization, and string wrangling. Our experiments demonstrate that GPT-4 outperforms PLM-based extraction methods by 10%, achieving an F1-Score of 91%. For the extraction and normalization of product attribute values, GPT-4 achieves a similar performance to the extraction scenario, while being particularly strong at string wrangling and name expansion.
摘要：电子商务网站上的产品报价通常由文本产品标题和文本产品描述组成。为了提供分面产品过滤或基于内容的产品推荐等功能，网站需要从非结构化产品描述中提取属性值对。本文探讨了使用大型语言模型 (LLM)（例如 OpenAI 的 GPT-3.5 和 GPT-4）从产品标题和产品描述中提取和规范化属性值的潜力。在我们的实验中，我们引入了 WDC 产品属性值提取 (WDC PAVE) 数据集。 WDC PAVE 包含来自 87 个提供 schema.org 注释的网站的产品。这些优惠属于五个不同的类别，每个类别都具有一组特定的属性。该数据集以两种形式提供手动验证的属性值对：（i）直接提取的值和（ii）标准化的属性值。属性值的规范化要求系统执行以下类型的操作：名称扩展、泛化、测量单位规范化和字符串整理。我们的实验表明，GPT-4 的性能比基于 PLM 的提取方法高出 10%，F1 分数达到 91%。对于产品属性值的提取和规范化，GPT-4 实现了与提取场景类似的性能，同时在字符串整理和名称扩展方面特别强大。

Title: EEE-QA: Exploring Effective and Efficient Question-Answer Representations

Authors: Zhanghao Hu, Yijun Yang, Junjie Xu, Yifu Qiu, Pinzhen Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.02176
Pdf URL: https://arxiv.org/pdf/2403.02176
Copy Paste: [[2403.02176]] EEE-QA: Exploring Effective and Efficient Question-Answer Representations(https://arxiv.org/abs/2403.02176)
Keywords: language model
Abstract: Current approaches to question answering rely on pre-trained language models (PLMs) like RoBERTa. This work challenges the existing question-answer encoding convention and explores finer representations. We begin with testing various pooling methods compared to using the begin-of-sentence token as a question representation for better quality. Next, we explore opportunities to simultaneously embed all answer candidates with the question. This enables cross-reference between answer choices and improves inference throughput via reduced memory usage. Despite their simplicity and effectiveness, these methods have yet to be widely studied in current frameworks. We experiment with different PLMs, and with and without the integration of knowledge graphs. Results prove that the memory efficacy of the proposed techniques with little sacrifice in performance. Practically, our work enhances 38-100% throughput with 26-65% speedups on consumer-grade GPUs by allowing for considerably larger batch sizes. Our work sends a message to the community with promising directions in both representation quality and efficiency for the question-answering task in natural language processing.
摘要：当前的问答方法依赖于 RoBERTa 等预先训练的语言模型 (PLM)。这项工作挑战了现有的问答编码约定，并探索了更精细的表示。我们首先测试各种池化方法，与使用句首标记作为问题表示以获得更好的质量进行比较。接下来，我们探索将所有候选答案同时嵌入问题的机会。这可以实现答案选择之间的交叉引用，并通过减少内存使用来提高推理吞吐量。尽管它们简单有效，但这些方法尚未在当前框架中得到广泛研究。我们尝试了不同的 PLM，以及集成或不集成知识图谱的情况。结果证明，所提出的技术的记忆功效几乎没有牺牲性能。实际上，我们的工作通过允许相当大的批量大小，将消费级 GPU 的吞吐量提高了 38-100%，加速了 26-65%。我们的工作向社区传达了一个信息，即自然语言处理中问答任务的表示质量和效率方面有希望的方向。

Title: ProTrix: Building Models for Planning and Reasoning over Tables with Sentence Context

Authors: Zirui Wu, Yansong Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.02177
Pdf URL: https://arxiv.org/pdf/2403.02177
Copy Paste: [[2403.02177]] ProTrix: Building Models for Planning and Reasoning over Tables with Sentence Context(https://arxiv.org/abs/2403.02177)
Keywords: gpt
Abstract: Tables play a crucial role in conveying information in various domains, serving as indispensable tools for organizing and presenting data in a structured manner. We propose a Plan-then-Reason framework to answer different types of user queries over tables with sentence context. The framework first plans the reasoning paths over the context, then assigns each step to program-based or textual reasoning to reach the final answer. We construct an instruction tuning set TrixInstruct following the framework. Our dataset cover queries that are program-unsolvable or need combining information from tables and sentences to obtain planning and reasoning abilities. We present ProTrix by finetuning Llama-2-7B on TrixInstruct. Our experiments show that ProTrix generalizes to diverse tabular tasks and achieves comparable performance to GPT-3.5-turbo. We further demonstrate that ProTrix can generate accurate and faithful explanations to answer complex free-form questions. Our work underscores the importance of the planning and reasoning abilities towards a model over tabular tasks with generalizability and interpretability. We will release our dataset and model at https://github.com/WilliamZR/ProTrix.
摘要：表格在各个领域的信息传递中发挥着至关重要的作用，是以结构化方式组织和呈现数据的不可或缺的工具。我们提出了一个 Plan-then-Reason 框架来回答不同类型的用户对具有句子上下文的表的查询。该框架首先规划上下文的推理路径，然后将每个步骤分配给基于程序或文本的推理以得出最终答案。我们按照框架构建了一个指令调优集TrixInstruct。我们的数据集涵盖了程序无法解决的查询或需要结合表格和句子中的信息以获得规划和推理能力的查询。我们通过在 TrixInstruct 上微调 Llama-2-7B 来展示 ProTrix。我们的实验表明，ProTrix 可以推广到各种表格任务，并实现与 GPT-3.5-turbo 相当的性能。我们进一步证明 ProTrix 可以生成准确且忠实的解释来回答复杂的自由形式问题。我们的工作强调了规划和推理能力对于具有普遍性和可解释性的表格任务模型的重要性。我们将在 https://github.com/WilliamZR/ProTrix 发布我们的数据集和模型。

Title: Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models

Authors: Changyu Chen, Xiting Wang, Ting-En Lin, Ang Lv, Yuchuan Wu, Xin Gao, Ji-Rong Wen, Rui Yan, Yongbin Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.02178
Pdf URL: https://arxiv.org/pdf/2403.02178
Copy Paste: [[2403.02178]] Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models(https://arxiv.org/abs/2403.02178)
Keywords: language model
Abstract: In reasoning tasks, even a minor error can cascade into inaccurate results, leading to suboptimal performance of large language models in such domains. Earlier fine-tuning approaches sought to mitigate this by leveraging more precise supervisory signals from human labeling, larger models, or self-sampling, although at a high cost. Conversely, we develop a method that avoids external resources, relying instead on introducing perturbations to the input. Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks. When applied to fine-tuning with GSM8K, this method achieved a 5% improvement in accuracy over standard supervised fine-tuning with a few codes modified and no additional labeling effort. Furthermore, it is complementary to existing methods. When integrated with related data augmentation methods, it leads to an average improvement of 3% improvement in GSM8K accuracy and 1% improvement in MATH accuracy across five datasets of various quality and size, as well as two base models. We further investigate the mechanisms behind this improvement through case studies and quantitative analysis, suggesting that our approach may provide superior support for the model in capturing long-distance dependencies, especially those related to questions. This enhancement could deepen understanding of premises in questions and prior steps. Our code is available at Github.
摘要：在推理任务中，即使是很小的错误也可能会导致不准确的结果，从而导致大型语言模型在此类领域的性能不佳。早期的微调方法试图通过利用来自人类标签、更大模型或自我采样的更精确的监督信号来缓解这一问题，尽管成本很高。相反，我们开发了一种避免外部资源的方法，而是依赖于对输入引入扰动。我们的训练方法随机掩盖思维链中的某些标记，我们发现这种技术对于推理任务特别有效。当应用于 GSM8K 微调时，该方法比标准监督微调的精度提高了 5%，只需修改一些代码，并且无需额外的标记工作。此外，它是对现有方法的补充。当与相关的数据增强方法集成时，它可以在五个不同质量和大小的数据集以及两个基础模型中，使 GSM8K 准确率平均提高 3%，数学准确率平均提高 1%。我们通过案例研究和定量分析进一步研究了这种改进背后的机制，表明我们的方法可以为模型捕获长距离依赖关系（尤其是与问题相关的依赖关系）提供卓越的支持。这种增强可以加深对问题和先前步骤中前提的理解。我们的代码可以在 Github 上找到。

Title: Not all Layers of LLMs are Necessary during Inference

Authors: Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, Zhongyuan Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.02181
Pdf URL: https://arxiv.org/pdf/2403.02181
Copy Paste: [[2403.02181]] Not all Layers of LLMs are Necessary during Inference(https://arxiv.org/abs/2403.02181)
Keywords: language model, llm
Abstract: The inference phase of Large Language Models (LLMs) is very expensive. An ideal inference stage of LLMs could utilize fewer computational resources while still maintaining its capabilities (e.g., generalization and in-context learning ability). In this paper, we try to answer the question, "During LLM inference, can we use shallow layers for easy instances; and deep layers for hard ones?" To answer this question, we first indicate that Not all Layers are Necessary during Inference by statistically analyzing the activated layers across tasks. Then, we propose a simple algorithm named AdaInfer to determine the inference termination moment based on the input instance adaptively. More importantly, AdaInfer does not alter LLM parameters and maintains generalizability across tasks. Experiments on well-known LLMs (i.e., Llama2 series and OPT) show that AdaInfer saves an average of 14.8% of computational resources, even up to 50% on sentiment tasks, while maintaining comparable performance. Additionally, this method is orthogonal to other model acceleration techniques, potentially boosting inference efficiency further.
摘要：大型语言模型 (LLM) 的推理阶段非常昂贵。法学硕士的理想推理阶段可以利用更少的计算资源，同时仍然保持其能力（例如泛化和上下文学习能力）。在本文中，我们试图回答这样的问题：“在LLM推理过程中，我们是否可以使用浅层来处理简单的实例；使用深层来处理困难的实例？”为了回答这个问题，我们首先通过统计分析跨任务的激活层来表明推理过程中并非所有层都是必需的。然后，我们提出了一种名为 AdaInfer 的简单算法，用于根据输入实例自适应地确定推理终止时刻。更重要的是，AdaInfer 不会改变 LLM 参数并保持跨任务的通用性。在著名的LLM（即Llama2系列和OPT）上进行的实验表明，AdaInfer平均节省了14.8%的计算资源，在情感任务上甚至节省了50%，同时保持了相当的性能。此外，该方法与其他模型加速技术正交，有可能进一步提高推理效率。

Title: PHAnToM: Personality Has An Effect on Theory-of-Mind Reasoning in Large Language Models

Authors: Fiona Anting Tan, Gerard Christopher Yeo, Fanyou Wu, Weijie Xu, Vinija Jain, Aman Chadha, Kokil Jaidka, Yang Liu, See-Kiong Ng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.02246
Pdf URL: https://arxiv.org/pdf/2403.02246
Copy Paste: [[2403.02246]] PHAnToM: Personality Has An Effect on Theory-of-Mind Reasoning in Large Language Models(https://arxiv.org/abs/2403.02246)
Keywords: language model, gpt, llm, prompt
Abstract: Recent advances in large language models (LLMs) demonstrate that their capabilities are comparable, or even superior, to humans in many tasks in natural language processing. Despite this progress, LLMs are still inadequate at social-cognitive reasoning, which humans are naturally good at. Drawing inspiration from psychological research on the links between certain personality traits and Theory-of-Mind (ToM) reasoning, and from prompt engineering research on the hyper-sensitivity of prompts in affecting LLMs capabilities, this study investigates how inducing personalities in LLMs using prompts affects their ToM reasoning capabilities. Our findings show that certain induced personalities can significantly affect the LLMs' reasoning capabilities in three different ToM tasks. In particular, traits from the Dark Triad have a larger variable effect on LLMs like GPT-3.5, Llama 2, and Mistral across the different ToM tasks. We find that LLMs that exhibit a higher variance across personality prompts in ToM also tends to be more controllable in personality tests: personality traits in LLMs like GPT-3.5, Llama 2 and Mistral can be controllably adjusted through our personality prompts. In today's landscape where role-play is a common strategy when using LLMs, our research highlights the need for caution, as models that adopt specific personas with personalities potentially also alter their reasoning abilities in an unexpected manner.
摘要：大型语言模型 (LLM) 的最新进展表明，它们的能力在自然语言处理的许多任务中与人类相当，甚至优于人类。尽管取得了这些进展，法学硕士在人类天生擅长的社会认知推理方面仍然存在不足。本研究从关于某些人格特质与心理理论 (ToM) 推理之间联系的心理学研究中汲取灵感，并从关于提示对影响法学硕士能力的超敏感性的提示工程研究中汲取灵感，本研究调查了如何使用提示来诱导法学硕士的个性影响他们的 ToM 推理能力。我们的研究结果表明，某些诱发的个性可以显着影响法学硕士在三种不同的 ToM 任务中的推理能力。特别是，来自黑暗三合会的特征对跨不同 ToM 任务的 GPT-3.5、Llama 2 和 Mistral 等 LLM 具有更大的可变影响。我们发现，在 ToM 中的人格提示中表现出较高方差的法学硕士在人格测试中也往往更可控：法学硕士中的人格特质（如 GPT-3.5、Llama 2 和 Mistral）可以通过我们的人格提示进行可控调整。在当今的环境中，角色扮演是使用法学硕士时的常见策略，我们的研究强调了谨慎的必要性，因为采用具有个性的特定角色的模型也可能以意想不到的方式改变他们的推理能力。

Title: Birbal: An efficient 7B instruct-model fine-tuned with curated datasets

Authors: Ashvini Kumar Jindal, Pawan Kumar Rajpoot, Ankur Parikh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.02247
Pdf URL: https://arxiv.org/pdf/2403.02247
Copy Paste: [[2403.02247]] Birbal: An efficient 7B instruct-model fine-tuned with curated datasets(https://arxiv.org/abs/2403.02247)
Keywords: llm
Abstract: LLMOps incur significant costs due to hardware requirements, hindering their widespread accessibility. Additionally, a lack of transparency in model training methods and data contributes to the majority of models being non-reproducible. To tackle these challenges, the LLM Efficiency Challenge was introduced at NeurIPS Workshop, aiming to adapt foundation models on a diverse set of tasks via fine-tuning on a single GPU (RTX 4090 or A100 with 40GB) within a 24-hour timeframe. In this system description paper, we introduce Birbal, our Mistral-7B based winning model, fine-tuned on a single RTX 4090 for 16 hours. Birbal's success lies in curating high-quality instructions covering diverse tasks, resulting in a 35% performance improvement over second-best Qwen-14B based submission.
摘要：由于硬件要求，LLMOps 会产生大量成本，阻碍了其广泛使用。此外，模型训练方法和数据缺乏透明度导致大多数模型不可重现。为了应对这些挑战，NeurIPS Workshop 上推出了 LLM 效率挑战赛，旨在通过在 24 小时时间内在单个 GPU（RTX 4090 或 40GB 的 A100）上进行微调，使基础模型适应多种任务。在本系统描述论文中，我们介绍了 Birbal，这是我们基于 Mistral-7B 的获胜模型，在单个 RTX 4090 上进行了 16 小时的微调。 Birbal 的成功在于策划涵盖不同任务的高质量指令，与第二好的基于 Qwen-14B 的提交相比，性能提高了 35%。

Title: FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

Authors: Alessandro Scirè, Karim Ghonim, Roberto Navigli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.02270
Pdf URL: https://arxiv.org/pdf/2403.02270
Copy Paste: [[2403.02270]] FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction(https://arxiv.org/abs/2403.02270)
Keywords: language model, llm, hallucination
Abstract: Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization.
摘要：文本摘要领域的最新进展，特别是随着大型语言模型 (LLM) 的出现，已经显示出卓越的性能。然而，一个显着的挑战仍然存在，因为大量自动生成的摘要表现出事实不一致，例如幻觉。针对这个问题，出现了各种评估摘要一致性的方法。然而，这些新引入的指标面临着一些局限性，包括缺乏可解释性、关注简短的文档摘要（例如新闻文章）以及计算不切实际，特别是对于基于 LLM 的指标。为了解决这些缺点，我们提出了基于自然语言推理和声明提取（FENICE）的摘要的事实性评估，这是一种更可解释和更有效的面向事实的指标。 FENICE 利用源文档中的信息与从摘要中提取的一组原子事实（称为声明）之间基于 NLI 的对齐。我们的指标为 AGGREFACT 设定了新的技术水平，这是事实性评估的事实上的基准。此外，我们通过进行长格式摘要的人工注释过程，将我们的评估扩展到更具挑战性的环境。

Title: RIFF: Learning to Rephrase Inputs for Few-shot Fine-tuning of Language Models

Authors: Saeed Najafi, Alona Fyshe
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.02271
Pdf URL: https://arxiv.org/pdf/2403.02271
Copy Paste: [[2403.02271]] RIFF: Learning to Rephrase Inputs for Few-shot Fine-tuning of Language Models(https://arxiv.org/abs/2403.02271)
Keywords: language model, prompt
Abstract: Pre-trained Language Models (PLMs) can be accurately fine-tuned for downstream text processing tasks. Recently, researchers have introduced several parameter-efficient fine-tuning methods that optimize input prompts or adjust a small number of model parameters (e.g LoRA). In this study, we explore the impact of altering the input text of the original task in conjunction with parameter-efficient fine-tuning methods. To most effectively rewrite the input text, we train a few-shot paraphrase model with a Maximum-Marginal Likelihood objective. Using six few-shot text classification datasets, we show that enriching data with paraphrases at train and test time enhances the performance beyond what can be achieved with parameter-efficient fine-tuning alone.
摘要：预训练语言模型 (PLM) 可以针对下游文本处理任务进行准确微调。最近，研究人员推出了几种参数高效的微调方法，可以优化输入提示或调整少量模型参数（例如 LoRA）。在本研究中，我们结合参数高效的微调方法探讨了改变原始任务的输入文本的影响。为了最有效地重写输入文本，我们训练了具有最大边际似然目标的几次释义模型。使用六个少量文本分类数据集，我们表明，在训练和测试时通过释义丰富数据所增强的性能超出了仅通过参数高效微调所能实现的性能。

Title: Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

Authors: Yiming Huang, Xiao Liu, Yeyun Gong, Zhibin Gou, Yelong Shen, Nan Duan, Weizhu Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.02333
Pdf URL: https://arxiv.org/pdf/2403.02333
Copy Paste: [[2403.02333]] Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning(https://arxiv.org/abs/2403.02333)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown great potential in complex reasoning tasks, yet their performance is often hampered by the scarcity of high-quality, reasoning-focused training datasets. Addressing this challenge, we propose Key-Point-Driven Data Synthesis (KPDDS), a novel data synthesis framework that synthesizes question-answer pairs by leveraging key points and exemplar pairs from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. As a result, we present KPMath, the most extensive synthetic dataset tailored for mathematical reasoning to date, comprising over one million question-answer pairs. Utilizing KPMath and augmenting it with additional reasoning-intensive corpora, we create the comprehensive KPMath-Plus dataset. Fine-tuning the Mistral-7B model on KPMath-Plus yields a zero-shot PASS@1 accuracy of 39.3% on the MATH test set, a performance that not only outpaces other finetuned 7B models but also exceeds that of certain 34B models. Our ablation studies further confirm the substantial enhancement in mathematical reasoning across various subtopics, marking a significant stride in LLMs' reasoning capabilities.
摘要：大型语言模型 (LLM) 在复杂的推理任务中显示出巨大的潜力，但其性能往往因缺乏高质量、以推理为重点的训练数据集而受到阻碍。为了应对这一挑战，我们提出了关键点驱动数据合成（KPDDS），这是一种新颖的数据合成框架，它通过利用来自真实数据源的关键点和样本对来合成问答对。 KPDDS 通过严格的质量控制和显着的可扩展性确保新颖问题的生成。因此，我们推出了 KPMath，这是迄今为止为数学推理量身定制的最广泛的综合数据集，包含超过一百万个问答对。利用 KPMath 并通过额外的推理密集型语料库对其进行扩充，我们创建了全面的 KPMath-Plus 数据集。在 KPMath-Plus 上微调 Mistral-7B 模型，在 MATH 测试集上获得了 39.3% 的零样本 PASS@1 准确率，这一性能不仅超过了其他微调的 7B 模型，还超过了某些 34B 模型。我们的消融研究进一步证实了各个子主题的数学推理能力的显着增强，标志着法学硕士推理能力的重大进步。