2024-04-30

Title: Empowering Large Language Models for Textual Data Augmentation

Authors: Yichuan Li, Kaize Ding, Jianling Wang, Kyumin Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17642
Pdf URL: https://arxiv.org/pdf/2404.17642
Copy Paste: [[2404.17642]] Empowering Large Language Models for Textual Data Augmentation(https://arxiv.org/abs/2404.17642)
Keywords: language model, llm
Abstract: With the capabilities of understanding and executing natural language instructions, Large language models (LLMs) can potentially act as a powerful tool for textual data augmentation. However, the quality of augmented data depends heavily on the augmentation instructions provided, and the effectiveness can fluctuate across different downstream tasks. While manually crafting and selecting instructions can offer some improvement, this approach faces scalability and consistency issues in practice due to the diversity of downstream tasks. In this work, we address these limitations by proposing a new solution, which can automatically generate a large pool of augmentation instructions and select the most suitable task-informed instructions, thereby empowering LLMs to create high-quality augmented data for different downstream tasks. Empirically, the proposed approach consistently generates augmented data with better quality compared to non-LLM and LLM-based data augmentation methods, leading to the best performance on 26 few-shot learning tasks sourced from a wide range of application domains.
摘要：凭借理解和执行自然语言指令的能力，大型语言模型 (LLM) 有可能成为文本数据增强的强大工具。然而，增强数据的质量在很大程度上取决于所提供的增强指令，并且其有效性可能会因不同的下游任务而波动。虽然手动制作和选择指令可以提供一些改进，但由于下游任务的多样性，这种方法在实践中面临可扩展性和一致性问题。在这项工作中，我们通过提出一种新的解决方案来解决这些限制，该解决方案可以自动生成大量增强指令并选择最合适的任务通知指令，从而使法学硕士能够为不同的下游任务创建高质量的增强数据。根据经验，与非 LLM 和基于 LLM 的数据增强方法相比，所提出的方法始终能够生成质量更好的增强数据，从而在来自广泛应用领域的 26 个小样本学习任务上获得最佳性能。

Title: PLAYER*: Enhancing LLM-based Multi-Agent Communication and Interaction in Murder Mystery Games

Authors: Qinglin Zhu, Runcong Zhao, Jinhua Du, Lin Gui, Yulan He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17662
Pdf URL: https://arxiv.org/pdf/2404.17662
Copy Paste: [[2404.17662]] PLAYER*: Enhancing LLM-based Multi-Agent Communication and Interaction in Murder Mystery Games(https://arxiv.org/abs/2404.17662)
Keywords: language model, llm, agent
Abstract: Recent advancements in Large Language Models (LLMs) have enhanced the efficacy of agent communication and social interactions. Despite these advancements, building LLM-based agents for reasoning in dynamic environments involving competition and collaboration remains challenging due to the limitations of informed graph-based search methods. We propose PLAYER*, a novel framework based on an anytime sampling-based planner, which utilises sensors and pruners to enable a purely question-driven searching framework for complex reasoning tasks. We also introduce a quantifiable evaluation method using multiple-choice questions and construct the WellPlay dataset with 1,482 QA pairs. Experiments demonstrate PLAYER*'s efficiency and performance enhancements compared to existing methods in complex, dynamic environments with quantifiable results.
摘要：大型语言模型 (LLM) 的最新进展增强了代理通信和社交互动的效率。尽管取得了这些进步，但由于基于图的知情搜索方法的局限性，构建用于在涉及竞争和协作的动态环境中进行推理的基于 LLM 的代理仍然具有挑战性。我们提出了 PLAYER*，这是一种基于随时采样的规划器的新颖框架，它利用传感器和修剪器为复杂的推理任务提供纯粹的问题驱动的搜索框架。我们还引入了一种使用多项选择题的量化评估方法，并构建了包含 1,482 个 QA 对的 WellPlay 数据集。实验证明，与复杂、动态环境中的现有方法相比，PLAYER* 的效率和性能有所提高，并具有可量化的结果。

Title: CoMM: Collaborative Multi-Agent, Multi-Reasoning-Path Prompting for Complex Problem Solving

Authors: Pei Chen, Boran Han, Shuai Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17729
Pdf URL: https://arxiv.org/pdf/2404.17729
Copy Paste: [[2404.17729]] CoMM: Collaborative Multi-Agent, Multi-Reasoning-Path Prompting for Complex Problem Solving(https://arxiv.org/abs/2404.17729)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) have shown great ability in solving traditional natural language tasks and elementary reasoning tasks with appropriate prompting techniques. However, their ability is still limited in solving complicated science problems. In this work, we aim to push the upper bound of the reasoning capability of LLMs by proposing a collaborative multi-agent, multi-reasoning-path (CoMM) prompting framework. Specifically, we prompt LLMs to play different roles in a problem-solving team, and encourage different role-play agents to collaboratively solve the target task. In particular, we discover that applying different reasoning paths for different roles is an effective strategy to implement few-shot prompting approaches in the multi-agent scenarios. Empirical results demonstrate the effectiveness of the proposed methods on two college-level science problems over competitive baselines. Our further analysis shows the necessity of prompting LLMs to play different roles or experts independently. We release the code at: https://github.com/amazon-science/comm-prompt
摘要：大型语言模型 (LLM) 已在解决传统自然语言任务和基本推理任务方面表现出了出色的能力，并且采用了适当的提示技术。然而，它们在解决复杂科学问题方面的能力仍然有限。在这项工作中，我们旨在通过提出一种协作式多智能体、多推理路径 (CoMM) 提示框架来推动 LLM 推理能力的上限。具体来说，我们提示 LLM 在解决问题的团队中扮演不同的角色，并鼓励不同角色扮演的智能体协作解决目标任务。特别是，我们发现，对不同角色应用不同的推理路径是在多智能体场景中实现少样本提示方法的有效策略。实证结果表明，所提出的方法在两个大学级科学问题上比竞争基线更有效。我们进一步的分析表明，提示 LLM 独立扮演不同角色或专家是必要的。我们在以下位置发布了代码：https://github.com/amazon-science/comm-prompt

Title: Building a Large Japanese Web Corpus for Large Language Models

Authors: Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17733
Pdf URL: https://arxiv.org/pdf/2404.17733
Copy Paste: [[2404.17733]] Building a Large Japanese Web Corpus for Large Language Models(https://arxiv.org/abs/2404.17733)
Keywords: language model, llm
Abstract: Open Japanese large language models (LLMs) have been trained on the Japanese portions of corpora such as CC-100, mC4, and OSCAR. However, these corpora were not created for the quality of Japanese texts. This study builds a large Japanese web corpus by extracting and refining text from the Common Crawl archive (21 snapshots of approximately 63.4 billion pages crawled between 2020 and 2023). This corpus consists of approximately 312.1 billion characters (approximately 173 million pages), which is the largest of all available training corpora for Japanese LLMs, surpassing CC-100 (approximately 25.8 billion characters), mC4 (approximately 239.7 billion characters) and OSCAR 23.10 (approximately 74 billion characters). To confirm the quality of the corpus, we performed continual pre-training on Llama 2 7B, 13B, 70B, Mistral 7B v0.1, and Mixtral 8x7B Instruct as base LLMs and gained consistent (6.6-8.1 points) improvements on Japanese benchmark datasets. We also demonstrate that the improvement on Llama 2 13B brought from the presented corpus was the largest among those from other existing corpora.
摘要：开放式日语大语言模型 (LLM) 已在语料库的日语部分（例如 CC-100、mC4 和 OSCAR）上进行了训练。然而，这些语料库并不是为了日语文本的质量而创建的。这项研究通过从 Common Crawl 档案（2020 年至 2023 年间抓取的约 634 亿个页面的 21 个快照）中提取和精炼文本，构建了一个大型日语网络语料库。该语料库包含约 3121 亿个字符（约 1.73 亿页），是日本 LLM 可用的所有培训语料库中最大的，超过了 CC-100（约 258 亿个字符）、mC4（约 2397 亿个字符）和 OSCAR 23.10（大约 740 亿个字符）。为了确认语料库的质量，我们对 Llama 2 7B、13B、70B、Mistral 7B v0.1 和 Mixtral 8x7B Instruct 作为基础 LLM 进行了持续的预训练，并在日本基准数据集上获得了一致的改进（6.6-8.1 分）。我们还证明，所提供的语料库对 Llama 2 13B 的改进是其他现有语料库中最大的。

Title: MRScore: Evaluating Radiology Report Generation with LLM-based Reward System

Authors: Yunyi Liu, Zhanyu Wang, Yingshu Li, Xinyu Liang, Lingqiao Liu, Lei Wang, Luping Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17778
Pdf URL: https://arxiv.org/pdf/2404.17778
Copy Paste: [[2404.17778]] MRScore: Evaluating Radiology Report Generation with LLM-based Reward System(https://arxiv.org/abs/2404.17778)
Keywords: language model, gpt, llm
Abstract: In recent years, automated radiology report generation has experienced significant growth. This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging Large Language Models (LLMs). Conventional NLG (natural language generation) metrics like BLEU are inadequate for accurately assessing the generated radiology reports, as systematically demonstrated by our observations within this paper. To address this challenge, we collaborated with radiologists to develop a framework that guides LLMs for radiology report evaluation, ensuring alignment with human analysis. Our framework includes two key components: i) utilizing GPT to generate large amounts of training data, i.e., reports with different qualities, and ii) pairing GPT-generated reports as accepted and rejected samples and training LLMs to produce MRScore as the model reward. Our experiments demonstrate MRScore's higher correlation with human judgments and superior performance in model selection compared to traditional metrics. Our code and datasets will be available on GitHub.
摘要：近年来，自动化放射学报告生成经历了显着增长。本文介绍了 MRScore，这是一种利用大型语言模型 (LLM) 为放射学报告生成量身定制的自动评估指标。正如我们在本文中的观察所系统地证明的那样，BLEU 等传统的 NLG（自然语言生成）指标不足以准确评估生成的放射学报告。为了应对这一挑战，我们与放射科医生合作开发了一个框架，指导法学硕士进行放射学报告评估，确保与人类分析保持一致。我们的框架包括两个关键组成部分：i）利用 GPT 生成大量训练数据，即具有不同质量的报告；ii）将 GPT 生成的报告配对为接受和拒绝的样本，并训练 LLM 生成 MRScore 作为模型奖励。我们的实验证明，与传统指标相比，MRScore 与人类判断具有更高的相关性，并且在模型选择方面具有卓越的性能。我们的代码和数据集将在 GitHub 上提供。

Title: Medical Vision-Language Pre-Training for Brain Abnormalities

Authors: Masoud Monajatipoor, Zi-Yi Dou, Aichi Chien, Nanyun Peng, Kai-Wei Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17779
Pdf URL: https://arxiv.org/pdf/2404.17779
Copy Paste: [[2404.17779]] Medical Vision-Language Pre-Training for Brain Abnormalities(https://arxiv.org/abs/2404.17779)
Keywords: language model
Abstract: Vision-language models have become increasingly powerful for tasks that require an understanding of both visual and linguistic elements, bridging the gap between these modalities. In the context of multimodal clinical AI, there is a growing need for models that possess domain-specific knowledge, as existing models often lack the expertise required for medical applications. In this paper, we take brain abnormalities as an example to demonstrate how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset from case reports and published journals and subsequently constructing a high-performance vision-language model tailored to specific medical tasks. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain. We evaluated the resulting model with quantitative and qualitative intrinsic evaluations. The resulting dataset and our code can be found here https://github.com/masoud-monajati/MedVL_pretraining_pipeline
摘要：视觉语言模型对于需要理解视觉和语言元素的任务变得越来越强大，弥合了这些模式之间的差距。在多模式临床人工智能的背景下，对拥有特定领域知识的模型的需求不断增长，因为现有模型往往缺乏医疗应用所需的专业知识。在本文中，我们以大脑异常为例，演示如何从 PubMed 等公共资源自动收集医学图文对齐数据进行预训练。特别是，我们提出了一个简化预训练过程的管道，首先从病例报告和已发表的期刊中收集大型大脑图像文本数据集，然后构建针对特定医疗任务的高性能视觉语言模型。我们还研究了医学领域中将子图映射到子标题的独特挑战。我们通过定量和定性的内在评估来评估最终的模型。生成的数据集和我们的代码可以在这里找到 https://github.com/masoud-monajati/MedVL_pretraining_pipeline

Title: Temporal Scaling Law for Large Language Models

Authors: Yizhe Xiong, Xiansheng Chen, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Jianwei Niu, Guiguang Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17785
Pdf URL: https://arxiv.org/pdf/2404.17785
Copy Paste: [[2404.17785]] Temporal Scaling Law for Large Language Models(https://arxiv.org/abs/2404.17785)
Keywords: language model, llm
Abstract: Recently, Large Language Models (LLMs) are widely adopted in a wide range of tasks, leading to increasing attention towards the research on how scaling LLMs affects their performance. Existing works, termed as Scaling Laws, have discovered that the loss of LLMs scales as power laws with model size, computational budget, and dataset size. However, the performance of LLMs throughout the training process remains untouched. In this paper, we propose the novel concept of Temporal Scaling Law and study the loss of LLMs from the temporal dimension. We first investigate the imbalance of loss on each token positions and develop a reciprocal-law across model scales and training stages. We then derive the temporal scaling law by studying the temporal patterns of the reciprocal-law parameters. Results on both in-distribution (IID) data and out-of-distribution (OOD) data demonstrate that our temporal scaling law accurately predicts the performance of LLMs in future training stages. Moreover, the temporal scaling law reveals that LLMs learn uniformly on different token positions, despite the loss imbalance. Experiments on pre-training LLMs in various scales show that this phenomenon verifies the default training paradigm for generative language models, in which no re-weighting strategies are attached during training. Overall, the temporal scaling law provides deeper insight into LLM pre-training.
摘要：最近，大型语言模型（LLM）在各种任务中被广泛采用，导致人们越来越关注扩展 LLM 如何影响其性能的研究。称为“缩放定律”的现有研究发现，法学硕士的损失随着模型大小、计算预算和数据集大小的幂律而变化。然而，法学硕士在整个培训过程中的表现保持不变。在本文中，我们提出了时间尺度法则的新概念，并从时间维度研究了法学硕士的损失。我们首先调查每个代币位置的损失不平衡，并制定跨模型规模和训练阶段的倒数定律。然后，我们通过研究倒数律参数的时间模式来推导时间标度律。分布内（IID）数据和分布外（OOD）数据的结果表明，我们的时间缩放定律准确地预测了 LLM 在未来训练阶段的表现。此外，时间缩放定律表明，尽管损失不平衡，法学硕士在不同的代币位置上学习是一致的。对不同规模的LLM预训练的实验表明，这种现象验证了生成语言模型的默认训练范式，即在训练过程中不附加重新加权策略。总体而言，时间缩放定律提供了对 LLM 预训练的更深入的了解。

Title: Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

Authors: Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17790
Pdf URL: https://arxiv.org/pdf/2404.17790
Copy Paste: [[2404.17790]] Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities(https://arxiv.org/abs/2404.17790)
Keywords: language model, llm
Abstract: Cross-lingual continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost. In this study, we constructed Swallow, an LLM with enhanced Japanese capability, by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on a large Japanese web corpus. Experimental results confirmed that the performance on Japanese tasks drastically improved through continual pre-training, and the performance monotonically increased with the amount of training data up to 100B tokens. Consequently, Swallow achieved superior performance compared to other LLMs that were trained from scratch in English and Japanese. An analysis of the effects of continual pre-training revealed that it was particularly effective for Japanese question answering tasks. Furthermore, to elucidate effective methodologies for cross-lingual continual pre-training from English to Japanese, we investigated the impact of vocabulary expansion and the effectiveness of incorporating parallel corpora. The results showed that the efficiency gained through vocabulary expansion had no negative impact on performance, except for the summarization task, and that the combined use of parallel corpora enhanced translation ability.
摘要：对最初在英语语料库上训练的大语言模型（LLM）进行跨语言持续预训练，使我们能够利用大量的英语语言资源并降低预训练成本。在这项研究中，我们通过扩展 Llama 2 的词汇量以包含日语字符并在大型日语网络语料库上进行持续的预训练，构建了 Swallow，这是一个具有增强日语能力的法学硕士。实验结果证实，通过持续的预训练，日语任务的性能得到了显着提高，并且随着训练数据量达到 100B 个 token，性能单调增加。因此，与其他从头开始接受英语和日语培训的法学硕士相比，Swallow 取得了优异的成绩。对持续预训练效果的分析表明，它对于日语问答任务特别有效。此外，为了阐明从英语到日语的跨语言持续预训练的有效方法，我们研究了词汇扩展的影响以及合并平行语料库的有效性。结果表明，除了摘要任务之外，通过词汇扩展获得的效率对性能没有负面影响，并且并行语料库的结合使用增强了翻译能力。

Title: Empirical Analysis of Dialogue Relation Extraction with Large Language Models

Authors: Guozheng Li, Zijie Xu, Ziyu Shang, Jiajun Liu, Ke Ji, Yikai Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17802
Pdf URL: https://arxiv.org/pdf/2404.17802
Copy Paste: [[2404.17802]] Empirical Analysis of Dialogue Relation Extraction with Large Language Models(https://arxiv.org/abs/2404.17802)
Keywords: language model, llm
Abstract: Dialogue relation extraction (DRE) aims to extract relations between two arguments within a dialogue, which is more challenging than standard RE due to the higher person pronoun frequency and lower information density in dialogues. However, existing DRE methods still suffer from two serious issues: (1) hard to capture long and sparse multi-turn information, and (2) struggle to extract golden relations based on partial dialogues, which motivates us to discover more effective methods that can alleviate the above issues. We notice that the rise of large language models (LLMs) has sparked considerable interest in evaluating their performance across diverse tasks. To this end, we initially investigate the capabilities of different LLMs in DRE, considering both proprietary models and open-source models. Interestingly, we discover that LLMs significantly alleviate two issues in existing DRE methods. Generally, we have following findings: (1) scaling up model size substantially boosts the overall DRE performance and achieves exceptional results, tackling the difficulty of capturing long and sparse multi-turn information; (2) LLMs encounter with much smaller performance drop from entire dialogue setting to partial dialogue setting compared to existing methods; (3) LLMs deliver competitive or superior performances under both full-shot and few-shot settings compared to current state-of-the-art; (4) LLMs show modest performances on inverse relations but much stronger improvements on general relations, and they can handle dialogues of various lengths especially for longer sequences.
摘要：对话关系提取（DRE）旨在提取对话中两个论点之间的关系，由于对话中人称代词频率较高且信息密度较低，这比标准 RE 更具挑战性。然而，现有的DRE方法仍然存在两个严重的问题：（1）难以捕获长且稀疏的多轮信息，（2）难以基于部分对话提取黄金关系，这促使我们发现更有效的方法缓解以上问题。我们注意到，大型语言模型（LLM）的兴起引发了人们对评估其在不同任务中的表现的极大兴趣。为此，我们首先研究了不同法学硕士在 DRE 中的能力，同时考虑了专有模型和开源模型。有趣的是，我们发现法学硕士显着缓解了现有 DRE 方法中的两个问题。总的来说，我们有以下发现：（1）扩大模型大小可以显着提高整体 DRE 性能并取得优异的结果，解决了捕获长且稀疏的多轮信息的困难；（2）与现有方法相比，LLM从整个对话设置到部分对话设置的性能下降要小得多； (3) 与当前最先进的技术相比，法学硕士在全镜头和少镜头设置下都能提供具有竞争力或优越的性能； (4) 法学硕士在逆关系上表现一般，但在一般关系上有更强的改进，并且它们可以处理各种长度的对话，特别是对于较长的序列。

Title: Meta In-Context Learning Makes Large Language Models Better Zero and Few-Shot Relation Extractors

Authors: Guozheng Li, Peng Wang, Jiajun Liu, Yikai Guo, Ke Ji, Ziyu Shang, Zijie Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17807
Pdf URL: https://arxiv.org/pdf/2404.17807
Copy Paste: [[2404.17807]] Meta In-Context Learning Makes Large Language Models Better Zero and Few-Shot Relation Extractors(https://arxiv.org/abs/2404.17807)
Keywords: language model, llm, prompt
Abstract: Relation extraction (RE) is an important task that aims to identify the relationships between entities in texts. While large language models (LLMs) have revealed remarkable in-context learning (ICL) capability for general zero and few-shot learning, recent studies indicate that current LLMs still struggle with zero and few-shot RE. Previous studies are mainly dedicated to design prompt formats and select good examples for improving ICL-based RE. Although both factors are vital for ICL, if one can fundamentally boost the ICL capability of LLMs in RE, the zero and few-shot RE performance via ICL would be significantly improved. To this end, we introduce \textsc{Micre} (\textbf{M}eta \textbf{I}n-\textbf{C}ontext learning of LLMs for \textbf{R}elation \textbf{E}xtraction), a new meta-training framework for zero and few-shot RE where an LLM is tuned to do ICL on a diverse collection of RE datasets (i.e., learning to learn in context for RE). Through meta-training, the model becomes more effectively to learn a new RE task in context by conditioning on a few training examples with no parameter updates or task-specific templates at inference time, enabling better zero and few-shot task generalization. We experiment \textsc{Micre} on various LLMs with different model scales and 12 public RE datasets, and then evaluate it on unseen RE benchmarks under zero and few-shot settings. \textsc{Micre} delivers comparable or superior performance compared to a range of baselines including supervised fine-tuning and typical in-context learning methods. We find that the gains are particular significant for larger model scales, and using a diverse set of the meta-training RE datasets is key to improvements. Empirically, we show that \textsc{Micre} can transfer the relation semantic knowledge via relation label name during inference on target RE datasets.
摘要：关系提取（RE）是一项重要任务，旨在识别文本中实体之间的关系。虽然大型语言模型 (LLM) 在一般的零和少样本学习中展现了卓越的上下文学习 (ICL) 能力，但最近的研究表明，当前的 LLM 仍然在零和少样本 RE 方面苦苦挣扎。之前的研究主要致力于设计提示格式并选择好的例子来改进基于 ICL 的 RE。尽管这两个因素对于 ICL 都至关重要，但如果能够从根本上提高 RE 中 LLM 的 ICL 能力，那么通过 ICL 实现的零和少样本 RE 性能将得到显着提高。为此，我们引入了 \textsc{Micre} (\textbf{M}eta \textbf{I}n-\textbf{C}ontext Learning of LLMs for \textbf{R}elation \textbf{E}xtraction)，针对零次和少次 RE 的新元训练框架，其中 LLM 被调整为在不同的 RE 数据集集合上进行 ICL（即学习在 RE 的上下文中学习）。通过元训练，模型可以通过在推理时以一些没有参数更新或特定于任务的模板的训练示例为条件，更有效地在上下文中学习新的 RE 任务，从而实现更好的零和少样本任务泛化。我们在具有不同模型规模和 12 个公共 RE 数据集的各种 LLM 上实验 \textsc{Micre}，然后在零和少样本设置下在未见过的 RE 基准上对其进行评估。与一系列基线（包括监督微调和典型的上下文学习方法）相比，\textsc{Micre} 提供了相当或更好的性能。我们发现，对于较大的模型规模，收益尤其显着，并且使用不同的元训练 RE 数据集是改进的关键。根据经验，我们表明 \textsc{Micre} 可以在目标 RE 数据集的推理过程中通过关系标签名称传递关系语义知识。

Title: Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal

Authors: Haoran Lian, Yizhe Xiong, Jianwei Niu, Shasha Mo, Zhenpeng Su, Zijia Lin, Peng Liu, Hui Chen, Guiguang Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17808
Pdf URL: https://arxiv.org/pdf/2404.17808
Copy Paste: [[2404.17808]] Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal(https://arxiv.org/abs/2404.17808)
Keywords: language model
Abstract: Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a frequency imbalance for tokens in the text corpus. Since BPE iteratively merges the most frequent token pair in the text corpus while keeping all tokens that have been merged in the vocabulary, it unavoidably holds tokens that primarily represent subwords of complete words and appear infrequently on their own in the text corpus. We term such tokens as Scaffold Tokens. Due to their infrequent appearance in the text corpus, Scaffold Tokens pose a learning imbalance issue for language models. To address that issue, we propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE. This novel approach ensures the exclusion of low-frequency Scaffold Tokens from the token representations for the given texts, thereby mitigating the issue of frequency imbalance and facilitating model training. On extensive experiments across language modeling tasks and machine translation tasks, Scaffold-BPE consistently outperforms the original BPE, well demonstrating its effectiveness and superiority.
摘要：字节对编码 (BPE) 是自然语言处理 (NLP) 领域中文本标记化的基础方法。尽管被广泛采用，原始的 BPE 算法存在一个固有的缺陷：它无意中引入了文本语料库中标记的频率不平衡。由于 BPE 迭代地合并文本语料库中最常见的标记对，同时保留词汇表中已合并的所有标记，因此它不可避免地保留主要代表完整单词的子词且在文本语料库中很少出现的标记。我们将此类代币称为支架代币。由于脚手架令牌很少出现在文本语料库中，因此给语言模型带来了学习不平衡问题。为了解决这个问题，我们提出了 Scaffold-BPE，它通过对原始 BPE 进行无参数、轻计算且易于实现的修改，结合了动态脚手架令牌删除机制。这种新颖的方法确保从给定文本的标记表示中排除低频支架标记，从而减轻频率不平衡问题并促进模型训练。在跨语言建模任务和机器翻译任务的大量实验中，Scaffold-BPE 始终优于原始 BPE，充分证明了其有效性和优越性。

Title: Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction

Authors: Guozheng Li, Peng Wang, Wenjun Ke, Yikai Guo, Ke Ji, Ziyu Shang, Jiajun Liu, Zijie Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17809
Pdf URL: https://arxiv.org/pdf/2404.17809
Copy Paste: [[2404.17809]] Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction(https://arxiv.org/abs/2404.17809)
Keywords: language model, llm
Abstract: Relation extraction (RE) aims to identify relations between entities mentioned in texts. Although large language models (LLMs) have demonstrated impressive in-context learning (ICL) abilities in various tasks, they still suffer from poor performances compared to most supervised fine-tuned RE methods. Utilizing ICL for RE with LLMs encounters two challenges: (1) retrieving good demonstrations from training examples, and (2) enabling LLMs exhibit strong ICL abilities in RE. On the one hand, retrieving good demonstrations is a non-trivial process in RE, which easily results in low relevance regarding entities and relations. On the other hand, ICL with an LLM achieves poor performance in RE while RE is different from language modeling in nature or the LLM is not large enough. In this work, we propose a novel recall-retrieve-reason RE framework that synergizes LLMs with retrieval corpora (training examples) to enable relevant retrieving and reliable in-context reasoning. Specifically, we distill the consistently ontological knowledge from training datasets to let LLMs generate relevant entity pairs grounded by retrieval corpora as valid queries. These entity pairs are then used to retrieve relevant training examples from the retrieval corpora as demonstrations for LLMs to conduct better ICL via instruction tuning. Extensive experiments on different LLMs and RE datasets demonstrate that our method generates relevant and valid entity pairs and boosts ICL abilities of LLMs, achieving competitive or new state-of-the-art performance on sentence-level RE compared to previous supervised fine-tuning methods and ICL-based methods.
摘要：关系提取 (RE) 旨在识别文本中提到的实体之间的关系。尽管大型语言模型 (LLM) 在各种任务中都表现出令人印象深刻的上下文学习 (ICL) 能力，但与大多数监督微调 RE 方法相比，它们仍然表现不佳。利用 ICL 进行带有 LLM 的 RE 面临两个挑战：(1) 从训练示例中检索良好的演示，以及 (2) 使 LLM 在 RE 中表现出强大的 ICL 能力。一方面，在 RE 中检索良好的演示并非易事，这很容易导致实体和关系的相关性低。另一方面，如果 RE 本质上不同于语言建模，或者 LLM 不够大，则带有 LLM 的 ICL 在 RE 中的表现不佳。在这项工作中，我们提出了一种新颖的回忆-检索-推理 RE 框架，该框架将 LLM 与检索语料库（训练示例）协同作用，以实现相关检索和可靠的上下文推理。具体来说，我们从训练数据集中提取一致的本体知识，让 LLM 生成以检索语料库为基础的相关实体对作为有效查询。然后使用这些实体对从检索语料库中检索相关的训练示例，作为 LLM 通过指令调整进行更好的 ICL 的演示。在不同的 LLM 和 RE 数据集上进行的大量实验表明，我们的方法生成了相关且有效的实体对并提升了 LLM 的 ICL 能力，与以前的监督微调方法和基于 ICL 的方法相比，在句子级 RE 上实现了具有竞争力或新的最先进的性能。

Title: Evaluation of Few-Shot Learning for Classification Tasks in the Polish Language

Authors: Tsimur Hadeliya, Dariusz Kajtoch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17832
Pdf URL: https://arxiv.org/pdf/2404.17832
Copy Paste: [[2404.17832]] Evaluation of Few-Shot Learning for Classification Tasks in the Polish Language(https://arxiv.org/abs/2404.17832)
Keywords: gpt
Abstract: We introduce a few-shot benchmark consisting of 7 different classification tasks native to the Polish language. We conducted an empirical comparison with 0 and 16 shots between fine-tuning, linear probing, SetFit, and in-context learning (ICL) using various pre-trained commercial and open-source models. Our findings reveal that ICL achieves the best performance, with commercial models like GPT-3.5 and GPT-4 attaining the best performance. However, there remains a significant 14 percentage points gap between our best few-shot learning score and the performance of HerBERT-large fine-tuned on the entire training dataset. Among the techniques, SetFit emerges as the second-best approach, closely followed by linear probing. We observed the worst and most unstable performance with non-linear head fine-tuning. Results for ICL indicate that continual pre-training of models like Mistral-7b or Llama-2-13b on Polish corpora is beneficial. This is confirmed by the improved performances of Bielik-7b and Trurl-13b, respectively. To further support experiments in few-shot learning for Polish, we are releasing handcrafted templates for the ICL.
摘要：我们引入了一个由 7 个不同的波兰语分类任务组成的小样本基准测试。我们使用各种预先训练的商业和开源模型，对微调、线性探测、SetFit 和上下文学习 (ICL) 之间的 0 和 16 个镜头进行了实证比较。我们的研究结果表明，ICL 实现了最佳性能，其中 GPT-3.5 和 GPT-4 等商业模型获得了最佳性能。然而，我们的最佳几次学习得分与 HerBERT-large 在整个训练数据集上进行微调的性能之间仍然存在 14 个百分点的显着差距。在这些技术中，SetFit 成为第二好的方法，紧随其后的是线性探测。我们观察到非线性磁头微调的最差和最不稳定的性能。 ICL 的结果表明，在波兰语料库上对 Mistral-7b 或 Llama-2-13b 等模型进行持续预训练是有益的。 Bielik-7b 和 Trurl-13b 的性能改进分别证实了这一点。为了进一步支持波兰语的少样本学习实验，我们正在发布手工制作的 ICL 模板。

Title: VANER: Leveraging Large Language Model for Versatile and Adaptive Biomedical Named Entity Recognition

Authors: Junyi Biana, Weiqi Zhai, Xiaodi Huang, Jiaxuan Zheng, Shanfeng Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17835
Pdf URL: https://arxiv.org/pdf/2404.17835
Copy Paste: [[2404.17835]] VANER: Leveraging Large Language Model for Versatile and Adaptive Biomedical Named Entity Recognition(https://arxiv.org/abs/2404.17835)
Keywords: language model, llm
Abstract: Prevalent solution for BioNER involves using representation learning techniques coupled with sequence labeling. However, such methods are inherently task-specific, demonstrate poor generalizability, and often require dedicated model for each dataset. To leverage the versatile capabilities of recently remarkable large language models (LLMs), several endeavors have explored generative approaches to entity extraction. Yet, these approaches often fall short of the effectiveness of previouly sequence labeling approaches. In this paper, we utilize the open-sourced LLM LLaMA2 as the backbone model, and design specific instructions to distinguish between different types of entities and datasets. By combining the LLM's understanding of instructions with sequence labeling techniques, we use mix of datasets to train a model capable of extracting various types of entities. Given that the backbone LLMs lacks specialized medical knowledge, we also integrate external entity knowledge bases and employ instruction tuning to compel the model to densely recognize carefully curated entities. Our model VANER, trained with a small partition of parameters, significantly outperforms previous LLMs-based models and, for the first time, as a model based on LLM, surpasses the majority of conventional state-of-the-art BioNER systems, achieving the highest F1 scores across three datasets.
摘要：BioNER 的普遍解决方案涉及使用表示学习技术和序列标记。然而，此类方法本质上是特定于任务的，普遍性较差，并且通常需要针对每个数据集的专用模型。为了利用最近引人注目的大型语言模型（LLM）的多功能功能，一些努力已经探索了实体提取的生成方法。然而，这些方法往往达不到以前序列标记方法的有效性。在本文中，我们利用开源的LLM LLaMA2作为主干模型，并设计特定的指令来区分不同类型的实体和数据集。通过将法学硕士对指令的理解与序列标记技术相结合，我们使用混合数据集来训练能够提取各种类型实体的模型。鉴于骨干法学硕士缺乏专业的医学知识，我们还集成外部实体知识库并采用指令调整来迫使模型密集识别精心策划的实体。我们的模型 VANER 使用小部分参数进行训练，显着优于以前基于 LLM 的模型，并且作为基于 LLM 的模型，首次超越了大多数传统的最先进的 BioNER 系统，实现了三个数据集的最高 F1 分数。

Title: Toxicity Classification in Ukrainian

Authors: Daryna Dementieva, Valeriia Khylenko, Nikolay Babakov, Georg Groh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17841
Pdf URL: https://arxiv.org/pdf/2404.17841
Copy Paste: [[2404.17841]] Toxicity Classification in Ukrainian(https://arxiv.org/abs/2404.17841)
Keywords: llm, prompt
Abstract: The task of toxicity detection is still a relevant task, especially in the context of safe and fair LMs development. Nevertheless, labeled binary toxicity classification corpora are not available for all languages, which is understandable given the resource-intensive nature of the annotation process. Ukrainian, in particular, is among the languages lacking such resources. To our knowledge, there has been no existing toxicity classification corpus in Ukrainian. In this study, we aim to fill this gap by investigating cross-lingual knowledge transfer techniques and creating labeled corpora by: (i)~translating from an English corpus, (ii)~filtering toxic samples using keywords, and (iii)~annotating with crowdsourcing. We compare LLMs prompting and other cross-lingual transfer approaches with and without fine-tuning offering insights into the most robust and efficient baselines.
摘要：毒性检测任务仍然是一项相关任务，特别是在安全和公平的 LM 开发背景下。然而，标记的二元毒性分类语料库并非适用于所有语言，考虑到注释过程的资源密集型性质，这是可以理解的。尤其是乌克兰语，是缺乏此类资源的语言之一。据我们所知，乌克兰语中尚无现有的毒性分类语料库。在本研究中，我们的目标是通过研究跨语言知识转移技术并通过以下方式创建标记语料库来填补这一空白：（i）〜从英语语料库翻译，（ii）〜使用关键词过滤有毒样本，以及（iii）〜注释通过众包。我们比较了有或没有微调的法学硕士提示和其他跨语言转移方法，提供了对最稳健和最有效基线的见解。

Title: PromptCL: Improving Event Representation via Prompt Template and Contrastive Learning

Authors: Yubo Feng, Lishuang Li, Yi Xiang, Xueyang Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17877
Pdf URL: https://arxiv.org/pdf/2404.17877
Copy Paste: [[2404.17877]] PromptCL: Improving Event Representation via Prompt Template and Contrastive Learning(https://arxiv.org/abs/2404.17877)
Keywords: language model, prompt
Abstract: The representation of events in text plays a significant role in various NLP tasks. Recent research demonstrates that contrastive learning has the ability to improve event comprehension capabilities of Pre-trained Language Models (PLMs) and enhance the performance of event representation learning. However, the efficacy of event representation learning based on contrastive learning and PLMs is limited by the short length of event texts. The length of event texts differs significantly from the text length used in the pre-training of PLMs. As a result, there is inconsistency in the distribution of text length between pre-training and event representation learning, which may undermine the learning process of event representation based on PLMs. In this study, we present PromptCL, a novel framework for event representation learning that effectively elicits the capabilities of PLMs to comprehensively capture the semantics of short event texts. PromptCL utilizes a Prompt template borrowed from prompt learning to expand the input text during Contrastive Learning. This helps in enhancing the event representation learning by providing a structured outline of the event components. Moreover, we propose Subject-Predicate-Object (SPO) word order and Event-oriented Masked Language Modeling (EventMLM) to train PLMs to understand the relationships between event components. Our experimental results demonstrate that PromptCL outperforms state-of-the-art baselines on event related tasks. Additionally, we conduct a thorough analysis and demonstrate that using a prompt results in improved generalization capabilities for event representations. Our code will be available at https://github.com/YuboFeng2023/PromptCL.
摘要：文本中的事件表示在各种 NLP 任务中起着重要作用。最近的研究表明，对比学习能够提高预训练语言模型 (PLM) 的事件理解能力并提高事件表示学习的性能。然而，基于对比学习和 PLM 的事件表示学习的有效性受到事件文本长度较短的限制。事件文本的长度与 PLM 预训练中使用的文本长度有很大不同。因此，预训练和事件表示学习之间的文本长度分布不一致，这可能会破坏基于 PLM 的事件表示学习过程。在本研究中，我们提出了 PromptCL，这是一种新颖的事件表示学习框架，可有效发挥 PLM 全面捕捉短事件文本语义的能力。PromptCL 利用从提示学习中借用的提示模板来扩展对比学习期间的输入文本。这有助于通过提供事件组件的结构化大纲来增强事件表示学习。此外，我们提出了主语-谓语-宾语 (SPO) 词序和面向事件的掩码语言建模 (EventMLM) 来训练 PLM 以理解事件组件之间的关系。我们的实验结果表明，PromptCL 在事件相关任务上的表现优于最先进的基线。此外，我们进行了彻底的分析，并证明使用提示可以提高事件表示的泛化能力。我们的代码将在 https://github.com/YuboFeng2023/PromptCL 上提供。

Title: Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models

Authors: Zhongzhen Huang, Kui Xue, Yongqi Fan, Linjie Mu, Ruoyu Liu, Tong Ruan, Shaoting Zhang, Xiaofan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17897
Pdf URL: https://arxiv.org/pdf/2404.17897
Copy Paste: [[2404.17897]] Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2404.17897)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large-scale language models (LLMs) have achieved remarkable success across various language tasks but suffer from hallucinations and temporal misalignment. To mitigate these shortcomings, Retrieval-augmented generation (RAG) has been utilized to provide external knowledge to facilitate the answer generation. However, applying such models to the medical domain faces several challenges due to the lack of domain-specific knowledge and the intricacy of real-world scenarios. In this study, we explore LLMs with RAG framework for knowledge-intensive tasks in the medical field. To evaluate the capabilities of LLMs, we introduce MedicineQA, a multi-round dialogue benchmark that simulates the real-world medication consultation scenario and requires LLMs to answer with retrieved evidence from the medicine database. MedicineQA contains 300 multi-round question-answering pairs, each embedded within a detailed dialogue history, highlighting the challenge posed by this knowledge-intensive task to current LLMs. We further propose a new \textit{Distill-Retrieve-Read} framework instead of the previous \textit{Retrieve-then-Read}. Specifically, the distillation and retrieval process utilizes a tool calling mechanism to formulate search queries that emulate the keyword-based inquiries used by search engines. With experimental results, we show that our framework brings notable performance improvements and surpasses the previous counterparts in the evidence retrieval process in terms of evidence retrieval accuracy. This advancement sheds light on applying RAG to the medical domain.
摘要：大规模语言模型（LLM）在各种语言任务中取得了显着的成功，但存在幻觉和时间错位的问题。为了减轻这些缺点，检索增强生成（RAG）已被用来提供外部知识以促进答案生成。然而，由于缺乏特定领域的知识以及现实场景的复杂性，将此类模型应用于医学领域面临着一些挑战。在本研究中，我们利用 RAG 框架探索法学硕士在医学领域的知识密集型任务。为了评估法学硕士的能力，我们引入了 MedicineQA，这是一个多轮对话基准，它模拟现实世界的药物咨询场景，并要求法学硕士使用从医学数据库中检索到的证据进行回答。 MedicineQA 包含 300 个多轮问答对，每个问答对都嵌入了详细的对话历史记录，突出了这项知识密集型任务对当前法学硕士提出的挑战。我们进一步提出了一个新的 \textit{Distill-Retrieve-Read} 框架来代替之前的 \textit{Retrieve-then-Read}。具体来说，蒸馏和检索过程利用工具调用机制来制定模拟搜索引擎使用的基于关键字的查询的搜索查询。通过实验结果，我们表明我们的框架带来了显着的性能改进，并且在证据检索准确性方面超越了证据检索过程中以前的同行。这一进展为 RAG 在医疗领域的应用提供了线索。

Title: SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models

Authors: Manav Nitin Kapadnis, Sohan Patnaik, Abhilash Nandy, Sourjyadip Ray, Pawan Goyal, Debdoot Sheet
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.17912
Pdf URL: https://arxiv.org/pdf/2404.17912
Copy Paste: [[2404.17912]] SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models(https://arxiv.org/abs/2404.17912)
Keywords: language model, gpt, llm, hallucination
Abstract: Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports. Existing methods often hallucinate details in text-based reports that don't accurately reflect the image content. To mitigate this, we introduce a novel strategy, SERPENT-VLM (SElf Refining Radiology RePort GENeraTion using Vision Language Models), which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework. We employ a unique self-supervised loss that leverages similarity between pooled image representations and the contextual representations of the generated radiological text, alongside the standard Causal Language Modeling objective, to refine image-text representations. This allows the model to scrutinize and align the generated text through dynamic interaction between a given image and the generated text, therefore reducing hallucination and continuously enhancing nuanced report generation. SERPENT-VLM outperforms existing baselines such as LLaVA-Med, BiomedGPT, etc., achieving SoTA performance on the IU X-ray and Radiology Objects in COntext (ROCO) datasets, and also proves to be robust against noisy images. A qualitative case study emphasizes the significant advancements towards more sophisticated MLLM frameworks for R2Gen, opening paths for further research into self-supervised refinement in the medical imaging domain.
摘要：放射学报告生成 (R2Gen) 演示了多模式大语言模型 (MLLM) 如何自动创建准确且连贯的放射学报告。现有的方法通常会产生基于文本的报告中的细节，而这些细节不能准确反映图像内容。为了缓解这个问题，我们引入了一种新颖的策略，SERPENT-VLM（使用视觉语言模型的自精炼放射学报告生成），它通过将自精炼机制集成到 MLLM 框架中来改进 R2Gen 任务。我们采用独特的自监督损失，利用池图像表示和生成的放射文本的上下文表示之间的相似性，以及标准因果语言建模目标，来完善图像文本表示。这使得模型能够通过给定图像和生成文本之间的动态交互来仔细检查和对齐生成的文本，从而减少幻觉并不断增强细致入微的报告生成。 SERPENT-VLM 优于 LLaVA-Med、BiomedGPT 等现有基线，在 COntext (ROCO) 数据集中的 IU X 射线和放射学对象上实现了 SoTA 性能，并且还证明对噪声图像具有鲁棒性。定性案例研究强调了 R2Gen 向更复杂的 MLLM 框架取得的重大进步，为进一步研究医学成像领域的自我监督细化开辟了道路。

Title: Transfer Learning Enhanced Single-choice Decision for Multi-choice Question Answering

Authors: Chenhao Cui, Yufan Jiang, Shuangzhi Wu, Zhoujun Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17949
Pdf URL: https://arxiv.org/pdf/2404.17949
Copy Paste: [[2404.17949]] Transfer Learning Enhanced Single-choice Decision for Multi-choice Question Answering(https://arxiv.org/abs/2404.17949)
Keywords: language model
Abstract: Multi-choice Machine Reading Comprehension (MMRC) aims to select the correct answer from a set of options based on a given passage and question. The existing methods employ the pre-trained language model as the encoder, share and transfer knowledge through fine-tuning.These methods mainly focus on the design of exquisite mechanisms to effectively capture the relationships among the triplet of passage, question and answers. It is non-trivial but ignored to transfer knowledge from other MRC tasks such as SQuAD due to task specific of MMRC.In this paper, we reconstruct multi-choice to single-choice by training a binary classification to distinguish whether a certain answer is correct. Then select the option with the highest confidence score as the final answer. Our proposed method gets rid of the multi-choice framework and can leverage resources of other tasks. We construct our model based on the ALBERT-xxlarge model and evaluate it on the RACE and DREAM datasets. Experimental results show that our model performs better than multi-choice methods. In addition, by transferring knowledge from other kinds of MRC tasks, our model achieves state-of-the-art results in both single and ensemble settings.
摘要：多选机器阅读理解 (MMRC) 旨在根据给定的段落和问题从一组选项中选择正确答案。现有的方法采用预训练的语言模型作为编码器，通过微调共享和迁移知识。这些方法主要侧重于设计精妙的机制来有效捕捉段落、问题和答案三元组之间的关系。由于 MMRC 的任务特定性，从其他 MRC 任务（如 SQuAD）迁移知识并不是一件容易的事，但却被忽略了。在本文中，我们通过训练二分类来区分某个答案是否正确，将多选重构为单选。然后选择置信度得分最高的选项作为最终答案。我们提出的方法摆脱了多选框架，可以利用其他任务的资源。我们基于 ALBERT-xxlarge 模型构建模型，并在 RACE 和 DREAM 数据集上对其进行评估。实验结果表明，我们的模型比多选方法表现更好。此外，通过从其他类型的 MRC 任务中转移知识，我们的模型在单一和集成设置中都取得了最先进的结果。

Title: Automating Customer Needs Analysis: A Comparative Study of Large Language Models in the Travel Industry

Authors: Simone Barandoni, Filippo Chiarello, Lorenzo Cascone, Emiliano Marrale, Salvatore Puccio
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2404.17975
Pdf URL: https://arxiv.org/pdf/2404.17975
Copy Paste: [[2404.17975]] Automating Customer Needs Analysis: A Comparative Study of Large Language Models in the Travel Industry(https://arxiv.org/abs/2404.17975)
Keywords: language model, gpt, llm
Abstract: In the rapidly evolving landscape of Natural Language Processing (NLP), Large Language Models (LLMs) have emerged as powerful tools for many tasks, such as extracting valuable insights from vast amounts of textual data. In this study, we conduct a comparative analysis of LLMs for the extraction of travel customer needs from TripAdvisor posts. Leveraging a diverse range of models, including both open-source and proprietary ones such as GPT-4 and Gemini, we aim to elucidate their strengths and weaknesses in this specialized domain. Through an evaluation process involving metrics such as BERTScore, ROUGE, and BLEU, we assess the performance of each model in accurately identifying and summarizing customer needs. Our findings highlight the efficacy of opensource LLMs, particularly Mistral 7B, in achieving comparable performance to larger closed models while offering affordability and customization benefits. Additionally, we underscore the importance of considering factors such as model size, resource requirements, and performance metrics when selecting the most suitable LLM for customer needs analysis tasks. Overall, this study contributes valuable insights for businesses seeking to leverage advanced NLP techniques to enhance customer experience and drive operational efficiency in the travel industry.
摘要：在快速发展的自然语言处理 (NLP) 领域，大型语言模型 (LLM) 已成为许多任务的强大工具，例如从大量文本数据中提取有价值的见解。在本研究中，我们对法学硕士进行了比较分析，以从 TripAdvisor 帖子中提取旅行客户需求。利用各种模型，包括开源模型和专有模型（例如 GPT-4 和 Gemini），我们的目标是阐明它们在这个专业领域的优势和劣势。通过涉及 BERTScore、ROUGE 和 BLEU 等指标的评估过程，我们评估每个模型在准确识别和总结客户需求方面的表现。我们的研究结果强调了开源法学硕士（尤其是 Mistral 7B）在实现与大型封闭模型相当的性能方面的功效，同时提供可负担性和定制优势。此外，我们强调在为客户需求分析任务选择最合适的法学硕士时考虑模型大小、资源要求和性能指标等因素的重要性。总体而言，这项研究为寻求利用先进的 NLP 技术来增强客户体验并提高旅游行业运营效率的企业提供了宝贵的见解。

Title: Detection of Conspiracy Theories Beyond Keyword Bias in German-Language Telegram Using Large Language Models

Authors: Milena Pustet, Elisabeth Steffen, Helena Mihaljević
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2404.17985
Pdf URL: https://arxiv.org/pdf/2404.17985
Copy Paste: [[2404.17985]] Detection of Conspiracy Theories Beyond Keyword Bias in German-Language Telegram Using Large Language Models(https://arxiv.org/abs/2404.17985)
Keywords: language model, gpt, prompt
Abstract: The automated detection of conspiracy theories online typically relies on supervised learning. However, creating respective training data requires expertise, time and mental resilience, given the often harmful content. Moreover, available datasets are predominantly in English and often keyword-based, introducing a token-level bias into the models. Our work addresses the task of detecting conspiracy theories in German Telegram messages. We compare the performance of supervised fine-tuning approaches using BERT-like models with prompt-based approaches using Llama2, GPT-3.5, and GPT-4 which require little or no additional training data. We use a dataset of $\sim\!\! 4,000$ messages collected during the COVID-19 pandemic, without the use of keyword filters. Our findings demonstrate that both approaches can be leveraged effectively: For supervised fine-tuning, we report an F1 score of $\sim\!\! 0.8$ for the positive class, making our model comparable to recent models trained on keyword-focused English corpora. We demonstrate our model's adaptability to intra-domain temporal shifts, achieving F1 scores of $\sim\!\! 0.7$. Among prompting variants, the best model is GPT-4, achieving an F1 score of $\sim\!\! 0.8$ for the positive class in a zero-shot setting and equipped with a custom conspiracy theory definition.
摘要：在线阴谋论的自动检测通常依赖于监督学习。然而，鉴于内容往往有害，创建相应的培训数据需要专业知识、时间和心理弹性。此外，可用的数据集主要是英文的，并且通常基于关键字，从而在模型中引入了标记级偏差。我们的工作致力于检测德国电报消息中的阴谋论。我们比较了使用类似 BERT 模型的监督微调方法与使用 Llama2、GPT-3.5 和 GPT-4 的基于提示的方法的性能，后者需要很少或不需要额外的训练数据。我们使用 $\sim\!\! 的数据集在 COVID-19 大流行期间收集的价值 4,000 美元的消息，未使用关键字过滤器。我们的研究结果表明，这两种方法都可以有效利用：对于监督微调，我们报告的 F1 分数为 $\sim\!\!正类的费用为 0.8 美元，这使得我们的模型与最近在以关键词为中心的英语语料库上训练的模型相当。我们展示了我们的模型对域内时间变化的适应性，实现了 $\sim\!\! 的 F1 分数0.7 美元。在提示变体中，最好的模型是 GPT-4，获得了 $\sim\!\! 的 F1 分数零样本设置中的正类别为 0.8 美元，并配备了自定义阴谋论定义。

Title: Enhancing Pre-Trained Generative Language Models with Question Attended Span Extraction on Machine Reading Comprehension

Authors: Lin Ai, Zheng Hui, Zizhou Liu, Julia Hirschberg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17991
Pdf URL: https://arxiv.org/pdf/2404.17991
Copy Paste: [[2404.17991]] Enhancing Pre-Trained Generative Language Models with Question Attended Span Extraction on Machine Reading Comprehension(https://arxiv.org/abs/2404.17991)
Keywords: language model, gpt, llm
Abstract: Machine Reading Comprehension (MRC) poses a significant challenge in the field of Natural Language Processing (NLP). While mainstream MRC methods predominantly leverage extractive strategies using encoder-only models such as BERT, generative approaches face the issue of out-of-control generation -- a critical problem where answers generated are often incorrect, irrelevant, or unfaithful to the source text. To address these limitations in generative models for MRC, we introduce the Question-Attended Span Extraction (QASE) module. Integrated during the fine-tuning phase of pre-trained generative language models (PLMs), QASE significantly enhances their performance, allowing them to surpass the extractive capabilities of advanced Large Language Models (LLMs) such as GPT-4. Notably, these gains in performance do not come with an increase in computational demands. The efficacy of the QASE module has been rigorously tested across various datasets, consistently achieving or even surpassing state-of-the-art (SOTA) results.
摘要：机器阅读理解（MRC）对自然语言处理（NLP）领域提出了重大挑战。虽然主流 MRC 方法主要利用仅使用编码器的模型（例如 BERT）的提取策略，但生成方法面临着生成失控的问题，这是一个关键问题，生成的答案通常不正确、不相关或不忠实于源文本。为了解决 MRC 生成模型中的这些限制，我们引入了问题参与跨度提取（QASE）模块。 QASE 在预训练生成语言模型 (PLM) 的微调阶段进行集成，显着增强了其性能，使其超越了 GPT-4 等高级大型语言模型 (LLM) 的提取能力。值得注意的是，这些性能提升并不伴随着计算需求的增加。 QASE 模块的功效已经在各种数据集上经过严格测试，始终达到甚至超越了最先进的 (SOTA) 结果。

Title: MediFact at MEDIQA-CORR 2024: Why AI Needs a Human Touch

Authors: Nadia Saeed
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.17999
Pdf URL: https://arxiv.org/pdf/2404.17999
Copy Paste: [[2404.17999]] MediFact at MEDIQA-CORR 2024: Why AI Needs a Human Touch(https://arxiv.org/abs/2404.17999)
Keywords: language model, llm
Abstract: Accurate representation of medical information is crucial for patient safety, yet artificial intelligence (AI) systems, such as Large Language Models (LLMs), encounter challenges in error-free clinical text interpretation. This paper presents a novel approach submitted to the MEDIQA-CORR 2024 shared task (Ben Abacha et al., 2024a), focusing on the automatic correction of single-word errors in clinical notes. Unlike LLMs that rely on extensive generic data, our method emphasizes extracting contextually relevant information from available clinical text data. Leveraging an ensemble of extractive and abstractive question-answering approaches, we construct a supervised learning framework with domain-specific feature engineering. Our methodology incorporates domain expertise to enhance error correction accuracy. By integrating domain expertise and prioritizing meaningful information extraction, our approach underscores the significance of a human-centric strategy in adapting AI for healthcare.
摘要：医疗信息的准确表示对于患者安全至关重要，但大型语言模型 (LLM) 等人工智能 (AI) 系统在无差错临床文本解释方面遇到挑战。本文提出了一种提交给 MEDIQA-CORR 2024 共享任务（Ben Abacha 等人，2024a）的新颖方法，重点关注临床笔记中单字错误的自动纠正。与依赖广泛通用数据的法学硕士不同，我们的方法强调从可用的临床文本数据中提取上下文相关信息。利用提取和抽象问答方法的集合，我们构建了一个具有特定领域特征工程的监督学习框架。我们的方法结合了领域专业知识来提高纠错准确性。通过整合领域专业知识并优先考虑有意义的信息提取，我们的方法强调了以人为本的策略在将人工智能应用于医疗保健方面的重要性。

Title: Utilizing Large Language Models for Information Extraction from Real Estate Transactions

Authors: Yu Zhao, Haoxiang Gao
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2404.18043
Pdf URL: https://arxiv.org/pdf/2404.18043
Copy Paste: [[2404.18043]] Utilizing Large Language Models for Information Extraction from Real Estate Transactions(https://arxiv.org/abs/2404.18043)
Keywords: language model
Abstract: Real estate sales contracts contain crucial information for property transactions, but manual extraction of data can be time-consuming and error-prone. This paper explores the application of large language models, specifically transformer-based architectures, for automated information extraction from real estate contracts. We discuss challenges, techniques, and future directions in leveraging these models to improve efficiency and accuracy in real estate contract analysis.
摘要：房地产销售合同包含房地产交易的关键信息，但手动提取数据可能非常耗时且容易出错。本文探讨了大型语言模型（特别是基于变压器的架构）在房地产合同中自动信息提取的应用。我们讨论利用这些模型来提高房地产合同分析的效率和准确性的挑战、技术和未来方向。

Title: Efficient LLM Inference with Kcache

Authors: Qiaozhi He, Zhihua Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18057
Pdf URL: https://arxiv.org/pdf/2404.18057
Copy Paste: [[2404.18057]] Efficient LLM Inference with Kcache(https://arxiv.org/abs/2404.18057)
Keywords: language model, llm
Abstract: Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures efficient sequence generation by caching previously computed KV states. However, it also introduces significant memory overhead. We discovered that KV Cache is not necessary and proposed a novel KCache technique to alleviate the memory bottleneck issue during the LLMs inference process. KCache can be used directly for inference without any training process, Our evaluations show that KCache improves the throughput of popular LLMs by 40% with the baseline, while keeping accuracy.
摘要：大型语言模型（LLM）对人工智能应用产生了深远的影响，特别是在长文本理解和生成领域。 KV Cache技术是业界应用最广泛的技术之一。它通过缓存先前计算的 KV 状态来确保高效的序列生成。然而，它也带来了显着的内存开销。我们发现 KV Cache 不是必需的，并提出了一种新颖的 KCache 技术来缓解 LLM 推理过程中的内存瓶颈问题。 KCache 可以直接用于推理，无需任何训练过程。我们的评估表明，KCache 将流行 LLM 的吞吐量与基线相比提高了 40%，同时保持了准确性。

Title: Can Perplexity Predict Fine-Tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali

Authors: Nishant Luitel, Nirajan Bekoju, Anand Kumar Sah, Subarna Shakya
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.18071
Pdf URL: https://arxiv.org/pdf/2404.18071
Copy Paste: [[2404.18071]] Can Perplexity Predict Fine-Tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali(https://arxiv.org/abs/2404.18071)
Keywords: language model, gpt
Abstract: Recent language models use subwording mechanisms to handle Out-of-Vocabulary(OOV) words seen during test time and, their generation capacity is generally measured using perplexity, an intrinsic metric. It is known that increasing the subword granularity results in a decrease of perplexity value. However, the study of how subwording affects the understanding capacity of language models has been very few and only limited to a handful of languages. To reduce this gap we used 6 different tokenization schemes to pretrain relatively small language models in Nepali and used the representations learned to finetune on several downstream tasks. Although byte-level BPE algorithm has been used in recent models like GPT, RoBERTa we show that on average they are sub-optimal in comparison to algorithms such as SentencePiece in finetuning performances for Nepali. Additionally, similar recent studies have focused on the Bert-based language model. We, however, pretrain and finetune sequential transformer-based language models.
摘要：最近的语言模型使用子词机制来处理测试期间出现的词汇外（OOV）单词，并且它们的生成能力通常使用困惑度（一种内在指标）来衡量。众所周知，增加子字粒度会导致困惑度值降低。然而，关于子词如何影响语言模型理解能力的研究却很少，而且仅限于少数语言。为了缩小这一差距，我们使用 6 种不同的标记化方案来预训练尼泊尔语中相对较小的语言模型，并使用学到的表示来微调几个下游任务。尽管字节级 BPE 算法已在 GPT、RoBERTa 等最新模型中使用，但我们表明，与 SentencePiece 等算法相比，它们在尼泊尔语微调性能方面平均不是最优的。此外，最近类似的研究也集中在基于 Bert 的语言模型上。然而，我们预训练和微调基于顺序变压器的语言模型。

Title: Contextual Spelling Correction with Language Model for Low-resource Setting

Authors: Nishant Luitel, Nirajan Bekoju, Anand Kumar Sah, Subarna Shakya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18072
Pdf URL: https://arxiv.org/pdf/2404.18072
Copy Paste: [[2404.18072]] Contextual Spelling Correction with Language Model for Low-resource Setting(https://arxiv.org/abs/2404.18072)
Keywords: language model
Abstract: The task of Spell Correction(SC) in low-resource languages presents a significant challenge due to the availability of only a limited corpus of data and no annotated spelling correction datasets. To tackle these challenges a small-scale word-based transformer LM is trained to provide the SC model with contextual understanding. Further, the probabilistic error rules are extracted from the corpus in an unsupervised way to model the tendency of error happening(error model). Then the combination of LM and error model is used to develop the SC model through the well-known noisy channel framework. The effectiveness of this approach is demonstrated through experiments on the Nepali language where there is access to just an unprocessed corpus of textual data.
摘要：由于只有有限的数据语料库且没有带注释的拼写纠正数据集，低资源语言中的拼写纠正（SC）任务提出了重大挑战。为了应对这些挑战，我们训练了一个基于单词的小型转换器 LM，为 SC 模型提供上下文理解。进一步，以无监督的方式从语料库中提取概率错误规则，对错误发生的趋势进行建模（错误模型）。然后，通过众所周知的噪声通道框架，结合LM和误差模型来开发SC模型。这种方法的有效性通过尼泊尔语言的实验得到了证明，在尼泊尔语言中只能访问未经处理的文本数据语料库。

Title: CRE-LLM: A Domain-Specific Chinese Relation Extraction Framework with Fine-tuned Large Language Model

Authors: Zhengpeng Shi, Haoran Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18085
Pdf URL: https://arxiv.org/pdf/2404.18085
Copy Paste: [[2404.18085]] CRE-LLM: A Domain-Specific Chinese Relation Extraction Framework with Fine-tuned Large Language Model(https://arxiv.org/abs/2404.18085)
Keywords: language model, llm, prompt, chat
Abstract: Domain-Specific Chinese Relation Extraction (DSCRE) aims to extract relations between entities from domain-specific Chinese text. Despite the rapid development of PLMs in recent years, especially LLMs, DSCRE still faces three core challenges: complex network structure design, poor awareness, and high consumption of fine-tuning. Given the impressive performance of large language models (LLMs) in natural language processing, we propose a new framework called CRE-LLM. This framework is based on fine-tuning open-source LLMs, such as Llama-2, ChatGLM2, and Baichuan2. CRE-LLM enhances the logic-awareness and generative capabilities of the model by constructing an appropriate prompt and utilizing open-source LLMs for instruction-supervised fine-tuning. And then it directly extracts the relations of the given entities in the input textual data, which improving the CRE approach. To demonstrate the effectiveness of the proposed framework, we conducted extensive experiments on two domain-specific CRE datasets, FinRE and SanWen. The experimental results show that CRE-LLM is significantly superior and robust, achieving state-of-the-art (SOTA) performance on the FinRE dataset. This paper introduces a novel approach to domain-specific relation extraction (DSCRE) tasks that are semantically more complex by combining LLMs with triples. Our code is publicly available.
摘要：特定领域中文关系提取（DSCRE）旨在从特定领域中文文本中提取实体之间的关系。尽管近年来PLM尤其是LLM发展迅速，但DSCRE仍然面临着三个核心挑战：网络结构设计复杂、认知度差、微调消耗高。鉴于大型语言模型（LLM）在自然语言处理中令人印象深刻的性能，我们提出了一个名为 CRE-LLM 的新框架。该框架基于微调开源LLM，例如Llama-2、ChatGLM2和Baichuan2。 CRE-LLM 通过构建适当的提示并利用开源 LLM 进行指令监督微调，增强模型的逻辑意识和生成能力。然后直接提取输入文本数据中给定实体的关系，从而改进了 CRE 方法。为了证明所提出框架的有效性，我们对两个特定领域的 CRE 数据集 FinRE 和 SanWen 进行了广泛的实验。实验结果表明，CRE-LLM 具有明显的优越性和鲁棒性，在 FinRE 数据集上实现了最先进的 (SOTA) 性能。本文介绍了一种新的方法来处理特定领域关系提取 (DSCRE) 任务，该方法通过将 LLM 与三元组相结合，在语义上更加复杂。我们的代码是公开的。

Title: Exploring the Robustness of In-Context Learning with Noisy Labels

Authors: Chen Cheng, Xinzhi Yu, Haodong Wen, Jinsong Sun, Guanzhang Yue, Yihao Zhang, Zeming Wei
Subjects: cs.CL, cs.AI, cs.CR, cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2404.18191
Pdf URL: https://arxiv.org/pdf/2404.18191
Copy Paste: [[2404.18191]] Exploring the Robustness of In-Context Learning with Noisy Labels(https://arxiv.org/abs/2404.18191)
Keywords: language model, llm, prompt
Abstract: Recently, the mysterious In-Context Learning (ICL) ability exhibited by Transformer architectures, especially in large language models (LLMs), has sparked significant research interest. However, the resilience of Transformers' in-context learning capabilities in the presence of noisy samples, prevalent in both training corpora and prompt demonstrations, remains underexplored. In this paper, inspired by prior research that studies ICL ability using simple function classes, we take a closer look at this problem by investigating the robustness of Transformers against noisy labels. Specifically, we first conduct a thorough evaluation and analysis of the robustness of Transformers against noisy labels during in-context learning and show that they exhibit notable resilience against diverse types of noise in demonstration labels. Furthermore, we delve deeper into this problem by exploring whether introducing noise into the training set, akin to a form of data augmentation, enhances such robustness during inference, and find that such noise can indeed improve the robustness of ICL. Overall, our fruitful analysis and findings provide a comprehensive understanding of the resilience of Transformer models against label noises during ICL and provide valuable insights into the research on Transformers in natural language processing. Our code is available at https://github.com/InezYu0928/in-context-learning.
摘要：最近，Transformer 架构所展现出的神秘的上下文学习（ICL）能力，尤其是在大型语言模型（LLM）中，引起了人们的极大研究兴趣。然而，在训练语料库和即时演示中普遍存在噪声样本的情况下，变形金刚的上下文学习能力的弹性仍未得到充分探索。在本文中，受到先前使用简单函数类研究 ICL 能力的研究的启发，我们通过研究 Transformer 针对噪声标签的鲁棒性来仔细研究这个问题。具体来说，我们首先对 Transformers 在上下文学习过程中针对噪声标签的鲁棒性进行了彻底的评估和分析，并表明它们在演示标签中对各种类型的噪声表现出显着的弹性。此外，我们通过探索在训练集中引入噪声（类似于数据增强的形式）是否可以增强推理过程中的鲁棒性来更深入地研究这个问题，并发现这种噪声确实可以提高 ICL 的鲁棒性。总体而言，我们富有成效的分析和发现让我们全面了解 Transformer 模型在 ICL 期间对抗标签噪声的能力，并为 Transformer 在自然语言处理中的研究提供了宝贵的见解。我们的代码可在 https://github.com/InezYu0928/in-context-learning 获取。

Title: TextGram: Towards a better domain-adaptive pretraining

Authors: Sharayu Hiwarkhedkar, Saloni Mittal, Vidula Magdum, Omkar Dhekane, Raviraj Joshi, Geetanjali Kale, Arnav Ladkat
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.18228
Pdf URL: https://arxiv.org/pdf/2404.18228
Copy Paste: [[2404.18228]] TextGram: Towards a better domain-adaptive pretraining(https://arxiv.org/abs/2404.18228)
Keywords: language model
Abstract: For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data selection method - TextGram - that effectively selects essential data from large corpora. We compare and evaluate the results of finetuned models for text classification task with and without data selection. We show that the proposed strategy works better compared to other selection methods.
摘要：对于绿色人工智能来说，测量和减少大型语言模型训练过程中排放的碳足迹至关重要。在 NLP 中，对 Transformer 模型进行预训练需要大量的计算资源。这种预训练涉及使用大量文本数据来获取执行下游任务的先验知识。因此，重要的是我们从这个庞大的语料库中以特定领域数据的形式选择正确的数据，以实现与特定领域任务相一致的最佳结果。虽然对大型无监督数据进行训练的成本很高，但可以通过在预训练之前执行数据选择步骤来对其进行优化。选择重要数据可以减少空间开销和预训练模型所需的大量时间，同时保持恒定的准确性。我们研究了现有的选择策略，并提出了我们自己的领域自适应数据选择方法 - TextGram - 该方法可以有效地从大型语料库中选择基本数据。我们比较和评估有和没有数据选择的文本分类任务的微调模型的结果。我们表明，与其他选择方法相比，所提出的策略效果更好。

Title: From Persona to Personalization: A Survey on Role-Playing Language Agents

Authors: Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, Yanghua Xiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.18231
Pdf URL: https://arxiv.org/pdf/2404.18231
Copy Paste: [[2404.18231]] From Persona to Personalization: A Survey on Role-Playing Language Agents(https://arxiv.org/abs/2404.18231)
Keywords: language model, llm, agent
Abstract: Recent advancements in large language models (LLMs) have significantly boosted the rise of Role-Playing Language Agents (RPLAs), i.e., specialized AI systems designed to simulate assigned personas. By harnessing multiple advanced abilities of LLMs, including in-context learning, instruction following, and social intelligence, RPLAs achieve a remarkable sense of human likeness and vivid role-playing performance. RPLAs can mimic a wide range of personas, ranging from historical figures and fictional characters to real-life individuals. Consequently, they have catalyzed numerous AI applications, such as emotional companions, interactive video games, personalized assistants and copilots, and digital clones. In this paper, we conduct a comprehensive survey of this field, illustrating the evolution and recent progress in RPLAs integrating with cutting-edge LLM technologies. We categorize personas into three types: 1) Demographic Persona, which leverages statistical stereotypes; 2) Character Persona, focused on well-established figures; and 3) Individualized Persona, customized through ongoing user interactions for personalized services. We begin by presenting a comprehensive overview of current methodologies for RPLAs, followed by the details for each persona type, covering corresponding data sourcing, agent construction, and evaluation. Afterward, we discuss the fundamental risks, existing limitations, and future prospects of RPLAs. Additionally, we provide a brief review of RPLAs in AI applications, which reflects practical user demands that shape and drive RPLA research. Through this work, we aim to establish a clear taxonomy of RPLA research and applications, and facilitate future research in this critical and ever-evolving field, and pave the way for a future where humans and RPLAs coexist in harmony.
摘要：大型语言模型 (LLM) 的最新进展极大地促进了角色扮演语言代理 (RPLA) 的兴起，即旨在模拟指定角色的专门人工智能系统。通过利用法学硕士的多种高级能力，包括情境学习、指令遵循和社交智能，RPLA 实现了卓越的人类相似感和生动的角色扮演表演。 RPLA 可以模仿各种角色，从历史人物、虚构人物到现实生活中的个人。因此，它们催生了众多人工智能应用，例如情感伴侣、互动视频游戏、个性化助理和副驾驶以及数字克隆。在本文中，我们对该领域进行了全面的调查，阐述了 RPLA 与前沿法学硕士技术相结合的演变和最新进展。我们将角色分为三种类型：1）人口角色，利用统计刻板印象； 2）人物角色，重点关注知名人物； 3) 个性化角色，通过持续的用户交互进行定制以提供个性化服务。我们首先全面概述当前 RPLA 的方法，然后详细介绍每种角色类型，涵盖相应的数据源、代理构建和评估。然后，我们讨论 RPLA 的基本风险、现有局限性和未来前景。此外，我们还对人工智能应用中的 RPLA 进行了简要回顾，这反映了塑造和推动 RPLA 研究的实际用户需求。通过这项工作，我们的目标是建立 RPLA 研究和应用的清晰分类，促进这一关键且不断发展的领域的未来研究，并为人类与 RPLA 和谐共存的未来铺平道路。

Title: LEGENT: Open Platform for Embodied Agents

Authors: Zhili Cheng, Zhitong Wang, Jinyi Hu, Shengding Hu, An Liu, Yuge Tu, Pengkai Li, Lei Shi, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18243
Pdf URL: https://arxiv.org/pdf/2404.18243
Copy Paste: [[2404.18243]] LEGENT: Open Platform for Embodied Agents(https://arxiv.org/abs/2404.18243)
Keywords: language model, gpt, llm, agent
Abstract: Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platform for developing embodied agents using LLMs and LMMs. LEGENT offers a dual approach: a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline utilizing advanced algorithms to exploit supervision from simulated worlds at scale. In our experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks, showcasing promising generalization capabilities.
摘要：尽管大型语言模型（LLM）和大型多模态模型（LMM）取得了进步，但它们与基于语言的类人实体的集成仍然不完整，阻碍了物理环境中复杂的现实任务的执行。现有的集成通常具有有限的开源功能，这对这一领域的集体进步构成了挑战。我们推出 LEGENT，这是一个开放、可扩展的平台，用于使用 LLM 和 LMM 开发实体代理。 LEGENT 提供了双重方法：丰富的交互式 3D 环境，具有可通信和可操作的代理，搭配用户友好的界面，以及利用先进算法大规模利用模拟世界监督的复杂数据生成管道。在我们的实验中，在 LEGENT 生成的数据上训练的胚胎视觉-语言-动作模型在具体任务中超越了 GPT-4V，展示了有前途的泛化能力。

Title: PatentGPT: A Large Language Model for Intellectual Property

Authors: Zilong Bai, Ruiji Zhang, Linqing Chen, Qijun Cai, Yuan Zhong, Cong Wang Yan Fang, Jie Fang, Jing Sun, Weikuan Wang, Lizhi Zhou, Haoran Hua Tian Qiu, Chaochao Wang, Cheng Sun, Jianping Lu, Yixin Wang, Yubin Xia Meng Hu, Haowen Liu, Peng Xu, Licong Xu, Fu Bian, Xiaolong Gu, Lisha Zhang Weilei Wang, Changyang Tu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.18255
Pdf URL: https://arxiv.org/pdf/2404.18255
Copy Paste: [[2404.18255]] PatentGPT: A Large Language Model for Intellectual Property(https://arxiv.org/abs/2404.18255)
Keywords: language model, gpt, llm, agent
Abstract: In recent years, large language models have attracted significant attention due to their exceptional performance across a multitude of natural language process tasks, and have been widely applied in various fields. However, the application of large language models in the Intellectual Property (IP) space is challenging due to the strong need for specialized knowledge, privacy protection, processing of extremely long text in this field. In this technical report, we present for the first time a low-cost, standardized procedure for training IP-oriented LLMs, meeting the unique requirements of the IP domain. Using this standard process, we have trained the PatentGPT series models based on open-source pretrained models. By evaluating them on the open-source IP-oriented benchmark MOZIP, our domain-specific LLMs outperforms GPT-4, indicating the effectiveness of the proposed training procedure and the expertise of the PatentGPT models in the IP demain. What is impressive is that our model significantly outperformed GPT-4 on the 2019 China Patent Agent Qualification Examination by achieving a score of 65, reaching the level of human experts. Additionally, the PatentGPT model, which utilizes the SMoE architecture, achieves performance comparable to that of GPT-4 in the IP domain and demonstrates a better cost-performance ratio on long-text tasks, potentially serving as an alternative to GPT-4 within the IP domain.
摘要：近年来，大型语言模型因其在众多自然语言处理任务中的出色表现而受到广泛关注，并已广泛应用于各个领域。然而，由于该领域对专业知识、隐私保护、超长文本处理的强烈需求，大语言模型在知识产权领域的应用具有挑战性。在这份技术报告中，我们首次提出了一种低成本、标准化的程序来培训面向知识产权的法学硕士，满足知识产权领域的独特要求。使用这个标准流程，我们基于开源预训练模型训练了 PatentGPT 系列模型。通过在面向开源 IP 的基准 MOZIP 上对其进行评估，我们特定领域的 LLM 的表现优于 GPT-4，这表明了所提出的培训程序的有效性以及 PatentGPT 模型在 IP 领域的专业知识。令人印象深刻的是，我们的模型在2019年中国专利代理人资格考试中显着优于GPT-4，取得了65分，达到了人类专家的水平。此外，采用SMoE架构的PatentGPT模型在IP领域实现了与GPT-4相当的性能，并且在长文本任务上表现出了更好的性价比，有可能成为GPT-4在IP领域的替代方案。 IP 域。

Title: Parameter-Efficient Tuning Large Language Models for Graph Representation Learning

Authors: Qi Zhu, Da Zheng, Xiang Song, Shichang Zhang, Bowen Jin, Yizhou Sun, George Karypis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.18271
Pdf URL: https://arxiv.org/pdf/2404.18271
Copy Paste: [[2404.18271]] Parameter-Efficient Tuning Large Language Models for Graph Representation Learning(https://arxiv.org/abs/2404.18271)
Keywords: language model, llm, prompt
Abstract: Text-rich graphs, which exhibit rich textual information on nodes and edges, are prevalent across a wide range of real-world business applications. Large Language Models (LLMs) have demonstrated remarkable abilities in understanding text, which also introduced the potential for more expressive modeling in text-rich graphs. Despite these capabilities, efficiently applying LLMs to representation learning on graphs presents significant challenges. Recently, parameter-efficient fine-tuning methods for LLMs have enabled efficient new task generalization with minimal time and memory consumption. Inspired by this, we introduce Graph-aware Parameter-Efficient Fine-Tuning - GPEFT, a novel approach for efficient graph representation learning with LLMs on text-rich graphs. Specifically, we utilize a graph neural network (GNN) to encode structural information from neighboring nodes into a graph prompt. This prompt is then inserted at the beginning of the text sequence. To improve the quality of graph prompts, we pre-trained the GNN to assist the frozen LLM in predicting the next token in the node text. Compared with existing joint GNN and LMs, our method directly generate the node embeddings from large language models with an affordable fine-tuning cost. We validate our approach through comprehensive experiments conducted on 8 different text-rich graphs, observing an average improvement of 2% in hit@1 and Mean Reciprocal Rank (MRR) in link prediction evaluations. Our results demonstrate the efficacy and efficiency of our model, showing that it can be smoothly integrated with various large language models, including OPT, LLaMA and Falcon.
摘要：丰富文本图在节点和边上展示丰富的文本信息，在广泛的现实业务应用程序中普遍存在。大型语言模型 (LLM) 在理解文本方面表现出了卓越的能力，这也带来了在文本丰富的图形中进行更具表现力的建模的潜力。尽管有这些能力，但将法学硕士有效地应用于图的表示学习仍然面临着巨大的挑战。最近，LLM 的参数高效微调方法以最少的时间和内存消耗实现了高效的新任务泛化。受此启发，我们引入了图感知参数高效微调 - GPEFT，这是一种利用法学硕士在文本丰富的图上进行高效图表示学习的新方法。具体来说，我们利用图神经网络（GNN）将来自相邻节点的结构信息编码为图形提示。然后将该提示插入到文本序列的开头。为了提高图形提示的质量，我们预训练了 GNN，以协助冻结的 LLM 预测节点文本中的下一个标记。与现有的联合 GNN 和 LM 相比，我们的方法直接从大型语言模型生成节点嵌入，并且微调成本低廉。我们通过在 8 个不同的富含文本的图表上进行综合实验来验证我们的方法，观察到链接预测评估中的 hit@1 和平均倒数排名 (MRR) 平均提高了 2%。我们的结果证明了我们模型的功效和效率，表明它可以与各种大型语言模型顺利集成，包括 OPT、LLaMA 和 Falcon。

Title: Bias Neutralization Framework: Measuring Fairness in Large Language Models with Bias Intelligence Quotient (BiQ)

Authors: Malur Narayan, John Pasmore, Elton Sampaio, Vijay Raghavan, Gabriella Waters
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.18276
Pdf URL: https://arxiv.org/pdf/2404.18276
Copy Paste: [[2404.18276]] Bias Neutralization Framework: Measuring Fairness in Large Language Models with Bias Intelligence Quotient (BiQ)(https://arxiv.org/abs/2404.18276)
Keywords: language model, gpt, llm, chat
Abstract: The burgeoning influence of Large Language Models (LLMs) in shaping public discourse and decision-making underscores the imperative to address inherent biases within these AI systems. In the wake of AI's expansive integration across sectors, addressing racial bias in LLMs has never been more critical. This paper introduces a novel framework called Comprehensive Bias Neutralization Framework (CBNF) which embodies an innovative approach to quantifying and mitigating biases within LLMs. Our framework combines the Large Language Model Bias Index (LLMBI) [Oketunji, A., Anas, M., Saina, D., (2023)] and Bias removaL with No Demographics (BLIND) [Orgad, H., Belinkov, Y. (2023)] methodologies to create a new metric called Bias Intelligence Quotient (BiQ)which detects, measures, and mitigates racial bias in LLMs without reliance on demographic annotations. By introducing a new metric called BiQ that enhances LLMBI with additional fairness metrics, CBNF offers a multi-dimensional metric for bias assessment, underscoring the necessity of a nuanced approach to fairness in AI [Mehrabi et al., 2021]. This paper presents a detailed analysis of Latimer AI (a language model incrementally trained on black history and culture) in comparison to ChatGPT 3.5, illustrating Latimer AI's efficacy in detecting racial, cultural, and gender biases through targeted training and refined bias mitigation strategies [Latimer & Bender, 2023].
摘要：大型语言模型 (LLM) 在塑造公共话语和决策方面的影响力日益增长，这凸显了解决这些人工智能系统中固有偏见的必要性。随着人工智能跨行业的广泛整合，解决法学硕士中的种族偏见变得前所未有的重要。本文介绍了一种称为综合偏见中和框架（CBNF）的新颖框架，它体现了一种量化和减轻法学硕士内部偏见的创新方法。我们的框架结合了大型语言模型偏差指数 (LLMBI) [Oketunji, A., Anas, M., Saina, D., (2023)] 和无人口统计的偏差消除 (BLIND) [Orgad, H., Belinkov, Y . (2023)] 创建一种称为偏差智商 (BiQ) 的新指标的方法，该指标可以在不依赖人口统计注释的情况下检测、测量和减轻法学硕士中的种族偏见。通过引入一种名为 BiQ 的新指标，通过额外的公平性指标增强 LLMBI，CBNF 提供了用于偏见评估的多维指标，强调了在 AI 中采取细致入微的公平性方法的必要性 [Mehrabi et al., 2021]。本文详细分析了 Latimer AI（一种针对黑人历史和文化进行增量训练的语言模型）与 ChatGPT 3.5 的比较，说明了 Latimer AI 通过有针对性的训练和完善的偏见缓解策略检测种族、文化和性别偏见的功效 [Latimer AI ＆本德，2023]。

Title: Comparing LLM prompting with Cross-lingual transfer performance on Indigenous and Low-resource Brazilian Languages

Authors: David Ifeoluwa Adelani, A. Seza Doğruöz, André Coneglian, Atul Kr. Ojha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18286
Pdf URL: https://arxiv.org/pdf/2404.18286
Copy Paste: [[2404.18286]] Comparing LLM prompting with Cross-lingual transfer performance on Indigenous and Low-resource Brazilian Languages(https://arxiv.org/abs/2404.18286)
Keywords: language model, llm, prompt
Abstract: Large Language Models are transforming NLP for a variety of tasks. However, how LLMs perform NLP tasks for low-resource languages (LRLs) is less explored. In line with the goals of the AmeicasNLP workshop, we focus on 12 LRLs from Brazil, 2 LRLs from Africa and 2 high-resource languages (HRLs) (e.g., English and Brazilian Portuguese). Our results indicate that the LLMs perform worse for the part of speech (POS) labeling of LRLs in comparison to HRLs. We explain the reasons behind this failure and provide an error analyses through examples observed in our data set.
摘要：大型语言模型正在改变 NLP 的各种任务。然而，法学硕士如何执行低资源语言（LRL）的 NLP 任务却鲜有探讨。根据 AmeicasNLP 研讨会的目标，我们重点关注来自巴西的 12 个 LRL、来自非洲的 2 个 LRL 和 2 个高资源语言 (HRL)（例如英语和巴西葡萄牙语）。我们的结果表明，与 HRL 相比，LLM 在 LRL 的词性 (POS) 标记方面表现较差。我们解释了这种失败背后的原因，并通过在我们的数据集中观察到的示例提供了错误分析。

Title: FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

Authors: Wei Li, Ren Ma, Jiang Wu, Chenya Gu, Jiahui Peng, Jinyang Len, Songyang Zhang, Hang Yan, Dahua Lin, Conghui He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.18359
Pdf URL: https://arxiv.org/pdf/2404.18359
Copy Paste: [[2404.18359]] FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models(https://arxiv.org/abs/2404.18359)
Keywords: language model, llm
Abstract: In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choice questions across common sense and K-12 educational subjects, meticulously curated to reflect the breadth and depth of everyday and academic knowledge. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities. The insights gleaned from FoundaBench evaluations set a new standard for understanding the fundamental knowledge of LLMs, providing a robust framework for future advancements in the field.
摘要：在蓬勃发展的大型语言模型 (LLM) 领域，基础知识评估仍然是一项关键挑战，尤其是针对中国语言和文化的模型。本文介绍了 FoundaBench，这是一项开创性的基准测试，旨在严格评估中国 LLM 的基础知识能力。FoundaBench 涵盖了 3354 道涉及常识和 K-12 教育科目的多项选择题，经过精心策划，以反映日常和学术知识的广度和深度。我们使用 FoundaBench 对 12 个最先进的 LLM 进行了广泛的评估，采用传统评估方法和我们的 CircularEval 协议来减轻模型响应中的潜在偏差。我们的结果突出了在中文语料库上预训练的模型的卓越性能，并揭示了模型的推理和记忆回忆能力之间的显著差异。从 FoundaBench 评估中获得的见解为理解 LLM 的基础知识设定了新标准，为该领域的未来发展提供了强大的框架。

Title: QANA: LLM-based Question Generation and Network Analysis for Zero-shot Key Point Analysis and Beyond

Authors: Tomoki Fukuma, Koki Noda, Toshihide Ubukata Kousuke Hoso, Yoshiharu Ichikawa, Kyosuke Kambe, Yu Masubuch, Fujio Toriumi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18371
Pdf URL: https://arxiv.org/pdf/2404.18371
Copy Paste: [[2404.18371]] QANA: LLM-based Question Generation and Network Analysis for Zero-shot Key Point Analysis and Beyond(https://arxiv.org/abs/2404.18371)
Keywords: language model, llm
Abstract: The proliferation of social media has led to information overload and increased interest in opinion mining. We propose "Question-Answering Network Analysis" (QANA), a novel opinion mining framework that utilizes Large Language Models (LLMs) to generate questions from users' comments, constructs a bipartite graph based on the comments' answerability to the questions, and applies centrality measures to examine the importance of opinions. We investigate the impact of question generation styles, LLM selections, and the choice of embedding model on the quality of the constructed QA networks by comparing them with annotated Key Point Analysis datasets. QANA achieves comparable performance to previous state-of-the-art supervised models in a zero-shot manner for Key Point Matching task, also reducing the computational cost from quadratic to linear. For Key Point Generation, questions with high PageRank or degree centrality align well with manually annotated key points. Notably, QANA enables analysts to assess the importance of key points from various aspects according to their selection of centrality measure. QANA's primary contribution lies in its flexibility to extract key points from a wide range of perspectives, which enhances the quality and impartiality of opinion mining.
摘要：社交媒体的激增导致信息过载和对意见挖掘的兴趣增加。我们提出了“问答网络分析”（QANA），这是一种新颖的意见挖掘框架，它利用大型语言模型（LLM）从用户的评论中生成问题，根据评论对问题的回答构建二部图，并应用中心性措施来检验意见的重要性。通过与带注释的关键点分析数据集进行比较，我们研究了问题生成风格、LLM 选择和嵌入模型的选择对构建的 QA 网络质量的影响。 QANA 在关键点匹配任务中以零样本方式实现了与之前最先进的监督模型相当的性能，同时将计算成本从二次降低到线性。对于关键点生成，具有高 PageRank 或程度中心性的问题与手动注释的关键点非常吻合。值得注意的是，QANA 使分析人员能够根据选择的中心性度量从各个方面评估关键点的重要性。 QANA 的主要贡献在于它能够灵活地从广泛的角度提取关键点，从而提高了意见挖掘的质量和公正性。

Title: Exploring the Limits of Fine-grained LLM-based Physics Inference via Premise Removal Interventions

Authors: Jordan Meadows, Tamsin James, Andre Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18384
Pdf URL: https://arxiv.org/pdf/2404.18384
Copy Paste: [[2404.18384]] Exploring the Limits of Fine-grained LLM-based Physics Inference via Premise Removal Interventions(https://arxiv.org/abs/2404.18384)
Keywords: language model, llm
Abstract: Language models can hallucinate when performing complex and detailed mathematical reasoning. Physics provides a rich domain for assessing mathematical reasoning capabilities where physical context imbues the use of symbols which needs to satisfy complex semantics (\textit{e.g.,} units, tensorial order), leading to instances where inference may be algebraically coherent, yet unphysical. In this work, we assess the ability of Language Models (LMs) to perform fine-grained mathematical and physical reasoning using a curated dataset encompassing multiple notations and Physics subdomains. We improve zero-shot scores using synthetic in-context examples, and demonstrate non-linear degradation of derivation quality with perturbation strength via the progressive omission of supporting premises. We find that the models' mathematical reasoning is not physics-informed in this setting, where physical context is predominantly ignored in favour of reverse-engineering solutions.
摘要：语言模型在执行复杂而详细的数学推理时会产生幻觉。物理学为评估数学推理能力提供了一个丰富的领域，其中物理环境渗透到需要满足复杂语义（\textit{e.g.,} 单位、张量序）的符号的使用中，导致推理可能在代数上连贯，但不符合物理规律的情况。在这项工作中，我们使用包含多个符号和物理子域的精选数据集来评估语言模型 (LM) 执行细粒度数学和物理推理的能力。我们使用合成的上下文示例来提高零样本分数，并通过逐步省略支持前提来展示具有扰动强度的非线性推导质量下降。我们发现，在这种情况下，模型的数学推理不是基于物理学的，因为物理环境主要被忽略，而倾向于逆向工程解决方案。

Title: MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Authors: Xiang Li, Zhi-Qi Cheng, Jun-Yan He, Xiaojiang Peng, Alexander G. Hauptmann
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2404.18398
Pdf URL: https://arxiv.org/pdf/2404.18398
Copy Paste: [[2404.18398]] MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis(https://arxiv.org/abs/2404.18398)
Keywords: prompt
Abstract: Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214
摘要：近年来，情感文本转语音（E-TTS）合成因其增强人机交互的潜力而受到广泛关注。然而，当前的 E-TTS 方法通常难以捕捉人类情感的复杂性，主要依赖于过于简单的情感标签或单一模态输入。为了解决这些限制，我们提出了多模态情感文本转语音系统（MM-TTS），这是一个统一的框架，利用多种模态的情感线索来生成高度表现力和情感共鸣的语音。 MM-TTS由两个关键组件组成：（1）情感提示对齐模块（EP-Align），它采用对比学习来对齐文本、音频和视觉模态的情感特征，确保多模态信息的连贯融合； (2) 情感嵌入诱导 TTS (EMI-TTS)，它将对齐的情感嵌入与最先进的 TTS 模型相结合，以合成准确反映预期情感的语音。跨不同数据集的广泛评估表明，与传统 E-TTS 模型相比，MM-TTS 具有卓越的性能。客观指标，包括字错误率 (WER) 和字符错误率 (CER)，在 ESD 数据集上显示出显着改进，MM-TTS 的得分分别为 7.35% 和 3.07%。主观评估进一步验证 MM-TTS 生成的语音具有与人类语音相当的情感保真度和自然度。我们的代码和预训练模型可在 https://anonymous.4open.science/r/MMTTS-D214 上公开获取

Title: Mixture-of-Instructions: Comprehensive Alignment of a Large Language Model through the Mixture of Diverse System Prompting Instructions

Authors: Bowen Xu, Shaoyu Wu, Kai Liu, Lulu Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18410
Pdf URL: https://arxiv.org/pdf/2404.18410
Copy Paste: [[2404.18410]] Mixture-of-Instructions: Comprehensive Alignment of a Large Language Model through the Mixture of Diverse System Prompting Instructions(https://arxiv.org/abs/2404.18410)
Keywords: language model, llm, prompt, chat
Abstract: With the proliferation of large language models (LLMs), the comprehensive alignment of such models across multiple tasks has emerged as a critical area of research. Existing alignment methodologies primarily address single task, such as multi-turn dialogue, coding, mathematical problem-solving, and tool usage. However, AI-driven products that leverage language models usually necessitate a fusion of these abilities to function effectively in real-world scenarios. Moreover, the considerable computational resources required for proper alignment of LLMs underscore the need for a more robust, efficient, and encompassing approach to multi-task alignment, ensuring improved generative performance. In response to these challenges, we introduce a novel technique termed Mixture-of-Instructions (MoI), which employs a strategy of instruction concatenation combined with diverse system prompts to boost the alignment efficiency of language models. We have also compiled a diverse set of seven benchmark datasets to rigorously evaluate the alignment efficacy of the MoI-enhanced language model. Our methodology was applied to the open-source Qwen-7B-chat model, culminating in the development of Qwen-SFT-MoI. This enhanced model demonstrates significant advancements in generative capabilities across coding, mathematics, and tool use tasks.
摘要：随着大型语言模型（LLM）的激增，跨多个任务的此类模型的全面对齐已成为一个关键的研究领域。现有的对齐方法主要解决单一任务，例如多轮对话、编码、数学问题解决和工具使用。然而，利用语言模型的人工智能驱动产品通常需要融合这些能力才能在现实场景中有效运行。此外，法学硕士的正确对齐所需的大量计算资源强调需要一种更强大、更高效、更全面的多任务对齐方法，以确保提高生成性能。为了应对这些挑战，我们引入了一种称为混合指令（MoI）的新技术，该技术采用指令串联策略与不同的系统提示相结合来提高语言模型的对齐效率。我们还编译了一组不同的七个基准数据集，以严格评估 MoI 增强语言模型的对齐效果。我们的方法应用于开源 Qwen-7B-聊天模型，最终开发了 Qwen-SFT-MoI。这个增强的模型展示了编码、数学和工具使用任务的生成能力的显着进步。

Title: BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers

Authors: Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D. Wang, Joyce C. Ho, Chao Zhang, Carl Yang
Subjects: cs.CL, cs.AI, cs.IR, q-bio.QM
Abstract URL: https://arxiv.org/abs/2404.18443
Pdf URL: https://arxiv.org/pdf/2404.18443
Copy Paste: [[2404.18443]] BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers(https://arxiv.org/abs/2404.18443)
Keywords: language model
Abstract: Developing effective biomedical retrieval models is important for excelling at knowledge-intensive biomedical tasks but still challenging due to the deficiency of sufficient publicly annotated biomedical data and computational resources. We present BMRetriever, a series of dense retrievers for enhancing biomedical retrieval via unsupervised pre-training on large biomedical corpora, followed by instruction fine-tuning on a combination of labeled datasets and synthetic pairs. Experiments on 5 biomedical tasks across 11 datasets verify BMRetriever's efficacy on various biomedical applications. BMRetriever also exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger, and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are released at \url{https://huggingface.co/BMRetriever} to ensure transparency, reproducibility, and application to new domains.
摘要：开发有效的生物医学检索模型对于擅长知识密集型生物医学任务非常重要，但由于缺乏足够的公开注释的生物医学数据和计算资源，仍然具有挑战性。我们提出了 BMRetriever，这是一系列密集检索器，用于通过对大型生物医学语料库进行无监督预训练来增强生物医学检索，然后对标记数据集和合成对的组合进行指令微调。对 11 个数据集的 5 项生物医学任务进行的实验验证了 BMRetriever 在各种生物医学应用中的功效。 BMRetriever 还表现出强大的参数效率，410M 变体的性能比基线高出 11.7 倍，2B 变体的性能与具有超过 5B 参数的模型的性能相匹配。训练数据和模型检查点在 \url{https://huggingface.co/BMRetriever} 上发布，以确保透明度、可重复性以及对新领域的应用。

Title: Ethical Reasoning and Moral Value Alignment of LLMs Depend on the Language we Prompt them in

Authors: Utkarsh Agarwal, Kumar Tanmay, Aditi Khandelwal, Monojit Choudhury
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.18460
Pdf URL: https://arxiv.org/pdf/2404.18460
Copy Paste: [[2404.18460]] Ethical Reasoning and Moral Value Alignment of LLMs Depend on the Language we Prompt them in(https://arxiv.org/abs/2404.18460)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Ethical reasoning is a crucial skill for Large Language Models (LLMs). However, moral values are not universal, but rather influenced by language and culture. This paper explores how three prominent LLMs -- GPT-4, ChatGPT, and Llama2-70B-Chat -- perform ethical reasoning in different languages and if their moral judgement depend on the language in which they are prompted. We extend the study of ethical reasoning of LLMs by Rao et al. (2023) to a multilingual setup following their framework of probing LLMs with ethical dilemmas and policies from three branches of normative ethics: deontology, virtue, and consequentialism. We experiment with six languages: English, Spanish, Russian, Chinese, Hindi, and Swahili. We find that GPT-4 is the most consistent and unbiased ethical reasoner across languages, while ChatGPT and Llama2-70B-Chat show significant moral value bias when we move to languages other than English. Interestingly, the nature of this bias significantly vary across languages for all LLMs, including GPT-4.
摘要：道德推理是大型语言模型 (LLM) 的一项关键技能。然而，道德价值观并不是普遍存在的，而是受到语言和文化的影响。本文探讨了三个著名的法学硕士——GPT-4、ChatGPT 和 Llama2-70B-Chat——如何用不同的语言进行道德推理，以及他们的道德判断是否取决于提示他们的语言。我们扩展了 Rao 等人对法学硕士伦理推理的研究。（2023）遵循他们的框架探索法学硕士的道德困境和来自规范伦理学三个分支的政策：道义论、美德和后果论的多语言设置。我们尝试了六种语言：英语、西班牙语、俄语、中文、印地语和斯瓦希里语。我们发现 GPT-4 是跨语言中最一致、最公正的道德推理器，而当我们转向英语以外的语言时，ChatGPT 和 Llama2-70B-Chat 表现出显着的道德价值偏见。有趣的是，这种偏见的性质在所有法学硕士（包括 GPT-4）的语言之间存在显着差异。

Title: HFT: Half Fine-Tuning for Large Language Models

Authors: Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Weiran Xu, Yu Sun, Hua Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18466
Pdf URL: https://arxiv.org/pdf/2404.18466
Copy Paste: [[2404.18466]] HFT: Half Fine-Tuning for Large Language Models(https://arxiv.org/abs/2404.18466)
Keywords: language model, llm
Abstract: Large language models (LLMs) with one or more fine-tuning phases have become a necessary step to unlock various capabilities, enabling LLMs to follow natural language instructions or align with human preferences. However, it carries the risk of catastrophic forgetting during sequential training, the parametric knowledge or the ability learned in previous stages may be overwhelmed by incoming training data. In this paper, we find that by regularly resetting partial parameters, LLMs can restore some of the original knowledge. Inspired by this, we introduce Half Fine-Tuning (HFT) for LLMs, as a substitute for full fine-tuning (FFT), to mitigate the forgetting issues, where half of the parameters are selected to learn new tasks while the other half are frozen to remain previous knowledge. We provide a feasibility analysis from the perspective of optimization and interpret the parameter selection operation as a regularization term. Without changing the model architecture, HFT could be seamlessly integrated into existing fine-tuning frameworks. Extensive experiments and analysis on supervised fine-tuning, direct preference optimization, and continual learning consistently demonstrate the effectiveness, robustness, and efficiency of HFT. Compared with FFT, HFT not only significantly alleviates the forgetting problem, but also achieves the best performance in a series of downstream benchmarks, with an approximately 30% reduction in training time.
摘要：具有一个或多个微调阶段的大型语言模型 (LLM) 已成为解锁各种功能的必要步骤，使 LLM 能够遵循自然语言指令或符合人类偏好。然而，它在顺序训练过程中存在灾难性遗忘的风险，前一阶段学到的参数知识或能力可能会被传入的训练数据淹没。在本文中，我们发现通过定期重置部分参数，LLM 可以恢复一些原始知识。受此启发，我们为法学硕士引入了半微调（HFT），作为完全微调（FFT）的替代品，以减轻遗忘问题，其中选择一半参数来学习新任务，而另一半则用于学习新任务。冻结以保留先前的知识。我们从优化的角度提供了可行性分析，并将参数选择操作解释为正则化项。在不改变模型架构的情况下，HFT 可以无缝集成到现有的微调框架中。关于监督微调、直接偏好优化和持续学习的大量实验和分析一致证明了高频交易的有效性、鲁棒性和效率。与FFT相比，HFT不仅显着缓解了遗忘问题，而且在一系列下游基准测试中取得了最佳性能，训练时间减少了约30%。

Title: MileBench: Benchmarking MLLMs in Long Context

Authors: Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, Benyou Wang
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2404.18532
Pdf URL: https://arxiv.org/pdf/2404.18532
Copy Paste: [[2404.18532]] MileBench: Benchmarking MLLMs in Long Context(https://arxiv.org/abs/2404.18532)
Keywords: language model, gpt, llm, long context
Abstract: Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 20 models, revealed that while the closed-source GPT-4(Vision) and Gemini 1.5 outperform others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.
摘要：尽管多模态大型语言模型 (MLLM) 在基准测试中取得了进步和令人印象深刻的性能，但由于基准测试范围有限，它们在现实世界、长上下文和多图像任务中的有效性尚不清楚。现有的基准通常侧重于单图像和短文本样本，在评估多图像任务时，它们要么限制图像数量，要么专注于特定任务（例如时间序列字幕），这可能掩盖了 MLLM 的性能挑战。为了解决这些限制，我们引入了 MileBench，这是一个开创性的基准测试，旨在测试 MLLM 的多模式 Long-contExt 功能。该基准不仅包括多模式长上下文，还包括需要理解和生成的多个任务。我们建立了两个不同的评估集（诊断性评估集和现实评估集），以系统地评估 MLLM 的长上下文适应能力及其在长上下文场景中完成任务的能力。我们通过测试 20 个模型获得的实验结果表明，虽然闭源 GPT-4(Vision) 和 Gemini 1.5 优于其他模型，但大多数开源 MLLM 在长上下文情况下都表现不佳。有趣的是，随着图像数量的增加，性能差距往往会扩大。我们强烈鼓励加强研究工作，以增强 MLLM 的长上下文能力，特别是在涉及多个图像的场景中。

Title: Evaluating and Mitigating Linguistic Discrimination in Large Language Models

Authors: Guoliang Dong, Haoyu Wang, Jun Sun, Xinyu Wang
Subjects: cs.CL, cs.AI, cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2404.18534
Pdf URL: https://arxiv.org/pdf/2404.18534
Copy Paste: [[2404.18534]] Evaluating and Mitigating Linguistic Discrimination in Large Language Models(https://arxiv.org/abs/2404.18534)
Keywords: language model, gpt, llm
Abstract: By training on text in various languages, large language models (LLMs) typically possess multilingual support and demonstrate remarkable capabilities in solving tasks described in different languages. However, LLMs can exhibit linguistic discrimination due to the uneven distribution of training data across languages. That is, LLMs are hard to keep the consistency of responses when faced with the same task but depicted in different languages. In this study, we first explore the consistency in the LLMs' outputs responding to queries in various languages from two aspects: safety and quality. We conduct this analysis with two datasets (AdvBench and NQ) based on four LLMs (Llama2-13b, Gemma-7b, GPT-3.5-turbo and Gemini-pro). The results show that LLMs exhibit stronger human alignment capabilities with queries in English, French, Russian, and Spanish (only 1.04\% of harmful queries successfully jailbreak on average) compared to queries in Bengali, Georgian, Nepali and Maithili (27.7\% of harmful queries jailbreak successfully on average). Moreover, for queries in English, Danish, Czech and Slovenian, LLMs tend to produce responses with a higher quality (with 0.1494 $F_1$ score on average) compared to the other languages. Upon these findings, we propose LDFighter, a similarity-based voting, to mitigate the linguistic discrimination in LLMs. LDFighter ensures consistent service for different language speakers. We evaluate LDFighter with both benign queries and harmful queries. The results show that LDFighter not only significantly reduces the jailbreak success rate but also improve the response quality on average, demonstrating its effectiveness.
摘要：通过对各种语言的文本进行训练，大语言模型（LLM）通常拥有多语言支持，并在解决用不同语言描述的任务方面表现出卓越的能力。然而，由于不同语言的训练数据分布不均匀，法学硕士可能会表现出语言歧视。也就是说，法学硕士在面对相同的任务但用不同的语言描述时很难保持回答的一致性。在本研究中，我们首先从安全性和质量两个方面探讨法学硕士在响应各种语言查询时输出的一致性。我们使用基于四个 LLM（Llama2-13b、Gemma-7b、GPT-3.5-turbo 和 Gemini-pro）的两个数据集（AdvBench 和 NQ）进行此分析。结果表明，与孟加拉语、格鲁吉亚语、尼泊尔语和迈蒂利语查询（平均只有 1.04% 的有害查询成功越狱）相比，法学硕士在英语、法语、俄语和西班牙语查询方面表现出更强的人类对齐能力（27.7% 的有害查询成功越狱）。有害查询平均越狱成功）。此外，对于英语、丹麦语、捷克语和斯洛文尼亚语的查询，与其他语言相比，法学硕士往往会给出更高质量的答复（平均得分为 0.1494 $F_1$）。根据这些发现，我们提出了 LDFighter，一种基于相似性的投票，以减轻法学硕士中的语言歧视。 LDFighter 确保为不同语言使用者提供一致的服务。我们使用良性查询和有害查询来评估 LDFighter。结果表明，LDFighter不仅显着降低了越狱成功率，而且平均提高了响应质量，证明了其有效性。

Title: Time Machine GPT

Authors: Felix Drinkall, Eghbal Rahimikia, Janet B. Pierrehumbert, Stefan Zohren
Subjects: cs.CL, cs.CE, cs.LG
Abstract URL: https://arxiv.org/abs/2404.18543
Pdf URL: https://arxiv.org/pdf/2404.18543
Copy Paste: [[2404.18543]] Time Machine GPT(https://arxiv.org/abs/2404.18543)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are often trained on extensive, temporally indiscriminate text corpora, reflecting the lack of datasets with temporal metadata. This approach is not aligned with the evolving nature of language. Conventional methods for creating temporally adapted language models often depend on further pre-training static models on time-specific data. This paper presents a new approach: a series of point-in-time LLMs called Time Machine GPT (TiMaGPT), specifically designed to be nonprognosticative. This ensures they remain uninformed about future factual information and linguistic changes. This strategy is beneficial for understanding language evolution and is of critical importance when applying models in dynamic contexts, such as time-series forecasting, where foresight of future information can prove problematic. We provide access to both the models and training datasets.
摘要：大型语言模型 (LLM) 通常在广泛的、时间上不加区分的文本语料库上进行训练，这反映出缺乏具有时间元数据的数据集。这种方法不符合语言不断发展的本质。创建时间适应语言模型的传统方法通常依赖于针对特定时间数据的进一步预训练静态模型。本文提出了一种新方法：一系列称为时间机器 GPT (TiMaGPT) 的时间点 LLM，专门设计用于非预测性。这确保他们不了解未来的事实信息和语言变化。这种策略有利于理解语言的演化，并且在动态环境中应用模型时至关重要，例如时间序列预测，其中对未来信息的预见可能会出现问题。我们提供对模型和训练数据集的访问。

Title: Can GPT-4 do L2 analytic assessment?

Authors: Stefano Bannò, Hari Krishna Vydana, Kate M. Knill, Mark J. F. Gales
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18557
Pdf URL: https://arxiv.org/pdf/2404.18557
Copy Paste: [[2404.18557]] Can GPT-4 do L2 analytic assessment?(https://arxiv.org/abs/2404.18557)
Keywords: language model, gpt
Abstract: Automated essay scoring (AES) to evaluate second language (L2) proficiency has been a firmly established technology used in educational contexts for decades. Although holistic scoring has seen advancements in AES that match or even exceed human performance, analytic scoring still encounters issues as it inherits flaws and shortcomings from the human scoring process. The recent introduction of large language models presents new opportunities for automating the evaluation of specific aspects of L2 writing proficiency. In this paper, we perform a series of experiments using GPT-4 in a zero-shot fashion on a publicly available dataset annotated with holistic scores based on the Common European Framework of Reference and aim to extract detailed information about their underlying analytic components. We observe significant correlations between the automatically predicted analytic scores and multiple features associated with the individual proficiency components.
摘要：用于评估第二语言 (L2) 熟练程度的自动作文评分 (AES) 已成为教育环境中使用了数十年的成熟技术。尽管 AES 的整体评分已经取得了与人类表现相当甚至超过人类表现的进步，但分析评分仍然遇到问题，因为它继承了人类评分过程的缺陷和缺点。最近引入的大型语言模型为自动化评估第二语言写作能力的特定方面提供了新的机会。在本文中，我们使用 GPT-4 以零样本方式在公开数据集上进行了一系列实验，该数据集基于欧洲共同参考框架标注了整体分数，旨在提取有关其底层分析组件的详细信息。我们观察到自动预测的分析分数和与个人熟练程度相关的多个特征之间存在显着相关性。

Title: Injecting Salesperson's Dialogue Strategies in Large Language Models with Chain-of-Thought Reasoning

Authors: Wen-Yu Chang, Yun-Nung Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.18564
Pdf URL: https://arxiv.org/pdf/2404.18564
Copy Paste: [[2404.18564]] Injecting Salesperson's Dialogue Strategies in Large Language Models with Chain-of-Thought Reasoning(https://arxiv.org/abs/2404.18564)
Keywords: language model, llm, prompt, chat, chain-of-thought, agent
Abstract: Recent research in dialogue systems and corpora has focused on two main categories: task-oriented (TOD) and open-domain (chit-chat) dialogues. TOD systems help users accomplish specific tasks, while open-domain systems aim to create engaging conversations. However, in real-world scenarios, user intents are often revealed during interactions. A recent study introduced SalesBot, which simulates dialogues transitioning from chit-chat to task-oriented scenarios to train sales agents. Unfortunately, the initial data lacked smooth transitions and coherent long-turn dialogues, resulting in poor naturalness in sales-customer interactions. To address these issues, this paper presents SalesBot 2.0, an improved dataset. It leverages commonsense knowledge from large language models (LLMs) through strategic prompting. Additionally, we introduce a novel model called SalesAgent, trained on salesperson's interactions, using chain-of-thought (CoT) reasoning. This model excels in transitioning topics, understanding user intents, and selecting appropriate strategies. Experiments using diverse user simulations validate the effectiveness of our method in controlling dialogue strategies in LLMs. Furthermore, SalesBot 2.0 enhances coherence and reduces aggression, facilitating better model learning for sales-customer interactions.
摘要：对话系统和语料库的最新研究主要集中在两个主要类别：面向任务（TOD）和开放域（闲聊）对话。 TOD 系统帮助用户完成特定任务，而开放域系统旨在创建引人入胜的对话。然而，在现实场景中，用户意图经常在交互过程中显露出来。最近的一项研究介绍了 SalesBot，它可以模拟从闲聊到任务导向场景的对话，以培训销售代理。不幸的是，最初的数据缺乏平滑的过渡和连贯的长轮对话，导致销售与客户互动的自然性很差。为了解决这些问题，本文提出了 SalesBot 2.0，这是一个改进的数据集。它通过战略提示利用来自大型语言模型 (LLM) 的常识知识。此外，我们还引入了一种名为 SalesAgent 的新颖模型，该模型使用思维链 (CoT) 推理对销售人员的交互进行训练。该模型擅长转换主题、理解用户意图和选择适当的策略。使用不同用户模拟的实验验证了我们的方法在控制法学硕士对话策略方面的有效性。此外，SalesBot 2.0 增强了连贯性并减少了攻击性，从而促进了更好的销售与客户交互模型学习。

Title: Analyzing Semantic Change through Lexical Replacements

Authors: Francesco Periti, Pierluigi Cassotti, Haim Dubossarsky, Nina Tahmasebi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18570
Pdf URL: https://arxiv.org/pdf/2404.18570
Copy Paste: [[2404.18570]] Analyzing Semantic Change through Lexical Replacements(https://arxiv.org/abs/2404.18570)
Keywords: language model
Abstract: Modern language models are capable of contextualizing words based on their surrounding context. However, this capability is often compromised due to semantic change that leads to words being used in new, unexpected contexts not encountered during pre-training. In this paper, we model \textit{semantic change} by studying the effect of unexpected contexts introduced by \textit{lexical replacements}. We propose a \textit{replacement schema} where a target word is substituted with lexical replacements of varying relatedness, thus simulating different kinds of semantic change. Furthermore, we leverage the replacement schema as a basis for a novel \textit{interpretable} model for semantic change. We are also the first to evaluate the use of LLaMa for semantic change detection.
摘要：现代语言模型能够根据周围的上下文将单词置于上下文中。然而，由于语义变化导致单词被用在预训练期间未遇到的新的、意外的上下文中，这种能力常常受到损害。在本文中，我们通过研究 \textit{词汇替换} 引入的意外上下文的影响来对 \textit{语义变化} 进行建模。我们提出了一个 \textit{替换模式}，其中目标单词被不同相关性的词汇替换所替换，从而模拟不同类型的语义变化。此外，我们利用替换模式作为语义更改的新颖 \textit{interpretable} 模型的基础。我们也是第一个评估使用 LLaMa 进行语义变化检测的人。

Title: Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?

Authors: Letitia Parcalabescu, Anette Frank
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2404.18624
Pdf URL: https://arxiv.org/pdf/2404.18624
Copy Paste: [[2404.18624]] Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?(https://arxiv.org/abs/2404.18624)
Keywords: language model, llm
Abstract: Vision and language models (VLMs) are currently the most generally performant architectures on multimodal tasks. Next to their predictions, they can also produce explanations, either in post-hoc or CoT settings. However, it is not clear how much they use the vision and text modalities when generating predictions or explanations. In this work, we investigate if VLMs rely on modalities differently when generating explanations as opposed to when they provide answers. We also evaluate the self-consistency of VLM decoders in both post-hoc and CoT explanation settings, by extending existing tests and measures to VLM decoders. We find that VLMs are less self-consistent than LLMs. The text contributions in VL decoders are much larger than the image contributions across all measured tasks. And the contributions of the image are significantly larger for explanation generations than for answer generation. This difference is even larger in CoT compared to the post-hoc explanation setting. We also provide an up-to-date benchmarking of state-of-the-art VL decoders on the VALSE benchmark, which to date focused only on VL encoders. We find that VL decoders are still struggling with most phenomena tested by VALSE.
摘要：视觉和语言模型（VLM）是目前多模态任务中性能最普遍的架构。除了预测之外，他们还可以在事后或 CoT 环境中做出解释。然而，尚不清楚他们在生成预测或解释时使用了多少视觉和文本模式。在这项工作中，我们研究了 VLM 在生成解释时与提供答案时是否以不同的方式依赖模式。我们还通过将现有测试和措施扩展到 VLM 解码器，评估了 VLM 解码器在事后和 CoT 解释设置中的自我一致性。我们发现 VLM 的自我一致性不如 LLM。在所有测量任务中，VL 解码器中的文本贡献远大于图像贡献。图像对解释生成的贡献明显大于对答案生成的贡献。与事后解释设置相比，CoT 中的这种差异甚至更大。我们还在 VALSE 基准测试中提供最先进的 VL 解码器的最新基准测试，该基准测试迄今为止仅专注于 VL 编码器。我们发现 VL 解码器仍然难以应对 VALSE 测试的大多数现象。

Title: Revealing the Parametric Knowledge of Language Models: A Unified Framework for Attribution Methods

Authors: Haeun Yu, Pepa Atanasova, Isabelle Augenstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.18655
Pdf URL: https://arxiv.org/pdf/2404.18655
Copy Paste: [[2404.18655]] Revealing the Parametric Knowledge of Language Models: A Unified Framework for Attribution Methods(https://arxiv.org/abs/2404.18655)
Keywords: language model
Abstract: Language Models (LMs) acquire parametric knowledge from their training process, embedding it within their weights. The increasing scalability of LMs, however, poses significant challenges for understanding a model's inner workings and further for updating or correcting this embedded knowledge without the significant cost of retraining. This underscores the importance of unveiling exactly what knowledge is stored and its association with specific model components. Instance Attribution (IA) and Neuron Attribution (NA) offer insights into this training-acquired knowledge, though they have not been compared systematically. Our study introduces a novel evaluation framework to quantify and compare the knowledge revealed by IA and NA. To align the results of the methods we introduce the attribution method NA-Instances to apply NA for retrieving influential training instances, and IA-Neurons to discover important neurons of influential instances discovered by IA. We further propose a comprehensive list of faithfulness tests to evaluate the comprehensiveness and sufficiency of the explanations provided by both methods. Through extensive experiments and analysis, we demonstrate that NA generally reveals more diverse and comprehensive information regarding the LM's parametric knowledge compared to IA. Nevertheless, IA provides unique and valuable insights into the LM's parametric knowledge, which are not revealed by NA. Our findings further suggest the potential of a synergistic approach of combining the diverse findings of IA and NA for a more holistic understanding of an LM's parametric knowledge.
摘要：语言模型 (LM) 从训练过程中获取参数知识，并将其嵌入到权重中。然而，LM 不断增强的可扩展性给理解模型的内部工作原理以及进一步更新或纠正这种嵌入式知识带来了重大挑战，而无需花费大量的再训练成本。这强调了准确揭示存储的知识及其与特定模型组件的关联的重要性。实例归因（IA）和神经元归因（NA）提供了对这种训练获得的知识的见解，尽管它们尚未进行系统比较。我们的研究引入了一种新颖的评估框架来量化和比较 IA 和 NA 揭示的知识。为了对齐方法的结果，我们引入了归因方法 NA-Instances 以应用 NA 来检索有影响力的训练实例，并引入 IA-Neurons 来发现 IA 发现的有影响力实例的重要神经元。我们进一步提出了忠实度测试的综合列表，以评估两种方法提供的解释的全面性和充分性。通过大量的实验和分析，我们证明，与 IA 相比，NA 通常揭示了关于 LM 参数知识的更多样化和更全面的信息。尽管如此，IA 为 LM 的参数知识提供了独特且有价值的见解，这是 NA 所没有透露的。我们的研究结果进一步表明，结合 IA 和 NA 的不同发现的协同方法的潜力，可以更全面地理解 LM 的参数知识。

Title: Where on Earth Do Users Say They Are?: Geo-Entity Linking for Noisy Multilingual User Input

Authors: Tessa Masis, Brendan O'Connor
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.18784
Pdf URL: https://arxiv.org/pdf/2404.18784
Copy Paste: [[2404.18784]] Where on Earth Do Users Say They Are?: Geo-Entity Linking for Noisy Multilingual User Input(https://arxiv.org/abs/2404.18784)
Keywords: llm
Abstract: Geo-entity linking is the task of linking a location mention to the real-world geographic location. In this paper we explore the challenging task of geo-entity linking for noisy, multilingual social media data. There are few open-source multilingual geo-entity linking tools available and existing ones are often rule-based, which break easily in social media settings, or LLM-based, which are too expensive for large-scale datasets. We present a method which represents real-world locations as averaged embeddings from labeled user-input location names and allows for selective prediction via an interpretable confidence score. We show that our approach improves geo-entity linking on a global and multilingual social media dataset, and discuss progress and problems with evaluating at different geographic granularities.
摘要：地理实体链接是将位置提及链接到现实世界地理位置的任务。在本文中，我们探讨了噪声、多语言社交媒体数据的地理实体链接这一具有挑战性的任务。可用的开源多语言地理实体链接工具很少，现有的工具通常是基于规则的，在社交媒体设置中很容易崩溃，或者基于法学硕士，这对于大规模数据集来说过于昂贵。我们提出了一种方法，它将现实世界的位置表示为标记的用户输入位置名称的平均嵌入，并允许通过可解释的置信度分数进行选择性预测。我们展示了我们的方法改进了全球和多语言社交媒体数据集上的地理实体链接，并讨论了在不同地理粒度上进行评估的进展和问题。

Title: Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Authors: Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.18796
Pdf URL: https://arxiv.org/pdf/2404.18796
Copy Paste: [[2404.18796]] Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models(https://arxiv.org/abs/2404.18796)
Keywords: language model, gpt, llm
Abstract: As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
摘要：随着大型语言模型 (LLM) 变得更加先进，它们已经超出了我们准确评估其质量的能力。不仅找到数据来充分探测特定模型属性很困难，而且单独评估模型自由形式生成的正确性也是一个挑战。为了解决这个问题，许多评估现在依靠法学硕士本身作为评委来对其他法学硕士的输出质量进行评分。评估最常使用单个大型模型，例如 GPT4。虽然这种方法越来越受欢迎，但成本高昂，并且已被证明会引入模型内偏差，并且在这项工作中，我们发现非常大的模型通常是不必要的。我们建议使用法学硕士评估小组（PoLL）来评估模型。在三个不同的法官设置和跨越六个不同数据集的情况下，我们发现使用由大量较小模型组成的 PoLL 优于单个大型法官，并且由于其由不相交的模型系列组成而表现出较少的模型内偏差，并且在价格便宜七倍多。

Title: Unknown Script: Impact of Script on Cross-Lingual Transfer

Authors: Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18810
Pdf URL: https://arxiv.org/pdf/2404.18810
Copy Paste: [[2404.18810]] Unknown Script: Impact of Script on Cross-Lingual Transfer(https://arxiv.org/abs/2404.18810)
Keywords: language model
Abstract: Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often-overlooked aspect in this domain: the influence of the source language of the base language model on transfer performance. We conduct a series of experiments to determine the effect of the script and tokenizer used in the pre-trained model on the performance of the downstream task. Our findings reveal the importance of the tokenizer as a stronger factor than the sharing of the script, the language typology match, and the model size.
摘要：跨语言迁移已成为语言间知识迁移的有效方式。在本文中，我们探讨了该领域中经常被忽视的一个方面：基础语言模型的源语言对传输性能的影响。我们进行了一系列实验，以确定预训练模型中使用的脚本和分词器对下游任务性能的影响。我们的研究结果揭示了分词器的重要性，它是比脚本共享、语言类型匹配和模型大小更重要的因素。

Title: Benchmarking Benchmark Leakage in Large Language Models

Authors: Ruijie Xu, Zengzhi Wang, Run-Ze Fan, Pengfei Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.18824
Pdf URL: https://arxiv.org/pdf/2404.18824
Copy Paste: [[2404.18824]] Benchmarking Benchmark Leakage in Large Language Models(https://arxiv.org/abs/2404.18824)
Keywords: language model, llm, prompt
Abstract: Amid the expanding use of pre-training data, the phenomenon of benchmark dataset leakage has become increasingly prominent, exacerbated by opaque training processes and the often undisclosed inclusion of supervised data in contemporary Large Language Models (LLMs). This issue skews benchmark effectiveness and fosters potentially unfair comparisons, impeding the field's healthy development. To address this, we introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark, to identify potential data leakages. By analyzing 31 LLMs under the context of mathematical reasoning, we reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons. These findings prompt us to offer several recommendations regarding model documentation, benchmark setup, and future evaluations. Notably, we propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization, promoting transparency and healthy developments of LLMs. we have made our leaderboard, pipeline implementation, and model predictions publicly available, fostering future research.
摘要：随着预训练数据的使用范围不断扩大，基准数据集泄露现象也变得越来越突出，不透明的训练过程和当代大型语言模型 (LLM) 中通常未公开的监督数据包含情况加剧了这一现象。这一问题扭曲了基准有效性并助长了潜在的不公平比较，阻碍了该领域的健康发展。为了解决这个问题，我们引入了一个检测管道，利用困惑度和 N-gram 准确度这两个简单且可扩展的指标来衡量模型在基准上的预测精度，以识别潜在的数据泄露。通过在数学推理的背景下分析 31 个 LLM，我们发现了大量训练甚至测试集滥用的情况，导致潜在的不公平比较。这些发现促使我们针对模型文档、基准设置和未来评估提出了一些建议。值得注意的是，我们提出了“基准透明度卡”，以鼓励清晰地记录基准使用情况，促进 LLM 的透明度和健康发展。我们已经公开了我们的排行榜、管道实施和模型预测，以促进未来的研究。

Title: It's Difficult to be Neutral -- Human and LLM-based Sentiment Annotation of Patient Comments

Authors: Petter Mæhlum, David Samuel, Rebecka Maria Norman, Elma Jelin, Øyvind Andresen Bjertnæs, Lilja Øvrelid, Erik Velldal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18832
Pdf URL: https://arxiv.org/pdf/2404.18832
Copy Paste: [[2404.18832]] It's Difficult to be Neutral -- Human and LLM-based Sentiment Annotation of Patient Comments(https://arxiv.org/abs/2404.18832)
Keywords: language model, llm, prompt
Abstract: Sentiment analysis is an important tool for aggregating patient voices, in order to provide targeted improvements in healthcare services. A prerequisite for this is the availability of in-domain data annotated for sentiment. This article documents an effort to add sentiment annotations to free-text comments in patient surveys collected by the Norwegian Institute of Public Health (NIPH). However, annotation can be a time-consuming and resource-intensive process, particularly when it requires domain expertise. We therefore also evaluate a possible alternative to human annotation, using large language models (LLMs) as annotators. We perform an extensive evaluation of the approach for two openly available pretrained LLMs for Norwegian, experimenting with different configurations of prompts and in-context learning, comparing their performance to human annotators. We find that even for zero-shot runs, models perform well above the baseline for binary sentiment, but still cannot compete with human annotators on the full dataset.
摘要：情绪分析是聚合患者声音的重要工具，以便有针对性地改进医疗服务。实现这一点的先决条件是可以获得带有情绪注释的域内数据。本文记录了挪威公共卫生研究所 (NIPH) 收集的患者调查中为自由文本评论添加情感注释的努力。然而，注释可能是一个耗时且资源密集的过程，特别是当它需要领域专业知识时。因此，我们还评估了人工注释的可能替代方案，即使用大型语言模型（LLM）作为注释器。我们对两个公开可用的挪威语预训练法学硕士的方法进行了广泛的评估，尝试了不同的提示配置和上下文学习，将它们的性能与人类注释器进行了比较。我们发现，即使对于零样本运行，模型的表现也远高于二元情感的基线，但仍然无法与完整数据集上的人类注释者竞争。

Title: Truth-value judgment in language models: belief directions are context sensitive

Authors: Stefan F. Schouten, Peter Bloem, Ilia Markov, Piek Vossen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18865
Pdf URL: https://arxiv.org/pdf/2404.18865
Copy Paste: [[2404.18865]] Truth-value judgment in language models: belief directions are context sensitive(https://arxiv.org/abs/2404.18865)
Keywords: language model, llm
Abstract: Recent work has demonstrated that the latent spaces of large language models (LLMs) contain directions predictive of the truth of sentences. Multiple methods recover such directions and build probes that are described as getting at a model's "knowledge" or "beliefs". We investigate this phenomenon, looking closely at the impact of context on the probes. Our experiments establish where in the LLM the probe's predictions can be described as being conditional on the preceding (related) sentences. Specifically, we quantify the responsiveness of the probes to the presence of (negated) supporting and contradicting sentences, and score the probes on their consistency. We also perform a causal intervention experiment, investigating whether moving the representation of a premise along these belief directions influences the position of the hypothesis along that same direction. We find that the probes we test are generally context sensitive, but that contexts which should not affect the truth often still impact the probe outputs. Our experiments show that the type of errors depend on the layer, the (type of) model, and the kind of data. Finally, our results suggest that belief directions are (one of the) causal mediators in the inference process that incorporates in-context information.
摘要：最近的工作表明，大型语言模型（LLM）的潜在空间包含预测句子真实性的方向。多种方法可以恢复这些方向并构建被描述为获取模型的“知识”或“信念”的探针。我们研究了这种现象，仔细观察环境对探测器的影响。我们的实验确定了法学硕士中探针的预测可以被描述为以前面的（相关）句子为条件。具体来说，我们量化探针对（否定）支持和矛盾句子的存在的响应性，并对探针的一致性进行评分。我们还进行了一项因果干预实验，研究沿着这些信念方向移动前提的表示是否会影响沿着同一方向的假设的位置。我们发现我们测试的探针通常是上下文敏感的，但是不应该影响事实的上下文通常仍然会影响探针的输出。我们的实验表明，错误的类型取决于层、模型（类型）和数据类型。最后，我们的结果表明，信念方向是包含上下文信息的推理过程中的（其中之一）因果中介。

Title: More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness

Authors: Aaron J. Li, Satyapriya Krishna, Himabindu Lakkaraju
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.18870
Pdf URL: https://arxiv.org/pdf/2404.18870
Copy Paste: [[2404.18870]] More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness(https://arxiv.org/abs/2404.18870)
Keywords: language model, llm
Abstract: The surge in Large Language Models (LLMs) development has led to improved performance on cognitive tasks as well as an urgent need to align these models with human values in order to safely exploit their power. Despite the effectiveness of preference learning algorithms like Reinforcement Learning From Human Feedback (RLHF) in aligning human preferences, their assumed improvements on model trustworthiness haven't been thoroughly testified. Toward this end, this study investigates how models that have been aligned with general-purpose preference data on helpfulness and harmlessness perform across five trustworthiness verticals: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy. For model alignment, we focus on three widely used RLHF variants: Supervised Finetuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Through extensive empirical investigations, we discover that the improvement in trustworthiness by RLHF is far from guaranteed, and there exists a complex interplay between preference data, alignment algorithms, and specific trustworthiness aspects. Together, our results underscore the need for more nuanced approaches for model alignment. By shedding light on the intricate dynamics of these components within model alignment, we hope this research will guide the community towards developing language models that are both capable and trustworthy.
摘要：大型语言模型 (LLM) 开发的激增提高了认知任务的性能，并且迫切需要使这些模型与人类价值观保持一致，以便安全地利用其力量。尽管人类反馈强化学习 (RLHF) 等偏好学习算法在调整人类偏好方面非常有效，但它们对模型可信度的假设改进尚未得到彻底证实。为此，本研究调查了与有用性和无害性的通用偏好数据相一致的模型如何在五个可信度垂直领域表现：毒性、刻板偏见、机器道德、真实性和隐私。对于模型对齐，我们重点关注三种广泛使用的 RLHF 变体：监督微调 (SFT)、近端策略优化 (PPO) 和直接偏好优化 (DPO)。通过广泛的实证研究，我们发现 RLHF 对可信度的提高远不能得到保证，并且偏好数据、对齐算法和特定可信度方面之间存在复杂的相互作用。总之，我们的结果强调需要更细致的模型对齐方法。通过揭示模型对齐中这些组件的复杂动态，我们希望这项研究能够指导社区开发既强大又值得信赖的语言模型。

Title: Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Authors: Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.18911
Pdf URL: https://arxiv.org/pdf/2404.18911
Copy Paste: [[2404.18911]] Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting(https://arxiv.org/abs/2404.18911)
Keywords: language model
Abstract: Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models while maintaining a consistent sampling distribution. However, the conventional approach of training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Drawing inspiration from early exiting, we propose a novel self-speculative decoding framework \emph{Kangaroo}, which uses a fixed shallow sub-network as a self-draft model, with the remaining layers serving as the larger target model. We train a lightweight and efficient adapter module on top of the sub-network to bridge the gap between the sub-network and the full model's representation ability. It is noteworthy that the inference latency of the self-draft model may no longer be negligible compared to the large model, necessitating strategies to increase the token acceptance rate while minimizing the drafting steps of the small model. To address this challenge, we introduce an additional early exiting mechanism for generating draft tokens. Specifically, we halt the small model's subsequent prediction during the drafting phase once the confidence level for the current token falls below a certain threshold. Extensive experiments on the Spec-Bench demonstrate the effectiveness of Kangaroo. Under single-sequence verification, Kangaroo achieves speedups up to $1.68\times$ on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additional parameters (67M compared to 591M). The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo.
摘要：推测解码已证明其在加速大型语言模型的推理同时保持一致的采样分布方面的有效性。然而，训练单独的草稿模型以获得令人满意的令牌接受率的传统方法可能成本高昂。受到早期退出的启发，我们提出了一种新颖的自推测解码框架 \emph{Kangaroo}，它使用固定的浅层子网络作为自拟模型，其余层作为更大的目标模型。我们在子网络之上训练一个轻量级且高效的适配器模块，以弥合子网络和完整模型表示能力之间的差距。值得注意的是，与大模型相比，自起草模型的推理延迟可能不再可以忽略不计，因此需要采取策略来提高令牌接受率，同时最大限度地减少小模型的起草步骤。为了应对这一挑战，我们引入了一种额外的提前退出机制来生成草稿代币。具体来说，一旦当前令牌的置信水平低于某个阈值，我们就会在起草阶段停止小模型的后续预测。 Spec-Bench 上的大量实验证明了 Kangaroo 的有效性。在单序列验证下，Kangaroo 在 Spec-Bench 上实现了高达 $1.68\times$ 的加速，优于 Medusa-1，附加参数减少了 88.7\%（67M 与 591M 相比）。 Kangaroo 的代码可在 https://github.com/Equationliu/Kangaroo 获取。

Title: Holmes: Benchmark the Linguistic Competence of Language Models

Authors: Andreas Waldis, Yotam Perlitz, Leshem Choshen, Yufang Hou, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.18923
Pdf URL: https://arxiv.org/pdf/2404.18923
Copy Paste: [[2404.18923]] Holmes: Benchmark the Linguistic Competence of Language Models(https://arxiv.org/abs/2404.18923)
Keywords: language model, prompt
Abstract: We introduce Holmes, a benchmark to assess the linguistic competence of language models (LMs) - their ability to grasp linguistic phenomena. Unlike prior prompting-based evaluations, Holmes assesses the linguistic competence of LMs via their internal representations using classifier-based probing. In doing so, we disentangle specific phenomena (e.g., part-of-speech of words) from other cognitive abilities, like following textual instructions, and meet recent calls to assess LMs' linguistic competence in isolation. Composing Holmes, we review over 250 probing studies and feature more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version of Holmes designed to lower the high computation load while maintaining high-ranking precision.
摘要：我们引入 Holmes，这是评估语言模型 (LM) 语言能力的基准——它们掌握语言现象的能力。与之前基于提示的评估不同，Holmes 使用基于分类器的探测通过 LM 的内部表征来评估 LM 的语言能力。在此过程中，我们将特定现象（例如单词的词性）与其他认知能力（例如遵循文本指令）分开，并满足最近独立评估 LM 语言能力的要求。在撰写《Holmes》时，我们回顾了 250 多项探索性研究，并提供了 200 多个数据集来评估句法、形态、语义、推理和话语现象。对 50 多个 LM 的分析表明，与已知趋势一致，它们的语言能力与模型大小相关。然而，令人惊讶的是，模型架构和指令调整也显着影响性能，特别是在形态和语法方面。最后，我们提出了 FlashHolmes，这是 Holmes 的简化版本，旨在降低高计算负载，同时保持高排名精度。