2024-03-07

Title: Mad Libs Are All You Need: Augmenting Cross-Domain Document-Level Event Argument Data

Authors: Joseph Gatto, Parker Seegmiller, Omar Sharif, Sarah M. Preum
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.03304
Pdf URL: https://arxiv.org/pdf/2403.03304
Copy Paste: [[2403.03304]] Mad Libs Are All You Need: Augmenting Cross-Domain Document-Level Event Argument Data(https://arxiv.org/abs/2403.03304)
Keywords: llm
Abstract: Document-Level Event Argument Extraction (DocEAE) is an extremely difficult information extraction problem -- with significant limitations in low-resource cross-domain settings. To address this problem, we introduce Mad Lib Aug (MLA), a novel generative DocEAE data augmentation framework. Our approach leverages the intuition that Mad Libs, which are categorically masked documents used as a part of a popular game, can be generated and solved by LLMs to produce data for DocEAE. Using MLA, we achieve a 2.6-point average improvement in overall F1 score. Moreover, this approach achieves a 3.9 and 5.2 point average increase in zero and few-shot event roles compared to augmentation-free baselines across all experiments. To better facilitate analysis of cross-domain DocEAE, we additionally introduce a new metric, Role-Depth F1 (RDF1), which uses statistical depth to identify roles in the target domain which are semantic outliers with respect to roles observed in the source domain. Our experiments show that MLA augmentation can boost RDF1 performance by an average of 5.85 points compared to non-augmented datasets.
摘要：文档级事件参数提取 (DocEAE) 是一个极其困难的信息提取问题 - 在资源匮乏的跨域设置中具有很大的局限性。为了解决这个问题，我们引入了 Mad Lib Aug (MLA)，一种新颖的生成式 DocEAE 数据增强框架。我们的方法利用了 Mad Libs 的直觉，Mad Libs 是一种被明确屏蔽的文档，用作流行游戏的一部分，可以由法学硕士生成和解决，为 DocEAE 生成数据。使用 MLA，我们的总体 F1 分数平均提高了 2.6 分。此外，与所有实验中的无增强基线相比，这种方法在零次和少次事件角色中实现了 3.9 和 5.2 点的平均增加。为了更好地促进跨域 DocEAE 的分析，我们还引入了一个新的度量，角色深度 F1 (RDF1)，它使用统计深度来识别目标域中的角色，这些角色相对于源域中观察到的角色是语义异常值。我们的实验表明，与非增强数据集相比，MLA 增强可以将 RDF1 性能平均提高 5.85 个点。

Title: Book2Dial: Generating Teacher-Student Interactions from Textbooks for Cost-Effective Development of Educational Chatbots

Authors: Junling Wang, Jakub Macina, Nico Daheim, Sankalan Pal Chowdhury, Mrinmaya Sachan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03307
Pdf URL: https://arxiv.org/pdf/2403.03307
Copy Paste: [[2403.03307]] Book2Dial: Generating Teacher-Student Interactions from Textbooks for Cost-Effective Development of Educational Chatbots(https://arxiv.org/abs/2403.03307)
Keywords: language model, hallucination, prompt, chat
Abstract: Educational chatbots are a promising tool for assisting student learning. However, the development of effective chatbots in education has been challenging, as high-quality data is seldom available in this domain. In this paper, we propose a framework for generating synthetic teacher-student interactions grounded in a set of textbooks. Our approaches capture one aspect of learning interactions where curious students with partial knowledge interactively ask a teacher questions about the material in the textbook. We highlight various quality criteria that such dialogues should fulfill and compare several approaches relying on either prompting or fine-tuning large language models. We use synthetic dialogues to train educational chatbots and show benefits of further fine-tuning in different educational domains. However, human evaluation shows that our best data synthesis method still suffers from hallucinations and tends to reiterate information from previous conversations. Our findings offer insights for future efforts in synthesizing conversational data that strikes a balance between size and quality. We will open-source our data and code.
摘要：教育聊天机器人是一种很有前途的帮助学生学习的工具。然而，在教育领域开发有效的聊天机器人一直具有挑战性，因为该领域很少有高质量的数据。在本文中，我们提出了一个基于一组教科书生成综合师生互动的框架。我们的方法捕捉了学习互动的一个方面，即具有部分知识的好奇学生向老师互动地询问有关教科书材料的问题。我们强调了此类对话应满足的各种质量标准，并比较了依赖于提示或微调大型语言模型的几种方法。我们使用合成对话来训练教育聊天机器人，并展示在不同教育领域进一步微调的好处。然而，人类评估表明，我们最好的数据合成方法仍然存在幻觉，并且倾向于重申之前对话中的信息。我们的研究结果为未来合成对话数据的努力提供了见解，以在大小和质量之间取得平衡。我们将开源我们的数据和代码。

Title: Guardrail Baselines for Unlearning in LLMs

Authors: Pratiksha Thaker, Yash Maurya, Virginia Smith
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03329
Pdf URL: https://arxiv.org/pdf/2403.03329
Copy Paste: [[2403.03329]] Guardrail Baselines for Unlearning in LLMs(https://arxiv.org/abs/2403.03329)
Keywords: language model, llm, prompt
Abstract: Recent work has demonstrated that fine-tuning is a promising approach to `unlearn' concepts from large language models. However, fine-tuning can be expensive, as it requires both generating a set of examples and running iterations of fine-tuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to fine-tuning. We recommend that researchers investigate these lightweight baselines when evaluating the performance of more computationally intensive fine-tuning methods. While we do not claim that methods such as prompting or filtering are universal solutions to the problem of unlearning, our work suggests the need for evaluation metrics that can better separate the power of guardrails vs. fine-tuning, and highlights scenarios where guardrails themselves may be advantageous for unlearning, such as in generating examples for fine-tuning or unlearning when only API access is available.
摘要：最近的工作表明，微调是从大型语言模型中“忘却”概念的一种有前途的方法。然而，微调可能会很昂贵，因为它需要生成一组示例并运行微调迭代来更新模型。在这项工作中，我们证明了简单的基于护栏的方法（例如提示和过滤）可以实现与微调相当的忘却结果。我们建议研究人员在评估计算密集型微调方法的性能时研究这些轻量级基线。虽然我们并不认为提示或过滤等方法是解决遗忘问题的通用解决方案，但我们的工作表明需要评估指标来更好地区分护栏与微调的力量，并强调护栏本身可能会出现的情况有利于遗忘，例如在仅 API 访问可用时生成用于微调或遗忘的示例。

Title: DIVERSE: Deciphering Internet Views on the U.S. Military Through Video Comment Stance Analysis, A Novel Benchmark Dataset for Stance Classification

Authors: Iain J. Cruickshank, Lynnette Hui Xian Ng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.03334
Pdf URL: https://arxiv.org/pdf/2403.03334
Copy Paste: [[2403.03334]] DIVERSE: Deciphering Internet Views on the U.S. Military Through Video Comment Stance Analysis, A Novel Benchmark Dataset for Stance Classification(https://arxiv.org/abs/2403.03334)
Keywords: language model
Abstract: Stance detection of social media text is a key component of downstream tasks involving the identification of groups of users with opposing opinions on contested topics such as vaccination and within arguments. In particular, stance provides an indication of an opinion towards an entity. This paper introduces DIVERSE, a dataset of over 173,000 YouTube video comments annotated for their stance towards videos of the U.S. military. The stance is annotated through a human-guided, machine-assisted labeling methodology that makes use of weak signals of tone within the sentence as supporting indicators, as opposed to using manual annotations by humans. These weak signals consist of the presence of hate speech and sarcasm, the presence of specific keywords, the sentiment of the text, and the stance inference from two Large Language Models. The weak signals are then consolidated using a data programming model before each comment is annotated with a final stance label. On average, the videos have 200 comments each, and the stance of the comments skews slightly towards the "against" characterization for both the U.S. Army and the videos posted on the channel.
摘要：社交媒体文本的立场检测是下游任务的关键组成部分，涉及识别对疫苗接种和争论等有争议的话题持反对意见的用户群体。特别是，立场表明了对实体的看法。本文介绍了 DIVERSE，这是一个包含超过 173,000 条 YouTube 视频评论的数据集，注释了他们对美国军方视频的立场。该立场是通过人工引导、机器辅助的标记方法来注释的，该方法利用句子中的微弱语气信号作为支持指标，而不是使用人类手动注释。这些微弱信号包括仇恨言论和讽刺的存在、特定关键字的存在、文本的情绪以及来自两个大型语言模型的立场推断。然后，在用最终立场标签对每条评论进行注释之前，使用数据编程模型来整合弱信号。平均每个视频有 200 条评论，评论的立场稍微倾向于“反对”美国陆军和频道上发布的视频。

Title: Scope of Large Language Models for Mining Emerging Opinions in Online Health Discourse

Authors: Joseph Gatto, Madhusudan Basak, Yash Srivastava, Philip Bohlman, Sarah M. Preum
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2403.03336
Pdf URL: https://arxiv.org/pdf/2403.03336
Copy Paste: [[2403.03336]] Scope of Large Language Models for Mining Emerging Opinions in Online Health Discourse(https://arxiv.org/abs/2403.03336)
Keywords: language model, gpt, llm
Abstract: In this paper, we develop an LLM-powered framework for the curation and evaluation of emerging opinion mining in online health communities. We formulate emerging opinion mining as a pairwise stance detection problem between (title, comment) pairs sourced from Reddit, where post titles contain emerging health-related claims on a topic that is not predefined. The claims are either explicitly or implicitly expressed by the user. We detail (i) a method of claim identification -- the task of identifying if a post title contains a claim and (ii) an opinion mining-driven evaluation framework for stance detection using LLMs. We facilitate our exploration by releasing a novel test dataset, Long COVID-Stance, or LC-stance, which can be used to evaluate LLMs on the tasks of claim identification and stance detection in online health communities. Long Covid is an emerging post-COVID disorder with uncertain and complex treatment guidelines, thus making it a suitable use case for our task. LC-Stance contains long COVID treatment related discourse sourced from a Reddit community. Our evaluation shows that GPT-4 significantly outperforms prior works on zero-shot stance detection. We then perform thorough LLM model diagnostics, identifying the role of claim type (i.e. implicit vs explicit claims) and comment length as sources of model error.
摘要：在本文中，我们开发了一个由法学硕士支持的框架，用于在线健康社区中新兴观点挖掘的策划和评估。我们将新兴观点挖掘表述为源自 Reddit 的（标题、评论）对之间的成对立场检测问题，其中帖子标题包含针对未预定义主题的新兴健康相关主张。声明由用户明确或隐含地表达。我们详细介绍了（i）一种声明识别方法——识别帖子标题是否包含声明的任务，以及（ii）使用法学硕士进行立场检测的意见挖掘驱动的评估框架。我们通过发布一个新颖的测试数据集“Long COVID-Stance”或“LC-stance”来促进我们的探索，该数据集可用于评估法学硕士在在线健康社区中的声明识别和立场检测任务上的能力。 Long Covid 是一种新兴的后新冠疾病，其治疗指南不确定且复杂，因此使其成为我们任务的合适用例。 LC-Stance 包含来自 Reddit 社区的长篇新冠治疗相关讨论。我们的评估表明，GPT-4 在零样本姿态检测方面明显优于先前的工作。然后，我们执行彻底的 LLM 模型诊断，确定声明类型（即隐式声明与显式声明）和评论长度作为模型错误来源的作用。

Title: Learning to Maximize Mutual Information for Chain-of-Thought Distillation

Authors: Xin Chen, Hanxian Huang, Yanjun Gao, Yi Wang, Jishen Zhao, Ke Ding
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.03348
Pdf URL: https://arxiv.org/pdf/2403.03348
Copy Paste: [[2403.03348]] Learning to Maximize Mutual Information for Chain-of-Thought Distillation(https://arxiv.org/abs/2403.03348)
Keywords: language model, chain-of-thought
Abstract: Knowledge distillation, the technique of transferring knowledge from large, complex models to smaller ones, marks a pivotal step towards efficient AI deployment. Distilling Step-by-Step (DSS), a novel method utilizing chain-of-thought (CoT) distillation, has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts. In DSS, the distilled model acquires the ability to generate rationales and predict labels concurrently through a multi-task learning framework. However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction. To this end, we investigate the mutual relationship of the two tasks from Information Bottleneck perspective and formulate it as maximizing the mutual information of the representation features of the two tasks. We propose a variational approach to solve this optimization problem using a learning-based method. Our experimental results across four datasets demonstrate that our method outperforms the state-of-the-art DSS. Our findings offer insightful guidance for future research on language model distillation as well as applications involving CoT. Code and models will be released soon.
摘要：知识蒸馏是将知识从大型复杂模型转移到较小模型的技术，标志着高效人工智能部署的关键一步。分步蒸馏 (DSS) 是一种利用思想链 (CoT) 蒸馏的新颖方法，通过为较小的模型赋予较大模型的卓越推理能力，展现出了良好的前景。在 DSS 中，蒸馏模型获得了通过多任务学习框架同时生成原理和预测标签的能力。然而，DSS忽略了两个训练任务之间的内在关系，导致CoT知识与标签预测任务的无效整合。为此，我们从信息瓶颈的角度研究了两个任务的相互关系，并将其表述为最大化两个任务的表示特征的互信息。我们提出了一种使用基于学习的方法来解决此优化问题的变分方法。我们在四个数据集上的实验结果表明，我们的方法优于最先进的 DSS。我们的研究结果为未来语言模型蒸馏的研究以及涉及 CoT 的应用提供了富有洞察力的指导。代码和模型将很快发布。

Title: Japanese-English Sentence Translation Exercises Dataset for Automatic Grading

Authors: Naoki Miura, Hiroaki Funayama, Seiya Kikuchi, Yuichiroh Matsubayashi, Yuya Iwase, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03396
Pdf URL: https://arxiv.org/pdf/2403.03396
Copy Paste: [[2403.03396]] Japanese-English Sentence Translation Exercises Dataset for Automatic Grading(https://arxiv.org/abs/2403.03396)
Keywords: language model, gpt
Abstract: This paper proposes the task of automatic assessment of Sentence Translation Exercises (STEs), that have been used in the early stage of L2 language learning. We formalize the task as grading student responses for each rubric criterion pre-specified by the educators. We then create a dataset for STE between Japanese and English including 21 questions, along with a total of 3, 498 student responses (167 on average). The answer responses were collected from students and crowd workers. Using this dataset, we demonstrate the performance of baselines including finetuned BERT and GPT models with few-shot in-context learning. Experimental results show that the baseline model with finetuned BERT was able to classify correct responses with approximately 90% in F1, but only less than 80% for incorrect responses. Furthermore, the GPT models with few-shot learning show poorer results than finetuned BERT, indicating that our newly proposed task presents a challenging issue, even for the stateof-the-art large language models.
摘要：本文提出了句子翻译练习（STE）的自动评估任务，该任务已在第二语言学习的早期阶段使用。我们将任务正式化为根据教育工作者预先指定的每个评分标准对学生的反应进行评分。然后，我们创建了一个日语和英语之间的 STE 数据集，其中包括 21 个问题，以及总共 3, 498 个学生的回答（平均 167 个）。答案是从学生和人群工作人员那里收集的。使用该数据集，我们展示了基线的性能，包括具有少量上下文学习的微调 BERT 和 GPT 模型。实验结果表明，经过微调的 BERT 基线模型能够对 F1 中大约 90% 的正确答案进行分类，但对错误答案的分类率只有不到 80%。此外，采用少样本学习的 GPT 模型显示的结果比微调的 BERT 更差，这表明我们新提出的任务提出了一个具有挑战性的问题，即使对于最先进的大型语言模型也是如此。

Title: Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization

Authors: Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, Ning Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.03419
Pdf URL: https://arxiv.org/pdf/2403.03419
Copy Paste: [[2403.03419]] Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization(https://arxiv.org/abs/2403.03419)
Keywords: language model, llm
Abstract: Large language models (LLMs) have revolutionized the role of AI, yet also pose potential risks of propagating unethical content. Alignment technologies have been introduced to steer LLMs towards human preference, gaining increasing attention. Despite notable breakthroughs in this direction, existing methods heavily rely on high-quality positive-negative training pairs, suffering from noisy labels and the marginal distinction between preferred and dispreferred response data. Given recent LLMs' proficiency in generating helpful responses, this work pivots towards a new research focus: achieving alignment using solely human-annotated negative samples, preserving helpfulness while reducing harmfulness. For this purpose, we propose Distributional Dispreference Optimization (D$^2$O), which maximizes the discrepancy between the generated responses and the dispreferred ones to effectively eschew harmful information. We theoretically demonstrate that D$^2$O is equivalent to learning a distributional instead of instance-level preference model reflecting human dispreference against the distribution of negative responses. Besides, D$^2$O integrates an implicit Jeffrey Divergence regularization to balance the exploitation and exploration of reference policies and converges to a non-negative one during training. Extensive experiments demonstrate that our method achieves comparable generation quality and surpasses the latest baselines in producing less harmful and more informative responses with better training stability and faster convergence.
摘要：大型语言模型 (LLM) 彻底改变了人工智能的作用，但也带来了传播不道德内容的潜在风险。对齐技术已被引入以引导法学硕士迎合人类的偏好，并受到越来越多的关注。尽管在这个方向上取得了显着的突破，但现有的方法严重依赖于高质量的正负训练对，受到噪声标签以及首选和不首选响应数据之间边缘差异的影响。鉴于最近的法学硕士在产生有用反应方面的熟练程度，这项工作转向了一个新的研究重点：仅使用人工注释的负样本来实现对齐，保留有用性，同时减少危害性。为此，我们提出分布不偏好优化（D$^2$O），它最大化生成的响应和不偏好的响应之间的差异，以有效避开有害信息。我们从理论上证明，D$^2$O 相当于学习一个分布而不是实例级偏好模型，反映了人类对负面反应分布的不偏好。此外，D$^2$O 集成了隐式杰弗里散度正则化来平衡参考策略的利用和探索，并在训练期间收敛到非负策略。大量的实验表明，我们的方法实现了可比的生成质量，并超越了最新的基线，产生了危害更小、信息更丰富的响应，具有更好的训练稳定性和更快的收敛速度。

Title: Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models

Authors: Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Yu Han, Hao Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.03432
Pdf URL: https://arxiv.org/pdf/2403.03432
Copy Paste: [[2403.03432]] Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models(https://arxiv.org/abs/2403.03432)
Keywords: language model, llm
Abstract: Instruction Tuning has the potential to stimulate or enhance specific capabilities of large language models (LLMs). However, achieving the right balance of data is crucial to prevent catastrophic forgetting and interference between tasks. To address these limitations and enhance training flexibility, we propose the Mixture-of-LoRAs (MoA) architecture which is a novel and parameter-efficient tuning method designed for multi-task learning with LLMs. In this paper, we start by individually training multiple domain-specific LoRA modules using corresponding supervised corpus data. These LoRA modules can be aligned with the expert design principles observed in Mixture-of-Experts (MoE). Subsequently, we combine the multiple LoRAs using an explicit routing strategy and introduce domain labels to facilitate multi-task learning, which help prevent interference between tasks and ultimately enhances the performance of each individual task. Furthermore, each LoRA model can be iteratively adapted to a new domain, allowing for quick domain-specific adaptation. Experiments on diverse tasks demonstrate superior and robust performance, which can further promote the wide application of domain-specific LLMs.
摘要：指令调优有潜力激发或增强大型语言模型 (LLM) 的特定功能。然而，实现数据的正确平衡对于防止灾难性遗忘和任务之间的干扰至关重要。为了解决这些限制并增强训练灵活性，我们提出了 Mixture-of-LoRAs (MoA) 架构，这是一种新颖且参数高效的调整方法，专为 LLM 的多任务学习而设计。在本文中，我们首先使用相应的监督语料库数据单独训练多个特定领域的 LoRA 模块。这些 LoRA 模块可以与专家混合 (MoE) 中遵守的专家设计原则保持一致。随后，我们使用显式路由策略组合多个 LoRA，并引入域标签来促进多任务学习，这有助于防止任务之间的干扰，并最终提高每个任务的性能。此外，每个 LoRA 模型都可以迭代地适应新领域，从而实现快速的特定领域适应。在不同任务上的实验证明了其优越而稳健的性能，这可以进一步促进特定领域法学硕士的广泛应用。

Title: Magic Markup: Maintaining Document-External Markup with an LLM

Authors: Edward Misback, Zachary Tatlock, Steven L. Tanimoto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03481
Pdf URL: https://arxiv.org/pdf/2403.03481
Copy Paste: [[2403.03481]] Magic Markup: Maintaining Document-External Markup with an LLM(https://arxiv.org/abs/2403.03481)
Keywords: language model, llm, agent
Abstract: Text documents, including programs, typically have human-readable semantic structure. Historically, programmatic access to these semantics has required explicit in-document tagging. Especially in systems where the text has an execution semantics, this means it is an opt-in feature that is hard to support properly. Today, language models offer a new method: metadata can be bound to entities in changing text using a model's human-like understanding of semantics, with no requirements on the document structure. This method expands the applications of document annotation, a fundamental operation in program writing, debugging, maintenance, and presentation. We contribute a system that employs an intelligent agent to re-tag modified programs, enabling rich annotations to automatically follow code as it evolves. We also contribute a formal problem definition, an empirical synthetic benchmark suite, and our benchmark generator. Our system achieves an accuracy of 90% on our benchmarks and can replace a document's tags in parallel at a rate of 5 seconds per tag. While there remains significant room for improvement, we find performance reliable enough to justify further exploration of applications.
摘要：文本文档（包括程序）通常具有人类可读的语义结构。从历史上看，对这些语义的编程访问需要显式的文档内标记。特别是在文本具有执行语义的系统中，这意味着它是一个很难正确支持的选择加入功能。如今，语言模型提供了一种新方法：可以使用模型对语义的类人理解将元数据绑定到更改文本中的实体，而对文档结构没有要求。该方法扩展了文档注释的应用范围，文档注释是程序编写、调试、维护和演示中的基本操作。我们贡献了一个系统，该系统采用智能代理来重新标记修改后的程序，使丰富的注释能够随着代码的发展自动跟随代码。我们还提供了正式的问题定义、经验综合基准套件和基准生成器。我们的系统在基准测试中的准确率达到 90%，并且可以以每个标签 5 秒的速度并行替换文档标签。虽然仍有很大的改进空间，但我们发现性能足够可靠，足以证明进一步探索应用程序是合理的。

Title: A Knowledge Plug-and-Play Test Bed for Open-domain Dialogue Generation

Authors: Xiangci Li, Linfeng Song, Lifeng Jin, Haitao Mi, Jessica Ouyang, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03496
Pdf URL: https://arxiv.org/pdf/2403.03496
Copy Paste: [[2403.03496]] A Knowledge Plug-and-Play Test Bed for Open-domain Dialogue Generation(https://arxiv.org/abs/2403.03496)
Keywords: language model, chat
Abstract: Knowledge-based, open-domain dialogue generation aims to build chit-chat systems that talk to humans using mined support knowledge. Many types and sources of knowledge have previously been shown to be useful as support knowledge. Even in the era of large language models, response generation grounded in knowledge retrieved from additional up-to-date sources remains a practically important approach. While prior work using single-source knowledge has shown a clear positive correlation between the performances of knowledge selection and response generation, there are no existing multi-source datasets for evaluating support knowledge retrieval. Further, prior work has assumed that the knowledge sources available at test time are the same as during training. This unrealistic assumption unnecessarily handicaps models, as new knowledge sources can become available after a model is trained. In this paper, we present a high-quality benchmark named multi-source Wizard of Wikipedia (Ms.WoW) for evaluating multi-source dialogue knowledge selection and response generation. Unlike existing datasets, it contains clean support knowledge, grounded at the utterance level and partitioned into multiple knowledge sources. We further propose a new challenge, dialogue knowledge plug-and-play, which aims to test an already trained dialogue model on using new support knowledge from previously unseen sources in a zero-shot fashion.
摘要：基于知识的开放领域对话生成旨在构建使用挖掘的支持知识与人类对话的聊天系统。许多类型和来源的知识先前已被证明可用作支持知识。即使在大型语言模型时代，基于从其他最新来源检索的知识的响应生成仍然是一种实际上重要的方法。虽然之前使用单源知识的工作表明知识选择和响应生成的性能之间存在明显的正相关性，但目前还没有用于评估支持知识检索的多源数据集。此外，先前的工作假设测试时可用的知识源与训练期间相同。这种不切实际的假设不必要地阻碍了模型，因为在模型训练后可以使用新的知识源。在本文中，我们提出了一个名为多源维基百科向导（Ms.WoW）的高质量基准，用于评估多源对话知识选择和响应生成。与现有数据集不同，它包含干净的支持知识，基于话语级别并划分为多个知识源。我们进一步提出了一个新的挑战，即对话知识即插即用，其目的是测试已经训练好的对话模型，以零样本的方式使用来自以前未见过的来源的新支持知识。

Title: CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

Authors: Zexuan Qiu, Jingjing Li, Shijue Huang, Wanjun Zhong, Irwin King
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03514
Pdf URL: https://arxiv.org/pdf/2403.03514
Copy Paste: [[2403.03514]] CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models(https://arxiv.org/abs/2403.03514)
Keywords: language model, llm
Abstract: Developing Large Language Models (LLMs) with robust long-context capabilities has been the recent research focus, resulting in the emergence of long-context LLMs proficient in Chinese. However, the evaluation of these models remains underdeveloped due to a lack of benchmarks. To address this gap, we present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs. CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels. With CLongEval, we undertake a comprehensive assessment of 6 open-source long-context LLMs and 2 leading commercial counterparts that feature both long-context abilities and proficiency in Chinese. We also provide in-depth analysis based on the empirical results, trying to shed light on the critical capabilities that present challenges in long-context settings. The dataset, evaluation scripts, and model outputs will be released.
摘要：开发具有强大的长上下文能力的大型语言模型（LLM）一直是最近的研究热点，导致精通中文的长上下文LLM的出现。然而，由于缺乏基准，这些模型的评估仍然不发达。为了弥补这一差距，我们推出了 CLongEval，这是一个用于评估长背景法学硕士的综合中国基准。 CLongEval 具有三个关键特征：（1）足够的数据量，包括 7 个不同的任务和 7,267 个示例；（2）适用性广泛，可适应上下文窗口大小从1K到100K的模型；（3）质量高，除了自动构建的标签外，还有超过2000个手动注释的问答对。通过 CLongEval，我们对 6 名开源长语境法学硕士和 2 名兼具长语境能力和中文熟练程度的领先商业同行进行了全面评估。我们还根据实证结果提供深入分析，试图阐明在长上下文环境中提出挑战的关键能力。数据集、评估脚本和模型输出将被发布。

Title: Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling

Authors: Chao-Wei Huang, Chen-An Li, Tsu-Yuan Hsu, Chen-Yu Hsu, Yun-Nung Chen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2403.03516
Pdf URL: https://arxiv.org/pdf/2403.03516
Copy Paste: [[2403.03516]] Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling(https://arxiv.org/abs/2403.03516)
Keywords: language model
Abstract: Dense retrieval methods have demonstrated promising performance in multilingual information retrieval, where queries and documents can be in different languages. However, dense retrievers typically require a substantial amount of paired data, which poses even greater challenges in multilingual scenarios. This paper introduces UMR, an Unsupervised Multilingual dense Retriever trained without any paired data. Our approach leverages the sequence likelihood estimation capabilities of multilingual language models to acquire pseudo labels for training dense retrievers. We propose a two-stage framework which iteratively improves the performance of multilingual dense retrievers. Experimental results on two benchmark datasets show that UMR outperforms supervised baselines, showcasing the potential of training multilingual retrievers without paired data, thereby enhancing their practicality. Our source code, data, and models are publicly available at https://github.com/MiuLab/UMR
摘要：密集检索方法在多语言信息检索中表现出了良好的性能，其中查询和文档可以使用不同的语言。然而，密集检索器通常需要大量的配对数据，这在多语言场景中提出了更大的挑战。本文介绍了 UMR，一种无需任何配对数据训练的无监督多语言密集检索器。我们的方法利用多语言语言模型的序列似然估计功能来获取用于训练密集检索器的伪标签。我们提出了一个两阶段框架，它迭代地提高了多语言密集检索器的性能。两个基准数据集的实验结果表明，UMR 优于监督基线，展示了在没有配对数据的情况下训练多语言检索器的潜力，从而增强了其实用性。我们的源代码、数据和模型可在 https://github.com/MiuLab/UMR 上公开获取

Title: Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

Authors: Yuhong Sun, Zhangyue Yin, Qipeng Guo, Jiawen Wu, Xipeng Qiu, Hui Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03558
Pdf URL: https://arxiv.org/pdf/2403.03558
Copy Paste: [[2403.03558]] Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem(https://arxiv.org/abs/2403.03558)
Keywords: language model, gpt, llm, hallucination
Abstract: Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. However, they are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP). To support this approach, we innovatively develop a dataset called Unanswerable Math Word Problem (UMWP) which comprises 5200 questions across five categories. We developed an evaluation methodology combining text similarity and mathematical expression detection to determine whether LLM considers the question unanswerable. The results of extensive experiments conducted on 31 LLMs, including GPT-3, InstructGPT, LLaMA, and Claude, demonstrate that in-context learning and reinforcement learning with human feedback (RLHF) training significantly enhance the model's ability to avoid hallucination. We show that utilizing MWP is a reliable and effective approach to assess hallucination. Our code and data are available at https://github.com/Yuki-Asuuna/UMWP.
摘要：大型语言模型 (LLM) 在各种自然语言处理 (NLP) 任务中非常有效。然而，他们很容易在模棱两可的环境中产生不可靠的猜想，称为幻觉。本文提出了一种基于无解数学应用题（MWP）的问答（QA）中评估LLM幻觉的新方法。为了支持这种方法，我们创新性地开发了一个名为“无法回答的数学应用题”(UMWP) 的数据集，其中包含五个类别的 5200 个问题。我们开发了一种结合文本相似性和数学表达式检测的评估方法，以确定 LLM 是否认为该问题无法回答。对包括 GPT-3、InstructGPT、LLaMA 和 Claude 在内的 31 名法学硕士进行的广泛实验结果表明，情境学习和人类反馈强化学习 (RLHF) 训练显着增强了模型避免幻觉的能力。我们证明，利用 MWP 是评估幻觉的可靠且有效的方法。我们的代码和数据可在 https://github.com/Yuki-Asuuna/UMWP 获取。

Title: Multimodal Large Language Models to Support Real-World Fact-Checking

Authors: Jiahui Geng, Yova Kementchedjhieva, Preslav Nakov, Iryna Gurevych
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.03627
Pdf URL: https://arxiv.org/pdf/2403.03627
Copy Paste: [[2403.03627]] Multimodal Large Language Models to Support Real-World Fact-Checking(https://arxiv.org/abs/2403.03627)
Keywords: language model, gpt, llm, prompt
Abstract: Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information. While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied. Here is aim to bridge this gap. In particular, we propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking. Our methodology is evidence-free, leveraging only these models' intrinsic knowledge and reasoning capabilities. By designing prompts that extract models' predictions, explanations, and confidence levels, we delve into research questions concerning model accuracy, robustness, and reasons for failure. We empirically find that (1) GPT-4V exhibits superior performance in identifying malicious and misleading multimodal claims, with the ability to explain the unreasonable aspects and underlying motives, and (2) existing open-source models exhibit strong biases and are highly sensitive to the prompt. Our study offers insights into combating false multimodal information and building secure, trustworthy multimodal models. To the best of our knowledge, we are the first to evaluate MLLMs for real-world fact-checking.
摘要：多模态大语言模型 (MLLM) 具有支持人类处理大量信息的潜力。虽然 MLLM 已被用作事实核查工具，但它们在这方面的能力和局限性尚未得到充分研究。我们的目标是弥合这一差距。特别是，我们提出了一个框架，用于系统评估当前多模式模型的能力，以促进现实世界的事实检查。我们的方法是无证据的，仅利用这些模型的内在知识和推理能力。通过设计提取模型预测、解释和置信水平的提示，我们深入研究有关模型准确性、鲁棒性和失败原因的研究问题。我们根据经验发现：（1）GPT-4V 在识别恶意和误导性多模式声明方面表现出优越的性能，能够解释不合理的方面和潜在动机；（2）现有的开源模型表现出强烈的偏见，并且对提示。我们的研究为打击虚假多模式信息和构建安全、值得信赖的多模式模型提供了见解。据我们所知，我们是第一个评估 MLLM 进行现实世界事实核查的公司。

Title: GPTopic: Dynamic and Interactive Topic Representations

Authors: Arik Reuter, Anton Thielmann, Christoph Weisser, Sebastian Fischer, Benjamin Säfken
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03628
Pdf URL: https://arxiv.org/pdf/2403.03628
Copy Paste: [[2403.03628]] GPTopic: Dynamic and Interactive Topic Representations(https://arxiv.org/abs/2403.03628)
Keywords: language model, gpt, llm, chat
Abstract: Topic modeling seems to be almost synonymous with generating lists of top words to represent topics within large text corpora. However, deducing a topic from such list of individual terms can require substantial expertise and experience, making topic modelling less accessible to people unfamiliar with the particularities and pitfalls of top-word interpretation. A topic representation limited to top-words might further fall short of offering a comprehensive and easily accessible characterization of the various aspects, facets and nuances a topic might have. To address these challenges, we introduce GPTopic, a software package that leverages Large Language Models (LLMs) to create dynamic, interactive topic representations. GPTopic provides an intuitive chat interface for users to explore, analyze, and refine topics interactively, making topic modeling more accessible and comprehensive. The corresponding code is available here: https://github. com/05ec6602be/GPTopic.
摘要：主题建模似乎几乎与生成热门词列表来表示大型文本语料库中的主题同义。然而，从这样的单个术语列表中推导出主题可能需要大量的专业知识和经验，使得不熟悉顶级词解释的特殊性和陷阱的人更难进行主题建模。仅限于热门词的主题表示可能进一步无法提供主题可能具有的各个方面、方面和细微差别的全面且易于访问的特征。为了应对这些挑战，我们推出了 GPTopic，这是一个利用大型语言模型 (LLM) 创建动态、交互式主题表示的软件包。 GPTopic 提供了直观的聊天界面，供用户交互式地探索、分析和提炼主题，使主题建模更加易于理解和全面。相应的代码可以在这里找到：https://github。 com/05ec6602be/GPTopic。

Title: Apollo: Lightweight Multilingual Medical LLMs towards Democratizing Medical AI to 6B People

Authors: Xidong Wang, Nuo Chen, Junyin Chen, Yan Hu, Yidong Wang, Xiangbo Wu, Anningzhe Gao, Xiang Wan, Haizhou Li, Benyou Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.03640
Pdf URL: https://arxiv.org/pdf/2403.03640
Copy Paste: [[2403.03640]] Apollo: Lightweight Multilingual Medical LLMs towards Democratizing Medical AI to 6B People(https://arxiv.org/abs/2403.03640)
Keywords: llm
Abstract: Despite the vast repository of global medical knowledge predominantly being in English, local languages are crucial for delivering tailored healthcare services, particularly in areas with limited medical resources. To extend the reach of medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion. This effort culminates in the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the multilingual medical benchmark, the released Apollo models, at various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best performance among models of equivalent size. Especially, Apollo-7B is the state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion. We will open-source training corpora, code, model weights and evaluation benchmark.
摘要：尽管全球医学知识库主要以英语为主，但当地语言对于提供定制医疗服务至关重要，特别是在医疗资源有限的地区。为了将医疗人工智能进步的影响范围扩大到更广泛的人群，我们的目标是开发六种最广泛使用的语言的医学法学硕士，涵盖全球 61 亿人口。这项工作最终创建了 ApolloCorpora 多语言医学数据集和 XMedBench 基准。在多语言医学基准测试中，已发布的Apollo模型在各种相对较小的尺寸（即0.5B、1.8B、2B、6B和7B）下，在同等尺寸的模型中取得了最佳性能。特别是，Apollo-7B 是最先进的多语言医学法学硕士，最高可达 70B。此外，这些精简模型可用于提高大型模型的多语言医疗能力，而无需以代理调整方式进行微调。我们将开源训练语料、代码、模型权重和评估基准。

Title: General2Specialized LLMs Translation for E-commerce

Authors: Kaidi Chen, Ben Chen, Dehong Gao, Huangyu Dai, Wen Jiang, Wei Ning, Shanqing Yu, Libin Yang, Xiaoyan Cai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.03689
Pdf URL: https://arxiv.org/pdf/2403.03689
Copy Paste: [[2403.03689]] General2Specialized LLMs Translation for E-commerce(https://arxiv.org/abs/2403.03689)
Keywords: language model, gpt, llm
Abstract: Existing Neural Machine Translation (NMT) models mainly handle translation in the general domain, while overlooking domains with special writing formulas, such as e-commerce and legal documents. Taking e-commerce as an example, the texts usually include amounts of domain-related words and have more grammar problems, which leads to inferior performances of current NMT methods. To address these problems, we collect two domain-related resources, including a set of term pairs (aligned Chinese-English bilingual terms) and a parallel corpus annotated for the e-commerce domain. Furthermore, we propose a two-step fine-tuning paradigm (named G2ST) with self-contrastive semantic enhancement to transfer one general NMT model to the specialized NMT model for e-commerce. The paradigm can be used for the NMT models based on Large language models (LLMs). Extensive evaluations on real e-commerce titles demonstrate the superior translation quality and robustness of our G2ST approach, as compared with state-of-the-art NMT models such as LLaMA, Qwen, GPT-3.5, and even GPT-4.
摘要：现有的神经机器翻译（NMT）模型主要处理一般领域的翻译，而忽略了具有特殊书写公式的领域，例如电子商务和法律文档。以电子商务为例，文本通常包含大量与领域相关的单词，并且存在更多语法问题，这导致当前 NMT 方法的性能较差。为了解决这些问题，我们收集了两个与领域相关的资源，包括一组术语对（对齐的中英双语术语）和一个针对电子商务领域注释的平行语料库。此外，我们提出了一种具有自对比语义增强功能的两步微调范式（称为 G2ST），将一个通用 NMT 模型转移到专门用于电子商务的 NMT 模型。该范例可用于基于大型语言模型（LLM）的 NMT 模型。对真实电子商务标题的广泛评估表明，与 LLaMA、Qwen、GPT-3.5 甚至 GPT-4 等最先进的 NMT 模型相比，我们的 G2ST 方法具有卓越的翻译质量和稳健性。

Title: Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese

Authors: Yikun Sun, Zhen Wan, Nobuhiro Ueda, Sakiko Yahata, Fei Cheng, Chenhui Chu, Sadao Kurohashi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.03690
Pdf URL: https://arxiv.org/pdf/2403.03690
Copy Paste: [[2403.03690]] Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese(https://arxiv.org/abs/2403.03690)
Keywords: language model, gpt, llm
Abstract: The creation of instruction data and evaluation benchmarks for serving Large language models often involves enormous human annotation. This issue becomes particularly pronounced when rapidly developing such resources for a non-English language like Japanese. Instead of following the popular practice of directly translating existing English resources into Japanese (e.g., Japanese-Alpaca), we propose an efficient self-instruct method based on GPT-4. We first translate a small amount of English instructions into Japanese and post-edit them to obtain native-level quality. GPT-4 then utilizes them as demonstrations to automatically generate Japanese instruction data. We also construct an evaluation benchmark containing 80 questions across 8 categories, using GPT-4 to automatically assess the response quality of LLMs without human references. The empirical results suggest that the models fine-tuned on our GPT-4 self-instruct data significantly outperformed the Japanese-Alpaca across all three base pre-trained models. Our GPT-4 self-instruct data allowed the LLaMA 13B model to defeat GPT-3.5 (Davinci-003) with a 54.37\% win-rate. The human evaluation exhibits the consistency between GPT-4's assessments and human preference. Our high-quality instruction data and evaluation benchmark have been released here.
摘要：用于服务大型语言模型的指令数据和评估基准的创建通常涉及大量的人工注释。当为日语等非英语语言快速开发此类资源时，这个问题变得尤为明显。我们没有遵循直接将现有英语资源翻译成日语（例如 Japanese-Alpaca）的流行做法，而是提出了一种基于 GPT-4 的高效自指导方法。我们首先将少量英文说明翻译成日文，并进行后期编辑以获得母语水平的质量。然后，GPT-4 利用它们作为演示来自动生成日语指令数据。我们还构建了一个评估基准，包含 8 个类别的 80 个问题，使用 GPT-4 自动评估法学硕士的回答质量，无需人工参考。实证结果表明，在我们的 GPT-4 自指导数据上进行微调的模型在所有三个基本预训练模型中均显着优于日本羊驼。我们的 GPT-4 自指导数据使 LLaMA 13B 模型能够以 54.37% 的胜率击败 GPT-3.5 (Davinci-003)。人类评估表现出 GPT-4 的评估与人类偏好之间的一致性。我们的高质量教学数据和评估基准已经在这里发布。

Title: German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset

Authors: Laura Mascarell, Ribin Chalumattu, Annette Rios
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.03750
Pdf URL: https://arxiv.org/pdf/2403.03750
Copy Paste: [[2403.03750]] German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset(https://arxiv.org/abs/2403.03750)
Keywords: language model, llm, hallucination
Abstract: The advent of Large Language Models (LLMs) has led to remarkable progress on a wide range of natural language processing tasks. Despite the advances, these large-sized models still suffer from hallucinating information in their output, which poses a major issue in automatic text summarization, as we must guarantee that the generated summary is consistent with the content of the source document. Previous research addresses the challenging task of detecting hallucinations in the output (i.e. inconsistency detection) in order to evaluate the faithfulness of the generated summaries. However, these works primarily focus on English and recent multilingual approaches lack German data. This work presents absinth, a manually annotated dataset for hallucination detection in German news summarization and explores the capabilities of novel open-source LLMs on this task in both fine-tuning and in-context learning settings. We open-source and release the absinth dataset to foster further research on hallucination detection in German.
摘要：大型语言模型 (LLM) 的出现使得各种自然语言处理任务取得了显着进展。尽管取得了进步，这些大型模型在输出中仍然受到幻觉信息的影响，这在自动文本摘要中提出了一个主要问题，因为我们必须保证生成的摘要与源文档的内容一致。先前的研究解决了检测输出中的幻觉（即不一致检测）以评估生成的摘要的真实性的挑战性任务。然而，这些作品主要关注英语，而最近的多语言方法缺乏德语数据。这项工作提出了苦艾酒，这是一个用于德国新闻摘要中幻觉检测的手动注释数据集，并探索了新型开源法学硕士在微调和上下文学习环境中在此任务上的能力。我们开源并发布苦艾酒数据集，以促进德国幻觉检测的进一步研究。

Title: PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion

Authors: Zekai Zhang, Yiduo Guo, Yaobo Liang, Dongyan Zhao, Nan Duan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03788
Pdf URL: https://arxiv.org/pdf/2403.03788
Copy Paste: [[2403.03788]] PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion(https://arxiv.org/abs/2403.03788)
Keywords: language model, gpt, llm, agent
Abstract: The growing dependence on Large Language Models (LLMs) for finishing user instructions necessitates a comprehensive understanding of their robustness to complex task completion in real-world situations. To address this critical need, we propose the PowerPoint Task Completion Robustness benchmark (PPTC-R) to measure LLMs' robustness to the user PPT task instruction and software version. Specifically, we construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. To assess the robustness of Language Models to software versions, we vary the number of provided APIs to simulate both the newest version and earlier version settings. Subsequently, we test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates these robustness settings, aiming to evaluate how deviations impact LLMs' API calls for task completion. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark, particularly in the version update and the multilingual settings. However, we find that all LLMs lose their robustness when confronted with multiple challenges (e.g., multi-turn) simultaneously, leading to significant performance drops. We further analyze the robustness behavior and error reasons of LLMs in our benchmark, which provide valuable insights for researchers to understand the LLM's robustness in task completion and develop more robust LLMs and agents. We release the code and data at \url{https://github.com/ZekaiGalaxy/PPTCR}.
摘要：人们越来越依赖大型语言模型 (LLM) 来完成用户指令，因此需要全面了解它们在现实情况下完成复杂任务的稳健性。为了满足这一关键需求，我们提出了 PowerPoint 任务完成稳健性基准 (PPTC-R) 来衡量法学硕士对用户 PPT 任务指令和软件版本的稳健性。具体来说，我们通过在句子、语义和多语言级别攻击用户指令来构建对抗性用户指令。为了评估语言模型对软件版本的稳健性，我们改变提供的 API 数量来模拟最新版本和早期版本设置。随后，我们使用包含这些鲁棒性设置的基准测试了 3 个闭源和 4 个开源 LLM，旨在评估偏差如何影响 LLM 完成任务的 API 调用。我们发现 GPT-4 在我们的基准测试中表现出最高的性能和很强的鲁棒性，特别是在版本更新和多语言设置方面。然而，我们发现所有法学硕士在同时面临多个挑战（例如多轮）时都会失去稳健性，导致性能显着下降。我们在基准中进一步分析了法学硕士的稳健性行为和错误原因，这为研究人员了解法学硕士在任务完成方面的稳健性并开发更稳健的法学硕士和代理提供了宝贵的见解。我们在 \url{https://github.com/ZekaiGalaxy/PPTCR} 发布了代码和数据。

Title: Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

Authors: Carolin Holtermann, Paul Röttger, Timm Dill, Anne Lauscher
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.03814
Pdf URL: https://arxiv.org/pdf/2403.03814
Copy Paste: [[2403.03814]] Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ(https://arxiv.org/abs/2403.03814)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) need to serve everyone, including a global majority of non-English speakers. However, most LLMs today, and open LLMs in particular, are often intended for use in just English (e.g. Llama2, Mistral) or a small handful of high-resource languages (e.g. Mixtral, Qwen). Recent research shows that, despite limits in their intended use, people prompt LLMs in many different languages. Therefore, in this paper, we investigate the basic multilingual capabilities of state-of-the-art open LLMs beyond their intended use. For this purpose, we introduce MultiQ, a new silver standard benchmark for basic open-ended question answering with 27.4k test questions across a typologically diverse set of 137 languages. With MultiQ, we evaluate language fidelity, i.e.\ whether models respond in the prompted language, and question answering accuracy. All LLMs we test respond faithfully and/or accurately for at least some languages beyond their intended use. Most models are more accurate when they respond faithfully. However, differences across models are large, and there is a long tail of languages where models are neither accurate nor faithful. We explore differences in tokenization as a potential explanation for our findings, identifying possible correlations that warrant further investigation.
摘要：大型语言模型 (LLM) 需要为所有人服务，包括全球大多数非英语国家的人。然而，当今的大多数法学硕士，尤其是开放式法学硕士，通常仅用于英语（例如 Llama2、Mistral）或少数高资源语言（例如 Mixtral、Qwen）。最近的研究表明，尽管法学硕士的预期用途受到限制，人们还是会用许多不同的语言来提示法学硕士。因此，在本文中，我们研究了最先进的开放式法学硕士超出其预期用途的基本多语言能力。为此，我们引入了 MultiQ，这是一种新的银标准基准，用于基本开放式问答，包含 137 种不同类型语言的 27,400 个测试问题。通过 MultiQ，我们评估语言保真度，即模型是否以提示语言进行响应，以及问题回答的准确性。我们测试的所有法学硕士都对至少某些超出其预期用途的语言做出了忠实和/或准确的反应。大多数模型在忠实响应时会更加准确。然而，不同模型之间的差异很大，而且在很长的语言中，模型既不准确也不忠实。我们探索标记化的差异作为对我们的发现的潜在解释，确定需要进一步调查的可能相关性。

Title: ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

Authors: Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, Weipeng Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03853
Pdf URL: https://arxiv.org/pdf/2403.03853
Copy Paste: [[2403.03853]] ShortGPT: Layers in Large Language Models are More Redundant Than You Expect(https://arxiv.org/abs/2403.03853)
Keywords: language model, gpt, llm
Abstract: As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.
摘要：随着大型语言模型 (LLM) 性能的不断提高，其规模也显着扩大，当前的 LLM 包含数十亿甚至数万亿的参数。然而，在这项研究中，我们发现LLM的许多层表现出高度相似性，并且某些层在网络功能中发挥的作用可以忽略不计。基于这一观察，我们定义了一个称为区块影响力（BI）的指标来衡量法学硕士中每一层的重要性。然后，我们提出了一种直接的修剪方法：层删除，其中我们根据 LLM 的 BI 分数直接删除其中的冗余层。实验表明，我们的方法（我们称之为 ShortGPT）在模型剪枝方面显着优于先前最先进的（SOTA）方法。此外，ShortGPT 与类量化方法正交，可以进一步减少参数和计算量。与更复杂的修剪技术相比，通过简单的层去除获得更好结果的能力表明模型架构中存在高度冗余。

Title: Emojinize : Enriching Any Text with Emoji Translations

Authors: Lars Henning Klein, Roland Aydin, Robert West
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2403.03857
Pdf URL: https://arxiv.org/pdf/2403.03857
Copy Paste: [[2403.03857]] Emojinize : Enriching Any Text with Emoji Translations(https://arxiv.org/abs/2403.03857)
Keywords: language model
Abstract: Emoji have become ubiquitous in written communication, on the Web and beyond. They can emphasize or clarify emotions, add details to conversations, or simply serve decorative purposes. This casual use, however, barely scratches the surface of the expressive power of emoji. To further unleash this power, we present Emojinize, a method for translating arbitrary text phrases into sequences of one or more emoji without requiring human input. By leveraging the power of large language models, Emojinize can choose appropriate emoji by disambiguating based on context (eg, cricket-bat vs bat) and can express complex concepts compositionally by combining multiple emoji (eq, ''Emojinize'' is translated to input-latin-letters right-arrow grinning-face). In a cloze test--based user study, we show that Emojinize's emoji translations increase the human guessability of masked words by 55%, whereas human-picked emoji translations do so by only 29%. These results suggest that emoji provide a sufficiently rich vocabulary to accurately translate a wide variety of words. Moreover, annotating words and phrases with Emojinize's emoji translations opens the door to numerous downstream applications, including children learning how to read, adults learning foreign languages, and text understanding for people with learning disabilities.
摘要：表情符号在书面交流、网络及其他领域已经变得无处不在。它们可以强调或澄清情绪，为对话添加细节，或者只是起到装饰作用。然而，这种随意的使用仅仅触及了表情符号表达能力的表面。为了进一步释放这种力量，我们提出了 Emojinize，一种无需人工输入即可将任意文本短语翻译为一个或多个表情符号序列的方法。通过利用大型语言模型的强大功能，Emojinize 可以根据上下文（例如，板球棒与蝙蝠）消除歧义来选择适当的表情符号，并且可以通过组合多个表情符号来表达复杂的概念（例如，“Emojinize”被翻译为输入） -拉丁字母右箭头笑脸）。在基于完形填空测试的用户研究中，我们表明，Emojinize 的表情符号翻译将人类对屏蔽词的猜测性提高了 55%，而人类挑选的表情符号翻译仅提高了 29%。这些结果表明，表情符号提供了足够丰富的词汇量，可以准确翻译各种单词。此外，用 Emojinize 的表情符号翻译注释单词和短语为众多下游应用打开了大门，包括儿童学习如何阅读、成人学习外语以及学习障碍人士的文本理解。

Title: Designing Informative Metrics for Few-Shot Example Selection

Authors: Rishabh Adiga, Lakshminarayanan Subramanian, Varun Chandrasekaran
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.03861
Pdf URL: https://arxiv.org/pdf/2403.03861
Copy Paste: [[2403.03861]] Designing Informative Metrics for Few-Shot Example Selection(https://arxiv.org/abs/2403.03861)
Keywords: language model, gpt, prompt
Abstract: Pretrained language models (PLMs) have shown remarkable few-shot learning capabilities when provided with properly formatted examples. However, selecting the "best" examples remains an open challenge. We propose a complexity-based prompt selection approach for sequence tagging tasks. This approach avoids the training of a dedicated model for selection of examples, and instead uses certain metrics to align the syntactico-semantic complexity of test sentences and examples. We use both sentence- and word-level metrics to match the complexity of examples to the (test) sentence being considered. Our results demonstrate that our approach extracts greater performance from PLMs: it achieves state-of-the-art performance on few-shot NER, achieving a 5% absolute improvement in F1 score on the CoNLL2003 dataset for GPT-4. We also see large gains of upto 28.85 points (F1/Acc.) in smaller models like GPT-j-6B.
摘要：当提供格式正确的示例时，预训练语言模型 (PLM) 显示出出色的小样本学习能力。然而，选择“最佳”示例仍然是一个公开的挑战。我们提出了一种用于序列标记任务的基于复杂性的提示选择方法。这种方法避免了训练用于选择示例的专用模型，而是使用某些指标来调整测试句子和示例的句法语义复杂性。我们使用句子级和单词级指标来将示例的复杂性与正在考虑的（测试）句子相匹配。我们的结果表明，我们的方法从 PLM 中获得了更高的性能：它在少样本 NER 上实现了最先进的性能，在 GPT-4 的 CoNLL2003 数据集上的 F1 分数绝对提高了 5%。我们还发现 GPT-j-6B 等较小型号的性能大幅提升，高达 28.85 点 (F1/Acc.)。

Title: X-Shot: A Unified System to Handle Frequent, Few-shot and Zero-shot Learning Simultaneously in Classification

Authors: Hanzi Xu, Muhao Chen, Lifu Huang, Slobodan Vucetic, Wenpeng Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03863
Pdf URL: https://arxiv.org/pdf/2403.03863
Copy Paste: [[2403.03863]] X-Shot: A Unified System to Handle Frequent, Few-shot and Zero-shot Learning Simultaneously in Classification(https://arxiv.org/abs/2403.03863)
Keywords: language model
Abstract: In recent years, few-shot and zero-shot learning, which learn to predict labels with limited annotated instances, have garnered significant attention. Traditional approaches often treat frequent-shot (freq-shot; labels with abundant instances), few-shot, and zero-shot learning as distinct challenges, optimizing systems for just one of these scenarios. Yet, in real-world settings, label occurrences vary greatly. Some of them might appear thousands of times, while others might only appear sporadically or not at all. For practical deployment, it is crucial that a system can adapt to any label occurrence. We introduce a novel classification challenge: X-shot, reflecting a real-world context where freq-shot, few-shot, and zero-shot labels co-occur without predefined limits. Here, X can span from 0 to positive infinity. The crux of X-shot centers on open-domain generalization and devising a system versatile enough to manage various label scenarios. To solve X-shot, we propose BinBin (Binary INference Based on INstruction following) that leverages the Indirect Supervision from a large collection of NLP tasks via instruction following, bolstered by Weak Supervision provided by large language models. BinBin surpasses previous state-of-the-art techniques on three benchmark datasets across multiple domains. To our knowledge, this is the first work addressing X-shot learning, where X remains variable.
摘要：近年来，少样本和零样本学习（通过有限的注释实例学习预测标签）引起了广泛关注。传统方法通常将频繁射击（freq-shot；具有丰富实例的标签）、少量射击和零射击学习视为不同的挑战，仅针对其中一种场景优化系统。然而，在现实世界中，标签的出现情况差异很大。其中一些可能会出现数千次，而另一些可能只会偶尔出现或根本不出现。对于实际部署来说，系统能够适应任何标签的出现至关重要。我们引入了一种新颖的分类挑战：X-shot，反映了现实世界的环境，其中频率射击、少量射击和零射击标签在没有预定义限制的情况下同时出现。这里，X 的范围可以从 0 到正无穷大。 X-shot 的关键在于开放域泛化和设计一个足够通用的系统来管理各种标签场景。为了解决 X-shot 问题，我们提出了 BinBin（基于指令跟随的二进制推理），它通过指令跟随利用来自大量 NLP 任务的间接监督，并得到大型语言模型提供的弱监督的支持。 BinBin 在跨多个领域的三个基准数据集上超越了之前最先进的技术。据我们所知，这是第一个解决 X 镜头学习的工作，其中 X 保持可变。

Title: KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions

Authors: Fangyuan Xu, Kyle Lo, Luca Soldaini, Bailey Kuehl, Eunsol Choi, David Wadden
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03866
Pdf URL: https://arxiv.org/pdf/2403.03866
Copy Paste: [[2403.03866]] KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions(https://arxiv.org/abs/2403.03866)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) adapted to follow user instructions are now widely deployed as conversational agents. In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer. To evaluate the capabilities of current LLMs on this task, we construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain. Given a research question, an initial model-generated answer and a set of relevant papers, an expert annotator iteratively issues instructions for the model to revise and improve its answer. We collect 1,260 interaction turns from 234 interaction sessions with three state-of-the-art LLMs. Each turn includes a user instruction, a model response, and a human evaluation of the model response. Through a detailed analysis of the collected responses, we find that all models struggle to incorporate new information into an existing answer, and to perform precise and unambiguous edits. Further, we find that models struggle to judge whether their outputs successfully followed user instructions, with accuracy at least 10 points short of human agreement. Our findings indicate that KIWI will be a valuable resource to measure progress and improve LLMs' instruction-following capabilities for knowledge intensive writing tasks.
摘要：适合遵循用户指令的大型语言模型（LLM）现在被广泛部署为对话代理。在这项工作中，我们研究了一项日益常见的指令遵循任务：提供写作帮助以撰写长篇答案。为了评估当前法学硕士在这项任务上的能力，我们构建了 KIWI，这是科学领域知识密集型写作指令的数据集。给定一个研究问题、模型生成的初始答案和一组相关论文，专家注释者会迭代地向模型发出指令，以修改和改进其答案。我们从与三位最先进的法学硕士进行的 234 次互动会话中收集了 1,260 次互动。每轮包括用户指令、模型响应以及模型响应的人工评估。通过对收集到的答案进行详细分析，我们发现所有模型都难以将新信息融入现有答案中，并执行精确且明确的编辑。此外，我们发现模型很难判断其输出是否成功遵循用户指令，其准确度至少比人类的一致性低 10 个百分点。我们的研究结果表明，KIWI 将成为衡量进度和提高法学硕士在知识密集型写作任务中遵循指令的能力的宝贵资源。

Title: On the Origins of Linear Representations in Large Language Models

Authors: Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, Bryon Aragam, Victor Veitch
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2403.03867
Pdf URL: https://arxiv.org/pdf/2403.03867
Copy Paste: [[2403.03867]] On the Origins of Linear Representations in Large Language Models(https://arxiv.org/abs/2403.03867)
Keywords: language model
Abstract: Recent works have argued that high-level semantic concepts are encoded "linearly" in the representation space of large language models. In this work, we study the origins of such linear representations. To that end, we introduce a simple latent variable model to abstract and formalize the concept dynamics of the next token prediction. We use this formalism to show that the next token prediction objective (softmax with cross-entropy) and the implicit bias of gradient descent together promote the linear representation of concepts. Experiments show that linear representations emerge when learning from data matching the latent variable model, confirming that this simple structure already suffices to yield linear representations. We additionally confirm some predictions of the theory using the LLaMA-2 large language model, giving evidence that the simplified model yields generalizable insights.
摘要：最近的工作认为，高级语义概念在大型语言模型的表示空间中“线性”编码。在这项工作中，我们研究了这种线性表示的起源。为此，我们引入了一个简单的潜变量模型来抽象和形式化下一个令牌预测的概念动态。我们使用这种形式主义来表明下一个 token 预测目标（具有交叉熵的 softmax）和梯度下降的隐式偏差共同促进了概念的线性表示。实验表明，当从匹配潜变量模型的数据中学习时，会出现线性表示，证实这种简单的结构已经足以产生线性表示。我们还使用 LLaMA-2 大语言模型证实了该理论的一些预测，证明简化模型可以产生可推广的见解。

Title: Learning to Decode Collaboratively with Multiple Language Models

Authors: Shannon Zejiang Shen, Hunter Lang, Bailin Wang, Yoon Kim, David Sontag
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.03870
Pdf URL: https://arxiv.org/pdf/2403.03870
Copy Paste: [[2403.03870]] Learning to Decode Collaboratively with Multiple Language Models(https://arxiv.org/abs/2403.03870)
Keywords: language model, llm
Abstract: We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level. We model the decision of which LLM generates the next token as a latent variable. By optimizing the marginal likelihood of a training set under our latent variable model, the base LLM automatically learns when to generate itself and when to call on one of the ``assistant'' language models to generate, all without direct supervision. Token-level collaboration during decoding allows for a fusion of each model's expertise in a manner tailored to the specific task at hand. Our collaborative decoding is especially useful in cross-domain settings where a generalist base LLM learns to invoke domain expert models. On instruction-following, domain-specific QA, and reasoning tasks, we show that the performance of the joint system exceeds that of the individual models. Through qualitative analysis of the learned latent decisions, we show models trained with our method exhibit several interesting collaboration patterns, e.g., template-filling. Our code is available at https://github.com/clinicalml/co-llm.
摘要：我们提出了一种方法，通过在令牌级别交错各代来教授多个大型语言模型（LLM）进行协作。我们将哪个 LLM 生成下一个标记的决策建模为潜在变量。通过优化潜变量模型下训练集的边际似然，基础法学硕士自动学习何时生成自身以及何时调用“辅助”语言模型之一来生成，所有这些都无需直接监督。解码过程中的令牌级协作允许以针对当前特定任务的方式融合每个模型的专业知识。我们的协作解码在跨领域设置中特别有用，其中通才基础法学硕士学习调用领域专家模型。在指令跟踪、特定领域的 QA 和推理任务上，我们表明联合系统的性能超过了单个模型的性能。通过对学习到的潜在决策进行定性分析，我们展示了用我们的方法训练的模型表现出几种有趣的协作模式，例如模板填充。我们的代码可在 https://github.com/clinicalml/co-llm 获取。

Title: SaulLM-7B: A pioneering Large Language Model for Law

Authors: Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, Michael Desa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03883
Pdf URL: https://arxiv.org/pdf/2403.03883
Copy Paste: [[2403.03883]] SaulLM-7B: A pioneering Large Language Model for Law(https://arxiv.org/abs/2403.03883)
Keywords: language model, llm
Abstract: In this paper, we introduce SaulLM-7B, a large language model (LLM) tailored for the legal domain. With 7 billion parameters, SaulLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SaulLM-7B is trained on an English legal corpus of over 30 billion tokens. SaulLM-7B exhibits state-of-the-art proficiency in understanding and processing legal documents. Additionally, we present a novel instructional fine-tuning method that leverages legal datasets to further enhance SaulLM-7B's performance in legal tasks. SaulLM-7B is released under the CC-BY-SA-4.0 License.
摘要：在本文中，我们介绍了 SaulLM-7B，这是一种专为法律领域量身定制的大型语言模型 (LLM)。 SaulLM-7B 拥有 70 亿个参数，是第一个专门为法律文本理解和生成而设计的法学硕士。 SaulLM-7B 以 Mistral 7B 架构为基础，接受了超过 300 亿个代币的英语法律语料库的训练。 SaulLM-7B 在理解和处理法律文件方面表现出最先进的能力。此外，我们提出了一种新颖的教学微调方法，利用法律数据集进一步提高 SaulLM-7B 在法律任务中的表现。 SaulLM-7B 根据 CC-BY-SA-4.0 许可证发布。

Title: FaaF: Facts as a Function for the evaluation of RAG systems

Authors: Vasileios Katranidis, Gabor Barany
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03888
Pdf URL: https://arxiv.org/pdf/2403.03888
Copy Paste: [[2403.03888]] FaaF: Facts as a Function for the evaluation of RAG systems(https://arxiv.org/abs/2403.03888)
Keywords: language model, prompt, retrieval augmented generation
Abstract: Factual recall from a reference source is crucial for evaluating the performance of Retrieval Augmented Generation (RAG) systems, as it directly probes into the quality of both retrieval and generation. However, it still remains a challenge to perform this evaluation reliably and efficiently. Recent work has focused on fact verification via prompting language model (LM) evaluators, however we demonstrate that these methods are unreliable in the presence of incomplete or inaccurate information. We introduce Facts as a Function (FaaF), a new approach to fact verification that utilizes the function calling abilities of LMs and a framework for RAG factual recall evaluation. FaaF substantially improves the ability of LMs to identify unsupported facts in text with incomplete information whilst improving efficiency and lowering cost by several times, compared to prompt-based approaches.
摘要：来自参考源的事实回忆对于评估检索增强生成（RAG）系统的性能至关重要，因为它直接探究检索和生成的质量。然而，可靠、高效地进行这种评估仍然是一个挑战。最近的工作重点是通过提示语言模型（LM）评估器进行事实验证，但是我们证明这些方法在存在不完整或不准确信息的情况下是不可靠的。我们引入事实作为函数 (FaaF)，这是一种新的事实验证方法，利用 LM 的函数调用能力和 RAG 事实召回评估框架。与基于提示的方法相比，FaaF 极大地提高了 LM 识别文本中不完整信息中不受支持的事实的能力，同时提高了效率并降低了数倍的成本。

Title: From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models

Authors: Luiza Pozzobon, Patrick Lewis, Sara Hooker, Beyza Ermis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.03893
Pdf URL: https://arxiv.org/pdf/2403.03893
Copy Paste: [[2403.03893]] From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models(https://arxiv.org/abs/2403.03893)
Keywords: language model
Abstract: To date, toxicity mitigation in language models has almost entirely been focused on single-language settings. As language models embrace multilingual capabilities, it's crucial our safety measures keep pace. Recognizing this research gap, our approach expands the scope of conventional toxicity mitigation to address the complexities presented by multiple languages. In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhance our mitigation techniques. We also compare finetuning mitigation approaches against retrieval-augmented techniques under both static and continual toxicity mitigation scenarios. This allows us to examine the effects of translation quality and the cross-lingual transfer on toxicity mitigation. We also explore how model size and data quantity affect the success of these mitigation efforts. Covering nine languages, our study represents a broad array of linguistic families and levels of resource availability, ranging from high to mid-resource languages. Through comprehensive experiments, we provide insights into the complexities of multilingual toxicity mitigation, offering valuable insights and paving the way for future research in this increasingly important field. Code and data are available at https://github.com/for-ai/goodtriever.
摘要：迄今为止，语言模型中的毒性缓解几乎完全集中在单一语言设置上。随着语言模型具备多语言功能，我们的安全措施保持同步至关重要。认识到这一研究差距，我们的方法扩大了传统毒性缓解的范围，以解决多种语言带来的复杂性。在缺乏足够的跨语言注释数据集的情况下，我们使用翻译后的数据来评估和增强我们的缓解技术。我们还在静态和连续毒性缓解场景下将微调缓解方法与检索增强技术进行比较。这使我们能够检查翻译质量和跨语言迁移对毒性缓解的影响。我们还探讨了模型大小和数据量如何影响这些缓解措施的成功。我们的研究涵盖九种语言，代表了广泛的语系和资源可用性水平，从高资源语言到中资源语言。通过全面的实验，我们深入了解了多语言毒性缓解的复杂性，提供了宝贵的见解，并为这个日益重要的领域的未来研究铺平了道路。代码和数据可在 https://github.com/for-ai/goodtriever 获取。

Title: Did Translation Models Get More Robust Without Anyone Even Noticing?

Authors: Ben Peters, André F.T. Martins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.03923
Pdf URL: https://arxiv.org/pdf/2403.03923
Copy Paste: [[2403.03923]] Did Translation Models Get More Robust Without Anyone Even Noticing?(https://arxiv.org/abs/2403.03923)
Keywords: language model, llm
Abstract: Neural machine translation (MT) models achieve strong results across a variety of settings, but it is widely believed that they are highly sensitive to "noisy" inputs, such as spelling errors, abbreviations, and other formatting issues. In this paper, we revisit this insight in light of recent multilingual MT models and large language models (LLMs) applied to machine translation. Somewhat surprisingly, we show through controlled experiments that these models are far more robust to many kinds of noise than previous models, even when they perform similarly on clean data. This is notable because, even though LLMs have more parameters and more complex training processes than past models, none of the open ones we consider use any techniques specifically designed to encourage robustness. Next, we show that similar trends hold for social media translation experiments -- LLMs are more robust to social media text. We include an analysis of the circumstances in which source correction techniques can be used to mitigate the effects of noise. Altogether, we show that robustness to many types of noise has increased.
摘要：神经机器翻译 (MT) 模型在各种设置中都取得了出色的结果，但人们普遍认为它们对“嘈杂”输入高度敏感，例如拼写错误、缩写和其他格式问题。在本文中，我们根据最近应用于机器翻译的多语言机器翻译模型和大型语言模型 (LLM) 重新审视这一见解。令人有些惊讶的是，我们通过对照实验表明，这些模型对多种噪声的鲁棒性比以前的模型要强得多，即使它们在干净数据上的表现相似。这是值得注意的，因为尽管法学硕士比过去的模型有更多的参数和更复杂的训练过程，但我们考虑的开放模型都没有使用任何专门设计来鼓励鲁棒性的技术。接下来，我们表明社交媒体翻译实验也存在类似的趋势——法学硕士对社交媒体文本更加稳健。我们分析了可以使用源校正技术来减轻噪声影响的情况。总而言之，我们表明对多种类型噪声的鲁棒性有所提高。

Title: The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models

Authors: Adithya Bhaskar, Dan Friedman, Danqi Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.03942
Pdf URL: https://arxiv.org/pdf/2403.03942
Copy Paste: [[2403.03942]] The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models(https://arxiv.org/abs/2403.03942)
Keywords: language model
Abstract: Prior work has found that pretrained language models (LMs) fine-tuned with different random seeds can achieve similar in-domain performance but generalize differently on tests of syntactic generalization. In this work, we show that, even within a single model, we can find multiple subnetworks that perform similarly in-domain, but generalize vastly differently. To better understand these phenomena, we investigate if they can be understood in terms of "competing subnetworks": the model initially represents a variety of distinct algorithms, corresponding to different subnetworks, and generalization occurs when it ultimately converges to one. This explanation has been used to account for generalization in simple algorithmic tasks. Instead of finding competing subnetworks, we find that all subnetworks -- whether they generalize or not -- share a set of attention heads, which we refer to as the heuristic core. Further analysis suggests that these attention heads emerge early in training and compute shallow, non-generalizing features. The model learns to generalize by incorporating additional attention heads, which depend on the outputs of the "heuristic" heads to compute higher-level features. Overall, our results offer a more detailed picture of the mechanisms for syntactic generalization in pretrained LMs.
摘要：先前的工作发现，使用不同随机种子进行微调的预训练语言模型（LM）可以实现相似的域内性能，但在句法泛化测试中的泛化能力不同。在这项工作中，我们表明，即使在单个模型中，我们也可以找到在域内执行类似但概括性差异很大的多个子网络。为了更好地理解这些现象，我们研究是否可以从“竞争子网络”的角度来理解它们：模型最初代表各种不同的算法，对应于不同的子网络，当它最终收敛到一个子网络时，就会发生泛化。这种解释已用于解释简单算法任务中的泛化。我们没有找到竞争的子网络，而是发现所有子网络——无论它们是否泛化——共享一组注意力头，我们将其称为启发式核心。进一步的分析表明，这些注意力头在训练的早期就出现并计算浅层的、非泛化的特征。该模型通过合并额外的注意力头来学习泛化，这取决于“启发式”头的输出来计算更高级别的特征。总的来说，我们的结果提供了预训练语言模型中句法泛化机制的更详细的描述。