2024-04-15

Title: MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference

Authors: Mobashir Sadat, Cornelia Caragea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.08066
Pdf URL: https://arxiv.org/pdf/2404.08066
Copy Paste: [[2404.08066]] MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference(https://arxiv.org/abs/2404.08066)
Keywords: language model, llm, prompt
Abstract: The task of scientific Natural Language Inference (NLI) involves predicting the semantic relation between two sentences extracted from research articles. This task was recently proposed along with a new dataset called SciNLI derived from papers published in the computational linguistics domain. In this paper, we aim to introduce diversity in the scientific NLI task and present MSciNLI, a dataset containing 132,320 sentence pairs extracted from five new scientific domains. The availability of multiple domains makes it possible to study domain shift for scientific NLI. We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs). The highest Macro F1 scores of PLM and LLM baselines are 77.21% and 51.77%, respectively, illustrating that MSciNLI is challenging for both types of models. Furthermore, we show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset. Finally, we use both scientific NLI datasets in an intermediate task transfer learning setting and show that they can improve the performance of downstream tasks in the scientific domain. We make our dataset and code available on Github.
摘要：科学自然语言推理（NLI）的任务涉及预测从研究文章中提取的两个句子之间的语义关系。这项任务最近与一个名为 SciNLI 的新数据集一起提出，该数据集源自计算语言学领域发表的论文。在本文中，我们旨在引入科学 NLI 任务的多样性，并提出 MSciNLI，这是一个包含从五个新科学领域提取的 132,320 个句子对的数据集。多个域的可用性使得研究科学 NLI 的域转移成为可能。我们通过微调预训练语言模型 (PLM) 和促进大型语言模型 (LLM) 为 MSciNLI 建立强大的基线。 PLM 和 LLM 基线的最高 Macro F1 分数分别为 77.21% 和 51.77%，这说明 MSciNLI 对两种类型的模型都具有挑战性。此外，我们发现领域转移会降低科学 NLI 模型的性能，这表明我们数据集中不同领域的不同特征。最后，我们在中间任务迁移学习环境中使用这两个科学 NLI 数据集，并表明它们可以提高科学领域下游任务的性能。我们在 Github 上提供我们的数据集和代码。

Title: SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions

Authors: Stefan Sylvius Wagner, Maike Behrendt, Marc Ziegele, Stefan Harmeling
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.08078
Pdf URL: https://arxiv.org/pdf/2404.08078
Copy Paste: [[2404.08078]] SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions(https://arxiv.org/abs/2404.08078)
Keywords: llm, agent
Abstract: Stance detection is an important task for many applications that analyse or support online political discussions. Common approaches include fine-tuning transformer based models. However, these models require a large amount of labelled data, which might not be available. In this work, we present two different ways to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions: first, we show that augmenting a small fine-tuning dataset with synthetic data can improve the performance of the stance detection model. Second, we propose a new active learning method called SQBC based on the "Query-by-Comittee" approach. The key idea is to use LLM-generated synthetic data as an oracle to identify the most informative unlabelled samples, that are selected for manual labelling. Comprehensive experiments show that both ideas can improve the stance detection performance. Curiously, we observed that fine-tuning on actively selected samples can exceed the performance of using the full dataset.
摘要：对于许多分析或支持在线政治讨论的应用程序来说，立场检测是一项重要任务。常见的方法包括微调基于变压器的模型。然而，这些模型需要大量的标记数据，而这些数据可能无法获得。在这项工作中，我们提出了两种不同的方法来利用 LLM 生成的合成数据来训练和改进在线政治讨论的立场检测代理：首先，我们证明用合成数据增强小型微调数据集可以提高立场的性能检测模型。其次，我们提出了一种基于“委员会查询”方法的新主动学习方法，称为 SQBC。关键思想是使用法学硕士生成的合成数据作为预言机来识别信息最丰富的未标记样本，并选择这些样本进行手动标记。综合实验表明，这两种想法都可以提高姿态检测性能。奇怪的是，我们观察到对主动选择的样本进行微调可以超过使用完整数据集的性能。

Title: Data-Augmentation-Based Dialectal Adaptation for LLMs

Authors: Fahim Faisal, Antonios Anastasopoulos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.08092
Pdf URL: https://arxiv.org/pdf/2404.08092
Copy Paste: [[2404.08092]] Data-Augmentation-Based Dialectal Adaptation for LLMs(https://arxiv.org/abs/2404.08092)
Keywords: language model, llm
Abstract: This report presents GMUNLP's participation to the Dialect-Copa shared task at VarDial 2024, which focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects. The task aims to assess how well LLMs can handle non-standard dialectal varieties, as their performance on standard languages is already well-established. We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance on three South Slavic dialects: Chakavian, Cherkano, and Torlak. We conduct experiments using a language-family-focused encoder-based model (BERTi\'c) and a domain-agnostic multilingual model (AYA-101). Our results demonstrate that the proposed data augmentation techniques lead to substantial performance gains across all three test datasets in the open-source model category. This work highlights the practical utility of data augmentation and the potential of LLMs in handling non-standard dialectal varieties, contributing to the broader goal of advancing natural language understanding in low-resource and dialectal settings. Code:https://github.com/ffaisal93/dialect_copa
摘要：本报告介绍了 GMUNLP 参与 VarDial 2024 年方言-Copa 共享任务的情况，该任务的重点是评估大语言模型 (LLM) 对南斯拉夫微方言的常识推理能力。该任务旨在评估法学硕士处理非标准方言品种的能力，因为他们在标准语言上的表现已经很成熟。我们提出了一种方法，结合了不同类型语言模型的优势，并利用数据增强技术来提高三种南斯拉夫方言的任务性能：Chakavian、Cherkano 和 Torlak。我们使用以语言家族为中心的基于编码器的模型 (BERTi\'c) 和与领域无关的多语言模型 (AYA-101) 进行实验。我们的结果表明，所提出的数据增强技术可以在开源模型类别中的所有三个测试数据集上带来显着的性能提升。这项工作强调了数据增强的实际效用以及法学硕士在处理非标准方言变体方面的潜力，有助于实现在资源匮乏和方言环境中促进自然语言理解的更广泛目标。代码：https://github.com/ffaisal93/dialect_copa

Title: HLTCOE at TREC 2023 NeuCLIR Track

Authors: Eugene Yang, Dawn Lawrie, James Mayfield
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2404.08118
Pdf URL: https://arxiv.org/pdf/2404.08118
Copy Paste: [[2404.08118]] HLTCOE at TREC 2023 NeuCLIR Track(https://arxiv.org/abs/2404.08118)
Keywords: language model
Abstract: The HLTCOE team applied PLAID, an mT5 reranker, and document translation to the TREC 2023 NeuCLIR track. For PLAID we included a variety of models and training techniques -- the English model released with ColBERT v2, translate-train~(TT), Translate Distill~(TD) and multilingual translate-train~(MTT). TT trains a ColBERT model with English queries and passages automatically translated into the document language from the MS-MARCO v1 collection. This results in three cross-language models for the track, one per language. MTT creates a single model for all three document languages by combining the translations of MS-MARCO passages in all three languages into mixed-language batches. Thus the model learns about matching queries to passages simultaneously in all languages. Distillation uses scores from the mT5 model over non-English translated document pairs to learn how to score query-document pairs. The team submitted runs to all NeuCLIR tasks: the CLIR and MLIR news task as well as the technical documents task.
摘要：HLTCOE 团队将 PLAID、mT5 重新排序器和文档翻译应用于 TREC 2023 NeuCLIR 赛道。对于 PLAID，我们包含了多种模型和训练技术——随 ColBERT v2 发布的英文模型、translate-train~(TT)、Translate Distill~(TD) 和多语言 translate-train~(MTT)。 TT 使用自动翻译成 MS-MARCO v1 集合中的文档语言的英语查询和段落来训练 ColBERT 模型。这会产生该曲目的三个跨语言模型，每种语言一个。 MTT 通过将所有三种语言的 MS-MARCO 段落的翻译组合成混合语言批次，为所有三种文档语言创建单一模型。因此，该模型可以学习同时将所有语言的查询与段落进行匹配。 Distillation 使用 mT5 模型对非英语翻译文档对的分数来学习如何对查询文档对进行评分。该团队提交了所有 NeuCLIR 任务的运行：CLIR 和 MLIR 新闻任务以及技术文档任务。

Title: Distilling Algorithmic Reasoning from LLMs via Explaining Solution Programs

Authors: Jierui Li, Raymond Mooney
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.08148
Pdf URL: https://arxiv.org/pdf/2404.08148
Copy Paste: [[2404.08148]] Distilling Algorithmic Reasoning from LLMs via Explaining Solution Programs(https://arxiv.org/abs/2404.08148)
Keywords: language model, llm, chain-of-thought
Abstract: Distilling explicit chain-of-thought reasoning paths has emerged as an effective method for improving the reasoning abilities of large language models (LLMs) across various tasks. However, when tackling complex tasks that pose significant challenges for state-of-the-art models, this technique often struggles to produce effective chains of thought that lead to correct answers. In this work, we propose a novel approach to distill reasoning abilities from LLMs by leveraging their capacity to explain solutions. We apply our method to solving competitive-level programming challenges. More specifically, we employ an LLM to generate explanations for a set of pairs, then use pairs to fine-tune a smaller language model, which we refer to as the Reasoner, to learn algorithmic reasoning that can generate "how-to-solve" hints for unseen problems. Our experiments demonstrate that learning from explanations enables the Reasoner to more effectively guide program implementation by a Coder, resulting in higher solve rates than strong chain-of-thought baselines on competitive-level programming problems. It also outperforms models that learn directly from pairs. We curated an additional test set in the CodeContests format, which includes 246 more recent problems posted after the models' knowledge cutoff.
摘要：提炼显式的思想链推理路径已成为提高大型语言模型（LLM）跨各种任务的推理能力的有效方法。然而，在处理对最先进模型构成重大挑战的复杂任务时，这种技术通常很难产生有效的思维链来得出正确的答案。在这项工作中，我们提出了一种新方法，通过利用法学硕士解释解决方案的能力来提炼他们的推理能力。我们应用我们的方法来解决竞争级别的编程挑战。更具体地说，我们采用 LLM 来生成一组 <问题，解决方案-程序> 对的解释，然后使用 <问题，解释> 对来微调较小的语言模型（我们将其称为 Reasoner）来学习算法推理可以为未见过的问题生成“如何解决”提示。我们的实验表明，从解释中学习使 Reasoner 能够更有效地指导编码员执行程序，从而在竞争级别的编程问题上比强大的思想链基线获得更高的解决率。它还优于直接从<问题，解决方案-程序>对学习的模型。我们以 CodeContests 格式策划了一个额外的测试集，其中包括模型知识截止后发布的 246 个最新问题。

Title: Measuring Cross-lingual Transfer in Bytes

Authors: Leandro Rodrigues de Souza, Thales Sales Almeida, Roberto Lotufo, Rodrigo Nogueira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.08191
Pdf URL: https://arxiv.org/pdf/2404.08191
Copy Paste: [[2404.08191]] Measuring Cross-lingual Transfer in Bytes(https://arxiv.org/abs/2404.08191)
Keywords: language model
Abstract: Multilingual pretraining has been a successful solution to the challenges posed by the lack of resources for languages. These models can transfer knowledge to target languages with minimal or no examples. Recent research suggests that monolingual models also have a similar capability, but the mechanisms behind this transfer remain unclear. Some studies have explored factors like language contamination and syntactic similarity. An emerging line of research suggests that the representations learned by language models contain two components: a language-specific and a language-agnostic component. The latter is responsible for transferring a more universal knowledge. However, there is a lack of comprehensive exploration of these properties across diverse target languages. To investigate this hypothesis, we conducted an experiment inspired by the work on the Scaling Laws for Transfer. We measured the amount of data transferred from a source language to a target language and found that models initialized from diverse languages perform similarly to a target language in a cross-lingual setting. This was surprising because the amount of data transferred to 10 diverse target languages, such as Spanish, Korean, and Finnish, was quite similar. We also found evidence that this transfer is not related to language contamination or language proximity, which strengthens the hypothesis that the model also relies on language-agnostic knowledge. Our experiments have opened up new possibilities for measuring how much data represents the language-agnostic representations learned during pretraining.
摘要：多语言预训练成功解决了语言资源缺乏带来的挑战。这些模型可以将知识转移到目标语言，只需很少的示例或无需示例。最近的研究表明，单语模型也具有类似的能力，但这种迁移背后的机制仍不清楚。一些研究探讨了语言污染和句法相似性等因素。一项新兴的研究表明，语言模型学习的表示包含两个组成部分：特定于语言的组成部分和与语言无关的组成部分。后者负责传递更普遍的知识。然而，缺乏对不同目标语言的这些属性的全面探索。为了研究这个假设，我们进行了一项实验，其灵感来自于转移缩放定律的工作。我们测量了从源语言传输到目标语言的数据量，发现从不同语言初始化的模型在跨语言设置中的表现与目标语言类似。这令人惊讶，因为传输到 10 种不同目标语言（例如西班牙语、韩语和芬兰语）的数据量非常相似。我们还发现证据表明这种转移与语言污染或语言接近无关，这强化了该模型也依赖于与语言无关的知识的假设。我们的实验为测量有多少数据代表预训练期间学到的与语言无关的表示开辟了新的可能性。

Title: Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study

Authors: Wan-Hua Her, Udo Kruschwitz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.08259
Pdf URL: https://arxiv.org/pdf/2404.08259
Copy Paste: [[2404.08259]] Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study(https://arxiv.org/abs/2404.08259)
Keywords: language model
Abstract: Machine Translation has made impressive progress in recent years offering close to human-level performance on many languages, but studies have primarily focused on high-resource languages with broad online presence and resources. With the help of growing Large Language Models, more and more low-resource languages achieve better results through the presence of other languages. However, studies have shown that not all low-resource languages can benefit from multilingual systems, especially those with insufficient training and evaluation data. In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian. We investigate conditions of low-resource languages such as data scarcity and parameter sensitivity and focus on refined solutions that combat low-resource difficulties and creative solutions such as harnessing language similarity. Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance. We demonstrate noisiness in the data and present our approach to carry out text preprocessing extensively. Evaluation was conducted using combined metrics: BLEU, chrF and TER. Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement. Furthermore, we present a qualitative analysis of translation errors and system limitations.
摘要：近年来，机器翻译取得了令人印象深刻的进步，在许多语言上提供了接近人类水平的性能，但研究主要集中在具有广泛在线存在和资源的高资源语言上。在不断增长的大型语言模型的帮助下，越来越多的低资源语言通过其他语言的存在取得了更好的结果。然而，研究表明，并非所有资源匮乏的语言都能从多语言系统中受益，尤其是那些训练和评估数据不足的语言。在本文中，我们重新审视最先进的神经机器翻译技术，以开发德语和巴伐利亚语之间的自动翻译系统。我们研究低资源语言的条件，例如数据稀缺和参数敏感性，并专注于解决低资源困难的精细解决方案和利用语言相似性等创造性解决方案。我们的实验需要应用反向翻译和迁移学习来自动生成更多训练数据并实现更高的翻译性能。我们展示了数据中的噪声，并介绍了我们广泛进行文本预处理的方法。使用组合指标进行评估：BLEU、chrF 和 TER。 Bonferroni 校正的统计显着性结果显示出令人惊讶的高基线系统，并且反向翻译带来了显着的改进。此外，我们还对翻译错误和系统限制进行了定性分析。

Title: Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

Authors: Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Tatsuya Ishigaki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.08262
Pdf URL: https://arxiv.org/pdf/2404.08262
Copy Paste: [[2404.08262]] Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain(https://arxiv.org/abs/2404.08262)
Keywords: language model, llm
Abstract: Several previous studies have considered language- and domain-specific large language models (LLMs) as separate topics. This study explores the combination of a non-English language and a high-demand industry domain, focusing on a Japanese business-specific LLM. This type of a model requires expertise in the business domain, strong language skills, and regular updates of its knowledge. We trained a 13-billion-parameter LLM from scratch using a new dataset of business texts and patents, and continually pretrained it with the latest business documents. Further we propose a new benchmark for Japanese business domain question answering (QA) and evaluate our models on it. The results show that our pretrained model improves QA accuracy without losing general knowledge, and that continual pretraining enhances adaptation to new information. Our pretrained model and business domain benchmark are publicly available.
摘要：之前的几项研究已将特定于语言和特定领域的大语言模型 (LLM) 视为单独的主题。本研究探索非英语语言与高需求行业领域的结合，重点关注日本特定商业法学硕士。这种类型的模型需要业务领域的专业知识、强大的语言技能以及定期更新知识。我们使用新的商业文本和专利数据集从头开始训练了一个包含 130 亿参数的法学硕士，并不断使用最新的商业文档对其进行预训练。此外，我们提出了日本商业领域问答（QA）的新基准，并据此评估我们的模型。结果表明，我们的预训练模型在不丢失一般知识的情况下提高了 QA 准确性，并且持续的预训练增强了对新信息的适应。我们的预训练模型和业务领域基准是公开的。

Title: Relational Prompt-based Pre-trained Language Models for Social Event Detection

Authors: Pu Li, Xiaoyan Yu, Hao Peng, Yantuan Xian, Linqin Wang, Li Sun, Jingyun Zhang, Philip S. Yu
Subjects: cs.CL, cs.AI, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2404.08263
Pdf URL: https://arxiv.org/pdf/2404.08263
Copy Paste: [[2404.08263]] Relational Prompt-based Pre-trained Language Models for Social Event Detection(https://arxiv.org/abs/2404.08263)
Keywords: language model, prompt
Abstract: Social Event Detection (SED) aims to identify significant events from social streams, and has a wide application ranging from public opinion analysis to risk management. In recent years, Graph Neural Network (GNN) based solutions have achieved state-of-the-art performance. However, GNN-based methods often struggle with noisy and missing edges between messages, affecting the quality of learned message embedding. Moreover, these methods statically initialize node embedding before training, which, in turn, limits the ability to learn from message texts and relations simultaneously. In this paper, we approach social event detection from a new perspective based on Pre-trained Language Models (PLMs), and present RPLM_SED (Relational prompt-based Pre-trained Language Models for Social Event Detection). We first propose a new pairwise message modeling strategy to construct social messages into message pairs with multi-relational sequences. Secondly, a new multi-relational prompt-based pairwise message learning mechanism is proposed to learn more comprehensive message representation from message pairs with multi-relational prompts using PLMs. Thirdly, we design a new clustering constraint to optimize the encoding process by enhancing intra-cluster compactness and inter-cluster dispersion, making the message representation more distinguishable. We evaluate the RPLM_SED on three real-world datasets, demonstrating that the RPLM_SED model achieves state-of-the-art performance in offline, online, low-resource, and long-tail distribution scenarios for social event detection tasks.
摘要：社交事件检测（SED）旨在识别社交流中的重大事件，具有广泛的应用范围，从舆情分析到风险管理。近年来，基于图神经网络（GNN）的解决方案已经取得了最先进的性能。然而，基于 GNN 的方法经常会遇到消息之间的噪声和边缘缺失问题，从而影响学习消息嵌入的质量。此外，这些方法在训练之前静态初始化节点嵌入，这反过来又限制了同时从消息文本和关系中学习的能力。在本文中，我们从基于预训练语言模型（PLM）的新角度来处理社交事件检测，并提出了 RPLM_SED（用于社交事件检测的基于关系提示的预训练语言模型）。我们首先提出了一种新的成对消息建模策略，将社交消息构建为具有多关系序列的消息对。其次，提出了一种新的基于多关系提示的成对消息学习机制，以使用 PLM 从具有多关系提示的消息对中学习更全面的消息表示。第三，我们设计了一种新的聚类约束，通过增强簇内紧凑性和簇间分散性来优化编码过程，使消息表示更具可区分性。我们在三个真实数据集上评估 RPLM_SED，证明 RPLM_SED 模型在社交事件检测任务的离线、在线、低资源和长尾分发场景中实现了最先进的性能。

Title: Toward a Theory of Tokenization in LLMs

Authors: Nived Rajaraman, Jiantao Jiao, Kannan Ramchandran
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.08335
Pdf URL: https://arxiv.org/pdf/2404.08335
Copy Paste: [[2404.08335]] Toward a Theory of Tokenization in LLMs(https://arxiv.org/abs/2404.08335)
Keywords: language model, llm
Abstract: While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. When trained on data drawn from certain simple $k^{\text{th}}$-order Markov processes for $k > 1$, transformers exhibit a surprising phenomenon - in the absence of tokenization, they empirically fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. With this observation as starting point, we study the end-to-end cross-entropy loss achieved by transformers with and without tokenization. With the appropriate tokenization, we show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $k^{\text{th}}$-order Markov sources near optimally. Our analysis provides a justification for the use of tokenization in practice through studying the behavior of transformers on Markovian data.
摘要：虽然有大量研究试图规避语言建模的标记化（Clark 等人，2022；Xue 等人，2022），但目前的共识是，这是设计最新状态的必要的初始步骤。 -艺术表演语言模型。在本文中，我们通过研究变压器在简单数据生成过程中的行为，从理论角度研究了标记化。当对从某些简单的 $k^{\text{th}}$ 阶马尔可夫过程中提取的数据进行训练（$k > 1$）时，变压器表现出令人惊讶的现象 - 在没有标记化的情况下，它们根据经验无法学习正确的分布并根据一元模型预测字符（Makkuva 等人，2024）。然而，通过添加标记化，我们凭经验观察到变压器突破了这一障碍，并且能够对从源中提取的序列的概率进行近乎最优的建模，从而实现较小的交叉熵损失。以此观察为起点，我们研究了有和没有标记化的变压器实现的端到端交叉熵损失。通过适当的标记化，我们表明，即使是 Transformer 学习的最简单的一元模型（通过标记）也能够对从 $k^{\text{th}}$ 阶马尔可夫源抽取的序列的概率进行近乎最佳的建模。我们的分析通过研究马尔可夫数据上变压器的行为，为在实践中使用标记化提供了理由。

Title: ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana

Authors: Monica Romero, Sandra Gomez, Iván G. Torre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.08368
Pdf URL: https://arxiv.org/pdf/2404.08368
Copy Paste: [[2404.08368]] ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana(https://arxiv.org/abs/2404.08368)
Keywords: language model
Abstract: Indigenous languages are a fundamental legacy in the development of human communication, embodying the unique identity and culture of local communities of America. The Second AmericasNLP Competition Track 1 of NeurIPS 2022 proposed developing automatic speech recognition (ASR) systems for five indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana. In this paper, we propose a reliable ASR model for each target language by crawling speech corpora spanning diverse sources and applying data augmentation methods that resulted in the winning approach in this competition. To achieve this, we systematically investigated the impact of different hyperparameters by a Bayesian search on the performance of the language models, specifically focusing on the variants of the Wav2vec2.0 XLS-R model: 300M and 1B parameters. Moreover, we performed a global sensitivity analysis to assess the contribution of various hyperparametric configurations to the performances of our best models. Importantly, our results show that freeze fine-tuning updates and dropout rate are more vital parameters than the total number of epochs of lr. Additionally, we liberate our best models -- with no other ASR model reported until now for two Wa'ikhana and Kotiria -- and the many experiments performed to pave the way to other researchers to continue improving ASR in minority languages. This insight opens up interesting avenues for future work, allowing for the advancement of ASR techniques in the preservation of minority indigenous and acknowledging the complexities involved in this important endeavour.
摘要：土著语言是人类交流发展的基本遗产，体现了美国当地社区的独特身份和文化。 NeurIPS 2022 的第二届 AmericasNLP 竞赛 Track 1 建议开发五种本土语言的自动语音识别 (ASR) 系统：盖丘亚语、瓜拉尼语、Bribri、Kotiria 和 Wa'ikhana。在本文中，我们通过抓取跨越不同来源的语音语料库并应用数据增强方法，为每种目标语言提出了一个可靠的 ASR 模型，从而在本次比赛中获胜。为了实现这一目标，我们通过贝叶斯搜索系统地研究了不同超参数对语言模型性能的影响，特别关注 Wav2vec2.0 XLS-R 模型的变体：300M 和 1B 参数。此外，我们进行了全局敏感性分析，以评估各种超参数配置对我们最佳模型性能的贡献。重要的是，我们的结果表明，冻结微调更新和 dropout 率是比 lr 的 epoch 总数更重要的参数。此外，我们还释放了我们最好的模型（到目前为止，还没有其他关于 Wa'ikhana 和 Kotiria 的 ASR 模型的报道），并且进行了许多实验，为其他研究人员继续改进少数民族语言的 ASR 铺平了道路。这一见解为未来的工作开辟了有趣的途径，允许在保护少数民族土著方面推进 ASR 技术，并承认这一重要努力所涉及的复杂性。

Title: Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

Authors: Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Röttger, Barbara Plank
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.08382
Pdf URL: https://arxiv.org/pdf/2404.08382
Copy Paste: [[2404.08382]] Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think(https://arxiv.org/abs/2404.08382)
Keywords: language model, llm
Abstract: Multiple choice questions (MCQs) are commonly used to evaluate the capabilities of large language models (LLMs). One common way to evaluate the model response is to rank the candidate answers based on the log probability of the first token prediction. An alternative way is to examine the text output. Prior work has shown that first token probabilities lack robustness to changes in MCQ phrasing, and that first token probabilities do not match text answers for instruction-tuned models. Therefore, in this paper, we investigate the robustness of text answers. We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers. The difference in robustness increases as the mismatch rate becomes greater. As the mismatch reaches over 50\%, the text answer is more robust to option order changes than the debiased first token probabilities using state-of-the-art debiasing methods such as PriDe. Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.
摘要：多项选择题 (MCQ) 通常用于评估大型语言模型 (LLM) 的能力。评估模型响应的一种常见方法是根据第一个标记预测的对数概率对候选答案进行排名。另一种方法是检查文本输出。先前的工作表明，第一个标记概率对 MCQ 措辞的变化缺乏鲁棒性，并且第一个标记概率与指令调整模型的文本答案不匹配。因此，在本文中，我们研究了文本答案的稳健性。我们表明，当第一个标记答案与文本答案不匹配时，文本答案对问题扰动的鲁棒性比第一个标记概率更强。随着失配率变大，鲁棒性的差异也增大。当不匹配达到 50% 以上时，文本答案对于选项顺序更改比使用最先进的去偏方法（例如 PriDe）去偏的第一个标记概率更加稳健。我们的研究结果为文本答案评估相对于首次标记概率评估的优势提供了进一步的证据。

Title: Learning representations of learning representations

Authors: Rita González-Márquez, Dmitry Kobak
Subjects: cs.CL, cs.DL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.08403
Pdf URL: https://arxiv.org/pdf/2404.08403
Copy Paste: [[2404.08403]] Learning representations of learning representations(https://arxiv.org/abs/2404.08403)
Keywords: language model
Abstract: The ICLR conference is unique among the top machine learning conferences in that all submitted papers are openly available. Here we present the ICLR dataset consisting of abstracts of all 24 thousand ICLR submissions from 2017-2024 with meta-data, decision scores, and custom keyword-based labels. We find that on this dataset, bag-of-words representation outperforms most dedicated sentence transformer models in terms of $k$NN classification accuracy, and the top performing language models barely outperform TF-IDF. We see this as a challenge for the NLP community. Furthermore, we use the ICLR dataset to study how the field of machine learning has changed over the last seven years, finding some improvement in gender balance. Using a 2D embedding of the abstracts' texts, we describe a shift in research topics from 2017 to 2024 and identify hedgehogs and foxes among the authors with the highest number of ICLR submissions.
摘要：ICLR 会议在顶级机器学习会议中是独一无二的，因为所有提交的论文都是公开的。在这里，我们展示了 ICLR 数据集，其中包含 2017 年至 2024 年所有 24,000 份 ICLR 提交的摘要，其中包含元数据、决策分数和基于关键字的自定义标签。我们发现，在这个数据集上，词袋表示在 $k$NN 分类精度方面优于大多数专用句子转换器模型，而表现最好的语言模型几乎不优于 TF-IDF。我们认为这对 NLP 社区来说是一个挑战。此外，我们使用 ICLR 数据集来研究机器学习领域在过去七年中的变化，发现性别平衡方面有所改善。使用摘要文本的 2D 嵌入，我们描述了从 2017 年到 2024 年研究主题的转变，并在 ICLR 提交数量最多的作者中识别出刺猬和狐狸。

Title: Thematic Analysis with Large Language Models: does it work with languages other than English? A targeted test in Italian

Authors: Stefano De Paoli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.08488
Pdf URL: https://arxiv.org/pdf/2404.08488
Copy Paste: [[2404.08488]] Thematic Analysis with Large Language Models: does it work with languages other than English? A targeted test in Italian(https://arxiv.org/abs/2404.08488)
Keywords: language model, llm, prompt
Abstract: This paper proposes a test to perform Thematic Analysis (TA) with Large Language Model (LLM) on data which is in a different language than English. While there has been initial promising work on using pre-trained LLMs for TA on data in English, we lack any tests on whether these models can reasonably perform the same analysis with good quality in other language. In this paper a test will be proposed using an open access dataset of semi-structured interviews in Italian. The test shows that a pre-trained model can perform such a TA on the data, also using prompts in Italian. A comparative test shows the model capacity to produce themes which have a good resemblance with those produced independently by human researchers. The main implication of this study is that pre-trained LLMs may thus be suitable to support analysis in multilingual situations, so long as the language is supported by the model used.
摘要：本文提出了一种使用大型语言模型（LLM）对非英语语言的数据进行主题分析（TA）的测试。虽然在使用预训练的法学硕士进行英语数据助教方面已经有了初步的有前途的工作，但我们缺乏任何测试来证明这些模型是否可以合理地以其他语言进行高质量的相同分析。在本文中，将使用意大利语半结构化访谈的开放访问数据集进行测试。测试表明，预训练模型可以对数据执行此类 TA，同样使用意大利语提示。比较测试表明，该模型产生的主题与人类研究人员独立产生的主题非常相似。这项研究的主要含义是，只要所使用的模型支持该语言，预训练的法学硕士可能适合支持多语言情况下的分析。

Title: Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation

Authors: Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Yufeng He, Kaikai An, Baobao Chang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.08491
Pdf URL: https://arxiv.org/pdf/2404.08491
Copy Paste: [[2404.08491]] Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation(https://arxiv.org/abs/2404.08491)
Keywords: language model
Abstract: Large-scale multilingual Pretrained Language Models (mPLMs) yield impressive performance on cross-language tasks, yet significant performance disparities exist across different languages within the same mPLM. Previous studies endeavored to narrow these disparities by supervise fine-tuning the mPLMs with multilingual data. However, obtaining labeled multilingual data is time-consuming, and fine-tuning mPLM with limited labeled multilingual data merely encapsulates the knowledge specific to the labeled data. Therefore, we introduce ALSACE to leverage the learned knowledge from the well-performing languages to guide under-performing ones within the same mPLM, eliminating the need for additional labeled multilingual data. Experiments show that ALSACE effectively mitigates language-level performance disparity across various mPLMs while showing the competitive performance on different multilingual NLU tasks, ranging from full resource to limited resource settings. The code for our approach is available at https://github.com/pkunlp-icler/ALSACE.
摘要：大规模多语言预训练语言模型 (mPLM) 在跨语言任务上产生了令人印象深刻的性能，但同一 mPLM 中不同语言之间存在显着的性能差异。先前的研究试图通过使用多语言数据监督微调 mPLM 来缩小这些差异。然而，获取带标签的多语言数据非常耗时，并且使用有限的带标签多语言数据微调 mPLM 仅仅封装了特定于带标签数据的知识。因此，我们引入 ALSACE，利用从表现良好的语言中学到的知识来指导同一 mPLM 中表现不佳的语言，从而无需额外的标记多语言数据。实验表明，ALSACE 有效地缓解了各种 mPLM 之间的语言级性能差异，同时展示了不同多语言 NLU 任务（从完整资源到有限资源设置）的竞争性能。我们方法的代码可在 https://github.com/pkunlp-icler/ALSACE 获取。

Title: Small Models Are (Still) Effective Cross-Domain Argument Extractors

Authors: William Gantt, Aaron Steven White
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.08579
Pdf URL: https://arxiv.org/pdf/2404.08579
Copy Paste: [[2404.08579]] Small Models Are (Still) Effective Cross-Domain Argument Extractors(https://arxiv.org/abs/2404.08579)
Keywords: gpt, llm
Abstract: Effective ontology transfer has been a major goal of recent work on event argument extraction (EAE). Two methods in particular -- question answering (QA) and template infilling (TI) -- have emerged as promising approaches to this problem. However, detailed explorations of these techniques' ability to actually enable this transfer are lacking. In this work, we provide such a study, exploring zero-shot transfer using both techniques on six major EAE datasets at both the sentence and document levels. Further, we challenge the growing reliance on LLMs for zero-shot extraction, showing that vastly smaller models trained on an appropriate source ontology can yield zero-shot performance superior to that of GPT-3.5 or GPT-4.
摘要：有效的本体迁移一直是事件论证提取（EAE）最近工作的一个主要目标。特别是两种方法——问答（QA）和模板填充（TI）——已经成为解决这个问题的有希望的方法。然而，缺乏对这些技术实际实现这种转移的能力的详细探索。在这项工作中，我们提供了这样的一项研究，在句子和文档级别的六个主要 EAE 数据集上使用这两种技术探索零样本迁移。此外，我们挑战了对 LLM 零样本提取的日益依赖，表明在适当的源本体上训练的小得多的模型可以产生优于 GPT-3.5 或 GPT-4 的零样本性能。

Title: Is ChatGPT Transforming Academics' Writing Style?

Authors: Mingmeng Geng, Roberto Trotta
Subjects: cs.CL, cs.AI, cs.DL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.08627
Pdf URL: https://arxiv.org/pdf/2404.08627
Copy Paste: [[2404.08627]] Is ChatGPT Transforming Academics' Writing Style?(https://arxiv.org/abs/2404.08627)
Keywords: gpt, prompt, chat
Abstract: Based on one million arXiv papers submitted from May 2018 to January 2024, we assess the textual density of ChatGPT's writing style in their abstracts by means of a statistical analysis of word frequency changes. Our model is calibrated and validated on a mixture of real abstracts and ChatGPT-modified abstracts (simulated data) after a careful noise analysis. We find that ChatGPT is having an increasing impact on arXiv abstracts, especially in the field of computer science, where the fraction of ChatGPT-revised abstracts is estimated to be approximately 35%, if we take the output of one of the simplest prompts, "revise the following sentences", as a baseline. We conclude with an analysis of both positive and negative aspects of the penetration of ChatGPT into academics' writing style.
摘要：基于2018年5月至2024年1月提交的100万篇arXiv论文，我们通过词频变化的统计分析来评估ChatGPT摘要中写作风格的文本密度。经过仔细的噪声分析后，我们的模型在真实摘要和 ChatGPT 修改摘要（模拟数据）的混合上进行了校准和验证。我们发现 ChatGPT 对 arXiv 摘要的影响越来越大，特别是在计算机科学领域，如果我们采用最简单的提示之一的输出，ChatGPT 修订摘要的比例估计约为 35%，”修改以下句子”，作为基线。最后，我们分析了 ChatGPT 渗透到学者写作风格的积极和消极方面。

Title: Pre-training Small Base LMs with Fewer Tokens

Authors: Sunny Sanyal, Sujay Sanghavi, Alexandros G. Dimakis
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.08634
Pdf URL: https://arxiv.org/pdf/2404.08634
Copy Paste: [[2404.08634]] Pre-training Small Base LMs with Fewer Tokens(https://arxiv.org/abs/2404.08634)
Keywords: language model, gpt, llm
Abstract: We study the effectiveness of a simple approach to develop a small base language model (LM) starting from an existing large base LM: first inherit a few transformer blocks from the larger LM, and then train this smaller model on a very small subset (0.1\%) of the raw pretraining data of the larger model. We call our simple recipe Inheritune and first demonstrate it for building a small base LM with 1.5B parameters using 1B tokens (and a starting few layers of larger LM of 3B parameters); we do this using a single A6000 GPU for less than half a day. Across 9 diverse evaluation datasets as well as the MMLU benchmark, the resulting model compares favorably to publicly available base models of 1B-2B size, some of which have been trained using 50-1000 times more tokens. We investigate Inheritune in a slightly different setting where we train small LMs utilizing larger LMs and their full pre-training dataset. Here we show that smaller LMs trained utilizing some of the layers of GPT2-medium (355M) and GPT-2-large (770M) can effectively match the val loss of their bigger counterparts when trained from scratch for the same number of training steps on OpenWebText dataset with 9B tokens. We analyze our recipe with extensive experiments and demonstrate it efficacy on diverse settings. Our code is available at https://github.com/sanyalsunny111/LLM-Inheritune.
摘要：我们研究了从现有大型基础语言模型 (LM) 开始开发小型基础语言模型 (LM) 的简单方法的有效性：首先从较大的 LM 继承一些 Transformer 块，然后在非常小的子集（0.1 \%) 较大模型的原始预训练数据。我们将我们的简单配方称为 Inheritune，并首先演示它如何使用 1B 代币（以及 3B 参数的较大 LM 的起始几层）构建具有 1.5B 参数的小型基础 LM；我们使用单个 A6000 GPU 花费了不到半天的时间完成此任务。在 9 个不同的评估数据集以及 MMLU 基准中，生成的模型与公开可用的 1B-2B 大小的基本模型相比具有优势，其中一些模型已经使用 50-1000 倍的令牌进行了训练。我们在稍微不同的环境中研究 Inheritune，我们利用较大的 LM 及其完整的预训练数据集来训练小型 LM。在这里，我们表明，当从头开始训练相同数量的训练步骤时，使用 GPT2-medium (355M) 和 GPT-2-large (770M) 的某些层训练的较小 LM 可以有效地匹配其较大对应层的 val 损失。具有 9B 个标记的 OpenWebText 数据集。我们通过大量实验分析我们的配方，并证明其在不同环境下的功效。我们的代码可在 https://github.com/sanyalsunny111/LLM-Inheritune 获取。