2024-04-04

Title: Emergent Abilities in Reduced-Scale Generative Language Models

Authors: Sherin Muckatira, Vijeta Deshpande, Vladislav Lialin, Anna Rumshisky
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02204
Pdf URL: https://arxiv.org/pdf/2404.02204
Copy Paste: [[2404.02204]] Emergent Abilities in Reduced-Scale Generative Language Models(https://arxiv.org/abs/2404.02204)
Keywords: language model
Abstract: Large language models can solve new tasks without task-specific fine-tuning. This ability, also known as in-context learning (ICL), is considered an emergent ability and is primarily seen in large language models with billions of parameters. This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data. To explore this, we simplify pre-training data and pre-train 36 causal language models with parameters varying from 1 million to 165 million parameters. We show that models trained on this simplified pre-training data demonstrate enhanced zero-shot capabilities across various tasks in simplified language, achieving performance comparable to that of pre-trained models six times larger on unrestricted language. This suggests that downscaling the language allows zero-shot learning capabilities to emerge in models with limited size. Additionally, we find that these smaller models pre-trained on simplified data demonstrate a power law relationship between the evaluation loss and the three scaling factors: compute, dataset size, and model size.
摘要：大型语言模型可以解决新任务，而无需针对特定任务进行微调。这种能力也称为上下文学习（ICL），被认为是一种新兴能力，主要出现在具有数十亿参数的大型语言模型中。这项研究调查了这些新出现的属性是否与模型大小严格相关，或者可以通过在缩小规模的数据上训练的较小模型来证明。为了探索这一点，我们简化了预训练数据并预训练了 36 个因果语言模型，参数范围从 100 万到 1.65 亿不等。我们表明，在这种简化的预训练数据上训练的模型在简化语言的各种任务中表现出增强的零样本能力，其性能可与在不受限制的语言上预训练的模型相比提高六倍。这表明，缩小语言规模可以在尺寸有限的模型中出现零样本学习能力。此外，我们发现这些在简化数据上预训练的较小模型展示了评估损失与三个缩放因子（计算、数据集大小和模型大小）之间的幂律关系。

Title: $\texttt{LM}^\texttt{2}$: A Simple Society of Language Models Solves Complex Reasoning

Authors: Gurusha Juneja, Subhabrata Dutta, Tanmoy Chakraborty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02255
Pdf URL: https://arxiv.org/pdf/2404.02255
Copy Paste: [[2404.02255]] $\texttt{LM}^\texttt{2}$: A Simple Society of Language Models Solves Complex Reasoning(https://arxiv.org/abs/2404.02255)
Keywords: language model, llm
Abstract: Despite demonstrating emergent reasoning abilities, Large Language Models (LLMS) often lose track of complex, multi-step reasoning. Existing studies show that providing guidance via decomposing the original question into multiple subproblems elicits more robustness in LLM reasoning -- a decomposer generates the subproblems, and a solver solves each of these subproblems. However, these techniques fail to accommodate coordination between the decomposer and the solver modules (either in a single model or different specialized ones) -- the decomposer does not keep track of the ability of the solver to follow the decomposed reasoning. In this paper, we propose LM2 to address these challenges. LM2 modularizes the decomposition, solution, and verification into three different language models. The decomposer module identifies the key concepts necessary to solve the problem and generates step-by-step subquestions according to the reasoning requirement. The solver model generates the solution to the subproblems that are then checked by the verifier module; depending upon the feedback from the verifier, the reasoning context is constructed using the subproblems and the solutions. These models are trained to coordinate using policy learning. Exhaustive experimentation suggests the superiority of LM2 over existing methods on in- and out-domain reasoning problems, outperforming the best baselines by $8.1\%$ on MATH, $7.71\%$ on JEEBench, and $9.7\%$ on MedQA problems (code available at https://github.com/LCS2-IIITD/Language_Model_Multiplex).
摘要：尽管展示了紧急推理能力，大型语言模型（LLMS）经常无法跟踪复杂的多步骤推理。现有研究表明，通过将原始问题分解为多个子问题来提供指导，可以使 LLM 推理更加稳健——分解器生成子问题，求解器解决每个子问题。然而，这些技术无法适应分解器和求解器模块（无论是在单个模型中还是在不同的专用模型中）之间的协调——分解器无法跟踪求解器遵循分解推理的能力。在本文中，我们提出 LM2 来应对这些挑战。 LM2将分解、求解和验证模块化为三种不同的语言模型。分解器模块识别解决问题所需的关键概念，并根据推理要求生成逐步的子问题。求解器模型生成子问题的解决方案，然后由验证器模块检查；根据验证者的反馈，使用子问题和解决方案构建推理上下文。这些模型经过训练可以使用策略学习进行协调。详尽的实验表明，LM2 在域内和域外推理问题上优于现有方法，在 MATH 上优于最佳基线 $8.1\%$，在 JEEBench 上优于最佳基线 $7.71\%$，在 MedQA 问题上优于最佳基线 $9.7\%$（代码可用）位于 https://github.com/LCS2-IIITD/Language_Model_Multiplex）。

Title: LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages

Authors: Nataliia Kholodna, Sahib Julka, Mohammad Khodadadi, Muhammed Nurullah Gumus, Michael Granitzer
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02261
Pdf URL: https://arxiv.org/pdf/2404.02261
Copy Paste: [[2404.02261]] LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages(https://arxiv.org/abs/2404.02261)
Keywords: language model, gpt, llm
Abstract: Low-resource languages face significant barriers in AI development due to limited linguistic resources and expertise for data labeling, rendering them rare and costly. The scarcity of data and the absence of preexisting tools exacerbate these challenges, especially since these languages may not be adequately represented in various NLP datasets. To address this gap, we propose leveraging the potential of LLMs in the active learning loop for data annotation. Initially, we conduct evaluations to assess inter-annotator agreement and consistency, facilitating the selection of a suitable LLM annotator. The chosen annotator is then integrated into a training loop for a classifier using an active learning paradigm, minimizing the amount of queried data required. Empirical evaluations, notably employing GPT-4-Turbo, demonstrate near-state-of-the-art performance with significantly reduced data requirements, as indicated by estimated potential cost savings of at least 42.45 times compared to human annotation. Our proposed solution shows promising potential to substantially reduce both the monetary and computational costs associated with automation in low-resource settings. By bridging the gap between low-resource languages and AI, this approach fosters broader inclusion and shows the potential to enable automation across diverse linguistic landscapes.
摘要：由于语言资源和数据标记专业知识有限，低资源语言在人工智能开发中面临重大障碍，导致它们稀有且昂贵。数据的稀缺和现有工具的缺乏加剧了这些挑战，特别是因为这些语言可能无法在各种 NLP 数据集中得到充分表示。为了解决这一差距，我们建议利用法学硕士在主动学习循环中的潜力进行数据注释。最初，我们进行评估以评估注释者之间的一致性和一致性，以促进选择合适的法学硕士注释者。然后，使用主动学习范例将所选注释器集成到分类器的训练循环中，从而最大限度地减少所需的查询数据量。实证评估，特别是使用 GPT-4-Turbo，展示了近乎最先进的性能，同时显着减少了数据需求，与人工注释相比，预计潜在成本节省至少 42.45 倍。我们提出的解决方案显示出在大幅降低资源匮乏环境中与自动化相关的货币和计算成本方面的巨大潜力。通过弥合低资源语言和人工智能之间的差距，这种方法促进了更广泛的包容性，并展示了跨不同语言环境实现自动化的潜力。

Title: Extracting Norms from Contracts Via ChatGPT: Opportunities and Challenges

Authors: Amanul Haque, Munindar P. Singh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02269
Pdf URL: https://arxiv.org/pdf/2404.02269
Copy Paste: [[2404.02269]] Extracting Norms from Contracts Via ChatGPT: Opportunities and Challenges(https://arxiv.org/abs/2404.02269)
Keywords: gpt, hallucination, chat, agent
Abstract: We investigate the effectiveness of ChatGPT in extracting norms from contracts. Norms provide a natural way to engineer multiagent systems by capturing how to govern the interactions between two or more autonomous parties. We extract norms of commitment, prohibition, authorization, and power, along with associated norm elements (the parties involved, antecedents, and consequents) from contracts. Our investigation reveals ChatGPT's effectiveness and limitations in norm extraction from contracts. ChatGPT demonstrates promising performance in norm extraction without requiring training or fine-tuning, thus obviating the need for annotated data, which is not generally available in this domain. However, we found some limitations of ChatGPT in extracting these norms that lead to incorrect norm extractions. The limitations include oversight of crucial details, hallucination, incorrect parsing of conjunctions, and empty norm elements. Enhanced norm extraction from contracts can foster the development of more transparent and trustworthy formal agent interaction specifications, thereby contributing to the improvement of multiagent systems.
摘要：我们研究了 ChatGPT 从合同中提取规范的有效性。规范通过捕获如何管理两个或多个自治方之间的交互，提供了一种设计多智能体系统的自然方法。我们从合同中提取承诺、禁止、授权和权力的规范，以及相关的规范要素（涉及各方、前因和后果）。我们的调查揭示了 ChatGPT 在从合同中提取规范方面的有效性和局限性。 ChatGPT 在范数提取方面表现出了良好的性能，无需训练或微调，从而无需注释数据，而注释数据在该领域通常不可用。然而，我们发现 ChatGPT 在提取这些范数时存在一些局限性，导致范数提取不正确。这些限制包括对关键细节的忽视、幻觉、连词解析不正确以及空洞的规范元素。增强从合同中提取规范可以促进开发更透明和值得信赖的正式代理交互规范，从而有助于改进多代理系统。

Title: Collapse of Self-trained Language Models

Authors: David Herel, Tomas Mikolov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02305
Pdf URL: https://arxiv.org/pdf/2404.02305
Copy Paste: [[2404.02305]] Collapse of Self-trained Language Models(https://arxiv.org/abs/2404.02305)
Keywords: language model, gpt
Abstract: In various fields of knowledge creation, including science, new ideas often build on pre-existing information. In this work, we explore this concept within the context of language models. Specifically, we explore the potential of self-training models on their own outputs, akin to how humans learn and build on their previous thoughts and actions. While this approach is intuitively appealing, our research reveals its practical limitations. We find that extended self-training of the GPT-2 model leads to a significant degradation in performance, resulting in repetitive and collapsed token output.
摘要：在包括科学在内的知识创造的各个领域，新想法通常建立在现有信息的基础上。在这项工作中，我们在语言模型的背景下探索这个概念。具体来说，我们探索自我训练模型在其自身输出上的潜力，类似于人类如何学习并建立在他们之前的想法和行动的基础上。虽然这种方法直观上很有吸引力，但我们的研究揭示了其实际局限性。我们发现 GPT-2 模型的扩展自我训练会导致性能显着下降，导致代币输出重复和崩溃。

Title: Prompts As Programs: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization

Authors: Tobias Schnabel, Jennifer Neville
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02319
Pdf URL: https://arxiv.org/pdf/2404.02319
Copy Paste: [[2404.02319]] Prompts As Programs: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization(https://arxiv.org/abs/2404.02319)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) can now handle longer and more complex inputs, which facilitate the use of more elaborate prompts. However, prompts often require some tuning to improve performance for deployment. Recent work has proposed automatic prompt optimization methods, but as prompt complexity and LLM strength increase, many prompt optimization techniques are no longer sufficient and a new approach is needed to optimize {\em meta prompt programs}. To address this, we introduce SAMMO, a framework for {\em compile-time} optimizations of metaprompt programs, which represent prompts as structured objects that allows for a rich set of transformations that can be searched over during optimization. We show that SAMMO generalizes previous methods and improves the performance of complex prompts on (1) instruction tuning, (2) RAG pipeline tuning, and (3) prompt compression, across several different LLMs. We make all code available open-source at https://github.com/microsoft/sammo .
摘要：大型语言模型 (LLM) 现在可以处理更长、更复杂的输入，这有助于使用更详细的提示。但是，提示通常需要进行一些调整才能提高部署性能。最近的工作提出了自动提示优化方法，但随着提示复杂性和LLM强度的增加，许多提示优化技术不再足够，需要一种新的方法来优化{\em元提示程序}。为了解决这个问题，我们引入了 SAMMO，一个用于元提示程序的编译时优化的框架，它将提示表示为结构化对象，允许在优化期间搜索丰富的转换集。我们表明，SAMMO 概括了以前的方法，并在多个不同的 LLM 中提高了 (1) 指令调整、(2) RAG 管道调整和 (3) 提示压缩的复杂提示的性能。我们在 https://github.com/microsoft/sammo 上开源所有代码。

Title: Toward Informal Language Processing: Knowledge of Slang in Large Language Models

Authors: Zhewei Sun, Qian Hu, Rahul Gupta, Richard Zemel, Yang Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02323
Pdf URL: https://arxiv.org/pdf/2404.02323
Copy Paste: [[2404.02323]] Toward Informal Language Processing: Knowledge of Slang in Large Language Models(https://arxiv.org/abs/2404.02323)
Keywords: language model, gpt, llm
Abstract: Recent advancement in large language models (LLMs) has offered a strong potential for natural language systems to process informal language. A representative form of informal language is slang, used commonly in daily conversations and online social media. To date, slang has not been comprehensively evaluated in LLMs due partly to the absence of a carefully designed and publicly accessible benchmark. Using movie subtitles, we construct a dataset that supports evaluation on a diverse set of tasks pertaining to automatic processing of slang. For both evaluation and finetuning, we show the effectiveness of our dataset on two core applications: 1) slang detection, and 2) identification of regional and historical sources of slang from natural sentences. We also show how our dataset can be used to probe the output distributions of LLMs for interpretive insights. We find that while LLMs such as GPT-4 achieve good performance in a zero-shot setting, smaller BERT-like models finetuned on our dataset achieve comparable performance. Furthermore, we show that our dataset enables finetuning of LLMs such as GPT-3.5 that achieve substantially better performance than strong zero-shot baselines. Our work offers a comprehensive evaluation and a high-quality benchmark on English slang based on the OpenSubtitles corpus, serving both as a publicly accessible resource and a platform for applying tools for informal language processing.
摘要：大型语言模型（LLM）的最新进展为自然语言系统处理非正式语言提供了强大的潜力。非正式语言的代表形式是俚语，常用于日常对话和在线社交媒体中。迄今为止，俚语尚未在法学硕士中得到全面评估，部分原因是缺乏精心设计且可公开访问的基准。使用电影字幕，我们构建了一个数据集，支持对与俚语自动处理相关的各种任务进行评估。对于评估和微调，我们展示了我们的数据集在两个核心应用程序上的有效性：1）俚语检测，2）从自然句子中识别俚语的区域和历史来源。我们还展示了如何使用我们的数据集来探测法学硕士的输出分布以获得解释性见解。我们发现，虽然 GPT-4 等 LLM 在零样本设置中实现了良好的性能，但在我们的数据集上进行微调的较小的类 BERT 模型也实现了相当的性能。此外，我们还表明，我们的数据集可以对 GPT-3.5 等 LLM 进行微调，从而实现比强大的零样本基线更好的性能。我们的工作基于 OpenSubtitles 语料库提供了对英语俚语的全面评估和高质量基准，既作为可公开访问的资源，又作为应用非正式语言处理工具的平台。

Title: Comparative Study of Domain Driven Terms Extraction Using Large Language Models

Authors: Sandeep Chataut, Tuyen Do, Bichar Dip Shrestha Gurung, Shiva Aryal, Anup Khanal, Carol Lushbough, Etienne Gnimpieba
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02330
Pdf URL: https://arxiv.org/pdf/2404.02330
Copy Paste: [[2404.02330]] Comparative Study of Domain Driven Terms Extraction Using Large Language Models(https://arxiv.org/abs/2404.02330)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Keywords play a crucial role in bridging the gap between human understanding and machine processing of textual data. They are essential to data enrichment because they form the basis for detailed annotations that provide a more insightful and in-depth view of the underlying data. Keyword/domain driven term extraction is a pivotal task in natural language processing, facilitating information retrieval, document summarization, and content categorization. This review focuses on keyword extraction methods, emphasizing the use of three major Large Language Models(LLMs): Llama2-7B, GPT-3.5, and Falcon-7B. We employed a custom Python package to interface with these LLMs, simplifying keyword extraction. Our study, utilizing the Inspec and PubMed datasets, evaluates the performance of these models. The Jaccard similarity index was used for assessment, yielding scores of 0.64 (Inspec) and 0.21 (PubMed) for GPT-3.5, 0.40 and 0.17 for Llama2-7B, and 0.23 and 0.12 for Falcon-7B. This paper underlines the role of prompt engineering in LLMs for better keyword extraction and discusses the impact of hallucination in LLMs on result evaluation. It also sheds light on the challenges in using LLMs for keyword extraction, including model complexity, resource demands, and optimization techniques.
摘要：关键词在弥合人类理解和机器处理文本数据之间的差距方面发挥着至关重要的作用。它们对于数据丰富至关重要，因为它们构成了详细注释的基础，可以提供对基础数据更有洞察力和更深入的了解。关键词/领域驱动的术语提取是自然语言处理中的一项关键任务，有助于信息检索、文档摘要和内容分类。本次综述重点关注关键词提取方法，强调三种主要大型语言模型（LLM）的使用：Llama2-7B、GPT-3.5 和 Falcon-7B。我们采用了自定义 Python 包来与这些法学硕士进行交互，从而简化了关键字提取。我们的研究利用 Inspec 和 PubMed 数据集评估了这些模型的性能。使用 Jaccard 相似性指数进行评估，GPT-3.5 的得分为 0.64 (Inspec) 和 0.21 (PubMed)，Llama2-7B 的得分为 0.40 和 0.17，Falcon-7B 的得分为 0.23 和 0.12。本文强调了法学硕士中提示工程对于更好地提取关键词的作用，并讨论了法学硕士中的幻觉对结果评估的影响。它还揭示了使用法学硕士进行关键字提取的挑战，包括模型复杂性、资源需求和优化技术。

Title: Multi-BERT: Leveraging Adapters and Prompt Tuning for Low-Resource Multi-Domain Adaptation

Authors: Parham Abed Azad, Hamid Beigy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02335
Pdf URL: https://arxiv.org/pdf/2404.02335
Copy Paste: [[2404.02335]] Multi-BERT: Leveraging Adapters and Prompt Tuning for Low-Resource Multi-Domain Adaptation(https://arxiv.org/abs/2404.02335)
Keywords: prompt
Abstract: The rapid expansion of texts' volume and diversity presents formidable challenges in multi-domain settings. These challenges are also visible in the Persian name entity recognition (NER) settings. Traditional approaches, either employing a unified model for multiple domains or individual models for each domain, frequently pose significant limitations. Single models often struggle to capture the nuances of diverse domains, while utilizing multiple large models can lead to resource constraints, rendering the training of a model for each domain virtually impractical. Therefore, this paper introduces a novel approach composed of one core model with multiple sets of domain-specific parameters. We utilize techniques such as prompt tuning and adapters, combined with the incorporation of additional layers, to add parameters that we can train for the specific domains. This enables the model to perform comparably to individual models for each domain. Experimental results on different formal and informal datasets show that by employing these added parameters, the proposed model significantly surpasses existing practical models in performance. Remarkably, the proposed model requires only one instance for training and storage, yet achieves outstanding results across all domains, even surpassing the state-of-the-art in some. Moreover, we analyze each adaptation strategy, delineating its strengths, weaknesses, and optimal hyper-parameters for the Persian NER settings. Finally, we introduce a document-based domain detection pipeline tailored for scenarios with unknown text domains, enhancing the adaptability and practicality of this paper in real-world applications.
摘要：文本数量和多样性的快速增长给多领域环境带来了巨大的挑战。这些挑战在波斯语名称实体识别 (NER) 设置中也很明显。传统方法，要么为多个领域采用统一模型，要么为每个领域采用单独的模型，通常会带来很大的局限性。单个模型通常很难捕捉不同领域的细微差别，而使用多个大型模型可能会导致资源限制，从而使每个领域的模型训练几乎不切实际。因此，本文介绍了一种由一个核心模型和多组特定领域参数组成的新方法。我们利用提示调整和适配器等技术，结合附加层的结合，来添加我们可以针对特定领域进行训练的参数。这使得该模型的性能与每个领域的单个模型相当。在不同正式和非正式数据集上的实验结果表明，通过使用这些添加的参数，所提出的模型在性能上显着超过了现有的实际模型。值得注意的是，所提出的模型只需要一个实例来进行训练和存储，但在所有领域都取得了出色的结果，甚至在某些领域超越了最先进的技术。此外，我们分析了每种适应策略，描述其优点、缺点以及波斯 NER 设置的最佳超参数。最后，我们引入了一种针对未知文本域场景的基于文档的域检测管道，增强了本文在实际应用中的适应性和实用性。

Title: Two Heads are Better than One: Nested PoE for Robust Defense Against Multi-Backdoors

Authors: Victoria Graf, Qin Liu, Muhao Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02356
Pdf URL: https://arxiv.org/pdf/2404.02356
Copy Paste: [[2404.02356]] Two Heads are Better than One: Nested PoE for Robust Defense Against Multi-Backdoors(https://arxiv.org/abs/2404.02356)
Keywords: language model, llm
Abstract: Data poisoning backdoor attacks can cause undesirable behaviors in large language models (LLMs), and defending against them is of increasing importance. Existing defense mechanisms often assume that only one type of trigger is adopted by the attacker, while defending against multiple simultaneous and independent trigger types necessitates general defense frameworks and is relatively unexplored. In this paper, we propose Nested Product of Experts(NPoE) defense framework, which involves a mixture of experts (MoE) as a trigger-only ensemble within the PoE defense framework to simultaneously defend against multiple trigger types. During NPoE training, the main model is trained in an ensemble with a mixture of smaller expert models that learn the features of backdoor triggers. At inference time, only the main model is used. Experimental results on sentiment analysis, hate speech detection, and question classification tasks demonstrate that NPoE effectively defends against a variety of triggers both separately and in trigger mixtures. Due to the versatility of the MoE structure in NPoE, this framework can be further expanded to defend against other attack settings
摘要：数据中毒后门攻击可能会在大型语言模型 (LLM) 中导致不良行为，因此防御它们变得越来越重要。现有的防御机制通常假设攻击者只采用一种类型的触发，而防御多种同时且独立的触发类型则需要通用的防御框架，并且相对未经探索。在本文中，我们提出了专家嵌套产品（NPoE）防御框架，其中涉及专家混合（MoE）作为 PoE 防御框架内的仅触发集成，以同时防御多种触发类型。在 NPoE 训练期间，主模型在与学习后门触发器特征的较小专家模型的混合体中进行训练。在推理时，仅使用主模型。情感分析、仇恨言论检测和问题分类任务的实验结果表明，NPoE 可以有效地防御各种单独的触发因素和混合触发因素。由于NPoE中MoE结构的多功能性，该框架可以进一步扩展以防御其他攻击设置

Title: On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-SQL

Authors: Yutong Shao, Ndapa Nakashole
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02389
Pdf URL: https://arxiv.org/pdf/2404.02389
Copy Paste: [[2404.02389]] On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-SQL(https://arxiv.org/abs/2404.02389)
Keywords: language model, llm
Abstract: Structured data, prevalent in tables, databases, and knowledge graphs, poses a significant challenge in its representation. With the advent of large language models (LLMs), there has been a shift towards linearization-based methods, which process structured data as sequential token streams, diverging from approaches that explicitly model structure, often as a graph. Crucially, there remains a gap in our understanding of how these linearization-based methods handle structured data, which is inherently non-linear. This work investigates the linear handling of structured data in encoder-decoder language models, specifically T5. Our findings reveal the model's ability to mimic human-designed processes such as schema linking and syntax prediction, indicating a deep, meaningful learning of structure beyond simple token sequencing. We also uncover insights into the model's internal mechanisms, including the ego-centric nature of structure node encodings and the potential for model compression due to modality fusion redundancy. Overall, this work sheds light on the inner workings of linearization-based methods and could potentially provide guidance for future research.
摘要：表格、数据库和知识图中普遍存在的结构化数据对其表示提出了重大挑战。随着大型语言模型 (LLM) 的出现，人们开始转向基于线性化的方法，这种方法将结构化数据作为顺序标记流进行处理，这与通常以图形形式显式建模结构的方法不同。至关重要的是，我们对这些基于线性化的方法如何处理本质上非线性的结构化数据的理解仍然存在差距。这项工作研究了编码器-解码器语言模型（特别是 T5）中结构化数据的线性处理。我们的研究结果揭示了该模型能够模仿人类设计的过程，例如模式链接和语法预测，这表明对结构的深入、有意义的学习超出了简单的标记排序。我们还揭示了对模型内部机制的见解，包括结构节点编码的以自我为中心的性质以及由于模态融合冗余而导致的模型压缩的潜力。总的来说，这项工作揭示了基于线性化的方法的内部工作原理，并可能为未来的研究提供指导。

Title: Low-resource neural machine translation with morphological modeling

Authors: Antoine Nzeyimana
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02392
Pdf URL: https://arxiv.org/pdf/2404.02392
Copy Paste: [[2404.02392]] Low-resource neural machine translation with morphological modeling(https://arxiv.org/abs/2404.02392)
Keywords: language model
Abstract: Morphological modeling in neural machine translation (NMT) is a promising approach to achieving open-vocabulary machine translation for morphologically-rich languages. However, existing methods such as sub-word tokenization and character-based models are limited to the surface forms of the words. In this work, we propose a framework-solution for modeling complex morphology in low-resource settings. A two-tier transformer architecture is chosen to encode morphological information at the inputs. At the target-side output, a multi-task multi-label training scheme coupled with a beam search-based decoder are found to improve machine translation performance. An attention augmentation scheme to the transformer model is proposed in a generic form to allow integration of pre-trained language models and also facilitate modeling of word order relationships between the source and target languages. Several data augmentation techniques are evaluated and shown to increase translation performance in low-resource settings. We evaluate our proposed solution on Kinyarwanda - English translation using public-domain parallel text. Our final models achieve competitive performance in relation to large multi-lingual models. We hope that our results will motivate more use of explicit morphological information and the proposed model and data augmentations in low-resource NMT.
摘要：神经机器翻译 (NMT) 中的形态建模是实现形态丰富的语言的开放词汇机器翻译的一种有前景的方法。然而，现有的方法（例如子词标记化和基于字符的模型）仅限于单词的表面形式。在这项工作中，我们提出了一个框架解决方案，用于在资源匮乏的环境中对复杂形态进行建模。选择两层变压器架构来对输入处的形态信息进行编码。在目标端输出，多任务多标签训练方案与基于波束搜索的解码器相结合，可以提高机器翻译性能。以通用形式提出了变压器模型的注意力增强方案，以允许集成预先训练的语言模型，并且还有助于对源语言和目标语言之间的词序关系进行建模。评估了几种数据增强技术，并证明它们可以提高资源匮乏环境中的翻译性能。我们评估了我们提出的基尼亚卢旺达语解决方案 - 使用公共领域并行文本的英语翻译。我们的最终模型实现了与大型多语言模型相比的竞争性能。我们希望我们的结果将激励更多地使用显式形态信息以及在低资源 NMT 中所提出的模型和数据增强。

Title: Token Trails: Navigating Contextual Depths in Conversational AI with ChatLLM

Authors: Md. Kowsher, Ritesh Panditi, Nusrat Jahan Prottasha, Prakash Bhat, Anupam Kumar Bairagi, Mohammad Shamsul Arefin
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02402
Pdf URL: https://arxiv.org/pdf/2404.02402
Copy Paste: [[2404.02402]] Token Trails: Navigating Contextual Depths in Conversational AI with ChatLLM(https://arxiv.org/abs/2404.02402)
Keywords: language model, llm, chat
Abstract: Conversational modeling using Large Language Models (LLMs) requires a nuanced understanding of context to generate coherent and contextually relevant responses. In this paper, we present Token Trails, a novel approach that leverages token-type embeddings to navigate the intricate contextual nuances within conversations. Our framework utilizes token-type embeddings to distinguish between user utterances and bot responses, facilitating the generation of context-aware replies. Through comprehensive experimentation and evaluation, we demonstrate the effectiveness of Token Trails in improving conversational understanding and response generation, achieving state-of-the-art performance. Our results highlight the significance of contextual modeling in conversational AI and underscore the promising potential of Token Trails to advance the field, paving the way for more sophisticated and contextually aware chatbot interactions.
摘要：使用大型语言模型 (LLM) 的对话建模需要对上下文有细致入微的理解，才能生成连贯且与上下文相关的响应。在本文中，我们提出了 Token Trails，这是一种利用令牌类型嵌入来导航对话中复杂的上下文细微差别的新颖方法。我们的框架利用令牌类型嵌入来区分用户话语和机器人响应，从而促进上下文感知回复的生成。通过全面的实验和评估，我们证明了 Token Trails 在提高对话理解和响应生成方面的有效性，实现了最先进的性能。我们的结果强调了上下文建模在对话式人工智能中的重要性，并强调了 Token Trails 推动该领域发展的巨大潜力，为更复杂和上下文感知的聊天机器人交互铺平了道路。

Title: Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT

Authors: Amirhossein Abaskohi, Sara Baruni, Mostafa Masoudi, Nesa Abbasi, Mohammad Hadi Babalou, Ali Edalat, Sepehr Kamahi, Samin Mahdizadeh Sani, Nikoo Naghavian, Danial Namazifard, Pouya Sadeghi, Yadollah Yaghoobzadeh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02403
Pdf URL: https://arxiv.org/pdf/2404.02403
Copy Paste: [[2404.02403]] Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT(https://arxiv.org/abs/2404.02403)
Keywords: language model, gpt, llm, chat
Abstract: This paper explores the efficacy of large language models (LLMs) for Persian. While ChatGPT and consequent LLMs have shown remarkable performance in English, their efficiency for more low-resource languages remains an open question. We present the first comprehensive benchmarking study of LLMs across diverse Persian language tasks. Our primary focus is on GPT-3.5-turbo, but we also include GPT-4 and OpenChat-3.5 to provide a more holistic evaluation. Our assessment encompasses a diverse set of tasks categorized into classic, reasoning, and knowledge-based domains. To enable a thorough comparison, we evaluate LLMs against existing task-specific fine-tuned models. Given the limited availability of Persian datasets for reasoning tasks, we introduce two new benchmarks: one based on elementary school math questions and another derived from the entrance exams for 7th and 10th grades. Our findings reveal that while LLMs, especially GPT-4, excel in tasks requiring reasoning abilities and a broad understanding of general knowledge, they often lag behind smaller pre-trained models fine-tuned specifically for particular tasks. Additionally, we observe improved performance when test sets are translated to English before inputting them into GPT-3.5. These results highlight the significant potential for enhancing LLM performance in the Persian language. This is particularly noteworthy due to the unique attributes of Persian, including its distinct alphabet and writing styles.
摘要：本文探讨了波斯语大型语言模型 (LLM) 的功效。虽然 ChatGPT 和随后的法学硕士在英语方面表现出色，但它们在资源匮乏的语言上的效率仍然是一个悬而未决的问题。我们提出了第一个针对不同波斯语任务的法学硕士的全面基准研究。我们的主要关注点是 GPT-3.5-turbo，但我们也包括 GPT-4 和 OpenChat-3.5，以提供更全面的评估。我们的评估涵盖了一系列不同的任务，分为经典、推理和基于知识的领域。为了进行彻底的比较，我们根据现有的特定于任务的微调模型来评估法学硕士。鉴于用于推理任务的波斯语数据集的可用性有限，我们引入了两个新的基准：一个基于小学数学问题，另一个来自七年级和十年级的入学考试。我们的研究结果表明，虽然法学硕士（尤其是 GPT-4）在需要推理能力和对一般知识有广泛理解的任务中表现出色，但它们往往落后于专门针对特定任务进行微调的较小的预训练模型。此外，我们观察到，在将测试集输入 GPT-3.5 之前将其翻译成英语时，性能有所提高。这些结果凸显了提高波斯语法学硕士成绩的巨大潜力。由于波斯语的独特属性，包括其独特的字母和书写风格，这一点尤其值得注意。

Title: Auxiliary task demands mask the capabilities of smaller language models

Authors: Jennifer Hu, Michael C. Frank
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02418
Pdf URL: https://arxiv.org/pdf/2404.02418
Copy Paste: [[2404.02418]] Auxiliary task demands mask the capabilities of smaller language models(https://arxiv.org/abs/2404.02418)
Keywords: language model
Abstract: Developmental psychologists have argued about when cognitive capacities such as language understanding or theory of mind emerge. These debates often hinge on the concept of "task demands" -- the auxiliary challenges associated with performing a particular evaluation -- that may mask the child's underlying ability. The same issues arise when measuring the capacities of language models (LMs): performance on a task is a function of the model's underlying competence, combined with the model's ability to interpret and perform the task given its available resources. Here, we show that for analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments, evaluation methods with greater task demands yield lower performance than evaluations with reduced demands. This "demand gap" is most pronounced for models with fewer parameters and less training data. Our results illustrate that LM performance should not be interpreted as a direct indication of intelligence (or lack thereof), but as a reflection of capacities seen through the lens of researchers' design choices.
摘要：发展心理学家一直争论语言理解或心理理论等认知能力何时出现。这些争论通常取决于“任务要求”的概念——与执行特定评估相关的辅助挑战——这可能掩盖了孩子的潜在能力。在衡量语言模型 (LM) 的能力时，也会出现同样的问题：任务的性能是模型基本能力的函数，再加上模型在给定可用资源的情况下解释和执行任务的能力。在这里，我们表明，对于类比推理、反思推理、单词预测和语法判断，任务要求较高的评估方法比要求减少的评估方法产生的性能更低。对于参数较少和训练数据较少的模型，这种“需求差距”最为明显。我们的结果表明，LM 的表现不应被解释为智力（或缺乏智力）的直接指示，而应被解释为通过研究人员的设计选择的视角所看到的能力的反映。

Title: Revisiting subword tokenization: A case study on affixal negation in large language models

Authors: Thinh Hung Truong, Yulia Otmakhova, Karin Verspoor, Trevor Cohn, Timothy Baldwin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02421
Pdf URL: https://arxiv.org/pdf/2404.02421
Copy Paste: [[2404.02421]] Revisiting subword tokenization: A case study on affixal negation in large language models(https://arxiv.org/abs/2404.02421)
Keywords: language model, llm
Abstract: In this work, we measure the impact of affixal negation on modern English large language models (LLMs). In affixal negation, the negated meaning is expressed through a negative morpheme, which is potentially challenging for LLMs as their tokenizers are often not morphologically plausible. We conduct extensive experiments using LLMs with different subword tokenization methods, which lead to several insights on the interaction between tokenization performance and negation sensitivity. Despite some interesting mismatches between tokenization accuracy and negation detection performance, we show that models can, on the whole, reliably recognize the meaning of affixal negation.
摘要：在这项工作中，我们衡量了词缀否定对现代英语大语言模型（LLM）的影响。在词缀否定中，否定的含义是通过否定语素来表达的，这对法学硕士来说可能是一个挑战，因为他们的分词器在形态上往往不合理。我们使用具有不同子词标记化方法的法学硕士进行了广泛的实验，这导致了对标记化性能和否定敏感性之间相互作用的一些见解。尽管标记化准确性和否定检测性能之间存在一些有趣的不匹配，但我们表明模型总体上可以可靠地识别词缀否定的含义。

Title: Enhancing Low-Resource LLMs Classification with PEFT and Synthetic Data

Authors: Parth Patwa, Simone Filice, Zhiyu Chen, Giuseppe Castellucci, Oleg Rokhlenko, Shervin Malmasi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02422
Pdf URL: https://arxiv.org/pdf/2404.02422
Copy Paste: [[2404.02422]] Enhancing Low-Resource LLMs Classification with PEFT and Synthetic Data(https://arxiv.org/abs/2404.02422)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) operating in 0-shot or few-shot settings achieve competitive results in Text Classification tasks. In-Context Learning (ICL) typically achieves better accuracy than the 0-shot setting, but it pays in terms of efficiency, due to the longer input prompt. In this paper, we propose a strategy to make LLMs as efficient as 0-shot text classifiers, while getting comparable or better accuracy than ICL. Our solution targets the low resource setting, i.e., when only 4 examples per class are available. Using a single LLM and few-shot real data we perform a sequence of generation, filtering and Parameter-Efficient Fine-Tuning steps to create a robust and efficient classifier. Experimental results show that our approach leads to competitive results on multiple text classification datasets.
摘要：在 0 次或少数次设置中运行的大型语言模型 (LLM) 在文本分类任务中取得了有竞争力的结果。上下文学习 (ICL) 通常比 0-shot 设置获得更好的准确度，但由于输入提示较长，因此在效率方面付出了代价。在本文中，我们提出了一种策略，使 LLM 与 0-shot 文本分类器一样高效，同时获得与 ICL 相当或更好的准确性。我们的解决方案针对低资源设置，即每个类只有 4 个示例可用时。使用单个 LLM 和少量真实数据，我们执行一系列生成、过滤和参数高效微调步骤，以创建稳健且高效的分类器。实验结果表明，我们的方法在多个文本分类数据集上取得了有竞争力的结果。

Title: On the Multilingual Ability of Decoder-based Pre-trained Language Models: Finding and Controlling Language-Specific Neurons

Authors: Takeshi Kojima, Itsuki Okimura, Yusuke Iwasawa, Hitomi Yanaka, Yutaka Matsuo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02431
Pdf URL: https://arxiv.org/pdf/2404.02431
Copy Paste: [[2404.02431]] On the Multilingual Ability of Decoder-based Pre-trained Language Models: Finding and Controlling Language-Specific Neurons(https://arxiv.org/abs/2404.02431)
Keywords: language model
Abstract: Current decoder-based pre-trained language models (PLMs) successfully demonstrate multilingual capabilities. However, it is unclear how these models handle multilingualism. We analyze the neuron-level internal behavior of multilingual decoder-based PLMs, Specifically examining the existence of neurons that fire ``uniquely for each language'' within decoder-only multilingual PLMs. We analyze six languages: English, German, French, Spanish, Chinese, and Japanese, and show that language-specific neurons are unique, with a slight overlap (< 5%) between languages. These neurons are mainly distributed in the models' first and last few layers. This trend remains consistent across languages and models. Additionally, we tamper with less than 1% of the total neurons in each model during inference and demonstrate that tampering with a few language-specific neurons drastically changes the probability of target language occurrence in text generation.
摘要：当前基于解码器的预训练语言模型 (PLM) 成功展示了多语言功能。然而，尚不清楚这些模型如何处理多语言。我们分析了基于多语言解码器的 PLM 的神经元级内部行为，特别检查了在仅解码器的多语言 PLM 中“针对每种语言唯一”激发的神经元的存在。我们分析了六种语言：英语、德语、法语、西班牙语、中文和日语，结果表明语言特异性神经元是独特的，语言之间有轻微的重叠（< 5%）。这些神经元主要分布在模型的第一层和最后几层。这种趋势在不同语言和模型中保持一致。此外，我们在推理过程中篡改了每个模型中不到 1% 的总神经元，并证明篡改一些特定于语言的神经元会极大地改变文本生成中目标语言出现的概率。

Title: From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

Authors: Shuxian Fan, Adam Visokay, Kentaro Hoffman, Stephen Salerno, Li Liu, Jeffrey T. Leek, Tyler H. McCormick
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2404.02438
Pdf URL: https://arxiv.org/pdf/2404.02438
Copy Paste: [[2404.02438]] From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives(https://arxiv.org/abs/2404.02438)
Keywords: language model, gpt
Abstract: In settings where most deaths occur outside the healthcare system, verbal autopsies (VAs) are a common tool to monitor trends in causes of death (COD). VAs are interviews with a surviving caregiver or relative that are used to predict the decedent's COD. Turning VAs into actionable insights for researchers and policymakers requires two steps (i) predicting likely COD using the VA interview and (ii) performing inference with predicted CODs (e.g. modeling the breakdown of causes by demographic factors using a sample of deaths). In this paper, we develop a method for valid inference using outcomes (in our case COD) predicted from free-form text using state-of-the-art NLP techniques. This method, which we call multiPPI++, extends recent work in "prediction-powered inference" to multinomial classification. We leverage a suite of NLP techniques for COD prediction and, through empirical analysis of VA data, demonstrate the effectiveness of our approach in handling transportability issues. multiPPI++ recovers ground truth estimates, regardless of which NLP model produced predictions and regardless of whether they were produced by a more accurate predictor like GPT-4-32k or a less accurate predictor like KNN. Our findings demonstrate the practical importance of inference correction for public health decision-making and suggests that if inference tasks are the end goal, having a small amount of contextually relevant, high quality labeled data is essential regardless of the NLP algorithm.
摘要：在大多数死亡发生在医疗保健系统之外的环境中，口头尸检 (VA) 是监测死因 (COD) 趋势的常用工具。 VA 是对幸存的照顾者或亲属的访谈，用于预测死者的 COD。将 VA 转化为研究人员和政策制定者可操作的见解需要两个步骤（i）使用 VA 访谈来预测可能的 COD，以及（ii）对预测的 COD 进行推理（例如，使用死亡样本对按人口因素划分的原因进行建模）。在本文中，我们开发了一种使用最先进的 NLP 技术从自由格式文本预测的结果（在我们的例子中为 COD）进行有效推理的方法。这种方法，我们称之为 multiPPI++，将“预测驱动的推理”领域的最新工作扩展到了多项分类。我们利用一套 NLP 技术进行 COD 预测，并通过 VA 数据的实证分析，证明我们的方法在处理可运输性问题方面的有效性。无论哪个 NLP 模型生成预测，也无论它们是由更准确的预测器（如 GPT-4-32k）还是由不太准确的预测器（如 KNN）生成，multiPPI++ 都会恢复真实估计。我们的研究结果证明了推理校正对公共卫生决策的实际重要性，并表明如果推理任务是最终目标，那么无论 NLP 算法如何，拥有少量上下文相关的高质量标记数据都是必不可少的。

Title: The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education

Authors: Paiheng Xu, Jing Liu, Nathan Jones, Julie Cohen, Wei Ai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02444
Pdf URL: https://arxiv.org/pdf/2404.02444
Copy Paste: [[2404.02444]] The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education(https://arxiv.org/abs/2404.02444)
Keywords: language model
Abstract: Assessing instruction quality is a fundamental component of any improvement efforts in the education system. However, traditional manual assessments are expensive, subjective, and heavily dependent on observers' expertise and idiosyncratic factors, preventing teachers from getting timely and frequent feedback. Different from prior research that mostly focuses on low-inference instructional practices on a singular basis, this paper presents the first study that leverages Natural Language Processing (NLP) techniques to assess multiple high-inference instructional practices in two distinct educational settings: in-person K-12 classrooms and simulated performance tasks for pre-service teachers. This is also the first study that applies NLP to measure a teaching practice that is widely acknowledged to be particularly effective for students with special needs. We confront two challenges inherent in NLP-based instructional analysis, including noisy and long input data and highly skewed distributions of human ratings. Our results suggest that pretrained Language Models (PLMs) demonstrate performances comparable to the agreement level of human raters for variables that are more discrete and require lower inference, but their efficacy diminishes with more complex teaching practices. Interestingly, using only teachers' utterances as input yields strong results for student-centered variables, alleviating common concerns over the difficulty of collecting and transcribing high-quality student speech data in in-person teaching settings. Our findings highlight both the potential and the limitations of current NLP techniques in the education domain, opening avenues for further exploration.
摘要：评估教学质量是教育系统任何改进工作的基本组成部分。然而，传统的手工评估成本高昂、主观性强，并且严重依赖观察者的专业知识和特殊因素，导致教师无法及时、频繁地获得反馈。与之前主要关注单一基础上的低推理教学实践的研究不同，本文提出了第一项利用自然语言处理（NLP）技术来评估两种不同教育环境中多种高推理教学实践的研究：面对面K-12 教室和岗前教师的模拟表演任务。这也是第一项应用 NLP 来衡量教学实践的研究，这种教学实践被广泛认为对有特殊需要的学生特别有效。我们面临基于 NLP 的教学分析固有的两个挑战，包括噪声和长输入数据以及人类评分的高度倾斜分布。我们的结果表明，预训练语言模型 (PLM) 对于更离散且需要较低推理的变量表现出与人类评分者的一致性水平相当的性能，但其功效随着更复杂的教学实践而降低。有趣的是，仅使用教师的话语作为输入可以为以学生为中心的变量产生强有力的结果，减轻人们对在面对面教学环境中收集和转录高质量学生语音数据的困难的普遍担忧。我们的研究结果强调了当前 NLP 技术在教育领域的潜力和局限性，为进一步探索开辟了道路。

Title: Adaptive Cross-lingual Text Classification through In-Context One-Shot Demonstrations

Authors: Emilio Villa-Cueva, A. Pastor López-Monroy, Fernando Sánchez-Vega, Thamar Solorio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02452
Pdf URL: https://arxiv.org/pdf/2404.02452
Copy Paste: [[2404.02452]] Adaptive Cross-lingual Text Classification through In-Context One-Shot Demonstrations(https://arxiv.org/abs/2404.02452)
Keywords: prompt
Abstract: Zero-Shot Cross-lingual Transfer (ZS-XLT) utilizes a model trained in a source language to make predictions in another language, often with a performance loss. To alleviate this, additional improvements can be achieved through subsequent adaptation using examples in the target language. In this paper, we exploit In-Context Tuning (ICT) for One-Shot Cross-lingual transfer in the classification task by introducing In-Context Cross-lingual Transfer (IC-XLT). The novel concept involves training a model to learn from context examples and subsequently adapting it during inference to a target language by prepending a One-Shot context demonstration in that language. Our results show that IC-XLT successfully leverages target-language examples to improve the cross-lingual capabilities of the evaluated mT5 model, outperforming prompt-based models in the Zero and Few-shot scenarios adapted through fine-tuning. Moreover, we show that when source-language data is limited, the fine-tuning framework employed for IC-XLT performs comparably to prompt-based fine-tuning with significantly more training data in the source language.
摘要：零样本跨语言迁移 (ZS-XLT) 利用以源语言训练的模型以另一种语言进行预测，通常会带来性能损失。为了缓解这一问题，可以通过使用目标语言中的示例进行后续调整来实现额外的改进。在本文中，我们通过引入上下文跨语言迁移（IC-XLT），在分类任务中利用上下文调整（ICT）进行一次性跨语言迁移。这一新颖的概念涉及训练模型从上下文示例中学习，然后通过在目标语言的推理过程中预先考虑该语言的一次性上下文演示来对其进行调整。我们的结果表明，IC-XLT 成功地利用目标语言示例来提高所评估的 mT5 模型的跨语言能力，在通过微调调整的零和少样本场景中优于基于提示的模型。此外，我们表明，当源语言数据有限时，IC-XLT 采用的微调框架的性能与基于提示的微调相当，源语言中的训练数据明显更多。

Title: PhonologyBench: Evaluating Phonological Skills of Large Language Models

Authors: Ashima Suvarna, Harshita Khandelwal, Nanyun Peng
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2404.02456
Pdf URL: https://arxiv.org/pdf/2404.02456
Copy Paste: [[2404.02456]] PhonologyBench: Evaluating Phonological Skills of Large Language Models(https://arxiv.org/abs/2404.02456)
Keywords: language model, llm
Abstract: Phonology, the study of speech's structure and pronunciation rules, is a critical yet often overlooked component in Large Language Model (LLM) research. LLMs are widely used in various downstream applications that leverage phonology such as educational tools and poetry generation. Moreover, LLMs can potentially learn imperfect associations between orthographic and phonological forms from the training data. Thus, it is imperative to benchmark the phonological skills of LLMs. To this end, we present PhonologyBench, a novel benchmark consisting of three diagnostic tasks designed to explicitly test the phonological skills of LLMs in English: grapheme-to-phoneme conversion, syllable counting, and rhyme word generation. Despite having no access to speech data, LLMs showcased notable performance on the PhonologyBench tasks. However, we observe a significant gap of 17% and 45% on Rhyme Word Generation and Syllable counting, respectively, when compared to humans. Our findings underscore the importance of studying LLM performance on phonological tasks that inadvertently impact real-world applications. Furthermore, we encourage researchers to choose LLMs that perform well on the phonological task that is closely related to the downstream application since we find that no single model consistently outperforms the others on all the tasks.
摘要：音系学是对语音结构和发音规则的研究，是大语言模型 (LLM) 研究中一个关键但经常被忽视的组成部分。法学硕士广泛应用于利用音韵学的各种下游应用，例如教育工具和诗歌生成。此外，法学硕士可以从训练数据中学习拼写形式和音韵形式之间的不完美关联。因此，必须对法学硕士的语音技能进行基准测试。为此，我们推出了 PhonologyBench，这是一个新颖的基准，由三个诊断任务组成，旨在明确测试英语法学硕士的语音技能：字素到音素的转换、音节计数和押韵词生成。尽管无法访问语音数据，法学硕士在 PhonologyBench 任务中表现出了出色的表现。然而，我们观察到与人类相比，在韵词生成和音节计数方面分别存在 17% 和 45% 的显着差距。我们的研究结果强调了研究法学硕士在语音任务上的表现的重要性，这些任务无意中影响了现实世界的应用。此外，我们鼓励研究人员选择在与下游应用密切相关的语音任务上表现良好的法学硕士，因为我们发现没有一个模型在所有任务上始终优于其他模型。

Title: Prompting for Numerical Sequences: A Case Study on Market Comment Generation

Authors: Masayuki Kawarada, Tatsuya Ishigaki, Hiroya Takamura
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2404.02466
Pdf URL: https://arxiv.org/pdf/2404.02466
Copy Paste: [[2404.02466]] Prompting for Numerical Sequences: A Case Study on Market Comment Generation(https://arxiv.org/abs/2404.02466)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have been applied to a wide range of data-to-text generation tasks, including tables, graphs, and time-series numerical data-to-text settings. While research on generating prompts for structured data such as tables and graphs is gaining momentum, in-depth investigations into prompting for time-series numerical data are lacking. Therefore, this study explores various input representations, including sequences of tokens and structured formats such as HTML, LaTeX, and Python-style codes. In our experiments, we focus on the task of Market Comment Generation, which involves taking a numerical sequence of stock prices as input and generating a corresponding market comment. Contrary to our expectations, the results show that prompts resembling programming languages yield better outcomes, whereas those similar to natural languages and longer formats, such as HTML and LaTeX, are less effective. Our findings offer insights into creating effective prompts for tasks that generate text from numerical sequences.
摘要：大型语言模型 (LLM) 已应用于广泛的数据到文本生成任务，包括表格、图形和时间序列数值数据到文本设置。虽然为表格和图表等结构化数据生成提示的研究正在蓬勃发展，但缺乏对时间序列数值数据提示的深入研究。因此，本研究探索了各种输入表示形式，包括标记序列和结构化格式，例如 HTML、LaTeX 和 Python 样式代码。在我们的实验中，我们专注于市场评论生成的任务，其中涉及以股票价格的数字序列作为输入并生成相应的市场评论。与我们的预期相反，结果表明，类似于编程语言的提示会产生更好的结果，而那些类似于自然语言和较长格式（例如 HTML 和 LaTeX）的提示则效果较差。我们的研究结果为如何为从数字序列生成文本的任务创建有效的提示提供了见解。

Title: uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?

Authors: Pouya Sadeghi, Amirhossein Abaskohi, Yadollah Yaghoobzadeh
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02474
Pdf URL: https://arxiv.org/pdf/2404.02474
Copy Paste: [[2404.02474]] uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?(https://arxiv.org/abs/2404.02474)
Keywords: gpt, llm, prompt, retrieval augmented generation
Abstract: Inspired by human cognition, Jiang et al.(2023c) create a benchmark for assessing LLMs' lateral thinking-thinking outside the box. Building upon this benchmark, we investigate how different prompting methods enhance LLMs' performance on this task to reveal their inherent power for outside-the-box thinking ability. Through participating in SemEval-2024, task 9, Sentence Puzzle sub-task, we explore prompt engineering methods: chain of thoughts (CoT) and direct prompting, enhancing with informative descriptions, and employing contextualizing prompts using a retrieval augmented generation (RAG) pipeline. Our experiments involve three LLMs including GPT-3.5, GPT-4, and Zephyr-7B-beta. We generate a dataset of thinking paths between riddles and options using GPT-4, validated by humans for quality. Findings indicate that compressed informative prompts enhance performance. Dynamic in-context learning enhances model performance significantly. Furthermore, fine-tuning Zephyr on our dataset enhances performance across other commonsense datasets, underscoring the value of innovative thinking.
摘要：受人类认知的启发，Jiang 等人（2023c）创建了一个评估法学硕士横向思维（跳出框框）的基准。在此基准的基础上，我们研究了不同的激励方法如何提高法学硕士在这项任务上的表现，以揭示他们固有的创新思维能力。通过参与 SemEval-2024、任务 9、句子谜题子任务，我们探索了提示工程方法：思想链 (CoT) 和直接提示、通过信息描述进行增强，以及使用检索增强生成 (RAG) 管道采用情境化提示。我们的实验涉及三个 LLM，包括 GPT-3.5、GPT-4 和 Zephyr-7B-beta。我们使用 GPT-4 生成谜语和选项之间思维路径的数据集，并由人类验证质量。研究结果表明，压缩的信息提示可以提高性能。动态上下文学习可显着提高模型性能。此外，在我们的数据集上微调 Zephyr 可以增强其他常识数据集的性能，强调创新思维的价值。

Title: Measuring Social Norms of Large Language Models

Authors: Ye Yuan, Kexin Tang, Jianhao Shen, Ming Zhang, Chenguang Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02491
Pdf URL: https://arxiv.org/pdf/2404.02491
Copy Paste: [[2404.02491]] Measuring Social Norms of Large Language Models(https://arxiv.org/abs/2404.02491)
Keywords: language model, gpt, chat, agent
Abstract: We present a new challenge to examine whether large language models understand social norms. In contrast to existing datasets, our dataset requires a fundamental understanding of social norms to solve. Our dataset features the largest set of social norm skills, consisting of 402 skills and 12,383 questions covering a wide set of social norms ranging from opinions and arguments to culture and laws. We design our dataset according to the K-12 curriculum. This enables the direct comparison of the social understanding of large language models to humans, more specifically, elementary students. While prior work generates nearly random accuracy on our benchmark, recent large language models such as GPT3.5-Turbo and LLaMA2-Chat are able to improve the performance significantly, only slightly below human performance. We then propose a multi-agent framework based on large language models to improve the models' ability to understand social norms. This method further improves large language models to be on par with humans. Given the increasing adoption of large language models in real-world applications, our finding is particularly important and presents a unique direction for future improvements.
摘要：我们提出了一个新的挑战来检查大型语言模型是否理解社会规范。与现有数据集相比，我们的数据集需要对社会规范有基本的了解才能解决。我们的数据集包含最大的社会规范技能集，由 402 项技能和 12,383 个问题组成，涵盖从观点和论点到文化和法律的广泛社会规范。我们根据 K-12 课程设计数据集。这使得能够将大型语言模型的社会理解与人类（更具体地说是小学生）进行直接比较。虽然之前的工作在我们的基准测试中产生了几乎随机的准确性，但最近的大型语言模型（例如 GPT3.5-Turbo 和 LLaMA2-Chat）能够显着提高性能，仅略低于人类表现。然后，我们提出了一种基于大型语言模型的多智能体框架，以提高模型理解社会规范的能力。该方法进一步改进了大型语言模型，使其与人类相当。鉴于大型语言模型在现实应用中的采用越来越多，我们的发现尤为重要，并为未来的改进提供了独特的方向。

Title: Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Languages

Authors: Vandan Mujadia, Pruthwik Mishra, Arafat Ahsan, Dipti Misra Sharma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02512
Pdf URL: https://arxiv.org/pdf/2404.02512
Copy Paste: [[2404.02512]] Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Languages(https://arxiv.org/abs/2404.02512)
Keywords: language model, llm
Abstract: With the primary focus on evaluating the effectiveness of large language models for automatic reference-less translation assessment, this work presents our experiments on mimicking human direct assessment to evaluate the quality of translations in English and Indian languages. We constructed a translation evaluation task where we performed zero-shot learning, in-context example-driven learning, and fine-tuning of large language models to provide a score out of 100, where 100 represents a perfect translation and 1 represents a poor translation. We compared the performance of our trained systems with existing methods such as COMET, BERT-Scorer, and LABSE, and found that the LLM-based evaluator (LLaMA-2-13B) achieves a comparable or higher overall correlation with human judgments for the considered Indian language pairs.
摘要：这项工作的主要重点是评估大型语言模型在自动无参考翻译评估中的有效性，介绍了我们模仿人类直接评估来评估英语和印度语言翻译质量的实验。我们构建了一个翻译评估任务，其中执行零样本学习、上下文示例驱动学习以及大型语言模型的微调，以提供 100 分的分数，其中 100 代表完美翻译，1 代表较差翻译。我们将经过训练的系统的性能与 COMET、BERT-Scorer 和 LABSE 等现有方法进行了比较，发现基于 LLM 的评估器 (LLaMA-2-13B) 对于所考虑的问题与人类判断实现了相当或更高的总体相关性。印度语言对。

Title: ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model

Authors: Osvaldo Luamba Quinjica, David Ifeoluwa Adelani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02534
Pdf URL: https://arxiv.org/pdf/2404.02534
Copy Paste: [[2404.02534]] ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model(https://arxiv.org/abs/2404.02534)
Keywords: language model
Abstract: In recent years, the development of pre-trained language models (PLMs) has gained momentum, showcasing their capacity to transcend linguistic barriers and facilitate knowledge transfer across diverse languages. However, this progress has predominantly bypassed the inclusion of very-low resource languages, creating a notable void in the multilingual landscape. This paper addresses this gap by introducing four tailored PLMs specifically finetuned for Angolan languages, employing a Multilingual Adaptive Fine-tuning (MAFT) approach. In this paper, we survey the role of informed embedding initialization and synthetic data in enhancing the performance of MAFT models in downstream tasks. We improve baseline over SOTA AfroXLMR-base (developed through MAFT) and OFA (an effective embedding initialization) by 12.3 and 3.8 points respectively.
摘要：近年来，预训练语言模型（PLM）的发展势头强劲，展示了它们超越语言障碍和促进跨不同语言知识转移的能力。然而，这一进展主要绕过了资源非常低的语言的纳入，在多语言领域造成了明显的空白。本文通过引入四种专门针对安哥拉语言进行微调的定制 PLM 来解决这一差距，采用多语言自适应微调 (MAFT) 方法。在本文中，我们调查了知情嵌入初始化和合成数据在增强 MAFT 模型在下游任务中的性能方面的作用。我们将 SOTA AfroXLMR-base（通过 MAFT 开发）和 OFA（有效的嵌入初始化）的基线分别提高了 12.3 和 3.8 个点。

Title: CSEPrompts: A Benchmark of Introductory Computer Science Prompts

Authors: Nishat Raihan, Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Christian Newman, Tharindu Ranasinghe, Marcos Zampieri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02540
Pdf URL: https://arxiv.org/pdf/2404.02540
Copy Paste: [[2404.02540]] CSEPrompts: A Benchmark of Introductory Computer Science Prompts(https://arxiv.org/abs/2404.02540)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Recent advances in AI, machine learning, and NLP have led to the development of a new generation of Large Language Models (LLMs) that are trained on massive amounts of data and often have trillions of parameters. Commercial applications (e.g., ChatGPT) have made this technology available to the general public, thus making it possible to use LLMs to produce high-quality texts for academic and professional purposes. Schools and universities are aware of the increasing use of AI-generated content by students and they have been researching the impact of this new technology and its potential misuse. Educational programs in Computer Science (CS) and related fields are particularly affected because LLMs are also capable of generating programming code in various programming languages. To help understand the potential impact of publicly available LLMs in CS education, we introduce CSEPrompts, a framework with hundreds of programming exercise prompts and multiple-choice questions retrieved from introductory CS and programming courses. We also provide experimental results on CSEPrompts to evaluate the performance of several LLMs with respect to generating Python code and answering basic computer science and programming questions.
摘要：人工智能、机器学习和 NLP 的最新进展促进了新一代大型语言模型 (LLM) 的发展，这些模型接受大量数据的训练，并且通常具有数万亿个参数。商业应用程序（例如 ChatGPT）已使这项技术向公众开放，从而使法学硕士能够为学术和专业目的生成高质量的文本。学校和大学意识到学生越来越多地使用人工智能生成的内容，他们一直在研究这项新技术的影响及其潜在的滥用。计算机科学（CS）及相关领域的教育项目尤其受到影响，因为法学硕士还能够用各种编程语言生成编程代码。为了帮助了解公开的法学硕士在计算机科学教育中的潜在影响，我们引入了 CSEPrompts，这是一个包含数百个编程练习提示和从介绍性计算机科学和编程课程中检索的多项选择题的框架。我们还提供了 CSEPrompts 的实验结果，以评估多个法学硕士在生成 Python 代码和回答基本计算机科学和编程问题方面的性能。

Title: Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models

Authors: Hyungjoo Chae, Yeonghyeon Kim, Seungone Kim, Kai Tzu-iunn Ong, Beong-woo Kwak, Moohyeon Kim, Seonghwan Kim, Taeyoon Kwon, Jiwan Chung, Youngjae Yu, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02575
Pdf URL: https://arxiv.org/pdf/2404.02575
Copy Paste: [[2404.02575]] Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models(https://arxiv.org/abs/2404.02575)
Keywords: language model, llm
Abstract: Algorithmic reasoning refers to the ability to understand the complex patterns behind the problem and decompose them into a sequence of reasoning steps towards the solution. Such nature of algorithmic reasoning makes it a challenge for large language models (LLMs), even though they have demonstrated promising performance in other reasoning tasks. Within this context, some recent studies use programming languages (e.g., Python) to express the necessary logic for solving a given instance/question (e.g., Program-of-Thought) as inspired by their strict and precise syntaxes. However, it is non-trivial to write an executable code that expresses the correct logic on the fly within a single inference call. Also, the code generated specifically for an instance cannot be reused for others, even if they are from the same task and might require identical logic to solve. This paper presents Think-and-Execute, a novel framework that decomposes the reasoning process of language models into two steps. (1) In Think, we discover a task-level logic that is shared across all instances for solving a given task and then express the logic with pseudocode; (2) In Execute, we further tailor the generated pseudocode to each instance and simulate the execution of the code. With extensive experiments on seven algorithmic reasoning tasks, we demonstrate the effectiveness of Think-and-Execute. Our approach better improves LMs' reasoning compared to several strong baselines performing instance-specific reasoning (e.g., CoT and PoT), suggesting the helpfulness of discovering task-level logic. Also, we show that compared to natural language, pseudocode can better guide the reasoning of LMs, even though they are trained to follow natural language instructions.
摘要：算法推理是指理解问题背后的复杂模式并将其分解为解决问题的一系列推理步骤的能力。算法推理的这种性质使其对大型语言模型（LLM）构成挑战，尽管它们在其他推理任务中表现出了良好的性能。在此背景下，最近的一些研究受到严格而精确的语法的启发，使用编程语言（例如Python）来表达解决给定实例/问题（例如思维程序）所需的逻辑。然而，编写在单个推理调用中动态表达正确逻辑的可执行代码并非易事。此外，专门为某个实例生成的代码不能被其他实例重用，即使它们来自同一任务并且可能需要相同的逻辑来解决。本文提出了 Think-and-Execute，这是一种新颖的框架，它将语言模型的推理过程分解为两个步骤。（1）在Think中，我们发现了一个在所有实例之间共享的任务级逻辑，用于解决给定的任务，然后用伪代码表达该逻辑； (2)在Execute中，我们进一步针对每个实例定制生成的伪代码并模拟代码的执行。通过对七个算法推理任务的广泛实验，我们证明了思考和执行的有效性。与执行特定实例推理的几个强基线（例如 CoT 和 PoT）相比，我们的方法更好地改进了 LM 的推理，这表明发现任务级逻辑的帮助。此外，我们还表明，与自然语言相比，伪代码可以更好地指导 LM 的推理，即使它们经过训练可以遵循自然语言指令。

Title: Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

Authors: Jakub Hoscilowicz, Pawel Pawlowski, Marcin Skorupa, Marcin Sowański, Artur Janicki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02588
Pdf URL: https://arxiv.org/pdf/2404.02588
Copy Paste: [[2404.02588]] Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages(https://arxiv.org/abs/2404.02588)
Keywords: language model, llm
Abstract: Spoken Language Understanding (SLU) models are a core component of voice assistants (VA), such as Alexa, Bixby, and Google Assistant. In this paper, we introduce a pipeline designed to extend SLU systems to new languages, utilizing Large Language Models (LLMs) that we fine-tune for machine translation of slot-annotated SLU training data. Our approach improved on the MultiATIS++ benchmark, a primary multi-language SLU dataset, in the cloud scenario using an mBERT model. Specifically, we saw an improvement in the Overall Accuracy metric: from 53% to 62.18%, compared to the existing state-of-the-art method, Fine and Coarse-grained Multi-Task Learning Framework (FC-MTLF). In the on-device scenario (tiny and not pretrained SLU), our method improved the Overall Accuracy from 5.31% to 22.06% over the baseline Global-Local Contrastive Learning Framework (GL-CLeF) method. Contrary to both FC-MTLF and GL-CLeF, our LLM-based machine translation does not require changes in the production architecture of SLU. Additionally, our pipeline is slot-type independent: it does not require any slot definitions or examples.
摘要：口语理解 (SLU) 模型是语音助手 (VA) 的核心组件，例如 Alexa、Bixby 和 Google Assistant。在本文中，我们介绍了一种旨在将 SLU 系统扩展到新语言的管道，利用大型语言模型 (LLM)，我们对带槽注释的 SLU 训练数据的机器翻译进行了微调。我们的方法在云场景中使用 mBERT 模型对 MultiATIS++ 基准（一个主要的多语言 SLU 数据集）进行了改进。具体来说，我们看到总体准确率指标有所提高：与现有最先进的方法、细粒度和粗粒度多任务学习框架 (FC-MTLF) 相比，从 53% 提高到 62.18%。在设备端场景（小型且未预训练的 SLU）中，我们的方法比基线全局局部对比学习框架 (GL-CLeF) 方法将总体准确率从 5.31% 提高到 22.06%。与 FC-MTLF 和 GL-CLeF 相反，我们基于 LLM 的机器翻译不需要更改 SLU 的生产架构。此外，我们的管道与槽类型无关：它不需要任何槽定义或示例。

Title: Affective-NLI: Towards Accurate and Interpretable Personality Recognition in Conversation

Authors: Zhiyuan Wen, Jiannong Cao, Yu Yang, Ruosong Yang, Shuaiqi Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02589
Pdf URL: https://arxiv.org/pdf/2404.02589
Copy Paste: [[2404.02589]] Affective-NLI: Towards Accurate and Interpretable Personality Recognition in Conversation(https://arxiv.org/abs/2404.02589)
Keywords: language model
Abstract: Personality Recognition in Conversation (PRC) aims to identify the personality traits of speakers through textual dialogue content. It is essential for providing personalized services in various applications of Human-Computer Interaction (HCI), such as AI-based mental therapy and companion robots for the elderly. Most recent studies analyze the dialog content for personality classification yet overlook two major concerns that hinder their performance. First, crucial implicit factors contained in conversation, such as emotions that reflect the speakers' personalities are ignored. Second, only focusing on the input dialog content disregards the semantic understanding of personality itself, which reduces the interpretability of the results. In this paper, we propose Affective Natural Language Inference (Affective-NLI) for accurate and interpretable PRC. To utilize affectivity within dialog content for accurate personality recognition, we fine-tuned a pre-trained language model specifically for emotion recognition in conversations, facilitating real-time affective annotations for utterances. For interpretability of recognition results, we formulate personality recognition as an NLI problem by determining whether the textual description of personality labels is entailed by the dialog content. Extensive experiments on two daily conversation datasets suggest that Affective-NLI significantly outperforms (by 6%-7%) state-of-the-art approaches. Additionally, our Flow experiment demonstrates that Affective-NLI can accurately recognize the speaker's personality in the early stages of conversations by surpassing state-of-the-art methods with 22%-34%.
摘要：对话中的人格识别（PRC）旨在通过文本对话内容识别说话者的人格特征。它对于在人机交互（HCI）的各种应用中提供个性化服务至关重要，例如基于人工智能的心理治疗和老年人陪伴机器人。最近的大多数研究分析了性格分类的对话内容，但忽略了阻碍其表现的两个主要问题。首先，对话中包含的关键隐含因素，例如反映说话者个性的情绪被忽略了。其次，只关注输入对话内容忽视了人格本身的语义理解，从而降低了结果的可解释性。在本文中，我们提出了情感自然语言推理（Affective-NLI），以实现准确且可解释的 PRC。为了利用对话内容中的情感进行准确的个性识别，我们专门针对对话中的情感识别对预先训练的语言模型进行了微调，从而促进了话语的实时情感注释。为了识别结果的可解释性，我们通过确定对话内容是否包含个性标签的文本描述，将个性识别制定为 NLI 问题。对两个日常对话数据集的大量实验表明，情感 NLI 显着优于（6%-7%）最先进的方法。此外，我们的 Flow 实验表明，Affective-NLI 可以在对话的早期阶段准确识别说话者的个性，超越最先进的方法 22%-34%。

Title: Estimating the Causal Effects of Natural Logic Features in Transformer-Based NLI Models

Authors: Julia Rozanova, Marco Valentino, André Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02622
Pdf URL: https://arxiv.org/pdf/2404.02622
Copy Paste: [[2404.02622]] Estimating the Causal Effects of Natural Logic Features in Transformer-Based NLI Models(https://arxiv.org/abs/2404.02622)
Keywords: language model
Abstract: Rigorous evaluation of the causal effects of semantic features on language model predictions can be hard to achieve for natural language reasoning problems. However, this is such a desirable form of analysis from both an interpretability and model evaluation perspective, that it is valuable to investigate specific patterns of reasoning with enough structure and regularity to identify and quantify systematic reasoning failures in widely-used models. In this vein, we pick a portion of the NLI task for which an explicit causal diagram can be systematically constructed: the case where across two sentences (the premise and hypothesis), two related words/terms occur in a shared context. In this work, we apply causal effect estimation strategies to measure the effect of context interventions (whose effect on the entailment label is mediated by the semantic monotonicity characteristic) and interventions on the inserted word-pair (whose effect on the entailment label is mediated by the relation between these words). Extending related work on causal analysis of NLP models in different settings, we perform an extensive interventional study on the NLI task to investigate robustness to irrelevant changes and sensitivity to impactful changes of Transformers. The results strongly bolster the fact that similar benchmark accuracy scores may be observed for models that exhibit very different behaviour. Moreover, our methodology reinforces previously suspected biases from a causal perspective, including biases in favour of upward-monotone contexts and ignoring the effects of negation markers.
摘要：对于自然语言推理问题来说，严格评估语义特征对语言模型预测的因果影响可能很难实现。然而，从可解释性和模型评估的角度来看，这是一种理想的分析形式，因此研究具有足够结构和规律性的特定推理模式以识别和量化广泛使用的模型中的系统推理失败是有价值的。本着这种精神，我们选择了 NLI 任务的一部分，可以系统地构建明确的因果图：在两个句子（前提和假设）中，两个相关的单词/术语出现在共享上下文中的情况。在这项工作中，我们应用因果效应估计策略来衡量上下文干预（其对蕴涵标签的影响由语义单调性特征介导）和对插入词对的干预（其对蕴涵标签的影响由语义单调性特征介导）的效果。这些词之间的关系）。扩展不同环境下 NLP 模型因果分析的相关工作，我们对 NLI 任务进行了广泛的干预研究，以调查 Transformers 对不相关变化的稳健性和对有影响的变化的敏感性。结果有力地证明了这样一个事实：对于表现出非常不同行为的模型，可以观察到类似的基准准确度分数。此外，我们的方法从因果角度强化了先前怀疑的偏见，包括支持向上单调背景的偏见和忽略否定标记的影响。

Title: Calibrating the Confidence of Large Language Models by Eliciting Fidelity

Authors: Mozhi Zhang, Mianqiu Huang, Rundong Shi, Linsen Guo, Chong Peng, Peng Yan, Yaqian Zhou, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02655
Pdf URL: https://arxiv.org/pdf/2404.02655
Copy Paste: [[2404.02655]] Calibrating the Confidence of Large Language Models by Eliciting Fidelity(https://arxiv.org/abs/2404.02655)
Keywords: language model
Abstract: Large language models optimized with techniques like RLHF have achieved good alignment in being helpful and harmless. However, post-alignment, these language models often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate. In this paper, we decompose the language model confidence into the \textit{Uncertainty} about the question and the \textit{Fidelity} to the answer generated by language models. Then, we propose a plug-and-play method to estimate the confidence of language models. Our method has shown good calibration performance by conducting experiments with 6 RLHF-LMs on four MCQA datasets. Moreover, we propose two novel metrics, IPR and CE, to evaluate the calibration of the model, and we have conducted a detailed discussion on \textit{Truly Well-Calibrated Confidence}. Our method could serve as a strong baseline, and we hope that this work will provide some insights into the model confidence calibration.
摘要：使用 RLHF 等技术优化的大型语言模型在有益和无害方面实现了良好的一致性。然而，在对齐后，这些语言模型经常表现出过度自信，其中所表达的置信度不能准确地与其正确率进行校准。在本文中，我们将语言模型置信度分解为关于问题的 \textit{Uncertainty} 和对语言模型生成的答案的 \textit{Fidelity} 。然后，我们提出了一种即插即用的方法来估计语言模型的置信度。通过在四个 MCQA 数据集上使用 6 个 RLHF-LM 进行实验，我们的方法显示出良好的校准性能。此外，我们提出了两个新颖的指标：IPR 和 CE，来评估模型的校准，并且我们对 \textit{Truly Well-Calibrate Confidence} 进行了详细的讨论。我们的方法可以作为强大的基线，我们希望这项工作将为模型置信度校准提供一些见解。

Title: Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

Authors: Taiqiang Wu, Chaofan Tao, Jiahao Wang, Zhe Zhao, Ngai Wong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02657
Pdf URL: https://arxiv.org/pdf/2404.02657
Copy Paste: [[2404.02657]] Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models(https://arxiv.org/abs/2404.02657)
Keywords: language model, gpt, llm
Abstract: Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses.
摘要：Kullback-Leiber 散度已广泛用于知识蒸馏 (KD) 中以压缩大型语言模型 (LLM)。先前的断言反向 Kullback-Leibler (RKL) 散度是模式搜索，因此比均值搜索正向 Kullback-Leibler (FKL) 散度更可取，与此相反，本研究从经验和理论上证明，模式搜索和均值搜索属性都不是体现在 LLM 的 KD 中。相反，RKL 和 FKL 具有相同的优化目标，并且在足够数量的 epoch 后都收敛。然而，由于实际限制，法学硕士很少接受如此广泛的时期训练。同时，我们进一步发现 RKL 侧重于分布的尾部部分，而 FKL 侧重于开始时期的头部部分。因此，我们提出了一种简单而有效的自适应 Kullback-Leiber (AKL) 散度方法，该方法自适应地分配权重以组合 FKL 和 RKL。基于指标和基于 GPT-4 的评估表明，所提出的 AKL 在各种任务中都优于基线，并提高了生成响应的多样性和质量。

Title: PejorativITy: Disambiguating Pejorative Epithets to Improve Misogyny Detection in Italian Tweets

Authors: Arianna Muti, Federico Ruggeri, Cagri Toraman, Lorenzo Musetti, Samuel Algherini, Silvia Ronchi, Gianmarco Saretto, Caterina Zapparoli, Alberto Barrón-Cedeño
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02681
Pdf URL: https://arxiv.org/pdf/2404.02681
Copy Paste: [[2404.02681]] PejorativITy: Disambiguating Pejorative Epithets to Improve Misogyny Detection in Italian Tweets(https://arxiv.org/abs/2404.02681)
Keywords: llm, prompt
Abstract: Misogyny is often expressed through figurative language. Some neutral words can assume a negative connotation when functioning as pejorative epithets. Disambiguating the meaning of such terms might help the detection of misogyny. In order to address such task, we present PejorativITy, a novel corpus of 1,200 manually annotated Italian tweets for pejorative language at the word level and misogyny at the sentence level. We evaluate the impact of injecting information about disambiguated words into a model targeting misogyny detection. In particular, we explore two different approaches for injection: concatenation of pejorative information and substitution of ambiguous words with univocal terms. Our experimental results, both on our corpus and on two popular benchmarks on Italian tweets, show that both approaches lead to a major classification improvement, indicating that word sense disambiguation is a promising preliminary step for misogyny detection. Furthermore, we investigate LLMs' understanding of pejorative epithets by means of contextual word embeddings analysis and prompting.
摘要：厌女症通常通过比喻语言来表达。一些中性词在用作贬义词时可能会带有负面含义。消除这些术语的含义可能有助于发现厌女症。为了解决这一任务，我们提出了 PejorativITy，这是一个包含 1,200 条手动注释的意大利推文的新颖语料库，用于单词级别的贬义语言和句子级别的厌女症。我们评估将消歧词信息注入针对厌女症检测的模型的影响。特别是，我们探索了两种不同的注入方法：贬义信息的串联和用单义术语替换不明确的单词。我们在语料库和意大利推文的两个流行基准上的实验结果表明，这两种方法都带来了重大的分类改进，表明词义消歧是厌女症检测的一个有希望的初步步骤。此外，我们通过上下文词嵌入分析和提示来调查法学硕士对贬义词的理解。

Title: Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers

Authors: Sehyun Choi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02684
Pdf URL: https://arxiv.org/pdf/2404.02684
Copy Paste: [[2404.02684]] Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers(https://arxiv.org/abs/2404.02684)
Keywords: language model
Abstract: Recently, multiple architectures has been proposed to improve the efficiency of the Transformer Language Models through changing the design of the self-attention block to have a linear-cost inference (LCI). A notable approach in this realm is the State-Space Machines (SSMs) architecture, which showed on-par performance on language modeling tasks with the self-attention transformers. However, such an architectural change requires a full pretraining of the weights from scratch, which incurs a huge cost to researchers and practitioners who want to use the new architectures. In the more traditional linear attention works, it has been proposed to approximate full attention with linear attention by swap-and-finetune framework. Motivated by this approach, we propose Cross-Architecture Transfer Learning (XATL), in which the weights of the shared components between LCI and self-attention-based transformers, such as layernorms, MLPs, input/output embeddings, are directly transferred to the new architecture from already pre-trained model parameters. We experimented the efficacy of the method on varying sizes and alternative attention architectures and show that \methodabbr significantly reduces the training time up to 2.5x times and converges to a better minimum with up to 2.6% stronger model on the LM benchmarks within the same compute budget.
摘要：最近，人们提出了多种架构，通过改变自注意力模块的设计来实现线性成本推理（LCI），从而提高 Transformer 语言模型的效率。该领域的一个值得注意的方法是状态空间机（SSM）架构，它在使用自注意力变压器的语言建模任务上表现出了同等的性能。然而，这样的架构变化需要从头开始对权重进行全面的预训练，这给想要使用新架构的研究人员和从业者带来了巨大的成本。在更传统的线性注意力工作中，有人提出通过交换和微调框架用线性注意力来近似完全注意力。受这种方法的启发，我们提出了跨架构迁移学习（XATL），其中 LCI 和基于自注意力的变压器之间共享组件的权重，例如层范数、MLP、输入/输出嵌入，直接转移到来自已预先训练的模型参数的新架构。我们在不同大小和替代注意力架构上实验了该方法的有效性，结果表明，\methodabbr 显着减少了训练时间高达 2.5 倍，并收敛到更好的最小值，在相同计算内的 LM 基准上模型增强了 2.6%预算。

Title: Scalable Model Editing via Customized Expert Networks

Authors: Zihan Yao, Yu He, Tianyu Qi, Ming Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02699
Pdf URL: https://arxiv.org/pdf/2404.02699
Copy Paste: [[2404.02699]] Scalable Model Editing via Customized Expert Networks(https://arxiv.org/abs/2404.02699)
Keywords: language model, hallucination
Abstract: Addressing the issue of hallucinations and outdated knowledge in large language models is critical for their reliable application. Model Editing presents a promising avenue for mitigating these challenges in a cost-effective manner. However, existing methods often suffer from unsatisfactory generalization and unintended effects on unrelated samples. To overcome these limitations, we introduce a novel approach: Scalable Model Editing via Customized Expert Networks (SCEN), which is a two-stage continuous training paradigm. Specifically, in the first stage, we train lightweight expert networks individually for each piece of knowledge that needs to be updated. Subsequently, we train a corresponding neuron for each expert to control the activation state of that expert. Our experiments on two different sizes of open-source large language models, the Llama2 7B and 13B, achieve state-of-the-art results compared to existing mainstream Model Editing methods. Our code is available at https: //github.com/TAL-auroraX/SCEN
摘要：解决大型语言模型中的幻觉和过时知识问题对于其可靠应用至关重要。模型编辑提供了一种以经济高效的方式缓解这些挑战的有前途的途径。然而，现有方法常常存在泛化效果不理想以及对不相关样本产生意想不到的影响的问题。为了克服这些限制，我们引入了一种新颖的方法：通过定制专家网络（SCEN）进行可扩展模型编辑，这是一种两阶段连续训练范例。具体来说，在第一阶段，我们针对每条需要更新的知识单独训练轻量级专家网络。随后，我们为每个专家训练相应的神经元来控制该专家的激活状态。与现有主流模型编辑方法相比，我们在两种不同规模的开源大语言模型 Llama2 7B 和 13B 上进行的实验取得了最先进的结果。我们的代码可在 https://github.com/TAL-auroraX/SCEN 获取

Title: Automatic Prompt Selection for Large Language Models

Authors: Viet-Tung Do, Van-Khanh Hoang, Duy-Hung Nguyen, Shahab Sabahi, Jeff Yang, Hajime Hotta, Minh-Tien Nguyen, Hung Le
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02717
Pdf URL: https://arxiv.org/pdf/2404.02717
Copy Paste: [[2404.02717]] Automatic Prompt Selection for Large Language Models(https://arxiv.org/abs/2404.02717)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) can perform various natural language processing tasks with suitable instruction prompts. However, designing effective prompts manually is challenging and time-consuming. Existing methods for automatic prompt optimization either lack flexibility or efficiency. In this paper, we propose an effective approach to automatically select the optimal prompt for a given input from a finite set of synthetic candidate prompts. Our approach consists of three steps: (1) clustering the training data and generating candidate prompts for each cluster using an LLM-based prompt generator; (2) synthesizing a dataset of input-prompt-output tuples for training a prompt evaluator to rank the prompts based on their relevance to the input; (3) using the prompt evaluator to select the best prompt for a new input at test time. Our approach balances prompt generality-specificity and eliminates the need for resource-intensive training and inference. It demonstrates competitive performance on zero-shot question-answering datasets: GSM8K, MultiArith, and AQuA.
摘要：大型语言模型（LLM）可以通过合适的指令提示执行各种自然语言处理任务。然而，手动设计有效的提示既具有挑战性又耗时。现有的自动提示优化方法要么缺乏灵活性，要么缺乏效率。在本文中，我们提出了一种有效的方法，可以从有限的合成候选提示集中自动选择给定输入的最佳提示。我们的方法包括三个步骤：（1）对训练数据进行聚类，并使用基于 LLM 的提示生成器为每个聚类生成候选提示； (2) 合成输入-提示-输出元组的数据集，用于训练提示评估器根据提示与输入的相关性对提示进行排名； (3) 使用提示评估器在测试时为新输入选择最佳提示。我们的方法平衡了普遍性和特异性，并消除了资源密集型培训和推理的需要。它展示了在零样本问答数据集上的竞争性能：GSM8K、MultiArith 和 AQuA。

Title: AQuA -- Combining Experts' and Non-Experts' Views To Assess Deliberation Quality in Online Discussions Using LLMs

Authors: Maike Behrendt, Stefan Sylvius Wagner, Marc Ziegele, Lena Wilms, Anke Stoll, Dominique Heinbach, Stefan Harmeling
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02761
Pdf URL: https://arxiv.org/pdf/2404.02761
Copy Paste: [[2404.02761]] AQuA -- Combining Experts' and Non-Experts' Views To Assess Deliberation Quality in Online Discussions Using LLMs(https://arxiv.org/abs/2404.02761)
Keywords: llm
Abstract: Measuring the quality of contributions in political online discussions is crucial in deliberation research and computer science. Research has identified various indicators to assess online discussion quality, and with deep learning advancements, automating these measures has become feasible. While some studies focus on analyzing specific quality indicators, a comprehensive quality score incorporating various deliberative aspects is often preferred. In this work, we introduce AQuA, an additive score that calculates a unified deliberative quality score from multiple indices for each discussion post. Unlike other singular scores, AQuA preserves information on the deliberative aspects present in comments, enhancing model transparency. We develop adapter models for 20 deliberative indices, and calculate correlation coefficients between experts' annotations and the perceived deliberativeness by non-experts to weigh the individual indices into a single deliberative score. We demonstrate that the AQuA score can be computed easily from pre-trained adapters and aligns well with annotations on other datasets that have not be seen during training. The analysis of experts' vs. non-experts' annotations confirms theoretical findings in the social science literature.
摘要：衡量在线政治讨论中贡献的质量对于审议研究和计算机科学至关重要。研究已经确定了评估在线讨论质量的各种指标，并且随着深度学习的进步，自动化这些措施已变得可行。虽然一些研究侧重于分析特定的质量指标，但通常首选包含各种审议方面的综合质量评分。在这项工作中，我们引入了 AQuA，这是一种附加分数，可以根据每个讨论帖子的多个指数计算统一的审议质量分数。与其他单一分数不同，AQuA 保留了评论中存在的审议方面的信息，从而增强了模型的透明度。我们为 20 个审议指数开发了适配器模型，并计算专家注释与非专家感知的审议性之间的相关系数，以将各个指数权衡为单个审议分数。我们证明，AQuA 分数可以通过预先训练的适配器轻松计算，并且与训练期间未见过的其他数据集上的注释很好地对齐。对专家与非专家注释的分析证实了社会科学文献中的理论发现。

Title: FPT: Feature Prompt Tuning for Few-shot Readability Assessment

Authors: Ziyang Wang, Sanwoo Lee, Hsiu-Yuan Huang, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02772
Pdf URL: https://arxiv.org/pdf/2404.02772
Copy Paste: [[2404.02772]] FPT: Feature Prompt Tuning for Few-shot Readability Assessment(https://arxiv.org/abs/2404.02772)
Keywords: language model, gpt, prompt
Abstract: Prompt-based methods have achieved promising results in most few-shot text classification tasks. However, for readability assessment tasks, traditional prompt methods lackcrucial linguistic knowledge, which has already been proven to be essential. Moreover, previous studies on utilizing linguistic features have shown non-robust performance in few-shot settings and may even impair model performance.To address these issues, we propose a novel prompt-based tuning framework that incorporates rich linguistic knowledge, called Feature Prompt Tuning (FPT). Specifically, we extract linguistic features from the text and embed them into trainable soft prompts. Further, we devise a new loss function to calibrate the similarity ranking order between categories. Experimental results demonstrate that our proposed method FTP not only exhibits a significant performance improvement over the prior best prompt-based tuning approaches, but also surpasses the previous leading methods that incorporate linguistic features. Also, our proposed model significantly outperforms the large language model gpt-3.5-turbo-16k in most cases. Our proposed method establishes a new architecture for prompt tuning that sheds light on how linguistic features can be easily adapted to linguistic-related tasks.
摘要：基于提示的方法在大多数小样本文本分类任务中取得了有希望的结果。然而，对于可读性评估任务，传统的提示方法缺乏关键的语言知识，而这已经被证明是必不可少的。此外，之前关于利用语言特征的研究表明，在少数样本设置中表现不稳健，甚至可能会损害模型性能。为了解决这些问题，我们提出了一种新颖的基于提示的调优框架，该框架融合了丰富的语言知识，称为特征提示调优（FPT）。具体来说，我们从文本中提取语言特征并将其嵌入到可训练的软提示中。此外，我们设计了一种新的损失函数来校准类别之间的相似度排名顺序。实验结果表明，我们提出的方法 FTP 不仅比之前最好的基于提示的调优方法表现出显着的性能改进，而且还超越了之前结合语言特征的领先方法。此外，我们提出的模型在大多数情况下显着优于大型语言模型 gpt-3.5-turbo-16k。我们提出的方法建立了一种用于快速调整的新架构，揭示了语言特征如何轻松适应与语言相关的任务。

Title: On Few-Shot Prompting for Controllable Question-Answer Generation in Narrative Comprehension

Authors: Bernardo Leite, Henrique Lopes Cardoso
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02800
Pdf URL: https://arxiv.org/pdf/2404.02800
Copy Paste: [[2404.02800]] On Few-Shot Prompting for Controllable Question-Answer Generation in Narrative Comprehension(https://arxiv.org/abs/2404.02800)
Keywords: prompt
Abstract: Question Generation aims to automatically generate questions based on a given input provided as context. A controllable question generation scheme focuses on generating questions with specific attributes, allowing better control. In this study, we propose a few-shot prompting strategy for controlling the generation of question-answer pairs from children's narrative texts. We aim to control two attributes: the question's explicitness and underlying narrative elements. With empirical evaluation, we show the effectiveness of controlling the generation process by employing few-shot prompting side by side with a reference model. Our experiments highlight instances where the few-shot strategy surpasses the reference model, particularly in scenarios such as semantic closeness evaluation and the diversity and coherency of question-answer pairs. However, these improvements are not always statistically significant. The code is publicly available at github.com/bernardoleite/few-shot-prompting-qg-control.
摘要：问题生成旨在根据作为上下文提供的给定输入自动生成问题。可控问题生成方案侧重于生成具有特定属性的问题，从而实现更好的控制。在这项研究中，我们提出了一种几次提示策略来控制儿童叙述文本中问答对的生成。我们的目标是控制两个属性：问题的明确性和潜在的叙述元素。通过实证评估，我们展示了通过与参考模型并列使用少样本提示来控制生成过程的有效性。我们的实验强调了少样本策略超越参考模型的情况，特别是在语义紧密度评估以及问题答案对的多样性和一致性等场景中。然而，这些改进并不总是具有统计显着性。该代码可在 github.com/bernardoleite/few-shot-prompting-qg-control 上公开获取。

Title: Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models

Authors: Haoran Sun, Lixin Liu, Junjie Li, Fengyu Wang, Baohua Dong, Ran Lin, Ruohui Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02823
Pdf URL: https://arxiv.org/pdf/2404.02823
Copy Paste: [[2404.02823]] Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models(https://arxiv.org/abs/2404.02823)
Keywords: language model, gpt, llm
Abstract: The ability of large language models (LLMs) to follow instructions is crucial to real-world applications. Despite recent advances, several studies have highlighted that LLMs struggle when faced with challenging instructions, especially those that include complex constraints, hindering their effectiveness in various tasks. To address this challenge, we introduce Conifer, a novel instruction tuning dataset, designed to enhance LLMs to follow multi-level instructions with complex constraints. Utilizing GPT-4, we curate the dataset by a series of LLM-driven refinement processes to ensure high quality. We also propose a progressive learning scheme that emphasizes an easy-to-hard progression, and learning from process feedback. Models trained with Conifer exhibit remarkable improvements in instruction-following abilities, especially for instructions with complex constraints. On several instruction-following benchmarks, our 7B model outperforms the state-of-the-art open-source 7B models, even exceeds the performance of models 10 times larger on certain metrics. All the code and Conifer dataset are available at https://www.github.com/ConiferLM/Conifer.
摘要：大型语言模型 (LLM) 遵循指令的能力对于实际应用至关重要。尽管最近取得了进展，但一些研究强调，法学硕士在面临具有挑战性的指令时会陷入困境，尤其是那些包含复杂约束的指令，从而阻碍了他们在各种任务中的有效性。为了应对这一挑战，我们引入了 Conifer，这是一种新颖的指令调整数据集，旨在增强 LLM 遵循具有复杂约束的多级指令。利用 GPT-4，我们通过一系列 LLM 驱动的细化流程来管理数据集，以确保高质量。我们还提出了一种渐进式学习方案，强调从易到难的进展，并从过程反馈中学习。使用 Conifer 训练的模型在指令跟踪能力方面表现出显着的提高，特别是对于具有复杂约束的指令。在多个指令跟踪基准测试中，我们的 7B 模型优于最先进的开源 7B 模型，甚至在某些指标上超过了大模型 10 倍的性能。所有代码和 Conifer 数据集均可在 https://www.github.com/ConiferLM/Conifer 获取。

Title: Retrieving Examples from Memory for Retrieval Augmented Neural Machine Translation: A Systematic Comparison

Authors: Maxime Bouthors, Josep Crego, Francois Yvon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02835
Pdf URL: https://arxiv.org/pdf/2404.02835
Copy Paste: [[2404.02835]] Retrieving Examples from Memory for Retrieval Augmented Neural Machine Translation: A Systematic Comparison(https://arxiv.org/abs/2404.02835)
Keywords: language model
Abstract: Retrieval-Augmented Neural Machine Translation (RAMT) architectures retrieve examples from memory to guide the generation process. While most works in this trend explore new ways to exploit the retrieved examples, the upstream retrieval step is mostly unexplored. In this paper, we study the effect of varying retrieval methods for several translation architectures, to better understand the interplay between these two processes. We conduct experiments in two language pairs in a multi-domain setting and consider several downstream architectures based on a standard autoregressive model, an edit-based model, and a large language model with in-context learning. Our experiments show that the choice of the retrieval technique impacts the translation scores, with variance across architectures. We also discuss the effects of increasing the number and diversity of examples, which are mostly positive across the board.
摘要：检索增强神经机器翻译 (RAMT) 架构从内存中检索示例以指导生成过程。虽然这一趋势中的大多数作品都探索利用检索到的示例的新方法，但上游检索步骤大多尚未探索。在本文中，我们研究了不同检索方法对几种翻译架构的影响，以更好地理解这两个过程之间的相互作用。我们在多领域设置中对两种语言对进行了实验，并考虑了几种基于标准自回归模型、基于编辑的模型和具有上下文学习的大型语言模型的下游架构。我们的实验表明，检索技术的选择会影响翻译分数，并在不同架构之间存在差异。我们还讨论了增加示例数量和多样性的影响，这些影响大多是积极的。

Title: Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models

Authors: Wanyun Cui, Qianle Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02837
Pdf URL: https://arxiv.org/pdf/2404.02837
Copy Paste: [[2404.02837]] Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models(https://arxiv.org/abs/2404.02837)
Keywords: language model, llm
Abstract: This paper reveals the phenomenon of parameter heterogeneity in large language models (LLMs). We find that a small subset of ``cherry'' parameters exhibit a disproportionately large influence on model performance, while the vast majority of parameters have minimal impact. This heterogeneity is found to be prevalent across different model families, scales, and types. Motivated by this observation, we propose CherryQ, a novel quantization method that unifies the optimization of mixed-precision parameters. CherryQ identifies and preserves the critical cherry parameters in high precision while aggressively quantizing the remaining parameters to low precision. Extensive experiments demonstrate the effectiveness of CherryQ. CherryQ outperforms existing quantization approaches in terms of perplexity and downstream task performance. Notably, our 3-bit quantized Vicuna-1.5 exhibits competitive performance compared to their 16-bit counterparts. These findings highlight the potential of CherryQ for enabling efficient deployment of LLMs by taking advantage of parameter heterogeneity.
摘要：本文揭示了大型语言模型（LLM）中参数异质性的现象。我们发现一小部分“樱桃”参数对模型性能表现出不成比例的巨大影响，而绝大多数参数的影响却很小。人们发现这种异质性在不同的模型家族、规模和类型中普遍存在。受这一观察的启发，我们提出了 CherryQ，一种统一混合精度参数优化的新颖量化方法。 CherryQ 以高精度识别并保留关键的樱桃参数，同时积极地将其余参数量化为低精度。大量实验证明了 CherryQ 的有效性。 CherryQ 在困惑度和下游任务性能方面优于现有的量化方法。值得注意的是，与 16 位同类产品相比，我们的 3 位量化 Vicuna-1.5 表现出具有竞争力的性能。这些发现凸显了 CherryQ 通过利用参数异质性实现 LLM 高效部署的潜力。

Title: ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline

Authors: Yifan Xu, Xiao Liu, Xinghan Liu, Zhenyu Hou, Yueyan Li, Xiaohan Zhang, Zihan Wang, Aohan Zeng, Zhengxiao Du, Wenyi Zhao, Jie Tang, Yuxiao Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02893
Pdf URL: https://arxiv.org/pdf/2404.02893
Copy Paste: [[2404.02893]] ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline(https://arxiv.org/abs/2404.02893)
Keywords: language model, llm, chat
Abstract: Large language models (LLMs) have shown excellent mastering of human language, but still struggle in real-world applications that require mathematical problem-solving. While many strategies and datasets to enhance LLMs' mathematics are developed, it remains a challenge to simultaneously maintain and improve both language and mathematical capabilities in deployed LLM systems.In this work, we tailor the Self-Critique pipeline, which addresses the challenge in the feedback learning stage of LLM alignment. We first train a general Math-Critique model from the LLM itself to provide feedback signals. Then, we sequentially employ rejective fine-tuning and direct preference optimization over the LLM's own generations for data collection. Based on ChatGLM3-32B, we conduct a series of experiments on both academic and our newly created challenging dataset, MathUserEval. Results show that our pipeline significantly enhances the LLM's mathematical problem-solving while still improving its language ability, outperforming LLMs that could be two times larger. Related techniques have been deployed to ChatGLM\footnote{\url{https://chatglm.cn}}, an online serving LLM. Related evaluation dataset and scripts are released at \url{https://github.com/THUDM/ChatGLM-Math}.
摘要：大型语言模型（LLM）已经表现出对人类语言的出色掌握，但在需要解决数学问题的现实应用中仍然举步维艰。虽然开发了许多增强法学硕士数学的策略和数据集，但在部署的法学硕士系统中同时维护和提高语言和数学能力仍然是一个挑战。在这项工作中，我们定制了自我批评管道，它解决了法学硕士数学中的挑战。 LLM对齐的反馈学习阶段。我们首先从法学硕士本身训练一个通用的数学批判模型来提供反馈信号。然后，我们依次对法学硕士自己的一代进行拒绝微调和直接偏好优化来收集数据。基于 ChatGLM3-32B，我们对学术数据集和新创建的挑战性数据集 MathUserEval 进行了一系列实验。结果表明，我们的管道显着增强了法学硕士解决数学问题的能力，同时仍然提高了其语言能力，其表现优于可能大两倍的法学硕士。相关技术已部署到在线服务法学硕士ChatGLM\footnote{\url{https://chatglm.cn}}。相关评估数据集和脚本发布于\url{https://github.com/THUDM/ChatGLM-Math}。