Copy Paste: [[2404.17642]] Empowering Large Language Models for Textual Data Augmentation(https://arxiv.org/abs/2404.17642)
Keywords: language model, llm
Abstract: With the capabilities of understanding and executing natural language instructions, Large language models (LLMs) can potentially act as a powerful tool for textual data augmentation. However, the quality of augmented data depends heavily on the augmentation instructions provided, and the effectiveness can fluctuate across different downstream tasks. While manually crafting and selecting instructions can offer some improvement, this approach faces scalability and consistency issues in practice due to the diversity of downstream tasks. In this work, we address these limitations by proposing a new solution, which can automatically generate a large pool of augmentation instructions and select the most suitable task-informed instructions, thereby empowering LLMs to create high-quality augmented data for different downstream tasks. Empirically, the proposed approach consistently generates augmented data with better quality compared to non-LLM and LLM-based data augmentation methods, leading to the best performance on 26 few-shot learning tasks sourced from a wide range of application domains.
Copy Paste: [[2404.17662]] PLAYER*: Enhancing LLM-based Multi-Agent Communication and Interaction in Murder Mystery Games(https://arxiv.org/abs/2404.17662)
Keywords: language model, llm, agent
Abstract: Recent advancements in Large Language Models (LLMs) have enhanced the efficacy of agent communication and social interactions. Despite these advancements, building LLM-based agents for reasoning in dynamic environments involving competition and collaboration remains challenging due to the limitations of informed graph-based search methods. We propose PLAYER*, a novel framework based on an anytime sampling-based planner, which utilises sensors and pruners to enable a purely question-driven searching framework for complex reasoning tasks. We also introduce a quantifiable evaluation method using multiple-choice questions and construct the WellPlay dataset with 1,482 QA pairs. Experiments demonstrate PLAYER*'s efficiency and performance enhancements compared to existing methods in complex, dynamic environments with quantifiable results.
Copy Paste: [[2404.17729]] CoMM: Collaborative Multi-Agent, Multi-Reasoning-Path Prompting for Complex Problem Solving(https://arxiv.org/abs/2404.17729)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) have shown great ability in solving traditional natural language tasks and elementary reasoning tasks with appropriate prompting techniques. However, their ability is still limited in solving complicated science problems. In this work, we aim to push the upper bound of the reasoning capability of LLMs by proposing a collaborative multi-agent, multi-reasoning-path (CoMM) prompting framework. Specifically, we prompt LLMs to play different roles in a problem-solving team, and encourage different role-play agents to collaboratively solve the target task. In particular, we discover that applying different reasoning paths for different roles is an effective strategy to implement few-shot prompting approaches in the multi-agent scenarios. Empirical results demonstrate the effectiveness of the proposed methods on two college-level science problems over competitive baselines. Our further analysis shows the necessity of prompting LLMs to play different roles or experts independently. We release the code at: https://github.com/amazon-science/comm-prompt
Copy Paste: [[2404.17733]] Building a Large Japanese Web Corpus for Large Language Models(https://arxiv.org/abs/2404.17733)
Keywords: language model, llm
Abstract: Open Japanese large language models (LLMs) have been trained on the Japanese portions of corpora such as CC-100, mC4, and OSCAR. However, these corpora were not created for the quality of Japanese texts. This study builds a large Japanese web corpus by extracting and refining text from the Common Crawl archive (21 snapshots of approximately 63.4 billion pages crawled between 2020 and 2023). This corpus consists of approximately 312.1 billion characters (approximately 173 million pages), which is the largest of all available training corpora for Japanese LLMs, surpassing CC-100 (approximately 25.8 billion characters), mC4 (approximately 239.7 billion characters) and OSCAR 23.10 (approximately 74 billion characters). To confirm the quality of the corpus, we performed continual pre-training on Llama 2 7B, 13B, 70B, Mistral 7B v0.1, and Mixtral 8x7B Instruct as base LLMs and gained consistent (6.6-8.1 points) improvements on Japanese benchmark datasets. We also demonstrate that the improvement on Llama 2 13B brought from the presented corpus was the largest among those from other existing corpora.
Abstract: In recent years, automated radiology report generation has experienced significant growth. This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging Large Language Models (LLMs). Conventional NLG (natural language generation) metrics like BLEU are inadequate for accurately assessing the generated radiology reports, as systematically demonstrated by our observations within this paper. To address this challenge, we collaborated with radiologists to develop a framework that guides LLMs for radiology report evaluation, ensuring alignment with human analysis. Our framework includes two key components: i) utilizing GPT to generate large amounts of training data, i.e., reports with different qualities, and ii) pairing GPT-generated reports as accepted and rejected samples and training LLMs to produce MRScore as the model reward. Our experiments demonstrate MRScore's higher correlation with human judgments and superior performance in model selection compared to traditional metrics. Our code and datasets will be available on GitHub.
Copy Paste: [[2404.17779]] Medical Vision-Language Pre-Training for Brain Abnormalities(https://arxiv.org/abs/2404.17779)
Keywords: language model
Abstract: Vision-language models have become increasingly powerful for tasks that require an understanding of both visual and linguistic elements, bridging the gap between these modalities. In the context of multimodal clinical AI, there is a growing need for models that possess domain-specific knowledge, as existing models often lack the expertise required for medical applications. In this paper, we take brain abnormalities as an example to demonstrate how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset from case reports and published journals and subsequently constructing a high-performance vision-language model tailored to specific medical tasks. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain. We evaluated the resulting model with quantitative and qualitative intrinsic evaluations. The resulting dataset and our code can be found here https://github.com/masoud-monajati/MedVL_pretraining_pipeline
Copy Paste: [[2404.17785]] Temporal Scaling Law for Large Language Models(https://arxiv.org/abs/2404.17785)
Keywords: language model, llm
Abstract: Recently, Large Language Models (LLMs) are widely adopted in a wide range of tasks, leading to increasing attention towards the research on how scaling LLMs affects their performance. Existing works, termed as Scaling Laws, have discovered that the loss of LLMs scales as power laws with model size, computational budget, and dataset size. However, the performance of LLMs throughout the training process remains untouched. In this paper, we propose the novel concept of Temporal Scaling Law and study the loss of LLMs from the temporal dimension. We first investigate the imbalance of loss on each token positions and develop a reciprocal-law across model scales and training stages. We then derive the temporal scaling law by studying the temporal patterns of the reciprocal-law parameters. Results on both in-distribution (IID) data and out-of-distribution (OOD) data demonstrate that our temporal scaling law accurately predicts the performance of LLMs in future training stages. Moreover, the temporal scaling law reveals that LLMs learn uniformly on different token positions, despite the loss imbalance. Experiments on pre-training LLMs in various scales show that this phenomenon verifies the default training paradigm for generative language models, in which no re-weighting strategies are attached during training. Overall, the temporal scaling law provides deeper insight into LLM pre-training.
Copy Paste: [[2404.17790]] Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities(https://arxiv.org/abs/2404.17790)
Keywords: language model, llm
Abstract: Cross-lingual continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost. In this study, we constructed Swallow, an LLM with enhanced Japanese capability, by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on a large Japanese web corpus. Experimental results confirmed that the performance on Japanese tasks drastically improved through continual pre-training, and the performance monotonically increased with the amount of training data up to 100B tokens. Consequently, Swallow achieved superior performance compared to other LLMs that were trained from scratch in English and Japanese. An analysis of the effects of continual pre-training revealed that it was particularly effective for Japanese question answering tasks. Furthermore, to elucidate effective methodologies for cross-lingual continual pre-training from English to Japanese, we investigated the impact of vocabulary expansion and the effectiveness of incorporating parallel corpora. The results showed that the efficiency gained through vocabulary expansion had no negative impact on performance, except for the summarization task, and that the combined use of parallel corpora enhanced translation ability.
Copy Paste: [[2404.17802]] Empirical Analysis of Dialogue Relation Extraction with Large Language Models(https://arxiv.org/abs/2404.17802)
Keywords: language model, llm
Abstract: Dialogue relation extraction (DRE) aims to extract relations between two arguments within a dialogue, which is more challenging than standard RE due to the higher person pronoun frequency and lower information density in dialogues. However, existing DRE methods still suffer from two serious issues: (1) hard to capture long and sparse multi-turn information, and (2) struggle to extract golden relations based on partial dialogues, which motivates us to discover more effective methods that can alleviate the above issues. We notice that the rise of large language models (LLMs) has sparked considerable interest in evaluating their performance across diverse tasks. To this end, we initially investigate the capabilities of different LLMs in DRE, considering both proprietary models and open-source models. Interestingly, we discover that LLMs significantly alleviate two issues in existing DRE methods. Generally, we have following findings: (1) scaling up model size substantially boosts the overall DRE performance and achieves exceptional results, tackling the difficulty of capturing long and sparse multi-turn information; (2) LLMs encounter with much smaller performance drop from entire dialogue setting to partial dialogue setting compared to existing methods; (3) LLMs deliver competitive or superior performances under both full-shot and few-shot settings compared to current state-of-the-art; (4) LLMs show modest performances on inverse relations but much stronger improvements on general relations, and they can handle dialogues of various lengths especially for longer sequences.
摘要:对话关系提取(DRE)旨在提取对话中两个论点之间的关系,由于对话中人称代词频率较高且信息密度较低,这比标准 RE 更具挑战性。然而,现有的DRE方法仍然存在两个严重的问题:(1)难以捕获长且稀疏的多轮信息,(2)难以基于部分对话提取黄金关系,这促使我们发现更有效的方法缓解以上问题。我们注意到,大型语言模型(LLM)的兴起引发了人们对评估其在不同任务中的表现的极大兴趣。为此,我们首先研究了不同法学硕士在 DRE 中的能力,同时考虑了专有模型和开源模型。有趣的是,我们发现法学硕士显着缓解了现有 DRE 方法中的两个问题。总的来说,我们有以下发现:(1)扩大模型大小可以显着提高整体 DRE 性能并取得优异的结果,解决了捕获长且稀疏的多轮信息的困难; (2)与现有方法相比,LLM从整个对话设置到部分对话设置的性能下降要小得多; (3) 与当前最先进的技术相比,法学硕士在全镜头和少镜头设置下都能提供具有竞争力或优越的性能; (4) 法学硕士在逆关系上表现一般,但在一般关系上有更强的改进,并且它们可以处理各种长度的对话,特别是对于较长的序列。
Title: Meta In-Context Learning Makes Large Language Models Better Zero and Few-Shot Relation Extractors
Authors: Guozheng Li, Peng Wang, Jiajun Liu, Yikai Guo, Ke Ji, Ziyu Shang, Zijie Xu
Copy Paste: [[2404.17807]] Meta In-Context Learning Makes Large Language Models Better Zero and Few-Shot Relation Extractors(https://arxiv.org/abs/2404.17807)
Keywords: language model, llm, prompt
Abstract: Relation extraction (RE) is an important task that aims to identify the relationships between entities in texts. While large language models (LLMs) have revealed remarkable in-context learning (ICL) capability for general zero and few-shot learning, recent studies indicate that current LLMs still struggle with zero and few-shot RE. Previous studies are mainly dedicated to design prompt formats and select good examples for improving ICL-based RE. Although both factors are vital for ICL, if one can fundamentally boost the ICL capability of LLMs in RE, the zero and few-shot RE performance via ICL would be significantly improved. To this end, we introduce \textsc{Micre} (\textbf{M}eta \textbf{I}n-\textbf{C}ontext learning of LLMs for \textbf{R}elation \textbf{E}xtraction), a new meta-training framework for zero and few-shot RE where an LLM is tuned to do ICL on a diverse collection of RE datasets (i.e., learning to learn in context for RE). Through meta-training, the model becomes more effectively to learn a new RE task in context by conditioning on a few training examples with no parameter updates or task-specific templates at inference time, enabling better zero and few-shot task generalization. We experiment \textsc{Micre} on various LLMs with different model scales and 12 public RE datasets, and then evaluate it on unseen RE benchmarks under zero and few-shot settings. \textsc{Micre} delivers comparable or superior performance compared to a range of baselines including supervised fine-tuning and typical in-context learning methods. We find that the gains are particular significant for larger model scales, and using a diverse set of the meta-training RE datasets is key to improvements. Empirically, we show that \textsc{Micre} can transfer the relation semantic knowledge via relation label name during inference on target RE datasets.
摘要:关系提取(RE)是一项重要任务,旨在识别文本中实体之间的关系。虽然大型语言模型 (LLM) 在一般的零和少样本学习中展现了卓越的上下文学习 (ICL) 能力,但最近的研究表明,当前的 LLM 仍然在零和少样本 RE 方面苦苦挣扎。之前的研究主要致力于设计提示格式并选择好的例子来改进基于 ICL 的 RE。尽管这两个因素对于 ICL 都至关重要,但如果能够从根本上提高 RE 中 LLM 的 ICL 能力,那么通过 ICL 实现的零和少样本 RE 性能将得到显着提高。为此,我们引入了 \textsc{Micre} (\textbf{M}eta \textbf{I}n-\textbf{C}ontext Learning of LLMs for \textbf{R}elation \textbf{E}xtraction),针对零次和少次 RE 的新元训练框架,其中 LLM 被调整为在不同的 RE 数据集集合上进行 ICL(即学习在 RE 的上下文中学习)。通过元训练,模型可以通过在推理时以一些没有参数更新或特定于任务的模板的训练示例为条件,更有效地在上下文中学习新的 RE 任务,从而实现更好的零和少样本任务泛化。我们在具有不同模型规模和 12 个公共 RE 数据集的各种 LLM 上实验 \textsc{Micre},然后在零和少样本设置下在未见过的 RE 基准上对其进行评估。与一系列基线(包括监督微调和典型的上下文学习方法)相比,\textsc{Micre} 提供了相当或更好的性能。我们发现,对于较大的模型规模,收益尤其显着,并且使用不同的元训练 RE 数据集是改进的关键。根据经验,我们表明 \textsc{Micre} 可以在目标 RE 数据集的推理过程中通过关系标签名称传递关系语义知识。
Title: Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal
Copy Paste: [[2404.17808]] Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal(https://arxiv.org/abs/2404.17808)
Keywords: language model
Abstract: Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a frequency imbalance for tokens in the text corpus. Since BPE iteratively merges the most frequent token pair in the text corpus while keeping all tokens that have been merged in the vocabulary, it unavoidably holds tokens that primarily represent subwords of complete words and appear infrequently on their own in the text corpus. We term such tokens as Scaffold Tokens. Due to their infrequent appearance in the text corpus, Scaffold Tokens pose a learning imbalance issue for language models. To address that issue, we propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE. This novel approach ensures the exclusion of low-frequency Scaffold Tokens from the token representations for the given texts, thereby mitigating the issue of frequency imbalance and facilitating model training. On extensive experiments across language modeling tasks and machine translation tasks, Scaffold-BPE consistently outperforms the original BPE, well demonstrating its effectiveness and superiority.
Copy Paste: [[2404.17809]] Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction(https://arxiv.org/abs/2404.17809)
Keywords: language model, llm
Abstract: Relation extraction (RE) aims to identify relations between entities mentioned in texts. Although large language models (LLMs) have demonstrated impressive in-context learning (ICL) abilities in various tasks, they still suffer from poor performances compared to most supervised fine-tuned RE methods. Utilizing ICL for RE with LLMs encounters two challenges: (1) retrieving good demonstrations from training examples, and (2) enabling LLMs exhibit strong ICL abilities in RE. On the one hand, retrieving good demonstrations is a non-trivial process in RE, which easily results in low relevance regarding entities and relations. On the other hand, ICL with an LLM achieves poor performance in RE while RE is different from language modeling in nature or the LLM is not large enough. In this work, we propose a novel recall-retrieve-reason RE framework that synergizes LLMs with retrieval corpora (training examples) to enable relevant retrieving and reliable in-context reasoning. Specifically, we distill the consistently ontological knowledge from training datasets to let LLMs generate relevant entity pairs grounded by retrieval corpora as valid queries. These entity pairs are then used to retrieve relevant training examples from the retrieval corpora as demonstrations for LLMs to conduct better ICL via instruction tuning. Extensive experiments on different LLMs and RE datasets demonstrate that our method generates relevant and valid entity pairs and boosts ICL abilities of LLMs, achieving competitive or new state-of-the-art performance on sentence-level RE compared to previous supervised fine-tuning methods and ICL-based methods.
摘要:关系提取 (RE) 旨在识别文本中提到的实体之间的关系。尽管大型语言模型 (LLM) 在各种任务中都表现出令人印象深刻的上下文学习 (ICL) 能力,但与大多数监督微调 RE 方法相比,它们仍然表现不佳。利用 ICL 进行带有 LLM 的 RE 面临两个挑战:(1) 从训练示例中检索良好的演示,以及 (2) 使 LLM 在 RE 中表现出强大的 ICL 能力。一方面,在 RE 中检索良好的演示并非易事,这很容易导致实体和关系的相关性低。另一方面,如果 RE 本质上不同于语言建模,或者 LLM 不够大,则带有 LLM 的 ICL 在 RE 中的表现不佳。在这项工作中,我们提出了一种新颖的回忆-检索-推理 RE 框架,该框架将 LLM 与检索语料库(训练示例)协同作用,以实现相关检索和可靠的上下文推理。具体来说,我们从训练数据集中提取一致的本体知识,让 LLM 生成以检索语料库为基础的相关实体对作为有效查询。然后使用这些实体对从检索语料库中检索相关的训练示例,作为 LLM 通过指令调整进行更好的 ICL 的演示。在不同的 LLM 和 RE 数据集上进行的大量实验表明,我们的方法生成了相关且有效的实体对并提升了 LLM 的 ICL 能力,与以前的监督微调方法和基于 ICL 的方法相比,在句子级 RE 上实现了具有竞争力或新的最先进的性能。
Title: Evaluation of Few-Shot Learning for Classification Tasks in the Polish Language
Copy Paste: [[2404.17832]] Evaluation of Few-Shot Learning for Classification Tasks in the Polish Language(https://arxiv.org/abs/2404.17832)
Keywords: gpt
Abstract: We introduce a few-shot benchmark consisting of 7 different classification tasks native to the Polish language. We conducted an empirical comparison with 0 and 16 shots between fine-tuning, linear probing, SetFit, and in-context learning (ICL) using various pre-trained commercial and open-source models. Our findings reveal that ICL achieves the best performance, with commercial models like GPT-3.5 and GPT-4 attaining the best performance. However, there remains a significant 14 percentage points gap between our best few-shot learning score and the performance of HerBERT-large fine-tuned on the entire training dataset. Among the techniques, SetFit emerges as the second-best approach, closely followed by linear probing. We observed the worst and most unstable performance with non-linear head fine-tuning. Results for ICL indicate that continual pre-training of models like Mistral-7b or Llama-2-13b on Polish corpora is beneficial. This is confirmed by the improved performances of Bielik-7b and Trurl-13b, respectively. To further support experiments in few-shot learning for Polish, we are releasing handcrafted templates for the ICL.
Copy Paste: [[2404.17835]] VANER: Leveraging Large Language Model for Versatile and Adaptive Biomedical Named Entity Recognition(https://arxiv.org/abs/2404.17835)
Keywords: language model, llm
Abstract: Prevalent solution for BioNER involves using representation learning techniques coupled with sequence labeling. However, such methods are inherently task-specific, demonstrate poor generalizability, and often require dedicated model for each dataset. To leverage the versatile capabilities of recently remarkable large language models (LLMs), several endeavors have explored generative approaches to entity extraction. Yet, these approaches often fall short of the effectiveness of previouly sequence labeling approaches. In this paper, we utilize the open-sourced LLM LLaMA2 as the backbone model, and design specific instructions to distinguish between different types of entities and datasets. By combining the LLM's understanding of instructions with sequence labeling techniques, we use mix of datasets to train a model capable of extracting various types of entities. Given that the backbone LLMs lacks specialized medical knowledge, we also integrate external entity knowledge bases and employ instruction tuning to compel the model to densely recognize carefully curated entities. Our model VANER, trained with a small partition of parameters, significantly outperforms previous LLMs-based models and, for the first time, as a model based on LLM, surpasses the majority of conventional state-of-the-art BioNER systems, achieving the highest F1 scores across three datasets.
Copy Paste: [[2404.17841]] Toxicity Classification in Ukrainian(https://arxiv.org/abs/2404.17841)
Keywords: llm, prompt
Abstract: The task of toxicity detection is still a relevant task, especially in the context of safe and fair LMs development. Nevertheless, labeled binary toxicity classification corpora are not available for all languages, which is understandable given the resource-intensive nature of the annotation process. Ukrainian, in particular, is among the languages lacking such resources. To our knowledge, there has been no existing toxicity classification corpus in Ukrainian. In this study, we aim to fill this gap by investigating cross-lingual knowledge transfer techniques and creating labeled corpora by: (i)~translating from an English corpus, (ii)~filtering toxic samples using keywords, and (iii)~annotating with crowdsourcing. We compare LLMs prompting and other cross-lingual transfer approaches with and without fine-tuning offering insights into the most robust and efficient baselines.
Copy Paste: [[2404.17877]] PromptCL: Improving Event Representation via Prompt Template and Contrastive Learning(https://arxiv.org/abs/2404.17877)
Keywords: language model, prompt
Abstract: The representation of events in text plays a significant role in various NLP tasks. Recent research demonstrates that contrastive learning has the ability to improve event comprehension capabilities of Pre-trained Language Models (PLMs) and enhance the performance of event representation learning. However, the efficacy of event representation learning based on contrastive learning and PLMs is limited by the short length of event texts. The length of event texts differs significantly from the text length used in the pre-training of PLMs. As a result, there is inconsistency in the distribution of text length between pre-training and event representation learning, which may undermine the learning process of event representation based on PLMs. In this study, we present PromptCL, a novel framework for event representation learning that effectively elicits the capabilities of PLMs to comprehensively capture the semantics of short event texts. PromptCL utilizes a Prompt template borrowed from prompt learning to expand the input text during Contrastive Learning. This helps in enhancing the event representation learning by providing a structured outline of the event components. Moreover, we propose Subject-Predicate-Object (SPO) word order and Event-oriented Masked Language Modeling (EventMLM) to train PLMs to understand the relationships between event components. Our experimental results demonstrate that PromptCL outperforms state-of-the-art baselines on event related tasks. Additionally, we conduct a thorough analysis and demonstrate that using a prompt results in improved generalization capabilities for event representations. Our code will be available at https://github.com/YuboFeng2023/PromptCL.
Copy Paste: [[2404.17897]] Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2404.17897)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large-scale language models (LLMs) have achieved remarkable success across various language tasks but suffer from hallucinations and temporal misalignment. To mitigate these shortcomings, Retrieval-augmented generation (RAG) has been utilized to provide external knowledge to facilitate the answer generation. However, applying such models to the medical domain faces several challenges due to the lack of domain-specific knowledge and the intricacy of real-world scenarios. In this study, we explore LLMs with RAG framework for knowledge-intensive tasks in the medical field. To evaluate the capabilities of LLMs, we introduce MedicineQA, a multi-round dialogue benchmark that simulates the real-world medication consultation scenario and requires LLMs to answer with retrieved evidence from the medicine database. MedicineQA contains 300 multi-round question-answering pairs, each embedded within a detailed dialogue history, highlighting the challenge posed by this knowledge-intensive task to current LLMs. We further propose a new \textit{Distill-Retrieve-Read} framework instead of the previous \textit{Retrieve-then-Read}. Specifically, the distillation and retrieval process utilizes a tool calling mechanism to formulate search queries that emulate the keyword-based inquiries used by search engines. With experimental results, we show that our framework brings notable performance improvements and surpasses the previous counterparts in the evidence retrieval process in terms of evidence retrieval accuracy. This advancement sheds light on applying RAG to the medical domain.
Copy Paste: [[2404.17912]] SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models(https://arxiv.org/abs/2404.17912)
Keywords: language model, gpt, llm, hallucination
Abstract: Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports. Existing methods often hallucinate details in text-based reports that don't accurately reflect the image content. To mitigate this, we introduce a novel strategy, SERPENT-VLM (SElf Refining Radiology RePort GENeraTion using Vision Language Models), which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework. We employ a unique self-supervised loss that leverages similarity between pooled image representations and the contextual representations of the generated radiological text, alongside the standard Causal Language Modeling objective, to refine image-text representations. This allows the model to scrutinize and align the generated text through dynamic interaction between a given image and the generated text, therefore reducing hallucination and continuously enhancing nuanced report generation. SERPENT-VLM outperforms existing baselines such as LLaVA-Med, BiomedGPT, etc., achieving SoTA performance on the IU X-ray and Radiology Objects in COntext (ROCO) datasets, and also proves to be robust against noisy images. A qualitative case study emphasizes the significant advancements towards more sophisticated MLLM frameworks for R2Gen, opening paths for further research into self-supervised refinement in the medical imaging domain.
摘要:放射学报告生成 (R2Gen) 演示了多模式大语言模型 (MLLM) 如何自动创建准确且连贯的放射学报告。现有的方法通常会产生基于文本的报告中的细节,而这些细节不能准确反映图像内容。为了缓解这个问题,我们引入了一种新颖的策略,SERPENT-VLM(使用视觉语言模型的自精炼放射学报告生成),它通过将自精炼机制集成到 MLLM 框架中来改进 R2Gen 任务。我们采用独特的自监督损失,利用池图像表示和生成的放射文本的上下文表示之间的相似性,以及标准因果语言建模目标,来完善图像文本表示。这使得模型能够通过给定图像和生成文本之间的动态交互来仔细检查和对齐生成的文本,从而减少幻觉并不断增强细致入微的报告生成。 SERPENT-VLM 优于 LLaVA-Med、BiomedGPT 等现有基线,在 COntext (ROCO) 数据集中的 IU X 射线和放射学对象上实现了 SoTA 性能,并且还证明对噪声图像具有鲁棒性。定性案例研究强调了 R2Gen 向更复杂的 MLLM 框架取得的重大进步,为进一步研究医学成像领域的自我监督细化开辟了道路。
Title: Transfer Learning Enhanced Single-choice Decision for Multi-choice Question Answering
Authors: Chenhao Cui, Yufan Jiang, Shuangzhi Wu, Zhoujun Li
Copy Paste: [[2404.17949]] Transfer Learning Enhanced Single-choice Decision for Multi-choice Question Answering(https://arxiv.org/abs/2404.17949)
Keywords: language model
Abstract: Multi-choice Machine Reading Comprehension (MMRC) aims to select the correct answer from a set of options based on a given passage and question. The existing methods employ the pre-trained language model as the encoder, share and transfer knowledge through fine-tuning.These methods mainly focus on the design of exquisite mechanisms to effectively capture the relationships among the triplet of passage, question and answers. It is non-trivial but ignored to transfer knowledge from other MRC tasks such as SQuAD due to task specific of MMRC.In this paper, we reconstruct multi-choice to single-choice by training a binary classification to distinguish whether a certain answer is correct. Then select the option with the highest confidence score as the final answer. Our proposed method gets rid of the multi-choice framework and can leverage resources of other tasks. We construct our model based on the ALBERT-xxlarge model and evaluate it on the RACE and DREAM datasets. Experimental results show that our model performs better than multi-choice methods. In addition, by transferring knowledge from other kinds of MRC tasks, our model achieves state-of-the-art results in both single and ensemble settings.
Copy Paste: [[2404.17975]] Automating Customer Needs Analysis: A Comparative Study of Large Language Models in the Travel Industry(https://arxiv.org/abs/2404.17975)
Keywords: language model, gpt, llm
Abstract: In the rapidly evolving landscape of Natural Language Processing (NLP), Large Language Models (LLMs) have emerged as powerful tools for many tasks, such as extracting valuable insights from vast amounts of textual data. In this study, we conduct a comparative analysis of LLMs for the extraction of travel customer needs from TripAdvisor posts. Leveraging a diverse range of models, including both open-source and proprietary ones such as GPT-4 and Gemini, we aim to elucidate their strengths and weaknesses in this specialized domain. Through an evaluation process involving metrics such as BERTScore, ROUGE, and BLEU, we assess the performance of each model in accurately identifying and summarizing customer needs. Our findings highlight the efficacy of opensource LLMs, particularly Mistral 7B, in achieving comparable performance to larger closed models while offering affordability and customization benefits. Additionally, we underscore the importance of considering factors such as model size, resource requirements, and performance metrics when selecting the most suitable LLM for customer needs analysis tasks. Overall, this study contributes valuable insights for businesses seeking to leverage advanced NLP techniques to enhance customer experience and drive operational efficiency in the travel industry.
Copy Paste: [[2404.17985]] Detection of Conspiracy Theories Beyond Keyword Bias in German-Language Telegram Using Large Language Models(https://arxiv.org/abs/2404.17985)
Keywords: language model, gpt, prompt
Abstract: The automated detection of conspiracy theories online typically relies on supervised learning. However, creating respective training data requires expertise, time and mental resilience, given the often harmful content. Moreover, available datasets are predominantly in English and often keyword-based, introducing a token-level bias into the models. Our work addresses the task of detecting conspiracy theories in German Telegram messages. We compare the performance of supervised fine-tuning approaches using BERT-like models with prompt-based approaches using Llama2, GPT-3.5, and GPT-4 which require little or no additional training data. We use a dataset of $\sim\!\! 4,000$ messages collected during the COVID-19 pandemic, without the use of keyword filters. Our findings demonstrate that both approaches can be leveraged effectively: For supervised fine-tuning, we report an F1 score of $\sim\!\! 0.8$ for the positive class, making our model comparable to recent models trained on keyword-focused English corpora. We demonstrate our model's adaptability to intra-domain temporal shifts, achieving F1 scores of $\sim\!\! 0.7$. Among prompting variants, the best model is GPT-4, achieving an F1 score of $\sim\!\! 0.8$ for the positive class in a zero-shot setting and equipped with a custom conspiracy theory definition.
摘要:在线阴谋论的自动检测通常依赖于监督学习。然而,鉴于内容往往有害,创建相应的培训数据需要专业知识、时间和心理弹性。此外,可用的数据集主要是英文的,并且通常基于关键字,从而在模型中引入了标记级偏差。我们的工作致力于检测德国电报消息中的阴谋论。我们比较了使用类似 BERT 模型的监督微调方法与使用 Llama2、GPT-3.5 和 GPT-4 的基于提示的方法的性能,后者需要很少或不需要额外的训练数据。我们使用 $\sim\!\! 的数据集在 COVID-19 大流行期间收集的价值 4,000 美元的消息,未使用关键字过滤器。我们的研究结果表明,这两种方法都可以有效利用:对于监督微调,我们报告的 F1 分数为 $\sim\!\!正类的费用为 0.8 美元,这使得我们的模型与最近在以关键词为中心的英语语料库上训练的模型相当。我们展示了我们的模型对域内时间变化的适应性,实现了 $\sim\!\! 的 F1 分数0.7 美元。在提示变体中,最好的模型是 GPT-4,获得了 $\sim\!\! 的 F1 分数零样本设置中的正类别为 0.8 美元,并配备了自定义阴谋论定义。
Title: Enhancing Pre-Trained Generative Language Models with Question Attended Span Extraction on Machine Reading Comprehension
Authors: Lin Ai, Zheng Hui, Zizhou Liu, Julia Hirschberg
Copy Paste: [[2404.17991]] Enhancing Pre-Trained Generative Language Models with Question Attended Span Extraction on Machine Reading Comprehension(https://arxiv.org/abs/2404.17991)
Keywords: language model, gpt, llm
Abstract: Machine Reading Comprehension (MRC) poses a significant challenge in the field of Natural Language Processing (NLP). While mainstream MRC methods predominantly leverage extractive strategies using encoder-only models such as BERT, generative approaches face the issue of out-of-control generation -- a critical problem where answers generated are often incorrect, irrelevant, or unfaithful to the source text. To address these limitations in generative models for MRC, we introduce the Question-Attended Span Extraction (QASE) module. Integrated during the fine-tuning phase of pre-trained generative language models (PLMs), QASE significantly enhances their performance, allowing them to surpass the extractive capabilities of advanced Large Language Models (LLMs) such as GPT-4. Notably, these gains in performance do not come with an increase in computational demands. The efficacy of the QASE module has been rigorously tested across various datasets, consistently achieving or even surpassing state-of-the-art (SOTA) results.
Copy Paste: [[2404.17999]] MediFact at MEDIQA-CORR 2024: Why AI Needs a Human Touch(https://arxiv.org/abs/2404.17999)
Keywords: language model, llm
Abstract: Accurate representation of medical information is crucial for patient safety, yet artificial intelligence (AI) systems, such as Large Language Models (LLMs), encounter challenges in error-free clinical text interpretation. This paper presents a novel approach submitted to the MEDIQA-CORR 2024 shared task (Ben Abacha et al., 2024a), focusing on the automatic correction of single-word errors in clinical notes. Unlike LLMs that rely on extensive generic data, our method emphasizes extracting contextually relevant information from available clinical text data. Leveraging an ensemble of extractive and abstractive question-answering approaches, we construct a supervised learning framework with domain-specific feature engineering. Our methodology incorporates domain expertise to enhance error correction accuracy. By integrating domain expertise and prioritizing meaningful information extraction, our approach underscores the significance of a human-centric strategy in adapting AI for healthcare.
Copy Paste: [[2404.18043]] Utilizing Large Language Models for Information Extraction from Real Estate Transactions(https://arxiv.org/abs/2404.18043)
Keywords: language model
Abstract: Real estate sales contracts contain crucial information for property transactions, but manual extraction of data can be time-consuming and error-prone. This paper explores the application of large language models, specifically transformer-based architectures, for automated information extraction from real estate contracts. We discuss challenges, techniques, and future directions in leveraging these models to improve efficiency and accuracy in real estate contract analysis.
Copy Paste: [[2404.18057]] Efficient LLM Inference with Kcache(https://arxiv.org/abs/2404.18057)
Keywords: language model, llm
Abstract: Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures efficient sequence generation by caching previously computed KV states. However, it also introduces significant memory overhead. We discovered that KV Cache is not necessary and proposed a novel KCache technique to alleviate the memory bottleneck issue during the LLMs inference process. KCache can be used directly for inference without any training process, Our evaluations show that KCache improves the throughput of popular LLMs by 40% with the baseline, while keeping accuracy.
Copy Paste: [[2404.18071]] Can Perplexity Predict Fine-Tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali(https://arxiv.org/abs/2404.18071)
Keywords: language model, gpt
Abstract: Recent language models use subwording mechanisms to handle Out-of-Vocabulary(OOV) words seen during test time and, their generation capacity is generally measured using perplexity, an intrinsic metric. It is known that increasing the subword granularity results in a decrease of perplexity value. However, the study of how subwording affects the understanding capacity of language models has been very few and only limited to a handful of languages. To reduce this gap we used 6 different tokenization schemes to pretrain relatively small language models in Nepali and used the representations learned to finetune on several downstream tasks. Although byte-level BPE algorithm has been used in recent models like GPT, RoBERTa we show that on average they are sub-optimal in comparison to algorithms such as SentencePiece in finetuning performances for Nepali. Additionally, similar recent studies have focused on the Bert-based language model. We, however, pretrain and finetune sequential transformer-based language models.
Copy Paste: [[2404.18072]] Contextual Spelling Correction with Language Model for Low-resource Setting(https://arxiv.org/abs/2404.18072)
Keywords: language model
Abstract: The task of Spell Correction(SC) in low-resource languages presents a significant challenge due to the availability of only a limited corpus of data and no annotated spelling correction datasets. To tackle these challenges a small-scale word-based transformer LM is trained to provide the SC model with contextual understanding. Further, the probabilistic error rules are extracted from the corpus in an unsupervised way to model the tendency of error happening(error model). Then the combination of LM and error model is used to develop the SC model through the well-known noisy channel framework. The effectiveness of this approach is demonstrated through experiments on the Nepali language where there is access to just an unprocessed corpus of textual data.
Copy Paste: [[2404.18085]] CRE-LLM: A Domain-Specific Chinese Relation Extraction Framework with Fine-tuned Large Language Model(https://arxiv.org/abs/2404.18085)
Keywords: language model, llm, prompt, chat
Abstract: Domain-Specific Chinese Relation Extraction (DSCRE) aims to extract relations between entities from domain-specific Chinese text. Despite the rapid development of PLMs in recent years, especially LLMs, DSCRE still faces three core challenges: complex network structure design, poor awareness, and high consumption of fine-tuning. Given the impressive performance of large language models (LLMs) in natural language processing, we propose a new framework called CRE-LLM. This framework is based on fine-tuning open-source LLMs, such as Llama-2, ChatGLM2, and Baichuan2. CRE-LLM enhances the logic-awareness and generative capabilities of the model by constructing an appropriate prompt and utilizing open-source LLMs for instruction-supervised fine-tuning. And then it directly extracts the relations of the given entities in the input textual data, which improving the CRE approach. To demonstrate the effectiveness of the proposed framework, we conducted extensive experiments on two domain-specific CRE datasets, FinRE and SanWen. The experimental results show that CRE-LLM is significantly superior and robust, achieving state-of-the-art (SOTA) performance on the FinRE dataset. This paper introduces a novel approach to domain-specific relation extraction (DSCRE) tasks that are semantically more complex by combining LLMs with triples. Our code is publicly available.
Copy Paste: [[2404.18191]] Exploring the Robustness of In-Context Learning with Noisy Labels(https://arxiv.org/abs/2404.18191)
Keywords: language model, llm, prompt
Abstract: Recently, the mysterious In-Context Learning (ICL) ability exhibited by Transformer architectures, especially in large language models (LLMs), has sparked significant research interest. However, the resilience of Transformers' in-context learning capabilities in the presence of noisy samples, prevalent in both training corpora and prompt demonstrations, remains underexplored. In this paper, inspired by prior research that studies ICL ability using simple function classes, we take a closer look at this problem by investigating the robustness of Transformers against noisy labels. Specifically, we first conduct a thorough evaluation and analysis of the robustness of Transformers against noisy labels during in-context learning and show that they exhibit notable resilience against diverse types of noise in demonstration labels. Furthermore, we delve deeper into this problem by exploring whether introducing noise into the training set, akin to a form of data augmentation, enhances such robustness during inference, and find that such noise can indeed improve the robustness of ICL. Overall, our fruitful analysis and findings provide a comprehensive understanding of the resilience of Transformer models against label noises during ICL and provide valuable insights into the research on Transformers in natural language processing. Our code is available at https://github.com/InezYu0928/in-context-learning.
Copy Paste: [[2404.18228]] TextGram: Towards a better domain-adaptive pretraining(https://arxiv.org/abs/2404.18228)
Keywords: language model
Abstract: For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data selection method - TextGram - that effectively selects essential data from large corpora. We compare and evaluate the results of finetuned models for text classification task with and without data selection. We show that the proposed strategy works better compared to other selection methods.
Copy Paste: [[2404.18231]] From Persona to Personalization: A Survey on Role-Playing Language Agents(https://arxiv.org/abs/2404.18231)
Keywords: language model, llm, agent
Abstract: Recent advancements in large language models (LLMs) have significantly boosted the rise of Role-Playing Language Agents (RPLAs), i.e., specialized AI systems designed to simulate assigned personas. By harnessing multiple advanced abilities of LLMs, including in-context learning, instruction following, and social intelligence, RPLAs achieve a remarkable sense of human likeness and vivid role-playing performance. RPLAs can mimic a wide range of personas, ranging from historical figures and fictional characters to real-life individuals. Consequently, they have catalyzed numerous AI applications, such as emotional companions, interactive video games, personalized assistants and copilots, and digital clones. In this paper, we conduct a comprehensive survey of this field, illustrating the evolution and recent progress in RPLAs integrating with cutting-edge LLM technologies. We categorize personas into three types: 1) Demographic Persona, which leverages statistical stereotypes; 2) Character Persona, focused on well-established figures; and 3) Individualized Persona, customized through ongoing user interactions for personalized services. We begin by presenting a comprehensive overview of current methodologies for RPLAs, followed by the details for each persona type, covering corresponding data sourcing, agent construction, and evaluation. Afterward, we discuss the fundamental risks, existing limitations, and future prospects of RPLAs. Additionally, we provide a brief review of RPLAs in AI applications, which reflects practical user demands that shape and drive RPLA research. Through this work, we aim to establish a clear taxonomy of RPLA research and applications, and facilitate future research in this critical and ever-evolving field, and pave the way for a future where humans and RPLAs coexist in harmony.
Copy Paste: [[2404.18243]] LEGENT: Open Platform for Embodied Agents(https://arxiv.org/abs/2404.18243)
Keywords: language model, gpt, llm, agent
Abstract: Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platform for developing embodied agents using LLMs and LMMs. LEGENT offers a dual approach: a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline utilizing advanced algorithms to exploit supervision from simulated worlds at scale. In our experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks, showcasing promising generalization capabilities.
Copy Paste: [[2404.18255]] PatentGPT: A Large Language Model for Intellectual Property(https://arxiv.org/abs/2404.18255)
Keywords: language model, gpt, llm, agent
Abstract: In recent years, large language models have attracted significant attention due to their exceptional performance across a multitude of natural language process tasks, and have been widely applied in various fields. However, the application of large language models in the Intellectual Property (IP) space is challenging due to the strong need for specialized knowledge, privacy protection, processing of extremely long text in this field. In this technical report, we present for the first time a low-cost, standardized procedure for training IP-oriented LLMs, meeting the unique requirements of the IP domain. Using this standard process, we have trained the PatentGPT series models based on open-source pretrained models. By evaluating them on the open-source IP-oriented benchmark MOZIP, our domain-specific LLMs outperforms GPT-4, indicating the effectiveness of the proposed training procedure and the expertise of the PatentGPT models in the IP demain. What is impressive is that our model significantly outperformed GPT-4 on the 2019 China Patent Agent Qualification Examination by achieving a score of 65, reaching the level of human experts. Additionally, the PatentGPT model, which utilizes the SMoE architecture, achieves performance comparable to that of GPT-4 in the IP domain and demonstrates a better cost-performance ratio on long-text tasks, potentially serving as an alternative to GPT-4 within the IP domain.
摘要:近年来,大型语言模型因其在众多自然语言处理任务中的出色表现而受到广泛关注,并已广泛应用于各个领域。然而,由于该领域对专业知识、隐私保护、超长文本处理的强烈需求,大语言模型在知识产权领域的应用具有挑战性。在这份技术报告中,我们首次提出了一种低成本、标准化的程序来培训面向知识产权的法学硕士,满足知识产权领域的独特要求。使用这个标准流程,我们基于开源预训练模型训练了 PatentGPT 系列模型。通过在面向开源 IP 的基准 MOZIP 上对其进行评估,我们特定领域的 LLM 的表现优于 GPT-4,这表明了所提出的培训程序的有效性以及 PatentGPT 模型在 IP 领域的专业知识。令人印象深刻的是,我们的模型在2019年中国专利代理人资格考试中显着优于GPT-4,取得了65分,达到了人类专家的水平。此外,采用SMoE架构的PatentGPT模型在IP领域实现了与GPT-4相当的性能,并且在长文本任务上表现出了更好的性价比,有可能成为GPT-4在IP领域的替代方案。 IP 域。
Title: Parameter-Efficient Tuning Large Language Models for Graph Representation Learning
Authors: Qi Zhu, Da Zheng, Xiang Song, Shichang Zhang, Bowen Jin, Yizhou Sun, George Karypis
Copy Paste: [[2404.18271]] Parameter-Efficient Tuning Large Language Models for Graph Representation Learning(https://arxiv.org/abs/2404.18271)
Keywords: language model, llm, prompt
Abstract: Text-rich graphs, which exhibit rich textual information on nodes and edges, are prevalent across a wide range of real-world business applications. Large Language Models (LLMs) have demonstrated remarkable abilities in understanding text, which also introduced the potential for more expressive modeling in text-rich graphs. Despite these capabilities, efficiently applying LLMs to representation learning on graphs presents significant challenges. Recently, parameter-efficient fine-tuning methods for LLMs have enabled efficient new task generalization with minimal time and memory consumption. Inspired by this, we introduce Graph-aware Parameter-Efficient Fine-Tuning - GPEFT, a novel approach for efficient graph representation learning with LLMs on text-rich graphs. Specifically, we utilize a graph neural network (GNN) to encode structural information from neighboring nodes into a graph prompt. This prompt is then inserted at the beginning of the text sequence. To improve the quality of graph prompts, we pre-trained the GNN to assist the frozen LLM in predicting the next token in the node text. Compared with existing joint GNN and LMs, our method directly generate the node embeddings from large language models with an affordable fine-tuning cost. We validate our approach through comprehensive experiments conducted on 8 different text-rich graphs, observing an average improvement of 2% in hit@1 and Mean Reciprocal Rank (MRR) in link prediction evaluations. Our results demonstrate the efficacy and efficiency of our model, showing that it can be smoothly integrated with various large language models, including OPT, LLaMA and Falcon.
Copy Paste: [[2404.18276]] Bias Neutralization Framework: Measuring Fairness in Large Language Models with Bias Intelligence Quotient (BiQ)(https://arxiv.org/abs/2404.18276)
Keywords: language model, gpt, llm, chat
Abstract: The burgeoning influence of Large Language Models (LLMs) in shaping public discourse and decision-making underscores the imperative to address inherent biases within these AI systems. In the wake of AI's expansive integration across sectors, addressing racial bias in LLMs has never been more critical. This paper introduces a novel framework called Comprehensive Bias Neutralization Framework (CBNF) which embodies an innovative approach to quantifying and mitigating biases within LLMs. Our framework combines the Large Language Model Bias Index (LLMBI) [Oketunji, A., Anas, M., Saina, D., (2023)] and Bias removaL with No Demographics (BLIND) [Orgad, H., Belinkov, Y. (2023)] methodologies to create a new metric called Bias Intelligence Quotient (BiQ)which detects, measures, and mitigates racial bias in LLMs without reliance on demographic annotations. By introducing a new metric called BiQ that enhances LLMBI with additional fairness metrics, CBNF offers a multi-dimensional metric for bias assessment, underscoring the necessity of a nuanced approach to fairness in AI [Mehrabi et al., 2021]. This paper presents a detailed analysis of Latimer AI (a language model incrementally trained on black history and culture) in comparison to ChatGPT 3.5, illustrating Latimer AI's efficacy in detecting racial, cultural, and gender biases through targeted training and refined bias mitigation strategies [Latimer & Bender, 2023].
摘要:大型语言模型 (LLM) 在塑造公共话语和决策方面的影响力日益增长,这凸显了解决这些人工智能系统中固有偏见的必要性。随着人工智能跨行业的广泛整合,解决法学硕士中的种族偏见变得前所未有的重要。本文介绍了一种称为综合偏见中和框架(CBNF)的新颖框架,它体现了一种量化和减轻法学硕士内部偏见的创新方法。我们的框架结合了大型语言模型偏差指数 (LLMBI) [Oketunji, A., Anas, M., Saina, D., (2023)] 和无人口统计的偏差消除 (BLIND) [Orgad, H., Belinkov, Y . (2023)] 创建一种称为偏差智商 (BiQ) 的新指标的方法,该指标可以在不依赖人口统计注释的情况下检测、测量和减轻法学硕士中的种族偏见。通过引入一种名为 BiQ 的新指标,通过额外的公平性指标增强 LLMBI,CBNF 提供了用于偏见评估的多维指标,强调了在 AI 中采取细致入微的公平性方法的必要性 [Mehrabi et al., 2021]。本文详细分析了 Latimer AI(一种针对黑人历史和文化进行增量训练的语言模型)与 ChatGPT 3.5 的比较,说明了 Latimer AI 通过有针对性的训练和完善的偏见缓解策略检测种族、文化和性别偏见的功效 [Latimer AI &本德,2023]。
Title: Comparing LLM prompting with Cross-lingual transfer performance on Indigenous and Low-resource Brazilian Languages
Authors: David Ifeoluwa Adelani, A. Seza Doğruöz, André Coneglian, Atul Kr. Ojha
Copy Paste: [[2404.18286]] Comparing LLM prompting with Cross-lingual transfer performance on Indigenous and Low-resource Brazilian Languages(https://arxiv.org/abs/2404.18286)
Keywords: language model, llm, prompt
Abstract: Large Language Models are transforming NLP for a variety of tasks. However, how LLMs perform NLP tasks for low-resource languages (LRLs) is less explored. In line with the goals of the AmeicasNLP workshop, we focus on 12 LRLs from Brazil, 2 LRLs from Africa and 2 high-resource languages (HRLs) (e.g., English and Brazilian Portuguese). Our results indicate that the LLMs perform worse for the part of speech (POS) labeling of LRLs in comparison to HRLs. We explain the reasons behind this failure and provide an error analyses through examples observed in our data set.
Copy Paste: [[2404.18359]] FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models(https://arxiv.org/abs/2404.18359)
Keywords: language model, llm
Abstract: In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choice questions across common sense and K-12 educational subjects, meticulously curated to reflect the breadth and depth of everyday and academic knowledge. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities. The insights gleaned from FoundaBench evaluations set a new standard for understanding the fundamental knowledge of LLMs, providing a robust framework for future advancements in the field.
Copy Paste: [[2404.18371]] QANA: LLM-based Question Generation and Network Analysis for Zero-shot Key Point Analysis and Beyond(https://arxiv.org/abs/2404.18371)
Keywords: language model, llm
Abstract: The proliferation of social media has led to information overload and increased interest in opinion mining. We propose "Question-Answering Network Analysis" (QANA), a novel opinion mining framework that utilizes Large Language Models (LLMs) to generate questions from users' comments, constructs a bipartite graph based on the comments' answerability to the questions, and applies centrality measures to examine the importance of opinions. We investigate the impact of question generation styles, LLM selections, and the choice of embedding model on the quality of the constructed QA networks by comparing them with annotated Key Point Analysis datasets. QANA achieves comparable performance to previous state-of-the-art supervised models in a zero-shot manner for Key Point Matching task, also reducing the computational cost from quadratic to linear. For Key Point Generation, questions with high PageRank or degree centrality align well with manually annotated key points. Notably, QANA enables analysts to assess the importance of key points from various aspects according to their selection of centrality measure. QANA's primary contribution lies in its flexibility to extract key points from a wide range of perspectives, which enhances the quality and impartiality of opinion mining.
Copy Paste: [[2404.18384]] Exploring the Limits of Fine-grained LLM-based Physics Inference via Premise Removal Interventions(https://arxiv.org/abs/2404.18384)
Keywords: language model, llm
Abstract: Language models can hallucinate when performing complex and detailed mathematical reasoning. Physics provides a rich domain for assessing mathematical reasoning capabilities where physical context imbues the use of symbols which needs to satisfy complex semantics (\textit{e.g.,} units, tensorial order), leading to instances where inference may be algebraically coherent, yet unphysical. In this work, we assess the ability of Language Models (LMs) to perform fine-grained mathematical and physical reasoning using a curated dataset encompassing multiple notations and Physics subdomains. We improve zero-shot scores using synthetic in-context examples, and demonstrate non-linear degradation of derivation quality with perturbation strength via the progressive omission of supporting premises. We find that the models' mathematical reasoning is not physics-informed in this setting, where physical context is predominantly ignored in favour of reverse-engineering solutions.
Copy Paste: [[2404.18398]] MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis(https://arxiv.org/abs/2404.18398)
Keywords: prompt
Abstract: Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214
Copy Paste: [[2404.18410]] Mixture-of-Instructions: Comprehensive Alignment of a Large Language Model through the Mixture of Diverse System Prompting Instructions(https://arxiv.org/abs/2404.18410)
Keywords: language model, llm, prompt, chat
Abstract: With the proliferation of large language models (LLMs), the comprehensive alignment of such models across multiple tasks has emerged as a critical area of research. Existing alignment methodologies primarily address single task, such as multi-turn dialogue, coding, mathematical problem-solving, and tool usage. However, AI-driven products that leverage language models usually necessitate a fusion of these abilities to function effectively in real-world scenarios. Moreover, the considerable computational resources required for proper alignment of LLMs underscore the need for a more robust, efficient, and encompassing approach to multi-task alignment, ensuring improved generative performance. In response to these challenges, we introduce a novel technique termed Mixture-of-Instructions (MoI), which employs a strategy of instruction concatenation combined with diverse system prompts to boost the alignment efficiency of language models. We have also compiled a diverse set of seven benchmark datasets to rigorously evaluate the alignment efficacy of the MoI-enhanced language model. Our methodology was applied to the open-source Qwen-7B-chat model, culminating in the development of Qwen-SFT-MoI. This enhanced model demonstrates significant advancements in generative capabilities across coding, mathematics, and tool use tasks.
摘要:随着大型语言模型(LLM)的激增,跨多个任务的此类模型的全面对齐已成为一个关键的研究领域。现有的对齐方法主要解决单一任务,例如多轮对话、编码、数学问题解决和工具使用。然而,利用语言模型的人工智能驱动产品通常需要融合这些能力才能在现实场景中有效运行。此外,法学硕士的正确对齐所需的大量计算资源强调需要一种更强大、更高效、更全面的多任务对齐方法,以确保提高生成性能。为了应对这些挑战,我们引入了一种称为混合指令(MoI)的新技术,该技术采用指令串联策略与不同的系统提示相结合来提高语言模型的对齐效率。我们还编译了一组不同的七个基准数据集,以严格评估 MoI 增强语言模型的对齐效果。我们的方法应用于开源 Qwen-7B-聊天模型,最终开发了 Qwen-SFT-MoI。这个增强的模型展示了编码、数学和工具使用任务的生成能力的显着进步。
Title: BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers
Authors: Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D. Wang, Joyce C. Ho, Chao Zhang, Carl Yang
Copy Paste: [[2404.18443]] BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers(https://arxiv.org/abs/2404.18443)
Keywords: language model
Abstract: Developing effective biomedical retrieval models is important for excelling at knowledge-intensive biomedical tasks but still challenging due to the deficiency of sufficient publicly annotated biomedical data and computational resources. We present BMRetriever, a series of dense retrievers for enhancing biomedical retrieval via unsupervised pre-training on large biomedical corpora, followed by instruction fine-tuning on a combination of labeled datasets and synthetic pairs. Experiments on 5 biomedical tasks across 11 datasets verify BMRetriever's efficacy on various biomedical applications. BMRetriever also exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger, and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are released at \url{https://huggingface.co/BMRetriever} to ensure transparency, reproducibility, and application to new domains.
Copy Paste: [[2404.18460]] Ethical Reasoning and Moral Value Alignment of LLMs Depend on the Language we Prompt them in(https://arxiv.org/abs/2404.18460)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Ethical reasoning is a crucial skill for Large Language Models (LLMs). However, moral values are not universal, but rather influenced by language and culture. This paper explores how three prominent LLMs -- GPT-4, ChatGPT, and Llama2-70B-Chat -- perform ethical reasoning in different languages and if their moral judgement depend on the language in which they are prompted. We extend the study of ethical reasoning of LLMs by Rao et al. (2023) to a multilingual setup following their framework of probing LLMs with ethical dilemmas and policies from three branches of normative ethics: deontology, virtue, and consequentialism. We experiment with six languages: English, Spanish, Russian, Chinese, Hindi, and Swahili. We find that GPT-4 is the most consistent and unbiased ethical reasoner across languages, while ChatGPT and Llama2-70B-Chat show significant moral value bias when we move to languages other than English. Interestingly, the nature of this bias significantly vary across languages for all LLMs, including GPT-4.
Copy Paste: [[2404.18466]] HFT: Half Fine-Tuning for Large Language Models(https://arxiv.org/abs/2404.18466)
Keywords: language model, llm
Abstract: Large language models (LLMs) with one or more fine-tuning phases have become a necessary step to unlock various capabilities, enabling LLMs to follow natural language instructions or align with human preferences. However, it carries the risk of catastrophic forgetting during sequential training, the parametric knowledge or the ability learned in previous stages may be overwhelmed by incoming training data. In this paper, we find that by regularly resetting partial parameters, LLMs can restore some of the original knowledge. Inspired by this, we introduce Half Fine-Tuning (HFT) for LLMs, as a substitute for full fine-tuning (FFT), to mitigate the forgetting issues, where half of the parameters are selected to learn new tasks while the other half are frozen to remain previous knowledge. We provide a feasibility analysis from the perspective of optimization and interpret the parameter selection operation as a regularization term. Without changing the model architecture, HFT could be seamlessly integrated into existing fine-tuning frameworks. Extensive experiments and analysis on supervised fine-tuning, direct preference optimization, and continual learning consistently demonstrate the effectiveness, robustness, and efficiency of HFT. Compared with FFT, HFT not only significantly alleviates the forgetting problem, but also achieves the best performance in a series of downstream benchmarks, with an approximately 30% reduction in training time.
Copy Paste: [[2404.18532]] MileBench: Benchmarking MLLMs in Long Context(https://arxiv.org/abs/2404.18532)
Keywords: language model, gpt, llm, long context
Abstract: Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 20 models, revealed that while the closed-source GPT-4(Vision) and Gemini 1.5 outperform others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.
Copy Paste: [[2404.18534]] Evaluating and Mitigating Linguistic Discrimination in Large Language Models(https://arxiv.org/abs/2404.18534)
Keywords: language model, gpt, llm
Abstract: By training on text in various languages, large language models (LLMs) typically possess multilingual support and demonstrate remarkable capabilities in solving tasks described in different languages. However, LLMs can exhibit linguistic discrimination due to the uneven distribution of training data across languages. That is, LLMs are hard to keep the consistency of responses when faced with the same task but depicted in different languages. In this study, we first explore the consistency in the LLMs' outputs responding to queries in various languages from two aspects: safety and quality. We conduct this analysis with two datasets (AdvBench and NQ) based on four LLMs (Llama2-13b, Gemma-7b, GPT-3.5-turbo and Gemini-pro). The results show that LLMs exhibit stronger human alignment capabilities with queries in English, French, Russian, and Spanish (only 1.04\% of harmful queries successfully jailbreak on average) compared to queries in Bengali, Georgian, Nepali and Maithili (27.7\% of harmful queries jailbreak successfully on average). Moreover, for queries in English, Danish, Czech and Slovenian, LLMs tend to produce responses with a higher quality (with 0.1494 $F_1$ score on average) compared to the other languages. Upon these findings, we propose LDFighter, a similarity-based voting, to mitigate the linguistic discrimination in LLMs. LDFighter ensures consistent service for different language speakers. We evaluate LDFighter with both benign queries and harmful queries. The results show that LDFighter not only significantly reduces the jailbreak success rate but also improve the response quality on average, demonstrating its effectiveness.
Copy Paste: [[2404.18543]] Time Machine GPT(https://arxiv.org/abs/2404.18543)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are often trained on extensive, temporally indiscriminate text corpora, reflecting the lack of datasets with temporal metadata. This approach is not aligned with the evolving nature of language. Conventional methods for creating temporally adapted language models often depend on further pre-training static models on time-specific data. This paper presents a new approach: a series of point-in-time LLMs called Time Machine GPT (TiMaGPT), specifically designed to be nonprognosticative. This ensures they remain uninformed about future factual information and linguistic changes. This strategy is beneficial for understanding language evolution and is of critical importance when applying models in dynamic contexts, such as time-series forecasting, where foresight of future information can prove problematic. We provide access to both the models and training datasets.
Copy Paste: [[2404.18557]] Can GPT-4 do L2 analytic assessment?(https://arxiv.org/abs/2404.18557)
Keywords: language model, gpt
Abstract: Automated essay scoring (AES) to evaluate second language (L2) proficiency has been a firmly established technology used in educational contexts for decades. Although holistic scoring has seen advancements in AES that match or even exceed human performance, analytic scoring still encounters issues as it inherits flaws and shortcomings from the human scoring process. The recent introduction of large language models presents new opportunities for automating the evaluation of specific aspects of L2 writing proficiency. In this paper, we perform a series of experiments using GPT-4 in a zero-shot fashion on a publicly available dataset annotated with holistic scores based on the Common European Framework of Reference and aim to extract detailed information about their underlying analytic components. We observe significant correlations between the automatically predicted analytic scores and multiple features associated with the individual proficiency components.
Copy Paste: [[2404.18564]] Injecting Salesperson's Dialogue Strategies in Large Language Models with Chain-of-Thought Reasoning(https://arxiv.org/abs/2404.18564)
Keywords: language model, llm, prompt, chat, chain-of-thought, agent
Abstract: Recent research in dialogue systems and corpora has focused on two main categories: task-oriented (TOD) and open-domain (chit-chat) dialogues. TOD systems help users accomplish specific tasks, while open-domain systems aim to create engaging conversations. However, in real-world scenarios, user intents are often revealed during interactions. A recent study introduced SalesBot, which simulates dialogues transitioning from chit-chat to task-oriented scenarios to train sales agents. Unfortunately, the initial data lacked smooth transitions and coherent long-turn dialogues, resulting in poor naturalness in sales-customer interactions. To address these issues, this paper presents SalesBot 2.0, an improved dataset. It leverages commonsense knowledge from large language models (LLMs) through strategic prompting. Additionally, we introduce a novel model called SalesAgent, trained on salesperson's interactions, using chain-of-thought (CoT) reasoning. This model excels in transitioning topics, understanding user intents, and selecting appropriate strategies. Experiments using diverse user simulations validate the effectiveness of our method in controlling dialogue strategies in LLMs. Furthermore, SalesBot 2.0 enhances coherence and reduces aggression, facilitating better model learning for sales-customer interactions.
Copy Paste: [[2404.18570]] Analyzing Semantic Change through Lexical Replacements(https://arxiv.org/abs/2404.18570)
Keywords: language model
Abstract: Modern language models are capable of contextualizing words based on their surrounding context. However, this capability is often compromised due to semantic change that leads to words being used in new, unexpected contexts not encountered during pre-training. In this paper, we model \textit{semantic change} by studying the effect of unexpected contexts introduced by \textit{lexical replacements}. We propose a \textit{replacement schema} where a target word is substituted with lexical replacements of varying relatedness, thus simulating different kinds of semantic change. Furthermore, we leverage the replacement schema as a basis for a novel \textit{interpretable} model for semantic change. We are also the first to evaluate the use of LLaMa for semantic change detection.
Copy Paste: [[2404.18624]] Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?(https://arxiv.org/abs/2404.18624)
Keywords: language model, llm
Abstract: Vision and language models (VLMs) are currently the most generally performant architectures on multimodal tasks. Next to their predictions, they can also produce explanations, either in post-hoc or CoT settings. However, it is not clear how much they use the vision and text modalities when generating predictions or explanations. In this work, we investigate if VLMs rely on modalities differently when generating explanations as opposed to when they provide answers. We also evaluate the self-consistency of VLM decoders in both post-hoc and CoT explanation settings, by extending existing tests and measures to VLM decoders. We find that VLMs are less self-consistent than LLMs. The text contributions in VL decoders are much larger than the image contributions across all measured tasks. And the contributions of the image are significantly larger for explanation generations than for answer generation. This difference is even larger in CoT compared to the post-hoc explanation setting. We also provide an up-to-date benchmarking of state-of-the-art VL decoders on the VALSE benchmark, which to date focused only on VL encoders. We find that VL decoders are still struggling with most phenomena tested by VALSE.
Copy Paste: [[2404.18655]] Revealing the Parametric Knowledge of Language Models: A Unified Framework for Attribution Methods(https://arxiv.org/abs/2404.18655)
Keywords: language model
Abstract: Language Models (LMs) acquire parametric knowledge from their training process, embedding it within their weights. The increasing scalability of LMs, however, poses significant challenges for understanding a model's inner workings and further for updating or correcting this embedded knowledge without the significant cost of retraining. This underscores the importance of unveiling exactly what knowledge is stored and its association with specific model components. Instance Attribution (IA) and Neuron Attribution (NA) offer insights into this training-acquired knowledge, though they have not been compared systematically. Our study introduces a novel evaluation framework to quantify and compare the knowledge revealed by IA and NA. To align the results of the methods we introduce the attribution method NA-Instances to apply NA for retrieving influential training instances, and IA-Neurons to discover important neurons of influential instances discovered by IA. We further propose a comprehensive list of faithfulness tests to evaluate the comprehensiveness and sufficiency of the explanations provided by both methods. Through extensive experiments and analysis, we demonstrate that NA generally reveals more diverse and comprehensive information regarding the LM's parametric knowledge compared to IA. Nevertheless, IA provides unique and valuable insights into the LM's parametric knowledge, which are not revealed by NA. Our findings further suggest the potential of a synergistic approach of combining the diverse findings of IA and NA for a more holistic understanding of an LM's parametric knowledge.
摘要:语言模型 (LM) 从训练过程中获取参数知识,并将其嵌入到权重中。然而,LM 不断增强的可扩展性给理解模型的内部工作原理以及进一步更新或纠正这种嵌入式知识带来了重大挑战,而无需花费大量的再训练成本。这强调了准确揭示存储的知识及其与特定模型组件的关联的重要性。实例归因(IA)和神经元归因(NA)提供了对这种训练获得的知识的见解,尽管它们尚未进行系统比较。我们的研究引入了一种新颖的评估框架来量化和比较 IA 和 NA 揭示的知识。为了对齐方法的结果,我们引入了归因方法 NA-Instances 以应用 NA 来检索有影响力的训练实例,并引入 IA-Neurons 来发现 IA 发现的有影响力实例的重要神经元。我们进一步提出了忠实度测试的综合列表,以评估两种方法提供的解释的全面性和充分性。通过大量的实验和分析,我们证明,与 IA 相比,NA 通常揭示了关于 LM 参数知识的更多样化和更全面的信息。尽管如此,IA 为 LM 的参数知识提供了独特且有价值的见解,这是 NA 所没有透露的。我们的研究结果进一步表明,结合 IA 和 NA 的不同发现的协同方法的潜力,可以更全面地理解 LM 的参数知识。
Title: Where on Earth Do Users Say They Are?: Geo-Entity Linking for Noisy Multilingual User Input
Copy Paste: [[2404.18784]] Where on Earth Do Users Say They Are?: Geo-Entity Linking for Noisy Multilingual User Input(https://arxiv.org/abs/2404.18784)
Keywords: llm
Abstract: Geo-entity linking is the task of linking a location mention to the real-world geographic location. In this paper we explore the challenging task of geo-entity linking for noisy, multilingual social media data. There are few open-source multilingual geo-entity linking tools available and existing ones are often rule-based, which break easily in social media settings, or LLM-based, which are too expensive for large-scale datasets. We present a method which represents real-world locations as averaged embeddings from labeled user-input location names and allows for selective prediction via an interpretable confidence score. We show that our approach improves geo-entity linking on a global and multilingual social media dataset, and discuss progress and problems with evaluating at different geographic granularities.
Copy Paste: [[2404.18796]] Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models(https://arxiv.org/abs/2404.18796)
Keywords: language model, gpt, llm
Abstract: As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
Copy Paste: [[2404.18810]] Unknown Script: Impact of Script on Cross-Lingual Transfer(https://arxiv.org/abs/2404.18810)
Keywords: language model
Abstract: Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often-overlooked aspect in this domain: the influence of the source language of the base language model on transfer performance. We conduct a series of experiments to determine the effect of the script and tokenizer used in the pre-trained model on the performance of the downstream task. Our findings reveal the importance of the tokenizer as a stronger factor than the sharing of the script, the language typology match, and the model size.
Copy Paste: [[2404.18824]] Benchmarking Benchmark Leakage in Large Language Models(https://arxiv.org/abs/2404.18824)
Keywords: language model, llm, prompt
Abstract: Amid the expanding use of pre-training data, the phenomenon of benchmark dataset leakage has become increasingly prominent, exacerbated by opaque training processes and the often undisclosed inclusion of supervised data in contemporary Large Language Models (LLMs). This issue skews benchmark effectiveness and fosters potentially unfair comparisons, impeding the field's healthy development. To address this, we introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark, to identify potential data leakages. By analyzing 31 LLMs under the context of mathematical reasoning, we reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons. These findings prompt us to offer several recommendations regarding model documentation, benchmark setup, and future evaluations. Notably, we propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization, promoting transparency and healthy developments of LLMs. we have made our leaderboard, pipeline implementation, and model predictions publicly available, fostering future research.
Copy Paste: [[2404.18832]] It's Difficult to be Neutral -- Human and LLM-based Sentiment Annotation of Patient Comments(https://arxiv.org/abs/2404.18832)
Keywords: language model, llm, prompt
Abstract: Sentiment analysis is an important tool for aggregating patient voices, in order to provide targeted improvements in healthcare services. A prerequisite for this is the availability of in-domain data annotated for sentiment. This article documents an effort to add sentiment annotations to free-text comments in patient surveys collected by the Norwegian Institute of Public Health (NIPH). However, annotation can be a time-consuming and resource-intensive process, particularly when it requires domain expertise. We therefore also evaluate a possible alternative to human annotation, using large language models (LLMs) as annotators. We perform an extensive evaluation of the approach for two openly available pretrained LLMs for Norwegian, experimenting with different configurations of prompts and in-context learning, comparing their performance to human annotators. We find that even for zero-shot runs, models perform well above the baseline for binary sentiment, but still cannot compete with human annotators on the full dataset.
Copy Paste: [[2404.18865]] Truth-value judgment in language models: belief directions are context sensitive(https://arxiv.org/abs/2404.18865)
Keywords: language model, llm
Abstract: Recent work has demonstrated that the latent spaces of large language models (LLMs) contain directions predictive of the truth of sentences. Multiple methods recover such directions and build probes that are described as getting at a model's "knowledge" or "beliefs". We investigate this phenomenon, looking closely at the impact of context on the probes. Our experiments establish where in the LLM the probe's predictions can be described as being conditional on the preceding (related) sentences. Specifically, we quantify the responsiveness of the probes to the presence of (negated) supporting and contradicting sentences, and score the probes on their consistency. We also perform a causal intervention experiment, investigating whether moving the representation of a premise along these belief directions influences the position of the hypothesis along that same direction. We find that the probes we test are generally context sensitive, but that contexts which should not affect the truth often still impact the probe outputs. Our experiments show that the type of errors depend on the layer, the (type of) model, and the kind of data. Finally, our results suggest that belief directions are (one of the) causal mediators in the inference process that incorporates in-context information.
Copy Paste: [[2404.18870]] More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness(https://arxiv.org/abs/2404.18870)
Keywords: language model, llm
Abstract: The surge in Large Language Models (LLMs) development has led to improved performance on cognitive tasks as well as an urgent need to align these models with human values in order to safely exploit their power. Despite the effectiveness of preference learning algorithms like Reinforcement Learning From Human Feedback (RLHF) in aligning human preferences, their assumed improvements on model trustworthiness haven't been thoroughly testified. Toward this end, this study investigates how models that have been aligned with general-purpose preference data on helpfulness and harmlessness perform across five trustworthiness verticals: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy. For model alignment, we focus on three widely used RLHF variants: Supervised Finetuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Through extensive empirical investigations, we discover that the improvement in trustworthiness by RLHF is far from guaranteed, and there exists a complex interplay between preference data, alignment algorithms, and specific trustworthiness aspects. Together, our results underscore the need for more nuanced approaches for model alignment. By shedding light on the intricate dynamics of these components within model alignment, we hope this research will guide the community towards developing language models that are both capable and trustworthy.
Copy Paste: [[2404.18911]] Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting(https://arxiv.org/abs/2404.18911)
Keywords: language model
Abstract: Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models while maintaining a consistent sampling distribution. However, the conventional approach of training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Drawing inspiration from early exiting, we propose a novel self-speculative decoding framework \emph{Kangaroo}, which uses a fixed shallow sub-network as a self-draft model, with the remaining layers serving as the larger target model. We train a lightweight and efficient adapter module on top of the sub-network to bridge the gap between the sub-network and the full model's representation ability. It is noteworthy that the inference latency of the self-draft model may no longer be negligible compared to the large model, necessitating strategies to increase the token acceptance rate while minimizing the drafting steps of the small model. To address this challenge, we introduce an additional early exiting mechanism for generating draft tokens. Specifically, we halt the small model's subsequent prediction during the drafting phase once the confidence level for the current token falls below a certain threshold. Extensive experiments on the Spec-Bench demonstrate the effectiveness of Kangaroo. Under single-sequence verification, Kangaroo achieves speedups up to $1.68\times$ on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additional parameters (67M compared to 591M). The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo.
Copy Paste: [[2404.18923]] Holmes: Benchmark the Linguistic Competence of Language Models(https://arxiv.org/abs/2404.18923)
Keywords: language model, prompt
Abstract: We introduce Holmes, a benchmark to assess the linguistic competence of language models (LMs) - their ability to grasp linguistic phenomena. Unlike prior prompting-based evaluations, Holmes assesses the linguistic competence of LMs via their internal representations using classifier-based probing. In doing so, we disentangle specific phenomena (e.g., part-of-speech of words) from other cognitive abilities, like following textual instructions, and meet recent calls to assess LMs' linguistic competence in isolation. Composing Holmes, we review over 250 probing studies and feature more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version of Holmes designed to lower the high computation load while maintaining high-ranking precision.