2024-10-29

Title: Ensembling Finetuned Language Models for Text Classification

Authors: Sebastian Pineda Arango, Maciej Janowski, Lennart Purucker, Arber Zela, Frank Hutter, Josif Grabocka
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.19889
Pdf URL: https://arxiv.org/pdf/2410.19889
Copy Paste: [[2410.19889]] Ensembling Finetuned Language Models for Text Classification(https://arxiv.org/abs/2410.19889)
Keywords: language model
Abstract: Finetuning is a common practice widespread across different communities to adapt pretrained models to particular tasks. Text classification is one of these tasks for which many pretrained models are available. On the other hand, ensembles of neural networks are typically used to boost performance and provide reliable uncertainty estimates. However, ensembling pretrained models for text classification is not a well-studied avenue. In this paper, we present a metadataset with predictions from five large finetuned models on six datasets, and report results of different ensembling strategies from these predictions. Our results shed light on how ensembling can improve the performance of finetuned text classifiers and incentivize future adoption of ensembles in such tasks.
摘要：微调是不同社区中广泛使用的一种常见做法，用于使预训练模型适应特定任务。文本分类是此类任务之一，有许多预训练模型可用于此任务。另一方面，神经网络集成通常用于提高性能并提供可靠的不确定性估计。然而，集成预训练模型进行文本分类并不是一个研究得很好的方法。在本文中，我们提供了一个元数据集，其中包含五个大型微调模型在六个数据集上的预测，并报告了这些预测的不同集成策略的结果。我们的结果揭示了集成如何提高微调文本分类器的性能并激励未来在这些任务中采用集成。

Title: Improving Multimodal Large Language Models Using Continual Learning

Authors: Shikhar Srivastava, Md Yousuf Harun, Robik Shrestha, Christopher Kanan
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2410.19925
Pdf URL: https://arxiv.org/pdf/2410.19925
Copy Paste: [[2410.19925]] Improving Multimodal Large Language Models Using Continual Learning(https://arxiv.org/abs/2410.19925)
Keywords: language model, llm
Abstract: Generative large language models (LLMs) exhibit impressive capabilities, which can be further augmented by integrating a pre-trained vision model into the original LLM to create a multimodal LLM (MLLM). However, this integration often significantly decreases performance on natural language understanding and generation tasks, compared to the original LLM. This study investigates this issue using the LLaVA MLLM, treating the integration as a continual learning problem. We evaluate five continual learning methods to mitigate forgetting and identify a technique that enhances visual understanding while minimizing linguistic performance loss. Our approach reduces linguistic performance degradation by up to 15\% over the LLaVA recipe, while maintaining high multimodal accuracy. We also demonstrate the robustness of our method through continual learning on a sequence of vision-language tasks, effectively preserving linguistic skills while acquiring new multimodal capabilities.
摘要：生成式大型语言模型 (LLM) 表现出令人印象深刻的功能，可以通过将预训练的视觉模型集成到原始 LLM 中以创建多模态 LLM (MLLM) 来进一步增强其功能。但是，与原始 LLM 相比，这种集成通常会显著降低自然语言理解和生成任务的性能。本研究使用 LLaVA MLLM 调查了这个问题，将集成视为一个持续学习问题。我们评估了五种持续学习方法来减轻遗忘，并确定了一种增强视觉理解同时最大限度地减少语言性能损失的技术。与 LLaVA 方法相比，我们的方法将语言性能下降降低了 15\%，同时保持了较高的多模态准确度。我们还通过对一系列视觉语言任务的持续学习证明了我们方法的稳健性，有效地保留了语言技能，同时获得了新的多模态能力。

Title: Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models

Authors: Zheng Zhao, Yftah Ziser, Shay B. Cohen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20008
Pdf URL: https://arxiv.org/pdf/2410.20008
Copy Paste: [[2410.20008]] Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models(https://arxiv.org/abs/2410.20008)
Keywords: language model, llm
Abstract: Fine-tuning pre-trained large language models (LLMs) on a diverse array of tasks has become a common approach for building models that can solve various natural language processing (NLP) tasks. However, where and to what extent these models retain task-specific knowledge remains largely unexplored. This study investigates the task-specific information encoded in pre-trained LLMs and the effects of instruction tuning on their representations across a diverse set of over 60 NLP tasks. We use a set of matrix analysis tools to examine the differences between the way pre-trained and instruction-tuned LLMs store task-specific information. Our findings reveal that while some tasks are already encoded within the pre-trained LLMs, others greatly benefit from instruction tuning. Additionally, we pinpointed the layers in which the model transitions from high-level general representations to more task-oriented representations. This finding extends our understanding of the governing mechanisms of LLMs and facilitates future research in the fields of parameter-efficient transfer learning and multi-task learning.
摘要：在各种任务上对预训练的大型语言模型 (LLM) 进行微调已成为构建可解决各种自然语言处理 (NLP) 任务的模型的常用方法。然而，这些模型在哪里以及在多大程度上保留特定于任务的知识仍未得到充分探索。本研究调查了预训练的 LLM 中编码的任务特定信息以及指令调整对其在 60 多个 NLP 任务中的表示的影响。我们使用一组矩阵分析工具来检查预训练和指令调整的 LLM 存储任务特定信息的方式之间的差异。我们的研究结果表明，虽然某些任务已经在预训练的 LLM 中编码，但其他任务则从指令调整中受益匪浅。此外，我们还确定了模型从高级通用表示过渡到更面向任务的表示的层次。这一发现扩展了我们对 LLM 控制机制的理解，并促进了参数高效迁移学习和多任务学习领域的未来研究。

Title: A Survey of Small Language Models

Authors: Chien Van Nguyen, Xuan Shen, Ryan Aponte, Yu Xia, Samyadeep Basu, Zhengmian Hu, Jian Chen, Mihir Parmar, Sasidhar Kunapuli, Joe Barrow, Junda Wu, Ashish Singh, Yu Wang, Jiuxiang Gu, Franck Dernoncourt, Nesreen K. Ahmed, Nedim Lipka, Ruiyi Zhang, Xiang Chen, Tong Yu, Sungchul Kim, Hanieh Deilamsalehy, Namyong Park, Mike Rimer, Zhehao Zhang, Huanrui Yang, Ryan A. Rossi, Thien Huu Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20011
Pdf URL: https://arxiv.org/pdf/2410.20011
Copy Paste: [[2410.20011]] A Survey of Small Language Models(https://arxiv.org/abs/2410.20011)
Keywords: language model
Abstract: Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources, making them ideal for various settings including on-device, mobile, edge devices, among many others. In this article, we present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques. We propose a novel taxonomy for categorizing the methods used to optimize SLMs, including model compression, pruning, and quantization techniques. We summarize the benchmark datasets that are useful for benchmarking SLMs along with the evaluation metrics commonly used. Additionally, we highlight key open challenges that remain to be addressed. Our survey aims to serve as a valuable resource for researchers and practitioners interested in developing and deploying small yet efficient language models.
摘要：小型语言模型 (SLM) 因其效率和性能而变得越来越重要，它们能够以最少的计算资源执行各种语言任务，使其成为各种环境的理想选择，包括设备上、移动设备、边缘设备等。在本文中，我们对 SLM 进行了全面的调查，重点介绍了它们的架构、训练技术和模型压缩技术。我们提出了一种新颖的分类法来对用于优化 SLM 的方法进行分类，包括模型压缩、修剪和量化技术。我们总结了可用于对 SLM 进行基准测试的基准数据集以及常用的评估指标。此外，我们还重点介绍了尚待解决的关键开放挑战。我们的调查旨在为有兴趣开发和部署小型但高效的语言模型的研究人员和从业者提供宝贵的资源。

Title: Vulnerability of LLMs to Vertically Aligned Text Manipulations

Authors: Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Zhen Xiong, Nanyun Peng, Kai-wei Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20016
Pdf URL: https://arxiv.org/pdf/2410.20016
Copy Paste: [[2410.20016]] Vulnerability of LLMs to Vertically Aligned Text Manipulations(https://arxiv.org/abs/2410.20016)
Keywords: language model, llm
Abstract: Text classification involves categorizing a given text, such as determining its sentiment or identifying harmful content. With the advancement of large language models (LLMs), these models have become highly effective at performing text classification tasks. However, they still show vulnerabilities to variations in text formatting. Recent research demonstrates that modifying input formats, such as vertically aligning words for encoder-based models, can substantially lower accuracy in text classification tasks. While easily understood by humans, these inputs can significantly mislead models, posing a potential risk of bypassing detection in real-world scenarios involving harmful or sensitive information. With the expanding application of LLMs, a crucial question arises: Do decoder-based LLMs exhibit similar vulnerabilities to vertically formatted text input? In this paper, we investigate the impact of vertical text input on the performance of various LLMs across multiple text classification datasets and analyze the underlying causes. Our findings are as follows: (i) Vertical text input significantly degrades the accuracy of LLMs in text classification tasks. (ii) Chain of Thought (CoT) reasoning does not help LLMs recognize vertical input or mitigate its vulnerability, but few-shot learning with careful analysis does. (iii) We explore the underlying cause of the vulnerability by analyzing the inherent issues in tokenization and attention matrices.
摘要：文本分类涉及对给定文本进行分类，例如确定其情绪或识别有害内容。随着大型语言模型 (LLM) 的进步，这些模型在执行文本分类任务方面变得非常有效。然而，它们仍然显示出对文本格式变化的脆弱性。最近的研究表明，修改输入格式（例如垂直对齐基于编码器的模型中的单词）会大大降低文本分类任务的准确性。虽然这些输入很容易被人类理解，但它们可能会严重误导模型，在涉及有害或敏感信息的实际场景中存在绕过检测的潜在风险。随着 LLM 的应用不断扩大，一个关键问题出现了：基于解码器的 LLM 是否表现出与垂直格式的文本输入类似的脆弱性？在本文中，我们研究了垂直文本输入对多个文本分类数据集中各种 LLM 性能的影响，并分析了其根本原因。我们的发现如下：(i) 垂直文本输入会显著降低 LLM 在文本分类任务中的准确性。 (ii) 思路链 (CoT) 推理无助于 LLM 识别垂直输入或减轻其脆弱性，但经过仔细分析的少量学习可以。 (iii) 我们通过分析标记化和注意力矩阵中的固有问题来探索脆弱性的根本原因。

Title: Attacks against Abstractive Text Summarization Models through Lead Bias and Influence Functions

Authors: Poojitha Thota, Shirin Nilizadeh
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2410.20019
Pdf URL: https://arxiv.org/pdf/2410.20019
Copy Paste: [[2410.20019]] Attacks against Abstractive Text Summarization Models through Lead Bias and Influence Functions(https://arxiv.org/abs/2410.20019)
Keywords: language model
Abstract: Large Language Models have introduced novel opportunities for text comprehension and generation. Yet, they are vulnerable to adversarial perturbations and data poisoning attacks, particularly in tasks like text classification and translation. However, the adversarial robustness of abstractive text summarization models remains less explored. In this work, we unveil a novel approach by exploiting the inherent lead bias in summarization models, to perform adversarial perturbations. Furthermore, we introduce an innovative application of influence functions, to execute data poisoning, which compromises the model's integrity. This approach not only shows a skew in the models behavior to produce desired outcomes but also shows a new behavioral change, where models under attack tend to generate extractive summaries rather than abstractive summaries.
摘要：大型语言模型为文本理解和生成带来了新的机会。然而，它们容易受到对抗性扰动和数据中毒攻击，特别是在文本分类和翻译等任务中。然而，抽象文本摘要模型的对抗性鲁棒性仍然没有得到充分探索。在这项工作中，我们揭示了一种新方法，即利用摘要模型中固有的引导偏差来执行对抗性扰动。此外，我们引入了一种影响函数的创新应用，以执行数据中毒，从而损害模型的完整性。这种方法不仅显示了模型行为的偏差以产生期望的结果，而且还显示了新的行为变化，即受到攻击的模型倾向于生成提取摘要而不是抽象摘要。

Title: Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization

Authors: Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Naifan Cheung, Nanyun Peng, Kai-wei Chang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.20021
Pdf URL: https://arxiv.org/pdf/2410.20021
Copy Paste: [[2410.20021]] Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization(https://arxiv.org/abs/2410.20021)
Keywords: language model, gpt, llm, prompt
Abstract: Cross-lingual summarization (CLS) aims to generate a summary for the source text in a different target language. Currently, instruction-tuned large language models (LLMs) excel at various English tasks. However, unlike languages such as English, Chinese or Spanish, for those relatively low-resource languages with limited usage or data, recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings. This raises the question: Are LLMs capable of handling cross-lingual summarization tasks for low-resource languages? To resolve this question, we fully explore the potential of large language models on cross-lingual summarization task for low-resource languages through our four-step zero-shot method: Summarization, Improvement, Translation and Refinement (SITR) with correspondingly designed prompts. We test our proposed method with multiple LLMs on two well-known cross-lingual summarization datasets with various low-resource target languages. The results show that: i) GPT-3.5 and GPT-4 significantly and consistently outperform other baselines when using our zero-shot SITR methods. ii) By employing our proposed method, we unlock the potential of LLMs, enabling them to effectively handle cross-lingual summarization tasks for relatively low-resource languages.
摘要：跨语言摘要 (CLS) 旨在为不同目标语言的源文本生成摘要。目前，针对指令调优的大型语言模型 (LLM) 在各种英语任务上表现出色。然而，与英语、中文或西班牙语等语言不同，对于那些使用或数据有限的相对低资源语言，最近的研究表明，即使在少样本设置下，LLM 在 CLS 任务上的表现仍然不令人满意。这就提出了一个问题：LLM 是否能够处理低资源语言的跨语言摘要任务？为了解决这个问题，我们通过我们的四步零样本方法充分探索大型语言模型在低资源语言跨语言摘要任务上的潜力：摘要、改进、翻译和细化 (SITR) 以及相应设计的提示。我们在两个众所周知的跨语言摘要数据集上使用多个 LLM 测试了我们提出的方法，其中包含各种低资源目标语言。结果表明：i）使用我们的零样本 SITR 方法时，GPT-3.5 和 GPT-4 显著且持续地优于其他基线。ii）通过采用我们提出的方法，我们释放了 LLM 的潜力，使它们能够有效地处理相对资源较少的语言的跨语言摘要任务。

Title: Dynamic layer selection in decoder-only transformers

Authors: Theodore Glavas, Joud Chataoui, Florence Regol, Wassim Jabbour, Antonios Valkanas, Boris N. Oreshkin, Mark Coates
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20022
Pdf URL: https://arxiv.org/pdf/2410.20022
Copy Paste: [[2410.20022]] Dynamic layer selection in decoder-only transformers(https://arxiv.org/abs/2410.20022)
Keywords: language model, llm, prompt
Abstract: The vast size of Large Language Models (LLMs) has prompted a search to optimize inference. One effective approach is dynamic inference, which adapts the architecture to the sample-at-hand to reduce the overall computational cost. We empirically examine two common dynamic inference methods for natural language generation (NLG): layer skipping and early exiting. We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping, as opposed to early exit. We demonstrate the difficulty of using hidden state information to adapt computation on a per-token basis for layer skipping. Finally, we show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains by constructing an oracle controller. Remarkably, we find that there exists an allocation which achieves equal performance to the full model using only 23.3% of its layers on average.
摘要：大型语言模型 (LLM) 的庞大规模促使人们寻求优化推理。一种有效的方法是动态推理，它使架构适应手头的样本，以降低总体计算成本。我们实证研究了两种常见的自然语言生成 (NLG) 动态推理方法：层跳过和提前退出。我们发现，与提前退出相比，预训练的仅解码器模型对通过层跳过移除层的鲁棒性明显更高。我们展示了使用隐藏状态信息来调整每个标记基础上的计算以实现层跳过的难度。最后，我们表明，通过构建 oracle 控制器，按序列进行动态计算分配有望显著提高效率。值得注意的是，我们发现存在一种分配，它平均仅使用 23.3% 的层就能实现与完整模型相同的性能。

Title: Beyond Fine-Tuning: Effective Strategies for Mitigating Hallucinations in Large Language Models for Data Analytics

Authors: Mikhail Rumiantsau, Aliaksei Vertsel, Ilya Hrytsuk, Isaiah Ballah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.20024
Pdf URL: https://arxiv.org/pdf/2410.20024
Copy Paste: [[2410.20024]] Beyond Fine-Tuning: Effective Strategies for Mitigating Hallucinations in Large Language Models for Data Analytics(https://arxiv.org/abs/2410.20024)
Keywords: language model, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) have become increasingly important in natural language processing, enabling advanced data analytics through natural language queries. However, these models often generate "hallucinations"-inaccurate or fabricated information-that can undermine their reliability in critical data-driven decision-making. Addressing the challenge of hallucinations is essential to improve the accuracy and trustworthiness of LLMs in processing natural language queries. This research focuses on mitigating hallucinations in LLMs, specifically within the context of data analytics. We introduce and evaluate four targeted strategies: Structured Output Generation, Strict Rules Enforcement, System Prompt Enhancements, and Semantic Layer Integration. Our findings show that these methods are more effective than traditional fine-tuning approaches in reducing hallucinations, offering a more reliable framework for deploying LLMs in natural language queries for data analytics. This research demonstrates the potential of these strategies to enhance the accuracy of LLM-driven data queries, ensuring more dependable results in data-driven environments.
摘要：大型语言模型 (LLM) 在自然语言处理中变得越来越重要，它通过自然语言查询实现了高级数据分析。然而，这些模型通常会产生“幻觉”——不准确或虚假的信息——这可能会削弱它们在关键数据驱动决策中的可靠性。解决幻觉问题对于提高 LLM 在处理自然语言查询时的准确性和可信度至关重要。这项研究的重点是减轻 LLM 中的幻觉，特别是在数据分析的背景下。我们介绍并评估了四种有针对性的策略：结构化输出生成、严格规则执行、系统提示增强和语义层集成。我们的研究结果表明，这些方法在减少幻觉方面比传统的微调方法更有效，为在数据分析的自然语言查询中部署 LLM 提供了更可靠的框架。这项研究展示了这些策略提高 LLM 驱动的数据查询准确性的潜力，确保在数据驱动的环境中获得更可靠的结果。

Title: Architectural Flaw Detection in Civil Engineering Using GPT-4

Authors: Saket Kumar, Abul Ehtesham, Aditi Singh, Tala Talaei Khoei
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20036
Pdf URL: https://arxiv.org/pdf/2410.20036
Copy Paste: [[2410.20036]] Architectural Flaw Detection in Civil Engineering Using GPT-4(https://arxiv.org/abs/2410.20036)
Keywords: gpt, llm
Abstract: The application of artificial intelligence (AI) in civil engineering presents a transformative approach to enhancing design quality and safety. This paper investigates the potential of the advanced LLM GPT4 Turbo vision model in detecting architectural flaws during the design phase, with a specific focus on identifying missing doors and windows. The study evaluates the model's performance through metrics such as precision, recall, and F1 score, demonstrating AI's effectiveness in accurately detecting flaws compared to human-verified data. Additionally, the research explores AI's broader capabilities, including identifying load-bearing issues, material weaknesses, and ensuring compliance with building codes. The findings highlight how AI can significantly improve design accuracy, reduce costly revisions, and support sustainable practices, ultimately revolutionizing the civil engineering field by ensuring safer, more efficient, and aesthetically optimized structures.
摘要：人工智能 (AI) 在土木工程中的应用为提高设计质量和安全性提供了一种变革性方法。本文探讨了先进的 LLM GPT4 Turbo 视觉模型在设计阶段检测建筑缺陷的潜力，特别侧重于识别缺失的门窗。该研究通过精度、召回率和 F1 分数等指标评估模型的性能，证明了与人工验证的数据相比，AI 在准确检测缺陷方面的有效性。此外，该研究还探索了 AI 更广泛的功能，包括识别承重问题、材料弱点以及确保符合建筑规范。研究结果强调了 AI 如何显著提高设计准确性、减少昂贵的修订并支持可持续实践，最终通过确保更安全、更高效和更美观的结构彻底改变土木工程领域。

Title: RARe: Retrieval Augmented Retrieval with In-Context Examples

Authors: Atula Tejaswi, Yoonsang Lee, Sujay Sanghavi, Eunsol Choi
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2410.20088
Pdf URL: https://arxiv.org/pdf/2410.20088
Copy Paste: [[2410.20088]] RARe: Retrieval Augmented Retrieval with In-Context Examples(https://arxiv.org/abs/2410.20088)
Keywords: language model, llm
Abstract: We investigate whether in-context examples, widely used in decoder-only language models (LLMs), can improve embedding model performance in retrieval tasks. Unlike in LLMs, naively prepending in-context examples (query-document pairs) to the target query at inference time does not work out of the box. We introduce a simple approach to enable retrievers to use in-context examples. Our approach, RARe, finetunes a pre-trained model with in-context examples whose query is semantically similar to the target query. This can be applied to adapt various base architectures (i.e., decoder-only language models, retriever models) and consistently achieves performance gains of up to +2.72% nDCG across various open-domain retrieval datasets (BeIR, RAR-b). In particular, we find RARe exhibits stronger out-of-domain generalization compared to models using queries without in-context examples, similar to what is seen for in-context learning in LLMs. We further provide analysis on the design choices of in-context example augmentation and lay the foundation for future work in this space.
摘要：我们研究了在仅解码器语言模型 (LLM) 中广泛使用的上下文示例是否可以提高嵌入模型在检索任务中的性能。与 LLM 不同，在推理时将上下文示例（查询-文档对）简单地添加到目标查询之前并不是开箱即用的。我们引入了一种简单的方法，使检索器能够使用上下文示例。我们的方法 RARe 使用上下文示例对预训练模型进行微调，其查询在语义上与目标查询相似。这可以应用于适应各种基础架构（即仅解码器语言模型、检索器模型），并在各种开放域检索数据集（BeIR、RAR-b）中持续实现高达 +2.72% nDCG 的性能提升。特别是，我们发现与使用没有上下文示例的查询的模型相比，RARe 表现出更强的域外泛化能力，类似于 LLM 中的上下文学习。我们进一步对上下文示例增强的设计选择进行了分析，并为该领域的未来工作奠定了基础。

Title: Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs

Authors: Houman Mehrafarin, Arash Eshghi, Ioannis Konstas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20200
Pdf URL: https://arxiv.org/pdf/2410.20200
Copy Paste: [[2410.20200]] Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs(https://arxiv.org/abs/2410.20200)
Keywords: language model, llm
Abstract: Evaluating Large Language Models (LLMs) on reasoning benchmarks demonstrates their ability to solve compositional questions. However, little is known of whether these models engage in genuine logical reasoning or simply rely on implicit cues to generate answers. In this paper, we investigate the transitive reasoning capabilities of two distinct LLM architectures, LLaMA 2 and Flan-T5, by manipulating facts within two compositional datasets: QASC and Bamboogle. We controlled for potential cues that might influence the models' performance, including (a) word/phrase overlaps across sections of test input; (b) models' inherent knowledge during pre-training or fine-tuning; and (c) Named Entities. Our findings reveal that while both models leverage (a), Flan-T5 shows more resilience to experiments (b and c), having less variance than LLaMA 2. This suggests that models may develop an understanding of transitivity through fine-tuning on knowingly relevant datasets, a hypothesis we leave to future work.
摘要：在推理基准上评估大型语言模型 (LLM) 表明了它们解决组合问题的能力。然而，人们对于这些模型是否进行真正的逻辑推理或仅仅依靠隐含线索来生成答案知之甚少。在本文中，我们通过操纵两个组合数据集（QASC 和 Bamboogle）中的事实，研究了两种不同的 LLM 架构 LLaMA 2 和 Flan-T5 的传递性推理能力。我们控制了可能影响模型性能的潜在线索，包括 (a) 测试输入部分之间的单词/短语重叠；(b) 模型在预训练或微调期间的固有知识；以及 (c) 命名实体。我们的研究结果表明，虽然两种模型都利用了 (a)，但 Flan-T5 对实验 (b 和 c) 表现出更强的适应性，方差小于 LLaMA 2。这表明模型可以通过对已知相关数据集进行微调来发展对传递性的理解，我们将这一假设留待将来研究。

Title: DAWN-ICL: Strategic Planning of Problem-solving Trajectories for Zero-Shot In-Context Learning

Authors: Xinyu Tang, Xiaolei Wang, Wayne Xin Zhao, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20215
Pdf URL: https://arxiv.org/pdf/2410.20215
Copy Paste: [[2410.20215]] DAWN-ICL: Strategic Planning of Problem-solving Trajectories for Zero-Shot In-Context Learning(https://arxiv.org/abs/2410.20215)
Keywords: language model, llm
Abstract: Zero-shot in-context learning (ZS-ICL) aims to conduct in-context learning (ICL) without using human-annotated demonstrations. Most ZS-ICL methods use large language models (LLMs) to generate (input, label) pairs as pseudo-demonstrations and leverage historical pseudo-demonstrations to help solve the current problem. They assume that problems are from the same task and traverse them in a random order. However, in real-world scenarios, problems usually come from diverse tasks, and only a few belong to the same task. The random traversing order may generate unreliable pseudo-demonstrations and lead to error accumulation. To address this problem, we reformulate ZS-ICL as a planning problem and propose a Demonstration-aware Monte Carlo Tree Search (MCTS) approach (DAWN-ICL), which leverages MCTS to strategically plan the problem-solving trajectories for ZS-ICL. In addition, to achieve effective and efficient Q value estimation, we propose a novel demonstration-aware Q-value function and use it to enhance the selection phase and accelerate the expansion and simulation phases in MCTS. Extensive experiments demonstrate the effectiveness and efficiency of DAWN-ICL on in-domain and cross-domain scenarios, and it even outperforms ICL using human-annotated labels. The code is available at this https URL.
摘要：零样本上下文学习 (ZS-ICL) 旨在在不使用人工注释演示的情况下进行上下文学习 (ICL)。大多数 ZS-ICL 方法使用大型语言模型 (LLM) 生成 (输入、标签) 对作为伪演示，并利用历史伪演示来帮助解决当前问题。他们假设问题来自同一任务并以随机顺序遍历它们。然而，在现实世界中，问题通常来自不同的任务，只有少数属于同一任务。随机遍历顺序可能会生成不可靠的伪演示并导致错误积累。为了解决这个问题，我们将 ZS-ICL 重新表述为规划问题，并提出了一种演示感知蒙特卡洛树搜索 (MCTS) 方法 (DAWN-ICL)，该方法利用 MCTS 为 ZS-ICL 战略性地规划问题解决轨迹。此外，为了实现有效且高效的 Q 值估计，我们提出了一种新颖的演示感知 Q 值函数，并使用它来增强选择阶段并加速 MCTS 中的扩展和模拟阶段。大量实验证明了 DAWN-ICL 在域内和跨域场景中的有效性和效率，它甚至在使用人工注释标签的情况下优于 ICL。代码可在此 https URL 上找到。

Title: Generative linguistics contribution to artificial intelligence: Where this contribution lies?

Authors: Mohammed Q. Shormani (Ibb University, University of Cyprus)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20221
Pdf URL: https://arxiv.org/pdf/2410.20221
Copy Paste: [[2410.20221]] Generative linguistics contribution to artificial intelligence: Where this contribution lies?(https://arxiv.org/abs/2410.20221)
Keywords: language model
Abstract: This article aims to characterize Generative linguistics (GL) contribution to artificial intelligence (AI), alluding to the debate among linguists and AI scientists on whether linguistics belongs to humanities or science. In this article, I will try not to be biased as a linguist, studying the phenomenon from an independent scientific perspective. The article walks the researcher/reader through the scientific theorems and rationales involved in AI which belong from GL, specifically the Chomsky School. It, thus, provides good evidence from syntax, semantics, language faculty, Universal Grammar, computational system of human language, language acquisition, human brain, programming languages (e.g. Python), Large Language Models, and unbiased AI scientists that this contribution is huge, and that this contribution cannot be denied. It concludes that however the huge GL contribution to AI, there are still points of divergence including the nature and type of language input."
摘要：本文旨在描述生成语言学 (GL) 对人工智能 (AI) 的贡献，并暗指语言学家和人工智能科学家之间关于语言学是属于人文学科还是科学的争论。在本文中，我将尽量不带语言学家的偏见，从独立的科学角度研究这一现象。本文将向研究人员/读者介绍人工智能中涉及的科学定理和原理，这些定理和原理属于 GL，特别是乔姆斯基学派。因此，它从句法、语义、语言能力、通用语法、人类语言的计算系统、语言习得、人脑、编程语言（例如 Python）、大型语言模型和公正的人工智能科学家等方面提供了充分的证据，表明这一贡献是巨大的，而且这一贡献是不可否认的。结论是，尽管 GL 对人工智能的贡献巨大，但仍存在分歧点，包括语言输入的性质和类型。”

Title: A Survey of Large Language Models for Arabic Language and its Dialects

Authors: Malak Mashaabi, Shahad Al-Khalifa, Hend Al-Khalifa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.20238
Pdf URL: https://arxiv.org/pdf/2410.20238
Copy Paste: [[2410.20238]] A Survey of Large Language Models for Arabic Language and its Dialects(https://arxiv.org/abs/2410.20238)
Keywords: language model, llm
Abstract: This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.
摘要：本调查全面概述了为阿拉伯语及其方言设计的大型语言模型 (LLM)。它涵盖了关键架构，包括仅编码器、仅解码器和编码器-解码器模型，以及用于预训练的数据集，涵盖古典阿拉伯语、现代标准阿拉伯语和方言阿拉伯语。该研究还探讨了单语、双语和多语 LLM，分析了它们的架构和在下游任务（如情绪分析、命名实体识别和问答）中的表现。此外，它还根据源代码可用性、训练数据、模型权重和文档等因素评估了阿拉伯语 LLM 的开放性。调查强调了对更多样化方言数据集的需求，并认为开放性对于研究的可重复性和透明度非常重要。最后，它确定了未来研究的主要挑战和机遇，并强调需要更具包容性和代表性的模型。

Title: Improving Model Evaluation using SMART Filtering of Benchmark Datasets

Authors: Vipul Gupta, Candace Ross, David Pantoja, Rebecca J. Passonneau, Megan Ung, Adina Williams
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20245
Pdf URL: https://arxiv.org/pdf/2410.20245
Copy Paste: [[2410.20245]] Improving Model Evaluation using SMART Filtering of Benchmark Datasets(https://arxiv.org/abs/2410.20245)
Keywords: chat
Abstract: One of the most challenging problems facing NLP today is evaluation. Some of the most pressing issues pertain to benchmark saturation, data contamination, and diversity in the quality of test examples. To address these concerns, we propose Selection Methodology for Accurate, Reduced, and Targeted (SMART) filtering, a novel approach to select a high-quality subset of examples from existing benchmark datasets by systematically removing less informative and less challenging examples. Our approach applies three filtering criteria, removing (i) easy examples, (ii) data-contaminated examples, and (iii) examples that are similar to each other based on distance in an embedding space. We demonstrate the effectiveness of SMART on three multiple choice QA datasets, where our methodology increases efficiency by reducing dataset size by 48\% on average, while increasing Pearson correlation with rankings from ChatBot Arena, a more open-ended human evaluation setting. Our method enables us to be more efficient, whether using SMART to make new benchmarks more challenging or to revitalize older datasets, while still preserving the relative model rankings.
摘要：当今 NLP 面临的最具挑战性的问题之一是评估。一些最紧迫的问题涉及基准饱和、数据污染和测试示例质量的多样性。为了解决这些问题，我们提出了准确、精简和有针对性的选择方法 (SMART) 过滤，这是一种新颖的方法，通过系统地删除信息量较少和挑战性较低的示例，从现有基准数据集中选择高质量的示例子集。我们的方法应用了三个过滤标准，删除 (i) 简单示例、(ii) 数据污染示例和 (iii) 基于嵌入空间中的距离彼此相似的示例。我们在三个多项选择 QA 数据集上展示了 SMART 的有效性，其中我们的方法通过将数据集大小平均减少 48% 来提高效率，同时增加了与 ChatBot Arena（一个更开放的人工评估设置）排名的 Pearson 相关性。我们的方法使我们能够提高效率，无论是使用 SMART 使新基准更具挑战性还是振兴旧数据集，同时仍保留相对模型排名。

Title: Fast Best-of-N Decoding via Speculative Rejection

Authors: Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, Andrea Zanette
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20290
Pdf URL: https://arxiv.org/pdf/2410.20290
Copy Paste: [[2410.20290]] Fast Best-of-N Decoding via Speculative Rejection(https://arxiv.org/abs/2410.20290)
Keywords: language model, llm
Abstract: The safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model's responses are in accordance with human preferences. Prevalent alignment techniques, such as DPO, PPO and their variants, align LLMs by changing the pre-trained model weights during a phase called post-training. While predominant, these post-training methods add substantial complexity before LLMs can be deployed. Inference-time alignment methods avoid the complex post-training step and instead bias the generation towards responses that are aligned with human preferences. The best-known inference-time alignment method, called Best-of-N, is as effective as the state-of-the-art post-training procedures. Unfortunately, Best-of-N requires vastly more resources at inference time than standard decoding strategies, which makes it computationally not viable. In this work, we introduce Speculative Rejection, a computationally-viable inference-time alignment algorithm. It generates high-scoring responses according to a given reward model, like Best-of-N does, while being between 16 to 32 times more computationally efficient.
摘要：大型语言模型 (LLM) 的安全有效部署涉及一个关键步骤，称为对齐，这可确保模型的响应符合人类偏好。流行的对齐技术（例如 DPO、PPO 及其变体）通过在称为后训练的阶段更改预训练模型权重来对齐 LLM。虽然这些后训练方法占主导地位，但它们在部署 LLM 之前增加了相当大的复杂性。推理时间对齐方法避免了复杂的后训练步骤，而是使生成偏向于与人类偏好一致的响应。最著名的推理时间对齐方法称为 Best-of-N，与最先进的后训练程序一样有效。不幸的是，Best-of-N 在推理时需要的资源比标准解码策略多得多，这使得它在计算上不可行。在这项工作中，我们引入了 Speculative Rejection，这是一种计算上可行的推理时间对齐算法。它根据给定的奖励模型生成高分响应，就像 Best-of-N 一样，同时计算效率提高了 16 到 32 倍。

Title: Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain

Authors: Daniel C. Ruiz, John Sell
Subjects: cs.CL, cs.AI, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2410.20297
Pdf URL: https://arxiv.org/pdf/2410.20297
Copy Paste: [[2410.20297]] Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain(https://arxiv.org/abs/2410.20297)
Keywords: language model, llm
Abstract: In recent years, the widespread adoption of Large Language Models (LLMs) has sparked interest in their potential for application within the military domain. However, the current generation of LLMs demonstrate sub-optimal performance on Army use cases, due to the prevalence of domain-specific vocabulary and jargon. In order to fully leverage LLMs in-domain, many organizations have turned to fine-tuning to circumvent the prohibitive costs involved in training new LLMs from scratch. In light of this trend, we explore the viability of adapting open-source LLMs for usage in the Army domain in order to address their existing lack of domain-specificity. Our investigations have resulted in the creation of three distinct generations of TRACLM, a family of LLMs fine-tuned by The Research and Analysis Center (TRAC), Army Futures Command (AFC). Through continuous refinement of our training pipeline, each successive iteration of TRACLM displayed improved capabilities when applied to Army tasks and use cases. Furthermore, throughout our fine-tuning experiments, we recognized the need for an evaluation framework that objectively quantifies the Army domain-specific knowledge of LLMs. To address this, we developed MilBench, an extensible software framework that efficiently evaluates the Army knowledge of a given LLM using tasks derived from doctrine and assessments. We share preliminary results, models, methods, and recommendations on the creation of TRACLM and MilBench. Our work significantly informs the development of LLM technology across the DoD and augments senior leader decisions with respect to artificial intelligence integration.
摘要：近年来，大型语言模型 (LLM) 的广泛采用引发了人们对其在军事领域应用潜力的兴趣。然而，由于领域特定词汇和术语的盛行，当前一代 LLM 在陆军用例中表现出次优性能。为了充分利用领域内的 LLM，许多组织已转向微调，以规避从头开始训练新 LLM 所涉及的高昂成本。鉴于这一趋势，我们探索了将开源 LLM 改编为陆军领域使用的可行性，以解决其现有的领域特异性不足问题。我们的调查结果创建了三代不同的 TRACLM，这是由陆军未来司令部 (AFC) 研究与分析中心 (TRAC) 微调的 LLM 系列。通过不断改进我们的训练流程，TRACLM 的每次迭代在应用于陆军任务和用例时都表现出改进的功能。此外，在微调实验过程中，我们认识到需要一个评估框架来客观量化陆军 LLM 领域特定知识。为了解决这个问题，我们开发了 MilBench，这是一个可扩展的软件框架，它使用来自条令和评估的任务有效地评估陆军对给定 LLM 的知识。我们分享了关于创建 TRACLM 和 MilBench 的初步结果、模型、方法和建议。我们的工作为整个国防部的 LLM 技术发展提供了重要信息，并增强了高级领导人关于人工智能集成的决策。

Title: Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

Authors: Xinhong Xie, Tao Li, Quanyan Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.20298
Pdf URL: https://arxiv.org/pdf/2410.20298
Copy Paste: [[2410.20298]] Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data(https://arxiv.org/abs/2410.20298)
Keywords: language model, llm
Abstract: Text detoxification, a variant of style transfer tasks, finds useful applications in online social media. This work presents a fine-tuning method that only uses non-parallel data to turn large language models (LLM) into a detoxification rewritter. We model the fine-tuning process as a Stackelberg game between an LLM (leader) and a toxicity screener (follower), which is a binary style classifier (toxic or non-toxic). The LLM aims to align its preference according to the screener and generate paraphases passing the screening. The primary challenge of non-parallel data fine-tuning is incomplete preference. In the case of unsuccessful paraphrases, the classifier cannot establish a preference between the input and paraphrase, as they belong to the same toxic style. Hence, preference-alignment fine-tuning methods, such as direct preference optimization (DPO), no longer apply. To address the challenge of incomplete preference, we propose Stackelberg response optimization (SRO), adapted from DPO, to enable the LLM to learn from the follower's response. The gist is that SRO decreases the likelihood of generating the paraphrase if it fails the follower's screening while performing DPO on the pair of the toxic input and its paraphrase when the latter passes the screening. Experiments indicate that the SRO-fine-tunned LLM achieves satisfying performance comparable to state-of-the-art models regarding style accuracy, content similarity, and fluency. The overall detoxification performance surpasses other computing methods and matches the human reference. Additional empirical evidence suggests that SRO is sensitive to the screener's feedback, and a slight perturbation leads to a significant performance drop. We release the code and LLM models at \url{this https URL}.
摘要：文本去毒是风格迁移任务的一种变体，在在线社交媒体中得到了广泛的应用。这项工作提出了一种微调方法，该方法仅使用非并行数据将大型语言模型 (LLM) 转变为去毒重写器。我们将微调过程建模为 LLM（领导者）和毒性筛选器（追随者）之间的 Stackelberg 博弈，后者是二元风格分类器（有毒或无毒）。LLM 旨在根据筛选器调整其偏好并生成通过筛选的释义。非并行数据微调的主要挑战是不完整的偏好。在释义不成功的情况下，分类器无法在输入和释义之间建立偏好，因为它们属于相同的有毒风格。因此，偏好对齐微调方法（例如直接偏好优化 (DPO)）不再适用。为了解决不完全偏好的挑战，我们提出了 Stackelberg 响应优化 (SRO)，它改编自 DPO，使 LLM 能够从跟随者的响应中学习。要点是，如果 LLM 未通过跟随者的筛选，则 SRO 会降低生成释义的可能性，而当后者通过筛选时，则对有毒输入及其释义对执行 DPO。实验表明，在风格准确性、内容相似性和流畅性方面，SRO 微调的 LLM 实现了与最先进模型相当的令人满意的性能。整体解毒性能超越其他计算方法并与人类参考相匹配。额外的经验证据表明，SRO 对筛选者的反馈很敏感，轻微的扰动会导致性能显着下降。我们在 \url{this https URL} 上发布了代码和 LLM 模型。

Title: Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs

Authors: Enshi Zhang, Christian Poellabauer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20334
Pdf URL: https://arxiv.org/pdf/2410.20334
Copy Paste: [[2410.20334]] Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs(https://arxiv.org/abs/2410.20334)
Keywords: language model, llm, prompt
Abstract: Speech Emotion Recognition (SER) focuses on identifying emotional states from spoken language. The 2024 IEEE SLT-GenSEC Challenge on Post Automatic Speech Recognition (ASR) Emotion Recognition tasks participants to explore the capabilities of large language models (LLMs) for emotion recognition using only text data. We propose a novel approach that first refines all available transcriptions to ensure data reliability. We then segment each complete conversation into smaller dialogues and use these dialogues as context to predict the emotion of the target utterance within the dialogue. Finally, we investigated different context lengths and prompting techniques to improve prediction accuracy. Our best submission exceeded the baseline by 20% in unweighted accuracy, achieving the best performance in the challenge. All our experiments' codes, prediction results, and log files are publicly available.
摘要：语音情感识别 (SER) 侧重于从口语中识别情感状态。2024 年 IEEE SLT-GenSEC 自动语音识别 (ASR) 情感识别挑战赛要求参赛者仅使用文本数据探索大型语言模型 (LLM) 进行情感识别的能力。我们提出了一种新颖的方法，首先细化所有可用的转录以确保数据可靠性。然后，我们将每个完整的对话细分为较小的对话，并使用这些对话作为上下文来预测对话中目标话语的情感。最后，我们研究了不同的上下文长度和提示技术来提高预测准确性。我们的最佳提交在未加权准确度方面超出了基线 20%，在挑战赛中取得了最佳表现。我们所有实验的代码、预测结果和日志文件都是公开的。

Title: Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

Authors: Maohao Shen, Shun Zhang, Jilong Wu, Zhiping Xiu, Ehab AlBadawy, Yiting Lu, Mike Seltzer, Qing He
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2410.20336
Pdf URL: https://arxiv.org/pdf/2410.20336
Copy Paste: [[2410.20336]] Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation(https://arxiv.org/abs/2410.20336)
Keywords: language model, llm
Abstract: Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.
摘要：大型语言模型 (LLM) 彻底改变了自然语言处理 (NLP)，在各种基于文本的任务中表现出色。然而，将以文本为主的 LLM 扩展到语音生成任务仍未得到充分探索。在这项工作中，我们引入了一个由经过微调的 Llama 模型驱动的文本转语音 (TTS) 系统，名为 TTS-Llama，该系统实现了最先进的语音合成性能。在 TTS-Llama 的基础上，我们进一步提出了 MoLE-Llama，这是一种通过纯后期融合参数高效微调 (PEFT) 和混合专家架构开发的文本和语音多模态 LLM。大量的实证结果表明，MoLE-Llama 在纯文本问答 (QA) 和 TTS 任务上都具有竞争力，可以缓解任一模态中的灾难性遗忘问题。最后，我们进一步探索 MoLE-Llama 在文本输入语音输出 QA 任务中的作用，展示其作为能够生成语音的多模式对话系统的巨大潜力。

Title: Maintaining Informative Coherence: Migrating Hallucinations in Large Language Models via Absorbing Markov Chains

Authors: Jiemin Wu, Songning Lai, Ruiqiang Xiao, Tianlang Xue, Jiayu Yang, Yutao Yue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.20340
Pdf URL: https://arxiv.org/pdf/2410.20340
Copy Paste: [[2410.20340]] Maintaining Informative Coherence: Migrating Hallucinations in Large Language Models via Absorbing Markov Chains(https://arxiv.org/abs/2410.20340)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) are powerful tools for text generation, translation, and summarization, but they often suffer from hallucinations-instances where they fail to maintain the fidelity and coherence of contextual information during decoding, sometimes overlooking critical details due to their sampling strategies and inherent biases from training data and fine-tuning discrepancies. These hallucinations can propagate through the web, affecting the trustworthiness of information disseminated online. To address this issue, we propose a novel decoding strategy that leverages absorbing Markov chains to quantify the significance of contextual information and measure the extent of information loss during generation. By considering all possible paths from the first to the last token, our approach enhances the reliability of model outputs without requiring additional training or external data. Evaluations on datasets including TruthfulQA, FACTOR, and HaluEval highlight the superior performance of our method in mitigating hallucinations, underscoring the necessity of ensuring accurate information flow in web-based applications.
摘要：大型语言模型 (LLM) 是用于文本生成、翻译和摘要的强大工具，但它们经常受到幻觉的影响 - 例如，它们在解码过程中无法保持上下文信息的保真度和连贯性，有时会由于其采样策略以及训练数据和微调差异的固有偏差而忽略关键细节。这些幻觉可以通过网络传播，影响在线传播信息的可信度。为了解决这个问题，我们提出了一种新颖的解码策略，利用吸收马尔可夫链来量化上下文信息的重要性并衡量生成过程中信息丢失的程度。通过考虑从第一个到最后一个标记的所有可能路径，我们的方法提高了模型输出的可靠性，而无需额外的训练或外部数据。对 TruthfulQA、FACTOR 和 HaluEval 等数据集的评估突出了我们的方法在减轻幻觉方面的卓越性能，强调了确保基于 Web 的应用程序中准确信息流的必要性。

Title: Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

Authors: Yifang Chen, David Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.20362
Pdf URL: https://arxiv.org/pdf/2410.20362
Copy Paste: [[2410.20362]] Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation(https://arxiv.org/abs/2410.20362)
Keywords: language model, llm, prompt
Abstract: Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data. Recently, many works are exploring synthetic data generation using LLMs. However, they primarily focus on prompt engineering with standard supervised instruction-finetuned models, which contains a fundamental limitation: these models are optimized for general question-answering/problem-solving rather than data generation. We propose a paradigm shift named \textbf{NOMAD} by investigating how to specifically train models for data generation, demonstrating that this task differs significantly from training a classical LM. We identify two key factors: no-prompt-masked training and proper training set size selection. Our method, NOMAD, shows substantial improvements over baselines, achieving >4\% gains in TriviaQA and >2\% in GSM8K with limited training data. Finally, we offer new insights by interpreting synthetic data through the lenses of "relevance" and "novelty".
摘要：大型语言模型 (LLM) 训练的最新进展凸显了对多样化、高质量指令数据的需求。最近，许多研究都在探索使用 LLM 生成合成数据。然而，它们主要关注使用标准监督指令微调模型的提示工程，这包含一个根本限制：这些模型针对一般问答/问题解决而不是数据生成进行了优化。我们提出了一种名为 \textbf{NOMAD} 的范式转变，通过研究如何专门训练数据生成模型，表明这项任务与训练经典 LM 有很大不同。我们确定了两个关键因素：无提示掩蔽训练和适当的训练集大小选择。我们的方法 NOMAD 显示出比基线有显著的改进，在有限的训练数据下，在 TriviaQA 中实现了 >4\% 的增益，在 GSM8K 中实现了 >2\% 的增益。最后，我们通过“相关性”和“新颖性”的视角来解释合成数据，从而提供了新的见解。

Title: MedGo: A Chinese Medical Large Language Model

Authors: Haitao Zhang, Bo An
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.20428
Pdf URL: https://arxiv.org/pdf/2410.20428
Copy Paste: [[2410.20428]] MedGo: A Chinese Medical Large Language Model(https://arxiv.org/abs/2410.20428)
Keywords: language model
Abstract: Large models are a hot research topic in the field of artificial intelligence. Leveraging their generative capabilities has the potential to enhance the level and quality of medical services. In response to the limitations of current large language models, which often struggle with accuracy and have narrow capabilities in medical applications, this paper presents a Chinese medical large language model, MedGo. MedGo was trained using a combination of high quality unsupervised medical data, supervised data, and preference alignment data, aimed at enhancing both its versatility and precision in medical tasks. The model was evaluated through the public CBLUE benchmark and a manually constructed dataset ClinicalQA. The results demonstrate that MedGo achieved promising performance across various Chinese medical information processing tasks, achieved the first place in the CBLUE evaluation. Additionally, on our constructed dataset ClinicalQA, MedGo outperformed its base model Qwen2, highlighting its potential to improve both automated medical question answering and clinical decision support. These experimental results demonstrate that MedGo possesses strong information processing capabilities in the medical field. At present, we have successfully deployed MedGo at Shanghai East Hospital.
摘要：大型模型是人工智能领域的研究热点，利用大型模型的生成能力可以提升医疗服务水平和质量。针对目前大型语言模型在医疗应用上精度低、能力有限等问题，本文提出了中文医疗大型语言模型MedGo。MedGo使用高质量的无监督医疗数据、监督数据和偏好对齐数据进行训练，以提高其在医疗任务中的通用性和精度。通过公开的CBLUE基准和手工构建的数据集ClinicalQA对模型进行评估。结果表明，MedGo在各种中文医疗信息处理任务中均取得了良好的表现，在CBLUE评估中获得了第一名。此外，在我们构建的数据集ClinicalQA上，MedGo的表现优于其基础模型Qwen2，凸显了其在改进自动医疗问答和临床决策支持方面的潜力。这些实验结果表明，MedGo在医疗领域拥有强大的信息处理能力。目前，我们已在上海东方医院成功部署了MedGo。

Title: TrajAgent: An Agent Framework for Unified Trajectory Modelling

Authors: Yuwei Du, Jie Feng, Jie Zhao, Yong Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20445
Pdf URL: https://arxiv.org/pdf/2410.20445
Copy Paste: [[2410.20445]] TrajAgent: An Agent Framework for Unified Trajectory Modelling(https://arxiv.org/abs/2410.20445)
Keywords: language model, agent
Abstract: Trajectory modeling, which includes research on trajectory data pattern mining and future prediction, has widespread applications in areas such as life services, urban transportation, and public administration. Numerous methods have been proposed to address specific problems within trajectory modelling. However, due to the heterogeneity of data and the diversity of trajectory tasks, achieving unified trajectory modelling remains an important yet challenging task. In this paper, we propose TrajAgent, a large language model-based agentic framework, to unify various trajectory modelling tasks. In TrajAgent, we first develop UniEnv, an execution environment with a unified data and model interface, to support the execution and training of various models. Building on UniEnv, we introduce TAgent, an agentic workflow designed for automatic trajectory modelling across various trajectory tasks. Specifically, we design AutOpt, a systematic optimization module within TAgent, to further improve the performance of the integrated model. With diverse trajectory tasks input in natural language, TrajAgent automatically generates competitive results via training and executing appropriate models. Extensive experiments on four tasks using four real-world datasets demonstrate the effectiveness of TrajAgent in unified trajectory modelling, achieving an average performance improvement of 15.43% over baseline methods.
摘要：轨迹建模包括对轨迹数据模式挖掘和未来预测的研究，在生活服务、城市交通和公共管理等领域有着广泛的应用。已经提出了许多方法来解决轨迹建模中的特定问题。然而，由于数据的异构性和轨迹任务的多样性，实现统一的轨迹建模仍然是一项重要而具有挑战性的任务。在本文中，我们提出了一个基于大型语言模型的代理框架 TrajAgent，以统一各种轨迹建模任务。在 TrajAgent 中，我们首先开发了具有统一数据和模型接口的执行环境 UniEnv，以支持各种模型的执行和训练。在 UniEnv 的基础上，我们引入了 TAgent，这是一个代理工作流，旨在跨各种轨迹任务进行自动轨迹建模。具体来说，我们在 TAgent 中设计了一个系统优化模块 AutOpt，以进一步提高集成模型的性能。通过以自然语言输入的各种轨迹任务，TrajAgent 通过训练和执行适当的模型自动生成有竞争力的结果。使用四个真实数据集对四个任务进行的大量实验证明了 TrajAgent 在统一轨迹建模中的有效性，与基线方法相比，平均性能提高了 15.43%。

Title: What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration

Authors: Libo Qin, Qiguang Chen, Hao Fei, Zhi Chen, Min Li, Wanxiang Che
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2410.20482
Pdf URL: https://arxiv.org/pdf/2410.20482
Copy Paste: [[2410.20482]] What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration(https://arxiv.org/abs/2410.20482)
Keywords: language model, prompt
Abstract: Recently, rapid advancements in Multi-Modal In-Context Learning (MM-ICL) have achieved notable success, which is capable of achieving superior performance across various tasks without requiring additional parameter tuning. However, the underlying rules for the effectiveness of MM-ICL remain under-explored. To fill this gap, this work aims to investigate the research question: "What factors affect the performance of MM-ICL?'' To this end, we investigate extensive experiments on the three core steps of MM-ICL including demonstration retrieval, demonstration ordering, and prompt construction using 6 vision large language models and 20 strategies. Our findings highlight (1) the necessity of a multi-modal retriever for demonstration retrieval, (2) the importance of intra-demonstration ordering over inter-demonstration ordering, and (3) the enhancement of task comprehension through introductory instructions in prompts. We hope this study can serve as a foundational guide for optimizing MM-ICL strategies in future research.
摘要：最近，多模态语境学习 (MM-ICL) 的快速发展取得了显著的成功，它能够在各种任务中实现卓越的性能，而无需额外的参数调整。然而，MM-ICL 有效性的根本规则仍未得到充分探索。为了填补这一空白，这项工作旨在研究以下问题：“哪些因素影响 MM-ICL 的性能？”为此，我们使用 6 个视觉大型语言模型和 20 种策略对 MM-ICL 的三个核心步骤（包括演示检索、演示排序和提示构建）进行了广泛的实验。我们的研究结果强调了 (1) 多模态检索器对于演示检索的必要性，(2) 演示内排序相对于演示间排序的重要性，以及 (3) 通过提示中的介绍性说明增强任务理解。我们希望这项研究可以作为未来研究中优化 MM-ICL 策略的基础指南。

Title: FIRP: Faster LLM inference via future intermediate representation prediction

Authors: Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li, Jingang Wang, Xunliang Cai, Dongyan Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20488
Pdf URL: https://arxiv.org/pdf/2410.20488
Copy Paste: [[2410.20488]] FIRP: Faster LLM inference via future intermediate representation prediction(https://arxiv.org/abs/2410.20488)
Keywords: language model, llm
Abstract: Recent advancements in Large Language Models (LLMs) have shown remarkable performance across a wide range of tasks. Despite this, the auto-regressive nature of LLM decoding, which generates only a single token per forward propagation, fails to fully exploit the parallel computational power of GPUs, leading to considerable latency. To address this, we introduce a novel speculative decoding method named FIRP which generates multiple tokens instead of one at each decoding step. We achieve this by predicting the intermediate hidden states of future tokens (tokens have not been decoded yet) and then using these pseudo hidden states to decode future tokens, specifically, these pseudo hidden states are predicted with simple linear transformation in intermediate layers of LLMs. Once predicted, they participate in the computation of all the following layers, thereby assimilating richer semantic information. As the layers go deeper, the semantic gap between pseudo and real hidden states is narrowed and it becomes feasible to decode future tokens with high accuracy. To validate the effectiveness of FIRP, we conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets, analytical experiments also prove our motivations.
摘要：大型语言模型 (LLM) 的最新进展已在各种任务中表现出色。尽管如此，LLM 解码的自回归特性（每次前向传播仅生成一个 token）无法充分利用 GPU 的并行计算能力，从而导致相当大的延迟。为了解决这个问题，我们引入了一种名为 FIRP 的新型推测解码方法，该方法在每个解码步骤中生成多个 token，而不是一个。我们通过预测未来 token（尚未解码的 token）的中间隐藏状态，然后使用这些伪隐藏状态解码未来 token 来实现这一点，具体来说，这些伪隐藏状态是在 LLM 的中间层中使用简单的线性变换来预测的。一旦预测完成，它们就会参与所有后续层的计算，从而吸收更丰富的语义信息。随着层的加深，伪隐藏状态和真实隐藏状态之间的语义差距缩小，并且可以高精度地解码未来 token。为了验证 FIRP 的有效性，我们进行了大量实验，结果表明在多个模型和数据集中加速比达到了 1.9 倍至 3 倍，分析实验也证明了我们的动机。

Title: $\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker's Ethnicity on Hate Classification

Authors: Ananya Malik, Kartik Sharma, Lynnette Hui Xian Ng, Shaily Bhatt
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.20490
Pdf URL: https://arxiv.org/pdf/2410.20490
Copy Paste: [[2410.20490]] $\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker's Ethnicity on Hate Classification(https://arxiv.org/abs/2410.20490)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high-stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of hate speech classification using LLMs, particularly when explicit and implicit markers of the speaker's ethnicity are injected into the input. For the explicit markers, we inject a phrase that mentions the speaker's identity. For the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 4 popular LLMs and 5 ethnicities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.
摘要：大型语言模型 (LLM) 为可扩展的内容审核（包括仇恨言论检测）提供了丰厚的前景。然而，它们也以脆弱和对边缘化社区和方言有偏见而闻名。这要求对它们在仇恨言论检测等高风险任务中的应用进行严格审查。在这项工作中，我们研究了使用 LLM 进行仇恨言论分类的稳健性，特别是当说话者种族的显性和隐性标记被注入到输入中时。对于显性标记，我们注入了一个提到说话者身份的短语。对于隐性标记，我们注入了方言特征。通过分析模型输出在这些标记存在的情况下翻转的频率，我们发现 4 个流行的 LLM 和 5 个种族的脆弱性程度各不相同。我们发现，输入中隐性方言标记的存在比显性标记的存在更能导致模型输出翻转。此外，翻转的百分比因种族而异。最后，我们发现更大的模型更稳健。我们的研究结果表明，在仇恨言论检测等高风险任务中部署 LLM 时需要谨慎。

Title: MatViX: Multimodal Information Extraction from Visually Rich Articles

Authors: Ghazal Khalighinejad, Sharon Scott, Ollie Liu, Kelly L. Anderson, Rickard Stureborg, Aman Tyagi, Bhuwan Dhingra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20494
Pdf URL: https://arxiv.org/pdf/2410.20494
Copy Paste: [[2410.20494]] MatViX: Multimodal Information Extraction from Visually Rich Articles(https://arxiv.org/abs/2410.20494)
Keywords: language model, long context
Abstract: Multimodal information extraction (MIE) is crucial for scientific literature, where valuable data is often spread across text, figures, and tables. In materials science, extracting structured information from research articles can accelerate the discovery of new materials. However, the multimodal nature and complex interconnections of scientific content present challenges for traditional text-based methods. We introduce \textsc{MatViX}, a benchmark consisting of $324$ full-length research articles and $1,688$ complex structured JSON files, carefully curated by domain experts. These JSON files are extracted from text, tables, and figures in full-length documents, providing a comprehensive challenge for MIE. We introduce an evaluation method to assess the accuracy of curve similarity and the alignment of hierarchical structures. Additionally, we benchmark vision-language models (VLMs) in a zero-shot manner, capable of processing long contexts and multimodal inputs, and show that using a specialized model (DePlot) can improve performance in extracting curves. Our results demonstrate significant room for improvement in current models. Our dataset and evaluation code are available\footnote{\url{this https URL}}.
摘要：多模态信息提取 (MIE) 对于科学文献至关重要，因为有价值的数据通常分布在文本、图形和表格中。在材料科学中，从研究文章中提取结构化信息可以加速新材料的发现。然而，科学内容的多模态性质和复杂的互连对传统的基于文本的方法提出了挑战。我们引入了 \textsc{MatViX}，这是一个由领域专家精心策划的由 324 篇全长研究文章和 1,688 个复杂结构化 JSON 文件组成的基准。这些 JSON 文件是从全长文档中的文本、表格和图形中提取的，为 MIE 带来了全面的挑战。我们引入了一种评估方法来评估曲线相似性的准确性和层次结构的对齐。此外，我们以零样本方式对视觉语言模型 (VLM) 进行了基准测试，能够处理长上下文和多模态输入，并表明使用专门的模型 (DePlot) 可以提高提取曲线的性能。我们的结果表明，当前模型有很大的改进空间。我们的数据集和评估代码可在\footnote{\url{此 https URL}} 上获得。

Title: Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction

Authors: Zimo Qi, Guangliang Liu, Kristen Marie Johnson, Lu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20513
Pdf URL: https://arxiv.org/pdf/2410.20513
Copy Paste: [[2410.20513]] Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction(https://arxiv.org/abs/2410.20513)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Though intensive attentions to the self-correction capability of Large Language Models (LLMs), the underlying mechanism of this capability is still under-explored. In this paper, we aim to answer two fundamental questions for moral self-correction: (1) how different components in self-correction, such as Chain-of-Thought (CoT) reasoning, external feedback, and instructional prompts, interact to enable moral self-correction; and (2) is the self-correction one of LLMs' innate capabilities? To answer the first question, we examine how different self-correction components interact to intervene the embedded morality within hidden states, therefore contributing to different performance. For the second question, we (i) evaluate the robustness of moral self-correction by introducing natural language interventions of weak evidence into prompts; (ii) propose a validation framework, self-distinguish, that requires effective self-correction to enable LLMs to distinguish between desirable and undesirable outputs. Our experimental results indicate that there is no universally optimal self-correction method for the tasks considered, although external feedback and CoT can contribute to additional performance gains. However, our mechanistic analysis reveals negative interactions among instructional prompts, CoT, and external feedback, suggesting a conflict between internal knowledge and external feedback. The self-distinguish experiments demonstrate that while LLMs can self-correct their responses, they are unable to reliably distinguish between desired and undesired outputs. With our empirical evidence, we can conclude that moral self-correction is not an innate capability of LLMs acquired during pretraining.
摘要：尽管人们密切关注大型语言模型 (LLM) 的自我纠正能力，但这种能力的潜在机制仍未得到充分探索。在本文中，我们旨在回答道德自我纠正的两个基本问题：（1）自我纠正中的不同组成部分，例如思路链 (CoT) 推理、外部反馈和教学提示，如何相互作用以实现道德自我纠正；（2）自我纠正是 LLM 的固有能力之一吗？为了回答第一个问题，我们研究不同的自我纠正组成部分如何相互作用以干预隐藏状态中嵌入的道德，从而导致不同的表现。对于第二个问题，我们 (i) 通过在提示中引入弱证据的自然语言干预来评估道德自我纠正的稳健性；(ii) 提出一个验证框架，自我区分，该框架需要有效的自我纠正，以使 LLM 能够区分理想和不理想的输出。我们的实验结果表明，对于所考虑的任务，没有普遍最佳的自我纠正方法，尽管外部反馈和 CoT 可以进一步提高绩效。然而，我们的机制分析揭示了教学提示、CoT 和外部反馈之间的负面相互作用，表明内部知识和外部反馈之间存在冲突。自我区分实验表明，虽然 LLM 可以自我纠正其反应，但它们无法可靠地区分期望和不期望的输出。根据我们的经验证据，我们可以得出结论，道德自我纠正不是 LLM 在预训练期间获得的天生能力。

Title: SubjECTive-QA: Measuring Subjectivity in Earnings Call Transcripts' QA Through Six-Dimensional Feature Analysis

Authors: Huzaifa Pardawala, Siddhant Sukhani, Agam Shah, Veer Kejriwal, Abhishek Pillai, Rohan Bhasin, Andrew DiBiasio, Tarun Mandapati, Dhruv Adha, Sudheer Chava
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.20651
Pdf URL: https://arxiv.org/pdf/2410.20651
Copy Paste: [[2410.20651]] SubjECTive-QA: Measuring Subjectivity in Earnings Call Transcripts' QA Through Six-Dimensional Feature Analysis(https://arxiv.org/abs/2410.20651)
Keywords: language model, chat
Abstract: Fact-checking is extensively studied in the context of misinformation and disinformation, addressing objective inaccuracies. However, a softer form of misinformation involves responses that are factually correct but lack certain features such as clarity and relevance. This challenge is prevalent in formal Question-Answer (QA) settings such as press conferences in finance, politics, sports, and other domains, where subjective answers can obscure transparency. Despite this, there is a lack of manually annotated datasets for subjective features across multiple dimensions. To address this gap, we introduce SubjECTive-QA, a human annotated dataset on Earnings Call Transcripts' (ECTs) QA sessions as the answers given by company representatives are often open to subjective interpretations and scrutiny. The dataset includes 49,446 annotations for long-form QA pairs across six features: Assertive, Cautious, Optimistic, Specific, Clear, and Relevant. These features are carefully selected to encompass the key attributes that reflect the tone of the answers provided during QA sessions across different domain. Our findings are that the best-performing Pre-trained Language Model (PLM), RoBERTa-base, has similar weighted F1 scores to Llama-3-70b-Chat on features with lower subjectivity, such as Relevant and Clear, with a mean difference of 2.17% in their weighted F1 scores. The models perform significantly better on features with higher subjectivity, such as Specific and Assertive, with a mean difference of 10.01% in their weighted F1 scores. Furthermore, testing SubjECTive-QA's generalizability using QAs from White House Press Briefings and Gaggles yields an average weighted F1 score of 65.97% using our best models for each feature, demonstrating broader applicability beyond the financial domain. SubjECTive-QA is publicly available under the CC BY 4.0 license
摘要：事实核查在错误信息和虚假信息的背景下得到了广泛的研究，旨在解决客观不准确性问题。然而，一种较温和的错误信息涉及的回答在事实上是正确的，但缺乏某些特征，例如清晰度和相关性。这种挑战在正式的问答 (QA) 环境中很常见，例如金融、政治、体育和其他领域的新闻发布会，主观答案可能会掩盖透明度。尽管如此，仍然缺乏跨多个维度的主观特征的手动注释数据集。为了弥补这一差距，我们引入了 SubjECTive-QA，这是一个人工注释的收益电话会议记录 (ECT) QA 会议数据集，因为公司代表给出的答案通常可以被主观解释和审查。该数据集包括 49,446 个长格式 QA 对的注释，涉及六个特征：自信、谨慎、乐观、具体、清晰和相关。这些特征经过精心挑选，涵盖了反映不同领域 QA 会话期间提供的答案基调的关键属性。我们发现，性能最佳的预训练语言模型 (PLM) RoBERTa-base 在主观性较低的特征（例如相关性和清晰性）上具有与 Llama-3-70b-Chat 相似的加权 F1 分数，它们的加权 F1 分数的平均差异为 2.17%。这些模型在主观性较高的特征（例如具体性和自信性）上的表现明显更好，它们的加权 F1 分数的平均差异为 10.01%。此外，使用来自白宫新闻发布会和 Gaggles 的 QA 测试 SubjECTive-QA 的通用性，使用我们针对每个特征的最佳模型得出的平均加权 F1 分数为 65.97%，表明其在金融领域之外具有更广泛的适用性。SubjECTive-QA 在 CC BY 4.0 许可下公开可用

Title: Visualizing attention zones in machine reading comprehension models

Authors: Yiming Cui, Wei-Nan Zhang, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20652
Pdf URL: https://arxiv.org/pdf/2410.20652
Copy Paste: [[2410.20652]] Visualizing attention zones in machine reading comprehension models(https://arxiv.org/abs/2410.20652)
Keywords: language model
Abstract: The attention mechanism plays an important role in the machine reading comprehension (MRC) model. Here, we describe a pipeline for building an MRC model with a pretrained language model and visualizing the effect of each attention zone in different layers, which can indicate the explainability of the model. With the presented protocol and accompanying code, researchers can easily visualize the relevance of each attention zone in the MRC model. This approach can be generalized to other pretrained language models.
摘要：注意力机制在机器阅读理解 (MRC) 模型中起着重要作用。本文，我们描述了一种使用预训练语言模型构建 MRC 模型的流程，并可视化了不同层中每个注意力区域的效果，这可以表明模型的可解释性。借助所提出的协议和随附的代码，研究人员可以轻松地可视化 MRC 模型中每个注意力区域的相关性。这种方法可以推广到其他预训练语言模型。

Title: Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

Authors: Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, Tal Schuster
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20672
Pdf URL: https://arxiv.org/pdf/2410.20672
Copy Paste: [[2410.20672]] Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA(https://arxiv.org/abs/2410.20672)
Keywords: language model, llm
Abstract: Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit "layer tying" as form of parameter sharing in Transformers, and introduce novel methods for converting existing LLMs into smaller "Recursive Transformers" that share parameters across layers, with minimal loss of performance. Here, our Recursive Transformers are efficiently initialized from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We further improve performance by introducing Relaxed Recursive Transformers that add flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still preserve the compactness of the overall model. We show that our recursive models (e.g., recursive Gemma 1B) outperform both similar-sized vanilla pretrained models (such as TinyLlama 1.1B and Pythia 1B) and knowledge distillation baselines -- and can even recover most of the performance of the original "full-size" model (e.g., Gemma 2B with no shared parameters). Finally, we propose Continuous Depth-wise Batching, a promising new inference paradigm enabled by the Recursive Transformer when paired with early exiting. In a theoretical analysis, we show that this has the potential to lead to significant (2-3x) gains in inference throughput.
摘要：大型语言模型 (LLM) 的部署成本很高。参数共享提供了一种减少其大小和成本的可能途径，但其在现代 LLM 中的有效性仍然相当有限。在这项工作中，我们重新审视了“层绑定”作为 Transformer 中参数共享的形式，并介绍了将现有 LLM 转换为较小的“递归 Transformer”的新方法，这些“递归 Transformer”可在各层之间共享参数，同时将性能损失降至最低。在这里，我们的递归 Transformer 是从标准预训练 Transformer 中高效初始化的，但只使用单个独特层块，然后在循环中重复多次。我们通过引入 Relaxed Recursive Transformer 进一步提高了性能，它通过深度低秩自适应 (LoRA) 模块为层绑定约束增加了灵活性，同时仍保持了整体模型的紧凑性。我们表明，我们的递归模型（例如，递归 Gemma 1B）的表现优于类似大小的 vanilla 预训练模型（例如 TinyLlama 1.1B 和 Pythia 1B）和知识蒸馏基线 - 甚至可以恢复原始“全尺寸”模型（例如，没有共享参数的 Gemma 2B）的大部分性能。最后，我们提出了连续深度批处理，这是递归 Transformer 与早期退出相结合实现的一种有前途的新推理范式。在理论分析中，我们表明这有可能显著提高（2-3 倍）推理吞吐量。

Title: Combining Domain-Specific Models and LLMs for Automated Disease Phenotyping from Survey Data

Authors: Gal Beeri, Benoit Chamot, Elena Latchem, Shruthi Venkatesh, Sarah Whalan, Van Zyl Kruger, David Martino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20695
Pdf URL: https://arxiv.org/pdf/2410.20695
Copy Paste: [[2410.20695]] Combining Domain-Specific Models and LLMs for Automated Disease Phenotyping from Survey Data(https://arxiv.org/abs/2410.20695)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: This exploratory pilot study investigated the potential of combining a domain-specific model, BERN2, with large language models (LLMs) to enhance automated disease phenotyping from research survey data. Motivated by the need for efficient and accurate methods to harmonize the growing volume of survey data with standardized disease ontologies, we employed BERN2, a biomedical named entity recognition and normalization model, to extract disease information from the ORIGINS birth cohort survey data. After rigorously evaluating BERN2's performance against a manually curated ground truth dataset, we integrated various LLMs using prompt engineering, Retrieval-Augmented Generation (RAG), and Instructional Fine-Tuning (IFT) to refine the model's outputs. BERN2 demonstrated high performance in extracting and normalizing disease mentions, and the integration of LLMs, particularly with Few Shot Inference and RAG orchestration, further improved accuracy. This approach, especially when incorporating structured examples, logical reasoning prompts, and detailed context, offers a promising avenue for developing tools to enable efficient cohort profiling and data harmonization across large, heterogeneous research datasets.
摘要：这项探索性试点研究调查了将领域特定模型 BERN2 与大型语言模型 (LLM) 相结合以增强从研究调查数据中进行自动疾病表型分析的潜力。出于对高效、准确的方法来协调不断增长的调查数据量和标准化疾病本体的需求，我们使用了 BERN2（一种生物医学命名实体识别和规范化模型）从 ORIGINS 出生队列调查数据中提取疾病信息。在根据手动整理的真实数据集严格评估 BERN2 的性能后，我们使用快速工程、检索增强生成 (RAG) 和教学微调 (IFT) 集成了各种 LLM，以优化模型的输出。BERN2 在提取和规范化疾病提及方面表现出色，而 LLM 的集成（尤其是与 Few Shot Inference 和 RAG 编排）进一步提高了准确性。这种方法，尤其是在结合结构化示例、逻辑推理提示和详细背景时，为开发工具以实现跨大型异构研究数据集的有效队列分析和数据协调提供了有希望的途径。

Title: DisasterQA: A Benchmark for Assessing the performance of LLMs in Disaster Response

Authors: Rajat Rawat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20707
Pdf URL: https://arxiv.org/pdf/2410.20707
Copy Paste: [[2410.20707]] DisasterQA: A Benchmark for Assessing the performance of LLMs in Disaster Response(https://arxiv.org/abs/2410.20707)
Keywords: language model, llm, prompt
Abstract: Disasters can result in the deaths of many, making quick response times vital. Large Language Models (LLMs) have emerged as valuable in the field. LLMs can be used to process vast amounts of textual information quickly providing situational context during a disaster. However, the question remains whether LLMs should be used for advice and decision making in a disaster. To evaluate the capabilities of LLMs in disaster response knowledge, we introduce a benchmark: DisasterQA created from six online sources. The benchmark covers a wide range of disaster response topics. We evaluated five LLMs each with four different prompting methods on our benchmark, measuring both accuracy and confidence levels through Logprobs. The results indicate that LLMs require improvement on disaster response knowledge. We hope that this benchmark pushes forth further development of LLMs in disaster response, ultimately enabling these models to work alongside. emergency managers in disasters.
摘要：灾难可能导致许多人死亡，因此快速响应至关重要。大型语言模型 (LLM) 已成为该领域的宝贵资源。LLM 可用于快速处理大量文本信息，在灾难期间提供情景背景。然而，问题仍然是 LLM 是否应该用于灾难中的建议和决策。为了评估 LLM 在灾难响应知识方面的能力，我们引入了一个基准：由六个在线来源创建的 DisasterQA。该基准涵盖了广泛的灾难响应主题。我们在基准上评估了五个 LLM，每个 LLM 都使用四种不同的提示方法，通过 Logprobs 测量准确性和置信度。结果表明 LLM 需要改进灾难响应知识。我们希望这个基准能够推动 LLM 在灾难响应方面的进一步发展，最终使这些模型能够与灾难中的应急管理人员一起工作。

Title: Relation-based Counterfactual Data Augmentation and Contrastive Learning for Robustifying Natural Language Inference Models

Authors: Heerin Yang, Sseung-won Hwang, Jungmin So
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.20710
Pdf URL: https://arxiv.org/pdf/2410.20710
Copy Paste: [[2410.20710]] Relation-based Counterfactual Data Augmentation and Contrastive Learning for Robustifying Natural Language Inference Models(https://arxiv.org/abs/2410.20710)
Keywords: language model
Abstract: Although pre-trained language models show good performance on various natural language processing tasks, they often rely on non-causal features and patterns to determine the outcome. For natural language inference tasks, previous results have shown that even a model trained on a large number of data fails to perform well on counterfactually revised data, indicating that the model is not robustly learning the semantics of the classes. In this paper, we propose a method in which we use token-based and sentence-based augmentation methods to generate counterfactual sentence pairs that belong to each class, and apply contrastive learning to help the model learn the difference between sentence pairs of different classes with similar contexts. Evaluation results with counterfactually-revised dataset and general NLI datasets show that the proposed method can improve the performance and robustness of the NLI model.
摘要：尽管预训练语言模型在各种自然语言处理任务上都表现出色，但它们往往依赖于非因果特征和模式来确定结果。对于自然语言推理任务，先前的结果表明，即使是在大量数据上训练的模型，在反事实修改的数据上也无法表现良好，这表明模型没有稳健地学习类别的语义。在本文中，我们提出了一种方法，我们使用基于 token 和基于句子的增强方法来生成属于各个类别的反事实句子对，并应用对比学习来帮助模型学习具有相似上下文的不同类别的句子对之间的差异。使用反事实修改数据集和一般 NLI 数据集的评估结果表明，所提出的方法可以提高 NLI 模型的性能和稳健性。

Title: Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation

Authors: Mufei Li, Siqi Miao, Pan Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20724
Pdf URL: https://arxiv.org/pdf/2410.20724
Copy Paste: [[2410.20724]] Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation(https://arxiv.org/abs/2410.20724)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) demonstrate strong reasoning abilities but face limitations such as hallucinations and outdated knowledge. Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) addresses these issues by grounding LLM outputs in structured external knowledge from KGs. However, current KG-based RAG frameworks still struggle to optimize the trade-off between retrieval effectiveness and efficiency in identifying a suitable amount of relevant graph information for the LLM to digest. We introduce SubgraphRAG, extending the KG-based RAG framework that retrieves subgraphs and leverages LLMs for reasoning and answer prediction. Our approach innovatively integrates a lightweight multilayer perceptron with a parallel triple-scoring mechanism for efficient and flexible subgraph retrieval while encoding directional structural distances to enhance retrieval effectiveness. The size of retrieved subgraphs can be flexibly adjusted to match the query's need and the downstream LLM's capabilities. This design strikes a balance between model complexity and reasoning power, enabling scalable and generalizable retrieval processes. Notably, based on our retrieved subgraphs, smaller LLMs like Llama3.1-8B-Instruct deliver competitive results with explainable reasoning, while larger models like GPT-4o achieve state-of-the-art accuracy compared with previous baselines -- all without fine-tuning. Extensive evaluations on the WebQSP and CWQ benchmarks highlight SubgraphRAG's strengths in efficiency, accuracy, and reliability by reducing hallucinations and improving response grounding.
摘要：大型语言模型 (LLM) 表现出强大的推理能力，但也面临幻觉和知识过时等限制。基于知识图谱 (KG) 的检索增强生成 (RAG) 通过将 LLM 输出建立在来自 KG 的结构化外部知识中来解决这些问题。然而，当前基于 KG 的 RAG 框架仍然难以在检索有效性和效率之间找到平衡，以确定 LLM 需要消化的适当数量的相关图信息。我们引入了 SubgraphRAG，扩展了基于 KG 的 RAG 框架，该框架检索子图并利用 LLM 进行推理和答案预测。我们的方法创新地将轻量级多层感知器与并行三重评分机制相结合，实现高效灵活的子图检索，同时对方向结构距离进行编码以提高检索有效性。检索到的子图的大小可以灵活调整，以匹配查询的需求和下游 LLM 的功能。这种设计在模型复杂性和推理能力之间取得了平衡，从而实现了可扩展和可推广的检索过程。值得注意的是，基于我们检索到的子图，较小的 LLM（如 Llama3.1-8B-Instruct）通过可解释的推理提供了具有竞争力的结果，而较大的模型（如 GPT-4o）与之前的基线相比实现了最先进的准确度——所有这些都无需微调。对 WebQSP 和 CWQ 基准的广泛评估凸显了 SubgraphRAG 在效率、准确性和可靠性方面的优势，因为它减少了幻觉并改善了响应基础。

Title: Gender Bias in LLM-generated Interview Responses

Authors: Haein Kong, Yongsu Ahn, Sangyub Lee, Yunho Maeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.20739
Pdf URL: https://arxiv.org/pdf/2410.20739
Copy Paste: [[2410.20739]] Gender Bias in LLM-generated Interview Responses(https://arxiv.org/abs/2410.20739)
Keywords: gpt, llm
Abstract: LLMs have emerged as a promising tool for assisting individuals in diverse text-generation tasks, including job-related texts. However, LLM-generated answers have been increasingly found to exhibit gender bias. This study evaluates three LLMs (GPT-3.5, GPT-4, Claude) to conduct a multifaceted audit of LLM-generated interview responses across models, question types, and jobs, and their alignment with two gender stereotypes. Our findings reveal that gender bias is consistent, and closely aligned with gender stereotypes and the dominance of jobs. Overall, this study contributes to the systematic examination of gender bias in LLM-generated interview responses, highlighting the need for a mindful approach to mitigate such biases in related applications.
摘要：LLM 已成为一种有前途的工具，可帮助个人完成各种文本生成任务，包括与工作相关的文本。然而，人们发现 LLM 生成的答案越来越多地表现出性别偏见。本研究评估了三门 LLM（GPT-3.5、GPT-4、Claude），对 LLM 生成的面试答案在模型、问题类型和工作方面以及它们与两种性别刻板印象的一致性进行了多方面的审核。我们的研究结果表明，性别偏见是一致的，并且与性别刻板印象和工作主导地位密切相关。总体而言，本研究有助于系统地研究 LLM 生成的面试答案中的性别偏见，强调需要采取谨慎的方法来减轻相关应用中的此类偏见。

Title: ElectionSim: Massive Population Election Simulation Powered by Large Language Model Driven Agents

Authors: Xinnong Zhang, Jiayu Lin, Libo Sun, Weihong Qi, Yihang Yang, Yue Chen, Hanjia Lyu, Xinyi Mou, Siming Chen, Jiebo Luo, Xuanjing Huang, Shiping Tang, Zhongyu Wei
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2410.20746
Pdf URL: https://arxiv.org/pdf/2410.20746
Copy Paste: [[2410.20746]] ElectionSim: Massive Population Election Simulation Powered by Large Language Model Driven Agents(https://arxiv.org/abs/2410.20746)
Keywords: language model, agent
Abstract: The massive population election simulation aims to model the preferences of specific groups in particular election scenarios. It has garnered significant attention for its potential to forecast real-world social trends. Traditional agent-based modeling (ABM) methods are constrained by their ability to incorporate complex individual background information and provide interactive prediction results. In this paper, we introduce ElectionSim, an innovative election simulation framework based on large language models, designed to support accurate voter simulations and customized distributions, together with an interactive platform to dialogue with simulated voters. We present a million-level voter pool sampled from social media platforms to support accurate individual simulation. We also introduce PPE, a poll-based presidential election benchmark to assess the performance of our framework under the U.S. presidential election scenario. Through extensive experiments and analyses, we demonstrate the effectiveness and robustness of our framework in U.S. presidential election simulations.
摘要：大规模人口选举模拟旨在模拟特定选举场景中特定群体的偏好。它因其预测现实世界社会趋势的潜力而备受关注。传统的基于代理的建模 (ABM) 方法受限于其纳入复杂个人背景信息并提供交互式预测结果的能力。在本文中，我们介绍了 ElectionSim，这是一个基于大型语言模型的创新选举模拟框架，旨在支持准确的选民模拟和定制分布，以及一个与模拟选民对话的交互式平台。我们提供了一个从社交媒体平台抽样的百万级选民池，以支持准确的个人模拟。我们还引入了 PPE，这是一个基于民意调查的总统选举基准，以评估我们的框架在美国总统选举情景下的表现。通过大量的实验和分析，我们证明了我们的框架在美国总统选举模拟中的有效性和稳健性。

Title: Plan$\times$RAG: Planning-guided Retrieval Augmented Generation

Authors: Prakhar Verma, Sukruta Prakash Midigeshi, Gaurav Sinha, Arno Solin, Nagarajan Natarajan, Amit Sharma
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20753
Pdf URL: https://arxiv.org/pdf/2410.20753
Copy Paste: [[2410.20753]] Plan$\times$RAG: Planning-guided Retrieval Augmented Generation(https://arxiv.org/abs/2410.20753)
Keywords: language model, hallucination, retrieval augmented generation
Abstract: We introduce Planning-guided Retrieval Augmented Generation (Plan$\times$RAG), a novel framework that augments the \emph{retrieve-then-reason} paradigm of existing RAG frameworks to \emph{plan-then-retrieve}. Plan$\times$RAG formulates a reasoning plan as a directed acyclic graph (DAG), decomposing queries into interrelated atomic sub-queries. Answer generation follows the DAG structure, allowing significant gains in efficiency through parallelized retrieval and generation. While state-of-the-art RAG solutions require extensive data generation and fine-tuning of language models (LMs), Plan$\times$RAG incorporates frozen LMs as plug-and-play experts to generate high-quality answers. Compared to existing RAG solutions, Plan$\times$RAG demonstrates significant improvements in reducing hallucinations and bolstering attribution due to its structured sub-query decomposition. Overall, Plan$\times$RAG offers a new perspective on integrating external knowledge in LMs while ensuring attribution by design, contributing towards more reliable LM-based systems.
摘要：我们引入了规划引导检索增强生成 (Plan$\times$RAG)，这是一个新颖的框架，它将现有 RAG 框架的 \emph{retrieve-then-reason} 范式增强为 \emph{plan-then-retrieve}。Plan$\times$RAG 将推理计划制定为有向无环图 (DAG)，将查询分解为相互关联的原子子查询。答案生成遵循 DAG 结构，通过并行检索和生成显着提高效率。虽然最先进的 RAG 解决方案需要大量数据生成和语言模型 (LM) 的微调，但 Plan$\times$RAG 将冻结的 LM 作为即插即用专家来生成高质量的答案。与现有的 RAG 解决方案相比，Plan$\times$RAG 由于其结构化的子查询分解，在减少幻觉和加强归因方面表现出显着的改进。总体而言，Plan$\times$RAG 为在 LM 中整合外部知识提供了新的视角，同时通过设计确保归因，有助于建立更可靠的基于 LM 的系统。

Title: Evaluating LLMs for Targeted Concept Simplification forDomain-Specific Texts

Authors: Sumit Asthana, Hannah Rashkin, Elizabeth Clark, Fantine Huot, Mirella Lapata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20763
Pdf URL: https://arxiv.org/pdf/2410.20763
Copy Paste: [[2410.20763]] Evaluating LLMs for Targeted Concept Simplification forDomain-Specific Texts(https://arxiv.org/abs/2410.20763)
Keywords: llm
Abstract: One useful application of NLP models is to support people in reading complex text from unfamiliar domains (e.g., scientific articles). Simplifying the entire text makes it understandable but sometimes removes important details. On the contrary, helping adult readers understand difficult concepts in context can enhance their vocabulary and knowledge. In a preliminary human study, we first identify that lack of context and unfamiliarity with difficult concepts is a major reason for adult readers' difficulty with domain-specific text. We then introduce "targeted concept simplification," a simplification task for rewriting text to help readers comprehend text containing unfamiliar concepts. We also introduce WikiDomains, a new dataset of 22k definitions from 13 academic domains paired with a difficult concept within each definition. We benchmark the performance of open-source and commercial LLMs and a simple dictionary baseline on this task across human judgments of ease of understanding and meaning preservation. Interestingly, our human judges preferred explanations about the difficult concept more than simplification of the concept phrase. Further, no single model achieved superior performance across all quality dimensions, and automated metrics also show low correlations with human evaluations of concept simplification ($\sim0.2$), opening up rich avenues for research on personalized human reading comprehension support.
摘要：NLP 模型的一个有用应用是帮助人们阅读来自不熟悉领域的复杂文本（例如科学文章）。简化整个文本使其易于理解，但有时会删除重要的细节。相反，帮助成年读者在上下文中理解困难的概念可以增强他们的词汇量和知识。在一项初步的人类研究中，我们首先发现缺乏背景和不熟悉困难的概念是成年读者难以理解特定领域文本的主要原因。然后，我们引入了“有针对性的概念简化”，这是一项简化任务，用于重写文本以帮助读者理解包含不熟悉概念的文本。我们还引入了 WikiDomains，这是一个新的数据集，包含来自 13 个学术领域的 22k 个定义，每个定义中都有一个困难的概念。我们根据人类对理解的难易程度和意义保留的判断，对开源和商业 LLM 以及一个简单的词典基线在此任务上的性能进行了基准测试。有趣的是，我们的人类评委更喜欢对困难概念的解释，而不是简化概念短语。此外，没有任何一个模型能够在所有质量维度上取得优异的表现，并且自动化指标与人类对概念简化的评估（$\sim0.2$）的相关性也较低，为个性化人类阅读理解支持的研究开辟了丰富的途径。

Title: MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

Authors: Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, Róbert Csordás
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20771
Pdf URL: https://arxiv.org/pdf/2410.20771
Copy Paste: [[2410.20771]] MrT5: Dynamic Token Merging for Efficient Byte-level Language Models(https://arxiv.org/abs/2410.20771)
Keywords: language model
Abstract: Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages and scripts. While character- or byte-level models like ByT5 attempt to address these concerns, they have not gained widespread adoption -- processing raw byte streams without tokenization results in significantly longer sequence lengths, making training and inference inefficient. This work introduces MrT5 (MergeT5), a more efficient variant of ByT5 that integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length. After processing through a fixed number of encoder layers, a learnt delete gate determines which tokens are to be removed and which are to be retained for subsequent layers. MrT5 effectively ``merges'' critical information from deleted tokens into a more compact sequence, leveraging contextual information from the remaining tokens. In continued pre-training experiments, we find that MrT5 can achieve significant gains in inference runtime with minimal effect on performance. When trained on English text, MrT5 demonstrates the capability to transfer its deletion feature zero-shot across several languages, with significant additional improvements following multilingual training. Furthermore, MrT5 shows comparable accuracy to ByT5 on downstream evaluations such as XNLI and character-level tasks while reducing sequence lengths by up to 80%. Our approach presents a solution to the practical limitations of existing byte-level models.
摘要：依赖于子词标记化的模型存在重大缺陷，例如对字符级噪声（如拼写错误）的敏感性以及不同语言和脚本之间的压缩率不一致。虽然像 ByT5 这样的字符级或字节级模型试图解决这些问题，但它们尚未得到广泛采用——处理未进行标记化的原始字节流会导致序列长度明显变长，从而使训练和推理效率低下。这项工作引入了 MrT5（MergeT5），这是 ByT5 的更高效变体，它在其编码器中集成了标记删除机制，以动态缩短输入序列长度。在经过固定数量的编码器层处理后，学习删除门将确定要删除哪些标记以及要保留哪些标记以供后续层使用。MrT5 有效地将已删除标记中的关键信息“合并”到更紧凑的序列中，利用剩余标记中的上下文信息。在持续的预训练实验中，我们发现 MrT5 可以在对性能影响最小的情况下显著提高推理运行时间。在英语文本上进行训练时，MrT5 展示了将其删除特征零样本迁移到多种语言的能力，并且在多语言训练后获得了显著的额外改进。此外，MrT5 在下游评估（例如 XNLI 和字符级任务）上表现出与 ByT5 相当的准确率，同时将序列长度缩短了高达 80%。我们的方法为现有字节级模型的实际局限性提供了一种解决方案。

Title: Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation

Authors: Dongryeol Lee, Yerin Hwang, Yongil Kim, Joonsuk Park, Kyomin Jung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20774
Pdf URL: https://arxiv.org/pdf/2410.20774
Copy Paste: [[2410.20774]] Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation(https://arxiv.org/abs/2410.20774)
Keywords: language model, gpt, llm
Abstract: In line with the principle of honesty, there has been a growing effort to train large language models (LLMs) to generate outputs containing epistemic markers. However, evaluation in the presence of epistemic markers has been largely overlooked, raising a critical question: Could the use of epistemic markers in LLM-generated outputs lead to unintended negative consequences? To address this, we present EMBER, a benchmark designed to assess the robustness of LLM-judges to epistemic markers in both single and pairwise evaluation settings. Our findings, based on evaluations using EMBER, reveal that all tested LLM-judges, including GPT-4o, show a notable lack of robustness in the presence of epistemic markers. Specifically, we observe a negative bias toward epistemic markers, with a stronger bias against markers expressing uncertainty. This suggests that LLM-judges are influenced by the presence of these markers and do not focus solely on the correctness of the content.
摘要：秉承诚实原则，人们越来越多地致力于训练大型语言模型 (LLM) 以生成包含认知标记的输出。然而，在存在认知标记的情况下进行评估在很大程度上被忽视了，这引出了一个关键问题：在 LLM 生成的输出中使用认知标记是否会导致意想不到的负面后果？为了解决这个问题，我们提出了 EMBER，这是一个基准，旨在评估 LLM 评委在单次和成对评估设置中对认知标记的稳健性。我们的研究结果基于使用 EMBER 进行的评估，结果表明，所有经过测试的 LLM 评委（包括 GPT-4o）在存在认知标记的情况下都表现出明显缺乏稳健性。具体来说，我们观察到对认知标记的负面偏见，对表达不确定性的标记的偏见更强烈。这表明 LLM 评委受到这些标记存在的影响，并且不仅仅关注内容的正确性。

Title: KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and Knowledge Distillation

Authors: Rambod Azimi, Rishav Rishav, Marek Teichmann, Samira Ebrahimi Kahou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20777
Pdf URL: https://arxiv.org/pdf/2410.20777
Copy Paste: [[2410.20777]] KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and Knowledge Distillation(https://arxiv.org/abs/2410.20777)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable performance across various downstream tasks. However, the high computational and memory requirements of LLMs are a major bottleneck. To address this, parameter-efficient fine-tuning (PEFT) methods such as low-rank adaptation (LoRA) have been proposed to reduce computational costs while ensuring minimal loss in performance. Additionally, knowledge distillation (KD) has been a popular choice for obtaining compact student models from teacher models. In this work, we present KD-LoRA, a novel fine-tuning method that combines LoRA with KD. Our results demonstrate that KD-LoRA achieves performance comparable to full fine-tuning (FFT) and LoRA while significantly reducing resource requirements. Specifically, KD-LoRA retains 98% of LoRA's performance on the GLUE benchmark, while being 40% more compact. Additionally, KD-LoRA reduces GPU memory usage by 30% compared to LoRA, while decreasing inference time by 30% compared to both FFT and LoRA. We evaluate KD-LoRA across three encoder-only models: BERT, RoBERTa, and DeBERTaV3. Code is available at this https URL.
摘要：大型语言模型 (LLM) 在各种下游任务中都表现出色。然而，LLM 的高计算和内存要求是一个主要瓶颈。为了解决这个问题，提出了参数高效微调 (PEFT) 方法，例如低秩自适应 (LoRA)，以降低计算成本，同时确保性能损失最小。此外，知识蒸馏 (KD) 一直是从教师模型获取紧凑学生模型的流行选择。在这项工作中，我们提出了 KD-LoRA，这是一种将 LoRA 与 KD 相结合的新型微调方法。我们的结果表明，KD-LoRA 实现了与完全微调 (FFT) 和 LoRA 相当的性能，同时显著降低了资源需求。具体来说，KD-LoRA 在 GLUE 基准上保留了 LoRA 98% 的性能，同时紧凑了 40%。此外，与 LoRA 相比，KD-LoRA 将 GPU 内存使用量减少了 30%，同时与 FFT 和 LoRA 相比，推理时间减少了 30%。我们对三种仅编码器模型中的 KD-LoRA 进行了评估：BERT、RoBERTa 和 DeBERTaV3。代码可在此 https URL 上获取。

Title: Graph-based Uncertainty Metrics for Long-form Language Model Outputs

Authors: Mingjian Jiang, Yangjun Ruan, Prasanna Sattigeri, Salim Roukos, Tatsunori Hashimoto
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20783
Pdf URL: https://arxiv.org/pdf/2410.20783
Copy Paste: [[2410.20783]] Graph-based Uncertainty Metrics for Long-form Language Model Outputs(https://arxiv.org/abs/2410.20783)
Keywords: language model, llm
Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities, but these systems are still known to hallucinate, and granular uncertainty estimation for long-form LLM generations remains challenging. In this work, we propose Graph Uncertainty -- which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics. Under this view, existing uncertainty estimation methods based on the concept of self-consistency can be viewed as using degree centrality as an uncertainty measure, and we show that more sophisticated alternatives such as closeness centrality provide consistent gains at claim-level uncertainty estimation. Moreover, we present uncertainty-aware decoding techniques that leverage both the graph structure and uncertainty estimates to improve the factuality of LLM generations by preserving only the most reliable claims. Compared to existing methods, our graph-based uncertainty metrics lead to an average of 6.8% relative gains on AUPRC across various long-form generation settings, and our end-to-end system provides consistent 2-4% gains in factuality over existing decoding techniques while significantly improving the informativeness of generated responses.
摘要：大型语言模型 (LLM) 的最新进展显著提高了文本生成能力，但这些系统仍然会产生幻觉，而且长格式 LLM 生成的细粒度不确定性估计仍然具有挑战性。在这项工作中，我们提出了图不确定性——它将 LLM 生成与其中的声明之间的关系表示为二分图，并使用一系列图中心性指标来估计声明级别的不确定性。根据这种观点，现有的基于自洽概念的不确定性估计方法可以看作是使用度中心性作为不确定性度量，并且我们表明，更复杂的替代方法（例如接近中心性）在声明级别的不确定性估计方面提供了一致的收益。此外，我们提出了不确定性感知解码技术，该技术利用图结构和不确定性估计来通过仅保留最可靠的声明来提高 LLM 生成的真实性。与现有方法相比，我们基于图的不确定性指标在各种长格式生成设置中可使 AUPRC 的平均相对增益达到 6.8%，并且我们的端到端系统在事实性方面比现有解码技术持续提高 2-4%，同时显著提高生成的响应的信息量。

Title: SCULPT: Systematic Tuning of Long Prompts

Authors: Shanu Kumar, Akhila Yesantarao Venkata, Shubhanshu Khandelwal, Bishal Santra, Parag Agrawal, Manish Gupta
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20788
Pdf URL: https://arxiv.org/pdf/2410.20788
Copy Paste: [[2410.20788]] SCULPT: Systematic Tuning of Long Prompts(https://arxiv.org/abs/2410.20788)
Keywords: language model, prompt
Abstract: As large language models become increasingly central to solving complex tasks, the challenge of optimizing long, unstructured prompts has become critical. Existing optimization techniques often struggle to effectively handle such prompts, leading to suboptimal performance. We introduce SCULPT (Systematic Tuning of Long Prompts), a novel framework that systematically refines long prompts by structuring them hierarchically and applying an iterative actor-critic mechanism. To enhance robustness and generalizability, SCULPT utilizes two complementary feedback mechanisms: Preliminary Assessment, which assesses the prompt's structure before execution, and Error Assessment, which diagnoses and addresses errors post-execution. By aggregating feedback from these mechanisms, SCULPT avoids overfitting and ensures consistent improvements in performance. Our experimental results demonstrate significant accuracy gains and enhanced robustness, particularly in handling erroneous and misaligned prompts. SCULPT consistently outperforms existing approaches, establishing itself as a scalable solution for optimizing long prompts across diverse and real-world tasks.
摘要：随着大型语言模型在解决复杂任务中变得越来越重要，优化长而非结构化的提示的挑战变得至关重要。现有的优化技术通常难以有效处理此类提示，导致性能不佳。我们引入了 SCULPT（长提示的系统调整），这是一个新颖的框架，它通过分层构建长提示并应用迭代参与者-评论家机制来系统地改进长提示。为了增强稳健性和通用性，SCULPT 采用了两种互补的反馈机制：初步评估（在执行前评估提示的结构）和错误评估（在执行后诊断和解决错误）。通过汇总来自这些机制的反馈，SCULPT 避免了过度拟合并确保性能持续改进。我们的实验结果表明，准确度显著提高，稳健性增强，特别是在处理错误和错位的提示时。SCULPT 始终优于现有方法，成为在各种现实世界任务中优化长提示的可扩展解决方案。

Title: Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

Authors: Michael Pieler, Marco Bellagente, Hannah Teufel, Duy Phung, Nathan Cooper, Jonathan Tow, Paulo Rocha, Reshinth Adithyan, Zaid Alyafeai, Nikhil Pinnaparaju, Maksym Zhuravinskyi, Carlos Riquelme
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20796
Pdf URL: https://arxiv.org/pdf/2410.20796
Copy Paste: [[2410.20796]] Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training(https://arxiv.org/abs/2410.20796)
Keywords: language model, llm
Abstract: Recently published work on rephrasing natural text data for pre-training LLMs has shown promising results when combining the original dataset with the synthetically rephrased data. We build upon previous work by replicating existing results on C4 and extending them with our optimized rephrasing pipeline to the English, German, Italian, and Spanish Oscar subsets of CulturaX. Our pipeline leads to increased performance on standard evaluation benchmarks in both the mono- and multilingual setup. In addition, we provide a detailed study of our pipeline, investigating the choice of the base dataset and LLM for the rephrasing, as well as the relationship between the model size and the performance after pre-training. By exploring data with different perceived quality levels, we show that gains decrease with higher quality. Furthermore, we find the difference in performance between model families to be bigger than between different model sizes. This highlights the necessity for detailed tests before choosing an LLM to rephrase large amounts of data. Moreover, we investigate the effect of pre-training with synthetic data on supervised fine-tuning. Here, we find increasing but inconclusive results that highly depend on the used benchmark. These results (again) highlight the need for better benchmarking setups. In summary, we show that rephrasing multilingual and low-quality data is a very promising direction to extend LLM pre-training data.
摘要：最近发表的关于为预训练 LLM 重新表述自然文本数据的研究显示，将原始数据集与合成重新表述数据相结合时，效果良好。我们在之前工作的基础上，复制了 C4 上的现有结果，并使用我们优化的重新表述管道将其扩展到 CulturaX 的英语、德语、意大利语和西班牙语 Oscar 子集。我们的管道在单语言和多语言设置中提高了标准评估基准的性能。此外，我们还对我们的管道进行了详细研究，调查了重新表述的基础数据集和 LLM 的选择，以及模型大小与预训练后性能之间的关系。通过探索具有不同感知质量水平的数据，我们发现收益随着质量的提高而减少。此外，我们发现模型系列之间的性能差异大于不同模型大小之间的性能差异。这凸显了在选择 LLM 重新表述大量数据之前进行详细测试的必要性。此外，我们研究了使用合成数据进行预训练对监督微调的影响。在这里，我们发现结果越来越多，但结论并不明确，这些结果高度依赖于所使用的基准。这些结果（再次）凸显了对更好的基准测试设置的需求。总之，我们表明，重新表述多语言和低质量数据是扩展 LLM 预训练数据的一个非常有前途的方向。

Title: NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates

Authors: Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Min Zhang, Zhaopeng Tu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20814
Pdf URL: https://arxiv.org/pdf/2410.20814
Copy Paste: [[2410.20814]] NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates(https://arxiv.org/abs/2410.20814)
Keywords: language model, llm
Abstract: Despite their remarkable abilities in various tasks, large language models (LLMs) still struggle with real-time information (e.g., new facts and terms) due to the knowledge cutoff in their development process. However, existing benchmarks focus on outdated content and limited fields, facing difficulties in real-time updating and leaving new terms unexplored. To address this problem, we propose an adaptive benchmark, NewTerm, for real-time evaluation of new terms. We design a highly automated construction method to ensure high-quality benchmark construction with minimal human effort, allowing flexible updates for real-time information. Empirical results on various LLMs demonstrate over 20% performance reduction caused by new terms. Additionally, while updates to the knowledge cutoff of LLMs can cover some of the new terms, they are unable to generalize to more distant new terms. We also analyze which types of terms are more challenging and why LLMs struggle with new terms, paving the way for future research. Finally, we construct NewTerm 2022 and 2023 to evaluate the new terms updated each year and will continue updating annually. The benchmark and codes can be found at this https URL.
摘要：尽管大型语言模型 (LLM) 在各种任务中都表现出色，但由于其开发过程中的知识截断，它们仍然难以处理实时信息（例如新事实和术语）。然而，现有的基准测试集中于过时的内容和有限的领域，难以实时更新，并且新术语未被探索。为了解决这个问题，我们提出了一个自适应基准测试 NewTerm，用于实时评估新术语。我们设计了一种高度自动化的构建方法，以确保以最少的人力构建高质量的基准测试，从而允许灵活地更新实时信息。各种 LLM 的经验结果表明，新术语导致的性能下降超过 20%。此外，虽然 LLM 知识截断的更新可以覆盖一些新术语，但它们无法推广到更远的新术语。我们还分析了哪些类型的术语更具挑战性以及 LLM 为何难以处理新术语，为未来的研究铺平了道路。最后，我们构建了 NewTerm 2022 和 2023 来评估每年更新的新术语，并将继续每年更新。基准和代码可在此 https URL 找到。

Title: LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation

Authors: Yen-Shan Chen, Jing Jin, Peng-Ting Kuo, Chao-Wei Huang, Yun-Nung Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20833
Pdf URL: https://arxiv.org/pdf/2410.20833
Copy Paste: [[2410.20833]] LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation(https://arxiv.org/abs/2410.20833)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Recent studies have demonstrated that large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. However, the extent to which this bias manifests in fact-oriented tasks, especially within retrieval-augmented generation (RAG) frameworks-where keyword extraction and factual accuracy take precedence over stylistic elements-remains unclear. Our study addresses this knowledge gap by simulating two critical phases of the RAG framework. In the first phase, we access the suitability of human-authored versus model-generated passages, emulating the pointwise reranking process. The second phase involves conducting pairwise reading comprehension tests to simulate the generation process. Contrary to previous findings indicating a self-preference in rating tasks, our results reveal no significant self-preference effect in RAG frameworks. Instead, we observe that factual accuracy significantly influences LLMs' output, even in the absence of prior knowledge. Our research contributes to the ongoing discourse on LLM biases and their implications for RAG-based system, offering insights that may inform the development of more robust and unbiased LLM systems.
摘要：最近的研究表明，大型语言模型 (LLM) 在评估任务中表现出明显的偏见，尤其是在优先评级和偏爱自生成内容方面。然而，这种偏见在面向事实的任务中表现的程度，尤其是在检索增强生成 (RAG) 框架中（其中关键字提取和事实准确性优先于风格元素）仍不清楚。我们的研究通过模拟 RAG 框架的两个关键阶段来解决这一知识空白。在第一阶段，我们模拟逐点重新排序过程，以评估人类编写的段落与模型生成的段落的适用性。第二阶段涉及进行成对阅读理解测试以模拟生成过程。与先前表明评级任务中存在自我偏好的研究结果相反，我们的结果显示在 RAG 框架中没有显着的自我偏好效应。相反，我们观察到事实准确性会显着影响 LLM 的输出，即使在没有先验知识的情况下也是如此。我们的研究促进了有关 LLM 偏见及其对基于 RAG 的系统的影响的持续讨论，并提供了可能有助于开发更为强大和公正的 LLM 系统的见解。

Title: A Simple Yet Effective Corpus Construction Framework for Indonesian Grammatical Error Correction

Authors: Nankai Lin, Meiyu Zeng, Wentao Huang, Shengyi Jiang, Lixian Xiao, Aimin Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20838
Pdf URL: https://arxiv.org/pdf/2410.20838
Copy Paste: [[2410.20838]] A Simple Yet Effective Corpus Construction Framework for Indonesian Grammatical Error Correction(https://arxiv.org/abs/2410.20838)
Keywords: language model, gpt, llm
Abstract: Currently, the majority of research in grammatical error correction (GEC) is concentrated on universal languages, such as English and Chinese. Many low-resource languages lack accessible evaluation corpora. How to efficiently construct high-quality evaluation corpora for GEC in low-resource languages has become a significant challenge. To fill these gaps, in this paper, we present a framework for constructing GEC corpora. Specifically, we focus on Indonesian as our research language and construct an evaluation corpus for Indonesian GEC using the proposed framework, addressing the limitations of existing evaluation corpora in Indonesian. Furthermore, we investigate the feasibility of utilizing existing large language models (LLMs), such as GPT-3.5-Turbo and GPT-4, to streamline corpus annotation efforts in GEC tasks. The results demonstrate significant potential for enhancing the performance of LLMs in low-resource language settings. Our code and corpus can be obtained from this https URL.
摘要：目前，语法纠错 (GEC) 的大部分研究集中在通用语言上，例如英语和中文。许多资源匮乏的语言缺乏可用的评估语料库。如何高效地为资源匮乏的语言中的 GEC 构建高质量的评估语料库已成为一项重大挑战。为了填补这些空白，本文提出了一个构建 GEC 语料库的框架。具体来说，我们专注于印尼语作为我们的研究语言，并使用所提出的框架为印尼语 GEC 构建评估语料库，以解决印尼语现有评估语料库的局限性。此外，我们研究了利用现有的大型语言模型 (LLM)（例如 GPT-3.5-Turbo 和 GPT-4）来简化 GEC 任务中的语料注释工作的可行性。结果表明，在资源匮乏的语言环境中，LLM 具有显著的提升性能的潜力。我们的代码和语料库可以从此 https URL 获取。

Title: Reward Modeling with Weak Supervision for Language Models

Authors: Ben Hauptvogel, Malte Ostendorff, Georg Rehm, Sebastian Möller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20869
Pdf URL: https://arxiv.org/pdf/2410.20869
Copy Paste: [[2410.20869]] Reward Modeling with Weak Supervision for Language Models(https://arxiv.org/abs/2410.20869)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have led to their increased application across various tasks, with reinforcement learning from human feedback (RLHF) being a crucial part of their training to align responses with user intentions. In the RLHF process, a reward model is trained using responses preferences determined by human labelers or AI systems, which then refines the LLM through reinforcement learning. This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance. Weak supervision employs noisy or imprecise data labeling, reducing reliance on expensive manually labeled data. By analyzing RLHF datasets to identify heuristics that correlate with response preference, we wrote simple labeling functions and then calibrated a label model to weakly annotate unlabeled data. Our evaluation show that while weak supervision significantly benefits smaller datasets by improving reward model performance, its effectiveness decreases with larger, originally labeled datasets. Additionally, using an LLM to generate and then weakly label responses offers a promising method for extending preference data.
摘要：大型语言模型 (LLM) 的最新进展使其在各种任务中的应用日益广泛，而强化学习人类反馈 (RLHF) 是其训练的重要组成部分，旨在使响应与用户意图保持一致。在 RLHF 过程中，奖励模型使用由人类标记者或 AI 系统确定的响应偏好进行训练，然后通过强化学习完善 LLM。这项工作引入了弱监督作为扩展 RLHF 数据集和增强奖励模型性能的策略。弱监督采用嘈杂或不精确的数据标记，减少对昂贵的手动标记数据的依赖。通过分析 RLHF 数据集以识别与响应偏好相关的启发式方法，我们编写了简单的标记函数，然后校准了标记模型以弱注释未标记数据。我们的评估表明，虽然弱监督通过提高奖励模型性能显著有利于较小的数据集，但其有效性会随着较大的原始标记数据集而降低。此外，使用 LLM 生成然后弱标记响应为扩展偏好数据提供了一种有前途的方法。

Title: AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline

Authors: Dongkyu Kim, Byoungwook Kim, Donggeon Han, Matouš Eibich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20878
Pdf URL: https://arxiv.org/pdf/2410.20878
Copy Paste: [[2410.20878]] AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline(https://arxiv.org/abs/2410.20878)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Using LLMs (Large Language Models) in conjunction with external documents has made RAG (Retrieval-Augmented Generation) an essential technology. Numerous techniques and modules for RAG are being researched, but their performance can vary across different datasets. Finding RAG modules that perform well on specific datasets is challenging. In this paper, we propose the AutoRAG framework, which automatically identifies suitable RAG modules for a given dataset. AutoRAG explores and approximates the optimal combination of RAG modules for the dataset. Additionally, we share the results of optimizing a dataset using AutoRAG. All experimental results and data are publicly available and can be accessed through our GitHub repository this https URL .
摘要：将 LLM（大型语言模型）与外部文档结合使用使得 RAG（检索增强生成）成为一项必不可少的技术。人们正在研究 RAG 的众多技术和模块，但它们的性能在不同的数据集上可能有所不同。找到在特定数据集上表现良好的 RAG 模块具有挑战性。在本文中，我们提出了 AutoRAG 框架，该框架可自动识别给定数据集的合适 RAG 模块。AutoRAG 探索并近似数据集的 RAG 模块的最佳组合。此外，我们分享了使用 AutoRAG 优化数据集的结果。所有实验结果和数据都是公开的，可以通过我们的 GitHub 存储库 https URL 访问。

Title: NeuGPT: Unified multi-modal Neural GPT

Authors: Yiqian Yang, Yiqun Duan, Hyejeong Jo, Qiang Zhang, Renjing Xu, Oiwi Parker Jones, Xuming Hu, Chin-teng Lin, Hui Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20916
Pdf URL: https://arxiv.org/pdf/2410.20916
Copy Paste: [[2410.20916]] NeuGPT: Unified multi-modal Neural GPT(https://arxiv.org/abs/2410.20916)
Keywords: gpt
Abstract: This paper introduces NeuGPT, a groundbreaking multi-modal language generation model designed to harmonize the fragmented landscape of neural recording research. Traditionally, studies in the field have been compartmentalized by signal type, with EEG, MEG, ECoG, SEEG, fMRI, and fNIRS data being analyzed in isolation. Recognizing the untapped potential for cross-pollination and the adaptability of neural signals across varying experimental conditions, we set out to develop a unified model capable of interfacing with multiple modalities. Drawing inspiration from the success of pre-trained large models in NLP, computer vision, and speech processing, NeuGPT is architected to process a diverse array of neural recordings and interact with speech and text data. Our model mainly focus on brain-to-text decoding, improving SOTA from 6.94 to 12.92 on BLEU-1 and 6.93 to 13.06 on ROUGE-1F. It can also simulate brain signals, thereby serving as a novel neural interface. Code is available at \href{this https URL}{NeuSpeech/NeuGPT (this https URL) .}
摘要：本文介绍了 NeuGPT，这是一种突破性的多模态语言生成模型，旨在协调神经记录研究的碎片化格局。传统上，该领域的研究按信号类型划分，EEG、MEG、ECoG、SEEG、fMRI 和 fNIRS 数据被单独分析。认识到尚未开发的交叉融合潜力以及神经信号在不同实验条件下的适应性，我们着手开发一种能够与多种模态交互的统一模型。从 NLP、计算机视觉和语音处理中预训练的大型模型的成功中汲取灵感，NeuGPT 的架构旨在处理各种神经记录并与语音和文本数据交互。我们的模型主要关注脑到文本解码，将 BLEU-1 上的 SOTA 从 6.94 提高到 12.92，将 ROUGE-1F 上的 SOTA 从 6.93 提高到 13.06。它还可以模拟脑信号，从而作为一种新型的神经接口。代码可在 \href{此 https URL}{NeuSpeech/NeuGPT (此 https URL) 处获取。}

Title: Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning

Authors: Aosong Feng, Rex Ying, Leandros Tassiulas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20926
Pdf URL: https://arxiv.org/pdf/2410.20926
Copy Paste: [[2410.20926]] Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning(https://arxiv.org/abs/2410.20926)
Keywords: llm
Abstract: As the demand for processing extended textual data grows, the ability to handle long-range dependencies and maintain computational efficiency is more critical than ever. One of the key issues for long-sequence modeling using attention-based model is the mismatch between the limited-range modeling power of full attention and the long-range token dependency in the input sequence. In this work, we propose to scale up the attention receptive field by tensorizing long input sequences into compact tensor representations followed by attention on each transformed dimension. The resulting Tensorized Attention can be adopted as efficient transformer backbones to extend input context length with improved memory and time efficiency. We show that the proposed attention tensorization encodes token dependencies as a multi-hop attention process, and is equivalent to Kronecker decomposition of full attention. Extensive experiments show that tensorized attention can be used to adapt pretrained LLMs with improved efficiency. Notably, Llama-8B with tensorization is trained under 32,768 context length and can steadily extrapolate to 128k length during inference with $11\times$ speedup, compared to full attention with FlashAttention-2.
摘要：随着对处理扩展文本数据的需求不断增长，处理长距离依赖关系并保持计算效率的能力比以往任何时候都更加重要。使用基于注意机制的模型进行长序列建模的关键问题之一是全注意的有限范围建模能力与输入序列中的长距离标记依赖关系之间的不匹配。在这项工作中，我们提出通过将长输入序列张量化为紧凑的张量表示，然后在每个变换维度上进行注意来扩大注意感受野。由此产生的张量化注意力可以作为高效的 Transformer 主干，以扩展输入上下文长度，同时提高内存和时间效率。我们表明，所提出的注意力张量化将标记依赖关系编码为多跳注意力过程，并且等同于全注意的 Kronecker 分解。大量实验表明，张量化注意力可用于提高预训练 LLM 的效率。值得注意的是，与使用 FlashAttention-2 的全注意力机制相比，具有张量化的 Llama-8B 在 32,768 上下文长度下进行训练，并且在推理过程中可以稳定地外推到 128k 长度，加速比为 $11\times$。

Title: Autoformalize Mathematical Statements by Symbolic Equivalence and Semantic Consistency

Authors: Zenan Li, Yifan Wu, Zhaoyu Li, Xinming Wei, Xian Zhang, Fan Yang, Xiaoxing Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20936
Pdf URL: https://arxiv.org/pdf/2410.20936
Copy Paste: [[2410.20936]] Autoformalize Mathematical Statements by Symbolic Equivalence and Semantic Consistency(https://arxiv.org/abs/2410.20936)
Keywords: language model, llm
Abstract: Autoformalization, the task of automatically translating natural language descriptions into a formal language, poses a significant challenge across various domains, especially in mathematics. Recent advancements in large language models (LLMs) have unveiled their promising capabilities to formalize even competition-level math problems. However, we observe a considerable discrepancy between pass@1 and pass@k accuracies in LLM-generated formalizations. To address this gap, we introduce a novel framework that scores and selects the best result from k autoformalization candidates based on two complementary self-consistency methods: symbolic equivalence and semantic consistency. Elaborately, symbolic equivalence identifies the logical homogeneity among autoformalization candidates using automated theorem provers, and semantic consistency evaluates the preservation of the original meaning by informalizing the candidates and computing the similarity between the embeddings of the original and informalized texts. Our extensive experiments on the MATH and miniF2F datasets demonstrate that our approach significantly enhances autoformalization accuracy, achieving up to 0.22-1.35x relative improvements across various LLMs and baseline methods.
摘要：自动形式化是将自然语言描述自动翻译成形式语言的任务，它在各个领域都面临着巨大的挑战，尤其是在数学领域。大型语言模型 (LLM) 的最新进展揭示了它们形式化甚至竞赛级数学问题的强大能力。然而，我们观察到 LLM 生成的形式化中 pass@1 和 pass@k 的准确率之间存在相当大的差异。为了解决这一差距，我们引入了一个新框架，该框架基于两种互补的自洽方法对 k 个自动形式化候选进行评分并选择最佳结果：符号等价和语义一致性。详细地说，符号等价使用自动定理证明器来识别自动形式化候选之间的逻辑同质性，语义一致性通过对候选进行形式化并计算原始文本和非形式化文本的嵌入之间的相似性来评估原始含义的保留情况。我们对 MATH 和 miniF2F 数据集进行的大量实验表明，我们的方法显著提高了自动形式化的准确性，在各种 LLM 和基线方法中实现了高达 0.22-1.35 倍的相对改进。

Title: Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models

Authors: Piotr Przybyła
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.20940
Pdf URL: https://arxiv.org/pdf/2410.20940
Copy Paste: [[2410.20940]] Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models(https://arxiv.org/abs/2410.20940)
Keywords: language model, prompt
Abstract: We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, e.g. text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure until the victim classifier changes its decision. The evaluation confirms the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.
摘要：我们研究了生成对抗性示例的挑战，以测试检测低可信度内容（包括宣传、虚假声明、谣言和极端党派新闻）的文本分类算法的稳健性。我们专注于通过设置攻击者可以尝试的查询数量的现实限制来模拟内容审核。在我们的解决方案（TREPAT）中，初始改写由大型语言模型生成，其提示受到保留意义的 NLP 任务的启发，例如文本简化和样式转换。随后，这些修改被分解为小的变化，通过集束搜索程序应用，直到受害者分类器改变其决策。评估证实了我们的方法在受限场景中的优越性，特别是在长输入文本（新闻文章）的情况下，穷举搜索是不可行的。

Title: Instruction-Tuned LLMs Succeed in Document-Level MT Without Fine-Tuning -- But BLEU Turns a Blind Eye

Authors: Yirong Sun, Dawei Zhu, Yanjun Chen, Erjia Xiao, Xinghao Chen, Xiaoyu Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.20941
Pdf URL: https://arxiv.org/pdf/2410.20941
Copy Paste: [[2410.20941]] Instruction-Tuned LLMs Succeed in Document-Level MT Without Fine-Tuning -- But BLEU Turns a Blind Eye(https://arxiv.org/abs/2410.20941)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have excelled in various NLP tasks, including machine translation (MT), yet most studies focus on sentence-level translation. This work investigates the inherent capability of instruction-tuned LLMs for document-level translation (docMT). Unlike prior approaches that require specialized techniques, we evaluate LLMs by directly prompting them to translate entire documents in a single pass. Our results show that this method improves translation quality compared to translating sentences separately, even without document-level fine-tuning. However, this advantage is not reflected in BLEU scores, which often favor sentence-based translations. We propose using the LLM-as-a-judge paradigm for evaluation, where GPT-4 is used to assess document coherence, accuracy, and fluency in a more nuanced way than n-gram-based metrics. Overall, our work demonstrates that instruction-tuned LLMs can effectively leverage document context for translation. However, we caution against using BLEU scores for evaluating docMT, as they often provide misleading outcomes, failing to capture the quality of document-level translation. Code and data are available at this https URL
摘要：大型语言模型 (LLM) 在各种 NLP 任务中表现出色，包括机器翻译 (MT)，但大多数研究都集中在句子级翻译上。这项工作研究了指令调整的 LLM 在文档级翻译 (docMT) 方面的固有能力。与需要专门技术的先前方法不同，我们通过直接提示它们一次性翻译整个文档来评估 LLM。我们的结果表明，与单独翻译句子相比，这种方法提高了翻译质量，即使没有文档级微调也是如此。然而，这种优势并没有反映在 BLEU 分数中，BLEU 分数通常更有利于基于句子的翻译。我们建议使用 LLM-as-a-judge 范式进行评估，其中 GPT-4 用于以比基于 n-gram 的指标更细致的方式评估文档的连贯性、准确性和流畅性。总的来说，我们的工作表明指令调整的 LLM 可以有效地利用文档上下文进行翻译。然而，我们警告不要使用 BLEU 分数来评估 docMT，因为它们通常会提供误导性的结果，无法捕捉文档级翻译的质量。代码和数据可在此 https URL 上获取

Title: DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning

Authors: Xun Guo, Shan Zhang, Yongxin He, Ting Zhang, Wanquan Feng, Haibin Huang, Chongyang Ma
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.20964
Pdf URL: https://arxiv.org/pdf/2410.20964
Copy Paste: [[2410.20964]] DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning(https://arxiv.org/abs/2410.20964)
Keywords: language model, llm
Abstract: Current techniques for detecting AI-generated text are largely confined to manual feature crafting and supervised binary classification paradigms. These methodologies typically lead to performance bottlenecks and unsatisfactory generalizability. Consequently, these methods are often inapplicable for out-of-distribution (OOD) data and newly emerged large language models (LLMs). In this paper, we revisit the task of AI-generated text detection. We argue that the key to accomplishing this task lies in distinguishing writing styles of different authors, rather than simply classifying the text into human-written or AI-generated text. To this end, we propose DeTeCtive, a multi-task auxiliary, multi-level contrastive learning framework. DeTeCtive is designed to facilitate the learning of distinct writing styles, combined with a dense information retrieval pipeline for AI-generated text detection. Our method is compatible with a range of text encoders. Extensive experiments demonstrate that our method enhances the ability of various text encoders in detecting AI-generated text across multiple benchmarks and achieves state-of-the-art results. Notably, in OOD zero-shot evaluation, our method outperforms existing approaches by a large margin. Moreover, we find our method boasts a Training-Free Incremental Adaptation (TFIA) capability towards OOD data, further enhancing its efficacy in OOD detection scenarios. We will open-source our code and models in hopes that our work will spark new thoughts in the field of AI-generated text detection, ensuring safe application of LLMs and enhancing compliance. Our code is available at this https URL.
摘要：目前用于检测 AI 生成文本的技术主要局限于手动特征构建和监督二分类范式。这些方法通常会导致性能瓶颈和不令人满意的通用性。因此，这些方法通常不适用于分布外 (OOD) 数据和新出现的大型语言模型 (LLM)。在本文中，我们重新审视了 AI 生成文本检测的任务。我们认为完成这项任务的关键在于区分不同作者的写作风格，而不是简单地将文本分类为人类书写的文本或 AI 生成的文本。为此，我们提出了 DeTeCtive，这是一个多任务辅助、多层次对比学习框架。DeTeCtive 旨在促进不同写作风格的学习，并结合用于 AI 生成文本检测的密集信息检索管道。我们的方法与多种文本编码器兼容。大量实验表明，我们的方法增强了各种文本编码器在多个基准测试中检测 AI 生成文本的能力，并取得了最先进的结果。值得注意的是，在 OOD 零样本评估中，我们的方法远远优于现有方法。此外，我们发现我们的方法对 OOD 数据具有无需训练的增量自适应 (TFIA) 能力，进一步增强了其在 OOD 检测场景中的有效性。我们将开源我们的代码和模型，希望我们的工作能够在 AI 生成文本检测领域激发新思路，确保 LLM 的安全应用并提高合规性。我们的代码可在此 https URL 上找到。

Title: Is GPT-4 Less Politically Biased than GPT-3.5? A Renewed Investigation of ChatGPT's Political Biases

Authors: Erik Weber, Jérôme Rutinowski, Niklas Jost, Markus Pauly
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21008
Pdf URL: https://arxiv.org/pdf/2410.21008
Copy Paste: [[2410.21008]] Is GPT-4 Less Politically Biased than GPT-3.5? A Renewed Investigation of ChatGPT's Political Biases(https://arxiv.org/abs/2410.21008)
Keywords: gpt, chat
Abstract: This work investigates the political biases and personality traits of ChatGPT, specifically comparing GPT-3.5 to GPT-4. In addition, the ability of the models to emulate political viewpoints (e.g., liberal or conservative positions) is analyzed. The Political Compass Test and the Big Five Personality Test were employed 100 times for each scenario, providing statistically significant results and an insight into the results correlations. The responses were analyzed by computing averages, standard deviations, and performing significance tests to investigate differences between GPT-3.5 and GPT-4. Correlations were found for traits that have been shown to be interdependent in human studies. Both models showed a progressive and libertarian political bias, with GPT-4's biases being slightly, but negligibly, less pronounced. Specifically, on the Political Compass, GPT-3.5 scored -6.59 on the economic axis and -6.07 on the social axis, whereas GPT-4 scored -5.40 and -4.73. In contrast to GPT-3.5, GPT-4 showed a remarkable capacity to emulate assigned political viewpoints, accurately reflecting the assigned quadrant (libertarian-left, libertarian-right, authoritarian-left, authoritarian-right) in all four tested instances. On the Big Five Personality Test, GPT-3.5 showed highly pronounced Openness and Agreeableness traits (O: 85.9%, A: 84.6%). Such pronounced traits correlate with libertarian views in human studies. While GPT-4 overall exhibited less pronounced Big Five personality traits, it did show a notably higher Neuroticism score. Assigned political orientations influenced Openness, Agreeableness, and Conscientiousness, again reflecting interdependencies observed in human studies. Finally, we observed that test sequencing affected ChatGPT's responses and the observed correlations, indicating a form of contextual memory.
摘要：这项研究调查了 ChatGPT 的政治偏见和性格特征，特别是将 GPT-3.5 与 GPT-4 进行了比较。此外，还分析了模型模拟政治观点（例如自由派或保守派立场）的能力。政治指南针测试和大五人格测试在每种情况下都使用了 100 次，提供了具有统计意义的结果并深入了解了结果相关性。通过计算平均值、标准差和执行显着性检验来分析响应，以调查 GPT-3.5 和 GPT-4 之间的差异。在人类研究中被证明相互依赖的特征中发现了相关性。两种模型都表现出进步和自由主义的政治偏见，而 GPT-4 的偏见略微不那么明显，但可以忽略不计。具体来说，在政治指南针上，GPT-3.5 在经济轴上得分为 -6.59，在社会轴上得分为 -6.07，而 GPT-4 得分为 -5.40 和 -4.73。与 GPT-3.5 相比，GPT-4 表现出了模仿指定政治观点的出色能力，在所有四个测试实例中都准确反映了指定的象限（自由主义左派、自由主义右派、威权主义左派、威权主义右派）。在“大五人格测试”中，GPT-3.5 表现出了高度明显的开放性和宜人性特征（O：85.9%，A：84.6%）。这些明显的特征与人类研究中的自由主义观点相关。虽然 GPT-4 总体上表现出不太明显的“大五人格”特征，但它确实显示出明显更高的神经质得分。指定的政治倾向影响了开放性、宜人性和尽责性，再次反映了人类研究中观察到的相互依赖性。最后，我们观察到测试顺序影响了 ChatGPT 的反应和观察到的相关性，表明存在一种情境记忆。

Title: FACT: Examining the Effectiveness of Iterative Context Rewriting for Multi-fact Retrieval

Authors: Jinlin Wang, Suyuchen Wang, Ziwen Xia, Sirui Hong, Yun Zhu, Bang Liu, Chenglin Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21012
Pdf URL: https://arxiv.org/pdf/2410.21012
Copy Paste: [[2410.21012]] FACT: Examining the Effectiveness of Iterative Context Rewriting for Multi-fact Retrieval(https://arxiv.org/abs/2410.21012)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are proficient at retrieving single facts from extended contexts, yet they struggle with tasks requiring the simultaneous retrieval of multiple facts, especially during generation. This paper identifies a novel "lost-in-the-middle" phenomenon, where LLMs progressively lose track of critical information throughout the generation process, resulting in incomplete or inaccurate retrieval. To address this challenge, we introduce Find All Crucial Texts (FACT), an iterative retrieval method that refines context through successive rounds of rewriting. This approach enables models to capture essential facts incrementally, which are often overlooked in single-pass retrieval. Experiments demonstrate that FACT substantially enhances multi-fact retrieval performance across various tasks, though improvements are less notable in general-purpose QA scenarios. Our findings shed light on the limitations of LLMs in multi-fact retrieval and underscore the need for more resilient long-context retrieval strategies.
摘要：大型语言模型 (LLM) 擅长从扩展上下文中检索单个事实，但它们在处理需要同时检索多个事实的任务时却举步维艰，尤其是在生成过程中。本文确定了一种新的“中间丢失”现象，即 LLM 在整个生成过程中逐渐丢失关键信息，导致检索不完整或不准确。为了应对这一挑战，我们引入了“查找所有关键文本” (FACT)，这是一种迭代检索方法，通过连续几轮重写来细化上下文。这种方法使模型能够逐步捕获基本事实，而这些事实在单次检索中经常被忽略。实验表明，FACT 大大提高了各种任务中的多事实检索性能，尽管在通用 QA 场景中的改进不那么明显。我们的研究结果揭示了 LLM 在多事实检索中的局限性，并强调需要更具弹性的长上下文检索策略。

Title: CRAT: A Multi-Agent Framework for Causality-Enhanced Reflective and Retrieval-Augmented Translation with Large Language Models

Authors: Meiqi Chen, Fandong Meng, Yingxue Zhang, Yan Zhang, Jie Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21067
Pdf URL: https://arxiv.org/pdf/2410.21067
Copy Paste: [[2410.21067]] CRAT: A Multi-Agent Framework for Causality-Enhanced Reflective and Retrieval-Augmented Translation with Large Language Models(https://arxiv.org/abs/2410.21067)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Large language models (LLMs) have shown great promise in machine translation, but they still struggle with contextually dependent terms, such as new or domain-specific words. This leads to inconsistencies and errors that are difficult to address. Existing solutions often depend on manual identification of such terms, which is impractical given the complexity and evolving nature of language. While Retrieval-Augmented Generation (RAG) could provide some assistance, its application to translation is limited by issues such as hallucinations from information overload. In this paper, we propose CRAT, a novel multi-agent translation framework that leverages RAG and causality-enhanced self-reflection to address these challenges. This framework consists of several specialized agents: the Unknown Terms Identification agent detects unknown terms within the context, the Knowledge Graph (KG) Constructor agent extracts relevant internal knowledge about these terms and retrieves bilingual information from external sources, the Causality-enhanced Judge agent validates the accuracy of the information, and the Translator agent incorporates the refined information into the final output. This automated process allows for more precise and consistent handling of key terms during translation. Our results show that CRAT significantly improves translation accuracy, particularly in handling context-sensitive terms and emerging vocabulary.
摘要：大型语言模型 (LLM) 在机器翻译中显示出巨大的潜力，但它们仍然难以处理上下文相关的术语，例如新词或领域特定词。这会导致难以解决的不一致和错误。现有的解决方案通常依赖于手动识别这些术语，但考虑到语言的复杂性和不断发展的特性，这是不切实际的。虽然检索增强生成 (RAG) 可以提供一些帮助，但它在翻译中的应用受到信息过载导致幻觉等问题的限制。在本文中，我们提出了 CRAT，这是一种新颖的多智能体翻译框架，它利用 RAG 和因果关系增强的自我反思来应对这些挑战。该框架由几个专门的代理组成：未知术语识别代理检测上下文中的未知术语，知识图谱 (KG) 构造器代理提取有关这些术语的相关内部知识并从外部来源检索双语信息，因果关系增强判断代理验证信息的准确性，翻译代理将精炼的信息纳入最终输出。这种自动化流程可以在翻译过程中更精确、更一致地处理关键术语。我们的结果表明，CRAT 显著提高了翻译准确性，特别是在处理上下文相关术语和新兴词汇时。

Title: Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Authors: Honglin Mu, Han He, Yuxin Zhou, Yunlong Feng, Yang Xu, Libo Qin, Xiaoming Shi, Zeming Liu, Xudong Han, Qi Shi, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21083
Pdf URL: https://arxiv.org/pdf/2410.21083
Copy Paste: [[2410.21083]] Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring(https://arxiv.org/abs/2410.21083)
Keywords: language model, gpt, llm, prompt
Abstract: Large language model (LLM) safety is a critical issue, with numerous studies employing red team testing to enhance model security. Among these, jailbreak methods explore potential vulnerabilities by crafting malicious prompts that induce model outputs contrary to safety alignments. Existing black-box jailbreak methods often rely on model feedback, repeatedly submitting queries with detectable malicious instructions during the attack search process. Although these approaches are effective, the attacks may be intercepted by content moderators during the search process. We propose an improved transfer attack method that guides malicious prompt construction by locally training a mirror model of the target black-box model through benign data distillation. This method offers enhanced stealth, as it does not involve submitting identifiable malicious instructions to the target model during the search phase. Our approach achieved a maximum attack success rate of 92%, or a balanced value of 80% with an average of 1.5 detectable jailbreak queries per sample against GPT-3.5 Turbo on a subset of AdvBench. These results underscore the need for more robust defense mechanisms.
摘要：大型语言模型 (LLM) 的安全性是一个关键问题，许多研究都采用红队测试来增强模型安全性。其中，越狱方法通过制作恶意提示来探索潜在的漏洞，这些提示会诱导与安全对齐相反的模型输出。现有的黑盒越狱方法通常依赖于模型反馈，在攻击搜索过程中反复提交带有可检测恶意指令的查询。虽然这些方法很有效，但内容版主可能会在搜索过程中拦截攻击。我们提出了一种改进的转移攻击方法，通过良性数据蒸馏在本地训练目标黑盒模型的镜像模型来指导恶意提示的构建。这种方法提供了增强的隐身性，因为它不涉及在搜索阶段向目标模型提交可识别的恶意指令。我们的方法在 AdvBench 子集上对 GPT-3.5 Turbo 实现了 92% 的最大攻击成功率，或 80% 的平衡值，平均每个样本有 1.5 个可检测到的越狱查询。这些结果强调了对更强大的防御机制的需求。

Title: Retrieval-Enhanced Mutation Mastery: Augmenting Zero-Shot Prediction of Protein Language Model

Authors: Yang Tan, Ruilin Wang, Banghao Wu, Liang Hong, Bingxin Zhou
Subjects: cs.CL, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2410.21127
Pdf URL: https://arxiv.org/pdf/2410.21127
Copy Paste: [[2410.21127]] Retrieval-Enhanced Mutation Mastery: Augmenting Zero-Shot Prediction of Protein Language Model(https://arxiv.org/abs/2410.21127)
Keywords: language model
Abstract: Enzyme engineering enables the modification of wild-type proteins to meet industrial and research demands by enhancing catalytic activity, stability, binding affinities, and other properties. The emergence of deep learning methods for protein modeling has demonstrated superior results at lower costs compared to traditional approaches such as directed evolution and rational design. In mutation effect prediction, the key to pre-training deep learning models lies in accurately interpreting the complex relationships among protein sequence, structure, and function. This study introduces a retrieval-enhanced protein language model for comprehensive analysis of native properties from sequence and local structural interactions, as well as evolutionary properties from retrieved homologous sequences. The state-of-the-art performance of the proposed ProtREM is validated on over 2 million mutants across 217 assays from an open benchmark (ProteinGym). We also conducted post-hoc analyses of the model's ability to improve the stability and binding affinity of a VHH antibody. Additionally, we designed 10 new mutants on a DNA polymerase and conducted wet-lab experiments to evaluate their enhanced activity at higher temperatures. Both in silico and experimental evaluations confirmed that our method provides reliable predictions of mutation effects, offering an auxiliary tool for biologists aiming to evolve existing enzymes. The implementation is publicly available at this https URL.
摘要：酶工程通过增强催化活性、稳定性、结合亲和力和其他特性，使野生型蛋白质的修饰能够满足工业和研究需求。与定向进化和合理设计等传统方法相比，用于蛋白质建模的深度学习方法的出现已经证明了其以更低的成本获得了更好的结果。在突变效应预测中，预训练深度学习模型的关键在于准确解释蛋白质序列、结构和功能之间的复杂关系。本研究引入了一种检索增强型蛋白质语言模型，用于全面分析序列和局部结构相互作用的天然特性以及检索到的同源序列的进化特性。所提出的 ProtREM 的最新性能已在来自开放基准 (ProteinGym) 的 217 个测定中的 200 多万个突变体上得到验证。我们还对该模型改善 VHH 抗体的稳定性和结合亲和力的能力进行了事后分析。此外，我们在 DNA 聚合酶上设计了 10 个新的突变体，并进行了湿实验室实验以评估它们在较高温度下的增强活性。计算机模拟和实验评估均证实，我们的方法能够可靠地预测突变效应，为旨在进化现有酶的生物学家提供了辅助工具。此实现可在此 https URL 上公开获取。

Title: Palisade -- Prompt Injection Detection Framework

Authors: Sahasra Kokkula, Somanathan R, Nandavardhan R, Aashishkumar, G Divya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21146
Pdf URL: https://arxiv.org/pdf/2410.21146
Copy Paste: [[2410.21146]] Palisade -- Prompt Injection Detection Framework(https://arxiv.org/abs/2410.21146)
Keywords: language model, llm, prompt
Abstract: The advent of Large Language Models LLMs marks a milestone in Artificial Intelligence, altering how machines comprehend and generate human language. However, LLMs are vulnerable to malicious prompt injection attacks, where crafted inputs manipulate the models behavior in unintended ways, compromising system integrity and causing incorrect outcomes. Conventional detection methods rely on static, rule-based approaches, which often fail against sophisticated threats like abnormal token sequences and alias substitutions, leading to limited adaptability and higher rates of false positives and false this http URL paper proposes a novel NLP based approach for prompt injection detection, emphasizing accuracy and optimization through a layered input screening process. In this framework, prompts are filtered through three distinct layers rule-based, ML classifier, and companion LLM before reaching the target model, thereby minimizing the risk of malicious this http URL show the ML classifier achieves the highest accuracy among individual layers, yet the multi-layer framework enhances overall detection accuracy by reducing false negatives. Although this increases false positives, it minimizes the risk of overlooking genuine injected prompts, thus prioritizing this http URL multi-layered detection approach highlights LLM vulnerabilities and provides a comprehensive framework for future research, promoting secure interactions between humans and AI systems.
摘要：大型语言模型 (LLM) 的出现标志着人工智能的一个里程碑，改变了机器理解和生成人类语言的方式。然而，LLM 容易受到恶意提示注入攻击，即精心设计的输入以非预期的方式操纵模型行为，损害系统完整性并导致错误结果。传统的检测方法依赖于静态的、基于规则的方法，这些方法通常无法抵御异常标记序列和别名替换等复杂威胁，导致适应性有限，误报和误判率较高。本文提出了一种基于 NLP 的新型提示注入检测方法，强调通过分层输入筛选过程实现准确性和优化。在此框架中，提示在到达目标模型之前经过三个不同的层（基于规则、ML 分类器和配套 LLM）的过滤，从而最大限度地降低了恶意攻击的风险。ML 分类器在各个层中实现了最高的准确率，而多层框架通过减少误报提高了整体检测准确率。虽然这会增加误报，但它最大限度地降低了忽视真正注入提示的风险，因此优先考虑这种 http URL 多层检测方法突出了 LLM 漏洞并为未来的研究提供了一个全面的框架，促进了人与人工智能系统之间的安全交互。

Title: SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents

Authors: Qi Zhang, Zhijia Chen, Huitong Pan, Cornelia Caragea, Longin Jan Latecki, Eduard Dragut
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21155
Pdf URL: https://arxiv.org/pdf/2410.21155
Copy Paste: [[2410.21155]] SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents(https://arxiv.org/abs/2410.21155)
Keywords: llm
Abstract: Scientific information extraction (SciIE) is critical for converting unstructured knowledge from scholarly articles into structured data (entities and relations). Several datasets have been proposed for training and validating SciIE models. However, due to the high complexity and cost of annotating scientific texts, those datasets restrict their annotations to specific parts of paper, such as abstracts, resulting in the loss of diverse entity mentions and relations in context. In this paper, we release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations. To capture the intricate use and interactions among entities in full texts, our dataset contains a fine-grained tag set for relations. Additionally, we provide an out-of-distribution test set to offer a more realistic evaluation. We conduct comprehensive experiments, including state-of-the-art supervised models and our proposed LLM-based baselines, and highlight the challenges presented by our dataset, encouraging the development of innovative models to further the field of SciIE.
摘要：科学信息提取 (SciIE) 对于将学术文章中的非结构化知识转换为结构化数据（实体和关系）至关重要。已经提出了几个数据集来训练和验证 SciIE 模型。然而，由于注释科学文本的复杂性和成本高，这些数据集将其注释限制在论文的特定部分，例如摘要，导致上下文中各种实体提及和关系的丢失。在本文中，我们发布了一个新的实体和关系提取数据集，用于与科学文章中的数据集、方法和任务相关的实体。我们的数据集包含 106 篇手动注释的全文科学出版物，其中包含超过 24k 个实体和 12k 个关系。为了捕捉全文中实体之间的复杂使用和交互，我们的数据集包含一个用于关系的细粒度标签集。此外，我们还提供了一个分布外的测试集，以提供更真实的评估。我们进行了全面的实验，包括最先进的监督模型和我们提出的基于 LLM 的基线，并强调了我们的数据集所带来的挑战，鼓励开发创新模型以进一步推动 SciIE 领域的发展。

Title: M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

Authors: Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai, Yanan Wu, Ke Jin, Ge Zhang, Zekun Wang, Guoan Zhang, Bangyu Xiang, Wenbo Su, Bo Zheng
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2410.21157
Pdf URL: https://arxiv.org/pdf/2410.21157
Copy Paste: [[2410.21157]] M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation(https://arxiv.org/abs/2410.21157)
Keywords: language model, llm
Abstract: Repository-level code completion has drawn great attention in software engineering, and several benchmark datasets have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of languages (<5), which cannot evaluate the general code intelligence abilities across different languages for existing code Large Language Models (LLMs). Besides, the existing benchmarks usually report overall average scores of different languages, where the fine-grained abilities in different completion scenarios are ignored. Therefore, to facilitate the research of code LLMs in multilingual scenarios, we propose a massively multilingual repository-level code completion benchmark covering 18 programming languages (called M2RC-EVAL), and two types of fine-grained annotations (i.e., bucket-level and semantic-level) on different completion scenarios are provided, where we obtain these annotations based on the parsed abstract syntax tree. Moreover, we also curate a massively multilingual instruction corpora M2RC- INSTRUCT dataset to improve the repository-level code completion abilities of existing code LLMs. Comprehensive experimental results demonstrate the effectiveness of our M2RC-EVAL and M2RC-INSTRUCT.
摘要：存储库级代码补全在软件工程中引起了极大关注，并且已经引入了多个基准数据集。然而，现有的存储库级代码补全基准通常关注有限数量的语言（<5），无法评估现有代码大型语言模型（LLM）在不同语言之间的一般代码智能能力。此外，现有基准通常报告不同语言的总体平均分数，而忽略了不同补全场景中的细粒度能力。因此，为了促进多语言场景下的代码 LLM 的研究，我们提出了一个涵盖 18 种编程语言的大规模多语言存储库级代码补全基准（称为 M2RC-EVAL），并提供了两种针对不同补全场景的细粒度注释（即存储桶级和语义级），我们基于解析后的抽象语法树获得这些注释。此外，我们还策划了一个大规模多语言指令语料库 M2RC-INSTRUCT 数据集，以提高现有代码 LLM 的存储库级代码补全能力。综合实验结果证明了我们的 M2RC-EVAL 和 M2RC-INSTRUCT 的有效性。

Title: Belief in the Machine: Investigating Epistemological Blind Spots of Language Models

Authors: Mirac Suzgun, Tayfun Gur, Federico Bianchi, Daniel E. Ho, Thomas Icard, Dan Jurafsky, James Zou
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2410.21195
Pdf URL: https://arxiv.org/pdf/2410.21195
Copy Paste: [[2410.21195]] Belief in the Machine: Investigating Epistemological Blind Spots of Language Models(https://arxiv.org/abs/2410.21195)
Keywords: language model, gpt
Abstract: As language models (LMs) become integral to fields like healthcare, law, and journalism, their ability to differentiate between fact, belief, and knowledge is essential for reliable decision-making. Failure to grasp these distinctions can lead to significant consequences in areas such as medical diagnosis, legal judgments, and dissemination of fake news. Despite this, current literature has largely focused on more complex issues such as theory of mind, overlooking more fundamental epistemic challenges. This study systematically evaluates the epistemic reasoning capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE, consisting of 13,000 questions across 13 tasks. Our results reveal key limitations. First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios, particularly in belief-related tasks. Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data, which raises concerns for applications in healthcare and counseling, where engaging with a person's beliefs is critical. Third, we identify a salient bias in how LMs process first-person versus third-person beliefs, performing better on third-person tasks (80.7%) compared to first-person tasks (54.4%). Fourth, LMs lack a robust understanding of the factive nature of knowledge, namely, that knowledge inherently requires truth. Fifth, LMs rely on linguistic cues for fact-checking and sometimes bypass the deeper reasoning. These findings highlight significant concerns about current LMs' ability to reason about truth, belief, and knowledge while emphasizing the need for advancements in these areas before broad deployment in critical sectors.
摘要：随着语言模型 (LM) 成为医疗保健、法律和新闻等领域不可或缺的一部分，它们区分事实、信念和知识的能力对于可靠的决策至关重要。未能掌握这些区别可能会在医疗诊断、法律判决和虚假新闻传播等领域导致严重后果。尽管如此，当前的文献主要集中在更复杂的问题上，例如心智理论，而忽略了更基本的认识论挑战。本研究使用新数据集 KaBLE（包含 13 个任务中的 13,000 个问题）系统地评估了现代 LM（包括 GPT-4、Claude-3 和 Llama-3）的认识论推理能力。我们的结果揭示了关键的局限性。首先，虽然 LM 在事实场景中的准确率达到 86%，但在虚假场景中，它们的性能会显著下降，尤其是在与信念相关的任务中。其次，语言模型难以识别和确认个人信念，尤其是当这些信念与事实数据相矛盾时，这引起了人们对医疗保健和咨询应用的担忧，因为在这些领域，了解个人信念至关重要。第三，我们发现语言模型在处理第一人称信念和第三人称信念方面存在显著的偏见，在第三人称任务上的表现 (80.7%) 优于第一人称任务 (54.4%)。第四，语言模型缺乏对知识事实性质的深入理解，即知识本身需要真相。第五，语言模型依赖语言线索进行事实核查，有时会绕过更深层次的推理。这些发现凸显了人们对当前语言模型推理真相、信念和知识的能力的重大担忧，同时强调在关键领域广泛部署之前，需要在这些领域取得进步。

Title: BongLLaMA: LLaMA for Bangla Language

Authors: Abdullah Khan Zehady, Safi Al Mamun, Naymul Islam, Santu Karmaker
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21200
Pdf URL: https://arxiv.org/pdf/2410.21200
Copy Paste: [[2410.21200]] BongLLaMA: LLaMA for Bangla Language(https://arxiv.org/abs/2410.21200)
Keywords: language model
Abstract: Bangla (or "Bengali") is a language spoken by approximately 240 million native speakers and around 300 million people worldwide. Despite being the 5th largest spoken language in the world, Bangla is still a "low-resource" language, and existing pretrained language models often struggle to perform well on Bangla Language Processing (BLP) tasks. This work addresses this gap by introducing BongLLaMA (i.e., Bangla-LLaMA), an open-source large language model fine-tuned exclusively on large Bangla corpora and instruction-tuning datasets. We present our methodology, data augmentation techniques, fine-tuning details, and comprehensive benchmarking results showcasing the utility of BongLLaMA on BLP tasks. We believe BongLLaMA will serve as the new standard baseline for Bangla Language Models and, thus, facilitate future benchmarking studies focused on this widely-spoken yet "low-resource" language. All BongLLaMA models are available for public use at this https URL.
摘要：孟加拉语（或“孟加拉语”）是一种语言，全球约有 2.4 亿母语使用者和约 3 亿人使用这种语言。尽管孟加拉语是世界上第五大语言，但它仍然是一种“资源匮乏”的语言，现有的预训练语言模型在孟加拉语处理 (BLP) 任务中往往表现不佳。这项工作通过引入 BongLLaMA（即 Bangla-LLaMA）来解决这一差距，BongLLaMA 是一种开源大型语言模型，专门针对大型孟加拉语料库和指令调整数据集进行了微调。我们介绍了我们的方法、数据增强技术、微调细节和全面的基准测试结果，展示了 BongLLaMA 在 BLP 任务中的实用性。我们相信 BongLLaMA 将成为孟加拉语模型的新标准基线，从而促进未来针对这种广泛使用但“资源匮乏”的语言的基准测试研究。所有 BongLLaMA 模型都可通过此 https URL 供公众使用。

Title: HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation

Authors: Yuhan Chen, Ang Lv, Jian Luan, Bin Wang, Wei Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21216
Pdf URL: https://arxiv.org/pdf/2410.21216
Copy Paste: [[2410.21216]] HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation(https://arxiv.org/abs/2410.21216)
Keywords: llm
Abstract: Many positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion: tokens farther away from the current position carry less relevant information. We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information from arbitrary positions. Firstly, we present empirical analyses on various PEs, demonstrating that models inherently learn attention with only a local-decay pattern while forming a U-shape pattern globally, contradicting the principle of long-term decay. Furthermore, we conduct a detailed analysis of rotary position encoding (RoPE, a prevalent relative positional encoding in LLMs), and found that the U-shape attention is caused by some learned components, which are also the key factor limiting RoPE's expressiveness and this http URL by these insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE replaces the specific components in RoPE with position-independent ones, retaining only high-frequency signals, which also breaks the principle of long-term decay in theory. HoPE achieves two major advantages: (1) Without constraints imposed by long-term decay, contradictory factors that limit spontaneous attention optimization and model extrapolation performance are removed. (2) Components representing positions and semantics are are optimized. These enhances model's context awareness and extrapolation, as validated by extensive experiments.
摘要：许多位置编码 (PE) 被设计为表现出长期衰减，这是基于一个根深蒂固的长期存在的归纳观点：距离当前位置较远的标记携带的相关信息较少。我们认为长期衰减在 LLM 时代已经过时了，因为 LLM 现在应用于需要从任意位置精确检索上下文信息的任务。首先，我们对各种 PE 进行了实证分析，证明模型固有学习注意力只有局部衰减模式，而全局形成 U 形模式，这与长期衰减原理相矛盾。此外，我们对旋转位置编码 (RoPE，LLM 中一种普遍的相对位置编码) 进行了详细分析，发现 U 形注意力是由一些学习到的组件引起的，这些组件也是限制 RoPE 表现力的关键因素，基于这些见解，我们提出了高频旋转位置编码 (HoPE)。 HoPE 将 RoPE 中的具体成分替换为与位置无关的成分，只保留高频信号，这也从理论上打破了长期衰减的原则。HoPE 实现了两大优势：（1）没有长期衰减的约束，消除了限制自发注意力优化和模型外推性能的矛盾因素。（2）优化了表示位置和语义的成分。这增强了模型的上下文感知和外推能力，这已被大量实验所验证。

Title: LongReward: Improving Long-context Large Language Models with AI Feedback

Authors: Jiajie Zhang, Zhongni Hou, Xin Lv, Shulin Cao, Zhenyu Hou, Yilin Niu, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21252
Pdf URL: https://arxiv.org/pdf/2410.21252
Copy Paste: [[2410.21252]] LongReward: Improving Long-context Large Language Models with AI Feedback(https://arxiv.org/abs/2410.21252)
Keywords: language model, llm
Abstract: Though significant advancements have been achieved in developing long-context large language models (LLMs), the compromised quality of LLM-synthesized data for supervised fine-tuning (SFT) often affects the long-context performance of SFT models and leads to inherent limitations. In principle, reinforcement learning (RL) with appropriate reward signals can further enhance models' capacities. However, how to obtain reliable rewards in long-context scenarios remains unexplored. To this end, we propose LongReward, a novel method that utilizes an off-the-shelf LLM to provide rewards for long-context model responses from four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness, each with a carefully designed assessment pipeline. By combining LongReward and offline RL algorithm DPO, we are able to effectively improve long-context SFT models. Our experiments indicate that LongReward not only significantly improves models' long-context performance but also enhances their ability to follow short instructions. We also find that long-context DPO with LongReward and conventional short-context DPO can be used together without hurting either one's performance.
摘要：尽管在开发长上下文大型语言模型 (LLM) 方面取得了重大进展，但用于监督微调 (SFT) 的 LLM 合成数据的质量受损通常会影响 SFT 模型的长上下文性能并导致固有的局限性。原则上，具有适当奖励信号的强化学习 (RL) 可以进一步增强模型的能力。然而，如何在长上下文场景中获得可靠的奖励仍未得到探索。为此，我们提出了 LongReward，这是一种新方法，它利用现成的 LLM 从四个人为价值维度为长上下文模型响应提供奖励：有用性、逻辑性、忠诚度和完整性，每个维度都有精心设计的评估管道。通过结合 LongReward 和离线 RL 算法 DPO，我们能够有效地改进长上下文 SFT 模型。我们的实验表明，LongReward 不仅显著提高了模型的长上下文性能，而且还增强了它们遵循简短指令的能力。我们还发现，带有 LongReward 的长上下文 DPO 和传统的短上下文 DPO 可以一起使用，而不会损害任何一个的性能。

Title: Are BabyLMs Second Language Learners?

Authors: Lukas Edman, Lisa Bylinina, Faeze Ghorbanpour, Alexander Fraser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21254
Pdf URL: https://arxiv.org/pdf/2410.21254
Copy Paste: [[2410.21254]] Are BabyLMs Second Language Learners?(https://arxiv.org/abs/2410.21254)
Keywords: llm
Abstract: This paper describes a linguistically-motivated approach to the 2024 edition of the BabyLM Challenge (Warstadt et al. 2023). Rather than pursuing a first language learning (L1) paradigm, we approach the challenge from a second language (L2) learning perspective. In L2 learning, there is a stronger focus on learning explicit linguistic information, such as grammatical notions, definitions of words or different ways of expressing a meaning. This makes L2 learning potentially more efficient and concise. We approximate this using data from Wiktionary, grammar examples either generated by an LLM or sourced from grammar books, and paraphrase data. We find that explicit information about word meaning (in our case, Wiktionary) does not boost model performance, while grammatical information can give a small improvement. The most impactful data ingredient is sentence paraphrases, with our two best models being trained on 1) a mix of paraphrase data and data from the BabyLM pretraining dataset, and 2) exclusively paraphrase data.
摘要：本文介绍了一种以语言学为动机的 2024 年 BabyLM 挑战赛方法（Warstadt 等人，2023 年）。我们不是追求第一语言学习 (L1) 范式，而是从第二语言 (L2) 学习的角度来应对挑战。在 L2 学习中，更注重学习明确的语言信息，例如语法概念、单词定义或表达含义的不同方式。这使得 L2 学习可能更高效、更简洁。我们使用来自 Wiktionary 的数据、由 LLM 生成或来自语法书的语法示例以及释义数据来近似计算。我们发现，关于单词含义的明确信息（在我们的例子中是 Wiktionary）不会提高模型性能，而语法信息可以带来小幅改进。影响最深远的数据成分是句子释义，我们最好的两个模型是在 1) 释义数据和来自 BabyLM 预训练数据集的数据的混合基础上进行训练的，以及 2) 专门的释义数据。

Title: EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

Authors: Shih-Yang Liu, Huck Yang, Chein-Yi Wang, Nai Chit Fung, Hongxu Yin, Charbel Sakr, Saurav Muralidharan, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21271
Pdf URL: https://arxiv.org/pdf/2410.21271
Copy Paste: [[2410.21271]] EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation(https://arxiv.org/abs/2410.21271)
Keywords: llm
Abstract: In this work, we re-formulate the model compression problem into the customized compensation problem: Given a compressed model, we aim to introduce residual low-rank paths to compensate for compression errors under customized requirements from users (e.g., tasks, compression ratios), resulting in greater flexibility in adjusting overall capacity without being constrained by specific compression formats. However, naively applying SVD to derive residual paths causes suboptimal utilization of the low-rank representation capacity. Instead, we propose Training-free Eigenspace Low-Rank Approximation (EoRA), a method that directly minimizes compression-induced errors without requiring gradient-based training, achieving fast optimization in minutes using a small amount of calibration data. EoRA projects compression errors into the eigenspace of input activations, leveraging eigenvalues to effectively prioritize the reconstruction of high-importance error components. Moreover, EoRA can be seamlessly integrated with fine-tuning and quantization to further improve effectiveness and efficiency. EoRA consistently outperforms previous methods in compensating errors for compressed LLaMA2/3 models on various tasks, such as language generation, commonsense reasoning, and math reasoning tasks (e.g., 31.31%/12.88% and 9.69% improvements on ARC-Easy/ARC-Challenge and MathQA when compensating LLaMA3-8B that is quantized to 4-bit and pruned to 2:4 sparsity). EoRA offers a scalable, training-free solution to compensate for compression errors, making it a powerful tool to deploy LLMs in various capacity and efficiency requirements.
摘要：在本研究中，我们将模型压缩问题重新表述为定制补偿问题：给定一个压缩模型，我们旨在根据用户的定制要求（例如任务、压缩率）引入残差低秩路径来补偿压缩误差，从而更灵活地调整整体容量，而不受特定压缩格式的限制。然而，单纯地应用 SVD 来导出残差路径会导致低秩表示容量的利用率不理想。相反，我们提出了无训练特征空间低秩近似 (EoRA)，这种方法无需基于梯度的训练即可直接最小化压缩引起的误差，使用少量校准数据在几分钟内实现快速优化。EoRA 将压缩误差投射到输入激活的特征空间中，利用特征值有效地优先重建高重要性误差成分。此外，EoRA 可以与微调和量化无缝集成，以进一步提高有效性和效率。 EoRA 在补偿压缩 LLaMA2/3 模型在各种任务（例如语言生成、常识推理和数学推理任务）上的误差方面始终优于以前的方法（例如，在补偿量化为 4 位并修剪为 2:4 稀疏度的 LLaMA3-8B 时，ARC-Easy/ARC-Challenge 和 MathQA 上的改进分别为 31.31%/12.88% 和 9.69%）。EoRA 提供了一种可扩展的、无需训练的解决方案来补偿压缩误差，使其成为在各种容量和效率要求下部署 LLM 的强大工具。

Title: Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics

Authors: Yaniv Nikankin, Anja Reusch, Aaron Mueller, Yonatan Belinkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21272
Pdf URL: https://arxiv.org/pdf/2410.21272
Copy Paste: [[2410.21272]] Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics(https://arxiv.org/abs/2410.21272)
Keywords: language model, llm, prompt
Abstract: Do large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a representative task. Using causal analysis, we identify a subset of the model (a circuit) that explains most of the model's behavior for basic arithmetic logic and examine its functionality. By zooming in on the level of individual circuit neurons, we discover a sparse set of important neurons that implement simple heuristics. Each heuristic identifies a numerical input pattern and outputs corresponding answers. We hypothesize that the combination of these heuristic neurons is the mechanism used to produce correct arithmetic answers. To test this, we categorize each neuron into several heuristic types-such as neurons that activate when an operand falls within a certain range-and find that the unordered combination of these heuristic types is the mechanism that explains most of the model's accuracy on arithmetic prompts. Finally, we demonstrate that this mechanism appears as the main source of arithmetic accuracy early in training. Overall, our experimental results across several LLMs show that LLMs perform arithmetic using neither robust algorithms nor memorization; rather, they rely on a "bag of heuristics".
摘要：大型语言模型 (LLM) 是通过学习稳健的可泛化算法来解决推理任务，还是通过记忆训练数据？为了研究这个问题，我们使用算术推理作为代表性任务。使用因果分析，我们确定了模型的一个子集（一个电路），该子集解释了模型在基本算术逻辑中的大部分行为，并检查了其功能。通过放大单个电路神经元的级别，我们发现了一组稀疏的重要神经元，它们实现了简单的启发式方法。每个启发式方法都会识别一个数字输入模式并输出相应的答案。我们假设这些启发式神经元的组合是产生正确算术答案的机制。为了测试这一点，我们将每个神经元分为几种启发式类型 - 例如当操作数落在某个范围内时激活的神经元 - 并发现这些启发式类型的无序组合是解释模型在算术提示上的大部分准确性的机制。最后，我们证明这种机制在训练初期是算术准确性的主要来源。总体而言，我们在多个 LLM 上进行的实验结果表明，LLM 执行算术时既不使用稳健算法也不使用记忆；相反，它们依赖于“启发式方法包”。

Title: GPT-4o System Card

Authors: OpenAI: Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alexis Conneau, Ali Kamali, Allan Jabri, Allison Moyer, Allison Tam, Amadou Crookes, Amin Tootoochian, Amin Tootoonchian, Ananya Kumar, Andrea Vallone, Andrej Karpathy, Andrew Braunstein, Andrew Cann, Andrew Codispoti, Andrew Galu, Andrew Kondrich, Andrew Tulloch, Andrey Mishchenko, Angela Baek, Angela Jiang, Antoine Pelisse, Antonia Woodford, Anuj Gosalia, Arka Dhar, Ashley Pantuliano, Avi Nayak, Avital Oliver, Barret Zoph, Behrooz Ghorbani, Ben Leimberger, Ben Rossen, Ben Sokolowsky, Ben Wang, Benjamin Zweig, Beth Hoover, Blake Samic, Bob McGrew, Bobby Spero, Bogo Giertler, Bowen Cheng, Brad Lightcap, Brandon Walkin, Brendan Quinn, Brian Guarraci, Brian Hsu, Bright Kellogg, Brydon Eastman, Camillo Lugaresi, Carroll Wainwright, Cary Bassin, Cary Hudson, Casey Chu, Chad Nelson, Chak Li, Chan Jun Shern, Channing Conger, Charlotte Barette, Chelsea Voss, Chen Ding, Cheng Lu, Chong Zhang, Chris Beaumont, Chris Hallacy, Chris Koch, Christian Gibson, Christina Kim, Christine Choi, Christine McLeavey, Christopher Hesse, Claudia Fischer, Clemens Winter, Coley Czarnecki, Colin Jarvis, Colin Wei, Constantin Koumouzelis, Dane Sherburn
Subjects: cs.CL, cs.AI, cs.CV, cs.CY, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2410.21276
Pdf URL: https://arxiv.org/pdf/2410.21276
Copy Paste: [[2410.21276]] GPT-4o System Card(https://arxiv.org/abs/2410.21276)
Keywords: gpt
Abstract: GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.
摘要：GPT-4o 是一种自回归全向模型，它接受文本、音频、图像和视频的任意组合作为输入，并生成文本、音频和图像的任意组合输出。它在文本、视觉和音频方面进行了端到端训练，这意味着所有输入和输出都由同一个神经网络处理。GPT-4o 可以在短短 232 毫秒内响应音频输入，平均为 320 毫秒，这与人类在对话中的响应时间相似。它在英语和代码文本上的表现与 GPT-4 Turbo 相当，在非英语语言文本上的表现有显著改善，同时在 API 上也更快、便宜 50%。与现有模型相比，GPT-4o 在视觉和音频理解方面尤其出色。根据我们对安全构建人工智能的承诺以及我们对白宫的自愿承诺，我们正在分享 GPT-4o 系统卡，其中包括我们的准备框架评估。在此系统卡中，我们详细介绍了 GPT-4o 的功能、局限性和多个类别的安全性评估，重点关注语音转语音，同时评估文本和图像功能，以及我们为确保模型安全一致而实施的措施。我们还包括对危险功能的第三方评估，以及对 GPT-4o 的文本和视觉功能的潜在社会影响的讨论。