2024-08-22

Title: StructuredRAG: JSON Response Formatting with Large Language Models

Authors: Connor Shorten, Charles Pierse, Thomas Benjamin Smith, Erika Cardenas, Akanksha Sharma, John Trengrove, Bob van Luijt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11061
Pdf URL: https://arxiv.org/pdf/2408.11061
Copy Paste: [[2408.11061]] StructuredRAG: JSON Response Formatting with Large Language Models(https://arxiv.org/abs/2408.11061)
Keywords: language model, llm, prompt
Abstract: The ability of Large Language Models (LLMs) to generate structured outputs, such as JSON, is crucial for their use in Compound AI Systems. However, evaluating and improving this capability remains challenging. In this work, we introduce StructuredRAG, a benchmark of six tasks designed to assess LLMs' proficiency in following response format instructions. We evaluate two state-of-the-art LLMs, Gemini 1.5 Pro and Llama 3 8B-instruct with 4-bit quantization using two distinct prompting strategies. We introduce these prompting strategies as f-String and Follow the Format (FF) prompting. Across 24 experiments, we find an average success rate of 82.55%. We further find a high variance in performance across tasks, models, and prompting strategies with success rates ranging from 0 to 100%. We find that Llama 3 8B-instruct often performs competitively with Gemini 1.5 Pro. We observe that task complexity significantly influences performance, with tasks involving lists or composite object outputs proving more challenging. Our findings highlight the need for further research into improving the reliability and consistency of structured output generation in LLMs. We have open-sourced our experimental code and results at this http URL.
摘要：大型语言模型 (LLM) 生成结构化输出（例如 JSON）的能力对于其在复合 AI 系统中的使用至关重要。然而，评估和改进这种能力仍然具有挑战性。在这项工作中，我们引入了 StructuredRAG，这是一项包含六个任务的基准测试，旨在评估 LLM 遵循响应格式指令的能力。我们使用两种不同的提示策略评估了两种最先进的 LLM，Gemini 1.5 Pro 和 Llama 3 8B-instruct，并采用 4 位量化。我们将这些提示策略引入为 f-String 和遵循格式 (FF) 提示。在 24 次实验中，我们发现平均成功率为 82.55%。我们进一步发现，不同任务、模型和提示策略之间的性能差异很大，成功率从 0 到 100% 不等。我们发现 Llama 3 8B-instruct 的表现通常与 Gemini 1.5 Pro 相当。我们观察到任务复杂性显著影响性能，涉及列表或复合对象输出的任务更具挑战性。我们的研究结果强调了进一步研究改进 LLM 中结构化输出生成的可靠性和一致性的必要性。我们已在此 http URL 上开源了我们的实验代码和结果。

Title: Interactive-T2S: Multi-Turn Interactions for Text-to-SQL with Large Language Models

Authors: Guanming Xiong, Junwei Bao, Hongfei Jiang, Yang Song, Wen Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11062
Pdf URL: https://arxiv.org/pdf/2408.11062
Copy Paste: [[2408.11062]] Interactive-T2S: Multi-Turn Interactions for Text-to-SQL with Large Language Models(https://arxiv.org/abs/2408.11062)
Keywords: language model, llm
Abstract: This study explores text-to-SQL parsing by leveraging the powerful reasoning capabilities of large language models (LLMs). Despite recent advancements, existing LLM-based methods have not adequately addressed scalability, leading to inefficiencies when processing wide tables. Furthermore, current interaction-based approaches either lack a step-by-step, interpretable SQL generation process or fail to provide an efficient and universally applicable interaction design. To address these challenges, we introduce Interactive-T2S, a framework that generates SQL queries through direct interactions with databases. This framework includes four general tools that facilitate proactive and efficient information retrieval by the LLM. Additionally, we have developed detailed exemplars to demonstrate the step-wise reasoning processes within our framework. Our experiments on the BIRD-Dev dataset, employing a setting without oracle knowledge, reveal that our method achieves state-of-the-art results with only two exemplars, underscoring the effectiveness and robustness of our framework.
摘要：本研究利用大型语言模型 (LLM) 强大的推理能力探索文本到 SQL 的解析。尽管最近取得了进展，但现有的基于 LLM 的方法尚未充分解决可扩展性问题，导致处理宽表时效率低下。此外，当前基于交互的方法要么缺乏逐步、可解释的 SQL 生成过程，要么无法提供高效且普遍适用的交互设计。为了应对这些挑战，我们引入了 Interactive-T2S，这是一个通过与数据库直接交互生成 SQL 查询的框架。该框架包括四个通用工具，可促进 LLM 主动高效地检索信息。此外，我们还开发了详细的示例来展示我们框架中的分步推理过程。我们在 BIRD-Dev 数据集上进行的实验采用了没有 Oracle 知识的设置，结果表明，我们的方法仅使用两个示例就实现了最先进的结果，凸显了我们框架的有效性和稳健性。

Title: Tabular Transfer Learning via Prompting LLMs

Authors: Jaehyun Nam, Woomin Song, Seong Hyeon Park, Jihoon Tack, Sukmin Yun, Jaehyung Kim, Kyu Hwan Oh, Jinwoo Shin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11063
Pdf URL: https://arxiv.org/pdf/2408.11063
Copy Paste: [[2408.11063]] Tabular Transfer Learning via Prompting LLMs(https://arxiv.org/abs/2408.11063)
Keywords: language model, llm, prompt
Abstract: Learning with a limited number of labeled data is a central problem in real-world applications of machine learning, as it is often expensive to obtain annotations. To deal with the scarcity of labeled data, transfer learning is a conventional approach; it suggests to learn a transferable knowledge by training a neural network from multiple other sources. In this paper, we investigate transfer learning of tabular tasks, which has been less studied and successful in the literature, compared to other domains, e.g., vision and language. This is because tables are inherently heterogeneous, i.e., they contain different columns and feature spaces, making transfer learning difficult. On the other hand, recent advances in natural language processing suggest that the label scarcity issue can be mitigated by utilizing in-context learning capability of large language models (LLMs). Inspired by this and the fact that LLMs can also process tables within a unified language space, we ask whether LLMs can be effective for tabular transfer learning, in particular, under the scenarios where the source and target datasets are of different format. As a positive answer, we propose a novel tabular transfer learning framework, coined Prompt to Transfer (P2T), that utilizes unlabeled (or heterogeneous) source data with LLMs. Specifically, P2T identifies a column feature in a source dataset that is strongly correlated with a target task feature to create examples relevant to the target task, thus creating pseudo-demonstrations for prompts. Experimental results demonstrate that P2T outperforms previous methods on various tabular learning benchmarks, showing good promise for the important, yet underexplored tabular transfer learning problem. Code is available at this https URL.
摘要：使用有限数量的标记数据进行学习是机器学习在实际应用中的核心问题，因为获取注释通常成本高昂。为了解决标记数据的稀缺性，迁移学习是一种常规方法；它建议通过从多个其他来源训练神经网络来学习可迁移的知识。在本文中，我们研究了表格任务的迁移学习，与其他领域（例如视觉和语言）相比，该任务在文献中的研究较少且成功率较低。这是因为表格本质上是异构的，即它们包含不同的列和特征空间，这使得迁移学习变得困难。另一方面，自然语言处理的最新进展表明，可以通过利用大型语言模型 (LLM) 的上下文学习能力来缓解标签稀缺问题。受此启发，以及 LLM 也可以在统一语言空间内处理表格的事实，我们想知道 LLM 是否适用于表格迁移学习，特别是在源数据集和目标数据集格式不同的场景下。作为肯定的回答，我们提出了一种新颖的表格迁移学习框架，称为“提示迁移”（P2T），该框架利用未标记（或异构）源数据和 LLM。具体来说，P2T 识别源数据集中与目标任务特征高度相关的列特征，以创建与目标任务相关的示例，从而为提示创建伪演示。实验结果表明，P2T 在各种表格学习基准上的表现优于以前的方法，为重要但尚未得到充分探索的表格迁移学习问题带来了良好的前景。代码可从此 https URL 获取。

Title: Reading with Intent

Authors: Benjamin Reichman, Kartik Talamadupula, Toshish Jawale, Larry Heck
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11189
Pdf URL: https://arxiv.org/pdf/2408.11189
Copy Paste: [[2408.11189]] Reading with Intent(https://arxiv.org/abs/2408.11189)
Keywords: language model, llm, prompt, retrieval augmented generation
Abstract: Retrieval augmented generation (RAG) systems augment how knowledge language models are by integrating external information sources such as Wikipedia, internal documents, scientific papers, or the open internet. RAG systems that rely on the open internet as their knowledge source have to contend with the complexities of human-generated content. Human communication extends much deeper than just the words rendered as text. Intent, tonality, and connotation can all change the meaning of what is being conveyed. Recent real-world deployments of RAG systems have shown some difficulty in understanding these nuances of human communication. One significant challenge for these systems lies in processing sarcasm. Though the Large Language Models (LLMs) that make up the backbone of these RAG systems are able to detect sarcasm, they currently do not always use these detections for the subsequent processing of text. To address these issues, in this paper, we synthetically generate sarcastic passages from Natural Question's Wikipedia retrieval corpus. We then test the impact of these passages on the performance of both the retriever and reader portion of the RAG pipeline. We introduce a prompting system designed to enhance the model's ability to interpret and generate responses in the presence of sarcasm, thus improving overall system performance. Finally, we conduct ablation studies to validate the effectiveness of our approach, demonstrating improvements in handling sarcastic content within RAG systems.
摘要：检索增强生成 (RAG) 系统通过集成外部信息源（例如维基百科、内部文档、科学论文或开放互联网）来增强知识语言模型的功能。依赖开放互联网作为知识源的 RAG 系统必须应对人类生成内容的复杂性。人类交流远不止以文本形式呈现的单词。意图、语调和内涵都可以改变所传达内容的含义。最近在现实世界中部署的 RAG 系统在理解人类交流的这些细微差别方面表现出一些困难。这些系统面临的一个重大挑战在于处理讽刺。虽然构成这些 RAG 系统骨干的大型语言模型 (LLM) 能够检测讽刺，但它们目前并不总是将这些检测用于后续的文本处理。为了解决这些问题，在本文中，我们从 Natural Question 的维基百科检索语料库中合成生成讽刺段落。然后，我们测试这些段落对 RAG 管道的检索器和阅读器部分性能的影响。我们引入了一个提示系统，旨在增强模型在存在讽刺的情况下解释和生成响应的能力，从而提高整体系统性能。最后，我们进行消融研究以验证我们的方法的有效性，展示了 RAG 系统在处理讽刺内容方面的改进。

Title: CoDi: Conversational Distillation for Grounded Question Answering

Authors: Patrick Huber, Arash Einolghozati, Rylan Conway, Kanika Narang, Matt Smith, Waqar Nayyar, Adithya Sagar, Ahmed Aly, Akshat Shrivastava
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11219
Pdf URL: https://arxiv.org/pdf/2408.11219
Copy Paste: [[2408.11219]] CoDi: Conversational Distillation for Grounded Question Answering(https://arxiv.org/abs/2408.11219)
Keywords: language model
Abstract: Distilling conversational skills into Small Language Models (SLMs) with approximately 1 billion parameters presents significant challenges. Firstly, SLMs have limited capacity in their model parameters to learn extensive knowledge compared to larger models. Secondly, high-quality conversational datasets are often scarce, small, and domain-specific. Addressing these challenges, we introduce a novel data distillation framework named CoDi (short for Conversational Distillation, pronounced "Cody"), allowing us to synthesize large-scale, assistant-style datasets in a steerable and diverse manner. Specifically, while our framework is task agnostic at its core, we explore and evaluate the potential of CoDi on the task of conversational grounded reasoning for question answering. This is a typical on-device scenario for specialist SLMs, allowing for open-domain model responses, without requiring the model to "memorize" world knowledge in its limited weights. Our evaluations show that SLMs trained with CoDi-synthesized data achieve performance comparable to models trained on human-annotated data in standard metrics. Additionally, when using our framework to generate larger datasets from web data, our models surpass larger, instruction-tuned models in zero-shot conversational grounded reasoning tasks.
摘要：将对话技能提炼到具有大约 10 亿个参数的小型语言模型 (SLM) 中是一项重大挑战。首先，与大型模型相比，SLM 的模型参数学习广泛知识的能力有限。其次，高质量的对话数据集通常稀缺、规模小且特定于领域。为了应对这些挑战，我们引入了一种名为 CoDi（Conversational Distillation 的缩写，发音为“Cody”）的新型数据提炼框架，使我们能够以可控且多样化的方式合成大规模助手式数据集。具体而言，虽然我们的框架本质上与任务无关，但我们探索并评估了 CoDi 在基于对话的问答推理任务上的潜力。这是专业 SLM 的典型设备场景，允许开放域模型响应，而无需模型在其有限的权重中“记住”世界知识。我们的评估表明，使用 CoDi 合成数据训练的 SLM 在标准指标上实现的性能与使用人工注释数据训练的模型相当。此外，当使用我们的框架从网络数据生成更大的数据集时，我们的模型在零样本对话基础推理任务中超越了更大的、指令调整的模型。

Title: Unboxing Occupational Bias: Grounded Debiasing LLMs with U.S. Labor Data

Authors: Atmika Gorti, Manas Gaur, Aman Chadha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11247
Pdf URL: https://arxiv.org/pdf/2408.11247
Copy Paste: [[2408.11247]] Unboxing Occupational Bias: Grounded Debiasing LLMs with U.S. Labor Data(https://arxiv.org/abs/2408.11247)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are prone to inheriting and amplifying societal biases embedded within their training data, potentially reinforcing harmful stereotypes related to gender, occupation, and other sensitive categories. This issue becomes particularly problematic as biased LLMs can have far-reaching consequences, leading to unfair practices and exacerbating social inequalities across various domains, such as recruitment, online content moderation, or even the criminal justice system. Although prior research has focused on detecting bias in LLMs using specialized datasets designed to highlight intrinsic biases, there has been a notable lack of investigation into how these findings correlate with authoritative datasets, such as those from the U.S. National Bureau of Labor Statistics (NBLS). To address this gap, we conduct empirical research that evaluates LLMs in a ``bias-out-of-the-box" setting, analyzing how the generated outputs compare with the distributions found in NBLS data. Furthermore, we propose a straightforward yet effective debiasing mechanism that directly incorporates NBLS instances to mitigate bias within LLMs. Our study spans seven different LLMs, including instructable, base, and mixture-of-expert models, and reveals significant levels of bias that are often overlooked by existing bias detection techniques. Importantly, our debiasing method, which does not rely on external datasets, demonstrates a substantial reduction in bias scores, highlighting the efficacy of our approach in creating fairer and more reliable LLMs.
摘要：大型语言模型 (LLM) 容易继承和放大其训练数据中嵌入的社会偏见，从而可能强化与性别、职业和其他敏感类别相关的有害刻板印象。这个问题变得尤其成问题，因为有偏见的 LLM 可能会产生深远的影响，导致不公平的做法并加剧各个领域的社会不平等，例如招聘、在线内容审核，甚至刑事司法系统。尽管先前的研究专注于使用专门设计的数据集来检测 LLM 中的偏见，以突出内在偏见，但对这些发现与权威数据集（例如美国国家劳工统计局 (NBLS) 的数据集）如何关联的调查明显不足。为了解决这一差距，我们进行了实证研究，在“开箱即用”的环境中评估 LLM，分析生成的输出与 NBLS 数据中的分布相比如何。此外，我们提出了一种直接但有效的去偏机制，该机制直接结合 NBLS 实例来减轻 LLM 中的偏见。我们的研究涵盖了七种不同的 LLM，包括可指导、基础和混合专家模型，并揭示了现有偏见检测技术经常忽视的严重偏见水平。重要的是，我们的去偏方法不依赖于外部数据集，显示出偏见分数的大幅降低，凸显了我们的方法在创建更公平、更可靠的 LLM 方面的有效性。

Title: Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

Authors: Sepehr Kamahi, Yadollah Yaghoobzadeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11252
Pdf URL: https://arxiv.org/pdf/2408.11252
Copy Paste: [[2408.11252]] Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models(https://arxiv.org/abs/2408.11252)
Keywords: language model
Abstract: Despite the widespread adoption of autoregressive language models, explainability evaluation research has predominantly focused on span infilling and masked language models (MLMs). Evaluating the faithfulness of an explanation method -- how accurately the method explains the inner workings and decision-making of the model -- is very challenging because it is very hard to separate the model from its explanation. Most faithfulness evaluation techniques corrupt or remove some input tokens considered important according to a particular attribution (feature importance) method and observe the change in the model's output. This approach creates out-of-distribution inputs for causal language models (CLMs) due to their training objective of next token prediction. In this study, we propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language modeling scenarios. Our technique creates fluent and in-distribution counterfactuals that makes evaluation protocol more reliable. Code is available at this https URL
摘要：尽管自回归语言模型被广泛采用，但可解释性评估研究主要集中在跨度填充和掩码语言模型 (MLM) 上。评估解释方法的忠实度（即该方法解释模型内部工作和决策的准确程度）非常具有挑战性，因为很难将模型与其解释分开。大多数忠实度评估技术会破坏或删除根据特定归因（特征重要性）方法被认为重要的一些输入标记，并观察模型输出的变化。由于因果语言模型 (CLM) 的训练目标是下一个标记预测，因此这种方法会为其创建分布外的输入。在本研究中，我们提出了一种利用反事实生成的技术来评估自回归语言建模场景的归因方法的忠实度。我们的技术创建了流畅且分布内的反事实，使评估协议更加可靠。代码可在此 https URL 上找到

Title: RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining

Authors: Anh-Dung Vo, Minseong Jung, Wonbeen Lee, Daewoo Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11294
Pdf URL: https://arxiv.org/pdf/2408.11294
Copy Paste: [[2408.11294]] RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining(https://arxiv.org/abs/2408.11294)
Keywords: language model, llm
Abstract: The field of Natural Language Processing (NLP) has seen significant advancements with the development of Large Language Models (LLMs). However, much of this research remains focused on English, often overlooking low-resource languages like Korean. This oversight presents challenges due to the unique non-alphabetic token structure of Korean and the substantial memory and computational demands required for LLM training, which frequently lead to memory constraints and out-of-memory errors. To address these issues, we present RedWhale, a model specifically tailored for Korean language processing. RedWhale is developed using an efficient continual pretraining approach that includes a comprehensive Korean corpus preprocessing pipeline, a specialized tokenizer, an optimized model initialization technique, and a multistage pretraining strategy. These innovations collectively reduce training time and computational costs while maintaining high levels of accuracy and comprehension. By leveraging cross-lingual transfer learning, which exploits shared linguistic similarities across languages, RedWhale builds on English models to enhance Korean language processing. Experimental results demonstrate that RedWhale outperforms other leading models on Korean NLP benchmarks, including the Korean Balanced Evaluation of Significant Tasks (KoBEST), showing superior understanding and generation of Korean text. Furthermore, RedWhale showed no signs of convergence even after pretraining on 9.7 billion tokens, indicating the potential for further improvements with additional training. This work represents a significant advancement in bridging the linguistic divide, particularly in enhancing NLP capabilities for the Korean language.
摘要：随着大型语言模型 (LLM) 的发展，自然语言处理 (NLP) 领域取得了重大进展。然而，这些研究大部分仍然集中在英语上，往往忽略了韩语等资源匮乏的语言。由于韩语独特的非字母标记结构以及 LLM 训练所需的大量内存和计算需求，这种疏忽带来了挑战，这经常导致内存限制和内存不足错误。为了解决这些问题，我们提出了 RedWhale，这是一种专门为韩语处理量身定制的模型。RedWhale 采用高效的持续预训练方法开发，包括全面的韩语语料库预处理管道、专门的标记器、优化的模型初始化技术和多阶段预训练策略。这些创新共同减少了训练时间和计算成本，同时保持了高水平的准确性和理解力。通过利用跨语言迁移学习（利用跨语言的共同语言相似性），RedWhale 在英语模型的基础上增强了韩语处理能力。实验结果表明，RedWhale 在韩语 NLP 基准测试（包括韩语重要任务平衡评估 (KoBEST)）上的表现优于其他领先模型，显示出对韩语文本的出色理解和生成能力。此外，RedWhale 即使在对 97 亿个标记进行预训练后也没有出现收敛迹象，这表明通过额外训练还有进一步改进的潜力。这项工作代表了弥合语言鸿沟的重大进步，特别是在增强韩语 NLP 能力方面。

Title: Towards Evaluating Large Language Models on Sarcasm Understanding

Authors: Yazhou Zhang, Chunwang Zou, Zheng Lian, Prayag Tiwari, Jing Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11319
Pdf URL: https://arxiv.org/pdf/2408.11319
Copy Paste: [[2408.11319]] Towards Evaluating Large Language Models on Sarcasm Understanding(https://arxiv.org/abs/2408.11319)
Keywords: language model, gpt, llm, prompt, chat
Abstract: In the era of large language models (LLMs), the task of ``System I''~-~the fast, unconscious, and intuitive tasks, e.g., sentiment analysis, text classification, etc., have been argued to be successfully solved. However, sarcasm, as a subtle linguistic phenomenon, often employs rhetorical devices like hyperbole and figuration to convey true sentiments and intentions, involving a higher level of abstraction than sentiment analysis. There is growing concern that the argument about LLMs' success may not be fully tenable when considering sarcasm understanding. To address this question, we select eleven SOTA LLMs and eight SOTA pre-trained language models (PLMs) and present comprehensive evaluations on six widely used benchmark datasets through different prompting approaches, i.e., zero-shot input/output (IO) prompting, few-shot IO prompting, chain of thought (CoT) prompting. Our results highlight three key findings: (1) current LLMs underperform supervised PLMs based sarcasm detection baselines across six sarcasm benchmarks. This suggests that significant efforts are still required to improve LLMs' understanding of human sarcasm. (2) GPT-4 consistently and significantly outperforms other LLMs across various prompting methods, with an average improvement of 14.0\%$\uparrow$. Claude 3 and ChatGPT demonstrate the next best performance after GPT-4. (3) Few-shot IO prompting method outperforms the other two methods: zero-shot IO and few-shot CoT. The reason is that sarcasm detection, being a holistic, intuitive, and non-rational cognitive process, is argued not to adhere to step-by-step logical reasoning, making CoT less effective in understanding sarcasm compared to its effectiveness in mathematical reasoning tasks.
摘要：在大型语言模型 (LLM) 时代，“系统 I” 的任务——快速、无意识和直观的任务，例如情绪分析、文本分类等，被认为已经得到成功解决。然而，讽刺作为一种微妙的语言现象，经常使用夸张和比喻等修辞手法来传达真实的情感和意图，涉及比情绪分析更高层次的抽象。人们越来越担心，在考虑讽刺理解时，关于 LLM 成功的说法可能并不完全站得住脚。为了解决这个问题，我们选择了 11 个 SOTA LLM 和 8 个 SOTA 预训练语言模型 (PLM)，并通过不同的提示方法（即零样本输入/输出 (IO) 提示、少样本 IO 提示、思路链 (CoT) 提示）在六个广泛使用的基准数据集上进行综合评估。我们的结果突出了三个关键发现：（1）在六个讽刺基准中，当前的 LLM 表现不如基于监督 PLM 的讽刺检测基线。这表明，仍然需要付出巨大努力来提高 LLM 对人类讽刺的理解。（2）GPT-4 在各种提示方法中始终显著优于其他 LLM，平均提高了 14.0%$\uparrow$。Claude 3 和 ChatGPT 的表现仅次于 GPT-4。（3）少样本 IO 提示方法优于其他两种方法：零样本 IO 和少样本 CoT。原因是讽刺检测是一个整体、直观和非理性的认知过程，被认为不遵循循序渐进的逻辑推理，这使得 CoT 在理解讽刺方面的有效性不如其在数学推理任务中的有效性。

Title: BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports

Authors: Yuxuan Chen, Haoyan Yang, Hengkai Pan, Fardeen Siddiqui, Antonio Verdone, Qingyang Zhang, Sumit Chopra, Chen Zhao, Yiqiu Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11334
Pdf URL: https://arxiv.org/pdf/2408.11334
Copy Paste: [[2408.11334]] BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports(https://arxiv.org/abs/2408.11334)
Keywords: gpt, llm
Abstract: Breast ultrasound is essential for detecting and diagnosing abnormalities, with radiology reports summarizing key findings like lesion characteristics and malignancy assessments. Extracting this critical information is challenging due to the unstructured nature of these reports, with varied linguistic styles and inconsistent formatting. While proprietary LLMs like GPT-4 are effective, they are costly and raise privacy concerns when handling protected health information. This study presents a pipeline for developing an in-house LLM to extract clinical information from radiology reports. We first use GPT-4 to create a small labeled dataset, then fine-tune a Llama3-8B model on it. Evaluated on clinician-annotated reports, our model achieves an average F1 score of 84.6%, which is on par with GPT-4. Our findings demonstrate the feasibility of developing an in-house LLM that not only matches GPT-4's performance but also offers cost reductions and enhanced data privacy.
摘要：乳房超声检查对于检测和诊断异常至关重要，放射学报告总结了病变特征和恶性肿瘤评估等关键发现。由于这些报告的非结构化性质，语言风格各异且格式不一致，提取这些关键信息具有挑战性。虽然像 GPT-4 这样的专有 LLM 很有效，但它们成本高昂，并且在处理受保护的健康信息时会引发隐私问题。这项研究提出了一种开发内部 LLM 的流程，以从放射学报告中提取临床信息。我们首先使用 GPT-4 创建一个小型标记数据集，然后在其上微调 Llama3-8B 模型。根据临床医生注释的报告进行评估，我们的模型获得了 84.6% 的平均 F1 分数，与 GPT-4 相当。我们的研究结果表明，开发内部 LLM 是可行的，它不仅可以匹配 GPT-4 的性能，还可以降低成本并增强数据隐私。

Title: Clinical Context-aware Radiology Report Generation from Medical Images using Transformers

Authors: Sonit Singh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11344
Pdf URL: https://arxiv.org/pdf/2408.11344
Copy Paste: [[2408.11344]] Clinical Context-aware Radiology Report Generation from Medical Images using Transformers(https://arxiv.org/abs/2408.11344)
Keywords: language model
Abstract: Recent developments in the field of Natural Language Processing, especially language models such as the transformer have brought state-of-the-art results in language understanding and language generation. In this work, we investigate the use of the transformer model for radiology report generation from chest X-rays. We also highlight limitations in evaluating radiology report generation using only the standard language generation metrics. We then applied a transformer based radiology report generation architecture, and also compare the performance of a transformer based decoder with the recurrence based decoder. Experiments were performed using the IU-CXR dataset, showing superior results to its LSTM counterpart and being significantly faster. Finally, we identify the need of evaluating radiology report generation system using both language generation metrics and classification metrics, which helps to provide robust measure of generated reports in terms of their coherence and diagnostic value.
摘要：自然语言处理领域的最新发展，尤其是 Transformer 等语言模型，为语言理解和语言生成带来了最先进的成果。在这项工作中，我们研究了使用 Transformer 模型从胸部 X 光片生成放射学报告的情况。我们还强调了仅使用标准语言生成指标评估放射学报告生成的局限性。然后，我们应用了基于 Transformer 的放射学报告生成架构，并将基于 Transformer 的解码器的性能与基于递归的解码器的性能进行了比较。实验使用 IU-CXR 数据集进行，结果显示其结果优于 LSTM，并且速度明显更快。最后，我们确定需要使用语言生成指标和分类指标来评估放射学报告生成系统，这有助于在生成报告的连贯性和诊断价值方面提供稳健的衡量标准。

Title: GeoReasoner: Reasoning On Geospatially Grounded Context For Natural Language Understanding

Authors: Yibo Yan, Joey Lee
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11366
Pdf URL: https://arxiv.org/pdf/2408.11366
Copy Paste: [[2408.11366]] GeoReasoner: Reasoning On Geospatially Grounded Context For Natural Language Understanding(https://arxiv.org/abs/2408.11366)
Keywords: language model, llm
Abstract: In human reading and communication, individuals tend to engage in geospatial reasoning, which involves recognizing geographic entities and making informed inferences about their interrelationships. To mimic such cognitive process, current methods either utilize conventional natural language understanding toolkits, or directly apply models pretrained on geo-related natural language corpora. However, these methods face two significant challenges: i) they do not generalize well to unseen geospatial scenarios, and ii) they overlook the importance of integrating geospatial context from geographical databases with linguistic information from the Internet. To handle these challenges, we propose GeoReasoner, a language model capable of reasoning on geospatially grounded natural language. Specifically, it first leverages Large Language Models (LLMs) to generate a comprehensive location description based on linguistic and geospatial information. It also encodes direction and distance information into spatial embedding via treating them as pseudo-sentences. Consequently, the model is trained on both anchor-level and neighbor-level inputs to learn geo-entity representation. Extensive experimental results demonstrate GeoReasoner's superiority in three tasks: toponym recognition, toponym linking, and geo-entity typing, compared to the state-of-the-art baselines.
摘要：在人类的阅读和交流中，个人倾向于进行地理空间推理，这涉及识别地理实体并对其相互关系做出明智的推断。为了模仿这种认知过程，当前的方法要么使用传统的自然语言理解工具包，要么直接应用在与地理相关的自然语言语料库上预先训练的模型。然而，这些方法面临两个重大挑战：i) 它们不能很好地推广到看不见的地理空间场景，ii) 它们忽视了将地理数据库中的地理空间上下文与互联网上的语言信息相结合的重要性。为了应对这些挑战，我们提出了 GeoReasoner，这是一种能够推理基于地理空间的自然语言的语言模型。具体来说，它首先利用大型语言模型 (LLM) 根据语言和地理空间信息生成全面的位置描述。它还通过将方向和距离信息视为伪句子来将其编码到空间嵌入中。因此，该模型在锚点级和邻居级输入上进行训练，以学习地理实体表示。大量实验结果表明，与最先进的基线相比，GeoReasoner 在三个任务中具有优势：地名识别、地名链接和地理实体类型。

Title: RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

Authors: Xuanwang Zhang, Yunze Song, Yidong Wang, Shuyun Tang, Xinfeng Li, Zhengran Zeng, Zhen Wu, Wei Ye, Wenyuan Xu, Yue Zhang, Xinyu Dai, Shikun Zhang, Qingsong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11381
Pdf URL: https://arxiv.org/pdf/2408.11381
Copy Paste: [[2408.11381]] RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation(https://arxiv.org/abs/2408.11381)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation
Abstract: Large Language Models (LLMs) demonstrate human-level capabilities in dialogue, reasoning, and knowledge retention. However, even the most advanced LLMs face challenges such as hallucinations and real-time updating of their knowledge. Current research addresses this bottleneck by equipping LLMs with external knowledge, a technique known as Retrieval Augmented Generation (RAG). However, two key issues constrained the development of RAG. First, there is a growing lack of comprehensive and fair comparisons between novel RAG algorithms. Second, open-source tools such as LlamaIndex and LangChain employ high-level abstractions, which results in a lack of transparency and limits the ability to develop novel algorithms and evaluation metrics. To close this gap, we introduce RAGLAB, a modular and research-oriented open-source library. RAGLAB reproduces 6 existing algorithms and provides a comprehensive ecosystem for investigating RAG algorithms. Leveraging RAGLAB, we conduct a fair comparison of 6 RAG algorithms across 10 benchmarks. With RAGLAB, researchers can efficiently compare the performance of various algorithms and develop novel algorithms.
摘要：大型语言模型 (LLM) 展示了人类级别的对话、推理和知识保留能力。然而，即使是最先进的 LLM 也面临着幻觉和实时更新知识等挑战。当前的研究通过为 LLM 配备外部知识来解决这一瓶颈，这种技术称为检索增强生成 (RAG)。然而，两个关键问题制约了 RAG 的发展。首先，新型 RAG 算法之间越来越缺乏全面和公平的比较。其次，LlamaIndex 和 LangChain 等开源工具采用高级抽象，导致缺乏透明度并限制了开发新算法和评估指标的能力。为了弥补这一差距，我们引入了 RAGLAB，这是一个模块化且面向研究的开源库。RAGLAB 重现了 6 种现有算法，并为研究 RAG 算法提供了全面的生态系统。利用 RAGLAB，我们在 10 个基准上对 6 种 RAG 算法进行了公平比较。通过RAGLAB，研究人员可以有效地比较各种算法的性能并开发新的算法。

Title: On the Interchangeability of Positional Embeddings in Multilingual Neural Machine Translation Models

Authors: Varun Gumma, Pranjal A. Chitale, Kalika Bali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11382
Pdf URL: https://arxiv.org/pdf/2408.11382
Copy Paste: [[2408.11382]] On the Interchangeability of Positional Embeddings in Multilingual Neural Machine Translation Models(https://arxiv.org/abs/2408.11382)
Keywords: language model, llm
Abstract: Standard Neural Machine Translation (NMT) models have traditionally been trained with Sinusoidal Positional Embeddings (PEs), which are inadequate for capturing long-range dependencies and are inefficient for long-context or document-level translation. In contrast, state-of-the-art large language models (LLMs) employ relative PEs, demonstrating superior length generalization. This work explores the potential for efficiently switching the Positional Embeddings of pre-trained NMT models from absolute sinusoidal PEs to relative approaches such as RoPE and ALiBi. Our findings reveal that sinusoidal PEs can be effectively replaced with RoPE and ALiBi with negligible or no performance loss, achieved by fine-tuning on a small fraction of high-quality data. Additionally, models trained without Positional Embeddings (NoPE) are not a viable solution for Encoder-Decoder architectures, as they consistently under-perform compared to models utilizing any form of Positional Embedding. Furthermore, even a model trained from scratch with these relative PEs slightly under-performs a fine-tuned model, underscoring the efficiency and validity of our hypothesis.
摘要：标准神经机器翻译 (NMT) 模型传统上采用正弦位置嵌入 (PE) 进行训练，这不足以捕获长距离依赖关系，并且对于长上下文或文档级翻译效率低下。相比之下，最先进的大型语言模型 (LLM) 采用相对 PE，表现出卓越的长度泛化能力。这项工作探索了将预训练 NMT 模型的位置嵌入从绝对正弦 PE 有效切换到相对方法（例如 RoPE 和 ALiBi）的潜力。我们的研究结果表明，通过对一小部分高质量数据进行微调，可以有效地用 RoPE 和 ALiBi 替换正弦 PE，而性能损失可以忽略不计或没有。此外，没有位置嵌入 (NoPE) 训练的模型不是编码器-解码器架构的可行解决方案，因为它们与使用任何形式的位置嵌入的模型相比始终表现不佳。此外，即使是使用这些相对 PE 从头开始训练的模型，其表现也略逊于微调模型，这凸显了我们假设的效率和有效性。

Title: First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models

Authors: Chi Ma, Mincong Huang, Ying Zhang, Chao Wang, Yujie Wang, Lei Yu, Chuan Liu, Wei Lin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11393
Pdf URL: https://arxiv.org/pdf/2408.11393
Copy Paste: [[2408.11393]] First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models(https://arxiv.org/abs/2408.11393)
Keywords: language model, llm
Abstract: Dynamic activation (DA) techniques, such as DejaVu and MoEfication, have demonstrated their potential to significantly enhance the inference efficiency of large language models (LLMs). However, these techniques often rely on ReLU activation functions or require additional parameters and training to maintain performance. This paper introduces a training-free Threshold-based Dynamic Activation(TDA) method that leverage sequence information to exploit the inherent sparsity of models across various architectures. This method is designed to accelerate generation speed by 18-25\% without significantly compromising task performance, thereby addressing the limitations of existing DA techniques. Moreover, we delve into the root causes of LLM sparsity and theoretically analyze two of its critical features: history-related activation uncertainty and semantic-irrelevant activation inertia. Our comprehensive analyses not only provide a robust theoretical foundation for DA methods but also offer valuable insights to guide future research in optimizing LLMs for greater efficiency and effectiveness.
摘要：动态激活 (DA) 技术（例如 DejaVu 和 MoEfication）已证明其能够显著提高大型语言模型 (LLM) 的推理效率。然而，这些技术通常依赖于 ReLU 激活函数或需要额外的参数和训练才能保持性能。本文介绍了一种无需训练的基于阈值的动态激活 (TDA) 方法，该方法利用序列信息来利用各种架构中模型固有的稀疏性。该方法旨在将生成速度提高 18-25\%，而不会显著影响任务性能，从而解决现有 DA 技术的局限性。此外，我们深入研究了 LLM 稀疏性的根本原因，并从理论上分析了它的两个关键特征：与历史相关的激活不确定性和与语义无关的激活惯性。我们的全面分析不仅为 DA 方法提供了坚实的理论基础，而且还提供了宝贵的见解，以指导未来优化 LLM 以提高效率和有效性的研究。

Title: MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

Authors: Hao Zhou, Zhijun Wang, Shujian Huang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Weihua Luo, Jiajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11396
Pdf URL: https://arxiv.org/pdf/2408.11396
Copy Paste: [[2408.11396]] MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing(https://arxiv.org/abs/2408.11396)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are often English-centric due to the disproportionate distribution of languages in their pre-training data. Enhancing non-English language capabilities through post-pretraining often results in catastrophic forgetting of the ability of original languages. Previous methods either achieve good expansion with severe forgetting or slight forgetting with poor expansion, indicating the challenge of balancing language expansion while preventing forgetting. In this paper, we propose a method called MoE-LPR (Mixture-of-Experts with Language Priors Routing) to alleviate this problem. MoE-LPR employs a two-stage training approach to enhance the multilingual capability. First, the model is post-pretrained into a Mixture-of-Experts (MoE) architecture by upcycling, where all the original parameters are frozen and new experts are added. In this stage, we focus improving the ability on expanded languages, without using any original language data. Then, the model reviews the knowledge of the original languages with replay data amounting to less than 1% of post-pretraining, where we incorporate language priors routing to better recover the abilities of the original languages. Evaluations on multiple benchmarks show that MoE-LPR outperforms other post-pretraining methods. Freezing original parameters preserves original language knowledge while adding new experts preserves the learning ability. Reviewing with LPR enables effective utilization of multilingual knowledge within the parameters. Additionally, the MoE architecture maintains the same inference overhead while increasing total model parameters. Extensive experiments demonstrate MoE-LPR's effectiveness in improving expanded languages and preserving original language proficiency with superior scalability. Code and scripts are freely available at this https URL.
摘要：大型语言模型 (LLM) 通常以英语为中心，因为其预训练数据中语言分布不均衡。通过后预训练增强非英语语言能力往往会导致对原始语言能力的灾难性遗忘。以前的方法要么扩展良好但遗忘严重，要么遗忘轻微但扩展不佳，这表明在防止遗忘的同时平衡语言扩展是一项挑战。在本文中，我们提出了一种称为 MoE-LPR（带语言先验路由的专家混合）的方法来缓解这一问题。MoE-LPR 采用两阶段训练方法来增强多语言能力。首先，通过升级将模型后预训练为专家混合 (MoE) 架构，其中所有原始参数都被冻结并添加新专家。在这个阶段，我们专注于提高扩展语言的能力，而不使用任何原始语言数据。然后，该模型使用回放数据（不到后预训练的 1%）来复习原始语言的知识，其中我们结合语言先验路由来更好地恢复原始语言的能力。多个基准测试的评估表明，MoE-LPR 优于其他后预训练方法。冻结原始参数可以保留原始语言知识，而添加新专家可以保留学习能力。使用 LPR 进行复习可以有效利用参数内的多语言知识。此外，MoE 架构在增加总模型参数的同时保持相同的推理开销。大量实验证明了 MoE-LPR 在改进扩展语言和保留原始语言能力方面具有出色的可扩展性。代码和脚本可在此 https URL 上免费获取。

Title: Towards "Differential AI Psychology" and in-context Value-driven Statement Alignment with Moral Foundations Theory

Authors: Simon Münker
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11415
Pdf URL: https://arxiv.org/pdf/2408.11415
Copy Paste: [[2408.11415]] Towards "Differential AI Psychology" and in-context Value-driven Statement Alignment with Moral Foundations Theory(https://arxiv.org/abs/2408.11415)
Keywords: language model, agent
Abstract: Contemporary research in social sciences is increasingly utilizing state-of-the-art statistical language models to annotate or generate content. While these models perform benchmark-leading on common language tasks and show exemplary task-independent emergent abilities, transferring them to novel out-of-domain tasks is only insufficiently explored. The implications of the statistical black-box approach - stochastic parrots - are prominently criticized in the language model research community; however, the significance for novel generative tasks is not. This work investigates the alignment between personalized language models and survey participants on a Moral Foundation Theory questionnaire. We adapt text-to-text models to different political personas and survey the questionnaire repetitively to generate a synthetic population of persona and model combinations. Analyzing the intra-group variance and cross-alignment shows significant differences across models and personas. Our findings indicate that adapted models struggle to represent the survey-captured assessment of political ideologies. Thus, using language models to mimic social interactions requires measurable improvements in in-context optimization or parameter manipulation to align with psychological and sociological stereotypes. Without quantifiable alignment, generating politically nuanced content remains unfeasible. To enhance these representations, we propose a testable framework to generate agents based on moral value statements for future research.
摘要：当代社会科学研究越来越多地利用最先进的统计语言模型来注释或生成内容。虽然这些模型在常见语言任务上的表现领先于基准，并表现出卓越的任务独立性新兴能力，但将它们转移到新的领域外任务上却没有得到充分探索。统计黑箱方法（随机鹦鹉）的含义在语言模型研究界受到广泛批评；然而，对于新生成任务的意义却没有。这项工作调查了个性化语言模型和道德基础理论问卷上的调查参与者之间的一致性。我们将文本到文本模型调整为不同的政治人物，并反复调查问卷以生成人物和模型组合的合成群体。分析组内方差和交叉一致性显示模型和人物之间存在显着差异。我们的研究结果表明，经过调整的模型难以代表调查捕获的政治意识形态评估。因此，使用语言模型来模拟社交互动需要在上下文优化或参数操纵方面取得可衡量的改进，以符合心理和社会刻板印象。如果没有可量化的一致性，生成政治细微差别的内容仍然不可行。为了增强这些表现，我们提出了一个可测试的框架，以根据道德价值观陈述生成代理，以供未来研究。

Title: Diagnosing and Remedying Knowledge Deficiencies in LLMs via Label-free Curricular Meaningful Learning

Authors: Kai Xiong, Xiao Ding, Li Du, Jiahao Ying, Ting Liu, Bing Qin, Yixin Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11431
Pdf URL: https://arxiv.org/pdf/2408.11431
Copy Paste: [[2408.11431]] Diagnosing and Remedying Knowledge Deficiencies in LLMs via Label-free Curricular Meaningful Learning(https://arxiv.org/abs/2408.11431)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are versatile and demonstrate impressive generalization ability by mining and learning information from extensive unlabeled text. However, they still exhibit reasoning mistakes, often stemming from knowledge deficiencies, which can affect their trustworthiness and reliability. Although users can provide diverse and comprehensive queries, obtaining sufficient and effective feedback is demanding. Furthermore, evaluating LLMs comprehensively with limited labeled samples is difficult. This makes it a challenge to diagnose and remedy the deficiencies of LLMs through rich label-free user queries. To tackle this challenge, we propose a label-free curricular meaningful learning framework (LaMer). LaMer first employs relative entropy to automatically diagnose and quantify the knowledge deficiencies of LLMs in a label-free setting. Next, to remedy the diagnosed knowledge deficiencies, we apply curricular meaningful learning: first, we adopt meaningful learning to adaptively synthesize augmentation data according to the severity of the deficiencies, and then design a curricular deficiency remedy strategy to remedy the knowledge deficiencies of LLMs progressively. Experiments show that LaMer efficiently and effectively diagnoses and remedies knowledge deficiencies in LLMs, improving various LLMs across seven out-of-distribution (OOD) reasoning and language understanding benchmarks, achieving comparable results to baselines with just 40\% training data. LaMer even surpasses methods that rely on labeled datasets for deficiency diagnosis. In application, our label-free method can offer an effective knowledge deficiency diagnostic tool for efficient LLM development.
摘要：大型语言模型 (LLM) 用途广泛，通过从大量未标记文本中挖掘和学习信息表现出令人印象深刻的泛化能力。然而，它们仍然表现出推理错误，这些错误通常源于知识缺陷，这会影响它们的可信度和可靠性。尽管用户可以提供多样化和全面的查询，但获得充分有效的反馈却很困难。此外，使用有限的标记样本全面评估 LLM 很困难。这使得通过丰富的无标签用户查询诊断和补救 LLM 的缺陷成为一项挑战。为了应对这一挑战，我们提出了一个无标签课程有意义学习框架 (LaMer)。LaMer 首先使用相对熵在无标签环境中自动诊断和量化 LLM 的知识缺陷。接下来，为了补救诊断出的知识缺陷，我们应用课程有意义学习：首先，我们采用有意义的学习根据缺陷的严重程度自适应地合成增强数据，然后设计课程缺陷补救策略以逐步补救 LLM 的知识缺陷。实验表明，LaMer 能够高效、有效地诊断和弥补 LLM 中的知识缺陷，在七个非分布 (OOD) 推理和语言理解基准中改进了各种 LLM，仅使用 40% 的训练数据就取得了与基线相当的结果。LaMer 甚至超越了依赖标记数据集进行缺陷诊断的方法。在应用中，我们的无标记方法可以为高效的 LLM 开发提供有效的知识缺陷诊断工具。

Title: The Self-Contained Negation Test Set

Authors: David Kletz (Lattice, LLF - UMR7110, UPCité), Pascal Amsili (Lattice), Marie Candito (LLF UMR7110, UPCité)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11469
Pdf URL: https://arxiv.org/pdf/2408.11469
Copy Paste: [[2408.11469]] The Self-Contained Negation Test Set(https://arxiv.org/abs/2408.11469)
Keywords: language model
Abstract: Several methodologies have recently been proposed to evaluate the ability of Pretrained Language Models (PLMs) to interpret negation. In this article, we build on Gubelmann and Handschuh (2022), which studies the modification of PLMs' predictions as a function of the polarity of inputs, in English. Crucially, this test uses ``self-contained'' inputs ending with a masked position: depending on the polarity of a verb in the input, a particular token is either semantically ruled out or allowed at the masked position. By replicating Gubelmann and Handschuh (2022) experiments, we have uncovered flaws that weaken the conclusions that can be drawn from this test. We thus propose an improved version, the Self-Contained Neg Test, which is more controlled, more systematic, and entirely based on examples forming minimal pairs varying only in the presence or absence of verbal negation in English. When applying our test to the roberta and bert base and large models, we show that only roberta-large shows trends that match the expectations, while bert-base is mostly insensitive to negation. For all the tested models though, in a significant number of test instances the top-1 prediction remains the token that is semantically forbidden by the context, which shows how much room for improvement remains for a proper treatment of the negation phenomenon.
摘要：最近提出了几种方法来评估预训练语言模型 (PLM) 解释否定的能力。在本文中，我们以 Gubelmann 和 Handschuh (2022) 为基础，他们研究了英语中 PLM 预测随输入极性变化的情况。至关重要的是，此测试使用以掩码位置结尾的“自包含”输入：根据输入中动词的极性，特定标记在语义上被排除或允许出现在掩码位置。通过复制 Gubelmann 和 Handschuh (2022) 的实验，我们发现了一些缺陷，这些缺陷削弱了可以从此测试中得出的结论。因此，我们提出了一个改进版本，即自包含否定测试，它更受控制、更系统，并且完全基于形成最小对的示例，这些示例仅在英语中存在或不存在动词否定时才有所不同。当将我们的测试应用于 roberta 和 bert 基础模型和大型模型时，我们发现只有 roberta-large 显示出符合预期的趋势，而 bert-base 对否定基本不敏感。不过，对于所有测试模型，在相当数量的测试实例中，top-1 预测仍然是上下文在语义上禁止的标记，这表明在正确处理否定现象方面还有多少改进空间。

Title: DocTabQA: Answering Questions from Long Documents Using Tables

Authors: Haochen Wang, Kai Hu, Haoyu Dong, Liangcai Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11490
Pdf URL: https://arxiv.org/pdf/2408.11490
Copy Paste: [[2408.11490]] DocTabQA: Answering Questions from Long Documents Using Tables(https://arxiv.org/abs/2408.11490)
Keywords: language model, gpt, llm
Abstract: We study a new problem setting of question answering (QA), referred to as DocTabQA. Within this setting, given a long document, the goal is to respond to questions by organizing the answers into structured tables derived directly from the document's content. Unlike traditional QA approaches which predominantly rely on unstructured text to formulate responses, DocTabQA aims to leverage structured tables as answers to convey information clearly and systematically, thereby enhancing user comprehension and highlighting relationships between data points. To the best of our knowledge, this problem has not been previously explored. In this paper, we introduce the QTabA dataset, encompassing 300 financial documents, accompanied by manually annotated 1.5k question-table pairs. Initially, we leverage Large Language Models (LLMs) such as GPT-4 to establish a baseline. However, it is widely acknowledged that LLMs encounter difficulties when tasked with generating intricate, structured outputs from long input sequences. To overcome these challenges, we present a two-stage framework, called DocTabTalk, which initially retrieves relevant sentences from extensive documents and subsequently generates hierarchical tables based on these identified sentences. DocTabTalk incorporates two key technological innovations: AlignLLaMA and TabTalk, which are specifically tailored to assist GPT-4 in tackling DocTabQA, enabling it to generate well-structured, hierarchical tables with improved organization and clarity. Comprehensive experimental evaluations conducted on both QTabA and RotoWire datasets demonstrate that our DocTabTalk significantly enhances the performances of the GPT-4 in our proposed DocTabQA task and the table generation task. The code and dataset are available at this https URL for further research.
摘要：我们研究了一种新的问答 (QA) 问题设置，称为 DocTabQA。在此设置中，给定一个长文档，目标是通过将答案组织成直接来自文档内容的结构化表格来回答问题。与主要依赖非结构化文本来制定响应的传统 QA 方法不同，DocTabQA 旨在利用结构化表格作为答案来清晰、系统地传达信息，从而增强用户理解并突出数据点之间的关系。据我们所知，这个问题以前从未被探索过。在本文中，我们介绍了 QTabA 数据集，其中包含 300 份财务文件，并附有手动注释的 1.5k 个问题表对。最初，我们利用 GPT-4 等大型语言模型 (LLM) 来建立基线。然而，人们普遍认为，当 LLM 被要求从长输入序列生成复杂的结构化输出时，它会遇到困难。为了克服这些挑战，我们提出了一个两阶段框架，称为 DocTabTalk，该框架首先从大量文档中检索相关句子，然后根据这些已识别的句子生成分层表。DocTabTalk 结合了两项关键技术创新：AlignLLaMA 和 TabTalk，它们专门用于帮助 GPT-4 处理 DocTabQA，使其能够生成结构良好、层次分明、组织性和清晰度更高的表格。在 QTabA 和 RotoWire 数据集上进行的全面实验评估表明，我们的 DocTabTalk 显著提高了 GPT-4 在我们提出的 DocTabQA 任务和表格生成任务中的表现。代码和数据集可在此 https URL 上获取，以供进一步研究。

Title: IKUN for WMT24 General MT Task: LLMs Are here for Multilingual Machine Translation

Authors: Baohao Liao, Christian Herold, Shahram Khadivi, Christof Monz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11512
Pdf URL: https://arxiv.org/pdf/2408.11512
Copy Paste: [[2408.11512]] IKUN for WMT24 General MT Task: LLMs Are here for Multilingual Machine Translation(https://arxiv.org/abs/2408.11512)
Keywords: language model, llm
Abstract: This paper introduces two multilingual systems, IKUN and IKUN-C, developed for the general machine translation task in WMT24. IKUN and IKUN-C represent an open system and a constrained system, respectively, built on Llama-3-8b and Mistral-7B-v0.3. Both systems are designed to handle all 11 language directions using a single model. According to automatic evaluation metrics, IKUN-C achieved 6 first-place and 3 second-place finishes among all constrained systems, while IKUN secured 1 first-place and 2 second-place finishes across both open and constrained systems. These encouraging results suggest that large language models (LLMs) are nearing the level of proficiency required for effective multilingual machine translation. The systems are based on a two-stage approach: first, continuous pre-training on monolingual data in 10 languages, followed by fine-tuning on high-quality parallel data for 11 language directions. The primary difference between IKUN and IKUN-C lies in their monolingual pre-training strategy. IKUN-C is pre-trained using constrained monolingual data, whereas IKUN leverages monolingual data from the OSCAR dataset. In the second phase, both systems are fine-tuned on parallel data sourced from NTREX, Flores, and WMT16-23 for all 11 language pairs.
摘要：本文介绍了两个多语言系统 IKUN 和 IKUN-C，它们都是为 WMT24 中的通用机器翻译任务而开发的。IKUN 和 IKUN-C 分别代表基于 Llama-3-8b 和 Mistral-7B-v0.3 构建的开放系统和受限系统。这两个系统都旨在使用单个模型处理所有 11 个语言方向。根据自动评估指标，IKUN-C 在所有受限系统中获得了 6 个第一名和 3 个第二名，而 IKUN 在开放和受限系统中均获得了 1 个第一名和 2 个第二名。这些令人鼓舞的结果表明，大型语言模型 (LLM) 已接近有效多语言机器翻译所需的熟练程度。这两个系统基于两阶段方法：首先，对 10 种语言的单语数据进行持续预训练，然后对 11 个语言方向的高质量并行数据进行微调。 IKUN 和 IKUN-C 的主要区别在于它们的单语预训练策略。IKUN-C 使用受限单语数据进行预训练，而 IKUN 利用 OSCAR 数据集中的单语数据。在第二阶段，两个系统都针对所有 11 个语言对，使用来自 NTREX、Flores 和 WMT16-23 的并行数据进行微调。

Title: Imagining from Images with an AI Storytelling Tool

Authors: Edirlei Soares de Lima, Marco A. Casanova, Antonio L. Furtado
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11517
Pdf URL: https://arxiv.org/pdf/2408.11517
Copy Paste: [[2408.11517]] Imagining from Images with an AI Storytelling Tool(https://arxiv.org/abs/2408.11517)
Keywords: gpt
Abstract: A method for generating narratives by analyzing single images or image sequences is presented, inspired by the time immemorial tradition of Narrative Art. The proposed method explores the multimodal capabilities of GPT-4o to interpret visual content and create engaging stories, which are illustrated by a Stable Diffusion XL model. The method is supported by a fully implemented tool, called ImageTeller, which accepts images from diverse sources as input. Users can guide the narrative's development according to the conventions of fundamental genres - such as Comedy, Romance, Tragedy, Satire or Mystery -, opt to generate data-driven stories, or to leave the prototype free to decide how to handle the narrative structure. User interaction is provided along the generation process, allowing the user to request alternative chapters or illustrations, and even reject and restart the story generation based on the same input. Additionally, users can attach captions to the input images, influencing the system's interpretation of the visual content. Examples of generated stories are provided, along with details on how to access the prototype.
摘要：受叙事艺术的古老传统的启发，本文介绍了一种通过分析单个图像或图像序列来生成叙事的方法。所提出的方法探索了 GPT-4o 解释视觉内容和创作引人入胜的故事的多模态能力，这些能力由稳定扩散 XL 模型说明。该方法由一个完全实现的工具 ImageTeller 支持，该工具接受来自不同来源的图像作为输入。用户可以根据喜剧、浪漫、悲剧、讽刺或神秘等基本类型的惯例来指导叙事的发展，选择生成数据驱动的故事，或者让原型自由决定如何处理叙事结构。在生成过程中提供用户交互，允许用户请求替代章节或插图，甚至根据相同的输入拒绝并重新开始故事生成。此外，用户可以将标题附加到输入图像，从而影响系统对视觉内容的解释。提供了生成的故事示例以及如何访问原型的详细信息。

Title: Memorization In In-Context Learning

Authors: Shahriar Golchin, Mihai Surdeanu, Steven Bethard, Eduardo Blanco, Ellen Riloff
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11546
Pdf URL: https://arxiv.org/pdf/2408.11546
Copy Paste: [[2408.11546]] Memorization In In-Context Learning(https://arxiv.org/abs/2408.11546)
Keywords: language model, llm
Abstract: In-context learning (ICL) has proven to be an effective strategy for improving the performance of large language models (LLMs) with no additional training. However, the exact mechanism behind these performance improvements remains unclear. This study is the first to show how ICL surfaces memorized training data and to explore the correlation between this memorization and performance across various ICL regimes: zero-shot, few-shot, and many-shot. Our most notable findings include: (1) ICL significantly surfaces memorization compared to zero-shot learning in most cases; (2) demonstrations, without their labels, are the most effective element in surfacing memorization; (3) ICL improves performance when the surfaced memorization in few-shot regimes reaches a high level (about 40%); and (4) there is a very strong correlation between performance and memorization in ICL when it outperforms zero-shot learning. Overall, our study uncovers a hidden phenomenon -- memorization -- at the core of ICL, raising an important question: to what extent do LLMs truly generalize from demonstrations in ICL, and how much of their success is due to memorization?
摘要：事实证明，情境学习 (ICL) 是一种无需额外训练即可提高大型语言模型 (LLM) 性能的有效策略。然而，这些性能改进背后的确切机制仍不清楚。这项研究首次展示了 ICL 如何浮现记忆的训练数据，并探索了这种记忆与各种 ICL 方案（零样本、小样本和多样本）中性能之间的相关性。我们最值得注意的发现包括：（1）在大多数情况下，与零样本学习相比，ICL 可以显著浮现记忆；（2）没有标签的演示是浮现记忆的最有效元素；（3）当小样本方案中的浮现记忆达到高水平（约 40% ）时，ICL 会提高性能；（4）当 ICL 的表现优于零样本学习时，性能和记忆之间存在非常强的相关性。总的来说，我们的研究揭示了 ICL 核心中一个隐藏的现象——记忆，这引出了一个重要的问题：LLM 在多大程度上真正从 ICL 中的演示中概括出来，以及它们的成功有多少归功于记忆？

Title: Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks

Authors: Ziqiang Li, Yueqi Zeng, Pengfei Xia, Lei Liu, Zhangjie Fu, Bin Li
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2408.11587
Pdf URL: https://arxiv.org/pdf/2408.11587
Copy Paste: [[2408.11587]] Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks(https://arxiv.org/abs/2408.11587)
Keywords: language model, llm
Abstract: With the burgeoning advancements in the field of natural language processing (NLP), the demand for training data has increased significantly. To save costs, it has become common for users and businesses to outsource the labor-intensive task of data collection to third-party entities. Unfortunately, recent research has unveiled the inherent risk associated with this practice, particularly in exposing NLP systems to potential backdoor attacks. Specifically, these attacks enable malicious control over the behavior of a trained model by poisoning a small portion of the training data. Unlike backdoor attacks in computer vision, textual backdoor attacks impose stringent requirements for attack stealthiness. However, existing attack methods meet significant trade-off between effectiveness and stealthiness, largely due to the high information entropy inherent in textual data. In this paper, we introduce the Efficient and Stealthy Textual backdoor attack method, EST-Bad, leveraging Large Language Models (LLMs). Our EST-Bad encompasses three core strategies: optimizing the inherent flaw of models as the trigger, stealthily injecting triggers with LLMs, and meticulously selecting the most impactful samples for backdoor injection. Through the integration of these techniques, EST-Bad demonstrates an efficient achievement of competitive attack performance while maintaining superior stealthiness compared to prior methods across various text classifier datasets.
摘要：随着自然语言处理 (NLP) 领域的蓬勃发展，对训练数据的需求显著增加。为了节省成本，用户和企业通常会将劳动密集型的数据收集任务外包给第三方实体。不幸的是，最近的研究揭示了这种做法所固有的风险，特别是在将 NLP 系统暴露于潜在的后门攻击方面。具体来说，这些攻击通过毒害一小部分训练数据来恶意控制训练模型的行为。与计算机视觉中的后门攻击不同，文本后门攻击对攻击的隐蔽性提出了严格的要求。然而，现有的攻击方法在有效性和隐蔽性之间需要权衡利弊，这主要是由于文本数据固有的高信息熵。在本文中，我们介绍了一种利用大型语言模型 (LLM) 的高效、隐蔽的文本后门攻击方法 EST-Bad。我们的 EST-Bad 包含三个核心策略：优化模型的固有缺陷作为触发器、使用 LLM 隐秘地注入触发器以及精心选择最具影响力的样本进行后门注入。通过整合这些技术，EST-Bad 展示了一种高效实现具有竞争力的攻击性能的方法，同时与各种文本分类器数据集中的先前方法相比，保持了卓越的隐秘性。

Title: Cause-Aware Empathetic Response Generation via Chain-of-Thought Fine-Tuning

Authors: Xinhao Chen, Chong Yang, Man Lan, Li Cai, Yang Chen, Tu Hu, Xinlin Zhuang, Aimin Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11599
Pdf URL: https://arxiv.org/pdf/2408.11599
Copy Paste: [[2408.11599]] Cause-Aware Empathetic Response Generation via Chain-of-Thought Fine-Tuning(https://arxiv.org/abs/2408.11599)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: Empathetic response generation endows agents with the capability to comprehend dialogue contexts and react to expressed emotions. Previous works predominantly focus on leveraging the speaker's emotional labels, but ignore the importance of emotion cause reasoning in empathetic response generation, which hinders the model's capacity for further affective understanding and cognitive inference. In this paper, we propose a cause-aware empathetic generation approach by integrating emotions and causes through a well-designed Chain-of-Thought (CoT) prompt on Large Language Models (LLMs). Our approach can greatly promote LLMs' performance of empathy by instruction tuning and enhancing the role awareness of an empathetic listener in the prompt. Additionally, we propose to incorporate cause-oriented external knowledge from COMET into the prompt, which improves the diversity of generation and alleviates conflicts between internal and external knowledge at the same time. Experimental results on the benchmark dataset demonstrate that our approach on LLaMA-7b achieves state-of-the-art performance in both automatic and human evaluations.
摘要：共情反应生成赋予代理理解对话背景并对表达的情绪做出反应的能力。以前的研究主要关注利用说话者的情绪标签，但忽略了情绪原因推理在共情反应生成中的重要性，这阻碍了模型进一步进行情感理解和认知推理的能力。在本文中，我们提出了一种原因感知的共情生成方法，通过在大型语言模型 (LLM) 上精心设计的思路链 (CoT) 提示整合情绪和原因。我们的方法可以通过指令调整和增强提示中共情听众的角色意识来极大地提升 LLM 的共情表现。此外，我们建议将 COMET 中以原因为导向的外部知识纳入提示中，这提高了生成的多样性，同时缓解了内部和外部知识之间的冲突。基准数据集上的实验结果表明，我们在 LLaMA-7b 上的方法在自动和人工评估中都实现了最先进的性能。

Title: Xinyu: An Efficient LLM-based System for Commentary Generation

Authors: Yiquan Wu, Bo Tang, Chenyang Xi, Yu Yu, Pengyu Wang, Yifei Liu, Kun Kuang, Haiying Deng, Zhiyu Li, Feiyu Xiong, Jie Hu, Peng Cheng, Zhonghao Wang, Yi Wang, Yi Luo, Mingchuan Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11609
Pdf URL: https://arxiv.org/pdf/2408.11609
Copy Paste: [[2408.11609]] Xinyu: An Efficient LLM-based System for Commentary Generation(https://arxiv.org/abs/2408.11609)
Keywords: language model, llm, retrieval augmented generation
Abstract: Commentary provides readers with a deep understanding of events by presenting diverse arguments and evidence. However, creating commentary is a time-consuming task, even for skilled commentators. Large language models (LLMs) have simplified the process of natural language generation, but their direct application in commentary creation still faces challenges due to unique task requirements. These requirements can be categorized into two levels: 1) fundamental requirements, which include creating well-structured and logically consistent narratives, and 2) advanced requirements, which involve generating quality arguments and providing convincing evidence. In this paper, we introduce Xinyu, an efficient LLM-based system designed to assist commentators in generating Chinese commentaries. To meet the fundamental requirements, we deconstruct the generation process into sequential steps, proposing targeted strategies and supervised fine-tuning (SFT) for each step. To address the advanced requirements, we present an argument ranking model for arguments and establish a comprehensive evidence database that includes up-to-date events and classic books, thereby strengthening the substantiation of the evidence with retrieval augmented generation (RAG) technology. To evaluate the generated commentaries more fairly, corresponding to the two-level requirements, we introduce a comprehensive evaluation metric that considers five distinct perspectives in commentary generation. Our experiments confirm the effectiveness of our proposed system. We also observe a significant increase in the efficiency of commentators in real-world scenarios, with the average time spent on creating a commentary dropping from 4 hours to 20 minutes. Importantly, such an increase in efficiency does not compromise the quality of the commentaries.
摘要：评论通过呈现各种论据和证据，帮助读者深入了解事件。然而，即使对于熟练的评论员来说，撰写评论也是一项耗时的任务。大型语言模型 (LLM) 简化了自然语言生成的过程，但由于独特的任务要求，将其直接应用于评论创作仍然面临挑战。这些要求可以分为两个层次：1）基本要求，包括创建结构良好且逻辑一致的叙述；2）高级要求，包括生成高质量的论据并提供令人信服的证据。在本文中，我们介绍了 Xinyu，这是一个高效的基于 LLM 的系统，旨在帮助评论员生成中文评论。为了满足基本要求，我们将生成过程解构为连续步骤，为每个步骤提出有针对性的策略和监督微调 (SFT)。为了满足高级要求，我们提出了一个论据排序模型，并建立了一个包含最新事件和经典书籍的综合证据数据库，从而通过检索增强生成 (RAG) 技术加强了证据的证实。为了更公平地评估生成的评论，与两级要求相对应，我们引入了一个综合评估指标，该指标考虑了评论生成的五个不同角度。我们的实验证实了我们提出的系统的有效性。我们还观察到评论员在实际场景中的效率显著提高，创建评论的平均时间从 4 小时缩短到 20 分钟。重要的是，这种效率的提高不会损害评论的质量。

Title: FocusLLM: Scaling LLM's Context by Parallel Decoding

Authors: Zhenyu Li, Yike Zhang, Tengyu Pan, Yutao Sun, Zhichao Duan, Junjie Fang, Rong Han, Zixuan Wang, Jianyong Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11745
Pdf URL: https://arxiv.org/pdf/2408.11745
Copy Paste: [[2408.11745]] FocusLLM: Scaling LLM's Context by Parallel Decoding(https://arxiv.org/abs/2408.11745)
Keywords: language model, llm, long context, prompt
Abstract: Empowering LLMs with the ability to utilize useful information from a long context is crucial for many downstream applications. However, achieving long context lengths with the conventional transformer architecture requires substantial training and inference resources. In this paper, we present FocusLLM, a framework designed to extend the context length of any decoder-only LLM, enabling the model to focus on relevant information from very long sequences. FocusLLM processes long text inputs by dividing them into chunks based on the model's original context length to alleviate the issue of attention distraction. Then, it appends the local context to each chunk as a prompt to extract essential information from each chunk based on a novel parallel decoding mechanism, and ultimately integrates the extracted information into the local context. FocusLLM stands out for great training efficiency and versatility: trained with an 8K input length with much less training cost than previous methods, FocusLLM exhibits superior performance across downstream long-context tasks and maintains strong language modeling ability when handling extensive long texts, even up to 400K tokens. Our code is available at this https URL.
摘要：对于许多下游应用来说，让 LLM 能够利用长上下文中的有用信息至关重要。然而，使用传统的 Transformer 架构实现长上下文长度需要大量的训练和推理资源。在本文中，我们介绍了 FocusLLM，这是一个旨在扩展任何仅解码器的 LLM 的上下文长度的框架，使模型能够从非常长的序列中关注相关信息。FocusLLM 通过根据模型的原始上下文长度将长文本输入分成块来处理长文本输入，以缓解注意力分散的问题。然后，它将本地上下文附加到每个块作为提示，基于一种新颖的并行解码机制从每个块中提取重要信息，并最终将提取的信息集成到本地上下文中。FocusLLM 以出色的训练效率和多功能性脱颖而出：使用 8K 输入长度进行训练，训练成本比以前的方法低得多，FocusLLM 在下游长上下文任务中表现出色，并在处理大量长文本（甚至高达 400K 个 token）时保持强大的语言建模能力。我们的代码可在此 https URL 上找到。

Title: Against All Odds: Overcoming Typology, Script, and Language Confusion in Multilingual Embedding Inversion Attacks

Authors: Yiyi Chen, Russa Biswas, Heather Lent, Johannes Bjerva
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2408.11749
Pdf URL: https://arxiv.org/pdf/2408.11749
Copy Paste: [[2408.11749]] Against All Odds: Overcoming Typology, Script, and Language Confusion in Multilingual Embedding Inversion Attacks(https://arxiv.org/abs/2408.11749)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are susceptible to malicious influence by cyber attackers through intrusions such as adversarial, backdoor, and embedding inversion attacks. In response, the burgeoning field of LLM Security aims to study and defend against such threats. Thus far, the majority of works in this area have focused on monolingual English models, however, emerging research suggests that multilingual LLMs may be more vulnerable to various attacks than their monolingual counterparts. While previous work has investigated embedding inversion over a small subset of European languages, it is challenging to extrapolate these findings to languages from different linguistic families and with differing scripts. To this end, we explore the security of multilingual LLMs in the context of embedding inversion attacks and investigate cross-lingual and cross-script inversion across 20 languages, spanning over 8 language families and 12 scripts. Our findings indicate that languages written in Arabic script and Cyrillic script are particularly vulnerable to embedding inversion, as are languages within the Indo-Aryan language family. We further observe that inversion models tend to suffer from language confusion, sometimes greatly reducing the efficacy of an attack. Accordingly, we systematically explore this bottleneck for inversion models, uncovering predictable patterns which could be leveraged by attackers. Ultimately, this study aims to further the field's understanding of the outstanding security vulnerabilities facing multilingual LLMs and raise awareness for the languages most at risk of negative impact from these attacks.
摘要：大型语言模型 (LLM) 容易受到网络攻击者通过诸如对抗性、后门和嵌入反转攻击等入侵方式的恶意影响。为此，新兴的 LLM 安全领域旨在研究和防御此类威胁。到目前为止，该领域的大多数研究都集中在单语英语模型上，然而，新兴研究表明，多语言 LLM 可能比单语模型更容易受到各种攻击。虽然之前的研究已经研究了一小部分欧洲语言的嵌入反转，但很难将这些发现推广到来自不同语系和不同脚本的语言。为此，我们在嵌入反转攻击的背景下探索了多语言 LLM 的安全性，并研究了 20 种语言的跨语言和跨脚本反转，涵盖 8 个语系和 12 个脚本。我们的研究结果表明，用阿拉伯文和西里尔文书写的语言特别容易受到嵌入反转的影响，印度-雅利安语系中的语言也是如此。我们进一步观察到，反转模型往往会受到语言混淆的影响，有时会大大降低攻击的有效性。因此，我们系统地探索了反转模型的这一瓶颈，发现了攻击者可以利用的可预测模式。最终，这项研究旨在进一步加深该领域对多语言 LLM 面临的突出安全漏洞的了解，并提高人们对最有可能受到这些攻击负面影响的语言的认识。

Title: Leveraging Fine-Tuned Retrieval-Augmented Generation with Long-Context Support: For 3GPP Standards

Authors: Omar Erak, Nouf Alabbasi, Omar Alhussein, Ismail Lotfi, Amr Hussein, Sami Muhaidat, Merouane Debbah
Subjects: cs.CL, cs.NI
Abstract URL: https://arxiv.org/abs/2408.11775
Pdf URL: https://arxiv.org/pdf/2408.11775
Copy Paste: [[2408.11775]] Leveraging Fine-Tuned Retrieval-Augmented Generation with Long-Context Support: For 3GPP Standards(https://arxiv.org/abs/2408.11775)
Keywords: language model, gpt, llm, retrieval-augmented generation, agent
Abstract: Recent studies show that large language models (LLMs) struggle with technical standards in telecommunications. We propose a fine-tuned retrieval-augmented generation (RAG) system based on the Phi-2 small language model (SLM) to serve as an oracle for communication networks. Our developed system leverages forward-looking semantic chunking to adaptively determine parsing breakpoints based on embedding similarity, enabling effective processing of diverse document formats. To handle the challenge of multiple similar contexts in technical standards, we employ a re-ranking algorithm to prioritize the most relevant retrieved chunks. Recognizing the limitations of Phi-2's small context window, we implement a recent technique, namely SelfExtend, to expand the context window during inference, which not only boosts the performance but also can accommodate a wider range of user queries and design requirements from customers to specialized technicians. For fine-tuning, we utilize the low-rank adaptation (LoRA) technique to enhance computational efficiency during training and enable effective fine-tuning on small datasets. Our comprehensive experiments demonstrate substantial improvements over existing question-answering approaches in the telecom domain, achieving performance that exceeds larger language models such as GPT-4 (which is about 880 times larger in size). This work presents a novel approach to leveraging SLMs for communication networks, offering a balance of efficiency and performance. This work can serve as a foundation towards agentic language models for networks.
摘要：最近的研究表明，大型语言模型 (LLM) 难以满足电信技术标准。我们提出了一种基于 Phi-2 小型语言模型 (SLM) 的微调检索增强生成 (RAG) 系统，作为通信网络的预言机。我们开发的系统利用前瞻性语义分块，根据嵌入相似性自适应地确定解析断点，从而能够有效处理各种文档格式。为了应对技术标准中多个相似上下文的挑战，我们采用重新排序算法来优先考虑最相关的检索块。认识到 Phi-2 小型上下文窗口的局限性，我们实施了一种最新技术，即 SelfExtend，以在推理过程中扩展上下文窗口，这不仅可以提高性能，还可以适应从客户到专业技术人员的更广泛的用户查询和设计要求。对于微调，我们利用低秩自适应 (LoRA) 技术来提高训练期间的计算效率，并实现对小型数据集的有效微调。我们的综合实验表明，与电信领域现有的问答方法相比，该方法有显著改进，其性能超过了 GPT-4 等大型语言模型（其规模大约是后者的 880 倍）。这项研究提出了一种利用 SLM 进行通信网络的新方法，实现了效率和性能的平衡。这项研究可以作为网络代理语言模型的基础。

Title: Personality Alignment of Large Language Models

Authors: Minjun Zhu, Linyi Yang, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11779
Pdf URL: https://arxiv.org/pdf/2408.11779
Copy Paste: [[2408.11779]] Personality Alignment of Large Language Models(https://arxiv.org/abs/2408.11779)
Keywords: language model, llm
Abstract: Current methods for aligning large language models (LLMs) typically aim to reflect general human values and behaviors, but they often fail to capture the unique characteristics and preferences of individual users. To address this gap, we introduce the concept of Personality Alignment. This approach tailors LLMs' responses and decisions to match the specific preferences of individual users or closely related groups. Inspired by psychometrics, we created the Personality Alignment with Personality Inventories (PAPI) dataset, which includes data from 300,000 real subjects, each providing behavioral preferences based on the Big Five Personality Factors. This dataset allows us to quantitatively evaluate the extent to which LLMs can align with each subject's behavioral patterns. Recognizing the challenges of personality alignments: such as limited personal data, diverse preferences, and scalability requirements: we developed an activation intervention optimization method. This method enhances LLMs' ability to efficiently align with individual behavioral preferences using minimal data and computational resources. Remarkably, our method, PAS, achieves superior performance while requiring only 1/5 of the optimization time compared to DPO, offering practical value for personality alignment. Our work paves the way for future AI systems to make decisions and reason in truly personality ways, enhancing the relevance and meaning of AI interactions for each user and advancing human-centered artificial intelligence.The code has released in \url{this https URL}.
摘要：目前对齐大型语言模型 (LLM) 的方法通常旨在反映一般的人类价值观和行为，但它们往往无法捕捉到个人用户的独特特征和偏好。为了解决这一差距，我们引入了个性对齐的概念。这种方法可以根据个人用户或密切相关群体的特定偏好来定制 LLM 的响应和决策。受心理测量学的启发，我们创建了个性对齐与个性清单 (PAPI) 数据集，其中包括来自 300,000 名真实受试者的数据，每个受试者都根据大五人格因素提供行为偏好。该数据集使我们能够定量评估 LLM 与每个受试者的行为模式的对齐程度。认识到个性对齐的挑战：例如有限的个人数据、多样化的偏好和可扩展性要求：我们开发了一种激活干预优化方法。该方法增强了 LLM 使用最少的数据和计算资源有效地与个人行为偏好对齐的能力。值得注意的是，我们的方法 PAS 实现了卓越的性能，而优化时间仅为 DPO 的 1/5，为个性匹配提供了实用价值。我们的工作为未来的人工智能系统以真正个性化的方式做出决策和推理铺平了道路，增强了人工智能交互对每个用户的相关性和意义，并推动了以人为本的人工智能的发展。代码已在 \url{此 https URL} 中发布。

Title: LLM Pruning and Distillation in Practice: The Minitron Approach

Authors: Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.11796
Pdf URL: https://arxiv.org/pdf/2408.11796
Copy Paste: [[2408.11796]] LLM Pruning and Distillation in Practice: The Minitron Approach(https://arxiv.org/abs/2408.11796)
Keywords: llm
Abstract: We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.
摘要：我们提供了有关使用修剪和蒸馏分别将 Llama 3.1 8B 和 Mistral NeMo 12B 模型压缩为 4B 和 8B 参数的综合报告。我们探索了两种不同的修剪策略：(1) 深度修剪和 (2) 联合隐藏/注意/MLP（宽度）修剪，并在 LM 评估工具的常见基准上评估结果。然后使用 NeMo Aligner 对模型进行对齐，并在指令调整版本中进行测试。这种方法从 Llama 3.1 8B 生成了一个引人注目的 4B 模型，并从 Mistral NeMo 12B 生成了一个最先进的 Mistral-NeMo-Minitron-8B（简称为 MN-Minitron-8B）模型。我们发现，在无法访问原始数据的情况下，在蒸馏数据集上对教师模型进行轻微微调是有益的。我们在 Hugging Face 上以宽松的许可证开源了我们的基础模型权重。

Title: PermitQA: A Benchmark for Retrieval Augmented Generation in Wind Siting and Permitting domain

Authors: Rounak Meyur, Hung Phan, Sridevi Wagle, Jan Strube, Mahantesh Halappanavar, Sameera Horawalavithana, Anurag Acharya, Sai Munikoti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11800
Pdf URL: https://arxiv.org/pdf/2408.11800
Copy Paste: [[2408.11800]] PermitQA: A Benchmark for Retrieval Augmented Generation in Wind Siting and Permitting domain(https://arxiv.org/abs/2408.11800)
Keywords: language model, llm, retrieval augmented generation
Abstract: In the rapidly evolving landscape of Natural Language Processing (NLP) and text generation, the emergence of Retrieval Augmented Generation (RAG) presents a promising avenue for improving the quality and reliability of generated text by leveraging information retrieved from user specified database. Benchmarking is essential to evaluate and compare the performance of the different RAG configurations in terms of retriever and generator, providing insights into their effectiveness, scalability, and suitability for the specific domain and applications. In this paper, we present a comprehensive framework to generate a domain relevant RAG benchmark. Our framework is based on automatic question-answer generation with Human (domain experts)-AI Large Language Model (LLM) teaming. As a case study, we demonstrate the framework by introducing PermitQA, a first-of-its-kind benchmark on the wind siting and permitting domain which comprises of multiple scientific documents/reports related to environmental impact of wind energy projects. Our framework systematically evaluates RAG performance using diverse metrics and multiple question types with varying complexity level. We also demonstrate the performance of different models on our benchmark.
摘要：在快速发展的自然语言处理 (NLP) 和文本生成领域中，检索增强生成 (RAG) 的出现为通过利用从用户指定数据库中检索到的信息来提高生成文本的质量和可靠性提供了一条有希望的途径。基准测试对于评估和比较不同 RAG 配置在检索器和生成器方面的性能至关重要，可以深入了解它们的有效性、可扩展性和对特定领域和应用程序的适用性。在本文中，我们提出了一个全面的框架来生成与领域相关的 RAG 基准。我们的框架基于自动问答生成，采用人类（领域专家）-AI 大型语言模型 (LLM) 团队合作。作为一个案例研究，我们通过引入 PermitQA 来展示该框架，PermitQA 是风电选址和许可领域的首个基准，包含与风能项目对环境影响相关的多个科学文档/报告。我们的框架使用多种指标和多种复杂程度不同的问题类型系统地评估 RAG 性能。我们还展示了不同模型在基准上的性能。

Title: Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

Authors: Shangyi Geng, Wenting Zhao, Alexander M Rush
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11815
Pdf URL: https://arxiv.org/pdf/2408.11815
Copy Paste: [[2408.11815]] Great Memory, Shallow Reasoning: Limits of $k$NN-LMs(https://arxiv.org/abs/2408.11815)
Keywords: language model
Abstract: $K$-nearest neighbor language models ($k$NN-LMs), which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling as well as downstream NLP benchmarks. These results have led researchers to argue that models trained on poor quality or outdated data could perform well by employing a $k$NN extension that has access to a higher-quality datastore. In this work, we ask whether this improved ability to recall information really translates into downstream abilities. We extensively evaluate $k$NN-LMs on a diverse set of tasks, ranging from sentiment classification and commonsense reasoning to multi-hop reasoning. Results show that $k$NN-LMs excel at memory-intensive tasks, where utilizing the patterns in the input is sufficient for determining the output, but struggle with reasoning tasks that require integrating multiple pieces of information to derive new knowledge. We further demonstrate through oracle experiments and qualitative analysis that even with perfect retrieval, $k$NN-LMs still fail to determine the correct answers, placing an upper bound on their reasoning performance. Code and datastores are released at this https URL.
摘要：$K$-最近邻语言模型 ($k$NN-LMs) 将检索与下一个单词预测相结合，在语言建模以及下游 NLP 基准测试中表现出色。这些结果促使研究人员认为，通过使用可以访问更高质量数据存储的 $k$NN 扩展，在质量较差或过时的数据上训练的模型可以表现良好。在这项工作中，我们想知道这种改进的信息回忆能力是否真的转化为下游能力。我们在一系列不同的任务上对 $k$NN-LMs 进行了广泛的评估，从情绪分类和常识推理到多跳推理。结果表明，$k$NN-LMs 在内存密集型任务方面表现出色，在这些任务中，利用输入中的模式足以确定输出，但在需要整合多条信息来获得新知识的推理任务中却举步维艰。我们通过 oracle 实验和定性分析进一步证明，即使检索完美，$k$NN-LMs 仍然无法确定正确答案，从而对其推理性能设置了上限。代码和数据存储在此 https URL 发布。