2024-07-11

Title: Nash CoT: Multi-Path Inference with Preference Equilibrium

Authors: Ziqi Zhang, Cunxiang Wang, Xiong Xiao, Yue Zhang, Donglin Wang
Subjects: cs.CL, cs.AI, cs.GT, cs.LG
Abstract URL: https://arxiv.org/abs/2407.07099
Pdf URL: https://arxiv.org/pdf/2407.07099
Copy Paste: [[2407.07099]] Nash CoT: Multi-Path Inference with Preference Equilibrium(https://arxiv.org/abs/2407.07099)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-thought (CoT) prompting has emerged as a powerful technique for enhancing the reasoning capabilities of Large Language Models (LLMs) on complex problems. Among CoT-related studies, self-consistency (Multi-path inference with answer filtering through voting) involves generating multiple reasoning paths using the CoT framework and then selecting the most frequently produced outputs standing out as a concise yet competitive approach. While self-consistency has indeed led to the improvements in LLM inference, the use of multi-path inference also escalates deployment costs. Therefore, maintaining the performance benefits of self-consistency inherited from multi-path inference while reducing the inference costs holds significant value. In this research, we conceptualize language decoding as a preference consensus game, constructing a bi-player gaming system within each local path, and introduce Nash Chain-of-Thought (Nash CoT). Specifically, for a given question, we leverage LLM to autonomously select the contextually relevant template and generate outputs guided by this template, aiming to reach Nash Equilibrium alongside normal generation in each path. This approach allows us to achieve comparable or improved performance compared to self-consistency while using fewer inference paths on various inference tasks, including Arabic reasoning, Commonsense Question answering, and Symbolic inference.
摘要：思路链 (CoT) 提示已成为一种强大的技术，可增强大型语言模型 (LLM) 在复杂问题上的推理能力。在与 CoT 相关的研究中，自洽性（通过投票过滤答案的多路径推理）涉及使用 CoT 框架生成多条推理路径，然后选择最常产生的输出，这是一种简洁而又具有竞争力的方法。虽然自洽性确实导致了 LLM 推理的改进，但使用多路径推理也会增加部署成本。因此，在降低推理成本的同时保持从多路径推理继承的自洽性的性能优势具有重要价值。在这项研究中，我们将语言解码概念化为偏好共识游戏，在每个本地路径内构建一个双人游戏系统，并引入纳什思路链 (Nash CoT)。具体来说，对于给定的问题，我们利用 LLM 自主选择上下文相关的模板并生成由该模板引导的输出，旨在达到纳什均衡，同时在每条路径上实现正常生成。这种方法使我们能够在各种推理任务（包括阿拉伯语推理、常识性问答和符号推理）上使用更少的推理路径，同时实现与自洽性相当或更好的性能。

Title: Identification of emotions on Twitter during the 2022 electoral process in Colombia

Authors: Juan Jose Iguaran Fernandez, Juan Manuel Perez, German Rosati
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2407.07258
Pdf URL: https://arxiv.org/pdf/2407.07258
Copy Paste: [[2407.07258]] Identification of emotions on Twitter during the 2022 electoral process in Colombia(https://arxiv.org/abs/2407.07258)
Keywords: gpt
Abstract: The study of Twitter as a means for analyzing social phenomena has gained interest in recent years due to the availability of large amounts of data in a relatively spontaneous environment. Within opinion-mining tasks, emotion detection is specially relevant, as it allows for the identification of people's subjective responses to different social events in a more granular way than traditional sentiment analysis based on polarity. In the particular case of political events, the analysis of emotions in social networks can provide valuable information on the perception of candidates, proposals, and other important aspects of the public debate. In spite of this importance, there are few studies on emotion detection in Spanish and, to the best of our knowledge, few resources are public for opinion mining in Colombian Spanish, highlighting the need for generating resources addressing the specific cultural characteristics of this variety. In this work, we present a small corpus of tweets in Spanish related to the 2022 Colombian presidential elections, manually labeled with emotions using a fine-grained taxonomy. We perform classification experiments using supervised state-of-the-art models (BERT models) and compare them with GPT-3.5 in few-shot learning settings. We make our dataset and code publicly available for research purposes.
摘要：近年来，由于在相对自发的环境中可以获得大量数据，因此将 Twitter 作为分析社会现象的手段的研究引起了人们的兴趣。在意见挖掘任务中，情绪检测特别重要，因为它可以比基于极性的传统情绪分析更细致地识别人们对不同社会事件的主观反应。在政治事件的特定情况下，社交网络中的情绪分析可以提供有关候选人、提案和公共辩论其他重要方面的看法的宝贵信息。尽管情绪检测很重要，但关于西班牙语情绪检测的研究很少，而且据我们所知，哥伦比亚西班牙语的意见挖掘的公开资源很少，这凸显了生成针对这种多样性特定文化特征的资源的必要性。在这项工作中，我们展示了一个与 2022 年哥伦比亚总统选举有关的西班牙语推文小语料库，使用细粒度分类法手动标记情绪。我们使用监督式最新模型（BERT 模型）进行分类实验，并在小样本学习设置中将其与 GPT-3.5 进行比较。我们将数据集和代码公开用于研究目的。

Title: Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

Authors: Jupinder Parmar, Sanjev Satheesh, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07263
Pdf URL: https://arxiv.org/pdf/2407.07263
Copy Paste: [[2407.07263]] Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models(https://arxiv.org/abs/2407.07263)
Keywords: language model
Abstract: As language models have scaled both their number of parameters and pretraining dataset sizes, the computational cost for pretraining has become intractable except for the most well-resourced teams. This increasing cost makes it ever more important to be able to reuse a model after it has completed pretraining; allowing for a model's abilities to further improve without needing to train from scratch. In this work, we detail a set of guidelines that cover how to design efficacious data distributions and learning rate schedules for continued pretraining of language models. When applying these findings within a continued pretraining run on top of a well-trained 15B parameter model, we show an improvement of 9\% in average model accuracy compared to the baseline of continued training on the pretraining set. The resulting recipe provides a practical starting point with which to begin developing language models through reuse rather than retraining.
摘要：随着语言模型的参数数量和预训练数据集大小不断扩大，除了资源最丰富的团队外，预训练的计算成本已变得难以解决。这种不断增加的成本使得在完成预训练后能够重复使用模型变得越来越重要；允许模型的能力进一步提高，而无需从头开始训练。在这项工作中，我们详细介绍了一套指导方针，涵盖如何设计有效的数据分布和学习率计划以继续对语言模型进行预训练。当在训练有素的 15B 参数模型上对这些发现进行持续预训练时，我们发现与在预训练集上继续训练的基线相比，平均模型准确率提高了 9%。由此产生的配方提供了一个实用的起点，可以通过重用而不是重新训练来开始开发语言模型。

Title: ESM+: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models

Authors: Benjamin Ascoli, Ram Kandikonda, Jinho D. Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07313
Pdf URL: https://arxiv.org/pdf/2407.07313
Copy Paste: [[2407.07313]] ESM+: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models(https://arxiv.org/abs/2407.07313)
Keywords: language model, llm
Abstract: The task of Text-to-SQL enables anyone to retrieve information from SQL databases using natural language. Despite several challenges, recent models have made remarkable advancements in this task using large language models (LLMs). Interestingly, we find that LLM-based models without fine-tuning exhibit distinct natures compared to their fine-tuned counterparts, leading to inadequacies in current evaluation metrics to accurately convey their performance. Thus, we analyze the two primary metrics, Test Suite Execution Accuracy (EXE) and Exact Set Matching Accuracy (ESM), to examine their robustness for this task and address shortcomings. We compare the performance of 9 LLM-based models using EXE, the original ESM, and our improved ESM (called ESM+). Our results show that EXE and ESM have high false positive and negative rates of 11.3% and 13.9%, while ESM+ gives those of 0.1% and 2.6% respectively, providing a significantly more stable evaluation. We release the ESM+ script as open-source for the community to contribute, while enjoying a more reliable assessment of Text-to-SQL.
摘要：文本转 SQL 任务使任何人都可以使用自然语言从 SQL 数据库中检索信息。尽管面临诸多挑战，但最近的模型使用大型语言模型 (LLM) 在这一任务上取得了显著进步。有趣的是，我们发现未经微调的基于 LLM 的模型与经过微调的模型相比表现出不同的性质，导致当前评估指标不足以准确传达其性能。因此，我们分析了两个主要指标，即测试套件执行准确度 (EXE) 和精确集匹配准确度 (ESM)，以检查它们对于此任务的稳健性并解决缺点。我们使用 EXE、原始 ESM 和我们改进的 ESM（称为 ESM+）比较了 9 个基于 LLM 的模型的性能。我们的结果表明，EXE 和 ESM 的假阳性率和假阴性率高达 11.3% 和 13.9%，而 ESM+ 的假阳性率和假阴性率分别为 0.1% 和 2.6%，提供了明显更稳定的评估。我们将 ESM+ 脚本作为开源发布，供社区做出贡献，同时享受更可靠的 Text-to-SQL 评估。

Title: RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension

Authors: Hung Phan, Anurag Acharya, Sarthak Chaturvedi, Shivam Sharma, Mike Parker, Dan Nally, Ali Jannesari, Karl Pazdernik, Mahantesh Halappanavar, Sai Munikoti, Sameera Horawalavithana
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07321
Pdf URL: https://arxiv.org/pdf/2407.07321
Copy Paste: [[2407.07321]] RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension(https://arxiv.org/abs/2407.07321)
Keywords: language model, gpt, llm, long context
Abstract: Large Language Models (LLMs) have been applied to many research problems across various domains. One of the applications of LLMs is providing question-answering systems that cater to users from different fields. The effectiveness of LLM-based question-answering systems has already been established at an acceptable level for users posing questions in popular and public domains such as trivia and literature. However, it has not often been established in niche domains that traditionally require specialized expertise. To this end, we construct the NEPAQuAD1.0 benchmark to evaluate the performance of three frontier LLMs -- Claude Sonnet, Gemini, and GPT-4 -- when answering questions originating from Environmental Impact Statements prepared by U.S. federal government agencies in accordance with the National Environmental Environmental Act (NEPA). We specifically measure the ability of LLMs to understand the nuances of legal, technical, and compliance-related information present in NEPA documents in different contextual scenarios. For example, we test the LLMs' internal prior NEPA knowledge by providing questions without any context, as well as assess how LLMs synthesize the contextual information present in long NEPA documents to facilitate the question/answering task. We compare the performance of the long context LLMs and RAG powered models in handling different types of questions (e.g., problem-solving, divergent). Our results suggest that RAG powered models significantly outperform the long context models in the answer accuracy regardless of the choice of the frontier LLM. Our further analysis reveals that many models perform better answering closed questions than divergent and problem-solving questions.
摘要：大型语言模型 (LLM) 已应用于各个领域的许多研究问题。LLM 的应用之一是提供迎合不同领域用户的问答系统。基于 LLM 的问答系统的有效性已经为用户在热门和公共领域（例如琐事和文学）提出问题建立了可接受的水平。然而，在传统上需要专业知识的小众领域，它往往没有建立起来。为此，我们构建了 NEPAQuAD1.0 基准，以评估三个前沿 LLM（Claude Sonnet、Gemini 和 GPT-4）在回答美国联邦政府机构根据《国家环境法》（NEPA）准备的环境影响声明中的问题时的表现。我们特别衡量了 LLM 在不同情境下理解 NEPA 文件中存在的法律、技术和合规相关信息的细微差别的能力。例如，我们通过提供没有任何背景的问题来测试 LLM 的内部先前 NEPA 知识，并评估 LLM 如何综合长 NEPA 文档中存在的背景信息以促进问答任务。我们比较了长上下文 LLM 和 RAG 驱动模型在处理不同类型问题（例如，问题解决、发散）方面的表现。我们的结果表明，无论选择哪种前沿 LLM，RAG 驱动的模型在答案准确性方面都明显优于长上下文模型。我们的进一步分析表明，许多模型在回答封闭式问题方面的表现比回答发散性和解决问题的问题要好。

Title: Probability of Differentiation Reveals Brittleness of Homogeneity Bias in Large Language Models

Authors: Messi H.J. Lee, Calvin K. Lai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07329
Pdf URL: https://arxiv.org/pdf/2407.07329
Copy Paste: [[2407.07329]] Probability of Differentiation Reveals Brittleness of Homogeneity Bias in Large Language Models(https://arxiv.org/abs/2407.07329)
Keywords: language model, gpt, llm, prompt
Abstract: Homogeneity bias in Large Language Models (LLMs) refers to their tendency to homogenize the representations of some groups compared to others. Previous studies documenting this bias have predominantly used encoder models, which may have inadvertently introduced biases. To address this limitation, we prompted GPT-4 to generate single word/expression completions associated with 18 situation cues - specific, measurable elements of environments that influence how individuals perceive situations and compared the variability of these completions using probability of differentiation. This approach directly assessed homogeneity bias from the model's outputs, bypassing encoder models. Across five studies, we find that homogeneity bias is highly volatile across situation cues and writing prompts, suggesting that the bias observed in past work may reflect those within encoder models rather than LLMs. Furthermore, these results suggest that homogeneity bias in LLMs is brittle, as even minor and arbitrary changes in prompts can significantly alter the expression of biases. Future work should further explore how variations in syntactic features and topic choices in longer text generations influence homogeneity bias in LLMs.
摘要：大型语言模型 (LLM) 中的同质性偏差是指它们倾向于将某些群体的表征与其他群体的表征同质化。记录这种偏差的先前研究主要使用编码器模型，这可能无意中引入了偏差。为了解决这一限制，我们提示 GPT-4 生成与 18 种情境线索（影响个人感知情境的特定、可衡量的环境元素）相关的单词/表达完成，并使用分化概率比较这些完成的可变性。这种方法绕过编码器模型，直接从模型的输出中评估同质性偏差。在五项研究中，我们发现同质性偏差在情境线索和写作提示中高度不稳定，这表明过去工作中观察到的偏差可能反映编码器模型而不是 LLM 中的偏差。此外，这些结果表明 LLM 中的同质性偏差很脆弱，因为即使是提示中的微小和任意变化也会显著改变偏差的表达。未来的工作应该进一步探索较长文本生成中的句法特征和主题选择的变化如何影响 LLM 中的同质性偏差。

Title: Interpretable Differential Diagnosis with Dual-Inference Large Language Models

Authors: Shuang Zhou, Sirui Ding, Jiashuo Wang, Mingquan Lin, Genevieve B. Melton, Rui Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.07330
Pdf URL: https://arxiv.org/pdf/2407.07330
Copy Paste: [[2407.07330]] Interpretable Differential Diagnosis with Dual-Inference Large Language Models(https://arxiv.org/abs/2407.07330)
Keywords: language model, llm
Abstract: Methodological advancements to automate the generation of differential diagnosis (DDx) to predict a list of potential diseases as differentials given patients' symptom descriptions are critical to clinical reasoning and applications such as decision support. However, providing reasoning or interpretation for these differential diagnoses is more meaningful. Fortunately, large language models (LLMs) possess powerful language processing abilities and have been proven effective in various related tasks. Motivated by this potential, we investigate the use of LLMs for interpretable DDx. First, we develop a new DDx dataset with expert-derived interpretation on 570 public clinical notes. Second, we propose a novel framework, named Dual-Inf, that enables LLMs to conduct bidirectional inference for interpretation. Both human and automated evaluation demonstrate the effectiveness of Dual-Inf in predicting differentials and diagnosis explanations. Specifically, the performance improvement of Dual-Inf over the baseline methods exceeds 32% w.r.t. BERTScore in DDx interpretation. Furthermore, experiments verify that Dual-Inf (1) makes fewer errors in interpretation, (2) has great generalizability, (3) is promising for rare disease diagnosis and explanation.
摘要：方法学上的进步可以自动生成鉴别诊断 (DDx)，根据患者的症状描述预测一系列潜在疾病作为鉴别诊断，这对于临床推理和决策支持等应用至关重要。然而，为这些鉴别诊断提供推理或解释更有意义。幸运的是，大型语言模型 (LLM) 具有强大的语言处理能力，并且已被证明在各种相关任务中有效。受此潜力的启发，我们研究了 LLM 在可解释 DDx 中的应用。首先，我们开发了一个新的 DDx 数据集，其中包含专家对 570 份公共临床记录的解释。其次，我们提出了一个名为 Dual-Inf 的新框架，使 LLM 能够进行双向推理以进行解释。人工和自动评估都证明了 Dual-Inf 在预测鉴别诊断和诊断解释方面的有效性。具体而言，与基线方法相比，Dual-Inf 在 DDx 解释中相对于 BERTScore 的性能提升超过 32%。此外，实验证明，Dual-Inf (1) 在解释中错误更少，(2) 具有很强的通用性，(3) 在罕见疾病的诊断和解释方面很有前景。

Title: MixSumm: Topic-based Data Augmentation using LLMs for Low-resource Extractive Text Summarization

Authors: Gaurav Sahu, Issam H. Laradji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.07341
Pdf URL: https://arxiv.org/pdf/2407.07341
Copy Paste: [[2407.07341]] MixSumm: Topic-based Data Augmentation using LLMs for Low-resource Extractive Text Summarization(https://arxiv.org/abs/2407.07341)
Keywords: language model, gpt, llm, prompt
Abstract: Low-resource extractive text summarization is a vital but heavily underexplored area of research. Prior literature either focuses on abstractive text summarization or prompts a large language model (LLM) like GPT-3 directly to generate summaries. In this work, we propose MixSumm for low-resource extractive text summarization. Specifically, MixSumm prompts an open-source LLM, LLaMA-3-70b, to generate documents that mix information from multiple topics as opposed to generating documents without mixup, and then trains a summarization model on the generated dataset. We use ROUGE scores and L-Eval, a reference-free LLaMA-3-based evaluation method to measure the quality of generated summaries. We conduct extensive experiments on a challenging text summarization benchmark comprising the TweetSumm, WikiHow, and ArXiv/PubMed datasets and show that our LLM-based data augmentation framework outperforms recent prompt-based approaches for low-resource extractive summarization. Additionally, our results also demonstrate effective knowledge distillation from LLaMA-3-70b to a small BERT-based extractive summarizer.
摘要：低资源提取式文本摘要是一个重要但尚未得到充分探索的研究领域。先前的文献要么侧重于抽象文本摘要，要么直接提示大型语言模型 (LLM)（如 GPT-3）来生成摘要。在这项工作中，我们提出了用于低资源提取式文本摘要的 MixSumm。具体来说，MixSumm 提示开源 LLM LLaMA-3-70b 生成混合来自多个主题的信息的文档，而不是生成没有混合的文档，然后在生成的数据集上训练摘要模型。我们使用 ROUGE 分数和 L-Eval（一种基于无参考 LLaMA-3 的评估方法）来衡量生成的摘要的质量。我们在由 TweetSumm、WikiHow 和 ArXiv/PubMed 数据集组成的具有挑战性的文本摘要基准上进行了广泛的实验，并表明我们基于 LLM 的数据增强框架优于最近基于提示的低资源提取式摘要方法。此外，我们的结果还证明了从 LLaMA-3-70b 到基于 BERT 的小型提取摘要器的有效知识提炼。

Title: Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture

Authors: Jiayang Song, Yuheng Huang, Zhehua Zhou, Lei Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07342
Pdf URL: https://arxiv.org/pdf/2407.07342
Copy Paste: [[2407.07342]] Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture(https://arxiv.org/abs/2407.07342)
Keywords: language model, gpt, llm, prompt
Abstract: As safety remains a crucial concern throughout the development lifecycle of Large Language Models (LLMs), researchers and industrial practitioners have increasingly focused on safeguarding and aligning LLM behaviors with human preferences and ethical standards. LLMs, trained on extensive multilingual corpora, exhibit powerful generalization abilities across diverse languages and domains. However, current safety alignment practices predominantly focus on single-language scenarios, which leaves their effectiveness in complex multilingual contexts, especially for those complex mixed-language formats, largely unexplored. In this study, we introduce Multilingual Blending, a mixed-language query-response scheme designed to evaluate the safety alignment of various state-of-the-art LLMs (e.g., GPT-4o, GPT-3.5, Llama3) under sophisticated, multilingual conditions. We further investigate language patterns such as language availability, morphology, and language family that could impact the effectiveness of Multilingual Blending in compromising the safeguards of LLMs. Our experimental results show that, without meticulously crafted prompt templates, Multilingual Blending significantly amplifies the detriment of malicious queries, leading to dramatically increased bypass rates in LLM safety alignment (67.23% on GPT-3.5 and 40.34% on GPT-4o), far exceeding those of single-language baselines. Moreover, the performance of Multilingual Blending varies notably based on intrinsic linguistic properties, with languages of different morphology and from diverse families being more prone to evading safety alignments. These findings underscore the necessity of evaluating LLMs and developing corresponding safety alignment strategies in a complex, multilingual context to align with their superior cross-language generalization capabilities.
摘要：由于安全性在大型语言模型 (LLM) 的整个开发生命周期中始终是一个关键问题，研究人员和行业从业者越来越关注如何保护 LLM 行为，并使之与人类偏好和道德标准保持一致。在大量多语言语料库上训练的 LLM 表现出跨多种语言和领域的强大泛化能力。然而，当前的安全性协调实践主要侧重于单语言场景，这使得它们在复杂的多语言环境中的有效性，尤其是对于那些复杂的混合语言格式的有效性，在很大程度上尚未得到探索。在本研究中，我们引入了多语言混合，这是一种混合语言查询-响应方案，旨在评估各种最先进的 LLM（例如 GPT-4o、GPT-3.5、Llama3）在复杂的多语言条件下的安全性协调。我们进一步研究了语言可用性、形态和语系等语言模式，这些模式可能会影响多语言混合在损害 LLM 安全方面的有效性。我们的实验结果表明，在没有精心制作的提示模板的情况下，多语言混合会显著放大恶意查询的危害，导致 LLM 安全对齐的绕过率大幅增加（GPT-3.5 上为 67.23%，GPT-4o 上为 40.34%），远远超过单语言基线。此外，多语言混合的性能因内在语言特性而存在显著差异，不同形态和不同家族的语言更容易逃避安全对齐。这些发现强调了在复杂的多语言环境中评估 LLM 并开发相应的安全对齐策略的必要性，以与其卓越的跨语言泛化能力保持一致。

Title: LokiLM: Technical Report

Authors: Justin Kiefel, Shrey Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] LokiLM: Technical Report(https://arxiv.org/abs/)
Keywords: language model, hallucination
Abstract: In this work, we introduce LokiLM, a 1.4B parameter large language model trained on 500B tokens. Our model performs strongly in natural language reasoning tasks and achieves state-of-the-art performance among models with 1.5B parameters or less. LokiLM is trained using multi-teacher knowledge distillation and high-quality training data to achieve benchmark results competitive with larger models trained on significantly more tokens. We support these findings by introducing steps to avoid benchmark contamination and overfitting throughout our development process. Despite its promising performance, LokiLM exhibits a concerning amount of hallucinations and scores poorly on the TruthfulQA benchmark, so we do not release the model publicly.
摘要：在这项工作中，我们引入了 LokiLM，这是一个在 500B 个 token 上训练的 1.4B 参数大型语言模型。我们的模型在自然语言推理任务中表现出色，在参数为 1.5B 或更少的模型中取得了最佳性能。LokiLM 使用多教师知识提炼和高质量训练数据进行训练，以获得与在更多 token 上训练的大型模型相媲美的基准测试结果。我们通过在整个开发过程中引入避免基准测试污染和过度拟合的步骤来支持这些发现。尽管 LokiLM 的性能令人鼓舞，但它表现出令人担忧的幻觉数量，并且在 TruthfulQA 基准测试中得分很低，因此我们不会公开发布该模型。

Title: KpopMT: Translation Dataset with Terminology for Kpop Fandom

Authors: JiWoo Kim, Yunsu Kim, JinYeong Bak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07413
Pdf URL: https://arxiv.org/pdf/2407.07413
Copy Paste: [[2407.07413]] KpopMT: Translation Dataset with Terminology for Kpop Fandom(https://arxiv.org/abs/2407.07413)
Keywords: gpt
Abstract: While machines learn from existing corpora, humans have the unique capability to establish and accept new language systems. This makes human form unique language systems within social groups. Aligning with this, we focus on a gap remaining in addressing translation challenges within social groups, where in-group members utilize unique terminologies. We propose KpopMT dataset, which aims to fill this gap by enabling precise terminology translation, choosing Kpop fandom as an initiative for social groups given its global popularity. Expert translators provide 1k English translations for Korean posts and comments, each annotated with specific terminology within social groups' language systems. We evaluate existing translation systems including GPT models on KpopMT to identify their failure cases. Results show overall low scores, underscoring the challenges of reflecting group-specific terminologies and styles in translation. We make KpopMT publicly available.
摘要：机器可以从现有语料库中学习，而人类则具有建立和接受新语言系统的独特能力。这使得人类在社会群体中形成了独特的语言系统。与此相符，我们专注于解决社会群体内翻译挑战的空白，其中群体内的成员使用独特的术语。我们提出了 KpopMT 数据集，旨在通过实现精确的术语翻译来填补这一空白，选择 Kpop 粉丝圈作为社会群体的一项举措，因为它在全球很受欢迎。专业翻译人员为韩语帖子和评论提供 1000 条英文翻译，每个都用社会群体语言系统中的特定术语进行注释。我们评估了 KpopMT 上的现有翻译系统（包括 GPT 模型），以确定它们的失败案例。结果显示总体得分较低，凸显了在翻译中反映特定群体的术语和风格的挑战。我们向公众开放 KpopMT。

Title: Review-LLM: Harnessing Large Language Models for Personalized Review Generation

Authors: Qiyao Peng, Hongtao Liu, Hongyan Xu, Qing Yang, Minglai Shao, Wenjun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07487
Pdf URL: https://arxiv.org/pdf/2407.07487
Copy Paste: [[2407.07487]] Review-LLM: Harnessing Large Language Models for Personalized Review Generation(https://arxiv.org/abs/2407.07487)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Product review generation is an important task in recommender systems, which could provide explanation and persuasiveness for the recommendation. Recently, Large Language Models (LLMs, e.g., ChatGPT) have shown superior text modeling and generating ability, which could be applied in review generation. However, directly applying the LLMs for generating reviews might be troubled by the ``polite'' phenomenon of the LLMs and could not generate personalized reviews (e.g., negative reviews). In this paper, we propose Review-LLM that customizes LLMs for personalized review generation. Firstly, we construct the prompt input by aggregating user historical behaviors, which include corresponding item titles and reviews. This enables the LLMs to capture user interest features and review writing style. Secondly, we incorporate ratings as indicators of satisfaction into the prompt, which could further improve the model's understanding of user preferences and the sentiment tendency control of generated reviews. Finally, we feed the prompt text into LLMs, and use Supervised Fine-Tuning (SFT) to make the model generate personalized reviews for the given user and target item. Experimental results on the real-world dataset show that our fine-tuned model could achieve better review generation performance than existing close-source LLMs.
摘要：产品评论生成是推荐系统中的一项重要任务，可以为推荐提供解释和说服力。最近，大型语言模型（LLM，例如 ChatGPT）表现出卓越的文本建模和生成能力，可以应用于评论生成。然而，直接应用 LLM 来生成评论可能会受到 LLM 的“礼貌”现象的困扰，无法生成个性化评论（例如负面评论）。在本文中，我们提出了 Review-LLM，它定制了 LLM 以生成个性化评论。首先，我们通过聚合用户历史行为来构建提示输入，其中包括相应的商品标题和评论。这使 LLM 能够捕获用户兴趣特征和评论写作风格。其次，我们将评分作为满意度指标纳入提示中，这可以进一步提高模型对用户偏好的理解和对生成评论的情绪倾向控制。最后，我们将提示文本输入 LLM，并使用监督微调 (SFT) 使模型为给定的用户和目标商品生成个性化评论。在真实数据集上的实验结果表明，我们的微调模型可以比现有的闭源 LLM 实现更好的评论生成性能。

Title: Bucket Pre-training is All You Need

Authors: Hongtao Liu, Qiyao Peng, Qing Yang, Kai Liu, Hongyan Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07495
Pdf URL: https://arxiv.org/pdf/2407.07495
Copy Paste: [[2407.07495]] Bucket Pre-training is All You Need(https://arxiv.org/abs/2407.07495)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated exceptional performance across various natural language processing tasks. However, the conventional fixed-length data composition strategy for pretraining, which involves concatenating and splitting documents, can introduce noise and limit the model's ability to capture long-range dependencies. To address this, we first introduce three metrics for evaluating data composition quality: padding ratio, truncation ratio, and concatenation ratio. We further propose a multi-bucket data composition method that moves beyond the fixed-length paradigm, offering a more flexible and efficient approach to pretraining. Extensive experiments demonstrate that our proposed method could significantly improving both the efficiency and efficacy of LLMs pretraining. Our approach not only reduces noise and preserves context but also accelerates training, making it a promising solution for LLMs pretraining.
摘要：大型语言模型 (LLM) 在各种自然语言处理任务中都表现出色。然而，传统的预训练固定长度数据组合策略（包括连接和拆分文档）可能会引入噪音并限制模型捕获长距离依赖关系的能力。为了解决这个问题，我们首先引入了三个评估数据组合质量的指标：填充率、截断率和连接率。我们进一步提出了一种超越固定长度范式的多桶数据组合方法，为预训练提供了一种更灵活、更高效的方法。大量实验表明，我们提出的方法可以显著提高 LLM 预训练的效率和功效。我们的方法不仅可以降低噪音和保留上下文，还可以加速训练，使其成为 LLM 预训练的一个有前途的解决方案。

Title: Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Authors: Jin Liu, Qingquan Li, Wenlong Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07531
Pdf URL: https://arxiv.org/pdf/2407.07531
Copy Paste: [[2407.07531]] Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models(https://arxiv.org/abs/2407.07531)
Keywords: language model, llm
Abstract: In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.
摘要：针对当前大型语言模型（LLM）的评测基准存在评测内容受限、更新不及时、缺乏优化指导等问题，本文提出一种新的LLM评测范式：基准-评测-评估。该范式将LLM评测的“地点”从“考场”转移到“医院”，通过对LLM进行“体检”，以具体任务解决为评测内容，对LLM中存在的问题进行深度归因，并给出优化建议。

Title: Arabic Automatic Story Generation with Large Language Models

Authors: Ahmed Oumar El-Shangiti, Fakhraddin Alwajih, Muhammad Abdul-Mageed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.07551
Pdf URL: https://arxiv.org/pdf/2407.07551
Copy Paste: [[2407.07551]] Arabic Automatic Story Generation with Large Language Models(https://arxiv.org/abs/2407.07551)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have recently emerged as a powerful tool for a wide range of language generation tasks. Nevertheless, this progress has been slower in Arabic. In this work, we focus on the task of generating stories from LLMs. For our training, we use stories acquired through machine translation (MT) as well as GPT-4. For the MT data, we develop a careful pipeline that ensures we acquire high-quality stories. For our GPT-41 data, we introduce crafted prompts that allow us to generate data well-suited to the Arabic context in both Modern Standard Arabic (MSA) and two Arabic dialects (Egyptian and Moroccan). For example, we generate stories tailored to various Arab countries on a wide host of topics. Our manual evaluation shows that our model fine-tuned on these training datasets can generate coherent stories that adhere to our instructions. We also conduct an extensive automatic and human evaluation comparing our models against state-of-the-art proprietary and open-source models. Our datasets and models will be made publicly available at https: //github.com/UBC-NLP/arastories.
摘要：大型语言模型 (LLM) 最近已成为各种语言生成任务的强大工具。然而，这一进展在阿拉伯语中进展较慢。在这项工作中，我们专注于从 LLM 生成故事的任务。对于我们的训练，我们使用通过机器翻译 (MT) 以及 GPT-4 获得的故事。对于 MT 数据，我们开发了一个精心设计的管道，以确保我们获得高质量的故事。对于我们的 GPT-41 数据，我们引入了精心设计的提示，使我们能够生成适合现代标准阿拉伯语 (MSA) 和两种阿拉伯方言（埃及语和摩洛哥语）的阿拉伯语环境的数据。例如，我们针对各种主题生成适合各个阿拉伯国家的故事。我们的手动评估表明，我们在这些训练数据集上微调的模型可以生成符合我们指示的连贯故事。我们还进行了广泛的自动和人工评估，将我们的模型与最先进的专有和开源模型进行了比较。我们的数据集和模型将在https：//github.com/UBC-NLP/arastories上公开。

Title: On Leakage of Code Generation Evaluation Datasets

Authors: Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, Matthias Gallé
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07565
Pdf URL: https://arxiv.org/pdf/2407.07565
Copy Paste: [[2407.07565]] On Leakage of Code Generation Evaluation Datasets(https://arxiv.org/abs/2407.07565)
Keywords: language model, prompt
Abstract: In this paper we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. Key to our findings is a new dataset of 161 prompts with their associated python solutions, dataset which is released at this https URL .
摘要：在本文中，我们考虑了代码生成测试集的污染，特别是在现代大型语言模型中使用它们时。我们讨论了这种污染的三个可能来源，并展示了支持每个来源的发现：（i）直接数据泄漏，（ii）通过使用合成数据间接数据泄漏，以及（iii）在模型选择期间过度拟合评估集。我们的发现的关键是一个包含 161 个提示及其相关 Python 解决方案的新数据集，该数据集在此 https URL 上发布。

Title: A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training

Authors: Michał Perełkiewicz, Rafał Poświata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07630
Pdf URL: https://arxiv.org/pdf/2407.07630
Copy Paste: [[2407.07630]] A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training(https://arxiv.org/abs/2407.07630)
Keywords: language model, llm
Abstract: This article presents a comprehensive review of the challenges associated with using massive web-mined corpora for the pre-training of large language models (LLMs). This review identifies key challenges in this domain, including challenges such as noise (irrelevant or misleading information), duplication of content, the presence of low-quality or incorrect information, biases, and the inclusion of sensitive or personal information in web-mined corpora. Addressing these issues is crucial for the development of accurate, reliable, and ethically responsible language models. Through an examination of current methodologies for data cleaning, pre-processing, bias detection and mitigation, we highlight the gaps in existing approaches and suggest directions for future research. Our discussion aims to catalyze advancements in developing more sophisticated and ethically responsible LLMs.
摘要：本文全面回顾了使用海量网络挖掘语料库对大型语言模型 (LLM) 进行预训练所面临的挑战。本综述确定了该领域的主要挑战，包括噪音（不相关或误导性信息）、内容重复、存在低质量或不正确的信息、偏见以及在网络挖掘语料库中包含敏感或个人信息等挑战。解决这些问题对于开发准确、可靠且符合道德规范的语言模型至关重要。通过研究当前的数据清理、预处理、偏见检测和缓解方法，我们强调了现有方法的差距并提出了未来研究的方向。我们的讨论旨在促进开发更复杂、更符合道德规范的 LLM 的进步。

Title: A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability

Authors: Ting Fang Tan, Kabilan Elangovan, Jasmine Ong, Nigam Shah, Joseph Sung, Tien Yin Wong, Lan Xue, Nan Liu, Haibo Wang, Chang Fu Kuo, Simon Chesterman, Zee Kin Yeong, Daniel SW Ting
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.07666
Pdf URL: https://arxiv.org/pdf/2407.07666
Copy Paste: [[2407.07666]] A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability(https://arxiv.org/abs/2407.07666)
Keywords: language model, llm
Abstract: A comprehensive qualitative evaluation framework for large language models (LLM) in healthcare that expands beyond traditional accuracy and quantitative metrics needed. We propose 5 key aspects for evaluation of LLMs: Safety, Consensus, Objectivity, Reproducibility and Explainability (S.C.O.R.E.). We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models that are safe, reliable, trustworthy, and ethical for healthcare and clinical applications.
摘要：医疗保健领域大型语言模型 (LLM) 的全面定性评估框架，超越了传统的准确性和定量指标。我们提出了评估 LLM 的 5 个关键方面：安全性、共识性、客观性、可重复性和可解释性 (S.C.O.R.E.)。我们认为 S.C.O.R.E. 可以作为未来基于 LLM 的模型评估框架的基础，这些模型对于医疗保健和临床应用而言是安全、可靠、值得信赖和合乎道德的。

Title: Multi-task Prompt Words Learning for Social Media Content Generation

Authors: Haochen Xue, Chong Zhang, Chengzhi Liu, Fangyu Wu, Xiaobo Jin
Subjects: cs.CL, cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2407.07771
Pdf URL: https://arxiv.org/pdf/2407.07771
Copy Paste: [[2407.07771]] Multi-task Prompt Words Learning for Social Media Content Generation(https://arxiv.org/abs/2407.07771)
Keywords: gpt, prompt, chat
Abstract: The rapid development of the Internet has profoundly changed human life. Humans are increasingly expressing themselves and interacting with others on social media platforms. However, although artificial intelligence technology has been widely used in many aspects of life, its application in social media content creation is still blank. To solve this problem, we propose a new prompt word generation framework based on multi-modal information fusion, which combines multiple tasks including topic classification, sentiment analysis, scene recognition and keyword extraction to generate more comprehensive prompt words. Subsequently, we use a template containing a set of prompt words to guide ChatGPT to generate high-quality tweets. Furthermore, in the absence of effective and objective evaluation criteria in the field of content generation, we use the ChatGPT tool to evaluate the results generated by the algorithm, making large-scale evaluation of content generation algorithms possible. Evaluation results on extensive content generation demonstrate that our cue word generation framework generates higher quality content compared to manual methods and other cueing techniques, while topic classification, sentiment analysis, and scene recognition significantly enhance content clarity and its consistency with the image.
摘要：互联网的快速发展深刻改变了人类的生活，人类越来越多地在社交媒体平台上表达自己并与他人互动。然而，尽管人工智能技术已广泛应用于生活的方方面面，但在社交媒体内容创作中的应用仍是空白。为了解决这个问题，我们提出了一种基于多模态信息融合的新型提示词生成框架，该框架结合了主题分类、情感分析、场景识别和关键词提取等多项任务，以生成更全面的提示词。随后，我们使用包含一组提示词的模板来指导ChatGPT生成高质量的推文。此外，在内容生成领域缺乏有效客观的评价标准的情况下，我们使用ChatGPT工具来评估算法生成的结果，使大规模的内容生成算法评估成为可能。在大规模内容生成的评估结果表明，与手动方法和其他提示技术相比，我们的提示词生成框架可以生成更高质量的内容，而主题分类、情感分析和场景识别显著提高了内容清晰度及其与图像的一致性。

Title: WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment

Authors: Jiefu Ou, Arda Uzunoglu, Benjamin Van Durme, Daniel Khashabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07778
Pdf URL: https://arxiv.org/pdf/2407.07778
Copy Paste: [[2407.07778]] WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment(https://arxiv.org/abs/2407.07778)
Keywords: language model, gpt, llm, prompt, agent
Abstract: AI systems make decisions in physical environments through primitive actions or affordances that are accessed via API calls. While deploying AI agents in the real world involves numerous high-level actions, existing embodied simulators offer a limited set of domain-salient APIs. This naturally brings up the questions: how many primitive actions (APIs) are needed for a versatile embodied agent, and what should they look like? We explore this via a thought experiment: assuming that wikiHow tutorials cover a wide variety of human-written tasks, what is the space of APIs needed to cover these instructions? We propose a framework to iteratively induce new APIs by grounding wikiHow instruction to situated agent policies. Inspired by recent successes in large language models (LLMs) for embodied planning, we propose a few-shot prompting to steer GPT-4 to generate Pythonic programs as agent policies and bootstrap a universe of APIs by 1) reusing a seed set of APIs; and then 2) fabricate new API calls when necessary. The focus of this thought experiment is on defining these APIs rather than their executability. We apply the proposed pipeline on instructions from wikiHow tutorials. On a small fraction (0.5%) of tutorials, we induce an action space of 300+ APIs necessary for capturing the rich variety of tasks in the physical world. A detailed automatic and human analysis of the induction output reveals that the proposed pipeline enables effective reuse and creation of APIs. Moreover, a manual review revealed that existing simulators support only a small subset of the induced APIs (9 of the top 50 frequent APIs), motivating the development of action-rich embodied environments.
摘要：AI 系统通过原始操作或可通过 API 调用访问的功能在物理环境中做出决策。虽然在现实世界中部署 AI 代理涉及许多高级操作，但现有的具身模拟器仅提供一组有限的领域显著 API。这自然会引出以下问题：多功能具身代理需要多少个原始操作 (API)，它们应该是什么样的？我们通过一个思想实验来探索这个问题：假设 wikiHow 教程涵盖了各种各样的人工编写任务，那么需要多少 API 空间来涵盖这些指令？我们提出了一个框架，通过将 wikiHow 指令与情境代理策略相结合来迭代地引入新的 API。受具身规划大型语言模型 (LLM) 最近取得的成功的启发，我们提出了一个少量提示来引导 GPT-4 生成 Pythonic 程序作为代理策略，并通过 1) 重用一组 API 种子来引导 API 世界；然后 2) 在必要时构造新的 API 调用。这个思想实验的重点是定义这些 API，而不是它们的可执行性。我们将提议的管道应用于 wikiHow 教程中的说明。在一小部分（0.5%）的教程中，我们引入了一个包含 300 多个 API 的操作空间，这些 API 是捕捉物理世界中丰富多样的任务所必需的。对诱导输出的详细自动和人工分析表明，提议的管道能够有效地重用和创建 API。此外，手动审查显示，现有的模拟器仅支持诱导 API 的一小部分（前 50 个最常用 API 中的 9 个），这激发了动作丰富的具身环境的开发。

Title: Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities

Authors: Tianjie Ju, Yiting Wang, Xinbei Ma, Pengzhou Cheng, Haodong Zhao, Yulong Wang, Lifeng Liu, Jian Xie, Zhuosheng Zhang, Gongshen Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07791
Pdf URL: https://arxiv.org/pdf/2407.07791
Copy Paste: [[2407.07791]] Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities(https://arxiv.org/abs/2407.07791)
Keywords: language model, llm, prompt, chat, retrieval-augmented generation, agent
Abstract: The rapid adoption of large language models (LLMs) in multi-agent systems has highlighted their impressive capabilities in various applications, such as collaborative problem-solving and autonomous negotiation. However, the security implications of these LLM-based multi-agent systems have not been thoroughly investigated, particularly concerning the spread of manipulated knowledge. In this paper, we investigate this critical issue by constructing a detailed threat model and a comprehensive simulation environment that mirrors real-world multi-agent deployments in a trusted platform. Subsequently, we propose a novel two-stage attack method involving Persuasiveness Injection and Manipulated Knowledge Injection to systematically explore the potential for manipulated knowledge (i.e., counterfactual and toxic knowledge) spread without explicit prompt manipulation. Our method leverages the inherent vulnerabilities of LLMs in handling world knowledge, which can be exploited by attackers to unconsciously spread fabricated information. Through extensive experiments, we demonstrate that our attack method can successfully induce LLM-based agents to spread both counterfactual and toxic knowledge without degrading their foundational capabilities during agent communication. Furthermore, we show that these manipulations can persist through popular retrieval-augmented generation frameworks, where several benign agents store and retrieve manipulated chat histories for future interactions. This persistence indicates that even after the interaction has ended, the benign agents may continue to be influenced by manipulated knowledge. Our findings reveal significant security risks in LLM-based multi-agent systems, emphasizing the imperative need for robust defenses against manipulated knowledge spread, such as introducing ``guardian'' agents and advanced fact-checking tools.
摘要：大型语言模型 (LLM) 在多智能体系统中的快速应用凸显了它们在协作问题解决和自主谈判等各种应用中的出色能力。然而，这些基于 LLM 的多智能体系统的安全隐患尚未得到彻底研究，特别是关于操纵知识的传播。在本文中，我们通过构建一个详细的威胁模型和一个全面的模拟环境来研究这一关键问题，该模型和模拟环境反映了可信平台中现实世界的多智能体部署。随后，我们提出了一种新颖的两阶段攻击方法，包括说服力注入和操纵知识注入，以系统地探索操纵知识（即反事实和有害知识）在没有明确提示操纵的情况下传播的可能性。我们的方法利用了 LLM 在处理世界知识方面的固有漏洞，攻击者可以利用这些漏洞无意识地传播虚假信息。通过大量实验，我们证明了我们的攻击方法可以成功诱导基于 LLM 的智能体传播反事实和有害知识，而不会降低其在智能体通信过程中的基础能力。此外，我们表明这些操纵可以通过流行的检索增强生成框架持续存在，其中多个良性代理存储和检索被操纵的聊天历史记录以供将来交互。这种持久性表明，即使在交互结束后，良性代理仍可能继续受到被操纵知识的影响。我们的研究结果揭示了基于 LLM 的多代理系统中存在重大安全风险，强调了对操纵知识传播的强大防御措施的迫切需要，例如引入“监护人”代理和先进的事实核查工具。

Title: Attribute or Abstain: Large Language Models as Long Document Assistants

Authors: Jan Buchmann, Xiao Liu, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.07799
Pdf URL: https://arxiv.org/pdf/2407.07799
Copy Paste: [[2407.07799]] Attribute or Abstain: Large Language Models as Long Document Assistants(https://arxiv.org/abs/2407.07799)
Keywords: language model, llm, prompt
Abstract: LLMs can help humans working with long documents, but are known to hallucinate. Attribution can increase trust in LLM responses: The LLM provides evidence that supports its response, which enhances verifiability. Existing approaches to attribution have only been evaluated in RAG settings, where the initial retrieval confounds LLM performance. This is crucially different from the long document setting, where retrieval is not needed, but could help. Thus, a long document specific evaluation of attribution is missing. To fill this gap, we present LAB, a benchmark of 6 diverse long document tasks with attribution, and experiment with different approaches to attribution on 4 LLMs of different sizes, both prompted and fine-tuned. We find that citation, i.e. response generation and evidence extraction in one step, mostly performs best. We investigate whether the ``Lost in the Middle'' phenomenon exists for attribution, but do not find this. We also find that evidence quality can predict response quality on datasets with simple responses, but not so for complex responses, as models struggle with providing evidence for complex claims. We release code and data for further investigation.
摘要：LLM 可以帮助人类处理长文档，但众所周知会产生幻觉。归因可以增加对 LLM 响应的信任：LLM 提供支持其响应的证据，从而增强可验证性。现有的归因方法仅在 RAG 设置中进行过评估，其中初始检索会混淆 LLM 性能。这与长文档设置有着至关重要的不同，在长文档设置中不需要检索，但可以提供帮助。因此，缺少对长文档的归因特定评估。为了填补这一空白，我们提出了 LAB，这是 6 个具有归因的不同长文档任务的基准，并在 4 个不同大小的 LLM 上尝试了不同的归因方法，包括提示和微调。我们发现引用（即一步生成响应和提取证据）大多表现最佳。我们调查了归因是否存在“迷失在中间”现象，但没有发现这种情况。我们还发现，证据质量可以预测简单响应数据集的响应质量，但对于复杂响应则不然，因为模型很难为复杂声明提供证据。我们发布了代码和数据以供进一步研究。

Title: Training on the Test Task Confounds Evaluation and Emergence

Authors: Ricardo Dominguez-Olmedo, Florian E. Dorner, Moritz Hardt
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.07890
Pdf URL: https://arxiv.org/pdf/2407.07890
Copy Paste: [[2407.07890]] Training on the Test Task Confounds Evaluation and Emergence(https://arxiv.org/abs/2407.07890)
Keywords: language model
Abstract: We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.
摘要：我们研究了大型语言模型评估中的一个基本问题，我们称之为测试任务训练。与测试数据训练、泄漏或数据污染等错误做法不同，测试任务训练并不是不当行为。相反，这个术语描述了一组不断增长的技术，用于在语言模型的预训练阶段包含与任务相关的数据。我们证明，测试任务训练混淆了相对模型评估和关于新兴能力的主张。我们认为，一个模型系列似乎优于另一个模型系列，这可能是由于测试任务训练程度不同造成的。为此，我们提出了一种有效的方法来调整测试任务的训练，即在评估之前对相同任务相关数据上的每个模型进行微调。然后，我们表明，一旦我们调整测试任务的训练，新兴行为的实例就会基本消失。这也适用于无法通过选择评估指标来解释的新兴行为实例。我们的工作为大型语言模型的评估提供了一个新的视角，对基准测试和新兴能力的研究具有广泛的影响。