2024-10-30

Title: Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges

Authors: Farid Ariai, Gianluca Demartini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21306
Pdf URL: https://arxiv.org/pdf/2410.21306
Copy Paste: [[2410.21306]] Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges(https://arxiv.org/abs/2410.21306)
Keywords: language model
Abstract: Natural Language Processing is revolutionizing the way legal professionals and laypersons operate in the legal field. The considerable potential for Natural Language Processing in the legal sector, especially in developing computational tools for various legal processes, has captured the interest of researchers for years. This survey follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses framework, reviewing 148 studies, with a final selection of 127 after manual filtering. It explores foundational concepts related to Natural Language Processing in the legal domain, illustrating the unique aspects and challenges of processing legal texts, such as extensive document length, complex language, and limited open legal datasets. We provide an overview of Natural Language Processing tasks specific to legal text, such as Legal Document Summarization, legal Named Entity Recognition, Legal Question Answering, Legal Text Classification, and Legal Judgment Prediction. In the section on legal Language Models, we analyze both developed Language Models and approaches for adapting general Language Models to the legal domain. Additionally, we identify 15 Open Research Challenges, including bias in Artificial Intelligence applications, the need for more robust and interpretable models, and improving explainability to handle the complexities of legal language and reasoning.
摘要：自然语言处理正在彻底改变法律专业人士和外行在法律领域的运作方式。自然语言处理在法律领域的巨大潜力，特别是在开发用于各种法律流程的计算工具方面，多年来一直引起研究人员的兴趣。这项调查遵循系统评价和荟萃分析的首选报告项目框架，审查了 148 项研究，经过人工筛选后最终选定了 127 项。它探讨了法律领域与自然语言处理相关的基础概念，说明了处理法律文本的独特方面和挑战，例如文档长度长、语言复杂和开放法律数据集有限。我们概述了特定于法律文本的自然语言处理任务，例如法律文件摘要、法律命名实体识别、法律问答、法律文本分类和法律判决预测。在法律语言模型部分，我们分析了已开发的语言模型和将通用语言模型应用于法律领域的方法。此外，我们确定了 15 个开放研究挑战，包括人工智能应用中的偏见、对更稳健和可解释模型的需求，以及提高可解释性以处理法律语言和推理的复杂性。

Title: Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts

Authors: E. Zhixuan Zeng, Yuhao Chen, Alexander Wong
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21314
Pdf URL: https://arxiv.org/pdf/2410.21314
Copy Paste: [[2410.21314]] Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts(https://arxiv.org/abs/2410.21314)
Keywords: prompt
Abstract: Recent advances in image generation have made diffusion models powerful tools for creating high-quality images. However, their iterative denoising process makes understanding and interpreting their semantic latent spaces more challenging than other generative models, such as GANs. Recent methods have attempted to address this issue by identifying semantically meaningful directions within the latent space. However, they often need manual interpretation or are limited in the number of vectors that can be trained, restricting their scope and utility. This paper proposes a novel framework for unsupervised exploration of diffusion latent spaces. We directly leverage natural language prompts and image captions to map latent directions. This method allows for the automatic understanding of hidden features and supports a broader range of analysis without the need to train specific vectors. Our method provides a more scalable and interpretable understanding of the semantic knowledge encoded within diffusion models, facilitating comprehensive analysis of latent biases and the nuanced representations these models learn. Experimental results show that our framework can uncover hidden patterns and associations in various domains, offering new insights into the interpretability of diffusion model latent spaces.
摘要：图像生成领域的最新进展使扩散模型成为创建高质量图像的强大工具。然而，与其他生成模型（如 GAN）相比，它们的迭代去噪过程使得理解和解释其语义潜在空间更具挑战性。最近的方法试图通过识别潜在空间内语义上有意义的方向来解决此问题。然而，它们通常需要人工解释，或者可训练的向量数量有限，从而限制了它们的范围和效用。本文提出了一种用于无监督探索扩散潜在空间的新框架。我们直接利用自然语言提示和图像标题来映射潜在方向。这种方法可以自动理解隐藏的特征，并支持更广泛的分析，而无需训练特定的向量。我们的方法提供了对扩散模型中编码的语义知识的更具可扩展性和可解释性的理解，有助于全面分析潜在偏差和这些模型学习的细微表示。实验结果表明，我们的框架可以揭示各个领域中隐藏的模式和关联，为扩散模型潜在空间的可解释性提供了新的见解。

Title: Mathematical Derivation Graphs: A Task for Summarizing Equation Dependencies in STEM Manuscripts

Authors: Vishesh Prasad, Brian Kim, Nickvash Kani
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21324
Pdf URL: https://arxiv.org/pdf/2410.21324
Copy Paste: [[2410.21324]] Mathematical Derivation Graphs: A Task for Summarizing Equation Dependencies in STEM Manuscripts(https://arxiv.org/abs/2410.21324)
Keywords: language model, llm
Abstract: Recent advances in natural language processing (NLP), particularly with the emergence of large language models (LLMs), have significantly enhanced the field of textual analysis. However, while these developments have yielded substantial progress in analyzing textual data, applying analysis to mathematical equations and their relationships within texts has produced mixed results. In this paper, we take the initial steps toward understanding the dependency relationships between mathematical expressions in STEM articles. Our dataset, sourced from a random sampling of the arXiv corpus, contains an analysis of 107 published STEM manuscripts whose inter-equation dependency relationships have been hand-labeled, resulting in a new object we refer to as a derivation graph that summarizes the mathematical content of the manuscript. We exhaustively evaluate analytical and NLP-based models to assess their capability to identify and extract the derivation relationships for each article and compare the results with the ground truth. Our comprehensive testing finds that both analytical and NLP models (including LLMs) achieve $\sim$40-50% F1 scores for extracting derivation graphs from articles, revealing that the recent advances in NLP have not made significant inroads in comprehending mathematical texts compared to simpler analytic models. While current approaches offer a solid foundation for extracting mathematical information, further research is necessary to improve accuracy and depth in this area.
摘要：自然语言处理 (NLP) 领域的最新进展，尤其是大型语言模型 (LLM) 的出现，极大地增强了文本分析领域。然而，尽管这些发展在文本数据分析方面取得了实质性进展，但将分析应用于数学方程及其在文本中的关系却产生了好坏参半的结果。在本文中，我们迈出了理解 STEM 文章中数学表达式之间依赖关系的初步步骤。我们的数据集来源于 arXiv 语料库的随机抽样，包含对 107 篇已发表的 STEM 手稿的分析，这些手稿的方程间依赖关系已手工标记，从而产生了一个新对象，我们称之为派生图，它总结了手稿的数学内容。我们详尽地评估了分析模型和基于 NLP 的模型，以评估它们识别和提取每篇文章的派生关系的能力，并将结果与基本事实进行比较。我们进行的全面测试发现，无论是分析模型还是 NLP 模型（包括 LLM），从文章中提取派生图的 F1 得分都达到了 $\sim$40-50%，这表明与更简单的分析模型相比，NLP 领域的最新进展在理解数学文本方面并没有取得重大进展。虽然目前的方法为提取数学信息提供了坚实的基础，但还需要进一步研究来提高该领域的准确性和深度。

Title: LLM Robustness Against Misinformation in Biomedical Question Answering

Authors: Alexander Bondarenko, Adrian Viehweger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21330
Pdf URL: https://arxiv.org/pdf/2410.21330
Copy Paste: [[2410.21330]] LLM Robustness Against Misinformation in Biomedical Question Answering(https://arxiv.org/abs/2410.21330)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation
Abstract: The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering by retrieving and providing additional context coming from external knowledge sources (e.g., by adding the context to the prompt). However, injecting incorrect information can mislead the LLM to generate an incorrect answer. In this paper, we evaluate the effectiveness and robustness of four LLMs against misinformation - Gemma 2, GPT-4o-mini, Llama~3.1, and Mixtral - in answering biomedical questions. We assess the answer accuracy on yes-no and free-form questions in three scenarios: vanilla LLM answers (no context is provided), "perfect" augmented generation (correct context is provided), and prompt-injection attacks (incorrect context is provided). Our results show that Llama 3.1 (70B parameters) achieves the highest accuracy in both vanilla (0.651) and "perfect" RAG (0.802) scenarios. However, the accuracy gap between the models almost disappears with "perfect" RAG, suggesting its potential to mitigate the LLM's size-related effectiveness differences. We further evaluate the ability of the LLMs to generate malicious context on one hand and the LLM's robustness against prompt-injection attacks on the other hand, using metrics such as attack success rate (ASR), accuracy under attack, and accuracy drop. As adversaries, we use the same four LLMs (Gemma 2, GPT-4o-mini, Llama 3.1, and Mixtral) to generate incorrect context that is injected in the target model's prompt. Interestingly, Llama is shown to be the most effective adversary, causing accuracy drops of up to 0.48 for vanilla answers and 0.63 for "perfect" RAG across target models. Our analysis reveals that robustness rankings vary depending on the evaluation measure, highlighting the complexity of assessing LLM resilience to adversarial attacks.
摘要：检索增强生成 (RAG) 方法用于减少大型语言模型 (LLM) 在问答中的虚构，方法是检索并提供来自外部知识源的额外上下文（例如，通过将上下文添加到提示中）。但是，注入不正确的信息可能会误导 LLM 生成不正确的答案。在本文中，我们评估了四个 LLM（Gemma 2、GPT-4o-mini、Llama~3.1 和 Mixtral）在回答生物医学问题时对错误信息的有效性和鲁棒性。我们在三种情况下评估了是非和自由形式问题的答案准确性：原始 LLM 答案（未提供上下文）、“完美”增强生成（提供正确的上下文）和提示注入攻击（提供不正确的上下文）。我们的结果表明，Llama 3.1（70B 参数）在原始（0.651）和“完美”RAG（0.802）场景中均实现了最高的准确率。然而，在“完美” RAG 的帮助下，模型之间的准确率差距几乎消失，这表明它有可能缓解 LLM 与规模相关的有效性差异。我们进一步评估了 LLM 一方面生成恶意上下文的能力，另一方面评估了 LLM 抵御提示注入攻击的鲁棒性，使用攻击成功率 (ASR)、攻击准确率和准确率下降等指标。作为对手，我们使用相同的四个 LLM（Gemma 2、GPT-4o-mini、Llama 3.1 和 Mixtral）来生成注入目标模型提示中的不正确上下文。有趣的是，Llama 被证明是最有效的对手，导致目标模型中原始答案的准确率下降高达 0.48，“完美” RAG 的准确率下降高达 0.63。我们的分析表明，鲁棒性排名因评估指标而异，这凸显了评估 LLM 对对抗性攻击的弹性的复杂性。

Title: Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection

Authors: Md Abdur Rahman, Fan Wu, Alfredo Cuzzocrea, Sheikh Iqbal Ahamed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21337
Pdf URL: https://arxiv.org/pdf/2410.21337
Copy Paste: [[2410.21337]] Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection(https://arxiv.org/abs/2410.21337)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are becoming a popular tool as they have significantly advanced in their capability to tackle a wide range of language-based tasks. However, LLMs applications are highly vulnerable to prompt injection attacks, which poses a critical problem. These attacks target LLMs applications through using carefully designed input prompts to divert the model from adhering to original instruction, thereby it could execute unintended actions. These manipulations pose serious security threats which potentially results in data leaks, biased outputs, or harmful responses. This project explores the security vulnerabilities in relation to prompt injection attacks. To detect whether a prompt is vulnerable or not, we follows two approaches: 1) a pre-trained LLM, and 2) a fine-tuned LLM. Then, we conduct a thorough analysis and comparison of the classification performance. Firstly, we use pre-trained XLM-RoBERTa model to detect prompt injections using test dataset without any fine-tuning and evaluate it by zero-shot classification. Then, this proposed work will apply supervised fine-tuning to this pre-trained LLM using a task-specific labeled dataset from deepset in huggingface, and this fine-tuned model achieves impressive results with 99.13\% accuracy, 100\% precision, 98.33\% recall and 99.15\% F1-score thorough rigorous experimentation and evaluation. We observe that our approach is highly efficient in detecting prompt injection attacks.
摘要：大型语言模型 (LLM) 正在成为一种流行的工具，因为它们在处理各种基于语言的任务方面的能力已经显著提高。然而，LLM 应用程序极易受到提示注入攻击，这是一个严重的问题。这些攻击通过使用精心设计的输入提示来转移 LLM 应用程序，使模型偏离原始指令，从而执行意外操作。这些操作构成了严重的安全威胁，可能导致数据泄露、输出偏差或有害响应。该项目探讨了与提示注入攻击相关的安全漏洞。为了检测提示是否存在漏洞，我们遵循两种方法：1) 预训练的 LLM，2) 微调的 LLM。然后，我们对分类性能进行彻底的分析和比较。首先，我们使用预训练的 XLM-RoBERTa 模型在未经任何微调的情况下使用测试数据集检测提示注入，并通过零样本分类对其进行评估。然后，本研究将使用 huggingface 中 deepset 中的特定任务标记数据集，对这个预训练的 LLM 进行监督微调，经过严格的实验和评估，这个微调模型取得了令人印象深刻的结果，准确率为 99.13%，精确率为 100%，召回率为 98.33%，F1 得分为 99.15%。我们观察到，我们的方法在检测即时注入攻击方面非常有效。

Title: FinTeamExperts: Role Specialized MOEs For Financial Analysis

Authors: Yue Yu, Prayag Tiwari
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21338
Pdf URL: https://arxiv.org/pdf/2410.21338
Copy Paste: [[2410.21338]] FinTeamExperts: Role Specialized MOEs For Financial Analysis(https://arxiv.org/abs/2410.21338)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs), such as ChatGPT, Phi3 and Llama-3, are leading a significant leap in AI, as they can generalize knowledge from their training to new tasks without fine-tuning. However, their application in the financial domain remains relatively limited. The financial field is inherently complex, requiring a deep understanding across various perspectives, from macro, micro economic trend to quantitative analysis. Motivated by this complexity, a mixture of expert LLMs tailored to specific financial domains could offer a more comprehensive understanding for intricate financial tasks. In this paper, we present the FinTeamExperts, a role-specialized LLM framework structured as a Mixture of Experts (MOEs) for financial analysis. The framework simulates a collaborative team setting by training each model to specialize in distinct roles: Macro Analysts, Micro analysts, and Quantitative Analysts. This role-specific specialization enhances the model's ability to integrate their domain-specific expertise. We achieve this by training three 8-billion parameter models on different corpus, each dedicated to excelling in specific finance-related roles. We then instruct-tune FinTeamExperts on downstream tasks to align with practical financial tasks. The experimental results show that FinTeamExperts outperform all models of the same size and larger on three out of four datasets. On the fourth dataset, which presents a more complex task, FinTeamExperts still surpass all models of the same size. This highlights the success of our role-based specialization approach and the continued training approach for FinTeamExperts.
摘要：ChatGPT、Phi3 和 Llama-3 等大型语言模型 (LLM) 正在引领人工智能的重大飞跃，因为它们可以将训练中的知识推广到新任务而无需微调。然而，它们在金融领域的应用仍然相对有限。金融领域本质上很复杂，需要从宏观、微观经济趋势到定量分析等各个角度深入了解。受这种复杂性的启发，针对特定金融领域量身定制的混合专家 LLM 可以为复杂的金融任务提供更全面的理解。在本文中，我们介绍了 FinTeamExperts，这是一个角色专门化的 LLM 框架，其结构为用于财务分析的混合专家 (MOE)。该框架通过训练每个模型专门从事不同的角色来模拟协作团队设置：宏观分析师、微观分析师和定量分析师。这种角色特定的专业化增强了模型整合其领域特定专业知识的能力。我们通过在不同的语料库上训练三个 80 亿参数模型来实现这一目标，每个模型都致力于在特定的金融相关角色中表现出色。然后，我们在下游任务上对 FinTeamExperts 进行指导调整，以适应实际的金融任务。实验结果表明，在四个数据集中的三个上，FinTeamExperts 的表现优于所有相同规模或更大的模型。在第四个数据集上，它代表了更复杂的任务，FinTeamExperts 仍然超越了所有相同规模的模型。这凸显了我们基于角色的专业化方法和 FinTeamExperts 持续培训方法的成功。

Title: Large Language Model Benchmarks in Medical Tasks

Authors: Lawrence K.Q. Yan, Ming Li, Yichao Zhang, Caitlyn Heqi Yin, Cheng Fei, Benji Peng, Ziqian Bi, Pohsun Feng, Keyu Chen, Junyu Liu, Qian Niu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21348
Pdf URL: https://arxiv.org/pdf/2410.21348
Copy Paste: [[2410.21348]] Large Language Model Benchmarks in Medical Tasks(https://arxiv.org/abs/2410.21348)
Keywords: language model, llm
Abstract: With the increasing application of large language models (LLMs) in the medical domain, evaluating these models' performance using benchmark datasets has become crucial. This paper presents a comprehensive survey of various benchmark datasets employed in medical LLM tasks. These datasets span multiple modalities including text, image, and multimodal benchmarks, focusing on different aspects of medical knowledge such as electronic health records (EHRs), doctor-patient dialogues, medical question-answering, and medical image captioning. The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs for clinical tasks such as diagnosis, report generation, and predictive decision support. Key benchmarks include MIMIC-III, MIMIC-IV, BioASQ, PubMedQA, and CheXpert, which have facilitated advancements in tasks like medical report generation, clinical summarization, and synthetic data generation. The paper summarizes the challenges and opportunities in leveraging these benchmarks for advancing multimodal medical intelligence, emphasizing the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis. This work also provides a foundation for future research in the application of LLMs in medicine, contributing to the evolving field of medical artificial intelligence.
摘要：随着大型语言模型 (LLM) 在医学领域的应用日益广泛，使用基准数据集评估这些模型的性能变得至关重要。本文全面介绍了医学 LLM 任务中使用的各种基准数据集。这些数据集涵盖多种模态，包括文本、图像和多模态基准，侧重于医学知识的不同方面，如电子健康记录 (EHR)、医患对话、医学问答和医学图像字幕。该调查按模态对数据集进行分类，讨论它们的重要性、数据结构以及对诊断、报告生成和预测决策支持等临床任务的 LLM 开发的影响。主要基准包括 MIMIC-III、MIMIC-IV、BioASQ、PubMedQA 和 CheXpert，它们促进了医疗报告生成、临床总结和合成数据生成等任务的进步。本文总结了利用这些基准推进多模态医学智能的挑战和机遇，强调需要具有更高语言多样性的数据集、结构化组学数据和创新的综合方法。这项工作还为未来 LLM 在医学领域的应用研究奠定了基础，为不断发展的医学人工智能领域做出了贡献。

Title: LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

Authors: Ge Yang, Changyi He, Jinyang Guo, Jianyu Wu, Yifu Ding, Aishan Liu, Haotong Qin, Pengliang Ji, Xianglong Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21352
Pdf URL: https://arxiv.org/pdf/2410.21352
Copy Paste: [[2410.21352]] LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment(https://arxiv.org/abs/2410.21352)
Keywords: language model, llm
Abstract: Although large language models (LLMs) have demonstrated their strong intelligence ability, the high demand for computation and storage hinders their practical application. To this end, many model compression techniques are proposed to increase the efficiency of LLMs. However, current researches only validate their methods on limited models, datasets, metrics, etc, and still lack a comprehensive evaluation under more general scenarios. So it is still a question of which model compression approach we should use under a specific case. To mitigate this gap, we present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms. We first analyze the actual model production requirements and carefully design evaluation tracks and metrics. Then, we conduct extensive experiments and comparison using multiple mainstream LLM compression approaches. Finally, we perform an in-depth analysis based on the evaluation and provide useful insight for LLM compression design. We hope our LLMCBench can contribute insightful suggestions for LLM compression algorithm design and serve as a foundation for future research. Our code is available at this https URL.
摘要：尽管大型语言模型 (LLM) 已经展示了其强大的智能能力，但对计算和存储的高需求阻碍了它们的实际应用。为此，提出了许多模型压缩技术来提高 LLM 的效率。然而，当前的研究仅在有限的模型、数据集、指标等上验证了它们的方法，并且仍然缺乏在更一般场景下的全面评估。因此，在特定情况下我们应该使用哪种模型压缩方法仍然是一个问题。为了弥补这一差距，我们提出了大型语言模型压缩基准 (LLMCBench)，这是一个严格设计的基准，对 LLM 压缩算法进行了深入分析。我们首先分析实际的模型生产需求，并精心设计评估轨道和指标。然后，我们使用多种主流 LLM 压缩方法进行大量实验和比较。最后，我们根据评估进行深入分析，并为 LLM 压缩设计提供有用的见解。我们希望我们的 LLMCBench 能够为 LLM 压缩算法设计提供有见地的建议，并为未来的研究奠定基础。我们的代码可在此 https URL 上找到。

Title: Causal Interventions on Causal Paths: Mapping GPT-2's Reasoning From Syntax to Semantics

Authors: Isabelle Lee, Joshua Lum, Ziyi Liu, Dani Yogatama
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21353
Pdf URL: https://arxiv.org/pdf/2410.21353
Copy Paste: [[2410.21353]] Causal Interventions on Causal Paths: Mapping GPT-2's Reasoning From Syntax to Semantics(https://arxiv.org/abs/2410.21353)
Keywords: gpt, llm
Abstract: While interpretability research has shed light on some internal algorithms utilized by transformer-based LLMs, reasoning in natural language, with its deep contextuality and ambiguity, defies easy categorization. As a result, formulating clear and motivating questions for circuit analysis that rely on well-defined in-domain and out-of-domain examples required for causal interventions is challenging. Although significant work has investigated circuits for specific tasks, such as indirect object identification (IOI), deciphering natural language reasoning through circuits remains difficult due to its inherent complexity. In this work, we take initial steps to characterize causal reasoning in LLMs by analyzing clear-cut cause-and-effect sentences like "I opened an umbrella because it started raining," where causal interventions may be possible through carefully crafted scenarios using GPT-2 small. Our findings indicate that causal syntax is localized within the first 2-3 layers, while certain heads in later layers exhibit heightened sensitivity to nonsensical variations of causal sentences. This suggests that models may infer reasoning by (1) detecting syntactic cues and (2) isolating distinct heads in the final layers that focus on semantic relationships.
摘要：虽然可解释性研究已经揭示了基于 Transformer 的 LLM 所采用的一些内部算法，但自然语言推理具有深层的语境性和模糊性，很难进行简单的分类。因此，制定清晰且有激励性的电路分析问题是一项挑战，这些问题依赖于因果干预所需的明确定义的域内和域外示例。尽管大量工作已经研究了特定任务的电路，例如间接对象识别 (IOI)，但由于其固有的复杂性，通过电路解密自然语言推理仍然很困难。在这项工作中，我们采取初步措施来表征 LLM 中的因果推理，通过分析明确的因果句，例如“我打开了雨伞，因为开始下雨了”，其中因果干预可能通过使用 GPT-2 small 精心设计的场景来实现。我们的研究结果表明，因果语法局限于前 2-3 层，而后面几层中的某些头部对因果句的无意义变化表现出高度的敏感性。这表明模型可以通过（1）检测句法线索和（2）在关注语义关系的最后几层中隔离不同的头部来推断推理。

Title: Energy-Based Diffusion Language Models for Text Generation

Authors: Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, Arash Vahdat
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21357
Pdf URL: https://arxiv.org/pdf/2410.21357
Copy Paste: [[2410.21357]] Energy-Based Diffusion Language Models for Text Generation(https://arxiv.org/abs/2410.21357)
Keywords: language model
Abstract: Despite remarkable progress in autoregressive language models, alternative generative paradigms beyond left-to-right generation are still being actively explored. Discrete diffusion models, with the capacity for parallel generation, have recently emerged as a promising alternative. Unfortunately, these models still underperform the autoregressive counterparts, with the performance gap increasing when reducing the number of sampling steps. Our analysis reveals that this degradation is a consequence of an imperfect approximation used by diffusion models. In this work, we propose Energy-based Diffusion Language Model (EDLM), an energy-based model operating at the full sequence level for each diffusion step, introduced to improve the underlying approximation used by diffusion models. More specifically, we introduce an EBM in a residual form, and show that its parameters can be obtained by leveraging a pretrained autoregressive model or by finetuning a bidirectional transformer via noise contrastive estimation. We also propose an efficient generation algorithm via parallel important sampling. Comprehensive experiments on language modeling benchmarks show that our model can consistently outperform state-of-the-art diffusion models by a significant margin, and approaches autoregressive models' perplexity. We further show that, without any generation performance drop, our framework offers a 1.3$\times$ sampling speedup over existing diffusion models.
摘要：尽管自回归语言模型取得了显著进展，但人们仍在积极探索除从左到右生成之外的替代生成范式。具有并行生成能力的离散扩散模型最近已成为一种有前途的替代方案。不幸的是，这些模型的表现仍然不如自回归模型，而且在减少采样步骤数时，性能差距会越来越大。我们的分析表明，这种退化是由于扩散模型使用的近似值不完美造成的。在这项工作中，我们提出了基于能量的扩散语言模型 (EDLM)，这是一种在每个扩散步骤的全序列级别上运行的基于能量的模型，旨在改进扩散模型使用的底层近似值。更具体地说，我们引入了残差形式的 EBM，并表明可以通过利用预训练的自回归模型或通过噪声对比估计微调双向变压器来获得其参数。我们还提出了一种通过并行重要性采样实现的有效生成算法。在语言建模基准上进行的全面实验表明，我们的模型可以持续显著超越最先进的扩散模型，并且接近自回归模型的困惑度。我们进一步表明，在没有任何生成性能下降的情况下，我们的框架比现有的扩散模型提供了 1.3$\times$ 的采样速度。

Title: Can Machines Think Like Humans? A Behavioral Evaluation of LLM-Agents in Dictator Games

Authors: Ji Ma
Subjects: cs.CL, cs.AI, cs.CY, cs.LG, econ.GN
Abstract URL: https://arxiv.org/abs/2410.21359
Pdf URL: https://arxiv.org/pdf/2410.21359
Copy Paste: [[2410.21359]] Can Machines Think Like Humans? A Behavioral Evaluation of LLM-Agents in Dictator Games(https://arxiv.org/abs/2410.21359)
Keywords: language model, llm, prompt, agent
Abstract: As Large Language Model (LLM)-based agents increasingly undertake real-world tasks and engage with human society, how well do we understand their behaviors? This study (1) investigates how LLM agents' prosocial behaviors -- a fundamental social norm -- can be induced by different personas and benchmarked against human behaviors; and (2) introduces a behavioral approach to evaluate the performance of LLM agents in complex decision-making scenarios. We explored how different personas and experimental framings affect these AI agents' altruistic behavior in dictator games and compared their behaviors within the same LLM family, across various families, and with human behaviors. Our findings reveal substantial variations and inconsistencies among LLMs and notable differences compared to human behaviors. Merely assigning a human-like identity to LLMs does not produce human-like behaviors. Despite being trained on extensive human-generated data, these AI agents cannot accurately predict human decisions. LLM agents are not able to capture the internal processes of human decision-making, and their alignment with human behavior is highly variable and dependent on specific model architectures and prompt formulations; even worse, such dependence does not follow a clear pattern.
摘要：随着基于大型语言模型 (LLM) 的代理越来越多地承担现实世界的任务并与人类社会互动，我们对它们的行为了解多少？本研究 (1) 调查了 LLM 代理的亲社会行为（一种基本的社会规范）如何由不同的角色诱导并与人类行为进行对标；(2) 引入了一种行为方法来评估 LLM 代理在复杂决策场景中的表现。我们探讨了不同的角色和实验框架如何影响这些 AI 代理在独裁者游戏中的利他行为，并比较了它们在同一个 LLM 家族中、不同家族中以及与人类行为中的行为。我们的研究结果揭示了 LLM 之间存在显著的差异和不一致性，与人类行为相比存在显著差异。仅仅为 LLM 分配类似人类的身份并不能产生类似人类的行为。尽管这些 AI 代理接受了大量人类生成的数据的训练，但它们无法准确预测人类的决策。 LLM 代理无法捕捉人类决策的内部过程，并且它们与人类行为的一致性具有高度可变性，并且依赖于特定的模型架构和提示公式；更糟糕的是，这种依赖性并不遵循清晰的模式。

Title: A Survey on Automatic Credibility Assessment of Textual Credibility Signals in the Era of Large Language Models

Authors: Ivan Srba, Olesya Razuvayevskaya, João A. Leite, Robert Moro, Ipek Baris Schlicht, Sara Tonelli, Francisco Moreno García, Santiago Barrio Lottmann, Denis Teyssou, Valentin Porcellini, Carolina Scarton, Kalina Bontcheva, Maria Bielikova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21360
Pdf URL: https://arxiv.org/pdf/2410.21360
Copy Paste: [[2410.21360]] A Survey on Automatic Credibility Assessment of Textual Credibility Signals in the Era of Large Language Models(https://arxiv.org/abs/2410.21360)
Keywords: language model, llm
Abstract: In the current era of social media and generative AI, an ability to automatically assess the credibility of online social media content is of tremendous importance. Credibility assessment is fundamentally based on aggregating credibility signals, which refer to small units of information, such as content factuality, bias, or a presence of persuasion techniques, into an overall credibility score. Credibility signals provide a more granular, more easily explainable and widely utilizable information in contrast to currently predominant fake news detection, which utilizes various (mostly latent) features. A growing body of research on automatic credibility assessment and detection of credibility signals can be characterized as highly fragmented and lacking mutual interconnections. This issue is even more prominent due to a lack of an up-to-date overview of research works on automatic credibility assessment. In this survey, we provide such systematic and comprehensive literature review of 175 research papers while focusing on textual credibility signals and Natural Language Processing (NLP), which undergoes a significant advancement due to Large Language Models (LLMs). While positioning the NLP research into the context of other multidisciplinary research works, we tackle with approaches for credibility assessment as well as with 9 categories of credibility signals (we provide a thorough analysis for 3 of them, namely: 1) factuality, subjectivity and bias, 2) persuasion techniques and logical fallacies, and 3) claims and veracity). Following the description of the existing methods, datasets and tools, we identify future challenges and opportunities, while paying a specific attention to recent rapid development of generative AI.
摘要：在当今社交媒体和生成式人工智能时代，自动评估在线社交媒体内容可信度的能力至关重要。可信度评估从根本上讲是基于将可信度信号（指内容真实性、偏见或说服技巧的存在等小信息单位）聚合成总体可信度分数。与目前占主导地位的利用各种（主要是潜在）特征的假新闻检测相比，可信度信号提供了更细粒度、更容易解释和更广泛使用的信息。越来越多的关于自动可信度评估和可信度信号检测的研究可以归结为高度分散和缺乏相互联系。由于缺乏对自动可信度评估研究工作的最新概述，这个问题更加突出。在本次调查中，我们对 175 篇研究论文进行了系统而全面的文献综述，重点关注文本可信度信号和自然语言处理 (NLP)，由于大型语言模型 (LLM) 的出现，NLP 取得了重大进展。在将 NLP 研究置于其他多学科研究工作的背景下时，我们研究可信度评估方法以及 9 类可信度信号（我们对其中 3 类进行了彻底分析，即：1）事实性、主观性和偏见，2）说服技巧和逻辑谬误，3）主张和真实性）。在描述现有方法、数据集和工具之后，我们确定了未来的挑战和机遇，同时特别关注生成式 AI 的近期快速发展。

Title: CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart

Authors: Bowen Zhao, Tianhao Cheng, Yuejie Zhang, Ying Cheng, Rui Feng, Xiaobo Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21414
Pdf URL: https://arxiv.org/pdf/2410.21414
Copy Paste: [[2410.21414]] CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart(https://arxiv.org/abs/2410.21414)
Keywords: gpt, agent
Abstract: Multimodal Question Answering (MMQA) is crucial as it enables comprehensive understanding and accurate responses by integrating insights from diverse data representations such as tables, charts, and text. Most existing researches in MMQA only focus on two modalities such as image-text QA, table-text QA and chart-text QA, and there remains a notable scarcity in studies that investigate the joint analysis of text, tables, and charts. In this paper, we present C$\text{T}^2$C-QA, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts, meticulously compiled from 200 selectively sourced webpages. Our dataset simulates real webpages and serves as a great test for the capability of the model to analyze and reason with multimodal data, because the answer to a question could appear in various modalities, or even potentially not exist at all. Additionally, we present AED (\textbf{A}llocating, \textbf{E}xpert and \textbf{D}esicion), a multi-agent system implemented through collaborative deployment, information interaction, and collective decision-making among different agents. Specifically, the Assignment Agent is in charge of selecting and activating expert agents, including those proficient in text, tables, and charts. The Decision Agent bears the responsibility of delivering the final verdict, drawing upon the analytical insights provided by these expert agents. We execute a comprehensive analysis, comparing AED with various state-of-the-art models in MMQA, including GPT-4. The experimental outcomes demonstrate that current methodologies, including GPT-4, are yet to meet the benchmarks set by our dataset.
摘要：多模态问答 (MMQA) 至关重要，因为它通过整合来自表格、图表和文本等不同数据表示的见解来实现全面理解和准确响应。MMQA 中的大多数现有研究仅关注两种模态，例如图像文本问答、表格文本问答和图表文本问答，而研究文本、表格和图表的联合分析的研究仍然非常稀缺。在本文中，我们提出了 C$\text{T}^2$C-QA，这是一个开创性的基于推理的中文问答数据集，其中包含大量文本、表格和图表，这些文本、表格和图表是从 200 个选择性来源的网页精心编制而成的。我们的数据集模拟了真实的网页，是对模型分析和推理多模态数据的能力的一个很好的测试，因为问题的答案可能出现在各种模态中，甚至可能根本不存在。此外，我们还介绍了 AED（\textbf{A}llocating、\textbf{E}xpert 和 \textbf{D}esicion），这是一个通过不同代理之间的协作部署、信息交互和集体决策实现的多代理系统。具体来说，分配代理负责选择和激活专家代理，包括精通文本、表格和图表的代理。决策代理负责根据这些专家代理提供的分析见解做出最终裁决。我们进行了全面的分析，将 AED 与 MMQA 中的各种最新模型（包括 GPT-4）进行了比较。实验结果表明，包括 GPT-4 在内的当前方法尚未达到我们数据集设定的基准。

Title: UFT: Unifying Fine-Tuning of SFT and RLHF/DPO/UNA through a Generalized Implicit Reward Function

Authors: Zhichao Wang, Bin Bi, Zixu Zhu, Xiangbo Mao, Jun Wang, Shiyu Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21438
Pdf URL: https://arxiv.org/pdf/2410.21438
Copy Paste: [[2410.21438]] UFT: Unifying Fine-Tuning of SFT and RLHF/DPO/UNA through a Generalized Implicit Reward Function(https://arxiv.org/abs/2410.21438)
Keywords: llm
Abstract: By pretraining on trillions of tokens, an LLM gains the capability of text generation. However, to enhance its utility and reduce potential harm, SFT and alignment are applied sequentially to the pretrained model. Due to the differing nature and objective functions of SFT and alignment, catastrophic forgetting has become a significant issue. To address this, we introduce Unified Fine-Tuning (UFT), which integrates SFT and alignment into a single training stage using the same objective and loss functions through an implicit reward function. Our experimental results demonstrate that UFT outperforms SFT on instruction-tuning data alone. Moreover, when combining instruction-tuning data with alignment data, UFT effectively prevents catastrophic forgetting across these two stages and shows a clear advantage over sequentially applying SFT and alignment. This is evident in the significant improvements observed in the \textbf{ifeval} task for instruction-following and the \textbf{truthful-qa} task for factuality. The proposed general fine-tuning framework UFT establishes an effective and efficient pretraining-UFT paradigm for LLM training.
摘要：通过对数万亿个 token 进行预训练，LLM 获得了文本生成能力。然而，为了增强其实用性并减少潜在危害，SFT 和对齐会按顺序应用于预训练模型。由于 SFT 和对齐的性质和目标函数不同，灾难性遗忘已成为一个重大问题。为了解决这个问题，我们引入了统一微调 (UFT)，它通过隐式奖励函数将 SFT 和对齐集成到单个训练阶段，使用相同的目标和损失函数。我们的实验结果表明，UFT 在指令调整数据上的表现优于 SFT。此外，当将指令调整数据与对齐数据相结合时，UFT 可有效防止这两个阶段的灾难性遗忘，并且比按顺序应用 SFT 和对齐具有明显的优势。这在 \textbf{ifeval} 任务的指令遵循和 \textbf{truthful-qa} 任务的事实性方面观察到的显著改进中显而易见。所提出的通用微调框架 UFT 为 LLM 训练建立了有效且高效的预训练-UFT 范式。

Title: Estimating Causal Effects of Text Interventions Leveraging LLMs

Authors: Siyi Guo, Myrl G. Marmarelis, Fred Morstatter, Kristina Lerman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21474
Pdf URL: https://arxiv.org/pdf/2410.21474
Copy Paste: [[2410.21474]] Estimating Causal Effects of Text Interventions Leveraging LLMs(https://arxiv.org/abs/2410.21474)
Keywords: language model, llm
Abstract: Quantifying the effect of textual interventions in social systems, such as reducing anger in social media posts to see its impact on engagement, poses significant challenges. Direct interventions on real-world systems are often infeasible, necessitating reliance on observational data. Traditional causal inference methods, typically designed for binary or discrete treatments, are inadequate for handling the complex, high-dimensional nature of textual data. This paper addresses these challenges by proposing a novel approach, CausalDANN, to estimate causal effects using text transformations facilitated by large language models (LLMs). Unlike existing methods, our approach accommodates arbitrary textual interventions and leverages text-level classifiers with domain adaptation ability to produce robust effect estimates against domain shifts, even when only the control group is observed. This flexibility in handling various text interventions is a key advancement in causal estimation for textual data, offering opportunities to better understand human behaviors and develop effective policies within social systems.
摘要：量化文本干预对社会系统的影响（例如减少社交媒体帖子中的愤怒情绪以了解其对参与度的影响）是一项重大挑战。直接干预现实世界系统通常是不可行的，因此必须依赖观察数据。传统的因果推理方法通常设计用于二元或离散处理，不足以处理文本数据的复杂、高维性质。本文通过提出一种新方法 CausalDANN 来解决这些挑战，该方法使用大型语言模型 (LLM) 促进的文本转换来估计因果效应。与现有方法不同，我们的方法可以适应任意的文本干预，并利用具有领域自适应能力的文本级分类器来产生针对领域转变的稳健效应估计，即使只观察对照组也是如此。这种处理各种文本干预的灵活性是文本数据因果估计的一项重要进步，为更好地理解人类行为和在社会系统中制定有效政策提供了机会。

Title: TransformLLM: Adapting Large Language Models via LLM-Transformed Reading Comprehension Text

Authors: Iftach Arbel, Yehonathan Refael, Ofir Lindenbaum
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21479
Pdf URL: https://arxiv.org/pdf/2410.21479
Copy Paste: [[2410.21479]] TransformLLM: Adapting Large Language Models via LLM-Transformed Reading Comprehension Text(https://arxiv.org/abs/2410.21479)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown promise in highly-specialized domains, however challenges are still present in aspects of accuracy and costs. These limitations restrict the usage of existing models in domain-specific tasks. While fine-tuning pre-trained models have shown promising results, this process can be computationally expensive and require massive datasets of the specialized application in hand. In this work, we bridge that gap. We have developed Phi-2-Legal and Mistral-Legal-7B, which are language models specifically designed for legal applications. These models are based on Phi-2 and Mistral-7B-v0.1, and have gone through continued pre-training with over 500 million tokens of legal texts. Our innovative approach significantly improves capabilities in legal tasks by using Large Language Models (LLMs) to convert raw training data into reading comprehension text. Our legal LLMs have demonstrated superior performance in legal benchmarks, even outperforming models trained on much larger datasets with more resources. This work emphasizes the effectiveness of continued pre-training on domain-specific texts, while using affordable LLMs for data conversion, which gives these models domain expertise while retaining general language understanding capabilities. While this work uses the legal domain as a test case, our method can be scaled and applied to any pre-training dataset, resulting in significant improvements across different tasks. These findings underscore the potential of domain-adaptive pre-training and reading comprehension for the development of highly effective domain-specific language models.
摘要：大型语言模型 (LLM) 在高度专业化的领域中已显示出良好的前景，但在准确性和成本方面仍然存在挑战。这些限制限制了现有模型在特定领域任务中的使用。虽然微调预训练模型已显示出良好的结果，但此过程在计算上可能非常昂贵，并且需要大量专门应用的数据集。在这项工作中，我们弥补了这一差距。我们开发了 Phi-2-Legal 和 Mistral-Legal-7B，它们是专为法律应用设计的语言模型。这些模型基于 Phi-2 和 Mistral-7B-v0.1，并使用超过 5 亿个法律文本标记进行了持续的预训练。我们的创新方法通过使用大型语言模型 (LLM) 将原始训练数据转换为阅读理解文本，显著提高了法律任务的能力。我们的法律 LLM 在法律基准测试中表现出色，甚至优于使用更多资源在更大的数据集上训练的模型。这项研究强调了在特定领域文本上持续进行预训练的有效性，同时使用价格合理的 LLM 进行数据转换，这为这些模型提供了领域专业知识，同时保留了一般的语言理解能力。虽然这项研究使用法律领域作为测试案例，但我们的方法可以扩展并应用于任何预训练数据集，从而在不同任务中实现显著改进。这些发现强调了领域自适应预训练和阅读理解在开发高效领域特定语言模型方面的潜力。

Title: SpeechQE: Estimating the Quality of Direct Speech Translation

Authors: HyoJung Han, Kevin Duh, Marine Carpuat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21485
Pdf URL: https://arxiv.org/pdf/2410.21485
Copy Paste: [[2410.21485]] SpeechQE: Estimating the Quality of Direct Speech Translation(https://arxiv.org/abs/2410.21485)
Keywords: llm
Abstract: Recent advances in automatic quality estimation for machine translation have exclusively focused on written language, leaving the speech modality underexplored. In this work, we formulate the task of quality estimation for speech translation (SpeechQE), construct a benchmark, and evaluate a family of systems based on cascaded and end-to-end architectures. In this process, we introduce a novel end-to-end system leveraging pre-trained text LLM. Results suggest that end-to-end approaches are better suited to estimating the quality of direct speech translation than using quality estimation systems designed for text in cascaded systems. More broadly, we argue that quality estimation of speech translation needs to be studied as a separate problem from that of text, and release our data and models to guide further research in this space.
摘要：机器翻译自动质量评估的最新进展仅集中在书面语言上，而语音模式尚未得到充分探索。在这项工作中，我们制定了语音翻译质量评估任务 (SpeechQE)，构建了基准，并评估了基于级联和端到端架构的一系列系统。在此过程中，我们引入了一种利用预训练文本 LLM 的新型端到端系统。结果表明，与使用级联系统中为文本设计的质量评估系统相比，端到端方法更适合评估直接语音翻译的质量。更广泛地说，我们认为语音翻译的质量评估需要作为一个与文本不同的问题来研究，并发布我们的数据和模型以指导该领域的进一步研究。

Title: Can Large Language Models Act as Symbolic Reasoners?

Authors: Rob Sullivan, Nelly Elsayed
Subjects: cs.CL, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2410.21490
Pdf URL: https://arxiv.org/pdf/2410.21490
Copy Paste: [[2410.21490]] Can Large Language Models Act as Symbolic Reasoners?(https://arxiv.org/abs/2410.21490)
Keywords: language model, llm
Abstract: The performance of Large language models (LLMs) across a broad range of domains has been impressive but have been critiqued as not being able to reason about their process and conclusions derived. This is to explain the conclusions draw, and also for determining a plan or strategy for their approach. This paper explores the current research in investigating symbolic reasoning and LLMs, and whether an LLM can inherently provide some form of reasoning or whether supporting components are necessary, and, if there is evidence for a reasoning capability, is this evident in a specific domain or is this a general capability? In addition, this paper aims to identify the current research gaps and future trends of LLM explainability, presenting a review of the literature, identifying current research into this topic and suggests areas for future work.
摘要：大型语言模型 (LLM) 在广泛领域的表现令人印象深刻，但也有人批评它无法推理其过程和得出的结论。这是为了解释得出的结论，也是为了确定其方法的计划或策略。本文探讨了当前对符号推理和 LLM 的研究，以及 LLM 是否可以固有地提供某种形式的推理或是否需要支持组件，如果有推理能力的证据，这是否在特定领域很明显，还是这是一种一般能力？此外，本文旨在确定 LLM 可解释性的当前研究差距和未来趋势，回顾文献，确定该主题的当前研究并提出未来工作的领域。

Title: RoBIn: A Transformer-Based Model For Risk Of Bias Inference With Machine Reading Comprehension

Authors: Abel Corrêa Dias, Viviane Pereira Moreira, João Luiz Dihl Comba
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21495
Pdf URL: https://arxiv.org/pdf/2410.21495
Copy Paste: [[2410.21495]] RoBIn: A Transformer-Based Model For Risk Of Bias Inference With Machine Reading Comprehension(https://arxiv.org/abs/2410.21495)
Keywords: language model, llm
Abstract: Objective: Scientific publications play a crucial role in uncovering insights, testing novel drugs, and shaping healthcare policies. Accessing the quality of publications requires evaluating their Risk of Bias (RoB), a process typically conducted by human reviewers. In this study, we introduce a new dataset for machine reading comprehension and RoB assessment and present RoBIn (Risk of Bias Inference), an innovative model crafted to automate such evaluation. The model employs a dual-task approach, extracting evidence from a given context and assessing the RoB based on the gathered evidence. Methods: We use data from the Cochrane Database of Systematic Reviews (CDSR) as ground truth to label open-access clinical trial publications from PubMed. This process enabled us to develop training and test datasets specifically for machine reading comprehension and RoB inference. Additionally, we created extractive (RoBInExt) and generative (RoBInGen) Transformer-based approaches to extract relevant evidence and classify the RoB effectively. Results: RoBIn is evaluated across various settings and benchmarked against state-of-the-art methods for RoB inference, including large language models in multiple scenarios. In most cases, the best-performing RoBIn variant surpasses traditional machine learning and LLM-based approaches, achieving an ROC AUC of 0.83. Conclusion: Based on the evidence extracted from clinical trial reports, RoBIn performs a binary classification to decide whether the trial is at a low RoB or a high/unclear RoB. We found that both RoBInGen and RoBInExt are robust and have the best results in many settings.
摘要：目标：科学出版物在揭示见解、测试新药和制定医疗保健政策方面发挥着至关重要的作用。评估出版物的质量需要评估其偏倚风险 (RoB)，这一过程通常由人工审阅者进行。在本研究中，我们引入了一个新的机器阅读理解和 RoB 评估数据集，并提出了 RoBIn（偏倚风险推断），这是一种旨在自动进行此类评估的创新模型。该模型采用双任务方法，从给定上下文中提取证据，并根据收集到的证据评估 RoB。方法：我们使用 Cochrane 系统评价数据库 (CDSR) 中的数据作为基本事实，标记来自 PubMed 的开放获取临床试验出版物。这一过程使我们能够专门为机器阅读理解和 RoB 推断开发训练和测试数据集。此外，我们创建了基于 Transformer 的提取 (RoBInExt) 和生成 (RoBInGen) 方法来提取相关证据并有效地对 RoB 进行分类。结果：RoBIn 在各种设置中进行评估，并与最先进的 RoB 推理方法（包括多种场景中的大型语言模型）进行了对比。在大多数情况下，表现最佳的 RoBIn 变体优于传统的机器学习和基于 LLM 的方法，ROC AUC 达到 0.83。结论：根据从临床试验报告中提取的证据，RoBIn 进行二元分类，以确定试验的 RoB 是低 RoB 还是高/不清楚的 RoB。我们发现 RoBInGen 和 RoBInExt 都很稳健，在许多设置中都能取得最佳效果。

Title: SandboxAQ's submission to MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval

Authors: Isidora Chara Tourni, Sayontan Ghosh, Brenda Miao, Constantijn van der Poel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21501
Pdf URL: https://arxiv.org/pdf/2410.21501
Copy Paste: [[2410.21501]] SandboxAQ's submission to MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval(https://arxiv.org/abs/2410.21501)
Keywords: language model, prompt, chain-of-thought
Abstract: This paper explores the problems of Question Answering (QA) and Named Entity Recognition (NER) in five diverse languages. We tested five Large Language Models with various prompting methods, including zero-shot, chain-of-thought reasoning, and translation techniques. Our results show that while some models consistently outperform others, their effectiveness varies significantly across tasks and languages. We saw that advanced prompting techniques generally improved QA performance but had mixed results for NER; and we observed that language difficulty patterns differed between tasks. Our findings highlight the need for task-specific approaches in multilingual NLP and suggest that current models may develop different linguistic competencies for different tasks.
摘要：本文探讨了五种不同语言的问答 (QA) 和命名实体识别 (NER) 问题。我们测试了五种大型语言模型，这些模型采用了各种提示方法，包括零样本、思路链推理和翻译技术。我们的结果表明，虽然某些模型的表现始终优于其他模型，但它们的有效性在不同任务和语言之间差异很大。我们发现，高级提示技术通常会提高 QA 性能，但对 NER 的结果好坏参半；我们观察到语言难度模式在不同任务之间有所不同。我们的研究结果强调了多语言 NLP 中对任务特定方法的需求，并表明当前的模型可能会为不同的任务开发不同的语言能力。

Title: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups

Authors: Davide Ghilardi, Federico Belotti, Marco Molinari
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21508
Pdf URL: https://arxiv.org/pdf/2410.21508
Copy Paste: [[2410.21508]] Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups(https://arxiv.org/abs/2410.21508)
Keywords: language model, llm
Abstract: Sparse AutoEnocders (SAEs) have recently been employed as an unsupervised approach for understanding the inner workings of Large Language Models (LLMs). They reconstruct the model's activations with a sparse linear combination of interpretable features. However, training SAEs is computationally intensive, especially as models grow in size and complexity. To address this challenge, we propose a novel training strategy that reduces the number of trained SAEs from one per layer to one for a given group of contiguous layers. Our experimental results on Pythia 160M highlight a speedup of up to 6x without compromising the reconstruction quality and performance on downstream tasks. Therefore, layer clustering presents an efficient approach to train SAEs in modern LLMs.
摘要：稀疏自动编码 (SAE) 最近被用作一种无监督方法来理解大型语言模型 (LLM) 的内部工作原理。它们使用可解释特征的稀疏线性组合重建模型的激活。然而，训练 SAE 需要大量计算，尤其是在模型规模和复杂性增加的情况下。为了应对这一挑战，我们提出了一种新颖的训练策略，将训练的 SAE 数量从每层一个减少到给定一组连续层一个。我们在 Pythia 160M 上的实验结果显示，速度提高了 6 倍，同时不影响下游任务的重建质量和性能。因此，层聚类是一种在现代 LLM 中训练 SAE 的有效方法。

Title: Unveiling Context-Aware Criteria in Self-Assessing LLMs

Authors: Taneesh Gupta, Shivam Shandilya, Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Huaxiu Yao, Saravan Rajmohan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21545
Pdf URL: https://arxiv.org/pdf/2410.21545
Copy Paste: [[2410.21545]] Unveiling Context-Aware Criteria in Self-Assessing LLMs(https://arxiv.org/abs/2410.21545)
Keywords: language model, llm
Abstract: The use of large language models (LLMs) as evaluators has garnered significant attention due to their potential to rival human-level evaluations in long-form response assessments. However, current LLM evaluators rely heavily on static, human-defined criteria, limiting their ability to generalize across diverse generative tasks and incorporate context-specific knowledge. In this paper, we propose a novel Self-Assessing LLM framework that integrates Context-Aware Criteria (SALC) with dynamic knowledge tailored to each evaluation instance. This instance-level knowledge enhances the LLM evaluator's performance by providing relevant and context-aware insights that pinpoint the important criteria specific to the current instance. Additionally, the proposed framework adapts seamlessly to various tasks without relying on predefined human criteria, offering a more flexible evaluation approach. Empirical evaluations demonstrate that our approach significantly outperforms existing baseline evaluation frameworks, yielding improvements on average 4.8% across a wide variety of datasets. Furthermore, by leveraging knowledge distillation techniques, we fine-tuned smaller language models for criteria generation and evaluation, achieving comparable or superior performance to larger models with much lower cost. Our method also exhibits a improvement in LC Win-Rate in AlpacaEval2 leaderboard up to a 12% when employed for preference data generation in Direct Preference Optimization (DPO), underscoring its efficacy as a robust and scalable evaluation framework.
摘要：使用大型语言模型 (LLM) 作为评估器引起了广泛关注，因为它们有可能在长篇响应评估中与人类水平的评估相媲美。然而，目前的 LLM 评估器严重依赖静态的、人类定义的标准，限制了它们在各种生成任务中进行推广和整合特定于上下文的知识的能力。在本文中，我们提出了一种新颖的自评估 LLM 框架，该框架将上下文感知标准 (SALC) 与针对每个评估实例量身定制的动态知识相结合。这种实例级知识通过提供相关和上下文感知的见解来提高 LLM 评估器的性能，这些见解可以精确定位特定于当前实例的重要标准。此外，所提出的框架可以无缝适应各种任务，而无需依赖预定义的人类标准，从而提供更灵活的评估方法。实证评估表明，我们的方法明显优于现有的基线评估框架，在各种数据集中平均提高了 4.8%。此外，通过利用知识蒸馏技术，我们对较小的语言模型进行了微调，以生成和评估标准，以更低的成本实现了与较大模型相当甚至更好的性能。当用于直接偏好优化 (DPO) 中的偏好数据生成时，我们的方法还显示出 AlpacaEval2 排行榜中的 LC Win-Rate 提高了 12%，这凸显了其作为强大且可扩展的评估框架的有效性。

Title: MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression

Authors: Noel Elias, Homa Esfahanizadeh, Kaan Kale, Sriram Vishwanath, Muriel Medard
Subjects: cs.CL, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21548
Pdf URL: https://arxiv.org/pdf/2410.21548
Copy Paste: [[2410.21548]] MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression(https://arxiv.org/abs/2410.21548)
Keywords: language model, llm
Abstract: Large language models have drastically changed the prospects of AI by introducing technologies for more complex natural language processing. However, current methodologies to train such LLMs require extensive resources including but not limited to large amounts of data, expensive machinery, and lengthy training. To solve this problem, this paper proposes a new tokenization method inspired by universal Lempel-Ziv-Welch data compression that compresses repetitive phrases into multi-word tokens. With MultiTok as a new tokenizing tool, we show that language models are able to be trained notably more efficiently while offering a similar accuracy on more succinct and compressed training data. In fact, our results demonstrate that MultiTok achieves a comparable performance to the BERT standard as a tokenizer while also providing close to 2.5x faster training with more than 30% less training data.
摘要：大型语言模型通过引入更复杂的自然语言处理技术，彻底改变了人工智能的前景。然而，目前训练此类 LLM 的方法需要大量资源，包括但不限于大量数据、昂贵的机器和漫长的训练。为了解决这个问题，本文提出了一种新的标记化方法，该方法受到通用 Lempel-Ziv-Welch 数据压缩的启发，将重复的短语压缩为多词标记。使用 MultiTok 作为新的标记化工具，我们表明语言模型能够以显著更高的效率进行训练，同时在更简洁和压缩的训练数据上提供类似的准确性。事实上，我们的结果表明，MultiTok 作为标记器实现了与 BERT 标准相当的性能，同时还提供近 2.5 倍的训练速度，而训练数据减少了 30% 以上。

Title: Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense

Authors: Samuel Cahyawijaya, Ruochen Zhang, Holy Lovenia, Jan Christian Blaise Cruz, Hiroki Nomoto, Alham Fikri Aji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21573
Pdf URL: https://arxiv.org/pdf/2410.21573
Copy Paste: [[2410.21573]] Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense(https://arxiv.org/abs/2410.21573)
Keywords: language model, llm
Abstract: Multilingual large language models (LLMs) have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends -- words that are orthographically similar but have completely different meanings in two languages -- as a possible approach to pinpoint the limitation of cross-lingual sense disambiguation in LLMs. We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German; and challenge LLMs to distinguish the use of them in context. In our analysis of various models, we observe they tend to be biased toward higher-resource languages. We also propose new metrics for quantifying the cross-lingual sense bias and comprehension based on our benchmark. Our work contributes to developing more diverse and inclusive language modeling, promoting fairer access for the wider multilingual community.
摘要：多语言大型语言模型 (LLM) 已获得广泛认可，但人们对其在英语以外的可靠性产生了担忧。本研究通过引入一种新的跨语言意义消歧基准 StingrayBench 来解决跨语言语义评估中的差距。在本文中，我们展示了使用假朋友（在两种语言中拼写相似但含义完全不同的词）作为一种可能的方法来确定 LLM 中跨语言意义消歧的局限性。我们收集了四种语言对中的假朋友，即印尼语-马来语、印尼语-他加禄语、中文-日语和英语-德语；并要求 LLM 区分它们在上下文中的使用。在对各种模型的分析中，我们观察到它们倾向于偏向资源丰富的语言。我们还根据我们的基准提出了量化跨语言意义偏差和理解的新指标。我们的工作有助于开发更加多样化和包容性的语言模型，促进更广泛的多语言社区获得更公平的访问。

Title: Reducing the Scope of Language Models with Circuit Breakers

Authors: David Yunis, Siyu Huo, Chulaka Gunasekara, Danish Contractor
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21597
Pdf URL: https://arxiv.org/pdf/2410.21597
Copy Paste: [[2410.21597]] Reducing the Scope of Language Models with Circuit Breakers(https://arxiv.org/abs/2410.21597)
Keywords: language model, prompt
Abstract: Language models are now deployed in a wide variety of user-facing applications, often for specific purposes like answering questions about documentation or acting as coding assistants. As these models are intended for particular purposes, they should not be able to answer irrelevant queries like requests for poetry or questions about physics, or even worse, queries that can only be answered by humans like sensitive company policies. Instead we would like them to only answer queries corresponding to desired behavior and refuse all other requests, which we refer to as scoping. We find that, despite the use of system prompts, two representative language models can be poorly scoped and respond to queries they should not be addressing. We then conduct a comprehensive empirical evaluation of methods which could be used for scoping the behavior of language models. Among many other results, we show that a recently-proposed method for general alignment, Circuit Breakers (CB), can be adapted to scope language models to very specific tasks like sentiment analysis or summarization or even tasks with finer-grained scoping (e.g. summarizing only news articles). When compared to standard methods like fine-tuning or preference learning, CB is more robust both for out of distribution tasks, and to adversarial prompting techniques. We also show that layering SFT and CB together often results in the best of both worlds: improved performance only on relevant queries, while rejecting irrelevant ones.
摘要：语言模型现在已部署在各种面向用户的应用程序中，通常用于特定目的，例如回答有关文档的问题或充当编码助手。由于这些模型用于特定目的，因此它们不应能够回答无关的查询，例如对诗歌的请求或有关物理的问题，甚至更糟的是，只能由人类回答的查询，例如敏感的公司政策。相反，我们希望它们只回答与期望行为相对应的查询，并拒绝所有其他请求，我们称之为范围界定。我们发现，尽管使用了系统提示，但两个代表性语言模型的范围界定可能很差，并会响应它们不应该解决的查询。然后，我们对可用于确定语言模型行为范围的方法进行了全面的实证评估。在许多其他结果中，我们表明，最近提出的一种通用对齐方法 Circuit Breakers (CB) 可以适应将语言模型的范围界定为非常具体的任务，例如情绪分析或总结，甚至是具有更细粒度范围界定的任务（例如仅总结新闻文章）。与微调或偏好学习等标准方法相比，CB 对于分布外任务和对抗性提示技术都更为稳健。我们还表明，将 SFT 和 CB 分层在一起通常可以实现两全其美的效果：只提高相关查询的性能，同时拒绝不相关的查询。

Title: MCPDial: A Minecraft Persona-driven Dialogue Dataset

Authors: Seyed Hossein Alavi, Sudha Rao, Ashutosh Adhikari, Gabriel A DesGarennes, Akanksha Malhotra, Chris Brockett, Mahmoud Adada, Raymond T. Ng, Vered Shwartz, Bill Dolan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21627
Pdf URL: https://arxiv.org/pdf/2410.21627
Copy Paste: [[2410.21627]] MCPDial: A Minecraft Persona-driven Dialogue Dataset(https://arxiv.org/abs/2410.21627)
Keywords: language model, llm
Abstract: We propose a novel approach that uses large language models (LLMs) to generate persona-driven conversations between Players and Non-Player Characters (NPC) in games. Showcasing the application of our methodology, we introduce the Minecraft Persona-driven Dialogue dataset (MCPDial). Starting with a small seed of expert-written conversations, we employ our method to generate hundreds of additional conversations. Each conversation in the dataset includes rich character descriptions of the player and NPC. The conversations are long, allowing for in-depth and extensive interactions between the player and NPC. MCPDial extends beyond basic conversations by incorporating canonical function calls (e.g. "Call find a resource on iron ore") between the utterances. Finally, we conduct a qualitative analysis of the dataset to assess its quality and characteristics.
摘要：我们提出了一种新颖的方法，该方法使用大型语言模型 (LLM) 来生成游戏中玩家和非玩家角色 (NPC) 之间的角色驱动对话。为了展示我们的方法的应用，我们介绍了 Minecraft 角色驱动对话数据集 (MCPDial)。从一小部分专家撰写的对话开始，我们使用我们的方法生成数百个额外的对话。数据集中的每个对话都包含玩家和 NPC 的丰富角色描述。对话很长，允许玩家和 NPC 之间进行深入而广泛的互动。MCPDial 通过在话语之间加入规范函数调用（例如“呼叫查找铁矿石资源”）来超越基本对话。最后，我们对数据集进行定性分析以评估其质量和特征。

Title: Are Paraphrases Generated by Large Language Models Invertible?

Authors: Rafael Rivera Soto, Barry Chen, Nicholas Andrews
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21637
Pdf URL: https://arxiv.org/pdf/2410.21637
Copy Paste: [[2410.21637]] Are Paraphrases Generated by Large Language Models Invertible?(https://arxiv.org/abs/2410.21637)
Keywords: language model
Abstract: Large language models can produce highly fluent paraphrases while retaining much of the original meaning. While this capability has a variety of helpful applications, it may also be abused by bad actors, for example to plagiarize content or to conceal their identity. This motivates us to consider the problem of paraphrase inversion: given a paraphrased document, attempt to recover the original text. To explore the feasibility of this task, we fine-tune paraphrase inversion models, both with and without additional author-specific context to help guide the inversion process. We explore two approaches to author-specific inversion: one using in-context examples of the target author's writing, and another using learned style representations that capture distinctive features of the author's style. We show that, when starting from paraphrased machine-generated text, we can recover significant portions of the document using a learned inversion model. When starting from human-written text, the variety of source writing styles poses a greater challenge for invertability. However, even when the original tokens can't be recovered, we find the inverted text is stylistically similar to the original, which significantly improves the performance of plagiarism detectors and authorship identification systems that rely on stylistic markers.
摘要：大型语言模型可以生成高度流畅的释义，同时保留大部分原意。虽然此功能具有多种有用的应用，但它也可能被不良行为者滥用，例如剽窃内容或隐瞒身份。这促使我们考虑释义反转问题：给定一个释义文档，尝试恢复原始文本。为了探索此任务的可行性，我们对释义反转模型进行了微调，既有带作者特定上下文的，也有不带作者特定上下文的，以帮助指导反转过程。我们探索了两种作者特定反转方法：一种使用目标作者写作的上下文示例，另一种使用学习到的风格表示来捕捉作者风格的独特特征。我们表明，当从释义的机器生成文本开始时，我们可以使用学习到的反转模型恢复文档的很大一部分。当从人类书写的文本开始时，源写作风格的多样性对可逆性提出了更大的挑战。然而，即使无法恢复原始标记，我们也发现倒置的文本在风格上与原文相似，这显著提高了依赖风格标记的抄袭检测器和作者识别系统的性能。

Title: $f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization

Authors: Jiaqi Han, Mingjian Jiang, Yuxuan Song, Jure Leskovec, Stefano Ermon, Minkai Xu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21662
Pdf URL: https://arxiv.org/pdf/2410.21662
Copy Paste: [[2410.21662]] $f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization(https://arxiv.org/abs/2410.21662)
Keywords: language model
Abstract: Preference optimization has made significant progress recently, with numerous methods developed to align language models with human preferences. This paper introduces $f$-divergence Preference Optimization ($f$-PO), a novel framework that generalizes and extends existing approaches. $f$-PO minimizes $f$-divergences between the optimized policy and the optimal policy, encompassing a broad family of alignment methods using various divergences. Our approach unifies previous algorithms like DPO and EXO, while offering new variants through different choices of $f$-divergences. We provide theoretical analysis of $f$-PO's properties and conduct extensive experiments on state-of-the-art language models using benchmark datasets. Results demonstrate $f$-PO's effectiveness across various tasks, achieving superior performance compared to existing methods on popular benchmarks such as AlpacaEval 2, Arena-Hard, and MT-Bench. Additionally, we present ablation studies exploring the impact of different $f$-divergences, offering insights into the trade-offs between regularization and performance in offline preference optimization. Our work contributes both practical algorithms and theoretical understanding to the field of language model alignment. Code is available at this https URL.
摘要：偏好优化最近取得了重大进展，开发了许多方法来将语言模型与人类偏好对齐。本文介绍了 $f$-散度偏好优化 ($f$-PO)，这是一种概括和扩展现有方法的新框架。$f$-PO 最小化优化策略和最优策略之间的 $f$-散度，涵盖使用各种散度的广泛对齐方法。我们的方法统一了以前的算法，如 DPO 和 EXO，同时通过不同的 $f$-散度选择提供了新的变体。我们对 $f$-PO 的属性进行了理论分析，并使用基准数据集对最先进的语言模型进行了广泛的实验。结果证明了 $f$-PO 在各种任务中的有效性，与 AlpacaEval 2、Arena-Hard 和 MT-Bench 等流行基准上的现有方法相比，取得了卓越的性能。此外，我们提出了消融研究，探索不同 $f$-散度的影响，为离线偏好优化中正则化和性能之间的权衡提供了见解。我们的工作为语言模型对齐领域贡献了实用算法和理论理解。代码可在此 https URL 上获取。

Title: CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs

Authors: Zhihao Liu, Chenhui Hu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.21695
Pdf URL: https://arxiv.org/pdf/2410.21695
Copy Paste: [[2410.21695]] CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs(https://arxiv.org/abs/2410.21695)
Keywords: language model, gpt, llm, prompt
Abstract: As large language models (LLMs) rapidly evolve, they bring significant conveniences to our work and daily lives, but also introduce considerable safety risks. These models can generate texts with social biases or unethical content, and under specific adversarial instructions, may even incite illegal activities. Therefore, rigorous safety assessments of LLMs are crucial. In this work, we introduce a safety assessment benchmark, CFSafety, which integrates 5 classic safety scenarios and 5 types of instruction attacks, totaling 10 categories of safety questions, to form a test set with 25k prompts. This test set was used to evaluate the natural language generation (NLG) capabilities of LLMs, employing a combination of simple moral judgment and a 1-5 safety rating scale for scoring. Using this benchmark, we tested eight popular LLMs, including the GPT series. The results indicate that while GPT-4 demonstrated superior safety performance, the safety effectiveness of LLMs, including this model, still requires improvement. The data and code associated with this study are available on GitHub.
摘要：大型语言模型（LLM）的快速发展，为我们的工作和日常生活带来了极大的便利，但也带来了不小的安全风险。这些模型可能生成带有社会偏见或不道德内容的文本，在特定的对抗性指令下，甚至可能煽动违法行为。因此，对LLM进行严格的安全评估至关重要。在本文中，我们引入了一个安全评估基准CFSafety，它整合了5种经典安全场景和5种指令攻击，共计10类安全问题，形成了一个包含25k个提示的测试集。该测试集用于评估LLM的自然语言生成（NLG）能力，采用简单的道德判断和1-5安全等级量表相结合的方式进行评分。使用这个基准，我们测试了包括GPT系列在内的8个流行的LLM。结果表明，虽然GPT-4表现出了优越的安全性能，但包括该模型在内的LLM的安全有效性仍有待提高。与本研究相关的数据和代码可在GitHub上找到。

Title: A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution

Authors: Zhengmian Hu, Tong Zheng, Heng Huang
Subjects: cs.CL, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2410.21716
Pdf URL: https://arxiv.org/pdf/2410.21716
Copy Paste: [[2410.21716]] A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution(https://arxiv.org/abs/2410.21716)
Keywords: language model, llm
Abstract: Authorship attribution aims to identify the origin or author of a document. Traditional approaches have heavily relied on manual features and fail to capture long-range correlations, limiting their effectiveness. Recent advancements leverage text embeddings from pre-trained language models, which require significant fine-tuning on labeled data, posing challenges in data dependency and limited interpretability. Large Language Models (LLMs), with their deep reasoning capabilities and ability to maintain long-range textual associations, offer a promising alternative. This study explores the potential of pre-trained LLMs in one-shot authorship attribution, specifically utilizing Bayesian approaches and probability outputs of LLMs. Our methodology calculates the probability that a text entails previous writings of an author, reflecting a more nuanced understanding of authorship. By utilizing only pre-trained models such as Llama-3-70B, our results on the IMDb and blog datasets show an impressive 85\% accuracy in one-shot authorship classification across ten authors. Our findings set new baselines for one-shot authorship analysis using LLMs and expand the application scope of these models in forensic linguistics. This work also includes extensive ablation studies to validate our approach.
摘要：作者归属旨在确定文档的来源或作者。传统方法严重依赖手动特征，无法捕捉长距离相关性，从而限制了其有效性。最近的进展利用了预训练语言模型的文本嵌入，这需要对标记数据进行大量微调，在数据依赖性和有限的可解释性方面带来了挑战。大型语言模型 (LLM) 具有深度推理能力和维持长距离文本关联的能力，提供了一种有前途的替代方案。本研究探讨了预训练 LLM 在一次性作者归属方面的潜力，特别是利用贝叶斯方法和 LLM 的概率输出。我们的方法计算文本包含作者先前作品的概率，反映了对作者身份更细致入微的理解。通过仅使用预训练模型（例如 Llama-3-70B），我们在 IMDb 和博客数据集上的结果显示，在十位作者的一次性作者分类中，准确率高达 85\%。我们的研究结果为使用 LLM 进行一次性作者身份分析设定了新的基准，并扩大了这些模型在法医语言学中的应用范围。这项工作还包括广泛的消融研究以验证我们的方法。

Title: Let's Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models

Authors: Kangyang Luo, Zichen Ding, Zhenmin Weng, Lingfeng Qiao, Meng Zhao, Xiang Li, Di Yin, Jinlong Shu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21728
Pdf URL: https://arxiv.org/pdf/2410.21728
Copy Paste: [[2410.21728]] Let's Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models(https://arxiv.org/abs/2410.21728)
Keywords: language model, llm, prompt
Abstract: While Chain of Thought (CoT) prompting approaches have significantly consolidated the reasoning capabilities of large language models (LLMs), they still face limitations that require extensive human effort or have performance needs to be improved. Existing endeavors have focused on bridging these gaps; however, these approaches either hinge on external data and cannot completely eliminate manual effort, or they fall short in effectively directing LLMs to generate high-quality exemplary prompts. To address the said pitfalls, we propose a novel prompt approach for automatic reasoning named \textbf{LBS3}, inspired by curriculum learning which better reflects human learning habits. Specifically, LBS3 initially steers LLMs to recall easy-to-hard proxy queries that are pertinent to the target query. Following this, it invokes a progressive strategy that utilizes exemplary prompts stemmed from easy-proxy queries to direct LLMs in solving hard-proxy queries, enabling the high-quality of the proxy solutions. Finally, our extensive experiments in various reasoning-intensive tasks with varying open- and closed-source LLMs show that LBS3 achieves strongly competitive performance compared to the SOTA baselines.
摘要：虽然思路链 (CoT) 提示方法大大增强了大型语言模型 (LLM) 的推理能力，但它们仍然面临需要大量人力或性能有待提高的限制。现有的努力主要集中在弥合这些差距；然而，这些方法要么依赖于外部数据而不能完全消除人工，要么无法有效地指导 LLM 生成高质量的示例提示。为了解决上述缺陷，我们提出了一种名为 \textbf{LBS3} 的新型自动推理提示方法，该方法受到课程学习的启发，可以更好地反映人类的学习习惯。具体而言，LBS3 首先引导 LLM 回忆与目标查询相关的从易到难的代理查询。随后，它调用一种渐进式策略，利用来自易代理查询的示例提示来指导 LLM 解决难代理查询，从而实现高质量的代理解决方案。最后，我们使用各种开源和闭源 LLM 在各种推理密集型任务中进行了大量的实验，结果表明，与 SOTA 基线相比，LBS3 实现了极具竞争力的性能。

Title: Enhancing Financial Question Answering with a Multi-Agent Reflection Framework

Authors: Sorouralsadat Fatemi, Yuheng Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21741
Pdf URL: https://arxiv.org/pdf/2410.21741
Copy Paste: [[2410.21741]] Enhancing Financial Question Answering with a Multi-Agent Reflection Framework(https://arxiv.org/abs/2410.21741)
Keywords: language model, gpt, llm, agent
Abstract: While Large Language Models (LLMs) have shown impressive capabilities in numerous Natural Language Processing (NLP) tasks, they still struggle with financial question answering (QA), particularly when numerical reasoning is required. Recently, LLM-based multi-agent frameworks have demonstrated remarkable effectiveness in multi-step reasoning, which is crucial for financial QA tasks as it involves extracting relevant information from tables and text and then performing numerical reasoning on the extracted data to infer answers. In this study, we propose a multi-agent framework incorporating a critic agent that reflects on the reasoning steps and final answers for each question. Additionally, we enhance our system by adding multiple critic agents, each focusing on a specific aspect of the answer. Our results indicate that this framework significantly improves performance compared to single-agent reasoning, with an average performance increase of 15% for the LLaMA3-8B model and 5% for the LLaMA3-70B model. Furthermore, our framework performs on par with, and in some cases surpasses, larger single-agent LLMs such as LLaMA3.1-405B and GPT-4o-mini, though it falls slightly short compared to Claude-3.5 Sonnet. Overall, our framework presents an effective solution to enhance open-source LLMs for financial QA tasks, offering a cost-effective alternative to larger models like Claude-3.5 Sonnet.
摘要：虽然大型语言模型 (LLM) 在许多自然语言处理 (NLP) 任务中表现出令人印象深刻的能力，但它们在金融问答 (QA) 方面仍然举步维艰，尤其是在需要数字推理时。最近，基于 LLM 的多智能体框架在多步骤推理中表现出了显著的效果，这对于金融 QA 任务至关重要，因为它涉及从表格和文本中提取相关信息，然后对提取的数据进行数字推理以推断答案。在本研究中，我们提出了一个多智能体框架，其中包含一个评论智能体，它可以反映每个问题的推理步骤和最终答案。此外，我们通过添加多个评论智能体来增强我们的系统，每个评论智能体都专注于答案的特定方面。我们的结果表明，与单智能体推理相比，该框架显著提高了性能，LLaMA3-8B 模型的平均性能提高了 15%，LLaMA3-70B 模型的平均性能提高了 5%。此外，我们的框架性能与 LLaMA3.1-405B 和 GPT-4o-mini 等大型单智能体 LLM 相当，在某些情况下甚至超过它们，但与 Claude-3.5 Sonnet 相比略有不足。总体而言，我们的框架提供了一种有效的解决方案来增强用于金融 QA 任务的开源 LLM，为 Claude-3.5 Sonnet 等大型模型提供了一种经济高效的替代方案。

Title: Learning and Unlearning of Fabricated Knowledge in Language Models

Authors: Chen Sun, Nolan Andrew Miller, Andrey Zhmoginov, Max Vladymyrov, Mark Sandler
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21750
Pdf URL: https://arxiv.org/pdf/2410.21750
Copy Paste: [[2410.21750]] Learning and Unlearning of Fabricated Knowledge in Language Models(https://arxiv.org/abs/2410.21750)
Keywords: language model, prompt
Abstract: What happens when a new piece of knowledge is introduced into the training data and how long does it last while a large language model (LM) continues to train? We investigate this question by injecting facts into LMs from a new probing dataset, "Outlandish", which is designed to permit the testing of a spectrum of different fact types. When studying how robust these memories are, there appears to be a sweet spot in the spectrum of fact novelty between consistency with world knowledge and total randomness, where the injected memory is the most enduring. Specifically we show that facts that conflict with common knowledge are remembered for tens of thousands of training steps, while prompts not conflicting with common knowledge (mundane), as well as scrambled prompts (randomly jumbled) are both forgotten much more rapidly. Further, knowledge-conflicting facts can "prime'' how the language model hallucinates on logically unrelated prompts, showing their propensity for non-target generalization, while both mundane and randomly jumbled facts prime significantly less. Finally, we show that impacts of knowledge-conflicting facts in LMs, though they can be long lasting, can be largely erased by novel application of multi-step sparse updates, even while the training ability of the model is preserved. As such, this very simple procedure has direct implications for mitigating the effects of data poisoning in training.
摘要：当新的知识被引入训练数据时会发生什么？在大型语言模型 (LM) 继续训练时，这些知识会持续多久？我们通过将事实从新的探测数据集“Outlandish”注入 LM 来研究这个问题，该数据集旨在允许测试一系列不同类型的事实。在研究这些记忆的稳健性时，似乎在事实新颖性的范围中存在一个最佳点，即与世界知识的一致性和完全随机性之间的最佳点，其中注入的记忆最持久。具体来说，我们表明与常识相冲突的事实会被记住数万个训练步骤，而与常识不冲突的提示（平凡）以及混乱的提示（随机混乱）都会被遗忘得更快。此外，知识冲突的事实可以“启动”语言模型如何在逻辑上不相关的提示上产生幻觉，显示出它们对非目标泛化的倾向，而平凡和随机混乱的事实启动作用则要小得多。最后，我们表明，尽管知识冲突的事实对 LM 的影响可能持续很长时间，但可以通过多步稀疏更新的新应用在很大程度上消除这些影响，即使在保留模型的训练能力的情况下也是如此。因此，这个非常简单的过程对于减轻训练中数据中毒的影响具有直接影响。

Title: Leveraging LLMs for Hypothetical Deduction in Logical Inference: A Neuro-Symbolic Approach

Authors: Qingchuan Li, Jiatong Li, Tongxuan Liu, Yuting Zeng, Mingyue Cheng, Weizhe Huang, Qi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21779
Pdf URL: https://arxiv.org/pdf/2410.21779
Copy Paste: [[2410.21779]] Leveraging LLMs for Hypothetical Deduction in Logical Inference: A Neuro-Symbolic Approach(https://arxiv.org/abs/2410.21779)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have exhibited remarkable potential across a wide array of reasoning tasks, including logical reasoning. Although massive efforts have been made to empower the logical reasoning ability of LLMs via external logical symbolic solvers, crucial challenges of the poor generalization ability to questions with different features and inevitable question information loss of symbolic solver-driven approaches remain unresolved. To mitigate these issues, we introduce LINA, a LLM-driven neuro-symbolic approach for faithful logical reasoning. By enabling an LLM to autonomously perform the transition from propositional logic extraction to sophisticated logical reasoning, LINA not only bolsters the resilience of the reasoning process but also eliminates the dependency on external solvers. Additionally, through its adoption of a hypothetical-deductive reasoning paradigm, LINA effectively circumvents the expansive search space challenge that plagues traditional forward reasoning methods. Empirical evaluations demonstrate that LINA substantially outperforms both established propositional logic frameworks and conventional prompting techniques across a spectrum of five logical reasoning tasks. Specifically, LINA achieves an improvement of 24.34% over LINC on the FOLIO dataset, while also surpassing prompting strategies like CoT and CoT-SC by up to 24.02%. Our code is available at this https URL.
摘要：大型语言模型 (LLM) 在包括逻辑推理在内的各种推理任务中都表现出了巨大的潜力。尽管人们已经付出了巨大的努力来通过外部逻辑符号求解器增强 LLM 的逻辑推理能力，但符号求解器驱动的方法对具有不同特征的问题的泛化能力较差以及不可避免的问题信息丢失等关键挑战仍未得到解决。为了缓解这些问题，我们引入了 LINA，这是一种由 LLM 驱动的神经符号方法，用于忠实的逻辑推理。通过使 LLM 能够自主执行从命题逻辑提取到复杂逻辑推理的过渡，LINA 不仅增强了推理过程的弹性，而且还消除了对外部求解器的依赖。此外，通过采用假设演绎推理范式，LINA 有效地规避了困扰传统正向推理方法的扩展搜索空间挑战。实证评估表明，在五项逻辑推理任务中，LINA 的表现远胜于现有的命题逻辑框架和传统的提示技术。具体来说，LINA 在 FOLIO 数据集上的表现比 LINC 提高了 24.34%，同时也比 CoT 和 CoT-SC 等提示策略提高了 24.02%。我们的代码可在此 https URL 上找到。

Title: Enhancing Adversarial Attacks through Chain of Thought

Authors: Jingbo Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21791
Pdf URL: https://arxiv.org/pdf/2410.21791
Copy Paste: [[2410.21791]] Enhancing Adversarial Attacks through Chain of Thought(https://arxiv.org/abs/2410.21791)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated impressive performance across various domains but remain susceptible to safety concerns. Prior research indicates that gradient-based adversarial attacks are particularly effective against aligned LLMs and the chain of thought (CoT) prompting can elicit desired answers through step-by-step reasoning. This paper proposes enhancing the robustness of adversarial attacks on aligned LLMs by integrating CoT prompts with the greedy coordinate gradient (GCG) technique. Using CoT triggers instead of affirmative targets stimulates the reasoning abilities of backend LLMs, thereby improving the transferability and universality of adversarial attacks. We conducted an ablation study comparing our CoT-GCG approach with Amazon Web Services auto-cot. Results revealed our approach outperformed both the baseline GCG attack and CoT prompting. Additionally, we used Llama Guard to evaluate potentially harmful interactions, providing a more objective risk assessment of entire conversations compared to matching outputs to rejection phrases. The code of this paper is available at this https URL.
摘要：大型语言模型 (LLM) 在各个领域都表现出令人印象深刻的性能，但仍然容易受到安全问题的影响。先前的研究表明，基于梯度的对抗性攻击对对齐的 LLM 特别有效，而思路链 (CoT) 提示可以通过逐步推理引出所需的答案。本文提出通过将 CoT 提示与贪婪坐标梯度 (GCG) 技术相结合来增强对对齐 LLM 的对抗性攻击的鲁棒性。使用 CoT 触发器代替肯定目标可以刺激后端 LLM 的推理能力，从而提高对抗性攻击的可转移性和普遍性。我们进行了一项消融研究，将我们的 CoT-GCG 方法与 Amazon Web Services auto-cot 进行了比较。结果表明，我们的方法优于基线 GCG 攻击和 CoT 提示。此外，我们使用 Llama Guard 来评估潜在的有害交互，与将输出与拒绝短语匹配相比，它提供了对整个对话的更客观的风险评估。本文的代码可在此 https URL 上找到。

Title: SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent Communication

Authors: Nguyen Le Hoang, Tadahiro Taniguchi, Fang Tianwei, Akira Taniguchi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21803
Pdf URL: https://arxiv.org/pdf/2410.21803
Copy Paste: [[2410.21803]] SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent Communication(https://arxiv.org/abs/2410.21803)
Keywords: agent
Abstract: Emergent communication, driven by generative models, enables agents to develop a shared language for describing their individual views of the same objects through interactions. Meanwhile, self-supervised learning (SSL), particularly SimSiam, uses discriminative representation learning to make representations of augmented views of the same data point closer in the representation space. Building on the prior work of VI-SimSiam, which incorporates a generative and Bayesian perspective into the SimSiam framework via variational inference (VI) interpretation, we propose SimSiam+VAE, a unified approach for both representation learning and emergent communication. SimSiam+VAE integrates a variational autoencoder (VAE) into the predictor of the SimSiam network to enhance representation learning and capture uncertainty. Experimental results show that SimSiam+VAE outperforms both SimSiam and VI-SimSiam. We further extend this model into a communication framework called the SimSiam Naming Game (SSNG), which applies the generative and Bayesian approach based on VI to develop internal representations and emergent language, while utilizing the discriminative process of SimSiam to facilitate mutual understanding between agents. In experiments with established models, despite the dynamic alternation of agent roles during interactions, SSNG demonstrates comparable performance to the referential game and slightly outperforms the Metropolis-Hastings naming game.
摘要：由生成模型驱动的紧急通信使代理能够开发一种共享语言，通过交互来描述他们对同一对象的个人看法。同时，自监督学习 (SSL)，尤其是 SimSiam，使用判别表示学习使同一数据点的增强视图的表示在表示空间中更接近。在 VI-SimSiam 的先前工作的基础上，我们提出了 SimSiam+VAE，这是一种用于表示学习和紧急通信的统一方法，VI-SimSiam 通过变分推理 (VI) 解释将生成和贝叶斯视角融入 SimSiam 框架。SimSiam+VAE 将变分自动编码器 (VAE) 集成到 SimSiam 网络的预测器中，以增强表示学习并捕捉不确定性。实验结果表明，SimSiam+VAE 优于 SimSiam 和 VI-SimSiam。我们进一步将此模型扩展为一个名为 SimSiam Naming Game (SSNG) 的通信框架，该框架应用基于 VI 的生成和贝叶斯方法来开发内部表示和新兴语言，同时利用 SimSiam 的判别过程来促进代理之间的相互理解。在对已建立模型的实验中，尽管代理角色在交互过程中动态交替，但 SSNG 表现出与指称游戏相当的性能，并且略优于 Metropolis-Hastings 命名游戏。

Title: Self-Preference Bias in LLM-as-a-Judge

Authors: Koki Wataoka, Tsubasa Takahashi, Ryokan Ri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21819
Pdf URL: https://arxiv.org/pdf/2410.21819
Copy Paste: [[2410.21819]] Self-Preference Bias in LLM-as-a-Judge(https://arxiv.org/abs/2410.21819)
Keywords: language model, gpt, llm
Abstract: Automated evaluation leveraging large language models (LLMs), commonly referred to as LLM evaluators or LLM-as-a-judge, has been widely used in measuring the performance of dialogue systems. However, the self-preference bias in LLMs has posed significant risks, including promoting specific styles or policies intrinsic to the LLMs. Despite the importance of this issue, there is a lack of established methods to measure the self-preference bias quantitatively, and its underlying causes are poorly understood. In this paper, we introduce a novel quantitative metric to measure the self-preference bias. Our experimental results demonstrate that GPT-4 exhibits a significant degree of self-preference bias. To explore the causes, we hypothesize that LLMs may favor outputs that are more familiar to them, as indicated by lower perplexity. We analyze the relationship between LLM evaluations and the perplexities of outputs. Our findings reveal that LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated. This suggests that the essence of the bias lies in perplexity and that the self-preference bias exists because LLMs prefer texts more familiar to them.
摘要：利用大型语言模型 (LLM) 的自动评估（通常称为 LLM 评估器或 LLM-as-a-judge）已广泛用于衡量对话系统的性能。然而，LLM 中的自我偏好偏差带来了重大风险，包括促进 LLM 固有的特定风格或政策。尽管这个问题很重要，但缺乏定量测量自我偏好偏差的既定方法，其根本原因也不太清楚。在本文中，我们引入了一种新的定量指标来衡量自我偏好偏差。我们的实验结果表明，GPT-4 表现出显著程度的自我偏好偏差。为了探究原因，我们假设 LLM 可能更喜欢它们更熟悉的输出，这可以通过较低的困惑度来表明。我们分析了 LLM 评估与输出困惑度之间的关系。我们的研究结果表明，无论输出是否是自我生成的，LLM 对困惑度较低的输出的评价都明显高于人类评估者。这表明偏见的本质在于困惑，而自我偏好偏见的存在是因为法学硕士学生更喜欢自己更熟悉的文本。

Title: Multi-aspect Depression Severity Assessment via Inductive Dialogue System

Authors: Chaebin Lee, Seungyeon Seo, Heejin Do, Gary Geunbae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21836
Pdf URL: https://arxiv.org/pdf/2410.21836
Copy Paste: [[2410.21836]] Multi-aspect Depression Severity Assessment via Inductive Dialogue System(https://arxiv.org/abs/2410.21836)
Keywords: chat
Abstract: With the advancement of chatbots and the growing demand for automatic depression detection, identifying depression in patient conversations has gained more attention. However, prior methods often assess depression in a binary way or only a single score without diverse feedback and lack focus on enhancing dialogue responses. In this paper, we present a novel task of multi-aspect depression severity assessment via an inductive dialogue system (MaDSA), evaluating a patient's depression level on multiple criteria by incorporating an assessment-aided response generation. Further, we propose a foundational system for MaDSA, which induces psychological dialogue responses with an auxiliary emotion classification task within a hierarchical severity assessment structure. We synthesize the conversational dataset annotated with eight aspects of depression severity alongside emotion labels, proven robust via human evaluations. Experimental results show potential for our preliminary work on MaDSA.
摘要：随着聊天机器人的进步和对自动抑郁症检测的需求不断增长，在患者对话中识别抑郁症受到了越来越多的关注。然而，之前的方法通常以二元方式或仅以单一分数评估抑郁症，没有多样化的反馈，并且缺乏对增强对话反应的关注。在本文中，我们提出了一种通过归纳对话系统 (MaDSA) 进行多方面抑郁症严重程度评估的新任务，通过结合评估辅助反应生成根据多项标准评估患者的抑郁程度。此外，我们提出了一个 MaDSA 的基础系统，该系统在分层严重程度评估结构中使用辅助情绪分类任务诱导心理对话反应。我们合成了带有八个抑郁严重程度方面以及情绪标签的对话数据集，并通过人工评估证明其可靠性。实验结果表明我们在 MaDSA 方面的初步工作具有潜力。

Title: Improving In-Context Learning with Small Language Model Ensembles

Authors: M. Mehdi Mojarradi, Lingyi Yang, Robert McCraith, Adam Mahdi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21868
Pdf URL: https://arxiv.org/pdf/2410.21868
Copy Paste: [[2410.21868]] Improving In-Context Learning with Small Language Model Ensembles(https://arxiv.org/abs/2410.21868)
Keywords: language model, llm, retrieval augmented generation
Abstract: Large language models (LLMs) have shown impressive capabilities across various tasks, but their performance on domain-specific tasks remains limited. While methods like retrieval augmented generation and fine-tuning can help to address this, they require significant resources. In-context learning (ICL) is a cheap and efficient alternative but cannot match the accuracies of advanced methods. We present Ensemble SuperICL, a novel approach that enhances ICL by leveraging the expertise of multiple fine-tuned small language models (SLMs). Ensemble SuperICL achieves state of the art (SoTA) results on several natural language understanding benchmarks. Additionally, we test it on a medical-domain labelling task and showcase its practicality by using off-the-shelf SLMs fine-tuned on a general language task, achieving superior accuracy in large-scale data labelling compared to all baselines. Finally, we conduct an ablation study and sensitivity analyses to elucidate the underlying mechanism of Ensemble SuperICL. Our research contributes to the growing demand for efficient domain specialisation methods in LLMs, offering a cheap and effective method for practitioners.
摘要：大型语言模型 (LLM) 在各种任务中都表现出令人印象深刻的能力，但它们在特定领域任务上的表现仍然有限。虽然检索增强生成和微调等方法可以帮助解决这个问题，但它们需要大量资源。上下文学习 (ICL) 是一种廉价而有效的替代方法，但无法与高级方法的准确性相媲美。我们提出了 Ensemble SuperICL，这是一种通过利用多个微调小型语言模型 (SLM) 的专业知识来增强 ICL 的新方法。Ensemble SuperICL 在几个自然语言理解基准上取得了最先进的 (SoTA) 结果。此外，我们在医学领域标记任务上对其进行了测试，并通过使用在一般语言任务上微调的现成 SLM 展示了它的实用性，与所有基线相比，它在大规模数据标记方面实现了卓越的准确性。最后，我们进行了一项消融研究和敏感性分析，以阐明 Ensemble SuperICL 的潜在机制。我们的研究满足了法学硕士 (LLM) 领域对高效领域专业化方法日益增长的需求，为从业者提供了一种廉价而有效的方法。

Title: SceneGenAgent: Precise Industrial Scene Generation with Coding Agent

Authors: Xiao Xia, Dan Zhang, Zibo Liao, Zhenyu Hou, Tianrui Sun, Jing Li, Ling Fu, Yuxiao Dong
Subjects: cs.CL, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2410.21909
Pdf URL: https://arxiv.org/pdf/2410.21909
Copy Paste: [[2410.21909]] SceneGenAgent: Precise Industrial Scene Generation with Coding Agent(https://arxiv.org/abs/2410.21909)
Keywords: language model, gpt, llm, agent
Abstract: The modeling of industrial scenes is essential for simulations in industrial manufacturing. While large language models (LLMs) have shown significant progress in generating general 3D scenes from textual descriptions, generating industrial scenes with LLMs poses a unique challenge due to their demand for precise measurements and positioning, requiring complex planning over spatial arrangement. To address this challenge, we introduce SceneGenAgent, an LLM-based agent for generating industrial scenes through C# code. SceneGenAgent ensures precise layout planning through a structured and calculable format, layout verification, and iterative refinement to meet the quantitative requirements of industrial scenarios. Experiment results demonstrate that LLMs powered by SceneGenAgent exceed their original performance, reaching up to 81.0% success rate in real-world industrial scene generation tasks and effectively meeting most scene generation requirements. To further enhance accessibility, we construct SceneInstruct, a dataset designed for fine-tuning open-source LLMs to integrate into SceneGenAgent. Experiments show that fine-tuning open-source LLMs on SceneInstruct yields significant performance improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our code and data are available at this https URL .
摘要：工业场景建模对于工业制造中的模拟至关重要。虽然大型语言模型 (LLM) 在从文本描述生成一般 3D 场景方面取得了重大进展，但使用 LLM 生成工业场景提出了独特的挑战，因为它们需要精确的测量和定位，需要对空间布局进行复杂的规划。为了应对这一挑战，我们引入了 SceneGenAgent，这是一个基于 LLM 的代理，用于通过 C# 代码生成工业场景。SceneGenAgent 通过结构化和可计算的格式、布局验证和迭代细化来确保精确的布局规划，以满足工业场景的量化要求。实验结果表明，由 SceneGenAgent 提供支持的 LLM 超越了其原始性能，在现实世界的工业场景生成任务中成功率高达 81.0%，并有效满足了大多数场景生成要求。为了进一步增强可访问性，我们构建了 SceneInstruct，这是一个旨在微调开源 LLM 以集成到 SceneGenAgent 中的数据集。实验表明，在 SceneInstruct 上微调开源 LLM 可显著提高性能，其中 Llama3.1-70B 的性能已接近 GPT-4o。我们的代码和数据可在此 https URL 上找到。

Title: Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications

Authors: Monica Riedler, Stefan Langer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.21943
Pdf URL: https://arxiv.org/pdf/2410.21943
Copy Paste: [[2410.21943]] Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications(https://arxiv.org/abs/2410.21943)
Keywords: language model, gpt, llm, hallucination, retrieval augmented generation
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in answering questions, but they lack domain-specific knowledge and are prone to hallucinations. Retrieval Augmented Generation (RAG) is one approach to address these challenges, while multimodal models are emerging as promising AI assistants for processing both text and images. In this paper we describe a series of experiments aimed at determining how to best integrate multimodal models into RAG systems for the industrial domain. The purpose of the experiments is to determine whether including images alongside text from documents within the industrial domain increases RAG performance and to find the optimal configuration for such a multimodal RAG system. Our experiments include two approaches for image processing and retrieval, as well as two LLMs (GPT4-Vision and LLaVA) for answer synthesis. These image processing strategies involve the use of multimodal embeddings and the generation of textual summaries from images. We evaluate our experiments with an LLM-as-a-Judge approach. Our results reveal that multimodal RAG can outperform single-modality RAG settings, although image retrieval poses a greater challenge than text retrieval. Additionally, leveraging textual summaries from images presents a more promising approach compared to the use of multimodal embeddings, providing more opportunities for future advancements.
摘要：大型语言模型 (LLM) 在回答问题方面表现出了令人印象深刻的能力，但它们缺乏特定领域的知识并且容易产生幻觉。检索增强生成 (RAG) 是解决这些挑战的一种方法，而多模态模型正在成为处理文本和图像的有前途的 AI 助手。在本文中，我们描述了一系列实验，旨在确定如何将多模态模型最好地集成到工业领域的 RAG 系统中。实验的目的是确定在工业领域文档的文本旁边包含图像是否会提高 RAG 性能，并找到这种多模态 RAG 系统的最佳配置。我们的实验包括两种图像处理和检索方法，以及两种用于答案合成的 LLM（GPT4-Vision 和 LLaVA）。这些图像处理策略涉及使用多模态嵌入和从图像生成文本摘要。我们使用 LLM-as-a-Judge 方法评估我们的实验。我们的结果表明，尽管图像检索比文本检索更具挑战性，但多模态 RAG 的表现优于单模态 RAG 设置。此外，与使用多模态嵌入相比，利用图像中的文本摘要是一种更有前景的方法，为未来的发展提供了更多机会。

Title: SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types

Authors: Yutao Mou, Shikun Zhang, Wei Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21965
Pdf URL: https://arxiv.org/pdf/2410.21965
Copy Paste: [[2410.21965]] SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types(https://arxiv.org/abs/2410.21965)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Ensuring the safety of large language model (LLM) applications is essential for developing trustworthy artificial intelligence. Current LLM safety benchmarks have two limitations. First, they focus solely on either discriminative or generative evaluation paradigms while ignoring their interconnection. Second, they rely on standardized inputs, overlooking the effects of widespread prompting techniques, such as system prompts, few-shot demonstrations, and chain-of-thought prompting. To overcome these issues, we developed SG-Bench, a novel benchmark to assess the generalization of LLM safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety. Our assessment of 3 advanced proprietary LLMs and 10 open-source LLMs with the benchmark reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment. We also explain these findings quantitatively and qualitatively to provide insights for future research.
摘要：确保大型语言模型 (LLM) 应用程序的安全性对于开发值得信赖的人工智能至关重要。当前的 LLM 安全基准有两个局限性。首先，它们仅关注判别性或生成性评估范式，而忽略了它们之间的相互联系。其次，它们依赖标准化输入，忽略了广泛使用的提示技术的影响，例如系统提示、少量演示和思路链提示。为了克服这些问题，我们开发了 SG-Bench，这是一种新颖的基准，用于评估 LLM 安全性在各种任务和提示类型中的泛化。该基准整合了生成性和判别性评估任务，并包含扩展数据以检查提示工程和越狱对 LLM 安全性的影响。我们使用该基准对 3 个高级专有 LLM 和 10 个开源 LLM 进行了评估，结果表明大多数 LLM 在判别性任务上的表现比生成性任务差，并且极易受到提示的影响，这表明安全性一致性的泛化能力较差。我们还从定量和定性的角度解释这些发现，为未来的研究提供见解。

Title: Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation

Authors: Suhang Wu, Jialong Tang, Baosong Yang, Ante Wang, Kaidi Jia, Jiawei Yu, Junfeng Yao, Jinsong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.21970
Pdf URL: https://arxiv.org/pdf/2410.21970
Copy Paste: [[2410.21970]] Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation(https://arxiv.org/abs/2410.21970)
Keywords: language model, retrieval augmented generation, retrieval-augmented generation
Abstract: RALMs (Retrieval-Augmented Language Models) broaden their knowledge scope by incorporating external textual resources. However, the multilingual nature of global knowledge necessitates RALMs to handle diverse languages, a topic that has received limited research focus. In this work, we propose \textit{Futurepedia}, a carefully crafted benchmark containing parallel texts across eight representative languages. We evaluate six multilingual RALMs using our benchmark to explore the challenges of multilingual RALMs. Experimental results reveal linguistic inequalities: 1) high-resource languages stand out in Monolingual Knowledge Extraction; 2) Indo-European languages lead RALMs to provide answers directly from documents, alleviating the challenge of expressing answers across languages; 3) English benefits from RALMs' selection bias and speaks louder in multilingual knowledge selection. Based on these findings, we offer advice for improving multilingual Retrieval Augmented Generation. For monolingual knowledge extraction, careful attention must be paid to cascading errors from translating low-resource languages into high-resource ones. In cross-lingual knowledge transfer, encouraging RALMs to provide answers within documents in different languages can improve transfer performance. For multilingual knowledge selection, incorporating more non-English documents and repositioning English documents can help mitigate RALMs' selection bias. Through comprehensive experiments, we underscore the complexities inherent in multilingual RALMs and offer valuable insights for future research.
摘要：RALM（检索增强语言模型）通过整合外部文本资源来拓宽其知识范围。然而，全球知识的多语言性质要求 RALM 能够处理多种语言，而这一主题的研究重点有限。在这项工作中，我们提出了 \textit{Futurepedia}，这是一个精心设计的基准，包含八种代表性语言的平行文本。我们使用基准评估了六种多语言 RALM，以探索多语言 RALM 的挑战。实验结果揭示了语言不平等：1）高资源语言在单语知识提取中脱颖而出；2）印欧语系语言使 RALM 直接从文档中提供答案，从而减轻了跨语言表达答案的挑战；3）英语受益于 RALM 的选择偏差，在多语言知识选择中更具发言权。基于这些发现，我们提出了改进多语言检索增强生成的建议。对于单语知识提取，必须特别注意将资源较少的语言翻译成资源丰富的语言时产生的级联错误。在跨语言知识转移中，鼓励 RALM 在不同语言的文档中提供答案可以提高转移性能。对于多语言知识选择，纳入更多非英语文档并重新定位英语文档可以帮助减轻 RALM 的选择偏差。通过全面的实验，我们强调了多语言 RALM 固有的复杂性，并为未来的研究提供了宝贵的见解。

Title: Are VLMs Really Blind

Authors: Ayush Singh, Mansi Gupta, Shivank Garg
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.22029
Pdf URL: https://arxiv.org/pdf/2410.22029
Copy Paste: [[2410.22029]] Are VLMs Really Blind(https://arxiv.org/abs/2410.22029)
Keywords: language model
Abstract: Vision Language Models excel in handling a wide range of complex tasks, including Optical Character Recognition (OCR), Visual Question Answering (VQA), and advanced geometric reasoning. However, these models fail to perform well on low-level basic visual tasks which are especially easy for humans. Our goal in this work was to determine if these models are truly "blind" to geometric reasoning or if there are ways to enhance their capabilities in this area. Our work presents a novel automatic pipeline designed to extract key information from images in response to specific questions. Instead of just relying on direct VQA, we use question-derived keywords to create a caption that highlights important details in the image related to the question. This caption is then used by a language model to provide a precise answer to the question without requiring external fine-tuning.
摘要：视觉语言模型擅长处理各种复杂任务，包括光学字符识别 (OCR)、视觉问答 (VQA) 和高级几何推理。然而，这些模型在人类特别容易完成的低级基本视觉任务上表现不佳。我们在这项工作中的目标是确定这些模型是否真的对几何推理“视而不见”，或者是否有办法增强它们在这方面的能力。我们的工作提出了一种新颖的自动管道，旨在从图像中提取关键信息以回答特定问题。我们不是仅仅依靠直接的 VQA，而是使用问题派生的关键字来创建标题，突出显示与问题相关的图像中的重要细节。然后，语言模型使用此标题来提供问题的精确答案，而无需外部微调。

Title: Distinguishing Ignorance from Error in LLM Hallucinations

Authors: Adi Simhi, Jonathan Herzig, Idan Szpektor, Yonatan Belinkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.22071
Pdf URL: https://arxiv.org/pdf/2410.22071
Copy Paste: [[2410.22071]] Distinguishing Ignorance from Error in LLM Hallucinations(https://arxiv.org/abs/2410.22071)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) are susceptible to hallucinations-outputs that are ungrounded, factually incorrect, or inconsistent with prior generations. We focus on close-book Question Answering (CBQA), where previous work has not fully addressed the distinction between two possible kinds of hallucinations, namely, whether the model (1) does not hold the correct answer in its parameters or (2) answers incorrectly despite having the required knowledge. We argue that distinguishing these cases is crucial for detecting and mitigating hallucinations. Specifically, case (2) may be mitigated by intervening in the model's internal computation, as the knowledge resides within the model's parameters. In contrast, in case (1) there is no parametric knowledge to leverage for mitigation, so it should be addressed by resorting to an external knowledge source or abstaining. To help distinguish between the two cases, we introduce Wrong Answer despite having Correct Knowledge (WACK), an approach for constructing model-specific datasets for the second hallucination type. Our probing experiments indicate that the two kinds of hallucinations are represented differently in the model's inner states. Next, we show that datasets constructed using WACK exhibit variations across models, demonstrating that even when models share knowledge of certain facts, they still vary in the specific examples that lead to hallucinations. Finally, we show that training a probe on our WACK datasets leads to better hallucination detection of case (2) hallucinations than using the common generic one-size-fits-all datasets. The code is available at this https URL .
摘要：大型语言模型 (LLM) 容易产生幻觉，即输出没有根据、事实不正确或与前几代不一致。我们专注于封闭式问答 (CBQA)，之前的研究尚未完全解决两种可能的幻觉之间的区别，即模型 (1) 是否在其参数中没有保存正确答案，或 (2) 尽管具有所需的知识，但仍回答错误。我们认为区分这些情况对于检测和缓解幻觉至关重要。具体而言，情况 (2) 可以通过干预模型的内部计算来缓解，因为知识存在于模型的参数中。相反，在情况 (1) 中，没有参数知识可用于缓解，因此应该通过求助于外部知识源或弃权来解决。为了帮助区分这两种情况，我们引入了尽管有正确知识，但答案错误 (WACK)，这是一种为第二种幻觉类型构建模型特定数据集的方法。我们的探索性实验表明，两种幻觉在模型的内部状态中表现不同。接下来，我们展示了使用 WACK 构建的数据集在不同的模型中表现出差异，这表明即使模型共享某些事实的知识，它们在导致幻觉的具体示例中仍会有所不同。最后，我们展示了在我们的 WACK 数据集上训练探测器比使用通用的通用数据集可以更好地检测案例 (2) 幻觉。代码可从此 https URL 获得。

Title: Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench

Authors: Zheyuan Liu, Guangyao Dou, Mengzhao Jia, Zhaoxuan Tan, Qingkai Zeng, Yongle Yuan, Meng Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.22108
Pdf URL: https://arxiv.org/pdf/2410.22108
Copy Paste: [[2410.22108]] Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench(https://arxiv.org/abs/2410.22108)
Keywords: language model, llm
Abstract: Generative models such as Large Language Models (LLM) and Multimodal Large Language models (MLLMs) trained on massive web corpora can memorize and disclose individuals' confidential and private data, raising legal and ethical concerns. While many previous works have addressed this issue in LLM via machine unlearning, it remains largely unexplored for MLLMs. To tackle this challenge, we introduce Multimodal Large Language Model Unlearning Benchmark (MLLMU-Bench), a novel benchmark aimed at advancing the understanding of multimodal machine unlearning. MLLMU-Bench consists of 500 fictitious profiles and 153 profiles for public celebrities, each profile feature over 14 customized question-answer pairs, evaluated from both multimodal (image+text) and unimodal (text) perspectives. The benchmark is divided into four sets to assess unlearning algorithms in terms of efficacy, generalizability, and model utility. Finally, we provide baseline results using existing generative model unlearning algorithms. Surprisingly, our experiments show that unimodal unlearning algorithms excel in generation and cloze tasks, while multimodal unlearning approaches perform better in classification tasks with multimodal inputs.
摘要：在海量网络语料库上训练的生成模型，例如大型语言模型 (LLM) 和多模态大型语言模型 (MLLM)，可以记忆和泄露个人的机密和私人数据，从而引发法律和道德问题。虽然之前的许多研究已经通过机器反学习解决了 LLM 中的这个问题，但对于 MLLM 来说，这个问题仍然基本上未被探索过。为了应对这一挑战，我们引入了多模态大型语言模型反学习基准 (MLLMU-Bench)，这是一种旨在增进对多模态机器反学习的理解的新型基准。MLLMU-Bench 包含 500 个虚构档案和 153 个公众名人档案，每个档案包含超过 14 个定制的问答对，从多模态（图像+文本）和单模态（文本）角度进行评估。基准分为四组，以评估反学习算法的功效、普遍性和模型效用。最后，我们使用现有的生成模型反学习算法提供基线结果。令人惊讶的是，我们的实验表明，单模态反学习算法在生成和完形填空任务中表现出色，而多模态反学习方法在具有多模态输入的分类任务中表现更好。

Title: The Impact of Inference Acceleration Strategies on Bias of LLMs

Authors: Elisabeth Kirsten, Ivan Habernal, Vedant Nanda, Muhammad Bilal Zafar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.22118
Pdf URL: https://arxiv.org/pdf/2410.22118
Copy Paste: [[2410.22118]] The Impact of Inference Acceleration Strategies on Bias of LLMs(https://arxiv.org/abs/2410.22118)
Keywords: language model, llm
Abstract: Last few years have seen unprecedented advances in capabilities of Large Language Models (LLMs). These advancements promise to deeply benefit a vast array of application domains. However, due to their immense size, performing inference with LLMs is both costly and slow. Consequently, a plethora of recent work has proposed strategies to enhance inference efficiency, e.g., quantization, pruning, and caching. These acceleration strategies reduce the inference cost and latency, often by several factors, while maintaining much of the predictive performance measured via common benchmarks. In this work, we explore another critical aspect of LLM performance: demographic bias in model generations due to inference acceleration optimizations. Using a wide range of metrics, we probe bias in model outputs from a number of angles. Analysis of outputs before and after inference acceleration shows significant change in bias. Worryingly, these bias effects are complex and unpredictable. A combination of an acceleration strategy and bias type may show little bias change in one model but may lead to a large effect in another. Our results highlight a need for in-depth and case-by-case evaluation of model bias after it has been modified to accelerate inference.
摘要：过去几年，大型语言模型 (LLM) 的功能取得了前所未有的进步。这些进步有望使众多应用领域受益匪浅。然而，由于 LLM 规模巨大，使用 LLM 进行推理既昂贵又缓慢。因此，最近的大量研究提出了提高推理效率的策略，例如量化、修剪和缓存。这些加速策略通常会降低几个因素的推理成本和延迟，同时保持通过常见基准测量的大部分预测性能。在这项工作中，我们探索了 LLM 性能的另一个关键方面：由于推理加速优化导致的模型生成中的人口统计学偏差。我们使用各种指标，从多个角度探究模型输出中的偏差。对推理加速前后的输出进行分析，发现偏差发生了显著变化。令人担忧的是，这些偏差效应复杂且不可预测。加速策略和偏差类型的组合可能在一个模型中显示偏差变化不大，但在另一个模型中可能导致很大的影响。我们的研究结果强调，在修改模型偏差以加速推理之后，需要进行深入和逐案评估。

Title: AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts

Authors: Vishal Kumar, Zeyi Liao, Jaylen Jones, Huan Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.22143
Pdf URL: https://arxiv.org/pdf/2410.22143
Copy Paste: [[2410.22143]] AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts(https://arxiv.org/abs/2410.22143)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Although large language models (LLMs) are typically aligned, they remain vulnerable to jailbreaking through either carefully crafted prompts in natural language or, interestingly, gibberish adversarial suffixes. However, gibberish tokens have received relatively less attention despite their success in attacking aligned LLMs. Recent work, AmpleGCG~\citep{liao2024amplegcg}, demonstrates that a generative model can quickly produce numerous customizable gibberish adversarial suffixes for any harmful query, exposing a range of alignment gaps in out-of-distribution (OOD) language spaces. To bring more attention to this area, we introduce AmpleGCG-Plus, an enhanced version that achieves better performance in fewer attempts. Through a series of exploratory experiments, we identify several training strategies to improve the learning of gibberish suffixes. Our results, verified under a strict evaluation setting, show that it outperforms AmpleGCG on both open-weight and closed-source models, achieving increases in attack success rate (ASR) of up to 17\% in the white-box setting against Llama-2-7B-chat, and more than tripling ASR in the black-box setting against GPT-4. Notably, AmpleGCG-Plus jailbreaks the newer GPT-4o series of models at similar rates to GPT-4, and, uncovers vulnerabilities against the recently proposed circuit breakers defense. We publicly release AmpleGCG-Plus along with our collected training datasets.
摘要：尽管大型语言模型 (LLM) 通常都是对齐的，但它们仍然容易被精心设计的自然语言提示或有趣的是胡言乱语的对抗性后缀所破解。然而，尽管胡言乱语标记在攻击对齐的 LLM 方面取得了成功，但它们受到的关注相对较少。最近的研究 AmpleGCG~\citep{liao2024amplegcg} 表明，生成模型可以快速为任何有害查询生成大量可定制的胡言乱语对抗性后缀，从而暴露出分布外 (OOD) 语言空间中的一系列对齐差距。为了引起人们对这一领域的更多关注，我们推出了 AmpleGCG-Plus，这是一个增强版本，可以在更少的尝试中获得更好的性能。通过一系列探索性实验，我们确定了几种训练策略来提高胡言乱语后缀的学习。我们在严格的评估环境下验证的结果显示，它在开源和闭源模型上的表现都优于 AmpleGCG，在白盒环境下对 Llama-2-7B-chat 的攻击成功率 (ASR) 提高了 17%，在黑盒环境下对 GPT-4 的攻击成功率提高了三倍多。值得注意的是，AmpleGCG-Plus 以与 GPT-4 类似的速度越狱了较新的 GPT-4o 系列模型，并发现了针对最近提出的断路器防御的漏洞。我们公开发布了 AmpleGCG-Plus 以及我们收集的训练数据集。

Title: Benchmarking LLM Guardrails in Handling Multilingual Toxicity

Authors: Yahan Yang, Soham Dan, Dan Roth, Insup Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.22153
Pdf URL: https://arxiv.org/pdf/2410.22153
Copy Paste: [[2410.22153]] Benchmarking LLM Guardrails in Handling Multilingual Toxicity(https://arxiv.org/abs/2410.22153)
Keywords: language model, llm, prompt
Abstract: With the ubiquity of Large Language Models (LLMs), guardrails have become crucial to detect and defend against toxic content. However, with the increasing pervasiveness of LLMs in multilingual scenarios, their effectiveness in handling multilingual toxic inputs remains unclear. In this work, we introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We also investigates the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts. This work aims to identify the limitations of guardrails and to build a more reliable and trustworthy LLMs in multilingual scenarios.
摘要：随着大型语言模型 (LLM) 的普及，护栏已成为检测和防御有害内容的关键。然而，随着 LLM 在多语言场景中的普及程度越来越高，它们在处理多语言有害输入方面的有效性仍不清楚。在这项工作中，我们引入了一个全面的多语言测试套件，涵盖七个数据集和十多种语言，以对最先进的护栏性能进行基准测试。我们还研究了护栏对最近越狱技术的适应能力，并评估了上下文安全策略和语言资源可用性对护栏性能的影响。我们的研究结果表明，现有的护栏在处理多语言毒性方面仍然无效，并且缺乏对越狱提示的鲁棒性。这项工作旨在确定护栏的局限性，并在多语言场景中构建更可靠、更值得信赖的 LLM。

Title: ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding

Authors: Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, Teruko Mitamura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.22211
Pdf URL: https://arxiv.org/pdf/2410.22211
Copy Paste: [[2410.22211]] ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding(https://arxiv.org/abs/2410.22211)
Keywords: llm
Abstract: Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios. ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities coupled with their corresponding instruction. For QA annotation, we take a cost-effective human-LLM collaborative approach, where the existing annotation is augmented with LLM-generated QA pairs that are later verified by humans. We then provide the benchmark results to set the baseline performance on ProMQA. Our experiment reveals a significant gap between human performance and that of current systems, including competitive proprietary multimodal models. We hope our dataset sheds light on new aspects of models' multimodal understanding capabilities.
摘要：多模态系统在协助人类进行程序性活动方面具有巨大潜力，人们遵循指令来实现目标。尽管应用场景多种多样，但系统通常基于传统分类任务进行评估，例如动作识别或时间动作分割。在本文中，我们提出了一个新颖的评估数据集 ProMQA，以衡量面向应用场景中的系统进步。ProMQA 由 401 个多模态程序性 QA 对组成，这些对基于用户记录的程序性活动及其相应的指令。对于 QA 注释，我们采用经济高效的人类-LLM 协作方法，其中现有注释通过 LLM 生成的 QA 对进行增强，然后由人类进行验证。然后，我们提供基准测试结果以设置 ProMQA 的基准性能。我们的实验揭示了人类表现与当前系统（包括竞争性专有多模态模型）之间的巨大差距。我们希望我们的数据集能够揭示模型多模态理解能力的新方面。

Title: DISCERN: Decoding Systematic Errors in Natural Language for Text Classifiers

Authors: Rakesh R. Menon, Shashank Srivastava
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.22239
Pdf URL: https://arxiv.org/pdf/2410.22239
Copy Paste: [[2410.22239]] DISCERN: Decoding Systematic Errors in Natural Language for Text Classifiers(https://arxiv.org/abs/2410.22239)
Keywords: language model
Abstract: Despite their high predictive accuracies, current machine learning systems often exhibit systematic biases stemming from annotation artifacts or insufficient support for certain classes in the dataset. Recent work proposes automatic methods for identifying and explaining systematic biases using keywords. We introduce DISCERN, a framework for interpreting systematic biases in text classifiers using language explanations. DISCERN iteratively generates precise natural language descriptions of systematic errors by employing an interactive loop between two large language models. Finally, we use the descriptions to improve classifiers by augmenting classifier training sets with synthetically generated instances or annotated examples via active learning. On three text-classification datasets, we demonstrate that language explanations from our framework induce consistent performance improvements that go beyond what is achievable with exemplars of systematic bias. Finally, in human evaluations, we show that users can interpret systematic biases more effectively (by over 25% relative) and efficiently when described through language explanations as opposed to cluster exemplars.
摘要：尽管当前的机器学习系统具有很高的预测准确度，但它们经常表现出系统性偏差，这些偏差源于注释工件或对数据集中某些类别的支持不足。最近的研究提出了使用关键字自动识别和解释系统性偏差的方法。我们引入了 DISCERN，这是一个使用语言解释解释文本分类器中的系统性偏差的框架。DISCERN 通过在两个大型语言模型之间采用交互式循环，迭代生成系统错误的精确自然语言描述。最后，我们使用这些描述来改进分类器，方法是通过主动学习用合成生成的实例或带注释的示例来扩充分类器训练集。在三个文本分类数据集上，我们证明了来自我们框架的语言解释可以带来持续的性能改进，这种改进超出了系统性偏差样本所能实现的范围。最后，在人工评估中，我们表明，与集群样本相比，当通过语言解释进行描述时，用户可以更有效（相对超过 25%）和更高效地解释系统性偏差。

Title: FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Authors: Farima Fatahi Bayat, Lechen Zhang, Sheza Munir, Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.22257
Pdf URL: https://arxiv.org/pdf/2410.22257
Copy Paste: [[2410.22257]] FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation(https://arxiv.org/abs/2410.22257)
Keywords: language model, gpt, hallucination, prompt
Abstract: Language models (LMs) are widely used by an increasing number of users, underscoring the challenge of maintaining factuality across a broad range of topics. We first present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on the retrieved evidence from the Web. Importantly, factuality judgment by VERIFY correlates better with human evaluations than existing methods. Using VERIFY, we identify "hallucination prompts" across diverse topics, i.e., those eliciting the highest rates of incorrect and inconclusive LM responses. These prompts form FactBench, a dataset of 1K prompts across 150 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and can be regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama3.1 family on FactBench, yielding the following key findings: (i) Proprietary models exhibit better factuality, with performance declining from Easy to Hard hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual accuracy than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more content labeled as undecidable. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases. Our code and data are publicly available at this https URL.
摘要：越来越多的用户广泛使用语言模型 (LM)，这凸显了在广泛主题中保持事实性的挑战。我们首先介绍 VERIFY（事实性评估的验证和证据检索），这是一个用于评估 LM 在现实世界用户交互中事实性的管道。VERIFY 考虑 LM 生成内容的可验证性，并根据从 Web 检索到的证据将内容单元分类为受支持、不受支持或不可判定。重要的是，VERIFY 的事实性判断与人类评估的相关性比现有方法更好。使用 VERIFY，我们可以识别不同主题中的“幻觉提示”，即那些引发错误和不确定 LM 响应率最高的提示。这些提示构成了 FactBench，这是一个包含 150 个细粒度主题的 1K 个提示的数据集。我们的数据集捕捉了现实世界 LM 交互中出现的事实性挑战，并且可以定期更新新提示。我们在 FactBench 上对 GPT、Gemini 和 Llama3.1 系列中广泛使用的 LM 进行了基准测试，得出以下主要发现：(i) 专有模型表现出更好的事实性，性能从简单到困难的幻觉提示下降。(ii) Llama3.1-405B-Instruct 在所有评估方法中的事实准确性与 Llama3.1-70B-Instruct 相当或更低，因为它的主观性更高，导致更多内容被标记为不可判定。(iii) Gemini1.5-Pro 的拒绝率明显更高，25% 的情况是过度拒绝。我们的代码和数据可在此 https URL 上公开获取。

Title: From melodic note sequences to pitches using word2vec

Authors: Daniel Defays
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.22285
Pdf URL: https://arxiv.org/pdf/2410.22285
Copy Paste: [[2410.22285]] From melodic note sequences to pitches using word2vec(https://arxiv.org/abs/2410.22285)
Keywords: language model
Abstract: Applying the word2vec technique, commonly used in language modeling, to melodies, where notes are treated as words in sentences, enables the capture of pitch information. This study examines two datasets: 20 children's songs and an excerpt from a Bach sonata. The semantic space for defining the embeddings is of very small dimension, specifically 2. Notes are predicted based on the 2, 3 or 4 preceding notes that establish the context. A multivariate analysis of the results shows that the semantic vectors representing the notes have a multiple correlation coefficient of approximately 0.80 with their pitches.
摘要：将语言建模中常用的 word2vec 技术应用于旋律，将音符视为句子中的单词，可以捕捉音高信息。本研究研究了两个数据集：20 首儿童歌曲和一首巴赫奏鸣曲的摘录。定义嵌入的语义空间维度非常小，具体为 2。音符是根据建立上下文的 2、3 或 4 个前音符来预测的。对结果进行多变量分析表明，表示音符的语义向量与其音高的多重相关系数约为 0.80。

Title: Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

Authors: Yihe Deng, Paul Mineiro
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.22304
Pdf URL: https://arxiv.org/pdf/2410.22304
Copy Paste: [[2410.22304]] Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning(https://arxiv.org/abs/2410.22304)
Keywords: language model, llm, agent
Abstract: Mathematical reasoning is a crucial capability for Large Language Models (LLMs), yet generating detailed and accurate reasoning traces remains a significant challenge. This paper introduces a novel approach to produce high-quality reasoning traces for LLM fine-tuning using online learning \textbf{Flows}. Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions through iterative communication. We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time. We directly compare the quality of reasoning traces generated by our method with those produced through direct model inference, demonstrating the effectiveness of our approach in improving LLM performance in mathematical reasoning tasks.
摘要：数学推理是大型语言模型 (LLM) 的一项关键能力，但生成详细而准确的推理轨迹仍然是一项重大挑战。本文介绍了一种使用在线学习 \textbf{Flows} 为 LLM 微调生成高质量推理轨迹的新方法。我们的方法采用增量输出生产流程，其中组件 LLM 通过迭代通信协作构建解决方案。我们使用在线直接偏好优化 (DPO) 学习和 rollouts 来训练 Flow，为每个训练示例生成 DPO 对并实时更新模型。我们直接将我们的方法生成的推理轨迹的质量与通过直接模型推理生成的推理轨迹的质量进行比较，证明了我们的方法在提高 LLM 在数学推理任务中的性能方面的有效性。

Title: Natural Language Inference Improves Compositionality in Vision-Language Models

Authors: Paola Cascante-Bonilla, Yu Hou, Yang Trista Cao, Hal Daumé III, Rachel Rudinger
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.22315
Pdf URL: https://arxiv.org/pdf/2410.22315
Copy Paste: [[2410.22315]] Natural Language Inference Improves Compositionality in Vision-Language Models(https://arxiv.org/abs/2410.22315)
Keywords: language model, llm
Abstract: Compositional reasoning in Vision-Language Models (VLMs) remains challenging as these models often struggle to relate objects, attributes, and spatial relationships. Recent methods aim to address these limitations by relying on the semantics of the textual description, using Large Language Models (LLMs) to break them down into subsets of questions and answers. However, these methods primarily operate on the surface level, failing to incorporate deeper lexical understanding while introducing incorrect assumptions generated by the LLM. In response to these issues, we present Caption Expansion with Contradictions and Entailments (CECE), a principled approach that leverages Natural Language Inference (NLI) to generate entailments and contradictions from a given premise. CECE produces lexically diverse sentences while maintaining their core meaning. Through extensive experiments, we show that CECE enhances interpretability and reduces overreliance on biased or superficial features. By balancing CECE along the original premise, we achieve significant improvements over previous methods without requiring additional fine-tuning, producing state-of-the-art results on benchmarks that score agreement with human judgments for image-text alignment, and achieving an increase in performance on Winoground of +19.2% (group score) and +12.9% on EqBen (group score) over the best prior work (finetuned with targeted data).
摘要：视觉语言模型 (VLM) 中的组合推理仍然具有挑战性，因为这些模型通常难以关联对象、属性和空间关系。最近的方法旨在通过依赖文本描述的语义来解决这些限制，使用大型语言模型 (LLM) 将它们分解为问题和答案的子集。然而，这些方法主要在表面层面上操作，未能结合更深层次的词汇理解，同时引入了由 LLM 生成的错误假设。为了解决这些问题，我们提出了带有矛盾和蕴涵的标题扩展 (CECE)，这是一种利用自然语言推理 (NLI) 从给定前提生成蕴涵和矛盾的原则性方法。CECE 生成词汇多样的句子，同时保持其核心含义。通过大量实验，我们表明 CECE 增强了可解释性并减少了对有偏见或肤浅特征的过度依赖。通过在原始前提下平衡 CECE，我们无需进行额外的微调，便可在以前的方法上取得显著的改进，在与人类对图像文本对齐的判断一致的基准上产生最先进的结果，并且在 Winoground 上的性能提高了 +19.2%（组分数），在 EqBen 上的性能提高了 +12.9%（组分数），超过了最好的先前工作（使用目标数据进行微调）。

Title: Understanding Synthetic Context Extension via Retrieval Heads

Authors: Xinyu Zhao, Fangcong Yin, Greg Durrett
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.22316
Pdf URL: https://arxiv.org/pdf/2410.22316
Copy Paste: [[2410.22316]] Understanding Synthetic Context Extension via Retrieval Heads(https://arxiv.org/abs/2410.22316)
Keywords: llm, long context, retrieval-augmented generation
Abstract: Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically generated long-context data in a post-training stage. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle" concepts to be retrieved and diversity of the surrounding "haystack" context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted and even predicted in terms of a special set of attention heads that are responsible for retrieval over long context: retrieval heads (Wu et al., 2024). The retrieval heads learned on synthetic data are mostly subsets of the retrieval heads learned on real data, and there is a strong correlation between the recall of heads learned and the downstream performance of a model. Furthermore, with attention knockout and activation patching, we mechanistically show that retrieval heads are necessary and explain model performance, although they are not totally sufficient. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
摘要：长上下文 LLM 在检索增强生成等应用中的需求越来越大。为了分担在长上下文中预训练 LLM 的成本，最近的研究采用了合成上下文扩展的方法：在训练后阶段使用合成生成的长上下文数据对 LLM 进行微调。然而，这种合成上下文扩展如何以及为何赋予下游长上下文任务能力仍不清楚。在本文中，我们研究了对三个需要检索和推理的长上下文任务的合成数据进行微调。我们改变了要检索的“针”概念的真实性和周围“大海捞针”上下文的多样性，从使用 LLM 构建合成文档到使用模板关系和创建符号数据集。我们发现，在合成数据上训练的模型与真实数据存在差距，但令人惊讶的是，这种不匹配可以用一组负责长上下文检索的特殊注意力头来解释甚至预测：检索头 (Wu et al., 2024)。在合成数据上学习到的检索头大多是在真实数据上学习到的检索头的子集，并且学习到的检索头的回忆率与模型的下游性能之间存在很强的相关性。此外，通过注意力消除和激活修补，我们从机制上证明了检索头是必要的，并解释了模型的性能，尽管它们并不完全充分。我们的结果揭示了如何解释合成数据微调性能，以及如何着手创建更好的数据以在长上下文中学习真实世界的能力。