2024-08-09

Title: Impacts of Anthropomorphizing Large Language Models in Learning Environments

Authors: Kristina Schaaff, Marc-André Heidelmann
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2408.03945
Pdf URL: https://arxiv.org/pdf/2408.03945
Copy Paste: [[2408.03945]] Impacts of Anthropomorphizing Large Language Models in Learning Environments(https://arxiv.org/abs/2408.03945)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs) are increasingly being used in learning environments to support teaching-be it as learning companions or as tutors. With our contribution, we aim to discuss the implications of the anthropomorphization of LLMs in learning environments on educational theory to build a foundation for more effective learning outcomes and understand their emotional impact on learners. According to the media equation, people tend to respond to media in the same way as they would respond to another person. A study conducted by the Georgia Institute of Technology showed that chatbots can be successfully implemented in learning environments. In this study, learners in selected online courses were unable to distinguish the chatbot from a "real" teacher. As LLM-based chatbots such as OpenAI's GPT series are increasingly used in educational tools, it is important to understand how the attribution processes to LLM-based chatbots in terms of anthropomorphization affect learners' emotions.
摘要：大型语言模型 (LLM) 越来越多地用于学习环境中以支持教学 - 无论是作为学习伙伴还是导师。通过我们的贡献，我们旨在讨论学习环境中 LLM 拟人化对教育理论的影响，为更有效的学习成果奠定基础，并了解其对学习者的情感影响。根据媒体方程，人们倾向于以与对他人的反应相同的方式对媒体做出反应。佐治亚理工学院进行的一项研究表明，聊天机器人可以在学习环境中成功实施。在这项研究中，选定的在线课程的学习者无法区分聊天机器人和“真正的”老师。随着基于 LLM 的聊天机器人（例如 OpenAI 的 GPT 系列）越来越多地用于教育工具，了解拟人化方面对基于 LLM 的聊天机器人的归因过程如何影响学习者的情绪非常重要。

Title: Image-to-LaTeX Converter for Mathematical Formulas and Text

Authors: Daniil Gurgurov, Aleksey Morshnev
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2408.04015
Pdf URL: https://arxiv.org/pdf/2408.04015
Copy Paste: [[2408.04015]] Image-to-LaTeX Converter for Mathematical Formulas and Text(https://arxiv.org/abs/2408.04015)
Keywords: gpt
Abstract: In this project, we train a vision encoder-decoder model to generate LaTeX code from images of mathematical formulas and text. Utilizing a diverse collection of image-to-LaTeX data, we build two models: a base model with a Swin Transformer encoder and a GPT-2 decoder, trained on machine-generated images, and a fine-tuned version enhanced with Low-Rank Adaptation (LoRA) trained on handwritten formulas. We then compare the BLEU performance of our specialized model on a handwritten test set with other similar models, such as Pix2Text, TexTeller, and Sumen. Through this project, we contribute open-source models for converting images to LaTeX and provide from-scratch code for building these models with distributed training and GPU optimizations.
摘要：在这个项目中，我们训练了一个视觉编码器-解码器模型，以便从数学公式和文本的图像中生成 LaTeX 代码。利用多样化的图像到 LaTeX 数据集合，我们构建了两个模型：一个带有 Swin Transformer 编码器和 GPT-2 解码器的基本模型，在机器生成的图像上进行训练；以及一个经过微调的版本，该版本使用低秩自适应 (LoRA) 增强，在手写公式上进行训练。然后，我们将我们的专用模型在手写测试集上的 BLEU 性能与其他类似模型（例如 Pix2Text、TexTeller 和 Sumen）进行比较。通过这个项目，我们贡献了将图像转换为 LaTeX 的开源模型，并提供了从头开始构建这些模型的代码，这些模型采用了分布式训练和 GPU 优化。

Title: Improving Large Language Model (LLM) fidelity through context-aware grounding: A systematic approach to reliability and veracity

Authors: Wrick Talukdar, Anjanava Biswas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04023
Pdf URL: https://arxiv.org/pdf/2408.04023
Copy Paste: [[2408.04023]] Improving Large Language Model (LLM) fidelity through context-aware grounding: A systematic approach to reliability and veracity(https://arxiv.org/abs/2408.04023)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) become increasingly sophisticated and ubiquitous in natural language processing (NLP) applications, ensuring their robustness, trustworthiness, and alignment with human values has become a critical challenge. This paper presents a novel framework for contextual grounding in textual models, with a particular emphasis on the Context Representation stage. Our approach aims to enhance the reliability and ethical alignment of these models through a comprehensive, context-aware methodology. By explicitly capturing and representing relevant situational, cultural, and ethical contexts in a machine-readable format, we lay the foundation for anchoring a model's behavior within these contexts. Our approach leverages techniques from knowledge representation and reasoning, such as ontologies, semantic web technologies, and logic-based formalisms. We evaluate our framework on real-world textual datasets, demonstrating its effectiveness in improving model performance, fairness, and alignment with human expectations, while maintaining high accuracy. Furthermore, we discuss the other key components of the framework, including context-aware encoding, context-aware learning, interpretability and explainability, and continuous monitoring and adaptation. This research contributes to the growing body of work on responsible AI, offering a practical approach to developing more reliable, trustworthy, and ethically-aligned language models. Our findings have significant implications for the deployment of LLMs in sensitive domains such as healthcare, legal systems, and social services, where contextual understanding is paramount.
摘要：随着大型语言模型 (LLM) 在自然语言处理 (NLP) 应用中变得越来越复杂和普遍，确保其稳健性、可信度和与人类价值观的一致性已成为一项关键挑战。本文提出了一种用于文本模型中上下文基础的新框架，特别强调上下文表示阶段。我们的方法旨在通过全面的上下文感知方法提高这些模型的可靠性和道德一致性。通过以机器可读的格式明确捕获和表示相关的情境、文化和道德背景，我们为在这些上下文中锚定模型的行为奠定了基础。我们的方法利用知识表示和推理技术，例如本体论、语义网技术和基于逻辑的形式化。我们在现实世界的文本数据集上评估了我们的框架，证明了它在提高模型性能、公平性和与人类期望的一致性的同时保持高精度的有效性。此外，我们还讨论了该框架的其他关键组成部分，包括上下文感知编码、上下文感知学习、可解释性和可解释性以及持续监控和适应。这项研究为负责任的人工智能研究做出了贡献，为开发更可靠、更值得信赖、更符合道德规范的语言模型提供了一种实用方法。我们的研究结果对于在医疗保健、法律体系和社会服务等敏感领域部署法学硕士具有重要意义，因为在这些领域，语境理解至关重要。

Title: Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It?

Authors: Anupama Chingacham, Miaoran Zhang, Vera Demberg, Dietrich Klakow
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04029
Pdf URL: https://arxiv.org/pdf/2408.04029
Copy Paste: [[2408.04029]] Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It?(https://arxiv.org/abs/2408.04029)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) can generate text by transferring style attributes like formality resulting in formal or informal text. However, instructing LLMs to generate text that when spoken, is more intelligible in an acoustically difficult environment, is an under-explored topic. We conduct the first study to evaluate LLMs on a novel task of generating acoustically intelligible paraphrases for better human speech perception in noise. Our experiments in English demonstrated that with standard prompting, LLMs struggle to control the non-textual attribute, i.e., acoustic intelligibility, while efficiently capturing the desired textual attributes like semantic equivalence. To remedy this issue, we propose a simple prompting approach, prompt-and-select, which generates paraphrases by decoupling the desired textual and non-textual attributes in the text generation pipeline. Our approach resulted in a 40% relative improvement in human speech perception, by paraphrasing utterances that are highly distorted in a listening condition with babble noise at a signal-to-noise ratio (SNR) -5 dB. This study reveals the limitation of LLMs in capturing non-textual attributes, and our proposed method showcases the potential of using LLMs for better human speech perception in noise.
摘要：大型语言模型 (LLM) 可以通过传输诸如正式性之类的样式属性来生成文本，从而生成正式或非正式文本。但是，指示 LLM 生成在听觉困难的环境中更易于理解的口语文本是一个尚未充分探索的课题。我们进行了第一项研究，以评估 LLM 在一项新任务上的表现，即生成听觉可理解的释义，以便在噪声中更好地感知人类语音。我们在英语中的实验表明，在标准提示下，LLM 难以控制非文本属性，即听觉可理解性，同时有效地捕获所需的文本属性，如语义等价性。为了解决这个问题，我们提出了一种简单的提示方法，即提示和选择，它通过在文本生成管道中解耦所需的文本和非文本属性来生成释义。我们的方法通过解释在信噪比 (SNR) -5 dB 的嘈杂噪音条件下严重失真的话语，使人类语音感知相对提高了 40%。这项研究揭示了 LLM 在捕获非文本属性方面的局限性，我们提出的方法展示了使用 LLM 在噪音环境中提高人类语音感知的潜力。

Title: Can Rule-Based Insights Enhance LLMs for Radiology Report Classification? Introducing the RadPrompt Methodology

Authors: Panagiotis Fytas, Anna Breger, Ian Selby, Simon Baker, Shahab Shahipasand, Anna Korhonen
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2408.04121
Pdf URL: https://arxiv.org/pdf/2408.04121
Copy Paste: [[2408.04121]] Can Rule-Based Insights Enhance LLMs for Radiology Report Classification? Introducing the RadPrompt Methodology(https://arxiv.org/abs/2408.04121)
Keywords: language model, gpt, llm, prompt
Abstract: Developing imaging models capable of detecting pathologies from chest X-rays can be cost and time-prohibitive for large datasets as it requires supervision to attain state-of-the-art performance. Instead, labels extracted from radiology reports may serve as distant supervision since these are routinely generated as part of clinical practice. Despite their widespread use, current rule-based methods for label extraction rely on extensive rule sets that are limited in their robustness to syntactic variability. To alleviate these limitations, we introduce RadPert, a rule-based system that integrates an uncertainty-aware information schema with a streamlined set of rules, enhancing performance. Additionally, we have developed RadPrompt, a multi-turn prompting strategy that leverages RadPert to bolster the zero-shot predictive capabilities of large language models, achieving a statistically significant improvement in weighted average F1 score over GPT-4 Turbo. Most notably, RadPrompt surpasses both its underlying models, showcasing the synergistic potential of LLMs with rule-based models. We have evaluated our methods on two English Corpora: the MIMIC-CXR gold-standard test set and a gold-standard dataset collected from the Cambridge University Hospitals.
摘要：对于大型数据集，开发能够从胸部 X 光片中检测病理的成像模型可能成本高昂且耗时长，因为它需要监督才能达到最佳性能。相反，从放射学报告中提取的标签可以作为远程监督，因为这些标签是临床实践中常规生成的。尽管目前基于规则的标签提取方法被广泛使用，但它们依赖于广泛的规则集，而这些规则集对句法变异的鲁棒性有限。为了缓解这些限制，我们引入了 RadPert，这是一个基于规则的系统，它将不确定性感知信息模式与一组精简的规则相结合，从而提高了性能。此外，我们还开发了 RadPrompt，这是一种多轮提示策略，利用 RadPert 来增强大型语言模型的零样本预测能力，与 GPT-4 Turbo 相比，加权平均 F1 得分实现了统计上显着的提升。最值得注意的是，RadPrompt 超越了其两个底层模型，展示了 LLM 与基于规则的模型的协同潜力。我们已经在两个英语语料库上评估了我们的方法：MIMIC-CXR 黄金标准测试集和从剑桥大学医院收集的黄金标准数据集。

Title: Enhancing Healthcare through Large Language Models: A Study on Medical Question Answering

Authors: Haoran Yu, Chang Yu, Zihan Wang, Dongxian Zou, Hao Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04138
Pdf URL: https://arxiv.org/pdf/2408.04138
Copy Paste: [[2408.04138]] Enhancing Healthcare through Large Language Models: A Study on Medical Question Answering(https://arxiv.org/abs/2408.04138)
Keywords: language model, llm, prompt
Abstract: In recent years, the application of Large Language Models (LLMs) in healthcare has shown significant promise in improving the accessibility and dissemination of medical knowledge. This paper presents a detailed study of various LLMs trained on the MedQuAD medical question-answering dataset, with a focus on identifying the most effective model for providing accurate medical information. Among the models tested, the Sentence-t5 combined with Mistral 7B demonstrated superior performance, achieving a precision score of 0.762. This model's enhanced capabilities are attributed to its advanced pretraining techniques, robust architecture, and effective prompt construction methodologies. By leveraging these strengths, the Sentence-t5 + Mistral 7B model excels in understanding and generating precise medical answers. Our findings highlight the potential of integrating sophisticated LLMs in medical contexts to facilitate efficient and accurate medical knowledge retrieval, thus significantly enhancing patient education and support.
摘要：近年来，大型语言模型 (LLM) 在医疗保健领域的应用在提高医学知识的可及性和传播方面显示出巨大的潜力。本文详细研究了在 MedQuAD 医学问答数据集上训练的各种 LLM，重点是找出提供准确医疗信息的最有效模型。在所测试的模型中，Sentence-t5 与 Mistral 7B 相结合表现出色，精度得分达到 0.762。该模型增强的功能归功于其先进的预训练技术、强大的架构和有效的提示构建方法。通过利用这些优势，Sentence-t5 + Mistral 7B 模型在理解和生成精准的医学答案方面表现出色。我们的研究结果强调了在医学环境中集成复杂的 LLM 的潜力，以促进高效准确的医学知识检索，从而显著加强对患者的教育和支持。

Title: UNLEARN Efficient Removal of Knowledge in Large Language Models

Authors: Tyler Lizzo, Larry Heck
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04140
Pdf URL: https://arxiv.org/pdf/2408.04140
Copy Paste: [[2408.04140]] UNLEARN Efficient Removal of Knowledge in Large Language Models(https://arxiv.org/abs/2408.04140)
Keywords: language model, llm
Abstract: Given the prevalence of large language models (LLMs) and the prohibitive cost of training these models from scratch, dynamically forgetting specific knowledge e.g., private or proprietary, without retraining the model has become an important capability. This paper proposes a novel method to achieve this objective called UNLEARN. The approach builds upon subspace methods to identify and specifically target the removal of knowledge without adversely affecting other knowledge in the LLM. Results demonstrate 96% of targeted knowledge can be forgotten while maintaining performance on other knowledge within 2.5% of the original model, significantly outperforming the discriminatory abilities of the previous state-of-the-art. A dual method called LEARN is also proposed for targeted knowledge addition. Results show LEARN can match the fine-tuning accuracy of Low-Rank Adaptation (LoRA) without adversely affecting similar tasks.
摘要：鉴于大型语言模型 (LLM) 的普及以及从头开始训练这些模型的高昂成本，动态忘记特定知识（例如私有或专有知识）而无需重新训练模型已成为一项重要功能。本文提出了一种实现此目标的新方法，称为 UNLEARN。该方法基于子空间方法，用于识别并专门针对知识的删除，而不会对 LLM 中的其他知识产生不利影响。结果表明，可以忘记 96% 的目标知识，同时将其他知识的性能保持在原始模型的 2.5% 以内，大大优于以前最先进技术的判别能力。还提出了一种称为 LEARN 的双重方法用于有针对性的知识添加。结果表明，LEARN 可以匹配低秩自适应 (LoRA) 的微调精度，而不会对类似任务产生不利影响。

Title: Semantics or spelling? Probing contextual word embeddings with orthographic noise

Authors: Jacob A. Matthews, John R. Starr, Marten van Schijndel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04162
Pdf URL: https://arxiv.org/pdf/2408.04162
Copy Paste: [[2408.04162]] Semantics or spelling? Probing contextual word embeddings with orthographic noise(https://arxiv.org/abs/2408.04162)
Keywords: language model
Abstract: Pretrained language model (PLM) hidden states are frequently employed as contextual word embeddings (CWE): high-dimensional representations that encode semantic information given linguistic context. Across many areas of computational linguistics research, similarity between CWEs is interpreted as semantic similarity. However, it remains unclear exactly what information is encoded in PLM hidden states. We investigate this practice by probing PLM representations using minimal orthographic noise. We expect that if CWEs primarily encode semantic information, a single character swap in the input word will not drastically affect the resulting representation,given sufficient linguistic context. Surprisingly, we find that CWEs generated by popular PLMs are highly sensitive to noise in input data, and that this sensitivity is related to subword tokenization: the fewer tokens used to represent a word at input, the more sensitive its corresponding CWE. This suggests that CWEs capture information unrelated to word-level meaning and can be manipulated through trivial modifications of input data. We conclude that these PLM-derived CWEs may not be reliable semantic proxies, and that caution is warranted when interpreting representational similarity
摘要：预训练语言模型 (PLM) 隐藏状态经常用作上下文词嵌入 (CWE)：在给定语言上下文的情况下编码语义信息的高维表示。在计算语言学研究的许多领域中，CWE 之间的相似性被解释为语义相似性。然而，目前仍不清楚 PLM 隐藏状态中究竟编码了什么信息。我们通过使用最小正字法噪声探测 PLM 表示来研究这种做法。我们预计，如果 CWE 主要编码语义信息，则在给定足够的语言上下文的情况下，输入词中的单个字符交换不会显著影响结果表示。令人惊讶的是，我们发现流行的 PLM 生成的 CWE 对输入数据中的噪声高度敏感，并且这种敏感性与子词标记化有关：用于表示输入中的单词的标记越少，其对应的 CWE 就越敏感。这表明 CWE 捕获与单词级含义无关的信息，并且可以通过对输入数据的简单修改进行操纵。我们得出结论，这些 PLM 衍生的 CWE 可能不是可靠的语义代理，在解释表征相似性时需要谨慎

Title: wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech

Authors: Khai Le-Duc, Quy-Anh Dang, Tan-Hanh Pham, Truong-Son Hy
Subjects: cs.CL, cs.AI, cs.IR, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2408.04174
Pdf URL: https://arxiv.org/pdf/2408.04174
Copy Paste: [[2408.04174]] wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech(https://arxiv.org/abs/2408.04174)
Keywords: language model, llm
Abstract: Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness. However, KGs only focus on text data, thereby neglecting other modalities such as speech. In this work, we introduce wav2graph, the first framework for supervised learning knowledge graph from speech data. Our pipeline are straightforward: (1) constructing a KG based on transcribed spoken utterances and a named entity database, (2) converting KG into embedding vectors, and (3) training graph neural networks (GNNs) for node classification and link prediction tasks. Through extensive experiments conducted in inductive and transductive learning contexts using state-of-the-art GNN models, we provide baseline results and error analysis for node classification and link prediction tasks on human transcripts and automatic speech recognition (ASR) transcripts, including evaluations using both encoder-based and decoder-based node embeddings, as well as monolingual and multilingual acoustic pre-trained models. All related code, data, and models are published online.
摘要：知识图谱 (KG) 通过提供结构化、互联的数据来提高推理和上下文感知能力，从而增强大型语言模型 (LLM) 和搜索引擎的性能。然而，KG 只关注文本数据，从而忽略了语音等其他模态。在这项工作中，我们引入了 wav2graph，这是第一个从语音数据中进行监督学习知识图谱的框架。我们的流程很简单：(1) 根据转录的口头话语和命名实体数据库构建 KG，(2) 将 KG 转换为嵌入向量，(3) 训练图神经网络 (GNN) 以进行节点分类和链接预测任务。通过使用最先进的 GNN 模型在归纳和传导学习环境中进行的大量实验，我们为人工转录本和自动语音识别 (ASR) 转录本上的节点分类和链接预测任务提供了基线结果和错误分析，包括使用基于编码器和基于解码器的节点嵌入以及单语和多语声学预训练模型进行的评估。所有相关代码、数据和模型均在线发布。

Title: MMREC: LLM Based Multi-Modal Recommender System

Authors: Jiahao Tian, Jinman Zhao, Zhenkai Wang, Zhicheng Ding
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2408.04211
Pdf URL: https://arxiv.org/pdf/2408.04211
Copy Paste: [[2408.04211]] MMREC: LLM Based Multi-Modal Recommender System(https://arxiv.org/abs/2408.04211)
Keywords: language model, llm
Abstract: The importance of recommender systems is growing rapidly due to the exponential increase in the volume of content generated daily. This surge in content presents unique challenges for designing effective recommender systems. Key among these challenges is the need to effectively leverage the vast amounts of natural language data and images that represent user preferences. This paper presents a novel approach to enhancing recommender systems by leveraging Large Language Models (LLMs) and deep learning techniques. The proposed framework aims to improve the accuracy and relevance of recommendations by incorporating multi-modal information processing and by the use of unified latent space representation. The study explores the potential of LLMs to better understand and utilize natural language data in recommendation contexts, addressing the limitations of previous methods. The framework efficiently extracts and integrates text and image information through LLMs, unifying diverse modalities in a latent space to simplify the learning process for the ranking model. Experimental results demonstrate the enhanced discriminative power of the model when utilizing multi-modal information. This research contributes to the evolving field of recommender systems by showcasing the potential of LLMs and multi-modal data integration to create more personalized and contextually relevant recommendations.
摘要：由于每天生成的内容量呈指数级增长，推荐系统的重要性正在迅速增长。内容的激增给设计有效的推荐系统带来了独特的挑战。这些挑战中的关键是需要有效地利用代表用户偏好的大量自然语言数据和图像。本文介绍了一种利用大型语言模型 (LLM) 和深度学习技术增强推荐系统的新方法。所提出的框架旨在通过结合多模态信息处理和使用统一的潜在空间表示来提高推荐的准确性和相关性。该研究探索了 LLM 在推荐环境中更好地理解和利用自然语言数据的潜力，解决了以前方法的局限性。该框架通过 LLM 有效地提取和集成文本和图像信息，将潜在空间中的各种模态统一起来，以简化排名模型的学习过程。实验结果表明，利用多模态信息时模型的判别能力增强。这项研究展示了 LLM 和多模式数据集成在创建更加个性化和上下文相关的推荐方面的潜力，为不断发展的推荐系统领域做出了贡献。

Title: Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs

Authors: Masashi Oshika, Makoto Morishita, Tsutomu Hirao, Ryohei Sasano, Koichi Takeda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04217
Pdf URL: https://arxiv.org/pdf/2408.04217
Copy Paste: [[2408.04217]] Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs(https://arxiv.org/abs/2408.04217)
Keywords: language model, llm
Abstract: In recent years, neural machine translation (NMT) has been widely used in everyday life. However, the current NMT lacks a mechanism to adjust the difficulty level of translations to match the user's language level. Additionally, due to the bias in the training data for NMT, translations of simple source sentences are often produced with complex words. In particular, this could pose a problem for children, who may not be able to understand the meaning of the translations correctly. In this study, we propose a method that replaces words with high Age of Acquisitions (AoA) in translations with simpler words to match the translations to the user's level. We achieve this by using large language models (LLMs), providing a triple of a source sentence, a translation, and a target word to be replaced. We create a benchmark dataset using back-translation on Simple English Wikipedia. The experimental results obtained from the dataset show that our method effectively replaces high-AoA words with lower-AoA words and, moreover, can iteratively replace most of the high-AoA words while still maintaining high BLEU and COMET scores.
摘要：近年来，神经机器翻译 (NMT) 已广泛应用于日常生活中。然而，目前的 NMT 缺乏一种机制来调整翻译的难度以匹配用户的语言水平。此外，由于 NMT 训练数据存在偏差，简单的源句子的翻译通常由复杂的单词生成。特别是，这可能会给儿童带来问题，他们可能无法正确理解翻译的含义。在本研究中，我们提出了一种方法，将翻译中具有较高习得年龄 (AoA) 的单词替换为更简单的单词，以使翻译与用户的水平相匹配。我们通过使用大型语言模型 (LLM) 来实现这一点，提供要替换的源句子、翻译和目标词的三元组。我们使用简单英语维基百科上的反向翻译创建了一个基准数据集。从数据集获得的实验结果表明，我们的方法有效地用低 AoA 词替换高 AoA 词，而且可以迭代替换大多数高 AoA 词，同时仍保持较高的 BLEU 和 COMET 分数。

Title: Diffusion Guided Language Modeling

Authors: Justin Lovelace, Varsha Kishore, Yiwei Chen, Kilian Q. Weinberger
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04220
Pdf URL: https://arxiv.org/pdf/2408.04220
Copy Paste: [[2408.04220]] Diffusion Guided Language Modeling(https://arxiv.org/abs/2408.04220)
Keywords: language model
Abstract: Current language models demonstrate remarkable proficiency in text generation. However, for many applications it is desirable to control attributes, such as sentiment, or toxicity, of the generated language -- ideally tailored towards each specific use case and target audience. For auto-regressive language models, existing guidance methods are prone to decoding errors that cascade during generation and degrade performance. In contrast, text diffusion models can easily be guided with, for example, a simple linear sentiment classifier -- however they do suffer from significantly higher perplexity than auto-regressive alternatives. In this paper we use a guided diffusion model to produce a latent proposal that steers an auto-regressive language model to generate text with desired properties. Our model inherits the unmatched fluency of the auto-regressive approach and the plug-and-play flexibility of diffusion. We show that it outperforms previous plug-and-play guidance methods across a wide range of benchmark data sets. Further, controlling a new attribute in our framework is reduced to training a single logistic regression classifier.
摘要：当前的语言模型在文本生成方面表现出非凡的能力。然而，对于许多应用程序来说，控制生成语言的属性（例如情绪或毒性）是可取的——理想情况下是针对每个特定用例和目标受众量身定制的。对于自回归语言模型，现有的指导方法容易出现解码错误，这些错误会在生成过程中级联并降低性能。相比之下，文本扩散模型可以很容易地用例如简单的线性情绪分类器来引导——然而，它们的困惑度确实比自回归替代方案高得多。在本文中，我们使用引导扩散模型来生成一个潜在的提议，该提议引导自回归语言模型生成具有所需属性的文本。我们的模型继承了自回归方法无与伦比的流畅性和扩散的即插即用灵活性。我们表明，它在广泛的基准数据集上都优于以前的即插即用指导方法。此外，在我们的框架中控制新属性简化为训练单个逻辑回归分类器。

Title: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula

Authors: Li Lucy, Tal August, Rose E. Wang, Luca Soldaini, Courtney Allison, Kyle Lo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04226
Pdf URL: https://arxiv.org/pdf/2408.04226
Copy Paste: [[2408.04226]] Evaluating Language Model Math Reasoning via Grounding in Educational Curricula(https://arxiv.org/abs/2408.04226)
Keywords: language model, prompt
Abstract: Our work presents a novel angle for evaluating language models' (LMs) mathematical abilities, by investigating whether they can discern skills and concepts enabled by math content. We contribute two datasets: one consisting of 385 fine-grained descriptions of K-12 math skills and concepts, or standards, from Achieve the Core (ATC), and another of 9.9K problems labeled with these standards (MathFish). Working with experienced teachers, we find that LMs struggle to tag and verify standards linked to problems, and instead predict labels that are close to ground truth, but differ in subtle ways. We also show that LMs often generate problems that do not fully align with standards described in prompts. Finally, we categorize problems in GSM8k using math standards, allowing us to better understand why some problems are more difficult to solve for models than others.
摘要：我们的工作为评估语言模型 (LM) 的数学能力提供了一个新角度，即研究它们是否能够辨别数学内容所支持的技能和概念。我们提供了两个数据集：一个数据集包含来自 Achieve the Core (ATC) 的 385 个 K-12 数学技能和概念或标准的细粒度描述，另一个数据集包含 9.9K 个使用这些标准标记的问题 (MathFish)。与经验丰富的教师合作，我们发现 LM 很难标记和验证与问题相关的标准，而是预测接近事实但略有不同的标签。我们还表明，LM 经常生成与提示中描述的标准不完全一致的问题。最后，我们使用数学标准对 GSM8k 中的问题进行分类，这使我们能够更好地理解为什么某些问题比其他问题更难用模型解决。

Title: Learning to Rewrite: Generalized LLM-Generated Text Detection

Authors: Wei Hao, Ran Li, Weiliang Zhao, Junfeng Yang, Chengzhi Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04237
Pdf URL: https://arxiv.org/pdf/2408.04237
Copy Paste: [[2408.04237]] Learning to Rewrite: Generalized LLM-Generated Text Detection(https://arxiv.org/abs/2408.04237)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) can be abused at scale to create non-factual content and spread disinformation. Detecting LLM-generated content is essential to mitigate these risks, but current classifiers often fail to generalize in open-world contexts. Prior work shows that LLMs tend to rewrite LLM-generated content less frequently, which can be used for detection and naturally generalizes to unforeseen data. However, we find that the rewriting edit distance between human and LLM content can be indistinguishable across domains, leading to detection failures. We propose training an LLM to rewrite input text, producing minimal edits for LLM-generated content and more edits for human-written text, deriving a distinguishable and generalizable edit distance difference across different domains. Experiments on text from 21 independent domains and three popular LLMs (e.g., GPT-4o, Gemini, and Llama-3) show that our classifier outperforms the state-of-the-art zero-shot classifier by up to 20.6% on AUROC score and the rewriting classifier by 9.2% on F1 score. Our work suggests that LLM can effectively detect machine-generated text if they are trained properly.
摘要：大型语言模型 (LLM) 可能会被大规模滥用，以创建非事实内容并传播虚假信息。检测 LLM 生成的内容对于减轻这些风险至关重要，但当前的分类器通常无法在开放世界环境中进行推广。先前的研究表明，LLM 倾向于较少地重写 LLM 生成的内容，这可用于检测并自然推广到不可预见的数据。然而，我们发现人类和 LLM 内容之间的重写编辑距离在各个领域之间可能无法区分，从而导致检测失败。我们建议训练一个 LLM 来重写输入文本，对 LLM 生成的内容进行最少的编辑，对人类编写的文本进行更多的编辑，从而得出不同领域之间可区分和可推广的编辑距离差异。对来自 21 个独立领域和三个流行 LLM（例如 GPT-4o、Gemini 和 Llama-3）的文本进行的实验表明，我们的分类器在 AUROC 得分上比最先进的零样本分类器高出 20.6%，在 F1 得分上比重写分类器高出 9.2%。我们的工作表明，如果经过适当训练，LLM 可以有效检测机器生成的文本。

Title: Explicating the Implicit: Argument Detection Beyond Sentence Boundaries

Authors: Paul Roit, Aviv Slobodkin, Eran Hirsch, Arie Cattan, Ayal Klein, Valentina Pyatkin, Ido Dagan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04246
Pdf URL: https://arxiv.org/pdf/2408.04246
Copy Paste: [[2408.04246]] Explicating the Implicit: Argument Detection Beyond Sentence Boundaries(https://arxiv.org/abs/2408.04246)
Keywords: language model
Abstract: Detecting semantic arguments of a predicate word has been conventionally modeled as a sentence-level task. The typical reader, however, perfectly interprets predicate-argument relations in a much wider context than just the sentence where the predicate was evoked. In this work, we reformulate the problem of argument detection through textual entailment to capture semantic relations across sentence boundaries. We propose a method that tests whether some semantic relation can be inferred from a full passage by first encoding it into a simple and standalone proposition and then testing for entailment against the passage. Our method does not require direct supervision, which is generally absent due to dataset scarcity, but instead builds on existing NLI and sentence-level SRL resources. Such a method can potentially explicate pragmatically understood relations into a set of explicit sentences. We demonstrate it on a recent document-level benchmark, outperforming some supervised methods and contemporary language models.
摘要：检测谓词的语义论元传统上被建模为句子级任务。然而，典型的读者可以在更广泛的背景下完美地解释谓词-论元关系，而不仅仅是在谓词被引发的句子中。在这项工作中，我们通过文本蕴涵重新表述了论元检测问题，以捕获跨句子边界的语义关系。我们提出了一种方法，通过首先将某个语义关系编码为一个简单而独立的命题，然后测试与段落的蕴涵，来测试是否可以从整篇文章中推断出该语义关系。我们的方法不需要直接监督，这通常由于数据集稀缺而不存在，而是建立在现有的 NLI 和句子级 SRL 资源之上。这种方法可以潜在地将实用理解的关系阐明为一组明确的句子。我们在最近的文档级基准上演示了它，其表现优于一些监督方法和当代语言模型。

Title: EfficientRAG: Efficient Retriever for Multi-Hop Question Answering

Authors: Ziyuan Zhuang, Zhiyang Zhang, Sitao Cheng, Fangkai Yang, Jia Liu, Shujian Huang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04259
Pdf URL: https://arxiv.org/pdf/2408.04259
Copy Paste: [[2408.04259]] EfficientRAG: Efficient Retriever for Multi-Hop Question Answering(https://arxiv.org/abs/2408.04259)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) methods encounter difficulties when addressing complex questions like multi-hop queries. While iterative retrieval methods improve performance by gathering additional information, current approaches often rely on multiple calls of large language models (LLMs). In this paper, we introduce EfficientRAG, an efficient retriever for multi-hop question answering. EfficientRAG iteratively generates new queries without the need for LLM calls at each iteration and filters out irrelevant information. Experimental results demonstrate that EfficientRAG surpasses existing RAG methods on three open-domain multi-hop question-answering datasets.
摘要：检索增强生成 (RAG) 方法在解决多跳查询等复杂问题时会遇到困难。虽然迭代检索方法通过收集更多信息来提高性能，但当前方法通常依赖于多次调用大型语言模型 (LLM)。在本文中，我们介绍了 EfficientRAG，一种用于多跳问答的高效检索器。EfficientRAG 迭代生成新查询，而无需在每次迭代时调用 LLM，并过滤掉不相关的信息。实验结果表明，EfficientRAG 在三个开放域多跳问答数据集上超越了现有的 RAG 方法。

Title: Analysis of Argument Structure Constructions in the Large Language Model BERT

Authors: Pegah Ramezani, Achim Schilling, Patrick Krauss
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04270
Pdf URL: https://arxiv.org/pdf/2408.04270
Copy Paste: [[2408.04270]] Analysis of Argument Structure Constructions in the Large Language Model BERT(https://arxiv.org/abs/2408.04270)
Keywords: language model
Abstract: This study investigates how BERT processes and represents Argument Structure Constructions (ASCs), extending previous LSTM analyses. Using a dataset of 2000 sentences across four ASC types (transitive, ditransitive, caused-motion, resultative), we analyzed BERT's token embeddings across 12 layers. Visualizations with MDS and t-SNE and clustering quantified by Generalized Discrimination Value (GDV) were used. Feedforward classifiers (probes) predicted construction categories from embeddings. CLS token embeddings clustered best in layers 2-4, decreased in intermediate layers, and slightly increased in final layers. DET and SUBJ embeddings showed consistent clustering in intermediate layers, VERB embeddings increased in clustering from layer 1 to 12, and OBJ embeddings peaked in layer 10. Probe accuracies indicated low construction information in layer 1, with over 90 percent accuracy from layer 2 onward, revealing latent construction information beyond GDV clustering. Fisher Discriminant Ratio (FDR) analysis of attention weights showed OBJ tokens were crucial for differentiating ASCs, followed by VERB and DET tokens. SUBJ, CLS, and SEP tokens had insignificant FDR scores. This study highlights BERT's layered processing of linguistic constructions and its differences from LSTMs. Future research will compare these findings with neuroimaging data to understand the neural correlates of ASC processing. This research underscores neural language models' potential to mirror linguistic processing in the human brain, offering insights into the computational and neural mechanisms underlying language understanding.
摘要：本研究调查了 BERT 如何处理和表示论元结构构造 (ASC)，扩展了之前的 LSTM 分析。使用四种 ASC 类型（及物动词、双及物动词、引起运动、结果动词）的 2000 个句子的数据集，我们分析了 BERT 在 12 个层中的标记嵌入。使用 MDS 和 t-SNE 进行可视化，并使用广义判别值 (GDV) 量化聚类。前馈分类器（探针）根据嵌入预测构造类别。CLS 标记嵌入在第 2-4 层中的聚类效果最好，在中间层中聚类效果下降，在最后层中略有增加。 DET 和 SUBJ 嵌入在中间层中显示出一致的聚类，VERB 嵌入在从第 1 层到第 12 层的聚类中增加，OBJ 嵌入在第 10 层达到峰值。探测准确度表明第 1 层的构造信息较少，从第 2 层开始准确度超过 90%，揭示了超出 GDV 聚类的潜在构造信息。注意力权重的 Fisher 判别比 (FDR) 分析表明，OBJ 标记对于区分 ASC 至关重要，其次是 VERB 和 DET 标记。SUBJ、CLS 和 SEP 标记的 FDR 分数不显著。本研究重点介绍了 BERT 对语言构造的分层处理及其与 LSTM 的区别。未来的研究将把这些发现与神经影像数据进行比较，以了解 ASC 处理的神经相关性。这项研究强调了神经语言模型反映人脑语言处理的潜力，为语言理解背后的计算和神经机制提供了见解。

Title: LaDiMo: Layer-wise Distillation Inspired MoEfier

Authors: Sungyoon Kim, Youngjun Kim, Kihyo Moon, Minsung Jang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04278
Pdf URL: https://arxiv.org/pdf/2408.04278
Copy Paste: [[2408.04278]] LaDiMo: Layer-wise Distillation Inspired MoEfier(https://arxiv.org/abs/2408.04278)
Keywords: language model
Abstract: The advent of large language models has revolutionized natural language processing, but their increasing complexity has led to substantial training costs, resource demands, and environmental impacts. In response, sparse Mixture-of-Experts (MoE) models have emerged as a promising alternative to dense models. Since training MoE models from scratch can be prohibitively expensive, recent studies have explored leveraging knowledge from pre-trained non-MoE models. However, existing approaches have limitations, such as requiring significant hardware resources and data. We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost. LaDiMo consists of two stages: layer-wise expert construction and routing policy decision. By harnessing the concept of Knowledge Distillation, we compress the model and rapidly recover its performance. Furthermore, we develop an adaptive router that optimizes inference efficiency by profiling the distribution of routing weights and determining a layer-wise policy that balances accuracy and latency. We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens, reducing activated parameters by over 20% while keeping accuracy. Our approach offers a flexible and efficient solution for building and deploying MoE models.
摘要：大型语言模型的出现彻底改变了自然语言处理，但它们日益复杂的特性导致了巨大的训练成本、资源需求和环境影响。为此，稀疏混合专家 (MoE) 模型已成为密集模型的有前途的替代方案。由于从头开始训练 MoE 模型的成本过高，最近的研究探索了利用预先训练的非 MoE 模型中的知识。然而，现有方法存在局限性，例如需要大量硬件资源和数据。我们提出了一种新算法 LaDiMo，它可以高效地将基于 Transformer 的非 MoE 模型转换为 MoE 模型，同时将额外的训练成本降至最低。LaDiMo 包括两个阶段：分层专家构建和路由策略决策。通过利用知识蒸馏的概念，我们可以压缩模型并快速恢复其性能。此外，我们开发了一种自适应路由器，通过分析路由权重的分布并确定平衡准确性和延迟的分层策略来优化推理效率。我们仅使用 100K 个 token 将 LLaMA2-7B 模型转换为 MoE 模型，在保持准确率的同时将激活参数减少了 20% 以上，从而证明了我们方法的有效性。我们的方法为构建和部署 MoE 模型提供了一种灵活而高效的解决方案。

Title: LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

Authors: Mervat Abassy, Kareem Elozeiri, Alexander Aziz, Minh Ngoc Ta, Raj Vardhan Tomar, Bimarsha Adhikari, Saad El Dine Ahmed, Yuxia Wang, Osama Mohammed Afzal, Zhuohan Xie, Jonibek Mansurov, Ekaterina Artemova, Vladislav Mikhailov, Rui Xing, Jiahui Geng, Hasan Iqbal, Zain Muhammad Mujahid, Tarek Mahmoud, Akim Tsvigun, Alham Fikri Aji, Artem Shelmanov, Nizar Habash, Iryna Gurevych, Preslav Nakov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04284
Pdf URL: https://arxiv.org/pdf/2408.04284
Copy Paste: [[2408.04284]] LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection(https://arxiv.org/abs/2408.04284)
Keywords: language model, llm, prompt
Abstract: The widespread accessibility of large language models (LLMs) to the general public has significantly amplified the dissemination of machine-generated texts (MGTs). Advancements in prompt manipulation have exacerbated the difficulty in discerning the origin of a text (human-authored vs machinegenerated). This raises concerns regarding the potential misuse of MGTs, particularly within educational and academic domains. In this paper, we present $\textbf{LLM-DetectAIve}$ -- a system designed for fine-grained MGT detection. It is able to classify texts into four categories: human-written, machine-generated, machine-written machine-humanized, and human-written machine-polished. Contrary to previous MGT detectors that perform binary classification, introducing two additional categories in LLM-DetectiAIve offers insights into the varying degrees of LLM intervention during the text creation. This might be useful in some domains like education, where any LLM intervention is usually prohibited. Experiments show that LLM-DetectAIve can effectively identify the authorship of textual content, proving its usefulness in enhancing integrity in education, academia, and other domains. LLM-DetectAIve is publicly accessible at this https URL. The video describing our system is available at this https URL.
摘要：大型语言模型 (LLM) 广泛普及，大大扩大了机器生成文本 (MGT) 的传播。即时操作的进步加剧了辨别文本来源（人为编写还是机器生成）的难度。这引发了对 MGT 潜在滥用的担忧，尤其是在教育和学术领域。在本文中，我们提出了 $\textbf{LLM-DetectAIve}$——一种专为细粒度 MGT 检测而设计的系统。它能够将文本分为四类：人为编写、机器生成、机器编写机器人性化和人为编写机器润色。与以前执行二元分类的 MGT 检测器相反，在 LLM-DetectiAIve 中引入两个额外的类别可以深入了解文本创建过程中 LLM 干预的不同程度。这可能在某些领域（如教育领域）有用，因为在这些领域通常禁止任何 LLM 干预。实验表明，LLM-DetectAIve 可以有效识别文本内容的作者身份，证明其在提高教育、学术和其他领域的诚信方面非常有用。LLM-DetectAIve 可通过此 https URL 公开访问。描述我们系统的视频可在此 https URL 上找到。

Title: EMTeC: A Corpus of Eye Movements on Machine-Generated Texts

Authors: Lena Sophia Bolliger, Patrick Haller, Isabelle Caroline Rose Cretton, David Robert Reich, Tannon Kew, Lena Ann Jäger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04289
Pdf URL: https://arxiv.org/pdf/2408.04289
Copy Paste: [[2408.04289]] EMTeC: A Corpus of Eye Movements on Machine-Generated Texts(https://arxiv.org/abs/2408.04289)
Keywords: language model
Abstract: The Eye Movements on Machine-Generated Texts Corpus (EMTeC) is a naturalistic eye-movements-while-reading corpus of 107 native English speakers reading machine-generated texts. The texts are generated by three large language models using five different decoding strategies, and they fall into six different text type categories. EMTeC entails the eye movement data at all stages of pre-processing, i.e., the raw coordinate data sampled at 2000 Hz, the fixation sequences, and the reading measures. It further provides both the original and a corrected version of the fixation sequences, accounting for vertical calibration drift. Moreover, the corpus includes the language models' internals that underlie the generation of the stimulus texts: the transition scores, the attention scores, and the hidden states. The stimuli are annotated for a range of linguistic features both at text and at word level. We anticipate EMTeC to be utilized for a variety of use cases such as, but not restricted to, the investigation of reading behavior on machine-generated text and the impact of different decoding strategies; reading behavior on different text types; the development of new pre-processing, data filtering, and drift correction algorithms; the cognitive interpretability and enhancement of language models; and the assessment of the predictive power of surprisal and entropy for human reading times. The data at all stages of pre-processing, the model internals, and the code to reproduce the stimulus generation, data pre-processing and analyses can be accessed via this https URL.
摘要：机器生成文本眼动语料库 (EMTeC) 是一个自然的阅读时眼动语料库，包含 107 位以英语为母语的人阅读机器生成的文本。这些文本由三个大型语言模型使用五种不同的解码策略生成，分为六种不同的文本类型类别。EMTeC 包含所有预处理阶段的眼动数据，即以 2000 Hz 采样的原始坐标数据、注视序列和阅读测量。它进一步提供了注视序列的原始版本和校正版本，考虑了垂直校准漂移。此外，语料库还包括语言模型的内部结构，这些结构是刺激文本生成的基础：转换分数、注意力分数和隐藏状态。刺激在文本和单词级别都标注了一系列语言特征。我们预计 EMTeC 将用于各种用例，例如（但不限于）研究机器生成文本的阅读行为和不同解码策略的影响；阅读行为对不同文本类型的影响；开发新的预处理、数据过滤和漂移校正算法；语言模型的认知可解释性和增强；以及评估人类阅读时间的意外和熵的预测能力。可以通过此 https URL 访问预处理各个阶段的数据、模型内部以及用于重现刺激生成、数据预处理和分析的代码。

Title: Are Social Sentiments Inherent in LLMs? An Empirical Study on Extraction of Inter-demographic Sentiments

Authors: Kunitomo Tanaka, Ryohei Sasano, Koichi Takeda
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2408.04293
Pdf URL: https://arxiv.org/pdf/2408.04293
Copy Paste: [[2408.04293]] Are Social Sentiments Inherent in LLMs? An Empirical Study on Extraction of Inter-demographic Sentiments(https://arxiv.org/abs/2408.04293)
Keywords: language model, llm
Abstract: Large language models (LLMs) are supposed to acquire unconscious human knowledge and feelings, such as social common sense and biases, by training models from large amounts of text. However, it is not clear how much the sentiments of specific social groups can be captured in various LLMs. In this study, we focus on social groups defined in terms of nationality, religion, and race/ethnicity, and validate the extent to which sentiments between social groups can be captured in and extracted from LLMs. Specifically, we input questions regarding sentiments from one group to another into LLMs, apply sentiment analysis to the responses, and compare the results with social surveys. The validation results using five representative LLMs showed higher correlations with relatively small p-values for nationalities and religions, whose number of data points were relatively large. This result indicates that the LLM responses including the inter-group sentiments align well with actual social survey results.
摘要：大型语言模型 (LLM) 旨在通过从大量文本中训练模型来获取人类无意识的知识和感受，例如社会常识和偏见。然而，目前尚不清楚各种 LLM 能够在多大程度上捕捉到特定社会群体的情绪。在本研究中，我们关注按国籍、宗教和种族/民族定义的社会群体，并验证 LLM 能够在多大程度上捕捉和提取社会群体之间的情绪。具体来说，我们将一个群体对另一个群体的情绪问题输入 LLM，对答案进行情绪分析，并将结果与社会调查进行比较。使用五个代表性 LLM 的验证结果显示，对于数据点数量相对较大的国籍和宗教，相关性较高，p 值相对较小。这一结果表明，包括群体间情绪的 LLM 答案与实际社会调查结果非常吻合。

Title: Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

Authors: François Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux, Thomas Demeester
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04303
Pdf URL: https://arxiv.org/pdf/2408.04303
Copy Paste: [[2408.04303]] Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP(https://arxiv.org/abs/2408.04303)
Keywords: language model, llm
Abstract: The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.
摘要：由于难以获取高质量的训练数据，针对低资源和中等资源语言的单语语言模型的开发仍然受到阻碍。在本研究中，我们提出了一种新颖的跨语言词汇迁移策略，即跨标记化，旨在应对这一挑战并实现更高效的语言适应。我们的方法侧重于将高资源单语 LLM 适应于看不见的目标语言，方法是使用来自源语言的语义相似标记嵌入的加权平均值初始化目标语言的标记嵌入。为此，我们利用涵盖源语言和目标语言的翻译资源。我们使用 Tweeties（一系列跨标记化的 LLM）验证了我们的方法，并展示了它们在少数但多样化的语言集合中的各种下游任务上的竞争性能。此外，我们引入了 Hydra LLM，这些模型具有多个可交换的语言建模头和嵌入表，这进一步扩展了我们的跨标记化策略的功能。通过基于多语言模型 TowerInstruct 设计 Hydra LLM，我们以零样本方式为鞑靼语开发了最先进的机器翻译模型，完全绕过了对高质量并行数据的需求。这一突破对于鞑靼语等资源匮乏的语言尤其重要，因为这些语言很难获得高质量的并行数据。通过降低训练高质量模型的数据和时间要求，我们的跨标记化策略允许为更广泛的语言开发 LLM，尤其是那些资源有限的语言。我们希望我们的工作能够激发跨语言词汇转移领域的进一步研究和合作，并为全球范围内语言的赋能做出贡献。

Title: Enhancing Journalism with AI: A Study of Contextualized Image Captioning for News Articles using LLMs and LMMs

Authors: Aliki Anagnostopoulou, Thiago Gouvea, Daniel Sonntag
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2408.04331
Pdf URL: https://arxiv.org/pdf/2408.04331
Copy Paste: [[2408.04331]] Enhancing Journalism with AI: A Study of Contextualized Image Captioning for News Articles using LLMs and LMMs(https://arxiv.org/abs/2408.04331)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) and large multimodal models (LMMs) have significantly impacted the AI community, industry, and various economic sectors. In journalism, integrating AI poses unique challenges and opportunities, particularly in enhancing the quality and efficiency of news reporting. This study explores how LLMs and LMMs can assist journalistic practice by generating contextualised captions for images accompanying news articles. We conducted experiments using the GoodNews dataset to evaluate the ability of LMMs (BLIP-2, GPT-4v, or LLaVA) to incorporate one of two types of context: entire news articles, or extracted named entities. In addition, we compared their performance to a two-stage pipeline composed of a captioning model (BLIP-2, OFA, or ViT-GPT2) with post-hoc contextualisation with LLMs (GPT-4 or LLaMA). We assess a diversity of models, and we find that while the choice of contextualisation model is a significant factor for the two-stage pipelines, this is not the case in the LMMs, where smaller, open-source models perform well compared to proprietary, GPT-powered ones. Additionally, we found that controlling the amount of provided context enhances performance. These results highlight the limitations of a fully automated approach and underscore the necessity for an interactive, human-in-the-loop strategy.
摘要：大型语言模型 (LLM) 和大型多模态模型 (LMM) 对 AI 社区、行业和各个经济部门产生了重大影响。在新闻业中，整合 AI 带来了独特的挑战和机遇，尤其是在提高新闻报道的质量和效率方面。本研究探讨了 LLM 和 LMM 如何通过为新闻文章附带的图像生成语境化字幕来协助新闻实践。我们使用 GoodNews 数据集进行了实验，以评估 LMM（BLIP-2、GPT-4v 或 LLaVA）整合以下两种类型上下文之一的能力：整个新闻文章或提取的命名实体。此外，我们将它们的性能与由字幕模型（BLIP-2、OFA 或 ViT-GPT2）和事后语境化的 LLM（GPT-4 或 LLaMA）组成的两阶段管道进行了比较。我们评估了多种模型，发现虽然情境化模型的选择是两阶段管道的一个重要因素，但在 LMM 中情况并非如此，其中较小的开源模型与专有的 GPT 驱动模型相比表现良好。此外，我们发现控制提供的上下文量可以提高性能。这些结果突出了完全自动化方法的局限性，并强调了交互式人机交互策略的必要性。

Title: Open-domain Implicit Format Control for Large Language Model Generation

Authors: Yiqun Yao, Wenjia Ma, Xuezhi Fang, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Jing Li, Aixin Sun, Yequan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04392
Pdf URL: https://arxiv.org/pdf/2408.04392
Copy Paste: [[2408.04392]] Open-domain Implicit Format Control for Large Language Model Generation(https://arxiv.org/abs/2408.04392)
Keywords: language model, llm
Abstract: Controlling the format of outputs generated by large language models (LLMs) is a critical functionality in various applications. Current methods typically employ constrained decoding with rule-based automata or fine-tuning with manually crafted format instructions, both of which struggle with open-domain format requirements. To address this limitation, we introduce a novel framework for controlled generation in LLMs, leveraging user-provided, one-shot QA pairs. This study investigates LLMs' capabilities to follow open-domain, one-shot constraints and replicate the format of the example answers. We observe that this is a non-trivial problem for current LLMs. We also develop a dataset collection methodology for supervised fine-tuning that enhances the open-domain format control of LLMs without degrading output quality, as well as a benchmark on which we evaluate both the helpfulness and format correctness of LLM outputs. The resulting datasets, named OIFC-SFT, along with the related code, will be made publicly available at this https URL.
摘要：控制大型语言模型 (LLM) 生成的输出格式是各种应用中的关键功能。当前的方法通常采用基于规则的自动机的约束解码或使用手动编写的格式指令进行微调，这两种方法都难以满足开放域格式要求。为了解决这一限制，我们引入了一个用于 LLM 中受控生成的新框架，利用用户提供的一次性 QA 对。本研究调查了 LLM 遵循开放域一次性约束并复制示例答案格式的能力。我们观察到，这对于当前的 LLM 来说是一个不小的问题。我们还开发了一种用于监督微调的数据集收集方法，该方法可增强 LLM 的开放域格式控制而不会降低输出质量，以及我们评估 LLM 输出的有用性和格式正确性的基准。生成的数据集名为 OIFC-SFT，以及相关代码将在此 https URL 上公开提供。

Title: Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation

Authors: Nicy Scaria, Suma Dharani Chenna, Deepak Subramani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04394
Pdf URL: https://arxiv.org/pdf/2408.04394
Copy Paste: [[2408.04394]] Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation(https://arxiv.org/abs/2408.04394)
Keywords: language model, llm, prompt
Abstract: Developing questions that are pedagogically sound, relevant, and promote learning is a challenging and time-consuming task for educators. Modern-day large language models (LLMs) generate high-quality content across multiple domains, potentially helping educators to develop high-quality questions. Automated educational question generation (AEQG) is important in scaling online education catering to a diverse student population. Past attempts at AEQG have shown limited abilities to generate questions at higher cognitive levels. In this study, we examine the ability of five state-of-the-art LLMs of different sizes to generate diverse and high-quality questions of different cognitive levels, as defined by Bloom's taxonomy. We use advanced prompting techniques with varying complexity for AEQG. We conducted expert and LLM-based evaluations to assess the linguistic and pedagogical relevance and quality of the questions. Our findings suggest that LLms can generate relevant and high-quality educational questions of different cognitive levels when prompted with adequate information, although there is a significant variance in the performance of the five LLms considered. We also show that automated evaluation is not on par with human evaluation.
摘要：对于教育工作者来说，开发符合教学法、具有相关性并能促进学习的问题是一项艰巨而耗时的任务。现代大型语言模型 (LLM) 可在多个领域生成高质量内容，从而有可能帮助教育工作者开发高质量的问题。自动教育问题生成 (AEQG) 对于扩大在线教育规模以满足多样化学生群体的需求至关重要。AEQG 过去的尝试表明，其在更高认知水平上生成问题的能力有限。在本研究中，我们研究了五种不同规模的最先进的 LLM 生成不同认知水平的多样化高质量问题的能力，这些问题由布鲁姆分类法定义。我们对 AEQG 使用了复杂程度各异的高级提示技术。我们进行了专家和基于 LLM 的评估，以评估问题的语言和教学相关性及质量。我们的研究结果表明，在获得足够的信息后，法学硕士可以提出不同认知水平的相关且高质量的教育问题，尽管所考虑的五个法学硕士的表现存在显著差异。我们还表明，自动评估与人工评估并不相称。

Title: Exploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the NeuBAROCO Dataset

Authors: Kentaro Ozeki, Risako Ando, Takanobu Morishita, Hirohiko Abe, Koji Mineshima, Mitsuhiro Okada
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04403
Pdf URL: https://arxiv.org/pdf/2408.04403
Copy Paste: [[2408.04403]] Exploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the NeuBAROCO Dataset(https://arxiv.org/abs/2408.04403)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: This paper explores the question of how accurately current large language models can perform logical reasoning in natural language, with an emphasis on whether these models exhibit reasoning biases similar to humans. Specifically, our study focuses on syllogistic reasoning, a form of deductive reasoning extensively studied in cognitive science as a natural form of human reasoning. We present a syllogism dataset called NeuBAROCO, which consists of syllogistic reasoning problems in English and Japanese. This dataset was originally designed for psychological experiments to assess human reasoning capabilities using various forms of syllogisms. Our experiments with leading large language models indicate that these models exhibit reasoning biases similar to humans, along with other error tendencies. Notably, there is significant room for improvement in reasoning problems where the relationship between premises and hypotheses is neither entailment nor contradiction. We also present experimental results and in-depth analysis using a new Chain-of-Thought prompting method, which asks LLMs to translate syllogisms into abstract logical expressions and then explain their reasoning process. Our analysis using this method suggests that the primary limitations of LLMs lie in the reasoning process itself rather than the interpretation of syllogisms.
摘要：本文探讨了当前大型语言模型在自然语言中执行逻辑推理的准确程度，重点是这些模型是否表现出与人类相似的推理偏差。具体来说，我们的研究重点是三段论推理，这是一种演绎推理形式，在认知科学中被广泛研究为人类推理的自然形式。我们提出了一个名为 NeuBAROCO 的三段论数据集，其中包含英语和日语的三段论推理问题。该数据集最初是为心理学实验设计的，用于评估人类使用各种形式的三段论的推理能力。我们对领先的大型语言模型进行的实验表明，这些模型表现出与人类相似的推理偏差以及其他错误倾向。值得注意的是，在前提和假设之间的关系既不是蕴涵也不是矛盾的推理问题中，还有很大的改进空间。我们还展示了实验结果和使用新思路提示方法的深入分析，该方法要求法学硕士将三段论转化为抽象的逻辑表达式，然后解释其推理过程。我们使用这种方法进行的分析表明，法学硕士的主要局限性在于推理过程本身，而不是三段论的解释。

Title: Enhancing Robustness of Retrieval-Augmented Language Models with In-Context Learning

Authors: Seong-Il Park, Seung-Woo Choi, Na-Hyun Kim, Jay-Yoon Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04414
Pdf URL: https://arxiv.org/pdf/2408.04414
Copy Paste: [[2408.04414]] Enhancing Robustness of Retrieval-Augmented Language Models with In-Context Learning(https://arxiv.org/abs/2408.04414)
Keywords: language model
Abstract: Retrieval-Augmented Language Models (RALMs) have significantly improved performance in open-domain question answering (QA) by leveraging external knowledge. However, RALMs still struggle with unanswerable queries, where the retrieved contexts do not contain the correct answer, and with conflicting information, where different sources provide contradictory answers due to imperfect retrieval. This study introduces an in-context learning-based approach to enhance the reasoning capabilities of RALMs, making them more robust in imperfect retrieval scenarios. Our method incorporates Machine Reading Comprehension (MRC) demonstrations, referred to as cases, to boost the model's capabilities to identify unanswerabilities and conflicts among the retrieved contexts. Experiments on two open-domain QA datasets show that our approach increases accuracy in identifying unanswerable and conflicting scenarios without requiring additional fine-tuning. This work demonstrates that in-context learning can effectively enhance the robustness of RALMs in open-domain QA tasks.
摘要：检索增强语言模型 (RALM) 通过利用外部知识显著提高了开放域问答 (QA) 的性能。然而，RALM 仍然难以处理无法回答的查询（检索到的上下文不包含正确答案）和冲突信息（由于检索不完善，不同来源提供相互矛盾的答案）。本研究引入了一种基于上下文学习的方法来增强 RALM 的推理能力，使其在检索不完善的场景中更加稳健。我们的方法结合了机器阅读理解 (MRC) 演示（称为案例），以增强模型识别检索到的上下文之间的不可回答性和冲突的能力。在两个开放域 QA 数据集上进行的实验表明，我们的方法提高了识别无法回答和冲突场景的准确性，而无需额外的微调。这项工作表明，上下文学习可以有效增强 RALM 在开放域 QA 任务中的稳健性。

Title: Recognizing Emotion Regulation Strategies from Human Behavior with Large Language Models

Authors: Philipp Müller, Alexander Heimerl, Sayed Muddashir Hossain, Lea Siegel, Jan Alexandersson, Patrick Gebhard, Elisabeth André, Tanja Schneeberger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04420
Pdf URL: https://arxiv.org/pdf/2408.04420
Copy Paste: [[2408.04420]] Recognizing Emotion Regulation Strategies from Human Behavior with Large Language Models(https://arxiv.org/abs/2408.04420)
Keywords: language model, llm, prompt
Abstract: Human emotions are often not expressed directly, but regulated according to internal processes and social display rules. For affective computing systems, an understanding of how users regulate their emotions can be highly useful, for example to provide feedback in job interview training, or in psychotherapeutic scenarios. However, at present no method to automatically classify different emotion regulation strategies in a cross-user scenario exists. At the same time, recent studies showed that instruction-tuned Large Language Models (LLMs) can reach impressive performance across a variety of affect recognition tasks such as categorical emotion recognition or sentiment analysis. While these results are promising, it remains unclear to what extent the representational power of LLMs can be utilized in the more subtle task of classifying users' internal emotion regulation strategy. To close this gap, we make use of the recently introduced \textsc{Deep} corpus for modeling the social display of the emotion shame, where each point in time is annotated with one of seven different emotion regulation classes. We fine-tune Llama2-7B as well as the recently introduced Gemma model using Low-rank Optimization on prompts generated from different sources of information on the \textsc{Deep} corpus. These include verbal and nonverbal behavior, person factors, as well as the results of an in-depth interview after the interaction. Our results show, that a fine-tuned Llama2-7B LLM is able to classify the utilized emotion regulation strategy with high accuracy (0.84) without needing access to data from post-interaction interviews. This represents a significant improvement over previous approaches based on Bayesian Networks and highlights the importance of modeling verbal behavior in emotion regulation.
摘要：人类的情绪通常不会直接表达，而是根据内部过程和社会表现规则进行调节。对于情感计算系统，了解用户如何调节情绪非常有用，例如在工作面试培训或心理治疗场景中提供反馈。然而，目前还没有在跨用户场景中自动对不同情绪调节策略进行分类的方法。同时，最近的研究表明，指令调整的大型语言模型 (LLM) 可以在各种情感识别任务（如分类情绪识别或情绪分析）中达到令人印象深刻的性能。虽然这些结果很有希望，但目前尚不清楚 LLM 的表征能力在多大程度上可以用于对用户的内部情绪调节策略进行分类这一更微妙的任务。为了弥补这一差距，我们利用最近推出的 \textsc{Deep} 语料库来建模情绪羞耻的社会表现，其中每个时间点都用七种不同的情绪调节类别之一进行注释。我们使用低秩优化对 \textsc{Deep} 语料库中不同信息源生成的提示进行微调，从而对 Llama2-7B 以及最近推出的 Gemma 模型进行微调。这些提示包括言语和非言语行为、个人因素以及互动后的深入访谈结果。我们的结果表明，经过微调的 Llama2-7B LLM 能够以高精度（0.84）对所使用的情绪调节策略进行分类，而无需访问互动后访谈的数据。这比以前基于贝叶斯网络的方法有了显著的改进，并强调了在情绪调节中对言语行为进行建模的重要性。

Title: Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate

Authors: Yiqun Zhang, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, Kaisong Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04472
Pdf URL: https://arxiv.org/pdf/2408.04472
Copy Paste: [[2408.04472]] Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate(https://arxiv.org/abs/2408.04472)
Keywords: language model, llm, hallucination, agent
Abstract: Competitive debate is a comprehensive and complex computational argumentation task. Large Language Models (LLMs) encounter hallucinations and lack competitiveness in this task. To address these challenges, we introduce Agent for Debate (Agent4Debate), a dynamic, multi-agent framework based on LLMs designed to enhance their capabilities in competitive debate. Drawing inspiration from human behavior in debate preparation and execution, Agent4Debate employs a collaborative architecture where four specialized agents (Searcher, Analyzer, Writer, and Reviewer) dynamically interact and cooperate. These agents work throughout the debate process, covering multiple stages from initial research and argument formulation to rebuttal and summary. To comprehensively evaluate framework performance, we construct the Chinese Debate Arena, comprising 66 carefully selected Chinese debate motions. We recruite ten experienced human debaters and collect records of 200 debates involving Agent4Debate, baseline models, and humans. The evaluation employs the Debatrix automatic scoring system and professional human reviewers based on the established Debatrix-Elo and Human-Elo ranking. Experimental results indicate that the state-of-the-art Agent4Debate exhibits capabilities comparable to those of humans. Furthermore, ablation studies demonstrate the effectiveness of each component in the agent structure.
摘要：辩论竞赛是一项综合性、复杂的计算论证任务。大型语言模型 (LLM) 在该任务中会遇到幻觉，缺乏竞争力。为了应对这些挑战，我们引入了辩论代理 (Agent4Debate)，这是一个基于 LLM 的动态多智能体框架，旨在增强其在辩论竞赛中的能力。Agent4Debate 借鉴了人类在辩论准备和执行中的行为，采用了一种协作架构，其中四个专门的智能体 (搜索者、分析者、作者和审阅者) 动态交互和合作。这些智能体在整个辩论过程中工作，涵盖从初步研究和论证制定到反驳和总结的多个阶段。为了全面评估框架性能，我们构建了中文辩论竞技场，包括 66 个精心挑选的中文辩论动议。我们招募了十位经验丰富的人类辩手，并收集了 200 场涉及 Agent4Debate、基线模型和人类的辩论记录。评估采用 Debatrix 自动评分系统和专业人类审阅者，基于已建立的 Debatrix-Elo 和 Human-Elo 排名。实验结果表明，最先进的 Agent4Debate 表现出与人类相当的能力。此外，消融研究证明了代理结构中每个组件的有效性。

Title: Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models

Authors: Fabio Pernisi, Dirk Hovy, Paul Röttger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04522
Pdf URL: https://arxiv.org/pdf/2408.04522
Copy Paste: [[2408.04522]] Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models(https://arxiv.org/abs/2408.04522)
Keywords: language model, llm, prompt
Abstract: As diverse linguistic communities and users adopt large language models (LLMs), assessing their safety across languages becomes critical. Despite ongoing efforts to make LLMs safe, they can still be made to behave unsafely with jailbreaking, a technique in which models are prompted to act outside their operational guidelines. Research on LLM safety and jailbreaking, however, has so far mostly focused on English, limiting our understanding of LLM safety in other languages. We contribute towards closing this gap by investigating the effectiveness of many-shot jailbreaking, where models are prompted with unsafe demonstrations to induce unsafe behaviour, in Italian. To enable our analysis, we create a new dataset of unsafe Italian question-answer pairs. With this dataset, we identify clear safety vulnerabilities in four families of open-weight LLMs. We find that the models exhibit unsafe behaviors even when prompted with few unsafe demonstrations, and -- more alarmingly -- that this tendency rapidly escalates with more demonstrations.
摘要：随着不同的语言社区和用户采用大型语言模型 (LLM)，评估其在不同语言中的安全性变得至关重要。尽管人们不断努力确保 LLM 的安全性，但它们仍可能因越狱而表现出不安全的行为。越狱是一种提示模型在其操作准则之外行事的技术。然而，到目前为止，对 LLM 安全性和越狱的研究主要集中在英语上，这限制了我们对其他语言中 LLM 安全性的理解。我们通过研究意大利语中多次越狱的有效性来缩小这一差距，在多次越狱中，模型会受到不安全演示的提示以诱发不安全行为。为了进行分析，我们创建了一个新的不安全意大利语问答对数据集。利用此数据集，我们在四个开放权重 LLM 系列中发现了明显的安全漏洞。我们发现，即使在很少的不安全演示提示下，模型也会表现出不安全的行为，而且——更令人担忧的是——这种趋势会随着演示的增多而迅速升级。

Title: Bias-Aware Low-Rank Adaptation: Mitigating Catastrophic Inheritance of Large Language Models

Authors: Yupeng Chang, Yi Chang, Yuan Wu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04556
Pdf URL: https://arxiv.org/pdf/2408.04556
Copy Paste: [[2408.04556]] Bias-Aware Low-Rank Adaptation: Mitigating Catastrophic Inheritance of Large Language Models(https://arxiv.org/abs/2408.04556)
Keywords: language model, llm
Abstract: Large language models (LLMs) have exhibited remarkable proficiency across a diverse array of natural language processing (NLP) tasks. However, adapting LLMs to downstream applications typically necessitates computationally intensive and memory-demanding fine-tuning procedures. To mitigate these burdens, parameter-efficient fine-tuning (PEFT) techniques have emerged as a promising approach to tailor LLMs with minimal computational overhead. While PEFT methods offer substantial advantages, they do not fully address the pervasive issue of bias propagation from pre-training data. In this work, we introduce Bias-Aware Low-Rank Adaptation (BA-LoRA), a novel PEFT method designed to counteract bias inheritance. BA-LoRA incorporates three distinct regularization terms: (1) consistency regularizer, (2) diversity regularizer, and (3) singular vector decomposition regularizer. These regularizers collectively aim to improve the generative models' consistency, diversity, and generalization capabilities during the fine-tuning process. Through extensive experiments on a variety of natural language understanding (NLU) and natural language generation (NLG) tasks, employing prominent LLMs such as LLaMA, Mistral, and Gemma, we demonstrate that BA-LoRA surpasses the performance of LoRA and its state-of-the-art variants. Moreover, our method effectively mitigates the deleterious effects of pre-training bias, leading to more reliable and robust model outputs. The code is available at this https URL.
摘要：大型语言模型 (LLM) 在各种自然语言处理 (NLP) 任务中表现出色。然而，将 LLM 适配到下游应用程序通常需要计算密集型和内存要求高的微调程序。为了减轻这些负担，参数高效微调 (PEFT) 技术已成为一种有前途的方法，可以以最小的计算开销定制 LLM。虽然 PEFT 方法具有显著的优势，但它们并不能完全解决预训练数据中普遍存在的偏差传播问题。在这项工作中，我们引入了偏差感知低秩自适应 (BA-LoRA)，这是一种旨在抵消偏差继承的新型 PEFT 方法。BA-LoRA 包含三个不同的正则化项：(1) 一致性正则化器、(2) 多样性正则化器和 (3) 奇异向量分解正则化器。这些正则化器共同旨在提高生成模型在微调过程中的一致性、多样性和泛化能力。通过对各种自然语言理解 (NLU) 和自然语言生成 (NLG) 任务进行大量实验，采用 LLaMA、Mistral 和 Gemma 等著名 LLM，我们证明 BA-LoRA 的性能优于 LoRA 及其最先进的变体。此外，我们的方法有效地减轻了预训练偏差的有害影响，从而产生了更可靠、更稳健的模型输出。代码可在此 https URL 上找到。

Title: Conversational Prompt Engineering

Authors: Liat Ein-Dor, Orith Toledo-Ronen, Artem Spector, Shai Gretz, Lena Dankin, Alon Halfon, Yoav Katz, Noam Slonim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04560
Pdf URL: https://arxiv.org/pdf/2408.04560
Copy Paste: [[2408.04560]] Conversational Prompt Engineering(https://arxiv.org/abs/2408.04560)
Keywords: llm, prompt, chat
Abstract: Prompts are how humans communicate with LLMs. Informative prompts are essential for guiding LLMs to produce the desired output. However, prompt engineering is often tedious and time-consuming, requiring significant expertise, limiting its widespread use. We propose Conversational Prompt Engineering (CPE), a user-friendly tool that helps users create personalized prompts for their specific tasks. CPE uses a chat model to briefly interact with users, helping them articulate their output preferences and integrating these into the prompt. The process includes two main stages: first, the model uses user-provided unlabeled data to generate data-driven questions and utilize user responses to shape the initial instruction. Then, the model shares the outputs generated by the instruction and uses user feedback to further refine the instruction and the outputs. The final result is a few-shot prompt, where the outputs approved by the user serve as few-shot examples. A user study on summarization tasks demonstrates the value of CPE in creating personalized, high-performing prompts. The results suggest that the zero-shot prompt obtained is comparable to its - much longer - few-shot counterpart, indicating significant savings in scenarios involving repetitive tasks with large text volumes.
摘要：提示是人类与 LLM 交流的方式。信息提示对于引导 LLM 产生所需的输出至关重要。然而，提示工程通常很繁琐且耗时，需要大量专业知识，限制了其广泛使用。我们提出了对话提示工程 (CPE)，这是一种用户友好的工具，可帮助用户为其特定任务创建个性化提示。CPE 使用聊天模型与用户进行简短交互，帮助他们表达他们的输出偏好并将其集成到提示中。该过程包括两个主要阶段：首先，模型使用用户提供的未标记数据来生成数据驱动的问题并利用用户响应来塑造初始指令。然后，模型共享指令生成的输出并使用用户反馈进一步完善指令和输出。最终结果是少量提示，其中用户认可的输出作为少量示例。一项关于总结任务的用户研究表明了 CPE 在创建个性化、高性能提示方面的价值。结果表明，获得的零样本提示与其更长的少样本提示相当，表明在涉及大量文本的重复任务的场景中可以节省大量成本。

Title: Learning Fine-Grained Grounded Citations for Attributed Large Language Models

Authors: Lei Huang, Xiaocheng Feng, Weitao Ma, Yuxuan Gu, Weihong Zhong, Xiachong Feng, Weijiang Yu, Weihua Peng, Duyu Tang, Dandan Tu, Bing Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.04568
Pdf URL: https://arxiv.org/pdf/2408.04568
Copy Paste: [[2408.04568]] Learning Fine-Grained Grounded Citations for Attributed Large Language Models(https://arxiv.org/abs/2408.04568)
Keywords: language model, gpt, llm, hallucination, chat
Abstract: Despite the impressive performance on information-seeking tasks, large language models (LLMs) still struggle with hallucinations. Attributed LLMs, which augment generated text with in-line citations, have shown potential in mitigating hallucinations and improving verifiability. However, current approaches suffer from suboptimal citation quality due to their reliance on in-context learning. Furthermore, the practice of citing only coarse document identifiers makes it challenging for users to perform fine-grained verification. In this work, we introduce FRONT, a training framework designed to teach LLMs to generate Fine-Grained Grounded Citations. By grounding model outputs in fine-grained supporting quotes, these quotes guide the generation of grounded and consistent responses, not only improving citation quality but also facilitating fine-grained verification. Experiments on the ALCE benchmark demonstrate the efficacy of FRONT in generating superior grounded responses and highly supportive citations. With LLaMA-2-7B, the framework significantly outperforms all the baselines, achieving an average of 14.21% improvement in citation quality across all datasets, even surpassing ChatGPT.
摘要：尽管在信息搜索任务上表现令人印象深刻，大型语言模型 (LLM) 仍然难以应对幻觉。归因 LLM 可通过内联引用增强生成的文本，已显示出缓解幻觉和提高可验证性的潜力。然而，由于当前方法依赖于上下文学习，因此引用质量不佳。此外，仅引用粗略文档标识符的做法使用户难以执行细粒度验证。在这项工作中，我们引入了 FRONT，这是一个旨在教 LLM 生成细粒度有根据的引用的训练框架。通过将模型输出以细粒度的支持性引用为基础，这些引用可指导生成有根据且一致的响应，不仅提高引用质量，还促进细粒度验证。在 ALCE 基准上的实验证明了 FRONT 在生成卓越的有根据的响应和高度支持性引用方面的有效性。通过 LLaMA-2-7B，该框架的表现显著优于所有基线，在所有数据集上的引用质量平均提高了 14.21%，甚至超过了 ChatGPT。

Title: Towards Resilient and Efficient LLMs: A Comparative Study of Efficiency, Performance, and Adversarial Robustness

Authors: Xiaojing Fan, Chunliang Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04585
Pdf URL: https://arxiv.org/pdf/2408.04585
Copy Paste: [[2408.04585]] Towards Resilient and Efficient LLMs: A Comparative Study of Efficiency, Performance, and Adversarial Robustness(https://arxiv.org/abs/2408.04585)
Keywords: language model, llm
Abstract: With the increasing demand for practical applications of Large Language Models (LLMs), many attention-efficient models have been developed to balance performance and computational cost. However, the adversarial robustness of these models remains under-explored. In this work, we design a framework to investigate the trade-off between efficiency, performance, and adversarial robustness of LLMs by comparing three prominent models with varying levels of complexity and efficiency -- Transformer++, Gated Linear Attention (GLA) Transformer, and MatMul-Free LM -- utilizing the GLUE and AdvGLUE datasets. The AdvGLUE dataset extends the GLUE dataset with adversarial samples designed to challenge model robustness. Our results show that while the GLA Transformer and MatMul-Free LM achieve slightly lower accuracy on GLUE tasks, they demonstrate higher efficiency and either superior or comparative robustness on AdvGLUE tasks compared to Transformer++ across different attack levels. These findings highlight the potential of simplified architectures to achieve a compelling balance between efficiency, performance, and adversarial robustness, offering valuable insights for applications where resource constraints and resilience to adversarial attacks are critical.
摘要：随着对大型语言模型 (LLM) 实际应用的需求不断增长，许多注意力高效模型已被开发来平衡性能和计算成本。然而，这些模型的对抗鲁棒性仍未得到充分探索。在这项工作中，我们设计了一个框架来研究 LLM 的效率、性能和对抗鲁棒性之间的权衡，通过比较三个具有不同复杂度和效率级别的著名模型——Transformer++、门控线性注意力 (GLA) Transformer 和 MatMul-Free LM——利用 GLUE 和 AdvGLUE 数据集。AdvGLUE 数据集使用旨在挑战模型鲁棒性的对抗样本扩展了 GLUE 数据集。我们的结果表明，虽然 GLA Transformer 和 MatMul-Free LM 在 GLUE 任务上的准确率略低，但与 Transformer++ 相比，它们在不同攻击级别上在 AdvGLUE 任务上表现出更高的效率和更优异或相当的鲁棒性。这些发现凸显了简化架构在效率、性能和对抗性鲁棒性之间实现有效平衡的潜力，为资源限制和对抗性攻击恢复能力至关重要的应用提供了宝贵的见解。

Title: Code-switching in text and speech reveals information-theoretic audience design

Authors: Debasmita Bhattacharya, Marten van Schijndel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.04596
Pdf URL: https://arxiv.org/pdf/2408.04596
Copy Paste: [[2408.04596]] Code-switching in text and speech reveals information-theoretic audience design(https://arxiv.org/abs/2408.04596)
Keywords: language model
Abstract: In this work, we use language modeling to investigate the factors that influence code-switching. Code-switching occurs when a speaker alternates between one language variety (the primary language) and another (the secondary language), and is widely observed in multilingual contexts. Recent work has shown that code-switching is often correlated with areas of high information load in the primary language, but it is unclear whether high primary language load only makes the secondary language relatively easier to produce at code-switching points (speaker-driven code-switching), or whether code-switching is additionally used by speakers to signal the need for greater attention on the part of listeners (audience-driven code-switching). In this paper, we use bilingual Chinese-English online forum posts and transcripts of spontaneous Chinese-English speech to replicate prior findings that high primary language (Chinese) information load is correlated with switches to the secondary language (English). We then demonstrate that the information load of the English productions is even higher than that of meaning equivalent Chinese alternatives, and these are therefore not easier to produce, providing evidence of audience-driven influences in code-switching at the level of the communication channel, not just at the sociolinguistic level, in both writing and speech.
摘要：在本研究中，我们使用语言模型来研究影响代码转换的因素。当说话者在一种语言变体（主要语言）和另一种语言变体（次要语言）之间交替使用时，就会发生代码转换，这种现象在多语言环境中很常见。最近的研究表明，代码转换通常与主要语言中信息负荷较高的领域相关，但尚不清楚较高的主要语言负荷是否只会使次要语言在代码转换点相对更容易产生（说话者驱动的代码转换），或者说话者是否还使用代码转换来表明听众需要更多关注（观众驱动的代码转换）。在本文中，我们使用中英双语在线论坛帖子和中英自发语音记录来复制先前的研究结果，即较高的主要语言（中文）信息负荷与切换到次要语言（英语）相关。我们随后证明，英语作品的信息量甚至高于意义等同的中文作品，因此不容易制作，这证明了受众驱动的影响在沟通渠道层面上对代码转换产生了影响，而不仅仅是在社会语言学层面，无论是在写作还是口语中。

Title: Better Alignment with Instruction Back-and-Forth Translation

Authors: Thao Nguyen, Jeffrey Li, Sewoong Oh, Ludwig Schmidt, Jason Weston, Luke Zettlemoyer, Xian Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.04614
Pdf URL: https://arxiv.org/pdf/2408.04614
Copy Paste: [[2408.04614]] Better Alignment with Instruction Back-and-Forth Translation(https://arxiv.org/abs/2408.04614)
Keywords: language model, gpt, llm
Abstract: We propose a new method, instruction back-and-forth translation, to construct high-quality synthetic data grounded in world knowledge for aligning large language models (LLMs). Given documents from a web corpus, we generate and curate synthetic instructions using the backtranslation approach proposed by Li et al.(2023a), and rewrite the responses to improve their quality further based on the initial documents. Fine-tuning with the resulting (backtranslated instruction, rewritten response) pairs yields higher win rates on AlpacaEval than using other common instruction datasets such as Humpback, ShareGPT, Open Orca, Alpaca-GPT4 and Self-instruct. We also demonstrate that rewriting the responses with an LLM outperforms direct distillation, and the two generated text distributions exhibit significant distinction in embedding space. Further analysis shows that our backtranslated instructions are of higher quality than other sources of synthetic instructions, while our responses are more diverse and complex than those obtained from distillation. Overall we find that instruction back-and-forth translation combines the best of both worlds -- making use of the information diversity and quantity found on the web, while ensuring the quality of the responses which is necessary for effective alignment.
摘要：我们提出了一种新方法，即指令来回翻译，以构建基于世界知识的高质量合成数据，用于对齐大型语言模型 (LLM)。给定来自网络语料库的文档，我们使用 Li 等人 (2023a) 提出的反向翻译方法生成和管理合成指令，并重写响应以根据初始文档进一步提高其质量。使用生成的 (反向翻译指令、重写的响应) 对进行微调，在 AlpacaEval 上的胜率高于使用其他常见指令数据集 (如 Humpback、ShareGPT、Open Orca、Alpaca-GPT4 和 Self-instruct)。我们还证明了使用 LLM 重写响应的效果优于直接蒸馏，并且两个生成的文本分布在嵌入空间中表现出显着差异。进一步的分析表明，我们的反向翻译指令比其他合成指令来源的质量更高，而我们的响应比从蒸馏中获得的响应更加多样化和复杂。总体而言，我们发现指令来回翻译结合了两全其美的优势——利用网络上的信息多样性和数量，同时确保有效对齐所必需的响应质量。

Title: Arctic-TILT. Business Document Understanding at Sub-Billion Scale

Authors: Łukasz Borchmann, Michał Pietruszka, Wojciech Jaśkowski, Dawid Jurkiewicz, Piotr Halama, Paweł Józiak, Łukasz Garncarek, Paweł Liskowski, Karolina Szyndler, Andrzej Gretkowski, Julita Ołtusek, Gabriela Nowakowska, Artur Zawłocki, Łukasz Duhr, Paweł Dyda, Michał Turski
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2408.04632
Pdf URL: https://arxiv.org/pdf/2408.04632
Copy Paste: [[2408.04632]] Arctic-TILT. Business Document Understanding at Sub-Billion Scale(https://arxiv.org/abs/2408.04632)
Keywords: llm
Abstract: The vast portion of workloads employing LLMs involves answering questions grounded on PDF or scan content. We introduce the Arctic-TILT achieving accuracy on par with models 1000$\times$ its size on these use cases. It can be fine-tuned and deployed on a single 24GB GPU, lowering operational costs while processing Visually Rich Documents with up to 400k tokens. The model establishes state-of-the-art results on seven diverse Document Understanding benchmarks, as well as provides reliable confidence scores and quick inference, which are essential for processing files in large-scale or time-sensitive enterprise environments.
摘要：使用 LLM 的大部分工作负载涉及回答基于 PDF 或扫描内容的问题。我们引入了 Arctic-TILT，在这些用例中实现了与 1000$\times$ 大小的模型相当的准确度。它可以在单个 24GB GPU 上进行微调和部署，从而降低运营成本，同时处理最多 400k 个标记的视觉丰富文档。该模型在七个不同的文档理解基准上建立了最先进的结果，并提供可靠的置信度分数和快速推理，这对于在大型或时间敏感的企业环境中处理文件至关重要。