2024-05-02

Title: HistNERo: Historical Named Entity Recognition for the Romanian Language

Authors: Andrei-Marius Avram, Andreea Iuga, George-Vlad Manolache, Vlad-Cristian Matei, Răzvan-Gabriel Micliuş, Vlad-Andrei Muntean, Manuel-Petru Sorlescu, Dragoş-Andrei Şerban, Adrian-Dinu Urse, Vasile Păiş, Dumitru-Clementin Cercel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.00155
Pdf URL: https://arxiv.org/pdf/2405.00155
Copy Paste: [[2405.00155]] HistNERo: Historical Named Entity Recognition for the Romanian Language(https://arxiv.org/abs/2405.00155)
Keywords: language model
Abstract: This work introduces HistNERo, the first Romanian corpus for Named Entity Recognition (NER) in historical newspapers. The dataset contains 323k tokens of text, covering more than half of the 19th century (i.e., 1817) until the late part of the 20th century (i.e., 1990). Eight native Romanian speakers annotated the dataset with five named entities. The samples belong to one of the following four historical regions of Romania, namely Bessarabia, Moldavia, Transylvania, and Wallachia. We employed this proposed dataset to perform several experiments for NER using Romanian pre-trained language models. Our results show that the best model achieved a strict F1-score of 55.69%. Also, by reducing the discrepancies between regions through a novel domain adaption technique, we improved the performance on this corpus to a strict F1-score of 66.80%, representing an absolute gain of more than 10%.
摘要：这项工作介绍了 HistNERo，这是历史报纸中第一个用于命名实体识别 (NER) 的罗马尼亚语语料库。该数据集包含 323k 个文本标记，涵盖 19 世纪的一半以上（即 1817 年）直到 20 世纪末期（即 1990 年）。八位罗马尼亚语母语人士用五个命名实体对数据集进行了注释。这些样本属于罗马尼亚以下四个历史地区之一，即比萨拉比亚、摩尔达维亚、特兰西瓦尼亚和瓦拉几亚。我们利用这个提出的数据集，使用罗马尼亚语预训练语言模型进行了多项 NER 实验。我们的结果表明，最好的模型达到了 55.69% 的严格 F1 分数。此外，通过一种新颖的领域适应技术减少区域之间的差异，我们将该语料库的性能提高到严格的 F1 分数 66.80%，绝对增益超过 10%。

Title: Towards a Search Engine for Machines: Unified Ranking for Multiple Retrieval-Augmented Large Language Models

Authors: Alireza Salemi, Hamed Zamani
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2405.00175
Pdf URL: https://arxiv.org/pdf/2405.00175
Copy Paste: [[2405.00175]] Towards a Search Engine for Machines: Unified Ranking for Multiple Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2405.00175)
Keywords: language model, retrieval-augmented generation
Abstract: This paper introduces uRAG--a framework with a unified retrieval engine that serves multiple downstream retrieval-augmented generation (RAG) systems. Each RAG system consumes the retrieval results for a unique purpose, such as open-domain question answering, fact verification, entity linking, and relation extraction. We introduce a generic training guideline that standardizes the communication between the search engine and the downstream RAG systems that engage in optimizing the retrieval model. This lays the groundwork for us to build a large-scale experimentation ecosystem consisting of 18 RAG systems that engage in training and 18 unknown RAG systems that use the uRAG as the new users of the search engine. Using this experimentation ecosystem, we answer a number of fundamental research questions that improve our understanding of promises and challenges in developing search engines for machines.
摘要：本文介绍了 uRAG——一个具有统一检索引擎的框架，为多个下游检索增强生成（RAG）系统提供服务。每个 RAG 系统都会出于特定目的使用检索结果，例如开放域问答、事实验证、实体链接和关系提取。我们引入了一个通用培训指南，该指南标准化了搜索引擎和参与优化检索模型的下游 RAG 系统之间的通信。这为我们建立一个大规模的实验生态系统奠定了基础，该生态系统由 18 个从事训练的 RAG 系统和 18 个使用 uRAG 作为搜索引擎新用户的未知 RAG 系统组成。利用这个实验生态系统，我们回答了许多基础研究问题，提高了我们对开发机器搜索引擎的前景和挑战的理解。

Title: SPAFIT: Stratified Progressive Adaptation Fine-tuning for Pre-trained Large Language Models

Authors: Samir Arora, Liangliang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.00201
Pdf URL: https://arxiv.org/pdf/2405.00201
Copy Paste: [[2405.00201]] SPAFIT: Stratified Progressive Adaptation Fine-tuning for Pre-trained Large Language Models(https://arxiv.org/abs/2405.00201)
Keywords: language model
Abstract: Full fine-tuning is a popular approach to adapt Transformer-based pre-trained large language models to a specific downstream task. However, the substantial requirements for computational power and storage have discouraged its widespread use. Moreover, increasing evidence of catastrophic forgetting and overparameterization in the Transformer architecture has motivated researchers to seek more efficient fine-tuning (PEFT) methods. Commonly known parameter-efficient fine-tuning methods like LoRA and BitFit are typically applied across all layers of the model. We propose a PEFT method, called Stratified Progressive Adaptation Fine-tuning (SPAFIT), based on the localization of different types of linguistic knowledge to specific layers of the model. Our experiments, conducted on nine tasks from the GLUE benchmark, show that our proposed SPAFIT method outperforms other PEFT methods while fine-tuning only a fraction of the parameters adjusted by other methods.
摘要：完全微调是一种流行的方法，可以使基于 Transformer 的预训练大型语言模型适应特定的下游任务。然而，对计算能力和存储的大量要求阻碍了其广泛使用。此外，越来越多的证据表明 Transformer 架构中存在灾难性遗忘和过度参数化，这促使研究人员寻求更有效的微调（PEFT）方法。 LoRA 和 BitFit 等众所周知的参数高效微调方法通常应用于模型的所有层。我们提出了一种称为分层渐进适应微调（SPAFIT）的 PEFT 方法，该方法基于不同类型的语言知识到模型特定层的本地化。我们对 GLUE 基准测试的九个任务进行的实验表明，我们提出的 SPAFIT 方法优于其他 PEFT 方法，同时仅微调其他方法调整的一小部分参数。

Title: General Purpose Verification for Chain of Thought Prompting

Authors: Robert Vacareanu, Anurag Pratik, Evangelia Spiliopoulou, Zheng Qi, Giovanni Paolini, Neha Anna John, Jie Ma, Yassine Benajiba, Miguel Ballesteros
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.00204
Pdf URL: https://arxiv.org/pdf/2405.00204
Copy Paste: [[2405.00204]] General Purpose Verification for Chain of Thought Prompting(https://arxiv.org/abs/2405.00204)
Keywords: language model, llm, prompt
Abstract: Many of the recent capabilities demonstrated by Large Language Models (LLMs) arise primarily from their ability to exploit contextual information. In this paper, we explore ways to improve reasoning capabilities of LLMs through (1) exploration of different chains of thought and (2) validation of the individual steps of the reasoning process. We propose three general principles that a model should adhere to while reasoning: (i) Relevance, (ii) Mathematical Accuracy, and (iii) Logical Consistency. We apply these constraints to the reasoning steps generated by the LLM to improve the accuracy of the final generation. The constraints are applied in the form of verifiers: the model itself is asked to verify if the generated steps satisfy each constraint. To further steer the generations towards high-quality solutions, we use the perplexity of the reasoning steps as an additional verifier. We evaluate our method on 4 distinct types of reasoning tasks, spanning a total of 9 different datasets. Experiments show that our method is always better than vanilla generation, and, in 6 out of the 9 datasets, it is better than best-of N sampling which samples N reasoning chains and picks the lowest perplexity generation.
摘要：大型语言模型 (LLM) 最近展示的许多功能主要源于它们利用上下文信息的能力。在本文中，我们通过（1）探索不同的思想链和（2）验证推理过程的各个步骤来探索提高法学硕士推理能力的方法。我们提出模型在推理时应遵循的三个一般原则：（i）相关性，（ii）数学准确性，以及（iii）逻辑一致性。我们将这些约束应用于 LLM 生成的推理步骤，以提高最终生成的准确性。约束以验证器的形式应用：要求模型本身验证生成的步骤是否满足每个约束。为了进一步引导几代人走向高质量的解决方案，我们使用推理步骤的复杂性作为额外的验证器。我们在 4 种不同类型的推理任务上评估我们的方法，涵盖总共 9 个不同的数据集。实验表明，我们的方法始终优于普通生成，并且在 9 个数据集中的 6 个数据集中，它优于 N 次最佳采样（对 N 个推理链进行采样并选择最低困惑度的生成）。

Title: A Primer on the Inner Workings of Transformer-based Language Models

Authors: Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.00208
Pdf URL: https://arxiv.org/pdf/2405.00208
Copy Paste: [[2405.00208]] A Primer on the Inner Workings of Transformer-based Language Models(https://arxiv.org/abs/2405.00208)
Keywords: language model
Abstract: The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture. We conclude by presenting a comprehensive overview of the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions in this area.
摘要：旨在解释高级语言模型内部运作的研究的快速进展凸显了将从该领域多年工作中获得的见解融入情境的必要性。本入门书对用于解释基于 Transformer 的语言模型的内部工作原理的当前技术进行了简明的技术介绍，重点关注仅生成解码器的架构。最后，我们对这些模型实现的已知内部机制进行了全面概述，揭示了该领域流行方法和活跃研究方向之间的联系。

Title: Graphical Reasoning: LLM-based Semi-Open Relation Extraction

Authors: Yicheng Tao, Yiqun Wang, Longju Bai
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.00216
Pdf URL: https://arxiv.org/pdf/2405.00216
Copy Paste: [[2405.00216]] Graphical Reasoning: LLM-based Semi-Open Relation Extraction(https://arxiv.org/abs/2405.00216)
Keywords: language model, gpt, llm
Abstract: This paper presents a comprehensive exploration of relation extraction utilizing advanced language models, specifically Chain of Thought (CoT) and Graphical Reasoning (GRE) techniques. We demonstrate how leveraging in-context learning with GPT-3.5 can significantly enhance the extraction process, particularly through detailed example-based reasoning. Additionally, we introduce a novel graphical reasoning approach that dissects relation extraction into sequential sub-tasks, improving precision and adaptability in processing complex relational data. Our experiments, conducted on multiple datasets, including manually annotated data, show considerable improvements in performance metrics, underscoring the effectiveness of our methodologies.
摘要：本文对利用高级语言模型，特别是思想链 (CoT) 和图形推理 (GRE) 技术的关系提取进行了全面的探索。我们演示了如何利用 GPT-3.5 的上下文学习来显着增强提取过程，特别是通过详细的基于示例的推理。此外，我们引入了一种新颖的图形推理方法，将关系提取分解为顺序子任务，提高处理复杂关系数据的精度和适应性。我们在多个数据集（包括手动注释的数据）上进行的实验显示了性能指标的显着改进，强调了我们方法的有效性。

Title: CodeHalu: Code Hallucinations in LLMs Driven by Execution-based Verification

Authors: Yuchen Tian, Weixiang Yan, Qian Yang, Qian Chen, Wen Wang, Ziyang Luo, Lei Ma
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2405.00253
Pdf URL: https://arxiv.org/pdf/2405.00253
Copy Paste: [[2405.00253]] CodeHalu: Code Hallucinations in LLMs Driven by Execution-based Verification(https://arxiv.org/abs/2405.00253)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have made significant advancements in the field of code generation, offering unprecedented support for automated programming and assisting developers. However, LLMs sometimes generate code that appears plausible but fails to meet the expected requirements or executes incorrectly. This phenomenon of hallucinations in the coding field has not been explored. To advance the community's understanding and research on code hallucinations in LLMs, we propose a definition method for these hallucinations based on execution verification and introduce the concept of code hallucinations for the first time. We categorize code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, each further divided into different subcategories to better understand and address the unique challenges faced by LLMs during code generation. To systematically evaluate code hallucinations, we propose a dynamic detection algorithm for code hallucinations and construct the CodeHalu benchmark, which includes 8,883 samples from 699 tasks, to actively detect hallucination phenomena in LLMs during programming. We tested 16 popular LLMs on this benchmark to evaluate the frequency and nature of their hallucinations during code generation. The findings reveal significant variations in the accuracy and reliability of LLMs in generating code, highlighting the urgent need to improve models and training methods to ensure the functional correctness and safety of automatically generated code. This study not only classifies and quantifies code hallucinations but also provides insights for future improvements in LLM-based code generation research. The CodeHalu benchmark and code are publicly available at https://github.com/yuchen814/CodeHalu.
摘要：大型语言模型 (LLM) 在代码生成领域取得了重大进步，为自动化编程提供了前所未有的支持并为开发人员提供帮助。然而，法学硕士有时会生成看似合理的代码，但无法满足预期要求或执行不正确。这种编码领域的幻觉现象还没有被探索过。为了促进社会对法学硕士代码幻觉的理解和研究，我们提出了一种基于执行验证的幻觉定义方法，并首次引入了代码幻觉的概念。我们将代码幻觉分为四种主要类型：映射幻觉、命名幻觉、资源幻觉和逻辑幻觉，每种类型又进一步分为不同的子类别，以更好地理解和解决法学硕士在代码生成过程中面临的独特挑战。为了系统地评估代码幻觉，我们提出了一种代码幻觉的动态检测算法，并构建了 CodeHalu 基准，其中包括来自 699 个任务的 8,883 个样本，以主动检测法学硕士在编程过程中的幻觉现象。我们在此基准测试中测试了 16 位流行的法学硕士，以评估他们在代码生成过程中产生幻觉的频率和性质。研究结果揭示了法学硕士在生成代码的准确性和可靠性方面存在显着差异，凸显了迫切需要改进模型和训练方法，以确保自动生成代码的功能正确性和安全性。这项研究不仅对代码幻觉进行了分类和量化，还为基于 LLM 的代码生成研究的未来改进提供了见解。 CodeHalu 基准测试和代码可在 https://github.com/yuchen814/CodeHalu 上公开获取。

Title: Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge

Authors: Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, Bin Cui
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.00263
Pdf URL: https://arxiv.org/pdf/2405.00263
Copy Paste: [[2405.00263]] Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge(https://arxiv.org/abs/2405.00263)
Keywords: language model, llm
Abstract: Large language models (LLMs) suffer from low efficiency as the mismatch between the requirement of auto-regressive decoding and the design of most contemporary GPUs. Specifically, billions to trillions of parameters must be loaded to the GPU cache through its limited memory bandwidth for computation, but only a small batch of tokens is actually computed. Consequently, the GPU spends most of its time on memory transfer instead of computation. Recently, parallel decoding, a type of speculative decoding algorithms, is becoming more popular and has demonstrated impressive efficiency improvement in generation. It introduces extra decoding heads to large models, enabling them to predict multiple subsequent tokens simultaneously and verify these candidate continuations in a single decoding step. However, this approach deviates from the training objective of next token prediction used during pre-training, resulting in a low hit rate for candidate tokens. In this paper, we propose a new speculative decoding algorithm, Clover, which integrates sequential knowledge into the parallel decoding process. This enhancement improves the hit rate of speculators and thus boosts the overall efficiency. Clover transmits the sequential knowledge from pre-speculated tokens via the Regressive Connection, then employs an Attention Decoder to integrate these speculated tokens. Additionally, Clover incorporates an Augmenting Block that modifies the hidden states to better align with the purpose of speculative generation rather than next token prediction. The experiment results demonstrate that Clover outperforms the baseline by up to 91% on Baichuan-Small and 146% on Baichuan-Large, respectively, and exceeds the performance of the previously top-performing method, Medusa, by up to 37% on Baichuan-Small and 57% on Baichuan-Large, respectively.
摘要：由于自回归解码的要求与大多数当代 GPU 的设计不匹配，大型语言模型 (LLM) 效率低下。具体来说，数十亿到数万亿的参数必须通过其有限的内存带宽加载到GPU缓存中进行计算，但实际上只计算了一小批令牌。因此，GPU 的大部分时间都花在内存传输上，而不是计算上。最近，并行解码（一种推测解码算法）变得越来越流行，并且在生成过程中表现出了令人印象深刻的效率提高。它向大型模型引入了额外的解码头，使它们能够同时预测多个后续标记，并在单个解码步骤中验证这些候选延续。然而，这种方法偏离了预训练期间使用的下一个令牌预测的训练目标，导致候选令牌的命中率较低。在本文中，我们提出了一种新的推测性解码算法 Clover，它将顺序知识集成到并行解码过程中。这一增强提高了投机者的命中率，从而提高了整体效率。 Clover 通过回归连接传输预先推测的令牌的顺序知识，然后使用注意力解码器来整合这些推测的令牌。此外，Clover 还包含一个增强块，可以修改隐藏状态，以更好地符合推测生成的目的，而不是下一个代币预测。实验结果表明，Clover 在 Baichuan-Small 上的性能比基线高出 91%，在 Baichuan-Large 上的性能比基线高出 146%，在 Baichuan-Large 上的性能比之前表现最好的方法 Medusa 的性能高出 37%。小号和百川大号分别占 57%。

Title: Social Life Simulation for Non-Cognitive Skills Learning

Authors: Zihan Yan, Yaohong Xiang, Yun Huang
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2405.00273
Pdf URL: https://arxiv.org/pdf/2405.00273
Copy Paste: [[2405.00273]] Social Life Simulation for Non-Cognitive Skills Learning(https://arxiv.org/abs/2405.00273)
Keywords: language model, llm, chat, agent
Abstract: Non-cognitive skills are crucial for personal and social life well-being, and such skill development can be supported by narrative-based (e.g., storytelling) technologies. While generative AI enables interactive and role-playing storytelling, little is known about how users engage with and perceive the use of AI in social life simulation for non-cognitive skills learning. To this end, we introduced SimuLife++, an interactive platform enabled by a large language model (LLM). The system allows users to act as protagonists, creating stories with one or multiple AI-based characters in diverse social scenarios. In particular, we expanded the Human-AI interaction to a Human-AI-AI collaboration by including a sage agent, who acts as a bystander to provide users with more insightful perspectives on their choices and conversations. Through a within-subject user study, we found that the inclusion of the sage agent significantly enhanced narrative immersion, according to the narrative transportation scale, leading to more messages, particularly in group chats. Participants' interactions with the sage agent were also associated with significantly higher scores in their perceived motivation, self-perceptions, and resilience and coping, indicating positive impacts on non-cognitive skills reflection. Participants' interview results further explained the sage agent's aid in decision-making, solving ethical dilemmas, and problem-solving; on the other hand, they suggested improvements in user control and balanced responses from multiple characters. We provide design implications on the application of generative AI in narrative solutions for non-cognitive skill development in broader social contexts.
摘要：非认知技能对于个人和社会生活的幸福至关重要，这种技能的发展可以通过基于叙事（例如讲故事）的技术支持。虽然生成式人工智能可以实现交互式和角色扮演式的讲故事，但对于用户如何参与和看待人工智能在非认知技能学习的社会生活模拟中的使用，人们知之甚少。为此，我们推出了 SimuLife++，这是一个由大型语言模型 (LLM) 支持的交互式平台。该系统允许用户扮演主角，在不同的社交场景中用一个或多个基于人工智能的角色创作故事。特别是，我们通过加入一个贤者代理，将人机交互扩展为人机-人工智能-人工智能协作，贤者代理充当旁观者，为用户提供有关他们的选择和对话的更有见地的观点。通过一项受试者内部用户研究，我们发现，根据叙事传输量表，贤者代理的加入显著增强了叙事沉浸感，从而产生了更多的消息，尤其是在群聊中。参与者与智者代理的互动也与感知动机、自我认知、适应力和应对能力的得分显著提高有关，表明对非认知技能的反思有积极影响。参与者的访谈结果进一步解释了智者代理在决策、解决道德困境和解决问题方面的帮助；另一方面，他们建议改进用户控制并平衡多个角色的反应。我们为在更广泛的社会背景下将生成式人工智能应用于非认知技能发展的叙事解决方案提供了设计启示。

Title: Adversarial Attacks and Defense for Conversation Entailment Task

Authors: Zhenning Yang, Ryan Krawec, Liang-Yuan Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.00289
Pdf URL: https://arxiv.org/pdf/2405.00289
Copy Paste: [[2405.00289]] Adversarial Attacks and Defense for Conversation Entailment Task(https://arxiv.org/abs/2405.00289)
Keywords: language model, llm
Abstract: Large language models (LLMs) that are proved to be very powerful on different NLP tasks. However, there are still many ways to attack the model with very low costs. How to defend the model becomes an important problem. In our work, we treat adversarial attack results as a new (unseen) domain of the model, and we frame the defending problem into how to improve the robustness of the model on the new domain. We focus on the task of conversation entailment, where multi-turn natural language dialogues are the premise, and the transformer model is fine-tuned to predict whether a given hypothesis about the given dialogue is true or false. The adversary would attack the hypothesis to fool the model to make the wrong predictions. We apply synonym-swapping as the attack method. To show the robustness of the model, we implement some fine-tuning strategies and propose the embedding perturbation loss as a method to improve the robustness of the model. Finally, we show the importance of our work by discussing the adversarial attacks in NLP in the real world.
摘要：大型语言模型 (LLM) 被证明在不同的 NLP 任务上非常强大。然而，仍然有很多方法可以以非常低的成本攻击该模型。如何捍卫该模型成为一个重要问题。在我们的工作中，我们将对抗性攻击结果视为模型的一个新的（看不见的）域，并将防御问题框架为如何提高模型在新域上的鲁棒性。我们专注于对话蕴含的任务，其中多轮自然语言对话是前提，并且对变压器模型进行微调以预测关于给定对话的给定假设是真还是假。对手会攻击假设来欺骗模型做出错误的预测。我们采用同义词交换作为攻击方法。为了显示模型的鲁棒性，我们实施了一些微调策略，并提出嵌入扰动损失作为提高模型鲁棒性的方法。最后，我们通过讨论现实世界中 NLP 的对抗性攻击来展示我们工作的重要性。

Title: How Can I Improve? Using GPT to Highlight the Desired and Undesired Parts of Open-ended Responses

Authors: Jionghao Lin, Eason Chen, Zeifei Han, Ashish Gurung, Danielle R. Thomas, Wei Tan, Ngoc Dang Nguyen, Kenneth R. Koedinger
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2405.00291
Pdf URL: https://arxiv.org/pdf/2405.00291
Copy Paste: [[2405.00291]] How Can I Improve? Using GPT to Highlight the Desired and Undesired Parts of Open-ended Responses(https://arxiv.org/abs/2405.00291)
Keywords: language model, gpt, prompt
Abstract: Automated explanatory feedback systems play a crucial role in facilitating learning for a large cohort of learners by offering feedback that incorporates explanations, significantly enhancing the learning process. However, delivering such explanatory feedback in real-time poses challenges, particularly when high classification accuracy for domain-specific, nuanced responses is essential. Our study leverages the capabilities of large language models, specifically Generative Pre-Trained Transformers (GPT), to explore a sequence labeling approach focused on identifying components of desired and less desired praise for providing explanatory feedback within a tutor training dataset. Our aim is to equip tutors with actionable, explanatory feedback during online training lessons. To investigate the potential of GPT models for providing the explanatory feedback, we employed two commonly-used approaches: prompting and fine-tuning. To quantify the quality of highlighted praise components identified by GPT models, we introduced a Modified Intersection over Union (M-IoU) score. Our findings demonstrate that: (1) the M-IoU score effectively correlates with human judgment in evaluating sequence quality; (2) using two-shot prompting on GPT-3.5 resulted in decent performance in recognizing effort-based (M-IoU of 0.46) and outcome-based praise (M-IoU of 0.68); and (3) our optimally fine-tuned GPT-3.5 model achieved M-IoU scores of 0.64 for effort-based praise and 0.84 for outcome-based praise, aligning with the satisfaction levels evaluated by human coders. Our results show promise for using GPT models to provide feedback that focuses on specific elements in their open-ended responses that are desirable or could use improvement.
摘要：自动解释性反馈系统通过提供包含解释的反馈，显着增强学习过程，在促进大量学习者的学习方面发挥着至关重要的作用。然而，实时提供此类解释性反馈会带来挑战，特别是当特定领域的细致入微的响应的高分类精度至关重要时。我们的研究利用大型语言模型，特别是生成式预训练 Transformer (GPT) 的功能，探索一种序列标记方法，重点是识别所需和不太需要的赞美的组成部分，以便在导师训练数据集中提供解释性反馈。我们的目标是在在线培训课程中为导师提供可操作的解释性反馈。为了研究 GPT 模型提供解释性反馈的潜力，我们采用了两种常用的方法：提示和微调。为了量化 GPT 模型识别的突出赞扬成分的质量，我们引入了改进的并集交集 (M-IoU) 分数。我们的研究结果表明：（1）M-IoU 评分与评估序列质量时的人类判断有效相关； (2) 在 GPT-3.5 上使用两次提示在识别基于努力的表扬（M-IoU 为 0.46）和基于结果的表扬（M-IoU 为 0.68）方面取得了不错的表现； (3) 经过优化微调的 GPT-3.5 模型，基于努力的表扬的 M-IoU 分数为 0.64，基于结果的表扬的 M-IoU 分数为 0.84，与人类编码员评估的满意度水平一致。我们的结果显示了使用 GPT 模型提供反馈的前景，这些反馈侧重于开放式响应中需要或需要改进的特定元素。

Title: LITO: Learnable Intervention for Truthfulness Optimization

Authors: Farima Fatahi Bayat, Xin Liu, H. V. Jagadish, Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.00301
Pdf URL: https://arxiv.org/pdf/2405.00301
Copy Paste: [[2405.00301]] LITO: Learnable Intervention for Truthfulness Optimization(https://arxiv.org/abs/2405.00301)
Keywords: language model, llm
Abstract: Large language models (LLMs) can generate long-form and coherent text, but they still frequently hallucinate facts, thus limiting their reliability. To address this issue, inference-time methods that elicit truthful responses have been proposed by shifting LLM representations towards learned "truthful directions". However, applying the truthful directions with the same intensity fails to generalize across different question contexts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization that automatically identifies the optimal intervention intensity tailored to a specific context. LITO explores a sequence of model generations based on increasing levels of intervention intensities. It selects the most accurate response or refuses to answer when the predictions are highly uncertain. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy. The adaptive nature of LITO counters issues with one-size-fits-all intervention-based solutions, maximizing model truthfulness by reflecting internal knowledge only when the model is confident.
摘要：大型语言模型（LLM）可以生成长篇且连贯的文本，但它们仍然经常产生幻觉事实，从而限制了它们的可靠性。为了解决这个问题，通过将法学硕士表示转向学习的“真实方向”，提出了引发真实响应的推理时间方法。然而，以相同的强度应用真实的指导无法在不同的问题上下文中推广。我们提出了 LITO，一种用于真实性优化的可学习干预方法，可自动识别针对特定环境的最佳干预强度。 LITO 根据干预强度的增加水平探索一系列模型生成。当预测高度不确定时，它会选择最准确的响应或拒绝回答。对多个法学硕士和问答数据集的实验表明，LITO 在保持任务准确性的同时提高了真实性。 LITO 的自适应性质解决了基于干预的一刀切解决方案的问题，仅在模型有信心时才反映内部知识，从而最大限度地提高模型的真实性。

Title: Generating Feedback-Ladders for Logical Errors in Programming using Large Language Models

Authors: Hasnain Heickal, Andrew Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.00302
Pdf URL: https://arxiv.org/pdf/2405.00302
Copy Paste: [[2405.00302]] Generating Feedback-Ladders for Logical Errors in Programming using Large Language Models(https://arxiv.org/abs/2405.00302)
Keywords: language model, llm, prompt
Abstract: In feedback generation for logical errors in programming assignments, large language model (LLM)-based methods have shown great promise. These methods ask the LLM to generate feedback given the problem statement and a student's (buggy) submission. There are several issues with these types of methods. First, the generated feedback messages are often too direct in revealing the error in the submission and thus diminish valuable opportunities for the student to learn. Second, they do not consider the student's learning context, i.e., their previous submissions, current knowledge, etc. Third, they are not layered since existing methods use a single, shared prompt for all student submissions. In this paper, we explore using LLMs to generate a "feedback-ladder", i.e., multiple levels of feedback for the same problem-submission pair. We evaluate the quality of the generated feedback-ladder via a user study with students, educators, and researchers. We have observed diminishing effectiveness for higher-level feedback and higher-scoring submissions overall in the study. In practice, our method enables teachers to select an appropriate level of feedback to show to a student based on their personal learning context, or in a progressive manner to go more detailed if a higher-level feedback fails to correct the student's error.
摘要：在针对编程作业中的逻辑错误生成反馈方面，基于大型语言模型 (LLM) 的方法已显示出巨大的潜力。这些方法要求 LLM 根据问题陈述和学生的（有缺陷的）提交生成反馈。这些类型的方法存在几个问题。首先，生成的反馈信息往往过于直接地揭示了提交中的错误，从而减少了学生学习的宝贵机会。其次，它们没有考虑学生的学习背景，即他们以前的提交、当前的知识等。第三，它们没有分层，因为现有方法对所有学生提交使用单一的共享提示。在本文中，我们探索使用 LLM 生成“反馈阶梯”，即对同一问题提交对的多个级别的反馈。我们通过对学生、教育工作者和研究人员进行的用户研究来评估生成的反馈阶梯的质量。我们观察到，在研究中，更高级别的反馈和更高分数的提交总体上效果会降低。在实践中，我们的方法使教师能够根据学生的个人学习情况选择适当级别的反馈来显示给学生，或者如果更高级别的反馈无法纠正学生的错误，则以渐进的方式进行更详细的反馈。

Title: DFKI-NLP at SemEval-2024 Task 2: Towards Robust LLMs Using Data Perturbations and MinMax Training

Authors: Bhuvanesh Verma, Lisa Raithel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.00321
Pdf URL: https://arxiv.org/pdf/2405.00321
Copy Paste: [[2405.00321]] DFKI-NLP at SemEval-2024 Task 2: Towards Robust LLMs Using Data Perturbations and MinMax Training(https://arxiv.org/abs/2405.00321)
Keywords: language model, llm
Abstract: The NLI4CT task at SemEval-2024 emphasizes the development of robust models for Natural Language Inference on Clinical Trial Reports (CTRs) using large language models (LLMs). This edition introduces interventions specifically targeting the numerical, vocabulary, and semantic aspects of CTRs. Our proposed system harnesses the capabilities of the state-of-the-art Mistral model, complemented by an auxiliary model, to focus on the intricate input space of the NLI4CT dataset. Through the incorporation of numerical and acronym-based perturbations to the data, we train a robust system capable of handling both semantic-altering and numerical contradiction interventions. Our analysis on the dataset sheds light on the challenging sections of the CTRs for reasoning.
摘要：SemEval-2024 的 NLI4CT 任务强调使用大语言模型 (LLM) 开发针对临床试验报告 (CTR) 的自然语言推理的稳健模型。本版本介绍了专门针对点击率的数字、词汇和语义方面的干预措施。我们提出的系统利用最先进的 Mistral 模型的功能，并辅以辅助模型，专注于 NLI4CT 数据集的复杂输入空间。通过将数字和基于首字母缩略词的扰动结合到数据中，我们训练了一个能够处理语义改变和数字矛盾干预的强大系统。我们对数据集的分析揭示了 CTR 中具有挑战性的推理部分。

Title: A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Authors: Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele (Mike)Lunati, Summer Yue
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.00332
Pdf URL: https://arxiv.org/pdf/2405.00332
Copy Paste: [[2405.00332]] A Careful Examination of Large Language Model Performance on Grade School Arithmetic(https://arxiv.org/abs/2405.00332)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning. We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more. When evaluating leading open- and closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with several families of models (e.g., Phi and Mistral) showing evidence of systematic overfitting across almost all model sizes. At the same time, many models, especially those on the frontier, (e.g., Gemini/GPT/Claude) show minimal signs of overfitting. Further analysis suggests a positive relationship (Spearman's r^2=0.32) between a model's probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that many models may have partially memorized GSM8k.
摘要：大型语言模型 (LLM) 在许多数学推理基准测试中取得了令人瞩目的成功。然而，人们越来越担心，这种表现实际上反映了数据集污染，即与基准问题非常相似的数据泄漏到训练数据中，而不是真正的推理能力。为了严格调查这一说法，我们委托了 Grade School Math 1000 (GSM1k)。GSM1k 旨在反映已建立的 GSM8k 基准测试的风格和复杂性，这是衡量初等数学推理的黄金标准。我们确保这两个基准测试在重要指标方面具有可比性，例如人工解决率、解决方案中的步骤数、答案大小等。在 GSM1k 上评估领先的开源和闭源 LLM 时，我们观察到准确率下降高达 13%，其中几个模型系列（例如 Phi 和 Mistral）显示出几乎所有模型大小的系统性过度拟合的证据。同时，许多模型，尤其是前沿模型（例如 Gemini/GPT/Claude）显示出极小的过度拟合迹象。进一步的分析表明，模型从 GSM8k 生成示例的概率与其 GSM8k 和 GSM1k 之间的性能差距之间存在正关系（Spearman's r^2=0.32），这表明许多模型可能已经部分记忆了 GSM8k。

Title: AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts

Authors: Zefang Liu, Jiahua Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.00361
Pdf URL: https://arxiv.org/pdf/2405.00361
Copy Paste: [[2405.00361]] AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts(https://arxiv.org/abs/2405.00361)
Keywords: language model, llm
Abstract: We introduce AdaMoLE, a novel method for fine-tuning large language models (LLMs) through an Adaptive Mixture of Low-Rank Adaptation (LoRA) Experts. Moving beyond conventional methods that employ a static top-k strategy for activating experts, AdaMoLE dynamically adjusts the activation threshold using a dedicated threshold network, adaptively responding to the varying complexities of different tasks. By replacing a single LoRA in a layer with multiple LoRA experts and integrating a gating function with the threshold mechanism, AdaMoLE effectively selects and activates the most appropriate experts based on the input context. Our extensive evaluations across a variety of commonsense reasoning and natural language processing tasks show that AdaMoLE exceeds baseline performance. This enhancement highlights the advantages of AdaMoLE's adaptive selection of LoRA experts, improving model effectiveness without a corresponding increase in the expert count. The experimental validation not only confirms AdaMoLE as a robust approach for enhancing LLMs but also suggests valuable directions for future research in adaptive expert selection mechanisms, potentially broadening the scope for optimizing model performance across diverse language processing tasks.
摘要：我们介绍了一种通过自适应混合低秩自适应 (LoRA) 专家来微调大型语言模型 (LLM) 的新方法 AdaMoLE。与采用静态 top-k 策略激活专家的传统方法不同，AdaMoLE 使用专用阈值网络动态调整激活阈值，自适应地响应不同任务的不同复杂性。通过用多个 LoRA 专家替换层中的单个 LoRA，并将门控函数与阈值机制集成，AdaMoLE 可以根据输入上下文有效地选择和激活最合适的专家。我们对各种常识推理和自然语言处理任务的广泛评估表明，AdaMoLE 超过了基线性能。这一增强功能凸显了 AdaMoLE 自适应选择 LoRA 专家的优势，提高了模型的有效性，而无需相应增加专家数量。实验验证不仅证实了 AdaMoLE 是一种增强 LLM 的强大方法，而且还为自适应专家选择机制的未来研究提供了有价值的方向，有可能扩大优化跨各种语言处理任务的模型性能的范围。

Title: CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models

Authors: Hongzhan Lin, Zixin Chen, Ziyang Luo, Mingfei Cheng, Jing Ma, Guang Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.00390
Pdf URL: https://arxiv.org/pdf/2405.00390
Copy Paste: [[2405.00390]] CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models(https://arxiv.org/abs/2405.00390)
Keywords: language model
Abstract: Social media abounds with multimodal sarcasm, and identifying sarcasm targets is particularly challenging due to the implicit incongruity not directly evident in the text and image modalities. Current methods for Multimodal Sarcasm Target Identification (MSTI) predominantly focus on superficial indicators in an end-to-end manner, overlooking the nuanced understanding of multimodal sarcasm conveyed through both the text and image. This paper proposes a versatile MSTI framework with a coarse-to-fine paradigm, by augmenting sarcasm explainability with reasoning and pre-training knowledge. Inspired by the powerful capacity of Large Multimodal Models (LMMs) on multimodal reasoning, we first engage LMMs to generate competing rationales for coarser-grained pre-training of a small language model on multimodal sarcasm detection. We then propose fine-tuning the model for finer-grained sarcasm target identification. Our framework is thus empowered to adeptly unveil the intricate targets within multimodal sarcasm and mitigate the negative impact posed by potential noise inherently in LMMs. Experimental results demonstrate that our model far outperforms state-of-the-art MSTI methods, and markedly exhibits explainability in deciphering sarcasm as well.
摘要：社交媒体充斥着多模式讽刺，由于文本和图像模式中不直接明显的隐含不一致，识别讽刺目标尤其具有挑战性。当前的多模态讽刺目标识别（MSTI）方法主要侧重于端到端方式的表面指标，忽视了对通过文本和图像传达的多模态讽刺的细致入微的理解。本文提出了一种具有从粗到细范式的多功能 MSTI 框架，通过推理和预训练知识增强讽刺的可解释性。受到大型多模态模型（LMM）在多模态推理方面强大能力的启发，我们首先利用 LMM 来生成竞争性原理，以便对多模态讽刺检测的小语言模型进行粗粒度预训练。然后，我们建议对模型进行微调，以实现更细粒度的讽刺目标识别。因此，我们的框架能够熟练地揭示多模态讽刺中的复杂目标，并减轻 LMM 固有的潜在噪声所带来的负面影响。实验结果表明，我们的模型远远优于最先进的 MSTI 方法，并且在解读讽刺方面也显着表现出可解释性。

Title: Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models

Authors: Leonardo Ranaldi, Andrè Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.00402
Pdf URL: https://arxiv.org/pdf/2405.00402
Copy Paste: [[2405.00402]] Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models(https://arxiv.org/abs/2405.00402)
Keywords: language model, llm
Abstract: The alignments of reasoning abilities between smaller and larger Language Models are largely conducted via Supervised Fine-Tuning (SFT) using demonstrations generated from robust Large Language Models (LLMs). Although these approaches deliver more performant models, they do not show sufficiently strong generalization ability as the training only relies on the provided demonstrations. In this paper, we propose the Self-refine Instruction-tuning method that elicits Smaller Language Models to self-refine their abilities. Our approach is based on a two-stage process, where reasoning abilities are first transferred between LLMs and Small Language Models (SLMs) via Instruction-tuning on demonstrations provided by LLMs, and then the instructed models Self-refine their abilities through preference optimization strategies. In particular, the second phase operates refinement heuristics based on the Direct Preference Optimization algorithm, where the SLMs are elicited to deliver a series of reasoning paths by automatically sampling the generated responses and providing rewards using ground truths from the LLMs. Results obtained on commonsense and math reasoning tasks show that this approach significantly outperforms Instruction-tuning in both in-domain and out-domain scenarios, aligning the reasoning abilities of Smaller and Larger Language Models.
摘要：较小和较大语言模型之间推理能力的一致性主要是通过监督微调（SFT）使用强大的大型语言模型（LLM）生成的演示来进行的。尽管这些方法提供了性能更高的模型，但它们没有表现出足够强的泛化能力，因为训练仅依赖于提供的演示。在本文中，我们提出了自我细化指令调整方法，该方法引发较小的语言模型自我细化其能力。我们的方法基于两阶段过程，首先通过对 LLM 提供的演示进行指令调整，在 LLM 和小语言模型 (SLM) 之间转移推理能力，然后指导模型通过偏好优化策略自我完善其能力。特别是，第二阶段基于直接偏好优化算法运行细化启发式算法，其中 SLM 通过自动采样生成的响应并使用来自 LLM 的基本事实提供奖励来提供一系列推理路径。在常识和数学推理任务上获得的结果表明，这种方法在域内和域外场景中都显着优于指令调整，从而调整了较小和较大语言模型的推理能力。

Title: BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine

Authors: Mingchen Li, Halil Kilicoglu, Hua Xu, Rui Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.00465
Pdf URL: https://arxiv.org/pdf/2405.00465
Copy Paste: [[2405.00465]] BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine(https://arxiv.org/abs/2405.00465)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have swiftly emerged as vital resources for different applications in the biomedical and healthcare domains; however, these models encounter issues such as generating inaccurate information or hallucinations. Retrieval-augmented generation provided a solution for these models to update knowledge and enhance their performance. In contrast to previous retrieval-augmented LMs, which utilize specialized cross-attention mechanisms to help LLM encode retrieved text, BiomedRAG adopts a simpler approach by directly inputting the retrieved chunk-based documents into the LLM. This straightforward design is easily applicable to existing retrieval and language models, effectively bypassing noise information in retrieved documents, particularly in noise-intensive tasks. Moreover, we demonstrate the potential for utilizing the LLM to supervise the retrieval model in the biomedical domain, enabling it to retrieve the document that assists the LM in improving its predictions. Our experiments reveal that with the tuned scorer,\textsc{ BiomedRAG} attains superior performance across 5 biomedical NLP tasks, encompassing information extraction (triple extraction, relation extraction), text classification, link prediction, and question-answering, leveraging over 9 datasets. For instance, in the triple extraction task, \textsc{BiomedRAG} outperforms other triple extraction systems with micro-F1 scores of 81.42 and 88.83 on GIT and ChemProt corpora, respectively.
摘要：大型语言模型（LLM）已迅速成为生物医学和医疗保健领域不同应用的重要资源；然而，这些模型遇到了诸如生成不准确信息或幻觉等问题。检索增强生成为这些模型提供了更新知识并提高其性能的解决方案。与之前的检索增强 LM 利用专门的交叉注意机制来帮助 LLM 编码检索到的文本相比，BiomedRAG 采用更简单的方法，将检索到的基于块的文档直接输入到 LLM 中。这种简单的设计很容易适用于现有的检索和语言模型，有效地绕过检索文档中的噪声信息，特别是在噪声密集型任务中。此外，我们还展示了利用 LLM 监督生物医学领域检索模型的潜力，使其能够检索帮助 LM 改进其预测的文档。我们的实验表明，通过调整后的评分器，\textsc{BiomedRAG} 在 5 个生物医学 NLP 任务中获得了卓越的性能，包括信息提取（三元组提取、关系提取）、文本分类、链接预测和问答，利用超过 9 个数据集。例如，在三元组提取任务中， \textsc{BiomedRAG} 优于其他三元组提取系统，在 GIT 和 ChemProt 语料库上的 micro-F1 分数分别为 81.42 和 88.83。

Title: Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

Authors: KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.00467
Pdf URL: https://arxiv.org/pdf/2405.00467
Copy Paste: [[2405.00467]] Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing(https://arxiv.org/abs/2405.00467)
Keywords: llm
Abstract: With the rapid development of LLMs, it is natural to ask how to harness their capabilities efficiently. In this paper, we explore whether it is feasible to direct each input query to a single most suitable LLM. To this end, we propose LLM routing for challenging reasoning tasks. Our extensive experiments suggest that such routing shows promise but is not feasible in all scenarios, so more robust approaches should be investigated to fill this gap.
摘要：随着 LLM 的快速发展，人们自然会问如何有效利用其功能。在本文中，我们探讨了将每个输入查询定向到单个最合适的 LLM 是否可行。为此，我们提出了用于具有挑战性的推理任务的 LLM 路由。我们的大量实验表明，这种路由很有前景，但并非在所有情况下都可行，因此应该研究更强大的方法来填补这一空白。

Title: Is Temperature the Creativity Parameter of Large Language Models?

Authors: Max Peeperkorn, Tom Kouwenhoven, Dan Brown, Anna Jordanous
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.00492
Pdf URL: https://arxiv.org/pdf/2405.00492
Copy Paste: [[2405.00492]] Is Temperature the Creativity Parameter of Large Language Models?(https://arxiv.org/abs/2405.00492)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are applied to all sorts of creative tasks, and their outputs vary from beautiful, to peculiar, to pastiche, into plain plagiarism. The temperature parameter of an LLM regulates the amount of randomness, leading to more diverse outputs; therefore, it is often claimed to be the creativity parameter. Here, we investigate this claim using a narrative generation task with a predetermined fixed context, model and prompt. Specifically, we present an empirical analysis of the LLM output for different temperature values using four necessary conditions for creativity in narrative generation: novelty, typicality, cohesion, and coherence. We find that temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality. However, the influence of temperature on creativity is far more nuanced and weak than suggested by the "creativity parameter" claim; overall results suggest that the LLM generates slightly more novel outputs as temperatures get higher. Finally, we discuss ideas to allow more controlled LLM creativity, rather than relying on chance via changing the temperature parameter.
摘要：大型语言模型（LLM）适用于各种创造性任务，其输出各不相同，从美丽到奇特，到模仿，再到简单的抄袭。 LLM 的温度参数调节随机性，从而导致输出更加多样化；因此，它经常被称为创造力参数。在这里，我们使用具有预定固定上下文、模型和提示的叙事生成任务来调查这一主张。具体来说，我们使用叙事生成中创造力的四个必要条件：新颖性、典型性、凝聚力和连贯性，对不同温度值的法学硕士输出进行了实证分析。我们发现，温度与新颖性的相关性较弱，与不连贯性的相关性中等，但与内聚性或典型性都没有关系。然而，温度对创造力的影响远比“创造力参数”主张所暗示的更加微妙和微弱。总体结果表明，随着温度升高，法学硕士会产生稍微更新颖的输出。最后，我们讨论了允许更受控的法学硕士创造力的想法，而不是依靠改变温度参数的机会。

Title: A Legal Framework for Natural Language Processing Model Training in Portugal

Authors: Rúben Almeida, Evelin Amorim
Subjects: cs.CL, cs.ET
Abstract URL: https://arxiv.org/abs/2405.00536
Pdf URL: https://arxiv.org/pdf/2405.00536
Copy Paste: [[2405.00536]] A Legal Framework for Natural Language Processing Model Training in Portugal(https://arxiv.org/abs/2405.00536)
Keywords: gpt, chat
Abstract: Recent advances in deep learning have promoted the advent of many computational systems capable of performing intelligent actions that, until then, were restricted to the human intellect. In the particular case of human languages, these advances allowed the introduction of applications like ChatGPT that are capable of generating coherent text without being explicitly programmed to do so. Instead, these models use large volumes of textual data to learn meaningful representations of human languages. Associated with these advances, concerns about copyright and data privacy infringements caused by these applications have emerged. Despite these concerns, the pace at which new natural language processing applications continued to be developed largely outperformed the introduction of new regulations. Today, communication barriers between legal experts and computer scientists motivate many unintentional legal infringements during the development of such applications. In this paper, a multidisciplinary team intends to bridge this communication gap and promote more compliant Portuguese NLP research by presenting a series of everyday NLP use cases, while highlighting the Portuguese legislation that may arise during its development.
摘要：深度学习的最新进展促进了许多能够执行智能操作的计算系统的出现，而在此之前，这些系统仅限于人类智力。在人类语言的特定情况下，这些进步允许引入像 ChatGPT 这样的应用程序，这些应用程序能够生成连贯的文本，而无需明确编程。相反，这些模型使用大量文本数据来学习人类语言的有意义的表示。与这些进步相关的是，人们对这些应用程序造成的版权和数据隐私侵犯的担忧也随之出现。尽管存在这些担忧，新的自然语言处理应用程序的持续开发速度在很大程度上超过了新法规的引入。如今，法律专家和计算机科学家之间的沟通障碍在此类应用程序的开发过程中引发了许多无意的法律侵权。在本文中，一个多学科团队打算通过展示一系列日常 NLP 用例，同时强调其发展过程中可能出现的葡萄牙立法，来弥合这种沟通差距并促进更合规的葡萄牙 NLP 研究。

Title: Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment

Authors: Zhili Liu, Yunhao Gou, Kai Chen, Lanqing Hong, Jiahui Gao, Fei Mi, Yu Zhang, Zhenguo Li, Xin Jiang, Qun Liu, James T. Kwok
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.00557
Pdf URL: https://arxiv.org/pdf/2405.00557
Copy Paste: [[2405.00557]] Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment(https://arxiv.org/abs/2405.00557)
Keywords: language model, llm
Abstract: As the capabilities of large language models (LLMs) have expanded dramatically, aligning these models with human values presents a significant challenge, posing potential risks during deployment. Traditional alignment strategies rely heavily on human intervention, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), or on the self-alignment capacities of LLMs, which usually require a strong LLM's emergent ability to improve its original bad answer. To address these challenges, we propose a novel self-alignment method that utilizes a Chain of Thought (CoT) approach, termed AlignCoT. This method encompasses stages of Question Analysis, Answer Guidance, and Safe Answer production. It is designed to enable LLMs to generate high-quality, safe responses throughout various stages of their development. Furthermore, we introduce the Mixture of insighTful Experts (MoTE) architecture, which applies the mixture of experts to enhance each component of the AlignCoT process, markedly increasing alignment efficiency. The MoTE approach not only outperforms existing methods in aligning LLMs with human values but also highlights the benefits of using self-generated data, revealing the dual benefits of improved alignment and training efficiency.
摘要：随着大型语言模型 (LLM) 的功能急剧扩展，使这些模型与人类价值观保持一致提出了重大挑战，在部署过程中带来了潜在风险。传统的对齐策略严重依赖于人为干预，例如监督微调（SFT）和人类反馈强化学习（RLHF），或者依赖于LLM的自我对齐能力，这通常需要LLM具有强大的突现能力来提高其原始能力不好的答案。为了应对这些挑战，我们提出了一种新颖的自对准方法，该方法利用思想链 (CoT) 方法，称为 AlignCoT。该方法包括问题分析、答案指导和安全答案生成阶段。它旨在使法学硕士能够在其发展的各个阶段产生高质量、安全的反应。此外，我们引入了Mixture of insightful Experts (MoTE)架构，该架构应用专家混合来增强AlignCoT流程的每个组件，显着提高对齐效率。 MoTE 方法不仅在使法学硕士与人类价值观保持一致方面优于现有方法，而且还强调了使用自行生成数据的好处，揭示了提高一致性和培训效率的双重好处。

Title: The Real, the Better: Aligning Large Language Models with Online Human Behaviors

Authors: Guanying Jiang, Lingyong Yan, Haibo Shi, Dawei Yin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.00578
Pdf URL: https://arxiv.org/pdf/2405.00578
Copy Paste: [[2405.00578]] The Real, the Better: Aligning Large Language Models with Online Human Behaviors(https://arxiv.org/abs/2405.00578)
Keywords: language model, llm
Abstract: Large language model alignment is widely used and studied to avoid LLM producing unhelpful and harmful responses. However, the lengthy training process and predefined preference bias hinder adaptation to online diverse human preferences. To this end, this paper proposes an alignment framework, called Reinforcement Learning with Human Behavior (RLHB), to align LLMs by directly leveraging real online human behaviors. By taking the generative adversarial framework, the generator is trained to respond following expected human behavior; while the discriminator tries to verify whether the triplets of query, response, and human behavior come from real online environments. Behavior modeling in natural-language form and the multi-model joint training mechanism enable an active and sustainable online alignment. Experimental results confirm the effectiveness of our proposed methods by both human and automatic evaluations.
摘要：大语言模型对齐被广泛使用和研究，以避免 LLM 产生无益和有害的反应。然而，冗长的训练过程和预定义的偏好偏差阻碍了对在线多样化人类偏好的适应。为此，本文提出了一种称为人类行为强化学习（RLHB）的对齐框架，通过直接利用真实的在线人类行为来对齐法学硕士。通过采用生成对抗框架，生成器被训练以响应预期的人类行为；而鉴别器则试图验证查询、响应和人类行为三元组是否来自真实的在线环境。自然语言形式的行为建模和多模型联合训练机制实现了主动且可持续的在线对齐。实验结果通过人工和自动评估证实了我们提出的方法的有效性。

Title: Are Models Biased on Text without Gender-related Language?

Authors: Catarina G Belém, Preethi Seshadri, Yasaman Razeghi, Sameer Singh
Subjects: cs.CL, cs.AI, cs.CV, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2405.00588
Pdf URL: https://arxiv.org/pdf/2405.00588
Copy Paste: [[2405.00588]] Are Models Biased on Text without Gender-related Language?(https://arxiv.org/abs/2405.00588)
Keywords: language model
Abstract: Gender bias research has been pivotal in revealing undesirable behaviors in large language models, exposing serious gender stereotypes associated with occupations, and emotions. A key observation in prior work is that models reinforce stereotypes as a consequence of the gendered correlations that are present in the training data. In this paper, we focus on bias where the effect from training data is unclear, and instead address the question: Do language models still exhibit gender bias in non-stereotypical settings? To do so, we introduce UnStereoEval (USE), a novel framework tailored for investigating gender bias in stereotype-free scenarios. USE defines a sentence-level score based on pretraining data statistics to determine if the sentence contain minimal word-gender associations. To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE's sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 tested models. Concretely, models demonstrate fair behavior in only 9%-41% of stereotype-free sentences, suggesting that bias does not solely stem from the presence of gender-related words. These results raise important questions about where underlying model biases come from and highlight the need for more systematic and comprehensive bias evaluation. We release the full dataset and code at https://ucinlp.github.io/unstereo-eval.
摘要：性别偏见研究在揭示大型语言模型中的不良行为、揭露与职业和情感相关的严重性别刻板印象方面发挥着关键作用。先前工作中的一个关键观察结果是，由于训练数据中存在的性别相关性，模型强化了刻板印象。在本文中，我们重点关注训练数据的影响尚不清楚的偏见，并解决以下问题：语言模型在非刻板印象环境中是否仍然表现出性别偏见？为此，我们引入了 UnStereoEval (USE)，这是一个专为调查无刻板印象场景中的性别偏见而定制的新颖框架。 USE 根据预训练数据统计定义句子级别分数，以确定句子是否包含最少的单词-性别关联。为了在无刻板印象的场景中系统地对流行语言模型的公平性进行基准测试，我们利用 USE 自动生成基准，而无需任何与性别相关的语言。通过利用 USE 的句子级评分，我们还重新利用先前的性别偏见基准（Winobias 和 Winogender）进行非刻板印象评估。令人惊讶的是，我们发现所有 28 个测试模型的公平性都很低。具体而言，模型仅在 9%-41% 的无刻板印象句子中表现出公平行为，这表明偏见不仅仅源于性别相关单词的存在。这些结果提出了关于潜在模型偏差从何而来的重要问题，并强调需要进行更系统、更全面的偏差评估。我们在 https://ucinlp.github.io/unstereo-eval 发布了完整的数据集和代码。

Title: Investigating Automatic Scoring and Feedback using Large Language Models

Authors: Gloria Ashiya Katuka, Alexander Gain, Yen-Yun Yu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2405.00602
Pdf URL: https://arxiv.org/pdf/2405.00602
Copy Paste: [[2405.00602]] Investigating Automatic Scoring and Feedback using Large Language Models(https://arxiv.org/abs/2405.00602)
Keywords: language model, llm
Abstract: Automatic grading and feedback have been long studied using traditional machine learning and deep learning techniques using language models. With the recent accessibility to high performing large language models (LLMs) like LLaMA-2, there is an opportunity to investigate the use of these LLMs for automatic grading and feedback generation. Despite the increase in performance, LLMs require significant computational resources for fine-tuning and additional specific adjustments to enhance their performance for such tasks. To address these issues, Parameter Efficient Fine-tuning (PEFT) methods, such as LoRA and QLoRA, have been adopted to decrease memory and computational requirements in model fine-tuning. This paper explores the efficacy of PEFT-based quantized models, employing classification or regression head, to fine-tune LLMs for automatically assigning continuous numerical grades to short answers and essays, as well as generating corresponding feedback. We conducted experiments on both proprietary and open-source datasets for our tasks. The results show that prediction of grade scores via finetuned LLMs are highly accurate, achieving less than 3% error in grade percentage on average. For providing graded feedback fine-tuned 4-bit quantized LLaMA-2 13B models outperform competitive base models and achieve high similarity with subject matter expert feedback in terms of high BLEU and ROUGE scores and qualitatively in terms of feedback. The findings from this study provide important insights into the impacts of the emerging capabilities of using quantization approaches to fine-tune LLMs for various downstream tasks, such as automatic short answer scoring and feedback generation at comparatively lower costs and latency.
摘要：长期以来，人们一直使用传统的机器学习和深度学习技术，利用语言模型来研究自动评分和反馈。随着最近高性能大型语言模型 (LLM)（如 LLaMA-2）的出现，我们有机会研究这些 LLM 在自动评分和反馈生成中的应用。尽管性能有所提高，但 LLM 需要大量计算资源进行微调和额外的特定调整，以提高其在此类任务中的性能。为了解决这些问题，人们采用了参数高效微调 (PEFT) 方法，例如 LoRA 和 QLoRA，以减少模型微调中的内存和计算要求。本文探讨了基于 PEFT 的量化模型（采用分类或回归头）对 LLM 进行微调的有效性，以自动为简答题和论文分配连续的数字等级，并生成相应的反馈。我们在专有和开源数据集上对我们的任务进行了实验。结果表明，通过微调的 LLM 对成绩分数的预测非常准确，平均成绩百分比的误差小于 3%。在提供分级反馈方面，经过微调的 4 位量化 LLaMA-2 13B 模型的表现优于竞争性基础模型，并且在高 BLEU 和 ROUGE 分数以及反馈质量方面与主题专家反馈高度相似。这项研究的结果为使用量化方法微调 LLM 以完成各种下游任务（例如自动简答评分和以相对较低的成本和延迟生成反馈）的新兴能力的影响提供了重要见解。

Title: Addressing Topic Granularity and Hallucination in Large Language Models for Topic Modelling

Authors: Yida Mu, Peizhen Bai, Kalina Bontcheva, Xingyi Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.00611
Pdf URL: https://arxiv.org/pdf/2405.00611
Copy Paste: [[2405.00611]] Addressing Topic Granularity and Hallucination in Large Language Models for Topic Modelling(https://arxiv.org/abs/2405.00611)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) with their strong zero-shot topic extraction capabilities offer an alternative to probabilistic topic modelling and closed-set topic classification approaches. As zero-shot topic extractors, LLMs are expected to understand human instructions to generate relevant and non-hallucinated topics based on the given documents. However, LLM-based topic modelling approaches often face difficulties in generating topics with adherence to granularity as specified in human instructions, often resulting in many near-duplicate topics. Furthermore, methods for addressing hallucinated topics generated by LLMs have not yet been investigated. In this paper, we focus on addressing the issues of topic granularity and hallucinations for better LLM-based topic modelling. To this end, we introduce a novel approach that leverages Direct Preference Optimisation (DPO) to fine-tune open-source LLMs, such as Mistral-7B. Our approach does not rely on traditional human annotation to rank preferred answers but employs a reconstruction pipeline to modify raw topics generated by LLMs, thus enabling a fast and efficient training and inference framework. Comparative experiments show that our fine-tuning approach not only significantly improves the LLM's capability to produce more coherent, relevant, and precise topics, but also reduces the number of hallucinated topics.
摘要：大型语言模型 (LLM) 具有强大的零样本主题提取功能，为概率主题建模和闭集主题分类方法提供了替代方案。作为零样本主题提取器，法学硕士应该理解人类指令，根据给定的文档生成相关且非幻觉的主题。然而，基于 LLM 的主题建模方法在生成符合人类指令指定的粒度的主题时通常面临困难，通常会导致许多近乎重复的主题。此外，尚未研究解决法学硕士产生的幻觉主题的方法。在本文中，我们重点解决主题粒度和幻觉问题，以实现更好的基于 LLM 的主题建模。为此，我们引入了一种利用直接偏好优化 (DPO) 来微调开源 LLM 的新颖方法，例如 Mistral-7B。我们的方法不依赖传统的人工注释来对首选答案进行排名，而是采用重建管道来修改法学硕士生成的原始主题，从而实现快速高效的训练和推理框架。比较实验表明，我们的微调方法不仅显着提高了法学硕士产生更连贯、相关和精确主题的能力，而且还减少了幻觉主题的数量。

Title: Causal Evaluation of Language Models

Authors: Sirui Chen, Bo Peng, Meiqi Chen, Ruiqi Wang, Mengying Xu, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Yu Qiao, Chaochao Lu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.00622
Pdf URL: https://arxiv.org/pdf/2405.00622
Copy Paste: [[2405.00622]] Causal Evaluation of Language Models(https://arxiv.org/abs/2405.00622)
Keywords: language model
Abstract: Causal reasoning is viewed as crucial for achieving human-level machine intelligence. Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning. In this work, we introduce Causal evaluation of Language Models (CaLM), which, to the best of our knowledge, is the first comprehensive benchmark for evaluating the causal reasoning capabilities of language models. First, we propose the CaLM framework, which establishes a foundational taxonomy consisting of four modules: causal target (i.e., what to evaluate), adaptation (i.e., how to obtain the results), metric (i.e., how to measure the results), and error (i.e., how to analyze the bad results). This taxonomy defines a broad evaluation design space while systematically selecting criteria and priorities. Second, we compose the CaLM dataset, comprising 126,334 data samples, to provide curated sets of causal targets, adaptations, metrics, and errors, offering extensive coverage for diverse research pursuits. Third, we conduct an extensive evaluation of 28 leading language models on a core set of 92 causal targets, 9 adaptations, 7 metrics, and 12 error types. Fourth, we perform detailed analyses of the evaluation results across various dimensions (e.g., adaptation, scale). Fifth, we present 50 high-level empirical findings across 9 dimensions (e.g., model), providing valuable guidance for future language model development. Finally, we develop a multifaceted platform, including a website, leaderboards, datasets, and toolkits, to support scalable and adaptable assessments. We envision CaLM as an ever-evolving benchmark for the community, systematically updated with new causal targets, adaptations, models, metrics, and error types to reflect ongoing research advancements. Project website is at https://opencausalab.github.io/CaLM.
摘要：因果推理被认为对于实现人类水平的机器智能至关重要。语言模型的最新进展扩大了人工智能在各个领域的视野，引发了对其因果推理潜力的研究。在这项工作中，我们引入了语言模型的因果评估（CaLM），据我们所知，这是第一个评估语言模型因果推理能力的综合基准。首先，我们提出CaLM框架，它建立了一个由四个模块组成的基础分类法：因果目标（即评估什么）、适应（即如何获得结果）、度量（即如何衡量结果）、和错误（即如何分析不良结果）。该分类法定义了广泛的评估设计空间，同时系统地选择标准和优先级。其次，我们构建了包含 126,334 个数据样本的 CaLM 数据集，以提供精心策划的因果目标、适应、指标和错误集，为不同的研究活动提供广泛的覆盖范围。第三，我们对 28 个领先的语言模型进行了广泛的评估，评估的核心包括 92 个因果目标、9 个适应、7 个指标和 12 个错误类型。第四，我们对各个维度（例如适应性、规模）的评估结果进行详细分析。第五，我们提出了 9 个维度（例如模型）的 50 个高级实证结果，为未来语言模型的开发提供了有价值的指导。最后，我们开发了一个多方面的平台，包括网站、排行榜、数据集和工具包，以支持可扩展和适应性强的评估。我们设想 CaLM 作为社区不断发展的基准，系统地更新新的因果目标、适应、模型、指标和错误类型，以反映正在进行的研究进展。项目网站位于 https://opencausalab.github.io/CaLM。

Title: When Quantization Affects Confidence of Large Language Models?

Authors: Irina Proskurina, Luc Brun, Guillaume Metzler, Julien Velcin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.00632
Pdf URL: https://arxiv.org/pdf/2405.00632
Copy Paste: [[2405.00632]] When Quantization Affects Confidence of Large Language Models?(https://arxiv.org/abs/2405.00632)
Keywords: language model, gpt, llm
Abstract: Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as language model type and scale as contributors to quantization loss. Firstly, we reveal that quantization with GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.
摘要：最近的研究通过训练后量化或低位权重表示引入了针对大型语言模型 (LLM) 的有效压缩技术。尽管量化权重提供了存储效率并允许更快的推理，但现有的研究表明量化可能会损害性能并加剧法学硕士的偏差。本研究研究了量化模型的置信度和校准，考虑了语言模型类型和规模等因素作为量化损失的影响因素。首先，我们揭示了使用 GPTQ 量化为 4 位会导致真实标签的置信度下降，并且在不同语言模型中观察到的影响各不相同。其次，我们观察到不同尺度对信心影响的波动。最后，我们提出了基于置信水平的量化损失的解释，表明量化对整个模型首先表现出低置信水平的样本产生了不成比例的影响。

Title: Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model Editing with Llama-3

Authors: Junsang Yoon, Akshat Gupta, Gopala Anumanchipalli
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.00664
Pdf URL: https://arxiv.org/pdf/2405.00664
Copy Paste: [[2405.00664]] Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model Editing with Llama-3(https://arxiv.org/abs/2405.00664)
Keywords: language model
Abstract: This study presents a targeted model editing analysis focused on the latest large language model, Llama-3. We explore the efficacy of popular model editing techniques - ROME, MEMIT, and EMMET, which are designed for precise layer interventions. We identify the most effective layers for targeted edits through an evaluation that encompasses up to 4096 edits across three distinct strategies: sequential editing, batch editing, and a hybrid approach we call as sequential-batch editing. Our findings indicate that increasing edit batch-sizes may degrade model performance more significantly than using smaller edit batches sequentially for equal number of edits. With this, we argue that sequential model editing is an important component for scaling model editing methods and future research should focus on methods that combine both batched and sequential editing. This observation suggests a potential limitation in current model editing methods which push towards bigger edit batch sizes, and we hope it paves way for future investigations into optimizing batch sizes and model editing performance.
摘要：本研究提出了针对最新大型语言模型 Llama-3 的有针对性的模型编辑分析。我们探索流行的模型编辑技术 - ROME、MEMIT 和 EMMET 的功效，这些技术专为精确的图层干预而设计。我们通过评估确定了最有效的目标编辑层，该评估涵盖三种不同策略的多达 4096 次编辑：顺序编辑、批量编辑和我们称为顺序批量编辑的混合方法。我们的研究结果表明，与在相同数量的编辑中顺序使用较小的编辑批次相比，增加编辑批次大小可能会更显着地降低模型性能。因此，我们认为顺序模型编辑是扩展模型编辑方法的重要组成部分，未来的研究应该集中在结合批量和顺序编辑的方法上。这一观察表明当前模型编辑方法存在潜在的局限性，这些方法推动了更大的编辑批量大小，我们希望它为未来优化批量大小和模型编辑性能的研究铺平道路。