2024-06-06

Title: Cross-Modal Safety Alignment: Is textual unlearning all you need?

Authors: Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit K. Roy-Chowdhury, Chengyu Song
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2406.02575
Pdf URL: https://arxiv.org/pdf/2406.02575
Copy Paste: [[2406.02575]] Cross-Modal Safety Alignment: Is textual unlearning all you need?(https://arxiv.org/abs/2406.02575)
Keywords: language model, llm
Abstract: Recent studies reveal that integrating new modalities into Large Language Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack surface that bypasses existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where, regardless of the combination of input modalities, all inputs are ultimately fused into the language space, we aim to explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our evaluation across six datasets empirically demonstrates the transferability -- textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8\% and in some cases, even as low as nearly 2\% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands, possibly up to 6 times higher.
摘要：最近的研究表明，将新模态集成到大型语言模型 (LLM)（例如视觉语言模型 (VLM)）中会创建一个新的攻击面，从而绕过现有的安全训练技术，例如监督微调 (SFT) 和带人工反馈的强化学习 (RLHF)。虽然可以在多模态设置中进行进一步的基于 SFT 和 RLHF 的安全训练，但收集多模态训练数据集是一项重大挑战。受最近多模态模型结构设计的启发，无论输入模态的组合如何，所有输入最终都会融合到语言空间中，我们旨在探索仅在文本域中进行反学习是否可以有效地实现跨模态安全对齐。我们对六个数据集的评估通过实证证明了可转移性——VLM 中的文本反学习显着降低了攻击成功率 (ASR) 至 8% 以下，在某些情况下，甚至低至近 2%，无论是基于文本的攻击还是基于视觉文本的攻击，同时保留了实用性。此外，我们的实验表明，使用多模态数据集进行反学习不会带来任何潜在的好处，但会显著增加计算需求，可能高达 6 倍。

Title: Are PPO-ed Language Models Hackable?

Authors: Suraj Anand, David Getzen
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Are PPO-ed Language Models Hackable?(https://arxiv.org/abs/)
Keywords: language model, gpt
Abstract: Numerous algorithms have been proposed to $\textit{align}$ language models to remove undesirable behaviors. However, the challenges associated with a very large state space and creating a proper reward function often result in various jailbreaks. Our paper aims to examine this effect of reward in the controlled setting of positive sentiment language generation. Instead of online training of a reward model based on human feedback, we employ a statically learned sentiment classifier. We also consider a setting where our model's weights and activations are exposed to an end-user after training. We examine a pretrained GPT-2 through the lens of mechanistic interpretability before and after proximal policy optimization (PPO) has been applied to promote positive sentiment responses. Using these insights, we (1) attempt to "hack" the PPO-ed model to generate negative sentiment responses and (2) add a term to the reward function to try and alter `negative' weights.
摘要：已经提出了许多算法来对语言模型进行 $\textit{align}$ 处理，以消除不良行为。然而，与非常大的状态空间和创建适当的奖励函数相关的挑战往往会导致各种越狱。我们的论文旨在研究在受控的积极情绪语言生成环境中奖励的这种影响。我们没有使用基于人类反馈的奖励模型进行在线训练，而是使用静态学习的情绪分类器。我们还考虑了一种设置，即在训练后将模型的权重和激活暴露给最终用户。我们在应用近端策略优化 (PPO) 来促进积极情绪反应之前和之后，通过机械可解释性的视角来检查预训练的 GPT-2。利用这些见解，我们 (1) 尝试“破解”PPO 模型以生成负面情绪反应，以及 (2) 在奖励函数中添加一个项来尝试改变“负面”权重。

Title: Block Transformer: Global-to-Local Language Modeling for Fast Inference

Authors: Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.02657
Pdf URL: https://arxiv.org/pdf/2406.02657
Copy Paste: [[2406.02657]] Block Transformer: Global-to-Local Language Modeling for Fast Inference(https://arxiv.org/abs/2406.02657)
Keywords: language model
Abstract: This paper presents the Block Transformer architecture which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks of self-attention. To apply self-attention, the key-value (KV) cache of all previous sequences must be retrieved from memory at every decoding step. Thereby, this KV cache IO becomes a significant bottleneck in batch inference. We notice that these costs stem from applying self-attention on the global context, therefore we isolate the expensive bottlenecks of global modeling to lower layers and apply fast local modeling in upper layers. To mitigate the remaining costs in the lower layers, we aggregate input tokens into fixed size blocks and then apply self-attention at this coarse level. Context information is aggregated into a single embedding to enable upper layers to decode the next block of tokens, without global attention. Free of global attention bottlenecks, the upper layers can fully utilize the compute hardware to maximize inference throughput. By leveraging global and local modules, the Block Transformer architecture demonstrates 10-20x gains in inference throughput compared to vanilla transformers with equivalent perplexity. Our work introduces a new approach to optimize language model inference through novel application of global-to-local modeling. Code is available at this https URL.
摘要：本文介绍了 Block Transformer 架构，该架构采用分层的全局到局部建模来对自回归变压器进行建模，以缓解自注意力的推理瓶颈。要应用自注意力，必须在每个解码步骤中从内存中检索所有先前序列的键值 (KV) 缓存。因此，此 KV 缓存 IO 成为批量推理中的一个重要瓶颈。我们注意到这些成本源于将自注意力应用于全局上下文，因此我们将全局建模的昂贵瓶颈隔离到较低层，并在较高层应用快速局部建模。为了减轻较低层的剩余成本，我们将输入标记聚合成固定大小的块，然后在这个粗略级别应用自注意力。上下文信息被聚合到单个嵌入中，以使上层能够解码下一个标记块，而无需全局注意力。没有全局注意力瓶颈，上层可以充分利用计算硬件来最大化推理吞吐量。通过利用全局和局部模块，Block Transformer 架构与具有同等困惑度的 vanilla Transformer 相比，推理吞吐量提高了 10-20 倍。我们的工作引入了一种新方法，通过新颖的全局到局部建模应用来优化语言模型推理。代码可在此 https URL 上找到。

Title: Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Authors: Min Cai, Yuchen Zhang, Shichang Zhang, Fan Yin, Difan Zou, Yisong Yue, Ziniu Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.02721
Pdf URL: https://arxiv.org/pdf/2406.02721
Copy Paste: [[2406.02721]] Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller(https://arxiv.org/abs/2406.02721)
Keywords: language model, llm
Abstract: We propose Self-Control, a novel method utilizing suffix gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a guideline expressed in suffix string and the model's self-assessment of adherence, Self-Control computes the gradient of this self-judgment concerning the model's hidden states, directly influencing the auto-regressive generation process towards desired behaviors. To enhance efficiency, we introduce Self-Control_{prefix}, a compact module that encapsulates the learned representations from suffix gradients into a Prefix Controller, facilitating inference-time control for various LLM behaviors. Our experiments demonstrate Self-Control's efficacy across multiple domains, including emotional modulation, ensuring harmlessness, and enhancing complex reasoning. Especially, Self-Control_{prefix} enables a plug-and-play control and jointly controls multiple attributes, improving model outputs without altering model parameters or increasing inference-time costs.
摘要：我们提出了 Self-Control，这是一种利用后缀梯度来控制大型语言模型 (LLM) 行为的新方法，无需明确的人工注释。给定以后缀字符串表示的指南和模型的自我评估，Self-Control 计算有关模型隐藏状态的自我判断的梯度，直接影响自回归生成过程以实现所需行为。为了提高效率，我们引入了 Self-Control_{prefix}，这是一个紧凑的模块，它将从后缀梯度中学习到的表示封装到前缀控制器中，从而促进各种 LLM 行为的推理时间控制。我们的实验证明了 Self-Control 在多个领域的有效性，包括情绪调节、确保无害和增强复杂推理。特别是，Self-Control_{prefix} 实现了即插即用控制并联合控制多个属性，从而无需改变模型参数或增加推理时间成本即可改善模型输出。

Title: RATT: AThought Structure for Coherent and Correct LLMReasoning

Authors: Jinghan Zhang, Xiting Wang, Weijieying Ren, Lu Jiang, Dongjie Wang, Kunpeng Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] RATT: AThought Structure for Coherent and Correct LLMReasoning(https://arxiv.org/abs/)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) gain substantial reasoning and decision-making capabilities from thought structures. However, existing methods such as Tree of Thought and Retrieval Augmented Thoughts often fall short in complex tasks due to the limitations of insufficient local retrieval of factual knowledge and inadequate global selection of strategies. These limitations make it challenging for these methods to balance factual accuracy and comprehensive logical optimization effectively. To address these limitations, we introduce the Retrieval Augmented Thought Tree (RATT), a novel thought structure that considers both overall logical soundness and factual correctness at each step of the thinking process. Specifically, at every point of a thought branch, RATT performs planning and lookahead to explore and evaluate multiple potential reasoning steps, and integrate the fact-checking ability of Retrieval-Augmented Generation (RAG) with LLM's ability to assess overall strategy. Through this combination of factual knowledge and strategic feasibility, the RATT adjusts and integrates the thought tree structure to search for the most promising branches within the search space. This thought structure significantly enhances the model's coherence in logical inference and efficiency in decision-making, and thus increases the limit of the capacity of LLM to generate reliable inferences and decisions based on thought structures. A broad range of experiments on different types of tasks showcases that the RATT structure significantly outperforms existing methods in factual correctness and logical coherence.
摘要：大型语言模型（LLM）从思维结构中获得了强大的推理和决策能力。然而，现有的思维树和检索增强思维等方法往往因局部检索事实知识不足和全局策略选择不足而无法胜任复杂任务。这些限制使得这些方法难以有效地平衡事实准确性和综合逻辑优化。为了解决这些限制，我们引入了检索增强思维树（RATT），这是一种新颖的思维结构，它在思维过程的每一步都考虑整体逻辑合理性和事实正确性。具体来说，在思维分支的每个点，RATT 都会进行规划和前瞻，以探索和评估多个潜在的推理步骤，并将检索增强生成（RAG）的事实核查能力与 LLM 评估整体策略的能力相结合。通过这种事实知识和战略可行性的结合，RATT 调整和整合思维树结构，以在搜索空间内搜索最有希望的分支。这种思维结构显著提升了模型的逻辑推理连贯性和决策效率，从而提高了LLM基于思维结构生成可靠推理和决策的能力极限。对不同类型任务的大量实验表明，RATT结构在事实正确性和逻辑连贯性方面显著优于现有方法。

Title: Aligning Large Language Models via Fine-grained Supervision

Authors: Dehong Xu, Liang Qiu, Minseok Kim, Faisal Ladhak, Jaeyoung Do
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.02756
Pdf URL: https://arxiv.org/pdf/2406.02756
Copy Paste: [[2406.02756]] Aligning Large Language Models via Fine-grained Supervision(https://arxiv.org/abs/2406.02756)
Keywords: language model, llm
Abstract: Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learning process. However, because this approach operates on sequence-level feedback, it lacks the precision to identify the exact parts of the output affecting user preferences. To address this gap, we propose a method to enhance LLM alignment through fine-grained token-level supervision. Specifically, we ask annotators to minimally edit less preferred responses within the standard reward modeling dataset to make them more favorable, ensuring changes are made only where necessary while retaining most of the original content. The refined dataset is used to train a token-level reward model, which is then used for training our fine-grained Proximal Policy Optimization (PPO) model. Our experiment results demonstrate that this approach can achieve up to an absolute improvement of $5.1\%$ in LLM performance, in terms of win rate against the reference model, compared with the traditional PPO model.
摘要：预训练的大规模语言模型 (LLM) 擅长生成连贯的文章，但它们的输出可能不真实、有害或不符合用户期望。当前的方法侧重于使用强化学习和人类反馈 (RLHF) 来改进模型对齐，其工作原理是将 LLM 输出的粗略人类偏好转化为指导模型学习过程的反馈信号。但是，由于这种方法基于序列级反馈，因此它缺乏识别影响用户偏好的输出的确切部分的精度。为了解决这一差距，我们提出了一种通过细粒度 token 级监督来增强 LLM 对齐的方法。具体来说，我们要求注释者在标准奖励建模数据集中对不太受欢迎的响应进行最低限度的编辑，以使其更受欢迎，确保仅在必要时进行更改，同时保留大部分原始内容。精炼后的数据集用于训练 token 级奖励模型，然后将其用于训练我们的细粒度近端策略优化 (PPO) 模型。我们的实验结果表明，与传统的 PPO 模型相比，该方法在 LLM 性能方面（相对于参考模型的胜率）可实现高达 $5.1\%$ 的绝对提升。

Title: Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities

Authors: Wenyue Hua, Kaijie Zhu, Lingyao Li, Lizhou Fan, Shuhang Lin, Mingyu Jin, Haochen Xue, Zelong Li, JinDong Wang, Yongfeng Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.02787
Pdf URL: https://arxiv.org/pdf/2406.02787
Copy Paste: [[2406.02787]] Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities(https://arxiv.org/abs/2406.02787)
Keywords: language model, llm
Abstract: This study intends to systematically disentangle pure logic reasoning and text understanding by investigating the contrast across abstract and contextualized logical problems from a comprehensive set of domains. We explore whether LLMs demonstrate genuine reasoning capabilities across various domains when the underlying logical structure remains constant. We focus on two main questions (1) Can abstract logical problems alone accurately benchmark an LLM's reasoning ability in real-world scenarios, disentangled from contextual support in practical settings? (2) Does fine-tuning LLMs on abstract logic problem generalize to contextualized logic problems and vice versa? To investigate these questions, we focus on standard propositional logic, specifically propositional deductive and abductive logic reasoning. In particular, we construct instantiated datasets for deductive and abductive reasoning with 4 levels of difficulty, encompassing 12 distinct categories or domains based on the categorization of Wikipedia. Our experiments aim to provide insights into disentangling context in logical reasoning and the true reasoning capabilities of LLMs and their generalization potential. The code and dataset are available at: this https URL.
摘要：本研究旨在通过研究一系列领域中抽象和情境化逻辑问题的对比，系统地解开纯逻辑推理和文本理解。我们探索当底层逻辑结构保持不变时，LLM 是否在各个领域表现出真正的推理能力。我们关注两个主要问题 (1) 抽象逻辑问题是否可以准确地在现实世界场景中对 LLM 的推理能力进行基准测试，脱离实际环境中的上下文支持？(2) 对抽象逻辑问题进行微调的 LLM 是否可以推广到情境化逻辑问题，反之亦然？为了研究这些问题，我们专注于标准命题逻辑，特别是命题演绎和溯因逻辑推理。具体来说，我们构建了具有 4 个难度级别的演绎和溯因推理的实例数据集，涵盖了基于维基百科分类的 12 个不同类别或领域。我们的实验旨在深入了解逻辑推理中的上下文解开以及 LLM 的真正推理能力及其泛化潜力。代码和数据集可在以下位置获得：此 https URL。

Title: Chain of Agents: Large Language Models Collaborating on Long-Context Tasks

Authors: Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, Sercan Ö. Arik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.02818
Pdf URL: https://arxiv.org/pdf/2406.02818
Copy Paste: [[2406.02818]] Chain of Agents: Large Language Models Collaborating on Long-Context Tasks(https://arxiv.org/abs/2406.02818)
Keywords: language model, llm, long context, retrieval-augmented generation, agent
Abstract: Addressing the challenge of effectively processing long contexts has become a critical issue for Large Language Models (LLMs). Two common strategies have emerged: 1) reducing the input length, such as retrieving relevant chunks by Retrieval-Augmented Generation (RAG), and 2) expanding the context window limit of LLMs. However, both strategies have drawbacks: input reduction has no guarantee of covering the part with needed information, while window extension struggles with focusing on the pertinent information for solving the task. To mitigate these limitations, we propose Chain-of-Agents (CoA), a novel framework that harnesses multi-agent collaboration through natural language to enable information aggregation and context reasoning across various LLMs over long-context tasks. CoA consists of multiple worker agents who sequentially communicate to handle different segmented portions of the text, followed by a manager agent who synthesizes these contributions into a coherent final output. CoA processes the entire input by interleaving reading and reasoning, and it mitigates long context focus issues by assigning each agent a short context. We perform comprehensive evaluation of CoA on a wide range of long-context tasks in question answering, summarization, and code completion, demonstrating significant improvements by up to 10% over strong baselines of RAG, Full-Context, and multi-agent LLMs.
摘要：解决有效处理长上下文的挑战已成为大型语言模型 (LLM) 的关键问题。出现了两种常见策略：1) 减少输入长度，例如通过检索增强生成 (RAG) 检索相关块，以及 2) 扩展 LLM 的上下文窗口限制。然而，这两种策略都有缺点：输入减少不能保证覆盖所需信息的部分，而窗口扩展则难以专注于解决任务的相关信息。为了缓解这些限制，我们提出了 Chain-of-Agents (CoA)，这是一种新颖的框架，它通过自然语言利用多代理协作，实现跨各种 LLM 的长上下文任务的信息聚合和上下文推理。CoA 由多个工作代理组成，它们依次通信以处理文本的不同分段部分，然后是一个管理代理，将这些贡献合成一个连贯的最终输出。CoA 通过交错阅读和推理来处理整个输入，并通过为每个代理分配一个短上下文来缓解长上下文焦点问题。我们对问答、总结和代码完成等各种长上下文任务中的 CoA 进行了全面的评估，结果显示，与 RAG、全上下文和多智能体 LLM 的强基线相比，CoA 的性能显著提高了 10%。

Title: Exploring Robustness in Doctor-Patient Conversation Summarization: An Analysis of Out-of-Domain SOAP Notes

Authors: Yu-Wen Chen, Julia Hirschberg
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.02826
Pdf URL: https://arxiv.org/pdf/2406.02826
Copy Paste: [[2406.02826]] Exploring Robustness in Doctor-Patient Conversation Summarization: An Analysis of Out-of-Domain SOAP Notes(https://arxiv.org/abs/2406.02826)
Keywords: language model, gpt, hallucination
Abstract: Summarizing medical conversations poses unique challenges due to the specialized domain and the difficulty of collecting in-domain training data. In this study, we investigate the performance of state-of-the-art doctor-patient conversation generative summarization models on the out-of-domain data. We divide the summarization model of doctor-patient conversation into two configurations: (1) a general model, without specifying subjective (S), objective (O), and assessment (A) and plan (P) notes; (2) a SOAP-oriented model that generates a summary with SOAP sections. We analyzed the limitations and strengths of the fine-tuning language model-based methods and GPTs on both configurations. We also conducted a Linguistic Inquiry and Word Count analysis to compare the SOAP notes from different datasets. The results exhibit a strong correlation for reference notes across different datasets, indicating that format mismatch (i.e., discrepancies in word distribution) is not the main cause of performance decline on out-of-domain data. Lastly, a detailed analysis of SOAP notes is included to provide insights into missing information and hallucinations introduced by the models.
摘要：由于医疗对话属于专业领域，且收集领域内训练数据的难度较大，因此对医疗对话进行总结具有独特的挑战。在本研究中，我们研究了最先进的医患对话生成总结模型在领域外数据上的表现。我们将医患对话总结模型分为两种配置：（1）通用模型，不指定主观（S）、客观（O）、评估（A）和计划（P）注释；（2）面向 SOAP 的模型，生成带有 SOAP 部分的总结。我们分析了基于微调语言模型的方法和 GPT 在这两种配置上的局限性和优势。我们还进行了语言查询和字数统计分析，以比较来自不同数据集的 SOAP 注释。结果显示，不同数据集之间的参考注释具有很强的相关性，表明格式不匹配（即单词分布的差异）不是导致领域外数据性能下降的主要原因。最后，还包括对 SOAP 注释的详细分析，以提供对模型引入的缺失信息和幻觉的见解。

Title: Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies

Authors: Changye Li, Zhecheng Sheng, Trevor Cohen, Serguei Pakhomov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.02830
Pdf URL: https://arxiv.org/pdf/2406.02830
Copy Paste: [[2406.02830]] Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies(https://arxiv.org/abs/2406.02830)
Keywords: language model, gpt
Abstract: As artificial neural networks grow in complexity, understanding their inner workings becomes increasingly challenging, which is particularly important in healthcare applications. The intrinsic evaluation metrics of autoregressive neural language models (NLMs), perplexity (PPL), can reflect how "surprised" an NLM model is at novel input. PPL has been widely used to understand the behavior of NLMs. Previous findings show that changes in PPL when masking attention layers in pre-trained transformer-based NLMs reflect linguistic anomalies associated with Alzheimer's disease dementia. Building upon this, we explore a novel bidirectional attention head ablation method that exhibits properties attributed to the concepts of cognitive and brain reserve in human brain studies, which postulate that people with more neurons in the brain and more efficient processing are more resilient to neurodegeneration. Our results show that larger GPT-2 models require a disproportionately larger share of attention heads to be masked/ablated to display degradation of similar magnitude to masking in smaller models. These results suggest that the attention mechanism in transformer models may present an analogue to the notions of cognitive and brain reserve and could potentially be used to model certain aspects of the progression of neurodegenerative disorders and aging.
摘要：随着人工神经网络变得越来越复杂，了解其内部工作原理变得越来越具有挑战性，这在医疗保健应用中尤为重要。自回归神经语言模型 (NLM) 的内在评估指标困惑度 (PPL) 可以反映 NLM 模型对新输入的“惊讶”程度。PPL 已被广泛用于了解 NLM 的行为。先前的研究结果表明，在预训练的基于 Transformer 的 NLM 中屏蔽注意力层时 PPL 的变化反映了与阿尔茨海默病痴呆相关的语言异常。在此基础上，我们探索了一种新颖的双向注意力头消融方法，该方法表现出归因于人类大脑研究中认知和大脑储备概念的特性，这些概念假设大脑中神经元较多且处理效率更高的人对神经退化更有抵抗力。我们的结果表明，较大的 GPT-2 模型需要屏蔽/消融更大比例的注意力头，才能显示出与较小模型中的屏蔽类似程度的退化。这些结果表明，Transformer 模型中的注意力机制可能与认知和大脑储备的概念类似，并可能用于模拟神经退行性疾病和衰老进展的某些方面。

Title: Xmodel-LM Technical Report

Authors: Yichuan Wang, Yang Liu, Yu Yan, Xucheng Huang, Ling Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Xmodel-LM Technical Report(https://arxiv.org/abs/)
Keywords: language model
Abstract: We introduce Xmodel-LM, a compact and efficient 1.1B language model pre-trained on over 2 trillion tokens. Trained on our self-built dataset (Xdata), which balances Chinese and English corpora based on downstream task optimization, Xmodel-LM exhibits remarkable performance despite its smaller size. It notably surpasses existing open-source language models of similar scale. Our model checkpoints and code are publicly accessible on GitHub at this https URL.
摘要：我们推出了 Xmodel-LM，这是一个紧凑而高效的 1.1B 语言模型，已在超过 2 万亿个 token 上进行了预训练。Xmodel-LM 在我们自建的数据集 (Xdata) 上进行训练，该数据集基于下游任务优化平衡了中文和英文语料库，尽管规模较小，但性能却非常出色。它明显超越了现有类似规模的开源语言模型。我们的模型检查点和代码可在 GitHub 上通过此 https URL 公开访问。

Title: LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation

Authors: Yi-Pei Chen, KuanChao Chu, Hideki Nakayama
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.02863
Pdf URL: https://arxiv.org/pdf/2406.02863
Copy Paste: [[2406.02863]] LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation(https://arxiv.org/abs/2406.02863)
Keywords: language model, llm, prompt
Abstract: This research investigates the effect of prompt design on dialogue evaluation using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for dialogue evaluation remains challenging due to model sensitivity and subjectivity in dialogue assessments. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a "reason-first" approach yielding more comprehensive evaluations. This insight is crucial for enhancing the accuracy and consistency of LLM-based evaluations.
摘要：本研究使用大型语言模型 (LLM) 调查提示设计对对话评估的影响。虽然 LLM 越来越多地用于对各种输入进行评分，但由于对话评估中的模型敏感性和主观性，创建有效的对话评估提示仍然具有挑战性。我们的研究尝试了不同的提示结构，改变了输出指令的顺序并加入了解释性原因。我们发现，呈现原因和分数的顺序会显著影响 LLM 的评分，而“原因优先”的方法会产生更全面的评估。这一见解对于提高基于 LLM 的评估的准确性和一致性至关重要。

Title: NUMCoT: Numerals and Units of Measurement in Chain-of-Thought Reasoning using Large Language Models

Authors: Ancheng Xu, Minghuan Tan, Lei Wang, Min Yang, Ruifeng Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.02864
Pdf URL: https://arxiv.org/pdf/2406.02864
Copy Paste: [[2406.02864]] NUMCoT: Numerals and Units of Measurement in Chain-of-Thought Reasoning using Large Language Models(https://arxiv.org/abs/2406.02864)
Keywords: language model, llm, chain-of-thought
Abstract: Numeral systems and units of measurement are two conjoined topics in activities of human beings and have mutual effects with the languages expressing them. Currently, the evaluation of Large Language Models (LLMs) often involves mathematical reasoning, yet little attention is given to how minor changes in numbers or units can drastically alter the complexity of problems and the performance of LLMs. In this paper, we scrutinize existing LLMs on processing of numerals and units of measurement by constructing datasets with perturbations. We first anatomize the reasoning of math word problems to different sub-procedures like numeral conversions from language to numbers and measurement conversions based on units. Then we further annotate math word problems from ancient Chinese arithmetic works which are challenging in numerals and units of measurement. Experiments on perturbed datasets demonstrate that LLMs still encounter difficulties in handling numeral and measurement conversions.
摘要：数系与计量单位是人类活动中紧密相连的两个主题，与表达它们的语言相互影响。目前，对大型语言模型（LLM）的评估往往涉及数学推理，但很少关注数字或单位的微小变化如何显著改变问题的复杂性和LLM的性能。在本文中，我们通过构建带有扰动的数据集来审视现有的LLM对数系和计量单位的处理。我们首先将数学应用题的推理分解为从语言到数字的数字转换和基于单位的计量转换等不同的子过程。然后我们进一步对中国古代算术著作中在数系和计量单位方面具有挑战性的数学应用题进行注释。在扰动数据集上的实验表明LLM在处理数系和计量单位转换方面仍然存在困难。

Title: PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs

Authors: Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Haorui Wang, Zhen Qin, Feng Han, Jialu Liu, Simon Baumgartner, Michael Bendersky, Chao Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.02886
Pdf URL: https://arxiv.org/pdf/2406.02886
Copy Paste: [[2406.02886]] PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs(https://arxiv.org/abs/2406.02886)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings. Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models. However, traditional KD techniques face specific challenges when applied to LLMs, including restricted access to LLM outputs, significant teacher-student capacity gaps, and the inherited mis-calibration issue. In this work, we present PLaD, a novel preference-based LLM distillation framework. PLaD exploits the teacher-student capacity discrepancy to generate pseudo-preference pairs where teacher outputs are preferred over student outputs. Then, PLaD leverages a ranking loss to re-calibrate student's estimation of sequence likelihood, which steers the student's focus towards understanding the relative quality of outputs instead of simply imitating the teacher. PLaD bypasses the need for access to teacher LLM's internal states, tackles the student's expressivity limitations, and mitigates the student mis-calibration issue. Through extensive experiments on two sequence generation tasks and with various LLMs, we demonstrate the effectiveness of our proposed PLaD framework.
摘要：大型语言模型 (LLM) 在各种任务中表现出令人印象深刻的能力，但它们巨大的参数大小限制了它们在资源受限环境中的适用性。知识蒸馏 (KD) 通过将专业知识从大型教师模型转移到紧凑的学生模型，提供了一种可行的解决方案。然而，传统的 KD 技术在应用于 LLM 时面临着特定的挑战，包括对 LLM 输出的访问受限、师生能力差距大以及继承的错误校准问题。在这项工作中，我们提出了一种基于偏好的新型 LLM 蒸馏框架 PLaD。PLaD 利用师生能力差异来生成伪偏好对，其中教师输出优于学生输出。然后，PLaD 利用排名损失来重新校准学生对序列似然的估计，从而引导学生的注意力转向理解输出的相对质量，而不是简单地模仿老师。PLaD 绕过了对教师 LLM 内部状态的访问需求，解决了学生的表达能力限制，并缓解了学生的错误校准问题。通过对两个序列生成任务和各种 LLM 进行大量实验，我们证明了我们提出的 PLaD 框架的有效性。

Title: HYDRA: Model Factorization Framework for Black-Box LLM Personalization

Authors: Yuchen Zhuang, Haotian Sun, Yue Yu, Qifan Wang, Chao Zhang, Bo Dai
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.02888
Pdf URL: https://arxiv.org/pdf/2406.02888
Copy Paste: [[2406.02888]] HYDRA: Model Factorization Framework for Black-Box LLM Personalization(https://arxiv.org/abs/2406.02888)
Keywords: language model, llm, prompt
Abstract: Personalization has emerged as a critical research area in modern intelligent systems, focusing on mining users' behavioral history and adapting to their preferences for delivering tailored experiences. Despite the remarkable few-shot capabilities exhibited by black-box large language models (LLMs), the inherent opacity of their model parameters presents significant challenges in aligning the generated output with individual expectations. Existing solutions have primarily focused on prompt design to incorporate user-specific profiles and behaviors; however, such approaches often struggle to generalize effectively due to their inability to capture shared knowledge among all users. To address these challenges, we propose HYDRA, a model factorization framework that captures both user-specific behavior patterns from historical data and shared general knowledge among all users to deliver personalized generation. In order to capture user-specific behavior patterns, we first train a reranker to prioritize the most useful information from top-retrieved relevant historical records. By combining the prioritized history with the corresponding query, we train an adapter to align the output with individual user-specific preferences, eliminating the reliance on access to inherent model parameters of black-box LLMs. Both the reranker and the adapter can be decomposed into a base model with multiple user-specific heads, resembling a hydra. The base model maintains shared knowledge across users, while the multiple personal heads capture user-specific preferences. Experimental results demonstrate that HYDRA outperforms existing state-of-the-art prompt-based methods by an average relative improvement of 9.01% across five diverse personalization tasks in the LaMP benchmark. Our implementation is available at this https URL.
摘要：个性化已成为现代智能系统的一个重要研究领域，其重点是挖掘用户的行为历史并根据他们的偏好提供量身定制的体验。尽管黑盒大型语言模型 (LLM) 表现出了卓越的少样本能力，但其模型参数固有的不透明性在将生成的输出与个人期望保持一致方面带来了重大挑战。现有的解决方案主要侧重于提示设计，以结合用户特定的个人资料和行为；然而，由于无法捕获所有用户之间的共享知识，此类方法通常难以有效地概括。为了应对这些挑战，我们提出了 HYDRA，这是一个模型分解框架，它既可以从历史数据中捕获用户特定的行为模式，也可以捕获所有用户之间的共享一般知识，以实现个性化生成。为了捕获用户特定的行为模式，我们首先训练一个重新排序器，以从最热门的相关历史记录中优先排序最有用的信息。通过将优先历史记录与相应的查询相结合，我们训练了一个适配器，以使输出与单个用户特定的偏好保持一致，从而消除了对黑盒 LLM 固有模型参数的访问的依赖。重新排序器和适配器都可以分解为具有多个用户特定头的基本模型，类似于九头蛇。基本模型维护用户之间的共享知识，而多个个人头则捕获用户特定的偏好。实验结果表明，HYDRA 在 LaMP 基准中的五个不同个性化任务中，平均相对改进了 9.01%，优于现有的最先进的基于提示的方法。我们的实现可在此 https URL 上找到。

Title: Language Model Can Do Knowledge Tracing: Simple but Effective Method to Integrate Language Model and Knowledge Tracing Task

Authors: Unggi Lee, Jiyeong Bae, Dohee Kim, Sookbun Lee, Jaekwon Park, Taekyung Ahn, Gunho Lee, Damji Stratton, Hyeoncheol Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.02893
Pdf URL: https://arxiv.org/pdf/2406.02893
Copy Paste: [[2406.02893]] Language Model Can Do Knowledge Tracing: Simple but Effective Method to Integrate Language Model and Knowledge Tracing Task(https://arxiv.org/abs/2406.02893)
Keywords: language model
Abstract: Knowledge Tracing (KT) is a critical task in online learning for modeling student knowledge over time. Despite the success of deep learning-based KT models, which rely on sequences of numbers as data, most existing approaches fail to leverage the rich semantic information in the text of questions and concepts. This paper proposes Language model-based Knowledge Tracing (LKT), a novel framework that integrates pre-trained language models (PLMs) with KT methods. By leveraging the power of language models to capture semantic representations, LKT effectively incorporates textual information and significantly outperforms previous KT models on large benchmark datasets. Moreover, we demonstrate that LKT can effectively address the cold-start problem in KT by leveraging the semantic knowledge captured by PLMs. Interpretability of LKT is enhanced compared to traditional KT models due to its use of text-rich data. We conducted the local interpretable model-agnostic explanation technique and analysis of attention scores to interpret the model performance further. Our work highlights the potential of integrating PLMs with KT and paves the way for future research in KT domain.
摘要：知识追踪 (KT) 是在线学习中一项关键任务，用于对学生的知识进行建模。尽管基于深度学习的知识追踪模型（依赖数字序列作为数据）取得了成功，但大多数现有方法都未能利用问题和概念文本中丰富的语义信息。本文提出了基于语言模型的知识追踪 (LKT)，这是一个将预训练语言模型 (PLM) 与知识追踪方法相结合的新型框架。通过利用语言模型的强大功能来捕获语义表示，LKT 有效地整合了文本信息，并在大型基准数据集上的表现明显优于以前的知识追踪模型。此外，我们证明 LKT 可以通过利用 PLM 捕获的语义知识有效地解决知识追踪中的冷启动问题。与传统知识追踪模型相比，LKT 的可解释性得到了增强，因为它使用了文本丰富的数据。我们进行了局部可解释的模型无关解释技术，并分析了注意力分数，以进一步解释模型性能。我们的工作突出了将 PLM 与知识追踪相结合的潜力，并为知识追踪领域的未来研究铺平了道路。

Title: Open Grounded Planning: Challenges and Benchmark Construction

Authors: Shiguang Guo, Ziliang Deng, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.02903
Pdf URL: https://arxiv.org/pdf/2406.02903
Copy Paste: [[2406.02903]] Open Grounded Planning: Challenges and Benchmark Construction(https://arxiv.org/abs/2406.02903)
Keywords: language model, llm
Abstract: The emergence of large language models (LLMs) has increasingly drawn attention to the use of LLMs for human-like planning. Existing work on LLM-based planning either focuses on leveraging the inherent language generation capabilities of LLMs to produce free-style plans, or employs reinforcement learning approaches to learn decision-making for a limited set of actions within restricted environments. However, both approaches exhibit significant discrepancies from the open and executable requirements in real-world planning. In this paper, we propose a new planning task--open grounded planning. The primary objective of open grounded planning is to ask the model to generate an executable plan based on a variable action set, thereby ensuring the executability of the produced plan. To this end, we establishes a benchmark for open grounded planning spanning a wide range of domains. Then we test current state-of-the-art LLMs along with five planning approaches, revealing that existing LLMs and methods still struggle to address the challenges posed by grounded planning in open domains. The outcomes of this paper define and establish a foundational dataset for open grounded planning, and shed light on the potential challenges and future directions of LLM-based planning.
摘要：大型语言模型 (LLM) 的出现越来越多地引起了人们对使用 LLM 进行类人规划的关注。现有的基于 LLM 的规划工作要么侧重于利用 LLM 固有的语言生成能力来生成自由式规划，要么采用强化学习方法来学习在受限环境中针对有限操作集进行决策。然而，这两种方法都与现实世界规划中的开放和可执行要求存在很大差异。在本文中，我们提出了一项新的规划任务——开放式落地规划。开放式落地规划的主要目标是要求模型根据可变操作集生成可执行计划，从而确保所生成计划的可执行性。为此，我们为涵盖广泛领域的开放式落地规划建立了基准。然后，我们测试了当前最先进的 LLM 以及五种规划方法，结果表明现有的 LLM 和方法仍然难以应对开放领域落地规划带来的挑战。本文的成果定义并建立了开放式基础规划的基础数据集，并阐明了基于 LLM 的规划的潜在挑战和未来方向。

Title: Improving In-Context Learning with Prediction Feedback for Sentiment Analysis

Authors: Hongling Xu, Qianlong Wang, Yice Zhang, Min Yang, Xi Zeng, Bing Qin, Ruifeng Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.02911
Pdf URL: https://arxiv.org/pdf/2406.02911
Copy Paste: [[2406.02911]] Improving In-Context Learning with Prediction Feedback for Sentiment Analysis(https://arxiv.org/abs/2406.02911)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have achieved promising results in sentiment analysis through the in-context learning (ICL) paradigm. However, their ability to distinguish subtle sentiments still remains a challenge. Inspired by the human ability to adjust understanding via feedback, this paper enhances ICL by incorporating prior predictions and feedback, aiming to rectify sentiment misinterpretation of LLMs. Specifically, the proposed framework consists of three steps: (1) acquiring prior predictions of LLMs, (2) devising predictive feedback based on correctness, and (3) leveraging a feedback-driven prompt to refine sentiment understanding. Experimental results across nine sentiment analysis datasets demonstrate the superiority of our framework over conventional ICL methods, with an average F1 improvement of 5.95%.
摘要：大型语言模型 (LLM) 通过上下文学习 (ICL) 范式在情绪分析中取得了令人鼓舞的成果。然而，它们区分细微情绪的能力仍然是一个挑战。受人类通过反馈调整理解的能力的启发，本文通过结合先前的预测和反馈来增强 ICL，旨在纠正 LLM 的情绪误解。具体来说，提出的框架包括三个步骤：(1) 获取 LLM 的先前预测，(2) 根据正确性设计预测反馈，以及 (3) 利用反馈驱动的提示来完善情绪理解。在九个情绪分析数据集上的实验结果表明我们的框架优于传统的 ICL 方法，平均 F1 提高了 5.95%。

Title: MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge

Authors: Yuxuan Zhou, Xien Liu, Chen Ning, Ji Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.02919
Pdf URL: https://arxiv.org/pdf/2406.02919
Copy Paste: [[2406.02919]] MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge(https://arxiv.org/abs/2406.02919)
Keywords: language model, llm
Abstract: Large language models (LLMs) have excelled across domains, also delivering notable performance on the medical evaluation benchmarks, such as MedQA. However, there still exists a significant gap between the reported performance and the practical effectiveness in real-world medical scenarios. In this paper, we aim to explore the causes of this gap by employing a multifaceted examination schema to systematically probe the actual mastery of medical knowledge by current LLMs. Specifically, we develop a novel evaluation framework MultifacetEval to examine the degree and coverage of LLMs in encoding and mastering medical knowledge at multiple facets (comparison, rectification, discrimination, and verification) concurrently. Based on the MultifacetEval framework, we construct two multifaceted evaluation datasets: MultiDiseK (by producing questions from a clinical disease knowledge base) and MultiMedQA (by rephrasing each question from a medical benchmark MedQA into multifaceted questions). The experimental results on these multifaceted datasets demonstrate that the extent of current LLMs in mastering medical knowledge is far below their performance on existing medical benchmarks, suggesting that they lack depth, precision, and comprehensiveness in mastering medical knowledge. Consequently, current LLMs are not yet ready for application in real-world medical tasks. The codes and datasets are available at this https URL.
摘要：大型语言模型 (LLM) 在各个领域都表现出色，在 MedQA 等医学评估基准上也表现不俗。然而，报告的表现与现实医疗场景中的实际效果之间仍然存在很大差距。在本文中，我们旨在通过采用多方面考试方案来系统地探究当前 LLM 对医学知识的实际掌握程度，从而探索造成这种差距的原因。具体而言，我们开发了一个新颖的评估框架 MultifacetEval，以同时检查 LLM 在多个方面（比较、纠正、辨别和验证）编码和掌握医学知识的程度和覆盖范围。基于 MultifacetEval 框架，我们构建了两个多方面评估数据集：MultiDiseK（通过从临床疾病知识库中生成问题）和 MultiMedQA（通过将医学基准 MedQA 中的每个问题改写为多方面问题）。这些多方面数据集上的实验结果表明，当前 LLM 掌握医学知识的程度远低于其在现有医学基准上的表现，这表明它们在掌握医学知识方面缺乏深度、精确度和全面性。因此，当前的 LLM 尚未准备好应用于现实世界的医疗任务。代码和数据集可在此 https URL 上找到。

Title: Adversarial Moment-Matching Distillation of Large Language Models

Authors: Chen Jia
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.02959
Pdf URL: https://arxiv.org/pdf/2406.02959
Copy Paste: [[2406.02959]] Adversarial Moment-Matching Distillation of Large Language Models(https://arxiv.org/abs/2406.02959)
Keywords: language model, llm
Abstract: Knowledge distillation (KD) has been shown to be highly effective in guiding a student model with a larger teacher model and achieving practical benefits in improving the computational and memory efficiency for large language models (LLMs). State-of-the-art KD methods for LLMs mostly rely on minimizing explicit distribution distance between teacher and student probability predictions. Instead of optimizing these mandatory behaviour cloning objectives, we explore an imitation learning strategy for KD of LLMs. In particular, we minimize the imitation gap by matching the action-value moments of the teacher's behavior from both on- and off-policy perspectives. To achieve this action-value moment-matching goal, we propose an adversarial training algorithm to jointly estimate the moment-matching distance and optimize the student policy to minimize it. Results from both task-agnostic instruction-following experiments and task-specific experiments demonstrate the effectiveness of our method and achieve new state-of-the-art performance.
摘要：知识蒸馏 (KD) 已被证明在使用更大的教师模型指导学生模型方面非常有效，并且在提高大型语言模型 (LLM) 的计算和内存效率方面具有实际优势。LLM 的最先进的 KD 方法主要依赖于最小化教师和学生概率预测之间的显式分布距离。我们没有优化这些强制性的行为克隆目标，而是探索了 LLM 的 KD 的模仿学习策略。具体来说，我们通过从在线和离线策略角度匹配教师行为的动作值矩来最小化模仿差距。为了实现这个动作值矩匹配目标，我们提出了一种对抗性训练算法来联合估计矩匹配距离并优化学生策略以最小化它。来自任务无关的指令跟随实验和任务特定实验的结果证明了我们方法的有效性并实现了新的最先进的性能。

Title: Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models

Authors: Qiang Sun, Yuanyi Luo, Wenxiao Zhang, Sirui Li, Jichunyang Li, Kai Niu, Xiangrui Kong, Wei Liu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2406.02962
Pdf URL: https://arxiv.org/pdf/2406.02962
Copy Paste: [[2406.02962]] Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models(https://arxiv.org/abs/2406.02962)
Keywords: language model
Abstract: Even for a conservative estimate, 80% of enterprise data reside in unstructured files, stored in data lakes that accommodate heterogeneous formats. Classical search engines can no longer meet information seeking needs, especially when the task is to browse and explore for insight formulation. In other words, there are no obvious search keywords to use. Knowledge graphs, due to their natural visual appeals that reduce the human cognitive load, become the winning candidate for heterogeneous data integration and knowledge representation. In this paper, we introduce Docs2KG, a novel framework designed to extract multimodal information from diverse and heterogeneous unstructured documents, including emails, web pages, PDF files, and Excel files. Dynamically generates a unified knowledge graph that represents the extracted key information, Docs2KG enables efficient querying and exploration of document data lakes. Unlike existing approaches that focus on domain-specific data sources or pre-designed schemas, Docs2KG offers a flexible and extensible solution that can adapt to various document structures and content types. The proposed framework unifies data processing supporting a multitude of downstream tasks with improved domain interpretability. Docs2KG is publicly accessible at this https URL, and a demonstration video is available at this https URL.
摘要：即使保守估计，80% 的企业数据也驻留在非结构化文件中，存储在可容纳异构格式的数据湖中。传统搜索引擎已无法满足信息搜索需求，尤其是当任务是浏览和探索以形成见解时。换句话说，没有明显的搜索关键字可用。知识图谱由于其自然的视觉吸引力可以减少人类的认知负荷，成为异构数据集成和知识表示的最佳候选者。在本文中，我们介绍了 Docs2KG，这是一种新颖的框架，旨在从多样化和异构的非结构化文档（包括电子邮件、网页、PDF 文件和 Excel 文件）中提取多模态信息。Docs2KG 动态生成表示提取的关键信息的统一知识图谱，从而实现对文档数据湖的有效查询和探索。与专注于特定领域数据源或预先设计的模式的现有方法不同，Docs2KG 提供了一种灵活且可扩展的解决方案，可以适应各种文档结构和内容类型。所提出的框架统一了支持多种下游任务的数据处理，并提高了域可解释性。 Docs2KG 可通过此 https URL 公开访问，并且可在此 https URL 上查看演示视频。

Title: Evaluation of data inconsistency for multi-modal sentiment analysis

Authors: Yufei Wang, Mengyue Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03004
Pdf URL: https://arxiv.org/pdf/2406.03004
Copy Paste: [[2406.03004]] Evaluation of data inconsistency for multi-modal sentiment analysis(https://arxiv.org/abs/2406.03004)
Keywords: language model, llm, agent
Abstract: Emotion semantic inconsistency is an ubiquitous challenge in multi-modal sentiment analysis (MSA). MSA involves analyzing sentiment expressed across various modalities like text, audio, and videos. Each modality may convey distinct aspects of sentiment, due to subtle and nuanced expression of human beings, leading to inconsistency, which may hinder the prediction of artificial agents. In this work, we introduce a modality conflicting test set and assess the performance of both traditional multi-modal sentiment analysis models and multi-modal large language models (MLLMs). Our findings reveal significant performance degradation across traditional models when confronted with semantically conflicting data and point out the drawbacks of MLLMs when handling multi-modal emotion analysis. Our research presents a new challenge and offer valuable insights for the future development of sentiment analysis systems.
摘要：情绪语义不一致是多模态情绪分析 (MSA) 中普遍存在的挑战。MSA 涉及分析文本、音频和视频等各种模态中表达的情绪。由于人类的表达微妙而细致，每种模态可能传达情绪的不同方面，从而导致不一致，这可能会妨碍人工智能的预测。在这项工作中，我们引入了一个模态冲突测试集，并评估了传统多模态情绪分析模型和多模态大型语言模型 (MLLM) 的性能。我们的研究结果表明，当面对语义冲突的数据时，传统模型的性能会显著下降，并指出了 MLLM 在处理多模态情绪分析时的缺点。我们的研究提出了一个新的挑战，并为情绪分析系统的未来发展提供了宝贵的见解。

Title: BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents

Authors: Yifei Wang, Dizhan Xue, Shengjie Zhang, Shengsheng Qian
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2406.03007
Pdf URL: https://arxiv.org/pdf/2406.03007
Copy Paste: [[2406.03007]] BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents(https://arxiv.org/abs/2406.03007)
Keywords: language model, llm, agent
Abstract: With the prosperity of large language models (LLMs), powerful LLM-based intelligent agents have been developed to provide customized services with a set of user-defined tools. State-of-the-art methods for constructing LLM agents adopt trained LLMs and further fine-tune them on data for the agent task. However, we show that such methods are vulnerable to our proposed backdoor attacks named BadAgent on various agent tasks, where a backdoor can be embedded by fine-tuning on the backdoor data. At test time, the attacker can manipulate the deployed LLM agents to execute harmful operations by showing the trigger in the agent input or environment. To our surprise, our proposed attack methods are extremely robust even after fine-tuning on trustworthy data. Though backdoor attacks have been studied extensively in natural language processing, to the best of our knowledge, we could be the first to study them on LLM agents that are more dangerous due to the permission to use external tools. Our work demonstrates the clear risk of constructing LLM agents based on untrusted LLMs or data. Our code is public at this https URL
摘要：随着大型语言模型 (LLM) 的繁荣，基于 LLM 的强大智能代理已经开发出来，以通过一组用户定义的工具提供定制服务。构建 LLM 代理的最新方法采用经过训练的 LLM，并进一步根据代理任务的数据对其进行微调。然而，我们表明，这种方法容易受到我们在各种代理任务上提出的名为 BadAgent 的后门攻击，其中可以通过对后门数据进行微调来嵌入后门。在测试时，攻击者可以通过在代理输入或环境中显示触发器来操纵部署的 LLM 代理来执行有害操作。令我们惊讶的是，即使在对可信数据进行微调后，我们提出的攻击方法仍然非常稳健。虽然后门攻击在自然语言处理中得到了广泛的研究，但据我们所知，我们可能是第一个在 LLM 代理上研究它们的人，由于允许使用外部工具，这些代理更危险。我们的工作表明了基于不受信任的 LLM 或数据构建 LLM 代理的明显风险。我们的代码在此 https URL 上公开

Title: Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models

Authors: Sheng-Lun Wei, Cheng-Kuang Wu, Hen-Hsen Huang, Hsin-Hsi Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03009
Pdf URL: https://arxiv.org/pdf/2406.03009
Copy Paste: [[2406.03009]] Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models(https://arxiv.org/abs/2406.03009)
Keywords: language model, llm
Abstract: In this paper, we investigate the phenomena of "selection biases" in Large Language Models (LLMs), focusing on problems where models are tasked with choosing the optimal option from an ordered sequence. We delve into biases related to option order and token usage, which significantly impact LLMs' decision-making processes. We also quantify the impact of these biases through an extensive empirical analysis across multiple models and tasks. Furthermore, we propose mitigation strategies to enhance model performance. Our key contributions are threefold: 1) Precisely quantifying the influence of option order and token on LLMs, 2) Developing strategies to mitigate the impact of token and order sensitivity to enhance robustness, and 3) Offering a detailed analysis of sensitivity across models and tasks, which informs the creation of more stable and reliable LLM applications for selection problems.
摘要：在本文中，我们研究了大型语言模型 (LLM) 中的“选择偏差”现象，重点研究了模型从有序序列中选择最佳选项的问题。我们深入研究了与选项顺序和标记使用相关的偏差，这些偏差对 LLM 的决策过程有重大影响。我们还通过对多个模型和任务进行广泛的实证分析来量化这些偏差的影响。此外，我们提出了缓解策略来提高模型性能。我们的主要贡献有三点：1) 精确量化选项顺序和标记对 LLM 的影响，2) 制定策略来减轻标记和顺序敏感性的影响以增强稳健性，3) 提供跨模型和任务的敏感性的详细分析，为创建更稳定、更可靠的 LLM 应用程序以解决选择问题提供参考。

Title: From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation

Authors: Ali Malik, Stephen Mayhew, Chris Piech, Klinton Bicknell
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.03030
Pdf URL: https://arxiv.org/pdf/2406.03030
Copy Paste: [[2406.03030]] From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation(https://arxiv.org/abs/2406.03030)
Keywords: language model, gpt, llm, prompt
Abstract: We study the problem of controlling the difficulty level of text generated by Large Language Models (LLMs) for contexts where end-users are not fully proficient, such as language learners. Using a novel framework, we evaluate the effectiveness of several key approaches for this task, including few-shot prompting, supervised finetuning, and reinforcement learning (RL), utilising both GPT-4 and open source alternatives like LLama2-7B and Mistral-7B. Our findings reveal a large performance gap between GPT-4 and the open source models when using prompt-based strategies. However, we show how to bridge this gap with a careful combination of finetuning and RL alignment. Our best model, CALM (CEFR-Aligned Language Model), surpasses the performance of GPT-4 and other strategies, at only a fraction of the cost. We further validate the quality of our results through a small-scale human study.
摘要：我们研究了在最终用户（例如语言学习者）尚未完全熟练的情况下，控制大型语言模型 (LLM) 生成的文本难度的问题。我们使用一个新框架，评估了此任务的几种关键方法的有效性，包括少样本提示、监督微调和强化学习 (RL)，同时使用了 GPT-4 和开源替代方案（如 LLama2-7B 和 Mistral-7B）。我们的研究结果表明，在使用基于提示的策略时，GPT-4 和开源模型之间存在很大的性能差距。但是，我们展示了如何通过精心结合微调和 RL 对齐来弥合这一差距。我们的最佳模型 CALM（CEFR 对齐语言模型）的性能超过了 GPT-4 和其他策略，而成本仅为其一小部分。我们通过小规模的人工研究进一步验证了结果的质量。

Title: RadBARTsum: Domain Specific Adaption of Denoising Sequence-to-Sequence Models for Abstractive Radiology Report Summarization

Authors: Jinge Wu, Abul Hasan, Honghan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03062
Pdf URL: https://arxiv.org/pdf/2406.03062
Copy Paste: [[2406.03062]] RadBARTsum: Domain Specific Adaption of Denoising Sequence-to-Sequence Models for Abstractive Radiology Report Summarization(https://arxiv.org/abs/2406.03062)
Keywords: language model
Abstract: Radiology report summarization is a crucial task that can help doctors quickly identify clinically significant findings without the need to review detailed sections of reports. This study proposes RadBARTsum, a domain-specific and ontology facilitated adaptation of the BART model for abstractive radiology report summarization. The approach involves two main steps: 1) re-training the BART model on a large corpus of radiology reports using a novel entity masking strategy to improving biomedical domain knowledge learning, and 2) fine-tuning the model for the summarization task using the Findings and Background sections to predict the Impression section. Experiments are conducted using different masking strategies. Results show that the re-training process with domain knowledge facilitated masking improves performances consistently across various settings. This work contributes a domain-specific generative language model for radiology report summarization and a method for utilising medical knowledge to realise entity masking language model. The proposed approach demonstrates a promising direction of enhancing the efficiency of language models by deepening its understanding of clinical knowledge in radiology reports.
摘要：放射学报告摘要是一项关键任务，它可以帮助医生快速识别具有临床意义的发现，而无需查看报告的详细部分。本研究提出了 RadBARTsum，这是一种特定领域且由本体论促进的 BART 模型，用于抽象放射学报告摘要。该方法涉及两个主要步骤：1) 使用新颖的实体掩蔽策略在大量放射学报告语料库上重新训练 BART 模型，以提高生物医学领域知识的学习；2) 使用“发现”和“背景”部分预测“印象”部分，对摘要任务的模型进行微调。实验使用不同的掩蔽策略进行。结果表明，使用领域知识促进掩蔽的重新训练过程可在各种设置中持续提高性能。这项工作为放射学报告摘要提供了一个领域特定的生成语言模型，以及一种利用医学知识实现实体掩蔽语言模型的方法。所提出的方法展示了通过深化对放射学报告中临床知识的理解来提高语言模型效率的有希望的方向。

Title: Towards Detecting LLMs Hallucination via Markov Chain-based Multi-agent Debate Framework

Authors: Xiaoxi Sun, Jinpeng Li, Yan Zhong, Dongyan Zhao, Rui Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03075
Pdf URL: https://arxiv.org/pdf/2406.03075
Copy Paste: [[2406.03075]] Towards Detecting LLMs Hallucination via Markov Chain-based Multi-agent Debate Framework(https://arxiv.org/abs/2406.03075)
Keywords: language model, llm, hallucination, agent
Abstract: The advent of large language models (LLMs) has facilitated the development of natural language text generation. It also poses unprecedented challenges, with content hallucination emerging as a significant concern. Existing solutions often involve expensive and complex interventions during the training process. Moreover, some approaches emphasize problem disassembly while neglecting the crucial validation process, leading to performance degradation or limited applications. To overcome these limitations, we propose a Markov Chain-based multi-agent debate verification framework to enhance hallucination detection accuracy in concise claims. Our method integrates the fact-checking process, including claim detection, evidence retrieval, and multi-agent verification. In the verification stage, we deploy multiple agents through flexible Markov Chain-based debates to validate individual claims, ensuring meticulous verification outcomes. Experimental results across three generative tasks demonstrate that our approach achieves significant improvements over baselines.
摘要：大型语言模型 (LLM) 的出现促进了自然语言文本生成的发展。但它也带来了前所未有的挑战，内容幻觉正成为一个重大问题。现有的解决方案通常涉及在训练过程中进行昂贵而复杂的干预。此外，一些方法强调问题分解而忽略了关键的验证过程，导致性能下降或应用受限。为了克服这些限制，我们提出了一种基于马尔可夫链的多智能体辩论验证框架，以提高简洁声明中的幻觉检测准确性。我们的方法整合了事实核查过程，包括声明检测、证据检索和多智能体验证。在验证阶段，我们通过灵活的基于马尔可夫链的辩论部署多个智能体来验证单个声明，确保细致的验证结果。三个生成任务的实验结果表明，我们的方法比基线有了显着的改进。

Title: Cryptocurrency Frauds for Dummies: How ChatGPT introduces us to fraud?

Authors: Wail Zellagui, Abdessamad Imine, Yamina Tadjeddine
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03079
Pdf URL: https://arxiv.org/pdf/2406.03079
Copy Paste: [[2406.03079]] Cryptocurrency Frauds for Dummies: How ChatGPT introduces us to fraud?(https://arxiv.org/abs/2406.03079)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Recent advances in the field of large language models (LLMs), particularly the ChatGPT family, have given rise to a powerful and versatile machine interlocutor, packed with knowledge and challenging our understanding of learning. This interlocutor is a double-edged sword: it can be harnessed for a wide variety of beneficial tasks, but it can also be used to cause harm. This study explores the complicated interaction between ChatGPT and the growing problem of cryptocurrency fraud. Although ChatGPT is known for its adaptability and ethical considerations when used for harmful purposes, we highlight the deep connection that may exist between ChatGPT and fraudulent actions in the volatile cryptocurrency ecosystem. Based on our categorization of cryptocurrency frauds, we show how to influence outputs, bypass ethical terms, and achieve specific fraud goals by manipulating ChatGPT prompts. Furthermore, our findings emphasize the importance of realizing that ChatGPT could be a valuable instructor even for novice fraudsters, as well as understanding and safely deploying complex language models, particularly in the context of cryptocurrency frauds. Finally, our study underlines the importance of using LLMs responsibly and ethically in the digital currency sector, identifying potential risks and resolving ethical issues. It should be noted that our work is not intended to encourage and promote fraud, but rather to raise awareness of the risks of fraud associated with the use of ChatGPT.
摘要：大型语言模型 (LLM) 领域的最新进展，尤其是 ChatGPT 系列，催生出了一个功能强大且用途广泛的机器对话者，它知识丰富，挑战了我们对学习的理解。这个对话者是一把双刃剑：它可以用于各种有益的任务，但也可以用于造成伤害。这项研究探讨了 ChatGPT 与日益严重的加密货币欺诈问题之间的复杂相互作用。尽管 ChatGPT 以其适应性和用于有害目的时的道德考虑而闻名，但我们强调了 ChatGPT 与动荡的加密货币生态系统中的欺诈行为之间可能存在的深层联系。根据我们对加密货币欺诈的分类，我们展示了如何通过操纵 ChatGPT 提示来影响输出、绕过道德条款并实现特定的欺诈目标。此外，我们的研究结果强调了认识到 ChatGPT 甚至对新手欺诈者来说也可能是一个有价值的指导者的重要性，以及理解和安全地部署复杂的语言模型的重要性，特别是在加密货币欺诈的背景下。最后，我们的研究强调了在数字货币领域负责任地、合乎道德地使用 LLM、识别潜在风险和解决道德问题的重要性。需要注意的是，我们的工作并非旨在鼓励和促进欺诈，而是旨在提高人们对使用 ChatGPT 相关的欺诈风险的认识。

Title: FragRel: Exploiting Fragment-level Relations in the External Memory of Large Language Models

Authors: Xihang Yue, Linchao Zhu, Yi Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03092
Pdf URL: https://arxiv.org/pdf/2406.03092
Copy Paste: [[2406.03092]] FragRel: Exploiting Fragment-level Relations in the External Memory of Large Language Models(https://arxiv.org/abs/2406.03092)
Keywords: language model, llm, chat
Abstract: To process contexts with unlimited length using Large Language Models (LLMs), recent studies explore hierarchically managing the long text. Only several text fragments are taken from the external memory and passed into the temporary working memory, i.e., LLM's context window. However, existing approaches isolatedly handle the text fragments without considering their structural connections, thereby suffering limited capability on texts with intensive inter-relations, e.g., coherent stories and code repositories. This work attempts to resolve this by exploiting the fragment-level relations in external memory. First, we formulate the fragment-level relations and present several instantiations for different text types. Next, we introduce a relation-aware fragment assessment criteria upon previous independent fragment assessment. Finally, we present the fragment-connected Hierarchical Memory based LLM. We validate the benefits of involving these relations on long story understanding, repository-level code generation, and long-term chatting.
摘要：为了使用大型语言模型 (LLM) 处理长度无限的上下文，最近的研究探索了分层管理长文本。只有几个文本片段从外部存储器中取出并传递到临时工作存储器，即 LLM 的上下文窗口。然而，现有的方法孤立地处理文本片段而不考虑它们的结构连接，因此在处理具有密集相互关系的文本（例如连贯的故事和代码存储库）时能力有限。这项工作试图通过利用外部存储器中的片段级关系来解决这个问题。首先，我们制定片段级关系并为不同文本类型提供几个实例。接下来，我们在先前的独立片段评估的基础上引入了关系感知片段评估标准。最后，我们提出了基于片段连接的分层内存的 LLM。我们验证了在长故事理解、存储库级代码生成和长期聊天中涉及这些关系的好处。

Title: Which Side Are You On? A Multi-task Dataset for End-to-End Argument Summarisation and Evaluation

Authors: Hao Li, Yuping Wu, Viktor Schlegel, Riza Batista-Navarro, Tharindu Madusanka, Iqra Zahid, Jiayan Zeng, Xiaochi Wang, Xinran He, Yizhi Li, Goran Nenadic
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.03151
Pdf URL: https://arxiv.org/pdf/2406.03151
Copy Paste: [[2406.03151]] Which Side Are You On? A Multi-task Dataset for End-to-End Argument Summarisation and Evaluation(https://arxiv.org/abs/2406.03151)
Keywords: language model, llm
Abstract: With the recent advances of large language models (LLMs), it is no longer infeasible to build an automated debate system that helps people to synthesise persuasive arguments. Previous work attempted this task by integrating multiple components. In our work, we introduce an argument mining dataset that captures the end-to-end process of preparing an argumentative essay for a debate, which covers the tasks of claim and evidence identification (Task 1 ED), evidence convincingness ranking (Task 2 ECR), argumentative essay summarisation and human preference ranking (Task 3 ASR) and metric learning for automated evaluation of resulting essays, based on human feedback along argument quality dimensions (Task 4 SQE). Our dataset contains 14k examples of claims that are fully annotated with the various properties supporting the aforementioned tasks. We evaluate multiple generative baselines for each of these tasks, including representative LLMs. We find, that while they show promising results on individual tasks in our benchmark, their end-to-end performance on all four tasks in succession deteriorates significantly, both in automated measures as well as in human-centred evaluation. This challenge presented by our proposed dataset motivates future research on end-to-end argument mining and summarisation. The repository of this project is available at this https URL
摘要：随着大型语言模型 (LLM) 的最新进展，构建一个帮助人们综合有说服力的论点的自动辩论系统已不再是不可能的。先前的工作通过整合多个组件尝试完成这项任务。在我们的工作中，我们引入了一个论证挖掘数据集，该数据集捕获了为辩论准备论证文章的端到端过程，涵盖了主张和证据识别（任务 1 ED）、证据说服力排名（任务 2 ECR）、论证文章摘要和人类偏好排名（任务 3 ASR）和基于论证质量维度的人工反馈对结果文章进行自动评估的度量学习（任务 4 SQE）。我们的数据集包含 14k 个主张示例，这些示例完全注释了支持上述任务的各种属性。我们为每个任务评估了多个生成基线，包括代表性的 LLM。我们发现，虽然它们在我们的基准测试中在单个任务上表现出了良好的结果，但它们在所有四个任务上的端到端性能都显著下降，无论是在自动化测量中还是在以人为中心的评估中。我们提出的数据集提出的这一挑战激发了未来对端到端论证挖掘和总结的研究。该项目的存储库可在此 https URL 上找到

Title: CSS: Contrastive Semantic Similarity for Uncertainty Quantification of LLMs

Authors: Shuang Ao, Stefan Rueger, Advaith Siddharthan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03158
Pdf URL: https://arxiv.org/pdf/2406.03158
Copy Paste: [[2406.03158]] CSS: Contrastive Semantic Similarity for Uncertainty Quantification of LLMs(https://arxiv.org/abs/2406.03158)
Keywords: language model, llm
Abstract: Despite the impressive capability of large language models (LLMs), knowing when to trust their generations remains an open challenge. The recent literature on uncertainty quantification of natural language generation (NLG) utilises a conventional natural language inference (NLI) classifier to measure the semantic dispersion of LLMs responses. These studies employ logits of NLI classifier for semantic clustering to estimate uncertainty. However, logits represent the probability of the predicted class and barely contain feature information for potential clustering. Alternatively, CLIP (Contrastive Language-Image Pre-training) performs impressively in extracting image-text pair features and measuring their similarity. To extend its usability, we propose Contrastive Semantic Similarity, the CLIP-based feature extraction module to obtain similarity features for measuring uncertainty for text pairs. We apply this method to selective NLG, which detects and rejects unreliable generations for better trustworthiness of LLMs. We conduct extensive experiments with three LLMs on several benchmark question-answering datasets with comprehensive evaluation metrics. Results show that our proposed method performs better in estimating reliable responses of LLMs than comparable baselines. Results show that our proposed method performs better in estimating reliable responses of LLMs than comparable baselines. The code are available at \url{this https URL}.
摘要：尽管大型语言模型 (LLM) 的能力令人印象深刻，但知道何时信任它们的生成仍然是一个悬而未决的挑战。最近关于自然语言生成 (NLG) 不确定性量化的文献利用传统的自然语言推理 (NLI) 分类器来测量 LLM 响应的语义分散性。这些研究使用 NLI 分类器的对数进行语义聚类以估计不确定性。然而，对数表示预测类的概率，几乎不包含潜在聚类的特征信息。或者，CLIP（对比语言-图像预训练）在提取图像-文本对特征并测量其相似性方面表现令人印象深刻。为了扩展其可用性，我们提出了对比语义相似性，即基于 CLIP 的特征提取模块，以获得用于测量文本对不确定性的相似性特征。我们将这种方法应用于选择性 NLG，它可以检测并拒绝不可靠的生成，以提高 LLM 的可信度。我们在几个基准问答数据集上使用三个 LLM 进行了广泛的实验，并制定了全面的评估指标。结果表明，我们提出的方法在估计 LLM 的可靠响应方面比同类基线表现更好。结果表明，我们提出的方法在估计 LLM 的可靠响应方面比同类基线表现更好。代码可在 \url{this https URL} 获得。

Title: StatBot.Swiss: Bilingual Open Data Exploration in Natural Language

Authors: Farhad Nooralahzadeh, Yi Zhang, Ellery Smith, Sabine Maennel, Cyril Matthey-Doret, Raphaël de Fondville, Kurt Stockinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03170
Pdf URL: https://arxiv.org/pdf/2406.03170
Copy Paste: [[2406.03170]] StatBot.Swiss: Bilingual Open Data Exploration in Natural Language(https://arxiv.org/abs/2406.03170)
Keywords: language model, gpt, llm
Abstract: The potential for improvements brought by Large Language Models (LLMs) in Text-to-SQL systems is mostly assessed on monolingual English datasets. However, LLMs' performance for other languages remains vastly unexplored. In this work, we release the StatBot.Swiss dataset, the first bilingual benchmark for evaluating Text-to-SQL systems based on real-world applications. The StatBot.Swiss dataset contains 455 natural language/SQL-pairs over 35 big databases with varying level of complexity for both English and German. We evaluate the performance of state-of-the-art LLMs such as GPT-3.5-Turbo and mixtral-8x7b-instruct for the Text-to-SQL translation task using an in-context learning approach. Our experimental analysis illustrates that current LLMs struggle to generalize well in generating SQL queries on our novel bilingual dataset.
摘要：大型语言模型 (LLM) 在文本到 SQL 系统中带来的改进潜力主要在单语英语数据集上进行评估。然而，LLM 对其他语言的性能仍未得到充分开发。在这项工作中，我们发布了 StatBot.Swiss 数据集，这是第一个基于实际应用评估文本到 SQL 系统的双语基准。StatBot.Swiss 数据集包含 35 个大型数据库中的 455 个自然语言/SQL 对，英语和德语的复杂程度各不相同。我们使用上下文学习方法评估了最先进的 LLM（例如 GPT-3.5-Turbo 和 mixtral-8x7b-instruct）在文本到 SQL 翻译任务中的性能。我们的实验分析表明，当前的 LLM 在我们新颖的双语数据集上生成 SQL 查询时难以很好地概括。

Title: Missci: Reconstructing Fallacies in Misrepresented Science

Authors: Max Glockner, Yufang Hou, Preslav Nakov, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03181
Pdf URL: https://arxiv.org/pdf/2406.03181
Copy Paste: [[2406.03181]] Missci: Reconstructing Fallacies in Misrepresented Science(https://arxiv.org/abs/2406.03181)
Keywords: language model, gpt, llm, prompt
Abstract: Health-related misinformation on social networks can lead to poor decision-making and real-world dangers. Such misinformation often misrepresents scientific publications and cites them as "proof" to gain perceived credibility. To effectively counter such claims automatically, a system must explain how the claim was falsely derived from the cited publication. Current methods for automated fact-checking or fallacy detection neglect to assess the (mis)used evidence in relation to misinformation claims, which is required to detect the mismatch between them. To address this gap, we introduce Missci, a novel argumentation theoretical model for fallacious reasoning together with a new dataset for real-world misinformation detection that misrepresents biomedical publications. Unlike previous fallacy detection datasets, Missci (i) focuses on implicit fallacies between the relevant content of the cited publication and the inaccurate claim, and (ii) requires models to verbalize the fallacious reasoning in addition to classifying it. We present Missci as a dataset to test the critical reasoning abilities of large language models (LLMs), that are required to reconstruct real-world fallacious arguments, in a zero-shot setting. We evaluate two representative LLMs and the impact of different levels of detail about the fallacy classes provided to the LLM via prompts. Our experiments and human evaluation show promising results for GPT 4, while also demonstrating the difficulty of this task.
摘要：社交网络上与健康相关的错误信息可能导致决策失误和现实世界的危险。此类错误信息通常会歪曲科学出版物，并将其引用为“证据”以获得可信度。为了有效地自动反驳此类说法，系统必须解释该说法是如何从引用的出版物中错误地得出的。当前的自动事实核查或谬误检测方法忽略了评估与错误信息主张相关的（误用）证据，而这是检测它们之间不匹配所必需的。为了解决这一差距，我们引入了 Missci，这是一种用于谬误推理的新型论证理论模型，以及一个用于现实世界中歪曲生物医学出版物的错误信息检测的新数据集。与以前的谬误检测数据集不同，Missci (i) 关注引用出版物的相关内容与不准确主张之间的隐性谬误，并且 (ii) 除了对谬误推理进行分类之外，还要求模型将其语言化。我们提供 Missci 作为数据集，以测试大型语言模型 (LLM) 的批判性推理能力，这些能力需要在零样本设置中重建现实世界中的谬误论证。我们评估了两个代表性的 LLM，以及通过提示提供给 LLM 的谬误类别的不同细节级别的影响。我们的实验和人工评估表明 GPT 4 取得了令人鼓舞的结果，同时也证明了这项任务的难度。

Title: The Impossibility of Fair LLMs

Authors: Jacy Anthis, Kristian Lum, Michael Ekstrand, Avi Feller, Alexander D'Amour, Chenhao Tan
Subjects: cs.CL, cs.HC, cs.LG, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2406.03198
Pdf URL: https://arxiv.org/pdf/2406.03198
Copy Paste: [[2406.03198]] The Impossibility of Fair LLMs(https://arxiv.org/abs/2406.03198)
Keywords: language model, gpt, llm, chat
Abstract: The need for fair AI is increasingly clear in the era of general-purpose systems such as ChatGPT, Gemini, and other large language models (LLMs). However, the increasing complexity of human-AI interaction and its social impacts have raised questions of how fairness standards could be applied. Here, we review the technical frameworks that machine learning researchers have used to evaluate fairness, such as group fairness and fair representations, and find that their application to LLMs faces inherent limitations. We show that each framework either does not logically extend to LLMs or presents a notion of fairness that is intractable for LLMs, primarily due to the multitudes of populations affected, sensitive attributes, and use cases. To address these challenges, we develop guidelines for the more realistic goal of achieving fairness in particular use cases: the criticality of context, the responsibility of LLM developers, and the need for stakeholder participation in an iterative process of design and evaluation. Moreover, it may eventually be possible and even necessary to use the general-purpose capabilities of AI systems to address fairness challenges as a form of scalable AI-assisted alignment.
摘要：在 ChatGPT、Gemini 和其他大型语言模型 (LLM) 等通用系统时代，对公平 AI 的需求越来越明显。然而，人机交互日益复杂及其社会影响引发了如何应用公平标准的疑问。在这里，我们回顾了机器学习研究人员用来评估公平性的技术框架，例如群体公平性和公平表示，并发现它们在 LLM 中的应用面临着固有的局限性。我们表明，每个框架要么在逻辑上无法扩展到 LLM，要么提出的公平概念对于 LLM 来说是难以解决的，这主要是由于受影响的人群众多、敏感属性和用例众多。为了应对这些挑战，我们为实现特定用例中的公平性的更现实目标制定了指导方针：上下文的关键性、LLM 开发人员的责任以及利益相关者参与设计和评估迭代过程的必要性。此外，最终可能甚至有必要使用 AI 系统的通用功能来解决公平挑战，作为一种可扩展的 AI 辅助对齐形式。

Title: Bayesian WeakS-to-Strong from Text Classification to Generation

Authors: Ziyun Cui, Ziyang Zhang, Wen Wu, Guangzhi Sun, Chao Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.03199
Pdf URL: https://arxiv.org/pdf/2406.03199
Copy Paste: [[2406.03199]] Bayesian WeakS-to-Strong from Text Classification to Generation(https://arxiv.org/abs/2406.03199)
Keywords: language model
Abstract: Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model's preference learning, beyond the basic learning framework of teacher forcing. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.
摘要：大型语言模型的进步提出了一个问题：随着模型变得越来越复杂，人类只能对模型进行弱监督，对齐技术将如何适应。弱到强模拟了这样一种场景，即弱模型监督试图充分利用更强大的模型的全部功能。这项工作通过探索一组模拟人类观点变化的弱模型，将弱到强扩展到弱到强。使用贝叶斯方法估计置信度分数以指导弱到强的泛化。此外，我们将弱到强的应用从文本分类任务扩展到文本生成任务，其中研究了更高级的监督策略。此外，直接偏好优化被用于推进学生模型的偏好学习，超越了教师强制的基本学习框架。结果证明了所提出的方法对强大学生模型的可靠性的有效性，显示出超对齐的潜力。

Title: ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

Authors: Jeiyoon Park, Chanjun Park, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03202
Pdf URL: https://arxiv.org/pdf/2406.03202
Copy Paste: [[2406.03202]] ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction(https://arxiv.org/abs/2406.03202)
Keywords: gpt, llm, prompt, chat
Abstract: We explore and improve the capabilities of LLMs to generate data for grammatical error correction (GEC). When merely producing parallel sentences, their patterns are too simplistic to be valuable as a corpus. To address this issue, we propose an automated framework that includes a Subject Selector, Grammar Selector, Prompt Manager, and Evaluator. Additionally, we introduce a new dataset for GEC tasks, named \textbf{ChatLang-8}, which encompasses eight types of subject nouns and 23 types of grammar. It consists of 1 million pairs featuring human-like grammatical errors. Our experiments reveal that ChatLang-8 exhibits a more uniform pattern composition compared to existing GEC datasets. Furthermore, we observe improved model performance when using ChatLang-8 instead of existing GEC datasets. The experimental results suggest that our framework and ChatLang-8 are valuable resources for enhancing ChatGPT's data generation capabilities.
摘要：我们探索并改进了 LLM 生成语法错误纠正 (GEC) 数据的能力。当仅生成平行句子时，它们的模式过于简单，无法作为语料库。为了解决这个问题，我们提出了一个自动化框架，其中包括主题选择器、语法选择器、提示管理器和评估器。此外，我们为 GEC 任务引入了一个名为 \textbf{ChatLang-8} 的新数据集，它包含八种类型的主题名词和 23 种类型的语法。它由 100 万对具有类似人类语法错误的句子组成。我们的实验表明，与现有的 GEC 数据集相比，ChatLang-8 表现出更统一的模式组成。此外，我们观察到使用 ChatLang-8 代替现有的 GEC 数据集时模型性能有所提高。实验结果表明，我们的框架和 ChatLang-8 是增强 ChatGPT 数据生成能力的宝贵资源。

Title: Error-preserving Automatic Speech Recognition of Young English Learners' Language

Authors: Janick Michot, Manuela Hürlimann, Jan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03235
Pdf URL: https://arxiv.org/pdf/2406.03235
Copy Paste: [[2406.03235]] Error-preserving Automatic Speech Recognition of Young English Learners' Language(https://arxiv.org/abs/2406.03235)
Keywords: language model
Abstract: One of the central skills that language learners need to practice is speaking the language. Currently, students in school do not get enough speaking opportunities and lack conversational practice. Recent advances in speech technology and natural language processing allow for the creation of novel tools to practice their speaking skills. In this work, we tackle the first component of such a pipeline, namely, the automated speech recognition module (ASR), which faces a number of challenges: first, state-of-the-art ASR models are often trained on adult read-aloud data by native speakers and do not transfer well to young language learners' speech. Second, most ASR systems contain a powerful language model, which smooths out errors made by the speakers. To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the errors made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their errors. For this, we collected a corpus containing around 85 hours of English audio spoken by learners in Switzerland from grades 4 to 6 on different language learning tasks, which we used to train an ASR model. Our experiments show that our model benefits from direct fine-tuning on children's voices and has a much higher error preservation rate than other models.
摘要：语言学习者需要练习的核心技能之一是口语。目前，学校的学生没有足够的口语机会，缺乏对话练习。语音技术和自然语言处理的最新进展使得人们能够创造出新的工具来练习他们的口语技能。在这项工作中，我们解决了这种流程的第一个组成部分，即自动语音识别模块 (ASR)，它面临着许多挑战：首先，最先进的 ASR 模型通常是用母语人士的成人朗读数据进行训练的，不能很好地转移到年轻语言学习者的语音上。其次，大多数 ASR 系统都包含一个强大的语言模型，可以消除说话者的错误。为了提供纠正反馈（这是语言学习的关键部分），我们环境中的 ASR 系统需要保留语言学习者所犯的错误。在这项工作中，我们构建了一个满足这些要求的 ASR 系统：它适用于年轻语言学习者的自发语音并保留他们的错误。为此，我们收集了一个语料库，其中包含瑞士 4 至 6 年级学习者在不同语言学习任务中说出的约 85 小时的英语音频，并用它来训练 ASR 模型。我们的实验表明，我们的模型受益于对儿童声音的直接微调，并且错误保留率远高于其他模型。

Title: The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches

Authors: Bhashithe Abeysinghe, Ruhan Circi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03339
Pdf URL: https://arxiv.org/pdf/2406.03339
Copy Paste: [[2406.03339]] The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches(https://arxiv.org/abs/2406.03339)
Keywords: llm, chat
Abstract: Chatbots have been an interesting application of natural language generation since its inception. With novel transformer based Generative AI methods, building chatbots have become trivial. Chatbots which are targeted at specific domains such as medicine, psychology, and general information retrieval are implemented rapidly. This, however, should not distract from the need to evaluate the chatbot responses. Especially because the natural language generation community does not entirely agree upon how to effectively evaluate such applications. With this work we discuss the issue further with the increasingly popular LLM based evaluations and how they correlate with human evaluations. Additionally, we introduce a comprehensive factored evaluation mechanism that can be utilized in conjunction with both human and LLM-based evaluations. We present the results of an experimental evaluation conducted using this scheme in one of our chatbot implementations, and subsequently compare automated, traditional human evaluation, factored human evaluation, and factored LLM evaluation. Results show that factor based evaluation produces better insights on which aspects need to be improved in LLM applications and further strengthens the argument to use human evaluation in critical spaces where main functionality is not direct retrieval.
摘要：自诞生以来，聊天机器人一直是自然语言生成的一个有趣应用。借助基于变换器的新型生成式 AI 方法，构建聊天机器人变得微不足道。针对医学、心理学和一般信息检索等特定领域的聊天机器人正在迅速实现。然而，这不应分散对评估聊天机器人响应的需求。特别是因为自然语言生成社区并不完全同意如何有效地评估此类应用程序。通过这项工作，我们进一步讨论了越来越流行的基于 LLM 的评估以及它们与人工评估的相关性问题。此外，我们引入了一种全面的因子评估机制，可与人工和基于 LLM 的评估结合使用。我们展示了在我们的一个聊天机器人实现中使用此方案进行的实验评估的结果，随后比较了自动化、传统人工评估、因子化人工评估和因子化 LLM 评估。结果表明，基于因子的评估可以更好地洞察 LLM 应用程序中需要改进哪些方面，并进一步加强了在主要功能不是直接检索的关键空间中使用人工评估的论点。

Title: LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback

Authors: Timon Ziegenbein, Gabriella Skitalinskaya, Alireza Bayat Makou, Henning Wachsmuth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03363
Pdf URL: https://arxiv.org/pdf/2406.03363
Copy Paste: [[2406.03363]] LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback(https://arxiv.org/abs/2406.03363)
Keywords: language model, llm, prompt
Abstract: Ensuring that online discussions are civil and productive is a major challenge for social media platforms. Such platforms usually rely both on users and on automated detection tools to flag inappropriate arguments of other users, which moderators then review. However, this kind of post-hoc moderation is expensive and time-consuming, and moderators are often overwhelmed by the amount and severity of flagged content. Instead, a promising alternative is to prevent negative behavior during content creation. This paper studies how inappropriate language in arguments can be computationally mitigated. We propose a reinforcement learning-based rewriting approach that balances content preservation and appropriateness based on existing classifiers, prompting an instruction-finetuned large language model (LLM) as our initial policy. Unlike related style transfer tasks, rewriting inappropriate arguments allows deleting and adding content permanently. It is therefore tackled on document level rather than sentence level. We evaluate different weighting schemes for the reward function in both absolute and relative human assessment studies. Systematic experiments on non-parallel data provide evidence that our approach can mitigate the inappropriateness of arguments while largely preserving their content. It significantly outperforms competitive baselines, including few-shot learning, prompting, and humans.
摘要：确保在线讨论文明且富有成效是社交媒体平台面临的一大挑战。此类平台通常依靠用户和自动检测工具来标记其他用户的不当论点，然后由版主进行审查。然而，这种事后审核既昂贵又耗时，版主经常被标记内容的数量和严重性所淹没。相反，一个有希望的替代方案是在内容创建过程中防止负面行为。本文研究了如何通过计算来缓解论据中的不当语言。我们提出了一种基于强化学习的重写方法，该方法基于现有分类器平衡内容保存和适当性，从而促使我们采用指令微调的大型语言模型 (LLM) 作为我们的初始策略。与相关的风格转换任务不同，重写不适当的论点允许永久删除和添加内容。因此，它是在文档级别而不是句子级别处理的。我们在绝对和相对人工评估研究中评估了奖励函数的不同加权方案。对非平行数据的系统实验提供了证据，表明我们的方法可以减轻论点的不当性，同时在很大程度上保留其内容。它的表现明显优于竞争基线，包括小样本学习、提示和人类。

Title: IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

Authors: David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba O. Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chiamaka Chukwuneke, Happy Buzaaba, Blessing Sibanda, Godson Kalipe, Jonathan Mukiibi, Salomon Kabongo, Foutse Yuehgoh, Mmasibidi Setaka, Lolwethu Ndolela, Nkiruka Odu, Rooweither Mabuya, Shamsuddeen Hassan Muhammad, Salomey Osei, Sokhar Samb, Tadesse Kebede Guge, Pontus Stenetorp
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03368
Pdf URL: https://arxiv.org/pdf/2406.03368
Copy Paste: [[2406.03368]] IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models(https://arxiv.org/abs/2406.03368)
Keywords: language model, gpt, llm
Abstract: Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based QA~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and four proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58\% of the best-performing proprietary model GPT-4o performance. Machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, like LLaMa 3 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.
摘要：尽管大型语言模型 (LLM) 被广泛采用，但其卓越功能仍然仅限于少数资源丰富的语言。此外，由于缺乏资源丰富的语言之外的适当或全面的基准，许多资源匮乏的语言（例如非洲语言）通常仅在基本文本分类任务上进行评估。在本文中，我们介绍了 IrokoBench——一个针对 16 种类型多样的资源匮乏的非洲语言的人工翻译基准数据集，涵盖三项任务：自然语言推理~(AfriXNLI)、数学推理~(AfriMGSM) 和多选知识型 QA~(AfriMMLU)。我们使用 IrokoBench 在 10 个开放和 4 个专有 LLM 中评估零样本、少量样本和翻译测试设置~（其中测试集被翻译成英语）。我们的评估显示，资源丰富的语言（如英语和法语）和资源匮乏的非洲语言之间存在显著的性能差距。我们观察到开放模型和专有模型之间存在显著的性能差距，性能最高的开放模型 Aya-101 的性能仅为性能最佳的专有模型 GPT-4o 性能的 58%。在评估之前将测试集机器翻译成英语有助于缩小以英语为中心的较大模型（如 LLaMa 3 70B）的差距。这些发现表明，需要付出更多努力来开发和调整针对非洲语言的 LLM。

Title: Automating Turkish Educational Quiz Generation Using Large Language Models

Authors: Kamyar Zeinalipour, Yusuf Gökberk Keptiğ, Marco Maggini, Marco Gori
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03397
Pdf URL: https://arxiv.org/pdf/2406.03397
Copy Paste: [[2406.03397]] Automating Turkish Educational Quiz Generation Using Large Language Models(https://arxiv.org/abs/2406.03397)
Keywords: language model, gpt, llm, chat
Abstract: Crafting quizzes from educational content is a pivotal activity that benefits both teachers and students by reinforcing learning and evaluating understanding. In this study, we introduce a novel approach to generate quizzes from Turkish educational texts, marking a pioneering endeavor in educational technology specifically tailored to the Turkish educational context. We present a specialized dataset, named the Turkish-Quiz-Instruct, comprising an extensive collection of Turkish educational texts accompanied by multiple-choice and short-answer quizzes. This research leverages the capabilities of Large Language Models (LLMs), including GPT-4-Turbo, GPT-3.5-Turbo, Llama-2-7b-chat-hf, and Llama-2-13b-chat-hf, to automatically generate quiz questions and answers from the Turkish educational content. Our work delineates the methodology for employing these LLMs in the context of Turkish educational material, thereby opening new avenues for automated Turkish quiz generation. The study not only demonstrates the efficacy of using such models for generating coherent and relevant quiz content but also sets a precedent for future research in the domain of automated educational content creation for languages other than English. The Turkish-Quiz-Instruct dataset is introduced as a valuable resource for researchers and practitioners aiming to explore the boundaries of educational technology and language-specific applications of LLMs in Turkish. By addressing the challenges of quiz generation in a non-English context specifically Turkish, this study contributes significantly to the field of Turkish educational technology, providing insights into the potential of leveraging LLMs for educational purposes across diverse linguistic landscapes.
摘要：根据教育内容制作测验是一项关键活动，它通过强化学习和评估理解使教师和学生都受益。在本研究中，我们介绍了一种从土耳其教育文本生成测验的新方法，标志着一项专门针对土耳其教育背景的教育技术的开创性努力。我们提供了一个专门的数据集，名为 Turkish-Quiz-Instruct，其中包含大量土耳其教育文本以及多项选择题和简答题测验。这项研究利用大型语言模型 (LLM) 的功能，包括 GPT-4-Turbo、GPT-3.5-Turbo、Llama-2-7b-chat-hf 和 Llama-2-13b-chat-hf，从土耳其教育内容中自动生成测验问题和答案。我们的工作描述了在土耳其教育材料背景下使用这些 LLM 的方法，从而为自动生成土耳其语测验开辟了新途径。这项研究不仅证明了使用此类模型生成连贯且相关的测验内容的有效性，而且还为未来在英语以外的语言自动创建教育内容领域的研究树立了先例。Turkish-Quiz-Instruct 数据集是研究人员和从业人员探索土耳其语教育技术边界和 LLM 语言特定应用的宝贵资源。通过解决非英语环境（特别是土耳其语）中测验生成的挑战，这项研究为土耳其语教育技术领域做出了重大贡献，深入了解了在不同语言环境中利用 LLM 实现教育目的的潜力。

Title: Cycles of Thought: Measuring LLM Confidence through Stable Explanations

Authors: Evan Becker, Stefano Soatto
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.03441
Pdf URL: https://arxiv.org/pdf/2406.03441
Copy Paste: [[2406.03441]] Cycles of Thought: Measuring LLM Confidence through Stable Explanations(https://arxiv.org/abs/2406.03441)
Keywords: language model, llm
Abstract: In many high-risk machine learning applications it is essential for a model to indicate when it is uncertain about a prediction. While large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, their overconfidence in incorrect responses is still a well-documented failure mode. Traditional methods for ML uncertainty quantification can be difficult to directly adapt to LLMs due to the computational cost of implementation and closed-source nature of many models. A variety of black-box methods have recently been proposed, but these often rely on heuristics such as self-verbalized confidence. We instead propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer. While utilizing explanations is not a new idea in and of itself, by interpreting each possible model+explanation pair as a test-time classifier we can calculate a posterior answer distribution over the most likely of these classifiers. We demonstrate how a specific instance of this framework using explanation entailment as our classifier likelihood improves confidence score metrics (in particular AURC and AUROC) over baselines across five different datasets. We believe these results indicate that our framework is both a well-principled and effective way of quantifying uncertainty in LLMs.
摘要：在许多高风险的机器学习应用中，模型必须能够表明何时对预测不确定。虽然大型语言模型 (LLM) 可以在各种基准上达到甚至超过人类水平的准确度，但它们对错误响应的过度自信仍然是一种有据可查的失败模式。由于实施的计算成本和许多模型的闭源性质，传统的 ML 不确定性量化方法可能难以直接适应 LLM。最近提出了各种黑盒方法，但这些方法通常依赖于启发式方法，例如自我表达的信心。我们提出了一个框架来衡量 LLM 相对于答案生成的解释分布的不确定性。虽然利用解释本身并不是一个新想法，但通过将每个可能的模型 + 解释对解释为测试时间分类器，我们可以计算出这些分类器中最可能的后验答案分布。我们展示了该框架的一个具体实例，该实例使用解释蕴涵作为我们的分类器似然，在五个不同数据集的基线上改进了置信度得分指标（特别是 AURC 和 AUROC）。我们相信这些结果表明，我们的框架是一种既有原则又有效的量化 LLM 不确定性的方法。

Title: Are language models rational? The case of coherence norms and belief revision

Authors: Thomas Hofweber, Peter Hase, Elias Stengel-Eskin, Mohit Bansal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03442
Pdf URL: https://arxiv.org/pdf/2406.03442
Copy Paste: [[2406.03442]] Are language models rational? The case of coherence norms and belief revision(https://arxiv.org/abs/2406.03442)
Keywords: language model
Abstract: Do norms of rationality apply to machine learning models, in particular language models? In this paper we investigate this question by focusing on a special subset of rational norms: coherence norms. We consider both logical coherence norms as well as coherence norms tied to the strength of belief. To make sense of the latter, we introduce the Minimal Assent Connection (MAC) and propose a new account of credence, which captures the strength of belief in language models. This proposal uniformly assigns strength of belief simply on the basis of model internal next token probabilities. We argue that rational norms tied to coherence do apply to some language models, but not to others. This issue is significant since rationality is closely tied to predicting and explaining behavior, and thus it is connected to considerations about AI safety and alignment, as well as understanding model behavior more generally.
摘要：理性规范是否适用于机器学习模型，特别是语言模型？在本文中，我们通过关注理性规范的一个特殊子集：连贯性规范来研究这个问题。我们考虑逻辑连贯性规范以及与信念强度相关的连贯性规范。为了理解后者，我们引入了最小同意连接 (MAC) 并提出了一种新的信任度解释，它捕捉了语言模型中的信念强度。该提议仅根据模型内部下一个标记概率统一分配信念强度。我们认为与连贯性相关的理性规范确实适用于某些语言模型，但不适用于其他语言模型。这个问题很重要，因为理性与预测和解释行为密切相关，因此它与对人工智能安全性和一致性的考虑以及更广泛地理解模型行为有关。

Title: What is the Best Way for ChatGPT to Translate Poetry?

Authors: Shanshan Wang, Derek F. Wong, Jingming Yao, Lidia S. Chao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03450
Pdf URL: https://arxiv.org/pdf/2406.03450
Copy Paste: [[2406.03450]] What is the Best Way for ChatGPT to Translate Poetry?(https://arxiv.org/abs/2406.03450)
Keywords: language model, gpt, prompt, chat
Abstract: Machine translation (MT) has historically faced significant challenges when applied to literary works, particularly in the domain of poetry translation. The advent of Large Language Models such as ChatGPT holds potential for innovation in this field. This study examines ChatGPT's capabilities in English-Chinese poetry translation tasks, utilizing targeted prompts and small sample scenarios to ascertain optimal performance. Despite promising outcomes, our analysis reveals persistent issues in the translations generated by ChatGPT that warrant attention. To address these shortcomings, we propose an Explanation-Assisted Poetry Machine Translation (EAPMT) method, which leverages monolingual poetry explanation as a guiding information for the translation process. Furthermore, we refine existing evaluation criteria to better suit the nuances of modern poetry translation. We engaged a panel of professional poets for assessments, complemented evaluations by using GPT-4. The results from both human and machine evaluations demonstrate that our EAPMT method outperforms traditional translation methods of ChatGPT and the existing online systems. This paper validates the efficacy of our method and contributes a novel perspective to machine-assisted literary translation.
摘要：机器翻译 (MT) 在应用于文学作品时历来面临重大挑战，特别是在诗歌翻译领域。大型语言模型（如 ChatGPT）的出现为该领域的创新提供了潜力。本研究考察了 ChatGPT 在英汉诗歌翻译任务中的能力，利用有针对性的提示和小样本场景来确定最佳性能。尽管结果令人鼓舞，但我们的分析表明 ChatGPT 生成的翻译中存在一些值得关注的持续问题。为了解决这些缺点，我们提出了一种解释辅助诗歌机器翻译 (EAPMT) 方法，该方法利用单语诗歌解释作为翻译过程的指导信息。此外，我们改进了现有的评估标准，以更好地适应现代诗歌翻译的细微差别。我们聘请了一组专业诗人进行评估，并使用 GPT-4 进行补充评估。人工和机器评估的结果表明，我们的 EAPMT 方法优于 ChatGPT 和现有在线系统的传统翻译方法。本文验证了我们方法的有效性，并为机器辅助文学翻译提供了一个新的视角。

Title: BIPED: Pedagogically Informed Tutoring System for ESL Education

Authors: Soonwoo Kwon, Sojung Kim, Minju Park, Seunghyun Lee, Kyuseok Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.03486
Pdf URL: https://arxiv.org/pdf/2406.03486
Copy Paste: [[2406.03486]] BIPED: Pedagogically Informed Tutoring System for ESL Education(https://arxiv.org/abs/2406.03486)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have a great potential to serve as readily available and cost-efficient Conversational Intelligent Tutoring Systems (CITS) for teaching L2 learners of English. Existing CITS, however, are designed to teach only simple concepts or lack the pedagogical depth necessary to address diverse learning strategies. To develop a more pedagogically informed CITS capable of teaching complex concepts, we construct a BIlingual PEDagogically-informed Tutoring Dataset (BIPED) of one-on-one, human-to-human English tutoring interactions. Through post-hoc analysis of the tutoring interactions, we come up with a lexicon of dialogue acts (34 tutor acts and 9 student acts), which we use to further annotate the collected dataset. Based on a two-step framework of first predicting the appropriate tutor act then generating the corresponding response, we implemented two CITS models using GPT-4 and SOLAR-KO, respectively. We experimentally demonstrate that the implemented models not only replicate the style of human teachers but also employ diverse and contextually appropriate pedagogical strategies.
摘要：大型语言模型 (LLM) 具有巨大潜力，可作为随时可用且经济高效的对话式智能辅导系统 (CITS)，用于教授 L2 英语学习者。然而，现有的 CITS 仅用于教授简单概念，或缺乏解决各种学习策略所需的教学深度。为了开发一种能够教授复杂概念的更具教学意义的 CITS，我们构建了一个双语教学意义辅导数据集 (BIPED)，其中包含一对一、人与人之间的英语辅导互动。通过对辅导互动的事后分析，我们得出了一个对话行为词典（34 个导师行为和 9 个学生行为），我们用它来进一步注释收集到的数据集。基于首先预测适当的导师行为然后生成相应响应的两步框架，我们分别使用 GPT-4 和 SOLAR-KO 实现了两个 CITS 模型。我们通过实验证明，所实施的模型不仅复制了人类教师的风格，而且还采用了多样化且适合情境的教学策略。

Title: Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

Authors: Sanjana Ramprasad, Elisa Ferracane, Zachary C. Lipton
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.03487
Pdf URL: https://arxiv.org/pdf/2406.03487
Copy Paste: [[2406.03487]] Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends(https://arxiv.org/abs/2406.03487)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Recent advancements in large language models (LLMs) have considerably advanced the capabilities of summarization systems. However, they continue to face concerns about hallucinations. While prior work has evaluated LLMs extensively in news domains, most evaluation of dialogue summarization has focused on BART-based models, leaving a gap in our understanding of their faithfulness. Our work benchmarks the faithfulness of LLMs for dialogue summarization, using human annotations and focusing on identifying and categorizing span-level inconsistencies. Specifically, we focus on two prominent LLMs: GPT-4 and Alpaca-13B. Our evaluation reveals subtleties as to what constitutes a hallucination: LLMs often generate plausible inferences, supported by circumstantial evidence in the conversation, that lack direct evidence, a pattern that is less prevalent in older models. We propose a refined taxonomy of errors, coining the category of "Circumstantial Inference" to bucket these LLM behaviors and release the dataset. Using our taxonomy, we compare the behavioral differences between LLMs and older fine-tuned models. Additionally, we systematically assess the efficacy of automatic error detection methods on LLM summaries and find that they struggle to detect these nuanced errors. To address this, we introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics, particularly for identifying "Circumstantial Inference."
摘要：大型语言模型 (LLM) 的最新进展大大提高了摘要系统的功能。然而，它们仍然面临幻觉问题。虽然之前的研究已经对新闻领域的 LLM 进行了广泛的评估，但大多数对话摘要评估都集中在基于 BART 的模型上，这让我们对其忠实度的理解存在差距。我们的工作对 LLM 在对话摘要方面的忠实度进行了基准测试，使用人工注释，并专注于识别和分类跨度级不一致。具体来说，我们关注两个著名的 LLM：GPT-4 和 Alpaca-13B。我们的评估揭示了构成幻觉的微妙之处：LLM 通常会产生合理的推论，这些推论由对话中的间接证据支持，但缺乏直接证据，这种模式在旧模型中不太常见。我们提出了一种改进的错误分类法，创造了“间接推论”类别来分类这些 LLM 行为并发布数据集。使用我们的分类法，我们比较了 LLM 与旧版微调模型之间的行为差异。此外，我们系统地评估了自动错误检测方法对 LLM 摘要的有效性，发现它们很难检测到这些细微的错误。为了解决这个问题，我们引入了两种基于提示的细粒度错误检测方法，它们的表现优于现有指标，尤其是在识别“环境推断”方面。

Title: Wings: Learning Multimodal LLMs without Text-only Forgetting

Authors: Yi-Kai Zhang, Shiyin Lu, Yang Li, Yanqing Ma, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.03496
Pdf URL: https://arxiv.org/pdf/2406.03496
Copy Paste: [[2406.03496]] Wings: Learning Multimodal LLMs without Text-only Forgetting(https://arxiv.org/abs/2406.03496)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like "wings" on either side, are connected in parallel within each layer's attention block. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.
摘要：多模态大型语言模型 (MLLM) 以经过训练的 LLM 为起点，首先将图像与文本对齐，然后对多模态混合输入进行微调。然而，MLLM 会彻底忘记纯文本指令，这些指令不包含图像，可以在初始 LLM 中解决。在本文中，我们介绍了 Wings，这是一种新型 MLLM，在纯文本对话和多模态理解方面都表现出色。分析多模态指令中的 MLLM 注意力表明，纯文本遗忘与注意力从前图像文本转移到后图像文本有关。从那里，我们构建了额外的模块，作为增强学习器来补偿注意力转移。互补的视觉和文本学习器就像两边的“翅膀”，在每一层的注意力块中并行连接。最初，图像和文本输入与与主要注意力一起运行的视觉学习器对齐，平衡对视觉元素的关注。文本学习器随后与基于注意力的路由协作集成，以融合视觉和文本学习器的输出。我们设计了低秩残差注意力 (LoRRA)，以确保学习者的高效性。我们的实验结果表明，无论是在纯文本问答任务中，还是在视觉问答任务中，Wings 的表现都优于同等规模的 MLLM。在新构建的交错图像文本 (IIT) 基准上，Wings 在从纯文本丰富的问答任务到多模态丰富的问答任务中都表现出色。