2024-10-17

Title: The Fair Language Model Paradox

Authors: Andrea Pinto, Tomer Galanti, Randall Balestriero
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.11985
Pdf URL: https://arxiv.org/pdf/2410.11985
Copy Paste: [[2410.11985]] The Fair Language Model Paradox(https://arxiv.org/abs/2410.11985)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level. Evaluation typically relies on aggregated training loss, measured at the batch level, which overlooks subtle per-token biases arising from (i) varying token-level dynamics and (ii) structural biases introduced by hyperparameters. While weight decay is commonly used to stabilize training, we reveal that it silently introduces performance biases detectable only at the token level. In fact, we empirically show across different dataset sizes, model architectures and sizes ranging from 270M to 3B parameters that as weight decay increases, low-frequency tokens are disproportionately depreciated. This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages, calling for novel regularization techniques that ensure fairness across all available tokens.
摘要：大型语言模型 (LLM) 被广泛部署在实际应用中，但人们对其在 token 级别的训练动态知之甚少。评估通常依赖于在批次级别测量的聚合训练损失，这忽略了由 (i) 不同的 token 级别动态和 (ii) 超参数引入的结构偏差引起的细微的 token 偏差。虽然权重衰减通常用于稳定训练，但我们发现它会悄悄地引入仅在 token 级别可检测到的性能偏差。事实上，我们通过经验表明，在不同数据集大小、模型架构和从 2.7 亿到 30 亿参数的大小中，随着权重衰减的增加，低频 token 会不成比例地贬值。这尤其令人担忧，因为这些被忽视的低频 token 代表了大多数语言中 token 分布的绝大多数，需要新的正则化技术来确保所有可用 token 的公平性。

Title: DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

Authors: Shangqian Gao, Chi-Heng Lin, Ting Hua, Tang Zheng, Yilin Shen, Hongxia Jin, Yen-Chang Hsu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.11988
Pdf URL: https://arxiv.org/pdf/2410.11988
Copy Paste: [[2410.11988]] DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models(https://arxiv.org/abs/2410.11988)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, including language modeling, understanding, and generation. However, the increased memory and computational costs associated with these models pose significant challenges for deployment on resource-limited devices. Structural pruning has emerged as a promising solution to reduce the costs of LLMs without requiring post-processing steps. Prior structural pruning methods either follow the dependence of structures at the cost of limiting flexibility, or introduce non-trivial additional parameters by incorporating different projection matrices. In this work, we propose a novel approach that relaxes the constraint imposed by regular structural pruning methods and eliminates the structural dependence along the embedding dimension. Our dimension-independent structural pruning method offers several benefits. Firstly, our method enables different blocks to utilize different subsets of the feature maps. Secondly, by removing structural dependence, we facilitate each block to possess varying widths along its input and output dimensions, thereby significantly enhancing the flexibility of structural pruning. We evaluate our method on various LLMs, including OPT, LLaMA, LLaMA-2, Phi-1.5, and Phi-2. Experimental results demonstrate that our approach outperforms other state-of-the-art methods, showing for the first time that structural pruning can achieve an accuracy similar to semi-structural pruning.
摘要：大型语言模型 (LLM) 在各种自然语言处理任务中取得了显著的成功，包括语言建模、理解和生成。然而，这些模型相关的内存和计算成本增加，对在资源有限的设备上部署构成了重大挑战。结构修剪已成为一种有前途的解决方案，可在无需后处理步骤的情况下降低 LLM 的成本。先前的结构修剪方法要么遵循结构的依赖性，代价是限制灵活性，要么通过合并不同的投影矩阵引入非平凡的附加参数。在这项工作中，我们提出了一种新方法，该方法放宽了常规结构修剪方法施加的约束，并消除了沿嵌入维度的结构依赖性。我们的维度无关的结构修剪方法提供了几个好处。首先，我们的方法使不同的块能够利用特征图的不同子集。其次，通过消除结构依赖性，我们使每个块在其输入和输出维度上拥有不同的宽度，从而显著增强了结构修剪的灵活性。我们在各种 LLM 上评估了我们的方法，包括 OPT、LLaMA、LLaMA-2、Phi-1.5 和 Phi-2。实验结果表明，我们的方法优于其他最先进的方法，首次表明结构化剪枝可以达到与半结构化剪枝相似的准确率。

Title: Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data

Authors: Seiji Maekawa, Hayate Iso, Nikita Bhutani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.11996
Pdf URL: https://arxiv.org/pdf/2410.11996
Copy Paste: [[2410.11996]] Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data(https://arxiv.org/abs/2410.11996)
Keywords: language model, long context, retrieval-augmented generation
Abstract: The rapid increase in textual information means we need more efficient methods to sift through, organize, and understand it all. While retrieval-augmented generation (RAG) models excel in accessing information from large document collections, they struggle with complex tasks that require aggregation and reasoning over information spanning across multiple documents--what we call holistic reasoning. Long-context language models (LCLMs) have great potential for managing large-scale documents, but their holistic reasoning capabilities remain unclear. In this work, we introduce HoloBench, a novel framework that brings database reasoning operations into text-based contexts, making it easier to systematically evaluate how LCLMs handle holistic reasoning across large documents. Our approach adjusts key factors such as context length, information density, distribution of information, and query complexity to evaluate LCLMs comprehensively. Our experiments show that the amount of information in the context has a bigger influence on LCLM performance than the actual context length. Furthermore, the complexity of queries affects performance more than the amount of information, particularly for different types of queries. Interestingly, queries that involve finding maximum or minimum values are easier for LCLMs and are less affected by context length, even though they pose challenges for RAG systems. However, tasks requiring the aggregation of multiple pieces of information show a noticeable drop in accuracy as context length increases. Additionally, we find that while grouping relevant information generally improves performance, the optimal positioning varies across models. Our findings surface both the advancements and the ongoing challenges in achieving a holistic understanding of long contexts.
摘要：文本信息的快速增长意味着我们需要更有效的方法来筛选、组织和理解所有信息。虽然检索增强生成 (RAG) 模型擅长从大型文档集合中访问信息，但它们在处理需要对跨多个文档的信息进行聚合和推理的复杂任务时却举步维艰——我们称之为整体推理。长上下文语言模型 (LCLM) 在管理大型文档方面具有巨大潜力，但它们的整体推理能力仍不清楚。在这项工作中，我们引入了 HoloBench，这是一个将数据库推理操作带入基于文本的上下文的新框架，使系统地评估 LCLM 如何处理大型文档中的整体推理变得更加容易。我们的方法调整了上下文长度、信息密度、信息分布和查询复杂度等关键因素，以全面评估 LCLM。我们的实验表明，上下文中的信息量对 LCLM 性能的影响大于实际上下文长度。此外，查询的复杂性对性能的影响大于信息量，尤其是对于不同类型的查询。有趣的是，涉及查找最大值或最小值的查询对于 LCLM 来说更容易，并且受上下文长度的影响较小，尽管它们对 RAG 系统构成了挑战。但是，随着上下文长度的增加，需要聚合多条信息的任务的准确率会明显下降。此外，我们发现，虽然对相关信息进行分组通常会提高性能，但最佳定位因模型而异。我们的研究结果揭示了在实现对长上下文的整体理解方面取得的进展和持续面临的挑战。

Title: Impacts of Continued Legal Pre-Training and IFT on LLMs' Latent Representations of Human-Defined Legal Concepts

Authors: Shaun Ho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12001
Pdf URL: https://arxiv.org/pdf/2410.12001
Copy Paste: [[2410.12001]] Impacts of Continued Legal Pre-Training and IFT on LLMs' Latent Representations of Human-Defined Legal Concepts(https://arxiv.org/abs/2410.12001)
Keywords: language model, llm
Abstract: This paper aims to offer AI & Law researchers and practitioners a more detailed understanding of whether and how continued pre-training and instruction fine-tuning (IFT) of large language models (LLMs) on legal corpora increases their utilization of human-defined legal concepts when developing global contextual representations of input sequences. We compared three models: Mistral 7B, SaulLM-7B-Base (Mistral 7B with continued pre-training on legal corpora), and SaulLM-7B-Instruct (with further IFT). This preliminary assessment examined 7 distinct text sequences from recent AI & Law literature, each containing a human-defined legal concept. We first compared the proportions of total attention the models allocated to subsets of tokens representing the legal concepts. We then visualized patterns of raw attention score alterations, evaluating whether legal training introduced novel attention patterns corresponding to structures of human legal knowledge. This inquiry revealed that (1) the impact of legal training was unevenly distributed across the various human-defined legal concepts, and (2) the contextual representations of legal knowledge learned during legal training did not coincide with structures of human-defined legal concepts. We conclude with suggestions for further investigation into the dynamics of legal LLM training.
摘要：本文旨在让人工智能与法律研究人员和从业人员更详细地了解，在开发输入序列的全局上下文表示时，在法律语料库上对大型语言模型 (LLM) 进行持续预训练和指令微调 (IFT) 是否以及如何提高他们对人类定义的法律概念的利用率。我们比较了三种模型：Mistral 7B、SaulLM-7B-Base（在法律语料库上继续预训练的 Mistral 7B）和 SaulLM-7B-Instruct（进一步进行 IFT）。这项初步评估研究了最近人工智能与法律文献中的 7 个不同的文本序列，每个序列都包含一个人类定义的法律概念。我们首先比较了模型分配给代表法律概念的标记子集的总注意力比例。然后，我们可视化了原始注意力分数变化的模式，评估法律训练是否引入了与人类法律知识结构相对应的新型注意力模式。这项调查显示：(1) 法律培训的影响在各种人类定义的法律概念中分布不均；(2) 法律培训期间学习的法律知识的语境表征与人类定义的法律概念的结构不一致。最后，我们提出了进一步研究法律 LLM 培训动态的建议。

Title: Toolken+: Improving LLM Tool Usage with Reranking and a Reject Option

Authors: Konstantin Yakovlev, Sergey Nikolenko, Andrey Bout
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12004
Pdf URL: https://arxiv.org/pdf/2410.12004
Copy Paste: [[2410.12004]] Toolken+: Improving LLM Tool Usage with Reranking and a Reject Option(https://arxiv.org/abs/2410.12004)
Keywords: gpt, llm
Abstract: The recently proposed ToolkenGPT tool learning paradigm demonstrates promising performance but suffers from two major issues: first, it cannot benefit from tool documentation, and second, it often makes mistakes in whether to use a tool at all. We introduce Toolken+ that mitigates the first problem by reranking top $k$ tools selected by ToolkenGPT and the second problem with a special "Reject" option such that the model will generate a vocabulary token if "Reject" is ranked first. We demonstrate the effectiveness of Toolken+ on multistep numerical reasoning and tool selection tasks.
摘要：最近提出的 ToolkenGPT 工具学习范例表现出色，但存在两个主要问题：首先，它无法从工具文档中获益，其次，它经常会犯是否使用工具的错误。我们引入了 Toolken+，通过重新排序 ToolkenGPT 选择的前 $k$ 个工具来缓解第一个问题，并使用特殊的“拒绝”选项来缓解第二个问题，这样如果“拒绝”排在第一位，模型将生成词汇标记。我们展示了 Toolken+ 在多步数值推理和工具选择任务中的有效性。

Title: Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models

Authors: Kushal Tatariya, Vladimir Araujo, Thomas Bauwens, Miryam de Lhoneux
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12011
Pdf URL: https://arxiv.org/pdf/2410.12011
Copy Paste: [[2410.12011]] Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models(https://arxiv.org/abs/2410.12011)
Keywords: language model
Abstract: Pixel-based language models have emerged as a compelling alternative to subword-based language modelling, particularly because they can represent virtually any script. PIXEL, a canonical example of such a model, is a vision transformer that has been pre-trained on rendered text. While PIXEL has shown promising cross-script transfer abilities and robustness to orthographic perturbations, it falls short of outperforming monolingual subword counterparts like BERT in most other contexts. This discrepancy raises questions about the amount of linguistic knowledge learnt by these models and whether their performance in language tasks stems more from their visual capabilities than their linguistic ones. To explore this, we probe PIXEL using a variety of linguistic and visual tasks to assess its position on the vision-to-language spectrum. Our findings reveal a substantial gap between the model's visual and linguistic understanding. The lower layers of PIXEL predominantly capture superficial visual features, whereas the higher layers gradually learn more syntactic and semantic abstractions. Additionally, we examine variants of PIXEL trained with different text rendering strategies, discovering that introducing certain orthographic constraints at the input level can facilitate earlier learning of surface-level features. With this study, we hope to provide insights that aid the further development of pixel-based language models.
摘要：基于像素的语言模型已成为基于子词的语言建模的有力替代方案，特别是因为它们几乎可以表示任何文字。PIXEL 是此类模型的一个典型示例，它是一种视觉转换器，已在渲染的文本上进行了预训练。虽然 PIXEL 表现出良好的跨文字迁移能力和对正字法干扰的鲁棒性，但在大多数其他情况下，它的表现都不如 BERT 等单语子词对应模型。这种差异引发了人们对这些模型学习的语言知识量以及它们在语言任务中的表现是否更多地源于视觉能力而非语言能力的质疑。为了探索这一点，我们使用各种语言和视觉任务来探测 PIXEL，以评估其在视觉到语言频谱中的位置。我们的研究结果揭示了模型的视觉和语言理解之间存在巨大差距。PIXEL 的较低层主要捕捉表面的视觉特征，而较高层则逐渐学习更多的句法和语义抽象。此外，我们还研究了使用不同文本渲染策略训练的 PIXEL 变体，发现在输入级别引入某些正字法约束可以促进更早地学习表层特征。通过这项研究，我们希望提供有助于进一步开发基于像素的语言模型的见解。

Title: MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router

Authors: Yanyue Xie, Zhi Zhang, Ding Zhou, Cong Xie, Ziang Song, Xin Liu, Yanzhi Wang, Xue Lin, An Xu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12013
Pdf URL: https://arxiv.org/pdf/2410.12013
Copy Paste: [[2410.12013]] MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router(https://arxiv.org/abs/2410.12013)
Keywords: language model, llm
Abstract: Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts. Pruning MoE can reduce network weights while maintaining model performance. Motivated by the recent observation of emergent large magnitude features in Large Language Models (LLM) and MoE routing policy, we propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights, on each output neuron. Our pruning method is one-shot, requiring no retraining or weight updates. We evaluate our method on Mixtral-8x7B and Mixtral-8x22B across multiple language benchmarks. Experimental results show that our pruning method significantly outperforms state-of-the-art LLM pruning methods. Furthermore, our pruned MoE models can benefit from a pretrained teacher model through expert-wise knowledge distillation, improving performance post-pruning. Experimental results demonstrate that the Mixtral-8x7B model with 50% sparsity maintains 99% of the performance of the original model after the expert-wise knowledge distillation.
摘要：混合专家 (MoE) 架构面临着诸如高内存消耗和专家冗余等挑战。修剪 MoE 可以在保持模型性能的同时减少网络权重。受最近观察到的大型语言模型 (LLM) 和 MoE 路由策略中出现的大幅度特征的启发，我们提出了 MoE-Pruner，这种方法在每个输出神经元上修剪具有最小幅度的权重，并将其乘以相应的输入激活和路由器权重。我们的修剪方法是一次性的，不需要重新训练或更新权重。我们在多个语言基准上在 Mixtral-8x7B 和 Mixtral-8x22B 上评估了我们的方法。实验结果表明，我们的修剪方法明显优于最先进的 LLM 修剪方法。此外，通过专家知识提炼，我们的修剪后的 MoE 模型可以从预训练的教师模型中受益，从而提高修剪后的性能。实验结果表明，稀疏度为50%的Mixtral-8x7B模型经过专家知识提炼后，仍然保持了原模型99%的性能。

Title: On Classification with Large Language Models in Cultural Analytics

Authors: David Bamman, Kent K. Chang, Li Lucy, Naitian Zhou
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2410.12029
Pdf URL: https://arxiv.org/pdf/2410.12029
Copy Paste: [[2410.12029]] On Classification with Large Language Models in Cultural Analytics(https://arxiv.org/abs/2410.12029)
Keywords: language model, llm, prompt
Abstract: In this work, we survey the way in which classification is used as a sensemaking practice in cultural analytics, and assess where large language models can fit into this landscape. We identify ten tasks supported by publicly available datasets on which we empirically assess the performance of LLMs compared to traditional supervised methods, and explore the ways in which LLMs can be employed for sensemaking goals beyond mere accuracy. We find that prompt-based LLMs are competitive with traditional supervised models for established tasks, but perform less well on de novo tasks. In addition, LLMs can assist sensemaking by acting as an intermediary input to formal theory testing.
摘要：在这项研究中，我们调查了分类在文化分析中作为一种意义建构实践的使用方式，并评估了大型语言模型在这一领域的适用范围。我们确定了十个由公开数据集支持的任务，并根据这些任务对 LLM 与传统监督方法的性能进行了实证评估，并探索了除了准确性之外，LLM 还可以用于哪些意义建构目标。我们发现，基于提示的 LLM 在既定任务上与传统监督模型相媲美，但在从头任务上表现较差。此外，LLM 可以作为正式理论测试的中介输入，从而协助意义建构。

Title: Concept-Reversed Winograd Schema Challenge: Evaluating and Improving Robust Reasoning in Large Language Models via Abstraction

Authors: Kaiqiao Han, Tianqing Fang, Zhaowei Wang, Yangqiu Song, Mark Steedman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12040
Pdf URL: https://arxiv.org/pdf/2410.12040
Copy Paste: [[2410.12040]] Concept-Reversed Winograd Schema Challenge: Evaluating and Improving Robust Reasoning in Large Language Models via Abstraction(https://arxiv.org/abs/2410.12040)
Keywords: language model, llm, hallucination, prompt
Abstract: While Large Language Models (LLMs) have showcased remarkable proficiency in reasoning, there is still a concern about hallucinations and unreliable reasoning issues due to semantic associations and superficial logical chains. To evaluate the extent to which LLMs perform robust reasoning instead of relying on superficial logical chains, we propose a new evaluation dataset, the Concept-Reversed Winograd Schema Challenge (CR-WSC), based on the famous Winograd Schema Challenge (WSC) dataset. By simply reversing the concepts to those that are more associated with the wrong answer, we find that the performance of LLMs drops significantly despite the rationale of reasoning remaining the same. Furthermore, we propose Abstraction-of-Thought (AoT), a novel prompt method for recovering adversarial cases to normal cases using conceptual abstraction to improve LLMs' robustness and consistency in reasoning, as demonstrated by experiments on CR-WSC.
摘要：虽然大型语言模型 (LLM) 已经展现出卓越的推理能力，但人们仍然担心由于语义关联和肤浅的逻辑链而导致的幻觉和不可靠的推理问题。为了评估 LLM 在多大程度上能够执行稳健的推理而不是依赖肤浅的逻辑链，我们提出了一个新的评估数据集，即概念反转 Winograd 模式挑战 (CR-WSC)，它基于著名的 Winograd 模式挑战 (WSC) 数据集。通过简单地将概念反转为与错误答案更相关的概念，我们发现 LLM 的性能会显著下降，尽管推理的基本原理保持不变。此外，我们提出了抽象思维 (AoT)，这是一种新颖的快速方法，使用概念抽象将对抗性案例恢复为正常案例，以提高 LLM 在推理中的稳健性和一致性，正如在 CR-WSC 上的实验所证明的那样。

Title: Boosting Logical Fallacy Reasoning in LLMs via Logical Structure Tree

Authors: Yuanyuan Lei, Ruihong Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12048
Pdf URL: https://arxiv.org/pdf/2410.12048
Copy Paste: [[2410.12048]] Boosting Logical Fallacy Reasoning in LLMs via Logical Structure Tree(https://arxiv.org/abs/2410.12048)
Keywords: llm, prompt
Abstract: Logical fallacy uses invalid or faulty reasoning in the construction of a statement. Despite the prevalence and harmfulness of logical fallacies, detecting and classifying logical fallacies still remains a challenging task. We observe that logical fallacies often use connective words to indicate an intended logical relation between two arguments, while the argument semantics does not actually support the logical relation. Inspired by this observation, we propose to build a logical structure tree to explicitly represent and track the hierarchical logic flow among relation connectives and their arguments in a statement. Specifically, this logical structure tree is constructed in an unsupervised manner guided by the constituency tree and a taxonomy of connectives for ten common logical relations, with relation connectives as non-terminal nodes and textual arguments as terminal nodes, and the latter are mostly elementary discourse units. We further develop two strategies to incorporate the logical structure tree into LLMs for fallacy reasoning. Firstly, we transform the tree into natural language descriptions and feed the textualized tree into LLMs as a part of the hard text prompt. Secondly, we derive a relation-aware tree embedding and insert the tree embedding into LLMs as a soft prompt. Experiments on benchmark datasets demonstrate that our approach based on logical structure tree significantly improves precision and recall for both fallacy detection and fallacy classification.
摘要：逻辑谬误在陈述的构建过程中使用了无效或错误的推理。尽管逻辑谬误普遍存在且危害巨大，但检测和分类逻辑谬误仍然是一项具有挑战性的任务。我们观察到，逻辑谬误经常使用连接词来表示两个论证之间的预期逻辑关系，而论证语义实际上并不支持这种逻辑关系。受此观察的启发，我们提出构建一个逻辑圣诞树来明确表示和跟踪陈述中关系连接词及其论证之间的层次逻辑流。具体而言，该逻辑圣诞树以无监督的方式构建，由构成树和十种常见逻辑关系的连接词分类法指导，关系连接词作为非终端节点，文本论证作为终端节点，后者大多是基本话语单位。我们进一步开发了两种策略，将逻辑圣诞树纳入 LLM 进行谬误推理。首先，我们将树转换为自然语言描述，并将文本化的树作为硬文本提示的一部分输入到 LLM 中。其次，我们推导出一个关系感知树嵌入，并将该树嵌入作为软提示插入到 LLM 中。在基准数据集上的实验表明，我们基于逻辑圣诞树的方法显著提高了谬误检测和谬误分类的准确率和召回率。

Title: Sabi\'a-3 Technical Report

Authors: Hugo Abonizio, Thales Sales Almeida, Thiago Laitz, Roseval Malaquias Junior, Giovana Kerche Bonás, Rodrigo Nogueira, Ramon Pires
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12049
Pdf URL: https://arxiv.org/pdf/2410.12049
Copy Paste: [[2410.12049]] Sabi\'a-3 Technical Report(https://arxiv.org/abs/2410.12049)
Keywords: language model, llm
Abstract: This report presents Sabiá-3, our new flagship language model trained on a large brazilian-centric corpus. Evaluations across diverse professional and academic benchmarks show a strong performance on Portuguese and Brazil-related tasks. Sabiá-3 shows large improvements in comparison to our previous best of model, Sabiá-2 Medium, especially in reasoning-intensive tasks. Notably, Sabiá-3's average performance matches frontier LLMs, while it is offered at a three to four times lower cost per token, reinforcing the benefits of domain specialization.
摘要：本报告介绍了 Sabiá-3，这是我们在以巴西为中心的大型语料库上训练的新旗舰语言模型。在各种专业和学术基准上的评估表明，该模型在葡萄牙语和巴西相关任务上表现强劲。与我们之前的最佳模型 Sabiá-2 Medium 相比，Sabiá-3 显示出了巨大的进步，尤其是在推理密集型任务中。值得注意的是，Sabiá-3 的平均性能与前沿 LLM 相当，而其每个 token 的成本却低了三到四倍，这进一步增强了领域专业化的优势。

Title: Skill-LLM: Repurposing General-Purpose LLMs for Skill Extraction

Authors: Amirhossein Herandi, Yitao Li, Zhanlin Liu, Ximin Hu, Xiao Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12052
Pdf URL: https://arxiv.org/pdf/2410.12052
Copy Paste: [[2410.12052]] Skill-LLM: Repurposing General-Purpose LLMs for Skill Extraction(https://arxiv.org/abs/2410.12052)
Keywords: language model, llm
Abstract: Accurate skill extraction from job descriptions is crucial in the hiring process but remains challenging. Named Entity Recognition (NER) is a common approach used to address this issue. With the demonstrated success of large language models (LLMs) in various NLP tasks, including NER, we propose fine-tuning a specialized Skill-LLM and a light weight model to improve the precision and quality of skill extraction. In our study, we evaluated the fine-tuned Skill-LLM and the light weight model using a benchmark dataset and compared its performance against state-of-the-art (SOTA) methods. Our results show that this approach outperforms existing SOTA techniques.
摘要：从职位描述中准确提取技能对于招聘流程至关重要，但仍然具有挑战性。命名实体识别 (NER) 是解决此问题的常用方法。随着大型语言模型 (LLM) 在包括 NER 在内的各种 NLP 任务中的成功应用，我们建议对专门的 Skill-LLM 和轻量级模型进行微调，以提高技能提取的精度和质量。在我们的研究中，我们使用基准数据集评估了微调后的 Skill-LLM 和轻量级模型，并将其性能与最先进 (SOTA) 方法进行了比较。我们的结果表明，这种方法优于现有的 SOTA 技术。

Title: Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned

Authors: Cassandra L. Jacobs, Loïc Grobol, Alvin Tsang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12057
Pdf URL: https://arxiv.org/pdf/2410.12057
Copy Paste: [[2410.12057]] Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned(https://arxiv.org/abs/2410.12057)
Keywords: language model
Abstract: In this work we compare the generative behavior at the next token prediction level in several language models by comparing them to human productions in the cloze task. We find that while large models trained for longer are typically better estimators of human productions, but they reliably under-estimate the probabilities of human responses, over-rank rare responses, under-rank top responses, and produce highly distinct semantic spaces. Altogether, this work demonstrates in a tractable, interpretable domain that LM generations can not be used as replacements of or models of the cloze task.
摘要：在这项研究中，我们通过将几种语言模型与完形填空任务中的人类生成结果进行比较，比较了它们在下一个标记预测级别的生成行为。我们发现，虽然经过较长时间训练的大型模型通常可以更好地估计人类生成结果，但它们确实低估了人类反应的概率，对罕见反应的排名过高，对顶级反应的排名过低，并产生了高度不同的语义空间。总之，这项研究在一个易于处理、可解释的领域表明，LM 生成不能用作完形填空任务的替代品或模型。

Title: LegalLens Shared Task 2024: Legal Violation Identification in Unstructured Text

Authors: Ben Hagag, Liav Harpaz, Gil Semo, Dor Bernsohn, Rohit Saha, Pashootan Vaezipoor, Kyryl Truskovskyi, Gerasimos Spanakis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12064
Pdf URL: https://arxiv.org/pdf/2410.12064
Copy Paste: [[2410.12064]] LegalLens Shared Task 2024: Legal Violation Identification in Unstructured Text(https://arxiv.org/abs/2410.12064)
Keywords: language model
Abstract: This paper presents the results of the LegalLens Shared Task, focusing on detecting legal violations within text in the wild across two sub-tasks: LegalLens-NER for identifying legal violation entities and LegalLens-NLI for associating these violations with relevant legal contexts and affected individuals. Using an enhanced LegalLens dataset covering labor, privacy, and consumer protection domains, 38 teams participated in the task. Our analysis reveals that while a mix of approaches was used, the top-performing teams in both tasks consistently relied on fine-tuning pre-trained language models, outperforming legal-specific models and few-shot methods. The top-performing team achieved a 7.11% improvement in NER over the baseline, while NLI saw a more marginal improvement of 5.7%. Despite these gains, the complexity of legal texts leaves room for further advancements.
摘要：本文介绍了 LegalLens 共享任务的结果，该任务侧重于通过两个子任务检测文本中的违法行为：LegalLens-NER 用于识别违法实体，LegalLens-NLI 用于将这些违法行为与相关法律背景和受影响个人相关联。使用增强的 LegalLens 数据集（涵盖劳工、隐私和消费者保护领域），38 个团队参与了该任务。我们的分析表明，虽然使用了多种方法，但两项任务中表现最好的团队始终依赖于微调预训练语言模型，其表现优于法律专用模型和少量方法。表现最好的团队在 NER 方面比基线提高了 7.11%，而 NLI 的改进幅度较小，为 5.7%。尽管取得了这些进步，但法律文本的复杂性仍留下了进一步改进的空间。

Title: De-jargonizing Science for Journalists with GPT-4: A Pilot Study

Authors: Sachita Nishal, Eric Lee, Nicholas Diakopoulos
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2410.12069
Pdf URL: https://arxiv.org/pdf/2410.12069
Copy Paste: [[2410.12069]] De-jargonizing Science for Journalists with GPT-4: A Pilot Study(https://arxiv.org/abs/2410.12069)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: This study offers an initial evaluation of a human-in-the-loop system leveraging GPT-4 (a large language model or LLM), and Retrieval-Augmented Generation (RAG) to identify and define jargon terms in scientific abstracts, based on readers' self-reported knowledge. The system achieves fairly high recall in identifying jargon and preserves relative differences in readers' jargon identification, suggesting personalization as a feasible use-case for LLMs to support sense-making of complex information. Surprisingly, using only abstracts for context to generate definitions yields slightly more accurate and higher quality definitions than using RAG-based context from the fulltext of an article. The findings highlight the potential of generative AI for assisting science reporters, and can inform future work on developing tools to simplify dense documents.
摘要：本研究对利用 GPT-4（大型语言模型或 LLM）和检索增强生成 (RAG) 的人机协同系统进行了初步评估，该系统基于读者自我报告的知识来识别和定义科学摘要中的术语。该系统在识别术语方面实现了相当高的召回率，并保留了读者术语识别的相对差异，这表明个性化是 LLM 支持复杂信息理解的可行用例。令人惊讶的是，仅使用摘要作为上下文来生成定义比使用基于 RAG 的文章全文上下文生成的定义略微更准确且质量更高。研究结果凸显了生成式人工智能在协助科学记者方面的潜力，并为未来开发简化密集文档的工具的工作提供参考。

Title: OMCAT: Omni Context Aware Transformer

Authors: Arushi Goel, Karan Sapra, Matthieu Le, Rafael Valle, Andrew Tao, Bryan Catanzaro
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.12109
Pdf URL: https://arxiv.org/pdf/2410.12109
Copy Paste: [[2410.12109]] OMCAT: Omni Context Aware Transformer(https://arxiv.org/abs/2410.12109)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams. We address these challenges with two key contributions: a new dataset and model, called OCTAV and OMCAT respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video. Second, OMCAT (Omni Context Aware Transformer) is a powerful model that leverages RoTE (Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks. Through a robust three-stage training pipeline-feature alignment, instruction tuning, and OCTAV-specific training-OMCAT excels in cross-modal temporal understanding. Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment, as validated through comprehensive experiments and ablation studies. Our dataset and code will be made publicly available. The link to our demo page is this https URL.
摘要：大型语言模型 (LLM) 在文本生成和理解方面取得了重大进展，最近的进展扩展到集成视觉和音频输入的多模态 LLM。然而，这些模型仍然难以进行细粒度、跨模态时间理解，特别是在关联音频和视频流中的事件时。我们通过两个关键贡献来解决这些挑战：一个新的数据集和模型，分别称为 OCTAV 和 OMCAT。OCTAV（全上下文和时间音频视频）是一种新颖的数据集，旨在捕获音频和视频之间的事件转换。其次，OMCAT（全上下文感知变换器）是一个强大的模型，它利用 RoTE（旋转时间嵌入），这是 RoPE 的创新扩展，以增强时间锚定任务中的时间基础和计算效率。通过强大的三阶段训练管道 - 特征对齐、指令调整和 OCTAV 特定训练 - OMCAT 在跨模态时间理解方面表现出色。我们的模型在音频-视频问答 (AVQA) 任务和 OCTAV 基准上表现出了最佳性能，通过全面的实验和消融研究验证了时间推理和跨模态对齐方面的显著提升。我们的数据集和代码将公开。我们的演示页面链接为这个 https URL。

Title: Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning

Authors: Huiwen Wu, Xiaohan Li, Xiaogang Xu, Jiafei Wu, Deyi Zhang, Zhe Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12130
Pdf URL: https://arxiv.org/pdf/2410.12130
Copy Paste: [[2410.12130]] Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning(https://arxiv.org/abs/2410.12130)
Keywords: language model, llm, hallucination
Abstract: The development of Large Language Models (LLMs) has significantly advanced various AI applications in commercial and scientific research fields, such as scientific literature summarization, writing assistance, and knowledge graph construction. However, a significant challenge is the high risk of hallucination during LLM inference, which can lead to security concerns like factual inaccuracies, inconsistent information, and fabricated content. To tackle this issue, it is essential to develop effective methods for reducing hallucination while maintaining the original capabilities of the LLM. This paper introduces a novel approach called Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination. This method modifies the representation layers of pre-trained LLMs by using contrastive `positive' and `negative' models, trained on data with and without hallucinations. By leveraging the differences between these two models, we create a more straightforward pathway to eliminate hallucinations, and the iterative nature of contrastive learning further enhances performance. Experimental validation on four pre-trained foundation LLMs (LLaMA2, Alpaca, LLaMA3, and Qwen) finetuning with a specially designed dataset shows that our approach achieves an average improvement of 10.1 points on the TruthfulQA benchmark. Comprehensive experiments demonstrate the effectiveness of Iter-AHMCL in reducing hallucination while maintaining the general capabilities of LLMs.
摘要：大型语言模型 (LLM) 的发展极大地推动了商业和科研领域的各种 AI 应用，例如科学文献摘要、写作辅助和知识图谱构建。然而，一个重大挑战是 LLM 推理过程中出现幻觉的风险很高，这可能导致安全问题，例如事实不准确、信息不一致和捏造内容。为了解决这个问题，必须开发有效的方法来减少幻觉，同时保持 LLM 的原有功能。本文介绍了一种称为迭代模型级对比学习 (Iter-AHMCL) 的新方法来解决幻觉问题。该方法通过使用对比的“正”和“负”模型来修改预训练 LLM 的表示层，这些模型在有幻觉和没有幻觉的数据上进行训练。通过利用这两个模型之间的差异，我们创建了一条更直接的途径来消除幻觉，而对比学习的迭代性质进一步提高了性能。使用专门设计的数据集对四个预训练的基础 LLM（LLaMA2、Alpaca、LLaMA3 和 Qwen）进行微调的实验验证表明，我们的方法在 TruthfulQA 基准上实现了 10.1 分的平均提升。全面的实验证明了 Iter-AHMCL 在减少幻觉方面的有效性，同时保持了 LLM 的一般能力。

Title: Layer-of-Thoughts Prompting (LoT): Leveraging LLM-Based Retrieval with Constraint Hierarchies

Authors: Wachara Fungwacharakorn, Nguyen Ha Thanh, May Myo Zin, Ken Satoh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12153
Pdf URL: https://arxiv.org/pdf/2410.12153
Copy Paste: [[2410.12153]] Layer-of-Thoughts Prompting (LoT): Leveraging LLM-Based Retrieval with Constraint Hierarchies(https://arxiv.org/abs/2410.12153)
Keywords: language model, llm, prompt
Abstract: This paper presents a novel approach termed Layer-of-Thoughts Prompting (LoT), which utilizes constraint hierarchies to filter and refine candidate responses to a given query. By integrating these constraints, our method enables a structured retrieval process that enhances explainability and automation. Existing methods have explored various prompting techniques but often present overly generalized frameworks without delving into the nuances of prompts in multi-turn interactions. Our work addresses this gap by focusing on the hierarchical relationships among prompts. We demonstrate that the efficacy of thought hierarchy plays a critical role in developing efficient and interpretable retrieval algorithms. Leveraging Large Language Models (LLMs), LoT significantly improves the accuracy and comprehensibility of information retrieval tasks.
摘要：本文介绍了一种称为“思维层次提示”（LoT）的新方法，该方法利用约束层次结构来过滤和优化给定查询的候选响应。通过整合这些约束，我们的方法可以实现结构化的检索过程，从而增强可解释性和自动化程度。现有方法已经探索了各种提示技术，但通常呈现过于笼统的框架，而没有深入研究多轮交互中提示的细微差别。我们的工作通过关注提示之间的层次关系来解决这一差距。我们证明了思维层次的有效性在开发高效且可解释的检索算法中起着关键作用。利用大型语言模型 (LLM)，LoT 显著提高了信息检索任务的准确性和可理解性。

Title: Exploiting LLMs' Reasoning Capability to Infer Implicit Concepts in Legal Information Retrieval

Authors: Hai-Long Nguyen, Tan-Minh Nguyen, Duc-Minh Nguyen, Thi-Hai-Yen Vuong, Ha-Thanh Nguyen, Xuan-Hieu Phan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12154
Pdf URL: https://arxiv.org/pdf/2410.12154
Copy Paste: [[2410.12154]] Exploiting LLMs' Reasoning Capability to Infer Implicit Concepts in Legal Information Retrieval(https://arxiv.org/abs/2410.12154)
Keywords: language model, llm
Abstract: Statutory law retrieval is a typical problem in legal language processing, that has various practical applications in law engineering. Modern deep learning-based retrieval methods have achieved significant results for this problem. However, retrieval systems relying on semantic and lexical correlations often exhibit limitations, particularly when handling queries that involve real-life scenarios, or use the vocabulary that is not specific to the legal domain. In this work, we focus on overcoming this weaknesses by utilizing the logical reasoning capabilities of large language models (LLMs) to identify relevant legal terms and facts related to the situation mentioned in the query. The proposed retrieval system integrates additional information from the term--based expansion and query reformulation to improve the retrieval accuracy. The experiments on COLIEE 2022 and COLIEE 2023 datasets show that extra knowledge from LLMs helps to improve the retrieval result of both lexical and semantic ranking models. The final ensemble retrieval system outperformed the highest results among all participating teams in the COLIEE 2022 and 2023 competitions.
摘要：成文法检索是法律语言处理中的一个典型问题，在法律工程中有各种实际应用。现代基于深度学习的检索方法已经为该问题取得了显著成果。然而，依赖于语义和词汇相关性的检索系统往往存在局限性，特别是在处理涉及现实场景的查询或使用非特定于法律领域的词汇时。在这项工作中，我们专注于克服这一弱点，利用大型语言模型 (LLM) 的逻辑推理能力来识别与查询中提到的情况相关的法律术语和事实。所提出的检索系统集成了基于术语的扩展和查询重构的附加信息，以提高检索准确性。在 COLIEE 2022 和 COLIEE 2023 数据集上的实验表明，来自 LLM 的额外知识有助于提高词汇和语义排名模型的检索结果。最终的集成检索系统超越了 COLIEE 2022 和 2023 年比赛所有参赛队伍中的最高成绩。

Title: Table-LLM-Specialist: Language Model Specialists for Tables using Iterative Generator-Validator Fine-tuning

Authors: Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Dongmei Zhang, Surajit Chaudhuri
Subjects: cs.CL, cs.DB, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12164
Pdf URL: https://arxiv.org/pdf/2410.12164
Copy Paste: [[2410.12164]] Table-LLM-Specialist: Language Model Specialists for Tables using Iterative Generator-Validator Fine-tuning(https://arxiv.org/abs/2410.12164)
Keywords: language model, gpt, llm
Abstract: In this work, we propose Table-LLM-Specialist, or Table-Specialist for short, as a new self-trained fine-tuning paradigm specifically designed for table tasks. Our insight is that for each table task, there often exist two dual versions of the same task, one generative and one classification in nature. Leveraging their duality, we propose a Generator-Validator paradigm, to iteratively generate-then-validate training data from language-models, to fine-tune stronger \sys models that can specialize in a given task, without requiring manually-labeled data. Our extensive evaluations suggest that our Table-Specialist has (1) \textit{strong performance} on diverse table tasks over vanilla language-models -- for example, Table-Specialist fine-tuned on GPT-3.5 not only outperforms vanilla GPT-3.5, but can often match or surpass GPT-4 level quality, (2) \textit{lower cost} to deploy, because when Table-Specialist fine-tuned on GPT-3.5 achieve GPT-4 level quality, it becomes possible to deploy smaller models with lower latency and inference cost, with comparable quality, and (3) \textit{better generalizability} when evaluated across multiple benchmarks, since \sys is fine-tuned on a broad range of training data systematically generated from diverse real tables. Our code and data will be available at this https URL.
摘要：在这项工作中，我们提出了 Table-LLM-Specialist（简称 Table-Specialist），这是一种专为表格任务设计的新型自训练微调范式。我们的见解是，对于每个表格任务，通常存在同一任务的两个对偶版本，一个是生成性的，一个是分类性的。利用它们的对偶性，我们提出了一个生成器-验证器范式，以迭代方式从语言模型生成然后验证训练数据，以微调更强大的 \sys 模型，这些模型可以专门用于给定任务，而无需手动标记的数据。我们广泛的评估表明，我们的 Table-Specialist (1) \textit{在各种表格任务上的表现优于原始语言模型 - 例如，在 GPT-3.5 上微调的 Table-Specialist 不仅优于原始 GPT-3.5，而且通常可以达到或超过 GPT-4 级别的质量，(2) \textit{部署成本更低}，因为当在 GPT-3.5 上微调的 Table-Specialist 达到 GPT-4 级别质量时，可以部署具有较低延迟和推理成本且质量相当的较小模型，以及 (3) \textit{更好的通用性}在跨多个基准进行评估时，因为 \sys 是在从各种真实表系统地生成的广泛训练数据上进行微调的。我们的代码和数据将在此 https URL 上提供。

Title: Exploring Large Language Models for Hate Speech Detection in Rioplatense Spanish

Authors: Juan Manuel Pérez, Paula Miguel, Viviana Cotik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12174
Pdf URL: https://arxiv.org/pdf/2410.12174
Copy Paste: [[2410.12174]] Exploring Large Language Models for Hate Speech Detection in Rioplatense Spanish(https://arxiv.org/abs/2410.12174)
Keywords: language model, gpt, chat, chain-of-thought
Abstract: Hate speech detection deals with many language variants, slang, slurs, expression modalities, and cultural nuances. This outlines the importance of working with specific corpora, when addressing hate speech within the scope of Natural Language Processing, recently revolutionized by the irruption of Large Language Models. This work presents a brief analysis of the performance of large language models in the detection of Hate Speech for Rioplatense Spanish. We performed classification experiments leveraging chain-of-thought reasoning with ChatGPT 3.5, Mixtral, and Aya, comparing their results with those of a state-of-the-art BERT classifier. These experiments outline that, even if large language models show a lower precision compared to the fine-tuned BERT classifier and, in some cases, they find hard-to-get slurs or colloquialisms, they still are sensitive to highly nuanced cases (particularly, homophobic/transphobic hate speech). We make our code and models publicly available for future research.
摘要：仇恨言论检测涉及多种语言变体、俚语、诽谤、表达方式和文化细微差别。这概述了在自然语言处理范围内处理仇恨言论时使用特定语料库的重要性，而大型语言模型的出现最近彻底改变了这一领域。这项工作简要分析了大型语言模型在检测 Rioplatense 西班牙语仇恨言论方面的表现。我们利用 ChatGPT 3.5、Mixtral 和 Aya 的思路链推理进行了分类实验，并将其结果与最先进的 BERT 分类器的结果进行了比较。这些实验概述了，即使大型语言模型与经过微调的 BERT 分类器相比显示出较低的精度，并且在某些情况下，它们会发现难以获得的诽谤或口语，但它们仍然对高度细微差别的情况很敏感（尤其是恐同/恐跨性别仇恨言论）。我们将我们的代码和模型公开，以供将来研究。

Title: Negative-Prompt-driven Alignment for Generative Language Model

Authors: Shiqi Qiao, Ning Xv, Biao Liu, Xin Geng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12194
Pdf URL: https://arxiv.org/pdf/2410.12194
Copy Paste: [[2410.12194]] Negative-Prompt-driven Alignment for Generative Language Model(https://arxiv.org/abs/2410.12194)
Keywords: language model, prompt
Abstract: Large language models have achieved remarkable capabilities, but aligning their outputs with human values and preferences remains a significant challenge. Existing alignment methods primarily focus on positive examples while overlooking the importance of negative responses in guiding models away from undesirable behaviors. For instance, the widely-used alignment datasets reveals a scarcity of explicit negative examples that contradict human values, hindering its ability to discourage harmful or biased outputs during training. To address this limitation, we propose NEAT, i.e., NEgative-prompt-driven AlignmenT, to introduce negative prompts to generate undesirable responses alongside positive examples during the optimization process. NEAT explicitly penalizes the model for producing harmful outputs, guiding it not only toward desirable behaviors but also steering it away from generating undesirable, biased responses. This dual feedback mechanism enables better alignment with human preferences, crucial in contexts where avoiding harm is paramount. Starting from a pre-trained language model, NEAT performs online alignment by incorporating a ranking loss derived from an expanded preference dataset containing both positive and negative examples. Extensive experiments validate NEAT's effectiveness in significantly enhancing language models' alignment with human values and preferences.
摘要：大型语言模型已经实现了卓越的能力，但使其输出与人类价值观和偏好保持一致仍然是一项重大挑战。现有的对齐方法主要关注正面示例，而忽略了负面反应在引导模型远离不良行为方面的重要性。例如，广泛使用的对齐数据集表明，与人类价值观相矛盾的明确负面示例很少，这阻碍了其在训练期间阻止有害或有偏见的输出的能力。为了解决这一限制，我们提出了 NEAT，即 NEgative-prompt-driven AlignmenT，在优化过程中引入负面提示以生成不良反应以及正面示例。NEAT 明确惩罚产生有害输出的模型，不仅引导它朝着理想的行为发展，而且还引导它远离产生不良的、有偏见的反应。这种双重反馈机制可以更好地与人类偏好保持一致，这在避免伤害至关重要的情况下至关重要。从预先训练的语言模型开始，NEAT 通过结合从包含正面和负面示例的扩展偏好数据集中得出的排名损失来执行在线对齐。大量实验验证了 NEAT 在显著增强语言模型与人类价值观和偏好的一致性方面的有效性。

Title: On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation

Authors: Xiaonan Jing, Srinivas Billa, Danny Godbout
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12222
Pdf URL: https://arxiv.org/pdf/2410.12222
Copy Paste: [[2410.12222]] On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation(https://arxiv.org/abs/2410.12222)
Keywords: language model, gpt, llm, hallucination
Abstract: Hallucination has been a popular topic in natural language generation (NLG). In real-world applications, unfaithful content can result in bad data quality or loss of trust from end users. Thus, it is crucial to fact-check before adopting NLG for production usage, which can be expensive if done manually. In this paper, we investigate automated faithfulness evaluation in guided NLG. We developed a rubrics template and use large language models (LLMs) to score the generation into quantifiable scales. We compared popular LLMs as well as the widely adopted natural language inference (NLI) models in scoring quality and sensitivity. In addition, we developed methods to generation synthetic unfaithful data, as well as a heuristics to quantify the percentage of hallucination. Our results on 4 travel-domain industry dataset show that GPT-4 can provide accurate judgement and explanation on whether a source and a generation are factually consistent. Furthermore, we found that tuning NLI models on synthetic data can improve performance. Lastly, we present insights on latency and cost for deploying such system.
摘要：幻觉一直是自然语言生成 (NLG) 中的热门话题。在实际应用中，不真实的内容会导致数据质量差或失去最终用户的信任。因此，在将 NLG 用于生产用途之前，进行事实核查至关重要，如果手动进行，成本会很高。在本文中，我们研究了引导式 NLG 中的自动忠诚度评估。我们开发了一个评分标准模板，并使用大型语言模型 (LLM) 将生成内容评分为可量化的量表。我们在评分质量和敏感度方面比较了流行的 LLM 以及广泛采用的自然语言推理 (NLI) 模型。此外，我们还开发了生成合成不真实数据的方法，以及量化幻觉百分比的启发式方法。我们在 4 个旅游领域行业数据集上的结果表明，GPT-4 可以准确判断和解释来源和生成内容是否在事实上一致。此外，我们发现在合成数据上调整 NLI 模型可以提高性能。最后，我们介绍了部署此类系统的延迟和成本。

Title: EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

Authors: Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai
Subjects: cs.CL, cs.DC
Abstract URL: https://arxiv.org/abs/2410.12247
Pdf URL: https://arxiv.org/pdf/2410.12247
Copy Paste: [[2410.12247]] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference(https://arxiv.org/abs/2410.12247)
Keywords: language model, llm
Abstract: Large Language Model (LLM) has revolutionized the field of artificial intelligence, with their capabilities expanding rapidly due to advances in deep learning and increased computational resources. The mixture-of-experts (MoE) model has emerged as a prominent architecture in the field of LLM, better balancing the model performance and computational efficiency. MoE architecture allows for effective scaling and efficient parallel processing, but the GEMM (General Matrix Multiply) of MoE and the large parameters introduce challenges in terms of computation efficiency and communication overhead, which becomes the throughput bottleneck during inference. Applying a single parallelism strategy like EP, DP, PP, etc. to MoE architecture usually achieves sub-optimal inference throughput, the straightforward combinations of existing different parallelisms on MoE can not obtain optimal inference throughput yet. This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that goes beyond the existing inference parallelism schemes. Our approach focuses on optimizing the computation of MoE FFN (FeedForward Network) modules by dynamically selecting the best kernel implementation of GroupGemm and DenseGemm for different loads and adaptively overlapping these computations with \textit{all2all} communication, leading to a substantial increase in throughput. Our experimental results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods. Specifically, we validated our method on DeepSeekV2, a highly optimized model claimed to achieve a prefill throughput of 100K tokens per second. By applying EPS-MoE, we further accelerated it to at least 120K tokens per second.
摘要：大型语言模型 (LLM) 彻底改变了人工智能领域，随着深度学习的进步和计算资源的增加，其能力迅速扩展。混合专家 (MoE) 模型已成为 LLM 领域的一种重要架构，可以更好地平衡模型性能和计算效率。MoE 架构可以实现有效的扩展和高效的并行处理，但是 MoE 的 GEMM (广义矩阵乘法) 和大型参数在计算效率和通信开销方面带来了挑战，成为推理期间的吞吐量瓶颈。将 EP、DP、PP 等单一并行策略应用于 MoE 架构通常会实现次优的推理吞吐量，而直接组合 MoE 上现有的不同并行性还无法获得最佳的推理吞吐量。本文介绍了一种新颖的 MoE 专家流水线调度器 EPS-MoE，它超越了现有的推理并行方案。我们的方法侧重于优化 MoE FFN（前馈网络）模块的计算，通过动态选择针对不同负载的最佳 GroupGemm 和 DenseGemm 内核实现，并自适应地将这些计算与 \textit{all2all} 通信重叠，从而大幅提高吞吐量。我们的实验结果表明，与现有的并行推理方法相比，预填充吞吐量平均提高了 21%。具体来说，我们在 DeepSeekV2 上验证了我们的方法，这是一个高度优化的模型，据称可实现每秒 100K 个令牌的预填充吞吐量。通过应用 EPS-MoE，我们将其进一步加速到至少每秒 120K 个令牌。

Title: CoFE-RAG: A Comprehensive Full-chain Evaluation Framework for Retrieval-Augmented Generation with Enhanced Data Diversity

Authors: Jintao Liu, Ruixue Ding, Linhao Zhang, Pengjun Xie, Fie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12248
Pdf URL: https://arxiv.org/pdf/2410.12248
Copy Paste: [[2410.12248]] CoFE-RAG: A Comprehensive Full-chain Evaluation Framework for Retrieval-Augmented Generation with Enhanced Data Diversity(https://arxiv.org/abs/2410.12248)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) aims to enhance large language models (LLMs) to generate more accurate and reliable answers with the help of the retrieved context from external knowledge sources, thereby reducing the incidence of hallucinations. Despite the advancements, evaluating these systems remains a crucial research area due to the following issues: (1) Limited data diversity: The insufficient diversity of knowledge sources and query types constrains the applicability of RAG systems; (2) Obscure problems location: Existing evaluation methods have difficulty in locating the stage of the RAG pipeline where problems occur; (3) Unstable retrieval evaluation: These methods often fail to effectively assess retrieval performance, particularly when the chunking strategy changes. To tackle these challenges, we propose a Comprehensive Full-chain Evaluation (CoFE-RAG) framework to facilitate thorough evaluation across the entire RAG pipeline, including chunking, retrieval, reranking, and generation. To effectively evaluate the first three phases, we introduce multi-granularity keywords, including coarse-grained and fine-grained keywords, to assess the retrieved context instead of relying on the annotation of golden chunks. Moreover, we release a holistic benchmark dataset tailored for diverse data scenarios covering a wide range of document formats and query types. We demonstrate the utility of the CoFE-RAG framework by conducting experiments to evaluate each stage of RAG systems. Our evaluation method provides unique insights into the effectiveness of RAG systems in handling diverse data scenarios, offering a more nuanced understanding of their capabilities and limitations.
摘要：检索增强生成 (RAG) 旨在增强大型语言模型 (LLM)，借助从外部知识源检索到的上下文来生成更准确、可靠的答案，从而降低幻觉的发生率。尽管取得了这些进步，但由于以下问题，评估这些系统仍然是一个至关重要的研究领域：（1）数据多样性有限：知识源和查询类型的多样性不足限制了 RAG 系统的适用性；（2）问题定位不清晰：现有的评估方法难以定位出现问题的 RAG 流程阶段；（3）检索评估不稳定：这些方法通常无法有效评估检索性能，尤其是在分块策略发生变化时。为了应对这些挑战，我们提出了一个综合全链评估 (CoFE-RAG) 框架，以促进对整个 RAG 流程进行全面评估，包括分块、检索、重新排序和生成。为了有效地评估前三个阶段，我们引入了多粒度关键字，包括粗粒度和细粒度关键字，以评估检索到的上下文，而不是依赖于黄金块的注释。此外，我们发布了针对各种数据场景量身定制的整体基准数据集，涵盖了广泛的文档格式和查询类型。我们通过进行实验来评估 RAG 系统的每个阶段，从而展示了 CoFE-RAG 框架的实用性。我们的评估方法提供了对 RAG 系统在处理各种数据场景方面的有效性的独特见解，从而提供了对其功能和局限性的更细致的理解。

Title: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

Authors: Junjie Chen, Weihang Su, Zhumin Chu, Haitao Li, Qinyao Ai, Yiqun Liu, Min Zhang, Shaoping Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12265
Pdf URL: https://arxiv.org/pdf/2410.12265
Copy Paste: [[2410.12265]] An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation(https://arxiv.org/abs/2410.12265)
Keywords: language model, llm, prompt
Abstract: With the rapid development of large language models (LLMs), how to efficiently evaluate them has become an important research question. Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases. To address these limitations, our study introduces the Auto-PRE, an automatic LLM evaluation framework based on peer review. In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluator LLMs automatically based on their inherent traits including consistency, self-confidence, and pertinence. We conduct extensive experiments on three tasks: summary generation, non-factoid question-answering, and dialogue generation. Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost. Moreover, our study highlights the impact of prompt strategies and evaluation formats on evaluation performance, offering guidance for method optimization in the future.
摘要：随着大型语言模型 (LLM) 的快速发展，如何有效地对其进行评估已成为一个重要的研究问题。现有的评估方法通常存在成本高、测试格式有限、需要人工参考以及系统性评估偏差等问题。为了解决这些限制，我们的研究引入了 Auto-PRE，这是一个基于同行评审的自动 LLM 评估框架。与以前依赖人工注释的研究不同，Auto-PRE 根据 LLM 的固有特征（包括一致性、自信心和针对性）自动选择评估者 LLM。我们对三个任务进行了广泛的实验：摘要生成、非事实问答和对话生成。实验结果表明，我们的 Auto-PRE 以较低的成本实现了最佳性能。此外，我们的研究强调了提示策略和评估格式对评估性能的影响，为未来的方法优化提供了指导。

Title: Kallini et al. (2024) do not compare impossible languages with constituency-based ones

Authors: Tim Hunter
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12271
Pdf URL: https://arxiv.org/pdf/2410.12271
Copy Paste: [[2410.12271]] Kallini et al. (2024) do not compare impossible languages with constituency-based ones(https://arxiv.org/abs/2410.12271)
Keywords: language model, gpt, llm
Abstract: A central goal of linguistic theory is to find a precise characterization of the notion "possible human language", in the form of a computational device that is capable of describing all and only the languages that can be acquired by a typically developing human child. The success of recent large language models (LLMs) in NLP applications arguably raises the possibility that LLMs might be computational devices that meet this goal. This would only be the case if, in addition to succeeding in learning human languages, LLMs struggle to learn "impossible" human languages. Kallini et al. (2024; "Mission: Impossible Language Models", Proc. ACL) conducted experiments aiming to test this by training GPT-2 on a variety of synthetic languages, and found that it learns some more successfully than others. They present these asymmetries as support for the idea that LLMs' inductive biases align with what is regarded as "possible" for human languages, but the most significant comparison has a confound that makes this conclusion unwarranted. In this paper I explain the confound and suggest some ways forward towards constructing a comparison that appropriately tests the underlying issue.
摘要：语言理论的一个核心目标是找到“可能的人类语言”这一概念的精确描述，即一种计算设备，能够描述正常发育的人类儿童能够掌握的所有语言。最近大型语言模型 (LLM) 在 NLP 应用中的成功可以说提高了 LLM 可能是满足这一目标的计算设备的可能性。只有当 LLM 除了成功学习人类语言之外，还难以学习“不可能”的人类语言时，情况才会如此。Kallini 等人 (2024；“任务：不可能的语言模型”，Proc. ACL) 进行了旨在测试这一点的实验，通过在各种合成语言上训练 GPT-2，发现它比其他语言学习得更成功。他们提出这些不对称现象来支持这样一种观点，即 LLM 的归纳偏差与人类语言被认为“可能”的情况相一致，但最重要的比较有一个混淆因素，使得这一结论毫无根据。在本文中，我解释了这个混淆因素，并提出了一些构建比较的方法，以适当地测试潜在问题。

Title: How much do contextualized representations encode long-range context?

Authors: Simeng Sun, Cheng-Ping Hsieh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12292
Pdf URL: https://arxiv.org/pdf/2410.12292
Copy Paste: [[2410.12292]] How much do contextualized representations encode long-range context?(https://arxiv.org/abs/2410.12292)
Keywords: language model
Abstract: We analyze contextual representations in neural autoregressive language models, emphasizing long-range contexts that span several thousand tokens. Our methodology employs a perturbation setup and the metric \emph{Anisotropy-Calibrated Cosine Similarity}, to capture the degree of contextualization of long-range patterns from the perspective of representation geometry. We begin the analysis with a case study on standard decoder-only Transformers, demonstrating that similar perplexity can exhibit markedly different downstream task performance, which can be explained by the difference in contextualization of long-range content. Next, we extend the analysis to other models, covering recent novel architectural designs and various training configurations. The representation-level results illustrate a reduced capacity for high-complexity (i.e., less compressible) sequences across architectures, and that fully recurrent models rely heavily on local context, whereas hybrid models more effectively encode the entire sequence structure. Finally, preliminary analysis of model size and training configurations on the encoding of long-range context suggest potential directions for improving existing language models.
摘要：我们分析神经自回归语言模型中的上下文表示，强调跨越数千个标记的长距离上下文。我们的方法采用了扰动设置和度量 \emph{各向异性校准余弦相似度}，从表示几何的角度捕捉长距离模式的上下文化程度。我们从标准解码器专用 Transformers 的案例研究开始分析，表明相似的困惑度可以表现出明显不同的下游任务性能，这可以通过长距离内容的上下文化差异来解释。接下来，我们将分析扩展到其他模型，涵盖最近的新架构设计和各种训练配置。表示级结果表明，跨架构的高复杂度（即可压缩性较差）序列的容量降低，并且完全循环模型严重依赖局部上下文，而混合模型更有效地编码整个序列结构。最后，对模型大小和训练配置对长距离上下文编码的初步分析为改进现有语言模型提供了潜在的方向。

Title: Pyramid-Driven Alignment: Pyramid Principle Guided Integration of Large Language Models and Knowledge Graphs

Authors: Lei Sun, Xinchen Wang, Youdi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12298
Pdf URL: https://arxiv.org/pdf/2410.12298
Copy Paste: [[2410.12298]] Pyramid-Driven Alignment: Pyramid Principle Guided Integration of Large Language Models and Knowledge Graphs(https://arxiv.org/abs/2410.12298)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) possess impressive reasoning abilities but are prone to generating incorrect information, often referred to as hallucinations. While incorporating external Knowledge Graphs (KGs) can partially mitigate this issue, existing methods primarily treat KGs as static knowledge repositories, overlooking the critical disparity between KG and LLM knowledge, and failing to fully exploit the reasoning capabilities inherent in KGs. To address these limitations, we propose Pyramid-Driven Alignment (PDA), a novel framework for seamlessly integrating LLMs with KGs. PDA utilizes Pyramid Principle analysis to construct a hierarchical pyramid structure. This structure is designed to reflect the input question and generate more validated deductive knowledge, thereby enhancing the alignment of LLMs and KGs and ensuring more cohesive integration. Furthermore, PDA employs a recursive mechanism to harness the underlying reasoning abilities of KGs, resulting in more accurate knowledge retrieval for question-answering tasks. Our experimental results reveal a substantial performance advantage of PDA over state-of-the-art baselines, with improvements reaching 26.70% and 26.78%.
摘要：大型语言模型 (LLM) 具有出色的推理能力，但容易产生错误信息，通常称为幻觉。虽然引入外部知识图谱 (KG) 可以部分缓解此问题，但现有方法主要将 KG 视为静态知识库，忽略了 KG 和 LLM 知识之间的关键差异，并且未能充分利用 KG 固有的推理能力。为了解决这些限制，我们提出了金字塔驱动对齐 (PDA)，这是一种无缝集成 LLM 和 KG 的新框架。PDA 利用金字塔原理分析构建分层金字塔结构。该结构旨在反映输入问题并生成更多经过验证的演绎知识，从而增强 LLM 和 KG 的对齐并确保更具凝聚力的集成。此外，PDA 采用递归机制来利用 KG 的底层推理能力，从而为问答任务提供更准确的知识检索。我们的实验结果表明，PDA 的性能优势明显优于最先进的基线，提升幅度分别达到 26.70% 和 26.78%。

Title: Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors

Authors: Weixuan Wang, Jingyuan Yang, Wei Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12299
Pdf URL: https://arxiv.org/pdf/2410.12299
Copy Paste: [[2410.12299]] Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors(https://arxiv.org/abs/2410.12299)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable performance across many tasks, yet aligning them with desired behaviors remains challenging. Activation intervention has emerged as an effective and economical method to modify the behavior of LLMs. Despite considerable interest in this area, current intervention methods exclusively employ a fixed steering vector to modify model activations, lacking adaptability to diverse input semantics. To address this limitation, we propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time. More specifically, SADI utilizes activation differences in contrastive pairs to precisely identify critical elements of an LLM (i.e., attention heads, hidden states, and neurons) for targeted intervention. During inference, SADI dynamically steers model behavior by scaling element-wise activations based on the directions of input semantics. Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training. SADI's cost-effectiveness and generalizability across various LLM backbones and tasks highlight its potential as a versatile alignment technique. In addition, we release the code to foster research along this line:this https URL.
摘要：大型语言模型 (LLM) 在许多任务中都取得了显著的表现，但将它们与期望的行为保持一致仍然具有挑战性。激活干预已成为一种有效且经济的修改 LLM 行为的方法。尽管人们对这一领域有浓厚的兴趣，但目前的干预方法仅采用固定的转向向量来修改模型激活，缺乏对不同输入语义的适应性。为了解决这一限制，我们提出了语义自适应动态干预 (SADI)，这是一种构建动态转向向量以在推理时干预模型激活的新方法。更具体地说，SADI 利用对比对中的激活差异来精确识别 LLM 的关键元素（即注意力头、隐藏状态和神经元）以进行有针对性的干预。在推理过程中，SADI 通过根据输入语义的方向缩放元素激活来动态引导模型行为。实验结果表明，SADI 的表现远超既定基线，无需训练即可提高任务性能。 SADI 在各种 LLM 主干和任务中的成本效益和通用性凸显了其作为多功能对齐技术的潜力。此外，我们发布了代码以促进这方面的研究：此 https URL。

Title: Open Domain Question Answering with Conflicting Contexts

Authors: Siyi Liu, Qiang Ning, Kishaloy Halder, Wei Xiao, Zheng Qi, Phu Mon Htut, Yi Zhang, Neha Anna John, Bonan Min, Yassine Benajiba, Dan Roth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12311
Pdf URL: https://arxiv.org/pdf/2410.12311
Copy Paste: [[2410.12311]] Open Domain Question Answering with Conflicting Contexts(https://arxiv.org/abs/2410.12311)
Keywords: language model, llm
Abstract: Open domain question answering systems frequently rely on information retrieved from large collections of text (such as the Web) to answer questions. However, such collections of text often contain conflicting information, and indiscriminately depending on this information may result in untruthful and inaccurate answers. To understand the gravity of this problem, we collect a human-annotated dataset, Question Answering with Conflicting Contexts (QACC), and find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search. We evaluate and benchmark three powerful Large Language Models (LLMs) with our dataset QACC and demonstrate their limitations in effectively addressing questions with conflicting information. To explore how humans reason through conflicting contexts, we request our annotators to provide explanations for their selections of correct answers. We demonstrate that by finetuning LLMs to explain their answers, we can introduce richer information into their training that guide them through the process of reasoning with conflicting contexts.
摘要：开放领域问答系统经常依赖从大量文本（如网络）中检索到的信息来回答问题。然而，这样的文本集合通常包含相互矛盾的信息，不加区分地依赖这些信息可能会导致不真实和不准确的答案。为了了解这个问题的严重性，我们收集了一个人工注释的数据集，即具有冲突上下文的问答 (QACC)，并发现在使用 Google 搜索检索时，多达 25% 的明确开放领域问题会导致冲突上下文。我们使用我们的数据集 QACC 评估和基准测试了三个强大的大型语言模型 (LLM)，并展示了它们在有效解决具有冲突信息的问题方面的局限性。为了探索人类如何在冲突的上下文中进行推理，我们要求我们的注释者为他们选择的正确答案提供解释。我们证明，通过微调 LLM 来解释他们的答案，我们可以将更丰富的信息引入到他们的训练中，指导他们完成具有冲突上下文的推理过程。

Title: Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up

Authors: Jiahao Yuan, Dehui Du, Hao Zhang, Zixiang Di, Usman Naseem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12323
Pdf URL: https://arxiv.org/pdf/2410.12323
Copy Paste: [[2410.12323]] Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up(https://arxiv.org/abs/2410.12323)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have shown remarkable performance in reasoning tasks but face limitations in mathematical and complex logical reasoning. Existing methods to improve LLMs' logical capabilities either involve traceable or verifiable logical sequences that generate more reliable responses by constructing logical structures yet increase computational costs, or introduces rigid logic template rules, reducing flexibility. In this paper, we propose Reversal of Thought (RoT), a novel framework aimed at enhancing the logical reasoning abilities of LLMs. RoT utilizes a Preference-Guided Reverse Reasoning warm-up strategy, which integrates logical symbols for pseudocode planning through meta-cognitive mechanisms and pairwise preference self-evaluation to generate task-specific prompts solely through demonstrations, aligning with LLMs' cognitive preferences shaped by Reinforcement Learning with Human Feedback (RLHF). Through reverse reasoning, we ultilize a Cognitive Preference Manager to assess knowledge boundaries and further expand LLMs' reasoning capabilities by aggregating solution logic for known tasks and stylistic templates for unknown tasks. Experiments across various tasks demonstrate that RoT surpasses existing baselines in both reasoning accuracy and efficiency.
摘要：大型语言模型 (LLM) 在推理任务中表现出色，但在数学和复杂逻辑推理方面面临限制。现有的提高 LLM 逻辑能力的方法要么涉及可跟踪或可验证的逻辑序列，通过构建逻辑结构生成更可靠的响应，但会增加计算成本，要么引入严格的逻辑模板规则，降低灵活性。在本文中，我们提出了思维逆转 (RoT)，这是一个旨在增强 LLM 逻辑推理能力的新框架。RoT 采用偏好引导的反向推理热身策略，通过元认知机制和成对偏好自我评估将逻辑符号集成到伪代码规划中，仅通过演示即可生成特定于任务的提示，这与 LLM 由带人类反馈的强化学习 (RLHF) 形成的认知偏好保持一致。通过反向推理，我们利用认知偏好管理器来评估知识边界，并通过聚合已知任务的解决方案逻辑和未知任务的风格模板来进一步扩展 LLM 的推理能力。跨各种任务的实验表明，RoT 在推理准确性和效率方面都超越了现有基线。

Title: Optimizing Low-Resource Language Model Training: Comprehensive Analysis of Multi-Epoch, Multi-Lingual, and Two-Stage Approaches

Authors: Kosuke Akimoto, Masafumi Oyamada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12325
Pdf URL: https://arxiv.org/pdf/2410.12325
Copy Paste: [[2410.12325]] Optimizing Low-Resource Language Model Training: Comprehensive Analysis of Multi-Epoch, Multi-Lingual, and Two-Stage Approaches(https://arxiv.org/abs/2410.12325)
Keywords: language model, llm
Abstract: In this paper, we address the challenge of optimizing training setups for Large Language Models (LLMs) of low-resource language with a limited amount of corpus. Existing works adopt multi-epoch, multi-lingual, and two-stage training to utilize the limited target language corpus efficiently. However, there is still a lack of understanding about the optimal hyperparameter setups for combining these three approaches to train LLMs. We exhaustively explore training setups for low-resource language LLM, combining these three approaches, and found the following insights for efficiently reducing the cost of hyperparameter search: (1) As the amount of target language corpus decreases, the optimal training approach shifts from monolingual single-stage training to multi-lingual two-stage training at a compute budget dependent threshold. (2) The optimal model scale remains stable regardless of the amount of target language corpus, allowing the use of the compute-optimal scale of monolingual training. (3) The optimal number of epochs can be extrapolated from smaller-scale experiments to larger scale using our proposed model. Also, we provide evidence that, in single-stage training, the target language validation loss follows a power law with respect to the target language ratio, with an exponent independent of the amount of data, model scale, and language pair.
摘要：在本文中，我们解决了在语料库有限的情况下优化低资源语言大型语言模型 (LLM) 训练设置的挑战。现有研究采用多 epoch、多语言和两阶段训练来有效利用有限的目标语言语料库。然而，对于结合这三种方法训练 LLM 的最佳超参数设置，人们仍然缺乏了解。我们详尽探索了结合这三种方法的低资源语言 LLM 训练设置，并发现了以下有效降低超参数搜索成本的见解：（1）随着目标语言语料库数量的减少，最佳训练方法从单语单阶段训练转变为计算预算相关阈值下的多语言两阶段训练。（2）无论目标语言语料库的数量如何，最佳模型规模都保持稳定，从而允许使用计算最优的单语训练规模。（3）使用我们提出的模型，可以将最佳 epoch 数从小规模实验推广到更大规模。此外，我们提供证据表明，在单阶段训练中，目标语言验证损失遵循关于目标语言比率的幂律，其指数与数据量、模型规模和语言对无关。

Title: Neuron-based Personality Trait Induction in Large Language Models

Authors: Jia Deng, Tianyi Tang, Yanbin Yin, Wenhao Yang, Wayne Xin Zhao, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12327
Pdf URL: https://arxiv.org/pdf/2410.12327
Copy Paste: [[2410.12327]] Neuron-based Personality Trait Induction in Large Language Models(https://arxiv.org/abs/2410.12327)
Keywords: language model, llm
Abstract: Large language models (LLMs) have become increasingly proficient at simulating various personality traits, an important capability for supporting related applications (e.g., role-playing). To further improve this capacity, in this paper, we present a neuron-based approach for personality trait induction in LLMs, with three major technical contributions. First, we construct PersonalityBench, a large-scale dataset for identifying and evaluating personality traits in LLMs. This dataset is grounded in the Big Five personality traits from psychology and is designed to assess the generative capabilities of LLMs towards specific personality traits. Second, by leveraging PersonalityBench, we propose an efficient method for identifying personality-related neurons within LLMs by examining the opposite aspects of a given trait. Third, we develop a simple yet effective induction method that manipulates the values of these identified personality-related neurons. This method enables fine-grained control over the traits exhibited by LLMs without training and modifying model parameters. Extensive experiments validate the efficacy of our neuron identification and trait induction methods. Notably, our approach achieves comparable performance as fine-tuned models, offering a more efficient and flexible solution for personality trait induction in LLMs. We provide access to all the mentioned resources at this https URL.
摘要：大型语言模型 (LLM) 在模拟各种性格特征方面已变得越来越熟练，这是支持相关应用（例如角色扮演）的重要能力。为了进一步提高这种能力，在本文中，我们提出了一种基于神经元的 LLM 性格特征诱导方法，主要有三项技术贡献。首先，我们构建了 PersonalityBench，这是一个用于识别和评估 LLM 中性格特征的大型数据集。该数据集以心理学中的“大五人格特质”为基础，旨在评估 LLM 对特定性格特征的生成能力。其次，通过利用 PersonalityBench，我们提出了一种通过检查给定特征的对立面来识别 LLM 中性格相关神经元的有效方法。第三，我们开发了一种简单而有效的诱导方法，可以操纵这些已识别的性格相关神经元的值。该方法无需训练和修改模型参数即可对 LLM 表现出的特质进行细粒度控制。大量实验验证了我们的神经元识别和特质诱导方法的有效性。值得注意的是，我们的方法实现了与微调模型相当的性能，为 LLM 中的性格特征诱导提供了更高效、更灵活的解决方案。我们在此 https URL 上提供对上述所有资源的访问。

Title: Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Authors: Botian Jiang, Lei Li, Xiaonan Li, Zhaowei Li, Xiachong Feng, Lingpeng Kong, Qi Liu, Xipeng Qiu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12329
Pdf URL: https://arxiv.org/pdf/2410.12329
Copy Paste: [[2410.12329]] Understanding the Role of LLMs in Multimodal Evaluation Benchmarks(https://arxiv.org/abs/2410.12329)
Keywords: language model, llm
Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has been accompanied by the development of various benchmarks to evaluate their capabilities. However, the true nature of these evaluations and the extent to which they assess multimodal reasoning versus merely leveraging the underlying Large Language Model (LLM) backbone remain unclear. This paper presents a comprehensive investigation into the role of LLM backbones in MLLM evaluation, focusing on two critical aspects: the degree to which current benchmarks truly assess multimodal reasoning and the influence of LLM prior knowledge on performance. Specifically, we introduce a modified evaluation protocol to disentangle the contributions of the LLM backbone from multimodal integration, and an automatic knowledge identification technique for diagnosing whether LLMs equip the necessary knowledge for corresponding multimodal questions. Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs. Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50\% of error rates can be attributed to insufficient world knowledge in the LLM backbone, indicating a heavy reliance on language capabilities. To address knowledge deficiencies, we propose a knowledge augmentation pipeline that achieves significant performance gains, with improvements of up to 60\% on certain datasets, resulting in a approximately 4x increase in performance. Our work provides crucial insights into the role of the LLM backbone in MLLMs, and highlights the need for more nuanced benchmarking approaches.
摘要：多模态大型语言模型 (MLLM) 的快速发展伴随着各种基准的开发，以评估其能力。然而，这些评估的真实性质以及它们在多大程度上评估多模态推理，而不是仅仅利用底层的大型语言模型 (LLM) 主干，仍不清楚。本文全面研究了 LLM 主干在 MLLM 评估中的作用，重点关注两个关键方面：当前基准真正评估多模态推理的程度以及 LLM 先验知识对性能的影响。具体来说，我们引入了一种改进的评估协议，以将 LLM 主干的贡献与多模态集成区分开来，并引入了一种自动知识识别技术，用于诊断 LLM 是否具备相应多模态问题所需的知识。我们的研究涵盖了四个不同的 MLLM 基准和八个最先进的 MLLM。主要发现表明，一些基准即使在没有视觉输入的情况下也能实现高性能，高达 50% 的错误率可归因于 LLM 主干中的世界知识不足，这表明严重依赖语言能力。为了解决知识不足的问题，我们提出了一种知识增强流程，该流程可显著提高性能，在某些数据集上性能提升高达 60%，从而使性能提高了约 4 倍。我们的工作为 LLM 主干在 MLLM 中的作用提供了重要见解，并强调了对更细致入微的基准测试方法的需求。

Title: A linguistic analysis of undesirable outcomes in the era of generative AI

Authors: Daniele Gambetta, Gizem Gezici, Fosca Giannotti, Dino Pedreschi, Alistair Knott, Luca Pappalardo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12341
Pdf URL: https://arxiv.org/pdf/2410.12341
Copy Paste: [[2410.12341]] A linguistic analysis of undesirable outcomes in the era of generative AI(https://arxiv.org/abs/2410.12341)
Keywords: llm, chat
Abstract: Recent research has focused on the medium and long-term impacts of generative AI, posing scientific and societal challenges mainly due to the detection and reliability of machine-generated information, which is projected to form the major content on the Web soon. Prior studies show that LLMs exhibit a lower performance in generation tasks (model collapse) as they undergo a fine-tuning process across multiple generations on their own generated content (self-consuming loop). In this paper, we present a comprehensive simulation framework built upon the chat version of LLama2, focusing particularly on the linguistic aspects of the generated content, which has not been fully examined in existing studies. Our results show that the model produces less lexical rich content across generations, reducing diversity. The lexical richness has been measured using the linguistic measures of entropy and TTR as well as calculating the POSTags frequency. The generated content has also been examined with an $n$-gram analysis, which takes into account the word order, and semantic networks, which consider the relation between different words. These findings suggest that the model collapse occurs not only by decreasing the content diversity but also by distorting the underlying linguistic patterns of the generated text, which both highlight the critical importance of carefully choosing and curating the initial input text, which can alleviate the model collapse problem. Furthermore, we conduct a qualitative analysis of the fine-tuned models of the pipeline to compare their performances on generic NLP tasks to the original model. We find that autophagy transforms the initial model into a more creative, doubtful and confused one, which might provide inaccurate answers and include conspiracy theories in the model responses, spreading false and biased information on the Web.
摘要：最近的研究集中在生成式人工智能的中长期影响上，这给科学和社会带来了挑战，主要是因为机器生成的信息的检测和可靠性，预计这些信息很快将成为网络上的主要内容。先前的研究表明，LLM 在生成任务中表现出较低的性能（模型崩溃），因为它们在自己生成的内容（自我消费循环）上经历了多代微调过程。在本文中，我们提出了一个基于 LLama2 聊天版本构建的综合模拟框架，特别关注生成内容的语言方面，现有研究尚未对此进行全面研究。我们的结果表明，该模型在几代中产生的词汇丰富内容较少，从而降低了多样性。词汇丰富度是使用熵和 TTR 的语言度量以及计算 POSTags 频率来衡量的。还使用考虑词序的 $n$-gram 分析和考虑不同词之间关系的语义网络检查了生成的内容。这些发现表明，模型崩溃不仅会降低内容多样性，还会扭曲生成文本的底层语言模式，这两者都凸显了精心选择和整理初始输入文本的重要性，这可以缓解模型崩溃问题。此外，我们对管道的微调模型进行了定性分析，以将它们在通用 NLP 任务上的表现与原始模型进行比较。我们发现自噬将初始模型转变为更具创造性、更可疑和更混乱的模型，这可能会提供不准确的答案，并在模型响应中包含阴谋论，在网络上传播虚假和有偏见的信息。

Title: HerO at AVeriTeC: The Herd of Open Large Language Models for Verifying Real-World Claims

Authors: Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon, Kunwoo Park
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2410.12377
Pdf URL: https://arxiv.org/pdf/2410.12377
Copy Paste: [[2410.12377]] HerO at AVeriTeC: The Herd of Open Large Language Models for Verifying Real-World Claims(https://arxiv.org/abs/2410.12377)
Keywords: language model, llm, prompt
Abstract: To tackle the AVeriTeC shared task hosted by the FEVER-24, we introduce a system that only employs publicly available large language models (LLMs) for each step of automated fact-checking, dubbed the Herd of Open LLMs for verifying real-world claims (HerO). HerO employs multiple LLMs for each step of automated fact-checking. For evidence retrieval, a language model is used to enhance a query by generating hypothetical fact-checking documents. We prompt pretrained and fine-tuned LLMs for question generation and veracity prediction by crafting prompts with retrieved in-context samples. HerO achieved 2nd place on the leaderboard with the AVeriTeC score of 0.57, suggesting the potential of open LLMs for verifying real-world claims. For future research, we make our code publicly available at this https URL.
摘要：为了解决 FEVER-24 主办的 AVeriTeC 共享任务，我们引入了一个系统，该系统在自动事实核查的每个步骤中仅使用公开可用的大型语言模型 (LLM)，称为用于验证真实世界声明的开放式 LLM 群 (HerO)。HerO 在自动事实核查的每个步骤中都使用多个 LLM。对于证据检索，语言模型用于通过生成假设的事实核查文档来增强查询。我们通过使用检索到的上下文样本制作提示来提示经过预训练和微调的 LLM 进行问题生成和真实性预测。HerO 在排行榜上以 0.57 的 AVeriTeC 得分获得第二名，这表明开放式 LLM 在验证真实世界声明方面具有潜力。为了未来的研究，我们将我们的代码公开在此 https URL 上。

Title: Evaluation of Attribution Bias in Retrieval-Augmented Large Language Models

Authors: Amin Abolghasemi, Leif Azzopardi, Seyyed Hadi Hashemi, Maarten de Rijke, Suzan Verberne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12380
Pdf URL: https://arxiv.org/pdf/2410.12380
Copy Paste: [[2410.12380]] Evaluation of Attribution Bias in Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2410.12380)
Keywords: language model, llm, retrieval augmented generation
Abstract: Attributing answers to source documents is an approach used to enhance the verifiability of a model's output in retrieval augmented generation (RAG). Prior work has mainly focused on improving and evaluating the attribution quality of large language models (LLMs) in RAG, but this may come at the expense of inducing biases in the attribution of answers. We define and examine two aspects in the evaluation of LLMs in RAG pipelines, namely attribution sensitivity and bias with respect to authorship information. We explicitly inform an LLM about the authors of source documents, instruct it to attribute its answers, and analyze (i) how sensitive the LLM's output is to the author of source documents, and (ii) whether the LLM exhibits a bias towards human-written or AI-generated source documents. We design an experimental setup in which we use counterfactual evaluation to study three LLMs in terms of their attribution sensitivity and bias in RAG pipelines. Our results show that adding authorship information to source documents can significantly change the attribution quality of LLMs by 3% to 18%. Moreover, we show that LLMs can have an attribution bias towards explicit human authorship, which can serve as a competing hypothesis for findings of prior work that shows that LLM-generated content may be preferred over human-written contents. Our findings indicate that metadata of source documents can influence LLMs' trust, and how they attribute their answers. Furthermore, our research highlights attribution bias and sensitivity as a novel aspect of brittleness in LLMs.
摘要：将答案归因于源文档是一种用于增强检索增强生成 (RAG) 中模型输出可验证性的方法。先前的工作主要集中在改进和评估 RAG 中大型语言模型 (LLM) 的归因质量，但这可能会以在答案归因中引入偏差为代价。我们定义并检查了 RAG 管道中 LLM 评估的两个方面，即归因敏感性和对作者信息的偏见。我们明确告知 LLM 源文档的作者，指示其归因其答案，并分析 (i) LLM 的输出对源文档作者的敏感程度，以及 (ii) LLM 是否表现出对人类编写或 AI 生成的源文档的偏见。我们设计了一个实验装置，在其中我们使用反事实评估来研究三个 LLM 在 RAG 管道中的归因敏感性和偏见。我们的结果表明，向源文档添加作者信息可以显著改变 LLM 的归因质量 3% 至 18%。此外，我们表明，LLM 可能对明确的人类作者身份存在归因偏见，这可以作为先前研究结果的竞争假设，该研究结果表明 LLM 生成的内容可能比人类编写的内容更受青睐。我们的研究结果表明，源文档的元数据会影响 LLM 的信任度以及他们如何归因他们的答案。此外，我们的研究强调归因偏见和敏感性是 LLM 脆弱性的一个新方面。

Title: Prompt Compression for Large Language Models: A Survey

Authors: Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12388
Pdf URL: https://arxiv.org/pdf/2410.12388
Copy Paste: [[2410.12388]] Prompt Compression for Large Language Models: A Survey(https://arxiv.org/abs/2410.12388)
Keywords: language model, llm, prompt
Abstract: Leveraging large language models (LLMs) for complex natural language tasks typically requires long-form prompts to convey detailed requirements and information, which results in increased memory usage and inference costs. To mitigate these challenges, multiple efficient methods have been proposed, with prompt compression gaining significant research interest. This survey provides an overview of prompt compression techniques, categorized into hard prompt methods and soft prompt methods. First, the technical approaches of these methods are compared, followed by an exploration of various ways to understand their mechanisms, including the perspectives of attention optimization, Parameter-Efficient Fine-Tuning (PEFT), modality fusion, and new synthetic language. We also examine the downstream adaptations of various prompt compression techniques. Finally, the limitations of current prompt compression methods are analyzed, and several future directions are outlined, such as optimizing the compression encoder, combining hard and soft prompts methods, and leveraging insights from multimodality.
摘要：利用大型语言模型 (LLM) 执行复杂的自然语言任务通常需要长格式提示来传达详细的需求和信息，这会导致内存使用量和推理成本增加。为了应对这些挑战，已经提出了多种有效方法，其中提示压缩引起了广泛的研究兴趣。本综述概述了提示压缩技术，分为硬提示方法和软提示方法。首先，比较这些方法的技术方法，然后探索理解其机制的各种方法，包括注意力优化、参数高效微调 (PEFT)、模态融合和新合成语言的视角。我们还研究了各种提示压缩技术的下游适应性。最后，分析了当前提示压缩方法的局限性，并概述了几个未来的方向，例如优化压缩编码器、结合硬提示和软提示方法以及利用多模态的见解。

Title: Tracking Universal Features Through Fine-Tuning and Model Merging

Authors: Niels Horn, Desmond Elliott
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12391
Pdf URL: https://arxiv.org/pdf/2410.12391
Copy Paste: [[2410.12391]] Tracking Universal Features Through Fine-Tuning and Model Merging(https://arxiv.org/abs/2410.12391)
Keywords: language model
Abstract: We study how features emerge, disappear, and persist across models fine-tuned on different domains of text. More specifically, we start from a base one-layer Transformer language model that is trained on a combination of the BabyLM corpus, and a collection of Python code from The Stack. This base model is adapted to two new domains of text: TinyStories, and the Lua programming language, respectively; and then these two models are merged using these two models using spherical linear interpolation. Our exploration aims to provide deeper insights into the stability and transformation of features across typical transfer-learning scenarios using small-scale models and sparse auto-encoders.
摘要：我们研究特征如何在针对不同文本领域进行微调的模型中出现、消失和持续存在。更具体地说，我们从一个基本的单层 Transformer 语言模型开始，该模型在 BabyLM 语料库和来自 The Stack 的 Python 代码集合上进行训练。这个基础模型分别适应两个新的文本领域：TinyStories 和 Lua 编程语言；然后使用球面线性插值将这两个模型合并。我们的探索旨在更深入地了解使用小规模模型和稀疏自动编码器的典型迁移学习场景中特征的稳定性和转换。

Title: ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

Authors: Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12405
Pdf URL: https://arxiv.org/pdf/2410.12405
Copy Paste: [[2410.12405]] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs(https://arxiv.org/abs/2410.12405)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but their performance is highly sensitive to the prompts utilized. This variability poses challenges for accurate assessment and user satisfaction. Current research frequently overlooks instance-level prompt variations and their implications on subjective evaluations. To address these shortcomings, we introduce ProSA, a framework designed to evaluate and comprehend prompt sensitivity in LLMs. ProSA incorporates a novel sensitivity metric, PromptSensiScore, and leverages decoding confidence to elucidate underlying mechanisms. Our extensive study, spanning multiple tasks, uncovers that prompt sensitivity fluctuates across datasets and models, with larger models exhibiting enhanced robustness. We observe that few-shot examples can alleviate this sensitivity issue, and subjective evaluations are also susceptible to prompt sensitivities, particularly in complex, reasoning-oriented tasks. Furthermore, our findings indicate that higher model confidence correlates with increased prompt robustness. We believe this work will serve as a helpful tool in studying prompt sensitivity of LLMs. The project is released at: this https URL .
摘要：大型语言模型 (LLM) 在各种任务中都表现出令人印象深刻的能力，但它们的性能对所使用的提示高度敏感。这种多变性对准确评估和用户满意度提出了挑战。当前的研究经常忽略实例级提示变化及其对主观评价的影响。为了解决这些缺点，我们引入了 ProSA，这是一个旨在评估和理解 LLM 中提示敏感性的框架。ProSA 采用了一种新颖的敏感性指标 PromptSensiScore，并利用解码置信度来阐明潜在机制。我们涵盖多个任务的广泛研究发现，提示敏感性在数据集和模型之间波动，较大的模型表现出增强的稳健性。我们观察到，少量样本可以缓解这种敏感性问题，主观评价也容易受到提示敏感性的影响，尤其是在复杂的推理导向任务中。此外，我们的研究结果表明，更高的模型置信度与更高的提示稳健性相关。我们相信这项工作将成为研究 LLM 提示敏感性的有用工具。该项目发布于：此 https URL 。

Title: Conformity in Large Language Models

Authors: Xiaochen Zhu, Caiqi Zhang, Tom Stafford, Nigel Collier, Andreas Vlachos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12428
Pdf URL: https://arxiv.org/pdf/2410.12428
Copy Paste: [[2410.12428]] Conformity in Large Language Models(https://arxiv.org/abs/2410.12428)
Keywords: language model, llm
Abstract: The conformity effect describes the tendency of individuals to align their responses with the majority. Studying this bias in large language models (LLMs) is crucial, as LLMs are increasingly used in various information-seeking and decision-making tasks as conversation partners to improve productivity. Thus, conformity to incorrect responses can compromise their effectiveness. In this paper, we adapt psychological experiments to examine the extent of conformity in state-of-the-art LLMs. Our findings reveal that all models tested exhibit varying levels of conformity toward the majority, regardless of their initial choice or correctness, across different knowledge domains. Notably, we are the first to show that LLMs are more likely to conform when they are more uncertain in their own prediction. We further explore factors that influence conformity, such as training paradigms and input characteristics, finding that instruction-tuned models are less susceptible to conformity, while increasing the naturalness of majority tones amplifies conformity. Finally, we propose two interventions--Devil's Advocate and Question Distillation--to mitigate conformity, providing insights into building more robust language models.
摘要：从众效应描述了个人倾向于将自己的回答与大多数人保持一致。研究大型语言模型 (LLM) 中的这种偏见至关重要，因为 LLM 越来越多地用于各种信息搜索和决策任务，作为对话伙伴来提高生产力。因此，对错误回答的从众可能会损害其有效性。在本文中，我们调整了心理实验来检查最先进的 LLM 中的从众程度。我们的研究结果表明，所有测试的模型在不同知识领域都表现出对大多数人的不同程度的从众，无论他们的初始选择或正确性如何。值得注意的是，我们首次表明，当 LLM 对自己的预测更不确定时，它们更有可能从众。我们进一步探讨了影响从众的因素，例如训练范式和输入特征，发现指令调整的模型不太容易从众，而增加多数语调的自然度会放大从众。最后，我们提出了两种干预措施——魔鬼代言人和问题提炼——来缓解从众，为构建更强大的语言模型提供见解。

Title: Expanding Chatbot Knowledge in Customer Service: Context-Aware Similar Question Generation Using Large Language Models

Authors: Mengze Hong, Yuanfeng Song, Di Jiang, Lu Wang, Zichang Guo, Chen Jason Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12444
Pdf URL: https://arxiv.org/pdf/2410.12444
Copy Paste: [[2410.12444]] Expanding Chatbot Knowledge in Customer Service: Context-Aware Similar Question Generation Using Large Language Models(https://arxiv.org/abs/2410.12444)
Keywords: language model, llm, prompt, chat
Abstract: Reliable responses of service chatbots are often achieved by employing retrieval-based methods that restrict answers to a knowledge base comprising predefined question-answer pairs (QA pairs). To accommodate potential variations in how a customer's query may be expressed, it emerges as the favored solution to augment these QA pairs with similar questions that are possibly diverse while remaining semantic consistency. This augmentation task is known as Similar Question Generation (SQG). Traditional methods that heavily rely on human efforts or rule-based techniques suffer from limited diversity or significant semantic deviation from the source question, only capable of producing a finite number of useful questions. To address these limitations, we propose an SQG approach based on Large Language Models (LLMs), capable of producing a substantial number of diverse questions while maintaining semantic consistency to the source QA pair. This is achieved by leveraging LLMs' natural language understanding capability through fine-tuning with specially designed prompts. The experiments conducted on a real customer-service dataset demonstrate that our method surpasses baseline methods by a significant margin in terms of semantic diversity. Human evaluation further confirms that integrating the answer that reflects the customer's intention is crucial for increasing the number of generated questions that meet business requirements.
摘要：服务聊天机器人的可靠响应通常是通过采用基于检索的方法来实现的，该方法将答案限制在由预定义问答对 (QA 对) 组成的知识库中。为了适应客户查询表达方式的潜在变化，使用可能多样化但保持语义一致性的类似问题来增强这些 QA 对成为首选解决方案。此增强任务称为类似问题生成 (SQG)。传统方法严重依赖人工或基于规则的技术，其多样性有限或与源问题的语义偏差很大，只能生成有限数量的有用问题。为了解决这些限制，我们提出了一种基于大型语言模型 (LLM) 的 SQG 方法，该方法能够生成大量不同的问题，同时保持与源 QA 对的语义一致性。这是通过利用 LLM 的自然语言理解能力，通过使用专门设计的提示进行微调来实现的。在真实的客户服务数据集上进行的实验表明，我们的方法在语义多样性方面远远超过了基线方法。人工评估进一步证实，整合反映客户意图的答案对于增加满足业务需求的生成问题数量至关重要。

Title: Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs

Authors: Hyeonwoo Kim, Dahyun Kim, Jihoo Kim, Sukyung Lee, Yungi Kim, Chanjun Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12445
Pdf URL: https://arxiv.org/pdf/2410.12445
Copy Paste: [[2410.12445]] Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs(https://arxiv.org/abs/2410.12445)
Keywords: language model, llm
Abstract: The Open Ko-LLM Leaderboard has been instrumental in benchmarking Korean Large Language Models (LLMs), yet it has certain limitations. Notably, the disconnect between quantitative improvements on the overly academic leaderboard benchmarks and the qualitative impact of the models should be addressed. Furthermore, the benchmark suite is largely composed of translated versions of their English counterparts, which may not fully capture the intricacies of the Korean language. To address these issues, we propose Open Ko-LLM Leaderboard2, an improved version of the earlier Open Ko-LLM Leaderboard. The original benchmarks are entirely replaced with new tasks that are more closely aligned with real-world capabilities. Additionally, four new native Korean benchmarks are introduced to better reflect the distinct characteristics of the Korean language. Through these refinements, Open Ko-LLM Leaderboard2 seeks to provide a more meaningful evaluation for advancing Korean LLMs.
摘要：Open Ko-LLM Leaderboard 在对韩语大型语言模型 (LLM) 进行基准测试方面发挥了重要作用，但它也存在一定的局限性。值得注意的是，应该解决过于学术化的排行榜基准测试的量化改进与模型的定性影响之间的脱节问题。此外，基准测试套件主要由英文版的翻译版本组成，可能无法完全捕捉韩语的复杂性。为了解决这些问题，我们提出了 Open Ko-LLM Leaderboard2，这是早期 Open Ko-LLM Leaderboard 的改进版本。原始基准测试完全被更符合现实世界能力的新任务所取代。此外，还引入了四个新的韩语本土基准测试，以更好地反映韩语的独特特征。通过这些改进，Open Ko-LLM Leaderboard2 旨在为推进韩语 LLM 提供更有意义的评估。

Title: The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

Authors: Minghao Wu, Thuy-Trang Vu, Lizhen Qu, Gholamreza Haffari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12458
Pdf URL: https://arxiv.org/pdf/2410.12458
Copy Paste: [[2410.12458]] The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph(https://arxiv.org/abs/2410.12458)
Keywords: language model, llm
Abstract: The performance of large language models (LLMs) in natural language processing (NLP) tasks is significantly influenced by the quality and diversity of data used for supervised fine-tuning (SFT). Current data selection methods often focus solely on quality or diversity, leading to underperforming models due to suboptimal training data. In this paper, we introduce GraphFilter, a novel method that represents the dataset as a bipartite graph, linking sentences to their constituent n-grams. This representation effectively captures the relationships between sentences and linguistic patterns, facilitating the selection of sentences that enhance n-gram diversity. To balance quality and diversity during selection, we propose a priority function that combines the quality metric with the diversity metric in a multiplicative manner. GraphFilter iteratively selects high-priority sentences, updates the bipartite graph by removing covered n-grams, and re-calculates priorities to reflect the evolving data landscape. We conduct extensive experiments using three model backbones across six widely used benchmarks. The results demonstrate that GraphFilter outperforms all nine baseline approaches, achieving superior model performance and computational efficiency. Our analyses validate the effectiveness of our design choices, examine the subsets selected by GraphFilter and other methods, highlight the importance of instruction diversity, and explore the role of quality and diversity in relation to subset sizes. GraphFilter establishes a new foundation for effective data selection strategies, encouraging further research in data selection for LLMs.
摘要：大型语言模型 (LLM) 在自然语言处理 (NLP) 任务中的表现受到用于监督微调 (SFT) 的数据的质量和多样性的显著影响。当前的数据选择方法通常只关注质量或多样性，导致模型由于训练数据不理想而表现不佳。在本文中，我们介绍了 GraphFilter，这是一种将数据集表示为二分图的新方法，将句子与其组成 n-gram 联系起来。这种表示有效地捕捉了句子和语言模式之间的关系，有助于选择增强 n-gram 多样性的句子。为了在选择过程中平衡质量和多样性，我们提出了一个优先级函数，以乘法方式将质量指标与多样性指标结合起来。GraphFilter 迭代地选择高优先级的句子，通过删除覆盖的 n-gram 来更新二分图，并重新计算优先级以反映不断变化的数据格局。我们使用三个模型主干在六个广泛使用的基准上进行了广泛的实验。结果表明，GraphFilter 的表现优于所有九种基准方法，实现了卓越的模型性能和计算效率。我们的分析验证了我们设计选择的有效性，检查了 GraphFilter 和其他方法选择的子集，强调了指令多样性的重要性，并探索了质量和多样性与子集大小的关系。GraphFilter 为有效的数据选择策略奠定了新的基础，鼓励对 LLM 的数据选择进行进一步研究。

Title: Bridging the Language Gaps in Large Language Models with Inference-Time Cross-Lingual Intervention

Authors: Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12462
Pdf URL: https://arxiv.org/pdf/2410.12462
Copy Paste: [[2410.12462]] Bridging the Language Gaps in Large Language Models with Inference-Time Cross-Lingual Intervention(https://arxiv.org/abs/2410.12462)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown remarkable capabilities in natural language processing but exhibit significant performance gaps among different languages. Most existing approaches to address these disparities rely on pretraining or fine-tuning, which are resource-intensive. To overcome these limitations without incurring significant costs, we propose Inference-Time Cross-Lingual Intervention (INCLINE), a novel framework that enhances LLM performance on low-performing (source) languages by aligning their internal representations with those of high-performing (target) languages during inference. INCLINE initially learns alignment matrices using parallel sentences from source and target languages through a Least-Squares optimization, and then applies these matrices during inference to transform the low-performing language representations toward the high-performing language space. Extensive experiments on nine benchmarks with five LLMs demonstrate that INCLINE significantly improves performance across diverse tasks and languages, compared to recent strong baselines. Our analysis demonstrates that INCLINE is highly cost-effective and applicable to a wide range of applications. In addition, we release the code to foster research along this line: this https URL.
摘要：大型语言模型 (LLM) 在自然语言处理方面表现出了卓越的能力，但不同语言之间的性能差距很大。现有的大多数解决这些差异的方法都依赖于预训练或微调，而这些方法需要大量资源。为了在不产生大量成本的情况下克服这些限制，我们提出了推理时间跨语言干预 (INCLINE)，这是一种新颖的框架，通过在推理过程中将低性能 (源) 语言的内部表示与高性能 (目标) 语言的内部表示对齐，可以提高 LLM 在低性能 (源) 语言上的性能。INCLINE 首先通过最小二乘优化使用来自源语言和目标语言的并行句子来学习对齐矩阵，然后在推理过程中应用这些矩阵，将低性能语言表示转换为高性能语言空间。使用五个 LLM 在九个基准上进行的大量实验表明，与最近的强大基线相比，INCLINE 显著提高了不同任务和语言的性能。我们的分析表明，INCLINE 具有很高的成本效益，适用于广泛的应用。此外，我们还发布了代码以促进这方面的研究：此 https URL。

Title: Learning to Predict Usage Options of Product Reviews with LLM-Generated Labels

Authors: Leo Kohlenberg, Leonard Horns, Frederic Sadrieh, Nils Kiele, Matthis Clausen, Konstantin Ketterer, Avetis Navasardyan, Tamara Czinczoll, Gerard de Melo, Ralf Herbrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12470
Pdf URL: https://arxiv.org/pdf/2410.12470
Copy Paste: [[2410.12470]] Learning to Predict Usage Options of Product Reviews with LLM-Generated Labels(https://arxiv.org/abs/2410.12470)
Keywords: gpt, llm
Abstract: Annotating large datasets can be challenging. However, crowd-sourcing is often expensive and can lack quality, especially for non-trivial tasks. We propose a method of using LLMs as few-shot learners for annotating data in a complex natural language task where we learn a standalone model to predict usage options for products from customer reviews. We also propose a new evaluation metric for this scenario, HAMS4, that can be used to compare a set of strings with multiple reference sets. Learning a custom model offers individual control over energy efficiency and privacy measures compared to using the LLM directly for the sequence-to-sequence task. We compare this data annotation approach with other traditional methods and demonstrate how LLMs can enable considerable cost savings. We find that the quality of the resulting data exceeds the level attained by third-party vendor services and that GPT-4-generated labels even reach the level of domain experts. We make the code and generated labels publicly available.
摘要：注释大型数据集可能具有挑战性。然而，众包通常成本高昂，而且质量可能不高，尤其是对于非平凡任务。我们提出了一种使用 LLM 作为少样本学习器来注释复杂自然语言任务中的数据的方法，在该任务中，我们学习一个独立模型来根据客户评论预测产品的使用选项。我们还为这种场景提出了一个新的评估指标 HAMS4，可用于将一组字符串与多个参考集进行比较。与直接使用 LLM 进行序列到序列任务相比，学习自定义模型可以单独控制能源效率和隐私措施。我们将这种数据注释方法与其他传统方法进行了比较，并展示了 LLM 如何实现可观的成本节省。我们发现，结果数据的质量超过了第三方供应商服务所达到的水平，而 GPT-4 生成的标签甚至达到了领域专家的水平。我们公开了代码和生成的标签。

Title: Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation

Authors: Zerui Xu, Fang Wu, Tianfan Fu, Yue Zhao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12476
Pdf URL: https://arxiv.org/pdf/2410.12476
Copy Paste: [[2410.12476]] Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation(https://arxiv.org/abs/2410.12476)
Keywords: language model, llm
Abstract: Machine learning (ML) exhibits promise in the clinical domain. However, it is constrained by data scarcity and ethical considerations, as the generation of clinical trials presents significant challenges due to stringent privacy regulations, high costs, and the extended duration required for conducting studies with human participants. Despite the advancements of large language models (LLMs) in general generation tasks, their potential in facilitating the generation of synthetic clinical trials is under-explored. To address this gap, we introduce a novel Retrieval-Reasoning few-shot framework that leverages LLMs to generate artificial yet realistic and diverse clinical trials with binary success/failure labels. Experiments conducted on real clinical trials from the \url{this http URL} database demonstrate that our synthetic data can effectively augment real datasets. Furthermore, by fine-tuning a pre-trained model as a binary classifier on synthetic clinical trial datasets, we demonstrate that this augmentation enhances model training for downstream tasks such as trial outcome prediction. Our findings suggest that LLMs for synthetic clinical trial generation hold promise for accelerating clinical research and upholding ethical standards for patient privacy. The code is publicly available at this https URL.
摘要：机器学习 (ML) 在临床领域前景光明。然而，由于严格的隐私法规、高昂的成本以及对人类参与者进行研究所需的较长时间，临床试验的生成面临着重大挑战，因此它受到数据稀缺和道德考虑的限制。尽管大型语言模型 (LLM) 在一般生成任务中取得了进步，但它们在促进合成临床试验生成方面的潜力尚未得到充分开发。为了解决这一差距，我们引入了一个新颖的检索推理小样本框架，该框架利用 LLM 生成具有二元成功/失败标签的人工但真实且多样化的临床试验。对来自 \url{this http URL} 数据库的真实临床试验进行的实验表明，我们的合成数据可以有效地增强真实数据集。此外，通过将预训练模型微调为合成临床试验数据集上的二元分类器，我们证明这种增强增强了下游任务（例如试验结果预测）的模型训练。我们的研究结果表明，用于合成临床试验生成的 LLM 有望加速临床研究并维护患者隐私的道德标准。代码可在此 https URL 上公开获取。

Title: MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models

Authors: Boyang Xue, Hongru Wang, Rui Wang, Sheng Wang, Zezhong Wang, Yiming Du, Bin Liang, Kam-Fai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12478
Pdf URL: https://arxiv.org/pdf/2410.12478
Copy Paste: [[2410.12478]] MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models(https://arxiv.org/abs/2410.12478)
Keywords: language model, llm, hallucination, prompt
Abstract: The tendency of Large Language Models (LLMs) to generate hallucinations raises concerns regarding their reliability. Therefore, confidence estimations indicating the extent of trustworthiness of the generations become essential. However, current LLM confidence estimations in languages other than English remain underexplored. This paper addresses this gap by introducing a comprehensive investigation of Multilingual Confidence estimation (MlingConf) on LLMs, focusing on both language-agnostic (LA) and language-specific (LS) tasks to explore the performance and language dominance effects of multilingual confidence estimations on different tasks. The benchmark comprises four meticulously checked and human-evaluate high-quality multilingual datasets for LA tasks and one for the LS task tailored to specific social, cultural, and geographical contexts of a language. Our experiments reveal that on LA tasks English exhibits notable linguistic dominance in confidence estimations than other languages, while on LS tasks, using question-related language to prompt LLMs demonstrates better linguistic dominance in multilingual confidence estimations. The phenomena inspire a simple yet effective native-tone prompting strategy by employing language-specific prompts for LS tasks, effectively improving LLMs' reliability and accuracy on LS tasks.
摘要：大型语言模型 (LLM) 产生幻觉的趋势引发了人们对其可靠性的担忧。因此，表明生成可信度程度的置信度估计变得至关重要。然而，目前除英语以外的其他语言的 LLM 置信度估计仍未得到充分探索。本文通过对 LLM 上的多语言置信度估计 (MlingConf) 进行全面调查来解决这一空白，重点关注语言无关 (LA) 和语言特定 (LS) 任务，以探索多语言置信度估计在不同任务上的性能和语言优势影响。该基准包括四个经过精心检查和人工评估的高质量多语言数据集，用于 LA 任务，以及一个针对语言的特定社会、文化和地理背景定制的 LS 任务。我们的实验表明，在 LA 任务中，英语在置信度估计中表现出比其他语言明显的语言优势，而在 LS 任务中，使用与问题相关的语言来提示 LLM 在多语言置信度估计中表现出更好的语言优势。这种现象启发了一种简单但有效的母语提示策略，即在 LS 任务中采用特定语言的提示，有效地提高了 LLM 在 LS 任务上的可靠性和准确性。

Title: KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs

Authors: Yongqin Xu, Huan Li, Ke Chen, Lidan Shou
Subjects: cs.CL, cs.AI, cs.DB, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12480
Pdf URL: https://arxiv.org/pdf/2410.12480
Copy Paste: [[2410.12480]] KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs(https://arxiv.org/abs/2410.12480)
Keywords: language model, llm, hallucination
Abstract: Schema and entity matching tasks are crucial for data integration and management. While large language models (LLMs) have shown promising results in these tasks, they suffer from hallucinations and confusion about task instructions. In this paper, we present the Knowledge-Compliant Matching Framework (KcMF), an LLM-based approach that addresses these issues without the need for domain-specific fine-tuning. KcMF employs a pseudo-code-based task decomposition strategy to adopt task-specific natural language statements that guide LLM reasoning and reduce confusion. We also propose two mechanisms, Dataset as Knowledge (DaK) and Example as Knowledge (EaK), to build domain knowledge sets when unstructured domain knowledge is lacking. Additionally, we introduce a result-ensembling strategy to leverage multiple knowledge sources and suppress poorly formatted outputs. Comprehensive evaluations on schema and entity matching tasks demonstrate that KcMF outperforms previous non-LLM state-of-the-art (SOTA) methods by an average F1 score of 22.9% and competes effectively with SOTA fine-tuned LLMs. Moreover, KcMF generalizes well across different LLMs.
摘要：模式和实体匹配任务对于数据集成和管理至关重要。虽然大型语言模型 (LLM) 在这些任务中表现出良好的效果，但它们存在幻觉和任务指令混淆的问题。在本文中，我们提出了知识兼容匹配框架 (KcMF)，这是一种基于 LLM 的方法，无需进行特定领域的微调即可解决这些问题。KcMF 采用基于伪代码的任务分解策略来采用特定于任务的自然语言语句来指导 LLM 推理并减少混淆。我们还提出了两种机制，数据集作为知识 (DaK) 和示例作为知识 (EaK)，用于在缺乏非结构化领域知识时构建领域知识集。此外，我们引入了一种结果集成策略来利用多个知识源并抑制格式不佳的输出。对模式和实体匹配任务的综合评估表明，KcMF 的平均 F1 得分比之前的非 LLM 最先进 (SOTA) 方法高出 22.9%，并且可以与 SOTA 微调的 LLM 有效竞争。此外，KcMF 在不同的 LLM 中具有很好的泛化能力。

Title: Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

Authors: Jared Joselowitz, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12491
Pdf URL: https://arxiv.org/pdf/2410.12491
Copy Paste: [[2410.12491]] Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL(https://arxiv.org/abs/2410.12491)
Keywords: language model, llm
Abstract: Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 80.40% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for understanding and improving LLM alignment, with implications for the responsible development and deployment of these powerful systems.
摘要：使用人类反馈强化学习 (RLHF) 训练的大型语言模型 (LLM) 已展示出卓越的能力，但其底层奖励函数和决策过程仍然不透明。本文介绍了一种解释 LLM 的新方法，即应用逆向强化学习 (IRL) 来恢复其隐式奖励函数。我们对不同大小的毒性对齐 LLM 进行了实验，提取了在预测人类偏好方面准确率高达 80.40% 的奖励模型。我们的分析揭示了奖励函数的不可识别性、模型大小与可解释性之间的关系以及 RLHF 过程中的潜在陷阱的关键见解。我们证明 IRL 衍生的奖励模型可用于微调新的 LLM，从而在毒性基准上获得可比或更好的性能。这项工作为理解和改进 LLM 对齐提供了一个新的视角，对于负责任地开发和部署这些强大的系统具有重要意义。

Title: End-to-end Planner Training for Language Modeling

Authors: Nathan Cornille, Florian Mai, Jingyuan Sun, Marie-Francine Moens
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12492
Pdf URL: https://arxiv.org/pdf/2410.12492
Copy Paste: [[2410.12492]] End-to-end Planner Training for Language Modeling(https://arxiv.org/abs/2410.12492)
Keywords: language model, llm
Abstract: Through end-to-end training to predict the next token, LLMs have become valuable tools for various tasks. Enhancing their core training in language modeling can improve numerous downstream applications. A successful approach to enhance language modeling uses a separate planning module to predict abstract labels of future sentences and conditions the LM on these predictions. However, this method is non-differentiable, preventing joint end-to-end tuning of the planner with the LM. We propose an effective method to improve this approach by enabling joint fine-tuning of the planner and the LM. We show that a naive way of approximating the gradient of selecting a label via the straight-through estimator is not effective. Instead, we propose to use the predicted label probabilities as mixing weights to condition the LM on a weighted average of label embeddings in a differentiable manner. This not only enables joint fine-tuning of the planner and the LM, but also allows the LM to draw on the full label distribution predicted by the planner, retaining more information. Our experimental results show consistent improvements in perplexity.
摘要：通过端到端训练来预测下一个标记，LLM 已成为各种任务的宝贵工具。增强其在语言建模中的核心训练可以改善许多下游应用。增强语言建模的成功方法是使用单独的规划模块来预测未来句子的抽象标签，并根据这些预测对 LM 进行条件调整。但是，这种方法是不可微分的，无法对规划器和 LM 进行端到端联合调整。我们提出了一种有效的方法来改进这种方法，即对规划器和 LM 进行联合微调。我们表明，通过直通估计器近似选择标签的梯度的简单方法是无效的。相反，我们建议使用预测的标签概率作为混合权重，以可微分的方式对 LM 进行标签嵌入的加权平均值的条件调整。这不仅能够对规划器和 LM 进行联合微调，还允许 LM 利用规划器预测的完整标签分布，保留更多信息。我们的实验结果显示困惑度持续改善。

Title: With a Grain of SALT: Are LLMs Fair Across Social Dimensions?

Authors: Samee Arif, Zohaib Khan, Agha Ali Raza, Awais Athar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12499
Pdf URL: https://arxiv.org/pdf/2410.12499
Copy Paste: [[2410.12499]] With a Grain of SALT: Are LLMs Fair Across Social Dimensions?(https://arxiv.org/abs/2410.12499)
Keywords: language model, gpt, llm, prompt
Abstract: This paper presents an analysis of biases in open-source Large Language Models (LLMs) across various genders, religions, and races. We introduce a methodology for generating a bias detection dataset using seven bias triggers: General Debate, Positioned Debate, Career Advice, Story Generation, Problem-Solving, Cover-Letter Writing, and CV Generation. We use GPT-4o to generate a diverse set of prompts for each trigger across various genders, religious and racial groups. We evaluate models from Llama and Gemma family on the generated dataset. We anonymise the LLM-generated text associated with each group using GPT-4o-mini and do a pairwise comparison using GPT-4o-as-a-Judge. To quantify bias in the LLM-generated text we use the number of wins and losses in the pairwise comparison. Our analysis spans three languages, English, German, and Arabic to explore how language influences bias manifestation. Our findings reveal that LLMs exhibit strong polarization toward certain groups across each category, with a notable consistency observed across models. However, when switching languages, variations and anomalies emerge, often attributable to cultural cues and contextual differences.
摘要：本文分析了开源大型语言模型 (LLM) 在不同性别、宗教和种族中的偏见。我们介绍了一种使用七个偏见触发器生成偏见检测数据集的方法：一般性辩论、定位辩论、职业建议、故事生成、问题解决、求职信写作和简历生成。我们使用 GPT-4o 为不同性别、宗教和种族群体的每个触发器生成一组不同的提示。我们在生成的数据集上评估了 Llama 和 Gemma 家族的模型。我们使用 GPT-4o-mini 将与每个组相关的 LLM 生成的文本匿名化，并使用 GPT-4o-as-a-Judge 进行成对比较。为了量化 LLM 生成的文本中的偏见，我们使用成对比较中的胜负次数。我们的分析涵盖了三种语言，英语、德语和阿拉伯语，以探索语言如何影响偏见的表现。我们的研究结果表明，法学硕士在各个类别中都对某些群体表现出强烈的两极分化，并且在各个模型中观察到了明显的一致性。然而，当切换语言时，就会出现变化和异常，这通常归因于文化线索和语境差异。

Title: FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction

Authors: Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12513
Pdf URL: https://arxiv.org/pdf/2410.12513
Copy Paste: [[2410.12513]] FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction(https://arxiv.org/abs/2410.12513)
Keywords: language model, llm, prompt
Abstract: Auto-regressive Large Language Models (LLMs) demonstrate remarkable performance across domanins such as vision and language processing. However, due to sequential processing through a stack of transformer layers, autoregressive decoding faces significant computation/latency challenges, particularly in resource constrained environments like mobile and edge devices. Existing approaches in literature that aim to improve latency via skipping layers have two distinct flavors - 1) Early exit 2) Input-agnostic heuristics where tokens exit at pre-determined layers irrespective of input sequence. Both the above strategies have limitations - the former cannot be applied to handle KV Caching necessary for speed-ups in modern framework and the latter does not capture the variation in layer importance across tasks or more generally, across input sequences. To address both limitations, we propose FIRST, an algorithm that reduces inference latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence - the prompt (during prefill stage) decides which layers will be skipped during decoding. FIRST preserves compatibility with KV caching enabling faster inference while being quality-aware. FIRST is model-agnostic and can be easily enabled on any pre-trained LLM. We further improve performance by incorporating LoRA adapters for fine-tuning on external datasets, enhancing task-specific accuracy while maintaining latency benefits. Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on task. Extensive experiments show that FIRST significantly reduces latency while retaining competitive performance (as compared to baselines), making our approach an efficient solution for LLM deployment in low-resource environments.
摘要：自回归大型语言模型 (LLM) 在视觉和语言处理等领域表现出色。然而，由于通过堆叠的转换器层进行顺序处理，自回归解码面临重大的计算/延迟挑战，特别是在资源受限的环境中，如移动和边缘设备。文献中旨在通过跳过层来改善延迟的现有方法有两种不同的风格 - 1) 提前退出 2) 输入无关的启发式方法，其中 token 退出预定层，而不管输入序列如何。上述两种策略都有局限性 - 前者不能用于处理现代框架中加速所需的 KV 缓存，后者不能捕捉跨任务或更一般地说跨输入序列的层重要性的变化。为了解决这两个限制，我们提出了 FIRST，这是一种通过使用层特定路由器为每个输入序列自适应地选择转换器层子集来减少推理延迟的算法 - 提示（在预填充阶段）决定在解码期间将跳过哪些层。 FIRST 保留了与 KV 缓存的兼容性，从而可以在保证质量的同时实现更快的推理。FIRST 与模型无关，可在任何预先训练的 LLM 上轻松启用。我们通过结合 LoRA 适配器对外部数据集进行微调来进一步提高性能，从而提高特定于任务的准确性，同时保持延迟优势。我们的方法表明，输入自适应性至关重要 - 事实上，不同的特定于任务的中间层在根据任务演变隐藏表示方面起着至关重要的作用。大量实验表明，FIRST 显着降低了延迟，同时保持了竞争性性能（与基线相比），使我们的方法成为在低资源环境中部署 LLM 的有效解决方案。

Title: MedAide: Towards an Omni Medical Aide via Specialized LLM-based Multi-Agent Collaboration

Authors: Jinjie Wei, Dingkang Yang, Yanshu Li, Qingyao Xu, Zhaoyu Chen, Mingcheng Li, Yue Jiang, Xiaolu Hou, Lihua Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12532
Pdf URL: https://arxiv.org/pdf/2410.12532
Copy Paste: [[2410.12532]] MedAide: Towards an Omni Medical Aide via Specialized LLM-based Multi-Agent Collaboration(https://arxiv.org/abs/2410.12532)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Large Language Model (LLM)-driven interactive systems currently show potential promise in healthcare domains. Despite their remarkable capabilities, LLMs typically lack personalized recommendations and diagnosis analysis in sophisticated medical applications, causing hallucinations and performance bottlenecks. To address these challenges, this paper proposes MedAide, an LLM-based omni medical multi-agent collaboration framework for specialized healthcare services. Specifically, MedAide first performs query rewriting through retrieval-augmented generation to accomplish accurate medical intent understanding. Immediately, we devise a contextual encoder to obtain intent prototype embeddings, which are used to recognize fine-grained intents by similarity matching. According to the intent relevance, the activated agents collaborate effectively to provide integrated decision analysis. Extensive experiments are conducted on four medical benchmarks with composite intents. Experimental results from automated metrics and expert doctor evaluations show that MedAide outperforms current LLMs and improves their medical proficiency and strategic reasoning.
摘要：大型语言模型 (LLM) 驱动的交互系统目前在医疗保健领域显示出潜在的前景。尽管 LLM 具有出色的功能，但在复杂的医疗应用中，LLM 通常缺乏个性化推荐和诊断分析，从而导致幻觉和性能瓶颈。为了应对这些挑战，本文提出了 MedAide，这是一种基于 LLM 的全方位医疗多智能体协作框架，适用于专业医疗服务。具体来说，MedAide 首先通过检索增强生成执行查询重写，以实现准确的医疗意图理解。立即，我们设计了一个上下文编码器来获取意图原型嵌入，这些嵌入用于通过相似性匹配识别细粒度意图。根据意图相关性，激活的智能体有效协作以提供综合决策分析。在具有复合意图的四个医疗基准上进行了广泛的实验。来自自动指标和专家医生评估的实验结果表明，MedAide 优于当前的 LLM，并提高了其医疗能力和战略推理能力。

Title: LLM-based Translation Inference with Iterative Bilingual Understanding

Authors: Andong Chen, Kehai Chen, Yang Xiang, Xuefeng Bai, Muyun Yang, Tiejun Zhao, Min zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12543
Pdf URL: https://arxiv.org/pdf/2410.12543
Copy Paste: [[2410.12543]] LLM-based Translation Inference with Iterative Bilingual Understanding(https://arxiv.org/abs/2410.12543)
Keywords: language model, llm
Abstract: The remarkable understanding and generation capabilities of large language models (LLMs) have greatly improved translation performance. However, incorrect understanding of the sentence to be translated can degrade translation quality. To address this issue, we proposed a novel Iterative Bilingual Understanding Translation (IBUT) method based on the cross-lingual capabilities of LLMs and the dual characteristics of translation tasks. The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately. Furthermore, the dual characteristics allow IBUT to generate effective cross-lingual feedback, iteratively refining contextual understanding, thereby reducing errors and improving translation performance. Experimental results showed that the proposed IBUT outperforms several strong comparison methods, especially being generalized to multiple domains (e.g., news, commonsense, and cultural translation benchmarks).
摘要：大型语言模型 (LLM) 卓越的理解和生成能力极大地提高了翻译性能。然而，对待译句子的错误理解会降低翻译质量。针对这一问题，我们提出了一种新颖的迭代双语理解翻译 (IBUT) 方法，该方法基于 LLM 的跨语言能力和翻译任务的双重特性。LLM 的跨语言能力使得能够分别生成源语言和目标语言的上下文理解。此外，双重特性使 IBUT 能够生成有效的跨语言反馈，迭代地细化上下文理解，从而减少错误并提高翻译性能。实验结果表明，所提出的 IBUT 优于几种强大的比较方法，尤其是在推广到多个领域（例如新闻、常识和文化翻译基准）时。

Title: A Claim Decomposition Benchmark for Long-form Answer Verification

Authors: Zhihao Zhang, Yixing Fan, Ruqing Zhang, Jiafeng Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12558
Pdf URL: https://arxiv.org/pdf/2410.12558
Copy Paste: [[2410.12558]] A Claim Decomposition Benchmark for Long-form Answer Verification(https://arxiv.org/abs/2410.12558)
Keywords: llm, hallucination
Abstract: The advancement of LLMs has significantly boosted the performance of complex long-form question answering tasks. However, one prominent issue of LLMs is the generated "hallucination" responses that are not factual. Consequently, attribution for each claim in responses becomes a common solution to improve the factuality and verifiability. Existing researches mainly focus on how to provide accurate citations for the response, which largely overlook the importance of identifying the claims or statements for each response. To bridge this gap, we introduce a new claim decomposition benchmark, which requires building system that can identify atomic and checkworthy claims for LLM responses. Specifically, we present the Chinese Atomic Claim Decomposition Dataset (CACDD), which builds on the WebCPM dataset with additional expert annotations to ensure high data quality. The CACDD encompasses a collection of 500 human-annotated question-answer pairs, including a total of 4956 atomic claims. We further propose a new pipeline for human annotation and describe the challenges of this task. In addition, we provide experiment results on zero-shot, few-shot and fine-tuned LLMs as baselines. The results show that the claim decomposition is highly challenging and requires further explorations. All code and data are publicly available at \url{this https URL}.
摘要：LLM 的进步显著提高了复杂长篇问答任务的性能。然而，LLM 的一个突出问题是生成的“幻觉”响应不是事实。因此，对响应中的每个声明进行归因成为提高事实性和可验证性的常见解决方案。现有研究主要关注如何为响应提供准确的引用，这在很大程度上忽略了识别每个响应的声明或陈述的重要性。为了弥补这一差距，我们引入了一个新的声明分解基准，这需要构建能够识别 LLM 响应的原子和可检查声明的系统。具体来说，我们提出了中文原子声明分解数据集 (CACDD)，它以 WebCPM 数据集为基础，并附加了专家注释以确保高数据质量。CACDD 包含 500 个人工注释的问答对的集合，总共包括 4956 个原子声明。我们进一步提出了一种新的人工注释流程，并描述了这项任务的挑战。此外，我们还提供了零样本、少样本和微调 LLM 的实验结果作为基准。结果表明，索赔分解极具挑战性，需要进一步探索。所有代码和数据均可在 \url{此 https URL} 上公开获取。

Title: STRUX: An LLM for Decision-Making with Structured Explanations

Authors: Yiming Lu, Yebowen Hu, Hassan Foroosh, Wei Jin, Fei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12583
Pdf URL: https://arxiv.org/pdf/2410.12583
Copy Paste: [[2410.12583]] STRUX: An LLM for Decision-Making with Structured Explanations(https://arxiv.org/abs/2410.12583)
Keywords: llm
Abstract: Countless decisions shape our daily lives, and it is paramount to understand the how and why behind these choices. In this paper, we introduce a new LLM decision-making framework called STRUX, which enhances LLM decision-making by providing structured explanations. These include favorable and adverse facts related to the decision, along with their respective strengths. STRUX begins by distilling lengthy information into a concise table of key facts. It then employs a series of self-reflection steps to determine which of these facts are pivotal, categorizing them as either favorable or adverse in relation to a specific decision. Lastly, we fine-tune an LLM to identify and prioritize these key facts to optimize decision-making. STRUX has been evaluated on the challenging task of forecasting stock investment decisions based on earnings call transcripts and demonstrated superior performance against strong baselines. It enhances decision transparency by allowing users to understand the impact of different factors, representing a meaningful step towards practical decision-making with LLMs.
摘要：无数的决定影响着我们的日常生活，了解这些选择背后的原因和方式至关重要。在本文中，我们介绍了一种新的 LLM 决策框架，即 STRUX，它通过提供结构化的解释来增强 LLM 决策能力。这些包括与决策相关的有利和不利事实，以及它们各自的优势。STRUX 首先将冗长的信息提炼成一个简明的关键事实表。然后，它采用一系列自我反思步骤来确定这些事实中的哪些是关键的，并将它们归类为与特定决策相关的有利或不利事实。最后，我们对 LLM 进行微调，以识别和优先考虑这些关键事实，从而优化决策。STRUX 已在基于收益电话会议记录预测股票投资决策的艰巨任务上进行了评估，并展示了与强大基线相比的卓越性能。它通过让用户了解不同因素的影响来提高决策透明度，代表着朝着使用 LLM 进行实际决策迈出了有意义的一步。

Title: Can We Reverse In-Context Knowledge Edits?

Authors: Paul Youssef, Zhixue Zhao, Jörg Schlötterer, Christin Seifert
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12586
Pdf URL: https://arxiv.org/pdf/2410.12586
Copy Paste: [[2410.12586]] Can We Reverse In-Context Knowledge Edits?(https://arxiv.org/abs/2410.12586)
Keywords: language model, llm, prompt
Abstract: In-context knowledge editing (IKE) enables efficient modification of large language model (LLM) outputs without parameter changes and at zero-cost. However, it can be misused to manipulate responses opaquely, e.g., insert misinformation or offensive content. Such malicious interventions could be incorporated into high-level wrapped APIs where the final input prompt is not shown to end-users. To address this issue, we investigate the detection and reversal of IKE-edits. First, we demonstrate that IKE-edits can be detected with high accuracy (F1 > 80\%) using only the top-10 output probabilities of the next token, even in a black-box setting, e.g. proprietary LLMs with limited output information. Further, we introduce the novel task of reversing IKE-edits using specially tuned reversal tokens. We explore using both continuous and discrete reversal tokens, achieving over 80\% accuracy in recovering original, unedited outputs across multiple LLMs. Our continuous reversal tokens prove particularly effective, with minimal impact on unedited prompts. Through analysis of output distributions, attention patterns, and token rankings, we provide insights into IKE's effects on LLMs and how reversal tokens mitigate them. This work represents a significant step towards enhancing LLM resilience against potential misuse of in-context editing, improving their transparency and trustworthiness.
摘要：上下文知识编辑 (IKE) 可以高效地修改大型语言模型 (LLM) 输出，无需更改参数，且成本为零。但是，它可能被滥用来不透明地操纵响应，例如插入错误信息或攻击性内容。此类恶意干预可以合并到高级包装 API 中，其中最终输入提示不会显示给最终用户。为了解决这个问题，我们研究了 IKE 编辑的检测和逆转。首先，我们证明，即使在黑盒设置中，例如具有有限输出信息的专有 LLM，也可以仅使用下一个标记的前 10 个输出概率以高精度 (F1 > 80\%) 检测 IKE 编辑。此外，我们介绍了使用经过特殊调整的逆转标记逆转 IKE 编辑的新任务。我们探索使用连续和离散逆转标记，在多个 LLM 中恢复原始未编辑输出的准确率超过 80\%。我们的连续反转标记被证明特别有效，对未编辑提示的影响最小。通过分析输出分布、注意力模式和标记排名，我们深入了解了 IKE 对 LLM 的影响以及反转标记如何缓解这些影响。这项工作代表着朝着增强 LLM 抵御上下文编辑潜在滥用的能力迈出了重要一步，提高了其透明度和可信度。

Title: On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs

Authors: Herun Wan, Minnan Luo, Zhixiong Su, Guang Dai, Xiang Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12600
Pdf URL: https://arxiv.org/pdf/2410.12600
Copy Paste: [[2410.12600]] On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs(https://arxiv.org/abs/2410.12600)
Keywords: language model, llm
Abstract: Evidence-enhanced detectors present remarkable abilities in identifying malicious social text with related evidence. However, the rise of large language models (LLMs) brings potential risks of evidence pollution to confuse detectors. This paper explores how to manipulate evidence, simulating potential misuse scenarios including basic pollution, and rephrasing or generating evidence by LLMs. To mitigate its negative impact, we propose three defense strategies from both the data and model sides, including machine-generated text detection, a mixture of experts, and parameter updating. Extensive experiments on four malicious social text detection tasks with ten datasets present that evidence pollution, especially the generate strategy, significantly compromises existing detectors. On the other hand, the defense strategies could mitigate evidence pollution, but they faced limitations for practical employment, such as the need for annotated data and huge inference costs. Further analysis illustrates that polluted evidence is of high quality, would compromise the model calibration, and could ensemble to amplify the negative impact.
摘要：证据增强检测器在利用相关证据识别恶意社交文本方面表现出色。然而，大型语言模型 (LLM) 的兴起带来了证据污染的潜在风险，使检测器感到困惑。本文探讨了如何操纵证据，模拟潜在的误用场景（包括基本污染）以及通过 LLM 改写或生成证据。为了减轻其负面影响，我们从数据和模型两个方面提出了三种防御策略，包括机器生成文本检测、专家混合和参数更新。在四个恶意社交文本检测任务上使用十个数据集进行的大量实验表明，证据污染，尤其是生成策略，严重损害了现有的检测器。另一方面，防御策略可以减轻证据污染，但它们在实际使用中面临限制，例如需要注释数据和巨大的推理成本。进一步分析表明，污染的证据质量很高，会损害模型校准，并且可以集成以放大负面影响。

Title: CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization

Authors: Yixi Ding, Jiaying Wu, Tongyao Zhu, Yanxia Qin, Qian Liu, Min-Yen Kan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12601
Pdf URL: https://arxiv.org/pdf/2410.12601
Copy Paste: [[2410.12601]] CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization(https://arxiv.org/abs/2410.12601)
Keywords: language model, gpt, llm
Abstract: To broaden the dissemination of scientific knowledge to diverse audiences, scientific document summarization must simultaneously control multiple attributes such as length and empirical focus. However, existing research typically focuses on controlling single attributes, leaving the compositional control of multiple attributes underexplored. To address this gap, we introduce CCSBench, a benchmark for compositional controllable summarization in the scientific domain. Our benchmark enables fine-grained control over both explicit attributes (e.g., length), which are objective and straightforward, and implicit attributes (e.g., empirical focus), which are more subjective and conceptual. We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings. Our findings reveal significant limitations in large language models' ability to balance trade-offs between control attributes, especially implicit ones that require deeper understanding and abstract reasoning.
摘要：为了将科学知识传播到不同的受众，科学文献摘要必须同时控制多个属性，例如长度和经验焦点。然而，现有的研究通常侧重于控制单个属性，而对多个属性的组合控制却没有得到充分探索。为了解决这一差距，我们引入了 CCSBench，这是科学领域中可组合控制摘要的基准。我们的基准可以对客观直接的显式属性（例如长度）和更主观和概念化的隐式属性（例如经验焦点）进行细粒度控制。我们在各种环境下对 GPT-4、LLaMA2 和其他流行的 LLM 进行了广泛的实验。我们的研究结果表明，大型语言模型在平衡控制属性之间的权衡方面存在很大的局限性，尤其是那些需要更深入理解和抽象推理的隐式属性。

Title: Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning

Authors: Vernon Y.H. Toh, Deepanway Ghosal, Soujanya Poria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12608
Pdf URL: https://arxiv.org/pdf/2410.12608
Copy Paste: [[2410.12608]] Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning(https://arxiv.org/abs/2410.12608)
Keywords: language model, llm, chat
Abstract: Large language models (LLMs) have shown increasing proficiency in solving mathematical reasoning problems. However, many current open-source LLMs often still make calculation and semantic understanding errors in their intermediate reasoning steps. In this work, we propose PROVE, a simple yet effective framework that uses program-based verification as a heuristic to filter out potentially incorrect reasoning paths before aggregating the final answers. Instead of relying on vanilla majority voting, our approach rejects solutions whose corresponding program outputs are inconsistent with the generated solution, aggregating only those validated by Python programs. We conducted extensive experiments on 13 open-source LLMs from various model families and sizes, ranging from 0.5B to 13B parameters, across seven math benchmarks. We demonstrate that PROVE consistently outperforms vanilla majority voting as a heuristic for solving mathematical reasoning tasks across all datasets and model sizes. Notably, PROVE increases accuracy on the GSM8K benchmark from 48.85% to 53.83% for Qwen2-0.5B-Instruct, from 65.66% to 73.01% for Llama-3.2-1B-Instruct, from 73.39% to 79.61% for Gemma-2-2b-it, and from 41.32% to 59.51% for Llama-2-7B-chat. Our codes are available at this https URL.
摘要：大型语言模型 (LLM) 在解决数学推理问题方面表现出越来越高的能力。然而，许多当前的开源 LLM 在其中间推理步骤中仍然经常出现计算和语义理解错误。在这项工作中，我们提出了 PROVE，这是一个简单而有效的框架，它使用基于程序的验证作为启发式方法，在聚合最终答案之前过滤掉可能不正确的推理路径。我们的方法不依赖于普通多数投票，而是拒绝相应程序输出与生成的解决方案不一致的解决方案，只聚合那些经过 Python 程序验证的解决方案。我们在七个数学基准上对来自不同模型系列和大小的 13 个开源 LLM 进行了广泛的实验，参数范围从 0.5B 到 13B。我们证明，作为一种解决所有数据集和模型大小的数学推理任务的启发式方法，PROVE 始终优于普通多数投票。值得注意的是，PROVE 将 GSM8K 基准的准确率从 48.85% 提高到 53.83%（Qwen2-0.5B-Instruct）、从 65.66% 提高到 73.01%（Llama-3.2-1B-Instruct）、从 73.39% 提高到 79.61%（Gemma-2-2b-it）以及从 41.32% 提高到 59.51%（Llama-2-7B-chat）。我们的代码可在此 https URL 上找到。

Title: Exploring Model Kinship for Merging Large Language Models

Authors: Yedi Hu, Yunzhi Yao, Ningyu Zhang, Shumin Deng, Huajun Chen
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2410.12613
Pdf URL: https://arxiv.org/pdf/2410.12613
Copy Paste: [[2410.12613]] Exploring Model Kinship for Merging Large Language Models(https://arxiv.org/abs/2410.12613)
Keywords: language model, llm
Abstract: Model merging has become one of the key technologies for enhancing the capabilities and efficiency of Large Language Models (LLMs). However, our understanding of the expected performance gains and principles when merging any two models remains limited. In this work, we introduce model kinship, the degree of similarity or relatedness between LLMs, analogous to biological evolution. With comprehensive empirical analysis, we find that there is a certain relationship between model kinship and the performance gains after model merging, which can help guide our selection of candidate models. Inspired by this, we propose a new model merging strategy: Top-k Greedy Merging with Model Kinship, which can yield better performance on benchmark datasets. Specifically, we discover that using model kinship as a criterion can assist us in continuously performing model merging, alleviating the degradation (local optima) in model evolution, whereas model kinship can serve as a guide to escape these traps. Code is available at this https URL.
摘要：模型合并已成为提升大型语言模型（LLM）能力和效率的关键技术之一。然而，我们对合并任何两个模型时预期的性能增益和原则的理解仍然有限。在本文中，我们引入了模型亲缘关系，即LLM之间的相似度或关联度，类似于生物进化。通过全面的实证分析，我们发现模型亲缘关系与模型合并后的性能增益之间存在一定的关系，这可以帮助我们指导候选模型的选择。受此启发，我们提出了一种新的模型合并策略：Top-k Greedy Merging with Model Kinship，它可以在基准数据集上获得更好的性能。具体来说，我们发现使用模型亲缘关系作为标准可以帮助我们不断进行模型合并，减轻模型演化中的退化（局部最优），而模型亲缘关系可以作为避免这些陷阱的指南。代码可从此 https URL 获取。

Title: Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning

Authors: Ruimeng Ye, Yang Xiao, Bo Hui
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12621
Pdf URL: https://arxiv.org/pdf/2410.12621
Copy Paste: [[2410.12621]] Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning(https://arxiv.org/abs/2410.12621)
Keywords: language model, llm
Abstract: As large language models (LLMs) continue to advance, ensuring their alignment with human values becomes increasingly critical. Traditional alignment methods heavily rely on human feedback to fine-tune models. With the emergence of superhuman models whose outputs may surpass human understanding, evaluating and aligning these models using human judgments poses significant challenges. To address the challenges, recent works use weak supervisors to elicit knowledge from much stronger models. However, there are important disanalogies between the empirical setup in the existing works and the genuine goal of alignment. We remark that existing works investigate the phenomenon of weak-to-strong generation in analogous setup (i.e., binary classification), rather than practical alignment-relevant tasks (e.g., safety). In this paper, we bridge this gap by extending weak-to-strong generation to the context of practical alignment. We empirically demonstrate the widespread phenomenon of weak-to-strong generation in three complicated alignment tasks: safety, toxicity, and legal reasoning}. Furthermore, we explore efficient strategies for improving alignment performance to enhance the quality of model outcomes. Lastly, we summarize and analyze the challenges and potential solutions in regard to specific alignment tasks, which we hope to catalyze the research progress on the topic of weak-to-strong generalization. Our code is released at this https URL.
摘要：随着大型语言模型 (LLM) 的不断发展，确保它们与人类价值观保持一致变得越来越重要。传统的对齐方法严重依赖人类反馈来微调模型。随着超人模型的出现，其输出可能超越人类的理解，使用人类判断来评估和对齐这些模型带来了重大挑战。为了应对这些挑战，最近的研究使用弱监督者从更强大的模型中获取知识。然而，现有研究的经验设置与对齐的真正目标之间存在重要的差异。我们注意到，现有研究在类似设置（即二元分类）中研究弱到强生成现象，而不是实际的对齐相关任务（例如安全性）。在本文中，我们通过将弱到强生成扩展到实际对齐的背景下来弥合这一差距。我们通过经验证明了在三个复杂的对齐任务中普遍存在的弱到强生成现象：安全性、毒性和法律推理。此外，我们探索了提高对齐性能的有效策略，以提高模型结果的质量。最后，我们总结并分析了特定对齐任务的挑战和潜在解决方案，希望能够促进弱到强泛化主题的研究进展。我们的代码发布在此 https URL 上。

Title: Evaluating Morphological Compositional Generalization in Large Language Models

Authors: Mete Ismayilzada, Defne Circi, Jonne Sälevä, Hale Sirin, Abdullatif Köksal, Bhuwan Dhingra, Antoine Bosselut, Lonneke van der Plas, Duygu Ataman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12656
Pdf URL: https://arxiv.org/pdf/2410.12656
Copy Paste: [[2410.12656]] Evaluating Morphological Compositional Generalization in Large Language Models(https://arxiv.org/abs/2410.12656)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have demonstrated significant progress in various natural language generation and understanding tasks. However, their linguistic generalization capabilities remain questionable, raising doubts about whether these models learn language similarly to humans. While humans exhibit compositional generalization and linguistic creativity in language use, the extent to which LLMs replicate these abilities, particularly in morphology, is under-explored. In this work, we systematically investigate the morphological generalization abilities of LLMs through the lens of compositionality. We define morphemes as compositional primitives and design a novel suite of generative and discriminative tasks to assess morphological productivity and systematicity. Focusing on agglutinative languages such as Turkish and Finnish, we evaluate several state-of-the-art instruction-finetuned multilingual models, including GPT-4 and Gemini. Our analysis shows that LLMs struggle with morphological compositional generalization particularly when applied to novel word roots, with performance declining sharply as morphological complexity increases. While models can identify individual morphological combinations better than chance, their performance lacks systematicity, leading to significant accuracy gaps compared to humans.
摘要：大型语言模型 (LLM) 在各种自然语言生成和理解任务中都取得了显著进展。然而，它们的语言泛化能力仍然值得怀疑，这让人怀疑这些模型是否以与人类类似的方式学习语言。虽然人类在语言使用中表现出组合泛化和语言创造力，但 LLM 复制这些能力的程度，尤其是在形态学方面，尚未得到充分探索。在这项工作中，我们从组合性的视角系统地研究了 LLM 的形态泛化能力。我们将形态素定义为组合基元，并设计了一套新颖的生成和判别任务来评估形态生产力和系统性。我们专注于土耳其语和芬兰语等黏着语言，评估了几种最先进的指令微调多语言模型，包括 GPT-4 和 Gemini。我们的分析表明，LLM 在形态组合泛化方面存在困难，尤其是在应用于新词根时，随着形态复杂性的增加，性能急剧下降。虽然模型可以比偶然性更好地识别单个形态组合，但它们的性能缺乏系统性，导致与人类相比存在显着的准确性差距。

Title: WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

Authors: Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Ching Lam Cheng, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri, Garry Kuwanto, Hanyang Zhao, Haryo Akbarianto Wibowo, Holy Lovenia, Jan Christian Blaise Cruz, Jan Wira Gotama Putra, Junho Myung, Lucky Susanto, Maria Angelica Riera Machin, Marina Zhukova, Michael Anugraha, Muhammad Farid Adilazuarda, Natasha Santosa, Peerat Limkonchotiwat, Raj Dabre, Rio Alexander Audino, Samuel Cahyawijaya, Shi-Xiong Zhang, Stephanie Yulia Salim, Yi Zhou, Yinxuan Gui, David Ifeoluwa Adelani, En-Shiun Annie Lee, Shogo Okada, Ayu Purwarianti, Alham Fikri Aji, Taro Watanabe, Derry Tanti Wijaya, Alice Oh, Chong-Wah Ngo
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2410.12705
Pdf URL: https://arxiv.org/pdf/2410.12705
Copy Paste: [[2410.12705]] WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines(https://arxiv.org/abs/2410.12705)
Keywords: language model
Abstract: Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.
摘要：视觉语言模型 (VLM) 通常在处理特定文化知识时会遇到困难，尤其是在英语以外的语言和代表性不足的文化背景下。为了评估它们对这些知识的理解，我们引入了 WorldCuisines，这是一个大规模的多语言和多文化、基于视觉的语言理解基准。该基准包括一个视觉问答 (VQA) 数据集，其中包含 30 种语言和方言的文本-图像对，涵盖 9 个语系，拥有超过 100 万个数据点，是迄今为止最大的多文化 VQA 基准。它包括识别菜名及其起源的任务。我们提供两种大小的评估数据集（12k 和 60k 个实例）以及一个训练数据集（100 万个实例）。我们的研究结果表明，虽然 VLM 在正确的位置上下文中表现更好，但它们在对抗性上下文和预测特定区域美食和语言方面表现不佳。为了支持未来的研究，我们发布了一个知识库，其中包含带注释的食物条目和图像以及 VQA 数据。

Title: WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation

Authors: João Matos, Shan Chen, Siena Placino, Yingya Li, Juan Carlos Climent Pardo, Daphna Idan, Takeshi Tohyama, David Restrepo, Luis F. Nakayama, Jose M. M. Pascual-Leone, Guergana Savova, Hugo Aerts, Leo A. Celi, A. Ian Wong, Danielle S. Bitterman, Jack Gallifant
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12722
Pdf URL: https://arxiv.org/pdf/2410.12722
Copy Paste: [[2410.12722]] WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation(https://arxiv.org/abs/2410.12722)
Keywords: language model
Abstract: Multimodal/vision language models (VLMs) are increasingly being deployed in healthcare settings worldwide, necessitating robust benchmarks to ensure their safety, efficacy, and fairness. Multiple-choice question and answer (QA) datasets derived from national medical examinations have long served as valuable evaluation tools, but existing datasets are largely text-only and available in a limited subset of languages and countries. To address these challenges, we present WorldMedQA-V, an updated multilingual, multimodal benchmarking dataset designed to evaluate VLMs in healthcare. WorldMedQA-V includes 568 labeled multiple-choice QAs paired with 568 medical images from four countries (Brazil, Israel, Japan, and Spain), covering original languages and validated English translations by native clinicians, respectively. Baseline performance for common open- and closed-source models are provided in the local language and English translations, and with and without images provided to the model. The WorldMedQA-V benchmark aims to better match AI systems to the diverse healthcare environments in which they are deployed, fostering more equitable, effective, and representative applications.
摘要：多模态/视觉语言模型 (VLM) 正越来越多地应用于世界各地的医疗保健环境，因此需要有强大的基准来确保其安全性、有效性和公平性。源自国家体检的多项选择题和答案 (QA) 数据集长期以来一直是宝贵的评估工具，但现有数据集大部分都是纯文本的，并且只适用于有限的语言和国家/地区。为了应对这些挑战，我们推出了 WorldMedQA-V，这是一个更新的多语言、多模态基准数据集，旨在评估医疗保健领域的 VLM。WorldMedQA-V 包括来自四个国家（巴西、以色列、日本和西班牙）的 568 个带标签的多项选择题 QA 和 568 张医学图像，分别涵盖原始语言和由母语临床医生验证的英语翻译。常见开源和闭源模型的基准性能以当地语言和英语翻译提供，并提供了模型中提供和不提供图像的情况。 WorldMedQA-V 基准旨在使人工智能系统更好地适应其部署的多样化医疗保健环境，促进更公平、有效和具有代表性的应用。

Title: StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples

Authors: Ajay Patel, Jiacheng Zhu, Justin Qiu, Zachary Horvitz, Marianna Apidianaki, Kathleen McKeown, Chris Callison-Burch
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.12757
Pdf URL: https://arxiv.org/pdf/2410.12757
Copy Paste: [[2410.12757]] StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples(https://arxiv.org/abs/2410.12757)
Keywords: language model
Abstract: Style representations aim to embed texts with similar writing styles closely and texts with different styles far apart, regardless of content. However, the contrastive triplets often used for training these representations may vary in both style and content, leading to potential content leakage in the representations. We introduce StyleDistance, a novel approach to training stronger content-independent style embeddings. We use a large language model to create a synthetic dataset of near-exact paraphrases with controlled style variations, and produce positive and negative examples across 40 distinct style features for precise contrastive learning. We assess the quality of our synthetic data and embeddings through human and automatic evaluations. StyleDistance enhances the content-independence of style embeddings, which generalize to real-world benchmarks and outperform leading style representations in downstream applications. Our model can be found at this https URL .
摘要：风格表示的目的是将具有相似写作风格的文本紧密嵌入，将具有不同风格的文本相距较远，而不管内容如何。然而，通常用于训练这些表示的对比三元组可能在风格和内容上都有所不同，从而导致表示中可能存在内容泄漏。我们引入了 StyleDistance，这是一种训练更强大的独立于内容的风格嵌入的新方法。我们使用大型语言模型来创建具有受控风格变化的近乎精确释义的合成数据集，并在 40 种不同的风格特征中生成正反两方面的例子，以进行精确的对比学习。我们通过人工和自动评估来评估合成数据和嵌入的质量。StyleDistance 增强了风格嵌入的内容独立性，这可以推广到现实世界的基准，并在下游应用中优于领先的风格表示。我们的模型可以在这个 https URL 找到。

Title: Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information

Authors: Yingya Li, Timothy Miller, Steven Bethard, Guergana Savova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.12774
Pdf URL: https://arxiv.org/pdf/2410.12774
Copy Paste: [[2410.12774]] Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information(https://arxiv.org/abs/2410.12774)
Keywords: language model, gpt
Abstract: The success of multi-task learning can depend heavily on which tasks are grouped together. Naively grouping all tasks or a random set of tasks can result in negative transfer, with the multi-task models performing worse than single-task models. Though many efforts have been made to identify task groupings and to measure the relatedness among different tasks, it remains a challenging research topic to define a metric to identify the best task grouping out of a pool of many potential task combinations. We propose a metric of task relatedness based on task difficulty measured by pointwise V-usable information (PVI). PVI is a recently proposed metric to estimate how much usable information a dataset contains given a model. We hypothesize that tasks with not statistically different PVI estimates are similar enough to benefit from the joint learning process. We conduct comprehensive experiments to evaluate the feasibility of this metric for task grouping on 15 NLP datasets in the general, biomedical, and clinical domains. We compare the results of the joint learners against single learners, existing baseline methods, and recent large language models, including Llama 2 and GPT-4. The results show that by grouping tasks with similar PVI estimates, the joint learners yielded competitive results with fewer total parameters, with consistent performance across domains.
摘要：多任务学习的成功在很大程度上取决于将哪些任务组合在一起。简单地将所有任务或一组随机任务组合在一起可能会导致负迁移，多任务模型的表现会比单任务模型差。尽管人们已经做出了很多努力来识别任务分组并衡量不同任务之间的相关性，但定义一个指标来从众多潜在任务组合中找出最佳任务分组仍然是一个具有挑战性的研究课题。我们提出了一种基于任务难度的任务相关性指标，该指标由逐点 V 可用信息 (PVI) 衡量。PVI 是最近提出的一种指标，用于估计给定模型的数据集包含多少可用信息。我们假设 PVI 估计值在统计上没有差异的任务足够相似，可以从联合学习过程中受益。我们进行了全面的实验，以评估该指标在一般、生物医学和临床领域的 15 个 NLP 数据集上进行任务分组的可行性。我们将联合学习器的结果与单个学习器、现有基线方法以及最近的大型语言模型（包括 Llama 2 和 GPT-4）进行了比较。结果表明，通过对具有相似 PVI 估计的任务进行分组，联合学习器以更少的总参数获得了具有竞争力的结果，并且在各个领域具有一致的性能。

Title: Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception

Authors: Jihao Zhao, Zhiyuan Ji, Pengnian Qi, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.12788
Pdf URL: https://arxiv.org/pdf/2410.12788
Copy Paste: [[2410.12788]] Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception(https://arxiv.org/abs/2410.12788)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline, which impacts the quality of knowledge-intensive tasks. This paper introduces the concept of Meta-Chunking, which refers to a granularity between sentences and paragraphs, consisting of a collection of sentences within a paragraph that have deep linguistic logical connections. To implement Meta-Chunking, we designed two strategies based on LLMs: Margin Sampling Chunking and Perplexity Chunking. The former employs LLMs to perform binary classification on whether consecutive sentences need to be segmented, making decisions based on the probability difference obtained from margin sampling. The latter precisely identifies text chunk boundaries by analyzing the characteristics of perplexity distribution. Additionally, considering the inherent complexity of different texts, we propose a strategy that combines Meta-Chunking with dynamic merging to achieve a balance between fine-grained and coarse-grained text chunking. Experiments conducted on eleven datasets demonstrate that Meta-Chunking can more efficiently improve the performance of single-hop and multi-hop question answering based on RAG. For instance, on the 2WikiMultihopQA dataset, it outperforms similarity chunking by 1.32 while only consuming 45.8% of the time. Our code is available at this https URL.
摘要：检索增强生成 (RAG) 虽然可以作为大型语言模型 (LLM) 的可行补充，但其流程中经常忽略文本分块这一关键方面，这会影响知识密集型任务的质量。本文介绍了元分块的概念，它是指句子和段落之间的粒度，由段落内具有深层语言逻辑联系的句子集合组成。为了实现元分块，我们设计了两种基于 LLM 的策略：边缘采样分块和困惑度分块。前者使用 LLM 对连续句子是否需要分割进行二分类，根据从边缘采样获得的概率差异做出决策。后者通过分析困惑度分布的特征来精确识别文本块边界。此外，考虑到不同文本的内在复杂性，我们提出了一种将元分块与动态合并相结合的策略，以实现细粒度和粗粒度文本分块之间的平衡。在 11 个数据集上进行的实验表明，Meta-Chunking 可以更有效地提高基于 RAG 的单跳和多跳问答的性能。例如，在 2WikiMultihopQA 数据集上，它比相似性分块的性能高出 1.32，而仅消耗 45.8% 的时间。我们的代码可在此 https URL 上找到。