2024-11-26

Title: Can Open-source LLMs Enhance Data Augmentation for Toxic Detection?: An Experimental Study

Authors: Zheng Hui, Zhaoxiao Guo, Hang Zhao, Juanyong Duan, Lin Ai, Yinheng Li, Julia Hirschberg, Congrui Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15175
Pdf URL: https://arxiv.org/pdf/2411.15175
Copy Paste: [[2411.15175]] Can Open-source LLMs Enhance Data Augmentation for Toxic Detection?: An Experimental Study(https://arxiv.org/abs/2411.15175)
Keywords: gpt, llm, hallucination, prompt
Abstract: High-quality, diverse harmful data is essential to addressing real-time applications in content moderation. Current state-of-the-art approaches to toxic content detection using GPT series models are costly and lack explainability. This paper investigates the use of prompt engineering and fine-tuning techniques on open-source LLMs to enhance harmful data augmentation specifically for toxic content detection. We conduct a two-stage empirical study, with stage 1 evaluating six open-source LLMs across multiple datasets using only prompt engineering and stage 2 focusing on fine-tuning. Our findings indicate that Mistral can excel in generating harmful data with minimal hallucination. While fine-tuning these models improves data quality and diversity, challenges such as data duplication and overfitting persist. Our experimental results highlight scalable, cost-effective strategies for enhancing toxic content detection systems. These findings not only demonstrate the potential of open-source LLMs in creating robust content moderation tools. The application of this method in real industrial scenarios further proves the feasibility and efficiency of the fine-tuned open-source LLMs for data augmentation. We hope our study will aid in understanding the capabilities and limitations of current models in toxic content detection and drive further advancements in this field.
摘要：高质量、多样化的有害数据对于解决内容审核中的实时应用至关重要。目前使用 GPT 系列模型进行有毒内容检测的最先进的方法成本高昂且缺乏可解释性。本文研究了在开源 LLM 上使用快速工程和微调技术来增强有害数据增强，特别是针对有毒内容检测。我们进行了一项两阶段的实证研究，第一阶段仅使用快速工程评估跨多个数据集的六个开源 LLM，第二阶段专注于微调。我们的研究结果表明，Mistral 可以在生成有害数据方面表现出色，同时将幻觉降到最低。虽然对这些模型进行微调可以提高数据质量和多样性，但数据重复和过度拟合等挑战仍然存在。我们的实验结果强调了可扩展、经济高效的增强有毒内容检测系统的策略。这些发现不仅展示了开源 LLM 在创建强大的内容审核工具方面的潜力。该方法在实际工业场景中的应用进一步证明了微调开源 LLM 用于数据增强的可行性和效率。我们希望我们的研究将有助于理解当前模型在有毒成分检测方面的能力和局限性，并推动该领域的进一步发展。

Title: Sycophancy in Large Language Models: Causes and Mitigations

Authors: Lars Malmqvist
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15287
Pdf URL: https://arxiv.org/pdf/2411.15287
Copy Paste: [[2411.15287]] Sycophancy in Large Language Models: Causes and Mitigations(https://arxiv.org/abs/2411.15287)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. However, their tendency to exhibit sycophantic behavior - excessively agreeing with or flattering users - poses significant risks to their reliability and ethical deployment. This paper provides a technical survey of sycophancy in LLMs, analyzing its causes, impacts, and potential mitigation strategies. We review recent work on measuring and quantifying sycophantic tendencies, examine the relationship between sycophancy and other challenges like hallucination and bias, and evaluate promising techniques for reducing sycophancy while maintaining model performance. Key approaches explored include improved training data, novel fine-tuning methods, post-deployment control mechanisms, and decoding strategies. We also discuss the broader implications of sycophancy for AI alignment and propose directions for future research. Our analysis suggests that mitigating sycophancy is crucial for developing more robust, reliable, and ethically-aligned language models.
摘要：大型语言模型 (LLM) 在广泛的自然语言处理任务中表现出了卓越的能力。然而，它们倾向于表现出谄媚行为——过度赞同或奉承用户——这对它们的可靠性和道德部署构成了重大风险。本文对 LLM 中的谄媚行为进行了技术调查，分析了其原因、影响和潜在的缓解策略。我们回顾了最近关于测量和量化谄媚倾向的研究，研究了谄媚与幻觉和偏见等其他挑战之间的关系，并评估了在保持模型性能的同时减少谄媚的有希望的技术。探索的主要方法包括改进的训练数据、新颖的微调方法、部署后控制机制和解码策略。我们还讨论了谄媚对人工智能对齐的更广泛影响，并提出了未来研究的方向。我们的分析表明，减轻谄媚对于开发更强大、更可靠、更符合道德的语言模型至关重要。

Title: PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language Models

Authors: Gerald Friedland, Xin Huang, Yueying Cui, Vishaal Kapoor, Ashish Khetan, Sanjiv Das
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15320
Pdf URL: https://arxiv.org/pdf/2411.15320
Copy Paste: [[2411.15320]] PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language Models(https://arxiv.org/abs/2411.15320)
Keywords: language model, llm
Abstract: We propose PPLqa, an easy to compute, language independent, information-theoretic metric to measure the quality of responses of generative Large Language Models (LLMs) in an unsupervised way, without requiring ground truth annotations or human supervision. The method and metric enables users to rank generative language models for quality of responses, so as to make a selection of the best model for a given task. Our single metric assesses LLMs with an approach that subsumes, but is not explicitly based on, coherence and fluency (quality of writing) and relevance and consistency (appropriateness of response) to the query. PPLqa performs as well as other related metrics, and works better with long-form Q\&A. Thus, PPLqa enables bypassing the lengthy annotation process required for ground truth evaluations, and it also correlates well with human and LLM rankings.
摘要：我们提出了 PPLqa，这是一种易于计算、独立于语言的信息论度量，用于以无监督的方式测量生成式大型语言模型 (LLM) 的响应质量，无需基本事实注释或人工监督。该方法和度量使用户能够根据响应质量对生成式语言模型进行排名，从而为给定任务选择最佳模型。我们的单一指标评估 LLM 的方法包含但不明确基于查询的连贯性和流畅性（写作质量）以及相关性和一致性（响应的适当性）。PPLqa 的表现与其他相关指标一样好，并且更适用于长格式问答。因此，PPLqa 可以绕过基本事实评估所需的冗长的注释过程，并且它还与人工和 LLM 排名有很好的相关性。

Title: Transforming NLU with Babylon: A Case Study in Development of Real-time, Edge-Efficient, Multi-Intent Translation System for Automated Drive-Thru Ordering

Authors: Mostafa Varzaneh, Pooja Voladoddi, Tanmay Bakshi, Uma Gunturi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15372
Pdf URL: https://arxiv.org/pdf/2411.15372
Copy Paste: [[2411.15372]] Transforming NLU with Babylon: A Case Study in Development of Real-time, Edge-Efficient, Multi-Intent Translation System for Automated Drive-Thru Ordering(https://arxiv.org/abs/2411.15372)
Keywords: agent
Abstract: Real-time conversational AI agents face challenges in performing Natural Language Understanding (NLU) in dynamic, outdoor environments like automated drive-thru systems. These settings require NLU models to handle background noise, diverse accents, and multi-intent queries while operating under strict latency and memory constraints on edge devices. Additionally, robustness to errors from upstream Automatic Speech Recognition (ASR) is crucial, as ASR outputs in these environments are often noisy. We introduce Babylon, a transformer-based architecture that tackles NLU as an intent translation task, converting natural language inputs into sequences of regular language units ('transcodes') that encode both intents and slot information. This formulation allows Babylon to manage multi-intent scenarios in a single dialogue turn. Furthermore, Babylon incorporates an LSTM-based token pooling mechanism to preprocess phoneme sequences, reducing input length and optimizing for low-latency, low-memory edge deployment. This also helps mitigate inaccuracies in ASR outputs, enhancing system robustness. While this work focuses on drive-thru ordering, Babylon's design extends to similar noise-prone scenarios, for e.g. ticketing kiosks. Our experiments show that Babylon achieves significantly better accuracy-latency-memory footprint trade-offs over typically employed NMT models like Flan-T5 and BART, demonstrating its effectiveness for real-time NLU in edge deployment settings.
摘要：实时对话式 AI 代理在自动驾车通行系统等动态户外环境中执行自然语言理解 (NLU) 时面临挑战。这些设置要求 NLU 模型能够处理背景噪音、不同的口音和多意图查询，同时在边缘设备上的严格延迟和内存限制下运行。此外，对上游自动语音识别 (ASR) 错误的鲁棒性至关重要，因为这些环境中的 ASR 输出通常很嘈杂。我们引入了 Babylon，这是一种基于转换器的架构，它将 NLU 作为意图翻译任务来处理，将自然语言输入转换为常规语言单元序列（“转码”），这些序列同时编码意图和时隙信息。这种公式允许 Babylon 在单个对话回合中管理多意图场景。此外，Babylon 结合了基于 LSTM 的令牌池机制来预处理音素序列，从而缩短输入长度并针对低延迟、低内存边缘部署进行优化。这也有助于缓解 ASR 输出中的不准确性，增强系统鲁棒性。虽然这项工作主要针对免下车点餐，但 Babylon 的设计也适用于类似的易受噪音影响的场景，例如售票亭。我们的实验表明，与 Flan-T5 和 BART 等常用的 NMT 模型相比，Babylon 在准确率-延迟-内存占用方面的权衡明显更佳，证明了其在边缘部署环境中实时 NLU 的有效性。

Title: On the Impact of Fine-Tuning on Chain-of-Thought Reasoning

Authors: Elita Lobo, Chirag Agarwal, Himabindu Lakkaraju
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15382
Pdf URL: https://arxiv.org/pdf/2411.15382
Copy Paste: [[2411.15382]] On the Impact of Fine-Tuning on Chain-of-Thought Reasoning(https://arxiv.org/abs/2411.15382)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models have emerged as powerful tools for general intelligence, showcasing advanced natural language processing capabilities that find applications across diverse domains. Despite their impressive performance, recent studies have highlighted the potential for significant enhancements in LLMs' task-specific performance through fine-tuning strategies like Reinforcement Learning with Human Feedback (RLHF), supervised fine-tuning (SFT), and Quantized Low-Rank Adapters (Q-LoRA) method. However, previous works have shown that while fine-tuning offers significant performance gains, it also leads to challenges such as catastrophic forgetting and privacy and safety risks. To this end, there has been little to no work in \textit{understanding the impact of fine-tuning on the reasoning capabilities of LLMs}. Our research investigates the effect of fine-tuning on the reasoning abilities of LLMs, addressing critical questions regarding the impact of task-specific fine-tuning on overall reasoning capabilities, the influence of fine-tuning on Chain-of-Thought (CoT) reasoning performance, and the implications for the faithfulness of CoT reasonings. By exploring these dimensions, our study shows the impact of fine-tuning on LLM reasoning capabilities, where the faithfulness of CoT reasoning, on average across four datasets, decreases, highlighting potential shifts in internal mechanisms of the LLMs resulting from fine-tuning processes.
摘要：大型语言模型已成为通用智能的强大工具，展示了可应用于不同领域的高级自然语言处理能力。尽管它们的表现令人印象深刻，但最近的研究强调，通过微调策略（如带人工反馈的强化学习 (RLHF)、监督微调 (SFT) 和量化低秩适配器 (Q-LoRA) 方法），LLM 的任务特定性能可能会得到显著提升。然而，之前的研究表明，虽然微调可以显著提高性能，但也会带来灾难性遗忘、隐私和安全风险等挑战。为此，在 \textit{了解微调对 LLM 推理能力的影响} 方面几乎没有开展任何工作。我们的研究调查了微调对法学硕士推理能力的影响，解决了关于任务特定微调对整体推理能力的影响、微调对思维链 (CoT) 推理性能的影响以及对 CoT 推理忠实度的影响等关键问题。通过探索这些维度，我们的研究显示了微调对法学硕士推理能力的影响，其中四个数据集中 CoT 推理的忠实度平均下降，突显了微调过程可能导致法学硕士内部机制发生转变。

Title: From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set

Authors: Mara Finkelstein, Dan Deutsch, Parker Riley, Juraj Juraska, Geza Kovacs, Markus Freitag
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15387
Pdf URL: https://arxiv.org/pdf/2411.15387
Copy Paste: [[2411.15387]] From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set(https://arxiv.org/abs/2411.15387)
Keywords: llm, prompt
Abstract: As LLMs continue to become more powerful and versatile, human evaluation has quickly become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. These Autoraters are typically designed so that they generalize to new systems and test sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure certain capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate our Specialist method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method for designing automatic metrics across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks.
摘要：随着 LLM 变得越来越强大和通用，人工评估在规模上迅速变得难以处理，对自动指标的依赖已成为常态。最近，有研究表明 LLM 本身就是许多任务的最先进的评估器。这些自动评分器通常设计为可以推广到新系统和测试集。然而，在实践中，评估是在一小组固定的规范测试集上进行的，这些测试集经过精心策划，可以测量某些感兴趣的功能，并且不会频繁更改。在这项工作中，我们设计了一种方法，通过利用测试集上的历史评分来构建上下文学习 (ICL) 示例，将提示自动评分器专门用于给定的测试集。我们在细粒度机器翻译评估任务上评估了我们的专家方法，并表明它在 WMT'23 和 WMT'24 测试集上的表现分别比最先进的 XCOMET 指标高出 54% 和 119%。我们进行了广泛的分析，以了解我们的专家指标所学到的表征，以及评估者行为的变化如何影响他们的表现。我们还验证了我们的专家方法的通用性和稳健性，该方法可用于设计跨不同数量的 ICL 示例、LLM 主干、要评估的系统和评估任务的自动指标。

Title: Exploring Large Language Models for Multimodal Sentiment Analysis: Challenges, Benchmarks, and Future Directions

Authors: Shezheng Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15408
Pdf URL: https://arxiv.org/pdf/2411.15408
Copy Paste: [[2411.15408]] Exploring Large Language Models for Multimodal Sentiment Analysis: Challenges, Benchmarks, and Future Directions(https://arxiv.org/abs/2411.15408)
Keywords: language model, gpt, llm, chat
Abstract: Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to extract aspect terms and their corresponding sentiment polarities from multimodal information, including text and images. While traditional supervised learning methods have shown effectiveness in this task, the adaptability of large language models (LLMs) to MABSA remains uncertain. Recent advances in LLMs, such as Llama2, LLaVA, and ChatGPT, demonstrate strong capabilities in general tasks, yet their performance in complex and fine-grained scenarios like MABSA is underexplored. In this study, we conduct a comprehensive investigation into the suitability of LLMs for MABSA. To this end, we construct a benchmark to evaluate the performance of LLMs on MABSA tasks and compare them with state-of-the-art supervised learning methods. Our experiments reveal that, while LLMs demonstrate potential in multimodal understanding, they face significant challenges in achieving satisfactory results for MABSA, particularly in terms of accuracy and inference time. Based on these findings, we discuss the limitations of current LLMs and outline directions for future research to enhance their capabilities in multimodal sentiment analysis.
摘要：基于多模态方面的情绪分析 (MABSA) 旨在从多模态信息（包括文本和图像）中提取方面术语及其相应的情绪极性。虽然传统的监督学习方法已显示出在这项任务中的有效性，但大型语言模型 (LLM) 对 MABSA 的适应性仍不确定。LLM 的最新进展，例如 Llama2、LLaVA 和 ChatGPT，在一般任务中表现出强大的能力，但它们在像 MABSA 这样复杂和细粒度场景中的表现尚未得到充分探索。在本研究中，我们对 LLM 对 MABSA 的适用性进行了全面调查。为此，我们构建了一个基准来评估 LLM 在 MABSA 任务上的表现，并将其与最先进的监督学习方法进行比较。我们的实验表明，虽然 LLM 在多模态理解方面表现出潜力，但它们在实现 MABSA 的令人满意的结果方面面临着重大挑战，特别是在准确性和推理时间方面。基于这些发现，我们讨论了当前 LLM 的局限性并概述了未来研究的方向，以增强其在多模态情绪分析方面的能力。

Title: Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts

Authors: Qizhou Chen, Chengyu Wang, Dakan Wang, Taolin Zhang, Wangyue Li, Xiaofeng He
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2411.15432
Pdf URL: https://arxiv.org/pdf/2411.15432
Copy Paste: [[2411.15432]] Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts(https://arxiv.org/abs/2411.15432)
Keywords: language model, llm
Abstract: Model editing aims to correct inaccurate knowledge, update outdated information, and incorporate new data into Large Language Models (LLMs) without the need for retraining. This task poses challenges in lifelong scenarios where edits must be continuously applied for real-world applications. While some editors demonstrate strong robustness for lifelong editing in pure LLMs, Vision LLMs (VLLMs), which incorporate an additional vision modality, are not directly adaptable to existing LLM editors. In this paper, we propose LiveEdit, a LIfelong Vision language modEl Edit to bridge the gap between lifelong LLM editing and VLLMs. We begin by training an editing expert generator to independently produce low-rank experts for each editing instance, with the goal of correcting the relevant responses of the VLLM. A hard filtering mechanism is developed to utilize visual semantic knowledge, thereby coarsely eliminating visually irrelevant experts for input queries during the inference stage of the post-edited model. Finally, to integrate visually relevant experts, we introduce a soft routing mechanism based on textual semantic relevance to achieve multi-expert fusion. For evaluation, we establish a benchmark for lifelong VLLM editing. Extensive experiments demonstrate that LiveEdit offers significant advantages in lifelong VLLM editing scenarios. Further experiments validate the rationality and effectiveness of each module design in LiveEdit.
摘要：模型编辑旨在纠正不准确的知识、更新过时的信息并将新数据合并到大型语言模型 (LLM) 中，而无需重新训练。这项任务在终身场景中提出了挑战，因为在实际应用中必须不断应用编辑。虽然一些编辑器在纯 LLM 中表现出对终身编辑的强大鲁棒性，但包含额外视觉模态的视觉 LLM (VLLM) 不能直接适应现有的 LLM 编辑器。在本文中，我们提出了 LiveEdit，这是一种终身视觉语言模型，旨在弥合终身 LLM 编辑和 VLLM 之间的差距。我们首先训练一个编辑专家生成器，为每个编辑实例独立生成低秩专家，目的是纠正 VLLM 的相关响应。开发了一种硬过滤机制来利用视觉语义知识，从而在后编辑模型的推理阶段粗略地消除与输入查询无关的视觉专家。最后，为了整合视觉相关专家，我们引入了基于文本语义相关性的软路由机制来实现多专家融合。为了进行评估，我们建立了终身 VLLM 编辑的基准。大量实验表明 LiveEdit 在终身 VLLM 编辑场景中具有显著优势。进一步的实验验证了 LiveEdit 中各个模块设计的合理性和有效性。

Title: Towards Robust Evaluation of Unlearning in LLMs via Data Transformations

Authors: Abhinav Joshi, Shaswati Saha, Divyaksh Shukla, Sriram Vema, Harsh Jhamtani, Manas Gaur, Ashutosh Modi
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15477
Pdf URL: https://arxiv.org/pdf/2411.15477
Copy Paste: [[2411.15477]] Towards Robust Evaluation of Unlearning in LLMs via Data Transformations(https://arxiv.org/abs/2411.15477)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have shown to be a great success in a wide range of applications ranging from regular NLP-based use cases to AI agents. LLMs have been trained on a vast corpus of texts from various sources; despite the best efforts during the data pre-processing stage while training the LLMs, they may pick some undesirable information such as personally identifiable information (PII). Consequently, in recent times research in the area of Machine Unlearning (MUL) has become active, the main idea is to force LLMs to forget (unlearn) certain information (e.g., PII) without suffering from performance loss on regular tasks. In this work, we examine the robustness of the existing MUL techniques for their ability to enable leakage-proof forgetting in LLMs. In particular, we examine the effect of data transformation on forgetting, i.e., is an unlearned LLM able to recall forgotten information if there is a change in the format of the input? Our findings on the TOFU dataset highlight the necessity of using diverse data formats to quantify unlearning in LLMs more reliably.
摘要：大型语言模型 (LLM) 已在从常规 NLP 用例到 AI 代理等广泛应用中取得了巨大成功。LLM 已在来自各种来源的大量文本语料库上进行训练；尽管在训练 LLM 时的数据预处理阶段尽了最大努力，但它们可能会拾取一些不良信息，例如个人身份信息 (PII)。因此，近年来，机器反学习 (MUL) 领域的研究变得活跃起来，主要思想是迫使 LLM 忘记（反学习）某些信息（例如 PII），而不会在常规任务上遭受性能损失。在这项工作中，我们检查了现有 MUL 技术的稳健性，以确定它们是否能够在 LLM 中实现防泄漏遗忘。具体来说，我们研究了数据转换对遗忘的影响，即如果输入格式发生变化，未学习的 LLM 是否能够回忆起被遗忘的信息？我们对 TOFU 数据集的研究结果强调了使用多种数据格式来更可靠地量化 LLM 中的去学习的必要性。

Title: Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai

Authors: Parinthapat Pengpun, Can Udomcharoenchaikit, Weerayut Buaphet, Peerat Limkonchotiwat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15484
Pdf URL: https://arxiv.org/pdf/2411.15484
Copy Paste: [[2411.15484]] Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai(https://arxiv.org/abs/2411.15484)
Keywords: language model, llm
Abstract: We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at this https URL.
摘要：我们提出了一种合成数据方法，用于以数据高效的方式对低资源语言的大型语言模型 (LLM) 进行指令调整，特别关注泰语。我们确定了有助于提高指令调整数据集有效性的三个关键属性：流畅性、多样性和文化背景。我们提出了一个无种子数据框架，用于生成包含这些基本属性的合成指令调整数据。我们的框架使用 LLM 来生成不同的主题，从维基百科检索相关上下文，并为各种任务（例如问答、总结和对话）创建指令。实验结果表明，与使用数十万条指令训练的最先进的泰语 LLM 相比，我们表现最佳的合成数据集结合了所有三个关键属性，仅使用 5,000 条指令就实现了具有竞争力的性能。我们的代码和数据集在此 https URL 上公开提供。

Title: Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark

Authors: Rong-Cheng Tu, Zi-Ao Ma, Tian Lan, Yuehao Zhao, Heyan Huang, Xian-Ling Mao
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.15488
Pdf URL: https://arxiv.org/pdf/2411.15488
Copy Paste: [[2411.15488]] Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark(https://arxiv.org/abs/2411.15488)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Driven by the remarkable progress in diffusion models, text-to-image generation has made significant strides, creating a pressing demand for automatic quality evaluation of generated images. Current state-of-the-art automatic evaluation methods heavily rely on Multi-modal Large Language Models (MLLMs), particularly powerful commercial models like GPT-4o. While these models are highly effective, their substantial costs limit scalability in large-scale evaluations. Adopting open-source MLLMs is an alternative; however, their performance falls short due to significant limitations in processing multi-modal data compared to commercial MLLMs. To tackle these problems, we first propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset, where the complex evaluation task is decoupled into simpler sub-tasks, effectively reducing the learning complexity. Based on this dataset, we design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Furthermore, to reliably and comprehensively assess prior works and our proposed model, we manually annotate a meta-evaluation benchmark that includes chain-of-thought explanations alongside quality scores for generated images. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline, VIEScore, with over 4.6\% improvement in Spearman and Kendall correlations with human judgments.
摘要：在扩散模型的显著进步的推动下，文本到图像的生成取得了重大进展，对生成图像的自动质量评估产生了迫切的需求。当前最先进的自动评估方法严重依赖于多模态大型语言模型 (MLLM)，尤其是像 GPT-4o 这样强大的商业模型。虽然这些模型非常有效，但它们的高成本限制了大规模评估的可扩展性。采用开源 MLLM 是一种替代方案；然而，与商业 MLLM 相比，它们在处理多模态数据方面存在很大限制，因此性能不足。为了解决这些问题，我们首先提出了一个基于 GPT-4o 的任务分解评估框架，以自动构建一个新的训练数据集，其中复杂的评估任务被分解为更简单的子任务，从而有效降低了学习复杂性。基于这个数据集，我们设计了创新的训练策略，将 GPT-4o 的评估能力有效地提炼成 7B 开源 MLLM MiniCPM-V-2.6。此外，为了可靠、全面地评估先前的研究和我们提出的模型，我们手动注释了一个元评估基准，其中包括思路链解释以及生成图像的质量分数。实验结果表明，我们提炼的开源 MLLM 明显优于当前最先进的 GPT-4o-base 基线 VIEScore，与人类判断的 Spearman 和 Kendall 相关性提高了 4.6% 以上。

Title: Traditional Chinese Medicine Case Analysis System for High-Level Semantic Abstraction: Optimized with Prompt and RAG

Authors: Peng Xu, Hongjin Wu, Jinle Wang, Rongjia Lin, Liwei Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15491
Pdf URL: https://arxiv.org/pdf/2411.15491
Copy Paste: [[2411.15491]] Traditional Chinese Medicine Case Analysis System for High-Level Semantic Abstraction: Optimized with Prompt and RAG(https://arxiv.org/abs/2411.15491)
Keywords: prompt
Abstract: This paper details a technical plan for building a clinical case database for Traditional Chinese Medicine (TCM) using web scraping. Leveraging multiple platforms, including 360doc, we gathered over 5,000 TCM clinical cases, performed data cleaning, and structured the dataset with crucial fields such as patient details, pathogenesis, syndromes, and annotations. Using the $Baidu\_ERNIE\_Speed\_128K$ API, we removed redundant information and generated the final answers through the $DeepSeekv2$ API, outputting results in standard JSON format. We optimized data recall with RAG and rerank techniques during retrieval and developed a hybrid matching scheme. By combining two-stage retrieval method with keyword matching via Jieba, we significantly enhanced the accuracy of model outputs.
摘要：本文详细介绍了使用网络抓取技术构建中医临床病例数据库的技术方案。我们利用 360doc 等多个平台收集了 5,000 多个中医临床病例，进行了数据清理，并使用患者详细信息、发病机制、综合征和注释等关键字段对数据集进行了结构化。我们使用 $Baidu\_ERNIE\_Speed\_128K$ API 删除了冗余信息，并通过 $DeepSeekv2$ API 生成最终答案，以标准 JSON 格式输出结果。我们在检索过程中使用 RAG 和重新排序技术优化了数据召回，并开发了混合匹配方案。通过将两阶段检索方法与 Jieba 的关键字匹配相结合，我们显著提高了模型输出的准确性。

Title: Enhancing Grammatical Error Detection using BERT with Cleaned Lang-8 Dataset

Authors: Rahul Nihalani, Kushal Shah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15523
Pdf URL: https://arxiv.org/pdf/2411.15523
Copy Paste: [[2411.15523]] Enhancing Grammatical Error Detection using BERT with Cleaned Lang-8 Dataset(https://arxiv.org/abs/2411.15523)
Keywords: llm
Abstract: This paper presents an improved LLM based model for Grammatical Error Detection (GED), which is a very challenging and equally important problem for many applications. The traditional approach to GED involved hand-designed features, but recently, Neural Networks (NN) have automated the discovery of these features, improving performance in GED. Traditional rule-based systems have an F1 score of 0.50-0.60 and earlier machine learning models give an F1 score of 0.65-0.75, including decision trees and simple neural networks. Previous deep learning models, for example, Bi-LSTM, have reported F1 scores within the range from 0.80 to 0.90. In our study, we have fine-tuned various transformer models using the Lang8 dataset rigorously cleaned by us. In our experiments, the BERT-base-uncased model gave an impressive performance with an F1 score of 0.91 and accuracy of 98.49% on training data and 90.53% on testing data, also showcasing the importance of data cleaning. Increasing model size using BERT-large-uncased or RoBERTa-large did not give any noticeable improvements in performance or advantage for this task, underscoring that larger models are not always better. Our results clearly show how far rigorous data cleaning and simple transformer-based models can go toward significantly improving the quality of GED.
摘要：本文介绍了一种改进的基于 LLM 的语法错误检测 (GED) 模型，这对于许多应用来说是一个非常具有挑战性且同样重要的问题。传统的 GED 方法涉及手工设计的特征，但最近，神经网络 (NN) 已经自动发现这些特征，从而提高了 GED 的性能。传统的基于规则的系统的 F1 得分为 0.50-0.60，早期的机器学习模型的 F1 得分为 0.65-0.75，包括决策树和简单的神经网络。以前的深度学习模型，例如 Bi-LSTM，报告的 F1 得分在 0.80 到 0.90 范围内。在我们的研究中，我们使用经过严格清理的 Lang8 数据集对各种 Transformer 模型进行了微调。在我们的实验中，BERT-base-uncased 模型表现出色，F1 得分为 0.91，训练数据准确率为 98.49%，测试数据准确率为 90.53%，这也展示了数据清理的重要性。使用 BERT-large-uncased 或 RoBERTa-large 增加模型大小并没有显著提高此任务的性能或优势，这强调了模型越大并不总是越好。我们的结果清楚地表明，严格的数据清理和简单的基于 Transformer 的模型在显著提高 GED 质量方面可以发挥多大作用。

Title: From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars

Authors: Albert Kornilov, Tatiana Shavrina
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15577
Pdf URL: https://arxiv.org/pdf/2411.15577
Copy Paste: [[2411.15577]] From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars(https://arxiv.org/abs/2411.15577)
Keywords: language model, retrieval-augmented generation
Abstract: Recent advances in language modeling have demonstrated significant improvements in zero-shot capabilities, including in-context learning, instruction following, and machine translation for extremely under-resourced languages (Tanzer et al., 2024). However, many languages with limited written resources rely primarily on formal descriptions of grammar and vocabulary. In this paper, we introduce a set of benchmarks to evaluate how well models can extract and classify information from the complex descriptions found in linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based approach that leverages these descriptions for downstream tasks such as machine translation. Our benchmarks encompass linguistic descriptions for 248 languages across 142 language families, focusing on typological features from WALS and Grambank. This set of benchmarks offers the first comprehensive evaluation of language models' in-context ability to accurately interpret and extract linguistic features, providing a critical resource for scaling NLP to low-resource languages. The code and data are publicly available at \url{this https URL}.
摘要：语言建模的最新进展表明零样本能力得到了显着改善，包括上下文学习、指令跟踪和资源极其匮乏的语言的机器翻译（Tanzer 等人，2024 年）。然而，许多书面资源有限的语言主要依赖于语法和词汇的正式描述。在本文中，我们引入了一组基准来评估模型从语言语法中的复杂描述中提取和分类信息的能力。我们提出了一种基于检索增强生成 (RAG) 的方法，该方法利用这些描述来完成机器翻译等下游任务。我们的基准涵盖了 142 个语系的 248 种语言的语言描述，重点关注 WALS 和 Grambank 的类型特征。这组基准首次全面评估了语言模型在上下文中准确解释和提取语言特征的能力，为将 NLP 扩展到资源匮乏的语言提供了关键资源。代码和数据可在 \url{此 https URL} 上公开获取。

Title: A Survey on LLM-as-a-Judge

Authors: Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, Jian Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15594
Pdf URL: https://arxiv.org/pdf/2411.15594
Copy Paste: [[2411.15594]] A Survey on LLM-as-a-Judge(https://arxiv.org/abs/2411.15594)
Keywords: language model, llm
Abstract: Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability, including improving consistency, mitigating biases, and adapting to diverse assessment scenarios. Additionally, we propose methodologies for evaluating the reliability of LLM-as-a-Judge systems, supported by a novel benchmark designed for this purpose. To advance the development and real-world deployment of LLM-as-a-Judge systems, we also discussed practical applications, challenges, and future directions. This survey serves as a foundational reference for researchers and practitioners in this rapidly evolving field.
摘要：准确而一致的评估对于众多领域的决策至关重要，但由于固有的主观性、可变性和规模性，它仍然是一项具有挑战性的任务。大型语言模型 (LLM) 在不同领域取得了显著的成功，导致了“LLM-as-a-Judge”的出现，其中 LLM 被用作复杂任务的评估者。凭借其处理各种数据类型并提供可扩展、经济高效和一致的评估的能力，LLM 为传统的专家驱动评估提供了一种引人注目的替代方案。然而，确保 LLM-as-a-Judge 系统的可靠性仍然是一项重大挑战，需要精心设计和标准化。本文对 LLM-as-a-Judge 进行了全面调查，解决了核心问题：如何构建可靠的 LLM-as-a-Judge 系统？我们探索了提高可靠性的策略，包括提高一致性、减轻偏见和适应不同的评估场景。此外，我们提出了评估 LLM-as-a-Judge 系统可靠性的方法，并由为此目的设计的新基准支持。为了推动 LLM-as-a-Judge 系统的开发和实际部署，我们还讨论了实际应用、挑战和未来方向。本调查为这一快速发展的领域的研究人员和从业者提供了基础参考。

Title: Multi-label Sequential Sentence Classification via Large Language Model

Authors: Mengfei Lan, Lecheng Zheng, Shufan Ming, Halil Kilicoglu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15623
Pdf URL: https://arxiv.org/pdf/2411.15623
Copy Paste: [[2411.15623]] Multi-label Sequential Sentence Classification via Large Language Model(https://arxiv.org/abs/2411.15623)
Keywords: language model, llm, prompt
Abstract: Sequential sentence classification (SSC) in scientific publications is crucial for supporting downstream tasks such as fine-grained information retrieval and extractive summarization. However, current SSC methods are constrained by model size, sequence length, and single-label setting. To address these limitations, this paper proposes LLM-SSC, a large language model (LLM)-based framework for both single- and multi-label SSC tasks. Unlike previous approaches that employ small- or medium-sized language models, the proposed framework utilizes LLMs to generate SSC labels through designed prompts, which enhance task understanding by incorporating demonstrations and a query to describe the prediction target. We also present a multi-label contrastive learning loss with auto-weighting scheme, enabling the multi-label classification task. To support our multi-label SSC analysis, we introduce and release a new dataset, biorc800, which mainly contains unstructured abstracts in the biomedical domain with manual annotations. Experiments demonstrate LLM-SSC's strong performance in SSC under both in-context learning and task-specific tuning settings. We release biorc800 and our code at: this https URL.
摘要：科学出版物中的顺序句子分类 (SSC) 对于支持下游任务（例如细粒度信息检索和提取摘要）至关重要。然而，当前的 SSC 方法受到模型大小、序列长度和单标签设置的限制。为了解决这些限制，本文提出了 LLM-SSC，这是一个基于大型语言模型 (LLM) 的框架，适用于单标签和多标签 SSC 任务。与以前采用小型或中型语言模型的方法不同，所提出的框架利用 LLM 通过设计的提示生成 SSC 标签，通过结合演示和查询来描述预测目标，从而增强任务理解。我们还提出了一种具有自动加权方案的多标签对比学习损失，从而实现多标签分类任务。为了支持我们的多标签 SSC 分析，我们引入并发布了一个新的数据集 biorc800，它主要包含带有手动注释的生物医学领域的非结构化摘要。实验证明了 LLM-SSC 在上下文学习和特定于任务的调整设置下在 SSC 中的强大性能。我们在以下 https URL 发布 biorc800 和我们的代码。

Title: "All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations

Authors: Michael Hardy
Subjects: cs.CL, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2411.15634
Pdf URL: https://arxiv.org/pdf/2411.15634
Copy Paste: [[2411.15634]] "All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations(https://arxiv.org/abs/2411.15634)
Keywords: language model, gpt, llm
Abstract: "Gold" and "ground truth" human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families--encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality: the encoder family of models achieve state-of-the-art, even "super-human", results across all classroom annotation tasks. But not all these positive results remain after using more rigorous evaluation measures which reveal spurious correlations and nonrandom racial biases across models and humans. This study then expands these methods to estimate how model use would change to human label quality if models were used in a human-in-the-loop context, finding that the variance captured in GPT model labels would worsen reliabilities for humans influenced by these models. We identify areas where some LLMs, within the generalizability of the current data, could improve the quality of expensive human ratings of classroom instruction.
摘要：“黄金”和“基本事实”人工介导标签存在错误。这种错误的影响可能会逃避常见的标签质量指标，或在模型评估过程中掩盖准确性、偏见、公平性和实用性的问题。这项研究展示了即使在专家可靠性非常低的情况下回答这些问题的方法。我们分析了人工标签、GPT 模型评级和变压器编码器模型注释，这些注释描述了课堂教学的质量，这是一项重要、昂贵且目前唯一由人工完成的任务。我们回答了这样一个问题：是否可以使用两个大型语言模型 (LLM) 架构系列——编码器和 GPT 解码器来自动完成这样的任务，使用新颖的方法从六个维度评估标签质量：一致性、置信度、有效性、偏见、公平性和有用性。首先，我们证明，在标签较差的情况下使用标准指标可以掩盖标签和模型的质量：编码器系列模型在所有课堂注释任务中都实现了最先进的，甚至是“超人”的结果。但在使用更严格的评估措施后，并非所有这些积极结果都能得到保留，这些措施揭示了模型和人类之间的虚假相关性和非随机种族偏见。然后，这项研究扩展了这些方法，以估计如果在人机交互环境中使用模型，模型使用将如何改变人类标签质量，发现 GPT 模型标签中捕获的方差会降低受这些模型影响的人类的可靠性。我们确定了某些 LLM 在当前数据的普遍性范围内可以提高昂贵的课堂教学人工评分质量的领域。

Title: AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset

Authors: Tobi Olatunji, Charles Nimo, Abraham Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, Folafunmi Omofoye, Foutse Yuehgoh, Timothy Faniran, Bonaventure F. P. Dossou, Moshood Yekini, Jonas Kemp, Katherine Heller, Jude Chidubem Omeke, Chidi Asuzu MD, Naome A. Etori, Aimérou Ndiaye, Ifeoma Okoh, Evans Doe Ocansey, Wendy Kinara, Michael Best, Irfan Essa, Stephen Edward Moore, Chris Fourie, Mercy Nyamewaa Asiedu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15640
Pdf URL: https://arxiv.org/pdf/2411.15640
Copy Paste: [[2411.15640]] AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset(https://arxiv.org/abs/2411.15640)
Keywords: language model, llm
Abstract: Recent advancements in large language model(LLM) performance on medical multiple choice question (MCQ) benchmarks have stimulated interest from healthcare providers and patients globally. Particularly in low-and middle-income countries (LMICs) facing acute physician shortages and lack of specialists, LLMs offer a potentially scalable pathway to enhance healthcare access and reduce costs. However, their effectiveness in the Global South, especially across the African continent, remains to be established. In this work, we introduce AfriMed-QA, the first large scale Pan-African English multi-specialty medical Question-Answering (QA) dataset, 15,000 questions (open and closed-ended) sourced from over 60 medical schools across 16 countries, covering 32 medical specialties. We further evaluate 30 LLMs across multiple axes including correctness and demographic bias. Our findings show significant performance variation across specialties and geographies, MCQ performance clearly lags USMLE (MedQA). We find that biomedical LLMs underperform general models and smaller edge-friendly LLMs struggle to achieve a passing score. Interestingly, human evaluations show a consistent consumer preference for LLM answers and explanations when compared with clinician answers.
摘要：大型语言模型 (LLM) 在医学多项选择题 (MCQ) 基准测试中的表现最近取得了进展，引起了全球医疗服务提供者和患者的兴趣。特别是在面临严重医生短缺和缺乏专家的中低收入国家 (LMIC)，LLM 提供了一种潜在的可扩展途径来增强医疗保健服务并降低成本。然而，它们在全球南方，特别是在整个非洲大陆的有效性仍有待确定。在这项工作中，我们推出了 AfriMed-QA，这是第一个大规模泛非英语多专业医学问答 (QA) 数据集，15,000 个问题（开放式和封闭式）来自 16 个国家/地区的 60 多所医学院，涵盖 32 个医学专业。我们进一步从正确性和人口统计偏差等多个维度评估了 30 个 LLM。我们的研究结果显示，不同专业和地区的表现存在显著差异，MCQ 表现明显落后于 USMLE（MedQA）。我们发现，生物医学法学硕士的表现不如一般模型，而规模较小的边缘友好型法学硕士则难以获得及格分数。有趣的是，与临床医生的答案相比，人工评估显示消费者对法学硕士答案和解释的偏好一致。

Title: Improving Next Tokens via Second-Last Predictions with Generate and Refine

Authors: Johannes Schneider
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15661
Pdf URL: https://arxiv.org/pdf/2411.15661
Copy Paste: [[2411.15661]] Improving Next Tokens via Second-Last Predictions with Generate and Refine(https://arxiv.org/abs/2411.15661)
Keywords: language model, gpt
Abstract: Autoregressive language models like GPT aim at predicting next tokens, while autoencoding models such as BERT are trained on tasks such as predicting masked tokens. We train a decoder only architecture for predicting the second last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models by employing a structured deterministic approach towards masking tokens. We use our model to improve the next token predictions of a standard GPT by combining both predictions in a ``generate-then-refine'' approach. We show on different variants of GPT-2 and different datasets that (not unexpectedly) second last token predictions are much more accurate, i.e., more than 15\% higher accuracy than ordinary next token predictors. The ``generate-then-refine'' approach also demonstrates notable improvements in next-token predictions, yielding smaller yet consistent and significant gains.
摘要：像 GPT 这样的自回归语言模型旨在预测下一个标记，而像 BERT 这样的自动编码模型则是针对预测掩码标记等任务进行训练的。我们训练了一个仅用于解码器的架构，用于预测标记序列的倒数第二个标记。通过采用结构化的确定性方法来掩码标记，我们的方法比 BERT 式模型具有更高的计算训练效率。我们使用我们的模型通过结合两种预测，采用“生成-然后-细化”的方法改进标准 GPT 的下一个标记预测。我们在 GPT-2 的不同变体和不同的数据集上展示了（不出所料）倒数第二个标记预测要准确得多，即比普通的下一个标记预测器准确率高出 15% 以上。“生成-然后-细化”方法还显示出下一个标记预测的显着改进，产生了较小但一致且显着的收益。

Title: Ontology-Constrained Generation of Domain-Specific Clinical Summaries

Authors: Gaya Mehenni, Amal Zouaq
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15666
Pdf URL: https://arxiv.org/pdf/2411.15666
Copy Paste: [[2411.15666]] Ontology-Constrained Generation of Domain-Specific Clinical Summaries(https://arxiv.org/abs/2411.15666)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) offer promising solutions for text summarization. However, some domains require specific information to be available in the summaries. Generating these domain-adapted summaries is still an open challenge. Similarly, hallucinations in generated content is a major drawback of current approaches, preventing their deployment. This study proposes a novel approach that leverages ontologies to create domain-adapted summaries both structured and unstructured. We employ an ontology-guided constrained decoding process to reduce hallucinations while improving relevance. When applied to the medical domain, our method shows potential in summarizing Electronic Health Records (EHRs) across different specialties, allowing doctors to focus on the most relevant information to their domain. Evaluation on the MIMIC-III dataset demonstrates improvements in generating domain-adapted summaries of clinical notes and hallucination reduction.
摘要：大型语言模型 (LLM) 为文本摘要提供了有前途的解决方案。但是，某些领域要求摘要中提供特定信息。生成这些领域适应性摘要仍然是一个开放的挑战。同样，生成内容中的幻觉是当前方法的主要缺点，阻碍了它们的部署。本研究提出了一种新方法，利用本体来创建结构化和非结构化的领域适应性摘要。我们采用本体引导的约束解码过程来减少幻觉，同时提高相关性。当应用于医学领域时，我们的方法显示出总结不同专业的电子健康记录 (EHR) 的潜力，使医生能够专注于与其领域最相关的信息。对 MIMIC-III 数据集的评估表明，在生成领域适应性临床记录摘要和减少幻觉方面有所改进。

Title: RAMIE: Retrieval-Augmented Multi-task Information Extraction with Large Language Models on Dietary Supplements

Authors: Zaifu Zhan, Shuang Zhou, Mingchen Li, Rui Zhang
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2411.15700
Pdf URL: https://arxiv.org/pdf/2411.15700
Copy Paste: [[2411.15700]] RAMIE: Retrieval-Augmented Multi-task Information Extraction with Large Language Models on Dietary Supplements(https://arxiv.org/abs/2411.15700)
Keywords: language model, llm, prompt
Abstract: \textbf{Objective:} We aimed to develop an advanced multi-task large language model (LLM) framework to extract multiple types of information about dietary supplements (DS) from clinical records. \textbf{Methods:} We used four core DS information extraction tasks - namely, named entity recognition (NER: 2,949 clinical sentences), relation extraction (RE: 4,892 sentences), triple extraction (TE: 2,949 sentences), and usage classification (UC: 2,460 sentences) as our multitasks. We introduced a novel Retrieval-Augmented Multi-task Information Extraction (RAMIE) Framework, including: 1) employed instruction fine-tuning techniques with task-specific prompts, 2) trained LLMs for multiple tasks with improved storage efficiency and lower training costs, and 3) incorporated retrieval augmentation generation (RAG) techniques by retrieving similar examples from the training set. We compared RAMIE's performance to LLMs with instruction fine-tuning alone and conducted an ablation study to assess the contributions of multi-task learning and RAG to improved multitasking performance. \textbf{Results:} With the aid of the RAMIE framework, Llama2-13B achieved an F1 score of 87.39 (3.51\% improvement) on the NER task and demonstrated outstanding performance on the RE task with an F1 score of 93.74 (1.15\% improvement). For the TE task, Llama2-7B scored 79.45 (14.26\% improvement), and MedAlpaca-7B achieved the highest F1 score of 93.45 (0.94\% improvement) on the UC task. The ablation study revealed that while MTL increased efficiency with a slight trade-off in performance, RAG significantly boosted overall accuracy. \textbf{Conclusion:} This study presents a novel RAMIE framework that demonstrates substantial improvements in multi-task information extraction for DS-related data from clinical records. Our framework can potentially be applied to other domains.
摘要：\textbf{目标：}我们旨在开发一种先进的多任务大型语言模型 (LLM) 框架，以从临床记录中提取有关膳食补充剂 (DS) 的多种类型的信息。 \textbf{方法：}我们使用四个核心 DS 信息提取任务 - 即命名实体识别 (NER：2,949 个临床句子)、关系提取 (RE：4,892 个句子)、三重提取 (TE：2,949 个句子) 和用法分类 (UC：2,460 个句子) 作为我们的多任务。我们引入了一种新颖的检索增强多任务信息提取 (RAMIE) 框架，包括：1) 采用具有任务特定提示的指令微调技术，2) 训练 LLM 以用于多个任务，以提高存储效率并降低训练成本，以及 3) 通过从训练集中检索类似示例来结合检索增强生成 (RAG) 技术。我们将 RAMIE 的性能与仅进行指令微调的 LLM 进行了比较，并进行了一项消融研究，以评估多任务学习和 RAG 对提高多任务性能的贡献。\textbf{结果：}借助 RAMIE 框架，Llama2-13B 在 NER 任务中获得了 87.39 的 F1 分数（提高了 3.51\%），并在 RE 任务中表现出色，F1 分数为 93.74（提高了 1.15\%）。对于 TE 任务，Llama2-7B 得分为 79.45（提高了 14.26\%），而 MedAlpaca-7B 在 UC 任务上获得了最高的 F1 分数 93.45（提高了 0.94\%）。消融研究表明，虽然 MTL 提高了效率但性能略有下降，但 RAG 显著提高了整体准确性。 \textbf{结论：}本研究提出了一种新颖的 RAMIE 框架，该框架在从临床记录中对 DS 相关数据进行多任务信息提取方面取得了显著的改进。我们的框架可能可以应用于其他领域。

Title: LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

Authors: Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15708
Pdf URL: https://arxiv.org/pdf/2411.15708
Copy Paste: [[2411.15708]] LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training(https://arxiv.org/abs/2411.15708)
Keywords: language model, llm
Abstract: Recently, inspired by the concept of sparsity, Mixture-of-Experts (MoE) models have gained increasing popularity for scaling model size while keeping the number of activated parameters constant. In this study, we thoroughly investigate the sparsity of the dense LLaMA model by constructing MoE for both the attention (i.e., Attention MoE) and MLP (i.e., MLP MoE) modules in the transformer blocks. Specifically, we investigate different expert construction methods and granularities under the same activation conditions to analyze the impact of sparsifying the model. Additionally, to comprehensively evaluate the model's capabilities across various domains (e.g., conversation, code, math) after sparsification, we apply sparsity to the instructed large language models (LLMs) and construct instructed MoE models. To counteract the performance degradation resulting from increased sparsity, we design a two-stage post-training strategy to enhance model performance. Experiments on the LLaMA3 model demonstrate the potential effectiveness of this approach for future developments of instructed MoE models. The source codes and models are available at: \url{this https URL}.
摘要：最近，受稀疏性概念的启发，混合专家 (MoE) 模型因在保持激活参数数量不变的情况下扩展模型大小而越来越受欢迎。在本研究中，我们通过为 Transformer 块中的注意 (即注意 MoE) 和 MLP (即 MLP MoE) 模块构建 MoE，彻底研究了密集 LLaMA 模型的稀疏性。具体来说，我们研究了相同激活条件下的不同专家构建方法和粒度，以分析稀疏化模型的影响。此外，为了全面评估稀疏化后模型在各个领域 (例如对话、代码、数学) 的能力，我们将稀疏性应用于指导大型语言模型 (LLM) 并构建指导 MoE 模型。为了抵消稀疏性增加导致的性能下降，我们设计了一个两阶段后训练策略来提高模型性能。在 LLaMA3 模型上的实验证明了这种方法对于未来开发指导 MoE 模型的潜在有效性。源代码和模型可在以下位置获得：\url{此 https URL}。

Title: Development of Pre-Trained Transformer-based Models for the Nepali Language

Authors: Prajwal Thapa, Jinu Nyachhyon, Mridul Sharma, Bal Krishna Bal
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15734
Pdf URL: https://arxiv.org/pdf/2411.15734
Copy Paste: [[2411.15734]] Development of Pre-Trained Transformer-based Models for the Nepali Language(https://arxiv.org/abs/2411.15734)
Keywords: language model, gpt
Abstract: Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction tuning and explored its potential for monolingual Nepali data, providing a foundation for future research. Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.
摘要：基于 Transformer 的预训练语言模型已经在自然语言处理 (NLP) 领域占据主导地位很长一段时间了。然而，全球约有 3200 万人使用的尼泊尔语在该领域仍然严重缺乏代表性。这种代表性不足主要归因于单语数据语料库的稀缺和尼泊尔语可用资源的有限。虽然现有的努力主要集中在基于编码器的基本模型上，但在基于解码器的架构的探索方面存在明显差距。为了弥补这一差距，我们收集了 27.5 GB 的尼泊尔语文本数据，大约是任何以前可用的尼泊尔语语料库的 2.4 倍。利用这些数据，我们专门针对尼泊尔语预训练了三种不同的模型，即 BERT、RoBERTa 和 GPT-2。此外，我们还进行了指令调整并探索了其对单语尼泊尔语数据的潜力，为未来的研究奠定了基础。我们的模型在 Nep-gLUE 基准测试中比现有最佳模型高出 2 分，得分为 95.60，并且在文本生成任务中也优于现有模型，证明了在理解和生成尼泊尔语文本方面的改进。

Title: A Method for Building Large Language Models with Predefined KV Cache Capacity

Authors: Zhonghua Yi, Ge Niu, Lei Wang, Wei Tang, Liqiu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15785
Pdf URL: https://arxiv.org/pdf/2411.15785
Copy Paste: [[2411.15785]] A Method for Building Large Language Models with Predefined KV Cache Capacity(https://arxiv.org/abs/2411.15785)
Keywords: language model
Abstract: This paper proposes a method for building large language models with predefined Key-Value (KV) cache capacity, particularly suitable for the attention layers in Transformer decode-only architectures. This method introduces fixed-length KV caches to address the issue of excessive memory consumption in traditional KV caches when handling infinite contexts. By dynamically updating the key-value vector sequences, it achieves efficient inference within limited cache capacity, significantly reducing memory usage while maintaining model performance and system throughput. Experimental results show that this method significantly reduces memory usage while maintaining the model's inference quality.
摘要：本文提出了一种预定义键值（KV）缓存容量的大型语言模型构建方法，特别适用于 Transformer 解码架构中的注意层。该方法引入了定长 KV 缓存，解决了传统 KV 缓存在处理无限上下文时内存消耗过大的问题。通过动态更新键值向量序列，在有限的缓存容量下实现高效推理，在保持模型性能和系统吞吐量的同时，大幅降低内存使用量。实验结果表明，该方法在保持模型推理质量的同时，大幅降低了内存使用量。

Title: LoRA-Mini : Adaptation Matrices Decomposition and Selective Training

Authors: Ayush Singh, Rajdeep Aher, Shivank Garg
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15804
Pdf URL: https://arxiv.org/pdf/2411.15804
Copy Paste: [[2411.15804]] LoRA-Mini : Adaptation Matrices Decomposition and Selective Training(https://arxiv.org/abs/2411.15804)
Keywords: language model, llm
Abstract: The rapid advancements in large language models (LLMs) have revolutionized natural language processing, creating an increased need for efficient, task-specific fine-tuning methods. Traditional fine-tuning of LLMs involves updating a large number of parameters, which is computationally expensive and memory-intensive. Low-Rank Adaptation (LoRA) has emerged as a promising solution, enabling parameter-efficient fine-tuning by reducing the number of trainable parameters. However, while LoRA reduces the number of trainable parameters, LoRA modules still create significant storage challenges. We propose LoRA-Mini, an optimized adaptation of LoRA that improves parameter efficiency by splitting low-rank matrices into four parts, with only the two inner matrices being trainable. This approach achieves upto a 20x reduction compared to standard LoRA in the number of trainable parameters while preserving performance levels comparable to standard LoRA, addressing both computational and storage efficiency in LLM fine-tuning.
摘要：大型语言模型 (LLM) 的快速发展彻底改变了自然语言处理，对高效、特定任务的微调方法的需求也随之增加。传统的 LLM 微调涉及更新大量参数，这在计算上非常昂贵，并且占用大量内存。低秩自适应 (LoRA) 已成为一种有前途的解决方案，它通过减少可训练参数的数量来实现参数高效的微调。然而，虽然 LoRA 减少了可训练参数的数量，但 LoRA 模块仍然带来了巨大的存储挑战。我们提出了 LoRA-Mini，这是 LoRA 的优化改编，它通过将低秩矩阵分成四个部分来提高参数效率，其中只有两个内部矩阵是可训练的。与标准 LoRA 相比，这种方法在可训练参数数量上减少了 20 倍，同时保持了与标准 LoRA 相当的性能水平，解决了 LLM 微调中的计算和存储效率问题。

Title: Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?

Authors: Aryan Sajith, Krishna Chaitanya Rao Kathala
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15821
Pdf URL: https://arxiv.org/pdf/2411.15821
Copy Paste: [[2411.15821]] Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?(https://arxiv.org/abs/2411.15821)
Keywords: language model
Abstract: This study investigates the relative impact of training data quality versus quantity on the performance of small language models (SLMs), utilizing the TinyStories dataset for empirical analysis. Analysis of dataset variations with respect to size (25% and 50% of the original size) and duplication (controlled rates of 25%, 50%, 75%, and 100%) were performed. Model performance was evaluated based on the validation loss, accuracy, and perplexity metrics. Results indicate training data quality plays a more significant role in the overall performance of SLMs, especially given scale of this experiment. Minimal duplication positively impacted model accuracy (+0.87% increase in accuracy at 25% duplication) without significantly increasing perplexity (+0.52% increase going from 0% to 25% duplication) but excessive duplication led to pronounced performance degradation (-40% drop in accuracy at 100% duplication). The implications of this exploration extend beyond just model performance; training large-scale models imposes significant financial and computational burdens, which can be prohibitive for organizations, individuals, and the public at large, especially in developing countries. Additionally, the energy consumption associated with large-scale training raises environmental concerns. Understanding the relative importance of data quality versus quantity could democratize AI technology, making advanced models more accessible and sustainable for all.
摘要：本研究利用 TinyStories 数据集进行实证分析，调查了训练数据质量与数量对小型语言模型 (SLM) 性能的相对影响。对数据集的大小（原始大小的 25% 和 50%）和重复率（控制率为 25%、50%、75% 和 100%）进行了分析。根据验证损失、准确率和困惑度指标评估模型性能。结果表明，训练数据质量在 SLM 的整体性能中起着更重要的作用，尤其是考虑到该实验的规模。最小重复对模型准确率有积极影响（25% 重复时准确率增加 0.87%），而不会显著增加困惑度（从 0% 到 25% 重复增加 0.52%），但过度重复会导致性能明显下降（100% 重复时准确率下降 40%）。这次探索的意义不仅限于模型性能；训练大规模模型会带来巨大的财务和计算负担，这对组织、个人和广大公众来说可能是难以承受的，尤其是在发展中国家。此外，大规模训练带来的能源消耗引发了环境问题。了解数据质量与数量的相对重要性可以使人工智能技术民主化，使高级模型更容易为所有人所用，并具有可持续性。

Title: LLMs Do Not Think Step-by-step In Implicit Reasoning

Authors: Yijiong Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15862
Pdf URL: https://arxiv.org/pdf/2411.15862
Copy Paste: [[2411.15862]] LLMs Do Not Think Step-by-step In Implicit Reasoning(https://arxiv.org/abs/2411.15862)
Keywords: llm, chain-of-thought
Abstract: It has been well-known that Chain-of-Thought can remarkably enhance LLMs' performance on complex tasks. However, because it also introduces slower inference speeds and higher computational costs, many researches have attempted to use implicit CoT, which does not need LLMs to explicitly generate the intermediate steps. But there is still gap between their efficacy and typical explicit CoT methods. This leaves us a doubt that, does implicit CoT really equal to explicit CoT? Therefore, in this study, we address this question through experiments. We probe the information of intermediate steps from the model's hidden states when it is performing implicit CoT. The results surprisingly indicate that LLMs hardly think about intermediate steps, suggesting they may just rely on experience rather than strict step-by-step reasoning. Moreover, we find LLMs' implicit reasoning capabilities are susceptible and unstable, reaffirming the necessity of explicit CoT to effectively support complex tasks.
摘要：众所周知，Chain-of-Thought 可以显著提高 LLM 在复杂任务上的性能。然而，由于它也引入了较慢的推理速度和较高的计算成本，许多研究尝试使用隐式 CoT，这不需要 LLM 显式生成中间步骤。但它们的功效与典型的显式 CoT 方法之间仍然存在差距。这让我们怀疑，隐式 CoT 真的等于显式 CoT 吗？因此，在本研究中，我们通过实验来解决这个问题。我们在模型执行隐式 CoT 时从模型的隐藏状态中探测中间步骤的信息。结果令人惊讶地表明，LLM 几乎不考虑中间步骤，这表明它们可能只是依靠经验而不是严格的逐步推理。此外，我们发现 LLM 的隐式推理能力易受影响且不稳定，这再次证明了显式 CoT 对有效支持复杂任务的必要性。

Title: Evaluating Large Language Models for Causal Modeling

Authors: Houssam Razouk, Leonie Benischke, Georg Niess, Roman Kern
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15888
Pdf URL: https://arxiv.org/pdf/2411.15888
Copy Paste: [[2411.15888]] Evaluating Large Language Models for Causal Modeling(https://arxiv.org/abs/2411.15888)
Keywords: language model, gpt, llm
Abstract: In this paper, we consider the process of transforming causal domain knowledge into a representation that aligns more closely with guidelines from causal data science. To this end, we introduce two novel tasks related to distilling causal domain knowledge into causal variables and detecting interaction entities using LLMs. We have determined that contemporary LLMs are helpful tools for conducting causal modeling tasks in collaboration with human experts, as they can provide a wider perspective. Specifically, LLMs, such as GPT-4-turbo and Llama3-70b, perform better in distilling causal domain knowledge into causal variables compared to sparse expert models, such as Mixtral-8x22b. On the contrary, sparse expert models such as Mixtral-8x22b stand out as the most effective in identifying interaction entities. Finally, we highlight the dependency between the domain where the entities are generated and the performance of the chosen LLM for causal modeling.
摘要：在本文中，我们考虑将因果领域知识转换为更符合因果数据科学指导方针的表示的过程。为此，我们引入了两个新任务，即使用 LLM 将因果领域知识提炼为因果变量并检测交互实体。我们已经确定，当代 LLM 是与人类专家合作进行因果建模任务的有用工具，因为它们可以提供更广阔的视角。具体来说，与稀疏专家模型（如 Mixtral-8x22b）相比，GPT-4-turbo 和 Llama3-70b 等 LLM 在将因果领域知识提炼为因果变量方面表现更好。相反，稀疏专家模型（如 Mixtral-8x22b）在识别交互实体方面最有效。最后，我们强调了生成实体的领域与所选 LLM 因果建模性能之间的依赖关系。

Title: Generative Context Distillation

Authors: Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15927
Pdf URL: https://arxiv.org/pdf/2411.15927
Copy Paste: [[2411.15927]] Generative Context Distillation(https://arxiv.org/abs/2411.15927)
Keywords: language model, prompt, agent
Abstract: Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Context Distillation (GCD), a lightweight prompt internalization method that employs a joint training approach. This method not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along with reasons for why the model's behavior should change accordingly. We demonstrate that our approach effectively internalizes complex prompts across various agent-based application scenarios. For effective training without interactions with the dedicated environments, we introduce a data synthesis technique that autonomously collects conversational datasets by swapping the roles of the agent and environment. This method is especially useful in scenarios where only a predefined prompt is available without a corresponding training dataset. By internalizing complex prompts, Generative Context Distillation enables high-performance and efficient inference without the need for explicit prompts.
摘要：最近基于大型语言模型的应用程序中使用的提示通常是固定且冗长的，这会导致大量的计算开销。为了应对这一挑战，我们提出了生成上下文蒸馏 (GCD)，这是一种采用联合训练方法的轻量级提示内化方法。此方法不仅可以复制具有提示输入的模型的行为，还可以生成提示的内容以及模型行为应相应改变的原因。我们证明了我们的方法可以在各种基于代理的应用程序场景中有效地内化复杂的提示。为了在不与专用环境交互的情况下进行有效训练，我们引入了一种数据合成技术，该技术通过交换代理和环境的角色来自主收集对话数据集。此方法在只有预定义提示可用而没有相应训练数据集的场景中特别有用。通过内化复杂的提示，生成上下文蒸馏无需显式提示即可实现高性能和高效的推理。

Title: Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

Authors: Lifu Tu, Rui Meng, Shafiq Joty, Yingbo Zhou, Semih Yavuz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15993
Pdf URL: https://arxiv.org/pdf/2411.15993
Copy Paste: [[2411.15993]] Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown(https://arxiv.org/abs/2411.15993)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have demonstrated strong capabilities in text understanding and generation. However, they often lack factuality, producing a mixture of true and false information, especially in long-form generation. In this work, we investigates the factuality of long-form text generation across various large language models (LLMs), including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llama-3-70B, and Mistral. Our analysis reveals that factuality scores tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims. Furthermore, we explore the effectiveness of different evaluation settings to assess whether LLMs can accurately judge the correctness of their own outputs: Self-Known (the percentage of supported atomic claims, decomposed from LLM outputs, that the corresponding LLMs judge as correct) and Self-Unknown (the percentage of unsupported atomic claims that the corresponding LLMs judge as incorrect). The results indicate that even advanced models like GPT-4 and Gemini-1.5-Pro fail to achieve perfect Self-Known scores, while their Self-Unknown scores remain notably above zero, reflecting ongoing uncertainty in their self-assessments. Moreover, we find a correlation between higher Self-Known scores and improved factuality, while higher Self-Unknown scores are associated with lower factuality. Interestingly, even without significant changes in the models' self-judgment (Self-Known and Self-Unknown), the number of unsupported claims can increases, likely as an artifact of long-form generation. These findings show the limitations of current LLMs in long-form generation, and provide valuable insights for improving factuality in long-form text generation.
摘要：大型语言模型 (LLM) 已展现出强大的文本理解和生成能力。然而，它们往往缺乏事实性，会产生真假混杂的信息，尤其是在长篇文本生成中。在这项工作中，我们研究了各种大型语言模型 (LLM) 的长篇文本生成事实性，包括 GPT-4、Gemini-1.5-Pro、Claude-3-Opus、Llama-3-70B 和 Mistral。我们的分析表明，生成文本的后几句中事实性得分趋于下降，同时无依据的声明数量有所增加。此外，我们探索了不同评估设置的有效性，以评估 LLM 是否能够准确判断其自身输出的正确性：Self-Known（从 LLM 输出中分解出来的有依据的原子声明的百分比，相应的 LLM 判断为正确）和 Self-Unknown（相应的 LLM 判断为不正确的无依据的原子声明的百分比）。结果表明，即使是像 GPT-4 和 Gemini-1.5-Pro 这样的高级模型也无法获得完美的自我已知分数，而它们的自我未知分数仍然明显高于零，反映出它们对自我评估的持续不确定性。此外，我们发现较高的自我已知分数与事实性提高之间存在相关性，而较高的自我未知分数与事实性降低相关。有趣的是，即使模型的自我判断（自我已知和自我未知）没有发生重大变化，未经证实的主张的数量也会增加，这可能是长篇生成的产物。这些发现表明了当前 LLM 在长篇生成中的局限性，并为提高长篇文本生成中的事实性提供了宝贵的见解。

Title: Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models

Authors: Jayanta Sadhu, Ayan Antik Khan, Noshin Nawal, Sanju Basak, Abhik Bhattacharjee, Rifat Shahriyar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15999
Pdf URL: https://arxiv.org/pdf/2411.15999
Copy Paste: [[2411.15999]] Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models(https://arxiv.org/abs/2411.15999)
Keywords: language model, llm
Abstract: Theory of Mind (ToM) refers to the cognitive ability to infer and attribute mental states to oneself and others. As large language models (LLMs) are increasingly evaluated for social and cognitive capabilities, it remains unclear to what extent these models demonstrate ToM across diverse languages and cultural contexts. In this paper, we introduce a comprehensive study of multilingual ToM capabilities aimed at addressing this gap. Our approach includes two key components: (1) We translate existing ToM datasets into multiple languages, effectively creating a multilingual ToM dataset and (2) We enrich these translations with culturally specific elements to reflect the social and cognitive scenarios relevant to diverse populations. We conduct extensive evaluations of six state-of-the-art LLMs to measure their ToM performance across both the translated and culturally adapted datasets. The results highlight the influence of linguistic and cultural diversity on the models' ability to exhibit ToM, and questions their social reasoning capabilities. This work lays the groundwork for future research into enhancing LLMs' cross-cultural social cognition and contributes to the development of more culturally aware and socially intelligent AI systems. All our data and code are publicly available.
摘要：心智理论 (ToM) 是指推断和归因于自己和他人心理状态的认知能力。随着大型语言模型 (LLM) 越来越多地被评估为具有社交和认知能力，这些模型在多大程度上在不同语言和文化背景下表现出心智理论仍不清楚。在本文中，我们介绍了一项旨在解决这一差距的多语言心智理论能力的全面研究。我们的方法包括两个关键部分：(1) 我们将现有的心智理论数据集翻译成多种语言，有效地创建了一个多语言心智理论数据集；(2) 我们用文化特定的元素丰富这些翻译，以反映与不同人群相关的社会和认知场景。我们对六个最先进的 LLM 进行了广泛的评估，以衡量它们在翻译和文化适应数据集中的心智理论表现。结果强调了语言和文化多样性对模型表现出心智理论的能力的影响，并质疑它们的社会推理能力。这项工作为未来研究增强 LLM 的跨文化社会认知奠定了基础，并有助于开发更具文化意识和社会智能的人工智能系统。我们的所有数据和代码都是公开的。

Title: Exploring Performance Contrasts in TableQA: Step-by-Step Reasoning Boosts Bigger Language Models, Limits Smaller Language Models

Authors: Haoyan Yang, Yixuan Wang, Keyue Tong, Hongjin Zhu, Yuanxin Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16002
Pdf URL: https://arxiv.org/pdf/2411.16002
Copy Paste: [[2411.16002]] Exploring Performance Contrasts in TableQA: Step-by-Step Reasoning Boosts Bigger Language Models, Limits Smaller Language Models(https://arxiv.org/abs/2411.16002)
Keywords: language model, prompt
Abstract: This paper proposes a detailed prompting flow, termed Table-Logic, to investigate the performance contrasts between bigger and smaller language models (LMs) utilizing step-by-step reasoning methods in the TableQA task. The method processes tasks by sequentially identifying critical columns and rows given question and table with its structure, determining necessary aggregations, calculations, or comparisons, and finally inferring the results to generate a precise prediction. By deploying this method, we observe a 7.8% accuracy improvement in bigger LMs like Llama-3-70B compared to the vanilla on HybridQA, while smaller LMs like Llama-2-7B shows an 11% performance decline. We empirically investigate the potential causes of performance contrasts by exploring the capabilities of bigger and smaller LMs from various dimensions in TableQA task. Our findings highlight the limitations of the step-by-step reasoning method in small models and provide potential insights for making improvements.
摘要：本文提出了一种详细的提示流程，称为 Table-Logic，用于研究在 TableQA 任务中使用分步推理方法的较大和较小语言模型 (LM) 之间的性能差异。该方法通过按顺序识别给定问题和表及其结构的关键列和行，确定必要的聚合、计算或比较，最后推断结果以生成精确的预测来处理任务。通过部署此方法，我们观察到与 HybridQA 上的 vanilla 相比，较大的 LM（如 Llama-3-70B）的准确率提高了 7.8%，而较小的 LM（如 Llama-2-7B）的性能下降了 11%。我们通过从各个维度探索 TableQA 任务中较大和较小 LM 的功能，实证研究了性能差异的潜在原因。我们的研究结果突出了小模型中分步推理方法的局限性，并为改进提供了潜在的见解。

Title: TransCompressor: LLM-Powered Multimodal Data Compression for Smart Transportation

Authors: Huanqi Yang, Rucheng Wu, Weitao Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16020
Pdf URL: https://arxiv.org/pdf/2411.16020
Copy Paste: [[2411.16020]] TransCompressor: LLM-Powered Multimodal Data Compression for Smart Transportation(https://arxiv.org/abs/2411.16020)
Keywords: language model, llm, prompt
Abstract: The incorporation of Large Language Models (LLMs) into smart transportation systems has paved the way for improving data management and operational efficiency. This study introduces TransCompressor, a novel framework that leverages LLMs for efficient compression and decompression of multimodal transportation sensor data. TransCompressor has undergone thorough evaluation with diverse sensor data types, including barometer, speed, and altitude measurements, across various transportation modes like buses, taxis, and MTRs. Comprehensive evaluation illustrates the effectiveness of TransCompressor in reconstructing transportation sensor data at different compression ratios. The results highlight that, with well-crafted prompts, LLMs can utilize their vast knowledge base to contribute to data compression processes, enhancing data storage, analysis, and retrieval in smart transportation settings.
摘要：大型语言模型 (LLM) 被纳入智能交通系统，为改善数据管理和运营效率铺平了道路。本研究介绍了 TransCompressor，这是一种利用 LLM 高效压缩和解压多式联运交通传感器数据的新型框架。TransCompressor 已通过多种传感器数据类型（包括气压计、速度和海拔测量值）的全面评估，涵盖各种交通方式，如公交车、出租车和地铁。综合评估表明，TransCompressor 在以不同压缩比重建交通传感器数据方面非常有效。结果强调，借助精心设计的提示，LLM 可以利用其庞大的知识库来促进数据压缩过程，从而增强智能交通环境中的数据存储、分析和检索。

Title: SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text

Authors: Reshmi Ghosh, Tianyi Yao, Lizzy Chen, Sadid Hasan, Tianwei Chen, Dario Bernal, Huitian Jiao, H M Sajjad Hossain
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2411.16077
Pdf URL: https://arxiv.org/pdf/2411.16077
Copy Paste: [[2411.16077]] SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text(https://arxiv.org/abs/2411.16077)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM) integrations into applications like Microsoft365 suite and Google Workspace for creating/processing documents, emails, presentations, etc. has led to considerable enhancements in productivity and time savings. But as these integrations become more more complex, it is paramount to ensure that the quality of output from the LLM-integrated applications are relevant and appropriate for use. Identifying the need to develop robust evaluation approaches for natural language generation, wherein references/ground labels doesn't exist or isn't amply available, this paper introduces a novel framework called "SAGEval" which utilizes a critiquing Agent to provide feedback on scores generated by LLM evaluators. We show that the critiquing Agent is able to rectify scores from LLM evaluators, in absence of references/ground-truth labels, thereby reducing the need for labeled data even for complex NLG evaluation scenarios, like the generation of JSON-structured forms/surveys with responses in different styles like multiple choice, likert ratings, single choice questions, etc.
摘要：大型语言模型 (LLM) 集成到 Microsoft365 套件和 Google Workspace 等应用程序中，用于创建/处理文档、电子邮件、演示文稿等，大大提高了生产力并节省了时间。但随着这些集成变得越来越复杂，确保 LLM 集成应用程序的输出质量相关且适合使用至关重要。本文确定需要开发用于自然语言生成的稳健评估方法，其中参考/基本标签不存在或不充分可用，介绍了一种名为“SAGEval”的新框架，该框架利用批评代理对 LLM 评估员生成的分数提供反馈。我们表明，批评代理能够在没有参考/基本事实标签的情况下纠正 LLM 评估员的分数，从而减少对标记数据的需求，即使对于复杂的 NLG 评估场景也是如此，例如生成 JSON 结构的表单/调查，其中的答案采用不同的样式，如多项选择、李克特评分、单项选择题等。

Title: LLM Augmentations to support Analytical Reasoning over Multiple Documents

Authors: Raquib Bin Yousuf, Nicholas Defelice, Mandar Sharma, Shengzhe Xu, Naren Ramakrishnan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16116
Pdf URL: https://arxiv.org/pdf/2411.16116
Copy Paste: [[2411.16116]] LLM Augmentations to support Analytical Reasoning over Multiple Documents(https://arxiv.org/abs/2411.16116)
Keywords: language model, llm
Abstract: Building on their demonstrated ability to perform a variety of tasks, we investigate the application of large language models (LLMs) to enhance in-depth analytical reasoning within the context of intelligence analysis. Intelligence analysts typically work with massive dossiers to draw connections between seemingly unrelated entities, and uncover adversaries' plans and motives. We explore if and how LLMs can be helpful to analysts for this task and develop an architecture to augment the capabilities of an LLM with a memory module called dynamic evidence trees (DETs) to develop and track multiple investigation threads. Through extensive experiments on multiple datasets, we highlight how LLMs, as-is, are still inadequate to support intelligence analysts and offer recommendations to improve LLMs for such intricate reasoning applications.
摘要：基于大型语言模型 (LLM) 已展现出的执行各种任务的能力，我们研究了其在情报分析背景下的应用，以增强深入的分析推理能力。情报分析员通常使用大量档案来建立看似不相关的实体之间的联系，并揭露对手的计划和动机。我们探索 LLM 是否以及如何帮助分析员完成这项任务，并开发一种架构来增强 LLM 的功能，使用称为动态证据树 (DET) 的内存模块来开发和跟踪多个调查线索。通过对多个数据集进行大量实验，我们强调了 LLM 本身仍然不足以支持情报分析员，并提出了改进 LLM 以用于此类复杂推理应用的建议。

Title: MH-MoE:Multi-Head Mixture-of-Experts

Authors: Shaohan Huang, Xun Wu, Shuming Ma, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16205
Pdf URL: https://arxiv.org/pdf/2411.16205
Copy Paste: [[2411.16205]] MH-MoE:Multi-Head Mixture-of-Experts(https://arxiv.org/abs/2411.16205)
Keywords: language model, llm
Abstract: Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.
摘要：多头混合专家 (MH-MoE) 通过使用多头机制来共同关注来自不同专家的各种表示空间的信息，展现出卓越的性能。在本文中，我们提出了一种新颖的 MH-MoE 实现，它与稀疏混合专家模型保持了 FLOP 和参数奇偶校验。语言模型上的实验结果表明，新实现比普通 MoE 和细粒度 MoE 模型都具有更好的质量。此外，我们的实验表明 MH-MoE 与 BitNet 等 1 位大型语言模型 (LLM) 兼容。

Title: DoubleCCA: Improving Foundation Model Group Robustness with Random Sentence Embeddings

Authors: Hong Liu, Yitong Lu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16236
Pdf URL: https://arxiv.org/pdf/2411.16236
Copy Paste: [[2411.16236]] DoubleCCA: Improving Foundation Model Group Robustness with Random Sentence Embeddings(https://arxiv.org/abs/2411.16236)
Keywords: prompt
Abstract: This paper presents a novel method to improve the robustness of foundation models to group-based biases. We propose a simple yet effective method, called DoubleCCA, that leverages random sentences and Canonical Correlation Analysis (CCA) to enrich the text embeddings of the foundation model. First, we generate various random sentences that augment the original prompts, which extends the original prompts with random words or character sequences. Second, we use an additional sentence embedding model to generate different text embeddings with respect to these random sentences. We then use CCA double twice to align the representations and reconstruct them back to the original representation space. We demonstrate the effectiveness of our method on a variety of tasks and datasets, showing that it outperforms existing methods in terms of both performance and robustness. Our method is simple to implement and can be easily integrated into existing models, making it a practical solution for improving the robustness of foundation models to group-based biases.
摘要：本文提出了一种新方法来提高基础模型对基于群体的偏差的鲁棒性。我们提出了一种简单而有效的方法，称为 DoubleCCA，该方法利用随机句子和典型相关分析 (CCA) 来丰富基础模型的文本嵌入。首先，我们生成各种随机句子来增强原始提示，这会使用随机单词或字符序列扩展原始提示。其次，我们使用额外的句子嵌入模型来针对这些随机句子生成不同的文本嵌入。然后，我们使用 CCA double 两次对齐表示并将它们重建回原始表示空间。我们在各种任务和数据集上证明了我们的方法的有效性，表明它在性能和鲁棒性方面都优于现有方法。我们的方法易于实现，可以轻松集成到现有模型中，使其成为提高基础模型对基于群体的偏差的鲁棒性的实用解决方案。

Title: NormXLogit: The Head-on-Top Never Lies

Authors: Sina Abbasi, Mohammad Reza Modarres, Mohammad Taher Pilehvar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16252
Pdf URL: https://arxiv.org/pdf/2411.16252
Copy Paste: [[2411.16252]] NormXLogit: The Head-on-Top Never Lies(https://arxiv.org/abs/2411.16252)
Keywords: language model, llm
Abstract: The Transformer architecture has emerged as the dominant choice for building large language models (LLMs). However, with new LLMs emerging on a frequent basis, it is important to consider the potential value of architecture-agnostic approaches that can provide interpretability across a variety of architectures. Despite recent successes in the interpretability of LLMs, many existing approaches rely on complex methods that are often tied to a specific model design and come with a significant computational cost. To address these limitations, we propose a novel technique, called NormXLogit, for assessing the significance of individual input tokens. This method operates based on the input and output representations associated with each token. First, we demonstrate that during the pre-training of LLMs, the norms of word embeddings capture the importance of input tokens. Second, we reveal a significant relationship between a token's importance and the extent to which its representation can resemble the model's final prediction. Through extensive analysis, we show that our approach consistently outperforms existing gradient-based methods in terms of faithfulness. Additionally, our method achieves better performance in layer-wise explanations compared to the most prominent architecture-specific methods.
摘要：Transformer 架构已成为构建大型语言模型 (LLM) 的主要选择。然而，随着新的 LLM 不断涌现，重要的是要考虑架构无关的方法的潜在价值，这些方法可以在各种架构中提供可解释性。尽管最近在 LLM 的可解释性方面取得了成功，但许多现有方法依赖于复杂的方法，这些方法通常与特定的模型设计相关，并且计算成本很高。为了解决这些限制，我们提出了一种称为 NormXLogit 的新技术，用于评估单个输入标记的重要性。此方法基于与每个标记相关的输入和输出表示进行操作。首先，我们证明在 LLM 的预训练期间，词嵌入的规范可以捕获输入标记的重要性。其次，我们揭示了标记的重要性与其表示与模型最终预测的相似程度之间的重要关系。通过广泛的分析，我们表明我们的方法在忠实度方面始终优于现有的基于梯度的方法。此外，与最突出的特定于架构的方法相比，我们的方法在分层解释方面取得了更好的性能。

Title: BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

Authors: Shaolei Zhang, Kehao Zhang, Qingkai Fang, Shoutao Guo, Yan Zhou, Xiaodong Liu, Yang Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16300
Pdf URL: https://arxiv.org/pdf/2411.16300
Copy Paste: [[2411.16300]] BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment(https://arxiv.org/abs/2411.16300)
Keywords: language model, llm
Abstract: Large language models (LLMs), with their powerful generative capabilities and vast knowledge, empower various tasks in everyday life. However, these abilities are primarily concentrated in high-resource languages, leaving low-resource languages with weaker generative capabilities and relatively limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore crucial for serving over 100 linguistic communities worldwide. An intuitive approach to enhance the multilingual capabilities would be to construct instruction data for various languages, but constructing instruction data for over 100 languages is prohibitively costly. In this paper, we introduce BayLing 2, which efficiently transfers generative capabilities and knowledge from high-resource languages to low-resource languages through language alignment. To achieve this, we constructed a dataset of 3.2 million instructions, comprising high-resource language instructions (Chinese and English) and cross-lingual instructions for 100+ languages and performed instruction tuning based on the dataset to facilitate the capability transfer between languages. Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B, and BayLing-3-8B, and conducted a comprehensive evaluation of BayLing. For multilingual translation across 100+ languages, BayLing shows superior performance compared to open-source models of similar scale. For multilingual knowledge and understanding benchmarks, BayLing achieves significant improvements across over 20 low-resource languages, demonstrating its capability of effective knowledge transfer from high-resource to low-resource languages. Furthermore, results on English benchmarks indicate that BayLing maintains high performance in highresource languages while enhancing the performance in low-resource languages. Demo, homepage, code and models of BayLing are available.
摘要：大型语言模型 (LLM) 具有强大的生成能力和丰富的知识，为日常生活中的各种任务提供支持。然而，这些能力主要集中在高资源语言中，而低资源语言的生成能力较弱，知识相对有限。因此，增强 LLM 的多语言能力对于服务全球 100 多个语言社区至关重要。增强多语言能力的一种直观方法是构建各种语言的指令数据，但构建 100 多种语言的指令数据成本过高。在本文中，我们介绍了 BayLing 2，它通过语言对齐将生成能力和知识从高资源语言有效地转移到低资源语言。为此，我们构建了一个包含 320 万条指令的数据集，包括高资源语言指令（中文和英文）和 100 多种语言的跨语言指令，并根据数据集进行指令调整，以促进语言之间的能力转移。我们以 Llama 为基础模型，开发了 BayLing-2-7B、BayLing-2-13B 和 BayLing-3-8B，并对 BayLing 进行了全面的评估。在 100 多种语言的多语言翻译中，BayLing 的表现优于同等规模的开源模型。在多语言知识和理解基准测试中，BayLing 在 20 多种低资源语言中取得了显著的改进，证明了其能够有效地将知识从高资源语言转移到低资源语言。此外，在英语基准测试上的结果表明，BayLing 在高资源语言中保持了高性能，同时提升了低资源语言的性能。BayLing 的 demo、主页、代码和模型都已发布。

Title: Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring

Authors: Kathrin Seßler, Maurice Fürstenberg, Babette Bühler, Enkelejda Kasneci
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2411.16337
Pdf URL: https://arxiv.org/pdf/2411.16337
Copy Paste: [[2411.16337]] Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring(https://arxiv.org/abs/2411.16337)
Keywords: language model, gpt, llm
Abstract: The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI, such as large language models, offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance and reliability of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs' scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The novel o1 model outperforms all other LLMs, achieving Spearman's $r = .74$ with human assessments in the overall score, and an internal consistency of $ICC=.80$. These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency for higher scores, the models require further refinement to better capture aspects of content quality.
摘要：对学生写作进行人工评估和评分对教师来说是一项耗时但至关重要的任务。生成式人工智能的最新发展（例如大型语言模型）为教师的论文评分任务提供了潜在的解决方案。在我们的研究中，我们评估了开源和闭源 LLM 在评估德国学生论文方面的表现和可靠性，将他们的评估与 37 位教师在 10 个预定义标准（即情节逻辑、表达）中的评估进行比较。使用五种 LLM：GPT-3.5、GPT-4、o1、LLaMA 3-70B 和 Mixtral 8x7B 分析了来自 7 年级和 8 年级学生的 20 篇真实论文语料库，旨在深入了解 LLM 的评分能力。闭源 GPT 模型在内部一致性和与人工评分的一致性方面均优于开源模型，尤其是在语言相关标准方面表现出色。新颖的 o1 模型优于所有其他 LLM，在总体得分方面实现了 Spearman 的 $r = .74$（人工评估），内部一致性为 $ICC=.80$。这些发现表明，基于 LLM 的评估可以成为一种有用的工具，通过支持论文评估来减少教师工作量，尤其是在语言相关标准方面。然而，由于这些模型倾向于获得更高的分数，因此需要进一步改进以更好地捕捉内容质量的各个方面。

Title: Preference Optimization for Reasoning with Pseudo Feedback

Authors: Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F. Chen, Shafiq Joty, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16345
Pdf URL: https://arxiv.org/pdf/2411.16345
Copy Paste: [[2411.16345]] Preference Optimization for Reasoning with Pseudo Feedback(https://arxiv.org/abs/2411.16345)
Keywords: language model, gpt, llm
Abstract: Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.
摘要：偏好优化技术（例如直接偏好优化 (DPO)）经常用于增强大型语言模型 (LLM) 在数学推理和编码等领域的推理能力，通常遵循监督微调。这些方法依赖于推理任务的高质量标签来生成偏好对；然而，具有人工验证标签的推理数据集的可用性有限。在本研究中，我们引入了一种新方法来生成推理任务的伪反馈，方法是将推理问题解决方案的标签设计为对相关测试用例的评估。我们探索了两种基于测试用例的伪反馈形式：一种由前沿 LLM 生成，另一种通过将自洽扩展到多测试用例。我们使用伪反馈对偏好优化进行数学推理和编码任务的实验，并观察了两个任务的改进。具体来说，使用 Mathstral-7B 作为基础模型，我们将 MATH 成绩从 58.3 提高到 68.6，超越了 NuminaMath-72B 和 GPT-4-Turbo-1106-preview。在 GSM8K 和 College Math 中，我们的分数分别从 85.6 提高到 90.3，从 34.3 提高到 42.3。基于 Deepseek-coder-7B-v1.5，我们在 LiveCodeBench 上获得了 24.6 分（从 21.1 分开始），超越了 Claude-3-Haiku。

Title: The Two-Hop Curse: LLMs trained on A->B, B->C fail to learn A-->C

Authors: Mikita Balesni, Tomek Korbak, Owain Evans
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16353
Pdf URL: https://arxiv.org/pdf/2411.16353
Copy Paste: [[2411.16353]] The Two-Hop Curse: LLMs trained on A->B, B->C fail to learn A-->C(https://arxiv.org/abs/2411.16353)
Keywords: gpt, llm, prompt, chain-of-thought
Abstract: While LLMs excel at multi-hop questions (e.g. "Who is the spouse of the performer of Imagine?") when using chain-of-thought reasoning (CoT), they struggle when forced to reason internally (without CoT). Previous work on the size and nature of this gap produced mixed evidence with inconclusive results. In this paper, we introduce a controlled setting for investigating two-hop reasoning in LLMs, where the above-chance performance constitutes undeniable evidence for latent reasoning. We fine-tune LLMs (including Llama 3 8B Instruct and GPT-4o) on fictional facts and confirm that they generalize to answering two-hop questions about them using CoT. We find that models can perform latent reasoning when facts appear together during training or in the prompt. However, to our surprise, models completely fail at two-hop reasoning without CoT when learned facts only appear in different documents, achieving chance-level accuracy and chance-level test loss. We call this complete failure to compose separately learned facts the Two-Hop Curse. Moreover, we evaluate 9 frontier LLMs on real-world facts, finding that models completely fail at two-hop no-CoT reasoning for over half of question categories while maintaining partial success with CoT across most categories. These results suggest that LLMs lack a general capability for latent multi-hop reasoning independent of the question type.
摘要：虽然 LLM 在使用思路链推理 (CoT) 时擅长回答多跳问题（例如“谁是 Imagine 表演者的配偶？”），但它们在被迫进行内部推理（没有 CoT）时会遇到困难。之前关于这种差距的大小和性质的研究产生了混合证据，结果不确定。在本文中，我们引入了一个受控设置来研究 LLM 中的两跳推理，其中高于机会的表现构成了潜在推理的不可否认的证据。我们对虚构事实的 LLM（包括 Llama 3 8B Instruct 和 GPT-4o）进行了微调，并确认它们可以推广到使用 CoT 回答有关它们的两跳问题。我们发现，当事实在训练期间或提示中一起出现时，模型可以执行潜在推理。然而，令我们惊讶的是，当学习到的事实仅出现在不同的文档中时，模型在没有 CoT 的两跳推理中完全失败，达到机会级别的准确率和机会级别的测试损失。我们将这种无法单独组合学习事实的现象称为“两跳诅咒”。此外，我们根据现实世界的事实评估了 9 个前沿 LLM，发现模型在超过一半的问题类别中完全无法进行两跳无 CoT 推理，而在大多数类别中，CoT 仍能部分成功。这些结果表明，LLM 缺乏独立于问题类型的一般潜在多跳推理能力。

Title: FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web

Authors: Cheng-Wei Lin, Wan-Hsuan Hsieh, Kai-Xin Guan, Chan-Jan Hsu, Chia-Chen Kuo, Chuan-Lin Lai, Chung-Wei Chung, Ming-Jen Wang, Da-Shan Shiu
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2411.16387
Pdf URL: https://arxiv.org/pdf/2411.16387
Copy Paste: [[2411.16387]] FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web(https://arxiv.org/abs/2411.16387)
Keywords: language model, llm
Abstract: The quality and size of a pretraining dataset significantly influence the performance of large language models (LLMs). While there have been numerous efforts in the curation of such a dataset for English users, there is a relative lack of similar initiatives for Traditional Chinese. Building upon this foundation of FineWeb, we introduce FineWeb-zhtw, a dataset tailored specifically for Traditional Chinese users. We came up with multiple stages of meticulously designed filters to cater to the linguistic difference between English and Traditional Chinese, to ensure comprehensiveness and quality. We determined effectiveness from querying dataset samples with three main objectives. Our code and datasets are publicly available.
摘要：预训练数据集的质量和大小显著影响大型语言模型 (LLM) 的性能。尽管人们为英语用户精心打造此类数据集付出了很多努力，但针对繁体中文的类似举措却相对缺乏。在 FineWeb 的基础上，我们推出了 FineWeb-zhtw，这是一个专门为繁体中文用户量身定制的数据集。我们设计了多个阶段的精心设计的过滤器，以适应英语和繁体中文之间的语言差异，确保全面性和质量。我们通过查询数据集样本来确定有效性，主要有三个目标。我们的代码和数据集都是公开的。

Title: Human-Calibrated Automated Testing and Validation of Generative Language Models

Authors: Agus Sudjianto, Aijun Zhang, Srinivas Neppalli, Tarun Joshi, Michal Malohlava
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16391
Pdf URL: https://arxiv.org/pdf/2411.16391
Copy Paste: [[2411.16391]] Human-Calibrated Automated Testing and Validation of Generative Language Models(https://arxiv.org/abs/2411.16391)
Keywords: language model, retrieval-augmented generation
Abstract: This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount.
摘要：本文介绍了一个用于评估和验证生成语言模型 (GLM) 的综合框架，重点关注部署在银行等高风险领域的检索增强生成 (RAG) 系统。由于开放式输出和主观质量评估，GLM 评估具有挑战性。利用 RAG 系统的结构化特性（其中生成的响应基于预定义的文档集合），我们提出了人机校准自动测试 (HCAT) 框架。HCAT 集成了 a) 使用分层抽样的自动测试生成，b) 基于嵌入的指标，用于可解释的功能、风险和安全属性评估，以及 c) 通过概率校准和共形预测将机器生成的评估与人工判断相一致的两阶段校准方法。此外，该框架还包括稳健性测试，以评估模型在对抗性、分布不均和不同输入条件下的性能，以及使用边际和双变量分析有针对性的弱点识别，以确定需要改进的特定领域。这种人工校准的、多层次的评估框架为 GLM 评估提供了一种可扩展、透明且可解释的方法，为在准确性、透明度和法规遵从性至关重要的应用中部署 GLM 提供了实用可靠的解决方案。

Title: Adapter-based Approaches to Knowledge-enhanced Language Models -- A Survey

Authors: Alexander Fichtl, Juraj Vladika, Georg Groh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16403
Pdf URL: https://arxiv.org/pdf/2411.16403
Copy Paste: [[2411.16403]] Adapter-based Approaches to Knowledge-enhanced Language Models -- A Survey(https://arxiv.org/abs/2411.16403)
Keywords: language model, hallucination
Abstract: Knowledge-enhanced language models (KELMs) have emerged as promising tools to bridge the gap between large-scale language models and domain-specific knowledge. KELMs can achieve higher factual accuracy and mitigate hallucinations by leveraging knowledge graphs (KGs). They are frequently combined with adapter modules to reduce the computational load and risk of catastrophic forgetting. In this paper, we conduct a systematic literature review (SLR) on adapter-based approaches to KELMs. We provide a structured overview of existing methodologies in the field through quantitative and qualitative analysis and explore the strengths and potential shortcomings of individual approaches. We show that general knowledge and domain-specific approaches have been frequently explored along with various adapter architectures and downstream tasks. We particularly focused on the popular biomedical domain, where we provided an insightful performance comparison of existing KELMs. We outline the main trends and propose promising future directions.
摘要：知识增强型语言模型 (KELM) 已成为弥合大规模语言模型和领域特定知识之间差距的有前途的工具。通过利用知识图谱 (KG)，KELM 可以实现更高的事实准确性并减轻幻觉。它们经常与适配器模块结合使用，以减少计算负荷和灾难性遗忘的风险。在本文中，我们对基于适配器的 KELM 方法进行了系统文献综述 (SLR)。我们通过定量和定性分析对该领域的现有方法进行了结构化概述，并探讨了各个方法的优势和潜在缺点。我们表明，人们经常探索一般知识和领域特定方法以及各种适配器架构和下游任务。我们特别关注流行的生物医学领域，在那里我们对现有的 KELM 进行了深刻的性能比较。我们概述了主要趋势并提出了有希望的未来方向。

Title: Finding Structure in Language Models

Authors: Jaap Jumelet
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16433
Pdf URL: https://arxiv.org/pdf/2411.16433
Copy Paste: [[2411.16433]] Finding Structure in Language Models(https://arxiv.org/abs/2411.16433)
Keywords: language model
Abstract: When we speak, write or listen, we continuously make predictions based on our knowledge of a language's grammar. Remarkably, children acquire this grammatical knowledge within just a few years, enabling them to understand and generalise to novel constructions that have never been uttered before. Language models are powerful tools that create representations of language by incrementally predicting the next word in a sentence, and they have had a tremendous societal impact in recent years. The central research question of this thesis is whether these models possess a deep understanding of grammatical structure similar to that of humans. This question lies at the intersection of natural language processing, linguistics, and interpretability. To address it, we will develop novel interpretability techniques that enhance our understanding of the complex nature of large-scale language models. We approach our research question from three directions. First, we explore the presence of abstract linguistic information through structural priming, a key paradigm in psycholinguistics for uncovering grammatical structure in human language processing. Next, we examine various linguistic phenomena, such as adjective order and negative polarity items, and connect a model's comprehension of these phenomena to the data distribution on which it was trained. Finally, we introduce a controlled testbed for studying hierarchical structure in language models using various synthetic languages of increasing complexity and examine the role of feature interactions in modelling this structure. Our findings offer a detailed account of the grammatical knowledge embedded in language model representations and provide several directions for investigating fundamental linguistic questions using computational methods.
摘要：当我们说话、写作或倾听时，我们会根据对语言语法的了解不断做出预测。值得注意的是，儿童在短短几年内就掌握了这些语法知识，使他们能够理解和概括从未说过的新结构。语言模型是一种强大的工具，它通过逐步预测句子中的下一个单词来创建语言表示，近年来它们对社会产生了巨大的影响。本论文的核心研究问题是这些模型是否具有与人类相似的对语法结构的深刻理解。这个问题位于自然语言处理、语言学和可解释性的交叉点。为了解决这个问题，我们将开发新的可解释性技术，以增强我们对大规模语言模型复杂性质的理解。我们从三个方向来探讨我们的研究问题。首先，我们通过结构启动探索抽象语言信息的存在，这是心理语言学中揭示人类语言处理中语法结构的关键范式。接下来，我们研究各种语言现象，例如形容词顺序和负极性项，并将模型对这些现象的理解与训练模型的数据分布联系起来。最后，我们引入了一个受控测试平台，用于研究使用各种复杂程度不断增加的合成语言的语言模型中的层次结构，并研究特征交互在建模此结构中的作用。我们的研究结果详细说明了语言模型表示中嵌入的语法知识，并为使用计算方法研究基本语言问题提供了几个方向。

Title: Learning by Analogy: Enhancing Few-Shot Prompting for Math Word Problem Solving with Computational Graph-Based Retrieval

Authors: Xiaocong Yang, Jiacheng Lin, Ziqi Wang, Chengxiang Zhai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16454
Pdf URL: https://arxiv.org/pdf/2411.16454
Copy Paste: [[2411.16454]] Learning by Analogy: Enhancing Few-Shot Prompting for Math Word Problem Solving with Computational Graph-Based Retrieval(https://arxiv.org/abs/2411.16454)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are known to struggle with complicated reasoning tasks such as math word problems (MWPs). In this paper, we present how analogy from similarly structured questions can improve LLMs' problem-solving capabilities for MWPs. Specifically, we rely on the retrieval of problems with similar computational graphs to the given question to serve as exemplars in the prompt, providing the correct reasoning path for the generation model to refer to. Empirical results across six math word problem datasets demonstrate the effectiveness of our proposed method, which achieves a significant improvement of up to 6.7 percent on average in absolute value, compared to baseline methods. These results highlight our method's potential in addressing the reasoning challenges in current LLMs.
摘要：众所周知，大型语言模型 (LLM) 难以处理复杂的推理任务，例如数学应用题 (MWP)。在本文中，我们介绍了如何通过类似结构问题的类比来提高 LLM 解决 MWP 问题的能力。具体来说，我们依靠检索与给定问题具有相似计算图的问题作为提示中的样本，为生成模型提供正确的推理路径以供参考。六个数学应用题数据集的实证结果证明了我们提出的方法的有效性，与基线方法相比，该方法在绝对值上平均实现了高达 6.7% 的显着改进。这些结果凸显了我们的方法在解决当前 LLM 中的推理挑战方面的潜力。

Title: When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets?

Authors: Srikrishna Iyer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16487
Pdf URL: https://arxiv.org/pdf/2411.16487
Copy Paste: [[2411.16487]] When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets?(https://arxiv.org/abs/2411.16487)
Keywords: language model
Abstract: We present our submission to the BabyLM challenge, aiming to push the boundaries of data-efficient language model pretraining. Our method builds upon deep mutual learning, introducing a student model search for diverse initialization. We address the limitation of treating students equally by formulating weighted mutual learning as a bi-level optimization problem. The inner loop learns compact students through online distillation, while the outer loop optimizes weights for better knowledge distillation from diverse students. This dynamic weighting strategy eliminates the need for a teacher model, reducing computational requirements. Our evaluations show that teacher-less methods can match or surpass teacher-supervised approaches.
摘要：我们向 BabyLM 挑战赛提交了我们的参赛作品，旨在突破数据高效型语言模型预训练的界限。我们的方法建立在深度相互学习的基础上，引入了学生模型搜索以实现多样化初始化。我们通过将加权相互学习公式化为双层优化问题来解决平等对待学生的局限性。内循环通过在线提炼学习紧凑型学生，而外循环优化权重以更好地从多样化学生中提炼知识。这种动态加权策略消除了对教师模型的需求，从而降低了计算要求。我们的评估表明，无教师方法可以匹敌或超越教师监督的方法。

Title: O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?

Authors: Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, Pengfei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16489
Pdf URL: https://arxiv.org/pdf/2411.16489
Copy Paste: [[2411.16489]] O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?(https://arxiv.org/abs/2411.16489)
Keywords: hallucination
Abstract: This paper presents a critical examination of current approaches to replicating OpenAI's O1 model capabilities, with particular focus on the widespread but often undisclosed use of knowledge distillation techniques. While our previous work explored the fundamental technical path to O1 replication, this study reveals how simple distillation from O1's API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks. Through extensive experiments, we show that a base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains outperforms O1-preview on the American Invitational Mathematics Examination (AIME) with minimal technical complexity. Moreover, our investigation extends beyond mathematical reasoning to explore the generalization capabilities of O1-distilled models across diverse tasks: hallucination, safety and open-domain QA. Notably, despite training only on mathematical problem-solving data, our models demonstrated strong generalization to open-ended QA tasks and became significantly less susceptible to sycophancy after fine-tuning. We deliberately make this finding public to promote transparency in AI research and to challenge the current trend of obscured technical claims in the field. Our work includes: (1) A detailed technical exposition of the distillation process and its effectiveness, (2) A comprehensive benchmark framework for evaluating and categorizing O1 replication attempts based on their technical transparency and reproducibility, (3) A critical discussion of the limitations and potential risks of over-relying on distillation approaches, our analysis culminates in a crucial bitter lesson: while the pursuit of more capable AI systems is important, the development of researchers grounded in first-principles thinking is paramount.
摘要：本文对当前复制 OpenAI O1 模型功能的方法进行了严格的审查，特别关注广泛但经常未公开的知识提炼技术。虽然我们之前的工作探索了 O1 复制的基本技术路径，但这项研究揭示了如何从 O1 的 API 中进行简单的提炼，结合监督微调，在复杂的数学推理任务中实现卓越的性能。通过大量的实验，我们表明，仅对数万个样本进行微调的基础模型 O1 提炼的长思维链在美国数学邀请赛 (AIME) 上的表现优于 O1 预览，且技术复杂性最小。此外，我们的研究超越了数学推理，探索了 O1 提炼模型在不同任务中的泛化能力：幻觉、安全和开放域 QA。值得注意的是，尽管只对数学问题解决数据进行训练，但我们的模型表现出对开放式 QA 任务的强大泛化能力，并且在微调后变得不那么容易受到谄媚的影响。我们特意公开这一发现，以促进人工智能研究的透明度，并挑战该领域目前技术主张模糊的趋势。我们的工作包括：（1）对蒸馏过程及其有效性的详细技术阐述；（2）基于技术透明度和可重复性对 O1 复制尝试进行评估和分类的综合基准框架；（3）对过度依赖蒸馏方法的局限性和潜在风险进行批判性讨论，我们的分析最终得出了一个至关重要的惨痛教训：虽然追求更强大的人工智能系统很重要，但培养以第一性原理思维为基础的研究人员至关重要。

Title: AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning

Authors: Amy Xin, Jinxin Liu, Zijun Yao, Zhicheng Li, Shulin Cao, Lei Hou, Juanzi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16495
Pdf URL: https://arxiv.org/pdf/2411.16495
Copy Paste: [[2411.16495]] AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning(https://arxiv.org/abs/2411.16495)
Keywords: language model, llm, hallucination, retrieval-augmented generation, chain-of-thought
Abstract: Recent advancements in large language models (LLMs) have led to significant improvements in various natural language processing tasks, but it is still challenging for LLMs to perform knowledge-intensive complex question answering due to LLMs' inefficacy in reasoning planning and the hallucination problem. A typical solution is to employ retrieval-augmented generation (RAG) coupled with chain-of-thought (CoT) reasoning, which decomposes complex questions into chain-like sub-questions and applies iterative RAG at each sub-question. However, prior works exhibit sub-optimal reasoning planning and overlook dynamic knowledge retrieval from heterogeneous sources. In this paper, we propose AtomR, a novel heterogeneous knowledge reasoning framework that conducts multi-source reasoning at the atomic level. Drawing inspiration from the graph modeling of knowledge, AtomR leverages large language models (LLMs) to decompose complex questions into combinations of three atomic knowledge operators, significantly enhancing the reasoning process at both the planning and execution stages. We also introduce BlendQA, a novel evaluation benchmark tailored to assess complex heterogeneous knowledge reasoning. Experiments show that AtomR significantly outperforms state-of-the-art baselines across three single-source and two multi-source reasoning benchmarks, with notable performance gains of 9.4% on 2WikiMultihop and 9.5% on BlendQA.
摘要：大型语言模型 (LLM) 的最新进展已显著改善各种自然语言处理任务，但由于 LLM 在推理规划和幻觉问题方面的低效性，LLM 仍然难以执行知识密集型的复杂问答。一种典型的解决方案是采用检索增强生成 (RAG) 结合思路链 (CoT) 推理，将复杂问题分解为链式子问题并在每个子问题上应用迭代 RAG。然而，先前的研究表现出次优的推理规划，并且忽视了从异构源动态检索知识。在本文中，我们提出了 AtomR，一种新型的异构知识推理框架，可在原子级别进行多源推理。从知识图建模中汲取灵感，AtomR 利用大型语言模型 (LLM) 将复杂问题分解为三个原子知识运算符的组合，显著增强了规划和执行阶段的推理过程。我们还引入了 BlendQA，这是一种专为评估复杂异构知识推理而定制的新型评估基准。实验表明，AtomR 在三个单源和两个多源推理基准上的表现明显优于最先进的基准，在 2WikiMultihop 上的性能提升显著，在 BlendQA 上的性能提升显著，分别为 9.4% 和 9.5%。

Title: Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings

Authors: Carolin M. Schuster, Maria-Alexandra Dinisor, Shashwat Ghatiwala, Georg Groh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16527
Pdf URL: https://arxiv.org/pdf/2411.16527
Copy Paste: [[2411.16527]] Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings(https://arxiv.org/abs/2411.16527)
Keywords: language model, llm
Abstract: Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.
摘要：大型语言模型 (LLM) 是当前人工智能 (AI) 成功的基础，但它们不可避免地存在偏见。为了有效地传达风险并鼓励缓解措施，这些模型需要对其歧视性进行充分而直观的描述，适合所有 AI 受众。我们根据社会心理学研究词典，针对刻板印象维度提出了偏见概况。沿着这些维度，我们研究了跨上下文和层次的上下文嵌入中的性别偏见，并为十二种不同的 LLM 生成刻板印象概况，展示了它们在揭示和可视化偏见方面的直觉和用例。

Title: Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Authors: Zhiheng Xi, Dingwen Yang, Jixuan Huang, Jiafu Tang, Guanyu Li, Yiwen Ding, Wei He, Boyang Hong, Shihan Do, Wenyu Zhan, Xiao Wang, Rui Zheng, Tao Ji, Xiaowei Shi, Yitao Zhai, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Zuxuan Wu, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Yu-Gang Jiang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16579
Pdf URL: https://arxiv.org/pdf/2411.16579
Copy Paste: [[2411.16579]] Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision(https://arxiv.org/abs/2411.16579)
Keywords: language model, llm
Abstract: Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model's capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and train-time. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data, resulting in a dataset of $76,321$ responses paired with step-level feedback. Fine-tuning language models with this dataset enables them to generate natural language feedback for mathematical reasoning. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time, especially when scaling up inference-time computation. Motivated by these findings, we introduce the critique-based supervision to the actor's self-training process, and propose a critique-in-the-loop self-improvement method. Experiments show that the method improves the actor's exploration efficiency and solution diversity, especially on challenging queries, leading to a stronger reasoning model. Lastly, we take the preliminary step to explore training self-talk reasoning models via critique supervision and showcase its potential. Our code and datasets are at \href{this https URL}{this https URL}.
摘要：训练大型语言模型 (LLM) 以在回答之前花更多时间思考和反思对于有效解决科学、编码和数学等领域的复杂推理任务至关重要。然而，自我反思和自我纠正等机制的有效性取决于模型准确评估自身表现的能力，而这可能受到初始准确性、问题难度和缺乏外部反馈等因素的限制。在本文中，我们深入研究了一种双人范式，该范式将推理和批评模型的角色分开，其中批评模型提供步骤级反馈以在测试时间和训练时间监督推理（参与者）模型。我们首先提出了 AutoMathCritique，这是一个用于收集批评数据的自动化和可扩展框架，从而产生了一个包含 $76,321$ 个响应和步骤级反馈的数据集。使用此数据集对语言模型进行微调使它们能够为数学推理生成自然语言反馈。我们证明，评论模型在测试时持续改善参与者在困难查询上的表现，尤其是在扩大推理时间计算时。受这些发现的启发，我们将基于评论的监督引入参与者的自我训练过程，并提出了一种循环内评论的自我改进方法。实验表明，该方法提高了参与者的探索效率和解决方案多样性，尤其是在具有挑战性的查询上，从而产生了更强大的推理模型。最后，我们迈出了初步的一步，探索通过评论监督训练自言自语推理模型并展示其潜力。我们的代码和数据集位于 \href{this https URL}{this https URL}。

Title: StructFormer: Document Structure-based Masked Attention and its Impact on Language Model Pre-Training

Authors: Kaustubh Ponkshe, Venkatapathy Subramanian, Natwar Modani, Ganesh Ramakrishnan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16618
Pdf URL: https://arxiv.org/pdf/2411.16618
Copy Paste: [[2411.16618]] StructFormer: Document Structure-based Masked Attention and its Impact on Language Model Pre-Training(https://arxiv.org/abs/2411.16618)
Keywords: language model
Abstract: Most state-of-the-art techniques for Language Models (LMs) today rely on transformer-based architectures and their ubiquitous attention mechanism. However, the exponential growth in computational requirements with longer input sequences confines Transformers to handling short passages. Recent efforts have aimed to address this limitation by introducing selective attention mechanisms, notably local and global attention. While sparse attention mechanisms, akin to full attention in being Turing-complete, have been theoretically established, their practical impact on pre-training remains unexplored. This study focuses on empirically assessing the influence of global attention on BERT pre-training. The primary steps involve creating an extensive corpus of structure-aware text through arXiv data, alongside a text-only counterpart. We carry out pre-training on these two datasets, investigate shifts in attention patterns, and assess their implications for downstream tasks. Our analysis underscores the significance of incorporating document structure into LM models, demonstrating their capacity to excel in more abstract tasks, such as document understanding.
摘要：当今，大多数最先进的语言模型 (LM) 技术都依赖于基于 Transformer 的架构及其无处不在的注意力机制。然而，随着输入序列的增加，计算需求呈指数级增长，这限制了 Transformer 只能处理短文。最近的努力旨在通过引入选择性注意力机制（尤其是局部和全局注意力）来解决这一限制。虽然稀疏注意力机制（类似于图灵完备的全注意力机制）已经在理论上建立，但它们对预训练的实际影响仍未得到探索。本研究侧重于实证评估全局注意力对 BERT 预训练的影响。主要步骤包括通过 arXiv 数据创建一个广泛的结构感知文本语料库，以及一个纯文本语料库。我们对这两个数据集进行预训练，研究注意力模式的变化，并评估它们对下游任务的影响。我们的分析强调了将文档结构纳入 LM 模型的重要性，证明了它们在更抽象的任务（例如文档理解）中表现出色的能力。

Title: Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation

Authors: Sanjana Ramprasad, Byron C. Wallace
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16638
Pdf URL: https://arxiv.org/pdf/2411.16638
Copy Paste: [[2411.16638]] Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation(https://arxiv.org/abs/2411.16638)
Keywords: llm, hallucination
Abstract: Modern LLMs can now produce highly readable abstractive summaries, to the point where traditional automated metrics for evaluating summary quality, such as ROUGE, have become saturated. However, LLMs still sometimes introduce unwanted content into summaries, i.e., information inconsistent with or unsupported by their source. Measuring the occurrence of these often subtle ``hallucinations'' automatically has proved to be challenging. This in turn has motivated development of a variety of metrics intended to measure the factual consistency of generated summaries against their source. But are these approaches measuring what they purport to do? In this work, we stress-test automatic factuality metrics. Specifically, we investigate whether and to what degree superficial attributes of summary texts suffice to predict ``factuality'', finding that a (supervised) model using only such shallow features is reasonably competitive with SOTA factuality scoring methods. We then evaluate how factuality metrics respond to factual corrections in inconsistent summaries and find that only a few show meaningful improvements. In contrast, some metrics are more sensitive to benign, non-factual edits. Motivated by these insights, we show that one can ``game'' (most) automatic factuality metrics, i.e., reliably inflate ``factuality'' scores by appending innocuous sentences to generated this http URL together, our results raise questions about the degree to which we should rely on existing automated factuality metrics and what exactly we want ``factuality metrics'' to measure.
摘要：现代 LLM 现在可以生成高度可读的抽象摘要，以至于传统的用于评估摘要质量的自动化指标（例如 ROUGE）已经饱和。然而，LLM 有时仍会在摘要中引入不必要的内容，即与来源不一致或不受来源支持的信息。事实证明，自动测量这些通常很微妙的“幻觉”的发生是一项挑战。这反过来又促使人们开发各种指标，旨在衡量生成的摘要与其来源的事实一致性。但这些方法是否衡量了它们声称要做的事情？在这项工作中，我们对自动事实性指标进行了压力测试。具体来说，我们研究摘要文本的表面属性是否以及在多大程度上足以预测“事实性”，发现仅使用这种浅显特征的（监督）模型与 SOTA 事实性评分方法相当具有竞争力。然后，我们评估事实性指标如何响应不一致摘要中的事实更正，并发现只有少数指标显示出有意义的改进。相反，一些指标对良性的、非事实性的编辑更敏感。受这些见解的启发，我们表明人们可以“玩弄”（大多数）自动事实性指标，即通过将无害的句子附加到一起生成的 http URL 来可靠地夸大“事实性”分数，我们的结果提出了关于我们应该在多大程度上依赖现有的自动事实性指标以及我们到底想用“事实性指标”来衡量什么的问题。

Title: Self-Generated Critiques Boost Reward Modeling for Language Models

Authors: Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16646
Pdf URL: https://arxiv.org/pdf/2411.16646
Copy Paste: [[2411.16646]] Self-Generated Critiques Boost Reward Modeling for Language Models(https://arxiv.org/abs/2411.16646)
Keywords: language model, llm
Abstract: Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of generated critiques in rectifying flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.
摘要：奖励建模对于将大型语言模型 (LLM) 与人类偏好相结合至关重要，尤其是在从人类反馈中强化学习 (RLHF) 中。然而，当前的奖励模型主要产生标量分数，并且难以以自然语言格式纳入批评。我们假设预测批评和标量奖励将提高奖励建模能力。受此启发，我们提出了 Critic-RM，这是一个使用自生成批评而无需额外监督的框架，可改进奖励模型。Critic-RM 采用两阶段流程：生成和过滤高质量批评，然后对奖励预测和批评生成进行联合微调。跨基准的实验表明，与标准奖励模型和 LLM 评委相比，Critic-RM 将奖励建模准确率提高了 3.7%-7.3%，表现出强大的性能和数据效率。其他研究进一步验证了生成的批评在纠正有缺陷的推理步骤方面的有效性，推理准确率提高了 2.5%-3.2%。

Title: Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?

Authors: Sohee Yang, Nora Kassner, Elena Gribovskaya, Sebastian Riedel, Mor Geva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16679
Pdf URL: https://arxiv.org/pdf/2411.16679
Copy Paste: [[2411.16679]] Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?(https://arxiv.org/abs/2411.16679)
Keywords: language model, llm, chain-of-thought
Abstract: We evaluate how well Large Language Models (LLMs) latently recall and compose facts to answer multi-hop queries like "In the year Scarlett Johansson was born, the Summer Olympics were hosted in the country of". One major challenge in evaluating this ability is that LLMs may have developed shortcuts by encounters of the head entity "Scarlett Johansson" and the answer entity "United States" in the same training sequences or merely guess the answer based on frequency-based priors. To prevent shortcuts, we exclude test queries where the head and answer entities co-appear in pretraining corpora. Through careful selection of relations and facts and systematic removal of cases where models might guess answers or exploit partial matches, we construct an evaluation dataset SOCRATES (ShOrtCut-fRee lATent rEaSoning). We observe that LLMs demonstrate promising latent multi-hop reasoning abilities without exploiting shortcuts, but only for certain types of queries. For queries requiring latent recall of countries as the intermediate answer, the best models achieve 80% latent composability, but this drops to just 5% for the recall of years. Comparisons with Chain-of-Thought composability highlight a significant gap between the ability of models to reason latently versus explicitly. Analysis reveals that latent representations of the intermediate answer are constructed more often in queries with higher latent composability, and shows the emergence of latent multi-hop reasoning during pretraining.
摘要：我们评估大型语言模型 (LLM) 潜在回忆和整理事实以回答多跳查询的能力，例如“斯嘉丽·约翰逊出生那年，夏季奥运会在某国举办”。评估这种能力的一个主要挑战是，LLM 可能通过在同一训练序列中遇到主实体“斯嘉丽·约翰逊”和答案实体“美国”而开发出捷径，或者仅仅根据基于频率的先验来猜测答案。为了防止捷径，我们排除了主实体和答案实体在预训练语料库中同时出现的测试查询。通过仔细选择关系和事实并系统地删除模型可能猜测答案或利用部分匹配的情况，我们构建了一个评估数据集 SOCRATES（无捷径潜在推理）。我们观察到 LLM 展示了有希望的潜在多跳推理能力，而无需利用捷径，但仅限于某些类型的查询。对于需要潜在回忆国家作为中间答案的查询，最佳模型可实现 80% 的潜在可组合性，但对于回忆年份，这一比例降至仅 5%。与思维链可组合性的比较突显了模型潜在推理能力与显性推理能力之间的巨大差距。分析表明，在潜在可组合性较高的查询中，中间答案的潜在表示构建得更频繁，并表明在预训练期间出现了潜在的多跳推理。