2024-09-05

Title: MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs

Authors: Saeid Asgari Taghanaki, Aliasgahr Khani, Amir Khasahmadi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.02257
Pdf URL: https://arxiv.org/pdf/2409.02257
Copy Paste: [[2409.02257]] MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs(https://arxiv.org/abs/2409.02257)
Keywords: language model, llm
Abstract: Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between top-performing models, underscoring the need for more challenging evaluation frameworks. We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning in LLMs. By incorporating questions with multiple correct answers across diverse domains, MMLU-Pro+ tests LLMs' ability to engage in complex reasoning and resist simplistic problem-solving strategies. Our results show that MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of model discrimination, particularly in multi-correct answer scenarios. We introduce novel metrics like shortcut selection ratio and correct pair identification ratio, offering deeper insights into model behavior and anchoring bias. Evaluations of five state-of-the-art LLMs reveal significant performance gaps, highlighting variations in reasoning abilities and bias susceptibility. We release the dataset and evaluation codes at \url{this https URL}.
摘要：现有的大型语言模型 (LLM) 基准越来越难以区分表现最佳的模型，这凸显了对更具挑战性的评估框架的需求。我们推出了 MMLU-Pro+，这是基于 MMLU-Pro 的增强基准，用于评估 LLM 中的捷径学习和高阶推理。通过结合不同领域的多个正确答案的问题，MMLU-Pro+ 测试了 LLM 进行复杂推理和抵制简单问题解决策略的能力。我们的结果表明，MMLU-Pro+ 保持了 MMLU-Pro 的难度，同时提供了更严格的模型判别测试，特别是在多正确答案场景中。我们引入了捷径选择率和正确对识别率等新指标，从而更深入地了解模型行为和锚定偏差。对五种最先进的 LLM 的评估揭示了显著的性能差距，突出了推理能力和偏差敏感性的差异。我们在 \url{this https URL} 上发布了数据集和评估代码。

Title: Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining

Authors: Yuxiang Wei, Hojae Han, Rajhans Samdani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.02326
Pdf URL: https://arxiv.org/pdf/2409.02326
Copy Paste: [[2409.02326]] Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining(https://arxiv.org/abs/2409.02326)
Keywords: language model
Abstract: Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.
摘要：最近的研究越来越多地表明，高质量数据对于语言模型的有效预训练至关重要。然而，“高质量”的确切定义仍未得到充分探索。我们专注于代码领域，推出了 Arctic-SnowCoder-1.3B，这是一个数据高效的基础代码模型，通过三个阶段逐步完善的数据对 555B 个标记进行预训练：(1) 使用 500B 个标准质量代码标记进行一般预训练，通过基本过滤、重复数据删除和去污进行预处理；(2) 使用 50B 个高质量标记继续进行预训练，这些标记由 BERT 风格的质量注释器从第一阶段中选择，该注释器经过训练，能够区分好代码和随机数据，使用从高质量代码文件中提取的正例，以及来自 Magicoder 和 StarCoder2-Instruct 的指令数据；(3) 使用由 Llama-3.1-70B 创建的 5B 合成数据进行增强预训练，使用第二阶段数据作为种子，采用 Magicoder 方法进行预训练。尽管是在有限的数据集上进行训练的，Arctic-SnowCoder 在 BigCodeBench（一种专注于实际和具有挑战性的编程任务的编码基准）上实现了最先进的性能，与在不超过 1T 令牌上训练的类似大小的模型相比，其性能比 Phi-1.5-1.3B 高出 36%。在所有评估的基准中，Arctic-SnowCoder-1.3B 都胜过在 1T 令牌上预训练的 StarCoderBase-3B。此外，它的性能与在数万亿个令牌上训练的领先小型基础代码模型相当。例如，Arctic-SnowCoder-1.3B 在 HumanEval+（一种评估函数级代码生成的基准）上超越了在超过 3.3T 令牌上预训练的 StarCoder2-3B，并且在 BigCodeBench 上仍然具有竞争力。我们的评估提供了全面的分析，证明了 Arctic-SnowCoder 的各种设计选择是合理的。最重要的是，我们发现高质量数据的关键在于它与下游应用程序的分布保持一致。

Title: Diversify-verify-adapt: Efficient and Robust Retrieval-Augmented Ambiguous Question Answering

Authors: Yeonjun In, Sungchul Kim, Ryan A. Rossi, Md Mehrab Tanjim, Tong Yu, Ritwik Sinha, Chanyoung Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.02361
Pdf URL: https://arxiv.org/pdf/2409.02361
Copy Paste: [[2409.02361]] Diversify-verify-adapt: Efficient and Robust Retrieval-Augmented Ambiguous Question Answering(https://arxiv.org/abs/2409.02361)
Keywords: retrieval augmented generation
Abstract: The retrieval augmented generation (RAG) framework addresses an ambiguity in user queries in QA systems by retrieving passages that cover all plausible interpretations and generating comprehensive responses based on the passages. However, our preliminary studies reveal that a single retrieval process often suffers from low quality results, as the retrieved passages frequently fail to capture all plausible interpretations. Although the iterative RAG approach has been proposed to address this problem, it comes at the cost of significantly reduced efficiency. To address these issues, we propose the diversify-verify-adapt (DIVA) framework. DIVA first diversifies the retrieved passages to encompass diverse interpretations. Subsequently, DIVA verifies the quality of the passages and adapts the most suitable approach tailored to their quality. This approach improves the QA systems accuracy and robustness by handling low quality retrieval issue in ambiguous questions, while enhancing efficiency.
摘要：检索增强生成 (RAG) 框架通过检索涵盖所有合理解释的段落并根据这些段落生成全面的答复来解决 QA 系统中用户查询的歧义性。然而，我们的初步研究表明，单一检索过程通常会产生低质量的结果，因为检索到的段落通常无法捕捉到所有合理的解释。虽然已经提出了迭代 RAG 方法来解决这个问题，但它是以显著降低效率为代价的。为了解决这些问题，我们提出了多样化-验证-适应 (DIVA) 框架。DIVA 首先使检索到的段落多样化以包含不同的解释。随后，DIVA 验证段落的质量并根据其质量采用最合适的方法。这种方法通过处理模糊问题中的低质量检索问题来提高 QA 系统的准确性和稳健性，同时提高效率。

Title: Do Large Language Models Possess Sensitive to Sentiment?

Authors: Yang Liu, Xichou Zhu, Zhou Shen, Yi Liu, Min Li, Yujun Chen, Benzi John, Zhenzhen Ma, Tao Hu, Zhiyang Xu, Wei Luo, Junhui Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.02370
Pdf URL: https://arxiv.org/pdf/2409.02370
Copy Paste: [[2409.02370]] Do Large Language Models Possess Sensitive to Sentiment?(https://arxiv.org/abs/2409.02370)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. However, how to comprehensively assess the sentiment capabilities of LLMs continues to be a challenge. This paper investigates the ability of LLMs to detect and react to sentiment in text modal. As the integration of LLMs into diverse applications is on the rise, it becomes highly critical to comprehend their sensitivity to emotional tone, as it can influence the user experience and the efficacy of sentiment-driven tasks. We conduct a series of experiments to evaluate the performance of several prominent LLMs in identifying and responding appropriately to sentiments like positive, negative, and neutral emotions. The models' outputs are analyzed across various sentiment benchmarks, and their responses are compared with human evaluations. Our discoveries indicate that although LLMs show a basic sensitivity to sentiment, there are substantial variations in their accuracy and consistency, emphasizing the requirement for further enhancements in their training processes to better capture subtle emotional cues. Take an example in our findings, in some cases, the models might wrongly classify a strongly positive sentiment as neutral, or fail to recognize sarcasm or irony in the text. Such misclassifications highlight the complexity of sentiment analysis and the areas where the models need to be refined. Another aspect is that different LLMs might perform differently on the same set of data, depending on their architecture and training datasets. This variance calls for a more in-depth study of the factors that contribute to the performance differences and how they can be optimized.
摘要：大型语言模型 (LLM) 最近在语言理解方面展现了其非凡的能力。然而，如何全面评估 LLM 的情感能力仍然是一个挑战。本文研究了 LLM 在文本模型中检测和响应情感的能力。随着 LLM 越来越多地融入各种应用程序，了解它们对情绪基调的敏感性变得至关重要，因为它会影响用户体验和情绪驱动任务的有效性。我们进行了一系列实验来评估几个著名的 LLM 在识别和适当响应积极、消极和中性情绪等情绪方面的表现。在各种情绪基准上分析了模型的输出，并将其响应与人类评估进行了比较。我们的发现表明，尽管 LLM 表现出对情绪的基本敏感性，但它们的准确性和一致性存在很大差异，强调需要进一步增强其训练过程以更好地捕捉微妙的情绪线索。以我们的研究结果为例，在某些情况下，模型可能会错误地将强烈的积极情绪归类为中性情绪，或者无法识别文本中的讽刺或反讽。这种错误分类凸显了情绪分析的复杂性以及模型需要改进的领域。另一个方面是，不同的 LLM 在同一组数据上的表现可能不同，这取决于它们的架构和训练数据集。这种差异需要更深入地研究导致性能差异的因素以及如何优化它们。

Title: How Privacy-Savvy Are Large Language Models? A Case Study on Compliance and Privacy Technical Review

Authors: Xichou Zhu, Yang Liu, Zhou Shen, Yi Liu, Min Li, Yujun Chen, Benzi John, Zhenzhen Ma, Tao Hu, Bolong Yang, Manman Wang, Zongxing Xie, Peng Liu, Dan Cai, Junhui Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.02375
Pdf URL: https://arxiv.org/pdf/2409.02375
Copy Paste: [[2409.02375]] How Privacy-Savvy Are Large Language Models? A Case Study on Compliance and Privacy Technical Review(https://arxiv.org/abs/2409.02375)
Keywords: language model, gpt, llm
Abstract: The recent advances in large language models (LLMs) have significantly expanded their applications across various fields such as language generation, summarization, and complex question answering. However, their application to privacy compliance and technical privacy reviews remains under-explored, raising critical concerns about their ability to adhere to global privacy standards and protect sensitive user data. This paper seeks to address this gap by providing a comprehensive case study evaluating LLMs' performance in privacy-related tasks such as privacy information extraction (PIE), legal and regulatory key point detection (KPD), and question answering (QA) with respect to privacy policies and data protection regulations. We introduce a Privacy Technical Review (PTR) framework, highlighting its role in mitigating privacy risks during the software development life-cycle. Through an empirical assessment, we investigate the capacity of several prominent LLMs, including BERT, GPT-3.5, GPT-4, and custom models, in executing privacy compliance checks and technical privacy reviews. Our experiments benchmark the models across multiple dimensions, focusing on their precision, recall, and F1-scores in extracting privacy-sensitive information and detecting key regulatory compliance points. While LLMs show promise in automating privacy reviews and identifying regulatory discrepancies, significant gaps persist in their ability to fully comply with evolving legal standards. We provide actionable recommendations for enhancing LLMs' capabilities in privacy compliance, emphasizing the need for robust model improvements and better integration with legal and regulatory requirements. This study underscores the growing importance of developing privacy-aware LLMs that can both support businesses in compliance efforts and safeguard user privacy rights.
摘要：大型语言模型 (LLM) 的最新进展大大扩展了它们在语言生成、摘要和复杂问答等各个领域的应用。然而，它们在隐私合规和技术隐私审查方面的应用仍未得到充分探索，这引发了人们对它们遵守全球隐私标准和保护敏感用户数据的能力的严重担忧。本文旨在通过提供一个全面的案例研究来解决这一差距，该研究评估了 LLM 在隐私相关任务中的表现，例如隐私信息提取 (PIE)、法律和监管关键点检测 (KPD) 以及问答 (QA) 在隐私政策和数据保护法规方面的表现。我们引入了一个隐私技术审查 (PTR) 框架，强调了它在减轻软件开发生命周期中的隐私风险方面的作用。通过实证评估，我们调查了几种著名的 LLM（包括 BERT、GPT-3.5、GPT-4 和自定义模型）在执行隐私合规性检查和技术隐私审查方面的能力。我们的实验从多个维度对模型进行了基准测试，重点关注模型在提取隐私敏感信息和检测关键监管合规点方面的准确率、召回率和 F1 分数。虽然 LLM 在自动化隐私审查和识别监管差异方面表现出色，但在完全遵守不断发展的法律标准方面仍然存在重大差距。我们为增强 LLM 在隐私合规方面的能力提供了可行的建议，强调需要进行强大的模型改进并更好地与法律和监管要求相结合。这项研究强调了开发隐私感知 LLM 的重要性日益增加，这种 LLM 既可以支持企业的合规工作，又可以保护用户的隐私权。

Title: STAB: Speech Tokenizer Assessment Benchmark

Authors: Shikhar Vashishth, Harman Singh, Shikhar Bharadwaj, Sriram Ganapathy, Chulayuth Asawaroengchai, Kartik Audhkhasi, Andrew Rosenberg, Ankur Bapna, Bhuvana Ramabhadran
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.02384
Pdf URL: https://arxiv.org/pdf/2409.02384
Copy Paste: [[2409.02384]] STAB: Speech Tokenizer Assessment Benchmark(https://arxiv.org/abs/2409.02384)
Keywords: language model, llm
Abstract: Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text, thus enabling the use of speech as an input to the widely successful large language models (LLMs). Currently, while several speech tokenizers have been proposed, there is ambiguity regarding the properties that are desired from a tokenizer for specific downstream tasks and its overall generalizability. Evaluating the performance of tokenizers across different downstream tasks is a computationally intensive effort that poses challenges for scalability. To circumvent this requirement, we present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively and shed light on their inherent characteristics. This framework provides a deeper understanding of the underlying mechanisms of speech tokenization, thereby offering a valuable resource for expediting the advancement of future tokenizer models and enabling comparative analysis using a standardized benchmark. We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
摘要：将语音表示为离散标记提供了一个框架，用于将语音转换为与文本非常相似的格式，从而使语音能够作为广泛成功的大型语言模型 (LLM) 的输入。目前，虽然已经提出了几种语音标记器，但对于特定下游任务所需的标记器属性及其整体通用性存在歧义。评估标记器在不同下游任务中的表现是一项计算密集型工作，对可扩展性提出了挑战。为了规避这一要求，我们提出了 STAB（语音标记器评估基准），这是一个系统评估框架，旨在全面评估语音标记器并阐明其固有特征。该框架提供了对语音标记化底层机制的更深入了解，从而为加快未来标记器模型的进步提供了宝贵的资源，并支持使用标准化基准进行比较分析。我们评估 STAB 指标并将其与一系列语音任务和标记器选择的下游任务性能相关联。

Title: Abstractive Text Summarization: State of the Art, Challenges, and Improvements

Authors: Hassan Shakil, Ahmad Farooq, Jugal Kalita
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.02413
Pdf URL: https://arxiv.org/pdf/2409.02413
Copy Paste: [[2409.02413]] Abstractive Text Summarization: State of the Art, Challenges, and Improvements(https://arxiv.org/abs/2409.02413)
Keywords: language model
Abstract: Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.
摘要：本综述特别关注抽象文本摘要的前景，而不是提取技术，提供了全面的概述，深入探讨了最先进的技术、普遍存在的挑战和未来的研究方向。我们将这些技术分为传统的序列到序列模型、预训练的大型语言模型、强化学习、分层方法和多模态摘要。与之前没有详细研究复杂性、可扩展性和技术比较的研究不同，本综述采用了一种全面的方法，涵盖了最先进的方法、挑战、解决方案、比较、局限性，并列出了未来的改进——为研究人员提供了广泛的概述，以推进抽象摘要研究。我们提供了分类技术的重要比较表——提供了对模型复杂性、可扩展性和适当应用的见解。本文重点介绍了意义表达不足、事实一致性、可控文本摘要、跨语言摘要和评估指标等挑战。提出了利用知识整合和其他创新策略的解决方案来应对这些挑战。本文最后重点介绍了一些新兴研究领域，例如事实不一致、领域特定、跨语言、多语言和长文档摘要，以及处理噪声数据。我们的目标是为研究人员和从业者提供该领域的结构化概述，使他们能够更好地了解当前形势并确定进一步研究和改进的潜在领域。

Title: DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels

Authors: Zhe Xu, Jiasheng Ye, Xiangyang Liu, Tianxiang Sun, Xiaoran Liu, Qipeng Guo, Linlin Li, Qun Liu, Xuanjing Huang, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.02465
Pdf URL: https://arxiv.org/pdf/2409.02465
Copy Paste: [[2409.02465]] DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels(https://arxiv.org/abs/2409.02465)
Keywords: language model, llm
Abstract: With the rapid advancement of Large Language Models (LLMs), long-context information understanding and processing have become a hot topic in academia and industry. However, benchmarks for evaluating the ability of LLMs to handle long-context information do not seem to have kept pace with the development of LLMs. Despite the emergence of various long-context evaluation benchmarks, the types of capability assessed are still limited, without new capability dimensions. In this paper, we introduce DetectiveQA, a narrative reasoning benchmark featured with an average context length of over 100K tokens. DetectiveQA focuses on evaluating the long-context reasoning ability of LLMs, which not only requires a full understanding of context but also requires extracting important evidences from the context and reasoning according to extracted evidences to answer the given questions. This is a new dimension of capability evaluation, which is more in line with the current intelligence level of LLMs. We use detective novels as data sources, which naturally have various reasoning elements. Finally, we manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions. We evaluate many long-context LLMs on DetectiveQA, including commercial and open-sourced models, and the results indicate that existing long-context LLMs still require significant advancements to effectively process true long-context dependency questions.
摘要：随着大型语言模型（LLM）的快速发展，长上下文信息的理解与处理成为学术界和工业界研究的热点。然而，衡量LLM处理长上下文信息能力的基准似乎并没有跟上LLM发展的步伐。尽管出现了各种长上下文评估基准，但评估的能力类型仍然有限，没有新的能力维度。本文引入了DetectiveQA，一个平均上下文长度超过100K token的叙事推理基准。DetectiveQA专注于评估LLM的长上下文推理能力，这不仅需要充分理解上下文，还需要从上下文中提取重要证据，并根据提取的证据进行推理以回答给定的问题。这是一个新的能力评估维度，更符合当前LLM的智能水平。我们以侦探小说作为数据源，其中自然包含各种推理元素。最后，我们手动标注了 600 个中文问题，然后提供了上下文信息和问题的英文版本。我们在 DetectiveQA 上评估了许多长上下文 LLM，包括商业和开源模型，结果表明，现有的长上下文 LLM 仍需要进行重大改进才能有效处理真正的长上下文依赖性问题。

Title: Language is Scary when Over-Analyzed: Unpacking Implied Misogynistic Reasoning with Argumentation Theory-Driven Prompts

Authors: Arianna Muti, Federico Ruggeri, Khalid Al-Khatib, Alberto Barrón-Cedeño, Tommaso Caselli
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2409.02519
Pdf URL: https://arxiv.org/pdf/2409.02519
Copy Paste: [[2409.02519]] Language is Scary when Over-Analyzed: Unpacking Implied Misogynistic Reasoning with Argumentation Theory-Driven Prompts(https://arxiv.org/abs/2409.02519)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: We propose misogyny detection as an Argumentative Reasoning task and we investigate the capacity of large language models (LLMs) to understand the implicit reasoning used to convey misogyny in both Italian and English. The central aim is to generate the missing reasoning link between a message and the implied meanings encoding the misogyny. Our study uses argumentation theory as a foundation to form a collection of prompts in both zero-shot and few-shot settings. These prompts integrate different techniques, including chain-of-thought reasoning and augmented knowledge. Our findings show that LLMs fall short on reasoning capabilities about misogynistic comments and that they mostly rely on their implicit knowledge derived from internalized common stereotypes about women to generate implied assumptions, rather than on inductive reasoning.
摘要：我们提出将厌女症检测作为一项论证推理任务，并研究大型语言模型 (LLM) 理解意大利语和英语中用于传达厌女症的隐性推理的能力。主要目的是生成信息与编码厌女症的隐含含义之间缺失的推理环节。我们的研究以论证理论为基础，在零样本和少样本设置中形成了一系列提示。这些提示整合了不同的技术，包括思路链推理和增强知识。我们的研究结果表明，LLM 在厌女评论的推理能力上存在不足，它们主要依靠从内化的关于女性的常见刻板印象中获得的隐性知识来产生隐含的假设，而不是归纳推理。

Title: More is More: Addition Bias in Large Language Models

Authors: Luca Santagata, Cristiano De Nobili
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2409.02569
Pdf URL: https://arxiv.org/pdf/2409.02569
Copy Paste: [[2409.02569]] More is More: Addition Bias in Large Language Models(https://arxiv.org/abs/2409.02569)
Keywords: language model, gpt, llm
Abstract: In this paper, we investigate the presence of additive bias in Large Language Models (LLMs), drawing a parallel to the cognitive bias observed in humans where individuals tend to favor additive over subtractive changes. Using a series of controlled experiments, we tested various LLMs, including GPT-3.5 Turbo, Claude 3.5 Sonnet, Mistral, Math$\Sigma$tral, and Llama 3.1, on tasks designed to measure their propensity for additive versus subtractive modifications. Our findings demonstrate a significant preference for additive changes across all tested models. For example, in a palindrome creation task, Llama 3.1 favored adding letters 97.85% of the time over removing them. Similarly, in a Lego tower balancing task, GPT-3.5 Turbo chose to add a brick 76.38% of the time rather than remove one. In a text summarization task, Mistral 7B produced longer summaries in 59.40% to 75.10% of cases when asked to improve its own or others' writing. These results indicate that, similar to humans, LLMs exhibit a marked additive bias, which might have implications when LLMs are used on a large scale. Addittive bias might increase resource use and environmental impact, leading to higher economic costs due to overconsumption and waste. This bias should be considered in the development and application of LLMs to ensure balanced and efficient problem-solving approaches.
摘要：在本文中，我们研究了大型语言模型 (LLM) 中是否存在加法偏差，这与人类中观察到的认知偏差相似，在人类中，个体倾向于喜欢加法而不是减法变化。通过一系列受控实验，我们测试了各种 LLM，包括 GPT-3.5 Turbo、Claude 3.5 Sonnet、Mistral、Math$\Sigma$tral 和 Llama 3.1，这些任务旨在衡量它们对加法修改还是减法修改的倾向。我们的研究结果表明，在所有测试模型中，加法修改都具有明显的偏好。例如，在回文创建任务中，Llama 3.1 97.85% 的时间倾向于添加字母而不是删除字母。同样，在乐高塔平衡任务中，GPT-3.5 Turbo 76.38% 的时间选择添加一块砖而不是移除一块砖。在文本摘要任务中，当被要求改进自己或他人的写作时，Mistral 7B 在 59.40% 至 75.10% 的情况下会生成更长的摘要。这些结果表明，与人类类似，LLM 表现出明显的附加偏差，这可能会在大规模使用 LLM 时产生影响。附加偏差可能会增加资源使用和环境影响，从而导致过度消费和浪费造成的更高经济成本。在开发和应用 LLM 时应考虑这种偏差，以确保平衡有效的问题解决方法。

Title: PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation

Authors: Aneta Pawelec, Victoria Sara Wesołowska, Zuzanna Bączek, Piotr Sankowski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.02617
Pdf URL: https://arxiv.org/pdf/2409.02617
Copy Paste: [[2409.02617]] PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation(https://arxiv.org/abs/2409.02617)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The ability of large language models (LLMs) to interpret visual representations of data is crucial for advancing their application in data analysis and decision-making processes. This paper presents a novel synthetic dataset designed to evaluate the proficiency of LLMs in interpreting various forms of data visualizations, including plots like time series, histograms, violins, boxplots, and clusters. Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios. We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models like ChatGPT or Gemini, assessing their understanding and interpretative accuracy. To ensure data integrity, our benchmark dataset is generated automatically, making it entirely new and free from prior exposure to the models being tested. This strategy allows us to evaluate the models' ability to truly interpret and understand the data, eliminating possibility of pre-learned responses, and allowing for an unbiased evaluation of the models' capabilities. We also introduce quantitative metrics to assess the performance of the models, providing a robust and comprehensive evaluation tool. Benchmarking several state-of-the-art LLMs with this dataset reveals varying degrees of success, highlighting specific strengths and weaknesses in interpreting diverse types of visual data. The results provide valuable insights into the current capabilities of LLMs and identify key areas for improvement. This work establishes a foundational benchmark for future research and development aimed at enhancing the visual interpretative abilities of language models. In the future, improved LLMs with robust visual interpretation skills can significantly aid in automated data analysis, scientific research, educational tools, and business intelligence applications.
摘要：大型语言模型 (LLM) 解释数据可视化表示的能力对于推进其在数据分析和决策过程中的应用至关重要。本文介绍了一种新颖的合成数据集，旨在评估 LLM 解释各种形式的数据可视化的能力，包括时间序列、直方图、小提琴图、箱线图和聚类等图表。我们的数据集是使用受控参数生成的，以确保全面覆盖潜在的现实场景。我们使用多模式文本提示和与图像中的视觉数据相关的问题来对 ChatGPT 或 Gemini 等几种最先进的模型进行基准测试，评估它们的理解和解释准确性。为了确保数据完整性，我们的基准数据集是自动生成的，使其完全是新的，并且不受被测试模型的先前接触的影响。这种策略使我们能够评估模型真正解释和理解数据的能力，消除预先学习的反应的可能性，并允许对模型的能力进行公正的评估。我们还引入了定量指标来评估模型的性能，从而提供了一个强大而全面的评估工具。使用此数据集对几种最先进的 LLM 进行基准测试，可以发现不同程度的成功，突出解释各种类型的视觉数据的具体优势和劣势。结果为 LLM 的当前能力提供了宝贵的见解，并确定了需要改进的关键领域。这项工作为旨在增强语言模型的视觉解释能力的未来研究和开发建立了基础基准。未来，具有强大视觉解释技能的改进型 LLM 可以极大地帮助自动数据分析、科学研究、教育工具和商业智能应用。

Title: Creating Domain-Specific Translation Memories for Machine Translation Fine-tuning: The TRENCARD Bilingual Cardiology Corpus

Authors: Gokhan Dogru
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.02667
Pdf URL: https://arxiv.org/pdf/2409.02667
Copy Paste: [[2409.02667]] Creating Domain-Specific Translation Memories for Machine Translation Fine-tuning: The TRENCARD Bilingual Cardiology Corpus(https://arxiv.org/abs/2409.02667)
Keywords: language model
Abstract: This article investigates how translation memories (TM) can be created by translators or other language professionals in order to compile domain-specific parallel corpora , which can then be used in different scenarios, such as machine translation training and fine-tuning, TM leveraging, and/or large language model fine-tuning. The article introduces a semi-automatic TM preparation methodology leveraging primarily translation tools used by translators in favor of data quality and control by the translators. This semi-automatic methodology is then used to build a cardiology-based Turkish -> English corpus from bilingual abstracts of Turkish cardiology journals. The resulting corpus called TRENCARD Corpus has approximately 800,000 source words and 50,000 sentences. Using this methodology, translators can build their custom TMs in a reasonable time and use them in their bilingual data requiring tasks.
摘要：本文探讨了翻译人员或其他语言专业人员如何创建翻译记忆库 (TM)，以编译领域特定平行语料库，这些语料库随后可用于不同场景，如机器翻译训练和微调、TM 利用和/或大型语言模型微调。本文介绍了一种半自动 TM 准备方法，该方法主要利用翻译人员使用的翻译工具来提高数据质量并由翻译人员进行控制。然后，该半自动方法用于从土耳其心脏病学期刊的双语摘要中构建基于心脏病学的土耳其语 -> 英语语料库。由此产生的语料库 TRENCARD 语料库约有 800,000 个源词和 50,000 个句子。使用这种方法，翻译人员可以在合理的时间内构建自定义 TM，并将其用于需要双语数据的任务中。

Title: Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs

Authors: Ruoyu Wang, Xiaoxuan Li, Lina Yao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.02686
Pdf URL: https://arxiv.org/pdf/2409.02686
Copy Paste: [[2409.02686]] Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs(https://arxiv.org/abs/2409.02686)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions, but recent studies reveal that these models often fail to achieve satisfactory results on questions involving reasoning, such as mathematics or physics questions. This phenomenon is usually attributed to the uncertainty regarding whether these models could genuinely comprehend the knowledge embedded in the text or merely learn to replicate the token distribution without a true understanding of the content. In this paper, we delve into this problem and aim to enhance the reasoning capabilities of LLMs. First, we investigate if the model has genuine reasoning capabilities by visualizing the text generation process at the attention and representation level. Then, we formulate the reasoning process of LLMs into a causal framework, which provides a formal explanation of the problems we observe in the visualization. Finally, building upon this causal framework, we propose Deconfounded Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model's reasoning capabilities by encouraging the model to extract the general problem-solving skills and apply these skills to different questions. Experiments show that our method outperforms the baseline consistently across multiple benchmarks, and with only 1.2M tunable parameters, we achieve better or comparable results to other fine-tuning methods. This demonstrates the effectiveness and efficiency of our method in improving the overall accuracy and reliability of LLMs.
摘要：大型语言模型 (LLM) 在处理基于人类指令的各种任务方面表现出了惊人的效率，但最近的研究表明，这些模型在涉及推理的问题（例如数学或物理问题）上往往无法取得令人满意的结果。这种现象通常归因于不确定这些模型是否能够真正理解文本中嵌入的知识，或者仅仅学会复制标记分布而没有真正理解内容。在本文中，我们深入研究了这个问题，旨在增强 LLM 的推理能力。首先，我们通过在注意力和表示级别可视化文本生成过程来调查模型是否具有真正的推理能力。然后，我们将 LLM 的推理过程制定为因果框架，这为我们在可视化中观察到的问题提供了形式化的解释。最后，基于这一因果框架，我们提出了去混淆因果适应 (DCA)，这是一种新颖的参数高效微调 (PEFT) 方法，通过鼓励模型提取一般的解决问题的技能并将这些技能应用于不同的问题来增强模型的推理能力。实验表明，我们的方法在多个基准测试中始终优于基线，并且仅使用 1.2M 个可调参数，我们就能获得比其他微调方法更好或相当的结果。这证明了我们的方法在提高 LLM 的整体准确性和可靠性方面的有效性和效率。

Title: Pre-training data selection for biomedical domain adaptation using journal impact metrics

Authors: Mathieu Laï-king, Patrick Paroubek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.02725
Pdf URL: https://arxiv.org/pdf/2409.02725
Copy Paste: [[2409.02725]] Pre-training data selection for biomedical domain adaptation using journal impact metrics(https://arxiv.org/abs/2409.02725)
Keywords: language model
Abstract: Domain adaptation is a widely used method in natural language processing (NLP) to improve the performance of a language model within a specific domain. This method is particularly common in the biomedical domain, which sees regular publication of numerous scientific articles. PubMed, a significant corpus of text, is frequently used in the biomedical domain. The primary objective of this study is to explore whether refining a pre-training dataset using specific quality metrics for scientific papers can enhance the performance of the resulting model. To accomplish this, we employ two straightforward journal impact metrics and conduct experiments by continually pre-training BERT on various subsets of the complete PubMed training set, we then evaluate the resulting models on biomedical language understanding tasks from the BLURB benchmark. Our results show that pruning using journal impact metrics is not efficient. But we also show that pre-training using fewer abstracts (but with the same number of training steps) does not necessarily decrease the resulting model's performance.
摘要：领域自适应是自然语言处理 (NLP) 中一种广泛使用的方法，用于提高特定领域内语言模型的性能。这种方法在生物医学领域尤为常见，该领域定期发表大量科学文章。PubMed 是一个重要的文本语料库，经常用于生物医学领域。本研究的主要目标是探索使用科学论文的特定质量指标细化预训练数据集是否可以提高最终模型的性能。为此，我们采用了两个简单的期刊影响力指标，并通过在完整 PubMed 训练集的各个子集上不断预训练 BERT 进行实验，然后我们从 BLURB 基准中评估生物医学语言理解任务中得到的模型。我们的结果表明，使用期刊影响力指标进行修剪效率不高。但我们也表明，使用较少的摘要（但使用相同数量的训练步骤）进行预训练并不一定会降低最终模型的性能。

Title: Pooling And Attention: What Are Effective Designs For LLm-Based Embedding Models?

Authors: Yixuan Tang, Yi Yang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2409.02727
Pdf URL: https://arxiv.org/pdf/2409.02727
Copy Paste: [[2409.02727]] Pooling And Attention: What Are Effective Designs For LLm-Based Embedding Models?(https://arxiv.org/abs/2409.02727)
Keywords: language model, llm
Abstract: The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.
摘要：大型语言模型 (LLM) 在生成任务中的重大进步导致越来越多的研究开始探索基于 LLM 的嵌入模型。虽然这些采用不同池化和注意力策略的模型在公共嵌入基准上取得了最先进的性能，但仍然存在关于什么才是基于 LLM 的嵌入模型的有效设计的问题。然而，这些模型通常在不同的数据集上进行训练，使用不同的 LLM 基础模型或训练设置。此外，公共嵌入基准的评估通常无法报告统计显著性，因此很难确定哪些设计真正有助于最终性能。这给寻求基于 LLM 的嵌入模型的最佳训练方案的从业者带来了复杂性。在本研究中，我们进行了一项大规模实验，使用相同的训练数据和基础模型训练一系列基于 LLM 的嵌入模型，但池化和注意力策略有所不同。结果表明，没有一刀切的解决方案：虽然双向注意力和额外的可训练池化层在文本相似性和信息检索任务中表现出色，但它们在聚类和分类任务中并没有明显超越更简单的设计，如 EOS-last 标记池化和默认因果注意力。此外，我们提出了一种新的池化策略，即多层可训练池化，它使用交叉注意力网络转换所有隐藏层的输出，而不仅仅是最后一层。与现有的池化方法相比，该方法在文本相似性和检索任务中具有统计优势。总体而言，本文阐明了基于 LLM 的嵌入模型的有效训练策略。

Title: Towards a Unified View of Preference Learning for Large Language Models: A Survey

Authors: Bofei Gao, Feifan Song, Yibo Miao, Zefan Cai, Zhe Yang, Liang Chen, Helan Hu, Runxin Xu, Qingxiu Dong, Ce Zheng, Wen Xiao, Ge Zhang, Daoguang Zan, Keming Lu, Bowen Yu, Dayiheng Liu, Zeyu Cui, Jian Yang, Lei Sha, Houfeng Wang, Zhifang Sui, Peiyi Wang, Tianyu Liu, Baobao Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.02795
Pdf URL: https://arxiv.org/pdf/2409.02795
Copy Paste: [[2409.02795]] Towards a Unified View of Preference Learning for Large Language Models: A Survey(https://arxiv.org/abs/2409.02795)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM's performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to understand. The relationships between different methods have been under-explored, limiting the development of the preference alignment. In light of this, we break down the existing popular alignment strategies into different components and provide a unified framework to study the current alignment strategies, thereby establishing connections among them. In this survey, we decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm. This unified view offers an in-depth understanding of existing alignment algorithms and also opens up possibilities to synergize the strengths of different strategies. Furthermore, we present detailed working examples of prevalent existing algorithms to facilitate a comprehensive understanding for the readers. Finally, based on our unified perspective, we explore the challenges and future research directions for aligning large language models with human preferences.
摘要：大型语言模型 (LLM) 表现出非常强大的功能。取得成功的关键因素之一是将 LLM 的输出与人类偏好对齐。此对齐过程通常只需要少量数据即可有效提高 LLM 的性能。虽然有效，但该领域的研究涵盖多个领域，所涉及的方法相对复杂且难以理解。不同方法之间的关系尚未得到充分探索，限制了偏好对齐的发展。鉴于此，我们将现有的流行对齐策略分解为不同的组件，并提供一个统一的框架来研究当前的对齐策略，从而建立它们之间的联系。在本调查中，我们将偏好学习中的所有策略分解为四个组件：模型、数据、反馈和算法。这种统一的视图提供了对现有对齐算法的深入了解，也为协同不同策略的优势开辟了可能性。此外，我们还提供了流行的现有算法的详细工作示例，以方便读者全面理解。最后，基于我们的统一视角，我们探讨了将大型语言模型与人类偏好对齐的挑战和未来的研究方向。

Title: MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Authors: Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2409.02813
Pdf URL: https://arxiv.org/pdf/2409.02813
Copy Paste: [[2409.02813]] MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark(https://arxiv.org/abs/2409.02813)
Keywords: prompt
Abstract: This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.
摘要：本文介绍了 MMMU-Pro，这是大规模多学科多模态理解和推理 (MMMU) 基准的稳健版本。MMMU-Pro 通过基于 MMMU 的三步流程严格评估多模态模型的真实理解和推理能力：(1) 过滤掉纯文本模型可以回答的问题，(2) 增加候选选项，以及 (3) 引入纯视觉输入设置，其中问题嵌入在图像中。此设置要求 AI 真正同时“看”和“读”，测试无缝整合视觉和文本信息的人类基本认知技能。结果表明，MMMU-Pro 上的模型性能明显低于 MMMU，不同模型之间的差异从 16.8% 到 26.9% 不等。我们探讨了 OCR 提示和思维链 (CoT) 推理的影响，发现 OCR 提示的影响微乎其微，而 CoT 通常会提高性能。MMMU-Pro 提供了更严格的评估工具，紧密模拟了现实世界的场景，并为多模态 AI 的未来研究提供了宝贵的方向。

Title: CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

Authors: Wentao Liu, Qianjun Pan, Yi Zhang, Zhuo Liu, Ji Wu, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang, Liang He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.02834
Pdf URL: https://arxiv.org/pdf/2409.02834
Copy Paste: [[2409.02834]] CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models(https://arxiv.org/abs/2409.02834)
Keywords: language model, llm
Abstract: Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.
摘要：大型语言模型 (LLM) 在数学推理方面取得了令人鼓舞的成果，而数学推理是人类智能的基础技能。以前的大多数研究都侧重于基于文本数学推理数据集 (例如 MATH、GSM8K) 来改进和衡量 LLM 的性能。最近，一些研究人员发布了英文多模态数学数据集 (例如 MATHVISTA 和 MATH-V) 来评估大型多模态模型 (LMM) 的有效性。在本文中，我们发布了一个中文多模态数学 (CMM-Math) 数据集，包括基准和训练部分，以评估和增强 LMM 的数学推理能力。CMM-Math 包含超过 28,000 个高质量样本，具有多种问题类型 (例如，多项选择题、填空题等)，并附有详细解决方案，涵盖中国从小学到高中的 12 个年级。具体来说，问题或意见中可能存在视觉背景，这使得该数据集更具挑战性。通过综合分析，我们发现 CMM-Math 数据集上最先进的 LMM 面临挑战，强调 LMM 开发需要进一步改进。我们还提出了一种多模态数学 LMM (Math-LMM) 来处理多幅图像和文本片段混合输入的问题。我们使用三个阶段训练我们的模型，包括基础预训练、基础微调和数学微调。大量实验表明，通过在三个多模态数学数据集上与 SOTA LMM 进行比较，我们的模型有效地提高了数学推理性能。

Title: Exploring Sentiment Dynamics and Predictive Behaviors in Cryptocurrency Discussions by Few-Shot Learning with Large Language Models

Authors: Moein Shahiki Tash, Zahra Ahani, Mohim Tash, Olga Kolesnikova, Grigori Sidorov
Subjects: cs.CL, cs.AI, cs.CE, cs.LG
Abstract URL: https://arxiv.org/abs/2409.02836
Pdf URL: https://arxiv.org/pdf/2409.02836
Copy Paste: [[2409.02836]] Exploring Sentiment Dynamics and Predictive Behaviors in Cryptocurrency Discussions by Few-Shot Learning with Large Language Models(https://arxiv.org/abs/2409.02836)
Keywords: language model, gpt
Abstract: This study performs analysis of Predictive statements, Hope speech, and Regret Detection behaviors within cryptocurrency-related discussions, leveraging advanced natural language processing techniques. We introduce a novel classification scheme named "Prediction statements," categorizing comments into Predictive Incremental, Predictive Decremental, Predictive Neutral, or Non-Predictive categories. Employing GPT-4o, a cutting-edge large language model, we explore sentiment dynamics across five prominent cryptocurrencies: Cardano, Binance, Matic, Fantom, and Ripple. Our analysis reveals distinct patterns in predictive sentiments, with Matic demonstrating a notably higher propensity for optimistic predictions. Additionally, we investigate hope and regret sentiments, uncovering nuanced interplay between these emotions and predictive behaviors. Despite encountering limitations related to data volume and resource availability, our study reports valuable discoveries concerning investor behavior and sentiment trends within the cryptocurrency market, informing strategic decision-making and future research endeavors.
摘要：本研究利用先进的自然语言处理技术，对加密货币相关讨论中的预测语句、希望言语和遗憾检测行为进行分析。我们引入了一种名为“预测语句”的新型分类方案，将评论分为预测增量、预测减量、预测中性或非预测类别。利用尖端大型语言模型 GPT-4o，我们探索了五种主要加密货币的情绪动态：Cardano、Binance、Matic、Fantom 和 Ripple。我们的分析揭示了预测情绪的不同模式，其中 Matic 表现出明显更高的乐观预测倾向。此外，我们研究了希望和遗憾情绪，揭示了这些情绪与预测行为之间微妙的相互作用。尽管遇到了与数据量和资源可用性相关的限制，但我们的研究报告了有关加密货币市场内投资者行为和情绪趋势的宝贵发现，为战略决策和未来的研究工作提供了信息。

Title: Historical German Text Normalization Using Type- and Token-Based Language Modeling

Authors: Anton Ehrmanntraut
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.02841
Pdf URL: https://arxiv.org/pdf/2409.02841
Copy Paste: [[2409.02841]] Historical German Text Normalization Using Type- and Token-Based Language Modeling(https://arxiv.org/abs/2409.02841)
Keywords: language model
Abstract: Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic orthographic normalization of the historical source material is pursued. This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model. However, the normalization of historical text remains a challenge due to difficulties for models to generalize, and the lack of extensive high-quality parallel data.
摘要：历史拼写变化对历史数字化文本的全文搜索或自然语言处理提出了挑战。为了尽量缩小历史正字法和当代拼写之间的差距，通常会对历史源材料进行自动正字法规范化。本报告提出了一种针对 1700-1900 年左右德国文学文本的规范化系统，该系统在平行语料库上进行训练。所提出的系统利用机器学习方法，使用 Transformer 语言模型，结合编码器-解码器模型来规范单个词类，以及预训练的因果语言模型来根据上下文调整这些规范化。广泛的评估表明，所提出的系统提供了最先进的准确度，可与更大的完全端到端的基于句子的规范化系统相媲美，对预训练的 Transformer 大型语言模型进行微调。然而，由于模型难以泛化，以及缺乏大量高质量的并行数据，历史文本的规范化仍然是一个挑战。

Title: Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling

Authors: Leanne Nortje
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2409.02865
Pdf URL: https://arxiv.org/pdf/2409.02865
Copy Paste: [[2409.02865]] Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling(https://arxiv.org/abs/2409.02865)
Keywords: prompt
Abstract: This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.
摘要：本论文研究了基于视觉的语音 (VGS) 模型，该模型从与图像配对的未标记语音中学习。它专注于低资源语言的应用和理解人类语言习得。我们引入了一项称为视觉提示关键字定位的任务，以使用图像检测和定位语音中的关键字。我们展示了 VGS 模型在少数学习场景中对约鲁巴语等低资源语言的有效性。此外，我们研究了 VGS 模型中的互斥偏差。我们的单语 VGS 模型表现出这种偏差，但我们发现多语言不会像在儿童中观察到的那样影响此 VGS 模型中的偏差。

Title: LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Authors: Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang
Subjects: cs.CL, cs.AI, cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2409.02889
Pdf URL: https://arxiv.org/pdf/2409.02889
Copy Paste: [[2409.02889]] LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture(https://arxiv.org/abs/2409.02889)
Keywords: language model, llm, agent
Abstract: Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as \textit{degraded performance with more images} and \textit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
摘要：扩展多模态大型语言模型（MLLM）的长上下文能力对于视频理解、高分辨率图像理解和多模态代理至关重要。这涉及一系列系统优化，包括模型架构、数据构建和训练策略，特别是解决诸如 \textit{图像越多性能越差} 和 \textit{计算成本高} 等挑战。在本文中，我们将模型架构调整为 Mamba 和 Transformer 块的混合，使用多幅图像之间的时间和空间依赖性来构建数据，并采用渐进式训练策略。发布的模型 \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) 是第一个混合 MLLM，它在效率和有效性之间取得了更好的平衡。 LongLLaVA不仅在各项基准测试中取得了令人瞩目的成绩，同时保持了高吞吐量和低内存消耗，尤其是在单个A100 80GB GPU上可以处理近千张图片，在多种任务中展现出良好的应用前景。

Title: LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

Authors: jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.02897
Pdf URL: https://arxiv.org/pdf/2409.02897
Copy Paste: [[2409.02897]] LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA(https://arxiv.org/abs/2409.02897)
Keywords: language model, gpt, llm, hallucination
Abstract: Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.
摘要：尽管当前的长上下文大型语言模型 (LLM) 在基于大量文本回答用户问题方面表现出色，但由于其回答缺乏引文，导致用户验证困难，人们担心其可信度会因用户产生幻觉而受到影响。在这项工作中，我们的目标是使长上下文 LLM 能够生成具有细粒度句子级引文的响应，从而提高其忠实度和可验证性。我们首先引入了 LongBench-Cite，这是一个用于评估当前 LLM 在带引文的长上下文问答 (LQAC) 中的表现的自动化基准，结果显示有很大的改进空间。为此，我们提出了 CoF（由粗到细），这是一种新颖的流程，它利用现成的 LLM 自动生成具有精确句子级引文的长上下文 QA 实例，并利用此流程构建 LongCite-45k，这是一个用于 LQAC 的大规模 SFT 数据集。最后，我们使用 LongCite-45k 数据集训练 LongCite-8B 和 LongCite-9B，成功使它们能够在单个输出中生成准确的响应和细粒度的句子级引用。LongBench-Cite 上的评估结果表明，我们训练的模型实现了最先进的引用质量，超越了包括 GPT-4o 在内的先进专有模型。