2024-12-20

Title: Fake News Detection: Comparative Evaluation of BERT-like Models and Large Language Models with Generative AI-Annotated Data

Authors: haina Raza, Drai Paulen-Patterson, Chen Ding
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14276
Pdf URL: https://arxiv.org/pdf/2412.14276
Copy Paste: [[2412.14276]] Fake News Detection: Comparative Evaluation of BERT-like Models and Large Language Models with Generative AI-Annotated Data(https://arxiv.org/abs/2412.14276)
Keywords: language model, gpt, llm
Abstract: Fake news poses a significant threat to public opinion and social stability in modern society. This study presents a comparative evaluation of BERT-like encoder-only models and autoregressive decoder-only large language models (LLMs) for fake news detection. We introduce a dataset of news articles labeled with GPT-4 assistance (an AI-labeling method) and verified by human experts to ensure reliability. Both BERT-like encoder-only models and LLMs were fine-tuned on this dataset. Additionally, we developed an instruction-tuned LLM approach with majority voting during inference for label generation. Our analysis reveals that BERT-like models generally outperform LLMs in classification tasks, while LLMs demonstrate superior robustness against text perturbations. Compared to weak labels (distant supervision) data, the results show that AI labels with human supervision achieve better classification results. This study highlights the effectiveness of combining AI-based annotation with human oversight and demonstrates the performance of different families of machine learning models for fake news detection
摘要：假新闻对现代社会的舆论和社会稳定构成了重大威胁。本研究对 BERT 类编码器模型和自回归解码器大型语言模型 (LLM) 进行了假新闻检测的比较评估。我们引入了一个新闻文章数据集，该数据集由 GPT-4 辅助标记（一种 AI 标记方法），并由人类专家验证以确保可靠性。BERT 类编码器模型和 LLM 都在此数据集上进行了微调。此外，我们开发了一种指令调整的 LLM 方法，在标签生成推理过程中采用多数投票。我们的分析表明，BERT 类模型在分类任务中的表现通常优于 LLM，而 LLM 表现出对文本扰动的卓越鲁棒性。与弱标签（远程监督）数据相比，结果表明，具有人工监督的 AI 标签可实现更好的分类结果。本研究强调了将基于 AI 的注释与人工监督相结合的有效性，并展示了不同系列机器学习模型在假新闻检测中的性能

Title: Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs

Authors: David Restrepo, Chenwei Wu, Zhengxu Tang, Zitao Shuai, Thao Nguyen Minh Phan, Jun-En Ding, Cong-Tinh Dao, Jack Gallifant, Robyn Gayle Dychiao, Jose Carlo Artiaga, André Hiroshi Bando, Carolina Pelegrini Barbosa Gracitelli, Vincenz Ferrer, Leo Anthony Celi, Danielle Bitterman, Michael G Morley, Luis Filipe Nakayama
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14304
Pdf URL: https://arxiv.org/pdf/2412.14304
Copy Paste: [[2412.14304]] Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs(https://arxiv.org/abs/2412.14304)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation, chain-of-thought, agent
Abstract: Current ophthalmology clinical workflows are plagued by over-referrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages, allowing for direct cross-lingual comparisons. Our evaluation of 6 popular LLMs across 7 different languages reveals substantial bias across different languages, highlighting risks for clinical deployment of LLMs in LMICs. Existing debiasing methods such as Translation Chain-of-Thought or Retrieval-augmented generation (RAG) by themselves fall short of closing this performance gap, often failing to improve performance across all languages and lacking specificity for the medical domain. To address this issue, We propose CLARA (Cross-Lingual Reflective Agentic system), a novel inference time de-biasing method leveraging retrieval augmented generation and self-verification. Our approach not only improves performance across all languages but also significantly reduces the multilingual bias gap, facilitating equitable LLM application across the globe.
摘要：当前眼科临床工作流程受到过度转诊、长时间等待以及复杂且异构的医疗记录的困扰。大型语言模型 (LLM) 为自动执行各种程序（例如分类、视力评估等初步测试以及报告摘要）提供了一种有前途的解决方案。然而，LLM 在自然语言问答任务中在不同语言中表现出了显著不同的表现，这可能会加剧中低收入国家 (LMIC) 的医疗保健差距。这项研究引入了第一个多语言眼科问答基准，其中手动策划的问题在不同语言中并行，允许直接进行跨语言比较。我们对 7 种不同语言的 6 种流行 LLM 的评估显示，不同语言之间存在相当大的偏见，凸显了 LLM 在 LMIC 临床部署的风险。现有的去偏方法（例如翻译思路链或检索增强生成 (RAG)）本身无法缩小这种性能差距，通常无法提高所有语言的性能，并且缺乏针对医学领域的特异性。为了解决这个问题，我们提出了 CLARA（跨语言反射代理系统），这是一种利用检索增强生成和自我验证的新型推理时间去偏方法。我们的方法不仅可以提高所有语言的性能，还可以显著缩小多语言偏见差距，促进全球公平的 LLM 应用。

Title: A Survey on LLM Inference-Time Self-Improvement

Authors: Xiangjue Dong, Maria Teleki, James Caverlee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14352
Pdf URL: https://arxiv.org/pdf/2412.14352
Copy Paste: [[2412.14352]] A Survey on LLM Inference-Time Self-Improvement(https://arxiv.org/abs/2412.14352)
Keywords: llm
Abstract: Techniques that enhance inference through increased computation at test-time have recently gained attention. In this survey, we investigate the current state of LLM Inference-Time Self-Improvement from three different perspectives: Independent Self-improvement, focusing on enhancements via decoding or sampling methods; Context-Aware Self-Improvement, leveraging additional context or datastore; and Model-Aided Self-Improvement, achieving improvement through model collaboration. We provide a comprehensive review of recent relevant studies, contribute an in-depth taxonomy, and discuss challenges and limitations, offering insights for future research.
摘要：通过增加测试时的计算来增强推理的技术最近引起了人们的关注。在本次调查中，我们从三个不同的角度调查了 LLM 推理时间自我改进的现状：独立自我改进，重点是通过解码或采样方法进行增强；上下文感知自我改进，利用额外的上下文或数据存储；模型辅助自我改进，通过模型协作实现改进。我们对最近的相关研究进行了全面回顾，提供了深入的分类法，并讨论了挑战和局限性，为未来的研究提供了见解。

Title: Memorization Over Reasoning? Exposing and Mitigating Verbatim Memorization in Large Language Models' Character Understanding Evaluation

Authors: Yuxuan Jiang, Francis Ferraro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14368
Pdf URL: https://arxiv.org/pdf/2412.14368
Copy Paste: [[2412.14368]] Memorization Over Reasoning? Exposing and Mitigating Verbatim Memorization in Large Language Models' Character Understanding Evaluation(https://arxiv.org/abs/2412.14368)
Keywords: language model, llm
Abstract: Recently, Large Language Models (LLMs) have shown impressive performance in character understanding tasks, such as analyzing the roles, personalities, and relationships of fictional characters. However, the extensive pre-training corpora used by LLMs raise concerns that they may rely on memorizing popular fictional works rather than genuinely understanding and reasoning about them. In this work, we argue that 'gist memory'-capturing essential meaning - should be the primary mechanism for character understanding tasks, as opposed to 'verbatim memory' - exact match of a string. We introduce a simple yet effective method to mitigate mechanized memorization in character understanding evaluations while preserving the essential implicit cues needed for comprehension and reasoning. Our approach reduces memorization-driven performance on popular fictional works from 96% accuracy to 72% and results in up to an 18% drop in accuracy across various character understanding tasks. These findings underscore the issue of data contamination in existing benchmarks, which often measure memorization rather than true character understanding.
摘要：最近，大型语言模型 (LLM) 在人物理解任务中表现出色，例如分析虚构人物的角色、性格和关系。然而，LLM 使用的大量预训练语料库引发了人们的担忧，即它们可能依赖于记忆流行的虚构作品，而不是真正理解和推理它们。在这项工作中，我们认为“要点记忆”——捕捉基本含义——应该是人物理解任务的主要机制，而不是“逐字记忆”——字符串的精确匹配。我们引入了一种简单而有效的方法来减轻人物理解评估中的机械化记忆，同时保留理解和推理所需的基本隐含线索。我们的方法将流行小说作品中记忆驱动的表现从 96% 的准确率降低到 72%，并导致各种人物理解任务的准确率下降高达 18%。这些发现强调了现有基准中的数据污染问题，这些基准通常衡量的是记忆，而不是真正的人物理解。

Title: ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

Authors: William Han, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Subjects: cs.CL, eess.SP
Abstract URL: https://arxiv.org/abs/2412.14373
Pdf URL: https://arxiv.org/pdf/2412.14373
Copy Paste: [[2412.14373]] ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling(https://arxiv.org/abs/2412.14373)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have shown remarkable adaptability across domains beyond text, specifically electrocardiograms (ECGs). More specifically, there is a growing body of work exploring the task of generating text from a multi-channeled ECG and corresponding textual prompt. Current approaches typically involve pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective and using the features output by the pretrained encoder to finetune a LLM for natural language generation (NLG). However, these methods are limited by 1) inefficiency from two-stage training and 2) interpretability challenges with encoder-generated features. To address these limitations, we introduce ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. This approach compresses and encodes ECG signals into tokens, enabling end-to-end LLM training by combining ECG and text tokens directly, while being much more interpretable since the ECG tokens can be directly mapped back to the original signal. Using ECG-Byte, we achieve competitive performance in NLG tasks in only half the time and ~48% of the data required by two-stage approaches.
摘要：大型语言模型 (LLM) 已在文本以外的领域表现出显著的适应性，特别是心电图 (ECG)。更具体地说，越来越多的研究正在探索从多通道 ECG 和相应的文本提示生成文本的任务。当前的方法通常涉及使用自监督学习 (SSL) 目标对 ECG 特定编码器进行预训练，并使用预训练编码器输出的特征来微调 LLM 以进行自然语言生成 (NLG)。然而，这些方法受到 1) 两阶段训练效率低下和 2) 编码器生成特征的可解释性挑战的限制。为了解决这些限制，我们引入了 ECG-Byte，这是一种用于 ECG 自回归语言建模的改编字节对编码 (BPE) 标记器管道。这种方法将 ECG 信号压缩并编码为标记，通过直接组合 ECG 和文本标记来实现端到端 LLM 训练，同时由于 ECG 标记可以直接映射回原始信号，因此更具可解释性。使用 ECG-Byte，我们仅用两阶段方法所需一半的时间和约 48% 的数据就在 NLG 任务中取得了具有竞争力的表现。

Title: All-in-One Tuning and Structural Pruning for Domain-Specific LLMs

Authors: Lei Lu, Zhepeng Wang, Ruexue Bao, Mengbing Wang, Fangyi Li, Yawen Wu, Weiwen Jiang, Jie Xu, Yanzhi Wang, Shangqian Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14426
Pdf URL: https://arxiv.org/pdf/2412.14426
Copy Paste: [[2412.14426]] All-in-One Tuning and Structural Pruning for Domain-Specific LLMs(https://arxiv.org/abs/2412.14426)
Keywords: language model, llm
Abstract: Existing pruning techniques for large language models (LLMs) targeting domain-specific applications typically follow a two-stage process: pruning the pretrained general-purpose LLMs and then fine-tuning the pruned LLMs on specific domains. However, the pruning decisions, derived from the pretrained weights, remain unchanged during fine-tuning, even if the weights have been updated. Therefore, such a combination of the pruning decisions and the finetuned weights may be suboptimal, leading to non-negligible performance degradation. To address these limitations, we propose ATP: All-in-One Tuning and Structural Pruning, a unified one-stage structural pruning and fine-tuning approach that dynamically identifies the current optimal substructure throughout the fine-tuning phase via a trainable pruning decision generator. Moreover, given the limited available data for domain-specific applications, Low-Rank Adaptation (LoRA) becomes a common technique to fine-tune the LLMs. In ATP, we introduce LoRA-aware forward and sparsity regularization to ensure that the substructures corresponding to the learned pruning decisions can be directly removed after the ATP process. ATP outperforms the state-of-the-art two-stage pruning methods on tasks in the legal and healthcare domains. More specifically, ATP recovers up to 88% and 91% performance of the dense model when pruning 40% parameters of LLaMA2-7B and LLaMA3-8B models, respectively.
摘要：现有的针对特定领域应用的大型语言模型 (LLM) 的修剪技术通常遵循两阶段过程：修剪预训练的通用 LLM，然后在特定领域微调修剪后的 LLM。但是，即使权重已更新，从预训练权重得出的修剪决策在微调期间仍保持不变。因此，这种修剪决策和微调权重的组合可能不是最优的，从而导致不可忽略的性能下降。为了解决这些限制，我们提出了 ATP：一体化调整和结构修剪，这是一种统一的单阶段结构修剪和微调方法，可通过可训练的修剪决策生成器在整个微调阶段动态识别当前最佳子结构。此外，鉴于特定领域应用的可用数据有限，低秩自适应 (LoRA) 成为微调 LLM 的常用技术。在 ATP 中，我们引入了 LoRA 感知的前向和稀疏正则化，以确保在 ATP 过程之后可以直接删除与学习到的修剪决策相对应的子结构。ATP 在法律和医疗保健领域的任务上的表现优于最先进的两阶段修剪方法。更具体地说，当修剪 LLaMA2-7B 和 LLaMA3-8B 模型的 40% 参数时，ATP 分别恢复了密集模型高达 88% 和 91% 的性能。

Title: ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

Authors: Eric Modesitt, Ke Yang, Spencer Hulsey, Chengxiang Zhai, Volodymyr Kindratenko
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14436
Pdf URL: https://arxiv.org/pdf/2412.14436
Copy Paste: [[2412.14436]] ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study(https://arxiv.org/abs/2412.14436)
Keywords: language model, gpt
Abstract: Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models. Using astronomy as a primary case study, we refined the 1.3T-token FineWeb-Edu dataset into a high-quality, 10B-token subset focused on astronomy. Fine-tuning \textsc{LLaMA-3-8B} on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69\% to 76\% and achieved top results on AstroBench, an astronomy-specific benchmark. Moreover, our model (Orbit-LLaMA) outperformed \textsc{LLaMA-3-8B-base}, with GPT-4o evaluations preferring it in 73\% of cases across 1000 astronomy-specific questions. Additionally, we validated ORBIT's generalizability by applying it to law and medicine, achieving a significant improvement of data quality compared to an unfiltered baseline. We open-source the ORBIT methodology, including the curated datasets, the codebase, and the resulting model at \href{this https URL}{this https URL}.
摘要：语言建模的最新进展表明，需要高质量的特定领域训练数据，尤其是对于需要专业知识的任务。通用模型虽然用途广泛，但由于特定领域信息有限，通常缺乏专家级任务所需的深度。领域适应训练可以增强这些模型，但它需要大量高质量的数据。为了解决这个问题，我们提出了 ORBIT，这是一种经济高效的方法，用于从嘈杂的网络源中整理大量高质量的特定领域数据集，专门用于训练专业的大型语言模型。以天文学为主要案例研究，我们将 1.3T 标记 FineWeb-Edu 数据集细化为专注于天文学的高质量 10B 标记子集。在 1B 标记天文学子集上微调 \textsc{LLaMA-3-8B} 可将 MMLU 天文学基准上的性能从 69\% 提高到 76\%，并在天文学特定基准 AstroBench 上取得最高成绩。此外，我们的模型 (Orbit-LLaMA) 的表现优于 \textsc{LLaMA-3-8B-base}，在 1000 个天文学特定问题中，GPT-4o 评估在 73\% 的情况下更倾向于它。此外，我们通过将 ORBIT 应用于法律和医学来验证其通用性，与未过滤的基线相比，数据质量显著提高。我们在 \href{this https URL}{this https URL} 上开源了 ORBIT 方法，包括精选数据集、代码库和生成的模型。

Title: From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research

Authors: Xiang Cheng, Raveesh Mayya, João Sedoc
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14461
Pdf URL: https://arxiv.org/pdf/2412.14461
Copy Paste: [[2412.14461]] From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research(https://arxiv.org/abs/2412.14461)
Keywords: language model, llm, prompt
Abstract: Unstructured text data annotation and analysis are fundamental to management research, often relying on human annotators through crowdsourcing platforms. While Large Language Models (LLMs) promise to provide a cost-effective and efficient alternative to human annotation, there lacks a systematic workflow that evaluate when LLMs are suitable or how to proceed with LLM-based text annotation in a reproducible manner. This paper addresses this methodological gap by introducing the ``SILICON" (\textbf{S}ystematic \textbf{I}nference with \textbf{L}LMs for \textbf{I}nformation \textbf{C}lassificati\textbf{o}n and \textbf{N}otation) workflow. The workflow integrates established principles of human annotation with systematic prompt optimization and model selection, addressing challenges such as developing robust annotation guidelines, establishing high-quality human baselines, optimizing prompts, and ensuring reproducibility across LLMs. We validate the SILICON workflow through seven case studies covering common management research tasks, including business proposal evaluation, dialog intent and breakdown analysis, review attribute detection. Our findings highlight the importance of validating annotation guideline agreement, the superiority of expert-developed human baselines over crowdsourced ones, the iterative nature of prompt optimization, and the necessity of testing multiple LLMs. Notably, we propose a regression-based methodology to empirically compare LLM outputs across prompts and models. Our workflow advances management research by establishing reproducible processes for LLM-based annotation that maintain scientific rigor. We provide practical guidance for researchers to effectively navigate the evolving landscape of generative AI tools effectively while maintaining transparency and reproducibility.
摘要：非结构化文本数据注释和分析是管理研究的基础，通常依赖于众包平台的人工注释。虽然大型语言模型 (LLM) 有望提供一种经济高效且可替代人工注释的方法，但目前缺乏系统的工作流程来评估 LLM 何时适用或如何以可重复的方式进行基于 LLM 的文本注释。本文通过引入“SILICON”（使用 \textbf{S}Stematic \textbf{I}nference with \textbf{L}LMs for \textbf{I}nformation \textbf{C}Classificati\textbf{o}n and \textbf{N}otation）工作流程来解决这一方法论上的差距。该工作流程将已建立的人工注释原则与系统提示优化和模型选择相结合，解决了诸如制定强大的注释指南、建立高质量的人工基线、优化提示以及确保跨 LLM 的可重复性等挑战。我们通过七个案例研究验证了 SILICON 工作流程，这些案例研究涵盖了常见的管理研究任务，包括商业提案评估、对话意图和细分分析、评论属性检测。我们的研究结果强调了验证注释指南一致性的重要性、专家开发的人工基线相对于众包基线的优越性、提示优化的迭代性质以及测试多个 LLM 的必要性。值得注意的是，我们提出了一种基于回归的方法来实证比较 LLM跨提示和模型的输出。我们的工作流程通过建立可重复的 LLM 注释流程来推进管理研究，同时保持科学严谨性。我们为研究人员提供实用指导，以有效驾驭生成式 AI 工具不断发展的格局，同时保持透明度和可重复性。

Title: Agent-SafetyBench: Evaluating the Safety of LLM Agents

Authors: Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14470
Pdf URL: https://arxiv.org/pdf/2412.14470
Copy Paste: [[2412.14470]] Agent-SafetyBench: Evaluating the Safety of LLM Agents(https://arxiv.org/abs/2412.14470)
Keywords: language model, llm, prompt, agent
Abstract: As large language models (LLMs) are increasingly deployed as agents, their integration into interactive environments and tool use introduce new safety challenges beyond those associated with the models themselves. However, the absence of comprehensive benchmarks for evaluating agent safety presents a significant barrier to effective assessment and further improvement. In this paper, we introduce Agent-SafetyBench, a comprehensive benchmark designed to evaluate the safety of LLM agents. Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions. Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%. This highlights significant safety challenges in LLM agents and underscores the considerable need for improvement. Through quantitative analysis, we identify critical failure modes and summarize two fundamental safety detects in current LLM agents: lack of robustness and lack of risk awareness. Furthermore, our findings suggest that reliance on defense prompts alone is insufficient to address these safety issues, emphasizing the need for more advanced and robust strategies. We release Agent-SafetyBench at \url{this https URL} to facilitate further research and innovation in agent safety evaluation and improvement.
摘要：随着大型语言模型 (LLM) 越来越多地被部署为代理，它们与交互环境和工具使用的集成带来了新的安全挑战，而不仅仅是与模型本身相关的挑战。然而，缺乏用于评估代理安全性的综合基准对有效评估和进一步改进构成了重大障碍。在本文中，我们介绍了 Agent-SafetyBench，这是一个旨在评估 LLM 代理安全性的综合基准。Agent-SafetyBench 涵盖 349 个交互环境和 2,000 个测试用例，评估了 8 类安全风险，涵盖了不安全交互中经常遇到的 10 种常见故障模式。我们对 16 个流行的 LLM 代理的评估揭示了一个令人担忧的结果：没有一个代理的安全分数超过 60%。这凸显了 LLM 代理面临的重大安全挑战，并强调了改进的巨大需求。通过定量分析，我们确定了关键故障模式，并总结了当前 LLM 代理中的两个基本安全检测：缺乏稳健性和缺乏风险意识。此外，我们的研究结果表明，仅依靠防御提示不足以解决这些安全问题，这强调了对更先进、更强大的策略的需求。我们在 \url{此 https URL} 发布了 Agent-SafetyBench，以促进代理安全评估和改进方面的进一步研究和创新。

Title: Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs

Authors: Koshiro Saito, Sakae Mizuki, Masanari Ohi, Taishi Nakamura, Taihei Shiotani, Koki Maeda, Youmi Ma, Kakeru Hattori, Kazuki Fujii, Takumi Okamoto, Shigeki Ishida, Hiroya Takamura, Rio Yokota, Naoaki Okazaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14471
Pdf URL: https://arxiv.org/pdf/2412.14471
Copy Paste: [[2412.14471]] Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs(https://arxiv.org/abs/2412.14471)
Keywords: language model, llm
Abstract: Why do we build local large language models (LLMs)? What should a local LLM learn from the target language? Which abilities can be transferred from other languages? Do language-specific scaling laws exist? To explore these research questions, we evaluated 35 Japanese, English, and multilingual LLMs on 19 evaluation benchmarks for Japanese and English, taking Japanese as a local language. Adopting an observational approach, we analyzed correlations of benchmark scores, and conducted principal component analysis (PCA) on the scores to derive \textit{ability factors} of local LLMs. We found that training on English text can improve the scores of academic subjects in Japanese (JMMLU). In addition, it is unnecessary to specifically train on Japanese text to enhance abilities for solving Japanese code generation, arithmetic reasoning, commonsense, and reading comprehension tasks. In contrast, training on Japanese text could improve question-answering tasks about Japanese knowledge and English-Japanese translation, which indicates that abilities for solving these two tasks can be regarded as \textit{Japanese abilities} for LLMs. Furthermore, we confirmed that the Japanese abilities scale with the computational budget for Japanese text.
摘要：为什么要建立本地大型语言模型（LLM）？本地 LLM 应该从目标语言中学习什么？哪些能力可以从其他语言中迁移？是否存在特定于语言的缩放定律？为了探索这些研究问题，我们将日语作为本地语言，在 19 个日语和英语评估基准上评估了 35 个日语、英语和多语言 LLM。采用观察方法，我们分析了基准分数的相关性，并对分数进行了主成分分析（PCA），以得出本地 LLM 的 \textit{能力因素}。我们发现，对英语文本进行训练可以提高日语学术科目（JMMLU）的分数。此外，无需专门对日语文本进行训练即可提高解决日语代码生成、算术推理、常识和阅读理解任务的能力。相反，对日语文本进行训练可以提高有关日语知识和英日翻译的问答任务，这表明解决这两个任务的能力可以被视为 LLM 的 \textit{日语能力}。此外，我们确认日语能力与日语文本的计算预算成正比。

Title: Do Large Language Models Defend Inferentialist Semantics?: On the Logical Expressivism and Anti-Representationalism of LLMs

Authors: Yuzuki Arai, Sho Tsugawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14501
Pdf URL: https://arxiv.org/pdf/2412.14501
Copy Paste: [[2412.14501]] Do Large Language Models Defend Inferentialist Semantics?: On the Logical Expressivism and Anti-Representationalism of LLMs(https://arxiv.org/abs/2412.14501)
Keywords: language model, gpt, llm, chat
Abstract: The philosophy of language, which has historically been developed through an anthropocentric lens, is now being forced to move towards post-anthropocentrism due to the advent of large language models (LLMs) like ChatGPT (OpenAI), Claude (Anthropic), which are considered to possess linguistic abilities comparable to those of humans. Traditionally, LLMs have been explained through distributional semantics as their foundational semantics. However, recent research is exploring alternative foundational semantics beyond distributional semantics. This paper proposes Robert Brandom's inferentialist semantics as an suitable foundational semantics for LLMs, specifically focusing on the issue of linguistic representationalism within this post-anthropocentric trend. Here, we show that the anti-representationalism and logical expressivism of inferential semantics, as well as quasi-compositionality, are useful in interpreting the characteristics and behaviors of LLMs. Further, we propose a \emph{consensus theory of truths} for LLMs. This paper argues that the characteristics of LLMs challenge mainstream assumptions in philosophy of language, such as semantic externalism and compositionality. We believe the argument in this paper leads to a re-evaluation of anti\hyphen{}representationalist views of language, potentially leading to new developments in the philosophy of language.
摘要：语言哲学历来都是通过人类中心主义的视角发展起来的，但现在，由于大型语言模型 (LLM) 的出现，如 ChatGPT (OpenAI)、Claude (Anthropic)，人们认为它们拥有与人类相当的语言能力，语言哲学正被迫走向后人类中心主义。传统上，LLM 是通过分布语义作为其基础语义来解释的。然而，最近的研究正在探索除分布语义之外的替代基础语义。本文提出了罗伯特·布兰登 (Robert Brandom) 的推理语义作为 LLM 的合适基础语义，特别关注后人类中心主义趋势中的语言表征主义问题。在这里，我们表明推理语义的反表征主义和逻辑表达主义以及准组合性对于解释 LLM 的特征和行为很有用。此外，我们为 LLM 提出了一个 \emph{真理共识理论}。本文认为，法学硕士的特点挑战了语言哲学的主流假设，例如语义外在主义和组合性。我们认为本文的论点将导致重新评估反连字符表征主义的语言观点，并可能导致语言哲学的新发展。

Title: PA-RAG: RAG Alignment via Multi-Perspective Preference Optimization

Authors: Jiayi Wu, Hengyi Cai, Lingyong Yan, Hao Sun, Xiang Li, Shuaiqiang Wang, Dawei Yin, Ming Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14510
Pdf URL: https://arxiv.org/pdf/2412.14510
Copy Paste: [[2412.14510]] PA-RAG: RAG Alignment via Multi-Perspective Preference Optimization(https://arxiv.org/abs/2412.14510)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: The emergence of Retrieval-augmented generation (RAG) has alleviated the issues of outdated and hallucinatory content in the generation of large language models (LLMs), yet it still reveals numerous limitations. When a general-purpose LLM serves as the RAG generator, it often suffers from inadequate response informativeness, response robustness, and citation quality. Past approaches to tackle these limitations, either by incorporating additional steps beyond generating responses or optimizing the generator through supervised fine-tuning (SFT), still failed to align with the RAG requirement thoroughly. Consequently, optimizing the RAG generator from multiple preference perspectives while maintaining its end-to-end LLM form remains a challenge. To bridge this gap, we propose Multiple Perspective Preference Alignment for Retrieval-Augmented Generation (PA-RAG), a method for optimizing the generator of RAG systems to align with RAG requirements comprehensively. Specifically, we construct high-quality instruction fine-tuning data and multi-perspective preference data by sampling varied quality responses from the generator across different prompt documents quality scenarios. Subsequently, we optimize the generator using SFT and Direct Preference Optimization (DPO). Extensive experiments conducted on four question-answer datasets across three LLMs demonstrate that PA-RAG can significantly enhance the performance of RAG generators. Our code and datasets are available at this https URL.
摘要：检索增强生成 (RAG) 的出现缓解了大型语言模型 (LLM) 生成中内容过时和幻觉的问题，但它仍然存在许多局限性。当通用 LLM 用作 RAG 生成器时，它通常会受到响应信息量、响应稳健性和引用质量不足的影响。过去解决这些限制的方法，无论是通过在生成响应之外加入额外步骤，还是通过监督微调 (SFT) 优化生成器，仍然无法完全满足 RAG 要求。因此，从多个偏好角度优化 RAG 生成器同时保持其端到端 LLM 形式仍然是一个挑战。为了弥补这一差距，我们提出了检索增强生成的多视角偏好对齐 (PA-RAG)，这是一种优化 RAG 系统生成器以全面满足 RAG 要求的方法。具体来说，我们通过从不同提示文档质量场景的生成器中采样不同质量的响应来构建高质量的指令微调数据和多视角偏好数据。随后，我们使用 SFT 和直接偏好优化 (DPO) 优化生成器。在三个 LLM 上对四个问答数据集进行的大量实验表明，PA-RAG 可以显著提高 RAG 生成器的性能。我们的代码和数据集可在此 https URL 上找到。

Title: Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Authors: Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14528
Pdf URL: https://arxiv.org/pdf/2412.14528
Copy Paste: [[2412.14528]] Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models(https://arxiv.org/abs/2412.14528)
Keywords: language model, llm
Abstract: Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently capture complex distribution structures of logits via the Sinkhorn distance, which approximates the Wasserstein distance for divergence measures. Extensive experiments on tasks such as extractive QA, generative QA, and summarization demonstrate that the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under various settings. Our approach is robust to different student and teacher models across model families, architectures, and parameter sizes.
摘要：知识蒸馏 (KD) 已成为压缩大型语言模型 (LLM) 的流行技术。现有的 KD 方法受限于教师和学生模型之间需要相同的标记器（即词汇表），这限制了它们在处理不同架构系列的 LLM 时的多功能性。在本文中，我们介绍了多级最优传输 (MultiLevelOT)，这是一种新方法，可推进通用跨标记器知识蒸馏的最优传输。我们的方法使用不同的成本矩阵在标记和序列级别对齐教师和学生的 logit 分布，从而无需维度或逐个标记的对应关系。在标记级别，MultiLevelOT 通过联合优化序列中的所有标记来集成全局和局部信息以增强鲁棒性。在序列级别，我们通过 Sinkhorn 距离有效地捕获 logit 的复杂分布结构，该距离近似于散度度量的 Wasserstein 距离。在提取式问答、生成式问答和摘要等任务上进行的大量实验表明，MultiLevelOT 在各种设置下都优于最先进的跨标记器 KD 方法。我们的方法对于不同模型系列、架构和参数大小的学生和教师模型都具有很强的鲁棒性。

Title: CitaLaw: Enhancing LLM with Citations in Legal Domain

Authors: Kepu Zhang, Weijie Yu, Sunhao Dai, Jun Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14556
Pdf URL: https://arxiv.org/pdf/2412.14556
Copy Paste: [[2412.14556]] CitaLaw: Enhancing LLM with Citations in Legal Domain(https://arxiv.org/abs/2412.14556)
Keywords: llm
Abstract: In this paper, we propose CitaLaw, the first benchmark designed to evaluate LLMs' ability to produce legally sound responses with appropriate citations. CitaLaw features a diverse set of legal questions for both laypersons and practitioners, paired with a comprehensive corpus of law articles and precedent cases as a reference pool. This framework enables LLM-based systems to retrieve supporting citations from the reference corpus and align these citations with the corresponding sentences in their responses. Moreover, we introduce syllogism-inspired evaluation methods to assess the legal alignment between retrieved references and LLM-generated responses, as well as their consistency with user questions. Extensive experiments on 2 open-domain and 7 legal-specific LLMs demonstrate that integrating legal references substantially enhances response quality. Furthermore, our proposed syllogism-based evaluation method exhibits strong agreement with human judgments.
摘要：在本文中，我们提出了 CitaLaw，这是第一个旨在评估 LLM 生成具有适当引文的合法合理响应的能力的基准。CitaLaw 为普通人和从业者提供了一系列不同的法律问题，并配有全面的法律文章和先例案例作为参考库。该框架使基于 LLM 的系统能够从参考语料库中检索支持引文，并将这些引文与其响应中的相应句子对齐。此外，我们引入了三段论启发的评估方法来评估检索到的参考资料和 LLM 生成的响应之间的法律一致性，以及它们与用户问题的一致性。在 2 个开放领域和 7 个法律专用 LLM 上进行的大量实验表明，整合法律参考资料可显著提高响应质量。此外，我们提出的基于三段论的评估方法与人类判断表现出很强的一致性。

Title: CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation

Authors: Youngwon Lee, Seung-won Hwang, Daniel Campos, Filip Graliński, Zhewei Yao, Yuxiong He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14581
Pdf URL: https://arxiv.org/pdf/2412.14581
Copy Paste: [[2412.14581]] CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation(https://arxiv.org/abs/2412.14581)
Keywords: language model, llm, retrieval-augmented generation
Abstract: With the adoption of retrieval-augmented generation (RAG), large language models (LLMs) are expected to ground their generation to the retrieved contexts. Yet, this is hindered by position bias of LLMs, failing to evenly attend to all contexts. Previous work has addressed this by synthesizing contexts with perturbed positions of gold segment, creating a position-diversified train set. We extend this intuition to propose consistency regularization with augmentation and distillation. First, we augment each training instance with its position perturbation to encourage consistent predictions, regardless of ordering. We also distill behaviors of this pair, although it can be counterproductive in certain RAG scenarios where the given order from the retriever is crucial for generation quality. We thus propose CORD, balancing COnsistency and Rank Distillation. CORD adaptively samples noise-controlled perturbations from an interpolation space, ensuring both consistency and respect for the rank prior. Empirical results show this balance enables CORD to outperform consistently in diverse RAG benchmarks.
摘要：随着检索增强生成 (RAG) 的采用，大型语言模型 (LLM) 有望将其生成基于检索到的上下文。然而，这受到 LLM 位置偏差的阻碍，无法均匀地关注所有上下文。先前的研究通过合成具有黄金片段扰动位置的上下文来解决这个问题，从而创建位置多样化的训练集。我们扩展了这一直觉，提出了具有增强和提炼的一致性正则化。首先，我们用其位置扰动来增强每个训练实例，以鼓励一致的预测，而不管顺序如何。我们还提炼了这一对的行为，尽管在某些 RAG 场景中，这可能会适得其反，因为检索器给出的顺序对于生成质量至关重要。因此，我们提出了 CORD，平衡一致性和等级提炼。CORD 自适应地从插值空间中采样噪声控制的扰动，确保一致性和对等级先验的尊重。实证结果表明，这种平衡使 CORD 能够在各种 RAG 基准测试中始终表现出色。

Title: Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues

Authors: Tao He, Lizi Liao, Yixin Cao, Yuanxing Liu, Yiheng Sun, Zerui Chen, Ming Liu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14584
Pdf URL: https://arxiv.org/pdf/2412.14584
Copy Paste: [[2412.14584]] Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues(https://arxiv.org/abs/2412.14584)
Keywords: language model, gpt, llm, chat
Abstract: Recent advancements in proactive dialogues have garnered significant attention, particularly for more complex objectives (e.g. emotion support and persuasion). Unlike traditional task-oriented dialogues, proactive dialogues demand advanced policy planning and adaptability, requiring rich scenarios and comprehensive policy repositories to develop such systems. However, existing approaches tend to rely on Large Language Models (LLMs) for user simulation and online learning, leading to biases that diverge from realistic scenarios and result in suboptimal efficiency. Moreover, these methods depend on manually defined, context-independent, coarse-grained policies, which not only incur high expert costs but also raise concerns regarding their completeness. In our work, we highlight the potential for automatically discovering policies directly from raw, real-world dialogue records. To this end, we introduce a novel dialogue policy planning framework, LDPP. It fully automates the process from mining policies in dialogue records to learning policy planning. Specifically, we employ a variant of the Variational Autoencoder to discover fine-grained policies represented as latent vectors. After automatically annotating the data with these latent policy labels, we propose an Offline Hierarchical Reinforcement Learning (RL) algorithm in the latent space to develop effective policy planning capabilities. Our experiments demonstrate that LDPP outperforms existing methods on two proactive scenarios, even surpassing ChatGPT with only a 1.8-billion-parameter LLM.
摘要：主动对话的最新进展引起了广泛关注，尤其是对于更复杂的目标（例如情感支持和说服）。与传统的面向任务的对话不同，主动对话需要高级策略规划和适应性，需要丰富的场景和全面的策略存储库来开发此类系统。然而，现有方法往往依赖大型语言模型 (LLM) 进行用户模拟和在线学习，导致偏差偏离现实场景并导致效率不理想。此外，这些方法依赖于手动定义的、与上下文无关的粗粒度策略，这不仅会产生高昂的专家成本，而且还引发对其完整性的担忧。在我们的工作中，我们强调了直接从原始的真实世界对话记录中自动发现策略的潜力。为此，我们引入了一个新颖的对话策略规划框架 LDPP。它完全自动化了从挖掘对话记录中的策略到学习策略规划的过程。具体来说，我们采用变分自动编码器的变体来发现表示为潜在向量的细粒度策略。在使用这些潜在策略标签自动注释数据后，我们提出了一种潜在空间中的离线分层强化学习 (RL) 算法，以开发有效的策略规划能力。我们的实验表明，LDPP 在两种主动场景中的表现优于现有方法，甚至仅使用 18 亿个参数的 LLM 就超越了 ChatGPT。

Title: Beyond Guilt: Legal Judgment Prediction with Trichotomous Reasoning

Authors: Kepu Zhang, Haoyue Yang, Xu Tang, Weijie Yu, Jun Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14588
Pdf URL: https://arxiv.org/pdf/2412.14588
Copy Paste: [[2412.14588]] Beyond Guilt: Legal Judgment Prediction with Trichotomous Reasoning(https://arxiv.org/abs/2412.14588)
Keywords: language model, llm, prompt
Abstract: In legal practice, judges apply the trichotomous dogmatics of criminal law, sequentially assessing the elements of the offense, unlawfulness, and culpability to determine whether an individual's conduct constitutes a crime. Although current legal large language models (LLMs) show promising accuracy in judgment prediction, they lack trichotomous reasoning capabilities due to the absence of an appropriate benchmark dataset, preventing them from predicting innocent outcomes. As a result, every input is automatically assigned a charge, limiting their practical utility in legal contexts. To bridge this gap, we introduce LJPIV, the first benchmark dataset for Legal Judgment Prediction with Innocent Verdicts. Adhering to the trichotomous dogmatics, we extend three widely-used legal datasets through LLM-based augmentation and manual verification. Our experiments with state-of-the-art legal LLMs and novel strategies that integrate trichotomous reasoning into zero-shot prompting and fine-tuning reveal: (1) current legal LLMs have significant room for improvement, with even the best models achieving an F1 score of less than 0.3 on LJPIV; and (2) our strategies notably enhance both in-domain and cross-domain judgment prediction accuracy, especially for cases resulting in an innocent verdict.
摘要：在司法实践中，法官运用刑法的三分法教义，依次评估犯罪、违法和罪责的要素，以确定个人的行为是否构成犯罪。尽管当前的法律大型语言模型 (LLM) 在判断预测方面表现出良好的准确性，但由于缺乏适当的基准数据集，它们缺乏三分法推理能力，导致无法预测无辜结果。因此，每个输入都会自动分配一项指控，限制了它们在法律环境中的实际效用。为了弥补这一差距，我们推出了 LJPIV，这是第一个具有无辜判决的法律判决预测基准数据集。遵循三分法教义，我们通过基于 LLM 的增强和人工验证扩展了三个广泛使用的法律数据集。我们对最先进的法律 LLM 和将三分法推理融入零样本提示和微调的新策略进行的实验表明：（1）当前的法律 LLM 具有很大的改进空间，即使是最好的模型在 LJPIV 上的 F1 分数也低于 0.3；（2）我们的策略显着提高了域内和跨域判断预测的准确性，特别是对于无罪判决的案件。

Title: HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model

Authors: Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.14613
Pdf URL: https://arxiv.org/pdf/2412.14613
Copy Paste: [[2412.14613]] HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model(https://arxiv.org/abs/2412.14613)
Keywords: language model
Abstract: Vision-language models (VLMs) have shown impressive abilities in text and image understanding. However, existing metrics for evaluating the text generated by VLMs focus exclusively on overall quality, leading to two limitations: 1) it is challenging to identify which aspects of the text need improvement from the overall score; 2) metrics may overlook specific evaluation criteria when predicting an overall score. To address these limitations, we propose HarmonicEval, a reference-free evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) dataset, which comprises 18,000 expert human judgments across four vision-language tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion.
摘要：视觉语言模型 (VLM) 在文本和图像理解方面表现出了令人印象深刻的能力。然而，现有的用于评估 VLM 生成的文本的指标仅侧重于整体质量，这导致两个限制：1) 很难从总体得分中确定文本的哪些方面需要改进；2) 指标在预测总体得分时可能会忽略特定的评估标准。为了解决这些限制，我们提出了 HarmonicEval，这是一种无参考评估指标，它汇总了标准得分以自下而上的方式产生总体得分。此外，我们构建了多任务多标准人工评估 (MMHE) 数据集，该数据集包含四个视觉语言任务中的 18,000 个专家人工判断。我们的实验表明，HarmonicEval 与传统指标相比，在为每个标准提供数值分数的同时，实现了与人工判断更高的相关性。

Title: How good is GPT at writing political speeches for the White House?

Authors: Jacques Savoy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14617
Pdf URL: https://arxiv.org/pdf/2412.14617
Copy Paste: [[2412.14617]] How good is GPT at writing political speeches for the White House?(https://arxiv.org/abs/2412.14617)
Keywords: language model, gpt, llm
Abstract: Using large language models (LLMs), computers are able to generate a written text in response to a us er request. As this pervasive technology can be applied in numerous contexts, this study analyses the written style of one LLM called GPT by comparing its generated speeches with those of the recent US presidents. To achieve this objective, the State of the Union (SOTU) addresses written by Reagan to Biden are contrasted to those produced by both GPT-3.5 and GPT-4.o versions. Compared to US presidents, GPT tends to overuse the lemma "we" and produce shorter messages with, on average, longer sentences. Moreover, GPT opts for an optimistic tone, opting more often for political (e.g., president, Congress), symbolic (e.g., freedom), and abstract terms (e.g., freedom). Even when imposing an author's style to GPT, the resulting speech remains distinct from addresses written by the target author. Finally, the two GPT versions present distinct characteristics, but both appear overall dissimilar to true presidential messages.
摘要：使用大型语言模型 (LLM)，计算机能够根据用户请求生成书面文本。由于这项普遍存在的技术可以应用于多种环境，本研究通过将其生成的演讲与最近几位美国总统的演讲进行比较，分析了一个名为 GPT 的 LLM 的书写风格。为了实现这一目标，里根写给拜登的国情咨文 (SOTU) 与 GPT-3.5 和 GPT-4.o 版本生成的演讲进行了对比。与美国总统相比，GPT 倾向于过度使用词根“我们”，并产生较短的信息，平均而言，句子较长。此外，GPT 选择乐观的语气，更多地选择政治术语（例如总统、国会）、象征术语（例如自由）和抽象术语（例如自由）。即使将作者的风格强加给 GPT，生成的演讲仍然与目标作者写的演讲不同。最后，两个 GPT 版本呈现出不同的特征，但总体上看起来都与真正的总统信息不同。

Title: Learning to Generate Research Idea with Dynamic Control

Authors: Ruochen Li, Liqiang Jing, Chi Han, Jiawei Zhou, Xinya Du
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14626
Pdf URL: https://arxiv.org/pdf/2412.14626
Copy Paste: [[2412.14626]] Learning to Generate Research Idea with Dynamic Control(https://arxiv.org/abs/2412.14626)
Keywords: language model, llm, prompt
Abstract: The rapid advancements in large language models (LLMs) have demonstrated their potential to accelerate scientific discovery, particularly in automating the process of research ideation. LLM-based systems have shown promise in generating hypotheses and research ideas. However, current approaches predominantly rely on prompting-based pre-trained models, limiting their ability to optimize generated content effectively. Moreover, they also lack the capability to deal with the complex interdependence and inherent restrictions among novelty, feasibility, and effectiveness, which remains challenging due to the inherent trade-offs among these dimensions, such as the innovation-feasibility conflict. To address these limitations, we for the first time propose fine-tuning LLMs to be better idea proposers and introduce a novel framework that employs a two-stage approach combining Supervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL). In the SFT stage, the model learns foundational patterns from pairs of research papers and follow-up ideas. In the RL stage, multi-dimensional reward modeling, guided by fine-grained feedback, evaluates and optimizes the generated ideas across key metrics. Dimensional controllers enable dynamic adjustment of generation, while a sentence-level decoder ensures context-aware emphasis during inference. Our framework provides a balanced approach to research ideation, achieving high-quality outcomes by dynamically navigating the trade-offs among novelty, feasibility, and effectiveness.
摘要：大型语言模型 (LLM) 的快速发展已显示出其加速科学发现的潜力，特别是在自动化研究构思过程方面。基于 LLM 的系统在生成假设和研究想法方面表现出了良好的前景。然而，当前的方法主要依赖于基于提示的预训练模型，这限制了它们有效优化生成内容的能力。此外，它们还缺乏处理新颖性、可行性和有效性之间复杂的相互依赖性和固有限制的能力，这仍然具有挑战性，因为这些维度之间存在固有的权衡，例如创新性-可行性冲突。为了解决这些限制，我们首次提出了微调 LLM 以成为更好的想法提出者，并引入了一个采用结合监督微调 (SFT) 和可控强化学习 (RL) 的两阶段方法的新框架。在 SFT 阶段，该模型从研究论文和后续想法的配对中学习基础模式。在强化学习阶段，由细粒度反馈引导的多维奖励建模会根据关键指标评估和优化生成的想法。维度控制器可以动态调整生成，而句子级解码器可确保在推理过程中强调上下文感知。我们的框架提供了一种平衡的研究构思方法，通过动态权衡新颖性、可行性和有效性来实现高质量的结果。

Title: TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation

Authors: Jiatong Li, Junxian Li, Yunqing Liu, Dongzhan Zhou, Qing Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14642
Pdf URL: https://arxiv.org/pdf/2412.14642
Copy Paste: [[2412.14642]] TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation(https://arxiv.org/abs/2412.14642)
Keywords: gpt, llm
Abstract: In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation capability of LLMs. TOMG-Bench encompasses a dataset of three major tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). Each task further contains three subtasks, with each subtask comprising 5,000 test samples. Given the inherent complexity of open molecule generation, we have also developed an automated evaluation system that helps measure both the quality and the accuracy of the generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the current limitations and potential areas for improvement in text-guided molecule discovery. Furthermore, with the assistance of OpenMolIns, a specialized instruction tuning dataset proposed for solving challenges raised by TOMG-Bench, Llama3.1-8B could outperform all the open-source general LLMs, even surpassing GPT-3.5-turbo by 46.5\% on TOMG-Bench. Our codes and datasets are available through this https URL.
摘要：在本文中，我们提出了基于文本的开放分子生成基准 (TOMG-Bench)，这是第一个评估 LLM 开放域分子生成能力的基准。TOMG-Bench 包含三个主要任务的数据集：分子编辑 (MolEdit)、分子优化 (MolOpt) 和定制分子生成 (MolCustom)。每个任务进一步包含三个子任务，每个子任务包含 5,000 个测试样本。鉴于开放分子生成的固有复杂性，我们还开发了一个自动评估系统，帮助衡量生成分子的质量和准确性。我们对 25 个 LLM 进行了全面的基准测试，揭示了文本引导分子发现的当前局限性和潜在的改进领域。此外，借助 OpenMolIns（一个为解决 TOMG-Bench 提出的挑战而提出的专门指令调优数据集），Llama3.1-8B 的表现可以超越所有开源通用 LLM，甚至在 TOMG-Bench 上超越 GPT-3.5-turbo 46.5%。我们的代码和数据集可通过此 https URL 获得。

Title: Length Controlled Generation for Black-box LLMs

Authors: Yuxuan Gu, Wenjie Wang, Xiaocheng Feng, Weihong Zhong, Kun Zhu, Lei Huang, Tat-Seng Chua, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14656
Pdf URL: https://arxiv.org/pdf/2412.14656
Copy Paste: [[2412.14656]] Length Controlled Generation for Black-box LLMs(https://arxiv.org/abs/2412.14656)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated impressive instruction following capabilities, while still struggling to accurately manage the length of the generated text, which is a fundamental requirement in many real-world applications. Existing length control methods involve fine-tuning the parameters of LLMs, which is inefficient and suboptimal for practical use. In this paper, we propose a novel iterative sampling framework for text length control, integrating the Metropolis-Hastings algorithm with an importance sampling acceleration strategy. This framework efficiently and reliably regulates LLMs to generate length-constrained text without modifying the underlying parameters, thereby preserving the original capabilities of LLMs. Experimental results demonstrate that our framework achieves almost 100\% success rates of length control on Llama3.1 for tasks such as length-controlled abstractive summarization and length-constrained instruction following, with minimal additional computational overhead. This also highlights the significant potential of our method for precise length control across a broader range of applications, without compromising the versatility of LLMs.
摘要：大型语言模型 (LLM) 已展示出令人印象深刻的指令跟踪能力，但仍难以准确管理生成文本的长度，这是许多实际应用中的基本要求。现有的长度控制方法涉及微调 LLM 的参数，这对于实际使用来说效率低下且不是最优的。在本文中，我们提出了一种用于文本长度控制的新型迭代采样框架，将 Metropolis-Hastings 算法与重要性采样加速策略相结合。该框架高效可靠地调节 LLM 以生成长度受限的文本，而无需修改底层参数，从而保留了 LLM 的原始功能。实验结果表明，我们的框架在 Llama3.1 上实现了几乎 100% 的长度控制成功率，例如长度控制的抽象摘要和长度受限的指令跟踪，同时将额外的计算开销降至最低。这也凸显了我们的方法在更广泛的应用中实现精确长度控制的巨大潜力，同时不会损害 LLM 的多功能性。

Title: Analysis and Visualization of Linguistic Structures in Large Language Models: Neural Representations of Verb-Particle Constructions in BERT

Authors: Hassane Kissane, Achim Schilling, Patrick Krauss
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14670
Pdf URL: https://arxiv.org/pdf/2412.14670
Copy Paste: [[2412.14670]] Analysis and Visualization of Linguistic Structures in Large Language Models: Neural Representations of Verb-Particle Constructions in BERT(https://arxiv.org/abs/2412.14670)
Keywords: language model, llm, prompt
Abstract: This study investigates the internal representations of verb-particle combinations within transformer-based large language models (LLMs), specifically examining how these models capture lexical and syntactic nuances at different neural network layers. Employing the BERT architecture, we analyse the representational efficacy of its layers for various verb-particle constructions such as 'agree on', 'come back', and 'give up'. Our methodology includes a detailed dataset preparation from the British National Corpus, followed by extensive model training and output analysis through techniques like multi-dimensional scaling (MDS) and generalized discrimination value (GDV) calculations. Results show that BERT's middle layers most effectively capture syntactic structures, with significant variability in representational accuracy across different verb categories. These findings challenge the conventional uniformity assumed in neural network processing of linguistic elements and suggest a complex interplay between network architecture and linguistic representation. Our research contributes to a better understanding of how deep learning models comprehend and process language, offering insights into the potential and limitations of current neural approaches to linguistic analysis. This study not only advances our knowledge in computational linguistics but also prompts further research into optimizing neural architectures for enhanced linguistic precision.
摘要：本研究调查了基于 Transformer 的大型语言模型 (LLM) 中动词-助词组合的内部表示，特别是研究了这些模型如何在不同的神经网络层上捕捉词汇和句法细微差别。利用 BERT 架构，我们分析了其各层对各种动词-助词结构（例如“同意”、“回来”和“放弃”）的表征效力。我们的方法包括从英国国家语料库中详细准备数据集，然后通过多维缩放 (MDS) 和广义判别值 (GDV) 计算等技术进行广泛的模型训练和输出分析。结果表明，BERT 的中间层最有效地捕捉了句法结构，不同动词类别的表征准确性存在显著差异。这些发现挑战了神经网络处理语言元素的传统统一性假设，并表明网络架构和语言表征之间存在复杂的相互作用。我们的研究有助于更好地理解深度学习模型如何理解和处理语言，深入了解当前神经语言分析方法的潜力和局限性。这项研究不仅提高了我们在计算语言学方面的知识，还推动了进一步研究优化神经架构以提高语言精度。

Title: LLMs as mediators: Can they diagnose conflicts accurately?

Authors: Özgecan Koçak (Emory University), Phanish Puranam (INSEAD), Afşar Yegin (Kadir Has University)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14675
Pdf URL: https://arxiv.org/pdf/2412.14675
Copy Paste: [[2412.14675]] LLMs as mediators: Can they diagnose conflicts accurately?(https://arxiv.org/abs/2412.14675)
Keywords: language model, gpt, llm
Abstract: Prior research indicates that to be able to mediate conflict, observers of disagreements between parties must be able to reliably distinguish the sources of their disagreement as stemming from differences in beliefs about what is true (causality) vs. differences in what they value (morality). In this paper, we test if OpenAI's Large Language Models GPT 3.5 and GPT 4 can perform this task and whether one or other type of disagreement proves particularly challenging for LLM's to diagnose. We replicate study 1 in Koçak et al. (2003), which employes a vignette design, with OpenAI's GPT 3.5 and GPT 4. We find that both LLMs have similar semantic understanding of the distinction between causal and moral codes as humans and can reliably distinguish between them. When asked to diagnose the source of disagreement in a conversation, both LLMs, compared to humans, exhibit a tendency to overestimate the extent of causal disagreement and underestimate the extent of moral disagreement in the moral misalignment condition. This tendency is especially pronounced for GPT 4 when using a proximate scale that relies on concrete language specific to an issue. GPT 3.5 does not perform as well as GPT4 or humans when using either the proximate or the distal scale. The study provides a first test of the potential for using LLMs to mediate conflict by diagnosing the root of disagreements in causal and evaluative codes.
摘要：先前的研究表明，为了能够调解冲突，观察双方分歧的人必须能够可靠地区分分歧的根源，是源于对真实事物的信念差异（因果关系）还是他们所重视的事物的差异（道德）。在本文中，我们测试了 OpenAI 的大型语言模型 GPT 3.5 和 GPT 4 是否可以执行此任务，以及一种或另一种类型的分歧是否对法学硕士来说特别具有挑战性。我们使用 OpenAI 的 GPT 3.5 和 GPT 4 复制了 Koçak 等人（2003 年）的研究 1，该研究采用了小插图设计。我们发现这两个法学硕士对因果和道德准则之间区别的语义理解与人类相似，并且可以可靠地区分它们。当被要求诊断对话中分歧的根源时，与人类相比，这两个法学硕士都表现出高估因果分歧程度并低估道德错位条件下道德分歧程度的倾向。当使用依赖于特定问题的具体语言的近距离量表时，这种趋势在 GPT 4 中尤为明显。无论是使用近距离量表还是远距离量表，GPT 3.5 的表现都不如 GPT4 或人类。该研究通过诊断因果和评价代码中分歧的根源，首次测试了使用 LLM 调解冲突的潜力。

Title: How to Synthesize Text Data without Model Collapse?

Authors: Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14689
Pdf URL: https://arxiv.org/pdf/2412.14689
Copy Paste: [[2412.14689]] How to Synthesize Text Data without Model Collapse?(https://arxiv.org/abs/2412.14689)
Keywords: language model, gpt
Abstract: Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-$\{n\}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.
摘要：合成数据中的模型崩溃表明对自生成数据的迭代训练会导致性能逐渐下降。随着人工智能模型的激增，合成数据将从根本上重塑网络数据生态系统。未来的 GPT-$\{n\}$ 模型将不可避免地在合成数据和人工生成的数据的混合上进行训练。在本文中，我们关注两个问题：合成数据对语言模型训练有何影响，以及如何在不导致模型崩溃的情况下合成数据？我们首先在不同比例的合成数据上对语言模型进行预训练，发现合成数据的比例与模型性能之间存在负相关性。我们进一步对合成数据进行统计分析，发现分布偏移现象和 n-gram 特征的过度集中。受上述发现的启发，我们提出对人工生成的数据进行 token 编辑以获得半合成数据。作为概念证明，我们从理论上证明了 token 级编辑可以防止模型崩溃，因为测试误差受到有限上限的限制。我们对从头开始的预训练、持续预训练和监督微调进行了广泛的实验。结果验证了我们的理论证明，即 token 级编辑可以提高数据质量并增强模型性能。

Title: On Verbalized Confidence Scores for LLMs

Authors: Daniel Yang, Yao-Hung Hubert Tsai, Makoto Yamada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14737
Pdf URL: https://arxiv.org/pdf/2412.14737
Copy Paste: [[2412.14737]] On Verbalized Confidence Scores for LLMs(https://arxiv.org/abs/2412.14737)
Keywords: language model, llm, prompt, agent
Abstract: The rise of large language models (LLMs) and their tight integration into our daily life make it essential to dedicate efforts towards their trustworthiness. Uncertainty quantification for LLMs can establish more human trust into their responses, but also allows LLM agents to make more informed decisions based on each other's uncertainty. To estimate the uncertainty in a response, internal token logits, task-specific proxy models, or sampling of multiple responses are commonly used. This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens, which is a promising way for prompt- and model-agnostic uncertainty quantification with low overhead. Using an extensive benchmark, we assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods. Our results reveal that the reliability of these scores strongly depends on how the model is asked, but also that it is possible to extract well-calibrated confidence scores with certain prompt methods. We argue that verbalized confidence scores can become a simple but effective and versatile uncertainty quantification method in the future. Our code is available at this https URL .
摘要：大型语言模型 (LLM) 的兴起及其与日常生活的紧密结合使得我们必须努力提高其可信度。LLM 的不确定性量化不仅可以建立更人性化的响应信任，还可以让 LLM 代理根据彼此的不确定性做出更明智的决策。为了估计响应中的不确定性，通常使用内部标记逻辑、特定于任务的代理模型或对多个响应进行采样。这项工作的重点是要求 LLM 本身将其不确定性用置信度分数作为其输出标记的一部分来表达，这是一种有前途的低开销提示和模型无关的不确定性量化方法。使用广泛的基准，我们评估了不同数据集、模型和提示方法的口头置信度分数的可靠性。我们的结果表明，这些分数的可靠性在很大程度上取决于如何询问模型，但也有可能使用某些提示方法提取经过良好校准的置信度分数。我们认为，口头置信度分数在未来可以成为一种简单但有效且用途广泛的不确定性量化方法。我们的代码可以在这个 https URL 上找到。

Title: Query pipeline optimization for cancer patient question answering systems

Authors: Maolin He, Rena Gao, Mike Conway, Brian E. Chapman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14751
Pdf URL: https://arxiv.org/pdf/2412.14751
Copy Paste: [[2412.14751]] Query pipeline optimization for cancer patient question answering systems(https://arxiv.org/abs/2412.14751)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation, chain-of-thought
Abstract: Retrieval-augmented generation (RAG) mitigates hallucination in Large Language Models (LLMs) by using query pipelines to retrieve relevant external information and grounding responses in retrieved knowledge. However, query pipeline optimization for cancer patient question-answering (CPQA) systems requires separately optimizing multiple components with domain-specific considerations. We propose a novel three-aspect optimization approach for the RAG query pipeline in CPQA systems, utilizing public biomedical databases like PubMed and PubMed Central. Our optimization includes: (1) document retrieval, utilizing a comparative analysis of NCBI resources and introducing Hybrid Semantic Real-time Document Retrieval (HSRDR); (2) passage retrieval, identifying optimal pairings of dense retrievers and rerankers; and (3) semantic representation, introducing Semantic Enhanced Overlap Segmentation (SEOS) for improved contextual understanding. On a custom-developed dataset tailored for cancer-related inquiries, our optimized RAG approach improved the answer accuracy of Claude-3-haiku by 5.24% over chain-of-thought prompting and about 3% over a naive RAG setup. This study highlights the importance of domain-specific query optimization in realizing the full potential of RAG and provides a robust framework for building more accurate and reliable CPQA systems, advancing the development of RAG-based biomedical systems.
摘要：检索增强生成 (RAG) 通过使用查询管道检索相关的外部信息并将响应建立在检索到的知识的基础上，从而减轻了大型语言模型 (LLM) 中的幻觉。然而，癌症患者问答 (CPQA) 系统的查询管道优化需要分别优化多个组件并考虑特定领域。我们利用 PubMed 和 PubMed Central 等公共生物医学数据库，为 CPQA 系统中的 RAG 查询管道提出了一种新颖的三方面优化方法。我们的优化包括：(1) 文档检索，利用 NCBI 资源的比较分析并引入混合语义实时文档检索 (HSRDR)；(2) 段落检索，确定密集检索器和重新排序器的最佳配对；(3) 语义表示，引入语义增强重叠分割 (SEOS) 以改进上下文理解。在为癌症相关查询量身定制的数据集上，我们优化的 RAG 方法将 Claude-3-haiku 的答案准确率提高了 5.24%，比思路链提示提高了 5.24%，比简单的 RAG 设置提高了约 3%。这项研究强调了领域特定查询优化在充分发挥 RAG 潜力方面的重要性，并为构建更准确、更可靠的 CPQA 系统提供了一个强大的框架，从而推动了基于 RAG 的生物医学系统的发展。

Title: PsyDraw: A Multi-Agent Multimodal System for Mental Health Screening in Left-Behind Children

Authors: Yiqun Zhang, Xiaocui Yang, Xiaobai Li, Siyuan Yu, Yi Luan, Shi Feng, Daling Wang, Yifei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14769
Pdf URL: https://arxiv.org/pdf/2412.14769
Copy Paste: [[2412.14769]] PsyDraw: A Multi-Agent Multimodal System for Mental Health Screening in Left-Behind Children(https://arxiv.org/abs/2412.14769)
Keywords: language model, agent
Abstract: Left-behind children (LBCs), numbering over 66 million in China, face severe mental health challenges due to parental migration for work. Early screening and identification of at-risk LBCs is crucial, yet challenging due to the severe shortage of mental health professionals, especially in rural areas. While the House-Tree-Person (HTP) test shows higher child participation rates, its requirement for expert interpretation limits its application in resource-scarce regions. To address this challenge, we propose PsyDraw, a multi-agent system based on Multimodal Large Language Models that assists mental health professionals in analyzing HTP drawings. The system employs specialized agents for feature extraction and psychological interpretation, operating in two stages: comprehensive feature analysis and professional report generation. Evaluation of HTP drawings from 290 primary school students reveals that 71.03% of the analyzes achieved High Consistency with professional evaluations, 26.21% Moderate Consistency and only 2.41% Low Consistency. The system identified 31.03% of cases requiring professional attention, demonstrating its effectiveness as a preliminary screening tool. Currently deployed in pilot schools, \method shows promise in supporting mental health professionals, particularly in resource-limited areas, while maintaining high professional standards in psychological assessment.
摘要：中国有超过 6600 万留守儿童，由于父母外出打工，他们面临着严重的心理健康挑战。早期筛查和识别高危留守儿童至关重要，但由于心理健康专业人员严重短缺，尤其是在农村地区，这项工作也面临挑战。虽然房-树-人 (HTP) 测试显示儿童参与率较高，但其对专家解释的要求限制了其在资源匮乏地区的应用。为了应对这一挑战，我们提出了 PsyDraw，这是一个基于多模态大型语言模型的多智能体系统，可帮助心理健康专业人员分析 HTP 图画。该系统采用专门的智能体进行特征提取和心理解释，分为两个阶段：全面特征分析和专业报告生成。对 290 名小学生的 HTP 图画的评估显示，71.03% 的分析与专业评估达到高度一致，26.21% 达到中等一致，只有 2.41% 达到低一致性。该系统识别出 31.03% 需要专业关注的病例，证明了其作为初步筛查工具的有效性。目前，该系统已在试点学校部署，该方法有望为心理健康专业人员提供支持，特别是在资源有限的地区，同时保持心理评估的高专业标准。

Title: ALKAFI-LLAMA3: Fine-Tuning LLMs for Precise Legal Understanding in Palestine

Authors: Rabee Qasem, Mohannad Hendi, Banan Tantour
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14771
Pdf URL: https://arxiv.org/pdf/2412.14771
Copy Paste: [[2412.14771]] ALKAFI-LLAMA3: Fine-Tuning LLMs for Precise Legal Understanding in Palestine(https://arxiv.org/abs/2412.14771)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable potential in diverse domains, yet their application in the legal sector, particularly in low-resource contexts, remains limited. This study addresses the challenges of adapting LLMs to the Palestinian legal domain, where political instability, fragmented legal frameworks, and limited AI resources hinder effective machine-learning applications. We present a fine-tuned model based on a quantized version of Llama-3.2-1B-Instruct, trained on a synthetic data set derived from Palestinian legal texts. Using smaller-scale models and strategically generated question-answer pairs, we achieve a cost-effective, locally sustainable solution that provides accurate and contextually relevant legal guidance. Our experiments demonstrate promising performance on various query types, ranging from yes/no questions and narrative explanations to complex legal differentiations, while highlighting areas for improvement, such as handling calculation-based inquiries and structured list formatting. This work provides a pathway for the deployment of AI-driven legal assistance tools tailored to the needs of resource-constrained environments.
摘要：大型语言模型 (LLM) 已在不同领域展现出巨大潜力，但它们在法律领域的应用，特别是在资源匮乏的环境中，仍然有限。这项研究解决了将 LLM 适应巴勒斯坦法律领域的挑战，在那里，政治不稳定、法律框架分散和人工智能资源有限阻碍了有效的机器学习应用。我们提出了一个基于量化版本的 Llama-3.2-1B-Instruct 的微调模型，该模型在来自巴勒斯坦法律文本的合成数据集上进行训练。使用规模较小的模型和策略性生成的问答对，我们实现了具有成本效益、本地可持续的解决方案，可提供准确且与上下文相关的法律指导。我们的实验在各种查询类型上都表现出色，从是/否问题和叙述性解释到复杂的法律区分，同时突出了需要改进的领域，例如处理基于计算的查询和结构化列表格式。这项工作为部署针对资源受限环境需求的人工智能驱动的法律援助工具提供了途径。

Title: Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning

Authors: Ziang Ye, Zhenru Zhang, Yang Zhang, Jianxin Ma, Junyang Lin, Fuli Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14780
Pdf URL: https://arxiv.org/pdf/2412.14780
Copy Paste: [[2412.14780]] Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning(https://arxiv.org/abs/2412.14780)
Keywords: language model, llm, agent
Abstract: When using agent-task datasets to enhance agent capabilities for Large Language Models (LLMs), current methodologies often treat all tokens within a sample equally. However, we argue that tokens serving different roles - specifically, reasoning tokens versus boilerplate tokens (e.g., those governing output format) - differ significantly in importance and learning complexity, necessitating their disentanglement and distinct treatment. To address this, we propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token discrimination. SHAD classifies tokens by exploiting predictability differences observed after shuffling input-output combinations across samples: boilerplate tokens, due to their repetitive nature among samples, maintain predictability, whereas reasoning tokens do not. Using SHAD, we propose the Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes reasoning tokens during fine-tuning, yielding notable performance gains over common Supervised Fine-Tuning (SFT).
摘要：在使用代理任务数据集来增强大型语言模型 (LLM) 的代理功能时，当前方法通常会平等对待样本中的所有标记。但是，我们认为，服务于不同角色的标记（具体来说，推理标记与样板标记（例如，控制输出格式的标记））在重要性和学习复杂性方面存在显著差异，因此需要将它们分开并进行不同处理。为了解决这个问题，我们提出了一种用于自适应标记鉴别的新型混洗感知鉴别器 (SHAD)。SHAD 通过利用在样本之间混洗输入输出组合后观察到的可预测性差异来对标记进行分类：样板标记由于其在样本之间的重复性而保持可预测性，而推理标记则不能。使用 SHAD，我们提出了推理突出显示的微调 (RFT) 方法，该方法在微调过程中自适应地强调推理标记，与常见的监督微调 (SFT) 相比，性能显著提高。

Title: ResoFilter: Rine-grained Synthetic Data Filtering for Large Language Models through Data-Parameter Resonance Analysis

Authors: Zeao Tu, Xiangdi Meng, Yu He, Zihan Yao, Tianyu Qi, Jun Liu, Ming Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14809
Pdf URL: https://arxiv.org/pdf/2412.14809
Copy Paste: [[2412.14809]] ResoFilter: Rine-grained Synthetic Data Filtering for Large Language Models through Data-Parameter Resonance Analysis(https://arxiv.org/abs/2412.14809)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have shown remarkable effectiveness across various domains, with data augmentation methods utilizing GPT for synthetic data generation becoming prevalent. However, the quality and utility of augmented data remain questionable, and current methods lack clear metrics for evaluating data characteristics. To address these challenges, we propose ResoFilter, a novel method that integrates models, data, and tasks to refine datasets. ResoFilter leverages the fine-tuning process to obtain Data-Parameter features for data selection, offering improved interpretability by representing data characteristics through model weights. Our experiments demonstrate that ResoFilter achieves comparable results to full-scale fine-tuning using only half the data in mathematical tasks and exhibits strong generalization across different models and domains. This method provides valuable insights for constructing synthetic datasets and evaluating high-quality data, offering a promising solution for enhancing data augmentation techniques and improving training dataset quality for LLMs. For reproducibility, we will release our code and data upon acceptance.
摘要：大型语言模型 (LLM) 已在各个领域表现出显著的效果，利用 GPT 生成合成数据的数据增强方法正变得日益普遍。然而，增强数据的质量和实用性仍然存在疑问，当前的方法缺乏评估数据特征的明确指标。为了应对这些挑战，我们提出了 ResoFilter，这是一种集成模型、数据和任务来细化数据集的新方法。ResoFilter 利用微调过程来获取用于数据选择的数据参数特征，通过模型权重表示数据特征，从而提供更好的可解释性。我们的实验表明，ResoFilter 在数学任务中仅使用一半的数据就实现了与全尺寸微调相当的结果，并且在不同的模型和领域中表现出很强的泛化能力。该方法为构建合成数据集和评估高质量数据提供了宝贵的见解，为增强数据增强技术和提高 LLM 训练数据集质量提供了一种有希望的解决方案。为了可重复性，我们将在接受后发布我们的代码和数据。

Title: Progressive Multimodal Reasoning via Active Retrieval

Authors: Guanting Dong, Chenghao Zhang, Mengjie Deng, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen
Subjects: cs.CL, cs.AI, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2412.14835
Pdf URL: https://arxiv.org/pdf/2412.14835
Copy Paste: [[2412.14835]] Progressive Multimodal Reasoning via Active Retrieval(https://arxiv.org/abs/2412.14835)
Keywords: language model, llm
Abstract: Multi-step multimodal reasoning tasks pose significant challenges for multimodal large language models (MLLMs), and finding effective ways to enhance their performance in such scenarios remains an unresolved issue. In this paper, we propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo Tree Search (MCTS). Our approach begins with the development of a unified retrieval module that retrieves key supporting insights for solving complex reasoning problems from a hybrid-modal retrieval corpus. To bridge the gap in automated multimodal reasoning verification, we employ the MCTS algorithm combined with an active retrieval mechanism, which enables the automatic generation of step-wise annotations. This strategy dynamically retrieves key insights for each reasoning step, moving beyond traditional beam search sampling to improve the diversity and reliability of the reasoning space. Additionally, we introduce a process reward model that aligns progressively to support the automatic verification of multimodal reasoning tasks. Experimental results across three complex multimodal reasoning benchmarks confirm the effectiveness of the AR-MCTS framework in enhancing the performance of various multimodal models. Further analysis demonstrates that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
摘要：多步骤多模态推理任务对多模态大型语言模型 (MLLM) 提出了重大挑战，寻找有效的方法来提高它们在这种场景下的性能仍然是一个未解决的问题。在本文中，我们提出了 AR-MCTS，这是一个通用框架，旨在通过主动检索 (AR) 和蒙特卡洛树搜索 (MCTS) 逐步提高 MLLM 的推理能力。我们的方法始于开发一个统一的检索模块，该模块从混合模态检索语料库中检索解决复杂推理问题的关键支持见解。为了弥补自动多模态推理验证方面的差距，我们采用了 MCTS 算法与主动检索机制相结合，从而实现了分步注释的自动生成。该策略动态检索每个推理步骤的关键见解，超越了传统的束搜索采样，提高了推理空间的多样性和可靠性。此外，我们引入了一个逐步对齐的过程奖励模型，以支持多模态推理任务的自动验证。在三个复杂的多模态推理基准上的实验结果证实了 AR-MCTS 框架在提升各种多模态模型性能方面的有效性。进一步的分析表明，AR-MCTS 可以优化采样多样性和准确性，从而实现可靠的多模态推理。

Title: DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs

Authors: Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, Liang Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14838
Pdf URL: https://arxiv.org/pdf/2412.14838
Copy Paste: [[2412.14838]] DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs(https://arxiv.org/abs/2412.14838)
Keywords: llm, long context
Abstract: Efficient KV cache management in LLMs is crucial for long-context tasks like RAG and summarization. Existing KV cache compression methods enforce a fixed pattern, neglecting task-specific characteristics and reducing the retention of essential information. However, we observe distinct activation patterns across layers in various tasks, highlighting the need for adaptive strategies tailored to each task's unique demands. Based on this insight, we propose DynamicKV, a method that dynamically optimizes token retention by adjusting the number of tokens retained at each layer to adapt to the specific task. DynamicKV establishes global and per-layer maximum KV cache budgets, temporarily retaining the maximum budget for the current layer, and periodically updating the KV cache sizes of all preceding layers during inference. Our method retains only 1.7% of the KV cache size while achieving ~85% of the Full KV cache performance on LongBench. Notably, even under extreme compression (0.9%), DynamicKV surpasses state-of-the-art (SOTA) methods by 11% in the Needle-in-a-Haystack test using Mistral-7B-Instruct-v0.2. The code will be released.
摘要：LLM 中高效的 KV 缓存管理对于 RAG 和摘要等长上下文任务至关重要。现有的 KV 缓存压缩方法强制执行固定模式，忽略了任务特定的特性并减少了重要信息的保留。然而，我们在各种任务中观察到跨层的不同激活模式，凸显了需要根据每个任务的独特需求量身定制自适应策略。基于这一见解，我们提出了 DynamicKV，这种方法通过调整每层保留的 token 数量来动态优化 token 保留以适应特定任务。DynamicKV 建立全局和每层最大 KV 缓存预算，暂时保留当前层的最大预算，并在推理期间定期更新所有前一层的 KV 缓存大小。我们的方法仅保留 1.7% 的 KV 缓存大小，同时在 LongBench 上实现约 85% 的完整 KV 缓存性能。值得注意的是，即使在极端压缩（0.9%）的情况下，DynamicKV 在使用 Mistral-7B-Instruct-v0.2 的 Needle-in-a-Haystack 测试中也比最先进的 (SOTA) 方法高出 11%。代码即将发布。

Title: Mapping and Influencing the Political Ideology of Large Language Models using Synthetic Personas

Authors: Pietro Bernardelle, Leon Fröhling, Stefano Civelli, Riccardo Lunardi, Kevin Roiter, Gianluca Demartini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14843
Pdf URL: https://arxiv.org/pdf/2412.14843
Copy Paste: [[2412.14843]] Mapping and Influencing the Political Ideology of Large Language Models using Synthetic Personas(https://arxiv.org/abs/2412.14843)
Keywords: language model, llm, prompt
Abstract: The analysis of political biases in large language models (LLMs) has primarily examined these systems as single entities with fixed viewpoints. While various methods exist for measuring such biases, the impact of persona-based prompting on LLMs' political orientation remains unexplored. In this work we leverage PersonaHub, a collection of synthetic persona descriptions, to map the political distribution of persona-based prompted LLMs using the Political Compass Test (PCT). We then examine whether these initial compass distributions can be manipulated through explicit ideological prompting towards diametrically opposed political orientations: right-authoritarian and left-libertarian. Our experiments reveal that synthetic personas predominantly cluster in the left-libertarian quadrant, with models demonstrating varying degrees of responsiveness when prompted with explicit ideological descriptors. While all models demonstrate significant shifts towards right-authoritarian positions, they exhibit more limited shifts towards left-libertarian positions, suggesting an asymmetric response to ideological manipulation that may reflect inherent biases in model training.
摘要：对大型语言模型 (LLM) 中的政治偏见的分析主要将这些系统视为具有固定观点的单一实体。虽然存在各种方法来衡量这种偏见，但基于角色的提示对 LLM 政治倾向的影响仍未得到探索。在这项工作中，我们利用 PersonaHub（一个合成角色描述的集合）来使用政治指南针测试 (PCT) 绘制基于角色提示的 LLM 的政治分布。然后，我们检查这些初始指南针分布是否可以通过明确的意识形态提示来操纵，使其朝着截然相反的政治取向发展：右翼权威主义和左翼自由主义。我们的实验表明，合成角色主要聚集在左翼自由主义象限，当使用明确的意识形态描述符提示时，模型表现出不同程度的响应。虽然所有模型都表现出向右翼权威主义立场的显著转变，但它们向左翼自由主义立场的转变更为有限，这表明对意识形态操纵的不对称反应可能反映了模型训练中固有的偏见。

Title: DS$^2$-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis

Authors: Hongling Xu, Yice Zhang, Qianlong Wang, Ruifeng Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14849
Pdf URL: https://arxiv.org/pdf/2412.14849
Copy Paste: [[2412.14849]] DS$^2$-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2412.14849)
Keywords: language model, llm, prompt
Abstract: Recently developed large language models (LLMs) have presented promising new avenues to address data scarcity in low-resource scenarios. In few-shot aspect-based sentiment analysis (ABSA), previous efforts have explored data augmentation techniques, which prompt LLMs to generate new samples by modifying existing ones. However, these methods fail to produce adequately diverse data, impairing their effectiveness. Besides, some studies apply in-context learning for ABSA by using specific instructions and a few selected examples as prompts. Though promising, LLMs often yield labels that deviate from task requirements. To overcome these limitations, we propose DS$^2$-ABSA, a dual-stream data synthesis framework targeted for few-shot ABSA. It leverages LLMs to synthesize data from two complementary perspectives: \textit{key-point-driven} and \textit{instance-driven}, which effectively generate diverse and high-quality ABSA samples in low-resource settings. Furthermore, a \textit{label refinement} module is integrated to improve the synthetic labels. Extensive experiments demonstrate that DS$^2$-ABSA significantly outperforms previous few-shot ABSA solutions and other LLM-oriented data generation methods.
摘要：最近开发的大型语言模型 (LLM) 为解决资源匮乏情况下的数据稀缺问题提供了有希望的新途径。在基于小样本方面的情绪分析 (ABSA) 中，先前的努力已经探索了数据增强技术，该技术促使 LLM 通过修改现有样本来生成新样本。然而，这些方法无法产生足够多样化的数据，从而损害了它们的有效性。此外，一些研究通过使用特定指令和一些选定的示例作为提示，将上下文学习应用于 ABSA。虽然很有希望，但 LLM 通常会产生偏离任务要求的标签。为了克服这些限制，我们提出了 DS$^2$-ABSA，这是一个针对小样本 ABSA 的双流数据合成框架。它利用 LLM 从两个互补的角度合成数据：\textit{关键点驱动} 和 \textit{实例驱动}，可在资源匮乏的环境中有效地生成多样化和高质量的 ABSA 样本。此外，还集成了 \textit{标签细化} 模块以改进合成标签。大量实验表明，DS$^2$-ABSA 明显优于之前的少样本 ABSA 解决方案和其他面向 LLM 的数据生成方法。

Title: Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling

Authors: Junyi Li, Hwee Tou Ng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14860
Pdf URL: https://arxiv.org/pdf/2412.14860
Copy Paste: [[2412.14860]] Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling(https://arxiv.org/abs/2412.14860)
Keywords: language model, llm, hallucination, prompt
Abstract: Despite their outstanding capabilities, large language models (LLMs) are prone to hallucination and producing factually incorrect information. This challenge has spurred efforts in attributed text generation, which prompts LLMs to generate content with supporting evidence. In this paper, we propose a novel framework, called Think&Cite, and formulate attributed text generation as a multi-step reasoning problem integrated with search. Specifically, we propose Self-Guided Monte Carlo Tree Search (SG-MCTS), which capitalizes on the self-reflection capability of LLMs to reflect on the intermediate states of MCTS for guiding the tree expansion process. To provide reliable and comprehensive feedback, we introduce Progress Reward Models to measure the progress of tree search from the root to the current state from two aspects, i.e., generation and attribution progress. We conduct extensive experiments on three datasets and the results show that our approach significantly outperforms baseline approaches.
摘要：尽管大型语言模型 (LLM) 具有出色的能力，但它们容易产生幻觉并产生事实上不正确的信息。这一挑战激发了人们对归因文本生成的努力，这促使 LLM 生成具有支持证据的内容。在本文中，我们提出了一个称为 Think&Cite 的新框架，并将归因文本生成表述为与搜索集成的多步骤推理问题。具体而言，我们提出了自引导蒙特卡洛树搜索 (SG-MCTS)，它利用 LLM 的自我反思能力来反思 MCTS 的中间状态以指导树扩展过程。为了提供可靠和全面的反馈，我们引入了进度奖励模型来从生成和归因进度两个方面衡量从根到当前状态的树搜索进度。我们对三个数据集进行了广泛的实验，结果表明我们的方法明显优于基线方法。

Title: Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering

Authors: Imed Keraghel, Mohamed Nadif
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14867
Pdf URL: https://arxiv.org/pdf/2412.14867
Copy Paste: [[2412.14867]] Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering(https://arxiv.org/abs/2412.14867)
Keywords: language model, gpt, llm
Abstract: Recent advances in machine learning, particularly Large Language Models (LLMs) such as BERT and GPT, provide rich contextual embeddings that improve text representation. However, current document clustering approaches often ignore the deeper relationships between named entities (NEs) and the potential of LLM embeddings. This paper proposes a novel approach that integrates Named Entity Recognition (NER) and LLM embeddings within a graph-based framework for document clustering. The method builds a graph with nodes representing documents and edges weighted by named entity similarity, optimized using a graph-convolutional network (GCN). This ensures a more effective grouping of semantically related documents. Experimental results indicate that our approach outperforms conventional co-occurrence-based methods in clustering, notably for documents rich in named entities.
摘要：机器学习的最新进展，尤其是 BERT 和 GPT 等大型语言模型 (LLM)，提供了丰富的上下文嵌入，从而改善了文本表示。然而，当前的文档聚类方法往往忽略了命名实体 (NE) 之间的更深层次关系以及 LLM 嵌入的潜力。本文提出了一种新颖的方法，将命名实体识别 (NER) 和 LLM 嵌入集成到基于图的文档聚类框架中。该方法构建一个图，其中节点表示文档，边按命名实体相似性加权，并使用图卷积网络 (GCN) 进行优化。这确保了对语义相关文档进行更有效的分组。实验结果表明，我们的方法在聚类方面优于传统的基于共现的方法，尤其是对于富含命名实体的文档。

Title: Why language models collapse when trained on recursively generated text

Authors: Lecheng Wang, Xianjie Shi, Ge Li, Jia Li, Yihong Dong, Xuanming Zhang, Wenpin Jiao, Hong Mei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14872
Pdf URL: https://arxiv.org/pdf/2412.14872
Copy Paste: [[2412.14872]] Why language models collapse when trained on recursively generated text(https://arxiv.org/abs/2412.14872)
Keywords: language model
Abstract: Language models (LMs) have been widely used to generate text on the Internet. The generated text is often collected into the training corpus of the next generations of LMs. Previous work has experimentally found that LMs collapse when trained on recursively generated text. This paper contributes to existing knowledge from two aspects. We present a theoretical proof of LM collapse. Our proof reveals the cause of LM collapse and proves that all auto-regressive LMs will definitely collapse. We present a new finding: the performance of LMs gradually declines when trained on recursively generated text until they perform no better than a randomly initialized LM. The trained LMs produce large amounts of repetitive text and perform poorly across a wide range of natural language tasks. The above proof and new findings deepen our understanding of LM collapse and offer valuable insights that may inspire new training techniques to mitigate this threat.
摘要：语言模型 (LM) 已广泛用于互联网上的文本生成。生成的文本通常会被收集到下一代 LM 的训练语料库中。先前的工作通过实验发现，当对递归生成的文本进行训练时，LM 会崩溃。本文从两个方面对现有知识做出了贡献。我们给出了 LM 崩溃的理论证明。我们的证明揭示了 LM 崩溃的原因，并证明所有自回归 LM 都必定会崩溃。我们提出了一个新发现：当对递归生成的文本进行训练时，LM 的性能会逐渐下降，直到它们的表现不比随机初始化的 LM 好。训练后的 LM 会产生大量重复的文本，并且在各种自然语言任务中表现不佳。上述证明和新发现加深了我们对 LM 崩溃的理解，并提供了有价值的见解，可能启发新的训练技术来减轻这种威胁。

Title: Dehallucinating Parallel Context Extension for Retrieval-Augmented Generation

Authors: Zexiong Ma, Shengnan An, Zeqi Lin, Yanzhen Zou, Jian-Guang Lou, Bing Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14905
Pdf URL: https://arxiv.org/pdf/2412.14905
Copy Paste: [[2412.14905]] Dehallucinating Parallel Context Extension for Retrieval-Augmented Generation(https://arxiv.org/abs/2412.14905)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) are susceptible to generating hallucinated information, despite the integration of retrieval-augmented generation (RAG). Parallel context extension (PCE) is a line of research attempting to effectively integrating parallel (unordered) contexts, while it still suffers from hallucinations when adapted to RAG scenarios. In this paper, we propose DePaC (Dehallucinating Parallel Context Extension), which alleviates the hallucination problem with context-aware negative training and information-calibrated aggregation. DePaC is designed to alleviate two types of in-context hallucination: fact fabrication (i.e., LLMs present claims that are not supported by the contexts) and fact omission (i.e., LLMs fail to present claims that can be supported by the contexts). Specifically, (1) for fact fabrication, we apply the context-aware negative training that fine-tunes the LLMs with negative supervisions, thus explicitly guiding the LLMs to refuse to answer when contexts are not related to questions; (2) for fact omission, we propose the information-calibrated aggregation which prioritizes context windows with higher information increment from their contexts. The experimental results on nine RAG tasks demonstrate that DePaC significantly alleviates the two types of hallucination and consistently achieves better performances on these tasks.
摘要：尽管集成了检索增强生成 (RAG)，大型语言模型 (LLM) 仍然容易产生幻觉信息。并行上下文扩展 (PCE) 是一种尝试有效集成并行（无序）上下文的研究方法，但在适应 RAG 场景时仍然会产生幻觉。在本文中，我们提出了 DePaC（去幻觉并行上下文扩展），它通过上下文感知负向训练和信息校准聚合来缓解幻觉问题。DePaC 旨在缓解两种类型的上下文幻觉：事实捏造（即 LLM 提出上下文不支持的主张）和事实遗漏（即 LLM 未能提出上下文可以支持的主张）。具体来说，（1）对于事实捏造，我们应用上下文感知负向训练，通过负向监督对 LLM 进行微调，从而明确引导 LLM 在上下文与问题无关时拒绝回答；（2）对于事实遗漏，我们提出了信息校准聚合，优先考虑从上下文中获取信息增量更高的上下文窗口。在九个 RAG 任务上的实验结果表明，DePaC 显著缓解了两种类型的幻觉，并在这些任务上始终取得更好的表现。

Title: RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response

Authors: Junyu Luo, Xiao Luo, Kaize Ding, Jingyang Yuan, Zhiping Xiao, Ming Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14922
Pdf URL: https://arxiv.org/pdf/2412.14922
Copy Paste: [[2412.14922]] RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response(https://arxiv.org/abs/2412.14922)
Keywords: language model, llm
Abstract: Supervised fine-tuning (SFT) plays a crucial role in adapting large language models (LLMs) to specific domains or tasks. However, as demonstrated by empirical experiments, the collected data inevitably contains noise in practical applications, which poses significant challenges to model performance on downstream tasks. Therefore, there is an urgent need for a noise-robust SFT framework to enhance model capabilities in downstream tasks. To address this challenge, we introduce a robust SFT framework (RobustFT) that performs noise detection and relabeling on downstream task data. For noise identification, our approach employs a multi-expert collaborative system with inference-enhanced models to achieve superior noise detection. In the denoising phase, we utilize a context-enhanced strategy, which incorporates the most relevant and confident knowledge followed by careful assessment to generate reliable annotations. Additionally, we introduce an effective data selection mechanism based on response entropy, ensuring only high-quality samples are retained for fine-tuning. Extensive experiments conducted on multiple LLMs across five datasets demonstrate RobustFT's exceptional performance in noisy scenarios.
摘要：监督微调 (SFT) 在将大型语言模型 (LLM) 适配到特定领域或任务方面起着至关重要的作用。然而，正如经验实验所证明的那样，在实际应用中，收集的数据不可避免地包含噪声，这对模型在下游任务上的性能提出了重大挑战。因此，迫切需要一个抗噪声的 SFT 框架来增强模型在下游任务中的能力。为了应对这一挑战，我们引入了一个强大的 SFT 框架 (RobustFT)，它可以对下游任务数据执行噪声检测和重新标记。对于噪声识别，我们的方法采用多专家协作系统和推理增强模型来实现卓越的噪声检测。在去噪阶段，我们采用上下文增强策略，该策略结合最相关和最可信的知识，然后进行仔细评估以生成可靠的注释。此外，我们引入了一种基于响应熵的有效数据选择机制，确保只保留高质量样本进行微调。在五个数据集上对多个 LLM 进行的大量实验证明了 RobustFT 在嘈杂场景中的卓越性能。

Title: Understanding the Dark Side of LLMs' Intrinsic Self-Correction

Authors: Qingjie Zhang, Han Qiu, Di Wang, Haoting Qian, Yiming Li, Tianwei Zhang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14959
Pdf URL: https://arxiv.org/pdf/2412.14959
Copy Paste: [[2412.14959]] Understanding the Dark Side of LLMs' Intrinsic Self-Correction(https://arxiv.org/abs/2412.14959)
Keywords: gpt, llm, prompt, chat
Abstract: Intrinsic self-correction was proposed to improve LLMs' responses via feedback prompts solely based on their inherent capability. However, recent works show that LLMs' intrinsic self-correction fails without oracle labels as feedback prompts. In this paper, we aim to interpret LLMs' intrinsic self-correction for different tasks, especially for those failure cases. By including one simple task and three complex tasks with state-of-the-art (SOTA) LLMs like ChatGPT families (o1, 4o, 3.5-turbo) and Llama families (2-7B, 3-8B, and 3.1-8B), we design three interpretation methods to reveal the dark side of LLMs' intrinsic self-correction. We identify intrinsic self-correction can (1) cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions; (2) introduce human-like cognitive bias on complex tasks. In light of our findings, we also provide two simple yet effective strategies for alleviation: question repeating and supervised fine-tuning with a few samples. We open-source our work at this https URL.
摘要：内在自我修正被提出来仅基于 LLM 固有能力通过反馈提示来改善其响应。然而，最近的研究表明，如果没有 oracle 标签作为反馈提示，LLM 的内在自我修正就会失败。在本文中，我们旨在解释 LLM 在不同任务中的内在自我修正，特别是对于那些失败的情况。通过包括一项简单任务和三项复杂任务以及最先进的 (SOTA) LLM，如 ChatGPT 系列 (o1、4o、3.5-turbo) 和 Llama 系列 (2-7B、3-8B 和 3.1-8B)，我们设计了三种解释方法来揭示 LLM 内在自我修正的阴暗面。我们发现内在自我修正会 (1) 导致 LLM 在中间和最终答案上动摇，并导致对简单事实问题的提示偏见；(2) 在复杂任务中引入类似人类的认知偏见。根据我们的研究结果，我们还提供了两种简单但有效的缓解策略：重复提问和使用少量样本进行监督微调。我们在此 https URL 上开源了我们的工作。

Title: Knowledge Injection via Prompt Distillation

Authors: Kalle Kujanpää, Harri Valpola, Alexander Ilin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14964
Pdf URL: https://arxiv.org/pdf/2412.14964
Copy Paste: [[2412.14964]] Knowledge Injection via Prompt Distillation(https://arxiv.org/abs/2412.14964)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: In many practical applications, large language models (LLMs) need to incorporate new knowledge not present in their pre-training data. The primary methods for this are fine-tuning and retrieval-augmented generation (RAG). Although RAG has emerged as the industry standard for knowledge injection, fine-tuning has not yet achieved comparable success. In this paper, we propose a new fine-tuning technique for learning new knowledge and show that it can reach the performance of RAG. The proposed method is based on the self-distillation approach, which we call prompt distillation. First, we generate question-answer pairs about the new knowledge. Then, we fine-tune a student model on the question-answer pairs to imitate the output distributions of a teacher model, which additionally receives the new knowledge in its prompt. The student model is identical to the teacher, except it is equipped with a LoRA adapter. This training procedure facilitates distilling the new knowledge from the teacher's prompt into the student's weights.
摘要：在许多实际应用中，大型语言模型 (LLM) 需要整合其预训练数据中不存在的新知识。实现此目的的主要方法是微调和检索增强生成 (RAG)。尽管 RAG 已成为知识注入的行业标准，但微调尚未取得同等的成功。在本文中，我们提出了一种用于学习新知识的新型微调技术，并表明它可以达到 RAG 的性能。所提出的方法基于自我提炼方法，我们称之为提示提炼。首先，我们生成有关新知识的问答对。然后，我们在问答对上微调学生模型以模仿教师模型的输出分布，该模型还在其提示中接收新知识。学生模型与教师相同，只是配备了 LoRA 适配器。此训练过程有助于将教师提示中的新知识提炼到学生的权重中。

Title: Chain-of-MetaWriting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts

Authors: Ioana Buhnila, Georgeta Cislaru, Amalia Todirascu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14986
Pdf URL: https://arxiv.org/pdf/2412.14986
Copy Paste: [[2412.14986]] Chain-of-MetaWriting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts(https://arxiv.org/abs/2412.14986)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have been used to generate texts in response to different writing tasks: reports, essays, story telling. However, language models do not have a meta-representation of the text writing process, nor inherent communication learning needs, comparable to those of young human students. This paper introduces a fine-grained linguistic and textual analysis of multilingual Small Language Models' (SLMs) writing. With our method, Chain-of-MetaWriting, SLMs can imitate some steps of the human writing process, such as planning and evaluation. We mainly focused on short story and essay writing tasks in French for schoolchildren and undergraduate students respectively. Our results show that SLMs encounter difficulties in assisting young students on sensitive topics such as violence in the schoolyard, and they sometimes use words too complex for the target audience. In particular, the output is quite different from the human produced texts in term of text cohesion and coherence regarding temporal connectors, topic progression, reference.
摘要：大型语言模型 (LLM) 已用于生成文本以应对不同的写作任务：报告、论文、讲故事。然而，语言模型没有文本写作过程的元表示，也没有与年轻人类学生相媲美的固有交流学习需求。本文介绍了一种对多语言小型语言模型 (SLM) 写作的细粒度语言和文本分析。通过我们的方法 Chain-of-MetaWriting，SLM 可以模仿人类写作过程的某些步骤，例如规划和评估。我们主要关注针对小学生和本科生的法语短篇小说和论文写作任务。我们的结果表明，SLM 在帮助年轻学生处理校园暴力等敏感话题时遇到了困难，他们有时使用的词语对于目标受众来说过于复杂。特别是，在时间连接符、主题进展、引用方面的文本凝聚力和连贯性方面，输出与人类生成的文本有很大不同。

Title: LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps

Authors: Felix Friedrich, Simone Tedeschi, Patrick Schramowski, Manuel Brack, Roberto Navigli, Huu Nguyen, Bo Li, Kristian Kersting
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.15035
Pdf URL: https://arxiv.org/pdf/2412.15035
Copy Paste: [[2412.15035]] LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps(https://arxiv.org/abs/2412.15035)
Keywords: language model, llm, prompt
Abstract: Building safe Large Language Models (LLMs) across multiple languages is essential in ensuring both safe access and linguistic diversity. To this end, we introduce M-ALERT, a multilingual benchmark that evaluates the safety of LLMs in five languages: English, French, German, Italian, and Spanish. M-ALERT includes 15k high-quality prompts per language, totaling 75k, following the detailed ALERT taxonomy. Our extensive experiments on 10 state-of-the-art LLMs highlight the importance of language-specific safety analysis, revealing that models often exhibit significant inconsistencies in safety across languages and categories. For instance, Llama3.2 shows high unsafety in the category crime_tax for Italian but remains safe in other languages. Similar differences can be observed across all models. In contrast, certain categories, such as substance_cannabis and crime_propaganda, consistently trigger unsafe responses across models and languages. These findings underscore the need for robust multilingual safety practices in LLMs to ensure safe and responsible usage across diverse user communities.
摘要：构建安全的跨多种语言大型语言模型 (LLM) 对于确保安全访问和语言多样性至关重要。为此，我们引入了 M-ALERT，这是一个多语言基准，用于评估五种语言的 LLM 安全性：英语、法语、德语、意大利语和西班牙语。M-ALERT 包含每种语言 15k 个高质量提示，总计 75k，遵循详细的 ALERT 分类法。我们对 10 种最先进的 LLM 进行的大量实验凸显了语言特定安全性分析的重要性，揭示了模型在跨语言和类别的安全性方面通常表现出显著的不一致。例如，Llama3.2 在意大利语的 crime_tax 类别中表现出高度不安全性，但在其他语言中仍然安全。在所有模型中都可以观察到类似的差异。相比之下，某些类别（例如 substance_cannabis 和 crime_propaganda）在跨模型和语言时始终会触发不安全的反应。这些发现强调了 LLM 中需要强大的多语言安全实践，以确保在不同用户社区中安全和负责任地使用。

Title: ConfliBERT: A Language Model for Political Conflict

Authors: Patrick T. Brandt, Sultan Alsarra, Vito J. D`Orazio, Dagmar Heintze, Latifur Khan, Shreyas Meher, Javier Osorio, Marcus Sianan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.15060
Pdf URL: https://arxiv.org/pdf/2412.15060
Copy Paste: [[2412.15060]] ConfliBERT: A Language Model for Political Conflict(https://arxiv.org/abs/2412.15060)
Keywords: language model, llm
Abstract: Conflict scholars have used rule-based approaches to extract information about political violence from news reports and texts. Recent Natural Language Processing developments move beyond rigid rule-based approaches. We review our recent ConfliBERT language model (Hu et al. 2022) to process political and violence related texts. The model can be used to extract actor and action classifications from texts about political conflict. When fine-tuned, results show that ConfliBERT has superior performance in accuracy, precision and recall over other large language models (LLM) like Google's Gemma 2 (9B), Meta's Llama 3.1 (7B), and Alibaba's Qwen 2.5 (14B) within its relevant domains. It is also hundreds of times faster than these more generalist LLMs. These results are illustrated using texts from the BBC, re3d, and the Global Terrorism Dataset (GTD).
摘要：冲突学者使用基于规则的方法从新闻报道和文本中提取有关政治暴力的信息。最近的自然语言处理发展超越了严格的基于规则的方法。我们回顾了我们最近的 ConfliBERT 语言模型（Hu et al. 2022）来处理与政治和暴力相关的文本。该模型可用于从有关政治冲突的文本中提取参与者和动作分类。经过微调后，结果表明，ConfliBERT 在准确度、精确度和召回率方面优于其他大型语言模型（LLM），例如 Google 的 Gemma 2（9B）、Meta 的 Llama 3.1（7B）和阿里巴巴的 Qwen 2.5（14B）在其相关领域内。它也比这些更通用的 LLM 快数百倍。这些结果使用来自 BBC、re3d 和全球恐怖主义数据集 (GTD) 的文本进行了说明。

Title: AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

Authors: Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15084
Pdf URL: https://arxiv.org/pdf/2412.15084
Copy Paste: [[2412.15084]] AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling(https://arxiv.org/abs/2412.15084)
Keywords: gpt, prompt
Abstract: In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: this https URL
摘要：在本文中，我们介绍了 AceMath，这是一套擅长解决复杂数学问题的前沿数学模型，以及能够评估生成的解决方案并可靠地识别正确解决方案的高效奖励模型。为了开发指令调整的数学模型，我们提出了一个监督微调 (SFT) 过程，该过程首先在一般领域实现具有竞争力的性能，然后使用精心策划的一组提示和合成生成的响应对数学领域进行有针对性的微调。由此产生的模型 AceMath-72B-Instruct 大大优于 Qwen2.5-Math-72B-Instruct、GPT-4o 和 Claude-3.5 Sonnet。为了开发数学专用奖励模型，我们首先构建 AceMath-RewardBench，这是一个全面而强大的基准，用于评估不同问题和难度级别的数学奖励模型。之后，我们提出了一种系统的方法来构建我们的数学奖励模型。由此产生的模型 AceMath-72B-RM 始终优于最先进的奖励模型。此外，当将 AceMath-72B-Instruct 与 AceMath-72B-RM 结合使用时，我们在数学推理基准测试中获得了最高的平均 rm@8 分数。我们将在此 https URL 发布模型权重、训练数据和评估基准测试

Title: Review-Then-Refine: A Dynamic Framework for Multi-Hop Question Answering with Temporal Adaptability

Authors: Xiangsen Chen, Xuming Hu, Nan Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.15101
Pdf URL: https://arxiv.org/pdf/2412.15101
Copy Paste: [[2412.15101]] Review-Then-Refine: A Dynamic Framework for Multi-Hop Question Answering with Temporal Adaptability(https://arxiv.org/abs/2412.15101)
Keywords: language model, llm, hallucination
Abstract: Retrieve-augmented generation (RAG) frameworks have emerged as a promising solution to multi-hop question answering(QA) tasks since it enables large language models (LLMs) to incorporate external knowledge and mitigate their inherent knowledge deficiencies. Despite this progress, existing RAG frameworks, which usually follows the retrieve-then-read paradigm, often struggle with multi-hop QA with temporal information since it has difficulty retrieving and synthesizing accurate time-related information. To address the challenge, this paper proposes a novel framework called review-then-refine, which aims to enhance LLM performance in multi-hop QA scenarios with temporal information. Our approach begins with a review phase, where decomposed sub-queries are dynamically rewritten with temporal information, allowing for subsequent adaptive retrieval and reasoning process. In addition, we implement adaptive retrieval mechanism to minimize unnecessary retrievals, thus reducing the potential for hallucinations. In the subsequent refine phase, the LLM synthesizes the retrieved information from each sub-query along with its internal knowledge to formulate a coherent answer. Extensive experimental results across multiple datasets demonstrate the effectiveness of our proposed framework, highlighting its potential to significantly improve multi-hop QA capabilities in LLMs.
摘要：检索增强生成 (RAG) 框架已成为多跳问答 (QA) 任务的一种有前途的解决方案，因为它使大型语言模型 (LLM) 能够整合外部知识并缓解其固有知识的缺陷。尽管取得了这一进展，但现有的 RAG 框架通常遵循“检索后阅读”范式，由于难以检索和合成准确的时间相关信息，因此在处理具有时间信息的多跳 QA 时通常会遇到困难。为了应对这一挑战，本文提出了一种称为“先审后细化”的新框架，旨在提高 LLM 在具有时间信息的多跳 QA 场景中的性能。我们的方法从审查阶段开始，其中分解的子查询会使用时间信息动态重写，从而允许后续的自适应检索和推理过程。此外，我们实施了自适应检索机制以最大限度地减少不必要的检索，从而降低出现幻觉的可能性。在随后的细化阶段，LLM 将从每个子查询中检索到的信息与其内部知识相结合，以形成连贯的答案。跨多个数据集的大量实验结果证明了我们提出的框架的有效性，凸显了其显著提高 LLM 中多跳 QA 能力的潜力。

Title: Qwen2.5 Technical Report

Authors: Qwen: An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu (additional authors not shown)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.15115
Pdf URL: https://arxiv.org/pdf/2412.15115
Copy Paste: [[2412.15115]] Qwen2.5 Technical Report(https://arxiv.org/abs/2412.15115)
Keywords: language model, gpt, llm
Abstract: In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.
摘要：在本报告中，我们介绍了 Qwen2.5，这是一系列全面的大型语言模型 (LLM)，旨在满足多样化的需求。与之前的迭代相比，Qwen 2.5 在预训练和后训练阶段都有了显著的改进。在预训练方面，我们将高质量的预训练数据集从之前的 7 万亿个 token 扩展到 18 万亿个 token，为常识、专业知识和推理能力提供了坚实的基础。在后训练方面，我们使用超过 100 万个样本实现了复杂的监督微调，以及多阶段强化学习。后训练技术增强了人类偏好，并显著改善了长文本生成、结构化数据分析和指令遵循。为了有效处理多样化和多样化的用例，我们提供了丰富的 Qwen2.5 LLM 系列。开放权重产品包括基础和指令调整模型，并提供量化版本。此外，对于托管解决方案，专有模型目前包括两个混合专家 (MoE) 变体：Qwen2.5-Turbo 和 Qwen2.5-Plus，均可从阿里云模型工作室获得。Qwen2.5 在评估语言理解、推理、数学、编码、人类偏好对齐等的广泛基准测试中表现出顶级性能。具体而言，开放权重旗舰 Qwen2.5-72B-Instruct 的表现优于许多开放和专有模型，并且与最先进的开放权重模型 Llama-3-405B-Instruct 表现出竞争性能，后者大约大 5 倍。Qwen2.5-Turbo 和 Qwen2.5-Plus 具有出色的成本效益，同时与 GPT-4o-mini 和 GPT-4o 具有竞争力。此外，作为基础，Qwen2.5 模型在训练 Qwen2.5-Math、Qwen2.5-Coder、QwQ 和多模态模型等专门模型方面发挥了重要作用。

Title: Outcome-Refining Process Supervision for Code Generation

Authors: Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2412.15118
Pdf URL: https://arxiv.org/pdf/2412.15118
Copy Paste: [[2412.15118]] Outcome-Refining Process Supervision for Code Generation(https://arxiv.org/abs/2412.15118)
Keywords: language model
Abstract: Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised. Our framework leverages concrete execution signals to ground the supervision of reasoning steps, while using tree-structured exploration to maintain multiple solution trajectories simultaneously. Experiments demonstrate that our approach enables even smaller models to achieve high success accuracy and performance metrics on competitive programming tasks, creates more reliable verification than traditional reward models without requiring training PRMs. Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency. The results suggest that providing structured reasoning space with concrete verification signals is crucial for solving complex programming tasks. We open-source all our code and data at: this https URL
摘要：大型语言模型在代码生成方面表现出了卓越的能力，但它们在处理需要深度算法推理的复杂编程任务时往往举步维艰。虽然通过学习奖励模型进行过程监督在指导推理步骤方面很有前景，但它需要昂贵的训练数据，并且评估不可靠。我们提出了结果细化过程监督，这是一种将结果细化本身视为需要监督的过程的新范式。我们的框架利用具体的执行信号来为推理步骤的监督打下基础，同时使用树形结构探索来同时维护多个解决方案轨迹。实验表明，我们的方法使更小的模型也能在竞争性编程任务上实现高成功率和性能指标，并且比传统的奖励模型创建了更可靠的验证，而无需训练 PRM。我们的方法在 5 个模型和 3 个数据集中实现了显着的改进：平均正确率提高了 26.9%，效率提高了 42.2%。结果表明，提供具有具体验证信号的结构化推理空间对于解决复杂的编程任务至关重要。我们将所有代码和数据开源于：此 https URL

Title: Adaptive Pruning for Large Language Models with Structural Importance Awareness

Authors: Haotian Zheng, Jinke Ren, Yushan Sun, Ruichen Zhang, Wenbo Zhang, Zhen Li, Dusit Niyato, Shuguang Cui, Yatong Han
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15127
Pdf URL: https://arxiv.org/pdf/2412.15127
Copy Paste: [[2412.15127]] Adaptive Pruning for Large Language Models with Structural Importance Awareness(https://arxiv.org/abs/2412.15127)
Keywords: language model, llm
Abstract: The recent advancements in large language models (LLMs) have significantly improved language understanding and generation capabilities. However, it is difficult to deploy LLMs on resource-constrained edge devices due to their high computational and storage resource demands. To address this issue, we propose a novel LLM model pruning method, namely structurally-aware adaptive pruning (SAAP), to significantly reduce the computational and memory costs while maintaining model performance. We first define an adaptive importance fusion metric to evaluate the importance of all coupled structures in LLMs by considering their homoscedastic uncertainty. Then, we rank the importance of all modules to determine the specific layers that should be pruned to meet particular performance requirements. Furthermore, we develop a new group fine-tuning strategy to improve the inference efficiency of LLMs. Finally, we evaluate the proposed SAAP method on multiple LLMs across two common tasks, i.e., zero-shot classification and text generation. Experimental results show that our SAAP method outperforms several state-of-the-art baseline methods, achieving 2.17%, 2.37%, and 2.39% accuracy gains on LLaMA-7B, Vicuna-7B, and LLaMA-13B. Additionally, SAAP improves the token generation speed by 5%, showcasing its practical advantages in resource-constrained scenarios.
摘要：大型语言模型 (LLM) 的最新进展显著提高了语言理解和生成能力。然而，由于资源受限的边缘设备对计算和存储资源的需求很高，因此很难在它们上面部署 LLM。为了解决这个问题，我们提出了一种新颖的 LLM 模型修剪方法，即结构感知自适应修剪 (SAAP)，以在保持模型性能的同时显著降低计算和内存成本。我们首先定义一个自适应重要性融合度量，通过考虑 LLM 中所有耦合结构同方差不确定性来评估它们的重要性。然后，我们对所有模块的重要性进行排序，以确定应修剪的特定层以满足特定的性能要求。此外，我们开发了一种新的组微调策略来提高 LLM 的推理效率。最后，我们在两个常见任务（即零样本分类和文本生成）中对多个 LLM 上的所提出的 SAAP 方法进行了评估。实验结果表明，我们的 SAAP 方法优于几种最先进的基线方法，在 LLaMA-7B、Vicuna-7B 和 LLaMA-13B 上实现了 2.17%、2.37% 和 2.39% 的准确率提升。此外，SAAP 将 token 生成速度提高了 5%，在资源受限的场景中展现了其实用优势。

Title: Language Models as Continuous Self-Evolving Data Engineers

Authors: Peidong Wang, Ming Wang, Zhiming Ma, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.15151
Pdf URL: https://arxiv.org/pdf/2412.15151
Copy Paste: [[2412.15151]] Language Models as Continuous Self-Evolving Data Engineers(https://arxiv.org/abs/2412.15151)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities on various tasks, while the further evolvement is limited to the lack of high-quality training data. In addition, traditional training approaches rely too much on expert-labeled data, setting an upper limit on the performance of LLMs. To address this issue, we propose a novel paradigm that enables LLMs to train itself by autonomously generating, cleaning, reviewing, and annotating data with preference information, named LANCE. Our approach demonstrates that LLMs can serve as continuous self-evolving data engineers, significantly reducing the time and cost of the post-training data construction process. Through iterative fine-tuning on different variants of the Qwen2, we validate the effectiveness of LANCE across various tasks, showing that it can continuously improve model performance and maintain high-quality data generation. Across eight benchmark dimensions, LANCE resulted in an average score enhancement of 3.36 for Qwen2-7B and 2.70 for Qwen2-7B-Instruct. This training paradigm with autonomous data construction not only reduces the reliance on human experts or external models but also ensures that the data aligns with human values and preferences, paving the way for the development of future superintelligent systems that can exceed human capabilities.
摘要：大型语言模型 (LLM) 在各种任务上都表现出了卓越的能力，但进一步的发展受限于缺乏高质量的训练数据。此外，传统的训练方法过于依赖专家标记的数据，这限制了 LLM 的性能。为了解决这个问题，我们提出了一种新的范式，使 LLM 能够通过自主生成、清理、审查和使用偏好信息注释数据来进行自我训练，称为 LANCE。我们的方法表明，LLM 可以充当持续自我进化的数据工程师，大大减少训练后数据构建过程的时间和成本。通过对 Qwen2 的不同变体进行迭代微调，我们验证了 LANCE 在各种任务中的有效性，表明它可以持续提高模型性能并保持高质量的数据生成。在八个基准维度上，LANCE 使 Qwen2-7B 的平均得分提高了 3.36，使 Qwen2-7B-Instruct 的平均得分提高了 2.70。这种具有自主数据构建的训练范式不仅减少了对人类专家或外部模型的依赖，而且还确保数据符合人类的价值观和偏好，为未来超越人类能力的超级智能系统的发展铺平了道路。

Title: LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

Authors: Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15188
Pdf URL: https://arxiv.org/pdf/2412.15188
Copy Paste: [[2412.15188]] LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation(https://arxiv.org/abs/2412.15188)
Keywords: language model, llm
Abstract: We present LlamaFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. LlamaFusion leverages existing Llama-3's weights for processing texts autoregressively while introducing additional and parallel transformer modules for processing images with diffusion. During training, the data from each modality is routed to its dedicated modules: modality-specific feedforward layers, query-key-value projections, and normalization layers process each modality independently, while the shared self-attention layers allow interactions across text and image features. By freezing the text-specific modules and only training the image-specific modules, LlamaFusion preserves the language capabilities of text-only LLMs while developing strong visual understanding and generation abilities. Compared to methods that pretrain multimodal generative models from scratch, our experiments demonstrate that, LlamaFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs while maintaining Llama-3's language capabilities. We also demonstrate that this framework can adapt existing vision-language models with multimodal generation ability. Overall, this framework not only leverages existing computational investments in text-only LLMs but also enables the parallel development of language and vision capabilities, presenting a promising direction for efficient multimodal model development.
摘要：我们提出了 LlamaFusion，这是一个为预训练的纯文本大型语言模型 (LLM) 提供多模态生成能力的框架，使它们能够理解和生成任意序列的文本和图像。LlamaFusion 利用现有的 Llama-3 权重来自回归处理文本，同时引入额外的并行转换器模块来处理具有扩散的图像。在训练期间，来自每种模态的数据都会被路由到其专用模块：特定于模态的前馈层、查询键值投影和规范化层独立处理每种模态，而共享的自注意力层允许跨文本和图像特征进行交互。通过冻结特定于文本的模块并仅训练特定于图像的模块，LlamaFusion 保留了纯文本 LLM 的语言能力，同时开发了强大的视觉理解和生成能力。与从头开始预训练多模态生成模型的方法相比，我们的实验表明，LlamaFusion 仅使用 50% 的 FLOP 将图像理解能力提高了 20%，将图像生成能力提高了 3.6%，同时保持了 Llama-3 的语言能力。我们还证明，该框架可以调整具有多模态生成能力的现有视觉语言模型。总体而言，该框架不仅利用了纯文本 LLM 中现有的计算投资，而且还实现了语言和视觉能力的并行开发，为高效的多模态模型开发提供了一个有希望的方向。

Title: Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings

Authors: Daniel Russo, Stefano Menini, Jacopo Staiano, Marco Guerini
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2412.15189
Pdf URL: https://arxiv.org/pdf/2412.15189
Copy Paste: [[2412.15189]] Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings(https://arxiv.org/abs/2412.15189)
Keywords: llm, retrieval-augmented generation
Abstract: Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers. In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, under more realistic scenarios, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.
摘要：自然语言处理和生成系统最近显示出补充和简化专业事实核查人员昂贵且耗时的工作的潜力。在这项工作中，我们基于检索增强生成 (RAG) 范式，解除了当前最先进的自动事实核查流程的几个限制。我们的目标是在更现实的场景下对基于 RAG 的判决生成方法（即讨论声明真实性的短文）进行基准测试，并在风格复杂的声明和异构但可靠的知识库上对其进行评估。我们的研究结果显示了一个复杂的情况，例如，基于 LLM 的检索器优于其他检索技术，但它们仍然难以处理异构知识库；较大的模型在判决忠实度方面表现出色，而较小的模型则提供更好的上下文一致性，人类评估更倾向于零样本和一次性方法以获得信息量，而微调模型则更倾向于情感一致性。

Title: MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark

Authors: Qihao Zhao, Yangyu Huang, Tengchao Lv, Lei Cui, Qinzheng Sun, Shaoguang Mao, Xin Zhang, Ying Xin, Qiufeng Yin, Scarlett Li, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.15194
Pdf URL: https://arxiv.org/pdf/2412.15194
Copy Paste: [[2412.15194]] MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark(https://arxiv.org/abs/2412.15194)
Keywords: language model, gpt, llm
Abstract: Multiple-choice question (MCQ) datasets like Massive Multitask Language Understanding (MMLU) are widely used to evaluate the commonsense, understanding, and problem-solving abilities of large language models (LLMs). However, the open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. To alleviate this issue, we propose a contamination-free and more challenging MCQ benchmark called MMLU-CF. This benchmark reassesses LLMs' understanding of world knowledge by averting both unintentional and malicious data leakage. To avoid unintentional data leakage, we source data from a broader domain and design three decontamination rules. To prevent malicious data leakage, we divide the benchmark into validation and test sets with similar difficulty and subject distributions. The test set remains closed-source to ensure reliable results, while the validation set is publicly available to promote transparency and facilitate independent verification. Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set, which indicates the effectiveness of our approach in creating a more rigorous and contamination-free evaluation standard. The GitHub repository is available at this https URL and the dataset refers to this https URL.
摘要：多项选择题 (MCQ) 数据集（如大规模多任务语言理解 (MMLU)）被广泛用于评估大型语言模型 (LLM) 的常识、理解和解决问题的能力。然而，这些基准的开源性质以及 LLM 训练数据的广泛来源不可避免地导致了基准污染，从而导致不可靠的评估结果。为了缓解这个问题，我们提出了一个无污染且更具挑战性的 MCQ 基准，称为 MMLU-CF。该基准通过避免无意和恶意的数据泄露来重新评估 LLM 对世界知识的理解。为了避免无意的数据泄露，我们从更广泛的领域获取数据并设计了三条净化规则。为了防止恶意数据泄露，我们将基准分为具有相似难度和主题分布的验证集和测试集。测试集保持闭源以确保可靠的结果，而验证集则公开可用以促进透明度并促进独立验证。我们对主流 LLM 的评估表明，强大的 GPT-4o 在测试集上仅取得了 73.4% 的 5 次得分和 71.9% 的 0 次得分，这表明我们的方法在创建更严格且无污染的评估标准方面是有效的。GitHub 存储库可在此 https URL 上找到，数据集引用此 https URL。

Title: LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Authors: Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.15204
Pdf URL: https://arxiv.org/pdf/2412.15204
Copy Paste: [[2412.15204]] LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks(https://arxiv.org/abs/2412.15204)
Keywords: llm
Abstract: This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2. The project is available at this https URL.
摘要：本文介绍了 LongBench v2，这是一个基准测试，旨在评估 LLM 处理需要在现实世界的多任务中进行深度学习理解和推理的长上下文问题的能力。LongBench v2 包含 503 个具有挑战性的多项选择题，上下文从 8k 到 2M 词不等，涵盖六大任务类别：单文档问答、多文档问答、长上下文学习、长对话历史理解、代码库理解和长结构化数据理解。为了确保广度和实用性，我们收集了近 100 名受过高等教育、具有不同专业背景的人士的数据。我们采用自动和手动审查流程来保持高质量和高难度，导致人类专家在 15 分钟的时间限制下只能达到 53.7% 的准确率。我们的评估表明，表现最好的模型在直接回答问题时也只能达到 50.1% 的准确率。相比之下，包含更长推理的 o1-preview 模型达到了 57.7%，比人类基线高出 4%。这些结果凸显了增强推理能力和扩展推理时间计算对于解决 LongBench v2 中的长上下文挑战的重要性。该项目可在此 https URL 上找到。