2024-12-02

Title: On the Effectiveness of Incremental Training of Large Language Models

Authors: Miles Q. Li, Benjamin C. M. Fung, Shih-Chia Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18700
Pdf URL: https://arxiv.org/pdf/2411.18700
Copy Paste: [[2411.18700]] On the Effectiveness of Incremental Training of Large Language Models(https://arxiv.org/abs/2411.18700)
Keywords: language model, llm
Abstract: Training large language models is a computationally intensive process that often requires substantial resources to achieve state-of-the-art results. Incremental layer-wise training has been proposed as a potential strategy to optimize the training process by progressively introducing layers, with the expectation that this approach would lead to faster convergence and more efficient use of computational resources. In this paper, we investigate the effectiveness of incremental training for LLMs, dividing the training process into multiple stages where layers are added progressively. Our experimental results indicate that while the incremental approach initially demonstrates some computational efficiency, it ultimately requires greater overall computational costs to reach comparable performance to traditional full-scale training. Although the incremental training process can eventually close the performance gap with the baseline, it does so only after significantly extended continual training. These findings suggest that incremental layer-wise training may not be a viable alternative for training large language models, highlighting its limitations and providing valuable insights into the inefficiencies of this approach.
摘要：训练大型语言模型是一个计算密集型的过程，通常需要大量资源才能获得最先进的结果。增量分层训练已被提出作为一种潜在的策略，通过逐步引入层来优化训练过程，期望这种方法能够加快收敛速度并更有效地利用计算资源。在本文中，我们研究了增量训练对 LLM 的有效性，将训练过程分为多个阶段，逐步添加层。我们的实验结果表明，虽然增量方法最初表现出一定的计算效率，但最终需要更大的总体计算成本才能达到与传统全尺寸训练相当的性能。虽然增量训练过程最终可以缩小与基线的性能差距，但只有在经过长时间的持续训练后才能做到这一点。这些发现表明，增量分层训练可能不是训练大型语言模型的可行替代方案，突显了它的局限性，并为这种方法的低效率提供了宝贵的见解。

Title: NewsEdits 2.0: Learning the Intentions Behind Updating News

Authors: Alexander Spangher, Kung-Hsiang Huang, Hyundong Cho, Jonathan May
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2411.18811
Pdf URL: https://arxiv.org/pdf/2411.18811
Copy Paste: [[2411.18811]] NewsEdits 2.0: Learning the Intentions Behind Updating News(https://arxiv.org/abs/2411.18811)
Keywords: language model, llm
Abstract: As events progress, news articles often update with new information: if we are not cautious, we risk propagating outdated facts. In this work, we hypothesize that linguistic features indicate factual fluidity, and that we can predict which facts in a news article will update using solely the text of a news article (i.e. not external resources like search engines). We test this hypothesis, first, by isolating fact-updates in large news revisions corpora. News articles may update for many reasons (e.g. factual, stylistic, narrative). We introduce the NewsEdits 2.0 taxonomy, an edit-intentions schema that separates fact updates from stylistic and narrative updates in news writing. We annotate over 9,200 pairs of sentence revisions and train high-scoring ensemble models to apply this schema. Then, taking a large dataset of silver-labeled pairs, we show that we can predict when facts will update in older article drafts with high precision. Finally, to demonstrate the usefulness of these findings, we construct a language model question asking (LLM-QA) abstention task. We wish the LLM to abstain from answering questions when information is likely to become outdated. Using our predictions, we show, LLM absention reaches near oracle levels of accuracy.
摘要：随着事件的发展，新闻文章经常会更新新信息：如果我们不谨慎，就有可能传播过时的事实。在这项研究中，我们假设语言特征表明事实的流动性，并且我们可以仅使用新闻文章的文本（即不使用搜索引擎等外部资源）来预测新闻文章中的哪些事实会更新。我们首先通过在大型新闻修订语料库中分离事实更新来测试这一假设。新闻文章可能出于多种原因而更新（例如事实、风格、叙述）。我们引入了 NewsEdits 2.0 分类法，这是一种编辑意图模式，将新闻写作中的事实更新与风格和叙述更新区分开来。我们注释了超过 9,200 对句子修订，并训练高分集成模型以应用此模式。然后，通过获取大量银标记对的数据集，我们表明我们可以高精度地预测旧文章草稿中的事实何时会更新。最后，为了证明这些发现的实用性，我们构建了一个语言模型问答 (LLM-QA) 弃权任务。我们希望 LLM 在信息可能过时时弃权回答问题。使用我们的预测，我们表明，LLM 缺席达到了接近 Oracle 的准确度水平。

Title: Measuring Risk of Bias in Biomedical Reports: The RoBBR Benchmark

Authors: Jianyou Wang, Weili Cao, Longtian Bao, Youze Zheng, Gil Pasternak, Kaicheng Wang, Xiaoyue Wang, Ramamohan Paturi, Leon Bergen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18831
Pdf URL: https://arxiv.org/pdf/2411.18831
Copy Paste: [[2411.18831]] Measuring Risk of Bias in Biomedical Reports: The RoBBR Benchmark(https://arxiv.org/abs/2411.18831)
Keywords: language model
Abstract: Systems that answer questions by reviewing the scientific literature are becoming increasingly feasible. To draw reliable conclusions, these systems should take into account the quality of available evidence, placing more weight on studies that use a valid methodology. We present a benchmark for measuring the methodological strength of biomedical papers, drawing on the risk-of-bias framework used for systematic reviews. The four benchmark tasks, drawn from more than 500 papers, cover the analysis of research study methodology, followed by evaluation of risk of bias in these studies. The benchmark contains 2000 expert-generated bias annotations, and a human-validated pipeline for fine-grained alignment with research paper content. We evaluate a range of large language models on the benchmark, and find that these models fall significantly short of expert-level performance. By providing a standardized tool for measuring judgments of study quality, the benchmark can help to guide systems that perform large-scale aggregation of scientific data. The dataset is available at this https URL.
摘要：通过查阅科学文献来回答问题的系统正变得越来越可行。为了得出可靠的结论，这些系统应该考虑现有证据的质量，更加重视使用有效方法的研究。我们提出了一个衡量生物医学论文方法论强度的基准，借鉴了系统评价中使用的偏倚风险框架。这四项基准任务来自 500 多篇论文，涵盖研究方法的分析，然后评估这些研究中的偏倚风险。该基准包含 2000 个专家生成的偏倚注释，以及一个人工验证的管道，用于与研究论文内容进行细粒度的对齐。我们根据基准评估了一系列大型语言模型，发现这些模型远远达不到专家级的性能。通过提供衡量研究质量判断的标准化工具，该基准可以帮助指导执行大规模科学数据聚合的系统。数据集可在此 https URL 上获得。

Title: Sneaking Syntax into Transformer Language Models with Tree Regularization

Authors: Ananjan Nandi, Christopher D. Manning, Shikhar Murty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18885
Pdf URL: https://arxiv.org/pdf/2411.18885
Copy Paste: [[2411.18885]] Sneaking Syntax into Transformer Language Models with Tree Regularization(https://arxiv.org/abs/2411.18885)
Keywords: language model, llm
Abstract: While compositional accounts of human language understanding are based on a hierarchical tree-like process, neural models like transformers lack a direct inductive bias for such tree structures. Introducing syntactic inductive biases could unlock more robust and data-efficient learning in transformer language models (LMs), but existing methods for incorporating such structure greatly restrict models, either limiting their expressivity or increasing inference complexity. This work instead aims to softly inject syntactic inductive biases into given transformer circuits, through a structured regularizer. We introduce TREEREG, an auxiliary loss function that converts bracketing decisions from silver parses into a set of differentiable orthogonality constraints on vector hidden states. TREEREG integrates seamlessly with the standard LM objective, requiring no architectural changes. LMs pre-trained with TreeReg on natural language corpora such as WikiText-103 achieve up to 10% lower perplexities on out-of-distribution data and up to 9.5 point improvements in syntactic generalization, requiring less than half the training data to outperform standard LMs. TreeReg still provides gains for pre-trained LLMs: Continued pre-training of Sheared Llama with TreeReg results in improved syntactic generalization, and fine-tuning on MultiNLI with TreeReg mitigates degradation of performance on adversarial NLI benchmarks by 41.2 points.
摘要：虽然人类语言理解的组合描述基于分层树状过程，但像 Transformer 这样的神经模型缺乏对这种树结构的直接归纳偏差。引入句法归纳偏差可以解锁 Transformer 语言模型 (LM) 中更稳健、数据效率更高的学习，但现有的整合这种结构的方法极大地限制了模型，要么限制了它们的表达能力，要么增加了推理复杂性。相反，这项工作旨在通过结构化正则化器将句法归纳偏差软注入给定的 Transformer 电路中。我们引入了 TREEREG，这是一种辅助损失函数，它将来自银解析的括号决策转换为一组对向量隐藏状态的可微正交约束。TREEREG 与标准 LM 目标无缝集成，无需进行任何架构更改。使用 TreeReg 在自然语言语料库（例如 WikiText-103）上进行预训练的 LM 在分布外数据上的困惑度降低了 10%，句法泛化能力提高了 9.5 个百分点，仅需不到一半的训练数据就能超越标准 LM。TreeReg 仍能为预训练的 LLM 带来好处：继续使用 TreeReg 对 Sheared Llama 进行预训练可提高句法泛化能力，使用 TreeReg 对 MultiNLI 进行微调可将对抗性 NLI 基准测试上的性能下降降低 41.2 个百分点。

Title: Devising a Set of Compact and Explainable Spoken Language Feature for Screening Alzheimer's Disease

Authors: Junan Li, Yunxiang Li, Yuren Wang, Xixin Wu, Helen Meng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18922
Pdf URL: https://arxiv.org/pdf/2411.18922
Copy Paste: [[2411.18922]] Devising a Set of Compact and Explainable Spoken Language Feature for Screening Alzheimer's Disease(https://arxiv.org/abs/2411.18922)
Keywords: language model, llm
Abstract: Alzheimer's disease (AD) has become one of the most significant health challenges in an aging society. The use of spoken language-based AD detection methods has gained prevalence due to their scalability due to their scalability. Based on the Cookie Theft picture description task, we devised an explainable and effective feature set that leverages the visual capabilities of a large language model (LLM) and the Term Frequency-Inverse Document Frequency (TF-IDF) model. Our experimental results show that the newly proposed features consistently outperform traditional linguistic features across two different classifiers with high dimension efficiency. Our new features can be well explained and interpreted step by step which enhance the interpretability of automatic AD screening.
摘要：阿尔茨海默病 (AD) 已成为老龄化社会中最重大的健康挑战之一。基于口语的 AD 检测方法因其可扩展性而越来越受欢迎。基于 Cookie Theft 图片描述任务，我们设计了一个可解释且有效的特征集，该特征集利用大型语言模型 (LLM) 和词频-逆文档频率 (TF-IDF) 模型的视觉功能。我们的实验结果表明，新提出的特征在两个不同的分类器中始终优于传统语言特征，并且具有较高的维度效率。我们的新功能可以得到很好的解释和逐步解释，从而增强了自动 AD 筛选的可解释性。

Title: EzSQL: An SQL intermediate representation for improving SQL-to-text Generation

Authors: Meher Bhardwaj, Hrishikesh Ethari, Dennis Singh Moirangthem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18923
Pdf URL: https://arxiv.org/pdf/2411.18923
Copy Paste: [[2411.18923]] EzSQL: An SQL intermediate representation for improving SQL-to-text Generation(https://arxiv.org/abs/2411.18923)
Keywords: language model
Abstract: The SQL-to-text generation task traditionally uses template base, Seq2Seq, tree-to-sequence, and graph-to-sequence models. Recent models take advantage of pre-trained generative language models for this task in the Seq2Seq framework. However, treating SQL as a sequence of inputs to the pre-trained models is not optimal. In this work, we put forward a new SQL intermediate representation called EzSQL to align SQL with the natural language text sequence. EzSQL simplifies the SQL queries and brings them closer to natural language text by modifying operators and keywords, which can usually be described in natural language. EzSQL also removes the need for set operators. Our proposed SQL-to-text generation model uses EzSQL as the input to a pre-trained generative language model for generating the text descriptions. We demonstrate that our model is an effective state-of-the-art method to generate text narrations from SQL queries on the WikiSQL and Spider datasets. We also show that by generating pretraining data using our SQL-to-text generation model, we can enhance the performance of Text-to-SQL parsers.
摘要：SQL 到文本生成任务传统上使用模板库、Seq2Seq、树到序列和图到序列模型。最近的模型利用 Seq2Seq 框架中预训练的生成语言模型来完成此任务。但是，将 SQL 视为预训练模型的输入序列并不是最佳选择。在这项工作中，我们提出了一种称为 EzSQL 的新 SQL 中间表示，以使 SQL 与自然语言文本序列对齐。EzSQL 通过修改运算符和关键字来简化 SQL 查询并使其更接近自然语言文本，这些运算符和关键字通常可以用自然语言描述。EzSQL 还消除了对集合运算符的需求。我们提出的 SQL 到文本生成模型使用 EzSQL 作为预训练生成语言模型的输入，以生成文本描述。我们证明我们的模型是一种有效的最先进方法，可以从 WikiSQL 和 Spider 数据集上的 SQL 查询生成文本叙述。我们还表明，通过使用我们的 SQL 到文本生成模型生成预训练数据，我们可以增强文本到 SQL 解析器的性能。

Title: The Impact of Example Selection in Few-Shot Prompting on Automated Essay Scoring Using GPT Models

Authors: Lui Yoshida
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18924
Pdf URL: https://arxiv.org/pdf/2411.18924
Copy Paste: [[2411.18924]] The Impact of Example Selection in Few-Shot Prompting on Automated Essay Scoring Using GPT Models(https://arxiv.org/abs/2411.18924)
Keywords: gpt, prompt
Abstract: This study investigates the impact of example selection on the performance of au-tomated essay scoring (AES) using few-shot prompting with GPT models. We evaluate the effects of the choice and order of examples in few-shot prompting on several versions of GPT-3.5 and GPT-4 models. Our experiments involve 119 prompts with different examples, and we calculate the quadratic weighted kappa (QWK) to measure the agreement between GPT and human rater scores. Regres-sion analysis is used to quantitatively assess biases introduced by example selec-tion. The results show that the impact of example selection on QWK varies across models, with GPT-3.5 being more influenced by examples than GPT-4. We also find evidence of majority label bias, which is a tendency to favor the majority la-bel among the examples, and recency bias, which is a tendency to favor the label of the most recent example, in GPT-generated essay scores and QWK, with these biases being more pronounced in GPT-3.5. Notably, careful example selection enables GPT-3.5 models to outperform some GPT-4 models. However, among the GPT models, the June 2023 version of GPT-4, which is not the latest model, exhibits the highest stability and performance. Our findings provide insights into the importance of example selection in few-shot prompting for AES, especially in GPT-3.5 models, and highlight the need for individual performance evaluations of each model, even for minor versions.
摘要：本研究使用 GPT 模型研究了示例选择对使用少样本提示的自动作文评分 (AES) 性能的影响。我们评估了少样本提示中示例的选择和顺序对 GPT-3.5 和 GPT-4 模型的几个版本的影响。我们的实验涉及 119 个具有不同示例的提示，我们计算了二次加权 kappa (QWK) 来衡量 GPT 和人类评分者分数之间的一致性。回归分析用于定量评估示例选择引入的偏差。结果表明，示例选择对 QWK 的影响因模型而异，GPT-3.5 受示例的影响大于 GPT-4。我们还发现 GPT 生成的作文分数和 QWK 中存在多数标签偏差的证据，即倾向于偏向示例中的多数标签，以及近因偏差，即倾向于偏向最近示例的标签，这些偏差在 GPT-3.5 中更为明显。值得注意的是，精心选择示例使 GPT-3.5 模型能够胜过某些 GPT-4 模型。然而，在 GPT 模型中，2023 年 6 月版的 GPT-4（不是最新模型）表现出最高的稳定性和性能。我们的研究结果深入了解了示例选择在 AES 的小样本提示中的重要性，尤其是在 GPT-3.5 模型中，并强调了对每个模型进行单独性能评估的必要性，即使是小版本也是如此。

Title: ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges

Authors: Rao Fu, Ziyang Luo, Hongzhan Lin, Zhen Ye, Jing Ma
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.18932
Pdf URL: https://arxiv.org/pdf/2411.18932
Copy Paste: [[2411.18932]] ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges(https://arxiv.org/abs/2411.18932)
Keywords: gpt
Abstract: Recent advancements in large multimodal models (LMMs) have showcased impressive code generation capabilities, primarily evaluated through image-to-code benchmarks. However, these benchmarks are limited to specific visual programming scenarios where the logic reasoning and the multimodal understanding capacities are split apart. To fill this gap, we propose ScratchEval, a novel benchmark designed to evaluate the visual programming reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based visual programming language widely used in children's programming education. By integrating visual elements and embedded programming logic, ScratchEval requires the model to process both visual information and code structure, thereby comprehensively evaluating its programming intent understanding ability. Our evaluation approach goes beyond the traditional image-to-code mapping and focuses on unified logical thinking and problem-solving abilities, providing a more comprehensive and challenging framework for evaluating the visual programming ability of LMMs. ScratchEval not only fills the gap in existing evaluation methods, but also provides new insights for the future development of LMMs in the field of visual programming. Our benchmark can be accessed at this https URL .
摘要：大型多模态模型 (LMM) 的最新进展展示了令人印象深刻的代码生成能力，主要通过图像到代码基准进行评估。然而，这些基准仅限于特定的可视化编程场景，其中逻辑推理和多模态理解能力是分开的。为了填补这一空白，我们提出了 ScratchEval，这是一种旨在评估 LMM 可视化编程推理能力的新型基准。ScratchEval 基于 Scratch，这是一种广泛用于儿童编程教育的基于块的可视化编程语言。通过整合视觉元素和嵌入式编程逻辑，ScratchEval 要求模型同时处理视觉信息和代码结构，从而全面评估其编程意图理解能力。我们的评估方法超越了传统的图像到代码的映射，侧重于统一的逻辑思维和解决问题的能力，为评估 LMM 的可视化编程能力提供了一个更全面、更具挑战性的框架。ScratchEval 不仅填补了现有评估方法的空白，而且为 LMM 在可视化编程领域的未来发展提供了新的见解。我们的基准可以通过此 https URL 访问。

Title: Rephrasing Electronic Health Records for Pretraining Clinical Language Models

Authors: Jinghui Liu, Anthony Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18940
Pdf URL: https://arxiv.org/pdf/2411.18940
Copy Paste: [[2411.18940]] Rephrasing Electronic Health Records for Pretraining Clinical Language Models(https://arxiv.org/abs/2411.18940)
Keywords: language model, llm
Abstract: Clinical language models are important for many applications in healthcare, but their development depends on access to extensive clinical text for pretraining. However, obtaining clinical notes from electronic health records (EHRs) at scale is challenging due to patient privacy concerns. In this study, we rephrase existing clinical notes using LLMs to generate synthetic pretraining corpora, drawing inspiration from previous work on rephrasing web data. We examine four popular small-sized LLMs (<10B) to create synthetic clinical text to pretrain both decoder-based and encoder-based language models. The method yields better results in language modeling and downstream tasks than previous synthesis approaches without referencing real clinical text. We find that augmenting original clinical notes with synthetic corpora from different LLMs improves performances even at a small token budget, showing the potential of this method to support pretraining at the institutional level or be scaled to synthesize large-scale clinical corpora.
摘要：临床语言模型对于医疗保健领域的许多应用都很重要，但它们的开发依赖于对大量临床文本的访问以进行预训练。然而，出于对患者隐私问题的考虑，从电子健康记录 (EHR) 中大规模获取临床记录具有挑战性。在本研究中，我们使用 LLM 重新表述现有的临床记录以生成合成的预训练语料库，并从以前对重新表述网络数据的研究中获得灵感。我们研究了四种流行的小型 LLM (<10B) 来创建合成临床文本，以预训练基于解码器和基于编码器的语言模型。与以前不参考真实临床文本的合成方法相比，该方法在语言建模和下游任务中获得了更好的结果。我们发现，即使在令牌预算很少的情况下，使用来自不同 LLM 的合成语料库来扩充原始临床记录也可以提高性能，这表明该方法有潜力支持机构层面的预训练或扩展以合成大规模临床语料库。

Title: Zero-shot Slot Filling in the Age of LLMs for Dialogue Systems

Authors: Mansi Rana, Kadri Hacioglu, Sindhuja Gopalan, Maragathamani Boothalingam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18980
Pdf URL: https://arxiv.org/pdf/2411.18980
Copy Paste: [[2411.18980]] Zero-shot Slot Filling in the Age of LLMs for Dialogue Systems(https://arxiv.org/abs/2411.18980)
Keywords: language model, llm
Abstract: Zero-shot slot filling is a well-established subtask of Natural Language Understanding (NLU). However, most existing methods primarily focus on single-turn text data, overlooking the unique complexities of conversational dialogue. Conversational data is highly dynamic, often involving abrupt topic shifts, interruptions, and implicit references that make it difficult to directly apply zero-shot slot filling techniques, even with the remarkable capabilities of large language models (LLMs). This paper addresses these challenges by proposing strategies for automatic data annotation with slot induction and black-box knowledge distillation (KD) from a teacher LLM to a smaller model, outperforming vanilla LLMs on internal datasets by 26% absolute increase in F1 score. Additionally, we introduce an efficient system architecture for call center product settings that surpasses off-the-shelf extractive models by 34% relative F1 score, enabling near real-time inference on dialogue streams with higher accuracy, while preserving low latency.
摘要：零样本槽位填充是自然语言理解 (NLU) 中一个成熟的子任务。然而，大多数现有方法主要关注单轮文本数据，忽略了对话的独特复杂性。对话数据高度动态，通常涉及突然的主题转换、中断和隐式引用，这使得直接应用零样本槽位填充技术变得困难，即使大型语言模型 (LLM) 具有非凡的功能。本文通过提出从教师 LLM 到较小模型的槽位归纳和黑盒知识提炼 (KD) 自动数据注释策略来解决这些挑战，在内部数据集上的表现优于普通 LLM，F1 分数绝对增加 26%。此外，我们为呼叫中心产品设置引入了一种高效的系统架构，其相对 F1 分数比现成的提取模型高出 34%，能够以更高的准确度对对话流进行近乎实时的推理，同时保持低延迟。

Title: Talking to oneself in CMC: a study of self replies in Wikipedia talk pages

Authors: Ludovic Tanguy (CLLE), Céline Poudat, Lydia-Mai Ho-Dac (CLLE)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19007
Pdf URL: https://arxiv.org/pdf/2411.19007
Copy Paste: [[2411.19007]] Talking to oneself in CMC: a study of self replies in Wikipedia talk pages(https://arxiv.org/abs/2411.19007)
Keywords: llm
Abstract: This study proposes a qualitative analysis of self replies in Wikipedia talk pages, more precisely when the first two messages of a discussion are written by the same user. This specific pattern occurs in more than 10% of threads with two messages or more and can be explained by a number of reasons. After a first examination of the lexical specificities of second messages, we propose a seven categories typology and use it to annotate two reference samples (English and French) of 100 threads each. Finally, we analyse and compare the performance of human annotators (who reach a reasonable global efficiency) and instruction-tuned LLMs (which encounter important difficulties with several categories).
摘要：本研究提出对维基百科讨论页中的自我回复进行定性分析，更准确地说，当讨论的前两条消息由同一用户撰写时。这种特定模式出现在超过 10% 的包含两条或更多条消息的线程中，可以用多种原因来解释。在对第二条消息的词汇特异性进行初步检查后，我们提出了一个七类分类法，并用它来注释两个参考样本（英语和法语），每个样本有 100 条线程。最后，我们分析和比较了人工注释者（达到合理的全局效率）和指令调整的 LLM（在几个类别中遇到重大困难）的性能。

Title: DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs

Authors: Ben Ganon, Alon Zolfi, Omer Hofman, Inderjeet Singh, Hisashi Kojima, Yuval Elovici, Asaf Shabtai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19038
Pdf URL: https://arxiv.org/pdf/2411.19038
Copy Paste: [[2411.19038]] DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs(https://arxiv.org/abs/2411.19038)
Keywords: language model, llm
Abstract: In recent years, conversational large language models (LLMs) have shown tremendous success in tasks such as casual conversation, question answering, and personalized dialogue, making significant advancements in domains like virtual assistance, social interaction, and online customer engagement. However, they often generate responses that are not aligned with human values (e.g., ethical standards, safety, or social norms), leading to potentially unsafe or inappropriate outputs. While several techniques have been proposed to address this problem, they come with a cost, requiring computationally expensive training or dramatically increasing the inference time. In this paper, we present DIESEL, a lightweight inference guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesired concepts from the response. DIESEL can function either as a standalone safeguard or as an additional layer of defense, enhancing response safety by reranking the LLM's proposed tokens based on their similarity to predefined negative concepts in the latent space. This approach provides an efficient and effective solution for maintaining alignment with human values. Our evaluation demonstrates DIESEL's effectiveness on state-of-the-art conversational models (e.g., Llama 3), even in challenging jailbreaking scenarios that test the limits of response safety. We further show that DIESEL can be generalized to use cases other than safety, providing a versatile solution for general-purpose response filtering with minimal computational overhead.
摘要：近年来，对话式大型语言模型 (LLM) 在诸如随意交谈、问答和个性化对话等任务中取得了巨大成功，在虚拟协助、社交互动和在线客户参与等领域取得了重大进展。然而，它们通常会产生与人类价值观（例如道德标准、安全或社会规范）不一致的响应，从而导致潜在的不安全或不适当的输出。虽然已经提出了几种技术来解决这个问题，但它们需要付出代价，需要计算成本高昂的训练或大大增加推理时间。在本文中，我们介绍了 DIESEL，这是一种轻量级推理指导技术，可以无缝集成到任何自回归 LLM 中，以语义上过滤响应中的不良概念。DIESEL 既可以作为独立的保护措施，也可以作为额外的防御层，通过根据 LLM 提出的标记与潜在空间中预定义的负面概念的相似性对其进行重新排序来增强响应安全性。这种方法提供了一种高效且有效的解决方案，可以保持与人类价值观的一致性。我们的评估表明，即使在测试响应安全性极限的具有挑战性的越狱场景中，DIESEL 在最先进的对话模型（例如 Llama 3）上也表现出色。我们进一步表明，DIESEL 可以推广到安全以外的用例，为通用响应过滤提供通用解决方案，同时将计算开销降至最低。

Title: Way to Specialist: Closing Loop Between Specialized LLM and Evolving Domain Knowledge Graph

Authors: Yutong Zhang, Lixing Chen, Shenghong Li, Nan Cao, Yang Shi, Jiaxin Ding, Zhe Qu, Pan Zhou, Yang Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19064
Pdf URL: https://arxiv.org/pdf/2411.19064
Copy Paste: [[2411.19064]] Way to Specialist: Closing Loop Between Specialized LLM and Evolving Domain Knowledge Graph(https://arxiv.org/abs/2411.19064)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Large language models (LLMs) have demonstrated exceptional performance across a wide variety of domains. Nonetheless, generalist LLMs continue to fall short in reasoning tasks necessitating specialized knowledge. Prior investigations into specialized LLMs focused on domain-specific training, which entails substantial efforts in domain data acquisition and model parameter fine-tuning. To address these challenges, this paper proposes the Way-to-Specialist (WTS) framework, which synergizes retrieval-augmented generation with knowledge graphs (KGs) to enhance the specialized capability of LLMs in the absence of specialized training. In distinction to existing paradigms that merely utilize external knowledge from general KGs or static domain KGs to prompt LLM for enhanced domain-specific reasoning, WTS proposes an innovative "LLM$\circlearrowright$KG" paradigm, which achieves bidirectional enhancement between specialized LLM and domain knowledge graph (DKG). The proposed paradigm encompasses two closely coupled components: the DKG-Augmented LLM and the LLM-Assisted DKG Evolution. The former retrieves question-relevant domain knowledge from DKG and uses it to prompt LLM to enhance the reasoning capability for domain-specific tasks; the latter leverages LLM to generate new domain knowledge from processed tasks and use it to evolve DKG. WTS closes the loop between DKG-Augmented LLM and LLM-Assisted DKG Evolution, enabling continuous improvement in the domain specialization as it progressively answers and learns from domain-specific questions. We validate the performance of WTS on 6 datasets spanning 5 domains. The experimental results show that WTS surpasses the previous SOTA in 4 specialized domains and achieves a maximum performance improvement of 11.3%.
摘要：大型语言模型 (LLM) 在各个领域都表现出色。然而，通用型 LLM 在需要专业知识的推理任务中仍然表现不佳。先前对专用 LLM 的研究主要集中在特定领域的训练上，这需要在领域数据获取和模型参数微调方面付出大量努力。为了应对这些挑战，本文提出了 Way-to-Specialist (WTS) 框架，该框架将检索增强生成与知识图谱 (KG) 相结合，以在没有专门训练的情况下增强 LLM 的专用能力。与仅利用来自通用 KG 或静态领域 KG 的外部知识来提示 LLM 增强特定领域的推理的现有范式不同，WTS 提出了一种创新的“LLM$\circlearrowright$KG”范式，实现了专用 LLM 和领域知识图谱 (DKG) 之间的双向增强。所提出的范式包含两个紧密耦合的组件：DKG-Augmented LLM 和 LLM-Assisted DKG Evolution。前者从 DKG 中检索与问题相关的领域知识，并利用它来提示 LLM 增强领域特定任务的推理能力；后者利用 LLM 从处理后的任务中生成新的领域知识，并用它来进化 DKG。WTS 闭合了 DKG-Augmented LLM 和 LLM-Assisted DKG Evolution 之间的循环，使其在逐步回答和学习领域特定问题的过程中不断改进领域专业化。我们在涵盖 5 个领域的 6 个数据集上验证了 WTS 的性能。实验结果表明，WTS 在 4 个专业领域超越了之前的 SOTA，最大性能提升了 11.3%。

Title: Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs

Authors: Anirudh Phukan, Divyansh, Harshit Kumar Morj, Vaishnavi, Apoorv Saxena, Koustava Goswami
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19187
Pdf URL: https://arxiv.org/pdf/2411.19187
Copy Paste: [[2411.19187]] Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs(https://arxiv.org/abs/2411.19187)
Keywords: language model, llm, hallucination
Abstract: The rapid development of Large Multimodal Models (LMMs) has significantly advanced multimodal understanding by harnessing the language abilities of Large Language Models (LLMs) and integrating modality-specific encoders. However, LMMs are plagued by hallucinations that limit their reliability and adoption. While traditional methods to detect and mitigate these hallucinations often involve costly training or rely heavily on external models, recent approaches utilizing internal model features present a promising alternative. In this paper, we critically assess the limitations of the state-of-the-art training-free technique, the logit lens, in handling generalized visual hallucinations. We introduce a refined method that leverages contextual token embeddings from middle layers of LMMs. This approach significantly improves hallucination detection and grounding across diverse categories, including actions and OCR, while also excelling in tasks requiring contextual understanding, such as spatial relations and attribute comparison. Our novel grounding technique yields highly precise bounding boxes, facilitating a transition from Zero-Shot Object Segmentation to Grounded Visual Question Answering. Our contributions pave the way for more reliable and interpretable multimodal models.
摘要：大型多模态模型 (LMM) 的快速发展通过利用大型语言模型 (LLM) 的语言能力并集成特定模态的编码器，显著提高了多模态理解。然而，LMM 受到幻觉的困扰，限制了其可靠性和采用率。虽然检测和缓解这些幻觉的传统方法通常需要昂贵的训练或严重依赖外部模型，但最近利用内部模型特征的方法提供了一种有希望的替代方案。在本文中，我们批判性地评估了最先进的无训练技术 logit lens 在处理广义视觉幻觉方面的局限性。我们介绍了一种利用 LMM 中间层的上下文标记嵌入的改进方法。这种方法显著提高了跨不同类别（包括动作和 OCR）的幻觉检测和基础，同时在需要上下文理解的任务（例如空间关系和属性比较）中也表现出色。我们新颖的基础技术产生了高度精确的边界框，促进了从零样本对象分割到基础视觉问答的过渡。我们的贡献为更可靠、更易解释的多模式模型铺平了道路。

Title: An Extensive Evaluation of Factual Consistency in Large Language Models for Data-to-Text Generation

Authors: Joy Mahapatra, Utpal Garain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19203
Pdf URL: https://arxiv.org/pdf/2411.19203
Copy Paste: [[2411.19203]] An Extensive Evaluation of Factual Consistency in Large Language Models for Data-to-Text Generation(https://arxiv.org/abs/2411.19203)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown exceptional performance across various Data-to-Text Generation (DTG) tasks. However, generating factually consistent text in DTG remains challenging for LLMs. Despite this, in-depth evaluations of LLM factual consistency for DTG remain missing in the current literature. This paper addresses this gap by providing an extensive evaluation of factual consistency in LLMs for DTG. Our evaluation covers five widely used DTG datasets (E2E, ViGGo, WikiTableText, DART, and WebNLG) and five prominent LLM families (T5, BART, OPT, BLOOM, and Llama 2). To ensure a thorough evaluation of factual consistency, we use four state-of-the-art automatic metrics and include essential human assessments. Our extensive evaluations reveals three key findings regarding factual consistency in LLMs for DTG. First, Llama 2 often excels in generating factually consistent text, although smaller models like T5 and BART can achieve strong factual consistency on larger, lexically less-diverse datasets. Second, the average rate of change (AROC) indicates that increasing model size (number of model trainable parameters) generally enhances factual consistency of LLMs in DTG. Third, we observe that source-reference divergence (i.e., when the reference text diverges semantically from the source) typically reduces the factual consistency of LLMs in DTG.
摘要：大型语言模型 (LLM) 在各种数据到文本生成 (DTG) 任务中表现出色。然而，在 DTG 中生成事实一致的文本对于 LLM 来说仍然具有挑战性。尽管如此，当前文献中仍然缺乏对 DTG 的 LLM 事实一致性的深入评估。本文通过对 DTG 的 LLM 事实一致性进行广泛的评估来解决这一空白。我们的评估涵盖了五个广泛使用的 DTG 数据集 (E2E、ViGGo、WikiTableText、DART 和 WebNLG) 和五个著名的 LLM 系列 (T5、BART、OPT、BLOOM 和 Llama 2)。为了确保对事实一致性进行全面评估，我们使用了四种最先进的自动指标并包括必要的人工评估。我们广泛的评估揭示了关于 DTG 的 LLM 事实一致性的三个关键发现。首先，Llama 2 通常在生成事实一致的文本方面表现出色，尽管像 T5 和 BART 这样的小型模型可以在更大、词汇多样性较低的数据集上实现强大的事实一致性。其次，平均变化率 (AROC) 表明，增加模型大小（模型可训练参数的数量）通常会增强 DTG 中 LLM 的事实一致性。第三，我们观察到源-参考分歧（即当参考文本在语义上与源文本不同时）通常会降低 DTG 中 LLM 的事实一致性。

Title: How far can bias go? -- Tracing bias from pretraining data to alignment

Authors: Marion Thaler, Abdullatif Köksal, Alina Leidinger, Anna Korhonen, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19240
Pdf URL: https://arxiv.org/pdf/2411.19240
Copy Paste: [[2411.19240]] How far can bias go? -- Tracing bias from pretraining data to alignment(https://arxiv.org/abs/2411.19240)
Keywords: llm, prompt
Abstract: As LLMs are increasingly integrated into user-facing applications, addressing biases that perpetuate societal inequalities is crucial. While much work has gone into measuring or mitigating biases in these models, fewer studies have investigated their origins. Therefore, this study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs, focusing on the Dolma dataset and the OLMo model. Using zero-shot prompting and token co-occurrence analyses, we explore how biases in training data influence model outputs. Our findings reveal that biases present in pre-training data are amplified in model outputs. The study also examines the effects of prompt types, hyperparameters, and instruction-tuning on bias expression, finding instruction-tuning partially alleviating representational bias while still maintaining overall stereotypical gender associations, whereas hyperparameters and prompting variation have a lesser effect on bias expression. Our research traces bias throughout the LLM development pipeline and underscores the importance of mitigating bias at the pretraining stage.
摘要：随着 LLM 越来越多地融入面向用户的应用程序中，解决导致社会不平等的偏见至关重要。虽然已经投入了大量精力来衡量或减轻这些模型中的偏见，但很少有研究调查其起源。因此，本研究考察了预训练数据中的性别职业偏见与其在 LLM 中的表现之间的相关性，重点关注 Dolma 数据集和 OLMo 模型。使用零样本提示和 token 共现分析，我们探索训练数据中的偏见如何影响模型输出。我们的研究结果表明，预训练数据中存在的偏见在模型输出中被放大。该研究还研究了提示类型、超参数和指令调整对偏见表达的影响，发现指令调整部分缓解了表征偏见，同时仍保持了整体刻板的性别关联，而超参数和提示变化对偏见表达的影响较小。我们的研究追踪了整个 LLM 开发流程中的偏见，并强调了在预训练阶段减轻偏见的重要性。

Title: Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows

Authors: Clémence Sebe, Sarah Cohen-Boulakia, Olivier Ferret, Aurélie Névéol
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19295
Pdf URL: https://arxiv.org/pdf/2411.19295
Copy Paste: [[2411.19295]] Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows(https://arxiv.org/abs/2411.19295)
Keywords: language model
Abstract: Bioinformatics workflows are essential for complex biological data analyses and are often described in scientific articles with source code in public repositories. Extracting detailed workflow information from articles can improve accessibility and reusability but is hindered by limited annotated corpora. To address this, we framed the problem as a low-resource extraction task and tested four strategies: 1) creating a tailored annotated corpus, 2) few-shot named-entity recognition (NER) with an autoregressive language model, 3) NER using masked language models with existing and new corpora, and 4) integrating workflow knowledge into NER models. Using BioToFlow, a new corpus of 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a 70.4 F-measure, comparable to inter-annotator agreement. While knowledge integration improved performance for specific entities, it was less effective across the entire information schema. Our results demonstrate that high-performance information extraction for bioinformatics workflows is achievable.
摘要：生物信息学工作流程对于复杂的生物数据分析至关重要，通常在公共存储库中带有源代码的科学文章中描述。从文章中提取详细的工作流程信息可以提高可访问性和可重用性，但有限的注释语料库阻碍了这一过程。为了解决这个问题，我们将问题定义为低资源提取任务，并测试了四种策略：1) 创建定制的注释语料库，2) 使用自回归语言模型进行少量命名实体识别 (NER)，3) 使用带有现有和新语料库的掩码语言模型进行 NER，以及 4) 将工作流知识集成到 NER 模型中。使用 BioToFlow（一个包含 52 篇文章的新语料库，其中注释了 16 个实体），基于 SciBERT 的 NER 模型实现了 70.4 F 度量，与注释者之间的一致性相当。虽然知识集成提高了特定实体的性能，但在整个信息模式中效果较差。我们的结果表明，可以实现生物信息学工作流程的高性能信息提取。

Title: DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities

Authors: Hui Dai, Dan Pechi, Xinyi Yang, Garvit Banga, Raghav Mantri
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19360
Pdf URL: https://arxiv.org/pdf/2411.19360
Copy Paste: [[2411.19360]] DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities(https://arxiv.org/abs/2411.19360)
Keywords: language model, gpt, llm
Abstract: The Needle-in-a-haystack (NIAH) test is a general task used to assess language models' (LMs') abilities to recall particular information from long input context. This framework however does not provide a means of analyzing what factors, beyond context length, contribute to LMs' abilities or inabilities to separate and recall needles from their haystacks. To provide a systematic means of assessing what features contribute to LMs' NIAH capabilities, we developed a synthetic benchmark called DENIAHL (Data-oriented Evaluation of NIAH for LLM's). Our work expands on previous NIAH studies by ablating NIAH features beyond typical context length including data type, size, and patterns. We find stark differences between GPT-3.5 and LLaMA 2-7B's performance on DENIAHL, and drops in recall performance when features like item size are increased, and to some degree when data type is changed from numbers to letters. This has implications for increasingly large context models, demonstrating factors beyond item-number impact NIAH capabilities.
摘要：大海捞针 (NIAH) 测试是一项通用任务，用于评估语言模型 (LM) 从长输入上下文中回忆特定信息的能力。然而，这个框架并没有提供一种方法来分析除了上下文长度之外，哪些因素会影响 LM 从大海捞针中分离和回忆的能力。为了提供一种系统的方法来评估哪些特征有助于 LM 的 NIAH 能力，我们开发了一个综合基准，称为 DENIAHL（面向 LLM 的 NIAH 数据导向评估）。我们的工作扩展了之前的 NIAH 研究，消除了典型上下文长度之外的 NIAH 特征，包括数据类型、大小和模式。我们发现 GPT-3.5 和 LLaMA 2-7B 在 DENIAHL 上的表现存在明显差异，当项目大小等特征增加时，以及当数据类型从数字更改为字母时，回忆性能会有所下降。这对越来越大的上下文模型有影响，表明项目编号以外的因素也会影响 NIAH 能力。

Title: Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models

Authors: Tian Yu, Shaolei Zhang, Yang Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19443
Pdf URL: https://arxiv.org/pdf/2411.19443
Copy Paste: [[2411.19443]] Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models(https://arxiv.org/abs/2411.19443)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Iterative retrieval refers to the process in which the model continuously queries the retriever during generation to enhance the relevance of the retrieved knowledge, thereby improving the performance of Retrieval-Augmented Generation (RAG). Existing work typically employs few-shot prompting or manually constructed rules to implement iterative retrieval. This introduces additional inference overhead and overlooks the remarkable reasoning capabilities of Large Language Models (LLMs). In this paper, we introduce Auto-RAG, an autonomous iterative retrieval model centered on the LLM's powerful decision-making capabilities. Auto-RAG engages in multi-turn dialogues with the retriever, systematically planning retrievals and refining queries to acquire valuable knowledge. This process continues until sufficient external information is gathered, at which point the results are presented to the user. To this end, we develop a method for autonomously synthesizing reasoning-based decision-making instructions in iterative retrieval and fine-tuned the latest open-source LLMs. The experimental results indicate that Auto-RAG is capable of autonomous iterative interaction with the retriever, effectively leveraging the remarkable reasoning and decision-making abilities of LLMs, which lead to outstanding performance across six benchmarks. Further analysis reveals that Auto-RAG can autonomously adjust the number of iterations based on the difficulty of the questions and the utility of the retrieved knowledge, without requiring any human intervention. Moreover, Auto-RAG expresses the iterative retrieval process in natural language, enhancing interpretability while providing users with a more intuitive experience\footnote{Code is available at \url{this https URL}.
摘要：迭代检索是指模型在生成过程中不断向检索器进行查询，以增强检索到的知识的相关性，从而提升检索增强生成（RAG）的性能。现有工作通常采用少样本提示或人工构造规则来实现迭代检索，这引入了额外的推理开销，并且忽略了大型语言模型（LLM）卓越的推理能力。本文介绍了一种以LLM强大的决策能力为核心的自主迭代检索模型Auto-RAG。Auto-RAG与检索器进行多轮对话，系统地规划检索并细化查询以获取有价值的知识。这个过程持续到收集到足够的外部信息，此时将结果呈现给用户。为此，我们开发了一种在迭代检索中自主合成基于推理的决策指令的方法，并对最新的开源LLM进行了微调。实验结果表明，Auto-RAG 能够与检索器进行自主迭代交互，有效利用 LLM 卓越的推理和决策能力，在六个基准测试中取得优异表现。进一步分析发现，Auto-RAG 可以根据问题的难度和检索到的知识的实用性自主调整迭代次数，而无需任何人工干预。此外，Auto-RAG 以自然语言表达迭代检索过程，增强了可解释性，同时为用户提供了更直观的体验\footnote{代码位于 \url{此 https URL}。

Title: Beyond Surface Structure: A Causal Assessment of LLMs' Comprehension Ability

Authors: Yujin Han, Lei Xu, Sirui Chen, Difan Zou, Chaochao Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19456
Pdf URL: https://arxiv.org/pdf/2411.19456
Copy Paste: [[2411.19456]] Beyond Surface Structure: A Causal Assessment of LLMs' Comprehension Ability(https://arxiv.org/abs/2411.19456)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable capability in natural language tasks, yet debate persists on whether they truly comprehend deep structure (i.e., core semantics) or merely rely on surface structure (e.g., presentation format). Prior studies observe that LLMs' performance declines when intervening on surface structure, arguing their success relies on surface structure recognition. However, surface structure sensitivity does not prevent deep structure comprehension. Rigorously evaluating LLMs' capability requires analyzing both, yet deep structure is often overlooked. To this end, we assess LLMs' comprehension ability using causal mediation analysis, aiming to fully discover the capability of using both deep and surface structures. Specifically, we formulate the comprehension of deep structure as direct causal effect (DCE) and that of surface structure as indirect causal effect (ICE), respectively. To address the non-estimability of original DCE and ICE -- stemming from the infeasibility of isolating mutual influences of deep and surface structures, we develop the corresponding quantifiable surrogates, including approximated DCE (ADCE) and approximated ICE (AICE). We further apply the ADCE to evaluate a series of mainstream LLMs, showing that most of them exhibit deep structure comprehension ability, which grows along with the prediction accuracy. Comparing ADCE and AICE demonstrates closed-source LLMs rely more on deep structure, while open-source LLMs are more surface-sensitive, which decreases with model scale. Theoretically, ADCE is a bidirectional evaluation, which measures both the sufficiency and necessity of deep structure changes in causing output variations, thus offering a more comprehensive assessment than accuracy, a common evaluation in LLMs. Our work provides new insights into LLMs' deep structure comprehension and offers novel methods for LLMs evaluation.
摘要：大型语言模型 (LLM) 在自然语言任务中表现出了卓越的能力，但关于它们是否真正理解深层结构（即核心语义）还是仅仅依赖于表层结构（例如呈现格式）的争论仍然存在。先前的研究观察到，LLM 在干预表层结构时性能会下降，认为它们的成功依赖于表层结构识别。然而，表层结构敏感性并不会妨碍深层结构理解。严格评估 LLM 的能力需要同时分析两者，但深层结构往往被忽视。为此，我们使用因果中介分析来评估 LLM 的理解能力，旨在充分发现使用深层和表层结构的能力。具体而言，我们分别将深层结构的理解表述为直接因果效应 (DCE)，将表层结构的理解表述为间接因果效应 (ICE)。针对原始 DCE 和 ICE 的不可估计性——源于无法隔离深层和表层结构的相互影响，我们开发了相应的可量化替代品，包括近似 DCE（ADCE）和近似 ICE（AICE）。我们进一步应用 ADCE 评估了一系列主流 LLM，结果表明大多数 LLM 都表现出深层结构理解能力，并且随着预测精度的提高而增长。比较 ADCE 和 AICE 表明，闭源 LLM 更多地依赖于深层结构，而开源 LLM 更倾向于表面，并且随着模型规模的增加而降低。从理论上讲，ADCE 是一种双向评估，它既衡量深层结构变化在导致输出变化方面的充分性，也衡量其必要性，因此比准确性（LLM 中的常见评估）提供了更全面的评估。我们的工作为 LLM 的深层结构理解提供了新的见解，并为 LLM 评估提供了新方法。

Title: A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models

Authors: Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19477
Pdf URL: https://arxiv.org/pdf/2411.19477
Copy Paste: [[2411.19477]] A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models(https://arxiv.org/abs/2411.19477)
Keywords: language model, llm
Abstract: We propose a general two-stage algorithm that enjoys a provable scaling law for the test-time compute of large language models (LLMs). Given an input problem, the proposed algorithm first generates $N$ candidate solutions, and then chooses the best one via a multiple-round knockout tournament where each pair of candidates are compared for $K$ times and only the winners move on to the next round. In a minimalistic implementation, both stages can be executed with a black-box LLM alone and nothing else (e.g., no external verifier or reward model), and a total of $N \times (K + 1)$ highly parallelizable LLM calls are needed for solving an input problem. Assuming that a generated candidate solution is correct with probability $p_{\text{gen}} > 0$ and a comparison between a pair of correct and incorrect solutions identifies the right winner with probability $p_{\text{comp}} > 0.5$ (i.e., better than a random guess), we prove theoretically that the failure probability of the proposed algorithm decays to zero exponentially with respect to $N$ and $K$: $$\mathbb{P}(\text{final output is incorrect}) \le (1 - p_{\text{gen}})^N + \lceil \log_2 N \rceil e^{-2 K (p_{\text{comp}} - 0.5)^2}.$$ Our empirical results with the challenging MMLU-Pro benchmark validate the technical assumptions, as well as the efficacy of the proposed algorithm and the gains from scaling up its test-time compute.
摘要：我们提出了一种通用的两阶段算法，该算法在大型语言模型 (LLM) 的测试时间计算中具有可证明的缩放律。给定一个输入问题，所提出的算法首先生成 $N$ 个候选解决方案，然后通过多轮淘汰赛选择最佳解决方案，其中每对候选方案被比较 $K$ 次，只有获胜者进入下一轮。在极简实现中，两个阶段都可以仅使用黑盒 LLM 执行，而无需其他任何东西（例如，没有外部验证器或奖励模型），并且总共需要 $N \times (K + 1)$ 个高度可并行的 LLM 调用来解决输入问题。假设生成的候选解决方案正确的概率为 $p_{\text{gen}} > 0$，并且一对正确和错误解决方案之间的比较确定正确赢家的概率为 $p_{\text{comp}} > 0.5$（即优于随机猜测），我们从理论上证明所提算法的失败概率关于 $N$ 和 $K$ 呈指数衰减为零：$$\mathbb{P}(\text{最终输出不正确}) \le (1 - p_{\text{gen}})^N + \lceil \log_2 N \rceil e^{-2 K (p_{\text{comp}} - 0.5)^2}。$$ 我们在具有挑战性的 MMLU-Pro 基准上的经验结果验证了技术假设，以及所提算法的有效性和扩大测试时间计算所带来的收益。

Title: COLD: Causal reasOning in cLosed Daily activities

Authors: Abhinav Joshi, Areeb Ahmad, Ashutosh Modi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19500
Pdf URL: https://arxiv.org/pdf/2411.19500
Copy Paste: [[2411.19500]] COLD: Causal reasOning in cLosed Daily activities(https://arxiv.org/abs/2411.19500)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown state-of-the-art performance in a variety of tasks, including arithmetic and reasoning; however, to gauge the intellectual capabilities of LLMs, causal reasoning has become a reliable proxy for validating a general understanding of the mechanics and intricacies of the world similar to humans. Previous works in natural language processing (NLP) have either focused on open-ended causal reasoning via causal commonsense reasoning (CCR) or framed a symbolic representation-based question answering for theoretically backed-up analysis via a causal inference engine. The former adds an advantage of real-world grounding but lacks theoretically backed-up analysis/validation, whereas the latter is far from real-world grounding. In this work, we bridge this gap by proposing the COLD (Causal reasOning in cLosed Daily activities) framework, which is built upon human understanding of daily real-world activities to reason about the causal nature of events. We show that the proposed framework facilitates the creation of enormous causal queries (~ 9 million) and comes close to the mini-turing test, simulating causal reasoning to evaluate the understanding of a daily real-world task. We evaluate multiple LLMs on the created causal queries and find that causal reasoning is challenging even for activities trivial to humans. We further explore (the causal reasoning abilities of LLMs) using the backdoor criterion to determine the causal strength between events.
摘要：大型语言模型 (LLM) 在各种任务（包括算术和推理）中都表现出了最佳性能；然而，为了衡量 LLM 的智力能力，因果推理已成为一种可靠的代理，用于验证对与人类相似的世界机制和复杂性的一般理解。自然语言处理 (NLP) 领域的先前研究要么侧重于通过因果常识推理 (CCR) 进行开放式因果推理，要么通过因果推理引擎构建基于符号表示的问答系统，以进行理论支持的分析。前者增加了现实世界基础的优势，但缺乏理论支持的分析/验证，而后者远离现实世界基础。在这项工作中，我们通过提出 COLD（封闭日常活动中的因果推理）框架来弥补这一差距，该框架建立在人类对日常现实世界活动的理解之上，以推理事件的因果性质。我们表明，所提出的框架有助于创建大量因果查询（约 900 万个），并且接近微型图灵测试，模拟因果推理来评估对日常现实世界任务的理解。我们对创建的因果查询上的多个 LLM 进行了评估，发现即使对于人类来说微不足道的活动，因果推理也是具有挑战性的。我们进一步探索（LLM 的因果推理能力），使用后门标准来确定事件之间的因果强度。

Title: Training Agents with Weakly Supervised Feedback from Large Language Models

Authors: Dihong Gong, Pu Lu, Zelong Wang, Meng Zhou, Xiuqiang He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19547
Pdf URL: https://arxiv.org/pdf/2411.19547
Copy Paste: [[2411.19547]] Training Agents with Weakly Supervised Feedback from Large Language Models(https://arxiv.org/abs/2411.19547)
Keywords: language model, gpt, llm, agent
Abstract: Large Language Models (LLMs) offer a promising basis for creating agents that can tackle complex tasks through iterative environmental interaction. Existing methods either require these agents to mimic expert-provided trajectories or rely on definitive environmental feedback for reinforcement learning which limits their application to specific scenarios like gaming or code generation. This paper introduces a novel training method for LLM-based agents using weakly supervised signals from a critic LLM, bypassing the need for expert trajectories or definitive feedback. Our agents are trained in iterative manner, where they initially generate trajectories through environmental interaction. Subsequently, a critic LLM selects a subset of good trajectories, which are then used to update the agents, enabling them to generate improved trajectories in the next iteration. Extensive tests on the API-bank dataset show consistent improvement in our agents' capabilities and comparable performance to GPT-4, despite using open-source models with much fewer parameters.
摘要：大型语言模型 (LLM) 为创建能够通过迭代环境交互处理复杂任务的代理提供了有希望的基础。现有方法要么要求这些代理模仿专家提供的轨迹，要么依靠明确的环境反馈进行强化学习，这限制了它们的应用范围，例如游戏或代码生成。本文介绍了一种基于 LLM 的代理的新训练方法，该方法使用来自批评者 LLM 的弱监督信号，从而无需专家轨迹或明确反馈。我们的代理以迭代方式进行训练，其中它们最初通过环境交互生成轨迹。随后，批评者 LLM 选择一组好的轨迹，然后使用这些轨迹更新代理，使它们能够在下一次迭代中生成改进的轨迹。尽管使用参数少得多的开源模型，但在 API-bank 数据集上进行的大量测试表明，我们的代理的能力持续提高，性能与 GPT-4 相当。

Title: Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning

Authors: Kaustubh Ponkshe, Raghav Singhal, Eduard Gorbunov, Alexey Tumanov, Samuel Horvath, Praneeth Vepakomma
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19557
Pdf URL: https://arxiv.org/pdf/2411.19557
Copy Paste: [[2411.19557]] Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning(https://arxiv.org/abs/2411.19557)
Keywords: language model, llm
Abstract: Low-rank adapters have become a standard approach for efficiently fine-tuning large language models (LLMs), but they often fall short of achieving the performance of full fine-tuning. We propose a method, LoRA Silver Bullet or LoRA-SB, that approximates full fine-tuning within low-rank subspaces using a carefully designed initialization strategy. We theoretically demonstrate that the architecture of LoRA-XS, which inserts a trainable (r x r) matrix between B and A while keeping other matrices fixed, provides the precise conditions needed for this approximation. We leverage its constrained update space to achieve optimal scaling for high-rank gradient updates while removing the need for hyperparameter tuning. We prove that our initialization offers an optimal low-rank approximation of the initial gradient and preserves update directions throughout training. Extensive experiments across mathematical reasoning, commonsense reasoning, and language understanding tasks demonstrate that our approach exceeds the performance of standard LoRA while using 27-90x fewer parameters, and comprehensively outperforms LoRA-XS. Our findings establish that it is possible to simulate full fine-tuning in low-rank subspaces, and achieve significant efficiency gains without sacrificing performance. Our code is publicly available at this https URL.
摘要：低秩适配器已成为高效微调大型语言模型 (LLM) 的标准方法，但它们通常无法达到完全微调的性能。我们提出了一种方法，LoRA Silver Bullet 或 LoRA-SB，使用精心设计的初始化策略在低秩子空间内近似完全微调。我们从理论上证明了 LoRA-XS 的架构（在 B 和 A 之间插入可训练的 (r x r) 矩阵，同时保持其他矩阵固定）提供了这种近似所需的精确条件。我们利用其受限的更新空间来实现高秩梯度更新的最佳缩放，同时消除了超参数调整的需要。我们证明我们的初始化提供了初始梯度的最佳低秩近似，并在整个训练过程中保留了更新方向。在数学推理、常识推理和语言理解任务中进行的大量实验表明，我们的方法在使用 27-90 倍更少的参数的情况下超过了标准 LoRA 的性能，并且全面优于 LoRA-XS。我们的研究结果表明，可以在低秩子空间中模拟完全微调，并在不牺牲性能的情况下实现显著的效率提升。我们的代码可在此 https URL 上公开获取。

Title: Ensemble Watermarks for Large Language Models

Authors: Georg Niess, Roman Kern
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19563
Pdf URL: https://arxiv.org/pdf/2411.19563
Copy Paste: [[2411.19563]] Ensemble Watermarks for Large Language Models(https://arxiv.org/abs/2411.19563)
Keywords: language model, llm
Abstract: The rapid advancement of large language models (LLMs) has made it increasingly difficult to distinguish between text written by humans and machines. While watermarks already exist for LLMs, they often lack flexibility, and struggle with attacks such as paraphrasing. To address these issues, we propose a multi-feature method for generating watermarks that combines multiple distinct watermark features into an ensemble watermark. Concretely, we combine acrostica and sensorimotor norms with the established red-green watermark to achieve a 98% detection rate. After a paraphrasing attack the performance remains high with 95% detection rate. The red-green feature alone as baseline achieves a detection rate of 49%. The evaluation of all feature combinations reveals that the ensemble of all three consistently has the highest detection rate across several LLMs and watermark strength settings. Due to the flexibility of combining features in the ensemble, various requirements and trade-offs can be addressed. Additionally, for all ensemble configurations the same detection function can be used without adaptations. This method is particularly of interest to facilitate accountability and prevent societal harm.
摘要：大型语言模型 (LLM) 的快速发展使得区分人类和机器编写的文本变得越来越困难。虽然 LLM 已经存在水印，但它们通常缺乏灵活性，并且难以应对诸如释义之类的攻击。为了解决这些问题，我们提出了一种多特征生成水印的方法，该方法将多个不同的水印特征组合成一个集成水印。具体来说，我们将首字母缩略词和感觉运动规范与已建立的红绿水印相结合，以实现 98% 的检测率。在释义攻击之后，性能仍然很高，检测率为 95%。仅以红绿特征作为基线即可实现 49% 的检测率。对所有特征组合的评估表明，这三种特征的集成在多个 LLM 和水印强度设置中始终具有最高的检测率。由于在集成中组合特征的灵活性，可以解决各种要求和权衡。此外，对于所有集成配置，可以使用相同的检测功能而无需进行调整。这种方法对于促进问责和防止社会危害特别有用。

Title: KV Shifting Attention Enhances Language Modeling

Authors: Mingyu Xu, Wei Cheng, Bingning Wang, Weipeng Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19574
Pdf URL: https://arxiv.org/pdf/2411.19574
Copy Paste: [[2411.19574]] KV Shifting Attention Enhances Language Modeling(https://arxiv.org/abs/2411.19574)
Keywords: language model
Abstract: The current large language models are mainly based on decode-only structure transformers, which have great in-context learning (ICL) capabilities. It is generally believed that the important foundation of its ICL capability is the induction heads mechanism, which requires at least two layers attention. In order to more efficiently implement the ability of the model's induction, we revisit the induction heads mechanism and proposed a KV shifting attention. We theoretically prove that the KV shifting attention reducing the model's requirements for the depth and width of the induction heads mechanism. Our experimental results demonstrate that KV shifting attention is beneficial to learning induction heads and language modeling, which lead to better performance or faster convergence from toy models to the pre-training models with more than 10 B parameters.
摘要：当前的大型语言模型主要基于仅解码结构的Transformer，其具有很强的上下文学习（ICL）能力。通常认为其ICL能力的重要基础是诱导头机制，而诱导头机制至少需要两层注意力机制。为了更高效地实现模型的诱导能力，我们重新审视诱导头机制并提出了一种KV移位注意力机制。我们从理论上证明了KV移位注意力机制降低了模型对诱导头机制的深度和宽度的要求。我们的实验结果表明KV移位注意力机制有利于诱导头的学习和语言建模，从而使从玩具模型到具有10B以上参数的预训练模型获得更好的性能或更快的收敛。

Title: In-Context Learning with Noisy Labels

Authors: Junyong Kang, Donghyun Son, Hwanjun Song, Buru Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19581
Pdf URL: https://arxiv.org/pdf/2411.19581
Copy Paste: [[2411.19581]] In-Context Learning with Noisy Labels(https://arxiv.org/abs/2411.19581)
Keywords: language model, llm
Abstract: In-context learning refers to the emerging ability of large language models (LLMs) to perform a target task without additional training, utilizing demonstrations of the task. Recent studies aim to enhance in-context learning performance by selecting more useful demonstrations. However, they overlook the presence of inevitable noisy labels in task demonstrations that arise during the labeling process in the real-world. In this paper, we propose a new task, in-context learning with noisy labels, which aims to solve real-world problems for in-context learning where labels in task demonstrations would be corrupted. Moreover, we propose a new method and baseline methods for the new task, inspired by studies in learning with noisy labels. Through experiments, we demonstrate that our proposed method can serve as a safeguard against performance degradation in in-context learning caused by noisy labels.
摘要：情境学习是指大型语言模型 (LLM) 利用任务演示无需额外训练即可执行目标任务的新兴能力。近期研究旨在通过选择更有用的演示来提高情境学习性能。然而，他们忽视了在现实世界中的标记过程中任务演示中不可避免的噪声标签。在本文中，我们提出了一项新任务——带噪声标签的情境学习，旨在解决情境学习的实际问题，即任务演示中的标签可能会被破坏。此外，我们还受到噪声标签学习研究的启发，为这项新任务提出了一种新方法和基线方法。通过实验，我们证明了我们提出的方法可以防止噪声标签导致的情境学习性能下降。

Title: Can Large Language Models Reason about the Region Connection Calculus?

Authors: Anthony G Cohn, Robert E Blackwell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19589
Pdf URL: https://arxiv.org/pdf/2411.19589
Copy Paste: [[2411.19589]] Can Large Language Models Reason about the Region Connection Calculus?(https://arxiv.org/abs/2411.19589)
Keywords: language model, llm
Abstract: Qualitative Spatial Reasoning is a well explored area of Knowledge Representation and Reasoning and has multiple applications ranging from Geographical Information Systems to Robotics and Computer Vision. Recently, many claims have been made for the reasoning capabilities of Large Language Models (LLMs). Here, we investigate the extent to which a set of representative LLMs can perform classical qualitative spatial reasoning tasks on the mereotopological Region Connection Calculus, RCC-8. We conduct three pairs of experiments (reconstruction of composition tables, alignment to human composition preferences, conceptual neighbourhood reconstruction) using state-of-the-art LLMs; in each pair one experiment uses eponymous relations and one, anonymous relations (to test the extent to which the LLM relies on knowledge about the relation names obtained during training). All instances are repeated 30 times to measure the stochasticity of the LLMs.
摘要：定性空间推理是知识表示和推理领域中一个研究得很好的领域，具有从地理信息系统到机器人技术和计算机视觉等多种应用。最近，许多人声称大型语言模型 (LLM) 具有推理能力。在这里，我们研究了一组代表性 LLM 在部分拓扑区域连接演算 RCC-8 上执行经典定性空间推理任务的程度。我们使用最先进的 LLM 进行了三对实验（组合表重建、与人类组合偏好对齐、概念邻域重建）；在每对实验中，一个实验使用同名关系和一个匿名关系（以测试 LLM 对训练期间获得的关系名称知识的依赖程度）。所有实例重复 30 次以测量 LLM 的随机性。

Title: LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

Authors: Taja Kuzman, Nikola Ljubešić
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19638
Pdf URL: https://arxiv.org/pdf/2411.19638
Copy Paste: [[2411.19638]] LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification(https://arxiv.org/abs/2411.19638)
Keywords: language model, gpt, llm
Abstract: With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers' access to relevant content. To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news classification models of reasonable size with no need for manual data annotation. The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop an IPTC Media Topic training dataset through automatic annotation of news articles in Slovenian, Croatian, Greek, and Catalan. The teacher model exhibits a high zero-shot performance on all four languages. Its agreement with human annotators is comparable to that between the human annotators themselves. To mitigate the computational limitations associated with the requirement of processing millions of texts daily, smaller BERT-like student models are fine-tuned on the GPT-annotated dataset. These student models achieve high performance comparable to the teacher model. Furthermore, we explore the impact of the training data size on the performance of the student models and investigate their monolingual, multilingual and zero-shot cross-lingual capabilities. The findings indicate that student models can achieve high performance with a relatively small number of training instances, and demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the best-performing news topic classifier, enabling multilingual classification with the top-level categories of the IPTC Media Topic schema.
摘要：随着网上新闻报道数量的不断增加，按主题对新闻报道进行分类（无论使用何种语言）已成为增强读者获取相关内容的关键。为了应对这一挑战，我们提出了一个基于大型语言模型 (LLM) 的师生框架，用于开发合理规模的多语言新闻分类模型，而无需手动数据注释。该框架采用生成式预训练 Transformer (GPT) 模型作为教师模型，通过自动注释斯洛文尼亚语、克罗地亚语、希腊语和加泰罗尼亚语的新闻文章来开发 IPTC 媒体主题训练数据集。教师模型在这四种语言上都表现出很高的零样本性能。它与人类注释者的一致性与人类注释者之间的一致性相当。为了减轻每天处理数百万文本所需的计算限制，在 GPT 注释的数据集上对较小的类似 BERT 的学生模型进行了微调。这些学生模型实现了与教师模型相当的高性能。此外，我们探索了训练数据大小对学生模型性能的影响，并研究了它们的单语、多语和零样本跨语言能力。研究结果表明，学生模型可以用相对较少的训练实例实现高性能，并表现出强大的零样本跨语言能力。最后，我们发布了表现最佳的新闻主题分类器，使用 IPTC 媒体主题模式的顶级类别实现多语言分类。

Title: Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-OASIS

Authors: Alessandro Scirè, Andrei Stefan Bejgu, Simone Tedeschi, Karim Ghonim, Federico Martelli, Roberto Navigli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19655
Pdf URL: https://arxiv.org/pdf/2411.19655
Copy Paste: [[2411.19655]] Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-OASIS(https://arxiv.org/abs/2411.19655)
Keywords: language model, gpt, llm, hallucination
Abstract: After the introduction of Large Language Models (LLMs), there have been substantial improvements in the performance of Natural Language Generation (NLG) tasks, including Text Summarization and Machine Translation. However, LLMs still produce outputs containing hallucinations, that is, content not grounded in factual information. Therefore, developing methods to assess the factuality of LLMs has become urgent. Indeed, resources for factuality evaluation have recently emerged. Although challenging, these resources face one or more of the following limitations: (i) they are tailored to a specific task or domain; (ii) they are limited in size, thereby preventing the training of new factuality evaluators; (iii) they are designed for simpler verification tasks, such as claim verification. To address these issues, we introduce LLM-Oasis, to the best of our knowledge the largest resource for training end-to-end factuality evaluators. LLM-Oasis is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for benchmarking factuality evaluation systems. Our experiments demonstrate that LLM-Oasis presents a significant challenge for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our proposed end-to-end factuality evaluation task, highlighting its potential to drive future research in the field.
摘要：引入大型语言模型 (LLM) 后，自然语言生成 (NLG) 任务（包括文本摘要和机器翻译）的性能得到了显着提升。然而，LLM 仍然会产生包含幻觉的输出，即不以事实信息为基础的内容。因此，开发评估 LLM 事实性的方法已变得迫在眉睫。事实上，最近出现了用于事实性评估的资源。尽管具有挑战性，但这些资源面临以下一个或多个限制：(i) 它们是针对特定任务或领域量身定制的；(ii) 它们的大小有限，从而阻碍了新事实性评估者的培训；(iii) 它们是为更简单的验证任务（例如声明验证）而设计的。为了解决这些问题，我们引入了 LLM-Oasis，据我们所知，这是用于培训端到端事实性评估者的最大资源。LLM-Oasis 是通过从维基百科中提取声明、伪造这些声明的子集并生成事实和非事实文本对来构建的。然后，我们依靠人工注释者来验证数据集的质量，并创建用于对事实性评估系统进行基准测试的黄金标准测试集。我们的实验表明，LLM-Oasis 对最先进的 LLM 提出了重大挑战，GPT-4o 在我们提出的端到端事实性评估任务中实现了高达 60% 的准确率，凸显了其推动该领域未来研究的潜力。

Title: ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

Authors: Wanyue Zhang, Ziyong Li, Wen Yang, Chunlin Leng, Yinan Bai, Qianlong Du, Chengqing Zong, Jiajun Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19668
Pdf URL: https://arxiv.org/pdf/2411.19668
Copy Paste: [[2411.19668]] ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information(https://arxiv.org/abs/2411.19668)
Keywords: language model, llm
Abstract: During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs' capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. First, we employ manually crafted rules to discard explicit noisy texts from raw contents. Second, the quality evaluation model, domain classifier, and toxicity evaluation model are well-designed to assess the remaining cleaned data respectively. Finally, we integrate these three types of fine-grained information for each text. With this approach, we release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score, facilitating the LLM researchers to select data based on various types of fine-grained information. The data, codes and the tool-chain are available on this website this https URL
摘要：在大型语言模型 (LLM) 的开发过程中，预训练数据在塑造 LLM 功能方面起着至关重要的作用。近年来，一些大规模高质量预训练数据集相继发布，以加速 LLM 的研究，包括 ChineseWebText1.0、C4、Pile、WanJuan、MAPCC 等。然而，随着 LLM 的不断发展，焦点越来越多地转向特定领域的功能和安全性问题，使得以前的粗粒度文本不足以满足训练要求。此外，质量、领域和毒性等细粒度信息在为各种场景构建强大而可靠的 LLM 时变得越来越重要。为了应对这些挑战，我们在本文中提出了一种名为 MDFG-tool 的新工具链，用于构建具有多维和细粒度信息的大规模高质量中文数据集。首先，我们使用手动制定的规则从原始内容中丢弃显式噪声文本。其次，我们精心设计了质量评估模型、领域分类器和毒性评估模型，分别对剩余的清洗数据进行评估。最后，我们将这三类细粒度信息整合到每篇文本中。通过这种方法，我们发布了最大的、高质量的、细粒度的中文文本 ChineseWebText2.0，它由 3.8TB 组成，每篇文本都与质量分数、领域标签、毒性标签和毒性分数相关联，方便 LLM 研究人员根据各种类型的细粒度信息选择数据。数据、代码和工具链可在此网站上找到，网址为 https

Title: MIMDE: Exploring the Use of Synthetic vs Human Data for Evaluating Multi-Insight Multi-Document Extraction Tasks

Authors: John Francis, Saba Esnaashari, Anton Poletaev, Sukankana Chakraborty, Youmna Hashem, Jonathan Bright
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19689
Pdf URL: https://arxiv.org/pdf/2411.19689
Copy Paste: [[2411.19689]] MIMDE: Exploring the Use of Synthetic vs Human Data for Evaluating Multi-Insight Multi-Document Extraction Tasks(https://arxiv.org/abs/2411.19689)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in text analysis tasks, yet their evaluation on complex, real-world applications remains challenging. We define a set of tasks, Multi-Insight Multi-Document Extraction (MIMDE) tasks, which involves extracting an optimal set of insights from a document corpus and mapping these insights back to their source documents. This task is fundamental to many practical applications, from analyzing survey responses to processing medical records, where identifying and tracing key insights across documents is crucial. We develop an evaluation framework for MIMDE and introduce a novel set of complementary human and synthetic datasets to examine the potential of synthetic data for LLM evaluation. After establishing optimal metrics for comparing extracted insights, we benchmark 20 state-of-the-art LLMs on both datasets. Our analysis reveals a strong correlation (0.71) between the ability of LLMs to extracts insights on our two datasets but synthetic data fails to capture the complexity of document-level analysis. These findings offer crucial guidance for the use of synthetic data in evaluating text analysis systems, highlighting both its potential and limitations.
摘要：大型语言模型 (LLM) 在文本分析任务中表现出了卓越的能力，但它们在复杂的实际应用中的评估仍然具有挑战性。我们定义了一组任务，即多洞察多文档提取 (MIMDE) 任务，该任务涉及从文档语料库中提取一组最佳洞察并将这些洞察映射回其源文档。这项任务对于许多实际应用至关重要，从分析调查响应到处理医疗记录，在这些应用中，跨文档识别和追踪关键洞察至关重要。我们为 MIMDE 开发了一个评估框架，并引入了一组新的互补人工和合成数据集来检查合成数据对 LLM 评估的潜力。在建立了比较提取洞察的最佳指标后，我们对两个数据集上的 20 个最先进的 LLM 进行了基准测试。我们的分析表明，LLM 在我们的两个数据集上提取洞察的能力之间存在很强的相关性 (0.71)，但合成数据无法捕捉文档级分析的复杂性。这些发现为使用合成数据评估文本分析系统提供了重要指导，突出了其潜力和局限性。

Title: INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Authors: Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Florez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Bin Islam, Perttu Isotalo, Maral Jabbarishiviari, Börje F. Karlsson, Eldar Khalilov, Christopher Klamm, Fajri Koto, Dominik Krzemiński, Gabriel Adriano de Melo, Syrielle Montariol, Yiyang Nan, Joel Niklaus, Jekaterina Novikova, Johan Samir Obando Ceron, Debjit Paul, Esther Ploeger, Jebish Purbey, Swati Rajwal, Selvan Sunitha Ravi, Sara Rydell, Roshan Santhosh, Drishti Sharma, Marjana Prifti Skenduli, Arshia Soltani Moakhar, Bardia Soltani Moakhar, Ran Tamir, Ayush Kumar Tarun, Azmine Toushik Wasi, Thenuka Ovin Weerasinghe, Serhan Yilmaz, Mike Zhang, Imanol Schlag, Marzieh Fadaee, Sara Hooker, Antoine Bosselut
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19799
Pdf URL: https://arxiv.org/pdf/2411.19799
Copy Paste: [[2411.19799]] INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge(https://arxiv.org/abs/2411.19799)
Keywords: language model, llm
Abstract: The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.
摘要：大型语言模型 (LLM) 在不同语言之间的性能差异阻碍了它们在许多地区的有效部署，从而抑制了生成式 AI 工具在许多社区中的潜在经济和社会价值。然而，由于缺乏除英语以外的其他语言的高质量评估资源，许多语言的功能性 LLM（即多语言 LLM）的开发受到瓶颈制约。此外，当前多语言基准构建实践通常会翻译英语资源，而忽略了使用多语言系统的环境的区域和文化知识。在这项工作中，我们从本地考试来源构建了一个包含 197,243 个 QA 对的评估套件，以衡量多语言 LLM 在各种区域环境中的能力。我们的新资源 INCLUDE 是一个涵盖 44 种书面语言的全面知识和推理中心基准，用于评估多语言 LLM 在实际部署语言环境中的性能。

Title: Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation

Authors: Dimosthenis Antypas, Indira Sen, Carla Perez-Almendros, Jose Camacho-Collados, Francesco Barbieri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19832
Pdf URL: https://arxiv.org/pdf/2411.19832
Copy Paste: [[2411.19832]] Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation(https://arxiv.org/abs/2411.19832)
Keywords: language model, llm
Abstract: The detection of sensitive content in large datasets is crucial for ensuring that shared and analysed data is free from harmful material. However, current moderation tools, such as external APIs, suffer from limitations in customisation, accuracy across diverse sensitive categories, and privacy concerns. Additionally, existing datasets and open-source models focus predominantly on toxic language, leaving gaps in detecting other sensitive categories such as substance abuse or self-harm. In this paper, we put forward a unified dataset tailored for social media content moderation across six sensitive categories: conflictual language, profanity, sexually explicit material, drug-related content, self-harm, and spam. By collecting and annotating data with consistent retrieval strategies and guidelines, we address the shortcomings of previous focalised research. Our analysis demonstrates that fine-tuning large language models (LLMs) on this novel dataset yields significant improvements in detection performance compared to open off-the-shelf models such as LLaMA, and even proprietary OpenAI models, which underperform by 10-15% overall. This limitation is even more pronounced on popular moderation APIs, which cannot be easily tailored to specific sensitive content categories, among others.
摘要：在大型数据集中检测敏感内容对于确保共享和分析的数据不含有害材料至关重要。然而，当前的审核工具（例如外部 API）在定制、跨各种敏感类别的准确性以及隐私问题方面受到限制。此外，现有数据集和开源模型主要关注有毒语言，在检测其他敏感类别（例如药物滥用或自残）方面存在差距。在本文中，我们提出了一个统一的数据集，专门针对六个敏感类别的社交媒体内容审核：冲突语言、亵渎、色情材料、与毒品有关的内容、自残和垃圾邮件。通过使用一致的检索策略和指南收集和注释数据，我们解决了以前重点研究的缺点。我们的分析表明，与 LLaMA 等开放的现成模型甚至专有的 OpenAI 模型相比，在这个新数据集上微调大型语言模型 (LLM) 可显着提高检测性能，后者的整体表现低于 10-15%。这种限制在流行的审核 API 上更加明显，因为它们不能轻易地针对特定的敏感内容类别进行定制。

Title: Artificial intelligence contribution to translation industry: looking back and forward

Authors: Mohammed Q. Shormani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19855
Pdf URL: https://arxiv.org/pdf/2411.19855
Copy Paste: [[2411.19855]] Artificial intelligence contribution to translation industry: looking back and forward(https://arxiv.org/abs/2411.19855)
Keywords: gpt, chat
Abstract: This study provides a comprehensive analysis of artificial intelligence (AI) contribution to translation industry (ACTI) research, synthesizing it over forty-one years from 1980-2024. 13220 articles were retrieved from three sources, namely WoS, Scopus, and Lens. We provided two types of analysis, viz., scientometric and thematic, focusing on cluster, subject categories, keywords, burstness, centrality and research centers as for the former. For the latter, we thematically review 18 articles, selected purposefully from the articles involved, centering on purpose, approach, findings, and contribution to ACTI future directions. The findings reveal that in the past AI contribution to translation industry was not rigorous, resulting in rule-based machine translation and statistical machine translation whose output was not satisfactory. However, the more AI develops, the more machine translation develops, incorporating Neural Networking Algorithms and (Deep) Language Learning Models like ChatGPT whose translation output has developed considerably. However, much rigorous research is still needed to overcome several problems encountering translation industry, specifically concerning low-source languages, multi-dialectical and free word order languages, and cultural and religious registers.
摘要：本研究对人工智能 (AI) 对翻译行业 (ACTI) 的贡献进行了全面分析，综合了 1980 年至 2024 年四十一年的研究成果。从 WoS、Scopus 和 Lens 三个来源检索了 13220 篇文章。我们提供了两种类型的分析，即科学计量和主题分析，前者侧重于集群、主题类别、关键词、突发性、中心性和研究中心。对于后者，我们按主题审查了 18 篇文章，这些文章是从所涉及的文章中特意挑选出来的，围绕目的、方法、发现和对 ACTI 未来方向的贡献。研究结果表明，过去人工智能对翻译行业的贡献并不严谨，导致基于规则的机器翻译和统计机器翻译的输出不令人满意。然而，随着人工智能的发展，机器翻译也得到了发展，结合了神经网络算法和（深度）语言学习模型，如 ChatGPT，其翻译输出得到了显着发展。然而，仍然需要进行大量严谨的研究来克服翻译行业遇到的几个问题，特别是有关低源语言、多方言和自由词序语言以及文化和宗教领域的问题。

Title: What fifty-one years of Linguistics and Artificial Intelligence research tell us about their correlation: A scientometric review

Authors: Mohammed Q. Shormani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19858
Pdf URL: https://arxiv.org/pdf/2411.19858
Copy Paste: [[2411.19858]] What fifty-one years of Linguistics and Artificial Intelligence research tell us about their correlation: A scientometric review(https://arxiv.org/abs/2411.19858)
Keywords: language model, gpt, chat
Abstract: There is a strong correlation between linguistics and artificial intelligence (AI), best manifested by deep learning language models. This study provides a thorough scientometric analysis of this correlation, synthesizing the intellectual production during 51 years, from 1974 to 2024. It involves 5750 Web of Science-indexed articles published in 2124 journals, which are written by 20835 authors belonging to 13773 research centers in 794 countries. Two powerful software, viz., CiteSpace and VOSviewer, were used to generate mapping visualizations of the intellectual landscape, trending issues and (re)emerging hotspots. The results indicate that in the 1980s and 1990s, linguistics and AI research was not robust, characterized by unstable publication over time. It has, however, witnessed a remarkable increase of publication since then, reaching 1478 articles in 2023, and 546 articles in January-March timespan in 2024, involving emerging issues and hotspots, addressing new horizons, new topics, and launching new applications and powerful deep learning language models including ChatGPT.
摘要：语言学与人工智能 (AI) 之间存在很强的相关性，深度学习语言模型就是最好的例证。本研究对这种相关性进行了彻底的科学计量分析，综合了 1974 年至 2024 年 51 年间的知识成果。研究涉及 2124 种期刊上发表的 5750 篇 Web of Science 索引文章，这些文章由 794 个国家/地区的 13773 个研究中心的 20835 名作者撰写。使用 CiteSpace 和 VOSviewer 这两个强大的软件生成知识图谱、趋势问题和（重新）出现的热点的可视化地图。结果表明，在 1980 年代和 1990 年代，语言学和人工智能研究并不稳健，其特点是出版物随时间不稳定。然而，自那时起，其出版数量显着增加，2023 年达到 1478 篇文章，2024 年 1 月至 3 月期间达到 546 篇文章，涉及新兴问题和热点，解决新视野、新主题，并推出新的应用程序和强大的深度学习语言模型，包括 ChatGPT。

Title: Reverse Thinking Makes LLMs Stronger Reasoners

Authors: Justin Chih-Yao Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, Tomas Pfister
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19865
Pdf URL: https://arxiv.org/pdf/2411.19865
Copy Paste: [[2411.19865]] Reverse Thinking Makes LLMs Stronger Reasoners(https://arxiv.org/abs/2411.19865)
Keywords: language model, llm
Abstract: Reverse thinking plays a crucial role in human reasoning. Humans can reason not only from a problem to a solution but also in reverse, i.e., start from the solution and reason towards the problem. This often enhances overall reasoning performance as it enables consistency checks between their forward and backward thinking. To enable Large Language Models (LLMs) to perform reverse thinking, we introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data augmentation and learning objectives. In RevThink, we augment the dataset by collecting structured forward-backward reasoning from a teacher model, consisting of: (1) the original question, (2) forward reasoning, (3) backward question, and (4) backward reasoning. We then employ three objectives to train a smaller student model in a multi-task learning fashion: (a) generate forward reasoning from a question, (b) generate a backward question from a question, and (c) generate backward reasoning from the backward question. Experiments across 12 datasets covering commonsense, math, and logical reasoning show an average 13.53% improvement over the student model's zero-shot performance and a 6.84% improvement over the strongest knowledge distillation baselines. Moreover, our method demonstrates sample efficiency -- using only 10% of the correct forward reasoning from the training data, it outperforms a standard fine-tuning method trained on 10x more forward reasoning. RevThink also exhibits strong generalization to out-of-distribution held-out datasets.
摘要：逆向思维在人类推理中起着至关重要的作用。人类不仅可以从问题推理到解决方案，还可以反向推理，即从解决方案开始，然后推理到问题。这通常会提高整体推理性能，因为它可以检查正向和逆向思维之间的一致性。为了使大型语言模型 (LLM) 能够进行逆向思维，我们引入了逆向增强思维 (RevThink)，这是一个由数据增强和学习目标组成的框架。在 RevThink 中，我们通过从教师模型收集结构化的正向-反向推理来增强数据集，包括：(1) 原始问题、(2) 正向推理、(3) 反向问题和 (4) 反向推理。然后，我们采用三个目标以多任务学习方式训练较小的学生模型：(a) 从问题生成正向推理、(b) 从问题生成反向问题和 (c) 从反向问题生成反向推理。在 12 个数据集上进行的实验涵盖常识、数学和逻辑推理，结果表明，与学生模型的零样本性能相比，平均提高了 13.53%，与最强的知识提炼基线相比，提高了 6.84%。此外，我们的方法还展示了样本效率——仅使用训练数据中 10% 的正确正向推理，其表现优于使用 10 倍正向推理进行训练的标准微调方法。RevThink 还表现出对分布外保留数据集的强大泛化能力。

Title: AIDetx: a compression-based method for identification of machine-learning generated text

Authors: Leonardo Almeida, Pedro Rodrigues, Diogo Magalhães, Armando J. Pinho, Diogo Pratas
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19869
Pdf URL: https://arxiv.org/pdf/2411.19869
Copy Paste: [[2411.19869]] AIDetx: a compression-based method for identification of machine-learning generated text(https://arxiv.org/abs/2411.19869)
Keywords: language model, llm
Abstract: This paper introduces AIDetx, a novel method for detecting machine-generated text using data compression techniques. Traditional approaches, such as deep learning classifiers, often suffer from high computational costs and limited interpretability. To address these limitations, we propose a compression-based classification framework that leverages finite-context models (FCMs). AIDetx constructs distinct compression models for human-written and AI-generated text, classifying new inputs based on which model achieves a higher compression ratio. We evaluated AIDetx on two benchmark datasets, achieving F1 scores exceeding 97% and 99%, respectively, highlighting its high accuracy. Compared to current methods, such as large language models (LLMs), AIDetx offers a more interpretable and computationally efficient solution, significantly reducing both training time and hardware requirements (e.g., no GPUs needed). The full implementation is publicly available at this https URL.
摘要：本文介绍了一种使用数据压缩技术检测机器生成文本的新方法 AIDetx。传统方法（例如深度学习分类器）通常存在计算成本高和可解释性有限的问题。为了解决这些限制，我们提出了一种基于压缩的分类框架，该框架利用有限上下文模型 (FCM)。AIDetx 为人类书写的文本和 AI 生成的文本构建了不同的压缩模型，根据哪个模型实现更高的压缩率对新输入进行分类。我们在两个基准数据集上对 AIDetx 进行了评估，F1 得分分别超过 97% 和 99%，突显了其高准确率。与大型语言模型 (LLM) 等当前方法相比，AIDetx 提供了一种更易于解释且计算效率更高的解决方案，大大减少了训练时间和硬件要求（例如，不需要 GPU）。完整实现可在此 https URL 上公开获取。

Title: On Domain-Specific Post-Training for Multimodal Large Language Models

Authors: Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Wayne Xin Zhao, Zhongzhi Luan, Bo Dai, Zhenliang Zhang
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19930
Pdf URL: https://arxiv.org/pdf/2411.19930
Copy Paste: [[2411.19930]] On Domain-Specific Post-Training for Multimodal Large Language Models(https://arxiv.org/abs/2411.19930)
Keywords: language model, gpt, llm
Abstract: Recent years have witnessed the rapid development of general multimodal large language models (MLLMs). However, adapting general MLLMs to specific domains, such as scientific fields and industrial applications, remains less explored. This paper systematically investigates domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation. (1) Data Synthesis: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs. (2) Training Pipeline: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training. (3) Task Evaluation: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks. To support further research in MLLM domain adaptation, we will open-source our implementations.
摘要：近年来，通用多模态大型语言模型 (MLLM) 发展迅速。然而，将通用 MLLM 应用于特定领域（如科学领域和工业应用）的研究仍然较少。本文系统地研究了通过后训练实现 MLLM 的领域适应性，重点关注数据合成、训练流程和任务评估。（1）数据合成：利用开源模型，我们开发了一个视觉指令合成器，可以有效地从特定领域的图像-字幕对中生成不同的视觉指令任务。我们的合成任务在增强 MLLM 领域特定性能方面超越了手动规则、GPT-4 和 GPT-4V 生成的任务。（2）训练流程：虽然开发通用 MLLM 通常采用两阶段训练（首先对图像-字幕对进行训练，然后进行视觉指令任务），但我们应用单阶段训练流程来增强领域特定后训练的任务多样性。 (3) 任务评估：我们在生物医药和食品两个领域开展实验，通过对不同来源和规模的 MLLM（例如 Qwen2-VL-2B、LLaVA-v1.6-8B、Llama-3.2-11B）进行后训练，然后评估 MLLM 在各种特定领域任务上的表现。为了支持对 MLLM 领域适应性的进一步研究，我们将开源我们的实现。

Title: Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability

Authors: Zicheng Lin, Tian Liang, Jiahao Xu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, Zhaopeng Tu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19943
Pdf URL: https://arxiv.org/pdf/2411.19943
Copy Paste: [[2411.19943]] Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability(https://arxiv.org/abs/2411.19943)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have exhibited remarkable performance on reasoning tasks. They utilize autoregressive token generation to construct reasoning trajectories, enabling the development of a coherent chain of thought. In this work, we explore the impact of individual tokens on the final outcomes of reasoning tasks. We identify the existence of ``critical tokens'' that lead to incorrect reasoning trajectories in LLMs. Specifically, we find that LLMs tend to produce positive outcomes when forced to decode other tokens instead of critical tokens. Motivated by this observation, we propose a novel approach - cDPO - designed to automatically recognize and conduct token-level rewards for the critical tokens during the alignment process. Specifically, we develop a contrastive estimation approach to automatically identify critical tokens. It is achieved by comparing the generation likelihood of positive and negative models. To achieve this, we separately fine-tune the positive and negative models on various reasoning trajectories, consequently, they are capable of identifying identify critical tokens within incorrect trajectories that contribute to erroneous outcomes. Moreover, to further align the model with the critical token information during the alignment process, we extend the conventional DPO algorithms to token-level DPO and utilize the differential likelihood from the aforementioned positive and negative model as important weight for token-level DPO this http URL results on GSM8K and MATH500 benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math (7B) demonstrate the effectiveness of the propsoed approach cDPO.
摘要：大型语言模型 (LLM) 在推理任务上表现出色。它们利用自回归标记生成来构建推理轨迹，从而形成连贯的思维链。在这项工作中，我们探索了单个标记对推理任务最终结果的影响。我们发现 LLM 中存在导致错误推理轨迹的“关键标记”。具体来说，我们发现当被迫解码其他标记而不是关键标记时，LLM 往往会产生积极的结果。受此观察的启发，我们提出了一种新颖的方法 - cDPO - 旨在在对齐过程中自动识别关键标记并对其进行标记级奖励。具体来说，我们开发了一种对比估计方法来自动识别关键标记。它是通过比较正模型和负模型的生成可能性来实现的。为了实现这一点，我们分别对各种推理轨迹上的正模型和负模型进行微调，因此，它们能够识别导致错误结果的错误轨迹中的关键标记。此外，为了在对齐过程中进一步使模型与关键的 token 信息对齐，我们将传统的 DPO 算法扩展为 token 级 DPO，并利用前述正负模型的差分似然作为 token 级 DPO 的重要权重。使用两个广泛使用的模型 Llama-3（8B 和 70B）和 deepseek-math（7B）在 GSM8K 和 MATH500 基准上的结果证明了所提出方法 cDPO 的有效性。