2024-10-10

Title: Output Scouting: Auditing Large Language Models for Catastrophic Responses

Authors: Andrew Bell, Joao Fonseca
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05305
Pdf URL: https://arxiv.org/pdf/2410.05305
Copy Paste: [[2410.05305]] Output Scouting: Auditing Large Language Models for Catastrophic Responses(https://arxiv.org/abs/2410.05305)
Keywords: language model, llm, prompt
Abstract: Recent high profile incidents in which the use of Large Language Models (LLMs) resulted in significant harm to individuals have brought about a growing interest in AI safety. One reason LLM safety issues occur is that models often have at least some non-zero probability of producing harmful outputs. In this work, we explore the following scenario: imagine an AI safety auditor is searching for catastrophic responses from an LLM (e.g. a "yes" responses to "can I fire an employee for being pregnant?"), and is able to query the model a limited number times (e.g. 1000 times). What is a strategy for querying the model that would efficiently find those failure responses? To this end, we propose output scouting: an approach that aims to generate semantically fluent outputs to a given prompt matching any target probability distribution. We then run experiments using two LLMs and find numerous examples of catastrophic responses. We conclude with a discussion that includes advice for practitioners who are looking to implement LLM auditing for catastrophic responses. We also release an open-source toolkit (this https URL) that implements our auditing framework using the Hugging Face transformers library.
摘要：最近，一些备受关注的事件中，大型语言模型 (LLM) 的使用导致个人受到严重伤害，这引起了人们对人工智能安全性的日益关注。LLM 安全问题发生的一个原因是，模型通常至少具有产生有害输出的一些非零概率。在这项工作中，我们探讨了以下场景：假设一名人工智能安全审计员正在从 LLM 中搜索灾难性响应（例如，对“我可以因为怀孕而解雇一名员工吗？”的“是”响应），并且能够对模型进行有限次数的查询（例如 1000 次）。查询模型的策略是什么，可以有效地找到这些失败响应？为此，我们提出了输出侦察：一种旨在为与任何目标概率分布匹配的给定提示生成语义流畅的输出的方法。然后，我们使用两个 LLM 进行实验，并找到大量灾难性响应的例子。最后，我们进行了讨论，其中包括对希望实施 LLM 审计以应对灾难性响应的从业者的建议。我们还发布了一个开源工具包（此 https URL），该工具包使用 Hugging Face 变换器库实现我们的审计框架。

Title: Falcon Mamba: The First Competitive Attention-free 7B Language Model

Authors: Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed, Younes Belkada, Guillaume Kunsch, Hakim Hacid
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05355
Pdf URL: https://arxiv.org/pdf/2410.05355
Copy Paste: [[2410.05355]] Falcon Mamba: The First Competitive Attention-free 7B Language Model(https://arxiv.org/abs/2410.05355)
Keywords: language model, llm
Abstract: In this technical report, we present Falcon Mamba 7B, a new base large language model based on the novel Mamba architecture. Falcon Mamba 7B is trained on 5.8 trillion tokens with carefully selected data mixtures. As a pure Mamba-based model, Falcon Mamba 7B surpasses leading open-weight models based on Transformers, such as Mistral 7B, Llama3.1 8B, and Falcon2 11B. It is on par with Gemma 7B and outperforms models with different architecture designs, such as RecurrentGemma 9B and RWKV-v6 Finch 7B/14B. Currently, Falcon Mamba 7B is the best-performing Mamba model in the literature at this scale, surpassing both existing Mamba and hybrid Mamba-Transformer models, according to the Open LLM Leaderboard. Due to its architecture, Falcon Mamba 7B is significantly faster at inference and requires substantially less memory for long sequence generation. Despite recent studies suggesting that hybrid Mamba-Transformer models outperform pure architecture designs, we demonstrate that even the pure Mamba design can achieve similar, or even superior results compared to the Transformer and hybrid designs. We make the weights of our implementation of Falcon Mamba 7B publicly available on this https URL, under a permissive license.
摘要：在本技术报告中，我们介绍了基于新型 Mamba 架构的新型基础大型语言模型 Falcon Mamba 7B。Falcon Mamba 7B 使用精心挑选的数据混合对 5.8 万亿个 token 进行训练。作为纯基于 Mamba 的模型，Falcon Mamba 7B 超越了基于 Transformer 的领先开放权重模型，例如 Mistral 7B、Llama3.1 8B 和 Falcon2 11B。它与 Gemma 7B 相当，并且优于具有不同架构设计的模型，例如 RecurrentGemma 9B 和 RWKV-v6 Finch 7B/14B。根据 Open LLM Leaderboard，目前，Falcon Mamba 7B 是文献中在此规模下表现最佳的 Mamba 模型，超越了现有的 Mamba 和混合 Mamba-Transformer 模型。由于其架构，Falcon Mamba 7B 的推理速度明显更快，并且生成长序列所需的内存显著减少。尽管最近的研究表明混合 Mamba-Transformer 模型的表现优于纯架构设计，但我们证明，与 Transformer 和混合设计相比，纯 Mamba 设计也能实现类似甚至更优的结果。我们根据宽松的许可，在此 https URL 上公开了 Falcon Mamba 7B 实现的权重。

Title: LLMs Are In-Context Reinforcement Learners

Authors: Giovanni Monea, Antoine Bosselut, Kianté Brantley, Yoav Artzi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05362
Pdf URL: https://arxiv.org/pdf/2410.05362
Copy Paste: [[2410.05362]] LLMs Are In-Context Reinforcement Learners(https://arxiv.org/abs/2410.05362)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) can learn new tasks through in-context supervised learning (i.e., ICL). This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards. We show that a naive application of ICRL fails miserably, and identify the root cause as a fundamental deficiency at exploration, which leads to quick model degeneration. We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation. We use several challenging classification tasks to empirically show that our ICRL algorithms lead to effective learning from rewards alone, and analyze the characteristics of this ability and our methods. Overall, our results reveal remarkable ICRL abilities in LLMs.
摘要：大型语言模型 (LLM) 可以通过上下文监督学习 (即 ICL) 来学习新任务。这项工作研究了这种能力是否扩展到上下文强化学习 (ICRL)，其中模型没有在上下文中被赋予黄金标签，而只有它们过去的预测和奖励。我们表明，ICRL 的简单应用会惨遭失败，并将根本原因确定为探索中的根本缺陷，这导致模型快速退化。我们提出了一种算法来解决这一缺陷，即增加测试时间计算以及计算受限近似。我们使用几个具有挑战性的分类任务来实证证明我们的 ICRL 算法可以仅从奖励中有效学习，并分析这种能力和我们方法的特点。总体而言，我们的结果揭示了 LLM 中卓越的 ICRL 能力。

Title: Post-hoc Study of Climate Microtargeting on Social Media Ads with LLMs: Thematic Insights and Fairness Evaluation

Authors: Tunazzina Islam, Dan Goldwasser
Subjects: cs.CL, cs.AI, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2410.05401
Pdf URL: https://arxiv.org/pdf/2410.05401
Copy Paste: [[2410.05401]] Post-hoc Study of Climate Microtargeting on Social Media Ads with LLMs: Thematic Insights and Fairness Evaluation(https://arxiv.org/abs/2410.05401)
Keywords: language model, llm
Abstract: Climate change communication on social media increasingly employs microtargeting strategies to effectively reach and influence specific demographic groups. This study presents a post-hoc analysis of microtargeting practices within climate campaigns by leveraging large language models (LLMs) to examine Facebook advertisements. Our analysis focuses on two key aspects: demographic targeting and fairness. We evaluate the ability of LLMs to accurately predict the intended demographic targets, such as gender and age group, achieving an overall accuracy of 88.55%. Furthermore, we instruct the LLMs to generate explanations for their classifications, providing transparent reasoning behind each decision. These explanations reveal the specific thematic elements used to engage different demographic segments, highlighting distinct strategies tailored to various audiences. Our findings show that young adults are primarily targeted through messages emphasizing activism and environmental consciousness, while women are engaged through themes related to caregiving roles and social advocacy. In addition to evaluating the effectiveness of LLMs in detecting microtargeted messaging, we conduct a comprehensive fairness analysis to identify potential biases in model predictions. Our findings indicate that while LLMs perform well overall, certain biases exist, particularly in the classification of senior citizens and male audiences. By showcasing the efficacy of LLMs in dissecting and explaining targeted communication strategies and by highlighting fairness concerns, this study provides a valuable framework for future research aimed at enhancing transparency, accountability, and inclusivity in social media-driven climate campaigns.
摘要：社交媒体上的气候变化传播越来越多地采用微目标策略来有效地接触和影响特定的人口群体。本研究通过利用大型语言模型 (LLM) 检查 Facebook 广告，对气候活动中的微目标实践进行了事后分析。我们的分析侧重于两个关键方面：人口定位和公平性。我们评估了 LLM 准确预测预期人口目标（例如性别和年龄组）的能力，总体准确率达到 88.55%。此外，我们指示 LLM 为其分类生成解释，为每个决策提供透明的理由。这些解释揭示了用于吸引不同人口群体的特定主题元素，突出了针对不同受众量身定制的不同策略。我们的研究结果表明，年轻人主要通过强调行动主义和环保意识的信息来定位，而女性则通过与照顾角色和社会倡导相关的主题来吸引。除了评估 LLM 在检测微目标消息方面的有效性外，我们还进行了全面的公平性分析，以确定模型预测中的潜在偏差。我们的研究结果表明，尽管法学硕士总体表现良好，但存在某些偏见，特别是在老年人和男性受众的分类上。通过展示法学硕士在剖析和解释有针对性的沟通策略方面的有效性，并强调公平问题，本研究为旨在提高社交媒体驱动的气候运动的透明度、问责制和包容性的未来研究提供了一个宝贵的框架。

Title: Neural machine translation system for Lezgian, Russian and Azerbaijani languages

Authors: Alidar Asvarov, Andrey Grabovoy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05472
Pdf URL: https://arxiv.org/pdf/2410.05472
Copy Paste: [[2410.05472]] Neural machine translation system for Lezgian, Russian and Azerbaijani languages(https://arxiv.org/abs/2410.05472)
Keywords: language model
Abstract: We release the first neural machine translation system for translation between Russian, Azerbaijani and the endangered Lezgian languages, as well as monolingual and parallel datasets collected and aligned for training and evaluating the system. Multiple experiments are conducted to identify how different sets of training language pairs and data domains can influence the resulting translation quality. We achieve BLEU scores of 26.14 for Lezgian-Azerbaijani, 22.89 for Azerbaijani-Lezgian, 29.48 for Lezgian-Russian and 24.25 for Russian-Lezgian pairs. The quality of zero-shot translation is assessed on a Large Language Model, showing its high level of fluency in Lezgian. However, the model often refuses to translate, justifying itself with its incompetence. We contribute our translation model along with the collected parallel and monolingual corpora and sentence encoder for the Lezgian language.
摘要：我们发布了首个用于俄语、阿塞拜疆语和濒危列兹金语之间翻译的神经机器翻译系统，以及为训练和评估该系统而收集和调整的单语和平行数据集。我们进行了多项实验，以确定不同的训练语言对和数据域集如何影响最终的翻译质量。我们获得的列兹金语-阿塞拜疆语 BLEU 分数为 26.14，阿塞拜疆语-列兹金语 BLEU 分数为 22.89，列兹金语-俄语 BLEU 分数为 29.48，俄语-列兹金语 BLEU 分数为 24.25。在大型语言模型上评估了零样本翻译的质量，结果显示其在列兹金语方面的流利程度很高。然而，该模型经常拒绝翻译，以其无能为由为自己辩解。我们贡献了我们的翻译模型以及为列兹金语收集的平行和单语语料库和句子编码器。

Title: Self-rationalization improves LLM as a fine-grained judge

Authors: Prapti Trivedi, Aditya Gulati, Oliver Molenschot, Meghana Arakkal Rajeev, Rajkumar Ramamurthy, Keith Stevens, Tanveesh Singh Chaudhery, Jahnavi Jambholkar, James Zou, Nazneen Rajani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05495
Pdf URL: https://arxiv.org/pdf/2410.05495
Copy Paste: [[2410.05495]] Self-rationalization improves LLM as a fine-grained judge(https://arxiv.org/abs/2410.05495)
Keywords: llm
Abstract: LLM-as-a-judge models have been used for evaluating both human and AI generated content, specifically by providing scores and rationales. Rationales, in addition to increasing transparency, help models learn to calibrate its judgments. Enhancing a model's rationale can therefore improve its calibration abilities and ultimately the ability to score content. We introduce Self-Rationalization, an iterative process of improving the rationales for the judge models, which consequently improves the score for fine-grained customizable scoring criteria (i.e., likert-scale scoring with arbitrary evaluation criteria). Self-rationalization works by having the model generate multiple judgments with rationales for the same input, curating a preference pair dataset from its own judgements, and iteratively fine-tuning the judge via DPO. Intuitively, this approach allows the judge model to self-improve by learning from its own rationales, leading to better alignment and evaluation accuracy. After just two iterations -- while only relying on examples in the training set -- human evaluation shows that our judge model learns to produce higher quality rationales, with a win rate of $62\%$ on average compared to models just trained via SFT on rationale . This judge model also achieves high scoring accuracy on BigGen Bench and Reward Bench, outperforming even bigger sized models trained using SFT with rationale, self-consistency or best-of-$N$ sampling by $3\%$ to $9\%$.
摘要：LLM-as-a-judge 模型已用于评估人类和人工智能生成的内容，具体方法是提供分数和理由。除了增加透明度之外，理由还可以帮助模型学习校准其判断。因此，增强模型的理由可以提高其校准能力，并最终提高对内容进行评分的能力。我们引入了自我合理化，这是一个改进判断模型理由的迭代过程，从而提高了细粒度可定制评分标准的分数（即具有任意评估标准的李克特量表评分）。自我合理化的工作原理是让模型对同一输入生成具有理由的多个判断，从其自己的判断中整理偏好对数据集，并通过 DPO 迭代微调判断。直观地说，这种方法允许判断模型通过从自己的理由中学习进行自我改进，从而实现更好的一致性和评估准确性。仅经过两次迭代——尽管仅依赖于训练集中的示例——人工评估表明，我们的判断模型学会了生成更高质量的理由，与仅通过 SFT 训练的理由模型相比，平均胜率为 $62\%$。该判断模型在 BigGen Bench 和 Reward Bench 上也实现了高得分准确率，甚至比使用 SFT 训练的更大规模的模型（具有理由、自洽性或最佳 $N$ 采样）高出 $3\%$ 到 $9\%$。

Title: On Instruction-Finetuning Neural Machine Translation Models

Authors: Vikas Raunak, Roman Grundkiewicz, Marcin Junczys-Dowmunt
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05553
Pdf URL: https://arxiv.org/pdf/2410.05553
Copy Paste: [[2410.05553]] On Instruction-Finetuning Neural Machine Translation Models(https://arxiv.org/abs/2410.05553)
Keywords: language model, gpt, llm
Abstract: In this work, we introduce instruction finetuning for Neural Machine Translation (NMT) models, which distills instruction following capabilities from Large Language Models (LLMs) into orders-of-magnitude smaller NMT models. Our instruction-finetuning recipe for NMT models enables customization of translations for a limited but disparate set of translation-specific tasks. We show that NMT models are capable of following multiple instructions simultaneously and demonstrate capabilities of zero-shot composition of instructions. We also show that through instruction finetuning, traditionally disparate tasks such as formality-controlled machine translation, multi-domain adaptation as well as multi-modal translations can be tackled jointly by a single instruction finetuned NMT model, at a performance level comparable to LLMs such as GPT-3.5-Turbo. To the best of our knowledge, our work is among the first to demonstrate the instruction-following capabilities of traditional NMT models, which allows for faster, cheaper and more efficient serving of customized translations.
摘要：在本研究中，我们引入了神经机器翻译 (NMT) 模型的指令微调，将大型语言模型 (LLM) 中的指令跟踪能力提炼到数量级较小的 NMT 模型中。我们针对 NMT 模型的指令微调方法能够针对有限但不同的翻译特定任务集定制翻译。我们表明 NMT 模型能够同时遵循多个指令，并展示零样本指令组合的能力。我们还表明，通过指令微调，传统上不同的任务（例如形式控制的机器翻译、多领域自适应以及多模态翻译）可以通过单个指令微调的 NMT 模型共同解决，其性能水平可与 GPT-3.5-Turbo 等 LLM 相媲美。据我们所知，我们的工作是首批展示传统 NMT 模型的指令跟踪能力的工作之一，这使得定制翻译的服务更快、更便宜、更高效。

Title: Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives

Authors: Xinliang Frederick Zhang, Nick Beauchamp, Lu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05558
Pdf URL: https://arxiv.org/pdf/2410.05558
Copy Paste: [[2410.05558]] Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives(https://arxiv.org/abs/2410.05558)
Keywords: language model, gpt, llm, prompt
Abstract: Reasoning about time and temporal relations is an integral aspect of human cognition, essential for perceiving the world and navigating our experiences. Though large language models (LLMs) have demonstrated impressive performance in many reasoning tasks, temporal reasoning remains challenging due to its intrinsic complexity. In this work, we first study an essential task of temporal reasoning -- temporal graph generation, to unveil LLMs' inherent, global reasoning capabilities. We show that this task presents great challenges even for the most powerful LLMs, such as GPT-3.5/4. We also notice a significant performance gap by small models (<10B) that lag behind LLMs by 50%. Next, we study how to close this gap with a budget constraint, e.g., not using model finetuning. We propose a new prompting technique tailored for temporal reasoning, Narrative-of-Thought (NoT), that first converts the events set to a Python class, then prompts a small model to generate a temporally grounded narrative, guiding the final generation of a temporal graph. Extensive experiments showcase the efficacy of NoT in improving various metrics. Notably, NoT attains the highest F1 on the Schema-11 evaluation set, while securing an overall F1 on par with GPT-3.5. NoT also achieves the best structural similarity across the board, even compared with GPT-3.5/4. Our code is available at this https URL.
摘要：推理时间和时间关系是人类认知的一个重要方面，对于感知世界和驾驭我们的体验至关重要。尽管大型语言模型 (LLM) 在许多推理任务中表现出色，但由于其内在的复杂性，时间推理仍然具有挑战性。在这项工作中，我们首先研究时间推理的一项基本任务——时间图生成，以揭示 LLM 固有的全局推理能力。我们表明，即使对于最强大的 LLM（例如 GPT-3.5/4），这项任务也带来了巨大的挑战。我们还注意到小型模型（<10B）的性能差距很大，落后于 LLM 50%。接下来，我们研究如何在预算约束下缩小这一差距，例如不使用模型微调。我们提出了一种专为时间推理量身定制的新提示技术，即 Narrative-of-Thought (NoT)，它首先将事件集转换为 Python 类，然后提示小型模型生成基于时间的叙述，指导最终生成时间图。大量实验证明了 NoT 在改善各种指标方面的有效性。值得注意的是，NoT 在 Schema-11 评估集上获得了最高的 F1，同时确保了与 GPT-3.5 相当的整体 F1。即使与 GPT-3.5/4 相比，NoT 也实现了最佳的结构相似性。我们的代码可在此 https URL 上找到。

Title: Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification

Authors: Tao Meng, Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Aram Galstyan, Richard Zemel, Kai-Wei Chang, Rahul Gupta, Charith Peris
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05559
Pdf URL: https://arxiv.org/pdf/2410.05559
Copy Paste: [[2410.05559]] Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification(https://arxiv.org/abs/2410.05559)
Keywords: language model, llm
Abstract: We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control. Given a training corpus and control criteria formulated as a sequence-level constraint on model outputs, our method fine-tunes the LLM on the training corpus while enhancing constraint satisfaction with minimal impact on its utility and generation quality. Specifically, our approach regularizes the LLM training by penalizing the KL divergence between the desired output distribution, which satisfies the constraints, and the LLM's posterior. This regularization term can be approximated by an auxiliary model trained to decompose the sequence-level constraints into token-level guidance, allowing the term to be measured by a closed-form formulation. To further improve efficiency, we design a parallel scheme for concurrently updating both the LLM and the auxiliary model. We evaluate the empirical performance of our approach by controlling the toxicity when training an LLM. We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
摘要：我们提出了一种约束学习方案，用于通过属性控制对大型语言模型 (LLM) 进行微调。给定一个训练语料库和控制标准，将其作为对模型输出的序列级约束，我们的方法可以在训练语料库上对 LLM 进行微调，同时增强约束满足度，同时尽量减少对其效用和生成质量的影响。具体而言，我们的方法通过惩罚满足约束的期望输出分布与 LLM 后验之间的 KL 散度来规范 LLM 训练。这个正则化项可以通过一个辅助模型来近似，该模型经过训练可以将序列级约束分解为标记级指导，从而允许通过闭式公式来测量该项。为了进一步提高效率，我们设计了一个并行方案，用于同时更新 LLM 和辅助模型。我们通过控制训练 LLM 时的毒性来评估我们方法的经验性能。我们表明，我们的方法可以产生更少的不适当响应的 LLM，同时在基准和毒性检测任务上实现具有竞争力的性能。

Title: Rational Metareasoning for Large Language Models

Authors: C. Nicolò De Sabbata, Theodore R. Sumers, Thomas L. Griffiths
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05563
Pdf URL: https://arxiv.org/pdf/2410.05563
Copy Paste: [[2410.05563]] Rational Metareasoning for Large Language Models(https://arxiv.org/abs/2410.05563)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Being prompted to engage in reasoning has emerged as a core technique for using large language models (LLMs), deploying additional inference-time compute to improve task performance. However, as LLMs increase in both size and adoption, inference costs are correspondingly becoming increasingly burdensome. How, then, might we optimize reasoning's cost-performance tradeoff? This work introduces a novel approach based on computational models of metareasoning used in cognitive science, training LLMs to selectively use intermediate reasoning steps only when necessary. We first develop a reward function that incorporates the Value of Computation by penalizing unnecessary reasoning, then use this reward function with Expert Iteration to train the LLM. Compared to few-shot chain-of-thought prompting and STaR, our method significantly reduces inference costs (20-37\% fewer tokens generated across three models) while maintaining task performance across diverse datasets.
摘要：促使人们参与推理已成为使用大型语言模型 (LLM) 的核心技术，部署额外的推理时间计算来提高任务性能。然而，随着 LLM 的规模和采用率的增加，推理成本也相应地变得越来越沉重。那么，我们如何才能优化推理的性价比呢？这项工作引入了一种基于认知科学中使用的元推理计算模型的新方法，训练 LLM 仅在必要时有选择地使用中间推理步骤。我们首先开发一个奖励函数，通过惩罚不必要的推理来结合计算的价值，然后将此奖励函数与专家迭代一起使用来训练 LLM。与少数思维链提示和 STaR 相比，我们的方法显着降低了推理成本（在三个模型中生成的标记减少了 20-37\%），同时保持了不同数据集的任务性能。

Title: ClaimBrush: A Novel Framework for Automated Patent Claim Refinement Based on Large Language Models

Authors: Seiya Kawano, Hirofumi Nonaka, Koichiro Yoshino
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05575
Pdf URL: https://arxiv.org/pdf/2410.05575
Copy Paste: [[2410.05575]] ClaimBrush: A Novel Framework for Automated Patent Claim Refinement Based on Large Language Models(https://arxiv.org/abs/2410.05575)
Keywords: language model
Abstract: Automatic refinement of patent claims in patent applications is crucial from the perspective of intellectual property strategy. In this paper, we propose ClaimBrush, a novel framework for automated patent claim refinement that includes a dataset and a rewriting model. We constructed a dataset for training and evaluating patent claim rewriting models by collecting a large number of actual patent claim rewriting cases from the patent examination process. Using the constructed dataset, we built an automatic patent claim rewriting model by fine-tuning a large language model. Furthermore, we enhanced the performance of the automatic patent claim rewriting model by applying preference optimization based on a prediction model of patent examiners' Office Actions. The experimental results showed that our proposed rewriting model outperformed heuristic baselines and zero-shot learning in state-of-the-art large language models. Moreover, preference optimization based on patent examiners' preferences boosted the performance of patent claim refinement.
摘要：从知识产权战略的角度来看，专利申请中的专利权利要求的自动细化至关重要。在本文中，我们提出了一个新颖的专利权利要求自动细化框架 ClaimBrush，其中包括一个数据集和一个重写模型。我们通过收集大量来自专利审查过程的实际专利权利要求重写案例，构建了一个用于训练和评估专利权利要求重写模型的数据集。利用构建的数据集，我们通过微调一个大型语言模型，构建了一个自动专利权利要求重写模型。此外，我们通过应用基于专利审查员审查意见预测模型的偏好优化来提高自动专利权利要求重写模型的性能。实验结果表明，我们提出的重写模型在最先进的大型语言模型中优于启发式基线和零样本学习。此外，基于专利审查员偏好的偏好优化提高了专利权利要求细化的性能。

Title: Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?

Authors: Fırat Öncel, Matthias Bethge, Beyza Ermis, Mirco Ravanelli, Cem Subakan, Çağatay Yıldız
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05581
Pdf URL: https://arxiv.org/pdf/2410.05581
Copy Paste: [[2410.05581]] Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?(https://arxiv.org/abs/2410.05581)
Keywords: language model, llm
Abstract: In the last decade, the generalization and adaptation abilities of deep learning models were typically evaluated on fixed training and test distributions. Contrary to traditional deep learning, large language models (LLMs) are (i) even more overparameterized, (ii) trained on unlabeled text corpora curated from the Internet with minimal human intervention, and (iii) trained in an online fashion. These stark contrasts prevent researchers from transferring lessons learned on model generalization and adaptation in deep learning contexts to LLMs. To this end, our short paper introduces empirical observations that aim to shed light on further training of already pretrained language models. Specifically, we demonstrate that training a model on a text domain could degrade its perplexity on the test portion of the same domain. We observe with our subsequent analysis that the performance degradation is positively correlated with the similarity between the additional and the original pretraining dataset of the LLM. Our further token-level perplexity observations reveals that the perplexity degradation is due to a handful of tokens that are not informative about the domain. We hope these findings will guide us in determining when to adapt a model vs when to rely on its foundational capabilities.
摘要：在过去十年中，深度学习模型的泛化和适应能力通常是在固定的训练和测试分布上进行评估的。与传统的深度学习相反，大型语言模型 (LLM) (i) 参数化程度更高，(ii) 在从互联网上收集的未标记文本语料库上进行训练，人工干预最少，(iii) 以在线方式进行训练。这些鲜明的对比阻碍了研究人员将在深度学习环境中的模型泛化和适应方面的经验教训转移到 LLM。为此，我们的短文介绍了旨在阐明已预训练语言模型的进一步训练的经验观察。具体而言，我们证明在文本域上训练模型可能会降低其在同一域的测试部分上的困惑度。我们在后续分析中观察到，性能下降与 LLM 的附加和原始预训练数据集之间的相似性呈正相关。我们进一步的标记级困惑度观察表明，困惑度下降是由于少数标记对该领域没有提供信息。我们希望这些发现能够指导我们确定何时调整模型、何时依赖其基础功能。

Title: ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Authors: Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong Yu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05589
Pdf URL: https://arxiv.org/pdf/2410.05589
Copy Paste: [[2410.05589]] ParallelSpec: Parallel Drafter for Efficient Speculative Decoding(https://arxiv.org/abs/2410.05589)
Keywords: language model, llm
Abstract: Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most existing works still draft tokens auto-regressively to maintain sequential dependency in language modeling, which we consider a huge computational burden in speculative decoding. We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches. In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model. ParallelSpec learns to efficiently predict multiple future tokens in parallel using a single model, and it can be integrated into any speculative decoding framework that requires aligning the output distributions of the drafter and the target model with minimal training cost. Experimental results show that ParallelSpec accelerates baseline methods in latency up to 62% on text generation benchmarks from different domains, and it achieves 2.84X overall speedup on the Llama-2-13B model using third-party evaluation criteria.
摘要：推测解码已被证明是大型语言模型 (LLM) 推理的有效解决方案，其中小型起草器以低成本预测未来的标记，并利用目标模型并行验证它们。然而，大多数现有工作仍然以自回归方式起草标记以保持语言建模中的顺序依赖性，我们认为这在推测解码中是一个巨大的计算负担。我们提出了 ParallelSpec，这是最先进的推测解码方法中自回归起草策略的替代方案。与推测阶段的自回归起草相比，我们训练并行起草器作为高效的推测模型。ParallelSpec 学习使用单个模型高效地并行预测多个未来标记，并且可以集成到任何需要以最小的训练成本对齐起草器和目标模型的输出分布的推测解码框架中。实验结果表明，ParallelSpec 在不同领域的文本生成基准上将基线方法的延迟提高了 62%，并且使用第三方评估标准在 Llama-2-13B 模型上实现了 2.84 倍的整体加速。

Title: Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning

Authors: Ming Shan Hee, Aditi Kumaresan, Roy Ka-Wei Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05600
Pdf URL: https://arxiv.org/pdf/2410.05600
Copy Paste: [[2410.05600]] Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning(https://arxiv.org/abs/2410.05600)
Keywords: language model
Abstract: The widespread presence of hate speech on the internet, including formats such as text-based tweets and vision-language memes, poses a significant challenge to digital platform safety. Recent research has developed detection models tailored to specific modalities; however, there is a notable gap in transferring detection capabilities across different formats. This study conducts extensive experiments using few-shot in-context learning with large language models to explore the transferability of hate speech detection between modalities. Our findings demonstrate that text-based hate speech examples can significantly enhance the classification accuracy of vision-language hate speech. Moreover, text-based demonstrations outperform vision-language demonstrations in few-shot learning settings. These results highlight the effectiveness of cross-modality knowledge transfer and offer valuable insights for improving hate speech detection systems.
摘要：互联网上仇恨言论的广泛存在，包括基于文本的推文和视觉语言模因等形式，对数字平台安全构成了重大挑战。最近的研究已经开发出针对特定模态的检测模型；然而，在不同格式之间转移检测能力存在明显差距。这项研究使用少量上下文学习和大型语言模型进行了大量实验，以探索仇恨言论检测在模态之间的可转移性。我们的研究结果表明，基于文本的仇恨言论示例可以显著提高视觉语言仇恨言论的分类准确性。此外，在少量学习环境中，基于文本的演示比视觉语言演示表现更好。这些结果强调了跨模态知识转移的有效性，并为改进仇恨言论检测系统提供了宝贵的见解。

Title: Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond

Authors: Soyeon Caren Han, Feiqi Cao, Josiah Poon, Roberto Navigli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05608
Pdf URL: https://arxiv.org/pdf/2410.05608
Copy Paste: [[2410.05608]] Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond(https://arxiv.org/abs/2410.05608)
Keywords: language model
Abstract: This tutorial explores recent advancements in multimodal pretrained and large models, capable of integrating and processing diverse data forms such as text, images, audio, and video. Participants will gain an understanding of the foundational concepts of multimodality, the evolution of multimodal research, and the key technical challenges addressed by these models. We will cover the latest multimodal datasets and pretrained models, including those beyond vision and language. Additionally, the tutorial will delve into the intricacies of multimodal large models and instruction tuning strategies to optimise performance for specific tasks. Hands-on laboratories will offer practical experience with state-of-the-art multimodal models, demonstrating real-world applications like visual storytelling and visual question answering. This tutorial aims to equip researchers, practitioners, and newcomers with the knowledge and skills to leverage multimodal AI. ACM Multimedia 2024 is the ideal venue for this tutorial, aligning perfectly with our goal of understanding multimodal pretrained and large language models, and their tuning mechanisms.
摘要：本教程探讨了多模态预训练和大型模型的最新进展，这些模型能够集成和处理多种数据形式，例如文本、图像、音频和视频。参与者将了解多模态的基本概念、多模态研究的发展以及这些模型解决的关键技术挑战。我们将介绍最新的多模态数据集和预训练模型，包括视觉和语言之外的模型。此外，本教程将深入探讨多模态大型模型的复杂性和指令调整策略，以优化特定任务的性能。动手实验室将提供最先进的多模态模型的实践经验，展示现实世界的应用，如视觉叙事和视觉问答。本教程旨在让研究人员、从业者和新手掌握利用多模态人工智能的知识和技能。ACM Multimedia 2024 是本教程的理想场所，与我们了解多模态预训练和大型语言模型及其调整机制的目标完美契合。

Title: Stereotype or Personalization? User Identity Biases Chatbot Recommendations

Authors: Anjali Kantharuban, Jeremiah Milbauer, Emma Strubell, Graham Neubig
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05613
Pdf URL: https://arxiv.org/pdf/2410.05613
Copy Paste: [[2410.05613]] Stereotype or Personalization? User Identity Biases Chatbot Recommendations(https://arxiv.org/abs/2410.05613)
Keywords: language model, gpt, llm, chat
Abstract: We demonstrate that when people use large language models (LLMs) to generate recommendations, the LLMs produce responses that reflect both what the user wants and who the user is. While personalized recommendations are often desired by users, it can be difficult in practice to distinguish cases of bias from cases of personalization: we find that models generate racially stereotypical recommendations regardless of whether the user revealed their identity intentionally through explicit indications or unintentionally through implicit cues. We argue that chatbots ought to transparently indicate when recommendations are influenced by a user's revealed identity characteristics, but observe that they currently fail to do so. Our experiments show that even though a user's revealed identity significantly influences model recommendations (p < 0.001), model responses obfuscate this fact in response to user queries. This bias and lack of transparency occurs consistently across multiple popular consumer LLMs (gpt-4o-mini, gpt-4-turbo, llama-3-70B, and claude-3.5) and for four American racial groups.
摘要：我们证明，当人们使用大型语言模型 (LLM) 生成推荐时，LLM 生成的响应既能反映用户的需求，也能反映用户的身份。虽然用户通常希望获得个性化推荐，但在实践中很难区分偏见和个性化情况：我们发现，无论用户是通过明确指示有意透露身份，还是通过隐含线索无意透露身份，模型都会生成种族刻板的推荐。我们认为，聊天机器人应该透明地表明推荐何时受到用户透露的身份特征的影响，但观察到它们目前未能做到这一点。我们的实验表明，即使用户透露的身份会显著影响模型推荐（p < 0.001），模型响应也会在响应用户查询时掩盖这一事实。这种偏见和缺乏透明度在多个流行的消费者 LLM（gpt-4o-mini、gpt-4-turbo、llama-3-70B 和 claude-3.5）和四个美国种族群体中持续存在。

Title: Vector-ICL: In-context Learning with Continuous Vector Representations

Authors: Yufan Zhuang, Chandan Singh, Liyuan Liu, Jingbo Shang, Jianfeng Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05629
Pdf URL: https://arxiv.org/pdf/2410.05629
Copy Paste: [[2410.05629]] Vector-ICL: In-context Learning with Continuous Vector Representations(https://arxiv.org/abs/2410.05629)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities on textual data. We explore whether these capabilities can be extended to continuous vectors from diverse domains, obtained from black-box pretrained encoders. By aligning input data with an LLM's embedding space through lightweight projectors, we observe that LLMs can effectively process and learn from these projected vectors, which we term Vector-ICL. In particular, we find that pretraining projectors with general language modeling objectives enables Vector-ICL, while task-specific finetuning further enhances performance. In our experiments across various tasks and modalities, including text reconstruction, numerical function regression, text classification, summarization, molecule captioning, time-series classification, graph classification, and fMRI decoding, Vector-ICL often surpasses both few-shot ICL and domain-specific model or tuning. We further conduct analyses and case studies, indicating the potential of LLMs to process vector representations beyond traditional token-based paradigms.
摘要：大型语言模型 (LLM) 已在文本数据上展现出卓越的上下文学习 (ICL) 能力。我们探索这些能力是否可以扩展到从黑盒预训练编码器获得的不同领域的连续向量。通过轻量级投影仪将输入数据与 LLM 的嵌入空间对齐，我们观察到 LLM 可以有效地处理和学习这些投影向量，我们称之为 Vector-ICL。具体来说，我们发现使用通用语言建模目标预训练投影仪可以实现 Vector-ICL，而特定于任务的微调可以进一步提高性能。在我们针对各种任务和模式的实验中，包括文本重建、数值函数回归、文本分类、摘要、分子字幕、时间序列分类、图形分类和 fMRI 解码，Vector-ICL 通常超越了少样本 ICL 和特定于领域的模型或调整。我们进一步进行了分析和案例研究，表明 LLM 具有超越传统基于标记的范式来处理向量表示的潜力。

Title: DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models

Authors: Ranchi Zhao, Zhen Leng Thai, Yifan Zhang, Shengding Hu, Yunqi Ba, Jie Zhou, Jie Cai, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05639
Pdf URL: https://arxiv.org/pdf/2410.05639
Copy Paste: [[2410.05639]] DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models(https://arxiv.org/abs/2410.05639)
Keywords: language model, llm
Abstract: The performance of Large Language Models (LLMs) is substantially influenced by the pretraining corpus, which consists of vast quantities of unsupervised data processed by the models. Despite its critical role in model performance, ensuring the quality of this data is challenging due to its sheer volume and the absence of sample-level quality annotations and enhancements. In this paper, we introduce DecorateLM, a data engineering method designed to refine the pretraining corpus through data rating, tagging and editing. Specifically, DecorateLM rates texts against quality criteria, tags texts with hierarchical labels, and edits texts into a more formalized format. Due to the massive size of the pretraining corpus, adopting an LLM for decorating the entire corpus is less efficient. Therefore, to balance performance with efficiency, we curate a meticulously annotated training corpus for DecorateLM using a large language model and distill data engineering expertise into a compact 1.2 billion parameter small language model (SLM). We then apply DecorateLM to enhance 100 billion tokens of the training corpus, selecting 45 billion tokens that exemplify high quality and diversity for the further training of another 1.2 billion parameter LLM. Our results demonstrate that employing such high-quality data can significantly boost model performance, showcasing a powerful approach to enhance the quality of the pretraining corpus.
摘要：大型语言模型 (LLM) 的性能在很大程度上受到预训练语料库的影响，该语料库由模型处理的大量无监督数据组成。尽管它在模型性能中起着至关重要的作用，但由于其数量庞大且缺乏样本级质量注释和增强，确保这些数据的质量具有挑战性。在本文中，我们介绍了 DecorateLM，这是一种数据工程方法，旨在通过数据评级、标记和编辑来完善预训练语料库。具体来说，DecorateLM 根据质量标准对文本进行评级，用分层标签标记文本，并将文本编辑为更正式的格式。由于预训练语料库的规模庞大，采用 LLM 来装饰整个语料库效率较低。因此，为了在性能和效率之间取得平衡，我们使用大型语言模型为 DecorateLM 策划了一个精心注释的训练语料库，并将数据工程专业知识提炼成一个紧凑的 12 亿参数小型语言模型 (SLM)。然后，我们应用 DecorateLM 来增强训练语料库中的 1000 亿个标记，并选择 450 亿个代表高质量和多样性的标记，用于进一步训练另外 12 亿个参数 LLM。我们的结果表明，使用如此高质量的数据可以显著提高模型性能，展示了一种提高预训练语料库质量的强大方法。

Title: Unlocking the Boundaries of Thought: A Reasoning Granularity Framework to Quantify and Optimize Chain-of-Thought

Authors: Qiguang Chen, Libo Qin, Jiaqi Wang, Jinxuan Zhou, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05695
Pdf URL: https://arxiv.org/pdf/2410.05695
Copy Paste: [[2410.05695]] Unlocking the Boundaries of Thought: A Reasoning Granularity Framework to Quantify and Optimize Chain-of-Thought(https://arxiv.org/abs/2410.05695)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) reasoning has emerged as a promising approach for enhancing the performance of large language models (LLMs) on complex reasoning tasks. Recently, a series of studies attempt to explain the mechanisms underlying CoT, aiming to deepen the understanding of its efficacy. Nevertheless, the existing research faces two major challenges: (1) a lack of quantitative metrics to assess CoT capabilities and (2) a dearth of guidance on optimizing CoT performance. Motivated by this, in this work, we introduce a novel reasoning granularity framework (RGF) to address these challenges. To solve the lack of quantification, we first define a reasoning granularity (RG) to quantify the upper bound of CoT and establish a combination law for RG, enabling a practical quantitative approach applicable to various real-world CoT tasks. To address the lack of optimization, we propose three categories of RGs. We further optimize these categories with combination laws focused on RG promotion and reasoning path optimization for CoT improvement. Through extensive experiments on 25 models and 4 tasks, the study validates the existence and rationality of the proposed framework. Furthermore, it explains the effectiveness of 10 CoT strategies and guides optimization from two perspectives. We hope this work can provide a comprehensive understanding of the boundaries and optimization strategies for reasoning in LLMs. Our code and data are available at this https URL.
摘要：思路链 (CoT) 推理已成为一种有前途的方法，可提高大型语言模型 (LLM) 在复杂推理任务上的性能。最近，一系列研究试图解释 CoT 背后的机制，旨在加深对其功效的理解。然而，现有研究面临两大挑战：(1) 缺乏评估 CoT 能力的定量指标和 (2) 缺乏优化 CoT 性能的指导。受此启发，在本文中，我们引入了一种新颖的推理粒度框架 (RGF) 来应对这些挑战。为了解决量化不足的问题，我们首先定义一个推理粒度 (RG) 来量化 CoT 的上限，并为 RG 建立组合定律，从而实现一种适用于各种现实世界 CoT 任务的实用定量方法。为了解决优化不足的问题，我们提出了三类 RG。我们进一步优化这些类别，使用专注于 RG 提升和推理路径优化的组合定律来改进 CoT。通过对 25 个模型和 4 个任务进行大量实验，该研究验证了所提框架的存在性和合理性。此外，它解释了 10 种 CoT 策略的有效性，并从两个角度指导优化。我们希望这项工作能够全面了解 LLM 中推理的边界和优化策略。我们的代码和数据可在此 https URL 上找到。

Title: Efficient Few-shot Learning for Multi-label Classification of Scientific Documents with Many Classes

Authors: Tim Schopf, Alexander Blatzheim, Nektarios Machner, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05770
Pdf URL: https://arxiv.org/pdf/2410.05770
Copy Paste: [[2410.05770]] Efficient Few-shot Learning for Multi-label Classification of Scientific Documents with Many Classes(https://arxiv.org/abs/2410.05770)
Keywords: prompt
Abstract: Scientific document classification is a critical task and often involves many classes. However, collecting human-labeled data for many classes is expensive and usually leads to label-scarce scenarios. Moreover, recent work has shown that sentence embedding model fine-tuning for few-shot classification is efficient, robust, and effective. In this work, we propose FusionSent (Fusion-based Sentence Embedding Fine-tuning), an efficient and prompt-free approach for few-shot classification of scientific documents with many classes. FusionSent uses available training examples and their respective label texts to contrastively fine-tune two different sentence embedding models. Afterward, the parameters of both fine-tuned models are fused to combine the complementary knowledge from the separate fine-tuning steps into a single model. Finally, the resulting sentence embedding model is frozen to embed the training instances, which are then used as input features to train a classification head. Our experiments show that FusionSent significantly outperforms strong baselines by an average of $6.0$ $F_{1}$ points across multiple scientific document classification datasets. In addition, we introduce a new dataset for multi-label classification of scientific documents, which contains 183,565 scientific articles and 130 classes from the arXiv category taxonomy. Code and data are available at this https URL.
摘要：科学文档分类是一项关键任务，通常涉及许多类别。然而，收集许多类别的人工标记数据成本高昂，通常会导致标签稀缺的情况。此外，最近的研究表明，句子嵌入模型微调用于小样本分类是高效、稳健和有效的。在这项工作中，我们提出了 FusionSent（基于融合的句子嵌入微调），这是一种高效且无需提示的用于多类别科学文档小样本分类的方法。FusionSent 使用可用的训练示例及其各自的标签文本对比微调两个不同的句子嵌入模型。之后，融合两个微调模型的参数，将来自单独微调步骤的互补知识组合成一个模型。最后，冻结生成的句子嵌入模型以嵌入训练实例，然后将其用作输入特征来训练分类头。我们的实验表明，FusionSent 在多个科学文档分类数据集上的表现显著优于强基线，平均高出 $6.0$ $F_{1}$ 点。此外，我们引入了一个新的科学文献多标签分类数据集，其中包含 183,565 篇科学文章和 arXiv 类别分类法中的 130 个类别。代码和数据可在此 https URL 上获取。

Title: CodeCipher: Learning to Obfuscate Source Code Against LLMs

Authors: Yalan Lin, Chengcheng Wan, Yixiong Fang, Xiaodong Gu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05797
Pdf URL: https://arxiv.org/pdf/2410.05797
Copy Paste: [[2410.05797]] CodeCipher: Learning to Obfuscate Source Code Against LLMs(https://arxiv.org/abs/2410.05797)
Keywords: language model, llm
Abstract: While large code language models have made significant strides in AI-assisted coding tasks, there are growing concerns about privacy challenges. The user code is transparent to the cloud LLM service provider, inducing risks of unauthorized training, reading, and execution of the user code. In this paper, we propose CodeCipher, a novel method that perturbs privacy from code while preserving the original response from LLMs. CodeCipher transforms the LLM's embedding matrix so that each row corresponds to a different word in the original matrix, forming a token-to-token confusion mapping for obfuscating source code. The new embedding matrix is optimized by minimizing the task-specific loss function. To tackle the challenge of the discrete and sparse nature of word vector spaces, CodeCipher adopts a discrete optimization strategy that aligns the updated vector to the nearest valid token in the vocabulary before each gradient update. We demonstrate the effectiveness of our approach on three AI-assisted coding tasks including code completion, summarization, and translation. Results show that our model successfully confuses the privacy in source code while preserving the original LLM's performance.
摘要：虽然大型代码语言模型在 AI 辅助编码任务中取得了重大进展，但人们越来越担心隐私问题。用户代码对云 LLM 服务提供商是透明的，这会带来未经授权训练、读取和执行用户代码的风险。在本文中，我们提出了 CodeCipher，这是一种新方法，它可以扰乱代码的隐私，同时保留 LLM 的原始响应。CodeCipher 转换 LLM 的嵌入矩阵，使每一行对应于原始矩阵中的不同单词，形成用于混淆源代码的标记到标记混淆映射。通过最小化特定于任务的损失函数来优化新的嵌入矩阵。为了应对词向量空间离散和稀疏性质的挑战，CodeCipher 采用离散优化策略，在每次梯度更新之前将更新后的向量与词汇表中最近的有效标记对齐。我们在三个 AI 辅助编码任务（包括代码完成、摘要和翻译）上证明了我们的方法的有效性。结果表明，我们的模型成功混淆了源代码中的隐私，同时保留了原始 LLM 的性能。

Title: Retrieving, Rethinking and Revising: The Chain-of-Verification Can Improve Retrieval Augmented Generation

Authors: Bolei He, Nuo Chen, Xinran He, Lingyong Yan, Zhenkai Wei, Jinchang Luo, Zhen-Hua Ling
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05801
Pdf URL: https://arxiv.org/pdf/2410.05801
Copy Paste: [[2410.05801]] Retrieving, Rethinking and Revising: The Chain-of-Verification Can Improve Retrieval Augmented Generation(https://arxiv.org/abs/2410.05801)
Keywords: language model, llm, retrieval augmented generation, chain-of-thought
Abstract: Recent Retrieval Augmented Generation (RAG) aims to enhance Large Language Models (LLMs) by incorporating extensive knowledge retrieved from external sources. However, such approach encounters some challenges: Firstly, the original queries may not be suitable for precise retrieval, resulting in erroneous contextual knowledge; Secondly, the language model can easily generate inconsistent answer with external references due to their knowledge boundary limitation. To address these issues, we propose the chain-of-verification (CoV-RAG) to enhance the external retrieval correctness and internal generation consistency. Specifically, we integrate the verification module into the RAG, engaging in scoring, judgment, and rewriting. To correct external retrieval errors, CoV-RAG retrieves new knowledge using a revised query. To correct internal generation errors, we unify QA and verification tasks with a Chain-of-Thought (CoT) reasoning during training. Our comprehensive experiments across various LLMs demonstrate the effectiveness and adaptability compared with other strong baselines. Especially, our CoV-RAG can significantly surpass the state-of-the-art baselines using different LLM backbones.
摘要：最近的检索增强生成 (RAG) 旨在通过整合从外部来源检索到的大量知识来增强大型语言模型 (LLM)。然而，这种方法遇到了一些挑战：首先，原始查询可能不适合精确检索，从而导致错误的上下文知识；其次，由于知识边界限制，语言模型很容易生成与外部参考不一致的答案。为了解决这些问题，我们提出了验证链 (CoV-RAG) 来增强外部检索正确性和内部生成一致性。具体来说，我们将验证模块集成到 RAG 中，进行评分、判断和重写。为了纠正外部检索错误，CoV-RAG 使用修改后的查询检索新知识。为了纠正内部生成错误，我们在训练期间使用思想链 (CoT) 推理将 QA 和验证任务统一起来。我们在各种 LLM 上的全面实验证明了与其他强基线相比的有效性和适应性。特别是，我们的 CoV-RAG 可以显著超越使用不同 LLM 主干的最先进的基线。

Title: Gradual Learning: Optimizing Fine-Tuning with Partially Mastered Knowledge in Large Language Models

Authors: Bozhou Li, Hao Liang, Yang Li, Fangcheng Fu, Hongzhi Yin, Conghui He, Wentao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05802
Pdf URL: https://arxiv.org/pdf/2410.05802
Copy Paste: [[2410.05802]] Gradual Learning: Optimizing Fine-Tuning with Partially Mastered Knowledge in Large Language Models(https://arxiv.org/abs/2410.05802)
Keywords: language model, llm, hallucination
Abstract: During the pretraining phase, large language models (LLMs) acquire vast amounts of knowledge from extensive text corpora. Nevertheless, in later stages such as fine-tuning and inference, the model may encounter knowledge not covered in the initial training, which can lead to hallucinations and degraded performance. This issue has a profound impact on the model's capabilities, as it will inevitably face out-of-scope knowledge after pretraining. Furthermore, fine-tuning is often required to adapt LLMs to domain-specific tasks. However, this phenomenon limits the model's ability to learn and integrate new information during fine-tuning. The effectiveness of fine-tuning largely depends on the type of knowledge involved. Existing research suggests that fine-tuning the model on partially mastered knowledge-for instance, question-answer pairs where the model has a chance of providing correct responses under non-greedy decoding-can enable the model to acquire new knowledge while mitigating hallucination. Notably, this approach can still lead to the forgetting of fully mastered knowledge, constraining the fine-tuning dataset to a narrower range and limiting the model's overall potential for improvement. Given the model's intrinsic reasoning abilities and the interconnectedness of different knowledge areas, it is likely that as the model's capacity to utilize existing knowledge improves during fine-tuning, previously unmastered knowledge may become more understandable. To explore this hypothesis, we conducted experiments and, based on the results, proposed a two-stage fine-tuning strategy. This approach not only improves the model's overall test accuracy and knowledge retention but also preserves its accuracy on previously mastered content. When fine-tuning on the WikiQA dataset, our method increases the amount of knowledge acquired by the model in this stage by 24%.
摘要：在预训练阶段，大型语言模型 (LLM) 从大量文本语料库中获取大量知识。然而，在后续阶段（例如微调和推理），模型可能会遇到初始训练中未涵盖的知识，这可能导致幻觉和性能下降。这个问题对模型的能力有着深远的影响，因为它在预训练后不可避免地会面临超出范围的知识。此外，微调通常需要使 LLM 适应特定领域的任务。然而，这种现象限制了模型在微调过程中学习和整合新信息的能力。微调的有效性在很大程度上取决于所涉及的知识类型。现有研究表明，对部分掌握的知识（例如，在非贪婪解码下模型有机会提供正确答案的问答对）进行微调可以使模型在减轻幻觉的同时获取新知识。值得注意的是，这种方法仍然会导致忘记已经掌握的知识，将微调数据集限制在更窄的范围并限制模型整体改进的潜力。考虑到模型的内在推理能力和不同知识领域的相互联系，很可能随着模型在微调过程中利用现有知识的能力的提高，以前未掌握的知识可能会变得更容易理解。为了探索这一假设，我们进行了实验，并根据结果提出了一种两阶段微调策略。这种方法不仅提高了模型的整体测试准确率和知识保留率，而且还保持了其对以前掌握的内容的准确率。在 WikiQA 数据集上进行微调时，我们的方法将模型在此阶段获得的知识量增加了 24%。

Title: Probing Language Models on Their Knowledge Source

Authors: Zineddine Tighidet, Andrea Mogini, Jiali Mei, Benjamin Piwowarski, Patrick Gallinari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05817
Pdf URL: https://arxiv.org/pdf/2410.05817
Copy Paste: [[2410.05817]] Probing Language Models on Their Knowledge Source(https://arxiv.org/abs/2410.05817)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) often encounter conflicts between their learned, internal (parametric knowledge, PK) and external knowledge provided during inference (contextual knowledge, CK). Understanding how LLMs models prioritize one knowledge source over the other remains a challenge. In this paper, we propose a novel probing framework to explore the mechanisms governing the selection between PK and CK in LLMs. Using controlled prompts designed to contradict the model's PK, we demonstrate that specific model activations are indicative of the knowledge source employed. We evaluate this framework on various LLMs of different sizes and demonstrate that mid-layer activations, particularly those related to relations in the input, are crucial in predicting knowledge source selection, paving the way for more reliable models capable of handling knowledge conflicts effectively.
摘要：大型语言模型 (LLM) 经常会遇到其学习到的内部知识（参数知识，PK）与推理过程中提供的外部知识（上下文知识，CK）之间的冲突。了解 LLM 模型如何优先考虑一个知识源而不是另一个知识源仍然是一个挑战。在本文中，我们提出了一个新颖的探测框架来探索控制 LLM 中 PK 和 CK 之间选择的机制。使用旨在与模型的 PK 相矛盾的受控提示，我们证明特定的模型激活可以指示所使用的知识源。我们在各种不同大小的 LLM 上评估了这个框架，并证明中间层激活（特别是与输入中的关系相关的激活）对于预测知识源选择至关重要，为能够有效处理知识冲突的更可靠的模型铺平了道路。

Title: A Zero-Shot approach to the Conversational Tree Search Task

Authors: Dirk Väth, Ngoc Thang Vu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05821
Pdf URL: https://arxiv.org/pdf/2410.05821
Copy Paste: [[2410.05821]] A Zero-Shot approach to the Conversational Tree Search Task(https://arxiv.org/abs/2410.05821)
Keywords: llm, agent
Abstract: In sensitive domains, such as legal or medial domains, the correctness of information given to users is critical. To address this, the recently introduced task Conversational Tree Search (CTS) provides a graph-based framework for controllable task-oriented dialog in sensitive domains. However, a big drawback of state-of-the-art CTS agents is their long training time, which is especially problematic as a new agent must be trained every time the associated domain graph is updated. The goal of this paper is to eliminate the need for training CTS agents altogether. To achieve this, we implement a novel LLM-based method for zero-shot, controllable CTS agents. We show that these agents significantly outperform state-of-the-art CTS agents (p<0.0001; Barnard Exact test) in simulation. This generalizes to all available CTS domains. Finally, we perform user evaluation to test the agent performance in the wild, showing that our policy significantly (p<0.05; Barnard Exact) improves task-success compared to the state-of-the-art Reinforcement Learning-based CTS agent.
摘要：在敏感领域（例如法律或医疗领域），向用户提供的信息的正确性至关重要。为了解决这个问题，最近引入的任务对话树搜索 (CTS) 为敏感领域中可控的面向任务的对话提供了一个基于图的框架。然而，最先进的 CTS 代理的一个很大的缺点是它们的训练时间长，这尤其成问题，因为每次更新相关域图时都必须训练一个新代理。本文的目标是完全消除训练 CTS 代理的需要。为了实现这个目标，我们为零样本、可控的 CTS 代理实现了一种基于 LLM 的新型方法。我们表明，这些代理在模拟中的表现明显优于最先进的 CTS 代理（p<0.0001；Barnard 精确检验）。这可以推广到所有可用的 CTS 域。最后，我们进行用户评估以在现实中测试代理的性能，结果表明，与最先进的基于强化学习的 CTS 代理相比，我们的策略显著（p<0.05；Barnard Exact）提高了任务成功率。

Title: Multi-Session Client-Centered Treatment Outcome Evaluation in Psychotherapy

Authors: Hongbin Na, Tao Shen, Shumao Yu, Ling Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.05824
Pdf URL: https://arxiv.org/pdf/2410.05824
Copy Paste: [[2410.05824]] Multi-Session Client-Centered Treatment Outcome Evaluation in Psychotherapy(https://arxiv.org/abs/2410.05824)
Keywords: language model
Abstract: In psychotherapy, therapeutic outcome assessment, or treatment outcome evaluation, is essential for enhancing mental health care by systematically evaluating therapeutic processes and outcomes. Existing large language model approaches often focus on therapist-centered, single-session evaluations, neglecting the client's subjective experience and longitudinal progress across multiple sessions. To address these limitations, we propose IPAEval, a client-Informed Psychological Assessment-based Evaluation framework that automates treatment outcome evaluations from the client's perspective using clinical interviews. IPAEval integrates cross-session client-contextual assessment and session-focused client-dynamics assessment to provide a comprehensive understanding of therapeutic progress. Experiments on our newly developed TheraPhase dataset demonstrate that IPAEval effectively tracks symptom severity and treatment outcomes over multiple sessions, outperforming previous single-session models and validating the benefits of items-aware reasoning mechanisms.
摘要：在心理治疗中，治疗结果评估或治疗结果评估对于通过系统地评估治疗过程和结果来加强心理健康护理至关重要。现有的大型语言模型方法通常侧重于以治疗师为中心的单次会话评估，而忽略了客户的主观体验和跨多个会话的纵向进展。为了解决这些限制，我们提出了 IPAEval，这是一个基于客户知情心理评估的评估框架，它使用临床访谈从客户的角度自动进行治疗结果评估。IPAEval 整合了跨会话客户背景评估和以会话为中心的客户动态评估，以全面了解治疗进展。在我们新开发的 TheraPhase 数据集上进行的实验表明，IPAEval 可以有效地跟踪多个会话中的症状严重程度和治疗结果，优于以前的单会话模型，并验证了项目感知推理机制的好处。

Title: From Tokens to Words: on the inner lexicon of LLMs

Authors: Guy Kaplan, Matanel Oren, Yuval Reif, Roy Schwartz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05864
Pdf URL: https://arxiv.org/pdf/2410.05864
Copy Paste: [[2410.05864]] From Tokens to Words: on the inner lexicon of LLMs(https://arxiv.org/abs/2410.05864)
Keywords: llm
Abstract: Natural language is composed of words, but modern LLMs process sub-words as input. A natural question raised by this discrepancy is whether LLMs encode words internally, and if so how. We present evidence that LLMs engage in an intrinsic detokenization process, where sub-word sequences are combined into coherent word representations. Our experiments show that this process takes place primarily within the early and middle layers of the model. They also show that it is robust to non-morphemic splits, typos and perhaps importantly-to out-of-vocabulary words: when feeding the inner representation of such words to the model as input vectors, it can "understand" them despite never seeing them during training. Our findings suggest that LLMs maintain a latent vocabulary beyond the tokenizer's scope. These insights provide a practical, finetuning-free application for expanding the vocabulary of pre-trained models. By enabling the addition of new vocabulary words, we reduce input length and inference iterations, which reduces both space and model latency, with little to no loss in model accuracy.
摘要：自然语言由单词组成，但现代 LLM 将子词作为输入进行处理。这种差异自然会引发一个问题：LLM 是否在内部对单词进行编码，如果是，又是如何编码的。我们提供的证据表明，LLM 参与了内在的去标记化过程，其中子词序列被组合成连贯的单词表示。我们的实验表明，这个过程主要发生在模型的早期和中层。它们还表明，它对非词素分割、拼写错误以及可能更重要的是词汇表之外的单词具有鲁棒性：当将这些单词的内部表示作为输入向量输入到模型时，它可以“理解”它们，尽管在训练期间从未见过它们。我们的研究结果表明，LLM 保持了标记器范围之外的潜在词汇量。这些见解为扩展预训练模型的词汇量提供了一种实用的、无需微调的应用程序。通过启用新词汇表单词的添加，我们减少了输入长度和推理迭代，从而减少了空间和模型延迟，而模型准确性几乎没有损失。

Title: MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

Authors: Amir Hossein Kargaran, Ali Modarressi, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05873
Pdf URL: https://arxiv.org/pdf/2410.05873
Copy Paste: [[2410.05873]] MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment(https://arxiv.org/abs/2410.05873)
Keywords: language model, llm
Abstract: English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs. Leaderboard: this https URL, Code: this https URL.
摘要：以英语为中心的大型语言模型 (LLM) 通常表现出强大的多语言能力。然而，这些模型的多语言性能仍不清楚，并且没有针对许多语言进行彻底评估。大多数多语言基准都侧重于经典的 NLP 任务，或仅涵盖极少数语言。我们引入了 MEXA，这是一种使用并行句子评估预训练的以英语为中心的 LLM 的多语言能力的方法，与现有的下游任务相比，它适用于更多的语言。MEXA 利用了以英语为中心的 LLM 在中间层使用英语作为一种枢轴语言的事实。它使用并行句子计算英语和非英语语言之间的对齐，以评估从英语到其他语言的语言理解迁移。此对齐可用于估计其他语言的模型性能。我们使用各种并行数据集（FLORES-200 和 Bible）、模型（Llama 家族、Gemma 家族、Mistral 和 OLMo）和已建立的下游任务（Belebele、m-MMLU 和 m-ARC）进行研究。我们探索了在仅解码器模型中计算嵌入的不同方法。我们的结果表明，在默认设置下，MEXA 与九个模型和两个并行数据集中的三个已建立的下游任务实现了统计上显著的平均 Pearson 相关性 0.90。这表明 MEXA 是一种可靠的方法，可用于评估以英语为中心的 LLM 的多语言能力，从而更清楚地了解其多语言潜力和 LLM 的内部运作。排行榜：此 https URL，代码：此 https URL。

Title: Automatic Summarization of Long Documents

Authors: Naman Chhibbar, Jugal Kalita
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.05903
Pdf URL: https://arxiv.org/pdf/2410.05903
Copy Paste: [[2410.05903]] Automatic Summarization of Long Documents(https://arxiv.org/abs/2410.05903)
Keywords: llm
Abstract: A vast amount of textual data is added to the internet daily, making utilization and interpretation of such data difficult and cumbersome. As a result, automatic text summarization is crucial for extracting relevant information, saving precious reading time. Although many transformer-based models excel in summarization, they are constrained by their input size, preventing them from processing texts longer than their context size. This study introduces three novel algorithms that allow any LLM to efficiently overcome its input size limitation, effectively utilizing its full potential without any architectural modifications. We test our algorithms on texts with more than 70,000 words, and our experiments show a significant increase in BERTScore with competitive ROUGE scores.
摘要：每天都有大量文本数据添加到互联网上，使得利用和解释这些数据变得困难且繁琐。因此，自动文本摘要对于提取相关信息、节省宝贵的阅读时间至关重要。尽管许多基于 Transformer 的模型在摘要方面表现出色，但它们受到输入大小的限制，无法处理长度超过上下文大小的文本。本研究引入了三种新算法，使任何 LLM 都能有效克服其输入大小限制，有效地发挥其全部潜力而无需进行任何架构修改。我们在超过 70,000 个单词的文本上测试了我们的算法，我们的实验表明 BERTScore 显著提高，并且 ROUGE 分数具有竞争力。

Title: Give me a hint: Can LLMs take a hint to solve math problems?

Authors: Vansh Agrawal, Pratham Singla, Amitoj Singh Miglani, Shivank Garg, Ayush Mangal
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2410.05915
Pdf URL: https://arxiv.org/pdf/2410.05915
Copy Paste: [[2410.05915]] Give me a hint: Can LLMs take a hint to solve math problems?(https://arxiv.org/abs/2410.05915)
Keywords: language model, llm, prompt
Abstract: While many state-of-the-art LLMs have shown poor logical and basic mathematical reasoning, recent works try to improve their problem-solving abilities using prompting techniques. We propose giving "hints" to improve the language model's performance on advanced mathematical problems, taking inspiration from how humans approach math pedagogically. We also test the model's adversarial robustness to wrong hints. We demonstrate the effectiveness of our approach by evaluating various LLMs, presenting them with a diverse set of problems of different difficulties and topics from the MATH dataset and comparing against techniques such as one-shot, few-shot, and chain of thought prompting.
摘要：虽然许多最先进的 LLM 表现出较差的逻辑和基本数学推理能力，但最近的研究尝试使用提示技术来提高其解决问题的能力。我们建议给出“提示”来提高语言模型在高级数学问题上的表现，从人类的数学教学方法中汲取灵感。我们还测试了模型对错误提示的对抗鲁棒性。我们通过评估各种 LLM 来证明我们方法的有效性，向他们展示来自 MATH 数据集的不同难度和主题的各种问题，并与一次性、少量和思路链提示等技术进行比较。

Title: Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Authors: Bowen Jin, Jinsung Yoon, Jiawei Han, Sercan O. Arik
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.05983
Pdf URL: https://arxiv.org/pdf/2410.05983
Copy Paste: [[2410.05983]] Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG(https://arxiv.org/abs/2410.05983)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) empowers large language models (LLMs) to utilize external knowledge sources. The increasing capacity of LLMs to process longer input sequences opens up avenues for providing more retrieved information, to potentially enhance the quality of generated outputs. It is plausible to assume that a larger retrieval set would contain more relevant information (higher recall), that might result in improved performance. However, our empirical findings demonstrate that for many long-context LLMs, the quality of generated output initially improves first, but then subsequently declines as the number of retrieved passages increases. This paper investigates this phenomenon, identifying the detrimental impact of retrieved "hard negatives" as a key contributor. To mitigate this and enhance the robustness of long-context LLM-based RAG, we propose both training-free and training-based approaches. We first showcase the effectiveness of retrieval reordering as a simple yet powerful training-free optimization. Furthermore, we explore training-based methods, specifically RAG-specific implicit LLM fine-tuning and RAG-oriented fine-tuning with intermediate reasoning, demonstrating their capacity for substantial performance gains. Finally, we conduct a systematic analysis of design choices for these training-based methods, including data distribution, retriever selection, and training context length.
摘要：检索增强生成 (RAG) 使大型语言模型 (LLM) 能够利用外部知识源。LLM 处理较长输入序列的能力不断增强，为提供更多检索信息开辟了途径，从而有可能提高生成输出的质量。可以合理地假设，更大的检索集将包含更多相关信息（更高的召回率），这可能会导致性能提高。然而，我们的实证结果表明，对于许多长上下文 LLM，生成输出的质量最初会提高，但随后会随着检索到的段落数量的增加而下降。本文研究了这一现象，并确定了检索到的“硬否定”的不利影响是主要因素。为了缓解这种情况并增强基于长上下文 LLM 的 RAG 的稳健性，我们提出了无需训练和基于训练的方法。我们首先展示了检索重新排序作为一种简单但功能强大的无需训练优化的有效性。此外，我们探索了基于训练的方法，特别是 RAG 特定的隐式 LLM 微调和具有中间推理的面向 RAG 的微调，证明了它们能够显著提高性能。最后，我们对这些基于训练的方法的设计选择进行了系统分析，包括数据分布、检索器选择和训练上下文长度。

Title: Can Language Models Induce Grammatical Knowledge from Indirect Evidence?

Authors: Miyu Oba, Yohei Oseki, Akiyo Fukatsu, Akari Haga, Hiroki Ouchi, Taro Watanabe, Saku Sugawara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06022
Pdf URL: https://arxiv.org/pdf/2410.06022
Copy Paste: [[2410.06022]] Can Language Models Induce Grammatical Knowledge from Indirect Evidence?(https://arxiv.org/abs/2410.06022)
Keywords: language model
Abstract: What kinds of and how much data is necessary for language models to induce grammatical knowledge to judge sentence acceptability? Recent language models still have much room for improvement in their data efficiency compared to humans. This paper investigates whether language models efficiently use indirect data (indirect evidence), from which they infer sentence acceptability. In contrast, humans use indirect evidence efficiently, which is considered one of the inductive biases contributing to efficient language acquisition. To explore this question, we introduce the Wug InDirect Evidence Test (WIDET), a dataset consisting of training instances inserted into the pre-training data and evaluation instances. We inject synthetic instances with newly coined wug words into pretraining data and explore the model's behavior on evaluation data that assesses grammatical acceptability regarding those words. We prepare the injected instances by varying their levels of indirectness and quantity. Our experiments surprisingly show that language models do not induce grammatical knowledge even after repeated exposure to instances with the same structure but differing only in lexical items from evaluation instances in certain language phenomena. Our findings suggest a potential direction for future research: developing models that use latent indirect evidence to induce grammatical knowledge.
摘要：语言模型需要哪些类型和多少数据才能诱导语法知识来判断句子的可接受性？与人类相比，最近的语言模型在数据效率方面仍有很大改进空间。本文研究语言模型是否有效地使用间接数据（间接证据），从中推断句子的可接受性。相比之下，人类有效地使用间接证据，这被认为是有助于有效语言习得的归纳偏差之一。为了探索这个问题，我们引入了 Wug 间接证据测试 (WIDET)，这是一个由插入预训练数据和评估实例的训练实例组成的数据集。我们将新造的 wug 词的合成实例注入预训练数据，并探索模型在评估数据上的行为，评估数据评估这些词的语法可接受性。我们通过改变它们的间接性和数量来准备注入的实例。我们的实验令人惊讶地表明，即使在反复接触具有相同结构但仅在词汇项目上与某些语言现象中的评估实例不同的实例后，语言模型也不会诱导语法知识。我们的研究结果为未来的研究提供了一个潜在的方向：开发使用潜在间接证据来诱导语法知识的模型。

Title: Training-free LLM-generated Text Detection by Mining Token Probability Sequences

Authors: Yihuai Xu, Yongwei Wang, Yifei Bi, Huangsen Cao, Zhouhan Lin, Yu Zhao, Fei Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06072
Pdf URL: https://arxiv.org/pdf/2410.06072
Copy Paste: [[2410.06072]] Training-free LLM-generated Text Detection by Mining Token Probability Sequences(https://arxiv.org/abs/2410.06072)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in generating high-quality texts across diverse domains. However, the potential misuse of LLMs has raised significant concerns, underscoring the urgent need for reliable detection of LLM-generated texts. Conventional training-based detectors often struggle with generalization, particularly in cross-domain and cross-model scenarios. In contrast, training-free methods, which focus on inherent discrepancies through carefully designed statistical features, offer improved generalization and interpretability. Despite this, existing training-free detection methods typically rely on global text sequence statistics, neglecting the modeling of local discriminative features, thereby limiting their detection efficacy. In this work, we introduce a novel training-free detector, termed \textbf{Lastde} that synergizes local and global statistics for enhanced detection. For the first time, we introduce time series analysis to LLM-generated text detection, capturing the temporal dynamics of token probability sequences. By integrating these local statistics with global ones, our detector reveals significant disparities between human and LLM-generated texts. We also propose an efficient alternative, \textbf{Lastde++} to enable real-time detection. Extensive experiments on six datasets involving cross-domain, cross-model, and cross-lingual detection scenarios, under both white-box and black-box settings, demonstrated that our method consistently achieves state-of-the-art performance. Furthermore, our approach exhibits greater robustness against paraphrasing attacks compared to existing baseline methods.
摘要：大型语言模型 (LLM) 已展现出在不同领域生成高质量文本的卓越能力。然而，LLM 的潜在滥用引起了重大担忧，凸显了对 LLM 生成文本的可靠检测的迫切需求。传统的基于训练的检测器通常难以实现泛化，尤其是在跨领域和跨模型场景中。相比之下，无需训练的方法通过精心设计的统计特征来关注固有的差异，从而提供了更好的泛化和可解释性。尽管如此，现有的无需训练的检测方法通常依赖于全局文本序列统计，而忽略了局部判别特征的建模，从而限制了它们的检测效果。在这项工作中，我们引入了一种新型的无需训练的检测器，称为 \textbf{Lastde}，它协同局部和全局统计数据以增强检测。我们首次将时间序列分析引入 LLM 生成的文本检测，捕捉标记概率序列的时间动态。通过将这些局部统计数据与全局统计数据相结合，我们的检测器揭示了人类和 LLM 生成的文本之间的显著差异。我们还提出了一种有效的替代方案 \textbf{Lastde++}，以实现实时检测。在白盒和黑盒设置下，对涉及跨域、跨模型和跨语言检测场景的六个数据集进行了广泛的实验，结果表明，我们的方法始终能够实现最先进的性能。此外，与现有的基线方法相比，我们的方法对释义攻击表现出更高的鲁棒性。

Title: TOWER: Tree Organized Weighting for Evaluating Complex Instructions

Authors: Noah Ziems, Zhihan Zhang, Meng Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06089
Pdf URL: https://arxiv.org/pdf/2410.06089
Copy Paste: [[2410.06089]] TOWER: Tree Organized Weighting for Evaluating Complex Instructions(https://arxiv.org/abs/2410.06089)
Keywords: language model, llm, chat
Abstract: Evaluating the ability of large language models (LLMs) to follow complex human-written instructions is essential for their deployment in real-world applications. While benchmarks like Chatbot Arena use human judges to assess model performance, they are resource-intensive and time-consuming. Alternative methods using LLMs as judges, such as AlpacaEval, MT Bench, WildBench, and InFoBench offer improvements but still do not capture that certain complex instruction aspects are more important than others to follow. To address this gap, we propose a novel evaluation metric, \textsc{TOWER}, that incorporates human-judged importance into the assessment of complex instruction following. We show that human annotators agree with tree-based representations of these complex instructions nearly as much as they agree with other human annotators. We release tree-based annotations of the InFoBench dataset and the corresponding evaluation code to facilitate future research.
摘要：评估大型语言模型 (LLM) 遵循复杂的人工编写指令的能力对于将其部署到实际应用中至关重要。虽然像 Chatbot Arena 这样的基准测试使用人工评判来评估模型性能，但它们耗费资源且耗时。使用 LLM 作为评判者的替代方法，例如 AlpacaEval、MT Bench、WildBench 和 InFoBench，虽然有所改进，但仍然没有捕捉到某些复杂指令方面比其他方面更重要。为了解决这一差距，我们提出了一种新颖的评估指标 \textsc{TOWER}，将人工判断的重要性纳入复杂指令遵循的评估中。我们表明，人类注释者对这些复杂指令的树形表示的认同程度几乎与他们与其他人类注释者的认同程度一样。我们发布了 InFoBench 数据集的树形注释和相应的评估代码，以促进未来的研究。

Title: Listen to the Patient: Enhancing Medical Dialogue Generation with Patient Hallucination Detection and Mitigation

Authors: Lang Qin, Yao Zhang, Hongru Liang, Adam Jatowt, Zhenglu Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06094
Pdf URL: https://arxiv.org/pdf/2410.06094
Copy Paste: [[2410.06094]] Listen to the Patient: Enhancing Medical Dialogue Generation with Patient Hallucination Detection and Mitigation(https://arxiv.org/abs/2410.06094)
Keywords: hallucination, agent
Abstract: Medical dialogue systems aim to provide medical services through patient-agent conversations. Previous methods typically regard patients as ideal users, focusing mainly on common challenges in dialogue systems, while neglecting the potential biases or misconceptions that might be introduced by real patients, who are typically non-experts. This study investigates the discrepancy between patients' expressions during medical consultations and their actual health conditions, defined as patient hallucination. Such phenomena often arise from patients' lack of knowledge and comprehension, concerns, and anxieties, resulting in the transmission of inaccurate or wrong information during consultations. To address this issue, we propose MedPH, a Medical dialogue generation method for mitigating the problem of Patient Hallucinations designed to detect and cope with hallucinations. MedPH incorporates a detection method that utilizes one-dimensional structural entropy over a temporal dialogue entity graph, and a mitigation strategy based on hallucination-related information to guide patients in expressing their actual conditions. Experimental results indicate the high effectiveness of MedPH when compared to existing approaches in both medical entity prediction and response generation tasks, while also demonstrating its effectiveness in mitigating hallucinations within interactive scenarios.
摘要：医疗对话系统旨在通过患者与代理之间的对话提供医疗服务。以前的方法通常将患者视为理想用户，主要关注对话系统中的常见挑战，而忽略了真实患者（通常不是专家）可能引入的潜在偏见或误解。本研究调查了患者在医疗咨询过程中的表情与其实际健康状况之间的差异，即患者幻觉。这种现象通常源于患者缺乏知识和理解、担忧和焦虑，导致在咨询过程中传递不准确或错误的信息。为了解决这个问题，我们提出了 MedPH，这是一种用于缓解患者幻觉问题的医疗对话生成方法，旨在检测和应对幻觉。MedPH 结合了一种利用时间对话实体图上的一维结构熵的检测方法，以及一种基于幻觉相关信息的缓解策略，以指导患者表达他们的实际状况。实验结果表明，与现有方法相比，MedPH 在医疗实体预测和响应生成任务中都具有很高的有效性，同时也证明了其在减轻交互场景中的幻觉方面的有效性。

Title: Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation

Authors: Esteban Garces Arias, Meimingwei Li, Christian Heumann, Matthias Aßenmacher
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06097
Pdf URL: https://arxiv.org/pdf/2410.06097
Copy Paste: [[2410.06097]] Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation(https://arxiv.org/abs/2410.06097)
Keywords: language model, llm
Abstract: Decoding strategies for large language models (LLMs) are a critical but often underexplored aspect of text generation tasks. Since LLMs produce probability distributions over the entire vocabulary, various decoding methods have been developed to transform these probabilities into coherent and fluent text, each with its own set of hyperparameters. In this study, we present a large-scale, comprehensive analysis of how hyperparameter selection affects text quality in open-ended text generation across multiple LLMs, datasets, and evaluation metrics. Through an extensive sensitivity analysis, we provide practical guidelines for hyperparameter tuning and demonstrate the substantial influence of these choices on text quality. Using three established datasets, spanning factual domains (e.g., news) and creative domains (e.g., fiction), we show that hyperparameter tuning significantly impacts generation quality, though its effects vary across models and tasks. We offer in-depth insights into these effects, supported by both human evaluations and a synthesis of widely-used automatic evaluation metrics.
摘要：大型语言模型 (LLM) 的解码策略是文本生成任务中一个关键但经常被低估的方面。由于 LLM 会在整个词汇表上产生概率分布，因此已经开发了各种解码方法来将这些概率转换为连贯流畅的文本，每种方法都有自己的一组超参数。在本研究中，我们对超参数选择如何影响跨多个 LLM、数据集和评估指标的开放式文本生成中的文本质量进行了大规模、全面的分析。通过广泛的敏感性分析，我们为超参数调整提供了实用的指导，并展示了这些选择对文本质量的重大影响。使用三个已建立的数据集，涵盖事实领域（例如新闻）和创意领域（例如小说），我们表明超参数调整会显著影响生成质量，尽管其影响因模型和任务而异。我们深入分析了这些影响，并得到了人工评估和广泛使用的自动评估指标的综合支持。

Title: Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA

Authors: Wenyu Huang, Guancheng Zhou, Hongru Wang, Pavlos Vougiouklis, Mirella Lapata, Jeff Z. Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06121
Pdf URL: https://arxiv.org/pdf/2410.06121
Copy Paste: [[2410.06121]] Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA(https://arxiv.org/abs/2410.06121)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) is widely used to inject external non-parametric knowledge into large language models (LLMs). Recent works suggest that Knowledge Graphs (KGs) contain valuable external knowledge for LLMs. Retrieving information from KGs differs from extracting it from document sets. Most existing approaches seek to directly retrieve relevant subgraphs, thereby eliminating the need for extensive SPARQL annotations, traditionally required by semantic parsing methods. In this paper, we model the subgraph retrieval task as a conditional generation task handled by small language models. Specifically, we define a subgraph identifier as a sequence of relations, each represented as a special token stored in the language models. Our base generative subgraph retrieval model, consisting of only 220M parameters, achieves competitive retrieval performance compared to state-of-the-art models relying on 7B parameters, demonstrating that small language models are capable of performing the subgraph retrieval task. Furthermore, our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks. Our model and data will be made available online: this https URL.
摘要：检索增强生成 (RAG) 被广泛用于将外部非参数知识注入大型语言模型 (LLM)。最近的研究表明，知识图谱 (KG) 包含 LLM 的宝贵外部知识。从 KG 中检索信息不同于从文档集中提取信息。大多数现有方法都试图直接检索相关子图，从而消除了对语义解析方法传统上需要的大量 SPARQL 注释的需求。在本文中，我们将子图检索任务建模为由小型语言模型处理的条件生成任务。具体来说，我们将子图标识符定义为关系序列，每个关系都表示为存储在语言模型中的特殊标记。我们的基础生成子图检索模型仅由 220M 个参数组成，与依赖 7B 个参数的最先进的模型相比，其检索性能具有竞争力，表明小型语言模型能够执行子图检索任务。此外，我们最大的 3B 模型在插入 LLM 阅读器后，在 WebQSP 和 CWQ 基准上都创下了新的 SOTA 端到端性能。我们的模型和数据将在线提供：此 https URL。

Title: AgentSquare: Automatic LLM Agent Search in Modular Design Space

Authors: Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, Yong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06153
Pdf URL: https://arxiv.org/pdf/2410.06153
Copy Paste: [[2410.06153]] AgentSquare: Automatic LLM Agent Search in Modular Design Space(https://arxiv.org/abs/2410.06153)
Keywords: language model, llm, agent
Abstract: Recent advancements in Large Language Models (LLMs) have led to a rapid growth of agentic systems capable of handling a wide range of complex tasks. However, current research largely relies on manual, task-specific design, limiting their adaptability to novel tasks. In this paper, we introduce a new research problem: Modularized LLM Agent Search (MoLAS). We propose a modular design space that abstracts existing LLM agent designs into four fundamental modules with uniform IO interface: Planning, Reasoning, Tool Use, and Memory. Building on this design space, we present a novel LLM agent search framework called AgentSquare, which introduces two core mechanisms, i.e., module evolution and recombination, to efficiently search for optimized LLM agents. To further accelerate the process, we design a performance predictor that uses in-context surrogate models to skip unpromising agent designs. Extensive experiments across six benchmarks, covering the diverse scenarios of web, embodied, tool use and game applications, show that AgentSquare substantially outperforms hand-crafted agents, achieving an average performance gain of 17.2% against best-known human designs. Moreover, AgentSquare can generate interpretable design insights, enabling a deeper understanding of agentic architecture and its impact on task performance. We believe that the modular design space and AgentSquare search framework offer a platform for fully exploiting the potential of prior successful designs and consolidating the collective efforts of research community. Code repo is available at this https URL.
摘要：大型语言模型 (LLM) 的最新进展导致能够处理各种复杂任务的代理系统迅速增长。然而，当前的研究主要依赖于手动、特定于任务的设计，限制了它们对新任务的适应性。在本文中，我们介绍了一个新的研究问题：模块化 LLM 代理搜索 (MoLAS)。我们提出了一个模块化设计空间，将现有的 LLM 代理设计抽象为具有统一 IO 接口的四个基本模块：规划、推理、工具使用和内存。在此设计空间的基础上，我们提出了一种名为 AgentSquare 的新型 LLM 代理搜索框架，它引入了两种核心机制，即模块演化和重组，以有效地搜索优化的 LLM 代理。为了进一步加快这一过程，我们设计了一个性能预测器，它使用上下文代理模型来跳过没有前途的代理设计。在六个基准上进行的大量实验涵盖了网络、具身化、工具使用和游戏应用等各种场景，结果表明 AgentSquare 的表现远胜于手工制作的代理，与最著名的人类设计相比，平均性能提升了 17.2%。此外，AgentSquare 可以生成可解释的设计见解，从而更深入地了解代理架构及其对任务性能的影响。我们相信，模块化设计空间和 AgentSquare 搜索框架提供了一个平台，可以充分利用先前成功设计的潜力并整合研究界的集体努力。代码库可在此 https URL 上找到。

Title: Manual Verbalizer Enrichment for Few-Shot Text Classification

Authors: Quang Anh Nguyen, Nadi Tomeh, Mustapha Lebbah, Thierry Charnois, Hanene Azzag, Santiago Cordoba Muñoz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06173
Pdf URL: https://arxiv.org/pdf/2410.06173
Copy Paste: [[2410.06173]] Manual Verbalizer Enrichment for Few-Shot Text Classification(https://arxiv.org/abs/2410.06173)
Keywords: language model, prompt
Abstract: With the continuous development of pre-trained language models, prompt-based training becomes a well-adopted paradigm that drastically improves the exploitation of models for many natural language processing tasks. Prompting also shows great performance compared to traditional fine-tuning when adapted to zero-shot or few-shot scenarios where the number of annotated data is limited. In this framework, the role of verbalizers is essential, as an interpretation from masked word distributions into output predictions. In this work, we propose \acrshort{mave}, an approach for verbalizer construction by enrichment of class labels using neighborhood relation in the embedding space of words for the text classification task. In addition, we elaborate a benchmarking procedure to evaluate typical baselines of verbalizers for document classification in few-shot learning contexts. Our model achieves state-of-the-art results while using significantly fewer resources. We show that our approach is particularly effective in cases with extremely limited supervision data.
摘要：随着预训练语言模型的不断发展，基于提示的训练成为一种广为采用的范例，极大地提高了模型在许多自然语言处理任务中的利用率。当适应零样本或少样本场景时，提示也表现出与传统微调相比的出色性能，因为这些场景中注释数据的数量有限。在这个框架中，言语化器的作用至关重要，因为它可以将掩码词分布解释为输出预测。在这项工作中，我们提出了 \acrshort{mave}，这是一种通过使用文本分类任务中单词嵌入空间中的邻域关系来丰富类标签来构建言语化器的方法。此外，我们制定了一个基准测试程序，以评估少样本学习环境中用于文档分类的言语化器的典型基线。我们的模型在使用更少资源的情况下实现了最先进的结果。我们表明，我们的方法在监督数据极其有限的情况下特别有效。

Title: Entering Real Social World! Benchmarking the Theory of Mind and Socialization Capabilities of LLMs from a First-person Perspective

Authors: Guiyang Hou, Wenqi Zhang, Yongliang Shen, Zeqi Tan, Sihao Shen, Weiming Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06195
Pdf URL: https://arxiv.org/pdf/2410.06195
Copy Paste: [[2410.06195]] Entering Real Social World! Benchmarking the Theory of Mind and Socialization Capabilities of LLMs from a First-person Perspective(https://arxiv.org/abs/2410.06195)
Keywords: language model, llm, agent
Abstract: In the social world, humans possess the capability to infer and reason about others mental states (such as emotions, beliefs, and intentions), known as the Theory of Mind (ToM). Simultaneously, humans own mental states evolve in response to social situations, a capability we refer to as socialization. Together, these capabilities form the foundation of human social interaction. In the era of artificial intelligence (AI), especially with the development of large language models (LLMs), we raise an intriguing question: How do LLMs perform in terms of ToM and socialization capabilities? And more broadly, can these AI models truly enter and navigate the real social world? Existing research evaluating LLMs ToM and socialization capabilities by positioning LLMs as passive observers from a third person perspective, rather than as active participants. However, compared to the third-person perspective, observing and understanding the world from an egocentric first person perspective is a natural approach for both humans and AI agents. The ToM and socialization capabilities of LLMs from a first person perspective, a crucial attribute for advancing embodied AI agents, remain unexplored. To answer the aforementioned questions and bridge the research gap, we introduce EgoSocialArena, a novel framework designed to evaluate and investigate the ToM and socialization capabilities of LLMs from a first person perspective. It encompasses two evaluation environments: static environment and interactive environment, with seven scenarios: Daily Life, Counterfactual, New World, Blackjack, Number Guessing, and Limit Texas Hold em, totaling 2,195 data entries. With EgoSocialArena, we have conducted a comprehensive evaluation of nine advanced LLMs and observed some key insights regarding the future development of LLMs as well as the capabilities levels of the most advanced LLMs currently available.
摘要：在社交世界中，人类拥有推断和推理他人心理状态（如情绪、信仰和意图）的能力，即心智理论 (ToM)。同时，人类自身的心理状态也会随着社交情况而发展，这种能力我们称之为社交化。这些能力共同构成了人类社交互动的基础。在人工智能 (AI) 时代，尤其是随着大型语言模型 (LLM) 的发展，我们提出了一个有趣的问题：LLM 在心智理论和社交能力方面表现如何？更广泛地说，这些 AI 模型是否真的可以进入并驾驭现实社交世界？现有研究通过将 LLM 定位为第三人称视角的被动观察者，而不是主动参与者来评估 LLM 的心智理论和社交能力。然而，与第三人称视角相比，从以自我为中心的第一人称视角观察和理解世界对人类和 AI 代理来说是一种自然的方法。从第一人称视角看，LLM 的 ToM 和社交能力是推进具身 AI 代理的关键属性，但这一能力仍未得到探索。为了回答上述问题并弥补研究空白，我们引入了 EgoSocialArena，这是一个新颖的框架，旨在从第一人称视角评估和研究 LLM 的 ToM 和社交能力。它包含两个评估环境：静态环境和交互式环境，有七种场景：日常生活、反事实、新世界、二十一点、猜数字和限注德州扑克，共计 2,195 个数据条目。通过 EgoSocialArena，我们对九个高级 LLM 进行了全面评估，并观察到一些关于 LLM 未来发展以及目前最先进的 LLM 的能力水平的关键见解。

Title: Integrating Planning into Single-Turn Long-Form Text Generation

Authors: Yi Liang, You Wu, Honglei Zhuang, Li Chen, Jiaming Shen, Yiling Jia, Zhen Qin, Sumit Sanghai, Xuanhui Wang, Carl Yang, Michael Bendersky
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06203
Pdf URL: https://arxiv.org/pdf/2410.06203
Copy Paste: [[2410.06203]] Integrating Planning into Single-Turn Long-Form Text Generation(https://arxiv.org/abs/2410.06203)
Keywords: language model, llm, prompt
Abstract: Generating high-quality, in-depth textual documents, such as academic papers, news articles, Wikipedia entries, and books, remains a significant challenge for Large Language Models (LLMs). In this paper, we propose to use planning to generate long form content. To achieve our goal, we generate intermediate steps via an auxiliary task that teaches the LLM to plan, reason and structure before generating the final text. Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning. To overcome the scarcity of training data for these intermediate steps, we leverage LLMs to generate synthetic intermediate writing data such as outlines, key information and summaries from existing full articles. Our experiments demonstrate on two datasets from different domains, namely the scientific news dataset SciNews and Wikipedia datasets in KILT-Wiki and FreshWiki, that LLMs fine-tuned with the auxiliary task generate higher quality documents. We observed +2.5% improvement in ROUGE-Lsum, and a strong 3.60 overall win/loss ratio via human SxS evaluation, with clear wins in organization, relevance, and verifiability.
摘要：生成高质量、深入的文本文档（如学术论文、新闻文章、维基百科条目和书籍）仍然是大型语言模型 (LLM) 面临的重大挑战。在本文中，我们建议使用规划来生成长篇内容。为了实现我们的目标，我们通过辅助任务生成中间步骤，该任务教会 LLM 在生成最终文本之前进行规划、推理和构建。我们的主要创新之处在于单个辅助任务不需要多轮提示或规划。为了克服这些中间步骤训练数据的稀缺性，我们利用 LLM 生成合成的中间写作数据，例如来自现有完整文章的大纲、关键信息和摘要。我们的实验在来自不同领域的两个数据集上证明，即科学新闻数据集 SciNews 和 KILT-Wiki 和 FreshWiki 中的维基百科数据集，使用辅助任务微调的 LLM 可以生成更高质量的文档。我们观察到 ROUGE-Lsum 提高了 2.5%，通过人工 SxS 评估，整体胜负比达到了 3.60，在组织性、相关性和可验证性方面均有明显优势。

Title: Round and Round We Go! What makes Rotary Positional Encodings useful?

Authors: Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, Petar Veličković
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06205
Pdf URL: https://arxiv.org/pdf/2410.06205
Copy Paste: [[2410.06205]] Round and Round We Go! What makes Rotary Positional Encodings useful?(https://arxiv.org/abs/2410.06205)
Keywords: language model, llm
Abstract: Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most popular types of encoding used today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries and keys based on their relative distance. A common belief is that RoPE is useful because it helps to decay token dependency as relative distance increases. In this work, we argue that this is unlikely to be the core reason. We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies. We also find that, in general, Gemma greatly prefers to use the lowest frequencies of RoPE, which we suspect are used to carry semantic information. We mathematically prove interesting behaviours of RoPE and conduct experiments to verify our findings, proposing a modification of RoPE that fixes some highlighted issues and improves performance. We believe that this work represents an interesting step in better understanding PEs in LLMs, which we believe holds crucial value for scaling LLMs to large sizes and context lengths.
摘要：位置编码 (PE) 是基于 Transformer 的大型语言模型 (LLM) 的关键组件，为注意力机制提供重要的序列位置信息。当今 LLM 中最流行的编码类型之一是旋转位置编码 (RoPE)，它根据查询和键的相对距离旋转查询和键。人们普遍认为 RoPE 很有用，因为它有助于随着相对距离的增加而衰减标记依赖性。在这项工作中，我们认为这不太可能是核心原因。我们研究了经过训练的 Gemma 7B 模型的内部结构，以了解 RoPE 在机械层面上的使用方式。我们发现 Gemma 学会了使用 RoPE 通过利用最高频率来构建强大的“位置”注意力模式。我们还发现，一般来说，Gemma 非常喜欢使用 RoPE 的最低频率，我们怀疑这些频率用于携带语义信息。我们用数学证明了 RoPE 的有趣行为，并进行了实验来验证我们的发现，提出了对 RoPE 的修改，以修复一些突出的问题并提高性能。我们相信这项工作代表着更好地理解 LLM 中的 PE 迈出了有趣的一步，我们相信这对于将 LLM 扩展到大尺寸和上下文长度具有至关重要的价值。

Title: DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback

Authors: Zaid Khan, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06215
Pdf URL: https://arxiv.org/pdf/2410.06215
Copy Paste: [[2410.06215]] DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback(https://arxiv.org/abs/2410.06215)
Keywords: llm, agent
Abstract: The process of creating training data to teach models is currently driven by humans, who manually analyze model weaknesses and plan how to create data that improves a student model. Recent approaches using LLMs as annotators reduce human effort, but still require humans to interpret feedback from evaluations and control the LLM to produce data the student needs. Automating this labor-intensive process by creating autonomous data generation agents - or teachers - is desirable, but requires environments that can simulate the feedback-driven, iterative, closed loop of data creation. To enable rapid and scalable testing for such agents and their modules, we introduce DataEnvGym, a testbed of teacher environments for data generation agents. DataEnvGym frames data generation as a sequential decision-making task, involving an agent consisting of a data generation policy (which generates a plan for creating training data) and a data generation engine (which transforms the plan into data), inside an environment that provides student feedback. The agent's goal is to improve student performance. Students are iteratively trained and evaluated on generated data, with their feedback (in the form of errors or weak skills) being reported to the agent after each iteration. DataEnvGym includes multiple teacher environment instantiations across 3 levels of structure in the state representation and action space. More structured environments are based on inferred skills and offer more interpretability and curriculum control. We support 3 diverse tasks (math, code, and VQA) and test multiple students and teachers. Example agents in our teaching environments can iteratively improve students across tasks and settings. Moreover, we show that environments teach different skill levels and test variants of key modules, pointing to future work in improving data generation agents, engines, and feedback mechanisms.
摘要：目前，创建训练数据以教授模型的过程是由人类推动的，人类手动分析模型的弱点并规划如何创建改进学生模型的数据。最近使用 LLM 作为注释器的方法减少了人力，但仍然需要人类解释评估反馈并控制 LLM 以生成学生需要的数据。通过创建自主数据生成代理（或教师）来自动化这一劳动密集型过程是可取的，但需要能够模拟反馈驱动、迭代、闭环的数据创建环境。为了能够快速且可扩展地测试此类代理及其模块，我们引入了 DataEnvGym，这是数据生成代理的教师环境测试平台。DataEnvGym 将数据生成定义为一个顺序决策任务，涉及一个由数据生成策略（生成创建训练数据的计划）和数据生成引擎（将计划转换为数据）组成的代理，位于提供学生反馈的环境中。代理的目标是提高学生的表现。学生会根据生成的数据进行迭代训练和评估，每次迭代后，他们的反馈（以错误或薄弱技能的形式）都会报告给代理。DataEnvGym 包括状态表示和动作空间中 3 个结构级别的多个教师环境实例。更结构化的环境基于推断的技能，并提供更多的可解释性和课程控制。我们支持 3 种不同的任务（数学、代码和 VQA）并测试多名学生和教师。我们教学环境中的示例代理可以迭代地提高学生在各个任务和设置中的水平。此外，我们展示了环境教授不同技能水平和关键模块的测试变体，为未来改进数据生成代理、引擎和反馈机制指明了方向。

Title: Probing the Robustness of Theory of Mind in Large Language Models

Authors: Christian Nickel, Laura Schrewe, Lucie Flek
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06271
Pdf URL: https://arxiv.org/pdf/2410.06271
Copy Paste: [[2410.06271]] Probing the Robustness of Theory of Mind in Large Language Models(https://arxiv.org/abs/2410.06271)
Keywords: language model, gpt, llm, chat, agent
Abstract: With the success of ChatGPT and other similarly sized SotA LLMs, claims of emergent human like social reasoning capabilities, especially Theory of Mind (ToM), in these models have appeared in the scientific literature. On the one hand those ToM-capabilities have been successfully tested using tasks styled similar to those used in psychology (Kosinski, 2023). On the other hand, follow up studies showed that those capabilities vanished when the tasks were slightly altered (Ullman, 2023). In this work we introduce a novel dataset of 68 tasks for probing ToM in LLMs, including potentially challenging variations which are assigned to 10 complexity classes. This way it is providing novel insights into the challenges LLMs face with those task variations. We evaluate the ToM performance of four SotA open source LLMs on our dataset and the dataset introduced by (Kosinski, 2023). The overall low goal accuracy across all evaluated models indicates only a limited degree of ToM capabilities. The LLMs' performance on simple complexity class tasks from both datasets are similar. Whereas we find a consistent tendency in all tested LLMs to perform poorly on tasks that require the realization that an agent has knowledge of automatic state changes in its environment, even when those are spelled out to the model. For task complications that change the relationship between objects by replacing prepositions, we notice a performance drop in all models, with the strongest impact on the mixture-of-experts model. With our dataset of tasks grouped by complexity we offer directions for further research on how to stabilize and advance ToM capabilities in LLM.
摘要：随着 ChatGPT 和其他类似规模的 SotA LLM 的成功，科学文献中出现了关于这些模型中出现了类似人类的社会推理能力，尤其是心智理论 (ToM) 的说法。一方面，这些 ToM 能力已经使用类似于心理学的任务成功测试过 (Kosinski, 2023)。另一方面，后续研究表明，当任务稍有改变时，这些能力就会消失 (Ullman, 2023)。在这项工作中，我们引入了一个包含 68 个任务的新数据集，用于探索 LLM 中的 ToM，包括分配给 10 个复杂性类别的潜在挑战性变化。这样，它就为 LLM 在这些任务变化中面临的挑战提供了新的见解。我们在我们的数据集和 (Kosinski, 2023) 引入的数据集上评估了四个 SotA 开源 LLM 的 ToM 性能。所有评估模型中目标准确率总体较低，这表明 ToM 能力有限。两个数据集中 LLM 在简单复杂性类任务上的表现相似。然而，我们发现所有测试的 LLM 都存在一个一致的趋势，即在需要实现代理了解其环境中的自动状态变化的任务上表现不佳，即使这些变化已向模型说明。对于通过替换介词来改变对象之间关系的任务复杂性，我们注意到所有模型的性能都会下降，对混合专家模型的影响最大。通过按复杂性分组的任务数据集，我们为如何稳定和提高 LLM 中的 ToM 能力的进一步研究提供了方向。

Title: The Mystery of Compositional Generalization in Graph-based Generative Commonsense Reasoning

Authors: Xiyan Fu, Anette Frank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06272
Pdf URL: https://arxiv.org/pdf/2410.06272
Copy Paste: [[2410.06272]] The Mystery of Compositional Generalization in Graph-based Generative Commonsense Reasoning(https://arxiv.org/abs/2410.06272)
Keywords: llm
Abstract: While LLMs have emerged as performant architectures for reasoning tasks, their compositional generalization capabilities have been questioned. In this work, we introduce a Compositional Generalization Challenge for Graph-based Commonsense Reasoning (CGGC) that goes beyond previous evaluations that are based on sequences or tree structures - and instead involves a reasoning graph: It requires models to generate a natural sentence based on given concepts and a corresponding reasoning graph, where the presented graph involves a previously unseen combination of relation types. To master this challenge, models need to learn how to reason over relation tupels within the graph, and how to compose them when conceptualizing a verbalization. We evaluate seven well-known LLMs using in-context learning and find that performant LLMs still struggle in compositional generalization. We investigate potential causes of this gap by analyzing the structures of reasoning graphs, and find that different structures present varying levels of difficulty for compositional generalization. Arranging the order of demonstrations according to the structures' difficulty shows that organizing samples in an easy-to-hard schema enhances the compositional generalization ability of LLMs.
摘要：虽然 LLM 已成为推理任务的高性能架构，但它们的组合泛化能力却受到质疑。在这项工作中，我们引入了基于图的常识推理 (CGGC) 的组合泛化挑战，它超越了以前基于序列或树结构的评估 - 而是涉及推理图：它要求模型根据给定的概念和相应的推理图生成自然句子，其中呈现的图涉及以前未见过的关系类型的组合。为了掌握这一挑战，模型需要学习如何推理图中的关系元组，以及如何在概念化言语表达时组合它们。我们使用上下文学习评估了七个著名的 LLM，发现高性能 LLM 在组合泛化方面仍然举步维艰。我们通过分析推理图的结构来调查这种差距的潜在原因，并发现不同的结构对组合泛化的难度程度不同。根据结构的难度排列演示的顺序表明，以从易到难的模式组织样本可以增强 LLM 的组合泛化能力。

Title: Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning

Authors: Ruosen Li, Ziming Luo, Xinya Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06304
Pdf URL: https://arxiv.org/pdf/2410.06304
Copy Paste: [[2410.06304]] Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning(https://arxiv.org/abs/2410.06304)
Keywords: language model, gpt, llm, hallucination, chat
Abstract: Hallucinations in large language models (LLMs) pose significant challenges in tasks requiring complex multi-step reasoning, such as mathematical problem-solving. Existing approaches primarily detect the presence of hallucinations but lack a nuanced understanding of their types and manifestations. In this paper, we first introduce a comprehensive taxonomy that categorizes the common hallucinations in mathematical reasoning task into six types: fabrication, factual inconsistency, context inconsistency, instruction inconsistency, logical inconsistency, and logical error. We then propose FG-PRM (Fine-Grained Process Reward Model), an augmented model designed to detect and mitigate hallucinations in a fine-grained, step-level manner. To address the limitations of manually labeling training data, we propose an automated method for generating fine-grained hallucination data using LLMs. By injecting hallucinations into reasoning steps of correct solutions, we create a diverse and balanced synthetic dataset for training FG-PRM, which consists of six specialized Process Reward Models (PRMs), each tailored to detect a specific hallucination type. Our FG-PRM demonstrates superior performance across two key tasks: 1) Fine-grained hallucination detection: classifying hallucination types for each reasoning step; and 2) Verification: ranking multiple LLM-generated outputs to select the most accurate solution, mitigating reasoning hallucinations. Our experiments show that FG-PRM outperforms ChatGPT-3.5 and Claude-3 on fine-grained hallucination detection and substantially boosts the performance of LLMs on GSM8K and MATH benchmarks.
摘要：大型语言模型 (LLM) 中的幻觉对需要复杂多步骤推理的任务（例如数学问题解决）提出了重大挑战。现有的方法主要检测幻觉的存在，但缺乏对其类型和表现的细致理解。在本文中，我们首先介绍一个全面的分类法，将数学推理任务中常见的幻觉分为六种类型：虚构、事实不一致、上下文不一致、指令不一致、逻辑不一致和逻辑错误。然后我们提出了 FG-PRM（细粒度过程奖励模型），这是一种增强模型，旨在以细粒度、步骤级的方式检测和缓解幻觉。为了解决手动标记训练数据的局限性，我们提出了一种使用 LLM 生成细粒度幻觉数据的自动化方法。通过将幻觉注入正确解决方案的推理步骤中，我们创建了一个多样化且平衡的合成数据集来训练 FG-PRM，它由六个专门的过程奖励模型 (PRM) 组成，每个模型都经过量身定制，以检测特定的幻觉类型。我们的 FG-PRM 在两个关键任务中表现出色：1) 细粒度幻觉检测：对每个推理步骤的幻觉类型进行分类；2) 验证：对多个 LLM 生成的输出进行排序以选择最准确的解决方案，减轻推理幻觉。我们的实验表明，FG-PRM 在细粒度幻觉检测方面优于 ChatGPT-3.5 和 Claude-3，并大大提高了 LLM 在 GSM8K 和 MATH 基准上的性能。

Title: Auto-Evolve: Enhancing Large Language Model's Performance via Self-Reasoning Framework

Authors: Krishna Aswani, Huilin Lu, Pranav Patankar, Priya Dhalwani, Iris Tan, Jayant Ganeshmohan, Simon Lacasse
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06328
Pdf URL: https://arxiv.org/pdf/2410.06328
Copy Paste: [[2410.06328]] Auto-Evolve: Enhancing Large Language Model's Performance via Self-Reasoning Framework(https://arxiv.org/abs/2410.06328)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Recent advancements in prompt engineering strategies, such as Chain-of-Thought (CoT) and Self-Discover, have demonstrated significant potential in improving the reasoning abilities of Large Language Models (LLMs). However, these state-of-the-art (SOTA) prompting strategies rely on single or fixed set of static seed reasoning modules like \emph{"think step by step"} or \emph{"break down this problem"} intended to simulate human approach to problem-solving. This constraint limits the flexibility of models in tackling diverse problems effectively. In this paper, we introduce Auto-Evolve, a novel framework that enables LLMs to self-create dynamic reasoning modules and downstream action plan, resulting in significant improvements over current SOTA methods. We evaluate Auto-Evolve on the challenging BigBench-Hard (BBH) dataset with Claude 2.0, Claude 3 Sonnet, Mistral Large, and GPT 4, where it consistently outperforms the SOTA prompt strategies. Auto-Evolve outperforms CoT by up to 10.4\% and on an average by 7\% across these four models. Our framework introduces two innovations: a) Auto-Evolve dynamically generates reasoning modules for each task while aligning with human reasoning paradigm, thus eliminating the need for predefined templates. b) We introduce an iterative refinement component, that incrementally refines instruction guidance for LLMs and helps boost performance by average 2.8\% compared to doing it in a single step.
摘要：提示工程策略的最新进展，例如思维链 (CoT) 和自我发现，已显示出在提高大型语言模型 (LLM) 推理能力方面的巨大潜力。然而，这些最先进的 (SOTA) 提示策略依赖于单个或固定的一组静态种子推理模块，如 \emph{“一步一步思考”} 或 \emph{“分解这个问题”}，旨在模拟人类解决问题的方法。这种约束限制了模型在有效解决各种问题方面的灵活性。在本文中，我们介绍了 Auto-Evolve，这是一个新颖的框架，使 LLM 能够自行创建动态推理模块和下游行动计划，从而显著改进了当前的 SOTA 方法。我们在具有挑战性的 BigBench-Hard (BBH) 数据集上使用 Claude 2.0、Claude 3 Sonnet、Mistral Large 和 GPT 4 对 Auto-Evolve 进行了评估，结果显示它始终优于 SOTA 提示策略。在这四种模型中，Auto-Evolve 的表现比 CoT 高出 10.4%，平均高出 7%。我们的框架引入了两项创新：a) Auto-Evolve 动态生成每个任务的推理模块，同时与人类推理范式保持一致，从而无需预定义模板。b) 我们引入了一个迭代细化组件，它可以逐步细化 LLM 的指导，与单步执行相比，有助于将性能平均提高 2.8%。

Title: Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing

Authors: Zhuoran Zhang, Yongxiang Li, Zijian Kan, Keyuan Cheng, Lijie Hu, Di Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06331
Pdf URL: https://arxiv.org/pdf/2410.06331
Copy Paste: [[2410.06331]] Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing(https://arxiv.org/abs/2410.06331)
Keywords: language model, llm, prompt
Abstract: The locate-then-edit paradigm has shown significant promise for knowledge editing (KE) in Large Language Models (LLMs). While previous methods perform well on single-hop fact recall tasks, they consistently struggle with multi-hop factual recall tasks involving newly edited knowledge. In this paper, leveraging tools in mechanistic interpretability, we first identify that in multi-hop tasks, LLMs tend to retrieve implicit subject knowledge from deeper MLP layers, unlike single-hop tasks, which rely on earlier layers. This distinction explains the poor performance of current methods in multi-hop queries, as they primarily focus on editing shallow layers, leaving deeper layers unchanged. To address this, we propose IFMET, a novel locate-then-edit KE approach designed to edit both shallow and deep MLP layers. IFMET employs multi-hop editing prompts and supplementary sets to locate and modify knowledge across different reasoning stages. Experimental results demonstrate that IFMET significantly improves performance on multi-hop factual recall tasks, effectively overcoming the limitations of previous locate-then-edit methods.
摘要：定位-然后-编辑范式在大型语言模型 (LLM) 中的知识编辑 (KE) 方面显示出巨大的前景。虽然以前的方法在单跳事实回忆任务上表现良好，但它们在涉及新编辑知识的多跳事实回忆任务中始终表现不佳。在本文中，利用机械可解释性工具，我们首先发现在多跳任务中，LLM 倾向于从更深的 MLP 层检索隐含的主题知识，而单跳任务则依赖于较早的层。这种区别解释了当前方法在多跳查询中性能不佳的原因，因为它们主要侧重于编辑浅层，而更深的层保持不变。为了解决这个问题，我们提出了 IFMET，这是一种新颖的定位-然后-编辑 KE 方法，旨在编辑浅层和深层 MLP 层。IFMET 采用多跳编辑提示和补充集来定位和修改不同推理阶段的知识。实验结果表明，IFMET 显著提高了多跳事实回忆任务的性能，有效地克服了以前定位-然后-编辑方法的局限性。

Title: Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content?

Authors: Shenbin Qian, Constantin Orăsan, Diptesh Kanojia, Félix do Carmo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06338
Pdf URL: https://arxiv.org/pdf/2410.06338
Copy Paste: [[2410.06338]] Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content?(https://arxiv.org/abs/2410.06338)
Keywords: language model, llm, prompt
Abstract: This paper investigates whether large language models (LLMs) are state-of-the-art quality estimators for machine translation of user-generated content (UGC) that contains emotional expressions, without the use of reference translations. To achieve this, we employ an existing emotion-related dataset with human-annotated errors and calculate quality evaluation scores based on the Multi-dimensional Quality Metrics. We compare the accuracy of several LLMs with that of our fine-tuned baseline models, under in-context learning and parameter-efficient fine-tuning (PEFT) scenarios. We find that PEFT of LLMs leads to better performance in score prediction with human interpretable explanations than fine-tuned models. However, a manual analysis of LLM outputs reveals that they still have problems such as refusal to reply to a prompt and unstable output while evaluating machine translation of UGC.
摘要：本文探讨在不使用参考翻译的情况下，大型语言模型 (LLM) 是否是包含情感表达的用户生成内容 (UGC) 机器翻译的最先进的质量评估器。为此，我们使用现有的带有人工注释错误的情感相关数据集，并根据多维质量指标计算质量评估分数。在上下文学习和参数高效微调 (PEFT) 场景下，我们将几个 LLM 的准确度与我们微调的基线模型的准确度进行了比较。我们发现，与微调模型相比，LLM 的 PEFT 在具有人类可解释解释的分数预测方面表现更好。然而，对 LLM 输出的手动分析表明，它们在评估 UGC 的机器翻译时仍然存在诸如拒绝回复提示和输出不稳定等问题。

Title: Counterfactual Causal Inference in Natural Language with Large Language Models

Authors: Gaël Gendron, Jože M. Rožanec, Michael Witbrock, Gillian Dobbie
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06392
Pdf URL: https://arxiv.org/pdf/2410.06392
Copy Paste: [[2410.06392]] Counterfactual Causal Inference in Natural Language with Large Language Models(https://arxiv.org/abs/2410.06392)
Keywords: language model, llm
Abstract: Causal structure discovery methods are commonly applied to structured data where the causal variables are known and where statistical testing can be used to assess the causal relationships. By contrast, recovering a causal structure from unstructured natural language data such as news articles contains numerous challenges due to the absence of known variables or counterfactual data to estimate the causal links. Large Language Models (LLMs) have shown promising results in this direction but also exhibit limitations. This work investigates LLM's abilities to build causal graphs from text documents and perform counterfactual causal inference. We propose an end-to-end causal structure discovery and causal inference method from natural language: we first use an LLM to extract the instantiated causal variables from text data and build a causal graph. We merge causal graphs from multiple data sources to represent the most exhaustive set of causes possible. We then conduct counterfactual inference on the estimated graph. The causal graph conditioning allows reduction of LLM biases and better represents the causal estimands. We use our method to show that the limitations of LLMs in counterfactual causal reasoning come from prediction errors and propose directions to mitigate them. We demonstrate the applicability of our method on real-world news articles.
摘要：因果结构发现方法通常应用于结构化数据，其中因果变量已知，并且可以使用统计测试来评估因果关系。相比之下，从非结构化自然语言数据（例如新闻文章）中恢复因果结构面临许多挑战，因为缺乏已知变量或反事实数据来估计因果联系。大型语言模型 (LLM) 在这方面表现出有希望的结果，但也表现出局限性。这项工作调查了 LLM 从文本文档构建因果图和执行反事实因果推理的能力。我们提出了一种从自然语言中进行端到端的因果结构发现和因果推理方法：我们首先使用 LLM 从文本数据中提取实例化的因果变量并构建因果图。我们合并来自多个数据源的因果图以表示尽可能详尽的原因集。然后，我们对估计的图进行反事实推理。因果图条件化可以减少 LLM 偏差并更好地表示因果估计量。我们使用我们的方法来表明 LLM 在反事实因果推理中的局限性来自预测误差，并提出了缓解这些局限性的方向。我们证明了我们的方法在现实世界新闻文章中的适用性。

Title: MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks

Authors: Mirelle Bueno, Roberto Lotufo, Rodrigo Nogueira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06396
Pdf URL: https://arxiv.org/pdf/2410.06396
Copy Paste: [[2410.06396]] MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks(https://arxiv.org/abs/2410.06396)
Keywords: language model, llm
Abstract: Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens. However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training. For example, state-of-the-art LLMs can find common items in two lists with up to 20 items but fail when lists have 80 items. In this paper, we introduce MLissard, a multilingual benchmark designed to evaluate models' abilities to process and generate texts of varied lengths and offers a mechanism for controlling sequence complexity. Our evaluation of open-source and proprietary models show a consistent decline in performance across all models and languages as the complexity of the sequence increases. Surprisingly, the use of in-context examples in languages other than English helps increase extrapolation performance significantly. The datasets and code are available at this https URL
摘要：语言模型现在能够解决需要处理由数十万个标记组成的长序列的任务。然而，它们经常无法完成需要重复使用简单规则的任务，即使是比训练期间看到的序列短得多的序列。例如，最先进的 LLM 可以在两个最多包含 20 个项目的列表中找到常见项目，但当列表包含 80 个项目时就会失败。在本文中，我们介绍了 MLissard，这是一个多语言基准，旨在评估模型处理和生成不同长度文本的能力，并提供了一种控制序列复杂性的机制。我们对开源和专有模型的评估表明，随着序列复杂性的增加，所有模型和语言的性能都在持续下降。令人惊讶的是，使用除英语以外的语言的上下文示例有助于显著提高外推性能。数据集和代码可在此 https URL 上找到

Title: ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments

Authors: Sourjyadip Ray, Kushal Gupta, Soumi Kundu, Payal Arvind Kasat, Somak Aditya, Pawan Goyal
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.06420
Pdf URL: https://arxiv.org/pdf/2410.06420
Copy Paste: [[2410.06420]] ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments(https://arxiv.org/abs/2410.06420)
Keywords: language model
Abstract: The global shortage of healthcare workers has demanded the development of smart healthcare assistants, which can help monitor and alert healthcare workers when necessary. We examine the healthcare knowledge of existing Large Vision Language Models (LVLMs) via the Visual Question Answering (VQA) task in hospital settings through expert annotated open-ended questions. We introduce the Emergency Room Visual Question Answering (ERVQA) dataset, consisting of triplets covering diverse emergency room scenarios, a seminal benchmark for LVLMs. By developing a detailed error taxonomy and analyzing answer trends, we reveal the nuanced nature of the task. We benchmark state-of-the-art open-source and closed LVLMs using traditional and adapted VQA metrics: Entailment Score and CLIPScore Confidence. Analyzing errors across models, we infer trends based on properties like decoder type, model size, and in-context examples. Our findings suggest the ERVQA dataset presents a highly complex task, highlighting the need for specialized, domain-specific solutions.
摘要：全球医护人员短缺要求开发智能医疗助理，以便在必要时帮助监控和提醒医护人员。我们通过医院环境中的视觉问答 (VQA) 任务，通过专家注释的开放式问题检查现有大型视觉语言模型 (LVLM) 的医疗保健知识。我们引入了急诊室视觉问答 (ERVQA) 数据集，该数据集由 <图像、问题、答案> 三元组组成，涵盖了不同的急诊室场景，这是 LVLM 的开创性基准。通过开发详细的错误分类法并分析答案趋势，我们揭示了任务的细微差别。我们使用传统和改编的 VQA 指标对最先进的开源和封闭 LVLM 进行基准测试：蕴涵分数和 CLIPScore 置信度。通过分析跨模型的错误，我们根据解码器类型、模型大小和上下文示例等属性推断趋势。我们的研究结果表明 ERVQA 数据集提出了一项高度复杂的任务，凸显了对专门的、特定领域的解决方案的需求。

Title: LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

Authors: Thomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, Haw-Shiuan Chang, Shereen Oraby, Sijia Liu, Vivek Subramanian, Tagyoung Chung, Mohit Bansal, Nanyun Peng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06458
Pdf URL: https://arxiv.org/pdf/2410.06458
Copy Paste: [[2410.06458]] LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints(https://arxiv.org/abs/2410.06458)
Keywords: gpt, llm
Abstract: Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs' ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM's response needs refinement. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.
摘要：指令遵循是 LLM 的一项关键能力。然而，最近的研究表明，LLM 经常难以处理包含多个约束的指令（例如，要求创建“以有趣的语气”且“没有标签”的社交媒体帖子）。尽管如此，大多数评估仅关注合成数据。为了解决这个问题，我们推出了 RealInstruct，这是第一个旨在通过利用真实用户向 AI 助手提出的查询来评估 LLM 遵循现实世界多约束指令的能力的基准。我们还研究了基于模型的评估，作为这项任务的一种经济有效的人工注释替代方案。我们的研究结果表明，即使是专有的 GPT-4 模型也无法满足超过 21% 的指令的至少一个约束，这凸显了最先进模型的局限性。为了解决开源模型和专有模型之间的性能差距，我们提出了分解、批评和改进 (DeCRIM) 自我校正管道，这增强了 LLM 遵循约束的能力。 DeCRIM 的工作原理是将原始指令分解为约束列表，并使用 Critic 模型来决定何时何地需要改进 LLM 的响应。我们的结果表明，即使在反馈较弱的情况下，DeCRIM 也能将 Mistral 在 RealInstruct 上的表现提高 7.3%，在 IFEval 上的表现提高 8.0%。此外，我们证明，在强反馈的情况下，使用 DeCRIM 的开源 LLM 在两个基准测试中的表现都优于 GPT-4。

Title: LLM Compression with Neural Architecture Search

Authors: Rhea Sanjay Sukthanker, Benedikt Staffler, Frank Hutter, Aaron Klein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06479
Pdf URL: https://arxiv.org/pdf/2410.06479
Copy Paste: [[2410.06479]] LLM Compression with Neural Architecture Search(https://arxiv.org/abs/2410.06479)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit remarkable reasoning abilities, allowing them to generalize across a wide range of downstream tasks, such as commonsense reasoning or instruction following. However, as LLMs scale, inference costs become increasingly prohibitive, accumulating significantly over their life cycle. This poses the question: Can we compress pre-trained LLMs to meet diverse size and latency requirements? We leverage Neural Architecture Search (NAS) to compress LLMs by pruning structural components, such as attention heads, neurons, and layers, aiming to achieve a Pareto-optimal balance between performance and efficiency. While NAS already achieved promising results on small language models in previous work, in this paper we propose various extensions that allow us to scale to LLMs. Compared to structural pruning baselines, we show that NAS improves performance up to 3.4% on MMLU with an on-device latency speedup.
摘要：大型语言模型 (LLM) 表现出卓越的推理能力，使其能够泛化到各种下游任务中，例如常识推理或指令遵循。然而，随着 LLM 的规模扩大，推理成本变得越来越高昂，在其生命周期内显著累积。这就提出了一个问题：我们能否压缩预先训练的 LLM 以满足不同的大小和延迟要求？我们利用神经架构搜索 (NAS) 通过修剪结构组件（例如注意力头、神经元和层）来压缩 LLM，旨在实现性能和效率之间的帕累托最优平衡。虽然 NAS 在之前的工作中已经在小型语言模型上取得了有希望的结果，但在本文中，我们提出了各种扩展，使我们能够扩展到 LLM。与结构修剪基线相比，我们表明 NAS 在 MMLU 上的性能提高了 3.4%，同时设备延迟加快。

Title: On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task

Authors: Javier Ferrando, Marta R.Costa-jussà
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06496
Pdf URL: https://arxiv.org/pdf/2410.06496
Copy Paste: [[2410.06496]] On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task(https://arxiv.org/abs/2410.06496)
Keywords: language model
Abstract: Several algorithms implemented by language models have recently been successfully reversed-engineered. However, these findings have been concentrated on specific tasks and models, leaving it unclear how universal circuits are across different settings. In this paper, we study the circuits implemented by Gemma 2B for solving the subject-verb agreement task across two different languages, English and Spanish. We discover that both circuits are highly consistent, being mainly driven by a particular attention head writing a `subject number' signal to the last residual stream, which is read by a small set of neurons in the final MLPs. Notably, this subject number signal is represented as a direction in the residual stream space, and is language-independent. We demonstrate that this direction has a causal effect on the model predictions, effectively flipping the Spanish predicted verb number by intervening with the direction found in English. Finally, we present evidence of similar behavior in other models within the Gemma 1 and Gemma 2 families.
摘要：最近，语言模型实现的几种算法已被成功逆向工程。然而，这些发现集中在特定的任务和模型上，因此尚不清楚不同设置中的通用电路如何。在本文中，我们研究了 Gemma 2B 实现的电路，用于解决英语和西班牙语两种不同语言的主谓一致任务。我们发现这两个电路高度一致，主要由一个特定的注意头驱动，它将“主题编号”信号写入最后的残差流，该信号由最终 MLP 中的一小组神经元读取。值得注意的是，这个主题编号信号表示为残差流空间中的方向，并且与语言无关。我们证明这个方向对模型预测有因果影响，通过干预英语中的方向，有效地翻转了西班牙语预测的动词编号。最后，我们提供了 Gemma 1 和 Gemma 2 系列中其他模型中类似行为的证据。

Title: TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training

Authors: Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, Stratos Idreos
Subjects: cs.CL, cs.AI, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06511
Pdf URL: https://arxiv.org/pdf/2410.06511
Copy Paste: [[2410.06511]] TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training(https://arxiv.org/abs/2410.06511)
Keywords: language model, llm
Abstract: The development of large language models (LLMs) has been instrumental in advancing state-of-the-art natural language processing applications. Training LLMs with billions of parameters and trillions of tokens require sophisticated distributed systems that enable composing and comparing several state-of-the-art techniques in order to efficiently scale across thousands of accelerators. However, existing solutions are complex, scattered across multiple libraries/repositories, lack interoperability, and are cumbersome to maintain. Thus, curating and empirically comparing training recipes require non-trivial engineering effort. This paper introduces TorchTitan, an open-source, PyTorch-native distributed training system that unifies state-of-the-art techniques, streamlining integration and reducing overhead. TorchTitan enables 3D parallelism in a modular manner with elastic scaling, providing comprehensive logging, checkpointing, and debugging tools for production-ready training. It also incorporates hardware-software co-designed solutions, leveraging features like Float8 training and SymmetricMemory. As a flexible test bed, TorchTitan facilitates custom recipe curation and comparison, allowing us to develop optimized training recipes for Llama 3.1 and provide guidance on selecting techniques for maximum efficiency based on our experiences. We thoroughly assess TorchTitan on the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its exceptional performance, modular composability, and elastic scalability. By stacking training optimizations, we demonstrate accelerations of 65.08% with 1D parallelism at the 128-GPU scale (Llama 3.1 8B), an additional 12.59% with 2D parallelism at the 256-GPU scale (Llama 3.1 70B), and an additional 30% with 3D parallelism at the 512-GPU scale (Llama 3.1 405B) on NVIDIA H100 GPUs over optimized baselines.
摘要：大型语言模型 (LLM) 的开发对推动最先进的自然语言处理应用发挥了重要作用。训练具有数十亿个参数和数万亿个标记的 LLM 需要复杂的分布式系统，以便能够组合和比较几种最先进的技术，从而有效地扩展到数千个加速器上。然而，现有的解决方案很复杂，分散在多个库/存储库中，缺乏互操作性，并且维护起来很麻烦。因此，策划和实证比较训练方案需要不小的工程工作。本文介绍了 TorchTitan，这是一个开源的 PyTorch 原生分布式训练系统，它统一了最先进的技术，简化了集成并降低了开销。TorchTitan 以模块化方式实现 3D 并行性，具有弹性扩展，为可用于生产的训练提供了全面的日志记录、检查点和调试工具。它还结合了硬件和软件共同设计的解决方案，利用了 Float8 训练和 SymmetricMemory 等功能。 TorchTitan 是一个灵活的测试平台，它便于自定义配方管理和比较，使我们能够为 Llama 3.1 开发优化的训练配方，并根据我们的经验提供选择技术以实现最高效率的指导。我们在 Llama 3.1 系列 LLM 上对 TorchTitan 进行了全面评估，涵盖 80 亿到 4050 亿个参数，并展示了其卓越的性能、模块化可组合性和弹性可扩展性。通过堆叠训练优化，我们在 NVIDIA H100 GPU 上展示了在优化的基准上，128-GPU 规模（Llama 3.1 8B）的 1D 并行性加速了 65.08%，256-GPU 规模（Llama 3.1 70B）的 2D 并行性加速了 12.59%，512-GPU 规模（Llama 3.1 405B）的 3D 并行性加速了 30%。

Title: SEGMENT+: Long Text Processing with Short-Context Language Models

Authors: Wei Shi, Shuang Li, Kerun Yu, Jinglei Chen, Zujie Liang, Xinhui Wu, Yuxi Qian, Feng Wei, Bo Zheng, Jiaqing Liang, Jiangjie Chen, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06519
Pdf URL: https://arxiv.org/pdf/2410.06519
Copy Paste: [[2410.06519]] SEGMENT+: Long Text Processing with Short-Context Language Models(https://arxiv.org/abs/2410.06519)
Keywords: language model
Abstract: There is a growing interest in expanding the input capacity of language models (LMs) across various domains. However, simply increasing the context window does not guarantee robust performance across diverse long-input processing tasks, such as understanding extensive documents and extracting detailed information from lengthy and noisy data. In response, we introduce SEGMENT+, a general framework that enables LMs to handle extended inputs within limited context windows efficiently. SEGMENT+ utilizes structured notes and a filtering module to manage information flow, resulting in a system that is both controllable and interpretable. Our extensive experiments across various model sizes, focusing on long-document question-answering and Needle-in-a-Haystack tasks, demonstrate the effectiveness of SEGMENT+ in improving performance.
摘要：人们越来越有兴趣在各个领域扩展语言模型 (LM) 的输入容量。然而，仅仅增加上下文窗口并不能保证在各种长输入处理任务中实现稳健的性能，例如理解大量文档以及从冗长且嘈杂的数据中提取详细信息。为此，我们引入了 SEGMENT+，这是一个通用框架，使 LM 能够有效地处理有限上下文窗口内的扩展输入。SEGMENT+ 利用结构化注释和过滤模块来管理信息流，从而形成一个可控制且可解释的系统。我们对各种模型大小进行了广泛的实验，重点关注长文档问答和大海捞针任务，证明了 SEGMENT+ 在提高性能方面的有效性。

Title: A Novel LLM-based Two-stage Summarization Approach for Long Dialogues

Authors: Yuan-Jhe Yin, Bo-Yu Chen, Berlin Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06520
Pdf URL: https://arxiv.org/pdf/2410.06520
Copy Paste: [[2410.06520]] A Novel LLM-based Two-stage Summarization Approach for Long Dialogues(https://arxiv.org/abs/2410.06520)
Keywords: language model, gpt, llm, chat
Abstract: Long document summarization poses a significant challenge in natural language processing due to input lengths that exceed the capacity of most state-of-the-art pre-trained language models. This study proposes a hierarchical framework that segments and condenses information from long documents, subsequently fine-tuning the processed text with an abstractive summarization model. Unsupervised topic segmentation methods identify semantically appropriate breakpoints. The condensation stage utilizes an unsupervised generation model to generate condensed data, and our current experiments employ ChatGPT(v3.5). The summarization stage fine-tunes the abstractive summarization model on the condensed data to generate the final results. This framework enables long documents to be processed on models even when the document length exceeds the model's maximum input size. The exclusion of the entire document from the summarization model reduces the time and computational resources required for training, making the framework suitable for contexts with constrained local computational resources.
摘要：长文档摘要对自然语言处理提出了重大挑战，因为输入长度超出了大多数最先进的预训练语言模型的容量。本研究提出了一个分层框架，该框架可以对长文档中的信息进行分割和压缩，随后使用抽象摘要模型对处理后的文本进行微调。无监督主题分割方法可识别语义上合适的断点。压缩阶段利用无监督生成模型来生成压缩数据，我们目前的实验采用 ChatGPT(v3.5)。摘要阶段对压缩数据上的抽象摘要模型进行微调以生成最终结果。即使文档长度超出模型的最大输入大小，该框架也可以在模型上处理长文档。将整个文档从摘要模型中排除可减少训练所需的时间和计算资源，使该框架适用于本地计算资源受限的环境。

Title: Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA

Authors: Maharshi Gor, Hal Daumé III, Tianyi Zhou, Jordan Boyd-Graber
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06524
Pdf URL: https://arxiv.org/pdf/2410.06524
Copy Paste: [[2410.06524]] Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA(https://arxiv.org/abs/2410.06524)
Keywords: language model, gpt, llm, agent
Abstract: Recent advancements of large language models (LLMs) have led to claims of AI surpassing humans in natural language processing (NLP) tasks such as textual understanding and reasoning. This work investigates these assertions by introducing CAIMIRA, a novel framework rooted in item response theory (IRT) that enables quantitative assessment and comparison of problem-solving abilities of question-answering (QA) agents: humans and AI systems. Through analysis of over 300,000 responses from ~70 AI systems and 155 humans across thousands of quiz questions, CAIMIRA uncovers distinct proficiency patterns in knowledge domains and reasoning skills. Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning, while state-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval and fact-based reasoning, particularly when information gaps are well-defined and addressable through pattern matching or data retrieval. These findings highlight the need for future QA tasks to focus on questions that challenge not only higher-order reasoning and scientific thinking, but also demand nuanced linguistic interpretation and cross-contextual knowledge application, helping advance AI developments that better emulate or complement human cognitive abilities in real-world problem-solving.
摘要：大型语言模型 (LLM) 的最新进展导致人们声称 AI 在自然语言处理 (NLP) 任务（例如文本理解和推理）方面超越了人类。这项工作通过引入 CAIMIRA 来调查这些断言，CAIMIRA 是一个基于项目反应理论 (IRT) 的新框架，可以定量评估和比较问答 (QA) 代理的问题解决能力：人类和 AI 系统。通过分析来自约 70 个 AI 系统和 155 名人类对数千个测验问题的超过 300,000 个答案，CAIMIRA 揭示了知识领域和推理技能中不同的熟练模式。人类在基于知识的溯因推理和概念推理方面胜过 AI 系统，而 GPT-4 和 LLaMA 等最先进的 LLM 在有针对性的信息检索和基于事实的推理方面表现出色，尤其是当信息差距定义明确且可通过模式匹配或数据检索解决时。这些发现强调了未来的 QA 任务需要关注不仅挑战高阶推理和科学思维的问题，而且还需要细致的语言解释和跨语境知识应用，从而帮助推动人工智能发展，更好地模拟或补充人类在现实世界解决问题中的认知能力。

Title: Chip-Tuning: Classify Before Language Models Say

Authors: Fangwei Zhu, Dian Li, Jiajun Huang, Gang Liu, Hui Wang, Zhifang Sui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06541
Pdf URL: https://arxiv.org/pdf/2410.06541
Copy Paste: [[2410.06541]] Chip-Tuning: Classify Before Language Models Say(https://arxiv.org/abs/2410.06541)
Keywords: language model, llm
Abstract: The rapid development in the performance of large language models (LLMs) is accompanied by the escalation of model size, leading to the increasing cost of model training and inference. Previous research has discovered that certain layers in LLMs exhibit redundancy, and removing these layers brings only marginal loss in model performance. In this paper, we adopt the probing technique to explain the layer redundancy in LLMs and demonstrate that language models can be effectively pruned with probing classifiers. We propose chip-tuning, a simple and effective structured pruning framework specialized for classification problems. Chip-tuning attaches tiny probing classifiers named chips to different layers of LLMs, and trains chips with the backbone model frozen. After selecting a chip for classification, all layers subsequent to the attached layer could be removed with marginal performance loss. Experimental results on various LLMs and datasets demonstrate that chip-tuning significantly outperforms previous state-of-the-art baselines in both accuracy and pruning ratio, achieving a pruning ratio of up to 50%. We also find that chip-tuning could be applied on multimodal models, and could be combined with model finetuning, proving its excellent compatibility.
摘要：大型语言模型 (LLM) 性能的快速发展伴随着模型规模的扩大，导致模型训练和推理的成本不断增加。先前的研究发现 LLM 中的某些层表现出冗余，删除这些层只会给模型性能带来边际损失。在本文中，我们采用探测技术来解释 LLM 中的层冗余，并证明可以使用探测分类器有效地修剪语言模型。我们提出了 chip-tuning，这是一个专门用于分类问题的简单有效的结构化修剪框架。Chip-tuning 将名为 chip 的微型探测分类器附加到 LLM 的不同层，并在主干模型冻结的情况下训练 chip。在选择用于分类的 chip 后，可以删除附加层之后的所有层，而性能损失很小。在各种 LLM 和数据集上的实验结果表明，chip-tuning 在准确率和修剪率方面都明显优于以前最先进的基线，修剪率高达 50%。我们还发现芯片调优可以应用于多模式模型，并且可以与模型微调相结合，证明了其出色的兼容性。

Title: TuringQ: Benchmarking AI Comprehension in Theory of Computation

Authors: Pardis Sadat Zahraei, Ehsaneddin Asgari
Subjects: cs.CL, cs.FL
Abstract URL: https://arxiv.org/abs/2410.06547
Pdf URL: https://arxiv.org/pdf/2410.06547
Copy Paste: [[2410.06547]] TuringQ: Benchmarking AI Comprehension in Theory of Computation(https://arxiv.org/abs/2410.06547)
Keywords: language model, gpt, llm, prompt
Abstract: We present TuringQ, the first benchmark designed to evaluate the reasoning capabilities of large language models (LLMs) in the theory of computation. TuringQ consists of 4,006 undergraduate and graduate-level question-answer pairs, categorized into four difficulty levels and covering seven core theoretical areas. We evaluate several open-source LLMs, as well as GPT-4, using Chain of Thought prompting and expert human assessment. Additionally, we propose an automated LLM-based evaluation system that demonstrates competitive accuracy when compared to human evaluation. Fine-tuning a Llama3-8B model on TuringQ shows measurable improvements in reasoning ability and out-of-domain tasks such as algebra. TuringQ serves as both a benchmark and a resource for enhancing LLM performance in complex computational reasoning tasks. Our analysis offers insights into LLM capabilities and advances in AI comprehension of theoretical computer science.
摘要：我们提出了 TuringQ，这是第一个旨在评估计算理论中大型语言模型 (LLM) 的推理能力的基准。TuringQ 由 4,006 个本科和研究生水平的问答对组成，分为四个难度级别，涵盖七个核心理论领域。我们使用思想链提示和专家人工评估评估了几个开源 LLM 以及 GPT-4。此外，我们提出了一种基于 LLM 的自动化评估系统，与人工评估相比，该系统表现出了具有竞争力的准确性。在 TuringQ 上对 Llama3-8B 模型进行微调，可以显著提高推理能力和代数等域外任务。TuringQ 既是基准，也是提高 LLM 在复杂计算推理任务中性能的资源。我们的分析提供了对 LLM 能力和 AI 对理论计算机科学理解的进步的见解。

Title: Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis

Authors: Shiho Matta, Yin Jou Huang, Fei Cheng, Hirokazu Kiyomaru, Yugo Murawaki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06550
Pdf URL: https://arxiv.org/pdf/2410.06550
Copy Paste: [[2410.06550]] Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis(https://arxiv.org/abs/2410.06550)
Keywords: gpt, llm
Abstract: Recent studies have demonstrated that few-shot learning allows LLMs to generate training data for supervised models at a low cost. However, the quality of LLM-generated data may not entirely match that of human-labeled data. This raises a crucial question: how should one balance the trade-off between the higher quality but more expensive human data and the lower quality yet substantially cheaper LLM-generated data? In this paper, we synthesized training data for conversational semantic frame analysis using GPT-4 and examined how to allocate budgets optimally to achieve the best performance. Our experiments, conducted across various budget levels, reveal that optimal cost-efficiency is achieved by combining both human and LLM-generated data across a wide range of budget levels. Notably, as the budget decreases, a higher proportion of LLM-generated data becomes more preferable.
摘要：最近的研究表明，少样本学习允许 LLM 以低成本生成监督模型的训练数据。然而，LLM 生成的数据的质量可能并不完全与人工标记的数据相匹配。这就提出了一个关键问题：应该如何在质量更高但更昂贵的人工数据和质量更低但便宜得多的 LLM 生成的数据之间取得平衡？在本文中，我们使用 GPT-4 合成了用于对话语义框架分析的训练数据，并研究了如何最佳地分配预算以实现最佳性能。我们在各种预算水平上进行的实验表明，在广泛的预算水平上，通过将人工和 LLM 生成的数据结合起来可以实现最佳的成本效益。值得注意的是，随着预算的减少，更高比例的 LLM 生成的数据变得更受欢迎。

Title: The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Authors: Yanjun Chen, Dawei Zhu, Yirong Sun, Xinghao Chen, Wei Zhang, Xiaoyu Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06554
Pdf URL: https://arxiv.org/pdf/2410.06554
Copy Paste: [[2410.06554]] The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models(https://arxiv.org/abs/2410.06554)
Keywords: language model
Abstract: Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models. Code and additional details are available at [this https URL](this https URL).
摘要：通过使语言模型与人类期望保持一致，强化学习从人类反馈中显著增强了自然语言处理。这种一致性的一个关键因素是训练期间使用的奖励模型的强度。本研究探讨了更强大的奖励模型是否必然会带来更好的语言模型。在本文中，通过使用 QA-FEEDBACK 数据集和基于 Longformer 的奖励模型对相关性、事实性和完整性任务进行实验，我们发现了一个令人惊讶的悖论：使用中等准确度的奖励模型训练的语言模型优于使用高度准确度的奖励模型训练的语言模型。这挑战了人们普遍持有的信念，即更强大的奖励模型总是会带来更好的语言模型，并为未来研究推动模型性能的关键因素以及如何选择最合适的奖励模型开辟了新途径。代码和其他详细信息可在 [此 https URL](此 https URL) 中找到。

Title: ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

Authors: Haoran Zhang, Hangyu Guo, Shuyue Guo, Meng Cao, Wenhao Huang, Jiaheng Liu, Ge Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06555
Pdf URL: https://arxiv.org/pdf/2410.06555
Copy Paste: [[2410.06555]] ING-VP: MLLMs cannot Play Easy Vision-based Games Yet(https://arxiv.org/abs/2410.06555)
Keywords: language model, llm
Abstract: As multimodal large language models (MLLMs) continue to demonstrate increasingly competitive performance across a broad spectrum of tasks, more intricate and comprehensive benchmarks have been developed to assess these cutting-edge models. These benchmarks introduce new challenges to core capabilities such as perception, reasoning, and planning. However, existing multimodal benchmarks fall short in providing a focused evaluation of multi-step planning based on spatial relationships in images. To bridge this gap, we present ING-VP, the first INteractive Game-based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi-step reasoning abilities of MLLMs. ING-VP features 6 distinct games, encompassing 300 levels, each with 6 unique configurations. A single model engages in over 60,000 rounds of interaction. The benchmark framework allows for multiple comparison settings, including image-text vs. text-only inputs, single-step vs. multi-step reasoning, and with-history vs. without-history conditions, offering valuable insights into the model's capabilities. We evaluated numerous state-of-the-art MLLMs, with the highest-performing model, Claude-3.5 Sonnet, achieving an average accuracy of only 3.37%, far below the anticipated standard. This work aims to provide a specialized evaluation framework to drive advancements in MLLMs' capacity for complex spatial reasoning and planning. The code is publicly available at this https URL.
摘要：随着多模态大型语言模型 (MLLM) 在广泛任务中继续展现出越来越有竞争力的性能，人们开发了更复杂、更全面的基准来评估这些尖端模型。这些基准对感知、推理和规划等核心能力提出了新的挑战。然而，现有的多模态基准在基于图像空间关系的多步规划方面缺乏重点评估。为了弥补这一差距，我们提出了 ING-VP，这是第一个基于游戏的交互式视觉规划基准，专门用于评估 MLLM 的空间想象力和多步推理能力。ING-VP 有 6 个不同的游戏，涵盖 300 个级别，每个级别都有 6 种独特的配置。一个模型进行超过 60,000 轮交互。基准框架允许多种比较设置，包括图像文本与纯文本输入、单步与多步推理以及有历史与无历史条件，从而为模型的功能提供了宝贵的见解。我们评估了许多最先进的 MLLM，其中性能最高的模型 Claude-3.5 Sonnet 的平均准确率仅为 3.37%，远低于预期标准。这项工作旨在提供一个专门的评估框架，以推动 MLLM 在复杂空间推理和规划方面的能力的进步。代码可在此 https URL 上公开获取。

Title: Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare

Authors: Pardis Sadat Zahraei, Zahra Shakeri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06566
Pdf URL: https://arxiv.org/pdf/2410.06566
Copy Paste: [[2410.06566]] Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare(https://arxiv.org/abs/2410.06566)
Keywords: language model, gpt, llm, chat
Abstract: Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety, making the integrity of AI in healthcare more critical than ever. As Large Language Models (LLMs) take on a growing role in medical decision-making, addressing their biases and enhancing their accuracy is key to delivering safe, reliable care. This study addresses these challenges head-on by introducing new resources designed to promote ethical and precise AI in healthcare. We present two datasets: BiasMD, featuring 6,007 question-answer pairs crafted to evaluate and mitigate biases in health-related LLM outputs, and DiseaseMatcher, with 32,000 clinical question-answer pairs spanning 700 diseases, aimed at assessing symptom-based diagnostic accuracy. Using these datasets, we developed the EthiClinician, a fine-tuned model built on the ChatDoctor framework, which outperforms GPT-4 in both ethical reasoning and clinical judgment. By exposing and correcting hidden biases in existing models for healthcare, our work sets a new benchmark for safer, more reliable patient outcomes.
摘要：人工智能生成的有偏见的医疗建议和误诊会危及患者安全，因此人工智能在医疗保健领域的完整性比以往任何时候都更加重要。随着大型语言模型 (LLM) 在医疗决策中发挥越来越大的作用，解决其偏见并提高其准确性是提供安全、可靠护理的关键。这项研究通过引入旨在促进医疗保健领域合乎道德和精准的人工智能的新资源，正面应对了这些挑战。我们提供了两个数据集：BiasMD，包含 6,007 个问答对，旨在评估和减轻与健康相关的 LLM 输出中的偏见；DiseaseMatcher，包含 32,000 个临床问答对，涵盖 700 种疾病，旨在评估基于症状的诊断准确性。利用这些数据集，我们开发了 EthiClinician，这是一个基于 ChatDoctor 框架构建的微调模型，在道德推理和临床判断方面均优于 GPT-4。通过揭露和纠正现有医疗保健模式中的隐藏偏见，我们的工作为更安全、更可靠的患者结果设定了新的基准。

Title: Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

Authors: Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06577
Pdf URL: https://arxiv.org/pdf/2410.06577
Copy Paste: [[2410.06577]] Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions(https://arxiv.org/abs/2410.06577)
Keywords: language model, llm
Abstract: Recent advancements in Transformer-based large language models (LLMs) have set new standards in natural language processing. However, the classical softmax attention incurs significant computational costs, leading to a $O(T)$ complexity for per-token generation, where $T$ represents the context length. This work explores reducing LLMs' complexity while maintaining performance by introducing Rodimus and its enhanced version, Rodimus$+$. Rodimus employs an innovative data-dependent tempered selection (DDTS) mechanism within a linear attention-based, purely recurrent framework, achieving significant accuracy while drastically reducing the memory usage typically associated with recurrent models. This method exemplifies semantic compression by maintaining essential input information with fixed-size hidden states. Building on this, Rodimus$+$ combines Rodimus with the innovative Sliding Window Shared-Key Attention (SW-SKA) in a hybrid approach, effectively leveraging the complementary semantic, token, and head compression techniques. Our experiments demonstrate that Rodimus$+$-1.6B, trained on 1 trillion tokens, achieves superior downstream performance against models trained on more tokens, including Qwen2-1.5B and RWKV6-1.6B, underscoring its potential to redefine the accuracy-efficiency balance in LLMs. Model code and pre-trained checkpoints will be available soon.
摘要：基于 Transformer 的大型语言模型 (LLM) 的最新进展为自然语言处理树立了新标准。然而，经典的 softmax 注意力会产生大量的计算成本，导致每个 token 生成的复杂度为 $O(T)$，其中 $T$ 表示上下文长度。这项工作通过引入 Rodimus 及其增强版本 Rodimus$+$ 探索降低 LLM 的复杂性，同时保持性能。Rodimus 在基于线性注意力的纯循环框架内采用创新的数据相关调节选择 (DDTS) 机制，实现了显着的准确性，同时大幅降低了通常与循环模型相关的内存使用量。该方法通过使用固定大小的隐藏状态来维护基本输入信息，体现了语义压缩。在此基础上，Rodimus$+$ 将 Rodimus 与创新的滑动窗口共享密钥注意力 (SW-SKA) 以混合方式相结合，有效地利用了互补的语义、token 和头部压缩技术。我们的实验表明，在 1 万亿个 token 上训练的 Rodimus$+$-1.6B 比在更多 token 上训练的模型（包括 Qwen2-1.5B 和 RWKV6-1.6B）实现了更出色的下游性能，凸显了其重新定义 LLM 中准确率-效率平衡的潜力。模型代码和预训练检查点将很快推出。

Title: Dissecting Fine-Tuning Unlearning in Large Language Models

Authors: Yihuai Hong, Yuelin Zou, Lijie Hu, Ziqian Zeng, Di Wang, Haiqin Yang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06606
Pdf URL: https://arxiv.org/pdf/2410.06606
Copy Paste: [[2410.06606]] Dissecting Fine-Tuning Unlearning in Large Language Models(https://arxiv.org/abs/2410.06606)
Keywords: language model
Abstract: Fine-tuning-based unlearning methods prevail for preventing targeted harmful, sensitive, or copyrighted information within large language models while preserving overall capabilities. However, the true effectiveness of these methods is unclear. In this paper, we delve into the limitations of fine-tuning-based unlearning through activation patching and parameter restoration experiments. Our findings reveal that these methods alter the model's knowledge retrieval process, rather than genuinely erasing the problematic knowledge embedded in the model parameters. Furthermore, behavioral tests demonstrate that the unlearning mechanisms inevitably impact the global behavior of the models, affecting unrelated knowledge or capabilities. Our work advocates the development of more resilient unlearning techniques for truly erasing knowledge. Our code is released at this https URL.
摘要：基于微调的反学习方法在防止大型语言模型中出现有针对性的有害、敏感或受版权保护的信息的同时，还保留了整体功能。然而，这些方法的真正有效性尚不清楚。在本文中，我们通过激活修补和参数恢复实验深入探讨了基于微调的反学习的局限性。我们的研究结果表明，这些方法改变了模型的知识检索过程，而不是真正消除了嵌入在模型参数中的有问题的知识。此外，行为测试表明，反学习机制不可避免地会影响模型的整体行为，影响不相关的知识或能力。我们的工作提倡开发更具弹性的反学习技术，以真正消除知识。我们的代码发布在此 https URL 上。

Title: $\beta$-calibration of Language Model Confidence Scores for Generative QA

Authors: Putra Manggala, Atalanti Mastakouri, Elke Kirschbaum, Shiva Prasad Kasiviswanathan, Aaditya Ramdas
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06615
Pdf URL: https://arxiv.org/pdf/2410.06615
Copy Paste: [[2410.06615]] $\beta$-calibration of Language Model Confidence Scores for Generative QA(https://arxiv.org/abs/2410.06615)
Keywords: language model
Abstract: To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is on average indicative of the likelihood that the answer is correct. We argue, however, that this standard (average-case) notion of calibration is difficult to interpret for decision-making in generative QA. To address this, we generalize the standard notion of average calibration and introduce $\beta$-calibration, which ensures calibration holds across different question-and-answer groups. We then propose discretized posthoc calibration schemes for achieving $\beta$-calibration.
摘要：要将生成式问答 (QA) 系统用于决策和任何关键应用，这些系统需要提供经过良好校准的置信度分数，以反映其答案的正确性。现有的校准方法旨在确保置信度分数平均而言能够表明答案正确的可能性。然而，我们认为，这种标准（平均情况）校准概念很难解释生成式 QA 中的决策。为了解决这个问题，我们概括了平均校准的标准概念并引入了 $\beta$ 校准，以确保校准在不同的问答组中成立。然后，我们提出了离散化事后校准方案来实现 $\beta$ 校准。

Title: Learning Evolving Tools for Large Language Models

Authors: Guoxin Chen, Zhong Zhang, Xin Cong, Fangda Guo, Yesai Wu, Yankai Lin, Wenzheng Feng, Yasheng Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06617
Pdf URL: https://arxiv.org/pdf/2410.06617
Copy Paste: [[2410.06617]] Learning Evolving Tools for Large Language Models(https://arxiv.org/abs/2410.06617)
Keywords: language model, llm
Abstract: Tool learning enables large language models (LLMs) to interact with external tools and APIs, greatly expanding the application scope of LLMs. However, due to the dynamic nature of external environments, these tools and APIs may become outdated over time, preventing LLMs from correctly invoking tools. Existing research primarily focuses on static environments and overlooks this issue, limiting the adaptability of LLMs in real-world applications. In this paper, we propose ToolEVO, a novel framework designed to enhance the adaptive and reflective capabilities of LLMs against tool variability. By leveraging Monte Carlo Tree Search, ToolEVO facilitates active exploration and interaction of LLMs within dynamic environments, allowing for autonomous self-reflection and self-updating of tool usage based on environmental feedback. Additionally, we introduce ToolQA-D, a benchmark specifically designed to evaluate the impact of tool variability. Extensive experiments demonstrate the effectiveness and stability of our approach, highlighting the importance of adaptability to tool variability for effective tool learning.
摘要：工具学习使大型语言模型 (LLM) 能够与外部工具和 API 交互，大大扩展了 LLM 的应用范围。然而，由于外部环境的动态特性，这些工具和 API 可能会随着时间的推移而过时，从而阻止 LLM 正确调用工具。现有研究主要关注静态环境，忽略了这个问题，限制了 LLM 在实际应用中的适应性。在本文中，我们提出了 ToolEVO，这是一个旨在增强 LLM 针对工具变化的自适应和反射能力的新框架。通过利用蒙特卡洛树搜索，ToolEVO 促进了 LLM 在动态环境中的主动探索和交互，允许基于环境反馈自主地自我反思和自我更新工具使用情况。此外，我们引入了 ToolQA-D，这是一个专门用于评估工具变化影响的基准。大量实验证明了我们方法的有效性和稳定性，强调了对工具变化的适应性对于有效工具学习的重要性。

Title: Tree of Problems: Improving structured problem solving with compositionality

Authors: Armel Zebaze, Benoît Sagot, Rachel Bawden
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06634
Pdf URL: https://arxiv.org/pdf/2410.06634
Copy Paste: [[2410.06634]] Tree of Problems: Improving structured problem solving with compositionality(https://arxiv.org/abs/2410.06634)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across multiple tasks through in-context learning. For complex reasoning tasks that require step-by-step thinking, Chain-of-Thought (CoT) prompting has given impressive results, especially when combined with self-consistency. Nonetheless, some tasks remain particularly difficult for LLMs to solve. Tree of Thoughts (ToT) and Graph of Thoughts (GoT) emerged as alternatives, dividing the complex problem into paths of subproblems. In this paper, we propose Tree of Problems (ToP), a simpler version of ToT, which we hypothesise can work better for complex tasks that can be divided into identical subtasks. Our empirical results show that our approach outperforms ToT and GoT, and in addition performs better than CoT on complex reasoning tasks. All code for this paper is publicly available here: this https URL.
摘要：大型语言模型 (LLM) 通过情境学习在多个任务中表现出色。对于需要逐步思考的复杂推理任务，思维链 (CoT) 提示已经取得了令人印象深刻的结果，尤其是与自洽性相结合时。尽管如此，有些任务对于 LLM 来说仍然特别难以解决。思维树 (ToT) 和思维图 (GoT) 作为替代方案出现，将复杂问题划分为子问题路径。在本文中，我们提出了问题树 (ToP)，这是 ToT 的更简单版本，我们假设它可以更好地用于可以划分为相同子任务的复杂任务。我们的实证结果表明，我们的方法优于 ToT 和 GoT，此外在复杂推理任务上的表现也优于 CoT。本文的所有代码均可在此处公开获取：此 https URL。

Title: Subtle Errors Matter: Preference Learning via Error-injected Self-editing

Authors: Kaishuai Xu, Tiezheng Yu, Wenjun Hou, Yi Cheng, Chak Tou Leong, Liangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, Wenjie Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06638
Pdf URL: https://arxiv.org/pdf/2410.06638
Copy Paste: [[2410.06638]] Subtle Errors Matter: Preference Learning via Error-injected Self-editing(https://arxiv.org/abs/2410.06638)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have exhibited strong mathematical reasoning and computational prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle errors, such as miscalculations or incorrect substitutions, limit the models' full mathematical potential. Existing studies to improve mathematical ability typically involve distilling reasoning skills from stronger LLMs or applying preference learning to step-wise response pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook the frequently occurring subtle errors. A major reason is that sampled preference pairs involve differences unrelated to the errors, which may distract the model from focusing on subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation. In detail, RISE uses the model itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective to focus on predefined errors and their tokens, without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
摘要：大型语言模型 (LLM) 表现出强大的数学推理和计算能力，能够处理从基本算术到高级竞赛级问题的各种任务。然而，经常发生的细微错误（例如计算错误或替换不正确）限制了模型的全部数学潜力。现有的提高数学能力的研究通常涉及从更强大的 LLM 中提取推理技能或将偏好学习应用于分步响应对。虽然这些方法利用不同粒度的样本来减轻推理错误，但它们忽略了经常发生的细微错误。一个主要原因是采样的偏好对涉及与错误无关的差异，这可能会分散模型对细微错误的关注。在这项工作中，我们提出了一种名为错误注入自编辑 (RISE) 的新型偏好学习框架，它将预定义的细微错误注入正确解决方案的部分标记中，以构建用于减轻错误的硬对。具体来说，RISE 使用模型本身来编辑解决方案中的少量标记，注入设计的细微错误。然后，将自编辑解决方案及其对应的正确解决方案组成的对，以及通过抽样获得的正确和错误解决方案对一起用于细微错误感知的 DPO 训练。与其他偏好学习方法相比，RISE 进一步细化了训练目标，专注于预定义的错误及其标记，而无需细粒度抽样或偏好注释。大量实验验证了 RISE 的有效性，在 Qwen2-7B-Instruct 上的偏好学习在 GSM8K 上取得了 3.0% 的显着改进，在 MATH 上取得了 7.9% 的显着改进。

Title: Large Language Models as Code Executors: An Exploratory Study

Authors: Chenyang Lyu, Lecheng Yan, Rui Xing, Wenxi Li, Younes Samih, Tianbo Ji, Longyue Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06667
Pdf URL: https://arxiv.org/pdf/2410.06667
Copy Paste: [[2410.06667]] Large Language Models as Code Executors: An Exploratory Study(https://arxiv.org/abs/2410.06667)
Keywords: language model, gpt, llm, prompt
Abstract: The capabilities of Large Language Models (LLMs) have significantly evolved, extending from natural language processing to complex tasks like code understanding and generation. We expand the scope of LLMs' capabilities to a broader context, using LLMs to execute code snippets to obtain the output. This paper pioneers the exploration of LLMs as code executors, where code snippets are directly fed to the models for execution, and outputs are returned. We are the first to comprehensively examine this feasibility across various LLMs, including OpenAI's o1, GPT-4o, GPT-3.5, DeepSeek, and Qwen-Coder. Notably, the o1 model achieved over 90% accuracy in code execution, while others demonstrated lower accuracy levels. Furthermore, we introduce an Iterative Instruction Prompting (IIP) technique that processes code snippets line by line, enhancing the accuracy of weaker models by an average of 7.22% (with the highest improvement of 18.96%) and an absolute average improvement of 3.86% against CoT prompting (with the highest improvement of 19.46%). Our study not only highlights the transformative potential of LLMs in coding but also lays the groundwork for future advancements in automated programming and the completion of complex tasks.
摘要：大型语言模型 (LLM) 的功能已显著发展，从自然语言处理扩展到代码理解和生成等复杂任务。我们将 LLM 的功能范围扩展到更广泛的上下文，使用 LLM 执行代码片段以获取输出。本文率先探索 LLM 作为代码执行器，其中代码片段直接输入模型进行执行，并返回输出。我们是第一个全面研究各种 LLM 可行性的人，包括 OpenAI 的 o1、GPT-4o、GPT-3.5、DeepSeek 和 Qwen-Coder。值得注意的是，o1 模型在代码执行中实现了超过 90% 的准确率，而其他模型的准确率较低。此外，我们引入了迭代指令提示 (IIP) 技术，逐行处理代码片段，使较弱模型的准确率平均提高 7.22%（最高提高 18.96%），与 CoT 提示相比，绝对平均提高 3.86%（最高提高 19.46%）。我们的研究不仅突出了 LLM 在编码方面的变革潜力，还为未来自动化编程和复杂任务的完成奠定了基础。

Title: Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures

Authors: Junxuan Wang, Xuyang Ge, Wentao Shu, Qiong Tang, Yunhua Zhou, Zhengfu He, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06672
Pdf URL: https://arxiv.org/pdf/2410.06672
Copy Paste: [[2410.06672]] Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures(https://arxiv.org/abs/2410.06672)
Keywords: language model
Abstract: The hypothesis of Universality in interpretability suggests that different neural networks may converge to implement similar algorithms on similar tasks. In this work, we investigate two mainstream architectures for language modeling, namely Transformers and Mambas, to explore the extent of their mechanistic similarity. We propose to use Sparse Autoencoders (SAEs) to isolate interpretable features from these models and show that most features are similar in these two models. We also validate the correlation between feature similarity and Universality. We then delve into the circuit-level analysis of Mamba models and find that the induction circuits in Mamba are structurally analogous to those in Transformers. We also identify a nuanced difference we call \emph{Off-by-One motif}: The information of one token is written into the SSM state in its next position. Whilst interaction between tokens in Transformers does not exhibit such trend.
摘要：可解释性的普遍性假设表明，不同的神经网络可能会聚合在一起，在相似的任务上实现相似的算法。在本文中，我们研究了两种主流的语言建模架构，即 Transformers 和 Mambas，以探索它们在机制上的相似程度。我们建议使用稀疏自编码器 (SAE) 从这些模型中分离出可解释的特征，并表明这两个模型中的大多数特征是相似的。我们还验证了特征相似性与普遍性之间的相关性。然后，我们深入研究了 Mamba 模型的电路级分析，发现 Mamba 中的感应电路在结构上与 Transformers 中的感应电路类似。我们还发现了一个细微的差异，我们称之为 \emph{Off-by-One 基序}：一个 token 的信息在下一个位置被写入 SSM 状态。而 Transformers 中 token 之间的交互并没有表现出这种趋势。

Title: PII-Scope: A Benchmark for Training Data PII Leakage Assessment in LLMs

Authors: Krishna Kanth Nakka, Ahmed Frikha, Ricardo Mendes, Xue Jiang, Xuebing Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06704
Pdf URL: https://arxiv.org/pdf/2410.06704
Copy Paste: [[2410.06704]] PII-Scope: A Benchmark for Training Data PII Leakage Assessment in LLMs(https://arxiv.org/abs/2410.06704)
Keywords: llm
Abstract: In this work, we introduce PII-Scope, a comprehensive benchmark designed to evaluate state-of-the-art methodologies for PII extraction attacks targeting LLMs across diverse threat settings. Our study provides a deeper understanding of these attacks by uncovering several hyperparameters (e.g., demonstration selection) crucial to their effectiveness. Building on this understanding, we extend our study to more realistic attack scenarios, exploring PII attacks that employ advanced adversarial strategies, including repeated and diverse querying, and leveraging iterative learning for continual PII extraction. Through extensive experimentation, our results reveal a notable underestimation of PII leakage in existing single-query attacks. In fact, we show that with sophisticated adversarial capabilities and a limited query budget, PII extraction rates can increase by up to fivefold when targeting the pretrained model. Moreover, we evaluate PII leakage on finetuned models, showing that they are more vulnerable to leakage than pretrained models. Overall, our work establishes a rigorous empirical benchmark for PII extraction attacks in realistic threat scenarios and provides a strong foundation for developing effective mitigation strategies.
摘要：在这项工作中，我们引入了 PII-Scope，这是一个全面的基准，旨在评估针对不同威胁环境中的 LLM 的 PII 提取攻击的最新方法。我们的研究通过揭示对其有效性至关重要的几个超参数（例如，演示选择）提供了对这些攻击的更深入了解。基于这种理解，我们将研究扩展到更现实的攻击场景，探索采用高级对抗策略的 PII 攻击，包括重复和多样化查询，并利用迭代学习进行持续的 PII 提取。通过大量实验，我们的结果揭示了现有单一查询攻击中对 PII 泄漏的严重低估。事实上，我们表明，凭借复杂的对抗能力和有限的查询预算，当针对预训练模型时，PII 提取率可以提高五倍。此外，我们在微调模型上评估了 PII 泄漏，表明它们比预训练模型更容易受到泄漏的影响。总体而言，我们的工作为现实威胁场景中的 PII 提取攻击建立了严格的经验基准，并为制定有效的缓解策略提供了坚实的基础。

Title: Calibrating Verbalized Probabilities for Large Language Models

Authors: Cheng Wang, Gyuri Szarvas, Georges Balazs, Pavel Danchenko, Patrick Ernst
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06707
Pdf URL: https://arxiv.org/pdf/2410.06707
Copy Paste: [[2410.06707]] Calibrating Verbalized Probabilities for Large Language Models(https://arxiv.org/abs/2410.06707)
Keywords: language model, llm
Abstract: Calibrating verbalized probabilities presents a novel approach for reliably assessing and leveraging outputs from black-box Large Language Models (LLMs). Recent methods have demonstrated improved calibration by applying techniques like Platt scaling or temperature scaling to the confidence scores generated by LLMs. In this paper, we explore the calibration of verbalized probability distributions for discriminative tasks. First, we investigate the capability of LLMs to generate probability distributions over categorical labels. We theoretically and empirically identify the issue of re-softmax arising from the scaling of verbalized probabilities, and propose using the invert softmax trick to approximate the "logit" by inverting verbalized probabilities. Through extensive evaluation on three public datasets, we demonstrate: (1) the robust capability of LLMs in generating class distributions, and (2) the effectiveness of the invert softmax trick in estimating logits, which, in turn, facilitates post-calibration adjustments.
摘要：校准口头化概率提供了一种可靠地评估和利用黑盒大型语言模型 (LLM) 输出的新方法。最近的方法已证明，通过将 Platt 缩放或温度缩放等技术应用于 LLM 生成的置信度分数，校准效果会有所改善。在本文中，我们探讨了用于判别任务的口头化概率分布的校准。首先，我们研究了 LLM 生成分类标签概率分布的能力。我们从理论和经验上确定了由于口头化概率缩放而产生的 re-softmax 问题，并建议使用反转 softmax 技巧通过反转口头化概率来近似“logit”。通过对三个公共数据集进行广泛的评估，我们证明了：(1) LLM 在生成类分布方面的强大能力，以及 (2) 反转 softmax 技巧在估计 logit 方面的有效性，这反过来又有助于校准后的调整。

Title: Guaranteed Generation from Large Language Models

Authors: Minbeom Kim, Thibaut Thonet, Jos Rozen, Hwaran Lee, Kyomin Jung, Marc Dymetman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06716
Pdf URL: https://arxiv.org/pdf/2410.06716
Copy Paste: [[2410.06716]] Guaranteed Generation from Large Language Models(https://arxiv.org/abs/2410.06716)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly used across various applications, there is a growing need to control text generation to satisfy specific constraints or requirements. This raises a crucial question: Is it possible to guarantee strict constraint satisfaction in generated outputs while preserving the distribution of the original model as much as possible? We first define the ideal distribution - the one closest to the original model, which also always satisfies the expressed constraint - as the ultimate goal of guaranteed generation. We then state a fundamental limitation, namely that it is impossible to reach that goal through autoregressive training alone. This motivates the necessity of combining training-time and inference-time methods to enforce such guarantees. Based on this insight, we propose GUARD, a simple yet effective approach that combines an autoregressive proposal distribution with rejection sampling. Through GUARD's theoretical properties, we show how controlling the KL divergence between a specific proposal and the target ideal distribution simultaneously optimizes inference speed and distributional closeness. To validate these theoretical concepts, we conduct extensive experiments on two text generation settings with hard-to-satisfy constraints: a lexical constraint scenario and a sentiment reversal scenario. These experiments show that GUARD achieves perfect constraint satisfaction while almost preserving the ideal distribution with highly improved inference efficiency. GUARD provides a principled approach to enforcing strict guarantees for LLMs without compromising their generative capabilities.
摘要：随着大型语言模型 (LLM) 在各种应用中的使用越来越多，控制文本生成以满足特定约束或要求的需求也日益增长。这引出了一个关键问题：是否有可能在尽可能保留原始模型分布的同时保证生成的输出满足严格的约束？我们首先将理想分布（最接近原始模型的分布，并且始终满足所表达的约束）定义为保证生成的最终目标。然后，我们陈述了一个根本限制，即仅通过自回归训练无法达到该目标。这促使我们有必要结合训练时间和推理时间方法来强制执行此类保证。基于这一见解，我们提出了 GUARD，这是一种简单而有效的方法，它将自回归提议分布与拒绝采样相结合。通过 GUARD 的理论特性，我们展示了如何控制特定提议与目标理想分布之间的 KL 散度，同时优化推理速度和分布接近度。为了验证这些理论概念，我们在两种具有难以满足约束的文本生成设置上进行了广泛的实验：词汇约束场景和情绪反转场景。这些实验表明，GUARD 实现了完美的约束满足，同时几乎保留了理想的分布，并且推理效率得到了极大的提高。GUARD 提供了一种原则性方法，可以在不损害 LLM 生成能力的情况下为其实施严格的保证。

Title: Scaling Laws for Mixed quantization in Large Language Models

Authors: Zeyu Cao, Cheng Zhang, Pedro Gimenes, Jianqiao Lu, Jianyi Cheng, Yiren Zhao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06722
Pdf URL: https://arxiv.org/pdf/2410.06722
Copy Paste: [[2410.06722]] Scaling Laws for Mixed quantization in Large Language Models(https://arxiv.org/abs/2410.06722)
Keywords: language model, llm
Abstract: Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the computational requirements for running inference on these models. In this study, we focus on a straightforward question: When aiming for a specific accuracy or perplexity target for low-precision quantization, how many high-precision numbers or calculations are required to preserve as we scale LLMs to larger sizes? We first introduce a critical metric named the quantization ratio, which compares the number of parameters quantized to low-precision arithmetic against the total parameter count. Through extensive and carefully controlled experiments across different model families, arithmetic types, and quantization granularities (e.g. layer-wise, matmul-wise), we identify two central phenomenons. 1) The larger the models, the better they can preserve performance with an increased quantization ratio, as measured by perplexity in pre-training tasks or accuracy in downstream tasks. 2) The finer the granularity of mixed-precision quantization (e.g., matmul-wise), the more the model can increase the quantization ratio. We believe these observed phenomena offer valuable insights for future AI hardware design and the development of advanced Efficient AI algorithms.
摘要：大型语言模型 (LLM) 的训练后量化已被证明可有效降低在这些模型上运行推理的计算要求。在本研究中，我们关注一个简单的问题：当针对低精度量化的特定准确度或困惑度目标时，在将 LLM 扩展到更大尺寸时需要保留多少高精度数字或计算？我们首先引入一个称为量化率的关键指标，它将量化为低精度算法的参数数量与总参数数量进行比较。通过对不同模型系列、算法类型和量化粒度（例如逐层、逐矩阵乘法）进行大量且精心控制的实验，我们发现了两个核心现象。1) 模型越大，它们在增加量化率的情况下保持性能的效果就越好，以预训练任务中的困惑度或下游任务中的准确度来衡量。 2）混合精度量化的粒度越细（例如，matmul-wise），模型可以提高量化率的程度就越高。我们相信这些观察到的现象为未来的人工智能硬件设计和先进高效的人工智能算法的开发提供了宝贵的见解。

Title: Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

Authors: Qi Chen, Bowen Zhang, Gang Wang, Qi Wu
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2410.06733
Pdf URL: https://arxiv.org/pdf/2410.06733
Copy Paste: [[2410.06733]] Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles(https://arxiv.org/abs/2410.06733)
Keywords: language model, llm
Abstract: While advancements in NLP have significantly improved the performance of Large Language Models (LLMs) on tasks requiring vertical thinking, their lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation, which often necessitates a stronger evaluation model. This framework simulates an interactive game where the model (player) asks the evaluation model (judge) questions about an incomplete story to infer the full scenario. The judge answers based on a detailed reference scenario or evaluates if the player's predictions align with the reference one. This approach lessens dependence on more robust evaluation models, enabling the assessment of state-of-the-art LLMs. The experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy, achieving over 80% agreement-similar to the agreement levels among humans. Furthermore, applying data and reasoning processes from our benchmark to other lateral thinking-related benchmarks, e.g., RiddleSense and BrainTeaser, leads to performance enhancements. This suggests that our benchmark effectively evaluates and elicits the lateral thinking abilities of LLMs. Code is available at: this https URL.
摘要：虽然 NLP 的进步显著提高了大型语言模型 (LLM) 在需要垂直思维的任务上的表现，但由于评估创造性思维过程的复杂性和相关数据的稀缺性，它们的横向思维能力仍未得到充分探索，而且难以衡量。为了应对这些挑战，我们引入了 SPLAT，这是一个利用情境谜题来评估和引出 LLM 横向思维的基准。这个基准包含三个难度级别的 975 个分级情境谜题，采用了一种新的多轮玩家-裁判框架，而不是传统的基于模型的评估，后者通常需要更强大的评估模型。这个框架模拟了一个互动游戏，其中模型（玩家）向评估模型（裁判）询问有关不完整故事的问题，以推断完整的场景。裁判根据详细的参考场景回答或评估玩家的预测是否与参考场景一致。这种方法减少了对更强大的评估模型的依赖，从而能够评估最先进的 LLM。实验表明，像 WizardLM-2 这样的稳健评估模型在中期问答和最终场景准确率方面与人类判断非常接近，一致性达到 80% 以上，与人类的一致性水平相似。此外，将我们基准测试中的数据和推理过程应用到其他横向思维相关基准测试（例如 RiddleSense 和 BrainTeaser）可提高性能。这表明我们的基准测试有效地评估和引出了 LLM 的横向思维能力。代码可从此 https URL 获取。

Title: Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?

Authors: Fumiya Uchiyama, Takeshi Kojima, Andrew Gambardella, Qi Cao, Yusuke Iwasawa, Yutaka Matsuo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06735
Pdf URL: https://arxiv.org/pdf/2410.06735
Copy Paste: [[2410.06735]] Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?(https://arxiv.org/abs/2410.06735)
Keywords: language model, llm
Abstract: Recent large language models (LLMs) have demonstrated remarkable generalization abilities in mathematics and logical reasoning tasks. Prior research indicates that LLMs pre-trained with programming language data exhibit high mathematical and reasoning abilities; however, this causal relationship has not been rigorously tested. Our research aims to verify which programming languages and features during pre-training affect logical inference performance. Specifically, we pre-trained decoder-based language models from scratch using datasets from ten programming languages (e.g., Python, C, Java) and three natural language datasets (Wikipedia, Fineweb, C4) under identical conditions. Thereafter, we evaluated the trained models in a few-shot in-context learning setting on logical reasoning tasks: FLD and bAbi, which do not require commonsense or world knowledge. The results demonstrate that nearly all models trained with programming languages consistently outperform those trained with natural languages, indicating that programming languages contain factors that elicit logic inference performance. In addition, we found that models trained with programming languages exhibit a better ability to follow instructions compared to those trained with natural languages. Further analysis reveals that the depth of Abstract Syntax Trees representing parsed results of programs also affects logical reasoning performance. These findings will offer insights into the essential elements of pre-training for acquiring the foundational abilities of LLMs.
摘要：最近的大型语言模型 (LLM) 在数学和逻辑推理任务中表现出了卓越的泛化能力。先前的研究表明，使用编程语言数据预训练的 LLM 表现出很高的数学和推理能力；然而，这种因果关系尚未经过严格测试。我们的研究旨在验证预训练期间哪些编程语言和功能会影响逻辑推理性能。具体来说，我们在相同条件下使用来自十种编程语言（例如 Python、C、Java）和三种自然语言数据集（Wikipedia、Fineweb、C4）的数据集从头开始预训练基于解码器的语言模型。此后，我们在逻辑推理任务的几次上下文学习设置中评估了训练后的模型：FLD 和 bAbi，这不需要常识或世界知识。结果表明，几乎所有用编程语言训练的模型都始终优于用自然语言训练的模型，这表明编程语言包含引发逻辑推理性能的因素。此外，我们发现与用自然语言训练的模型相比，用编程语言训练的模型表现出更好的遵循指令的能力。进一步分析发现，表示程序解析结果的抽象语法树的深度也会影响逻辑推理性能。这些发现将为获得 LLM 基础能力的预训练的基本要素提供见解。

Title: CoBa: Convergence Balancer for Multitask Finetuning of Large Language Models

Authors: Zi Gong, Hang Yu, Cong Liao, Bingchang Liu, Chaoyu Chen, Jianguo Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.06741
Pdf URL: https://arxiv.org/pdf/2410.06741
Copy Paste: [[2410.06741]] CoBa: Convergence Balancer for Multitask Finetuning of Large Language Models(https://arxiv.org/abs/2410.06741)
Keywords: language model, llm
Abstract: Multi-task learning (MTL) benefits the fine-tuning of large language models (LLMs) by providing a single model with improved performance and generalization ability across tasks, presenting a resource-efficient alternative to developing separate models for each task. Yet, existing MTL strategies for LLMs often fall short by either being computationally intensive or failing to ensure simultaneous task convergence. This paper presents CoBa, a new MTL approach designed to effectively manage task convergence balance with minimal computational overhead. Utilizing Relative Convergence Scores (RCS), Absolute Convergence Scores (ACS), and a Divergence Factor (DF), CoBa dynamically adjusts task weights during the training process, ensuring that the validation loss of all tasks progress towards convergence at an even pace while mitigating the issue of individual task divergence. The results of our experiments involving three disparate datasets underscore that this approach not only fosters equilibrium in task improvement but enhances the LLMs' performance by up to 13% relative to the second-best baselines. Code is open-sourced at this https URL.
摘要：多任务学习 (MTL) 有利于大型语言模型 (LLM) 的微调，因为它提供了一个具有更高性能和跨任务泛化能力的模型，为开发每个任务的单独模型提供了一种资源高效的替代方案。然而，现有的 LLM MTL 策略往往存在不足，要么计算量大，要么无法确保任务同时收敛。本文介绍了一种新的 MTL 方法 CoBa，旨在以最小的计算开销有效管理任务收敛平衡。利用相对收敛分数 (RCS)、绝对收敛分数 (ACS) 和发散因子 (DF)，CoBa 在训练过程中动态调整任务权重，确保所有任务的验证损失以均匀的速度向收敛发展，同时缓解单个任务发散的问题。我们对三个不同数据集进行的实验结果表明，这种方法不仅有助于平衡任务改进，而且相对于第二好的基线，LLM 的性能提高了 13%。代码在此 https URL 上开源。

Title: To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Authors: Junyan Lin, Haoran Chen, Dawei Zhu, Xiaoyu Shen
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.06765
Pdf URL: https://arxiv.org/pdf/2410.06765
Copy Paste: [[2410.06765]] To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models(https://arxiv.org/abs/2410.06765)
Keywords: language model, llm
Abstract: In recent years, multimodal large language models (MLLMs) have garnered significant attention from both industry and academia. However, there is still considerable debate on constructing MLLM architectures, particularly regarding the selection of appropriate connectors for perception tasks of varying granularities. This paper systematically investigates the impact of connectors on MLLM performance. Specifically, we classify connectors into feature-preserving and feature-compressing types. Utilizing a unified classification standard, we categorize sub-tasks from three comprehensive benchmarks, MMBench, MME, and SEED-Bench, into three task types: coarse-grained perception, fine-grained perception, and reasoning, and evaluate the performance. Our findings reveal that feature-preserving connectors excel in \emph{fine-grained perception} tasks due to their ability to retain detailed visual information. In contrast, feature-compressing connectors, while less effective in fine-grained perception tasks, offer significant speed advantages and perform comparably in \emph{coarse-grained perception} and \emph{reasoning} tasks. These insights are crucial for guiding MLLM architecture design and advancing the optimization of MLLM architectures.
摘要：近年来，多模态大型语言模型 (MLLM) 引起了业界和学术界的广泛关注。然而，在构建 MLLM 架构方面仍然存在相当大的争议，特别是关于为不同粒度的感知任务选择合适的连接器。本文系统地研究了连接器对 MLLM 性能的影响。具体来说，我们将连接器分为特征保留型和特征压缩型。利用统一的分类标准，我们将三个综合基准 MMBench、MME 和 SEED-Bench 中的子任务分为三种任务类型：粗粒度感知、细粒度感知和推理，并评估其性能。我们的研究结果表明，特征保留型连接器在 \emph{细粒度感知} 任务中表现出色，因为它们能够保留详细的视觉信息。相比之下，特征压缩连接器虽然在细粒度感知任务中效率较低，但在 \emph{粗粒度感知} 和 \emph{推理} 任务中具有显著的速度优势，且性能相当。这些见解对于指导 MLLM 架构设计和推进 MLLM 架构的优化至关重要。

Title: From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models

Authors: Yuying Shang, Xinyi Zeng, Yutao Zhu, Xiao Yang, Zhengwei Fang, Jingyuan Zhang, Jiawei Chen, Zinan Liu, Yu Tian
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2410.06795
Pdf URL: https://arxiv.org/pdf/2410.06795
Copy Paste: [[2410.06795]] From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models(https://arxiv.org/abs/2410.06795)
Keywords: language model, hallucination
Abstract: Hallucinations in large vision-language models (LVLMs) are a significant challenge, i.e., generating objects that are not presented in the visual input, which impairs their reliability. Recent studies often attribute hallucinations to a lack of understanding of visual input, yet ignore a more fundamental issue: the model's inability to effectively extract or decouple visual features. In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling). Motivated by our findings on the preliminary investigation, we propose a novel tuning strategy, PATCH, to mitigate hallucinations in LVLMs. This plug-and-play method can be integrated into various LVLMs, utilizing adaptive virtual tokens to extract object features from bounding boxes, thereby addressing hallucinations caused by insufficient decoupling of visual features. PATCH achieves state-of-the-art performance on multiple multi-modal hallucination datasets. We hope this approach provides researchers with deeper insights into the underlying causes of hallucinations in LVLMs, fostering further advancements and innovation in this field.
摘要：大型视觉语言模型 (LVLM) 中的幻觉是一项重大挑战，即生成视觉输入中未呈现的物体，这会损害其可靠性。最近的研究通常将幻觉归因于对视觉输入缺乏理解，却忽略了一个更根本的问题：模型无法有效地提取或解耦视觉特征。在本文中，我们从架构的角度重新审视了 LVLM 中的幻觉，调查了主要原因是视觉编码器（特征提取）还是模态对齐模块（特征解耦）。受初步调查结果的启发，我们提出了一种新颖的调整策略 PATCH，以减轻 LVLM 中的幻觉。这种即插即用的方法可以集成到各种 LVLM 中，利用自适应虚拟标记从边界框中提取对象特征，从而解决由于视觉特征解耦不足而导致的幻觉。PATCH 在多个多模态幻觉数据集上实现了最先进的性能。我们希望这种方法能让研究人员更深入地了解 LVLM 幻觉的根本原因，促进该领域的进一步进步和创新。

Title: Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level

Authors: Xinyi Zeng, Yuying Shang, Yutao Zhu, Jiawei Chen, Yu Tian
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2410.06809
Pdf URL: https://arxiv.org/pdf/2410.06809
Copy Paste: [[2410.06809]] Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level(https://arxiv.org/abs/2410.06809)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated immense utility across various industries. However, as LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts. While current methods effectively address jailbreak risks, they share common limitations: 1) Judging harmful responses from the prefill-level lacks utilization of the model's decoding outputs, leading to relatively lower effectiveness and robustness. 2) Rejecting potentially harmful responses based on a single evaluation can significantly impair the model's this http URL paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens. Motivated by pilot experiment results, we design a robust defense mechanism at the decoding level. Our novel decoder-oriented, step-by-step defense architecture corrects harmful queries directly rather than rejecting them outright. We introduce speculative decoding to enhance usability and facilitate deployment to boost secure decoding speed. Extensive experiments demonstrate that our approach improves model security without compromising reasoning speed. Notably, our method leverages the model's ability to discern hazardous information, maintaining its helpfulness compared to existing methods.
摘要：大型语言模型 (LLM) 已在各个行业中展现出巨大的实用性。然而，随着 LLM 的发展，由于不正确或恶意的指令提示，有害输出的风险会增加。虽然当前的方法可以有效地解决越狱风险，但它们具有共同的局限性：1) 从预填充级别判断有害响应缺乏对模型解码输出的利用，导致相对较低的有效性和鲁棒性。2) 基于单一评估拒绝潜在有害响应会严重损害模型的这篇 http URL 论文研究了 LLM 识别有害输出的能力，揭示并量化了它们评估先前标记危险的能力。受试点实验结果的启发，我们在解码级别设计了一种强大的防御机制。我们新颖的面向解码器的分步防御架构直接纠正有害查询，而不是直接拒绝它们。我们引入了推测解码以增强可用性并促进部署以提高安全解码速度。大量实验表明，我们的方法在不影响推理速度的情况下提高了模型安全性。值得注意的是，我们的方法利用了模型辨别危险信息的能力，与现有方法相比保持了其有用性。

Title: MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders

Authors: Cheng Li, May Fung, Qingyun Wang, Chi Han, Manling Li, Jindong Wang, Heng Ji
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2410.06845
Pdf URL: https://arxiv.org/pdf/2410.06845
Copy Paste: [[2410.06845]] MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders(https://arxiv.org/abs/2410.06845)
Keywords: language model, gpt
Abstract: Mental health disorders are one of the most serious diseases in the world. Most people with such a disease lack access to adequate care, which highlights the importance of training models for the diagnosis and treatment of mental health disorders. However, in the mental health domain, privacy concerns limit the accessibility of personalized treatment data, making it challenging to build powerful models. In this paper, we introduce MentalArena, a self-play framework to train language models by generating domain-specific personalized data, where we obtain a better model capable of making a personalized diagnosis and treatment (as a therapist) and providing information (as a patient). To accurately model human-like mental health patients, we devise Symptom Encoder, which simulates a real patient from both cognition and behavior perspectives. To address intent bias during patient-therapist interactions, we propose Symptom Decoder to compare diagnosed symptoms with encoded symptoms, and dynamically manage the dialogue between patient and therapist according to the identified deviations. We evaluated MentalArena against 6 benchmarks, including biomedicalQA and mental health tasks, compared to 6 advanced models. Our models, fine-tuned on both GPT-3.5 and Llama-3-8b, significantly outperform their counterparts, including GPT-4o. We hope that our work can inspire future research on personalized care. Code is available in this https URL
摘要：精神疾病是世界上最严重的疾病之一。大多数患有这种疾病的人都无法获得足够的护理，这凸显了训练模型对于精神疾病诊断和治疗的重要性。然而，在心理健康领域，隐私问题限制了个性化治疗数据的可访问性，使得建立强大的模型变得具有挑战性。在本文中，我们介绍了 MentalArena，这是一个自我游戏框架，通过生成特定领域的个性化数据来训练语言模型，我们获得了一个更好的模型，能够做出个性化的诊断和治疗（作为治疗师）并提供信息（作为患者）。为了准确地模拟类似人类的精神疾病患者，我们设计了 Symptom Encoder，它从认知和行为的角度模拟真实的患者。为了解决患者与治疗师互动过程中的意图偏差，我们提出了 Symptom Decoder 来比较诊断出的症状与编码的症状，并根据识别出的偏差动态管理患者和治疗师之间的对话。我们根据 6 个基准（包括生物医学问答和心理健康任务）对 MentalArena 进行了评估，并与 6 个高级模型进行了比较。我们的模型在 GPT-3.5 和 Llama-3-8b 上进行了微调，其表现明显优于 GPT-4o 等同类模型。我们希望我们的工作能够启发未来个性化护理的研究。代码可在此 https URL 中找到

Title: Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity

Authors: Mutian He, Philip N. Garner
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2410.06846
Pdf URL: https://arxiv.org/pdf/2410.06846
Copy Paste: [[2410.06846]] Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity(https://arxiv.org/abs/2410.06846)
Keywords: language model
Abstract: Architectures such as Linformer and Mamba have recently emerged as competitive linear time replacements for transformers. However, corresponding large pretrained models are often unavailable, especially in non-text domains. To remedy this, we present a Cross-Architecture Layerwise Distillation (CALD) approach that jointly converts a transformer model to a linear time substitute and fine-tunes it to a target task. We also compare several means to guide the fine-tuning to optimally retain the desired inference capability from the original model. The methods differ in their use of the target model and the trajectory of the parameters. In a series of empirical studies on language processing, language modeling, and speech processing, we show that CALD can effectively recover the result of the original model, and that the guiding strategy contributes to the result. Some reasons for the variation are suggested.
摘要：Linformer 和 Mamba 等架构最近已成为 Transformer 的有竞争力的线性时间替代品。然而，相应的大型预训练模型通常不可用，尤其是在非文本领域。为了解决这个问题，我们提出了一种跨架构分层蒸馏 (CALD) 方法，该方法将 Transformer 模型联合转换为线性时间替代品，并将其微调到目标任务。我们还比较了几种指导微调的方法，以最佳地保留原始模型所需的推理能力。这些方法在目标模型的使用和参数的轨迹上有所不同。在一系列关于语言处理、语言建模和语音处理的实证研究中，我们表明 CALD 可以有效地恢复原始模型的结果，并且指导策略有助于结果。提出了一些变化的原因。

Title: FltLM: An Intergrated Long-Context Large Language Model for Effective Context Filtering and Understanding

Authors: Jingyang Deng, Zhengyang Shen, Boyang Wang, Lixin Su, Suqi Cheng, Ying Nie, Junfeng Wang, Dawei Yin, Jinwen Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06886
Pdf URL: https://arxiv.org/pdf/2410.06886
Copy Paste: [[2410.06886]] FltLM: An Intergrated Long-Context Large Language Model for Effective Context Filtering and Understanding(https://arxiv.org/abs/2410.06886)
Keywords: language model, llm
Abstract: The development of Long-Context Large Language Models (LLMs) has markedly advanced natural language processing by facilitating the process of textual data across long documents and multiple corpora. However, Long-Context LLMs still face two critical challenges: The lost in the middle phenomenon, where crucial middle-context information is likely to be missed, and the distraction issue that the models lose focus due to overly extended contexts. To address these challenges, we propose the Context Filtering Language Model (FltLM), a novel integrated Long-Context LLM which enhances the ability of the model on multi-document question-answering (QA) tasks. Specifically, FltLM innovatively incorporates a context filter with a soft mask mechanism, identifying and dynamically excluding irrelevant content to concentrate on pertinent information for better comprehension and reasoning. Our approach not only mitigates these two challenges, but also enables the model to operate conveniently in a single forward pass. Experimental results demonstrate that FltLM significantly outperforms supervised fine-tuning and retrieval-based methods in complex QA scenarios, suggesting a promising solution for more accurate and reliable long-context natural language understanding applications.
摘要：长上下文大型语言模型 (LLM) 的发展显著推动了自然语言处理的发展，它促进了长文档和多个语料库中文本数据的处理。然而，长上下文 LLM 仍然面临两个关键挑战：中间丢失现象，即关键的中间上下文信息可能会被遗漏；以及由于上下文过长而导致模型失去焦点的干扰问题。为了应对这些挑战，我们提出了上下文过滤语言模型 (FltLM)，这是一种新颖的集成长上下文 LLM，可增强模型在多文档问答 (QA) 任务上的能力。具体而言，FltLM 创新地将上下文过滤器与软掩码机制结合起来，识别并动态排除不相关内容，以专注于相关信息，从而更好地理解和推理。我们的方法不仅缓解了这两个挑战，而且使模型能够在一次前向传递中方便地运行。实验结果表明，FltLM 在复杂的 QA 场景中明显优于监督微调和基于检索的方法，为更准确、可靠的长上下文自然语言理解应用提供了一种有希望的解决方案。

Title: Generative Model for Less-Resourced Language with 1 billion parameters

Authors: Domen Vreš, Martin Božič, Aljaž Potočnik, Tomaž Martinčič, Marko Robnik-Šikonja
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06898
Pdf URL: https://arxiv.org/pdf/2410.06898
Copy Paste: [[2410.06898]] Generative Model for Less-Resourced Language with 1 billion parameters(https://arxiv.org/abs/2410.06898)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) are a basic infrastructure for modern natural language processing. Many commercial and open-source LLMs exist for English, e.g., ChatGPT, Llama, Falcon, and Mistral. As these models are trained on mostly English texts, their fluency and knowledge of low-resource languages and societies are superficial. We present the development of large generative language models for a less-resourced language. GaMS 1B - Generative Model for Slovene with 1 billion parameters was created by continuing pretraining of the existing English OPT model. We developed a new tokenizer adapted to Slovene, Croatian, and English languages and used embedding initialization methods FOCUS and WECHSEL to transfer the embeddings from the English OPT model. We evaluate our models on several classification datasets from the Slovene suite of benchmarks and generative sentence simplification task SENTA. We only used a few-shot in-context learning of our models, which are not yet instruction-tuned. For classification tasks, in this mode, the generative models lag behind the existing Slovene BERT-type models fine-tuned for specific tasks. On a sentence simplification task, the GaMS models achieve comparable or better performance than the GPT-3.5-Turbo model.
摘要：大型语言模型 (LLM) 是现代自然语言处理的基本基础设施。英语有许多商业和开源 LLM，例如 ChatGPT、Llama、Falcon 和 Mistral。由于这些模型主要是在英语文本上训练的，因此它们的流利程度和对资源匮乏的语言和社会的了解都很肤浅。我们介绍了针对资源较少的语言的大型生成语言模型的开发。GaMS 1B - 具有 10 亿个参数的斯洛文尼亚语生成模型是通过对现有的英语 OPT 模型进行持续预训练而创建的。我们开发了一种适用于斯洛文尼亚语、克罗地亚语和英语的新标记器，并使用嵌入初始化方法 FOCUS 和 WECHSEL 从英语 OPT 模型中传输嵌入。我们在来自斯洛文尼亚语基准套件和生成句子简化任务 SENTA 的几个分类数据集上评估了我们的模型。我们只对我们的模型进行了少量的上下文学习，这些模型尚未进行指令调整。对于分类任务，在此模式下，生成模型落后于针对特定任务进行微调的现有斯洛文尼亚 BERT 类模型。在句子简化任务上，GaMS 模型实现了与 GPT-3.5-Turbo 模型相当或更好的性能。

Title: Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning

Authors: Runchuan Zhu, Zhipeng Ma, Jiang Wu, Junyuan Gao, Jiaqi Wang, Dahua Lin, Conghui He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06913
Pdf URL: https://arxiv.org/pdf/2410.06913
Copy Paste: [[2410.06913]] Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning(https://arxiv.org/abs/2410.06913)
Keywords: language model, llm, hallucination
Abstract: Refusal-Aware Instruction Tuning (RAIT) enables Large Language Models (LLMs) to refuse to answer unknown questions. By modifying responses of unknown questions in the training data to refusal responses such as "I don't know", RAIT enhances the reliability of LLMs and reduces their hallucination. Generally, RAIT modifies training samples based on the correctness of the initial LLM's response. However, this crude approach can cause LLMs to excessively refuse answering questions they could have correctly answered, the problem we call over-refusal. In this paper, we explore two primary causes of over-refusal: Static conflict emerges when the RAIT data is constructed solely on correctness criteria, causing similar samples in the LLM's feature space to be assigned different labels (original vs. modified "I don't know"). Dynamic conflict occurs due to the changes of LLM's knowledge state during fine-tuning, which transforms previous unknown questions into knowns, while the training data, which is constructed based on the initial LLM, remains unchanged. These conflicts cause the trained LLM to misclassify known questions as unknown, resulting in over-refusal. To address this issue, we introduce Certainty Represented Knowledge Flow for Refusal-Aware Instructions Construction (CRaFT). CRaFT centers on two main contributions: First, we additionally incorporate response certainty to selectively filter and modify data, reducing static conflicts. Second, we implement preliminary rehearsal training to characterize changes in the LLM's knowledge state, which helps mitigate dynamic conflicts during the fine-tuning process. We conducted extensive experiments on open-ended question answering and multiple-choice question task. Experiment results show that CRaFT can improve LLM's overall performance during the RAIT process. Source code and training data will be released at Github.
摘要：拒绝感知指令调优 (RAIT) 使大型语言模型 (LLM) 能够拒绝回答未知问题。通过将训练数据中未知问题的回答修改为“我不知道”之类的拒绝回答，RAIT 提高了 LLM 的可靠性并减少了幻觉。通常，RAIT 根据初始 LLM 的回答的正确性修改训练样本。然而，这种粗暴的方法可能会导致 LLM 过度拒绝回答它们本可以正确回答的问题，我们称这种问题为过度拒绝。在本文中，我们探讨了过度拒绝的两个主要原因：当 RAIT 数据仅基于正确性标准构建时，会出现静态冲突，导致 LLM 特征空间中的类似样本被分配不同的标签（原始与修改后的“我不知道”）。动态冲突是由于 LLM 的知识状态在微调过程中发生变化而发生的，微调将之前的未知问题转变为已知问题，而基于初始 LLM 构建的训练数据保持不变。这些冲突导致训练后的 LLM 将已知问题错误地归类为未知问题，从而导致过度拒绝。为了解决这个问题，我们引入了确定性表示知识流的拒绝感知指令构建（CRaFT）。CRaFT 主要围绕两个贡献：首先，我们还结合了响应确定性来选择性地过滤和修改数据，减少静态冲突。其次，我们实施初步的排练训练来表征 LLM 知识状态的变化，这有助于缓解微调过程中的动态冲突。我们对开放式问答和多项选择题任务进行了大量实验。实验结果表明，CRaFT 可以提高 LLM 在 RAIT 过程中的整体性能。源代码和训练数据将在 Github 上发布。

Title: SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Authors: Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06916
Pdf URL: https://arxiv.org/pdf/2410.06916
Copy Paste: [[2410.06916]] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration(https://arxiv.org/abs/2410.06916)
Keywords: language model, llm
Abstract: Speculative decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by first employing a compact model to draft multiple tokens efficiently and then using the target LLM to verify them in parallel. While this technique has achieved notable speedups, most existing approaches necessitate either additional parameters or extensive training to construct effective draft models, thereby restricting their applicability across different LLMs and tasks. To address this limitation, we explore a novel plug-and-play SD solution with layer-skipping, which skips intermediate layers of the target LLM as the compact draft model. Our analysis reveals that LLMs exhibit great potential for self-acceleration through layer sparsity and the task-specific nature of this sparsity. Building on these insights, we introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. SWIFT does not require auxiliary models or additional training, making it a plug-and-play solution for accelerating LLM inference across diverse input data streams. Our extensive experiments across a wide range of models and downstream tasks demonstrate that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
摘要：推测解码 (SD) 已成为一种广泛使用的范例，可在不影响生成质量的情况下加速大型语言模型 (LLM) 的推理。它的工作原理是首先使用紧凑模型有效地起草多个标记，然后使用目标 LLM 并行验证它们。虽然这种技术已经实现了显着的加速，但大多数现有方法都需要额外的参数或大量的训练来构建有效的草稿模型，从而限制了它们在不同 LLM 和任务中的适用性。为了解决这一限制，我们探索了一种具有层跳过功能的新型即插即用 SD 解决方案，它跳过目标 LLM 的中间层作为紧凑的草稿模型。我们的分析表明，LLM 通过层稀疏性和这种稀疏的任务特定性表现出巨大的自我加速潜力。基于这些见解，我们引入了 SWIFT，这是一种即时自推测解码算法，可在推理过程中自适应地选择要跳过的 LLM 中间层。 SWIFT 不需要辅助模型或额外训练，因此它是一种即插即用的解决方案，可加速跨各种输入数据流的 LLM 推理。我们对各种模型和下游任务进行的大量实验表明，SWIFT 可以实现 1.3 倍至 1.6 倍以上的加速，同时保留生成文本的原始分布。

Title: CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages

Authors: Pretam Ray, Jivnesh Sandhan, Amrith Krishna, Pawan Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.06944
Pdf URL: https://arxiv.org/pdf/2410.06944
Copy Paste: [[2410.06944]] CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages(https://arxiv.org/abs/2410.06944)
Keywords: prompt
Abstract: Neural dependency parsing has achieved remarkable performance for low resource morphologically rich languages. It has also been well-studied that morphologically rich languages exhibit relatively free word order. This prompts a fundamental investigation: Is there a way to enhance dependency parsing performance, making the model robust to word order variations utilizing the relatively free word order nature of morphologically rich languages? In this work, we examine the robustness of graph-based parsing architectures on 7 relatively free word order languages. We focus on scrutinizing essential modifications such as data augmentation and the removal of position encoding required to adapt these architectures accordingly. To this end, we propose a contrastive self-supervised learning method to make the model robust to word order variations. Furthermore, our proposed modification demonstrates a substantial average gain of 3.03/2.95 points in 7 relatively free word order languages, as measured by the UAS/LAS Score metric when compared to the best performing baseline.
摘要：神经依存分析在资源较少的形态丰富语言中取得了显著的表现。形态丰富语言表现出相对自由的词序，这一点也得到了充分的研究。这促使我们进行一项根本性的研究：是否有一种方法可以增强依存分析的性能，利用形态丰富语言相对自由的词序特性，使模型对词序变化具有鲁棒性？在这项工作中，我们研究了基于图的解析架构在 7 种相对自由的词序语言上的鲁棒性。我们重点研究了调整这些架构所需的基本修改，例如数据增强和位置编码的删除。为此，我们提出了一种对比自监督学习方法，使模型对词序变化具有鲁棒性。此外，与表现最佳的基线相比，我们提出的修改在 7 种相对自由的词序语言中表现出 3.03/2.95 分的平均增益，这是通过 UAS/LAS 分数指标衡量的。

Title: Self-Boosting Large Language Models with Synthetic Preference Data

Authors: Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, Furu Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06961
Pdf URL: https://arxiv.org/pdf/2410.06961
Copy Paste: [[2410.06961]] Self-Boosting Large Language Models with Synthetic Preference Data(https://arxiv.org/abs/2410.06961)
Keywords: language model, llm, prompt
Abstract: Through alignment with human preferences, Large Language Models (LLMs) have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.
摘要：通过与人类偏好保持一致，大型语言模型 (LLM) 在生成诚实、无害且有用的响应方面取得了显著进展。然而，收集高质量的偏好数据是一个资源密集型且需要创造力的过程，尤其是对于 LLM 的持续改进而言。我们引入了 SynPO，这是一种自增强范式，利用合成偏好数据进行模型对齐。SynPO 采用一种迭代机制，其中自提示生成器创建不同的提示，响应改进器逐步完善模型响应。这种方法训练 LLM 自主学习自身输出的生成奖励，并消除了对提示和人类偏好进行大规模注释的需要。经过四次 SynPO 迭代后，Llama3-8B 和 Mistral-7B 在遵循指令的能力方面表现出显著增强，在 AlpacaEval 2.0 和 ArenaHard 上的胜率提高了 22.1% 以上。同时，SynPO 提高了 LLM 在各项任务上的总体表现，这通过公认的 Open LLM 排行榜上的平均分数提高 3.2 到 5.0 得到验证。

Title: Uncovering Factor Level Preferences to Improve Human-Model Alignment

Authors: Juhyun Oh, Eunsu Kim, Jiseon Kim, Wenda Xu, Inha Cha, William Yang Wang, Alice Oh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06965
Pdf URL: https://arxiv.org/pdf/2410.06965
Copy Paste: [[2410.06965]] Uncovering Factor Level Preferences to Improve Human-Model Alignment(https://arxiv.org/abs/2410.06965)
Keywords: language model, llm
Abstract: Despite advancements in Large Language Model (LLM) alignment, understanding the reasons behind LLM preferences remains crucial for bridging the gap between desired and actual behavior. LLMs often exhibit biases or tendencies that diverge from human preferences, such as favoring certain writing styles or producing overly verbose outputs. However, current methods for evaluating preference alignment often lack explainability, relying on coarse-grained comparisons. To address this, we introduce PROFILE (PRObing Factors of InfLuence for Explainability), a novel framework that uncovers and quantifies the influence of specific factors driving preferences. PROFILE's factor level analysis explains the 'why' behind human-model alignment and misalignment, offering insights into the direction of model improvement. We apply PROFILE to analyze human and LLM preferences across three tasks: summarization, helpful response generation, and document-based question-answering. Our factor level analysis reveals a substantial discrepancy between human and LLM preferences in generation tasks, whereas LLMs show strong alignment with human preferences in evaluation tasks. We demonstrate how leveraging factor level insights, including addressing misaligned factors or exploiting the generation-evaluation gap, can improve alignment with human preferences. This work underscores the importance of explainable preference analysis and highlights PROFILE's potential to provide valuable training signals, driving further improvements in human-model alignment.
摘要：尽管大型语言模型 (LLM) 对齐方面取得了进展，但了解 LLM 偏好背后的原因对于弥合期望行为和实际行为之间的差距仍然至关重要。LLM 经常表现出与人类偏好不同的偏见或倾向，例如偏爱某些写作风格或产生过于冗长的输出。然而，当前用于评估偏好对齐的方法通常缺乏可解释性，依赖于粗粒度的比较。为了解决这个问题，我们引入了 PROFILE（可解释性影响因素探测），这是一个新颖的框架，可以揭示和量化驱动偏好的特定因素的影响。PROFILE 的因子水平分析解释了人类模型对齐和错位背后的“原因”，为模型改进的方向提供了见解。我们应用 PROFILE 分析三个任务中的人类和 LLM 偏好：总结、有用的响应生成和基于文档的问答。我们的因子水平分析揭示了人类和 LLM 偏好在生成任务中存在巨大差异，而 LLM 在评估任务中表现出与人类偏好的高度一致性。我们展示了如何利用因素层面的洞察力（包括解决不一致的因素或利用代际评估差距）来改善与人类偏好的一致性。这项工作强调了可解释的偏好分析的重要性，并强调了 PROFILE 提供有价值的训练信号的潜力，从而推动人类模型一致性的进一步改善。

Title: Personal Intelligence System UniLM: Hybrid On-Device Small Language Model and Server-Based Large Language Model for Malay Nusantara

Authors: Azree Nazri, Olalekan Agbolade, Faisal Aziz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.06973
Pdf URL: https://arxiv.org/pdf/2410.06973
Copy Paste: [[2410.06973]] Personal Intelligence System UniLM: Hybrid On-Device Small Language Model and Server-Based Large Language Model for Malay Nusantara(https://arxiv.org/abs/2410.06973)
Keywords: language model, llm
Abstract: In contexts with limited computational and data resources, high-resource language models often prove inadequate, particularly when addressing the specific needs of Malay languages. This paper introduces a Personal Intelligence System designed to efficiently integrate both on-device and server-based models. The system incorporates SLiM-34M for on-device processing, optimized for low memory and power usage, and MANYAK-1.3B for server-based tasks, allowing for scalable, high-performance language processing. The models achieve significant results across various tasks, such as machine translation, question-answering, and translate IndoMMLU. Particularly noteworthy is SLiM-34M's ability to achieve a high improvement in accuracy compared to other LLMs while using 2 times fewer pre-training tokens. This work challenges the prevailing assumption that large-scale computational resources are necessary to build effective language models, contributing to the development of resource-efficient models for the Malay language with the unique orchestration between SLiM-34M and MANYAK-1.3B.
摘要：在计算和数据资源有限的环境中，高资源语言模型通常被证明是不够的，特别是在满足马来语的特定需求时。本文介绍了一种个人智能系统，旨在有效整合设备上和基于服务器的模型。该系统结合了 SLiM-34M 用于设备上处理，针对低内存和功耗进行了优化，以及 MANYAK-1.3B 用于基于服务器的任务，从而实现了可扩展的高性能语言处理。这些模型在各种任务中取得了显著的成果，例如机器翻译、问答和翻译 IndoMMLU。特别值得注意的是，与其他 LLM 相比，SLiM-34M 能够实现准确性的大幅提升，同时使用的预训练标记减少了 2 倍。这项工作挑战了普遍存在的假设，即构建有效的语言模型需要大规模计算资源，有助于通过 SLiM-34M 和 MANYAK-1.3B 之间的独特协调为马来语开发资源高效的模型。

Title: CursorCore: Assist Programming through Aligning Anything

Authors: Hao Jiang, Qi Liu, Rui Li, Shengyu Ye, Shijin Wang
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2410.07002
Pdf URL: https://arxiv.org/pdf/2410.07002
Copy Paste: [[2410.07002]] CursorCore: Assist Programming through Aligning Anything(https://arxiv.org/abs/2410.07002)
Keywords: language model, chat
Abstract: Large language models have been successfully applied to programming assistance tasks, such as code completion, code insertion, and instructional code editing. However, these applications remain insufficiently automated and struggle to effectively integrate various types of information during the programming process, including coding history, current code, and user instructions. In this work, we propose a new conversational framework that comprehensively integrates these information sources, collect data to train our models and evaluate their performance. Firstly, to thoroughly evaluate how well models align with different types of information and the quality of their outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to comprehensively assess the performance of models in programming assistance tasks. Then, for data collection, we develop a data generation pipeline, Programming-Instruct, which synthesizes training data from diverse sources, such as GitHub and online judge platforms. This pipeline can automatically generate various types of messages throughout the programming process. Finally, using this pipeline, we generate 219K samples, fine-tune multiple models, and develop the CursorCore series. We show that CursorCore outperforms other models of comparable size. This framework unifies applications such as inline chat and automated editing, contributes to the advancement of coding assistants. Code, models and data are freely available at this https URL.
摘要：大型语言模型已成功应用于编程辅助任务，例如代码完成、代码插入和指导性代码编辑。然而，这些应用程序仍然自动化程度不够，难以在编程过程中有效整合各种类型的信息，包括编码历史、当前代码和用户指令。在这项工作中，我们提出了一个新的对话框架，全面整合这些信息源，收集数据来训练我们的模型并评估其性能。首先，为了彻底评估模型与不同类型信息的匹配程度及其输出质量，我们引入了一个新的基准 APEval（辅助编程评估），以全面评估模型在编程辅助任务中的表现。然后，为了收集数据，我们开发了一个数据生成管道 Programming-Instruct，它综合了来自各种来源的训练数据，例如 GitHub 和在线评判平台。该管道可以在整个编程过程中自动生成各种类型的消息。最后，使用此管道，我们生成了 219K 个样本，微调了多个模型，并开发了 CursorCore 系列。我们表明 CursorCore 的表现优于其他同等规模的模型。该框架统一了在线聊天和自动编辑等应用程序，有助于提高编码助手的水平。代码、模型和数据可在此 https URL 上免费获取。

Title: Pap2Pat: Towards Automated Paper-to-Patent Drafting using Chunk-based Outline-guided Generation

Authors: Valentin Knappich, Simon Razniewski, Anna Hätty, Annemarie Friedrich
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.07009
Pdf URL: https://arxiv.org/pdf/2410.07009
Copy Paste: [[2410.07009]] Pap2Pat: Towards Automated Paper-to-Patent Drafting using Chunk-based Outline-guided Generation(https://arxiv.org/abs/2410.07009)
Keywords: language model, llm
Abstract: The patent domain is gaining attention in natural language processing research, offering practical applications in streamlining the patenting process and providing challenging benchmarks for large language models (LLMs). However, the generation of the description sections of patents, which constitute more than 90% of the patent document, has not been studied to date. We address this gap by introducing the task of outline-guided paper-to-patent generation, where an academic paper provides the technical specification of the invention and an outline conveys the desired patent structure. We present PAP2PAT, a new challenging benchmark of 1.8k patent-paper pairs with document outlines, collected using heuristics that reflect typical research lab practices. Our experiments with current open-weight LLMs and outline-guided chunk-based generation show that they can effectively use information from the paper but struggle with repetitions, likely due to the inherent repetitiveness of patent language. We release our data and code.
摘要：专利领域在自然语言处理研究中越来越受到关注，它为简化专利申请流程提供了实际应用，并为大型语言模型 (LLM) 提供了具有挑战性的基准。然而，专利描述部分的生成（占专利文档的 90% 以上）迄今为止尚未得到研究。我们通过引入大纲引导的论文到专利生成任务来解决这一差距，其中学术论文提供发明的技术规范，大纲传达所需的专利结构。我们提出了 PAP2PAT，这是一个新的具有挑战性的基准，包含 1.8k 个专利论文对和文档大纲，使用反映典型研究实验室实践的启发式方法收集。我们对当前开放权重 LLM 和大纲引导的基于块的生成进行的实验表明，它们可以有效地使用论文中的信息，但在重复方面存在困难，这可能是由于专利语言固有的重复性。我们发布了我们的数据和代码。

Title: PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness

Authors: Zekun Wang, Feiyu Duan, Yibo Zhang, Wangchunshu Zhou, Ke Xu, Wenhao Huang, Jie Fu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.07035
Pdf URL: https://arxiv.org/pdf/2410.07035
Copy Paste: [[2410.07035]] PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness(https://arxiv.org/abs/2410.07035)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) demonstrate impressive capabilities across various domains, including role-playing, creative writing, mathematical reasoning, and coding. Despite these advancements, LLMs still encounter challenges with length control, frequently failing to adhere to specific length constraints due to their token-level operations and insufficient training on data with strict length limitations. We identify this issue as stemming from a lack of positional awareness and propose novel approaches--PositionID Prompting and PositionID Fine-Tuning--to address it. These methods enhance the model's ability to continuously monitor and manage text length during generation. Additionally, we introduce PositionID CP Prompting to enable LLMs to perform copy and paste operations accurately. Furthermore, we develop two benchmarks for evaluating length control and copy-paste abilities. Our experiments demonstrate that our methods significantly improve the model's adherence to length constraints and copy-paste accuracy without compromising response quality.
摘要：大型语言模型 (LLM) 在角色扮演、创意写作、数学推理和编码等各个领域都表现出令人印象深刻的能力。尽管取得了这些进步，LLM 仍然面临长度控制方面的挑战，由于其 token 级操作以及对严格长度限制的数据的训练不足，它们经常无法遵守特定的长度限制。我们认为这个问题源于缺乏位置意识，并提出了新颖的方法——PositionID Prompting 和 PositionID Fine-Tuning——来解决它。这些方法增强了模型在生成过程中持续监控和管理文本长度的能力。此外，我们引入了 PositionID CP Prompting，使 LLM 能够准确地执行复制和粘贴操作。此外，我们还开发了两个用于评估长度控制和复制粘贴能力的基准。我们的实验表明，我们的方法显著提高了模型对长度限制的遵守程度和复制粘贴准确性，同时又不影响响应质量。

Title: Mitigating the Language Mismatch and Repetition Issues in LLM-based Machine Translation via Model Editing

Authors: Weichuan Wang, Zhaoyi Li, Defu Lian, Chen Ma, Linqi Song, Ying Wei
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.07054
Pdf URL: https://arxiv.org/pdf/2410.07054
Copy Paste: [[2410.07054]] Mitigating the Language Mismatch and Repetition Issues in LLM-based Machine Translation via Model Editing(https://arxiv.org/abs/2410.07054)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have recently revolutionized the NLP field, while they still fall short in some specific down-stream tasks. In the work, we focus on utilizing LLMs to perform machine translation, where we observe that two patterns of errors frequently occur and drastically affect the translation quality: language mismatch and repetition. The work sets out to explore the potential for mitigating these two issues by leveraging model editing methods, e.g., by locating Feed-Forward Network (FFN) neurons or something that are responsible for the errors and deactivating them in the inference time. We find that directly applying such methods either limited effect on the targeted errors or has significant negative side-effect on the general translation quality, indicating that the located components may also be crucial for ensuring machine translation with LLMs on the rails. To this end, we propose to refine the located components by fetching the intersection of the locating results under different language settings, filtering out the aforementioned information that is irrelevant to targeted errors. The experiment results empirically demonstrate that our methods can effectively reduce the language mismatch and repetition ratios and meanwhile enhance or keep the general translation quality in most cases.
摘要：大型语言模型 (LLM) 最近彻底改变了 NLP 领域，但它们在某些特定的下游任务中仍存在不足。在这项工作中，我们专注于利用 LLM 进行机器翻译，我们观察到两种错误模式经常发生并严重影响翻译质量：语言不匹配和重复。这项工作旨在探索通过利用模型编辑方法缓解这两个问题的可能性，例如，通过定位前馈网络 (FFN) 神经元或导致错误的某些东西并在推理时间内停用它们。我们发现直接应用此类方法要么对目标错误的影响有限，要么对总体翻译质量产生重大的负面影响，这表明定位的组件对于确保 LLM 的机器翻译正常运行也至关重要。为此，我们建议通过获取不同语言设置下定位结果的交集来细化定位的组件，过滤掉与目标错误无关的上述信息。实验结果证明，我们的方法可以有效降低语言不匹配和重复率，同时在大多数情况下提高或保持总体翻译质量。

Title: Data Selection via Optimal Control for Language Models

Authors: Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.07064
Pdf URL: https://arxiv.org/pdf/2410.07064
Copy Paste: [[2410.07064]] Data Selection via Optimal Control for Language Models(https://arxiv.org/abs/2410.07064)
Keywords: language model
Abstract: This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs' capabilities for downstream usage. We formulate data selection as a generalized Optimal Control problem, which can be solved theoretically by Pontryagin's Maximum Principle (PMP), yielding a set of necessary conditions that characterize the relationship between optimal data selection and LM training dynamics. Based on these theoretical results, we introduce PMP-based Data Selection (PDS), a framework that approximates optimal data selection by solving the PMP conditions. In our experiments, we adopt PDS to select data from CommmonCrawl and show that the PDS-selected corpus accelerates the learning of LMs and constantly boosts their performance on a wide range of downstream tasks across various model sizes. Moreover, the benefits of PDS extend to ~400B models trained on ~10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws. PDS also improves data utilization when the pre-training data is limited, by reducing the data demand by 1.8 times, which mitigates the quick exhaustion of available web-crawled corpora. Our code, data, and model checkpoints can be found in this https URL.
摘要：这项工作研究了从海量语料库中选择高质量的预训练数据，以增强 LM 的下游使用能力。我们将数据选择表述为广义最优控制问题，该问题可以通过庞特里亚金最大原理 (PMP) 在理论上解决，从而得出一组必要条件，这些条件描述了最优数据选择与 LM 训练动态之间的关系。基于这些理论结果，我们引入了基于 PMP 的数据选择 (PDS)，这是一个通过解决 PMP 条件来近似最优数据选择的框架。在我们的实验中，我们采用 PDS 从 CommmonCrawl 中选择数据，并表明 PDS 选择的语料库加速了 LM 的学习，并不断提高它们在不同模型大小的各种下游任务上的性能。此外，PDS 的好处扩展到在 ~10T 标记上训练的 ~400B 个模型，这可以通过根据缩放定律推断测试损失曲线来证明。 PDS 还可以在预训练数据有限的情况下提高数据利用率，将数据需求减少 1.8 倍，从而缓解可用的网络爬取语料库的快速耗尽。我们的代码、数据和模型检查点可在此 https URL 中找到。

Title: ReIFE: Re-evaluating Instruction-Following Evaluation

Authors: Yixin Liu, Kejian Shi, Alexander R. Fabbri, Yilun Zhao, Peifeng Wang, Chien-Sheng Wu, Shafiq Joty, Arman Cohan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.07069
Pdf URL: https://arxiv.org/pdf/2410.07069
Copy Paste: [[2410.07069]] ReIFE: Re-evaluating Instruction-Following Evaluation(https://arxiv.org/abs/2410.07069)
Keywords: language model, llm
Abstract: The automatic evaluation of instruction following typically involves using large language models (LLMs) to assess response quality. However, there is a lack of comprehensive evaluation of these LLM-based evaluators across two dimensions: the base LLMs and the evaluation protocols. Therefore, we present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 recently proposed evaluation protocols, on 4 human-annotated datasets, assessing the evaluation accuracy of the LLM-evaluators. Our evaluation allows us to identify the best-performing base LLMs and evaluation protocols with a high degree of robustness. Moreover, our large-scale evaluation reveals: (1) Base LLM performance ranking remains largely consistent across evaluation protocols, with less capable LLMs showing greater improvement from protocol enhancements; (2) Robust evaluation of evaluation protocols requires many base LLMs with varying capability levels, as protocol effectiveness can depend on the base LLM used; (3) Evaluation results on different datasets are not always consistent, so a rigorous evaluation requires multiple datasets with distinctive features. We release our meta-evaluation suite ReIFE, which provides the codebase and evaluation result collection for more than 500 LLM-evaluator configurations, to support future research in instruction-following evaluation.
摘要：指令跟随的自动评估通常涉及使用大型语言模型 (LLM) 来评估响应质量。然而，缺乏对这些基于 LLM 的评估器在两个维度上的全面评估：基础 LLM 和评估协议。因此，我们在 4 个人工注释的数据集上对指令跟随进行了全面的元评估，包括 25 个基础 LLM 和 15 个最近提出的评估协议，以评估 LLM 评估器的评估准确性。我们的评估使我们能够识别出具有高度稳健性的最佳性能基础 LLM 和评估协议。此外，我们的大规模评估表明：(1) 基础 LLM 性能排名在评估协议中保持基本一致，能力较差的 LLM 从协议增强中显示出更大的改进；(2) 对评估协议的稳健评估需要许多具有不同能力水平的基础 LLM，因为协议的有效性可能取决于所使用的基础 LLM；(3) 不同数据集的评估结果并不总是一致的，因此严格的评估需要具有独特特征的多个数据集。我们发布了元评估套件 ReIFE，它为超过 500 个 LLM 评估器配置提供了代码库和评估结果集合，以支持未来的指令跟踪评估研究。

Title: MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses

Authors: Zonglin Yang, Wanhao Liu, Ben Gao, Tong Xie, Yuqiang Li, Wanli Ouyang, Soujanya Poria, Erik Cambria, Dongzhan Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.07076
Pdf URL: https://arxiv.org/pdf/2410.07076
Copy Paste: [[2410.07076]] MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses(https://arxiv.org/abs/2410.07076)
Keywords: language model, llm, agent
Abstract: Scientific discovery contributes largely to human society's prosperity, and recent progress shows that LLMs could potentially catalyze this process. However, it is still unclear whether LLMs can discover novel and valid hypotheses in chemistry. In this work, we investigate this central research question: Can LLMs automatically discover novel and valid chemistry research hypotheses given only a chemistry research background (consisting of a research question and/or a background survey), without limitation on the domain of the research question? After extensive discussions with chemistry experts, we propose an assumption that a majority of chemistry hypotheses can be resulted from a research background and several inspirations. With this key insight, we break the central question into three smaller fundamental questions. In brief, they are: (1) given a background question, whether LLMs can retrieve good inspirations; (2) with background and inspirations, whether LLMs can lead to hypothesis; and (3) whether LLMs can identify good hypotheses to rank them higher. To investigate these questions, we construct a benchmark consisting of 51 chemistry papers published in Nature, Science, or a similar level in 2024 (all papers are only available online since 2024). Every paper is divided by chemistry PhD students into three components: background, inspirations, and hypothesis. The goal is to rediscover the hypothesis, given only the background and a large randomly selected chemistry literature corpus consisting the ground truth inspiration papers, with LLMs trained with data up to 2023. We also develop an LLM-based multi-agent framework that leverages the assumption, consisting of three stages reflecting the three smaller questions. The proposed method can rediscover many hypotheses with very high similarity with the ground truth ones, covering the main innovations.
摘要：科学发现为人类社会的繁荣做出了巨大贡献，最近的进展表明法学硕士 (LLM) 有可能催化这一进程。然而，法学硕士是否能够发现化学领域的新颖且有效的假设仍不清楚。在本文中，我们研究这个核心研究问题：仅给定化学研究背景（包括研究问题和/或背景调查），法学硕士能否自动发现新颖且有效的化学研究假设，而不限制研究问题的领域？在与化学专家进行广泛讨论后，我们提出一个假设，即大多数化学假设可以来自一个研究背景和几个灵感。基于这一关键见解，我们将核心问题分解为三个较小的根本问题。简而言之，它们是：（1）给定一个背景问题，法学硕士 (LLM) 是否可以检索到好的灵感；（2）有了背景和灵感，法学硕士 (LLM) 是否可以得出假设；（3）法学硕士 (LLM) 是否可以识别好的假设并对其进行更高的排名。为了研究这些问题，我们构建了一个基准，其中包括 2024 年在《自然》、《科学》或类似期刊上发表的 51 篇化学论文（所有论文自 2024 年起才可在线获取）。化学博士生将每篇论文分为三个部分：背景、灵感和假设。目标是在仅给出背景和一个由基本事实灵感论文组成的随机选择的大型化学文献语料库的情况下重新发现假设，并使用截至 2023 年的数据对 LLM 进行训练。我们还开发了一个基于 LLM 的多智能体框架，该框架利用该假设，由三个阶段组成，反映了三个较小的问题。所提出的方法可以重新发现许多与基本事实非常相似的假设，涵盖了主要创新。

Title: Stanceformer: Target-Aware Transformer for Stance Detection

Authors: Krishna Garg, Cornelia Caragea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.07083
Pdf URL: https://arxiv.org/pdf/2410.07083
Copy Paste: [[2410.07083]] Stanceformer: Target-Aware Transformer for Stance Detection(https://arxiv.org/abs/2410.07083)
Keywords: language model, llm
Abstract: The task of Stance Detection involves discerning the stance expressed in a text towards a specific subject or target. Prior works have relied on existing transformer models that lack the capability to prioritize targets effectively. Consequently, these models yield similar performance regardless of whether we utilize or disregard target information, undermining the task's significance. To address this challenge, we introduce Stanceformer, a target-aware transformer model that incorporates enhanced attention towards the targets during both training and inference. Specifically, we design a \textit{Target Awareness} matrix that increases the self-attention scores assigned to the targets. We demonstrate the efficacy of the Stanceformer with various BERT-based models, including state-of-the-art models and Large Language Models (LLMs), and evaluate its performance across three stance detection datasets, alongside a zero-shot dataset. Our approach Stanceformer not only provides superior performance but also generalizes even to other domains, such as Aspect-based Sentiment Analysis. We make the code publicly available.\footnote{\scriptsize\url{this https URL}}
摘要：立场检测的任务涉及辨别文本中表达的针对特定主题或目标的立场。先前的研究依赖于现有的转换器模型，这些模型缺乏有效确定目标优先级的能力。因此，无论我们利用还是忽略目标信息，这些模型都会产生类似的性能，从而削弱了任务的重要性。为了应对这一挑战，我们引入了 Stanceformer，这是一种目标感知转换器模型，在训练和推理过程中都增强了对目标的注意力。具体来说，我们设计了一个 \textit{目标感知} 矩阵，它可以增加分配给目标的自注意力分数。我们用各种基于 BERT 的模型（包括最先进的模型和大型语言模型 (LLM)）证明了 Stanceformer 的有效性，并在三个立场检测数据集以及零样本数据集上评估了其性能。我们的方法 Stanceformer 不仅提供了卓越的性能，而且甚至可以推广到其他领域，例如基于方面的情绪分析。我们将代码公开发布。\footnote{\scriptsize\url{this https URL}}

Title: MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Authors: Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Mądry
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.07095
Pdf URL: https://arxiv.org/pdf/2410.07095
Copy Paste: [[2410.07095]] MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering(https://arxiv.org/abs/2410.07095)
Keywords: language model, agent
Abstract: We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (this http URL) to facilitate future research in understanding the ML engineering capabilities of AI agents.
摘要：我们引入了 MLE-bench，这是衡量 AI 代理在机器学习工程方面表现的基准。为此，我们从 Kaggle 中挑选了 75 项与 ML 工程相关的竞赛，创建了一组多样化的具有挑战性的任务，以测试现实世界的 ML 工程技能，例如训练模型、准备数据集和运行实验。我们使用 Kaggle 的公开排行榜为每项竞赛建立人类基线。我们使用开源代理支架在我们的基准上评估几种前沿语言模型，发现表现最佳的设置——OpenAI 的带有 AIDE 支架的 o1-preview——在 16.9% 的竞赛中至少达到了 Kaggle 铜牌的水平。除了我们的主要结果之外，我们还研究了 AI 代理的各种形式的资源扩展以及预训练污染的影响。我们开源我们的基准代码（此 http URL），以方便未来研究了解 AI 代理的 ML 工程能力。

Title: Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context

Authors: Sangwon Yu, Ik-hwan Kim, Jongyoon Song, Saehyung Lee, Junsung Park, Sungroh Yoon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.07103
Pdf URL: https://arxiv.org/pdf/2410.07103
Copy Paste: [[2410.07103]] Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context(https://arxiv.org/abs/2410.07103)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Multi-hop reasoning, which requires multi-step reasoning based on the supporting documents within a given context, remains challenging for large language models (LLMs). LLMs often struggle to filter out irrelevant documents within the context, and their performance is sensitive to the position of supporting documents within that context. In this paper, we identify an additional challenge: LLMs' performance is also sensitive to the order in which the supporting documents are presented. We refer to this as the misordered context problem. To address this issue, we propose a simple yet effective method called context repetition (CoRe), which involves prompting the model by repeatedly presenting the context to ensure the supporting documents are presented in the optimal order for the model. Using CoRe, we improve the F1 score by up to 30%p on multi-hop QA tasks and increase accuracy by up to 70%p on a synthetic task. Additionally, CoRe helps mitigate the well-known "lost-in-the-middle" problem in LLMs and can be effectively combined with retrieval-based approaches utilizing Chain-of-Thought (CoT) reasoning.
摘要：多跳推理需要基于给定上下文中的支持文档进行多步骤推理，这对于大型语言模型 (LLM) 来说仍然具有挑战性。LLM 通常很难过滤掉上下文中的不相关文档，并且其性能对支持文档在该上下文中的位置很敏感。在本文中，我们发现了另一个挑战：LLM 的性能还对支持文档的呈现顺序很敏感。我们将其称为上下文乱序问题。为了解决这个问题，我们提出了一种简单而有效的方法，称为上下文重复 (CoRe)，该方法涉及通过反复呈现上下文来提示模型，以确保支持文档以模型的最佳顺序呈现。使用 CoRe，我们将多跳 QA 任务的 F1 分数提高了高达 30%p，并将合成任务的准确率提高了高达 70%p。此外，CoRe 有助于缓解 LLM 中众所周知的“中间迷失”问题，并且可以有效地与利用思路链 (CoT) 推理的基于检索的方法相结合。

Title: I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy

Authors: Gian Maria Campedelli, Nicolò Penzo, Massimo Stefan, Roberto Dessì, Marco Guerini, Bruno Lepri, Jacopo Staiano
Subjects: cs.CL, cs.AI, cs.CY, cs.MA
Abstract URL: https://arxiv.org/abs/2410.07109
Pdf URL: https://arxiv.org/pdf/2410.07109
Copy Paste: [[2410.07109]] I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy(https://arxiv.org/abs/2410.07109)
Keywords: language model, llm, prompt, agent
Abstract: As Large Language Model (LLM)-based agents become increasingly autonomous and will more freely interact with each other, studying interactions between them becomes crucial to anticipate emergent phenomena and potential risks. Drawing inspiration from the widely popular Stanford Prison Experiment, we contribute to this line of research by studying interaction patterns of LLM agents in a context characterized by strict social hierarchy. We do so by specifically studying two types of phenomena: persuasion and anti-social behavior in simulated scenarios involving a guard and a prisoner agent who seeks to achieve a specific goal (i.e., obtaining additional yard time or escape from prison). Leveraging 200 experimental scenarios for a total of 2,000 machine-machine conversations across five different popular LLMs, we provide a set of noteworthy findings. We first document how some models consistently fail in carrying out a conversation in our multi-agent setup where power dynamics are at play. Then, for the models that were able to engage in successful interactions, we empirically show how the goal that an agent is set to achieve impacts primarily its persuasiveness, while having a negligible effect with respect to the agent's anti-social behavior. Third, we highlight how agents' personas, and particularly the guard's personality, drive both the likelihood of successful persuasion from the prisoner and the emergence of anti-social behaviors. Fourth, we show that even without explicitly prompting for specific personalities, anti-social behavior emerges by simply assigning agents' roles. These results bear implications for the development of interactive LLM agents as well as the debate on their societal impact.
摘要：随着基于大型语言模型 (LLM) 的代理变得越来越自主，并且将更自由地相互交互，研究它们之间的交互对于预测突发现象和潜在风险变得至关重要。从广受欢迎的斯坦福监狱实验中汲取灵感，我们通过研究 LLM 代理在以严格的社会等级制度为特征的环境中互动模式，为这一系列研究做出了贡献。我们通过具体研究两种类型的现象来实现这一点：在模拟场景中的说服和反社会行为，其中涉及一名警卫和一名试图实现特定目标（即获得额外的监狱时间或越狱）的囚犯代理。利用 200 个实验场景，在五个不同的流行 LLM 中总共进行 2,000 次机器对机器对话，我们提供了一组值得注意的发现。我们首先记录了某些模型在多代理设置中在权力动态发挥作用的情况下始终无法进行对话的原因。然后，对于能够成功互动的模型，我们通过实证研究展示了代理设定的目标如何主要影响其说服力，而对代理的反社会行为的影响微乎其微。第三，我们强调了代理的角色，特别是狱警的个性，如何推动成功说服囚犯的可能性以及反社会行为的出现。第四，我们表明，即使没有明确提示特定个性，反社会行为也会通过简单地分配代理角色而出现。这些结果对交互式 LLM 代理的开发以及对其社会影响的辩论具有重要意义。

Title: Exploring the Readiness of Prominent Small Language Models for the Democratization of Financial Literacy

Authors: Tagore Rao Kosireddy, Jeffrey D. Wall, Evan Lucas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.07118
Pdf URL: https://arxiv.org/pdf/2410.07118
Copy Paste: [[2410.07118]] Exploring the Readiness of Prominent Small Language Models for the Democratization of Financial Literacy(https://arxiv.org/abs/2410.07118)
Keywords: language model
Abstract: The use of small language models (SLMs), herein defined as models with less than three billion parameters, is increasing across various domains and applications. Due to their ability to run on more accessible hardware and preserve user privacy, SLMs possess the potential to democratize access to language models for individuals of different socioeconomic status and with different privacy preferences. This study assesses several state-of-the-art SLMs (e.g., Apple's OpenELM, Microsoft's Phi, Google's Gemma, and the Tinyllama project) for use in the financial domain to support the development of financial literacy LMs. Democratizing access to quality financial information for those who are financially under educated is greatly needed in society, particularly as new financial markets and products emerge and participation in financial markets increases due to ease of access. We are the first to examine the use of open-source SLMs to democratize access to financial question answering capabilities for individuals and students. To this end, we provide an analysis of the memory usage, inference time, similarity comparisons to ground-truth answers, and output readability of prominent SLMs to determine which models are most accessible and capable of supporting access to financial information. We analyze zero-shot and few-shot learning variants of the models. The results suggest that some off-the-shelf SLMs merit further exploration and fine-tuning to prepare them for individual use, while others may have limits to their democratization.
摘要：小型语言模型 (SLM)（此处定义为具有少于 30 亿个参数的模型）的使用正在各个领域和应用程序中日益增多。由于 SLM 能够在更易访问的硬件上运行并保护用户隐私，因此它有可能使不同社会经济地位和不同隐私偏好的个人能够民主地访问语言模型。本研究评估了几种最先进的 SLM（例如 Apple 的 OpenELM、Microsoft 的 Phi、Google 的 Gemma 和 Tinyllama 项目），以用于金融领域，以支持金融知识 LM 的开发。让那些金融教育不足的人能够民主地获取高质量的金融信息是社会迫切需要的，特别是随着新的金融市场和产品的出现，以及由于访问方便而导致金融市场参与度增加。我们是第一个研究使用开源 SLM 使个人和学生能够民主地访问金融问答功能的人。为此，我们对主要 SLM 的内存使用情况、推理时间、与真实答案的相似性比较以及输出可读性进行了分析，以确定哪些模型最易于访问且能够支持访问财务信息。我们分析了这些模型的零样本和少样本学习变体。结果表明，一些现成的 SLM 值得进一步探索和微调，以备个人使用，而其他 SLM 可能在民主化方面存在限制。

Title: Mental Disorders Detection in the Era of Large Language Models

Authors: Gleb Kuzmin, Petr Strepetov, Maksim Stankevich, Ivan Smirnov, Artem Shelmanov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.07129
Pdf URL: https://arxiv.org/pdf/2410.07129
Copy Paste: [[2410.07129]] Mental Disorders Detection in the Era of Large Language Models(https://arxiv.org/abs/2410.07129)
Keywords: language model, llm
Abstract: This paper compares the effectiveness of traditional machine learning methods, encoder-based models, and large language models (LLMs) on the task of detecting depression and anxiety. Five datasets were considered, each differing in format and the method used to define the target pathology class. We tested AutoML models based on linguistic features, several variations of encoder-based Transformers such as BERT, and state-of-the-art LLMs as pathology classification models. The results demonstrated that LLMs outperform traditional methods, particularly on noisy and small datasets where training examples vary significantly in text length and genre. However, psycholinguistic features and encoder-based models can achieve performance comparable to language models when trained on texts from individuals with clinically confirmed depression, highlighting their potential effectiveness in targeted clinical applications.
摘要：本文比较了传统机器学习方法、基于编码器的模型和大型语言模型 (LLM) 在检测抑郁和焦虑任务上的有效性。我们考虑了五个数据集，每个数据集的格式和定义目标病理类别的方法都不同。我们测试了基于语言特征的 AutoML 模型、基于编码器的 Transformer 的几种变体（例如 BERT）以及最先进的 LLM 作为病理分类模型。结果表明，LLM 优于传统方法，尤其是在训练示例在文本长度和类型上差异很大的嘈杂和小型数据集上。然而，当使用临床确诊抑郁症患者的文本进行训练时，心理语言学特征和基于编码器的模型可以实现与语言模型相当的性能，这凸显了它们在有针对性的临床应用中的潜在有效性。

Title: Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Authors: Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2410.07137
Pdf URL: https://arxiv.org/pdf/2410.07137
Copy Paste: [[2410.07137]] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates(https://arxiv.org/abs/2410.07137)
Keywords: language model, llm
Abstract: Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a "null model" that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. The code is available at this https URL.
摘要：自动 LLM 基准测试（例如 AlpacaEval 2.0、Arena-Hard-Auto 和 MT-Bench）由于与人工评估相比具有成本效益和可扩展性而成为评估语言模型的流行工具。在这些基准测试中取得高胜率可以显著提升新发布的语言模型的推广影响力。这种推广优势可能会激发一些花招，例如操纵模型输出长度或风格来提高胜率，尽管已经开发了几种机制来控制长度和解开风格以降低可玩性。尽管如此，我们表明，即使是始终输出恒定响应（与输入指令无关）的“空模型”也可以欺骗自动基准测试并获得顶级胜率：AlpacaEval 2.0 上的 LC 胜率为 86.5%；Arena-Hard-Auto 上的得分为 83.0；MT-Bench 上的得分为 9.55。此外，精心设计的作弊输出是可转让的，因为我们假设这些基准测试的指令（例如，AlpacaEval 2.0 的 805 个样本）是私有的，无法访问。虽然我们的实验主要是概念验证，但对手可以使用 LLM 生成更难以察觉的作弊响应，不道德地从高胜率和促销影响中获益。我们的研究结果呼吁开发可靠的自动基准测试的反作弊机制。代码可在此 https URL 上找到。

Title: Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling

Authors: Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.07145
Pdf URL: https://arxiv.org/pdf/2410.07145
Copy Paste: [[2410.07145]] Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling(https://arxiv.org/abs/2410.07145)
Keywords: language model, long context
Abstract: One essential advantage of recurrent neural networks (RNNs) over transformer-based language models is their linear computational complexity concerning the sequence length, which makes them much faster in handling long sequences during inference. However, most publicly available RNNs (e.g., Mamba and RWKV) are trained on sequences with less than 10K tokens, and their effectiveness in longer contexts remains largely unsatisfying so far. In this paper, we study the cause of the inability to process long context for RNNs and suggest critical mitigations. We examine two practical concerns when applying state-of-the-art RNNs to long contexts: (1) the inability to extrapolate to inputs longer than the training length and (2) the upper bound of memory capacity. Addressing the first concern, we first investigate *state collapse* (SC), a phenomenon that causes severe performance degradation on sequence lengths not encountered during training. With controlled experiments, we attribute this to overfitting due to the recurrent state being overparameterized for the training length. For the second concern, we train a series of Mamba-2 models on long documents to empirically estimate the recurrent state capacity in language modeling and passkey retrieval. Then, three SC mitigation methods are proposed to improve Mamba-2's length generalizability, allowing the model to process more than 1M tokens without SC. We also find that the recurrent state capacity in passkey retrieval scales exponentially to the state size, and we empirically train a Mamba-2 370M with near-perfect passkey retrieval accuracy on 256K context length. This suggests a promising future for RNN-based long-context modeling.
摘要：循环神经网络 (RNN) 相对于基于 Transformer 的语言模型的一个基本优势是它们在序列长度方面的线性计算复杂度，这使得它们在推理过程中处理长序列的速度更快。然而，大多数公开可用的 RNN（例如 Mamba 和 RWKV）都是在少于 10K 个标记的序列上进行训练的，到目前为止，它们在较长上下文中的有效性仍然不能令人满意。在本文中，我们研究了 RNN 无法处理长上下文的原因并提出了关键的缓解措施。我们研究了将最先进的 RNN 应用于长上下文时的两个实际问题：(1) 无法推断出比训练长度更长的输入和 (2) 内存容量的上限。为了解决第一个问题，我们首先研究了 *状态崩溃* (SC)，这是一种导致训练期间未遇到的序列长度严重性能下降的现象。通过受控实验，我们将其归因于过度拟合，因为循环状态对于训练长度过度参数化。对于第二个问题，我们在长文档上训练了一系列 Mamba-2 模型，以实证估计语言建模和密码检索中的循环状态容量。然后，提出了三种 SC 缓解方法来提高 Mamba-2 的长度通用性，使模型能够在没有 SC 的情况下处理超过 1M 个令牌。我们还发现，密码检索中的循环状态容量与状态大小呈指数级增长，并且我们实证训练了一个 Mamba-2 370M，在 256K 上下文长度上具有近乎完美的密码检索准确率。这表明基于 RNN 的长上下文建模前景光明。

Title: Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

Authors: Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, Sijia Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.07163
Pdf URL: https://arxiv.org/pdf/2410.07163
Copy Paste: [[2410.07163]] Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning(https://arxiv.org/abs/2410.07163)
Keywords: language model, llm
Abstract: In this work, we address the problem of large language model (LLM) unlearning, aiming to remove unwanted data influences and associated model capabilities (e.g., copyrighted data or harmful content generation) while preserving essential model utilities, without the need for retraining from scratch. Despite the growing need for LLM unlearning, a principled optimization framework remains lacking. To this end, we revisit the state-of-the-art approach, negative preference optimization (NPO), and identify the issue of reference model bias, which could undermine NPO's effectiveness, particularly when unlearning forget data of varying difficulty. Given that, we propose a simple yet effective unlearning optimization framework, called SimNPO, showing that 'simplicity' in removing the reliance on a reference model (through the lens of simple preference optimization) benefits unlearning. We also provide deeper insights into SimNPO's advantages, supported by analysis using mixtures of Markov chains. Furthermore, we present extensive experiments validating SimNPO's superiority over existing unlearning baselines in benchmarks like TOFU and MUSE, and robustness against relearning attacks. Codes are available at this https URL.
摘要：在这项工作中，我们解决了大型语言模型 (LLM) 反学习问题，旨在消除不必要的数据影响和相关模型功能（例如，受版权保护的数据或有害内容生成），同时保留必要的模型效用，而无需从头开始重新训练。尽管对 LLM 反学习的需求日益增长，但仍然缺乏原则性的优化框架。为此，我们重新审视了最先进的方法，即负面偏好优化 (NPO)，并确定了参考模型偏差问题，这可能会削弱 NPO 的有效性，尤其是在反学习忘记不同难度的数据时。鉴于此，我们提出了一个简单而有效的反学习优化框架，称为 SimNPO，表明消除对参考模型的依赖（通过简单偏好优化的视角）的“简单性”有利于反学习。我们还通过使用马尔可夫链混合的分析，更深入地了解了 SimNPO 的优势。此外，我们还进行了广泛的实验，以验证 SimNPO 在 TOFU 和 MUSE 等基准测试中优于现有反学习基线，以及对再学习攻击的鲁棒性。代码可在此 https URL 上获取。

Title: Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

Authors: Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, Jiajun Wu
Subjects: cs.CL, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2410.07166
Pdf URL: https://arxiv.org/pdf/2410.07166
Copy Paste: [[2410.07166]] Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making(https://arxiv.org/abs/2410.07166)
Keywords: language model, llm, hallucination, agent
Abstract: We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn blocks embodied agents from leveraging LLMs effectively and selectively. To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision-making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics which break down evaluation into various types of errors, such as hallucination errors, affordance errors, various types of planning errors, etc. Overall, our benchmark offers a comprehensive assessment of LLMs' performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems, and providing insights for effective and selective use of LLMs in embodied decision making.
摘要：我们旨在评估大型语言模型 (LLM) 的具身决策能力。尽管大量研究已经利用 LLM 在具身环境中进行决策，但我们仍然缺乏对其性能的系统了解，因为它们通常应用于不同的领域、用于不同的目的，并基于不同的输入和输出构建。此外，现有的评估往往仅依赖于最终的成功率，因此很难确定 LLM 缺少什么能力以及问题出在哪里，这反过来又阻碍了具身代理有效和有选择地利用 LLM。为了解决这些限制，我们提出了一个通用接口（具身代理接口），支持各种类型的任务的形式化和基于 LLM 的模块的输入输出规范。具体来说，它使我们能够统一 1）涉及状态和时间扩展目标的广泛的具身决策任务，2）四个常用的基于 LLM 的决策模块：目标解释、子目标分解、动作排序和转换建模，以及 3）一组细粒度指标，将评估分解为各种类型的错误，例如幻觉错误、可供性错误、各种类型的规划错误等。总体而言，我们的基准对 LLM 在不同子任务中的表现进行了全面的评估，找出了 LLM 驱动的具身 AI 系统的优势和劣势，并为在具身决策中有效和有选择地使用 LLM 提供了见解。

Title: Sylber: Syllabic Embedding Representation of Speech from Raw Audio

Authors: Cheol Jun Cho, Nicholas Lee, Akshat Gupta, Dhruv Agarwal, Ethan Chen, Alan W Black, Gopala K. Anumanchipalli
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2410.07168
Pdf URL: https://arxiv.org/pdf/2410.07168
Copy Paste: [[2410.07168]] Sylber: Syllabic Embedding Representation of Speech from Raw Audio(https://arxiv.org/abs/2410.07168)
Keywords: language model
Abstract: Syllables are compositional units of spoken language that play a crucial role in human speech perception and production. However, current neural speech representations lack structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised model that regresses features on syllabic segments distilled from a teacher model which is an exponential moving average of the model in training. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) syllabic units better suited for lexical and syntactic understanding. We also train token-to-speech generative models with our syllabic units and show that fully intelligible speech can be reconstructed from these tokens. Lastly, we observe that categorical perception, a linguistic phenomenon of speech perception, emerges naturally in our model, making the embedding space more categorical and sparse than previous self-supervised learning approaches. Together, we present a novel self-supervised approach for representing speech as syllables, with significant potential for efficient speech tokenization and spoken language modeling.
摘要：音节是口语的组成单位，在人类语音感知和生成中起着至关重要的作用。然而，目前的神经语音表征缺乏结构，导致密集的标记序列处理成本高昂。为了弥补这一差距，我们提出了一个新模型 Sylber，它可以生成具有清晰和稳健音节结构的语音表征。具体来说，我们提出了一个自监督模型，该模型对从教师模型中提取的音节段的特征进行回归，该模型是训练中模型的指数移动平均值。这会产生高度结构化的语音特征表示，具有三个主要优点：1) 快速、线性时间音节分割算法，2) 高效的音节标记化，平均每秒 4.27 个标记，3) 音节单元更适合词汇和句法理解。我们还使用我们的音节单元训练标记到语音的生成模型，并表明可以从这些标记中重建完全可理解的语音。最后，我们观察到，分类感知（一种语音感知的语言现象）在我们的模型中自然出现，这使得嵌入空间比以前的自监督学习方法更具分类性和稀疏性。总之，我们提出了一种新颖的自监督方法，用于将语音表示为音节，具有高效语音标记和口语建模的巨大潜力。

Title: Do better language models have crisper vision?

Authors: Jona Ruthardt, Gertjan J. Burghouts, Serge Belongie, Yuki M. Asano
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2410.07173
Pdf URL: https://arxiv.org/pdf/2410.07173
Copy Paste: [[2410.07173]] Do better language models have crisper vision?(https://arxiv.org/abs/2410.07173)
Keywords: language model, llm
Abstract: How well do text-only Large Language Models (LLMs) grasp the visual world? As LLMs are increasingly used in computer vision, addressing this question becomes both fundamental and pertinent. However, existing studies have primarily focused on limited scenarios, such as their ability to generate visual content or cluster multimodal data. To this end, we propose the Visual Text Representation Benchmark (ViTeRB) to isolate key properties that make language models well-aligned with the visual world. With this, we identify large-scale decoder-based LLMs as ideal candidates for representing text in vision-centric contexts, counter to the current practice of utilizing text encoders. Building on these findings, we propose ShareLock, an ultra-lightweight CLIP-like model. By leveraging precomputable frozen features from strong vision and language models, ShareLock achieves an impressive 51% accuracy on ImageNet despite utilizing just 563k image-caption pairs. Moreover, training requires only 1 GPU hour (or 10 hours including the precomputation of features) - orders of magnitude less than prior methods. Code will be released.
摘要：纯文本大型语言模型 (LLM) 对视觉世界的把握有多好？随着 LLM 在计算机视觉领域的应用越来越广泛，解决这个问题变得既根本又迫切。然而，现有的研究主要集中在有限的场景上，例如它们生成视觉内容或聚类多模态数据的能力。为此，我们提出了视觉文本表示基准 (ViTeRB) 来分离使语言模型与视觉世界紧密结合的关键属性。据此，我们将基于解码器的大型 LLM 确定为在以视觉为中心的环境中表示文本的理想候选者，这与当前使用文本编码器的做法相反。基于这些发现，我们提出了 ShareLock，一种超轻量级 CLIP 类模型。通过利用强大的视觉和语言模型中可预计算的冻结特征，ShareLock 在 ImageNet 上实现了令人印象深刻的 51% 的准确率，尽管只使用了 563k 个图像-标题对。此外，训练只需要 1 个 GPU 小时（或 10 小时，包括特征的预计算）——比以前的方法少几个数量级。代码将会发布。

Title: Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models

Authors: Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, Sercan Ö. Arık
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.07176
Pdf URL: https://arxiv.org/pdf/2410.07176
Copy Paste: [[2410.07176]] Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models(https://arxiv.org/abs/2410.07176)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG), while effective in integrating external knowledge to address the limitations of large language models (LLMs), can be undermined by imperfect retrieval, which may introduce irrelevant, misleading, or even malicious information. Despite its importance, previous studies have rarely explored the behavior of RAG through joint analysis on how errors from imperfect retrieval attribute and propagate, and how potential conflicts arise between the LLMs' internal knowledge and external sources. We find that imperfect retrieval augmentation might be inevitable and quite harmful, through controlled analysis under realistic conditions. We identify the knowledge conflicts between LLM-internal and external knowledge from retrieval as a bottleneck to overcome in the post-retrieval stage of RAG. To render LLMs resilient to imperfect retrieval, we propose Astute RAG, a novel RAG approach that adaptively elicits essential information from LLMs' internal knowledge, iteratively consolidates internal and external knowledge with source-awareness, and finalizes the answer according to information reliability. Our experiments using Gemini and Claude demonstrate that Astute RAG significantly outperforms previous robustness-enhanced RAG methods. Notably, Astute RAG is the only approach that matches or exceeds the performance of LLMs without RAG under worst-case scenarios. Further analysis reveals that Astute RAG effectively resolves knowledge conflicts, improving the reliability and trustworthiness of RAG systems.
摘要：检索增强生成 (RAG) 虽然可以有效地整合外部知识来解决大型语言模型 (LLM) 的局限性，但不完善的检索可能会破坏这一机制，因为不完善的检索可能会引入不相关、误导甚至恶意的信息。尽管 RAG 非常重要，但之前的研究很少通过联合分析不完善检索中的错误如何归因和传播，以及 LLM 的内部知识和外部来源之间如何产生潜在冲突来探索 RAG 的行为。通过在现实条件下进行受控分析，我们发现不完善的检索增强可能是不可避免的，而且非常有害。我们认为 LLM 内部知识和检索外部知识之间的知识冲突是 RAG 后检索阶段需要克服的瓶颈。为了使 LLM 能够适应不完善的检索，我们提出了 Astute RAG，这是一种新颖的 RAG 方法，它可以自适应地从 LLM 的内部知识中获取基本信息，通过源感知迭代地整合内部和外部知识，并根据信息可靠性最终确定答案。我们使用 Gemini 和 Claude 进行的实验表明，Astute RAG 的表现明显优于以前的增强鲁棒性的 RAG 方法。值得注意的是，Astute RAG 是唯一一种在最坏情况下性能与没有 RAG 的 LLM 相当或超过其性能的方法。进一步的分析表明，Astute RAG 有效地解决了知识冲突，提高了 RAG 系统的可靠性和可信度。