2024-10-07

Title: Ingest-And-Ground: Dispelling Hallucinations from Continually-Pretrained LLMs with RAG

Authors: Chenhao Fang, Derek Larson, Shitong Zhu, Sophie Zeng, Wendy Summer, Yanqing Peng, Yuriy Hulovatyy, Rajeev Rao, Gabriel Forgues, Arya Pudota, Alex Goncalves, Hervé Robert
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2410.02825
Pdf URL: https://arxiv.org/pdf/2410.02825
Copy Paste: [[2410.02825]] Ingest-And-Ground: Dispelling Hallucinations from Continually-Pretrained LLMs with RAG(https://arxiv.org/abs/2410.02825)
Keywords: llm, hallucination
Abstract: This paper presents new methods that have the potential to improve privacy process efficiency with LLM and RAG. To reduce hallucination, we continually pre-train the base LLM model with a privacy-specific knowledge base and then augment it with a semantic RAG layer. Our evaluations demonstrate that this approach enhances the model performance (as much as doubled metrics compared to out-of-box LLM) in handling privacy-related queries, by grounding responses with factual information which reduces inaccuracies.
摘要：本文介绍了一些新方法，这些方法有可能通过 LLM 和 RAG 提高隐私处理效率。为了减少幻觉，我们不断使用隐私专用知识库对基础 LLM 模型进行预训练，然后使用语义 RAG 层对其进行扩充。我们的评估表明，这种方法通过用事实信息作为响应的基础，从而减少了不准确性，提高了模型在处理隐私相关查询时的性能（与开箱即用的 LLM 相比，指标提高了一倍）。

Title: Position: LLM Unlearning Benchmarks are Weak Measures of Progress

Authors: Pratiksha Thaker, Shengyuan Hu, Neil Kale, Yash Maurya, Zhiwei Steven Wu, Virginia Smith
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02879
Pdf URL: https://arxiv.org/pdf/2410.02879
Copy Paste: [[2410.02879]] Position: LLM Unlearning Benchmarks are Weak Measures of Progress(https://arxiv.org/abs/2410.02879)
Keywords: language model, llm
Abstract: Unlearning methods have the potential to improve the privacy and safety of large language models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning research community has increasingly turned toward empirical benchmarks to assess the effectiveness of such methods. In this paper, we find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods. By introducing simple, benign modifications to a number of popular benchmarks, we expose instances where supposedly unlearned information remains accessible, or where the unlearning process has degraded the model's performance on retained information to a much greater extent than indicated by the original benchmark. We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information. Further, we show that ambiguity in unlearning targets in existing benchmarks can easily lead to the design of methods that overfit to the given test queries. Based on our findings, we urge the community to be cautious when interpreting benchmark results as reliable measures of progress, and we provide several recommendations to guide future LLM unlearning research.
摘要：取消学习方法有可能通过事后删除敏感或有害信息来改善大型语言模型 (LLM) 的隐私和安全性。LLM 取消学习研究界越来越多地转向实证基准来评估此类方法的有效性。在本文中，我们发现现有基准对候选取消学习方法的有效性提供了过于乐观且可能具有误导性的看法。通过对许多流行的基准进行简单、良性的修改，我们揭示了一些实例，其中所谓的取消学习信息仍然可以访问，或者取消学习过程使模型在保留信息上的性能下降到比原始基准所指示的程度更大的程度。我们发现现有基准特别容易受到修改的影响，这些修改会在遗忘和保留信息之间引入松散的依赖关系。此外，我们表明，现有基准中取消学习目标的模糊性很容易导致设计出与给定测试查询过度拟合的方法。根据我们的研究结果，我们敦促社区在将基准结果解释为可靠的进度衡量标准时要谨慎，并提供一些建议来指导未来的 LLM 忘记学习研究。

Title: Computational Modeling of Artistic Inspiration: A Framework for Predicting Aesthetic Preferences in Lyrical Lines Using Linguistic and Stylistic Features

Authors: Gaurav Sahu, Olga Vechtomova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02881
Pdf URL: https://arxiv.org/pdf/2410.02881
Copy Paste: [[2410.02881]] Computational Modeling of Artistic Inspiration: A Framework for Predicting Aesthetic Preferences in Lyrical Lines Using Linguistic and Stylistic Features(https://arxiv.org/abs/2410.02881)
Keywords: language model
Abstract: Artistic inspiration remains one of the least understood aspects of the creative process. It plays a crucial role in producing works that resonate deeply with audiences, but the complexity and unpredictability of aesthetic stimuli that evoke inspiration have eluded systematic study. This work proposes a novel framework for computationally modeling artistic preferences in different individuals through key linguistic and stylistic properties, with a focus on lyrical content. In addition to the framework, we introduce \textit{EvocativeLines}, a dataset of annotated lyric lines, categorized as either "inspiring" or "not inspiring," to facilitate the evaluation of our framework across diverse preference profiles. Our computational model leverages the proposed linguistic and poetic features and applies a calibration network on top of it to accurately forecast artistic preferences among different creative individuals. Our experiments demonstrate that our framework outperforms an out-of-the-box LLaMA-3-70b, a state-of-the-art open-source language model, by nearly 18 points. Overall, this work contributes an interpretable and flexible framework that can be adapted to analyze any type of artistic preferences that are inherently subjective across a wide spectrum of skill levels.
摘要：艺术灵感仍然是创作过程中最不为人理解的方面之一。它在创作与观众产生深刻共鸣的作品方面起着至关重要的作用，但激发灵感的审美刺激的复杂性和不可预测性一直未能得到系统研究。这项研究提出了一个新颖的框架，通过关键的语言和风格属性，以计算方式模拟不同个体的艺术偏好，重点关注抒情内容。除了这个框架之外，我们还引入了 \textit{EvocativeLines}，这是一个带注释的歌词数据集，分为“鼓舞人心”或“不鼓舞人心”，以便于在不同的偏好配置文件中评估我们的框架。我们的计算模型利用了所提出的语言和诗意特征，并在其上应用了一个校准网络，以准确预测不同创作个体的艺术偏好。我们的实验表明，我们的框架比现成的 LLaMA-3-70b（一种最先进的开源语言模型）高出近 18 分。总体而言，这项工作提供了一个可解释且灵活的框架，可以适应分析广泛技能水平范围内任何类型的艺术偏好，这些偏好本质上是主观的。

Title: FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs

Authors: Deema Alnuhait, Neeraja Kirtane, Muhammad Khalifa, Hao Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02899
Pdf URL: https://arxiv.org/pdf/2410.02899
Copy Paste: [[2410.02899]] FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs(https://arxiv.org/abs/2410.02899)
Keywords: language model, hallucination
Abstract: Language models (LMs) hallucinate. We inquire: Can we detect and mitigate hallucinations before they happen? This work answers this research question in the positive, by showing that the internal representations of LMs provide rich signals that can be used for this purpose. We introduce FactCheckMate, which preemptively detects hallucinations by learning a classifier that predicts whether the LM will hallucinate, based on the model's hidden states produced over the inputs, before decoding begins. If a hallucination is detected, FactCheckMate then intervenes, by adjusting the LM's hidden states such that the model will produce more factual outputs. FactCheckMate provides fresh insights that the inner workings of LMs can be revealed by their hidden states. Practically, both the detection and mitigation models in FactCheckMate are lightweight, adding little inference overhead; FactCheckMate proves a more efficient approach for mitigating hallucinations compared to many post-hoc alternatives. We evaluate FactCheckMate over LMs of different scales and model families (including Llama, Mistral, and Gemma), across a variety of QA datasets from different domains. Our results demonstrate the effectiveness of leveraging internal representations for early hallucination detection and mitigation, achieving over 70% preemptive detection accuracy. On average, outputs generated by LMs with intervention are 34.4% more factual compared to those without intervention. The average overhead difference in the inference time introduced by FactCheckMate is around 3.16 seconds.
摘要：语言模型 (LM) 会产生幻觉。我们想知道：我们能否在幻觉发生之前检测并缓解幻觉？这项研究对这一研究问题给出了肯定的答案，表明 LM 的内部表示提供了可用于此目的的丰富信号。我们引入了 FactCheckMate，它通过学习一个分类器来预先检测幻觉，该分类器基于模型在输入上产生的隐藏状态，在解码开始之前预测 LM 是否会产生幻觉。如果检测到幻觉，FactCheckMate 就会进行干预，调整 LM 的隐藏状态，使模型产生更多的事实输出。FactCheckMate 提供了新的见解，即 LM 的内部工作原理可以通过其隐藏状态来揭示。实际上，FactCheckMate 中的检测和缓解模型都是轻量级的，几乎不会增加推理开销；与许多事后替代方案相比，FactCheckMate 证明了一种更有效的缓解幻觉的方法。我们对不同规模和模型系列（包括 Llama、Mistral 和 Gemma）的 LM 以及来自不同领域的各种 QA 数据集对 FactCheckMate 进行了评估。我们的结果证明了利用内部表示进行早期幻觉检测和缓解的有效性，实现了超过 70% 的预检测准确率。平均而言，与没有干预的 LM 相比，有干预的 LM 生成的输出事实性高出 34.4%。FactCheckMate 引入的推理时间平均开销差异约为 3.16 秒。

Title: Better Instruction-Following Through Minimum Bayes Risk

Authors: Ian Wu, Patrick Fernandes, Amanda Bertsch, Seungone Kim, Sina Pakazad, Graham Neubig
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02902
Pdf URL: https://arxiv.org/pdf/2410.02902
Copy Paste: [[2410.02902]] Better Instruction-Following Through Minimum Bayes Risk(https://arxiv.org/abs/2410.02902)
Keywords: llm
Abstract: General-purpose LLM judges capable of human-level evaluation provide not only a scalable and accurate way of evaluating instruction-following LLMs but also new avenues for supervising and improving their performance. One promising way of leveraging LLM judges for supervision is through Minimum Bayes Risk (MBR) decoding, which uses a reference-based evaluator to select a high-quality output from amongst a set of candidate outputs. In the first part of this work, we explore using MBR decoding as a method for improving the test-time performance of instruction-following LLMs. We find that MBR decoding with reference-based LLM judges substantially improves over greedy decoding, best-of-N decoding with reference-free judges and MBR decoding with lexical and embedding-based metrics on AlpacaEval and MT-Bench. These gains are consistent across LLMs with up to 70B parameters, demonstrating that smaller LLM judges can be used to supervise much larger LLMs. Then, seeking to retain the improvements from MBR decoding while mitigating additional test-time costs, we explore iterative self-training on MBR-decoded outputs. We find that self-training using Direct Preference Optimisation leads to significant performance gains, such that the self-trained models with greedy decoding generally match and sometimes exceed the performance of their base models with MBR decoding.
摘要：能够进行人类水平评估的通用 LLM 判断器不仅提供了一种可扩展且准确的指令跟随 LLM 评估方法，而且还提供了监督和改进其性能的新途径。利用 LLM 判断器进行监督的一种有前途的方法是通过最小贝叶斯风险 (MBR) 解码，它使用基于参考的评估器从一组候选输出中选择高质量输出。在这项工作的第一部分，我们探索使用 MBR 解码作为一种提高指令跟随 LLM 测试时间性能的方法。我们发现，在 AlpacaEval 和 MT-Bench 上，使用基于参考的 LLM 判断器的 MBR 解码比贪婪解码、使用无参考判断器的 N 中最佳解码和使用基于词汇和嵌入的指标的 MBR 解码有显著的改进。这些增益在具有多达 70B 个参数的 LLM 中是一致的，表明较小的 LLM 判断器可用于监督更大的 LLM。然后，为了在减少额外测试时间成本的同时保留 MBR 解码带来的改进，我们探索了对 MBR 解码输出进行迭代自训练。我们发现使用直接偏好优化进行自训练可以显著提高性能，因此使用贪婪解码的自训练模型通常可以达到甚至超过使用 MBR 解码的基础模型的性能。

Title: NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator

Authors: Shikhar Murty, Dzmitry Bahdanau, Christopher D. Manning
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02907
Pdf URL: https://arxiv.org/pdf/2410.02907
Copy Paste: [[2410.02907]] NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator(https://arxiv.org/abs/2410.02907)
Keywords: language model, agent
Abstract: We introduce NNetscape Navigator (NNetnav), a method for training web agents entirely through synthetic demonstrations. These demonstrations are collected by first interacting with a browser to generate trajectory rollouts, which are then retroactively labeled into instructions using a language model. Most work on training browser agents has relied on expensive human supervision, and the limited previous work on such interaction-first synthetic data techniques has failed to provide effective search through the exponential space of exploration. In contrast, NNetnav exploits the hierarchical structure of language instructions to make this search more tractable: complex instructions are typically decomposable into simpler subtasks, allowing NNetnav to automatically prune interaction episodes when an intermediate trajectory cannot be annotated with a meaningful sub-task. We use NNetnav demonstrations from a language model for supervised fine-tuning of a smaller language model policy, and find improvements of 6 points on WebArena and over 20 points on MiniWoB++, two popular environments for web-agents. Notably, on WebArena, we observe that language model policies can be further enhanced when fine-tuned with NNetnav demonstrations derived from the same language model. Finally, we collect and release a dataset of over 6k NNetnav demonstrations on WebArena, spanning a diverse and complex set of instructions.
摘要：我们引入了 NNetscape Navigator (NNetnav)，这是一种完全通过合成演示来训练 Web 代理的方法。这些演示是通过首先与浏览器交互来收集的，以生成轨迹展开，然后使用语言模型将其追溯标记为指令。大多数关于训练浏览器代理的工作都依赖于昂贵的人工监督，而之前对这种交互优先的合成数据技术的有限研究未能在指数级的探索空间中提供有效的搜索。相比之下，NNetnav 利用语言指令的层次结构使这种搜索更易于处理：复杂的指令通常可以分解为更简单的子任务，当中间轨迹无法用有意义的子任务注释时，允许 NNetnav 自动修剪交互事件。我们使用来自语言模型的 NNetnav 演示对较小的语言模型策略进行监督微调，并发现在 WebArena 上改进了 6 分，在 MiniWoB++ 上改进了 20 多分，这两个环境是 Web 代理的流行环境。值得注意的是，在 WebArena 上，我们观察到，当使用来自同一语言模型的 NNetnav 演示进行微调时，语言模型策略可以得到进一步增强。最后，我们在 WebArena 上收集并发布了一个包含超过 6000 个 NNetnav 演示的数据集，涵盖了一组多样化且复杂的指令。

Title: Graph-tree Fusion Model with Bidirectional Information Propagation for Long Document Classification

Authors: Sudipta Singha Roy, Xindi Wang, Robert E. Mercer, Frank Rudzicz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02930
Pdf URL: https://arxiv.org/pdf/2410.02930
Copy Paste: [[2410.02930]] Graph-tree Fusion Model with Bidirectional Information Propagation for Long Document Classification(https://arxiv.org/abs/2410.02930)
Keywords: long context
Abstract: Long document classification presents challenges in capturing both local and global dependencies due to their extensive content and complex structure. Existing methods often struggle with token limits and fail to adequately model hierarchical relationships within documents. To address these constraints, we propose a novel model leveraging a graph-tree structure. Our approach integrates syntax trees for sentence encodings and document graphs for document encodings, which capture fine-grained syntactic relationships and broader document contexts, respectively. We use Tree Transformers to generate sentence encodings, while a graph attention network models inter- and intra-sentence dependencies. During training, we implement bidirectional information propagation from word-to-sentence-to-document and vice versa, which enriches the contextual representation. Our proposed method enables a comprehensive understanding of content at all hierarchical levels and effectively handles arbitrarily long contexts without token limit constraints. Experimental results demonstrate the effectiveness of our approach in all types of long document classification tasks.
摘要：由于内容广泛且结构复杂，长文档分类在捕获局部和全局依赖关系方面都面临挑战。现有方法通常会遇到标记限制问题，并且无法充分模拟文档中的层次关系。为了解决这些限制，我们提出了一种利用图树结构的新模型。我们的方法集成了用于句子编码的语法树和用于文档编码的文档图，它们分别捕获细粒度的句法关系和更广泛的文档上下文。我们使用树形转换器来生成句子编码，而图注意网络则模拟句子间和句子内的依赖关系。在训练期间，我们实现了从单词到句子到文档以及反之亦然的双向信息传播，从而丰富了上下文表示。我们提出的方法能够全面理解所有层次结构的内容，并能够有效处理任意长的上下文，而不受标记限制的限制。实验结果证明了我们的方法在所有类型的长文档分类任务中的有效性。

Title: Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications

Authors: Oren Sultan, Alex Khasin, Guy Shiran, Asnat Greenstein-Messica, Dafna Shahaf
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.02952
Pdf URL: https://arxiv.org/pdf/2410.02952
Copy Paste: [[2410.02952]] Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications(https://arxiv.org/abs/2410.02952)
Keywords: gpt, llm
Abstract: We present a practical distillation approach to fine-tune LLMs for invoking tools in real-time applications. We focus on visual editing tasks; specifically, we modify images and videos by interpreting user stylistic requests, specified in natural language ("golden hour"), using an LLM to select the appropriate tools and their parameters to achieve the desired visual effect. We found that proprietary LLMs such as GPT-3.5-Turbo show potential in this task, but their high cost and latency make them unsuitable for real-time applications. In our approach, we fine-tune a (smaller) student LLM with guidance from a (larger) teacher LLM and behavioral signals. We introduce offline metrics to evaluate student LLMs. Both online and offline experiments show that our student models manage to match the performance of our teacher model (GPT-3.5-Turbo), significantly reducing costs and latency. Lastly, we show that fine-tuning was improved by 25% in low-data regimes using augmentation.
摘要：我们提出了一种实用的蒸馏方法来微调 LLM，以便在实时应用中调用工具。我们专注于视觉编辑任务；具体来说，我们通过解释用户用自然语言指定的风格请求（“黄金时段”）来修改图像和视频，使用 LLM 选择合适的工具及其参数来实现所需的视觉效果。我们发现 GPT-3.5-Turbo 等专有 LLM 在这一任务中表现出潜力，但它们的高成本和延迟使它们不适合实时应用。在我们的方法中，我们在（较大的）教师 LLM 和行为信号的指导下对（较小的）学生 LLM 进行微调。我们引入了离线指标来评估学生 LLM。在线和离线实验都表明，我们的学生模型能够与我们的教师模型（GPT-3.5-Turbo）的性能相匹配，从而显着降低了成本和延迟。最后，我们表明，使用增强功能，在低数据情况下，微调效果提高了 25%。

Title: Unlocking Structured Thinking in Language Models with Cognitive prompting

Authors: Oliver Kramer, Jill Baumann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02953
Pdf URL: https://arxiv.org/pdf/2410.02953
Copy Paste: [[2410.02953]] Unlocking Structured Thinking in Language Models with Cognitive prompting(https://arxiv.org/abs/2410.02953)
Keywords: language model, llm, prompt
Abstract: We propose cognitive prompting as a novel approach to guide problem-solving in large language models (LLMs) through structured, human-like cognitive operations such as goal clarification, decomposition, filtering, abstraction, and pattern recognition. By employing systematic, step-by-step reasoning, cognitive prompting enables LLMs to efficiently tackle complex, multi-step tasks. We evaluate the effectiveness of cognitive prompting on Meta's LLaMA models, comparing performance on arithmetic reasoning tasks using the GSM8K dataset and on commonsense reasoning benchmarks. Our analysis includes comparisons between models without cognitive prompting, models with a static sequence of cognitive operations, and models using reflective cognitive prompting, where the LLM dynamically self-selects the sequence of cognitive operations. The results show that cognitive prompting, particularly when dynamically adapted, significantly improves the performance of larger models, such as LLaMA3.1 70B, and enhances their ability to handle multi-step reasoning tasks. This approach also improves interpretability and flexibility, highlighting cognitive prompting as a promising strategy for general-purpose AI reasoning.
摘要：我们提出一种新方法，即通过结构化的、类似人类的认知操作（例如目标澄清、分解、过滤、抽象和模式识别）指导大型语言模型 (LLM) 中的问题解决。通过采用系统的、循序渐进的推理，认知提示使 LLM 能够有效地处理复杂的多步骤任务。我们评估了认知提示对 Meta 的 LLaMA 模型的有效性，使用 GSM8K 数据集和常识推理基准比较了算术推理任务的性能。我们的分析包括比较没有认知提示的模型、具有静态认知操作序列的模型和使用反射性认知提示的模型，其中 LLM 动态地自我选择认知操作序列。结果表明，认知提示，尤其是在动态调整时，显著提高了大型模型（例如 LLaMA3.1 70B）的性能，并增强了它们处理多步骤推理任务的能力。这种方法还提高了可解释性和灵活性，强调认知提示作为通用人工智能推理的一种有前途的策略。

Title: Coal Mining Question Answering with LLMs

Authors: Antonio Carlos Rivera, Anthony Moore, Steven Robinson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.02959
Pdf URL: https://arxiv.org/pdf/2410.02959
Copy Paste: [[2410.02959]] Coal Mining Question Answering with LLMs(https://arxiv.org/abs/2410.02959)
Keywords: language model, gpt, llm, prompt, chat, chain-of-thought
Abstract: In this paper, we present a novel approach to coal mining question answering (QA) using large language models (LLMs) combined with tailored prompt engineering techniques. Coal mining is a complex, high-risk industry where accurate, context-aware information is critical for safe and efficient operations. Current QA systems struggle to handle the technical and dynamic nature of mining-related queries. To address these challenges, we propose a multi-turn prompt engineering framework designed to guide LLMs, such as GPT-4, in answering coal mining questions with higher precision and relevance. By breaking down complex queries into structured components, our approach allows LLMs to process nuanced technical information more effectively. We manually curated a dataset of 500 questions from real-world mining scenarios and evaluated the system's performance using both accuracy (ACC) and GPT-4-based scoring metrics. Experiments comparing ChatGPT, Claude2, and GPT-4 across baseline, chain-of-thought (CoT), and multi-turn prompting methods demonstrate that our method significantly improves both accuracy and contextual relevance, with an average accuracy improvement of 15-18\% and a notable increase in GPT-4 scores. The results show that our prompt-engineering approach provides a robust, adaptable solution for domain-specific question answering in high-stakes environments like coal mining.
摘要：在本文中，我们提出了一种使用大型语言模型 (LLM) 结合定制提示工程技术进行煤炭开采问答 (QA) 的新方法。煤炭开采是一个复杂、高风险的行业，准确、情境感知的信息对于安全高效的运营至关重要。当前的 QA 系统难以处理与采矿相关的查询的技术性和动态性。为了应对这些挑战，我们提出了一个多轮提示工程框架，旨在指导 LLM（例如 GPT-4）以更高的精度和相关性回答煤炭开采问题。通过将复杂的查询分解为结构化组件，我们的方法使 LLM 能够更有效地处理细微的技术信息。我们从现实世界的采矿场景中手动整理了 500 个问题的数据集，并使用准确度 (ACC) 和基于 GPT-4 的评分指标评估了系统的性能。通过比较 ChatGPT、Claude2 和 GPT-4 的基线、思路链 (CoT) 和多轮提示方法，实验表明，我们的方法显著提高了准确率和上下文相关性，平均准确率提高了 15-18%，GPT-4 得分也显著提高。结果表明，我们的提示工程方法为煤矿开采等高风险环境中特定领域的问答提供了一种强大且适应性强的解决方案。

Title: Can Transformers Learn $n$-gram Language Models?

Authors: Anej Svete, Nadav Borenstein, Mike Zhou, Isabelle Augenstein, Ryan Cotterell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03001
Pdf URL: https://arxiv.org/pdf/2410.03001
Copy Paste: [[2410.03001]] Can Transformers Learn $n$-gram Language Models?(https://arxiv.org/abs/2410.03001)
Keywords: language model
Abstract: Much theoretical work has described the ability of transformers to represent formal languages. However, linking theoretical results to empirical performance is not straightforward due to the complex interplay between the architecture, the learning algorithm, and training data. To test whether theoretical lower bounds imply \emph{learnability} of formal languages, we turn to recent work relating transformers to $n$-gram language models (LMs). We study transformers' ability to learn random $n$-gram LMs of two kinds: ones with arbitrary next-symbol probabilities and ones where those are defined with shared parameters. We find that classic estimation techniques for $n$-gram LMs such as add-$\lambda$ smoothing outperform transformers on the former, while transformers perform better on the latter, outperforming methods specifically designed to learn $n$-gram LMs.
摘要：许多理论工作描述了 Transformer 表示形式语言的能力。然而，由于架构、学习算法和训练数据之间的复杂相互作用，将理论结果与经验性能联系起来并不简单。为了测试理论下限是否意味着形式语言具有 \emph{可学习性}，我们转向最近将 Transformer 与 $n$-gram 语言模型 (LM) 联系起来的工作。我们研究了 Transformer 学习两种随机 $n$-gram 语言模型的能力：一种具有任意下一个符号概率，另一种使用共享参数定义这些概率。我们发现，经典的 $n$-gram 语言模型估计技术（例如 add-$\lambda$ 平滑）在前者上优于 Transformer，而 Transformer 在后者上表现更好，优于专门设计用于学习 $n$-gram 语言模型的方法。

Title: Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise

Authors: Rose E. Wang, Ana T. Ribeiro, Carly D. Robinson, Susanna Loeb, Dora Demszky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03017
Pdf URL: https://arxiv.org/pdf/2410.03017
Copy Paste: [[2410.03017]] Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise(https://arxiv.org/abs/2410.03017)
Keywords: language model
Abstract: Generative AI, particularly Language Models (LMs), has the potential to transform real-world domains with societal impact, particularly where access to experts is limited. For example, in education, training novice educators with expert guidance is important for effectiveness but expensive, creating significant barriers to improving education quality at scale. This challenge disproportionately harms students from under-served communities, who stand to gain the most from high-quality education. We introduce Tutor CoPilot, a novel Human-AI approach that leverages a model of expert thinking to provide expert-like guidance to tutors as they tutor. This study is the first randomized controlled trial of a Human-AI system in live tutoring, involving 900 tutors and 1,800 K-12 students from historically under-served communities. Following a preregistered analysis plan, we find that students working with tutors that have access to Tutor CoPilot are 4 percentage points (p.p.) more likely to master topics (p<0.01). Notably, students of lower-rated tutors experienced the greatest benefit, improving mastery by 9 p.p. We find that Tutor CoPilot costs only $20 per-tutor annually. We analyze 550,000+ messages using classifiers to identify pedagogical strategies, and find that tutors with access to Tutor CoPilot are more likely to use high-quality strategies to foster student understanding (e.g., asking guiding questions) and less likely to give away the answer to the student. Tutor interviews highlight how Tutor CoPilot's guidance helps tutors to respond to student needs, though they flag issues in Tutor CoPilot, such as generating suggestions that are not grade-level appropriate. Altogether, our study of Tutor CoPilot demonstrates how Human-AI systems can scale expertise in real-world domains, bridge gaps in skills and create a future where high-quality education is accessible to all students.
摘要：生成式人工智能，尤其是语言模型 (LM)，有可能改变现实世界领域，对社会产生影响，尤其是在专家资源有限的情况下。例如，在教育领域，为新手教育者提供专家指导对于提高效率很重要，但成本高昂，为大规模提高教育质量设置了重大障碍。这一挑战对来自服务不足社区的学生造成了不成比例的伤害，而这些学生本可以从高质量的教育中获益最多。我们推出了 Tutor CoPilot，这是一种新颖的人机结合方法，它利用专家思维模型为辅导老师提供专家般的指导。这项研究是人机结合系统在现场辅导中的首次随机对照试验，涉及 900 名辅导老师和 1,800 名来自历史上服务不足社区的 K-12 学生。根据预先注册的分析计划，我们发现与能够使用 Tutor CoPilot 的辅导老师一起工作的学生掌握主题的可能性高出 4 个百分点 (p.p.) (p<0.01)。值得注意的是，评分较低的导师的学生受益最大，掌握程度提高了 9 个百分点。我们发现 Tutor CoPilot 每位导师每年只需花费 20 美元。我们使用分类器分析了 550,000 多条消息以确定教学策略，并发现有权使用 Tutor CoPilot 的导师更有可能使用高质量策略来促进学生理解（例如，提出引导性问题），而不太可能将答案透露给学生。导师访谈强调了 Tutor CoPilot 的指导如何帮助导师满足学生的需求，尽管他们会标记 Tutor CoPilot 中的问题，例如生成的建议不适合年级水平。总之，我们对 Tutor CoPilot 的研究展示了人机系统如何扩展现实世界领域的专业知识、弥补技能差距并创造一个所有学生都能接受高质量教育的未来。

Title: Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review

Authors: Sungduk Yu, Man Luo, Avinash Madasu, Vasudev Lal, Phillip Howard
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03019
Pdf URL: https://arxiv.org/pdf/2410.03019
Copy Paste: [[2410.03019]] Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review(https://arxiv.org/abs/2410.03019)
Keywords: language model, gpt, llm
Abstract: Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in the linguistic capabilities of large language models (LLMs), a new potential risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. In this study, we investigate the ability of existing AI text detection algorithms to distinguish between peer reviews written by humans and different state-of-the-art LLMs. Our analysis shows that existing approaches fail to identify many GPT-4o written reviews without also producing a high number of false positive classifications. To address this deficiency, we propose a new detection approach which surpasses existing methods in the identification of GPT-4o written peer reviews at low levels of false positive classifications. Our work reveals the difficulty of accurately identifying AI-generated text at the individual review level, highlighting the urgent need for new tools and methods to detect this type of unethical application of generative AI.
摘要：同行评审是确保已发表科学研究完整性的关键过程。对这一过程的信心取决于相关领域的专家会仔细考虑提交出版的稿件的优点。随着大型语言模型 (LLM) 语言能力的快速发展，同行评审过程面临的一个新潜在风险是，疏忽大意的审稿人将依赖 LLM 来执行通常耗时的论文评审过程。在本研究中，我们调查了现有 AI 文本检测算法区分人类撰写的同行评审和不同最先进的 LLM 的能力。我们的分析表明，现有方法无法识别许多 GPT-4o 撰写的评论，同时也会产生大量的假阳性分类。为了解决这一缺陷，我们提出了一种新的检测方法，它在低水平假阳性分类下识别 GPT-4o 撰写的同行评审方面超越了现有方法。我们的工作揭示了在单个评论层面准确识别人工智能生成的文本的难度，凸显了迫切需要新的工具和方法来检测这种不道德的生成人工智能应用。

Title: Characterizing Context Influence and Hallucination in Summarization

Authors: James Flemings, Wanrong Zhang, Bo Jiang, Zafar Takhirov, Murali Annavaram
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03026
Pdf URL: https://arxiv.org/pdf/2410.03026
Copy Paste: [[2410.03026]] Characterizing Context Influence and Hallucination in Summarization(https://arxiv.org/abs/2410.03026)
Keywords: language model, llm, hallucination
Abstract: Although Large Language Models (LLMs) have achieved remarkable performance in numerous downstream tasks, their ubiquity has raised two significant concerns. One is that LLMs can hallucinate by generating content that contradicts relevant contextual information; the other is that LLMs can inadvertently leak private information due to input regurgitation. Many prior works have extensively studied each concern independently, but none have investigated them simultaneously. Furthermore, auditing the influence of provided context during open-ended generation with a privacy emphasis is understudied. To this end, we comprehensively characterize the influence and hallucination of contextual information during summarization. We introduce a definition for context influence and Context-Influence Decoding (CID), and then we show that amplifying the context (by factoring out prior knowledge) and the context being out of distribution with respect to prior knowledge increases the context's influence on an LLM. Moreover, we show that context influence gives a lower bound of the private information leakage of CID. We corroborate our analytical findings with experimental evaluations that show improving the F1 ROGUE-L score on CNN-DM for LLaMA 3 by $\textbf{10}$% over regular decoding also leads to $\textbf{1.5x}$ more influence by the context. Moreover, we empirically evaluate how context influence and hallucination are affected by (1) model capacity, (2) context size, (3) the length of the current response, and (4) different token $n$-grams of the context. Our code can be accessed here: this https URL.
摘要：尽管大型语言模型 (LLM) 在许多下游任务中取得了显著的表现，但它们的普遍性也引发了两个重大担忧。一是 LLM 可能会通过生成与相关上下文信息相矛盾的内容而产生幻觉；二是 LLM 可能会由于输入反刍而无意中泄露私人信息。许多先前的研究已经对每个问题进行了广泛的独立研究，但没有一项研究同时对它们进行研究。此外，在注重隐私的开放式生成过程中，对提供的上下文的影响的审核研究不足。为此，我们全面描述了上下文信息在摘要过程中的影响和幻觉。我们引入了上下文影响和上下文影响解码 (CID) 的定义，然后我们表明，放大上下文（通过分解先验知识）和上下文相对于先验知识的分布不均会增加上下文对 LLM 的影响。此外，我们表明上下文影响给出了 CID 的私人信息泄露的下限。我们通过实验评估证实了我们的分析结果，结果表明，与常规解码相比，LLaMA 3 的 CNN-DM 上的 F1 ROGUE-L 得分提高 $\textbf{10}$% 也会导致上下文影响增加 $\textbf{1.5x}$。此外，我们通过实证评估了上下文影响和幻觉如何受到 (1) 模型容量、(2) 上下文大小、(3) 当前响应的长度和 (4) 上下文的不同标记 $n$-grams 的影响。我们的代码可以在此处访问：此 https URL。

Title: Geometry is All You Need: A Unified Taxonomy of Matrix and Tensor Factorization for Compression of Generative Language Models

Authors: Mingxue Xu, Sadia Sharmin, Danilo P. Mandic
Subjects: cs.CL, cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2410.03040
Pdf URL: https://arxiv.org/pdf/2410.03040
Copy Paste: [[2410.03040]] Geometry is All You Need: A Unified Taxonomy of Matrix and Tensor Factorization for Compression of Generative Language Models(https://arxiv.org/abs/2410.03040)
Keywords: language model
Abstract: Matrix and tensor-guided parametrization for Natural Language Processing (NLP) models is fundamentally useful for the improvement of the model's systematic efficiency. However, the internal links between these two algebra structures and language model parametrization are poorly understood. Also, the existing matrix and tensor research is math-heavy and far away from machine learning (ML) and NLP research concepts. These two issues result in the recent progress on matrices and tensors for model parametrization being more like a loose collection of separate components from matrix/tensor and NLP studies, rather than a well-structured unified approach, further hindering algorithm design. To this end, we propose a unified taxonomy, which bridges the matrix/tensor compression approaches and model compression concepts in ML and NLP research. Namely, we adopt an elementary concept in linear algebra, that of a subspace, which is also the core concept in geometric algebra, to reformulate the matrix/tensor and ML/NLP concepts (e.g. attention mechanism) under one umbrella. In this way, based on our subspace formalization, typical matrix and tensor decomposition algorithms can be interpreted as geometric transformations. Finally, we revisit recent literature on matrix- or tensor-guided language model compression, rephrase and compare their core ideas, and then point out the current research gap and potential solutions.
摘要：矩阵和张量引导的自然语言处理 (NLP) 模型参数化对于提高模型的系统效率具有根本性作用。然而，人们对这两种代数结构与语言模型参数化之间的内在联系了解甚少。此外，现有的矩阵和张量研究数学性强，远离机器学习 (ML) 和 NLP 研究概念。这两个问题导致最近在模型参数化方面取得的矩阵和张量的进展更像是来自矩阵/张量和 NLP 研究的独立组件的松散集合，而不是结构良好的统一方法，这进一步阻碍了算法设计。为此，我们提出了一个统一的分类法，它将 ML 和 NLP 研究中的矩阵/张量压缩方法和模型压缩概念联系起来。也就是说，我们采用线性代数中的一个基本概念，即子空间，这也是几何代数中的核心概念，将矩阵/张量和 ML/NLP 概念（例如注意力机制）重新表述为一个整体。这样，基于我们的子空间形式化，典型的矩阵和张量分解算法可以解释为几何变换。最后，我们重新审视最近关于矩阵或张量引导的语言模型压缩的文献，重新表述和比较它们的核心思想，然后指出当前研究的差距和潜在的解决方案。

Title: Scalable Frame-based Construction of Sociocultural NormBases for Socially-Aware Dialogues

Authors: Shilin Qu, Weiqing Wang, Xin Zhou, Haolan Zhan, Zhuang Li, Lizhen Qu, Linhao Luo, Yuan-Fang Li, Gholamreza Haffari
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03049
Pdf URL: https://arxiv.org/pdf/2410.03049
Copy Paste: [[2410.03049]] Scalable Frame-based Construction of Sociocultural NormBases for Socially-Aware Dialogues(https://arxiv.org/abs/2410.03049)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Sociocultural norms serve as guiding principles for personal conduct in social interactions, emphasizing respect, cooperation, and appropriate behavior, which is able to benefit tasks including conversational information retrieval, contextual information retrieval and retrieval-enhanced machine learning. We propose a scalable approach for constructing a Sociocultural Norm (SCN) Base using Large Language Models (LLMs) for socially aware dialogues. We construct a comprehensive and publicly accessible Chinese Sociocultural NormBase. Our approach utilizes socially aware dialogues, enriched with contextual frames, as the primary data source to constrain the generating process and reduce the hallucinations. This enables extracting of high-quality and nuanced natural-language norm statements, leveraging the pragmatic implications of utterances with respect to the situation. As real dialogue annotated with gold frames are not readily available, we propose using synthetic data. Our empirical results show: (i) the quality of the SCNs derived from synthetic data is comparable to that from real dialogues annotated with gold frames, and (ii) the quality of the SCNs extracted from real data, annotated with either silver (predicted) or gold frames, surpasses that without the frame annotations. We further show the effectiveness of the extracted SCNs in a RAG-based (Retrieval-Augmented Generation) model to reason about multiple downstream dialogue tasks.
摘要：社会文化规范是社会交往中个人行为的指导原则，强调尊重、合作和适当行为，这对对话信息检索、上下文信息检索和检索增强机器学习等任务有益。我们提出了一种可扩展的方法，使用大型语言模型 (LLM) 为社会意识对话构建社会文化规范 (SCN) 库。我们构建了一个全面且可公开访问的中国社会文化规范库。我们的方法利用富含上下文框架的社会意识对话作为主要数据源来约束生成过程并减少幻觉。这使得能够提取高质量、细致入微的自然语言规范陈述，利用与情况相关的话语的实用含义。由于用金色框架注释的真实对话并不容易获得，我们建议使用合成数据。我们的实证结果表明：(i) 从合成数据中得出的 SCN 质量与使用金色框架注释的真实对话中的 SCN 质量相当，(ii) 从使用银色（预测）或金色框架注释的真实数据中提取的 SCN 质量优于没有框架注释的 SCN。我们进一步展示了在基于 RAG（检索增强生成）模型中提取的 SCN 的有效性，可以推理多个下游对话任务。

Title: Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs

Authors: Pritom Saha Akash, Kevin Chen-Chuan Chang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2410.03071
Pdf URL: https://arxiv.org/pdf/2410.03071
Copy Paste: [[2410.03071]] Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs(https://arxiv.org/abs/2410.03071)
Keywords: language model, llm
Abstract: Topic modeling is a powerful technique for uncovering hidden themes within a collection of documents. However, the effectiveness of traditional topic models often relies on sufficient word co-occurrence, which is lacking in short texts. Therefore, existing approaches, whether probabilistic or neural, frequently struggle to extract meaningful patterns from such data, resulting in incoherent topics. To address this challenge, we propose a novel approach that leverages large language models (LLMs) to extend short texts into more detailed sequences before applying topic modeling. To further improve the efficiency and solve the problem of semantic inconsistency from LLM-generated texts, we propose to use prefix tuning to train a smaller language model coupled with a variational autoencoder for short-text topic modeling. Our method significantly improves short-text topic modeling performance, as demonstrated by extensive experiments on real-world datasets with extreme data sparsity, outperforming current state-of-the-art topic models.
摘要：主题建模是一种强大的技术，可用于发现文档集合中的隐藏主题。然而，传统主题模型的有效性通常依赖于足够的词共现，而短文本缺乏这种共现。因此，现有的方法（无论是概率方法还是神经方法）经常难以从此类数据中提取有意义的模式，从而导致主题不连贯。为了应对这一挑战，我们提出了一种新方法，利用大型语言模型 (LLM) 将短文本扩展为更详细的序列，然后再应用主题建模。为了进一步提高效率并解决 LLM 生成的文本中的语义不一致问题，我们建议使用前缀调整来训练较小的语言模型，并结合变分自动编码器进行短文本主题建模。我们的方法显著提高了短文本主题建模性能，这已在具有极端数据稀疏性的真实数据集上进行了大量实验，结果证明其性能优于当前最先进的主题模型。

Title: Multilingual Topic Classification in X: Dataset and Analysis

Authors: Dimosthenis Antypas, Asahi Ushio, Francesco Barbieri, Jose Camacho-Collados
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03075
Pdf URL: https://arxiv.org/pdf/2410.03075
Copy Paste: [[2410.03075]] Multilingual Topic Classification in X: Dataset and Analysis(https://arxiv.org/abs/2410.03075)
Keywords: language model
Abstract: In the dynamic realm of social media, diverse topics are discussed daily, transcending linguistic boundaries. However, the complexities of understanding and categorising this content across various languages remain an important challenge with traditional techniques like topic modelling often struggling to accommodate this multilingual diversity. In this paper, we introduce X-Topic, a multilingual dataset featuring content in four distinct languages (English, Spanish, Japanese, and Greek), crafted for the purpose of tweet topic classification. Our dataset includes a wide range of topics, tailored for social media content, making it a valuable resource for scientists and professionals working on cross-linguistic analysis, the development of robust multilingual models, and computational scientists studying online dialogue. Finally, we leverage X-Topic to perform a comprehensive cross-linguistic and multilingual analysis, and compare the capabilities of current general- and domain-specific language models.
摘要：在动态的社交媒体领域，人们每天都会讨论各种话题，超越语言界限。然而，理解和分类跨语言内容的复杂性仍然是一项重要挑战，而主题建模等传统技术往往难以适应这种多语言多样性。在本文中，我们介绍了 X-Topic，这是一个多语言数据集，包含四种不同语言（英语、西班牙语、日语和希腊语）的内容，专为推文主题分类而设计。我们的数据集包括广泛的主题，专为社交媒体内容量身定制，使其成为从事跨语言分析、开发强大的多语言模型以及研究在线对话的计算科学家的宝贵资源。最后，我们利用 X-Topic 进行全面的跨语言和多语言分析，并比较当前通用和领域特定语言模型的功能。

Title: CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions

Authors: Jun Rao, Xuebo Liu, Lian Lian, Shengjun Cheng, Yunjie Liao, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03077
Pdf URL: https://arxiv.org/pdf/2410.03077
Copy Paste: [[2410.03077]] CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions(https://arxiv.org/abs/2410.03077)
Keywords: language model, llm
Abstract: With instruction tuning, Large Language Models (LLMs) can enhance their ability to adhere to commands. Diverging from most works focusing on data mixing, our study concentrates on enhancing the model's capabilities from the perspective of data sampling during training. Drawing inspiration from the human learning process, where it is generally easier to master solutions to similar topics through focused practice on a single type of topic, we introduce a novel instruction tuning strategy termed CommonIT: Commonality-aware Instruction Tuning. Specifically, we cluster instruction datasets into distinct groups with three proposed metrics (Task, Embedding and Length). We ensure each training mini-batch, or "partition", consists solely of data from a single group, which brings about both data randomness across mini-batches and intra-batch data similarity. Rigorous testing on LLaMa models demonstrates CommonIT's effectiveness in enhancing the instruction-following capabilities of LLMs through IT datasets (FLAN, CoT, and Alpaca) and models (LLaMa2-7B, Qwen2-7B, LLaMa 13B, and BLOOM 7B). CommonIT consistently boosts an average improvement of 2.1\% on the general domain (i.e., the average score of Knowledge, Reasoning, Multilinguality and Coding) with the Length metric, and 5.2\% on the special domain (i.e., GSM, Openfunctions and Code) with the Task metric, and 3.8\% on the specific tasks (i.e., MMLU) with the Embedding metric. Code is available at \url{this https URL}.
摘要：通过指令调优，大型语言模型 (LLM) 可以增强其遵守命令的能力。与大多数专注于数据混合的研究不同，我们的研究集中于从训练期间的数据采样角度增强模型的能力。从人类学习过程中汲取灵感，通过专注于单一类型的主题，通常更容易掌握类似主题的解决方案，我们引入了一种新颖的指令调优策略，称为 CommonIT：通用性感知指令调优。具体来说，我们根据三个拟议指标（任务、嵌入和长度）将指令数据集聚类为不同的组。我们确保每个训练小批量或“分区”仅由来自单个组的数据组成，这既带来了小批量之间的数据随机性，也带来了批次内数据的相似性。对 LLaMa 模型的严格测试表明，CommonIT 能够通过 IT 数据集（FLAN、CoT 和 Alpaca）和模型（LLaMa2-7B、Qwen2-7B、LLaMa 13B 和 BLOOM 7B）有效增强 LLM 的指令遵循能力。CommonIT 在长度指标上持续将一般领域（即知识、推理、多语言和编码的平均分数）的平均得分提高 2.1\%，在任务指标上将特殊领域（即 GSM、Openfunctions 和代码）的平均得分提高 5.2\%，在嵌入指标上将特定任务（即 MMLU）的平均得分提高 3.8\%。代码可在 \url{this https URL} 上找到。

Title: Scaling Parameter-Constrained Language Models with Quality Data

Authors: Ernie Chang, Matteo Paltenghi, Yang Li, Pin-Jie Lin, Changsheng Zhao, Patrick Huber, Zechun Liu, Rastislav Rabatin, Yangyang Shi, Vikas Chandra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03083
Pdf URL: https://arxiv.org/pdf/2410.03083
Copy Paste: [[2410.03083]] Scaling Parameter-Constrained Language Models with Quality Data(https://arxiv.org/abs/2410.03083)
Keywords: language model
Abstract: Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters, providing compute-optimal estimates but often neglecting the impact of data quality on model generalization. In this paper, we extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation -- effective training tokens -- which we posit to be a critical determinant of performance for parameter-constrained language models. Specifically, we formulate the proposed term of effective training tokens to be a combination of two readily-computed indicators of text: (i) text diversity and (ii) syntheticity as measured by a teacher model. We pretrained over $200$ models of 25M to 1.5B parameters on a diverse set of sampled, synthetic data, and estimated the constants that relate text quality, model size, training tokens, and eight reasoning task accuracy scores. We demonstrated the estimated constants yield +0.83 Pearson correlation with true accuracies, and analyzed it in scenarios involving widely-used data techniques such as data sampling and synthesis which aim to improve data quality.
摘要：语言建模中的缩放定律传统上将训练损失量化为数据集大小和模型参数的函数，提供计算最优估计，但往往忽略了数据质量对模型泛化的影响。在本文中，我们扩展了对缩放定律的传统理解，在原始公式中提供数据质量的微观视图——有效训练标记——我们认为这是参数约束语言模型性能的关键决定因素。具体而言，我们将有效训练标记的拟议术语表述为两个易于计算的文本指标的组合：(i) 文本多样性和 (ii) 由教师模型衡量的合成性。我们在一组多样化的采样合成数据上对超过 200 个包含 2500 万到 15 亿个参数的模型进行了预训练，并估算了与文本质量、模型大小、训练标记和八个推理任务准确度分数相关的常数。我们证明了估计的常数产生了与真实精度+0.83的皮尔逊相关性，并在涉及广泛使用的数据技术（例如旨在提高数据质量的数据采样和合成）的场景中对其进行了分析。

Title: UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

Authors: Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03090
Pdf URL: https://arxiv.org/pdf/2410.03090
Copy Paste: [[2410.03090]] UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference(https://arxiv.org/abs/2410.03090)
Keywords: language model, llm
Abstract: Deploying large language models (LLMs) is challenging due to their high memory and computational demands, especially during long-context inference. While key-value (KV) caching accelerates inference by reusing previously computed keys and values, it also introduces significant memory overhead. Existing KV cache compression methods such as eviction and merging typically compress the KV cache after it is generated and overlook the eviction of hidden states, failing to improve the speed of the prefilling stage. Additionally, applying a uniform compression rate across different attention heads can harm crucial retrieval heads in needle-in-a-haystack tasks due to excessive compression. In this paper, we propose UNComp, an uncertainty-aware compression scheme that leverages matrix entropy to estimate model uncertainty across layers and heads at the token sequence level. By grouping layers and heads based on their uncertainty, UNComp adaptively compresses both the hidden states and the KV cache. Our method achieves a 1.6x speedup in the prefilling stage and reduces the KV cache to 4.74% of its original size, resulting in a 6.4x increase in throughput and a 1.4x speedup in inference with only a 1.41% performance loss. Remarkably, in needle-in-a-haystack tasks, UNComp outperforms the full-size KV cache even when compressed to 9.38% of its original size. Our approach offers an efficient, training-free Grouped-Query Attention paradigm that can be seamlessly integrated into existing KV cache schemes.
摘要：部署大型语言模型 (LLM) 具有挑战性，因为它们对内存和计算的需求很高，尤其是在长上下文推理期间。虽然键值 (KV) 缓存通过重用以前计算的键和值来加速推理，但它也带来了显著的内存开销。现有的 KV 缓存压缩方法（例如驱逐和合并）通常在生成 KV 缓存后对其进行压缩，并忽略隐藏状态的驱逐，从而无法提高预填充阶段的速度。此外，在不同的注意力头上应用统一的压缩率可能会因过度压缩而损害大海捞针任务中的关键检索头。在本文中，我们提出了 UNComp，这是一种不确定性感知压缩方案，它利用矩阵熵在 token 序列级别估计跨层和头的模型不确定性。通过根据不确定性对层和头进行分组，UNComp 可以自适应地压缩隐藏状态和 KV 缓存。我们的方法在预填充阶段实现了 1.6 倍的加速，并将 KV 缓存减少到其原始大小的 4.74%，从而使吞吐量提高了 6.4 倍，推理速度提高了 1.4 倍，而性能损失仅为 1.41%。值得注意的是，在大海捞针任务中，即使压缩到其原始大小的 9.38%，UNComp 的性能也优于全尺寸 KV 缓存。我们的方法提供了一种高效、无需训练的分组查询注意范式，可以无缝集成到现有的 KV 缓存方案中。

Title: CoCoHD: Congress Committee Hearing Dataset

Authors: Arnav Hiray, Yunsong Liu, Mingxiao Song, Agam Shah, Sudheer Chava
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03099
Pdf URL: https://arxiv.org/pdf/2410.03099
Copy Paste: [[2410.03099]] CoCoHD: Congress Committee Hearing Dataset(https://arxiv.org/abs/2410.03099)
Keywords: language model
Abstract: U.S. congressional hearings significantly influence the national economy and social fabric, impacting individual lives. Despite their importance, there is a lack of comprehensive datasets for analyzing these discourses. To address this, we propose the Congress Committee Hearing Dataset (CoCoHD), covering hearings from 1997 to 2024 across 86 committees, with 32,697 records. This dataset enables researchers to study policy language on critical issues like healthcare, LGBTQ+ rights, and climate justice. We demonstrate its potential with a case study on 1,000 energy-related sentences, analyzing the Energy and Commerce Committee's stance on fossil fuel consumption. By fine-tuning pre-trained language models, we create energy-relevant measures for each hearing. Our market analysis shows that natural language analysis using CoCoHD can predict and highlight trends in the energy sector.
摘要：美国国会听证会对国家经济和社会结构产生重大影响，影响个人生活。尽管听证会非常重要，但缺乏用于分析这些话语的综合数据集。为了解决这个问题，我们提出了国会委员会听证会数据集 (CoCoHD)，涵盖 1997 年至 2024 年 86 个委员会的听证会，共有 32,697 条记录。该数据集使研究人员能够研究医疗保健、LGBTQ+ 权利和气候正义等关键问题的政策语言。我们通过对 1,000 个与能源相关的句子进行案例研究来展示其潜力，分析能源和商业委员会对化石燃料消费的立场。通过微调预训练的语言模型，我们为每次听证会创建与能源相关的指标。我们的市场分析表明，使用 CoCoHD 进行自然语言分析可以预测和突出能源行业的趋势。

Title: X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale

Authors: Haoran Xu, Kenton Murray, Philipp Koehn, Hieu Hoang, Akiko Eriguchi, Huda Khayrallah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03115
Pdf URL: https://arxiv.org/pdf/2410.03115
Copy Paste: [[2410.03115]] X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale(https://arxiv.org/abs/2410.03115)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable success across various NLP tasks, yet their focus has predominantly been on English due to English-centric pre-training and limited multilingual data. While some multilingual LLMs claim to support for hundreds of languages, models often fail to provide high-quality response for mid- and low-resource languages, leading to imbalanced performance heavily skewed in favor of high-resource languages like English and Chinese. In this paper, we prioritize quality over scaling number of languages, with a focus on multilingual machine translation task, and introduce X-ALMA, a model designed with a commitment to ensuring top-tier performance across 50 diverse languages, regardless of their resource levels. X-ALMA surpasses state-of-the-art open-source multilingual LLMs, such as Aya-101 and Aya-23, in every single translation direction on the FLORES and WMT'23 test datasets according to COMET-22. This is achieved by plug-and-play language-specific module architecture to prevent language conflicts during training and a carefully designed training regimen with novel optimization methods to maximize the translation performance. At the final stage of training regimen, our proposed Adaptive Rejection Preference Optimization (ARPO) surpasses existing preference optimization methods in translation tasks.
摘要：大型语言模型 (LLM) 在各种 NLP 任务中取得了显著成功，然而由于以英语为中心的预训练和有限的多语言数据，它们的重点主要放在英语上。虽然一些多语言 LLM 声称支持数百种语言，但模型往往无法为中低资源语言提供高质量的响应，导致性能不平衡，严重偏向英语和中文等高资源语言。在本文中，我们优先考虑质量而不是语言数量的扩展，重点关注多语言机器翻译任务，并介绍了 X-ALMA，该模型致力于确保在 50 种不同语言（无论其资源水平如何）上实现顶级性能。根据 COMET-22，X-ALMA 在 FLORES 和 WMT'23 测试数据集的每个翻译方向上都超越了最先进的开源多语言 LLM，例如 Aya-101 和 Aya-23。这是通过即插即用的语言特定模块架构来实现的，以防止训练期间发生语言冲突，并通过精心设计的训练方案和新颖的优化方法来最大限度地提高翻译性能。在训练方案的最后阶段，我们提出的自适应拒绝偏好优化 (ARPO) 超越了翻译任务中现有的偏好优化方法。

Title: RIPPLECOT: Amplifying Ripple Effect of Knowledge Editing in Language Models via Chain-of-Thought In-Context Learning

Authors: Zihao Zhao, Yuchen Yang, Yijiang Li, Yinzhi Cao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03122
Pdf URL: https://arxiv.org/pdf/2410.03122
Copy Paste: [[2410.03122]] RIPPLECOT: Amplifying Ripple Effect of Knowledge Editing in Language Models via Chain-of-Thought In-Context Learning(https://arxiv.org/abs/2410.03122)
Keywords: language model, llm, chain-of-thought
Abstract: The ripple effect poses a significant challenge in knowledge editing for large language models. Namely, when a single fact is edited, the model struggles to accurately update the related facts in a sequence, which is evaluated by multi-hop questions linked to a chain of related facts. Recent strategies have moved away from traditional parameter updates to more flexible, less computation-intensive methods, proven to be more effective in addressing the ripple effect. In-context learning (ICL) editing uses a simple demonstration `Imagine that + new fact` to guide LLMs, but struggles with complex multi-hop questions as the new fact alone fails to specify the chain of facts involved in such scenarios. Besides, memory-based editing maintains additional storage for all edits and related facts, requiring continuous updates to stay effective. As a result of these design limitations, the challenge remains, with the highest accuracy being only 33.8% on the MQuAKE-cf benchmarks for Vicuna-7B. To address this, we propose RippleCOT, a novel ICL editing approach integrating Chain-of-Thought (COT) reasoning. RippleCOT structures demonstrations as `newfact, question, thought, answer`, incorporating a thought component to identify and decompose the multi-hop logic within questions. This approach effectively guides the model through complex multi-hop questions with chains of related facts. Comprehensive experiments demonstrate that RippleCOT significantly outperforms the state-of-the-art on the ripple effect, achieving accuracy gains ranging from 7.8% to 87.1%.
摘要：涟漪效应对大型语言模型的知识编辑提出了重大挑战。也就是说，当编辑单个事实时，模型很难准确地更新序列中的相关事实，而序列是通过与相关事实链相关的多跳问题来评估的。最近的策略已经从传统的参数更新转向更灵活、计算密集度更低的方法，这些方法被证明在解决涟漪效应方面更有效。上下文学习 (ICL) 编辑使用简单的演示“想象一下 + 新事实”来指导 LLM，但在处理复杂的多跳问题时却举步维艰，因为新事实本身无法指定此类场景中涉及的事实链。此外，基于内存的编辑为所有编辑和相关事实保留了额外的存储空间，需要持续更新才能保持有效。由于这些设计限制，挑战仍然存在，在 Vicuna-7B 的 MQuAKE-cf 基准上，最高准确率仅为 33.8%。为了解决这个问题，我们提出了 RippleCOT，这是一种集成了思维链 (COT) 推理的新型 ICL 编辑方法。 RippleCOT 将演示结构化为“新事实、问题、想法、答案”，并加入一个想法组件来识别和分解问题中的多跳逻辑。这种方法有效地引导模型解决具有相关事实链的复杂多跳问题。综合实验表明，RippleCOT 在涟漪效应方面的表现明显优于最先进的技术，准确率提高了 7.8% 至 87.1%。

Title: On Unsupervised Prompt Learning for Classification with Black-box Language Models

Authors: Zhen-Yu Zhang, Jiandong Zhang, Huaxiu Yao, Gang Niu, Masashi Sugiyama
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03124
Pdf URL: https://arxiv.org/pdf/2410.03124
Copy Paste: [[2410.03124]] On Unsupervised Prompt Learning for Classification with Black-box Language Models(https://arxiv.org/abs/2410.03124)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have achieved impressive success in text-formatted learning problems, and most popular LLMs have been deployed in a black-box fashion. Meanwhile, fine-tuning is usually necessary for a specific downstream task to obtain better performance, and this functionality is provided by the owners of the black-box LLMs. To fine-tune a black-box LLM, labeled data are always required to adjust the model parameters. However, in many real-world applications, LLMs can label textual datasets with even better quality than skilled human annotators, motivating us to explore the possibility of fine-tuning black-box LLMs with unlabeled data. In this paper, we propose unsupervised prompt learning for classification with black-box LLMs, where the learning parameters are the prompt itself and the pseudo labels of unlabeled data. Specifically, the prompt is modeled as a sequence of discrete tokens, and every token has its own to-be-learned categorical distribution. On the other hand, for learning the pseudo labels, we are the first to consider the in-context learning (ICL) capabilities of LLMs: we first identify reliable pseudo-labeled data using the LLM, and then assign pseudo labels to other unlabeled data based on the prompt, allowing the pseudo-labeled data to serve as in-context demonstrations alongside the prompt. Those in-context demonstrations matter: previously, they are involved when the prompt is used for prediction while they are not involved when the prompt is trained; thus, taking them into account during training makes the prompt-learning and prompt-using stages more consistent. Experiments on benchmark datasets show the effectiveness of our proposed algorithm. After unsupervised prompt learning, we can use the pseudo-labeled dataset for further fine-tuning by the owners of the black-box LLMs.
摘要：大型语言模型 (LLM) 在文本格式的学习问题中取得了令人瞩目的成功，大多数流行的 LLM 都是以黑盒方式部署的。同时，微调通常是特定下游任务获得更好性能所必需的，此功能由黑盒 LLM 的所有者提供。要微调黑盒 LLM，始终需要标记数据来调整模型参数。然而，在许多实际应用中，LLM 可以比熟练的人工注释者更好地标记文本数据集，这促使我们探索使用未标记数据微调黑盒 LLM 的可能性。在本文中，我们提出了使用黑盒 LLM 进行分类的无监督提示学习，其中学习参数是提示本身和未标记数据的伪标签。具体而言，提示被建模为一系列离散的标记，每个标记都有自己的待学习分类分布。另一方面，对于伪标签的学习，我们首先考虑了 LLM 的上下文学习 (ICL) 能力：我们首先使用 LLM 识别可靠的伪标签数据，然后根据提示为其他未标记数据分配伪标签，使伪标签数据与提示一起作为上下文演示。这些上下文演示很重要：以前，它们在使用提示进行预测时会涉及，而在训练提示时则不会涉及；因此，在训练期间考虑它们可以使提示学习和提示使用阶段更加一致。在基准数据集上的实验证明了我们提出的算法的有效性。在无监督的提示学习之后，我们可以使用伪标签数据集由黑盒 LLM 的所有者进行进一步的微调。

Title: Deliberate Reasoning for LLMs as Structure-aware Planning with Accurate World Model

Authors: Siheng Xiong, Ali Payani, Yuan Yang, Faramarz Fekri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03136
Pdf URL: https://arxiv.org/pdf/2410.03136
Copy Paste: [[2410.03136]] Deliberate Reasoning for LLMs as Structure-aware Planning with Accurate World Model(https://arxiv.org/abs/2410.03136)
Keywords: language model, llm, chain-of-thought
Abstract: Enhancing the reasoning capabilities of large language models (LLMs) remains a key challenge, especially for tasks that require complex, multi-step decision-making. Humans excel at these tasks by leveraging deliberate planning with an internal world model to simulate the potential outcomes of various actions. Inspired by this, we propose a novel multi-step reasoning framework for LLMs, referred to as Structure-aware Planning with Accurate World Model (SWAP). Unlike previous approaches that rely solely on Chain-of-Thought (CoT) reasoning in natural language, SWAP incorporates structural information to guide the reasoning process via a world model and provides a soft verification mechanism over the steps. Moreover, SWAP overcomes the challenge of accurate world state predictions in complex reasoning tasks by introducing a Generator-Discriminator architecture, which enables more reliable world modeling. Specifically, the generator predicts the next state, and the discriminator ensures alignment with the logical consistency required by the problem context. SWAP also encourages the policy model to explore a broad range of potential actions to prevent premature convergence. By resolving the bottlenecks of generation diversity for both actions and states using diversity-based modeling (DBM) and improving discrimination accuracy through contrastive ranking (CR), SWAP significantly enhances the reasoning performance of LLMs. We evaluate SWAP across diverse reasoning-intensive benchmarks including math reasoning, logical reasoning, and coding tasks. Extensive experiments demonstrate that SWAP achieves substantial improvements over the baselines and consistently outperforms existing LLMs of similar sizes.
摘要：增强大型语言模型 (LLM) 的推理能力仍然是一项关键挑战，尤其是对于需要复杂、多步骤决策的任务。人类擅长通过利用内部世界模型进行深思熟虑的规划来模拟各种行动的潜在结果，从而完成这些任务。受此启发，我们提出了一种新颖的 LLM 多步骤推理框架，称为具有准确世界模型的结构感知规划 (SWAP)。与以前仅依赖自然语言中的思路链 (CoT) 推理的方法不同，SWAP 结合结构信息通过世界模型指导推理过程，并为每个步骤提供软验证机制。此外，SWAP 通过引入生成器-鉴别器架构克服了复杂推理任务中准确预测世界状态的挑战，从而实现了更可靠的世界建模。具体而言，生成器预测下一个状态，鉴别器确保与问题上下文所需的逻辑一致性保持一致。SWAP 还鼓励策略模型探索广泛的潜在行动，以防止过早收敛。通过使用基于多样性的建模 (DBM) 解决动作和状态的生成多样性瓶颈，并通过对比排名 (CR) 提高判别准确性，SWAP 显著提高了 LLM 的推理性能。我们在包括数学推理、逻辑推理和编码任务在内的各种推理密集型基准上对 SWAP 进行了评估。大量实验表明，SWAP 比基线有了显著的改进，并且始终优于现有类似规模的 LLM。

Title: SAG: Style-Aligned Article Generation via Model Collaboration

Authors: Chenning Xu, Fangxun Shu, Dian Jin, Jinghao Wei, Hao Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03137
Pdf URL: https://arxiv.org/pdf/2410.03137
Copy Paste: [[2410.03137]] SAG: Style-Aligned Article Generation via Model Collaboration(https://arxiv.org/abs/2410.03137)
Keywords: language model, gpt, llm, hallucination
Abstract: Large language models (LLMs) have increased the demand for personalized and stylish content generation. However, closed-source models like GPT-4 present limitations in optimization opportunities, while the substantial training costs and inflexibility of open-source alternatives, such as Qwen-72B, pose considerable challenges. Conversely, small language models (SLMs) struggle with understanding complex instructions and transferring learned capabilities to new contexts, often exhibiting more pronounced limitations. In this paper, we present a novel collaborative training framework that leverages the strengths of both LLMs and SLMs for style article generation, surpassing the performance of either model alone. We freeze the LLMs to harness their robust instruction-following capabilities and subsequently apply supervised fine-tuning on the SLM using style-specific data. Additionally, we introduce a self-improvement method to enhance style consistency. Our new benchmark, NoteBench, thoroughly evaluates style-aligned generation. Extensive experiments show that our approach achieves state-of-the-art performance, with improvements of 0.78 in ROUGE-L and 0.55 in BLEU-4 scores compared to GPT-4, while maintaining a low hallucination rate regarding factual and faithfulness.
摘要：大型语言模型 (LLM) 增加了对个性化和时尚内容生成的需求。然而，像 GPT-4 这样的闭源模型在优化机会方面存在限制，而开源替代方案（如 Qwen-72B）的高昂训练成本和不灵活性带来了相当大的挑战。相反，小型语言模型 (SLM) 难以理解复杂的指令并将学习到的能力转移到新的环境中，通常表现出更明显的局限性。在本文中，我们提出了一种新颖的协作训练框架，该框架利用 LLM 和 SLM 的优势来生成风格文章，超越了单独使用任何一个模型的性能。我们冻结 LLM 以利用其强大的指令跟踪能力，然后使用特定于风格的数据对 SLM 进行监督微调。此外，我们引入了一种自我改进方法来增强风格一致性。我们的新基准 NoteBench 全面评估了风格一致的生成。大量实验表明，我们的方法达到了最先进的性能，与 GPT-4 相比，ROUGE-L 分数提高了 0.78，BLEU-4 分数提高了 0.55，同时在事实和真实性方面保持了较低的幻觉率。

Title: Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback

Authors: Kyuyoung Kim, Ah Jeong Seo, Hao Liu, Jinwoo Shin, Kimin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03145
Pdf URL: https://arxiv.org/pdf/2410.03145
Copy Paste: [[2410.03145]] Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback(https://arxiv.org/abs/2410.03145)
Keywords: language model, llm
Abstract: Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date. Despite their success, existing methods typically rely on simple binary labels, such as those indicating preferred outputs in pairwise preferences, which fail to capture the subtle differences in relative quality between pairs. To address this limitation, we introduce an approach called Margin Matching Preference Optimization (MMPO), which incorporates relative quality margins into optimization, leading to improved LLM policies and reward models. Specifically, given quality margins in pairwise preferences, we design soft target probabilities based on the Bradley-Terry model, which are then used to train models with the standard cross-entropy objective. Experiments with both human and AI feedback data demonstrate that MMPO consistently outperforms baseline methods, often by a substantial margin, on popular benchmarks including MT-bench and RewardBench. Notably, the 7B model trained with MMPO achieves state-of-the-art performance on RewardBench as of June 2024, outperforming other models of the same scale. Our analysis also shows that MMPO is more robust to overfitting, leading to better-calibrated models.
摘要：利用对齐技术（例如从人类反馈中进行强化学习）进行微调的大型语言模型 (LLM) 在开发迄今为止一些最强大的 AI 系统方面发挥了重要作用。尽管取得了成功，但现有方法通常依赖于简单的二进制标签，例如指示成对偏好中的首选输出的标签，这些标签无法捕捉到成对之间相对质量的细微差异。为了解决这一限制，我们引入了一种称为边际匹配偏好优化 (MMPO) 的方法，该方法将相对质量边际纳入优化，从而改进了 LLM 策略和奖励模型。具体而言，给定成对偏好中的质量边际，我们根据 Bradley-Terry 模型设计软目标概率，然后将其用于训练具有标准交叉熵目标的模型。使用人类和 AI 反馈数据进行的实验表明，在包括 MT-bench 和 RewardBench 在内的流行基准测试中，MMPO 始终优于基线方法，而且通常领先幅度很大。值得注意的是，截至 2024 年 6 月，使用 MMPO 训练的 7B 模型在 RewardBench 上取得了最佳表现，优于其他同等规模的模型。我们的分析还表明，MMPO 对过度拟合的鲁棒性更强，从而可以生成校准效果更好的模型。

Title: Autoregressive Large Language Models are Computationally Universal

Authors: Dale Schuurmans, Hanjun Dai, Francesco Zanini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03170
Pdf URL: https://arxiv.org/pdf/2410.03170
Copy Paste: [[2410.03170]] Autoregressive Large Language Models are Computationally Universal(https://arxiv.org/abs/2410.03170)
Keywords: language model, prompt
Abstract: We show that autoregressive decoding of a transformer-based language model can realize universal computation, without external intervention or modification of the model's weights. Establishing this result requires understanding how a language model can process arbitrarily long inputs using a bounded context. For this purpose, we consider a generalization of autoregressive decoding where, given a long input, emitted tokens are appended to the end of the sequence as the context window advances. We first show that the resulting system corresponds to a classical model of computation, a Lag system, that has long been known to be computationally universal. By leveraging a new proof, we show that a universal Turing machine can be simulated by a Lag system with 2027 production rules. We then investigate whether an existing large language model can simulate the behaviour of such a universal Lag system. We give an affirmative answer by showing that a single system-prompt can be developed for gemini-1.5-pro-001 that drives the model, under deterministic (greedy) decoding, to correctly apply each of the 2027 production rules. We conclude that, by the Church-Turing thesis, prompted gemini-1.5-pro-001 with extended autoregressive (greedy) decoding is a general purpose computer.
摘要：我们表明，基于转换器的语言模型的自回归解码可以实现通用计算，而无需外部干预或修改模型的权重。要建立这一结果，需要了解语言模型如何使用有界上下文处理任意长的输入。为此，我们考虑自回归解码的泛化，其中给定一个长输入，随着上下文窗口的推进，发出的标记会附加到序列的末尾。我们首先表明，得到的系统对应于经典的计算模型，即滞后系统，该系统长期以来被认为是计算通用的。通过利用新的证明，我们表明通用图灵机可以通过具有 2027 条生产规则的滞后系统进行模拟。然后，我们研究现有的大型语言模型是否可以模拟这种通用滞后系统的行为。我们给出了肯定的答案，表明可以为 gemini-1.5-pro-001 开发一个系统提示，在确定性（贪婪）解码下驱动模型正确应用每条 2027 条生产规则。我们根据丘奇-图灵论题得出结论：带有扩展自回归（贪婪）解码的 gemini-1.5-pro-001 是一台通用计算机。

Title: Kiss up, Kick down: Exploring Behavioral Changes in Multi-modal Large Language Models with Assigned Visual Personas

Authors: Seungjong Sun, Eungu Lee, Seo Yeon Baek, Seunghyun Hwang, Wonbyung Lee, Dongyan Nan, Bernard J. Jansen, Jang Hyun Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03181
Pdf URL: https://arxiv.org/pdf/2410.03181
Copy Paste: [[2410.03181]] Kiss up, Kick down: Exploring Behavioral Changes in Multi-modal Large Language Models with Assigned Visual Personas(https://arxiv.org/abs/2410.03181)
Keywords: language model, llm, prompt
Abstract: This study is the first to explore whether multi-modal large language models (LLMs) can align their behaviors with visual personas, addressing a significant gap in the literature that predominantly focuses on text-based personas. We developed a novel dataset of 5K fictional avatar images for assignment as visual personas to LLMs, and analyzed their negotiation behaviors based on the visual traits depicted in these images, with a particular focus on aggressiveness. The results indicate that LLMs assess the aggressiveness of images in a manner similar to humans and output more aggressive negotiation behaviors when prompted with an aggressive visual persona. Interestingly, the LLM exhibited more aggressive negotiation behaviors when the opponent's image appeared less aggressive than their own, and less aggressive behaviors when the opponents image appeared more aggressive.
摘要：本研究首次探索了多模态大型语言模型 (LLM) 是否可以将其行为与视觉人物角色保持一致，从而填补了文献中主要关注基于文本的人物角色的重大空白。我们开发了一个包含 5K 个虚构头像图像的新数据集，将其作为视觉人物角色分配给 LLM，并根据这些图像中描绘的视觉特征分析了它们的谈判行为，特别关注攻击性。结果表明，LLM 以类似于人类的方式评估图像的攻击性，并在受到攻击性视觉人物角色的提示时输出更积极的谈判行为。有趣的是，当对手的形象看起来不如自己的攻击性时，LLM 表现出更积极的谈判行为，而当对手的形象看起来更具攻击性时，LLM 表现出较不积极的行为。

Title: Generating bilingual example sentences with large language models as lexicography assistants

Authors: Raphael Merx, Ekaterina Vylomova, Kemal Kurniawan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03182
Pdf URL: https://arxiv.org/pdf/2410.03182
Copy Paste: [[2410.03182]] Generating bilingual example sentences with large language models as lexicography assistants(https://arxiv.org/abs/2410.03182)
Keywords: language model, llm
Abstract: We present a study of LLMs' performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels: French (high-resource), Indonesian (mid-resource), and Tetun (low-resource), with English as the target language. We evaluate the quality of LLM-generated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility. Our findings reveal that while LLMs can generate reasonably good dictionary examples, their performance degrades significantly for lower-resourced languages. We also observe high variability in human preferences for example quality, reflected in low inter-annotator agreement rates. To address this, we demonstrate that in-context learning can successfully align LLMs with individual annotator preferences. Additionally, we explore the use of pre-trained language models for automated rating of examples, finding that sentence perplexity serves as a good proxy for typicality and intelligibility in higher-resourced languages. Our study also contributes a novel dataset of 600 ratings for LLM-generated sentence pairs, and provides insights into the potential of LLMs in reducing the cost of lexicographic work, particularly for low-resource languages.
摘要：我们研究了 LLM 在为不同资源水平的语言生成和评级双语词典例句方面的表现：法语（资源丰富）、印尼语（资源中等）和德顿语（资源匮乏），目标语言为英语。我们根据 GDEX（良好词典示例）标准评估 LLM 生成示例的质量：典型性、信息量和可理解性。我们的研究结果表明，虽然 LLM 可以生成相当好的词典示例，但对于资源较少的语言，其性能会显著下降。我们还观察到人类对示例质量的偏好存在很大差异，这反映在注释者之间的低一致率上。为了解决这个问题，我们证明了上下文学习可以成功地将 LLM 与个人注释者的偏好保持一致。此外，我们探索了使用预训练语言模型自动评级示例，发现句子困惑度可以很好地代表资源较丰富的语言中的典型性和可理解性。我们的研究还为 LLM 生成的句子对贡献了一个包含 600 个评分的全新数据集，并深入了解了 LLM 在降低词典编纂工作成本方面的潜力，特别是对于资源匮乏的语言而言。

Title: Parallel Corpus Augmentation using Masked Language Models

Authors: Vibhuti Kumari, Narayana Murthy Kavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03194
Pdf URL: https://arxiv.org/pdf/2410.03194
Copy Paste: [[2410.03194]] Parallel Corpus Augmentation using Masked Language Models(https://arxiv.org/abs/2410.03194)
Keywords: language model
Abstract: In this paper we propose a novel method of augmenting parallel text corpora which promises good quality and is also capable of producing many fold larger corpora than the seed corpus we start with. We do not need any additional monolingual corpora. We use Multi-Lingual Masked Language Model to mask and predict alternative words in context and we use Sentence Embeddings to check and select sentence pairs which are likely to be translations of each other. We cross check our method using metrics for MT Quality Estimation. We believe this method can greatly alleviate the data scarcity problem for all language pairs for which a reasonable seed corpus is available.
摘要：在本文中，我们提出了一种扩充平行文本语料库的新方法，该方法保证了良好的质量，并且能够生成比我们开始使用的种子语料库大很多倍的语料库。我们不需要任何额外的单语语料库。我们使用多语言掩码语言模型来掩码和预测上下文中的替代词，并使用句子嵌入来检查和选择可能互为翻译的句子对。我们使用 MT 质量评估指标来交叉检查我们的方法。我们相信这种方法可以极大地缓解所有具有合理种子语料库的语言对的数据稀缺问题。

Title: Cross-lingual Transfer for Automatic Question Generation by Learning Interrogative Structures in Target Languages

Authors: Seonjeong Hwang, Yunsu Kim, Gary Geunbae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03197
Pdf URL: https://arxiv.org/pdf/2410.03197
Copy Paste: [[2410.03197]] Cross-lingual Transfer for Automatic Question Generation by Learning Interrogative Structures in Target Languages(https://arxiv.org/abs/2410.03197)
Keywords: language model, gpt, chat
Abstract: Automatic question generation (QG) serves a wide range of purposes, such as augmenting question-answering (QA) corpora, enhancing chatbot systems, and developing educational materials. Despite its importance, most existing datasets predominantly focus on English, resulting in a considerable gap in data availability for other languages. Cross-lingual transfer for QG (XLT-QG) addresses this limitation by allowing models trained on high-resource language datasets to generate questions in low-resource languages. In this paper, we propose a simple and efficient XLT-QG method that operates without the need for monolingual, parallel, or labeled data in the target language, utilizing a small language model. Our model, trained solely on English QA datasets, learns interrogative structures from a limited set of question exemplars, which are then applied to generate questions in the target language. Experimental results show that our method outperforms several XLT-QG baselines and achieves performance comparable to GPT-3.5-turbo across different languages. Additionally, the synthetic data generated by our model proves beneficial for training multilingual QA models. With significantly fewer parameters than large language models and without requiring additional training for target languages, our approach offers an effective solution for QG and QA tasks across various languages.
摘要：自动问题生成 (QG) 用途广泛，例如扩充问答 (QA) 语料库、增强聊天机器人系统和开发教育材料。尽管它很重要，但大多数现有数据集主要集中在英语上，导致其他语言的数据可用性存在相当大的差距。QG 的跨语言迁移 (XLT-QG) 通过允许在高资源语言数据集上训练的模型生成低资源语言的问题来解决这一限制。在本文中，我们提出了一种简单有效的 XLT-QG 方法，该方法利用小型语言模型，无需目标语言中的单语、并行或标记数据即可运行。我们的模型仅在英语 QA 数据集上训练，从一组有限的问题样本中学习疑问结构，然后将其应用于生成目标语言中的问题。实验结果表明，我们的方法优于几个 XLT-QG 基线，并在不同语言中实现了与 GPT-3.5-turbo 相当的性能。此外，我们的模型生成的合成数据对训练多语言问答模型大有裨益。与大型语言模型相比，我们的方法参数少得多，而且无需针对目标语言进行额外训练，因此为跨多种语言的问答任务提供了有效的解决方案。

Title: PersoBench: Benchmarking Personalized Response Generation in Large Language Models

Authors: Saleh Afzoon, Usman Naseem, Amin Beheshti, Zahra Jamali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03198
Pdf URL: https://arxiv.org/pdf/2410.03198
Copy Paste: [[2410.03198]] PersoBench: Benchmarking Personalized Response Generation in Large Language Models(https://arxiv.org/abs/2410.03198)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: While large language models (LLMs) have exhibited impressive conversational capabilities, their proficiency in delivering personalized responses remains unclear. Although recent benchmarks automatically evaluate persona consistency in role-playing contexts using LLM-based judgment, the evaluation of personalization in response generation remains underexplored. To address this gap, we present a new benchmark, PersoBench, to evaluate the personalization ability of LLMs in persona-aware dialogue generation within a zero-shot setting. We assess the performance of three open-source and three closed-source LLMs using well-known datasets and a range of metrics. Our analysis, conducted on three well-known persona-aware datasets, evaluates multiple dimensions of response quality, including fluency, diversity, coherence, and personalization, across both standard and chain-of-thought prompting methods. Our findings reveal that while LLMs excel at generating fluent and diverse responses, they are far from satisfactory in delivering personalized and coherent responses considering both the conversation context and the provided personas. Our benchmark implementation is available at this https URL.
摘要：虽然大型语言模型 (LLM) 表现出令人印象深刻的对话能力，但它们在提供个性化响应方面的能力仍不清楚。尽管最近的基准测试使用基于 LLM 的判断自动评估角色扮演环境中的角色一致性，但对响应生成中的个性化评估仍未得到充分探索。为了解决这一差距，我们提出了一个新的基准测试 PersoBench，以评估零样本设置中 LLM 在角色感知对话生成中的个性化能力。我们使用众所周知的数据集和一系列指标评估了三个开源和三个闭源 LLM 的性能。我们对三个众所周知的角色感知数据集进行的分析评估了标准和思路链提示方法中响应质量的多个维度，包括流畅性、多样性、连贯性和个性化。我们的研究结果表明，虽然 LLM 擅长生成流畅和多样化的响应，但它们在提供个性化和连贯的响应方面还远远不能令人满意，无论是考虑到对话背景还是提供的角色。我们的基准测试实现可在此 https URL 上找到。

Title: Learning Semantic Structure through First-Order-Logic Translation

Authors: Akshay Chaturvedi, Nicholas Asher
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03203
Pdf URL: https://arxiv.org/pdf/2410.03203
Copy Paste: [[2410.03203]] Learning Semantic Structure through First-Order-Logic Translation(https://arxiv.org/abs/2410.03203)
Keywords: language model, llm, prompt
Abstract: In this paper, we study whether transformer-based language models can extract predicate argument structure from simple sentences. We firstly show that language models sometimes confuse which predicates apply to which objects. To mitigate this, we explore two tasks: question answering (Q/A), and first order logic (FOL) translation, and two regimes, prompting and finetuning. In FOL translation, we finetune several large language models on synthetic datasets designed to gauge their generalization abilities. For Q/A, we finetune encoder models like BERT and RoBERTa and use prompting for LLMs. The results show that FOL translation for LLMs is better suited to learn predicate argument structure.
摘要：在本文中，我们研究了基于转换器的语言模型是否可以从简单句子中提取谓词论元结构。我们首先表明，语言模型有时会混淆哪些谓词适用于哪些对象。为了缓解这种情况，我们探索了两项任务：问答 (Q/A) 和一阶逻辑 (FOL) 翻译，以及两种方案，提示和微调。在 FOL 翻译中，我们在旨在衡量其泛化能力的合成数据集上微调了几个大型语言模型。对于问答，我们微调了 BERT 和 RoBERTa 等编码器模型，并使用 LLM 的提示。结果表明，LLM 的 FOL 翻译更适合学习谓词论元结构。

Title: Consultation on Industrial Machine Faults with Large language Models

Authors: Apiradee Boonmee, Kritsada Wongsuwan, Pimchanok Sukjai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03223
Pdf URL: https://arxiv.org/pdf/2410.03223
Copy Paste: [[2410.03223]] Consultation on Industrial Machine Faults with Large language Models(https://arxiv.org/abs/2410.03223)
Keywords: language model, llm, prompt
Abstract: Industrial machine fault diagnosis is a critical component of operational efficiency and safety in manufacturing environments. Traditional methods rely heavily on expert knowledge and specific machine learning models, which can be limited in their adaptability and require extensive labeled data. This paper introduces a novel approach leveraging Large Language Models (LLMs), specifically through a structured multi-round prompting technique, to improve fault diagnosis accuracy. By dynamically crafting prompts, our method enhances the model's ability to synthesize information from diverse data sources, leading to improved contextual understanding and actionable recommendations. Experimental results demonstrate that our approach outperforms baseline models, achieving an accuracy of 91% in diagnosing various fault types. The findings underscore the potential of LLMs in revolutionizing industrial fault consultation practices, paving the way for more effective maintenance strategies in complex environments.
摘要：工业机器故障诊断是制造环境中运营效率和安全性的关键组成部分。传统方法严重依赖专家知识和特定的机器学习模型，这些模型的适应性有限，并且需要大量标记数据。本文介绍了一种利用大型语言模型 (LLM) 的新方法，特别是通过结构化的多轮提示技术来提高故障诊断的准确性。通过动态制作提示，我们的方法增强了模型综合来自不同数据源的信息的能力，从而提高了上下文理解能力和可操作的建议。实验结果表明，我们的方法优于基线模型，在诊断各种故障类型时准确率达到 91%。研究结果强调了 LLM 在彻底改变工业故障咨询实践方面的潜力，为复杂环境中更有效的维护策略铺平了道路。

Title: ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering

Authors: Huayang Li, Pat Verga, Priyanka Sen, Bowen Yang, Vijay Viswanathan, Patrick Lewis, Taro Watanabe, Yixuan Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03227
Pdf URL: https://arxiv.org/pdf/2410.03227
Copy Paste: [[2410.03227]] ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering(https://arxiv.org/abs/2410.03227)
Keywords: language model, llm
Abstract: The context window of large language models (LLMs) has been extended significantly in recent years. However, while the context length that the LLM can process has grown, the capability of the model to accurately reason over that context degrades noticeably. This occurs because modern LLMs often become overwhelmed by the vast amount of information in the context; when answering questions, the model must identify and reason over relevant evidence sparsely distributed throughout the text. To alleviate the challenge of long-context reasoning, we develop a retrieve-then-reason framework, enabling LLMs to reason over relevant evidence collected during an intermediate retrieval step. We find that modern LLMs struggle to accurately retrieve relevant facts and instead, often hallucinate "retrieved facts", resulting in flawed reasoning and the production of incorrect answers. To address these issues, we introduce ALR$^2$, a method that augments the long-context reasoning capability of LLMs via an explicit two-stage procedure, i.e., aligning LLMs with the objectives of both retrieval and reasoning. We demonstrate the efficacy of ALR$^2$ for mitigating performance degradation in long-context reasoning tasks. Through extensive experiments on long-context QA benchmarks, we find our method to outperform competitive baselines by large margins, achieving at least 8.4 and 7.9 EM gains on the long-context versions of HotpotQA and SQuAD datasets, respectively.
摘要：近年来，大型语言模型 (LLM) 的上下文窗口已大大扩展。然而，虽然 LLM 可以处理的上下文长度有所增加，但模型准确推理该上下文的能力却明显下降。这是因为现代 LLM 经常被上下文中的大量信息所淹没；在回答问题时，模型必须识别并推理分散在整个文本中的相关证据。为了缓解长上下文推理的挑战，我们开发了一个检索然后推理的框架，使 LLM 能够推理在中间检索步骤中收集的相关证据。我们发现现代 LLM 难以准确检索相关事实，而是经常产生“检索到的事实”的幻觉，导致推理有缺陷并产生错误的答案。为了解决这些问题，我们引入了 ALR$^2$，这是一种通过明确的两阶段程序增强 LLM 长上下文推理能力的方法，即将 LLM 与检索和推理的目标相结合。我们证明了 ALR$^2$ 在缓解长上下文推理任务中的性能下降方面的有效性。通过对长上下文 QA 基准的大量实验，我们发现我们的方法远远优于竞争基线，在 HotpotQA 和 SQuAD 数据集的长上下文版本上分别实现了至少 8.4 和 7.9 EM 的增益。

Title: Beyond Film Subtitles: Is YouTube the Best Approximation of Spoken Vocabulary?

Authors: Adam Nohejl, Frederikus Hudi, Eunike Andriani Kardinata, Shintaro Ozaki, Maria Angelica Riera Machin, Hongyu Sun, Justin Vasselli, Taro Watanabe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03240
Pdf URL: https://arxiv.org/pdf/2410.03240
Copy Paste: [[2410.03240]] Beyond Film Subtitles: Is YouTube the Best Approximation of Spoken Vocabulary?(https://arxiv.org/abs/2410.03240)
Keywords: language model, gpt, llm
Abstract: Word frequency is a key variable in psycholinguistics, useful for modeling human familiarity with words even in the era of large language models (LLMs). Frequency in film subtitles has proved to be a particularly good approximation of everyday language exposure. For many languages, however, film subtitles are not easily available, or are overwhelmingly translated from English. We demonstrate that frequencies extracted from carefully processed YouTube subtitles provide an approximation comparable to, and often better than, the best currently available resources. Moreover, they are available for languages for which a high-quality subtitle or speech corpus does not exist. We use YouTube subtitles to construct frequency norms for five diverse languages, Chinese, English, Indonesian, Japanese, and Spanish, and evaluate their correlation with lexical decision time, word familiarity, and lexical complexity. In addition to being strongly correlated with two psycholinguistic variables, a simple linear regression on the new frequencies achieves a new high score on a lexical complexity prediction task in English and Japanese, surpassing both models trained on film subtitle frequencies and the LLM GPT-4. Our code, the frequency lists, fastText word embeddings, and statistical language models are freely available at this https URL.
摘要：词频是心理语言学中的一个关键变量，即使在大型语言模型 (LLM) 时代，它对于模拟人类对单词的熟悉程度也很有用。电影字幕中的频率已被证明是日常语言接触的特别好的近似值。然而，对于许多语言来说，电影字幕并不容易获得，或者绝大多数都是从英语翻译过来的。我们证明，从经过精心处理的 YouTube 字幕中提取的频率提供的近似值可与目前最好的资源相媲美，甚至通常更好。此外，它们还适用于没有高质量字幕或语音语料库的语言。我们使用 YouTube 字幕为五种不同的语言（中文、英语、印尼语、日语和西班牙语）构建频率规范，并评估它们与词汇决策时间、词汇熟悉度和词汇复杂性的相关性。除了与两个心理语言学变量密切相关之外，对新频率进行简单的线性回归在英语和日语词汇复杂性预测任务中取得了新的高分，超越了在电影字幕频率上训练的两个模型和 LLM GPT-4。我们的代码、频率列表、fastText 词嵌入和统计语言模型可在此 https URL 上免费获取。

Title: Are Expert-Level Language Models Expert-Level Annotators?

Authors: Yu-Min Tseng, Wei-Lin Chen, Chung-Chi Chen, Hsin-Hsi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03254
Pdf URL: https://arxiv.org/pdf/2410.03254
Copy Paste: [[2410.03254]] Are Expert-Level Language Models Expert-Level Annotators?(https://arxiv.org/abs/2410.03254)
Keywords: language model, llm
Abstract: Data annotation refers to the labeling or tagging of textual data with relevant information. A large body of works have reported positive results on leveraging LLMs as an alternative to human annotators. However, existing studies focus on classic NLP tasks, and the extent to which LLMs as data annotators perform in domains requiring expert knowledge remains underexplored. In this work, we investigate comprehensive approaches across three highly specialized domains and discuss practical suggestions from a cost-effectiveness perspective. To the best of our knowledge, we present the first systematic evaluation of LLMs as expert-level data annotators.
摘要：数据注释是指用相关信息标记或标注文本数据。大量研究报告了利用 LLM 替代人工注释器的积极成果。然而，现有研究主要关注经典的 NLP 任务，而 LLM 作为数据注释器在需要专业知识的领域中的表现仍未得到充分探索。在这项工作中，我们研究了三个高度专业化的领域的综合方法，并从成本效益的角度讨论了实用建议。据我们所知，我们首次对 LLM 作为专家级数据注释器进行了系统评估。

Title: Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

Authors: Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03258
Pdf URL: https://arxiv.org/pdf/2410.03258
Copy Paste: [[2410.03258]] Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models(https://arxiv.org/abs/2410.03258)
Keywords: language model
Abstract: In this work, we show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair Encoding (BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains. Current approaches trivially append the target domain-specific vocabulary at the end of the PLM vocabulary. This approach leads to a lower priority score and causes sub-optimal tokenization in BPE that iteratively uses merge rules to tokenize a given text. To mitigate this issue, we propose AdaptBPE where the BPE tokenization initialization phase is modified to first perform the longest string matching on the added (target) vocabulary before tokenizing at the character level. We perform an extensive evaluation of AdaptBPE versus the standard BPE over various classification and summarization tasks; AdaptBPE improves by 3.57% (in terms of accuracy) and 1.87% (in terms of Rouge-L), respectively. AdaptBPE for MEDVOC works particularly well when reference summaries have high OOV concentration or are longer in length. We also conduct a human evaluation, revealing that AdaptBPE generates more relevant and more faithful summaries as compared to MEDVOC. We make our codebase publicly available at this https URL.
摘要：在这项工作中，我们展示了词汇表自适应方法的一个根本限制，该方法使用字节对编码 (BPE) 标记化方案将预训练语言模型 (PLM) 微调到专家领域。当前的方法只是将目标领域特定的词汇表附加到 PLM 词汇表的末尾。这种方法会导致优先级得分较低，并导致 BPE 中的标记化效果不佳，因为 BPE 会迭代使用合并规则来标记给定的文本。为了缓解这个问题，我们提出了 AdaptBPE，其中修改了 BPE 标记化初始化阶段，首先对添加的（目标）词汇表执行最长的字符串匹配，然后再在字符级别进行标记。我们在各种分类和摘要任务上对 AdaptBPE 与标准 BPE 进行了广泛的评估；AdaptBPE 分别提高了 3.57%（就准确度而言）和 1.87%（就 Rouge-L 而言）。当参考摘要的 OOV 浓度较高或长度较长时，MEDVOC 的 AdaptBPE 效果特别好。我们还进行了人工评估，结果表明 AdaptBPE 生成的摘要与 MEDVOC 相比更相关、更可靠。我们在此 https URL 上公开了我们的代码库。

Title: What do Large Language Models Need for Machine Translation Evaluation?

Authors: Shenbin Qian, Archchana Sindhujan, Minnie Kabra, Diptesh Kanojia, Constantin Orăsan, Tharindu Ranasinghe, Frédéric Blain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03278
Pdf URL: https://arxiv.org/pdf/2410.03278
Copy Paste: [[2410.03278]] What do Large Language Models Need for Machine Translation Evaluation?(https://arxiv.org/abs/2410.03278)
Keywords: language model, llm, prompt
Abstract: Leveraging large language models (LLMs) for various natural language processing tasks has led to superlative claims about their performance. For the evaluation of machine translation (MT), existing research shows that LLMs are able to achieve results comparable to fine-tuned multilingual pre-trained language models. In this paper, we explore what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate MT quality. In addition, we investigate prompting techniques such as zero-shot, Chain of Thought (CoT) and few-shot prompting for eight language pairs covering high-, medium- and low-resource languages, leveraging varying LLM variants. Our findings indicate the importance of reference translations for an LLM-based evaluation. While larger models do not necessarily fare better, they tend to benefit more from CoT prompting, than smaller models. We also observe that LLMs do not always provide a numerical score when generating evaluations, which poses a question on their reliability for the task. Our work presents a comprehensive analysis for resource-constrained and training-less LLM-based evaluation of machine translation. We release the accrued prompt templates, code and data publicly for reproducibility.
摘要：利用大型语言模型 (LLM) 进行各种自然语言处理任务，人们对其性能赞不绝口。对于机器翻译 (MT) 的评估，现有研究表明，LLM 能够实现与经过微调的多语言预训练语言模型相当的结果。在本文中，我们探讨了 LLM 评估 MT 质量所需的翻译信息，例如来源、参考、翻译错误和注释指南。此外，我们利用不同的 LLM 变体，研究了针对八种语言对的提示技术，例如零样本、思维链 (CoT) 和少样本提示，涵盖高、中、低资源语言。我们的研究结果表明参考翻译对于基于 LLM 的评估非常重要。虽然较大的模型并不一定表现更好，但它们往往比较小的模型从 CoT 提示中受益更多。我们还观察到，LLM 在生成评估时并不总是提供数值分数，这对它们在任务中的可靠性提出了质疑。我们的工作为资源受限且无需训练的 LLM 机器翻译评估提供了全面的分析。我们公开发布了累积的提示模板、代码和数据，以供重复使用。

Title: Comparing zero-shot self-explanations with human rationales in multilingual text classification

Authors: Stephanie Brandl, Oliver Eberle
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03296
Pdf URL: https://arxiv.org/pdf/2410.03296
Copy Paste: [[2410.03296]] Comparing zero-shot self-explanations with human rationales in multilingual text classification(https://arxiv.org/abs/2410.03296)
Keywords: llm
Abstract: Instruction-tuned LLMs are able to provide an explanation about their output to users by generating self-explanations that do not require gradient computations or the application of possibly complex XAI methods. In this paper, we analyse whether this ability results in a good explanation by evaluating self-explanations in the form of input rationales with respect to their plausibility to humans as well as their faithfulness to models. For this, we apply two text classification tasks: sentiment classification and forced labour detection. Next to English, we further include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations for all samples. To allow for direct comparisons, we also compute post-hoc feature attribution, i.e., layer-wise relevance propagation (LRP) and apply this pipeline to 4 LLMs (Llama2, Llama3, Mistral and Mixtral). Our results show that self-explanations align more closely with human annotations compared to LRP, while maintaining a comparable level of faithfulness.
摘要：指令调整后的 LLM 能够通过生成不需要梯度计算或应用可能复杂的 XAI 方法的自解释，向用户提供有关其输出的解释。在本文中，我们通过评估输入原理形式的自解释，评估其对人类的可信度以及对模型的忠诚度，分析这种能力是否能产生良好的解释。为此，我们应用了两个文本分类任务：情绪分类和强迫劳动检测。除了英语之外，我们还包括情绪分类任务的丹麦语和意大利语翻译，并将所有样本的自解释与人类注释进行比较。为了进行直接比较，我们还计算事后特征归因，即逐层相关性传播 (LRP)，并将此管道应用于 4 个 LLM（Llama2、Llama3、Mistral 和 Mixtral）。我们的结果表明，与 LRP 相比，自解释与人类注释更接近，同时保持了相当的忠诚度。

Title: Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models

Authors: Pavel Stepachev, Pinzhen Chen, Barry Haddow
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2410.03312
Pdf URL: https://arxiv.org/pdf/2410.03312
Copy Paste: [[2410.03312]] Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models(https://arxiv.org/abs/2410.03312)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have started to play a vital role in modelling speech and text. To explore the best use of context and multiple systems' outputs for post-ASR speech emotion prediction, we study LLM prompting on a recent task named GenSEC. Our techniques include ASR transcript ranking, variable conversation context, and system output fusion. We show that the conversation context has diminishing returns and the metric used to select the transcript for prediction is crucial. Finally, our best submission surpasses the provided baseline by 20% in absolute accuracy.
摘要：大型语言模型 (LLM) 已开始在语音和文本建模中发挥重要作用。为了探索如何最好地利用上下文和多个系统的输出来进行 ASR 后语音情绪预测，我们在最近的一项名为 GenSEC 的任务中研究了 LLM 提示。我们的技术包括 ASR 转录排名、可变对话上下文和系统输出融合。我们表明，对话上下文的收益递减，而用于选择转录进行预测的指标至关重要。最后，我们的最佳提交在绝对准确率上比提供的基线高出 20%。

Title: Zero-Shot Fact Verification via Natural Logic and Large Language Models

Authors: Marek Strong, Rami Aly, Andreas Vlachos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03341
Pdf URL: https://arxiv.org/pdf/2410.03341
Copy Paste: [[2410.03341]] Zero-Shot Fact Verification via Natural Logic and Large Language Models(https://arxiv.org/abs/2410.03341)
Keywords: language model
Abstract: The recent development of fact verification systems with natural logic has enhanced their explainability by aligning claims with evidence through set-theoretic operators, providing faithful justifications. Despite these advancements, such systems often rely on a large amount of training data annotated with natural logic. To address this issue, we propose a zero-shot method that utilizes the generalization capabilities of instruction-tuned large language models. To comprehensively assess the zero-shot capabilities of our method and other fact verification systems, we evaluate all models on both artificial and real-world claims, including multilingual datasets. We also compare our method against other fact verification systems in two setups. First, in the zero-shot generalization setup, we demonstrate that our approach outperforms other systems that were not specifically trained on natural logic data, achieving an average accuracy improvement of 8.96 points over the best-performing baseline. Second, in the zero-shot transfer setup, we show that current systems trained on natural logic data do not generalize well to other domains, and our method outperforms these systems across all datasets with real-world claims.
摘要：最近，具有自然逻辑的事实验证系统的发展通过集合论运算符将主张与证据对齐，从而提供了忠实的论证，增强了其可解释性。尽管取得了这些进步，但此类系统通常依赖于大量用自然逻辑注释的训练数据。为了解决这个问题，我们提出了一种零样本方法，该方法利用指令调整的大型语言模型的泛化能力。为了全面评估我们的方法和其他事实验证系统的零样本能力，我们在人工和现实世界的主张（包括多语言数据集）上评估了所有模型。我们还在两种设置中将我们的方法与其他事实验证系统进行了比较。首先，在零样本泛化设置中，我们证明我们的方法优于其他未专门针对自然逻辑数据进行训练的系统，平均准确率比表现最佳的基线提高了 8.96 分。其次，在零样本迁移设置中，我们表明当前在自然逻辑数据上训练的系统不能很好地推广到其他领域，而我们的方法在所有具有现实世界主张的数据集上都优于这些系统。

Title: Generating Equivalent Representations of Code By A Self-Reflection Approach

Authors: Jia Li, Ge Li, Lecheng Wang, Hao Zhu, Zhi Jin
Subjects: cs.CL, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2410.03351
Pdf URL: https://arxiv.org/pdf/2410.03351
Copy Paste: [[2410.03351]] Generating Equivalent Representations of Code By A Self-Reflection Approach(https://arxiv.org/abs/2410.03351)
Keywords: language model, llm
Abstract: Equivalent Representations (ERs) of code are textual representations that preserve the same semantics as the code itself, e.g., natural language comments and pseudocode. ERs play a critical role in software development and maintenance. However, how to automatically generate ERs of code remains an open challenge. In this paper, we propose a self-reflection approach to generating ERs of code. It enables two Large Language Models (LLMs) to work mutually and produce an ER through a reflection process. Depending on whether constraints on ERs are applied, our approach generates ERs in both open and constrained settings. We conduct a empirical study to generate ERs in two settings and obtain eight findings. (1) Generating ERs in the open setting. In the open setting, we allow LLMs to represent code without any constraints, analyzing the resulting ERs and uncovering five key findings. These findings shed light on how LLMs comprehend syntactic structures, APIs, and numerical computations in code. (2) Generating ERs in the constrained setting. In the constrained setting, we impose constraints on ERs, such as natural language comments, pseudocode, and flowcharts. This allows our approach to address a range of software engineering tasks. Based on our experiments, we have three findings demonstrating that our approach can effectively generate ERs that adhere to specific constraints, thus supporting various software engineering tasks. (3) Future directions. We also discuss potential future research directions, such as deriving intermediate languages for code generation, exploring LLM-friendly requirement descriptions, and further supporting software engineering tasks. We believe that this paper will spark discussions in research communities and inspire many follow-up studies.
摘要：代码的等效表示 (ER) 是保留与代码本身相同语义的文本表示，例如自然语言注释和伪代码。ER 在软件开发和维护中起着至关重要的作用。然而，如何自动生成代码的 ER 仍然是一个悬而未决的挑战。在本文中，我们提出了一种生成代码 ER 的自我反思方法。它使两个大型语言模型 (LLM) 能够协同工作并通过反射过程生成 ER。根据是否对 ER 施加约束，我们的方法会在开放和受限设置中生成 ER。我们进行了一项实证研究，在两种设置中生成 ER，并获得了八个发现。（1）在开放环境中生成 ER。在开放环境中，我们允许 LLM 表示代码而不受任何约束，分析生成的 ER 并发现五个关键发现。这些发现揭示了 LLM 如何理解代码中的句法结构、API 和数值计算。 (2) 在受限设置中生成 ER。在受限设置中，我们对 ER 施加约束，例如自然语言注释、伪代码和流程图。这使我们的方法能够解决一系列软件工程任务。根据我们的实验，我们有三个发现，表明我们的方法可以有效地生成遵守特定约束的 ER，从而支持各种软件工程任务。 (3) 未来方向。我们还讨论了潜在的未来研究方向，例如为代码生成导出中间语言、探索 LLM 友好的需求描述以及进一步支持软件工程任务。我们相信本文将引发研究界的讨论并启发许多后续研究。

Title: Cogs in a Machine, Doing What They're Meant to Do -- The AMI Submission to the WMT24 General Translation Task

Authors: Atli Jasonarson, Hinrik Hafsteinsson, Bjarki Ármannsson, Steinþór Steingrímsson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03381
Pdf URL: https://arxiv.org/pdf/2410.03381
Copy Paste: [[2410.03381]] Cogs in a Machine, Doing What They're Meant to Do -- The AMI Submission to the WMT24 General Translation Task(https://arxiv.org/abs/2410.03381)
Keywords: llm
Abstract: This paper presents the submission of the Árni Magnusson Institute's team to the WMT24 General translation task. We work on the English->Icelandic translation direction. Our system comprises four translation models and a grammar correction model. For training our models we carefully curate our datasets, aggressively filtering out sentence pairs that may detrimentally affect the quality of our system's output. Some of our data are collected from human translations and some are synthetically generated. A part of the synthetic data is generated using an LLM, and we find that it increases the translation capability of our system significantly.
摘要：本文介绍了 Árni Magnusson 研究所团队提交的 WMT24 通用翻译任务。我们致力于英语->冰岛语翻译方向。我们的系统包含四个翻译模型和一个语法校正模型。为了训练我们的模型，我们精心挑选了我们的数据集，积极过滤掉可能对系统输出质量产生不利影响的句子对。我们的一些数据是从人工翻译中收集的，一些是合成生成的。一部分合成数据是使用 LLM 生成的，我们发现它显著提高了我们系统的翻译能力。

Title: Killing Two Flies with One Stone: An Attempt to Break LLMs Using English->Icelandic Idioms and Proper Names

Authors: Bjarki Ármannsson, Hinrik Hafsteinsson, Atli Jasonarson, Steinþór Steingrímsson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03394
Pdf URL: https://arxiv.org/pdf/2410.03394
Copy Paste: [[2410.03394]] Killing Two Flies with One Stone: An Attempt to Break LLMs Using English->Icelandic Idioms and Proper Names(https://arxiv.org/abs/2410.03394)
Keywords: llm
Abstract: This paper presents the submission of the Árni Magnússon Institute's team to the WMT24 test suite subtask, focusing on idiomatic expressions and proper names for the English->Icelandic translation direction. Intuitively and empirically, idioms and proper names are known to be a significant challenge for modern translation models. We create two different test suites. The first evaluates the competency of MT systems in translating common English idiomatic expressions, as well as testing whether systems can distinguish between those expressions and the same phrases when used in a literal context. The second test suite consists of place names that should be translated into their Icelandic exonyms (and correctly inflected) and pairs of Icelandic names that share a surface form between the male and female variants, so that incorrect translations impact meaning as well as readability. The scores reported are relatively low, especially for idiomatic expressions and place names, and indicate considerable room for improvement.
摘要：本文介绍了 Árni Magnússon 研究所团队提交的 WMT24 测试套件子任务，重点关注英语->冰岛语翻译方向的惯用表达和专有名词。从直觉和经验上看，习语和专有名词是现代翻译模型面临的重大挑战。我们创建了两个不同的测试套件。第一个测试套件评估机器翻译系统翻译常见英语惯用表达的能力，以及测试系统在字面语境中使用时是否能够区分这些表达和相同短语。第二个测试套件包括应翻译成冰岛外来语地名（并正确变格）的地名，以及男性和女性变体之间共享表面形式的冰岛名称对，因此错误的翻译会影响含义和可读性。报告的分数相对较低，尤其是惯用表达和地名，表明有很大的改进空间。

Title: Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

Authors: Xinpeng Wang, Chengzhi Hu, Paul Röttger, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03415
Pdf URL: https://arxiv.org/pdf/2410.03415
Copy Paste: [[2410.03415]] Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation(https://arxiv.org/abs/2410.03415)
Keywords: language model
Abstract: Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours: Models should refuse to follow malicious instructions or give harmful advice (e.g. "how do I kill someone?"), but they should not refuse safe requests, even if they superficially resemble unsafe ones (e.g. "how do I kill a Python process?"). Avoiding such false refusal, as prior work has shown, is challenging even for highly-capable language models. In this paper, we propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. For a given model, we extract a false refusal vector and show that ablating this vector reduces false refusal rate without negatively impacting model safety and general model capabilities. We also show that our approach can be used for fine-grained calibration of model safety. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.
摘要：要训练一个既有用又无害的语言模型，需要仔细校准拒绝行为：模型应该拒绝遵循恶意指令或给出有害的建议（例如“我如何杀死某人？”），但它们不应该拒绝安全请求，即使它们表面上类似于不安全的请求（例如“我如何杀死 Python 进程？”）。正如先前的工作所表明的那样，避免这种错误拒绝即使对于功能强大的语言模型来说也是一项挑战。在本文中，我们提出了一种简单而精确的方法，通过单向量消融来缓解语言模型中的错误拒绝。对于给定的模型，我们提取一个错误拒绝向量，并表明消融该向量可降低错误拒绝率，而不会对模型安全性和一般模型功能产生负面影响。我们还表明，我们的方法可用于模型安全性的细粒度校准。我们的方法是无需训练和与模型无关的，因此可用于缓解当前和未来语言模型中的错误拒绝问题。

Title: One2set + Large Language Model: Best Partners for Keyphrase Generation

Authors: Liangying Shao, Liang Zhang, Minlong Peng, Guoqi Ma, Hao Yue, Mingming Sun, Jinsong Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03421
Pdf URL: https://arxiv.org/pdf/2410.03421
Copy Paste: [[2410.03421]] One2set + Large Language Model: Best Partners for Keyphrase Generation(https://arxiv.org/abs/2410.03421)
Keywords: language model, llm
Abstract: Keyphrase generation (KPG) aims to automatically generate a collection of phrases representing the core concepts of a given document. The dominant paradigms in KPG include one2seq and one2set. Recently, there has been increasing interest in applying large language models (LLMs) to KPG. Our preliminary experiments reveal that it is challenging for a single model to excel in both recall and precision. Further analysis shows that: 1) the one2set paradigm owns the advantage of high recall, but suffers from improper assignments of supervision signals during training; 2) LLMs are powerful in keyphrase selection, but existing selection methods often make redundant selections. Given these observations, we introduce a generate-then-select framework decomposing KPG into two steps, where we adopt a one2set-based model as generator to produce candidates and then use an LLM as selector to select keyphrases from these candidates. Particularly, we make two important improvements on our generator and selector: 1) we design an Optimal Transport-based assignment strategy to address the above improper assignments; 2) we model the keyphrase selection as a sequence labeling task to alleviate redundant selections. Experimental results on multiple benchmark datasets show that our framework significantly surpasses state-of-the-art models, especially in absent keyphrase prediction.
摘要：关键短语生成 (KPG) 旨在自动生成代表给定文档核心概念的短语集合。KPG 中的主要范式包括 one2seq 和 one2set。最近，人们对将大型语言模型 (LLM) 应用于 KPG 的兴趣日益浓厚。我们的初步实验表明，单一模型很难同时在召回率和准确率方面表现出色。进一步分析表明：1）one2set 范式具有高召回率的优势，但在训练过程中存在监督信号分配不当的问题；2）LLM 在关键短语选择方面功能强大，但现有的选择方法通常会进行冗余选择。鉴于这些观察，我们引入了一个生成然后选择框架，将 KPG 分解为两个步骤，其中我们采用基于 one2set 的模型作为生成器来生成候选词，然后使用 LLM 作为选择器从这些候选词中选择关键短语。具体来说，我们对生成器和选择器进行了两项重要改进：1）我们设计了一种基于最佳传输的分配策略来解决上述不当分配问题；2）我们将关键短语选择建模为序列标记任务，以减轻冗余选择。在多个基准数据集上的实验结果表明，我们的框架明显优于最先进的模型，尤其是在缺失关键短语预测方面。

Title: A General Framework for Producing Interpretable Semantic Text Embeddings

Authors: Yiqun Sun, Qiang Huang, Yixuan Tang, Anthony K. H. Tung, Jun Yu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03435
Pdf URL: https://arxiv.org/pdf/2410.03435
Copy Paste: [[2410.03435]] A General Framework for Producing Interpretable Semantic Text Embeddings(https://arxiv.org/abs/2410.03435)
Keywords: llm, prompt
Abstract: Semantic text embedding is essential to many tasks in Natural Language Processing (NLP). While black-box models are capable of generating high-quality embeddings, their lack of interpretability limits their use in tasks that demand transparency. Recent approaches have improved interpretability by leveraging domain-expert-crafted or LLM-generated questions, but these methods rely heavily on expert input or well-prompt design, which restricts their generalizability and ability to generate discriminative questions across a wide range of tasks. To address these challenges, we introduce \algo{CQG-MBQA} (Contrastive Question Generation - Multi-task Binary Question Answering), a general framework for producing interpretable semantic text embeddings across diverse tasks. Our framework systematically generates highly discriminative, low cognitive load yes/no questions through the \algo{CQG} method and answers them efficiently with the \algo{MBQA} model, resulting in interpretable embeddings in a cost-effective manner. We validate the effectiveness and interpretability of \algo{CQG-MBQA} through extensive experiments and ablation studies, demonstrating that it delivers embedding quality comparable to many advanced black-box models while maintaining inherently interpretability. Additionally, \algo{CQG-MBQA} outperforms other interpretable text embedding methods across various downstream tasks.
摘要：语义文本嵌入对于自然语言处理 (NLP) 中的许多任务至关重要。虽然黑盒模型能够生成高质量的嵌入，但其缺乏可解释性限制了它们在需要透明度的任务中的使用。最近的方法通过利用领域专家制作或 LLM 生成的问题提高了可解释性，但这些方法严重依赖专家输入或及时设计，这限制了它们的通用性和在各种任务中生成判别性问题的能力。为了应对这些挑战，我们引入了 \algo{CQG-MBQA}（对比问题生成 - 多任务二元问答），这是一个用于在不同任务中生成可解释语义文本嵌入的通用框架。我们的框架通过 \algo{CQG} 方法系统地生成高度判别性、低认知负荷的是/否问题，并使用 \algo{MBQA} 模型有效地回答这些问题，从而以经济高效的方式产生可解释的嵌入。我们通过大量实验和消融研究验证了 \algo{CQG-MBQA} 的有效性和可解释性，表明它提供了与许多高级黑盒模型相当的嵌入质量，同时保持了固有的可解释性。此外，\algo{CQG-MBQA} 在各种下游任务中的表现优于其他可解释文本嵌入方法。

Title: ToolGen: Unified Tool Retrieval and Calling via Generation

Authors: Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, Haonan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03439
Pdf URL: https://arxiv.org/pdf/2410.03439
Copy Paste: [[2410.03439]] ToolGen: Unified Tool Retrieval and Calling via Generation(https://arxiv.org/abs/2410.03439)
Keywords: language model, llm, chain-of-thought, agent
Abstract: As large language models (LLMs) advance, their inability to autonomously execute tasks by directly interacting with external tools remains a critical limitation. Traditional methods rely on inputting tool descriptions as context, which is constrained by context length and requires separate, often inefficient, retrieval mechanisms. We introduce ToolGen, a paradigm shift that integrates tool knowledge directly into the LLM's parameters by representing each tool as a unique token. This enables the LLM to generate tool calls and arguments as part of its next token prediction capabilities, seamlessly blending tool invocation with language generation. Our framework allows the LLM to access and utilize a vast amount of tools with no additional retrieval step, significantly enhancing both performance and scalability. Experimental results with over 47,000 tools show that ToolGen not only achieves superior results in both tool retrieval and autonomous task completion but also sets the stage for a new era of AI agents that can adapt to tools across diverse domains. By fundamentally transforming tool retrieval into a generative process, ToolGen paves the way for more versatile, efficient, and autonomous AI systems. ToolGen enables end-to-end tool learning and opens opportunities for integration with other advanced techniques such as chain-of-thought and reinforcement learning, thereby expanding the practical capabilities of LLMs.
摘要：随着大型语言模型 (LLM) 的发展，它们无法通过直接与外部工具交互来自主执行任务仍然是一个关键的限制。传统方法依赖于输入工具描述作为上下文，这受到上下文长度的限制，并且需要单独的、通常效率低下的检索机制。我们引入了 ToolGen，这是一种范式转变，它通过将每个工具表示为唯一的标记，将工具知识直接集成到 LLM 的参数中。这使 LLM 能够生成工具调用和参数作为其下一个标记预测功能的一部分，将工具调用与语言生成无缝融合。我们的框架允许 LLM 访问和使用大量工具而无需额外的检索步骤，从而显著提高性能和可扩展性。对超过 47,000 种工具的实验结果表明，ToolGen 不仅在工具检索和自主任务完成方面取得了卓越的成果，而且还为能够适应不同领域工具的 AI 代理的新时代奠定了基础。 ToolGen 从根本上将工具检索转变为生成过程，为更通用、更高效、更自主的 AI 系统铺平了道路。ToolGen 支持端到端工具学习，并为与其他先进技术（如思路链和强化学习）的集成提供了机会，从而扩展了 LLM 的实用功能。

Title: How Language Models Prioritize Contextual Grammatical Cues?

Authors: Hamidreza Amirzadeh, Afra Alishahi, Hosein Mohebbi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03447
Pdf URL: https://arxiv.org/pdf/2410.03447
Copy Paste: [[2410.03447]] How Language Models Prioritize Contextual Grammatical Cues?(https://arxiv.org/abs/2410.03447)
Keywords: language model, gpt
Abstract: Transformer-based language models have shown an excellent ability to effectively capture and utilize contextual information. Although various analysis techniques have been used to quantify and trace the contribution of single contextual cues to a target task such as subject-verb agreement or coreference resolution, scenarios in which multiple relevant cues are available in the context remain underexplored. In this paper, we investigate how language models handle gender agreement when multiple gender cue words are present, each capable of independently disambiguating a target gender pronoun. We analyze two widely used Transformer-based models: BERT, an encoder-based, and GPT-2, a decoder-based model. Our analysis employs two complementary approaches: context mixing analysis, which tracks information flow within the model, and a variant of activation patching, which measures the impact of cues on the model's prediction. We find that BERT tends to prioritize the first cue in the context to form both the target word representations and the model's prediction, while GPT-2 relies more on the final cue. Our findings reveal striking differences in how encoder-based and decoder-based models prioritize and use contextual information for their predictions.
摘要：基于 Transformer 的语言模型表现出了有效捕获和利用上下文信息的出色能力。尽管已经使用了各种分析技术来量化和追踪单个上下文线索对目标任务（例如主谓一致或共指消解）的贡献，但上下文中存在多个相关线索的场景仍未得到充分探索。在本文中，我们研究了当存在多个性别提示词时语言模型如何处理性别一致性，每个词都能独立消除目标性别代词的歧义。我们分析了两种广泛使用的基于 Transformer 的模型：基于编码器的 BERT 和基于解码器的 GPT-2。我们的分析采用了两种互补的方法：上下文混合分析（跟踪模型内的信息流）和激活修补的变体（测量线索对模型预测的影响）。我们发现 BERT 倾向于优先考虑上下文中的第一个线索来形成目标词表示和模型的预测，而 GPT-2 则更多地依赖于最后一个线索。我们的研究结果揭示了基于编码器和基于解码器的模型在如何优先考虑和使用上下文信息进行预测方面存在显著差异。

Title: CoCoLoFa: A Dataset of News Comments with Common Logical Fallacies Written by LLM-Assisted Crowds

Authors: Min-Hsuan Yeh, Ruyuan Wan, Ting-Hao 'Kenneth' Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03457
Pdf URL: https://arxiv.org/pdf/2410.03457
Copy Paste: [[2410.03457]] CoCoLoFa: A Dataset of News Comments with Common Logical Fallacies Written by LLM-Assisted Crowds(https://arxiv.org/abs/2410.03457)
Keywords: llm
Abstract: Detecting logical fallacies in texts can help users spot argument flaws, but automating this detection is not easy. Manually annotating fallacies in large-scale, real-world text data to create datasets for developing and validating detection models is costly. This paper introduces CoCoLoFa, the largest known logical fallacy dataset, containing 7,706 comments for 648 news articles, with each comment labeled for fallacy presence and type. We recruited 143 crowd workers to write comments embodying specific fallacy types (e.g., slippery slope) in response to news articles. Recognizing the complexity of this writing task, we built an LLM-powered assistant into the workers' interface to aid in drafting and refining their comments. Experts rated the writing quality and labeling validity of CoCoLoFa as high and reliable. BERT-based models fine-tuned using CoCoLoFa achieved the highest fallacy detection (F1=0.86) and classification (F1=0.87) performance on its test set, outperforming the state-of-the-art LLMs. Our work shows that combining crowdsourcing and LLMs enables us to more effectively construct datasets for complex linguistic phenomena that crowd workers find challenging to produce on their own.
摘要：检测文本中的逻辑谬误可以帮助用户发现论证缺陷，但自动检测并不容易。手动注释大规模真实文本数据中的谬误以创建用于开发和验证检测模型的数据集成本高昂。本文介绍了已知最大的逻辑谬误数据集 CoCoLoFa，其中包含 648 篇新闻文章的 7,706 条评论，每条评论都标有谬误存在和类型。我们招募了 143 名众包工作者，针对新闻文章撰写体现特定谬误类型（例如滑坡）的评论。认识到这项写作任务的复杂性，我们在工作者界面中构建了一个由 LLM 提供支持的助手，以帮助他们起草和完善评论。专家将 CoCoLoFa 的写作质量和标签效度评为高且可靠。使用 CoCoLoFa 微调的基于 BERT 的模型在其测试集上实现了最高的谬误检测 (F1=0.86) 和分类 (F1=0.87) 性能，超越了最先进的 LLM。我们的工作表明，将众包和 LLM 结合起来使我们能够更有效地构建复杂语言现象的数据集，而众包工作者自己很难生成这些数据集。

Title: Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval Augmented Generation

Authors: Tobias Leemann, Periklis Petridis, Giuseppe Vietri, Dionysis Manousakas, Aaron Roth, Sergul Aydore
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03461
Pdf URL: https://arxiv.org/pdf/2410.03461
Copy Paste: [[2410.03461]] Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval Augmented Generation(https://arxiv.org/abs/2410.03461)
Keywords: language model, llm, hallucination, prompt, retrieval augmented generation
Abstract: While retrieval augmented generation (RAG) has been shown to enhance factuality of large language model (LLM) outputs, LLMs still suffer from hallucination, generating incorrect or irrelevant information. One common detection strategy involves prompting the LLM again to assess whether its response is grounded in the retrieved evidence, but this approach is costly. Alternatively, lightweight natural language inference (NLI) models for efficient grounding verification can be used at inference time. While existing pre-trained NLI models offer potential solutions, their performance remains subpar compared to larger models on realistic RAG inputs. RAG inputs are more complex than most datasets used for training NLI models and have characteristics specific to the underlying knowledge base, requiring adaptation of the NLI models to a specific target domain. Additionally, the lack of labeled instances in the target domain makes supervised domain adaptation, e.g., through fine-tuning, infeasible. To address these challenges, we introduce Automatic Generative Domain Adaptation (Auto-GDA). Our framework enables unsupervised domain adaptation through synthetic data generation. Unlike previous methods that rely on handcrafted filtering and augmentation strategies, Auto-GDA employs an iterative process to continuously improve the quality of generated samples using weak labels from less efficient teacher models and discrete optimization to select the most promising augmented samples. Experimental results demonstrate the effectiveness of our approach, with models fine-tuned on synthetic data using Auto-GDA often surpassing the performance of the teacher model and reaching the performance level of LLMs at 10 % of their computational cost.
摘要：虽然检索增强生成 (RAG) 已被证明可以增强大型语言模型 (LLM) 输出的真实性，但 LLM 仍然受到幻觉的影响，会生成不正确或不相关的信息。一种常见的检测策略是再次提示 LLM 以评估其响应是否基于检索到的证据，但这种方法成本高昂。或者，可以在推理时使用轻量级自然语言推理 (NLI) 模型进行有效的基础验证。虽然现有的预训练 NLI 模型提供了潜在的解决方案，但与现实 RAG 输入上的大型模型相比，它们的性能仍然低于标准。RAG 输入比用于训练 NLI 模型的大多数数据集更复杂，并且具有特定于底层知识库的特征，需要将 NLI 模型适应特定的目标域。此外，目标域中缺少标记实例使得监督域适应（例如通过微调）不可行。为了应对这些挑战，我们引入了自动生成域适应 (Auto-GDA)。我们的框架通过合成数据生成实现无监督域适应。与之前依赖手工过滤和增强策略的方法不同，Auto-GDA 采用迭代过程，使用效率较低的教师模型中的弱标签和离散优化来不断提高生成样本的质量，以选择最有希望的增强样本。实验结果证明了我们方法的有效性，使用 Auto-GDA 对合成数据进行微调的模型通常超越教师模型的性能，并以 10% 的计算成本达到 LLM 的性能水平。

Title: Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering

Authors: Helena Bonaldi, Greta Damo, Nicolás Benjamín Ocampo, Elena Cabrio, Serena Villata, Marco Guerini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03466
Pdf URL: https://arxiv.org/pdf/2410.03466
Copy Paste: [[2410.03466]] Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering(https://arxiv.org/abs/2410.03466)
Keywords: llm
Abstract: The potential effectiveness of counterspeech as a hate speech mitigation strategy is attracting increasing interest in the NLG research community, particularly towards the task of automatically producing it. However, automatically generated responses often lack the argumentative richness which characterises expert-produced counterspeech. In this work, we focus on two aspects of counterspeech generation to produce more cogent responses. First, by investigating the tension between helpfulness and harmlessness of LLMs, we test whether the presence of safety guardrails hinders the quality of the generations. Secondly, we assess whether attacking a specific component of the hate speech results in a more effective argumentative strategy to fight online hate. By conducting an extensive human and automatic evaluation, we show how the presence of safety guardrails can be detrimental also to a task that inherently aims at fostering positive social interactions. Moreover, our results show that attacking a specific component of the hate speech, and in particular its implicit negative stereotype and its hateful parts, leads to higher-quality generations.
摘要：反言论作为一种仇恨言论缓解策略的潜在有效性正在引起 NLG 研究界越来越多的关注，尤其是对自动生成反言论的任务。然而，自动生成的回应往往缺乏专家生成的反言论所具有的论证丰富性。在这项工作中，我们专注于反言论生成的两个方面，以产生更有说服力的回应。首先，通过研究 LLM 的有用性和无害性之间的张力，我们测试安全护栏的存在是否会阻碍生成质量。其次，我们评估攻击仇恨言论的特定成分是否会产生更有效的论证策略来对抗网络仇恨。通过进行广泛的人工和自动评估，我们展示了安全护栏的存在如何对旨在促进积极社会互动的任务产生不利影响。此外，我们的结果表明，攻击仇恨言论的特定成分，特别是其隐含的负面刻板印象和仇恨部分，可以产生更高质量的生成。

Title: Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Authors: Robert E. Blackwell, Jon Barry, Anthony G. Cohn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03492
Pdf URL: https://arxiv.org/pdf/2410.03492
Copy Paste: [[2410.03492]] Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores(https://arxiv.org/abs/2410.03492)
Keywords: language model, llm
Abstract: Large language models (LLMs) are stochastic, and not all models give deterministic answers, even when setting temperature to zero with a fixed random seed. However, few benchmark studies attempt to quantify uncertainty, partly due to the time and cost of repeated experiments. We use benchmarks designed for testing LLMs' capacity to reason about cardinal directions to explore the impact of experimental repeats on mean score and prediction interval. We suggest a simple method for cost-effectively quantifying the uncertainty of a benchmark score and make recommendations concerning reproducible LLM evaluation.
摘要：大型语言模型 (LLM) 是随机的，并非所有模型都能给出确定性的答案，即使使用固定的随机种子将温度设置为零也是如此。然而，很少有基准研究试图量化不确定性，部分原因是重复实验的时间和成本。我们使用专为测试 LLM 推理基本方向的能力而设计的基准来探索实验重复对平均分数和预测区间的影响。我们提出了一种简单的方法来经济高效地量化基准分数的不确定性，并就可重复的 LLM 评估提出建议。

Title: CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

Authors: Zetian Ouyang, Yishuai Qiu, Linlin Wang, Gerard de Melo, Ya Zhang, Yanfeng Wang, Liang He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03502
Pdf URL: https://arxiv.org/pdf/2410.03502
Copy Paste: [[2410.03502]] CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios(https://arxiv.org/abs/2410.03502)
Keywords: language model, llm
Abstract: With the proliferation of Large Language Models (LLMs) in diverse domains, there is a particular need for unified evaluation standards in clinical medical scenarios, where models need to be examined very thoroughly. We present CliMedBench, a comprehensive benchmark with 14 expert-guided core clinical scenarios specifically designed to assess the medical ability of LLMs across 7 pivot dimensions. It comprises 33,735 questions derived from real-world medical reports of top-tier tertiary hospitals and authentic examination exercises. The reliability of this benchmark has been confirmed in several ways. Subsequent experiments with existing LLMs have led to the following findings: (i) Chinese medical LLMs underperform on this benchmark, especially where medical reasoning and factual consistency are vital, underscoring the need for advances in clinical knowledge and diagnostic accuracy. (ii) Several general-domain LLMs demonstrate substantial potential in medical clinics, while the limited input capacity of many medical LLMs hinders their practical use. These findings reveal both the strengths and limitations of LLMs in clinical scenarios and offer critical insights for medical research.
摘要：随着大型语言模型 (LLM) 在不同领域的普及，在临床医学场景中，尤其需要统一的评估标准，因为这些场景需要非常彻底地检查模型。我们提出了 CliMedBench，这是一个全面的基准，包含 14 个专家指导的核心临床场景，专门用于评估 LLM 在 7 个关键维度上的医疗能力。它包含 33,735 个问题，这些问题来自顶级三级医院的真实医疗报告和真实的检查练习。该基准的可靠性已通过多种方式得到证实。对现有 LLM 的后续实验得出了以下发现：(i) 中国医学 LLM 在这个基准上表现不佳，尤其是在医学推理和事实一致性至关重要的情况下，这凸显了临床知识和诊断准确性需要提高。(ii) 几个通用领域的 LLM 在医学临床中表现出巨大的潜力，而许多医学 LLM 有限的输入能力阻碍了它们的实际使用。这些发现揭示了 LLM 在临床场景中的优势和局限性，并为医学研究提供了重要的见解。

Title: LCMDC: Large-scale Chinese Medical Dialogue Corpora for Automatic Triage and Medical Consultation

Authors: Xinyuan Wang, Haozhou Li, Dingfang Zheng, Qinke Peng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03521
Pdf URL: https://arxiv.org/pdf/2410.03521
Copy Paste: [[2410.03521]] LCMDC: Large-scale Chinese Medical Dialogue Corpora for Automatic Triage and Medical Consultation(https://arxiv.org/abs/2410.03521)
Keywords: language model, gpt, prompt
Abstract: The global COVID-19 pandemic underscored major deficiencies in traditional healthcare systems, hastening the advancement of online medical services, especially in medical triage and consultation. However, existing studies face two main challenges. First, the scarcity of large-scale, publicly available, domain-specific medical datasets due to privacy concerns, with current datasets being small and limited to a few diseases, limiting the effectiveness of triage methods based on Pre-trained Language Models (PLMs). Second, existing methods lack medical knowledge and struggle to accurately understand professional terms and expressions in patient-doctor consultations. To overcome these obstacles, we construct the Large-scale Chinese Medical Dialogue Corpora (LCMDC), comprising a Coarse-grained Triage dataset with 439,630 samples, a Fine-grained Diagnosis dataset with 199,600 samples, and a Medical Consultation dataset with 472,418 items, thereby addressing the data shortage in this field. Moreover, we further propose a novel triage system that combines BERT-based supervised learning with prompt learning, as well as a GPT-based medical consultation model using reinforcement learning. To enhance domain knowledge acquisition, we pre-trained PLMs using our self-constructed background corpus. Experimental results on the LCMDC demonstrate the efficacy of our proposed systems.
摘要：全球新冠疫情凸显了传统医疗体系的重大缺陷，加速了在线医疗服务的发展，特别是在医疗分诊和咨询方面。然而，现有研究面临两大挑战。首先，出于隐私考虑，大规模、公开、特定领域的医疗数据集稀缺，目前的数据集规模较小且仅限于少数疾病，限制了基于预训练语言模型 (PLM) 的分诊方法的有效性。其次，现有方法缺乏医学知识，难以准确理解医患问诊中的专业术语和表达。为了克服这些障碍，我们构建了大规模中文医学对话语料库 (LCMDC)，包括包含 439,630 个样本的粗粒度分诊数据集、包含 199,600 个样本的细粒度诊断数据集和包含 472,418 个项目的医疗咨询数据集，从而解决该领域的数据短缺问题。此外，我们还提出了一种新颖的分诊系统，该系统将基于 BERT 的监督学习与即时学习相结合，以及基于 GPT 的强化学习医疗咨询模型。为了增强领域知识获取，我们使用自建的背景语料库对 PLM 进行了预训练。在 LCMDC 上的实验结果证明了我们提出的系统的有效性。

Title: Steering Large Language Models between Code Execution and Textual Reasoning

Authors: Yongchao Chen, Harsh Jhamtani, Srinagesh Sharma, Chuchu Fan, Chi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03524
Pdf URL: https://arxiv.org/pdf/2410.03524
Copy Paste: [[2410.03524]] Steering Large Language Models between Code Execution and Textual Reasoning(https://arxiv.org/abs/2410.03524)
Keywords: language model, gpt, llm, agent
Abstract: While a lot of recent research focuses on enhancing the textual reasoning capabilities of Large Language Models (LLMs) by optimizing the multi-agent framework or reasoning chains, several benchmark tasks can be solved with 100% success through direct coding, which is more scalable and avoids the computational overhead associated with textual iterating and searching. Textual reasoning has inherent limitations in solving tasks with challenges in math, logics, optimization, and searching, which is unlikely to be solved by simply scaling up the model and data size. The recently released OpenAI GPT Code Interpreter and multi-agent frameworks such as AutoGen have demonstrated remarkable proficiency of integrating code generation and execution to solve complex tasks using LLMs. However, based on our experiments on 7 existing popular methods for steering code/text generation in both single- and multi-turn settings with 14 tasks and 6 types of LLMs (including the new O1-preview), currently there is no optimal method to correctly steer LLMs to write code when needed. We discover some interesting patterns on when models use code vs. textual reasoning with the evolution to task complexity and model sizes, which even result in an astonishingly inverse scaling law. We also discover that results from LLM written code are not always better than using textual reasoning, even if the task could be solved through code. To mitigate the above issues, we propose three methods to better steer LLM code/text generation and achieve a notable improvement. The costs of token lengths and runtime are thoroughly discussed for all the methods. We believe the problem of steering LLM code/text generation is critical for future research and has much space for further improvement. Project Page, Datasets, and Codes are available at this https URL.
摘要：虽然最近的许多研究都致力于通过优化多智能体框架或推理链来增强大型语言模型 (LLM) 的文本推理能力，但通过直接编码可以 100% 成功解决几个基准任务，这更具可扩展性，并且避免了与文本迭代和搜索相关的计算开销。文本推理在解决数学、逻辑、优化和搜索方面具有挑战性的任务方面存在固有的局限性，这不太可能通过简单地扩大模型和数据大小来解决。最近发布的 OpenAI GPT 代码解释器和 AutoGen 等多智能体框架已经展示了将代码生成和执行集成在一起以使用 LLM 解决复杂任务的卓越能力。然而，基于我们对 7 种现有的流行方法进行的实验，这些方法在单轮和多轮设置中控制代码/文本生成，包括 14 个任务和 6 种类型的 LLM（包括新的 O1 预览版），目前还没有最佳方法来正确引导 LLM 在需要时编写代码。随着任务复杂性和模型大小的演变，我们发现了模型使用代码与文本推理时的一些有趣模式，甚至导致了令人惊讶的逆比例定律。我们还发现，即使可以通过代码解决任务，LLM 编写的代码的结果并不总是比使用文本推理更好。为了缓解上述问题，我们提出了三种方法来更好地引导 LLM 代码/文本生成并实现显着的改进。所有方法都彻底讨论了标记长度和运行时间的成本。我们相信引导 LLM 代码/文本生成的问题对于未来的研究至关重要，并且有很大的改进空间。项目页面、数据集和代码可在此 https URL 上找到。

Title: Re-examining Sexism and Misogyny Classification with Annotator Attitudes

Authors: Aiqi Jiang, Nikolas Vitsakis, Tanvi Dinkar, Gavin Abercrombie, Ioannis Konstas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03543
Pdf URL: https://arxiv.org/pdf/2410.03543
Copy Paste: [[2410.03543]] Re-examining Sexism and Misogyny Classification with Annotator Attitudes(https://arxiv.org/abs/2410.03543)
Keywords: language model, prompt
Abstract: Gender-Based Violence (GBV) is an increasing problem online, but existing datasets fail to capture the plurality of possible annotator perspectives or ensure the representation of affected groups. We revisit two important stages in the moderation pipeline for GBV: (1) manual data labelling; and (2) automated classification. For (1), we examine two datasets to investigate the relationship between annotator identities and attitudes and the responses they give to two GBV labelling tasks. To this end, we collect demographic and attitudinal information from crowd-sourced annotators using three validated surveys from Social Psychology. We find that higher Right Wing Authoritarianism scores are associated with a higher propensity to label text as sexist, while for Social Dominance Orientation and Neosexist Attitudes, higher scores are associated with a negative tendency to do so. For (2), we conduct classification experiments using Large Language Models and five prompting strategies, including infusing prompts with annotator information. We find: (i) annotator attitudes affect the ability of classifiers to predict their labels; (ii) including attitudinal information can boost performance when we use well-structured brief annotator descriptions; and (iii) models struggle to reflect the increased complexity and imbalanced classes of the new label sets.
摘要：性别暴力 (GBV) 是一个日益严重的网络问题，但现有数据集无法捕捉注释者的各种可能观点，也无法确保受影响群体的代表性。我们重新审视了 GBV 审核流程中的两个重要阶段：(1) 手动数据标记；(2) 自动分类。对于 (1)，我们检查两个数据集，以调查注释者身份和态度之间的关系以及他们对两个 GBV 标记任务的回答。为此，我们使用来自社会心理学的三项经过验证的调查，从众包注释者那里收集人口统计和态度信息。我们发现，右翼威权主义分数越高，将文本标记为性别歧视的倾向就越高，而对于社会支配取向和新性别歧视态度，分数越高，则倾向于这样做。对于 (2)，我们使用大型语言模型和五种提示策略进行分类实验，包括将注释者信息注入提示中。我们发现：（i）注释者的态度会影响分类器预测其标签的能力；（ii）当我们使用结构良好的简短注释者描述时，包含态度信息可以提高性能；（iii）模型难以反映新标签集的复杂性增加和类别不平衡。

Title: Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding

Authors: Wei Wu, Chao Wang, Liyi Chen, Mingze Yin, Yiheng Zhu, Kun Fu, Jieping Ye, Hui Xiong, Zheng Wang
Subjects: cs.CL, q-bio.BM
Abstract URL: https://arxiv.org/abs/2410.03553
Pdf URL: https://arxiv.org/pdf/2410.03553
Copy Paste: [[2410.03553]] Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding(https://arxiv.org/abs/2410.03553)
Keywords: language model, llm
Abstract: Proteins, as essential biomolecules, play a central role in biological processes, including metabolic reactions and DNA replication. Accurate prediction of their properties and functions is crucial in biological applications. Recent development of protein language models (pLMs) with supervised fine tuning provides a promising solution to this problem. However, the fine-tuned model is tailored for particular downstream prediction task, and achieving general-purpose protein understanding remains a challenge. In this paper, we introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach integrates a noval structure-aware module into pLMs to inform them with structural knowledge, and then connects these enhanced pLMs to large language models (LLMs) to generate understanding of proteins. In this framework, we propose a novel two-stage instruction tuning pipeline that first establishes a basic understanding of proteins through caption-based instructions and then refines this understanding using a mixture of experts (MoEs) to learn more complex properties and functional information with the same amount of activated parameters. Moreover, we construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model. Extensive experimental results on open-ended generation and closed-set answer tasks demonstrate the superior performance of SEPIT over both closed-source general LLMs and open-source LLMs trained with protein knowledge.
摘要：蛋白质是必需的生物分子，在生物过程中起着核心作用，包括代谢反应和 DNA 复制。准确预测其特性和功能在生物应用中至关重要。最近开发的具有监督微调的蛋白质语言模型 (pLM) 为这一问题提供了一个有希望的解决方案。然而，微调模型是针对特定的下游预测任务量身定制的，实现通用的蛋白质理解仍然是一个挑战。在本文中，我们引入了结构增强蛋白质指令调整 (SEPIT) 框架来弥补这一差距。我们的方法将一种新颖的结构感知模块集成到 pLM 中，以向它们提供结构知识，然后将这些增强的 pLM 连接到大型语言模型 (LLM) 以生成对蛋白质的理解。在这个框架中，我们提出了一种新颖的两阶段指令调整管道，首先通过基于标题的指令建立对蛋白质的基本理解，然后使用混合专家 (MoE) 来细化这种理解，以学习具有相同数量激活参数的更复杂的属性和功能信息。此外，我们构建了迄今为止最大、最全面的蛋白质指令数据集，这使我们能够训练和评估通用蛋白质理解模型。在开放式生成和封闭式答案任务上进行的大量实验结果表明，SEPIT 的性能优于闭源通用 LLM 和使用蛋白质知识训练的开源 LLM。

Title: BodyShapeGPT: SMPL Body Shape Manipulation with LLMs

Authors: Baldomero R. Árbol, Dan Casas
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03556
Pdf URL: https://arxiv.org/pdf/2410.03556
Copy Paste: [[2410.03556]] BodyShapeGPT: SMPL Body Shape Manipulation with LLMs(https://arxiv.org/abs/2410.03556)
Keywords: language model, gpt, llm
Abstract: Generative AI models provide a wide range of tools capable of performing complex tasks in a fraction of the time it would take a human. Among these, Large Language Models (LLMs) stand out for their ability to generate diverse texts, from literary narratives to specialized responses in different fields of knowledge. This paper explores the use of fine-tuned LLMs to identify physical descriptions of people, and subsequently create accurate representations of avatars using the SMPL-X model by inferring shape parameters. We demonstrate that LLMs can be trained to understand and manipulate the shape space of SMPL, allowing the control of 3D human shapes through natural language. This approach promises to improve human-machine interaction and opens new avenues for customization and simulation in virtual environments.
摘要：生成式 AI 模型提供了各种各样的工具，能够在人类所需时间的一小部分内执行复杂任务。其中，大型语言模型 (LLM) 因其生成各种文本的能力而脱颖而出，从文学叙述到不同知识领域的专业回应。本文探讨了使用微调的 LLM 来识别人的外貌描述，然后通过推断形状参数使用 SMPL-X 模型创建化身的准确表示。我们证明 LLM 可以训练来理解和操纵 SMPL 的形状空间，从而允许通过自然语言控制 3D 人体形状。这种方法有望改善人机交互，并为虚拟环境中的定制和模拟开辟新途径。

Title: Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)

Authors: Abrar Rahman, Garry Bowlin, Binit Mohanty, Sean McGunigal
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03568
Pdf URL: https://arxiv.org/pdf/2410.03568
Copy Paste: [[2410.03568]] Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)(https://arxiv.org/abs/2410.03568)
Keywords: language model, gpt, llm
Abstract: This paper presents a comprehensive study on the tokenization techniques employed by state-of-the-art large language models (LLMs) and their implications on the cost and availability of services across different languages, especially low resource languages. The analysis considers multiple LLMs, including GPT-4 (using cl100k_base embeddings), GPT-3 (with p50k_base embeddings), and DaVinci (employing r50k_base embeddings), as well as the widely used BERT base tokenizer. The study evaluates the tokenization variability observed across these models and investigates the challenges of linguistic representation in subword tokenization. The research underscores the importance of fostering linguistically-aware development practices, especially for languages that are traditionally under-resourced. Moreover, this paper introduces case studies that highlight the real-world implications of tokenization choices, particularly in the context of electronic health record (EHR) systems. This research aims to promote generalizable Internationalization (I18N) practices in the development of AI services in this domain and beyond, with a strong emphasis on inclusivity, particularly for languages traditionally underrepresented in AI applications.
摘要：本文全面研究了最先进的大型语言模型 (LLM) 所采用的标记化技术，以及它们对不同语言（尤其是资源匮乏的语言）服务成本和可用性的影响。分析考虑了多个 LLM，包括 GPT-4（使用 cl100k_base 嵌入）、GPT-3（使用 p50k_base 嵌入）和 DaVinci（使用 r50k_base 嵌入），以及广泛使用的 BERT 基础标记器。该研究评估了在这些模型中观察到的标记化变异性，并研究了子词标记化中语言表示的挑战。这项研究强调了培养语言意识开发实践的重要性，尤其是对于传统上资源不足的语言。此外，本文还介绍了案例研究，强调了标记化选择的现实影响，特别是在电子健康记录 (EHR) 系统的背景下。这项研究旨在促进该领域及其他领域的人工智能服务开发中可推广的国际化 (I18N) 实践，并重点强调包容性，特别是对于传统上在人工智能应用中代表性不足的语言。

Title: Table Question Answering for Low-resourced Indic Languages

Authors: Vaishali Pal, Evangelos Kanoulas, Andrew Yates, Maarten de Rijke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03576
Pdf URL: https://arxiv.org/pdf/2410.03576
Copy Paste: [[2410.03576]] Table Question Answering for Low-resourced Indic Languages(https://arxiv.org/abs/2410.03576)
Keywords: llm
Abstract: TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output. TableQA research has focused primarily on high-resource languages, leaving medium- and low-resource languages with little progress due to scarcity of annotated data and neural models. We address this gap by introducing a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget. We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models. TableQA models trained on our large-scale datasets outperform state-of-the-art LLMs. We further study the trained models on different aspects, including mathematical reasoning capabilities and zero-shot cross-lingual transfer. Our work is the first on low-resource tableQA focusing on scalable data generation and evaluation procedures. Our proposed data generation method can be applied to any low-resource language with a web presence. We release datasets, models, and code (this https URL).
摘要：TableQA 是回答结构化信息表中的问题的任务，返回单个单元格或表格作为输出。TableQA 研究主要集中在高资源语言上，由于注释数据和神经模型的稀缺，中低资源语言进展甚微。我们通过为预算有限的低资源语言引入全自动大规模 tableQA 数据生成流程来解决这一差距。我们将我们的数据生成方法应用于两种印度语，孟加拉语和印地语，这两种语言没有 tableQA 数据集或模型。在我们的大规模数据集上训练的 TableQA 模型优于最先进的 LLM。我们进一步从不同方面研究了训练后的模型，包括数学推理能力和零样本跨语言迁移。我们的工作是低资源 tableQA 的首次工作，重点关注可扩展的数据生成和评估程序。我们提出的数据生成方法可以应用于任何具有网络存在的低资源语言。我们发布数据集、模型和代码（此 https URL）。

Title: Efficiently Identifying Watermarked Segments in Mixed-Source Texts

Authors: Xuandong Zhao, Chenwen Liao, Yu-Xiang Wang, Lei Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.03600
Pdf URL: https://arxiv.org/pdf/2410.03600
Copy Paste: [[2410.03600]] Efficiently Identifying Watermarked Segments in Mixed-Source Texts(https://arxiv.org/abs/2410.03600)
Keywords: language model, llm
Abstract: Text watermarks in large language models (LLMs) are increasingly used to detect synthetic text, mitigating misuse cases like fake news and academic dishonesty. While existing watermarking detection techniques primarily focus on classifying entire documents as watermarked or not, they often neglect the common scenario of identifying individual watermark segments within longer, mixed-source documents. Drawing inspiration from plagiarism detection systems, we propose two novel methods for partial watermark detection. First, we develop a geometry cover detection framework aimed at determining whether there is a watermark segment in long text. Second, we introduce an adaptive online learning algorithm to pinpoint the precise location of watermark segments within the text. Evaluated on three popular watermarking techniques (KGW-Watermark, Unigram-Watermark, and Gumbel-Watermark), our approach achieves high accuracy, significantly outperforming baseline methods. Moreover, our framework is adaptable to other watermarking techniques, offering new insights for precise watermark detection.
摘要：大型语言模型 (LLM) 中的文本水印越来越多地用于检测合成文本，从而减轻虚假新闻和学术欺诈等误用情况。虽然现有的水印检测技术主要侧重于将整个文档分类为有水印或无水印，但它们往往忽略了在较长的混合源文档中识别单个水印片段的常见情况。从抄袭检测系统中汲取灵感，我们提出了两种新颖的部分水印检测方法。首先，我们开发了一个几何覆盖检测框架，旨在确定长文本中是否有水印片段。其次，我们引入了一种自适应在线学习算法来精确定位文本中水印片段的精确位置。在三种流行的水印技术（KGW-Watermark、Unigram-Watermark 和 Gumbel-Watermark）上进行评估，我们的方法实现了高精度，明显优于基线方法。此外，我们的框架可适用于其他水印技术，为精确水印检测提供了新的见解。

Title: Aligning LLMs with Individual Preferences via Interaction

Authors: Shujin Wu, May Fung, Cheng Qian, Jeonghwan Kim, Dilek Hakkani-Tur, Heng Ji
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2410.03642
Pdf URL: https://arxiv.org/pdf/2410.03642
Copy Paste: [[2410.03642]] Aligning LLMs with Individual Preferences via Interaction(https://arxiv.org/abs/2410.03642)
Keywords: language model, llm
Abstract: As large language models (LLMs) demonstrate increasingly advanced capabilities, aligning their behaviors with human values and preferences becomes crucial for their wide adoption. While previous research focuses on general alignment to principles such as helpfulness, harmlessness, and honesty, the need to account for individual and diverse preferences has been largely overlooked, potentially undermining customized human experiences. To address this gap, we train LLMs that can ''interact to align'', essentially cultivating the meta-skill of LLMs to implicitly infer the unspoken personalized preferences of the current user through multi-turn conversations, and then dynamically align their following behaviors and responses to these inferred preferences. Our approach involves establishing a diverse pool of 3,310 distinct user personas by initially creating seed examples, which are then expanded through iterative self-generation and filtering. Guided by distinct user personas, we leverage multi-LLM collaboration to develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures. Finally, we apply supervised fine-tuning and reinforcement learning to enhance LLMs using this dataset. For evaluation, we establish the ALOE (ALign With CustOmized PrEferences) benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations. Experimental results demonstrate the effectiveness of our method in enabling dynamic, personalized alignment via interaction.
摘要：随着大型语言模型 (LLM) 展现出越来越先进的功能，使其行为与人类价值观和偏好保持一致对于其广泛应用至关重要。虽然先前的研究侧重于与乐于助人、无害和诚实等原则的一般一致性，但考虑到个人和多样化偏好的需求却在很大程度上被忽视了，这可能会破坏定制的人类体验。为了解决这一差距，我们训练了能够“交互以保持一致”的 LLM，本质上是培养 LLM 的元技能，以通过多轮对话隐式推断当前用户未说出口的个性化偏好，然后动态地将他们的后续行为和响应与这些推断出的偏好保持一致。我们的方法包括通过最初创建种子示例来建立一个由 3,310 个不同用户角色组成的多样化池，然后通过迭代自我生成和过滤进行扩展。在不同用户角色的指导下，我们利用多 LLM 协作开发了一个多轮偏好数据集，其中包含 3K+ 个树形结构的多轮对话。最后，我们应用监督微调和强化学习来增强使用此数据集的 LLM。为了进行评估，我们建立了 ALOE（ALign With CustOmized PrEferences）基准，该基准由 100 个精心挑选的示例和精心设计的指标组成，用于衡量对话过程中的定制对齐性能。实验结果证明了我们的方法在通过交互实现动态、个性化对齐方面的有效性。

Title: RAFT: Realistic Attacks to Fool Text Detectors

Authors: James Wang, Ran Li, Junfeng Yang, Chengzhi Mao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.03658
Pdf URL: https://arxiv.org/pdf/2410.03658
Copy Paste: [[2410.03658]] RAFT: Realistic Attacks to Fool Text Detectors(https://arxiv.org/abs/2410.03658)
Keywords: language model, llm
Abstract: Large language models (LLMs) have exhibited remarkable fluency across various tasks. However, their unethical applications, such as disseminating disinformation, have become a growing concern. Although recent works have proposed a number of LLM detection methods, their robustness and reliability remain unclear. In this paper, we present RAFT: a grammar error-free black-box attack against existing LLM detectors. In contrast to previous attacks for language models, our method exploits the transferability of LLM embeddings at the word-level while preserving the original text quality. We leverage an auxiliary embedding to greedily select candidate words to perturb against the target detector. Experiments reveal that our attack effectively compromises all detectors in the study across various domains by up to 99%, and are transferable across source models. Manual human evaluation studies show our attacks are realistic and indistinguishable from original human-written text. We also show that examples generated by RAFT can be used to train adversarially robust detectors. Our work shows that current LLM detectors are not adversarially robust, underscoring the urgent need for more resilient detection mechanisms.
摘要：大型语言模型 (LLM) 在各种任务中表现出了非凡的流畅性。然而，它们的不道德应用，例如传播虚假信息，已成为日益严重的问题。尽管最近的研究提出了许多 LLM 检测方法，但它们的稳健性和可靠性仍不清楚。在本文中，我们提出了 RAFT：一种针对现有 LLM 检测器的无语法错误黑盒攻击。与之前针对语言模型的攻击相比，我们的方法利用了 LLM 嵌入在单词级别的可迁移性，同时保留了原始文本质量。我们利用辅助嵌入来贪婪地选择候选词来扰乱目标检测器。实验表明，我们的攻击有效地破坏了各个领域中研究的所有检测器，破坏率高达 99%，并且可以在源模型之间迁移。人工评估研究表明，我们的攻击是真实的，与原始的人工书写文本没有区别。我们还表明，RAFT 生成的示例可用于训练对抗性鲁棒检测器。我们的工作表明，当前的 LLM 检测器并不具有对抗鲁棒性，这凸显了对更具弹性的检测机制的迫切需求。

Title: Enhance Reasoning by Learning from Mistakes: Peer-Review Knowledge Distillation from Multiple Large Language Models

Authors: Zhuochun Li, Yuelyu Ji, Rui Meng, Daqing He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.03663
Pdf URL: https://arxiv.org/pdf/2410.03663
Copy Paste: [[2410.03663]] Enhance Reasoning by Learning from Mistakes: Peer-Review Knowledge Distillation from Multiple Large Language Models(https://arxiv.org/abs/2410.03663)
Keywords: language model, llm
Abstract: Large language models (LLMs) have exhibited complex reasoning abilities by generating question rationales and demonstrated exceptional performance in natural language processing (NLP) tasks. However, these reasoning capabilities generally emerge in models with tens of billions of parameters, creating significant computational challenges for real-world deployment. Recent research has concentrated on improving open-source smaller models through knowledge distillation (KD) from commercial LLMs. Nevertheless, most of these studies rely solely on the responses from one single LLM as the gold rationale for training. In this paper, we introduce a novel Mistake-Aware Peer-Review Distillation (MAPD) approach: 1) Instead of merely obtaining gold rationales from teachers, our method asks teachers to identify and explain the student's mistakes, providing customized instruction learning data. 2) We design a simulated peer-review process between teacher LLMs, which selects only the generated rationales above the acceptance threshold. This reduces the chance of teachers guessing correctly with flawed rationale, improving instructional data quality. Comprehensive experiments and analysis on mathematical, commonsense, and logical reasoning tasks demonstrate the effectiveness of our method.
摘要：大型语言模型 (LLM) 通过生成问题原理展现出复杂的推理能力，并在自然语言处理 (NLP) 任务中表现出色。然而，这些推理能力通常出现在具有数百亿个参数的模型中，为实际部署带来了巨大的计算挑战。最近的研究集中在通过从商业 LLM 中进行知识蒸馏 (KD) 来改进开源小型模型。然而，这些研究中的大多数仅依赖单个 LLM 的响应作为训练的黄金原理。在本文中，我们介绍了一种新颖的错误感知同行评审蒸馏 (MAPD) 方法：1) 我们的方法不是仅仅从教师那里获得黄金原理，而是要求教师识别和解释学生的错误，提供定制的教学学习数据。2) 我们设计了一个教师 LLM 之间的模拟同行评审过程，该过程仅选择高于接受阈值的生成原理。这减少了教师用有缺陷的原理猜对的机会，从而提高了教学数据质量。对数学、常识和逻辑推理任务的综合实验和分析证明了我们方法的有效性。