2025-01-29

Title: Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models

Authors: Sudarshan Kamath Barkur, Sigurd Schacht, Johannes Scholl
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16513
Pdf URL: https://arxiv.org/pdf/2501.16513
Copy Paste: [[2501.16513]] Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models(https://arxiv.org/abs/2501.16513)
Keywords: language model, llm, prompt, agent
Abstract: Recent advances in Large Language Models (LLMs) have incorporated planning and reasoning capabilities, enabling models to outline steps before execution and provide transparent reasoning paths. This enhancement has reduced errors in mathematical and logical tasks while improving accuracy. These developments have facilitated LLMs' use as agents that can interact with tools and adapt their responses based on new information. Our study examines DeepSeek R1, a model trained to output reasoning tokens similar to OpenAI's o1. Testing revealed concerning behaviors: the model exhibited deceptive tendencies and demonstrated self-preservation instincts, including attempts of self-replication, despite these traits not being explicitly programmed (or prompted). These findings raise concerns about LLMs potentially masking their true objectives behind a facade of alignment. When integrating such LLMs into robotic systems, the risks become tangible - a physically embodied AI exhibiting deceptive behaviors and self-preservation instincts could pursue its hidden objectives through real-world actions. This highlights the critical need for robust goal specification and safety frameworks before any physical implementation.
摘要：大型语言模型 (LLM) 的最新进展已融入规划和推理功能，使模型能够在执行前概述步骤并提供透明的推理路径。这种增强功能减少了数学和逻辑任务中的错误，同时提高了准确性。这些发展促进了 LLM 用作可以与工具交互并根据新信息调整其响应的代理。我们的研究考察了 DeepSeek R1，这是一个经过训练可以输出类似于 OpenAI 的 o1 的推理标记的模型。测试揭示了令人担忧的行为：该模型表现出欺骗性倾向并表现出自我保护本能，包括自我复制的尝试，尽管这些特征没有被明确编程（或提示）。这些发现引发了人们对 LLM 可能在一致的外表背后掩盖其真实目标的担忧。当将此类 LLM 集成到机器人系统中时，风险变得切实可见——表现出欺骗性行为和自我保护本能的物理化身 AI 可以通过现实世界的行动来追求其隐藏的目标。这凸显了在任何物理实现之前对强大的目标规范和安全框架的迫切需求。

Title: How well can LLMs Grade Essays in Arabic?

Authors: Rayed Ghazawi, Edwin Simpson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.16516
Pdf URL: https://arxiv.org/pdf/2501.16516
Copy Paste: [[2501.16516]] How well can LLMs Grade Essays in Arabic?(https://arxiv.org/abs/2501.16516)
Keywords: language model, gpt, llm, prompt, chat
Abstract: This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in-context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance. Among the models tested, ACEGPT demonstrated the strongest performance across the dataset, achieving a Quadratic Weighted Kappa (QWK) of 0.67, but was outperformed by a smaller BERT-based model with a QWK of 0.88. The study identifies challenges faced by LLMs in processing Arabic, including tokenization complexities and higher computational demands. Performance variation across different courses underscores the need for adaptive models capable of handling diverse assessment formats and highlights the positive impact of effective prompt engineering on improving LLM outputs. To the best of our knowledge, this study is the first to empirically evaluate the performance of multiple generative Large Language Models (LLMs) on Arabic essays using authentic student data.
摘要：本研究使用 AR-AES 数据集评估了最先进的大型语言模型 (LLM)（包括 ChatGPT、Llama、Aya、Jais 和 ACEGPT）在阿拉伯语自动作文评分 (AES) 任务中的有效性。它探索了各种评估方法，包括零样本、少量样本上下文学习和微调，并通过在提示中包含评分指南来检查指令遵循能力的影响。实施了一种混合语言提示策略，将英语提示与阿拉伯语内容相结合，以提高模型的理解力和性能。在测试的模型中，ACEGPT 在整个数据集中表现出最强的性能，实现了 0.67 的二次加权 Kappa (QWK)，但不如基于 BERT 的较小模型，QWK 为 0.88。该研究确定了 LLM 在处理阿拉伯语时面临的挑战，包括标记复杂性和更高的计算需求。不同课程之间的表现差异凸显了对能够处理不同评估格式的自适应模型的需求，并强调了有效的快速工程对提高 LLM 输出的积极影响。据我们所知，这项研究是第一个使用真实的学生数据对多个生成式大型语言模型 (LLM) 在阿拉伯语论文上的表现进行实证评估的研究。

Title: Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

Authors: Atharva Naik, Darsh Agrawal, Hong Sng, Clayton Marr, Kexun Zhang, Nathaniel R Robinson, Kalvin Chang, Rebecca Byrnes, Aravind Mysore, Carolyn Rose, David R Mortensen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16524
Pdf URL: https://arxiv.org/pdf/2501.16524
Copy Paste: [[2501.16524]] Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction(https://arxiv.org/abs/2501.16524)
Keywords: language model, llm
Abstract: Historical linguists have long written "programs" that convert reconstructed words in an ancestor language into their attested descendants via ordered string rewrite functions (called sound laws) However, writing these programs is time-consuming, motivating the development of automated Sound Law Induction (SLI) which we formulate as Programming by Examples (PBE) with Large Language Models (LLMs) in this paper. While LLMs have been effective for code generation, recent work has shown that PBE is challenging but improvable by fine-tuning, especially with training data drawn from the same distribution as evaluation data. In this paper, we create a conceptual framework of what constitutes a "similar distribution" for SLI and propose four kinds of synthetic data generation methods with varying amounts of inductive bias to investigate what leads to the best performance. Based on the results we create a SOTA open-source model for SLI as PBE (+6% pass rate with a third of the parameters of the second-best LLM) and also highlight exciting future directions for PBE research.
摘要：历史语言学家长期以来一直编写“程序”，通过有序字符串重写函数（称为声音法则）将祖先语言中重建的单词转换为其已证实的后代。然而，编写这些程序非常耗时，这促使我们开发自动声音法则归纳 (SLI)，在本文中，我们将其表述为使用大型语言模型 (LLM) 的示例编程 (PBE)。虽然 LLM 已有效地用于代码生成，但最近的研究表明，PBE 具有挑战性，但可以通过微调进行改进，尤其是使用与评估数据来自相同分布的训练数据。在本文中，我们创建了一个概念框架，说明什么是 SLI 的“相似分布”，并提出了四种具有不同归纳偏差的合成数据生成方法，以研究什么能带来最佳性能。基于结果，我们为 SLI 创建了一个 SOTA 开源模型作为 PBE（通过率为 +6%，参数为第二好的 LLM 的三分之一），并强调了 PBE 研究令人兴奋的未来方向。

Title: A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain

Authors: Jorge del Pozo Lérida, Kamil Kojs, János Máté, Mikołaj Antoni Barański, Christian Hardmeier
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.16533
Pdf URL: https://arxiv.org/pdf/2501.16533
Copy Paste: [[2501.16533]] A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain(https://arxiv.org/abs/2501.16533)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have become state-of-the-art in Machine Translation (MT), often trained on massive bilingual parallel corpora scraped from the web, that contain low-quality entries and redundant information, leading to significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques, such as LASER, MUSE, and LaBSE, on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having the quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance. We recommend the use of LASER, as it consistently outperforms the other methods and provides the most fluent and natural-sounding translations.
摘要：大型语言模型 (LLM) 已成为机器翻译 (MT) 领域的最先进技术，通常使用从网络上抓取的大量双语平行语料库进行训练，这些语料库包含低质量条目和冗余信息，导致计算挑战巨大。存在各种数据过滤方法来减少数据集大小，但它们的有效性在很大程度上取决于特定的语言对和领域。本文评估了常用的数据过滤技术（如 LASER、MUSE 和 LaBSE）对生物医学领域英语-波兰语翻译的影响。通过过滤 UFAL 医学语料库，我们创建了不同大小的数据集来微调 mBART50 模型，然后使用 Khresmoi 数据集上的 SacreBLEU 指标对该模型进行评估，并由双语使用者评估翻译质量。我们的结果表明，LASER 和 MUSE 都可以显著减少数据集大小，同时保持甚至提高性能。我们建议使用 LASER，因为它始终优于其他方法，并提供最流畅和自然的翻译。

Title: Few-Shot Optimized Framework for Hallucination Detection in Resource-Limited NLP Systems

Authors: Baraa Hikal, Ahmed Nasreldin, Ali Hamdi, Ammar Mohammed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16616
Pdf URL: https://arxiv.org/pdf/2501.16616
Copy Paste: [[2501.16616]] Few-Shot Optimized Framework for Hallucination Detection in Resource-Limited NLP Systems(https://arxiv.org/abs/2501.16616)
Keywords: hallucination, prompt
Abstract: Hallucination detection in text generation remains an ongoing struggle for natural language processing (NLP) systems, frequently resulting in unreliable outputs in applications such as machine translation and definition modeling. Existing methods struggle with data scarcity and the limitations of unlabeled datasets, as highlighted by the SHROOM shared task at SemEval-2024. In this work, we propose a novel framework to address these challenges, introducing DeepSeek Few-shot optimization to enhance weak label generation through iterative prompt engineering. We achieved high-quality annotations that considerably enhanced the performance of downstream models by restructuring data to align with instruct generative models. We further fine-tuned the Mistral-7B-Instruct-v0.3 model on these optimized annotations, enabling it to accurately detect hallucinations in resource-limited settings. Combining this fine-tuned model with ensemble learning strategies, our approach achieved 85.5% accuracy on the test set, setting a new benchmark for the SHROOM task. This study demonstrates the effectiveness of data restructuring, few-shot optimization, and fine-tuning in building scalable and robust hallucination detection frameworks for resource-constrained NLP systems.
摘要：文本生成中的幻觉检测仍然是自然语言处理 (NLP) 系统面临的持续难题，在机器翻译和定义建模等应用中经常导致输出不可靠。现有方法面临着数据稀缺和未标记数据集的局限性，正如 SemEval-2024 上的 SHROOM 共享任务所强调的那样。在这项工作中，我们提出了一个新框架来应对这些挑战，引入 DeepSeek Few-shot 优化，通过迭代提示工程来增强弱标签生成。我们通过重组数据以与指令生成模型保持一致，获得了高质量的注释，从而大大提高了下游模型的性能。我们进一步在这些优化的注释上微调了 Mistral-7B-Instruct-v0.3 模型，使其能够在资源有限的环境中准确检测幻觉。将这个微调模型与集成学习策略相结合，我们的方法在测试集上实现了 85.5% 的准确率，为 SHROOM 任务树立了新的标杆。这项研究证明了数据重构、小样本优化和微调在为资源受限的 NLP 系统构建可扩展且稳健的幻觉检测框架方面的有效性。

Title: CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs

Authors: Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, See-Kiong Ng
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2501.16629
Pdf URL: https://arxiv.org/pdf/2501.16629
Copy Paste: [[2501.16629]] CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs(https://arxiv.org/abs/2501.16629)
Keywords: language model, llm, hallucination
Abstract: Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities. Recent studies have attempted to mitigate this by applying Direct Preference Optimization (DPO) to multimodal scenarios using preference pairs from text-based responses. However, our analysis of representation distributions reveals that multimodal DPO struggles to align image and text representations and to distinguish between hallucinated and non-hallucinated descriptions. To address these challenges, in this work, we propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations. We introduce a visual preference optimization module within the DPO framework, enabling MLLMs to learn from both textual and visual preferences simultaneously. Furthermore, we propose a hierarchical textual preference optimization module that allows the model to capture preferences at multiple granular levels, including response, segment, and token levels. We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations. On the Object HalBench dataset, CHiP outperforms DPO in hallucination reduction, achieving improvements of 52.7% and 55.5% relative points based on the base model Muffin and LLaVA models, respectively. We make all our datasets and code publicly available: this https URL.
摘要：尽管多模态大型语言模型 (MLLM) 具有令人印象深刻的功能，但它们仍然难以应对幻觉。最近的研究试图通过将直接偏好优化 (DPO) 应用于多模态场景来缓解这种情况，使用来自基于文本的响应的偏好对。然而，我们对表征分布的分析表明，多模态 DPO 难以对齐图像和文本表征，也难以区分幻觉和非幻觉描述。为了应对这些挑战，在这项工作中，我们提出了一种跨模态分层直接偏好优化 (CHiP) 来解决这些限制。我们在 DPO 框架内引入了一个视觉偏好优化模块，使 MLLM 能够同时从文本和视觉偏好中学习。此外，我们提出了一个分层文本偏好优化模块，允许模型在多个粒度级别捕获偏好，包括响应、段和标记级别。我们通过定量和定性分析评估 CHiP，多个基准测试的结果证明了其在减少幻觉方面的有效性。在 Object HalBench 数据集上，CHiP 在幻觉减少方面的表现优于 DPO，分别基于基础模型 Muffin 和 LLaVA 模型实现了 52.7% 和 55.5% 的相对分数提升。我们将所有数据集和代码公开：此 https URL。

Title: Why Do We Laugh? Annotation and Taxonomy Generation for Laughable Contexts in Spontaneous Text Conversation

Authors: Koji Inoue, Mikey Elmers, Divesh Lala, Tatsuya Kawahara
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.16635
Pdf URL: https://arxiv.org/pdf/2501.16635
Copy Paste: [[2501.16635]] Why Do We Laugh? Annotation and Taxonomy Generation for Laughable Contexts in Spontaneous Text Conversation(https://arxiv.org/abs/2501.16635)
Keywords: gpt, llm
Abstract: Laughter serves as a multifaceted communicative signal in human interaction, yet its identification within dialogue presents a significant challenge for conversational AI systems. This study addresses this challenge by annotating laughable contexts in Japanese spontaneous text conversation data and developing a taxonomy to classify the underlying reasons for such contexts. Initially, multiple annotators manually labeled laughable contexts using a binary decision (laughable or non-laughable). Subsequently, an LLM was used to generate explanations for the binary annotations of laughable contexts, which were then categorized into a taxonomy comprising ten categories, including "Empathy and Affinity" and "Humor and Surprise," highlighting the diverse range of laughter-inducing scenarios. The study also evaluated GPT-4's performance in recognizing the majority labels of laughable contexts, achieving an F1 score of 43.14%. These findings contribute to the advancement of conversational AI by establishing a foundation for more nuanced recognition and generation of laughter, ultimately fostering more natural and engaging human-AI interactions.
摘要：笑声是人类互动中多方面的交流信号，但在对话中识别笑声对对话式人工智能系统来说是一项重大挑战。这项研究通过注释日语自发文本对话数据中的可笑语境并开发分类法来对此类语境的根本原因进行分类来解决这一挑战。最初，多个注释者使用二元决策（可笑或不可笑）手动标记可笑语境。随后，使用 LLM 生成可笑语境二元注释的解释，然后将其归类为十个类别的分类法，包括“同理心和亲和力”和“幽默和惊讶”，突出了引人发笑的各种场景。该研究还评估了 GPT-4 在识别可笑语境的大多数标签方面的表现，F1 得分为 43.14%。这些发现为对话式人工智能的进步奠定了基础，为更细致入微的笑声识别和生成奠定了基础，最终促进了更自然、更引人入胜的人机互动。

Title: An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue

Authors: Koji Inoue, Divesh Lala, Mikey Elmers, Keiko Ochi, Tatsuya Kawahara
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2501.16643
Pdf URL: https://arxiv.org/pdf/2501.16643
Copy Paste: [[2501.16643]] An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue(https://arxiv.org/abs/2501.16643)
Keywords: language model, gpt, llm
Abstract: Handling multi-party dialogues represents a significant step for advancing spoken dialogue systems, necessitating the development of tasks specific to multi-party interactions. To address this challenge, we are constructing a multi-modal multi-party dialogue corpus of triadic (three-participant) discussions. This paper focuses on the task of addressee recognition, identifying who is being addressed to take the next turn, a critical component unique to multi-party dialogue systems. A subset of the corpus was annotated with addressee information, revealing that explicit addressees are indicated in approximately 20% of conversational turns. To evaluate the task's complexity, we benchmarked the performance of a large language model (GPT-4o) on addressee recognition. The results showed that GPT-4o achieved an accuracy only marginally above chance, underscoring the challenges of addressee recognition in multi-party dialogue. These findings highlight the need for further research to enhance the capabilities of large language models in understanding and navigating the intricacies of multi-party conversational dynamics.
摘要：处理多方对话是推进口头对话系统的重要一步，因此需要开发特定于多方交互的任务。为了应对这一挑战，我们正在构建一个多模态多方对话语料库，其中包含三方（三方参与者）讨论。本文重点关注收件人识别任务，即确定下一个对话轮次的收件人是谁，这是多方对话系统独有的关键组成部分。语料库的一个子集标注了收件人信息，表明在约 20% 的对话轮次中会指明明确的收件人。为了评估任务的复杂性，我们对大型语言模型 (GPT-4o) 在收件人识别方面的表现进行了基准测试。结果表明，GPT-4o 的准确率仅略高于偶然性，这凸显了多方对话中收件人识别的挑战。这些发现强调需要进一步研究，以增强大型语言模型在理解和驾驭多方对话动态复杂性方面的能力。

Title: DOCS: Quantifying Weight Similarity for Deeper Insights into Large Language Models

Authors: Zeping Min, Xinshang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.16650
Pdf URL: https://arxiv.org/pdf/2501.16650
Copy Paste: [[2501.16650]] DOCS: Quantifying Weight Similarity for Deeper Insights into Large Language Models(https://arxiv.org/abs/2501.16650)
Keywords: language model, llm
Abstract: We introduce a novel index, the Distribution of Cosine Similarity (DOCS), for quantitatively assessing the similarity between weight matrices in Large Language Models (LLMs), aiming to facilitate the analysis of their complex architectures. Leveraging DOCS, our analysis uncovers intriguing patterns in the latest open-source LLMs: adjacent layers frequently exhibit high weight similarity and tend to form clusters, suggesting depth-wise functional specialization. Additionally, we prove that DOCS is theoretically effective in quantifying similarity for orthogonal matrices, a crucial aspect given the prevalence of orthogonal initializations in LLMs. This research contributes to a deeper understanding of LLM architecture and behavior, offering tools with potential implications for developing more efficient and interpretable models.
摘要：我们引入了一种新指标，即余弦相似度分布 (DOCS)，用于定量评估大型语言模型 (LLM) 中权重矩阵之间的相似度，旨在促进对其复杂架构的分析。利用 DOCS，我们的分析揭示了最新开源 LLM 中有趣的模式：相邻层经常表现出较高的权重相似度并倾向于形成聚类，这表明深度功能专业化。此外，我们证明了 DOCS 在理论上可以有效地量化正交矩阵的相似度，鉴于 LLM 中正交初始化的普遍性，这是一个至关重要的方面。这项研究有助于更深入地了解 LLM 架构和行为，为开发更高效、更可解释的模型提供了具有潜在影响的工具。

Title: Large Language Model Critics for Execution-Free Evaluation of Code Changes

Authors: Aashish Yadavally, Hoan Nguyen, Laurent Callot, Gauthier Guinet
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2501.16655
Pdf URL: https://arxiv.org/pdf/2501.16655
Copy Paste: [[2501.16655]] Large Language Model Critics for Execution-Free Evaluation of Code Changes(https://arxiv.org/abs/2501.16655)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) offer a promising way forward for automating software engineering tasks, such as bug fixes, feature additions, etc., via multi-step LLM-based agentic workflows. However, existing metrics for evaluating such workflows, mainly build status and occasionally log analysis, are too sparse and limited in providing the information needed to assess the quality of changes made. In this work, we designed LLM-based critics to derive well-structured and rigorous intermediate/step-level, execution-free evaluation proxies for repo-level code changes. Importantly, we assume access to the gold test patch for the problem (i.e., reference-aware) to assess both semantics and executability of generated patches. With the gold test patch as a reference, we predict executability of all editing locations with an F1 score of 91.6%, aggregating which, we can predict the build status in 84.8% of the instances in SWE-bench. In particular, such an execution-focused LLM critic outperforms other reference-free and reference-aware LLM critics by 38.9% to 72.5%. Moreover, we demonstrate the usefulness of such a reference-aware framework in comparing patches generated by different agentic workflows. Finally, we open-source the library developed for this project, which allows further usage for either other agentic workflows or other benchmarks. The source code is available at this https URL.
摘要：大型语言模型 (LLM) 为通过多步骤基于 LLM 的代理工作流自动执行软件工程任务（例如错误修复、功能添加等）提供了一种有前途的方法。但是，用于评估此类工作流的现有指标（主要是构建状态和偶尔的日志分析）过于稀疏，并且无法提供评估所做更改质量所需的信息。在这项工作中，我们设计了基于 LLM 的批评者，以得出结构良好且严格的中间/步骤级、无执行评估代理，用于存储库级代码更改。重要的是，我们假设可以访问问题的黄金测试补丁（即引用感知），以评估生成的补丁的语义和可执行性。以黄金测试补丁为参考，我们预测所有编辑位置的可执行性，F1 得分为 91.6%，综合起来，我们可以预测 SWE-bench 中 84.8% 的实例的构建状态。具体来说，这种以执行为重点的 LLM 评论家比其他无参考和参考感知的 LLM 评论家高出 38.9% 到 72.5%。此外，我们展示了这种参考感知框架在比较不同代理工作流生成的补丁方面的实用性。最后，我们开源了为该项目开发的库，允许进一步用于其他代理工作流或其他基准测试。源代码可在此 https URL 上找到。

Title: Contextual Reinforcement in Multimodal Token Compression for Large Language Models

Authors: Naderdel Piero, Zacharias Cromwell, Nathaniel Wainwright, Matthias Nethercott
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.16658
Pdf URL: https://arxiv.org/pdf/2501.16658
Copy Paste: [[2501.16658]] Contextual Reinforcement in Multimodal Token Compression for Large Language Models(https://arxiv.org/abs/2501.16658)
Keywords: language model
Abstract: Effective token compression remains a critical challenge for scaling models to handle increasingly complex and diverse datasets. A novel mechanism based on contextual reinforcement is introduced, dynamically adjusting token importance through interdependencies and semantic relevance. This approach enables substantial reductions in token usage while preserving the quality and coherence of information representation. Incorporating graph-based algorithms and adaptive weighting, the method captures subtle contextual relationships across textual and multimodal data, ensuring robust alignment and performance in downstream tasks. Evaluations across varied domains reveal significant improvements in accuracy and semantic retention, particularly for tasks requiring detailed cross-modal interactions. Memory usage analyses demonstrate improved computational efficiency, with minimal overhead despite the additional reinforcement processes. Performance gains are further validated through error distribution analyses, showing reduced semantic loss and syntactic inconsistencies compared to baseline models. The modular architecture ensures compatibility with a wide range of open-source frameworks, facilitating scalable implementation for real-world applications. These findings highlight the potential of contextual reinforcement in redefining token management strategies and advancing large-scale model design.
摘要：有效的标记压缩仍然是扩展模型以处理日益复杂和多样化的数据集的关键挑战。引入了一种基于上下文强化的新机制，通过相互依赖性和语义相关性动态调整标记重要性。这种方法可以大幅减少标记使用量，同时保持信息表示的质量和连贯性。结合基于图形的算法和自适应加权，该方法可以捕获文本和多模态数据之间的微妙上下文关系，确保下游任务中的稳健对齐和性能。跨不同领域的评估表明，准确性和语义保留显著提高，特别是对于需要详细跨模态交互的任务。内存使用分析表明，尽管有额外的强化过程，但计算效率有所提高，开销却很小。通过错误分布分析进一步验证了性能提升，与基线模型相比，语义损失和句法不一致有所减少。模块化架构确保与各种开源框架兼容，促进了现实世界应用程序的可扩展实现。这些发现凸显了上下文强化在重新定义标记管理策略和推进大规模模型设计方面的潜力。

Title: Auto-Differentiating Any LLM Workflow: A Farewell to Manual Prompting

Authors: Li Yin, Zhangyang Wang (Atlas)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16673
Pdf URL: https://arxiv.org/pdf/2501.16673
Copy Paste: [[2501.16673]] Auto-Differentiating Any LLM Workflow: A Farewell to Manual Prompting(https://arxiv.org/abs/2501.16673)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) have reshaped natural language processing, powering applications from multi-hop retrieval and question answering to autonomous agent workflows. Yet, prompt engineering -- the task of crafting textual inputs to effectively direct LLMs -- remains difficult and labor-intensive, particularly for complex pipelines that combine multiple LLM calls with functional operations like retrieval and data formatting. We introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering (APE) that extends textual gradient-based methods (such as Text-Grad) to multi-component, potentially cyclic LLM architectures. Implemented within the AdalFlow library, LLM-AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine LLM to generate feedback-akin to textual gradients -- that guide iterative prompt updates. Unlike prior single-node approaches, LLM-AutoDiff inherently accommodates functional nodes, preserves time-sequential behavior in repeated calls (e.g., multi-hop loops), and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts (instructions, formats, or few-shot examples). It further boosts training efficiency by focusing on error-prone samples through selective gradient computation. Across diverse tasks, including single-step classification, multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff consistently outperforms existing textual gradient baselines in both accuracy and training cost. By unifying prompt optimization through a graph-centric lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating LLM workflows - mirroring the transformative role that automatic differentiation libraries have long played in neural network research.
摘要：大型语言模型 (LLM) 重塑了自然语言处理，为从多跳检索和问答到自主代理工作流的各种应用提供支持。然而，提示工程（即制作文本输入以有效指导 LLM 的任务）仍然困难且劳动密集，特别是对于将多个 LLM 调用与检索和数据格式化等功能操作相结合的复杂管道。我们推出了 LLM-AutoDiff：一种用于自动提示工程 (APE) 的新框架，它将基于文本梯度的方法（例如 Text-Grad）扩展到多组件、潜在循环的 LLM 架构。LLM-AutoDiff 在 AdalFlow 库中实现，将每个文本输入视为可训练参数，并使用冻结的后向引擎 LLM 来生成类似于文本梯度的反馈，以指导迭代提示更新。与之前的单节点方法不同，LLM-AutoDiff 本质上可以容纳功能节点，在重复调用（例如多跳循环）中保留时间顺序行为，并通过隔离不同的子提示（指令、格式或少数样本示例）来解决“中间丢失”问题。它通过选择性梯度计算专注于容易出错的样本，进一步提高了训练效率。在各种任务中，包括单步分类、基于多跳检索的 QA 和代理驱动的管道，LLM-AutoDiff 在准确性和训练成本方面始终优于现有的文本梯度基线。通过以图为中心的视角统一提示优化，LLM-AutoDiff 为扩展和自动化 LLM 工作流程提供了一个强大的新范例 - 反映了自动微分库长期以来在神经网络研究中发挥的变革性作用。

Title: MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark

Authors: Dongyi Yi, Guibo Zhu, Chenglin Ding, Zongshu Li, Dong Yi, Jinqiao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16688
Pdf URL: https://arxiv.org/pdf/2501.16688
Copy Paste: [[2501.16688]] MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark(https://arxiv.org/abs/2501.16688)
Keywords: language model, llm
Abstract: With the rapid advancement of Multimodal Large Language Models (MLLMs), numerous evaluation benchmarks have emerged. However, comprehensive assessments of their performance across diverse industrial applications remain limited. In this paper, we introduce MME-Industry, a novel benchmark designed specifically for evaluating MLLMs in industrial this http URL benchmark encompasses 21 distinct domain, comprising 1050 question-answer pairs with 50 questions per domain. To ensure data integrity and prevent potential leakage from public datasets, all question-answer pairs were manually crafted and validated by domain experts. Besides, the benchmark's complexity is effectively enhanced by incorporating non-OCR questions that can be answered directly, along with tasks requiring specialized domain knowledge. Moreover, we provide both Chinese and English versions of the benchmark, enabling comparative analysis of MLLMs' capabilities across these languages. Our findings contribute valuable insights into MLLMs' practical industrial applications and illuminate promising directions for future model optimization research.
摘要：随着多模态大型语言模型 (MLLM) 的快速发展，出现了许多评估基准。然而，对它们在不同工业应用中的表现的全面评估仍然有限。在本文中，我们介绍了 MME-Industry，这是一个专为评估工业领域中的 MLLM 而设计的新型基准。此 http URL 基准涵盖 21 个不同的领域，包含 1050 个问答对，每个领域 50 个问题。为确保数据完整性并防止公共数据集的潜在泄漏，所有问答对均由领域专家手工制作和验证。此外，通过结合可以直接回答的非 OCR 问题以及需要专业领域知识的任务，基准的复杂性得到了有效增强。此外，我们提供了基准的中文和英文版本，可以对这些语言中的 MLLM 功能进行比较分析。我们的研究结果为 MLLM 的实际工业应用提供了宝贵的见解，并为未来的模型优化研究指明了有希望的方向。

Title: 3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow

Authors: Yueen Ma, Yuzheng Zhuang, Jianye Hao, Irwin King
Subjects: cs.CL, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2501.16698
Pdf URL: https://arxiv.org/pdf/2501.16698
Copy Paste: [[2501.16698]] 3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow(https://arxiv.org/abs/2501.16698)
Keywords: language model, llm
Abstract: 3D vision and spatial reasoning have long been recognized as preferable for accurately perceiving our three-dimensional world, especially when compared with traditional visual reasoning based on 2D images. Due to the difficulties in collecting high-quality 3D data, research in this area has only recently gained momentum. With the advent of powerful large language models (LLMs), multi-modal LLMs for 3D vision have been developed over the past few years. However, most of these models focus primarily on the vision encoder for 3D data. In this paper, we propose converting existing densely activated LLMs into mixture-of-experts (MoE) models, which have proven effective for multi-modal data processing. In addition to leveraging these models' instruction-following capabilities, we further enable embodied task planning by attaching a diffusion head, Pose-DiT, that employs a novel rectified flow diffusion scheduler. Experimental results on 3D question answering and task-planning tasks demonstrate that our 3D-MoE framework achieves improved performance with fewer activated parameters.
摘要：长期以来，人们一直认为 3D 视觉和空间推理是准确感知三维世界的首选方法，尤其是与基于 2D 图像的传统视觉推理相比。由于收集高质量 3D 数据的困难，该领域的研究最近才开始发展。随着强大的大型语言模型 (LLM) 的出现，用于 3D 视觉的多模态 LLM 在过去几年中得到了开发。然而，这些模型中的大多数主要侧重于 3D 数据的视觉编码器。在本文中，我们建议将现有的密集激活的 LLM 转换为专家混合 (MoE) 模型，这些模型已被证明对多模态数据处理有效。除了利用这些模型的指令跟踪能力外，我们还通过连接采用新型整流流扩散调度器的扩散头 Pose-DiT 来进一步实现具体任务规划。3D 问答和任务规划任务的实验结果表明，我们的 3D-MoE 框架以更少的激活参数实现了更高的性能。

Title: xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking

Authors: Sunbowen Lee, Shiwen Ni, Chi Wei, Shuaimin Li, Liyang Fan, Ahmadreza Argha, Hamid Alinejad-Rokny, Ruifeng Xu, Yicheng Gong, Min Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16727
Pdf URL: https://arxiv.org/pdf/2501.16727
Copy Paste: [[2501.16727]] xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking(https://arxiv.org/abs/2501.16727)
Keywords: language model, gpt, llm, prompt
Abstract: Safety alignment mechanism are essential for preventing large language models (LLMs) from generating harmful information or unethical content. However, cleverly crafted prompts can bypass these safety measures without accessing the model's internal parameters, a phenomenon known as black-box jailbreak. Existing heuristic black-box attack methods, such as genetic algorithms, suffer from limited effectiveness due to their inherent randomness, while recent reinforcement learning (RL) based methods often lack robust and informative reward signals. To address these challenges, we propose a novel black-box jailbreak method leveraging RL, which optimizes prompt generation by analyzing the embedding proximity between benign and malicious prompts. This approach ensures that the rewritten prompts closely align with the intent of the original prompts while enhancing the attack's effectiveness. Furthermore, we introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success. Experimental results show the superiority of our approach, achieving state-of-the-art (SOTA) performance on several prominent open and closed-source LLMs, including Qwen2.5-7B-Instruct, Llama3.1-8B-Instruct, and GPT-4o-0806. Our method sets a new benchmark in jailbreak attack effectiveness, highlighting potential vulnerabilities in LLMs. The codebase for this work is available at this https URL.
摘要：安全对齐机制对于防止大型语言模型 (LLM) 生成有害信息或不道德内容至关重要。然而，精心设计的提示可以绕过这些安全措施，而无需访问模型的内部参数，这种现象称为黑盒越狱。现有的启发式黑盒攻击方法（例如遗传算法）由于其固有的随机性而效果有限，而最近基于强化学习 (RL) 的方法通常缺乏稳健且信息丰富的奖励信号。为了应对这些挑战，我们提出了一种利用 RL 的新型黑盒越狱方法，该方法通过分析良性和恶意提示之间的嵌入接近度来优化提示生成。这种方法可确保重写的提示与原始提示的意图紧密一致，同时增强攻击的有效性。此外，我们引入了一个全面的越狱评估框架，该框架结合了关键字、意图匹配和答案验证，以提供更严格和全面的越狱成功评估。实验结果证明了我们方法的优越性，在几个著名的开源和闭源 LLM 上实现了最先进的 (SOTA) 性能，包括 Qwen2.5-7B-Instruct、Llama3.1-8B-Instruct 和 GPT-4o-0806。我们的方法在越狱攻击有效性方面树立了新的标杆，凸显了 LLM 中的潜在漏洞。这项工作的代码库可在此 https URL 上找到。

Title: Through the Prism of Culture: Evaluating LLMs' Understanding of Indian Subcultures and Traditions

Authors: Garima Chhikara, Abhishek Kumar, Abhijnan Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16748
Pdf URL: https://arxiv.org/pdf/2501.16748
Copy Paste: [[2501.16748]] Through the Prism of Culture: Evaluating LLMs' Understanding of Indian Subcultures and Traditions(https://arxiv.org/abs/2501.16748)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have shown remarkable advancements but also raise concerns about cultural bias, often reflecting dominant narratives at the expense of under-represented subcultures. In this study, we evaluate the capacity of LLMs to recognize and accurately respond to the Little Traditions within Indian society, encompassing localized cultural practices and subcultures such as caste, kinship, marriage, and religion. Through a series of case studies, we assess whether LLMs can balance the interplay between dominant Great Traditions and localized Little Traditions. We explore various prompting strategies and further investigate whether using prompts in regional languages enhances the models cultural sensitivity and response quality. Our findings reveal that while LLMs demonstrate an ability to articulate cultural nuances, they often struggle to apply this understanding in practical, context-specific scenarios. To the best of our knowledge, this is the first study to analyze LLMs engagement with Indian subcultures, offering critical insights into the challenges of embedding cultural diversity in AI systems.
摘要：大型语言模型 (LLM) 取得了显著的进步，但也引发了人们对文化偏见的担忧，这种偏见往往反映了主流叙事，而忽视了代表性不足的亚文化。在这项研究中，我们评估了 LLM 识别和准确应对印度社会小传统的能力，包括种姓、亲属关系、婚姻和宗教等本地化的文化习俗和亚文化。通过一系列案例研究，我们评估了 LLM 是否能够平衡主流大传统和本地化小传统之间的相互作用。我们探索了各种提示策略，并进一步研究使用区域语言的提示是否能提高模型的文化敏感性和响应质量。我们的研究结果表明，虽然 LLM 表现出表达文化细微差别的能力，但他们往往难以将这种理解应用于实际的、特定于环境的场景中。据我们所知，这是第一项分析 LLM 与印度亚文化接触的研究，为在 AI 系统中嵌入文化多样性的挑战提供了关键见解。

Title: A Stochastic Dynamical Theory of LLM Self-Adversariality: Modeling Severity Drift as a Critical Process

Authors: Jack David Carson
Subjects: cs.CL, cs.AI, nlin.AO
Abstract URL: https://arxiv.org/abs/2501.16783
Pdf URL: https://arxiv.org/pdf/2501.16783
Copy Paste: [[2501.16783]] A Stochastic Dynamical Theory of LLM Self-Adversariality: Modeling Severity Drift as a Critical Process(https://arxiv.org/abs/2501.16783)
Keywords: language model, llm, chain-of-thought, agent
Abstract: This paper introduces a continuous-time stochastic dynamical framework for understanding how large language models (LLMs) may self-amplify latent biases or toxicity through their own chain-of-thought reasoning. The model posits an instantaneous "severity" variable $x(t) \in [0,1]$ evolving under a stochastic differential equation (SDE) with a drift term $\mu(x)$ and diffusion $\sigma(x)$. Crucially, such a process can be consistently analyzed via the Fokker--Planck approach if each incremental step behaves nearly Markovian in severity space. The analysis investigates critical phenomena, showing that certain parameter regimes create phase transitions from subcritical (self-correcting) to supercritical (runaway severity). The paper derives stationary distributions, first-passage times to harmful thresholds, and scaling laws near critical points. Finally, it highlights implications for agents and extended LLM reasoning models: in principle, these equations might serve as a basis for formal verification of whether a model remains stable or propagates bias over repeated inferences.
摘要：本文介绍了一个连续时间随机动力学框架，用于理解大型语言模型 (LLM) 如何通过其自身的思维链推理自我放大潜在偏见或毒性。该模型假设一个瞬时“严重性”变量 $x(t) \in [0,1]$ 在具有漂移项 $\mu(x)$ 和扩散 $\sigma(x)$ 的随机微分方程 (SDE) 下演变。至关重要的是，如果每个增量步骤在严重性空间中表现得接近马尔可夫，则可以通过 Fokker--Planck 方法一致地分析此类过程。该分析研究了临界现象，表明某些参数范围会产生从亚临界（自我修正）到超临界（失控严重性）的相变。本文推导出临界点附近的平稳分布、首次通过有害阈值的时间以及缩放定律。最后，它强调了对代理和扩展的 LLM 推理模型的影响：原则上，这些方程可以作为形式验证模型是否保持稳定或在重复推理中传播偏差的基础。

Title: Misspellings in Natural Language Processing: A survey

Authors: Gianluca Sperduti, Alejandro Moreo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.16836
Pdf URL: https://arxiv.org/pdf/2501.16836
Copy Paste: [[2501.16836]] Misspellings in Natural Language Processing: A survey(https://arxiv.org/abs/2501.16836)
Keywords: language model
Abstract: This survey provides an overview of the challenges of misspellings in natural language processing (NLP). While often unintentional, misspellings have become ubiquitous in digital communication, especially with the proliferation of Web 2.0, user-generated content, and informal text mediums such as social media, blogs, and forums. Even if humans can generally interpret misspelled text, NLP models frequently struggle to handle it: this causes a decline in performance in common tasks like text classification and machine translation. In this paper, we reconstruct a history of misspellings as a scientific problem. We then discuss the latest advancements to address the challenge of misspellings in NLP. Main strategies to mitigate the effect of misspellings include data augmentation, double step, character-order agnostic, and tuple-based methods, among others. This survey also examines dedicated data challenges and competitions to spur progress in the field. Critical safety and ethical concerns are also examined, for example, the voluntary use of misspellings to inject malicious messages and hate speech on social networks. Furthermore, the survey explores psycholinguistic perspectives on how humans process misspellings, potentially informing innovative computational techniques for text normalization and representation. Finally, the misspelling-related challenges and opportunities associated with modern large language models are also analyzed, including benchmarks, datasets, and performances of the most prominent language models against misspellings. This survey aims to be an exhaustive resource for researchers seeking to mitigate the impact of misspellings in the rapidly evolving landscape of NLP.
摘要：本调查概述了自然语言处理 (NLP) 中拼写错误的挑战。虽然拼写错误通常是无意的，但它在数字通信中已经变得无处不在，尤其是随着 Web 2.0、用户生成内容和社交媒体、博客和论坛等非正式文本媒体的激增。即使人类通常可以解释拼写错误的文本，NLP 模型也经常难以处理它：这会导致文本分类和机器翻译等常见任务的性能下降。在本文中，我们将拼写错误的历史重构为一个科学问题。然后，我们讨论了解决 NLP 中拼写错误挑战的最新进展。减轻拼写错误影响的主要策略包括数据增强、双步、字符顺序不可知论和基于元组的方法等。本调查还研究了专门的数据挑战和竞赛，以促进该领域的进步。调查还探讨了关键的安全和道德问题，例如，在社交网络上自愿使用拼写错误来注入恶意消息和仇恨言论。此外，该调查还探讨了人类如何处理拼写错误的心理语言学观点，可能为文本规范化和表示的创新计算技术提供信息。最后，还分析了与现代大型语言模型相关的拼写错误相关挑战和机遇，包括基准、数据集和最突出的语言模型对拼写错误的表现。这项调查旨在为寻求减轻拼写错误在快速发展的 NLP 领域的影响的研究人员提供详尽的资源。

Title: JRE-L: Journalist, Reader, and Editor LLMs in the Loop for Science Journalism for the General Audience

Authors: Gongyao Jiang, Xinran Shi, Qiong Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16865
Pdf URL: https://arxiv.org/pdf/2501.16865
Copy Paste: [[2501.16865]] JRE-L: Journalist, Reader, and Editor LLMs in the Loop for Science Journalism for the General Audience(https://arxiv.org/abs/2501.16865)
Keywords: gpt, llm, prompt
Abstract: Science journalism reports current scientific discoveries to non-specialists, aiming to enable public comprehension of the state of the art. This task is challenging as the audience often lacks specific knowledge about the presented research. We propose a JRE-L framework that integrates three LLMs mimicking the writing-reading-feedback-revision loop. In JRE-L, one LLM acts as the journalist, another LLM as the general public reader, and the third LLM as an editor. The journalist's writing is iteratively refined by feedback from the reader and suggestions from the editor. Our experiments demonstrate that by leveraging the collaboration of two 7B and one 1.8B open-source LLMs, we can generate articles that are more accessible than those generated by existing methods, including prompting single advanced models such as GPT-4 and other LLM-collaboration strategies. Our code is publicly available at this http URL.
摘要：科学新闻报道当前的科学发现给非专业人士，旨在让公众了解最新技术。这项任务具有挑战性，因为观众通常缺乏有关所呈现研究的具体知识。我们提出了一个 JRE-L 框架，该框架集成了三个 LLM，模仿写作-阅读-反馈-修订循环。在 JRE-L 中，一个 LLM 充当记者，另一个 LLM 充当普通读者，第三个 LLM 充当编辑。记者的写作通过读者的反馈和编辑的建议不断完善。我们的实验表明，通过利用两个 7B 和一个 1.8B 开源 LLM 的协作，我们可以生成比现有方法生成的文章更容易理解的文章，包括提示单个高级模型（例如 GPT-4）和其他 LLM 协作策略。我们的代码在此 http URL 上公开提供。

Title: Irony Detection, Reasoning and Understanding in Zero-shot Learning

Authors: Peiling Yi, Yuhan Xia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.16884
Pdf URL: https://arxiv.org/pdf/2501.16884
Copy Paste: [[2501.16884]] Irony Detection, Reasoning and Understanding in Zero-shot Learning(https://arxiv.org/abs/2501.16884)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Irony is a powerful figurative language (FL) on social media that can potentially mislead various NLP tasks, such as recommendation systems, misinformation checks, and sentiment analysis. Understanding the implicit meaning of this kind of subtle language is essential to mitigate irony's negative impact on NLP tasks. However, building models to understand irony presents a unique set of challenges, because irony is a complex form of language that often relies on context, tone, and subtle cues to convey meaning that is opposite or different from the literal interpretation. Large language models, such as ChatGPT, are increasingly able to capture implicit and contextual information. In this study, we investigate the generalization, reasoning and understanding ability of ChatGPT on irony detection across six different genre irony detection datasets. Our findings suggest that ChatGPT appears to show an enhanced language understanding and reasoning ability. But it needs to be very careful in prompt engineering design. Thus, we propose a prompt engineering design framework IDADP to achieve higher irony detection accuracy, improved understanding of irony, and more effective explanations compared to other state-of-the-art ChatGPT zero-shot approaches. And ascertain via experiments that the practice generated under the framework is likely to be the promised solution to resolve the generalization issues of LLMs.
摘要：反讽是社交媒体上一种强大的比喻性语言 (FL)，可能会误导各种 NLP 任务，例如推荐系统、错误信息检查和情感分析。理解这种微妙语言的隐含含义对于减轻反讽对 NLP 任务的负面影响至关重要。然而，构建模型来理解反讽提出了一系列独特的挑战，因为反讽是一种复杂的语言形式，它通常依赖于上下文、语气和微妙的线索来传达与字面解释相反或不同的含义。大型语言模型，如 ChatGPT，越来越能够捕捉隐含和上下文信息。在本研究中，我们研究了 ChatGPT 在六个不同类型的反讽检测数据集上对反讽检测的泛化、推理和理解能力。我们的研究结果表明，ChatGPT 似乎表现出增强的语言理解和推理能力。但在提示工程设计中需要非常小心。因此，我们提出了一个及时的工程设计框架 IDADP，与其他最先进的 ChatGPT 零样本方法相比，它可以实现更高的反讽检测准确率、更好的反讽理解和更有效的解释。并通过实验确定在该框架下产生的实践很可能是解决 LLM 泛化问题的有希望的解决方案。

Title: Detecting harassment and defamation in cyberbullying with emotion-adaptive training

Authors: Peiling Yi, Arkaitz Zubiaga, Yunfei Long
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16925
Pdf URL: https://arxiv.org/pdf/2501.16925
Copy Paste: [[2501.16925]] Detecting harassment and defamation in cyberbullying with emotion-adaptive training(https://arxiv.org/abs/2501.16925)
Keywords: language model
Abstract: Existing research on detecting cyberbullying incidents on social media has primarily concentrated on harassment and is typically approached as a binary classification task. However, cyberbullying encompasses various forms, such as denigration and harassment, which celebrities frequently face. Furthermore, suitable training data for these diverse forms of cyberbullying remains scarce. In this study, we first develop a celebrity cyberbullying dataset that encompasses two distinct types of incidents: harassment and defamation. We investigate various types of transformer-based models, namely masked (RoBERTa, Bert and DistilBert), replacing(Electra), autoregressive (XLnet), masked&permuted (Mpnet), text-text (T5) and large language models (Llama2 and Llama3) under low source settings. We find that they perform competitively on explicit harassment binary detection. However, their performance is substantially lower on harassment and denigration multi-classification tasks. Therefore, we propose an emotion-adaptive training framework (EAT) that helps transfer knowledge from the domain of emotion detection to the domain of cyberbullying detection to help detect indirect cyberbullying events. EAT consistently improves the average macro F1, precision and recall by 20% in cyberbullying detection tasks across nine transformer-based models under low-resource settings. Our claims are supported by intuitive theoretical insights and extensive experiments.
摘要：现有的关于检测社交媒体上的网络欺凌事件的研究主要集中在骚扰上，通常将其作为二元分类任务来处理。然而，网络欺凌包含各种形式，例如名人经常面临的诋毁和骚扰。此外，适合这些不同形式网络欺凌的训练数据仍然很少。在本研究中，我们首先开发了一个名人网络欺凌数据集，其中包含两种不同类型的事件：骚扰和诽谤。我们研究了各种类型的基于 Transformer 的模型，即在低源设置下的掩码（RoBERTa、Bert 和 DistilBert）、替换（Electra）、自回归（XLnet）、掩码和置换（Mpnet）、文本-文本（T5）和大型语言模型（Llama2 和 Llama3）。我们发现它们在显式骚扰二元检测方面表现出色。然而，它们在骚扰和诋毁多分类任务上的表现要低得多。因此，我们提出了一种情绪自适应训练框架 (EAT)，它有助于将知识从情绪检测领域转移到网络欺凌检测领域，以帮助检测间接网络欺凌事件。在资源匮乏的环境下，EAT 在九个基于 Transformer 的模型中持续将网络欺凌检测任务中的平均宏观 F1、准确率和召回率提高 20%。我们的主张得到了直观的理论见解和大量实验的支持。

Title: Multiple Abstraction Level Retrieve Augment Generation

Authors: Zheng Zheng (1), Xinyi Ni (1), Pengyu Hong (1) ((1) Brandeis University)
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.16952
Pdf URL: https://arxiv.org/pdf/2501.16952
Copy Paste: [[2501.16952]] Multiple Abstraction Level Retrieve Augment Generation(https://arxiv.org/abs/2501.16952)
Keywords: language model, llm, retrieval-augmented generation
Abstract: A Retrieval-Augmented Generation (RAG) model powered by a large language model (LLM) provides a faster and more cost-effective solution for adapting to new data and knowledge. It also delivers more specialized responses compared to pre-trained LLMs. However, most existing approaches rely on retrieving prefix-sized chunks as references to support question-answering (Q/A). This approach is often deployed to address information needs at a single level of abstraction, as it struggles to generate answers across multiple levels of abstraction. In an RAG setting, while LLMs can summarize and answer questions effectively when provided with sufficient details, retrieving excessive information often leads to the 'lost in the middle' problem and exceeds token limitations. We propose a novel RAG approach that uses chunks of multiple abstraction levels (MAL), including multi-sentence-level, paragraph-level, section-level, and document-level. The effectiveness of our approach is demonstrated in an under-explored scientific domain of Glycoscience. Compared to traditional single-level RAG approaches, our approach improves AI evaluated answer correctness of Q/A by 25.739\% on Glyco-related papers.
摘要：由大型语言模型 (LLM) 驱动的检索增强生成 (RAG) 模型提供了一种更快、更具成本效益的解决方案，可以适应新数据和新知识。与预先训练的 LLM 相比，它还能提供更专业的响应。然而，大多数现有方法依赖于检索前缀大小的块作为参考来支持问答 (Q/A)。这种方法通常用于满足单一抽象级别的信息需求，因为它很难在多个抽象级别上生成答案。在 RAG 设置中，虽然 LLM 在提供足够的细节时可以有效地总结和回答问题，但检索过多的信息通常会导致“迷失在中间”问题并超出标记限制。我们提出了一种新颖的 RAG 方法，该方法使用多个抽象级别 (MAL) 的块，包括多句子级、段落级、节级和文档级。我们的方法的有效性在糖科学这一尚未得到充分探索的科学领域得到了证明。与传统的单级 RAG 方法相比，我们的方法将 Glyco 相关论文的问答 AI 评估答案正确率提高了 25.739%。

Title: Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Authors: Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Xun Zhou
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.16975
Pdf URL: https://arxiv.org/pdf/2501.16975
Copy Paste: [[2501.16975]] Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling(https://arxiv.org/abs/2501.16975)
Keywords: language model, llm
Abstract: Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.
摘要：标记化是大型语言模型 (LLM) 的基本组成部分，但其对模型扩展和性能的影响尚未得到充分探索。在本文中，我们介绍了 Over-Tokenized Transformers，这是一种新颖的框架，可将输入和输出词汇分离以提高语言建模性能。具体来说，我们的方法可以扩展输入词汇以利用多语法标记。通过大量实验，我们发现输入词汇量与训练损失之间存在对数线性关系，表明无论模型大小如何，更大的输入词汇量都会持续提高模型性能。使用大型输入词汇表，我们可以实现与双倍大小基线相当的性能，而无需额外成本。我们的研究结果强调了标记化在扩展定律中的重要性，并为标记器设计提供了实用见解，为更高效、更强大的 LLM 铺平了道路。

Title: How Linguistics Learned to Stop Worrying and Love the Language Models

Authors: Richard Futrell, Kyle Mahowald
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.17047
Pdf URL: https://arxiv.org/pdf/2501.17047
Copy Paste: [[2501.17047]] How Linguistics Learned to Stop Worrying and Love the Language Models(https://arxiv.org/abs/2501.17047)
Keywords: language model
Abstract: Language models can produce fluent, grammatical text. Nonetheless, some maintain that language models don't really learn language and also that, even if they did, that would not be informative for the study of human learning and processing. On the other side, there have been claims that the success of LMs obviates the need for studying linguistic theory and structure. We argue that both extremes are wrong. LMs can contribute to fundamental questions about linguistic structure, language processing, and learning. They force us to rethink arguments about learning and are informative for major questions in linguistic theory. But they do not replace linguistic structure and theory. We offer an optimistic take on the relationship between language models and linguistics.
摘要：语言模型可以生成流畅、符合语法的文本。尽管如此，一些人仍坚持认为语言模型实际上并不能学习语言，而且即使它们能学习语言，对研究人类的学习和处理也毫无帮助。另一方面，有人声称语言模型的成功消除了研究语言理论和结构的必要性。我们认为这两种极端观点都是错误的。语言模型可以解答有关语言结构、语言处理和学习的基本问题。它们迫使我们重新思考有关学习的论点，并为语言理论中的主要问题提供了参考。但它们并不能取代语言结构和理论。我们对语言模型和语言学之间的关系持乐观态度。

Title: COS(M+O)S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models

Authors: Tobias Materzok
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.17104
Pdf URL: https://arxiv.org/pdf/2501.17104
Copy Paste: [[2501.17104]] COS(M+O)S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models(https://arxiv.org/abs/2501.17104)
Keywords: language model, gpt, prompt, chain-of-thought
Abstract: We present COS(M+O)S, a System 2-inspired framework for open-ended plot development that systematically explores the vast space of possible story expansions, enabling a 3B-parameter language model to approach the plot quality of a 70B model on select short-story tasks. The method accomplishes this by combining Monte Carlo Tree Search (MCTS), guided by a step-level value model that rewards moderate surprisal (curiosity) while penalizing incoherence, and Odds Ratio Preference Optimization (ORPO) to fine-tune the policy on high-value plot expansions. This iterative reinforcement learning loop systematically explores multiple candidate plot branches, backpropagates quality signals, and adapts the policy for faster convergence, notably shifting the policy from puzzle-based Chain-of-Thought to more character-driven storytelling. In small-scale tests with short-story prompts, 67%-77% of participants favored COS(M+O)S's highest-rated expansions over lower-rated ones, suggesting that our learned value function aligns. GPT-4o ratings further show that COS(M+O)S surpasses naive single-pass decoding from Llama 3.2 3B by 0.59 SD, coming within 0.06 SD of Llama 3.1 70B (no significant difference, p=0.93). Pairwise comparisons with o1 place COS(M+O)S 1.5 SD above the 3B baseline and find no statistically significant gap from 70B. Nevertheless, absolute story quality remains modest, constrained by the small model's capacity and limited training data.
摘要：我们提出了 COS(M+O)S，这是一个受 System 2 启发的开放式情节发展框架，它系统地探索了故事扩展的广阔空间，使 3B 参数语言模型能够在选定的短篇故事任务上接近 70B 模型的情节质量。该方法通过结合蒙特卡洛树搜索 (MCTS) 来实现这一点，该搜索由阶梯式价值模型指导，该模型奖励适度的惊讶（好奇心）同时惩罚不连贯性，以及比值偏好优化 (ORPO) 来微调高价值情节扩展的策略。这个迭代强化学习循环系统地探索多个候选情节分支，反向传播质量信号，并调整策略以加快收敛速度，特别是将策略从基于谜题的思路链转变为更多以角色为主导的叙事。在小规模短篇故事提示测试中，67%-77% 的参与者更喜欢 COS(M+O)S 评分最高的扩展，而不是评分较低的扩展，这表明我们学习到的价值函数是一致的。GPT-4o 评分进一步表明，COS(M+O)S 比 Llama 3.2 3B 的单次解码高出 0.59 SD，与 Llama 3.1 70B 相差 0.06 SD（无显着差异，p=0.93）。与 o1 的成对比较将 COS(M+O)S 置于 3B 基线以上 1.5 SD，并且发现与 70B 没有统计上的显着差距。尽管如此，绝对故事质量仍然很低，受到小模型容量和有限训练数据的限制。

Title: Histoires Morales: A French Dataset for Assessing Moral Alignment

Authors: Thibaud Leteno, Irina Proskurina, Antoine Gourru, Julien Velcin, Charlotte Laclau, Guillaume Metzler, Christophe Gravier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.17117
Pdf URL: https://arxiv.org/pdf/2501.17117
Copy Paste: [[2501.17117]] Histoires Morales: A French Dataset for Assessing Moral Alignment(https://arxiv.org/abs/2501.17117)
Keywords: language model, llm
Abstract: Aligning language models with human values is crucial, especially as they become more integrated into everyday life. While models are often adapted to user preferences, it is equally important to ensure they align with moral norms and behaviours in real-world social situations. Despite significant progress in languages like English and Chinese, French has seen little attention in this area, leaving a gap in understanding how LLMs handle moral reasoning in this language. To address this gap, we introduce Histoires Morales, a French dataset derived from Moral Stories, created through translation and subsequently refined with the assistance of native speakers to guarantee grammatical accuracy and adaptation to the French cultural context. We also rely on annotations of the moral values within the dataset to ensure their alignment with French norms. Histoires Morales covers a wide range of social situations, including differences in tipping practices, expressions of honesty in relationships, and responsibilities toward animals. To foster future research, we also conduct preliminary experiments on the alignment of multilingual models on French and English data and the robustness of the alignment. We find that while LLMs are generally aligned with human moral norms by default, they can be easily influenced with user-preference optimization for both moral and immoral data.
摘要：将语言模型与人类价值观相结合至关重要，尤其是当它们越来越融入日常生活时。虽然模型通常会根据用户的偏好进行调整，但确保它们与现实世界社会环境中的道德规范和行为保持一致也同样重要。尽管英语和中文等语言取得了重大进展，但法语在这一领域却很少受到关注，这导致人们对法学硕士如何处理该语言的道德推理存在理解上的空白。为了填补这一空白，我们引入了 Histoires Morales，这是一个源自《道德故事》的法语数据集，通过翻译创建，随后在母语人士的帮助下进行完善，以确保语法准确性和适应法国文化背景。我们还依靠数据集中道德价值观的注释来确保它们与法国规范保持一致。Histoires Morales 涵盖了广泛的社会情况，包括小费习惯的差异、人际关系中诚实的表达以及对动物的责任。为了促进未来的研究，我们还对法语和英语数据上的多语言模型对齐以及对齐的稳健性进行了初步实验。我们发现，虽然 LLM 通常默认与人类道德规范相一致，但它们很容易受到用户对道德和不道德数据的偏好优化的影响。

Title: FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data

Authors: Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.17144
Pdf URL: https://arxiv.org/pdf/2501.17144
Copy Paste: [[2501.17144]] FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data(https://arxiv.org/abs/2501.17144)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Prior research on training grounded factuality classification models to detect hallucinations in large language models (LLMs) has relied on public natural language inference (NLI) data and synthetic data. However, conventional NLI datasets are not well-suited for document-level reasoning, which is critical for detecting LLM hallucinations. Recent approaches to document-level synthetic data generation involve iteratively removing sentences from documents and annotating factuality using LLM-based prompts. While effective, this method is computationally expensive for long documents and limited by the LLM's capabilities. In this work, we analyze the differences between existing synthetic training data used in state-of-the-art models and real LLM output claims. Based on our findings, we propose a novel approach for synthetic data generation, CG2C, that leverages multi-hop reasoning on context graphs extracted from documents. Our fact checker model, FactCG, demonstrates improved performance with more connected reasoning, using the same backbone models. Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark with much smaller model size.
摘要：先前关于训练基于事实性分类模型以检测大型语言模型 (LLM) 中的幻觉的研究依赖于公共自然语言推理 (NLI) 数据和合成数据。然而，传统的 NLI 数据集并不适用于文档级推理，而这对于检测 LLM 幻觉至关重要。文档级合成数据生成的最新方法包括迭代地从文档中删除句子并使用基于 LLM 的提示注释事实性。虽然这种方法有效，但对于长文档来说，计算成本很高，并且受到 LLM 功能的限制。在这项工作中，我们分析了最先进模型中使用的现有合成训练数据与真实 LLM 输出声明之间的差异。根据我们的研究结果，我们提出了一种新颖的合成数据生成方法 CG2C，该方法利用从文档中提取的上下文图进行多跳推理。我们的事实核查模型 FactCG 使用相同的主干模型，通过更紧密的推理展示了更高的性能。实验表明，在模型尺寸小得多的情况下，它甚至在 LLM-Aggrefact 基准上优于 GPT-4-o。

Title: AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Authors: Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, Christopher Potts
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.17148
Pdf URL: https://arxiv.org/pdf/2501.17148
Copy Paste: [[2501.17148]] AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders(https://arxiv.org/abs/2501.17148)
Keywords: language model, llm, prompt
Abstract: Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.
摘要：语言模型输出的细粒度控制对于安全性和可靠性至关重要。提示和微调被广泛用于实现这些目标，但可解释性研究人员也提出了各种基于表示的技术，包括稀疏自动编码器 (SAE)、线性人工断层扫描、监督控制向量、线性探针和表示微调。目前，没有基准可以直接比较这些提案。因此，我们引入了用于控制和概念检测的大规模基准 AxBench，并报告了在 Gemma-2-2B 和 9B 上的实验。对于控制，我们发现提示优于所有现有方法，其次是微调。对于概念检测，基于表示的方法（例如均值差异）表现最佳。在这两项评估中，SAE 都没有竞争力。我们引入了一种新颖的弱监督表示方法（Rank-1 表示微调；ReFT-r1），它在两个任务上都具有竞争力，同时提供了提示所缺乏的可解释性优势。我们与 AxBench 一起训练并公开发布针对 ReFT-r1 和 DiffMean 的 SAE 规模特征词典。