2024-12-10

Title: Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor

Authors: Ashwin Baluja
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2412.05315
Pdf URL: https://arxiv.org/pdf/2412.05315
Copy Paste: [[2412.05315]] Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor(https://arxiv.org/abs/2412.05315)
Keywords: language model, llm, prompt
Abstract: While Large Language Models (LLMs) have demonstrated impressive natural language understanding capabilities across various text-based tasks, understanding humor has remained a persistent challenge. Humor is frequently multimodal, relying on phonetic ambiguity, rhythm and timing to convey meaning. In this study, we explore a simple multimodal prompting approach to humor understanding and explanation. We present an LLM with both the text and the spoken form of a joke, generated using an off-the-shelf text-to-speech (TTS) system. Using multimodal cues improves the explanations of humor compared to textual prompts across all tested datasets.
摘要：虽然大型语言模型 (LLM) 在各种基于文本的任务中都表现出了令人印象深刻的自然语言理解能力，但理解幽默仍然是一个持续的挑战。幽默通常是多模态的，依靠语音歧义、节奏和时间来传达含义。在本研究中，我们探索了一种简单的多模态提示方法来理解和解释幽默。我们提供了一个 LLM，其中包含笑话的文本和口语形式，使用现成的文本转语音 (TTS) 系统生成。与所有测试数据集中的文本提示相比，使用多模态提示可以改善对幽默的解释。

Title: Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation

Authors: Xiaoyu Wang, Ningyuan Xi, Teng Chen, Qingqing Gu, Yue Zhao, Xiaokai Chen, Zhonglin Jiang, Yong Chen, Luo Ji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05342
Pdf URL: https://arxiv.org/pdf/2412.05342
Copy Paste: [[2412.05342]] Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation(https://arxiv.org/abs/2412.05342)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLM) are usually fine-tuned to participate in dyadic or two-party dialogues, which can not adapt well to multi-party dialogues (MPD), which hinders their applications in such scenarios including multi-personal meetings, discussions and daily communication. Previous LLM-based researches mainly focus on the multi-agent framework, while their base LLMs are still pairwisely fine-tuned. In this work, we design a multi-party fine-tuning framework (MuPaS) for LLMs on the multi-party dialogue datasets, and prove such a straightforward framework can let the LLM align with the multi-party conversation style efficiently and effectively. We also design two training strategies which can convert MuPaS into the MPD simulator. Substantial experiments show that MuPaS can achieve state-of-the-art multi-party response, higher accuracy of the-next-speaker prediction, higher human and automatic evaluated utterance qualities, and can even generate reasonably with out-of-distribution scene, topic and role descriptions. The MuPaS framework bridges the LLM training with more complicated multi-party applications, such as conversation generation, virtual rehearsal or meta-universe.
摘要：大型语言模型（LLM）通常经过微调以参与二元或两方对话，无法很好地适应多方对话（MPD），这阻碍了它们在多人会议、讨论和日常交流等场景中的应用。先前基于LLM的研究主要集中在多智能体框架上，而它们的基础LLM仍然是成对微调的。在本文中，我们为多方对话数据集上的LLM设计了一个多方微调框架（MuPaS），并证明了这种简单的框架可以使LLM高效且有效地与多方对话风格保持一致。我们还设计了两种可以将MuPaS转换为MPD模拟器的训练策略。大量实验表明，MuPaS可以实现最佳的多方响应，更高的下一位说话人预测准确率，更高的人为和自动评估的话语质量，甚至可以合理地生成超出分布的场景、主题和角色描述。 MuPaS 框架将 LLM 培训与更复杂的多方应用（例如对话生成、虚拟排练或元宇宙）联系起来。

Title: Incremental Sentence Processing Mechanisms in Autoregressive Transformer Language Models

Authors: Michael Hanna, Aaron Mueller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.05353
Pdf URL: https://arxiv.org/pdf/2412.05353
Copy Paste: [[2412.05353]] Incremental Sentence Processing Mechanisms in Autoregressive Transformer Language Models(https://arxiv.org/abs/2412.05353)
Keywords: language model
Abstract: Autoregressive transformer language models (LMs) possess strong syntactic abilities, often successfully handling phenomena from agreement to NPI licensing. However, the features they use to incrementally process language inputs are not well understood. In this paper, we fill this gap by studying the mechanisms underlying garden path sentence processing in LMs. We ask: (1) Do LMs use syntactic features or shallow heuristics to perform incremental sentence processing? (2) Do LMs represent only one potential interpretation, or multiple? and (3) Do LMs reanalyze or repair their initial incorrect representations? To address these questions, we use sparse autoencoders to identify interpretable features that determine which continuation - and thus which reading - of a garden path sentence the LM prefers. We find that while many important features relate to syntactic structure, some reflect syntactically irrelevant heuristics. Moreover, while most active features correspond to one reading of the sentence, some features correspond to the other, suggesting that LMs assign weight to both possibilities simultaneously. Finally, LMs do not re-use features from garden path sentence processing to answer follow-up questions.
摘要：自回归变换语言模型 (LM) 具有强大的句法能力，通常可以成功处理从同意到 NPI 许可的现象。然而，它们用于逐步处理语言输入的特征尚不明确。在本文中，我们通过研究 LM 中花园小径句子处理的潜在机制来填补这一空白。我们问：(1) LM 是使用句法特征还是浅层启发式方法来执行增量句子处理？(2) LM 只代表一种潜在解释，还是多种？(3) LM 会重新分析或修复其最初的错误表示吗？为了解决这些问题，我们使用稀疏自动编码器来识别可解释的特征，这些特征决定了 LM 更喜欢花园小径句子的哪种延续（从而更喜欢哪种解读）。我们发现，虽然许多重要特征与句法结构有关，但有些特征反映了句法上不相关的启发式方法。此外，虽然大多数活跃特征对应于句子的一种解读，但有些特征对应于另一种解读，这表明 LM 同时为两种可能性分配权重。最后，LM 不会重新使用花园小径句子处理中的特征来回答后续问题。

Title: CALICO: Conversational Agent Localization via Synthetic Data Generation

Authors: Andy Rosenbaum, Pegah Kharazmi, Ershad Banijamali, Lu Zeng, Christopher DiPersio, Pan Wei, Gokmen Oz, Clement Chung, Karolina Owczarzak, Fabian Triefenbach, Wael Hamza
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.05388
Pdf URL: https://arxiv.org/pdf/2412.05388
Copy Paste: [[2412.05388]] CALICO: Conversational Agent Localization via Synthetic Data Generation(https://arxiv.org/abs/2412.05388)
Keywords: language model, llm, agent
Abstract: We present CALICO, a method to fine-tune Large Language Models (LLMs) to localize conversational agent training data from one language to another. For slots (named entities), CALICO supports three operations: verbatim copy, literal translation, and localization, i.e. generating slot values more appropriate in the target language, such as city and airport names located in countries where the language is spoken. Furthermore, we design an iterative filtering mechanism to discard noisy generated samples, which we show boosts the performance of the downstream conversational agent. To prove the effectiveness of CALICO, we build and release a new human-localized (HL) version of the MultiATIS++ travel information test set in 8 languages. Compared to the original human-translated (HT) version of the test set, we show that our new HL version is more challenging. We also show that CALICO out-performs state-of-the-art LINGUIST (which relies on literal slot translation out of context) both on the HT case, where CALICO generates more accurate slot translations, and on the HL case, where CALICO generates localized slots which are closer to the HL test set.
摘要：我们提出了 CALICO，这是一种微调大型语言模型 (LLM) 的方法，用于将对话代理训练数据从一种语言本地化为另一种语言。对于插槽（命名实体），CALICO 支持三种操作：逐字复制、直译和本地化，即生成更适合目标语言的插槽值，例如位于使用该语言的国家/地区的城市和机场名称。此外，我们设计了一种迭代过滤机制来丢弃嘈杂的生成样本，我们表明这可以提高下游对话代理的性能。为了证明 CALICO 的有效性，我们构建并发布了 8 种语言的 MultiATIS++ 旅行信息测试集的新人工本地化 (HL) 版本。与测试集的原始人工翻译 (HT) 版本相比，我们表明我们的新 HL 版本更具挑战性。我们还表明，CALICO 在 HT 情况下和 HL 情况下的表现均优于最先进的 LINGUIST（依赖于脱离上下文的字面槽翻译），其中 CALICO 生成更准确的槽翻译，其中 CALICO 生成更接近 HL 测试集的本地化槽。

Title: Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications

Authors: Raphael Shu, Nilaksh Das, Michelle Yuan, Monica Sunkara, Yi Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05449
Pdf URL: https://arxiv.org/pdf/2412.05449
Copy Paste: [[2412.05449]] Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications(https://arxiv.org/abs/2412.05449)
Keywords: language model, llm, agent
Abstract: AI agents powered by large language models (LLMs) have shown strong capabilities in problem solving. Through combining many intelligent agents, multi-agent collaboration has emerged as a promising approach to tackle complex, multi-faceted problems that exceed the capabilities of single AI agents. However, designing the collaboration protocols and evaluating the effectiveness of these systems remains a significant challenge, especially for enterprise applications. This report addresses these challenges by presenting a comprehensive evaluation of coordination and routing capabilities in a novel multi-agent collaboration framework. We evaluate two key operational modes: (1) a coordination mode enabling complex task completion through parallel communication and payload referencing, and (2) a routing mode for efficient message forwarding between agents. We benchmark on a set of handcrafted scenarios from three enterprise domains, which are publicly released with the report. For coordination capabilities, we demonstrate the effectiveness of inter-agent communication and payload referencing mechanisms, achieving end-to-end goal success rates of 90%. Our analysis yields several key findings: multi-agent collaboration enhances goal success rates by up to 70% compared to single-agent approaches in our benchmarks; payload referencing improves performance on code-intensive tasks by 23%; latency can be substantially reduced with a routing mechanism that selectively bypasses agent orchestration. These findings offer valuable guidance for enterprise deployments of multi-agent systems and advance the development of scalable, efficient multi-agent collaboration frameworks.
摘要：由大型语言模型 (LLM) 驱动的 AI 代理已表现出强大的解决问题能力。通过结合许多智能代理，多代理协作已成为一种有前途的方法，可以解决超出单个 AI 代理能力的复杂、多方面问题。然而，设计协作协议和评估这些系统的有效性仍然是一项重大挑战，尤其是对于企业应用程序而言。本报告通过对新型多代理协作框架中的协调和路由能力进行全面评估来解决这些挑战。我们评估了两种关键的操作模式：(1) 通过并行通信和有效载荷引用实现复杂任务完成的协调模式，以及 (2) 用于代理之间高效消息转发的路由模式。我们对来自三个企业领域的一组手工制作的场景进行了基准测试，这些场景与报告一起公开发布。对于协调能力，我们展示了代理间通信和有效载荷引用机制的有效性，实现了 90% 的端到端目标成功率。我们的分析得出了几个关键发现：与基准测试中的单代理方法相比，多代理协作可将目标成功率提高多达 70%；有效载荷引用可将代码密集型任务的性能提高 23%；使用选择性绕过代理编排的路由机制可以大幅减少延迟。这些发现为多代理系统的企业部署提供了宝贵的指导，并推动了可扩展、高效的多代理协作框架的开发。

Title: Knowledge Graphs are all you need: Leveraging KGs in Physics Question Answering

Authors: Krishnasai Addala, Kabir Dev Paul Baghel, Dhruv Jain, Chhavi Kirtani, Avinash Anand, Rajiv Ratn Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.05453
Pdf URL: https://arxiv.org/pdf/2412.05453
Copy Paste: [[2412.05453]] Knowledge Graphs are all you need: Leveraging KGs in Physics Question Answering(https://arxiv.org/abs/2412.05453)
Keywords: language model, llm
Abstract: This study explores the effectiveness of using knowledge graphs generated by large language models to decompose high school-level physics questions into sub-questions. We introduce a pipeline aimed at enhancing model response quality for Question Answering tasks. By employing LLMs to construct knowledge graphs that capture the internal logic of the questions, these graphs then guide the generation of subquestions. We hypothesize that this method yields sub-questions that are more logically consistent with the original questions compared to traditional decomposition techniques. Our results show that sub-questions derived from knowledge graphs exhibit significantly improved fidelity to the original question's logic. This approach not only enhances the learning experience by providing clearer and more contextually appropriate sub-questions but also highlights the potential of LLMs to transform educational methodologies. The findings indicate a promising direction for applying AI to improve the quality and effectiveness of educational content.
摘要：本研究探讨了使用大型语言模型生成的知识图谱将高中物理问题分解为子问题的有效性。我们引入了一个旨在提高问答任务的模型响应质量的流程。通过使用 LLM 构建知识图谱来捕捉问题的内部逻辑，这些图谱随后指导子问题的生成。我们假设，与传统的分解技术相比，这种方法产生的子问题在逻辑上与原始问题更加一致。我们的结果表明，从知识图谱中得出的子问题对原始问题逻辑的保真度显著提高。这种方法不仅通过提供更清晰、更符合语境的子问题来增强学习体验，而且还凸显了 LLM 改变教育方法的潜力。研究结果表明，应用人工智能来提高教育内容质量和有效性是一个有希望的方向。

Title: A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions

Authors: Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z. Ren, Anirudha Majumdar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05563
Pdf URL: https://arxiv.org/pdf/2412.05563
Copy Paste: [[2412.05563]] A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions(https://arxiv.org/abs/2412.05563)
Keywords: language model, llm, hallucination, prompt, chat
Abstract: The remarkable performance of large language models (LLMs) in content generation, coding, and common-sense reasoning has spurred widespread integration into many facets of society. However, integration of LLMs raises valid questions on their reliability and trustworthiness, given their propensity to generate hallucinations: plausible, factually-incorrect responses, which are expressed with striking confidence. Previous work has shown that hallucinations and other non-factual responses generated by LLMs can be detected by examining the uncertainty of the LLM in its response to the pertinent prompt, driving significant research efforts devoted to quantifying the uncertainty of LLMs. This survey seeks to provide an extensive review of existing uncertainty quantification methods for LLMs, identifying their salient features, along with their strengths and weaknesses. We present existing methods within a relevant taxonomy, unifying ostensibly disparate methods to aid understanding of the state of the art. Furthermore, we highlight applications of uncertainty quantification methods for LLMs, spanning chatbot and textual applications to embodied artificial intelligence applications in robotics. We conclude with open research challenges in uncertainty quantification of LLMs, seeking to motivate future research.
摘要：大型语言模型 (LLM) 在内容生成、编码和常识推理方面的出色表现促使其被广泛融入社会的各个方面。然而，鉴于 LLM 容易产生幻觉，因此其融入引发了人们对其可靠性和可信度的质疑：这些幻觉是看似合理、但事实错误的反应，而且表达得非常自信。之前的研究表明，可以通过检查 LLM 对相关提示的响应的不确定性来检测 LLM 产生的幻觉和其他非事实反应，从而推动了大量研究工作，致力于量化 LLM 的不确定性。本调查旨在对现有的 LLM 不确定性量化方法进行广泛的回顾，确定它们的显著特征以及优缺点。我们在相关分类法中介绍了现有方法，统一了表面上不同的方法，以帮助理解最新技术。此外，我们重点介绍了法学硕士 (LLM) 不确定性量化方法的应用，涵盖聊天机器人和文本应用以及机器人技术中的具身人工智能应用。最后，我们总结了法学硕士 (LLM) 不确定性量化方面的开放性研究挑战，以期激发未来的研究。

Title: A polar coordinate system represents syntax in large language models

Authors: Pablo Diego-Simón, Stéphane D'Ascoli, Emmanuel Chemla, Yair Lakretz, Jean-Rémi King
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.05571
Pdf URL: https://arxiv.org/pdf/2412.05571
Copy Paste: [[2412.05571]] A polar coordinate system represents syntax in large language models(https://arxiv.org/abs/2412.05571)
Keywords: language model, llm
Abstract: Originally formalized with symbolic representations, syntactic trees may also be effectively represented in the activations of large language models (LLMs). Indeed, a 'Structural Probe' can find a subspace of neural activations, where syntactically related words are relatively close to one-another. However, this syntactic code remains incomplete: the distance between the Structural Probe word embeddings can represent the existence but not the type and direction of syntactic relations. Here, we hypothesize that syntactic relations are, in fact, coded by the relative direction between nearby embeddings. To test this hypothesis, we introduce a 'Polar Probe' trained to read syntactic relations from both the distance and the direction between word embeddings. Our approach reveals three main findings. First, our Polar Probe successfully recovers the type and direction of syntactic relations, and substantially outperforms the Structural Probe by nearly two folds. Second, we confirm that this polar coordinate system exists in a low-dimensional subspace of the intermediate layers of many LLMs and becomes increasingly precise in the latest frontier models. Third, we demonstrate with a new benchmark that similar syntactic relations are coded similarly across the nested levels of syntactic trees. Overall, this work shows that LLMs spontaneously learn a geometry of neural activations that explicitly represents the main symbolic structures of linguistic theory.
摘要：句法树最初是用符号表示形式化的，也可以在大型语言模型 (LLM) 的激活中有效表示。事实上，“结构探针”可以找到神经激活的子空间，其中句法相关的单词彼此相对较近。然而，这种句法代码仍然不完整：结构探针词向量之间的距离可以表示句法关系的存在，但不能表示句法关系的类型和方向。在这里，我们假设句法关系实际上是由附近向量之间的相对方向编码的。为了检验这一假设，我们引入了一个“极地探针”，它经过训练可以从词向量之间的距离和方向读取句法关系。我们的方法揭示了三个主要发现。首先，我们的极地探针成功地恢复了句法关系的类型和方向，并且比结构探针的性能高出近两倍。其次，我们确认该极坐标系统存在于许多 LLM 中间层的低维子空间中，并且在最新的前沿模型中变得越来越精确。第三，我们用一个新的基准证明，相似的句法关系在句法树的嵌套层级上以类似的方式编码。总体而言，这项工作表明 LLM 自发地学习了一种神经激活的几何结构，它明确地代表了语言理论的主要符号结构。

Title: LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Authors: Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, Yiqun Liu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2412.05579
Pdf URL: https://arxiv.org/pdf/2412.05579
Copy Paste: [[2412.05579]] LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods(https://arxiv.org/abs/2412.05579)
Keywords: language model, llm
Abstract: The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ''LLMs-as-judges''. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at this https URL.
摘要：大型语言模型 (LLM) 的快速发展推动了其在各个领域的应用不断扩大。最有前景的应用之一是它们作为基于自然语言响应的评估器的角色，称为“LLMs-as-judges”。由于其出色的有效性、跨任务泛化能力以及自然语言形式的可解释性，该框架引起了学术界和工业界越来越多的关注。本文从五个关键角度全面概述了 LLMs-as-judges 范式：功能、方法、应用、元评估和局限性。我们首先对 LLMs-as-judges 进行系统定义并介绍其功能（为什么使用 LLM 评委？）。然后，我们讨论使用 LLM 构建评估系统的方法（如何使用 LLM 评委？）。此外，我们研究了它们应用的潜在领域（在哪里使用 LLM 评委？）并讨论了在各种情况下评估它们的方法（如何评估 LLM 评委？）。最后，我们详细分析了法学硕士法官的局限性，并讨论了未来的潜在发展方向。通过结构化和全面的分析，我们旨在为法学硕士法官在研究和实践中的发展和应用提供见解。我们将继续在此 https URL 上维护相关资源列表。

Title: CharacterBox: Evaluating the Role-Playing Capabilities of LLMs in Text-Based Virtual Worlds

Authors: Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haoxuan Li, Xu Chen, Xing Xie, Ji-Rong Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05631
Pdf URL: https://arxiv.org/pdf/2412.05631
Copy Paste: [[2412.05631]] CharacterBox: Evaluating the Role-Playing Capabilities of LLMs in Text-Based Virtual Worlds(https://arxiv.org/abs/2412.05631)
Keywords: language model, gpt, llm, agent
Abstract: Role-playing is a crucial capability of Large Language Models (LLMs), enabling a wide range of practical applications, including intelligent non-player characters, digital twins, and emotional companions. Evaluating this capability in LLMs is challenging due to the complex dynamics involved in role-playing, such as maintaining character fidelity throughout a storyline and navigating open-ended narratives without a definitive ground truth. Current evaluation methods, which primarily focus on question-answering or conversational snapshots, fall short of adequately capturing the nuanced character traits and behaviors essential for authentic role-playing. In this paper, we propose CharacterBox, which is a simulation sandbox designed to generate situational fine-grained character behavior trajectories. These behavior trajectories enable a more comprehensive and in-depth evaluation of role-playing capabilities. CharacterBox consists of two main components: the character agent and the narrator agent. The character agent, grounded in psychological and behavioral science, exhibits human-like behaviors, while the narrator agent coordinates interactions between character agents and environmental changes. Additionally, we introduce two trajectory-based methods that leverage CharacterBox to enhance LLM performance. To reduce costs and facilitate the adoption of CharacterBox by public communities, we fine-tune two smaller models, CharacterNR and CharacterRM, as substitutes for GPT API calls, and demonstrate their competitive performance compared to advanced GPT APIs.
摘要：角色扮演是大型语言模型 (LLM) 的一项关键功能，可实现广泛的实际应用，包括智能非玩家角色、数字孪生和情感伴侣。评估 LLM 中的这种能力具有挑战性，因为角色扮演涉及复杂的动态，例如在整个故事情节中保持角色保真度以及在没有明确事实的情况下驾驭开放式叙述。当前的评估方法主要侧重于问答或对话快照，无法充分捕捉真实角色扮演所必需的细微角色特征和行为。在本文中，我们提出了 CharacterBox，这是一个模拟沙箱，旨在生成情境细粒度角色行为轨迹。这些行为轨迹可以更全面、更深入地评估角色扮演能力。CharacterBox 由两个主要组件组成：角色代理和叙述者代理。角色代理以心理和行为科学为基础，表现出类似人类的行为，而叙述者代理则协调角色代理与环境变化之间的交互。此外，我们还介绍了两种基于轨迹的方法，利用 CharacterBox 来增强 LLM 性能。为了降低成本并促进公共社区采用 CharacterBox，我们对两个较小的模型 CharacterNR 和 CharacterRM 进行了微调，以替代 GPT API 调用，并展示了它们与高级 GPT API 相比的竞争性能。

Title: Shifting NER into High Gear: The Auto-AdvER Approach

Authors: Filippos Ventirozos, Ioanna Nteka, Tania Nandy, Jozef Baca, Peter Appleby, Matthew Shardlow
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.05655
Pdf URL: https://arxiv.org/pdf/2412.05655
Copy Paste: [[2412.05655]] Shifting NER into High Gear: The Auto-AdvER Approach(https://arxiv.org/abs/2412.05655)
Keywords: language model, gpt, llm
Abstract: This paper presents a case study on the development of Auto-AdvER, a specialised named entity recognition schema and dataset for text in the car advertisement genre. Developed with industry needs in mind, Auto-AdvER is designed to enhance text mining analytics in this domain and contributes a linguistically unique NER dataset. We present a schema consisting of three labels: "Condition", "Historic" and "Sales Options". We outline the guiding principles for annotation, describe the methodology for schema development, and show the results of an annotation study demonstrating inter-annotator agreement of 92% F1-Score. Furthermore, we compare the performance by using encoder-only models: BERT, DeBERTaV3 and decoder-only open and closed source Large Language Models (LLMs): Llama, Qwen, GPT-4 and Gemini. Our results show that the class of LLMs outperforms the smaller encoder-only models. However, the LLMs are costly and far from perfect for this task. We present this work as a stepping stone toward more fine-grained analysis and discuss Auto-AdvER's potential impact on advertisement analytics and customer insights, including applications such as the analysis of market dynamics and data-driven predictive maintenance. Our schema, as well as our associated findings, are suitable for both private and public entities considering named entity recognition in the automotive domain, or other specialist domains.
摘要：本文介绍了 Auto-AdvER 开发的案例研究，Auto-AdvER 是一种专门用于汽车广告类型文本的命名实体识别模式和数据集。Auto-AdvER 的开发考虑到了行业需求，旨在增强该领域的文本挖掘分析，并贡献了语言上独特的 NER 数据集。我们提出了一个由三个标签组成的模式：“条件”、“历史”和“销售选项”。我们概述了注释的指导原则，描述了模式开发的方法，并展示了注释研究的结果，该研究表明注释者之间的 F1 分数一致性为 92%。此外，我们使用仅编码器模型：BERT、DeBERTaV3 和仅解码器的开源和闭源大型语言模型 (LLM)：Llama、Qwen、GPT-4 和 Gemini 来比较性能。我们的结果表明，这类 LLM 优于较小的仅编码器模型。然而，LLM 成本高昂，远非完美地完成这项任务。我们将这项工作作为更精细分析的垫脚石，并讨论 Auto-AdvER 对广告分析和客户洞察的潜在影响，包括市场动态分析和数据驱动的预测性维护等应用。我们的模式以及相关发现适用于考虑在汽车领域或其他专业领域进行命名实体识别的私人和公共实体。

Title: Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

Authors: Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.05693
Pdf URL: https://arxiv.org/pdf/2412.05693
Copy Paste: [[2412.05693]] Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression(https://arxiv.org/abs/2412.05693)
Keywords: llm, prompt
Abstract: Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model's accuracy.
摘要：有几项研究已经开发出驱逐策略，用于从 KV 缓存中删除键值 (KV) 对，以实现更高效的推理。重点是在处理输入提示后压缩 KV 缓存，以加快生成令牌的速度。在 GPU 内存有限的设置中，当输入上下文长于生成长度时，我们表明，通过在输入处理阶段也压缩 KV 缓存，可以使用更大的批处理大小，从而显著提高吞吐量，同时仍保持原始模型的准确性。

Title: On the effective transfer of knowledge from English to Hindi Wikipedia

Authors: Paramita Das, Amartya Roy, Ritabrata Chakraborty, Animesh Mukherjee
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.05708
Pdf URL: https://arxiv.org/pdf/2412.05708
Copy Paste: [[2412.05708]] On the effective transfer of knowledge from English to Hindi Wikipedia(https://arxiv.org/abs/2412.05708)
Keywords: language model
Abstract: Although Wikipedia is the largest multilingual encyclopedia, it remains inherently incomplete. There is a significant disparity in the quality of content between high-resource languages (HRLs, e.g., English) and low-resource languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate information. To bridge these content gaps, we propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework extracts relevant information from external resources readily available (such as English books) and adapts it to align with Wikipedia's distinctive style, including its \textit{neutral point of view} (NPOV) policy, using in-context learning capabilities of large language models. The adapted content is then machine-translated into Hindi for integration into the corresponding Wikipedia articles. On the other hand, if the English version is comprehensive and up-to-date, the framework directly transfers knowledge from English to Hindi. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations.
摘要：尽管维基百科是最大的多语言百科全书，但它本质上仍然是不完整的。高资源语言（HRL，例如英语）和低资源语言（LRL，例如印地语）之间的内容质量存在显著差异，许多 LRL 文章缺乏足够的信息。为了弥补这些内容差距，我们提出了一个轻量级框架来增强英语和印地语之间的知识公平性。如果英语维基百科页面不是最新的，我们的框架会从现成的外部资源（如英语书籍）中提取相关信息，并对其进行调整以符合维基百科的独特风格，包括其 \textit{中立观点} (NPOV) 政策，使用大型语言模型的上下文学习功能。然后将改编后的内容机器翻译成印地语，以集成到相应的维基百科文章中。另一方面，如果英语版本全面且最新，该框架会直接将知识从英语转移到印地语。我们的框架有效地为印地语维基百科部分生成了新内容，根据自动和人工判断的评估，分别将印地语维基百科文章增强了 65% 和 62%。

Title: PromptRefine: Enhancing Few-Shot Performance on Low-Resource Indic Languages with Example Selection from Related Example Banks

Authors: Soumya Suvra Ghosal, Soumyabrata Pal, Koyel Mukherjee, Dinesh Manocha
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.05710
Pdf URL: https://arxiv.org/pdf/2412.05710
Copy Paste: [[2412.05710]] PromptRefine: Enhancing Few-Shot Performance on Low-Resource Indic Languages with Example Selection from Related Example Banks(https://arxiv.org/abs/2412.05710)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have recently demonstrated impressive few-shot learning capabilities through in-context learning (ICL). However, ICL performance is highly dependent on the choice of few-shot demonstrations, making the selection of the most optimal examples a persistent research challenge. This issue is further amplified in low-resource Indic languages, where the scarcity of ground-truth data complicates the selection process. In this work, we propose PromptRefine, a novel Alternating Minimization approach for example selection that improves ICL performance on low-resource Indic languages. PromptRefine leverages auxiliary example banks from related high-resource Indic languages and employs multi-task learning techniques to align language-specific retrievers, enabling effective cross-language retrieval. Additionally, we incorporate diversity in the selected examples to enhance generalization and reduce bias. Through comprehensive evaluations on four text generation tasks -- Cross-Lingual Question Answering, Multilingual Question Answering, Machine Translation, and Cross-Lingual Summarization using state-of-the-art LLMs such as LLAMA-3.1-8B, LLAMA-2-7B, Qwen-2-7B, and Qwen-2.5-7B, we demonstrate that PromptRefine significantly outperforms existing frameworks for retrieving examples.
摘要：大型语言模型 (LLM) 最近通过上下文学习 (ICL) 展示了令人印象深刻的少样本学习能力。然而，ICL 性能高度依赖于少样本演示的选择，使得选择最佳示例成为一项持续的研究挑战。这个问题在资源匮乏的印度语中进一步加剧，因为真实数据的稀缺使选择过程变得复杂。在这项工作中，我们提出了 PromptRefine，这是一种新颖的交替最小化示例选择方法，可提高资源匮乏的印度语的 ICL 性能。PromptRefine 利用来自相关资源丰富的印度语的辅助示例库，并采用多任务学习技术来对齐特定语言的检索器，从而实现有效的跨语言检索。此外，我们在所选示例中加入了多样性，以增强泛化能力并减少偏差。通过使用最先进的 LLM（例如 LLAMA-3.1-8B、LLAMA-2-7B、Qwen-2-7B 和 Qwen-2.5-7B）对四个文本生成任务（跨语言问答、多语言问答、机器翻译和跨语言摘要）进行全面评估，我们证明了 PromptRefine 在检索示例方面的表现明显优于现有的框架。

Title: A Comparative Study on Code Generation with Transformers

Authors: Namrata Das, Rakshya Panta, Neelam Karki, Ruchi Manandhar, Dinesh Baniya Kshatri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05749
Pdf URL: https://arxiv.org/pdf/2412.05749
Copy Paste: [[2412.05749]] A Comparative Study on Code Generation with Transformers(https://arxiv.org/abs/2412.05749)
Keywords: language model
Abstract: In an era of widespread influence of Natural Language Processing (NLP), there have been multiple research efforts to supplant traditional manual coding techniques with automated systems capable of generating solutions autonomously. With rapid research for code generation and a sole focus on large language models, there emerges a need to compare and evaluate the performance of transformer architectures based on several complexities of the model. This paper introduces the concept of a "A Comparative Study on Code Generation with Transformers," a model based on Transformer architecture, and NLP methodologies to automatically generate C++ source code for different varieties of problems. Here, a comparative study is performed to evaluate the robustness of transformer-based models on the basis of their architecture complexities and their capability to handle diverse problem sets, from basic arithmetic to complex computations.
摘要：在自然语言处理 (NLP) 影响力日益扩大的时代，人们进行了多项研究，试图用能够自主生成解决方案的自动化系统取代传统的手动编码技术。随着对代码生成的研究迅速发展，并只关注大型语言模型，需要根据模型的几种复杂性来比较和评估 Transformer 架构的性能。本文介绍了“使用 Transformer 进行代码生成的比较研究”的概念、基于 Transformer 架构的模型以及用于自动生成不同类型问题的 C++ 源代码的 NLP 方法。本文进行了一项比较研究，以根据 Transformer 模型的架构复杂性及其处理从基本算术到复杂计算的各种问题集的能力来评估其稳健性。

Title: Uncovering Uncertainty in Transformer Inference

Authors: Greyson Brothers, Willa Mannering, Amber Tien, John Winder
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05768
Pdf URL: https://arxiv.org/pdf/2412.05768
Copy Paste: [[2412.05768]] Uncovering Uncertainty in Transformer Inference(https://arxiv.org/abs/2412.05768)
Keywords: language model
Abstract: We explore the Iterative Inference Hypothesis (IIH) within the context of transformer-based language models, aiming to understand how a model's latent representations are progressively refined and whether observable differences are present between correct and incorrect generations. Our findings provide empirical support for the IIH, showing that the nth token embedding in the residual stream follows a trajectory of decreasing loss. Additionally, we observe that the rate at which residual embeddings converge to a stable output representation reflects uncertainty in the token generation process. Finally, we introduce a method utilizing cross-entropy to detect this uncertainty and demonstrate its potential to distinguish between correct and incorrect token generations on a dataset of idioms.
摘要：我们在基于转换器的语言模型的背景下探索迭代推理假设 (IIH)，旨在了解模型的潜在表示如何逐步完善，以及正确和不正确的生成之间是否存在可观察到的差异。我们的研究结果为 IIH 提供了实证支持，表明残差流中的第 n 个标记嵌入遵循损失减少的轨迹。此外，我们观察到残差嵌入收敛到稳定输出表示的速率反映了标记生成过程中的不确定性。最后，我们介绍了一种利用交叉熵来检测这种不确定性的方法，并展示了它在成语数据集上区分正确和不正确的标记生成的潜力。

Title: An Entailment Tree Generation Approach for Multimodal Multi-Hop Question Answering with Mixture-of-Experts and Iterative Feedback Mechanism

Authors: Qing Zhang, Haocheng Lv, Jie Liu, Zhiyun Chen, Jianyong Duan, Hao Wang, Li He, Mingying Xv
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05821
Pdf URL: https://arxiv.org/pdf/2412.05821
Copy Paste: [[2412.05821]] An Entailment Tree Generation Approach for Multimodal Multi-Hop Question Answering with Mixture-of-Experts and Iterative Feedback Mechanism(https://arxiv.org/abs/2412.05821)
Keywords: language model, llm
Abstract: With the rise of large-scale language models (LLMs), it is currently popular and effective to convert multimodal information into text descriptions for multimodal multi-hop question answering. However, we argue that the current methods of multi-modal multi-hop question answering still mainly face two challenges: 1) The retrieved evidence containing a large amount of redundant information, inevitably leads to a significant drop in performance due to irrelevant information misleading the prediction. 2) The reasoning process without interpretable reasoning steps makes the model difficult to discover the logical errors for handling complex questions. To solve these problems, we propose a unified LLMs-based approach but without heavily relying on them due to the LLM's potential errors, and innovatively treat multimodal multi-hop question answering as a joint entailment tree generation and question answering problem. Specifically, we design a multi-task learning framework with a focus on facilitating common knowledge sharing across interpretability and prediction tasks while preventing task-specific errors from interfering with each other via mixture of experts. Afterward, we design an iterative feedback mechanism to further enhance both tasks by feeding back the results of the joint training to the LLM for regenerating entailment trees, aiming to iteratively refine the potential answer. Notably, our method has won the first place in the official leaderboard of WebQA (since April 10, 2024), and achieves competitive results on MultimodalQA.
摘要：随着大规模语言模型（LLM）的兴起，将多模态信息转换为文本描述以进行多模态多跳问答是目前流行且有效的方法。然而，我们认为目前的多模态多跳问答方法仍然主要面临两个挑战：1）检索到的证据包含大量冗余信息，由于无关信息误导预测，不可避免地会导致性能大幅下降。2）没有可解释的推理步骤的推理过程使模型难以发现处理复杂问题的逻辑错误。为了解决这些问题，我们提出了一种统一的基于LLM的方法，但不因LLM的潜在错误而过度依赖它们，并创新地将多模态多跳问答视为联合蕴涵树生成和问答问题。具体而言，我们设计了一个多任务学习框架，重点是促进可解释性和预测任务之间的共同知识共享，同时防止特定于任务的错误通过专家混合而相互干扰。之后，我们设计了一种迭代反馈机制，通过将联合训练的结果反馈给 LLM 以重新生成蕴涵树来进一步增强这两项任务，旨在迭代地完善潜在答案。值得注意的是，我们的方法在 WebQA 官方排行榜上获得了第一名（自 2024 年 4 月 10 日起），并在 MultimodalQA 上取得了有竞争力的结果。

Title: A Self-Learning Multimodal Approach for Fake News Detection

Authors: Hao Chen, Hui Guo, Baochen Hu, Shu Hu, Jinrong Hu, Siwei Lyu, Xi Wu, Xin Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.05843
Pdf URL: https://arxiv.org/pdf/2412.05843
Copy Paste: [[2412.05843]] A Self-Learning Multimodal Approach for Fake News Detection(https://arxiv.org/abs/2412.05843)
Keywords: language model, llm
Abstract: The rapid growth of social media has resulted in an explosion of online news content, leading to a significant increase in the spread of misleading or false information. While machine learning techniques have been widely applied to detect fake news, the scarcity of labeled datasets remains a critical challenge. Misinformation frequently appears as paired text and images, where a news article or headline is accompanied by a related visuals. In this paper, we introduce a self-learning multimodal model for fake news classification. The model leverages contrastive learning, a robust method for feature extraction that operates without requiring labeled data, and integrates the strengths of Large Language Models (LLMs) to jointly analyze both text and image features. LLMs are excel at this task due to their ability to process diverse linguistic data drawn from extensive training corpora. Our experimental results on a public dataset demonstrate that the proposed model outperforms several state-of-the-art classification approaches, achieving over 85% accuracy, precision, recall, and F1-score. These findings highlight the model's effectiveness in tackling the challenges of multimodal fake news detection.
摘要：社交媒体的快速发展导致在线新闻内容激增，从而导致误导性或虚假信息的传播大幅增加。虽然机器学习技术已广泛应用于检测虚假新闻，但标记数据集的稀缺仍然是一个关键挑战。错误信息经常以成对的文本和图像的形式出现，其中新闻文章或标题伴随着相关的视觉效果。在本文中，我们介绍了一种用于虚假新闻分类的自学习多模态模型。该模型利用对比学习，这是一种不需要标记数据即可运行的强大特征提取方法，并整合了大型语言模型 (LLM) 的优势来联合分析文本和图像特征。LLM 擅长于这项任务，因为它们能够处理来自大量训练语料库的各种语言数据。我们在公共数据集上的实验结果表明，所提出的模型优于几种最先进的分类方法，实现了超过 85% 的准确率、精确率、召回率和 F1 分数。这些发现凸显了该模型在应对多模式假新闻检测挑战方面的有效性。

Title: Are Clinical T5 Models Better for Clinical Text?

Authors: Yahan Li, Keith Harrigian, Ayah Zirikly, Mark Dredze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.05845
Pdf URL: https://arxiv.org/pdf/2412.05845
Copy Paste: [[2412.05845]] Are Clinical T5 Models Better for Clinical Text?(https://arxiv.org/abs/2412.05845)
Keywords: language model, llm
Abstract: Large language models with a transformer-based encoder/decoder architecture, such as T5, have become standard platforms for supervised tasks. To bring these technologies to the clinical domain, recent work has trained new or adapted existing models to clinical data. However, the evaluation of these clinical T5 models and comparison to other models has been limited. Are the clinical T5 models better choices than FLAN-tuned generic T5 models? Do they generalize better to new clinical domains that differ from the training sets? We comprehensively evaluate these models across several clinical tasks and domains. We find that clinical T5 models provide marginal improvements over existing models, and perform worse when evaluated on different domains. Our results inform future choices in developing clinical LLMs.
摘要：具有基于转换器的编码器/解码器架构的大型语言模型（例如 T5）已成为监督任务的标准平台。为了将这些技术引入临床领域，最近的工作已经训练了新的或改编的现有模型以适应临床数据。但是，对这些临床 T5 模型的评估和与其他模型的比较是有限的。临床 T5 模型是否比 FLAN 调整的通用 T5 模型是更好的选择？它们是否可以更好地推广到与训练集不同的新临床领域？我们在多个临床任务和领域全面评估了这些模型。我们发现临床 T5 模型比现有模型提供了微小的改进，并且在不同领域进行评估时表现更差。我们的结果为开发临床 LLM 的未来选择提供了参考。

Title: Cooperative SQL Generation for Segmented Databases By Using Multi-functional LLM Agents

Authors: Zhiguang Wu, Fengbin Zhu, Xuequn Shang, Yupei Zhang, Pan Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.05850
Pdf URL: https://arxiv.org/pdf/2412.05850
Copy Paste: [[2412.05850]] Cooperative SQL Generation for Segmented Databases By Using Multi-functional LLM Agents(https://arxiv.org/abs/2412.05850)
Keywords: language model, llm, agent
Abstract: Text-to-SQL task aims to automatically yield SQL queries according to user text questions. To address this problem, we propose a Cooperative SQL Generation framework based on Multi-functional Agents (CSMA) through information interaction among large language model (LLM) based agents who own part of the database schema seperately. Inspired by the collaboration in human teamwork, CSMA consists of three stages: 1) Question-related schema collection, 2) Question-corresponding SQL query generation, and 3) SQL query correctness check. In the first stage, agents analyze their respective schema and communicate with each other to collect the schema information relevant to the question. In the second stage, agents try to generate the corresponding SQL query for the question using the collected information. In the third stage, agents check if the SQL query is created correctly according to their known information. This interaction-based method makes the question-relevant part of database schema from each agent to be used for SQL generation and check. Experiments on the Spider and Bird benckmark demonstrate that CSMA achieves a high performance level comparable to the state-of-the-arts, meanwhile holding the private data in these individual agents.
摘要：文本转SQL任务旨在根据用户文本问题自动生成SQL查询。为了解决这个问题，我们提出了一种基于多功能代理的协作SQL生成框架（CSMA），通过基于大型语言模型（LLM）的代理之间的信息交互，这些代理分别拥有部分数据库模式。受人类团队协作的启发，CSMA包含三个阶段：1）与问题相关的模式收集，2）与问题对应的SQL查询生成，3）SQL查询正确性检查。在第一阶段，代理分析各自的模式并相互通信以收集与问题相关的模式信息。在第二阶段，代理尝试使用收集到的信息为问题生成相应的SQL查询。在第三阶段，代理根据已知信息检查SQL查询是否创建正确。这种基于交互的方法使每个代理的数据库模式中与问题相关的部分用于SQL生成和检查。在 Spider 和 Bird 基准上的实验表明，CSMA 达到了与最先进技术相当的高性能水平，同时保留了这些单个代理中的私有数据。

Title: Domain-Specific Translation with Open-Source Large Language Models: Resource-Oriented Analysis

Authors: Aman Kassahun Wassie, Mahdi Molaei, Yasmin Moslem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.05862
Pdf URL: https://arxiv.org/pdf/2412.05862
Copy Paste: [[2412.05862]] Domain-Specific Translation with Open-Source Large Language Models: Resource-Oriented Analysis(https://arxiv.org/abs/2412.05862)
Keywords: language model, llm
Abstract: In this work, we compare the domain-specific translation performance of open-source autoregressive decoder-only large language models (LLMs) with task-oriented machine translation (MT) models. Our experiments focus on the medical domain and cover four language pairs with varied resource availability: English-to-French, English-to-Portuguese, English-to-Swahili, and Swahili-to-English. Despite recent advancements, LLMs exhibit a clear gap in specialized translation quality compared to multilingual encoder-decoder MT models such as NLLB-200. In three out of four language directions in our study, NLLB-200 3.3B outperforms all LLMs in the size range of 8B parameters in medical translation. While fine-tuning LLMs such as Mistral and Llama improves their performance at medical translation, these models still fall short compared to fine-tuned NLLB-200 3.3B models. Our findings highlight the ongoing need for specialized MT models to achieve higher-quality domain-specific translation, especially in medium-resource and low-resource settings. As larger LLMs outperform their 8B variants, this also encourages pre-training domain-specific medium-sized LMs to improve quality and efficiency in specialized translation tasks.
摘要：在本研究中，我们将开源自回归解码器专用大型语言模型 (LLM) 与面向任务的机器翻译 (MT) 模型的领域特定翻译性能进行了比较。我们的实验重点关注医学领域，涵盖四种资源可用性各异的语言对：英语到法语、英语到葡萄牙语、英语到斯瓦希里语和斯瓦希里语到英语。尽管最近取得了进展，但与 NLLB-200 等多语言编码器-解码器 MT 模型相比，LLM 在专业翻译质量方面仍存在明显差距。在我们研究的四个语言方向中的三个中，NLLB-200 3.3B 在医学翻译中的表现优于 8B 参数大小范围内的所有 LLM。虽然对 Mistral 和 Llama 等 LLM 进行微调可以提高其在医学翻译方面的表现，但与经过微调的 NLLB-200 3.3B 模型相比，这些模型仍然存在不足。我们的研究结果强调，我们始终需要专门的机器翻译模型来实现更高质量的领域特定翻译，尤其是在资源中等和资源较少的环境中。由于大型 LLM 的表现优于其 8B 变体，这也鼓励预先训练领域特定中型 LM，以提高专业翻译任务的质量和效率。

Title: Paraphrase-Aligned Machine Translation

Authors: Ke-Ching Chang, Chung-Chi Chen, An-Zi Yen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.05916
Pdf URL: https://arxiv.org/pdf/2412.05916
Copy Paste: [[2412.05916]] Paraphrase-Aligned Machine Translation(https://arxiv.org/abs/2412.05916)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated significant capabilities in machine translation. However, their translation quality is sometimes questioned, as the generated outputs may deviate from expressions typically used by native speakers. These deviations often arise from differences in sentence structure between language systems. To address this issue, we propose ParaAlign Translator, a method that fine-tunes LLMs to paraphrase sentences, aligning their structures with those of the target language systems. This approach improves the performance of subsequent translations. Experimental results demonstrate that the proposed method enhances the LLaMA-3-8B model's performance in both resource-rich and low-resource scenarios and achieves parity with or surpassing the much larger LLaMA-3-70B model.
摘要：大型语言模型 (LLM) 在机器翻译中展现出了强大的能力。然而，它们的翻译质量有时会受到质疑，因为生成的输出可能与母语人士通常使用的表达方式有所不同。这些偏差通常源于语言系统之间句子结构的差异。为了解决这个问题，我们提出了 ParaAlign Translator，这是一种微调 LLM 来解释句子的方法，使其结构与目标语言系统的结构保持一致。这种方法提高了后续翻译的性能。实验结果表明，所提出的方法提高了 LLaMA-3-8B 模型在资源丰富和资源匮乏场景下的性能，并且与规模大得多的 LLaMA-3-70B 模型相当甚至超过了它。

Title: A Cross-Validation Study of Turkish Sentiment Analysis Datasets and Tools

Authors: Şevval Çakıcı, Dilara Karaduman, Mehmet Akif Çırlan, Ali Hürriyetoğlu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.05964
Pdf URL: https://arxiv.org/pdf/2412.05964
Copy Paste: [[2412.05964]] A Cross-Validation Study of Turkish Sentiment Analysis Datasets and Tools(https://arxiv.org/abs/2412.05964)
Keywords: prompt
Abstract: In recent years, sentiment analysis has gained increasing significance, prompting researchers to explore datasets in various languages, including Turkish. However, the limited availability of Turkish datasets has led to their multifaceted usage in different studies, yielding diverse outcomes. To overcome this challenge, a rigorous review was conducted of research articles published between 2012 and 2022. 31 studies were listed, and 23 Turkish datasets obtained from publicly available sources and email requests used in these studies were collected. We labeled these 31 studies using a taxonomy. We provide a map of sentiment analysis datasets according to this taxonomy in Turkish over 10 years. Moreover, we run state-of-the-art sentiment analysis tools on these datasets and analyzed performance across popular Turkish sentiment datasets. We observed that the performance of the sentiment analysis tools significantly depends on the characteristics of the target text. Our study fosters a more nuanced understanding of sentiment analysis in the Turkish language.
摘要：近年来，情绪分析的重要性日益增加，促使研究人员探索包括土耳其语在内的各种语言的数据集。然而，土耳其语数据集的有限可用性导致它们在不同研究中被多方面使用，产生了不同的结果。为了克服这一挑战，对 2012 年至 2022 年期间发表的研究文章进行了严格的审查。列出了 31 项研究，并收集了从公开来源和这些研究中使用的电子邮件请求中获得的 23 个土耳其语数据集。我们使用分类法对这 31 项研究进行了标记。我们根据该分类法提供了 10 年来土耳其语情绪分析数据集的地图。此外，我们在这些数据集上运行了最先进的情绪分析工具，并分析了流行的土耳其语情绪数据集的性能。我们观察到情绪分析工具的性能在很大程度上取决于目标文本的特征。我们的研究促进了对土耳其语情绪分析的更细致的理解。

Title: Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt

Authors: Damien de Mijolla, Wen Yang, Philippa Duckett, Christopher Frye, Mark Worrall
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.05967
Pdf URL: https://arxiv.org/pdf/2412.05967
Copy Paste: [[2412.05967]] Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt(https://arxiv.org/abs/2412.05967)
Keywords: language model, llm, prompt
Abstract: Prompting and fine-tuning have emerged as two competing paradigms for augmenting language models with new capabilities, such as the use of tools. Prompting approaches are quick to set up but rely on providing explicit demonstrations of each tool's usage in the model's prompt, thus coupling tool use to the task at hand and limiting generalisation. Fine-tuning removes the need for task-specific demonstrations of tool usage at runtime; however, this ties new capabilities to a single model, thus making already-heavier setup costs a recurring expense. In this paper, we introduce language hooks, a novel framework for augmenting language models with new capabilities that is decoupled both from the model's task-specific prompt and from the model itself. The language hook algorithm interleaves text generation by the base model with the execution of modular programs that trigger conditionally based on the existing text and the available capabilities. Upon triggering, programs may call external tools, auxiliary language models (e.g. using tool specific prompts), and modify the existing context. We benchmark our method against state-of-the-art baselines, find that it outperforms task-aware approaches, and demonstrate its ability to generalise to novel tasks.
摘要：提示和微调已成为两种相互竞争的范例，用于通过新功能（例如使用工具）增强语言模型。提示方法设置起来很快，但依赖于在模型提示中提供每个工具用法的明确演示，从而将工具使用与手头的任务联系起来，限制了泛化。微调消除了在运行时对特定任务的工具使用演示的需要；然而，这会将新功能与单个模型绑定在一起，从而使已经很重的设置成本成为经常性费用。在本文中，我们介绍了语言钩子，这是一种通过新功能增强语言模型的新框架，它与模型的任务特定提示和模型本身都解耦了。语言钩子算法将基础模型的文本生成与模块化程序的执行交织在一起，模块化程序根据现有文本和可用功能有条件地触发。触发后，程序可以调用外部工具、辅助语言模型（例如使用工具特定的提示）并修改现有上下文。我们根据最先进的基线对我们的方法进行了基准测试，发现它优于任务感知方法，并展示了其推广到新任务的能力。

Title: Does RLHF Scale? Exploring the Impacts From Data, Model, and Method

Authors: Zhenyu Hou, Pengfan Du, Yilin Niu, Zhengxiao Du, Aohan Zeng, Xiao Liu, Minlie Huang, Hongning Wang, Jie Tang, Yuxiao Dong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06000
Pdf URL: https://arxiv.org/pdf/2412.06000
Copy Paste: [[2412.06000]] Does RLHF Scale? Exploring the Impacts From Data, Model, and Method(https://arxiv.org/abs/2412.06000)
Keywords: language model, llm, prompt
Abstract: This study explores the scaling properties of Reinforcement Learning from Human Feedback (RLHF) in Large Language Models (LLMs). Although RLHF is considered an important step in post-training of LLMs, its scaling potential is still largely unknown. We systematically analyze key components in the RLHF framework--model size, data composition, and inference budget--and their impacts on performance. Our findings show that increasing data diversity and volume improves reward model performance, helping process-supervision models scale better. For policy training, more response samples per prompt boost performance initially but quickly plateau. And larger reward models offer modest gains in policy training. In addition, larger policy models benefit less from RLHF with a fixed reward model. Overall, RLHF scales less efficiently than pretraining, with diminishing returns from additional computational resources. Based on these observations, we propose strategies to optimize RLHF performance within computational limits.
摘要：本研究探讨了大型语言模型 (LLM) 中强化学习人类反馈 (RLHF) 的扩展特性。尽管 RLHF 被认为是 LLM 后训练的重要步骤，但其扩展潜力仍然很大程度上未知。我们系统地分析了 RLHF 框架中的关键组件——模型大小、数据组成和推理预算——及其对性能的影响。我们的研究结果表明，增加数据多样性和数量可以提高奖励模型的性能，帮助流程监督模型更好地扩展。对于策略训练，每个提示的更多响应样本最初会提高性能，但很快就会停滞不前。而更大的奖励模型在策略训练中提供了适度的收益。此外，具有固定奖励模型的大型策略模型从 RLHF 中受益较少。总体而言，RLHF 的扩展效率低于预训练，额外计算资源的回报递减。基于这些观察，我们提出了在计算限制内优化 RLHF 性能的策略。

Title: 1-800-SHARED-TASKS at RegNLP: Lexical Reranking of Semantic Retrieval (LeSeR) for Regulatory Question Answering

Authors: Jebish Purbey, Drishti Sharma, Siddhant Gupta, Khawaja Murad, Siddartha Pullakhandam, Ram Mohan Rao Kadiyala
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06009
Pdf URL: https://arxiv.org/pdf/2412.06009
Copy Paste: [[2412.06009]] 1-800-SHARED-TASKS at RegNLP: Lexical Reranking of Semantic Retrieval (LeSeR) for Regulatory Question Answering(https://arxiv.org/abs/2412.06009)
Keywords: retrieval augmented generation
Abstract: This paper presents the system description of our entry for the COLING 2025 RegNLP RIRAG (Regulatory Information Retrieval and Answer Generation) challenge, focusing on leveraging advanced information retrieval and answer generation techniques in regulatory domains. We experimented with a combination of embedding models, including Stella, BGE, CDE, and Mpnet, and leveraged fine-tuning and reranking for retrieving relevant documents in top ranks. We utilized a novel approach, LeSeR, which achieved competitive results with a recall@10 of 0.8201 and map@10 of 0.6655 for retrievals. This work highlights the transformative potential of natural language processing techniques in regulatory applications, offering insights into their capabilities for implementing a retrieval augmented generation system while identifying areas for future improvement in robustness and domain adaptation.
摘要：本文介绍了我们参加 COLING 2025 RegNLP RIRAG（监管信息检索和答案生成）挑战赛的系统描述，重点是在监管领域利用先进的信息检索和答案生成技术。我们尝试了多种嵌入模型，包括 Stella、BGE、CDE 和 Mpnet，并利用微调和重新排名来检索排名靠前的相关文档。我们采用了一种新颖的方法 LeSeR，该方法在检索中取得了具有竞争力的结果，recall@10 为 0.8201，map@10 为 0.6655。这项工作突出了自然语言处理技术在监管应用中的变革潜力，深入了解了它们实现检索增强生成系统的能力，同时确定了未来在稳健性和领域适应性方面需要改进的领域。

Title: Steering Large Language Models to Evaluate and Amplify Creativity

Authors: Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Shao-yen Tseng, Vasudev Lal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06060
Pdf URL: https://arxiv.org/pdf/2412.06060
Copy Paste: [[2412.06060]] Steering Large Language Models to Evaluate and Amplify Creativity(https://arxiv.org/abs/2412.06060)
Keywords: language model, llm, prompt
Abstract: Although capable of generating creative text, Large Language Models (LLMs) are poor judges of what constitutes "creativity". In this work, we show that we can leverage this knowledge of how to write creatively in order to better judge what is creative. We take a mechanistic approach that extracts differences in the internal states of an LLM when prompted to respond "boringly" or "creatively" to provide a robust measure of creativity that corresponds strongly with human judgment. We also show these internal state differences can be applied to enhance the creativity of generated text at inference time.
摘要：尽管大型语言模型 (LLM) 能够生成创造性文本，但它们无法很好地判断什么是“创造力”。在这项研究中，我们表明，我们可以利用这种关于如何创造性写作的知识来更好地判断什么是创造力。我们采用一种机械方法，在提示 LLM 回答“无聊”或“创造性”时提取其内部状态的差异，以提供与人类判断高度一致的创造力的可靠衡量标准。我们还表明，这些内部状态差异可用于在推理时增强生成文本的创造力。

Title: KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models

Authors: Fan Wang, Juyong Jiang, Chansung Park, Sunghun Kim, Jing Tang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06071
Pdf URL: https://arxiv.org/pdf/2412.06071
Copy Paste: [[2412.06071]] KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models(https://arxiv.org/abs/2412.06071)
Keywords: language model, llm
Abstract: The increasing sizes of large language models (LLMs) result in significant computational overhead and memory usage when adapting these models to specific tasks or domains. Various parameter-efficient fine-tuning (PEFT) methods have been devised to mitigate these challenges by training a small set of parameters for the task-specific updates of the model weights. Among PEFT methods, LoRA stands out for its simplicity and efficiency, inspiring the development of a series of variants. However, LoRA and its successors disregard the knowledge that is noisy or irrelevant to the targeted task, detrimentally impacting model performance and leading to suboptimality. To address this limitation, we introduce Knowledge-aware Singular-value Adaptation (KaSA), a PEFT method that leverages singular value decomposition (SVD) with knowledge-aware singular values to dynamically activate knowledge based on its relevance to the task at hand. We conduct extensive experiments across a range of LLMs on tasks spanning natural language understanding (NLU), generation (NLG), instruction following, and commonsense reasoning. The experimental results demonstrate that KaSA consistently outperforms FFT and 14 popular PEFT baselines across 16 benchmarks and 4 synthetic datasets, underscoring our method's efficacy and adaptability. The source code of our method is available at this https URL.
摘要：大型语言模型 (LLM) 的大小不断增加，在将这些模型调整到特定任务或领域时，会导致大量的计算开销和内存使用。已经设计了各种参数高效微调 (PEFT) 方法来缓解这些挑战，方法是训练一小组参数以更新特定于任务的模型权重。在 PEFT 方法中，LoRA 因其简单性和效率而脱颖而出，激发了一系列变体的开发。然而，LoRA 及其后继者忽略了与目标任务嘈杂或无关的知识，对模型性能产生不利影响并导致次优。为了解决这一限制，我们引入了知识感知奇异值自适应 (KaSA)，这是一种 PEFT 方法，它利用奇异值分解 (SVD) 和知识感知奇异值来根据其与当前任务的相关性动态激活知识。我们在一系列 LLM 上对涵盖自然语言理解 (NLU)、生成 (NLG)、指令遵循和常识推理的任务进行了广泛的实验。实验结果表明，KaSA 在 16 个基准和 4 个合成数据集上的表现始终优于 FFT 和 14 个流行的 PEFT 基线，凸显了我们方法的有效性和适应性。我们方法的源代码可在此 https URL 上找到。

Title: Enhanced Computationally Efficient Long LoRA Inspired Perceiver Architectures for Auto-Regressive Language Modeling

Authors: Kaleel Mahmood, Shaoyi Huang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06106
Pdf URL: https://arxiv.org/pdf/2412.06106
Copy Paste: [[2412.06106]] Enhanced Computationally Efficient Long LoRA Inspired Perceiver Architectures for Auto-Regressive Language Modeling(https://arxiv.org/abs/2412.06106)
Keywords: language model, llm
Abstract: The Transformer architecture has revolutionized the Natural Language Processing field and is the backbone of Large Language Models (LLMs). The Transformer uses the attention mechanism that computes the pair-wise similarity between its input tokens to produce latent vectors that are able to understand the semantic meaning of the input text. One of the challenges in the Transformer architecture is the quadratic complexity of the attention mechanism that prohibits the efficient processing of long sequence lengths. While many recent research works have attempted to provide a reduction from $O(n^2)$ time complexity of attention to semi-linear complexity, it remains an unsolved problem in the sense of maintaining a high performance when such complexity is reduced. One of the important works in this respect is the Perceiver class of architectures that have demonstrated excellent performance while reducing the computation complexity. In this paper, we use the PerceiverAR that was proposed for Auto-Regressive modeling as a baseline, and provide three different architectural enhancements to it with varying computation overhead tradeoffs. Inspired by the recently proposed efficient attention computation approach of Long-LoRA, we then present an equally efficient Perceiver-based architecture (termed as Long LoRA Pereceiver - LLP) that can be used as the base architecture in LLMs instead of just a fine-tuning add-on. Our results on different benchmarks indicate impressive improvements compared to recent Transformer based models.
摘要：Transformer 架构彻底改变了自然语言处理领域，是大型语言模型 (LLM) 的支柱。Transformer 使用注意力机制来计算其输入标记之间的成对相似性，以生成能够理解输入文本语义的潜在向量。Transformer 架构面临的挑战之一是注意力机制的二次复杂度，这阻碍了对长序列长度的有效处理。虽然许多最近的研究工作都试图将注意力的时间复杂度从 O(n^2) 降低到半线性复杂度，但在降低这种复杂性的同时保持高性能仍然是一个未解决的问题。这方面的重要工作之一是 Perceiver 类架构，它在降低计算复杂度的同时表现出了出色的性能。在本文中，我们使用为自回归建模提出的 PerceiverAR 作为基线，并为其提供三种不同的架构增强功能，具有不同的计算开销权衡。受最近提出的 Long-LoRA 高效注意力计算方法的启发，我们提出了一种同样高效的基于感知器的架构（称为 Long LoRA Pereceiver - LLP），可用作 LLM 中的基础架构，而不仅仅是微调插件。我们在不同基准测试中的结果表明，与最近基于 Transformer 的模型相比，该方法取得了显著的改进。

Title: Infusing Prompts with Syntax and Semantics

Authors: Anton Bulle Labate, Fabio Gagliardi Cozman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06107
Pdf URL: https://arxiv.org/pdf/2412.06107
Copy Paste: [[2412.06107]] Infusing Prompts with Syntax and Semantics(https://arxiv.org/abs/2412.06107)
Keywords: language model, prompt
Abstract: Despite impressive success, language models often generate outputs with flawed linguistic structure. We analyze the effect of directly infusing various kinds of syntactic and semantic information into large language models. To demonstrate the value of our proposals, we focus on the translation of natural language queries to SQL, in particular dealing with languages with less resources than English, to better investigate how much help we can get from low cost syntactic and semantic information. We show that linguistic analysis can significantly boost language models, to the point that we have surpassed previous best systems.
摘要：尽管取得了令人瞩目的成功，但语言模型通常会生成具有缺陷语言结构的输出。我们分析了将各种句法和语义信息直接注入大型语言模型的效果。为了证明我们提案的价值，我们专注于将自然语言查询转换为 SQL，特别是处理资源比英语少的语言，以更好地研究我们可以从低成本句法和语义信息中获得多少帮助。我们表明，语言分析可以显著提升语言模型，以至于我们已经超越了以前最好的系统。

Title: Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings

Authors: Zhao Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06134
Pdf URL: https://arxiv.org/pdf/2412.06134
Copy Paste: [[2412.06134]] Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings(https://arxiv.org/abs/2412.06134)
Keywords: language model, llm, chain-of-thought
Abstract: Current social bias benchmarks for Large Language Models (LLMs) primarily rely on pre-defined question formats like multiple-choice, limiting their ability to reflect the complexity and open-ended nature of real-world interactions. To address this gap, we extend an existing BBQ dataset introduced by incorporating fill-in-the-blank and short-answer question types, designed to evaluate biases in an open-ended setting. Our finding reveals that LLMs tend to produce responses that are more biased against certain protected attributes, like age and socio-economic status. On the other hand, these biased outputs produced by LLMs can serve as valuable contexts and chains of thought for debiasing. Our debiasing approach combined zero-shot, few-shot, and chain-of-thought could significantly reduce the level of bias to almost 0. We open-source our evaluation and debiasing code hoping to encourage further measurements and mitigation of bias and stereotype in LLMs.
摘要：目前，大型语言模型 (LLM) 的社会偏见基准主要依赖于多项选择等预定义的问题格式，这限制了它们反映现实世界互动的复杂性和开放性的能力。为了解决这一差距，我们扩展了现有的 BBQ 数据集，引入了填空和简答题类型，旨在评估开放式环境中的偏见。我们的研究结果表明，LLM 往往会产生对某些受保护属性（如年龄和社会经济地位）有更大偏见的回答。另一方面，LLM 产生的这些有偏见的输出可以作为消除偏见的宝贵背景和思路链。我们的消除偏见方法结合了零样本、少量样本和思路链，可以将偏见水平显著降低到几乎为 0。我们开源了我们的评估和消除偏见代码，希望鼓励进一步测量和减轻 LLM 中的偏见和刻板印象。

Title: AIDE: Task-Specific Fine Tuning with Attribute Guided Multi-Hop Data Expansion

Authors: Jiayu Li, Xuan Zhu, Fang Liu, Yanjun Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06136
Pdf URL: https://arxiv.org/pdf/2412.06136
Copy Paste: [[2412.06136]] AIDE: Task-Specific Fine Tuning with Attribute Guided Multi-Hop Data Expansion(https://arxiv.org/abs/2412.06136)
Keywords: language model, llm, prompt
Abstract: Fine-tuning large language models (LLMs) for specific tasks requires high-quality, diverse training data relevant to the task. Recent research has leveraged LLMs to synthesize training data, but existing approaches either depend on large seed datasets or struggle to ensure both task relevance and data diversity in the generated outputs. To address these challenges, we propose AIDE, a novel data synthesis framework that uses a multi-hop process to expand 10 seed data points while ensuring diversity and task relevance. AIDE extracts the main topic and key knowledge attributes from the seed data to guide the synthesis process. In each subsequent hop, it extracts the topic and attributes from the newly generated data and continues guided synthesis. This process repeats for a total of K hops. To prevent irrelevant data generation as the hop depth increases, AIDE incorporates a residual connection mechanism and uses self-reflection to improve data quality. Our empirical results demonstrate that fine-tuning Mistral-7B, Llama-3.1-8B and Llama-3.2-3B with AIDE achieves more than 10% accuracy improvements over the base models across 13 tasks from 5 different benchmarks, while outperforming the models fine-tuned with state-of-the-art data synthesis methods like Evol-Instruct, DataTune and Prompt2Model.
摘要：针对特定任务对大型语言模型 (LLM) 进行微调需要与任务相关的高质量、多样化的训练数据。最近的研究利用 LLM 来合成训练数据，但现有的方法要么依赖于大型种子数据集，要么难以确保生成的输出中的任务相关性和数据多样性。为了应对这些挑战，我们提出了 AIDE，这是一种新颖的数据合成框架，它使用多跳过程来扩展 10 个种子数据点，同时确保多样性和任务相关性。AIDE 从种子数据中提取主要主题和关键知识属性来指导合成过程。在后续的每个跳跃中，它从新生成的数据中提取主题和属性并继续引导合成。此过程重复总共 K 跳。为了防止随着跳跃深度的增加而生成不相关的数据，AIDE 采用了残差连接机制并使用自我反射来提高数据质量。我们的实证结果表明，使用 AIDE 对 Mistral-7B、Llama-3.1-8B 和 Llama-3.2-3B 进行微调，在 5 个不同基准的 13 个任务中，准确率比基础模型提高了 10% 以上，同时优于使用 Evol-Instruct、DataTune 和 Prompt2Model 等最先进的数据合成方法进行微调的模型。

Title: Hate Speech According to the Law: An Analysis for Effective Detection

Authors: Katerina Korre, John Pavlopoulos, Paolo Gajo, Alberto Barrón-Cedeño
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06144
Pdf URL: https://arxiv.org/pdf/2412.06144
Copy Paste: [[2412.06144]] Hate Speech According to the Law: An Analysis for Effective Detection(https://arxiv.org/abs/2412.06144)
Keywords: language model, prompt
Abstract: The issue of hate speech extends beyond the confines of the online realm. It is a problem with real-life repercussions, prompting most nations to formulate legal frameworks that classify hate speech as a punishable offence. These legal frameworks differ from one country to another, contributing to the big chaos that online platforms have to face when addressing reported instances of hate speech. With the definitions of hate speech falling short in introducing a robust framework, we turn our gaze onto hate speech laws. We consult the opinion of legal experts on a hate speech dataset and we experiment by employing various approaches such as pretrained models both on hate speech and legal data, as well as exploiting two large language models (Qwen2-7B-Instruct and Meta-Llama-3-70B). Due to the time-consuming nature of data acquisition for prosecutable hate speech, we use pseudo-labeling to improve our pretrained models. This study highlights the importance of amplifying research on prosecutable hate speech and provides insights into effective strategies for combating hate speech within the parameters of legal frameworks. Our findings show that legal knowledge in the form of annotations can be useful when classifying prosecutable hate speech, yet more focus should be paid on the differences between the laws.
摘要：仇恨言论问题超出了网络领域的范围。这是一个具有现实影响的问题，促使大多数国家制定法律框架，将仇恨言论归类为可惩罚的罪行。这些法律框架因国家而异，导致网络平台在处理报告的仇恨言论案例时不得不面对巨大的混乱。由于仇恨言论的定义不足以引入一个强大的框架，我们将目光转向仇恨言论法。我们咨询了法律专家对仇恨言论数据集的意见，并通过采用各种方法进行实验，例如对仇恨言论和法律数据进行预训练的模型，以及利用两个大型语言模型（Qwen2-7B-Instruct 和 Meta-Llama-3-70B）。由于可起诉仇恨言论的数据获取非常耗时，我们使用伪标签来改进我们的预训练模型。这项研究强调了扩大可起诉仇恨言论研究的重要性，并提供了在法律框架范围内打击仇恨言论的有效策略的见解。我们的研究结果表明，以注释形式呈现的法律知识在对可起诉的仇恨言论进行分类时很有用，但应更加关注法律之间的差异。

Title: SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs

Authors: James Vo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06198
Pdf URL: https://arxiv.org/pdf/2412.06198
Copy Paste: [[2412.06198]] SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs(https://arxiv.org/abs/2412.06198)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) scale to longer context windows, the computational cost of attention mechanisms, which traditionally grows quadratically with input length, presents a critical challenge for real-time and memory-constrained deployments. Existing sparse attention techniques have sought to reduce this complexity, but they often incur significant overhead or compromise accuracy, making them less practical for large contexts on mid-range hardware. In this paper, we introduce SparseAccelerate, a dynamic sparse attention method that adapts its sparsity patterns based on input characteristics, effectively flattening the attention complexity curve. Our approach is effective for input lengths starting at 16K tokens and scales efficiently up to 128K tokens on dual NVIDIA A5000 GPUs (24GB each). Experimental results show that SparseAccelerate achieves up to a 1.04x reduction in Time-To-First-Token (TTFT) latency at 32K tokens, while also providing substantial memory savings. These improvements yield practical gains for memory-intensive applications and long-context tasks that were previously infeasible with standard attention. Beyond latency reductions, SparseAccelerate fundamentally shifts the scaling trend, demonstrating the smallest TTFT growth gradient relative to context length among competing methods. Ongoing evaluations on diverse benchmarks confirm its scalability, positioning SparseAccelerate as a critical advancement toward efficient, real-time, and large-context LLM inference on accessible hardware.
摘要：随着大型语言模型 (LLM) 扩展到更长的上下文窗口，注意力机制的计算成本（传统上随输入长度二次增长）对实时和内存受限的部署提出了严峻挑战。现有的稀疏注意力技术试图降低这种复杂性，但它们通常会产生大量开销或损害准确性，因此对于中档硬件上的大型上下文来说不太实用。在本文中，我们介绍了 SparseAccelerate，这是一种动态稀疏注意力方法，它根据输入特征调整其稀疏模式，有效地拉平了注意力复杂度曲线。我们的方法对从 16K 令牌开始的输入长度有效，并且可以在双 NVIDIA A5000 GPU（每个 24GB）上有效扩展到 128K 令牌。实验结果表明，SparseAccelerate 在 32K 令牌时将第一个令牌时间 (TTFT) 延迟减少了 1.04 倍，同时还节省了大量内存。这些改进为内存密集型应用程序和长上下文任务带来了实际收益，而这些任务以前在标准注意下是无法实现的。除了减少延迟之外，SparseAccelerate 还从根本上改变了扩展趋势，在竞争方法中展示了相对于上下文长度最小的 TTFT 增长梯度。对各种基准的持续评估证实了它的可扩展性，将 SparseAccelerate 定位为在可访问硬件上实现高效、实时和大上下文 LLM 推理的关键进步。

Title: SiReRAG: Indexing Similar and Related Information for Multihop Reasoning

Authors: Nan Zhang, Prafulla Kumar Choubey, Alexander Fabbri, Gabriel Bernadett-Shapiro, Rui Zhang, Prasenjit Mitra, Caiming Xiong, Chien-Sheng Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06206
Pdf URL: https://arxiv.org/pdf/2412.06206
Copy Paste: [[2412.06206]] SiReRAG: Indexing Similar and Related Information for Multihop Reasoning(https://arxiv.org/abs/2412.06206)
Keywords: retrieval-augmented generation
Abstract: Indexing is an important step towards strong performance in retrieval-augmented generation (RAG) systems. However, existing methods organize data based on either semantic similarity (similarity) or related information (relatedness), but do not cover both perspectives comprehensively. Our analysis reveals that modeling only one perspective results in insufficient knowledge synthesis, leading to suboptimal performance on complex tasks requiring multihop reasoning. In this paper, we propose SiReRAG, a novel RAG indexing approach that explicitly considers both similar and related information. On the similarity side, we follow existing work and explore some variances to construct a similarity tree based on recursive summarization. On the relatedness side, SiReRAG extracts propositions and entities from texts, groups propositions via shared entities, and generates recursive summaries to construct a relatedness tree. We index and flatten both similarity and relatedness trees into a unified retrieval pool. Our experiments demonstrate that SiReRAG consistently outperforms state-of-the-art indexing methods on three multihop datasets (MuSiQue, 2WikiMultiHopQA, and HotpotQA), with an average 1.9% improvement in F1 scores. As a reasonably efficient solution, SiReRAG enhances existing reranking methods significantly, with up to 7.8% improvement in average F1 scores.
摘要：索引是实现增强检索生成 (RAG) 系统高性能的重要一步。然而，现有方法基于语义相似性 (相似性) 或相关信息 (关联性) 来组织数据，但并未全面涵盖这两个角度。我们的分析表明，仅对一个角度进行建模会导致知识综合不足，从而导致在需要多跳推理的复杂任务上性能不佳。在本文中，我们提出了 SiReRAG，这是一种新颖的 RAG 索引方法，它明确考虑了相似和相关信息。在相似性方面，我们遵循现有工作并探索一些差异以基于递归摘要构建相似性树。在关联性方面，SiReRAG 从文本中提取命题和实体，通过共享实体对命题进行分组，并生成递归摘要以构建关联树。我们将相似性和关联性树都索引并展平到统一的检索池中。我们的实验表明，SiReRAG 在三个多跳数据集（MuSiQue、2WikiMultiHopQA 和 HotpotQA）上的表现始终优于最先进的索引方法，F1 分数平均提高了 1.9%。作为一种相当有效的解决方案，SiReRAG 显著增强了现有的重新排序方法，平均 F1 分数提高了 7.8%。

Title: A Comparative Study of Learning Paradigms in Large Language Models via Intrinsic Dimension

Authors: Saahith Janapati, Yangfeng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06245
Pdf URL: https://arxiv.org/pdf/2412.06245
Copy Paste: [[2412.06245]] A Comparative Study of Learning Paradigms in Large Language Models via Intrinsic Dimension(https://arxiv.org/abs/2412.06245)
Keywords: language model, llm, prompt
Abstract: The performance of Large Language Models (LLMs) on natural language tasks can be improved through both supervised fine-tuning (SFT) and in-context learning (ICL), which operate via distinct mechanisms. Supervised fine-tuning updates the model's weights by minimizing loss on training data, whereas in-context learning leverages task demonstrations embedded in the prompt, without changing the model's parameters. This study investigates the effects of these learning paradigms on the hidden representations of LLMs using Intrinsic Dimension (ID). We use ID to estimate the number of degrees of freedom between representations extracted from LLMs as they perform specific natural language tasks. We first explore how the ID of LLM representations evolves during SFT and how it varies due to the number of demonstrations in ICL. We then compare the IDs induced by SFT and ICL and find that ICL consistently induces a higher ID compared to SFT, suggesting that representations generated during ICL reside in higher dimensional manifolds in the embedding space.
摘要：大型语言模型 (LLM) 在自然语言任务上的性能可以通过监督微调 (SFT) 和上下文学习 (ICL) 来提高，它们通过不同的机制运行。监督微调通过最小化训练数据的损失来更新模型的权重，而上下文学习则利用提示中嵌入的任务演示，而不改变模型的参数。本研究使用内在维度 (ID) 研究了这些学习范式对 LLM 隐藏表示的影响。我们使用 ID 来估计从 LLM 中提取的表示在执行特定自然语言任务时之间的自由度数。我们首先探索 LLM 表示的 ID 在 SFT 期间如何演变，以及它如何因 ICL 中的演示数量而变化。然后，我们比较了 SFT 和 ICL 引起的 ID，发现 ICL 始终比 SFT 诱导更高的 ID，这表明在 ICL 期间生成的表示位于嵌入空间中更高维的流形中。

Title: Optimizing Multi-Task Learning for Enhanced Performance in Large Language Models

Authors: Zhen Qi, Jiajing Chen, Shuo Wang, Bingying Liu, Hongye Zheng, Chihang Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06249
Pdf URL: https://arxiv.org/pdf/2412.06249
Copy Paste: [[2412.06249]] Optimizing Multi-Task Learning for Enhanced Performance in Large Language Models(https://arxiv.org/abs/2412.06249)
Keywords: language model, gpt
Abstract: This study aims to explore the performance improvement method of large language models based on GPT-4 under the multi-task learning framework and conducts experiments on two tasks: text classification and automatic summary generation. Through the combined design of shared feature extractors and task-specific modules, we achieve knowledge-sharing and optimization of multiple tasks in the same model. The experiment uses multiple subtasks of the GLUE dataset to compare the performance of the multi-task model with the single-task GPT-4, the multi-task version of GPT-3, the BERT basic model, and the classic Bi-LSTM with Attention model. The results show that the proposed multi-task learning model outperforms other comparison models in terms of text classification accuracy and ROUGE value of summary generation, demonstrating the advantages of multi-task learning in improving model generalization ability and collaborative learning between tasks. The model maintains a stable loss convergence rate during training, showing good learning efficiency and adaptability to the test set. This study verifies the applicability of the multi-task learning framework in large language models, especially in improving the model's ability to balance different tasks. In the future, with the combination of large language models and multimodal data and the application of dynamic task adjustment technology, the framework based on multi-task learning is expected to play a greater role in practical applications across fields and provide new ideas for the development of general artificial intelligence.
摘要：本研究旨在探索多任务学习框架下基于GPT-4的大型语言模型的性能提升方法，针对文本分类和自动摘要生成两个任务开展实验，通过共享特征提取器和任务专用模块的联合设计，实现同一模型中多个任务的知识共享和优化。实验使用GLUE数据集的多个子任务，对比了多任务模型与单任务GPT-4、多任务版本GPT-3、BERT基础模型、经典的Bi-LSTM with Attention模型的性能。结果表明，所提出的多任务学习模型在文本分类准确率和摘要生成ROUGE值方面优于其他对比模型，展现了多任务学习在提升模型泛化能力和任务间协同学习方面的优势。模型在训练过程中保持稳定的loss收敛速度，表现出良好的学习效率和对测试集的适应性。本研究验证了多任务学习框架在大型语言模型中的适用性，特别是在提高模型平衡不同任务的能力方面。未来随着大型语言模型与多模态数据的结合以及动态任务调整技术的应用，基于多任务学习的框架有望在跨领域的实际应用中发挥更大作用，为通用人工智能的发展提供新思路。

Title: Methods for Legal Citation Prediction in the Age of LLMs: An Australian Law Case Study

Authors: Ehsan Shareghi, Jiuzhou Han, Paul Burgess
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.06272
Pdf URL: https://arxiv.org/pdf/2412.06272
Copy Paste: [[2412.06272]] Methods for Legal Citation Prediction in the Age of LLMs: An Australian Law Case Study(https://arxiv.org/abs/2412.06272)
Keywords: language model, llm, hallucination, prompt
Abstract: In recent years, Large Language Models (LLMs) have shown great potential across a wide range of legal tasks. Despite these advances, mitigating hallucination remains a significant challenge, with state-of-the-art LLMs still frequently generating incorrect legal references. In this paper, we focus on the problem of legal citation prediction within the Australian law context, where correctly identifying and citing relevant legislations or precedents is critical. We compare several approaches: prompting general purpose and law-specialised LLMs, retrieval-only pipelines with both generic and domain-specific embeddings, task-specific instruction-tuning of LLMs, and hybrid strategies that combine LLMs with retrieval augmentation, query expansion, or voting ensembles. Our findings indicate that domain-specific pre-training alone is insufficient for achieving satisfactory citation accuracy even after law-specialised pre-training. In contrast, instruction tuning on our task-specific dataset dramatically boosts performance reaching the best results across all settings. We also highlight that database granularity along with the type of embeddings play a critical role in the performance of retrieval systems. Among retrieval-based approaches, hybrid methods consistently outperform retrieval-only setups, and among these, ensemble voting delivers the best result by combining the predictive quality of instruction-tuned LLMs with the retrieval system.
摘要：近年来，大型语言模型 (LLM) 在广泛的法律任务中显示出巨大的潜力。尽管取得了这些进展，但减轻幻觉仍然是一项重大挑战，最先进的 LLM 仍然经常产生不正确的法律参考。在本文中，我们重点关注澳大利亚法律背景下的法律引文预测问题，其中正确识别和引用相关立法或先例至关重要。我们比较了几种方法：提示通用和法律专业 LLM、具有通用和领域特定嵌入的仅检索管道、任务特定的 LLM 指令调整，以及将 LLM 与检索增强、查询扩展或投票集成相结合的混合策略。我们的研究结果表明，即使经过法律专业预训练，仅进行领域特定预训练也不足以实现令人满意的引文准确性。相比之下，我们任务特定的数据集上的指令调整显着提高了性能，在所有设置中都达到了最佳结果。我们还强调，数据库粒度以及嵌入类型对检索系统的性能起着至关重要的作用。在基于检索的方法中，混合方法始终优于仅检索设置，其中，集成投票通过将指令调整的 LLM 的预测质量与检索系统相结合，可提供最佳结果。

Title: PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

Authors: Qian Zhang, Panfeng Chen, Jiali Li, Linkun Feng, Shuyu Liu, Mei Chen, Hui Li, Yanhao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06287
Pdf URL: https://arxiv.org/pdf/2412.06287
Copy Paste: [[2412.06287]] PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models(https://arxiv.org/abs/2412.06287)
Keywords: language model, llm
Abstract: The emergence of Large Language Models (LLMs) in the medical domain has stressed a compelling need for standard datasets to evaluate their question-answering (QA) performance. Although there have been several benchmark datasets for medical QA, they either cover common knowledge across different departments or are specific to another department rather than pediatrics. Moreover, some of them are limited to objective questions and do not measure the generation capacity of LLMs. Therefore, they cannot comprehensively assess the QA ability of LLMs in pediatrics. To fill this gap, we construct PediaBench, the first Chinese pediatric dataset for LLM evaluation. Specifically, it contains 4,565 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups. It adopts an integrated scoring criterion based on different difficulty levels to thoroughly assess the proficiency of an LLM in instruction following, knowledge understanding, clinical case analysis, etc. Finally, we validate the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs. Through an in-depth analysis of experimental results, we offer insights into the ability of LLMs to answer pediatric questions in the Chinese context, highlighting their limitations for further improvements. Our code and data are published at this https URL.
摘要：大型语言模型 (LLM) 在医学领域的出现，凸显了对标准数据集的迫切需求，以评估其问答 (QA) 性能。尽管医学问答已经有了几个基准数据集，但它们要么涵盖不同部门的共同知识，要么特定于儿科以外的其他部门。此外，其中一些仅限于客观问题，无法衡量 LLM 的生成能力。因此，它们无法全面评估儿科 LLM 的问答能力。为了填补这一空白，我们构建了第一个用于 LLM 评估的中国儿科数据集 PediaBench。具体来说，它包含 4,565 个客观问题和 1,632 个主观问题，涵盖 12 个儿科疾病组。它采用基于不同难度级别的综合评分标准，全面评估 LLM 在指令遵循、知识理解、临床病例分析等方面的熟练程度。最后，我们通过对 20 个开源和商业 LLM 的大量实验验证了 PediaBench 的有效性。通过深入分析实验结果，我们深入了解了 LLM 在中国背景下回答儿科问题的能力，并指出了其有待进一步改进的局限性。我们的代码和数据发布在此 https URL 上。

Title: LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation

Authors: Haihang Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06419
Pdf URL: https://arxiv.org/pdf/2412.06419
Copy Paste: [[2412.06419]] LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation(https://arxiv.org/abs/2412.06419)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable performance across various language tasks, but their widespread deployment is impeded by their large size and high computational costs. Structural pruning is a prevailing technique used to introduce sparsity into pre-trained models and facilitate direct hardware acceleration during inference by removing redundant connections (structurally-grouped parameters), such as channels and attention heads. Existing structural pruning approaches often employ either global or layer-wise pruning criteria; however, they are hindered by ineffectiveness stemming from inaccurate evaluation of connection importance. Global pruning methods typically assess component importance using near-zero and unreliable gradients, while layer-wise pruning approaches encounter significant pruning error accumulation issues. To this end, we propose a more accurate pruning metric based on the block-wise importance score propagation, termed LLM-BIP. Specifically, LLM-BIP precisely evaluates connection importance by gauging its influence on the respective transformer block output, which can be efficiently approximated in a single forward pass through an upper bound derived from the assumption of Lipschitz continuity. We evaluate the proposed method using LLaMA-7B, Vicuna-7B, and LLaMA-13B across common zero-shot tasks. The results demonstrate that our approach achieves an average of 3.26% increase in accuracy for common reasoning tasks compared to previous best baselines. It also reduces perplexity by 14.09 and 68.76 on average for the WikiText2 dataset and PTB dataset, respectively.
摘要：大型语言模型 (LLM) 在各种语言任务中都表现出色，但由于其规模大、计算成本高，因此无法广泛部署。结构剪枝是一种流行的技术，用于将稀疏性引入预训练模型并通过删除冗余连接（结构分组参数）（例如通道和注意力头）来促进推理过程中的直接硬件加速。现有的结构剪枝方法通常采用全局或逐层剪枝标准；然而，由于对连接重要性的评估不准确，导致效率低下，从而阻碍了它们的发展。全局剪枝方法通常使用接近零且不可靠的梯度来评估组件重要性，而逐层剪枝方法会遇到严重的剪枝误差累积问题。为此，我们提出了一种基于分块重要性分数传播的更准确的剪枝指标，称为 LLM-BIP。具体而言，LLM-BIP 通过衡量连接对各个转换器块输出的影响来精确评估连接的重要性，该输出可以通过从 Lipschitz 连续性假设得出的上限在一次前向传递中有效地近似。我们使用 LLaMA-7B、Vicuna-7B 和 LLaMA-13B 在常见的零样本任务中评估所提出的方法。结果表明，与之前的最佳基线相比，我们的方法在常见推理任务中的准确率平均提高了 3.26%。它还分别将 WikiText2 数据集和 PTB 数据集的困惑度平均降低了 14.09 和 68.76。

Title: Gated Delta Networks: Improving Mamba2 with Delta Rule

Authors: Songlin Yang, Jan Kautz, Ali Hatamizadeh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06464
Pdf URL: https://arxiv.org/pdf/2412.06464
Copy Paste: [[2412.06464]] Gated Delta Networks: Improving Mamba2 with Delta Rule(https://arxiv.org/abs/2412.06464)
Keywords: language model
Abstract: Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.
摘要：线性 Transformer 作为标准 Transformer 的有效替代品而备受关注，但它们在检索和长上下文任务中的表现有限。为了解决这些限制，最近的研究探索了两种不同的机制：用于自适应内存控制的门控和用于精确内存修改的增量更新规则。我们观察到这些机制是互补的：门控可以快速擦除内存，而增量规则则有助于有针对性的更新。基于这一见解，我们引入了门控增量规则并开发了一种针对现代硬件优化的并行训练算法。我们提出的架构 Gated DeltaNet 在多个基准测试中始终超越现有模型，如 Mamba2 和 DeltaNet，包括语言建模、常识推理、上下文检索、长度外推和长上下文理解。我们通过开发将 Gated DeltaNet 层与滑动窗口注意或 Mamba2 层相结合的混合架构进一步提高了性能，从而实现了更高的训练效率和卓越的任务性能。

Title: SafeWorld: Geo-Diverse Safety Alignment

Authors: Da Yin, Haoyi Qiu, Kung-Hsiang Huang, Kai-Wei Chang, Nanyun Peng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06483
Pdf URL: https://arxiv.org/pdf/2412.06483
Copy Paste: [[2412.06483]] SafeWorld: Geo-Diverse Safety Alignment(https://arxiv.org/abs/2412.06483)
Keywords: language model, gpt, llm
Abstract: In the rapidly evolving field of Large Language Models (LLMs), ensuring safety is a crucial and widely discussed topic. However, existing works often overlook the geo-diversity of cultural and legal standards across the world. To demonstrate the challenges posed by geo-diverse safety standards, we introduce SafeWorld, a novel benchmark specifically designed to evaluate LLMs' ability to generate responses that are not only helpful but also culturally sensitive and legally compliant across diverse global contexts. SafeWorld encompasses 2,342 test user queries, each grounded in high-quality, human-verified cultural norms and legal policies from 50 countries and 493 regions/races. On top of it, we propose a multi-dimensional automatic safety evaluation framework that assesses the contextual appropriateness, accuracy, and comprehensiveness of responses. Our evaluations reveal that current LLMs struggle to meet these criteria. To enhance LLMs' alignment with geo-diverse safety standards, we synthesize helpful preference pairs for Direct Preference Optimization (DPO) alignment training. The preference pair construction aims to encourage LLMs to behave appropriately and provide precise references to relevant cultural norms and policies when necessary. Our trained SafeWorldLM outperforms all competing models, including GPT-4o on all three evaluation dimensions by a large margin. Global human evaluators also note a nearly 20% higher winning rate in helpfulness and harmfulness evaluation. Our code and data can be found here: this https URL.
摘要：在快速发展的大型语言模型 (LLM) 领域，确保安全是一个至关重要且被广泛讨论的话题。然而，现有的研究往往忽视了世界各地文化和法律标准的地理多样性。为了展示地理多样化安全标准带来的挑战，我们引入了 SafeWorld，这是一个新颖的基准，专门用于评估 LLM 生成不仅有用而且在不同的全球背景下具有文化敏感性和法律合规性的响应的能力。SafeWorld 包含 2,342 个测试用户查询，每个查询都基于来自 50 个国家/地区和 493 个地区/种族的高质量、人工验证的文化规范和法律政策。在此基础上，我们提出了一个多维自动安全评估框架，用于评估响应的上下文适当性、准确性和全面性。我们的评估表明，当前的 LLM 难以满足这些标准。为了增强 LLM 与地理多样化安全标准的一致性，我们合成了有用的偏好对以进行直接偏好优化 (DPO) 对齐训练。偏好对构建旨在鼓励 LLM 行为得体，并在必要时提供相关文化规范和政策的精确参考。我们训练过的 SafeWorldLM 在所有三个评估维度上都远远胜过所有竞争模型，包括 GPT-4o。全球人类评估者还注意到，在有用性和有害性评估中，获胜率高出近 20%。我们的代码和数据可以在这里找到：这个 https URL。

Title: Small Languages, Big Models: A Study of Continual Training on Languages of Norway

Authors: David Samuel, Vladislav Mikhailov, Erik Velldal, Lilja Øvrelid, Lucas Georges Gabriel Charpentier, Andrey Kutuzov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06484
Pdf URL: https://arxiv.org/pdf/2412.06484
Copy Paste: [[2412.06484]] Small Languages, Big Models: A Study of Continual Training on Languages of Norway(https://arxiv.org/abs/2412.06484)
Keywords: language model
Abstract: Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Sámi. To address this issue, we present a novel three-stage continual training approach. We also experiment with combining causal and masked language modeling to get more flexible models. Based on our findings, we train, evaluate, and openly release a new large generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.
摘要：训练大型语言模型需要大量数据，这对挪威语等使用范围较窄的语言来说是一个挑战，对萨米语等资源匮乏的语言来说更是如此。为了解决这个问题，我们提出了一种新颖的三阶段持续训练方法。我们还尝试结合因果语言模型和掩码语言模型来获得更灵活的模型。根据我们的发现，我们训练、评估并公开发布了一个包含 114 亿个参数的挪威语、尼诺斯克语和北萨米语的新型大型生成语言模型：NorMistral-11B。

Title: Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy

Authors: Min Zeng, Caiquan Liu, Shiqi Zhang, Li Xie, Chen Sang, Xiaoxin Chen, Xiaoxin Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06575
Pdf URL: https://arxiv.org/pdf/2412.06575
Copy Paste: [[2412.06575]] Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy(https://arxiv.org/abs/2412.06575)
Keywords: language model, llm
Abstract: In recent years, the use of large language models (LLMs) for text classification has attracted widespread attention. Despite this, the classification accuracy of LLMs has not yet universally surpassed that of smaller models. LLMs can enhance their performance in text classification through fine-tuning. However, existing data quality research based on LLMs is challenging to apply directly to solve text classification problems. To further improve the performance of LLMs in classification tasks, this paper proposes a data quality enhancement (DQE) method for text classification based on LLMs. This method starts by using a greedy algorithm to select data, dividing the dataset into sampled and unsampled subsets, and then performing fine-tuning of the LLMs using the sampled data. Subsequently, this model is used to predict the outcomes for the unsampled data, categorizing incorrectly predicted data into uncovered, difficult, and noisy data. Experimental results demonstrate that our method effectively enhances the performance of LLMs in text classification tasks and significantly improves training efficiency, saving nearly half of the training time. Our method has achieved state-of-the-art performance in several open-source classification tasks.
摘要：近年来，大型语言模型（LLM）在文本分类中的应用受到广泛关注，但其分类准确率尚未普遍超越小型模型。LLM可以通过微调提升其在文本分类中的表现，但现有的基于LLM的数据质量研究难以直接应用于解决文本分类问题。为了进一步提升LLM在分类任务中的表现，本文提出了一种基于LLM的文本分类数据质量增强（DQE）方法。该方法首先使用贪婪算法选择数据，将数据集分为采样子集和非采样子集，然后使用采样数据对LLM进行微调。然后使用该模型对非采样数据进行预测，将预测错误的数据分为未覆盖数据、困难数据和噪声数据。实验结果表明，我们的方法有效提升了LLM在文本分类任务中的表现，并显著提高了训练效率，节省了近一半的训练时间。我们的方法在多个开源分类任务中取得了最先进的性能。

Title: Anchoring Bias in Large Language Models: An Experimental Study

Authors: Jiaxu Lou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06593
Pdf URL: https://arxiv.org/pdf/2412.06593
Copy Paste: [[2412.06593]] Anchoring Bias in Large Language Models: An Experimental Study(https://arxiv.org/abs/2412.06593)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Large Language Models (LLMs) like GPT-4 and Gemini have significantly advanced artificial intelligence by enabling machines to generate and comprehend human-like text. Despite their impressive capabilities, LLMs are not immune to limitations, including various biases. While much research has explored demographic biases, the cognitive biases in LLMs have not been equally scrutinized. This study delves into anchoring bias, a cognitive bias where initial information disproportionately influences judgment. Utilizing an experimental dataset, we examine how anchoring bias manifests in LLMs and verify the effectiveness of various mitigation strategies. Our findings highlight the sensitivity of LLM responses to biased hints. At the same time, our experiments show that, to mitigate anchoring bias, one needs to collect hints from comprehensive angles to prevent the LLMs from being anchored to individual pieces of information, while simple algorithms such as Chain-of-Thought, Thoughts of Principles, Ignoring Anchor Hints, and Reflection are not sufficient.
摘要：GPT-4 和 Gemini 等大型语言模型 (LLM) 使机器能够生成和理解类似人类的文本，从而显著提高了人工智能水平。尽管 LLM 具有令人印象深刻的功能，但它们并非没有局限性，包括各种偏见。虽然许多研究都探讨了人口统计学偏见，但 LLM 中的认知偏见尚未得到同等的审查。本研究深入研究了锚定偏差，这是一种认知偏差，其中初始信息不成比例地影响判断。利用实验数据集，我们研究了锚定偏差在 LLM 中的表现方式，并验证了各种缓解策略的有效性。我们的研究结果强调了 LLM 对有偏见的提示的响应的敏感性。同时，我们的实验表明，为了减轻锚定偏差，需要从全面的角度收集提示，以防止 LLM 锚定在单个信息上，而简单的算法（如思维链、原则思维、忽略锚定提示和反思）是不够的。

Title: Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

Authors: Tianxin Xie, Yan Rong, Pengfei Zhang, Li Liu
Subjects: cs.CL, cs.AI, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.06602
Pdf URL: https://arxiv.org/pdf/2412.06602
Copy Paste: [[2412.06602]] Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey(https://arxiv.org/abs/2412.06602)
Keywords: language model, prompt
Abstract: Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that aims to generate natural-sounding human speech from text. Recently, with the increasing industrial demand, TTS technologies have evolved beyond synthesizing human-like speech to enabling controllable speech generation. This includes fine-grained control over various attributes of synthesized speech such as emotion, prosody, timbre, and duration. Besides, advancements in deep learning, such as diffusion and large language models, have significantly enhanced controllable TTS over the past several years. In this paper, we conduct a comprehensive survey of controllable TTS, covering approaches ranging from basic control techniques to methods utilizing natural language prompts, aiming to provide a clear understanding of the current state of research. We examine the general controllable TTS pipeline, challenges, model architectures, and control strategies, offering a comprehensive and clear taxonomy of existing methods. Additionally, we provide a detailed summary of datasets and evaluation metrics and shed some light on the applications and future directions of controllable TTS. To the best of our knowledge, this survey paper provides the first comprehensive review of emerging controllable TTS methods, which can serve as a beneficial resource for both academic researchers and industry practitioners.
摘要：文本转语音 (TTS)，也称为语音合成，是一个著名的研究领域，旨在从文本生成听起来自然的人类语音。最近，随着工业需求的不断增长，TTS 技术已经从合成类似人类的语音发展到能够生成可控语音。这包括对合成语音的各种属性（如情感、韵律、音色和持续时间）的细粒度控制。此外，深度学习的进步（如扩散和大型语言模型）在过去几年中显著增强了可控 TTS。在本文中，我们对可控 TTS 进行了全面的调查，涵盖了从基本控制技术到利用自然语言提示的方法等方法，旨在清楚地了解当前的研究状态。我们研究了通用的可控 TTS 流程、挑战、模型架构和控制策略，对现有方法进行了全面而清晰的分类。此外，我们还提供了数据集和评估指标的详细摘要，并阐明了可控 TTS 的应用和未来方向。据我们所知，这篇调查论文首次对新兴的可控 TTS 方法进行了全面的回顾，可以作为学术研究人员和行业从业者的有益资源。

Title: GEAR: A Simple GENERATE, EMBED, AVERAGE AND RANK Approach for Unsupervised Reverse Dictionary

Authors: Fatemah Almeman, Luis Espinosa-Anke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06654
Pdf URL: https://arxiv.org/pdf/2412.06654
Copy Paste: [[2412.06654]] GEAR: A Simple GENERATE, EMBED, AVERAGE AND RANK Approach for Unsupervised Reverse Dictionary(https://arxiv.org/abs/2412.06654)
Keywords: llm
Abstract: Reverse Dictionary (RD) is the task of obtaining the most relevant word or set of words given a textual description or dictionary definition. Effective RD methods have applications in accessibility, translation or writing support systems. Moreover, in NLP research we find RD to be used to benchmark text encoders at various granularities, as it often requires word, definition and sentence embeddings. In this paper, we propose a simple approach to RD that leverages LLMs in combination with embedding models. Despite its simplicity, this approach outperforms supervised baselines in well studied RD datasets, while also showing less over-fitting. We also conduct a number of experiments on different dictionaries and analyze how different styles, registers and target audiences impact the quality of RD systems. We conclude that, on average, untuned embeddings alone fare way below an LLM-only baseline (although they are competitive in highly technical dictionaries), but are crucial for boosting performance in combined methods.
摘要：逆向词典 (RD) 的任务是根据文本描述或词典定义获取最相关的单词或单词集。有效的 RD 方法可应用于可访问性、翻译或写作支持系统。此外，在 NLP 研究中，我们发现 RD 可用于对不同粒度的文本编码器进行基准测试，因为它通常需要单词、定义和句子嵌入。在本文中，我们提出了一种简单的 RD 方法，该方法利用 LLM 与嵌入模型相结合。尽管方法很简单，但该方法在经过充分研究的 RD 数据集中的表现优于监督基线，同时还显示出较少的过度拟合。我们还对不同的词典进行了大量实验，并分析了不同的风格、语域和目标受众如何影响 RD 系统的质量。我们得出的结论是，平均而言，未经调整的嵌入本身的表现远低于仅使用 LLM 的基线（尽管它们在技术含量高的词典中具有竞争力），但对于提高组合方法的性能至关重要。

Title: OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions

Authors: Yi-Kai Zhang, Xu-Xiang Zhong, Shiyin Lu, Qing-Guo Chen, De-Chuan Zhan, Han-Jia Ye
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2412.06693
Pdf URL: https://arxiv.org/pdf/2412.06693
Copy Paste: [[2412.06693]] OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions(https://arxiv.org/abs/2412.06693)
Keywords: language model, llm
Abstract: The rapid advancements in Large Language Models (LLMs) have significantly expanded their applications, ranging from multilingual support to domain-specific tasks and multimodal integration. In this paper, we present OmniEvalKit, a novel benchmarking toolbox designed to evaluate LLMs and their omni-extensions across multilingual, multidomain, and multimodal capabilities. Unlike existing benchmarks that often focus on a single aspect, OmniEvalKit provides a modular, lightweight, and automated evaluation system. It is structured with a modular architecture comprising a Static Builder and Dynamic Data Flow, promoting the seamless integration of new models and datasets. OmniEvalKit supports over 100 LLMs and 50 evaluation datasets, covering comprehensive evaluations across thousands of model-dataset combinations. OmniEvalKit is dedicated to creating an ultra-lightweight and fast-deployable evaluation framework, making downstream applications more convenient and versatile for the AI community.
摘要：大型语言模型 (LLM) 的快速发展极大地扩展了它们的应用范围，从多语言支持到特定领域的任务和多模式集成。在本文中，我们介绍了 OmniEvalKit，这是一种新颖的基准测试工具箱，旨在评估 LLM 及其在多语言、多领域和多模式能力方面的全方位扩展。与通常侧重于单一方面的现有基准测试不同，OmniEvalKit 提供了一个模块化、轻量级和自动化的评估系统。它采用由静态构建器和动态数据流组成的模块化架构，促进了新模型和数据集的无缝集成。OmniEvalKit 支持 100 多个 LLM 和 50 个评估数据集，涵盖数千种模型数据集组合的全面评估。OmniEvalKit 致力于创建一个超轻量级且可快速部署的评估框架，使下游应用程序对 AI 社区来说更加方便和通用。

Title: JAPAGEN: Efficient Few/Zero-shot Learning via Japanese Training Dataset Generation with LLM

Authors: Takuro Fujii, Satoru Katsumata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06738
Pdf URL: https://arxiv.org/pdf/2412.06738
Copy Paste: [[2412.06738]] JAPAGEN: Efficient Few/Zero-shot Learning via Japanese Training Dataset Generation with LLM(https://arxiv.org/abs/2412.06738)
Keywords: language model, llm, prompt
Abstract: Recently some studies have highlighted the potential of Large Language Models (LLMs) as effective generators of supervised training data, offering advantages such as enhanced inference efficiency and reduced costs associated with data collection. However, these studies have predominantly focused on English language tasks. In this paper, we address the fundamental research question: Can LLMs serve as proficient training data generators for other language tasks? Specifically, we leverage LLMs to synthesize supervised training data under few-shot and zero-shot learning scenarios across six diverse Japanese downstream tasks. Subsequently, we utilize this synthesized data to train compact models (e.g., BERT). This novel methodology is termed JAPAGEN. Our experimental findings underscore that JAPAGEN achieves robust performance in classification tasks that necessitate formal text inputs, demonstrating competitive results compared to conventional LLM prompting strategies.
摘要：最近，一些研究强调了大型语言模型 (LLM) 作为监督训练数据的有效生成器的潜力，它具有提高推理效率和降低数据收集成本等优势。然而，这些研究主要集中在英语语言任务上。在本文中，我们探讨了一个基本研究问题：LLM 能否作为其他语言任务的熟练训练数据生成器？具体来说，我们利用 LLM 在六种不同的日语下游任务中的少样本和零样本学习场景下合成监督训练数据。随后，我们利用这些合成数据来训练紧凑模型（例如 BERT）。这种新颖的方法称为 JAPAGEN。我们的实验结果强调，JAPAGEN 在需要正式文本输入的分类任务中实现了稳健的性能，与传统的 LLM 提示策略相比，其结果具有竞争力。

Title: Training Large Language Models to Reason in a Continuous Latent Space

Authors: Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.06769
Pdf URL: https://arxiv.org/pdf/2412.06769
Copy Paste: [[2412.06769]] Training Large Language Models to Reason in a Continuous Latent Space(https://arxiv.org/abs/2412.06769)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. However, we argue that language space may not always be optimal for reasoning. For example, most word tokens are primarily for textual coherence and not essential for reasoning, while some critical tokens require complex planning and pose huge challenges to LLMs. To explore the potential of LLM reasoning in an unrestricted latent space instead of using natural language, we introduce a new paradigm Coconut (Chain of Continuous Thought). We utilize the last hidden state of the LLM as a representation of the reasoning state (termed "continuous thought"). Rather than decoding this into a word token, we feed it back to the LLM as the subsequent input embedding directly in the continuous space. Experiments show that Coconut can effectively augment the LLM on several reasoning tasks. This novel latent reasoning paradigm leads to emergent advanced reasoning patterns: the continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform a breadth-first search (BFS) to solve the problem, rather than prematurely committing to a single deterministic path like CoT. Coconut outperforms CoT in certain logical reasoning tasks that require substantial backtracking during planning, with fewer thinking tokens during inference. These findings demonstrate the promise of latent reasoning and offer valuable insights for future research.
摘要：大型语言模型 (LLM) 仅限于在“语言空间”中进行推理，它们通常使用思维链 (CoT) 来表达推理过程，以解决复杂的推理问题。然而，我们认为语言空间可能并不总是推理的最佳空间。例如，大多数单词标记主要用于文本连贯性，对推理来说并不是必不可少的，而一些关键标记需要复杂的规划，对 LLM 构成了巨大的挑战。为了探索 LLM 在不受限制的潜在空间中推理的潜力，而不是使用自然语言，我们引入了一种新的范式 Coconut（连续思维链）。我们利用 LLM 的最后一个隐藏状态作为推理状态（称为“连续思维”）的表示。我们不是将其解码为单词标记，而是将其作为后续输入直接嵌入连续空间，反馈给 LLM。实验表明，Coconut 可以在几个推理任务上有效地增强 LLM。这种新颖的潜在推理范式导致了高级推理模式的出现：连续思维可以编码多个备选的下一步推理步骤，从而使模型能够执行广度优先搜索 (BFS) 来解决问题，而不是过早地选择像 CoT 这样的单一确定性路径。在某些需要在规划过程中进行大量回溯的逻辑推理任务中，Coconut 的表现优于 CoT，推理过程中的思考标记更少。这些发现证明了潜在推理的前景，并为未来的研究提供了宝贵的见解。