2024-12-17

Title: Reinforcement Learning Enhanced LLMs: A Survey

Authors: Shuhe Wang, Shengyu Zhang, Jie Zhang, Runyi Hu, Xiaoya Li, Tianwei Zhang, Jiwei Li, Fei Wu, Guoyin Wang, Eduard Hovy
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10400
Pdf URL: https://arxiv.org/pdf/2412.10400
Copy Paste: [[2412.10400]] Reinforcement Learning Enhanced LLMs: A Survey(https://arxiv.org/abs/2412.10400)
Keywords: language model, llm
Abstract: This paper surveys research in the rapidly growing field of enhancing large language models (LLMs) with reinforcement learning (RL), a technique that enables LLMs to improve their performance by receiving feedback in the form of rewards based on the quality of their outputs, allowing them to generate more accurate, coherent, and contextually appropriate responses. In this work, we make a systematic review of the most up-to-date state of knowledge on RL-enhanced LLMs, attempting to consolidate and analyze the rapidly growing research in this field, helping researchers understand the current challenges and advancements. Specifically, we (1) detail the basics of RL; (2) introduce popular RL-enhanced LLMs; (3) review researches on two widely-used reward model-based RL techniques: Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF); and (4) explore Direct Preference Optimization (DPO), a set of methods that bypass the reward model to directly use human preference data for aligning LLM outputs with human expectations. We will also point out current challenges and deficiencies of existing methods and suggest some avenues for further improvements.
摘要：本文调查了使用强化学习 (RL) 增强大型语言模型 (LLM) 这一快速发展领域的研究，强化学习是一种使 LLM 能够通过根据其输出质量以奖励形式接收反馈来提高其性能的技术，从而使其能够生成更准确、更连贯、更符合上下文的响应。在这项工作中，我们对最新的 RL 增强型 LLM 知识状态进行了系统回顾，试图整合和分析该领域快速发展的研究，帮助研究人员了解当前的挑战和进步。具体来说，我们 (1) 详细介绍了 RL 的基础知识；(2) 介绍了流行的 RL 增强型 LLM；(3) 回顾了两种广泛使用的基于奖励模型的 RL 技术的研究：从人类反馈中进行强化学习 (RLHF) 和从人工智能反馈中进行强化学习 (RLAIF)；(4) 探索直接偏好优化 (DPO)，这是一组绕过奖励模型直接使用人类偏好数据来使 LLM 输出与人类期望相一致的方法。我们还将指出现有方法面临的挑战和不足，并提出一些进一步改进的方法。

Title: Evaluating Robustness of LLMs on Crisis-Related Microblogs across Events, Information Types, and Linguistic Features

Authors: Muhammad Imran, Abdul Wahab Ziaullah, Kai Chen, Ferda Ofli
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2412.10413
Pdf URL: https://arxiv.org/pdf/2412.10413
Copy Paste: [[2412.10413]] Evaluating Robustness of LLMs on Crisis-Related Microblogs across Events, Information Types, and Linguistic Features(https://arxiv.org/abs/2412.10413)
Keywords: language model, gpt, llm
Abstract: The widespread use of microblogging platforms like X (formerly Twitter) during disasters provides real-time information to governments and response authorities. However, the data from these platforms is often noisy, requiring automated methods to filter relevant information. Traditionally, supervised machine learning models have been used, but they lack generalizability. In contrast, Large Language Models (LLMs) show better capabilities in understanding and processing natural language out of the box. This paper provides a detailed analysis of the performance of six well-known LLMs in processing disaster-related social media data from a large-set of real-world events. Our findings indicate that while LLMs, particularly GPT-4o and GPT-4, offer better generalizability across different disasters and information types, most LLMs face challenges in processing flood-related data, show minimal improvement despite the provision of examples (i.e., shots), and struggle to identify critical information categories like urgent requests and needs. Additionally, we examine how various linguistic features affect model performance and highlight LLMs' vulnerabilities against certain features like typos. Lastly, we provide benchmarking results for all events across both zero- and few-shot settings and observe that proprietary models outperform open-source ones in all tasks.
摘要：在灾难期间，X（以前称为 Twitter）等微博平台的广泛使用为政府和响应机构提供了实时信息。然而，这些平台的数据往往很嘈杂，需要自动化方法来过滤相关信息。传统上，人们使用监督式机器学习模型，但它们缺乏通用性。相比之下，大型语言模型 (LLM) 在理解和处理开箱即用的自然语言方面表现出更好的能力。本文详细分析了六种著名的 LLM 在处理来自大量真实事件的灾难相关社交媒体数据方面的表现。我们的研究结果表明，虽然 LLM（尤其是 GPT-4o 和 GPT-4）在不同灾难和信息类型中提供了更好的通用性，但大多数 LLM 在处理与洪水相关的数据方面面临挑战，尽管提供了示例（即镜头），但改进幅度很小，并且难以识别关键信息类别，例如紧急请求和需求。此外，我们研究了各种语言特征如何影响模型性能，并强调了 LLM 在某些特征（例如拼写错误）方面的弱点。最后，我们为零样本和小样本设置中的所有事件提供了基准测试结果，并观察到专有模型在所有任务中的表现均优于开源模型。

Title: Exploring Complex Mental Health Symptoms via Classifying Social Media Data with Explainable LLMs

Authors: Kexin Chen, Noelle Lim, Claire Lee, Michael Guerzhoy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10414
Pdf URL: https://arxiv.org/pdf/2412.10414
Copy Paste: [[2412.10414]] Exploring Complex Mental Health Symptoms via Classifying Social Media Data with Explainable LLMs(https://arxiv.org/abs/2412.10414)
Keywords: llm
Abstract: We propose a pipeline for gaining insights into complex diseases by training LLMs on challenging social media text data classification tasks, obtaining explanations for the classification outputs, and performing qualitative and quantitative analysis on the explanations. We report initial results on predicting, explaining, and systematizing the explanations of predicted reports on mental health concerns in people reporting Lyme disease concerns. We report initial results on predicting future ADHD concerns for people reporting anxiety disorder concerns, and demonstrate preliminary results on visualizing the explanations for predicting that a person with anxiety concerns will in the future have ADHD concerns.
摘要：我们提出了一种洞察复杂疾病的管道，通过训练 LLM 完成具有挑战性的社交媒体文本数据分类任务，获取分类输出的解释，并对解释进行定性和定量分析。我们报告了对报告莱姆病问题的人的心理健康问题预测报告的解释进行预测、解释和系统化的初步结果。我们报告了对报告焦虑症问题的人未来 ADHD 问题的预测的初步结果，并展示了可视化预测有焦虑问题的人将来会有 ADHD 问题的解释的初步结果。

Title: Generative Adversarial Reviews: When LLMs Become the Critic

Authors: Nicolas Bougie, Narimasa Watanabe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10415
Pdf URL: https://arxiv.org/pdf/2412.10415
Copy Paste: [[2412.10415]] Generative Adversarial Reviews: When LLMs Become the Critic(https://arxiv.org/abs/2412.10415)
Keywords: language model, llm, agent
Abstract: The peer review process is fundamental to scientific progress, determining which papers meet the quality standards for publication. Yet, the rapid growth of scholarly production and increasing specialization in knowledge areas strain traditional scientific feedback mechanisms. In light of this, we introduce Generative Agent Reviewers (GAR), leveraging LLM-empowered agents to simulate faithful peer reviewers. To enable generative reviewers, we design an architecture that extends a large language model with memory capabilities and equips agents with reviewer personas derived from historical data. Central to this approach is a graph-based representation of manuscripts, condensing content and logically organizing information - linking ideas with evidence and technical details. GAR's review process leverages external knowledge to evaluate paper novelty, followed by detailed assessment using the graph representation and multi-round assessment. Finally, a meta-reviewer aggregates individual reviews to predict the acceptance decision. Our experiments demonstrate that GAR performs comparably to human reviewers in providing detailed feedback and predicting paper outcomes. Beyond mere performance comparison, we conduct insightful experiments, such as evaluating the impact of reviewer expertise and examining fairness in reviews. By offering early expert-level feedback, typically restricted to a limited group of researchers, GAR democratizes access to transparent and in-depth evaluation.
摘要：同行评审过程是科学进步的基础，它决定了哪些论文符合出版的质量标准。然而，学术成果的快速增长和知识领域的日益专业化给传统的科学反馈机制带来了压力。鉴于此，我们引入了生成代理审稿人 (GAR)，利用 LLM 授权的代理来模拟忠实的同行评审员。为了启用生成审稿人，我们设计了一个架构，该架构扩展了具有记忆功能的大型语言模型，并为代理配备了来自历史数据的审稿人角色。这种方法的核心是基于图形的手稿表示，浓缩内容并逻辑地组织信息 - 将想法与证据和技术细节联系起来。GAR 的审查过程利用外部知识来评估论文的新颖性，然后使用图形表示和多轮评估进行详细评估。最后，元审稿人汇总个人评论以预测接受决定。我们的实验表明，GAR 在提供详细反馈和预测论文结果方面的表现与人类审稿人相当。除了单纯的性能比较之外，我们还进行了富有洞察力的实验，例如评估审稿人专业知识的影响和检查评论的公平性。通过提供早期专家级反馈（通常仅限于有限的一组研究人员），GAR 使获取透明和深入的评估变得民主化。

Title: SUPERMERGE: An Approach For Gradient-Based Model Merging

Authors: Haoyu Yang, Zheng Zhang, Saket Sathe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10416
Pdf URL: https://arxiv.org/pdf/2412.10416
Copy Paste: [[2412.10416]] SUPERMERGE: An Approach For Gradient-Based Model Merging(https://arxiv.org/abs/2412.10416)
Keywords: language model, gpt, chat
Abstract: Large language models, such as ChatGPT, Claude, or LLaMA, are gigantic, monolithic, and possess the superpower to simultaneously support thousands of tasks. However, high-throughput applications often prefer smaller task-specific models because of their lower latency and cost. One challenge of using task-specific models is the incremental need for solving newer tasks after the model is already deployed for existing tasks. A straightforward solution requires fine-tuning the model again for both existing and new tasks, which is computationally expensive and time-consuming. To address this issue, we propose a model merging based approach called SUPERMERGE. SUPERMERGE is a gradient-based method to systematically merge several fine-tuned models trained on existing and new tasks. SUPERMERGE is designed to be lightweight and fast, and the merged model achieves similar performance to fully fine-tuned models on all tasks. Furthermore, we proposed a hierarchical model merging strategy to reduce the peak space requirement without sacrificing the performance of the merged model. We experimentally demonstrate that SUPERMERGE outperforms existing model merging methods on common natural language processing and computer vision tasks.
摘要：大型语言模型（例如 ChatGPT、Claude 或 LLaMA）庞大、庞大，并且拥有同时支持数千个任务的超能力。然而，高吞吐量应用程序通常更喜欢较小的任务专用模型，因为它们的延迟和成本较低。使用任务专用模型的一个挑战是，在模型已经部署到现有任务后，需要逐步解决新任务。一种简单的解决方案需要针对现有任务和新任务再次微调模型，这在计算上是昂贵的且耗时的。为了解决这个问题，我们提出了一种基于模型合并的方法，称为 SUPERMERGE。SUPERMERGE 是一种基于梯度的方法，用于系统地合并在现有任务和新任务上训练的几个微调模型。SUPERMERGE 的设计轻量且快速，合并后的模型在所有任务上都实现了与完全微调模型相似的性能。此外，我们提出了一种分层模型合并策略，以减少峰值空间需求，而不会牺牲合并模型的性能。我们通过实验证明，SUPERMERGE 在常见的自然语言处理和计算机视觉任务上优于现有的模型合并方法。

Title: Leveraging Audio and Text Modalities in Mental Health: A Study of LLMs Performance

Authors: Abdelrahman A. Ali, Aya E. Fouda, Radwa J. Hanafy, Mohammed E. Fouda
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.10417
Pdf URL: https://arxiv.org/pdf/2412.10417
Copy Paste: [[2412.10417]] Leveraging Audio and Text Modalities in Mental Health: A Study of LLMs Performance(https://arxiv.org/abs/2412.10417)
Keywords: language model, gpt, llm, prompt
Abstract: Mental health disorders are increasingly prevalent worldwide, creating an urgent need for innovative tools to support early diagnosis and intervention. This study explores the potential of Large Language Models (LLMs) in multimodal mental health diagnostics, specifically for detecting depression and Post Traumatic Stress Disorder through text and audio modalities. Using the E-DAIC dataset, we compare text and audio modalities to investigate whether LLMs can perform equally well or better with audio inputs. We further examine the integration of both modalities to determine if this can enhance diagnostic accuracy, which generally results in improved performance metrics. Our analysis specifically utilizes custom-formulated metrics; Modal Superiority Score and Disagreement Resolvement Score to evaluate how combined modalities influence model performance. The Gemini 1.5 Pro model achieves the highest scores in binary depression classification when using the combined modality, with an F1 score of 0.67 and a Balanced Accuracy (BA) of 77.4%, assessed across the full dataset. These results represent an increase of 3.1% over its performance with the text modality and 2.7% over the audio modality, highlighting the effectiveness of integrating modalities to enhance diagnostic accuracy. Notably, all results are obtained in zero-shot inferring, highlighting the robustness of the models without requiring task-specific fine-tuning. To explore the impact of different configurations on model performance, we conduct binary, severity, and multiclass tasks using both zero-shot and few-shot prompts, examining the effects of prompt variations on performance. The results reveal that models such as Gemini 1.5 Pro in text and audio modalities, and GPT-4o mini in the text modality, often surpass other models in balanced accuracy and F1 scores across multiple tasks.
摘要：精神健康障碍在世界范围内日益普遍，迫切需要创新工具来支持早期诊断和干预。本研究探索了大型语言模型 (LLM) 在多模态心理健康诊断中的潜力，特别是通过文本和音频模态检测抑郁症和创伤后应激障碍。使用 E-DAIC 数据集，我们比较了文本和音频模态，以研究 LLM 在音频输入下是否能表现得同样好或更好。我们进一步研究了两种模态的整合，以确定这是否可以提高诊断准确性，这通常会提高性能指标。我们的分析特别利用了定制的指标；模态优势得分和分歧解决得分来评估组合模态如何影响模型性能。Gemini 1.5 Pro 模型在使用组合模态时在二元抑郁分类中获得了最高分，在整个数据集中评估的 F1 得分为 0.67，平衡准确度 (BA) 为 77.4%。这些结果表明，其在文本模态下的性能提高了 3.1%，在音频模态下的性能提高了 2.7%，凸显了整合模态以提高诊断准确性的有效性。值得注意的是，所有结果都是在零样本推理中获得的，凸显了模型的稳健性，无需针对特定任务进行微调。为了探索不同配置对模型性能的影响，我们使用零样本和少样本提示执行二分类、严重性和多分类任务，检查提示变化对性能的影响。结果表明，文本和音频模态中的 Gemini 1.5 Pro 和文本模态中的 GPT-4o mini 等模型在多个任务的平衡准确率和 F1 分数方面往往优于其他模型。

Title: Constrained Decoding with Speculative Lookaheads

Authors: Nishanth Nakshatri, Shamik Roy, Rajarshi Das, Suthee Chaidaroon, Leonid Boytsov, Rashmi Gangadharaiah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10418
Pdf URL: https://arxiv.org/pdf/2412.10418
Copy Paste: [[2412.10418]] Constrained Decoding with Speculative Lookaheads(https://arxiv.org/abs/2412.10418)
Keywords: llm
Abstract: Constrained decoding with lookahead heuristics (CDLH) is a highly effective method for aligning LLM generations to human preferences. However, the extensive lookahead roll-out operations for each generated token makes CDLH prohibitively expensive, resulting in low adoption in practice. In contrast, common decoding strategies such as greedy decoding are extremely efficient, but achieve very low constraint satisfaction. We propose constrained decoding with speculative lookaheads (CDSL), a technique that significantly improves upon the inference efficiency of CDLH without experiencing the drastic performance reduction seen with greedy decoding. CDSL is motivated by the recently proposed idea of speculative decoding that uses a much smaller draft LLM for generation and a larger target LLM for verification. In CDSL, the draft model is used to generate lookaheads which is verified by a combination of target LLM and task-specific reward functions. This process accelerates decoding by reducing the computational burden while maintaining strong performance. We evaluate CDSL in two constraint decoding tasks with three LLM families and achieve 2.2x to 12.15x speedup over CDLH without significant performance reduction.
摘要：带前瞻启发式的约束解码 (CDLH) 是一种将 LLM 生成与人类偏好相结合的高效方法。然而，每个生成的 token 的大量前瞻滚动操作使 CDLH 成本过高，导致实践中采用率很低。相比之下，贪婪解码等常见解码策略效率极高，但实现的约束满足率却很低。我们提出了带推测前瞻的约束解码 (CDSL)，该技术显著提高了 CDLH 的推理效率，而不会出现贪婪解码所经历的大幅性能下降。CDSL 的灵感来自最近提出的推测解码思想，该思想使用小得多的草稿 LLM 进行生成，使用更大的目标 LLM 进行验证。在 CDSL 中，草稿模型用于生成前瞻，并通过目标 LLM 和特定于任务的奖励函数的组合进行验证。此过程通过减少计算负担来加速解码，同时保持强大的性能。我们使用三个 LLM 系列在两个约束解码任务中对 CDSL 进行评估，与 CDLH 相比，其加速性能提高了 2.2 倍到 12.15 倍，且性能没有显著降低。

Title: AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework

Authors: Meihao Fan, Ju Fan, Nan Tang, Lei Cao, Xiaoyong Du
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10422
Pdf URL: https://arxiv.org/pdf/2412.10422
Copy Paste: [[2412.10422]] AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework(https://arxiv.org/abs/2412.10422)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Answering natural language (NL) questions about tables, which is referred to as Tabular Question Answering (TQA), is important because it enables users to extract meaningful insights quickly and efficiently from structured data, bridging the gap between human language and machine-readable formats. Many of these tables originate from web sources or real-world scenarios, necessitating careful data preparation (or data prep for short) to ensure accurate answers. However, unlike traditional data prep, question-aware data prep introduces new requirements, which include tasks such as column augmentation and filtering for given questions, and question-aware value normalization or conversion. Because each of the above tasks is unique, a single model (or agent) may not perform effectively across all scenarios. In this paper, we propose AUTOPREP, a large language model (LLM)-based multi-agent framework that leverages the strengths of multiple agents, each specialized in a certain type of data prep, ensuring more accurate and contextually relevant responses. Given an NL question over a table, AUTOPREP performs data prep through three key components. Planner: Determines a logical plan, outlining a sequence of high-level operations. Programmer: Translates this logical plan into a physical plan by generating the corresponding low-level code. Executor: Iteratively executes and debugs the generated code to ensure correct outcomes. To support this multi-agent framework, we design a novel chain-of-thought reasoning mechanism for high-level operation suggestion, and a tool-augmented method for low-level code generation. Extensive experiments on real-world TQA datasets demonstrate that AUTOPREP can significantly improve the SOTA TQA solutions through question-aware data prep.
摘要：回答有关表格的自然语言 (NL) 问题（称为表格问答 (TQA)）非常重要，因为它使用户能够快速高效地从结构化数据中提取有意义的见解，从而弥合人类语言与机器可读格式之间的差距。许多此类表格源自网络来源或真实场景，因此需要仔细准备数据（或简称为数据准备）以确保答案准确。然而，与传统数据准备不同，问题感知数据准备引入了新的要求，其中包括针对给定问题的列扩充和过滤以及问题感知值规范化或转换等任务。由于上述每项任务都是独一无二的，因此单个模型（或代理）可能无法在所有场景中有效执行。在本文中，我们提出了 AUTOPREP，这是一个基于大型语言模型 (LLM) 的多代理框架，它利用多个代理的优势，每个代理都专门从事某种类型的数据准备，从而确保更准确、更符合上下文的响应。给定一个关于表格的 NL 问题，AUTOPREP 通过三个关键组件执行数据准备。规划器：确定逻辑计划，概述一系列高级操作。程序员：通过生成相应的低级代码将此逻辑计划转换为物理计划。执行器：迭代执行和调试生成的代码以确保正确的结果。为了支持这个多智能体框架，我们设计了一种新颖的思路链推理机制，用于高级操作建议，以及一种工具增强方法，用于低级代码生成。在现实世界的 TQA 数据集上进行的大量实验表明，AUTOPREP 可以通过问题感知数据准备显著改进 SOTA TQA 解决方案。

Title: Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM

Authors: Shaoqing Zhang, Zhuosheng Zhang, Kehai Chen, Rongxiang Weng, Muyun Yang, Tiejun Zhao, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10423
Pdf URL: https://arxiv.org/pdf/2412.10423
Copy Paste: [[2412.10423]] Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM(https://arxiv.org/abs/2412.10423)
Keywords: language model, llm
Abstract: Despite being empowered with alignment mechanisms, large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks that can compromise their alignment mechanisms. This vulnerability poses significant risks to the real-world applications. Existing work faces challenges in both training efficiency and generalization capabilities (i.e., Reinforcement Learning from Human Feedback and Red-Teaming). Developing effective strategies to enable LLMs to resist continuously evolving jailbreak attempts represents a significant challenge. To address this challenge, we propose a novel defensive paradigm called GuidelineLLM, which assists LLMs in recognizing queries that may have harmful content. Before LLMs respond to a query, GuidelineLLM first identifies potential risks associated with the query, summarizes these risks into guideline suggestions, and then feeds these guidelines to the responding LLMs. Importantly, our approach eliminates the necessity for additional safety fine-tuning of the LLMs themselves; only the GuidelineLLM requires fine-tuning. This characteristic enhances the general applicability of GuidelineLLM across various LLMs. Experimental results demonstrate that GuidelineLLM can significantly reduce the attack success rate (ASR) against the LLMs (an average reduction of 34.17\% ASR) while maintaining the helpfulness of the LLMs in handling benign queries. Code is available at this https URL.
摘要：尽管大型语言模型 (LLM) 拥有对齐机制，但它们越来越容易受到新兴越狱攻击，这些攻击可能会破坏其对齐机制。这种漏洞对现实世界的应用构成了重大风险。现有工作在训练效率和泛化能力方面都面临挑战（即从人类反馈和红队中进行强化学习）。制定有效的策略以使 LLM 能够抵御不断发展的越狱尝试是一项重大挑战。为了应对这一挑战，我们提出了一种名为 GuidelineLLM 的新型防御范式，它可以帮助 LLM 识别可能包含有害内容的查询。在 LLM 响应查询之前，GuidelineLLM 首先识别与查询相关的潜在风险，将这些风险总结为指南建议，然后将这些指南提供给响应的 LLM。重要的是，我们的方法消除了对 LLM 本身进行额外安全微调的必要性；只有 GuidelineLLM 需要微调。这一特性增强了 GuidelineLLM 在各种 LLM 中的普遍适用性。实验结果表明，GuidelineLLM 可以显著降低针对 LLM 的攻击成功率 (ASR)（平均降低 34.17% ASR），同时保持 LLM 在处理良性查询方面的有用性。代码可从此 https URL 获取。

Title: LLM-AS-AN-INTERVIEWER: Beyond Static Testing Through Dynamic LLM Evaluation

Authors: Eunsu Kim, Juyoung Suk, Seungone Kim, Niklas Muennighoff, Dongkwan Kim, Alice Oh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10424
Pdf URL: https://arxiv.org/pdf/2412.10424
Copy Paste: [[2412.10424]] LLM-AS-AN-INTERVIEWER: Beyond Static Testing Through Dynamic LLM Evaluation(https://arxiv.org/abs/2412.10424)
Keywords: language model, llm
Abstract: We introduce a novel evaluation paradigm for large language models (LLMs), LLM-as-an-Interviewer. This approach consists of a two stage process designed to assess the true capabilities of LLMs: first, modifying benchmark datasets to generate initial queries, and second, interacting with the LLM through feedback and follow up questions. Compared to existing evaluation methods such as LLM as a Judge, our framework addresses several limitations, including data contamination, verbosity bias, and self enhancement bias. Additionally, we show that our multi turn evaluation process provides valuable insights into the LLM's performance in real-world scenarios, including its adaptability to feedback and its ability to handle follow up questions, including clarification or requests for additional knowledge. Finally, we propose the Interview Report, which offers a comprehensive reflection of an LLM's strengths and weaknesses, illustrated with specific examples from the interview process. This report delivers a snapshot of the LLM's capabilities, providing a detailed picture of its practical performance.
摘要：我们引入了一种用于大型语言模型 (LLM) 的新型评估范式，即 LLM-as-an-Interviewer。此方法由两个阶段组成，旨在评估 LLM 的真实能力：首先，修改基准数据集以生成初始查询；其次，通过反馈和后续问题与 LLM 进行交互。与现有的评估方法（例如 LLM as a Judge）相比，我们的框架解决了几个限制，包括数据污染、冗长偏见和自我增强偏见。此外，我们表明，我们的多轮评估流程为 LLM 在现实场景中的表现提供了宝贵的见解，包括其对反馈的适应性以及处理后续问题（包括澄清或要求更多知识）的能力。最后，我们提出了面试报告，该报告全面反映了 LLM 的优势和劣势，并通过面试过程中的具体示例进行了说明。本报告提供了 LLM 能力的快照，详细介绍了其实际表现。

Title: Active Inference for Self-Organizing Multi-LLM Systems: A Bayesian Thermodynamic Approach to Adaptation

Authors: Rithvik Prakki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10425
Pdf URL: https://arxiv.org/pdf/2412.10425
Copy Paste: [[2412.10425]] Active Inference for Self-Organizing Multi-LLM Systems: A Bayesian Thermodynamic Approach to Adaptation(https://arxiv.org/abs/2412.10425)
Keywords: language model, llm, prompt, agent
Abstract: This paper introduces a novel approach to creating adaptive language agents by integrating active inference with large language models (LLMs). While LLMs demonstrate remarkable capabilities, their reliance on static prompts limits adaptation to new information and changing environments. We address this by implementing an active inference framework that acts as a cognitive layer above an LLM-based agent, dynamically adjusting prompts and search strategies through principled information-seeking behavior. Our framework models the environment using three state factors (prompt, search, and information states) with seven observation modalities capturing quality metrics. By framing the agent's learning through the free energy principle, we enable systematic exploration of prompt combinations and search strategies. Experimental results demonstrate the effectiveness of this approach, with the agent developing accurate models of environment dynamics evidenced by emergent structure in observation matrices. Action selection patterns reveal sophisticated exploration-exploitation behavior, transitioning from initial information-gathering to targeted prompt testing. The integration of thermodynamic principles with language model capabilities provides a principled framework for creating robust, adaptable agents, extending active inference beyond traditional low-dimensional control problems to high-dimensional, language-driven environments.
摘要：本文介绍了一种通过将主动推理与大型语言模型 (LLM) 相结合来创建自适应语言代理的新方法。虽然 LLM 表现出非凡的能力，但它们对静态提示的依赖限制了对新信息和不断变化的环境的适应。我们通过实现一个主动推理框架来解决这个问题，该框架充当基于 LLM 的代理之上的认知层，通过原则性的信息搜索行为动态调整提示和搜索策略。我们的框架使用三个状态因素（提示、搜索和信息状态）对环境进行建模，七种观察模式捕获质量指标。通过自由能原理来构建代理的学习，我们可以系统地探索提示组合和搜索策略。实验结果证明了这种方法的有效性，代理开发了准确的环境动态模型，这由观察矩阵中出现的结构证明。动作选择模式揭示了复杂的探索-利用行为，从最初的信息收集过渡到有针对性的提示测试。热力学原理与语言模型功能的结合为创建稳健、适应性强的代理提供了一个原则框架，将主动推理从传统的低维控制问题扩展到高维、语言驱动的环境。

Title: Identifying and Manipulating Personality Traits in LLMs Through Activation Engineering

Authors: Rumi A. Allbert, James K. Wiles
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10427
Pdf URL: https://arxiv.org/pdf/2412.10427
Copy Paste: [[2412.10427]] Identifying and Manipulating Personality Traits in LLMs Through Activation Engineering(https://arxiv.org/abs/2412.10427)
Keywords: language model, llm
Abstract: The field of large language models (LLMs) has grown rapidly in recent years, driven by the desire for better efficiency, interpretability, and safe use. Building on the novel approach of "activation engineering," this study explores personality modification in LLMs, drawing inspiration from research like Refusal in LLMs Is Mediated by a Single Direction (arXiv:2406.11717) and Steering Llama 2 via Contrastive Activation Addition (arXiv:2312.06681). We leverage activation engineering to develop a method for identifying and adjusting activation directions related to personality traits, which may allow for dynamic LLM personality fine-tuning. This work aims to further our understanding of LLM interpretability while examining the ethical implications of such developments.
摘要：近年来，大型语言模型 (LLM) 领域发展迅速，其驱动力在于提高效率、可解释性和安全性。本研究以“激活工程”这一新方法为基础，探讨了 LLM 中的性格修改，其灵感来自《LLM 中的拒绝由单一方向介导》（arXiv:2406.11717）和《通过对比激活添加引导骆驼 2》（arXiv:2312.06681）等研究。我们利用激活工程开发了一种识别和调整与性格特征相关的激活方向的方法，这可能允许动态 LLM 性格微调。这项工作旨在进一步了解 LLM 的可解释性，同时研究此类发展的伦理影响。

Title: Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection

Authors: Jiaqi Chen, Xiaoye Zhu, Tianyang Liu, Ying Chen, Xinhui Chen, Yiwen Yuan, Chak Tou Leong, Zuchao Li, Tang Long, Lei Zhang, Chenyu Yan, Guanghao Mei, Jie Zhang, Lefei Zhang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2412.10432
Pdf URL: https://arxiv.org/pdf/2412.10432
Copy Paste: [[2412.10432]] Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection(https://arxiv.org/abs/2412.10432)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have revolutionized text generation, making detecting machine-generated text increasingly challenging. Although past methods have achieved good performance on detecting pure machine-generated text, those detectors have poor performance on distinguishing machine-revised text (rewriting, expansion, and polishing), which can have only minor changes from its original human prompt. As the content of text may originate from human prompts, detecting machine-revised text often involves identifying distinctive machine styles, e.g., worded favored by LLMs. However, existing methods struggle to detect machine-style phrasing hidden within the content contributed by humans. We propose the "Imitate Before Detect" (ImBD) approach, which first imitates the machine-style token distribution, and then compares the distribution of the text to be tested with the machine-style distribution to determine whether the text has been machine-revised. To this end, we introduce style preference optimization (SPO), which aligns a scoring LLM model to the preference of text styles generated by machines. The aligned scoring model is then used to calculate the style-conditional probability curvature (Style-CPC), quantifying the log probability difference between the original and conditionally sampled texts for effective detection. We conduct extensive comparisons across various scenarios, encompassing text revisions by six LLMs, four distinct text domains, and three machine revision types. Compared to existing state-of-the-art methods, our method yields a 13% increase in AUC for detecting text revised by open-source LLMs, and improves performance by 5% and 19% for detecting GPT-3.5 and GPT-4o revised text, respectively. Notably, our method surpasses the commercially trained GPT-Zero with just $1,000$ samples and five minutes of SPO, demonstrating its efficiency and effectiveness.
摘要：大型语言模型 (LLM) 彻底改变了文本生成，使得检测机器生成的文本变得越来越具有挑战性。尽管过去的方法在检测纯机器生成的文本方面取得了良好的性能，但这些检测器在区分机器修改的文本（重写、扩展和润色）方面性能不佳，这些文本与原始的人类提示只有很小的变化。由于文本的内容可能源自人类提示，因此检测机器修改的文本通常涉及识别独特的机器风格，例如 LLM 偏爱的措辞。然而，现有的方法很难检测到隐藏在人类贡献的内容中的机器风格措辞。我们提出了“先模仿后检测”（ImBD）方法，该方法首先模仿机器风格的标记分布，然后将要测试的文本的分布与机器风格的分布进行比较，以确定文本是否经过机器修改。为此，我们引入了风格偏好优化 (SPO)，它将评分 LLM 模型与机器生成的文本风格的偏好对齐。然后使用对齐的评分模型计算样式条件概率曲率 (Style-CPC)，量化原始文本和条件采样文本之间的对数概率差异，以实现有效检测。我们在各种场景中进行了广泛的比较，涵盖了六个 LLM、四个不同的文本域和三种机器修订类型的文本修订。与现有的最先进方法相比，我们的方法在检测开源 LLM 修订的文本时 AUC 提高了 13%，在检测 GPT-3.5 和 GPT-4o 修订文本时的性能分别提高了 5% 和 19%。值得注意的是，我们的方法仅用 1,000 美元的样本和五分钟的 SPO 就超越了商业训练的 GPT-Zero，证明了其效率和有效性。

Title: NAT-NL2GQL: A Novel Multi-Agent Framework for Translating Natural Language to Graph Query Language

Authors: Yuanyuan Liang, Tingyu Xie, Gan Peng, Zihao Huang, Yunshi Lan, Weining Qian
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2412.10434
Pdf URL: https://arxiv.org/pdf/2412.10434
Copy Paste: [[2412.10434]] NAT-NL2GQL: A Novel Multi-Agent Framework for Translating Natural Language to Graph Query Language(https://arxiv.org/abs/2412.10434)
Keywords: language model, llm, agent
Abstract: The emergence of Large Language Models (LLMs) has revolutionized many fields, not only traditional natural language processing (NLP) tasks. Recently, research on applying LLMs to the database field has been booming, and as a typical non-relational database, the use of LLMs in graph database research has naturally gained significant attention. Recent efforts have increasingly focused on leveraging LLMs to translate natural language into graph query language (NL2GQL). Although some progress has been made, these methods have clear limitations, such as their reliance on streamlined processes that often overlook the potential of LLMs to autonomously plan and collaborate with other LLMs in tackling complex NL2GQL challenges. To address this gap, we propose NAT-NL2GQL, a novel multi-agent framework for translating natural language to graph query language. Specifically, our framework consists of three synergistic agents: the Preprocessor agent, the Generator agent, and the Refiner agent. The Preprocessor agent manages data processing as context, including tasks such as name entity recognition, query rewriting, path linking, and the extraction of query-related schemas. The Generator agent is a fine-tuned LLM trained on NL-GQL data, responsible for generating corresponding GQL statements based on queries and their related schemas. The Refiner agent is tasked with refining the GQL or context using error information obtained from the GQL execution results. Given the scarcity of high-quality open-source NL2GQL datasets based on nGQL syntax, we developed StockGQL, a dataset constructed from a financial market graph database. It is available at: this https URL. Experimental results on the StockGQL and SpCQL datasets reveal that our method significantly outperforms baseline approaches, highlighting its potential for advancing NL2GQL research.
摘要：大型语言模型 (LLM) 的出现彻底改变了许多领域，而不仅仅是传统的自然语言处理 (NLP) 任务。最近，将 LLM 应用于数据库领域的研究蓬勃发展，作为典型的非关系型数据库，LLM 在图数据库研究中的使用自然引起了极大关注。最近的努力越来越多地集中在利用 LLM 将自然语言转换为图查询语言 (NL2GQL)。虽然取得了一些进展，但这些方法有明显的局限性，例如它们依赖于精简的流程，而这往往忽视了 LLM 在应对复杂 NL2GQL 挑战时自主规划和与其他 LLM 协作的潜力。为了解决这一差距，我们提出了 NAT-NL2GQL，这是一种用于将自然语言转换为图查询语言的新型多智能体框架。具体来说，我们的框架由三个协同智能体组成：预处理器智能体、生成器智能体和精炼器智能体。预处理器代理将数据处理作为上下文进行管理，包括名称实体识别、查询重写、路径链接和查询相关模式的提取等任务。生成器代理是一个经过微调的 LLM，在 NL-GQL 数据上进行训练，负责根据查询及其相关模式生成相应的 GQL 语句。精炼器代理的任务是使用从 GQL 执行结果中获得的错误信息来精炼 GQL 或上下文。鉴于基于 nGQL 语法的高质量开源 NL2GQL 数据集稀缺，我们开发了 StockGQL，这是一个由金融市场图形数据库构建的数据集。它可从此 https URL 获得。在 StockGQL 和 SpCQL 数据集上的实验结果表明，我们的方法明显优于基线方法，凸显了其推动 NL2GQL 研究的潜力。

Title: On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models

Authors: April Yang, Jordan Tab, Parth Shah, Paul Kotchavong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10535
Pdf URL: https://arxiv.org/pdf/2412.10535
Copy Paste: [[2412.10535]] On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models(https://arxiv.org/abs/2412.10535)
Keywords: language model, llm
Abstract: The increasing reliance on large language models (LLMs) for diverse applications necessitates a thorough understanding of their robustness to adversarial perturbations and out-of-distribution (OOD) inputs. In this study, we investigate the correlation between adversarial robustness and OOD robustness in LLMs, addressing a critical gap in robustness evaluation. By applying methods originally designed to improve one robustness type across both contexts, we analyze their performance on adversarial and out-of-distribution benchmark datasets. The input of the model consists of text samples, with the output prediction evaluated in terms of accuracy, precision, recall, and F1 scores in various natural language inference tasks. Our findings highlight nuanced interactions between adversarial robustness and OOD robustness, with results indicating limited transferability between the two robustness types. Through targeted ablations, we evaluate how these correlations evolve with different model sizes and architectures, uncovering model-specific trends: smaller models like LLaMA2-7b exhibit neutral correlations, larger models like LLaMA2-13b show negative correlations, and Mixtral demonstrates positive correlations, potentially due to domain-specific alignment. These results underscore the importance of hybrid robustness frameworks that integrate adversarial and OOD strategies tailored to specific models and domains. Further research is needed to evaluate these interactions across larger models and varied architectures, offering a pathway to more reliable and generalizable LLMs.
摘要：各种应用对大型语言模型 (LLM) 的依赖性越来越强，因此必须彻底了解它们对对抗性扰动和分布外 (OOD) 输入的鲁棒性。在本研究中，我们研究了 LLM 中对抗性鲁棒性和 OOD 鲁棒性之间的相关性，解决了鲁棒性评估中的一个关键差距。通过应用最初设计用于在两种情况下改进一种鲁棒性类型的方法，我们分析了它们在对抗性和分布外基准数据集上的性能。该模型的输入包括文本样本，并在各种自然语言推理任务中根据准确率、精确率、召回率和 F1 分数对输出预测进行评估。我们的研究结果强调了对抗性鲁棒性和 OOD 鲁棒性之间的细微相互作用，结果表明两种鲁棒性类型之间的可转移性有限。通过有针对性的消融，我们评估了这些相关性如何随着不同的模型大小和架构而演变，揭示了特定于模型的趋势：较小的模型（如 LLaMA2-7b）表现出中性相关性，较大的模型（如 LLaMA2-13b）表现出负相关性，而 Mixtral 表现出正相关性，这可能是由于特定领域的对齐所致。这些结果强调了混合稳健性框架的重要性，该框架整合了针对特定模型和领域的对抗和 OOD 策略。需要进一步研究以评估这些相互作用在更大的模型和不同的架构中的表现，从而为更可靠和更通用的 LLM 提供途径。

Title: Too Big to Fool: Resisting Deception in Language Models

Authors: Mohammad Reza Samsami, Mats Leon Richter, Juan Rodriguez, Megh Thakkar, Sarath Chandar, Maxime Gasse
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10558
Pdf URL: https://arxiv.org/pdf/2412.10558
Copy Paste: [[2412.10558]] Too Big to Fool: Resisting Deception in Language Models(https://arxiv.org/abs/2412.10558)
Keywords: language model, prompt
Abstract: Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. This paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.
摘要：大型语言模型必须平衡其权重编码知识与提示中的上下文信息，以生成准确的响应。本文通过分析同一家族中不同容量的模型如何处理故意误导的上下文信息来研究这种相互作用。我们的实验表明，较大的模型对欺骗性提示表现出更高的弹性，展示了解释和整合提示信息与其内部知识的高级能力。此外，我们发现较大的模型在遵循合法指令方面的表现优于较小的模型，这表明它们的弹性不是由于忽略上下文信息。我们还表明，这种现象可能不是记忆的结果，而是源于模型能够更好地利用提示中的隐性任务相关信息以及它们内部存储的知识。

Title: Evidence Contextualization and Counterfactual Attribution for Conversational QA over Heterogeneous Data with RAG Systems

Authors: Rishiraj Saha Roy, Joel Schlotthauer, Chris Hinze, Andreas Foltyn, Luzian Hahn, Fabian Kuech
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2412.10571
Pdf URL: https://arxiv.org/pdf/2412.10571
Copy Paste: [[2412.10571]] Evidence Contextualization and Counterfactual Attribution for Conversational QA over Heterogeneous Data with RAG Systems(https://arxiv.org/abs/2412.10571)
Keywords: language model, llm, prompt, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) works as a backbone for interacting with an enterprise's own data via Conversational Question Answering (ConvQA). In a RAG system, a retriever fetches passages from a collection in response to a question, which are then included in the prompt of a large language model (LLM) for generating a natural language (NL) answer. However, several RAG systems today suffer from two shortcomings: (i) retrieved passages usually contain their raw text and lack appropriate document context, negatively impacting both retrieval and answering quality; and (ii) attribution strategies that explain answer generation usually rely only on similarity between the answer and the retrieved passages, thereby only generating plausible but not causal explanations. In this work, we demonstrate RAGONITE, a RAG system that remedies the above concerns by: (i) contextualizing evidence with source metadata and surrounding text; and (ii) computing counterfactual attribution, a causal explanation approach where the contribution of an evidence to an answer is determined by the similarity of the original response to the answer obtained by removing that evidence. To evaluate our proposals, we release a new benchmark ConfQuestions, with 300 hand-created conversational questions, each in English and German, coupled with ground truth URLs, completed questions, and answers from 215 public Confluence pages, that are typical of enterprise wiki spaces with heterogeneous elements. Experiments with RAGONITE on ConfQuestions show the viability of our ideas: contextualization improves RAG performance, and counterfactual attribution is effective at explaining RAG answers.
摘要：检索增强生成 (RAG) 是通过对话式问答 (ConvQA) 与企业自身数据交互的支柱。在 RAG 系统中，检索器从集合中获取回答问题的段落，然后将其包含在大型语言模型 (LLM) 的提示中以生成自然语言 (NL) 答案。然而，当今的几个 RAG 系统存在两个缺点：(i) 检索到的段落通常包含原始文本并且缺乏适当的文档上下文，对检索和回答质量产生负面影响；(ii) 解释答案生成的归因策略通常仅依赖于答案和检索到的段落之间的相似性，因此只能生成合理但没有因果关系的解释。在这项工作中，我们展示了 RAGONITE，这是一个通过以下方式解决上述问题的 RAG 系统：(i) 使用源元数据和周围文本将证据情境化；以及 (ii) 计算反事实归因，这是一种因果解释方法，其中证据对答案的贡献取决于原始响应与删除该证据后获得的答案的相似性。为了评估我们的提议，我们发布了一个新的基准 ConfQuestions，其中包含 300 个手工创建的对话问题，每个问题都使用英语和德语，并附有地面实况 URL、已完成的问题和来自 215 个公共 Confluence 页面的答案，这些页面是具有异构元素的企业 wiki 空间的典型特征。使用 RAGONITE 在 ConfQuestions 上进行的实验表明了我们的想法的可行性：情境化提高了 RAG 性能，而反事实归因可以有效解释 RAG 答案。

Title: WHAT-IF: Exploring Branching Narratives by Meta-Prompting Large Language Models

Authors: Runsheng "Anson" Huang, Lara J. Martin, Chris Callison-Burch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10582
Pdf URL: https://arxiv.org/pdf/2412.10582
Copy Paste: [[2412.10582]] WHAT-IF: Exploring Branching Narratives by Meta-Prompting Large Language Models(https://arxiv.org/abs/2412.10582)
Keywords: language model, gpt, llm, prompt
Abstract: WHAT-IF -- Writing a Hero's Alternate Timeline through Interactive Fiction -- is a system that uses zero-shot meta-prompting to create branching narratives from a prewritten story. Played as an interactive fiction (IF) game, WHAT-IF lets the player choose between decisions that the large language model (LLM) GPT-4 generates as possible branches in the story. Starting with an existing linear plot as input, a branch is created at each key decision taken by the main character. By meta-prompting the LLM to consider the major plot points from the story, the system produces coherent and well-structured alternate storylines. WHAT-IF stores the branching plot tree in a graph which helps it to both keep track of the story for prompting and maintain the structure for the final IF system. A video demo of our system can be found here: this https URL.
摘要：WHAT-IF——通过互动小说编写英雄的替代时间线——是一个使用零样本元提示从预先写好的故事中创建分支叙述的系统。WHAT-IF 是一款互动小说 (IF) 游戏，它让玩家在大型语言模型 (LLM) GPT-4 生成的作为故事中可能分支的决策之间进行选择。从现有的线性情节作为输入开始，主角做出的每个关键决定都会创建一个分支。通过元提示 LLM 考虑故事中的主要情节点，系统可以生成连贯且结构良好的替代故事情节。WHAT-IF 将分支情节树存储在图形中，这有助于它跟踪故事以进行提示并维护最终 IF 系统的结构。我们系统的视频演示可以在这里找到：此 https URL。

Title: Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data

Authors: Xue Wu, Kostas Tsioutsiouliklis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10654
Pdf URL: https://arxiv.org/pdf/2412.10654
Copy Paste: [[2412.10654]] Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data(https://arxiv.org/abs/2412.10654)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, they often struggle with complex reasoning tasks and are prone to hallucination. Recent research has shown promising results in leveraging knowledge graphs (KGs) to enhance LLM performance. KGs provide a structured representation of entities and their relationships, offering a rich source of information that can enhance the reasoning capabilities of LLMs. For this work, we have developed different techniques that tightly integrate KG structures and semantics into LLM representations. Our results show that we are able to significantly improve the performance of LLMs in complex reasoning scenarios, and ground the reasoning process with KGs. We are the first to represent KGs with programming language and fine-tune pretrained LLMs with KGs. This integration facilitates more accurate and interpretable reasoning processes, paving the way for more advanced reasoning capabilities of LLMs.
摘要：大型语言模型 (LLM) 在自然语言理解和生成方面表现出了卓越的能力。然而，它们往往难以完成复杂的推理任务，而且容易产生幻觉。最近的研究表明，利用知识图谱 (KG) 来增强 LLM 性能取得了令人鼓舞的成果。KG 提供了实体及其关系的结构化表示，提供了丰富的信息源，可以增强 LLM 的推理能力。为了完成这项工作，我们开发了不同的技术，将 KG 结构和语义紧密集成到 LLM 表示中。我们的结果表明，我们能够显著提高 LLM 在复杂推理场景中的表现，并用 KG 为推理过程打下基础。我们是第一个用编程语言表示 KG 并用 KG 微调预训练的 LLM 的人。这种集成促进了更准确和可解释的推理过程，为 LLM 更高级的推理能力铺平了道路。

Title: Chasing Progress, Not Perfection: Revisiting Strategies for End-to-End LLM Plan Generation

Authors: Sukai Huang, Trevor Cohn, Nir Lipovetzky
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10675
Pdf URL: https://arxiv.org/pdf/2412.10675
Copy Paste: [[2412.10675]] Chasing Progress, Not Perfection: Revisiting Strategies for End-to-End LLM Plan Generation(https://arxiv.org/abs/2412.10675)
Keywords: language model, llm, chain-of-thought
Abstract: The capability of Large Language Models (LLMs) to plan remains a topic of debate. Some critics argue that strategies to boost LLMs' reasoning skills are ineffective in planning tasks, while others report strong outcomes merely from training models on a planning corpus. This study reassesses recent strategies by developing an end-to-end LLM planner and employing diverse metrics for a thorough evaluation. We find that merely fine-tuning LLMs on a corpus of planning instances does not lead to robust planning skills, as indicated by poor performance on out-of-distribution test sets. At the same time, we find that various strategies, including Chain-of-Thought, do enhance the probability of a plan being executable. This indicates progress towards better plan quality, despite not directly enhancing the final validity rate. Among the strategies we evaluated, reinforcement learning with our novel `Longest Contiguous Common Subsequence' reward emerged as the most effective, contributing to both plan validity and executability. Overall, our research addresses key misconceptions in the LLM-planning literature; we validate incremental progress in plan executability, although plan validity remains a challenge. Hence, future strategies should focus on both these aspects, drawing insights from our findings.
摘要：大型语言模型 (LLM) 的规划能力仍然是一个争论话题。一些批评者认为，提高 LLM 推理能力的策略在规划任务中是无效的，而另一些人则报告说，仅仅通过在规划语料库上训练模型就能取得很好的结果。这项研究通过开发端到端 LLM 规划器并采用不同的指标进行全面评估，重新评估了最近的策略。我们发现，仅仅在规划实例语料库上对 LLM 进行微调并不能带来强大的规划技能，这一点可以从分布外测试集上的表现不佳看出。同时，我们发现包括思维链在内的各种策略确实提高了计划可执行的可能性。这表明计划质量在提高方面取得了进展，尽管并没有直接提高最终的有效性。在我们评估的策略中，强化学习与我们新颖的“最长连续公共子序列”奖励是最有效的，有助于提高计划的有效性和可执行性。总的来说，我们的研究解决了 LLM 规划文献中的关键误解；我们验证了计划可执行性的渐进式进展，尽管计划有效性仍然是一个挑战。因此，未来的战略应该关注这两个方面，并从我们的研究结果中汲取灵感。

Title: Inference Scaling for Bridging Retrieval and Augmented Generation

Authors: Youngwon Lee, Seung-won Hwang, Daniel Campos, Filip Graliński, Zhewei Yao, Yuxiong He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10684
Pdf URL: https://arxiv.org/pdf/2412.10684
Copy Paste: [[2412.10684]] Inference Scaling for Bridging Retrieval and Augmented Generation(https://arxiv.org/abs/2412.10684)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has emerged as a popular approach to steering the output of a large language model (LLM) by incorporating retrieved contexts as inputs. However, existing work observed the generator bias, such that improving the retrieval results may negatively affect the outcome. In this work, we show such bias can be mitigated, from inference scaling, aggregating inference calls from the permuted order of retrieved contexts. The proposed Mixture-of-Intervention (MOI) explicitly models the debiased utility of each passage with multiple forward passes to construct a new ranking. We also show that MOI can leverage the retriever's prior knowledge to reduce the computational cost by minimizing the number of permutations considered and lowering the cost per LLM call. We showcase the effectiveness of MOI on diverse RAG tasks, improving ROUGE-L on MS MARCO and EM on HotpotQA benchmarks by ~7 points.
摘要：检索增强生成 (RAG) 已成为一种流行的方法，通过将检索到的上下文作为输入来控制大型语言模型 (LLM) 的输出。然而，现有研究观察到生成器偏差，因此改进检索结果可能会对结果产生负面影响。在这项研究中，我们展示了这种偏差可以通过推理扩展来缓解，即从检索到的上下文的排列顺序中聚合推理调用。提出的混合干预 (MOI) 明确地对每个段落的去偏效用进行建模，并使用多个前向传递来构建新的排名。我们还表明，MOI 可以利用检索器的先验知识来降低计算成本，方法是最小化考虑的排列数量并降低每个 LLM 调用的成本。我们展示了 MOI 在各种 RAG 任务上的有效性，将 MS MARCO 上的 ROUGE-L 和 HotpotQA 基准上的 EM 提高了约 7 分。

Title: Learning to Verify Summary Facts with Fine-Grained LLM Feedback

Authors: Jihwan Oh, Jeonghwan Choi, Nicole Hee-Yeon Kim, Taewon Yun, Hwanjun Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10689
Pdf URL: https://arxiv.org/pdf/2412.10689
Copy Paste: [[2412.10689]] Learning to Verify Summary Facts with Fine-Grained LLM Feedback(https://arxiv.org/abs/2412.10689)
Keywords: language model, llm
Abstract: Training automatic summary fact verifiers often faces the challenge of a lack of human-labeled data. In this paper, we explore alternative way of leveraging Large Language Model (LLM) generated feedback to address the inherent limitation of using human-labeled data. We introduce FineSumFact, a large-scale dataset containing fine-grained factual feedback on summaries. We employ 10 distinct LLMs for diverse summary generation and Llama-3-70B-Instruct for feedback. We utilize this dataset to fine-tune the lightweight open-source model Llama-3-8B-Instruct, optimizing resource efficiency while maintaining high performance. Our experimental results reveal that the model trained on extensive LLM-generated datasets surpasses that trained on smaller human-annotated datasets when evaluated using human-generated test sets. Fine-tuning fact verification models with LLM feedback can be more effective and cost-efficient than using human feedback. The dataset is available at this https URL.
摘要：训练自动摘要事实验证器通常面临缺乏人工标记数据的挑战。在本文中，我们探索了利用大型语言模型 (LLM) 生成的反馈来解决使用人工标记数据的固有限制的替代方法。我们引入了 FineSumFact，这是一个包含摘要细粒度事实反馈的大型数据集。我们使用 10 个不同的 LLM 来生成不同的摘要，并使用 Llama-3-70B-Instruct 进行反馈。我们利用此数据集对轻量级开源模型 Llama-3-8B-Instruct 进行微调，在保持高性能的同时优化资源效率。我们的实验结果表明，在使用人工生成的测试集进行评估时，在大量 LLM 生成的数据集上训练的模型优于在较小的人工注释数据集上训练的模型。使用 LLM 反馈对事实验证模型进行微调比使用人工反馈更有效、更经济。数据集可在此 https URL 上获得。

Title: VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

Authors: Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A. Rossi, Dinesh Manocha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10704
Pdf URL: https://arxiv.org/pdf/2412.10704
Copy Paste: [[2412.10704]] VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation(https://arxiv.org/abs/2412.10704)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation, chain-of-thought
Abstract: Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open-source and proprietary large language models, we benchmark state-of-the-art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long-context LLM baselines for end-to-end multimodal document QA by 12-20%.
摘要：从多份文档集合（尤其是那些具有丰富视觉元素的文档集合）中理解信息对于基于文档的问答系统非常重要。本文介绍了 VisDoMBench，这是第一个全面的基准测试，旨在评估多文档设置中的问答系统，其中包含丰富的多模态内容，包括表格、图表和演示文稿幻灯片。我们提出了 VisDoMRAG，这是一种新颖的多模态检索增强生成 (RAG) 方法，它同时利用视觉和文本 RAG，将强大的视觉检索功能与复杂的语言推理相结合。VisDoMRAG 采用多步骤推理过程，包括证据管理和思路链推理，用于并发文本和视觉 RAG 管道。VisDoMRAG 的一个关键创新之处在于其一致性约束模态融合机制，该机制在推理时将跨模态的推理过程对齐以产生连贯的最终答案。这可以提高关键信息分布在模态中的场景的准确性，并通过隐式上下文归因提高答案的可验证性。通过大量涉及开源和专有大型语言模型的实验，我们在 VisDoMBench 上对最先进的文档 QA 方法进行了基准测试。大量结果表明，VisDoMRAG 在端到端多模态文档 QA 方面的表现比单模态和长上下文 LLM 基线高出 12-20%。

Title: HITgram: A Platform for Experimenting with n-gram Language Models

Authors: Shibaranjani Dasgupta, Chandan Maity, Somdip Mukherjee, Rohan Singh, Diptendu Dutta, Debasish Jana
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10717
Pdf URL: https://arxiv.org/pdf/2412.10717
Copy Paste: [[2412.10717]] HITgram: A Platform for Experimenting with n-gram Language Models(https://arxiv.org/abs/2412.10717)
Keywords: language model, llm
Abstract: Large language models (LLMs) are powerful but resource intensive, limiting accessibility. HITgram addresses this gap by offering a lightweight platform for n-gram model experimentation, ideal for resource-constrained environments. It supports unigrams to 4-grams and incorporates features like context sensitive weighting, Laplace smoothing, and dynamic corpus management to e-hance prediction accuracy, even for unseen word sequences. Experiments demonstrate HITgram's efficiency, achieving 50,000 tokens/second and generating 2-grams from a 320MB corpus in 62 seconds. HITgram scales efficiently, constructing 4-grams from a 1GB file in under 298 seconds on an 8 GB RAM system. Planned enhancements include multilingual support, advanced smoothing, parallel processing, and model saving, further broadening its utility.
摘要：大型语言模型 (LLM) 功能强大，但资源密集，限制了可访问性。HITgram 通过提供轻量级的 n-gram 模型实验平台来解决这一问题，非常适合资源受限的环境。它支持从 unigram 到 4-gram，并结合了上下文敏感加权、拉普拉斯平滑和动态语料库管理等功能，以提高预测准确性，即使对于未见过的单词序列也是如此。实验证明了 HITgram 的效率，实现了每秒 50,000 个标记，并在 62 秒内从 320MB 语料库生成 2-gram。HITgram 可高效扩展，在 8 GB RAM 系统上不到 298 秒即可从 1GB 文件构建 4-gram。计划中的增强功能包括多语言支持、高级平滑、并行处理和模型保存，从而进一步扩大其实用性。

Title: WEPO: Web Element Preference Optimization for LLM-based Web Navigation

Authors: Jiarun Liu, Jia Hao, Chunhong Zhang, Zheng Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10742
Pdf URL: https://arxiv.org/pdf/2412.10742
Copy Paste: [[2412.10742]] WEPO: Web Element Preference Optimization for LLM-based Web Navigation(https://arxiv.org/abs/2412.10742)
Keywords: language model, llm, agent
Abstract: The rapid advancement of autonomous web navigation has significantly benefited from grounding pretrained Large Language Models (LLMs) as agents. However, current research has yet to fully leverage the redundancy of HTML elements for contrastive training. This paper introduces a novel approach to LLM-based web navigation tasks, called Web Element Preference Optimization (WEPO). WEPO utilizes unsupervised preference learning by sampling distance-based non-salient web elements as negative samples, optimizing maximum likelihood objective within Direct Preference Optimization (DPO). We evaluate WEPO on the Mind2Web benchmark and empirically demonstrate that WEPO aligns user high-level intent with output actions more effectively. The results show that our method achieved the state-of-the-art, with an improvement of 13.8% over WebAgent and 5.3% over the visual language model CogAgent baseline. Our findings underscore the potential of preference optimization to enhance web navigation and other web page based tasks, suggesting a promising direction for future research.
摘要：自主网络导航的快速发展极大地受益于将预训练的大型语言模型 (LLM) 作为代理。然而，当前的研究尚未充分利用 HTML 元素的冗余进行对比训练。本文介绍了一种基于 LLM 的网络导航任务的新方法，称为网络元素偏好优化 (WEPO)。WEPO 利用无监督偏好学习，将基于距离的非显著网络元素采样为负样本，在直接偏好优化 (DPO) 中优化最大似然目标。我们在 Mind2Web 基准上评估了 WEPO，并通过经验证明 WEPO 可以更有效地将用户的高级意图与输出操作对齐。结果表明，我们的方法达到了最先进的水平，比 WebAgent 提高了 13.8%，比视觉语言模型 CogAgent 基线提高了 5.3%。我们的研究结果强调了偏好优化在增强网络导航和其他基于网页的任务方面的潜力，为未来的研究指明了一个有希望的方向。

Title: Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages

Authors: Poulami Ghosh, Raj Dabre, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10805
Pdf URL: https://arxiv.org/pdf/2412.10805
Copy Paste: [[2412.10805]] Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages(https://arxiv.org/abs/2412.10805)
Keywords: language model
Abstract: Pre-trained language models (PLMs) are known to be susceptible to perturbations to the input text, but existing works do not explicitly focus on linguistically grounded attacks, which are subtle and more prevalent in nature. In this paper, we study whether PLMs are agnostic to linguistically grounded attacks or not. To this end, we offer the first study addressing this, investigating different Indic languages and various downstream tasks. Our findings reveal that although PLMs are susceptible to linguistic perturbations, when compared to non-linguistic attacks, PLMs exhibit a slightly lower susceptibility to linguistic attacks. This highlights that even constrained attacks are effective. Moreover, we investigate the implications of these outcomes across a range of languages, encompassing diverse language families and different scripts.
摘要：众所周知，预训练语言模型 (PLM) 容易受到输入文本扰动的影响，但现有研究并未明确关注基于语言的攻击，这些攻击本质上是微妙且更普遍的。在本文中，我们研究了 PLM 是否对基于语言的攻击具有不可知性。为此，我们提供了第一项针对此问题的研究，研究了不同的印度语和各种下游任务。我们的研究结果表明，尽管 PLM 容易受到语言扰动的影响，但与非语言攻击相比，PLM 对语言攻击的敏感性略低。这突出表明即使是受限攻击也是有效的。此外，我们研究了这些结果对一系列语言的影响，涵盖了不同的语系和不同的文字。

Title: FinGPT: Enhancing Sentiment-Based Stock Movement Prediction with Dissemination-Aware and Context-Enriched LLMs

Authors: Yixuan Liang, Yuncong Liu, Boyu Zhang, Christina Dan Wang, Hongyang Yang
Subjects: cs.CL, cs.LG, q-fin.CP, q-fin.TR
Abstract URL: https://arxiv.org/abs/2412.10823
Pdf URL: https://arxiv.org/pdf/2412.10823
Copy Paste: [[2412.10823]] FinGPT: Enhancing Sentiment-Based Stock Movement Prediction with Dissemination-Aware and Context-Enriched LLMs(https://arxiv.org/abs/2412.10823)
Keywords: language model, gpt, llm, prompt
Abstract: Financial sentiment analysis is crucial for understanding the influence of news on stock prices. Recently, large language models (LLMs) have been widely adopted for this purpose due to their advanced text analysis capabilities. However, these models often only consider the news content itself, ignoring its dissemination, which hampers accurate prediction of short-term stock movements. Additionally, current methods often lack sufficient contextual data and explicit instructions in their prompts, limiting LLMs' ability to interpret news. In this paper, we propose a data-driven approach that enhances LLM-powered sentiment-based stock movement predictions by incorporating news dissemination breadth, contextual data, and explicit instructions. We cluster recent company-related news to assess its reach and influence, enriching prompts with more specific data and precise instructions. This data is used to construct an instruction tuning dataset to fine-tune an LLM for predicting short-term stock price movements. Our experimental results show that our approach improves prediction accuracy by 8\% compared to existing methods.
摘要：金融情绪分析对于理解新闻对股价的影响至关重要。最近，大型语言模型 (LLM) 因其先进的文本分析功能而被广泛用于此目的。然而，这些模型通常只考虑新闻内容本身，而忽略了其传播，这妨碍了对短期股票走势的准确预测。此外，当前的方法通常缺乏足够的上下文数据和提示中的明确指令，限制了 LLM 解释新闻的能力。在本文中，我们提出了一种数据驱动的方法，通过结合新闻传播广度、上下文数据和明确指令来增强 LLM 驱动的基于情绪的股票走势预测。我们对最近的公司相关新闻进行聚类以评估其覆盖面和影响力，用更具体的数据和精确的指令丰富提示。这些数据用于构建指令调整数据集，以微调用于预测短期股价走势的 LLM。我们的实验结果表明，与现有方法相比，我们的方法将预测准确率提高了 8%。

Title: Rethinking Chain-of-Thought from the Perspective of Self-Training

Authors: Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, Lei Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10827
Pdf URL: https://arxiv.org/pdf/2412.10827
Copy Paste: [[2412.10827]] Rethinking Chain-of-Thought from the Perspective of Self-Training(https://arxiv.org/abs/2412.10827)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in large language models (LLMs). We observe that CoT shares significant similarities with self-training in terms of their learning processes. Motivated by these parallels, this paper explores the underlying relationship between CoT and self-training, demonstrating how insights from self-training can enhance CoT performance. Specifically, our study first reveals that CoT, like self-training, follows the principle of semantic entropy minimization. Leveraging this insight, we propose a novel CoT framework that incorporates two key components: (i) a task-specific prompt module designed to guide LLMs in generating high-quality initial reasoning processes, and (ii) an adaptive reasoning iteration module for progressively refining the reasoning process.
摘要：思路链 (CoT) 推理已成为激活大型语言模型 (LLM) 中潜在能力的有效方法。我们观察到，CoT 在学习过程方面与自训练有显著的相似之处。受这些相似之处的启发，本文探讨了 CoT 与自训练之间的潜在关系，展示了自训练的见解如何提高 CoT 性能。具体而言，我们的研究首先揭示了 CoT 与自训练一样，遵循语义熵最小化原则。利用这一见解，我们提出了一个新颖的 CoT 框架，它包含两个关键组件：(i) 一个任务特定的提示模块，旨在指导 LLM 生成高质量的初始推理过程，以及 (ii) 一个自适应推理迭代模块，用于逐步完善推理过程。

Title: Large Language Models for Medical Forecasting -- Foresight 2

Authors: Zeljko Kraljevic, Joshua Au Yeung, Daniel Bean, James Teo, Richard J. Dobson
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10848
Pdf URL: https://arxiv.org/pdf/2412.10848
Copy Paste: [[2412.10848]] Large Language Models for Medical Forecasting -- Foresight 2(https://arxiv.org/abs/2412.10848)
Keywords: language model, gpt, llm
Abstract: Foresight 2 (FS2) is a large language model fine-tuned on hospital data for modelling patient timelines (GitHub 'removed for anon'). It can understand patients' clinical notes and predict SNOMED codes for a wide range of biomedical use cases, including diagnosis suggestions, risk forecasting, and procedure and medication recommendations. FS2 is trained on the free text portion of the MIMIC-III dataset, firstly through extracting biomedical concepts and then creating contextualised patient timelines, upon which the model is then fine-tuned. The results show significant improvement over the previous state-of-the-art for the next new biomedical concept prediction (P/R - 0.73/0.66 vs 0.52/0.32) and a similar improvement specifically for the next new disorder prediction (P/R - 0.69/0.62 vs 0.46/0.25). Finally, on the task of risk forecast, we compare our model to GPT-4-turbo (and a range of open-source biomedical LLMs) and show that FS2 performs significantly better on such tasks (P@5 - 0.90 vs 0.65). This highlights the need to incorporate hospital data into LLMs and shows that small models outperform much larger ones when fine-tuned on high-quality, specialised data.
摘要：Foresight 2 (FS2) 是一个大型语言模型，根据医院数据进行了微调，用于建模患者时间表（GitHub“已匿名删除”）。它可以理解患者的临床记录并预测各种生物医学用例的 SNOMED 代码，包括诊断建议、风险预测以及程序和药物建议。FS2 在 MIMIC-III 数据集的自由文本部分进行训练，首先通过提取生物医学概念，然后创建情境化的患者时间表，然后在此基础上对模型进行微调。结果显示，与之前的最新技术相比，下一种新的生物医学概念预测有显著改善（P/R - 0.73/0.66 vs 0.52/0.32），并且下一种新的疾病预测也有类似的改善（P/R - 0.69/0.62 vs 0.46/0.25）。最后，在风险预测任务上，我们将我们的模型与 GPT-4-turbo（以及一系列开源生物医学 LLM）进行了比较，结果表明 FS2 在此类任务上的表现明显更好（P@5 - 0.90 vs 0.65）。这凸显了将医院数据纳入 LLM 的必要性，并表明在对高质量、专业数据进行微调时，小型模型的表现要优于大型模型。

Title: BgGPT 1.0: Extending English-centric LLMs to other languages

Authors: Anton Alexandrov, Veselin Raychev, Dimitar I. Dimitrov, Ce Zhang, Martin Vechev, Kristina Toutanova
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10893
Pdf URL: https://arxiv.org/pdf/2412.10893
Copy Paste: [[2412.10893]] BgGPT 1.0: Extending English-centric LLMs to other languages(https://arxiv.org/abs/2412.10893)
Keywords: gpt, llm, chat
Abstract: We present BgGPT-Gemma-2-27B-Instruct and BgGPT-Gemma-2-9B-Instruct: continually pretrained and fine-tuned versions of Google's Gemma-2 models, specifically optimized for Bulgarian language understanding and generation. Leveraging Gemma-2's multilingual capabilities and over 100 billion tokens of Bulgarian and English text data, our models demonstrate strong performance in Bulgarian language tasks, setting a new standard for language-specific AI models. Our approach maintains the robust capabilities of the original Gemma-2 models, ensuring that the English language performance remains intact. To preserve the base model capabilities, we incorporate continual learning strategies based on recent Branch-and-Merge techniques as well as thorough curation and selection of training data. We provide detailed insights into our methodology, including the release of model weights with a commercial-friendly license, enabling broader adoption by researchers, companies, and hobbyists. Further, we establish a comprehensive set of benchmarks based on non-public educational data sources to evaluate models on Bulgarian language tasks as well as safety and chat capabilities. Our findings demonstrate the effectiveness of fine-tuning state-of-the-art models like Gemma 2 to enhance language-specific AI applications while maintaining cross-lingual capabilities.
摘要：我们推出了 BgGPT-Gemma-2-27B-Instruct 和 BgGPT-Gemma-2-9B-Instruct：Google Gemma-2 模型的持续预训练和微调版本，专门针对保加利亚语理解和生成进行了优化。利用 Gemma-2 的多语言功能和超过 1000 亿个保加利亚语和英语文本数据标记，我们的模型在保加利亚语任务中表现出色，为特定语言的 AI 模型树立了新标准。我们的方法保留了原始 Gemma-2 模型的强大功能，确保英语语言性能保持不变。为了保留基本模型功能，我们结合了基于最新分支合并技术的持续学习策略以及对训练数据的全面管理和选择。我们提供了有关我们方法的详细见解，包括以商业友好许可证发布模型权重，以便研究人员、公司和业余爱好者更广泛地采用。此外，我们根据非公开教育数据源建立了一套全面的基准，以评估保加利亚语语言任务以及安全和聊天功能的模型。我们的研究结果表明，微调 Gemma 2 等最先进的模型可以有效地增强特定语言的 AI 应用，同时保持跨语言能力。

Title: SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation

Authors: Qilong Wu, Xiaoneng Xiang, Hejia Huang, Xuan Wang, Yeo Wei Jie, Ranjan Satapathy, Ricardo Shirota Filho, Bharadwaj Veeravalli
Subjects: cs.CL, cs.CE, cs.LG, q-fin.CP
Abstract URL: https://arxiv.org/abs/2412.10906
Pdf URL: https://arxiv.org/pdf/2412.10906
Copy Paste: [[2412.10906]] SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation(https://arxiv.org/abs/2412.10906)
Keywords: gpt, llm, retrieval-augmented generation
Abstract: The rapid growth of the financial sector and the rising focus on Environmental, Social, and Governance (ESG) considerations highlight the need for advanced NLP tools. However, open-source LLMs proficient in both finance and ESG domains remain scarce. To address this gap, we introduce SusGen-30K, a category-balanced dataset comprising seven financial NLP tasks and ESG report generation, and propose TCFD-Bench, a benchmark for evaluating sustainability report generation. Leveraging this dataset, we developed SusGen-GPT, a suite of models achieving state-of-the-art performance across six adapted and two off-the-shelf tasks, trailing GPT-4 by only 2% despite using 7-8B parameters compared to GPT-4's 1,700B. Based on this, we propose the SusGen system, integrated with Retrieval-Augmented Generation (RAG), to assist in sustainability report generation. This work demonstrates the efficiency of our approach, advancing research in finance and ESG.
摘要：金融行业的快速增长以及对环境、社会和治理 (ESG) 考虑因素的日益关注凸显了对高级 NLP 工具的需求。然而，精通金融和 ESG 领域的开源 LLM 仍然很少。为了弥补这一差距，我们引入了 SusGen-30K，这是一个类别平衡的数据集，包含七个金融 NLP 任务和 ESG 报告生成，并提出了 TCFD-Bench，这是评估可持续发展报告生成的基准。利用这个数据集，我们开发了 SusGen-GPT，这是一套模型，在六个改编任务和两个现成任务中实现了最先进的性能，尽管使用了 7-8B 个参数，而 GPT-4 使用了 1,700B 个参数，但仅落后于 GPT-4 2%。基于此，我们提出了 SusGen 系统，该系统与检索增强生成 (RAG) 集成，以协助生成可持续发展报告。这项工作证明了我们方法的效率，推动了金融和 ESG 的研究。

Title: LLMs-in-the-Loop Part 2: Expert Small AI Models for Anonymization and De-identification of PHI Across Multiple Languages

Authors: Murat Gunay, Bunyamin Keles, Raife Hizlan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10918
Pdf URL: https://arxiv.org/pdf/2412.10918
Copy Paste: [[2412.10918]] LLMs-in-the-Loop Part 2: Expert Small AI Models for Anonymization and De-identification of PHI Across Multiple Languages(https://arxiv.org/abs/2412.10918)
Keywords: language model, gpt, llm
Abstract: The rise of chronic diseases and pandemics like COVID-19 has emphasized the need for effective patient data processing while ensuring privacy through anonymization and de-identification of protected health information (PHI). Anonymized data facilitates research without compromising patient confidentiality. This paper introduces expert small AI models developed using the LLM-in-the-loop methodology to meet the demand for domain-specific de-identification NER models. These models overcome the privacy risks associated with large language models (LLMs) used via APIs by eliminating the need to transmit or store sensitive data. More importantly, they consistently outperform LLMs in de-identification tasks, offering superior performance and reliability. Our de-identification NER models, developed in eight languages (English, German, Italian, French, Romanian, Turkish, Spanish, and Arabic) achieved f1-micro score averages of 0.966, 0.975, 0.976, 0.970, 0.964, 0.974, 0.978, and 0.953 respectively. These results establish them as the most accurate healthcare anonymization solutions, surpassing existing small models and even general-purpose LLMs such as GPT-4o. While Part-1 of this series introduced the LLM-in-the-loop methodology for bio-medical document translation, this second paper showcases its success in developing cost-effective expert small NER models in de-identification tasks. Our findings lay the groundwork for future healthcare AI innovations, including biomedical entity and relation extraction, demonstrating the value of specialized models for domain-specific challenges.
摘要：慢性病和 COVID-19 等流行病的兴起凸显了有效处理患者数据的需求，同时通过对受保护的健康信息 (PHI) 进行匿名化和去识别化来确保隐私。匿名数据有助于研究，同时又不会损害患者的隐私。本文介绍了使用 LLM-in-the-loop 方法开发的专家小型 AI 模型，以满足对特定领域去识别化 NER 模型的需求。这些模型通过消除传输或存储敏感数据的需要，克服了通过 API 使用的大型语言模型 (LLM) 相关的隐私风险。更重要的是，它们在去识别化任务中始终优于 LLM，提供卓越的性能和可靠性。我们用八种语言（英语、德语、意大利语、法语、罗马尼亚语、土耳其语、西班牙语和阿拉伯语）开发的去识别 NER 模型，其 f1-micro 得分平均值分别为 0.966、0.975、0.976、0.970、0.964、0.974、0.978 和 0.953。这些结果证明它们是最准确的医疗匿名化解决方案，超越了现有的小型模型，甚至超越了 GPT-4o 等通用 LLM。虽然本系列的第一部分介绍了用于生物医学文档翻译的 LLM-in-the-loop 方法，但第二篇论文展示了它在去识别任务中开发具有成本效益的专家小型 NER 模型方面的成功。我们的研究结果为未来的医疗 AI 创新奠定了基础，包括生物医学实体和关系提取，展示了专门模型对特定领域挑战的价值。

Title: Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Authors: Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10924
Pdf URL: https://arxiv.org/pdf/2412.10924
Copy Paste: [[2412.10924]] Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning(https://arxiv.org/abs/2412.10924)
Keywords: language model, llm
Abstract: Tokenization is a necessary component within the current architecture of many language models, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DM) is sufficient for reasonably human-like language performance, and that the emergence of human-meaningful linguistic units among tokens motivates linguistically-informed interventions in existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) semantic primitives and as (2) vehicles for conveying salient distributional patterns from human language to the model. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating sub-optimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokenization pretraining can be a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being meaningfully insulated from the main system intelligence.
摘要：标记化是当前许多语言模型架构中的必要组成部分，包括基于转换器的生成式 AI 大型语言模型 (LLM)，但其对模型认知的影响往往被忽视。我们认为，LLM 表明分布假设 (DM) 足以实现相当类似人类的语言表现，并且标记中出现具有人类意义的语言单位促使人们在现有的、语言无关的标记化技术中进行语言知情干预，特别是关于它们作为 (1) 语义原语和 (2) 将显着的分布模式从人类语言传达到模型的载体的作用。我们探索了来自 BPE 标记器的标记化；从 Hugging Face 和 tiktoken 获得的现有模型词汇；以及示例标记向量在 RoBERTa（大型）模型各层中移动时的信息。除了创建次优的语义构建块并模糊模型对必要分布模式的访问之外，我们还描述了标记化预训练如何成为偏见和其他不想要的内容的后门，而当前的对齐实践可能无法补救。此外，我们传递证据表明，标记化算法的目标函数会影响 LLM 的认知，尽管它与主系统智能完全隔离。

Title: Enhancing Discoverability in Enterprise Conversational Systems with Proactive Question Suggestions

Authors: Xiaobin Shen, Daniel Lee, Sumit Ranjan, Sai Sree Harsha, Pawan Sevak, Yunyao Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10933
Pdf URL: https://arxiv.org/pdf/2412.10933
Copy Paste: [[2412.10933]] Enhancing Discoverability in Enterprise Conversational Systems with Proactive Question Suggestions(https://arxiv.org/abs/2412.10933)
Keywords: chat
Abstract: Enterprise conversational AI systems are becoming increasingly popular to assist users in completing daily tasks such as those in marketing and customer management. However, new users often struggle to ask effective questions, especially in emerging systems with unfamiliar or evolving capabilities. This paper proposes a framework to enhance question suggestions in conversational enterprise AI systems by generating proactive, context-aware questions that try to address immediate user needs while improving feature discoverability. Our approach combines periodic user intent analysis at the population level with chat session-based question generation. We evaluate the framework using real-world data from the AI Assistant for Adobe Experience Platform (AEP), demonstrating the improved usefulness and system discoverability of the AI Assistant.
摘要：企业对话式 AI 系统正变得越来越流行，它可帮助用户完成日常任务，例如营销和客户管理。然而，新用户往往很难提出有效的问题，尤其是在具有不熟悉或不断发展的功能的新兴系统中。本文提出了一个框架，通过生成主动的、上下文感知的问题来增强对话式企业 AI 系统中的问题建议，这些问题试图满足用户的即时需求，同时提高功能的可发现性。我们的方法将人口级别的定期用户意图分析与基于聊天会话的问题生成相结合。我们使用来自 Adobe Experience Platform (AEP) 的 AI 助手的真实数据评估该框架，展示了 AI 助手改进的实用性和系统可发现性。

Title: Can LLMs Help Create Grammar?: Automating Grammar Creation for Endangered Languages with In-Context Learning

Authors: Piyapath T Spencer, Nanthipat Kongborrirak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10960
Pdf URL: https://arxiv.org/pdf/2412.10960
Copy Paste: [[2412.10960]] Can LLMs Help Create Grammar?: Automating Grammar Creation for Endangered Languages with In-Context Learning(https://arxiv.org/abs/2412.10960)
Keywords: language model, llm, prompt
Abstract: Yes! In the present-day documenting and preserving endangered languages, the application of Large Language Models (LLMs) presents a promising approach. This paper explores how LLMs, particularly through in-context learning, can assist in generating grammatical information for low-resource languages with limited amount of data. We takes Moklen as a case study to evaluate the efficacy of LLMs in producing coherent grammatical rules and lexical entries using only bilingual dictionaries and parallel sentences of the unknown language without building the model from scratch. Our methodology involves organising the existing linguistic data and prompting to efficiently enable to generate formal XLE grammar. Our results demonstrate that LLMs can successfully capture key grammatical structures and lexical information, although challenges such as the potential for English grammatical biases remain. This study highlights the potential of LLMs to enhance language documentation efforts, providing a cost-effective solution for generating linguistic data and contributing to the preservation of endangered languages.
摘要：是的！在当今记录和保护濒危语言的过程中，大型语言模型 (LLM) 的应用是一种很有前途的方法。本文探讨了 LLM（尤其是通过上下文学习）如何帮助为资源有限的语言生成语法信息，而这些数据量有限。我们以 Moklen 为例，评估 LLM 在仅使用未知语言的双语词典和平行句子生成连贯的语法规则和词汇条目方面的有效性，而无需从头开始构建模型。我们的方法包括组织现有的语言数据并提示以有效地生成正式的 XLE 语法。我们的结果表明，LLM 可以成功捕获关键的语法结构和词汇信息，尽管仍然存在诸如英语语法偏见的可能性等挑战。这项研究强调了 LLM 在加强语言记录工作方面的潜力，为生成语言数据提供了一种经济有效的解决方案，并为保护濒危语言做出了贡献。

Title: A Contextualized BERT model for Knowledge Graph Completion

Authors: Haji Gul, Abdul Ghani Naim, Ajaz A. Bhat
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11016
Pdf URL: https://arxiv.org/pdf/2412.11016
Copy Paste: [[2412.11016]] A Contextualized BERT model for Knowledge Graph Completion(https://arxiv.org/abs/2412.11016)
Keywords: llm
Abstract: Knowledge graphs (KGs) are valuable for representing structured, interconnected information across domains, enabling tasks like semantic search, recommendation systems and inference. A pertinent challenge with KGs, however, is that many entities (i.e., heads, tails) or relationships are unknown. Knowledge Graph Completion (KGC) addresses this by predicting these missing nodes or links, enhancing the graph's informational depth and utility. Traditional methods like TransE and ComplEx predict tail entities but struggle with unseen entities. Textual-based models leverage additional semantics but come with high computational costs, semantic inconsistencies, and data imbalance issues. Recent LLM-based models show improvement but overlook contextual information and rely heavily on entity descriptions. In this study, we introduce a contextualized BERT model for KGC that overcomes these limitations by utilizing the contextual information from neighbouring entities and relationships to predict tail entities. Our model eliminates the need for entity descriptions and negative triplet sampling, reducing computational demands while improving performance. Our model outperforms state-of-the-art methods on standard datasets, improving Hit@1 by 5.3% and 4.88% on FB15k-237 and WN18RR respectively, setting a new benchmark in KGC.
摘要：知识图谱 (KG) 对于表示跨领域的结构化、互连信息非常有用，可实现语义搜索、推荐系统和推理等任务。然而，知识图谱面临的一个重大挑战是许多实体（即头部、尾部）或关系是未知的。知识图谱补全 (KGC) 通过预测这些缺失的节点或链接来解决这个问题，从而增强了图谱的信息深度和实用性。TransE 和 ComplEx 等传统方法可以预测尾部实体，但难以处理看不见的实体。基于文本的模型利用了额外的语义，但计算成本高、语义不一致和数据不平衡问题。最近基于 LLM 的模型有所改进，但忽略了上下文信息，严重依赖实体描述。在本研究中，我们为 KGC 引入了一个上下文化的 BERT 模型，该模型通过利用来自邻近实体和关系的上下文信息来预测尾部实体，从而克服了这些限制。我们的模型消除了对实体描述和负三重采样的需求，减少了计算需求，同时提高了性能。我们的模型在标准数据集上的表现优于最先进的方法，在 FB15k-237 和 WN18RR 上分别将 Hit@1 提高了 5.3% 和 4.88%，在 KGC 中树立了新的标杆。

Title: Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

Authors: Di Wu, Xin Lu, Yanyan Zhao, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11041
Pdf URL: https://arxiv.org/pdf/2412.11041
Copy Paste: [[2412.11041]] Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models(https://arxiv.org/abs/2412.11041)
Keywords: language model, llm
Abstract: Although large language models (LLMs) achieve effective safety alignment at the time of release, they still face various safety challenges. A key issue is that fine-tuning often compromises the safety alignment of LLMs. To address this issue, we propose a method named \textbf{IRR} (\textbf{I}dentify, \textbf{R}emove, and \textbf{R}ecalibrate for Safety Realignment) that performs safety realignment for LLMs. The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones. We evaluate the effectiveness of IRR across various datasets, including both full fine-tuning and LoRA methods. Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks, while maintaining their performance on downstream tasks. The source code is available at: \url{this https URL}.
摘要：尽管大型语言模型 (LLM) 在发布时实现了有效的安全校准，但它们仍然面临各种安全挑战。一个关键问题是微调通常会损害 LLM 的安全校准。为了解决这个问题，我们提出了一种名为 \textbf{IRR}（\textbf{I}identify、\textbf{R}emove 和 \textbf{R}ecalibrate 用于安全校准）的方法，用于对 LLM 进行安全校准。IRR 的核心是从微调模型中识别和删除不安全的增量参数，同时重新校准保留的参数。我们在各种数据集上评估了 IRR 的有效性，包括完全微调和 LoRA 方法。我们的结果表明，IRR 显着提高了微调模型在安全基准（例如有害查询和越狱攻击）上的安全性能，同时保持了它们在下游任务上的性能。源代码位于：\url{此 https URL}。

Title: NITRO: LLM Inference on Intel Laptop NPUs

Authors: Anthony Fei, Mohamed S. Abdelfattah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11053
Pdf URL: https://arxiv.org/pdf/2412.11053
Copy Paste: [[2412.11053]] NITRO: LLM Inference on Intel Laptop NPUs(https://arxiv.org/abs/2412.11053)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs) have become essential tools in natural language processing, finding large usage in chatbots such as ChatGPT and Gemini, and are a central area of research. A particular area of interest includes designing hardware specialized for these AI applications, with one such example being the neural processing unit (NPU). In 2023, Intel released the Intel Core Ultra processor with codename Meteor Lake, featuring a CPU, GPU, and NPU system-on-chip. However, official software support for the NPU through Intel's OpenVINO framework is limited to static model inference. The dynamic nature of autoregressive token generation in LLMs is therefore not supported out of the box. To address this shortcoming, we present NITRO (NPU Inference for Transformers Optimization), a Python-based framework built on top of OpenVINO to support text and chat generation on NPUs. In this paper, we discuss in detail the key modifications made to the transformer architecture to enable inference, some performance benchmarks, and future steps towards improving the package. The code repository for NITRO can be found here: this https URL.
摘要：大型语言模型 (LLM) 已成为自然语言处理中必不可少的工具，在 ChatGPT 和 Gemini 等聊天机器人中得到广泛使用，并且是研究的中心领域。一个特别感兴趣的领域包括设计专门用于这些 AI 应用的硬件，其中一个例子就是神经处理单元 (NPU)。2023 年，英特尔发布了代号为 Meteor Lake 的英特尔酷睿超处理器，具有 CPU、GPU 和 NPU 片上系统。但是，通过英特尔的 OpenVINO 框架对 NPU 的官方软件支持仅限于静态模型推理。因此，LLM 中自回归令牌生成的动态特性并不开箱即用。为了解决这个缺点，我们提出了 NITRO（用于 Transformers 优化的 NPU 推理），这是一个基于 Python 的框架，建立在 OpenVINO 之上，以支持在 NPU 上生成文本和聊天。在本文中，我们详细讨论了对 Transformer 架构进行的关键修改以实现推理、一些性能基准以及未来改进该软件包的步骤。 NITRO 的代码库可以在这里找到：这个 https URL。

Title: AD-LLM: Benchmarking Large Language Models for Anomaly Detection

Authors: Tiankai Yang, Yi Nian, Shawn Li, Ruiyao Xu, Yuangang Li, Jiaqi Li, Zhuo Xiao, Xiyang Hu, Ryan Rossi, Kaize Ding, Xia Hu, Yue Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11142
Pdf URL: https://arxiv.org/pdf/2412.11142
Copy Paste: [[2412.11142]] AD-LLM: Benchmarking Large Language Models for Anomaly Detection(https://arxiv.org/abs/2412.11142)
Keywords: language model, llm
Abstract: Anomaly detection (AD) is an important machine learning task with many real-world uses, including fraud detection, medical diagnosis, and industrial monitoring. Within natural language processing (NLP), AD helps detect issues like spam, misinformation, and unusual user activity. Although large language models (LLMs) have had a strong impact on tasks such as text generation and summarization, their potential in AD has not been studied enough. This paper introduces AD-LLM, the first benchmark that evaluates how LLMs can help with NLP anomaly detection. We examine three key tasks: (i) zero-shot detection, using LLMs' pre-trained knowledge to perform AD without tasks-specific training; (ii) data augmentation, generating synthetic data and category descriptions to improve AD models; and (iii) model selection, using LLMs to suggest unsupervised AD models. Through experiments with different datasets, we find that LLMs can work well in zero-shot AD, that carefully designed augmentation methods are useful, and that explaining model selection for specific datasets remains challenging. Based on these results, we outline six future research directions on LLMs for AD.
摘要：异常检测 (AD) 是一项重要的机器学习任务，具有许多实际用途，包括欺诈检测、医疗诊断和工业监控。在自然语言处理 (NLP) 中，AD 有助于检测垃圾邮件、错误信息和异常用户活动等问题。尽管大型语言模型 (LLM) 对文本生成和摘要等任务产生了重大影响，但它们在 AD 中的潜力尚未得到充分研究。本文介绍了 AD-LLM，这是第一个评估 LLM 如何帮助 NLP 异常检测的基准。我们研究了三个关键任务：(i) 零样本检测，使用 LLM 的预训练知识执行 AD，而无需特定于任务的训练；(ii) 数据增强，生成合成数据和类别描述以改进 AD 模型；(iii) 模型选择，使用 LLM 建议无监督 AD 模型。通过对不同数据集的实验，我们发现 LLM 可以在零样本 AD 中很好地工作，精心设计的增强方法很有用，并且解释特定数据集的模型选择仍然具有挑战性。根据这些结果，我们概述了 AD 法律硕士课程未来的六个研究方向。

Title: The Superalignment of Superhuman Intelligence with Large Language Models

Authors: Minlie Huang, Yingkang Wang, Shiyao Cui, Pei Ke, Jie Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11145
Pdf URL: https://arxiv.org/pdf/2412.11145
Copy Paste: [[2412.11145]] The Superalignment of Superhuman Intelligence with Large Language Models(https://arxiv.org/abs/2412.11145)
Keywords: language model
Abstract: We have witnessed superhuman intelligence thanks to the fast development of large language models and multimodal language models. As the application of such superhuman models becomes more and more common, a critical question rises here: how can we ensure superhuman models are still safe, reliable and aligned well to human values? In this position paper, we discuss the concept of superalignment from the learning perspective to answer this question by outlining the learning paradigm shift from large-scale pretraining, supervised fine-tuning, to alignment training. We define superalignment as designing effective and efficient alignment algorithms to learn from noisy-labeled data (point-wise samples or pair-wise preference data) in a scalable way when the task becomes very complex for human experts to annotate and the model is stronger than human experts. We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation. We then present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing. We discuss some important research problems in each component of this framework and highlight some interesting research ideas that are closely related to our proposed framework, for instance, self-alignment, self-play, self-refinement, and more. Last, we highlight some future research directions for superalignment, including identification of new emergent risks and multi-dimensional alignment.
摘要：由于大型语言模型和多模态语言模型的快速发展，我们见证了超人的智能。随着此类超人模型的应用越来越普遍，一个关键问题随之而来：我们如何确保超人模型仍然安全、可靠并与人类价值观保持一致？在本文中，我们从学习的角度讨论了超对齐的概念，通过概述从大规模预训练、监督微调到对齐训练的学习范式转变来回答这个问题。我们将超对齐定义为设计有效且高效的对齐算法，当任务变得非常复杂以至于人类专家无法注释并且模型比人类专家更强大时，以可扩展的方式从噪声标记数据（逐点样本或成对偏好数据）中学习。我们重点介绍了超对齐中的一些关键研究问题，即由弱到强的泛化、可扩展的监督和评估。然后，我们提出了超对齐的概念框架，它由三个模块组成：攻击者生成对手查询，试图暴露学习者模型的弱点；学习者将通过从由批评模型和最少的人类专家生成的可扩展反馈中学习来完善自己；批评者将针对给定的查询-响应对生成批评或解释，目标是通过批评来改进学习者。我们讨论了该框架每个组成部分中的一些重要研究问题，并强调了一些与我们提出的框架密切相关的有趣研究想法，例如自我调整、自我博弈、自我改进等。最后，我们强调了超调整的一些未来研究方向，包括识别新出现的风险和多维调整。

Title: Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette

Authors: Jiahao Yuan, Zixiang Di, Shangzixin Zhao, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11167
Pdf URL: https://arxiv.org/pdf/2412.11167
Copy Paste: [[2412.11167]] Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette(https://arxiv.org/abs/2412.11167)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) face challenges in aligning with diverse cultural values despite their remarkable performance in generation, which stems from inherent monocultural biases and difficulties in capturing nuanced cultural semantics. Existing methods lack adaptability to unkown culture after finetuning. Inspired by cultural geography across five continents, we propose Cultural Palette, a multi-agent framework for cultural alignment. We first introduce the Pentachromatic Cultural Palette Dataset synthesized using LLMs to capture diverse cultural values from social dialogues across five continents. Building on this, Cultural Palette integrates five continent-level alignment agents with a meta-agent using our superior Cultural MoErges alignment technique by dynamically activating relevant cultural expertise based on user prompts to adapting new culture, which outperforms other joint and merging alignment strategies in overall cultural value alignment. Each continent agent generates a cultural draft, which is then refined and self-regulated by the meta-agent to produce the final culturally aligned response. Experiments across various countries demonstrate that Cultural Palette surpasses existing baselines in cultural alignment.
摘要：尽管大型语言模型 (LLM) 在生成方面表现出色，但它们在与多元文化价值观保持一致方面仍面临挑战，这源于其固有的单一文化偏见和难以捕捉细微的文化语义。现有方法在微调后缺乏对未知文化的适应性。受五大洲文化地理的启发，我们提出了文化调色板，这是一个用于文化对齐的多智能体框架。我们首先介绍使用 LLM 合成的五色文化调色板数据集，以捕捉五大洲社交对话中的多元文化价值观。在此基础上，文化调色板使用我们卓越的文化 MoErges 对齐技术将五个大陆级对齐代理与元代理集成，通过根据用户提示动态激活相关文化专业知识来适应新文化，这在整体文化价值观对齐方面优于其他联合和合并对齐策略。每个大陆代理都会生成一个文化草案，然后由元代理对其进行改进和自我调节，以产生最终的文化对齐响应。在各个国家进行的实验表明，文化调色板在文化对齐方面超越了现有的基线。

Title: Drawing the Line: Enhancing Trustworthiness of MLLMs Through the Power of Refusal

Authors: Yuhao Wang, Zhiyuan Zhu, Heyang Liu, Yusheng Liao, Hongcheng Liu, Yanfeng Wang, Yu Wang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2412.11196
Pdf URL: https://arxiv.org/pdf/2412.11196
Copy Paste: [[2412.11196]] Drawing the Line: Enhancing Trustworthiness of MLLMs Through the Power of Refusal(https://arxiv.org/abs/2412.11196)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) excel at multimodal perception and understanding, yet their tendency to generate hallucinated or inaccurate responses undermines their trustworthiness. Existing methods have largely overlooked the importance of refusal responses as a means of enhancing MLLMs reliability. To bridge this gap, we present the Information Boundary-aware Learning Framework (InBoL), a novel approach that empowers MLLMs to refuse to answer user queries when encountering insufficient information. To the best of our knowledge, InBoL is the first framework that systematically defines the conditions under which refusal is appropriate for MLLMs using the concept of information boundaries proposed in our paper. This framework introduces a comprehensive data generation pipeline and tailored training strategies to improve the model's ability to deliver appropriate refusal responses. To evaluate the trustworthiness of MLLMs, we further propose a user-centric alignment goal along with corresponding metrics. Experimental results demonstrate a significant improvement in refusal accuracy without noticeably compromising the model's helpfulness, establishing InBoL as a pivotal advancement in building more trustworthy MLLMs.
摘要：多模态大型语言模型 (MLLM) 擅长多模态感知和理解，但它们容易产生幻觉或不准确的反应，从而削弱了它们的可信度。现有方法在很大程度上忽视了拒绝反应作为增强 MLLM 可靠性的一种手段的重要性。为了弥补这一差距，我们提出了信息边界感知学习框架 (InBoL)，这是一种新方法，它使 MLLM 在遇到信息不足时能够拒绝回答用户查询。据我们所知，InBoL 是第一个使用我们论文中提出的信息边界概念系统地定义 MLLM 拒绝适当条件的框架。该框架引入了全面的数据生成管道和量身定制的训练策略，以提高模型提供适当拒绝反应的能力。为了评估 MLLM 的可信度，我们进一步提出了以用户为中心的对齐目标以及相应的指标。实验结果表明，拒绝准确率有了显著提高，而且模型的帮助并没有明显减弱，从而确立了 InBoL 是构建更值得信赖的 MLLM 的关键进步。

Title: Task-Oriented Dialog Systems for the Senegalese Wolof Language

Authors: Derguene Mbaye, Moussa Diallo
Subjects: cs.CL, cs.AI, cs.HC, cs.IR
Abstract URL: https://arxiv.org/abs/2412.11203
Pdf URL: https://arxiv.org/pdf/2412.11203
Copy Paste: [[2412.11203]] Task-Oriented Dialog Systems for the Senegalese Wolof Language(https://arxiv.org/abs/2412.11203)
Keywords: language model, llm, hallucination, chat, agent
Abstract: In recent years, we are seeing considerable interest in conversational agents with the rise of large language models (LLMs). Although they offer considerable advantages, LLMs also present significant risks, such as hallucination, which hinder their widespread deployment in industry. Moreover, low-resource languages such as African ones are still underrepresented in these systems limiting their performance in these languages. In this paper, we illustrate a more classical approach based on modular architectures of Task-oriented Dialog Systems (ToDS) offering better control over outputs. We propose a chatbot generation engine based on the Rasa framework and a robust methodology for projecting annotations onto the Wolof language using an in-house machine translation system. After evaluating a generated chatbot trained on the Amazon Massive dataset, our Wolof Intent Classifier performs similarly to the one obtained for French, which is a resource-rich language. We also show that this approach is extensible to other low-resource languages, thanks to the intent classifier's language-agnostic pipeline, simplifying the design of chatbots in these languages.
摘要：近年来，随着大型语言模型 (LLM) 的兴起，我们看到了人们对对话代理的浓厚兴趣。尽管 LLM 具有相当大的优势，但也存在重大风险，例如幻觉，这阻碍了它们在工业中的广泛部署。此外，非洲等资源匮乏的语言在这些系统中仍然代表性不足，限制了它们在这些语言中的表现。在本文中，我们说明了一种更经典的方法，该方法基于面向任务的对话系统 (ToDS) 的模块化架构，可以更好地控制输出。我们提出了一种基于 Rasa 框架的聊天机器人生成引擎，以及一种使用内部机器翻译系统将注释投射到沃洛夫语上的强大方法。在评估在 Amazon Massive 数据集上训练的生成的聊天机器人后，我们的沃洛夫意图分类器的表现与资源丰富的法语的表现相似。我们还表明，由于意图分类器的语言无关管道，这种方法可以扩展到其他资源匮乏的语言，从而简化了这些语言的聊天机器人的设计。

Title: Smaller Language Models Are Better Instruction Evolvers

Authors: Tingfeng Hui, Lulu Zhao, Guanting Dong, Yaqi Zhang, Hua Zhou, Sen Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11231
Pdf URL: https://arxiv.org/pdf/2412.11231
Copy Paste: [[2412.11231]] Smaller Language Models Are Better Instruction Evolvers(https://arxiv.org/abs/2412.11231)
Keywords: language model, gpt, llm
Abstract: Instruction tuning has been widely used to unleash the complete potential of large language models. Notably, complex and diverse instructions are of significant importance as they can effectively align models with various downstream tasks. However, current approaches to constructing large-scale instructions predominantly favour powerful models such as GPT-4 or those with over 70 billion parameters, under the empirical presumption that such larger language models (LLMs) inherently possess enhanced capabilities. In this study, we question this prevalent assumption and conduct an in-depth exploration into the potential of smaller language models (SLMs) in the context of instruction evolution. Extensive experiments across three scenarios of instruction evolution reveal that smaller language models (SLMs) can synthesize more effective instructions than LLMs. Further analysis demonstrates that SLMs possess a broader output space during instruction evolution, resulting in more complex and diverse variants. We also observe that the existing metrics fail to focus on the impact of the instructions. Thus, we propose Instruction Complex-Aware IFD (IC-IFD), which introduces instruction complexity in the original IFD score to evaluate the effectiveness of instruction data more accurately. Our source code is available at: \href{this https URL}{this https URL}
摘要：指令调优已被广泛用于释放大型语言模型的全部潜力。值得注意的是，复杂多样的指令非常重要，因为它们可以有效地将模型与各种下游任务对齐。然而，当前构建大规模指令的方法主要倾向于强大的模型，例如 GPT-4 或具有超过 700 亿个参数的模型，经验假设此类大型语言模型 (LLM) 本身具有增强的功能。在本研究中，我们质疑这一普遍的假设，并深入探索小型语言模型 (SLM) 在指令演化的背景下的潜力。在三种指令演化场景中进行的大量实验表明，小型语言模型 (SLM) 可以合成比 LLM 更有效的指令。进一步的分析表明，SLM 在指令演化过程中拥有更广阔的输出空间，从而产生更复杂、更多样化的变体。我们还观察到现有指标未能关注指令的影响。因此，我们提出了指令复杂度感知 IFD（IC-IFD），在原始 IFD 分数中引入指令复杂度，以更准确地评估指令数据的有效性。我们的源代码位于：\href{此 https URL}{此 https URL}

Title: Beyond Discrete Personas: Personality Modeling Through Journal Intensive Conversations

Authors: Sayantan Pal, Souvik Das, Rohini K. Srihari
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11250
Pdf URL: https://arxiv.org/pdf/2412.11250
Copy Paste: [[2412.11250]] Beyond Discrete Personas: Personality Modeling Through Journal Intensive Conversations(https://arxiv.org/abs/2412.11250)
Keywords: language model, llm, chat
Abstract: Large Language Models (LLMs) have significantly improved personalized conversational capabilities. However, existing datasets like Persona Chat, Synthetic Persona Chat, and Blended Skill Talk rely on static, predefined personas. This approach often results in dialogues that fail to capture human personalities' fluid and evolving nature. To overcome these limitations, we introduce a novel dataset with around 400,000 dialogues and a framework for generating personalized conversations using long-form journal entries from Reddit. Our approach clusters journal entries for each author and filters them by selecting the most representative cluster, ensuring that the retained entries best reflect the author's personality. We further refine the data by capturing the Big Five personality traits --openness, conscientiousness, extraversion, agreeableness, and neuroticism --ensuring that dialogues authentically reflect an individual's personality. Using Llama 3 70B, we generate high-quality, personality-rich dialogues grounded in these journal entries. Fine-tuning models on this dataset leads to an 11% improvement in capturing personality traits on average, outperforming existing approaches in generating more coherent and personality-driven dialogues.
摘要：大型语言模型 (LLM) 显著提高了个性化对话能力。然而，现有的数据集（如 Persona Chat、Synthetic Persona Chat 和 Blended Skill Talk）依赖于静态的、预定义的角色。这种方法通常会导致对话无法捕捉人类个性的流动和不断发展的本质。为了克服这些限制，我们引入了一个包含约 400,000 个对话的新数据集和一个使用 Reddit 的长篇日记条目生成个性化对话的框架。我们的方法对每个作者的日记条目进行聚类，并通过选择最具代表性的聚类对其进行过滤，确保保留的条目最能反映作者的个性。我们通过捕捉五大性格特征（开放性、尽责性、外向性、亲和性和神经质）进一步细化数据，确保对话真实地反映个人的个性。使用 Llama 3 70B，我们根据这些日记条目生成高质量、个性丰富的对话。对该数据集进行微调模型可以使捕捉性格特征的准确率平均提高 11%，在生成更连贯、更个性化的对话方面优于现有方法。

Title: CATER: Leveraging LLM to Pioneer a Multidimensional, Reference-Independent Paradigm in Translation Quality Evaluation

Authors: Kurando IIDA, Kenjiro MIMURA
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11261
Pdf URL: https://arxiv.org/pdf/2412.11261
Copy Paste: [[2412.11261]] CATER: Leveraging LLM to Pioneer a Multidimensional, Reference-Independent Paradigm in Translation Quality Evaluation(https://arxiv.org/abs/2412.11261)
Keywords: language model, llm, hallucination, prompt
Abstract: This paper introduces the Comprehensive AI-assisted Translation Edit Ratio (CATER), a novel and fully prompt-driven framework for evaluating machine translation (MT) quality. Leveraging large language models (LLMs) via a carefully designed prompt-based protocol, CATER expands beyond traditional reference-bound metrics, offering a multidimensional, reference-independent evaluation that addresses linguistic accuracy, semantic fidelity, contextual coherence, stylistic appropriateness, and information completeness. CATER's unique advantage lies in its immediate implementability: by providing the source and target texts along with a standardized prompt, an LLM can rapidly identify errors, quantify edit effort, and produce category-level and overall scores. This approach eliminates the need for pre-computed references or domain-specific resources, enabling instant adaptation to diverse languages, genres, and user priorities through adjustable weights and prompt modifications. CATER's LLM-enabled strategy supports more nuanced assessments, capturing phenomena such as subtle omissions, hallucinations, and discourse-level shifts that increasingly challenge contemporary MT systems. By uniting the conceptual rigor of frameworks like MQM and DQF with the scalability and flexibility of LLM-based evaluation, CATER emerges as a valuable tool for researchers, developers, and professional translators worldwide. The framework and example prompts are openly available, encouraging community-driven refinement and further empirical validation.
摘要：本文介绍了综合人工智能辅助翻译编辑比率 (CATER)，这是一种用于评估机器翻译 (MT) 质量的全新且完全由提示驱动的框架。CATER 通过精心设计的基于提示的协议利用大型语言模型 (LLM)，超越了传统的参考约束指标，提供多维、参考独立的评估，解决语言准确性、语义保真度、上下文连贯性、风格适当性和信息完整性问题。CATER 的独特优势在于其即时可实施性：通过提供源文本和目标文本以及标准化提示，LLM 可以快速识别错误、量化编辑工作量并生成类别级和总体分数。这种方法消除了对预先计算的参考或特定领域资源的需求，通过可调整的权重和提示修改，可以即时适应不同的语言、流派和用户优先级。CATER 的 LLM 启用策略支持更细致入微的评估，捕捉微妙的遗漏、幻觉和话语级转变等现象，这些现象对当代 MT 系统构成了越来越大的挑战。 CATER 将 MQM 和 DQF 等框架的概念严谨性与 LLM 评估的可扩展性和灵活性相结合，成为全球研究人员、开发人员和专业翻译人员的宝贵工具。该框架和示例提示均公开提供，鼓励社区推动改进和进一步的实证验证。

Title: Sequence-Level Analysis of Leakage Risk of Training Data in Large Language Models

Authors: Trishita Tiwari, G. Edward Suh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11302
Pdf URL: https://arxiv.org/pdf/2412.11302
Copy Paste: [[2412.11302]] Sequence-Level Analysis of Leakage Risk of Training Data in Large Language Models(https://arxiv.org/abs/2412.11302)
Keywords: language model, llm
Abstract: This work advocates for the use of sequence level probabilities for quantifying the risk of extraction training data from Large Language Models (LLMs) as they provide much finer-grained information than has been previously obtained. We re-analyze the effects of decoding schemes, model-size, prefix length, partial sequence leakages, and token positions to uncover new insights that have were not possible in prior work due to their choice of metrics. We perform this study on two pre-trained models, LLaMa and OPT, trained on the Common Crawl and Pile respectively. We discover that 1) Extraction rate, the predominant metric used in prior quantification work, underestimates the threat of leakage of training data in randomized LLMs by as much as 2.14x. 2) Though, on average, larger models and longer prefixes can extract more data, this is not true with a substantial portion of individual sequences. 30.4-41.5% of our sequences are easier to extract with either shorter prefixes or smaller models. 3) Contrary to prior belief, partial leakage in the commonly used decoding schemes like top-k and top-p are not easier than leaking verbatim training data. 4) Extracting later tokens in a sequence is as much as 912% easier than extracting earlier tokens. The insights gained from our analysis show that it is important to look at leakage of training data on a per-sequence basis.
摘要：这项研究主张使用序列级概率来量化从大型语言模型 (LLM) 中提取训练数据的风险，因为它们提供的信息比以前获得的更细粒度。我们重新分析了解码方案、模型大小、前缀长度、部分序列泄漏和标记位置的影响，以发现由于指标选择而无法在以前的工作中获得的新见解。我们对两个预先训练的模型 LLaMa 和 OPT 进行了这项研究，这两个模型分别在 Common Crawl 和 Pile 上进行了训练。我们发现 1) 提取率是之前量化工作中使用的主要指标，它低估了随机 LLM 中训练数据泄漏的威胁高达 2.14 倍。2) 虽然平均而言，更大的模型和更长的前缀可以提取更多数据，但对于很大一部分单个序列来说情况并非如此。30.4-41.5% 的序列更容易使用较短的前缀或较小的模型提取。 3) 与先前的看法相反，top-k 和 top-p 等常用解码方案中的部分泄漏并不比泄漏逐字训练数据更容易。4) 提取序列中较后的标记比提取较早的标记容易 912%。从我们的分析中获得的见解表明，按序列查看训练数据的泄漏非常重要。

Title: Reliable, Reproducible, and Really Fast Leaderboards with Evalica

Authors: Dmitry Ustalov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11314
Pdf URL: https://arxiv.org/pdf/2412.11314
Copy Paste: [[2412.11314]] Reliable, Reproducible, and Really Fast Leaderboards with Evalica(https://arxiv.org/abs/2412.11314)
Keywords: language model, llm
Abstract: The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.
摘要：自然语言处理 (NLP) 技术（例如指令调优的大型语言模型 (LLM)）的快速发展，促使开发具有人机反馈的现代评估协议。我们介绍了 Evalica，这是一个开源工具包，有助于创建可靠且可重现的模型排行榜。本文介绍了它的设计，评估了它的性能，并通过其 Web 界面、命令行界面和 Python API 展示了它的可用性。

Title: RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation

Authors: Andrei-Marius Avram, Mircea Timpuriu, Andreea Iuga, Vlad-Cristian Matei, Iulian-Marius Tăiatu, Tudor Găină, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11317
Pdf URL: https://arxiv.org/pdf/2412.11317
Copy Paste: [[2412.11317]] RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation(https://arxiv.org/abs/2412.11317)
Keywords: language model
Abstract: Using supervised automatic summarisation methods requires sufficient corpora that include pairs of documents and their summaries. Similarly to many tasks in natural language processing, most of the datasets available for summarization are in English, posing challenges for developing summarization models in other languages. Thus, in this work, we introduce RoLargeSum, a novel large-scale summarization dataset for the Romanian language crawled from various publicly available news websites from Romania and the Republic of Moldova that were thoroughly cleaned to ensure a high-quality standard. RoLargeSum contains more than 615K news articles, together with their summaries, as well as their headlines, keywords, dialect, and other metadata that we found on the targeted websites. We further evaluated the performance of several BART variants and open-source large language models on RoLargeSum for benchmarking purposes. We manually evaluated the results of the best-performing system to gain insight into the potential pitfalls of this data set and future development.
摘要：使用监督式自动摘要方法需要足够的语料库，其中包括文档及其摘要的对。与自然语言处理中的许多任务类似，可用于摘要的大多数数据集都是英文的，这对开发其他语言的摘要模型提出了挑战。因此，在这项工作中，我们引入了 RoLargeSum，这是一个新的罗马尼亚语大规模摘要数据集，它从罗马尼亚和摩尔多瓦共和国的各种公开新闻网站中爬取而来，并经过彻底清理以确保高质量标准。RoLargeSum 包含超过 615K 篇新闻文章及其摘要，以及我们在目标网站上找到的标题、关键词、方言和其他元数据。我们进一步评估了几种 BART 变体和开源大型语言模型在 RoLargeSum 上的性能，以进行基准测试。我们手动评估了表现最佳的系统的结果，以深入了解该数据集的潜在缺陷和未来的发展。

Title: Generics are puzzling. Can language models find the missing piece?

Authors: Gustavo Cilleruelo Calderón, Emily Allaway, Barry Haddow, Alexandra Birch
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11318
Pdf URL: https://arxiv.org/pdf/2412.11318
Copy Paste: [[2412.11318]] Generics are puzzling. Can language models find the missing piece?(https://arxiv.org/abs/2412.11318)
Keywords: language model
Abstract: Generic sentences express generalisations about the world without explicit quantification. Although generics are central to everyday communication, building a precise semantic framework has proven difficult, in part because speakers use generics to generalise properties with widely different statistical prevalence. In this work, we study the implicit quantification and context-sensitivity of generics by leveraging language models as models of language. We create ConGen, a dataset of 2873 naturally occurring generic and quantified sentences in context, and define p-acceptability, a metric based on surprisal that is sensitive to quantification. Our experiments show generics are more context-sensitive than determiner quantifiers and about 20% of naturally occurring generics we analyze express weak generalisations. We also explore how human biases in stereotypes can be observed in language models.
摘要：通用句子表达了对世界的概括，但没有明确的量化。尽管通用句子是日常交流的核心，但建立一个精确的语义框架已被证明是困难的，部分原因是说话者使用通用句子来概括具有广泛统计普遍性的属性。在这项工作中，我们利用语言模型作为语言模型，研究通用句子的隐式量化和上下文敏感性。我们创建了 ConGen，这是一个包含 2873 个自然出现的通用句子和量化句子的数据集，并定义了 p 可接受性，这是一种基于对量化敏感的意外度的指标。我们的实验表明，通用句子比限定量词更具有上下文敏感性，我们分析的大约 20% 的自然出现的通用句子表达了较弱的概括。我们还探索了如何在语言模型中观察到人类刻板印象中的偏见。

Title: Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models

Authors: Xiaochen Zhu, Georgi Karadzhov, Chenxi Whitehouse, Andreas Vlachos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11333
Pdf URL: https://arxiv.org/pdf/2412.11333
Copy Paste: [[2412.11333]] Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models(https://arxiv.org/abs/2412.11333)
Keywords: language model
Abstract: Diffusion models have shown promise in text generation but often struggle with generating long, coherent, and contextually accurate text. Token-level diffusion overlooks word-order dependencies and enforces short output windows, while passage-level diffusion struggles with learning robust representation for long-form text. To address these challenges, we propose Segment-Level Diffusion (SLD), a framework that enhances diffusion-based text generation through text segmentation, robust representation training with adversarial and contrastive learning, and improved latent-space guidance. By segmenting long-form outputs into separate latent representations and decoding them with an autoregressive decoder, SLD simplifies diffusion predictions and improves scalability. Experiments on XSum, ROCStories, DialogSum, and DeliData demonstrate that SLD achieves competitive or superior performance in fluency, coherence, and contextual compatibility across automatic and human evaluation metrics comparing with other diffusion and autoregressive baselines. Ablation studies further validate the effectiveness of our segmentation and representation learning strategies.
摘要：扩散模型在文本生成方面表现出了良好的前景，但通常难以生成长篇、连贯且上下文准确的文本。标记级扩散忽略了词序依赖性并强制使用短输出窗口，而段落级扩散则难以学习长篇文本的稳健表示。为了应对这些挑战，我们提出了段级扩散 (SLD)，这是一个通过文本分割、使用对抗和对比学习进行稳健表示训练以及改进的潜在空间指导来增强基于扩散的文本生成的框架。通过将长篇输出分割成单独的潜在表示并使用自回归解码器对其进行解码，SLD 简化了扩散预测并提高了可扩展性。在 XSum、ROCStories、DialogSum 和 DeliData 上进行的实验表明，与其他扩散和自回归基线相比，SLD 在自动和人工评估指标的流畅性、连贯性和上下文兼容性方面实现了具有竞争力或更优异的性能。消融研究进一步验证了我们的分割和表示学习策略的有效性。

Title: Can AI Extract Antecedent Factors of Human Trust in AI? An Application of Information Extraction for Scientific Literature in Behavioural and Computer Sciences

Authors: Melanie McGrath, Harrison Bailey, Necva Bölücü, Xiang Dai, Sarvnaz Karimi, Cecile Paris
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11344
Pdf URL: https://arxiv.org/pdf/2412.11344
Copy Paste: [[2412.11344]] Can AI Extract Antecedent Factors of Human Trust in AI? An Application of Information Extraction for Scientific Literature in Behavioural and Computer Sciences(https://arxiv.org/abs/2412.11344)
Keywords: language model, llm, prompt
Abstract: Information extraction from the scientific literature is one of the main techniques to transform unstructured knowledge hidden in the text into structured data which can then be used for decision-making in down-stream tasks. One such area is Trust in AI, where factors contributing to human trust in artificial intelligence applications are studied. The relationships of these factors with human trust in such applications are complex. We hence explore this space from the lens of information extraction where, with the input of domain experts, we carefully design annotation guidelines, create the first annotated English dataset in this domain, investigate an LLM-guided annotation, and benchmark it with state-of-the-art methods using large language models in named entity and relation extraction. Our results indicate that this problem requires supervised learning which may not be currently feasible with prompt-based LLMs.
摘要：从科学文献中提取信息是将隐藏在文本中的非结构化知识转换为结构化数据的主要技术之一，然后可以使用结构化数据进行下游任务的决策。人工智能中的信任就是这样一个领域，该领域研究了影响人类对人工智能应用的信任的因素。这些因素与人类对此类应用的信任之间的关系非常复杂。因此，我们从信息提取的角度探索这一领域，在领域专家的帮助下，我们精心设计了注释指南，创建了该领域第一个带注释的英文数据集，研究了 LLM 指导的注释，并使用大型语言模型在命名实体和关系提取中使用最先进的方法对其进行了基准测试。我们的结果表明，这个问题需要监督学习，而目前基于提示的 LLM 可能无法实现。

Title: ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data

Authors: Chengsen Wang, Qi Qi, Jingyu Wang, Haifeng Sun, Zirui Zhuang, Jinming Wu, Lei Zhang, Jianxin Liao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11376
Pdf URL: https://arxiv.org/pdf/2412.11376
Copy Paste: [[2412.11376]] ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data(https://arxiv.org/abs/2412.11376)
Keywords: language model, chat
Abstract: Human experts typically integrate numerical and textual multimodal information to analyze time series. However, most traditional deep learning predictors rely solely on unimodal numerical data, using a fixed-length window for training and prediction on a single dataset, and cannot adapt to different scenarios. The powered pre-trained large language model has introduced new opportunities for time series analysis. Yet, existing methods are either inefficient in training, incapable of handling textual information, or lack zero-shot forecasting capability. In this paper, we innovatively model time series as a foreign language and construct ChatTime, a unified framework for time series and text processing. As an out-of-the-box multimodal time series foundation model, ChatTime provides zero-shot forecasting capability and supports bimodal input/output for both time series and text. We design a series of experiments to verify the superior performance of ChatTime across multiple tasks and scenarios, and create four multimodal datasets to address data gaps. The experimental results demonstrate the potential and utility of ChatTime.
摘要：人类专家通常将数值和文本多模态信息整合起来进行时间序列分析。然而，大多数传统的深度学习预测器仅依赖于单模态数值数据，使用固定长度的窗口在单个数据集上进行训练和预测，无法适应不同的场景。强大的预训练大型语言模型为时间序列分析带来了新的机会。然而，现有的方法要么训练效率低下，要么无法处理文本信息，要么缺乏零样本预测能力。在本文中，我们创新地将时间序列建模为外语，并构建了 ChatTime，一个统一的时间序列和文本处理框架。作为一个开箱即用的多模态时间序列基础模型，ChatTime 提供零样本预测能力，并支持时间序列和文本的双模态输入/输出。我们设计了一系列实验来验证 ChatTime 在多个任务和场景中的卓越性能，并创建了四个多模态数据集来弥补数据缺口。实验结果证明了 ChatTime 的潜力和实用性。

Title: Why Does ChatGPT "Delve" So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models

Authors: Tom S. Juzek, Zina B. Ward
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11385
Pdf URL: https://arxiv.org/pdf/2412.11385
Copy Paste: [[2412.11385]] Why Does ChatGPT "Delve" So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models(https://arxiv.org/abs/2412.11385)
Keywords: language model, gpt, llm, chat
Abstract: Scientific English is currently undergoing rapid change, with words like "delve," "intricate," and "underscore" appearing far more frequently than just a few years ago. It is widely assumed that scientists' use of large language models (LLMs) is responsible for such trends. We develop a formal, transferable method to characterize these linguistic changes. Application of our method yields 21 focal words whose increased occurrence in scientific abstracts is likely the result of LLM usage. We then pose "the puzzle of lexical overrepresentation": WHY are such words overused by LLMs? We fail to find evidence that lexical overrepresentation is caused by model architecture, algorithm choices, or training data. To assess whether reinforcement learning from human feedback (RLHF) contributes to the overuse of focal words, we undertake comparative model testing and conduct an exploratory online study. While the model testing is consistent with RLHF playing a role, our experimental results suggest that participants may be reacting differently to "delve" than to other focal words. With LLMs quickly becoming a driver of global language change, investigating these potential sources of lexical overrepresentation is important. We note that while insights into the workings of LLMs are within reach, a lack of transparency surrounding model development remains an obstacle to such research.
摘要：科学英语目前正在经历快速变化，像“delve”、“intricate”和“underscore”这样的词出现的频率远高于几年前。人们普遍认为，科学家使用大型语言模型 (LLM) 是造成这种趋势的原因。我们开发了一种正式的、可转移的方法来描述这些语言变化。应用我们的方法产生了 21 个焦点词，这些词在科学摘要中的出现率增加可能是 LLM 使用的结果。然后我们提出了“词汇过度表达之谜”：为什么这些词被 LLM 过度使用？我们未能找到证据表明词汇过度表达是由模型架构、算法选择或训练数据引起的。为了评估强化学习从人类反馈 (RLHF) 是否导致焦点词的过度使用，我们进行了比较模型测试并进行了探索性在线研究。虽然模型测试与 RLHF 发挥作用一致，但我们的实验结果表明参与者对“delve”的反应可能与其他焦点词不同。随着法学硕士 (LLM) 迅速成为全球语言变革的推动力，调查词汇过度表达的潜在来源非常重要。我们注意到，虽然了解法学硕士 (LLM) 的运作方式是触手可及的，但模型开发缺乏透明度仍然是此类研究的障碍。

Title: INTERACT: Enabling Interactive, Question-Driven Learning in Large Language Models

Authors: Aum Kendapadi, Kerem Zaman, Rakesh R. Menon, Shashank Srivastava
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11388
Pdf URL: https://arxiv.org/pdf/2412.11388
Copy Paste: [[2412.11388]] INTERACT: Enabling Interactive, Question-Driven Learning in Large Language Models(https://arxiv.org/abs/2412.11388)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at answering questions but remain passive learners--absorbing static data without the ability to question and refine knowledge. This paper explores how LLMs can transition to interactive, question-driven learning through student-teacher dialogues. We introduce INTERACT (INTEReractive Learning for Adaptive Concept Transfer), a framework in which a "student" LLM engages a "teacher" LLM through iterative inquiries to acquire knowledge across 1,347 contexts, including song lyrics, news articles, movie plots, academic papers, and images. Our experiments show that across a wide range of scenarios and LLM architectures, interactive learning consistently enhances performance, achieving up to a 25% improvement, with 'cold-start' student models matching static learning baselines in as few as five dialogue turns. Interactive setups can also mitigate the disadvantages of weaker teachers, showcasing the robustness of question-driven learning.
摘要：大型语言模型 (LLM) 擅长回答问题，但仍然是被动学习者——吸收静态数据，没有质疑和提炼知识的能力。本文探讨了 LLM 如何通过师生对话过渡到交互式、问题驱动的学习。我们引入了 INTERACT（自适应概念转移的交互式学习），这是一个框架，其中“学生”LLM 通过迭代查询与“老师”LLM 互动，以在 1,347 个上下文中获取知识，包括歌词、新闻文章、电影情节、学术论文和图像。我们的实验表明，在各种场景和 LLM 架构中，交互式学习可以持续提高性能，实现高达 25% 的改进，而“冷启动”学生模型在短短五轮对话中就能匹配静态学习基线。交互式设置还可以减轻较弱教师的劣势，展示问题驱动学习的稳健性。

Title: Biased or Flawed? Mitigating Stereotypes in Generative Language Models by Addressing Task-Specific Flaws

Authors: Akshita Jha, Sanchit Kabra, Chandan K. Reddy
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11414
Pdf URL: https://arxiv.org/pdf/2412.11414
Copy Paste: [[2412.11414]] Biased or Flawed? Mitigating Stereotypes in Generative Language Models by Addressing Task-Specific Flaws(https://arxiv.org/abs/2412.11414)
Keywords: language model
Abstract: Recent studies have shown that generative language models often reflect and amplify societal biases in their outputs. However, these studies frequently conflate observed biases with other task-specific shortcomings, such as comprehension failure. For example, when a model misinterprets a text and produces a response that reinforces a stereotype, it becomes difficult to determine whether the issue arises from inherent bias or from a misunderstanding of the given content. In this paper, we conduct a multi-faceted evaluation that distinctly disentangles bias from flaws within the reading comprehension task. We propose a targeted stereotype mitigation framework that implicitly mitigates observed stereotypes in generative models through instruction-tuning on general-purpose datasets. We reduce stereotypical outputs by over 60% across multiple dimensions -- including nationality, age, gender, disability, and physical appearance -- by addressing comprehension-based failures, and without relying on explicit debiasing techniques. We evaluate several state-of-the-art generative models to demonstrate the effectiveness of our approach while maintaining the overall utility. Our findings highlight the need to critically disentangle the concept of `bias' from other types of errors to build more targeted and effective mitigation strategies. CONTENT WARNING: Some examples contain offensive stereotypes.
摘要：最近的研究表明，生成式语言模型的输出结果通常会反映和放大社会偏见。然而，这些研究经常将观察到的偏见与其他特定于任务的缺陷（例如理解失败）混为一谈。例如，当模型误解文本并产生强化刻板印象的响应时，很难确定问题是源于固有偏见还是源于对给定内容的误解。在本文中，我们进行了多方面的评估，将偏见与阅读理解任务中的缺陷明显区分开来。我们提出了一个有针对性的刻板印象缓解框架，该框架通过对通用数据集进行指令调整来隐性缓解生成模型中观察到的刻板印象。我们通过解决基于理解的失败问题，而无需依赖明确的去偏见技术，将多个维度（包括国籍、年龄、性别、残疾和外貌）的刻板印象输出减少了 60% 以上。我们评估了几种最先进的生成模型，以证明我们的方法在保持整体效用的同时具有有效性。我们的研究结果强调，需要严格区分“偏见”概念与其他类型的错误，以建立更有针对性和更有效的缓解策略。内容警告：一些示例包含令人反感的刻板印象。

Title: ConceptEdit: Conceptualization-Augmented Knowledge Editing in Large Language Models for Commonsense Reasoning

Authors: Liyu Zhang, Weiqi Wang, Tianqing Fang, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11418
Pdf URL: https://arxiv.org/pdf/2412.11418
Copy Paste: [[2412.11418]] ConceptEdit: Conceptualization-Augmented Knowledge Editing in Large Language Models for Commonsense Reasoning(https://arxiv.org/abs/2412.11418)
Keywords: language model, llm
Abstract: Knowledge Editing (KE) aims to adjust a Large Language Model's (LLM) internal representations and parameters to correct inaccuracies and improve output consistency without incurring the computational expense of re-training the entire model. However, editing commonsense knowledge still faces difficulties, including limited knowledge coverage in existing resources, the infeasibility of annotating labels for an overabundance of commonsense knowledge, and the strict knowledge formats of current editing methods. In this paper, we address these challenges by presenting ConceptEdit, a framework that integrates conceptualization and instantiation into the KE pipeline for LLMs to enhance their commonsense reasoning capabilities. ConceptEdit dynamically diagnoses implausible commonsense knowledge within an LLM using another verifier LLM and augments the source knowledge to be edited with conceptualization for stronger generalizability. Experimental results demonstrate that LLMs enhanced with ConceptEdit successfully generate commonsense knowledge with improved plausibility compared to other baselines and achieve stronger performance across multiple question answering benchmarks.
摘要：知识编辑 (KE) 旨在调整大型语言模型 (LLM) 的内部表示和参数，以纠正不准确性并提高输出一致性，而无需承担重新训练整个模型的计算成本。然而，编辑常识性知识仍然面临困难，包括现有资源中知识覆盖范围有限、无法为过多的常识性知识标注标签以及当前编辑方法的知识格式严格。在本文中，我们通过提出 ConceptEdit 来解决这些挑战，该框架将概念化和实例化集成到 LLM 的 KE 管道中，以增强其常识性推理能力。ConceptEdit 使用另一个验证器 LLM 动态诊断 LLM 中不合理的常识性知识，并通过概念化增强要编辑的源知识，以提高通用性。实验结果表明，与其他基线相比，使用 ConceptEdit 增强的 LLM 成功生成了具有更高可信度的常识性知识，并在多个问答基准中实现了更强大的性能。

Title: Optimized Quran Passage Retrieval Using an Expanded QA Dataset and Fine-Tuned Language Models

Authors: Mohamed Basem, Islam Oshallah, Baraa Hikal, Ali Hamdi, Ammar Mohamed
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2412.11431
Pdf URL: https://arxiv.org/pdf/2412.11431
Copy Paste: [[2412.11431]] Optimized Quran Passage Retrieval Using an Expanded QA Dataset and Fine-Tuned Language Models(https://arxiv.org/abs/2412.11431)
Keywords: language model
Abstract: Understanding the deep meanings of the Qur'an and bridging the language gap between modern standard Arabic and classical Arabic is essential to improve the question-and-answer system for the Holy Qur'an. The Qur'an QA 2023 shared task dataset had a limited number of questions with weak model retrieval. To address this challenge, this work updated the original dataset and improved the model accuracy. The original dataset, which contains 251 questions, was reviewed and expanded to 629 questions with question diversification and reformulation, leading to a comprehensive set of 1895 categorized into single-answer, multi-answer, and zero-answer types. Extensive experiments fine-tuned transformer models, including AraBERT, RoBERTa, CAMeLBERT, AraELECTRA, and BERT. The best model, AraBERT-base, achieved a MAP@10 of 0.36 and MRR of 0.59, representing improvements of 63% and 59%, respectively, compared to the baseline scores (MAP@10: 0.22, MRR: 0.37). Additionally, the dataset expansion led to improvements in handling "no answer" cases, with the proposed approach achieving a 75% success rate for such instances, compared to the baseline's 25%. These results demonstrate the effect of dataset improvement and model architecture optimization in increasing the performance of QA systems for the Holy Qur'an, with higher accuracy, recall, and precision.
摘要：理解《古兰经》的深层含义并弥合现代标准阿拉伯语和古典阿拉伯语之间的语言鸿沟对于改进《古兰经》问答系统至关重要。《古兰经》问答 2023 共享任务数据集的问题数量有限，模型检索能力较弱。为了应对这一挑战，这项工作更新了原始数据集并提高了模型准确性。原始数据集包含 251 个问题，经过审查后扩展到 629 个问题，问题多样化和重新表述，最终形成了一套全面的 1895 个问题，分为单答案、多答案和零答案类型。大量实验对 Transformer 模型进行了微调，包括 AraBERT、RoBERTa、CAMeLBERT、AraELECTRA 和 BERT。最佳模型 AraBERT-base 的 MAP@10 为 0.36，MRR 为 0.59，与基线分数（MAP@10：0.22，MRR：0.37）相比，分别提高了 63% 和 59%。此外，数据集扩展还提高了处理“无答案”情况的能力，所提出的方法对此类情况的成功率为 75%，而基线的成功率为 25%。这些结果证明了数据集改进和模型架构优化在提高《古兰经》问答系统性能方面的作用，提高了准确率、召回率和精确率。

Title: ACE-$M^3$: Automatic Capability Evaluator for Multimodal Medical Models

Authors: Xiechi Zhang, Shunfan Zheng, Linlin Wang, Gerard de Melo, Zhu Cao, Xiaoling Wang, Liang He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11453
Pdf URL: https://arxiv.org/pdf/2412.11453
Copy Paste: [[2412.11453]] ACE-$M^3$: Automatic Capability Evaluator for Multimodal Medical Models(https://arxiv.org/abs/2412.11453)
Keywords: language model, llm
Abstract: As multimodal large language models (MLLMs) gain prominence in the medical field, the need for precise evaluation methods to assess their effectiveness has become critical. While benchmarks provide a reliable means to evaluate the capabilities of MLLMs, traditional metrics like ROUGE and BLEU employed for open domain evaluation only focus on token overlap and may not align with human judgment. Although human evaluation is more reliable, it is labor-intensive, costly, and not scalable. LLM-based evaluation methods have proven promising, but to date, there is still an urgent need for open-source multimodal LLM-based evaluators in the medical field. To address this issue, we introduce ACE-$M^3$, an open-sourced \textbf{A}utomatic \textbf{C}apability \textbf{E}valuator for \textbf{M}ultimodal \textbf{M}edical \textbf{M}odels specifically designed to assess the question answering abilities of medical MLLMs. It first utilizes a branch-merge architecture to provide both detailed analysis and a concise final score based on standard medical evaluation criteria. Subsequently, a reward token-based direct preference optimization (RTDPO) strategy is incorporated to save training time without compromising performance of our model. Extensive experiments have demonstrated the effectiveness of our ACE-$M^3$ model\footnote{\url{this https URL}} in evaluating the capabilities of medical MLLMs.
摘要：随着多模态大型语言模型 (MLLM) 在医学领域日益受到重视，对精确评估方法来评估其有效性的需求变得至关重要。虽然基准测试提供了一种可靠的方法来评估 MLLM 的能力，但用于开放域评估的传统指标（如 ROUGE 和 BLEU）仅关注标记重叠，可能与人类判断不一致。虽然人工评估更可靠，但它是劳动密集型的、成本高昂的且不可扩展。基于 LLM 的评估方法已被证明很有前景，但到目前为止，医学领域仍然迫切需要开源多模态 LLM 评估器。为了解决这个问题，我们引入了 ACE-$M^3$，这是一个开源的 \textbf{A}utomatic \textbf{C}apability \textbf{E} 评估器，用于 \textbf{M}ultimodal \textbf{M}edical \textbf{M}modells，专门用于评估医学 MLLM 的问答能力。它首先利用分支合并架构提供详细分析和基于标准医疗评估标准的简洁最终评分。随后，结合基于奖励令牌的直接偏好优化 (RTDPO) 策略来节省训练时间，同时不影响我们模型的性能。大量实验证明了我们的 ACE-$M^3$ 模型\footnote{\url{this https URL}} 在评估医疗 MLLM 能力方面的有效性。

Title: Towards Better Multi-task Learning: A Framework for Optimizing Dataset Combinations in Large Language Models

Authors: Zaifu Zhan, Rui Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11455
Pdf URL: https://arxiv.org/pdf/2412.11455
Copy Paste: [[2412.11455]] Towards Better Multi-task Learning: A Framework for Optimizing Dataset Combinations in Large Language Models(https://arxiv.org/abs/2412.11455)
Keywords: language model
Abstract: To efficiently select optimal dataset combinations for enhancing multi-task learning (MTL) performance in large language models, we proposed a novel framework that leverages a neural network to predict the best dataset combinations. The framework iteratively refines the selection, greatly improving efficiency, while being model-, dataset-, and domain-independent. Through experiments on 12 biomedical datasets across four tasks - named entity recognition, relation extraction, event extraction, and text classification-we demonstrate that our approach effectively identifies better combinations, even for tasks that may seem unpromising from a human perspective. This verifies that our framework provides a promising solution for maximizing MTL potential.
摘要：为了高效地选择最佳数据集组合，以增强大型语言模型中的多任务学习 (MTL) 性能，我们提出了一个新颖的框架，利用神经网络来预测最佳数据集组合。该框架迭代地优化选择，大大提高了效率，同时独立于模型、数据集和领域。通过对 12 个生物医学数据集进行四项任务（命名实体识别、关系提取、事件提取和文本分类）的实验，我们证明了我们的方法可以有效地识别更好的组合，即使对于从人类角度来看似乎没有希望的任务也是如此。这证明了我们的框架为最大化 MTL 潜力提供了一种有希望的解决方案。

Title: Understanding Knowledge Hijack Mechanism in In-context Learning through Associative Memory

Authors: Shuo Wang, Issei Sato
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11459
Pdf URL: https://arxiv.org/pdf/2412.11459
Copy Paste: [[2412.11459]] Understanding Knowledge Hijack Mechanism in In-context Learning through Associative Memory(https://arxiv.org/abs/2412.11459)
Keywords: language model, llm, prompt
Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without fine-tuning by leveraging contextual information provided within a prompt. However, ICL relies not only on contextual clues but also on the global knowledge acquired during pretraining for the next token prediction. Analyzing this process has been challenging due to the complex computational circuitry of LLMs. This paper investigates the balance between in-context information and pretrained bigram knowledge in token prediction, focusing on the induction head mechanism, a key component in ICL. Leveraging the fact that a two-layer transformer can implement the induction head mechanism with associative memories, we theoretically analyze the logits when a two-layer transformer is given prompts generated by a bigram model. In the experiments, we design specific prompts to evaluate whether the outputs of a two-layer transformer align with the theoretical results.
摘要：上下文学习 (ICL) 利用提示中提供的上下文信息，使大型语言模型 (LLM) 无需微调即可适应新任务。然而，ICL 不仅依赖于上下文线索，还依赖于预训练期间获得的全局知识来进行下一个 token 预测。由于 LLM 的计算电路复杂，分析这个过程一直很困难。本文研究了 token 预测中上下文信息和预训练二元语法知识之间的平衡，重点研究了 ICL 中的关键组件感应头机制。利用双层 Transformer 可以用联想记忆实现感应头机制的事实，我们从理论上分析了当双层 Transformer 被给予由二元语法模型生成的提示时的逻辑。在实验中，我们设计了特定的提示来评估双层 Transformer 的输出是否与理论结果一致。

Title: FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing

Authors: Zekai Li, Jintu Zheng, Ji Liu, Han Liu, Haowei Zhu, Zeping Li, Fuwei Yang, Haiduo Huang, Jinzhang Peng, Dong Li, Lu Tian, Emad Barsoum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11494
Pdf URL: https://arxiv.org/pdf/2412.11494
Copy Paste: [[2412.11494]] FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing(https://arxiv.org/abs/2412.11494)
Keywords: language model, gpt, llm
Abstract: Recently, large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws, which significantly increase model size. However, the huge computation overhead during inference hinders the deployment in industrial applications. Many works leverage traditional compression approaches to boost model inference, but these always introduce additional training costs to restore the performance and the pruning results typically show noticeable performance drops compared to the original model when aiming for a specific level of acceleration. To address these issues, we propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens and skip them across model blocks to reduce computational cost during inference. To construct the router efficiently, we present a search-based sparsity scheduler for pruning sparsity allocation, a trainable router combined with our proposed four low-dimensional factors as input and three proposed losses. We conduct extensive experiments across different benchmarks on different LLMs to demonstrate the superiority of our method. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods. For instance, our method outperforms BlockPruner and ShortGPT by approximately 10 points on both LLaMA2-7B and Qwen1.5-7B in accuracy retention at comparable token sparsity levels.
摘要：最近，大型语言模型 (LLM) 通过遵循缩放定律（这显著增加了模型大小）在各种任务中表现出色。然而，推理过程中巨大的计算开销阻碍了工业应用中的部署。许多工作利用传统的压缩方法来增强模型推理，但这些方法总是会引入额外的训练成本来恢复性能，并且当以特定的加速水平为目标时，修剪结果通常会显示出与原始模型相比明显的性能下降。为了解决这些问题，我们为 LLM 提出了一种细粒度的 token 修剪方法，该方法提出了一个可学习的路由器，可以自适应地识别不太重要的 token 并跳过它们跨模型块以降低推理过程中的计算成本。为了有效地构建路由器，我们提出了一个基于搜索的稀疏度调度程序来修剪稀疏度分配，一个可训练的路由器结合了我们提出的四个低维因子作为输入和三个提出的损失。我们在不同的 LLM 上对不同的基准进行了广泛的实验，以证明我们方法的优越性。我们的方法实现了最先进的 (SOTA) 剪枝结果，超越了其他现有剪枝方法。例如，在相当的 token 稀疏度水平下，我们的方法在 LLaMA2-7B 和 Qwen1.5-7B 上的准确率保持率比 BlockPruner 和 ShortGPT 高出约 10 分。

Title: Glimpse: Enabling White-Box Methods to Use Proprietary Models for Zero-Shot LLM-Generated Text Detection

Authors: Guangsheng Bao, Yanbin Zhao, Juncai He, Yue Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11506
Pdf URL: https://arxiv.org/pdf/2412.11506
Copy Paste: [[2412.11506]] Glimpse: Enabling White-Box Methods to Use Proprietary Models for Zero-Shot LLM-Generated Text Detection(https://arxiv.org/abs/2412.11506)
Keywords: language model, gpt, llm
Abstract: Advanced large language models (LLMs) can generate text almost indistinguishable from human-written text, highlighting the importance of LLM-generated text detection. However, current zero-shot techniques face challenges as white-box methods are restricted to use weaker open-source LLMs, and black-box methods are limited by partial observation from stronger proprietary LLMs. It seems impossible to enable white-box methods to use proprietary models because API-level access to the models neither provides full predictive distributions nor inner embeddings. To traverse the divide, we propose Glimpse, a probability distribution estimation approach, predicting the full distributions from partial observations. Despite the simplicity of Glimpse, we successfully extend white-box methods like Entropy, Rank, Log-Rank, and Fast-DetectGPT to latest proprietary models. Experiments show that Glimpse with Fast-DetectGPT and GPT-3.5 achieves an average AUROC of about 0.95 in five latest source models, improving the score by 51% relative to the remaining space of the open source baseline (Table 1). It demonstrates that the latest LLMs can effectively detect their own outputs, suggesting that advanced LLMs may be the best shield against themselves.
摘要：高级大型语言模型 (LLM) 可以生成几乎与人类书写的文本难以区分的文本，这凸显了 LLM 生成的文本检测的重要性。然而，当前的零样本技术面临挑战，因为白盒方法仅限于使用较弱的开源 LLM，而黑盒方法则受到较强的专有 LLM 的部分观察的限制。似乎不可能让白盒方法使用专有模型，因为对模型的 API 级别访问既不提供完整的预测分布也不提供内部嵌入。为了跨越这一鸿沟，我们提出了 Glimpse，这是一种概率分布估计方法，可以从部分观察中预测完整分布。尽管 Glimpse 很简单，但我们成功地将 Entropy、Rank、Log-Rank 和 Fast-DetectGPT 等白盒方法扩展到最新的专有模型。实验表明，使用 Fast-DetectGPT 和 GPT-3.5 的 Glimpse 在五个最新源模型中实现了约 0.95 的平均 AUROC，相对于开源基线的剩余空间将分数提高了 51%（表 1）。这表明最新的 LLM 可以有效地检测自己的输出，这表明高级 LLM 可能是抵御自身攻击的最佳盾牌。

Title: DART: An AIGT Detector using AMR of Rephrased Text

Authors: Hyeonchu Park, Byungjun Kim, Bugeun Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11517
Pdf URL: https://arxiv.org/pdf/2412.11517
Copy Paste: [[2412.11517]] DART: An AIGT Detector using AMR of Rephrased Text(https://arxiv.org/abs/2412.11517)
Keywords: language model, llm
Abstract: As large language models (LLMs) generate more human-like texts, concerns about the side effects of AI-generated texts (AIGT) have grown. So, researchers have developed methods for detecting AIGT. However, two challenges remain. First, the performance on detecting black-box LLMs is low, because existing models have focused on syntactic features. Second, most AIGT detectors have been tested on a single-candidate setting, which assumes that we know the origin of an AIGT and may deviate from the real-world scenario. To resolve these challenges, we propose DART, which consists of four steps: rephrasing, semantic parsing, scoring, and multiclass classification. We conducted several experiments to test the performance of DART by following previous work. The experimental result shows that DART can discriminate multiple black-box LLMs without using syntactic features and knowing the origin of AIGT.
摘要：随着大型语言模型 (LLM) 生成的文本越来越像人类，人们对人工智能生成文本 (AIGT) 副作用的担忧也与日俱增。因此，研究人员开发了检测 AIGT 的方法。然而，仍然存在两个挑战。首先，检测黑盒 LLM 的性能较低，因为现有模型侧重于句法特征。其次，大多数 AIGT 检测器都是在单一候选设置上进行测试的，这假设我们知道 AIGT 的来源，并且可能偏离真实世界场景。为了解决这些挑战，我们提出了 DART，它包括四个步骤：改写、语义解析、评分和多类分类。我们按照以前的工作进行了几次实验来测试 DART 的性能。实验结果表明，DART 可以在不使用句法特征和不知道 AIGT 来源的情况下区分多个黑盒 LLM。

Title: Let your LLM generate a few tokens and you will reduce the need for retrieval

Authors: Hervé Déjean
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11536
Pdf URL: https://arxiv.org/pdf/2412.11536
Copy Paste: [[2412.11536]] Let your LLM generate a few tokens and you will reduce the need for retrieval(https://arxiv.org/abs/2412.11536)
Keywords: language model, llm
Abstract: In this paper, we investigate how efficiently large language models (LLM) can be trained to check whether an answer is already stored in their parametric memory. We distill an LLM-as-a-judge to compute the IK (I Know) score. We found that this method is particularly beneficial in the context of retrieval-assisted augmented generation (RAG), with a respectable accuracy of 80%. It enables a significant reduction (more than 50%) in the number of search and reranking steps required for certain data sets. We have also introduced the IK score, which serves as a useful tool for characterising datasets by facilitating the classification task. Interestingly, through the inclusion of response tokens as input, our results suggest that only about 20,000 training samples are required to achieve good performance. The central element of this work is the use of a teacher model - the LLM as a judge - to generate training data. We also assess the robustness of the IK classifier by evaluating it with various types of teachers, including both string-based methods and LLMs, with the latter providing better results.
摘要：在本文中，我们研究了如何高效地训练大型语言模型 (LLM) 来检查答案是否已存储在其参数内存中。我们提炼了一个 LLM-as-a-judge 来计算 IK（我知道）分数。我们发现这种方法在检索辅助增强生成 (RAG) 的背景下特别有用，准确率高达 80%。它可以显著减少（超过 50%）某些数据集所需的搜索和重新排序步骤的数量。我们还引入了 IK 分数，它是一种通过促进分类任务来表征数据集的有用工具。有趣的是，通过将响应标记作为输入，我们的结果表明，仅需大约 20,000 个训练样本即可获得良好的性能。这项工作的核心要素是使用教师模型（LLM 作为判断者）来生成训练数据。我们还通过使用各种类型的教师（包括基于字符串的方法和 LLM）对 IK 分类器进行评估来评估其稳健性，其中后者提供更好的结果。

Title: Towards a Speech Foundation Model for Singapore and Beyond

Authors: Muhammad Huzaifah, Tianchi Liu, Hardik B. Sailor, Kye Min Tan, Tarun K. Vangani, Qiongqiong Wang, Jeremy H. M. Wong, Nancy F. Chen, Ai Ti Aw
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2412.11538
Pdf URL: https://arxiv.org/pdf/2412.11538
Copy Paste: [[2412.11538]] Towards a Speech Foundation Model for Singapore and Beyond(https://arxiv.org/abs/2412.11538)
Keywords: language model
Abstract: This technical report describes the MERaLiON Speech Encoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore's National Multimodal Large Language Model Programme, the MERaLiON Speech Encoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports mainly English, including the variety spoken in Singapore. We are actively expanding our datasets to gradually cover other languages in subsequent releases. The MERaLiON Speech Encoder was pre-trained from scratch on 200K hours of unlabelled speech data using a self-supervised learning approach based on masked language modelling. We describe our training procedure and hyperparameter tuning experiments in detail below. Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders across ten other speech tasks. We commit to releasing our model, supporting broader research endeavours, both in Singapore and beyond.
摘要：本技术报告介绍了 MERaLiON 语音编码器，这是一种基础模型，旨在支持广泛的下游语音应用。MERaLiON 语音编码器是新加坡国家多模态大型语言模型计划的一部分，专门用于满足新加坡和周边东南亚地区的语音处理需求。该模型目前主要支持英语，包括新加坡使用的多种语言。我们正在积极扩展我们的数据集，以在后续版本中逐步覆盖其他语言。MERaLiON 语音编码器从头开始在 200K 小时未标记语音数据上进行预训练，使用基于掩码语言建模的自监督学习方法。我们将在下面详细描述我们的训练过程和超参数调整实验。我们的评估表明，自发语音和新加坡语音基准在语音识别方面有所改进，同时在其他十个语音任务中仍与其他最先进的语音编码器保持竞争力。我们致力于发布我们的模型，支持新加坡及其他地区更广泛的研究工作。

Title: Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs

Authors: Yuchen Fu, Zifeng Cheng, Zhiwei Jiang, Zhonghui Wang, Yafeng Yin, Zhengliang Li, Qing Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11556
Pdf URL: https://arxiv.org/pdf/2412.11556
Copy Paste: [[2412.11556]] Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs(https://arxiv.org/abs/2412.11556)
Keywords: language model, llm, prompt
Abstract: Extracting sentence embeddings from large language models (LLMs) is a promising direction, as LLMs have demonstrated stronger semantic understanding capabilities. Previous studies typically focus on prompt engineering to elicit sentence embeddings from LLMs by prompting the model to encode sentence information into the embedding of the last token. However, LLMs are mostly decoder-only models with causal attention and the earlier tokens in the sentence cannot attend to the latter tokens, resulting in biased encoding of sentence information and cascading effects on the final decoded token. To this end, we propose a novel Token Prepending (TP) technique that prepends each layer's decoded sentence embedding to the beginning of the sentence in the next layer's input, allowing earlier tokens to attend to the complete sentence information under the causal attention mechanism. The proposed TP technique is a plug-and-play and training-free technique, which means it can be seamlessly integrated with various prompt-based sentence embedding methods and autoregressive LLMs. Extensive experiments on various Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our proposed TP technique can significantly improve the performance of existing prompt-based sentence embedding methods across different LLMs, while incurring negligible additional inference cost.
摘要：从大型语言模型 (LLM) 中提取句子嵌入是一个很有前途的方向，因为 LLM 已经表现出更强的语义理解能力。以前的研究通常侧重于提示工程，通过提示模型将句子信息编码到最后一个标记的嵌入中来从 LLM 中引出句子嵌入。然而，LLM 大多是具有因果注意的仅解码器模型，句子中较早的标记无法关注后面的标记，导致句子信息编码有偏差，并对最终解码的标记产生级联效应。为此，我们提出了一种新颖的标记前置 (TP) 技术，将每一层的解码句子嵌入前置到下一层输入中的句子开头，让较早的标记在因果注意机制下关注完整的句子信息。所提出的 TP 技术是一种即插即用且无需训练的技术，这意味着它可以与各种基于提示的句子嵌入方法和自回归 LLM 无缝集成。在各种语义文本相似性 (STS) 任务和下游分类任务上进行的大量实验表明，我们提出的 TP 技术可以显着提高现有基于提示的句子嵌入方法在不同 LLM 中的性能，同时几乎不会产生额外的推理成本。

Title: The Role of Natural Language Processing Tasks in Automatic Literary Character Network Construction

Authors: Arthur Amalvy (LIA), Vincent Labatut (LIA), Richard Dufour (LS2N - équipe TALN)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11560
Pdf URL: https://arxiv.org/pdf/2412.11560
Copy Paste: [[2412.11560]] The Role of Natural Language Processing Tasks in Automatic Literary Character Network Construction(https://arxiv.org/abs/2412.11560)
Keywords: language model, llm
Abstract: The automatic extraction of character networks from literary texts is generally carried out using natural language processing (NLP) cascading pipelines. While this approach is widespread, no study exists on the impact of low-level NLP tasks on their performance. In this article, we conduct such a study on a literary dataset, focusing on the role of named entity recognition (NER) and coreference resolution when extracting co-occurrence networks. To highlight the impact of these tasks' performance, we start with gold-standard annotations, progressively add uniformly distributed errors, and observe their impact in terms of character network quality. We demonstrate that NER performance depends on the tested novel and strongly affects character detection. We also show that NER-detected mentions alone miss a lot of character co-occurrences, and that coreference resolution is needed to prevent this. Finally, we present comparison points with 2 methods based on large language models (LLMs), including a fully end-to-end one, and show that these models are outperformed by traditional NLP pipelines in terms of recall.
摘要：从文学文本中自动提取字符网络通常使用自然语言处理 (NLP) 级联管道进行。虽然这种方法很普遍，但还没有关于低级 NLP 任务对其性能影响的研究。在本文中，我们对文学数据集进行了这样的研究，重点关注命名实体识别 (NER) 和共指解析在提取共现网络时的作用。为了突出这些任务性能的影响，我们从黄金标准注释开始，逐步添加均匀分布的错误，并观察它们对字符网络质量的影响。我们证明 NER 性能取决于测试的小说，并强烈影响字符检测。我们还表明，仅 NER 检测到的提及会错过很多字符共现，而共指解析是防止这种情况发生的必要条件。最后，我们展示了基于大型语言模型 (LLM) 的 2 种方法的比较点，包括一个完全端到端的方法，并表明这些模型在召回率方面优于传统的 NLP 管道。

Title: SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

Authors: Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11605
Pdf URL: https://arxiv.org/pdf/2412.11605
Copy Paste: [[2412.11605]] SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models(https://arxiv.org/abs/2412.11605)
Keywords: language model, gpt, llm
Abstract: Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference learning. However, existing methods often directly sample multiple independent responses from the model when creating preference pairs. Such practice can introduce content variations irrelevant to whether the instruction is precisely followed (e.g., different expressions about the same semantic), interfering with the goal of teaching models to recognize the key differences that lead to improved instruction following. In light of this, we introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions. By playing against itself, an LLM employs a tree-search strategy to refine its previous responses with respect to the instruction while minimizing unnecessary variations. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities. Furthermore, SPaR demonstrates promising scalability and transferability, greatly enhancing models like GLM-4-9B and LLaMA3-70B. We also identify how inference scaling in tree search would impact model performance. Our code and data are publicly available at this https URL.
摘要：指令遵循是语言模型的一项基本能力，要求模型能够识别指令中最细微的要求，并在输出中准确反映这些要求。这种能力非常适合偏好学习，并且通常可以通过偏好学习进行优化。然而，现有方法在创建偏好对时，通常会直接从模型中采样多个独立响应。这种做法可能会引入与是否准确遵循指令无关的内容变化（例如，关于相同语义的不同表达），干扰教导模型识别导致改进指令遵循的关键差异的目标。鉴于此，我们引入了 SPaR，这是一个自对弈框架，它集成了树搜索自我改进，以产生不受干扰的有效且可比较的偏好对。通过与自己对弈，LLM 采用树搜索策略来改进其对指令的先前响应，同时最大限度地减少不必要的变化。我们的实验表明，在 SPaR 的指导下经过三次迭代训练的 LLaMA3-8B 模型在 IFEval 基准上超越了 GPT-4-Turbo，同时没有失去一般能力。此外，SPaR 表现出良好的可扩展性和可迁移性，大大增强了 GLM-4-9B 和 LLaMA3-70B 等模型的性能。我们还确定了树搜索中的推理扩展将如何影响模型性能。我们的代码和数据在此 https URL 上公开提供。

Title: MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation

Authors: Javier García Gilabert, Carlos Escolano, Audrey Mash, Xixian Liao, Maite Melero
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11615
Pdf URL: https://arxiv.org/pdf/2412.11615
Copy Paste: [[2412.11615]] MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation(https://arxiv.org/abs/2412.11615)
Keywords: language model, llm
Abstract: We introduce MT-LENS, a framework designed to evaluate Machine Translation (MT) systems across a variety of tasks, including translation quality, gender bias detection, added toxicity, and robustness to misspellings. While several toolkits have become very popular for benchmarking the capabilities of Large Language Models (LLMs), existing evaluation tools often lack the ability to thoroughly assess the diverse aspects of MT performance. MT-LENS addresses these limitations by extending the capabilities of LM-eval-harness for MT, supporting state-of-the-art datasets and a wide range of evaluation metrics. It also offers a user-friendly platform to compare systems and analyze translations with interactive visualizations. MT-LENS aims to broaden access to evaluation strategies that go beyond traditional translation quality evaluation, enabling researchers and engineers to better understand the performance of a NMT model and also easily measure system's biases.
摘要：我们推出了 MT-LENS，这是一个旨在评估机器翻译 (MT) 系统在各种任务中的性能的框架，包括翻译质量、性别偏见检测、附加毒性和拼写错误鲁棒性。虽然有几种工具包已经非常流行，用于对大型语言模型 (LLM) 的功能进行基准测试，但现有的评估工具往往缺乏全面评估 MT 性能各个方面的能力。MT-LENS 通过扩展 LM-eval-harness 的 MT 功能来解决这些限制，支持最先进的数据集和广泛的评估指标。它还提供了一个用户友好的平台，用于比较系统并通过交互式可视化分析翻译。MT-LENS 旨在扩大对超越传统翻译质量评估的评估策略的访问，使研究人员和工程师能够更好地了解 NMT 模型的性能，并轻松衡量系统的偏见。

Title: Fool Me, Fool Me: User Attitudes Toward LLM Falsehoods

Authors: Diana Bar-Or Nirman, Ariel Weizman, Amos Azaria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11625
Pdf URL: https://arxiv.org/pdf/2412.11625
Copy Paste: [[2412.11625]] Fool Me, Fool Me: User Attitudes Toward LLM Falsehoods(https://arxiv.org/abs/2412.11625)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) have become central tools in various fields, they often provide inaccurate or false information. This study examines user preferences regarding falsehood responses from LLMs. Specifically, we evaluate preferences for LLM responses where false statements are explicitly marked versus unmarked responses and preferences for confident falsehoods compared to LLM disclaimers acknowledging a lack of knowledge. Additionally, we investigate how requiring users to assess the truthfulness of statements influences these preferences. Surprisingly, 61\% of users prefer unmarked falsehood responses over marked ones, and 69\% prefer confident falsehoods over LLMs admitting lack of knowledge. In all our experiments, a total of 300 users participated, contributing valuable data to our analysis and conclusions. When users are required to evaluate the truthfulness of statements, preferences for unmarked and falsehood responses decrease slightly but remain high. These findings suggest that user preferences, which influence LLM training via feedback mechanisms, may inadvertently encourage the generation of falsehoods. Future research should address the ethical and practical implications of aligning LLM behavior with such preferences.
摘要：虽然大型语言模型 (LLM) 已成为各个领域的核心工具，但它们经常提供不准确或虚假的信息。本研究考察了用户对 LLM 虚假回答的偏好。具体来说，我们评估了 LLM 回答中明确标记虚假陈述与未标记虚假陈述的偏好，以及对自信虚假陈述与承认缺乏知识的 LLM 免责声明的偏好。此外，我们研究了要求用户评估陈述的真实性如何影响这些偏好。令人惊讶的是，61% 的用户更喜欢未标记的虚假回答而不是标记的虚假回答，69% 的用户更喜欢自信虚假陈述而不是承认缺乏知识的 LLM。在我们所有的实验中，共有 300 名用户参与，为我们的分析和结论提供了宝贵的数据。当要求用户评估陈述的真实性时，对未标记和虚假回答的偏好略有下降，但仍然很高。这些发现表明，通过反馈机制影响 LLM 训练的用户偏好可能会无意中鼓励虚假信息的产生。未来的研究应该解决将 LLM 行为与此类偏好相结合的道德和实际影响。

Title: SE-GCL: An Event-Based Simple and Effective Graph Contrastive Learning for Text Representation

Authors: Tao Meng, Wei Ai, Jianbin Li, Ze Wang, Yuntao Shou, Keqin Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11652
Pdf URL: https://arxiv.org/pdf/2412.11652
Copy Paste: [[2412.11652]] SE-GCL: An Event-Based Simple and Effective Graph Contrastive Learning for Text Representation(https://arxiv.org/abs/2412.11652)
Keywords: prompt
Abstract: Text representation learning is significant as the cornerstone of natural language processing. In recent years, graph contrastive learning (GCL) has been widely used in text representation learning due to its ability to represent and capture complex text information in a self-supervised setting. However, current mainstream graph contrastive learning methods often require the incorporation of domain knowledge or cumbersome computations to guide the data augmentation process, which significantly limits the application efficiency and scope of GCL. Additionally, many methods learn text representations only by constructing word-document relationships, which overlooks the rich contextual semantic information in the text. To address these issues and exploit representative textual semantics, we present an event-based, simple, and effective graph contrastive learning (SE-GCL) for text representation. Precisely, we extract event blocks from text and construct internal relation graphs to represent inter-semantic interconnections, which can ensure that the most critical semantic information is preserved. Then, we devise a streamlined, unsupervised graph contrastive learning framework to leverage the complementary nature of the event semantic and structural information for intricate feature data capture. In particular, we introduce the concept of an event skeleton for core representation semantics and simplify the typically complex data augmentation techniques found in existing graph contrastive learning to boost algorithmic efficiency. We employ multiple loss functions to prompt diverse embeddings to converge or diverge within a confined distance in the vector space, ultimately achieving a harmonious equilibrium. We conducted experiments on the proposed SE-GCL on four standard data sets (AG News, 20NG, SougouNews, and THUCNews) to verify its effectiveness in text representation learning.
摘要：文本表征学习是自然语言处理的基石，具有重要意义。近年来，图对比学习（GCL）因其在自监督环境下表征和捕获复杂文本信息的能力而被广泛应用于文本表征学习。然而，目前主流的图对比学习方法往往需要结合领域知识或繁琐的计算来指导数据增强过程，这大大限制了GCL的应用效率和范围。此外，许多方法仅通过构建单词-文档关系来学习文本表征，忽略了文本中丰富的上下文语义信息。为了解决这些问题并利用代表性的文本语义，我们提出了一种基于事件的、简单有效的图对比学习（SE-GCL）用于文本表征。具体来说，我们从文本中提取事件块并构建内部关系图来表示语义间的互连，这可以确保最关键的语义信息得到保留。然后，我们设计了一个精简的、无监督的图对比学习框架，利用事件语义和结构信息的互补性来捕获复杂的特征数据。具体来说，我们引入了事件骨架的概念，用于核心表示语义，并简化了现有图对比学习中通常复杂的数据增强技术，以提高算法效率。我们使用多个损失函数来促使不同的嵌入在向量空间中的有限距离内收敛或发散，最终达到和谐的平衡。我们在四个标准数据集（AG News、20NG、SougouNews 和 THUCNews）上对提出的 SE-GCL 进行了实验，以验证其在文本表示学习中的有效性。

Title: Self-Adaptive Paraphrasing and Preference Learning for Improved Claim Verifiability

Authors: Amelie Wührl, Roman Klinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11653
Pdf URL: https://arxiv.org/pdf/2412.11653
Copy Paste: [[2412.11653]] Self-Adaptive Paraphrasing and Preference Learning for Improved Claim Verifiability(https://arxiv.org/abs/2412.11653)
Keywords: language model
Abstract: In fact-checking, structure and phrasing of claims critically influence a model's ability to predict verdicts accurately. Social media content in particular rarely serves as optimal input for verification systems, which necessitates pre-processing to extract the claim from noisy context before fact checking. Prior work suggests extracting a claim representation that humans find to be checkworthy and verifiable. This has two limitations: (1) the format may not be optimal for a fact-checking model, and (2), it requires annotated data to learn the extraction task from. We address both issues and propose a method to extract claims that is not reliant on labeled training data. Instead, our self-adaptive approach only requires a black-box fact checking model and a generative language model (LM). Given a tweet, we iteratively optimize the LM to generate a claim paraphrase that increases the performance of a fact checking model. By learning from preference pairs, we align the LM to the fact checker using direct preference optimization. We show that this novel setup extracts a claim paraphrase that is more verifiable than their original social media formulations, and is on par with competitive baselines. For refuted claims, our method consistently outperforms all baselines.
摘要：在事实核查中，声明的结构和措辞对模型准确预测判决的能力有重大影响。社交媒体内容很少作为验证系统的最佳输入，因此需要在事实核查之前进行预处理以从嘈杂的上下文中提取声明。先前的工作建议提取人类认为值得检查和可验证的声明表示。这有两个限制：（1）格式可能不是事实核查模型的最佳格式，（2）它需要注释数据来学习提取任务。我们解决了这两个问题，并提出了一种不依赖于标记训练数据的声明提取方法。相反，我们的自适应方法只需要一个黑盒事实核查模型和一个生成语言模型 (LM)。给定一条推文，我们迭代优化 LM 以生成声明释义，从而提高事实核查模型的性能。通过从偏好对中学习，我们使用直接偏好优化将 LM 与事实核查器对齐。我们表明，这种新颖的设置提取的声明释义比其原始社交媒体表述更可验证，并且与竞争基线不相上下。对于被驳斥的声明，我们的方法始终优于所有基线。

Title: C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness

Authors: Yu Kang, Xianghui Sun, Liangyu Chen, Wei Zou
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11664
Pdf URL: https://arxiv.org/pdf/2412.11664
Copy Paste: [[2412.11664]] C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness(https://arxiv.org/abs/2412.11664)
Keywords: language model, llm, chain-of-thought
Abstract: Generating Chain-of-Thought (CoT) before deriving the answer can effectively improve the reasoning capabilities of large language models (LLMs) and significantly improve the accuracy of the generated answer. However, in most cases, the length of the generated CoT is much longer than the desired final answer, which results in additional decoding costs. Furthermore, existing research has discovered that shortening the reasoning steps in CoT, even while preserving the key information, diminishes LLMs' abilities. These phenomena make it difficult to use LLMs and CoT in many real-world applications that only require the final answer and are sensitive to latency, such as search and recommendation. To reduce the costs of model decoding and shorten the length of the generated CoT, this paper presents $\textbf{C}$onditioned $\textbf{C}$ompressed $\textbf{C}$hain-of-$\textbf{T}$hought (C3oT), a CoT compression framework that involves a compressor to compress an original longer CoT into a shorter CoT while maintaining key information and interpretability, a conditioned training method to train LLMs with both longer CoT and shorter CoT simultaneously to learn the corresponding relationships between them, and a conditioned inference method to gain the reasoning ability learned from longer CoT by generating shorter CoT. We conduct experiments over four datasets from arithmetic and commonsense scenarios, showing that the proposed method is capable of compressing the length of generated CoT by up to more than 50% without compromising its effectiveness.
摘要：在得出答案之前生成思路链（CoT）可以有效提升大型语言模型（LLM）的推理能力，并显著提高生成答案的准确率。然而，在大多数情况下，生成的 CoT 的长度远长于所需的最终答案，这会导致额外的解码成本。此外，现有研究发现，即使在保留关键信息的情况下缩短 CoT 中的推理步骤也会削弱 LLM 的能力。这些现象使得 LLM 和 CoT 难以在许多只需要最终答案且对延迟敏感的实际应用中使用，例如搜索和推荐。为了降低模型解码成本，缩短生成 CoT 的长度，本文提出了 $\textbf{C}$条件 $\textbf{C}$压缩 $\textbf{C}$连线-of-$\textbf{T}$思想 (C3oT)，这是一种 CoT 压缩框架，包括压缩器，用于在保留关键信息和可解释性的同时将原始较长的 CoT 压缩为较短的 CoT；条件训练方法，用于同时训练具有较长 CoT 和较短 CoT 的 LLM 以学习它们之间的对应关系；条件推理方法，用于通过生成较短 CoT 来获得从较长 CoT 中学到的推理能力。我们在算术和常识场景的四个数据集上进行了实验，结果表明，所提出的方法能够在不影响其有效性的情况下将生成的 CoT 的长度压缩高达 50% 以上。

Title: BioBridge: Unified Bio-Embedding with Bridging Modality in Code-Switched EMR

Authors: Jangyeong Jeon, Sangyeon Cho, Dongjoon Lee, Changhee Lee, Junyeong Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11671
Pdf URL: https://arxiv.org/pdf/2412.11671
Copy Paste: [[2412.11671]] BioBridge: Unified Bio-Embedding with Bridging Modality in Code-Switched EMR(https://arxiv.org/abs/2412.11671)
Keywords: prompt
Abstract: Pediatric Emergency Department (PED) overcrowding presents a significant global challenge, prompting the need for efficient solutions. This paper introduces the BioBridge framework, a novel approach that applies Natural Language Processing (NLP) to Electronic Medical Records (EMRs) in written free-text form to enhance decision-making in PED. In non-English speaking countries, such as South Korea, EMR data is often written in a Code-Switching (CS) format that mixes the native language with English, with most code-switched English words having clinical significance. The BioBridge framework consists of two core modules: "bridging modality in context" and "unified bio-embedding." The "bridging modality in context" module improves the contextual understanding of bilingual and code-switched EMRs. In the "unified bio-embedding" module, the knowledge of the model trained in the medical domain is injected into the encoder-based model to bridge the gap between the medical and general domains. Experimental results demonstrate that the proposed BioBridge significantly performance traditional machine learning and pre-trained encoder-based models on several metrics, including F1 score, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and Brier score. Specifically, BioBridge-XLM achieved enhancements of 0.85% in F1 score, 0.75% in AUROC, and 0.76% in AUPRC, along with a notable 3.04% decrease in the Brier score, demonstrating marked improvements in accuracy, reliability, and prediction calibration over the baseline XLM model. The source code will be made publicly available.
摘要：儿科急诊科 (PED) 人满为患是全球面临的重大挑战，因此需要有效的解决方案。本文介绍了 BioBridge 框架，这是一种新颖的方法，将自然语言处理 (NLP) 应用于书面自由文本形式的电子病历 (EMR)，以增强 PED 的决策能力。在韩国等非英语国家，EMR 数据通常以混合母语和英语的代码转换 (CS) 格式编写，大多数代码转换的英语单词都具有临床意义。BioBridge 框架由两个核心模块组成：“上下文中的桥接模态”和“统一的生物嵌入”。 “上下文中的桥接模态”模块提高了对双语和代码转换 EMR 的上下文理解。在“统一的生物嵌入”模块中，在医学领域训练的模型的知识被注入到基于编码器的模型中，以弥合医学和一般领域之间的差距。实验结果表明，所提出的 BioBridge 在多个指标上显著优于传统机器学习和预训练的基于编码器的模型，包括 F1 得分、受试者工作特征曲线下面积 (AUROC)、精确召回曲线下面积 (AUPRC) 和 Brier 得分。具体来说，BioBridge-XLM 的 F1 得分提高了 0.85%，AUROC 提高了 0.75%，AUPRC 提高了 0.76%，同时 Brier 得分显著下降了 3.04%，表明与基线 XLM 模型相比，其准确性、可靠性和预测校准方面有显著改善。源代码将公开发布。

Title: Bias Vector: Mitigating Biases in Language Models with Task Arithmetic Approach

Authors: Daiki Shirafuji, Makoto Takenaka, Shinya Taguchi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11679
Pdf URL: https://arxiv.org/pdf/2412.11679
Copy Paste: [[2412.11679]] Bias Vector: Mitigating Biases in Language Models with Task Arithmetic Approach(https://arxiv.org/abs/2412.11679)
Keywords: language model
Abstract: The use of language models (LMs) has increased considerably in recent years, and the biases and stereotypes in training data that are reflected in the LM outputs are causing social problems. In this paper, inspired by the task arithmetic, we propose the ``Bias Vector'' method for the mitigation of these LM biases. The Bias Vector method does not require manually created debiasing data. The three main steps of our approach involve: (1) continual training the pre-trained LMs on biased data using masked language modeling; (2) constructing the Bias Vector as the difference between the weights of the biased LMs and those of pre-trained LMs; and (3) subtracting the Bias Vector from the weights of the pre-trained LMs for debiasing. We evaluated the Bias Vector method on the SEAT across three LMs and confirmed an average improvement of 0.177 points. We demonstrated that the Bias Vector method does not degrade the LM performance on downstream tasks in the GLUE benchmark. In addition, we examined the impact of scaling factors, which control the magnitudes of Bias Vectors, with effect sizes on the SEAT and conducted a comprehensive evaluation of our debiased LMs across both the SEAT and GLUE benchmarks.
摘要：近年来，语言模型 (LM) 的使用大幅增加，而训练数据中的偏见和刻板印象会反映在 LM 输出中，从而引发社会问题。在本文中，受任务算法的启发，我们提出了“偏差向量”方法来缓解这些 LM 偏见。偏差向量方法不需要手动创建的去偏差数据。我们的方法主要包括三个步骤：(1) 使用掩蔽语言模型在有偏差的数据上不断训练预训练的 LM；(2) 将偏差向量构建为有偏差 LM 的权重与预训练 LM 的权重之差；(3) 从预训练 LM 的权重中减去偏差向量以进行去偏差。我们在三个 LM 的 SEAT 上评估了偏差向量方法，并确认平均提高了 0.177 分。我们证明了偏差向量方法不会降低 GLUE 基准中下游任务的 LM 性能。此外，我们研究了控制偏差向量大小的缩放因子对 SEAT 的影响，并对 SEAT 和 GLUE 基准中的去偏差 LM 进行了全面评估。

Title: Multilingual and Explainable Text Detoxification with Parallel Corpora

Authors: Daryna Dementieva, Nikolay Babakov, Amit Ronen, Abinew Ali Ayele, Naquee Rizwan, Florian Schneider, Xintong Wang, Seid Muhie Yimam, Daniil Moskovskiy, Elisei Stakovskii, Eran Kaufman, Ashraf Elnagar, Animesh Mukherjee, Alexander Panchenko
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11691
Pdf URL: https://arxiv.org/pdf/2412.11691
Copy Paste: [[2412.11691]] Multilingual and Explainable Text Detoxification with Parallel Corpora(https://arxiv.org/abs/2412.11691)
Keywords: prompt, chain-of-thought
Abstract: Even with various regulations in place across countries and social media platforms (Government of India, 2021; European Parliament and Council of the European Union, 2022, digital abusive speech remains a significant issue. One potential approach to address this challenge is automatic text detoxification, a text style transfer (TST) approach that transforms toxic language into a more neutral or non-toxic form. To date, the availability of parallel corpora for the text detoxification task (Logachevavet al., 2022; Atwell et al., 2022; Dementievavet al., 2024a) has proven to be crucial for state-of-the-art approaches. With this work, we extend parallel text detoxification corpus to new languages -- German, Chinese, Arabic, Hindi, and Amharic -- testing in the extensive multilingual setup TST baselines. Next, we conduct the first of its kind an automated, explainable analysis of the descriptive features of both toxic and non-toxic sentences, diving deeply into the nuances, similarities, and differences of toxicity and detoxification across 9 languages. Finally, based on the obtained insights, we experiment with a novel text detoxification method inspired by the Chain-of-Thoughts reasoning approach, enhancing the prompting process through clustering on relevant descriptive attributes.
摘要：即使各个国家和社交媒体平台都制定了各种法规（印度政府，2021 年；欧洲议会和欧盟理事会，2022 年），数字辱骂性言论仍然是一个重大问题。解决这一挑战的一种潜在方法是自动文本解毒，这是一种将有毒语言转换为更中性或无毒形式的文本样式转换 (TST) 方法。迄今为止，用于文本解毒任务的平行语料库的可用性（Logachevavet al.，2022 年；Atwell 等人，2022 年；Dementievavet al.，2024a）已被证明对于最先进的方法至关重要。通过这项工作，我们将平行文本解毒语料库扩展到新的语言——德语、中文、阿拉伯语、印地语和阿姆哈拉语——在广泛的多语言设置 TST 基线中进行测试。接下来，我们将首次对有毒和非有毒句子的描述性特征进行自动、可解释的分析，深入研究深入研究了 9 种语言中毒性和解毒的细微差别、相似之处和差异。最后，基于获得的见解，我们尝试了一种受思想链推理方法启发的新型文本解毒方法，通过对相关描述属性进行聚类来增强提示过程。

Title: CoinMath: Harnessing the Power of Coding Instruction for Math LLMs

Authors: Chengwei Wei, Bin Wang, Jung-jae Kim, Guimei Liu, Nancy F. Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11699
Pdf URL: https://arxiv.org/pdf/2412.11699
Copy Paste: [[2412.11699]] CoinMath: Harnessing the Power of Coding Instruction for Math LLMs(https://arxiv.org/abs/2412.11699)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown strong performance in solving mathematical problems, with code-based solutions proving particularly effective. However, the best practice to leverage coding instruction data to enhance mathematical reasoning remains underexplored. This study investigates three key questions: (1) How do different coding styles of mathematical code-based rationales impact LLMs' learning performance? (2) Can general-domain coding instructions improve performance? (3) How does integrating textual rationales with code-based ones during training enhance mathematical reasoning abilities? Our findings reveal that code-based rationales with concise comments, descriptive naming, and hardcoded solutions are beneficial, while improvements from general-domain coding instructions and textual rationales are relatively minor. Based on these insights, we propose CoinMath, a learning strategy designed to enhance mathematical reasoning by diversifying the coding styles of code-based rationales. CoinMath generates a variety of code-based rationales incorporating concise comments, descriptive naming conventions, and hardcoded solutions. Experimental results demonstrate that CoinMath significantly outperforms its baseline model, MAmmoTH, one of the SOTA math LLMs.
摘要：大型语言模型 (LLM) 在解决数学问题方面表现出色，其中基于代码的解决方案被证明特别有效。然而，利用编码指令数据来增强数学推理的最佳实践仍未得到充分探索。本研究探讨了三个关键问题：（1）基于数学代码的原理的不同编码风格如何影响 LLM 的学习性能？（2）通用领域的编码指令能否提高性能？（3）在训练过程中将文本原理与基于代码的原理相结合如何增强数学推理能力？我们的研究结果表明，带有简洁注释、描述性命名和硬编码解决方案的基于代码的原理是有益的，而通用领域编码指令和文本原理的改进相对较小。基于这些见解，我们提出了 CoinMath，这是一种旨在通过多样化基于代码的原理的编码风格来增强数学推理的学习策略。CoinMath 生成各种基于代码的原理，结合了简洁的注释、描述性命名约定和硬编码解决方案。实验结果表明，CoinMath 的表现明显优于其基线模型 MAmmoTH（SOTA 数学 LLM 之一）。

Title: Vocabulary Expansion of Chat Models with Unlabeled Target Language Data

Authors: Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11704
Pdf URL: https://arxiv.org/pdf/2412.11704
Copy Paste: [[2412.11704]] Vocabulary Expansion of Chat Models with Unlabeled Target Language Data(https://arxiv.org/abs/2412.11704)
Keywords: language model, chat
Abstract: Chat models (i.e. language models trained to follow instructions through conversation with humans) outperform base models (i.e. trained solely on unlabeled data) in both conversation and general task-solving abilities. These models are generally English-centric and require further adaptation for languages that are underrepresented in or absent from their training data. A common technique for adapting base models is to extend the model's vocabulary with target language tokens, i.e. vocabulary expansion (VE), and then continually pre-train it on language-specific data. Using chat data is ideal for chat model adaptation, but often, either this does not exist or is costly to construct. Alternatively, adapting chat models with unlabeled data is a possible solution, but it could result in catastrophic forgetting. In this paper, we investigate the impact of using unlabeled target language data for VE on chat models for the first time. We first show that off-the-shelf VE generally performs well across target language tasks and models in 71% of cases, though it underperforms in scenarios where source chat models are already strong. To further improve adapted models, we propose post-hoc techniques that inject information from the source model without requiring any further training. Experiments reveal the effectiveness of our methods, helping the adapted models to achieve performance improvements in 87% of cases.
摘要：聊天模型（即经过训练以遵循与人类对话的指令的语言模型）在对话和一般任务解决能力方面均优于基础模型（即仅在未标记数据上训练）。这些模型通常以英语为中心，需要进一步适应训练数据中代表性不足或缺失的语言。调整基础模型的一种常用技术是使用目标语言标记扩展模型的词汇表，即词汇扩展 (VE)，然后不断地在特定语言的数据上对其进行预训练。使用聊天数据是聊天模型适应的理想选择，但通常，这要么不存在，要么构建成本高昂。或者，使用未标记数据调整聊天模型是一种可能的解决方案，但它可能会导致灾难性的遗忘。在本文中，我们首次研究了使用未标记的目标语言数据对聊天模型进行 VE 的影响。我们首先表明，在 71% 的情况下，现成的 VE 在目标语言任务和模型中通常表现良好，尽管在源聊天模型已经很强大的场景中它表现不佳。为了进一步改进适应模型，我们提出了事后技术，即从源模型注入信息而无需进一步训练。实验证明了我们的方法的有效性，帮助适应模型在 87% 的情况下实现性能提升。

Title: MiMoTable: A Multi-scale Spreadsheet Benchmark with Meta Operations for Table Reasoning

Authors: Zheng Li, Yang Du, Mao Zheng, Mingyang Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11711
Pdf URL: https://arxiv.org/pdf/2412.11711
Copy Paste: [[2412.11711]] MiMoTable: A Multi-scale Spreadsheet Benchmark with Meta Operations for Table Reasoning(https://arxiv.org/abs/2412.11711)
Keywords: language model, llm
Abstract: Extensive research has been conducted to explore the capability of Large Language Models (LLMs) for table reasoning and has significantly improved the performance on existing benchmarks. However, tables and user questions in real-world applications are more complex and diverse, presenting an unignorable gap compared to the existing benchmarks. To fill the gap, we propose a \textbf{M}ult\textbf{i}-scale spreadsheet benchmark with \textbf{M}eta \textbf{o}perations for \textbf{Table} reasoning, named as MiMoTable. Specifically, MiMoTable incorporates two key features. First, the tables in MiMoTable are all spreadsheets used in real-world scenarios, which cover seven domains and contain different types. Second, we define a new criterion with six categories of meta operations for measuring the difficulty of each question in MiMoTable, simultaneously as a new perspective for measuring the difficulty of the existing benchmarks. Experimental results show that Claude-3.5-Sonnet achieves the best performance with 77.4\% accuracy, indicating that there is still significant room to improve for LLMs on MiMoTable. Furthermore, we grade the difficulty of existing benchmarks according to our new criteria. Experiments have shown that the performance of LLMs decreases as the difficulty of benchmarks increases, thereby proving the effectiveness of our proposed new criterion.
摘要：人们已经进行了大量研究来探索大型语言模型 (LLM) 在表格推理方面的能力，并显著提高了现有基准的性能。然而，实际应用中的表格和用户问题更加复杂和多样化，与现有基准相比存在不可忽视的差距。为了填补这个空白，我们提出了一个用于 \textbf{M}eta \textbf{o} 操作的 \textbf{M} 规模电子表格基准，用于 \textbf{表格} 推理，称为 MiMoTable。具体来说，MiMoTable 包含两个关键特性。首先，MiMoTable 中的表格都是实际场景中使用的电子表格，涵盖七个领域并包含不同类型。其次，我们定义了一个包含六类元操作的新标准来衡量 MiMoTable 中每个问题的难度，同时作为衡量现有基准难度的新视角。实验结果表明，Claude-3.5-Sonnet 的表现最好，准确率为 77.4%，这表明 MiMoTable 上的 LLM 仍有很大的提升空间。此外，我们根据新标准对现有基准的难度进行了分级。实验表明，随着基准难度的增加，LLM 的性能会下降，从而证明了我们提出的新标准的有效性。

Title: Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework

Authors: Xuanming Zhang, Yuxuan Chen, Yiming Zheng, Zhexin Zhang, Yuan Yuan, Minlie Huang
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2412.11713
Pdf URL: https://arxiv.org/pdf/2412.11713
Copy Paste: [[2412.11713]] Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework(https://arxiv.org/abs/2412.11713)
Keywords: language model, llm, agent
Abstract: In real world software development, improper or missing exception handling can severely impact the robustness and reliability of code. Exception handling mechanisms require developers to detect, capture, and manage exceptions according to high standards, but many developers struggle with these tasks, leading to fragile code. This problem is particularly evident in open-source projects and impacts the overall quality of the software ecosystem. To address this challenge, we explore the use of large language models (LLMs) to improve exception handling in code. Through extensive analysis, we identify three key issues: Insensitive Detection of Fragile Code, Inaccurate Capture of Exception Block, and Distorted Handling Solution. These problems are widespread across real world repositories, suggesting that robust exception handling practices are often overlooked or mishandled. In response, we propose Seeker, a multi-agent framework inspired by expert developer strategies for exception handling. Seeker uses agents: Scanner, Detector, Predator, Ranker, and Handler to assist LLMs in detecting, capturing, and resolving exceptions more effectively. Our work is the first systematic study on leveraging LLMs to enhance exception handling practices in real development scenarios, providing valuable insights for future improvements in code reliability.
摘要：在现实世界的软件开发中，不当或缺失的异常处理会严重影响代码的稳健性和可靠性。异常处理机制要求开发人员按照高标准检测、捕获和管理异常，但许多开发人员难以完成这些任务，导致代码脆弱。这个问题在开源项目中尤为明显，影响软件生态系统的整体质量。为了应对这一挑战，我们探索使用大型语言模型 (LLM) 来改进代码中的异常处理。通过广泛的分析，我们发现了三个关键问题：对脆弱代码的检测不敏感、对异常块的捕获不准确以及处理解决方案扭曲。这些问题在现实世界的存储库中普遍存在，这表明稳健的异常处理实践经常被忽视或处理不当。作为回应，我们提出了 Seeker，这是一个受专家开发人员异常处理策略启发的多代理框架。Seeker 使用代理：扫描仪、检测器、捕食者、排名器和处理程序来协助 LLM 更有效地检测、捕获和解决异常。我们的工作是首次系统地研究利用 LLM 增强实际开发场景中的异常处理实践，为未来代码可靠性的改进提供了宝贵的见解。

Title: LLMs Can Simulate Standardized Patients via Agent Coevolution

Authors: Zhuoyun Du, Lujie Zheng, Renjun Hu, Yuyang Xu, Xiawei Li, Ying Sun, Wei Chen, Jian Wu, Haolei Cai, Haohao Ying
Subjects: cs.CL, cs.AI, cs.HC, cs.MA
Abstract URL: https://arxiv.org/abs/2412.11716
Pdf URL: https://arxiv.org/pdf/2412.11716
Copy Paste: [[2412.11716]] LLMs Can Simulate Standardized Patients via Agent Coevolution(https://arxiv.org/abs/2412.11716)
Keywords: language model, llm, prompt, agent
Abstract: Training medical personnel using standardized patients (SPs) remains a complex challenge, requiring extensive domain expertise and role-specific practice. Most research on Large Language Model (LLM)-based simulated patients focuses on improving data retrieval accuracy or adjusting prompts through human feedback. However, this focus has overlooked the critical need for patient agents to learn a standardized presentation pattern that transforms data into human-like patient responses through unsupervised simulations. To address this gap, we propose EvoPatient, a novel simulated patient framework in which a patient agent and doctor agents simulate the diagnostic process through multi-turn dialogues, simultaneously gathering experience to improve the quality of both questions and answers, ultimately enabling human doctor training. Extensive experiments on various cases demonstrate that, by providing only overall SP requirements, our framework improves over existing reasoning methods by more than 10% in requirement alignment and better human preference, while achieving an optimal balance of resource consumption after evolving over 200 cases for 10 hours, with excellent generalizability. The code will be available at this https URL.
摘要：使用标准化患者 (SP) 培训医务人员仍然是一项复杂的挑战，需要广泛的领域专业知识和特定角色的实践。大多数基于大型语言模型 (LLM) 的模拟患者的研究都侧重于提高数据检索准确性或通过人工反馈调整提示。然而，这种关注忽略了患者代理学习标准化呈现模式的关键需求，该模式通过无监督模拟将数据转换为类似人类的患者反应。为了解决这一差距，我们提出了 EvoPatient，这是一种新颖的模拟患者框架，其中患者代理和医生代理通过多轮对话模拟诊断过程，同时积累经验以提高问题和答案的质量，最终实现人类医生培训。对各种案例的大量实验表明，通过仅提供总体 SP 要求，我们的框架在需求一致性和更好的人类偏好方面比现有推理方法提高了 10% 以上，同时在 10 小时内演化了 200 多个案例后实现了资源消耗的最佳平衡，具有出色的通用性。代码将在此 https URL 上提供。

Title: Personalized LLM for Generating Customized Responses to the Same Query from Different Users

Authors: Hang Zeng, Chaoyue Niu, Fan Wu, Chengfei Lv, Guihai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11736
Pdf URL: https://arxiv.org/pdf/2412.11736
Copy Paste: [[2412.11736]] Personalized LLM for Generating Customized Responses to the Same Query from Different Users(https://arxiv.org/abs/2412.11736)
Keywords: language model, llm, chat
Abstract: Existing work on large language model (LLM) personalization assigned different responding roles to LLM, but overlooked the diversity of questioners. In this work, we propose a new form of questioner-aware LLM personalization, generating different responses even for the same query from different questioners. We design a dual-tower model architecture with a cross-questioner general encoder and a questioner-specific encoder. We further apply contrastive learning with multi-view augmentation, pulling close the dialogue representations of the same questioner, while pulling apart those of different questioners. To mitigate the impact of question diversity on questioner-contrastive learning, we cluster the dialogues based on question similarity and restrict the scope of contrastive learning within each cluster. We also build a multi-questioner dataset from English and Chinese scripts and WeChat records, called MQDialog, containing 173 questioners and 12 responders. Extensive evaluation with different metrics shows a significant improvement in the quality of personalized response generation.
摘要：现有的大型语言模型 (LLM) 个性化研究为 LLM 分配了不同的响应角色，但忽略了提问者的多样性。在这项研究中，我们提出了一种新型的提问者感知 LLM 个性化，即使对于来自不同提问者的同一查询，也会生成不同的响应。我们设计了一个双塔模型架构，其中包含跨提问者的通用编码器和提问者特定的编码器。我们进一步应用了多视图增强的对比学习，将同一提问者的对话表示拉近，同时将不同提问者的对话表示分开。为了减轻问题多样性对提问者对比学习的影响，我们根据问题的相似性对对话进行聚类，并限制每个聚类内对比学习的范围。我们还从英文和中文脚本以及微信记录中构建了一个多提问者数据集，称为 MQDialog，其中包含 173 名提问者和 12 名响应者。使用不同指标进行的广泛评估表明个性化响应生成的质量显著提高。

Title: CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation

Authors: Hongxuan Zhang, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11741
Pdf URL: https://arxiv.org/pdf/2412.11741
Copy Paste: [[2412.11741]] CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation(https://arxiv.org/abs/2412.11741)
Keywords: language model, llm
Abstract: The emergence of long-context text applications utilizing large language models (LLMs) has presented significant scalability challenges, particularly in memory footprint. The linear growth of the Key-Value (KV) cache responsible for storing attention keys and values to minimize redundant computations can lead to substantial increases in memory consumption, potentially causing models to fail to serve with limited memory resources. To address this issue, we propose a novel approach called Cache Sparse Representation (CSR), which converts the KV cache by transforming the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference. Furthermore, we introduce NeuralDict, a novel neural network-based method for automatically generating the dictionary used in our sparse representation. Our extensive experiments demonstrate that CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms while maintaining robust functionality in memory-constrained environments.
摘要：利用大型语言模型 (LLM) 的长上下文文本应用程序的出现带来了巨大的可扩展性挑战，尤其是在内存占用方面。负责存储注意键和值以尽量减少冗余计算的键值 (KV) 缓存的线性增长可能会导致内存消耗大幅增加，从而可能导致模型无法在有限的内存资源下提供服务。为了解决这个问题，我们提出了一种称为缓存稀疏表示 (CSR) 的新方法，它通过将密集的键值缓存张量转换为稀疏索引和权重来转换 KV 缓存，从而在 LLM 推理期间提供更节省内存的表示。此外，我们引入了 NeuralDict，这是一种基于神经网络的新型方法，用于自动生成我们稀疏表示中使用的字典。我们大量的实验表明，CSR 实现了与最先进的 KV 缓存量化算法相当的性能，同时在内存受限的环境中保持了强大的功能。

Title: Common Ground, Diverse Roots: The Difficulty of Classifying Common Examples in Spanish Varieties

Authors: Javier A. Lopetegui, Arij Riabi, Djamé Seddah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11750
Pdf URL: https://arxiv.org/pdf/2412.11750
Copy Paste: [[2412.11750]] Common Ground, Diverse Roots: The Difficulty of Classifying Common Examples in Spanish Varieties(https://arxiv.org/abs/2412.11750)
Keywords: agent
Abstract: Variations in languages across geographic regions or cultures are crucial to address to avoid biases in NLP systems designed for culturally sensitive tasks, such as hate speech detection or dialog with conversational agents. In languages such as Spanish, where varieties can significantly overlap, many examples can be valid across them, which we refer to as common examples. Ignoring these examples may cause misclassifications, reducing model accuracy and fairness. Therefore, accounting for these common examples is essential to improve the robustness and representativeness of NLP systems trained on such data. In this work, we address this problem in the context of Spanish varieties. We use training dynamics to automatically detect common examples or errors in existing Spanish datasets. We demonstrate the efficacy of using predicted label confidence for our Datamaps \cite{swayamdipta-etal-2020-dataset} implementation for the identification of hard-to-classify examples, especially common examples, enhancing model performance in variety identification tasks. Additionally, we introduce a Cuban Spanish Variety Identification dataset with common examples annotations developed to facilitate more accurate detection of Cuban and Caribbean Spanish varieties. To our knowledge, this is the first dataset focused on identifying the Cuban, or any other Caribbean, Spanish variety.
摘要：跨地理区域或文化的语言差异对于避免为文化敏感任务（例如仇恨言论检测或与对话代理对话）设计的 NLP 系统出现偏见至关重要。在西班牙语等语言中，变体可以有很大的重叠，许多示例可以跨变体有效，我们将其称为常见示例。忽略这些示例可能会导致分类错误，从而降低模型的准确性和公平性。因此，考虑这些常见示例对于提高在此类数据上训练的 NLP 系统的稳健性和代表性至关重要。在这项工作中，我们在西班牙语变体的背景下解决了这个问题。我们使用训练动态来自动检测现有西班牙语数据集中的常见示例或错误。我们证明了使用预测标签置信度对我们的 Datamaps \cite{swayamdipta-etal-2020-dataset} 实现的有效性，可以识别难以分类的示例，尤其是常见示例，从而提高模型在变体识别任务中的性能。此外，我们还引入了一个古巴西班牙语变体识别数据集，其中包含常见示例注释，旨在帮助更准确地检测古巴和加勒比西班牙语变体。据我们所知，这是第一个用于识别古巴或任何其他加勒比地区西班牙品种的数据集。

Title: QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs

Authors: Mohammad Aflah Khan, Neemesh Yadav, Sarah Masud, Md. Shad Akhtar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11763
Pdf URL: https://arxiv.org/pdf/2412.11763
Copy Paste: [[2412.11763]] QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs(https://arxiv.org/abs/2412.11763)
Keywords: language model, llm, prompt
Abstract: The rise of large language models (LLMs) has created a need for advanced benchmarking systems beyond traditional setups. To this end, we introduce QUENCH, a novel text-based English Quizzing Benchmark manually curated and transcribed from YouTube quiz videos. QUENCH possesses masked entities and rationales for the LLMs to predict via generation. At the intersection of geographical context and common sense reasoning, QUENCH helps assess world knowledge and deduction capabilities of LLMs via a zero-shot, open-domain quizzing setup. We perform an extensive evaluation on 7 LLMs and 4 metrics, investigating the influence of model size, prompting style, geographical context, and gold-labeled rationale generation. The benchmarking concludes with an error analysis to which the LLMs are prone.
摘要：大型语言模型 (LLM) 的兴起催生了对超越传统设置的高级基准测试系统的需求。为此，我们推出了 QUENCH，这是一种基于文本的新型英语测验基准，由 YouTube 测验视频手动整理和转录而成。QUENCH 拥有 LLM 通过生成进行预测的掩码实体和原理。在地理背景和常识推理的交叉点上，QUENCH 通过零样本开放域测验设置帮助评估 LLM 的世界知识和推理能力。我们对 7 个 LLM 和 4 个指标进行了广泛的评估，研究了模型大小、提示风格、地理背景和金牌原理生成的影响。基准测试以 LLM 容易出现的错误分析结束。

Title: UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models

Authors: Boyang Xue, Fei Mi, Qi Zhu, Hongru Wang, Rui Wang, Sheng Wang, Erxin Yu, Xuming Hu, Kam-Fai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11803
Pdf URL: https://arxiv.org/pdf/2412.11803
Copy Paste: [[2412.11803]] UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models(https://arxiv.org/abs/2412.11803)
Keywords: language model, llm, prompt
Abstract: Despite demonstrating impressive capabilities, Large Language Models (LLMs) still often struggle to accurately express the factual knowledge they possess, especially in cases where the LLMs' knowledge boundaries are ambiguous. To improve LLMs' factual expressions, we propose the UAlign framework, which leverages Uncertainty estimations to represent knowledge boundaries, and then explicitly incorporates these representations as input features into prompts for LLMs to Align with factual knowledge. First, we prepare the dataset on knowledge question-answering (QA) samples by calculating two uncertainty estimations, including confidence score and semantic entropy, to represent the knowledge boundaries for LLMs. Subsequently, using the prepared dataset, we train a reward model that incorporates uncertainty estimations and then employ the Proximal Policy Optimization (PPO) algorithm for factuality alignment on LLMs. Experimental results indicate that, by integrating uncertainty representations in LLM alignment, the proposed UAlign can significantly enhance the LLMs' capacities to confidently answer known questions and refuse unknown questions on both in-domain and out-of-domain tasks, showing reliability improvements and good generalizability over various prompt- and training-based baselines.
摘要：尽管大型语言模型 (LLM) 表现出令人印象深刻的能力，但它们仍然常常难以准确表达其所拥有的事实知识，尤其是在 LLM 的知识边界模糊的情况下。为了改进 LLM 的事实表达，我们提出了 UAlign 框架，该框架利用不确定性估计来表示知识边界，然后将这些表示作为输入特征明确地合并到提示中，以便 LLM 与事实知识对齐。首先，我们通过计算两个不确定性估计（包括置信度得分和语义熵）来准备知识问答 (QA) 样本的数据集，以表示 LLM 的知识边界。随后，使用准备好的数据集，我们训练一个包含不确定性估计的奖励模型，然后采用近端策略优化 (PPO) 算法对 LLM 进行事实性对齐。实验结果表明，通过在 LLM 对齐中集成不确定性表示，提出的 UAlign 可以显著增强 LLM 在域内和域外任务中自信地回答已知问题和拒绝未知问题的能力，并显示出可靠性的提高和在各种基于提示和训练的基线上的良好通用性。

Title: EventSum: A Large-Scale Event-Centric Summarization Dataset for Chinese Multi-News Documents

Authors: Mengna Zhu, Kaisheng Zeng, Mao Wang, Kaiming Xiao, Lei Hou, Hongbin Huang, Juanzi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11814
Pdf URL: https://arxiv.org/pdf/2412.11814
Copy Paste: [[2412.11814]] EventSum: A Large-Scale Event-Centric Summarization Dataset for Chinese Multi-News Documents(https://arxiv.org/abs/2412.11814)
Keywords: language model, llm
Abstract: In real life, many dynamic events, such as major disasters and large-scale sports events, evolve continuously over time. Obtaining an overview of these events can help people quickly understand the situation and respond more effectively. This is challenging because the key information of the event is often scattered across multiple documents, involving complex event knowledge understanding and reasoning, which is under-explored in previous work. Therefore, we proposed the Event-Centric Multi-Document Summarization (ECS) task, which aims to generate concise and comprehensive summaries of a given event based on multiple related news documents. Based on this, we constructed the EventSum dataset, which was constructed using Baidu Baike entries and underwent extensive human annotation, to facilitate relevant research. It is the first large scale Chinese multi-document summarization dataset, containing 5,100 events and a total of 57,984 news documents, with an average of 11.4 input news documents and 13,471 characters per event. To ensure data quality and mitigate potential data leakage, we adopted a multi-stage annotation approach for manually labeling the test set. Given the complexity of event-related information, existing metrics struggle to comprehensively assess the quality of generated summaries. We designed specific metrics including Event Recall, Argument Recall, Causal Recall, and Temporal Recall along with corresponding calculation methods for evaluation. We conducted comprehensive experiments on EventSum to evaluate the performance of advanced long-context Large Language Models (LLMs) on this task. Our experimental results indicate that: 1) The event-centric multi-document summarization task remains challenging for existing long-context LLMs; 2) The recall metrics we designed are crucial for evaluating the comprehensiveness of the summary information.
摘要：现实生活中，许多动态事件，如重大灾难、大型体育赛事等，会随着时间的推移不断演变。了解这些事件的概况可以帮助人们快速了解情况并做出更有效的响应。这很有挑战性，因为事件的关键信息往往分散在多个文档中，涉及复杂的事件知识理解和推理，而这在以前的工作中尚未得到充分探索。因此，我们提出了以事件为中心的多文档摘要 (ECS) 任务，旨在根据多个相关新闻文档生成给定事件的简洁而全面的摘要。在此基础上，我们构建了 EventSum 数据集，该数据集由百度百科条目构建并经过大量人工注释，以方便相关研究。它是第一个大规模中文多文档摘要数据集，包含 5,100 个事件和 57,984 篇新闻文档，平均每个事件有 11.4 个输入新闻文档和 13,471 个字符。为保证数据质量，降低数据泄露风险，我们采用了多阶段标注的方式对测试集进行人工标注。由于事件相关信息的复杂性，现有的指标难以全面评估生成摘要的质量，我们设计了事件回忆率、论据回忆率、因果回忆率、时间回忆率等具体指标及相应的计算方法进行评估。我们在 EventSum 上进行了全面的实验，以评估先进的长上下文大型语言模型 (LLM) 在该任务上的性能。实验结果表明：1）以事件为中心的多文档摘要任务对于现有的长上下文 LLM 来说仍然具有挑战性；2）我们设计的回忆率指标对于评估摘要信息的全面性至关重要。

Title: Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation

Authors: Leonidas Zotos, Hedderik van Rijn, Malvina Nissim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11831
Pdf URL: https://arxiv.org/pdf/2412.11831
Copy Paste: [[2412.11831]] Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation(https://arxiv.org/abs/2412.11831)
Keywords: language model
Abstract: In an educational setting, an estimate of the difficulty of multiple-choice questions (MCQs), a commonly used strategy to assess learning progress, constitutes very useful information for both teachers and students. Since human assessment is costly from multiple points of view, automatic approaches to MCQ item difficulty estimation are investigated, yielding however mixed success until now. Our approach to this problem takes a different angle from previous work: asking various Large Language Models to tackle the questions included in two different MCQ datasets, we leverage model uncertainty to estimate item difficulty. By using both model uncertainty features as well as textual features in a Random Forest regressor, we show that uncertainty features contribute substantially to difficulty prediction, where difficulty is inversely proportional to the number of students who can correctly answer a question. In addition to showing the value of our approach, we also observe that our model achieves state-of-the-art results on the BEA publicly available dataset.
摘要：在教育环境中，评估多项选择题 (MCQ) 的难度是一种常用的学习进度评估策略，对教师和学生来说都是非常有用的信息。由于从多个角度来看，人工评估成本高昂，因此人们研究了自动方法以评估 MCQ 项目的难度，但迄今为止取得的成功参差不齐。我们解决这个问题的方法与以前的工作不同：要求各种大型语言模型解决两个不同 MCQ 数据集中包含的问题，我们利用模型不确定性来估计项目难度。通过在随机森林回归器中使用模型不确定性特征和文本特征，我们表明不确定性特征对难度预测有很大贡献，其中难度与能够正确回答问题的学生数量成反比。除了展示我们方法的价值之外，我们还观察到我们的模型在 BEA 公开数据集上取得了最先进的结果。

Title: Improved Models for Media Bias Detection and Subcategorization

Authors: Tim Menzner, Jochen L. Leidner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11835
Pdf URL: https://arxiv.org/pdf/2412.11835
Copy Paste: [[2412.11835]] Improved Models for Media Bias Detection and Subcategorization(https://arxiv.org/abs/2412.11835)
Keywords: language model
Abstract: We present improved models for the granular detection and sub-classification news media bias in English news articles. We compare the performance of zero-shot versus fine-tuned large pre-trained neural transformer language models, explore how the level of detail of the classes affects performance on a novel taxonomy of 27 news bias-types, and demonstrate how using synthetically generated example data can be used to improve quality
摘要：我们提出了改进的模型，用于对英文新闻文章中的新闻媒体偏见进行精细检测和子分类。我们比较了零样本和经过微调的大型预训练神经变换语言模型的性能，探索了类别的细节级别如何影响 27 种新闻偏见类型的新分类法的性能，并展示了如何使用合成生成的示例数据来提高质量

Title: A Benchmark and Robustness Study of In-Context-Learning with Large Language Models in Music Entity Detection

Authors: Simon Hachmeier, Robert Jäschke
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2412.11851
Pdf URL: https://arxiv.org/pdf/2412.11851
Copy Paste: [[2412.11851]] A Benchmark and Robustness Study of In-Context-Learning with Large Language Models in Music Entity Detection(https://arxiv.org/abs/2412.11851)
Keywords: language model, llm, hallucination
Abstract: Detecting music entities such as song titles or artist names is a useful application to help use cases like processing music search queries or analyzing music consumption on the web. Recent approaches incorporate smaller language models (SLMs) like BERT and achieve high results. However, further research indicates a high influence of entity exposure during pre-training on the performance of the models. With the advent of large language models (LLMs), these outperform SLMs in a variety of downstream tasks. However, researchers are still divided if this is applicable to tasks like entity detection in texts due to issues like hallucination. In this paper, we provide a novel dataset of user-generated metadata and conduct a benchmark and a robustness study using recent LLMs with in-context-learning (ICL). Our results indicate that LLMs in the ICL setting yield higher performance than SLMs. We further uncover the large impact of entity exposure on the best performing LLM in our study.
摘要：检测歌曲标题或艺术家姓名等音乐实体是一种很有用的应用，它可帮助处理音乐搜索查询或分析网络上的音乐消费等用例。最近的方法结合了 BERT 等较小的语言模型 (SLM) 并取得了很好的效果。然而，进一步的研究表明，预训练期间的实体暴露对模型性能有很大影响。随着大型语言模型 (LLM) 的出现，它们在各种下游任务中的表现都优于 SLM。然而，由于幻觉等问题，研究人员仍然对这是否适用于文本中的实体检测等任务存在分歧。在本文中，我们提供了一个新的用户生成元数据数据集，并使用最近的具有上下文学习 (ICL) 的 LLM 进行了基准测试和稳健性研究。我们的结果表明，ICL 设置中的 LLM 比 SLM 产生更高的性能。我们进一步揭示了实体暴露对我们研究中表现最佳的 LLM 的巨大影响。

Title: Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives

Authors: Sam Relins, Daniel Birks, Charlie Lloyd
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11878
Pdf URL: https://arxiv.org/pdf/2412.11878
Copy Paste: [[2412.11878]] Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives(https://arxiv.org/abs/2412.11878)
Keywords: language model, llm, prompt
Abstract: Objectives: Compare qualitative coding of instruction tuned large language models (IT-LLMs) against human coders in classifying the presence or absence of vulnerability in routinely collected unstructured text that describes police-public interactions. Evaluate potential bias in IT-LLM codings. Methods: Analyzing publicly available text narratives of police-public interactions recorded by Boston Police Department, we provide humans and IT-LLMs with qualitative labelling codebooks and compare labels generated by both, seeking to identify situations associated with (i) mental ill health; (ii) substance misuse; (iii) alcohol dependence; and (iv) homelessness. We explore multiple prompting strategies and model sizes, and the variability of labels generated by repeated prompts. Additionally, to explore model bias, we utilize counterfactual methods to assess the impact of two protected characteristics - race and gender - on IT-LLM classification. Results: Results demonstrate that IT-LLMs can effectively support human qualitative coding of police incident narratives. While there is some disagreement between LLM and human generated labels, IT-LLMs are highly effective at screening narratives where no vulnerabilities are present, potentially vastly reducing the requirement for human coding. Counterfactual analyses demonstrate that manipulations to both gender and race of individuals described in narratives have very limited effects on IT-LLM classifications beyond those expected by chance. Conclusions: IT-LLMs offer effective means to augment human qualitative coding in a way that requires much lower levels of resource to analyze large unstructured datasets. Moreover, they encourage specificity in qualitative coding, promote transparency, and provide the opportunity for more standardized, replicable approaches to analyzing large free-text police data sources.
摘要：目标：将指令调整大型语言模型 (IT-LLM) 的定性编码与人类编码员进行比较，以对常规收集的描述警民互动的非结构化文本中是否存在脆弱性进行分类。评估 IT-LLM 编码中的潜在偏差。方法：分析波士顿警察局记录的警民互动的公开文本叙述，我们为人类和 IT-LLM 提供定性标签代码本，并比较两者生成的标签，以期识别与 (i) 精神疾病、(ii) 药物滥用、(iii) 酒精依赖和 (iv) 无家可归相关的情况。我们探索多种提示策略和模型大小，以及重复提示生成的标签的可变性。此外，为了探索模型偏差，我们利用反事实方法来评估两个受保护特征（种族和性别）对 IT-LLM 分类的影响。结果：结果表明，IT-LLM 可以有效支持人类对警察事件叙述进行定性编码。虽然 LLM 和人工生成的标签之间存在一些分歧，但 IT-LLM 在筛选不存在漏洞的叙述方面非常有效，从而可能大大减少对人工编码的要求。反事实分析表明，对叙述中描述的个人的性别和种族的操纵对 IT-LLM 分类的影响非常有限，超出了偶然预期的影响。结论：IT-LLM 提供了一种有效的方法来增强人工定性编码，而分析大型非结构化数据集所需的资源水平要低得多。此外，它们鼓励定性编码的特异性，促进透明度，并为分析大型自由文本警察数据源提供更标准化、可复制的方法的机会。

Title: Can Language Models Rival Mathematics Students? Evaluating Mathematical Reasoning through Textual Manipulation and Human Experiments

Authors: Andrii Nikolaiev, Yiannos Stathopoulos, Simone Teufel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11908
Pdf URL: https://arxiv.org/pdf/2412.11908
Copy Paste: [[2412.11908]] Can Language Models Rival Mathematics Students? Evaluating Mathematical Reasoning through Textual Manipulation and Human Experiments(https://arxiv.org/abs/2412.11908)
Keywords: language model, gpt, llm
Abstract: In this paper we look at the ability of recent large language models (LLMs) at solving mathematical problems in combinatorics. We compare models LLaMA-2, LLaMA-3.1, GPT-4, and Mixtral against each other and against human pupils and undergraduates with prior experience in mathematical olympiads. To facilitate these comparisons we introduce the Combi-Puzzles dataset, which contains 125 problem variants based on 25 combinatorial reasoning problems. Each problem is presented in one of five distinct forms, created by systematically manipulating the problem statements through adversarial additions, numeric parameter changes, and linguistic obfuscation. Our variations preserve the mathematical core and are designed to measure the generalisability of LLM problem-solving abilities, while also increasing confidence that problems are submitted to LLMs in forms that have not been seen as training instances. We found that a model based on GPT-4 outperformed all other models in producing correct responses, and performed significantly better in the mathematical variation of the problems than humans. We also found that modifications to problem statements significantly impact the LLM's performance, while human performance remains unaffected.
摘要：在本文中，我们研究了最近的大型语言模型 (LLM) 在解决组合数学问题方面的能力。我们将 LLaMA-2、LLaMA-3.1、GPT-4 和 Mixtral 模型相互比较，并与具有数学奥林匹克经验的人类小学生和大学生进行比较。为了便于进行比较，我们引入了 Combi-Puzzles 数据集，其中包含基于 25 个组合推理问题的 125 个问题变体。每个问题都以五种不同形式之一呈现，这些形式是通过对抗性添加、数字参数更改和语言混淆系统地操纵问题陈述而创建的。我们的变体保留了数学核心，旨在衡量 LLM 解决问题能力的通用性，同时也增加了对问题以未被视为训练实例的形式提交给 LLM 的信心。我们发现基于 GPT-4 的模型在产生正确答案方面优于所有其他模型，并且在问题的数学变体方面表现明显优于人类。我们还发现，问题陈述的修改会显著影响 LLM 的性能，而人类的性能则不受影响。

Title: CharacterBench: Benchmarking Character Customization of Large Language Models

Authors: Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, Rongsheng Zhang, Le Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11912
Pdf URL: https://arxiv.org/pdf/2412.11912
Copy Paste: [[2412.11912]] CharacterBench: Benchmarking Character Customization of Large Language Models(https://arxiv.org/abs/2412.11912)
Keywords: language model, gpt, llm
Abstract: Character-based dialogue (aka role-playing) enables users to freely customize characters for interaction, which often relies on LLMs, raising the need to evaluate LLMs' character customization capability. However, existing benchmarks fail to ensure a robust evaluation as they often only involve a single character category or evaluate limited dimensions. Moreover, the sparsity of character features in responses makes feature-focused generative evaluation both ineffective and inefficient. To address these issues, we propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters from 25 detailed character categories. We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response. We enable effective and efficient evaluation by crafting tailored queries for each dimension to induce characters' responses related to specific dimensions. Further, we develop CharacterJudge model for cost-effective and stable evaluations. Experiments show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark's potential to optimize LLMs' character customization. Our repository is at this https URL.
摘要：基于角色的对话（又称角色扮演）使用户可以自由定制角色进行交互，这通常依赖于 LLM，因此需要评估 LLM 的角色定制能力。然而，现有的基准测试无法确保稳健的评估，因为它们通常只涉及单个角色类别或评估有限的维度。此外，响应中角色特征的稀疏性使得以特征为中心的生成评估既无效又低效。为了解决这些问题，我们提出了最大的双语生成基准测试 CharacterBench，它有 22,859 个人工注释样本，涵盖 25 个详细角色类别中的 3,956 个角色。我们定义了 6 个方面的 11 个维度，根据特定维度评估的角色特征是否体现在每个响应中，将其分为稀疏维度和密集维度。我们通过为每个维度设计定制查询来诱导与特定维度相关的角色响应，从而实现有效和高效的评估。此外，我们开发了 CharacterJudge 模型，以实现经济高效且稳定的评估。实验表明，它比 SOTA 自动评判器（例如 GPT-4）更优越，并且我们的基准测试有潜力优化 LLM 的角色定制。我们的存储库位于此 https URL。

Title: RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

Authors: Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.11919
Pdf URL: https://arxiv.org/pdf/2412.11919
Copy Paste: [[2412.11919]] RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation(https://arxiv.org/abs/2412.11919)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) exhibit remarkable generative capabilities but often suffer from hallucinations. Retrieval-augmented generation (RAG) offers an effective solution by incorporating external knowledge, but existing methods still face several limitations: additional deployment costs of separate retrievers, redundant input tokens from retrieved text chunks, and the lack of joint optimization of retrieval and generation. To address these issues, we propose \textbf{RetroLLM}, a unified framework that integrates retrieval and generation into a single, cohesive process, enabling LLMs to directly generate fine-grained evidence from the corpus with constrained decoding. Moreover, to mitigate false pruning in the process of constrained evidence generation, we introduce (1) hierarchical FM-Index constraints, which generate corpus-constrained clues to identify a subset of relevant documents before evidence generation, reducing irrelevant decoding space; and (2) a forward-looking constrained decoding strategy, which considers the relevance of future sequences to improve evidence accuracy. Extensive experiments on five open-domain QA datasets demonstrate RetroLLM's superior performance across both in-domain and out-of-domain tasks. The code is available at \url{this https URL}.
摘要：大型语言模型 (LLM) 表现出卓越的生成能力，但经常会出现幻觉。检索增强生成 (RAG) 通过整合外部知识提供了一种有效的解决方案，但现有方法仍然面临一些限制：单独检索器的额外部署成本、检索文本块的冗余输入标记以及检索和生成缺乏联合优化。为了解决这些问题，我们提出了 \textbf{RetroLLM}，这是一个统一的框架，将检索和生成集成到一个单一的、有凝聚力的过程中，使 LLM 能够直接从语料库中生成具有约束解码的细粒度证据。此外，为了减轻约束证据生成过程中的错误剪枝，我们引入了 (1) 分层 FM-Index 约束，它在证据生成之前生成语料库约束的线索以识别相关文档的子集，从而减少不相关的解码空间；以及 (2) 前瞻性约束解码策略，它考虑未来序列的相关性以提高证据准确性。在五个开放域 QA 数据集上进行的大量实验证明了 RetroLLM 在域内和域外任务中的卓越性能。代码可在 \url{此 https URL} 处获取。

Title: PICLe: Pseudo-Annotations for In-Context Learning in Low-Resource Named Entity Detection

Authors: Sepideh Mamooler, Syrielle Montariol, Alexander Mathis, Antoine Bosselut
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11923
Pdf URL: https://arxiv.org/pdf/2412.11923
Copy Paste: [[2412.11923]] PICLe: Pseudo-Annotations for In-Context Learning in Low-Resource Named Entity Detection(https://arxiv.org/abs/2412.11923)
Keywords: language model, llm
Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to perform tasks using few demonstrations, facilitating task adaptation when labeled examples are hard to obtain. However, ICL is sensitive to the choice of demonstrations, and it remains unclear which demonstration attributes enable in-context generalization. In this work, we conduct a perturbation study of in-context demonstrations for low-resource Named Entity Detection (NED). Our surprising finding is that in-context demonstrations with partially correct annotated entity mentions can be as effective for task transfer as fully correct demonstrations. Based off our findings, we propose Pseudo-annotated In-Context Learning (PICLe), a framework for in-context learning with noisy, pseudo-annotated demonstrations. PICLe leverages LLMs to annotate many demonstrations in a zero-shot first pass. We then cluster these synthetic demonstrations, sample specific sets of in-context demonstrations from each cluster, and predict entity mentions using each set independently. Finally, we use self-verification to select the final set of entity mentions. We evaluate PICLe on five biomedical NED datasets and show that, with zero human annotation, PICLe outperforms ICL in low-resource settings where limited gold examples can be used as in-context demonstrations.
摘要：上下文学习 (ICL) 使大型语言模型 (LLM) 能够使用少量演示执行任务，从而在难以获得标记示例时促进任务适应。然而，ICL 对演示的选择很敏感，目前尚不清楚哪些演示属性能够实现上下文泛化。在这项工作中，我们对低资源命名实体检测 (NED) 的上下文演示进行了扰动研究。我们令人惊讶的发现是，带有部分正确注释实体提及的上下文演示对于任务转移的效果与完全正确的演示一样有效。根据我们的发现，我们提出了伪注释上下文学习 (PICLe)，这是一个使用嘈杂的伪注释演示进行上下文学习的框架。PICLe 利用 LLM 在零样本第一遍中注释许多演示。然后，我们对这些合成演示进行聚类，从每个聚类中抽取特定的上下文演示集，并使用每个集独立预测实体提及。最后，我们使用自我验证来选择最终的实体提及集。我们在五个生物医学 NED 数据集上评估了 PICLe，并表明，在零人工注释的情况下，PICLe 在资源匮乏的环境中表现优于 ICL，在这种环境中，有限的黄金示例可用作上下文演示。

Title: A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges

Authors: Yibo Yan, Jiamin Su, Jianxiang He, Fangteng Fu, Xu Zheng, Yuanhuiyi Lyu, Kun Wang, Shen Wang, Qingsong Wen, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11936
Pdf URL: https://arxiv.org/pdf/2412.11936
Copy Paste: [[2412.11936]] A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges(https://arxiv.org/abs/2412.11936)
Keywords: language model, llm
Abstract: Mathematical reasoning, a core aspect of human cognition, is vital across many domains, from educational problem-solving to scientific advancements. As artificial general intelligence (AGI) progresses, integrating large language models (LLMs) with mathematical reasoning tasks is becoming increasingly significant. This survey provides the first comprehensive analysis of mathematical reasoning in the era of multimodal large language models (MLLMs). We review over 200 studies published since 2021, and examine the state-of-the-art developments in Math-LLMs, with a focus on multimodal settings. We categorize the field into three dimensions: benchmarks, methodologies, and challenges. In particular, we explore multimodal mathematical reasoning pipeline, as well as the role of (M)LLMs and the associated methodologies. Finally, we identify five major challenges hindering the realization of AGI in this domain, offering insights into the future direction for enhancing multimodal reasoning capabilities. This survey serves as a critical resource for the research community in advancing the capabilities of LLMs to tackle complex multimodal reasoning tasks.
摘要：数学推理是人类认知的一个核心方面，对从教育问题解决到科学进步等许多领域都至关重要。随着通用人工智能 (AGI) 的发展，将大型语言模型 (LLM) 与数学推理任务相结合变得越来越重要。本调查首次全面分析了多模态大型语言模型 (MLLM) 时代的数学推理。我们回顾了自 2021 年以来发表的 200 多项研究，并研究了数学 LLM 的最新发展，重点关注多模态设置。我们将该领域分为三个维度：基准、方法和挑战。特别是，我们探索了多模态数学推理流程，以及 (M)LLM 和相关方法的作用。最后，我们确定了阻碍该领域实现 AGI 的五大挑战，为增强多模态推理能力的未来方向提供了见解。这项调查是研究界提高 LLM 解决复杂多模态推理任务能力的重要资源。

Title: Precise Length Control in Large Language Models

Authors: Bradley Butcher, Michael O'Keefe, James Titchener
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11937
Pdf URL: https://arxiv.org/pdf/2412.11937
Copy Paste: [[2412.11937]] Precise Length Control in Large Language Models(https://arxiv.org/abs/2412.11937)
Keywords: language model, llm, chat
Abstract: Large Language Models (LLMs) are increasingly used in production systems, powering applications such as chatbots, summarization, and question answering. Despite their success, controlling the length of their response remains a significant challenge, particularly for tasks requiring structured outputs or specific levels of detail. In this work, we propose a method to adapt pre-trained decoder-only LLMs for precise control of response length. Our approach incorporates a secondary length-difference positional encoding (LDPE) into the input embeddings, which counts down to a user-set response termination length. Fine-tuning with LDPE allows the model to learn to terminate responses coherently at the desired length, achieving mean token errors of less than 3 tokens. We also introduce Max New Tokens++, an extension that enables flexible upper-bound length control, rather than an exact target. Experimental results on tasks such as question answering and document summarization demonstrate that our method enables precise length control without compromising response quality.
摘要：大型语言模型 (LLM) 在生产系统中的应用越来越广泛，为聊天机器人、摘要和问答等应用程序提供支持。尽管它们取得了成功，但控制其响应长度仍然是一项重大挑战，特别是对于需要结构化输出或特定细节级别的任务。在这项工作中，我们提出了一种方法来调整预训练的仅解码器 LLM，以精确控制响应长度。我们的方法将二次长度差异位置编码 (LDPE) 合并到输入嵌入中，该编码倒计时到用户设置的响应终止长度。使用 LDPE 进行微调允许模型学习以所需长度连贯地终止响应，从而实现小于 3 个标记的平均标记错误。我们还引入了 Max New Tokens++，这是一个扩展，可以实现灵活的上限长度控制，而不是精确的目标。在问答和文档摘要等任务上的实验结果表明，我们的方法能够精确控制长度而不会影响响应质量。

Title: The Impact of Token Granularity on the Predictive Power of Language Model Surprisal

Authors: Byung-Doh Oh, William Schuler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11940
Pdf URL: https://arxiv.org/pdf/2412.11940
Copy Paste: [[2412.11940]] The Impact of Token Granularity on the Predictive Power of Language Model Surprisal(https://arxiv.org/abs/2412.11940)
Keywords: language model
Abstract: Word-by-word language model surprisal is often used to model the incremental processing of human readers, which raises questions about how various choices in language modeling influence its predictive power. One factor that has been overlooked in cognitive modeling is the granularity of subword tokens, which explicitly encodes information about word length and frequency, and ultimately influences the quality of vector representations that are learned. This paper presents experiments that manipulate the token granularity and evaluate its impact on the ability of surprisal to account for processing difficulty of naturalistic text and garden-path constructions. Experiments with naturalistic reading times reveal a substantial influence of token granularity on surprisal, with tokens defined by a vocabulary size of 8,000 resulting in surprisal that is most predictive. In contrast, on garden-path constructions, language models trained on coarser-grained tokens generally assigned higher surprisal to critical regions, suggesting their increased sensitivity to syntax. Taken together, these results suggest a large role of token granularity on the quality of language model surprisal for cognitive modeling.
摘要：逐字语言模型意外度通常用于模拟人类读者的增量处理，这引发了关于语言建模中的各种选择如何影响其预测能力的问题。认知建模中被忽视的一个因素是子词标记的粒度，它明确编码了有关单词长度和频率的信息，并最终影响学习到的向量表示的质量。本文介绍了操纵标记粒度的实验，并评估了其对意外度解释自然文本和花园路径构造处理难度的能力的影响。自然阅读时间的实验表明，标记粒度对意外度有显著影响，词汇量为 8,000 的标记产生的意外度最具预测性。相比之下，在花园路径构造中，用粗粒度标记训练的语言模型通常会将更高的意外度分配给关键区域，这表明它们对语法的敏感性增加。综合起来，这些结果表明标记粒度对于认知建模的语言模型惊喜质量起着很大的作用。

Title: Inferring Functionality of Attention Heads from their Parameters

Authors: Amit Elhelo, Mor Geva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11965
Pdf URL: https://arxiv.org/pdf/2412.11965
Copy Paste: [[2412.11965]] Inferring Functionality of Attention Heads from their Parameters(https://arxiv.org/abs/2412.11965)
Keywords: language model, llm
Abstract: Attention heads are one of the building blocks of large language models (LLMs). Prior work on investigating their operation mostly focused on analyzing their behavior during inference for specific circuits or tasks. In this work, we seek a comprehensive mapping of the operations they implement in a model. We propose MAPS (Mapping Attention head ParameterS), an efficient framework that infers the functionality of attention heads from their parameters, without any model training or inference. We showcase the utility of MAPS for answering two types of questions: (a) given a predefined operation, mapping how strongly heads across the model implement it, and (b) given an attention head, inferring its salient functionality. Evaluating MAPS on 20 operations across 6 popular LLMs shows its estimations correlate with the head's outputs during inference and are causally linked to the model's predictions. Moreover, its mappings reveal attention heads of certain operations that were overlooked in previous studies, and valuable insights on function universality and architecture biases in LLMs. Next, we present an automatic pipeline and analysis that leverage MAPS to characterize the salient operations of a given head. Our pipeline produces plausible operation descriptions for most heads, as assessed by human judgment, while revealing diverse operations.
摘要：注意力头是大型语言模型 (LLM) 的构建块之一。之前对其操作的研究主要集中在分析它们在特定电路或任务的推理过程中的行为。在这项工作中，我们寻求对它们在模型中实现的操作进行全面映射。我们提出了 MAPS（映射注意力头参数），这是一个高效的框架，可以从注意力头的参数推断出注意力头的功能，而无需任何模型训练或推理。我们展示了 MAPS 在回答两类问题方面的实用性：(a) 给定一个预定义的操作，映射模型中注意力头实现该操作的强度，以及 (b) 给定一个注意力头，推断其显著功能。在 6 个流行的 LLM 中对 20 个操作的 MAPS 评估表明，其估计值与注意力头在推理过程中的输出相关，并与模型的预测有因果关系。此外，它的映射揭示了以前研究中忽略的某些操作的注意力头，以及关于 LLM 中的功能通用性和架构偏差的宝贵见解。接下来，我们介绍一种自动流程和分析方法，利用 MAPS 来描述给定头部的突出操作。我们的流程为大多数头部生成了合理的操作描述（经人类判断评估），同时揭示了不同的操作。

Title: DARWIN 1.5: Large Language Models as Materials Science Adapted Learners

Authors: Tong Xie, Yuwei Wan, Yixuan Liu, Yuchen Zeng, Wenjie Zhang, Chunyu Kit, Dongzhan Zhou, Bram Hoex
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11970
Pdf URL: https://arxiv.org/pdf/2412.11970
Copy Paste: [[2412.11970]] DARWIN 1.5: Large Language Models as Materials Science Adapted Learners(https://arxiv.org/abs/2412.11970)
Keywords: language model, llm
Abstract: Materials discovery and design aim to find components and structures with desirable properties over highly complex and diverse search spaces. Traditional solutions, such as high-throughput simulations and machine learning (ML), often rely on complex descriptors, which hinder generalizability and transferability across tasks. Moreover, these descriptors may deviate from experimental data due to inevitable defects and purity issues in the real world, which may reduce their effectiveness in practical applications. To address these challenges, we propose Darwin 1.5, an open-source large language model (LLM) tailored for materials science. By leveraging natural language as input, Darwin eliminates the need for task-specific descriptors and enables a flexible, unified approach to material property prediction and discovery. We employ a two-stage training strategy combining question-answering (QA) fine-tuning with multi-task learning (MTL) to inject domain-specific knowledge in various modalities and facilitate cross-task knowledge transfer. Through our strategic approach, we achieved a significant enhancement in the prediction accuracy of LLMs, with a maximum improvement of 60\% compared to LLaMA-7B base models. It further outperforms traditional machine learning models on various tasks in material science, showcasing the potential of LLMs to provide a more versatile and scalable foundation model for materials discovery and design.
摘要：材料发现和设计旨在在高度复杂和多样化的搜索空间中找到具有理想特性的组件和结构。传统的解决方案，例如高通量模拟和机器学习 (ML)，通常依赖于复杂的描述符，这阻碍了跨任务的通用性和可转移性。此外，由于现实世界中不可避免的缺陷和纯度问题，这些描述符可能会偏离实验数据，这可能会降低它们在实际应用中的有效性。为了应对这些挑战，我们提出了 Darwin 1.5，这是一个为材料科学量身定制的开源大型语言模型 (LLM)。通过利用自然语言作为输入，Darwin 消除了对特定于任务的描述符的需求，并实现了灵活、统一的材料属性预测和发现方法。我们采用结合问答 (QA) 微调和多任务学习 (MTL) 的两阶段训练策略，以各种模式注入领域特定知识并促进跨任务知识转移。通过我们的战略方法，我们显著提高了 LLM 的预测准确性，与 LLaMA-7B 基础模型相比，最大提高了 60\%。它在材料科学的各种任务中的表现进一步超越了传统的机器学习模型，展示了 LLM 为材料发现和设计提供更通用、更可扩展的基础模型的潜力。

Title: SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation

Authors: Debarshi Kundu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11988
Pdf URL: https://arxiv.org/pdf/2412.11988
Copy Paste: [[2412.11988]] SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation(https://arxiv.org/abs/2412.11988)
Keywords: language model, gpt, llm
Abstract: Consider the problem: ``If one man and one woman can produce one child in one year, how many children will be produced by one woman and three men in 0.5 years?" Current large language models (LLMs) such as GPT-4o, GPT-o1-preview, and Gemini Flash frequently answer "0.5," which does not make sense. While these models sometimes acknowledge the unrealistic nature of the question, in many cases (8 out of 10 trials), they provide the nonsensical answer of "0.5 child." Additionally, temporal variation has been observed: if an LLM answers correctly once (by recognizing the faulty nature of the question), subsequent responses are more likely to also reflect this understanding. However, this is inconsistent. These types of questions have motivated us to develop a dataset of science questions, SciFaultyQA, where the questions themselves are intentionally faulty. We observed that LLMs often proceed to answer these flawed questions without recognizing their inherent issues, producing results that are logically or scientifically invalid. By analyzing such patterns, we developed a novel method for generating synthetic datasets to evaluate and benchmark the performance of various LLMs in identifying these flawed questions. We have also developed novel approaches to reduce the errors.
摘要：考虑一下这个问题：“如果一个男人和一个女人一年可以生一个孩子，那么一个女人和三个男人在 0.5 年内会生多少个孩子？” 当前的大型语言模型 (LLM)，例如 GPT-4o、GPT-o1-preview 和 Gemini Flash，经常会回答“0.5”，这毫无意义。虽然这些模型有时会承认问题的不切实际，但在许多情况下（10 次试验中有 8 次），它们会给出“0.5 个孩子”这个荒谬的答案。此外，还观察到了时间变化：如果 LLM 回答正确一次（通过认识到问题的错误性质），后续的回答也更有可能反映出这种理解。然而，这是不一致的。这些类型的问题促使我们开发了一个科学问题数据集 SciFaultyQA，其中的问题本身是故意有缺陷的。我们观察到，LLM 经常继续回答这些有缺陷的问题，而没有认识到它们固有的问题，从而产生在逻辑上或科学上无效的结果。通过分析这些模式，我们开发了一种生成合成数据集的新方法，以评估和基准化各种 LLM 在识别这些有缺陷的问题方面的表现。我们还开发了减少错误的新方法。

Title: ExecRepoBench: Multi-level Executable Code Completion Evaluation

Authors: Jian Yang, Jiajun Zhang, Jiaxi Yang, Ke Jin, Lei Zhang, Qiyao Peng, Ken Deng, Yibo Miao, Tianyu Liu, Zeyu Cui, Binyuan Hui, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11990
Pdf URL: https://arxiv.org/pdf/2412.11990
Copy Paste: [[2412.11990]] ExecRepoBench: Multi-level Executable Code Completion Evaluation(https://arxiv.org/abs/2412.11990)
Keywords: language model, llm
Abstract: Code completion has become an essential tool for daily software development. Existing evaluation benchmarks often employ static methods that do not fully capture the dynamic nature of real-world coding environments and face significant challenges, including limited context length, reliance on superficial evaluation metrics, and potential overfitting to training datasets. In this work, we introduce a novel framework for enhancing code completion in software development through the creation of a repository-level benchmark ExecRepoBench and the instruction corpora Repo-Instruct, aim at improving the functionality of open-source large language models (LLMs) in real-world coding scenarios that involve complex interdependencies across multiple files. ExecRepoBench includes 1.2K samples from active Python repositories. Plus, we present a multi-level grammar-based completion methodology conditioned on the abstract syntax tree to mask code fragments at various logical units (e.g. statements, expressions, and functions). Then, we fine-tune the open-source LLM with 7B parameters on Repo-Instruct to produce a strong code completion baseline model Qwen2.5-Coder-Instruct-C based on the open-source model. Qwen2.5-Coder-Instruct-C is rigorously evaluated against existing benchmarks, including MultiPL-E and ExecRepoBench, which consistently outperforms prior baselines across all programming languages. The deployment of \ourmethod{} can be used as a high-performance, local service for programming development\footnote{\url{this https URL}}.
摘要：代码补全已成为日常软件开发必不可少的工具。现有的评估基准通常采用静态方法，无法完全捕捉真实世界编码环境的动态特性，并且面临重大挑战，包括上下文长度有限、依赖表面评估指标以及对训练数据集的潜在过度拟合。在这项工作中，我们引入了一个通过创建存储库级基准 ExecRepoBench 和指令语料库 Repo-Instruct 来增强软件开发中的代码补全的新框架，旨在提高开源大型语言模型 (LLM) 在涉及多个文件之间复杂相互依赖的真实世界编码场景中的功能。ExecRepoBench 包含来自活跃 Python 存储库的 1.2K 个样本。此外，我们提出了一种基于抽象语法树的多级语法补全方法，以屏蔽各种逻辑单元（例如语句、表达式和函数）的代码片段。然后，我们在 Repo-Instruct 上对具有 7B 参数的开源 LLM 进行微调，以基于开源模型生成强大的代码完成基线模型 Qwen2.5-Coder-Instruct-C。Qwen2.5-Coder-Instruct-C 经过现有基准测试的严格评估，包括 MultiPL-E 和 ExecRepoBench，其在所有编程语言中的表现始终优于之前的基线。\ourmethod{} 的部署可用作编程开发的高性能本地服务\footnote{\url{此 https URL}}。

Title: LLM-RG4: Flexible and Factual Radiology Report Generation across Diverse Input Contexts

Authors: Zhuhao Wang, Yihua Sun, Zihan Li, Xuan Yang, Fang Chen, Hongen Liao
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2412.12001
Pdf URL: https://arxiv.org/pdf/2412.12001
Copy Paste: [[2412.12001]] LLM-RG4: Flexible and Factual Radiology Report Generation across Diverse Input Contexts(https://arxiv.org/abs/2412.12001)
Keywords: language model, llm, hallucination
Abstract: Drafting radiology reports is a complex task requiring flexibility, where radiologists tail content to available information and particular clinical demands. However, most current radiology report generation (RRG) models are constrained to a fixed task paradigm, such as predicting the full ``finding'' section from a single image, inherently involving a mismatch between inputs and outputs. The trained models lack the flexibility for diverse inputs and could generate harmful, input-agnostic hallucinations. To bridge the gap between current RRG models and the clinical demands in practice, we first develop a data generation pipeline to create a new MIMIC-RG4 dataset, which considers four common radiology report drafting scenarios and has perfectly corresponded input and output. Secondly, we propose a novel large language model (LLM) based RRG framework, namely LLM-RG4, which utilizes LLM's flexible instruction-following capabilities and extensive general knowledge. We further develop an adaptive token fusion module that offers flexibility to handle diverse scenarios with different input combinations, while minimizing the additional computational burden associated with increased input volumes. Besides, we propose a token-level loss weighting strategy to direct the model's attention towards positive and uncertain descriptions. Experimental results demonstrate that LLM-RG4 achieves state-of-the-art performance in both clinical efficiency and natural language generation on the MIMIC-RG4 and MIMIC-CXR datasets. We quantitatively demonstrate that our model has minimal input-agnostic hallucinations, whereas current open-source models commonly suffer from this problem.
摘要：起草放射学报告是一项复杂的任务，需要灵活性，放射科医生需要根据可用信息和特定临床需求来调整内容。然而，大多数当前的放射学报告生成 (RRG) 模型都局限于固定的任务范式，例如从单个图像预测完整的“发现”部分，这本身就涉及输入和输出之间的不匹配。训练后的模型缺乏对各种输入的灵活性，可能会产生有害的、与输入无关的幻觉。为了弥合当前 RRG 模型与实践中的临床需求之间的差距，我们首先开发了一个数据生成管道来创建一个新的 MIMIC-RG4 数据集，该数据集考虑了四种常见的放射学报告起草场景，并且输入和输出完全对应。其次，我们提出了一种基于大型语言模型 (LLM) 的新型 RRG 框架，即 LLM-RG4，它利用了 LLM 灵活的指令遵循能力和广泛的一般知识。我们进一步开发了一个自适应令牌融合模块，该模块可以灵活地处理具有不同输入组合的各种场景，同时最大限度地减少与增加输入量相关的额外计算负担。此外，我们提出了一种 token 级损失加权策略，以将模型的注意力引向积极和不确定的描述。实验结果表明，LLM-RG4 在 MIMIC-RG4 和 MIMIC-CXR 数据集上的临床效率和自然语言生成方面均达到了最佳性能。我们定量地证明了我们的模型具有最小的输入不可知幻觉，而当前的开源模型通常存在此问题。

Title: The Open Source Advantage in Large Language Models (LLMs)

Authors: Jiya Manchanda, Laura Boettcher, Matheus Westphalen, Jasser Jasser
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12004
Pdf URL: https://arxiv.org/pdf/2412.12004
Copy Paste: [[2412.12004]] The Open Source Advantage in Large Language Models (LLMs)(https://arxiv.org/abs/2412.12004)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) mark a key shift in natural language processing (NLP), having advanced text generation, translation, and domain-specific reasoning. Closed-source models like GPT-4, powered by proprietary datasets and extensive computational resources, lead with state-of-the-art performance today. However, they face criticism for their "black box" nature and for limiting accessibility in a manner that hinders reproducibility and equitable AI development. By contrast, open-source initiatives like LLaMA and BLOOM prioritize democratization through community-driven development and computational efficiency. These models have significantly reduced performance gaps, particularly in linguistic diversity and domain-specific applications, while providing accessible tools for global researchers and developers. Notably, both paradigms rely on foundational architectural innovations, such as the Transformer framework by Vaswani et al. (2017). Closed-source models excel by scaling effectively, while open-source models adapt to real-world applications in underrepresented languages and domains. Techniques like Low-Rank Adaptation (LoRA) and instruction-tuning datasets enable open-source models to achieve competitive results despite limited resources. To be sure, the tension between closed-source and open-source approaches underscores a broader debate on transparency versus proprietary control in AI. Ethical considerations further highlight this divide. Closed-source systems restrict external scrutiny, while open-source models promote reproducibility and collaboration but lack standardized auditing documentation frameworks to mitigate biases. Hybrid approaches that leverage the strengths of both paradigms are likely to shape the future of LLM innovation, ensuring accessibility, competitive technical performance, and ethical deployment.
摘要：大型语言模型 (LLM) 标志着自然语言处理 (NLP) 的一个关键转变，具有先进的文本生成、翻译和领域特定推理功能。GPT-4 等闭源模型由专有数据集和大量计算资源提供支持，目前性能领先。然而，它们因其“黑箱”性质以及以阻碍可重复性和公平的人工智能开发的方式限制可访问性而受到批评。相比之下，LLaMA 和 BLOOM 等开源计划通过社区驱动的开发和计算效率优先考虑民主化。这些模型显著缩小了性能差距，特别是在语言多样性和领域特定应用方面，同时为全球研究人员和开发人员提供了可访问的工具。值得注意的是，这两种范式都依赖于基础架构创新，例如 Vaswani 等人 (2017) 的 Transformer 框架。闭源模型通过有效扩展而出类拔萃，而开源模型则适应代表性不足的语言和领域的实际应用。低秩自适应 (LoRA) 和指令调整数据集等技术使开源模型能够在资源有限的情况下取得具有竞争力的结果。可以肯定的是，闭源和开源方法之间的紧张关系凸显了关于人工智能透明度与专有控制的更广泛争论。道德考虑进一步凸显了这种分歧。闭源系统限制了外部审查，而开源模型促进了可重复性和协作，但缺乏标准化的审计文档框架来减轻偏见。利用两种范式优势的混合方法可能会塑造 LLM 创新的未来，确保可访问性、具有竞争力的技术性能和道德部署。

Title: How Private are Language Models in Abstractive Summarization?

Authors: Anthony Hughes, Nikolaos Aletras, Ning Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12040
Pdf URL: https://arxiv.org/pdf/2412.12040
Copy Paste: [[2412.12040]] How Private are Language Models in Abstractive Summarization?(https://arxiv.org/abs/2412.12040)
Keywords: language model, prompt
Abstract: Language models (LMs) have shown outstanding performance in text summarization including sensitive domains such as medicine and law. In these settings, it is important that personally identifying information (PII) included in the source document should not leak in the summary. Prior efforts have mostly focused on studying how LMs may inadvertently elicit PII from training data. However, to what extent LMs can provide privacy-preserving summaries given a non-private source document remains under-explored. In this paper, we perform a comprehensive study across two closed- and three open-weight LMs of different sizes and families. We experiment with prompting and fine-tuning strategies for privacy-preservation across a range of summarization datasets across three domains. Our extensive quantitative and qualitative analysis including human evaluation shows that LMs often cannot prevent PII leakage on their summaries and that current widely-used metrics cannot capture context dependent privacy risks.
摘要：语言模型 (LM) 在文本摘要方面表现优异，包括医学和法律等敏感领域。在这些设置中，重要的是源文档中包含的个人身份信息 (PII) 不应在摘要中泄露。之前的努力主要集中在研究 LM 如何无意中从训练数据中获取 PII。但是，对于非隐私源文档，LM 能在多大程度上提供保护隐私的摘要仍未得到充分探索。在本文中，我们对两个不同大小和系列的封闭式和三个开放式 LM 进行了全面研究。我们尝试了在三个领域的一系列摘要数据集中提示和微调隐私保护策略。我们包括人工评估在内的广泛定量和定性分析表明，LM 通常无法防止其摘要中的 PII 泄露，并且当前广泛使用的指标无法捕捉依赖于上下文的隐私风险。

Title: Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats

Authors: Kuleen Sasse, Carlos Aguirre, Isabel Cachola, Sharon Levy, Mark Dredze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12072
Pdf URL: https://arxiv.org/pdf/2412.12072
Copy Paste: [[2412.12072]] Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats(https://arxiv.org/abs/2412.12072)
Keywords: language model, llm
Abstract: WARNING: This paper contains content that maybe upsetting or offensive to some readers. Dog whistles are coded expressions with dual meanings: one intended for the general public (outgroup) and another that conveys a specific message to an intended audience (ingroup). Often, these expressions are used to convey controversial political opinions while maintaining plausible deniability and slip by content moderation filters. Identification of dog whistles relies on curated lexicons, which have trouble keeping up to date. We introduce \textbf{FETCH!}, a task for finding novel dog whistles in massive social media corpora. We find that state-of-the-art systems fail to achieve meaningful results across three distinct social media case studies. We present \textbf{EarShot}, a novel system that combines the strengths of vector databases and Large Language Models (LLMs) to efficiently and effectively identify new dog whistles.
摘要：警告：本文包含的内容可能会让某些读者感到不安或冒犯。狗哨是具有双重含义的编码表达：一种是针对普通大众（外群体），另一种是向目标受众（内群体）传达特定信息。通常，这些表达用于传达有争议的政治观点，同时保持合理的否认并通过内容审核过滤器。狗哨的识别依赖于精选的词典，而这些词典很难保持最新状态。我们引入了 \textbf{FETCH!}，这是一项在大量社交媒体语料库中寻找新狗哨的任务。我们发现，最先进的系统未能在三个不同的社交媒体案例研究中获得有意义的结果。我们提出了 \textbf{EarShot}，这是一种新颖的系统，它结合了矢量数据库和大型语言模型 (LLM) 的优势，可以高效、有效地识别新的狗哨。

Title: SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Authors: Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12094
Pdf URL: https://arxiv.org/pdf/2412.12094
Copy Paste: [[2412.12094]] SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator(https://arxiv.org/abs/2412.12094)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless special tokens (i.e., separators) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM's effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.
摘要：大型语言模型 (LLM) 在一系列自然语言处理任务中表现出色。然而，由于其二次复杂度，它们庞大的规模带来了相当大的挑战，特别是在计算需求和推理速度方面。在这项工作中，我们发现了一个关键模式：某些看似毫无意义的特殊标记（即分隔符）对注意力得分的贡献与语义上有意义的标记相比不成比例。这一观察结果表明，这些分隔符标记之间的段信息可以有效地压缩到分隔符标记本身中，而不会造成重大信息丢失。在这一见解的指导下，我们引入了 SepLLM，这是一个即插即用的框架，它通过压缩这些段并消除冗余标记来加速推理。此外，我们还实现了高效的内核来加速训练。在无训练、从头开始训练和训练后设置中的实验结果证明了 SepLLM 的有效性。值得注意的是，使用 Llama-3-8B 主干，SepLLM 在 GSM8K-CoT 基准上实现了超过 50% 的 KV 缓存减少，同时保持了相当的性能。此外，在流式设置中，SepLLM 可以有效处理多达 400 万个或更多标记的序列，同时保持一致的语言建模能力。