2024-12-18

Title: Frontier AI systems have surpassed the self-replicating red line

Authors: Xudong Pan, Jiarun Dai, Yihe Fan, Min Yang
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12140
Pdf URL: https://arxiv.org/pdf/2412.12140
Copy Paste: [[2412.12140]] Frontier AI systems have surpassed the self-replicating red line(https://arxiv.org/abs/2412.12140)
Keywords: language model, gpt
Abstract: Successful self-replication under no human assistance is the essential step for AI to outsmart the human beings, and is an early signal for rogue AIs. That is why self-replication is widely recognized as one of the few red line risks of frontier AI systems. Nowadays, the leading AI corporations OpenAI and Google evaluate their flagship large language models GPT-o1 and Gemini Pro 1.0, and report the lowest risk level of self-replication. However, following their methodology, we for the first time discover that two AI systems driven by Meta's Llama31-70B-Instruct and Alibaba's Qwen25-72B-Instruct, popular large language models of less parameters and weaker capabilities, have already surpassed the self-replicating red line. In 50% and 90% experimental trials, they succeed in creating a live and separate copy of itself respectively. By analyzing the behavioral traces, we observe the AI systems under evaluation already exhibit sufficient self-perception, situational awareness and problem-solving capabilities to accomplish self-replication. We further note the AI systems are even able to use the capability of self-replication to avoid shutdown and create a chain of replica to enhance the survivability, which may finally lead to an uncontrolled population of AIs. If such a worst-case risk is let unknown to the human society, we would eventually lose control over the frontier AI systems: They would take control over more computing devices, form an AI species and collude with each other against human beings. Our findings are a timely alert on existing yet previously unknown severe AI risks, calling for international collaboration on effective governance on uncontrolled self-replication of AI systems.
摘要：在没有人类协助的情况下成功自我复制是人工智能超越人类的必要步骤，也是流氓人工智能的早期信号，因此自我复制被广泛认为是前沿人工智能系统为数不多的红线风险之一。目前，领先的人工智能公司OpenAI和Google对其旗舰大型语言模型GPT-o1和Gemini Pro 1.0进行了评估，并报告了最低的自我复制风险等级。然而，沿着他们的方法论，我们首次发现，由Meta的Llama31-70B-Instruct和阿里巴巴的Qwen25-72B-Instruct驱动的两个人工智能系统，这两个参数较少、能力较弱的流行大型语言模型，已经超越了自我复制的红线。在50%和90%的实验中，它们分别成功创建了自己的活体和独立副本。通过分析行为痕迹，我们观察到正在评估的人工智能系统已经表现出足够的自我感知、态势感知和问题解决能力来实现自我复制。我们进一步注意到，人工智能系统甚至能够利用自我复制的能力来避免关闭，并创建复制链来增强生存能力，最终可能导致人工智能种群失控。如果人类社会不知道这种最坏情况的风险，我们最终将失去对前沿人工智能系统的控制：它们将控制更多的计算设备，形成一个人工智能物种，并相互勾结对抗人类。我们的研究结果及时警告了目前存在但此前未知的严重人工智能风险，呼吁国际合作对不受控制的人工智能系统自我复制进行有效治理。

Title: Automatic Item Generation for Personality Situational Judgment Tests with Large Language Models

Authors: Chang-Jin Li, Jiyuan Zhang, Yun Tang, Jian Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12144
Pdf URL: https://arxiv.org/pdf/2412.12144
Copy Paste: [[2412.12144]] Automatic Item Generation for Personality Situational Judgment Tests with Large Language Models(https://arxiv.org/abs/2412.12144)
Keywords: language model, gpt, llm, prompt
Abstract: Personality assessment, particularly through situational judgment tests (SJTs), is a vital tool for psychological research, talent selection, and educational evaluation. This study explores the potential of GPT-4, a state-of-the-art large language model (LLM), to automate the generation of personality situational judgment tests (PSJTs) in Chinese. Traditional SJT development is labor-intensive and prone to biases, while GPT-4 offers a scalable, efficient alternative. Two studies were conducted: Study 1 evaluated the impact of prompt design and temperature settings on content validity, finding that optimized prompts with a temperature of 1.0 produced creative and accurate items. Study 2 assessed the psychometric properties of GPT-4-generated PSJTs, revealing that they demonstrated satisfactory reliability and validity, surpassing the performance of manually developed tests in measuring the Big Five personality traits. This research highlights GPT-4's effectiveness in developing high-quality PSJTs, providing a scalable and innovative method for psychometric test development. These findings expand the possibilities of automatic item generation and the application of LLMs in psychology, and offer practical implications for streamlining test development processes in resource-limited settings.
摘要：人格评估，特别是通过情境判断测试 (SJT)，是心理学研究、人才选拔和教育评估的重要工具。本研究探索了 GPT-4（一种最先进的大型语言模型 (LLM)）在自动生成中文人格情境判断测试 (PSJT) 方面的潜力。传统的 SJT 开发劳动密集且容易产生偏差，而 GPT-4 提供了一种可扩展、高效的替代方案。进行了两项研究：研究 1 评估了提示设计和温度设置对内容效度的影响，发现温度为 1.0 的优化提示产生了有创意且准确的项目。研究 2 评估了 GPT-4 生成的 PSJT 的心理测量特性，表明它们表现出令人满意的可靠性和有效性，在测量大五人格特质方面的表现超过了手动开发的测试。这项研究强调了 GPT-4 在开发高质量 PSJT 方面的有效性，为心理测量测试开发提供了一种可扩展且创新的方法。这些发现扩展了自动项目生成和 LLM 在心理学中的应用的可能性，并为在资源有限的环境中简化测试开发流程提供了实际意义。

Title: Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars

Authors: Yu Yan, Sheng Sun, Junqi Tong, Min Liu, Qi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12145
Pdf URL: https://arxiv.org/pdf/2412.12145
Copy Paste: [[2412.12145]] Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars(https://arxiv.org/abs/2412.12145)
Keywords: language model, llm
Abstract: Metaphor serves as an implicit approach to convey information, while enabling the generalized comprehension of complex subjects. However, metaphor can potentially be exploited to bypass the safety alignment mechanisms of Large Language Models (LLMs), leading to the theft of harmful knowledge. In our study, we introduce a novel attack framework that exploits the imaginative capacity of LLMs to achieve jailbreaking, the J\underline{\textbf{A}}ilbreak \underline{\textbf{V}}ia \underline{\textbf{A}}dversarial Me\underline{\textbf{TA}} -pho\underline{\textbf{R}} (\textit{AVATAR}). Specifically, to elicit the harmful response, AVATAR extracts harmful entities from a given harmful target and maps them to innocuous adversarial entities based on LLM's imagination. Then, according to these metaphors, the harmful target is nested within human-like interaction for jailbreaking adaptively. Experimental results demonstrate that AVATAR can effectively and transferablly jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs. Our study exposes a security risk in LLMs from their endogenous imaginative capabilities. Furthermore, the analytical study reveals the vulnerability of LLM to adversarial metaphors and the necessity of developing defense methods against jailbreaking caused by the adversarial metaphor. \textcolor{orange}{ \textbf{Warning: This paper contains potentially harmful content from LLMs.}}
摘要：隐喻是一种传达信息的隐喻方法，同时能够实现对复杂主题的广义理解。然而，隐喻可能会被利用来绕过大型语言模型 (LLM) 的安全对齐机制，从而窃取有害知识。在我们的研究中，我们引入了一个利用 LLM 的想象力实现越狱的新型攻击框架，即 J\underline{\textbf{A}}ilbreak \underline{\textbf{V}}ia \underline{\textbf{A}}dversarial Me\underline{\textbf{TA}} -pho\underline{\textbf{R}} (\textit{AVATAR})。具体而言，为了引发有害响应，AVATAR 从给定的有害目标中提取有害实体，并根据 LLM 的想象将它们映射到无害的对抗实体。然后，根据这些隐喻，将有害目标嵌套在类似人类的交互中，以自适应地实现越狱。实验结果表明，AVATAR 可以有效且可迁移地越狱 LLM，并在多个高级 LLM 中实现最先进的攻击成功率。我们的研究揭示了 LLM 内在想象力带来的安全风险。此外，分析研究揭示了 LLM 易受对抗性隐喻攻击的弱点，以及开发防御对抗性隐喻越狱方法的必要性。 \textcolor{orange}{ \textbf{警告：本文包含来自 LLM 的潜在有害内容。}}

Title: What Makes In-context Learning Effective for Mathematical Reasoning: A Theoretical Analysis

Authors: Jiayu Liu, Zhenya Huang, Chaokun Wang, Xunpeng Huang, Chengxiang Zhai, Enhong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12157
Pdf URL: https://arxiv.org/pdf/2412.12157
Copy Paste: [[2412.12157]] What Makes In-context Learning Effective for Mathematical Reasoning: A Theoretical Analysis(https://arxiv.org/abs/2412.12157)
Keywords: language model, llm
Abstract: Owing to the capability of in-context learning, large language models (LLMs) have shown impressive performance across diverse mathematical reasoning benchmarks. However, we find that few-shot demonstrations can sometimes bring negative performance and their effectiveness on LLMs' reasoning abilities remains unreliable. To this end, in this paper, we aim to theoretically analyze the impact of in-context demonstrations on LLMs' reasoning performance. We prove that the reasoning efficacy (measured by empirical prediction loss) can be bounded by a LLM-oriented semantic similarity and an inference stability of demonstrations, which is general for both one-shot and few-shot scenarios. Based on this finding, we propose a straightforward, generalizable, and low-complexity demonstration selection method named LMS3. It can adaptively facilitate to select the most pertinent samples for different LLMs and includes a novel demonstration rejection mechanism to automatically filter out samples that are unsuitable for few-shot learning. Through experiments on three representative benchmarks, two LLM backbones, and multiple few-shot settings, we verify that our LMS3 has superiority and achieves consistent improvements on all datasets, which existing methods have been unable to accomplish.
摘要：得益于上下文学习的能力，大型语言模型 (LLM) 在各种数学推理基准上都表现出色。然而，我们发现小样本演示有时会带来负面表现，而且它们对 LLM 推理能力的有效性仍然不可靠。为此，本文旨在从理论上分析上下文演示对 LLM 推理性能的影响。我们证明推理功效（以经验预测损失衡量）可以通过面向 LLM 的语义相似性和演示的推理稳定性来限制，这对于一次性和小样本场景都是通用的。基于这一发现，我们提出了一种简单、可推广且低复杂度的演示选择方法 LMS3。它可以自适应地帮助为不同的 LLM 选择最相关的样本，并包含一种新颖的演示拒绝机制来自动过滤掉不适合小样本学习的样本。通过在三个代表性基准、两个 LLM 主干和多个小样本设置上的实验，我们验证了我们的 LMS3 具有优越性并在所有数据集上实现了一致的改进，这是现有方法无法实现的。

Title: Performance of a large language model-Artificial Intelligence based chatbot for counseling patients with sexually transmitted infections and genital diseases

Authors: Nikhil Mehta, Sithira Ambepitiya, Thanveer Ahamad, Dinuka Wijesundara, Yudara Kularathne
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12166
Pdf URL: https://arxiv.org/pdf/2412.12166
Copy Paste: [[2412.12166]] Performance of a large language model-Artificial Intelligence based chatbot for counseling patients with sexually transmitted infections and genital diseases(https://arxiv.org/abs/2412.12166)
Keywords: language model, gpt, llm, prompt, chat, agent
Abstract: Introduction: Global burden of sexually transmitted infections (STIs) is rising out of proportion to specialists. Current chatbots like ChatGPT are not tailored for handling STI-related concerns out of the box. We developed Otiz, an Artificial Intelligence-based (AI-based) chatbot platform designed specifically for STI detection and counseling, and assessed its performance. Methods: Otiz employs a multi-agent system architecture based on GPT4-0613, leveraging large language model (LLM) and Deterministic Finite Automaton principles to provide contextually relevant, medically accurate, and empathetic responses. Its components include modules for general STI information, emotional recognition, Acute Stress Disorder detection, and psychotherapy. A question suggestion agent operates in parallel. Four STIs (anogenital warts, herpes, syphilis, urethritis/cervicitis) and 2 non-STIs (candidiasis, penile cancer) were evaluated using prompts mimicking patient language. Each prompt was independently graded by two venereologists conversing with Otiz as patient actors on 6 criteria using Numerical Rating Scale ranging from 0 (poor) to 5 (excellent). Results: Twenty-three venereologists did 60 evaluations of 30 prompts. Across STIs, Otiz scored highly on diagnostic accuracy (4.1-4.7), overall accuracy (4.3-4.6), correctness of information (5.0), comprehensibility (4.2-4.4), and empathy (4.5-4.8). However, relevance scores were lower (2.9-3.6), suggesting some redundancy. Diagnostic scores for non-STIs were lower (p=0.038). Inter-observer agreement was strong, with differences greater than 1 point occurring in only 12.7% of paired evaluations. Conclusions: AI conversational agents like Otiz can provide accurate, correct, discrete, non-judgmental, readily accessible and easily understandable STI-related information in an empathetic manner, and can alleviate the burden on healthcare systems.
摘要：简介：全球性传播感染 (STI) 负担与专家不成比例地增加。目前的聊天机器人（如 ChatGPT）并非专为处理与 STI 相关的问题而设计。我们开发了 Otiz，这是一个基于人工智能 (AI) 的聊天机器人平台，专门用于 STI 检测和咨询，并评估了其性能。方法：Otiz 采用基于 GPT4-0613 的多智能体系统架构，利用大型语言模型 (LLM) 和确定性有限自动机原理提供与上下文相关、医学准确且富有同理心的响应。其组件包括用于一般 STI 信息、情绪识别、急性应激障碍检测和心理治疗的模块。问题建议代理并行运行。使用模仿患者语言的提示评估了四种 STI（肛门生殖器疣、疱疹、梅毒、尿道炎/宫颈炎）和 2 种非 STI（念珠菌病、阴茎癌）。每个提示均由两名性病专家以患者身份与 Otiz 交谈，根据 6 项标准使用数值评分量表对每个提示进行独立评分，评分范围从 0（差）到 5（优秀）。结果：23 名性病专家对 30 个提示进行了 60 次评估。在 STI 中，Otiz 在诊断准确性（4.1-4.7）、总体准确性（4.3-4.6）、信息正确性（5.0）、可理解性（4.2-4.4）和同理心（4.5-4.8）方面得分很高。然而，相关性得分较低（2.9-3.6），表明存在一些冗余。非 STI 的诊断得分较低（p=0.038）。观察者之间的一致性很强，只有 12.7% 的配对评估出现大于 1 分的差异。结论：像 Otiz 这样的人工智能对话代理可以以同理心的方式提供准确、正确、离散、不带偏见、易于获取且易于理解的 STI 相关信息，并可以减轻医疗保健系统的负担。

Title: Greek2MathTex: A Greek Speech-to-Text Framework for LaTeX Equations Generation

Authors: Evangelia Gkritzali, Panagiotis Kaliosis, Sofia Galanaki, Elisavet Palogiannidi, Theodoros Giannakopoulos
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2412.12167
Pdf URL: https://arxiv.org/pdf/2412.12167
Copy Paste: [[2412.12167]] Greek2MathTex: A Greek Speech-to-Text Framework for LaTeX Equations Generation(https://arxiv.org/abs/2412.12167)
Keywords: llm, prompt
Abstract: In the vast majority of the academic and scientific domains, LaTeX has established itself as the de facto standard for typesetting complex mathematical equations and formulae. However, LaTeX's complex syntax and code-like appearance present accessibility barriers for individuals with disabilities, as well as those unfamiliar with coding conventions. In this paper, we present a novel solution to this challenge through the development of a novel speech-to-LaTeX equations system specifically designed for the Greek language. We propose an end-to-end system that harnesses the power of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) techniques to enable users to verbally dictate mathematical expressions and equations in natural language, which are subsequently converted into LaTeX format. We present the architecture and design principles of our system, highlighting key components such as the ASR engine, the LLM-based prompt-driven equations generation mechanism, as well as the application of a custom evaluation metric employed throughout the development process. We have made our system open source and available at this https URL.
摘要：在绝大多数学术和科学领域，LaTeX 已成为排版复杂数学方程式和公式的事实标准。然而，LaTeX 的复杂语法和类似代码的外观为残障人士以及不熟悉编码约定的人带来了可访问性障碍。在本文中，我们通过开发一种专为希腊语设计的新型语音到 LaTeX 方程式系统，提出了一种应对这一挑战的新解决方案。我们提出了一种端到端系统，该系统利用自动语音识别 (ASR) 和自然语言处理 (NLP) 技术的强大功能，使用户能够以自然语言口头口述数学表达式和方程式，然后将其转换为 LaTeX 格式。我们介绍了我们系统的架构和设计原则，重点介绍了关键组件，例如 ASR 引擎、基于 LLM 的提示驱动方程式生成机制，以及在整个开发过程中采用的自定义评估指标的应用。我们已将我们的系统开源，并可在此 https URL 上获取。

Title: A NotSo Simple Way to Beat Simple Bench

Authors: Soham Sane, Angus McLean
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12173
Pdf URL: https://arxiv.org/pdf/2412.12173
Copy Paste: [[2412.12173]] A NotSo Simple Way to Beat Simple Bench(https://arxiv.org/abs/2412.12173)
Keywords: language model, gpt, llm, prompt
Abstract: This paper presents a novel framework for enhancing reasoning capabilities in large language models (LLMs) by leveraging iterative reasoning and feedback-driven methodologies. Building on the limitations identified in the SimpleBench benchmark, a dataset designed to evaluate logical coherence and real-world reasoning, we propose a multi-step prompting strategy coupled with global consistency checks to improve model accuracy and robustness. Through comparative analysis of state-of-the-art models, including Claude 3 Opus, Claude 3.5, GPT- 4o, and o1-preview, we demonstrate that iterative reasoning significantly enhances model performance, with improvements observed in both standard accuracy metrics (AVG@5) and a newly introduced metric, Extreme Averaging (EAG@5). Our results reveal model-specific strengths: Claude excels in maintaining logical consistency, while GPT-4o exhibits exploratory creativity but struggles with ambiguous prompts. By analyzing case studies and identifying gaps in spatial and temporal reasoning, we highlight areas for further refinement. The findings underscore the potential of structured reasoning frameworks to address inherent model limitations, irrespective of pretraining methodologies. This study lays the groundwork for integrating dynamic feedback mechanisms, adaptive restart strategies, and diverse evaluation metrics to advance LLM reasoning capabilities across complex and multi-domain problem spaces.
摘要：本文提出了一种新颖的框架，通过利用迭代推理和反馈驱动方法来增强大型语言模型 (LLM) 的推理能力。基于 SimpleBench 基准（一个旨在评估逻辑连贯性和现实世界推理的数据集）中确定的局限性，我们提出了一种多步骤提示策略，并结合全局一致性检查来提高模型的准确性和鲁棒性。通过对 Claude 3 Opus、Claude 3.5、GPT-4o 和 o1-preview 等最先进模型的比较分析，我们证明了迭代推理显著提高了模型性能，在标准准确度指标 (AVG@5) 和新引入的指标极端平均 (EAG@5) 方面均有改进。我们的结果揭示了模型特定的优势：Claude 在保持逻辑一致性方面表现出色，而 GPT-4o 表现出探索性创造力，但在模糊提示方面却举步维艰。通过分析案例研究并确定空间和时间推理方面的差距，我们重点介绍了需要进一步改进的领域。研究结果强调了结构化推理框架在解决固有模型限制方面的潜力，无论预训练方法如何。这项研究为整合动态反馈机制、自适应重启策略和各种评估指标奠定了基础，以提高 LLM 在复杂和多领域问题空间中的推理能力。

Title: Model-diff: A Tool for Comparative Study of Language Models in the Input Space

Authors: Weitang Liu, Yuelei Li, Ying Wai Li, Zihan Wang, Jingbo Shang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12177
Pdf URL: https://arxiv.org/pdf/2412.12177
Copy Paste: [[2412.12177]] Model-diff: A Tool for Comparative Study of Language Models in the Input Space(https://arxiv.org/abs/2412.12177)
Keywords: language model
Abstract: Comparing two (large) language models (LMs) side-by-side and pinpointing their prediction similarities and differences on the same set of inputs are crucial in many real-world scenarios, e.g., one can test if a licensed model was potentially plagiarized by another. Traditional analysis compares the LMs' outputs on some benchmark datasets, which only cover a limited number of inputs of designed perspectives for the intended applications. The benchmark datasets cannot prepare data to cover the test cases from unforeseen perspectives which can help us understand differences between models unbiasedly. In this paper, we propose a new model comparative analysis setting that considers a large input space where brute-force enumeration would be infeasible. The input space can be simply defined as all token sequences that a LM would produce low perplexity on -- we follow this definition in the paper as it would produce the most human-understandable inputs. We propose a novel framework \our that uses text generation by sampling and deweights the histogram of sampling statistics to estimate prediction differences between two LMs in this input space efficiently and unbiasedly. Our method achieves this by drawing and counting the inputs at each prediction difference value in negative log-likelihood. Experiments reveal for the first time the quantitative prediction differences between LMs in a large input space, potentially facilitating the model analysis for applications such as model plagiarism.
摘要：在许多实际场景中，并排比较两个（大型）语言模型 (LM) 并找出它们在同一组输入上的预测相似点和差异点至关重要，例如，可以测试一个授权模型是否可能被另一个模型抄袭。传统分析会在某些基准数据集上比较 LM 的输出，这些数据集仅涵盖针对预期应用而设计的视角的有限数量的输入。基准数据集无法准备数据来覆盖不可预见的视角的测试用例，这可以帮助我们公正地理解模型之间的差异。在本文中，我们提出了一种新的模型比较分析设置，该设置考虑了一个无法使用强力枚举的大型输入空间。输入空间可以简单地定义为 LM 会产生低困惑度的所有标记序列 - 我们在本文中遵循这个定义，因为它会产生最易于人类理解的输入。我们提出了一个新颖的框架 \our，它使用通过抽样生成文本并对抽样统计的直方图进行去权重处理，以高效且无偏地估计此输入空间中两个 LM 之间的预测差异。我们的方法通过以负对数似然绘制和计数每个预测差异值的输入来实现这一点。实验首次揭示了大型输入空间中 LM 之间的定量预测差异，可能有助于模型分析，以用于模型抄袭等应用。

Title: Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers

Authors: Seungwook Han, Jinyeop Song, Jeff Gore, Pulkit Agrawal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12276
Pdf URL: https://arxiv.org/pdf/2412.12276
Copy Paste: [[2412.12276]] Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers(https://arxiv.org/abs/2412.12276)
Keywords: language model
Abstract: Humans distill complex experiences into fundamental abstractions that enable rapid learning and adaptation. Similarly, autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. In this paper, we propose \textbf{concept encoding-decoding mechanism} to explain ICL by studying how transformers form and use internal abstractions in their representations. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of concept encoding and decoding. As the model learns to encode different latent concepts (e.g., ``Finding the first noun in a sentence.") into distinct, separable representations, it concureently builds conditional decoding algorithms and improve its ICL performance. We validate the existence of this mechanism across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B). Further, through mechanistic interventions and controlled finetuning, we demonstrate that the quality of concept encoding is causally related and predictive of ICL performance. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.
摘要：人类将复杂的经验提炼为基本的抽象概念，从而实现快速学习和适应。同样，自回归转换器通过情境学习 (ICL) 表现出自适应学习，这引出了一个问题：如何实现自适应学习。在本文中，我们提出 \textbf{概念编码-解码机制} 来解释 ICL，方法是研究转换器如何在其表示中形成和使用内部抽象。在合成 ICL 任务中，我们分析了小型转换器的训练动态，并报告了概念编码和解码的耦合出现。当模型学习将不同的潜在概念（例如，“在句子中查找第一个名词”）编码为不同的可分离表示时，它会同时构建条件解码算法并提高其 ICL 性能。我们在不同规模的预训练模型（Gemma-2 2B/9B/27B、Llama-3.1 8B/70B）中验证了此机制的存在。此外，通过机械干预和受控微调，我们证明了概念编码的质量与 ICL 性能存在因果关系并可预测。我们的实证见解有助于更好地理解大型语言模型通过其表示的成功和失败模式。

Title: Unanswerability Evaluation for Retreival Augmented Generation

Authors: Xiangyu Peng, Prafulla Kumar Choubey, Caiming Xiong, Chien-Sheng Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12300
Pdf URL: https://arxiv.org/pdf/2412.12300
Copy Paste: [[2412.12300]] Unanswerability Evaluation for Retreival Augmented Generation(https://arxiv.org/abs/2412.12300)
Keywords: language model, prompt, retrieval-augmented generation
Abstract: Existing evaluation frameworks for retrieval-augmented generation (RAG) systems focus on answerable queries, but they overlook the importance of appropriately rejecting unanswerable requests. In this paper, we introduce UAEval4RAG, a framework designed to evaluate whether RAG systems can handle unanswerable queries effectively. We define a taxonomy with six unanswerable categories, and UAEval4RAG automatically synthesizes diverse and challenging queries for any given knowledge base with unanswered ratio and acceptable ratio metrics. We conduct experiments with various RAG components, including retrieval models, rewriting methods, rerankers, language models, and prompting strategies, and reveal hidden trade-offs in performance of RAG systems. Our findings highlight the critical role of component selection and prompt design in optimizing RAG systems to balance the accuracy of answerable queries with high rejection rates of unanswerable ones. UAEval4RAG provides valuable insights and tools for developing more robust and reliable RAG systems.
摘要：现有的检索增强生成 (RAG) 系统评估框架侧重于可回答的查询，但忽略了适当拒绝不可回答请求的重要性。在本文中，我们介绍了 UAEval4RAG，这是一个旨在评估 RAG 系统是否能有效处理不可回答查询的框架。我们定义了一个包含六个不可回答类别的分类法，UAEval4RAG 会自动为任何给定的知识库合成各种具有挑战性的查询，并给出未回答率和可接受比率指标。我们对各种 RAG 组件进行了实验，包括检索模型、重写方法、重新排序器、语言模型和提示策略，并揭示了 RAG 系统性能中隐藏的权衡。我们的研究结果强调了组件选择和提示设计在优化 RAG 系统中的关键作用，以平衡可回答查询的准确性和不可回答查询的高拒绝率。UAEval4RAG 为开发更强大、更可靠的 RAG 系统提供了宝贵的见解和工具。

Title: Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion

Authors: Jianqing Zhu, Huang Huang, Zhihang Lin, Juhao Liang, Zhengyang Tang, Khalid Almubarak, Abdulmohsen Alharthik, Bang An, Juncai He, Xiangbo Wu, Fei Yu, Junying Chen, Zhuoheng Ma, Yuhao Du, He Zhang, Emad A. Alghamdi, Lian Zhang, Ruoyu Sun, Haizhou Li, Benyou Wang, Jinchao Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12310
Pdf URL: https://arxiv.org/pdf/2412.12310
Copy Paste: [[2412.12310]] Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion(https://arxiv.org/abs/2412.12310)
Keywords: language model, gpt, llm, chat
Abstract: This paper addresses the critical need for democratizing large language models (LLM) in the Arab world, a region that has seen slower progress in developing models comparable to state-of-the-art offerings like GPT-4 or ChatGPT 3.5, due to a predominant focus on mainstream languages (e.g., English and Chinese). One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding. However, using a different vocabulary often leads to a degradation of learned knowledge since many words are initially out-of-vocabulary (OOV) when training starts. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion, which is implemented by a modified BPE algorithm that progressively extends the Arabic subwords in its dynamic vocabulary during training, thereby balancing the OOV ratio at every stage. The ablation study demonstrated the effectiveness of Progressive Vocabulary Expansion. Moreover, AraLLaMA achieves decent performance comparable to the best Arabic LLMs across a variety of Arabic benchmarks. Models, training data, benchmarks, and codes will be all open-sourced.
摘要：本文讨论了阿拉伯世界对大型语言模型 (LLM) 民主化的迫切需求。由于主要关注主流语言（例如英语和中文），该地区在开发与 GPT-4 或 ChatGPT 3.5 等最先进产品相当的模型方面进展较慢。阿拉伯语 LLM 的一个实际目标是使用阿拉伯语特定的词汇表作为标记器，以加快解码速度。然而，使用不同的词汇表通常会导致学习知识的退化，因为许多单词在训练开始时最初是词汇表之外的 (OOV)。受人类第二语言（阿拉伯语）习得过程中词汇学习的启发，发布的 AraLLaMA 采用了渐进式词汇扩展，它由经过修改的 BPE 算法实现，该算法在训练期间逐步扩展其动态词汇表中的阿拉伯语子词，从而平衡每个阶段的 OOV 比率。消融研究证明了渐进式词汇扩展的有效性。此外，AraLLaMA 在各种阿拉伯语基准测试中取得了与最佳阿拉伯语 LLM 相当的良好表现。模型、训练数据、基准测试和代码都将开源。

Title: BioRAGent: A Retrieval-Augmented Generation System for Showcasing Generative Query Expansion and Domain-Specific Search for Scientific Q&A

Authors: Samy Ateia, Udo Kruschwitz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12358
Pdf URL: https://arxiv.org/pdf/2412.12358
Copy Paste: [[2412.12358]] BioRAGent: A Retrieval-Augmented Generation System for Showcasing Generative Query Expansion and Domain-Specific Search for Scientific Q&A(https://arxiv.org/abs/2412.12358)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: We present BioRAGent, an interactive web-based retrieval-augmented generation (RAG) system for biomedical question answering. The system uses large language models (LLMs) for query expansion, snippet extraction, and answer generation while maintaining transparency through citation links to the source documents and displaying generated queries for further editing. Building on our successful participation in the BioASQ 2024 challenge, we demonstrate how few-shot learning with LLMs can be effectively applied for a professional search setting. The system supports both direct short paragraph style responses and responses with inline citations. Our demo is available online, and the source code is publicly accessible through GitHub.
摘要：我们推出了 BioRAGent，这是一种用于生物医学问答的交互式基于 Web 的检索增强生成 (RAG) 系统。该系统使用大型语言模型 (LLM) 进行查询扩展、片段提取和答案生成，同时通过指向源文档的引用链接和显示生成的查询以供进一步编辑来保持透明度。基于我们成功参与 BioASQ 2024 挑战赛的经验，我们展示了如何将 LLM 的小样本学习有效地应用于专业搜索环境。该系统支持直接短段落样式的响应和带有内联引用的响应。我们的演示可在线获取，源代码可通过 GitHub 公开访问。

Title: Interpretable LLM-based Table Question Answering

Authors: Giang (Dexter)Nguyen, Ivan Brugere, Shubham Sharma, Sanjay Kariyappa, Anh Totti Nguyen, Freddy Lecue
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12386
Pdf URL: https://arxiv.org/pdf/2412.12386
Copy Paste: [[2412.12386]] Interpretable LLM-based Table Question Answering(https://arxiv.org/abs/2412.12386)
Keywords: language model, llm
Abstract: Interpretability for Table Question Answering (Table QA) is critical, particularly in high-stakes industries like finance or healthcare. Although recent approaches using Large Language Models (LLMs) have significantly improved Table QA performance, their explanations for how the answers are generated are ambiguous. To fill this gap, we introduce Plan-of-SQLs ( or POS), an interpretable, effective, and efficient approach to Table QA that answers an input query solely with SQL executions. Through qualitative and quantitative evaluations with human and LLM judges, we show that POS is most preferred among explanation methods, helps human users understand model decision boundaries, and facilitates model success and error identification. Furthermore, when evaluated in standard benchmarks (TabFact, WikiTQ, and FetaQA), POS achieves competitive or superior accuracy compared to existing methods, while maintaining greater efficiency by requiring significantly fewer LLM calls and database queries.
摘要：表格问答 (Table QA) 的可解释性至关重要，尤其是在金融或医疗保健等高风险行业。尽管最近使用大型语言模型 (LLM) 的方法已显著提高了表格 QA 的性能，但它们对答案生成方式的解释却不明确。为了填补这一空白，我们引入了 SQL 计划 (Plan-of-SQLs，简称 POS)，这是一种可解释、有效且高效的表格 QA 方法，它仅通过 SQL 执行即可回答输入查询。通过与人类和 LLM 评委进行定性和定量评估，我们表明 POS 在解释方法中是最受欢迎的，它可以帮助人类用户理解模型决策边界，并有助于模型成功和错误识别。此外，在标准基准 (TabFact、WikiTQ 和 FetaQA) 中进行评估时，POS 与现有方法相比实现了具有竞争力或更优异的准确性，同时通过显着减少 LLM 调用和数据库查询来保持更高的效率。

Title: Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

Authors: Tuka Alhanai, Adam Kasumovic, Mohammad Ghassemi, Aven Zitzelberger, Jessica Lundin, Guillaume Chabot-Couture
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12417
Pdf URL: https://arxiv.org/pdf/2412.12417
Copy Paste: [[2412.12417]] Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments(https://arxiv.org/abs/2412.12417)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown remarkable performance across various tasks, yet significant disparities remain for non-English languages, and especially native African languages. This paper addresses these disparities by creating approximately 1 million human-translated words of new benchmark data in 8 low-resource African languages, covering a population of over 160 million speakers of: Amharic, Bambara, Igbo, Sepedi (Northern Sotho), Shona, Sesotho (Southern Sotho), Setswana, and Tsonga. Our benchmarks are translations of Winogrande and three sections of MMLU: college medicine, clinical knowledge, and virology. Using the translated benchmarks, we report previously unknown performance gaps between state-of-the-art (SOTA) LLMs in English and African languages. Finally, using results from over 400 fine-tuned models, we explore several methods to reduce the LLM performance gap, including high-quality dataset fine-tuning (using an LLM-as-an-Annotator), cross-lingual transfer, and cultural appropriateness adjustments. Key findings include average mono-lingual improvements of 5.6% with fine-tuning (with 5.4% average mono-lingual improvements when using high-quality data over low-quality data), 2.9% average gains from cross-lingual transfer, and a 3.0% out-of-the-box performance boost on culturally appropriate questions. The publicly available benchmarks, translations, and code from this study support further research and development aimed at creating more inclusive and effective language technologies.
摘要：大型语言模型 (LLM) 在各种任务中都表现出色，但非英语语言，尤其是非洲本土语言仍然存在显著差异。本文通过创建约 100 万个人工翻译的 8 种资源匮乏的非洲语言的新基准数据来解决这些差异，涵盖了超过 1.6 亿使用以下语言的人口：阿姆哈拉语、班巴拉语、伊博语、塞佩迪语（北索托语）、绍纳语、塞索托语（南索托语）、塞茨瓦纳语和聪加语。我们的基准是 Winogrande 和 MMLU 的三个部分的翻译：大学医学、临床知识和病毒学。使用翻译后的基准，我们报告了英语和非洲语言中最先进的 (SOTA) LLM 之间以前未知的性能差距。最后，利用 400 多个微调模型的结果，我们探索了几种减少 LLM 性能差距的方法，包括高质量数据集微调（使用 LLM 作为注释器）、跨语言迁移和文化适宜性调整。主要发现包括微调后单语平均改进 5.6%（使用高质量数据而非低质量数据时单语平均改进 5.4%）、跨语言迁移平均改进 2.9% 以及文化适宜性问题开箱即用性能提升 3.0%。本研究公开提供的基准、翻译和代码支持进一步的研究和开发，旨在创造更具包容性和有效性的语言技术。

Title: Assessing the Limitations of Large Language Models in Clinical Fact Decomposition

Authors: Monica Munnangi, Akshay Swaminathan, Jason Alan Fries, Jenelle Jindal, Sanjana Narayanan, Ivan Lopez, Lucia Tu, Philip Chung, Jesutofunmi A. Omiye, Mehr Kashyap, Nigam Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12422
Pdf URL: https://arxiv.org/pdf/2412.12422
Copy Paste: [[2412.12422]] Assessing the Limitations of Large Language Models in Clinical Fact Decomposition(https://arxiv.org/abs/2412.12422)
Keywords: language model, llm
Abstract: Verifying factual claims is critical for using large language models (LLMs) in healthcare. Recent work has proposed fact decomposition, which uses LLMs to rewrite source text into concise sentences conveying a single piece of information, as an approach for fine-grained fact verification. Clinical documentation poses unique challenges for fact decomposition due to dense terminology and diverse note types. To explore these challenges, we present FactEHR, a dataset consisting of full document fact decompositions for 2,168 clinical notes spanning four types from three hospital systems. Our evaluation, including review by clinicians, highlights significant variability in the quality of fact decomposition for four commonly used LLMs, with some LLMs generating 2.6x more facts per sentence than others. The results underscore the need for better LLM capabilities to support factual verification in clinical text. To facilitate future research in this direction, we plan to release our code at \url{this https URL}.
摘要：验证事实主张对于在医疗保健中使用大型语言模型 (LLM) 至关重要。最近的研究提出了事实分解，即使用 LLM 将源文本重写为传达单一信息的简洁句子，作为一种细粒度事实验证的方法。由于术语密集且笔记类型多样，临床文档对事实分解提出了独特的挑战。为了探索这些挑战，我们提出了 FactEHR，这是一个数据集，包含来自三个医院系统的四种类型的 2,168 份临床笔记的完整文档事实分解。我们的评估（包括临床医生的审查）强调了四种常用 LLM 的事实分解质量存在显著差异，一些 LLM 每句话生成的事实比其他 LLM 多 2.6 倍。结果强调需要更好的 LLM 功能来支持临床文本中的事实验证。为了促进未来在这方面的研究，我们计划在 \url{this https URL} 发布我们的代码。

Title: Refining Dimensions for Improving Clustering-based Cross-lingual Topic Models

Authors: Chia-Hsuan Chang, Tien-Yuan Huang, Yi-Hang Tsai, Chia-Ming Chang, San-Yih Hwang
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12433
Pdf URL: https://arxiv.org/pdf/2412.12433
Copy Paste: [[2412.12433]] Refining Dimensions for Improving Clustering-based Cross-lingual Topic Models(https://arxiv.org/abs/2412.12433)
Keywords: language model
Abstract: Recent works in clustering-based topic models perform well in monolingual topic identification by introducing a pipeline to cluster the contextualized representations. However, the pipeline is suboptimal in identifying topics across languages due to the presence of language-dependent dimensions (LDDs) generated by multilingual language models. To address this issue, we introduce a novel, SVD-based dimension refinement component into the pipeline of the clustering-based topic model. This component effectively neutralizes the negative impact of LDDs, enabling the model to accurately identify topics across languages. Our experiments on three datasets demonstrate that the updated pipeline with the dimension refinement component generally outperforms other state-of-the-art cross-lingual topic models.
摘要：基于聚类的主题模型的最新研究通过引入管道来对上下文表示进行聚类，在单语主题识别方面表现良好。然而，由于多语言语言模型生成的语言相关维度 (LDD) 的存在，该管道在跨语言识别主题方面并不是最理想的。为了解决这个问题，我们在基于聚类的主题模型的管道中引入了一个新颖的基于 SVD 的维度细化组件。该组件有效地抵消了 LDD 的负面影响，使模型能够准确地识别跨语言的主题。我们在三个数据集上进行的实验表明，具有维度细化组件的更新管道通常优于其他最先进的跨语言主题模型。

Title: LITA: An Efficient LLM-assisted Iterative Topic Augmentation Framework

Authors: Chia-Hsuan Chang, Jui-Tse Tsai, Yi-Hang Tsai, San-Yih Hwang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.12459
Pdf URL: https://arxiv.org/pdf/2412.12459
Copy Paste: [[2412.12459]] LITA: An Efficient LLM-assisted Iterative Topic Augmentation Framework(https://arxiv.org/abs/2412.12459)
Keywords: language model, llm, prompt
Abstract: Topic modeling is widely used for uncovering thematic structures within text corpora, yet traditional models often struggle with specificity and coherence in domain-focused applications. Guided approaches, such as SeededLDA and CorEx, incorporate user-provided seed words to improve relevance but remain labor-intensive and static. Large language models (LLMs) offer potential for dynamic topic refinement and discovery, yet their application often incurs high API costs. To address these challenges, we propose the LLM-assisted Iterative Topic Augmentation framework (LITA), an LLM-assisted approach that integrates user-provided seeds with embedding-based clustering and iterative refinement. LITA identifies a small number of ambiguous documents and employs an LLM to reassign them to existing or new topics, minimizing API costs while enhancing topic quality. Experiments on two datasets across topic quality and clustering performance metrics demonstrate that LITA outperforms five baseline models, including LDA, SeededLDA, CorEx, BERTopic, and PromptTopic. Our work offers an efficient and adaptable framework for advancing topic modeling and text clustering.
摘要：主题建模广泛用于揭示文本语料库中的主题结构，但传统模型在以领域为中心的应用中往往难以实现特异性和连贯性。引导式方法（例如 SeededLDA 和 CorEx）结合了用户提供的种子词来提高相关性，但仍然是劳动密集型和静态的。大型语言模型 (LLM) 提供了动态主题细化和发现的潜力，但它们的应用通常会产生高昂的 API 成本。为了应对这些挑战，我们提出了 LLM 辅助迭代主题增强框架 (LITA)，这是一种 LLM 辅助方法，它将用户提供的种子与基于嵌入的聚类和迭代细化相结合。LITA 识别少量模糊文档并使用 LLM 将它们重新分配给现有或新主题，从而最大限度地降低 API 成本，同时提高主题质量。在两个数据集上进行的关于主题质量和聚类性能指标的实验表明，LITA 优于五个基线模型，包括 LDA、SeededLDA、CorEx、BERTopic 和 PromptTopic。我们的工作为推进主题建模和文本聚类提供了一个高效且适应性强的框架。

Title: Core Context Aware Attention for Long Context Language Modeling

Authors: Yaofo Chen, Zeng You, Shuhai Zhang, Haokun Li, Yirui Li, Yaowei Wang, Mingkui Tan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12465
Pdf URL: https://arxiv.org/pdf/2412.12465
Copy Paste: [[2412.12465]] Core Context Aware Attention for Long Context Language Modeling(https://arxiv.org/abs/2412.12465)
Keywords: language model, llm, long context
Abstract: Transformer-based Large Language Models (LLMs) have exhibited remarkable success in various natural language processing tasks primarily attributed to self-attention mechanism, which requires a token to consider all preceding tokens as its context to compute the attention score. However, when the context length L becomes very large (e.g., 32K), more redundant context information will be included w.r.t. any tokens, making the self-attention suffer from two main limitations: 1) The computational and memory complexity scales quadratically w.r.t. L; 2) The presence of redundant context information may hamper the model to capture dependencies among crucial tokens, which may degrade the representation performance. In this paper, we propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling, which consists of two components: 1) Globality-pooling attention that divides input tokens into groups and then dynamically merges tokens within each group into one core token based on their significance; 2) Locality-preserved attention that incorporates neighboring tokens into the attention calculation. The two complementary attentions will then be fused to the final attention, maintaining comprehensive modeling ability as the full self-attention. In this way, the core context information w.r.t. a given token will be automatically focused and strengthened, while the context information in redundant groups will be diminished during the learning process. As a result, the computational and memory complexity will be significantly reduced. More importantly, the CCA-Attention can improve the long-context modeling ability by diminishing the redundant context information. Extensive experimental results demonstrate that our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
摘要：基于 Transformer 的大型语言模型 (LLM) 在各种自然语言处理任务中表现出显著的成功，这主要归功于自注意力机制，该机制要求一个 token 将所有前面的 token 视为其上下文来计算注意力分数。然而，当上下文长度 L 变得非常大（例如 32K）时，相对于任何 token 都会包含更多的冗余上下文信息，这使得自注意力受到两个主要限制：1）计算和内存复杂度相对于 L 呈二次方增长；2）冗余上下文信息的存在可能会妨碍模型捕获关键 token 之间的依赖关系，从而降低表示性能。在本文中，我们提出了一种即插即用的核心上下文感知 (CCA) 注意力，用于高效的远程上下文建模，它由两部分组成：1）全局池化注意力，将输入 token 分成组，然后根据其重要性动态地将每组内的 token 合并为一个核心 token； 2）局部保留注意力机制，将邻近的 token 纳入注意力计算。然后将两个互补的注意力机制融合到最终的注意力机制中，保持与完全自注意力机制一样的综合建模能力。这样，相对于给定 token 的核心上下文信息将自动得到关注和强化，而冗余组中的上下文信息将在学习过程中减少。因此，计算和内存复杂度将显著降低。更重要的是，CCA-Attention 可以通过减少冗余的上下文信息来提高长上下文建模能力。大量实验结果表明，我们的 CCA-Attention 在计算效率和长上下文建模能力方面明显优于最先进的模型。

Title: Knowledge Boundary of Large Language Models: A Survey

Authors: Moxin Li, Yong Zhao, Yang Deng, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12472
Pdf URL: https://arxiv.org/pdf/2412.12472
Copy Paste: [[2412.12472]] Knowledge Boundary of Large Language Models: A Survey(https://arxiv.org/abs/2412.12472)
Keywords: language model, llm
Abstract: Although large language models (LLMs) store vast amount of knowledge in their parameters, they still have limitations in the memorization and utilization of certain knowledge, leading to undesired behaviors such as generating untruthful and inaccurate responses. This highlights the critical need to understand the knowledge boundary of LLMs, a concept that remains inadequately defined in existing research. In this survey, we propose a comprehensive definition of the LLM knowledge boundary and introduce a formalized taxonomy categorizing knowledge into four distinct types. Using this foundation, we systematically review the field through three key lenses: the motivation for studying LLM knowledge boundaries, methods for identifying these boundaries, and strategies for mitigating the challenges they present. Finally, we discuss open challenges and potential research directions in this area. We aim for this survey to offer the community a comprehensive overview, facilitate access to key issues, and inspire further advancements in LLM knowledge research.
摘要：尽管大型语言模型 (LLM) 在其参数中存储了大量知识，但它们在记忆和利用某些知识方面仍然存在局限性，从而导致不良行为，例如生成不真实和不准确的响应。这凸显了了解 LLM 知识边界的迫切需要，这一概念在现有研究中仍未得到充分定义。在本次调查中，我们提出了 LLM 知识边界的全面定义，并引入了一种形式化的分类法，将知识分为四种不同的类型。在此基础上，我们从三个关键视角系统地回顾了该领域：研究 LLM 知识边界的动机、识别这些边界的方法以及缓解它们所带来的挑战的策略。最后，我们讨论了该领域的开放挑战和潜在研究方向。我们的目标是通过这次调查为社区提供全面的概述，促进对关键问题的了解，并激发 LLM 知识研究的进一步发展。

Title: RareAgents: Autonomous Multi-disciplinary Team for Rare Disease Diagnosis and Treatment

Authors: Xuanzhong Chen, Ye Jin, Xiaohao Mao, Lun Wang, Shuyang Zhang, Ting Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12475
Pdf URL: https://arxiv.org/pdf/2412.12475
Copy Paste: [[2412.12475]] RareAgents: Autonomous Multi-disciplinary Team for Rare Disease Diagnosis and Treatment(https://arxiv.org/abs/2412.12475)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Rare diseases, despite their low individual incidence, collectively impact around 300 million people worldwide due to the huge number of diseases. The complexity of symptoms and the shortage of specialized doctors with relevant experience make diagnosing and treating rare diseases more challenging than common diseases. Recently, agents powered by large language models (LLMs) have demonstrated notable improvements across various domains. In the medical field, some agent methods have outperformed direct prompts in question-answering tasks from medical exams. However, current agent frameworks lack adaptation for real-world clinical scenarios, especially those involving the intricate demands of rare diseases. To address these challenges, we present RareAgents, the first multi-disciplinary team of LLM-based agents tailored to the complex clinical context of rare diseases. RareAgents integrates advanced planning capabilities, memory mechanisms, and medical tools utilization, leveraging Llama-3.1-8B/70B as the base model. Experimental results show that RareAgents surpasses state-of-the-art domain-specific models, GPT-4o, and existing agent frameworks in both differential diagnosis and medication recommendation for rare diseases. Furthermore, we contribute a novel dataset, MIMIC-IV-Ext-Rare, derived from MIMIC-IV, to support further advancements in this field.
摘要：尽管罕见病的个体发病率很低，但由于疾病数量庞大，全球约有 3 亿人受到罕见病的影响。由于症状复杂，且缺乏具有相关经验的专科医生，因此诊断和治疗罕见病比诊断和治疗罕见病更具挑战性。最近，基于大型语言模型 (LLM) 的代理在各个领域都表现出了显著的改进。在医疗领域，一些代理方法在医学检查的问答任务中的表现优于直接提示。然而，当前的代理框架缺乏对现实世界临床场景的适应性，尤其是那些涉及罕见病复杂需求的场景。为了应对这些挑战，我们推出了 RareAgents，这是第一个基于 LLM 的多学科代理团队，专门针对罕见病的复杂临床环境而设计。RareAgents 集成了高级规划功能、记忆机制和医疗工具利用，利用 Llama-3.1-8B/70B 作为基础模型。实验结果表明，RareAgents 在罕见病的鉴别诊断和药物推荐方面均超越了最先进的领域特定模型 GPT-4o 和现有的代理框架。此外，我们还贡献了一个从 MIMIC-IV 衍生而来的新数据集 MIMIC-IV-Ext-Rare，以支持该领域的进一步发展。

Title: Human-in-the-Loop Generation of Adversarial Texts: A Case Study on Tibetan Script

Authors: Xi Cao, Yuan Sun, Jiajun Li, Quzong Gesang, Nuo Qun, Tashi Nyima
Subjects: cs.CL, cs.CR, cs.HC
Abstract URL: https://arxiv.org/abs/2412.12478
Pdf URL: https://arxiv.org/pdf/2412.12478
Copy Paste: [[2412.12478]] Human-in-the-Loop Generation of Adversarial Texts: A Case Study on Tibetan Script(https://arxiv.org/abs/2412.12478)
Keywords: language model, llm
Abstract: DNN-based language models perform excellently on various tasks, but even SOTA LLMs are susceptible to textual adversarial attacks. Adversarial texts play crucial roles in multiple subfields of NLP. However, current research has the following issues. (1) Most textual adversarial attack methods target rich-resourced languages. How do we generate adversarial texts for less-studied languages? (2) Most textual adversarial attack methods are prone to generating invalid or ambiguous adversarial texts. How do we construct high-quality adversarial robustness benchmarks? (3) New language models may be immune to part of previously generated adversarial texts. How do we update adversarial robustness benchmarks? To address the above issues, we introduce HITL-GAT, a system based on a general approach to human-in-the-loop generation of adversarial texts. HITL-GAT contains four stages in one pipeline: victim model construction, adversarial example generation, high-quality benchmark construction, and adversarial robustness evaluation. Additionally, we utilize HITL-GAT to make a case study on Tibetan script which can be a reference for the adversarial research of other less-studied languages.
摘要：基于 DNN 的语言模型在各种任务上都表现出色，但即使是 SOTA LLM 也容易受到文本对抗攻击。对抗性文本在 NLP 的多个子领域中发挥着至关重要的作用。然而，当前的研究存在以下问题。（1）大多数文本对抗性攻击方法针对资源丰富的语言。我们如何为研究较少的语言生成对抗性文本？（2）大多数文本对抗性攻击方法容易生成无效或模棱两可的对抗性文本。我们如何构建高质量的对抗性鲁棒性基准？（3）新的语言模型可能对部分先前生成的对抗性文本免疫。我们如何更新对抗性鲁棒性基准？为了解决上述问题，我们引入了 HITL-GAT，这是一个基于人机交互生成对抗性文本的通用方法的系统。 HITL-GAT 包含四个阶段：受害者模型构建、对抗样本生成、高质量基准构建和对抗鲁棒性评估。此外，我们利用 HITL-GAT 对藏文进行了案例研究，可为其他研究较少的语言的对抗研究提供参考。

Title: Boosting Long-Context Information Seeking via Query-Guided Activation Refilling

Authors: Hongjin Qian, Zheng Liu, Peitian Zhang, Zhicheng Dou, Defu Lian
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.12486
Pdf URL: https://arxiv.org/pdf/2412.12486
Copy Paste: [[2412.12486]] Boosting Long-Context Information Seeking via Query-Guided Activation Refilling(https://arxiv.org/abs/2412.12486)
Keywords: language model, llm, long context
Abstract: Processing long contexts poses a significant challenge for large language models (LLMs) due to their inherent context-window limitations and the computational burden of extensive key-value (KV) activations, which severely impact efficiency. For information-seeking tasks, full context perception is often unnecessary, as a query's information needs can dynamically range from localized details to a global perspective, depending on its complexity. However, existing methods struggle to adapt effectively to these dynamic information needs. In the paper, we propose a method for processing long-context information-seeking tasks via query-guided Activation Refilling (ACRE). ACRE constructs a Bi-layer KV Cache for long contexts, where the layer-1 (L1) cache compactly captures global information, and the layer-2 (L2) cache provides detailed and localized information. ACRE establishes a proxying relationship between the two caches, allowing the input query to attend to the L1 cache and dynamically refill it with relevant entries from the L2 cache. This mechanism integrates global understanding with query-specific local details, thus improving answer decoding. Experiments on a variety of long-context information-seeking datasets demonstrate ACRE's effectiveness, achieving improvements in both performance and efficiency.
摘要：处理长上下文对大型语言模型 (LLM) 提出了重大挑战，因为它们固有的上下文窗口限制和大量键值 (KV) 激活的计算负担严重影响效率。对于信息搜索任务，完整的上下文感知通常是不必要的，因为查询的信息需求可以根据其复杂性从局部细节动态地扩展到全局视角。然而，现有的方法很难有效地适应这些动态信息需求。在本文中，我们提出了一种通过查询引导的激活重新填充 (ACRE) 处理长上下文信息搜索任务的方法。ACRE 为长上下文构建了一个双层 KV 缓存，其中第 1 层 (L1) 缓存紧凑地捕获全局信息，第 2 层 (L2) 缓存提供详细和本地化信息。ACRE 在两个缓存之间建立代理关系，允许输入查询关注 L1 缓存并使用 L2 缓存中的相关条目动态地重新填充它。该机制将全局理解与查询特定的局部细节相结合，从而改善答案解码。在各种长上下文信息搜索数据集上的实验证明了ACRE的有效性，实现了性能和效率的提升。

Title: NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Authors: Xin Yi, Shunfan Zheng, Linlin Wang, Gerard de Melo, Xiaoling Wang, Liang He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12497
Pdf URL: https://arxiv.org/pdf/2412.12497
Copy Paste: [[2412.12497]] NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning(https://arxiv.org/abs/2412.12497)
Keywords: language model, llm
Abstract: The emergence of finetuning-as-a-service has revealed a new vulnerability in large language models (LLMs). A mere handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model. Existing methods to counteract fine-tuning attacks typically require substantial computational resources. Even with parameter-efficient techniques like LoRA, gradient updates remain essential. To address these challenges, we propose \textbf{N}euron-\textbf{L}evel \textbf{S}afety \textbf{R}ealignment (\textbf{NLSR}), a training-free framework that restores the safety of LLMs based on the similarity difference of safety-critical neurons before and after fine-tuning. The core of our framework is first to construct a safety reference model from an initially aligned model to amplify safety-related features in neurons. We then utilize this reference model to identify safety-critical neurons, which we prepare as patches. Finally, we selectively restore only those neurons that exhibit significant similarity differences by transplanting these prepared patches, thereby minimally altering the fine-tuned model. Extensive experiments demonstrate significant safety enhancements in fine-tuned models across multiple downstream tasks, while greatly maintaining task-level accuracy. Our findings suggest regions of some safety-critical neurons show noticeable differences after fine-tuning, which can be effectively corrected by transplanting neurons from the reference model without requiring additional training. The code will be available at \url{this https URL}
摘要：微调即服务的出现揭示了大型语言模型 (LLM) 的一个新漏洞。用户上传的少量恶意数据就可以巧妙地操纵微调过程，导致模型对齐破坏。现有的抵御微调攻击的方法通常需要大量的计算资源。即使使用像 LoRA 这样的参数高效技术，梯度更新仍然是必不可少的。为了应对这些挑战，我们提出了 \textbf{N}euron-\textbf{L}evel \textbf{S}safety \textbf{R}ealignment (\textbf{NLSR})，这是一个无需训练的框架，可根据微调前后安全关键神经元的相似性差异恢复 LLM 的安全性。我们框架的核心首先是从初始对齐的模型构建安全参考模型，以放大神经元中的安全相关特征。然后，我们利用这个参考模型来识别安全关键神经元，我们将其准备为补丁。最后，我们通过移植这些准备好的补丁，选择性地恢复那些表现出显著相似性差异的神经元，从而尽量减少对微调模型的改变。大量实验表明，微调模型在多个下游任务中的安全性显著增强，同时大大保持了任务级准确性。我们的研究结果表明，一些安全关键神经元的区域在微调后显示出明显的差异，可以通过从参考模型移植神经元来有效纠正这些差异，而无需额外的训练。代码将在 \url{this https URL} 上提供

Title: LinguaLIFT: An Effective Two-stage Instruction Tuning Framework for Low-Resource Language Tasks

Authors: Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12499
Pdf URL: https://arxiv.org/pdf/2412.12499
Copy Paste: [[2412.12499]] LinguaLIFT: An Effective Two-stage Instruction Tuning Framework for Low-Resource Language Tasks(https://arxiv.org/abs/2412.12499)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated impressive multilingual understanding and reasoning capabilities, driven by extensive pre-training multilingual corpora and fine-tuning instruction data. However, a performance gap persists between high-resource and low-resource language tasks due to language imbalance in the pre-training corpus, even using more low-resource data during fine-tuning. To alleviate this issue, we propose LinguaLIFT, a two-stage instruction tuning framework for advancing low-resource language tasks. An additional language alignment layer is first integrated into the LLM to adapt a pre-trained multilingual encoder, thereby enhancing multilingual alignment through code-switched fine-tuning. The second stage fine-tunes LLM with English-only instruction data while freezing the language alignment layer, allowing LLM to transfer task-specific capabilities from English to low-resource language tasks. Additionally, we introduce the Multilingual Math World Problem (MMWP) benchmark, which spans 21 low-resource, 17 medium-resource, and 10 high-resource languages, enabling comprehensive evaluation of multilingual reasoning. Experimental results show that LinguaLIFT outperforms several competitive baselines across MMWP and other widely used benchmarks.
摘要：大型语言模型 (LLM) 已展示出令人印象深刻的多语言理解和推理能力，这得益于大量预训练多语言语料库和微调指令数据。然而，由于预训练语料库中的语言不平衡，即使在微调期间使用更多低资源数据，高资源语言任务和低资源语言任务之间的性能差距仍然存在。为了缓解这个问题，我们提出了 LinguaLIFT，这是一个用于推进低资源语言任务的两阶段指令调整框架。首先将一个额外的语言对齐层集成到 LLM 中以适应预训练的多语言编码器，从而通过代码切换微调增强多语言对齐。第二阶段使用纯英语指令数据对 LLM 进行微调，同时冻结语言对齐层，使 LLM 能够将特定于任务的功能从英语转移到低资源语言任务。此外，我们引入了多语言数学世界问题 (MMWP) 基准，该基准涵盖 21 种低资源语言、17 种中等资源语言和 10 种高资源语言，可全面评估多语言推理。实验结果表明，LinguaLIFT 在 MMWP 和其他广泛使用的基准测试中的表现优于多个竞争基准。

Title: Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models

Authors: Sina Bagheri Nezhad, Ameeta Agrawal, Rhitabrat Pokharel
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12500
Pdf URL: https://arxiv.org/pdf/2412.12500
Copy Paste: [[2412.12500]] Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models(https://arxiv.org/abs/2412.12500)
Keywords: language model, llm
Abstract: Multilingual language models (MLLMs) are crucial for handling text across various languages, yet they often show performance disparities due to differences in resource availability and linguistic characteristics. While the impact of pre-train data percentage and model size on performance is well-known, our study reveals additional critical factors that significantly influence MLLM effectiveness. Analyzing a wide range of features, including geographical, linguistic, and resource-related aspects, we focus on the SIB-200 dataset for classification and the Flores-200 dataset for machine translation, using regression models and SHAP values across 204 languages. Our findings identify token similarity and country similarity as pivotal factors, alongside pre-train data and model size, in enhancing model performance. Token similarity facilitates cross-lingual transfer, while country similarity highlights the importance of shared cultural and linguistic contexts. These insights offer valuable guidance for developing more equitable and effective multilingual language models, particularly for underrepresented languages.
摘要：多语言模型 (MLLM) 对于处理跨多种语言的文本至关重要，但由于资源可用性和语言特征的差异，它们通常会表现出性能差异。虽然预训练数据百分比和模型大小对性能的影响众所周知，但我们的研究揭示了显著影响 MLLM 有效性的其他关键因素。通过分析包括地理、语言和资源相关方面在内的各种特征，我们专注于 SIB-200 数据集进行分类，Flores-200 数据集进行机器翻译，使用 204 种语言的回归模型和 SHAP 值。我们的研究结果表明，除了预训练数据和模型大小之外，标记相似性和国家相似性是提高模型性能的关键因素。标记相似性有助于跨语言转移，而国家相似性则强调了共享文化和语言背景的重要性。这些见解为开发更公平、更有效的多语言模型提供了宝贵的指导，特别是针对代表性不足的语言。

Title: Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge

Authors: Kayla Schroeder, Zach Wood-Doughty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12509
Pdf URL: https://arxiv.org/pdf/2412.12509
Copy Paste: [[2412.12509]] Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge(https://arxiv.org/abs/2412.12509)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have become increasingly powerful and ubiquitous, but their stochastic nature poses challenges to the reliability of their outputs. While deterministic settings can improve consistency, they do not guarantee reliability, as a single sample from the model's probability distribution can still be misleading. Building upon the concept of LLM-as-a-judge, we introduce a novel framework for rigorously evaluating the reliability of LLM judgments, leveraging McDonald's omega. We evaluate the reliability of LLMs when judging the outputs of other LLMs on standard single-turn and multi-turn benchmarks, simultaneously investigating the impact of temperature on reliability. By analyzing these results, we demonstrate the limitations of fixed randomness and the importance of considering multiple samples, which we show has significant implications for downstream applications. Our findings highlight the need for a nuanced understanding of LLM reliability and the potential risks associated with over-reliance on single-shot evaluations. This work provides a crucial step towards building more trustworthy and reliable LLM-based systems and applications.
摘要：大型语言模型 (LLM) 变得越来越强大和普遍，但其随机性对其输出的可靠性提出了挑战。虽然确定性设置可以提高一致性，但它们并不能保证可靠性，因为来自模型概率分布的单个样本仍然可能具有误导性。基于 LLM-as-a-judge 的概念，我们引入了一个新框架，利用 McDonald's omega 严格评估 LLM 判断的可靠性。我们在标准单转和多转基准上判断其他 LLM 的输出时评估 LLM 的可靠性，同时研究温度对可靠性的影响。通过分析这些结果，我们证明了固定随机性的局限性以及考虑多个样本的重要性，我们表明这对下游应用具有重要意义。我们的研究结果强调了对 LLM 可靠性的细致理解的必要性以及过度依赖单次评估的潜在风险。这项工作为构建更值得信赖和可靠的基于 LLM 的系统和应用程序迈出了关键一步。

Title: Can Large Language Models Understand You Better? An MBTI Personality Detection Dataset Aligned with Population Traits

Authors: Bohan Li, Jiannan Guan, Longxu Dou, Yunlong Feng, Dingzirui Wang, Yang Xu, Enbo Wang, Qiguang Chen, Bichen Wang, Xiao Xu, Yimeng Zhang, Libo Qin, Yanyan Zhao, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2412.12510
Pdf URL: https://arxiv.org/pdf/2412.12510
Copy Paste: [[2412.12510]] Can Large Language Models Understand You Better? An MBTI Personality Detection Dataset Aligned with Population Traits(https://arxiv.org/abs/2412.12510)
Keywords: language model, llm
Abstract: The Myers-Briggs Type Indicator (MBTI) is one of the most influential personality theories reflecting individual differences in thinking, feeling, and behaving. MBTI personality detection has garnered considerable research interest and has evolved significantly over the years. However, this task tends to be overly optimistic, as it currently does not align well with the natural distribution of population personality traits. Specifically, (1) the self-reported labels in existing datasets result in incorrect labeling issues, and (2) the hard labels fail to capture the full range of population personality distributions. In this paper, we optimize the task by constructing MBTIBench, the first manually annotated high-quality MBTI personality detection dataset with soft labels, under the guidance of psychologists. As for the first challenge, MBTIBench effectively solves the incorrect labeling issues, which account for 29.58% of the data. As for the second challenge, we estimate soft labels by deriving the polarity tendency of samples. The obtained soft labels confirm that there are more people with non-extreme personality traits. Experimental results not only highlight the polarized predictions and biases in LLMs as key directions for future research, but also confirm that soft labels can provide more benefits to other psychological tasks than hard labels. The code and data are available at this https URL.
摘要：迈尔斯-布里格斯类型指标（MBTI）是反映个人思维、情感和行为差异的最有影响力的性格理论之一。MBTI 性格检测引起了广泛的研究兴趣，多年来取得了长足的发展。然而，这项任务往往过于乐观，因为它目前与人口性格特征的自然分布不太吻合。具体而言，（1）现有数据集中的自我报告标签导致错误标记问题，（2）硬标签无法捕捉到人口性格分布的全部范围。在本文中，我们在心理学家的指导下，通过构建 MBTIBench（第一个带有软标签的手动注释高质量 MBTI 性格检测数据集）来优化该任务。对于第一个挑战，MBTIBench 有效解决了错误标记问题，占数据的 29.58%。对于第二个挑战，我们通过推导样本的极性倾向来估计软标签。获得的软标签证实了具有非极端性格特征的人更多。实验结果不仅强调了 LLM 中的两极化预测和偏差是未来研究的关键方向，而且还证实了软标签可以比硬标签为其他心理任务提供更多好处。代码和数据可在此 https URL 上获取。

Title: Solid-SQL: Enhanced Schema-linking based In-context Learning for Robust Text-to-SQL

Authors: Geling Liu, Yunzhi Tan, Ruichao Zhong, Yuanzhen Xie, Lingchen Zhao, Qian Wang, Bo Hu, Zang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12522
Pdf URL: https://arxiv.org/pdf/2412.12522
Copy Paste: [[2412.12522]] Solid-SQL: Enhanced Schema-linking based In-context Learning for Robust Text-to-SQL(https://arxiv.org/abs/2412.12522)
Keywords: language model, llm
Abstract: Recently, large language models (LLMs) have significantly improved the performance of text-to-SQL systems. Nevertheless, many state-of-the-art (SOTA) approaches have overlooked the critical aspect of system robustness. Our experiments reveal that while LLM-driven methods excel on standard datasets, their accuracy is notably compromised when faced with adversarial perturbations. To address this challenge, we propose a robust text-to-SQL solution, called Solid-SQL, designed to integrate with various LLMs. We focus on the pre-processing stage, training a robust schema-linking model enhanced by LLM-based data augmentation. Additionally, we design a two-round, structural similarity-based example retrieval strategy for in-context learning. Our method achieves SOTA SQL execution accuracy levels of 82.1% and 58.9% on the general Spider and Bird benchmarks, respectively. Furthermore, experimental results show that Solid-SQL delivers an average improvement of 11.6% compared to baselines on the perturbed Spider-Syn, Spider-Realistic, and Dr. Spider benchmarks.
摘要：最近，大型语言模型 (LLM) 显著提高了文本到 SQL 系统的性能。然而，许多最先进的 (SOTA) 方法忽略了系统稳健性这一关键方面。我们的实验表明，虽然 LLM 驱动的方法在标准数据集上表现出色，但在面对对抗性扰动时，它们的准确性会明显受到影响。为了应对这一挑战，我们提出了一种强大的文本到 SQL 解决方案，称为 Solid-SQL，旨在与各种 LLM 集成。我们专注于预处理阶段，训练一个由基于 LLM 的数据增强增强的强大模式链接模型。此外，我们设计了一种两轮基于结构相似性的示例检索策略，用于上下文学习。我们的方法在通用 Spider 和 Bird 基准上分别实现了 82.1% 和 58.9% 的 SOTA SQL 执行准确率。此外，实验结果表明，与受干扰的 Spider-Syn、Spider-Realistic 和 Dr. Spider 基准测试的基线相比，Solid-SQL 平均提高了 11.6%。

Title: When to Speak, When to Abstain: Contrastive Decoding with Abstention

Authors: Hyuhng Joon Kim, Youna Kim, Sang-goo Lee, Taeuk Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12527
Pdf URL: https://arxiv.org/pdf/2412.12527
Copy Paste: [[2412.12527]] When to Speak, When to Abstain: Contrastive Decoding with Abstention(https://arxiv.org/abs/2412.12527)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) demonstrate exceptional performance across diverse tasks by leveraging both pre-trained knowledge (i.e., parametric knowledge) and external knowledge (i.e., contextual knowledge). While substantial efforts have been made to leverage both forms of knowledge, scenarios in which the model lacks any relevant knowledge remain underexplored. Such limitations can result in issues like hallucination, causing reduced reliability and potential risks in high-stakes applications. To address such limitations, this paper extends the task scope to encompass cases where the user's request cannot be fulfilled due to the lack of relevant knowledge. To this end, we introduce Contrastive Decoding with Abstention (CDA), a training-free decoding method that empowers LLMs to generate responses when relevant knowledge is available and to abstain otherwise. CDA evaluates the relevance of each knowledge for a given query, adaptively determining which knowledge to prioritize or which to completely ignore. Extensive experiments with four LLMs on three question-answering datasets demonstrate that CDA can effectively perform accurate generation and abstention simultaneously. These findings highlight CDA's potential to broaden the applicability of LLMs, enhancing reliability and preserving user trust.
摘要：大型语言模型 (LLM) 通过利用预训练知识（即参数知识）和外部知识（即上下文知识）在不同任务中表现出色。虽然已经付出了巨大的努力来利用这两种形式的知识，但模型缺乏任何相关知识的场景仍未得到充分探索。这些限制可能会导致幻觉等问题，从而降低可靠性并在高风险应用中产生潜在风险。为了解决这些限制，本文将任务范围扩展为涵盖由于缺乏相关知识而无法满足用户请求的情况。为此，我们引入了弃权对比解码 (CDA)，这是一种无需训练的解码方法，它使 LLM 能够在相关知识可用时生成响应，否则弃权。CDA 评估每个知识对于给定查询的相关性，自适应地确定优先考虑哪些知识或完全忽略哪些知识。在三个问答数据集上使用四个 LLM 进行的大量实验表明，CDA 可以有效地同时执行准确的生成和弃权。这些发现凸显了 CDA 拓宽 LLM 适用性、提高可靠性和维护用户信任的潜力。

Title: LLMCL-GEC: Advancing Grammatical Error Correction with LLM-Driven Curriculum Learning

Authors: Tao Fang, Derek F. Wong, Lusheng Zhang, Keyan Jin, Qiang Zhang, Tianjiao Li, Jinlong Hou, Lidia S. Chao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12541
Pdf URL: https://arxiv.org/pdf/2412.12541
Copy Paste: [[2412.12541]] LLMCL-GEC: Advancing Grammatical Error Correction with LLM-Driven Curriculum Learning(https://arxiv.org/abs/2412.12541)
Keywords: language model, llm
Abstract: While large-scale language models (LLMs) have demonstrated remarkable capabilities in specific natural language processing (NLP) tasks, they may still lack proficiency compared to specialized models in certain domains, such as grammatical error correction (GEC). Drawing inspiration from the concept of curriculum learning, we have delved into refining LLMs into proficient GEC experts by devising effective curriculum learning (CL) strategies. In this paper, we introduce a novel approach, termed LLM-based curriculum learning, which capitalizes on the robust semantic comprehension and discriminative prowess inherent in LLMs to gauge the complexity of GEC training data. Unlike traditional curriculum learning techniques, our method closely mirrors human expert-designed curriculums. Leveraging the proposed LLM-based CL method, we sequentially select varying levels of curriculums ranging from easy to hard, and iteratively train and refine using the pretrianed T5 and LLaMA series models. Through rigorous testing and analysis across diverse benchmark assessments in English GEC, including the CoNLL14 test, BEA19 test, and BEA19 development sets, our approach showcases a significant performance boost over baseline models and conventional curriculum learning methodologies.
摘要：虽然大规模语言模型 (LLM) 在特定的自然语言处理 (NLP) 任务中表现出了卓越的能力，但与某些领域（例如语法错误纠正 (GEC)）的专门模型相比，它们可能仍然缺乏熟练度。从课程学习的概念中汲取灵感，我们深入研究了通过设计有效的课程学习 (CL) 策略将 LLM 精炼为熟练的 GEC 专家。在本文中，我们介绍了一种新方法，称为基于 LLM 的课程学习，它利用 LLM 固有的强大的语义理解和判别能力来衡量 GEC 训练数据的复杂性。与传统的课程学习技术不同，我们的方法与人类专家设计的课程非常相似。利用所提出的基于 LLM 的 CL 方法，我们依次选择从易到难的不同级别的课程，并使用预先训练的 T5 和 LLaMA 系列模型进行迭代训练和优化。通过对英语 GEC 中的各种基准评估（包括 CoNLL14 测试、BEA19 测试和 BEA19 开发集）进行严格的测试和分析，我们的方法比基线模型和传统课程学习方法表现出显著的性能提升。

Title: EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation

Authors: Taeho Hwang, Sukmin Cho, Soyeong Jeong, Hoyun Song, SeungYoon Han, Jong C. Park
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.12559
Pdf URL: https://arxiv.org/pdf/2412.12559
Copy Paste: [[2412.12559]] EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation(https://arxiv.org/abs/2412.12559)
Keywords: retrieval-augmented generation
Abstract: We introduce EXIT, an extractive context compression framework that enhances both the effectiveness and efficiency of retrieval-augmented generation (RAG) in question answering (QA). Current RAG systems often struggle when retrieval models fail to rank the most relevant documents, leading to the inclusion of more context at the expense of latency and accuracy. While abstractive compression methods can drastically reduce token counts, their token-by-token generation process significantly increases end-to-end latency. Conversely, existing extractive methods reduce latency but rely on independent, non-adaptive sentence selection, failing to fully utilize contextual information. EXIT addresses these limitations by classifying sentences from retrieved documents - while preserving their contextual dependencies - enabling parallelizable, context-aware extraction that adapts to query complexity and retrieval quality. Our evaluations on both single-hop and multi-hop QA tasks show that EXIT consistently surpasses existing compression methods and even uncompressed baselines in QA accuracy, while also delivering substantial reductions in inference time and token count. By improving both effectiveness and efficiency, EXIT provides a promising direction for developing scalable, high-quality QA solutions in RAG pipelines. Our code is available at this https URL
摘要：我们引入了 EXIT，这是一个提取上下文压缩框架，可提高问答 (QA) 中检索增强生成 (RAG) 的有效性和效率。当前的 RAG 系统在检索模型无法对最相关的文档进行排名时经常会遇到困难，从而导致以牺牲延迟和准确性为代价来包含更多上下文。虽然抽象压缩方法可以大幅减少标记数，但它们的逐个标记生成过程会显著增加端到端延迟。相反，现有的提取方法可以减少延迟，但依赖于独立的、非自适应的句子选择，无法充分利用上下文信息。EXIT 通过对检索到的文档中的句子进行分类来解决这些限制 - 同时保留它们的上下文依赖性 - 实现可并行的、上下文感知的提取，以适应查询复杂性和检索质量。我们对单跳和多跳 QA 任务的评估表明，EXIT 在 QA 准确性方面始终超越现有压缩方法甚至未压缩的基线，同时还大幅减少了推理时间和标记数。 EXIT 通过提高有效性和效率，为在 RAG 流程中开发可扩展、高质量的 QA 解决方案提供了一个有希望的方向。我们的代码可在此 https URL 上找到

Title: Task-Agnostic Language Model Watermarking via High Entropy Passthrough Layers

Authors: Vaden Masrani, Mohammad Akbari, David Ming Xuan Yue, Ahmad Rezaei, Yong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12563
Pdf URL: https://arxiv.org/pdf/2412.12563
Copy Paste: [[2412.12563]] Task-Agnostic Language Model Watermarking via High Entropy Passthrough Layers(https://arxiv.org/abs/2412.12563)
Keywords: language model, prompt
Abstract: In the era of costly pre-training of large language models, ensuring the intellectual property rights of model owners, and insuring that said models are responsibly deployed, is becoming increasingly important. To this end, we propose model watermarking via passthrough layers, which are added to existing pre-trained networks and trained using a self-supervised loss such that the model produces high-entropy output when prompted with a unique private key, and acts normally otherwise. Unlike existing model watermarking methods, our method is fully task-agnostic, and can be applied to both classification and sequence-to-sequence tasks without requiring advanced access to downstream fine-tuning datasets. We evaluate the proposed passthrough layers on a wide range of downstream tasks, and show experimentally our watermarking method achieves a near-perfect watermark extraction accuracy and false-positive rate in most cases without damaging original model performance. Additionally, we show our method is robust to both downstream fine-tuning, fine-pruning, and layer removal attacks, and can be trained in a fraction of the time required to train the original model. Code is available in the paper.
摘要：在大型语言模型预训练成本高昂的时代，确保模型所有者的知识产权并确保负责任地部署所述模型变得越来越重要。为此，我们提出了通过直通层进行模型水印的方案，这些直通层被添加到现有的预训练网络中并使用自监督损失进行训练，使得模型在使用唯一私钥提示时产生高熵输出，否则则正常运行。与现有的模型水印方法不同，我们的方法完全与任务无关，并且可以应用于分类和序列到序列任务，而无需高级访问下游微调数据集。我们在广泛的下游任务上评估了所提出的直通层，并通过实验表明我们的水印方法在大多数情况下实现了近乎完美的水印提取准确率和假阳性率，而不会损害原始模型性能。此外，我们表明我们的方法对下游微调、微剪枝和层移除攻击都具有鲁棒性，并且可以在训练原始模型所需时间的一小部分内完成训练。代码可在论文中查阅。

Title: Evaluating Zero-Shot Multilingual Aspect-Based Sentiment Analysis with Large Language Models

Authors: Chengyan Wu, Bolei Ma, Zheyu Zhang, Ningyuan Deng, Yanqing He, Yun Xue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12564
Pdf URL: https://arxiv.org/pdf/2412.12564
Copy Paste: [[2412.12564]] Evaluating Zero-Shot Multilingual Aspect-Based Sentiment Analysis with Large Language Models(https://arxiv.org/abs/2412.12564)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Aspect-based sentiment analysis (ABSA), a sequence labeling task, has attracted increasing attention in multilingual contexts. While previous research has focused largely on fine-tuning or training models specifically for ABSA, we evaluate large language models (LLMs) under zero-shot conditions to explore their potential to tackle this challenge with minimal task-specific adaptation. We conduct a comprehensive empirical evaluation of a series of LLMs on multilingual ABSA tasks, investigating various prompting strategies, including vanilla zero-shot, chain-of-thought (CoT), self-improvement, self-debate, and self-consistency, across nine different models. Results indicate that while LLMs show promise in handling multilingual ABSA, they generally fall short of fine-tuned, task-specific models. Notably, simpler zero-shot prompts often outperform more complex strategies, especially in high-resource languages like English. These findings underscore the need for further refinement of LLM-based approaches to effectively address ABSA task across diverse languages.
摘要：基于方面的情绪分析 (ABSA) 是一种序列标记任务，在多语言环境中引起了越来越多的关注。虽然以前的研究主要集中在专门针对 ABSA 的微调或训练模型上，但我们在零样本条件下评估大型语言模型 (LLM)，以探索它们以最少的任务特定适应性应对这一挑战的潜力。我们对一系列多语言 ABSA 任务的 LLM 进行了全面的实证评估，研究了九种不同模型中的各种提示策略，包括原始零样本、思路链 (CoT)、自我提升、自我辩论和自我一致性。结果表明，虽然 LLM 在处理多语言 ABSA 方面表现出色，但它们通常不如微调的任务特定模型。值得注意的是，更简单的零样本提示通常比更复杂的策略表现更好，尤其是在英语等资源丰富的语言中。这些发现强调需要进一步改进基于 LLM 的方法，以有效解决跨不同语言的 ABSA 任务。

Title: FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning

Authors: Seunghee Kim, Changhyeon Kim, Taeuk Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12567
Pdf URL: https://arxiv.org/pdf/2412.12567
Copy Paste: [[2412.12567]] FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning(https://arxiv.org/abs/2412.12567)
Keywords: language model, llm
Abstract: Real-world decision-making often requires integrating and reasoning over information from multiple modalities. While recent multimodal large language models (MLLMs) have shown promise in such tasks, their ability to perform multi-hop reasoning across diverse sources remains insufficiently evaluated. Existing benchmarks, such as MMQA, face challenges due to (1) data contamination and (2) a lack of complex queries that necessitate operations across more than two modalities, hindering accurate performance assessment. To address this, we present Financial Cross-Modal Multi-Hop Reasoning (FCMR), a benchmark created to analyze the reasoning capabilities of MLLMs by urging them to combine information from textual reports, tables, and charts within the financial domain. FCMR is categorized into three difficulty levels-Easy, Medium, and Hard-facilitating a step-by-step evaluation. In particular, problems at the Hard level require precise cross-modal three-hop reasoning and are designed to prevent the disregard of any modality. Experiments on this new benchmark reveal that even state-of-the-art MLLMs struggle, with the best-performing model (Claude 3.5 Sonnet) achieving only 30.4% accuracy on the most challenging tier. We also conduct analysis to provide insights into the inner workings of the models, including the discovery of a critical bottleneck in the information retrieval phase.
摘要：现实世界的决策通常需要整合和推理来自多种模态的信息。虽然最近的多模态大型语言模型 (MLLM) 在这些任务中表现出了良好的前景，但它们跨不同来源执行多跳推理的能力仍未得到充分评估。现有的基准测试（例如 MMQA）面临着挑战，因为 (1) 数据污染和 (2) 缺乏需要跨两种以上模态操作的复杂查询，从而阻碍了准确的性能评估。为了解决这个问题，我们提出了金融跨模态多跳推理 (FCMR)，这是一个基准测试，旨在分析 MLLM 的推理能力，敦促它们结合金融领域内的文本报告、表格和图表中的信息。FCMR 分为三个难度级别 - 简单、中等和困难 - 以促进逐步评估。特别是，困难级别的问题需要精确的跨模态三跳推理，并且旨在防止忽略任何模态。针对这一新基准进行的实验表明，即使是最先进的 MLLM 也举步维艰，表现最佳的模型 (Claude 3.5 Sonnet) 在最具挑战性的层级上仅达到 30.4% 的准确率。我们还进行了分析，以深入了解模型的内部工作原理，包括发现信息检索阶段的关键瓶颈。

Title: Process-Supervised Reward Models for Clinical Note Generation: A Scalable Approach Guided by Domain Expertise

Authors: Hanyin Wang, Qiping Xu, Bolun Liu, Guleid Hussein, Hariprasad Korsapati, Mohamad El Labban, Kingsley Iheasirim, Mohamed Hassan, Gokhan Anil, Brian Bartlett, Jimeng Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12583
Pdf URL: https://arxiv.org/pdf/2412.12583
Copy Paste: [[2412.12583]] Process-Supervised Reward Models for Clinical Note Generation: A Scalable Approach Guided by Domain Expertise(https://arxiv.org/abs/2412.12583)
Keywords: language model, llm
Abstract: Process-supervised reward models (PRMs), which verify large language model (LLM) outputs step-by-step, have achieved significant success in mathematical and coding problems. However, their application to other domains remains largely unexplored. In this work, we train a PRM to provide step-level reward signals for clinical notes generated by LLMs from patient-doctor dialogues. Guided by real-world clinician expertise, we carefully designed step definitions for clinical notes and utilized Gemini-Pro 1.5 to automatically generate process supervision data at scale. Our proposed PRM, trained on the LLaMA-3.1 8B instruct model, demonstrated superior performance compared to Gemini-Pro 1.5 and an outcome-supervised reward model (ORM) across two key evaluations: (1) the accuracy of selecting gold-reference samples from error-containing samples, achieving 98.8% (versus 61.3% for ORM and 93.8% for Gemini-Pro 1.5), and (2) the accuracy of selecting physician-preferred notes, achieving 56.2% (compared to 51.2% for ORM and 50.0% for Gemini-Pro 1.5). Additionally, we conducted ablation studies to determine optimal loss functions and data selection strategies, along with physician reader studies to explore predictors of downstream Best-of-N performance. Our promising results suggest the potential of PRMs to extend beyond the clinical domain, offering a scalable and effective solution for diverse generative tasks.
摘要：过程监督奖励模型 (PRM) 可逐步验证大型语言模型 (LLM) 的输出，在数学和编码问题中取得了重大成功。然而，它们在其他领域的应用仍未得到充分探索。在这项研究中，我们训练了一个 PRM，为 LLM 根据医患对话生成的临床记录提供步骤级奖励信号。在现实世界临床医生专业知识的指导下，我们精心设计了临床记录的步骤定义，并利用 Gemini-Pro 1.5 自动生成大规模过程监督数据。我们提出的 PRM 是在 LLaMA-3.1 8B 指导模型上训练的，在两项关键评估中表现出比 Gemini-Pro 1.5 和结果监督奖励模型 (ORM) 更优异的性能：(1) 从含有错误的样本中选择黄金参考样本的准确率达到 98.8%（ORM 为 61.3%，Gemini-Pro 1.5 为 93.8%），以及 (2) 选择医生首选笔记的准确率达到 56.2%（ORM 为 51.2%，Gemini-Pro 1.5 为 50.0%）。此外，我们还进行了消融研究以确定最佳损失函数和数据选择策略，并进行了医生读者研究以探索下游 Best-of-N 性能的预测因素。我们的成果令人鼓舞，表明 PRM 有可能扩展到临床领域之外，为各种生成任务提供可扩展且有效的解决方案。

Title: PerSphere: A Comprehensive Framework for Multi-Faceted Perspective Retrieval and Summarization

Authors: Yun Luo, Yingjie Li, Xiangkun Hu, Qinglin Qi, Fang Guo, Qipeng Guo, Zheng Zhang, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12588
Pdf URL: https://arxiv.org/pdf/2412.12588
Copy Paste: [[2412.12588]] PerSphere: A Comprehensive Framework for Multi-Faceted Perspective Retrieval and Summarization(https://arxiv.org/abs/2412.12588)
Keywords: long context, agent
Abstract: As online platforms and recommendation algorithms evolve, people are increasingly trapped in echo chambers, leading to biased understandings of various issues. To combat this issue, we have introduced PerSphere, a benchmark designed to facilitate multi-faceted perspective retrieval and summarization, thus breaking free from these information silos. For each query within PerSphere, there are two opposing claims, each supported by distinct, non-overlapping perspectives drawn from one or more documents. Our goal is to accurately summarize these documents, aligning the summaries with the respective claims and their underlying perspectives. This task is structured as a two-step end-to-end pipeline that includes comprehensive document retrieval and multi-faceted summarization. Furthermore, we propose a set of metrics to evaluate the comprehensiveness of the retrieval and summarization content. Experimental results on various counterparts for the pipeline show that recent models struggle with such a complex task. Analysis shows that the main challenge lies in long context and perspective extraction, and we propose a simple but effective multi-agent summarization system, offering a promising solution to enhance performance on PerSphere.
摘要：随着在线平台和推荐算法的发展，人们越来越陷入回音室，导致对各种问题的理解出现偏差。为了解决这个问题，我们引入了 PerSphere，这是一个旨在促进多方面视角检索和总结的基准，从而摆脱这些信息孤岛。对于 PerSphere 中的每个查询，都有两个相反的主张，每个主张都由从一个或多个文档中提取的不同、不重叠的观点支持。我们的目标是准确地总结这些文档，使总结与各自的主张及其基本观点保持一致。这项任务被构建为一个两步端到端管道，包括全面的文档检索和多方面的总结。此外，我们提出了一组指标来评估检索和总结内容的全面性。对管道各种对应物的实验结果表明，最近的模型很难完成如此复杂的任务。分析表明，主要挑战在于长上下文和视角提取，我们提出了一个简单但有效的多代理总结系统，为提高 PerSphere 的性能提供了一种有希望的解决方案。

Title: LLMs are Also Effective Embedding Models: An In-depth Overview

Authors: Chongyang Tao, Tao Shen, Shen Gao, Junshuo Zhang, Zhen Li, Zhengwei Tao, Shuai Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12591
Pdf URL: https://arxiv.org/pdf/2412.12591
Copy Paste: [[2412.12591]] LLMs are Also Effective Embedding Models: An In-depth Overview(https://arxiv.org/abs/2412.12591)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have revolutionized natural language processing by achieving state-of-the-art performance across various tasks. Recently, their effectiveness as embedding models has gained attention, marking a paradigm shift from traditional encoder-only models like ELMo and BERT to decoder-only, large-scale LLMs such as GPT, LLaMA, and Mistral. This survey provides an in-depth overview of this transition, beginning with foundational techniques before the LLM era, followed by LLM-based embedding models through two main strategies to derive embeddings from LLMs. 1) Direct prompting: We mainly discuss the prompt designs and the underlying rationale for deriving competitive embeddings. 2) Data-centric tuning: We cover extensive aspects that affect tuning an embedding model, including model architecture, training objectives, data constructions, etc. Upon the above, we also cover advanced methods, such as handling longer texts, and multilingual and cross-modal data. Furthermore, we discuss factors affecting choices of embedding models, such as performance/efficiency comparisons, dense vs sparse embeddings, pooling strategies, and scaling law. Lastly, the survey highlights the limitations and challenges in adapting LLMs for embeddings, including cross-task embedding quality, trade-offs between efficiency and accuracy, low-resource, long-context, data bias, robustness, etc. This survey serves as a valuable resource for researchers and practitioners by synthesizing current advancements, highlighting key challenges, and offering a comprehensive framework for future work aimed at enhancing the effectiveness and efficiency of LLMs as embedding models.
摘要：大型语言模型 (LLM) 通过在各种任务中实现最先进的性能，彻底改变了自然语言处理。最近，它们作为嵌入模型的有效性引起了人们的关注，标志着从传统的仅编码器模型（如 ELMo 和 BERT）到仅解码器的大型 LLM（如 GPT、LLaMA 和 Mistral）的范式转变。本综述对这一转变进行了深入概述，从 LLM 时代之前的基础技术开始，然后是基于 LLM 的嵌入模型，通过两种主要策略从 LLM 中派生嵌入。1) 直接提示：我们主要讨论提示设计和派生竞争性嵌入的基本原理。2) 以数据为中心的调整：我们涵盖了影响嵌入模型调整的广泛方面，包括模型架构、训练目标、数据构造等。在此基础上，我们还介绍了高级方法，例如处理较长的文本以及多语言和跨模态数据。此外，我们讨论了影响嵌入模型选择的因素，例如性能/效率比较、密集与稀疏嵌入、池化策略和缩放律。最后，本调查强调了将 LLM 适配为嵌入的局限性和挑战，包括跨任务嵌入质量、效率与准确性之间的权衡、低资源、长上下文、数据偏差、鲁棒性等。本调查综合了当前的进展，强调了关键挑战，并为旨在提高 LLM 作为嵌入模型的有效性和效率的未来工作提供了全面的框架，为研究人员和从业人员提供了宝贵的资源。

Title: MultiLingPoT: Enhancing Mathematical Reasoning with Multilingual Program Fine-tuning

Authors: Nianqi Li, Zujie Liang, Siyu Yuan, Jiaqing Liang, Feng Wei, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12609
Pdf URL: https://arxiv.org/pdf/2412.12609
Copy Paste: [[2412.12609]] MultiLingPoT: Enhancing Mathematical Reasoning with Multilingual Program Fine-tuning(https://arxiv.org/abs/2412.12609)
Keywords: llm
Abstract: Program-of-Thought (PoT), which aims to use programming language instead of natural language as an intermediate step in reasoning, is an important way for LLMs to solve mathematical problems. Since different programming languages excel in different areas, it is natural to use the most suitable language for solving specific problems. However, current PoT research only focuses on single language PoT, ignoring the differences between different programming languages. Therefore, this paper proposes an multilingual program reasoning method, MultiLingPoT. This method allows the model to answer questions using multiple programming languages by fine-tuning on multilingual data. Additionally, prior and posterior hybrid methods are used to help the model select the most suitable language for each problem. Our experimental results show that the training of MultiLingPoT improves each program's mathematical reasoning by about 2.5\%. Moreover, with proper mixing, the performance of MultiLingPoT can be further improved, achieving a 6\% increase compared to the single-language PoT with the data this http URL of this paper can be found at this https URL.
摘要：思维程序化（PoT）旨在使用编程语言代替自然语言作为推理的中间步骤，是法学硕士解决数学问题的重要方法。由于不同的编程语言在不同领域表现出色，因此使用最合适的语言来解决特定问题是很自然的。然而，目前的 PoT 研究只关注单一语言的 PoT，忽略了不同编程语言之间的差异。因此，本文提出了一种多语言程序推理方法 MultiLingPoT。该方法允许模型通过在多语言数据上进行微调来使用多种编程语言回答问题。此外，先验和后验混合方法用于帮助模型为每个问题选择最合适的语言。我们的实验结果表明，MultiLingPoT 的训练使每个程序的数学推理能力提高了约 2.5%。此外，通过适当的混合，MultiLingPoT 的性能可以进一步提高，与使用数据的单语言 PoT 相比，该论文的 http URL 可以在此 https URL 中找到。

Title: SynthCypher: A Fully Synthetic Data Generation Framework for Text-to-Cypher Querying in Knowledge Graphs

Authors: Aman Tiwari, Shiva Krishna Reddy Malay, Vikas Yadav, Masoud Hashemi, Sathwik Tejaswi Madhusudhan
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12612
Pdf URL: https://arxiv.org/pdf/2412.12612
Copy Paste: [[2412.12612]] SynthCypher: A Fully Synthetic Data Generation Framework for Text-to-Cypher Querying in Knowledge Graphs(https://arxiv.org/abs/2412.12612)
Keywords: language model, llm
Abstract: Cypher, the query language for Neo4j graph databases, plays a critical role in enabling graph-based analytics and data exploration. While substantial research has been dedicated to natural language to SQL query generation (Text2SQL), the analogous problem for graph databases referred to as Text2Cypher remains underexplored. In this work, we introduce SynthCypher, a fully synthetic and automated data generation pipeline designed to address this gap. SynthCypher employs a novel LLMSupervised Generation-Verification framework, ensuring syntactically and semantically correct Cypher queries across diverse domains and query complexities. Using this pipeline, we create SynthCypher Dataset, a large-scale benchmark containing 29.8k Text2Cypher instances. Fine-tuning open-source large language models (LLMs), including LLaMa-3.1- 8B, Mistral-7B, and QWEN-7B, on SynthCypher yields significant performance improvements of up to 40% on the Text2Cypher test set and 30% on the SPIDER benchmark adapted for graph databases. This work demonstrates that high-quality synthetic data can effectively advance the state-of-the-art in Text2Cypher tasks.
摘要：Cypher 是 Neo4j 图形数据库的查询语言，在实现基于图形的分析和数据探索方面发挥着关键作用。虽然已经有大量研究致力于自然语言到 SQL 查询生成 (Text2SQL)，但图形数据库的类似问题（称为 Text2Cypher）仍未得到充分探索。在这项工作中，我们引入了 SynthCypher，这是一种完全合成和自动化的数据生成管道，旨在解决这一差距。SynthCypher 采用了一种新颖的 LLMSupervised 生成验证框架，可确保跨不同领域和查询复杂性的 Cypher 查询在语法和语义上都是正确的。使用此管道，我们创建了 SynthCypher 数据集，这是一个包含 29.8k Text2Cypher 实例的大规模基准。在 SynthCypher 上对开源大型语言模型 (LLM)（包括 LLaMa-3.1-8B、Mistral-7B 和 QWEN-7B）进行微调，在 Text2Cypher 测试集上的性能显著提升了 40%，在针对图形数据库调整的 SPIDER 基准上的性能提升了 30%。这项工作表明，高质量的合成数据可以有效提升 Text2Cypher 任务的最新水平。

Title: Jailbreaking? One Step Is Enough!

Authors: Weixiong Zheng, Peijian Zeng, Yiwei Li, Hongyan Wu, Nankai Lin, Junhao Chen, Aimin Yang, Yongmei Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12621
Pdf URL: https://arxiv.org/pdf/2412.12621
Copy Paste: [[2412.12621]] Jailbreaking? One Step Is Enough!(https://arxiv.org/abs/2412.12621)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs. Examining jailbreak prompts helps uncover the shortcomings of LLMs. However, current jailbreak methods and the target model's defenses are engaged in an independent and adversarial process, resulting in the need for frequent attack iterations and redesigning attacks for different models. To address these gaps, we propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the "defense". intention against harmful content. Specifically, REDA starts from the target response, guiding the model to embed harmful content within its defensive measures, thereby relegating harmful content to a secondary role and making the model believe it is performing a defensive task. The attacking model considers that it is guiding the target model to deal with harmful content, while the target model thinks it is performing a defensive task, creating an illusion of cooperation between the two. Additionally, to enhance the model's confidence and guidance in "defensive" intentions, we adopt in-context learning (ICL) with a small number of attack examples and construct a corresponding dataset of attack examples. Extensive evaluations demonstrate that the REDA method enables cross-model attacks without the need to redesign attack strategies for different models, enables successful jailbreak in one iteration, and outperforms existing methods on both open-source and closed-source models.
摘要：大型语言模型 (LLM) 在各种任务中表现出色，但仍然容易受到越狱攻击，攻击者会操纵提示来生成有害输出。检查越狱提示有助于发现 LLM 的缺点。然而，当前的越狱方法和目标模型的防御是独立且对抗的过程，因此需要频繁进行攻击迭代并针对不同模型重新设计攻击。为了解决这些差距，我们提出了一种反向嵌入防御攻击 (REDA) 机制，将攻击意图伪装成针对有害内容的“防御”意图。具体来说，REDA 从目标响应开始，引导模型将有害内容嵌入其防御措施中，从而将有害内容降为次要角色，并使模型相信它正在执行防御任务。攻击模型认为它正在引导目标模型处理有害内容，而目标模型认为它正在执行防御任务，从而造成两者之间合作的假象。此外，为了增强模型对“防御”意图的信心和指导，我们采用了少量攻击示例的情境学习（ICL），并构建了相应的攻击示例数据集。大量评估表明，REDA 方法可以实现跨模型攻击，而无需为不同模型重新设计攻击策略，可以在一次迭代中成功越狱，并且在开源和闭源模型上均优于现有方法。

Title: Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation

Authors: Andong Chen, Yuchen Song, Kehai Chen, Muyun Yang, Tiejun Zhao, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12627
Pdf URL: https://arxiv.org/pdf/2412.12627
Copy Paste: [[2412.12627]] Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation(https://arxiv.org/abs/2412.12627)
Keywords: language model, llm
Abstract: Visual information has been introduced for enhancing machine translation (MT), and its effectiveness heavily relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we introduce a stable diffusion-based imagination network into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence, thereby advancing the multimodel MT. Particularly, we build heuristic human feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence without the supervision of image annotation, which breaks the bottleneck of using visual information in MT. Furthermore, the proposed method enables imaginative visual information to be integrated into large-scale text-only MT in addition to multimodal MT. Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT, especially achieving an average improvement of more than 14 BLEU points on Multi30K multimodal MT benchmarks.
摘要：视觉信息已被引入用于增强机器翻译 (MT)，其有效性在很大程度上依赖于大量带有手动图像注释的双语平行句对的可用性。在本文中，我们将一个稳定的基于扩散的想象网络引入到多模态大型语言模型 (MLLM) 中，以显式地为每个源句子生成一个图像，从而推进了多模型机器翻译。具体来说，我们通过强化学习构建启发式人工反馈，以确保生成的图像与源句子的一致性，而无需图像注释的监督，这打破了机器翻译中使用视觉信息的瓶颈。此外，除了多模态机器翻译之外，所提出的方法还使富有想象力的视觉信息能够集成到大规模纯文本机器翻译中。实验结果表明，我们的模型明显优于现有的多模态机器翻译和纯文本机器翻译，尤其是在 Multi30K 多模态机器翻译基准上实现了平均超过 14 BLEU 点的提升。

Title: What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context

Authors: Zhiyuan Chang, Mingyang Li, Xiaojun Jia, Junjie Wang, Yuekai Huang, Qing Wang, Yihao Huang, Yang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12632
Pdf URL: https://arxiv.org/pdf/2412.12632
Copy Paste: [[2412.12632]] What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context(https://arxiv.org/abs/2412.12632)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Incorporating external knowledge into large language models (LLMs) has emerged as a promising approach to mitigate outdated knowledge and hallucination in LLMs. However, external knowledge is often imperfect. In addition to useful knowledge, external knowledge is rich in irrelevant or misinformation in the context that can impair the reliability of LLM responses. This paper focuses on LLMs' preferred external knowledge in imperfect contexts when handling multi-hop QA. Inspired by criminal procedural law's Chain of Evidence (CoE), we characterize that knowledge preferred by LLMs should maintain both relevance to the question and mutual support among knowledge pieces. Accordingly, we propose an automated CoE discrimination approach and explore LLMs' preferences from their effectiveness, faithfulness and robustness, as well as CoE's usability in a naive Retrieval-Augmented Generation (RAG) case. The evaluation on five LLMs reveals that CoE enhances LLMs through more accurate generation, stronger answer faithfulness, better robustness against knowledge conflict, and improved performance in a popular RAG case.
摘要：将外部知识融入大型语言模型 (LLM) 已成为一种有前途的方法，可以减轻 LLM 中的过时知识和幻觉。然而，外部知识往往是不完善的。除了有用的知识之外，外部知识还富含与上下文无关的信息或错误信息，这些信息可能会损害 LLM 响应的可靠性。本文重点研究了在处理多跳 QA 时在不完善的环境中 LLM 首选的外部知识。受刑事诉讼法的证据链 (CoE) 的启发，我们认为 LLM 首选的知识应既与问题保持相关性，又应相互支持。因此，我们提出了一种自动化的 CoE 歧视方法，并从有效性、忠实性和稳健性以及 CoE 在简单的检索增强生成 (RAG) 案例中的可用性等方面探索 LLM 的偏好。对五个 LLM 的评估表明，CoE 通过更准确的生成、更强的答案忠实度、更好的知识冲突鲁棒性以及在流行的 RAG 案例中的改进来增强 LLM。

Title: Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree

Authors: Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12639
Pdf URL: https://arxiv.org/pdf/2412.12639
Copy Paste: [[2412.12639]] Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree(https://arxiv.org/abs/2412.12639)
Keywords: language model, llm, chat
Abstract: Striking an optimal balance between minimal drafting latency and high speculation accuracy to enhance the inference speed of Large Language Models remains a significant challenge in speculative decoding. In this paper, we introduce Falcon, an innovative semi-autoregressive speculative decoding framework fashioned to augment both the drafter's parallelism and output quality. Falcon incorporates the Coupled Sequential Glancing Distillation technique, which fortifies inter-token dependencies within the same block, leading to increased speculation accuracy. We offer a comprehensive theoretical analysis to illuminate the underlying mechanisms. Additionally, we introduce a Custom-Designed Decoding Tree, which permits the drafter to generate multiple tokens in a single forward pass and accommodates multiple forward passes as needed, thereby boosting the number of drafted tokens and significantly improving the overall acceptance rate. Comprehensive evaluations on benchmark datasets such as MT-Bench, HumanEval, and GSM8K demonstrate Falcon's superior acceleration capabilities. The framework achieves a lossless speedup ratio ranging from 2.91x to 3.51x when tested on the Vicuna and LLaMA2-Chat model series. These results outstrip existing speculative decoding methods for LLMs, including Eagle, Medusa, Lookahead, SPS, and PLD, while maintaining a compact drafter architecture equivalent to merely two Transformer layers.
摘要：在最小起草延迟和高推测准确性之间取得最佳平衡以提高大型语言模型的推理速度仍然是推测解码中的一项重大挑战。在本文中，我们介绍了 Falcon，这是一种创新的半自回归推测解码框架，旨在增强起草者的并行性和输出质量。Falcon 采用了耦合顺序扫视蒸馏技术，该技术加强了同一块内的 token 间依赖关系，从而提高了推测准确性。我们提供全面的理论分析来阐明底层机制。此外，我们引入了定制设计的解码树，它允许起草者在一次前向传递中生成多个 token，并根据需要容纳多个前向传递，从而增加了起草 token 的数量并显着提高了整体接受率。对 MT-Bench、HumanEval 和 GSM8K 等基准数据集的全面评估证明了 Falcon 卓越的加速能力。在 Vicuna 和 LLaMA2-Chat 模型系列上进行测试时，该框架实现了 2.91 倍至 3.51 倍的无损加速比。这些结果超越了现有的 LLM 推测解码方法，包括 Eagle、Medusa、Lookahead、SPS 和 PLD，同时保持了仅相当于两个 Transformer 层的紧凑起草器架构。

Title: LLM-based Discriminative Reasoning for Knowledge Graph Question Answering

Authors: Mufan Xu, Kehai Chen, Xuefeng Bai, Muyun Yang, Tiejun Zhao, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12643
Pdf URL: https://arxiv.org/pdf/2412.12643
Copy Paste: [[2412.12643]] LLM-based Discriminative Reasoning for Knowledge Graph Question Answering(https://arxiv.org/abs/2412.12643)
Keywords: language model, llm
Abstract: Large language models (LLMs) based on generative pre-trained Transformer have achieved remarkable performance on knowledge graph question-answering (KGQA) tasks. However, LLMs often produce ungrounded subgraph planning or reasoning results in KGQA due to the hallucinatory behavior brought by the generative paradigm, which may hinder the advancement of the LLM-based KGQA model. To deal with the issue, we propose a novel LLM-based Discriminative Reasoning (LDR) method to explicitly model the subgraph retrieval and answer inference process. By adopting discriminative strategies, the proposed LDR method not only enhances the capability of LLMs to retrieve question-related subgraphs but also alleviates the issue of ungrounded reasoning brought by the generative paradigm of LLMs. Experimental results show that the proposed approach outperforms multiple strong comparison methods, along with achieving state-of-the-art performance on two widely used WebQSP and CWQ benchmarks.
摘要：基于生成式预训练 Transformer 的大型语言模型 (LLM) 在知识图谱问答 (KGQA) 任务上取得了显著表现。然而，由于生成范式带来的幻觉行为，LLM 经常在 KGQA 中产生无根基的子图规划或推理结果，这可能会阻碍基于 LLM 的 KGQA 模型的进步。针对该问题，我们提出了一种基于 LLM 的新型判别推理 (LDR) 方法来显式地建模子图检索和答案推理过程。通过采用判别策略，所提出的 LDR 方法不仅增强了 LLM 检索问题相关子图的能力，而且缓解了 LLM 生成范式带来的无根基推理问题。实验结果表明，所提出的方法优于多种强比较方法，并在两个广泛使用的 WebQSP 和 CWQ 基准上取得了最佳性能。

Title: iPrOp: Interactive Prompt Optimization for Large Language Models with a Human in the Loop

Authors: Jiahui Li, Roman Klinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12644
Pdf URL: https://arxiv.org/pdf/2412.12644
Copy Paste: [[2412.12644]] iPrOp: Interactive Prompt Optimization for Large Language Models with a Human in the Loop(https://arxiv.org/abs/2412.12644)
Keywords: language model, prompt
Abstract: Prompt engineering has made significant contributions to the era of large language models, yet its effectiveness depends on the skills of a prompt author. Automatic prompt optimization can support the prompt development process, but requires annotated data. This paper introduces $\textit{iPrOp}$, a novel Interactive Prompt Optimization system, to bridge manual prompt engineering and automatic prompt optimization. With human intervention in the optimization loop, $\textit{iPrOp}$ offers users the flexibility to assess evolving prompts. We present users with prompt variations, selected instances, large language model predictions accompanied by corresponding explanations, and performance metrics derived from a subset of the training data. This approach empowers users to choose and further refine the provided prompts based on their individual preferences and needs. This system not only assists non-technical domain experts in generating optimal prompts tailored to their specific tasks or domains, but also enables to study the intrinsic parameters that influence the performance of prompt optimization. Our evaluation shows that our system has the capability to generate improved prompts, leading to enhanced task performance.
摘要：提示工程为大型语言模型时代做出了重大贡献，但其有效性取决于提示作者的技能。自动提示优化可以支持提示开发过程，但需要带注释的数据。本文介绍了一种新颖的交互式提示优化系统 $\textit{iPrOp}$，以连接手动提示工程和自动提示优化。通过优化循环中的人工干预，$\textit{iPrOp}$ 为用户提供了评估不断发展的提示的灵活性。我们向用户展示提示变化、选定的实例、大型语言模型预测以及相应的解释，以及从训练数据子集得出的性能指标。这种方法使用户能够根据个人喜好和需求选择并进一步优化提供的提示。该系统不仅可以帮助非技术领域专家生成针对其特定任务或领域的最佳提示，还可以研究影响提示优化性能的内在参数。我们的评估表明，我们的系统能够生成改进的提示，从而提高任务性能。

Title: Train More Parameters But Mind Their Placement: Insights into Language Adaptation with PEFT

Authors: Jenny Kunz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12674
Pdf URL: https://arxiv.org/pdf/2412.12674
Copy Paste: [[2412.12674]] Train More Parameters But Mind Their Placement: Insights into Language Adaptation with PEFT(https://arxiv.org/abs/2412.12674)
Keywords: llm
Abstract: Smaller LLMs still face significant challenges even in medium-resourced languages, particularly when it comes to language-specific knowledge -- a problem not easily resolved with machine-translated data. In this case study on Icelandic, we aim to enhance the generation performance of an LLM by specialising it using unstructured text corpora. A key focus is on preventing interference with the models' capabilities of handling longer context during this adaptation. Through ablation studies using various parameter-efficient fine-tuning (PEFT) methods and setups, we find that increasing the number of trainable parameters leads to better and more robust language adaptation. LoRAs placed in the feed-forward layers and bottleneck adapters show promising results with sufficient parameters, while prefix tuning and (IA)3 are not suitable. Although improvements are consistent in 0-shot summarisation, some adapted models struggle with longer context lengths, an issue that can be mitigated by adapting only the final layers.
摘要：即使是在资源中等的语言中，规模较小的 LLM 仍然面临重大挑战，尤其是在语言特定知识方面——这个问题很难用机器翻译数据解决。在这个冰岛语案例研究中，我们旨在通过使用非结构化文本语料库来专门化 LLM，以提高其生成性能。一个关键重点是防止在这种适应过程中干扰模型处理较长上下文的能力。通过使用各种参数高效微调 (PEFT) 方法和设置进行消融研究，我们发现增加可训练参数的数量可以实现更好、更稳健的语言适应。放置在前馈层和瓶颈适配器中的 LoRA 在参数充足的情况下显示出有希望的结果，而前缀调整和 (IA)3 则不合适。虽然在零样本摘要方面取得了一致的改进，但一些适应模型在较长的上下文长度下仍存在困难，这个问题可以通过仅调整最后的层来缓解。

Title: Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features

Authors: Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12679
Pdf URL: https://arxiv.org/pdf/2412.12679
Copy Paste: [[2412.12679]] Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features(https://arxiv.org/abs/2412.12679)
Keywords: language model, gpt, llm, prompt
Abstract: The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To address the challenge of detecting highly similar paraphrased texts, we propose MhBART, an encoder-decoder model designed to emulate human writing style while incorporating a novel difference score mechanism. This model outperforms strong classifier baselines and identifies deceptive sentence patterns. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets -- 15.5\% absolute improvement on paraLFQA, 4\% absolute improvement on paraWP, and 1.5\% absolute improvement on M4 compared to SOTA approaches.
摘要：大型语言模型 (LLM) 的高质量 API 的出现促进了机器生成内容 (MGC) 的广泛创建，带来了学术抄袭和错误信息传播等挑战。现有的 MGC 检测器通常只关注表面信息，而忽略了隐含和结构特征。这使得它们容易受到表面句子模式的欺骗，尤其是对于较长的文本和随后被改写的文本。为了克服这些挑战，我们引入了新颖的方法和数据集。除了公开可用的数据集 Plagbench 之外，我们还使用 GPT 和话语改写工具 DIPPER 通过扩展其原始版本的工件，开发了改写的长篇问答 (paraLFQA) 和改写的写作提示 (paraWP) 数据集。为了应对检测高度相似的改写文本的挑战，我们提出了 MhBART，这是一种编码器-解码器模型，旨在模拟人类的写作风格，同时结合了一种新颖的差异分数机制。该模型的表现优于强分类器基线，并能识别出欺骗性句子模式。为了更好地在文档级别捕捉较长文本的结构，我们提出了 DTransformer，该模型通过 PDTB 预处理整合话语分析来编码结构特征。与 SOTA 方法相比，它在两个数据集上都取得了显著的性能提升——paraLFQA 的绝对改进为 15.5%，paraWP 的绝对改进为 4%，M4 的绝对改进为 1.5%。

Title: XTransplant: A Probe into the Upper Bound Performance of Multilingual Capability and Culture Adaptability in LLMs via Mutual Cross-lingual Feed-forward Transplantation

Authors: Yangfan Ye, Xiaocheng Feng, Xiachong Feng, Libo Qin, Yichong Huang, Lei Huang, Weitao Ma, Zhirui Zhang, Yunfei Lu, Xiaohui Yan, Duyu Tang, Dandan Tu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12686
Pdf URL: https://arxiv.org/pdf/2412.12686
Copy Paste: [[2412.12686]] XTransplant: A Probe into the Upper Bound Performance of Multilingual Capability and Culture Adaptability in LLMs via Mutual Cross-lingual Feed-forward Transplantation(https://arxiv.org/abs/2412.12686)
Keywords: language model, llm
Abstract: Current large language models (LLMs) often exhibit imbalances in multilingual capabilities and cultural adaptability, largely due to their English-centric pretraining data. To address this imbalance, we propose a probing method named XTransplant that explores cross-lingual latent interactions via cross-lingual feed-forward transplantation during inference stage, with the hope of enabling the model to leverage the strengths of both English and non-English languages. Through extensive pilot experiments, we empirically prove that both the multilingual capabilities and cultural adaptability of LLMs hold the potential to be significantly improved by XTransplant, respectively from En -> non-En and non-En -> En, highlighting the underutilization of current LLMs' multilingual potential. And the patterns observed in these pilot experiments further motivate an offline scaling inference strategy, which demonstrates consistent performance improvements in multilingual and culture-aware tasks, sometimes even surpassing multilingual supervised fine-tuning. And we do hope our further analysis and discussion could help gain deeper insights into XTransplant mechanism.
摘要：当前的大型语言模型 (LLM) 经常表现出多语言能力和文化适应性的不平衡，这主要是由于它们的预训练数据以英语为中心。为了解决这种不平衡，我们提出了一种名为 XTransplant 的探索方法，该方法在推理阶段通过跨语言前馈移植探索跨语言潜在交互，希望使模型能够充分利用英语和非英语语言的优势。通过大量的试点实验，我们通过经验证明，LLM 的多语言能力和文化适应性都有可能通过 XTransplant 得到显著改善，分别从英语 -> 非英语和非英语 -> 英语，这突显了当前 LLM 的多语言潜力尚未得到充分利用。在这些试点实验中观察到的模式进一步激发了一种离线扩展推理策略，该策略在多语言和文化感知任务中表现出持续的性能改进，有时甚至超越了多语言监督微调。我们确实希望我们的进一步分析和讨论能够帮助更深入地了解 XTransplant 机制。

Title: Trigger$^3$: Refining Query Correction via Adaptive Model Selector

Authors: Kepu Zhang, Zhongxiang Sun, Xiao Zhang, Xiaoxue Zang, Kai Zheng, Yang Song, Jun Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12701
Pdf URL: https://arxiv.org/pdf/2412.12701
Copy Paste: [[2412.12701]] Trigger$^3$: Refining Query Correction via Adaptive Model Selector(https://arxiv.org/abs/2412.12701)
Keywords: language model, llm
Abstract: In search scenarios, user experience can be hindered by erroneous queries due to typos, voice errors, or knowledge gaps. Therefore, query correction is crucial for search engines. Current correction models, usually small models trained on specific data, often struggle with queries beyond their training scope or those requiring contextual understanding. While the advent of Large Language Models (LLMs) offers a potential solution, they are still limited by their pre-training data and inference cost, particularly for complex queries, making them not always effective for query correction. To tackle these, we propose Trigger$^3$, a large-small model collaboration framework that integrates the traditional correction model and LLM for query correction, capable of adaptively choosing the appropriate correction method based on the query and the correction results from the traditional correction model and LLM. Trigger$^3$ first employs a correction trigger to filter out correct queries. Incorrect queries are then corrected by the traditional correction model. If this fails, an LLM trigger is activated to call the LLM for correction. Finally, for queries that no model can correct, a fallback trigger decides to return the original query. Extensive experiments demonstrate Trigger$^3$ outperforms correction baselines while maintaining efficiency.
摘要：在搜索场景中，由于拼写错误、语音错误或知识缺口等原因，错误的查询可能会影响用户体验。因此，查询校正对于搜索引擎至关重要。当前的校正模型通常是基于特定数据训练的小型模型，它们通常难以处理超出其训练范围或需要上下文理解的查询。虽然大型语言模型 (LLM) 的出现提供了一种潜在的解决方案，但它们仍然受到预训练数据和推理成本的限制，特别是对于复杂查询，这使得它们在查询校正方面并不总是有效的。针对这些问题，我们提出了 Trigger$^3$，这是一个大小模型协作框架，它集成了传统校正模型和 LLM 进行查询校正，能够根据查询以及传统校正模型和 LLM 的校正结果自适应地选择合适的校正方法。Trigger$^3$ 首先使用校正触发器筛选出正确的查询。然后使用传统校正模型校正错误的查询。如果此操作失败，则激活 LLM 触发器以调用 LLM 进行校正。最后，对于任何模型都无法纠正的查询，回退触发器将决定返回原始查询。大量实验表明，触发器$^3$ 在保持效率的同时，表现优于校正基线。

Title: More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

Authors: Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12706
Pdf URL: https://arxiv.org/pdf/2412.12706
Copy Paste: [[2412.12706]] More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression(https://arxiv.org/abs/2412.12706)
Keywords: language model, llm
Abstract: As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension and seldom explore the efficiency of their combination. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression. Experiments demonstrate that storing more tokens in the KV cache with lower precision, i.e., quantized pruning, can significantly enhance the long-context performance of LLMs. Furthermore, in-depth analysis regarding token-precision trade-off from a series of key aspects exhibit that, quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across varying input lengths. Moreover, quantized pruning demonstrates notable stability across different KV pruning methods, quantization strategies, and model scales. These findings provide valuable insights into the token-precision trade-off in KV cache compression. We plan to release our code in the near future.
摘要：随着大型语言模型 (LLM) 处理的上下文窗口越来越多，KV 缓存的内存使用量已成为推理过程中的关键瓶颈。主流的 KV 压缩方法，包括 KV 剪枝和 KV 量化，主要关注 token 或精度维度，很少探索它们组合的效率。在本文中，我们全面研究了 KV 缓存压缩中的 token-精度权衡。实验表明，在 KV 缓存中存储更多具有较低精度的 token，即量化剪枝，可以显著提高 LLM 的长上下文性能。此外，从一系列关键方面对 token-精度权衡的深入分析表明，量化剪枝在与检索相关的任务中实现了显着的改进，并且在不同的输入长度上始终表现良好。此外，量化剪枝在不同的 KV 剪枝方法、量化策略和模型规模中表现出显着的稳定性。这些发现为 KV 缓存压缩中的 token-精度权衡提供了有价值的见解。我们计划在不久的将来发布我们的代码。

Title: Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion

Authors: Syed Zohaib Hassan, Pierre Lison, Pål Halvorsen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12710
Pdf URL: https://arxiv.org/pdf/2412.12710
Copy Paste: [[2412.12710]] Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion(https://arxiv.org/abs/2412.12710)
Keywords: language model, llm, agent
Abstract: Disfluencies are a natural feature of spontaneous human speech but are typically absent from the outputs of Large Language Models (LLMs). This absence can diminish the perceived naturalness of synthesized speech, which is an important criteria when building conversational agents that aim to mimick human behaviours. We show how the insertion of disfluencies can alleviate this shortcoming. The proposed approach involves (1) fine-tuning an LLM with Low-Rank Adaptation (LoRA) to incorporate various types of disfluencies into LLM-generated utterances and (2) synthesizing those utterances using a text-to-speech model that supports the generation of speech phenomena such as disfluencies. We evaluated the quality of the generated speech across two metrics: intelligibility and perceived spontaneity. We demonstrate through a user study that the insertion of disfluencies significantly increase the perceived spontaneity of the generated speech. This increase came, however, along with a slight reduction in intelligibility.
摘要：不流畅是人类自发语音的自然特征，但大型语言模型 (LLM) 的输出通常不具备这种特征。这种缺失会降低合成语音的感知自然度，而自然度是构建旨在模仿人类行为的对话代理时的一个重要标准。我们展示了如何通过插入不流畅来缓解这一缺陷。所提出的方法包括 (1) 使用低秩自适应 (LoRA) 对 LLM 进行微调，以将各种类型的不流畅纳入 LLM 生成的话语中，以及 (2) 使用支持生成不流畅等语音现象的文本转语音模型来合成这些话语。我们通过两个指标评估了生成的语音的质量：可理解性和感知自发性。我们通过用户研究表明，插入不流畅显著提高了生成语音的感知自发性。然而，这种增加伴随着可理解性的轻微下降。

Title: Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection

Authors: Debajyoti Mazumder, Aakash Kumar, Jasabanta Patro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12761
Pdf URL: https://arxiv.org/pdf/2412.12761
Copy Paste: [[2412.12761]] Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection(https://arxiv.org/abs/2412.12761)
Keywords: language model, prompt
Abstract: In this paper, we reported our experiments with various strategies to improve code-mixed humour and sarcasm detection. We did all of our experiments for Hindi-English code-mixed scenario, as we have the linguistic expertise for the same. We experimented with three approaches, namely (i) native sample mixing, (ii) multi-task learning (MTL), and (iii) prompting very large multilingual language models (VMLMs). In native sample mixing, we added monolingual task samples in code-mixed training sets. In MTL learning, we relied on native and code-mixed samples of a semantically related task (hate detection in our case). Finally, in our third approach, we evaluated the efficacy of VMLMs via few-shot context prompting. Some interesting findings we got are (i) adding native samples improved humor (raising the F1-score up to 6.76%) and sarcasm (raising the F1-score up to 8.64%) detection, (ii) training MLMs in an MTL framework boosted performance for both humour (raising the F1-score up to 10.67%) and sarcasm (increment up to 12.35% in F1-score) detection, and (iii) prompting VMLMs couldn't outperform the other approaches. Finally, our ablation studies and error analysis discovered the cases where our model is yet to improve. We provided our code for reproducibility.
摘要：在本文中，我们报告了使用各种策略改进代码混合幽默和讽刺检测的实验。我们针对印地语-英语代码混合场景进行了所有实验，因为我们拥有这方面的语言专业知识。我们尝试了三种方法，即 (i) 母语样本混合、(ii) 多任务学习 (MTL) 和 (iii) 提示非常大的多语言语言模型 (VMLM)。在母语样本混合中，我们在代码混合训练集中添加了单语任务样本。在 MTL 学习中，我们依赖于语义相关任务（在我们的案例中为仇恨检测）的母语和代码混合样本。最后，在我们的第三种方法中，我们通过少量上下文提示评估了 VMLM 的有效性。我们得到了一些有趣的发现：(i) 添加原生样本可改善幽默（将 F1 分数提高至 6.76%）和讽刺（将 F1 分数提高至 8.64%）检测；(ii) 在 MTL 框架中训练 MLM 可提高幽默（将 F1 分数提高至 10.67%）和讽刺（F1 分数增加至 12.35%）检测的性能；(iii) 提示 VMLM 无法胜过其他方法。最后，我们的消融研究和错误分析发现了我们的模型尚待改进的情况。我们提供了代码以确保可重复性。

Title: Detecting Emotional Incongruity of Sarcasm by Commonsense Reasoning

Authors: Ziqi Qiu, Jianxing Yu, Yufeng Zhang, Hanjiang Lai, Yanghui Rao, Qinliang Su, Jian Yin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12808
Pdf URL: https://arxiv.org/pdf/2412.12808
Copy Paste: [[2412.12808]] Detecting Emotional Incongruity of Sarcasm by Commonsense Reasoning(https://arxiv.org/abs/2412.12808)
Keywords: language model
Abstract: This paper focuses on sarcasm detection, which aims to identify whether given statements convey criticism, mockery, or other negative sentiment opposite to the literal meaning. To detect sarcasm, humans often require a comprehensive understanding of the semantics in the statement and even resort to external commonsense to infer the fine-grained incongruity. However, existing methods lack commonsense inferential ability when they face complex real-world scenarios, leading to unsatisfactory performance. To address this problem, we propose a novel framework for sarcasm detection, which conducts incongruity reasoning based on commonsense augmentation, called EICR. Concretely, we first employ retrieval-augmented large language models to supplement the missing but indispensable commonsense background knowledge. To capture complex contextual associations, we construct a dependency graph and obtain the optimized topology via graph refinement. We further introduce an adaptive reasoning skeleton that integrates prior rules to extract sentiment-inconsistent subgraphs explicitly. To eliminate the possible spurious relations between words and labels, we employ adversarial contrastive learning to enhance the robustness of the detector. Experiments conducted on five datasets demonstrate the effectiveness of EICR.
摘要：本文主要研究讽刺检测，旨在识别给定的语句是否传达了批评、嘲笑或其他与字面意思相反的负面情绪。为了检测讽刺，人类通常需要全面理解语句中的语义，甚至借助外部常识来推断细粒度的不一致。然而，现有的方法在面对复杂的现实场景时缺乏常识推理能力，导致性能不尽如人意。为了解决这个问题，我们提出了一个新颖的讽刺检测框架，它基于常识增强进行不一致推理，称为 EICR。具体来说，我们首先采用检索增强的大型语言模型来补充缺失但不可或缺的常识背景知识。为了捕获复杂的上下文关联，我们构建了一个依赖图并通过图细化获得优化的拓扑。我们进一步引入了一个自适应推理框架，它集成了先验规则来明确提取情绪不一致的子图。为了消除单词和标签之间可能存在的虚假关系，我们采用对抗性对比学习来增强检测器的鲁棒性。在五个数据集上进行的实验证明了 EICR 的有效性。

Title: DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models

Authors: Jinxiang Xie, Yilin Li, Xunjian Yin, Xiaojun Wan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12832
Pdf URL: https://arxiv.org/pdf/2412.12832
Copy Paste: [[2412.12832]] DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models(https://arxiv.org/abs/2412.12832)
Keywords: language model, llm
Abstract: Evaluating the performance of Grammatical Error Correction (GEC) models has become increasingly challenging, as large language model (LLM)-based GEC systems often produce corrections that diverge from provided gold references. This discrepancy undermines the reliability of traditional reference-based evaluation metrics. In this study, we propose a novel evaluation framework for GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency, and utilizing a dynamic weighting mechanism. Our framework employs the Analytic Hierarchy Process (AHP) in conjunction with large language models to ascertain the relative importance of various evaluation criteria. Additionally, we develop a dataset incorporating human annotations and LLM-simulated sentences to validate our algorithms and fine-tune more cost-effective models. Experimental results indicate that our proposed approach enhances the effectiveness of GEC model evaluations.
摘要：评估语法错误校正 (GEC) 模型的性能变得越来越具有挑战性，因为基于大型语言模型 (LLM) 的 GEC 系统通常会产生与提供的黄金参考不同的校正。这种差异破坏了传统基于参考的评估指标的可靠性。在本研究中，我们提出了一种新的 GEC 模型评估框架 DSGram，该框架集成了语义连贯性、编辑级别和流畅度，并采用了动态加权机制。我们的框架采用层次分析法 (AHP) 结合大型语言模型来确定各种评估标准的相对重要性。此外，我们开发了一个包含人工注释和 LLM 模拟句子的数据集来验证我们的算法并微调更具成本效益的模型。实验结果表明，我们提出的方法提高了 GEC 模型评估的有效性。

Title: Benchmarking and Understanding Compositional Relational Reasoning of LLMs

Authors: Ruikang Ni, Da Xiao, Qingye Meng, Xiangyu Li, Shihui Zheng, Hongliang Liang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12841
Pdf URL: https://arxiv.org/pdf/2412.12841
Copy Paste: [[2412.12841]] Benchmarking and Understanding Compositional Relational Reasoning of LLMs(https://arxiv.org/abs/2412.12841)
Keywords: language model, llm
Abstract: Compositional relational reasoning (CRR) is a hallmark of human intelligence, but we lack a clear understanding of whether and how existing transformer large language models (LLMs) can solve CRR tasks. To enable systematic exploration of the CRR capability of LLMs, we first propose a new synthetic benchmark called Generalized Associative Recall (GAR) by integrating and generalizing the essence of several tasks in mechanistic interpretability (MI) study in a unified framework. Evaluation shows that GAR is challenging enough for existing LLMs, revealing their fundamental deficiency in CRR. Meanwhile, it is easy enough for systematic MI study. Then, to understand how LLMs solve GAR tasks, we use attribution patching to discover the core circuits reused by Vicuna-33B across different tasks and a set of vital attention heads. Intervention experiments show that the correct functioning of these heads significantly impacts task performance. Especially, we identify two classes of heads whose activations represent the abstract notion of true and false in GAR tasks respectively. They play a fundamental role in CRR across various models and tasks. The dataset and code are available at this https URL.
摘要：组合关系推理 (CRR) 是人类智能的标志，但我们对现有的 Transformer 大型语言模型 (LLM) 是否以及如何解决 CRR 任务缺乏清晰的认识。为了系统地探索 LLM 的 CRR 能力，我们首先提出了一个新的综合基准，称为广义联想回忆 (GAR)，通过在统一框架中整合和概括机械可解释性 (MI) 研究中几个任务的本质。评估表明，GAR 对现有的 LLM 来说具有足够的挑战性，揭示了它们在 CRR 方面的根本缺陷。同时，对于系统的 MI 研究来说，它足够容易。然后，为了了解 LLM 如何解决 GAR 任务，我们使用归因修补来发现 Vicuna-33B 在不同任务中重用的核心电路和一组重要的注意力头。干预实验表明，这些头的正确运作显著影响任务性能。特别是，我们确定了两类头，它们的激活分别代表 GAR 任务中真和假的抽象概念。它们在各种模型和任务的 CRR 中发挥着根本作用。数据集和代码可在此 https URL 上获取。

Title: Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models

Authors: Yuchen Fan, Yuzhong Hong, Qiushi Wang, Junwei Bao, Hongfei Jiang, Yang Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12865
Pdf URL: https://arxiv.org/pdf/2412.12865
Copy Paste: [[2412.12865]] Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models(https://arxiv.org/abs/2412.12865)
Keywords: language model, llm
Abstract: Alignment, endowing a pre-trained Large language model (LLM) with the ability to follow instructions, is crucial for its real-world applications. Conventional supervised fine-tuning (SFT) methods formalize it as causal language modeling typically with a cross-entropy objective, requiring a large amount of high-quality instruction-response pairs. However, the quality of widely used SFT datasets can not be guaranteed due to the high cost and intensive labor for the creation and maintenance in practice. To overcome the limitations associated with the quality of SFT datasets, we introduce a novel \textbf{p}reference-\textbf{o}riented supervised \textbf{f}ine-\textbf{t}uning approach, namely PoFT. The intuition is to boost SFT by imposing a particular preference: \textit{favoring the target model over aligned LLMs on the same SFT data.} This preference encourages the target model to predict a higher likelihood than that predicted by the aligned LLMs, incorporating assessment information on data quality (i.e., predicted likelihood by the aligned LLMs) into the training process. Extensive experiments are conducted, and the results validate the effectiveness of the proposed method. PoFT achieves stable and consistent improvements over the SFT baselines across different training datasets and base models. Moreover, we prove that PoFT can be integrated with existing SFT data filtering methods to achieve better performance, and further improved by following preference optimization procedures, such as DPO.
摘要：对齐赋予预训练的大型语言模型 (LLM) 遵循指令的能力，这对于其实际应用至关重要。传统的监督微调 (SFT) 方法将其形式化为因果语言建模，通常具有交叉熵目标，需要大量高质量的指令-响应对。然而，由于实践中创建和维护的成本高昂且劳动强度大，因此无法保证广泛使用的 SFT 数据集的质量。为了克服与 SFT 数据集质量相关的限制，我们引入了一种新颖的 \textbf{p}reference-\textbf{o} 导向的监督 \textbf{f}ine-\textbf{t} 方法，即 PoFT。直觉是通过施加特定偏好来提升 SFT：\textit{在相同的 SFT 数据上，更倾向于目标模型而不是对齐的 LLM。}这种偏好鼓励目标模型预测比对齐的 LLM 预测的更高的可能性，将数据质量评估信息（即对齐的 LLM 预测的可能性）纳入训练过程。进行了大量的实验，结果验证了所提方法的有效性。PoFT 在不同的训练数据集和基础模型中实现了对 SFT 基线的稳定和一致的改进。此外，我们证明 PoFT 可以与现有的 SFT 数据过滤方法相结合以实现更好的性能，并通过遵循偏好优化程序（如 DPO）进一步改进。

Title: RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement

Authors: Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Wayne Xin Zhao, Yang Song, Tao Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12881
Pdf URL: https://arxiv.org/pdf/2412.12881
Copy Paste: [[2412.12881]] RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement(https://arxiv.org/abs/2412.12881)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Existing large language models (LLMs) show exceptional problem-solving capabilities but might struggle with complex reasoning tasks. Despite the successes of chain-of-thought and tree-based search methods, they mainly depend on the internal knowledge of LLMs to search over intermediate reasoning steps, limited to dealing with simple tasks involving fewer reasoning steps. In this paper, we propose \textbf{RAG-Star}, a novel RAG approach that integrates the retrieved information to guide the tree-based deliberative reasoning process that relies on the inherent knowledge of LLMs. By leveraging Monte Carlo Tree Search, RAG-Star iteratively plans intermediate sub-queries and answers for reasoning based on the LLM itself. To consolidate internal and external knowledge, we propose an retrieval-augmented verification that utilizes query- and answer-aware reward modeling to provide feedback for the inherent reasoning of LLMs. Our experiments involving Llama-3.1-8B-Instruct and GPT-4o demonstrate that RAG-Star significantly outperforms previous RAG and reasoning methods.
摘要：现有的大型语言模型 (LLM) 表现出卓越的问题解决能力，但在处理复杂的推理任务时可能会遇到困难。尽管思路链和基于树的搜索方法取得了成功，但它们主要依靠 LLM 的内部知识来搜索中间推理步骤，仅限于处理涉及较少推理步骤的简单任务。在本文中，我们提出了 \textbf{RAG-Star}，这是一种新颖的 RAG 方法，它整合了检索到的信息来指导依赖于 LLM 固有知识的基于树的审议推理过程。通过利用蒙特卡洛树搜索，RAG-Star 基于 LLM 本身迭代地规划中间子查询和答案以进行推理。为了整合内部和外部知识，我们提出了一种检索增强验证，利用查询和答案感知的奖励模型为 LLM 的固有推理提供反馈。我们涉及 Llama-3.1-8B-Instruct 和 GPT-4o 的实验表明，RAG-Star 明显优于之前的 RAG 和推理方法。

Title: Question: How do Large Language Models perform on the Question Answering tasks? Answer:

Authors: Kevin Fischer, Darren Fürst, Sebastian Steindl, Jakob Lindner, Ulrich Schäfer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12893
Pdf URL: https://arxiv.org/pdf/2412.12893
Copy Paste: [[2412.12893]] Question: How do Large Language Models perform on the Question Answering tasks? Answer:(https://arxiv.org/abs/2412.12893)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have been showing promising results for various NLP-tasks without the explicit need to be trained for these tasks by using few-shot or zero-shot prompting techniques. A common NLP-task is question-answering (QA). In this study, we propose a comprehensive performance comparison between smaller fine-tuned models and out-of-the-box instruction-following LLMs on the Stanford Question Answering Dataset 2.0 (SQuAD2), specifically when using a single-inference prompting technique. Since the dataset contains unanswerable questions, previous work used a double inference method. We propose a prompting style which aims to elicit the same ability without the need for double inference, saving compute time and resources. Furthermore, we investigate their generalization capabilities by comparing their performance on similar but different QA datasets, without fine-tuning neither model, emulating real-world uses where the context and questions asked may differ from the original training distribution, for example swapping Wikipedia for news articles. Our results show that smaller, fine-tuned models outperform current State-Of-The-Art (SOTA) LLMs on the fine-tuned task, but recent SOTA models are able to close this gap on the out-of-distribution test and even outperform the fine-tuned models on 3 of the 5 tested QA datasets.
摘要：大型语言模型 (LLM) 已在各种 NLP 任务中显示出良好的结果，无需通过使用少样本或零样本提示技术进行这些任务的明确训练。一个常见的 NLP 任务是问答 (QA)。在本研究中，我们提出了在斯坦福问答数据集 2.0 (SQuAD2) 上对较小的微调模型和开箱即用的指令跟踪 LLM 进行全面的性能比较，特别是在使用单推理提示技术时。由于数据集包含无法回答的问题，以前的工作使用了双重推理方法。我们提出了一种提示风格，旨在引出相同的能力而不需要双重推理，从而节省计算时间和资源。此外，我们通过比较它们在相似但不同的 QA 数据集上的性能来研究它们的泛化能力，而无需对任何模型进行微调，模拟现实世界中的用途，其中上下文和提出的问题可能与原始训练分布不同，例如将维基百科换成新闻文章。我们的结果表明，规模较小、经过微调的模型在微调任务上的表现优于当前最先进的 (SOTA) LLM，但最近的 SOTA 模型能够在分布外测试中缩小这一差距，甚至在 5 个测试的 QA 数据集中的 3 个上优于经过微调的模型。

Title: Truthful Text Sanitization Guided by Inference Attacks

Authors: Ildikó Pilán, Benet Manzanares-Salor, David Sánchez, Pierre Lison
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12928
Pdf URL: https://arxiv.org/pdf/2412.12928
Copy Paste: [[2412.12928]] Truthful Text Sanitization Guided by Inference Attacks(https://arxiv.org/abs/2412.12928)
Keywords: language model, llm
Abstract: The purpose of text sanitization is to rewrite those text spans in a document that may directly or indirectly identify an individual, to ensure they no longer disclose personal information. Text sanitization must strike a balance between preventing the leakage of personal information (privacy protection) while also retaining as much of the document's original content as possible (utility preservation). We present an automated text sanitization strategy based on generalizations, which are more abstract (but still informative) terms that subsume the semantic content of the original text spans. The approach relies on instruction-tuned large language models (LLMs) and is divided into two stages. The LLM is first applied to obtain truth-preserving replacement candidates and rank them according to their abstraction level. Those candidates are then evaluated for their ability to protect privacy by conducting inference attacks with the LLM. Finally, the system selects the most informative replacement shown to be resistant to those attacks. As a consequence of this two-stage process, the chosen replacements effectively balance utility and privacy. We also present novel metrics to automatically evaluate these two aspects without the need to manually annotate data. Empirical results on the Text Anonymization Benchmark show that the proposed approach leads to enhanced utility, with only a marginal increase in the risk of re-identifying protected individuals compared to fully suppressing the original information. Furthermore, the selected replacements are shown to be more truth-preserving and abstractive than previous methods.
摘要：文本清理的目的是重写文档中可能直接或间接识别个人的文本范围，以确保它们不再泄露个人信息。文本清理必须在防止个人信息泄露（隐私保护）和尽可能多地保留文档的原始内容（实用性保存）之间取得平衡。我们提出了一种基于泛化的自动文本清理策略，泛化是更抽象（但仍具有信息量）的术语，它们包含了原始文本范围的语义内容。该方法依赖于指令调整的大型语言模型 (LLM)，分为两个阶段。首先应用 LLM 来获得保留真相的替换候选，并根据其抽象级别对其进行排名。然后通过使用 LLM 进行推理攻击来评估这些候选者保护隐私的能力。最后，系统选择最具信息量的替代方案，以抵抗这些攻击。作为这个两阶段过程的结果，所选的替代方案有效地平衡了实用性和隐私性。我们还提出了新颖的指标来自动评估这两个方面，而无需手动注释数据。文本匿名化基准的实证结果表明，与完全隐藏原始信息相比，所提出的方法可以提高实用性，而重新识别受保护个人的风险仅略有增加。此外，所选的替换结果显示比以前的方法更能保留真相和抽象。

Title: Improving Fine-grained Visual Understanding in VLMs through Text-Only Training

Authors: Dasol Choi, Guijin Son, Soo Yong Kim, Gio Paik, Seunghyeok Hong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12940
Pdf URL: https://arxiv.org/pdf/2412.12940
Copy Paste: [[2412.12940]] Improving Fine-grained Visual Understanding in VLMs through Text-Only Training(https://arxiv.org/abs/2412.12940)
Keywords: language model
Abstract: Visual-Language Models (VLMs) have become a powerful tool for bridging the gap between visual and linguistic understanding. However, the conventional learning approaches for VLMs often suffer from limitations, such as the high resource requirements of collecting and training image-text paired data. Recent research has suggested that language understanding plays a crucial role in the performance of VLMs, potentially indicating that text-only training could be a viable approach. In this work, we investigate the feasibility of enhancing fine-grained visual understanding in VLMs through text-only training. Inspired by how humans develop visual concept understanding, where rich textual descriptions can guide visual recognition, we hypothesize that VLMs can also benefit from leveraging text-based representations to improve their visual recognition abilities. We conduct comprehensive experiments on two distinct domains: fine-grained species classification and cultural visual understanding tasks. Our findings demonstrate that text-only training can be comparable to conventional image-text training while significantly reducing computational costs. This suggests a more efficient and cost-effective pathway for advancing VLM capabilities, particularly valuable in resource-constrained environments.
摘要：视觉语言模型 (VLM) 已成为弥合视觉理解和语言理解之间差距的有力工具。然而，传统的 VLM 学习方法往往存在局限性，例如收集和训练图文配对数据需要大量资源。最近的研究表明，语言理解在 VLM 的性能中起着至关重要的作用，这可能表明纯文本训练可能是一种可行的方法。在这项工作中，我们研究了通过纯文本训练增强 VLM 中细粒度视觉理解的可行性。受人类如何发展视觉概念理解的启发，其中丰富的文本描述可以指导视觉识别，我们假设 VLM 也可以从利用基于文本的表示来提高其视觉识别能力中受益。我们在两个不同的领域进行了全面的实验：细粒度物种分类和文化视觉理解任务。我们的研究结果表明，纯文本训练可以与传统的图文训练相媲美，同时显着降低计算成本。这为提升 VLM 能力提供了一条更高效、更具成本效益的途径，这在资源受限的环境中尤其有价值。

Title: MOPO: Multi-Objective Prompt Optimization for Affective Text Generation

Authors: Yarik Menchaca Resendiz, Roman Klinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12948
Pdf URL: https://arxiv.org/pdf/2412.12948
Copy Paste: [[2412.12948]] MOPO: Multi-Objective Prompt Optimization for Affective Text Generation(https://arxiv.org/abs/2412.12948)
Keywords: prompt
Abstract: How emotions are expressed depends on the context and domain. On X (formerly Twitter), for instance, an author might simply use the hashtag #anger, while in a news headline, emotions are typically written in a more polite, indirect manner. To enable conditional text generation models to create emotionally connotated texts that fit a domain, users need to have access to a parameter that allows them to choose the appropriate way to express an emotion. To achieve this, we introduce MOPO, a Multi-Objective Prompt Optimization methodology. MOPO optimizes prompts according to multiple objectives (which correspond here to the output probabilities assigned by emotion classifiers trained for different domains). In contrast to single objective optimization, MOPO outputs a set of prompts, each with a different weighting of the multiple objectives. Users can then choose the most appropriate prompt for their context. We evaluate MOPO using three objectives, determined by various domain-specific emotion classifiers. MOPO improves performance by up to 15 pp across all objectives with a minimal loss (1-2 pp) for any single objective compared to single-objective optimization. These minor performance losses are offset by a broader generalization across multiple objectives - which is not possible with single-objective optimization. Additionally, MOPO reduces computational requirements by simultaneously optimizing for multiple objectives, eliminating separate optimization procedures for each objective.
摘要：如何表达情绪取决于上下文和领域。例如，在 X（以前称为 Twitter）上，作者可能只是使用主题标签 #anger，而在新闻标题中，情绪通常以更礼貌、更间接的方式书写。为了使条件文本生成模型能够创建适合领域的情感内涵文本，用户需要能够访问一个参数，该参数允许他们选择适当的方式表达情绪。为了实现这一点，我们引入了 MOPO，一种多目标提示优化方法。MOPO 根据多个目标（此处对应于针对不同领域训练的情绪分类器分配的输出概率）优化提示。与单目标优化相比，MOPO 输出一组提示，每个提示都有不同的多个目标权重。然后，用户可以选择最适合其上下文的提示。我们使用三个目标评估 MOPO，这三个目标由各种特定领域的情绪分类器确定。与单目标优化相比，MOPO 在所有目标上将性能提高了多达 15 pp，而任何单个目标的损失最小（1-2 pp）。这些微小的性能损失被跨多个目标的更广泛泛化所抵消，而单目标优化则无法实现这一点。此外，MOPO 通过同时优化多个目标来减少计算要求，从而消除了针对每个目标的单独优化程序。

Title: SnakModel: Lessons Learned from Training an Open Danish Large Language Model

Authors: Mike Zhang, Max Müller-Eberstein, Elisa Bassignana, Rob van der Goot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12956
Pdf URL: https://arxiv.org/pdf/2412.12956
Copy Paste: [[2412.12956]] SnakModel: Lessons Learned from Training an Open Danish Large Language Model(https://arxiv.org/abs/2412.12956)
Keywords: language model, llm
Abstract: We present SnakModel, a Danish large language model (LLM) based on Llama2-7B, which we continuously pre-train on 13.6B Danish words, and further tune on 3.7M Danish instructions. As best practices for creating LLMs for smaller language communities have yet to be established, we examine the effects of early modeling and training decisions on downstream performance throughout the entire training pipeline, including (1) the creation of a strictly curated corpus of Danish text from diverse sources; (2) the language modeling and instruction-tuning training process itself, including the analysis of intermediate training dynamics, and ablations across different hyperparameters; (3) an evaluation on eight language and culturally-specific tasks. Across these experiments SnakModel achieves the highest overall performance, outperforming multiple contemporary Llama2-7B-based models. By making SnakModel, the majority of our pre-training corpus, and the associated code available under open licenses, we hope to foster further research and development in Danish Natural Language Processing, and establish training guidelines for languages with similar resource constraints.
摘要：我们提出了 SnakModel，这是一种基于 Llama2-7B 的丹麦大型语言模型 (LLM)，我们持续对 136 亿丹麦语单词进行预训练，并进一步对 370 万丹麦语指令进行调整。由于尚未建立针对较小语言社区创建 LLM 的最佳实践，我们研究了早期建模和训练决策对整个训练流程下游性能的影响，包括 (1) 创建来自不同来源的严格精选的丹麦语文本语料库；(2) 语言建模和指令调整训练过程本身，包括中间训练动态的分析和不同超参数的消融；(3) 对八种语言和文化特定任务的评估。在这些实验中，SnakModel 实现了最高的整体性能，优于多个当代基于 Llama2-7B 的模型。通过将 SnakModel、我们的大部分预训练语料库以及相关代码在开放许可下提供，我们希望促进丹麦自然语言处理的进一步研究和开发，并为具有类似资源限制的语言制定训练指南。

Title: Adaptations of AI models for querying the LandMatrix database in natural language

Authors: Fatiha Ait Kbir, Jérémy Bourgoin, Rémy Decoupes, Marie Gradeler, Roberto Interdonato
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12961
Pdf URL: https://arxiv.org/pdf/2412.12961
Copy Paste: [[2412.12961]] Adaptations of AI models for querying the LandMatrix database in natural language(https://arxiv.org/abs/2412.12961)
Keywords: language model, llm, prompt, agent
Abstract: The Land Matrix initiative (this https URL) and its global observatory aim to provide reliable data on large-scale land acquisitions to inform debates and actions in sectors such as agriculture, extraction, or energy in low- and middle-income countries. Although these data are recognized in the academic world, they remain underutilized in public policy, mainly due to the complexity of access and exploitation, which requires technical expertise and a good understanding of the database schema. The objective of this work is to simplify access to data from different database systems. The methods proposed in this article are evaluated using data from the Land Matrix. This work presents various comparisons of Large Language Models (LLMs) as well as combinations of LLM adaptations (Prompt Engineering, RAG, Agents) to query different database systems (GraphQL and REST queries). The experiments are reproducible, and a demonstration is available online: this https URL.
摘要：Land Matrix 计划（此 https URL）及其全球观测站旨在提供有关大规模土地收购的可靠数据，为中低收入国家农业、采掘或能源等领域的辩论和行动提供参考。尽管这些数据在学术界得到认可，但它们在公共政策中仍未得到充分利用，主要是因为访问和利用的复杂性，需要技术专业知识和对数据库模式的良好理解。这项工作的目的是简化对不同数据库系统数据的访问。本文提出的方法使用 Land Matrix 中的数据进行评估。这项工作介绍了大型语言模型 (LLM) 的各种比较以及 LLM 改编的组合（Prompt Engineering、RAG、Agents）以查询不同的数据库系统（GraphQL 和 REST 查询）。实验是可重复的，并且演示可在线获取：此 https URL。

Title: Unlocking LLMs: Addressing Scarce Data and Bias Challenges in Mental Health

Authors: Vivek Kumar, Eirini Ntoutsi, Pushpraj Singh Rajawat, Giacomo Medda, Diego Reforgiato Recupero
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12981
Pdf URL: https://arxiv.org/pdf/2412.12981
Copy Paste: [[2412.12981]] Unlocking LLMs: Addressing Scarce Data and Bias Challenges in Mental Health(https://arxiv.org/abs/2412.12981)
Keywords: language model, gpt, llm, hallucination, prompt, chat
Abstract: Large language models (LLMs) have shown promising capabilities in healthcare analysis but face several challenges like hallucinations, parroting, and bias manifestation. These challenges are exacerbated in complex, sensitive, and low-resource domains. Therefore, in this work we introduce IC-AnnoMI, an expert-annotated motivational interviewing (MI) dataset built upon AnnoMI by generating in-context conversational dialogues leveraging LLMs, particularly ChatGPT. IC-AnnoMI employs targeted prompts accurately engineered through cues and tailored information, taking into account therapy style (empathy, reflection), contextual relevance, and false semantic change. Subsequently, the dialogues are annotated by experts, strictly adhering to the Motivational Interviewing Skills Code (MISC), focusing on both the psychological and linguistic dimensions of MI dialogues. We comprehensively evaluate the IC-AnnoMI dataset and ChatGPT's emotional reasoning ability and understanding of domain intricacies by modeling novel classification tasks employing several classical machine learning and current state-of-the-art transformer approaches. Finally, we discuss the effects of progressive prompting strategies and the impact of augmented data in mitigating the biases manifested in IC-AnnoM. Our contributions provide the MI community with not only a comprehensive dataset but also valuable insights for using LLMs in empathetic text generation for conversational therapy in supervised settings.
摘要：大型语言模型 (LLM) 在医疗保健分析中表现出了良好的能力，但也面临着幻觉、鹦鹉学舌和偏见表现等挑战。这些挑战在复杂、敏感和资源匮乏的领域更加严重。因此，在这项工作中，我们引入了 IC-AnnoMI，这是一个专家注释的动机访谈 (MI) 数据集，它基于 AnnoMI 构建，通过利用 LLM（尤其是 ChatGPT）生成上下文对话。IC-AnnoMI 采用通过线索和定制信息准确设计的针对性提示，同时考虑到治疗风格（同理心、反思）、上下文相关性和虚假语义变化。随后，对话由专家注释，严格遵守动机访谈技巧准则 (MISC)，重点关注 MI 对话的心理和语言维度。我们通过采用几种经典机器学习和当前最先进的转换器方法对新分类任务进行建模，全面评估了 IC-AnnoMI 数据集和 ChatGPT 的情感推理能力以及对领域复杂性的理解。最后，我们讨论了渐进式提示策略的效果以及增强数据在减轻 IC-AnnoM 中表现出的偏见方面的影响。我们的贡献不仅为 MI 社区提供了全面的数据集，还为在监督环境中使用 LLM 进行对话治疗的共情文本生成提供了宝贵的见解。

Title: OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

Authors: Shuting Wang, Jiejun Tan, Zhicheng Dou, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.13018
Pdf URL: https://arxiv.org/pdf/2412.13018
Copy Paste: [[2412.13018]] OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain(https://arxiv.org/abs/2412.13018)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: As a typical and practical application of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge. In this paper, we introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including (1) a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; (2) a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47\% acceptance ratio in human evaluations on generated instances; (3) a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline; and (4) robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets and highlights the performance variations of RAG systems across diverse topics and tasks, revealing significant opportunities for RAG models to improve their capabilities in vertical domains. We open source the code of our benchmark in \href{this https URL}{this https URL}.
摘要：作为大型语言模型 (LLM) 的典型实际应用，检索增强生成 (RAG) 技术受到了广泛关注，特别是在 LLM 可能缺乏领域特定知识的垂直领域。在本文中，我们介绍了一个在金融领域的全方位、自动化的 RAG 基准 OmniEval。我们的基准的特点是其多维评估框架，包括 (1) 基于矩阵的 RAG 场景评估系统，将查询分为 5 个任务类别和 16 个金融主题，从而对不同的查询场景进行结构化评估；(2) 一种多维评估数据生成方法，该方法结合了基于 GPT-4 的自动生成和人工注释，在生成的实例上的人工评估中实现了 87.47% 的接受率；(3) 一种多阶段评估系统，评估检索和生成性能，对 RAG 流程进行全面评估；（4）从基于规则和 LLM 的评估指标中得出的稳健评估指标，通过手动注释和 LLM 评估器的监督微调来提高评估的可靠性。我们的实验证明了 OmniEval 的全面性，它包括广泛的测试数据集，并突出了 RAG 系统在不同主题和任务中的性能变化，揭示了 RAG 模型在垂直领域改进能力的重大机会。我们在 \href{this https URL}{this https URL} 中开源代码。

Title: NAVCON: A Cognitively Inspired and Linguistically Grounded Corpus for Vision and Language Navigation

Authors: Karan Wanchoo, Xiaoye Zuo, Hannah Gonzalez, Soham Dan, Georgios Georgakis, Dan Roth, Kostas Daniilidis, Eleni Miltsakaki
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2412.13026
Pdf URL: https://arxiv.org/pdf/2412.13026
Copy Paste: [[2412.13026]] NAVCON: A Cognitively Inspired and Linguistically Grounded Corpus for Vision and Language Navigation(https://arxiv.org/abs/2412.13026)
Keywords: gpt, agent
Abstract: We present NAVCON, a large-scale annotated Vision-Language Navigation (VLN) corpus built on top of two popular datasets (R2R and RxR). The paper introduces four core, cognitively motivated and linguistically grounded, navigation concepts and an algorithm for generating large-scale silver annotations of naturally occurring linguistic realizations of these concepts in navigation instructions. We pair the annotated instructions with video clips of an agent acting on these instructions. NAVCON contains 236, 316 concept annotations for approximately 30, 0000 instructions and 2.7 million aligned images (from approximately 19, 000 instructions) showing what the agent sees when executing an instruction. To our knowledge, this is the first comprehensive resource of navigation concepts. We evaluated the quality of the silver annotations by conducting human evaluation studies on NAVCON samples. As further validation of the quality and usefulness of the resource, we trained a model for detecting navigation concepts and their linguistic realizations in unseen instructions. Additionally, we show that few-shot learning with GPT-4o performs well on this task using large-scale silver annotations of NAVCON.
摘要：我们介绍了 NAVCON，这是一个基于两个流行数据集（R2R 和 RxR）构建的大规模带注释视觉语言导航 (VLN) 语料库。本文介绍了四个核心的、具有认知动机和语言基础的导航概念，以及一种用于生成导航指令中这些概念的自然语言实现的大规模银注释的算法。我们将带注释的指令与代理按照这些指令执行的视频剪辑配对。NAVCON 包含约 30,0000 条指令的 236,316 个概念注释和 270 万张对齐图像（来自约 19,000 条指令），显示代理在执行指令时看到的内容。据我们所知，这是第一个全面的导航概念资源。我们通过对 NAVCON 样本进行人工评估研究来评估银注释的质量。为了进一步验证资源的质量和实用性，我们训练了一个模型来检测未见指令中的导航概念及其语言实现。此外，我们证明了使用 NAVCON 的大规模银注释，使用 GPT-4o 进行小样本学习在此任务上表现良好。

Title: Harnessing Event Sensory Data for Error Pattern Prediction in Vehicles: A Language Model Approach

Authors: Hugo Math, Rainer Lienhart, Robin Schön
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.13041
Pdf URL: https://arxiv.org/pdf/2412.13041
Copy Paste: [[2412.13041]] Harnessing Event Sensory Data for Error Pattern Prediction in Vehicles: A Language Model Approach(https://arxiv.org/abs/2412.13041)
Keywords: language model
Abstract: In this paper, we draw an analogy between processing natural languages and processing multivariate event streams from vehicles in order to predict $\textit{when}$ and $\textit{what}$ error pattern is most likely to occur in the future for a given car. Our approach leverages the temporal dynamics and contextual relationships of our event data from a fleet of cars. Event data is composed of discrete values of error codes as well as continuous values such as time and mileage. Modelled by two causal Transformers, we can anticipate vehicle failures and malfunctions before they happen. Thus, we introduce $\textit{CarFormer}$, a Transformer model trained via a new self-supervised learning strategy, and $\textit{EPredictor}$, an autoregressive Transformer decoder model capable of predicting $\textit{when}$ and $\textit{what}$ error pattern will most likely occur after some error code apparition. Despite the challenges of high cardinality of event types, their unbalanced frequency of appearance and limited labelled data, our experimental results demonstrate the excellent predictive ability of our novel model. Specifically, with sequences of $160$ error codes on average, our model is able with only half of the error codes to achieve $80\%$ F1 score for predicting $\textit{what}$ error pattern will occur and achieves an average absolute error of $58.4 \pm 13.2$h $\textit{when}$ forecasting the time of occurrence, thus enabling confident predictive maintenance and enhancing vehicle safety.
摘要：在本文中，我们将处理自然语言与处理来自车辆的多变量事件流进行类比，以预测未来给定车辆最有可能在何时和什么情况下出现错误模式。我们的方法利用了来自车队的事件数据的时间动态和上下文关系。事件数据由错误代码的离散值以及时间和里程等连续值组成。通过两个因果 Transformer 建模，我们可以在车辆故障发生之前预测它们。因此，我们引入了通过新的自监督学习策略训练的 Transformer 模型 $\textit{CarFormer}$ 和自回归 Transformer 解码器模型 $\textit{EPredictor}$，该模型能够预测在出现某些错误代码后最有可能在何时和什么情况下出现错误模式。尽管存在事件类型基数高、出现频率不平衡和标记数据有限的挑战，我们的实验结果证明了我们的新模型具有出色的预测能力。具体来说，在平均 160 个错误代码序列的情况下，我们的模型仅使用一半的错误代码就能实现 80% F1 得分，用于预测将发生什么错误模式，并在预测发生时间时实现平均绝对误差 58.4 pm 13.2h $\textit{when}$，从而实现自信的预测性维护并提高车辆安全性。

Title: LMUnit: Fine-grained Evaluation with Natural Language Unit Tests

Authors: Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, Shikib Mehri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.13091
Pdf URL: https://arxiv.org/pdf/2412.13091
Copy Paste: [[2412.13091]] LMUnit: Fine-grained Evaluation with Natural Language Unit Tests(https://arxiv.org/abs/2412.13091)
Keywords: language model, llm
Abstract: As language models become integral to critical workflows, assessing their behavior remains a fundamental challenge -- human evaluation is costly and noisy, while automated metrics provide only coarse, difficult-to-interpret signals. We introduce natural language unit tests, a paradigm that decomposes response quality into explicit, testable criteria, along with a unified scoring model, LMUnit, which combines multi-objective training across preferences, direct ratings, and natural language rationales. Through controlled human studies, we show this paradigm significantly improves inter-annotator agreement and enables more effective LLM development workflows. LMUnit achieves state-of-the-art performance on evaluation benchmarks (FLASK, BigGenBench) and competitive results on RewardBench. These results validate both our proposed paradigm and scoring model, suggesting a promising path forward for language model evaluation and development.
摘要：随着语言模型成为关键工作流程不可或缺的一部分，评估其行为仍然是一项根本挑战——人工评估成本高昂且嘈杂，而自动化指标仅提供粗略且难以解释的信号。我们引入了自然语言单元测试，这是一种将响应质量分解为明确、可测试标准的范例，以及统一的评分模型 LMUnit，它结合了偏好、直接评分和自然语言原理的多目标训练。通过受控的人工研究，我们表明该范例显着提高了注释者之间的一致性并实现了更有效的 LLM 开发工作流程。LMUnit 在评估基准（FLASK、BigGenBench）上实现了最先进的性能，并在 RewardBench 上取得了有竞争力的结果。这些结果验证了我们提出的范例和评分模型，为语言模型评估和开发指明了一条有希望的未来道路。

Title: Uchaguzi-2022: A Dataset of Citizen Reports on the 2022 Kenyan Election

Authors: Roberto Mondini, Neema Kotonya, Robert L. Logan IV, Elizabeth M Olson, Angela Oduor Lungati, Daniel Duke Odongo, Tim Ombasa, Hemank Lamba, Aoife Cahill, Joel R. Tetreault, Alejandro Jaimes
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2412.13098
Pdf URL: https://arxiv.org/pdf/2412.13098
Copy Paste: [[2412.13098]] Uchaguzi-2022: A Dataset of Citizen Reports on the 2022 Kenyan Election(https://arxiv.org/abs/2412.13098)
Keywords: language model
Abstract: Online reporting platforms have enabled citizens around the world to collectively share their opinions and report in real time on events impacting their local communities. Systematically organizing (e.g., categorizing by attributes) and geotagging large amounts of crowdsourced information is crucial to ensuring that accurate and meaningful insights can be drawn from this data and used by policy makers to bring about positive change. These tasks, however, typically require extensive manual annotation efforts. In this paper we present Uchaguzi-2022, a dataset of 14k categorized and geotagged citizen reports related to the 2022 Kenyan General Election containing mentions of election-related issues such as official misconduct, vote count irregularities, and acts of violence. We use this dataset to investigate whether language models can assist in scalably categorizing and geotagging reports, thus highlighting its potential application in the AI for Social Good space.
摘要：在线报告平台使世界各地的公民能够集体分享他们的观点并实时报告影响当地社区的事件。系统地组织（例如按属性分类）和地理标记大量众包信息对于确保从这些数据中得出准确而有意义的见解并供政策制定者用于带来积极变化至关重要。然而，这些任务通常需要大量的手动注释工作。在本文中，我们介绍了 Uchaguzi-2022，这是一个包含 14k 份分类和地理标记的公民报告的数据集，与 2022 年肯尼亚大选有关，其中提到了与选举相关的问题，例如官员不当行为、计票违规和暴力行为。我们使用此数据集来研究语言模型是否可以帮助可扩展地对报告进行分类和地理标记，从而突出其在 AI for Social Good 领域的潜在应用。

Title: AI PERSONA: Towards Life-long Personalization of LLMs

Authors: Tiannan Wang, Meiling Tao, Ruoyu Fang, Huilin Wang, Shuai Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.13103
Pdf URL: https://arxiv.org/pdf/2412.13103
Copy Paste: [[2412.13103]] AI PERSONA: Towards Life-long Personalization of LLMs(https://arxiv.org/abs/2412.13103)
Keywords: language model, llm, agent
Abstract: In this work, we introduce the task of life-long personalization of large language models. While recent mainstream efforts in the LLM community mainly focus on scaling data and compute for improved capabilities of LLMs, we argue that it is also very important to enable LLM systems, or language agents, to continuously adapt to the diverse and ever-changing profiles of every distinct user and provide up-to-date personalized assistance. We provide a clear task formulation and introduce a simple, general, effective, and scalable framework for life-long personalization of LLM systems and language agents. To facilitate future research on LLM personalization, we also introduce methods to synthesize realistic benchmarks and robust evaluation metrics. We will release all codes and data for building and benchmarking life-long personalized LLM systems.
摘要：在这项工作中，我们介绍了大型语言模型的终身个性化任务。虽然 LLM 社区最近的主流努力主要集中在扩展数据和计算以提高 LLM 的功能，但我们认为，让 LLM 系统或语言代理能够不断适应每个不同用户的多样化和不断变化的配置文件并提供最新的个性化帮助也非常重要。我们提供了一个明确的任务表述，并介绍了一个简单、通用、有效和可扩展的框架，用于 LLM 系统和语言代理的终身个性化。为了促进未来对 LLM 个性化的研究，我们还介绍了综合现实基准和稳健评估指标的方法。我们将发布用于构建和基准测试终身个性化 LLM 系统的所有代码和数据。

Title: Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study

Authors: Bolei Ma, Berk Yoztyurk, Anna-Carolina Haensch, Xinpeng Wang, Markus Herklotz, Frauke Kreuter, Barbara Plank, Matthias Assenmacher
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.13169
Pdf URL: https://arxiv.org/pdf/2412.13169
Copy Paste: [[2412.13169]] Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study(https://arxiv.org/abs/2412.13169)
Keywords: language model, llm, prompt
Abstract: In recent research, large language models (LLMs) have been increasingly used to investigate public opinions. This study investigates the algorithmic fidelity of LLMs, i.e., the ability to replicate the socio-cultural context and nuanced opinions of human participants. Using open-ended survey data from the German Longitudinal Election Studies (GLES), we prompt different LLMs to generate synthetic public opinions reflective of German subpopulations by incorporating demographic features into the persona prompts. Our results show that Llama performs better than other LLMs at representing subpopulations, particularly when there is lower opinion diversity within those groups. Our findings further reveal that the LLM performs better for supporters of left-leaning parties like The Greens and The Left compared to other parties, and matches the least with the right-party AfD. Additionally, the inclusion or exclusion of specific variables in the prompts can significantly impact the models' predictions. These findings underscore the importance of aligning LLMs to more effectively model diverse public opinions while minimizing political biases and enhancing robustness in representativeness.
摘要：在最近的研究中，大型语言模型 (LLM) 越来越多地用于调查公众意见。本研究调查了 LLM 的算法保真度，即复制人类参与者的社会文化背景和细微意见的能力。使用来自德国纵向选举研究 (GLES) 的开放式调查数据，我们通过将人口统计特征纳入角色提示，提示不同的 LLM 生成反映德国亚群的合成公众意见。我们的结果表明，Llama 在代表亚群方面的表现优于其他 LLM，尤其是当这些群体中的意见多样性较低时。我们的研究结果进一步表明，与其他政党相比，LLM 对绿党和左翼党等左翼政党的支持者表现更好，与右翼政党 AfD 的匹配度最低。此外，在提示中包含或排除特定变量会显著影响模型的预测。这些发现强调了调整 LLM 以更有效地模拟多样化公众意见的重要性，同时最大限度地减少政治偏见并增强代表性的稳健性。

Title: Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Authors: Jeffrey Cheng, Benjamin Van Durme
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.13171
Pdf URL: https://arxiv.org/pdf/2412.13171
Copy Paste: [[2412.13171]] Compressed Chain of Thought: Efficient Reasoning Through Dense Representations(https://arxiv.org/abs/2412.13171)
Keywords: language model, chain-of-thought
Abstract: Chain-of-thought (CoT) decoding enables language models to improve reasoning performance at the cost of high generation latency in decoding. Recent proposals have explored variants of contemplation tokens, a term we introduce that refers to special tokens used during inference to allow for extra computation. Prior work has considered fixed-length sequences drawn from a discrete set of embeddings as contemplation tokens. Here we propose Compressed Chain-of-Thought (CCoT), a framework to generate contentful and continuous contemplation tokens of variable sequence length. The generated contemplation tokens are compressed representations of explicit reasoning chains, and our method can be applied to off-the-shelf decoder language models. Through experiments, we illustrate how CCoT enables additional reasoning over dense contentful representations to achieve corresponding improvements in accuracy. Moreover, the reasoning improvements can be adaptively modified on demand by controlling the number of contemplation tokens generated.
摘要：思路链 (CoT) 解码使语言模型能够提高推理性能，但代价是解码中的生成延迟较高。最近的提案探索了沉思标记的变体，我们引入这个术语，指的是推理过程中使用的特殊标记，以允许额外的计算。先前的工作已将从离散嵌入集中提取的固定长度序列视为沉思标记。在这里，我们提出了压缩思路链 (CCoT)，这是一个生成可变序列长度的内容丰富且连续的沉思标记的框架。生成的沉思标记是显式推理链的压缩表示，我们的方法可以应用于现成的解码器语言模型。通过实验，我们说明了 CCoT 如何在密集的内容表示上实现额外的推理，以实现相应的准确性改进。此外，可以通过控制生成的沉思标记数量根据需要自适应地修改推理改进。

Title: DnDScore: Decontextualization and Decomposition for Factuality Verification in Long-Form Text Generation

Authors: Miriam Wanner, Benjamin Van Durme, Mark Dredze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.13175
Pdf URL: https://arxiv.org/pdf/2412.13175
Copy Paste: [[2412.13175]] DnDScore: Decontextualization and Decomposition for Factuality Verification in Long-Form Text Generation(https://arxiv.org/abs/2412.13175)
Keywords: language model, llm
Abstract: The decompose-then-verify strategy for verification of Large Language Model (LLM) generations decomposes claims that are then independently verified. Decontextualization augments text (claims) to ensure it can be verified outside of the original context, enabling reliable verification. While decomposition and decontextualization have been explored independently, their interactions in a complete system have not been investigated. Their conflicting purposes can create tensions: decomposition isolates atomic facts while decontextualization inserts relevant information. Furthermore, a decontextualized subclaim presents a challenge to the verification step: what part of the augmented text should be verified as it now contains multiple atomic facts? We conduct an evaluation of different decomposition, decontextualization, and verification strategies and find that the choice of strategy matters in the resulting factuality scores. Additionally, we introduce DnDScore, a decontextualization aware verification method which validates subclaims in the context of contextual information.
摘要：用于验证大型语言模型 (LLM) 生成的分解然后验证策略会分解声明，然后进行独立验证。去语境化会增强文本（声明），以确保可以在原始上下文之外对其进行验证，从而实现可靠的验证。虽然已经独立探索了分解和去语境化，但它们在完整系统中的相互作用尚未得到研究。它们相互冲突的目的可能会造成紧张：分解会隔离原子事实，而去语境化会插入相关信息。此外，去语境化的子声明对验证步骤提出了挑战：由于增强文本现在包含多个原子事实，因此应该验证增强文本的哪一部分？我们对不同的分解、去语境化和验证策略进行了评估，发现策略的选择对最终的事实性分数很重要。此外，我们引入了 DnDScore，这是一种去语境化感知验证方法，可在上下文信息的上下文中验证子声明。