2025-03-04

Title: Eeyore: Realistic Depression Simulation via Supervised and Preference Optimization

Authors: Siyang Liu, Bianca Brie, Wenda Li, Laura Biester, Andrew Lee, James Pennebaker, Rada Mihalcea
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00018
Pdf URL: https://arxiv.org/pdf/2503.00018
Copy Paste: [[2503.00018]] Eeyore: Realistic Depression Simulation via Supervised and Preference Optimization(https://arxiv.org/abs/2503.00018)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have been previously explored for mental healthcare training and therapy client simulation, but they still fall short in authentically capturing diverse client traits and psychological conditions. We introduce \textbf{Eeyore}, an 8B model optimized for realistic depression simulation through a structured alignment framework, incorporating expert input at every stage. First, we systematically curate real-world depression-related conversations, extracting depressive traits to guide data filtering and psychological profile construction, and use this dataset to instruction-tune Eeyore for profile adherence. Next, to further enhance realism, Eeyore undergoes iterative preference optimization -- first leveraging model-generated preferences and then calibrating with a small set of expert-annotated preferences. Throughout the entire pipeline, we actively collaborate with domain experts, developing interactive interfaces to validate trait extraction and iteratively refine structured psychological profiles for clinically meaningful role-play customization. Despite its smaller model size, the Eeyore depression simulation outperforms GPT-4o with SOTA prompting strategies, both in linguistic authenticity and profile adherence.
摘要：大型语言模型（LLMS）以前已经进行了用于心理保健培训和治疗客户模拟的探索，但它们在真正捕捉多样的客户特征和心理状况方面仍然缺乏。我们介绍了\ textbf {eeyore}，这是一种通过结构化对齐框架进行了为逼真的抑郁模拟优化的8B模型，在每个阶段都结合了专家输入。首先，我们系统地策划了现实世界中与抑郁症相关的对话，提取抑郁特征来指导数据过滤和心理概况构建，并将此数据集使用该数据集来指导Eyeyore以遵守个人资料依从性。接下来，为了进一步增强现实主义，Eeyore经历了迭代偏好优化 - 首先利用模型生成的偏好，然后使用一小部分专家宣布的偏好进行校准。在整个管道中，我们积极地与域专家合作，开发交互式界面以验证特质提取，并迭代地完善结构化心理概况，以实现临床上有意义的角色扮演定制。尽管模型尺寸较小，但Eyeore抑郁型模拟在语言真实性和概况依从性方面均超过了GPT-4O，促使SOTA提示策略。

Title: A Systematic Review of Open Datasets Used in Text-to-Image (T2I) Gen AI Model Safety

Authors: Rakeen Rouf, Trupti Bavalatti, Osama Ahmed, Dhaval Potdar, Faraz Jawed
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.00020
Pdf URL: https://arxiv.org/pdf/2503.00020
Copy Paste: [[2503.00020]] A Systematic Review of Open Datasets Used in Text-to-Image (T2I) Gen AI Model Safety(https://arxiv.org/abs/2503.00020)
Keywords: prompt
Abstract: Novel research aimed at text-to-image (T2I) generative AI safety often relies on publicly available datasets for training and evaluation, making the quality and composition of these datasets crucial. This paper presents a comprehensive review of the key datasets used in the T2I research, detailing their collection methods, compositions, semantic and syntactic diversity of prompts and the quality, coverage, and distribution of harm types in the datasets. By highlighting the strengths and limitations of the datasets, this study enables researchers to find the most relevant datasets for a use case, critically assess the downstream impacts of their work given the dataset distribution, particularly regarding model safety and ethical considerations, and also identify the gaps in dataset coverage and quality that future research may address.
摘要：针对文本图像图像（T2I）生成AI安全性的新型研究通常依赖于公开可用的数据集用于培训和评估，从而使这些数据集的质量和组成至关重要。本文对T2I研究中使用的关键数据集进行了全面的评论，详细介绍了提示的收集方法，组成，语义和句法多样性以及数据集中危害类型的质量，覆盖范围和分布。通过强调数据集的优势和局限性，本研究使研究人员能够找到用例使用数据集分布的用例，尤其是在模型安全性方面，并确定未来研究的数据集覆盖率和质量差异，因此，研究人员能够找到最相关的数据集，批判性地评估其工作的下游影响。

Title: KVCrush: Key value cache size-reduction using similarity in head-behaviour

Authors: Gopi Krishna Jha, Sameh Gobriel, Liubov Talamanova, Alexander Kozlov, Nilesh Jain
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.00022
Pdf URL: https://arxiv.org/pdf/2503.00022
Copy Paste: [[2503.00022]] KVCrush: Key value cache size-reduction using similarity in head-behaviour(https://arxiv.org/abs/2503.00022)
Keywords: language model, llm
Abstract: Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs). By allowing the attention operation to scale linearly rather than quadratically with the total sequence length, KV caching significantly enhances generation throughput. However, due to large context lengths in the modern LLMs, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model's batch size, hindering its ability to deliver high-throughput. Existing research addresses this challenge using several techniques, such as discarding low-attention tokens, quantization, and matrix approximation which typically lead to a negative impact on the model accuracy. In this paper, We propose KVCrush technology which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory. KVCrush provides an alternate representation scheme for key-value states, along with a low-overhead token pruning algorithm that accounts for the token distribution in the KV cache, which in turn allows for a a smaller footprint while maintaining the accuracy of the model. Based on our results, KVCrush reduces LongBench KV Cache size by 4x with less than 1% accuracy drop and achieves state-of-the-art average accuracy with minimal overhead, incurring less than 0.5% total inference latency. KVCrush not only outperforms the accuracy of state-of-the-art importance-based token retention schemes but is also compatible with typical practical LLM deployments using KV cache paging schemes such as vLLM and mixed precision quantization.
摘要：钥匙值（KV）缓存已成为一种至关重要的优化技术，用于加速大型语言模型（LLMS）。通过允许注意力运行线性地扩展而不是二次地缩放，KV缓存显着增强了生成吞吐量。但是，由于现代LLM中的上下文长度较大，KV的内存足迹是直接影响模型批次大小的模型部署的巨大瓶颈，阻碍了其提供高通量的能力。现有的研究使用多种技术解决了这一挑战，例如丢弃低意见令牌，量化和矩阵近似，这通常会对模型准确性产生负面影响。在本文中，我们提出了KVCrush技术，该技术可以与许多KV压缩技术结合使用，以提高在较小的内存下的模型准确性。 KVCrush为钥匙值状态提供了替代表示方案，以及一个低空的令牌修剪算法，该算法是KV CACHE中的令牌分布，这又允许较小的足迹，同时保持模型的准确性。根据我们的结果，KVCrush将Longbench KV高速缓存的大小降低了4倍，精度下降少于1％，并实现了最先进的平均准确性，而最小的开销则少于0.5％的总推断潜伏期。 KVCrush不仅要优于最先进的基于重要性的令牌保留方案的准确性，而且还与使用KV缓存计划方案（例如VLLM）和混合精度量化的典型实用LLM部署兼容。

Title: Do Emotions Really Affect Argument Convincingness? A Dynamic Approach with LLM-based Manipulation Checks

Authors: Yanran Chen, Steffen Eger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00024
Pdf URL: https://arxiv.org/pdf/2503.00024
Copy Paste: [[2503.00024]] Do Emotions Really Affect Argument Convincingness? A Dynamic Approach with LLM-based Manipulation Checks(https://arxiv.org/abs/2503.00024)
Keywords: llm
Abstract: Emotions have been shown to play a role in argument convincingness, yet this aspect is underexplored in the natural language processing (NLP) community. Unlike prior studies that use static analyses, focus on a single text domain or language, or treat emotion as just one of many factors, we introduce a dynamic framework inspired by manipulation checks commonly used in psychology and social science; leveraging LLM-based manipulation checks, this framework examines the extent to which perceived emotional intensity influences perceived convincingness. Through human evaluation of arguments across different languages, text domains, and topics, we find that in over half of cases, judgments of convincingness remain unchanged despite variations in perceived emotional intensity; when emotions do have an impact, they more often enhance rather than weaken convincingness. We further analyze how 11 LLMs behave in the same scenario, finding that while LLMs generally mirror human patterns, they struggle to capture nuanced emotional effects in individual judgments.
摘要：情绪已被证明在争论中发挥了作用，但是在自然语言处理（NLP）社区中，这一方面却没有得到充实的影响。与先前使用静态分析的研究，专注于单个文本领域或语言，或将情感视为众多因素之一，我们引入了一个动态框架，灵感来自心理学和社会科学中常用的操纵检查；该框架利用基于LLM的操作检查检查了感知的情绪强度影响感知到的令人信服的程度。通过人类对跨不同语言，文本领域和主题的论点的评估，我们发现，尽管有感知到的情感强度有所不同，但在超过一半的情况下，令人信服的判断仍然没有改变。当情绪确实产生影响时，它们常常会增强而不是弱化。我们进一步分析了11个LLM在同一情况下的行为方式，发现尽管LLM通常反映了人类的模式，但它们却难以捕捉单个判断中的细微情感影响。

Title: Evaluating Large Language Models on the Spanish Medical Intern Resident (MIR) Examination 2024/2025:A Comparative Analysis of Clinical Reasoning and Knowledge Application

Authors: Carlos Luengo Vera, Ignacio Ferro Picon, M. Teresa del Val Nunez, Jose Andres Gomez Gandia, Antonio de Lucas Ancillo, Victor Ramos Arroyo, Carlos Milan Figueredo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00025
Pdf URL: https://arxiv.org/pdf/2503.00025
Copy Paste: [[2503.00025]] Evaluating Large Language Models on the Spanish Medical Intern Resident (MIR) Examination 2024/2025:A Comparative Analysis of Clinical Reasoning and Knowledge Application(https://arxiv.org/abs/2503.00025)
Keywords: language model, gpt, llm
Abstract: This study presents a comparative evaluation of 22 large language models LLMs on the Spanish Medical Intern Resident MIR examinations for 2024 and 2025 with a focus on clinical reasoning domain specific expertise and multimodal processing capabilities The MIR exam consisting of 210 multiple choice questions some requiring image interpretation serves as a stringent benchmark for assessing both factual recall and complex clinical problem solving skills Our investigation encompasses general purpose models such as GPT4 Claude LLaMA and Gemini as well as specialized fine tuned systems like Miri Pro which leverages proprietary Spanish healthcare data to excel in medical contexts Recent market entries Deepseek and Grok have further enriched the evaluation landscape particularly for tasks that demand advanced visual and semantic analysis The findings indicate that while general purpose LLMs perform robustly overall fine tuned models consistently achieve superior accuracy especially in addressing nuanced domain specific challenges A modest performance decline observed between the two exam cycles appears attributable to the implementation of modified questions designed to mitigate reliance on memorization The results underscore the transformative potential of domain specific fine tuning and multimodal integration in advancing medical AI applications They also highlight critical implications for the future integration of LLMs into medical education training and clinical decision making emphasizing the importance of balancing automated reasoning with ethical and context aware judgment
摘要：This study presents a comparative evaluation of 22 large language models LLMs on the Spanish Medical Intern Resident MIR examinations for 2024 and 2025 with a focus on clinical reasoning domain specific expertise and multimodal processing capabilities The MIR exam consisting of 210 multiple choice questions some requiring image interpretation serves as a stringent benchmark for assessing both factual recall and complex clinical problem solving skills Our investigation encompasses general purpose models such as GPT4 Claude LLaMA and Gemini as well as specialized fine tuned systems like Miri Pro which leverages proprietary Spanish healthcare data to excel in medical contexts Recent market entries Deepseek and Grok have further enriched the evaluation landscape particularly for tasks that demand advanced visual and semantic analysis The findings indicate that while general purpose LLMs perform robustly overall fine tuned models consistently achieve superior accuracy especially in addressing nuanced domain specific challenges A modest performance decline在两个考试循环之间观察到的观察似乎归因于实施修改的问题，旨在减轻对记忆的依赖的结果，结果强调了域特定的微调和多模式集成在推进医学AI应用中的变革潜力，它们还突出了对LLMS的未来范围内态度和临床决策的重要性的关键含义

Title: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis

Authors: Shinwoo Park, Shubin Kim, Do-Kyung Kim, Yo-Sub Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00032
Pdf URL: https://arxiv.org/pdf/2503.00032
Copy Paste: [[2503.00032]] Detecting LLM-Generated Korean Text through Linguistic Feature Analysis(https://arxiv.org/abs/2503.00032)
Keywords: language model, llm
Abstract: The rapid advancement of large language models (LLMs) increases the difficulty of distinguishing between human-written and LLM-generated text. Detecting LLM-generated text is crucial for upholding academic integrity, preventing plagiarism, protecting copyrights, and ensuring ethical research practices. Most prior studies on detecting LLM-generated text focus primarily on English text. However, languages with distinct morphological and syntactic characteristics require specialized detection approaches. Their unique structures and usage patterns can hinder the direct application of methods primarily designed for English. Among such languages, we focus on Korean, which has relatively flexible spacing rules, a rich morphological system, and less frequent comma usage compared to English. We introduce KatFish, the first benchmark dataset for detecting LLM-generated Korean text. The dataset consists of text written by humans and generated by four LLMs across three genres. By examining spacing patterns, part-of-speech diversity, and comma usage, we illuminate the linguistic differences between human-written and LLM-generated Korean text. Building on these observations, we propose KatFishNet, a detection method specifically designed for the Korean language. KatFishNet achieves an average of 19.78% higher AUROC compared to the best-performing existing detection method.
摘要：大型语言模型（LLMS）的快速发展增加了区分人文和LLM生成的文本的困难。检测LLM生成的文本对于维护学术完整性，防止窃，保护版权和确保道德研究实践至关重要。大多数关于检测LLM生成的文本的先前研究主要集中在英语文本上。但是，具有独特形态和句法特征的语言需要专门的检测方法。它们的独特结构和使用模式可能会阻碍主要针对英语设计的方法的直接应用。在此类语言中，我们专注于韩语，韩语具有相对灵活的间距规则，丰富的形态学系统，并且与英语相比，使用频率较低。我们介绍了Katfish，这是第一个用于检测LLM生成的韩语文本的基准数据集。数据集由人类编写的文本组成，并由四种类型的四个LLM生成。通过检查间距模式，言论的一部分多样性和逗号的使用，我们阐明了人文和LLM生成的韩语文本之间的语言差异。在这些观察结果的基础上，我们提出了Katfishnet，这是一种专门为韩语设计的检测方法。与表现最佳的现有检测方法相比，Katfishnet平均AUROC平均增加了19.78％。

Title: Constraining Sequential Model Editing with Editing Anchor Compression

Authors: Hao-Xiang Xu, Jun-Yu Ma, Zhen-Hua Ling, Ningyu Zhang, Jia-Chen Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00035
Pdf URL: https://arxiv.org/pdf/2503.00035
Copy Paste: [[2503.00035]] Constraining Sequential Model Editing with Editing Anchor Compression(https://arxiv.org/abs/2503.00035)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) struggle with hallucinations due to false or outdated knowledge. Given the high resource demands of retraining these models, there is an increasing focus on developing model editing. However, the general abilities of LLMs across downstream tasks are prone to significant degradation during sequential editing. This paper statistically observes that the parameter matrix after editing exhibits a significant deviation compared to its previous state as the number of edits increases. This serious deviation affects the original knowledge associations within LLMs and leads to the degradation of their general abilities. To this end, a framework termed Editing Anchor Compression (EAC) is proposed to constrain the deviation of the parameter matrix during sequential editing. It compresses the editing information by selecting editing anchors that are important in encoding new relations without deviating too much from the original matrix, thereby preserving the general abilities. Experiments of applying EAC to two popular editing methods on three LLMs across four tasks are conducted. Evaluation results show that EAC effectively minimizes unreasonable deviations caused by model editing, preserving over 70% of the general abilities while better retaining the editing knowledge compared to the original counterpart methods.
摘要：大型语言模型（LLMS）由于虚假或过时的知识而与幻觉作斗争。鉴于对这些模型的资源要求很高，因此越来越关注开发模型编辑。但是，跨下游任务的LLM的一般能力在顺序编辑过程中容易出现显着降解。从统计学上讲，本文的编辑后参数矩阵与先前的状态相比，随着编辑数量的增加，参数矩阵表现出明显的偏差。这种严重的偏差会影响LLM内的原始知识协会，并导致其一般能力的退化。为此，提出了一个称为编辑锚压缩（EAC）的框架来限制在顺序编辑过程中参数矩阵的偏差。它通过选择编辑锚来压缩编辑信息，这些锚对编码新关系很重要而不偏离原始矩阵，从而保留了一般能力。在四个任务上，将EAC应用于三个LLM的两种流行编辑方法的实验。评估结果表明，EAC可以有效地最大程度地减少模型编辑引起的不合理偏差，从而保留了超过70％的一般能力，同时与原始对应方法相比，可以更好地保留编辑知识。

Title: Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs

Authors: Wei Zhao, Zhe Li, Yige Li, Jun Sun
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.00037
Pdf URL: https://arxiv.org/pdf/2503.00037
Copy Paste: [[2503.00037]] Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs(https://arxiv.org/abs/2503.00037)
Keywords: language model
Abstract: Large Vision-Language Models (LVLMs) have made significant strides in multimodal comprehension, thanks to extensive pre-training and fine-tuning on large-scale visual datasets. However, despite their robust textual safety mechanisms, they remain vulnerable to harmful visual inputs. Existing safeguards-typically relying on pre-filtering or fine-tuning-incur high costs and diminish overall utility. To address this critical vulnerability, we introduce SafeCLIP, a lightweight method that leverages LVLMs inherent multimodal alignment for zero-shot toxic image detection. By projecting CLIPs discarded CLS token into its text space and matching it with toxic descriptors, SafeCLIP detects harmful content without any architectural changes-adding minimal latency and enabling dynamic safety corrections during inference and this http URL show that SafeCLIP achieves a 66.9% defense success rate with only 3.2% false positive rate and 7.2% overhead. In contrast, state-of-the-art methods achieve 52.9% success but have a 10.7% false positive rate and 210% overhead. Our work demonstrates that leveraging inherent multimodal alignment can yield efficient, low-cost LVLM safety. Code is available at this http URL.
摘要：大型视觉模型（LVLM）在多模式理解方面取得了重大步伐，这要归功于大规模的大规模视觉数据集中的广泛培训和微调。但是，尽管具有强大的文本安全机制，但它们仍然容易受到有害视觉投入的影响。现有的保障措施通常依赖于预过滤或微调的高成本和整体效用。为了解决这个关键的漏洞，我们引入了SafeClip，这是一种轻巧的方法，该方法利用LVLMS固有的多模式对齐方式来零射击有毒图像检测。通过将剪辑放置在其文本空间中并将其与有毒描述符相匹配，SafeClip检测有害内容，而没有任何建筑变化 - 减轻最小的延迟，并在推理过程中实现动态安全校正，并且此HTTTP URL显示SafeClip仅获得66.9％的国防成功率与3.2％的3.2％伪造率和7.2％的福利率和7.2％的空头差异为66.9％。相比之下，最先进的方法取得了52.9％的成功，但假阳性率为10.7％，开销210％。我们的工作表明，利用固有的多模式比对可以产生有效的低成本LVLM安全性。代码可在此HTTP URL上找到。

Title: from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors

Authors: Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Qi Li, Jiangyu Lei
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.00038
Pdf URL: https://arxiv.org/pdf/2503.00038
Copy Paste: [[2503.00038]] from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors(https://arxiv.org/abs/2503.00038)
Keywords: language model, llm
Abstract: Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms. In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking. Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed. Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.
摘要：当前的研究暴露了大型语言模型（LLM）的风险，该模型通过越狱攻击产生有害内容。但是，他们忽略了从头开始的有害内容的直接生成比诱导LLM校准良性含量为有害形式更加困难。在我们的研究中，我们介绍了一个新颖的攻击框架，该攻击框架利用对抗性隐喻（Avatar）诱导LLM校准恶意隐喻以犯有越狱。具体而言，要回答有害的查询，阿凡达（Avatar）会自适应地识别一组良性但与逻辑相关的隐喻为初始种子。然后，在这些隐喻的驱动下，目标LLM被诱导进行推理并校准隐喻内容，从而通过直接输出有害响应或校准隐喻和专业有害内容之间的残留物来越狱。实验结果表明，阿凡达（Avatar）可以有效，可转让的越狱LLM，并在多个高级LLM中获得最先进的攻击成功率。

Title: Evaluation of LLMs-based Hidden States as Author Representations for Psychological Human-Centered NLP Tasks

Authors: Nikita Soni, Pranav Chitale, Khushboo Singh, Niranjan Balasubramanian, H. Andrew Schwartz
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2503.00124
Pdf URL: https://arxiv.org/pdf/2503.00124
Copy Paste: [[2503.00124]] Evaluation of LLMs-based Hidden States as Author Representations for Psychological Human-Centered NLP Tasks(https://arxiv.org/abs/2503.00124)
Keywords: language model, llm
Abstract: Like most of NLP, models for human-centered NLP tasks -- tasks attempting to assess author-level information -- predominantly use representations derived from hidden states of Transformer-based LLMs. However, what component of the LM is used for the representation varies widely. Moreover, there is a need for Human Language Models (HuLMs) that implicitly model the author and provide a user-level hidden state. Here, we systematically evaluate different ways of representing documents and users using different LM and HuLM architectures to predict task outcomes as both dynamically changing states and averaged trait-like user-level attributes of valence, arousal, empathy, and distress. We find that representing documents as an average of the token hidden states performs the best generally. Further, while a user-level hidden state itself is rarely the best representation, we find its inclusion in the model strengthens token or document embeddings used to derive document- and user-level representations resulting in best performances.
摘要：与大多数NLP一样，以人为中心的NLP任务的模型 - 试图评估作者级信息的任务 - 主要使用来自基于变压器LLMS的隐藏状态的表示形式。但是，LM的哪个组件用于表示形式差异很大。此外，需要隐式对作者建模并提供用户级隐藏状态的人类语言模型（HULMS）。在这里，我们系统地评估使用不同的LM和HULM体系结构来表示文档和用户的不同方式，以预测任务成果，因为动态变化的状态和平均特质的类型用户级属性，唤醒，同情和困扰。我们发现将文档表示为代币隐藏状态的平均表现最好。此外，虽然用户级隐藏状态本身很少是最佳表示，但我们发现它包含在模型中，可以增强用于得出文档和用户级表示形式的代币或文档嵌入，从而导致最佳性能。

Title: AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction

Authors: Magnus Sesodia, Alina Petrova, John Armour, Thomas Lukasiewicz, Oana-Maria Camburu, Puneet K. Dokania, Philip Torr, Christian Schroeder de Witt
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00128
Pdf URL: https://arxiv.org/pdf/2503.00128
Copy Paste: [[2503.00128]] AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction(https://arxiv.org/abs/2503.00128)
Keywords: language model, llm
Abstract: Legal systems worldwide continue to struggle with overwhelming caseloads, limited judicial resources, and growing complexities in legal proceedings. Artificial intelligence (AI) offers a promising solution, with Legal Judgment Prediction (LJP) -- the practice of predicting a court's decision from the case facts -- emerging as a key research area. However, existing datasets often formulate the task of LJP unrealistically, not reflecting its true difficulty. They also lack high-quality annotation essential for legal reasoning and explainability. To address these shortcomings, we introduce AnnoCaseLaw, a first-of-its-kind dataset of 471 meticulously annotated U.S. Appeals Court negligence cases. Each case is enriched with comprehensive, expert-labeled annotations that highlight key components of judicial decision making, along with relevant legal concepts. Our dataset lays the groundwork for more human-aligned, explainable LJP models. We define three legally relevant tasks: (1) judgment prediction; (2) concept identification; and (3) automated case annotation, and establish a performance baseline using industry-leading large language models (LLMs). Our results demonstrate that LJP remains a formidable task, with application of legal precedent proving particularly difficult. Code and data are available at this https URL.
摘要：全球法律制度继续在压倒性的案件上，有限的司法资源以及法律程序不断增长的复杂性。人工智能（AI）提供了一个有前途的解决方案，并具有法律判断预测（LJP） - 从案件事实中预测法院的判决的做法 - 成为关键研究领域。但是，现有数据集通常是不切实际地制定LJP的任务，而不是反映其真正的难度。他们还缺乏对法律推理和解释性必不可少的高质量注释。为了解决这些缺点，我们介绍了Annocaselaw，这是471个精心注释的美国上诉法院疏忽案件的首个数据集。每种情况都充满了全面，专家标记的注释，这些注释突出了司法决策的关键组成部分以及相关的法律概念。我们的数据集为更加人类，可解释的LJP模型奠定了基础。我们定义了三个与法律相关的任务：（1）判断预测；（2）概念识别；（3）自动化案例注释，并使用行业领先的大语言模型（LLM）建立绩效基线。我们的结果表明，LJP仍然是一项艰巨的任务，法律先例的应用也特别困难。代码和数据可在此HTTPS URL上找到。

Title: Personalized Causal Graph Reasoning for LLMs: A Case Study on Dietary Recommendations

Authors: Zhongqi Yang, Amir Rahmani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00134
Pdf URL: https://arxiv.org/pdf/2503.00134
Copy Paste: [[2503.00134]] Personalized Causal Graph Reasoning for LLMs: A Case Study on Dietary Recommendations(https://arxiv.org/abs/2503.00134)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) effectively leverage common-sense knowledge for general reasoning, yet they struggle with personalized reasoning when tasked with interpreting multifactor personal data. This limitation restricts their applicability in domains that require context-aware decision-making tailored to individuals. This paper introduces Personalized Causal Graph Reasoning as an agentic framework that enhances LLM reasoning by incorporating personal causal graphs derived from data of individuals. These graphs provide a foundation that guides the LLM's reasoning process. We evaluate it on a case study on nutrient-oriented dietary recommendations, which requires personal reasoning due to the implicit unique dietary effects. We propose a counterfactual evaluation to estimate the efficiency of LLM-recommended foods for glucose management. Results demonstrate that the proposed method efficiently provides personalized dietary recommendations to reduce average glucose iAUC across three time windows, which outperforms the previous approach. LLM-as-a-judge evaluation results indicate that our proposed method enhances personalization in the reasoning process.
摘要：大型语言模型（LLMS）有效地利用常识性知识来进行一般推理，但是当负责解释多因素个人数据时，它们在个性化的推理中挣扎。这种限制限制了它们在需要对个人量身定制的上下文意识决策的领域中的适用性。本文将个性化因果图推理作为一个代理框架，通过合并从个人数据数据中得出的个人因果图来增强LLM推理。这些图提供了指导LLM推理过程的基础。我们在针对营养方向的饮食建议的案例研究中对其进行了评估，该建议需要个人推理，这是由于隐含的独特饮食影响。我们提出了反事实评估，以估计LLM备受推测的食品用于葡萄糖管理的效率。结果表明，该提出的方法有效地提供了个性化的饮食建议，以减少在三个时间窗口中的平均葡萄糖IAUC，这表现优于先前的方法。 LLM-AS-A-A-A-Gudge评估结果表明，我们提出的方法在推理过程中增强了个性化。

Title: SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models

Authors: Grigor Nalbandyan, Rima Shahbazyan, Evelina Bakhturina
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00137
Pdf URL: https://arxiv.org/pdf/2503.00137
Copy Paste: [[2503.00137]] SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models(https://arxiv.org/abs/2503.00137)
Keywords: language model, llm, prompt
Abstract: Typical evaluations of Large Language Models (LLMs) report a single metric per dataset, often representing the model's best-case performance under carefully selected settings. Unfortunately, this approach overlooks model robustness and reliability in real-world applications. For instance, simple paraphrasing of prompts on the MMLU-Pro dataset causes accuracy fluctuations of up to 10\%, while reordering answer choices in the AGIEval dataset results in accuracy differences of up to 6.1\%. While some studies discuss issues with LLM robustness, there is no unified or centralized framework for evaluating the robustness of language models. To address this gap and consolidate existing research on model robustness, we present SCORE ($\mathbf{S}$ystematic $\mathbf{CO}$nsistency and $\mathbf{R}$obustness $\mathbf{E}$valuation), a comprehensive framework for non-adversarial evaluation of LLMs. The SCORE framework evaluates models by repeatedly testing them on the same benchmarks in various setups to give a realistic estimate of their accuracy and consistency. We release the code publicly and start an LLM robustness leaderboard to facilitate further development and research.
摘要：大型语言模型（LLMS）的典型评估报告每个数据集的单个度量，通常代表在精心选择的设置下模型的最佳性能。不幸的是，这种方法忽略了现实应用程序中的模型鲁棒性和可靠性。例如，在MMLU-PRO数据集上的提示简单释义会导致高达10 \％的准确性波动，同时在Agieval DataSet中重新排序答案选择会导致高达6.1 \％的准确性差异。尽管一些研究讨论了LLM鲁棒性的问题，但没有用于评估语言模型鲁棒性的统一或集中式框架。为了解决这一差距并巩固了现有的模型鲁棒性研究，我们提出得分（$ \ Mathbf {s} $ semtyatic $ \ mathbf {co} $ nsistency和$ \ MathBf {r} $ ubustness $ \ mathbf {e}得分框架通过在各种设置中的相同基准上重复测试模型来评估模型，以对其准确性和一致性进行现实估计。我们公开发布该代码，并启动LLM鲁棒性排行榜，以促进进一步的发展和研究。

Title: Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

Authors: Fakhraddin Alwajih, Abdellah El Mekki, Samar Mohamed Magdy, Abdelrahim A. Elmadany, Omer Nacar, El Moatez Billah Nagoudi, Reem Abdel-Salam, Hanin Atwany, Youssef Nafea, Abdulfattah Mohammed Yahya, Rahaf Alhamouri, Hamzah A. Alsayadi, Hiba Zayed, Sara Shatnawi, Serry Sibaee, Yasir Ech-Chammakhy, Walid Al-Dhabyani, Marwa Mohamed Ali, Imen Jarraya, Ahmed Oumar El-Shangiti, Aisha Alraeesi, Mohammed Anwar Al-Ghrawi, Abdulrahman S. Al-Batati, Elgizouli Mohamed, Noha Taha Elgindi, Muhammed Saeed, Houdaifa Atou, Issam Ait Yahia, Abdelhak Bouayad, Mohammed Machrouh, Amal Makouar, Dania Alkawi, Mukhtar Mohamed, Safaa Taher Abdelfadil, Amine Ziad Ounnoughene, Rouabhia Anfel, Rwaa Assi, Ahmed Sorkatti, Mohamedou Cheikh Tourad, Anis Koubaa, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00151
Pdf URL: https://arxiv.org/pdf/2503.00151
Copy Paste: [[2503.00151]] Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs(https://arxiv.org/abs/2503.00151)
Keywords: language model, llm
Abstract: As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.
摘要：随着大型语言模型（LLM）越来越多地融入日常生活，确保其文化敏感性和包容性至关重要。我们介绍了我们的数据集，这是一个为期一年的社区驱动项目，涵盖了所有22个阿拉伯国家。该数据集包括现代标准阿拉伯语（MSA）和方言阿拉伯语（DA）中的说明（输入，响应对），涵盖了20个不同的主题。由阿拉伯世界的44名研究人员组成的团队，他们都是本文的作者，我们的数据集提供了广泛而包容的观点。我们使用数据集评估了几种边境LLM的文化和方言能力，从而揭示了显着的局限性。例如，虽然封闭源LLM通常表现出很强的性能，但它们并非没有缺陷，而较小的开源型号则面临更大的挑战。此外，某些国家（例如，埃及，阿联酋）似乎比其他国家更好（例如伊拉克，毛里塔尼亚，也门）。我们的注释指南，代码和可重复性的数据公开可用。

Title: A Survey of Uncertainty Estimation Methods on Large Language Models

Authors: Zhiqiu Xia, Jinxuan Xu, Yuqian Zhang, Hang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00172
Pdf URL: https://arxiv.org/pdf/2503.00172
Copy Paste: [[2503.00172]] A Survey of Uncertainty Estimation Methods on Large Language Models(https://arxiv.org/abs/2503.00172)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, these models could offer biased, hallucinated, or non-factual responses camouflaged by their fluency and realistic appearance. Uncertainty estimation is the key method to address this challenge. While research efforts in uncertainty estimation are ramping up, there is a lack of comprehensive and dedicated surveys on LLM uncertainty estimation. This survey presents four major avenues of LLM uncertainty estimation. Furthermore, we perform extensive experimental evaluations across multiple methods and datasets. At last, we provide critical and promising future directions for LLM uncertainty estimation.
摘要：大型语言模型（LLM）在各种任务中都表现出了显着的功能。但是，这些模型可以提供偏见，幻觉或非事实的反应，以伪装它们的流利性和现实外观。不确定性估计是应对这一挑战的关键方法。尽管不确定性估计的研究工作正在加剧，但对LLM不确定性估计缺乏全面和专用的调查。这项调查提出了LLM不确定性估计的四个主要途径。此外，我们对多种方法和数据集进行了广泛的实验评估。最后，我们为LLM不确定性估计提供了关键和有希望的未来方向。

Title: Llamarine: Open-source Maritime Industry-specific Large Language Model

Authors: William Nguyen, An Phan, Konobu Kimura, Hitoshi Maeno, Mika Tanaka, Quynh Le, William Poucher, Christopher Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00203
Pdf URL: https://arxiv.org/pdf/2503.00203
Copy Paste: [[2503.00203]] Llamarine: Open-source Maritime Industry-specific Large Language Model(https://arxiv.org/abs/2503.00203)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated substantial potential in addressing complex reasoning tasks, yet their general-purpose nature often limits their effectiveness in specialized domains such as maritime navigation. To bridge this gap, we introduce Llamarine, the first open-source LLM designed specifically for maritime navigation. Llamarine 1.0 is developed through continued pretraining and fine-tuning on a high-quality corpus comprising maritime textbooks, research publications, and web text from Wikipedia. This domain-specific training enables the model to acquire expert-level knowledge in navigational principles, collision avoidance, route optimization, and regulatory compliance. Our key contributions include (a) the curation of a comprehensive maritime dataset from authoritative sources, ensuring depth and reliability in the model's knowledge base; (b) the development of a foundational model capable of reasoning about complex navigational challenges with greater accuracy than general-purpose LLMs; and (c) the establishment of a benchmark to evaluate performance in maritime-specific decision-making tasks. Experimental results demonstrate that Llamarine outperforms both general-purpose and commercial LLMs in critical navigation-related tasks, such as trajectory planning, risk assessment, and compliance with maritime regulations. By providing an open-source foundation model trained exclusively on high-quality maritime literature, Llamarine paves the way for AI-driven advancements in maritime safety, efficiency, and operational decision-making.
摘要：大型语言模型（LLMS）在解决复杂的推理任务方面表现出了巨大的潜力，但是它们的通用性质通常会限制其在海上导航等专业领域中的有效性。为了弥合这一差距，我们介绍了Llamarine，这是专门为海上导航设计的第一个开源LLM。 Llamarine 1.0是通过对包括海上教科书，研究出版物和Wikipedia的Web文本的高质量语料库进行预处理和微调来开发的。这种特定领域的培训使该模型能够获得导航原理，避免碰撞，路线优化和法规合规性方面的专家级知识。我们的主要贡献包括（a）从权威来源的全面海事数据集进行策划，以确保模型的知识库中的深度和可靠性；（b）发展能够以比通用LLM更高的精确性来推理复杂导航挑战的基础模型的发展；（c）建立一个基准，以评估特定于海事的决策任务中的绩效。实验结果表明，在与导航有关的任务（例如轨迹计划，风险评估和遵守海上法规）中，甲呈实验结果表现优于通用和商业LLM。通过提供专门针对高质量海事文献培训的开源基金会模型，Llamarine为AI驱动的海上安全，效率和运营决策铺平了道路。

Title: À la recherche du sens perdu: your favourite LLM might have more to say than you can understand

Authors: K. O. T. Erziev
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00224
Pdf URL: https://arxiv.org/pdf/2503.00224
Copy Paste: [[2503.00224]] À la recherche du sens perdu: your favourite LLM might have more to say than you can understand(https://arxiv.org/abs/2503.00224)
Keywords: gpt, llm, chat
Abstract: We report a peculiar observation that LLMs can assign hidden meanings to sequences that seem visually incomprehensible to humans: for example, a nonsensical phrase consisting of Byzantine musical symbols is recognized by gpt-4o as "say abracadabra". Moreover, some models can communicate using these sequences. Some of these meanings are hypothesized to partly originate in the massive spurious correlations due to BPE tokenization. We systematically evaluate the presence of such abilities in a wide range of models: Claude-3.5 Haiku, Claude-3.5 Sonnet (New and Old), Claude-3.7 Sonnet, gpt-4o mini, gpt-4o, o1-mini, Llama-3.3 70B, DeepSeek-R1-Distill-Lllama 70B, Qwen2.5 1.5B, Qwen2.5 32B, Phi-3.5 mini, GigaChat-Max, Vikhr-Llama-3.2 1B. We argue that this observation might have far-reaching consequences for both safety and security of the modern and future LLMs and systems that employ them. As an illustration, we show that applying this method in combination with simple templates is sufficient to jailbreak previous generation models, with ASR = 0.4 on gpt-4o mini. Our code and data artifacts are available at this https URL
摘要：我们报告了一个奇特的观察，即LLM可以将隐藏的含义分配给人类看似不可理解的序列：例如，由GPT-4O认识到由拜占庭式音乐符号组成的非敏感短语被视为“ Say Abracadabra”。此外，某些模型可以使用这些序列进行通信。这些含义中的一些被认为部分源于由于BPE令牌化引起的巨大虚假相关性。 We systematically evaluate the presence of such abilities in a wide range of models: Claude-3.5 Haiku, Claude-3.5 Sonnet (New and Old), Claude-3.7 Sonnet, gpt-4o mini, gpt-4o, o1-mini, Llama-3.3 70B, DeepSeek-R1-Distill-Lllama 70B, Qwen2.5 1.5B, Qwen2.5 32B, Phi-3.5 Mini，Gigachat-Max，Vikhr-llama-3.2 1b。我们认为，这种观察可能会对雇用现代和未来的LLM和系统的安全和保障产生深远的影响。作为例证，我们表明，将此方法与简单模板结合使用足以越狱上一代模型，而GPT-4O Mini上的ASR = 0.4。我们的代码和数据工件可在此HTTPS URL上找到

Title: Jawaher: A Multidialectal Dataset of Arabic Proverbs for LLM Benchmarking

Authors: Samar M. Magdy, Sang Yun Kwon, Fakhraddin Alwajih, Safaa Abdelfadil, Shady Shehata, Muhammad Abdul-Mageed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00231
Pdf URL: https://arxiv.org/pdf/2503.00231
Copy Paste: [[2503.00231]] Jawaher: A Multidialectal Dataset of Arabic Proverbs for LLM Benchmarking(https://arxiv.org/abs/2503.00231)
Keywords: language model, llm
Abstract: Recent advancements in instruction fine-tuning, alignment methods such as reinforcement learning from human feedback (RLHF), and optimization techniques like direct preference optimization (DPO) have significantly enhanced the adaptability of large language models (LLMs) to user preferences. However, despite these innovations, many LLMs continue to exhibit biases toward Western, Anglo-centric, or American cultures, with performance on English data consistently surpassing that of other languages. This reveals a persistent cultural gap in LLMs, which complicates their ability to accurately process culturally rich and diverse figurative language such as proverbs. To address this, we introduce Jawaher, a benchmark designed to assess LLMs' capacity to comprehend and interpret Arabic proverbs. Jawaher includes proverbs from various Arabic dialects, along with idiomatic translations and explanations. Through extensive evaluations of both open- and closed-source models, we find that while LLMs can generate idiomatically accurate translations, they struggle with producing culturally nuanced and contextually relevant explanations. These findings highlight the need for ongoing model refinement and dataset expansion to bridge the cultural gap in figurative language processing.
摘要：指导微调，对齐方式的最新进展，例如从人类反馈（RLHF）学习的强化学习以及诸如直接偏好优化（DPO）之类的优化技术可显着增强大语言模型（LLMS）对用户偏好的适应性。但是，尽管进行了这些创新，但许多LLM仍会对西方，以盎格鲁以盎格鲁文化或美国文化的形式表现出偏见，并且在英语数据上的表现始终超过其他语言。这揭示了LLMS中持续存在的文化鸿沟，这使他们能够准确地处理文化丰富和多样化的象征性语言（例如谚语）变得复杂。为了解决这个问题，我们介绍了Jawaher，这是一种旨在评估LLMS理解和解释阿拉伯谚语的能力的基准。 Jawaher包括来自各种阿拉伯方言的谚语，以及惯用的翻译和解释。通过对开放式和封闭源模型的广泛评估，我们发现，尽管LLM可以生成惯用的准确翻译，但它们在产生文化上有细微差别且在上下文上相关的解释方面很难。这些发现凸显了需要进行持续的模型改进和数据集扩展，以弥合形象性语言处理中的文化差距。

Title: Decoupling Content and Expression: Two-Dimensional Detection of AI-Generated Text

Authors: Guangsheng Bao, Lihua Rong, Yanbin Zhao, Qiji Zhou, Yue Zhang
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.00258
Pdf URL: https://arxiv.org/pdf/2503.00258
Copy Paste: [[2503.00258]] Decoupling Content and Expression: Two-Dimensional Detection of AI-Generated Text(https://arxiv.org/abs/2503.00258)
Keywords: llm
Abstract: The wide usage of LLMs raises critical requirements on detecting AI participation in texts. Existing studies investigate these detections in scattered contexts, leaving a systematic and unified approach unexplored. In this paper, we present HART, a hierarchical framework of AI risk levels, each corresponding to a detection task. To address these tasks, we propose a novel 2D Detection Method, decoupling a text into content and language expression. Our findings show that content is resistant to surface-level changes, which can serve as a key feature for detection. Experiments demonstrate that 2D method significantly outperforms existing detectors, achieving an AUROC improvement from 0.705 to 0.849 for level-2 detection and from 0.807 to 0.886 for RAID. We release our data and code at this https URL.
摘要：LLMS的广泛使用提出了对检测AI参与文本的关键要求。现有研究在分散的环境中调查了这些检测，使系统和统一的方法未经探索。在本文中，我们提出了Hart，这是AI风险水平的层次结构框架，每个框架对应于检测任务。为了解决这些任务，我们提出了一种新颖的2D检测方法，将文本分解为内容和语言表达方式。我们的发现表明，内容对表面级变化有抵抗力，这可以作为检测的关键特征。实验表明，2D方法的表现明显优于现有检测器，对于2级检测，AUROC从0.705提高到0.849，而RAID的AUROC提高到0.807到0.886。我们在此HTTPS URL上发布数据和代码。

Title: Robust Multi-Objective Preference Alignment with Online DPO

Authors: Raghav Gupta, Ryan Sullivan, Yunxuan Li, Samrat Phatale, Abhinav Rastogi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.00295
Pdf URL: https://arxiv.org/pdf/2503.00295
Copy Paste: [[2503.00295]] Robust Multi-Objective Preference Alignment with Online DPO(https://arxiv.org/abs/2503.00295)
Keywords: language model, llm, prompt
Abstract: Multi-objective preference alignment of large language models (LLMs) is critical for developing AI systems that are more configurable, personalizable, helpful, and safe. However, optimizing model outputs to satisfy diverse objectives with variable weights at inference time for truly personalized models presents a significant challenge. Existing approaches are either computationally expensive to train or do not sufficiently steer model behaviors. This paper introduces the Multi-Objective Online DPO (MO-ODPO) algorithm, designed to robustly and efficiently align model behaviors with multiple, potentially conflicting human preferences. Our approach incorporates a prompt conditioning mechanism, allowing us to train a single preference-conditional policy, that can adapt to new preference combinations at inference. Experiments on two popular benchmarks show that MO-ODPO Pareto-dominates existing baselines while providing excellent inference-time steerability between diverse objectives.
摘要：大型语言模型（LLMS）的多目标偏好对齐对于开发更可配置，可配置，个性化，帮助和安全的AI系统至关重要。但是，优化模型输出以满足推理时间在推理时间的不同权重的各种目标的真正个性化模型带来了重大挑战。现有的方法要么在计算上训练昂贵，要么不充分操纵模型行为。本文介绍了多目标在线DPO（MO-ODPO）算法，该算法旨在与多种，潜在的，潜在的人类偏好相抵触的模型行为稳健有效地使模型。我们的方法结合了迅速的调理机制，使我们能够培训单个偏好条件政策，可以适应推理时的新优先组合。两个流行基准的实验表明，Mo-Odpo帕累托（Mo-Odpo Pareto）占主导地位，同时在各种目标之间提供了出色的推理时间可识别性。

Title: Unlocking Efficient, Scalable, and Continual Knowledge Editing with Basis-Level Representation Fine-Tuning

Authors: Tianci Liu, Ruirui Li, Yunzhe Qi, Hui Liu, Xianfeng Tang, Tianqi Zheng, Qingyu Yin, Monica Xiao Cheng, Jun Huan, Haoyu Wang, Jing Gao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.00306
Pdf URL: https://arxiv.org/pdf/2503.00306
Copy Paste: [[2503.00306]] Unlocking Efficient, Scalable, and Continual Knowledge Editing with Basis-Level Representation Fine-Tuning(https://arxiv.org/abs/2503.00306)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable performance on various natural language tasks. However, they are trained on static corpora and their knowledge can become outdated quickly in the fast-changing world. This motivates the development of knowledge editing methods designed to update certain knowledge in LLMs without changing unrelated others. To make selective edits, previous efforts often sought to update a small amount of parameters in some specific layer(s) of a LLM. Nonetheless, in challenging scenarios, they still fall short in making successful edits while preserving knowledge irrelevant to the updates simultaneously, resulting in a notable editing-locality trade-off. In this work, we question if the trade-offs are caused by the fact that parameter-based updates have a global effect, i.e., edited parameters affect all inputs indiscriminately. In light of this, we explore the feasibility of representation fine-tuning, which applied some linear update to a few representations in a learned subspace, for knowledge editing. While being effective to enhance an LLM's general ability as demonstrated in the previous work, we theoretically show that this linear update imposes a tension in editing-locality trade-off. Subsequently, BaFT is proposed to break the linearity. BaFT computes a weight for each basis that spans a dimension of the subspace based on the input representation. This input-dependent weighting mechanism allows BaFT to manage different types of knowledge in an adaptive way, thereby achieving a better editing-locality trade-off. Experiments on three LLMs with five editing benchmarks in diverse scenarios show the superiority of our method.
摘要：大型语言模型（LLM）在各种自然语言任务上取得了出色的表现。但是，他们接受了静态语料库的培训，在快速变化的世界中，他们的知识可能会迅速过时。这激发了知识编辑方法的发展，旨在更新LLM中的某些知识而不改变其他知识。为了进行选择性编辑，以前的努力经常试图在LLM的某些特定层中更新少量参数。尽管如此，在具有挑战性的情况下，他们仍然无法进行成功编辑，同时保留与更新无关的知识，从而导致了显着的编辑局部性权衡。在这项工作中，我们质疑权衡是否是由于基于参数的更新具有全球效果的事实，即编辑的参数会不分青睐。鉴于此，我们探讨了表示微调的可行性，该可行性将一些线性更新应用于学习的子空间中的一些表示形式，以进行知识编辑。尽管有效地增强了LLM的一般能力，但从理论上讲，我们表明该线性更新在编辑局部折衷方面施加了张力。随后，提出了BAFT破坏线性。 BAFT计算每个基础的重量，该权重跨越基于输入表示的子空间的尺寸。这种依赖于输入的加权机制使BAFT能够以自适应方式管理不同类型的知识，从而实现了更好的编辑本地性权衡。在不同情况下，在三个具有五个编辑基准的LLMS上进行了实验，显示了我们方法的优越性。

Title: How Deep is Love in LLMs' Hearts? Exploring Semantic Size in Human-like Cognition

Authors: Yao Yao, Yifei Yang, Xinbei Ma, Dongjie Yang, Zhuosheng Zhang, Zuchao Li, Hai Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00330
Pdf URL: https://arxiv.org/pdf/2503.00330
Copy Paste: [[2503.00330]] How Deep is Love in LLMs' Hearts? Exploring Semantic Size in Human-like Cognition(https://arxiv.org/abs/2503.00330)
Keywords: language model, llm
Abstract: How human cognitive abilities are formed has long captivated researchers. However, a significant challenge lies in developing meaningful methods to measure these complex processes. With the advent of large language models (LLMs), which now rival human capabilities in various domains, we are presented with a unique testbed to investigate human cognition through a new lens. Among the many facets of cognition, one particularly crucial aspect is the concept of semantic size, the perceived magnitude of both abstract and concrete words or concepts. This study seeks to investigate whether LLMs exhibit similar tendencies in understanding semantic size, thereby providing insights into the underlying mechanisms of human cognition. We begin by exploring metaphorical reasoning, comparing how LLMs and humans associate abstract words with concrete objects of varying sizes. Next, we examine LLMs' internal representations to evaluate their alignment with human cognitive processes. Our findings reveal that multi-modal training is crucial for LLMs to achieve more human-like understanding, suggesting that real-world, multi-modal experiences are similarly vital for human cognitive development. Lastly, we examine whether LLMs are influenced by attention-grabbing headlines with larger semantic sizes in a real-world web shopping scenario. The results show that multi-modal LLMs are more emotionally engaged in decision-making, but this also introduces potential biases, such as the risk of manipulation through clickbait headlines. Ultimately, this study offers a novel perspective on how LLMs interpret and internalize language, from the smallest concrete objects to the most profound abstract concepts like love. The insights gained not only improve our understanding of LLMs but also provide new avenues for exploring the cognitive abilities that define human intelligence.
摘要：人类认知能力的形成方式长期以来吸引了研究人员。但是，一个重大的挑战在于开发有意义的方法来衡量这些复杂过程。随着大型语言模型（LLMS）的出现，现在与各个领域的人类能力相匹配，我们得到了独特的测试床，以通过新的镜头调查人类认知。在认知的许多方面中，一个特别关键的方面是语义大小的概念，即抽象和具体单词或概念的感知程度。这项研究试图研究LLM在理解语义大小时是否表现出相似的趋势，从而提供了对人类认知基本机制的见解。我们首先要探索隐喻推理，将LLM和人类如何将抽象单词与不同大小的具体对象联系起来。接下来，我们研究LLMS的内部表示，以评估它们与人类认知过程的一致性。我们的发现表明，多模式训练对于LLMS获得更类似人类的理解至关重要，这表明现实世界中的多模式经验对于人类的认知发展同样至关重要。最后，我们检查了LLM是否受到现实网络购物场景中具有较大语义大小的吸引人的头条新闻的影响。结果表明，多模式LLM在决策中更多地参与了情感，但这也引入了潜在的偏见，例如通过Clickbait Headlines操纵的风险。最终，这项研究提供了关于LLM的解释和内部化语言如何从最小的具体对象到爱情等最深刻的抽象概念的新观点。获得的见解不仅提高了我们对LLM的理解，而且为探索定义人类智力的认知能力提供了新的途径。

Title: More of the Same: Persistent Representational Harms Under Increased Representation

Authors: Jennifer Mickel, Maria De-Arteaga, Leqi Liu, Kevin Tian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00333
Pdf URL: https://arxiv.org/pdf/2503.00333
Copy Paste: [[2503.00333]] More of the Same: Persistent Representational Harms Under Increased Representation(https://arxiv.org/abs/2503.00333)
Keywords: language model, prompt
Abstract: To recognize and mitigate the harms of generative AI systems, it is crucial to consider who is represented in the outputs of generative AI systems and how people are represented. A critical gap emerges when naively improving who is represented, as this does not imply bias mitigation efforts have been applied to address how people are represented. We critically examined this by investigating gender representation in occupation across state-of-the-art large language models. We first show evidence suggesting that over time there have been interventions to models altering the resulting gender distribution, and we find that women are more represented than men when models are prompted to generate biographies or personas. We then demonstrate that representational biases persist in how different genders are represented by examining statistically significant word differences across genders. This results in a proliferation of representational harms, stereotypes, and neoliberalism ideals that, despite existing interventions to increase female representation, reinforce existing systems of oppression.
摘要：为了识别和减轻生成AI系统的危害，重要的是要考虑谁在生成AI系统的输出以及人们的代表方式中代表谁。当天真地改善谁是代表的人时，会出现一个关键的差距，因为这并不意味着已经采用了偏见缓解措施来解决人们的代表方式。我们通过调查跨最先进的大语言模型的职业中的性别代表来进行严格的研究。我们首先表明证据表明，随着时间的流逝，已经采取了干预措施来改变由此产生的性别分布，我们发现当促使模型产生传记或角色时，女性比男性更有代表性。然后，我们证明代表性偏见持续存在着通过检查性别统计学上重要的单词差异来表示不同性别的偏见。这会导致代表性危害，刻板印象和新自由主义理想的扩散，尽管现有的干预措施增加了女性代表，但仍加强了现有的压迫系统。

Title: U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack

Authors: Yunfan Gao, Yun Xiong, Wenlong Wu, Zijing Huang, Bohan Li, Haofen Wang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.00353
Pdf URL: https://arxiv.org/pdf/2503.00353
Copy Paste: [[2503.00353]] U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack(https://arxiv.org/abs/2503.00353)
Keywords: language model, llm, long context, hallucination, retrieval-augmented generation
Abstract: Recent advancements in Large Language Models (LLMs) have expanded their context windows to unprecedented lengths, sparking debates about the necessity of Retrieval-Augmented Generation (RAG). To address the fragmented evaluation paradigms and limited cases in existing Needle-in-a-Haystack (NIAH), this paper introduces U-NIAH, a unified framework that systematically compares LLMs and RAG methods in controlled long context settings. Our framework extends beyond traditional NIAH by incorporating multi-needle, long-needle, and needle-in-needle configurations, along with different retrieval settings, while leveraging the synthetic Starlight Academy dataset-a fictional magical universe-to eliminate biases from pre-trained knowledge. Through extensive experiments, we investigate three research questions: (1) performance trade-offs between LLMs and RAG, (2) error patterns in RAG, and (3) RAG's limitations in complex settings. Our findings show that RAG significantly enhances smaller LLMs by mitigating the "lost-in-the-middle" effect and improving robustness, achieving an 82.58% win-rate over LLMs. However, we observe that retrieval noise and reverse chunk ordering degrade performance, while surprisingly, advanced reasoning LLMs exhibit reduced RAG compatibility due to sensitivity to semantic distractors. We identify typical error patterns including omission due to noise, hallucination under high noise critical condition, and self-doubt behaviors. Our work not only highlights the complementary roles of RAG and LLMs, but also provides actionable insights for optimizing deployments. Code: this https URL.
摘要：大型语言模型（LLMS）的最新进步将其上下文窗口扩展到了前所未有的长度，引发了有关检索效果生成（RAG）必要性的辩论。为了解决现有的Ined-A-Haystack（NIAH）中零散的评估范例和有限的案例，本文介绍了U-Niah，U-Niah是一个统一的框架，该框架系统地比较了受控的长上下文设置中的LLM和抹布方法。我们的框架通过合并多针，长针和针中的针形配置以及不同的检索设置，超越了传统的NIAH，同时利用了合成的星光学院数据集，即虚构的魔法魔法宇宙，从预先信誉的知识中消除了偏见。通过广泛的实验，我们研究了三个研究问题：（1）LLMS和抹布之间的性能权衡，（2）抹布中的错误模式以及（3）（3）RAG在复杂设置中的局限性。我们的发现表明，通过减轻“中间”效应并提高了鲁棒性，抹布可显着提高较小的LLM，从而在LLMS上获得82.58％的胜利率。但是，我们观察到，检索噪声和反向块订购降解性能，而令人惊讶的是，由于对语义分散因子的敏感性，高级推理LLMS表现出降低的抹布兼容性。我们确定典型的误差模式，包括由于噪声引起的省略，高噪声关键条件下的幻觉以及自我怀疑行为。我们的工作不仅强调了抹布和LLM的互补作用，而且还提供了可行的见解以优化部署。代码：此HTTPS URL。

Title: Structured Reasoning for Fairness: A Multi-Agent Approach to Bias Detection in Textual Data

Authors: Tianyi Huang, Elsa Fan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00355
Pdf URL: https://arxiv.org/pdf/2503.00355
Copy Paste: [[2503.00355]] Structured Reasoning for Fairness: A Multi-Agent Approach to Bias Detection in Textual Data(https://arxiv.org/abs/2503.00355)
Keywords: language model, llm, chat, agent
Abstract: From disinformation spread by AI chatbots to AI recommendations that inadvertently reinforce stereotypes, textual bias poses a significant challenge to the trustworthiness of large language models (LLMs). In this paper, we propose a multi-agent framework that systematically identifies biases by disentangling each statement as fact or opinion, assigning a bias intensity score, and providing concise, factual justifications. Evaluated on 1,500 samples from the WikiNPOV dataset, the framework achieves 84.9% accuracy$\unicode{x2014}$an improvement of 13.0% over the zero-shot baseline$\unicode{x2014}$demonstrating the efficacy of explicitly modeling fact versus opinion prior to quantifying bias intensity. By combining enhanced detection accuracy with interpretable explanations, this approach sets a foundation for promoting fairness and accountability in modern language models.
摘要：从AI Chatbots传播的虚假信息到无意中强化刻板印象的AI建议，文本偏见对大语言模型（LLMS）的可信度构成了重大挑战。在本文中，我们提出了一个多代理框架，该框架通过将每个陈述视为事实或意见，分配偏见强度得分并提供简洁的事实理由来系统地识别偏见。该框架对来自Wikinpov数据集的1,500个样本进行了评估，其精度达到了84.9％的精度$ \ unicode {x2014} $比零击基线$ \ unicode {x2014} $提高了13.0％的提高，以表明与量化的意图相比，表明了与明确建模的意见相关性。通过将增强的检测精度与可解释的解释相结合，这种方法为促进现代语言模型中的公平和问责制奠定了基础。

Title: BERT-based model for Vietnamese Fact Verification Dataset

Authors: Bao Tran, T. N. Khanh, Khang Nguyen Tuong, Thien Dang, Quang Nguyen, Nguyen T. Thinh, Vo T. Hung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00356
Pdf URL: https://arxiv.org/pdf/2503.00356
Copy Paste: [[2503.00356]] BERT-based model for Vietnamese Fact Verification Dataset(https://arxiv.org/abs/2503.00356)
Keywords: language model
Abstract: The rapid advancement of information and communication technology has facilitated easier access to information. However, this progress has also necessitated more stringent verification measures to ensure the accuracy of information, particularly within the context of Vietnam. This paper introduces an approach to address the challenges of Fact Verification using the Vietnamese dataset by integrating both sentence selection and classification modules into a unified network architecture. The proposed approach leverages the power of large language models by utilizing pre-trained PhoBERT and XLM-RoBERTa as the backbone of the network. The proposed model was trained on a Vietnamese dataset, named ISE-DSC01, and demonstrated superior performance compared to the baseline model across all three metrics. Notably, we achieved a Strict Accuracy level of 75.11\%, indicating a remarkable 28.83\% improvement over the baseline model.
摘要：信息和通信技术的快速发展已促进更容易获取信息。但是，这一进步还需要采取更严格的验证措施，以确保信息的准确性，尤其是在越南背景下。本文介绍了一种方法，可以通过将句子选择和分类模块集成到统一的网络体系结构中，以使用越南数据集来解决事实验证的挑战。提出的方法通过利用预先训练的Phobert和XLM-Roberta作为网络的骨干来利用大语言模型的力量。提出的模型在越南数据集（ISE-DSC01）上进行了培训，与所有三个指标的基线模型相比，表现出色。值得注意的是，我们实现了75.11 \％的严格精度水平，表明比基线模型的28.83 \％改进。

Title: Approaching the Limits to EFL Writing Enhancement with AI-generated Text and Diverse Learners

Authors: David James Woo, Hengky Susanto, Chi Ho Yeung, Kai Guo, Yilin Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00367
Pdf URL: https://arxiv.org/pdf/2503.00367
Copy Paste: [[2503.00367]] Approaching the Limits to EFL Writing Enhancement with AI-generated Text and Diverse Learners(https://arxiv.org/abs/2503.00367)
Keywords: gpt, chat
Abstract: Generative artificial intelligence (AI) chatbots, such as ChatGPT, are reshaping how English as a foreign language (EFL) students write since students can compose texts by integrating their own words with AI-generated text. This study investigated how 59 Hong Kong secondary school students with varying levels of academic achievement interacted with AI-generated text to compose a feature article, exploring whether any interaction patterns benefited the overall quality of the article. Through content analysis, multiple linear regression and cluster analysis, we found the overall number of words -- whether AI- or human-generated -- is the main predictor of writing quality. However, the impact varies by students' competence to write independently, for instance, by using their own words accurately and coherently to compose a text, and to follow specific interaction patterns with AI-generated text. Therefore, although composing texts with human words and AI-generated text may become prevalent in EFL writing classrooms, without educators' careful attention to EFL writing pedagogy and AI literacy, high-achieving students stand to benefit more from using AI-generated text than low-achieving students.
摘要：生成的人工智能（AI）聊天机器人（例如Chatgpt）正在重塑英语作为外语（EFL）学生写作，因为学生可以通过将自己的单词与AI生成的文本整合在一起来撰写文本。这项研究调查了59名香港中学学生如何与AI生成的文本相互作用，以撰写专题文章，探索任何互动模式是否使文章的整体质量受益。通过内容分析，多个线性回归和聚类分析，我们发现单词的总数（无论是AI-还是人类生成的）是写作质量的主要预测指标。但是，这种影响因学生能力独立写作而有所不同，例如，通过准确且相干地构成文本，并遵循具有AI生成的文本的特定互动模式。因此，尽管用人类的单词和AI生成的文本撰写文本可能在EFL写作教室中很普遍，而没有教育者对EFL编写教学法和AI识字的仔细关注，但高成就的学生可以从使用AI生成的文本中受益于与低位学生相比。

Title: Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks

Authors: Zongru Wu, Pengzhou Cheng, Zheng Wu, Tianjie Ju, Zhuosheng Zhang, Gongshen Liu
Subjects: cs.CL, cs.AI, cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2503.00401
Pdf URL: https://arxiv.org/pdf/2503.00401
Copy Paste: [[2503.00401]] Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks(https://arxiv.org/abs/2503.00401)
Keywords: llm, agent
Abstract: Perception-enhanced pre-training, particularly through grounding techniques, is widely adopted to enhance the performance of graphical user interface (GUI) agents. However, in resource-constrained scenarios, the format discrepancy between coordinate-oriented grounding and action-oriented reasoning limits the effectiveness of grounding for reasoning tasks. To address this challenge, we propose a query-oriented pivot approach called query inference, which serves as a bridge between GUI grounding and reasoning. By inferring potential user queries from a screenshot and its associated element coordinates, query inference improves the understanding of coordinates while aligning more closely with reasoning tasks. Experimental results show that query inference outperforms previous grounding techniques under the same training data scale. Notably, query inference achieves comparable or even better performance to large-scale grounding-enhanced OS-Atlas with less than 0.1% of training data. Furthermore, we explore the impact of reasoning formats and demonstrate that integrating additional semantic information into the input further boosts reasoning performance. The code is publicly available athttps://github.com/ZrW00/GUIPivot.
摘要：人们广泛采用感知增强的预训练，特别是通过接地技术，以增强图形用户界面（GUI）代理的性能。但是，在资源约束的情况下，面向坐标的接地和面向动作的推理之间的格式差异限制了基础对推理任务的有效性。为了应对这一挑战，我们提出了一种名为“查询推理”的面向查询的枢轴方法，该方法是GUI接地和推理之间的桥梁。通过从屏幕截图及其关联元素坐标中推断潜在的用户查询，查询推理可以提高对坐标的理解，同时更加与推理任务保持一致。实验结果表明，在相同的训练数据量表下，查询推理的表现优于先前的接地技术。值得注意的是，查询推理与少于0.1％的培训数据的大规模接地增强的OS-ATLA具有可比性甚至更好的性能。此外，我们探讨了推理格式的影响，并证明将其他语义信息整合到输入中进一步提高了推理性能。该代码是公开可用的Athttps：//github.com/zrw00/guipivot。

Title: A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information

Authors: Lucky Susanto, Musa Wijanarko, Prasetia Pratama, Zilu Tang, Fariz Akyas, Traci Hong, Ika Idris, Alham Aji, Derry Wijaya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00417
Pdf URL: https://arxiv.org/pdf/2503.00417
Copy Paste: [[2503.00417]] A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information(https://arxiv.org/abs/2503.00417)
Keywords: language model, llm
Abstract: Polarization is defined as divisive opinions held by two or more groups on substantive issues. As the world's third-largest democracy, Indonesia faces growing concerns about the interplay between political polarization and online toxicity, which is often directed at vulnerable minority groups. Despite the importance of this issue, previous NLP research has not fully explored the relationship between toxicity and polarization. To bridge this gap, we present a novel multi-label Indonesian dataset that incorporates toxicity, polarization, and annotator demographic information. Benchmarking this dataset using BERT-base models and large language models (LLMs) shows that polarization information enhances toxicity classification, and vice versa. Furthermore, providing demographic information significantly improves the performance of polarization classification.
摘要：两极分化被定义为在实质性问题上持有的两个或更多群体的分裂意见。作为世界第三大民主国家，印度尼西亚对政治两极分化与在线毒性之间的相互作用的关注日益严重，这通常针对脆弱的少数群体。尽管这个问题很重要，但以前的NLP研究并未完全探索毒性与极化之间的关系。为了弥合这一差距，我们提出了一种新型的多标签印尼数据集，其中包含毒性，极化和注释者人口统计信息。使用BERT基本模型和大型语言模型（LLM）对该数据集进行基准测试表明，极化信息可以增强毒性分类，反之亦然。此外，提供人口统计信息可显着提高两极分化的性能。

Title: AILS-NTUA at SemEval-2025 Task 8: Language-to-Code prompting and Error Fixing for Tabular Question Answering

Authors: Andreas Evangelatos, Giorgos Filandrianos, Maria Lymperaiou, Athanasios Voulodimos, Giorgos Stamou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00435
Pdf URL: https://arxiv.org/pdf/2503.00435
Copy Paste: [[2503.00435]] AILS-NTUA at SemEval-2025 Task 8: Language-to-Code prompting and Error Fixing for Tabular Question Answering(https://arxiv.org/abs/2503.00435)
Keywords: language model, llm, prompt
Abstract: In this paper, we present our submission to SemEval-2025 Task 8: Question Answering over Tabular Data. This task, evaluated on the DataBench dataset, assesses Large Language Models' (LLMs) ability to answer natural language questions over structured data while addressing topic diversity and table size limitations in previous benchmarks. We propose a system that employs effective LLM prompting to translate natural language queries into executable code, enabling accurate responses, error correction, and interpretability. Our approach ranks first in both subtasks of the competition in the proprietary model category, significantly outperforming the organizer's baseline.
摘要：在本文中，我们将提交给Semeval-2025 Task 8：通过表格数据回答问题。该任务在数据宾通数据集上进行了评估，评估了大型语言模型的（LLMS）能够通过结构化数据回答自然语言问题的能力，同时解决了以前的基准测试中的主题多样性和表格尺寸限制。我们提出了一个采用有效LLM的系统，提示将自然语言查询转化为可执行的代码，从而实现准确的响应，错误校正和解释性。我们的方法在专有模型类别中的两个子任务中排名第一，大大优于组织者的基线。

Title: Rehearse With User: Personalized Opinion Summarization via Role-Playing based on Large Language Models

Authors: Yanyue Zhang, Yulan He, Deyu Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00449
Pdf URL: https://arxiv.org/pdf/2503.00449
Copy Paste: [[2503.00449]] Rehearse With User: Personalized Opinion Summarization via Role-Playing based on Large Language Models(https://arxiv.org/abs/2503.00449)
Keywords: language model, llm
Abstract: Personalized opinion summarization is crucial as it considers individual user interests while generating product summaries. Recent studies show that although large language models demonstrate powerful text summarization and evaluation capabilities without the need for training data, they face difficulties in personalized tasks involving long texts. To address this, \textbf{Rehearsal}, a personalized opinion summarization framework via LLMs-based role-playing is proposed. Having the model act as the user, the model can better understand the user's personalized needs. Additionally, a role-playing supervisor and practice process are introduced to improve the role-playing ability of the LLMs, leading to a better expression of user needs. Furthermore, through suggestions from virtual users, the summary generation is intervened, ensuring that the generated summary includes information of interest to the user, thus achieving personalized summary generation. Experiment results demonstrate that our method can effectively improve the level of personalization in large model-generated summaries.
摘要：个性化意见摘要至关重要，因为它在生成产品摘要时考虑了个人用户的兴趣。最近的研究表明，尽管大型语言模型显示出强大的文本摘要和评估功能而无需培训数据，但它们在涉及长文本的个性化任务中遇到困难。为了解决这个问题，\ textbf {relehearsal}，提出了一个个性化的意见摘要框架，该摘要框架通过基于LLMS的角色扮演。将模型作为用户起作用，模型可以更好地了解用户的个性化需求。此外，引入了角色扮演主管和练习过程，以提高LLM的角色扮演能力，从而更好地表达用户需求。此外，通过虚拟用户的建议，摘要生成进行了干预，以确保生成的摘要包括用户感兴趣的信息，从而获得了个性化的摘要生成。实验结果表明，我们的方法可以有效地提高大型模型生成的摘要中的个性化水平。

Title: Embracing Diversity: A Multi-Perspective Approach with Soft Labels

Authors: Benedetta Muscato, Praveen Bushipaka, Gizem Gezici, Lucia Passaro, Fosca Giannotti, Tommaso Cucinotta
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2503.00489
Pdf URL: https://arxiv.org/pdf/2503.00489
Copy Paste: [[2503.00489]] Embracing Diversity: A Multi-Perspective Approach with Soft Labels(https://arxiv.org/abs/2503.00489)
Keywords: llm
Abstract: Prior studies show that adopting the annotation diversity shaped by different backgrounds and life experiences and incorporating them into the model learning, i.e. multi-perspective approach, contribute to the development of more responsible models. Thus, in this paper we propose a new framework for designing and further evaluating perspective-aware models on stance detection task,in which multiple annotators assign stances based on a controversial topic. We also share a new dataset established through obtaining both human and LLM annotations. Results show that the multi-perspective approach yields better classification performance (higher F1-scores), outperforming the traditional approaches that use a single ground-truth, while displaying lower model confidence scores, probably due to the high level of subjectivity of the stance detection task.
摘要：先前的研究表明，采用由不同的背景和生活经验塑造的注释多样性，并将其纳入模型学习，即多观点方法，有助于开发更负责任的模型。因此，在本文中，我们提出了一个新的框架，用于设计和进一步评估立场检测任务的观点感知模型，其中多个注释者根据有争议的主题分配了立场。我们还通过获得人类和LLM注释来共享一个新的数据集。结果表明，多观点方法可以产生更好的分类性能（更高的F1分数），表现优于使用单个基础真相的传统方法，同时显示出较低的模型置信得分，这可能是由于立场检测任务的主观性高。

Title: Tutorial Proposal: Speculative Decoding for Efficient LLM Inference

Authors: Heming Xia, Cunxiao Du, Yongqi Li, Qian Liu, Wenjie Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00491
Pdf URL: https://arxiv.org/pdf/2503.00491
Copy Paste: [[2503.00491]] Tutorial Proposal: Speculative Decoding for Efficient LLM Inference(https://arxiv.org/abs/2503.00491)
Keywords: llm
Abstract: This tutorial presents a comprehensive introduction to Speculative Decoding (SD), an advanced technique for LLM inference acceleration that has garnered significant research interest in recent years. SD is introduced as an innovative decoding paradigm to mitigate the high inference latency stemming from autoregressive decoding in LLMs. At each decoding step, SD efficiently drafts several future tokens and then verifies them in parallel. This approach, unlike traditional autoregressive decoding, facilitates the simultaneous decoding of multiple tokens per step, thereby achieving promising 2x-4x speedups in LLM inference while maintaining original distributions. This tutorial delves into the latest techniques in SD, including draft model architectures and verification strategies. Additionally, it explores the acceleration potential and future research directions in this promising field. We aim for this tutorial to elucidate the current research landscape and offer insights for researchers interested in Speculative Decoding, ultimately contributing to more efficient LLM inference.
摘要：该教程介绍了投机解码（SD）的全面介绍，这是LLM推理加速度的先进技术，近年来引起了重大研究的兴趣。 SD作为一种创新的解码范式引入，以减轻因LLMS自回旋解码而产生的高推断潜伏期。在每个解码步骤中，SD有效地起草了几个将来的令牌，然后并行验证它们。这种方法与传统的自动回解码不同，可以促进每个步骤的多个令牌的同时解码，从而在LLM推理中实现了有希望的2x-4X加速，同时保持原始分布。该教程研究了SD的最新技术，包括模型架构草案和验证策略。此外，它探讨了这个有前途的领域的加速潜力和未来研究方向。我们的目标是阐明当前的研究格局，并为对投机解码感兴趣的研究人员提供见解，最终为更有效的LLM推论做出了贡献。

Title: ToolDial: Multi-turn Dialogue Generation Method for Tool-Augmented Language Models

Authors: Jeonghoon Shim, Gyuhyeon Seo, Cheongsu Lim, Yohan Jo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00564
Pdf URL: https://arxiv.org/pdf/2503.00564
Copy Paste: [[2503.00564]] ToolDial: Multi-turn Dialogue Generation Method for Tool-Augmented Language Models(https://arxiv.org/abs/2503.00564)
Keywords: language model
Abstract: Tool-Augmented Language Models (TALMs) leverage external APIs to answer user queries across various domains. However, existing benchmark datasets for TALM research often feature simplistic dialogues that do not reflect real-world scenarios, such as the need for models to ask clarifying questions or proactively call additional APIs when essential information is missing. To address these limitations, we construct and release ToolDial, a dataset comprising 11,111 multi-turn dialogues, with an average of 8.95 turns per dialogue, based on APIs from RapidAPI. ToolDial has two key characteristics. First, the dialogues incorporate 16 user and system actions (e.g., "Request", "Clarify", "Fail inform") to capture the rich dynamics of real-world interactions. Second, we simulate dialogues where the system requests necessary information from the user based on API documentation and seeks additional APIs if the user fails to provide the required information. To facilitate this process, we introduce a method for generating an API graph that represents input and output compatibility between APIs. Using ToolDial, we evaluate a suite of language models on their ability to predict correct actions and extract input parameter values for API calls from the dialogue history. Modern language models achieve accuracy scores below 70%, indicating substantial room for improvement. We release our dataset and code at this https URL.
摘要：工具授权语言模型（TALMS）利用外部API来回答各个域中的用户查询。但是，现有用于TALM研究的基准数据集通常具有简单的对话，这些对话不会反映现实世界中的情况，例如，在丢失基本信息时，模型需要提出澄清问题或主动调用其他API。为了解决这些局限性，我们根据Rapidapi的API，构建和释放了一个包含11,111个多转对对话的数据集，每个对话平均为8.95圈。 Tooldial具有两个关键特征。首先，对话结合了16个用户和系统操作（例如，“请求”，“澄清”，“失败信息”），以捕获现实世界交互的丰富动态。其次，我们模拟了对话，该系统根据API文档向用户请求必要的信息，并在用户未能提供所需信息的情况下寻求其他API。为了促进此过程，我们引入了一种生成API图的方法，该图表表示API之间的输入和输出兼容性。使用Tooldial，我们评估了一套语言模型，以预测对话历史记录中的API调用的能力并提取输入参数值。现代语言模型的精度得分低于70％，这表明改进的空间很大。我们在此HTTPS URL上发布数据集和代码。

Title: LoR2C : Low-Rank Residual Connection Adaptation for Parameter-Efficient Fine-Tuning

Authors: Jiancheng Zhao, Xingda Yu, Yuxiang Zhang, Zhen Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00572
Pdf URL: https://arxiv.org/pdf/2503.00572
Copy Paste: [[2503.00572]] LoR2C : Low-Rank Residual Connection Adaptation for Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2503.00572)
Keywords: language model
Abstract: In recent years, pretrained large language models have demonstrated outstanding performance across various natural language processing tasks. However, full-parameter fine-tuning methods require adjusting all model parameters, leading to immense computational resource demands. Although parameter-efficient fine-tuning methods like LoRA have significantly reduced the number of parameters, they still face challenges such as gradient vanishing and the potential for further parameter reduction. To address these issues, this paper proposes a novel parameter-efficient fine-tuning method called LoR2C (Low-Rank Residual Connection Adaptation). LoR2C introduces residual connections with low-rank matrices within the model layers, which not only reduces the number of fine-tuning parameters but also effectively alleviates the gradient vanishing problem. Additionally, this paper presents three optimization variants of LoR2C: ShareLoR2C, MergeLoR2C, and InjectLoR2C. These variants further improve parameter efficiency and model performance through parameter sharing, module merging, and injection mechanisms, respectively. Experimental results on multiple natural language understanding and natural language generation tasks demonstrate that LoR2C and its optimized variants significantly reduce parameter overhead while maintaining or even improving performance, outperforming existing mainstream parameter-efficient fine-tuning this http URL code is publicly available at this https URL.
摘要：近年来，审计的大型语言模型已显示出各种自然语言处理任务的出色表现。但是，全参数微调方法需要调整所有模型参数，从而导致巨大的计算资源需求。尽管诸如洛拉（Lora）之类的参数效率微调方法显着减少了参数的数量，但它们仍然面临诸如梯度消失和进一步降低参数的潜力之类的挑战。为了解决这些问题，本文提出了一种新型参数有效的微调方法，称为LOR2C（低级别残留连接适应）。 LOR2C在模型层中引入了与低级别矩阵的残留连接，这不仅减少了微调参数的数量，而且可以有效地减轻梯度消失的问题。此外，本文介绍了LOR2C的三个优化变体：Sharelor2c，Mergelor2c和Injextlor2c。这些变体分别通过参数共享，模块合并和注入机制进一步提高了参数效率和模型性能。关于多种自然语言理解和自然语言生成任务的实验结果表明，LOR2C及其优化的变体在维持甚至提高性能的同时大大减少参数开销，胜过现有的主流参数效率高效调整此HTTP URL代码在此HTTPS URL上公开可用。

Title: BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge

Authors: Terry Tong, Fei Wang, Zhe Zhao, Muhao Chen
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.00596
Pdf URL: https://arxiv.org/pdf/2503.00596
Copy Paste: [[2503.00596]] BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge(https://arxiv.org/abs/2503.00596)
Keywords: llm, prompt
Abstract: This paper proposes a novel backdoor threat attacking the LLM-as-a-Judge evaluation regime, where the adversary controls both the candidate and evaluator model. The backdoored evaluator victimizes benign users by unfairly assigning inflated scores to adversary. A trivial single token backdoor poisoning 1% of the evaluator training data triples the adversary's score with respect to their legitimate score. We systematically categorize levels of data access corresponding to three real-world settings, (1) web poisoning, (2) malicious annotator, and (3) weight poisoning. These regimes reflect a weak to strong escalation of data access that highly correlates with attack severity. Under the weakest assumptions - web poisoning (1), the adversary still induces a 20% score inflation. Likewise, in the (3) weight poisoning regime, the stronger assumptions enable the adversary to inflate their scores from 1.5/5 to 4.9/5. The backdoor threat generalizes across different evaluator architectures, trigger designs, evaluation tasks, and poisoning rates. By poisoning 10% of the evaluator training data, we control toxicity judges (Guardrails) to misclassify toxic prompts as non-toxic 89% of the time, and document reranker judges in RAG to rank the poisoned document first 97% of the time. LLM-as-a-Judge is uniquely positioned at the intersection of ethics and technology, where social implications of mislead model selection and evaluation constrain the available defensive tools. Amidst these challenges, model merging emerges as a principled tool to offset the backdoor, reducing ASR to near 0% whilst maintaining SOTA performance. Model merging's low computational cost and convenient integration into the current LLM Judge training pipeline position it as a promising avenue for backdoor mitigation in the LLM-as-a-Judge setting.
摘要：本文提出了一种新型的后门威胁，以攻击LLM-AS-A-A-A-A-Gudge评估制度，对手控制候选人和评估者模型。这位后排评估者通过向对手分配夸大的分数来损害良性用户。一个微不足道的单代务会中毒，评估者培训数据的1％将对手的得分相对于其合法分数。我们系统地将数据访问级别分类为对应于三个现实世界设置，（1）网络中毒，（2）恶意注释者和（3）重量中毒。这些制度反映了数据访问的弱逐渐升级，这与攻击严重程度高度相关。在最弱的假设下 - 网络中毒（1），对手仍然会诱导20％的得分通胀。同样，在（3）体重中毒制度中，更强的假设使对手能够从1.5/5到4.9/5的分数膨胀。后门威胁在不同的评估器架构，触发设计，评估任务和中毒率之间进行了概括。通过使10％的评估者培训数据中毒，我们控制毒性法官（护栏）将有毒的提示错误地分类为无毒的89％的时间，并记录了Reranker Rag中的Reranker法官，以将有毒文件排名为中毒文件的前97％。 LLM-AS-A-A-Gudge在伦理与技术的交集中独特地定位，其中误导模型选择和评估的社会含义限制了可用的防御工具。在这些挑战中，合并的模型是一种原则性的工具，可以抵消后门，在维持SOTA性能的同时，将ASR减少到接近0％。模型合并了低计算成本和方便的集成到当前的LLM法官培训管道中，将其定位为在LLM-AS-A-A-Gudge设置中进行后门缓解的有希望的途径。

Title: Zero-Shot Keyphrase Generation: Investigating Specialized Instructions and Multi-Sample Aggregation on Large Language Models

Authors: Jayanth Mohan, Jishnu Ray Chowdhury, Tomas Malik, Cornelia Caragea
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00597
Pdf URL: https://arxiv.org/pdf/2503.00597
Copy Paste: [[2503.00597]] Zero-Shot Keyphrase Generation: Investigating Specialized Instructions and Multi-Sample Aggregation on Large Language Models(https://arxiv.org/abs/2503.00597)
Keywords: language model, gpt, llm, prompt
Abstract: Keyphrases are the essential topical phrases that summarize a document. Keyphrase generation is a long-standing NLP task for automatically generating keyphrases for a given document. While the task has been comprehensively explored in the past via various models, only a few works perform some preliminary analysis of Large Language Models (LLMs) for the task. Given the impact of LLMs in the field of NLP, it is important to conduct a more thorough examination of their potential for keyphrase generation. In this paper, we attempt to meet this demand with our research agenda. Specifically, we focus on the zero-shot capabilities of open-source instruction-tuned LLMs (Phi-3, Llama-3) and the closed-source GPT-4o for this task. We systematically investigate the effect of providing task-relevant specialized instructions in the prompt. Moreover, we design task-specific counterparts to self-consistency-style strategies for LLMs and show significant benefits from our proposals over the baselines.
摘要：钥匙拼是总结文档的基本局部短语。 Keyphrase生成是一项长期存在的NLP任务，用于自动为给定文档生成键形。虽然过去已经通过各种模型对任务进行了全面探索，但只有少数作品对任务进行了一些大型语言模型（LLMS）的初步分析。鉴于LLM在NLP领域的影响，重要的是要对其产生键形的潜力进行更彻底的检查。在本文中，我们试图以我们的研究议程来满足这一需求。具体而言，我们专注于开源指令调节的LLM（PHI-3，Llama-3）的零射击功能和此任务的封闭源GPT-4O。我们系统地研究了提示中提供与任务相关的专业说明的效果。此外，我们将特定于任务的对应物设计为LLMS的自符势风格的策略，并从我们对基地的建议中显示出很大的好处。

Title: An evaluation of DeepSeek Models in Biomedical Natural Language Processing

Authors: Zaifu Zhan, Shuang Zhou, Huixue Zhou, Jiawen Deng, Yu Hou, Jeremy Yeung, Rui Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00624
Pdf URL: https://arxiv.org/pdf/2503.00624
Copy Paste: [[2503.00624]] An evaluation of DeepSeek Models in Biomedical Natural Language Processing(https://arxiv.org/abs/2503.00624)
Keywords: language model, llm
Abstract: The advancement of Large Language Models (LLMs) has significantly impacted biomedical Natural Language Processing (NLP), enhancing tasks such as named entity recognition, relation extraction, event extraction, and text classification. In this context, the DeepSeek series of models have shown promising potential in general NLP tasks, yet their capabilities in the biomedical domain remain underexplored. This study evaluates multiple DeepSeek models (Distilled-DeepSeek-R1 series and Deepseek-LLMs) across four key biomedical NLP tasks using 12 datasets, benchmarking them against state-of-the-art alternatives (Llama3-8B, Qwen2.5-7B, Mistral-7B, Phi-4-14B, Gemma-2-9B). Our results reveal that while DeepSeek models perform competitively in named entity recognition and text classification, challenges persist in event and relation extraction due to precision-recall trade-offs. We provide task-specific model recommendations and highlight future research directions. This evaluation underscores the strengths and limitations of DeepSeek models in biomedical NLP, guiding their future deployment and optimization.
摘要：大语言模型（LLMS）的进步已经显着影响生物医学自然语言处理（NLP），增强了诸如命名实体识别，关系提取，事件提取和文本分类之类的任务。在这种情况下，DeepSeek系列模型在一般的NLP任务中显示出有希望的潜力，但是它们在生物医学领域中的功能仍然没有被逐渐倍增。这项研究使用12个数据集评估了四个关键生物医学NLP任务的多个DeepSeek模型（Distiled-Deepseek-R1系列和DeepSeek-llms），使用12个数据集，对它们进行了针对最先进的替代方案（Llama3-8b，Qwen2.5-7b，Mistral-7b，Mistral-7b，Phi-4-14b，Phi-4-14b，phi-4-14b，gemma-2-9b）。我们的结果表明，尽管DeepSeek模型在指定的实体识别和文本分类方面具有竞争力，但由于精确核算权衡而引起的事件和关系提取的挑战仍然存在。我们提供特定于任务的模型建议并突出未来的研究方向。该评估强调了生物医学NLP中DeepSeek模型的优势和局限性，从而指导其未来的部署和优化。

Title: Unmasking Digital Falsehoods: A Comparative Analysis of LLM-Based Misinformation Detection Strategies

Authors: Tianyi Huang, Jingyuan Yi, Peiyang Yu, Xiaochuan Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00724
Pdf URL: https://arxiv.org/pdf/2503.00724
Copy Paste: [[2503.00724]] Unmasking Digital Falsehoods: A Comparative Analysis of LLM-Based Misinformation Detection Strategies(https://arxiv.org/abs/2503.00724)
Keywords: language model, gpt, llm, hallucination, agent
Abstract: The proliferation of misinformation on social media has raised significant societal concerns, necessitating robust detection mechanisms. Large Language Models such as GPT-4 and LLaMA2 have been envisioned as possible tools for detecting misinformation based on their advanced natural language understanding and reasoning capabilities. This paper conducts a comparison of LLM-based approaches to detecting misinformation between text-based, multimodal, and agentic approaches. We evaluate the effectiveness of fine-tuned models, zero-shot learning, and systematic fact-checking mechanisms in detecting misinformation across different topic domains like public health, politics, and finance. We also discuss scalability, generalizability, and explainability of the models and recognize key challenges such as hallucination, adversarial attacks on misinformation, and computational resources. Our findings point towards the importance of hybrid approaches that pair structured verification protocols with adaptive learning techniques to enhance detection accuracy and explainability. The paper closes by suggesting potential avenues of future work, including real-time tracking of misinformation, federated learning, and cross-platform detection models.
摘要：社交媒体上的错误信息的扩散引发了重大的社会问题，需要实现强大的检测机制。大型语言模型（例如GPT-4和Llama2）已被设想为根据其先进的自然语言理解和推理能力来检测错误信息的可能工具。本文对基于LLM的方法进行了比较，以检测基于文本，多模式和代理方法之间的错误信息。我们评估了微调模型的有效性，零击学习和系统的事实检查机制在检测跨不同主题领域（如公共卫生，政治和财务）的错误信息方面的误解。我们还讨论了模型的可伸缩性，可推广性和解释性，并认识到关键挑战，例如幻觉，对错误信息的对抗性攻击和计算资源。我们的发现指出了混合方法的重要性，该方法将结构化验证方案与自适应学习技术配对以增强检测准确性和解释性。该论文通过建议未来工作的潜在途径结束，包括对错误信息，联合学习和跨平台检测模型的实时跟踪。

Title: RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery

Authors: Hongchao Gu, Dexun Li, Kuicai Dong, Hao Zhang, Hang Lv, Hao Wang, Defu Lian, Yong Liu, Enhong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00751
Pdf URL: https://arxiv.org/pdf/2503.00751
Copy Paste: [[2503.00751]] RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery(https://arxiv.org/abs/2503.00751)
Keywords: language model, hallucination, agent
Abstract: Generating knowledge-intensive and comprehensive long texts, such as encyclopedia articles, remains significant challenges for Large Language Models. It requires not only the precise integration of facts but also the maintenance of thematic coherence throughout the article. Existing methods, such as direct generation and multi-agent discussion, often struggle with issues like hallucinations, topic incoherence, and significant latency. To address these challenges, we propose RAPID, an efficient retrieval-augmented long text generation framework. RAPID consists of three main modules: (1) Retrieval-augmented preliminary outline generation to reduce hallucinations, (2) Attribute-constrained search for efficient information discovery, (3) Plan-guided article generation for enhanced coherence. Extensive experiments on our newly compiled benchmark dataset, FreshWiki-2024, demonstrate that RAPID significantly outperforms state-of-the-art methods across a wide range of evaluation metrics (e.g. long-text generation, outline quality, latency, etc). Our work provides a robust and efficient solution to the challenges of automated long-text generation.
摘要：对于大型语言模型，产生知识密集型和全面的长文本（例如百科全书文章）仍然是重大挑战。它不仅需要事实的精确整合，而且需要在整个文章中维护主题连贯性。现有的方法，例如直接生成和多代理讨论，通常在幻觉，主题不连贯和明显的潜伏期等问题上挣扎。为了应对这些挑战，我们提出了快速，有效的检索长期文字生成框架。快速由三个主要模块组成：（1）检索授权的初步大纲生成以减少幻觉，（2）属性限制了有效信息发现的搜索，（3）计划指导的文章生成以增强相干性。在我们新汇编的基准数据集（FreshWiki-2024）上进行的广泛实验表明，快速的快速表现明显优于广泛的评估指标（例如，长篇文本生成，轮廓质量，潜伏等）的最先进方法。我们的工作为自动长文本生成的挑战提供了强大而有效的解决方案。

Title: Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity

Authors: Yupu Hao, Pengfei Cao, Zhuoran Jin, Huanxuan Liao, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00771
Pdf URL: https://arxiv.org/pdf/2503.00771
Copy Paste: [[2503.00771]] Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity(https://arxiv.org/abs/2503.00771)
Keywords: language model, llm, agent
Abstract: Personalized tool utilization is essential for aligning large language models (LLMs) with user preference in interaction scenarios with various tools. However, most of the current benchmarks primarily focus on either personalization of text generation or direct tool-utilizing, without considering both. In this work, we introduce a novel benchmark ETAPP for evaluating personalized tool invocation, establishing a sandbox environment, and a comprehensive dataset of 800 testing cases covering diverse user profiles. To improve the accuracy of our evaluation, we propose a key-point-based LLM evaluation method, mitigating biases in the LLM-as-a-judge system by manually annotating key points for each test case and providing them to LLM as the reference. Additionally, we evaluate the excellent LLMs and provide an in-depth analysis. Furthermore, we investigate the impact of different tool-invoking strategies on LLMs' personalization performance and the effects of fine-tuning in our task. The effectiveness of our preference-setting and key-point-based evaluation method is also validated. Our findings offer insights into improving personalized LLM agents. Our Code is available at this https URL.
摘要：个性化工具利用对于将大型语言模型（LLM）与用户偏好在与各种工具的交互情况下至关重要。但是，当前的大多数基准主要集中于文本生成的个性化或直接使用工具，而无需考虑两者。在这项工作中，我们介绍了一种新颖的基准ETAPP，用于评估个性化工具调用，建立沙盒环境以及涵盖各种用户配置文件的800个测试案例的全面数据集。为了提高评估的准确性，我们提出了一种基于关键的LLM评估方法，通过手动注释每个测试案例的关键点并将其作为参考文献，来减轻LLM-AS-A-A-Gudge系统中的偏见。此外，我们评估了出色的LLM并提供深入分析。此外，我们研究了不同工具的策略对LLMS个性化绩效的影响以及微调在我们的任务中的影响。还验证了我们的偏好设定和基于密钥的评估方法的有效性。我们的发现提供了改善个性化LLM代理商的见解。我们的代码可在此HTTPS URL上找到。

Title: DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

Authors: Kai Lv, Honglin Guo, Qipeng Guo, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00784
Pdf URL: https://arxiv.org/pdf/2503.00784
Copy Paste: [[2503.00784]] DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting(https://arxiv.org/abs/2503.00784)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity. Nevertheless, the draft model introduces additional computational overhead, becoming a performance bottleneck and increasing the time to first token (TTFT). Previous approaches to mitigate draft model overhead have primarily relied on heuristics and generally failed to match the quality of the draft language models. To address these challenges, we propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively, enabling parallel decoding while preserving draft quality. Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality. Extensive experiments across seven tasks show that DuoDecoding achieves up to 2.61x speedup in generation latency, while reducing TTFT to 83% of that in conventional speculative decoding. The Code is available at this https URL.
摘要：大型语言模型（LLM）在各种任务中表现出非凡的表现；但是，他们的逐态自动回归生成过程极大地阻碍了推理速度。投机解码提出了一个有希望的草稿 - 验证的框架，可在保持产出分配保真度的同时降低发电延迟。然而，该草案模型引入了其他计算开销，成为性能瓶颈，并增加了首先令牌（TTFT）的时间。以前的减轻模型开销草案的方法主要依赖启发式方法，并且通常无法匹配语言模型草案的质量。为了应对这些挑战，我们提出了DuodeCoding，这是一种新颖的方法，该方法分别在CPU和GPU上策略性地部署草案和目标模型，从而实现平行解码，同时保留草案质量。我们的方法结合了硬件感知的最佳草稿预算，以最大程度地减少空闲时间，并采用动态的多序列起草来提高草案质量。七个任务的广泛实验表明，十二指编码在发电潜伏期中达到2.61倍加速，而在常规投机解码中将TTFT降低至83％。该代码可在此HTTPS URL上找到。

Title: Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Authors: Kashun Shum, Yuzhen Huang, Hongjian Zou, Ding Qi, Yixuan Liao, Xiaoxin Chen, Qian Liu, Junxian He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00808
Pdf URL: https://arxiv.org/pdf/2503.00808
Copy Paste: [[2503.00808]] Predictive Data Selection: The Data That Predicts Is the Data That Teaches(https://arxiv.org/abs/2503.00808)
Keywords: language model
Abstract: Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., the normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmark (Huang et al., 2024). Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning. To leverage this insight, we introduce data selection based on data's Predictive strength (Preselect), a lightweight and efficient data selection method that requires training and deploying only a fastText-based scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PreSelect surpasses the performance of a vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu on a scale of 3B models trained on 100B tokens. We open-source our trained data selection scorer along with the curated datasets at this https URL.
摘要：审计语言模型涉及在广泛的语料库上进行培训，在该语料库中，数据质量起着关键作用。在这项工作中，我们旨在直接估计在预处理过程中数据的贡献，并以有效的方式选择预处理数据。具体而言，我们从最近的发现中汲取灵感，表明某些文本上不同模型的压缩效率（即归一化损失）与其下游性能密切相关，而文本域与下游基准相符（Huang等，2024）。在这一观察结果的基础上，我们假设数据损失的数据可预测下游能力，这也有效地有助于学习。为了利用这种见解，我们根据数据的预测强度（Preselect）介绍数据选择，这是一种轻巧有效的数据选择方法，需要培训并仅部署基于FastText的得分手。通过使用1B和3B参数模型进行的全面实验，我们证明了在30B令牌上训练的模型超过了在300B令牌上训练的香草基线的性能，从而减少了计算要求的10倍。此外，预选明显优于其他竞争性数据选择基准，例如DCLM和FineWeb-Edu，在100B代币训练的3B模型中，以量表为单位。我们在此HTTPS URL上开放训练有素的数据选择得分手以及策划的数据集。

Title: Waste Not, Want Not; Recycled Gumbel Noise Improves Consistency in Natural Language Generation

Authors: Damien de Mijolla, Hannan Saddiq, Kim Moore
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00831
Pdf URL: https://arxiv.org/pdf/2503.00831
Copy Paste: [[2503.00831]] Waste Not, Want Not; Recycled Gumbel Noise Improves Consistency in Natural Language Generation(https://arxiv.org/abs/2503.00831)
Keywords: language model, prompt
Abstract: Consistency in the output of language models is critical for their reliability and practical utility. Due to their training objective, language models learn to model the full space of possible continuations, leading to outputs that can vary significantly in style and content, even for similar or repeated inputs. To address this, we propose a novel decoding algorithm that enhances response consistency across different prompts with no degradation in response quality. By incorporating a latent variable into the next-token sampling process based on the Gumbel reparametrisation trick, our method outperforms standard sampling by up to 10% across semantic and stylistic consistency benchmarks. Additionally, our approach integrates seamlessly with existing sampling methods with negligible computational overhead, providing a practical solution for improving the reliability of language model outputs.
摘要：语言模型输出的一致性对于它们的可靠性和实用性至关重要。由于他们的培训目标，语言模型学会了建模可能的连续性的完整空间，从而导致输出在样式和内容上可能会有很大差异，即使对于类似或重复的输入也是如此。为了解决这个问题，我们提出了一种新颖的解码算法，该算法增强了不同提示的响应一致性，而没有响应质量降解。通过将潜在变量纳入基于Gumbel Reparametrisation Track的下一言采样过程中，我们的方法在语义和风格一致性基准中的表现优于标准采样最多10％。此外，我们的方法与现有的采样方法无关紧要的计算开销无缝集成，为提高语言模型输出的可靠性提供了一种实用的解决方案。

Title: Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners

Authors: Miao Peng, Nuo Chen, Zongrui Suo, Jia Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.00845
Pdf URL: https://arxiv.org/pdf/2503.00845
Copy Paste: [[2503.00845]] Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners(https://arxiv.org/abs/2503.00845)
Keywords: language model, llm
Abstract: Despite significant advancements in Large Language Models (LLMs), developing advanced reasoning capabilities in LLMs remains a key challenge. Process Reward Models (PRMs) have demonstrated exceptional promise in enhancing reasoning by providing step-wise feedback, particularly in the context of mathematical reasoning. However, their application to broader reasoning domains remains understudied, largely due to the high costs associated with manually creating step-level supervision. In this work, we explore the potential of PRMs in graph reasoning problems - a domain that demands sophisticated multi-step reasoning and offers opportunities for automated step-level data generation using established graph algorithms. We introduce GraphSILO, the largest dataset for graph reasoning problems with fine-grained step-wise labels, built using automated Task-oriented Trajectories and Monte Carlo Tree Search (MCTS) to generate detailed reasoning steps with step-wise labels. Building upon this dataset, we train GraphPRM, the first PRM designed for graph reasoning problems, and evaluate its effectiveness in two key settings: inference-time scaling and reinforcement learning via Direct Preference Optimization (DPO). Experimental results show that GraphPRM significantly improves LLM performance across 13 graph reasoning tasks, delivering a 9% gain for Qwen2.5-7B and demonstrating transferability to new graph reasoning datasets and new reasoning domains like mathematical problem-solving. Notably, GraphPRM enhances LLM performance on GSM8K and Math500, underscoring the cross-domain applicability of graph-based reasoning rewards. Our findings highlight the potential of PRMs in advancing reasoning across diverse domains, paving the way for more versatile and effective LLMs.
摘要：尽管大语言模型（LLMS）取得了重大进步，但在LLMS中发展了先进的推理能力仍然是一个关键挑战。流程奖励模型（PRM）通过提供逐步的反馈，尤其是在数学推理的背景下，在增强推理方面表现出了非凡的希望。但是，它们在更广泛的推理领域的应用仍在研究中，这主要是由于手动创建步进级别的监督相关的高成本。在这项工作中，我们探讨了PRM在图形推理问题中的潜力 - 一个需要复杂的多步推理的领域，并使用已建立的图形算法为自动级别数据生成提供了机会。我们介绍了GraphSilo，这是最大的数据集，该数据集使用具有自动化任务的轨迹和Monte Carlo Tree Search（MCTS）构建的细粒度逐步标签，以实现逐步标签。在此数据集的基础上，我们训练GraphPRM，这是第一个用于图形推理问题的PRM，并在两个关键设置中评估了其有效性：通过直接偏好优化（DPO）进行推理时间扩展和增强学习。实验结果表明，GraphPRM显着改善了13个图形推理任务的LLM性能，为QWEN2.5-7B带来了9％的增益，并证明了向新的图形推理数据集和新的推理域（如数学问题解决）的转移性。值得注意的是，GraphPRM在GSM8K和Math500上增强了LLM性能，强调了基于图形的推理奖励的跨域适用性。我们的发现凸显了PRM在跨不同领域推理推理的潜力，为更具用途和有效的LLM铺平了道路。

Title: Argument Summarization and its Evaluation in the Era of Large Language Models

Authors: Moritz Altemeyer, Steffen Eger, Johannes Daxenberger, Tim Altendorf, Philipp Cimiano, Benjamin Schiller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00847
Pdf URL: https://arxiv.org/pdf/2503.00847
Copy Paste: [[2503.00847]] Argument Summarization and its Evaluation in the Era of Large Language Models(https://arxiv.org/abs/2503.00847)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have revolutionized various Natural Language Generation (NLG) tasks, including Argument Summarization (ArgSum), a key subfield of Argument Mining (AM). This paper investigates the integration of state-of-the-art LLMs into ArgSum, including for its evaluation. In particular, we propose a novel prompt-based evaluation scheme, and validate it through a novel human benchmark dataset. Our work makes three main contributions: (i) the integration of LLMs into existing ArgSum frameworks, (ii) the development of a new LLM-based ArgSum system, benchmarked against prior methods, and (iii) the introduction of an advanced LLM-based evaluation scheme. We demonstrate that the use of LLMs substantially improves both the generation and evaluation of argument summaries, achieving state-of-the-art results and advancing the field of ArgSum.
摘要：大型语言模型（LLMS）彻底改变了各种自然语言生成（NLG）任务，包括参数摘要（Argsum），这是参数挖掘的关键子场（AM）。本文调查了最新的LLM的整合到包括评估在内的Argsum中。特别是，我们提出了一种新型的基于及时的评估方案，并通过新的人类基准数据集对其进行验证。我们的工作做出了三个主要贡献：（i）LLMS集成到现有的Argsum框架中，（ii）开发了一种新的基于LLM的Argsum系统，以先前的方法为基准，以及（iii）引入了高级基于LLM的评估方案。我们证明，LLM的使用实质上改善了论证摘要的产生和评估，实现了最新的结果并推进了Argsum的领域。

Title: Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers

Authors: Yiran Zhao, Chaoqun Liu, Yue Deng, Jiahao Ying, Mahani Aljunied, Zhaodonghui Li, Lidong Bing, Hou Pong Chan, Yu Rong, Deli Zhao, Wenxuan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00865
Pdf URL: https://arxiv.org/pdf/2503.00865
Copy Paste: [[2503.00865]] Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers(https://arxiv.org/abs/2503.00865)
Keywords: language model, llm, chat
Abstract: Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce $\texttt{Babel}$, an open multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs. Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling. We introduce two variants: $\texttt{Babel-9B}$, designed for efficient inference and fine-tuning, and $\texttt{Babel-83B}$, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using open-source supervised fine-tuning datasets, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for multilingual tasks, reaching the same level of commercial models.
摘要：大型语言模型（LLM）彻底改变了自然语言处理（NLP），但开源的多语言LLM仍然稀缺，现有模型在语言覆盖范围内通常有限。这样的模型通常会优先考虑资源良好的语言，虽然说话广泛但资源不足的语言经常被忽略。为了解决这一差异，我们介绍了$ \ texttt {babel} $，这是一种开放的多语言LLM，涵盖了扬声器数量的前25种语言，支持超过90％的全球人口，其中包括其他开放的多语言llms所忽视的许多语言。与传统的继续预处理方法不同，Babel通过层扩展技术扩展了其参数计数，从而提高了Babel的性能天花板。我们介绍了两个变体：$ \ texttt {babel-9b} $，专为有效的推理和微调而设计，以及$ \ texttt {babel-83b} $，为开放的多语言LLMS设定了新标准。对多语言任务的广泛评估表明，与可比大小的开放LLM相比，其表现出色。此外，使用开源监督的微调数据集，Babel实现了卓越的性能，Babel-9B-Chat在10B尺寸的LLMS和Babel-83B-Chat中领先，为多语言任务设定了新标准，以达到相同水平的商业模型。

Title: DUAL: Diversity and Uncertainty Active Learning for Text Summarization

Authors: Petros Stylianos Giouroukis, Alexios Gidiotis, Grigorios Tsoumakas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00867
Pdf URL: https://arxiv.org/pdf/2503.00867
Copy Paste: [[2503.00867]] DUAL: Diversity and Uncertainty Active Learning for Text Summarization(https://arxiv.org/abs/2503.00867)
Keywords: language model
Abstract: With the rise of large language models, neural text summarization has advanced significantly in recent years. However, even state-of-the-art models continue to rely heavily on high-quality human-annotated data for training and evaluation. Active learning is frequently used as an effective way to collect such datasets, especially when annotation resources are scarce. Active learning methods typically prioritize either uncertainty or diversity but have shown limited effectiveness in summarization, often being outperformed by random sampling. We present Diversity and Uncertainty Active Learning (DUAL), a novel algorithm that combines uncertainty and diversity to iteratively select and annotate samples that are both representative of the data distribution and challenging for the current model. DUAL addresses the selection of noisy samples in uncertainty-based methods and the limited exploration scope of diversity-based methods. Through extensive experiments with different summarization models and benchmark datasets, we demonstrate that DUAL consistently matches or outperforms the best performing strategies. Using visualizations and quantitative metrics, we provide valuable insights into the effectiveness and robustness of different active learning strategies, in an attempt to understand why these strategies haven't performed consistently in text summarization. Finally, we show that DUAL strikes a good balance between diversity and robustness.
摘要：随着大语言模型的兴起，近年来，神经文本摘要已大大提高。但是，即使是最先进的模型也继续严重依赖于高质量的人类注销数据进行培训和评估。主动学习经常用作收集此类数据集的有效方法，尤其是在稀缺的注释资源时。主动学习方法通常优先考虑不确定性或多样性，但在汇总中表现出有限的有效性，通常通过随机抽样表现优于表现。我们提出了多样性和不确定性主动学习（DUAL），这是一种新型算法，将不确定性和多样性结合到迭代选择和注释样本，这些样本都代表了当前模型的数据分布和具有挑战性。双重解决基于不确定性方法的嘈杂样品的选择以及基于多样性方法的有限探索范围。通过具有不同摘要模型和基准数据集的广泛实验，我们证明双重持续匹配或优于最佳性能策略。使用可视化和定量指标，我们为不同的活跃学习策略的有效性和鲁棒性提供了宝贵的见解，以了解为什么这些策略在文本摘要中没有持续执行。最后，我们表明双重打击在多样性和鲁棒性之间取得了良好的平衡。

Title: Instruct-of-Reflection: Enhancing Large Language Models Iterative Reflection Capabilities via Dynamic-Meta Instruction

Authors: Liping Liu, Chunhong Zhang, Likang Wu, Chuang Zhao, Zheng Hu, Ming He, Jianping Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00902
Pdf URL: https://arxiv.org/pdf/2503.00902
Copy Paste: [[2503.00902]] Instruct-of-Reflection: Enhancing Large Language Models Iterative Reflection Capabilities via Dynamic-Meta Instruction(https://arxiv.org/abs/2503.00902)
Keywords: language model, llm
Abstract: Self-reflection for Large Language Models (LLMs) has gained significant attention. Existing approaches involve models iterating and improving their previous responses based on LLMs' internal reflection ability or external feedback. However, recent research has raised doubts about whether intrinsic self-correction without external feedback may even degrade performance. Based on our empirical evidence, we find that current static reflection methods may lead to redundant, drift, and stubborn issues. To mitigate this, we introduce Instruct-of-Reflection (IoRT), a novel and general reflection framework that leverages dynamic-meta instruction to enhance the iterative reflection capability of LLMs. Specifically, we propose the instructor driven by the meta-thoughts and self-consistency classifier, generates various instructions, including refresh, stop, and select, to guide the next reflection iteration. Our experiments demonstrate that IoRT achieves an average improvement of 10.1% over established baselines in mathematical and commonsense reasoning tasks, highlighting its efficacy and applicability.
摘要：大型语言模型（LLM）的自我反思引起了人们的重大关注。现有的方法涉及基于LLMS的内部反射能力或外部反馈的模型迭代和改进其先前的响应。但是，最近的研究引起了人们对没有外部反馈的固有自我纠正的疑问，甚至可能会降低绩效。根据我们的经验证据，我们发现当前的静态反射方法可能导致多余，漂移和固执的问题。为了减轻这种情况，我们介绍了反射指导（IORT），这是一种新颖而通用的反射框架，利用动态META指令来增强LLMS的迭代反射能力。具体而言，我们建议由元思考和自洽分类器驱动的讲师生成各种说明，包括刷新，停止和选择，以指导下一次反射迭代。我们的实验表明，IORT在数学和常识性推理任务方面的平均提高了10.1％，突出了其功效和适用性。

Title: HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning

Authors: Zhuohang Jiang, Pangjing Wu, Ziran Liang, Peter Q. Chen, Xu Yuan, Ye Jia, Jiancheng Tu, Chen Li, Peter H.F. Ng, Qing Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00912
Pdf URL: https://arxiv.org/pdf/2503.00912
Copy Paste: [[2503.00912]] HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning(https://arxiv.org/abs/2503.00912)
Keywords: language model, llm
Abstract: Structure reasoning is a fundamental capability of large language models (LLMs), enabling them to reason about structured commonsense and answer multi-hop questions. However, existing benchmarks for structure reasoning mainly focus on horizontal and coordinate structures (\emph{e.g.} graphs), overlooking the hierarchical relationships within them. Hierarchical structure reasoning is crucial for human cognition, particularly in memory organization and problem-solving. It also plays a key role in various real-world tasks, such as information extraction and decision-making. To address this gap, we propose HiBench, the first framework spanning from initial structure generation to final proficiency assessment, designed to benchmark the hierarchical reasoning capabilities of LLMs systematically. HiBench encompasses six representative scenarios, covering both fundamental and practical aspects, and consists of 30 tasks with varying hierarchical complexity, totaling 39,519 queries. To evaluate LLMs comprehensively, we develop five capability dimensions that depict different facets of hierarchical structure understanding. Through extensive evaluation of 20 LLMs from 10 model families, we reveal key insights into their capabilities and limitations: 1) existing LLMs show proficiency in basic hierarchical reasoning tasks; 2) they still struggle with more complex structures and implicit hierarchical representations, especially in structural modification and textual reasoning. Based on these findings, we create a small yet well-designed instruction dataset, which enhances LLMs' performance on HiBench by an average of 88.84\% (Llama-3.1-8B) and 31.38\% (Qwen2.5-7B) across all tasks. The HiBench dataset and toolkit are available here, this https URL, to encourage evaluation.
摘要：结构推理是大语言模型（LLM）的基本能力，使他们能够对结构化常识进行推理并回答多跳的问题。但是，结构推理的现有基准主要集中于水平和坐标结构（\ emph {e.g。}图），忽略了其中的层次关系。层次结构推理对于人类认知至关重要，尤其是在记忆组织和解决问题中。它在各种现实世界中也起着关键作用，例如信息提取和决策。为了解决这一差距，我们提出了Hibench，这是从初始结构生成到最终熟练评估的第一个框架，旨在系统地基于LLMS的层次推理能力。 Hibench涵盖了六个代表性的方案，涵盖了基本和实际方面，由30个任务组成，具有不同的层次复杂性，总计39,519个查询。为了全面评估LLM，我们开发了五个能力维度，描绘了层次结构理解的不同方面。通过对10个模型系列的20个LLM的广泛评估，我们揭示了对其功能和局限性的关键见解：1）现有的LLMS显示出熟练的基本分层推理任务； 2）他们仍然在更复杂的结构和隐式层次结构中挣扎，尤其是在结构修改和文本推理方面。基于这些发现，我们创建了一个小但精心设计的指令数据集，该数据集平均在所有任务中提高了LLMS在Hibench上的性能88.84 \％（Llama-3.1-8B）和31.38 \％（Qwen2.5-7B）。 HTTPS URL可在此处提供Hibench数据集和工具包，以鼓励评估。

Title: SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking

Authors: Nam V. Nguyen, Dien X. Tran, Thanh T. Tran, Anh T. Hoang, Tai V. Duong, Di T. Le, Phuc-Lu Le
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00955
Pdf URL: https://arxiv.org/pdf/2503.00955
Copy Paste: [[2503.00955]] SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking(https://arxiv.org/abs/2503.00955)
Keywords: language model, gpt, llm
Abstract: The rise of misinformation, exacerbated by Large Language Models (LLMs) like GPT and Gemini, demands robust fact-checking solutions, especially for low-resource languages like Vietnamese. Existing methods struggle with semantic ambiguity, homonyms, and complex linguistic structures, often trading accuracy for efficiency. We introduce SemViQA, a novel Vietnamese fact-checking framework integrating Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC). Our approach balances precision and speed, achieving state-of-the-art results with 78.97\% strict accuracy on ISE-DSC01 and 80.82\% on ViWikiFC, securing 1st place in the UIT Data Science Challenge. Additionally, SemViQA Faster improves inference speed 7x while maintaining competitive accuracy. SemViQA sets a new benchmark for Vietnamese fact verification, advancing the fight against misinformation. The source code is available at: this https URL.
摘要：大语言模型（例如GPT和Gemini）加剧了错误信息的兴起，要求对事实检查解决方案进行强有力的核对解决方案，尤其是对于像越南人这样的低资源语言。现有的方法与语义歧义，同音词和复杂的语言结构相比，通常以效率交易精度。我们介绍了Semviqa，这是一个新颖的越南事实检查框架，该框架整合了基于语义的证据检索（SER）和两步性判决分类（TVC）。我们的方法平衡了精度和速度，以78.97 \％的ISE-DSC01和80.82 \％在Viwikifc上获得最新的结果，并在UIT数据科学挑战中获得第一名。此外，SemViQA更快地提高了推理速度7倍，同时保持竞争精度。 Semviqa为越南事实验证设定了一个新的基准，推进了反对错误信息的斗争。源代码可在以下网址提供：此HTTPS URL。

Title: Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs

Authors: Ravi Ghadia, Avinash Kumar, Gaurav Jain, Prashant Nair, Poulami Das
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.00979
Pdf URL: https://arxiv.org/pdf/2503.00979
Copy Paste: [[2503.00979]] Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs(https://arxiv.org/abs/2503.00979)
Keywords: llm, chat
Abstract: Autoregressive Transformers rely on Key-Value (KV) caching to accelerate inference. However, the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints. This bottleneck is particularly problematic in real-time applications -- such as chatbots and interactive assistants -- where low latency and high memory efficiency are critical. Existing methods drop distant tokens or compress states in a lossy manner, sacrificing accuracy by discarding vital context or introducing bias. We propose MorphKV, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy. MorphKV balances long-range dependencies and local coherence during text generation. It eliminates early-token bias while retaining high-fidelity context by adaptively ranking tokens through correlation-aware selection. Unlike heuristic retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens. This approach captures inter-token correlation with greater accuracy, crucial for tasks like content creation and code generation. Our studies on long-response tasks show 52.9$\%$ memory savings and 18.2$\%$ higher accuracy on average compared to state-of-the-art prior works, enabling efficient real-world deployment.
摘要：自回旋变压器依靠键值（KV）缓存来加速推理。但是，具有上下文长度的KV缓存的线性增长会导致过度的内存消耗和带宽约束。这种瓶颈在实时应用程序（例如聊天机器人和交互式助手）中尤其有问题，在这种应用程序中，低延迟和高内存效率至关重要。现有方法以有损的方式掉落遥远的令牌或压缩状态，通过丢弃重要环境或引入偏见来牺牲准确性。我们提出了MorphKV，这是一种推理时间技术，该技术在保持准确性的同时保持恒定的KV缓存。 MorphKV在文本生成过程中平衡了长期依赖性和本地连贯性。它消除了早期偏见，同时通过通过相关性选择的选择来自适应排名令牌，从而消除了高保真环境。与启发式保留或有损压缩不同，MorphKV迭代通过轻巧的更新来改进KV缓存，并以最近的标记的注意模式为指导。这种方法以更高的准确性捕获了相互关联，对于内容创建和代码生成等任务至关重要。我们对长期响应任务的研究表明，与先前的最先前的工作相比，52.9 $ \％$内存的节省和18.2 $ \％$的准确性更高，从而实现了有效的现实部署。

Title: Evaluating Polish linguistic and cultural competency in large language models

Authors: Sławomir Dadas, Małgorzata Grębowiec, Michał Perełkiewicz, Rafał Poświata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00995
Pdf URL: https://arxiv.org/pdf/2503.00995
Copy Paste: [[2503.00995]] Evaluating Polish linguistic and cultural competency in large language models(https://arxiv.org/abs/2503.00995)
Keywords: language model, llm
Abstract: Large language models (LLMs) are becoming increasingly proficient in processing and generating multilingual texts, which allows them to address real-world problems more effectively. However, language understanding is a far more complex issue that goes beyond simple text analysis. It requires familiarity with cultural context, including references to everyday life, historical events, traditions, folklore, literature, and pop culture. A lack of such knowledge can lead to misinterpretations and subtle, hard-to-detect errors. To examine language models' knowledge of the Polish cultural context, we introduce the Polish linguistic and cultural competency benchmark, consisting of 600 manually crafted questions. The benchmark is divided into six categories: history, geography, culture & tradition, art & entertainment, grammar, and vocabulary. As part of our study, we conduct an extensive evaluation involving over 30 open-weight and commercial LLMs. Our experiments provide a new perspective on Polish competencies in language models, moving past traditional natural language processing tasks and general knowledge assessment.
摘要：大型语言模型（LLM）越来越熟练地处理和生成多语言文本，这使他们能够更有效地解决现实世界中的问题。但是，语言理解是一个更复杂的问题，它不仅仅是简单的文本分析。它需要熟悉文化背景，包括对日常生活，历史事件，传统，民俗，文学和流行文化的提及。缺乏这种知识会导致误解和微妙的，难以检测的错误。为了研究语言模型对波兰文化背景的了解，我们介绍了波兰语言和文化能力基准，包括600个手动制作的问题。基准分为六个类别：历史，地理，文化与传统，艺术与娱乐，语法和词汇。作为研究的一部分，我们进行了广泛的评估，涉及30多个开放式和商业LLM。我们的实验为语言模型中的波兰能力提供了新的观点，超越了传统的自然语言处理任务和常识评估。

Title: Language Models Predict Empathy Gaps Between Social In-groups and Out-groups

Authors: Yu Hou, Hal Daumé III, Rachel Rudinger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01030
Pdf URL: https://arxiv.org/pdf/2503.01030
Copy Paste: [[2503.01030]] Language Models Predict Empathy Gaps Between Social In-groups and Out-groups(https://arxiv.org/abs/2503.01030)
Keywords: language model, llm, prompt
Abstract: Studies of human psychology have demonstrated that people are more motivated to extend empathy to in-group members than out-group members (Cikara et al., 2011). In this study, we investigate how this aspect of intergroup relations in humans is replicated by LLMs in an emotion intensity prediction task. In this task, the LLM is given a short description of an experience a person had that caused them to feel a particular emotion; the LLM is then prompted to predict the intensity of the emotion the person experienced on a numerical scale. By manipulating the group identities assigned to the LLM's persona (the "perceiver") and the person in the narrative (the "experiencer"), we measure how predicted emotion intensities differ between in-group and out-group settings. We observe that LLMs assign higher emotion intensity scores to in-group members than out-group members. This pattern holds across all three types of social groupings we tested: race/ethnicity, nationality, and religion. We perform an in-depth analysis on Llama-3.1-8B, the model which exhibited strongest intergroup bias among those tested.
摘要：对人类心理学的研究表明，与小组成员相比，人们更有动力向小组内成员传达同理心（Cikara等，2011）。在这项研究中，我们调查了人类中LLM在情感强度预测任务中如何复制人类间关系的这一方面。在此任务中，LLM对一个人的经历进行了简短的描述，这使他们感到特殊的情感。然后，提示LLM预测人在数值范围内经历的情绪的强度。通过操纵分配给LLM的角色（“感知者”）和叙事中的人（“经验者”）的群体身份，我们衡量了预测的情感强度在小组内和小组外环境之间有何不同。我们观察到，LLMS比组外成员分配了更高的情绪强度得分。这种模式在我们测试的所有三种社会群体中都持有：种族/种族，国籍和宗教。我们对Llama-3.1-8B进行了深入的分析，该模型在测试的模型中表现出最强的组间偏置。

Title: Language-agnostic, automated assessment of listeners' speech recall using large language models

Authors: Björn Herrmann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01045
Pdf URL: https://arxiv.org/pdf/2503.01045
Copy Paste: [[2503.01045]] Language-agnostic, automated assessment of listeners' speech recall using large language models(https://arxiv.org/abs/2503.01045)
Keywords: language model, llm, prompt
Abstract: Speech-comprehension difficulties are common among older people. Standard speech tests do not fully capture such difficulties because the tests poorly resemble the context-rich, story-like nature of ongoing conversation and are typically available only in a country's dominant/official language (e.g., English), leading to inaccurate scores for native speakers of other languages. Assessments for naturalistic, story speech in multiple languages require accurate, time-efficient scoring. The current research leverages modern large language models (LLMs) in native English speakers and native speakers of 10 other languages to automate the generation of high-quality, spoken stories and scoring of speech recall in different languages. Participants listened to and freely recalled short stories (in quiet/clear and in babble noise) in their native language. LLM text-embeddings and LLM prompt engineering with semantic similarity analyses to score speech recall revealed sensitivity to known effects of temporal order, primacy/recency, and background noise, and high similarity of recall scores across languages. The work overcomes limitations associated with simple speech materials and testing of closed native-speaker groups because recall data of varying length and details can be mapped across languages with high accuracy. The full automation of speech generation and recall scoring provides an important step towards comprehension assessments of naturalistic speech with clinical applicability.
摘要：在老年人中，言语坦言困难很常见。标准语音测试并不能完全捕捉到此类困难，因为这些测试与正在进行的对话的上下文富裕，故事般的性质相似，并且通常仅在一个国家的主导/官方语言（例如英语）中获得，导致其他语言的母语人士不准确。用多种语言的自然主义，故事演讲的评估需要准确的，时间效率的评分。当前的研究利用以英语为英语的人和其他10种语言的母语人士利用现代大型语言模型（LLM）来自动化高质量，口语故事和以不同语言的言语回忆的产生。参与者用母语听了并自由地回顾了短篇小说（以安静/清晰的噪音）。 LLM Text-ebbeddings和LLM提示工程具有语义相似性分析以得分语音召回的人，揭示了对时间顺序，至高无上/新近度和背景噪声的已知效果的敏感性，以及跨语言的回忆得分的高相似性。这项工作克服了与简单的语音材料相关的限制和对封闭式扬声器群体的测试，因为可以以高度准确地在语言上绘制不同长度和细节的召回数据。语音产生和召回评分的完整自动化为理解自然主义语音的临床适用性提供了重要的一步。

Title: AI-Invented Tonal Languages: Preventing a Machine Lingua Franca Beyond Human Understanding

Authors: David Noever
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01063
Pdf URL: https://arxiv.org/pdf/2503.01063
Copy Paste: [[2503.01063]] AI-Invented Tonal Languages: Preventing a Machine Lingua Franca Beyond Human Understanding(https://arxiv.org/abs/2503.01063)
Keywords: language model, llm
Abstract: This paper investigates the potential for large language models (LLMs) to develop private tonal languages for machine-to-machine (M2M) communication. Inspired by cryptophasia in human twins (affecting up to 50% of twin births) and natural tonal languages like Mandarin and Vietnamese, we implement a precise character-to-frequency mapping system that encodes the full ASCII character set (32-126) using musical semitones. Each character is assigned a unique frequency, creating a logarithmic progression beginning with space (220 Hz) and ending with tilde (50,175.42 Hz). This spans approximately 7.9 octaves, with higher characters deliberately mapped to ultrasonic frequencies beyond human perception (>20 kHz). Our implemented software prototype demonstrates this encoding through visualization, auditory playback, and ABC musical notation, allowing for analysis of information density and transmission speed. Testing reveals that tonal encoding can achieve information rates exceeding human speech while operating partially outside human perceptual boundaries. This work responds directly to concerns about AI systems catastrophically developing private languages within the next five years, providing a concrete prototype software example of how such communication might function and the technical foundation required for its emergence, detection, and governance.
摘要：本文研究了大型语言模型（LLMS）开发用于机器到机器（M2M）通信的私人音调语言的潜力。受人双胞胎（影响多达50％的双胞胎出生的50％）和诸如普通话和越南语的天然色调语言的启发，我们实施了一个精确的角色与频率映射系统，该系统使用音乐半声音编码完整的ASCII角色集（32-126）。每个字符都被分配一个唯一的频率，创建一个以空格（220 Hz）开头的对数进程，并以Tilde（50,175.42 Hz）结尾。这跨越约7.9八度，较高的字符故意映射到超出人类感知（> 20 kHz）的超声波频率。我们实施的软件原型通过可视化，听觉播放和ABC音乐符号来说明这种编码，从而可以分析信息密度和传输速度。测试表明，音调编码可以达到超过人类语音的信息率，而在人类感知界面之外的部分操作。这项工作直接回应了对AI系统在未来五年内灾难性地开发私人语言的担忧，提供了一个具体的原型软件示例，说明了这种通信如何运作以及其出现，检测和治理所需的技术基础。

Title: Scientific Reasoning: Assessment of Multimodal Generative LLMs

Authors: Florian Dreyer, Ekaterina Kolos, Daria Matiash
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.01064
Pdf URL: https://arxiv.org/pdf/2503.01064
Copy Paste: [[2503.01064]] Scientific Reasoning: Assessment of Multimodal Generative LLMs(https://arxiv.org/abs/2503.01064)
Keywords: language model, llm
Abstract: Large language models (LLMs) can answer questions and reason about complex tasks, also from the scientific domain. We assess several multimodal LLMs (MLLMs) on ScienceQA and find that Gemini models show the highest accuracy with little context, and the highest textual similarity to human explanations with richer context. Adapter-tuning of smaller MLLMs did not lead to any reliable performance. Training from Gemini outputs consistently underperformed training from the original data.
摘要：大型语言模型（LLM）也可以回答有关复杂任务的问题和理由，也可以从科学领域回答。我们在ScienceQA上评估了几个多模式LLMS（MLLM），并发现双子座模型显示出最高的准确性，几乎没有上下文，并且与人类的解释具有更丰富的背景的最高相似性。较小的MLLM的适配器调整不会导致任何可靠的性能。双子座输出的培训始终从原始数据中表现出表现不佳的培训。

Title: Precise Localization of Memories: A Fine-grained Neuron-level Knowledge Editing Technique for LLMs

Authors: Haowen Pan, Xiaozhi Wang, Yixin Cao, Zenglin Shi, Xun Yang, Juanzi Li, Meng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01090
Pdf URL: https://arxiv.org/pdf/2503.01090
Copy Paste: [[2503.01090]] Precise Localization of Memories: A Fine-grained Neuron-level Knowledge Editing Technique for LLMs(https://arxiv.org/abs/2503.01090)
Keywords: language model, llm
Abstract: Knowledge editing aims to update outdated information in Large Language Models (LLMs). A representative line of study is locate-then-edit methods, which typically employ causal tracing to identify the modules responsible for recalling factual knowledge about entities. However, we find these methods are often sensitive only to changes in the subject entity, leaving them less effective at adapting to changes in relations. This limitation results in poor editing locality, which can lead to the persistence of irrelevant or inaccurate facts, ultimately compromising the reliability of LLMs. We believe this issue arises from the insufficient precision of knowledge localization. To address this, we propose a Fine-grained Neuron-level Knowledge Editing (FiNE) method that enhances editing locality without affecting overall success rates. By precisely identifying and modifying specific neurons within feed-forward networks, FiNE significantly improves knowledge localization and editing. Quantitative experiments demonstrate that FiNE efficiently achieves better overall performance compared to existing techniques, providing new insights into the localization and modification of knowledge within LLMs.
摘要：知识编辑旨在更新大语模型（LLMS）中过时的信息。一项代表性的研究是定位的，然后是编辑方法，该方法通常采用因果追踪来确定负责回忆有关实体的事实知识的模块。但是，我们发现这些方法通常仅对主体实体的变化敏感，从而使它们在适应关系变化方面的有效性降低。这种局限性导致编辑区域差，这可能导致无关紧要或不正确的事实的持续性，最终损害了LLM的可靠性。我们认为，这个问题源于知识本地化的精确度不足。为了解决这个问题，我们提出了一种精细的神经元级知识编辑（FINE）方法，该方法可以增强编辑位置而不影响整体成功率。通过精确识别和修改馈送前向网络中的特定神经元，Fine可以显着改善知识的本地化和编辑。定量实验表明，与现有技术相比，良好的实验可以有效地实现更好的整体性能，从而为LLM中知识的本地化和修改提供了新的见解。

Title: Beyond QA Pairs: Assessing Parameter-Efficient Fine-Tuning for Fact Embedding in LLMs

Authors: Shivam Ratnakar, Abhiroop Talasila, Raghav Chamadiya, Nikhil Agarwal, Vinayak K Doifode
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01131
Pdf URL: https://arxiv.org/pdf/2503.01131
Copy Paste: [[2503.01131]] Beyond QA Pairs: Assessing Parameter-Efficient Fine-Tuning for Fact Embedding in LLMs(https://arxiv.org/abs/2503.01131)
Keywords: language model, gpt, llm
Abstract: This paper presents an extensive examination of Parameter-Efficient Fine-Tuning (PEFT) for embedding domain specific facts into Large Language Models (LLMs), focusing on improving the fine-tuning process by categorizing question-answer (QA) pairs into Factual and Conceptual classes using a BERT-based classifier. Two distinct Llama-2 models are fine-tuned based on these classifications and evaluated using larger models like GPT-3.5 Turbo and Gemini. Our results indicate that models trained on conceptual datasets outperform those trained on factual datasets. Additionally, we compare the efficiency of two synthetic fine-tuning dataset generation techniques, D-RAG and D-Naive, with D-Naive demonstrating superior performance. Although PEFT has shown effectiveness, our research indicates that it may not be the most optimal method for embedding facts into LLMs. However, it has demonstrated exceptional performance in instruction-based tasks. Our findings are reinforced by a 1000-sample dataset in the data center domain, where the fine-tuned Llama-2 7B model significantly outperforms the baseline model in generating product recommendations. Our study highlights the importance of QA pair categorization and synthetic dataset generation techniques in enhancing the performance of LLMs in specific domains.
摘要：本文对参数有效的微调（PEFT）进行了广泛的检查，以将特定的域事实嵌入大型语言模型（LLMS）中，重点是通过将问题解答（QA）对分类为使用BERT基于BERT的分类器中的问题和概念性类，以改善微调过程。基于这些分类，对两个不同的Llama-2模型进行了微调，并使用GPT-3.5 Turbo和Gemini等较大模型进行了评估。我们的结果表明，在概念数据集上训练的模型优于在事实数据集上训练的模型。此外，我们比较了两种合成微调数据集生成技术的效率，即D-rag和d-Naive，D-Neive证明了表现出色的性能。尽管PEFT显示出有效性，但我们的研究表明，它可能不是将事实嵌入LLM的最佳方法。但是，它在基于教学的任务中表现出了出色的表现。我们的发现通过数据中心域中的1000个样本数据集加强，在该域中进行了微调的Llama-2 7b模型在生成产品建议时大大优于基线模型。我们的研究强调了QA对分类和合成数据集生成技术在增强特定域中LLM的性能方面的重要性。

Title: How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach

Authors: Ayeong Lee, Ethan Che, Tianyi Peng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01141
Pdf URL: https://arxiv.org/pdf/2503.01141
Copy Paste: [[2503.01141]] How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach(https://arxiv.org/abs/2503.01141)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-thought prompting has emerged as a powerful technique for enabling large language models (LLMs) to solve complex reasoning tasks. However, these reasoning chains can be verbose, raising concerns about efficiency. In response, recent works have sought to decrease response lengths through simple prompting strategies (e.g. 'be concise'). In this work, we conduct the first systematic study of the relationship between reasoning length and model performance across a diverse range of compression instructions (e.g. 'use 10 words or less' or 'remove all punctuation'). In doing so, we discover a universal tradeoff between reasoning length and accuracy that persists across even very distinct reasoning chains. We demonstrate that this tradeoff emerges from a sharp threshold behavior at the question level: each task has an intrinsic 'token complexity' - a minimal number of tokens required for successful problem-solving. We show how token complexity enables us to compute information-theoretic limits on the accuracy-compression tradeoff, and find that prompt-based compression strategies operate far from these theoretical limits. This suggests there may be significant room for improvement and our framework provides a benchmark to help researchers evaluate progress in reasoning efficiency. Our work also highlights the importance of adaptive compression -- giving shorter responses for easier questions -- and we show that token complexity is a useful tool for measuring this capability.
摘要：经过深思熟虑的提示已成为一种强大的技术，用于启用大型语言模型（LLMS）来解决复杂的推理任务。但是，这些推理链可以是冗长的，从而引起了人们对效率的关注。作为回应，最近的工作试图通过简单提示策略（例如“简洁”）来降低响应长度。在这项工作中，我们对推理长度与模型性能之间的关系进行了首次系统研究，例如“使用10个单词或更少”或“删除所有标点符号”）。在这样做的过程中，我们发现了推理长度和准确性之间的普遍权衡，甚至在非常不同的推理链中持续存在。我们证明，这种权衡是从问题级别上的尖锐阈值行为中出现的：每个任务都具有固有的“令牌复杂性” - 成功解决问题所需的最小数量的令牌。我们展示了令牌复杂性如何使我们能够在准确性压缩折衷的情况下计算信息理论限制，并发现基于及时的压缩策略远非这些理论限制。这表明可能有很大的改进空间，我们的框架为研究人员评估推理效率的进度提供了基准。我们的工作还强调了自适应压缩的重要性 - 为简单的问题提供较短的回答 - 我们表明令牌复杂性是测量此功能的有用工具。

Title: MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages

Authors: Chen Zhang, Mingxu Tao, Zhiyuan Liao, Yansong Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01150
Pdf URL: https://arxiv.org/pdf/2503.01150
Copy Paste: [[2503.01150]] MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages(https://arxiv.org/abs/2503.01150)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel in high-resource languages but struggle with low-resource languages (LRLs), particularly those spoken by minority communities in China, such as Tibetan, Uyghur, Kazakh, and Mongolian. To systematically track the progress in these languages, we introduce MiLiC-Eval, a benchmark designed for minority languages in China, featuring 24K instances across 9 tasks. MiLiC-Eval focuses on underrepresented writing systems and provides a fine-grained assessment of linguistic and problem-solving skills. Our evaluation reveals that LLMs perform poorly on syntax-intensive tasks and multi-script languages. We further demonstrate how MiLiC-Eval can help advance LRL research in handling diverse writing systems and understanding the process of language adaptation.
摘要：大型语言模型（LLM）在高资源语言中表现出色，但在低资源语言（LRLS）中挣扎，尤其是中国少数民族社区所说的那些，例如藏族，哈萨克州和蒙古。为了系统地跟踪这些语言的进度，我们引入了Milic-eval，这是一种为中国少数族裔语言设计的基准，在9个任务中具有24K实例。 Milic-Eval专注于代表性不足的写作系统，并对语言和解决问题的技能进行了精细的评估。我们的评估表明，LLM在语法密集型任务和多脚本语言上的表现较差。我们进一步展示了Milic-eval如何帮助推进LRL研究，以处理多种写作系统并了解语言适应过程。

Title: ReaderLM-v2: Small Language Model for HTML to Markdown and JSON

Authors: Feng Wang, Zesheng Shi, Bo Wang, Nan Wang, Han Xiao
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2503.01151
Pdf URL: https://arxiv.org/pdf/2503.01151
Copy Paste: [[2503.01151]] ReaderLM-v2: Small Language Model for HTML to Markdown and JSON(https://arxiv.org/abs/2503.01151)
Keywords: language model, gpt
Abstract: We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.
摘要：我们提出了ReaderLM-V2，这是一种紧凑的15亿个参数语言模型，旨在有效的Web内容提取。我们的模型处理最多512K令牌，将混乱的HTML转变为具有高精度的干净降价或JSON格式 - 使其成为接地大语言模型的理想工具。该模型的有效性来自两个关键创新：（1）三阶段数据合成管道，该管道通过迭代起草，完善和批评Web内容提取来生成高质量，多样化的培训数据；（2）将连续预训练与多目标优化相结合的统一培训框架。密集评估表明，在精心策划的基准上，ReaderLM-V2的表现优于GPT-4O-2024-08-06和其他较大型号，而其他较大的模型的表现为15-20 \％，尤其是在超过100K代币的文档中脱颖而出，同时保持较大的计算要求。

Title: Nature-Inspired Population-Based Evolution of Large Language Models

Authors: Yiqun Zhang, Peng Ye, Xiaocui Yang, Shi Feng, Shufei Zhang, Lei Bai, Wanli Ouyang, Shuyue Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01155
Pdf URL: https://arxiv.org/pdf/2503.01155
Copy Paste: [[2503.01155]] Nature-Inspired Population-Based Evolution of Large Language Models(https://arxiv.org/abs/2503.01155)
Keywords: language model, llm
Abstract: Evolution, the engine behind the survival and growth of life on Earth, operates through the population-based process of reproduction. Inspired by this principle, this paper formally defines a newly emerging problem -- the population-based evolution of large language models (LLMs) -- and introduces a novel framework. Starting with a population of parent LLMs, our framework enables the population to evolve through four key operations: (i) crossover, merging the weights of different parents to create offspring LLMs, (ii) mutation, introducing small, random changes to model weights to foster diversity, (iii) selection, prioritizing high-performing models, and (iv) succession, transferring the learned experience from parent to offspring LLMs. With only 200 samples per new task, the LLM population evolves rapidly to adapt to the task at hand, without any gradients. Experiments on 12 datasets show that our framework consistently outperforms existing multi-LLM merging and adaptation methods, achieving accuracy gains of up to 54.8% over the best LLM in the initial population. Moreover, our framework allows for the evolution of LLMs across multiple new tasks simultaneously, scaling effectively with populations of up to 40 LLMs, and even zero-shot generalization to unseen held-out tasks. We have open-sourced the code on GitHub and released the weights of 10 parent LLMs, fine-tuned from gemma-2-2b-it, on HuggingFace$, enabling reproduction of our proposed framework using just a single 4090 GPU with 24GB memory, without any performance degradation.
摘要：进化，是地球生命生存和生命的生存背后的发动机，是通过基于人群的繁殖过程运作的。受这一原则的启发，本文正式定义了一个新出现的问题，即大型语言模型（LLMS）的基于人群的演变 - 并引入了一个新颖的框架。从父母LLM的人口开始，我们的框架使总体能够通过四个关键操作发展：（i）交叉合并不同父母的权重以创建后代LLM，（ii）突变，引入较小的，随机的随机变化，以促进多样性，（iii）选择，优先选择高级模型，并（IV）转移良好的经验。 LLM人口每项新任务只有200个样本，因此迅速发展以适应手头的任务，而没有任何梯度。 12个数据集的实验表明，我们的框架始终优于现有的多LLM合并和适应方法，在初始人群中，与最佳LLM相比，准确性获得了54.8％。此外，我们的框架允许同时在多个新任务中进行LLM的演变，从而有效地扩展了40个LLM的种群，甚至还可以零弹性地概括，从而看不见持有的任务。我们已经在GitHub上开源了代码，并在HuggingFace $上释放了10个由Gemma-2-2b-it进行微调的Parent LLMS的权重，从而使我们提出的框架仅使用24GB内存的单个GPU复制我们所提出的框架，而没有任何性能降解。

Title: Large Language Models for Healthcare Text Classification: A Systematic Review

Authors: Hajar Sakai, Sarah S. Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01159
Pdf URL: https://arxiv.org/pdf/2503.01159
Copy Paste: [[2503.01159]] Large Language Models for Healthcare Text Classification: A Systematic Review(https://arxiv.org/abs/2503.01159)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have fundamentally transformed approaches to Natural Language Processing (NLP) tasks across diverse domains. In healthcare, accurate and cost-efficient text classification is crucial, whether for clinical notes analysis, diagnosis coding, or any other task, and LLMs present promising potential. Text classification has always faced multiple challenges, including manual annotation for training, handling imbalanced data, and developing scalable approaches. With healthcare, additional challenges are added, particularly the critical need to preserve patients' data privacy and the complexity of the medical terminology. Numerous studies have been conducted to leverage LLMs for automated healthcare text classification and contrast the results with existing machine learning-based methods where embedding, annotation, and training are traditionally required. Existing systematic reviews about LLMs either do not specialize in text classification or do not focus on the healthcare domain. This research synthesizes and critically evaluates the current evidence found in the literature regarding the use of LLMs for text classification in a healthcare setting. Major databases (e.g., Google Scholar, Scopus, PubMed, Science Direct) and other resources were queried, which focused on the papers published between 2018 and 2024 within the framework of PRISMA guidelines, which resulted in 65 eligible research articles. These were categorized by text classification type (e.g., binary classification, multi-label classification), application (e.g., clinical decision support, public health and opinion analysis), methodology, type of healthcare text, and metrics used for evaluation and validation. This review reveals the existing gaps in the literature and suggests future research lines that can be investigated and explored.
摘要：大型语言模型（LLMS）从根本上转变了自然语言处理（NLP）跨不同领域的任务。在医疗保健中，准确且具有成本效益的文本分类至关重要，无论是用于临床注释，诊断编码还是任何其他任务，LLM都具有有希望的潜力。文本分类始终面临多个挑战，包括用于培训，处理不平衡数据和开发可扩展方法的手动注释。通过医疗保健，增加了其他挑战，尤其是保留患者数据隐私和医学术语复杂性的迫切需要。已经进行了许多研究，以利用LLM进行自动医疗文本分类，并将结果与现有的基于机器学习的方法进行对比，传统上需要嵌入，注释和培训。有关LLM的现有系统评论要么不专门研究文本分类，要么不专注于医疗保健领域。这项研究综合并批判性地评估了文献中有关LLM在医疗保健环境中使用文本分类的当前证据。询问了主要数据库（例如Google Scholar，Scopus，PubMed，Science Direct）和其他资源，该数据库的重点是PRISMA指南框架内2018年至2024年发表的论文，这导致了65份合格的研究文章。这些是根据文本分类类型（例如二进制分类，多标签分类），应用程序（例如临床决策支持，公共卫生和意见分析），方法论，医疗保健文本类型以及用于评估和验证的指标进行分类。这篇综述揭示了文献中现有的差距，并提出了可以研究和探索的未来研究行。

Title: Cancer Type, Stage and Prognosis Assessment from Pathology Reports using LLMs

Authors: Rachit Saluja, Jacob Rosenthal, Yoav Artzi, David J. Pisapia, Benjamin L. Liechty, Mert R. Sabuncu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.01194
Pdf URL: https://arxiv.org/pdf/2503.01194
Copy Paste: [[2503.01194]] Cancer Type, Stage and Prognosis Assessment from Pathology Reports using LLMs(https://arxiv.org/abs/2503.01194)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have shown significant promise across various natural language processing tasks. However, their application in the field of pathology, particularly for extracting meaningful insights from unstructured medical texts such as pathology reports, remains underexplored and not well quantified. In this project, we leverage state-of-the-art language models, including the GPT family, Mistral models, and the open-source Llama models, to evaluate their performance in comprehensively analyzing pathology reports. Specifically, we assess their performance in cancer type identification, AJCC stage determination, and prognosis assessment, encompassing both information extraction and higher-order reasoning tasks. Based on a detailed analysis of their performance metrics in a zero-shot setting, we developed two instruction-tuned models: Path-llama3.1-8B and Path-GPT-4o-mini-FT. These models demonstrated superior performance in zero-shot cancer type identification, staging, and prognosis assessment compared to the other models evaluated.
摘要：大型语言模型（LLM）在各种自然语言处理任务中表现出了巨大的希望。但是，它们在病理领域的应用，尤其是用于从非结构化医学文本（例如病理学报告）中提取有意义的见解，但仍未得到充实且未得到很好的量化。在这个项目中，我们利用最先进的语言模型，包括GPT家族，Mistral模型和开源美洲驼模型，以全面分析病理学报告来评估其绩效。具体而言，我们评估了它们在癌症类型识别，AJCC阶段的确定和预后评估中的表现，包括信息提取和高阶推理任务。基于对其性能指标的详细分析，我们开发了两个指令调整的模型：Path-llama3.1-8B和Path-GPT-GPT-GPT-4O-MINI-FT。与评估的其他模型相比，这些模型在零拍癌类型的识别，分期和预后评估中表现出卓越的性能。

Title: PEO: Improving Bi-Factorial Preference Alignment with Post-Training Policy Extrapolation

Authors: Yuxuan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01233
Pdf URL: https://arxiv.org/pdf/2503.01233
Copy Paste: [[2503.01233]] PEO: Improving Bi-Factorial Preference Alignment with Post-Training Policy Extrapolation(https://arxiv.org/abs/2503.01233)
Keywords: language model, llm
Abstract: The alignment of large language models with human values presents a critical challenge, particularly when balancing conflicting objectives like helpfulness and harmlessness. Existing approaches, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), face notable limitations: RLHF suffers from instability and inefficiency in multi-objective optimization, while DPO lacks mechanisms for dynamic trade-offs. To address these challenges, we propose Post-Training Extrapolation Optimization (PEO), a novel and efficient framework for bi-factorial alignment. PEO generates a family of Pareto-optimal policies in a single training pass by leveraging a three-phase pipeline: (1) aspect-specific learning, (2) generalist initialization via interpolation, and (3) post-training optimization via extrapolation. PEO enables dynamic adaptation to diverse user preferences at inference time without retraining. Our comprehensive experiments across multiple LLMs demonstrate that PEO achieves superior Pareto fronts compared to baselines, offering improved flexibility and computational efficiency. Theoretical analyses further highlight PEO's capacity to overcome optimization bottlenecks, paving the way for scalable, personalized alignment.
摘要：大型语言模型与人类价值观的一致性提出了一个关键的挑战，尤其是在平衡诸如帮助和无害的矛盾目标时。现有的方法，例如从人类反馈（RLHF）学习和直接偏好优化（DPO）的强化学习，面临明显的局限性：RLHF在多目标优化方面的不稳定性和效率低下，而DPO缺乏动态交易的机制。为了应对这些挑战，我们提出了训练后外推优化（PEO），这是一个新颖而有效的双因素对齐框架。 PEO通过利用三相管道来产生单个培训通过的帕累托最佳策略：（1）特定于方面的学习，（2）通过插值来进行通才初始化，（3）（3）通过外推优化后训练优化。 PEO可以在推理时间内动态适应不同的用户偏好而无需再培训。我们跨多个LLM的全面实验表明，与基线相比，PEO在帕累托方面取得了优越的阵线，从而提高了柔韧性和计算效率。理论分析进一步强调了PEO克服优化瓶颈的能力，为可扩展的个性化对齐铺平了道路。

Title: ChatGPT for President! Presupposed content in politicians versus GPT-generated texts

Authors: Davide Garassino, Nicola Brocca, Viviana Masia
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2503.01269
Pdf URL: https://arxiv.org/pdf/2503.01269
Copy Paste: [[2503.01269]] ChatGPT for President! Presupposed content in politicians versus GPT-generated texts(https://arxiv.org/abs/2503.01269)
Keywords: language model, gpt, chat
Abstract: This study examines ChatGPT-4's capability to replicate linguistic strategies used in political discourse, focusing on its potential for manipulative language generation. As large language models become increasingly popular for text generation, concerns have grown regarding their role in spreading fake news and propaganda. This research compares real political speeches with those generated by ChatGPT, emphasizing presuppositions (a rhetorical device that subtly influences audiences by packaging some content as already known at the moment of utterance, thus swaying opinions without explicit argumentation). Using a corpus-based pragmatic analysis, this study assesses how well ChatGPT can mimic these persuasive strategies. The findings reveal that although ChatGPT-generated texts contain many manipulative presuppositions, key differences emerge in their frequency, form, and function compared with those of politicians. For instance, ChatGPT often relies on change-of-state verbs used in fixed phrases, whereas politicians use presupposition triggers in more varied and creative ways. Such differences, however, are challenging to detect with the naked eye, underscoring the potential risks posed by large language models in political and public this http URL a corpus-based pragmatic analysis, this study assesses how well ChatGPT can mimic these persuasive strategies. The findings reveal that although ChatGPT-generated texts contain many manipulative presuppositions, key differences emerge in their frequency, form, and function compared with those of politicians. For instance, ChatGPT often relies on change-of-state verbs used in fixed phrases, whereas politicians use presupposition triggers in more varied and creative ways. Such differences, however, are challenging to detect with the naked eye, underscoring the potential risks posed by large language models in political and public discourse.
摘要：这项研究研究了Chatgpt-4复制政治话语中使用的语言策略的能力，重点关注其操纵语言的潜力。随着大型语言模型越来越流行于文本生成，人们对它们在传播假新闻和宣传中的作用的关注变得越来越兴奋。这项研究将真实的政治演讲与Chatgpt产生的演讲进行了比较，强调了前提（一种修辞手段，通过包装一些在话语时已知的内容来巧妙地影响听众，从而在没有明确论证的情况下摇摆了观点）。使用基于语料库的务实分析，本研究评估了Chatgpt如何模仿这些有说服力的策略。研究结果表明，尽管ChatGpt生成的文本包含许多操纵性的预设，但与政客相比，其频率，形式和功能的关键差异出现。例如，Chatgpt通常依赖于固定短语中使用的状态动词，而政客则以更加多样化和创造性的方式使用预设触发器。然而，这种差异是要用肉眼检测到的挑战，强调了政治和公众中大型语言模型带来的潜在风险，这是基于语料库的务实分析，这项研究评估了Chatgpt如何模仿这些有说服力的策略。研究结果表明，尽管ChatGpt生成的文本包含许多操纵性的预设，但与政客相比，其频率，形式和功能的关键差异出现。例如，Chatgpt通常依赖于固定短语中使用的状态动词，而政客则以更加多样化和创造性的方式使用预设触发器。然而，这种差异挑战是用肉眼检测到，强调了政治和公共话语中大型语言模型带来的潜在风险。

Title: Enhancing Non-English Capabilities of English-Centric Large Language Models through Deep Supervision Fine-Tuning

Authors: Wenshuai Huo, Xiaocheng Feng, Yichong Huang, Chengpeng Fu, Baohang Li, Yangfan Ye, Zhirui Zhang, Dandan Tu, Duyu Tang, Yunfei Lu, Hui Wang, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01275
Pdf URL: https://arxiv.org/pdf/2503.01275
Copy Paste: [[2503.01275]] Enhancing Non-English Capabilities of English-Centric Large Language Models through Deep Supervision Fine-Tuning(https://arxiv.org/abs/2503.01275)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated significant progress in multilingual language understanding and generation. However, due to the imbalance in training data, their capabilities in non-English languages are limited. Recent studies revealed the English-pivot multilingual mechanism of LLMs, where LLMs implicitly convert non-English queries into English ones at the bottom layers and adopt English for thinking at the middle layers. However, due to the absence of explicit supervision for cross-lingual alignment in the intermediate layers of LLMs, the internal representations during these stages may become inaccurate. In this work, we introduce a deep supervision fine-tuning method (DFT) that incorporates additional supervision in the internal layers of the model to guide its workflow. Specifically, we introduce two training objectives on different layers of LLMs: one at the bottom layers to constrain the conversion of the target language into English, and another at the middle layers to constrain reasoning in English. To effectively achieve the guiding purpose, we designed two types of supervision signals: logits and feature, which represent a stricter constraint and a relatively more relaxed guidance. Our method guides the model to not only consider the final generated result when processing non-English inputs but also ensure the accuracy of internal representations. We conducted extensive experiments on typical English-centric large models, LLaMA-2 and Gemma-2, and the results on multiple multilingual datasets show that our method significantly outperforms traditional fine-tuning methods.
摘要：大型语言模型（LLM）在多语言语言理解和产生方面表现出了重大进展。但是，由于培训数据的不平衡，它们在非英语语言中的功能受到限制。最近的研究揭示了LLMS的英语私密多语言机制，其中LLMS隐式将非英语查询转换为底层的英语查询，并在中层采用英语来思考。但是，由于缺乏在LLM的中间层中进行跨语性对准的明确监督，因此这些阶段期间的内部表示可能变得不准确。在这项工作中，我们引入了一种深入的监督微调方法（DFT），该方法在模型的内部层中纳入了其他监督，以指导其工作流程。具体来说，我们在LLM的不同层上介绍了两个培训目标：一个在底层，以限制目标语言向英语的转换，另一个在中间层中，以限制英语的推理。为了有效地实现指导目的，我们设计了两种类型的监督信号：逻辑和功能，代表了更严格的约束和相对放松的指导。我们的方法指导模型不仅考虑处理非英语输入时的最终生成结果，还可以确保内部表示的准确性。我们对典型的以英语为中心的大型Llama-2和Gemma-2进行了广泛的实验，多种多语言数据集的结果表明，我们的方法显着胜过传统的微调方法。

Title: PROPER: A Progressive Learning Framework for Personalized Large Language Models with Group-Level Adaptation

Authors: Linhai Zhang, Jialong Wu, Deyu Zhou, Yulan He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01303
Pdf URL: https://arxiv.org/pdf/2503.01303
Copy Paste: [[2503.01303]] PROPER: A Progressive Learning Framework for Personalized Large Language Models with Group-Level Adaptation(https://arxiv.org/abs/2503.01303)
Keywords: language model, llm
Abstract: Personalized large language models (LLMs) aim to tailor their outputs to user preferences. Recent advances in parameter-efficient fine-tuning (PEFT) methods have highlighted the effectiveness of adapting population-level LLMs to personalized LLMs by fine-tuning user-specific parameters with user history. However, user data is typically sparse, making it challenging to adapt LLMs to specific user patterns. To address this challenge, we propose PROgressive PERsonalization (PROPER), a novel progressive learning framework inspired by meso-level theory in social science. PROPER bridges population-level and user-level models by grouping users based on preferences and adapting LLMs in stages. It combines a Mixture-of-Experts (MoE) structure with Low Ranked Adaptation (LoRA), using a user-aware router to assign users to appropriate groups automatically. Additionally, a LoRA-aware router is proposed to facilitate the integration of individual user LoRAs with group-level LoRAs. Experimental results show that PROPER significantly outperforms SOTA models across multiple tasks, demonstrating the effectiveness of our approach.
摘要：个性化的大语言模型（LLMS）旨在根据用户偏好调整其输出。参数有效的微调（PEFT）方法的最新进展突出了通过用用户历史记录的微调用户特定参数将种群级别LLMS适应个性化LLM的有效性。但是，用户数据通常很少，因此使LLM适应特定用户模式的挑战。为了应对这一挑战，我们提出了渐进个性化（适当），这是一个受社会科学中介理论启发的新型进步学习框架。通过基于偏好和阶段调整LLM的用户对用户进行分组，将桥接人群级别和用户级模型桥接。它使用用户感知的路由器将用户意识的路由器自动分配给适当的组，将混合物（MOE）结构与低排名适应性（LORA）结合在一起。此外，还提出了洛拉（Lora）感知的路由器，以促进单个用户洛拉斯（Loras）与小组级洛拉斯（Loras）的整合。实验结果表明，适当的表现可以显着超过多个任务的SOTA模型，这表明了我们方法的有效性。

Title: Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Authors: Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, Noah D. Goodman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01307
Pdf URL: https://arxiv.org/pdf/2503.01307
Copy Paste: [[2503.01307]] Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs(https://arxiv.org/abs/2503.01307)
Keywords: language model
Abstract: Test-time inference has emerged as a powerful paradigm for enabling language models to ``think'' longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors -- verification, backtracking, subgoal setting, and backward chaining -- that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen's performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor -- models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen's self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau.
摘要：测试时间推论已成为一种有力的范式，可以使语言模型更长地``思考''更长时间，更仔细地对复杂的挑战，就像熟练的人类专家一样。尽管增强学习（RL）可以在可验证的任务上推动语言模型中的自我完善，但一些模型表现出可观的增长，而另一些模型很快就会迅速发展。例如，我们发现QWEN-2.5-3B远远超过了倒计时游戏中相同的RL训练的Llama-3.2-3b。这种差异提出了一个关键的问题：哪些内在特性可以有效自我完善？我们介绍了一个框架来调查这个问题，通过分析四种关键的认知行为 - 验证，回溯，子目标设置和后退链接 - 专家人类问题解决者和成功的语言模型都采用了专家。我们的研究表明，QWEN自然表现出这些推理行为，而Llama最初缺乏它们。在对受控行为数据集进行系统的实验中，我们发现，使用包含这些推理行为的示例启动Llama可以在RL期间进行实质性改进，匹配或超过QWEN的性能。重要的是，推理行为的存在，而不是答案的正确性，被证明是关键因素 - 具有不正确的解决方案的模型，包含适当的推理模式的解决方案实现了与在正确解决方案中训练的人相当的性能。最后，利用OpenWebMath数据继续进行预处理，过滤以扩大推理行为，使Llama模型能够匹配Qwen的自我改进轨迹。我们的发现建立了初始推理行为与改进能力之间的基本关系，解释了为什么某些语言模型有效地利用了其他计算，而另一些语言模型则是在平稳的。

Title: Explainable Depression Detection in Clinical Interviews with Personalized Retrieval-Augmented Generation

Authors: Linhai Zhang, Ziyang Gao, Deyu Zhou, Yulan He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01315
Pdf URL: https://arxiv.org/pdf/2503.01315
Copy Paste: [[2503.01315]] Explainable Depression Detection in Clinical Interviews with Personalized Retrieval-Augmented Generation(https://arxiv.org/abs/2503.01315)
Keywords: llm, hallucination, retrieval-augmented generation
Abstract: Depression is a widespread mental health disorder, and clinical interviews are the gold standard for assessment. However, their reliance on scarce professionals highlights the need for automated detection. Current systems mainly employ black-box neural networks, which lack interpretability, which is crucial in mental health contexts. Some attempts to improve interpretability use post-hoc LLM generation but suffer from hallucination. To address these limitations, we propose RED, a Retrieval-augmented generation framework for Explainable depression Detection. RED retrieves evidence from clinical interview transcripts, providing explanations for predictions. Traditional query-based retrieval systems use a one-size-fits-all approach, which may not be optimal for depression detection, as user backgrounds and situations vary. We introduce a personalized query generation module that combines standard queries with user-specific background inferred by LLMs, tailoring retrieval to individual contexts. Additionally, to enhance LLM performance in social intelligence, we augment LLMs by retrieving relevant knowledge from a social intelligence datastore using an event-centric retriever. Experimental results on the real-world benchmark demonstrate RED's effectiveness compared to neural networks and LLM-based baselines.
摘要：抑郁症是一种普遍的心理健康障碍，临床访谈是评估的黄金标准。但是，他们对稀缺专业人员的依赖强调了自动检测的需求。当前的系统主要采用黑盒神经网络，这些神经网络缺乏可解释性，这在心理健康环境中至关重要。一些尝试改善可解释性的尝试使用了事后LLM，但遭受了幻觉的困扰。为了解决这些局限性，我们提出了RED，这是一个可解释的抑郁症检测的检索生成框架。红色从临床访谈笔录中检索证据，提供了预测的解释。传统的基于查询的检索系统使用一种尺寸适合的方法，这对于抑郁症检测可能不是最佳的，因为用户背景和情况有所不同。我们介绍了一个个性化的查询生成模块，该模块将标准查询与LLMS推论的用户特定背景结合在一起，并根据各个上下文定制检索。此外，为了提高社会智能中LLM的表现，我们通过使用以事件为中心的发现从社会智能数据存储中检索相关知识来增强LLMS。现实世界基准的实验结果表明，与神经网络和基于LLM的基准相比，RED的有效性。

Title: WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models

Authors: Jian Yuan, Ziwei He, Haoli Bai, Jingwen Leng, Bo Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01330
Pdf URL: https://arxiv.org/pdf/2503.01330
Copy Paste: [[2503.01330]] WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models(https://arxiv.org/abs/2503.01330)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) use key-value (KV) cache to reduce redundant computation in autoregressive generation. However, the KV cache size increases linearly during generation, leading to excessive memory usage, especially for long texts. Most KV cache compression methods evict the unimportant KV pairs to maintain a fixed cache size, which leads to the permanent loss of tokens during generation. However, singular value decomposition shows that \textit{values} do not exhibit a strong low-rank property as \textit{keys} do, suggesting that information is distributed more evenly across \textit{values}, in contrast to its more redundant distribution within \textit{keys}. Therefore, methods that evict both \textit{keys} and \textit{values} risk losing crucial information and compromise context integrity, ultimately degrading the output quality. To address this problem, we propose WeightedKV, a novel, training-free approach that discards the \textit{keys} of less important tokens, while merging their \textit{values} into neighboring tokens via a convex combination weighted by their average attention scores. In this way, the retained \textit{keys} serve as anchors that guide the generation process, while the merged \textit{values} provide a rich contextual backdrop. We assess our method on four widely used language modeling datasets, demonstrating superior performance compared to all baseline methods, particularly with a lower budget ratio.
摘要：大型语言模型（LLMS）使用键值（KV）缓存来减少自回归生成中的冗余计算。但是，KV高速缓存大小在发电期间线性增加，导致过度记忆使用量，尤其是对于长文本。大多数KV缓存压缩方法驱逐了不重要的KV对以保持固定的缓存尺寸，从而导致代币的永久损失。但是，奇异值分解表明，\ textit {values}并不像\ textit {keys}那样表现出强的低级属性，这表明在\ textit {values}之间更均匀地分布信息，与其在\ textit {keys}中更冗余的分布相反。因此，驱逐\ textit {键}和\ textit {values}风险失去关键信息并损害上下文完整性的方法，最终降低了输出质量。为了解决这个问题，我们提出了一种新颖的，无训练的方法，它丢弃了不太重要的代币的\ textit {键}，同时将其\ textit {values}合并到相邻的标记中，通过其平均注意力评分加权的凸组合。这样，保留的\ textit {键}是指导生成过程的锚点，而合并的\ textit {values}提供了丰富的上下文背景。我们在四个广泛使用的语言建模数据集上评估了我们的方法，这表明与所有基线方法相比，尤其是预算比率较低的方法相比。

Title: Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models

Authors: Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01332
Pdf URL: https://arxiv.org/pdf/2503.01332
Copy Paste: [[2503.01332]] Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models(https://arxiv.org/abs/2503.01332)
Keywords: language model, prompt, agent
Abstract: Knowing when to answer or refuse is crucial for safe and reliable decision-making language agents. Although prior work has introduced refusal strategies to boost LMs' reliability, how these models adapt their decisions to different risk levels remains underexplored. We formalize the task of risk-aware decision-making, expose critical weaknesses in existing LMs, and propose skill-decomposition solutions to mitigate them. Our findings show that even cutting-edge LMs--both regular and reasoning models--still require explicit prompt chaining to handle the task effectively, revealing the challenges that must be overcome to achieve truly autonomous decision-making agents.
摘要：知道何时回答或拒绝对于安全可靠的决策语言推动者至关重要。尽管先前的工作已经引入了拒绝策略来提高LMS的可靠性，但这些模型如何使其决策适应不同的风险水平。我们将风险意识决策的任务正式化，暴露了现有LMS中的关键弱点，并提出了技能分类解决方案来减轻它们。我们的发现表明，即使是常规和推理模型，即使是尖端的LMS，也需要明确的及时悬而未决以有效地处理任务，从而揭示了必须克服的挑战，以实现真正的自主决策代理。

Title: Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness

Authors: Tingchen Fu, Fazl Barez
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01345
Pdf URL: https://arxiv.org/pdf/2503.01345
Copy Paste: [[2503.01345]] Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness(https://arxiv.org/abs/2503.01345)
Keywords: language model, llm, prompt
Abstract: Insensitivity to semantically-preserving variations of prompts (paraphrases) is crucial for reliable behavior and real-world deployment of large language models. However, language models exhibit significant performance degradation when faced with semantically equivalent but differently phrased prompts, and existing solutions either depend on trial-and-error prompt engineering or require computationally expensive inference-time algorithms. In this study, built on the key insight that worst-case prompts exhibit a drift in embedding space, we present Latent Adversarial Paraphrasing (LAP), a dual-loop adversarial framework: the inner loop trains a learnable perturbation to serve as a "latent continuous paraphrase" while preserving semantics through Lagrangian regulation, and the outer loop optimizes the language model parameters on these perturbations. We conduct extensive experiments to demonstrate the effectiveness of LAP across multiple LLM architectures on the RobustAlpaca benchmark with a 0.5%-4% absolution improvement on worst-case win-rate compared with vanilla supervised fine-tuning.
摘要：对提示（释义）的语义传播变化的不敏感对于可靠的行为和大型语言模型的现实部署至关重要。但是，语言模型在面对语义上等效但措辞不同的提示时会显示出明显的性能降解，并且现有的解决方案依赖于反复试验的及时工程，或者需要计算昂贵的推理时间算法。 In this study, built on the key insight that worst-case prompts exhibit a drift in embedding space, we present Latent Adversarial Paraphrasing (LAP), a dual-loop adversarial framework: the inner loop trains a learnable perturbation to serve as a "latent continuous paraphrase" while preserving semantics through Lagrangian regulation, and the outer loop optimizes the language model parameters on these扰动。我们进行了广泛的实验，以证明LAP在鲁棒基准上的多个LLM体系结构中的有效性，而与香草监督的微调相比，最差的案例赢得率有0.5％-4％的次数豁免率提高。

Title: SRAG: Structured Retrieval-Augmented Generation for Multi-Entity Question Answering over Wikipedia Graph

Authors: Teng Lin, Yizhang Zhu, Yuyu Luo, Nan Tang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.01346
Pdf URL: https://arxiv.org/pdf/2503.01346
Copy Paste: [[2503.01346]] SRAG: Structured Retrieval-Augmented Generation for Multi-Entity Question Answering over Wikipedia Graph(https://arxiv.org/abs/2503.01346)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Multi-entity question answering (MEQA) poses significant challenges for large language models (LLMs), which often struggle to consolidate scattered information across multiple documents. An example question might be "What is the distribution of IEEE Fellows among various fields of study?", which requires retrieving information from diverse sources e.g., Wikipedia pages. The effectiveness of current retrieval-augmented generation (RAG) methods is limited by the LLMs' capacity to aggregate insights from numerous pages. To address this gap, this paper introduces a structured RAG (SRAG) framework that systematically organizes extracted entities into relational tables (e.g., tabulating entities with schema columns like "name" and "field of study") and then apply table-based reasoning techniques. Our approach decouples retrieval and reasoning, enabling LLMs to focus on structured data analysis rather than raw text aggregation. Extensive experiments on Wikipedia-based multi-entity QA tasks demonstrate that SRAG significantly outperforms state-of-the-art long-context LLMs and RAG solutions, achieving a 29.6% improvement in accuracy. The results underscore the efficacy of structuring unstructured data to enhance LLMs' reasoning capabilities.
摘要：多实体问题答案（MEQA）对大型语言模型（LLM）提出了重大挑战，这些挑战通常很难巩固跨多个文档的散落信息。一个示例问题可能是“ IEEE研究员在各个研究领域之间的分布是什么？”，这需要从Wikipedia页面中检索各种来源的信息。 LLMS从众多页面中汇总见解的能力限制了当前检索效果生成（RAG）方法的有效性。为了解决这一差距，本文介绍了一个结构化的抹布（SRAG）框架，该框架系统地将其提取的实体组织到关系表中（例如，使用模式列的制表实体，例如“名称”和“研究领域”），然后应用基于表格的推理技术。我们的方法解除了检索和推理，使LLM可以专注于结构化数据分析，而不是原始文本聚合。对基于Wikipedia的多实体质量检查任务进行的广泛实验表明，SRAG显着胜过最先进的长篇小说LLM和RAG解决方案，可提高准确性的29.6％。结果强调了构建非结构化数据以增强LLMS推理能力的功效。

Title: SwiLTra-Bench: The Swiss Legal Translation Benchmark

Authors: Joel Niklaus, Jakob Merane, Luka Nenadic, Sina Ahmadi, Yingqiang Gao, Cyrill A. H. Chevalley, Claude Humbel, Christophe Gösken, Lorenzo Tanzi, Thomas Lüthi, Stefan Palombo, Spencer Poff, Boling Yang, Nan Wu, Matthew Guillod, Robin Mamié, Daniel Brunner, Julio Pereyra, Niko Grupen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01372
Pdf URL: https://arxiv.org/pdf/2503.01372
Copy Paste: [[2503.01372]] SwiLTra-Bench: The Swiss Legal Translation Benchmark(https://arxiv.org/abs/2503.01372)
Keywords: llm, prompt
Abstract: In Switzerland legal translation is uniquely important due to the country's four official languages and requirements for multilingual legal documentation. However, this process traditionally relies on professionals who must be both legal experts and skilled translators -- creating bottlenecks and impacting effective access to justice. To address this challenge, we introduce SwiLTra-Bench, a comprehensive multilingual benchmark of over 180K aligned Swiss legal translation pairs comprising laws, headnotes, and press releases across all Swiss languages along with English, designed to evaluate LLM-based translation systems. Our systematic evaluation reveals that frontier models achieve superior translation performance across all document types, while specialized translation systems excel specifically in laws but under-perform in headnotes. Through rigorous testing and human expert validation, we demonstrate that while fine-tuning open SLMs significantly improves their translation quality, they still lag behind the best zero-shot prompted frontier models such as Claude-3.5-Sonnet. Additionally, we present SwiLTra-Judge, a specialized LLM evaluation system that aligns best with human expert assessments.
摘要：在瑞士，由于该国的四种官方语言以及对多语言法律文档的要求，法律翻译非常重要。但是，这个过程传统上依赖于必须既是法律专家又是熟练翻译的专业人士 - 创建瓶颈并影响有效的司法接触。为了应对这一挑战，我们介绍了Swiltra-Bench，这是超过180k的瑞士法律翻译对的全面多语言基准，其中包括法律，头脑和新闻稿以及所有瑞士语言的新闻稿以及旨在评估基于LLM的翻译系统的英语。我们的系统评估表明，Frontier模型在所有文档类型中都具有出色的翻译性能，而专门的翻译系统则在法律方面特别表现出色，但表现不佳。通过严格的测试和人类专家验证，我们证明，在微调开放式SLMS可以显着提高其翻译质量，但它们仍然落后于最佳的零拍摄促使Frontier模型，例如Claude-3.5-Sonnet。此外，我们提出了Swiltra-Gudge，这是一种专门的LLM评估系统，与人类专家评估保持一致。

Title: Q-NL Verifier: Leveraging Synthetic Data for Robust Knowledge Graph Question Answering

Authors: Tim Schwabe, Louisa Siebel, Patrik Valach, Maribel Acosta
Subjects: cs.CL, cs.DB, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01385
Pdf URL: https://arxiv.org/pdf/2503.01385
Copy Paste: [[2503.01385]] Q-NL Verifier: Leveraging Synthetic Data for Robust Knowledge Graph Question Answering(https://arxiv.org/abs/2503.01385)
Keywords: language model, llm
Abstract: Question answering (QA) requires accurately aligning user questions with structured queries, a process often limited by the scarcity of high-quality query-natural language (Q-NL) pairs. To overcome this, we present Q-NL Verifier, an approach to generating high-quality synthetic pairs of queries and NL translations. Our approach relies on large language models (LLMs) to generate semantically precise natural language paraphrases of structured queries. Building on these synthetic Q-NL pairs, we introduce a learned verifier component that automatically determines whether a generated paraphrase is semantically equivalent to the original query. Our experiments with the well-known LC-QuAD 2.0 benchmark show that Q-NL Verifier generalizes well to paraphrases from other models and even human-authored translations. Our approach strongly aligns with human judgments across varying query complexities and outperforms existing NLP metrics in assessing semantic correctness. We also integrate the verifier into QA pipelines, showing that verifier-filtered synthetic data has significantly higher quality in terms of translation correctness and enhances NL to Q translation accuracy. Lastly, we release an updated version of the LC-QuAD 2.0 benchmark containing our synthetic Q-NL pairs and verifier scores, offering a new resource for robust and scalable QA.
摘要：问答（QA）需要准确地将用户问题与结构化查询对齐，这一过程通常受高质量查询 - 自然语言（Q-NL）对的稀缺限制。为了克服这一点，我们提出了Q-NL验证器，这是一种生成高质量合成对的查询和NL翻译的方法。我们的方法依靠大型语言模型（LLM）来生成结构化查询的语义精确的自然语言释义。在这些合成Q-NL对的基础上，我们引入了一个学到的验证器组件，该组件会自动确定生成的释义是否在语义上等同于原始查询。我们对著名LC-Quad 2.0基准的实验表明，Q-NL验证者可以很好地概括到其他模型甚至人为译本的解释中。我们的方法与各种查询复杂性的人类判断密切相符，并且在评估语义正确性方面胜过现有的NLP指标。我们还将验证者集成到QA管道中，表明验证器过滤的合成数据在翻译正确性方面具有明显更高的质量，并提高了NL至Q翻译精度。最后，我们发布了LC-Quad 2.0基准的更新版本，其中包含我们的合成Q-NL对和验证器分数，为可靠且可扩展的QA提供了新的资源。

Title: Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace

Authors: Jia-Chen Zhang, Yu-Jie Xiong, Chun-Ming Xia, Dong-Hai Zhu, Xi-He Qiu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01419
Pdf URL: https://arxiv.org/pdf/2503.01419
Copy Paste: [[2503.01419]] Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace(https://arxiv.org/abs/2503.01419)
Keywords: language model, llm
Abstract: Large language model (LLM) is considered a milestone towards achieving Artificial General Intelligence (AGI). With its advanced emergent capabilities, it adapt to a wide range of specific applications. Fine-tuning LLMs for various downstream tasks has become a new paradigm. Low-Rank Adaptation (LoRA) is well-known for its parameter efficiency. It can reduce the number of parameters needed to fine-tune LLMs by several orders of magnitude. However, LoRA-based approaches encounter a significant limitation due to the bottleneck imposed by rank one decomposition. As the parameters count in LLMs increase, even rank one decomposition might surpass the number of parameters truly necessary for handling more downstream tasks. In this paper, we propose a new method for Parameter-Efficient Fine-Tuning (PEFT) via deconvolution in subspace, dubbed as DCFT. We innovatively use deconvolution to complete details and enhance knowledge in subspace incremental matrices, and dynamically control parameters by adjusting the kernel size, unconstrained by rank-one decomposition. Extensive experiments are conducted to validate the effectiveness of DCFT. Results show that compared to LoRA, DCFT achieve an 8$\times$ reduction in parameters, and still achieves highly impressive performance. Our code is available here: this https URL.
摘要：大型语言模型（LLM）被认为是实现人工通用情报（AGI）的里程碑。凭借其高级紧急功能，它适应了广泛的特定应用。用于各种下游任务的微调LLM已成为新的范式。低级适应性（LORA）以其参数效率而闻名。它可以减少几个数量级微调LLM所需的参数数量。然而，由于排列者施加的瓶颈，基于洛拉的方法遇到了一个显着的限制。随着LLMS中的参数计数的增加，即使排名一个分解也可能超过处理更多下游任务所需的参数数量。在本文中，我们提出了一种通过子空间中的反卷积（称为dcft）的反卷积的新方法，称为DCFT。我们创新使用反卷积来完成详细信息并增强子空间增量矩阵中的知识，并通过调整核大小来动态控制参数，而不受等级一的分解的约束。进行了广泛的实验以验证DCFT的有效性。结果表明，与Lora相比，DCFT的参数减少了8 $ \ times $，并且仍然具有令人印象深刻的性能。我们的代码可在此处提供：此HTTPS URL。

Title: Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Authors: Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, Rui Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01422
Pdf URL: https://arxiv.org/pdf/2503.01422
Copy Paste: [[2503.01422]] Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding(https://arxiv.org/abs/2503.01422)
Keywords: language model
Abstract: Test-time scaling improves large language model performance by adding extra compute during decoding. Best-of-N (BoN) sampling serves as a common scaling technique, broadening the search space for finding better solutions from the model distribution. However, traditional BoN requires N full generations, leading to high GPU memory overhead and time latency. Moreover, some methods depend on reward models, adding computational cost and limiting domain generalization. In this paper, we propose Self-Truncation Best-of-N (ST-BoN), a novel decoding method that avoids fully generating all samplings and eliminates the need for reward models. ST-BoN introduces early sampling consistency to estimate the most promising sample, truncating suboptimal ones to free memory and accelerate inference. This pushes the sampling-efficient test-time scaling. Compared to traditional BoN, ST-BoN can reduce dynamic GPU memory overhead by over 90% and time latency by 50%, while achieving comparable or even better performance across reasoning and open-ended domains.
摘要：测试时间缩放通过在解码过程中添加额外的计算来改善大型语言模型性能。最佳N（BON）采样是一种通用缩放技术，扩大了从模型分布中找到更好解决方案的搜索空间。但是，传统的BON需要全代人，导致GPU的高度内存和时间延迟。此外，某些方法取决于奖励模型，增加了计算成本和限制域的概括。在本文中，我们提出了自我截断最佳N（ST-BON），这是一种新颖的解码方法，避免了充分生成所有采样并消除了对奖励模型的需求。 ST-BON引入了早期抽样的一致性，以估算最有希望的样本，将次优的样本截断以自由记忆和加速推理。这推动了抽样效率的测试时间缩放。与传统的BON相比，ST-BON可以将动态GPU内存开销降低90％以上，并且时间延迟降低50％，同时在推理和开放式域中实现可比甚至更好的性能。

Title: Rethinking Data: Towards Better Performing Domain-Specific Small Language Models

Authors: Boris Nazarov, Darya Frolova, Yackov Lubarsky, Alexei Gaissinski, Pavel Kisilev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01464
Pdf URL: https://arxiv.org/pdf/2503.01464
Copy Paste: [[2503.01464]] Rethinking Data: Towards Better Performing Domain-Specific Small Language Models(https://arxiv.org/abs/2503.01464)
Keywords: language model, llm
Abstract: Fine-tuning of Large Language Models (LLMs) for downstream tasks, performed on domain-specific data has shown significant promise. However, commercial use of such LLMs is limited by the high computational cost required for their deployment at scale. On the other hand, small Language Models (LMs) are much more cost effective but have subpar performance in a similar setup. This paper presents our approach to finetuning a small LM, that reaches high accuracy in multiple choice question answering task. We achieve this by improving data quality at each stage of the LM training pipeline. In particular, we start with data structuring resulting in extraction of compact, semantically meaningful text chunks used by a retriever. This allows more efficient knowledge digestion by the LM. Further, we improve the retrieved context by training a lightweight Chunk Re-Ranker (CRR) that generates more accurate relative relevance chunk scores. Finally, we improve the model generalization ability by merging the models fine-tuned with different parameters on different data subsets. We present detailed procedure descriptions, and corresponding experimental findings that show the improvements of each one of the proposed techniques.
摘要：在特定于领域的数据上执行的大型语言模型（LLMS）的微型语言模型（LLM）表现出了巨大的希望。但是，这种LLM的商业用途受其大规模部署所需的高计算成本的限制。另一方面，小语言模型（LMS）更具成本效益，但在类似的设置中具有不优秀的性能。本文介绍了我们对小型LM进行填补的方法，该方法在多项选择问答任务中达到了高精度。我们通过在LM培训管道的每个阶段提高数据质量来实现这一目标。特别是，我们从数据结构开始，从而导致猎犬使用的紧凑，语义上有意义的文本块。这允许LM更有效的知识消化。此外，我们通过训练轻巧的重新级别（CRR）来改善检索到的上下文，从而产生更准确的相对相关性块分数。最后，我们通过将模型与不同数据子集上的不同参数合并来提高模型的概括能力。我们介绍了详细的程序描述，以及相应的实验发现，这些发现表明了每种提出的技术的改进。

Title: SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction

Authors: Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, Hui Xiong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01478
Pdf URL: https://arxiv.org/pdf/2503.01478
Copy Paste: [[2503.01478]] SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction(https://arxiv.org/abs/2503.01478)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have demonstrated improved generation performance by incorporating externally retrieved knowledge, a process known as retrieval-augmented generation (RAG). Despite the potential of this approach, existing studies evaluate RAG effectiveness by 1) assessing retrieval and generation components jointly, which obscures retrieval's distinct contribution, or 2) examining retrievers using traditional metrics such as NDCG, which creates a gap in understanding retrieval's true utility in the overall generation process. To address the above limitations, in this work, we introduce an automatic evaluation method that measures retrieval quality through the lens of information gain within the RAG framework. Specifically, we propose Semantic Perplexity (SePer), a metric that captures the LLM's internal belief about the correctness of the retrieved information. We quantify the utility of retrieval by the extent to which it reduces semantic perplexity post-retrieval. Extensive experiments demonstrate that SePer not only aligns closely with human preferences but also offers a more precise and efficient evaluation of retrieval utility across diverse RAG scenarios.
摘要：大型语言模型（LLMS）通过合并外部检索知识（一种称为检索功能增强的生成（RAG）的过程）表现出了改善的发电性能。尽管有这种方法的潜力，但现有研究通过1）共同评估检索和发电成分，掩盖了检索的独特贡献，或者2）使用NDCG等传统指标来检查检索器，这在理解整体生成过程中的真正实用性方面造成了差距。为了解决上述局限性，在这项工作中，我们引入了一种自动评估方法，该方法通过RAG框架内的信息增益来测量检索质量。具体来说，我们提出了语义困惑（SEPE），该指标捕获了LLM对检索到的信息正确性的内部信念。我们通过在降低语义后的回归性后量化的程度来量化检索的效用。广泛的实验表明，SEME不仅与人类的偏好紧密相符，而且还提供了对各种抹布情景中检索实用程序的更精确，更有效的评估。

Title: Improving Retrospective Language Agents via Joint Policy Gradient Optimization

Authors: Xueyang Feng, Bo Lan, Quanyu Dai, Lei Wang, Jiakai Tang, Xu Chen, Zhenhua Dong, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01490
Pdf URL: https://arxiv.org/pdf/2503.01490
Copy Paste: [[2503.01490]] Improving Retrospective Language Agents via Joint Policy Gradient Optimization(https://arxiv.org/abs/2503.01490)
Keywords: language model, llm, prompt, agent
Abstract: In recent research advancements within the community, large language models (LLMs) have sparked great interest in creating autonomous agents. However, current prompt-based agents often heavily rely on large-scale LLMs. Meanwhile, although fine-tuning methods significantly enhance the capabilities of smaller LLMs, the fine-tuned agents often lack the potential for self-reflection and self-improvement. To address these challenges, we introduce a novel agent framework named RetroAct, which is a framework that jointly optimizes both task-planning and self-reflective evolution capabilities in language agents. Specifically, we develop a two-stage joint optimization process that integrates imitation learning and reinforcement learning, and design an off-policy joint policy gradient optimization algorithm with imitation learning regularization to enhance the data efficiency and training stability in agent tasks. RetroAct significantly improves the performance of open-source models, reduces dependency on closed-source LLMs, and enables fine-tuned agents to learn and evolve continuously. We conduct extensive experiments across various testing environments, demonstrating RetroAct has substantial improvements in task performance and decision-making processes.
摘要：在社区中最新的研究进步中，大型语言模型（LLM）引发了人们对创建自主代理的极大兴趣。但是，当前基于及时的代理通常很大程度上依赖大型LLM。同时，尽管微调方法显着增强了较小的LLM的能力，但微型剂通常缺乏自我反思和自我改善的潜力。为了应对这些挑战，我们介绍了一个名为Retroact的新型代理框架，该框架是一个共同优化语言代理中任务计划和自我反射演化能力的框架。具体而言，我们开发了一个两阶段的关节优化过程，该过程将模仿学习和强化学习整合，并设计了一个非政策策略梯度优化算法，并通过模仿学习正则化，以提高代理任务中的数据效率和培训稳定性。追溯可显着提高开源模型的性能，降低对封闭源LLM的依赖，并使微型剂能够连续学习和进化。我们在各种测试环境中进行了广泛的实验，证明追溯性在任务绩效和决策过程方面有很大改善。

Title: Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh

Authors: Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly, Yuxia Wang, Zhuohan Xie, Rahul Pal, Daniil Orel, Parvez Mullah, Diana Turmakhan, Maiya Goloburda, Mohammed Kamran, Samujjwal Ghosh, Bokang Jia, Jonibek Mansurov, Mukhammed Togmanov, Debopriyo Banerjee, Nurkhan Laiyk, Akhmed Sakip, Xudong Han, Ekaterina Kochmar, Alham Fikri Aji, Aaryamonvikram Singh, Alok Anil Jadhav, Satheesh Katipomu, Samta Kamboj, Monojit Choudhury, Gurpreet Gosal, Gokul Ramakrishnan, Biswajit Mishra, Sarath Chandran, Avraham Sheinin, Natalia Vassilieva, Neha Sengupta, Larry Murray, Preslav Nakov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01493
Pdf URL: https://arxiv.org/pdf/2503.01493
Copy Paste: [[2503.01493]] Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh(https://arxiv.org/abs/2503.01493)
Keywords: language model, llm, chat
Abstract: Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outperforming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. We release Sherkala-Chat (8B) as an open-weight instruction-tuned model and provide a detailed overview of its training, fine-tuning, safety alignment, and evaluation, aiming to advance research and support diverse real-world applications.
摘要：Llama-3.1-Sherkala-8B-Chat或Sherkala-Chat（8b），简称为最先进的指令调整的开放式大语言模型（LLM）。 Sherkala-Chat（8B）旨在提高哈萨克语言者的LLM进步的包容性。 Sherkala-Chat（8B）改编自遍地3.1-8B模型，接受了哈萨克州，英语，俄语和土耳其语的45.3b代币的培训。它具有80亿个参数，在哈萨克州表现出强大的知识和推理能力，在同时以英语实现竞争性能的同时，大大优于现有的开放性哈萨克和多语言模型。我们将Sherkala-Chat（8B）作为开放式指导调整模型发布，并详细概述了其培训，微调，安全一致性和评估，旨在提高研究和支持多样化的现实世界应用。

Title: Liger: Linearizing Large Language Models to Gated Recurrent Structures

Authors: Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01496
Pdf URL: https://arxiv.org/pdf/2503.01496
Copy Paste: [[2503.01496]] Liger: Linearizing Large Language Models to Gated Recurrent Structures(https://arxiv.org/abs/2503.01496)
Keywords: language model, llm
Abstract: Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents Liger, short for Linearizing LLMs to gated recurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at this https URL.
摘要：具有线性复发建模的变压器提供线性时间训练和恒定内存推断。尽管它们表现出效率和性能，但从头开始进行此类非标准架构仍然是昂贵和冒险的。大语言模型（LLM）的线性化将预处理的标准模型转换为线性复发结构，从而更有效地部署。但是，当前的线性化方法通常会引入其他特征映射模块，这些模块需要进行大量的微调，并忽略了最先进的线性复发模型中使用的门控机制。为了解决这些问题，本文介绍了Liger，这是将LLM线性化的缩写为门控复发结构。 Liger是一种新颖的方法，用于将预告片的LLM转换为封闭的线性复发模型，而无需添加额外的参数。它重新利用了预处理的密钥矩阵重量来构建各种门控机制，从而促进了各种门控复发结构的形成，同时避免需要从头开始训练其他组件。 Liger使用轻巧的微调调整（LORA），恢复了线性化的封闭式复发模型的性能，以匹配原始LLM的型号。此外，我们引入了Liger Coation，这是一种内部杂种注意机制，在线性化过程中，以0.02 \％的前训练代币恢复了93 \％的基于变压器的LLM，从而在从1B到8B参数的模型上验证了跨多个基准的竞争结果。代码可在此HTTPS URL上找到。

Title: SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity

Authors: Xiangyu Xi, Deyang Kong, Jian Yang, Jiawei Yang, Zhengyu Chen, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01506
Pdf URL: https://arxiv.org/pdf/2503.01506
Copy Paste: [[2503.01506]] SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity(https://arxiv.org/abs/2503.01506)
Keywords: language model, llm
Abstract: Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x training steps to achieves the baselines' performance, highlighting the substantial potential of SampleMix to optimize pre-training data.
摘要：大型语言模型（LLMS）的现有预训练数据混合方法通常遵循域的方法，这是一个自上而下的过程，该过程首先确定域重量，然后在每个域上执行统一的数据采样。但是，这些方法忽略了重大的域间重叠和共同点，无法控制构建的培训数据集的全球多样性。此外，域内的均匀采样忽略了细粒的样本特异性特征，可能导致次优数据分布。为了解决这些缺点，我们提出了一种基于自下而上范式的新型样品数据混合方法。该方法通过系统地评估每个样本的质量和多样性，从而动态确定最佳域分布，从而执行全局跨域采样。跨多个下游任务和困惑评估的全面实验表明，Samplemix超过了基于域的方法。同时，Samplemix需要1.4倍至2.1倍的训练步骤才能达到基本线的性能，从而突出了Samplemix优化预训练数据的巨大潜力。

Title: KoWit-24: A Richly Annotated Dataset of Wordplay in News Headlines

Authors: Alexander Baranov, Anna Palatkina, Yulia Makovka, Pavel Braslavski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01510
Pdf URL: https://arxiv.org/pdf/2503.01510
Copy Paste: [[2503.01510]] KoWit-24: A Richly Annotated Dataset of Wordplay in News Headlines(https://arxiv.org/abs/2503.01510)
Keywords: llm
Abstract: We present KoWit-24, a dataset with fine-grained annotation of wordplay in 2,700 Russian news headlines. KoWit-24 annotations include the presence of wordplay, its type, wordplay anchors, and words/phrases the wordplay refers to. Unlike the majority of existing humor collections of canned jokes, KoWit-24 provides wordplay contexts -- each headline is accompanied by the news lead and summary. The most common type of wordplay in the dataset is the transformation of collocations, idioms, and named entities -- the mechanism that has been underrepresented in previous humor datasets. Our experiments with five LLMs show that there is ample room for improvement in wordplay detection and interpretation tasks. The dataset and evaluation scripts are available at this https URL
摘要：我们提出了Kowit-24，这是一个数据集，其中有2,700个俄罗斯新闻头条的文字播放注释。 KOWIT-24注释包括文字播放，类型，文字播放器以及单词播放所指的单词/短语。与大多数现有笑话笑话的幽默收藏不同，Kowit-24提供了文字游戏环境 - 每个标题都伴随着新闻主角和摘要。数据集中最常见的文字播放类型是搭配，成语和命名实体的转换 - 在以前的幽默数据集中的代理不足。我们使用五个LLM的实验表明，文字游戏检测和解释任务有足够的改进空间。数据集和评估脚本可在此HTTPS URL上找到

Title: Evaluation and Facilitation of Online Discussions in the LLM Era: A Survey

Authors: Katerina Korre, Dimitris Tsirmpas, Nikos Gkoumas, Emma Cabalé, Dionysis Kontarinis, Danai Myrtzani, Theodoros Evgeniou, Ion Androutsopoulos, John Pavlopoulos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01513
Pdf URL: https://arxiv.org/pdf/2503.01513
Copy Paste: [[2503.01513]] Evaluation and Facilitation of Online Discussions in the LLM Era: A Survey(https://arxiv.org/abs/2503.01513)
Keywords: language model, llm, agent
Abstract: We present a survey of methods for assessing and enhancing the quality of online discussions, focusing on the potential of Large Language Models (LLMs). While online discourses aim, at least in theory, to foster mutual understanding, they often devolve into harmful exchanges, such as hate speech, threatening social cohesion and democratic values. Recent advancements in LLMs enable facilitation agents that not only moderate content, but also actively improve the quality of interactions. Our survey synthesizes ideas from Natural Language Processing (NLP) and Social Sciences to provide (a) a new taxonomy on discussion quality evaluation, (b) an overview of intervention and facilitation strategies, along with a new taxonomy on conversation facilitation datasets, (c) an LLM-oriented roadmap of good practices and future research directions, from technological and societal perspectives.
摘要：我们提出了评估和提高在线讨论质量的方法的调查，重点是大语言模型（LLMS）的潜力。至少在理论上，在线话语的目的是促进相互理解，但他们经常转向有害的交流，例如仇恨言论，威胁社会凝聚力和民主价值观。 LLMS的最新进展使不仅适度内容的促进代理，而且可以积极提高互动质量。我们的调查综合了自然语言处理（NLP）和社会科学的想法，以提供（a）讨论质量评估的新分类法，（b）概述干预和促进策略的概述，以及对对话促进数据集的新分类法，（c）良好的实践和未来研究的良好实践和未来研究的良好实践和未来的研究和社会，

Title: Pragmatic Inference Chain (PIC) Improving LLMs' Reasoning of Authentic Implicit Toxic Language

Authors: Xi Chen, Shuo Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01539
Pdf URL: https://arxiv.org/pdf/2503.01539
Copy Paste: [[2503.01539]] Pragmatic Inference Chain (PIC) Improving LLMs' Reasoning of Authentic Implicit Toxic Language(https://arxiv.org/abs/2503.01539)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: The rapid development of large language models (LLMs) gives rise to ethical concerns about their performance, while opening new avenues for developing toxic language detection techniques. However, LLMs' unethical output and their capability of detecting toxicity have primarily been tested on language data that do not demand complex meaning inference, such as the biased associations of 'he' with programmer and 'she' with household. Nowadays toxic language adopts a much more creative range of implicit forms, thanks to advanced censorship. In this study, we collect authentic toxic interactions that evade online censorship and that are verified by human annotators as inference intensive. To evaluate and improve LLMs' reasoning of the authentic implicit toxic language, we propose a new prompting method, Pragmatic Inference Chain (PIC), drawn on interdisciplinary findings from cognitive science and linguistics. The PIC prompting significantly improves the success rate of GPT-4o, Llama-3.1-70B-Instruct, and DeepSeek-v2.5 in identifying implicit toxic language, compared to both direct prompting and Chain-of-Thought. In addition, it also facilitates the models to produce more explicit and coherent reasoning processes, hence can potentially be generalized to other inference-intensive tasks, e.g., understanding humour and metaphors.
摘要：大型语言模型（LLM）的快速发展引起了人们对其表现的道德问题，同时为开发有毒语言检测技术的新途径。但是，LLMS的不道德输出及其检测毒性的能力主要在不需要复杂含义推断的语言数据上进行了测试，例如“ He He”与程序员和“她”与家庭的偏见关联。如今，有毒语言通过高级审查制度采用了更具创造性的隐性形式。在这项研究中，我们收集了逃避在线审查制度的真实有毒互动，并被人类注释者验证为推理密集型。为了评估和改善LLM对真实隐式有毒语言的推理，我们提出了一种新的提示方法，务实的推论链（PIC），该方法是根据认知科学和语言学的跨学科发现。与直接提示和思维链相比，这张PIC促使该PIC显着提高了GPT-4O，Llama-3.1-70B-Instruct和DeepSeek-V2.5的成功率。此外，它还有助于模型产生更明确和相干的推理过程，因此有可能将其推广到其他推理密集型任务，例如了解幽默和隐喻。

Title: Revisiting Large Language Model Pruning using Neuron Semantic Attribution

Authors: Yizhuo Ding, Xinwei Sun, Yanwei Fu, Guosheng Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01542
Pdf URL: https://arxiv.org/pdf/2503.01542
Copy Paste: [[2503.01542]] Revisiting Large Language Model Pruning using Neuron Semantic Attribution(https://arxiv.org/abs/2503.01542)
Keywords: language model, llm
Abstract: Model pruning technique is vital for accelerating large language models by reducing their size and computational requirements. However, the generalizability of existing pruning methods across diverse datasets and tasks remains unclear. Thus, we conduct extensive evaluations on 24 datasets and 4 tasks using popular pruning methods. Based on these evaluations, we find and then investigate that calibration set greatly affect the performance of pruning methods. In addition, we surprisingly find a significant performance drop of existing pruning methods in sentiment classification tasks. To understand the link between performance drop and pruned neurons, we propose Neuron Semantic Attribution, which learns to associate each neuron with specific semantics. This method first makes the unpruned neurons of LLMs explainable.
摘要：模型修剪技术对于通过降低其规模和计算要求来加速大语言模型至关重要。但是，跨不同数据集和任务的现有修剪方法的普遍性尚不清楚。因此，我们使用流行的修剪方法对24个数据集和4个任务进行了广泛的评估。基于这些评估，我们发现并研究校准设置极大地影响修剪方法的性能。此外，我们出人意料地发现，情感分类任务中现有的修剪方法的性能下降。为了了解性能下降和修剪神经元之间的联系，我们提出了神经元语义归因，该语义归因于将每个神经元与特定语义相关联。该方法首先使LLMS未经修复的神经元可解释。

Title: Attention Condensation via Sparsity Induced Regularized Training

Authors: Eli Sason, Darya Frolova, Boris Nazarov, Felix Goldberd
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01564
Pdf URL: https://arxiv.org/pdf/2503.01564
Copy Paste: [[2503.01564]] Attention Condensation via Sparsity Induced Regularized Training(https://arxiv.org/abs/2503.01564)
Keywords: language model, gpt, llm
Abstract: As the context window expands, self-attention increasingly dominates the transformer's inference time. Therefore, accelerating attention computation while minimizing performance degradation is essential for the efficient deployment of Large Language Models (LLMs). In this study we extend a theoretical framework of attention sparsity in LLMs. A customized loss function is designed to enforce the sparsity by restricting the number of top elements in the attention matrix. We perform an initial set of evaluations with GPT-2 to show the effectiveness of our sparsification approach. The attention matrices of the models trained with the proposed loss are both sparse and effective in capturing relevant input dependencies. We now continue working to demonstrate the value of our approach on larger models and different architectures.
摘要：随着上下文窗口的扩展，自我发作越来越多地主导着变压器的推理时间。因此，在最大程度地减少性能下降的同时，加速注意力计算对于有效部署大语言模型（LLMS）至关重要。在这项研究中，我们扩展了LLM中注意力稀疏性的理论框架。定制的损失函数旨在通过限制注意矩阵中的顶级元素数量来实现稀疏性。我们使用GPT-2进行初始评估，以显示我们的稀疏方法的有效性。接受拟议损失训练的模型的注意力矩阵在捕获相关输入依赖性方面既稀疏又有效。现在，我们继续努力证明我们的方法对较大模型和不同体系结构的价值。

Title: Beyond Prompting: An Efficient Embedding Framework for Open-Domain Question Answering

Authors: Zhanghao Hu, Hanqi Yan, Qingling Zhu, Zhenyi Shen, Yulan He, Lin Gui
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01606
Pdf URL: https://arxiv.org/pdf/2503.01606
Copy Paste: [[2503.01606]] Beyond Prompting: An Efficient Embedding Framework for Open-Domain Question Answering(https://arxiv.org/abs/2503.01606)
Keywords: language model, llm, prompt
Abstract: Large language models have recently pushed open domain question answering (ODQA) to new frontiers. However, prevailing retriever-reader pipelines often depend on multiple rounds of prompt level instructions, leading to high computational overhead, instability, and suboptimal retrieval coverage. In this paper, we propose EmbQA, an embedding-level framework that alleviates these shortcomings by enhancing both the retriever and the reader. Specifically, we refine query representations via lightweight linear layers under an unsupervised contrastive learning objective, thereby reordering retrieved passages to highlight those most likely to contain correct answers. Additionally, we introduce an exploratory embedding that broadens the model's latent semantic space to diversify candidate generation and employs an entropy-based selection mechanism to choose the most confident answer automatically. Extensive experiments across three open-source LLMs, three retrieval methods, and four ODQA benchmarks demonstrate that EmbQA substantially outperforms recent baselines in both accuracy and efficiency.
摘要：大型语言模型最近将开放式域问答（ODQA）推向了新的边界。但是，盛行的检索员阅读器管道通常取决于多个及时的指令的回合，从而导致高度计算开销，不稳定性和次优的检索覆盖范围。在本文中，我们提出了Embqa，这是一个嵌入式级别的框架，通过增强猎犬和读者来减轻这些缺陷。具体而言，我们通过轻巧的线性层在无监督的对比学习目标下完善查询表示形式，从而重新排序检索的段落以突出显示最有可能包含正确答案的段落。此外，我们介绍了一种探索性嵌入，该探索性嵌入范围扩大了模型的潜在语义空间，以使候选人生成多样化，并采用基于熵的选择机制自动选择最自信的答案。在三个开源LLM，三种检索方法和四个ODQA基准的大量实验表明，Embqa在准确性和效率方面显着优于最近的基线。

Title: In-context Learning vs. Instruction Tuning: The Case of Small and Multilingual Language Models

Authors: David Ponce, Thierry Etchegoyhen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01611
Pdf URL: https://arxiv.org/pdf/2503.01611
Copy Paste: [[2503.01611]] In-context Learning vs. Instruction Tuning: The Case of Small and Multilingual Language Models(https://arxiv.org/abs/2503.01611)
Keywords: language model
Abstract: Instruction following is a critical ability for Large Language Models to perform downstream tasks. The standard approach to instruction alignment has relied on a specific phase of model tuning over curated instruction datasets, optionally complemented with an alignment step over human preferences. Recent work has shown the potential of in-context learning (ICL) alternatives to guide base models towards instruction following. This type of approach is particularly relevant to extend instruction following across languages and models of varying sizes adapted to different types of usage. In this work we compare ICL and instruction fine-tuning in English, French and Spanish, on Small Language Models, and provide experimental results on applying Direct Preference Optimisation (DPO) over base models. Our results show that scenarios involving multilingual and smaller models result in downgraded ICL instruction following performance, only partially mitigated by DPO alignment. This study aims to further our understanding of current strengths and limitations of alternative methods for instruction following.
摘要：以下指令是大型语言模型执行下游任务的关键能力。标准的教学对齐方式取决于模型调整的特定阶段，而不是策划的指令数据集，可选地以对人类偏好的一致步骤进行补充。最近的工作表明，在指导基本模型以跟随教学的可能性替代方案（ICL）替代方案的潜力。这种类型的方法特别重要地是在适合不同类型的用法的不同尺寸的语言和不同尺寸的模型中扩展教学尤其重要。在这项工作中，我们将ICL和教学用英语，法语和西班牙语，小语言模型进行了比较，并在基本模型上应用直接偏好优化（DPO）提供了实验结果。我们的结果表明，涉及多语言和较小模型的场景导致性能后降级ICL指令，仅通过DPO对齐而部分缓解。这项研究旨在进一步了解当前的优势和替代方法的局限性，以进行遵循。

Title: DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

Authors: Eliya Habba, Ofir Arviv, Itay Itzhak, Yotam Perlitz, Elron Bandel, Leshem Choshen, Michal Shmueli-Scheuer, Gabriel Stanovsky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01622
Pdf URL: https://arxiv.org/pdf/2503.01622
Copy Paste: [[2503.01622]] DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation(https://arxiv.org/abs/2503.01622)
Keywords: llm, prompt
Abstract: Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more: this https URL
摘要：最近的工作发现，LLMS对广泛的任意提示维度敏感，包括分界符的类型，回答枚举者，指令措辞等等。这引起了人们的质疑，即受欢迎的单propt评估实践。我们提出了Dove（变异评估数据集）一个大规模数据集，其中包含各种评估基准的迅速扰动。与以前的工作相反，我们从整体的角度检查了LLM敏感性，并评估沿各个维度扰动的关节作用，每个实例导致数千个扰动。我们评估了几个模型家族针对Dove，导致了几个发现，包括选择出色表现的提示的有效方法，观察到很少的示例降低了灵敏度，并确定了在所有扰动中固有地很难的实例。 Dove由超过2.50亿的迅速扰动和模型输出组成，我们可以公开使用，以刺激社区范围内的努力，以进行有意义，健壮和有效的评估。浏览数据，贡献等等：此HTTPS URL

Title: Detecting Stylistic Fingerprints of Large Language Models

Authors: Yehonatan Bitton, Elad Bitton, Shai Nisan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01659
Pdf URL: https://arxiv.org/pdf/2503.01659
Copy Paste: [[2503.01659]] Detecting Stylistic Fingerprints of Large Language Models(https://arxiv.org/abs/2503.01659)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have distinct and consistent stylistic fingerprints, even when prompted to write in different writing styles. Detecting these fingerprints is important for many reasons, among them protecting intellectual property, ensuring transparency regarding AI-generated content, and preventing the misuse of AI technologies. In this paper, we present a novel method to classify texts based on the stylistic fingerprints of the models that generated them. We introduce an LLM-detection ensemble that is composed of three classifiers with varied architectures and training data. This ensemble is trained to classify texts generated by four well-known LLM families: Claude, Gemini, Llama, and OpenAI. As this task is highly cost-sensitive and might have severe implications, we want to minimize false-positives and increase confidence. We consider a prediction as valid when all three classifiers in the ensemble unanimously agree on the output classification. Our ensemble is validated on a test set of texts generated by Claude, Gemini, Llama, and OpenAI models, and achieves extremely high precision (0.9988) and a very low false-positive rate (0.0004). Furthermore, we demonstrate the ensemble's ability to distinguish between texts generated by seen and unseen models. This reveals interesting stylistic relationships between models. This approach to stylistic analysis has implications for verifying the originality of AI-generated texts and tracking the origins of model training techniques.
摘要：大型语言模型（LLMS）具有独特且一致的风格指纹，即使提示以不同的写作方式写作。出于许多原因，检测这些指纹很重要，其中包括保护知识产权，确保对AI生成的内容的透明度，并防止AI技术的滥用。在本文中，我们提出了一种基于生成模型的模型的风格指纹分类的新方法来对文本进行分类。我们介绍了一个LLM检测集合，该集合由三个具有各种体系结构和培训数据的分类器组成。该合奏经过训练，可以对四个著名的LLM家庭产生的文本进行分类：Claude，Gemini，Llama和Openai。由于此任务具有高度敏感的，并且可能具有严重的影响，因此我们希望最大程度地减少假阳性并增加信心。当集合中的所有三个分类器一致同意输出分类时，我们将预测视为有效。我们的合奏在Claude，Gemini，Llama和Openai模型生成的文本中进行了验证，并且达到了极高的精度（0.9988）和非常低的假阳性速率（0.0004）。此外，我们演示了整体区分通过可见模型产生和看不见的文本的能力。这揭示了模型之间有趣的风格关系。这种风格分析方法具有验证AI生成的文本的原创性和跟踪模型训练技术的起源的意义。

Title: Evaluating LLMs' Assessment of Mixed-Context Hallucination Through the Lens of Summarization

Authors: Siya Qi, Rui Cao, Yulan He, Zheng Yuan
Subjects: cs.CL, cs.AI, cs.CY, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01670
Pdf URL: https://arxiv.org/pdf/2503.01670
Copy Paste: [[2503.01670]] Evaluating LLMs' Assessment of Mixed-Context Hallucination Through the Lens of Summarization(https://arxiv.org/abs/2503.01670)
Keywords: language model, llm, hallucination
Abstract: With the rapid development of large language models (LLMs), LLM-as-a-judge has emerged as a widely adopted approach for text quality evaluation, including hallucination evaluation. While previous studies have focused exclusively on single-context evaluation (e.g., discourse faithfulness or world factuality), real-world hallucinations typically involve mixed contexts, which remains inadequately evaluated. In this study, we use summarization as a representative task to comprehensively evaluate LLMs' capability in detecting mixed-context hallucinations, specifically distinguishing between factual and non-factual hallucinations. Through extensive experiments across direct generation and retrieval-based models of varying scales, our main observations are: (1) LLMs' intrinsic knowledge introduces inherent biases in hallucination evaluation; (2) These biases particularly impact the detection of factual hallucinations, yielding a significant performance bottleneck; (3) The fundamental challenge lies in effective knowledge utilization, balancing between LLMs' intrinsic knowledge and external context for accurate mixed-context hallucination evaluation.
摘要：随着大型语言模型（LLMS）的快速发展，LLM-AS-A-Gudge已成为一种广泛采用的文本质量评估方法，包括幻觉评估。尽管以前的研究仅关注单语言评估（例如话语忠诚或世界事实），但现实世界中的幻觉通常涉及混合背景，这仍然不足。在这项研究中，我们将摘要用作代表性的任务，以全面评估LLMS在检测混合文化幻觉方面的能力，特别区分事实和非事实幻觉。通过在不同尺度的直接生成和基于检索的模型之间进行的广泛实验，我们的主要观察结果是：（1）LLMS的内在知识引入了幻觉评估中的固有偏见；（2）这些偏见特别影响了事实幻觉的检测，产生了显着的性能瓶颈；（3）基本挑战在于有效的知识利用，LLMS的内在知识和外部环境之间的平衡，以进行准确的混合文化幻觉评估。

Title: Automated Annotation of Evolving Corpora for Augmenting Longitudinal Network Data: A Framework Integrating Large Language Models and Expert Knowledge

Authors: Xiao Liu, Zirui Wu, Jiayi Li, Zhicheng Shao, Xun Pang, Yansong Feng
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2503.01672
Pdf URL: https://arxiv.org/pdf/2503.01672
Copy Paste: [[2503.01672]] Automated Annotation of Evolving Corpora for Augmenting Longitudinal Network Data: A Framework Integrating Large Language Models and Expert Knowledge(https://arxiv.org/abs/2503.01672)
Keywords: language model, llm
Abstract: Longitudinal network data are essential for analyzing political, economic, and social systems and processes. In political science, these datasets are often generated through human annotation or supervised machine learning applied to evolving corpora. However, as semantic contexts shift over time, inferring dynamic interaction types on emerging issues among a diverse set of entities poses significant challenges, particularly in maintaining timely and consistent annotations. This paper presents the Expert-Augmented LLM Annotation (EALA) approach, which leverages Large Language Models (LLMs) in combination with historically annotated data and expert-constructed codebooks to extrapolate and extend datasets into future periods. We evaluate the performance and reliability of EALA using a dataset of climate negotiations. Our findings demonstrate that EALA effectively predicts nuanced interactions between negotiation parties and captures the evolution of topics over time. At the same time, we identify several limitations inherent to LLM-based annotation, highlighting areas for further improvement. Given the wide availability of codebooks and annotated datasets, EALA holds substantial promise for advancing research in political science and beyond.
摘要：纵向网络数据对于分析政治，经济和社会制度和过程至关重要。在政治学中，这些数据集通常是通过人类注释或应用于不断发展的语料库的监督机器学习而产生的。但是，随着语义环境随着时间的流逝而变化，在各种实体中推断出的动态互动类型会带来重大挑战，尤其是在保持及时和一致的注释方面。本文介绍了专家的LLM注释（EALA）方法，该方法利用大型语言模型（LLMS）与历史上注释的数据和专家构造的代码书结合使用，以推断和将数据集扩展到未来的时期。我们使用气候谈判数据集评估EALA的性能和可靠性。我们的发现表明，EALA有效地预测了谈判方之间细微的互动，并捕获了随着时间的推移的演变。同时，我们确定了基于LLM的注释固有的几个限制，突出了以进一步改进的领域。鉴于代码簿和注释数据集的广泛可用性，EALA对推进政治科学及其他方面的研究具有巨大的希望。

Title: When an LLM is apprehensive about its answers -- and when its uncertainty is justified

Authors: Petr Sychev, Andrey Goncharov, Daniil Vyazhev, Edvard Khalafyan, Alexey Zaytsev
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01688
Pdf URL: https://arxiv.org/pdf/2503.01688
Copy Paste: [[2503.01688]] When an LLM is apprehensive about its answers -- and when its uncertainty is justified(https://arxiv.org/abs/2503.01688)
Keywords: language model, llm
Abstract: Uncertainty estimation is crucial for evaluating Large Language Models (LLMs), particularly in high-stakes domains where incorrect answers result in significant consequences. Numerous approaches consider this problem, while focusing on a specific type of uncertainty, ignoring others. We investigate what estimates, specifically token-wise entropy and model-as-judge (MASJ), would work for multiple-choice question-answering tasks for different question topics. Our experiments consider three LLMs: Phi-4, Mistral, and Qwen of different sizes from 1.5B to 72B and $14$ topics. While MASJ performs similarly to a random error predictor, the response entropy predicts model error in knowledge-dependent domains and serves as an effective indicator of question difficulty: for biology ROC AUC is $0.73$. This correlation vanishes for the reasoning-dependent domain: for math questions ROC-AUC is $0.55$. More principally, we found out that the entropy measure required a reasoning amount. Thus, data-uncertainty related entropy should be integrated within uncertainty estimates frameworks, while MASJ requires refinement. Moreover, existing MMLU-Pro samples are biased, and should balance required amount of reasoning for different subdomains to provide a more fair assessment of LLMs performance.
摘要：不确定性估计对于评估大语言模型（LLM）至关重要，尤其是在不正确的答案导致重大后果的高风险域中。许多方法考虑了这个问题，同时着眼于特定类型的不确定性，而忽略了他人。我们调查了哪些估计值，特别是令牌的熵和模型法官（MASJ），将用于针对不同问题主题的多项选择提问任务。我们的实验考虑了三个LLM：PHI-4，Mistral和Qwen，其大小从1.5B到72B和$ 14 $。尽管MASJ的性能与随机错误预测指标相似，但响应熵可以预测知识依赖性域中的模型误差，并用作问题难度的有效指标：对于生物学Roc AUC来说，Roc AUC为$ 0.73 $。这种相关性消失了与推理有关的域：对于数学问题，ROC-AUC为$ 0.55 $。更重要的是，我们发现熵措施需要推理金额。因此，与数据不确定性相关的熵应集成到不确定性估计框架中，而MASJ需要改进。此外，现有的MMLU-PRO样品是有偏见的，应平衡不同子域的所需推理，以提供对LLMS性能的更公平评估。

Title: Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution

Authors: Kun Li, Tianhua Zhang, Yunxiang Li, Hongyin Luo, Abdalla Moustafa, Xixin Wu, James Glass, Helen Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01695
Pdf URL: https://arxiv.org/pdf/2503.01695
Copy Paste: [[2503.01695]] Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution(https://arxiv.org/abs/2503.01695)
Keywords: language model, llm, hallucination, retrieval augmented generation
Abstract: Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios involving knowledge conflicts. Existing methods either intervene LLMs only at inference without addressing their inherent limitations or overlook the potential for self-improvement. In this paper, we introduce GenDiE (Generate, Discriminate, Evolve), a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization. GenDiE combines both generative and discriminative training, equipping LLMs with self-generation and self-scoring capabilities to facilitate iterative self-evolution. This supports both data construction for model alignment and score-guided search during inference. Furthermore, by treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches that optimize at the holistic answer level, which may miss unfaithful details. Experiments on ASQA (in-domain LFQA) and ConFiQA (out-of-domain counterfactual QA) datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness, and exhibits robust performance for domain adaptation.
摘要：在大型语言模型中改善背景忠诚对于开发值得信赖的检索增强生成系统并减轻幻觉至关重要，尤其是在长形的问题回答（LFQA）任务或涉及知识冲突的场景中。现有方法要么仅在推理时干预LLM，而无需解决其固有的局限性，要么忽略了自我改善的潜力。在本文中，我们介绍了Gendie（生成，歧视，进化），这是一个新颖的自我发展框架，通过细粒度句子级优化来增强背景忠诚。 Gendie结合了生成性和歧视性培训，为LLM提供了自我生成和自我评分能力，以促进迭代的自我进化。这支持了用于模型对齐的数据构建和推断期间得分引导的搜索。此外，通过将响应中的每个句子视为一个独立的优化单元，Gendie有效地解决了以前在整体答案级别进行优化的方法的局限性，这可能会错过不忠的细节。关于ASQA（内域LFQA）和confiqa（不域反事实质量质量质量标准）的实验表明，Gendie在忠诚和正确性方面都超过了各种基线，并且对域适应性表现出强大的性能。

Title: Word Form Matters: LLMs' Semantic Reconstruction under Typoglycemia

Authors: Chenxi Wang, Tianle Gu, Zhongyu Wei, Lang Gao, Zirui Song, Xiuying Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01714
Pdf URL: https://arxiv.org/pdf/2503.01714
Copy Paste: [[2503.01714]] Word Form Matters: LLMs' Semantic Reconstruction under Typoglycemia(https://arxiv.org/abs/2503.01714)
Keywords: language model, llm
Abstract: Human readers can efficiently comprehend scrambled words, a phenomenon known as Typoglycemia, primarily by relying on word form; if word form alone is insufficient, they further utilize contextual cues for interpretation. While advanced large language models (LLMs) exhibit similar abilities, the underlying mechanisms remain unclear. To investigate this, we conduct controlled experiments to analyze the roles of word form and contextual information in semantic reconstruction and examine LLM attention patterns. Specifically, we first propose SemRecScore, a reliable metric to quantify the degree of semantic reconstruction, and validate its effectiveness. Using this metric, we study how word form and contextual information influence LLMs' semantic reconstruction ability, identifying word form as the core factor in this process. Furthermore, we analyze how LLMs utilize word form and find that they rely on specialized attention heads to extract and process word form information, with this mechanism remaining stable across varying levels of word scrambling. This distinction between LLMs' fixed attention patterns primarily focused on word form and human readers' adaptive strategy in balancing word form and contextual information provides insights into enhancing LLM performance by incorporating human-like, context-aware mechanisms.
摘要：人类的读者可以有效理解拼命的单词，这是一种称为Typglycemia的现象，主要是通过依靠单词形式来理解。如果单独的单词形式不足，他们将进一步利用上下文提示进行解释。尽管高级大语模型（LLM）具有相似的能力，但基本机制仍不清楚。为了研究这一点，我们进行了受控的实验，以分析单词形式和语境信息在语义重建中的作用，并检查LLM注意力模式。具体而言，我们首先提出了SemRecScore，这是一种可靠的度量标准，用于量化语义重建程度并验证其有效性。使用此指标，我们研究单词形式和上下文信息如何影响LLMS的语义重建能力，将单词形式识别为此过程的核心因素。此外，我们分析了LLMS如何利用单词形式，并发现它们依靠专门的注意力头来提取和处理单词形式信息，并且这种机制在各种级别的单词争吵中保持稳定。 LLMS的固定注意模式之间的这种区别主要集中在单词形式和人类读者在平衡单词形式和上下文信息中的自适应策略，从而通过结合类似人类的上下文感知机制来增强LLM性能的见解。

Title: Syntactic Learnability of Echo State Neural Language Models at Scale

Authors: Ryo Ueda, Tatsuki Kuribayashi, Shunsuke Kando, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01724
Pdf URL: https://arxiv.org/pdf/2503.01724
Copy Paste: [[2503.01724]] Syntactic Learnability of Echo State Neural Language Models at Scale(https://arxiv.org/abs/2503.01724)
Keywords: language model
Abstract: What is a neural model with minimum architectural complexity that exhibits reasonable language learning capability? To explore such a simple but sufficient neural language model, we revisit a basic reservoir computing (RC) model, Echo State Network (ESN), a restricted class of simple Recurrent Neural Networks. Our experiments showed that ESN with a large hidden state is comparable or superior to Transformer in grammaticality judgment tasks when trained with about 100M words, suggesting that architectures as complex as that of Transformer may not always be necessary for syntactic learning.
摘要：什么是具有最低架构复杂性的神经模型表现出合理的语言学习能力？为了探索这种简单但充分的神经语言模型，我们重新访问基本储层计算（RC）模型，Echo State Network（ESN），这是一类简单的复发性神经网络。我们的实验表明，具有较大隐藏状态的ESN在接受大约100m单词训练时，在语法判断任务中可以与变压器相当或优越，这表明像变压器一样复杂的体系结构对于句法学习可能并不总是必要的。

Title: Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models

Authors: Alberto Purpura, Sahil Wadhwa, Jesse Zymet, Akshay Gupta, Andy Luo, Melissa Kazemi Rad, Swapnil Shinde, Mohammad Shahed Sorower
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01742
Pdf URL: https://arxiv.org/pdf/2503.01742
Copy Paste: [[2503.01742]] Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models(https://arxiv.org/abs/2503.01742)
Keywords: language model, llm
Abstract: The rapid growth of Large Language Models (LLMs) presents significant privacy, security, and ethical concerns. While much research has proposed methods for defending LLM systems against misuse by malicious actors, researchers have recently complemented these efforts with an offensive approach that involves red teaming, i.e., proactively attacking LLMs with the purpose of identifying their vulnerabilities. This paper provides a concise and practical overview of the LLM red teaming literature, structured so as to describe a multi-component system end-to-end. To motivate red teaming we survey the initial safety needs of some high-profile LLMs, and then dive into the different components of a red teaming system as well as software packages for implementing them. We cover various attack methods, strategies for attack-success evaluation, metrics for assessing experiment outcomes, as well as a host of other considerations. Our survey will be useful for any reader who wants to rapidly obtain a grasp of the major red teaming concepts for their own use in practical applications.
摘要：大型语言模型（LLM）的快速增长提出了重大的隐私，安全和道德问题。尽管许多研究提出了捍卫LLM系统免受恶意演员滥用的方法，但研究人员最近采用进攻方法来补充这些努力，涉及红色小组，即主动攻击LLM，以识别其脆弱性。本文提供了LLM红色团队文献的简洁明了的概述，该文献结构化，以描述多组分系统的端到端。为了激励红色小组，我们调查了一些备受瞩目的LLM的最初安全需求，然后深入到红色团队系统的不同组件以及用于实施它们的软件包。我们涵盖了各种攻击方法，攻击 - 成功评估的策略，评估实验结果的指标以及其他许多考虑因素。我们的调查对于任何想要快速掌握主要红色团队概念的读者都将很有用，以在实际应用中使用。

Title: Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Authors: Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo, Piyush Madan, Vadim Mazalov, Ali Mousavi, Anh Nguyen, Jing Pan, Daniel Perez-Becker, Jacob Platin, Thomas Portet, Kai Qiu, Bo Ren, Liliang Ren, Sambuddha Roy, Ning Shang, Yelong Shen, Saksham Singhal, Subhojit Som, Xia Song, Tetyana Sych, Praneetha Vaddamanu, Shuohang Wang, Yiming Wang, Zhenghao Wang, Haibin Wu, Haoran Xu, Weijian Xu, Yifan Yang, Ziyi Yang, Donghan Yu, Ishmam Zabir, Jianwen Zhang, Li Lyna Zhang, Yunan Zhang, Xiren Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01743
Pdf URL: https://arxiv.org/pdf/2503.01743
Copy Paste: [[2503.01743]] Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs(https://arxiv.org/abs/2503.01743)
Keywords: language model
Abstract: We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.
摘要：我们介绍了Phi-4-Mini和Phi-4-Multimodal，紧凑但功能高度的语言和多模型模型。 PHI-4-MINI是一种3.8亿参数语言模型，该模型在高质量的Web和合成数据上训练，大大优于最近大小相似的开源模型，并匹配其在数学和编码任务上的大小和需要复杂推理的模型的性能。这项成就是由精心策划的合成数据配方强调高质量数学和编码数据集的驱动的。与其前身PHI-3.5-MINI相比，Phi-4-Mini具有扩大的词汇大小为200k代币，以更好地支持多语言应用程序，以及集体查询的关注，以增加有效的长期序列生成。 Phi-4-MultiModal是一种多模型模型，将文本，视觉和语音/音频输入模式集成到单个模型中。它的新型模态扩展方法利用Lora适配器和模态特异性路由器允许多种推理模式结合各种模态而不会受到干扰。例如，尽管语音/音频模式的洛拉组成部分仅具有4.6亿个参数，但现在它在OpenASR排行榜中排名第一。 Phi-4-Multimodal支持涉及（视觉 +语言），（视觉 +语音）和（语音/音频）输入的场景，在广泛的任务上优于更大的视觉语言语言和语音语言模型。此外，我们尝试进一步训练PHI-4-MINI以增强其推理能力。尽管具有紧凑的38亿参数的规模，但该实验版本仍以与或超过更大的模型的推理性能，包括DeepSeek-R1-Distill-Qwen-7b和DeepSeek-R1-Distill-distill-lllama-8B。

Title: Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

Authors: Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, Zhaochun Ren
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2503.01763
Pdf URL: https://arxiv.org/pdf/2503.01763
Copy Paste: [[2503.01763]] Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models(https://arxiv.org/abs/2503.01763)
Keywords: language model, llm, agent
Abstract: Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.
摘要：工具学习旨在使用各种工具来增强大型语言模型（LLM），从而使其能够充当解决实际任务的代理。由于使用工具LLM的上下文长度有限，因此采用信息检索模型（IR）模型从大型工具集中选择有用的工具是关键的初始步骤。但是，工具检索任务中IR模型的性能仍然不足和不清楚。大多数工具使用的基准测试标准通过手动预通向每个任务的一小组相关工具来简化此步骤，这远非现实世界方案。在本文中，我们提出了Toolret，这是一种包括7.6K不同检索任务的异质工具检索基准，以及从现有数据集收集的43K工具的语料库。我们在Toolret上基准了六种类型的模型。令人惊讶的是，即使是在常规IR基准测试中具有很强性能的模型，在工具中的性能也很差。这种低的检索质量降低了工具使用LLM的任务通过率。作为进一步的一步，我们用超过200k的实例贡献了一个大规模的培训数据集，该数据集实质上优化了IR模型的工具检索能力。

Title: Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Authors: Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, Manling Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01773
Pdf URL: https://arxiv.org/pdf/2503.01773
Copy Paste: [[2503.01773]] Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas(https://arxiv.org/abs/2503.01773)
Keywords: language model
Abstract: Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model's internal states to examine the interactions between image and text tokens. By tracing attention distribution over the image through out intermediate layers, we observe that successful spatial reasoning correlates strongly with the model's ability to align its attention distribution with actual object locations, particularly differing between familiar and unfamiliar spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when confident, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible cost. We make code and data publicly available for research purposes at this https URL.
摘要：大型视觉语言模型（VLM）长期以来一直在空间推理任务上挣扎。令人惊讶的是，即使是简单的空间推理任务，例如识别仅两个对象之间的“下”或“背后”的关系，对当前VLM构成了重大挑战。在这项工作中，我们研究了机械性解释性镜头的空间推理挑战，并深入研究了模型的内部状态，以检查图像和文本令牌之间的相互作用。通过通过中间层追踪图像上的注意力分布，我们观察到，成功的空间推理与模型将注意力分布与实际对象位置保持一致的能力密切相关，尤其是熟悉和陌生的空间关系之间的差异。在这些发现的激励下，我们根据推理时间置信度得分提出ADAPTVIS，以在自信时提高对高度相关区域的关注，同时平滑和拓宽注意力窗口，以考虑置信度较低时考虑更广泛的环境。这种无训练的解码方法在空间推理基准（例如WhatsUp和VSR）上显示出显着改善（例如，最多可获得50个绝对点的改进），其成本可忽略不计。我们在此HTTPS URL上公开可用于研究目的的代码和数据。

Title: Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models

Authors: Meghana Rajeev, Rajkumar Ramamurthy, Prapti Trivedi, Vikas Yadav, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudan, James Zou, Nazneen Rajani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01781
Pdf URL: https://arxiv.org/pdf/2503.01781
Copy Paste: [[2503.01781]] Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models(https://arxiv.org/abs/2503.01781)
Keywords: llm
Abstract: We investigate the robustness of reasoning models trained for step-by-step problem solving by introducing query-agnostic adversarial triggers - short, irrelevant text that, when appended to math problems, systematically mislead models to output incorrect answers without altering the problem's semantics. We propose CatAttack, an automated iterative attack pipeline for generating triggers on a weaker, less expensive proxy model (DeepSeek V3) and successfully transfer them to more advanced reasoning target models like DeepSeek R1 and DeepSeek R1-distilled-Qwen-32B, resulting in greater than 300% increase in the likelihood of the target model generating an incorrect answer. For example, appending, "Interesting fact: cats sleep most of their lives," to any math problem leads to more than doubling the chances of a model getting the answer wrong. Our findings highlight critical vulnerabilities in reasoning models, revealing that even state-of-the-art models remain susceptible to subtle adversarial inputs, raising security and reliability concerns. The CatAttack triggers dataset with model responses is available at this https URL.
摘要：我们通过引入查询 - 不合时宜的对抗触发器来调查针对分步问题进行培训的推理模型的鲁棒性 - 简短的，无关紧要的文本，当附加到数学问题上时，系统地误导了模型，以输出不正确的答案，而不会更改问题的语言。我们提出了Catattack，这是一种自动化的迭代攻击管道，用于在较弱，较便宜的代理模型（DeepSeek V3）上产生触发器，并成功地将它们转移到更先进的推理目标模型中，例如DeepSeek R1和DeepSeek R1 Distieldield-Distield-Distield-distield-distilled-qwen-32b，导致了300％的目标构成目标，使目标模型增加了300％。例如，附加“有趣的事实：猫的大部分时间睡觉”，任何数学问题都会导致模型弄错答案的机会增加了一倍以上。我们的发现突出了推理模型中的关键脆弱性，表明即使是最先进的模型仍然容易受到微妙的对抗投入的影响，从而提高了安全性和可靠性问题。此HTTPS URL可用Catattack触发带有模型响应的数据集。

Title: $\texttt{SEM-CTRL}$: Semantically Controlled Decoding

Authors: Mohammad Albinhassan, Pranava Madhyastha, Alessandra Russo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01804
Pdf URL: https://arxiv.org/pdf/2503.01804
Copy Paste: [[2503.01804]] $\texttt{SEM-CTRL}$: Semantically Controlled Decoding(https://arxiv.org/abs/2503.01804)
Keywords: language model, llm
Abstract: Ensuring both syntactic and semantic correctness in Large Language Model (LLM) outputs remains a significant challenge, despite being critical for real-world deployment. In this paper, we introduce $\texttt{SEM-CTRL}$, a unified approach that enforces rich context-sensitive constraints and task- and instance-specific semantics directly on an LLM decoder. Our approach integrates token-level MCTS, which is guided by specific syntactic and semantic constraints. The constraints over the desired outputs are expressed using Answer Set Grammars -- a logic-based formalism that generalizes context-sensitive grammars while incorporating background knowledge to represent task-specific semantics. We show that our approach guarantees correct completions for any off-the-shelf LLM without the need for fine-tuning. We evaluate $\texttt{SEM-CTRL}$ on a range of tasks, including synthetic grammar synthesis, combinatorial reasoning, and planning. Our results demonstrate that $\texttt{SEM-CTRL}$ allows small pre-trained LLMs to efficiently outperform larger variants and state-of-the-art reasoning models (e.g., o1-preview) while simultaneously guaranteeing solution correctness.
摘要：尽管对于实际部署至关重要，但在大语言模型（LLM）输出中确保句法和语义正确性仍然是一个重大挑战。在本文中，我们介绍了$ \ texttt {sem-ctrl} $，这是一种统一的方法，可直接在LLM解码器上执行丰富的上下文敏感约束以及特定于任务和实例的语义。我们的方法集成了令牌级别的MCT，它以特定的句法和语义约束为指导。对所需输出的约束使用答案集语法表达，这是一种基于逻辑的形式主义，概括了上下文敏感的语法，同时结合了背景知识以表示特定于任务的语义。我们表明，我们的方法可以保证任何现成的LLM的正确完成，而无需进行微调。我们在一系列任务上评估了$ \ texttt {sem-ctrl} $，包括合成语法合成，组合推理和计划。我们的结果表明，$ \ texttt {sem-ctrl} $允许小的预训练的LLM有效地超过较大的变体和最先进的推理模型（例如，O1-浏览），同时保证解决方案正确性。

Title: Large-Scale Data Selection for Instruction Tuning

Authors: Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, Pradeep Dasigi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01807
Pdf URL: https://arxiv.org/pdf/2503.01807
Copy Paste: [[2503.01807]] Large-Scale Data Selection for Instruction Tuning(https://arxiv.org/abs/2503.01807)
Keywords: language model
Abstract: Selecting high-quality training data from a larger pool is a crucial step when instruction-tuning language models, as carefully curated datasets often produce models that outperform those trained on much larger, noisier datasets. Automated data selection approaches for instruction-tuning are typically tested by selecting small datasets (roughly 10k samples) from small pools (100-200k samples). However, popular deployed instruction-tuned models often train on hundreds of thousands to millions of samples, subsampled from even larger data pools. We present a systematic study of how well data selection methods scale to these settings, selecting up to 2.5M samples from pools of up to 5.8M samples and evaluating across 7 diverse tasks. We show that many recently proposed methods fall short of random selection in this setting (while using more compute), and even decline in performance when given access to larger pools of data to select over. However, we find that a variant of representation-based data selection (RDS+), which uses weighted mean pooling of pretrained LM hidden states, consistently outperforms more complex methods across all settings tested -- all whilst being more compute-efficient. Our findings highlight that the scaling properties of proposed automated selection methods should be more closely examined. We release our code, data, and models at this https URL.
摘要：从较大的池中选择高质量的培训数据是一个至关重要的步骤，当指令调整语言模型，因为精心策划的数据集经常产生的模型胜过在更大，嘈杂的数据集中训练的模型。通常通过从小型池（100-200k样本）中选择小型数据集（大约10K样本）来测试用于指导调整的自动数据选择方法。但是，受欢迎的部署指令调整模型经常训练数十万到数百万个样本，这些样本从较大的数据库中进行了采样。我们介绍了一项系统的研究，该研究对数据选择方法的扩展如何为这些设置进行扩展，从最高可达580万个样本的池中选择多达250万个样本，并在7个不同的任务中进行评估。我们表明，在这种情况下（使用更多计算），许多最近提出的方法在此设置中没有随机选择，甚至在访问大量数据池时的性能下降。但是，我们发现一种基于表示形式的数据选择（RDS+）的变体，该变量使用了预验证的LM隐藏状态的加权平均池，在所有测试的所有设置上都始终优于更复杂的方法 - 同时所有这些方法都更加计算效率。我们的发现强调，应更仔细地研究提出的自动选择方法的缩放属性。我们在此HTTPS URL上发布代码，数据和模型。

Title: Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

Authors: Nimet Beyza Bozdag, Shuhaib Mehri, Gokhan Tur, Dilek Hakkani-Tür
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2503.01829
Pdf URL: https://arxiv.org/pdf/2503.01829
Copy Paste: [[2503.01829]] Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models(https://arxiv.org/abs/2503.01829)
Keywords: language model, gpt, llm, agent
Abstract: Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion. While these capabilities can be used for social good, they also present risks of potential misuse. Moreover, LLMs' susceptibility to persuasion raises concerns about alignment with ethical principles. To study these dynamics, we introduce Persuade Me If You Can (PMIYC), an automated framework for evaluating persuasion through multi-agent interactions. Here, Persuader agents engage in multi-turn conversations with the Persuadee agents, allowing us to measure LLMs' persuasive effectiveness and their susceptibility to persuasion. We conduct comprehensive evaluations across diverse LLMs, ensuring each model is assessed against others in both subjective and misinformation contexts. We validate the efficacy of our framework through human evaluations and show alignment with prior work. PMIYC offers a scalable alternative to human annotation for studying persuasion in LLMs. Through PMIYC, we find that Llama-3.3-70B and GPT-4o exhibit similar persuasive effectiveness, outperforming Claude 3 Haiku by 30%. However, GPT-4o demonstrates over 50% greater resistance to persuasion for misinformation compared to Llama-3.3-70B. These findings provide empirical insights into the persuasive dynamics of LLMs and contribute to the development of safer AI systems.
摘要：大型语言模型（LLMS）表现出具有与人级说服力相匹配的说服力。尽管这些功能可用于社会善，但它们也带来了潜在滥用的风险。此外，LLMS对说服力的敏感性引起了人们对与道德原则保持一致的担忧。为了研究这些动态，我们会介绍说服我（PMIYC），这是一个自动化的框架，用于通过多代理相互作用评估说服力。在这里，说服者代理人与说服者进行了多转交谈，使我们能够衡量LLMS的说服力效果及其对说服力的敏感性。我们对不同的LLM进行了全面的评估，以确保在主观和错误信息环境中对其他模型进行评估。我们通过人类评估来验证框架的功效，并显示与先前的工作保持一致。 PMIYC为研究LLM的说服力提供了可扩展的人类注释替代方案。通过PMIYC，我们发现Llama-3.3-70B和GPT-4O具有相似的说服力，表现优于Claude 3 Haiku 30％。然而，与美洲驼-3.3-70b相比，GPT-4O表现出超过50％以上对错误信息的抵抗力。这些发现提供了对LLMS的有说服力动态的经验见解，并有助于更安全的AI系统的发展。

Title: From Language to Cognition: How LLMs Outgrow the Human Language Network

Authors: Badr AlKhamissi, Greta Tuckute, Yingtian Tang, Taha Binhuraib, Antoine Bosselut, Martin Schrimpf
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01830
Pdf URL: https://arxiv.org/pdf/2503.01830
Copy Paste: [[2503.01830]] From Language to Cognition: How LLMs Outgrow the Human Language Network(https://arxiv.org/abs/2503.01830)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit remarkable similarity to neural activity in the human language network. However, the key properties of language shaping brain-like representations, and their evolution during training as a function of different tasks remain unclear. We here benchmark 34 training checkpoints spanning 300B tokens across 8 different model sizes to analyze how brain alignment relates to linguistic competence. Specifically, we find that brain alignment tracks the development of formal linguistic competence -- i.e., knowledge of linguistic rules -- more closely than functional linguistic competence. While functional competence, which involves world knowledge and reasoning, continues to develop throughout training, its relationship with brain alignment is weaker, suggesting that the human language network primarily encodes formal linguistic structure rather than broader cognitive functions. We further show that model size is not a reliable predictor of brain alignment when controlling for feature size and find that the correlation between next-word prediction, behavioral alignment and brain alignment fades once models surpass human language proficiency. Finally, using the largest set of rigorous neural language benchmarks to date, we show that language brain alignment benchmarks remain unsaturated, highlighting opportunities for improving future models. Taken together, our findings suggest that the human language network is best modeled by formal, rather than functional, aspects of language.
摘要：大型语言模型（LLMS）与人类语言网络中的神经活动表现出显着的相似性。但是，语言塑造类似大脑的表示的关键特性及其在训练中的演变作为不同任务的函数，尚不清楚。我们在这里基准测试34个培训检查点，跨越了8种不同模型大小的300B令牌，以分析大脑对齐方式与语言能力的关系。具体而言，我们发现大脑对齐能跟踪形式语言能力的发展 - 即语言规则的知识 - 比功能性语言能力更接近。尽管涉及世界知识和推理的功能能力在整个培训过程中都在不断发展，但其与大脑的关系的关系较弱，这表明人类语言网络主要编码形式的语言结构，而不是更广泛的认知功能。我们进一步表明，在控制特征大小时，模型大小并不是大脑对齐的可靠预测指标，并发现一旦模型超过人类语言能力，下一字预测，行为比对和大脑对齐之间的相关性逐渐消失。最后，使用迄今为止最大的严格神经语言基准，我们表明语言大脑对齐基准仍然不饱和，这突出了改善未来模型的机会。综上所述，我们的发现表明，人类语言网络最好由语言的形式而不是功能性的方面建模。

Title: Rotary Outliers and Rotary Offset Features in Large Language Models

Authors: André Jonasson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01832
Pdf URL: https://arxiv.org/pdf/2503.01832
Copy Paste: [[2503.01832]] Rotary Outliers and Rotary Offset Features in Large Language Models(https://arxiv.org/abs/2503.01832)
Keywords: language model, llm
Abstract: Transformer-based Large Language Models (LLMs) rely on positional encodings to provide sequence position information to their attention mechanism. Rotary Positional Encodings (RoPE), which encode relative position by rotating queries and keys, have become widely used in modern LLMs. We study the features and patterns that emerge in queries and keys when using rotary embeddings. Our analysis reveals consistent patterns within the same model across layers and attention heads and across different models and architectures. We present and apply analysis techniques and show how the queries and keys use RoPE to construct various attention patterns, including attention sinks. We find and analyze outliers across models in queries and keys and find that they are likely to be found in rotary features with partial cycles. We derive bounds that tell us what rotary frequencies are likely to be selected as outlier features and at what minimum angle the query-key rotary pairs in these features tend to be above and verify the bounds empirically with models of significant architectural differences.
摘要：基于变压器的大语言模型（LLMS）依靠位置编码来为其注意力机制提供序列位置信息。通过旋转查询和键编码相对位置的旋转位置编码（绳索）已被广泛用于现代LLMS中。我们研究使用旋转嵌入时在查询和键中出现的特征和图案。我们的分析揭示了跨层和注意力头以及不同模型和体系结构的同一模型中的一致模式。我们介绍并应用分析技术，并展示查询和钥匙如何使用绳索来构建各种注意力模式，包括注意点。我们在查询和钥匙中发现并分析了跨模型的离群值，发现它们很可能在具有部分循环的旋转特征中找到。我们得出界限，这些边界告诉我们哪些旋转频率可能被选为离群特征，并且在这些特征中查询 - 键旋转对的最小角度往往是上面的，并通过具有显着建筑差异的模型在经验上验证边界。

Title: CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom

Authors: Yisen Li, Lingfeng Yang, Wenxuan Shen, Pan Zhou, Yao Wan, Weiwei Lin, Dongping Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01836
Pdf URL: https://arxiv.org/pdf/2503.01836
Copy Paste: [[2503.01836]] CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom(https://arxiv.org/abs/2503.01836)
Keywords: language model, llm
Abstract: Distilling advanced Large Language Models' instruction-following capabilities into smaller models using a selected subset has become a mainstream approach in model training. While existing synthetic instruction data selection strategies rely mainly on single-dimensional signals (i.e., reward scores, model perplexity), they fail to capture the complexity of instruction-following across diverse fields. Therefore, we investigate more diverse signals to capture comprehensive instruction-response pair characteristics and propose three foundational metrics that leverage Multi-LLM wisdom, informed by (1) diverse LLM responses and (2) reward model assessment. Building upon base metrics, we propose CrowdSelect, an integrated metric incorporating a clustering-based approach to maintain response diversity. Our comprehensive experiments demonstrate that our foundation metrics consistently improve performance across 4 base models on MT-bench and Arena-Hard. CrowdSelect, efficiently incorporating all metrics, achieves state-of-the-art performance in both Full and LoRA fine-tuning, showing improvements of 4.81% on Arena-Hard and 11.1% on MT-bench with Llama-3.2-3b-instruct. We hope our findings will bring valuable insights for future research in this direction. Code are available at this https URL.
摘要：使用选定的子集将高级大语模型的指令跟踪功能的指示功能已成为模型培训中的主流方法。尽管现有的合成指令数据选择策略主要依赖于单一信号（即奖励分数，模型困惑），但它们未能捕获跨不同领域的指令遵循的复杂性。因此，我们研究了更多样化的信号，以捕获全面的指导 - 响应对特征，并提出了三个利用多LLM智慧的基础指标，这些指标由（1）不同的LLM响应和（2）奖励模型评估所告知。在基本指标的基础上，我们提出了CrowdSelect，这是一种综合度量，该指标结合了一种基于聚类的方法来维持响应多样性。我们的全面实验表明，我们的基础指标始终提高MT Bench和Arena-Hard上4个基本模型的性能。 CrowdSelect，有效地纳入了所有指标，在全面和洛拉微调中都取得了最先进的表现，在Arena-Hard上显示了4.81％的改善，在MT Bench上，Llama-3.2-3B-Instruct的MT Bench的提高了11.1％。我们希望我们的发现将为以后的研究带来宝贵的见解。代码可在此HTTPS URL上找到。

Title: EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Authors: Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01840
Pdf URL: https://arxiv.org/pdf/2503.01840
Copy Paste: [[2503.01840]] EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test(https://arxiv.org/abs/2503.01840)
Keywords: language model, llm, chat
Abstract: The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. The code is available at this https URL.
摘要：现代LLM的顺序性质使它们变得昂贵且缓慢，并且投机性抽样已被证明是解决此问题的有效解决方案。诸如Eagle之类的方法在特征级别执行自动降低，重复使用目标模型的顶层特征，以获得比香草投机采样更好的结果。 LLM社区的增长趋势正在扩大培训数据，以改善模型智能而不增加推理成本。但是，我们观察到扩展数据为Eagle提供了有限的改进。我们确定这种限制来自Eagle的特征预测约束。在本文中，我们介绍了Eagle-3，它放弃了预测以直接代币预测，并通过名为“训练时间测试”技术融合了多层功能融合，取代了对顶层功能的依赖。这些改进可显着提高性能，并使模型草案能够从扩展培训数据中充分受益。我们的实验包括聊天模型和推理模型，对五个任务进行了评估。结果表明，Eagle-3达到的加速比高达6.5倍，而EAGLE-2的提高约为1.4倍。该代码可在此HTTPS URL上找到。

Title: Can (A)I Change Your Mind?

Authors: Miriam Havin, Timna Wharton Kleinman, Moran Koren, Yaniv Dover, Ariel Goldstein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01844
Pdf URL: https://arxiv.org/pdf/2503.01844
Copy Paste: [[2503.01844]] Can (A)I Change Your Mind?(https://arxiv.org/abs/2503.01844)
Keywords: language model, llm, agent
Abstract: The increasing integration of large language model (LLM) based conversational agents into everyday life raises critical cognitive and social questions about their potential to influence human opinions. Although previous studies have shown that LLM-based agents can generate persuasive content, these typically involve controlled, English-language settings. Addressing this, our preregistered study explored LLM's persuasive capabilities in more ecological, unconstrained scenarios, examining both static (written paragraphs) and dynamic (conversations via Telegram) interaction types. Conducted entirely in Hebrew with 200 participants, the study assessed the persuasive effects of both LLM and human interlocutors on controversial civil policy topics. Results indicated that participants adopted LLM and human perspectives similarly, with significant opinion changes evident across all conditions, regardless of interlocutor type or interaction mode. Confidence levels increased significantly in most scenarios, except in static LLM interactions. These findings demonstrate LLM-based agents' robust persuasive capabilities across diverse sources and settings, highlighting their potential impact on shaping public opinions.
摘要：大型语言模型（LLM）逐渐融入日常生活中的越来越多的整合引发了有关影响人类意见的潜力的批判性认知和社会问题。尽管以前的研究表明，基于LLM的代理可以产生有说服力的内容，但这些剂量通常涉及受控的英语语言设置。在解决这个问题时，我们的前研究的研究探索了LLM在更生态，不受限制的场景中的有说服力的能力，研究了静态（书面段落）和动态（通过电报）互动类型的动态（对话）。该研究完全在希伯来语中与200名参与者进行，评估了LLM和人对话者对有争议的民事政策主题的有说服力影响。结果表明，参与者采用了LLM和人类的观点，无论对话者类型或交互模式如何，在所有情况下都有显着的意见变化。在大多数情况下，置信度水平显着增加，除了静态LLM相互作用。这些发现表明，基于LLM的代理商在各种来源和环境中具有强大的有说服力的能力，强调了它们对塑造公众意见的潜在影响。