2025-01-28

Title: Unmasking Conversational Bias in AI Multiagent Systems

Authors: Erica Coppolillo, Giuseppe Manco, Luca Maria Aiello
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2501.14844
Pdf URL: https://arxiv.org/pdf/2501.14844
Copy Paste: [[2501.14844]] Unmasking Conversational Bias in AI Multiagent Systems(https://arxiv.org/abs/2501.14844)
Keywords: language model, llm, agent
Abstract: Detecting biases in the outputs produced by generative models is essential to reduce the potential risks associated with their application in critical settings. However, the majority of existing methodologies for identifying biases in generated text consider the models in isolation and neglect their contextual applications. Specifically, the biases that may arise in multi-agent systems involving generative models remain under-researched. To address this gap, we present a framework designed to quantify biases within multi-agent systems of conversational Large Language Models (LLMs). Our approach involves simulating small echo chambers, where pairs of LLMs, initialized with aligned perspectives on a polarizing topic, engage in discussions. Contrary to expectations, we observe significant shifts in the stance expressed in the generated messages, particularly within echo chambers where all agents initially express conservative viewpoints, in line with the well-documented political bias of many LLMs toward liberal positions. Crucially, the bias observed in the echo-chamber experiment remains undetected by current state-of-the-art bias detection methods that rely on questionnaires. This highlights a critical need for the development of a more sophisticated toolkit for bias detection and mitigation for AI multi-agent systems. The code to perform the experiments is publicly available at this https URL.
摘要：检测生成模型输出中的偏差对于降低其在关键环境中的应用可能带来的风险至关重要。然而，现有的大多数用于识别生成文本中偏差的方法都孤立地考虑模型，而忽略了它们的上下文应用。具体而言，涉及生成模型的多智能体系统中可能出现的偏差仍未得到充分研究。为了解决这一差距，我们提出了一个框架，旨在量化对话式大型语言模型 (LLM) 多智能体系统中的偏差。我们的方法涉及模拟小型回音室，其中初始化了对两极分化主题的一致观点的 LLM 对进行讨论。与预期相反，我们观察到生成的消息中表达的立场发生了重大变化，特别是在所有智能体最初都表达保守观点的回音室中，这与许多 LLM 对自由主义立场的政治偏见相一致。至关重要的是，回音室实验中观察到的偏见仍然未被当前最先进的依赖问卷的偏见检测方法检测到。这凸显了开发更复杂的工具包以检测和缓解 AI 多智能体系统的偏见的迫切需求。执行实验的代码可在此 https URL 上公开获取。

Title: JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models

Authors: Michael K. Chen, Xikun Zhang, Dacheng Tao
Subjects: cs.CL, cs.AI, cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2501.14851
Pdf URL: https://arxiv.org/pdf/2501.14851
Copy Paste: [[2501.14851]] JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models(https://arxiv.org/abs/2501.14851)
Keywords: language model, llm
Abstract: Logical reasoning is a critical component of Large Language Models (LLMs), and substantial research efforts in recent years have aimed to enhance their deductive reasoning capabilities. However, existing deductive reasoning benchmarks, which are crucial for evaluating and advancing LLMs, are inadequate due to their lack of task complexity, presence of prior knowledge as a confounder, and superficial error analysis. To address these deficiencies, we introduce JustLogic, a synthetically generated deductive reasoning benchmark designed for rigorous evaluation of LLMs. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy. Our experimental results on JustLogic reveal that most state-of-the-art (SOTA) LLMs perform significantly worse than the human average, demonstrating substantial room for model improvement. All code and data are available at this https URL
摘要：逻辑推理是大型语言模型 (LLM) 的重要组成部分，近年来，大量的研究工作旨在增强其演绎推理能力。然而，现有的演绎推理基准对于评估和推进 LLM 至关重要，但由于缺乏任务复杂性、先验知识作为混杂因素的存在以及肤浅的错误分析，这些基准存在不足。为了解决这些不足，我们引入了 JustLogic，这是一个合成生成的演绎推理基准，旨在对 LLM 进行严格评估。JustLogic (i) 非常复杂，能够生成各种语言模式、词汇和论证结构；(ii) 独立于先验知识，消除了拥有先验知识的模型的优势，并确保仅使用演绎推理来回答问题；(iii) 能够对推理深度和论证形式对模型准确性的异质影响进行深入的错误分析。我们在 JustLogic 上的实验结果表明，大多数最先进的 (SOTA) LLM 的表现明显低于人类平均水平，表明模型有很大的改进空间。所有代码和数据均可通过此 https URL 获得

Title: Dynamic Adaptation of LoRA Fine-Tuning for Efficient and Task-Specific Optimization of Large Language Models

Authors: Xiaoxuan Liao, Chihang Wang, Shicheng Zhou, Jiacheng Hu, Hongye Zheng, Jia Gao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.14859
Pdf URL: https://arxiv.org/pdf/2501.14859
Copy Paste: [[2501.14859]] Dynamic Adaptation of LoRA Fine-Tuning for Efficient and Task-Specific Optimization of Large Language Models(https://arxiv.org/abs/2501.14859)
Keywords: language model, llm
Abstract: This paper presents a novel methodology of fine-tuning for large language models-dynamic LoRA. Building from the standard Low-Rank Adaptation framework, this methodology further adds dynamic adaptation mechanisms to improve efficiency and performance. The key contribution of dynamic LoRA lies within its adaptive weight allocation mechanism coupled with an input feature-based adaptive strategy. These enhancements allow for a more precise fine-tuning process that is more tailored to specific tasks. Traditional LoRA methods use static adapter settings, not considering the different importance of model layers. In contrast, dynamic LoRA introduces a mechanism that dynamically evaluates the layer's importance during fine-tuning. This evaluation enables the reallocation of adapter parameters to fit the unique demands of each individual task, which leads to better optimization results. Another gain in flexibility arises from the consideration of the input feature distribution, which helps the model generalize better when faced with complicated and diverse datasets. The joint approach boosts not only the performance over each single task but also the generalization ability of the model. The efficiency of the dynamic LoRA was validated in experiments on benchmark datasets, such as GLUE, with surprising results. More specifically, this method achieved 88.1% accuracy with an F1-score of 87.3%. Noticeably, these improvements were made at a slight increase in computational costs: only 0.1% more resources than standard LoRA. This balance between performance and efficiency positions dynamic LoRA as a practical, scalable solution for fine-tuning LLMs, especially in resource-constrained scenarios. To take it a step further, its adaptability makes it a promising foundation for much more advanced applications, including multimodal tasks.
摘要：本文介绍了一种新型的大型语言模型微调方法——动态 LoRA。该方法以标准低秩自适应框架为基础，进一步增加了动态自适应机制，以提高效率和性能。动态 LoRA 的关键贡献在于其自适应权重分配机制以及基于输入特征的自适应策略。这些增强功能允许更精确的微调过程，更适合特定任务。传统的 LoRA 方法使用静态适配器设置，不考虑模型层的不同重要性。相比之下，动态 LoRA 引入了一种在微调过程中动态评估层重要性的机制。这种评估可以重新分配适配器参数以适应每个单独任务的独特需求，从而获得更好的优化结果。灵活性的另一个提升来自于对输入特征分布的考虑，这有助于模型在面对复杂多样的数据集时更好地泛化。联合方法不仅提高了每个单一任务的性能，还提高了模型的泛化能力。动态 LoRA 的效率在基准数据集（如 GLUE）上的实验中得到了验证，结果令人惊讶。更具体地说，该方法实现了 88.1% 的准确率，F1 得分为 87.3%。值得注意的是，这些改进是在计算成本略有增加的情况下实现的：仅比标准 LoRA 多 0.1% 的资源。性能和效率之间的这种平衡使动态 LoRA 成为微调 LLM 的实用、可扩展解决方案，尤其是在资源受限的情况下。更进一步说，它的适应性使其成为包括多模式任务在内的更高级应用的有希望的基础。

Title: DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images

Authors: Sami Baral, Li Lucy, Ryan Knight, Alice Ng, Luca Soldaini, Neil T. Heffernan, Kyle Lo
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2501.14877
Pdf URL: https://arxiv.org/pdf/2501.14877
Copy Paste: [[2501.14877]] DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images(https://arxiv.org/abs/2501.14877)
Keywords: language model
Abstract: In real-world settings, vision language models (VLMs) should robustly handle naturalistic, noisy visual content as well as domain-specific language and concepts. For example, K-12 educators using digital learning platforms may need to examine and provide feedback across many images of students' math work. To assess the potential of VLMs to support educators in settings like this one, we introduce DrawEduMath, an English-language dataset of 2,030 images of students' handwritten responses to K-12 math problems. Teachers provided detailed annotations, including free-form descriptions of each image and 11,661 question-answer (QA) pairs. These annotations capture a wealth of pedagogical insights, ranging from students' problem-solving strategies to the composition of their drawings, diagrams, and writing. We evaluate VLMs on teachers' QA pairs, as well as 44,362 synthetic QA pairs derived from teachers' descriptions using language models (LMs). We show that even state-of-the-art VLMs leave much room for improvement on DrawEduMath questions. We also find that synthetic QAs, though imperfect, can yield similar model rankings as teacher-written QAs. We release DrawEduMath to support the evaluation of VLMs' abilities to reason mathematically over images gathered with educational contexts in mind.
摘要：在现实环境中，视觉语言模型 (VLM) 应能够稳健地处理自然的、嘈杂的视觉内容以及特定领域的语言和概念。例如，使用数字学习平台的 K-12 教育工作者可能需要检查学生数学作业的许多图像并提供反馈。为了评估 VLM 在此类环境中支持教育工作者的潜力，我们引入了 DrawEduMath，这是一个英语数据集，包含 2,030 张学生对 K-12 数学问题手写答案的图像。教师提供了详细的注释，包括每幅图像的自由形式描述和 11,661 个问答 (QA) 对。这些注释捕捉了大量的教学见解，从学生的解决问题策略到他们的绘画、图表和写作的构图。我们根据教师的 QA 对以及使用语言模型 (LM) 从教师描述中得出的 44,362 个合成 QA 对来评估 VLM。我们表明，即使是最先进的 VLM 在 DrawEduMath 问题上也有很大改进空间。我们还发现，合成 QA 虽然不完美，但可以产生与教师编写的 QA 类似的模型排名。我们发布 DrawEduMath 是为了支持评估 VLM 在考虑教育背景的情况下对收集的图像进行数学推理的能力。

Title: Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Authors: Ameya Godbole, Robin Jia
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.14883
Pdf URL: https://arxiv.org/pdf/2501.14883
Copy Paste: [[2501.14883]] Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics(https://arxiv.org/abs/2501.14883)
Keywords: language model, retrieval-augmented generation
Abstract: Improvements in large language models have led to increasing optimism that they can serve as reliable evaluators of natural language generation outputs. In this paper, we challenge this optimism by thoroughly re-evaluating five state-of-the-art factuality metrics on a collection of 11 datasets for summarization, retrieval-augmented generation, and question answering. We find that these evaluators are inconsistent with each other and often misestimate system-level performance, both of which can lead to a variety of pitfalls. We further show that these metrics exhibit biases against highly paraphrased outputs and outputs that draw upon faraway parts of the source documents. We urge users of these factuality metrics to proceed with caution and manually validate the reliability of these metrics in their domain of interest before proceeding.
摘要：大型语言模型的改进使人们越来越乐观地认为它们可以作为自然语言生成输出的可靠评估器。在本文中，我们通过彻底重新评估 11 个数据集上的五个最先进的事实性指标来挑战这种乐观情绪，这些数据集用于摘要、检索增强生成和问答。我们发现这些评估器彼此不一致，并且经常错误估计系统级性能，这两者都可能导致各种陷阱。我们进一步表明，这些指标对高度释义的输出和利用源文档遥远部分的输出表现出偏见。我们敦促这些事实性指标的用户谨慎行事，并在继续之前手动验证这些指标在其感兴趣领域中的可靠性。

Title: Self-reflecting Large Language Models: A Hegelian Dialectical Approach

Authors: Sara Abdali, Can Goksen, Saeed Amizadeh andKazuhito Koishida
Subjects: cs.CL, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2501.14917
Pdf URL: https://arxiv.org/pdf/2501.14917
Copy Paste: [[2501.14917]] Self-reflecting Large Language Models: A Hegelian Dialectical Approach(https://arxiv.org/abs/2501.14917)
Keywords: language model, llm, agent
Abstract: Investigating NLP through a philosophical lens has recently caught researcher's eyes as it connects computational methods with classical schools of philosophy. This paper introduces a philosophical approach inspired by the Hegelian Dialectic for LLMs' self-reflection, utilizing a self-dialectical approach to emulate internal critiques and then synthesize new ideas by resolving the contradicting points. Moreover, this paper investigates the effect of LLMs' temperature for generation by establishing a dynamic annealing approach, which promotes the creativity in the early stages and gradually refines it by focusing on the nuances, as well as a fixed temperature strategy for generation. Our proposed approach is examined to determine its ability to generate novel ideas from an initial proposition. Additionally, a Multi Agent Majority Voting (MAMV) strategy is leveraged to assess the validity and novelty of the generated ideas, which proves beneficial in the absence of domain experts. Our experiments show promise in generating new ideas and provide a stepping-stone for future research.
摘要：最近，通过哲学视角研究 NLP 引起了研究人员的关注，因为它将计算方法与古典哲学流派联系起来。本文介绍了一种受黑格尔辩证法启发的 LLM 自我反思的哲学方法，利用自我辩证法模拟内部批评，然后通过解决矛盾点来综合新的想法。此外，本文通过建立动态退火方法研究了 LLM 生成温度的影响，该方法在早期阶段促进创造力，并通过关注细微差别逐渐完善它，以及固定的生成温度策略。我们提出的方法经过检验，以确定其从初始命题产生新想法的能力。此外，利用多智能体多数投票 (MAMV) 策略来评估所生成想法的有效性和新颖性，这在没有领域专家的情况下被证明是有益的。我们的实验显示出产生新想法的希望，并为未来的研究提供了垫脚石。

Title: Context-Aware Neural Gradient Mapping for Fine-Grained Instruction Processing

Authors: David Boldo, Lily Pemberton, Gabriel Thistledown, Jacob Fairchild, Felix Kowalski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.14936
Pdf URL: https://arxiv.org/pdf/2501.14936
Copy Paste: [[2501.14936]] Context-Aware Neural Gradient Mapping for Fine-Grained Instruction Processing(https://arxiv.org/abs/2501.14936)
Keywords: language model
Abstract: The integration of contextual embeddings into the optimization processes of large language models is an advancement in natural language processing. The Context-Aware Neural Gradient Mapping framework introduces a dynamic gradient adjustment mechanism, incorporating contextual embeddings directly into the optimization process. This approach facilitates real-time parameter adjustments, enhancing task-specific generalization even in the presence of sparse or noisy data inputs. The mathematical foundation of this framework relies on gradient descent modifications, where contextual embeddings are derived from a supplementary neural network trained to map input features to optimal adaptation gradients. By employing differential geometry principles, high-dimensional input dependencies are encoded into low-dimensional gradient manifolds, enabling efficient adaptation without necessitating the retraining of the entire model. Empirical evaluations demonstrate that the proposed framework consistently outperforms baseline models across various metrics, including accuracy, robustness to noise, and computational efficiency. The integration of context-specific embeddings allows for a more complex understanding of language, thereby improving the model's ability to handle diverse linguistic phenomena. Furthermore, the computational efficiency achieved through this method demonstrates its scalability for large-scale language models operating under diverse constraints.
摘要：将上下文嵌入集成到大型语言模型的优化过程中是自然语言处理的一大进步。上下文感知神经梯度映射框架引入了动态梯度调整机制，将上下文嵌入直接纳入优化过程。这种方法有助于实时调整参数，即使在稀疏或嘈杂的数据输入下也能增强特定于任务的泛化。该框架的数学基础依赖于梯度下降修改，其中上下文嵌入来自经过训练以将输入特征映射到最佳适应梯度的补充神经网络。通过采用微分几何原理，高维输入依赖关系被编码到低维梯度流形中，从而无需重新训练整个模型即可实现高效适应。实证评估表明，所提出的框架在各种指标（包括准确性、抗噪性和计算效率）方面始终优于基线模型。上下文特定嵌入的集成允许对语言有更复杂的理解，从而提高模型处理各种语言现象的能力。此外，通过该方法实现的计算效率证明了其对于在不同约束下运行的大规模语言模型的可扩展性。

Title: CASE-Bench: Context-Aware Safety Evaluation Benchmark for Large Language Models

Authors: Guangzhi Sun, Xiao Zhan, Shutong Feng, Philip C. Woodland, Jose Such
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.14940
Pdf URL: https://arxiv.org/pdf/2501.14940
Copy Paste: [[2501.14940]] CASE-Bench: Context-Aware Safety Evaluation Benchmark for Large Language Models(https://arxiv.org/abs/2501.14940)
Keywords: language model, llm
Abstract: Aligning large language models (LLMs) with human values is essential for their safe deployment and widespread adoption. Current LLM safety benchmarks often focus solely on the refusal of individual problematic queries, which overlooks the importance of the context where the query occurs and may cause undesired refusal of queries under safe contexts that diminish user experience. Addressing this gap, we introduce CASE-Bench, a Context-Aware Safety Evaluation Benchmark that integrates context into safety assessments of LLMs. CASE-Bench assigns distinct, formally described contexts to categorized queries based on Contextual Integrity theory. Additionally, in contrast to previous studies which mainly rely on majority voting from just a few annotators, we recruited a sufficient number of annotators necessary to ensure the detection of statistically significant differences among the experimental conditions based on power analysis. Our extensive analysis using CASE-Bench on various open-source and commercial LLMs reveals a substantial and significant influence of context on human judgments (p<0.0001 from a z-test), underscoring the necessity of context in safety evaluations. We also identify notable mismatches between human judgments and LLM responses, particularly in commercial models within safe contexts.
摘要：将大型语言模型 (LLM) 与人类价值观相结合对于其安全部署和广泛采用至关重要。当前的 LLM 安全基准通常仅关注拒绝个别有问题的查询，这忽略了查询发生的上下文的重要性，并且可能会导致在安全上下文中不必要地拒绝查询，从而降低用户体验。为了解决这一差距，我们推出了 CASE-Bench，这是一个上下文感知安全评估基准，它将上下文集成到 LLM 的安全性评估中。CASE-Bench 根据上下文完整性理论为分类查询分配不同的、正式描述的上下文。此外，与以前主要依赖少数注释者的多数投票的研究相比，我们招募了足够数量的注释者，以确保基于功效分析检测到实验条件之间的统计显着差异。我们使用 CASE-Bench 对各种开源和商业 LLM 进行了广泛的分析，结果显示环境对人类判断有显著影响（z 检验 p<0.0001），强调了环境在安全评估中的必要性。我们还发现人类判断和 LLM 响应之间存在显著的不匹配，尤其是在安全环境中的商业模型中。

Title: ExPerT: Effective and Explainable Evaluation of Personalized Long-Form Text Generation

Authors: Alireza Salemi, Julian Killingback, Hamed Zamani
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2501.14956
Pdf URL: https://arxiv.org/pdf/2501.14956
Copy Paste: [[2501.14956]] ExPerT: Effective and Explainable Evaluation of Personalized Long-Form Text Generation(https://arxiv.org/abs/2501.14956)
Keywords: language model, llm, prompt
Abstract: Evaluating personalized text generated by large language models (LLMs) is challenging, as only the LLM user, i.e., prompt author, can reliably assess the output, but re-engaging the same individuals across studies is infeasible. This paper addresses the challenge of evaluating personalized text generation by introducing ExPerT, an explainable reference-based evaluation framework. ExPerT leverages an LLM to extract atomic aspects and their evidence from the generated and reference texts, match the aspects, and evaluate their alignment based on content and writing style -- two key attributes in personalized text generation. Additionally, ExPerT generates detailed, fine-grained explanations for every step of the evaluation process, enhancing transparency and interpretability. Our experiments demonstrate that ExPerT achieves a 7.2% relative improvement in alignment with human judgments compared to the state-of-the-art text generation evaluation methods. Furthermore, human evaluators rated the usability of ExPerT's explanations at 4.7 out of 5, highlighting its effectiveness in making evaluation decisions more interpretable.
摘要：评估大型语言模型 (LLM) 生成的个性化文本具有挑战性，因为只有 LLM 用户（即及时作者）才能可靠地评估输出，但在不同的研究中重新让同一个人参与是不可行的。本文通过引入可解释的基于参考的评估框架 ExPerT 来解决评估个性化文本生成的挑战。ExPerT 利用 LLM 从生成文本和参考文本中提取原子方面及其证据，匹配这些方面，并根据内容和写作风格评估它们的一致性——这是个性化文本生成的两个关键属性。此外，ExPerT 为评估过程的每一步生成详细、细粒度的解释，增强了透明度和可解释性。我们的实验表明，与最先进的文本生成评估方法相比，ExPerT 在与人类判断的一致性方面实现了 7.2% 的相对改善。此外，人类评估者对 ExPerT 解释的可用性的评分为 5 分中的 4.7 分，突显了其在使评估决策更具可解释性方面的有效性。

Title: Federated Retrieval Augmented Generation for Multi-Product Question Answering

Authors: Parshin Shojaee, Sai Sree Harsha, Dan Luo, Akash Maharaj, Tong Yu, Yunyao Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.14998
Pdf URL: https://arxiv.org/pdf/2501.14998
Copy Paste: [[2501.14998]] Federated Retrieval Augmented Generation for Multi-Product Question Answering(https://arxiv.org/abs/2501.14998)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation
Abstract: Recent advancements in Large Language Models and Retrieval-Augmented Generation have boosted interest in domain-specific question-answering for enterprise products. However, AI Assistants often face challenges in multi-product QA settings, requiring accurate responses across diverse domains. Existing multi-domain RAG-QA approaches either query all domains indiscriminately, increasing computational costs and LLM hallucinations, or rely on rigid resource selection, which can limit search results. We introduce MKP-QA, a novel multi-product knowledge-augmented QA framework with probabilistic federated search across domains and relevant knowledge. This method enhances multi-domain search quality by aggregating query-domain and query-passage probabilistic relevance. To address the lack of suitable benchmarks for multi-product QAs, we also present new datasets focused on three Adobe products: Adobe Experience Platform, Target, and Customer Journey Analytics. Our experiments show that MKP-QA significantly boosts multi-product RAG-QA performance in terms of both retrieval accuracy and response quality.
摘要：大型语言模型和检索增强生成方面的最新进展提高了人们对企业产品领域特定问答的兴趣。然而，人工智能助手在多产品 QA 环境中经常面临挑战，需要跨不同领域做出准确的响应。现有的多领域 RAG-QA 方法要么不加区分地查询所有领域，增加计算成本和 LLM 幻觉，要么依赖于严格的资源选择，这可能会限制搜索结果。我们推出了 MKP-QA，这是一种新颖的多产品知识增强 QA 框架，具有跨领域和相关知识的概率联合搜索。此方法通过聚合查询域和查询段落概率相关性来提高多领域搜索质量。为了解决多产品 QA 缺乏合适基准的问题，我们还提供了专注于三款 Adobe 产品的新数据集：Adobe Experience Platform、Target 和 Customer Journey Analytics。我们的实验表明，MKP-QA 在检索准确性和响应质量方面显著提升了多产品 RAG-QA 的性能。

Title: MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models

Authors: Zhongpu Chen, Yinfeng Liu, Long Shi, Zhi-Jie Wang, Xingyan Chen, Yu Zhao, Fuji Ren
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2501.15000
Pdf URL: https://arxiv.org/pdf/2501.15000
Copy Paste: [[2501.15000]] MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models(https://arxiv.org/abs/2501.15000)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readability and structure of the content generated by these language models. In this paper, we introduce MDEval, a comprehensive benchmark to assess Markdown Awareness for LLMs, by constructing a dataset with 20K instances covering 10 subjects in English and Chinese. Unlike traditional model-based evaluations, MDEval provides excellent interpretability by combining model-based generation tasks and statistical methods. Our results demonstrate that MDEval achieves a Spearman correlation of 0.791 and an accuracy of 84.1% with human, outperforming existing methods by a large margin. Extensive experimental results also show that through fine-tuning over our proposed dataset, less performant open-source models are able to achieve comparable performance to GPT-4o in terms of Markdown Awareness. To ensure reproducibility and transparency, MDEval is open sourced at this https URL.
摘要：大型语言模型 (LLM) 有望提供结构化的 Markdown 响应，以提高网络聊天机器人 (例如 ChatGPT) 的可读性。尽管评估 LLM 的指标多种多样，但它们都未能从输出内容结构的角度评估可读性。为此，我们将重点关注一个被忽视但又很重要的指标——Markdown 感知，它直接影响这些语言模型生成内容的可读性和结构。在本文中，我们通过构建一个包含 20K 个实例的数据集，涵盖英文和中文的 10 个主题，引入了 MDEval，这是一个评估 LLM 的 Markdown 感知的综合基准。与传统的基于模型的评估不同，MDEval 通过结合基于模型的生成任务和统计方法提供了出色的可解释性。我们的结果表明，MDEval 的 Spearman 相关性达到 0.791，与人类的准确率为 84.1%，大大优于现有方法。大量实验结果还表明，通过对我们提出的数据集进行微调，性能较差的开源模型在 Markdown 感知方面能够达到与 GPT-4o 相当的性能。为了确保可重复性和透明度，MDEval 在此 https URL 上开源。

Title: AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models

Authors: Zunhai Su, Wang Shen, Linge Li, Zhe Chen, Hanyu Wei, Huangqi Yu, Kehong Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15021
Pdf URL: https://arxiv.org/pdf/2501.15021
Copy Paste: [[2501.15021]] AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models(https://arxiv.org/abs/2501.15021)
Keywords: language model, llm
Abstract: Vision-language models (VLMs) show remarkable performance in multimodal tasks. However, excessively long multimodal inputs lead to oversized Key-Value (KV) caches, resulting in significant memory consumption and I/O bottlenecks. Previous KV quantization methods for Large Language Models (LLMs) may alleviate these issues but overlook the attention saliency differences of multimodal tokens, resulting in suboptimal performance. In this paper, we investigate the attention-aware token saliency patterns in VLM and propose AKVQ-VL. AKVQ-VL leverages the proposed Text-Salient Attention (TSA) and Pivot-Token-Salient Attention (PSA) patterns to adaptively allocate bit budgets. Moreover, achieving extremely low-bit quantization requires effectively addressing outliers in KV tensors. AKVQ-VL utilizes the Walsh-Hadamard transform (WHT) to construct outlier-free KV caches, thereby reducing quantization difficulty. Evaluations of 2-bit quantization on 12 long-context and multimodal tasks demonstrate that AKVQ-VL maintains or even improves accuracy, outperforming LLM-oriented methods. AKVQ-VL can reduce peak memory usage by 2.13x, support up to 3.25x larger batch sizes and 2.46x throughput.
摘要：视觉语言模型 (VLM) 在多模态任务中表现出色。然而，过长的多模态输入会导致键值 (KV) 缓存过大，从而导致严重的内存消耗和 I/O 瓶颈。先前的大型语言模型 (LLM) 的 KV 量化方法可能会缓解这些问题，但忽略了多模态标记的注意力显着性差异，导致性能不佳。在本文中，我们研究了 VLM 中的注意力感知标记显着性模式并提出了 AKVQ-VL。AKVQ-VL 利用提出的文本显着性注意 (TSA) 和枢轴标记显着性注意 (PSA) 模式来自适应地分配比特预算。此外，实现极低比特量化需要有效解决 KV 张量中的异常值。AKVQ-VL 利用 Walsh-Hadamard 变换 (WHT) 构建无异常值的 KV 缓存，从而降低量化难度。对 12 个长上下文和多模态任务的 2 位量化评估表明，AKVQ-VL 保持甚至提高了准确率，优于面向 LLM 的方法。AKVQ-VL 可以将峰值内存使用量降低 2.13 倍，支持高达 3.25 倍的批量大小和 2.46 倍的吞吐量。

Title: Using Large Language Models for education managements in Vietnamese with low resources

Authors: Duc Do Minh, Vinh Nguyen Van, Thang Dam Cong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.15022
Pdf URL: https://arxiv.org/pdf/2501.15022
Copy Paste: [[2501.15022]] Using Large Language Models for education managements in Vietnamese with low resources(https://arxiv.org/abs/2501.15022)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs), such as GPT-4, Gemini 1.5, Claude 3.5 Sonnet, and Llama3, have demonstrated significant advancements in various NLP tasks since the release of ChatGPT in 2022. Despite their success, fine-tuning and deploying LLMs remain computationally expensive, especially in resource-constrained environments. In this paper, we proposed VietEduFrame, a framework specifically designed to apply LLMs to educational management tasks in Vietnamese institutions. Our key contribution includes the development of a tailored dataset, derived from student education documents at Hanoi VNU, which addresses the unique challenges faced by educational systems with limited resources. Through extensive experiments, we show that our approach outperforms existing methods in terms of accuracy and efficiency, offering a promising solution for improving educational management in under-resourced environments. While our framework leverages synthetic data to supplement real-world examples, we discuss potential limitations regarding broader applicability and robustness in future implementations.
摘要：自 2022 年 ChatGPT 发布以来，大型语言模型 (LLM)（例如 GPT-4、Gemini 1.5、Claude 3.5 Sonnet 和 Llama3）在各种 NLP 任务中都取得了显著进步。尽管取得了成功，但微调和部署 LLM 仍然需要大量计算，尤其是在资源受限的环境中。在本文中，我们提出了 VietEduFrame，这是一个专门设计用于将 LLM 应用于越南机构教育管理任务的框架。我们的主要贡献包括开发一个定制的数据集，该数据集源自河内 VNU 的学生教育文件，解决了资源有限的教育系统面临的独特挑战。通过大量实验，我们表明我们的方法在准确性和效率方面优于现有方法，为改善资源匮乏环境中的教育管理提供了有希望的解决方案。虽然我们的框架利用合成数据来补充现实世界的示例，但我们讨论了未来实施中更广泛的适用性和稳健性方面的潜在限制。

Title: An Attempt to Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models

Authors: Jaturong Kongmanee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.15054
Pdf URL: https://arxiv.org/pdf/2501.15054
Copy Paste: [[2501.15054]] An Attempt to Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models(https://arxiv.org/abs/2501.15054)
Keywords: language model, gpt, llm
Abstract: This research aims to unravel how large language models (LLMs) iteratively refine token predictions (or, in a general sense, vector predictions). We utilized a logit lens technique to analyze the model's token predictions derived from intermediate representations. Specifically, we focused on how LLMs access and use information from input contexts, and how positioning of relevant information affects the model's token prediction refinement process. Our findings for multi-document question answering task, by varying input context lengths (the number of documents), using GPT-2, revealed that the number of layers between the first layer that the model predicted next tokens correctly and the later layers that the model finalized its correct predictions, as a function of the position of relevant information (i.e., placing the relevant one at the beginning, middle, or end of the input context), has a nearly inverted U shape. We found that the gap between these two layers, on average, diminishes when relevant information is positioned at the beginning or end of the input context, suggesting that the model requires more refinements when processing longer contexts with relevant information situated in the middle, and highlighting which layers are essential for determining the correct output. Our analysis provides insights about how token predictions are distributed across different conditions, and establishes important connections to existing hypotheses and previous findings in AI safety research and development.
摘要：本研究旨在揭示大型语言模型 (LLM) 如何迭代细化标记预测（或一般意义上的向量预测）。我们利用逻辑透镜技术分析了模型从中间表示得出的标记预测。具体来说，我们重点研究了 LLM 如何访问和使用来自输入上下文的信息，以及相关信息的定位如何影响模型的标记预测细化过程。我们使用 GPT-2 针对多文档问答任务的研究结果显示，模型正确预测下一个标记的第一层与模型最终确定其正确预测的后续层之间的层数与相关信息的位置（即将相关信息放在输入上下文的开头、中间或结尾）的关系呈近乎倒 U 形。我们发现，当相关信息位于输入上下文的开头或结尾时，这两层之间的差距平均会缩小，这表明模型在处理较长的上下文时需要进行更多改进，而相关信息位于中间，并强调了哪些层对于确定正确的输出至关重要。我们的分析提供了关于标记预测在不同条件下如何分布的见解，并与人工智能安全研究和开发中的现有假设和先前发现建立了重要的联系。

Title: LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion

Authors: Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15089
Pdf URL: https://arxiv.org/pdf/2501.15089
Copy Paste: [[2501.15089]] LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion(https://arxiv.org/abs/2501.15089)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable progress in understanding long-context inputs. However, benchmarks for evaluating the long-context reasoning abilities of LLMs fall behind the pace. Existing benchmarks often focus on a narrow range of tasks or those that do not demand complex reasoning. To address this gap and enable a more comprehensive evaluation of the long-context reasoning capabilities of current LLMs, we propose a new synthetic benchmark, LongReason, which is constructed by synthesizing long-context reasoning questions from a varied set of short-context reasoning questions through context expansion. LongReason consists of 794 multiple-choice reasoning questions with diverse reasoning patterns across three task categories: reading comprehension, logical inference, and mathematical word problems. We evaluate 21 LLMs on LongReason, revealing that most models experience significant performance drops as context length increases. Our further analysis shows that even state-of-the-art LLMs still have significant room for improvement in providing robust reasoning across different tasks. We will open-source LongReason to support the comprehensive evaluation of LLMs' long-context reasoning capabilities.
摘要：大型语言模型 (LLM) 在理解长上下文输入方面取得了显著进展。然而，用于评估 LLM 长上下文推理能力的基准却落后于步伐。现有的基准通常侧重于较窄范围的任务或不需要复杂推理的任务。为了弥补这一差距并更全面地评估当前 LLM 的长上下文推理能力，我们提出了一个新的综合基准 LongReason，它通过上下文扩展从一组不同的短上下文推理问题中合成长上下文推理问题而构建。LongReason 包含 794 个具有多种推理模式的多项选择推理问题，涉及三个任务类别：阅读理解、逻辑推理和数学应用题。我们在 LongReason 上评估了 21 个 LLM，发现随着上下文长度的增加，大多数模型的性能都会显著下降。我们的进一步分析表明，即使是最先进的 LLM 在不同任务中提供稳健的推理方面仍有很大改进空间。我们将开源LongReason，以支持对LLM长上下文推理能力的全面评估。

Title: Speech Translation Refinement using Large Language Models

Authors: Huaixia Dou, Xinyu Tian, Xinglin Lyu, Jie Zhu, Junhui Li, Lifan Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15090
Pdf URL: https://arxiv.org/pdf/2501.15090
Copy Paste: [[2501.15090]] Speech Translation Refinement using Large Language Models(https://arxiv.org/abs/2501.15090)
Keywords: language model, gpt, llm
Abstract: Recent advancements in large language models (LLMs) have demonstrated their remarkable capabilities across various language tasks. Inspired by the success of text-to-text translation refinement, this paper investigates how LLMs can improve the performance of speech translation by introducing a joint refinement process. Through the joint refinement of speech translation (ST) and automatic speech recognition (ASR) transcription via LLMs, the performance of the ST model is significantly improved in both training-free in-context learning and parameter-efficient fine-tuning scenarios. Additionally, we explore the effect of document-level context on refinement under the context-aware fine-tuning scenario. Experimental results on the MuST-C and CoVoST 2 datasets, which include seven translation tasks, demonstrate the effectiveness of the proposed approach using several popular LLMs including GPT-3.5-turbo, LLaMA3-8B, and Mistral-12B. Further analysis further suggests that jointly refining both transcription and translation yields better performance compared to refining translation alone. Meanwhile, incorporating document-level context significantly enhances refinement performance. We release our code and datasets on GitHub.
摘要：大型语言模型 (LLM) 的最新进展已在各种语言任务中展现出卓越的能力。受文本到文本翻译细化成功的启发，本文研究了 LLM 如何通过引入联合细化过程来提高语音翻译的性能。通过 LLM 对语音翻译 (ST) 和自动语音识别 (ASR) 转录进行联合细化，ST 模型的性能在无需训练的上下文学习和参数高效的微调场景中都得到了显着提高。此外，我们还探讨了在上下文感知微调场景下文档级上下文对细化的影响。在包含七个翻译任务的 MuST-C 和 CoVoST 2 数据集上的实验结果证明了使用包括 GPT-3.5-turbo、LLaMA3-8B 和 Mistral-12B 在内的几种流行 LLM 所提出方法的有效性。进一步的分析进一步表明，与单独细化翻译相比，联合细化转录和翻译可获得更好的性能。同时，结合文档级上下文可显著提高细化性能。我们在 GitHub 上发布我们的代码和数据集。

Title: Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training

Authors: Xunxin Cai, Chengrui Wang, Qingqing Long, Yuanchun Zhou, Meng Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15108
Pdf URL: https://arxiv.org/pdf/2501.15108
Copy Paste: [[2501.15108]] Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training(https://arxiv.org/abs/2501.15108)
Keywords: language model, gpt, llm, prompt
Abstract: The rapid advancement of large language models (LLMs) in biological-medical applications has highlighted a gap between their potential and the limited scale and often low quality of available open-source annotated textual datasets. In addition, the inherent complexity of the biomedical knowledge hierarchy significantly hampers efforts to bridge this this http URL LLMs themselves play a pivotal role in overcoming this limitation? Motivated by this question, we investigate this challenge in the present this http URL propose a framework that automates the distillation of high-quality textual training data from the extensive scientific literature. Our approach self-evaluates and generates questions that are more closely aligned with the biomedical domain, guided by the biomedical knowledge hierarchy through medical subject headings (MeSH). This comprehensive framework establishes an automated workflow, thereby eliminating the need for manual intervention. Furthermore, we conducted comprehensive experiments to evaluate the impact of our framework-generated data on downstream language models of varying sizes. Our approach substantially improves question-answering tasks compared to pre-trained models from the life sciences domain and powerful close-source models represented by GPT-4. Notably, the generated AI-Ready dataset enabled the Llama3-70B base model to outperform GPT-4 using MedPrompt with multiple times the number of parameters. Detailed case studies and ablation experiments underscore the significance of each component within our framework
摘要：大型语言模型 (LLM) 在生物医学应用中的快速发展凸显了其潜力与可用的开源注释文本数据集的有限规模和通常低质量之间的差距。此外，生物医学知识层次结构的固有复杂性严重阻碍了弥合这一限制的努力。LLM 本身在克服这一限制方面发挥了关键作用？受此问题的启发，我们在本篇文章中研究了这一挑战，并提出了一个框架，该框架可自动从大量科学文献中提取高质量的文本训练数据。我们的方法通过医学主题词 (MeSH) 在生物医学知识层次结构的指导下自我评估并生成与生物医学领域更紧密相关的问题。这个全面的框架建立了一个自动化的工作流程，从而无需人工干预。此外，我们进行了全面的实验，以评估我们的框架生成的数据对不同大小的下游语言模型的影响。与生命科学领域的预训练模型和 GPT-4 代表的强大的闭源模型相比，我们的方法大大提高了问答任务的性能。值得注意的是，生成的 AI-Ready 数据集使 Llama3-70B 基础模型在使用 MedPrompt 时能够以多倍的参数量超越 GPT-4。详细的案例研究和消融实验强调了我们框架中每个组件的重要性

Title: Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads

Authors: Xingyang He, Jie Liu, Shaowei Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15113
Pdf URL: https://arxiv.org/pdf/2501.15113
Copy Paste: [[2501.15113]] Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads(https://arxiv.org/abs/2501.15113)
Keywords: language model, llm
Abstract: KV cache is a widely used acceleration technique for large language models (LLMs) inference. However, its memory requirement grows rapidly with input length. Previous studies have reduced the size of KV cache by either removing the same number of unimportant tokens for all attention heads or by allocating differentiated KV cache budgets for pre-identified attention heads. However, due to the importance of attention heads varies across different tasks, the pre-identified attention heads fail to adapt effectively to various downstream tasks. To address this issue, we propose Task-KV, a method that leverages the semantic differentiation of attention heads to allocate differentiated KV cache budgets across various tasks. We demonstrate that attention heads far from the semantic center (called heterogeneous heads) make an significant contribution to task outputs and semantic understanding. In contrast, other attention heads play the role of aggregating important information and focusing reasoning. Task-KV allocates full KV cache budget to heterogeneous heads to preserve comprehensive semantic information, while reserving a small number of recent tokens and attention sinks for non-heterogeneous heads. Furthermore, we innovatively introduce middle activations to preserve key contextual information aggregated from non-heterogeneous heads. To dynamically perceive semantic differences among attention heads, we design a semantic separator to distinguish heterogeneous heads from non-heterogeneous ones based on their distances from the semantic center. Experimental results on multiple benchmarks and different model architectures demonstrate that Task-KV significantly outperforms existing baseline methods.
摘要：KV 缓存是一种广泛用于大型语言模型 (LLM) 推理的加速技术。然而，它的内存需求会随着输入长度的增加而迅速增长。先前的研究通过为所有注意力头删除相同数量的不重要的标记或为预先确定的注意力头分配差异化的 KV 缓存预算来减少 KV 缓存的大小。然而，由于注意力头在不同任务中的重要性不同，预先确定的注意力头无法有效适应各种下游任务。为了解决这个问题，我们提出了 Task-KV，一种利用注意力头的语义差异化在不同任务之间分配差异化 KV 缓存预算的方法。我们证明远离语义中心的注意力头（称为异构头）对任务输出和语义理解做出了重大贡献。相反，其他注意力头起着聚合重要信息和聚焦推理的作用。Task-KV 将完整的 KV 缓存预算分配给异构头以保留全面的语义信息，同时为非异构头保留少量最近的标记和注意力接收器。此外，我们创新性地引入了中间激活来保留从非异构头聚合的关键上下文信息。为了动态感知注意头之间的语义差异，我们设计了一个语义分离器，根据它们与语义中心的距离来区分异构头和非异构头。在多个基准和不同模型架构上的实验结果表明，Task-KV 明显优于现有的基线方法。

Title: Option-ID Based Elimination For Multiple Choice Questions

Authors: Zhenhao Zhu, Bulou Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.15175
Pdf URL: https://arxiv.org/pdf/2501.15175
Copy Paste: [[2501.15175]] Option-ID Based Elimination For Multiple Choice Questions(https://arxiv.org/abs/2501.15175)
Keywords: language model, llm
Abstract: Multiple choice questions (MCQs) are a common and important task for evaluating large language models (LLMs). Based on common strategies humans use when answering MCQs, the process of elimination has been proposed as an effective problem-solving method. Existing methods to the process of elimination generally fall into two categories: one involves having the model directly select the incorrect answer, while the other involves scoring the options. However, both methods incur high computational costs and often perform worse than methods that answer based on option ID. To address this issue, this paper proposes a process of elimination based on option ID. We select 10 LLMs and conduct zero-shot experiments on 7 different datasets. The experimental results demonstrate that our method significantly improves the model's performance. Further analysis reveals that the sequential elimination strategy can effectively enhance the model's reasoning ability. Additionally, we find that sequential elimination is also applicable to few-shot settings and can be combined with debias methods to further improve model performance.
摘要：多项选择题（MCQ）是评估大型语言模型（LLM）的常见且重要的任务。基于人类回答 MCQ 时使用的常见策略，消除法已被提出作为一种有效的问题解决方法。现有的消除法方法通常分为两类：一类是让模型直接选出错误答案，另一类是对选项进行评分。然而，这两种方法的计算成本都很高，并且通常比基于选项 ID 回答的方法效果更差。针对这一问题，本文提出了一种基于选项 ID 的消除法。我们选取 10 个 LLM，在 7 个不同的数据集上进行零样本实验。实验结果表明，我们的方法显著提高了模型的性能。进一步分析表明，顺序消除策略可以有效增强模型的推理能力。此外，我们发现顺序消除也适用于少样本设置，并且可以与去偏差方法相结合以进一步提高模型性能。

Title: SEAL: Scaling to Emphasize Attention for Long-Context Retrieval

Authors: Changhun Lee, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.15225
Pdf URL: https://arxiv.org/pdf/2501.15225
Copy Paste: [[2501.15225]] SEAL: Scaling to Emphasize Attention for Long-Context Retrieval(https://arxiv.org/abs/2501.15225)
Keywords: language model, llm
Abstract: In this work, we introduce a novel approach called Scaling to Emphasize Attention for Long-context retrieval (SEAL), which enhances the retrieval performance of large language models (LLMs) over extended contexts. Previous studies have shown that each attention head in LLMs has a unique functionality and collectively contributes to the overall behavior of the model. Similarly, we observe that specific heads are closely tied to long-context retrieval, showing positive or negative correlation with retrieval scores. Built on this insight, we propose a learning-based mechanism using zero-shot generated data to emphasize these heads, improving the model's performance in long-context retrieval tasks. By applying SEAL, we can achieve significant improvements in in-domain retrieval performance, including document QA tasks from LongBench, and considerable improvements in out-of-domain cases. Additionally, when combined with existing training-free context extension techniques, SEAL extends the context limits of LLMs while maintaining highly reliable outputs, opening new avenues for research in this field.
摘要：在这项工作中，我们引入了一种名为“缩放以强调长上下文检索的注意力”（SEAL）的新方法，该方法可增强大型语言模型 (LLM) 在扩展上下文中的检索性能。先前的研究表明，LLM 中的每个注意力头都具有独特的功能，并且共同对模型的整体行为做出贡献。同样，我们观察到特定的注意力头与长上下文检索密切相关，与检索分数呈正相关或负相关。基于这一见解，我们提出了一种基于学习的机制，使用零样本生成的数据来强调这些注意力头，从而提高模型在长上下文检索任务中的性能。通过应用 SEAL，我们可以显著提高域内检索性能，包括 LongBench 的文档 QA 任务，并在域外案例中取得显著改进。此外，当与现有的无需训练的上下文扩展技术相结合时，SEAL 可以扩展 LLM 的上下文限制，同时保持高度可靠的输出，为该领域的研究开辟新的途径。

Title: Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning

Authors: Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, Jiaxin Mao
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2501.15228
Pdf URL: https://arxiv.org/pdf/2501.15228
Copy Paste: [[2501.15228]] Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning(https://arxiv.org/abs/2501.15228)
Keywords: language model, hallucination, retrieval-augmented generation, agent
Abstract: Retrieval-augmented generation (RAG) is extensively utilized to incorporate external, current knowledge into large language models, thereby minimizing hallucinations. A standard RAG pipeline may comprise several components, such as query rewriting, document retrieval, document filtering, and answer generation. However, these components are typically optimized separately through supervised fine-tuning, which can lead to misalignments between the objectives of individual modules and the overarching aim of generating accurate answers in question-answering (QA) tasks. Although recent efforts have explored reinforcement learning (RL) to optimize specific RAG components, these approaches often focus on overly simplistic pipelines with only two components or do not adequately address the complex interdependencies and collaborative interactions among the modules. To overcome these challenges, we propose treating the RAG pipeline as a multi-agent cooperative task, with each component regarded as an RL agent. Specifically, we present MMOA-RAG, a Multi-Module joint Optimization Algorithm for RAG, which employs multi-agent reinforcement learning to harmonize all agents' goals towards a unified reward, such as the F1 score of the final answer. Experiments conducted on various QA datasets demonstrate that MMOA-RAG improves the overall pipeline performance and outperforms existing baselines. Furthermore, comprehensive ablation studies validate the contributions of individual components and the adaptability of MMOA-RAG across different RAG components and datasets. The code of MMOA-RAG is on this https URL.
摘要：检索增强生成 (RAG) 被广泛用于将外部的、当前的知识整合到大型语言模型中，从而最大限度地减少幻觉。标准 RAG 管道可能包含多个组件，例如查询重写、文档检索、文档过滤和答案生成。然而，这些组件通常通过监督微调单独优化，这可能导致各个模块的目标与在问答 (QA) 任务中生成准确答案的总体目标不一致。尽管最近的努力已经探索了强化学习 (RL) 来优化特定的 RAG 组件，但这些方法通常侧重于只有两个组件的过于简单的管道，或者没有充分解决模块之间复杂的相互依赖关系和协作交互。为了克服这些挑战，我们建议将 RAG 管道视为多智能体协作任务，每个组件都被视为一个 RL 智能体。具体来说，我们提出了 MMOA-RAG，一种用于 RAG 的多模块联合优化算法，它采用多智能体强化学习来协调所有智能体的目标，以实现统一的奖励，例如最终答案的 F1 分数。在各种 QA 数据集上进行的实验表明，MMOA-RAG 提高了整体管道性能并优于现有基线。此外，全面的消融研究验证了各个组件的贡献以及 MMOA-RAG 在不同 RAG 组件和数据集中的适应性。MMOA-RAG 的代码位于此 https URL 上。

Title: ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval

Authors: Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Adam Jatowt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15245
Pdf URL: https://arxiv.org/pdf/2501.15245
Copy Paste: [[2501.15245]] ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval(https://arxiv.org/abs/2501.15245)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) models have drawn considerable attention in modern open-domain question answering. The effectiveness of RAG depends on the quality of the top retrieved documents. However, conventional retrieval methods sometimes fail to rank the most relevant documents at the top. In this paper, we introduce ASRank, a new re-ranking method based on scoring retrieved documents using zero-shot answer scent which relies on a pre-trained large language model to compute the likelihood of the document-derived answers aligning with the answer scent. Our approach demonstrates marked improvements across several datasets, including NQ, TriviaQA, WebQA, ArchivalQA, HotpotQA, and Entity Questions. Notably, ASRank increases Top-1 retrieval accuracy on NQ from $19.2\%$ to $46.5\%$ for MSS and $22.1\%$ to $47.3\%$ for BM25. It also shows strong retrieval performance on several datasets compared to state-of-the-art methods (47.3 Top-1 by ASRank vs 35.4 by UPR by BM25).
摘要：检索增强生成 (RAG) 模型在现代开放域问答中引起了广泛关注。RAG 的有效性取决于顶级检索文档的质量。然而，传统的检索方法有时无法将最相关的文档排在最前面。在本文中，我们介绍了一种新的重新排名方法 ASRank，该方法基于使用零样本答案气味对检索到的文档进行评分，该方法依赖于预先训练的大型语言模型来计算文档派生答案与答案气味一致的可能性。我们的方法在多个数据集上都表现出显著的改进，包括 NQ、TriviaQA、WebQA、ArchivalQA、HotpotQA 和实体问题。值得注意的是，ASRank 将 NQ 上的 Top-1 检索准确率从 MSS 的 $19.2\%$ 提高到 $46.5\%$，将 BM25 的 $22.1\%$ 提高到 $47.3\%$。与最先进的方法相比，它还在多个数据集上表现出强大的检索性能（ASRank 的 Top-1 为 47.3，BM25 的 UPR 为 35.4）。

Title: Prompting ChatGPT for Chinese Learning as L2: A CEFR and EBCL Level Study

Authors: Miao Lin-Zucker, Joël Bellasen, Jean-Daniel Zucker
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.15247
Pdf URL: https://arxiv.org/pdf/2501.15247
Copy Paste: [[2501.15247]] Prompting ChatGPT for Chinese Learning as L2: A CEFR and EBCL Level Study(https://arxiv.org/abs/2501.15247)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The use of chatbots in language learning has evolved significantly since the 1960s, becoming more sophisticated platforms as generative AI emerged. These tools now simulate natural conversations, adapting to individual learners' needs, including those studying Chinese. Our study explores how learners can use specific prompts to engage Large Language Models (LLM) as personalized chatbots, aiming to target their language level based on the Common European Framework of Reference for Languages (CEFR) and the European Benchmarking Chinese Language (EBCL) project. Focusing on A1, A1+ and A2 levels, we examine the teaching of Chinese, which presents unique challenges due to its logographic writing system. Our goal is to develop prompts that integrate oral and written skills, using high-frequency character lists and controlling oral lexical productions. These tools, powered by generative AI, aim to enhance language practice by crossing lexical and sinographic recurrence. While generative AI shows potential as a personalized tutor, further evaluation is needed to assess its effectiveness. We conducted a systematic series of experiments using ChatGPT models to evaluate their adherence to constraints specified in the prompts. The results indicate that incorporating level A1 and A1+ characters, along with the associated reference list, significantly enhances compliance with the EBCL character set. Properly prompted, LLMs can increase exposure to the target language and offer interactive exchanges to develop language skills.
摘要：自 20 世纪 60 年代以来，聊天机器人在语言学习中的应用发生了重大变化，随着生成式人工智能的出现，聊天机器人成为了更加复杂的平台。这些工具现在可以模拟自然对话，适应个人学习者的需求，包括学习中文的人。我们的研究探讨了学习者如何使用特定提示来让大型语言模型 (LLM) 作为个性化聊天机器人，旨在根据欧洲语言共同参考框架 (CEFR) 和欧洲汉语基准 (EBCL) 项目确定他们的语言水平。我们专注于 A1、A1+ 和 A2 级别，研究中文教学，由于其表意文字书写系统，中文教学面临独特的挑战。我们的目标是开发集成口语和书面技能的提示，使用高频字符列表并控制口语词汇生成。这些由生成式人工智能驱动的工具旨在通过跨越词汇和汉字重复来增强语言练习。虽然生成式人工智能显示出作为个性化导师的潜力，但需要进一步评估以评估其有效性。我们使用 ChatGPT 模型进行了一系列系统性实验，以评估它们对提示中指定的约束的遵守情况。结果表明，结合 A1 级和 A1+ 级字符以及相关参考列表可显著提高对 EBCL 字符集的遵守程度。在适当的提示下，LLM 可以增加对目标语言的接触，并提供互动交流以培养语言技能。

Title: New Evaluation Paradigm for Lexical Simplification

Authors: Jipeng Qiang, Minjiang Huang, Yi Zhu, Yunhao Yuan, Chaowei Zhang, Xiaoye Ouyang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15268
Pdf URL: https://arxiv.org/pdf/2501.15268
Copy Paste: [[2501.15268]] New Evaluation Paradigm for Lexical Simplification(https://arxiv.org/abs/2501.15268)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Lexical Simplification (LS) methods use a three-step pipeline: complex word identification, substitute generation, and substitute ranking, each with separate evaluation datasets. We found large language models (LLMs) can simplify sentences directly with a single prompt, bypassing the traditional pipeline. However, existing LS datasets are not suitable for evaluating these LLM-generated simplified sentences, as they focus on providing substitutes for single complex words without identifying all complex words in a sentence. To address this gap, we propose a new annotation method for constructing an all-in-one LS dataset through human-machine collaboration. Automated methods generate a pool of potential substitutes, which human annotators then assess, suggesting additional alternatives as needed. Additionally, we explore LLM-based methods with single prompts, in-context learning, and chain-of-thought techniques. We introduce a multi-LLMs collaboration approach to simulate each step of the LS task. Experimental results demonstrate that LS based on multi-LLMs approaches significantly outperforms existing baselines.
摘要：词汇简化 (LS) 方法使用三步流程：复杂词识别、替代词生成和替代词排名，每个流程都有单独的评估数据集。我们发现大型语言模型 (LLM) 可以直接通过单个提示简化句子，从而绕过传统流程。然而，现有的 LS 数据集不适合评估这些 LLM 生成的简化句子，因为它们专注于为单个复杂词提供替代词，而不是识别句子中的所有复杂词。为了解决这一差距，我们提出了一种新的注释方法，用于通过人机协作构建一体化 LS 数据集。自动化方法会生成一个潜在替代词池，然后由人工注释者对其进行评估，并根据需要提出其他替代方案。此外，我们还探索了基于 LLM 的方法，包括单个提示、上下文学习和思路链技术。我们引入了一种多 LLM 协作方法来模拟 LS 任务的每个步骤。实验结果表明，基于多 LLM 方法的 LS 明显优于现有基线。

Title: Pre-training a Transformer-Based Generative Model Using a Small Sepedi Dataset

Authors: Simon P. Ramalepe, Thipe I. Modipa, Marelie H. Davel
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.15281
Pdf URL: https://arxiv.org/pdf/2501.15281
Copy Paste: [[2501.15281]] Pre-training a Transformer-Based Generative Model Using a Small Sepedi Dataset(https://arxiv.org/abs/2501.15281)
Keywords: language model
Abstract: Due to the scarcity of data in low-resourced languages, the development of language models for these languages has been very slow. Currently, pre-trained language models have gained popularity in natural language processing, especially, in developing domain-specific models for low-resourced languages. In this study, we experiment with the impact of using occlusion-based techniques when training a language model for a text generation task. We curate 2 new datasets, the Sepedi monolingual (SepMono) dataset from several South African resources and the Sepedi radio news (SepNews) dataset from the radio news domain. We use the SepMono dataset to pre-train transformer-based models using the occlusion and non-occlusion pre-training techniques and compare performance. The SepNews dataset is specifically used for fine-tuning. Our results show that the non-occlusion models perform better compared to the occlusion-based models when measuring validation loss and perplexity. However, analysis of the generated text using the BLEU score metric, which measures the quality of the generated text, shows a slightly higher BLEU score for the occlusion-based models compared to the non-occlusion models.
摘要：由于资源匮乏的语言数据稀缺，这些语言的语言模型开发进展非常缓慢。目前，预训练语言模型在自然语言处理中越来越受欢迎，尤其是在为资源匮乏的语言开发领域特定模型方面。在本研究中，我们试验了在为文本生成任务训练语言模型时使用基于遮挡的技术的影响。我们整理了 2 个新数据集，来自多个南非资源的 Sepedi 单语 (SepMono) 数据集和来自广播新闻领域的 Sepedi 广播新闻 (SepNews) 数据集。我们使用 SepMono 数据集使用遮挡和非遮挡预训练技术对基于 Transformer 的模型进行预训练，并比较性能。SepNews 数据集专门用于微调。我们的结果表明，在测量验证损失和困惑度时，非遮挡模型比基于遮挡的模型表现更好。然而，使用 BLEU 分数指标（衡量生成文本的质量）分析生成的文本，结果显示基于遮挡的模型的 BLEU 分数略高于非遮挡模型。

Title: Are Human Interactions Replicable by Generative Agents? A Case Study on Pronoun Usage in Hierarchical Interactions

Authors: Naihao Deng, Rada Mihalcea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15283
Pdf URL: https://arxiv.org/pdf/2501.15283
Copy Paste: [[2501.15283]] Are Human Interactions Replicable by Generative Agents? A Case Study on Pronoun Usage in Hierarchical Interactions(https://arxiv.org/abs/2501.15283)
Keywords: language model, llm, prompt, agent
Abstract: As Large Language Models (LLMs) advance in their capabilities, researchers have increasingly employed them for social simulation. In this paper, we investigate whether interactions among LLM agents resemble those of humans. Specifically, we focus on the pronoun usage difference between leaders and non-leaders, examining whether the simulation would lead to human-like pronoun usage patterns during the LLMs' interactions. Our evaluation reveals the significant discrepancies between LLM-based simulations and human pronoun usage, with prompt-based or specialized agents failing to demonstrate human-like pronoun usage patterns. In addition, we reveal that even if LLMs understand the human pronoun usage patterns, they fail to demonstrate them in the actual interaction process. Our study highlights the limitations of social simulations based on LLM agents, urging caution in using such social simulation in practitioners' decision-making process.
摘要：随着大型语言模型 (LLM) 功能的不断提升，研究人员越来越多地将其用于社交模拟。在本文中，我们研究了 LLM 代理之间的交互是否与人类的交互相似。具体来说，我们关注领导者和非领导者之间的代词使用差异，研究模拟是否会在 LLM 交互过程中导致类似人类的代词使用模式。我们的评估揭示了基于 LLM 的模拟与人类代词使用之间存在显著差异，基于提示或专门的代理无法展示类似人类的代词使用模式。此外，我们发现即使 LLM 了解人类代词使用模式，它们也无法在实际交互过程中展示这些模式。我们的研究强调了基于 LLM 代理的社交模拟的局限性，敦促从业者在决策过程中谨慎使用此类社交模拟。

Title: You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning

Authors: Ayan Sengupta, Siddhant Chaudhary, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15296
Pdf URL: https://arxiv.org/pdf/2501.15296
Copy Paste: [[2501.15296]] You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning(https://arxiv.org/abs/2501.15296)
Keywords: language model, llm
Abstract: The ever-increasing size of large language models (LLMs) presents significant challenges for deployment due to their heavy computational and memory requirements. Current model pruning techniques attempt to alleviate these issues by relying heavily on external calibration datasets to determine which parameters to prune or compress, thus limiting their flexibility and scalability across different compression ratios. Moreover, these methods often cause severe performance degradation, particularly in downstream tasks, when subjected to higher compression rates. In this paper, we propose PruneNet, a novel model compression method that addresses these limitations by reformulating model pruning as a policy learning process. PruneNet decouples the pruning process from the model architecture, eliminating the need for calibration datasets. It learns a stochastic pruning policy to assess parameter importance solely based on intrinsic model properties while preserving the spectral structure to minimize information loss. PruneNet can compress the LLaMA-2-7B model in just 15 minutes, achieving over 80% retention of its zero-shot performance with a 30% compression ratio, outperforming existing methods that retain only 75% performance. Furthermore, on complex multitask language understanding tasks, PruneNet demonstrates its robustness by preserving up to 80% performance of the original model, proving itself a superior alternative to conventional structured compression techniques.
摘要：大型语言模型 (LLM) 的大小不断增加，由于其对计算和内存的要求很高，给部署带来了重大挑战。当前的模型修剪技术试图通过严重依赖外部校准数据集来确定要修剪或压缩哪些参数来缓解这些问题，从而限制了它们在不同压缩比下的灵活性和可扩展性。此外，这些方法在受到更高压缩率时通常会导致严重的性能下降，尤其是在下游任务中。在本文中，我们提出了 PruneNet，这是一种新颖的模型压缩方法，它通过将模型修剪重新表述为策略学习过程来解决这些限制。PruneNet 将修剪过程与模型架构分离，从而无需校准数据集。它学习一种随机修剪策略，仅根据内在模型属性来评估参数重要性，同时保留谱结构以最大限度地减少信息丢失。 PruneNet 仅用 15 分钟便可压缩 LLaMA-2-7B 模型，在 30% 的压缩率下实现了超过 80% 的零样本性能保留，优于现有方法（仅保留 75% 的性能）。此外，在复杂的多任务语言理解任务中，PruneNet 可保留原始模型高达 80% 的性能，展现出其稳健性，证明其是传统结构化压缩技术的更佳替代方案。

Title: The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders?

Authors: Ayo Adedeji, Mardhiyah Sanni, Emmanuel Ayodele, Sarita Joshi, Tobi Olatunji
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2501.15310
Pdf URL: https://arxiv.org/pdf/2501.15310
Copy Paste: [[2501.15310]] The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders?(https://arxiv.org/abs/2501.15310)
Keywords: language model, llm
Abstract: The global adoption of Large Language Models (LLMs) in healthcare shows promise to enhance clinical workflows and improve patient outcomes. However, Automatic Speech Recognition (ASR) errors in critical medical terms remain a significant challenge. These errors can compromise patient care and safety if not detected. This study investigates the prevalence and impact of ASR errors in medical transcription in Nigeria, the United Kingdom, and the United States. By evaluating raw and LLM-corrected transcriptions of accented English in these regions, we assess the potential and limitations of LLMs to address challenges related to accents and medical terminology in ASR. Our findings highlight significant disparities in ASR accuracy across regions and identify specific conditions under which LLM corrections are most effective.
摘要：全球医疗保健领域采用大型语言模型 (LLM) 有望改善临床工作流程并改善患者治疗效果。然而，关键医学术语中的自动语音识别 (ASR) 错误仍然是一项重大挑战。如果未检测到这些错误，可能会危及患者护理和安全。本研究调查了尼日利亚、英国和美国医学转录中 ASR 错误的普遍性和影响。通过评估这些地区带口音英语的原始转录和 LLM 校正转录，我们评估了 LLM 在解决 ASR 中与口音和医学术语相关的挑战方面的潜力和局限性。我们的研究结果强调了不同地区 ASR 准确性的显著差异，并确定了 LLM 校正最有效的特定条件。

Title: Figurative-cum-Commonsense Knowledge Infusion for Multimodal Mental Health Meme Classification

Authors: Abdullah Mazhar, Zuhair hasan shaik, Aseem Srivastava, Polly Ruhnke, Lavanya Vaddavalli, Sri Keshav Katragadda, Shweta Yadav, Md Shad Akhtar
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2501.15321
Pdf URL: https://arxiv.org/pdf/2501.15321
Copy Paste: [[2501.15321]] Figurative-cum-Commonsense Knowledge Infusion for Multimodal Mental Health Meme Classification(https://arxiv.org/abs/2501.15321)
Keywords: language model
Abstract: The expression of mental health symptoms through non-traditional means, such as memes, has gained remarkable attention over the past few years, with users often highlighting their mental health struggles through figurative intricacies within memes. While humans rely on commonsense knowledge to interpret these complex expressions, current Multimodal Language Models (MLMs) struggle to capture these figurative aspects inherent in memes. To address this gap, we introduce a novel dataset, AxiOM, derived from the GAD anxiety questionnaire, which categorizes memes into six fine-grained anxiety symptoms. Next, we propose a commonsense and domain-enriched framework, M3H, to enhance MLMs' ability to interpret figurative language and commonsense knowledge. The overarching goal remains to first understand and then classify the mental health symptoms expressed in memes. We benchmark M3H against 6 competitive baselines (with 20 variations), demonstrating improvements in both quantitative and qualitative metrics, including a detailed human evaluation. We observe a clear improvement of 4.20% and 4.66% on weighted-F1 metric. To assess the generalizability, we perform extensive experiments on a public dataset, RESTORE, for depressive symptom identification, presenting an extensive ablation study that highlights the contribution of each module in both datasets. Our findings reveal limitations in existing models and the advantage of employing commonsense to enhance figurative understanding.
摘要：过去几年，通过非传统方式（例如 meme）表达心理健康症状引起了广泛关注，用户经常通过 meme 中的比喻性复杂性来强调他们的心理健康问题。虽然人类依靠常识来解释这些复杂的表达，但当前的多模态语言模型 (MLM) 难以捕捉 meme 中固有的这些比喻性方面。为了解决这一差距，我们引入了一个新数据集 AxiOM，它源自 GAD 焦虑问卷，将 meme 分为六种细粒度的焦虑症状。接下来，我们提出了一个常识和领域丰富的框架 M3H，以增强 MLM 解释比喻性语言和常识性知识的能力。总体目标仍然是首先了解 meme 中表达的心理健康症状，然后对其进行分类。我们将 M3H 与 6 个竞争基线（有 20 种变体）进行基准测试，结果显示定量和定性指标都有所改善，包括详细的人工评估。我们观察到加权 F1 指标的明显改善，分别为 4.20% 和 4.66%。为了评估普遍性，我们在公共数据集 RESTORE 上进行了大量实验，用于识别抑郁症状，并进行了一项广泛的消融研究，突出了两个数据集中每个模块的贡献。我们的研究结果揭示了现有模型的局限性以及利用常识来增强形象理解的优势。

Title: Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection

Authors: Bo Yang, Jiaxian Guo, Yusuke Iwasawa, Yutaka Matsuo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.15355
Pdf URL: https://arxiv.org/pdf/2501.15355
Copy Paste: [[2501.15355]] Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection(https://arxiv.org/abs/2501.15355)
Keywords: language model, llm, agent
Abstract: Recent studies have increasingly demonstrated that large language models (LLMs) possess significant theory of mind (ToM) capabilities, showing the potential for simulating the tracking of mental states in generative agents. In this study, we propose a novel paradigm called ToM-agent, designed to empower LLMs-based generative agents to simulate ToM in open-domain conversational interactions. ToM-agent disentangles the confidence from mental states, facilitating the emulation of an agent's perception of its counterpart's mental states, such as beliefs, desires, and intentions (BDIs). Using past conversation history and verbal reflections, ToM-Agent can dynamically adjust counterparts' inferred BDIs, along with related confidence levels. We further put forth a counterfactual intervention method that reflects on the gap between the predicted responses of counterparts and their real utterances, thereby enhancing the efficiency of reflection. Leveraging empathetic and persuasion dialogue datasets, we assess the advantages of implementing the ToM-agent with downstream tasks, as well as its performance in both the first-order and the \textit{second-order} ToM. Our findings indicate that the ToM-agent can grasp the underlying reasons for their counterpart's behaviors beyond mere semantic-emotional supporting or decision-making based on common sense, providing new insights for studying large-scale LLMs-based simulation of human social behaviors.
摘要：最近的研究越来越多地表明，大型语言模型 (LLM) 具有显著的心理理论 (ToM) 能力，显示出在生成代理中模拟心理状态跟踪的潜力。在本研究中，我们提出了一种称为 ToM-agent 的新范式，旨在使基于 LLM 的生成代理能够在开放域对话交互中模拟 ToM。ToM-agent 将信心与心理状态分离，便于模拟代理对对方心理状态的感知，例如信念、愿望和意图 (BDI)。使用过去的对话历史和口头反思，ToM-Agent 可以动态调整对方推断的 BDI 以及相关的置信度水平。我们进一步提出了一种反事实干预方法，该方法可以反映对方预测的反应与他们的真实话语之间的差距，从而提高反思的效率。利用共情和说服对话数据集，我们评估了将 ToM 代理用于下游任务的优势，以及其在第一阶和 \textit{第二阶} ToM 中的表现。我们的研究结果表明，ToM 代理可以掌握对方行为的根本原因，而不仅仅是语义情感支持或基于常识的决策，这为研究基于大规模 LLM 的人类社会行为模拟提供了新的见解。

Title: Baichuan-Omni-1.5 Technical Report

Authors: Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang, Shusen Zhang, Xin Wu, Shuai Zhao, Linchu Xiong, Yozhen Wu, Jiahui Ye, Wenhao Lu, Bowen Li, Yan Zhang, Yaqi Zhou, Xin Chen, Lei Su, Hongda Zhang, Fuzhong Chen, Xuezhen Dong, Na Nie, Zhiying Wu, Bin Xiao, Ting Li, Shunya Dang, Ping Zhang, Yijia Sun, Jincheng Wu, Jinjie Yang, Xionghai Lin, Zhi Ma, Kegeng Wu, Jia li, Aiyuan Yang, Hui Liu, Jianqiang Zhang, Xiaoxi Chen, Guangwei Ai, Wentao Zhang, Yicong Chen, Xiaoqin Huang, Kun Li, Wenjing Luo, Yifei Duan, Lingling Zhu, Ran Xiao, Zhe Su, Jiani Pu, Dian Wang, Xu Jia, Tianyu Zhang, Mengyu Ai, Mang Wang, Yujing Qiao, Lei Zhang, Yanjun Shen, Fan Yang, Miao Zhen, Yijie Zhou, Mingyang Chen, Fei Li, Chenzheng Zhu, Keer Lu, Yaqi Zhao, Hao Liang, Youquan Li, Yanzhao Qin, Linzhuang Sun, Jianhua Xu, Haoze Sun, Mingan Lin, Zenan Zhou, Weipeng Chen
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2501.15368
Pdf URL: https://arxiv.org/pdf/2501.15368
Copy Paste: [[2501.15368]] Baichuan-Omni-1.5 Technical Report(https://arxiv.org/abs/2501.15368)
Keywords: gpt, llm
Abstract: We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.
摘要：我们推出了 Baichuan-Omni-1.5，这是一个全模态模型，不仅具备全模态理解能力，还提供端到端的音频生成能力。为了在不损害任何模态能力的情况下实现跨模态的流畅高质量交互，我们优先优化了三个关键方面。首先，我们建立了完整的多模态数据清洗和合成管道，获得约 500B 的高质量数据（文本、音频和视觉）。其次，设计了音频分词器（Baichuan-Audio-Tokenizer），从音频中同时捕获语义和声学信息，实现与 MLLM 的无缝集成和增强的兼容性。最后，我们设计了一种多阶段训练策略，逐步整合多模态对齐和多任务微调，确保所有模态的有效协同。Baichuan-Omni-1.5 在全模态综合能力方面领先于同时代模型（包括 GPT4o-mini 和 MiniCPM-o 2.6）。值得注意的是，它在各种多模式医疗基准上取得了与 Qwen2-VL-72B 等领先模型相当的结果。

Title: Evaluating the Effectiveness of XAI Techniques for Encoder-Based Language Models

Authors: Melkamu Abay Mersha, Mesay Gemeda Yigezu, Jugal Kalita
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.15374
Pdf URL: https://arxiv.org/pdf/2501.15374
Copy Paste: [[2501.15374]] Evaluating the Effectiveness of XAI Techniques for Encoder-Based Language Models(https://arxiv.org/abs/2501.15374)
Keywords: language model, llm
Abstract: The black-box nature of large language models (LLMs) necessitates the development of eXplainable AI (XAI) techniques for transparency and trustworthiness. However, evaluating these techniques remains a challenge. This study presents a general evaluation framework using four key metrics: Human-reasoning Agreement (HA), Robustness, Consistency, and Contrastivity. We assess the effectiveness of six explainability techniques from five different XAI categories model simplification (LIME), perturbation-based methods (SHAP), gradient-based approaches (InputXGradient, Grad-CAM), Layer-wise Relevance Propagation (LRP), and attention mechanisms-based explainability methods (Attention Mechanism Visualization, AMV) across five encoder-based language models: TinyBERT, BERTbase, BERTlarge, XLM-R large, and DeBERTa-xlarge, using the IMDB Movie Reviews and Tweet Sentiment Extraction (TSE) datasets. Our findings show that the model simplification-based XAI method (LIME) consistently outperforms across multiple metrics and models, significantly excelling in HA with a score of 0.9685 on DeBERTa-xlarge, robustness, and consistency as the complexity of large language models increases. AMV demonstrates the best Robustness, with scores as low as 0.0020. It also excels in Consistency, achieving near-perfect scores of 0.9999 across all models. Regarding Contrastivity, LRP performs the best, particularly on more complex models, with scores up to 0.9371.
摘要：大型语言模型 (LLM) 的黑箱性质要求开发可解释的人工智能 (XAI) 技术以实现透明度和可信度。然而，评估这些技术仍然是一个挑战。本研究使用四个关键指标提出了一个通用评估框架：人机推理一致性 (HA)、鲁棒性、一致性和对比度。我们使用 IMDB 电影评论和推文情绪提取 (TSE) 数据集，评估了五种不同 XAI 类别中的六种可解释性技术的有效性：模型简化 (LIME)、基于扰动的方法 (SHAP)、基于梯度的方法 (InputXGradient、Grad-CAM)、逐层相关性传播 (LRP) 和基于注意机制的可解释性方法 (注意机制可视化，AMV)，涵盖五种基于编码器的语言模型：TinyBERT、BERTbase、BERTlarge、XLM-R large 和 DeBERTa-xlarge。我们的研究结果表明，基于模型简化的 XAI 方法 (LIME) 在多个指标和模型中始终表现出色，随着大型语言模型的复杂性增加，其在 HA 方面表现优异（DeBERTa-xlarge 上的得分为 0.9685）、稳健性和一致性。AMV 表现出最佳的稳健性，得分低至 0.0020。它在一致性方面也表现出色，在所有模型中均获得了近乎完美的 0.9999 分。在对比度方面，LRP 表现最佳，尤其是在更复杂的模型上，得分高达 0.9371。

Title: Qwen2.5-1M Technical Report

Authors: An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, Zipeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15383
Pdf URL: https://arxiv.org/pdf/2501.15383
Copy Paste: [[2501.15383]] Qwen2.5-1M Technical Report(https://arxiv.org/abs/2501.15383)
Keywords: gpt
Abstract: We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.
摘要：我们推出了 Qwen2.5-1M 系列模型，将上下文长度扩展到 100 万个 token。相比上一代 128K 版本，Qwen2.5-1M 系列通过长上下文预训练和后训练，显著增强了长上下文能力。通过长数据合成、渐进式预训练、多阶段监督微调等关键技术，有效提升了长上下文性能，同时降低了训练成本。为了推动更多用户使用长上下文模型，我们提出并开源了我们的推理框架。该框架包含一个长度外推方法，可以在不进行额外训练的情况下将模型上下文长度至少扩大四倍甚至更多。为了降低推理成本，我们实现了稀疏注意力方法，并针对部署场景实现了分块预填充优化，以及稀疏度细化方法以提高精度。此外，我们详细介绍了推理引擎中的优化，包括内核优化、流水线并行性和调度优化，这些优化显著提高了整体推理性能。利用我们的推理框架，Qwen2.5-1M 模型在 100 万个上下文 token 场景下实现了 3 到 7 倍的显著预填充加速。该框架为使用开源模型开发需要长上下文处理的应用提供了高效而强大的解决方案。Qwen2.5-1M 系列目前包括开源模型 Qwen2.5-7B-Instruct-1M 和 Qwen2.5-14B-Instruct-1M，以及 API 访问模型 Qwen2.5-Turbo。评估表明，Qwen2.5-1M 模型在长上下文任务中得到了大幅提升，同时不影响短上下文场景的性能。具体来说，Qwen2.5-14B-Instruct-1M 模型在长上下文任务中的表现显著优于 GPT-4o-mini，并且支持 8 倍以上的上下文。

Title: How Green are Neural Language Models? Analyzing Energy Consumption in Text Summarization Fine-tuning

Authors: Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15398
Pdf URL: https://arxiv.org/pdf/2501.15398
Copy Paste: [[2501.15398]] How Green are Neural Language Models? Analyzing Energy Consumption in Text Summarization Fine-tuning(https://arxiv.org/abs/2501.15398)
Keywords: language model
Abstract: Artificial intelligence systems significantly impact the environment, particularly in natural language processing (NLP) tasks. These tasks often require extensive computational resources to train deep neural networks, including large-scale language models containing billions of parameters. This study analyzes the trade-offs between energy consumption and performance across three neural language models: two pre-trained models (T5-base and BART-base), and one large language model (LLaMA 3-8B). These models were fine-tuned for the text summarization task, focusing on generating research paper highlights that encapsulate the core themes of each paper. A wide range of evaluation metrics, including ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore, were employed to assess their performance. Furthermore, the carbon footprint associated with fine-tuning each model was measured, offering a comprehensive assessment of their environmental impact. This research underscores the importance of incorporating environmental considerations into the design and implementation of neural language models and calls for the advancement of energy-efficient AI methodologies.
摘要：人工智能系统对环境有重大影响，尤其是在自然语言处理 (NLP) 任务中。这些任务通常需要大量计算资源来训练深度神经网络，包括包含数十亿个参数的大规模语言模型。这项研究分析了三种神经语言模型的能耗与性能之间的权衡：两个预训练模型（T5-base 和 BART-base）和一个大型语言模型（LLaMA 3-8B）。这些模型针对文本摘要任务进行了微调，重点是生成涵盖每篇论文核心主题的研究论文亮点。采用了多种评估指标来评估其性能，包括 ROUGE、METEOR、MoverScore、BERTScore 和 SciBERTScore。此外，还测量了与微调每个模型相关的碳足迹，从而全面评估了它们对环境的影响。这项研究强调了将环境考虑因素纳入神经语言模型的设计和实施中的重要性，并呼吁推进节能的 AI 方法。

Title: Semantic Layered Embedding Diffusion in Large Language Models for Multi-Contextual Consistency

Authors: Irin Kabakum, Thomas Montgomery, Daniel Ravenwood, Genevieve Harrington
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.15405
Pdf URL: https://arxiv.org/pdf/2501.15405
Copy Paste: [[2501.15405]] Semantic Layered Embedding Diffusion in Large Language Models for Multi-Contextual Consistency(https://arxiv.org/abs/2501.15405)
Keywords: language model
Abstract: The Semantic Layered Embedding Diffusion (SLED) mechanism redefines the representation of hierarchical semantics within transformer-based architectures, enabling enhanced contextual consistency across a wide array of linguistic tasks. By introducing a multi-layered diffusion process grounded in spectral analysis, it achieves a complex balance between global and local semantic coherence. Experimental results demonstrate significant improvements in perplexity and BLEU scores, emphasizing the mechanism's ability to adapt effectively across diverse domains, including multilingual and cross-domain text generation. A rigorous mathematical framework underpins the embedding diffusion process, incorporating weighted adjacency matrices, kernel-based refinements, and dynamic layer-wise normalization. Error distribution analysis reveals that SLED addresses challenges in semantic alignment and coherence, outperforming baseline approaches across varied benchmarks. Scalability studies illustrate that its performance gains are maintained consistently across different model sizes, reflecting a practical balance between computational efficiency and linguistic precision. The implementation also achieves energy efficiency, reducing resource consumption during training and inference phases without compromising accuracy. Qualitative case studies further validate its adaptability to extended narratives and context-intensive scenarios, highlighting the mechanism's potential for real-world applications. SLED offers a different perspective on embedding design and its implications for advancing language modeling.
摘要：语义分层嵌入扩散 (SLED) 机制重新定义了基于 Transformer 的架构中分层语义的表示，从而能够在各种语言任务中增强上下文一致性。通过引入基于谱分析的多层扩散过程，它实现了全局和局部语义一致性之间的复杂平衡。实验结果表明困惑度和 BLEU 分数有显著改善，强调了该机制能够有效适应不同领域，包括多语言和跨域文本生成。严格的数学框架支撑着嵌入扩散过程，结合了加权邻接矩阵、基于内核的细化和动态分层规范化。错误分布分析表明，SLED 解决了语义对齐和一致性方面的挑战，在各种基准测试中都优于基线方法。可扩展性研究表明，其性能提升在不同模型大小中保持一致，反映了计算效率和语言精度之间的实际平衡。该实现还实现了节能，在不影响准确性的情况下减少了训练和推理阶段的资源消耗。定性案例研究进一步验证了其对扩展叙述和上下文密集型场景的适应性，突出了该机制在实际应用中的潜力。SLED 为嵌入设计及其对推进语言建模的影响提供了不同的视角。

Title: OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas

Authors: Xiaoyang Wang, Hongming Zhang, Tao Ge, Wenhao Yu, Dian Yu, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15427
Pdf URL: https://arxiv.org/pdf/2501.15427
Copy Paste: [[2501.15427]] OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas(https://arxiv.org/abs/2501.15427)
Keywords: language model, gpt, llm, agent
Abstract: Customizable role-playing in large language models (LLMs), also known as character generalization, is gaining increasing attention for its versatility and cost-efficiency in developing and deploying role-playing dialogue agents. This study explores a large-scale data synthesis approach to equip LLMs with character generalization capabilities. We begin by synthesizing large-scale character profiles using personas from Persona Hub and then explore two strategies: response rewriting and response generation, to create character-aligned instructional responses. To validate the effectiveness of our synthetic instruction tuning data for character generalization, we perform supervised fine-tuning (SFT) using the LLaMA-3 8B model. Our best-performing model strengthens the original LLaMA-3 8B Instruct model and achieves performance comparable to GPT-4o models on role-playing dialogue. We release our synthetic characters and instruction-tuning dialogues to support public research.
摘要：大型语言模型 (LLM) 中的可定制角色扮演，也称为角色泛化，因其在开发和部署角色扮演对话代理方面的多功能性和成本效益而受到越来越多的关注。本研究探索了一种大规模数据合成方法，以使 LLM 具备角色泛化能力。我们首先使用 Persona Hub 中的角色合成大规模角色资料，然后探索两种策略：响应重写和响应生成，以创建与角色对齐的指令响应。为了验证我们的合成指令调整数据对角色泛化的有效性，我们使用 LLaMA-3 8B 模型执行监督微调 (SFT)。我们表现最佳的模型加强了原始的 LLaMA-3 8B Instruct 模型，并在角色扮演对话方面实现了与 GPT-4o 模型相当的性能。我们发布了我们的合成角色和指令调整对话以支持公共研究。

Title: Token Democracy: The Architectural Limits of Alignment in Transformer-Based Language Models

Authors: Robin Young
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.15446
Pdf URL: https://arxiv.org/pdf/2501.15446
Copy Paste: [[2501.15446]] Token Democracy: The Architectural Limits of Alignment in Transformer-Based Language Models(https://arxiv.org/abs/2501.15446)
Keywords: language model, prompt
Abstract: Modern language models paradoxically combine unprecedented capability with persistent vulnerability in that they can draft poetry yet cannot reliably refuse harmful requests. We reveal this fragility stems not from inadequate training, but from a fundamental architectural limitation: transformers process all tokens as equals. Transformers operate as computational democracies, granting equal voice to all tokens. This is a design tragically unsuited for AGI, where we cannot risk adversarial "candidates" hijacking the system. Through formal analysis, we demonstrate that safety instructions fundamentally lack privileged status in transformer architectures, that they compete with adversarial inputs in the same computational arena, making robust alignment through prompting or fine-tuning inherently limited. This "token democracy" explains why jailbreaks bypass even extensively safety-trained models and why positional shifts erode prompt effectiveness. Our work systematizes practitioners' tacit knowledge into an architectural critique, showing current alignment approaches create mere preferences, not constraints.
摘要：现代语言模型自相矛盾地将前所未有的能力与持续存在的脆弱性结合在一起，因为它们可以起草诗歌，但无法可靠地拒绝有害请求。我们发现这种脆弱性并非源于训练不足，而源于一个基本的架构限制：Transformer 将所有 token 平等处理。Transformer 作为计算民主机构运行，赋予所有 token 平等的发言权。这种设计非常不适合 AGI，我们不能冒着对抗性“候选人”劫持系统的风险。通过形式分析，我们证明安全指令在 Transformer 架构中从根本上缺乏特权地位，它们在同一个计算领域与对抗性输入竞争，使得通过提示或微调实现的稳健对齐本质上受到限制。这种“令牌民主”解释了为什么越狱甚至可以绕过经过广泛安全训练的模型，以及为什么位置变化会削弱提示有效性。我们的工作将从业者的隐性知识系统化为架构批评，表明当前的对齐方法只会产生偏好，而不是约束。

Title: STATE ToxiCN: A Benchmark for Span-level Target-Aware Toxicity Extraction in Chinese Hate Speech Detection

Authors: Zewen Bai, Yuanyuan Sun, Shengdi Yin, Junyu Lu, Jingjie Zeng, Haohao Zhu, Liang Yang, Hongfei Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15451
Pdf URL: https://arxiv.org/pdf/2501.15451
Copy Paste: [[2501.15451]] STATE ToxiCN: A Benchmark for Span-level Target-Aware Toxicity Extraction in Chinese Hate Speech Detection(https://arxiv.org/abs/2501.15451)
Keywords: llm
Abstract: The proliferation of hate speech has caused significant harm to society. The intensity and directionality of hate are closely tied to the target and argument it is associated with. However, research on hate speech detection in Chinese has lagged behind, and existing datasets lack span-level fine-grained annotations. Furthermore, the lack of research on Chinese hateful slang poses a significant challenge. In this paper, we provide a solution for fine-grained detection of Chinese hate speech. First, we construct a dataset containing Target-Argument-Hateful-Group quadruples (STATE ToxiCN), which is the first span-level Chinese hate speech dataset. Secondly, we evaluate the span-level hate speech detection performance of existing models using STATE ToxiCN. Finally, we conduct the first study on Chinese hateful slang and evaluate the ability of LLMs to detect such expressions. Our work contributes valuable resources and insights to advance span-level hate speech detection in Chinese
摘要：仇恨言论泛滥对社会造成了重大危害。仇恨的强度和方向性与其相关的目标和论点密切相关。然而，中文仇恨言论检测研究落后，现有数据集缺乏跨度级的细粒度标注。此外，缺乏对中文仇恨俚语的研究也带来了重大挑战。在本文中，我们提供了一种中文仇恨言论细粒度检测的解决方案。首先，我们构建了一个包含目标-论点-仇恨-群体四元组的数据集（STATE ToxiCN），这是第一个跨度级中文仇恨言论数据集。其次，我们使用 STATE ToxiCN 评估现有模型的跨度级仇恨言论检测性能。最后，我们对中文仇恨俚语进行了首次研究，并评估了 LLM 检测此类表达的能力。我们的工作为推进中文跨度级仇恨言论检测提供了宝贵的资源和见解

Title: Data-adaptive Safety Rules for Training Reward Models

Authors: Xiaomin Li, Mingye Gao, Zhiwei Zhang, Jingxuan Fan, Weiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15453
Pdf URL: https://arxiv.org/pdf/2501.15453
Copy Paste: [[2501.15453]] Data-adaptive Safety Rules for Training Reward Models(https://arxiv.org/abs/2501.15453)
Keywords: language model, llm
Abstract: Reinforcement Learning from Human Feedback (RLHF) is commonly employed to tailor models to human preferences, especially to improve the safety of outputs from large language models (LLMs). Traditionally, this method depends on selecting preferred responses from pairs. However, due to the variability in human opinions and the challenges in directly comparing two responses, there is an increasing trend towards fine-grained annotation approaches that evaluate responses using multiple targeted metrics or rules. The challenge lies in efficiently choosing and applying these rules to handle the diverse range of preference data. In this paper, we propose a dynamic method that adaptively selects the most important rules for each response pair. We introduce a mathematical framework that utilizes the maximum discrepancy across paired responses and demonstrate theoretically that this approach maximizes the mutual information between the rule-based annotations and the underlying true preferences. We then train an 8B reward model using this adaptively labeled preference dataset and assess its efficacy using RewardBench. As of January 25, 2025, our model achieved the highest safety performance on the leaderboard, surpassing various larger models.
摘要：人类反馈强化学习 (RLHF) 通常用于根据人类偏好定制模型，尤其是用于提高大型语言模型 (LLM) 输出的安全性。传统上，这种方法依赖于从成对的响应中选择首选响应。然而，由于人类观点的多变性以及直接比较两个响应的挑战，人们越来越倾向于使用细粒度注释方法，使用多个目标指标或规则来评估响应。挑战在于有效地选择和应用这些规则来处理各种偏好数据。在本文中，我们提出了一种动态方法，可以自适应地为每个响应对选择最重要的规则。我们引入了一个利用成对响应之间最大差异的数学框架，并从理论上证明了这种方法可以最大化基于规则的注释与潜在真实偏好之间的相互信息。然后，我们使用这个自适应标记的偏好数据集训练一个 8B 奖励模型，并使用 RewardBench 评估其有效性。截至 2025 年 1 月 25 日，我们的模型在排行榜上实现了最高的安全性能，超越了各种大型模型。

Title: ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

Authors: Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15570
Pdf URL: https://arxiv.org/pdf/2501.15570
Copy Paste: [[2501.15570]] ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer(https://arxiv.org/abs/2501.15570)
Keywords: language model, llm
Abstract: As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{this https URL}{this https URL}, \href{this https URL}{this https URL}.
摘要：众所周知，多头架构中的混合二次和亚二次注意力模型已经超越了 Transformer 和线性 RNN 模型，这些工作主要集中在降低 KV 复杂度和提高效率上。为了进一步研究表现力，我们介绍了从 Qwen 2.5 中提炼出的一系列模型，这些模型基于纯原生 RWKV-7 注意力，旨在使 RNN 更具表现力并展示超越 Transformer 的状态跟踪能力。我们使用基于 RWKV-6 架构的 QRWK 32B，另一种方法使用 16 个 AMD MI300X GPU 将整个知识处理时间缩短至仅 8 小时，同时保持 Qwen 2.5 的性能。事实上，蒸馏过程可以利用任何 LLM，而不仅仅是 Qwen，并且能够将知识从较大的 LLM 转移到具有更少 token 的较小的 LLM。我们将解释详细的过程并分享我们对构建更强大的基础模型的见解。请注意，这是一项正在进行的工作，将持续更新。模型检查点和源代码可在 \href{此 https URL}{此 https URL}、\href{此 https URL}{此 https URL} 处找到。

Title: Cross-Cultural Fashion Design via Interactive Large Language Models and Diffusion Models

Authors: Spencer Ramsey, Amina Grant, Jeffrey Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15571
Pdf URL: https://arxiv.org/pdf/2501.15571
Copy Paste: [[2501.15571]] Cross-Cultural Fashion Design via Interactive Large Language Models and Diffusion Models(https://arxiv.org/abs/2501.15571)
Keywords: language model, llm, prompt
Abstract: Fashion content generation is an emerging area at the intersection of artificial intelligence and creative design, with applications ranging from virtual try-on to culturally diverse design prototyping. Existing methods often struggle with cultural bias, limited scalability, and alignment between textual prompts and generated visuals, particularly under weak supervision. In this work, we propose a novel framework that integrates Large Language Models (LLMs) with Latent Diffusion Models (LDMs) to address these challenges. Our method leverages LLMs for semantic refinement of textual prompts and introduces a weak supervision filtering module to effectively utilize noisy or weakly labeled data. By fine-tuning the LDM on an enhanced DeepFashion+ dataset enriched with global fashion styles, the proposed approach achieves state-of-the-art performance. Experimental results demonstrate that our method significantly outperforms baselines, achieving lower Frechet Inception Distance (FID) and higher Inception Scores (IS), while human evaluations confirm its ability to generate culturally diverse and semantically relevant fashion content. These results highlight the potential of LLM-guided diffusion models in driving scalable and inclusive AI-driven fashion innovation.
摘要：时尚内容生成是人工智能与创意设计交叉领域的一个新兴领域，其应用范围从虚拟试穿到文化多元化的设计原型。现有方法通常难以应对文化偏见、有限的可扩展性以及文本提示与生成的视觉效果之间的一致性，尤其是在弱监督下。在这项工作中，我们提出了一个新颖的框架，将大型语言模型 (LLM) 与潜在扩散模型 (LDM) 相结合，以应对这些挑战。我们的方法利用 LLM 对文本提示进行语义细化，并引入弱监督过滤模块来有效利用嘈杂或弱标记的数据。通过在增强的 DeepFashion+ 数据集上微调 LDM，该数据集富含全球时尚风格，所提出的方法实现了最先进的性能。实验结果表明，我们的方法明显优于基线，实现了更低的 Frechet 初始距离 (FID) 和更高的初始分数 (IS)，而人工评估证实了它能够生成文化多样化和语义相关的时尚内容。这些结果凸显了 LLM 指导的传播模型在推动可扩展和包容性的 AI 驱动时尚创新方面的潜力。

Title: Instruction Tuning for Story Understanding and Generation with Weak Supervision

Authors: Yangshu Yuan, Heng Chen, Christian Ng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15574
Pdf URL: https://arxiv.org/pdf/2501.15574
Copy Paste: [[2501.15574]] Instruction Tuning for Story Understanding and Generation with Weak Supervision(https://arxiv.org/abs/2501.15574)
Keywords: language model, llm
Abstract: Story understanding and generation have long been a challenging task in natural language processing (NLP), especially when dealing with various levels of instruction specificity. In this paper, we propose a novel approach called "Weak to Strong Instruction Tuning" for improving story generation by tuning models with instructions of varying clarity. We explore the potential of large language models (LLMs) to adapt to different types of instructions, weak and strong, and show that our method significantly enhances performance in story comprehension and generation. By leveraging the strength of instruction tuning, we train models to understand the nuances of story plots, characters, and themes while generating coherent and engaging narratives. Through extensive experiments on several benchmark datasets and comparison with state-of-the-art baselines, we demonstrate that our method outperforms existing techniques, yielding substantial improvements in both automatic evaluation metrics and human evaluations. Our work shows that adaptive instruction tuning can be a powerful tool in refining generative models for complex narrative tasks.
摘要：故事理解和生成长期以来一直是自然语言处理 (NLP) 中的一项艰巨任务，尤其是在处理不同级别的指令特异性时。在本文中，我们提出了一种称为“弱到强指令调整”的新方法，通过使用不同清晰度的指令调整模型来改进故事生成。我们探索大型语言模型 (LLM) 适应不同类型指令（弱和强）的潜力，并表明我们的方法显著提高了故事理解和生成的性能。通过利用指令调整的优势，我们训练模型来理解故事情节、人物和主题的细微差别，同时生成连贯且引人入胜的叙述。通过在多个基准数据集上进行大量实验并与最先进的基线进行比较，我们证明了我们的方法优于现有技术，在自动评估指标和人工评估方面都有了显着的改进。我们的工作表明，自适应指令调整可以成为改进复杂叙事任务的生成模型的有力工具。

Title: Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework

Authors: Yuhong Sun, Zhangyue Yin, Xuanjing Huang, Xipeng Qiu, Hui Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15581
Pdf URL: https://arxiv.org/pdf/2501.15581
Copy Paste: [[2501.15581]] Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework(https://arxiv.org/abs/2501.15581)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains. Math Word Problems (MWPs) serve as a crucial benchmark for evaluating LLMs' reasoning abilities. While most research primarily focuses on improving accuracy, it often neglects understanding and addressing the underlying patterns of errors. Current error classification methods rely on static and predefined categories, which limit their ability to capture the full spectrum of error patterns in mathematical reasoning. To enable systematic error analysis, we collect error samples from 15 different LLMs of varying sizes across four distinct MWP datasets using multiple sampling strategies. Based on this extensive collection, we introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples that cover diverse error patterns and reasoning paths. To reduce human bias and enable fine-grained analysis of error patterns, we propose a novel framework for automated dynamic error classification in mathematical reasoning. Experimental results demonstrate that dataset characteristics significantly shape error patterns, which evolve from basic to complex manifestations as model capabilities increase. With deeper insights into error patterns, we propose error-aware prompting that incorporates common error patterns as explicit guidance, leading to significant improvements in mathematical reasoning performance.
摘要：大型语言模型 (LLM) 已在各个领域展现出卓越的能力。数学应用题 (MWP) 是评估 LLM 推理能力的重要基准。虽然大多数研究主要侧重于提高准确性，但往往忽略了理解和解决潜在的错误模式。当前的错误分类方法依赖于静态和预定义类别，这限制了它们捕获数学推理中各种错误模式的能力。为了实现系统错误分析，我们使用多种采样策略从四个不同的 MWP 数据集中的 15 个不同大小的 LLM 中收集错误样本。基于这一广泛的收集，我们推出了 MWPES-300K，这是一个全面的数据集，包含 304,865 个错误样本，涵盖了不同的错误模式和推理路径。为了减少人为偏见并实现对错误模式的细粒度分析，我们提出了一种用于数学推理中自动动态错误分类的新框架。实验结果表明，数据集特征显著影响了错误模式，随着模型能力的提高，错误模式从基本表现演变为复杂表现。通过对错误模式的深入了解，我们提出了错误感知提示，将常见的错误模式作为明确的指导，从而显著提高数学推理性能。

Title: SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain

Authors: Dakuan Lu, Xiaoyu Tan, Rui Xu, Tianchu Yao, Chao Qu, Wei Chu, Yinghui Xu, Yuan Qi
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2501.15587
Pdf URL: https://arxiv.org/pdf/2501.15587
Copy Paste: [[2501.15587]] SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain(https://arxiv.org/abs/2501.15587)
Keywords: language model, llm
Abstract: Recent breakthroughs in large language models (LLMs) exemplified by the impressive mathematical and scientific reasoning capabilities of the o1 model have spotlighted the critical importance of high-quality training data in advancing LLM performance across STEM disciplines. While the mathematics community has benefited from a growing body of curated datasets, the scientific domain at the higher education level has long suffered from a scarcity of comparable resources. To address this gap, we present SCP-116K, a new large-scale dataset of 116,756 high-quality problem-solution pairs, automatically extracted from heterogeneous sources using a streamlined and highly generalizable pipeline. Our approach involves stringent filtering to ensure the scientific rigor and educational level of the extracted materials, while maintaining adaptability for future expansions or domain transfers. By openly releasing both the dataset and the extraction pipeline, we seek to foster research on scientific reasoning, enable comprehensive performance evaluations of new LLMs, and lower the barrier to replicating the successes of advanced models like o1 in the broader science community. We believe SCP-116K will serve as a critical resource, catalyzing progress in high-level scientific reasoning tasks and promoting further innovations in LLM development. The dataset and code are publicly available at this https URL.
摘要：大型语言模型 (LLM) 的最新突破，以 o1 模型令人印象深刻的数学和科学推理能力为代表，凸显了高质量训练数据对于提升 STEM 学科 LLM 性能至关重要。虽然数学界从日益增多的精选数据集中受益，但高等教育领域的科学领域长期以来一直缺乏可比资源。为了解决这一差距，我们提出了 SCP-116K，这是一个新的大型数据集，包含 116,756 个高质量问题解决方案对，使用精简且高度可通用的管道从异构源中自动提取。我们的方法包括严格的过滤，以确保提取材料的科学严谨性和教育水平，同时保持对未来扩展或领域转移的适应性。通过公开发布数据集和提取管道，我们寻求促进科学推理研究，实现对新 LLM 的全面性能评估，并降低在更广泛的科学界复制 o1 等先进模型成功的障碍。我们相信 SCP-116K 将成为一项重要资源，促进高级科学推理任务的进步并推动 LLM 开发的进一步创新。数据集和代码可在此 https URL 上公开获取。

Title: Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Authors: Eduard Barbu, Meeri-Ly Muru, Sten Marcus Malva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15624
Pdf URL: https://arxiv.org/pdf/2501.15624
Copy Paste: [[2501.15624]] Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets(https://arxiv.org/abs/2501.15624)
Keywords: language model, gpt
Abstract: This study introduces an approach to Estonian text simplification using two model architectures: a neural machine translation model and a fine-tuned large language model (LLaMA). Given the limited resources for Estonian, we developed a new dataset, the Estonian Simplification Dataset, combining translated data and GPT-4.0-generated simplifications. We benchmarked OpenNMT, a neural machine translation model that frames text simplification as a translation task, and fine-tuned the LLaMA model on our dataset to tailor it specifically for Estonian simplification. Manual evaluations on the test set show that the LLaMA model consistently outperforms OpenNMT in readability, grammaticality, and meaning preservation. These findings underscore the potential of large language models for low-resource languages and provide a basis for further research in Estonian text simplification.
摘要：本研究介绍了一种使用两种模型架构简化爱沙尼亚语文本的方法：神经机器翻译模型和微调大型语言模型 (LLaMA)。鉴于爱沙尼亚语资源有限，我们开发了一个新的数据集，即爱沙尼亚语简化数据集，结合了翻译数据和 GPT-4.0 生成的简化。我们对 OpenNMT（一种将文本简化作为翻译任务的神经机器翻译模型）进行了基准测试，并在我们的数据集上微调了 LLaMA 模型，使其专门针对爱沙尼亚语简化进行定制。对测试集的手动评估表明，LLaMA 模型在可读性、语法性和意义保留方面始终优于 OpenNMT。这些发现强调了大型语言模型对资源匮乏的语言的潜力，并为进一步研究爱沙尼亚语文本简化奠定了基础。

Title: People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text

Authors: Jenna Russell, Marzena Karpinska, Mohit Iyyer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.15654
Pdf URL: https://arxiv.org/pdf/2501.15654
Copy Paste: [[2501.15654]] People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text(https://arxiv.org/abs/2501.15654)
Keywords: gpt, llm, chat
Abstract: In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such "expert" annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts' free-form explanations shows that while they rely heavily on specific lexical clues ('AI vocabulary'), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.
摘要：在本文中，我们研究了人类检测商业 LLM（GPT-4o、Claude、o1）生成的文本的能力。我们聘请注释者阅读 300 篇非小说类英语文章，将它们标记为人类撰写或 AI 生成，并提供段落长度的解释来解释他们的决定。我们的实验表明，经常使用 LLM 进行写作任务的注释者在检测 AI 生成的文本方面表现出色，即使没有任何专门的培训或反馈。事实上，五位这样的“专家”注释者中的多数票只对 300 篇文章中的 1 篇进行了错误分类，即使在存在释义和人性化等规避策略的情况下，其表现也远远优于我们评估的大多数商业和开源检测器。对专家的自由形式解释进行定性分析表明，虽然他们严重依赖特定的词汇线索（“AI 词汇”），但他们也会发现文本中更复杂的现象（例如形式性、原创性、清晰度），而这些现象对于自动检测器来说很难评估。我们发布了带注释的数据集和代码，以促进未来对人工和自动检测人工智能生成文本的研究。

Title: TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs

Authors: Yuxuan Gu, Wuyang Zhou, Giorgos Iacovides, Danilo Mandic
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.15674
Pdf URL: https://arxiv.org/pdf/2501.15674
Copy Paste: [[2501.15674]] TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs(https://arxiv.org/abs/2501.15674)
Keywords: language model, llm
Abstract: The reasoning abilities of Large Language Models (LLMs) can be improved by structurally denoising their weights, yet existing techniques primarily focus on denoising the feed-forward network (FFN) of the transformer block, and can not efficiently utilise the Multi-head Attention (MHA) block, which is the core of transformer architectures. To address this issue, we propose a novel intuitive framework that, at its very core, performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. This enables both higher-dimensional structured denoising and compression of the MHA weights, by enforcing a shared higher-dimensional subspace across the weights of the multiple attention heads. We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets, and for both encoder-only and decoder-only architectures, while achieving compression rates of up to $\sim 250$ times in the MHA weights, all without requiring any additional data, training, or fine-tuning. Furthermore, we show that the proposed method can be seamlessly combined with existing FFN-only-based denoising techniques to achieve further improvements in LLM reasoning performance.
摘要：大型语言模型 (LLM) 的推理能力可以通过结构化去噪权重来提高，但现有技术主要侧重于对 Transformer 块的前馈网络 (FFN) 进行去噪，而不能有效利用作为 Transformer 架构核心的多头注意力 (MHA) 块。为了解决这个问题，我们提出了一个新颖的直观框架，其核心是通过多头张量化过程和 Tucker 分解执行 MHA 压缩。通过在多个注意力头的权重之间强制共享高维子空间，可以实现 MHA 权重的高维结构化去噪和压缩。我们证明，这种方法可以持续增强 LLM 在多个基准数据集以及仅编码器和仅解码器架构中的推理能力，同时在 MHA 权重中实现高达 $\sim 250$ 倍的压缩率，所有这些都不需要任何额外的数据、训练或微调。此外，我们表明，所提出的方法可以与现有的基于 FFN 的去噪技术无缝结合，以进一步提高 LLM 推理性能。

Title: Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts

Authors: Haodi Ma, Dzmitry Kasinets, Daisy Zhe Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.15688
Pdf URL: https://arxiv.org/pdf/2501.15688
Copy Paste: [[2501.15688]] Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts(https://arxiv.org/abs/2501.15688)
Keywords: language model
Abstract: Multimodal knowledge graph completion (MMKGC) aims to predict missing links in multimodal knowledge graphs (MMKGs) by leveraging information from various modalities alongside structural data. Existing MMKGC approaches primarily extend traditional knowledge graph embedding (KGE) models, which often require creating an embedding for every entity. This results in large model sizes and inefficiencies in integrating multimodal information, particularly for real-world graphs. Meanwhile, Transformer-based models have demonstrated competitive performance in knowledge graph completion (KGC). However, their focus on single-modal knowledge limits their capacity to utilize cross-modal information. Recently, Large vision-language models (VLMs) have shown potential in cross-modal tasks but are constrained by the high cost of training. In this work, we propose a novel approach that integrates Transformer-based KGE models with cross-modal context generated by pre-trained VLMs, thereby extending their applicability to MMKGC. Specifically, we employ a pre-trained VLM to transform relevant visual information from entities and their neighbors into textual sequences. We then frame KGC as a sequence-to-sequence task, fine-tuning the model with the generated cross-modal context. This simple yet effective method significantly reduces model size compared to traditional KGE approaches while achieving competitive performance across multiple large-scale datasets with minimal hyperparameter tuning.
摘要：多模态知识图谱补全 (MMKGC) 旨在通过利用来自各种模态的信息以及结构数据来预测多模态知识图谱 (MMKG) 中的缺失链接。现有的 MMKGC 方法主要扩展传统的知识图谱嵌入 (KGE) 模型，该模型通常需要为每个实体创建一个嵌入。这导致模型尺寸过大，并且在集成多模态信息时效率低下，尤其是对于现实世界的图。同时，基于 Transformer 的模型在知识图谱补全 (KGC) 中表现出了有竞争力的性能。然而，它们对单模态知识的关注限制了它们利用跨模态信息的能力。最近，大型视觉语言模型 (VLM) 在跨模态任务中显示出潜力，但受到高昂训练成本的限制。在这项工作中，我们提出了一种新方法，将基于 Transformer 的 KGE 模型与预训练的 VLM 生成的跨模态上下文相结合，从而将其适用性扩展到 MMKGC。具体而言，我们使用预训练的 VLM 将来自实体及其邻居的相关视觉信息转换为文本序列。然后，我们将 KGC 定义为一个序列到序列的任务，使用生成的跨模态上下文对模型进行微调。与传统的 KGE 方法相比，这种简单而有效的方法显著减少了模型大小，同时以最少的超参数调整在多个大型数据集上实现了具有竞争力的性能。

Title: Adapting Biomedical Abstracts into Plain language using Large Language Models

Authors: Haritha Gangavarapu, Giridhar Kaushik Ramachandran, Kevin Lybarger, Meliha Yetisgen, Özlem Uzuner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15700
Pdf URL: https://arxiv.org/pdf/2501.15700
Copy Paste: [[2501.15700]] Adapting Biomedical Abstracts into Plain language using Large Language Models(https://arxiv.org/abs/2501.15700)
Keywords: language model, gpt
Abstract: A vast amount of medical knowledge is available for public use through online health forums, and question-answering platforms on social media. The majority of the population in the United States doesn't have the right amount of health literacy to make the best use of that information. Health literacy means the ability to obtain and comprehend the basic health information to make appropriate health decisions. To build the bridge between this gap, organizations advocate adapting this medical knowledge into plain language. Building robust systems to automate the adaptations helps both medical and non-medical professionals best leverage the available information online. The goal of the Plain Language Adaptation of Biomedical Abstracts (PLABA) track is to adapt the biomedical abstracts in English language extracted from PubMed based on the questions asked in MedlinePlus for the general public using plain language at the sentence level. As part of this track, we leveraged the best open-source Large Language Models suitable and fine-tuned for dialog use cases. We compare and present the results for all of our systems and our ranking among the other participants' submissions. Our top performing GPT-4 based model ranked first in the avg. simplicity measure and 3rd on the avg. accuracy measure.
摘要：大量医学知识可通过在线健康论坛和社交媒体上的问答平台供公众使用。美国大多数人口的健康素养不足以充分利用这些信息。健康素养是指获取和理解基本健康信息以做出适当健康决策的能力。为了弥补这一差距，各组织提倡将这些医学知识改编成通俗易懂的语言。建立强大的系统来自动化改编有助于医疗和非医疗专业人员最大限度地利用在线可用信息。生物医学摘要的通俗语言改编 (PLABA) 课程的目标是根据 MedlinePlus 中提出的问题，使用句子级别的通俗易懂的语言改编从 PubMed 中提取的英语生物医学摘要，供公众使用。作为该课程的一部分，我们利用了最适合对话用例并经过微调的最佳开源大型语言模型。我们比较并展示了我们所有系统的结果以及我们在其他参与者提交内容中的排名。我们表现最佳的基于 GPT-4 的模型在平均简单性测量中排名第一，在平均准确度测量中排名第三。

Title: StaICC: Standardized Evaluation for Classification Task in In-context Learning

Authors: Hakaze Cho, Naoya Inoue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.15708
Pdf URL: https://arxiv.org/pdf/2501.15708
Copy Paste: [[2501.15708]] StaICC: Standardized Evaluation for Classification Task in In-context Learning(https://arxiv.org/abs/2501.15708)
Keywords: prompt
Abstract: Classification tasks are widely investigated in the In-Context Learning (ICL) paradigm. However, current efforts are evaluated on disjoint benchmarks and settings, while their performances are significantly influenced by some trivial variables, such as prompt templates, data sampling, instructions, etc., which leads to significant inconsistencies in the results reported across various literature, preventing fair comparison or meta-analysis across different papers. Therefore, this paper proposes a standardized and easy-to-use evaluation toolkit (StaICC) for in-context classification. Including, for the normal classification task, we provide StaICC-Normal, selecting 10 widely used datasets, and generating prompts with a fixed form, to mitigate the variance among the experiment implementations. To enrich the usage of our benchmark, we also provide a sub-benchmark StaICC-Diag for diagnosing ICL from several aspects, aiming for a more robust inference processing.
摘要：分类任务在情境学习 (ICL) 范式中得到广泛研究。然而，当前的努力是在不相交的基准和设置上进行评估的，而它们的性能受到一些琐碎变量的显著影响，例如提示模板、数据采样、指令等，这导致不同文献中报告的结果存在显著不一致，阻碍了不同论文之间的公平比较或元分析。因此，本文提出了一个标准化且易于使用的情境分类评估工具包 (StaICC)。其中，对于正常分类任务，我们提供 StaICC-Normal，选择 10 个广泛使用的数据集，并生成具有固定形式的提示，以减轻实验实现之间的差异。为了丰富我们基准的使用，我们还提供了一个子基准 StaICC-Diag，用于从多个方面诊断 ICL，旨在实现更强大的推理处理。

Title: ESGSenticNet: A Neurosymbolic Knowledge Base for Corporate Sustainability Analysis

Authors: Keane Ong, Rui Mao, Frank Xing, Ranjan Satapathy, Johan Sulaeman, Erik Cambria, Gianmarco Mengaldo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15720
Pdf URL: https://arxiv.org/pdf/2501.15720
Copy Paste: [[2501.15720]] ESGSenticNet: A Neurosymbolic Knowledge Base for Corporate Sustainability Analysis(https://arxiv.org/abs/2501.15720)
Keywords: gpt
Abstract: Evaluating corporate sustainability performance is essential to drive sustainable business practices, amid the need for a more sustainable economy. However, this is hindered by the complexity and volume of corporate sustainability data (i.e. sustainability disclosures), not least by the effectiveness of the NLP tools used to analyse them. To this end, we identify three primary challenges - immateriality, complexity, and subjectivity, that exacerbate the difficulty of extracting insights from sustainability disclosures. To address these issues, we introduce ESGSenticNet, a publicly available knowledge base for sustainability analysis. ESGSenticNet is constructed from a neurosymbolic framework that integrates specialised concept parsing, GPT-4o inference, and semi-supervised label propagation, together with a hierarchical taxonomy. This approach culminates in a structured knowledge base of 44k knowledge triplets - ('halve carbon emission', supports, 'emissions control'), for effective sustainability analysis. Experiments indicate that ESGSenticNet, when deployed as a lexical method, more effectively captures relevant and actionable sustainability information from sustainability disclosures compared to state of the art baselines. Besides capturing a high number of unique ESG topic terms, ESGSenticNet outperforms baselines on the ESG relatedness and ESG action orientation of these terms by 26% and 31% respectively. These metrics describe the extent to which topic terms are related to ESG, and depict an action toward ESG. Moreover, when deployed as a lexical method, ESGSenticNet does not require any training, possessing a key advantage in its simplicity for non-technical stakeholders.
摘要：在需要更可持续的经济的背景下，评估企业可持续发展绩效对于推动可持续商业实践至关重要。然而，企业可持续发展数据（即可持续发展披露）的复杂性和数量阻碍了这一点，尤其是用于分析这些数据的 NLP 工具的有效性。为此，我们确定了三个主要挑战——非物质性、复杂性和主观性，这加剧了从可持续发展披露中提取见解的难度。为了解决这些问题，我们引入了 ESGSenticNet，这是一个用于可持续发展分析的公开知识库。ESGSenticNet 由一个神经符号框架构建，该框架集成了专门的概念解析、GPT-4o 推理和半监督标签传播以及分层分类法。这种方法最终形成了一个由 44k 个知识三元组组成的结构化知识库——（“将碳排放减半”、“支持”、“排放控制”），用于有效的可持续性分析。实验表明，与最先进的基线相比，ESGSenticNet 在部署为词汇方法时，能够更有效地从可持续性披露中捕获相关且可操作的可持续性信息。除了捕获大量独特的 ESG 主题术语外，ESGSenticNet 在这些术语的 ESG 相关性和 ESG 行动导向方面的表现分别比基线高出 26% 和 31%。这些指标描述了主题术语与 ESG 的相关程度，并描述了对 ESG 的行动。此外，当部署为词汇方法时，ESGSenticNet 不需要任何培训，对于非技术利益相关者而言，其简单性具有关键优势。

Title: IndicMMLU-Pro: Benchmarking the Indic Large Language Models

Authors: Sankalp KJ, Ashutosh Kumar, Laxmaan Balaji, Nikunj Kotecha, Vinija Jain, Aman Chadha, Sreyoshi Bhaduri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.15747
Pdf URL: https://arxiv.org/pdf/2501.15747
Copy Paste: [[2501.15747]] IndicMMLU-Pro: Benchmarking the Indic Large Language Models(https://arxiv.org/abs/2501.15747)
Keywords: language model, llm
Abstract: Known by more than 1.5 billion people in the Indian subcontinent, Indic languages present unique challenges and opportunities for natural language processing (NLP) research due to their rich cultural heritage, linguistic diversity, and complex structures. IndicMMLU-Pro is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) across Indic languages, building upon the MMLU Pro (Massive Multitask Language Understanding) framework. Covering major languages such as Hindi, Bengali, Gujarati, Marathi, Kannada, Punjabi, Tamil, Telugu, and Urdu, our benchmark addresses the unique challenges and opportunities presented by the linguistic diversity of the Indian subcontinent. This benchmark encompasses a wide range of tasks in language comprehension, reasoning, and generation, meticulously crafted to capture the intricacies of Indian languages. IndicMMLU-Pro provides a standardized evaluation framework to push the research boundaries in Indic language AI, facilitating the development of more accurate, efficient, and culturally sensitive models. This paper outlines the benchmarks' design principles, task taxonomy, data collection methodology, and presents baseline results from state-of-the-art multilingual models.
摘要：印度次大陆有超过 15 亿人知道印度语，由于其丰富的文化遗产、语言多样性和复杂的结构，印度语为自然语言处理 (NLP) 研究带来了独特的挑战和机遇。IndicMMLU-Pro 是一个全面的基准，旨在评估印度语中的大型语言模型 (LLM)，它建立在 MMLU Pro（大规模多任务语言理解）框架之上。我们的基准涵盖了印地语、孟加拉语、古吉拉特语、马拉地语、卡纳达语、旁遮普语、泰米尔语、泰卢固语和乌尔都语等主要语言，解决了印度次大陆语言多样性带来的独特挑战和机遇。该基准涵盖了语言理解、推理和生成方面的广泛任务，经过精心设计，以捕捉印度语言的复杂性。IndicMMLU-Pro 提供了一个标准化的评估框架，以推动印度语 AI 的研究边界，促进开发更准确、更高效、更具文化敏感性的模型。本文概述了基准的设计原则、任务分类、数据收集方法，并展示了最先进的多语言模型的基线结果。

Title: Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference

Authors: Go Kamoda, Benjamin Hienzerling, Tatsuro Inaba, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15754
Pdf URL: https://arxiv.org/pdf/2501.15754
Copy Paste: [[2501.15754]] Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference(https://arxiv.org/abs/2501.15754)
Keywords: language model, gpt
Abstract: According to the stages-of-inference hypothesis, early layers of language models map their subword-tokenized input, which does not necessarily correspond to a linguistically meaningful segmentation, to more meaningful representations that form the model's ``inner vocabulary''. Prior analysis of this detokenization stage has predominantly relied on probing and interventions such as path patching, which involve selecting particular inputs, choosing a subset of components that will be patched, and then observing changes in model behavior. Here, we show that several important aspects of the detokenization stage can be understood purely by analyzing model weights, without performing any model inference steps. Specifically, we introduce an analytical decomposition of first-layer attention in GPT-2. Our decomposition yields interpretable terms that quantify the relative contributions of position-related, token-related, and mixed effects. By focusing on terms in this decomposition, we discover weight-based explanations of attention bias toward close tokens and attention for detokenization.
摘要：根据推理阶段假设，语言模型的早期层将其子词标记化输入（不一定对应于语言上有意义的分割）映射到更有意义的表示，形成模型的“内部词汇”。之前对这个去标记化阶段的分析主要依赖于探测和干预，例如路径修补，这涉及选择特定输入、选择要修补的组件子集，然后观察模型行为的变化。在这里，我们表明，去标记化阶段的几个重要方面可以纯粹通过分析模型权重来理解，而无需执行任何模型推理步骤。具体来说，我们在 GPT-2 中引入了第一层注意力的分析分解。我们的分解产生了可解释的术语，量化了位置相关、标记相关和混合效应的相对贡献。通过关注这种分解中的术语，我们发现了基于权重的对近距离标记的注意力偏差和去标记化注意力的解释。

Title: Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages

Authors: Ivory Yang, Weicheng Ma, Chunhui Zhang, Soroush Vosoughi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15773
Pdf URL: https://arxiv.org/pdf/2501.15773
Copy Paste: [[2501.15773]] Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages(https://arxiv.org/abs/2501.15773)
Keywords: language model, llm
Abstract: Endangered languages, such as Navajo - the most widely spoken Native American language - are significantly underrepresented in contemporary language technologies, exacerbating the challenges of their preservation and revitalization. This study evaluates Google's large language model (LLM)-based language identification system, which consistently misidentifies Navajo, exposing inherent limitations when applied to low-resource Native American languages. To address this, we introduce a random forest classifier trained on Navajo and eight frequently confused languages. Despite its simplicity, the classifier achieves near-perfect accuracy (97-100%), significantly outperforming Google's LLM-based system. Additionally, the model demonstrates robustness across other Athabaskan languages - a family of Native American languages spoken primarily in Alaska, the Pacific Northwest, and parts of the Southwestern United States - suggesting its potential for broader application. Our findings underscore the pressing need for NLP systems that prioritize linguistic diversity and adaptability over centralized, one-size-fits-all solutions, especially in supporting underrepresented languages in a multicultural world. This work directly contributes to ongoing efforts to address cultural biases in language models and advocates for the development of culturally localized NLP tools that serve diverse linguistic communities.
摘要：濒危语言，例如使用最广泛的美洲原住民语言纳瓦霍语，在当代语言技术中代表性严重不足，这加剧了这些语言的保护和复兴挑战。这项研究评估了 Google 基于大型语言模型 (LLM) 的语言识别系统，该系统始终会错误识别纳瓦霍语，这暴露了其在应用于资源匮乏的美洲原住民语言时固有的局限性。为了解决这个问题，我们引入了一个随机森林分类器，该分类器经过了纳瓦霍语和八种经常混淆的语言的训练。尽管它很简单，但分类器的准确率却接近完美（97-100%），远远优于 Google 基于 LLM 的系统。此外，该模型还展示了对其他阿萨巴斯坎语（一种主要在阿拉斯加、太平洋西北部和美国西南部部分地区使用的美洲原住民语言）的稳健性，表明其具有更广泛的应用潜力。我们的研究结果强调，迫切需要优先考虑语言多样性和适应性而非集中式、一刀切的解决方案的 NLP 系统，尤其是在支持多元文化世界中代表性不足的语言方面。这项工作直接促进了持续努力解决语言模型中的文化偏见，并倡导开发服务于不同语言社区的文化本地化 NLP 工具。

Title: Large Language Models to Diffusion Finetuning

Authors: Edoardo Cetin, Tianyu Zhao, Yujin Tang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.15781
Pdf URL: https://arxiv.org/pdf/2501.15781
Copy Paste: [[2501.15781]] Large Language Models to Diffusion Finetuning(https://arxiv.org/abs/2501.15781)
Keywords: language model
Abstract: We propose a new finetuning method to provide pre-trained large language models (LMs) the ability to scale test-time compute through the diffusion framework. By increasing the number of diffusion steps, we show our finetuned models achieve monotonically increasing accuracy, directly translating to improved performance across downstream tasks. Furthermore, our finetuned models can expertly answer questions on specific topics by integrating powerful guidance techniques, and autonomously determine the compute required for a given problem by leveraging adaptive ODE solvers. Our method is universally applicable to any foundation model pre-trained with a cross-entropy loss and does not modify any of its original weights, fully preserving its strong single-step generation capabilities. We show our method is more effective and fully compatible with traditional finetuning approaches, introducing an orthogonal new direction to unify the strengths of the autoregressive and diffusion frameworks.
摘要：我们提出了一种新的微调方法，使预训练的大型语言模型 (LM) 能够通过扩散框架扩展测试时间计算。通过增加扩散步骤的数量，我们展示了我们的微调模型实现了单调增加的准确率，直接转化为下游任务性能的提高。此外，我们的微调模型可以通过集成强大的指导技术熟练地回答特定主题的问题，并通过利用自适应 ODE 求解器自主确定给定问题所需的计算。我们的方法普遍适用于任何使用交叉熵损失预训练的基础模型，并且不会修改其任何原始权重，完全保留其强大的单步生成能力。我们展示了我们的方法更有效并且与传统的微调方法完全兼容，引入了一个正交的新方向来统一自回归和扩散框架的优势。

Title: MADP: Multi-Agent Deductive Planning for Enhanced Cognitive-Behavioral Mental Health Question Answer

Authors: Qi Chen, Dexi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15826
Pdf URL: https://arxiv.org/pdf/2501.15826
Copy Paste: [[2501.15826]] MADP: Multi-Agent Deductive Planning for Enhanced Cognitive-Behavioral Mental Health Question Answer(https://arxiv.org/abs/2501.15826)
Keywords: language model, llm, agent
Abstract: The Mental Health Question Answer (MHQA) task requires the seeker and supporter to complete the support process in one-turn dialogue. Given the richness of help-seeker posts, supporters must thoroughly understand the content and provide logical, comprehensive, and well-structured responses. Previous works in MHQA mostly focus on single-agent approaches based on the cognitive element of Cognitive Behavioral Therapy (CBT), but they overlook the interactions among various CBT elements, such as emotion and cognition. This limitation hinders the models' ability to thoroughly understand the distress of help-seekers. To address this, we propose a framework named Multi-Agent Deductive Planning (MADP), which is based on the interactions between the various psychological elements of CBT. This method guides Large Language Models (LLMs) to achieve a deeper understanding of the seeker's context and provide more personalized assistance based on individual circumstances. Furthermore, we construct a new dataset based on the MADP framework and use it to fine-tune LLMs, resulting in a specialized model named MADP-LLM. We conduct extensive experiments, including comparisons with multiple LLMs, human evaluations, and automatic evaluations, to validate the effectiveness of the MADP framework and MADP-LLM.
摘要：心理健康问答 (MHQA) 任务要求求助者和支持者在一次对话中完成支持过程。鉴于求助者帖子的丰富性，支持者必须彻底理解内容并提供合乎逻辑、全面且结构良好的回复。MHQA 中的先前研究主要侧重于基于认知行为疗法 (CBT) 的认知元素的单智能体方法，但它们忽略了各种 CBT 元素之间的相互作用，例如情绪和认知。这种限制阻碍了模型彻底理解求助者痛苦的能力。为了解决这个问题，我们提出了一个名为多智能体演绎规划 (MADP) 的框架，该框架基于 CBT 的各种心理元素之间的相互作用。该方法指导大型语言模型 (LLM) 更深入地了解求助者的背景，并根据个人情况提供更个性化的帮助。此外，我们基于 MADP 框架构建了一个新数据集，并使用它来微调 LLM，从而产生了一个名为 MADP-LLM 的专门模型。我们进行了大量实验，包括与多个 LLM 的比较、人工评估和自动评估，以验证 MADP 框架和 MADP-LLM 的有效性。

Title: LCTG Bench: LLM Controlled Text Generation Benchmark

Authors: Kentaro Kurihara, Masato Mita, Peinan Zhang, Shota Sasaki, Ryosuke Ishigami, Naoaki Okazaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.15875
Pdf URL: https://arxiv.org/pdf/2501.15875
Copy Paste: [[2501.15875]] LCTG Bench: LLM Controlled Text Generation Benchmark(https://arxiv.org/abs/2501.15875)
Keywords: language model, gpt, llm
Abstract: The rise of large language models (LLMs) has led to more diverse and higher-quality machine-generated text. However, their high expressive power makes it difficult to control outputs based on specific business instructions. In response, benchmarks focusing on the controllability of LLMs have been developed, but several issues remain: (1) They primarily cover major languages like English and Chinese, neglecting low-resource languages like Japanese; (2) Current benchmarks employ task-specific evaluation metrics, lacking a unified framework for selecting models based on controllability across different use cases. To address these challenges, this research introduces LCTG Bench, the first Japanese benchmark for evaluating the controllability of LLMs. LCTG Bench provides a unified framework for assessing control performance, enabling users to select the most suitable model for their use cases based on controllability. By evaluating nine diverse Japanese-specific and multilingual LLMs like GPT-4, we highlight the current state and challenges of controllability in Japanese LLMs and reveal the significant gap between multilingual models and Japanese-specific models.
摘要：大型语言模型 (LLM) 的兴起带来了更加多样化和更高质量的机器生成文本。然而，它们的高表达能力使得根据特定的业务指令控制输出变得困难。为此，已经开发了专注于 LLM 可控性的基准，但仍存在一些问题：（1）它们主要涵盖英语和中文等主要语言，而忽略了日语等资源匮乏的语言；（2）当前的基准采用特定于任务的评估指标，缺乏基于不同用例的可控性选择模型的统一框架。为了应对这些挑战，本研究引入了 LCTG Bench，这是日本第一个用于评估 LLM 可控性的基准。LCTG Bench 提供了一个统一的控制性能评估框架，使用户能够根据可控性为其用例选择最合适的模型。通过评估九种不同的日语特定和多语言 LLM（如 GPT-4），我们强调了日语 LLM 可控性的现状和挑战，并揭示了多语言模型和日语特定模型之间的巨大差距。

Title: Parametric Retrieval Augmented Generation

Authors: Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, Yiqun Liu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2501.15915
Pdf URL: https://arxiv.org/pdf/2501.15915
Copy Paste: [[2501.15915]] Parametric Retrieval Augmented Generation(https://arxiv.org/abs/2501.15915)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) techniques have emerged as a promising solution to enhance the reliability of large language models (LLMs) by addressing issues like hallucinations, outdated knowledge, and domain adaptation. In particular, existing RAG methods append relevant documents retrieved from external corpus or databases to the input of LLMs to guide their generation process, which we refer to as the in-context knowledge injection method. While this approach is simple and often effective, it has inherent limitations. Firstly, increasing the context length and number of relevant documents can lead to higher computational overhead and degraded performance, especially in complex reasoning tasks. More importantly, in-context knowledge injection operates primarily at the input level, but LLMs store their internal knowledge in their parameters. This gap fundamentally limits the capacity of in-context methods. To this end, we introduce Parametric retrieval-augmented generation (Parametric RAG), a new RAG paradigm that integrates external knowledge directly into the parameters of feed-forward networks (FFN) of an LLM through document parameterization. This approach not only saves online computational costs by eliminating the need to inject multiple documents into the LLMs' input context, but also deepens the integration of external knowledge into the parametric knowledge space of the LLM. Experimental results demonstrate that Parametric RAG substantially enhances both the effectiveness and efficiency of knowledge augmentation in LLMs. Also, it can be combined with in-context RAG methods to achieve even better performance. We have open-sourced all the code, data, and models in the following anonymized GitHub link: this https URL
摘要：检索增强生成 (RAG) 技术已成为一种有前途的解决方案，可通过解决幻觉、过时知识和领域适应等问题来增强大型语言模型 (LLM) 的可靠性。具体而言，现有的 RAG 方法将从外部语料库或数据库检索到的相关文档附加到 LLM 的输入中以指导其生成过程，我们将其称为上下文知识注入方法。虽然这种方法简单且通常有效，但它有固有的局限性。首先，增加上下文长度和相关文档的数量会导致更高的计算开销和性能下降，尤其是在复杂的推理任务中。更重要的是，上下文知识注入主要在输入级别运行，但 LLM 将其内部知识存储在其参数中。这一差距从根本上限制了上下文方法的容量。为此，我们引入了参数检索增强生成 (Parametric RAG)，这是一种新的 RAG 范式，它通过文档参数化将外部知识直接集成到 LLM 的前馈网络 (FFN) 的参数中。这种方法不仅通过消除将多个文档注入 LLM 的输入上下文的需要来节省在线计算成本，而且还加深了外部知识与 LLM 的参数知识空间的集成。实验结果表明，参数 RAG 大大提高了 LLM 中知识增强的有效性和效率。此外，它可以与上下文中的 RAG 方法相结合，以实现更好的性能。我们在以下匿名 GitHub 链接中开源了所有代码、数据和模型：此 https URL

Title: MEL: Legal Spanish Language Model

Authors: David Betancur Sánchez, Nuria Aldama García, Álvaro Barbero Jiménez, Marta Guerrero Nieto, Patricia Marsà Morales, Nicolás Serrano Salas, Carlos García Hernán, Pablo Haya Coll, Elena Montiel Ponsoda, Pablo Calleja Ibáñez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16011
Pdf URL: https://arxiv.org/pdf/2501.16011
Copy Paste: [[2501.16011]] MEL: Legal Spanish Language Model(https://arxiv.org/abs/2501.16011)
Keywords: language model
Abstract: Legal texts, characterized by complex and specialized terminology, present a significant challenge for Language Models. Adding an underrepresented language, such as Spanish, to the mix makes it even more challenging. While pre-trained models like XLM-RoBERTa have shown capabilities in handling multilingual corpora, their performance on domain specific documents remains underexplored. This paper presents the development and evaluation of MEL, a legal language model based on XLM-RoBERTa-large, fine-tuned on legal documents such as BOE (Boletín Oficial del Estado, the Spanish oficial report of laws) and congress texts. We detail the data collection, processing, training, and evaluation processes. Evaluation benchmarks show a significant improvement over baseline models in understanding the legal Spanish language. We also present case studies demonstrating the model's application to new legal texts, highlighting its potential to perform top results over different NLP tasks.
摘要：法律文本以复杂和专业的术语为特征，对语言模型提出了重大挑战。再加上一种代表性不足的语言，如西班牙语，就更具挑战性。虽然像 XLM-RoBERTa 这样的预训练模型已经展示了处理多语言语料库的能力，但它们在特定领域文档上的表现仍未得到充分探索。本文介绍了 MEL 的开发和评估，MEL 是一种基于 XLM-RoBERTa-large 的法律语言模型，针对法律文件（如 BOE（西班牙官方法律报告）和国会文本）进行了微调。我们详细介绍了数据收集、处理、训练和评估过程。评估基准表明，在理解西班牙语法律语言方面，与基线模型相比，MEL 有显著的改进。我们还提供了案例研究，展示了该模型在新法律文本中的应用，强调了其在不同 NLP 任务中取得最佳结果的潜力。

Title: PISCO: Pretty Simple Compression for Retrieval-Augmented Generation

Authors: Maxime Louis, Hervé Déjean, Stéphane Clinchant
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2501.16075
Pdf URL: https://arxiv.org/pdf/2501.16075
Copy Paste: [[2501.16075]] PISCO: Pretty Simple Compression for Retrieval-Augmented Generation(https://arxiv.org/abs/2501.16075)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) pipelines enhance Large Language Models (LLMs) by retrieving relevant documents, but they face scalability issues due to high inference costs and limited context size. Document compression is a practical solution, but current soft compression methods suffer from accuracy losses and require extensive pretraining. In this paper, we introduce PISCO, a novel method that achieves a 16x compression rate with minimal accuracy loss (0-3%) across diverse RAG-based question-answering (QA) tasks. Unlike existing approaches, PISCO requires no pretraining or annotated data, relying solely on sequence-level knowledge distillation from document-based questions. With the ability to fine-tune a 7-10B LLM in 48 hours on a single A100 GPU, PISCO offers a highly efficient and scalable solution. We present comprehensive experiments showing that PISCO outperforms existing compression models by 8% in accuracy.
摘要：检索增强生成 (RAG) 管道通过检索相关文档来增强大型语言模型 (LLM)，但由于推理成本高且上下文大小有限，它们面临可扩展性问题。文档压缩是一种实用的解决方案，但当前的软压缩方法存在准确性损失，并且需要大量的预训练。在本文中，我们介绍了 PISCO，这是一种新方法，可在基于 RAG 的各种问答 (QA) 任务中实现 16 倍的压缩率，同时准确度损失最小（0-3%）。与现有方法不同，PISCO 不需要预训练或注释数据，仅依赖于从基于文档的问题中进行序列级知识提炼。PISCO 能够在单个 A100 GPU 上在 48 小时内微调 7-10B LLM，提供了一种高效且可扩展的解决方案。我们进行了全面的实验，表明 PISCO 的准确度比现有压缩模型高出 8%。

Title: Integration of LLM Quality Assurance into an NLG System

Authors: Ching-Yi Chen, Johanna Heininger, Adela Schneider, Christian Eckard, Andreas Madsack, Robert Weißgraeber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16078
Pdf URL: https://arxiv.org/pdf/2501.16078
Copy Paste: [[2501.16078]] Integration of LLM Quality Assurance into an NLG System(https://arxiv.org/abs/2501.16078)
Keywords: language model, llm
Abstract: In this paper, we present a system that uses a Large Language Model (LLM) to perform grammar and spelling correction as a component of Quality Assurance (QA) for texts generated by NLG systems, which is important for text production in real-world scenarios. Evaluating the results of the system on work-in-progress sports news texts in three languages, we show that it is able to deliver acceptable corrections.
摘要：在本文中，我们介绍了一个使用大型语言模型 (LLM) 执行语法和拼写校正的系统，这是 NLG 系统生成的文本质量保证 (QA) 的一部分，这对于现实世界中的文本生成非常重要。通过评估该系统对三种语言的体育新闻文本的编写结果，我们表明它能够提供可接受的校正。

Title: AdaCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Chain-of-Thought

Authors: Xin Huang, Tarun Kumar Vangani, Zhengyuan Liu, Bowei Zou, Ai Ti Aw
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.16154
Pdf URL: https://arxiv.org/pdf/2501.16154
Copy Paste: [[2501.16154]] AdaCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Chain-of-Thought(https://arxiv.org/abs/2501.16154)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) have shown impressive multilingual capabilities through pretraining on diverse corpora. While these models show strong reasoning abilities, their performance varies significantly across languages due to uneven training data distribution. Existing approaches using machine translation, and extensive multilingual pretraining and cross-lingual tuning face scalability challenges and often fail to capture nuanced reasoning processes across languages. In this paper, we introduce AdaCoT (Adaptive Chain-of-Thought), a framework that enhances multilingual reasoning by dynamically routing thought processes through intermediary "thinking languages" before generating target-language responses. AdaCoT leverages a language-agnostic core and incorporates an adaptive, reward-based mechanism for selecting optimal reasoning pathways without requiring additional pretraining. Our comprehensive evaluation across multiple benchmarks demonstrates substantial improvements in both factual reasoning quality and cross-lingual consistency, with particularly strong performance gains in low-resource language settings. The results suggest that adaptive reasoning paths can effectively bridge the performance gap between high and low-resource languages while maintaining cultural and linguistic nuances.
摘要：大型语言模型 (LLM) 通过在不同的语料库上进行预训练，展现出了令人印象深刻的多语言能力。虽然这些模型表现出强大的推理能力，但由于训练数据分布不均，它们在不同语言之间的性能差异很大。现有的使用机器翻译、广泛的多语言预训练和跨语言调整的方法面临着可扩展性的挑战，而且往往无法捕捉跨语言的细微推理过程。在本文中，我们介绍了 AdaCoT（自适应思维链），这是一个通过在生成目标语言响应之前通过中间“思维语言”动态路由思维过程来增强多语言推理的框架。AdaCoT 利用与语言无关的核心，并结合了一种自适应的、基于奖励的机制来选择最佳推理路径，而无需额外的预训练。我们对多个基准的全面评估表明，事实推理质量和跨语言一致性都有了显著的提高，在资源匮乏的语言环境中，性能提升尤其显著。结果表明，自适应推理路径可以有效地弥合高资源语言和低资源语言之间的性能差距，同时保持文化和语言的细微差别。

Title: Can summarization approximate simplification? A gold standard comparison

Authors: Giacomo Magnifico, Eduard Barbu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16181
Pdf URL: https://arxiv.org/pdf/2501.16181
Copy Paste: [[2501.16181]] Can summarization approximate simplification? A gold standard comparison(https://arxiv.org/abs/2501.16181)
Keywords: prompt
Abstract: This study explores the overlap between text summarization and simplification outputs. While summarization evaluation methods are streamlined, simplification lacks cohesion, prompting the question: how closely can abstractive summarization resemble gold-standard simplification? We address this by applying two BART-based BRIO summarization methods to the Newsela corpus, comparing outputs with manually annotated simplifications and achieving a top ROUGE-L score of 0.654. This provides insight into where summarization and simplification outputs converge and differ.
摘要：本研究探讨了文本摘要和简化输出之间的重叠。虽然摘要评估方法得到了简化，但简化缺乏凝聚力，这引发了一个问题：抽象摘要与黄金标准简化的相似程度如何？我们通过将两种基于 BART 的 BRIO 摘要方法应用于 Newsela 语料库来解决这个问题，将输出与手动注释的简化进行比较，并获得了 0.654 的最高 ROUGE-L 分数。这可以深入了解摘要和简化输出的趋同和不同之处。

Title: Provence: efficient and robust context pruning for retrieval-augmented generation

Authors: Nadezhda Chirkova, Thibault Formal, Vassilina Nikoulina, Stéphane Clinchant
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16214
Pdf URL: https://arxiv.org/pdf/2501.16214
Copy Paste: [[2501.16214]] Provence: efficient and robust context pruning for retrieval-augmented generation(https://arxiv.org/abs/2501.16214)
Keywords: language model, llm, long context, retrieval-augmented generation
Abstract: Retrieval-augmented generation improves various aspects of large language models (LLMs) generation, but suffers from computational overhead caused by long contexts as well as the propagation of irrelevant retrieved information into generated responses. Context pruning deals with both aspects, by removing irrelevant parts of retrieved contexts before LLM generation. Existing context pruning approaches are however limited, and do not provide a universal model that would be both efficient and robust in a wide range of scenarios, e.g., when contexts contain a variable amount of relevant information or vary in length, or when evaluated on various domains. In this work, we close this gap and introduce Provence (Pruning and Reranking Of retrieVEd relevaNt ContExts), an efficient and robust context pruner for Question Answering, which dynamically detects the needed amount of pruning for a given context and can be used out-of-the-box for various domains. The three key ingredients of Provence are formulating the context pruning task as sequence labeling, unifying context pruning capabilities with context reranking, and training on diverse data. Our experimental results show that Provence enables context pruning with negligible to no drop in performance, in various domains and settings, at almost no cost in a standard RAG pipeline. We also conduct a deeper analysis alongside various ablations to provide insights into training context pruners for future work.
摘要：检索增强生成改进了大型语言模型 (LLM) 生成的各个方面，但却存在由长上下文以及将不相关的检索到的信息传播到生成的响应中而导致的计算开销的问题。上下文修剪通过在 LLM 生成之前删除检索到的上下文中不相关的部分来处理这两个方面。然而，现有的上下文修剪方法有限，并且不能提供一个在各种场景中既高效又稳健的通用模型，例如，当上下文包含可变数量的相关信息或长度不同时，或者在各个领域进行评估时。在这项工作中，我们弥补了这一差距并引入了 Provence（对检索到的相关上下文进行修剪和重新排序），这是一种用于问答的高效而强大的上下文修剪器，它可以动态检测给定上下文所需的修剪量，并且可以开箱即用地用于各个领域。 Provence 的三个关键要素是将上下文修剪任务制定为序列标记、将上下文修剪功能与上下文重新排序统一起来以及对各种数据进行训练。我们的实验结果表明，Provence 在标准 RAG 管道中几乎不花费任何成本，在各种领域和设置中实现了上下文修剪，性能几乎不会下降。我们还对各种消融进行了更深入的分析，以提供对未来工作中训练上下文修剪器的见解。

Title: DBRouting: Routing End User Queries to Databases for Answerability

Authors: Priyangshu Mandal, Manasi Patwardhan, Mayur Patidar, Lovekesh Vig
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16220
Pdf URL: https://arxiv.org/pdf/2501.16220
Copy Paste: [[2501.16220]] DBRouting: Routing End User Queries to Databases for Answerability(https://arxiv.org/abs/2501.16220)
Keywords: llm
Abstract: Enterprise level data is often distributed across multiple sources and identifying the correct set-of data-sources with relevant information for a knowledge request is a fundamental challenge. In this work, we define the novel task of routing an end-user query to the appropriate data-source, where the data-sources are databases. We synthesize datasets by extending existing datasets designed for NL-to-SQL semantic parsing. We create baselines on these datasets by using open-source LLMs, using both pre-trained and task specific embeddings fine-tuned using the training data. With these baselines we demonstrate that open-source LLMs perform better than embedding based approach, but suffer from token length limitations. Embedding based approaches benefit from task specific fine-tuning, more so when there is availability of data in terms of database specific questions for training. We further find that the task becomes more difficult (i) with an increase in the number of data-sources, (ii) having data-sources closer in terms of their domains,(iii) having databases without external domain knowledge required to interpret its entities and (iv) with ambiguous and complex queries requiring more fine-grained understanding of the data-sources or logical reasoning for routing to an appropriate source. This calls for the need for developing more sophisticated solutions to better address the task.
摘要：企业级数据通常分布在多个来源中，而识别具有与知识请求相关的信息的正确数据源集是一项基本挑战。在这项工作中，我们定义了将最终用户查询路由到适当数据源的新任务，其中数据源是数据库。我们通过扩展为 NL-to-SQL 语义解析设计的现有数据集来合成数据集。我们使用开源 LLM 在这些数据集上创建基线，使用预训练和使用训练数据微调的任务特定嵌入。通过这些基线，我们证明开源 LLM 的表现优于基于嵌入的方法，但受到标记长度限制。基于嵌入的方法受益于任务特定的微调，当有可用于训练的数据库特定问题的数据时更是如此。我们进一步发现，随着 (i) 数据源数量的增加，(ii) 数据源在领域方面越来越接近，(iii) 数据库不需要外部领域知识来解释其实体，以及 (iv) 查询模糊且复杂，需要更细致地了解数据源或逻辑推理以路由到适当的源，这项任务变得越来越困难。这就要求开发更复杂的解决方案来更好地完成任务。

Title: A foundation model for human-AI collaboration in medical literature mining

Authors: Zifeng Wang, Lang Cao, Qiao Jin, Joey Chan, Nicholas Wan, Behdad Afzali, Hyun-Jin Cho, Chang-In Choi, Mehdi Emamverdi, Manjot K. Gill, Sun-Hyung Kim, Yijia Li, Yi Liu, Hanley Ong, Justin Rousseau, Irfan Sheikh, Jenny J. Wei, Ziyang Xu, Christopher M. Zallek, Kyungsang Kim, Yifan Peng, Zhiyong Lu, Jimeng Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16255
Pdf URL: https://arxiv.org/pdf/2501.16255
Copy Paste: [[2501.16255]] A foundation model for human-AI collaboration in medical literature mining(https://arxiv.org/abs/2501.16255)
Keywords: language model, llm
Abstract: Systematic literature review is essential for evidence-based medicine, requiring comprehensive analysis of clinical trial publications. However, the application of artificial intelligence (AI) models for medical literature mining has been limited by insufficient training and evaluation across broad therapeutic areas and diverse tasks. Here, we present LEADS, an AI foundation model for study search, screening, and data extraction from medical literature. The model is trained on 633,759 instruction data points in LEADSInstruct, curated from 21,335 systematic reviews, 453,625 clinical trial publications, and 27,015 clinical trial registries. We showed that LEADS demonstrates consistent improvements over four cutting-edge generic large language models (LLMs) on six tasks. Furthermore, LEADS enhances expert workflows by providing supportive references following expert requests, streamlining processes while maintaining high-quality results. A study with 16 clinicians and medical researchers from 14 different institutions revealed that experts collaborating with LEADS achieved a recall of 0.81 compared to 0.77 experts working alone in study selection, with a time savings of 22.6%. In data extraction tasks, experts using LEADS achieved an accuracy of 0.85 versus 0.80 without using LEADS, alongside a 26.9% time savings. These findings highlight the potential of specialized medical literature foundation models to outperform generic models, delivering significant quality and efficiency benefits when integrated into expert workflows for medical literature mining.
摘要：系统文献综述对于循证医学至关重要，需要对临床试验出版物进行全面分析。然而，由于在广泛的治疗领域和各种任务中训练和评估不足，人工智能 (AI) 模型在医学文献挖掘中的应用受到了限制。在这里，我们介绍了 LEADS，这是一个用于从医学文献中搜索、筛选和提取研究数据的 AI 基础模型。该模型在 LEADSInstruct 中的 633,759 个指令数据点上进行了训练，这些数据点来自 21,335 篇系统综述、453,625 篇临床试验出版物和 27,015 个临床试验注册中心。我们表明，LEADS 在六项任务上表现出比四种尖端通用大型语言模型 (LLM) 持续改进。此外，LEADS 通过根据专家要求提供支持性参考来增强专家工作流程，简化流程同时保持高质量结果。一项针对来自 14 个不同机构的 16 名临床医生和医学研究人员的研究表明，与单独工作的专家相比，使用 LEADS 的专家在研究选择方面的召回率为 0.77，而单独工作的专家召回率为 0.81，节省了 22.6% 的时间。在数据提取任务中，使用 LEADS 的专家的准确率为 0.85，而未使用 LEADS 的准确率为 0.80，同时节省了 26.9% 的时间。这些发现凸显了专门的医学文献基础模型优于通用模型的潜力，当将其集成到医学文献挖掘的专家工作流程中时，可以带来显著的质量和效率优势。

Title: Return of the Encoder: Maximizing Parameter Efficiency for SLMs

Authors: Mohamed Elfeki, Rui Liu, Chad Voegele
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2501.16273
Pdf URL: https://arxiv.org/pdf/2501.16273
Copy Paste: [[2501.16273]] Return of the Encoder: Maximizing Parameter Efficiency for SLMs(https://arxiv.org/abs/2501.16273)
Keywords: language model
Abstract: The dominance of large decoder-only language models has overshadowed encoder-decoder architectures, despite their fundamental efficiency advantages in sequence processing. For small language models (SLMs) - those with 1 billion parameters or fewer - our systematic analysis across GPU, CPU, and NPU platforms reveals that encoder-decoder architectures achieve 47% lower first-token latency and 4.7x higher throughput compared to decoder-only models on edge devices. These gains may be attributed to encoder-decoder's one-time input processing and efficient separation of understanding and generation phases. We introduce a novel knowledge distillation framework that enables encoder-decoder models to leverage capabilities from large scalable decoder-only teachers while preserving their architectural advantages, achieving up to 6 average performance points improvement across diverse tasks, with significant gains in asymmetric sequence tasks where input and output distributions can benefit from different processing approaches. When combined with modern advances like Rotary Positional Embeddings (RoPE) and Vision encoders, our systematic investigation demonstrates that encoder-decoder architectures provide a more practical path toward deploying capable language models in resource-constrained environments. Our findings challenge the prevailing trend toward decoder-only scaling, showing that architectural choices become increasingly crucial as parameter budgets decrease, particularly for on-device and edge deployments where computational efficiency is paramount.
摘要：尽管大型解码器专用语言模型在序列处理方面具有根本的效率优势，但编码器-解码器架构的主导地位已经盖过了它们。对于小型语言模型 (SLM)（具有 10 亿个或更少参数的模型），我们在 GPU、CPU 和 NPU 平台上进行的系统分析表明，与边缘设备上的解码器专用模型相比，编码器-解码器架构实现了 47% 的首令牌延迟和 4.7 倍的吞吐量提升。这些收益可能归因于编码器-解码器的一次性输入处理以及理解和生成阶段的有效分离。我们引入了一种新颖的知识蒸馏框架，使编码器-解码器模型能够利用大型可扩展解码器专用教师的能力，同时保留其架构优势，在不同任务中实现高达 6 个平均性能点的提升，在非对称序列任务中具有显着的收益，其中输入和输出分布可以从不同的处理方法中受益。当与旋转位置嵌入 (RoPE) 和视觉编码器等现代技术相结合时，我们的系统研究表明，编码器-解码器架构为在资源受限的环境中部署功能强大的语言模型提供了更实用的途径。我们的研究结果挑战了仅使用解码器进行扩展的流行趋势，表明随着参数预算的减少，架构选择变得越来越重要，尤其是对于计算效率至关重要的设备和边缘部署而言。

Title: URAG: Implementing a Unified Hybrid RAG for Precise Answers in University Admission Chatbots -- A Case Study at HCMUT

Authors: Long Nguyen, Tho Quan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16276
Pdf URL: https://arxiv.org/pdf/2501.16276
Copy Paste: [[2501.16276]] URAG: Implementing a Unified Hybrid RAG for Precise Answers in University Admission Chatbots -- A Case Study at HCMUT(https://arxiv.org/abs/2501.16276)
Keywords: language model, llm, chat, retrieval-augmented generation
Abstract: With the rapid advancement of Artificial Intelligence, particularly in Natural Language Processing, Large Language Models (LLMs) have become pivotal in educational question-answering systems, especially university admission chatbots. Concepts such as Retrieval-Augmented Generation (RAG) and other advanced techniques have been developed to enhance these systems by integrating specific university data, enabling LLMs to provide informed responses on admissions and academic counseling. However, these enhanced RAG techniques often involve high operational costs and require the training of complex, specialized modules, which poses challenges for practical deployment. Additionally, in the educational context, it is crucial to provide accurate answers to prevent misinformation, a task that LLM-based systems find challenging without appropriate strategies and methods. In this paper, we introduce the Unified RAG (URAG) Framework, a hybrid approach that significantly improves the accuracy of responses, particularly for critical queries. Experimental results demonstrate that URAG enhances our in-house, lightweight model to perform comparably to state-of-the-art commercial models. Moreover, to validate its practical applicability, we conducted a case study at our educational institution, which received positive feedback and acclaim. This study not only proves the effectiveness of URAG but also highlights its feasibility for real-world implementation in educational settings.
摘要：随着人工智能（尤其是自然语言处理）的快速发展，大型语言模型 (LLM) 已成为教育问答系统（尤其是大学招生聊天机器人）的关键。检索增强生成 (RAG) 等概念和其他先进技术已经开发出来，通过整合特定的大学数据来增强这些系统，使 LLM 能够就招生和学术咨询提供明智的回应。然而，这些增强的 RAG 技术通常涉及高昂的运营成本，并且需要训练复杂的专业模块，这对实际部署构成了挑战。此外，在教育背景下，提供准确的答案以防止错误信息至关重要，如果没有适当的策略和方法，基于 LLM 的系统将很难完成这项任务。在本文中，我们介绍了统一 RAG (URAG) 框架，这是一种混合方法，可显着提高响应的准确性，尤其是对于关键查询。实验结果表明，URAG 增强了我们的内部轻量级模型，使其性能与最先进的商业模型相当。此外，为了验证其实际适用性，我们在教育机构进行了案例研究，并获得了积极的反馈和赞誉。这项研究不仅证明了 URAG 的有效性，还强调了其在教育环境中实际实施的可行性。

Title: Matryoshka Re-Ranker: A Flexible Re-Ranking Architecture With Configurable Depth and Width

Authors: Zheng Liu, Chaofan Li, Shitao Xiao, Chaozhuo Li, Defu Lian, Yingxia Shao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16302
Pdf URL: https://arxiv.org/pdf/2501.16302
Copy Paste: [[2501.16302]] Matryoshka Re-Ranker: A Flexible Re-Ranking Architecture With Configurable Depth and Width(https://arxiv.org/abs/2501.16302)
Keywords: language model, llm
Abstract: Large language models (LLMs) provide powerful foundations to perform fine-grained text re-ranking. However, they are often prohibitive in reality due to constraints on computation bandwidth. In this work, we propose a \textbf{flexible} architecture called \textbf{Matroyshka Re-Ranker}, which is designed to facilitate \textbf{runtime customization} of model layers and sequence lengths at each layer based on users' configurations. Consequently, the LLM-based re-rankers can be made applicable across various real-world situations. The increased flexibility may come at the cost of precision loss. To address this problem, we introduce a suite of techniques to optimize the performance. First, we propose \textbf{cascaded self-distillation}, where each sub-architecture learns to preserve a precise re-ranking performance from its super components, whose predictions can be exploited as smooth and informative teacher signals. Second, we design a \textbf{factorized compensation mechanism}, where two collaborative Low-Rank Adaptation modules, vertical and horizontal, are jointly employed to compensate for the precision loss resulted from arbitrary combinations of layer and sequence compression. We perform comprehensive experiments based on the passage and document retrieval datasets from MSMARCO, along with all public datasets from BEIR benchmark. In our experiments, Matryoshka Re-Ranker substantially outperforms the existing methods, while effectively preserving its superior performance across various forms of compression and different application scenarios.
摘要：大型语言模型 (LLM) 为执行细粒度文本重新排序提供了强大的基础。然而，由于计算带宽的限制，它们在现实中往往是禁止的。在这项工作中，我们提出了一种名为 \textbf{Matroyshka Re-Ranker} 的 \textbf{灵活} 架构，旨在根据用户的配置促进模型层和每层的序列长度的 \textbf{运行时定制}。因此，基于 LLM 的重新排序器可以适用于各种实际情况。增加的灵活性可能会以精度损失为代价。为了解决这个问题，我们引入了一套优化性能的技术。首先，我们提出了 \textbf{级联自蒸馏}，其中每个子架构都学习从其超级组件中保持精确的重新排序性能，超级组件的预测可以用作平滑且信息丰富的教师信号。其次，我们设计了一个 \textbf{分解补偿机制}，其中两个协作的低秩自适应模块（垂直和水平）被联合使用，以补偿由层和序列压缩的任意组合导致的精度损失。我们基于 MSMARCO 的段落和文档检索数据集以及 BEIR 基准的所有公共数据集进行了全面的实验。在我们的实验中，Matryoshka Re-Ranker 的表现大大优于现有方法，同时有效地在各种压缩形式和不同应用场景中保持了其卓越的性能。

Title: RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Authors: Long Nguyen, Huy Nguyen, Bao Khuu, Huy Luu, Huy Le, Tuan Nguyen, Tho Quan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.16303
Pdf URL: https://arxiv.org/pdf/2501.16303
Copy Paste: [[2501.16303]] RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval(https://arxiv.org/abs/2501.16303)
Keywords: language model, llm, prompt
Abstract: Retrieving events from videos using text queries has become increasingly challenging due to the rapid growth of multimedia content. Existing methods for text-based video event retrieval often focus heavily on object-level descriptions, overlooking the crucial role of contextual information. This limitation is especially apparent when queries lack sufficient context, such as missing location details or ambiguous background elements. To address these challenges, we propose a novel system called RAPID (Retrieval-Augmented Parallel Inference Drafting), which leverages advancements in Large Language Models (LLMs) and prompt-based learning to semantically correct and enrich user queries with relevant contextual information. These enriched queries are then processed through parallel retrieval, followed by an evaluation step to select the most relevant results based on their alignment with the original query. Through extensive experiments on our custom-developed dataset, we demonstrate that RAPID significantly outperforms traditional retrieval methods, particularly for contextually incomplete queries. Our system was validated for both speed and accuracy through participation in the Ho Chi Minh City AI Challenge 2024, where it successfully retrieved events from over 300 hours of video. Further evaluation comparing RAPID with the baseline proposed by the competition organizers demonstrated its superior effectiveness, highlighting the strength and robustness of our approach.
摘要：由于多媒体内容的快速增长，使用文本查询从视频中检索事件变得越来越具有挑战性。现有的基于文本的视频事件检索方法通常侧重于对象级描述，而忽略了上下文信息的关键作用。当查询缺乏足够的上下文（例如缺少位置详细信息或背景元素不明确）时，这种限制尤其明显。为了应对这些挑战，我们提出了一种名为 RAPID（检索增强并行推理起草）的新系统，该系统利用大型语言模型 (LLM) 和基于提示的学习的进步，从语义上纠正用户查询并使用相关上下文信息丰富用户查询。然后通过并行检索处理这些丰富的查询，然后进行评估步骤，根据它们与原始查询的一致性选择最相关的结果。通过对我们定制开发的数据集进行大量实验，我们证明 RAPID 明显优于传统检索方法，尤其是对于上下文不完整的查询。我们的系统在参加 2024 年胡志明市人工智能挑战赛时，成功从超过 300 小时的视频中检索事件，验证了其速度和准确性。通过对 RAPID 与比赛组织者提出的基线进行比较的进一步评估，证明了其卓越的有效性，凸显了我们方法的优势和稳健性。

Title: LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

Authors: Heting Gao, Hang Shao, Xiong Wang, Chaofan Qiu, Yunhang Shen, Siqi Cai, Yuchen Shi, Zihan Xu, Zuwei Long, Yike Zhang, Shaoqi Dong, Chaoyou Fu, Ke Li, Long Ma, Xing Sun
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2501.16327
Pdf URL: https://arxiv.org/pdf/2501.16327
Copy Paste: [[2501.16327]] LUCY: Linguistic Understanding and Control Yielding Early Stage of Her(https://arxiv.org/abs/2501.16327)
Keywords: language model, agent
Abstract: The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.
摘要：电影《她》中的主人公 Samantha 是一位精通人工智能的音频代理，她能够理解人类语音中的语言和副语言信息，并做出自然、信息丰富且对情感细微差别敏感的实时响应。为了向更精通的音频代理迈进一步，我们根据端到端 (E2E) 语音系统的最新进展提出了 LUCY，这是一种端到端语音模型，它 (1) 能够感知并响应用户的情绪，(2) 能够以简洁自然的方式做出响应，以及 (3) 使用外部工具回答实时查询。实验结果表明，LUCY 在情绪控制方面优于同类模型，能够根据语言情绪指令生成情绪响应，并响应副语言情绪线索。根据外部语言模型的判断，Lucy 还能够以更自然的方式生成响应，而不会在一般问答方面牺牲太多性能。最后，LUCY 可以利用函数调用来回答超出其知识范围的问题。