2025-03-12

Title: Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

Authors: Kefan Song, Jin Yao, Runnan Jiang, Rohan Chandra, Shangtong Zhang
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07806
Pdf URL: https://arxiv.org/pdf/2503.07806
Copy Paste: [[2503.07806]] Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models(https://arxiv.org/abs/2503.07806)
Keywords: language model, llm, prompt
Abstract: As Large Language Models (LLMs) become increasingly powerful and accessible to human users, ensuring fairness across diverse demographic groups, i.e., group fairness, is a critical ethical concern. However, current fairness and bias research in LLMs is limited in two aspects. First, compared to traditional group fairness in machine learning classification, it requires that the non-sensitive attributes, in this case, the prompt questions, be the same across different groups. In many practical scenarios, different groups, however, may prefer different prompt questions and this requirement becomes impractical. Second, it evaluates group fairness only for the LLM's final output without identifying the source of possible bias. Namely, the bias in LLM's output can result from both the pretraining and the finetuning. For finetuning, the bias can result from both the RLHF procedure and the learned reward model. Arguably, evaluating the group fairness of each component in the LLM pipeline could help develop better methods to mitigate the possible bias. Recognizing those two limitations, this work benchmarks the group fairness of learned reward models. By using expert-written text from arXiv, we are able to benchmark the group fairness of reward models without requiring the same prompt questions across different demographic groups. Surprisingly, our results demonstrate that all the evaluated reward models (e.g., Nemotron-4-340B-Reward, ArmoRM-Llama3-8B-v0.1, and GRM-llama3-8B-sftreg) exhibit statistically significant group unfairness. We also observed that top-performing reward models (w.r.t. canonical performance metrics) tend to demonstrate better group fairness.
摘要：随着大型语言模型（LLM）变得越来越强大且可容纳人类用户，从而确保各种人口统计组的公平性，即集体公平，是一个关键的道德问题。但是，LLMS中当前的公平性和偏见研究在两个方面受到限制。首先，与机器学习分类中的传统群体公平性相比，它要求在这种情况下，非敏感属性在不同的组之间相同。在许多实际情况下，不同的群体可能会更喜欢不同的及时问题，并且这一要求变得不切实际。其次，它仅对LLM的最终输出评估组公平性，而无需确定可能的偏差来源。也就是说，LLM输出中的偏差可能是由于预训练和填充而导致的。对于填充，偏见可能是由RLHF程序和学习的奖励模型造成的。可以说，评估LLM管道中每个组件的群体公平性可以帮助开发更好的方法来减轻可能的偏见。认识到这两个局限性，这项工作基于学习奖励模型的群体公平性。通过使用来自Arxiv的专家写的文本，我们可以在不需要不同人口组的及时及时问题的情况下对奖励模型的群体公平性进行基准测试。令人惊讶的是，我们的结果表明，所有评估的奖励模型（例如Nemotron-4-340B奖励，Armorm-llama3-8b-v0.1和Grm-llama3-8b-Sftreg）表现出统计学意义的群体不公平性。我们还观察到，表现最佳的奖励模型（W.R.T.规范性能指标）倾向于表现出更好的群体公平性。

Title: Training Domain Draft Models for Speculative Decoding: Best Practices and Insights

Authors: Fenglu Hong, Ravi Raju, Jonathan Lingjie Li, Bo Li, Urmish Thakker, Avinash Ravichandran, Swayambhoo Jain, Changran Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07807
Pdf URL: https://arxiv.org/pdf/2503.07807
Copy Paste: [[2503.07807]] Training Domain Draft Models for Speculative Decoding: Best Practices and Insights(https://arxiv.org/abs/2503.07807)
Keywords: language model, llm
Abstract: Speculative decoding is an effective method for accelerating inference of large language models (LLMs) by employing a small draft model to predict the output of a target model. However, when adapting speculative decoding to domain-specific target models, the acceptance rate of the generic draft model drops significantly due to domain shift. In this work, we systematically investigate knowledge distillation techniques for training domain draft models to improve their speculation accuracy. We compare white-box and black-box distillation approaches and explore their effectiveness in various data accessibility scenarios, including historical user queries, curated domain data, and synthetically generated alignment data. Our experiments across Function Calling, Biology, and Chinese domains show that offline distillation consistently outperforms online distillation by 11% to 25%, white-box distillation surpasses black-box distillation by 2% to 10%, and data scaling trends hold across domains. Additionally, we find that synthetic data can effectively align draft models and achieve 80% to 93% of the performance of training on historical user queries. These findings provide practical guidelines for training domain-specific draft models to improve speculative decoding efficiency.
摘要：投机解码是一种有效的方法，可以通过使用小型草稿模型来预测目标模型的输出来加速大型语言模型（LLMS）的推理。但是，当将投机解码适应特定于域的目标模型时，由于域移位，通用草图模型的接受率大大下降。在这项工作中，我们系统地研究了用于培训领域草案模型的知识蒸馏技术，以提高其推测准确性。我们比较白盒和黑框蒸馏方法，并在各种数据可访问性方案中探索它们的有效性，包括历史用户查询，策划的域数据以及合成生成的对齐数据。我们跨功能呼叫，生物学和中国领域的实验表明，离线蒸馏一贯的效果超过11％至25％，白盒蒸馏超过了黑盒蒸馏，将其超过2％至10％，并且数据缩放趋势范围内的范围范围越来越多。此外，我们发现合成数据可以有效地调整模型，并实现历史用户查询培训培训的80％至93％。这些发现为培训领域特定的草稿模型提供了实用的准则，以提高投机解码效率。

Title: Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation

Authors: Fan Yin, Zifeng Wang, I-Hung Hsu, Jun Yan, Ke Jiang, Yanfei Chen, Jindong Gu, Long T. Le, Kai-Wei Chang, Chen-Yu Lee, Hamid Palangi, Tomas Pfister
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07826
Pdf URL: https://arxiv.org/pdf/2503.07826
Copy Paste: [[2503.07826]] Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation(https://arxiv.org/abs/2503.07826)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have exhibited the ability to effectively utilize external tools to address user queries. However, their performance may be limited in complex, multi-turn interactions involving users and multiple tools. To address this, we propose Magnet, a principled framework for synthesizing high-quality training trajectories to enhance the function calling capability of large language model agents in multi-turn conversations with humans. The framework is based on automatic and iterative translations from a function signature path to a sequence of queries and executable function calls. We model the complicated function interactions in multi-turn cases with graph and design novel node operations to build reliable signature paths. Motivated by context distillation, when guiding the generation of positive and negative trajectories using a teacher model, we provide reference function call sequences as positive hints in context and contrastive, incorrect function calls as negative hints. Experiments show that training with the positive trajectories with supervised fine-tuning and preference optimization against negative trajectories, our 14B model, Magnet-14B-mDPO, obtains 68.01 on BFCL-v3 and 73.30 on ToolQuery, surpassing the performance of the teacher model Gemini-1.5-pro-002 by a large margin in function calling.
摘要：大型语言模型（LLMS）具有有效利用外部工具来解决用户查询的能力。但是，它们的性能可能会受到涉及用户和多种工具的复杂，多转交互的限制。为了解决这个问题，我们提出了磁铁，这是一个合成高质量训练轨迹的原则性框架，以增强大语模型代理在与人类的多转交流中的功能功能。该框架基于从功能签名路径到一系列查询和可执行函数调用序列的自动和迭代翻译。我们将复杂的函数相互作用与图形和设计新颖的节点操作中的复杂函数相互作用建模，以构建可靠的签名路径。通过上下文蒸馏的启发，当使用教师模型指导积极和负轨迹的产生时，我们将参考函数调用序列作为上下文中的积极提示和对比度，不正确的函数呼叫，作为负面提示。实验表明，通过对负轨迹进行监督的微调和偏好优化，我们的14b模型，磁铁14b-MDPO在BFCL-V3上获得68.01，在BFCL-V3和73.30上对ToolQuery获得了68.01的训练，超过了教师模型Gemini-1.5-Pro-002在BFCL-V3和73.30上获得了68.01。

Title: Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan

Authors: Matthias Schöffel, Marinus Wiedner, Esteban Garces Arias, Paula Ruppert, Christian Heumann, Matthias Aßenmacher
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07827
Pdf URL: https://arxiv.org/pdf/2503.07827
Copy Paste: [[2503.07827]] Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan(https://arxiv.org/abs/2503.07827)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing, yet their effectiveness in handling historical languages remains largely unexplored. This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan, a historical language characterized by non-standardized orthography and significant diachronic variation. Through comparative analysis of two distinct corpora-hagiographical and medical texts-we evaluate how current models handle the inherent challenges of processing a low-resource historical language. Our findings demonstrate critical limitations in LLM performance when confronted with extreme orthographic and syntactic variability. We provide detailed error analysis and specific recommendations for improving model performance in historical language processing. This research advances our understanding of LLM capabilities in challenging linguistic contexts while offering practical insights for both computational linguistics and historical language studies.
摘要：大型语言模型（LLMS）在自然语言处理中表现出了显着的功能，但是它们在处理历史语言中的有效性仍然没有探索。这项研究检查了开源LLM在Octectech的词性标签（POS）标记中的性能，旧的Occitan是一种历史语言，其特征是非标准化的拼字法和显着的毒理变化。通过对两种不同语言图和医学文本的比较分析，我们可以评估当前模型如何应对处理低资源历史语言的固有挑战。我们的发现表明，当面对极端的拼字法和句法变异性时，LLM性能的关键局限性。我们提供详细的错误分析和具体建议，以改善历史语言处理中的模型性能。这项研究促进了我们对挑战性语言环境中LLM能力的理解，同时为计算语言学和历史语言研究提供实践见解。

Title: HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations

Authors: Samir Abdaljalil, Hasan Kurban, Erchin Serpedin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07833
Pdf URL: https://arxiv.org/pdf/2503.07833
Copy Paste: [[2503.07833]] HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations(https://arxiv.org/abs/2503.07833)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) are increasingly used in various contexts, yet remain prone to generating non-factual content, commonly referred to as "hallucinations". The literature categorizes hallucinations into several types, including entity-level, relation-level, and sentence-level hallucinations. However, existing hallucination datasets often fail to capture fine-grained hallucinations in multilingual settings. In this work, we introduce HalluVerse25, a multilingual LLM hallucination dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality. We evaluate several LLMs on HalluVerse25, providing valuable insights into how proprietary models perform in detecting LLM-generated hallucinations across different contexts.
摘要：大型语言模型（LLM）越来越多地用于各种情况下，但仍然容易产生非事实内容，通常称为“幻觉”。文献将幻觉分为几种类型，包括实体级别，关系级别和句子级幻觉。但是，现有的幻觉数据集通常无法在多语言设置中捕获细粒度的幻觉。在这项工作中，我们介绍了Halluverse25，这是一个多语言LLM幻觉数据集，将英语，阿拉伯语和土耳其语的细粒度幻觉分类。我们的数据集施工管道使用LLM将幻觉注入事实传记句子，然后是严格的人类注释过程，以确保数据质量。我们在Halluverse25上评估了几个LLM，为专有模型在检测不同情况下检测LLM生成的幻觉方面的表现提供了宝贵的见解。

Title: MapQA: Open-domain Geospatial Question Answering on Map Data

Authors: Zekun Li, Malcolm Grossman, Eric (Ehsan)Qasemi, Mihir Kulkarni, Muhao Chen, Yao-Yi Chiang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2503.07871
Pdf URL: https://arxiv.org/pdf/2503.07871
Copy Paste: [[2503.07871]] MapQA: Open-domain Geospatial Question Answering on Map Data(https://arxiv.org/abs/2503.07871)
Keywords: language model, gpt, llm
Abstract: Geospatial question answering (QA) is a fundamental task in navigation and point of interest (POI) searches. While existing geospatial QA datasets exist, they are limited in both scale and diversity, often relying solely on textual descriptions of geo-entities without considering their geometries. A major challenge in scaling geospatial QA datasets for reasoning lies in the complexity of geospatial relationships, which require integrating spatial structures, topological dependencies, and multi-hop reasoning capabilities that most text-based QA datasets lack. To address these limitations, we introduce MapQA, a novel dataset that not only provides question-answer pairs but also includes the geometries of geo-entities referenced in the questions. MapQA is constructed using SQL query templates to extract question-answer pairs from OpenStreetMap (OSM) for two study regions: Southern California and Illinois. It consists of 3,154 QA pairs spanning nine question types that require geospatial reasoning, such as neighborhood inference and geo-entity type identification. Compared to existing datasets, MapQA expands both the number and diversity of geospatial question types. We explore two approaches to tackle this challenge: (1) a retrieval-based language model that ranks candidate geo-entities by embedding similarity, and (2) a large language model (LLM) that generates SQL queries from natural language questions and geo-entity attributes, which are then executed against an OSM database. Our findings indicate that retrieval-based methods effectively capture concepts like closeness and direction but struggle with questions that require explicit computations (e.g., distance calculations). LLMs (e.g., GPT and Gemini) excel at generating SQL queries for one-hop reasoning but face challenges with multi-hop reasoning, highlighting a key bottleneck in advancing geospatial QA systems.
摘要：地理空间问题回答（QA）是导航和兴趣点（POI）搜索的基本任务。尽管存在现有的地理空间质量空间数据集，但它们的规模和多样性都受到限制，通常仅依靠文字描述地理现象而不考虑其几何形状。缩放地理空间质量空间数据集的主要挑战在于地理空间关系的复杂性，这需要整合空间结构，拓扑依赖性和多跳的推理能力，而大多数基于文本的QA数据集缺乏。为了解决这些限制，我们介绍了MAPQA，这是一个新颖的数据集，不仅提供了问题 - 答案对，而且还包括问题中引用的地理本体的几何形状。 MAPQA是使用SQL查询模板构建的，以从OpenStreetMap（OSM）提取两个研究区域：南加州和伊利诺伊州。它由3,154对QA对组成，涵盖了需要地理空间推理的九种问题类型，例如邻里推理和地理原性类型识别。与现有数据集相比，MAPQA扩大了地理空间问题类型的数量和多样性。我们探索了解决这一挑战的两种方法：（1）一种基于检索的语言模型，该模型通过嵌入相似性来对候选地理位置进行排名，（2）一个大语言模型（LLM），该模型（LLM）从自然语言问题和地理实体属性中产生SQL查询，然后根据一个数据库执行这些属性。我们的发现表明，基于检索的方法有效地捕获了诸如亲密和方向之类的概念，但要在需要明确计算的问题（例如距离计算）方面挣扎。 LLM（例如GPT和Gemini）出色地为单跳推理生成SQL查询，但通过多跳推理面临挑战，突出了推进地理空间QA系统的关键瓶颈。

Title: Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

Authors: Alex Fang, Hadi Pouransari, Matt Jordan, Alexander Toshev, Vaishaal Shankar, Ludwig Schmidt, Tom Gunter
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07879
Pdf URL: https://arxiv.org/pdf/2503.07879
Copy Paste: [[2503.07879]] Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality(https://arxiv.org/abs/2503.07879)
Keywords: language model
Abstract: Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude. While this finding relies on repeating the dataset for many epochs, we also investigate repeats within these datasets at the document level. We find that not all documents within a dataset are equal, and we can create better datasets relative to a token budget by explicitly manipulating the counts of individual documents. We conclude by arguing that even as large language models scale, data filtering remains an important direction of research.
摘要：数据过滤已成为改善模型性能的强大工具，同时降低计算成本。但是，随着大型语言模型计算预算的不断增长，由大量过滤和重复数据集提供的有限数据量将成为一个实际的约束。为了更好地理解如何进行，我们在各种计算预算以及通过数据过滤和重复数据删除创建的多个预训练数据集中研究模型绩效。我们发现，鉴于对培训配方进行了适当的修改，重复现有的积极过滤的数据集，以使多达十个时期的训练在多个计算预算数量级上的单个时期的超级集较大的超级集较大的超级集可能胜过较大的超级训练。尽管此发现依赖于重复许多时期的数据集，但我们还在文档级别调查了这些数据集中的重复序列。我们发现，并非数据集中的所有文档都是平等的，我们可以通过明确操纵单个文档的计数来创建相对于代币预算的更好的数据集。我们结论是，即使大型语言模型的规模，数据过滤仍然是研究的重要方向。

Title: Gemini Embedding: Generalizable Embeddings from Gemini

Authors: Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, Xiaoqi Ren, Shanfeng Zhang, Daniel Salz, Michael Boratko, Jay Han, Blair Chen, Shuo Huang, Vikram Rao, Paul Suganthan, Feng Han, Andreas Doumanoglou, Nithi Gupta, Fedor Moiseev, Cathy Yip, Aashi Jain, Simon Baumgartner, Shahrokh Shahi, Frank Palma Gomez, Sandeep Mariserla, Min Choi, Parashar Shah, Sonam Goenka, Ke Chen, Ye Xia, Koert Chen, Sai Meher Karthik Duddu, Yichang Chen, Trevor Walker, Wenlei Zhou, Rakesh Ghiya, Zach Gleicher, Karan Gill, Zhe Dong, Mojtaba Seyedhosseini, Yunhsuan Sung, Raphael Hoffmann, Tom Duerig
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07891
Pdf URL: https://arxiv.org/pdf/2503.07891
Copy Paste: [[2503.07891]] Gemini Embedding: Generalizable Embeddings from Gemini(https://arxiv.org/abs/2503.07891)
Keywords: language model
Abstract: In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google's most capable large language model. Capitalizing on Gemini's inherent multilingual and code understanding capabilities, Gemini Embedding produces highly generalizable embeddings for text spanning numerous languages and textual modalities. The representations generated by Gemini Embedding can be precomputed and applied to a variety of downstream tasks including classification, similarity, clustering, ranking, and retrieval. Evaluated on the Massive Multilingual Text Embedding Benchmark (MMTEB), which includes over one hundred tasks across 250+ languages, Gemini Embedding substantially outperforms prior state-of-the-art models, demonstrating considerable improvements in embedding quality. Achieving state-of-the-art performance across MMTEB's multilingual, English, and code benchmarks, our unified model demonstrates strong capabilities across a broad selection of tasks and surpasses specialized domain-specific models.
摘要：在此报告中，我们介绍了Gemini Embedding，这是一种最先进的嵌入模型，该模型利用Geamini（Google最有能力的大语言模型）的力量。利用双子座固有的多语言和代码理解能力，双子座嵌入功能可产生高度概括的嵌入文本，这些文本涵盖了许多语言和文本方式。双子座嵌入产生的表示形式可以预先计算并应用于各种下游任务，包括分类，相似性，聚类，排名和检索。对大规模的多语言文本嵌入基准（MMTEB）进行了评估，该基准包括250多种语言的一百多个任务，双子座嵌入了先前的最新模型，表明嵌入质量的改进很大。我们的统一模型在MMTEB的多语言，英语和代码基准中实现最先进的性能，在广泛的任务中展示了强大的功能，并且超过了专业领域的特定模型。

Title: Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks?

Authors: Payel Das, Ching-Yun Ko, Sihui Dai, Georgios Kollias, Subhajit Chaudhury, Aurelie Lozano
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07903
Pdf URL: https://arxiv.org/pdf/2503.07903
Copy Paste: [[2503.07903]] Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks?(https://arxiv.org/abs/2503.07903)
Keywords: language model, llm
Abstract: Large language models often expose their brittleness in reasoning tasks, especially while executing long chains of reasoning over context. We propose MemReasoner, a new and simple memory-augmented LLM architecture, in which the memory learns the relative order of facts in context, and enables hopping over them, while the decoder selectively attends to the memory. MemReasoner is trained end-to-end, with optional supporting fact supervision of varying degrees. We train MemReasoner, along with existing memory-augmented transformer models and a state-space model, on two distinct synthetic multi-hop reasoning tasks. Experiments performed under a variety of challenging scenarios, including the presence of long distractor text or target answer changes in test set, show strong generalization of MemReasoner on both single- and two-hop tasks. This generalization of MemReasoner is achieved using none-to-weak supporting fact supervision (using none and 1\% of supporting facts for one- and two-hop tasks, respectively). In contrast, baseline models overall struggle to generalize and benefit far less from using full supporting fact supervision. The results highlight the importance of explicit memory mechanisms, combined with additional weak supervision, for improving large language model's context processing ability toward reasoning tasks.
摘要：大型语言模型通常会在推理任务中暴露出来的脆弱性，尤其是在执行上下文中的一长串推理的同时。我们提出了一种新的简单的内存仪表型LLM体系结构Memreasoner，其中内存在上下文中学习了事实的相对顺序，并在解码器选择性地参与内存时可以跳过它们。 Memreasoner受过训练有素的端到端，并提供了可选的支持不同程度的事实监督。我们在两个不同的合成多跳的推理任务上培训Memreasoner，以及现有的内存启动变压器模型和一个状态空间模型。在各种具有挑战性的场景下进行的实验，包括长期干扰物文本或测试集的目标答案变化，在单跳和两跳任务上显示了Memreasoner的强烈概括。 Memreasoner的这种概括是使用无弱支持的事实监督来实现的（分别使用一个和1 \％的支持事实，分别用于单跳任务和两人任务）。相比之下，基准模型总体上难以概括和受益于使用全面支持的事实监督。结果突出了明确的记忆机制的重要性，再加上其他弱监督，以提高大型语言模型的上下文处理能力来推理任务。

Title: EFPC: Towards Efficient and Flexible Prompt Compression

Authors: Yun-Hao Cao, Yangsong Wang, Shuzheng Hao, Zhenxing Li, Chengjun Zhan, Sichao Liu, Yi-Qi Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07956
Pdf URL: https://arxiv.org/pdf/2503.07956
Copy Paste: [[2503.07956]] EFPC: Towards Efficient and Flexible Prompt Compression(https://arxiv.org/abs/2503.07956)
Keywords: language model, gpt, llm, prompt
Abstract: The emergence of large language models (LLMs) like GPT-4 has revolutionized natural language processing (NLP), enabling diverse, complex tasks. However, extensive token counts lead to high computational and financial burdens. To address this, we propose Efficient and Flexible Prompt Compression (EFPC), a novel method unifying task-aware and task-agnostic compression for a favorable accuracy-efficiency trade-off. EFPC uses GPT-4 to generate compressed prompts and integrates them with original prompts for training. During training and inference, we selectively prepend user instructions and compress prompts based on predicted probabilities. EFPC is highly data-efficient, achieving significant performance with minimal data. Compared to the state-of-the-art method LLMLingua-2, EFPC achieves a 4.8% relative improvement in F1-score with 1% additional data at a 4x compression rate, and an 11.4% gain with 10% additional data on the LongBench single-doc QA benchmark. EFPC's unified framework supports broad applicability and enhances performance across various models, tasks, and domains, offering a practical advancement in NLP.
摘要：GPT-4之类的大型语言模型（LLM）的出现彻底改变了自然语言处理（NLP），从而实现了多样化，复杂的任务。但是，广泛的令牌计数导致了较高的计算和财务负担。为了解决这个问题，我们提出了有效且灵活的及时压缩（EFPC），这是一种统一任务意识和任务不合时式压缩的新方法，以实现有利的准确性效率折衷。 EFPC使用GPT-4生成压缩提示，并将其与原始提示进行培训。在培训和推理期间，我们根据预测的概率有选择地预定用户说明并压缩提示。 EFPC具有高度的数据效率，可以通过最小数据实现显着的性能。与最先进的方法LLMlingua-2相比，EFPC以4倍的压缩率以1％的额外数据获得了4.8％的相对相对提高，而Longbench单doc QA基准的额外数据为11.4％，增益为11.4％。 EFPC的统一框架支持广泛的适用性并提高各种模型，任务和域的性能，从而在NLP中提供了实践进步。

Title: LabelCoRank: Revolutionizing Long Tail Multi-Label Classification with Co-Occurrence Reranking

Authors: Yan Yan, Junyuan Liu, Bo-Wen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07968
Pdf URL: https://arxiv.org/pdf/2503.07968
Copy Paste: [[2503.07968]] LabelCoRank: Revolutionizing Long Tail Multi-Label Classification with Co-Occurrence Reranking(https://arxiv.org/abs/2503.07968)
Keywords: language model
Abstract: Motivation: Despite recent advancements in semantic representation driven by pre-trained and large-scale language models, addressing long tail challenges in multi-label text classification remains a significant issue. Long tail challenges have persistently posed difficulties in accurately classifying less frequent labels. Current approaches often focus on improving text semantics while neglecting the crucial role of label relationships. Results: This paper introduces LabelCoRank, a novel approach inspired by ranking principles. LabelCoRank leverages label co-occurrence relationships to refine initial label classifications through a dual-stage reranking process. The first stage uses initial classification results to form a preliminary ranking. In the second stage, a label co-occurrence matrix is utilized to rerank the preliminary results, enhancing the accuracy and relevance of the final classifications. By integrating the reranked label representations as additional text features, LabelCoRank effectively mitigates long tail issues in multi-labeltext classification. Experimental evaluations on popular datasets including MAG-CS, PubMed, and AAPD demonstrate the effectiveness and robustness of LabelCoRank.
摘要：动机：尽管在语义表示方面取得了最新进步，但预先训练和大规模的语言模型驱动，但解决了多标签文本分类中长期的尾巴挑战仍然是一个重大问题。长期的尾巴挑战一直在准确地对较少频繁的标签进行分类时持续存在困难。当前的方法通常集中于改善文本语义，同时忽略标签关系的关键作用。结果：本文介绍了LabelCorank，这是一种受排名原则启发的新颖方法。 LabelCorank利用标签共发生的关系来通过双阶段的重新依克过程来完善初始标签分类。第一阶段使用初始分类结果形成初步排名。在第二阶段，标签共发生矩阵用于重新启动初步结果，从而提高了最终分类的准确性和相关性。通过将重读标签表示作为其他文本功能集成，LabelCorank有效地减轻了多标签分类中的长尾部问题。对包括MAG-CS，PubMed和AAPD在内的流行数据集进行的实验评估证明了LabelCorank的有效性和鲁棒性。

Title: Enhancing Multilingual Language Models for Code-Switched Input Data

Authors: Katherine Xie, Nitya Babbar, Vicky Chen, Yoanna Turura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07990
Pdf URL: https://arxiv.org/pdf/2503.07990
Copy Paste: [[2503.07990]] Enhancing Multilingual Language Models for Code-Switched Input Data(https://arxiv.org/abs/2503.07990)
Keywords: language model
Abstract: Code-switching, or alternating between languages within a single conversation, presents challenges for multilingual language models on NLP tasks. This research investigates if pre-training Multilingual BERT (mBERT) on code-switched datasets improves the model's performance on critical NLP tasks such as part of speech tagging, sentiment analysis, named entity recognition, and language identification. We use a dataset of Spanglish tweets for pre-training and evaluate the pre-trained model against a baseline model. Our findings show that our pre-trained mBERT model outperforms or matches the baseline model in the given tasks, with the most significant improvements seen for parts of speech tagging. Additionally, our latent analysis uncovers more homogenous English and Spanish embeddings for language identification tasks, providing insights for future modeling work. This research highlights potential for adapting multilingual LMs for code-switched input data in order for advanced utility in globalized and multilingual contexts. Future work includes extending experiments to other language pairs, incorporating multiform data, and exploring methods for better understanding context-dependent code-switches.
摘要：代码转换或单个对话中语言之间的交替，对NLP任务上的多语言语言模型提出了挑战。这项研究调查了在代码开关数据集上的预训练多语言BERT（MBERT）是否可以改善模型在关键NLP任务上的性能，例如语音标记，情感分析，命名实体识别和语言识别的一部分。我们使用Spanglish Tweet的数据集进行预训练，并根据基线模型评估预训练的模型。我们的发现表明，我们的预训练的Mbert模型优于给定任务中的基线模型，对语音标记的部分地区看到了最显着的改进。此外，我们的潜在分析还发现了更多同质的英语和西班牙语嵌入语言识别任务，从而为未来的建模工作提供了见解。这项研究突出了为代码转换输入数据调整多语言LMS的潜力，以便在全球化和多语言上下文中进行高级实用程序。未来的工作包括将实验扩展到其他语言对，合并多形数据以及探索方法，以更好地理解上下文依赖上下文的代码转换。

Title: In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

Authors: Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long T. Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, Tomas Pfister
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08026
Pdf URL: https://arxiv.org/pdf/2503.08026
Copy Paste: [[2503.08026]] In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents(https://arxiv.org/abs/2503.08026)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities-utterances, turns, and sessions-into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs' cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.
摘要：大型语言模型（LLM）在开放式对话中取得了重大进展，但是他们无法从长期互动中保留和检索相关信息限制了它们在需要持续个性化的应用中的有效性。已经提出了外部记忆机制来解决此限制，从而使LLMS能够保持对话连续性。但是，现有的方法面临两个主要挑战。首先，刚性记忆粒度无法捕获对话的自然语义结构，从而导致零散和不完整的表示。其次，固定的检索机制不能适应各种对话环境和用户交互模式。 In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities-utterances, turns, and sessions-into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning （RL）基于LLM的引用证据的方式。实验表明，RMM在各种指标和基准测试中表现出一致的改进。例如，RMM在longmemeval数据集上没有内存管理的情况下显示出10％以上的精度提高。

Title: Learning to Search Effective Example Sequences for In-Context Learning

Authors: Xiang Gao, Ankita Sinha, Kamalika Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08030
Pdf URL: https://arxiv.org/pdf/2503.08030
Copy Paste: [[2503.08030]] Learning to Search Effective Example Sequences for In-Context Learning(https://arxiv.org/abs/2503.08030)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate impressive few-shot learning capabilities, but their performance varies widely based on the sequence of in-context examples. Key factors influencing this include the sequence's length, composition, and arrangement, as well as its relation to the specific query. Existing methods often tackle these factors in isolation, overlooking their interdependencies. Moreover, the extensive search space for selecting optimal sequences complicates the development of a holistic approach. In this work, we introduce Beam Search-based Example Sequence Constructor (BESC), a novel method for learning to construct optimal example sequences. BESC addresses all key factors involved in sequence selection by considering them jointly during inference, while incrementally building the sequence. This design enables the use of beam search to significantly reduce the complexity of the search space. Experiments across various datasets and language models show notable improvements in performance.
摘要：大型语言模型（LLMS）表现出令人印象深刻的少数学习能力，但它们的性能取决于封闭式示例的顺序。影响这一点的关键因素包括序列的长度，组成和布置及其与特定查询的关系。现有方法通常会孤立地解决这些因素，从而忽略了它们的相互依存关系。此外，选择最佳序列的广泛搜索空间使整体方法的发展变得复杂。在这项工作中，我们介绍了基于光束搜索的示例序列构建体（BESC），这是一种学习构建最佳示例序列的新方法。 BESC通过在推断期间共同考虑序列选择中涉及的所有关键因素，同时逐步构建序列。该设计使使用光束搜索可以显着降低搜索空间的复杂性。各种数据集和语言模型的实验表现出显着改善的性能。

Title: Group Preference Alignment: Customized LLM Response Generation from In-Situ Conversations

Authors: Ishani Mondal, Jack W. Stokes, Sujay Kumar Jauhar, Longqi Yang, Mengting Wan, Xiaofeng Xu, Xia Song, Jennifer Neville
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08035
Pdf URL: https://arxiv.org/pdf/2503.08035
Copy Paste: [[2503.08035]] Group Preference Alignment: Customized LLM Response Generation from In-Situ Conversations(https://arxiv.org/abs/2503.08035)
Keywords: llm, prompt
Abstract: LLMs often fail to meet the specialized needs of distinct user groups due to their one-size-fits-all training paradigm \cite{lucy-etal-2024-one} and there is limited research on what personalization aspects each group expect. To address these limitations, we propose a group-aware personalization framework, Group Preference Alignment (GPA), that identifies context-specific variations in conversational preferences across user groups and then steers LLMs to address those preferences. Our approach consists of two steps: (1) Group-Aware Preference Extraction, where maximally divergent user-group preferences are extracted from real-world conversation logs and distilled into interpretable rubrics, and (2) Tailored Response Generation, which leverages these rubrics through two methods: a) Context-Tuned Inference (GAP-CT), that dynamically adjusts responses via context-dependent prompt instructions, and b) Rubric-Finetuning Inference (GPA-FT), which uses the rubrics to generate contrastive synthetic data for personalization of group-specific models via alignment. Experiments demonstrate that our framework significantly improves alignment of the output with respect to user preferences and outperforms baseline methods, while maintaining robust performance on standard benchmarks.
摘要：LLMS通常无法满足不同用户群体的专业需求，因为他们的一定大小的培训范式\ cite {lucy-etal-2024-One}，并且对每个团体期望的个性化方面的研究有限。为了解决这些限制，我们提出了一个群体感知的个性化框架，小组偏好对齐（GPA），该框架标识了跨用户组的对话偏好的上下文特定变化，然后引导LLMS解决这些偏好。我们的方法由两个步骤组成：（1）群体意识偏好提取，其中最大不同的用户群体偏好是从现实世界对话日志中提取的，并将其提炼成可解释的标题，以及（2）量身定制的响应产生，通过两种方法利用这些标题，通过两种方法：（GPA-ft），它使用专栏来生成对比度的合成数据，以通过对齐方式个性化组特定模型。实验表明，我们的框架显着提高了输出相对于用户偏好的对齐，并优于基线方法，同时保持标准基准的稳健性能。

Title: A General Framework to Evaluate Methods for Assessing Dimensions of Lexical Semantic Change Using LLM-Generated Synthetic Data

Authors: Naomi Baes, Raphaël Merx, Nick Haslam, Ekaterina Vylomova, Haim Dubossarsky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08042
Pdf URL: https://arxiv.org/pdf/2503.08042
Copy Paste: [[2503.08042]] A General Framework to Evaluate Methods for Assessing Dimensions of Lexical Semantic Change Using LLM-Generated Synthetic Data(https://arxiv.org/abs/2503.08042)
Keywords: llm
Abstract: Lexical Semantic Change (LSC) offers insights into cultural and social dynamics. Yet, the validity of methods for measuring kinds of LSC has yet to be established due to the absence of historical benchmark datasets. To address this gap, we develop a novel three-stage evaluation framework that involves: 1) creating a scalable, domain-general methodology for generating synthetic datasets that simulate theory-driven LSC across time, leveraging In-Context Learning and a lexical database; 2) using these datasets to evaluate the effectiveness of various methods; and 3) assessing their suitability for specific dimensions and domains. We apply this framework to simulate changes across key dimensions of LSC (SIB: Sentiment, Intensity, and Breadth) using examples from psychology, and evaluate the sensitivity of selected methods to detect these artificially induced changes. Our findings support the utility of the synthetic data approach, validate the efficacy of tailored methods for detecting synthetic changes in SIB, and reveal that a state-of-the-art LSC model faces challenges in detecting affective dimensions of LSC. This framework provides a valuable tool for dimension- and domain-specific bench-marking and evaluation of LSC methods, with particular benefits for the social sciences.
摘要：词汇语义变化（LSC）提供了对文化和社会动态的见解。然而，由于缺乏历史基准数据集，尚未确定测量LSC种类的方法的有效性。为了解决这一差距，我们开发了一个新颖的三阶段评估框架，其中涉及：1）创建一种可扩展的，域的总方法，用于生成合成数据集，该数据集在时间上跨时间模拟理论驱动的LSC，利用封闭式学习和词汇数据库； 2）使用这些数据集评估各种方法的有效性； 3）评估其对特定维度和域的适用性。我们使用此框架来模拟LSC（SIB：情感，强度和广度）的关键维度的变化，并评估所选方法的灵敏度以检测这些人为诱导的变化。我们的发现支持合成数据方法的实用性，验证量身定制方法检测SIB中合成变化的功效，并揭示了最先进的LSC模型在检测LSC的情感维度方面面临着挑战。该框架为LSC方法的尺寸和特定于域的基准标记和评估提供了一个有价值的工具，对社会科学有特别的好处。

Title: Odysseus Navigates the Sirens' Song: Dynamic Focus Decoding for Factual and Diverse Open-Ended Text Generation

Authors: Wen Luo, Feifan Song, Wei Li, Guangyue Peng, Shaohang Wei, Houfeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08057
Pdf URL: https://arxiv.org/pdf/2503.08057
Copy Paste: [[2503.08057]] Odysseus Navigates the Sirens' Song: Dynamic Focus Decoding for Factual and Diverse Open-Ended Text Generation(https://arxiv.org/abs/2503.08057)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly required to generate text that is both factually accurate and diverse across various open-ended applications. However, current stochastic decoding methods struggle to balance such objectives. We introduce Dynamic Focus Decoding (DFD), a novel plug-and-play stochastic approach that resolves this trade-off without requiring additional data, knowledge, or models. DFD adaptively adjusts the decoding focus based on distributional differences across layers, leveraging the modular and hierarchical nature of factual knowledge within LLMs. This dynamic adjustment improves factuality in knowledge-intensive decoding steps and promotes diversity in less knowledge-reliant steps. DFD can be easily integrated with existing decoding methods, enhancing both factuality and diversity with minimal computational overhead. Extensive experiments across seven datasets demonstrate that DFD significantly improves performance, providing a scalable and efficient solution for open-ended text generation.
摘要：大型语言模型（LLM）越来越需要生成在各种开放式应用程序中既准确又多样的文本。但是，当前的随机解码方法努力平衡此类目标。我们介绍了动态焦点解码（DFD），这是一种新颖的插件随机方法，可以解决这种权衡，而无需其他数据，知识或模型。 DFD根据各个层之间的分布差异自适应调整解码重点，从而利用了LLM中事实知识的模块化和分层性质。这种动态调整改善了知识密集型解码步骤中的事实，并在较少的知识步骤中促进了多样性。 DFD可以很容易地与现有的解码方法集成，并通过最小的计算开销增强了事实和多样性。七个数据集的广泛实验表明，DFD显着提高了性能，为开放式文本生成提供了可扩展和高效的解决方案。

Title: Context-aware Biases for Length Extrapolation

Authors: Ali Veisi, Amir Mansourian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08067
Pdf URL: https://arxiv.org/pdf/2503.08067
Copy Paste: [[2503.08067]] Context-aware Biases for Length Extrapolation(https://arxiv.org/abs/2503.08067)
Keywords: gpt
Abstract: Transformers' ability to generalize to longer sequences than they have been trained on, known as length extrapolation, degrades as sequence length increases. Most of Relative Positional Encoding (RPE) methods address this problem by either adding constant linear biases or learning general biases, lacking the ability to specialize for different sequences. In this work, inspired by ALiBi, we propose Context-aware Biases for Length Extrapolation (Cable), that learns token-specific biases for each head in decoder-based transformers. Cable learns adaptive, context-aware biases, overcoming the limitations of fixed patterns by adding dynamic biases specific to each token in the sequence. Results show that when tested on a sequence length of 1024, a GPT-3 Medium (334M parameters) with our positional encoding, trained on a sequence length of 512, achieves better perplexity (-0.65) than a similar network with sinusoidal positional encoding trained on a sequence length of 1024. This is achieved with 48% lower memory usage, and only 3.5% higher training time. Furthermore, our method notably improves the extrapolation ability of existing RPE methods on the Edu-FineWeb10B and WikiText-103 datasets. Code is available at: this https URL
摘要：变形金刚概括到更长的序列的能力比对其进行了训练的能力，称为长度外推，随着序列长度的增加而降解。大多数相对位置编码（RPE）方法中的大多数通过添加恒定的线性偏见或学习一般偏见来解决此问题，从而缺乏专门针对不同序列的能力。在这项工作的灵感中，我们提出了针对长度外推（电缆）的上下文感知偏见，该偏见在基于解码器的变压器中学习了每个头部的特定于代币的偏见。电缆学习自适应，上下文感知的偏见，通过在序列中添加对每个令牌特定的动态偏见来克服固定模式的局限性。结果表明，当以1024的序列长度进行测试时，与我们的位置编码的GPT-3培养基（334m参数）相比，以512的序列长度进行训练，相比，在序列训练的序列长度上，训练的网络相似，以512的序列长度（-0.65）更高。仅在1024的序列上进行训练。此外，我们的方法特别提高了在EDU-FINEWEB10B和WIKITEXT-103数据集上现有RPE方法的外推能力。代码可用：此HTTPS URL

Title: OASIS: Order-Augmented Strategy for Improved Code Search

Authors: Zuchen Gao, Zizheng Zhan, Xianming Li, Erxin Yu, Haotian Zhang, Yuqun Zhang, Jing Li
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.08161
Pdf URL: https://arxiv.org/pdf/2503.08161
Copy Paste: [[2503.08161]] OASIS: Order-Augmented Strategy for Improved Code Search(https://arxiv.org/abs/2503.08161)
Keywords: language model, llm
Abstract: Code embeddings capture the semantic representations of code and are crucial for various code-related large language model (LLM) applications, such as code search. Previous training primarily relies on optimizing the InfoNCE loss by comparing positive natural language (NL)-code pairs with in-batch negatives. However, due to the sparse nature of code contexts, training solely by comparing the major differences between positive and negative pairs may fail to capture deeper semantic nuances. To address this issue, we propose a novel order-augmented strategy for improved code search (OASIS). It leverages order-based similarity labels to train models to capture subtle differences in similarity among negative pairs. Extensive benchmark evaluations demonstrate that our OASIS model significantly outperforms previous state-of-the-art models focusing solely on major positive-negative differences. It underscores the value of exploiting subtle differences among negative pairs with order labels for effective code embedding training.
摘要：代码嵌入式捕获代码的语义表示，对于各种与代码相关的大型语言模型（LLM）应用程序（例如代码搜索）至关重要。以前的培训主要依赖于通过比较积极的自然语言（NL）编码对与内部负面负面物质来优化Infonce损失。但是，由于代码环境的稀疏性质，仅通过比较正面和负面对之间的主要差异而无法捕获更深层的语义细微差别。为了解决这个问题，我们提出了一种新颖的订单提升策略，以改善代码搜索（OASIS）。它利用基于订单的相似性标签来训练模型，以捕获负对之间相似性的细微差异。广泛的基准评估表明，我们的OASIS模型显着优于以前仅着眼于主要正面差异的先前最先进的模型。它强调了用订单标签在负面对之间利用微妙差异的价值，以进行有效的代码嵌入培训。

Title: RigoChat 2: an adapted language model to Spanish using a bounded dataset and reduced hardware

Authors: Gonzalo Santamaría Gómez, Guillem García Subies, Pablo Gutiérrez Ruiz, Mario González Valero, Natàlia Fuertes, Helena Montoro Zamorano, Carmen Muñoz Sanz, Leire Rosado Plaza, Nuria Aldama García, David Betancur Sánchez, Kateryna Sushkova, Marta Guerrero Nieto, Álvaro Barbero Jiménez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08188
Pdf URL: https://arxiv.org/pdf/2503.08188
Copy Paste: [[2503.08188]] RigoChat 2: an adapted language model to Spanish using a bounded dataset and reduced hardware(https://arxiv.org/abs/2503.08188)
Keywords: language model, llm, chat
Abstract: Large Language Models (LLMs) have become a key element of modern artificial intelligence, demonstrating the ability to address a wide range of language processing tasks at unprecedented levels of accuracy without the need of collecting problem-specific data. However, these versatile models face a significant challenge: both their training and inference processes require substantial computational resources, time, and memory. Consequently, optimizing this kind of models to minimize these requirements is crucial. In this article, we demonstrate that, with minimal resources and in a remarkably short time, it is possible to enhance a state-of-the-art model, specifically for a given language task, without compromising its overall capabilities using a relatively small pretrained LLM as a basis. Specifically, we present our use case, RigoChat 2, illustrating how LLMs can be adapted to achieve superior results in Spanish-language tasks.
摘要：大型语言模型（LLM）已成为现代人工智能的关键要素，证明了在前所未有的准确性级别上解决广泛的语言处理任务的能力，而无需收集特定问题的数据。但是，这些多功能模型面临着重大挑战：他们的训练和推理过程都需要大量的计算资源，时间和内存。因此，优化这种模型以最小化这些要求至关重要。在本文中，我们证明，凭借最少的资源，并且在很短的时间内，可以增强最先进的模型，特别是针对给定语言任务的模型，而不会使用相对较小的预估计的LLM作为基础损害其整体功能。具体来说，我们介绍了用例Rigochat 2，说明了如何适应LLMS以在西班牙语任务中取得优越的结果。

Title: Automating Violence Detection and Categorization from Ancient Texts

Authors: Alhassan Abdelhalim, Michaela Regneri
Subjects: cs.CL, cs.DL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08192
Pdf URL: https://arxiv.org/pdf/2503.08192
Copy Paste: [[2503.08192]] Automating Violence Detection and Categorization from Ancient Texts(https://arxiv.org/abs/2503.08192)
Keywords: language model, llm
Abstract: Violence descriptions in literature offer valuable insights for a wide range of research in the humanities. For historians, depictions of violence are of special interest for analyzing the societal dynamics surrounding large wars and individual conflicts of influential people. Harvesting data for violence research manually is laborious and time-consuming. This study is the first one to evaluate the effectiveness of large language models (LLMs) in identifying violence in ancient texts and categorizing it across multiple dimensions. Our experiments identify LLMs as a valuable tool to scale up the accurate analysis of historical texts and show the effect of fine-tuning and data augmentation, yielding an F1-score of up to 0.93 for violence detection and 0.86 for fine-grained violence categorization.
摘要：文学中的暴力描述为人文科学的广泛研究提供了宝贵的见解。对于历史学家而言，对暴力的描述是分析围绕大型战争和个人影响力冲突的社会动态的特别兴趣。手动研究收集暴力研究的数据是费力且耗时的。这项研究是第一个评估大语言模型（LLM）在识别古代文本中暴力并在多个维度上对其进行分类的研究。我们的实验将LLMS视为一种有价值的工具，可以扩展对历史文本的准确分析并显示微调和数据增强的效果，从而获得暴力检测的F1得分高达0.93，而对细粒度的暴力分类为0.86。

Title: Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation

Authors: Wenlong Meng, Fan Zhang, Wendao Yao, Zhenyuan Guo, Yuwei Li, Chengkun Wei, Wenzhi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08195
Pdf URL: https://arxiv.org/pdf/2503.08195
Copy Paste: [[2503.08195]] Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation(https://arxiv.org/abs/2503.08195)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) have demonstrated significant utility in a wide range of applications; however, their deployment is plagued by security vulnerabilities, notably jailbreak attacks. These attacks manipulate LLMs to generate harmful or unethical content by crafting adversarial prompts. While much of the current research on jailbreak attacks has focused on single-turn interactions, it has largely overlooked the impact of historical dialogues on model behavior. In this paper, we introduce a novel jailbreak paradigm, Dialogue Injection Attack (DIA), which leverages the dialogue history to enhance the success rates of such attacks. DIA operates in a black-box setting, requiring only access to the chat API or knowledge of the LLM's chat template. We propose two methods for constructing adversarial historical dialogues: one adapts gray-box prefilling attacks, and the other exploits deferred responses. Our experiments show that DIA achieves state-of-the-art attack success rates on recent LLMs, including Llama-3.1 and GPT-4o. Additionally, we demonstrate that DIA can bypass 5 different defense mechanisms, highlighting its robustness and effectiveness.
摘要：大型语言模型（LLMS）在广泛的应用中表现出明显的实用性；但是，他们的部署受到安全漏洞的困扰，尤其是越狱攻击。这些攻击通过制定对抗性提示来操纵LLM，从而产生有害或不道德的内容。尽管目前关于越狱攻击的许多研究都集中在单转交互上，但它在很大程度上忽略了历史对话对模型行为的影响。在本文中，我们引入了一种新颖的越狱范式，对话注射攻击（DIA），该攻击利用对话历史来提高此类攻击的成功率。 DIA在黑箱设置中运行，仅需要访问聊天API或LLM聊天模板的知识。我们提出了两种构建对抗性历史对话的方法：一种调整灰色盒子预填充攻击，另一种利用了递延响应。我们的实验表明，DIA可以在包括Llama-3.1和GPT-4O在内的最近的LLM上获得最新的攻击成功率。此外，我们证明DIA可以绕过5种不同的防御机制，从而强调其稳健性和有效性。

Title: DeepRAG: Building a Custom Hindi Embedding Model for Retrieval Augmented Generation from Scratch

Authors: Nandakishor M
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08213
Pdf URL: https://arxiv.org/pdf/2503.08213
Copy Paste: [[2503.08213]] DeepRAG: Building a Custom Hindi Embedding Model for Retrieval Augmented Generation from Scratch(https://arxiv.org/abs/2503.08213)
Keywords: llm, retrieval augmented generation
Abstract: In this paper, I present our work on DeepRAG, a specialized embedding model we built specifically for Hindi language in RAG systems. While LLMs have gotten really good at generating text, their performance in retrieval tasks still depends heavily on having quality embeddings - something that's been lacking for Hindi despite being one of the world's most spoken languages. We tackled this by creating embeddings from the ground up rather than just fine-tuning existing models. Our process involved collecting diverse Hindi texts (over 2.7M samples), training a custom SentencePiece tokenizer that actually understands Hindi morphology, designing transformer architecture with Hindi-specific attention mechanisms, and optimizing with contrastive learning. Results were honestly better than I expected - we saw a 23% improvement in retrieval precision compared to the multilingual models everyone's been using. The paper details our methodology, which I think could help others working with low-resource languages where the one-size-fits-all multilingual models fall short. We've also integrated our embeddings with LangChain to build complete Hindi RAG systems, which might be useful for practitioners. While there's still tons more to explore, I believe this work addresses a critical gap for Hindi NLP and demonstrates why language-specific approaches matter.
摘要：在本文中，我介绍了我们在DeepRag上的工作，DeepRag是我们专门为RAG Systems中印地语语言构建的专门嵌入模型。尽管LLM非常擅长生成文本，但他们在检索任务中的性能仍然在很大程度上取决于拥有优质的嵌入 - 尽管是印度语中最多的语言之一，但这对于印度语所缺乏的。我们通过从头开始创建嵌入，而不仅仅是微调现有模型来解决此问题。我们的过程涉及收集不同的印地语文本（超过270万个样本），训练一种自定义的句子令牌，实际上了解印地语形态，使用特定于印地语的注意力进行机制设计变压器架构，并通过对比度学习进行优化。老实说，结果比我预期的要好 - 与每个人使用的多语言模型相比，我们发现检索精度提高了23％。本文详细介绍了我们的方法论，我认为这可以帮助其他人使用低资源语言的人，其中一大尺寸的多种语言模型不足。我们还将嵌入与兰班链融合在一起，以构建完整的印地语抹布系统，这可能对从业者有用。虽然还有很多要探索的东西，但我认为这项工作解决了印地语NLP的关键差距，并证明了为什么特定于语言的方法很重要。

Title: Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges

Authors: Xiaoxiao Liu, Qingying Xiao, Junying Chen, Xiangyi Feng, Xiangbo Wu, Bairui Zhang, Xiang Wan, Jian Chang, Guangjun Yu, Yan Hu, Benyou Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08292
Pdf URL: https://arxiv.org/pdf/2503.08292
Copy Paste: [[2503.08292]] Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges(https://arxiv.org/abs/2503.08292)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly applied to outpatient referral tasks across healthcare systems. However, there is a lack of standardized evaluation criteria to assess their effectiveness, particularly in dynamic, interactive scenarios. In this study, we systematically examine the capabilities and limitations of LLMs in managing tasks within Intelligent Outpatient Referral (IOR) systems and propose a comprehensive evaluation framework specifically designed for such systems. This framework comprises two core tasks: static evaluation, which focuses on evaluating the ability of predefined outpatient referrals, and dynamic evaluation, which evaluates capabilities of refining outpatient referral recommendations through iterative dialogues. Our findings suggest that LLMs offer limited advantages over BERT-like models, but show promise in asking effective questions during interactive dialogues.
摘要：大型语言模型（LLMS）越来越多地应用于医疗保健系统的门诊推荐任务。但是，缺乏标准化的评估标准来评估其有效性，尤其是在动态的互动场景中。在这项研究中，我们系统地研究了LLM在智能门诊转诊（IOR）系统中管理任务的功能和局限性，并提出了专门为此类系统设计的全面评估框架。该框架包括两个核心任务：静态评估，该任务着重于评估预定的门诊推荐的能力和动态评估，这些评估评估了通过迭代对话来提炼门诊参考建议的能力。我们的发现表明，LLMS比BERT样模型具有有限的优势，但在交互式对话中提出有效问题方面有希望。

Title: Towards Scalable and Cross-Lingual Specialist Language Models for Oncology

Authors: Morteza Rohanian, Tarun Mehra, Nicola Miglino, Farhad Nooralahzadeh, Michael Krauthammer, Andreas Wicki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08323
Pdf URL: https://arxiv.org/pdf/2503.08323
Copy Paste: [[2503.08323]] Towards Scalable and Cross-Lingual Specialist Language Models for Oncology(https://arxiv.org/abs/2503.08323)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Clinical oncology generates vast, unstructured data that often contain inconsistencies, missing information, and ambiguities, making it difficult to extract reliable insights for data-driven decision-making. General-purpose large language models (LLMs) struggle with these challenges due to their lack of domain-specific reasoning, including specialized clinical terminology, context-dependent interpretations, and multi-modal data integration. We address these issues with an oncology-specialized, efficient, and adaptable NLP framework that combines instruction tuning, retrieval-augmented generation (RAG), and graph-based knowledge integration. Our lightweight models prove effective at oncology-specific tasks, such as named entity recognition (e.g., identifying cancer diagnoses), entity linking (e.g., linking entities to standardized ontologies), TNM staging, document classification (e.g., cancer subtype classification from pathology reports), and treatment response prediction. Our framework emphasizes adaptability and resource efficiency. We include minimal German instructions, collected at the University Hospital Zurich (USZ), to test whether small amounts of non-English language data can effectively transfer knowledge across languages. This approach mirrors our motivation for lightweight models, which balance strong performance with reduced computational costs, making them suitable for resource-limited healthcare settings. We validated our models on oncology datasets, demonstrating strong results in named entity recognition, relation extraction, and document classification.
摘要：临床肿瘤学产生了庞大的非结构化数据，这些数据通常包含不一致，缺失的信息和歧义，因此很难提取可靠的见解以进行数据驱动的决策。通用大型语言模型（LLM）由于缺乏特定领域的推理，包括专门的临床术语，依赖上下文依赖性解释和多模式数据集成，因此面对这些挑战。我们使用肿瘤学专业，高效和适应性的NLP框架来解决这些问题，该框架结合了指导调整，检索效果的生成（RAG）和基于图形的知识集成。我们的轻量级模型证明在特定于肿瘤学特定的任务中有效，例如命名实体识别（例如，识别癌症诊断），实体联系（例如，将实体与标准化的本体学链接到标准化的本体学），TNM分期，文档分类（例如，病理学报告中的癌症亚型分类）和治疗反应反应预测。我们的框架强调适应性和资源效率。我们包括在苏黎世大学医院（USZ）收集的最低德国指示，以测试少量的非英语语言数据是否可以有效地跨语言转移知识。这种方法反映了我们轻巧模型的动力，这些模型平衡了强劲的性能和降低的计算成本，使其适合资源有限的医疗保健设置。我们在肿瘤学数据集上验证了我们的模型，在指定的实体识别，关系提取和文档分类中证明了强大的结果。

Title: OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning

Authors: Jiawei Zhou, Lei Chen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.08398
Pdf URL: https://arxiv.org/pdf/2503.08398
Copy Paste: [[2503.08398]] OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning(https://arxiv.org/abs/2503.08398)
Keywords: language model, llm, retrieval-augmented generation
Abstract: In this paper, we analyze and empirically show that the learned relevance for conventional information retrieval (IR) scenarios may be inconsistent in retrieval-augmented generation (RAG) scenarios. To bridge this gap, we introduce OpenRAG, a RAG framework that is optimized end-to-end by tuning the retriever to capture in-context relevance, enabling adaptation to the diverse and evolving needs. Extensive experiments across a wide range of tasks demonstrate that OpenRAG, by tuning a retriever end-to-end, leads to a consistent improvement of 4.0% over the original retriever, consistently outperforming existing state-of-the-art retrievers by 2.1%. Additionally, our results indicate that for some tasks, an end-to-end tuned 0.2B retriever can achieve improvements that surpass those of RAG-oriented or instruction-tuned 8B large language models (LLMs), highlighting the cost-effectiveness of our approach in enhancing RAG systems.
摘要：在本文中，我们分析并从经验上表明，在检索演出的一代（RAG）方案中，对传统信息检索（IR）方案的学习相关性可能是不一致的。为了弥合这一差距，我们介绍了OpenRag，这是一个抹布框架，通过调整猎犬以捕获在上下文中的相关性来优化端到端，从而适应了多样化和不断发展的需求。跨多种任务进行的广泛实验表明，通过调整猎犬的端到端，OpenRag可导致比原始猎犬的一致提高4.0％，从而始终优于现有的先进回收者2.1％。此外，我们的结果表明，对于某些任务，端到端调整的0.2B检索器可以实现超过面向RAG的或指导调节的8B 8B大语言模型（LLMS）的改进，从而突出了我们在增强RAG系统方面的成本效益。

Title: Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information

Authors: Elizaveta Kuznetsova, Ilaria Vitulano, Mykola Makhortykh, Martha Stolze, Tomas Nagy, Victoria Vziatysheva
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2503.08404
Pdf URL: https://arxiv.org/pdf/2503.08404
Copy Paste: [[2503.08404]] Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information(https://arxiv.org/abs/2503.08404)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The purpose of this study is to assess how large language models (LLMs) can be used for fact-checking and contribute to the broader debate on the use of automated means for veracity identification. To achieve this purpose, we use AI auditing methodology that systematically evaluates performance of five LLMs (ChatGPT 4, Llama 3 (70B), Llama 3.1 (405B), Claude 3.5 Sonnet, and Google Gemini) using prompts regarding a large set of statements fact-checked by professional journalists (16,513). Specifically, we use topic modeling and regression analysis to investigate which factors (e.g. topic of the prompt or the LLM type) affect evaluations of true, false, and mixed statements. Our findings reveal that while ChatGPT 4 and Google Gemini achieved higher accuracy than other models, overall performance across models remains modest. Notably, the results indicate that models are better at identifying false statements, especially on sensitive topics such as COVID-19, American political controversies, and social issues, suggesting possible guardrails that may enhance accuracy on these topics. The major implication of our findings is that there are significant challenges for using LLMs for factchecking, including significant variation in performance across different LLMs and unequal quality of outputs for specific topics which can be attributed to deficits of training data. Our research highlights the potential and limitations of LLMs in political fact-checking, suggesting potential avenues for further improvements in guardrails as well as fine-tuning.
摘要：这项研究的目的是评估如何使用大型语言模型（LLM）进行事实检查，并为使用自动化手段进行真实性识别的更广泛的辩论做出贡献。为了实现此目的，我们使用AI审计方法，该方法可以系统地评估五个LLM（Chatgpt 4，Llama 3（70b），Llama 3.1（405b），Claude 3.5 SONNET和Google Gemini）的性能，并使用有关专业记者（16,513）的一系列陈述的提示。具体而言，我们使用主题建模和回归分析来研究哪些因素（例如提示或LLM类型的主题）会影响对真实，错误和混合陈述的评估。我们的发现表明，尽管Chatgpt 4和Google Gemini的精度比其他模型更高，但整个模型的总体性能仍然适中。值得注意的是，结果表明，模型更擅长识别虚假陈述，尤其是在敏感主题，例如Covid-19，美国政治争议和社会问题上，提出可能提高这些主题准确性的可能护栏。我们发现的主要含义是，使用LLM进行事实检查存在重大挑战，包括不同LLM的性能的显着差异以及对于特定主题的产出质量不相等，这可能归因于培训数据的缺陷。我们的研究凸显了LLM在政治事实检查中的潜力和局限性，这表明潜在的途径可以进一步改善护栏和微调。

Title: Enhancing Multi-Hop Fact Verification with Structured Knowledge-Augmented Large Language Models

Authors: Han Cao, Lingwei Wei, Wei Zhou, Songlin Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08495
Pdf URL: https://arxiv.org/pdf/2503.08495
Copy Paste: [[2503.08495]] Enhancing Multi-Hop Fact Verification with Structured Knowledge-Augmented Large Language Models(https://arxiv.org/abs/2503.08495)
Keywords: language model, llm
Abstract: The rapid development of social platforms exacerbates the dissemination of misinformation, which stimulates the research in fact verification. Recent studies tend to leverage semantic features to solve this problem as a single-hop task. However, the process of verifying a claim requires several pieces of evidence with complicated inner logic and relations to verify the given claim in real-world situations. Recent studies attempt to improve both understanding and reasoning abilities to enhance the performance, but they overlook the crucial relations between entities that benefit models to understand better and facilitate the prediction. To emphasize the significance of relations, we resort to Large Language Models (LLMs) considering their excellent understanding ability. Instead of other methods using LLMs as the predictor, we take them as relation extractors, for they do better in understanding rather than reasoning according to the experimental results. Thus, to solve the challenges above, we propose a novel Structured Knowledge-Augmented LLM-based Network (LLM-SKAN) for multi-hop fact verification. Specifically, we utilize an LLM-driven Knowledge Extractor to capture fine-grained information, including entities and their complicated relations. Besides, we leverage a Knowledge-Augmented Relation Graph Fusion module to interact with each node and learn better claim-evidence representations comprehensively. The experimental results on four common-used datasets demonstrate the effectiveness and superiority of our model.
摘要：社会平台的快速发展加剧了错误信息的传播，这刺激了实际验证的研究。最近的研究倾向于利用语义特征来解决此问题作为单跳任务。但是，验证索赔的过程需要几个具有复杂内部逻辑和关系的证据，以在现实世界中验证给定的索赔。最近的研究试图提高理解力和推理能力以提高绩效，但它们忽略了使模型有益于更好地理解和促进预测的实体之间的关键关系。为了强调关系的重要性，我们考虑了它们出色的理解能力，诉诸于大型语言模型（LLM）。我们将其作为关系提取器而不是其他使用LLM作为预测指标的方法，因为它们在理解方面做得更好，而不是根据实验结果进行推理。因此，为了解决上述挑战，我们提出了一种新型结构化知识增强LLM的网络（LLM-SKAN），以进行多跳事实验证。具体而言，我们利用LLM驱动的知识提取器来捕获细粒度的信息，包括实体及其复杂的关系。此外，我们利用知识增强的关系图融合模块与每个节点进行交互，并全面地学习更好的索赔代表表示。四个通用数据集的实验结果证明了我们模型的有效性和优势。

Title: ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews

Authors: Xian Gao, Jiacheng Ruan, Jingsheng Gao, Ting Liu, Yuzhuo Fu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08506
Pdf URL: https://arxiv.org/pdf/2503.08506
Copy Paste: [[2503.08506]] ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews(https://arxiv.org/abs/2503.08506)
Keywords: language model, llm, agent
Abstract: Academic paper review is a critical yet time-consuming task within the research community. With the increasing volume of academic publications, automating the review process has become a significant challenge. The primary issue lies in generating comprehensive, accurate, and reasoning-consistent review comments that align with human reviewers' judgments. In this paper, we address this challenge by proposing ReviewAgents, a framework that leverages large language models (LLMs) to generate academic paper reviews. We first introduce a novel dataset, Review-CoT, consisting of 142k review comments, designed for training LLM agents. This dataset emulates the structured reasoning process of human reviewers-summarizing the paper, referencing relevant works, identifying strengths and weaknesses, and generating a review conclusion. Building upon this, we train LLM reviewer agents capable of structured reasoning using a relevant-paper-aware training method. Furthermore, we construct ReviewAgents, a multi-role, multi-LLM agent review framework, to enhance the review comment generation process. Additionally, we propose ReviewBench, a benchmark for evaluating the review comments generated by LLMs. Our experimental results on ReviewBench demonstrate that while existing LLMs exhibit a certain degree of potential for automating the review process, there remains a gap when compared to human-generated reviews. Moreover, our ReviewAgents framework further narrows this gap, outperforming advanced LLMs in generating review comments.
摘要：学术论文评论是研究界的一项至关重要但耗时的任务。随着学术出版物量的增加，自动化审核过程已成为一个重大挑战。主要问题在于产生与人类审稿人的判断相符的全面，准确和一致的评论评论。在本文中，我们通过提出评论员来应对这一挑战，该框架利用大型语言模型（LLMS）生成学术论文评论。我们首先介绍了一个新颖的数据集《评论》，该数据集由142K评论评论组成，该评论是为培训LLM代理而设计的。该数据集模拟了人类审稿人的结构化推理过程，使该论文宣传，参考相关作品，确定优势和劣势，并产生审核结论。在此基础上，我们培训LLM审阅者的代理人，能够使用相关纸张感知的培训方法进行结构化推理。此外，我们构建了一个多角色，多LLLM代理评论框架，以增强评论评论生成过程。此外，我们提出了ReviewBench，这是评估LLMS产生的评论评论的基准。我们在评论基地的实验结果表明，尽管现有的LLM具有自动化审查过程的一定程度的潜力，但与人类生成的审查相比，相比之下，仍然存在空白。此外，我们的评论框架进一步缩小了这一差距，在生成评论评论时表现优于高级LLM。

Title: Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

Authors: Siqi Fan, Xuezhi Fang, Xingrun Xing, Peng Han, Shuo Shang, Yequan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08524
Pdf URL: https://arxiv.org/pdf/2503.08524
Copy Paste: [[2503.08524]] Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency(https://arxiv.org/abs/2503.08524)
Keywords: language model, llm
Abstract: Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline. In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding ($D^3$), which leverages a power-law decay function, $\left\lfloor L \times (\alpha^i) \right\rfloor$, to determine the number of layers to retain when generating token $T_i$. Remarkably, without any retraining, the $D^3$ achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with $7 \sim 70$ billion parameters show that $D^3$ can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop ($<1\%$) on the GSM8K and BBH benchmarks.
摘要：由于参数数量大量，大语言模型（LLMS）的推理阶段是资源密集的。与需要重新训练的传统模型压缩不同，最近的动态计算方法表明，并非所有组件都需要推理，从而实现了无训练的管道。在本文中，我们关注LLM生成的动态深度。提出了一个令牌位置意识到的跳过框架，以在保持性能的同时有效地节省1.5倍的操作。我们首先观察到，令牌后来预测的，其困惑性较低，因此需要更少的计算。然后，我们提出了一种无训练的算法，称为位置意识深度衰减解码（$ d^3 $），该算法利用了幂律衰减功能，$ \ left \ lftloor l \ times（\ alpha^i）\ arpha^i）\ right \ right \ rfloor $，以确定生成token $ t_i $ t_i $ t _i $ t _i $ t _i $ t _i $ t y layain dealers deasain dealers。值得注意的是，$ d^3 $没有任何重新训练，首次在各种一代任务中取得了成功。 $ 70 $ sim 70亿美元的参数表明，大型语言模型（\ ie llama）的实验表明，与全推入的管道相比，$ d^3 $可以达到平均1.5倍的加速，同时在GSM8K和BBH Benchmarks上保持了可比性的性能，几乎没有性能下降（$ <1 \％$ $）。

Title: DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering

Authors: Sher Badshah, Hassan Sajjad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08542
Pdf URL: https://arxiv.org/pdf/2503.08542
Copy Paste: [[2503.08542]] DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering(https://arxiv.org/abs/2503.08542)
Keywords: language model, llm
Abstract: Evaluating Large Language Models (LLMs) free-form generated responses remains a challenge due to their diverse and open-ended nature. Traditional supervised signal-based automatic metrics fail to capture semantic equivalence or handle the variability of open-ended responses, while human evaluation, though reliable, is resource-intensive. Leveraging LLMs as evaluators offers a promising alternative due to their strong language understanding and instruction-following capabilities. Taking advantage of these capabilities, we propose the Dynamic Arbitration Framework for Evaluation (DAFE), which employs two primary LLM-as-judges and engages a third arbitrator only in cases of disagreements. This selective arbitration prioritizes evaluation reliability while reducing unnecessary computational demands compared to conventional majority voting. DAFE utilizes task-specific reference answers with dynamic arbitration to enhance judgment accuracy, resulting in significant improvements in evaluation metrics such as Macro F1 and Cohen's Kappa. Through experiments, including a comprehensive human evaluation, we demonstrate DAFE's ability to provide consistent, scalable, and resource-efficient assessments, establishing it as a robust framework for evaluating free-form model outputs.
摘要：评估大型语言模型（LLMS）自由形式产生的响应仍然是一个挑战，因为它们的多样性和开放性的性质。传统的基于信号的自动指标无法捕获语义等效性或处理开放式响应的可变性，而人类评估虽然可靠，但资源密集。利用LLM作为评估者，由于其强烈的语言理解和跟随教学的能力，提供了一种有希望的替代方案。利用这些功能，我们提出了评估的动态仲裁框架（DAFE），该框架采用了两种主要的LLM-AS-AS-judges，并且仅在分歧的情况下才能参与第三个仲裁员。与常规多数投票相比，这种选择性仲裁优先考虑评估可靠性，同时减少不必要的计算需求。 DAFE使用动态仲裁利用特定于任务的参考答案来提高判断准确性，从而显着改善了评估指标，例如Macro F1和Cohen的Kappa。通过实验，包括全面的人类评估，我们证明了DAFE提供一致，可扩展和资源有效评估的能力，并将其确定为评估自由形式模型输出的强大框架。

Title: Transferring Extreme Subword Style Using Ngram Model-Based Logit Scaling

Authors: Craig Messner, Tom Lippincott
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08550
Pdf URL: https://arxiv.org/pdf/2503.08550
Copy Paste: [[2503.08550]] Transferring Extreme Subword Style Using Ngram Model-Based Logit Scaling(https://arxiv.org/abs/2503.08550)
Keywords: language model
Abstract: We present an ngram model-based logit scaling technique that effectively transfers extreme subword stylistic variation to large language models at inference time. We demonstrate its efficacy by tracking the perplexity of generated text with respect to the ngram interpolated and original versions of an evaluation model. Minimizing the former measure while the latter approaches the perplexity of a text produced by a target author or character lets us select a sufficient degree of adaptation while retaining fluency.
摘要：我们提出了一种基于NGRAM模型的logit缩放技术，该技术在推理时有效地将极端的子字风格变化转移到了大语言模型中。我们通过跟踪与评估模型的插值和原始版本相对于生成文本的困惑来证明其功效。最小化前者的措施，而后者接近目标作者或角色产生的文本的困惑，使我们可以在保持流利度的同时选择足够的适应性。

Title: DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process

Authors: Minjun Zhu, Yixuan Weng, Linyi Yang, Yue Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08569
Pdf URL: https://arxiv.org/pdf/2503.08569
Copy Paste: [[2503.08569]] DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process(https://arxiv.org/abs/2503.08569)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are increasingly utilized in scientific research assessment, particularly in automated paper review. However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. Using DeepReview-13K, a curated dataset with structured annotations, we train DeepReviewer-14B, which outperforms CycleReviewer-70B with fewer tokens. In its best mode, DeepReviewer-14B achieves win rates of 88.21\% and 80.20\% against GPT-o1 and DeepSeek-R1 in evaluations. Our work sets a new benchmark for LLM-based paper review, with all resources publicly available. The code, model, dataset and demo have be released in this http URL.
摘要：大型语言模型（LLM）越来越多地用于科学研究评估，尤其是在自动化论文评论中。但是，现有的基于LLM的评论系统面临着重大挑战，包括有限的领域专业知识，幻觉推理以及缺乏结构化评估。为了解决这些限制，我们介绍了DeepReview，这是一个多阶段框架，旨在通过结合结构化分析，文献检索和基于证据的论点来模仿专家审阅者。使用DeepReview-13K，一个带有结构化注释的策划数据集，我们训练DeepReviewer-14B，它的表现优于CyclereViewer-70B，具有更少的令牌。在最佳模式下，DeepReviewer-14b在评估中对GPT-O1和DeepSeek-R1的赢率达到88.21 \％和80.20 \％。我们的工作为基于LLM的纸质评论设定了一个新的基准，并将所有资源公开可用。代码，模型，数据集和演示已在此HTTP URL中发布。

Title: BiasEdit: Debiasing Stereotyped Language Models via Model Editing

Authors: Xin Xu, Wei Xu, Ningyu Zhang, Julian McAuley
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08588
Pdf URL: https://arxiv.org/pdf/2503.08588
Copy Paste: [[2503.08588]] BiasEdit: Debiasing Stereotyped Language Models via Model Editing(https://arxiv.org/abs/2503.08588)
Keywords: language model, prompt
Abstract: Previous studies have established that language models manifest stereotyped biases. Existing debiasing strategies, such as retraining a model with counterfactual data, representation projection, and prompting often fail to efficiently eliminate bias or directly alter the models' biased internal representations. To address these issues, we propose BiasEdit, an efficient model editing method to remove stereotypical bias from language models through lightweight networks that act as editors to generate parameter updates. BiasEdit employs a debiasing loss guiding editor networks to conduct local edits on partial parameters of a language model for debiasing while preserving the language modeling abilities during editing through a retention loss. Experiments on StereoSet and Crows-Pairs demonstrate the effectiveness, efficiency, and robustness of BiasEdit in eliminating bias compared to tangental debiasing baselines and little to no impact on the language models' general capabilities. In addition, we conduct bias tracing to probe bias in various modules and explore bias editing impacts on different components of language models.
摘要：先前的研究表明，语言模型表现出刻板印象的偏见。现有的偏见策略，例如通过反事实数据进行重新验证模型，表示投影和提示通常无法有效消除偏见或直接改变模型的偏见内部表示。为了解决这些问题，我们提出了biasedit，这是一种有效的模型编辑方法，可以通过轻巧的网络从语言模型中删除刻板印象偏见，这些网络充当生成参数更新的编辑器。 Biasedit采用一个偏见的损失指导编辑网络来对语言模型的部分参数进行本地编辑，以通过保留损失在编辑过程中保留语言建模能力。与分线式基准相比，立体声和乌鸦对的实验证明了偏见在消除偏差方面的有效性，效率和鲁棒性，对语言模型的一般能力几乎没有影响。此外，我们进行偏见跟踪以探测各种模块中的偏见，并探索对语言模型不同组成部分的偏见编辑影响。

Title: NSF-SciFy: Mining the NSF Awards Database for Scientific Claims

Authors: Delip Rao, Weiqiu You, Eric Wong, Chris Callison-Burch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08600
Pdf URL: https://arxiv.org/pdf/2503.08600
Copy Paste: [[2503.08600]] NSF-SciFy: Mining the NSF Awards Database for Scientific Claims(https://arxiv.org/abs/2503.08600)
Keywords: language model, llm, prompt
Abstract: We present NSF-SciFy, a large-scale dataset for scientific claim extraction derived from the National Science Foundation (NSF) awards database, comprising over 400K grant abstracts spanning five decades. While previous datasets relied on published literature, we leverage grant abstracts which offer a unique advantage: they capture claims at an earlier stage in the research lifecycle before publication takes effect. We also introduce a new task to distinguish between existing scientific claims and aspirational research intentions in this http URL zero-shot prompting with frontier large language models, we jointly extract 114K scientific claims and 145K investigation proposals from 16K grant abstracts in the materials science domain to create a focused subset called NSF-SciFy-MatSci. We use this dataset to evaluate 3 three key tasks: (1) technical to non-technical abstract generation, where models achieve high BERTScore (0.85+ F1); (2) scientific claim extraction, where fine-tuned models outperform base models by 100% relative improvement; and (3) investigation proposal extraction, showing 90%+ improvement with fine-tuning. We introduce novel LLM-based evaluation metrics for robust assessment of claim/proposal extraction quality. As the largest scientific claim dataset to date -- with an estimated 2.8 million claims across all STEM disciplines funded by the NSF -- NSF-SciFy enables new opportunities for claim verification and meta-scientific research. We publicly release all datasets, trained models, and evaluation code to facilitate further research.
摘要：我们提出了NSF-Scify，这是一个大规模的数据集，用于从国家科学基金会（NSF）奖项数据库中得出的科学主张提取，其中包括超过400k的赠款摘要，涉及五十年。尽管以前的数据集依赖于已发表的文献，但我们利用了赠款摘要，这些摘要提供了一个独特的优势：他们在发表生效之前在研究生命周期的早期阶段捕获了主张。我们还介绍了一项新任务，以区分现有的科学主张和与前沿大语模型的HTTP URL零射击零零件中的理想研究意图，我们共同提取114K科学主张和145K的研究建议与16K的研究提案与材料科学领域中的16K Grant摘要，以创建一个焦点，以创建一个名为NSF-Scifie-Matsci的材料科学领域。我们使用此数据集评估3个三个关键任务：（1）非技术摘要生成技术，其中模型获得了高bertscore（0.85+ f1）；（2）科学主张提取，其中微调模型比基本模型的相对改进优于基本模型；（3）调查建议提取，显示90％+通过微调改善。我们介绍了新型的基于LLM的评估指标，以对索赔/提议提取质量进行稳健评估。作为迄今为止最大的科学要求数据集 - NSF资助的所有STEM学科估计有280万项索赔-NSF-Scify为索赔验证和元科学研究提供了新的机会。我们公开发布所有数据集，训练有素的模型和评估代码，以促进进一步的研究。

Title: Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

Authors: Parishad BehnamGhader, Nicholas Meade, Siva Reddy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08644
Pdf URL: https://arxiv.org/pdf/2503.08644
Copy Paste: [[2503.08644]] Exploiting Instruction-Following Retrievers for Malicious Information Retrieval(https://arxiv.org/abs/2503.08644)
Keywords: llm, retrieval augmented generation
Abstract: Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.
摘要：在现实世界中，与LLMS一起广泛采用了遵循指导的检索器，但是很少有工作调查了围绕其搜索功能增加的安全风险。当直接使用以及用于检索基于增强生成的设置时，我们从经验上研究了追捕者满足恶意查询的能力。具体来说，我们研究了六个领先的检索员，包括NV Embed和LLM2VEC，并发现给定恶意请求，大多数检索器（> 50％的查询）都可以选择相关的有害段落。例如，llm2vec正确选择了61.35％的恶意查询的段落。我们进一步通过遵循指令捕猎者来揭示出新的风险，在这种情况下，可以通过利用其指导遵循功能来浮出水面的有害信息。最后，我们表明，即使是安全一致的LLM，例如Llama3，也可以在提供有害的有害检索段落时满足恶意要求。总而言之，我们的发现强调了与提高回收者能力相关的恶意滥用风险。

Title: Exploring the Word Sense Disambiguation Capabilities of Large Language Models

Authors: Pierpaolo Basile, Lucia Siciliani, Elio Musacchio, Giovanni Semeraro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08662
Pdf URL: https://arxiv.org/pdf/2503.08662
Copy Paste: [[2503.08662]] Exploring the Word Sense Disambiguation Capabilities of Large Language Models(https://arxiv.org/abs/2503.08662)
Keywords: language model, llm
Abstract: Word Sense Disambiguation (WSD) is a historical task in computational linguistics that has received much attention over the years. However, with the advent of Large Language Models (LLMs), interest in this task (in its classical definition) has decreased. In this study, we evaluate the performance of various LLMs on the WSD task. We extend a previous benchmark (XL-WSD) to re-design two subtasks suitable for LLM: 1) given a word in a sentence, the LLM must generate the correct definition; 2) given a word in a sentence and a set of predefined meanings, the LLM must select the correct one. The extended benchmark is built using the XL-WSD and BabelNet. The results indicate that LLMs perform well in zero-shot learning but cannot surpass current state-of-the-art methods. However, a fine-tuned model with a medium number of parameters outperforms all other models, including the state-of-the-art.
摘要：单词感官歧义（WSD）是计算语言学中的一项历史任务，多年来引起了很多关注。但是，随着大语言模型（LLM）的出现，对这项任务的兴趣（在经典定义中）有所减少。在这项研究中，我们评估了各种LLM在WSD任务上的性能。我们将以前的基准（XL-WSD）扩展到重新设计适合LLM的两个子任务：1）在句子中给定一个单词，LLM必须生成正确的定义； 2）在句子和一组预定义的含义中给定一个单词，LLM必须选择正确的含义。扩展的基准测试是使用XL-WSD和Babelnet构建的。结果表明，LLM在零拍学习中表现良好，但无法超过当前的最新方法。但是，一个具有中等数量参数的微型模型优于所有其他模型，包括最先进的模型。

Title: AgentOrca: A Dual-System Framework to Evaluate Language Agents on Operational Routine and Constraint Adherence

Authors: Zekun Li, Shinda Huang, Jiangtian Wang, Nathan Zhang, Antonis Antoniades, Wenyue Hua, Kaijie Zhu, Sirui Zeng, William Yang Wang, Xifeng Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08669
Pdf URL: https://arxiv.org/pdf/2503.08669
Copy Paste: [[2503.08669]] AgentOrca: A Dual-System Framework to Evaluate Language Agents on Operational Routine and Constraint Adherence(https://arxiv.org/abs/2503.08669)
Keywords: prompt, agent
Abstract: As language agents progressively automate critical tasks across domains, their ability to operate within operational constraints and safety protocols becomes essential. While extensive research has demonstrated these agents' effectiveness in downstream task completion, their reliability in following operational procedures and constraints remains largely unexplored. To this end, we present AgentOrca, a dual-system framework for evaluating language agents' compliance with operational constraints and routines. Our framework encodes action constraints and routines through both natural language prompts for agents and corresponding executable code serving as ground truth for automated verification. Through an automated pipeline of test case generation and evaluation across five real-world domains, we quantitatively assess current language agents' adherence to operational constraints. Our findings reveal notable performance gaps among state-of-the-art models, with large reasoning models like o1 demonstrating superior compliance while others show significantly lower performance, particularly when encountering complex constraints or user persuasion attempts.
摘要：随着语言代理逐渐自动化跨领域的关键任务，它们在操作约束和安全协议中运行的能力变得至关重要。尽管广泛的研究表明了这些代理在下游任务完成中的有效性，但它们在遵循操作程序和约束方面的可靠性在很大程度上尚未得到探索。为此，我们提出了Agenorca，这是一个双重系统框架，用于评估语言代理遵守操作约束和例程。我们的框架通过自然语言提示代理和相应的可执行代码来编码动作约束和例程，并用作自动验证的基础真相。通过跨五个现实世界域的测试案例生成和评估的自动化管道，我们定量评估当前语言代理对操作约束的遵守。我们的发现揭示了最先进的模型之间的显着性能差距，诸如O1之类的大型推理模型表现出了卓越的合规性，而其他模型则表现出明显较低的性能，尤其是在遇到复杂的约束或用户说服力尝试时。

Title: Self-Taught Self-Correction for Small Language Models

Authors: Viktor Moskvoretskii, Chris Biemann, Irina Nikishina
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08681
Pdf URL: https://arxiv.org/pdf/2503.08681
Copy Paste: [[2503.08681]] Self-Taught Self-Correction for Small Language Models(https://arxiv.org/abs/2503.08681)
Keywords: language model, llm
Abstract: Although large language models (LLMs) have achieved remarkable performance across various tasks, they remain prone to errors. A key challenge is enabling them to self-correct. While prior research has relied on external tools or large proprietary models, this work explores self-correction in small language models (SLMs) through iterative fine-tuning using solely self-generated data. We introduce the Self-Taught Self-Correction (STaSC) algorithm, which incorporates multiple algorithmic design choices. Experimental results on a question-answering task demonstrate that STaSC effectively learns self-correction, leading to significant performance improvements. Our analysis further provides insights into the mechanisms of self-correction and the impact of different design choices on learning dynamics and overall performance. To support future research, we release our user-friendly codebase and lightweight models.
摘要：尽管大型语言模型（LLMS）在各种任务中都取得了出色的表现，但它们仍然容易出错。一个关键的挑战是使他们能够自我纠正。虽然先前的研究依赖于外部工具或大型专有模型，但这项工作通过使用仅使用自我生成的数据进行迭代微调来探索小语言模型（SLM）中的自我纠正。我们介绍了自学成才的自校正（Stasc）算法，该算法包含多种算法设计选择。解决问题任务的实验结果表明，Stasc有效地学习了自我纠正，从而大大提高了绩效。我们的分析进一步提供了有关自我纠正机制以及不同设计选择对学习动态和整体性能的影响的见解。为了支持未来的研究，我们发布了用户友好的代码库和轻量级模型。

Title: Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents

Authors: Haoyu Wang, Sunhao Dai, Haiyuan Zhao, Liang Pang, Xiao Zhang, Gang Wang, Zhenhua Dong, Jun Xu, Ji-Rong Wen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2503.08684
Pdf URL: https://arxiv.org/pdf/2503.08684
Copy Paste: [[2503.08684]] Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents(https://arxiv.org/abs/2503.08684)
Keywords: language model, llm
Abstract: Previous studies have found that PLM-based retrieval models exhibit a preference for LLM-generated content, assigning higher relevance scores to these documents even when their semantic quality is comparable to human-written ones. This phenomenon, known as source bias, threatens the sustainable development of the information access ecosystem. However, the underlying causes of source bias remain unexplored. In this paper, we explain the process of information retrieval with a causal graph and discover that PLM-based retrievers learn perplexity features for relevance estimation, causing source bias by ranking the documents with low perplexity higher. Theoretical analysis further reveals that the phenomenon stems from the positive correlation between the gradients of the loss functions in language modeling task and retrieval task. Based on the analysis, a causal-inspired inference-time debiasing method is proposed, called Causal Diagnosis and Correction (CDC). CDC first diagnoses the bias effect of the perplexity and then separates the bias effect from the overall estimated relevance score. Experimental results across three domains demonstrate the superior debiasing effectiveness of CDC, emphasizing the validity of our proposed explanatory framework. Source codes are available at this https URL.
摘要：先前的研究发现，基于PLM的检索模型对LLM生成的内容偏爱，即使它们的语义质量与人类写入的文档相媲美，也为这些文档分配了更高的相关性分数。这种现象被称为来源偏见，威胁到信息访问生态系统的可持续发展。但是，源偏差的根本原因仍未开发。在本文中，我们使用因果图解释了信息检索的过程，并发现基于PLM的检索员学习相关性估计的困惑特征，从而通过对更高的困惑性较低的文档进行排名，从而导致源偏差。理论分析进一步表明，这种现象源于语言建模任务和检索任务中损失函数梯度之间的正相关。基于分析，提出了一种因果启发的推理时间证明方法，称为因果诊断和纠正（CDC）。 CDC首先诊断困惑的偏差效应，然后将偏差效应与总体估计相关性评分分开。跨三个领域的实验结果证明了疾病预防控制中心的优势有效性，强调了我们提出的解释框架的有效性。源代码可在此HTTPS URL上找到。