2025-03-11

Title: What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets

Authors: Marco Antonio Stranisci, Christian Hardmeier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05721
Pdf URL: https://arxiv.org/pdf/2503.05721
Copy Paste: [[2503.05721]] What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets(https://arxiv.org/abs/2503.05721)
Keywords: language model, llm
Abstract: Data filtering strategies are a crucial component to develop safe Large Language Models (LLM), since they support the removal of harmful contents from pretraining datasets. There is a lack of research on the actual impact of these strategies on vulnerable groups to discrimination, though, and their effectiveness has not been yet systematically addressed. In this paper we present a benchmark study of data filtering strategies for harm reduction aimed at providing a systematic overview on these approaches. We survey 55 technical reports of English LMs and LLMs to identify the existing filtering strategies in literature and implement an experimental setting to test their impact against vulnerable groups. Our results show that the positive impact that strategies have in reducing harmful contents from documents has the side effect of increasing the underrepresentation of vulnerable groups to discrimination in datasets.
摘要：数据过滤策略是开发安全的大语言模型（LLM）的关键组件，因为它们支持从预科数据集中删除有害内容物。但是，缺乏研究这些策略对弱势群体歧视的实际影响的研究，尚未系统地解决它们的有效性。在本文中，我们介绍了针对减少伤害的数据过滤策略的基准研究，旨在为这些方法提供系统的概述。我们调查了55个英语LMS和LLM的技术报告，以确定文献中现有的过滤策略，并实施实验环境以测试其对弱势群体的影响。我们的结果表明，策略在减少文件中的有害内容方面产生的积极影响具有增加弱势群体对数据集中歧视的代表性不足的副作用。

Title: Graph Masked Language Models

Authors: Aarush Sinha, OM Kumar CU
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05763
Pdf URL: https://arxiv.org/pdf/2503.05763
Copy Paste: [[2503.05763]] Graph Masked Language Models(https://arxiv.org/abs/2503.05763)
Keywords: language model
Abstract: Language Models (LMs) are integral to Natural Language Processing (NLP), yet their interaction with structured knowledge graphs (KGs) remains an open research challenge. While Graph Neural Networks (GNNs) excel at capturing graph structures, they struggle with textual feature representation compared to pretrained LMs. To bridge this gap, we propose \textbf{Graph Masked Language Models (GMLM)} for node classification tasks. Our approach introduces two key innovations: a \textit{semantic masking strategy} that selectively masks nodes based on their structural importance, ensuring critical graph components contribute effectively to learning, and a \textit{soft masking mechanism} that generates interpolated node representations, enabling smoother information retention and improved gradient flow. Our dual-branch model architecture fuses structural graph information with contextual embeddings via a multi-layer fusion network. Extensive experiments on six node classification benchmarks demonstrate that GMLM not only achieves state-of-the-art (SOTA) performance but also enhances robustness and stability across datasets.
摘要：语言模型（LMS）是自然语言处理（NLP）不可或缺的，但是它们与结构化知识图（KGS）的互动仍然是一项开放的研究挑战。尽管图形神经网络（GNNS）在捕获图形结构方面表现出色，但与预验证的LMS相比，它们在文本特征表示方面挣扎。为了弥合此差距，我们建议用于节点分类任务的\ textbf {Graph Masked语言模型（GMLM）}。 Our approach introduces two key innovations: a \textit{semantic masking strategy} that selectively masks nodes based on their structural importance, ensuring critical graph components contribute effectively to learning, and a \textit{soft masking mechanism} that generates interpolated node representations, enabling smoother information retention and improved gradient flow.我们的双分支模型体系结构通过多层融合网络将结构图信息与上下文嵌入融合在一起。对六个节点分类基准的广泛实验表明，GMLM不仅可以实现最先进的性能（SOTA）性能，还可以增强整个数据集的鲁棒性和稳定性。

Title: Medical Hallucinations in Foundation Models and Their Impact on Healthcare

Authors: Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai Xu, Xin Liu, Daniel McDuff, Hyeonhoon Lee, Hae Won Park, Samir Tulebaev, Cynthia Breazeal
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.05777
Pdf URL: https://arxiv.org/pdf/2503.05777
Copy Paste: [[2503.05777]] Medical Hallucinations in Foundation Models and Their Impact on Healthcare(https://arxiv.org/abs/2503.05777)
Keywords: llm, hallucination, chain-of-thought
Abstract: Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at this https URL hallucination.
摘要：能够处理和生成多模式数据的基础模型已改变了AI在医学中的作用。但是，其可靠性的关键局限性是幻觉，其中不准确或捏造的信息可能会影响临床决策和患者安全。我们将医学幻觉定义为模型产生误导性医学内容的任何实例。本文研究了医学幻觉的独特特征，原因和含义，特别关注这些错误如何在现实世界中的临床情况下表现出来。我们的贡献包括（1）用于理解和解决医学幻觉的分类法，（2）使用医学幻觉数据集进行基准测试模型，以及对实际医疗病例的LLM对LLM的响应，直接深入了解幻觉的临床影响，以及（3）对医疗归纳的多核临床调查。我们的结果表明，诸如经过思考链（COT）和搜索增强发电之类的推理技术可以有效降低幻觉速度。但是，尽管有这些改善，但幻觉的非平凡水平仍然存在。这些发现强调了强大的检测和缓解策略的道德和实践命令，为监管政策建立了基础，该政策优先考虑患者的安全性并保持临床完整性，因为AI更加集成到医疗保健中。临床医生的反馈意见不仅需要技术进步，而且还需要更清晰的道德和监管指南以确保患者的安全。在此HTTPS URL幻觉中，提供了组织纸质资源，摘要和其他信息的存储库。

Title: FedMentalCare: Towards Privacy-Preserving Fine-Tuned LLMs to Analyze Mental Health Status Using Federated Learning Framework

Authors: S M Sarwar
Subjects: cs.CL, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05786
Pdf URL: https://arxiv.org/pdf/2503.05786
Copy Paste: [[2503.05786]] FedMentalCare: Towards Privacy-Preserving Fine-Tuned LLMs to Analyze Mental Health Status Using Federated Learning Framework(https://arxiv.org/abs/2503.05786)
Keywords: language model, llm, chat, agent
Abstract: With the increasing prevalence of mental health conditions worldwide, AI-powered chatbots and conversational agents have emerged as accessible tools to support mental health. However, deploying Large Language Models (LLMs) in mental healthcare applications raises significant privacy concerns, especially regarding regulations like HIPAA and GDPR. In this work, we propose FedMentalCare, a privacy-preserving framework that leverages Federated Learning (FL) combined with Low-Rank Adaptation (LoRA) to fine-tune LLMs for mental health analysis. We investigate the performance impact of varying client data volumes and model architectures (e.g., MobileBERT and MiniLM) in FL environments. Our framework demonstrates a scalable, privacy-aware approach for deploying LLMs in real-world mental healthcare scenarios, addressing data security and computational efficiency challenges.
摘要：随着全球心理健康状况的越来越多，AI驱动的聊天机器人和对话代理已经成为支持心理健康的可访问工具。但是，在心理保健应用中部署大型语言模型（LLM）引起了严重的隐私问题，尤其是关于HIPAA和GDPR等法规。在这项工作中，我们提出了联合alcare，这是一个保存隐私的框架，利用联合学习（FL）与低级适应（LORA）结合使用，以微调LLMS进行心理健康分析。我们研究了FL环境中不同客户数据量和模型架构（例如Moberbert和Minilm）的性能影响。我们的框架展示了一种可扩展的，隐私感知的方法，用于在现实世界中的心理保健方案中部署LLM，从而应对数据安全和计算效率挑战。

Title: Extracting and Emulsifying Cultural Explanation to Improve Multilingual Capability of LLMs

Authors: Hamin Koo, Jaehyung Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05846
Pdf URL: https://arxiv.org/pdf/2503.05846
Copy Paste: [[2503.05846]] Extracting and Emulsifying Cultural Explanation to Improve Multilingual Capability of LLMs(https://arxiv.org/abs/2503.05846)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have achieved remarkable success, but their English-centric training data limits performance in non-English languages, highlighting the need for enhancements in their multilingual capabilities. While some work on multilingual prompting methods handles non-English queries by utilizing English translations or restructuring them to more closely align with LLM reasoning patterns, these works often overlook the importance of cultural context, limiting their effectiveness. To address this limitation, we propose EMCEI, a simple yet effective approach that improves LLMs' multilingual capabilities by incorporating cultural context for more accurate and appropriate responses. Specifically, EMCEI follows a two-step process that first extracts relevant cultural context from the LLM's parametric knowledge via prompting. Then, EMCEI employs an LLM-as-Judge mechanism to select the most appropriate response by balancing cultural relevance and reasoning ability. Experiments on diverse multilingual benchmarks show that EMCEI outperforms existing baselines, demonstrating its effectiveness in handling multilingual queries with LLMs.
摘要：大型语言模型（LLMS）取得了杰出的成功，但是他们以英语为中心的培训数据限制了非英语语言的性能，强调了其多语言能力增强的需求。尽管一些关于多语言提示方法的工作通过利用英语翻译或重组将其与LLM推理模式更加紧密地处理非英语查询，但这些作品通常忽略了文化背景的重要性，从而限制了它们的有效性。为了解决这一限制，我们提出了一种简单而有效的方法，可以通过结合文化背景来提高LLMS的多语言能力，从而提高LLMS的多语言能力。具体而言，司令遵循了一个两步的过程，该过程首先通过提示从LLM的参数知识中提取相关的文化背景。然后，Emcei采用LLM-As-Gudge机制来通过平衡文化相关性和推理能力来选择最合适的响应。关于多种语言基准的实验表明，司仪表现优于现有基准，表明其在使用LLMS处理多语言查询方面的有效性。

Title: This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs

Authors: Lorenz Wolf, Sangwoong Yoon, Ilija Bogunovic
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05856
Pdf URL: https://arxiv.org/pdf/2503.05856
Copy Paste: [[2503.05856]] This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs(https://arxiv.org/abs/2503.05856)
Keywords: language model, llm, agent
Abstract: Mixture of large language model (LLMs) Agents (MoA) architectures achieve state-of-the-art performance on prominent benchmarks like AlpacaEval 2.0 by leveraging the collaboration of multiple LLMs at inference time. Despite these successes, an evaluation of the safety and reliability of MoA is missing. We present the first comprehensive study of MoA's robustness against deceptive LLM agents that deliberately provide misleading responses. We examine factors like the propagation of deceptive information, model size, and information availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the popular LLaMA 3.1-70B model achieves a length-controlled Win Rate (LC WR) of 49.2% when coupled with 3-layer MoA (6 LLM agents). However, we demonstrate that introducing only a $\textit{single}$ carefully-instructed deceptive agent into the MoA can reduce performance to 37.9%, effectively nullifying all MoA gains. On QuALITY, a multiple-choice comprehension task, the impact is also severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the historical Doge of Venice voting process, designed to minimize influence and deception, we propose a range of unsupervised defense mechanisms that recover most of the lost performance.
摘要：大型语言模型（LLMS）代理（MOA）体系结构的混合物通过利用推理时多个LLM的协作来实现诸如Alpacaeval 2.0之类的突出基准的最先进性能。尽管取得了这些成功，但仍缺少对MOA的安全性和可靠性的评估。我们介绍了对MOA对欺骗性LLM代理商的鲁棒性的首次全面研究，该研究有意提供误导性的反应。我们研究了欺骗性信息，模型规模和信息可用性的传播以及发现关键漏洞等因素。在Alpacaeval 2.0上，流行的Llama 3.1-70B型号与3层MOA（6 LLM代理）相结合时，长度控制的获胜率（LC WR）为49.2％。但是，我们证明，仅引入$ \ textIt {single} $仔细的欺骗剂可以将其降低到37.9％，从而有效地消除了所有MOA增长。在质量上，一项多项选择理解任务，影响也很严重，精度下降了48.5％。我们的一部分是受威尼斯投票过程的历史性启发，旨在最大程度地减少影响力和欺骗，我们提出了一系列无监督的防御机制，以恢复大部分丢失的绩效。

Title: QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation

Authors: Bang Nguyen, Tingting Du, Mengxia Yu, Lawrence Angrave, Meng Jiang
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05888
Pdf URL: https://arxiv.org/pdf/2503.05888
Copy Paste: [[2503.05888]] QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation(https://arxiv.org/abs/2503.05888)
Keywords: language model
Abstract: While the Question Generation (QG) task has been increasingly adopted in educational assessments, its evaluation remains limited by approaches that lack a clear connection to the educational values of test items. In this work, we introduce test item analysis, a method frequently used by educators to assess test question quality, into QG evaluation. Specifically, we construct pairs of candidate questions that differ in quality across dimensions such as topic coverage, item difficulty, item discrimination, and distractor efficiency. We then examine whether existing QG evaluation approaches can effectively distinguish these differences. Our findings reveal significant shortcomings in these approaches with respect to accurately assessing test item quality in relation to student performance. To address this gap, we propose a novel QG evaluation framework, QG-SMS, which leverages Large Language Model for Student Modeling and Simulation to perform test item analysis. As demonstrated in our extensive experiments and human evaluation study, the additional perspectives introduced by the simulated student profiles lead to a more effective and robust assessment of test items.
摘要：虽然问题产生（QG）任务在教育评估中越来越多地通过，但其评估仍受到与测试项目教育价值没有明确联系的方法的限制。在这项工作中，我们介绍了测试项目分析，这是一种教育工作者经常使用的方法来评估测试质量的方法，并将其纳入QG评估中。具体而言，我们构建了一对候选问题，这些问题在跨维度，例如主题覆盖，项目难度，项目歧视和干扰效率等质量上有所不同。然后，我们检查现有的QG评估方法是否可以有效区分这些差异。我们的发现揭示了这些方法在与学生绩效相关的准确评估测试项目质量方面存在重大缺点。为了解决这一差距，我们提出了一个新颖的QG评估框架QG-SMS，该框架利用大型语言模型进行学生建模和模拟来执行测试项目分析。正如我们在广泛的实验和人类评估研究中所证明的那样，模拟学生概况引入的其他观点会导致对测试项目的更有效，更强大的评估。

Title: MastermindEval: A Simple But Scalable Reasoning Benchmark

Authors: Jonas Golde, Patrick Haller, Fabio Barth, Alan Akbik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05891
Pdf URL: https://arxiv.org/pdf/2503.05891
Copy Paste: [[2503.05891]] MastermindEval: A Simple But Scalable Reasoning Benchmark(https://arxiv.org/abs/2503.05891)
Keywords: language model, llm, agent
Abstract: Recent advancements in large language models (LLMs) have led to remarkable performance across a wide range of language understanding and mathematical tasks. As a result, increasing attention has been given to assessing the true reasoning capabilities of LLMs, driving research into commonsense, numerical, logical, and qualitative reasoning. However, with the rapid progress of reasoning-focused models such as OpenAI's o1 and DeepSeek's R1, there has been a growing demand for reasoning benchmarks that can keep pace with ongoing model developments. In this paper, we introduce MastermindEval, a simple, scalable, and interpretable deductive reasoning benchmark inspired by the board game Mastermind. Our benchmark supports two evaluation paradigms: (1) agentic evaluation, in which the model autonomously plays the game, and (2) deductive reasoning evaluation, in which the model is given a pre-played game state with only one possible valid code to infer. In our experimental results we (1) find that even easy Mastermind instances are difficult for current models and (2) demonstrate that the benchmark is scalable to possibly more advanced models in the future Furthermore, we investigate possible reasons why models cannot deduce the final solution and find that current models are limited in deducing the concealed code as the number of statement to combine information from is increasing.
摘要：大型语言模型（LLM）的最新进展导致了广泛的语言理解和数学任务的出色表现。结果，越来越关注的关注是评估LLM的真正推理能力，推动对常识，数字，逻辑和定性推理的研究。但是，随着以推理为重点的模型（例如OpenAI的O1和DeepSeek的R1）的快速发展，人们对推理基准的需求越来越不断增长，这些基准可以跟上持续的模型发展。在本文中，我们介绍了一个由棋盘游戏策划者启发的简单，可扩展和可解释的演绎推理基准。我们的基准测试支持两个评估范例：（1）代理评估，模型自动地玩游戏，以及（2）扣除推理评估，其中为模型提供了预先游戏的游戏状态，只有一个可能有效的代码来推断。在我们的实验结果中，我们（1）发现当前模型的策划实例甚至很容易，（2）证明基准可以扩展到将来可能更先进的模型，我们调查了可能的原因，可能的原因是导致最终模型不推论当前模型的限制，以将隐藏的代码限制为与信息的数量相结合，从而增加了信息的数量。

Title: From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning

Authors: Eric Zhao, Pranjal Awasthi, Nika Haghtalab
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05919
Pdf URL: https://arxiv.org/pdf/2503.05919
Copy Paste: [[2503.05919]] From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning(https://arxiv.org/abs/2503.05919)
Keywords: language model, prompt
Abstract: Finetuning provides a scalable and cost-effective means of customizing language models for specific tasks or response styles, with greater reliability than prompting or in-context learning. In contrast, the conventional wisdom is that injecting knowledge via finetuning results in brittle performance and poor generalization. We argue that the dichotomy of "task customization" (e.g., instruction tuning) and "knowledge injection" (e.g., teaching new facts) is a distinction without a difference. We instead identify concrete factors that explain the heterogeneous effectiveness observed with finetuning. To this end, we conduct a large-scale experimental study of finetuning the frontier Gemini v1.5 model family on a spectrum of datasets that are artificially engineered to interpolate between the strengths and failure modes of finetuning. Our findings indicate that question-answer training data formats provide much stronger knowledge generalization than document/article-style training data, numerical information can be harder for finetuning to retain than categorical information, and models struggle to apply finetuned knowledge during multi-step reasoning even when trained on similar examples -- all factors that render "knowledge injection" to be especially difficult, even after controlling for considerations like data augmentation and information volume. On the other hand, our findings also indicate that it is not fundamentally more difficult to finetune information about a real-world event than information about what a model's writing style should be.
摘要：Finetuning为特定任务或响应样式定制语言模型提供了可扩展且具有成本效益的方法，其可靠性比提示或秘密学习的可靠性更大。相比之下，传统的观点是，通过固定进行注入知识会导致脆弱的性能和不良的概括。我们认为，“任务定制”（例如，教学调整）和“知识注入”（例如，教授新事实）的二分法是一个区别，没有差异。相反，我们确定了解释通过鉴定观察到的异质有效性的具体因素。为此，我们在一系列数据集上对鉴定边界的双子座V1.5模型家族进行了大规模实验研究，这些数据集经过人工设计，可以在较强的燃烧模式之间进行插值。我们的发现表明，提问的培训数据格式比文档/文档/文档风格的培训数据提供了更强的知识概括，数值信息可能比保留更难保留，而与分类信息更难保留，并且模型难以应用多步骤推理期间的填充知识，即使在类似的示例上培训了类似的示例 - 尤其是在类似的“知识注射”中，即使在范围内也要对其进行控制，以便对数据进行了范围，并且要对数据进行控制。另一方面，我们的发现还表明，关于现实世界事件的芬特季度信息在根本上比有关模型写作方式的信息要难。

Title: IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining

Authors: Yixiao Li, Xianzhi Du, Ajay Jaiswal, Tao Lei, Tuo Zhao, Chong Wang, Jianyu Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05920
Pdf URL: https://arxiv.org/pdf/2503.05920
Copy Paste: [[2503.05920]] IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining(https://arxiv.org/abs/2503.05920)
Keywords: language model
Abstract: Recent advancements in large language models have intensified the need for efficient and deployable models within limited inference budgets. Structured pruning pipelines have shown promise in token efficiency compared to training target-size models from scratch. In this paper, we advocate incorporating enlarged model pretraining, which is often ignored in previous works, into pruning. We study the enlarge-and-prune pipeline as an integrated system to address two critical questions: whether it is worth pretraining an enlarged model even when the model is never deployed, and how to optimize the entire pipeline for better pruned models. We propose an integrated enlarge-and-prune pipeline, which combines enlarge model training, pruning, and recovery under a single cosine annealing learning rate schedule. This approach is further complemented by a novel iterative structured pruning method for gradual parameter removal. The proposed method helps to mitigate the knowledge loss caused by the rising learning rate in naive enlarge-and-prune pipelines and enable effective redistribution of model capacity among surviving neurons, facilitating smooth compression and enhanced performance. We conduct comprehensive experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining. It demonstrates the integrated approach not only provides insights into the token efficiency of enlarged model pretraining but also achieves superior performance of pruned models.
摘要：大语言模型的最新进步加剧了在有限的推理预算中对有效和可部署模型的需求。与从头开始的训练目标大小模型相比，结构化的修剪管道在令牌效率方面已显示出希望。在本文中，我们主张将放大的模型预处理（通常在以前的作品中被忽略）纳入修剪中。我们将扩大和促进管道作为一个集成系统，以解决两个关键问题：即使模型从未部署，是否值得预处理模型，以及如何优化整个管道以获得更好的修剪模型。我们提出了一条集成的扩大和促进管道，该管道结合了单个余弦退火率计划的放大模型培训，修剪和恢复。这种方法通过一种新型的迭代结构化修剪方法进一步补充，以逐步去除参数。所提出的方法有助于减轻因幼稚扩大和促进管道中的学习率上升而导致的知识损失，并有效地重新分布幸存的神经元之间的模型容量，从而促进平滑压缩和增强的性能。我们对将2.8B模型压缩为1.3B进行全面实验，并在预训练中具有多达2T令牌。它展示了综合方法不仅可以洞悉扩大模型预处理的令牌效率，而且还可以实现修剪模型的卓越性能。

Title: DETQUS: Decomposition-Enhanced Transformers for QUery-focused Summarization

Authors: Yasir Khan, Xinlei Wu, Sangpil Youm, Justin Ho, Aryaan Shaikh, Jairo Garciga, Rohan Sharma, Bonnie J. Dorr
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.05935
Pdf URL: https://arxiv.org/pdf/2503.05935
Copy Paste: [[2503.05935]] DETQUS: Decomposition-Enhanced Transformers for QUery-focused Summarization(https://arxiv.org/abs/2503.05935)
Keywords: language model
Abstract: Query-focused tabular summarization is an emerging task in table-to-text generation that synthesizes a summary response from tabular data based on user queries. Traditional transformer-based approaches face challenges due to token limitations and the complexity of reasoning over large tables. To address these challenges, we introduce DETQUS (Decomposition-Enhanced Transformers for QUery-focused Summarization), a system designed to improve summarization accuracy by leveraging tabular decomposition alongside a fine-tuned encoder-decoder model. DETQUS employs a large language model to selectively reduce table size, retaining only query-relevant columns while preserving essential information. This strategy enables more efficient processing of large tables and enhances summary quality. Our approach, equipped with table-based QA model Omnitab, achieves a ROUGE-L score of 0.4437, outperforming the previous state-of-the-art REFACTOR model (ROUGE-L: 0.422). These results highlight DETQUS as a scalable and effective solution for query-focused tabular summarization, offering a structured alternative to more complex architectures.
摘要：以查询为重点的表格摘要是表到文本生成中的一项新任务，它根据用户查询从表格数据中综合了摘要响应。由于令牌的限制和大表格上的推理的复杂性，基于变压器的传统方法面临挑战。为了应对这些挑战，我们介绍了DETQU（以查询为中心的摘要的分解增强的变压器），该系统旨在通过将表格分解与微调编码器模型一起利用表格分解来提高摘要精度。 Detqus采用大型语言模型来选择性地减小表尺寸，仅保留与查询相关的列，同时保留基本信息。该策略可以更有效地处理大型桌子并提高总结质量。我们的方法配备了基于表的QA模型Omnitab，可实现0.4437的Rouge-L得分，表现优于先前的最新重构模型（Rouge-L：0.422）。这些结果将Detqus作为一种可扩展有效的解决方案，用于以查询为重点的表格摘要，为更复杂的体系结构提供了结构化的替代方案。

Title: SANDWiCH: Semantical Analysis of Neighbours for Disambiguating Words in Context ad Hoc

Authors: Daniel Guzman-Olivares, Lara Quijano-Sanchez, Federico Liberatore
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05958
Pdf URL: https://arxiv.org/pdf/2503.05958
Copy Paste: [[2503.05958]] SANDWiCH: Semantical Analysis of Neighbours for Disambiguating Words in Context ad Hoc(https://arxiv.org/abs/2503.05958)
Keywords: language model, llm, chat
Abstract: The rise of generative chat-based Large Language Models (LLMs) over the past two years has spurred a race to develop systems that promise near-human conversational and reasoning experiences. However, recent studies indicate that the language understanding offered by these models remains limited and far from human-like performance, particularly in grasping the contextual meanings of words, an essential aspect of reasoning. In this paper, we present a simple yet computationally efficient framework for multilingual Word Sense Disambiguation (WSD). Our approach reframes the WSD task as a cluster discrimination analysis over a semantic network refined from BabelNet using group algebra. We validate our methodology across multiple WSD benchmarks, achieving a new state of the art for all languages and tasks, as well as in individual assessments by part of speech. Notably, our model significantly surpasses the performance of current alternatives, even in low-resource languages, while reducing the parameter count by 72%.
摘要：在过去的两年中，基于生成聊天的大语言模型（LLM）的兴起促使一场竞赛开发了有望近乎人类的对话和推理经验的系统。但是，最近的研究表明，这些模型提供的语言理解仍然有限，并且远非像人类的表现，尤其是在掌握单词的上下文含义时，这是推理的基本方面。在本文中，我们提出了一个简单但在计算上有效的框架，用于多语言单词sense disamigation（WSD）。我们的方法将WSD任务重新编写为使用组代数从babelnet改进的语义网络的聚类歧视分析。我们在多个WSD基准中验证了我们的方法论，从而实现了所有语言和任务的新技术，以及通过语音的部分评估。值得注意的是，即使在低资源语言中，我们的模型也大大超过了当前替代方案的性能，同时将参数计数降低了72％。

Title: SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs

Authors: Samir Abdaljalil, Hasan Kurban, Parichit Sharma, Erchin Serpedin, Rachad Atat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05980
Pdf URL: https://arxiv.org/pdf/2503.05980
Copy Paste: [[2503.05980]] SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs(https://arxiv.org/abs/2503.05980)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) are increasingly deployed across diverse domains, yet they are prone to generating factually incorrect outputs - commonly known as "hallucinations." Among existing mitigation strategies, uncertainty-based methods are particularly attractive due to their ease of implementation, independence from external data, and compatibility with standard LLMs. In this work, we introduce a novel and scalable uncertainty-based semantic clustering framework for automated hallucination detection. Our approach leverages sentence embeddings and hierarchical clustering alongside a newly proposed inconsistency measure, SINdex, to yield more homogeneous clusters and more accurate detection of hallucination phenomena across various LLMs. Evaluations on prominent open- and closed-book QA datasets demonstrate that our method achieves AUROC improvements of up to 9.3% over state-of-the-art techniques. Extensive ablation studies further validate the effectiveness of each component in our framework.
摘要：大型语言模型（LLM）越来越多地在不同的领域中部署，但它们容易产生事实不正确的产出 - 通常称为“幻觉”。在现有的缓解策略中，基于不确定性的方法易于实施，独立于外部数据以及与标准LLM的兼容性特别有吸引力。在这项工作中，我们介绍了一个新颖且可扩展的基于不确定性的语义聚类框架，用于自动化幻觉检测。我们的方法利用句子嵌入和分层聚类以及新提出的不一致度量sindex并产生更多同质簇，并更准确地检测到各种LLM的幻觉现象。对突出的开放式QA数据集的评估表明，我们的方法比最先进的技术实现了高达9.3％的AUROC。广泛的消融研究进一步验证了每个组件在我们的框架中的有效性。

Title: Intent-Aware Self-Correction for Mitigating Social Biases in Large Language Models

Authors: Panatchakorn Anantaprayoon, Masahiro Kaneko, Naoaki Okazaki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06011
Pdf URL: https://arxiv.org/pdf/2503.06011
Copy Paste: [[2503.06011]] Intent-Aware Self-Correction for Mitigating Social Biases in Large Language Models(https://arxiv.org/abs/2503.06011)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Self-Correction based on feedback improves the output quality of Large Language Models (LLMs). Moreover, as Self-Correction functions like the slow and conscious System-2 thinking from cognitive psychology's perspective, it can potentially reduce LLMs' social biases. LLMs are sensitive to contextual ambiguities and inconsistencies; therefore, explicitly communicating their intentions during interactions when applying Self-Correction for debiasing is crucial. In this study, we demonstrate that clarifying intentions is essential for effectively reducing biases in LLMs through Self-Correction. We divide the components needed for Self-Correction into three parts: instruction, response, and feedback, and clarify intentions at each component. We incorporate an explicit debiasing prompt to convey the intention of bias mitigation from the instruction for response generation. In the response, we use Chain-of-Thought (CoT) to clarify the reasoning process. In the feedback, we define evaluation aspects necessary for debiasing and propose clear feedback through multi-aspect critiques and scoring. Through experiments, we demonstrate that self-correcting CoT responses obtained from a debiasing prompt based on multi-aspect feedback can reduce biased responses more robustly and consistently than the baselines. We also find the variation in debiasing efficacy when using models with different bias levels or separating models for response and feedback generation.
摘要：基于反馈的自我纠正提高了大语言模型（LLM）的输出质量。此外，随着自我纠正的功能，例如从认知心理学的角度来思考的缓慢和有意识的系统2思维，它可能会减少LLMS的社会偏见。 LLM对上下文的歧义和不一致敏感；因此，在将自我纠正应用于证据时在互动过程中明确传达其意图至关重要。在这项研究中，我们证明了澄清意图对于通过自我纠正有效减少LLM的偏见至关重要。我们将自我纠正所需的组件分为三个部分：指令，响应和反馈，并阐明每个组件的意图。我们结合了一个明确的偏见提示，以从响应产生的指示中传达出偏见缓解的意图。在响应中，我们使用经营链（COT）来澄清推理过程。在反馈中，我们定义了依据所需的评估方面，并通过多种批评和评分提出明确的反馈。通过实验，我们证明了基于多光值反馈从偏见提示获得的自我校正的COT响应可以比基线更稳定，一致地降低偏见的响应。当使用具有不同偏差级别的模型或分开响应和反馈生成的模型时，我们还发现了词汇疗效的变化。

Title: GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

Authors: Xudong Lu, Yinghao Chen, Renshou Wu, Haohao Gao, Xi Chen, Xue Yang, Xiangyu Zhao, Aojun Zhou, Fangyuan Li, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06019
Pdf URL: https://arxiv.org/pdf/2503.06019
Copy Paste: [[2503.06019]] GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices(https://arxiv.org/abs/2503.06019)
Keywords: language model, llm
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have enabled their deployment on mobile devices. However, challenges persist in maintaining strong language capabilities and ensuring hardware compatibility, both of which are crucial for user experience and practical deployment efficiency. In our deployment process, we observe that existing MLLMs often face performance degradation on pure language tasks, and the current NPU platforms on smartphones do not support the MoE architecture, which is commonly used to preserve pure language capabilities during multimodal training. To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. GenieBlue freezes the original LLM parameters during MLLM training to maintain pure language capabilities. It acquires multimodal capabilities by duplicating specific transformer blocks for full fine-tuning and integrating lightweight LoRA modules. This approach preserves language capabilities while achieving comparable multimodal performance through extensive training. Deployed on smartphone NPUs, GenieBlue demonstrates efficiency and practicality for applications on mobile devices.
摘要：多模式大语言模型（MLLM）的最新进展已使其在移动设备上的部署。但是，挑战一直在保持强大的语言能力和确保硬件兼容性方面，这对于用户体验和实际部署效率至关重要。在我们的部署过程中，我们观察到现有的MLLM经常在纯语言任务上面临性能退化，而当前的NPU平台在智能手机上不支持MOE体系结构，MOE架构通常用于在多模式培训中保留纯语言能力。为了解决这些问题，我们系统地分析了在MLLM培训期间保持纯语言能力的方法，重点关注培训数据和模型架构方面。基于这些分析，我们提出了Genieblue，这是一种有效的MLLM结构设计，可在移动设备上集成LLM的语言和多模式能力。 Genieblue在MLLM培训期间冻结了原始的LLM参数，以保持纯语言能力。它通过复制特定的变压器块来获得多模式功能，以进行全面的微调和集成轻量级的Lora模块。这种方法可以保留语言能力，同时通过广泛的培训实现可比的多模式性能。 Genieblue部署在智能手机NPU上，展示了移动设备上应用的效率和实用性。

Title: SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?

Authors: Xudong Lu, Haohao Gao, Renshou Wu, Shuai Ren, Xiaoxin Chen, Hongsheng Li, Fangyuan Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06029
Pdf URL: https://arxiv.org/pdf/2503.06029
Copy Paste: [[2503.06029]] SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?(https://arxiv.org/abs/2503.06029)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have become integral to daily life, especially advancing as intelligent assistants through on-device deployment on smartphones. However, existing LLM evaluation benchmarks predominantly focus on objective tasks like mathematics and coding in English, which do not necessarily reflect the practical use cases of on-device LLMs in real-world mobile scenarios, especially for Chinese users. To address these gaps, we introduce SmartBench, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts. We analyze functionalities provided by representative smartphone manufacturers and divide them into five categories: text summarization, text Q\&A, information extraction, content creation, and notification management, further detailed into 20 specific tasks. For each task, we construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions, and we develop automated evaluation criteria tailored for these tasks. We conduct comprehensive evaluations of on-device LLMs and MLLMs using SmartBench and also assess their performance after quantized deployment on real smartphone NPUs. Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization in this critical area. Code and data will be available at this https URL.
摘要：大型语言模型（LLMS）已成为日常生活不可或缺的一部分，尤其是通过智能手机上的设备部署作为智能助手的发展。但是，现有的LLM评估基准主要集中于数学和英语编码等客观任务，这些任务不一定反映在现实世界移动方案中，尤其是对中国用户的实用性LLMS的实际用例。为了解决这些差距，我们介绍了SmartBench，这是第一个旨在评估中国移动环境中设备LLM的功能的基准。我们分析了代表性智能手机制造商提供的功能，并将其分为五个类别：文本摘要，文本Q \＆A，信息提取，内容创建和通知管理，进一步详细介绍了20个特定任务。对于每个任务，我们构建了包括50到200个问题解答对的高质量数据集，这些数据集反映了日常移动互动，并制定了针对这些任务量身定制的自动化评估标准。我们使用SmartBench对设备LLM和MLLM进行全面评估，并在对真正的智能手机NPU进行量化部署后评估其性能。我们的贡献提供了一个标准化的框架，用于评估中文的设备LLM，从而促进该关键领域的进一步发展和优化。代码和数据将在此HTTPS URL上可用。

Title: Mitigating Memorization in LLMs using Activation Steering

Authors: Manan Suri, Nishit Anand, Amisha Bhaskar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06040
Pdf URL: https://arxiv.org/pdf/2503.06040
Copy Paste: [[2503.06040]] Mitigating Memorization in LLMs using Activation Steering(https://arxiv.org/abs/2503.06040)
Keywords: language model, llm
Abstract: The memorization of training data by Large Language Models (LLMs) poses significant risks, including privacy leaks and the regurgitation of copyrighted content. Activation steering, a technique that directly intervenes in model activations, has emerged as a promising approach for manipulating LLMs. In this work, we explore the effectiveness of activation steering in reducing memorization while preserving generalization capabilities. We conduct empirical evaluations using a controlled memorization benchmark of literary material and demonstrate that our method successfully suppresses memorized content with minimal degradation in model performance in Gemma. Additionally, we analyze the trade-offs between suppression effectiveness and linguistic fluency, highlighting the advantages and limitations of activation-based interventions. Our findings contribute to ongoing efforts in developing safer and more privacy-preserving LLMs by providing a practical and efficient mechanism to mitigate unintended memorization.
摘要：大型语言模型（LLMS）对培训数据的记忆带来了重大风险，包括隐私泄漏和受版权保护内容的反流。激活转向是一种直接介入模型激活的技术，已成为操纵LLM的有前途的方法。在这项工作中，我们探讨了激活转向在减少记忆的有效性，同时保留了概括能力。我们使用文学材料的受控记忆基准进行经验评估，并证明我们的方法成功地抑制了记忆的内容，在Gemma中模型性能的最小降解。此外，我们分析了抑制效率和语言流利性之间的权衡，突出了基于激活的干预措施的优势和局限性。我们的发现通过提供一种实用有效的机制来减轻意外的记忆，为开发更安全，更加隐私的LLM的持续努力做出了贡献。

Title: Constructions are Revealed in Word Distributions

Authors: Joshua Rozner, Leonie Weissweiler, Kyle Mahowald, Cory Shain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06048
Pdf URL: https://arxiv.org/pdf/2503.06048
Copy Paste: [[2503.06048]] Constructions are Revealed in Word Distributions(https://arxiv.org/abs/2503.06048)
Keywords: language model
Abstract: Construction grammar posits that constructions (form-meaning pairings) are acquired through experience with language (the distributional learning hypothesis). But how much information about constructions does this distribution actually contain? Corpus-based analyses provide some answers, but text alone cannot answer counterfactual questions about what caused a particular word to occur. For that, we need computable models of the distribution over strings -- namely, pretrained language models (PLMs). Here we treat a RoBERTa model as a proxy for this distribution and hypothesize that constructions will be revealed within it as patterns of statistical affinity. We support this hypothesis experimentally: many constructions are robustly distinguished, including (i) hard cases where semantically distinct constructions are superficially similar, as well as (ii) schematic constructions, whose "slots" can be filled by abstract word classes. Despite this success, we also provide qualitative evidence that statistical affinity alone may be insufficient to identify all constructions from text. Thus, statistical affinity is likely an important, but partial, signal available to learners.
摘要：构建语法认为，通过语言（分配学习假设）获得构造（表格融合配对）。但是，关于此分布的构造实际上包含多少信息？基于语料库的分析提供了一些答案，但是单独的文本无法回答有关导致特定词的原因的反事实问题。为此，我们需要在字符串上分布的可计算模型 - 即验证的语言模型（PLM）。在这里，我们将罗伯塔模型视为该分布的代理，并假设在其中将构建为统计亲和力的模式。我们在实验上支持这一假设：许多结构都有牢固的区分，包括（i）在语义上不同的构造在表面上相似的困难情况，以及（ii）示意图结构，其“插槽”可以通过抽象的单词类填充。尽管取得了成功，但我们还提供了定性证据，表明仅统计亲和力可能不足以识别文本中的所有结构。因此，统计亲和力可能是学习者可用的重要但部分信号。

Title: Fine-Grained Bias Detection in LLM: Enhancing detection mechanisms for nuanced biases

Authors: Suvendu Mohanty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06054
Pdf URL: https://arxiv.org/pdf/2503.06054
Copy Paste: [[2503.06054]] Fine-Grained Bias Detection in LLM: Enhancing detection mechanisms for nuanced biases(https://arxiv.org/abs/2503.06054)
Keywords: language model, llm, prompt
Abstract: Recent advancements in Artificial Intelligence, particularly in Large Language Models (LLMs), have transformed natural language processing by improving generative capabilities. However, detecting biases embedded within these models remains a challenge. Subtle biases can propagate misinformation, influence decision-making, and reinforce stereotypes, raising ethical concerns. This study presents a detection framework to identify nuanced biases in LLMs. The approach integrates contextual analysis, interpretability via attention mechanisms, and counterfactual data augmentation to capture hidden biases across linguistic contexts. The methodology employs contrastive prompts and synthetic datasets to analyze model behaviour across cultural, ideological, and demographic scenarios. Quantitative analysis using benchmark datasets and qualitative assessments through expert reviews validate the effectiveness of the framework. Results show improvements in detecting subtle biases compared to conventional methods, which often fail to highlight disparities in model responses to race, gender, and socio-political contexts. The framework also identifies biases arising from imbalances in training data and model architectures. Continuous user feedback ensures adaptability and refinement. This research underscores the importance of proactive bias mitigation strategies and calls for collaboration between policymakers, AI developers, and regulators. The proposed detection mechanisms enhance model transparency and support responsible LLM deployment in sensitive applications such as education, legal systems, and healthcare. Future work will focus on real-time bias monitoring and cross-linguistic generalization to improve fairness and inclusivity in AI-driven communication tools.
摘要：人工智能的最新进步，特别是大型语言模型（LLM），通过提高生成能力来改变自然语言的处理。但是，检测这些模型中嵌入的偏见仍然是一个挑战。微妙的偏见可以传播错误信息，影响决策并加强刻板印象，从而引发道德问题。这项研究提出了一个检测框架，以识别LLMS中细微的偏见。该方法整合了上下文分析，通过注意机制的可解释性以及反事实数据增强，以捕获语言环境中的隐藏偏见。该方法采用对比提示和合成数据集来分析跨文化，意识形态和人口统计学场景的模型行为。使用基准数据集和定性评估的定量分析验证了框架的有效性。结果表明，与传统方法相比，检测微妙偏见的改善，这些方法通常未能突出模型对种族，性别和社会政治环境的响应中的差异。该框架还确定了训练数据和模型体系结构中的失衡引起的偏见。连续的用户反馈确保适应性和改进。这项研究强调了积极的偏见缓解策略的重要性，并呼吁政策制定者，AI开发人员和监管机构之间的合作。拟议的检测机制增强了模型透明度，并支持敏感应用程序（例如教育，法律系统和医疗保健）中负责任的LLM部署。未来的工作将集中于实时偏见监测和交叉语言概括，以提高AI驱动的交流工具中的公平性和包容性。

Title: A Survey on Post-training of Large Language Models

Authors: Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06072
Pdf URL: https://arxiv.org/pdf/2503.06072
Copy Paste: [[2503.06072]] A Survey on Post-training of Large Language Models(https://arxiv.org/abs/2503.06072)
Keywords: language model, gpt, llm, chat
Abstract: The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. However, their pre-trained architectures often reveal limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance. These challenges necessitate advanced post-training language models (PoLMs) to address these shortcomings, such as OpenAI-o1/o3 and DeepSeek-R1 (collectively known as Large Reasoning Models, or LRMs). This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Efficiency, which optimizes resource utilization amidst increasing complexity; and Integration and Adaptation, which extend capabilities across diverse modalities while addressing coherence issues. Charting progress from ChatGPT's foundational alignment strategies to DeepSeek-R1's innovative reasoning advancements, we illustrate how PoLMs leverage datasets to mitigate biases, deepen reasoning capabilities, and enhance domain adaptability. Our contributions include a pioneering synthesis of PoLM evolution, a structured taxonomy categorizing techniques and datasets, and a strategic agenda emphasizing the role of LRMs in improving reasoning proficiency and domain flexibility. As the first survey of its scope, this work consolidates recent PoLM advancements and establishes a rigorous intellectual framework for future research, fostering the development of LLMs that excel in precision, ethical robustness, and versatility across scientific and societal applications.
摘要：大型语言模型（LLM）的出现从根本上改变了自然语言处理，使其在从对话系统到科学探索的范围内必不可少。但是，他们的预训练的架构通常揭示了在专业环境中的局限性，包括限制的推理能力，道德不确定性和次优的领域特定性能。这些挑战需要先进的培训后语言模型（POLMS）来解决这些缺点，例如OpenAI-O1/O3和DeepSeek-R1（统称为大型推理模型或LRMS）。本文介绍了对波尔姆斯的首次综合调查，系统地在五个核心范式中进行了系统的进化：微调，从而提高了特定于任务的精度；对齐，确保与人类偏好保持一致；推理，尽管奖励设计面临挑战，但仍推进了多步推断；效率，在复杂性提高，可以优化资源利用；以及整合和适应，这些功能跨越了各种方式，同时解决了连贯的问题。绘制从Chatgpt的基本对齐策略到DeepSeek-R1的创新推理进步的进度，我们说明了Polms如何利用数据集来减轻偏见，加深推理能力并增强域的适应性。我们的贡献包括Polm Evolution的开创性综合，结构化的分类法对技术和数据集进行了分类，以及一项战略议程，强调了LRMS在提高推理能力和域灵活性方面的作用。作为对其范围的首次调查，这项工作巩固了POLM的最新进步，并为未来的研究建立了一个严格的智力框架，促进了在科学和社会应用中精确，道德鲁棒性和多功能性的LLM的发展。

Title: GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

Authors: Xiang Lan, Feng Wu, Kai He, Qinghao Zhao, Shenda Hong, Mengling Feng
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06073
Pdf URL: https://arxiv.org/pdf/2503.06073
Copy Paste: [[2503.06073]] GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images(https://arxiv.org/abs/2503.06073)
Keywords: language model, llm
Abstract: While recent multimodal large language models (MLLMs) have advanced automated ECG interpretation, they still face two key limitations: (1) insufficient multimodal synergy between time series signals and visual ECG representations, and (2) limited explainability in linking diagnoses to granular waveform evidence. We introduce GEM, the first MLLM unifying ECG time series, 12-lead ECG images and text for grounded and clinician-aligned ECG interpretation. GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters ($e.g.$, QRS/PR Intervals). Additionally, we propose the Grounded ECG Understanding task, a clinically motivated benchmark designed to comprehensively assess the MLLM's capability in grounded ECG understanding. Experimental results on both existing and our proposed benchmarks show GEM significantly improves predictive performance (CSN $7.4\% \uparrow$), explainability ($22.7\% \uparrow$), and grounding ($24.8\% \uparrow$), making it more suitable for real-world clinical applications. GitHub repository: this https URL
摘要：尽管最近的多模式大语言模型（MLLM）具有高级自动ECG解释，但它们仍然面临两个关键局限性：（1）时间序列信号和视觉ECG表示之间的多模式协同作用不足，（（2）将诊断诊断与粒度波形证据联系起来的有限解释性。我们介绍了GEM，这是第一个MLLM统一ECG时间序列，12铅ECG图像和用于接地和临床医生与ECG的解释的文本。 GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters ($e.g.$, QRS/PR间隔）。此外，我们提出了扎根的心电图理解任务，这是一种临床动机的基准测试，旨在全面评估MLLM在接地的ECG理解中的能力。对现有基准和我们提出的基准测试的实验结果表明，GEM显着提高了预测性能（CSN $ 7.4 \％\ uparrow $），解释性（$ 22.7 \％\％\ uparrow $）和接地（$ 24.8 \％\％\ uparrow $），使其更适合现实世界临床应用程序。 GitHub存储库：此HTTPS URL

Title: Towards Conversational AI for Disease Management

Authors: Anil Palepu, Valentin Liévin, Wei-Hung Weng, Khaled Saab, David Stutz, Yong Cheng, Kavita Kulkarni, S. Sara Mahdavi, Joëlle Barral, Dale R. Webster, Katherine Chou, Avinatan Hassidim, Yossi Matias, James Manyika, Ryutaro Tanno, Vivek Natarajan, Adam Rodman, Tao Tu, Alan Karthikesalingam, Mike Schaekermann
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06074
Pdf URL: https://arxiv.org/pdf/2503.06074
Copy Paste: [[2503.06074]] Towards Conversational AI for Disease Management(https://arxiv.org/abs/2503.06074)
Keywords: language model, llm, agent
Abstract: While large language models (LLMs) have shown promise in diagnostic dialogue, their capabilities for effective management reasoning - including disease progression, therapeutic response, and safe medication prescription - remain under-explored. We advance the previously demonstrated diagnostic capabilities of the Articulate Medical Intelligence Explorer (AMIE) through a new LLM-based agentic system optimised for clinical management and dialogue, incorporating reasoning over the evolution of disease and multiple patient visit encounters, response to therapy, and professional competence in medication prescription. To ground its reasoning in authoritative clinical knowledge, AMIE leverages Gemini's long-context capabilities, combining in-context retrieval with structured reasoning to align its output with relevant and up-to-date clinical practice guidelines and drug formularies. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study, AMIE was compared to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios designed to reflect UK NICE Guidance and BMJ Best Practice guidelines. AMIE was non-inferior to PCPs in management reasoning as assessed by specialist physicians and scored better in both preciseness of treatments and investigations, and in its alignment with and grounding of management plans in clinical guidelines. To benchmark medication reasoning, we developed RxQA, a multiple-choice question benchmark derived from two national drug formularies (US, UK) and validated by board-certified pharmacists. While AMIE and PCPs both benefited from the ability to access external drug information, AMIE outperformed PCPs on higher difficulty questions. While further research would be needed before real-world translation, AMIE's strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management.
摘要：尽管大型语言模型（LLMS）在诊断对话中表现出了希望，但它们的有效管理推理的能力（包括疾病进展，治疗反应和安全药物处方）仍然不足。我们通过针对临床管理和对话进行了优化的新的基于LLM的代理系统，推动了先前证明的清晰医学情报探索者（AMIE）的诊断能力，并纳入了疾病演变和多次患者访问的推理，对治疗的反应以及对药物处方药处方的专业能力。为了基于权威临床知识的推理，Amie利用了Gemini的长期文化功能，将其内部检索与结构化推理相结合，以使其产量与相关和最新的临床实践指南和药物配方相结合。在一项随机，盲目的虚拟物镜结构化临床检查（OSCE）研究中，AMIE与100个多访问案例中的21位初级保健医生（PCP）进行了比较，旨在反映英国的NICE指导和BMJ最佳实践指南。艾米（Amie）在管理推理中不属于PCP，这是由专业医生评估的，在治疗和调查的准确性方面都更好，并在临床指南中与管理计划的一致性和基础。为了进行基准药物推理，我们开发了RXQA，这是一个源自两个国家药物配方（美国，英国）的多项选择问题，并由董事会认证的药剂师验证。尽管AMIE和PCP都受益于访问外部药物信息的能力，但AMIE在更高的难题上胜过PCP。尽管在现实世界翻译之前需要进行进一步的研究，但艾米在评估中的出色表现标志着作为疾病管理工具迈出的重要一步。

Title: Multi-Attribute Multi-Grained Adaptation of Pre-Trained Language Models for Text Understanding from Bayesian Perspective

Authors: You Zhang, Jin Wang, Liang-Chih Yu, Dan Xu, Xuejie Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06085
Pdf URL: https://arxiv.org/pdf/2503.06085
Copy Paste: [[2503.06085]] Multi-Attribute Multi-Grained Adaptation of Pre-Trained Language Models for Text Understanding from Bayesian Perspective(https://arxiv.org/abs/2503.06085)
Keywords: language model
Abstract: Current neural networks often employ multi-domain-learning or attribute-injecting mechanisms to incorporate non-independent and identically distributed (non-IID) information for text understanding tasks by capturing individual characteristics and the relationships among samples. However, the extent of the impact of non-IID information and how these methods affect pre-trained language models (PLMs) remains unclear. This study revisits the assumption that non-IID information enhances PLMs to achieve performance improvements from a Bayesian perspective, which unearths and integrates non-IID and IID features. Furthermore, we proposed a multi-attribute multi-grained framework for PLM adaptations (M2A), which combines multi-attribute and multi-grained views to mitigate uncertainty in a lightweight manner. We evaluate M2A through prevalent text-understanding datasets and demonstrate its superior performance, mainly when data are implicitly non-IID, and PLMs scale larger.
摘要：当前的神经网络通常采用多域学习或属性注射机制来纳入非独立且相同分布的（非IID）信息，以通过捕获单个特征和样本之间的关系来理解文本理解任务。但是，非IID信息的影响程度以及这些方法如何影响预训练的语言模型（PLM）尚不清楚。这项研究重新审查了以下假设：非IID信息可以增强PLM，从贝叶斯的角度来提高性能，从而发掘并整合了非IID和IID特征。此外，我们提出了一个多属性的PLM适应框架（M2A），该框架结合了多属性和多透明视图，以轻巧的方式减轻不确定性。我们通过普遍的文本认识数据集评估M2A，并证明其出色的性能，主要是当数据隐含非IID时，并且PLMS扩展更大。

Title: Evaluating Discourse Cohesion in Pre-trained Language Models

Authors: Jie He, Wanqiu Long, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06137
Pdf URL: https://arxiv.org/pdf/2503.06137
Copy Paste: [[2503.06137]] Evaluating Discourse Cohesion in Pre-trained Language Models(https://arxiv.org/abs/2503.06137)
Keywords: language model
Abstract: Large pre-trained neural models have achieved remarkable success in natural language process (NLP), inspiring a growing body of research analyzing their ability from different aspects. In this paper, we propose a test suite to evaluate the cohesive ability of pre-trained language models. The test suite contains multiple cohesion phenomena between adjacent and non-adjacent sentences. We try to compare different pre-trained language models on these phenomena and analyze the experimental results,hoping more attention can be given to discourse cohesion in the future.
摘要：大型的预训练的神经模型在自然语言过程（NLP）方面取得了巨大的成功，激发了越来越多的研究，分析了它们从不同方面的能力。在本文中，我们提出了一个测试套件，以评估预训练的语言模型的凝聚力。该测试套件包含相邻句子和非雅典之间的多种内聚现象。我们尝试将不同的预训练的语言模型在这些现象上进行比较，并分析实验结果，希望将来可以更多地关注话语凝聚力。

Title: GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs

Authors: Mingyang Song, Mao Zheng, Xuan Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06139
Pdf URL: https://arxiv.org/pdf/2503.06139
Copy Paste: [[2503.06139]] GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs(https://arxiv.org/abs/2503.06139)
Keywords: language model, llm, prompt
Abstract: Using Large Language Models (LLMs) to evaluate and compare two answers from different models typically involves having LLM-based judges select the better answer. However, humans often approach problem-solving from a reverse perspective, for instance, by choosing the worse option instead of the better one in a pairwise comparison. Generally, this kind of reverse thinking plays a crucial role in human reasoning and decision-making and can further test the difference between original and reverse thought processes simultaneously. To address the above issue, in this paper, we propose a Goal-Reversed Prompting (GRP) approach for pairwise evaluation that shifts the original task from selecting the better answer to choosing the worse one. We encourage LLMs to think in reverse by prompting LLMs to identify the worse response. Experiments on closed-source models demonstrate that GRP significantly enhances evaluation capabilities, outperforming the prompt template with the original goal.
摘要：使用大型语言模型（LLM）评估和比较来自不同模型的两个答案通常涉及让基于LLM的法官选择更好的答案。但是，例如，人类通常从反向的角度解决问题解决问题，例如，在成对比较中选择较差的选项而不是更好的选择。通常，这种反向思维在人类的推理和决策中起着至关重要的作用，可以进一步测试原始思维过程和反向思维过程之间的差异。为了解决上述问题，在本文中，我们提出了一种反向的提示（GRP）方法，以进行成对评估，该方法将原始任务从选择更好的答案转移到选择较差的任务。我们鼓励LLM通过提示LLMS确定较差的反应来反向思考。封闭源模型的实验表明，GRP显着增强了评估功能，以最初的目标优于及时模板。

Title: Sample-aware Adaptive Structured Pruning for Large Language Models

Authors: Jun Kong, Xinge Ma, Jin Wang, Xuejie Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06184
Pdf URL: https://arxiv.org/pdf/2503.06184
Copy Paste: [[2503.06184]] Sample-aware Adaptive Structured Pruning for Large Language Models(https://arxiv.org/abs/2503.06184)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved outstanding performance in natural language processing, but enormous model sizes and high computational costs limit their practical deployment. Structured pruning can effectively reduce the resource demands for deployment by removing redundant model parameters. However, the randomly selected calibration data and fixed single importance estimation metrics in existing structured pruning methods lead to degraded performance of pruned models. This study introduces AdaPruner, a sample-aware adaptive structured pruning framework for LLMs, aiming to optimize the calibration data and importance estimation metrics in the structured pruning process. Specifically, AdaPruner effectively removes redundant parameters from LLMs by constructing a structured pruning solution space and then employing Bayesian optimization to adaptively search for the optimal calibration data and importance estimation metrics. Experimental results show that the AdaPruner outperforms existing structured pruning methods on a family of LLMs with varying pruning ratios, demonstrating its applicability and robustness. Remarkably, at a 20\% pruning ratio, the model pruned with AdaPruner maintains 97\% of the performance of the unpruned model.
摘要：大型语言模型（LLM）在自然语言处理中取得了出色的表现，但是巨大的模型大小和高计算成本限制了其实际部署。结构化的修剪可以通过删除冗余模型参数有效地减少部署的资源需求。但是，在现有的结构化修剪方法中，随机选择的校准数据和固定单个重要性估计指标导致修剪模型的性能降低。这项研究介绍了AdaPruner，这是一种针对LLM的样品感知的自适应结构化修剪框架，旨在优化结构化修剪过程中的校准数据和重要性估计指标。具体而言，AdaPruner通过构造结构化的修剪解决方案空间，然后采用贝叶斯优化来自适应地搜索最佳校准数据和重要性估计指标，从而有效地消除了LLM的冗余参数。实验结果表明，适应器在LLM家族上的实现率优于具有不同修剪比的现有结构化修剪方法，这表明其适用性和鲁棒性。值得注意的是，以20 \％的修剪率，用AdaPruner修剪的模型保持了未经修复模型的97 \％。

Title: CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset

Authors: Oriel Perets, Ofir Ben Shoham, Nir Grinberg, Nadav Rappoport
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06204
Pdf URL: https://arxiv.org/pdf/2503.06204
Copy Paste: [[2503.06204]] CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset(https://arxiv.org/abs/2503.06204)
Keywords: language model, gpt, llm
Abstract: Medical benchmark datasets significantly contribute to developing Large Language Models (LLMs) for medical knowledge extraction, diagnosis, summarization, and other uses. Yet, current benchmarks are mainly derived from exam questions given to medical students or cases described in the medical literature, lacking the complexity of real-world patient cases that deviate from classic textbook abstractions. These include rare diseases, uncommon presentations of common diseases, and unexpected treatment responses. Here, we construct Clinically Uncommon Patient Cases and Diagnosis Dataset (CUPCase) based on 3,562 real-world case reports from BMC, including diagnoses in open-ended textual format and as multiple-choice options with distractors. Using this dataset, we evaluate the ability of state-of-the-art LLMs, including both general-purpose and Clinical LLMs, to identify and correctly diagnose a patient case, and test models' performance when only partial information about cases is available. Our findings show that general-purpose GPT-4o attains the best performance in both the multiple-choice task (average accuracy of 87.9%) and the open-ended task (BERTScore F1 of 0.764), outperforming several LLMs with a focus on the medical domain such as Meditron-70B and MedLM-Large. Moreover, GPT-4o was able to maintain 87% and 88% of its performance with only the first 20% of tokens of the case presentation in multiple-choice and free text, respectively, highlighting the potential of LLMs to aid in early diagnosis in real-world cases. CUPCase expands our ability to evaluate LLMs for clinical decision support in an open and reproducible manner.
摘要：医疗基准数据集为开发大型语言模型（LLM）的发展，用于医学知识提取，诊断，摘要和其他用途。然而，当前的基准测试主要源自对医学生或医学文献中描述的案例的考试问题，缺乏偏离经典教科书摘要的现实世界患者病例的复杂性。这些包括罕见的疾病，常见疾病的不常见表现以及意外的治疗反应。在这里，我们基于来自BMC的3,562例现实世界病例报告，构建临床上罕见的患者病例和诊断数据集（CUPCASE），包括开放式文本格式的诊断，以及带有分散术者的多项选择选项。使用此数据集，我们评估了包括通用和临床LLM在内的最先进的LLM的能力，可以识别和正确诊断患者病例的能力，并在仅提供有关病例的部分信息时测试模型的性能。我们的发现表明，通用GPT-4O在多项选择任务（平均精度为87.9％）和开放式任务（Bertscore F1的0.764）中达到了最佳性能，优于几个LLM，重点是医疗领域，例如Mestitron-70B和Medlm-Large。此外，GPT-4O能够在多项选择和自由文本中分别维持其性能的87％和88％，而案例表现的前20％，强调了LLMS在现实世界中有助于早期诊断的潜力。 Cupcase扩展了我们以开放且可重复的方式评估LLM的LLM的临床决策支持的能力。

Title: Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels

Authors: Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Ricard Marxer
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2503.06211
Pdf URL: https://arxiv.org/pdf/2503.06211
Copy Paste: [[2503.06211]] Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels(https://arxiv.org/abs/2503.06211)
Keywords: language model
Abstract: Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- aim to enable cross-modal knowledge transfer to overcome the scaling limitations of unimodal speech LMs. The predominant approach to TSLM training expands the vocabulary of a pre-trained text LM by appending new embeddings and linear projections for speech, followed by fine-tuning on speech data. We hypothesize that this method limits cross-modal transfer by neglecting feature compositionality, preventing text-learned functions from being fully leveraged at appropriate abstraction levels. To address this, we propose augmenting vocabulary expansion with modules that better align abstraction levels across layers. Our models, \textsc{SmolTolk}, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute. Representation analyses and improved multimodal performance suggest our method enhances cross-modal transfer.
摘要：文本语言语言模型（TSLMS） - 经过培训的语言模型，旨在共同处理文本和语音 - 旨在使跨模式知识转移以克服单峰语音LMS的缩放限制。 TSLM培训的主要方法通过附加新的嵌入和语音线性预测，然后对语音数据进行微调，扩大了预训练的文本LM的词汇。我们假设该方法通过忽略特征组成性来限制交叉模式转移，从而阻止文本学习的功能在适当的抽象水平下完全利用。为了解决这个问题，我们提出了增强词汇扩展，其模块可以更好地对齐层次的抽象水平。我们的模型，\ textsc {smoltolk}，竞争对手或超越最先进的TSLM，受过数量级的训练。表示分析和改进的多模式性能表明我们的方法增强了跨模式转移。

Title: KnowLogic: A Benchmark for Commonsense Reasoning via Knowledge-Driven Data Synthesis

Authors: Weidong Zhan, Yue Wang, Nan Hu, Liming Xiao, Jingyuan Ma, Yuhang Qin, Zheng Li, Yixin Yang, Sirui Deng, Jinkun Ding, Wenhan Ma, Rui Li, Weilin Luo, Qun Liu, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06218
Pdf URL: https://arxiv.org/pdf/2503.06218
Copy Paste: [[2503.06218]] KnowLogic: A Benchmark for Commonsense Reasoning via Knowledge-Driven Data Synthesis(https://arxiv.org/abs/2503.06218)
Keywords: llm
Abstract: Current evaluations of commonsense reasoning in LLMs are hindered by the scarcity of natural language corpora with structured annotations for reasoning tasks. To address this, we introduce KnowLogic, a benchmark generated through a knowledge-driven synthetic data strategy. KnowLogic integrates diverse commonsense knowledge, plausible scenarios, and various types of logical reasoning. One of the key advantages of KnowLogic is its adjustable difficulty levels, allowing for flexible control over question complexity. It also includes fine-grained labels for in-depth evaluation of LLMs' reasoning abilities across multiple dimensions. Our benchmark consists of 3,000 bilingual (Chinese and English) questions across various domains, and presents significant challenges for current LLMs, with the highest-performing model achieving only 69.57\%. Our analysis highlights common errors, such as misunderstandings of low-frequency commonsense, logical inconsistencies, and overthinking. This approach, along with our benchmark, provides a valuable tool for assessing and enhancing LLMs' commonsense reasoning capabilities and can be applied to a wide range of knowledge domains.
摘要：当前对LLM中常识性推理的评估受到自然语言语料库的稀缺性，其结构化注释用于推理任务。为了解决这个问题，我们介绍了通过知识驱动的合成数据策略生成的Knowlogic，这是一种基准。 Knowlogic整合了多样的常识性知识，合理的场景和各种类型的逻辑推理。知识的关键优势之一是其可调节的难度级别，从而可以灵活地控制问题复杂性。它还包括用于对LLMS跨多个维度的LLMS推理能力进行深入评估的细粒标签。我们的基准由各个领域的3,000个双语（中文和英语）问题组成，并对当前的LLM提出了重大挑战，表现最高的模型仅达到69.57 \％。我们的分析突出了常见错误，例如对低频常识，逻辑上矛盾和过度思考的误解。这种方法以及我们的基准为评估和增强LLMS的常识性推理功能提供了宝贵的工具，并可以应用于广泛的知识领域。

Title: Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning

Authors: Yanjun Chen, Yirong Sun, Xinghao Chen, Jian Wang, Xiaoyu Shen, Wenjie Li, Wei Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06232
Pdf URL: https://arxiv.org/pdf/2503.06232
Copy Paste: [[2503.06232]] Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning(https://arxiv.org/abs/2503.06232)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) reasoning has proven effective in natural language tasks but remains underexplored in multimodal alignment. This study investigates its integration into 3D vision-language learning by embedding structured reasoning into alignment training. We introduce the 3D-CoT Benchmark, a dataset with hierarchical CoT annotations covering shape recognition, functional inference, and causal reasoning. Through controlled experiments, we compare CoT-structured and standard textual annotations across large reasoning models (LRMs) and large language models (LLMs). Our evaluation employs a dual-layer framework assessing both intermediate reasoning and final inference quality. Extensive experiments demonstrate that CoT significantly improves 3D semantic grounding, with LRMs leveraging CoT more effectively than LLMs. Furthermore, we highlight that annotation structure influences performance-explicit reasoning markers aid LLMs, while unmarked CoT better aligns with LRM inference patterns. Our analyses suggest that CoT is crucial for enhancing multimodal reasoning, with implications beyond 3D tasks.
摘要：事实证明，经过思考链（COT）推理在自然语言任务中有效，但在多模式对齐中仍然没有得到充实的影响。这项研究通过将结构化推理嵌入对齐训练中，调查了其整合到3D视觉学习中。我们介绍了3D-COT基准测试，该数据集具有层次结构的COT注释，涵盖了形状识别，功能推理和因果推理。通过受控实验，我们比较了大型推理模型（LRMS）和大型语言模型（LLMS）的COT结构和标准文本注释。我们的评估采用了一个双层框架，评估了中间推理和最终推理质量。广泛的实验表明，COT显着改善了3D语义接地，而LRMS比LLM更有效地利用了COT。此外，我们强调说，注释结构会影响性能 - 明确推理标记有助于LLM，而未标记的COT可以更好地与LRM推理模式保持一致。我们的分析表明，COT对于增强多模式推理至关重要，其含义超出了3D任务。

Title: IteRABRe: Iterative Recovery-Aided Block Reduction

Authors: Haryo Akbarianto Wibowo, Haiyue Song, Hideki Tanaka, Masao Utiyama, Alham Fikri Aji, Raj Dabre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06291
Pdf URL: https://arxiv.org/pdf/2503.06291
Copy Paste: [[2503.06291]] IteRABRe: Iterative Recovery-Aided Block Reduction(https://arxiv.org/abs/2503.06291)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have grown increasingly expensive to deploy, driving the need for effective model compression techniques. While block pruning offers a straightforward approach to reducing model size, existing methods often struggle to maintain performance or require substantial computational resources for recovery. We present IteRABRe, a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources. Using only 2.5M tokens for recovery, our method outperforms baseline approaches by ~3% on average when compressing the Llama3.1-8B and Qwen2.5-7B models. IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5% over the baselines in language-related tasks. Our analysis reveals distinct pruning characteristics between these models, while also demonstrating preservation of multilingual capabilities.
摘要：大型语言模型（LLM）的部署越来越昂贵，推动了有效模型压缩技术的需求。尽管Block Pruning提供了一种简单的方法来减少模型大小，但现有方法通常难以维持性能或需要大量的计算资源才能恢复。我们提出了Iterabre，这是一种简单而有效的迭代修剪方法，可以在需要最少的计算资源的同时，获得出色的压缩结果。在压缩LLAMA3.1-8B和QWEN2.5-7B模型时，我们的方法仅使用250万代币进行恢复，平均比基线方法平均超过3％。 Iterabre在保存语言能力方面表现出了特殊的优势，在与语言相关的任务中的基准相比，提高了5％。我们的分析揭示了这些模型之间的独特修剪特征，同时也证明了多种语言能力的保存。

Title: States of LLM-generated Texts and Phase Transitions between them

Authors: Nikolay Mikhaylovskiy
Subjects: cs.CL, cond-mat.stat-mech, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06330
Pdf URL: https://arxiv.org/pdf/2503.06330
Copy Paste: [[2503.06330]] States of LLM-generated Texts and Phase Transitions between them(https://arxiv.org/abs/2503.06330)
Keywords: llm
Abstract: It is known for some time that autocorrelations of words in human-written texts decay according to a power law. Recent works have also shown that the autocorrelations decay in texts generated by LLMs is qualitatively different from the literary texts. Solid state physics tie the autocorrelations decay laws to the states of matter. In this work, we empirically demonstrate that, depending on the temperature parameter, LLMs can generate text that can be classified as solid, critical state or gas.
摘要：众所周知，在人工写的文本中，单词的自相关是根据权力法腐烂的。最近的作品还表明，LLMS生成的文本中的自相关衰减在质量上与文学文本不同。固态物理学将自相关衰减法与物质状态联系起来。在这项工作中，我们从经验上证明，根据温度参数，LLM可以生成可以归类为固体，临界状态或气体的文本。

Title: How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders

Authors: Tatsuro Inaba, Kentaro Inui, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, Yu Takagi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06394
Pdf URL: https://arxiv.org/pdf/2503.06394
Copy Paste: [[2503.06394]] How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders(https://arxiv.org/abs/2503.06394)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate remarkable multilingual capabilities and broad knowledge. However, the internal mechanisms underlying the development of these capabilities remain poorly understood. To investigate this, we analyze how the information encoded in LLMs' internal representations evolves during the training process. Specifically, we train sparse autoencoders at multiple checkpoints of the model and systematically compare the interpretative results across these stages. Our findings suggest that LLMs initially acquire language-specific knowledge independently, followed by cross-linguistic correspondences. Moreover, we observe that after mastering token-level knowledge, the model transitions to learning higher-level, abstract concepts, indicating the development of more conceptual understanding.
摘要：大型语言模型（LLMS）表现出了非凡的多语言能力和广泛的知识。但是，这些能力发展的内部机制仍然很少理解。为了调查这一点，我们分析了在训练过程中LLMS内部表示中编码的信息如何演变。具体来说，我们在模型的多个检查点训练稀疏的自动编码器，并系统地比较了这些阶段的解释结果。我们的发现表明，LLMS最初独立地获取特定于语言的知识，然后是跨语言对应。此外，我们观察到，在掌握令牌级别的知识之后，模型过渡到学习更高级别的抽象概念，表明了更概念的理解的发展。

Title: Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues

Authors: Alexander Scarlatos, Naiming Liu, Jaewook Lee, Richard Baraniuk, Andrew Lan
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2503.06424
Pdf URL: https://arxiv.org/pdf/2503.06424
Copy Paste: [[2503.06424]] Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues(https://arxiv.org/abs/2503.06424)
Keywords: language model, gpt, llm, prompt
Abstract: Generative artificial intelligence (AI) has the potential to scale up personalized tutoring through large language models (LLMs). Recent AI tutors are adapted for the tutoring task by training or prompting LLMs to follow effective pedagogical principles, though they are not trained to maximize student learning throughout the course of a dialogue. Therefore, they may engage with students in a suboptimal way. We address this limitation by introducing an approach to train LLMs to generate tutor utterances that maximize the likelihood of student correctness, while still encouraging the model to follow good pedagogical practice. Specifically, we generate a set of candidate tutor utterances and score them using (1) an LLM-based student model to predict the chance of correct student responses and (2) a pedagogical rubric evaluated by GPT-4o. We then use the resulting data to train an open-source LLM, Llama 3.1 8B, using direct preference optimization. We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses while maintaining the pedagogical quality of GPT-4o. We also conduct qualitative analyses and a human evaluation to demonstrate that our model generates high quality tutor utterances.
摘要：生成人工智能（AI）有可能通过大型语言模型（LLM）扩展个性化的辅导。最近的AI导师通过培训或促使LLM遵循有效的教学原则来适应辅导任务，尽管他们没有接受过培训以最大程度地提高学生在整个对话过程中的学习。因此，他们可以以次优的方式与学生互动。我们通过引入一种培训LLM的方法来解决这一限制，以产生导师的话语，从而最大程度地提高学生正确性的可能性，同时仍鼓励模型遵循良好的教学实践。具体而言，我们生成了一组候选导师的话语，并使用（1）基于LLM的学生模型对其进行评分，以预测正确的学生响应的机会，以及（2）由GPT-4O评估的教学专栏。然后，我们使用所得数据使用直接偏好优化训练开源LLM Llama 3.1 8b。我们表明，我们的模型产生的导师话语导致正确的学生反应的机会显着更高，同时保持GPT-4O的教学质量。我们还进行了定性分析和人类评估，以证明我们的模型会产生高质量的导师话语。

Title: Graph Retrieval-Augmented LLM for Conversational Recommendation Systems

Authors: Zhangchi Qiu, Linhao Luo, Zicheng Zhao, Shirui Pan, Alan Wee-Chung Liew
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2503.06430
Pdf URL: https://arxiv.org/pdf/2503.06430
Copy Paste: [[2503.06430]] Graph Retrieval-Augmented LLM for Conversational Recommendation Systems(https://arxiv.org/abs/2503.06430)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Conversational Recommender Systems (CRSs) have emerged as a transformative paradigm for offering personalized recommendations through natural language dialogue. However, they face challenges with knowledge sparsity, as users often provide brief, incomplete preference statements. While recent methods have integrated external knowledge sources to mitigate this, they still struggle with semantic understanding and complex preference reasoning. Recent Large Language Models (LLMs) demonstrate promising capabilities in natural language understanding and reasoning, showing significant potential for CRSs. Nevertheless, due to the lack of domain knowledge, existing LLM-based CRSs either produce hallucinated recommendations or demand expensive domain-specific training, which largely limits their applicability. In this work, we present G-CRS (Graph Retrieval-Augmented Large Language Model for Conversational Recommender Systems), a novel training-free framework that combines graph retrieval-augmented generation and in-context learning to enhance LLMs' recommendation capabilities. Specifically, G-CRS employs a two-stage retrieve-and-recommend architecture, where a GNN-based graph reasoner first identifies candidate items, followed by Personalized PageRank exploration to jointly discover potential items and similar user interactions. These retrieved contexts are then transformed into structured prompts for LLM reasoning, enabling contextually grounded recommendations without task-specific training. Extensive experiments on two public datasets show that G-CRS achieves superior recommendation performance compared to existing methods without requiring task-specific training.
摘要：会话推荐系统（CRS）已成为通过自然语言对话提供个性化建议的变革性范式。但是，由于用户通常会提供简短的，不完整的偏好陈述，因此他们面临着知识稀疏性的挑战。尽管最近的方法已经整合了外部知识来源来减轻这种情况，但他们仍然在语义理解和复杂的偏好推理上挣扎。最近的大型语言模型（LLMS）在自然语言理解和推理方面表现出有希望的能力，显示出对CRS的巨大潜力。然而，由于缺乏领域知识，现有的基于LLM的CRS可以产生幻觉建议或要求昂贵的领域特定培训，这在很大程度上限制了其适用性。在这项工作中，我们介绍了G-CRS（用于会话推荐系统的图形检索大型语言模型），这是一个新颖的无培训框架，结合了图检索检索效果的生成和内在的学习学习，以增强LLMS的建议功能。具体而言，G-CRS采用了两阶段检索和重新提示的体系结构，基于GNN的图形推理器首先确定候选项目，然后进行个性化的Pagerank Exploration，共同发现潜在项目和类似的用户互动。然后将这些检索的上下文转换为LLM推理的结构化提示，从而在没有特定于任务的培训的情况下可以扎根于上下文。在两个公共数据集上进行的广泛实验表明，与现有方法相比，G-CRS在不需要特定于任务的培训的情况下达到了优越的建议性能。

Title: VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Authors: Yanling Wang, Yihan Zhao, Xiaodong Chen, Shasha Guo, Lixin Liu, Haoyang Li, Yong Xiao, Jing Zhang, Qi Li, Ke Xu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06492
Pdf URL: https://arxiv.org/pdf/2503.06492
Copy Paste: [[2503.06492]] VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering(https://arxiv.org/abs/2503.06492)
Keywords: language model, gpt
Abstract: Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at this https URL.
摘要：大型视觉模型（LVLM）表现出了显着的成就，但是在寻求事实的问题回答（QA）上，非事实反应的产生仍然很普遍。当前的多模式寻求事实基准主要集中于将模型输出与地面真相答案进行比较，从而有限地了解了特定于模态模块的性能。为了弥合这一差距，我们介绍了VisualSimpleqa，这是一个具有两个关键功能的多模式寻求事实基准。首先，它可以在视觉和语言模式中对LVLM的简化和脱钩评估。其次，它包含定义明确的难度标准，以指导人类注释并促进具有挑战性的子集，VisualSimpleqa-Hard的提取。在15个LVLM上进行的实验表明，即使是最新的模型，例如GPT-4O等最新模型也仅在VisualSimpleqa上的多模式事实寻求质量质量质量质量质量上只能达到60％+正确性，而在VisualSimpleqa-Hard上也只能达到30％+。此外，这些模型的分离评估突出了视觉和语言模块改进的大量机会。该数据集可在此HTTPS URL上找到。

Title: GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

Authors: Haoqiang Kang, Enna Sachdeva, Piyush Gupta, Sangjae Bae, Kwonjoon Lee
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06514
Pdf URL: https://arxiv.org/pdf/2503.06514
Copy Paste: [[2503.06514]] GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks(https://arxiv.org/abs/2503.06514)
Keywords: language model, prompt, chain-of-thought
Abstract: Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.
摘要：视觉语言模型（VLM）最近通过特定于任务的微调显示了顺序决策任务的有希望的进步。然而，常见的微调方法，例如监督的微调（SFT）和强化学习（RL）技术，例如近端策略优化（PPO），当前的显着局限性：SFT假设独立且相同分布式（IID）数据，而PPO则专注于最大程度地提高累积差异。这些限制通常会限制解决方案多样性和阻碍多步推理任务中的概括。为了应对这些挑战，我们介绍了一个新颖的框架GflowVLM，该框架使用生成流动网络（GFLOWNETS）微调VLM，以促进为复杂推理任务的生成不同的解决方案。 GflowVLM将环境建模为非马克维亚决策过程，从而使其能够捕获对现实应用程序必不可少的长期依赖关系。它将观察和任务描述作为输入，以提示后来指导行动选择的思维链（COT）推理。我们使用基于任务的奖励与Gflownets微调VLM。这种方法使VLM能够胜过包括SFT和RL在内的先验微调方法。经验结果证明了GflowVLM对诸如纸牌游戏（数字线，二十一点）和具体规划任务（ALFWORLD）等复杂任务的有效性，显示了增强的训练效率，解决方案多样性以及在分发和过度分发场景中的更强的概括能力。

Title: SafeSpeech: A Comprehensive and Interactive Tool for Analysing Sexist and Abusive Language in Conversations

Authors: Xingwei Tan, Chen Lyu, Hafiz Muhammad Umer, Sahrish Khan, Mahathi Parvatham, Lois Arthurs, Simon Cullen, Shelley Wilson, Arshad Jhumka, Gabriele Pergola
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06534
Pdf URL: https://arxiv.org/pdf/2503.06534
Copy Paste: [[2503.06534]] SafeSpeech: A Comprehensive and Interactive Tool for Analysing Sexist and Abusive Language in Conversations(https://arxiv.org/abs/2503.06534)
Keywords: language model, llm
Abstract: Detecting toxic language including sexism, harassment and abusive behaviour, remains a critical challenge, particularly in its subtle and context-dependent forms. Existing approaches largely focus on isolated message-level classification, overlooking toxicity that emerges across conversational contexts. To promote and enable future research in this direction, we introduce SafeSpeech, a comprehensive platform for toxic content detection and analysis that bridges message-level and conversation-level insights. The platform integrates fine-tuned classifiers and large language models (LLMs) to enable multi-granularity detection, toxic-aware conversation summarization, and persona profiling. SafeSpeech also incorporates explainability mechanisms, such as perplexity gain analysis, to highlight the linguistic elements driving predictions. Evaluations on benchmark datasets, including EDOS, OffensEval, and HatEval, demonstrate the reproduction of state-of-the-art performance across multiple tasks, including fine-grained sexism detection.
摘要：检测包括性别歧视，骚扰和虐待行为在内的有毒语言仍然是一个关键的挑战，尤其是在其微妙和依赖上下文的形式中。现有的方法主要集中在孤立的消息级分类上，忽略了对话环境中出现的毒性。为了促进和启用未来的研究，我们介绍了SafeSpeech，这是一个综合的有毒内容检测和分析平台，桥梁消息级别和对话级别的见解。该平台集成了微调的分类器和大型语言模型（LLMS），以实现多粒性检测，有毒意识的对话摘要和角色分析。 Safpeech还结合了解释性机制，例如困惑性增益分析，以突出驱动预测的语言元素。在包括Edos，Insenseval和Hateval在内的基准数据集上的评估表明，在包括精细性别歧视检测在内的多个任务中，跨多个任务的最新性能再现。

Title: BingoGuard: LLM Content Moderation Tools with Risk Levels

Authors: Fan Yin, Philippe Laban, Xiangyu Peng, Yilun Zhou, Yixin Mao, Vaibhav Vats, Linnea Ross, Divyansh Agarwal, Caiming Xiong, Chien-Sheng Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06550
Pdf URL: https://arxiv.org/pdf/2503.06550
Copy Paste: [[2503.06550]] BingoGuard: LLM Content Moderation Tools with Risk Levels(https://arxiv.org/abs/2503.06550)
Keywords: language model, llm
Abstract: Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3\%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.
摘要：大语言模型（LLM）产生的恶意内容可能造成不同程度的伤害。尽管现有的基于LLM的主持人可以检测有害内容，但他们难以评估风险水平，并可能会错过低风险的产出。准确的风险评估允许具有不同安全阈值的平台来量身定制内容过滤和拒绝。在本文中，我们为11个有害主题引入了主题严重性专栏，并构建BingoGuard，这是一种基于LLM的适应系统，旨在预测二进制安全标签和严重性水平。为了解决严重程度水平的注释缺乏注释，我们提出了一个可扩展的生成 - 然后进行过滤框架，该框架首先在不同的严重性水平上产生响应，然后过滤低质量的响应。使用此框架，我们创建了BingoGuardTrain，这是一个培训数据集，其中有54,897个示例，涵盖了各种主题，响应严重性，样式和BingoGuardTest，这是一个测试集，该测试集具有988个示例，该示例明确标记了我们的严重性标记，可以对不同的严重性级别进行模型行为进行良好的分析。我们的Bingoguard-8B接受了BingoGuardTrain的培训，在包括Wildguardtest和Harmbench在内的几个适度基准上以及Bingoguardtest，胜过最佳公共模型，Wildguard，Wildguard，Wildguard，以4.3 \％为4.3％。我们的分析表明，将严重性水平纳入训练可以显着提高检测性能，并使该模型能够有效评估有害反应的严重性。

Title: WildIFEval: Instruction Following in the Wild

Authors: Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06573
Pdf URL: https://arxiv.org/pdf/2503.06573
Copy Paste: [[2503.06573]] WildIFEval: Instruction Following in the Wild(https://arxiv.org/abs/2503.06573)
Keywords: llm, prompt
Abstract: Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. Our findings reveal that all evaluated models experience performance degradation with an increasing number of constraints. Thus, we show that all models have a large room for improvement on such tasks. Moreover, we observe that the specific type of constraint plays a critical role in model performance. We release our dataset to promote further research on instruction-following under complex, realistic conditions.
摘要：最近的LLM在遵循用户说明方面表现出色，但是具有多个约束的说明仍然是一个重大挑战。在这项工作中，我们介绍了WildifeVal-一个大规模的数据集，其中包括12K真实用户说明，具有多种多样的多种构造条件。与先前的数据集不同，我们的收藏集涵盖了自然用户提示中的宽阔词汇和局部约束范围。我们将这些约束分为八个高级类，以捕获其在现实情况下的分布和动态。利用WildifeVal，我们进行了广泛的实验，以基准测试领先LLM的指导跟踪功能。我们的发现表明，所有经过评估的模型都会经历越来越多的约束。因此，我们表明所有模型都有大量改进此类任务的空间。此外，我们观察到特定类型的约束在模型性能中起着至关重要的作用。我们释放数据集，以促进有关复杂，现实条件下的指导跟踪的进一步研究。

Title: Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation

Authors: Yingfeng Luo, Tong Zheng, Yongyu Mu, Bei Li, Qinghong Zhang, Yongqi Gao, Ziqiang Xu, Peinan Feng, Xiaoqian Liu, Tong Xiao, Jingbo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06594
Pdf URL: https://arxiv.org/pdf/2503.06594
Copy Paste: [[2503.06594]] Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation(https://arxiv.org/abs/2503.06594)
Keywords: language model, llm
Abstract: The field of neural machine translation (NMT) has changed with the advent of large language models (LLMs). Much of the recent emphasis in natural language processing (NLP) has been on modeling machine translation and many other problems using a single pre-trained Transformer decoder, while encoder-decoder architectures, which were the standard in earlier NMT models, have received relatively less attention. In this paper, we explore translation models that are universal, efficient, and easy to optimize, by marrying the world of LLMs with the world of NMT. We apply LLMs to NMT encoding and leave the NMT decoder unchanged. We also develop methods for adapting LLMs to work better with the NMT decoder. Furthermore, we construct a new dataset involving multiple tasks to assess how well the machine translation system generalizes across various tasks. Evaluations on the WMT and our datasets show that results using our method match or surpass a range of baselines in terms of translation quality, but achieve $2.4 \sim 6.5 \times$ inference speedups and a $75\%$ reduction in the memory footprint of the KV cache. It also demonstrates strong generalization across a variety of translation-related tasks.
摘要：随着大型语言模型（LLM）的出现，神经机器翻译（NMT）的领域发生了变化。自然语言处理（NLP）最近的许多重点都用于建模机器翻译和许多其他使用单个预训练的变压器解码器的问题，而编码器decoder架构（在早期NMT模型中是标准的编码器架构）受到了相对较少的关注。在本文中，我们通过将LLM的世界与NMT世界结合在一起，探讨了通用，高效且易于优化的翻译模型。我们将LLMS应用于NMT编码，并使NMT解码器保持不变。我们还开发了适应LLM与NMT解码器更好地工作的方法。此外，我们构建了一个新的数据集，该数据集涉及多个任务，以评估机器翻译系统在各种任务中的推广程度。对WMT和我们的数据集的评估表明，使用我们的方法匹配的结果或超过了翻译质量的一系列基准，但获得了$ 2.4 \ sim 6.5 \ times $ times $推理速度和$ 75 \％$ $ $ $ $ $ $ $ $减少KV Cache的内存足迹。它还显示了各种与翻译相关的任务的强烈概括。

Title: Enhancing NLP Robustness and Generalization through LLM-Generated Contrast Sets: A Scalable Framework for Systematic Evaluation and Adversarial Training

Authors: Hender Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06648
Pdf URL: https://arxiv.org/pdf/2503.06648
Copy Paste: [[2503.06648]] Enhancing NLP Robustness and Generalization through LLM-Generated Contrast Sets: A Scalable Framework for Systematic Evaluation and Adversarial Training(https://arxiv.org/abs/2503.06648)
Keywords: language model, llm
Abstract: Standard NLP benchmarks often fail to capture vulnerabilities stemming from dataset artifacts and spurious correlations. Contrast sets address this gap by challenging models near decision boundaries but are traditionally labor-intensive to create and limited in diversity. This study leverages large language models to automate the generation of diverse contrast sets. Using the SNLI dataset, we created a 3,000-example contrast set to evaluate and improve model robustness. Fine-tuning on these contrast sets enhanced performance on systematically perturbed examples, maintained standard test accuracy, and modestly improved generalization to novel perturbations. This automated approach offers a scalable solution for evaluating and improving NLP models, addressing systematic generalization challenges, and advancing robustness in real-world applications.
摘要：标准的NLP基准通常无法捕获来自数据集文物和虚假相关性的漏洞。对比集通过挑战近乎决策界限的模型来解决这一差距，但传统上是劳动力密集的，可以创造和限制多样性。这项研究利用大型语言模型可以自动化各种对比度集的产生。使用SNLI数据集，我们创建了一个3,000个示例对比度集，以评估和提高模型鲁棒性。对这些对比集进行微调，可以在系统扰动的示例上提高性能，保持标准测试准确性，并适度改善对新型扰动的概括。这种自动化方法提供了一种可扩展的解决方案，用于评估和改进NLP模型，应对系统的概括挑战以及在现实世界应用中提高鲁棒性。

Title: InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models

Authors: Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06692
Pdf URL: https://arxiv.org/pdf/2503.06692
Copy Paste: [[2503.06692]] InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models(https://arxiv.org/abs/2503.06692)
Keywords: language model
Abstract: Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.
摘要：大型语言模型中的先进推理在具有挑战性的任务上取得了出色的性能，但是普遍的长篇文本推理范式面临着临界限制：二次计算缩放，序列长度，序列长度，由最大上下文边界限制的推理以及超出预先训练上下文窗口以外的绩效降低。现有方法主要压缩推理链，而无需解决基本缩放问题。为了克服这些挑战，我们引入了Inftythink，这种范式将单片推理转化为中间摘要的迭代过程。通过将简短的推理片段与简洁的进度摘要交织在一起，我们的方法可以使无限的推理深度在保持有限的计算成本。与传统方法相比，这会产生一种特征性的锯齿记忆模式，可显着降低计算复杂性。此外，我们开发了一种将长篇文化推理数据集重建为迭代格式的方法，将OpenR1-Math转换为333K培训实例。跨多个模型架构的实验表明，我们的方法降低了计算成本，同时提高了性能，QWEN2.5-MATH-7B显示了Math500，AIME24和GPQA_DIAMOND基准的3-13％改善。我们的工作挑战了推理深度和计算效率之间假定的权衡，从而在没有建筑修改的情况下为复杂的推理提供了更可扩展的方法。

Title: PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts

Authors: Ming Zhang, Yuhui Wang, Yujiong Shen, Tingyi Yang, Changhao Jiang, Yilong Wu, Shihan Dou, Qinhao Chen, Zhiheng Xi, Zhihao Zhang, Yi Dong, Zhen Wang, Zhihui Fei, Mingyang Wan, Tao Liang, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06706
Pdf URL: https://arxiv.org/pdf/2503.06706
Copy Paste: [[2503.06706]] PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts(https://arxiv.org/abs/2503.06706)
Keywords: language model, gpt, llm
Abstract: Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct Process Flow Dialogue (PFDial) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models' performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in this https URL.
摘要：在严格的预定过程约束下运行的过程驱动的对话系统对于客户服务和设备维护方案至关重要。尽管大型语言模型（LLM）在对话和推理方面表现出了显着的进步，但他们仍然很难解决这些严格的对话任务。为了应对这一挑战，我们构建过程流对话（PFDIAL）数据集，其中包含12,705个高质量的中国对话说明，这些说明来自440个流程图，其中包含5,055个流程节点。基于Plantuml规范，每个UML流程图都将转换为原子对话单元，即结构化的五个tuper。实验结果表明，一个只有800个样品训练的7B模型，而对总数据进行训练的0.5B模型都可以超过90％的精度。此外，8B模型可以超过高达43.88％的GPT-4O，平均为11.00％。我们进一步评估了模型在过程流中的挑战向后转变方面的性能，并对各种数据集格式进行深入分析，以揭示其对处理决策和顺序分支中模型性能的影响。数据在此HTTPS URL中发布。

Title: Alignment for Efficient Tool Calling of Large Language Models

Authors: Hongshen Xu, Zihan Wang, Zichen Zhu, Lei Pan, Xingyu Chen, Lu Chen, Kai Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06708
Pdf URL: https://arxiv.org/pdf/2503.06708
Copy Paste: [[2503.06708]] Alignment for Efficient Tool Calling of Large Language Models(https://arxiv.org/abs/2503.06708)
Keywords: language model, llm
Abstract: Recent advancements in tool learning have enabled large language models (LLMs) to integrate external tools, enhancing their task performance by expanding their knowledge boundaries. However, relying on tools often introduces tradeoffs between performance, speed, and cost, with LLMs sometimes exhibiting overreliance and overconfidence in tool usage. This paper addresses the challenge of aligning LLMs with their knowledge boundaries to make more intelligent decisions about tool invocation. We propose a multi objective alignment framework that combines probabilistic knowledge boundary estimation with dynamic decision making, allowing LLMs to better assess when to invoke tools based on their confidence. Our framework includes two methods for knowledge boundary estimation, consistency based and absolute estimation, and two training strategies for integrating these estimates into the model decision making process. Experimental results on various tool invocation scenarios demonstrate the effectiveness of our framework, showing significant improvements in tool efficiency by reducing unnecessary tool usage.
摘要：工具学习的最新进展使大型语言模型（LLMS）能够整合外部工具，通过扩大知识界限来增强其任务性能。但是，依靠工具通常会引入性能，速度和成本之间的权衡，而LLM有时在工具使用中表现出过分依赖和过度自信。本文解决了将LLM与知识边界保持一致的挑战，以便对工具调用做出更聪明的决定。我们提出了一个多物镜对齐框架，将概率知识边界估计与动态决策结合在一起，从而使LLM可以更好地评估何时根据其信心调用工具。我们的框架包括两种用于知识边界估计，基于一致性和绝对估计的方法，以及将这些估计值整合到模型决策过程中的两种培训策略。各种工具调用方案的实验结果证明了我们框架的有效性，通过减少不必要的刀具使用来显示刀具效率的显着提高。

Title: Delusions of Large Language Models

Authors: Hongshen Xu, Zixv yang, Zichen Zhu, Kunyao Lan, Zihan Wang, Mengyue Wu, Ziwei Ji, Lu Chen, Pascale Fung, Kai Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06709
Pdf URL: https://arxiv.org/pdf/2503.06709
Copy Paste: [[2503.06709]] Delusions of Large Language Models(https://arxiv.org/abs/2503.06709)
Keywords: language model, llm, hallucination, retrieval augmented generation, agent
Abstract: Large Language Models often generate factually incorrect but plausible outputs, known as hallucinations. We identify a more insidious phenomenon, LLM delusion, defined as high belief hallucinations, incorrect outputs with abnormally high confidence, making them harder to detect and mitigate. Unlike ordinary hallucinations, delusions persist with low uncertainty, posing significant challenges to model reliability. Through empirical analysis across different model families and sizes on several Question Answering tasks, we show that delusions are prevalent and distinct from hallucinations. LLMs exhibit lower honesty with delusions, which are harder to override via finetuning or self reflection. We link delusion formation with training dynamics and dataset noise and explore mitigation strategies such as retrieval augmented generation and multi agent debating to mitigate delusions. By systematically investigating the nature, prevalence, and mitigation of LLM delusions, our study provides insights into the underlying causes of this phenomenon and outlines future directions for improving model reliability.
摘要：大型语言模型通常会产生事实不正确，但可能会产生合理的输出，称为幻觉。我们确定了一种更阴险的现象，LLM的妄想，被定义为高信念幻觉，异常信心异常的输出不正确，使它们更难检测和减轻。与普通的幻觉不同，妄想持续存在低不确定性，对模型可靠性构成了重大挑战。通过跨不同模型家族和大小的几个问题回答任务的经验分析，我们表明妄想是普遍的，并且与幻觉不同。 LLM对妄想表现出较低的诚实，这很难通过填充或自我反思来覆盖。我们将妄想的形成与训练动力学和数据集噪声联系起来，并探索缓解策略，例如检索增强发电和多代理辩论，以减轻妄想。通过系统地研究LLM妄想的性质，患病率和缓解，我们的研究提供了对这种现象的根本原因的见解，并概述了提高模型可靠性的未来方向。

Title: Gender Encoding Patterns in Pretrained Language Model Representations

Authors: Mahdi Zakizadeh, Mohammad Taher Pilehvar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06734
Pdf URL: https://arxiv.org/pdf/2503.06734
Copy Paste: [[2503.06734]] Gender Encoding Patterns in Pretrained Language Model Representations(https://arxiv.org/abs/2503.06734)
Keywords: language model
Abstract: Gender bias in pretrained language models (PLMs) poses significant social and ethical challenges. Despite growing awareness, there is a lack of comprehensive investigation into how different models internally represent and propagate such biases. This study adopts an information-theoretic approach to analyze how gender biases are encoded within various encoder-based architectures. We focus on three key aspects: identifying how models encode gender information and biases, examining the impact of bias mitigation techniques and fine-tuning on the encoded biases and their effectiveness, and exploring how model design differences influence the encoding of biases. Through rigorous and systematic investigation, our findings reveal a consistent pattern of gender encoding across diverse models. Surprisingly, debiasing techniques often exhibit limited efficacy, sometimes inadvertently increasing the encoded bias in internal representations while reducing bias in model output distributions. This highlights a disconnect between mitigating bias in output distributions and addressing its internal representations. This work provides valuable guidance for advancing bias mitigation strategies and fostering the development of more equitable language models.
摘要：预验证的语言模型（PLM）中的性别偏见提出了重大的社会和道德挑战。尽管有越来越多的意识，但缺乏对不同模型如何在内部代表和传播这种偏见的全面调查。这项研究采用了一种信息理论方法来分析在基于编码器的各种体系结构中如何编码性别偏见。我们关注三个关键方面：确定模型如何编码性别信息和偏见，研究偏见缓解技术的影响，微调对编码偏见及其有效性的影响，并探索模型设计差异如何影响偏见的编码。通过严格而系统的调查，我们的发现揭示了跨不同模型的性别编码的一致模式。令人惊讶的是，脱锯技术通常表现出有限的功效，有时会无意中增加内部表示中编码的偏见，同时减少模型输出分布的偏见。这突出了减轻输出分布中的偏差与解决其内部表示形式之间的脱节。这项工作为推进偏见缓解策略并促进更公平的语言模型的发展提供了宝贵的指导。

Title: Effectiveness of Zero-shot-CoT in Japanese Prompts

Authors: Shusuke Takayama, Ian Frank
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06765
Pdf URL: https://arxiv.org/pdf/2503.06765
Copy Paste: [[2503.06765]] Effectiveness of Zero-shot-CoT in Japanese Prompts(https://arxiv.org/abs/2503.06765)
Keywords: gpt, llm, prompt, chat, chain-of-thought
Abstract: We compare the effectiveness of zero-shot Chain-of-Thought (CoT) prompting in Japanese and English using ChatGPT-3.5 and 4o-mini. The technique of zero-shot CoT, which involves appending a phrase such as "Let's think step by step" to a prompt to encourage reasoning before answering, has been shown to offer LLM performance improvements in mathematical and reasoning tasks, particularly in English. We investigate how these effects transfer to Japanese using the Japanese Multi-task Language Understanding Benchmark (JMMLU) and the Multi-task Language Understanding Benchmark (MMLU). Our results show that while zero-shot CoT prompting can lead to notable performance gains for some prompt categories in GPT-3.5, its impact in GPT-4o-mini is associated with significant performance declines. However, for Japanese prompts there remain certain categories, such as college mathematics and abstract algebra, that still exhibit improvements, despite the broader trend of diminishing effectiveness in more advanced models.
摘要：我们比较了使用Chatgpt-3.5和4o-Mini在日语和英语中提示的零拍（COT）的有效性。零射cot的技术涉及将“让我们逐步思考”等短语附加到鼓励推理之前的提示，以在回答之前鼓励推理，并显示出在数学和推理任务（尤其是英语）方面提供LLM绩效的改进。我们研究了这些效果如何使用日本多任务语言理解基准（JMMLU）和多任务语言理解基准（MMLU）转移到日本。我们的结果表明，虽然零射CoT提示可能会导致GPT-3.5中某些及时类别的杰出性能提高，但其在GPT-4O-Mini中的影响与绩效的显着下降有关。但是，对于日本提示而言，尽管在更高级模型中有效性降低的趋势降低了，但仍有某些类别，例如大学数学和抽象代数，这些类别仍会取得进步。

Title: Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

Authors: Feng Gu, Zongxia Li, Carlos Rafael Colon, Benjamin Evans, Ishani Mondal, Jordan Lee Boyd-Graber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06778
Pdf URL: https://arxiv.org/pdf/2503.06778
Copy Paste: [[2503.06778]] Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators(https://arxiv.org/abs/2503.06778)
Keywords: language model, llm
Abstract: Event annotation is important for identifying market changes, monitoring breaking news, and understanding sociological trends. Although expert annotators set the gold standards, human coding is expensive and inefficient. Unlike information extraction experiments that focus on single contexts, we evaluate a holistic workflow that removes irrelevant documents, merges documents about the same event, and annotates the events. Although LLM-based automated annotations are better than traditional TF-IDF-based methods or Event Set Curation, they are still not reliable annotators compared to human experts. However, adding LLMs to assist experts for Event Set Curation can reduce the time and mental effort required for Variable Annotation. When using LLMs to extract event variables to assist expert annotators, they agree more with the extracted variables than fully automated LLMs for annotation.
摘要：事件注释对于确定市场变化，监视突发新闻和了解社会学趋势很重要。尽管专家注释者设定了黄金标准，但人类编码却昂贵且效率低下。与关注单个上下文的信息提取实验不同，我们评估了一个整体工作流，该工作流程删除了无关的文档，合并了有关同一事件的文档并注释事件。尽管基于LLM的自动注释比传统的基于TF-IDF的方法或事件集策划更好，但与人类专家相比，它们仍然不是可靠的注释者。但是，添加LLMS协助专家进行活动设定策划可以减少可变注释所需的时间和心理努力。当使用LLM来提取事件变量以帮助专家注释者时，它们与提取的变量更一致，而不是完全自动化的LLMS进行注释。

Title: Dr Genre: Reinforcement Learning from Decoupled LLM Feedback for Generic Text Rewriting

Authors: Yufei Li, John Nham, Ganesh Jawahar, Lei Shu, David Uthus, Yun-Hsuan Sung, Chengrun Yang, Itai Rolnick, Yi Qiao, Cong Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06781
Pdf URL: https://arxiv.org/pdf/2503.06781
Copy Paste: [[2503.06781]] Dr Genre: Reinforcement Learning from Decoupled LLM Feedback for Generic Text Rewriting(https://arxiv.org/abs/2503.06781)
Keywords: language model, llm, chat
Abstract: Generic text rewriting is a prevalent large language model (LLM) application that covers diverse real-world tasks, such as style transfer, fact correction, and email editing. These tasks vary in rewriting objectives (e.g., factual consistency vs. semantic preservation), making it challenging to develop a unified model that excels across all dimensions. Existing methods often specialize in either a single task or a specific objective, limiting their generalizability. In this work, we introduce a generic model proficient in factuality, stylistic, and conversational rewriting tasks. To simulate real-world user rewrite requests, we construct a conversational rewrite dataset, ChatRewrite, that presents ``natural''-sounding instructions, from raw emails using LLMs. Combined with other popular rewrite datasets, including LongFact for the factuality rewrite task and RewriteLM for the stylistic rewrite task, this forms a broad benchmark for training and evaluating generic rewrite models. To align with task-specific objectives, we propose Dr Genre, a Decoupled-reward learning framework for Generic rewriting, that utilizes objective-oriented reward models with a task-specific weighting. Evaluation shows that \approach delivers higher-quality rewrites across all targeted tasks, improving objectives including instruction following (agreement), internal consistency (coherence), and minimal unnecessary edits (conciseness).
摘要：通用文本重写是一种普遍的大型语言模型（LLM）应用程序，涵盖了各种现实世界的任务，例如样式转移，事实校正和电子邮件编辑。这些任务在重写目标（例如，事实一致性与语义保存）方面有所不同，这使得开发一个在所有维度上脱颖而出的统一模型都具有挑战性。现有的方法通常专门从事单个任务或特定目标，从而限制了它们的普遍性。在这项工作中，我们引入了熟练的事实，风格和对话重写任务的通用模型。为了模拟真实世界的用户重写请求，我们构建了一个对话性重写数据集ChatRewrite，该数据集（通过使用LLMS的原始电子邮件）呈现``自然'' - ``自然''的说明。结合其他流行的重写数据集，包括用于“事实重写”任务的长效和用于样式重写任务的重写，这构成了培训和评估通用重写模型的广泛基准。为了与特定于任务的目标保持一致，我们提出了DR Genre，这是一个脱钩的奖励学习框架，用于通用重写，该框架利用具有特定于任务的权重的目标奖励模型。评估表明，\方法可以在所有目标任务中提供更高质量的重写，从而改善了目标，包括指令以下（协议），内部一致性（相干）和最少的不必要的编辑（简洁）。

Title: On the Mutual Influence of Gender and Occupation in LLM Representations

Authors: Haozhe An, Connor Baumler, Abhilasha Sancheti, Rachel Rudinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06792
Pdf URL: https://arxiv.org/pdf/2503.06792
Copy Paste: [[2503.06792]] On the Mutual Influence of Gender and Occupation in LLM Representations(https://arxiv.org/abs/2503.06792)
Keywords: llm
Abstract: We examine LLM representations of gender for first names in various occupational contexts to study how occupations and the gender perception of first names in LLMs influence each other mutually. We find that LLMs' first-name gender representations correlate with real-world gender statistics associated with the name, and are influenced by the co-occurrence of stereotypically feminine or masculine occupations. Additionally, we study the influence of first-name gender representations on LLMs in a downstream occupation prediction task and their potential as an internal metric to identify extrinsic model biases. While feminine first-name embeddings often raise the probabilities for female-dominated jobs (and vice versa for male-dominated jobs), reliably using these internal gender representations for bias detection remains challenging.
摘要：我们研究了在各种职业环境中的性别的LLM表征，以研究职业和对LLM中名字的性别感知如何相互影响。我们发现LLMS的名称性别表示与与该名称相关的现实性别统计数据相关，并且受定型女性或男性职业的同时存在的影响。此外，我们研究了下游职业预测任务中名称性别表示对LLM的影响及其作为内部度量的潜力，以识别外部模型偏见。尽管女性的名字嵌入经常提高女性主导工作的概率（反之亦然），但可靠地使用这些内部性别表征来可靠地进行偏见检测仍然具有挑战性。

Title: Enhanced Multi-Tuple Extraction for Alloys: Integrating Pointer Networks and Augmented Attention

Authors: Mengzhe Hei, Zhouran Zhang, Qingbao Liu, Yan Pan, Xiang Zhao, Yongqian Peng, Yicong Ye, Xin Zhang, Shuxin Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06861
Pdf URL: https://arxiv.org/pdf/2503.06861
Copy Paste: [[2503.06861]] Enhanced Multi-Tuple Extraction for Alloys: Integrating Pointer Networks and Augmented Attention(https://arxiv.org/abs/2503.06861)
Keywords: language model
Abstract: Extracting high-quality structured information from scientific literature is crucial for advancing material design through data-driven methods. Despite the considerable research in natural language processing for dataset extraction, effective approaches for multi-tuple extraction in scientific literature remain scarce due to the complex interrelations of tuples and contextual ambiguities. In the study, we illustrate the multi-tuple extraction of mechanical properties from multi-principal-element alloys and presents a novel framework that combines an entity extraction model based on MatSciBERT with pointer networks and an allocation model utilizing inter- and intra-entity attention. Our rigorous experiments on tuple extraction demonstrate impressive F1 scores of 0.963, 0.947, 0.848, and 0.753 across datasets with 1, 2, 3, and 4 tuples, confirming the effectiveness of the model. Furthermore, an F1 score of 0.854 was achieved on a randomly curated dataset. These results highlight the model's capacity to deliver precise and structured information, offering a robust alternative to large language models and equipping researchers with essential data for fostering data-driven innovations.
摘要：从科学文献中提取高质量的结构化信息对于通过数据驱动方法推进材料设计至关重要。尽管对数据集提取的自然语言处理进行了大量研究，但由于元素和上下文歧义的复杂相互关系，科学文献中多ing提取的有效方法仍然很少。在研究中，我们说明了从多元元素合金中的机械性能的多键盘提取，并提出了一个新型框架，该框架结合了基于Matscibert的实体提取模型与指针网络以及利用跨性别和内部注意力的分配模型。我们对元组提取的严格实验表明，在具有1、2、3和4个单元的数据集中，令人印象深刻的F1得分为0.963、0.947、0.848和0.753，证实了模型的有效性。此外，在随机策划的数据集上达到了0.854的F1分数。这些结果突出了该模型提供精确和结构化信息的能力，为大型语言模型提供了可靠的替代方案，并为研究人员提供了基本数据，以促进数据驱动的创新。

Title: Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation

Authors: Junhao Zhang, Richong Zhang, Fanshuang Kong, Ziyang Miao, Yanhan Ye, Yaowei Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06868
Pdf URL: https://arxiv.org/pdf/2503.06868
Copy Paste: [[2503.06868]] Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation(https://arxiv.org/abs/2503.06868)
Keywords: prompt
Abstract: Existing long-text generation methods primarily concentrate on producing lengthy texts from short inputs, neglecting the long-input and long-output tasks. Such tasks have numerous practical applications while lacking available benchmarks. Moreover, as the input grows in length, existing methods inevitably encounter the "lost-in-the-middle" phenomenon. In this paper, we first introduce a Long Input and Output Benchmark (LongInOutBench), including a synthetic dataset and a comprehensive evaluation framework, addressing the challenge of the missing benchmark. We then develop the Retrieval-Augmented Long-Text Writer (RAL-Writer), which retrieves and restates important yet overlooked content, mitigating the "lost-in-the-middle" issue by constructing explicit prompts. We finally employ the proposed LongInOutBench to evaluate our RAL-Writer against comparable baselines, and the results demonstrate the effectiveness of our approach. Our code has been released at this https URL.
摘要：现有的长文本生成方法主要集中于从短输入中产生冗长的文本，忽略了长输入和长输入任务。此类任务在缺乏可用基准的同时具有许多实际应用。此外，随着输入的长度的增长，现有方法不可避免地遇到“中间失落”现象。在本文中，我们首先引入了较长的输入和输出基准（Longinoutbench），包括合成数据集和全面的评估框架，以解决缺失基准测试的挑战。然后，我们开发了检索的长篇文本作家（RAL-Writer），该作者检索并重述了重要但被忽略的内容，从而通过构造明确的提示来减轻“中间失落的中间”问题。最终，我们采用了拟议的Longinoutbench来评估我们的RAL撰写者与可比的基线，结果证明了我们方法的有效性。我们的代码已在此HTTPS URL上发布。

Title: KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus

Authors: Xiaoming Shi, Zeming Liu, Yiming Lei, Chenkai Zhang, Haitao Leng, Chuan Wang, Qingjie Liu, Wanxiang Che, Shaoguo Liu, Size Li, Yunhong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06899
Pdf URL: https://arxiv.org/pdf/2503.06899
Copy Paste: [[2503.06899]] KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus(https://arxiv.org/abs/2503.06899)
Keywords: gpt, llm, chat
Abstract: Video-based dialogue systems, such as education assistants, have compelling application value, thereby garnering growing interest. However, the current video-based dialogue systems are limited by their reliance on a single dialogue type, which hinders their versatility in practical applications across a range of scenarios, including question-answering, emotional dialog, etc. In this paper, we identify this challenge as how to generate video-driven multilingual mixed-type dialogues. To mitigate this challenge, we propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus, termed KwaiChat, containing a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics. Additionally, we establish baseline models on KwaiChat. An extensive analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation even with the help of in-context learning and fine-tuning, which indicates that the task is not trivial and needs further research.
摘要：基于视频的对话系统，例如教育助理，具有令人信服的应用程序价值，从而引起了人们日益增长的兴趣。但是，当前基于视频的对话系统受到对单一对话类型的依赖的限制，这阻碍了它们在各种场景中的实际应用中的多功能性，包括提问，情感对话等。在本文中，我们将这一挑战确定为如何产生视频驱动的多种语言混合类型对话。为了缓解这一挑战，我们提出了一项新颖的任务，并创建了人类对人类视频驱动的多语言混合型对话语料库，称为Kwaichat，其中包含93,209个视频和246,080个对话，在4种对话类型，30个域，4种语言，4种语言和13个主题。此外，我们在Kwaichat上建立了基线模型。对Kwaichat上7种不同的LLM的广泛分析表明，GPT-4O在这种情况下仍无法表现出色，但在这种情况下仍无法表现良好，即使借助在文化学习和微调的帮助下，这表明该任务并不小，需要进一步的研究。

Title: Effect of Selection Format on LLM Performance

Authors: Yuchen Han, Yucheng Wu, Jeffrey Willard
Subjects: cs.CL, cs.AI, cs.CE, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06926
Pdf URL: https://arxiv.org/pdf/2503.06926
Copy Paste: [[2503.06926]] Effect of Selection Format on LLM Performance(https://arxiv.org/abs/2503.06926)
Keywords: language model, llm, prompt
Abstract: This paper investigates a critical aspect of large language model (LLM) performance: the optimal formatting of classification task options in prompts. Through an extensive experimental study, we compared two selection formats -- bullet points and plain English -- to determine their impact on model performance. Our findings suggest that presenting options via bullet points generally yields better results, although there are some exceptions. Furthermore, our research highlights the need for continued exploration of option formatting to drive further improvements in model performance.
摘要：本文研究了大语言模型（LLM）性能的关键方面：提示中分类任务选项的最佳格式。通过一项广泛的实验研究，我们比较了两种选择格式：子弹点和普通英语，以确定它们对模型性能的影响。我们的发现表明，尽管有一些例外，但通过子弹来提出选项通常会产生更好的结果。此外，我们的研究强调了继续探索期权格式的必要性，以推动模型性能的进一步改进。

Title: Lshan-1.0 Technical Report

Authors: Haotian Chen, Yanyu Xu, Boyan Wang, Chaoyue Zhao, Xiaoyu Han, Fang Wang, Lizhen Cui, Yonghui Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06949
Pdf URL: https://arxiv.org/pdf/2503.06949
Copy Paste: [[2503.06949]] Lshan-1.0 Technical Report(https://arxiv.org/abs/2503.06949)
Keywords: language model, llm
Abstract: In this report, we introduce our first-generation reasoning model, Lshan-1.0, a large language model designed for the highly specialized Chinese legal domain, offering comprehensive capabilities to meet diverse realistic needs. Existing legal LLMs face two primary challenges. Firstly, their design and evaluation are predominantly driven by computer science perspectives, leading to insufficient incorporation of legal expertise and logic, which is crucial for high-precision legal applications, such as handling complex prosecutorial tasks. Secondly, these models often underperform due to a lack of comprehensive training data from the legal domain, limiting their ability to effectively address real-world legal scenarios. To address this, we first compile millions of legal documents covering over 20 types of crimes from 31 provinces in China for model training. From the extensive dataset, we further select high-quality for supervised fine-tuning, ensuring enhanced relevance and precision. The model further undergoes large-scale reinforcement learning without additional supervision, emphasizing the enhancement of its reasoning capabilities and explainability. To validate its effectiveness in complex legal applications, we also conduct human evaluations with legal experts. We develop fine-tuned models based on DeepSeek-R1-Distilled versions, available in three dense configurations: 14B, 32B, and 70B.
摘要：在本报告中，我们介绍了第一代推理模型Lshan-1.0，这是一种专为中国高度专业的法律领域而设计的大型语言模型，提供了满足各种现实需求的全面功能。现有的法律LLM面临两个主要挑战。首先，他们的设计和评估主要是由计算机科学的观点驱动的，这导致法律专业知识和逻辑不足，这对于高精度法律应用至关重要，例如处理复杂的起诉任务。其次，由于缺乏法律领域的全面培训数据，这些模型通常表现不佳，从而限制了它们有效解决现实世界法律场景的能力。为了解决这个问题，我们首先汇编了数百万个法律文件，其中涉及中国31个省的20多种犯罪，以进行模型培训。从广泛的数据集中，我们进一步选择了高质量的高质量进行微调，从而确保了增强的相关性和精度。该模型进一步经历了大规模的增强学习，而无需其他监督，强调了其推理能力和解释性的增强。为了验证其在复杂的法律应用中的有效性，我们还与法律专家进行人体评估。我们开发了基于DeepSeek-R1启动版本的微调模型，具有三种密集配置：14B，32B和70B。

Title: CtrlRAG: Black-box Adversarial Attacks Based on Masked Language Models in Retrieval-Augmented Language Generation

Authors: Runqi Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06950
Pdf URL: https://arxiv.org/pdf/2503.06950
Copy Paste: [[2503.06950]] CtrlRAG: Black-box Adversarial Attacks Based on Masked Language Models in Retrieval-Augmented Language Generation(https://arxiv.org/abs/2503.06950)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by integrating external knowledge bases. However, this integration introduces a new security threat: adversaries can exploit the retrieval mechanism to inject malicious content into the knowledge base, thereby influencing the generated responses. Based on this attack vector, we propose CtrlRAG, a novel attack method designed for RAG system in the black-box setting, which aligns with real-world scenarios. Unlike existing attack methods, CtrlRAG introduces a perturbation mechanism using Masked Language Model (MLM) to dynamically optimize malicious content in response to changes in the retrieved context. Experimental results demonstrate that CtrlRAG outperforms three baseline methods in both Emotional Manipulation and Hallucination Amplification objectives. Furthermore, we evaluate three existing defense mechanisms, revealing their limited effectiveness against CtrlRAG and underscoring the urgent need for more robust defenses.
摘要：检索增强的生成（RAG）系统通过整合外部知识库来增强大语模型（LLMS）。但是，这种集成引入了一个新的安全威胁：对手可以利用检索机制将恶意内容注入知识库，从而影响生成的响应。基于此攻击向量，我们提出了Ctrlrag，这是一种新型的攻击方法，专为黑框设置中的抹布系统设计，它与现实世界情景相符。与现有的攻击方法不同，Ctrlrag使用蒙版语言模型（MLM）引入了一种扰动机制，以动态优化恶意内容，以响应检索到检索的上下文中的变化。实验结果表明，在情绪操纵和幻觉放大目标中，Ctrlrag的表现优于三种基线方法。此外，我们评估了三种现有的防御机制，揭示了它们针对Ctrlrag的有效性，并强调了对更强大防御的迫切需求。

Title: Exploring Multimodal Perception in Large Language Models Through Perceptual Strength Ratings

Authors: Jonghyun Lee, Dojun Park, Jiwoo Lee, Hoekeon Choi, Sung-Eun Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.06980
Pdf URL: https://arxiv.org/pdf/2503.06980
Copy Paste: [[2503.06980]] Exploring Multimodal Perception in Large Language Models Through Perceptual Strength Ratings(https://arxiv.org/abs/2503.06980)
Keywords: language model, gpt, llm
Abstract: This study investigated the multimodal perception of large language models (LLMs), focusing on their ability to capture human-like perceptual strength ratings across sensory modalities. Utilizing perceptual strength ratings as a benchmark, the research compared GPT-3.5, GPT-4, GPT-4o, and GPT-4o-mini, highlighting the influence of multimodal inputs on grounding and linguistic reasoning. While GPT-4 and GPT-4o demonstrated strong alignment with human evaluations and significant advancements over smaller models, qualitative analyses revealed distinct differences in processing patterns, such as multisensory overrating and reliance on loose semantic associations. Despite integrating multimodal capabilities, GPT-4o did not exhibit superior grounding compared to GPT-4, raising questions about their role in improving human-like grounding. These findings underscore how LLMs' reliance on linguistic patterns can both approximate and diverge from human embodied cognition, revealing limitations in replicating sensory experiences.
摘要：这项研究调查了大语言模型（LLM）的多模式感知，重点是他们捕获跨感觉方式捕获类似人类的感知力量评级的能力。该研究利用感知强度等级作为基准，将GPT-3.5，GPT-4，GPT-4O和GPT-4O-MINI进行了比较，强调了多模式输入对接地和语言推理的影响。尽管GPT-4和GPT-4O与人类评估表现出很强的一致性，并且在较小的模型中取得了重大进步，但定性分析揭示了处理模式的明显差异，例如多感觉高估和对宽松语义关联的依赖。尽管整合了多模式的功能，但与GPT-4相比，GPT-4O并没有表现出优越的基础，这引发了有关它们在改善类似人类基础方面的作用的问题。这些发现强调了LLMS对语言模式的依赖如何与人类体现的认知差异，从而揭示了复制感觉体验的局限性。

Title: Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

Authors: Jiho Jin, Woosung Kang, Junho Myung, Alice Oh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06987
Pdf URL: https://arxiv.org/pdf/2503.06987
Copy Paste: [[2503.06987]] Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations(https://arxiv.org/abs/2503.06987)
Keywords: language model, llm, prompt
Abstract: Measuring social bias in large language models (LLMs) is crucial, but existing bias evaluation methods struggle to assess bias in long-form generation. We propose a Bias Benchmark for Generation (BBG), an adaptation of the Bias Benchmark for QA (BBQ), designed to evaluate social bias in long-form generation by having LLMs generate continuations of story prompts. Building our benchmark in English and Korean, we measure the probability of neutral and biased generations across ten LLMs. We also compare our long-form story generation evaluation results with multiple-choice BBQ evaluation, showing that the two approaches produce inconsistent results.
摘要：测量大语模型（LLM）中的社会偏见至关重要，但是现有的偏见评估方法难以评估长期产生的偏见。我们提出了一个偏差基准（BBG），该基准是QA（BBQ）的偏置基准的改编，旨在通过使LLMS生成故事提示的持续性来评估长期产生的社会偏见。用英语和韩文建立基准，我们测量了十个LLM的中性和有偏见的可能性。我们还将我们的长格式故事生成评估结果与多项选择烧烤评估进行了比较，这表明两种方法会产生不一致的结果。

Title: Large Language Models Often Say One Thing and Do Another

Authors: Ruoxi Xu, Hongyu Lin, Xianpei Han, Jia Zheng, Weixiang Zhou, Le Sun, Yingfei Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07003
Pdf URL: https://arxiv.org/pdf/2503.07003
Copy Paste: [[2503.07003]] Large Language Models Often Say One Thing and Do Another(https://arxiv.org/abs/2503.07003)
Keywords: language model, llm
Abstract: As large language models (LLMs) increasingly become central to various applications and interact with diverse user populations, ensuring their reliable and consistent performance is becoming more important. This paper explores a critical issue in assessing the reliability of LLMs: the consistency between their words and deeds. To quantitatively explore this consistency, we developed a novel evaluation benchmark called the Words and Deeds Consistency Test (WDCT). The benchmark establishes a strict correspondence between word-based and deed-based questions across different domains, including opinion vs. action, non-ethical value vs. action, ethical value vs. action, and theory vs. application. The evaluation results reveal a widespread inconsistency between words and deeds across different LLMs and domains. Subsequently, we conducted experiments with either word alignment or deed alignment to observe their impact on the other aspect. The experimental results indicate that alignment only on words or deeds poorly and unpredictably influences the other aspect. This supports our hypothesis that the underlying knowledge guiding LLMs' word or deed choices is not contained within a unified space.
摘要：随着大型语言模型（LLMS）越来越成为各种应用程序的核心，并与不同的用户群体互动，从而确保其可靠和一致的性能变得越来越重要。本文探讨了评估LLMS的可靠性：单词和行为之间的一致性的关键问题。为了定量探索这种一致性，我们开发了一种新颖的评估基准，称为单词和事迹一致性测试（WDCT）。该基准在跨不同领域的基于单词的问题和基于契约的问题之间建立了严格的对应关系，包括意见与行动，非伦理价值与行动，道德价值与行动以及理论与应用程序。评估结果揭示了跨不同LLM和域之间的单词和契据之间的普遍不一致。随后，我们通过单词对准或契约对准进行了实验，以观察它们对另一个方面的影响。实验结果表明，仅在单词或契据上对齐，会影响另一个方面。这支持了我们的假设，即基本知识指导LLMS的单词或契据选择不包含在统一空间中。

Title: Toward Multi-Session Personalized Conversation: A Large-Scale Dataset and Hierarchical Tree Framework for Implicit Reasoning

Authors: Xintong Li, Jalend Bantupalli, Ria Dharmani, Yuwei Zhang, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07018
Pdf URL: https://arxiv.org/pdf/2503.07018
Copy Paste: [[2503.07018]] Toward Multi-Session Personalized Conversation: A Large-Scale Dataset and Hierarchical Tree Framework for Implicit Reasoning(https://arxiv.org/abs/2503.07018)
Keywords: language model, llm, agent
Abstract: There has been a surge in the use of large language models (LLM) conversational agents to generate responses based on long-term history from multiple sessions. However, existing long-term open-domain dialogue datasets lack complex, real-world personalization and fail to capture implicit reasoning-where relevant information is embedded in subtle, syntactic, or semantically distant connections rather than explicit statements. In such cases, traditional retrieval methods fail to capture relevant context, and long-context modeling also becomes inefficient due to numerous complicated persona-related details. To address this gap, we introduce ImplexConv, a large-scale long-term dataset with 2,500 examples, each containing approximately 100 conversation sessions, designed to study implicit reasoning in personalized dialogues. Additionally, we propose TaciTree, a novel hierarchical tree framework that structures conversation history into multiple levels of summarization. Instead of brute-force searching all data, TaciTree enables an efficient, level-based retrieval process where models refine their search by progressively selecting relevant details. Our experiments demonstrate that TaciTree significantly improves the ability of LLMs to reason over long-term conversations with implicit contextual dependencies.
摘要：大型语言模型（LLM）对话剂的使用激增，以根据多个会议的长期历史产生响应。但是，现有的长期开放域对话数据集缺乏复杂的，现实世界的个性化，并且无法捕获隐式推理 - 相关信息嵌入了微妙的，句法或语义上遥远的连接中，而不是显式语句。在这种情况下，传统的检索方法无法捕获相关的上下文，并且由于许多复杂的角色相关细节，长期文化建模也效率低下。为了解决这一差距，我们介绍了一个大规模的长期数据集ImplexConv，其中包含2500个示例，每个数据集包含大约100个对话会话，旨在在个性化对话中研究隐性推理。此外，我们提出了Tacitree，这是一种新型的分层树框架，将对话历史构成多个摘要级别。 tacitree并没有遇到所有数据，而是实现了一个基于级别的检索过程，其中模型通过逐步选择相关的详细信息来完善其搜索。我们的实验表明，tacitree显着提高了LLM与隐式上下文依赖性进行长期对话推理的能力。

Title: Multimodal Human-AI Synergy for Medical Imaging Quality Control: A Hybrid Intelligence Framework with Adaptive Dataset Curation and Closed-Loop Evaluation

Authors: Zhi Qin, Qianhui Gui, Mouxiao Bian, Rui Wang, Hong Ge, Dandan Yao, Ziying Sun, Yuan Zhao, Yu Zhang, Hui Shi, Dongdong Wang, Chenxin Song, Shenghong Ju, Lihao Liu, Junjun He, Jie Xu, Yuan-Cheng Wang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.07032
Pdf URL: https://arxiv.org/pdf/2503.07032
Copy Paste: [[2503.07032]] Multimodal Human-AI Synergy for Medical Imaging Quality Control: A Hybrid Intelligence Framework with Adaptive Dataset Curation and Closed-Loop Evaluation(https://arxiv.org/abs/2503.07032)
Keywords: language model, gpt, llm, chat
Abstract: Medical imaging quality control (QC) is essential for accurate diagnosis, yet traditional QC methods remain labor-intensive and subjective. To address this challenge, in this study, we establish a standardized dataset and evaluation framework for medical imaging QC, systematically assessing large language models (LLMs) in image quality assessment and report standardization. Specifically, we first constructed and anonymized a dataset of 161 chest X-ray (CXR) radiographs and 219 CT reports for evaluation. Then, multiple LLMs, including Gemini 2.0-Flash, GPT-4o, and DeepSeek-R1, were evaluated based on recall, precision, and F1 score to detect technical errors and inconsistencies. Experimental results show that Gemini 2.0-Flash achieved a Macro F1 score of 90 in CXR tasks, demonstrating strong generalization but limited fine-grained performance. DeepSeek-R1 excelled in CT report auditing with a 62.23\% recall rate, outperforming other models. However, its distilled variants performed poorly, while InternLM2.5-7B-chat exhibited the highest additional discovery rate, indicating broader but less precise error detection. These findings highlight the potential of LLMs in medical imaging QC, with DeepSeek-R1 and Gemini 2.0-Flash demonstrating superior performance.
摘要：医学成像质量控制（QC）对于准确的诊断至关重要，但是传统的QC方法仍然是劳动密集型和主观的。为了应对这一挑战，在这项研究中，我们为医学成像QC建立了标准化的数据集和评估框架，在图像质量评估中系统地评估大语言模型（LLMS）并报告标准化。具体而言，我们首先构建并匿名化了161个胸部X射线（CXR）X光片和219个CT报告的数据集进行评估。然后，根据召回，精度和F1分数评估了多个LLM，包括Gemini 2.0-Flash，GPT-4O和DeepSeek-R1，以检测技术错误和不一致。实验结果表明，Gemini 2.0-Flash在CXR任务中达到了90分的宏F1得分，表明了强烈的泛化，但细粒度的性能有限。 DeepSeek-R1在CT报告审核中以62.23 \％的召回率优于其他模型。但是，其蒸馏变体的性能较差，而Interlm2.5-7B-Chat表现出最高的额外发现率，表明误差检测更广泛但更精确。这些发现突出了LLM在医学成像QC中的潜力，DeepSeek-R1和Gemini 2.0-Flash表现出卓越的性能。

Title: Bot Wars Evolved: Orchestrating Competing LLMs in a Counterstrike Against Phone Scams

Authors: Nardine Basta, Conor Atkins, Dali Kaafar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07036
Pdf URL: https://arxiv.org/pdf/2503.07036
Copy Paste: [[2503.07036]] Bot Wars Evolved: Orchestrating Competing LLMs in a Counterstrike Against Phone Scams(https://arxiv.org/abs/2503.07036)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: We present "Bot Wars," a framework using Large Language Models (LLMs) scam-baiters to counter phone scams through simulated adversarial dialogues. Our key contribution is a formal foundation for strategy emergence through chain-of-thought reasoning without explicit optimization. Through a novel two-layer prompt architecture, our framework enables LLMs to craft demographically authentic victim personas while maintaining strategic coherence. We evaluate our approach using a dataset of 3,200 scam dialogues validated against 179 hours of human scam-baiting interactions, demonstrating its effectiveness in capturing complex adversarial dynamics. Our systematic evaluation through cognitive, quantitative, and content-specific metrics shows that GPT-4 excels in dialogue naturalness and persona authenticity, while Deepseek demonstrates superior engagement sustainability.
摘要：我们提出了使用大型语言模型（LLMS）骗子的框架“ Bot Wars”，通过模拟的对抗对话来反击电话骗局。我们的关键贡献是通过无明确优化的思想链推理来实现战略出现的正式基础。通过一种新颖的两层及时建筑，我们的框架使LLM可以在保持战略连贯性的同时制定人口统计的受害者角色。我们使用针对179小时的人类骗局诱饵相互作用的3200个骗局对话的数据集评估了我们的方法，这证明了其在捕获复杂的对抗动力学方面的有效性。我们通过认知，定量和特定于内容的指标进行系统的评估表明，GPT-4在对话自然性和性格真实性方面表现出色，而DeepSeek则表现出卓越的参与性可持续性。

Title: TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine

Authors: Tianai Huang, Lu Lu, Jiayuan Chen, Lihao Liu, Junjun He, Yuping Zhao, Wenchao Tang, Jie Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07041
Pdf URL: https://arxiv.org/pdf/2503.07041
Copy Paste: [[2503.07041]] TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine(https://arxiv.org/abs/2503.07041)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) excel in various NLP tasks and modern medicine, but their evaluation in traditional Chinese medicine (TCM) is underexplored. To address this, we introduce TCM3CEval, a benchmark assessing LLMs in TCM across three dimensions: core knowledge mastery, classical text understanding, and clinical decision-making. We evaluate diverse models, including international (e.g., GPT-4o), Chinese (e.g., InternLM), and medical-specific (e.g., PLUSE). Results show a performance hierarchy: all models have limitations in specialized subdomains like Meridian & Acupoint theory and Various TCM Schools, revealing gaps between current capabilities and clinical needs. Models with Chinese linguistic and cultural priors perform better in classical text interpretation and clinical reasoning. TCM-3CEval sets a standard for AI evaluation in TCM, offering insights for optimizing LLMs in culturally grounded medical domains. The benchmark is available on Medbench's TCM track, aiming to assess LLMs' TCM capabilities in basic knowledge, classic texts, and clinical decision-making through multidimensional questions and real cases.
摘要：大型语言模型（LLM）在各种NLP任务和现代医学中都表现出色，但是他们在中医中的评估（TCM）却没有得到充实。为了解决这个问题，我们介绍了TCM3Ceval，这是一种评估TCM中LLM的基准，该基准在三个方面：核心知识掌握，经典文本理解和临床决策。我们评估包括国际（例如GPT-4O），中文（例如InternLM）和医学特定（例如Pluse）在内的各种模型。结果表明了性能层次结构：所有模型在子午线和兆头理论等专业亚地区都有局限性，揭示了当前能力和临床需求之间的差距。具有中国语言和文化先验的模型在古典文本解释和临床推理中表现更好。 TCM-3CEVAL设定了TCM中AI评估的标准，提供了优化文化基础医疗领域LLM的见解。基准标准可在Medbench的TCM轨道上获得，旨在通过多维问题和实际情况来评估LLMS在基本知识，经典文本和临床决策方面的TCM功能。

Title: DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science

Authors: Ziming You, Yumiao Zhang, Dexuan Xu, Yiwei Lou, Yandong Yan, Wei Wang, Huaming Zhang, Yu Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07044
Pdf URL: https://arxiv.org/pdf/2503.07044
Copy Paste: [[2503.07044]] DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science(https://arxiv.org/abs/2503.07044)
Keywords: llm, agent
Abstract: Data Science tasks are multifaceted, dynamic, and often domain-specific. Existing LLM-based approaches largely concentrate on isolated phases, neglecting the interdependent nature of many data science tasks and limiting their capacity for comprehensive end-to-end support. We propose DatawiseAgent, a notebook-centric LLM agent framework that unifies interactions among user, agent and the computational environment through markdown and executable code cells, supporting flexible and adaptive automated data science. Built on a Finite State Transducer(FST), DatawiseAgent orchestrates four stages, including DSF-like planning, incremental execution, self-debugging, and post-filtering. Specifically, the DFS-like planning stage systematically explores the solution space, while incremental execution harnesses real-time feedback and accommodates LLM's limited capabilities to progressively complete tasks. The self-debugging and post-filtering modules further enhance reliability by diagnosing and correcting errors and pruning extraneous information. Extensive experiments on diverse tasks, including data analysis, visualization, and data modeling, show that DatawiseAgent consistently outperforms or matches state-of-the-art methods across multiple model settings. These results highlight its potential to generalize across data science scenarios and lay the groundwork for more efficient, fully automated workflows.
摘要：数据科学任务是多方面的，动态的，通常是特定于域的。现有的基于LLM的方法主要集中在孤立的阶段上，忽略了许多数据科学任务的相互依存性，并限制了其全面端到端支持的能力。我们提出了DataWiseagent，这是一种以笔记本为中心的LLM代理框架，通过降价和可执行的代码单元格在用户，代理和计算环境之间统一交互，从而支持灵活和自适应的自动化数据科学。 Datawiseagent建立在有限状态传感器（FST）的基础上，将精心编排四个阶段，包括类似DSF的计划，增量执行，自我抑制和过滤后。具体而言，类似DFS的计划阶段系统地探索了解决方案空间，而增量执行则可以利用实时反馈，并容纳LLM有限的功能来逐步完成任务。自我淘汰和过滤后模块通过诊断和纠正错误并修剪无关信息，从而进一步提高了可靠性。关于各种任务（包括数据分析，可视化和数据建模）的广泛实验表明，datawiseagent始终优于多个模型设置的最先进方法。这些结果强调了其在数据科学方案中概括的潜力，并为更高效，完全自动化的工作流奠定了基础。

Title: DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

Authors: Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, Se-Young Yun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07067
Pdf URL: https://arxiv.org/pdf/2503.07067
Copy Paste: [[2503.07067]] DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs(https://arxiv.org/abs/2503.07067)
Keywords: language model, llm
Abstract: Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.
摘要：尽管大语模型（LLMS）蒸馏成功，但大多数先前的工作都将相同的损失功能应用于教师和学生生成的数据。这些策略忽略了损失配方和数据类型之间的协同作用，从而导致学生模型的次优绩效提高。为了解决这个问题，我们提出了Distillm-2，这是一种对比方法，同时增加了教师反应的可能性，并通过利用这种协同作用来减少学生的反应。我们的广泛实验表明，Distillm-2不仅在各种任务中建立了高性能的学生模型，包括跟踪指导和代码的生成，还支持各种应用程序，例如偏好对准和视觉扩展。这些发现突出了一种对比方法的潜力，可以通过有效地对准不同数据类型的教师和学生模型来增强LLM蒸馏的功效。

Title: Linguistic Knowledge Transfer Learning for Speech Enhancement

Authors: Kuo-Hsuan Hung, Xugang Lu, Szu-Wei Fu, Huan-Hsin Tseng, Hsin-Yi Lin, Chii-Wann Lin, Yu Tsao
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2503.07078
Pdf URL: https://arxiv.org/pdf/2503.07078
Copy Paste: [[2503.07078]] Linguistic Knowledge Transfer Learning for Speech Enhancement(https://arxiv.org/abs/2503.07078)
Keywords: language model, llm
Abstract: Linguistic knowledge plays a crucial role in spoken language comprehension. It provides essential semantic and syntactic context for speech perception in noisy environments. However, most speech enhancement (SE) methods predominantly rely on acoustic features to learn the mapping relationship between noisy and clean speech, with limited exploration of linguistic integration. While text-informed SE approaches have been investigated, they often require explicit speech-text alignment or externally provided textual data, constraining their practicality in real-world scenarios. Additionally, using text as input poses challenges in aligning linguistic and acoustic representations due to their inherent differences. In this study, we propose the Cross-Modality Knowledge Transfer (CMKT) learning framework, which leverages pre-trained large language models (LLMs) to infuse linguistic knowledge into SE models without requiring text input or LLMs during inference. Furthermore, we introduce a misalignment strategy to improve knowledge transfer. This strategy applies controlled temporal shifts, encouraging the model to learn more robust representations. Experimental evaluations demonstrate that CMKT consistently outperforms baseline models across various SE architectures and LLM embeddings, highlighting its adaptability to different configurations. Additionally, results on Mandarin and English datasets confirm its effectiveness across diverse linguistic conditions, further validating its robustness. Moreover, CMKT remains effective even in scenarios without textual data, underscoring its practicality for real-world applications. By bridging the gap between linguistic and acoustic modalities, CMKT offers a scalable and innovative solution for integrating linguistic knowledge into SE models, leading to substantial improvements in both intelligibility and enhancement performance.
摘要：语言知识在口语理解中起着至关重要的作用。它为嘈杂环境中的语音感知提供了基本的语义和句法环境。但是，大多数语音增强（SE）方法主要依赖于声学特征来学习嘈杂和简洁言语之间的映射关系，并且对语言整合的探索有限。尽管已经研究了文本信息的SE方法，但通常需要明确的语音文本对齐或外部提供文本数据，从而在现实世界中限制了它们的实用性。此外，由于其固有的差异，使用文本作为输入会带来挑战语言和声学表示。在这项研究中，我们提出了跨模式知识转移（CMKT）学习框架，该框架利用预先训练的大型语言模型（LLMS）将语言知识注入SE模型，而无需推断文本输入或LLMS。此外，我们引入了一种未对准的策略来改善知识转移。该策略应用了受控的时间变化，鼓励模型学习更多强大的表示。实验评估表明，CMKT始终超过各种SE体系结构和LLM嵌入的基线模型，从而突出了其对不同配置的适应性。此外，关于普通话和英语数据集的结果证实了其在各种语言条件下的有效性，从而进一步验证了其稳健性。此外，即使在没有文本数据的情况下，CMKT仍然有效，强调了其对现实应用程序的实用性。通过弥合语言和声学方式之间的差距，CMKT提供了可扩展和创新的解决方案，用于将语言知识整合到SE模型中，从而实现了可理解性和增强性能的实质性提高。

Title: A Novel Ophthalmic Benchmark for Evaluating Multimodal Large Language Models with Fundus Photographs and OCT Images

Authors: Xiaoyi Liang, Mouxiao Bian, Moxin Chen, Lihao Liu, Junjun He, Jie Xu, Lin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07094
Pdf URL: https://arxiv.org/pdf/2503.07094
Copy Paste: [[2503.07094]] A Novel Ophthalmic Benchmark for Evaluating Multimodal Large Language Models with Fundus Photographs and OCT Images(https://arxiv.org/abs/2503.07094)
Keywords: language model, llm
Abstract: In recent years, large language models (LLMs) have demonstrated remarkable potential across various medical applications. Building on this foundation, multimodal large language models (MLLMs) integrate LLMs with visual models to process diverse inputs, including clinical data and medical images. In ophthalmology, LLMs have been explored for analyzing optical coherence tomography (OCT) reports, assisting in disease classification, and even predicting treatment outcomes. However, existing MLLM benchmarks often fail to capture the complexities of real-world clinical practice, particularly in the analysis of OCT images. Many suffer from limitations such as small sample sizes, a lack of diverse OCT datasets, and insufficient expert validation. These shortcomings hinder the accurate assessment of MLLMs' ability to interpret OCT scans and their broader applicability in ophthalmology. Our dataset, curated through rigorous quality control and expert annotation, consists of 439 fundus images and 75 OCT images. Using a standardized API-based framework, we assessed seven mainstream MLLMs and observed significant variability in diagnostic accuracy across different diseases. While some models performed well in diagnosing conditions such as diabetic retinopathy and age-related macular degeneration, they struggled with others, including choroidal neovascularization and myopia, highlighting inconsistencies in performance and the need for further refinement. Our findings emphasize the importance of developing clinically relevant benchmarks to provide a more accurate assessment of MLLMs' capabilities. By refining these models and expanding their scope, we can enhance their potential to transform ophthalmic diagnosis and treatment.
摘要：近年来，大型语言模型（LLMS）在各种医学应用中都表现出了巨大的潜力。在此基础的基础上，多模式大语模型（MLLM）将LLM与视觉模型集成在一起，以处理各种输入，包括临床数据和医学图像。在眼科中，已经探索了LLMS用于分析光学相干断层扫描（OCT）报告，有助于疾病分类，甚至预测治疗结果。但是，现有的MLLM基准通常无法捕获现实世界临床实践的复杂性，尤其是在OCT图像的分析中。许多人遭受诸如小样本量，缺乏多样化的OCT数据集以及专家验证不足的局限性。这些缺点阻碍了MLLM解释OCT扫描及其在眼科中更广泛适用性的能力的准确评估。通过严格的质量控制和专家注释策划的我们的数据集由439张眼底图像和75个OCT图像组成。使用标准化API的框架，我们评估了七个主流MLLM，并观察到不同疾病的诊断准确性的显着差异。尽管某些模型在诊断糖尿病性视网膜病和与年龄相关的黄斑变性等诊断状况方面表现良好，但他们与其他模型（包括脉络膜新生血管形成和近视）斗争，突出了性能的不一致以及需要进一步完善。我们的发现强调了开发临床相关基准测试以提供更准确评估MLLMS功能的重要性。通过完善这些模型并扩大其范围，我们可以增强其改变眼科诊断和治疗的潜力。

Title: ASTRA: A Negotiation Agent with Adaptive and Strategic Reasoning through Action in Dynamic Offer Optimization

Authors: Deuksin Kwon, Jiwon Hae, Emma Clift, Daniel Shamsoddini, Jonathan Gratch, Gale M. Lucas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07129
Pdf URL: https://arxiv.org/pdf/2503.07129
Copy Paste: [[2503.07129]] ASTRA: A Negotiation Agent with Adaptive and Strategic Reasoning through Action in Dynamic Offer Optimization(https://arxiv.org/abs/2503.07129)
Keywords: agent
Abstract: Negotiation requires dynamically balancing self-interest and cooperation to maximize one's own utility. Yet, existing agents struggle due to bounded rationality in human data, low adaptability to counterpart behavior, and limited strategic reasoning. To address this, we introduce principle-driven negotiation agents, powered by ASTRA, a novel framework for turn-level offer optimization grounded in two core principles: opponent modeling and Tit-for-Tat reciprocity. ASTRA operates in three stages: (1) interpreting counterpart behavior, (2) optimizing counteroffers via a linear programming (LP) solver, and (3) selecting offers based on negotiation tactics and the partner's acceptance probability. Through simulations and human evaluations, our agent effectively adapts to an opponent's shifting stance and achieves favorable outcomes through enhanced adaptability and strategic reasoning. Beyond improving negotiation performance, it also serves as a powerful coaching tool, offering interpretable strategic feedback and optimal offer recommendations.
摘要：谈判需要动态平衡自我利益和合作，以最大程度地提高自己的效用。然而，由于人类数据中有限的合理性，对对应行为的适应性低以及战略推理的有限理性，现有的代理人挣扎。为了解决这个问题，我们介绍了由Astra提供支持的原理驱动的谈判代理，这是一个以两种核心原则为基础的转向级别优化的新型框架：对手建模和TIT-FOR-TAT互惠。 Astra分为三个阶段：（1）解释对应行为，（2）通过线性编程（LP）求解器优化反击，以及（3）基于谈判策略和合作伙伴的接受概率选择要约。通过模拟和人类评估，我们的代理人有效地适应了对手的转移姿态，并通过增强的适应性和战略推理实现了有利的结果。除了提高谈判绩效外，它还可以作为强大的教练工具，提供可解释的战略反馈和最佳优惠建议。

Title: Application of Multiple Chain-of-Thought in Contrastive Reasoning for Implicit Sentiment Analysis

Authors: Liwei Yang, Xinying Wang, Xiaotang Zhou, Zhengchao Wu, Ningning Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07140
Pdf URL: https://arxiv.org/pdf/2503.07140
Copy Paste: [[2503.07140]] Application of Multiple Chain-of-Thought in Contrastive Reasoning for Implicit Sentiment Analysis(https://arxiv.org/abs/2503.07140)
Keywords: language model, chain-of-thought
Abstract: Implicit sentiment analysis aims to uncover emotions that are subtly expressed, often obscured by ambiguity and figurative language. To accomplish this task, large language models and multi-step reasoning are needed to identify those sentiments that are not explicitly stated. In this study, we propose a novel Dual Reverse Chain Reasoning (DRCR) framework to enhance the performance of implicit sentiment analysis. Inspired by deductive reasoning, the framework consists of three key steps: 1) hypothesize an emotional polarity and derive a reasoning process, 2) negate the initial hypothesis and derive a new reasoning process, and 3) contrast the two reasoning paths to deduce the final sentiment polarity. Building on this, we also introduce a Triple Reverse Chain Reasoning (TRCR) framework to address the limitations of random hypotheses. Both methods combine contrastive mechanisms and multi-step reasoning, significantly improving the accuracy of implicit sentiment classification. Experimental results demonstrate that both approaches outperform existing methods across various model scales, achieving state-of-the-art performance. This validates the effectiveness of combining contrastive reasoning and multi-step reasoning for implicit sentiment analysis.
摘要：隐式情感分析旨在发现巧妙地表达的情绪，通常被歧义性和比喻性语言所掩盖。为了完成这项任务，需要大型语言模型和多步推理来识别未明确说明的情感。在这项研究中，我们提出了一个新型的双重反向链推理（DRCR）框架，以增强隐式情感分析的性能。受推力推理的启发，该框架由三个关键步骤组成：1）假设情绪极性并得出推理过程，2）否定初始假设并得出新的推理过程，以及3）对比两种推理路径来推断最终的情感极性。在此基础上，我们还引入了三重反向链推理（TRCR）框架，以解决随机假设的局限性。两种方法都结合了对比机制和多步推理，显着提高了隐式情感分类的准确性。实验结果表明，两种方法都超过各种模型量表的现有方法，从而实现了最新的性能。这验证了与隐式情感分析结合对比度推理和多步推理的有效性。

Title: MRCEval: A Comprehensive, Challenging and Accessible Machine Reading Comprehension Benchmark

Authors: Shengkun Ma, Hao Peng, Lei Hou, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07144
Pdf URL: https://arxiv.org/pdf/2503.07144
Copy Paste: [[2503.07144]] MRCEval: A Comprehensive, Challenging and Accessible Machine Reading Comprehension Benchmark(https://arxiv.org/abs/2503.07144)
Keywords: language model, llm
Abstract: Machine Reading Comprehension (MRC) is an essential task in evaluating natural language understanding. Existing MRC datasets primarily assess specific aspects of reading comprehension (RC), lacking a comprehensive MRC benchmark. To fill this gap, we first introduce a novel taxonomy that categorizes the key capabilities required for RC. Based on this taxonomy, we construct MRCEval, an MRC benchmark that leverages advanced Large Language Models (LLMs) as both sample generators and selection judges. MRCEval is a comprehensive, challenging and accessible benchmark designed to assess the RC capabilities of LLMs thoroughly, covering 13 distinct RC skills with a total of 2.1K high-quality multi-choice questions. We perform an extensive evaluation of 28 widely used open-source and proprietary models, highlighting that MRC continues to present significant challenges even in the era of LLMs.
摘要：机器阅读理解（MRC）是评估自然语言理解的重要任务。现有的MRC数据集主要评估阅读理解的特定方面（RC），缺乏全面的MRC基准。为了填补这一空白，我们首先引入了一种新颖的分类法，该分类法将RC所需的关键功能分类。基于这种分类法，我们构建了MRCEVAL，这是一种MRC基准，它利用先进的大语言模型（LLMS）作为样本生成器和选择法官。 MRCEVAL是一种全面，具有挑战性且易于访问的基准测试，旨在彻底评估LLM的RC功能，涵盖了13个不同的RC技能，总共有2.1k高质量的多项选择问题。我们对28种广泛使用的开源和专有模型进行了广泛的评估，这强调了MRC即使在LLMS时代也仍然存在重大挑战。

Title: DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation

Authors: Ming Wang, Fang Wang, Minghao Hu, Li He, Haiyang Wang, Jun Zhang, Tianwei Yan, Li Li, Zhunchen Luo, Wei Luo, Xiaoying Bai, Guotong Geng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07170
Pdf URL: https://arxiv.org/pdf/2503.07170
Copy Paste: [[2503.07170]] DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation(https://arxiv.org/abs/2503.07170)
Keywords: agent
Abstract: Long-form article generation (LFAG) presents challenges such as maintaining logical consistency, comprehensive topic coverage, and narrative coherence across extended articles. Existing datasets often lack both the hierarchical structure and fine-grained annotation needed to effectively decompose tasks, resulting in shallow, disorganized article generation. To address these limitations, we introduce DeFine, a Decomposed and Fine-grained annotated dataset for long-form article generation. DeFine is characterized by its hierarchical decomposition strategy and the integration of domain-specific knowledge with multi-level annotations, ensuring granular control and enhanced depth in article generation. To construct the dataset, a multi-agent collaborative pipeline is proposed, which systematically segments the generation process into four parts: Data Miner, Cite Retreiver, Q&A Annotator and Data Cleaner. To validate the effectiveness of DeFine, we designed and tested three LFAG baselines: the web retrieval, the local retrieval, and the grounded reference. We fine-tuned the Qwen2-7b-Instruct model using the DeFine training dataset. The experimental results showed significant improvements in text quality, specifically in topic coverage, depth of information, and content fidelity. Our dataset publicly available to facilitate future research.
摘要：长篇文章生成（LFAG）提出了挑战，例如保持逻辑一致性，全面的主题覆盖范围和跨扩展文章的叙事连贯性。现有的数据集通常缺乏有效分解任务所需的层次结构和细粒注释，从而导致浅层，混乱的文章生成。为了解决这些限制，我们介绍了Define，这是一个分解且细粒度的注释数据集，用于长期形式的文章生成。定义的特征在于其层次分解策略以及特定于域知识与多级注释的整合，从而确保文章生成中的颗粒状控制和增强深度。为了构建数据集，提出了一条多代理协作管道，该管道将生成过程系统地将生成过程分为四个部分：数据矿工，引用retreiver，Q＆A注释器和数据清洁器。为了验证定义的有效性，我们设计并测试了三个LFAG基准：Web检索，局部检索和接地参考。我们使用定义培训数据集微调了QWEN2-7B教学模型。实验结果表明，文本质量的显着改善，特别是主题覆盖，信息深度和内容保真度。我们的数据集公开可用，以促进未来的研究。

Title: Contextual Cues in Machine Translation: Investigating the Potential of Multi-Source Input Strategies in LLMs and NMT Systems

Authors: Lia Shahnazaryan, Patrick Simianer, Joern Wuebker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07195
Pdf URL: https://arxiv.org/pdf/2503.07195
Copy Paste: [[2503.07195]] Contextual Cues in Machine Translation: Investigating the Potential of Multi-Source Input Strategies in LLMs and NMT Systems(https://arxiv.org/abs/2503.07195)
Keywords: language model, gpt, llm
Abstract: We explore the impact of multi-source input strategies on machine translation (MT) quality, comparing GPT-4o, a large language model (LLM), with a traditional multilingual neural machine translation (NMT) system. Using intermediate language translations as contextual cues, we evaluate their effectiveness in enhancing English and Chinese translations into Portuguese. Results suggest that contextual information significantly improves translation quality for domain-specific datasets and potentially for linguistically distant language pairs, with diminishing returns observed in benchmarks with high linguistic variability. Additionally, we demonstrate that shallow fusion, a multi-source approach we apply within the NMT system, shows improved results when using high-resource languages as context for other translation pairs, highlighting the importance of strategic context language selection.
摘要：我们探讨了多源输入策略对机器翻译（MT）质量的影响，并将大型语言模型（LLM）的GPT-4O与传统的多语言神经机器翻译（NMT）系统进行了比较。使用中级语言翻译作为上下文提示，我们评估了它们在增强英语和中文翻译为葡萄牙语方面的有效性。结果表明，上下文信息可显着提高域特异性数据集的翻译质量，并可能用于语言遥远的语言对，并且在具有较高语言可变性的基准中观察到的回报减少。此外，我们证明了我们在NMT系统中采用的一种多源方法浅融合，在使用高资源语言作为其他翻译对的上下文时，显示出改进的结果，突出了战略上下文语言选择的重要性。

Title: LLM-C3MOD: A Human-LLM Collaborative System for Cross-Cultural Hate Speech Moderation

Authors: Junyeong Park, Seogyeong Jeong, Seyoung Song, Yohan Lee, Alice Oh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07237
Pdf URL: https://arxiv.org/pdf/2503.07237
Copy Paste: [[2503.07237]] LLM-C3MOD: A Human-LLM Collaborative System for Cross-Cultural Hate Speech Moderation(https://arxiv.org/abs/2503.07237)
Keywords: gpt, llm
Abstract: Content moderation is a global challenge, yet major tech platforms prioritize high-resource languages, leaving low-resource languages with scarce native moderators. Since effective moderation depends on understanding contextual cues, this imbalance increases the risk of improper moderation due to non-native moderators' limited cultural understanding. Through a user study, we identify that non-native moderators struggle with interpreting culturally-specific knowledge, sentiment, and internet culture in the hate speech moderation. To assist them, we present LLM-C3MOD, a human-LLM collaborative pipeline with three steps: (1) RAG-enhanced cultural context annotations; (2) initial LLM-based moderation; and (3) targeted human moderation for cases lacking LLM consensus. Evaluated on a Korean hate speech dataset with Indonesian and German participants, our system achieves 78% accuracy (surpassing GPT-4o's 71% baseline), while reducing human workload by 83.6%. Notably, human moderators excel at nuanced contents where LLMs struggle. Our findings suggest that non-native moderators, when properly supported by LLMs, can effectively contribute to cross-cultural hate speech moderation.
摘要：内容审核是一个全球挑战，但主要的技术平台优先考虑高资源语言，而本地主持人则将低资源的语言留下。由于有效的节制取决于理解上下文提示，因此由于非本地主持人的文化理解有限，这种失衡会增加适度不当的风险。通过用户研究，我们确定非本地主持人在仇恨言论节奏中努力解释特定于文化的知识，情感和互联网文化。为了协助他们，我们提出了LLM-C3Mod，这是一个人类合作管道，采用三个步骤：（1）抹布增强的文化背景注释；（2）初始基于LLM的适量；（3）针对缺乏LLM共识的案件的人类适度。我们的系统在与印尼和德国参与者的韩国仇恨言论数据集上进行了评估，可实现78％的准确性（超过GPT-4O的基线71％），同时将人类工作量减少了83.6％。值得注意的是，人类的主持人在LLM挣扎的细微差异中表现出色。我们的发现表明，如果LLM适当地支持非本地主持人，则可以有效地有助于跨文化仇恨言论。

Title: A Graph-based Verification Framework for Fact-Checking

Authors: Yani Huang, Richong Zhang, Zhijie Nie, Junfan Chen, Xuefeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07282
Pdf URL: https://arxiv.org/pdf/2503.07282
Copy Paste: [[2503.07282]] A Graph-based Verification Framework for Fact-Checking(https://arxiv.org/abs/2503.07282)
Keywords: language model, llm
Abstract: Fact-checking plays a crucial role in combating misinformation. Existing methods using large language models (LLMs) for claim decomposition face two key limitations: (1) insufficient decomposition, introducing unnecessary complexity to the verification process, and (2) ambiguity of mentions, leading to incorrect verification results. To address these challenges, we suggest introducing a claim graph consisting of triplets to address the insufficient decomposition problem and reduce mention ambiguity through graph structure. Based on this core idea, we propose a graph-based framework, GraphFC, for fact-checking. The framework features three key components: graph construction, which builds both claim and evidence graphs; graph-guided planning, which prioritizes the triplet verification order; and graph-guided checking, which verifies the triples one by one between claim and evidence graphs. Extensive experiments show that GraphFC enables fine-grained decomposition while resolving referential ambiguities through relational constraints, achieving state-of-the-art performance across three datasets.
摘要：事实核对在对抗错误信息中起着至关重要的作用。现有的使用大语言模型（LLMS）进行索赔分解的方法面临两个关键局限性：（1）分解不足，对验证过程引入不必要的复杂性，以及（2）提及的模棱两可，导致不正确的验证结果。为了应对这些挑战，我们建议介绍一个由三胞胎组成的索赔图，以解决分解问题不足并通过图形结构减少提及的歧义。基于这个核心想法，我们提出了一个基于图形的框架GraphFC，以进行事实检查。该框架具有三个关键组成部分：图形结构，它们既可以构建索赔和证据图；图形指导计划，将三重态验证顺序确定优先；和图引导的检查，该检查可以在索赔和证据图之间逐一验证三元组。广泛的实验表明，GraphFC可以通过关系约束来解决参考模棱两可的同时实现细粒分解，从而在三个数据集中实现最新性能。

Title: Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies

Authors: Luyi Jiang, Jiayuan Chen, Lu Lu, Xinwei Peng, Lihao Liu, Junjun He, Jie Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07306
Pdf URL: https://arxiv.org/pdf/2503.07306
Copy Paste: [[2503.07306]] Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies(https://arxiv.org/abs/2503.07306)
Keywords: language model, llm, hallucination, prompt
Abstract: The evaluation and improvement of medical large language models (LLMs) are critical for their real-world deployment, particularly in ensuring accuracy, safety, and ethical alignment. Existing frameworks inadequately dissect domain-specific error patterns or address cross-modal challenges. This study introduces a granular error taxonomy through systematic analysis of top 10 models on MedBench, categorizing incorrect responses into eight types: Omissions, Hallucination, Format Mismatch, Causal Reasoning Deficiency, Contextual Inconsistency, Unanswered, Output Error, and Deficiency in Medical Language Generation. Evaluation of 10 leading models reveals vulnerabilities: despite achieving 0.86 accuracy in medical knowledge recall, critical reasoning tasks show 96.3% omission, while safety ethics evaluations expose alarming inconsistency (robustness score: 0.79) under option shuffled. Our analysis uncovers systemic weaknesses in knowledge boundary enforcement and multi-step reasoning. To address these, we propose a tiered optimization strategy spanning four levels, from prompt engineering and knowledge-augmented retrieval to hybrid neuro-symbolic architectures and causal reasoning frameworks. This work establishes an actionable roadmap for developing clinically robust LLMs while redefining evaluation paradigms through error-driven insights, ultimately advancing the safety and trustworthiness of AI in high-stakes medical environments.
摘要：医学大语言模型（LLM）的评估和改进对于其现实世界的部署至关重要，尤其是确保准确性，安全性和道德一致性。现有的框架不充分解剖域特异性误差模式或解决跨模式挑战。这项研究通过对MedBench上十大模型进行系统分析引入了颗粒状的错误分类法，将不正确的响应分为8种类型：遗漏，幻觉，格式不匹配，因果推理缺乏症，上下文不一致，不稳定，未解决的，未解决的，输出误差和医学语言产生中的缺乏症。对10个领先模型的评估揭示了漏洞：尽管在医学知识中达到了0.86的准确性，但关键的推理任务仍显示出96.3％的遗漏，而安全道德的评估暴露了令人震惊的不一致（鲁棒性得分：0.79），这是期权被改组的。我们的分析发现了知识边界执法和多步推理中的系统性弱点。为了解决这些问题，我们提出了一个分层的优化策略，涵盖了四个级别，从迅速的工程和知识使检索到混合神经符号架构和因果推理框架。这项工作为开发临床上强大的LLM的行动路线图建立了可行的路线图，同时通过错误驱动的见解重新定义评估范例，最终提高了AI在高风险医疗环境中的安全性和可信度。

Title: Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models

Authors: Hao Zhou, Guergana Savova, Lijing Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07329
Pdf URL: https://arxiv.org/pdf/2503.07329
Copy Paste: [[2503.07329]] Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models(https://arxiv.org/abs/2503.07329)
Keywords: language model, llm
Abstract: The impact of random seeds in fine-tuning large language models (LLMs) has been largely overlooked despite its potential influence on model this http URL this study, we systematically evaluate the effects of random seeds on LLMs using the GLUE and SuperGLUE benchmarks. We analyze the macro-level impact through traditional metrics like accuracy and F1, calculating their mean and variance to quantify performance fluctuations. To capture the micro-level effects, we introduce a novel metric, consistency, measuring the stability of individual predictions across runs. Our experiments reveal significant variance at both macro and micro levels, underscoring the need for careful consideration of random seeds in fine-tuning and evaluation.
摘要：尽管该研究对模型的潜在影响，但随机种子在微调大语言模型（LLMS）中的影响已被忽略了，我们使用胶水和超级粘液基准来系统地评估随机种子对LLM的影响。我们通过准确性和F1等传统指标分析了宏观的影响，计算它们的平均值和差异以量化性能波动。为了捕获微观效果，我们引入了一种新颖的度量，一致性，以测量整个运行中个体预测的稳定性。我们的实验揭示了在宏观和微观水平上的显着差异，强调了在微调和评估中仔细考虑随机种子的必要性。

Title: RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing

Authors: Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, Carolyn Rose
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2503.07358
Pdf URL: https://arxiv.org/pdf/2503.07358
Copy Paste: [[2503.07358]] RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing(https://arxiv.org/abs/2503.07358)
Keywords: llm
Abstract: We present RepoST, a scalable method to construct environments that provide execution feedback for repository-level code generation for both training and evaluation. Unlike existing works that aim to build entire repositories for execution, which is challenging for both human and LLMs, we provide execution feedback with sandbox testing, which isolates a given target function and its dependencies to a separate script for testing. Sandbox testing reduces the complexity of external dependencies and enables constructing environments at a large scale. We use our method to construct RepoST-Train, a large-scale train set with 7,415 functions from 832 repositories. Training with the execution feedback provided by RepoST-Train leads to a performance gain of 5.5% Pass@1 on HumanEval and 3.5% Pass@1 on RepoEval. We also build an evaluation dataset, RepoST-Eval, and benchmark 12 code generation models.
摘要：我们提出了repost，这是一种可扩展的方法，用于构建环境，为培训和评估提供存储级代码生成的执行反馈。与旨在构建整个存储库的现有作品不同，这对人类和LLM都充满挑战，我们通过沙盒测试提供执行反馈，该反馈将给定的目标功能及其对单独的测试脚本的依赖性隔离。沙盒测试降低了外部依赖性的复杂性，并启用了大规模构建环境的复杂性。我们使用我们的方法来构建Repost-Train，这是一种大型火车，设置了来自832个存储库的7,415个功能。通过Repost-Train提供的执行反馈的培训可导致Humaneval的5.5％通过@1，而RepoEval的1次通过@1。我们还构建了评估数据集，repost-eval和Benchmark 12代码生成模型。

Title: Is My Text in Your AI Model? Gradient-based Membership Inference Test applied to LLMs

Authors: Gonzalo Mancera, Daniel de Alcala, Julian Fierrez, Ruben Tolosana, Aythami Morales
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07384
Pdf URL: https://arxiv.org/pdf/2503.07384
Copy Paste: [[2503.07384]] Is My Text in Your AI Model? Gradient-based Membership Inference Test applied to LLMs(https://arxiv.org/abs/2503.07384)
Keywords: language model, llm
Abstract: This work adapts and studies the gradient-based Membership Inference Test (gMINT) to the classification of text based on LLMs. MINT is a general approach intended to determine if given data was used for training machine learning models, and this work focuses on its application to the domain of Natural Language Processing. Using gradient-based analysis, the MINT model identifies whether particular data samples were included during the language model training phase, addressing growing concerns about data privacy in machine learning. The method was evaluated in seven Transformer-based models and six datasets comprising over 2.5 million sentences, focusing on text classification tasks. Experimental results demonstrate MINTs robustness, achieving AUC scores between 85% and 99%, depending on data size and model architecture. These findings highlight MINTs potential as a scalable and reliable tool for auditing machine learning models, ensuring transparency, safeguarding sensitive data, and fostering ethical compliance in the deployment of AI/NLP technologies.
摘要：这项工作调整并研究了基于梯度的会员推理测试（GMINT），以基于LLMS的文本分类。 MINT是一种通用方法，旨在确定给定的数据是否用于培训机器学习模型，这项工作重点是将其应用于自然语言处理领域。使用基于梯度的分析，MINT模型确定了在语言模型培训阶段是否包括特定的数据样本，从而解决了对机器学习中数据隐私的日益关注。该方法在七个基于变压器的模型和六个数据集中进行了评估，其中包括超过250万句话，重点是文本分类任务。实验结果证明了薄荷的鲁棒性，取决于数据大小和模型结构，其AUC得分在85％至99％之间。这些发现突出显示了薄荷糖作为审核机器学习模型的可扩展可靠工具的潜力，可确保透明度，保护敏感数据并促进AI/NLP技术部署的道德合规性。

Title: Revisiting Noise in Natural Language Processing for Computational Social Science

Authors: Nadav Borenstein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07395
Pdf URL: https://arxiv.org/pdf/2503.07395
Copy Paste: [[2503.07395]] Revisiting Noise in Natural Language Processing for Computational Social Science(https://arxiv.org/abs/2503.07395)
Keywords: language model
Abstract: Computational Social Science (CSS) is an emerging field driven by the unprecedented availability of human-generated content for researchers. This field, however, presents a unique set of challenges due to the nature of the theories and datasets it explores, including highly subjective tasks and complex, unstructured textual corpora. Among these challenges, one of the less well-studied topics is the pervasive presence of noise. This thesis aims to address this gap in the literature by presenting a series of interconnected case studies that examine different manifestations of noise in CSS. These include character-level errors following the OCR processing of historical records, archaic language, inconsistencies in annotations for subjective and ambiguous tasks, and even noise and biases introduced by large language models during content generation. This thesis challenges the conventional notion that noise in CSS is inherently harmful or useless. Rather, it argues that certain forms of noise can encode meaningful information that is invaluable for advancing CSS research, such as the unique communication styles of individuals or the culture-dependent nature of datasets and tasks. Further, this thesis highlights the importance of nuance in dealing with noise and the considerations CSS researchers must address when encountering it, demonstrating that different types of noise require distinct strategies.
摘要：计算社会科学（CSS）是一个新兴领域，由研究人员为人类生成的内容的前所未有的可用性驱动。但是，由于其探索的理论和数据集的性质，包括高度主观的任务以及复杂的，非结构化的文本语料库，这一领域提出了一系列独特的挑战。在这些挑战中，噪音普遍存在的挑战之一是噪音的普遍存在。该论文旨在通过提出一系列相互联系的案例研究来解决文献中的这一差距，这些案例研究检查CSS中噪声的不同表现。其中包括在OCR处理历史记录，古老语言，主观和模棱两可任务的注释中的不一致之后的角色级别错误，甚至在内容生成过程中大型语言模型引入的噪音和偏见。该论文挑战了传统观念，即CSS中的噪声本质上是有害的或没有用的。相反，它认为某些形式的噪声可以编码有意义的信息，这些信息对于推进CSS研究是无价的，例如个人的独特交流方式或数据集和任务的文化依赖性性质。此外，该论文强调了细微差别在处理噪声方面的重要性，而CSS研究人员在遇到噪声方面必须解决，这表明不同类型的噪声需要不同的策略。

Title: LLMs syntactically adapt their language use to their conversational partner

Authors: Florian Kandra, Vera Demberg, Alexander Koller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07457
Pdf URL: https://arxiv.org/pdf/2503.07457
Copy Paste: [[2503.07457]] LLMs syntactically adapt their language use to their conversational partner(https://arxiv.org/abs/2503.07457)
Keywords: language model, llm, agent
Abstract: It has been frequently observed that human speakers align their language use with each other during conversations. In this paper, we study empirically whether large language models (LLMs) exhibit the same behavior of conversational adaptation. We construct a corpus of conversations between LLMs and find that two LLM agents end up making more similar syntactic choices as conversations go on, confirming that modern LLMs adapt their language use to their conversational partners in at least a rudimentary way.
摘要：人们经常观察到，人说话的人在对话期间将其语言使用保持一致。在本文中，我们从经验上研究了大语言模型（LLM）是否表现出相同的对话适应行为。我们在LLMS之间构建了一系列对话，发现随着对话的进行，两个LLM代理最终会做出更相似的句法选择，证实现代LLM至少以基本的方式使他们的语言使用对他们的对话伙伴。

Title: MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

Authors: Xiangru Tang, Daniel Shao, Jiwoong Sohn, Jiapeng Chen, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, Arman Cohan, Mark Gerstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07459
Pdf URL: https://arxiv.org/pdf/2503.07459
Copy Paste: [[2503.07459]] MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning(https://arxiv.org/abs/2503.07459)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks. This high performance makes it increasingly difficult to meaningfully evaluate and differentiate advanced methods. We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios where current models still struggle despite their strong performance on standard tests. Drawing from seven established medical datasets, our benchmark addresses three key limitations in existing evaluations: (1) the prevalence of straightforward questions where even base models achieve high performance, (2) inconsistent sampling and evaluation protocols across studies, and (3) lack of systematic analysis of the interplay between performance, cost, and inference time. Through experiments with various base models and reasoning methods, we demonstrate that the latest thinking models, DeepSeek R1 and OpenAI o3, exhibit exceptional performance in complex medical reasoning tasks. Additionally, advanced search-based agent methods offer promising performance-to-cost ratios compared to traditional approaches. Our analysis reveals substantial performance gaps between model families on complex questions and identifies optimal model selections for different computational constraints. Our benchmark and evaluation framework are publicly available at this https URL.
摘要：大型语言模型（LLMS）在现有的医学提问基准测试中表现出令人印象深刻的表现。这种高性能使评估和区分先进方法变得越来越困难。我们提出了MedagentsBench，这是一个基准，该基准侧重于挑战医学问题，需要多步临床推理，诊断制定和治疗计划 - 赛季里奥斯，尽管它们在标准测试方面表现出色，但目前的模型仍在挣扎。我们的基准从七个已建立的医疗数据集中借鉴了现有评估中的三个关键局限性：（1）直接问题的普遍性，即使基本模型甚至可以达到高性能，（2）跨研究的采样和评估协议不一致，以及（3）对性能，成本，成本和跨越时间之间的相互作用的系统分析。通过使用各种基本模型和推理方法的实验，我们证明了最新的思维模型R1和OpenAI O3在复杂的医学推理任务中表现出非凡的表现。此外，与传统方法相比，基于先进的基于搜索的代理方法提供了有希望的性能与成本比率。我们的分析揭示了模型家族在复杂问题上的大量性能差距，并确定了不同计算约束的最佳模型选择。我们的基准和评估框架在此HTTPS URL上公开可用。

Title: Language Models Fail to Introspect About Their Knowledge of Language

Authors: Siyuan Song, Jennifer Hu, Kyle Mahowald
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07513
Pdf URL: https://arxiv.org/pdf/2503.07513
Copy Paste: [[2503.07513]] Language Models Fail to Introspect About Their Knowledge of Language(https://arxiv.org/abs/2503.07513)
Keywords: language model, llm, prompt
Abstract: There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of standard introspective methods in linguistics to evaluate grammatical knowledge in models (e.g., asking "Is this sentence grammatical?"). We systematically investigate emergent introspection across 21 open-source LLMs, in two domains where introspection is of theoretical interest: grammatical knowledge and word prediction. Crucially, in both domains, a model's internal linguistic knowledge can be theoretically grounded in direct measurements of string probability. We then evaluate whether models' responses to metalinguistic prompts faithfully reflect their internal knowledge. We propose a new measure of introspection: the degree to which a model's prompted responses predict its own string probabilities, beyond what would be predicted by another model with nearly identical internal knowledge. While both metalinguistic prompting and probability comparisons lead to high task accuracy, we do not find evidence that LLMs have privileged "self-access". Our findings complicate recent results suggesting that models can introspect, and add new evidence to the argument that prompted responses should not be conflated with models' linguistic generalizations.
摘要：最近，人们对大型语言模型（LLM）是否可以对自己的内部状态进行内省。这样的能力将使LLMS更容易解释，并验证语言学中标准内省方法的使用来评估模型中的语法知识（例如，询问“这句话语法是语法吗？”）。我们系统地研究了21个开源LLM的新兴内省，这是内省具有理论意义的两个领域：语法知识和单词预测。至关重要的是，在两个领域中，理论上可以将模型的内部语言知识以直接测量字符串概率为基础。然后，我们评估模型对金属语言的反应是否忠实地反映了他们的内部知识。我们提出了一种内省的新量度：模型的促使响应预测其自己的字符串概率的程度，超出了另一个具有几乎相同内部知识的模型所预测的。尽管金属语言提示和概率比较都导致了很高的任务准确性，但我们没有发现LLM具有特权“自我访问”的证据。我们的发现使最近的结果复杂化，表明模型可以内省，并在以下论点中添加了新的证据，即促使响应不应与模型的语言概括相结合。

Title: TokenButler: Token Importance is Predictable

Authors: Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Nilesh Jain, Mohamed S. Abdelfattah
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07518
Pdf URL: https://arxiv.org/pdf/2503.07518
Copy Paste: [[2503.07518]] TokenButler: Token Importance is Predictable(https://arxiv.org/abs/2503.07518)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck, however, there is an opportunity to alleviate this bottleneck, especially because prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks (pages) of tokens at generation, failing at dense, context-rich tasks. Additionally, many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. By training a light-weight predictor with less than 1.2% parameter overhead, TokenButler prioritizes tokens based on their contextual, predicted importance. This improves perplexity & downstream accuracy by over 8% relative to SoTA methods for estimating token importance. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy. Code, models and benchmarks: this https URL
摘要：大型语言模型（LLMS）依靠键值（KV）缓存来存储令牌历史记录，从而有效地解码令牌。随着KV-CACHE的增长，它成为主要的内存和计算瓶颈，但是，有机会减轻这种瓶颈，尤其是因为先前的研究表明，只有一小部分代币对每个解码步骤都有有意义的贡献。找到这些关键令牌的一个关键挑战是它们是动态的，并且非常依赖于输入查询。现有的方法要么通过永久驱逐令牌，要么保留完整的KV-CACHE，但要依赖于代企业的代币（页面），但在密集的，上下文填充的任务上失败。此外，许多现有的KV-CACHE稀疏方法依赖于代币重要性的不准确代理。为了解决这些局限性，我们介绍了Tokenbutler，这是一种高粒度，查询意识到的预测指标，学会识别这些关键令牌。通过训练少于1.2％的参数开销的轻质预测变量，Tokenbutler基于其上下文，预测的重要性来优先考虑令牌。相对于估计令牌重要性的SOTA方法相对于SOTA方法而言，这将困惑性和下游精度提高了8％以上。我们根据新型的合成小膜片共检索任务评估Tokenbutler，证明了近乎轨道的准确性。代码，型号和基准：此HTTPS URL

Title: XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

Authors: Zhenyu Li, Kehai Chen, Yunfei Long, Xuefeng Bai, Yaoyin Zhang, Xuchen Wei, Juntao Li, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07539
Pdf URL: https://arxiv.org/pdf/2503.07539
Copy Paste: [[2503.07539]] XIFBench: Evaluating Large Language Models on Multilingual Instruction Following(https://arxiv.org/abs/2503.07539)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings remains poorly understood, as existing evaluations lack fine-grained constraint analysis. We introduce XIFBench, a comprehensive constraint-based benchmark for assessing multilingual instruction-following abilities of LLMs, featuring a novel taxonomy of five constraint categories and 465 parallel instructions across six languages spanning different resource levels. To ensure consistent cross-lingual evaluation, we develop a requirement-based protocol that leverages English requirements as semantic anchors. These requirements are then used to validate the translations across languages. Extensive experiments with various LLMs reveal notable variations in instruction-following performance across resource levels, identifying key influencing factors such as constraint categories, instruction complexity, and cultural specificity.
摘要：大型语言模型（LLMS）在各种应用程序中都表现出了出色的指导跟踪功能。但是，由于现有评估缺乏细粒度的约束分析，因此它们在多语言环境中的性能仍然很少。我们介绍了Xifbench，这是一种基于综合约束的基准测试，用于评估LLMS的多语言指令跟随能力，具有五个约束类别的新型分类法和跨越不同资源水平的六种语言的465个平行说明。为了确保一致的跨语言评估，我们开发了一种基于要求的协议，该协议利用英语要求作为语义锚。然后使用这些要求来验证跨语言的翻译。具有各种LLM的广泛实验揭示了跨资源层面的指导跟踪性能的显着差异，从而确定了关键影响因素，例如约束类别，教学复杂性和文化特异性。

Title: KSOD: Knowledge Supplement for LLMs On Demand

Authors: Haoran Li, Junfeng Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07550
Pdf URL: https://arxiv.org/pdf/2503.07550
Copy Paste: [[2503.07550]] KSOD: Knowledge Supplement for LLMs On Demand(https://arxiv.org/abs/2503.07550)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet still produce errors in domain-specific tasks. To further improve their performance, we propose KSOD (Knowledge Supplement for LLMs On Demand), a novel framework that empowers LLMs to improve their capabilities with knowledge-based supervised fine-tuning (SFT). KSOD analyzes the causes of errors from the perspective of knowledge deficiency by identifying potential missing knowledge in LLM that may lead to the errors. Subsequently, KSOD tunes a knowledge module on knowledge dataset and verifies whether the LLM lacks the identified knowledge based on it. If the knowledge is verified, KSOD supplements the LLM with the identified knowledge using the knowledge module. Tuning LLMs on specific knowledge instead of specific task decouples task and knowledge and our experiments on two domain-specific benchmarks and four general benchmarks empirically demonstrate that KSOD enhances the performance of LLMs on tasks requiring the supplemented knowledge while preserving their performance on other tasks. Our findings shed light on the potential of improving the capabilities of LLMs with knowledge-based SFT.
摘要：大型语言模型（LLMS）在各种任务中都表现出了显着的功能，但仍会在特定于领域的任务中产生错误。为了进一步提高其性能，我们提出了KSOD（根据需要使用LLMS的知识补充），这是一个新颖的框架，可以通过基于知识的监督微调（SFT）来提高LLMS的能力。 KSOD通过确定LLM中潜在的知识可能导致错误的潜在知识，从知识缺乏的角度分析了错误的原因。随后，KSOD对知识数据集进行了一个知识模块，并验证LLM是否基于IT缺乏已确定的知识。如果验证了知识，则KSOD使用知识模块为LLM补充了LLM。在特定知识上调整LLM，而不是特定的任务删除任务和知识，以及我们对两个特定领域的基准和四个一般基准进行实验，从经验上表明，KSOD可以增强LLM在需要补充知识的任务上的绩效，同时保留其在其他任务上的绩效。我们的发现阐明了通过基于知识的SFT提高LLM的功能的潜力。

Title: Detection Avoidance Techniques for Large Language Models

Authors: Sinclair Schneider, Florian Steuber, Joao A. G. Schneider, Gabi Dreo Rodosek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07595
Pdf URL: https://arxiv.org/pdf/2503.07595
Copy Paste: [[2503.07595]] Detection Avoidance Techniques for Large Language Models(https://arxiv.org/abs/2503.07595)
Keywords: language model, gpt
Abstract: The increasing popularity of large language models has not only led to widespread use but has also brought various risks, including the potential for systematically spreading fake news. Consequently, the development of classification systems such as DetectGPT has become vital. These detectors are vulnerable to evasion techniques, as demonstrated in an experimental series: Systematic changes of the generative models' temperature proofed shallow learning-detectors to be the least reliable. Fine-tuning the generative model via reinforcement learning circumvented BERT-based-detectors. Finally, rephrasing led to a >90\% evasion of zero-shot-detectors like DetectGPT, although texts stayed highly similar to the original. A comparison with existing work highlights the better performance of the presented methods. Possible implications for society and further research are discussed.
摘要：大型语言模型的日益普及不仅导致广泛使用，而且带来了各种风险，包括系统地传播假新闻的潜力。因此，分类系统（例如检测）的开发变得至关重要。这些探测器很容易受到逃避技术的影响，如实验系列中所示：生成模型证明温度的浅学习检测器的系统变化是最不可靠的。通过增强学习绕过基于BERT的检测器来微调生成模型。最后，重新绘制导致> 90 \％逃避零射击检测器（如检测），尽管文本与原始文本高度相似。与现有工作的比较突出了提出方法的更好性能。讨论了对社会和进一步研究的可能影响。

Title: Implicit Reasoning in Transformers is Reasoning through Shortcuts

Authors: Tianhe Lin, Jian Xie, Siyu Yuan, Deqing Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07604
Pdf URL: https://arxiv.org/pdf/2503.07604
Copy Paste: [[2503.07604]] Implicit Reasoning in Transformers is Reasoning through Shortcuts(https://arxiv.org/abs/2503.07604)
Keywords: language model, gpt
Abstract: Test-time compute is emerging as a new paradigm for enhancing language models' complex multi-step reasoning capabilities, as demonstrated by the success of OpenAI's o1 and o3, as well as DeepSeek's R1. Compared to explicit reasoning in test-time compute, implicit reasoning is more inference-efficient, requiring fewer generated tokens. However, why does the advanced reasoning capability fail to emerge in the implicit reasoning style? In this work, we train GPT-2 from scratch on a curated multi-step mathematical reasoning dataset and conduct analytical experiments to investigate how language models perform implicit reasoning in multi-step tasks. Our findings reveal: 1) Language models can perform step-by-step reasoning and achieve high accuracy in both in-domain and out-of-domain tests via implicit reasoning. However, this capability only emerges when trained on fixed-pattern data. 2) Conversely, implicit reasoning abilities emerging from training on unfixed-pattern data tend to overfit a specific pattern and fail to generalize further. Notably, this limitation is also observed in state-of-the-art large language models. These findings suggest that language models acquire implicit reasoning through shortcut learning, enabling strong performance on tasks with similar patterns while lacking generalization.
摘要：测试时间计算正在作为增强语言模型复杂的多步推理功能的新范式，如Openai的O1和O3以及DeepSeek的R1所证明的那样。与测试时间计算中的显式推理相比，隐式推理的推理效率更高，需要更少的产生令牌。但是，为什么先进的推理能力无法以隐式推理方式出现？在这项工作中，我们在策划的多步数学推理数据集上从头开始训练GPT-2，并进行分析实验，以研究语言模型如何在多步任务中执行隐式推理。我们的发现揭示了：1）语言模型可以通过隐式推理进行逐步推理，并在内域和室外测试中获得高精度。但是，仅在对固定模式数据进行培训时才会出现此功能。 2）相反，从未固定模式数据的培训中出现的隐性推理能力倾向于过度拟合特定的模式，并且无法进一步推广。值得注意的是，在最先进的大语言模型中也观察到了这种限制。这些发现表明，语言模型通过快捷方式学习获得了隐性的推理，在缺乏概括的同时，在具有相似模式的任务上实现了强劲的绩效。

Title: SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models

Authors: Xun Liang, Hanyu Wang, Huayi Lai, Simin Niu, Shichao Song, Jiawei Yang, Jihao Zhao, Feiyu Xiong, Bo Tang, Zhiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07605
Pdf URL: https://arxiv.org/pdf/2503.07605
Copy Paste: [[2503.07605]] SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models(https://arxiv.org/abs/2503.07605)
Keywords: language model, llm
Abstract: Large Language Models have achieved remarkable success across various natural language processing tasks, yet their high computational cost during inference remains a major bottleneck. This paper introduces Sparse Expert Activation Pruning (SEAP), a training-free pruning method that selectively retains task-relevant parameters to reduce inference overhead. Inspired by the clustering patterns of hidden states and activations in LLMs, SEAP identifies task-specific expert activation patterns and prunes the model while preserving task performance and enhancing computational efficiency. Experimental results demonstrate that SEAP significantly reduces computational overhead while maintaining competitive accuracy. Notably, at 50% pruning, SEAP surpasses both WandA and FLAP by over 20%, and at 20% pruning, it incurs only a 2.2% performance drop compared to the dense model. These findings highlight SEAP's scalability and effectiveness, making it a promising approach for optimizing large-scale LLMs.
摘要：大型语言模型在各种自然语言处理任务中取得了巨大的成功，但推断期间的高计算成本仍然是主要的瓶颈。本文介绍了稀疏的专家激活修剪（SEAP），这是一种无训练的修剪方法，有选择地保留与任务相关的参数以减少推理开销。 SEAP受到隐藏状态和激活的聚类模式的启发，SEAP确定了特定于任务的专家激活模式并修剪模型，同时保留任务性能并提高计算效率。实验结果表明，SEAP显着降低了计算开销，同时保持竞争精度。值得注意的是，在修剪50％的情况下，SEAP超过20％以上的Wanda和plap，与密集模型相比，它的性能下降仅为2.2％。这些发现突出了SEAP的可伸缩性和有效性，使其成为优化大型LLM的有前途的方法。