2025-03-21

Title: Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings

Authors: Austin Xu, Srijan Bansal, Yifei Ming, Semih Yavuz, Shafiq Joty
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.15620
Pdf URL: https://arxiv.org/pdf/2503.15620
Copy Paste: [[2503.15620]] Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings(https://arxiv.org/abs/2503.15620)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs during AI system development and post-deployment monitoring. While judge models -- LLMs finetuned to specialize in assessing and critiquing model outputs -- have been touted as general purpose evaluators, they are typically evaluated only on non-contextual scenarios, such as instruction following. The omission of contextual settings -- those where external information is used as context to generate an output -- is surprising given the increasing prevalence of retrieval-augmented generation (RAG) and summarization use cases. Contextual assessment is uniquely challenging, as evaluation often depends on practitioner priorities, leading to conditional evaluation criteria (e.g., comparing responses based on factuality and then considering completeness if they are equally factual). To address the gap, we propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios. We build our benchmark with a multi-pronged data construction pipeline that leverages both existing human annotations and model-based perturbations. Our comprehensive study across 11 judge models and 9 general purpose models, reveals that the contextual information and its assessment criteria present a significant challenge to even state-of-the-art models. For example, OpenAI's o1, the best-performing model, barely reaches 55% consistent accuracy.
摘要：大型语言模型（LLM）-AS-Gudge范式已被用来满足对AI系统开发和部署后监控过程中对模型输出的廉价，可靠和快速评估的需求。虽然法官模型（LLMS FINETUNED专门用于评估和批评模型输出）已被吹捧为通用评估者，但通常仅在非上下文场景（例如以下说明）上进行评估。鉴于检索效果生成（RAG）和摘要用例的越来越多的流行率越来越多，因此省略了上下文设置（将外部信息用作生成输出的上下文的情况）令人惊讶。上下文评估具有独特的挑战，因为评估通常取决于从业人员的优先级，导致有条件的评估标准（例如，比较基于事实的响应，然后考虑完整性，如果它们同样是事实）。为了解决差距，我们提出了ContextualJudgeBench，这是法官的基准，在八个拆分中具有2,000个具有挑战性的响应对，受到现实世界情境评估方案的启发。我们通过多管齐下的数据构建管道来构建基准，该数据构建管道利用现有的人类注释和基于模型的扰动。我们对11个法官模型和9个通用模型进行的全面研究表明，上下文信息及其评估标准对甚至最先进的模型都带来了重大挑战。例如，Openai的O1是表现最佳的模型，几乎无法达到55％的一致精度。

Title: Enhancing Pancreatic Cancer Staging with Large Language Models: The Role of Retrieval-Augmented Generation

Authors: Hisashi Johno, Yuki Johno, Akitomo Amakawa, Junichi Sato, Ryota Tozuka, Atsushi Komaba, Hiroaki Watanabe, Hiroki Watanabe, Chihiro Goto, Hiroyuki Morisaka, Hiroshi Onishi, Kazunori Nakamoto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15664
Pdf URL: https://arxiv.org/pdf/2503.15664
Copy Paste: [[2503.15664]] Enhancing Pancreatic Cancer Staging with Large Language Models: The Role of Retrieval-Augmented Generation(https://arxiv.org/abs/2503.15664)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Purpose: Retrieval-augmented generation (RAG) is a technology to enhance the functionality and reliability of large language models (LLMs) by retrieving relevant information from reliable external knowledge (REK). RAG has gained interest in radiology, and we previously reported the utility of NotebookLM, an LLM with RAG (RAG-LLM), for lung cancer staging. However, since the comparator LLM differed from NotebookLM's internal model, it remained unclear whether its advantage stemmed from RAG or inherent model differences. To better isolate RAG's impact and assess its utility across different cancers, we compared NotebookLM with its internal LLM, Gemini 2.0 Flash, in a pancreatic cancer staging experiment. Materials and Methods: A summary of Japan's pancreatic cancer staging guidelines was used as REK. We compared three groups - REK+/RAG+ (NotebookLM with REK), REK+/RAG- (Gemini 2.0 Flash with REK), and REK-/RAG- (Gemini 2.0 Flash without REK) - in staging 100 fictional pancreatic cancer cases based on CT findings. Staging criteria included TNM classification, local invasion factors, and resectability classification. In REK+/RAG+, retrieval accuracy was quantified based on the sufficiency of retrieved REK excerpts. Results: REK+/RAG+ achieved a staging accuracy of 70%, outperforming REK+/RAG- (38%) and REK-/RAG- (35%). For TNM classification, REK+/RAG+ attained 80% accuracy, exceeding REK+/RAG- (55%) and REK-/RAG- (50%). Additionally, REK+/RAG+ explicitly presented retrieved REK excerpts, achieving a retrieval accuracy of 92%. Conclusion: NotebookLM, a RAG-LLM, outperformed its internal LLM, Gemini 2.0 Flash, in a pancreatic cancer staging experiment, suggesting that RAG may improve LLM's staging accuracy. Furthermore, its ability to retrieve and present REK excerpts provides transparency for physicians, highlighting its applicability for clinical diagnosis and classification.
摘要：目的：检索型发电（RAG）是一项技术，可以通过从可靠的外部知识（REK）中检索相关信息来增强大语言模型（LLMS）的功能和可靠性。 RAG对放射学引起了兴趣，我们以前报道了Notebooklm（带有RAG（RAG-LLM））的Notebooklm的实用性，用于肺癌分期。但是，由于比较器llm与笔记本电脑的内部模型有所不同，因此尚不清楚其优势是源于抹布还是固有的模型差异。为了更好地隔离RAG的影响并评估其在不同癌症中的效用，我们在胰腺癌分期实验中比较了笔记本LLM与其内部LLM Gemini 2.0 Flash。材料和方法：日本胰腺癌分期指南的摘要被用作REK。我们将三个组（rek+/rag+（Notebooklm）与REK），REK+/rag-（带REK的Gemini 2.0闪光灯）和Rek-/rag-（无REK的Gemini 2.0闪光灯） - 基于CT发现。分期标准包括TNM分类，局部入侵因子和切除性分类。在REK+/rag+中，根据检索到的REK摘录的充分性来量化检索精度。结果：REK+/RAG+达到了70％的分期精度，表现优于REK+/RAG-（38％）和REK-/RAG-（35％）。对于TNM分类，REK+/RAG+达到80％的精度，超过REK+/rag-（55％）和Rek-/rag-（50％）。此外，REK+/rag+明确提出的REK摘录，获得了92％的检索精度。结论：在胰腺癌分期实验中，Notebooklm（rag-llm）优于其内部LLM Gemini 2.0 Flash，这表明RAG可以提高LLM的分期精度。此外，其检索和介绍REK摘录的能力为医师提供了透明度，强调了其适用于临床诊断和分类。

Title: Am I eligible? Natural Language Inference for Clinical Trial Patient Recruitment: the Patient's Point of View

Authors: Mathilde Aguiar, Pierre Zweigenbaum, Nona Naderi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15718
Pdf URL: https://arxiv.org/pdf/2503.15718
Copy Paste: [[2503.15718]] Am I eligible? Natural Language Inference for Clinical Trial Patient Recruitment: the Patient's Point of View(https://arxiv.org/abs/2503.15718)
Keywords: language model, prompt
Abstract: Recruiting patients to participate in clinical trials can be challenging and time-consuming. Usually, participation in a clinical trial is initiated by a healthcare professional and proposed to the patient. Promoting clinical trials directly to patients via online recruitment might help to reach them more efficiently. In this study, we address the case where a patient is initiating their own recruitment process and wants to determine whether they are eligible for a given clinical trial, using their own language to describe their medical profile. To study whether this creates difficulties in the patient trial matching process, we design a new dataset and task, Natural Language Inference for Patient Recruitment (NLI4PR), in which patient language profiles must be matched to clinical trials. We create it by adapting the TREC 2022 Clinical Trial Track dataset, which provides patients' medical profiles, and rephrasing them manually using patient language. We also use the associated clinical trial reports where the patients are either eligible or excluded. We prompt several open-source Large Language Models on our task and achieve from 56.5 to 71.8 of F1 score using patient language, against 64.7 to 73.1 for the same task using medical language. When using patient language, we observe only a small loss in performance for the best model, suggesting that having the patient as a starting point could be adopted to help recruit patients for clinical trials. The corpus and code bases are all freely available on our Github and HuggingFace repositories.
摘要：招募患者参加临床试验可能具有挑战性且耗时。通常，参与临床试验是由医疗保健专业人员发起的，并向患者提出了建议。通过在线招聘直接向患者促进临床试验可能有助于更有效地与他们联系。在这项研究中，我们解决了患者正在启动自己的招聘过程并希望确定他们是否有资格参加给定临床试验的情况，并使用自己的语言来描述其医学概况。为了研究这是否在患者试验匹配过程中造成困难，我们设计了一个新的数据集和任务，自然语言招募（NLI4PR），其中必须将患者语言概况与临床试验匹配。我们通过调整TREC 2022临床试验轨道数据集来创建它，该临床试验数据集可提供患者的医疗概况，并使用患者语言手动重新塑造它们。我们还使用相关的临床试验报告，其中患者符合条件或排除在内。我们在我们的任务上提示了几种开源大语模型，并使用患者语言实现了F1分数的56.5至71.8，而使用医学语言的同一任务为64.7至73.1。当使用患者语言时，我们只观察到最佳模型的性能损失很小，这表明可以将患者作为起点以帮助招募患者进行临床试验。语料库和代码库都可以在我们的GitHub和HuggingFace存储库中免费获得。

Title: KoGNER: A Novel Framework for Knowledge Graph Distillation on Biomedical Named Entity Recognition

Authors: Heming Zhang, Wenyu Li, Di Huang, Yinjie Tang, Yixin Chen, Philip Payne, Fuhai Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15737
Pdf URL: https://arxiv.org/pdf/2503.15737
Copy Paste: [[2503.15737]] KoGNER: A Novel Framework for Knowledge Graph Distillation on Biomedical Named Entity Recognition(https://arxiv.org/abs/2503.15737)
Keywords: llm
Abstract: Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that plays a crucial role in information extraction, question answering, and knowledge-based systems. Traditional deep learning-based NER models often struggle with domain-specific generalization and suffer from data sparsity issues. In this work, we introduce Knowledge Graph distilled for Named Entity Recognition (KoGNER), a novel approach that integrates Knowledge Graph (KG) distillation into NER models to enhance entity recognition performance. Our framework leverages structured knowledge representations from KGs to enrich contextual embeddings, thereby improving entity classification and reducing ambiguity in entity detection. KoGNER employs a two-step process: (1) Knowledge Distillation, where external knowledge sources are distilled into a lightweight representation for seamless integration with NER models, and (2) Entity-Aware Augmentation, which integrates contextual embeddings that have been enriched with knowledge graph information directly into GNN, thereby improving the model's ability to understand and represent entity relationships. Experimental results on benchmark datasets demonstrate that KoGNER achieves state-of-the-art performance, outperforming finetuned NER models and LLMs by a significant margin. These findings suggest that leveraging knowledge graphs as auxiliary information can significantly improve NER accuracy, making KoGNER a promising direction for future research in knowledge-aware NLP.
摘要：命名实体识别（NER）是自然语言处理（NLP）的基本任务，在信息提取，问答和基于知识的系统中起着至关重要的作用。传统的基于深度学习的NER模型通常会在特定领域的概括中遇到困难，并且遇到了数据稀疏问题。在这项工作中，我们引入了针对命名实体识别（KOGNER）提炼的知识图，这是一种新颖的方法，将知识图（kg）蒸馏集成到NER模型中以增强实体识别性能。我们的框架利用了从KG的结构化知识表示形式来丰富上下文嵌入，从而改善了实体分类并减少实体检测的歧义。 Kogner采用了两个步骤的过程：（1）知识蒸馏，将外部知识源蒸馏成与NER模型无缝集成的轻量级表示，（2）实体感知的增强功能，将其集成到与知识图的上下文嵌入的嵌入式相结合，从而将知识图直接与GNN富含，从而将模型的能力直接融合到模型的能力中，从而可以理解和代表实体关系。基准数据集的实验结果表明，Kogner实现了最先进的性能，优于固定的NER模型和LLM的表现可观。这些发现表明，利用知识图作为辅助信息可以显着提高NER准确性，使Kogner成为知识吸引NLP的未来研究的有希望的方向。

Title: Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer

Authors: Alexandra DeLucia, Mark Dredze
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15768
Pdf URL: https://arxiv.org/pdf/2503.15768
Copy Paste: [[2503.15768]] Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer(https://arxiv.org/abs/2503.15768)
Keywords: gpt
Abstract: Abstractive multi-document summarization (MDS) is the task of automatically summarizing information in multiple documents, from news articles to conversations with multiple speakers. The training approaches for current MDS models can be grouped into four approaches: end-to-end with special pre-training ("direct"), chunk-then-summarize, extract-then-summarize, and inference with GPT-style models. In this work, we evaluate MDS models across training approaches, domains, and dimensions (reference similarity, quality, and factuality), to analyze how and why models trained on one domain can fail to summarize documents from another (News, Science, and Conversation) in the zero-shot domain transfer setting. We define domain-transfer "failure" as a decrease in factuality, higher deviation from the target, and a general decrease in summary quality. In addition to exploring domain transfer for MDS models, we examine potential issues with applying popular summarization metrics out-of-the-box.
摘要：抽象性多文件摘要（MDS）是自动汇总多个文档中信息的任务，从新闻文章到与多个扬声器的对话。当前MDS模型的培训方法可以分为四种方法：端到端，具有特殊的预训练（“直接”），块 - 夏令式，提取到夏季化以及使用GPT风格的模型进行推断。在这项工作中，我们评估了跨培训方法，域和维度（参考相似性，质量和事实）的MDS模型，以分析在一个域中训练的模型以及为什么在零射击域转移设置中的另一个（新闻，科学和对话）的文档。我们将域转移“失败”定义为事实的减少，较高的目标偏差以及总结质量的总体下降。除了探索MDS模型的域传输外，我们还研究了使用流行的摘要指标的潜在问题。

Title: Grammar and Gameplay-aligned RL for Game Description Generation with LLMs

Authors: Tsunehiko Tanaka, Edgar Simo-Serra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15783
Pdf URL: https://arxiv.org/pdf/2503.15783
Copy Paste: [[2503.15783]] Grammar and Gameplay-aligned RL for Game Description Generation with LLMs(https://arxiv.org/abs/2503.15783)
Keywords: language model, llm
Abstract: Game Description Generation (GDG) is the task of generating a game description written in a Game Description Language (GDL) from natural language text. Previous studies have explored generation methods leveraging the contextual understanding capabilities of Large Language Models (LLMs); however, accurately reproducing the game features of the game descriptions remains a challenge. In this paper, we propose reinforcement learning-based fine-tuning of LLMs for GDG (RLGDG). Our training method simultaneously improves grammatical correctness and fidelity to game concepts by introducing both grammar rewards and concept rewards. Furthermore, we adopt a two-stage training strategy where Reinforcement Learning (RL) is applied following Supervised Fine-Tuning (SFT). Experimental results demonstrate that our proposed method significantly outperforms baseline methods using SFT alone.
摘要：游戏描述生成（GDG）是生成用自然语言文本编写的游戏描述（GDL）的游戏描述的任务。先前的研究已经探索了利用大语模型（LLMS）的上下文理解能力的生成方法；但是，准确地重现游戏描述的游戏功能仍然是一个挑战。在本文中，我们提出了基于GDG（RLGDG）的LLM的加强学习微调。我们的培训方法同时通过引入语法奖励和概念奖励来提高语法正确性和对游戏概念的忠诚。此外，我们采用了两阶段的培训策略，在监督微调（SFT）之后，应用强化学习（RL）。实验结果表明，我们所提出的方法仅使用SFT显着胜过基线方法。

Title: Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation

Authors: Shangqing Zhao, Yuhao Zhou, Yupei Ren, Zhe Chen, Chenghao Jia, Fang Zhe, Zhaogaung Long, Shu Liu, Man Lan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15837
Pdf URL: https://arxiv.org/pdf/2503.15837
Copy Paste: [[2503.15837]] Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation(https://arxiv.org/abs/2503.15837)
Keywords: language model, llm
Abstract: Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese. We introduce Fùxì, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. Our benchmark distinguishes itself through three key contributions: (1) balanced coverage of both comprehension and generation tasks, including novel tasks like poetry composition and couplet completion, (2) specialized evaluation metrics designed specifically for classical Chinese text generation, combining rule-based verification with fine-tuned LLM evaluators, and (3) a systematic assessment framework that considers both linguistic accuracy and cultural authenticity. Through extensive evaluation of state-of-the-art LLMs, we reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks, particularly those requiring deep cultural knowledge and adherence to classical formats. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development. The benchmark, evaluation toolkit, and baseline results are publicly available to facilitate research in this domain.
摘要：古代中国文本处理给大型语言模型（LLM）带来了独特的挑战，因为其独特的语言特征，复杂的结构限制和丰富的文化背景。尽管现有的基准主要集中于通过多项选择问题评估理解，但在评估模型在古典中文中的生成能力方面仍然存在一个关键的差距。我们介绍了Fùxì，这是一个全面的基准，可评估21个不同任务的理解和发电能力。我们的基准通过三个关键贡献来区分：（1）对理解和发电任务的平衡覆盖范围，包括诗歌组成和对联的完成，（2）专门为中国文本生成设计的专业评估指标，将基于规则的验证结合在一起，将基于规则的验证与精细的LLM评估者以及（3）的系统构成效果相结合。通过对最先进的LLM的广泛评估，我们揭示了理解和发电任务之间的显着绩效差距，模型在理解中取得了令人鼓舞的结果，但在一代任务中却大大挣扎，尤其是那些需要深厚的文化知识和遵守经典格式的任务。我们的发现突出了中国古代文本处理中当前的局限性，并为未来的模型开发提供了见解。基准，评估工具包和基线结果可公开使用，以促进该领域的研究。

Title: Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey

Authors: Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, Hua Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15850
Pdf URL: https://arxiv.org/pdf/2503.15850
Copy Paste: [[2503.15850]] Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey(https://arxiv.org/abs/2503.15850)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel in text generation, reasoning, and decision-making, enabling their adoption in high-stakes domains such as healthcare, law, and transportation. However, their reliability is a major concern, as they often produce plausible but incorrect responses. Uncertainty quantification (UQ) enhances trustworthiness by estimating confidence in outputs, enabling risk mitigation and selective prediction. However, traditional UQ methods struggle with LLMs due to computational constraints and decoding inconsistencies. Moreover, LLMs introduce unique uncertainty sources, such as input ambiguity, reasoning path divergence, and decoding stochasticity, that extend beyond classical aleatoric and epistemic uncertainty. To address this, we introduce a new taxonomy that categorizes UQ methods based on computational efficiency and uncertainty dimensions (input, reasoning, parameter, and prediction uncertainty). We evaluate existing techniques, assess their real-world applicability, and identify open challenges, emphasizing the need for scalable, interpretable, and robust UQ approaches to enhance LLM reliability.
摘要：大型语言模型（LLM）在文本，推理和决策中表现出色，使其能够在医疗保健，法律和运输等高风险领域中采用。但是，它们的可靠性是一个主要问题，因为它们通常会产生合理但不正确的响应。不确定性定量（UQ）通过估计对产出的信心，降低风险和选择性预测来增强可信度。但是，由于计算限制和解码不一致，传统的UQ方法与LLMS难度。此外，LLMS引入了独特的不确定性来源，例如输入歧义，推理路径差异和解码随机性，这些源超出了经典的质地和认知不确定性。为了解决这个问题，我们引入了一种新的分类法，该分类法基于计算效率和不确定性维度（输入，推理，参数和预测不确定性）对UQ方法进行分类。我们评估现有技术，评估其现实世界的适用性并确定开放的挑战，强调对可扩展，可解释和强大的UQ方法提高LLM可靠性的需求。

Title: Typed-RAG: Type-aware Multi-Aspect Decomposition for Non-Factoid Question Answering

Authors: DongGeon Lee, Ahjeong Park, Hyeri Lee, Hyeonseo Nam, Yunho Maeng
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.15879
Pdf URL: https://arxiv.org/pdf/2503.15879
Copy Paste: [[2503.15879]] Typed-RAG: Type-aware Multi-Aspect Decomposition for Non-Factoid Question Answering(https://arxiv.org/abs/2503.15879)
Keywords: retrieval-augmented generation
Abstract: Non-factoid question-answering (NFQA) poses a significant challenge due to its open-ended nature, diverse intents, and the need for multi-aspect reasoning, which renders conventional factoid QA approaches, including retrieval-augmented generation (RAG), inadequate. Unlike factoid questions, non-factoid questions (NFQs) lack definitive answers and require synthesizing information from multiple sources across various reasoning dimensions. To address these limitations, we introduce Typed-RAG, a type-aware multi-aspect decomposition framework within the RAG paradigm for NFQA. Typed-RAG classifies NFQs into distinct types -- such as debate, experience, and comparison -- and applies aspect-based decomposition to refine retrieval and generation strategies. By decomposing multi-aspect NFQs into single-aspect sub-queries and aggregating the results, Typed-RAG generates more informative and contextually relevant responses. To evaluate Typed-RAG, we introduce Wiki-NFQA, a benchmark dataset covering diverse NFQ types. Experimental results demonstrate that Typed-RAG outperforms baselines, thereby highlighting the importance of type-aware decomposition for effective retrieval and generation in NFQA. Our code and dataset are available at \href{this https URL}{this https URL}.
摘要：非事实提问（NFQA）构成了重大挑战，这是由于其开放性的性质，多样化的意图以及对多相关推理的需求，这使传统的Factoid QA方法（包括检索成绩（RAG））不足。与Factoid问题不同，非事实问题（NFQ）缺乏确切的答案，需要从各种推理维度的多个来源的综合信息。为了解决这些局限性，我们引入了键入rag，这是NFQA的RAG范式内的一种类型感知的多相关分解框架。键入rag将NFQ分类为不同类型的类型，例如辩论，经验和比较 - 并将基于方面的分解应用于完善检索和发电策略。通过将多种敏感的NFQ分解为单一的子征值并汇总结果，键入rag会产生更有信息和上下文相关的响应。为了评估键入rag，我们介绍了Wiki-NFQA，这是一种涵盖不同NFQ类型的基准数据集。实验结果表明，键入rag的表现优于基准，从而突出了类型感知分解对于在NFQA中有效检索和产生的重要性。我们的代码和数据集可在\ href {此https url} {this https url}上获得。

Title: Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models

Authors: Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15888
Pdf URL: https://arxiv.org/pdf/2503.15888
Copy Paste: [[2503.15888]] Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models(https://arxiv.org/abs/2503.15888)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, conflicts between parametric knowledge and retrieved context pose challenges, particularly when retrieved information is unreliable or the model's internal knowledge is outdated. In such cases, LLMs struggle to determine whether to rely more on their own parameters or the conflicted context. To address this, we propose **CK-PLUG**, a plug-and-play method for controlling LLMs' reliance on parametric and contextual knowledge. We introduce a novel knowledge consistency metric, Confidence Gain, which detects knowledge conflicts by measuring entropy shifts in token probability distributions after context insertion. CK-PLUG then enables fine-grained control over knowledge preference by adjusting the probability distribution of tokens with negative confidence gain through a single tuning parameter. Experiments demonstrate CK-PLUG's ability to significantly regulate knowledge reliance in counterfactual RAG scenarios while maintaining generation fluency and knowledge accuracy. For instance, on Llama3-8B, memory recall (MR) of RAG response can be adjusted within a broad range (9.9%-71.9%), compared to the baseline of 42.1%. Moreover, CK-PLUG supports adaptive control based on the model's confidence in both internal and external knowledge, achieving consistent performance improvements across various general RAG tasks. Our code is available at: $\href{this https URL}{\text{this https URL}}$.
摘要：检索增强的生成（RAG）通过整合外部知识来减轻大语言模型（LLMS）的幻觉。但是，参数知识和检索到上下文之间的冲突构成了挑战，尤其是当检索到的信息不可靠或模型的内部知识已过时时。在这种情况下，LLM努力确定是否更多地依靠其自己的参数或冲突的背景。为了解决这个问题，我们提出了** ck-plug **，这是一种用于控制LLMS依赖参数和上下文知识的插件方法。我们介绍了一种新颖的知识一致性指标，信心增益，该指标通过测量插入后令牌概率分布的熵变化来检测知识冲突。然后，通过单个调谐参数调整具有负置信度增益的令牌的概率分布，CK插槽可以对知识偏好进行细粒度的控制。实验表明，CK-Plug能够显着调节反事实抹布情景中知识依赖的能力，同时保持发电的流利性和知识准确性。例如，在Llama3-8b上，与基线42.1％相比，可以在广泛范围内调整抹布响应的记忆回忆（MR）。此外，CK-Plug基于模型对内部和外部知识的信心支持自适应控制，从而在各种一般的抹布任务中实现了一致的性能提高。我们的代码可在：$ \ href {此https url} {\ text {this https url}} $中。

Title: From Structured Prompts to Open Narratives: Measuring Gender Bias in LLMs Through Open-Ended Storytelling

Authors: Evan Chen, Run-Jun Zhan, Yan-Bai Lin, Hung-Hsuan Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15904
Pdf URL: https://arxiv.org/pdf/2503.15904
Copy Paste: [[2503.15904]] From Structured Prompts to Open Narratives: Measuring Gender Bias in LLMs Through Open-Ended Storytelling(https://arxiv.org/abs/2503.15904)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have revolutionized natural language processing, yet concerns persist regarding their tendency to reflect or amplify social biases present in their training data. This study introduces a novel evaluation framework to uncover gender biases in LLMs, focusing on their occupational narratives. Unlike previous methods relying on structured scenarios or carefully crafted prompts, our approach leverages free-form storytelling to reveal biases embedded in the models. Systematic analyses show an overrepresentation of female characters across occupations in six widely used LLMs. Additionally, our findings reveal that LLM-generated occupational gender rankings align more closely with human stereotypes than actual labor statistics. These insights underscore the need for balanced mitigation strategies to ensure fairness while avoiding the reinforcement of new stereotypes.
摘要：大型语言模型（LLMS）彻底改变了自然语言处理，但担心他们反映或扩大培训数据中存在的社会偏见的趋势。这项研究介绍了一个新颖的评估框架，以发现LLMS中的性别偏见，重点是他们的职业叙事。与以前依靠结构化场景或精心制作的提示的方法不同，我们的方法利用了自由形式的讲故事来揭示模型中嵌入的偏见。系统的分析表明，在六个广泛使用的LLM中，女性角色对女性角色的代表性过高。此外，我们的发现表明，LLM生成的职业性别排名与人类的刻板印象相比，比实际的劳动统计更加紧密地与人类的刻板印象保持一致。这些见解强调了需要平衡缓解策略以确保公平性的必要性，同时避免了新的刻板印象的增强。

Title: Towards Automatic Continual Learning: A Self-Adaptive Framework for Continual Instruction Tuning

Authors: Peiyi Lin, Fukai Zhang, Kai Niu, Hao Fu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15924
Pdf URL: https://arxiv.org/pdf/2503.15924
Copy Paste: [[2503.15924]] Towards Automatic Continual Learning: A Self-Adaptive Framework for Continual Instruction Tuning(https://arxiv.org/abs/2503.15924)
Keywords: language model, llm
Abstract: Continual instruction tuning enables large language models (LLMs) to learn incrementally while retaining past knowledge, whereas existing methods primarily focus on how to retain old knowledge rather than on selecting which new knowledge to learn. In domain-specific contexts, maintaining data quality and managing system constraints remain key challenges. To address these issues, we propose an automated continual instruction tuning framework that dynamically filters incoming data, which identify and reduce redundant data across successive updates. Our approach utilizes a small proxy model for efficient perplexity-based filtering, and updates the proxy to ensure that the filtering criteria remain aligned with the evolving state of the deployed model. Compared to existing static data selection methods, our framework can effectively handle incrementally acquired data and shifting distributions. Additionally, it addresses practical deployment challenges by enabling seamless model updates, supporting version rollback and incorporating automatic checkpoint evaluation. We evaluated the system in real-world medical scenarios. It reduced computational costs by 66.7% and improved model performance, and achieved autonomous updates, thus demonstrating its effectiveness for automatic continual instruction tuning.
摘要：持续的指导调整使大型语言模型（LLMS）能够逐步学习，同时保留过去的知识，而现有方法主要关注如何保留旧知识，而不是选择要学习的新知识。在特定领域的环境中，维持数据质量和管理系统约束仍然是主要挑战。为了解决这些问题，我们提出了一个自动化的连续指令调谐框架，该框架动态过滤了传入的数据，该数据可以识别和减少连续更新中的冗余数据。我们的方法利用一个小的代理模型来有效地基于困惑，并更新代理以确保过滤标准与已部署模型的不断发展的状态保持一致。与现有的静态数据选择方法相比，我们的框架可以有效地处理逐步获取的数据和变化分布。此外，它通过启用无缝模型更新，支持版本回滚并合并自动检查点评估来解决实际部署挑战。我们在现实的医疗情况下评估了系统。它将计算成本降低了66.7％，并提高了模型性能，并实现了自动更新，从而证明了其自动连续教学调整的有效性。

Title: From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Reasoning in Large Language Models

Authors: Jinyi Liu, Yan Zheng, Rong Cheng, Qiyu Wu, Wei Guo, Fei Ni, Hebin Liang, Yifu Yuan, Hangyu Mao, Fuzheng Zhang, Jianye Hao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15944
Pdf URL: https://arxiv.org/pdf/2503.15944
Copy Paste: [[2503.15944]] From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Reasoning in Large Language Models(https://arxiv.org/abs/2503.15944)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have shown remarkable progress, yet their capacity for logical ``slow-thinking'' reasoning persists as a critical research frontier. Current inference scaling paradigms suffer from two fundamental constraints: fragmented thought flows compromising logical coherence, and intensively computational complexity that escalates with search space dimensions. To overcome these limitations, we present \textbf{Atomic Reasoner} (\textbf{AR}), a cognitive inference strategy that enables fine-grained reasoning through systematic atomic-level operations. AR decomposes the reasoning process into atomic cognitive units, employing a cognitive routing mechanism to dynamically construct reasoning representations and orchestrate inference pathways. This systematic methodology implements stepwise, structured cognition, which ensures logical coherence while significantly reducing cognitive load, effectively simulating the cognitive patterns observed in human deep thinking processes. Extensive experimental results demonstrate AR's superior reasoning capabilities without the computational burden of exhaustive solution searches, particularly excelling in linguistic logic puzzles. These findings substantiate AR's effectiveness in enhancing LLMs' capacity for robust, long-sequence logical reasoning and deliberation.
摘要：大型语言模型（LLM）的最新进展表现出了显着的进步，但它们的逻辑``缓慢思考''推理的能力仍然是一个关键的研究领域。当前的推理缩放范式受到了两个基本约束：零散的思想流动损害了逻辑相干性，以及随着搜索空间维度升级的强烈计算复杂性。为了克服这些局限性，我们提出\ textbf {atomic推理器}（\ textbf {ar}），这是一种认知推理策略，可以通过系统的原子级操作实现细粒度的推理。 AR采用认知路由机制将推理过程分解为原子认知单元，以动态构建推理表示并编排推理途径。这种系统的方法可以逐步实现结构化认知，从而确保了逻辑连贯性，同时显着降低了认知载荷，从而有效地模拟了在人类深思熟虑过程中观察到的认知模式。广泛的实验结果表明，没有详尽解决方案搜索的计算负担，尤其是在语言逻辑难题中脱颖而出，AR的卓越推理能力。这些发现证明了AR在增强LLM的稳健，悠久逻辑推理和审议能力方面的有效性。

Title: Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning

Authors: Chen Li, Nazhou Liu, Kai Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15952
Pdf URL: https://arxiv.org/pdf/2503.15952
Copy Paste: [[2503.15952]] Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning(https://arxiv.org/abs/2503.15952)
Keywords: llm
Abstract: Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of Reasoning LLMs training. However, we find some deficiency that influences RL stability and inference efficiency. Thus, we propose Adaptive Group Policy Optimization (AGPO) which contains two simple but effective modifications: a revised advantage estimation method to mitigate zero-variance situations; a length-based reward, incentivizing the model to avoid overthinking. The experiments demonstrate our methods achieve more stable training and comparable or superior performance with significantly fewer tokens in reasoning steps.
摘要：自从DeepSeek-R1普及以来，小组相对政策优化（GRPO）已成为推理LLMS培训的核心部分。但是，我们发现一些影响RL稳定性和推理效率的缺陷。因此，我们提出了自适应组策略优化（AGPO），其中包含两个简单但有效的修改：一种修订的优势估计方法，以减轻零变量的情况；基于长度的奖励，激励模型以避免过度思考。该实验证明了我们的方法实现了更稳定的训练，并且在推理步骤中具有较少的令牌，可以进行更稳定或卓越的性能。

Title: InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer

Authors: Tony Zhang, Rickard Brännvall
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.15983
Pdf URL: https://arxiv.org/pdf/2503.15983
Copy Paste: [[2503.15983]] InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer(https://arxiv.org/abs/2503.15983)
Keywords: language model
Abstract: This work explores optimizing transformer-based language models by integrating model compression techniques with inhibitor attention, a novel alternative attention mechanism. Inhibitor attention employs Manhattan distances and ReLU activations instead of the matrix multiplications and softmax activation of the conventional scaled dot-product attention. This shift offers potential computational and energy savings while maintaining model effectiveness. We propose further adjustments to improve the inhibitor mechanism's training efficiency and evaluate its performance on the DistilBERT architecture. Our knowledge distillation experiments indicate that the modified inhibitor transformer model can achieve competitive performance on standard NLP benchmarks, including General Language Understanding Evaluation (GLUE) and sentiment analysis tasks.
摘要：这项工作通过将模型压缩技术与抑制剂注意（这是一种新型的替代注意力机制）集成到基于变压器的语言模型。抑制剂的注意利用曼哈顿的距离和恢复激活，而不是矩阵乘法和常规缩放点产生关注的SoftMax激活。这种转变提供了潜在的计算和节能，同时保持模型有效性。我们提出进一步的调整，以提高抑制剂机制的训练效率，并评估其在Distilbert建筑上的性能。我们的知识蒸馏实验表明，经过修改的抑制剂变压器模型可以在标准NLP基准上实现竞争性能，包括一般语言理解评估（GLUE）和情感分析任务。

Title: ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph

Authors: Langming Liu, Haibin Chen, Yuhao Wang, Yujin Yuan, Shilei Liu, Wenbo Su, Xiangyu Zhao, Bo Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15990
Pdf URL: https://arxiv.org/pdf/2503.15990
Copy Paste: [[2503.15990]] ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph(https://arxiv.org/abs/2503.15990)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) have demonstrated their capabilities across various NLP tasks. Their potential in e-commerce is also substantial, evidenced by practical implementations such as platform search, personalized recommendations, and customer service. One primary concern associated with LLMs is their factuality (e.g., hallucination), which is urgent in e-commerce due to its significant impact on user experience and revenue. Despite some methods proposed to evaluate LLMs' factuality, issues such as lack of reliability, high consumption, and lack of domain expertise leave a gap between effective assessment in e-commerce. To bridge the evaluation gap, we propose ECKGBench, a dataset specifically designed to evaluate the capacities of LLMs in e-commerce knowledge. Specifically, we adopt a standardized workflow to automatically generate questions based on a large-scale knowledge graph, guaranteeing sufficient reliability. We employ the simple question-answering paradigm, substantially improving the evaluation efficiency by the least input and output tokens. Furthermore, we inject abundant e-commerce expertise in each evaluation stage, including human annotation, prompt design, negative sampling, and verification. Besides, we explore the LLMs' knowledge boundaries in e-commerce from a novel perspective. Through comprehensive evaluations of several advanced LLMs on ECKGBench, we provide meticulous analysis and insights into leveraging LLMs for e-commerce.
摘要：大型语言模型（LLMS）已在各种NLP任务中证明了它们的功能。它们在电子商务中的潜力也很大，可以通过平台搜索，个性化建议和客户服务等实践实现来证明。与LLMS相关的主要问题是它们的事实（例如幻觉），由于其对用户体验和收入的重大影响，因此在电子商务中很紧迫。尽管提出了一些方法来评估LLMS的事实，但缺乏可靠性，高消耗量和缺乏领域专业知识等问题在电子商务中有效评估之间存在差距。为了弥合评估差距，我们提出了Eckgbench，这是一个专门设计的数据集，旨在评估电子商务知识中LLM的能力。具体来说，我们采用标准化的工作流程来自动根据大规模知识图生成问题，从而确保足够的可靠性。我们采用简单的提问范式，从最少的输入和输出令牌提高了评估效率。此外，我们在每个评估阶段都注入了丰富的电子商务专业知识，包括人类注释，及时设计，负面抽样和验证。此外，我们从新颖的角度探讨了电子商务中LLMS的知识界限。通过对Eckgbench上几个高级LLM的全面评估，我们提供了细致的分析和对利用LLM的电子商务的见解。

Title: Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models

Authors: Mario Sanz-Guerrero, Katharina von der Wense
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16022
Pdf URL: https://arxiv.org/pdf/2503.16022
Copy Paste: [[2503.16022]] Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models(https://arxiv.org/abs/2503.16022)
Keywords: language model, llm, prompt
Abstract: In-context learning (ICL) has transformed the use of large language models (LLMs) for NLP tasks, enabling few-shot learning by conditioning on labeled examples without finetuning. Despite its effectiveness, ICL is prone to errors, especially for challenging examples. With the goal of improving the performance of ICL, we propose corrective in-context learning (CICL), an approach that incorporates a model's incorrect predictions alongside ground truth corrections into the prompt, aiming to enhance classification accuracy through self-correction. However, contrary to our hypothesis, extensive experiments on text classification tasks demonstrate that CICL consistently underperforms standard ICL, with performance degrading as the proportion of corrections in the prompt increases. Our findings indicate that CICL introduces confusion by disrupting the model's task understanding, rather than refining its predictions. Additionally, we observe that presenting harder examples in standard ICL does not improve performance, suggesting that example difficulty alone may not be a reliable criterion for effective selection. By presenting these negative results, we provide important insights into the limitations of self-corrective mechanisms in LLMs and offer directions for future research.
摘要：内部文化学习（ICL）已将大型语言模型（LLMS）用于NLP任务的使用，从而通过在没有填充的标签示例上进行调节，从而实现了很少的学习。尽管有效率，ICL还是很容易出现错误，尤其是对于具有挑战性的例子。为了提高ICL的性能，我们提出了纠正性的内在学习（CICL），这种方法将模型的错误预测与地面真理校正伴随的方法结合到了提示中，旨在通过自我纠正提高分类准确性。但是，与我们的假设相反，对文本分类任务的广泛实验表明，CICL始终表现标准ICL，并且绩效降低，因为及时校正的比例增加。我们的发现表明，CICL通过破坏模型的任务理解而不是完善其预测来引入混乱。此外，我们观察到，在标准ICL中提出更难的示例并不能提高性能，这表明仅示例难度可能不是有效选择的可靠标准。通过提出这些负面结果，我们为LLMS中自校正机制的局限性提供了重要的见解，并为将来的研究提供了方向。

Title: The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement

Authors: Ruihan Yang, Fanghua Ye, Jian Li, Siyu Yuan, Yikai Zhang, Zhaopeng Tu, Xiaolong Li, Deqing Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.16024
Pdf URL: https://arxiv.org/pdf/2503.16024
Copy Paste: [[2503.16024]] The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement(https://arxiv.org/abs/2503.16024)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs) have recently transformed from text-based assistants to autonomous agents capable of planning, reasoning, and iteratively improving their actions. While numerical reward signals and verifiers can effectively rank candidate actions, they often provide limited contextual guidance. In contrast, natural language feedback better aligns with the generative capabilities of LLMs, providing richer and more actionable suggestions. However, parsing and implementing this feedback effectively can be challenging for LLM-based agents. In this work, we introduce Critique-Guided Improvement (CGI), a novel two-player framework, comprising an actor model that explores an environment and a critic model that generates detailed nature language feedback. By training the critic to produce fine-grained assessments and actionable revisions, and the actor to utilize these critiques, our approach promotes more robust exploration of alternative strategies while avoiding local optima. Experiments in three interactive environments show that CGI outperforms existing baselines by a substantial margin. Notably, even a small critic model surpasses GPT-4 in feedback quality. The resulting actor achieves state-of-the-art performance, demonstrating the power of explicit iterative guidance to enhance decision-making in LLM-based agents.
摘要：大型语言模型（LLMS）最近从基于文本的助手转变为能够计划，推理和迭代行动的自主代理。尽管数值奖励信号和验证者可以有效地对候选行动进行排名，但它们通常提供有限的上下文指导。相比之下，自然语言反馈可以更好地与LLM的生成能力保持一致，从而提供更丰富，更可行的建议。但是，对基于LLM的代理商有效地解析和实施此反馈可能是具有挑战性的。在这项工作中，我们引入了批评指导的改进（CGI），这是一个新颖的两人框架，包括一个演员模型，该模型探讨了环境和批评模型，该模型生成了详细的自然语言反馈。通过训练评论家进行精细的评估和可操作的修订，并利用这些批评的演员，我们的方法促进了对替代策略的更强大的探索，同时避免了当地的优点。在三个交互式环境中的实验表明，CGI的表现优于现有基准的大幅度余量。值得注意的是，即使是小型评论家模型也超过了GPT-4的反馈质量。由此产生的演员实现了最先进的表现，证明了明确的迭代指导的力量，以增强基于LLM的代理商的决策。

Title: Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content

Authors: Sai Kartheek Reddy Kasu, Shankar Biradar, Sunil Saumya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16031
Pdf URL: https://arxiv.org/pdf/2503.16031
Copy Paste: [[2503.16031]] Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content(https://arxiv.org/abs/2503.16031)
Keywords: gpt, chat
Abstract: This paper presents the Deceptive Humor Dataset (DHD), a novel resource for studying humor derived from fabricated claims and misinformation. In an era of rampant misinformation, understanding how humor intertwines with deception is essential. DHD consists of humor-infused comments generated from false narratives, incorporating fabricated claims and manipulated information using the ChatGPT-4o model. Each instance is labeled with a Satire Level, ranging from 1 for subtle satire to 3 for high-level satire and classified into five distinct Humor Categories: Dark Humor, Irony, Social Commentary, Wordplay, and Absurdity. The dataset spans multiple languages including English, Telugu, Hindi, Kannada, Tamil, and their code-mixed variants (Te-En, Hi-En, Ka-En, Ta-En), making it a valuable multilingual benchmark. By introducing DHD, we establish a structured foundation for analyzing humor in deceptive contexts, paving the way for a new research direction that explores how humor not only interacts with misinformation but also influences its perception and spread. We establish strong baselines for the proposed dataset, providing a foundation for future research to benchmark and advance deceptive humor detection models.
摘要：本文介绍了欺骗性的幽默数据集（DHD），这是一种研究源自捏造的主张和错误信息的幽默的新资源。在一个猖ramp的错误信息时代，了解幽默与欺骗的方式是必不可少的。 DHD由虚假叙述产生的幽默注释，使用Chatgpt-4O模型结合了捏造的索赔和操纵信息。每个实例都标有讽刺级别的标签，从微妙的讽刺讽刺到3的3次高级讽刺，并将其分为五个截然不同的幽默类别：黑暗的幽默，讽刺，社会评论，文字游戏和荒谬。数据集涵盖了多种语言，包括英语，泰卢固语，印地语，卡纳达语，泰米尔语及其代码混合变体（Te-en，hi-en，ka-en，ta-en），使其成为有价值的多语言基准。通过介绍DHD，我们建立了一个结构化的基础，用于分析欺骗性环境中的幽默，为新的研究方向铺平了道路，该方向探讨了幽默不仅与错误信息相互作用，还会影响其感知和传播。我们为拟议的数据集建立了强大的基线，为将来的研究提供了基础，以基准并提高欺骗性的幽默检测模型。

Title: Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond

Authors: Yaoyao Yu, Leilei Gan, Yinghao Hu, Bin Wei, Kun Kuang, Fei Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16040
Pdf URL: https://arxiv.org/pdf/2503.16040
Copy Paste: [[2503.16040]] Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond(https://arxiv.org/abs/2503.16040)
Keywords: language model, llm
Abstract: Recently, Test-Time Scaling Large Language Models (LLMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated exceptional capabilities across various domains and tasks, particularly in reasoning. While these models have shown impressive performance on general language tasks, their effectiveness in specialized fields like legal remains unclear. To address this, we present a preliminary evaluation of LLMs in various legal scenarios, covering both Chinese and English legal tasks. Our analysis includes 9 LLMs and 17 legal tasks, with a focus on newly published and more complex challenges such as multi-defendant legal judgments and legal argument reasoning. Our findings indicate that, despite DeepSeek-R1 and OpenAI o1 being among the most powerful models, their legal reasoning capabilities are still lacking. Specifically, these models score below 80\% on seven Chinese legal reasoning tasks and below 80\% on two English legal reasoning tasks. This suggests that, even among the most advanced reasoning models, legal reasoning abilities remain underdeveloped.
摘要：最近，测试时间缩放大型语言模型（LLMS），例如DeepSeek-R1和OpenAI O1，已经在各种领域和任务中表现出了非凡的功能，尤其是在推理中。尽管这些模型在一般语言任务上表现出了令人印象深刻的表现，但它们在法律等专业领域的有效性尚不清楚。为了解决这个问题，我们在各种法律场景中对LLM进行初步评估，涵盖了中文和英语法律任务。我们的分析包括9个LLM和17项法律任务，重点是新出版和更复杂的挑战，例如多被告法律判断和法律论证推理。我们的发现表明，尽管DeepSeek-R1和Openai O1是最有力的模型之一，但他们的法律推理能力仍然缺乏。具体而言，这些模型在七个中国法律推理任务上的得分低于80 \％，而在两项英语法律推理任务上的得分低于80 \％。这表明，即使在最先进的推理模型中，法律推理能力仍然不发达。

Title: Incomplete Utterance Rewriting with Editing Operation Guidance and Utterance Augmentation

Authors: Zhiyu Cao, Peifeng Li, Yaxin Fan, Qiaoming Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.16043
Pdf URL: https://arxiv.org/pdf/2503.16043
Copy Paste: [[2503.16043]] Incomplete Utterance Rewriting with Editing Operation Guidance and Utterance Augmentation(https://arxiv.org/abs/2503.16043)
Keywords: llm
Abstract: Although existing fashionable generation methods on Incomplete Utterance Rewriting (IUR) can generate coherent utterances, they often result in the inclusion of irrelevant and redundant tokens in rewritten utterances due to their inability to focus on critical tokens in dialogue context. Furthermore, the limited size of the training datasets also contributes to the insufficient training of the IUR model. To address the first issue, we propose a multi-task learning framework EO-IUR (Editing Operation-guided Incomplete Utterance Rewriting) that introduces the editing operation labels generated by sequence labeling module to guide generation model to focus on critical tokens. Furthermore, we introduce a token-level heterogeneous graph to represent dialogues. To address the second issue, we propose a two-dimensional utterance augmentation strategy, namely editing operation-based incomplete utterance augmentation and LLM-based historical utterance augmentation. The experimental results on three datasets demonstrate that our EO-IUR outperforms previous state-of-the-art (SOTA) baselines in both open-domain and task-oriented dialogue. The code will be available at this https URL.
摘要：尽管现有的关于不完整的话语重写（IUR）的时尚生成方法可以产生连贯的话语，但由于它们无法专注于对话环境中的关键令牌，因此它们通常会导致重写无关的令牌。此外，培训数据集的大小有限也有助于IUR模型的培训不足。为了解决第一个问题，我们提出了一个多任务学习框架EO-IUR（编辑操作指导的不完整的话语重写），该框架介绍了由序列标签模块生成的编辑操作标签，以指导生成模型以关注关键令牌。此外，我们引入了一个令牌级的异质图来表示对话。为了解决第二个问题，我们提出了一项二维话语增强策略，即基于操作的不完整话语增强和基于LLM的历史话语的增强。三个数据集上的实验结果表明，我们的EO-IUR在开放域和以任务为导向的对话中都优于先前最先进的基线（SOTA）基准。该代码将在此HTTPS URL上可用。

Title: Meta-Learning Neural Mechanisms rather than Bayesian Priors

Authors: Michael Goodale, Salvador Mascarenhas, Yair Lakretz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16048
Pdf URL: https://arxiv.org/pdf/2503.16048
Copy Paste: [[2503.16048]] Meta-Learning Neural Mechanisms rather than Bayesian Priors(https://arxiv.org/abs/2503.16048)
Keywords: language model
Abstract: Children acquire language despite being exposed to several orders of magnitude less data than large language models require. Meta-learning has been proposed as a way to integrate human-like learning biases into neural-network architectures, combining both the structured generalizations of symbolic models with the scalability of neural-network models. But what does meta-learning exactly imbue the model with? We investigate the meta-learning of formal languages and find that, contrary to previous claims, meta-trained models are not learning simplicity-based priors when meta-trained on datasets organised around simplicity. Rather, we find evidence that meta-training imprints neural mechanisms (such as counters) into the model, which function like cognitive primitives for the network on downstream tasks. Most surprisingly, we find that meta-training on a single formal language can provide as much improvement to a model as meta-training on 5000 different formal languages, provided that the formal language incentivizes the learning of useful neural mechanisms. Taken together, our findings provide practical implications for efficient meta-learning paradigms and new theoretical insights into linking symbolic theories and neural mechanisms.
摘要：尽管与大型语言模型所需的数据相比，儿童获得了语言，但却少了几个数量级。已提出了元学习作为将类似人类的学习偏见整合到神经网络架构中的一种方式，将符号模型的结构化概括与神经网络模型的可扩展性相结合。但是，元学习能够与该模型完全融为一体？我们研究了形式语言的元学习，发现与以前的主张相反，元训练的模型在围绕简单性组织的数据集中进行元训练时，无法学习基于简单的先验。相反，我们发现证据表明，元训练的烙印（例如计数器）中的神经机制（例如计数器），这些神经机制的功能像在下游任务上的网络的认知原始素。最令人惊讶的是，我们发现单一形式语言上的元训练可以为模型提供与5000种不同形式语言的元培训一样多的改进，但前提是正式语言激励学习有用的神经机制。综上所述，我们的发现为有效的元学习范式和有关将符号理论和神经机制联系起来的新理论见解提供了实际含义。

Title: Tuning LLMs by RAG Principles: Towards LLM-native Memory

Authors: Jiale Wei, Shuchi Wu, Ruochen Liu, Xiang Ying, Jingbo Shang, Fangbo Tao
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2503.16071
Pdf URL: https://arxiv.org/pdf/2503.16071
Copy Paste: [[2503.16071]] Tuning LLMs by RAG Principles: Towards LLM-native Memory(https://arxiv.org/abs/2503.16071)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Memory, additional information beyond the training of large language models (LLMs), is crucial to various real-world applications, such as personal assistant. The two mainstream solutions to incorporate memory into the generation process are long-context LLMs and retrieval-augmented generation (RAG). In this paper, we first systematically compare these two types of solutions on three renovated/new datasets and show that (1) long-context solutions, although more expensive, shall be easier to capture the big picture and better answer queries which require considering the memory as a whole; and (2) when the queries concern specific information, RAG solutions shall be more competitive especially when the keywords can be explicitly matched. Therefore, we propose a novel method RAG-Tuned-LLM which fine-tunes a relative small (e.g., 7B) LLM using the data generated following the RAG principles, so it can combine the advantages of both solutions. Extensive experiments on three datasets demonstrate that RAG-Tuned-LLM can beat long-context LLMs and RAG methods across a wide range of query types.
摘要：记忆，大型语言模型（LLM）培训以外的其他信息，对于各种现实世界应用程序，例如个人助理至关重要。将记忆纳入生成过程的两个主流解决方案是长篇文化LLM和检索增强的生成（RAG）。在本文中，我们首先会系统地比较三个翻新/新数据集中的这两种解决方案，并表明（1）长篇小说解决方案虽然更昂贵，但更容易捕获大图和更好的答案查询，这些查询需要考虑整体内存；（2）当查询涉及特定信息时，RAG解决方案应更具竞争力，尤其是当关键字可以明确匹配时。因此，我们提出了一种新型的方法抹布式的llm，该方法使用按照抹布原理生成的数据微调相对小（例如7b）llm，因此它可以结合两种溶液的优势。在三个数据集上进行的广泛实验表明，抹布式的-llm可以在广泛的查询类型中击败长篇小说LLM和抹布方法。

Title: Cultural Alignment in Large Language Models Using Soft Prompt Tuning

Authors: Reem I. Masoud, Martin Ferianc, Philip Treleaven, Miguel Rodrigues
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16094
Pdf URL: https://arxiv.org/pdf/2503.16094
Copy Paste: [[2503.16094]] Cultural Alignment in Large Language Models Using Soft Prompt Tuning(https://arxiv.org/abs/2503.16094)
Keywords: language model, llm, prompt
Abstract: Large Language Model (LLM) alignment conventionally relies on supervised fine-tuning or reinforcement learning based alignment frameworks. These methods typically require labeled or preference datasets and involve updating model weights to align the LLM with the training objective or reward model. Meanwhile, in social sciences such as cross-cultural studies, factor analysis is widely used to uncover underlying dimensions or latent variables that explain observed patterns in survey data. The non-differentiable nature of these measurements deriving from survey data renders the former alignment methods infeasible for alignment with cultural dimensions. To overcome this, we propose a parameter efficient strategy that combines soft prompt tuning, which freezes the model parameters while modifying the input prompt embeddings, with Differential Evolution (DE), a black-box optimization method for cases where a differentiable objective is unattainable. This strategy ensures alignment consistency without the need for preference data or model parameter updates, significantly enhancing efficiency and mitigating overfitting. Our method demonstrates significant improvements in LLama-3-8B-Instruct's cultural dimensions across multiple regions, outperforming both the Naive LLM and the In-context Learning (ICL) baseline, and effectively bridges computational models with human cultural nuances.
摘要：大型语言模型（LLM）一致性通常依赖于监督的基于微调或加强学习的对准框架。这些方法通常需要标记或偏好数据集，并涉及更新模型权重以使LLM与培训目标或奖励模型保持一致。同时，在诸如跨文化研究之类的社会科学中，因子分析被广泛用于揭示基本的维度或潜在变量，这些变量解释了调查数据中观察到的模式。来自调查数据得出的这些测量值的非差异性质使以前的比对方法无法与文化维度保持一致。为了克服这一点，我们提出了一种结合软及时调整的参数有效策略，该策略将模型参数冻结，同时修改输入提示嵌入，并使用差分进化（DE），这是一种无法实现可微分目标的情况的黑盒优化方法。该策略可确保对齐的一致性无需偏好数据或模型参数更新，从而显着提高效率并减轻过度拟合。我们的方法表明，遍布多个地区的Llama-3-8b-Instruct的文化维度有了显着改善，表现优于幼稚的LLM和内部文化学习（ICL）基线，并有效地将计算模型与人类文化差异融为一体。

Title: MKG-Rank: Enhancing Large Language Models with Knowledge Graph for Multilingual Medical Question Answering

Authors: Feiyang Li, Yingjian Chen, Haoran Liu, Rui Yang, Han Yuan, Yuang Jiang, Tianxiao Li, Edison Marrese Taylor, Hossein Rouhizadeh, Yusuke Iwasawa, Douglas Teodoro, Yutaka Matsuo, Irene Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16131
Pdf URL: https://arxiv.org/pdf/2503.16131
Copy Paste: [[2503.16131]] MKG-Rank: Enhancing Large Language Models with Knowledge Graph for Multilingual Medical Question Answering(https://arxiv.org/abs/2503.16131)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown remarkable progress in medical question answering (QA), yet their effectiveness remains predominantly limited to English due to imbalanced multilingual training data and scarce medical resources for low-resource languages. To address this critical language gap in medical QA, we propose Multilingual Knowledge Graph-based Retrieval Ranking (MKG-Rank), a knowledge graph-enhanced framework that enables English-centric LLMs to perform multilingual medical QA. Through a word-level translation mechanism, our framework efficiently integrates comprehensive English-centric medical knowledge graphs into LLM reasoning at a low cost, mitigating cross-lingual semantic distortion and achieving precise medical QA across language barriers. To enhance efficiency, we introduce caching and multi-angle ranking strategies to optimize the retrieval process, significantly reducing response times and prioritizing relevant medical knowledge. Extensive evaluations on multilingual medical QA benchmarks across Chinese, Japanese, Korean, and Swahili demonstrate that MKG-Rank consistently outperforms zero-shot LLMs, achieving maximum 33.89% increase in accuracy, while maintaining an average retrieval time of only 0.0009 seconds.
摘要：大型语言模型（LLMS）在医学问题答案（QA）方面表现出色，但由于多语言培训数据不平衡，其有效性主要限于英语，而低资源语言的医疗资源则稀缺。为了解决医疗质量检查中的这个关键语言差距，我们提出了基于多语言知识的基于图形的检索排名（MKG秩），这是一个知识图增强框架，使以英语为中心的LLMS可以执行多语言医学质量质量质量质量质量质量质量标准。通过单词级翻译机制，我们的框架有效地将全面的以英语为中心的医学知识图集成到LLM推理中，以低成本，减轻跨语性语义失真并在语言障碍中实现精确的医疗质量质量。为了提高效率，我们引入了缓存和多角度排名策略，以优化检索过程，大大减少响应时间并确定相关医学知识。对中国，日本，韩国和斯瓦希里语的多语言医学质量检查基准的广泛评估表明，MKG范围始终优于零拍的LLM，在准确性上提高了33.89％，同时保持平均检索时间仅为0.0009秒。

Title: Automatically Generating Chinese Homophone Words to Probe Machine Translation Estimation Systems

Authors: Shenbin Qian, Constantin Orăsan, Diptesh Kanojia, Félix do Carmo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16158
Pdf URL: https://arxiv.org/pdf/2503.16158
Copy Paste: [[2503.16158]] Automatically Generating Chinese Homophone Words to Probe Machine Translation Estimation Systems(https://arxiv.org/abs/2503.16158)
Keywords: language model, llm
Abstract: Evaluating machine translation (MT) of user-generated content (UGC) involves unique challenges such as checking whether the nuance of emotions from the source are preserved in the target text. Recent studies have proposed emotion-related datasets, frameworks and models to automatically evaluate MT quality of Chinese UGC, without relying on reference translations. However, whether these models are robust to the challenge of preserving emotional nuances has been left largely unexplored. To address this gap, we introduce a novel method inspired by information theory which generates challenging Chinese homophone words related to emotions, by leveraging the concept of self-information. Our approach generates homophones that were observed to cause translation errors in emotion preservation, and exposes vulnerabilities in MT systems and their evaluation methods when tackling emotional UGC. We evaluate the efficacy of our method using human evaluation for the quality of these generated homophones, and compare it with an existing one, showing that our method achieves higher correlation with human judgments. The generated Chinese homophones, along with their manual translations, are utilized to generate perturbations and to probe the robustness of existing quality evaluation models, including models trained using multi-task learning, fine-tuned variants of multilingual language models, as well as large language models (LLMs). Our results indicate that LLMs with larger size exhibit higher stability and robustness to such perturbations. We release our data and code for reproducibility and further research.
摘要：评估用户生成内容（UGC）的机器翻译（MT）涉及独特的挑战，例如检查目标文本中是否保留了来自源的情绪的细微挑战。最近的研究提出了与情绪相关的数据集，框架和模型，以自动评估中国UGC的MT质量，而无需依赖参考翻译。但是，这些模型是否对维护情感细微差别的挑战是强大的，这在很大程度上尚未探索。为了解决这一差距，我们介绍了一种受信息理论启发的新颖方法，该方法通过利用自我信息的概念来产生与情感有关的挑战性中国同音单词。我们的方法会产生同轴，观察到会导致情绪保存中的翻译错误，并在解决情绪UGC时暴露于MT系统中的脆弱性及其评估方法。我们使用人类评估对这些产生的词形的质量评估方法的疗效，并将其与现有的词性进行比较，这表明我们的方法与人类判断具有更高的相关性。产生的中国同音词以及它们的手动翻译可用于产生扰动并探测现有质量评估模型的鲁棒性，包括使用多任务学习，多语言语言模型的微调变体培训的模型以及大型语言模型（LLMS）。我们的结果表明，具有较大尺寸的LLM对这种扰动表现出更高的稳定性和鲁棒性。我们发布了我们的数据和代码，以进行可重复性和进一步的研究。

Title: Towards Lighter and Robust Evaluation for Retrieval Augmented Generation

Authors: Alex-Razvan Ispas, Charles-Elie Simon, Fabien Caspani, Vincent Guigue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.16161
Pdf URL: https://arxiv.org/pdf/2503.16161
Copy Paste: [[2503.16161]] Towards Lighter and Robust Evaluation for Retrieval Augmented Generation(https://arxiv.org/abs/2503.16161)
Keywords: language model, gpt, llm, hallucination, prompt, retrieval augmented generation
Abstract: Large Language Models are prompting us to view more NLP tasks from a generative perspective. At the same time, they offer a new way of accessing information, mainly through the RAG framework. While there have been notable improvements for the autoregressive models, overcoming hallucination in the generated answers remains a continuous problem. A standard solution is to use commercial LLMs, such as GPT4, to evaluate these algorithms. However, such frameworks are expensive and not very transparent. Therefore, we propose a study which demonstrates the interest of open-weight models for evaluating RAG hallucination. We develop a lightweight approach using smaller, quantized LLMs to provide an accessible and interpretable metric that gives continuous scores for the generated answer with respect to their correctness and faithfulness. This score allows us to question decisions' reliability and explore thresholds to develop a new AUC metric as an alternative to correlation with human judgment.
摘要：大型语言模型促使我们从生成的角度查看更多NLP任务。同时，他们提供了一种新的访问信息的方式，主要是通过RAG框架。尽管自回归模型已经有了显着改进，但在生成的答案中克服幻觉仍然是一个持续的问题。标准解决方案是使用商业LLM，例如GPT4，评估这些算法。但是，此类框架很昂贵，而且不是很透明。因此，我们提出了一项研究，该研究证明了开放权重模型评估碎布幻觉的兴趣。我们使用较小的，量化的LLM开发一种轻巧的方法，以提供可访问且可解释的度量标准，该指标在其正确性和忠诚方面为生成的答案提供了连续的分数。这个分数使我们能够质疑决策的可靠性并探索阈值以开发新的AUC指标，以替代与人类判断的相关性。

Title: SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs

Authors: Shibo Jie, Yehui Tang, Kai Han, Zhi-Hong Deng, Jing Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16163
Pdf URL: https://arxiv.org/pdf/2503.16163
Copy Paste: [[2503.16163]] SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs(https://arxiv.org/abs/2503.16163)
Keywords: language model, llm
Abstract: Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the sequence length increases, which has become a bottleneck for the application of LLMs on long sequences. Existing KV cache compression methods include eviction, merging, or quantization of the KV cache to reduce its size. However, compression results in irreversible information forgetting, potentially affecting the accuracy of subsequent decoding. In this paper, we propose SpeCache, which takes full advantage of the large and easily expandable CPU memory to offload the complete KV cache, and dynamically fetches KV pairs back in each decoding step based on their importance measured by low-bit KV cache copy in VRAM. To avoid inference latency caused by CPU-GPU communication, SpeCache speculatively predicts the KV pairs that the next token might attend to, allowing us to prefetch them before the next decoding step which enables parallelization of prefetching and computation. Experiments on LongBench and Needle-in-a-Haystack benchmarks verify that SpeCache effectively reduces VRAM usage while avoiding information forgetting for long sequences without re-training, even with a 10x high KV cache compression ratio.
摘要：基于变压器的大型语言模型（LLM）已经在长篇文本任务上取得了显着的结果，但是随着序列长度的增加，GPU内存（VRAM）资源有限的资源很难适应对键值（KV）缓存的线性增长需求，这已成为用于长序列中LLMS瓶颈的瓶颈。现有的KV缓存压缩方法包括驱逐，合并或量化KV缓存以降低其大小。但是，压缩会导致不可逆转的信息遗忘，并可能影响后续解码的准确性。在本文中，我们提出了SPECACHE，它可以充分利用大型且易于扩展的CPU存储器以卸载完整的KV缓存，并根据其在VRAM中低位的KV Cache Cache副本测量的重要性，动态地将KV对回到每个解码步骤中。为了避免CPU-GPU通信引起的推断潜伏期，Specache可以预测预测下一代币可能会处理的KV对，从而使我们能够在下一个解码步骤之前对其进行预取，从而可以并行预取计算和计算。在Longbench和in-A-Haystack基准测试基准上进行的实验可有效地减少VRAM使用情况，同时避免忘记在没有重新训练的长序列的信息中，即使具有10倍的KV缓存压缩率也是如此。

Title: MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion

Authors: Qizhi Pei, Lijun Wu, Zhuoshi Pan, Yu Li, Honglin Lin, Chenlin Ming, Xin Gao, Conghui He, Rui Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.16212
Pdf URL: https://arxiv.org/pdf/2503.16212
Copy Paste: [[2503.16212]] MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion(https://arxiv.org/abs/2503.16212)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown impressive progress in mathematical reasoning. While data augmentation is promising to enhance mathematical problem-solving ability, current approaches are predominantly limited to instance-level modifications-such as rephrasing or generating syntactic variations-which fail to capture and leverage the intrinsic relational structures inherent in mathematical knowledge. Inspired by human learning processes, where mathematical proficiency develops through systematic exposure to interconnected concepts, we introduce MathFusion, a novel framework that enhances mathematical reasoning through cross-problem instruction synthesis. MathFusion implements this through three fusion strategies: (1) sequential fusion, which chains related problems to model solution dependencies; (2) parallel fusion, which combines analogous problems to reinforce conceptual understanding; and (3) conditional fusion, which creates context-aware selective problems to enhance reasoning flexibility. By applying these strategies, we generate a new dataset, \textbf{MathFusionQA}, followed by fine-tuning models (DeepSeekMath-7B, Mistral-7B, Llama3-8B) on it. Experimental results demonstrate that MathFusion achieves substantial improvements in mathematical reasoning while maintaining high data efficiency, boosting performance by 18.0 points in accuracy across diverse benchmarks while requiring only 45K additional synthetic instructions, representing a substantial improvement over traditional single-instruction approaches. Our datasets, models, and code are publicly available at this https URL.
摘要：大型语言模型（LLM）在数学推理方面表现出了令人印象深刻的进步。尽管数据增强有望增强数学问题解决能力，但当前的方法主要限于实例级的修改，例如重新绘制或生成语法变量，而这些变化无法捕获和利用数学知识固有的内在关系结构。受到人类学习过程的启发，在该过程中，数学能力通过系统地接触相互联系的概念而发展，我们引入了MathFusion，这是一个新颖的框架，通过跨问题指导综合增强了数学推理。 MathFusion通过三种融合策略实现了这一点：（1）顺序融合，将相关的问题与模型求解依赖性相关；（2）平行融合，结合了类似问题以加强概念理解；（3）有条件融合，它会产生上下文感知的选择性问题，以增强推理灵活性。通过应用这些策略，我们在其上生成了一个新的数据集，即\ textbf {MathfusionQa}，然后在其上进行微调模型（DeepSeekmath-7b，Mismtral-7b，Llama3-8B）。实验结果表明，MathFusion在维持高数据效率的同时，在数学推理方面取得了重大改进，在各种基准测试基准之间的准确性上提高了18.0点的绩效，同时仅需要45K额外的综合指令，这代表了对传统单一建设方法的实质性改进。我们的数据集，模型和代码可在此HTTPS URL上公开获得。

Title: Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

Authors: Zhaowei Liu, Xin Guo, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Zixuan Wang, Jiajie Xu, Weige Cai, Ziwei Yang, Xueqian Zhao, Chao Li, Sheng Xu, Dezhi Chen, Yun Chen, Zuo Bai, Liwen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16252
Pdf URL: https://arxiv.org/pdf/2503.16252
Copy Paste: [[2503.16252]] Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning(https://arxiv.org/abs/2503.16252)
Keywords: language model, llm
Abstract: Reasoning large language models are rapidly evolving across various domains. However, their capabilities in handling complex financial tasks still require in-depth exploration. In this paper, we introduce Fin-R1, a reasoning large language model specifically designed for the financial sector. Fin-R1 is built using a two-stage architecture, leveraging a financial reasoning dataset distilled and processed based on DeepSeek-R1. Through supervised fine-tuning (SFT) and reinforcement learning (RL) training, it demonstrates performance close to DeepSeek-R1 with a parameter size of 7 billion across a range of financial reasoning tasks. It achieves the state-of-the-art (SOTA) in the FinQA and ConvFinQA tasks between those LLMs in our evaluation, surpassing larger models in other tasks as well. Fin-R1 showcases strong reasoning and decision-making capabilities, providing solutions to various problems encountered in the financial domain. Our code is available at this https URL.
摘要：推理大型语言模型正在跨各个领域迅速发展。但是，它们在处理复杂财务任务的能力仍然需要深入探索。在本文中，我们介绍了Fin-R1，这是一种专门为金融领域设计的大型语言模型。 Fin-R1是使用两阶段架构构建的，利用了基于DeepSeek-R1进行蒸馏和处理的财务推理数据集。通过有监督的微调（SFT）和加强学习（RL）培训，它在一系列财务推理任务中显示了接近DeepSeek-R1的性能，参数大小为70亿。它在我们的评估中，在FinQA和Convinqa任务中实现了最新的（SOTA），并在其他任务中也超过了更大的模型。 Fin-R1展示了强大的推理和决策能力，为金融领域遇到的各种问题提供了解决方案。我们的代码可在此HTTPS URL上找到。

Title: LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates

Authors: Ying Shen, Lifu Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16334
Pdf URL: https://arxiv.org/pdf/2503.16334
Copy Paste: [[2503.16334]] LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates(https://arxiv.org/abs/2503.16334)
Keywords: language model, llm
Abstract: Recent findings reveal that much of the knowledge in a Transformer-based Large Language Model (LLM) is encoded in its feed-forward (FFN) layers, where each FNN layer can be interpreted as the summation of sub-updates, each corresponding to a weighted column vector from the FFN's value parameter matrix that often encodes human-interpretable concepts. In light of this, we hypothesize that model performance and behaviors can be further enhanced and controlled by modulating the contributions of these sub-updates based on their relevance to the input or target output style, and propose LLMBRACES, a novel and efficient method that computes relevance scores associated with value vectors in FFN layers and leverages these scores to dynamically adjust the contribution of sub-updates. By optimizing sub-update contributions, LLMBRACES refines the prediction process, leading to more accurate and reliable outputs, much like a 'brace' providing support and stability. Moreover, LLMBRACES can be extended to support conditional control over generation characteristics, such as sentiment, thereby offering fine-grained steering of LLM outputs. Extensive experiments on various LLMs-including Qwen2.5-1.5B, Llama2-7B, and Llama3-8B-demonstrate that LLMBRACES outperforms baseline approaches in both fine-tuning and zero-shot settings while requiring significantly fewer tunable parameters, up to 75% fewer compared to LoRA. Furthermore, LLMBRACES excels in sentiment-controlled generation and toxicity reduction, highlighting its potential for flexible, controlled text generation across applications.
摘要：最近的发现表明，基于变压器的大语言模型（LLM）中的许多知识均以其前馈（FFN）层进行编码，其中每个FNN层都可以解释为sub-updates的求和，每个FNN层都与FFN的价值参数矩阵相对应，而每个FNN层经常与FFN的价值参数矩阵相对应。鉴于这一点，我们假设可以通过根据它们与输入或目标输出样式的相关性来调节这些亚更新的贡献，从而进一步增强和控制，并提出了一种新颖而有效的方法，该方法可以计算出与FFN层次和利用相关的相关性分数在FFN层次和利用中相关的相关性，以使这些份量贡献贡献贡献贡献的贡献。通过优化亚更贡献，LLMBRACES完善了预测过程，从而导致更准确和可靠的输出，就像提供支持和稳定性的“支撑”。此外，可以扩展LMBRACES以支持对发电特征（例如情感）的有条件控制，从而提供LLM输出的细粒度转向。在包括QWEN2.5-1.5B，LLAMA2-7B和LLAMA3-8B示出的各种LLMS的广泛实验中，LLMBRACES在微调和零摄像机中的基线方法都超过了基线方法，同时需要较少的TUN TUN参数，最多需要少量的TUN TUN参数。此外，LMBraces在情感控制的产生和毒性降低中表现出色，强调了其在应用程序跨应用程序中的灵活，受控文本生成的潜力。

Title: CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners

Authors: Yunzhi Yao, Jizhan Fang, Jia-Chen Gu, Ningyu Zhang, Shumin Deng, Huajun Chen, Nanyun Peng
Subjects: cs.CL, cs.AI, cs.CV, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.16356
Pdf URL: https://arxiv.org/pdf/2503.16356
Copy Paste: [[2503.16356]] CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners(https://arxiv.org/abs/2503.16356)
Keywords: language model, llm
Abstract: Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they struggle to generalize these updates to multi-hop reasoning tasks that depend on the modified knowledge. Through an analysis of reasoning circuits -- the neural pathways LLMs use for knowledge-based inference, we observe that current layer-localized KE approaches, such as MEMIT and WISE, which edit only single or a few model layers, struggle to effectively incorporate updated information into these reasoning pathways. To address this limitation, we propose CaKE (Circuit-aware Knowledge Editing), a novel method that enables more effective integration of updated knowledge in LLMs. CaKE leverages strategically curated data, guided by our circuits-based analysis, that enforces the model to utilize the modified knowledge, stimulating the model to develop appropriate reasoning circuits for newly integrated knowledge. Experimental results show that CaKE enables more accurate and consistent use of updated knowledge across related reasoning tasks, leading to an average of 20% improvement in multi-hop reasoning accuracy on MQuAKE dataset compared to existing KE methods. We release the code and data in this https URL.
摘要：知识编辑（KE）可以在大语言模型（LLMS）中修改过时或不正确的信息。尽管现有的KE方法可以更新孤立的事实，但他们努力将这些更新推广到依赖于修改知识的多跳推理任务。通过对推理电路的分析 - LLM用于基于知识的推理的神经途径，我们观察到，当前层置换的KE方法（例如Memit和Wise）仅编辑单个或几个模型层，难以有效地将更新的信息纳入这些推理途径。为了解决这一限制，我们提出了蛋糕（电路感知知识编辑），这是一种新颖的方法，可以在LLM中更有效地整合更新的知识。 Cake在我们基于电路的分析的指导下利用了战略性策略的数据，该数据强制实施该模型以利用修改后的知识，从而刺激模型以开发适当的推理电路以供新综合知识。实验结果表明，Cake可以在相关推理任务中更准确和一致地使用更新知识，从而导致MQUAKE数据集的多跳上推理精度平均提高了20％，与现有的KE方法相比。我们在此HTTPS URL中发布代码和数据。

Title: Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Authors: Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen (Henry)Zhong, Hanjie Chen, Xia Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16419
Pdf URL: https://arxiv.org/pdf/2503.16419
Copy Paste: [[2503.16419]] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models(https://arxiv.org/abs/2503.16419)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the "overthinking phenomenon". In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking.
摘要：大型语言模型（LLMS）在复杂的任务中表现出了出色的功能。大型推理模型（LRMS）的最新进展，例如OpenAI O1和DeepSeek-R1，通过利用监督的微调（SFT）和加强学习（RL）技术来进一步提高了System-2推理领域（如数学和编程）的性能。但是，尽管较长的COT推理序列改善了性能，但由于冗余和冗余输出（称为“过度思考现象”），它们也引入了重要的计算开销。在本文中，我们提供了第一个结构化调查，以系统地研究和探索在LLM中实现有效推理的当前进展。总体而言，依靠LLM的固有机制，我们将现有作品分为几个关键方向：（1）基于模型的有效推理，该推理认为将全长推理模型优化为更简洁的推理模型或直接训练有效的推理模型；（2）基于推理的有效推理，旨在动态减少推理期间的推理步骤和长度；（3）输入提示基于提示的有效推理，该推理旨在根据输入提示属性（例如难度或长度控制）提高推理效率。此外，我们介绍了有效数据用于培训推理模型，探索小语言模型的推理能力，并讨论评估方法和基准测试。