2025-04-02

Title: Medical Reasoning in LLMs: An In-Depth Analysis of DeepSeek R1

Authors: Birger Moell, Fredrik Sand Aronsson, Sanian Akbar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00016
Pdf URL: https://arxiv.org/pdf/2504.00016
Copy Paste: [[2504.00016]] Medical Reasoning in LLMs: An In-Depth Analysis of DeepSeek R1(https://arxiv.org/abs/2504.00016)
Keywords: language model, llm
Abstract: Integrating large language models (LLMs) like DeepSeek R1 into healthcare requires rigorous evaluation of their reasoning alignment with clinical expertise. This study assesses DeepSeek R1's medical reasoning against expert patterns using 100 MedQA clinical cases. The model achieved 93% diagnostic accuracy, demonstrating systematic clinical judgment through differential diagnosis, guideline-based treatment selection, and integration of patient-specific factors. However, error analysis of seven incorrect cases revealed persistent limitations: anchoring bias, challenges reconciling conflicting data, insufficient exploration of alternatives, overthinking, knowledge gaps, and premature prioritization of definitive treatment over intermediate care. Crucially, reasoning length correlated with accuracy - shorter responses (<5,000 characters) were more reliable, suggesting extended explanations may signal uncertainty or rationalization of errors. While DeepSeek R1 exhibits foundational clinical reasoning capabilities, recurring flaws highlight critical areas for refinement, including bias mitigation, knowledge updates, and structured reasoning frameworks. These findings underscore LLMs' potential to augment medical decision-making through artificial reasoning but emphasize the need for domain-specific validation, interpretability safeguards, and confidence metrics (e.g., response length thresholds) to ensure reliability in real-world applications.
摘要：将大型语言模型（LLM）等大型语言模型（例如DeepSeek R1）纳入医疗保健需要严格评估其与临床专业知识的推理一致性。这项研究使用100例MEDQA临床病例评估了DeepSeek R1针对专家模式的医学推理。该模型达到了93％的诊断准确性，通过鉴别诊断，基于指南的治疗选择以及患者特异性因素的整合来证明系统的临床判断。但是，对七个不正确案件的错误分析显示出持续的局限性：锚定偏见，挑战矛盾的数据，对替代方案的探索不足，过度思考，知识差距，知识差距以及对中等护理的确定性治疗的优先级。至关重要的是，推理长度与准确性相关 - 较短的响应（<5,000个字符）更可靠，这表明扩展的解释可能表明错误的不确定性或合理化。虽然DeepSeek R1具有基础临床推理能力，但反复出现的缺陷突出了精炼的关键领域，包括缓解偏见，知识更新和结构化推理框架。这些发现强调了LLMS通过人工推理增强医疗决策的潜力，但强调了对特定领域的验证，可解释性保障措施和信心指标（例如响应长度阈值）的需求，以确保对现实世界应用的可靠性。

Title: ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding

Authors: Indraneil Paul, Haoyi Yang, Goran Glavaš, Kristian Kersting, Iryna Gurevych
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2504.00019
Pdf URL: https://arxiv.org/pdf/2504.00019
Copy Paste: [[2504.00019]] ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding(https://arxiv.org/abs/2504.00019)
Keywords: language model
Abstract: Language models (LMs) have become a staple of the code-writing toolbox. Their pre-training recipe has, however, remained stagnant over recent years, barring the occasional changes in data sourcing and filtering strategies. In particular, research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse, especially compared with corresponding efforts in natural language LMs. In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency. To this end, we compile ObscuraX, a dataset of approximately 55M source and obfuscated code pairs in seven languages. Subsequently, we pre-train ObscuraCoder models, ranging in size from 255M to 2.8B parameters, on a 272B-token corpus that includes ObscuraX and demonstrate that our obfuscation-based pre-training recipe leads to consistent improvements in Code-LMs' abilities compared to both vanilla autoregressive pre-training as well as existing de-obfuscation (DOBF) objectives. ObscuraCoder demonstrates sizeable gains across multiple tests of syntactic and semantic code understanding, along with improved capabilities in multilingual code completion, multilingual code commit summarization, and multi-purpose library-oriented code generation.
摘要：语言模型（LMS）已成为代码编写工具箱的主食。然而，近年来，他们的预培训食谱仍然停滞不前，禁止偶尔的数据采购和过滤策略发生变化。特别是，探索Code-LMS的预训练目标修改的研究，旨在提高数据效率并在语法和语义之间更好地解开，尤其是与自然语言LMS中的相应努力相比。在这项工作中，我们研究了对混淆的代码的基础，以此作为帮助代码-LM超越表面形式语法并提高其预训练样品效率的一种手段。为此，我们编译了大约5500万源的数据集，并用七种语言混淆代码对。 Subsequently, we pre-train ObscuraCoder models, ranging in size from 255M to 2.8B parameters, on a 272B-token corpus that includes ObscuraX and demonstrate that our obfuscation-based pre-training recipe leads to consistent improvements in Code-LMs' abilities compared to both vanilla autoregressive pre-training as well as existing de-obfuscation (DOBF) objectives. Migsuracoder展示了在多种语法和语义代码理解的多个测试中的巨大收益，以及改进的多语言代码完成，多语言代码提交摘要以及多功能库代码生成的功能。

Title: Generalization Bias in Large Language Model Summarization of Scientific Research

Authors: Uwe Peters, Benjamin Chin-Yee
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2504.00025
Pdf URL: https://arxiv.org/pdf/2504.00025
Copy Paste: [[2504.00025]] Generalization Bias in Large Language Model Summarization of Scientific Research(https://arxiv.org/abs/2504.00025)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26 to 73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (OR = 4.85, 95% CI [3.06, 7.70]). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy.
摘要：由大语言模型（LLM）驱动的人工智能聊天机器人具有提高公共科学素养和支持科学研究的潜力，因为它们可以快速以可访问的方式总结复杂的科学信息。但是，当总结科学文本时，LLM可能会忽略限制研究结论范围的细节，从而导致结果的概括性比原始研究所保证的。我们测试了10个突出的LLM，包括Chatgpt-4O，Chatgpt-4.5，DeepSeek，Llama 3.3 70B和Claude 3.7十四行诗，将4900 LLM生成的摘要与其原始科学文本进行了比较。即使明确提示了准确性，大多数LLM都比原始文本中的科学结果更广泛地概括，DeepSeek，Chatgpt-4O和Llama 3.3 70B在26％至73％的案例中过度概括。在直接比较LLM生成和人为作者的科学摘要中，LLM摘要包含广泛的概括的可能性几乎是五倍（OR = 4.85，95％CI [3.06，7.70]）。值得注意的是，较新的模型的概括精度往往比早期的模型更差。我们的结果表明，在许多广泛使用的LLM中存在强烈的偏见，以使科学结论过于概括，从而引起了大规模误解研究结果的重大风险。我们重点介绍了潜在的缓解策略，包括降低LLM温度设置和基准测试LLMS以达到概括精度。

Title: Opioid Named Entity Recognition (ONER-2025) from Reddit

Authors: Muhammad Ahmad, Humaira Farid, Iqra Ameer, Muhammad Muzamil, Ameer Hamza Muhammad Jalal, Ildar Batyrshin, Grigori Sidorov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00027
Pdf URL: https://arxiv.org/pdf/2504.00027
Copy Paste: [[2504.00027]] Opioid Named Entity Recognition (ONER-2025) from Reddit(https://arxiv.org/abs/2504.00027)
Keywords: language model
Abstract: The opioid overdose epidemic remains a critical public health crisis, particularly in the United States, leading to significant mortality and societal costs. Social media platforms like Reddit provide vast amounts of unstructured data that offer insights into public perceptions, discussions, and experiences related to opioid use. This study leverages Natural Language Processing (NLP), specifically Opioid Named Entity Recognition (ONER-2025), to extract actionable information from these platforms. Our research makes four key contributions. First, we created a unique, manually annotated dataset sourced from Reddit, where users share self-reported experiences of opioid use via different administration routes. This dataset contains 331,285 tokens and includes eight major opioid entity categories. Second, we detail our annotation process and guidelines while discussing the challenges of labeling the ONER-2025 dataset. Third, we analyze key linguistic challenges, including slang, ambiguity, fragmented sentences, and emotionally charged language, in opioid discussions. Fourth, we propose a real-time monitoring system to process streaming data from social media, healthcare records, and emergency services to identify overdose events. Using 5-fold cross-validation in 11 experiments, our system integrates machine learning, deep learning, and transformer-based language models with advanced contextual embeddings to enhance understanding. Our transformer-based models (bert-base-NER and roberta-base) achieved 97% accuracy and F1-score, outperforming baselines by 10.23% (RF=0.88).
摘要：阿片类药物过量流行仍然是一个严重的公共卫生危机，尤其是在美国，导致了重大死亡率和社会成本。像Reddit这样的社交媒体平台提供了大量的非结构化数据，这些数据为与阿片类药物使用相关的公众看法，讨论和经验提供了见解。这项研究利用自然语言处理（NLP），特别是名为“实体识别”（ONER-2025）的阿片类药物，从这些平台中提取可操作的信息。我们的研究做出了四个关键的贡献。首先，我们创建了一个来自Reddit的独特，手动注释的数据集，用户通过不同的管理路线共享自我报告的阿片类药物使用体验。该数据集包含331,285个令牌，其中包括八个主要的阿片类药物实体类别。其次，我们在讨论标记ONER-2025数据集的挑战时详细介绍了注释过程和指南。第三，我们在阿片类药物讨论中分析了关键的语言挑战，包括语，歧义，分散的句子和情绪激动的语言。第四，我们提出了一个实时监控系统，以处理社交媒体，医疗保健记录和紧急服务的流流数据，以识别过量的事件。在11个实验中，使用5倍的交叉验证，我们的系统将机器学习，深度学习和基于变压器的语言模型与高级上下文嵌入以增强理解。我们的基于变压器的模型（BERT-BASE-NER和ROBERTA-BASE）达到了97％的精度和F1得分，超过10.23％（RF = 0.88）。

Title: Token-Driven GammaTune: Adaptive Calibration for Enchanced Speculative Decoding

Authors: Aayush Gautam, Susav Shrestha, Narasimha Annapareddy
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.00030
Pdf URL: https://arxiv.org/pdf/2504.00030
Copy Paste: [[2504.00030]] Token-Driven GammaTune: Adaptive Calibration for Enchanced Speculative Decoding(https://arxiv.org/abs/2504.00030)
Keywords: language model, llm
Abstract: Speculative decoding accelerates large language model (LLM) inference by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, selecting an optimal speculation length is critical for maximizing speedup while minimizing wasted computation. We introduce \textit{GammaTune} and \textit{GammaTune+}, training-free adaptive algorithms that dynamically adjust speculation length based on token acceptance rates using a heuristic-based switching mechanism. Evaluated on SpecBench across multiple tasks and model pairs, our method outperforms other heuristic-based approaches and fixed-length speculative decoding, achieving an average speedup of 15\% ($\pm$5\%) with \textit{GammaTune} and 16\% ($\pm$3\%) with \textit{GammaTune+}, while reducing performance variance. This makes \textit{GammaTune} a robust and efficient solution for real-world deployment.
摘要：通过使用较小的草稿模型提出令牌，投机解码会加速大型语言模型（LLM）推断，然后通过较大的目标模型对其进行验证。但是，选择最佳推测长度对于最大程度地提高加速度至关重要，同时最大程度地减少浪费计算至关重要。我们介绍\ textit {gammatune}和\ textit {gammatune+}，无训练的自适应算法，这些算法使用基于启发式的切换机制，根据令牌接受率动态调整基于令牌接受率的推测长度。我们在多个任务和模型对上进行评估，我们的方法优于其他基于启发式的方法和固定长度的投机解码，并以\ textit {gammatune}的平均加速为15 \％（$ \ pm $ 5 \％），并与\ textiN cranivie cranivie cranivie craniges \ textIn+textune cranigiender cormand+textune+textune+textune+textune+textune。这使得\ textit {gammatune}成为现实部署的强大而有效的解决方案。

Title: Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge

Authors: Agam Shah, Liqin Ye, Sebastian Jaskowski, Wei Xu, Sudheer Chava
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00042
Pdf URL: https://arxiv.org/pdf/2504.00042
Copy Paste: [[2504.00042]] Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge(https://arxiv.org/abs/2504.00042)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are frequently utilized as sources of knowledge for question-answering. While it is known that LLMs may lack access to real-time data or newer data produced after the model's cutoff date, it is less clear how their knowledge spans across historical information. In this study, we assess the breadth of LLMs' knowledge using financial data of U.S. publicly traded companies by evaluating more than 197k questions and comparing model responses to factual data. We further explore the impact of company characteristics, such as size, retail investment, institutional attention, and readability of financial filings, on the accuracy of knowledge represented in LLMs. Our results reveal that LLMs are less informed about past financial performance, but they display a stronger awareness of larger companies and more recent information. Interestingly, at the same time, our analysis also reveals that LLMs are more likely to hallucinate for larger companies, especially for data from more recent years. We will make the code, prompts, and model outputs public upon the publication of the work.
摘要：大型语言模型（LLM）经常被用作提问的知识来源。虽然众所周知，LLM可能无法访问模型截止日期后生成的实时数据或新的数据，但他们的知识如何跨越历史信息尚不清楚。在这项研究中，我们使用美国公开交易公司的财务数据来评估LLMS知识的广度，通过评估超过197k的问题并比较模型对事实数据的响应。我们进一步探讨了公司特征的影响，例如规模，零售投资，机构关注和金融申请的可读性，对LLMS代表的知识的准确性。我们的结果表明，LLM较少了解过去的财务绩效，但它们表现出对大型公司和最新信息的更强意识。有趣的是，与此同时，我们的分析还表明，LLM更有可能为大型公司幻觉，尤其是近年来数据。我们将在发布作品后将代码，提示和模型输出公开。

Title: CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

Authors: Jixuan Leng, Chengsong Huang, Langlin Huang, Bill Yuchen Lin, William W. Cohen, Haohan Wang, Jiaxin Huang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.00043
Pdf URL: https://arxiv.org/pdf/2504.00043
Copy Paste: [[2504.00043]] CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation(https://arxiv.org/abs/2504.00043)
Keywords: language model, llm
Abstract: Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly either assess text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in multiple formats (text and image) and offers different evaluation strategies ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.
摘要：大型语言模型（LLM）和大型视觉模型（LVLM）的现有推理评估框架主要评估基于文本的推理或视觉语言理解能力，并且文本约束和视觉约束之间的动态相互作用有限。为了解决这一限制，我们介绍了填字游戏，这是一种基准测试，旨在通过填字游戏拼写媒介评估LLM和LVLM的推理能力，这需要多模式依从性，以从基于文本的线索以及来自Visual Grid Grid结构的基于文本的线索以及从基于文本的线索和相互构造的约束来评估。填字游戏利用可控的拼图生成框架，该框架以多种格式（文本和图像）产生难题，并提供不同的评估策略，从直接拼图求解到交互式模式。我们对20多种模型的广泛评估表明，推理LLM通过有效利用交叉字母的约束而实质上优于非争议模型。我们进一步证明了LVLM在任务中挣扎，显示了它们的拼图性能与放映精度之间的密切相关性。我们的发现提供了有关当前LLM和LVLM的推理能力局限性的见解，并提供了一种有效的方法来创建多模式约束任务以进行将来的评估。

Title: Multi-Stakeholder Disaster Insights from Social Media Using Large Language Models

Authors: Loris Belcastro, Cristian Cosentino, Fabrizio Marozzo, Merve Gündüz-Cüre, Şule Öztürk-Birim
Subjects: cs.CL, cs.AI, cs.ET, cs.SI
Abstract URL: https://arxiv.org/abs/2504.00046
Pdf URL: https://arxiv.org/pdf/2504.00046
Copy Paste: [[2504.00046]] Multi-Stakeholder Disaster Insights from Social Media Using Large Language Models(https://arxiv.org/abs/2504.00046)
Keywords: language model, gpt, llm, prompt, chat
Abstract: In recent years, social media has emerged as a primary channel for users to promptly share feedback and issues during disasters and emergencies, playing a key role in crisis management. While significant progress has been made in collecting and analyzing social media content, there remains a pressing need to enhance the automation, aggregation, and customization of this data to deliver actionable insights tailored to diverse stakeholders, including the press, police, EMS, and firefighters. This effort is essential for improving the coordination of activities such as relief efforts, resource distribution, and media communication. This paper presents a methodology that leverages the capabilities of LLMs to enhance disaster response and management. Our approach combines classification techniques with generative AI to bridge the gap between raw user feedback and stakeholder-specific reports. Social media posts shared during catastrophic events are analyzed with a focus on user-reported issues, service interruptions, and encountered challenges. We employ full-spectrum LLMs, using analytical models like BERT for precise, multi-dimensional classification of content type, sentiment, emotion, geolocation, and topic. Generative models such as ChatGPT are then used to produce human-readable, informative reports tailored to distinct audiences, synthesizing insights derived from detailed classifications. We compare standard approaches, which analyze posts directly using prompts in ChatGPT, to our advanced method, which incorporates multi-dimensional classification, sub-event selection, and tailored report generation. Our methodology demonstrates superior performance in both quantitative metrics, such as text coherence scores and latent representations, and qualitative assessments by automated tools and field experts, delivering precise insights for diverse disaster response stakeholders.
摘要：近年来，社交媒体已成为用户迅速在灾难和紧急情况下分享反馈和问题的主要渠道，在危机管理中发挥了关键作用。尽管在收集和分析社交媒体内容方面取得了重大进展，但仍需迫切需要增强该数据的自动化，聚合和定制，以提供针对各种利益相关者（包括新闻界，警察，EMS，EMS和消防员）量身定制的可行见解。这项工作对于改善诸如救济工作，资源分配和媒体传播等活动的协调至关重要。本文介绍了一种利用LLMS增强灾难响应和管理能力的方法。我们的方法将分类技术与生成AI结合在一起，以弥合原始用户反馈和特定于利益相关者的报告之间的差距。分析了在灾难性事件期间共享的社交媒体帖子，重点是用户报告的问题，服务中断并遇到挑战。我们采用全光谱LLM，使用BERT（例如BERT）的分析模型，以精确的内容类型，情感，情感，地理位置和主题的多维分类。然后使用诸如Chatgpt之类的生成模型来生成针对不同受众的人类可读性，内容丰富的报道，并综合了从详细的分类中得出的见解。我们将标准方法比较了我们的高级方法，该方法将使用Chatgpt中的提示直接分析帖子进行分析，该方法结合了多维分类，次活动选择和量身定制的报告生成。我们的方法证明了在定量指标中的卓越性能，例如文本相干分数和潜在表示，以及自动化工具和现场专家的定性评估，为各种灾难响应利益相关者提供了精确的见解。

Title: Distill-C: Enhanced NL2SQL via Distilled Customization with LLMs

Authors: Cong Duy Vu Hoang, Gioacchino Tangari, Clemence Lanfranchi, Dalu Guo, Paul Cayet, Steve Siu, Don Dharmasiri, Yuan-Fang Li, Long Duong, Damien Hilloulin, Rhicheek Patra, Sungpack Hong, Hassan Chafi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00048
Pdf URL: https://arxiv.org/pdf/2504.00048
Copy Paste: [[2504.00048]] Distill-C: Enhanced NL2SQL via Distilled Customization with LLMs(https://arxiv.org/abs/2504.00048)
Keywords: language model, llm
Abstract: The growing adoption of large language models (LLMs) in business applications has amplified interest in Natural Language to SQL (NL2SQL) solutions, in which there is competing demand for high performance and efficiency. Domain- and customer-specific requirements further complicate the problem. To address this conundrum, we introduce Distill-C, a distilled customization framework tailored for NL2SQL tasks. Distill-C utilizes large teacher LLMs to produce high-quality synthetic data through a robust and scalable pipeline. Finetuning smaller and open-source LLMs on this synthesized data enables them to rival or outperform teacher models an order of magnitude larger. Evaluated on multiple challenging benchmarks, Distill-C achieves an average improvement of 36% in execution accuracy compared to the base models from three distinct LLM families. Additionally, on three internal customer benchmarks, Distill-C demonstrates a 22.6% performance improvement over the base models. Our results demonstrate that Distill-C is an effective, high-performing and generalizable approach for deploying lightweight yet powerful NL2SQL models, delivering exceptional accuracies while maintaining low computational cost.
摘要：在业务应用中，大型语言模型（LLM）的采用越来越多，将自然语言的兴趣扩大到SQL（NL2SQL）解决方案，在这种解决方案中，对高性能和效率的需求竞争。领域和特定客户的要求进一步使问题复杂化。为了解决这个难题，我们介绍了Distill-C，这是一个针对NL2SQL任务量身定制的蒸馏定制框架。 Distill-C利用大型教师LLM通过可靠且可扩展的管道来生产高质量的合成数据。在此综合数据上对较小和开源的LLM进行了较小和开源的LLM，这使他们能够竞争或超越教师建模的数量级。与来自三个不同LLM家族的基本模型相比，Distill-C在多个具有挑战性的基准测试中进行了评估，执行精度的平均提高36％。此外，在三个内部客户基准下，Distill-C比基本模型的性能提高了22.6％。我们的结果表明，Distill-C是一种有效，高性能且可推广的方法，用于部署轻巧但功能强大的NL2SQL模型，在保持低计算成本的同时提供出色的精确度。

Title: JudgeLRM: Large Reasoning Models as a Judge

Authors: Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00050
Pdf URL: https://arxiv.org/pdf/2504.00050
Copy Paste: [[2504.00050]] JudgeLRM: Large Reasoning Models as a Judge(https://arxiv.org/abs/2504.00050)
Keywords: language model, gpt, llm
Abstract: The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.
摘要：大型语言模型（LLMS）作为评估者的兴起提供了人类注释的可扩展替代方案，但对于法官来说，现有的监督微调（SFT）通常在需要复杂推理的领域中落下。在这项工作中，我们调查了LLM法官是否真的从增强的推理能力中受益。通过对评估任务跨评估任务的推理要求的详细分析，我们揭示了SFT性能增长与推理要求样本的比例之间的负相关性 - 在这种情况下突出了SFT的局限性。为了解决这个问题，我们介绍了一个由判断力的LLM家族贾格尔姆（Judgelrm），该家族以判断力学习（RL）的判断力学习（RL），并以法官为驱动的奖励。 JudgelRM模型始终胜过SFT调整和最先进的推理模型。值得注意的是，JudgelRM-3B超过GPT-4，而JudgelRM-7B在F1得分中的表现优于DeepSeek-R1，尤其是需要深层推理的法官任务。

Title: Integrating Large Language Models with Human Expertise for Disease Detection in Electronic Health Records

Authors: Jie Pan, Seungwon Lee, Cheligeer Cheligeer, Elliot A. Martin, Kiarash Riazi, Hude Quan, Na Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00053
Pdf URL: https://arxiv.org/pdf/2504.00053
Copy Paste: [[2504.00053]] Integrating Large Language Models with Human Expertise for Disease Detection in Electronic Health Records(https://arxiv.org/abs/2504.00053)
Keywords: language model, llm, prompt
Abstract: Objective: Electronic health records (EHR) are widely available to complement administrative data-based disease surveillance and healthcare performance evaluation. Defining conditions from EHR is labour-intensive and requires extensive manual labelling of disease outcomes. This study developed an efficient strategy based on advanced large language models to identify multiple conditions from EHR clinical notes. Methods: We linked a cardiac registry cohort in 2015 with an EHR system in Alberta, Canada. We developed a pipeline that leveraged a generative large language model (LLM) to analyze, understand, and interpret EHR notes by prompts based on specific diagnosis, treatment management, and clinical guidelines. The pipeline was applied to detect acute myocardial infarction (AMI), diabetes, and hypertension. The performance was compared against clinician-validated diagnoses as the reference standard and widely adopted International Classification of Diseases (ICD) codes-based methods. Results: The study cohort accounted for 3,088 patients and 551,095 clinical notes. The prevalence was 55.4%, 27.7%, 65.9% and for AMI, diabetes, and hypertension, respectively. The performance of the LLM-based pipeline for detecting conditions varied: AMI had 88% sensitivity, 63% specificity, and 77% positive predictive value (PPV); diabetes had 91% sensitivity, 86% specificity, and 71% PPV; and hypertension had 94% sensitivity, 32% specificity, and 72% PPV. Compared with ICD codes, the LLM-based method demonstrated improved sensitivity and negative predictive value across all conditions. The monthly percentage trends from the detected cases by LLM and reference standard showed consistent patterns.
摘要：目的：电子健康记录（EHR）广泛可用于补充基于数据的疾病监视和医疗保健绩效评估。从EHR定义条件是劳动密集型的，需要大量的疾病预后手动标记。这项研究基于先进的大语言模型制定了有效的策略，以从EHR临床注释中识别多种条件。方法：我们在2015年将心脏注册人群与加拿大艾伯塔省的EHR系统联系起来。我们开发了一条管道，该管道利用生成的大语言模型（LLM）来分析，理解和解释EHR注释，该提示根据特定的诊断，治疗管理和临床指南。该管道用于检测急性心肌梗塞（AMI），糖尿病和高血压。将表现与临床医生验证的诊断作为参考标准进行了比较，并广泛采用了基于疾病的国际分类（ICD）代码。结果：该研究队列占3,088例患者和551,095例临床笔记。患病率分别为55.4％，27.7％，65.9％和AMI，糖尿病和高血压。基于LLM的管道检测条件的性能各不相同：AMI具有88％的灵敏度，63％的特异性和77％的阳性预测值（PPV）；糖尿病具有91％的敏感性，86％的特异性和71％的PPV；高血压具有94％的敏感性，32％的特异性和72％的PPV。与ICD代码相比，基于LLM的方法在所有条件下都表现出提高的灵敏度和负预测值。 LLM和参考标准从检测到的情况下的每月百分比趋势显示出一致的模式。

Title: Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology

Authors: Dou Liu, Ying Long, Sophia Zuoqiu, Tian Tang, Rong Yin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00061
Pdf URL: https://arxiv.org/pdf/2504.00061
Copy Paste: [[2504.00061]] Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology(https://arxiv.org/abs/2504.00061)
Keywords: language model, gpt, llm, chat
Abstract: Effective physician-patient communications in pre-diagnostic environments, and most specifically in complex and sensitive medical areas such as infertility, are critical but consume a lot of time and, therefore, cause clinic workflows to become inefficient. Recent advancements in Large Language Models (LLMs) offer a potential solution for automating conversational medical history-taking and improving diagnostic accuracy. This study evaluates the feasibility and performance of LLMs in those tasks for infertility cases. An AI-driven conversational system was developed to simulate physician-patient interactions with ChatGPT-4o and ChatGPT-4o-mini. A total of 70 real-world infertility cases were processed, generating 420 diagnostic histories. Model performance was assessed using F1 score, Differential Diagnosis (DDs) Accuracy, and Accuracy of Infertility Type Judgment (ITJ). ChatGPT-4o-mini outperformed ChatGPT-4o in information extraction accuracy (F1 score: 0.9258 vs. 0.9029, p = 0.045, d = 0.244) and demonstrated higher completeness in medical history-taking (97.58% vs. 77.11%), suggesting that ChatGPT-4o-mini is more effective in extracting detailed patient information, which is critical for improving diagnostic accuracy. In contrast, ChatGPT-4o performed slightly better in differential diagnosis accuracy (2.0524 vs. 2.0048, p > 0.05). ITJ accuracy was higher in ChatGPT-4o-mini (0.6476 vs. 0.5905) but with lower consistency (Cronbach's $\alpha$ = 0.562), suggesting variability in classification reliability. Both models demonstrated strong feasibility in automating infertility history-taking, with ChatGPT-4o-mini excelling in completeness and extraction accuracy. In future studies, expert validation for accuracy and dependability in a clinical setting, AI model fine-tuning, and larger datasets with a mix of cases of infertility have to be prioritized.
摘要：在诊断前环境中，最特别是在复杂而敏感的医疗领域（例如不育）中，有效的医师与患者通信至关重要，但会消耗大量时间，因此会导致诊所工作流效率低下。大型语言模型（LLMS）的最新进展为自动化对话病史的自动化和提高诊断准确性提供了潜在的解决方案。这项研究评估了这些任务中LLM的可行性和性能。开发了一个AI驱动的对话系统，以模拟与Chatgpt-4O和Chatgpt-4O-Mini的医师相互作用。总共处理了70个现实世界中的不孕病例，产生了420个诊断历史。使用F1评分，鉴别诊断（DDS）精度以及不育类型判断（ITJ）的准确性评估模型性能。 ChatGPT-4o-mini outperformed ChatGPT-4o in information extraction accuracy (F1 score: 0.9258 vs. 0.9029, p = 0.045, d = 0.244) and demonstrated higher completeness in medical history-taking (97.58% vs. 77.11%), suggesting that ChatGPT-4o-mini is more effective in extracting detailed patient information, which is critical for improving diagnostic 准确性。相比之下，ChatGpt-4O在鉴别诊断准确性方面的表现稍好（2.0524 vs. 2.0048，p> 0.05）。 CHATGPT-4O-MINI（0.6476 vs. 0.5905）的ITJ准确性较高，但一致性较低（Cronbach的$ \ alpha $ = 0.562），表明分类可靠性的可变性。这两种模型都表现出强大的可行性，可以使不育历史记录记录自动化，而Chatgpt-4o-Mini在完整性和提取准确性方面表现出色。在未来的研究中，必须优先考虑在临床环境，AI模型微调和较大数据集中的准确性和可靠性的专家验证。

Title: Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B

Authors: Aleksandra Bakalova, Yana Veitsman, Xinting Huang, Michael Hahn
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.00132
Pdf URL: https://arxiv.org/pdf/2504.00132
Copy Paste: [[2504.00132]] Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B(https://arxiv.org/abs/2504.00132)
Keywords: language model, llm, prompt
Abstract: In-Context Learning (ICL) is an intriguing ability of large language models (LLMs). Despite a substantial amount of work on its behavioral aspects and how it emerges in miniature setups, it remains unclear which mechanism assembles task information from the individual examples in a fewshot prompt. We use causal interventions to identify information flow in Gemma-2 2B for five naturalistic ICL tasks. We find that the model infers task information using a two-step strategy we call contextualize-then-aggregate: In the lower layers, the model builds up representations of individual fewshot examples, which are contextualized by preceding examples through connections between fewshot input and output tokens across the sequence. In the higher layers, these representations are aggregated to identify the task and prepare prediction of the next output. The importance of the contextualization step differs between tasks, and it may become more important in the presence of ambiguous examples. Overall, by providing rigorous causal analysis, our results shed light on the mechanisms through which ICL happens in language models.
摘要：内在学习（ICL）是大语言模型（LLMS）的有趣能力。尽管在其行为方面及其在微型设置中的出现方面进行了大量工作，但仍不清楚哪些机制在几个示例中汇总了哪些机制从单个示例中汇编的任务信息。我们使用因果干预措施来确定五个自然主义ICL任务中Gemma-2 2b中的信息流。我们发现，模型使用两步策略来渗透任务信息，我们称之为上下文化，然后构成聚集：在下层中，该模型构建了单个几乎没有图像示例的表示形式，这些示例是通过序列中的几个shot Input和输出令牌之间的连接来进行上下文示例的上下文。在较高的层中，这些表示形式汇总以确定任务并准备下一个输出的预测。上下文化步骤的重要性在任务之间有所不同，并且在存在模棱两可的例子的情况下可能变得越来越重要。总体而言，通过提供严格的因果分析，我们的结果阐明了语言模型中ICL发生的机制。

Title: Does "Reasoning" with Large Language Models Improve Recognizing, Generating, and Reframing Unhelpful Thoughts?

Authors: Yilin Qi, Dong Won Lee, Cynthia Breazeal, Hae Won Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00163
Pdf URL: https://arxiv.org/pdf/2504.00163
Copy Paste: [[2504.00163]] Does "Reasoning" with Large Language Models Improve Recognizing, Generating, and Reframing Unhelpful Thoughts?(https://arxiv.org/abs/2504.00163)
Keywords: language model, gpt, llm
Abstract: Cognitive Reframing, a core element of Cognitive Behavioral Therapy (CBT), helps individuals reinterpret negative experiences by finding positive meaning. Recent advances in Large Language Models (LLMs) have demonstrated improved performance through reasoning-based strategies. This inspires a promising direction of leveraging the reasoning capabilities of LLMs to improve CBT and mental reframing by simulating the process of critical thinking, potentially enabling more effective recognition, generation, and reframing of cognitive distortions. In this work, we investigate the role of various reasoning methods, including pre-trained reasoning LLMs and augmented reasoning strategies such as CoT and self-consistency in enhancing LLMs' ability to perform cognitive reframing tasks. We find that augmented reasoning methods, even when applied to "outdated" LLMs like GPT-3.5, consistently outperform state-of-the-art pretrained reasoning models on recognizing, generating and reframing unhelpful thoughts.
摘要：认知重塑是认知行为疗法（CBT）的核心元素，可以通过寻找积极的含义来重新解释负面经历。大型语言模型（LLM）的最新进展已通过基于推理的策略提高了绩效。这激发了利用LLM的推理能力来改善CBT和精神重新构架的有希望的方向，从而模拟批判性思维的过程，有可能使人们更有效地识别，产生和重塑认知扭曲。在这项工作中，我们调查了各种推理方法的作用，包括预先训练的推理LLM和增强推理策略，例如COT和自洽性，在增强LLMS执行认知重塑任务的能力方面。我们发现，即使应用于GPT-3.5（例如GPT-3.5）的“过时” LLM时，增强的推理方法也始终超过了最先进的预算推理模型，以识别，产生和重塑无益的想法。

Title: Contradiction Detection in RAG Systems: Evaluating LLMs as Context Validators for Improved Information Consistency

Authors: Vignesh Gokul, Srikanth Tenneti, Alwarappan Nakkiran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00180
Pdf URL: https://arxiv.org/pdf/2504.00180
Copy Paste: [[2504.00180]] Contradiction Detection in RAG Systems: Evaluating LLMs as Context Validators for Improved Information Consistency(https://arxiv.org/abs/2504.00180)
Keywords: language model, llm, prompt, retrieval augmented generation, chain-of-thought
Abstract: Retrieval Augmented Generation (RAG) systems have emerged as a powerful method for enhancing large language models (LLMs) with up-to-date information. However, the retrieval step in RAG can sometimes surface documents containing contradictory information, particularly in rapidly evolving domains such as news. These contradictions can significantly impact the performance of LLMs, leading to inconsistent or erroneous outputs. This study addresses this critical challenge in two ways. First, we present a novel data generation framework to simulate different types of contradictions that may occur in the retrieval stage of a RAG system. Second, we evaluate the robustness of different LLMs in performing as context validators, assessing their ability to detect contradictory information within retrieved document sets. Our experimental results reveal that context validation remains a challenging task even for state-of-the-art LLMs, with performance varying significantly across different types of contradictions. While larger models generally perform better at contradiction detection, the effectiveness of different prompting strategies varies across tasks and model architectures. We find that chain-of-thought prompting shows notable improvements for some models but may hinder performance in others, highlighting the complexity of the task and the need for more robust approaches to context validation in RAG systems.
摘要：检索增强生成（RAG）系统已成为一种强大的方法，用于增强使用最新信息的大型语言模型（LLM）。但是，抹布中的检索步骤有时会表面文档包含矛盾信息，尤其是在新闻等快速发展的领域中。这些矛盾会严重影响LLM的性能，从而导致不一致或错误的输出。这项研究通过两种方式解决了这一关键挑战。首先，我们提出一个新颖的数据生成框架，以模拟在抹布系统的检索阶段可能发生的不同类型的矛盾。其次，我们评估了不同LLM作为上下文验证器的鲁棒性，评估了它们在检索到的文档集中检测矛盾信息的能力。我们的实验结果表明，即使对于最先进的LLM，上下文验证仍然是一项具有挑战性的任务，在不同类型的矛盾之间的性能差异很大。尽管较大的模型通常在矛盾检测中表现更好，但不同提示策略的有效性在任务和模型体系结构之间各不相同。我们发现，经过思考链的提示显示了某些模型的显着改进，但可能会阻碍其他模型的性能，从而突出了任务的复杂性以及对抹布系统中上下文验证的更强大方法的需求。

Title: Insight-RAG: Enhancing LLMs with Insight-Driven Augmentation

Authors: Pouya Pezeshkpour, Estevam Hruschka
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.00187
Pdf URL: https://arxiv.org/pdf/2504.00187
Copy Paste: [[2504.00187]] Insight-RAG: Enhancing LLMs with Insight-Driven Augmentation(https://arxiv.org/abs/2504.00187)
Keywords: language model, llm, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) frameworks have shown significant promise in leveraging external knowledge to enhance the performance of large language models (LLMs). However, conventional RAG methods often retrieve documents based solely on surface-level relevance, leading to many issues: they may overlook deeply buried information within individual documents, miss relevant insights spanning multiple sources, and are not well-suited for tasks beyond traditional question answering. In this paper, we propose Insight-RAG, a novel framework designed to address these issues. In the initial stage of Insight-RAG, instead of using traditional retrieval methods, we employ an LLM to analyze the input query and task, extracting the underlying informational requirements. In the subsequent stage, a specialized LLM -- trained on the document database -- is queried to mine content that directly addresses these identified insights. Finally, by integrating the original query with the retrieved insights, similar to conventional RAG approaches, we employ a final LLM to generate a contextually enriched and accurate response. Using two scientific paper datasets, we created evaluation benchmarks targeting each of the mentioned issues and assessed Insight-RAG against traditional RAG pipeline. Our results demonstrate that the Insight-RAG pipeline successfully addresses these challenges, outperforming existing methods by a significant margin in most cases. These findings suggest that integrating insight-driven retrieval within the RAG framework not only enhances performance but also broadens the applicability of RAG to tasks beyond conventional question answering.
摘要：检索增强生成（RAG）框架在利用外部知识以增强大语模型（LLMS）的性能方面表现出了巨大的希望。但是，传统的抹布方法通常仅基于表面层面的相关性检索文档，这导致了许多问题：它们可能会忽略各个文档中的深层掩埋信息，错过了跨越多个来源的相关见解，并且不适合除传统问题回答以外的任务。在本文中，我们提出了Insight-Rag，这是一个旨在解决这些问题的新型框架。在Insight-rag的初始阶段，我们不使用传统的检索方法，而是使用LLM来分析输入查询和任务，从而提取基本的信息要求。在随后的阶段，在文档数据库中培训的专业LLM被查询到直接解决这些已确定见解的内容的地雷内容。最后，通过将原始查询与检索到的见解集成在一起，类似于常规的抹布方法，我们采用了最终的LLM来生成上下文丰富且准确的响应。使用两个科学纸数据集，我们创建了针对每个问题的评估基准，并评估了与传统抹布管道的Insight-rag。我们的结果表明，Insight-rag管道成功地解决了这些挑战，在大多数情况下，大多数情况都超过了现有方法。这些发现表明，将洞察力驱动的检索集成在抹布框架内不仅可以增强性能，而且还扩大了抹布在传统问题回答之外的任务中的适用性。

Title: Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy

Authors: Rabimba Karanjai, Boris Shor, Amanda Austin, Ryan Kennedy, Yang Lu, Lei Xu, Weidong Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00241
Pdf URL: https://arxiv.org/pdf/2504.00241
Copy Paste: [[2504.00241]] Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy(https://arxiv.org/abs/2504.00241)
Keywords: language model, llm, prompt
Abstract: This paper investigates the use of Large Language Models (LLMs) to synthesize public opinion data, addressing challenges in traditional survey methods like declining response rates and non-response bias. We introduce a novel technique: role creation based on knowledge injection, a form of in-context learning that leverages RAG and specified personality profiles from the HEXACO model and demographic information, and uses that for dynamically generated prompts. This method allows LLMs to simulate diverse opinions more accurately than existing prompt engineering approaches. We compare our results with pre-trained models with standard few-shot prompts. Experiments using questions from the Cooperative Election Study (CES) demonstrate that our role-creation approach significantly improves the alignment of LLM-generated opinions with real-world human survey responses, increasing answer adherence. In addition, we discuss challenges, limitations and future research directions.
摘要：本文研究了使用大语模型（LLM）来综合公众舆论数据，从而解决了传统调查方法中的挑战，例如响应率下降和无响应偏见。我们介绍了一种新颖的技术：基于知识注入的角色创建，这是一种在己科模型和人口统计信息中利用抹布和指定的个性概况的文化学习形式，并将其用于动态生成的提示。与现有的及时工程方法相比，该方法允许LLM更准确地模拟不同的意见。我们将结果与预先训练的模型和标准少射击提示进行了比较。使用合作选举研究（CES）中的问题的实验表明，我们的角色创造方法显着改善了LLM生成的观点与现实世界人类调查的反应的一致性，从而提高了答案的依从性。此外，我们讨论了挑战，局限性和未来的研究方向。

Title: SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

Authors: Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, Yulan He
Subjects: cs.CL, cs.AI, cs.MA, cs.SE
Abstract URL: https://arxiv.org/abs/2504.00255
Pdf URL: https://arxiv.org/pdf/2504.00255
Copy Paste: [[2504.00255]] SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers(https://arxiv.org/abs/2504.00255)
Keywords: language model, llm, agent
Abstract: This study evaluates large language models (LLMs) in generating code from algorithm descriptions from recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic literature to understand implementation logic, and (2) coding expertise: identifying dependencies and correctly implementing necessary APIs. To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024, featuring detailed annotations and comprehensive test cases. Building on SciReplicate-Bench, we propose Sci-Reproducer, a multi-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implement solutions. To assess algorithm understanding, we introduce reasoning graph accuracy, which quantifies similarity between generated and reference reasoning graphs derived from code comments and structure. For evaluating implementation quality, we employ execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In our experiments, we evaluate various powerful Non-Reasoning LLMs and Reasoning LLMs as foundational models. The best-performing LLM using Sci-Reproducer achieves only 39% execution accuracy, highlighting the benchmark's this http URL analysis identifies missing or inconsistent algorithm descriptions as key barriers to successful reproduction. We will open-source our benchmark, and code at this https URL.
摘要：这项研究评估了大型语言模型（LLM）在从最近的NLP论文中生成算法描述中的代码。该任务需要两个关键能力：（1）算法理解：从论文和学术文献中综合信息以了解实施逻辑，以及（2）编码专业知识：识别依赖性并正确实施必要的API。为了促进严格的评估，我们介绍了Scireplicate Bench，这是2024年发表的36篇NLP论文的100个任务的基准，其中包含详细的注释和全面的测试用例。在基于Scireplicate Bench的基础上，我们提出了Sci Reproducer，这是一个由纸质代理组成的多代理框架，该纸质代理从文献中解释算法概念和从存储库中检索依赖性和实施解决方案的代码代理。为了评估算法理解，我们介绍了推理图的精度，该准确性量化了从代码注释和结构得出的生成和参考推理图之间的相似性。为了评估实施质量，我们采用执行精度，CodeBleu和存储库依赖性/API召回指标。在我们的实验中，我们将各种功能强大的非调理LLM和推理LLMs评估为基础模型。使用SCI-Roproducer的表现最佳的LLM仅达到39％的执行精度，突出显示基准的HTTP URL分析确定缺失或不一致的算法描述是成功复制的关键障碍。我们将在此HTTPS URL上开放基准和代码。

Title: Text Chunking for Document Classification for Urban System Management using Large Language Models

Authors: Joshua Rodriguez (1), Om Sanan (2), Guillermo Vizarreta-Luna (1), Steven A. Conrad (1) ((1) Department of Systems Engineering, Colorado State University, Fort Collins, CO, USA, (2) Scarsdale High School, Scardsale, NY, USA)
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2504.00274
Pdf URL: https://arxiv.org/pdf/2504.00274
Copy Paste: [[2504.00274]] Text Chunking for Document Classification for Urban System Management using Large Language Models(https://arxiv.org/abs/2504.00274)
Keywords: language model, gpt, llm, prompt
Abstract: Urban systems are managed using complex textual documentation that need coding and analysis to set requirements and evaluate built environment performance. This paper contributes to the study of applying large-language models (LLM) to qualitative coding activities to reduce resource requirements while maintaining comparable reliability to humans. Qualitative coding and assessment face challenges like resource limitations and bias, accuracy, and consistency between human evaluators. Here we report the application of LLMs to deductively code 10 case documents on the presence of 17 digital twin characteristics for the management of urban systems. We utilize two prompting methods to compare the semantic processing of LLMs with human coding efforts: whole text analysis and text chunk analysis using OpenAI's GPT-4o, GPT-4o-mini, and o1-mini models. We found similar trends of internal variability between methods and results indicate that LLMs may perform on par with human coders when initialized with specific deductive coding contexts. GPT-4o, o1-mini and GPT-4o-mini showed significant agreement with human raters when employed using a chunking method. The application of both GPT-4o and GPT-4o-mini as an additional rater with three manual raters showed statistically significant agreement across all raters, indicating that the analysis of textual documents is benefited by LLMs. Our findings reveal nuanced sub-themes of LLM application suggesting LLMs follow human memory coding processes where whole-text analysis may introduce multiple meanings. The novel contributions of this paper lie in assessing the performance of OpenAI GPT models and introduces the chunk-based prompting approach, which addresses context aggregation biases by preserving localized context.
摘要：使用复杂的文本文档对城市系统进行管理，这些文本文档需要编码和分析以设定需求并评估建筑环境性能。本文有助于研究将大语言模型（LLM）应用于定性编码活动，以减少资源需求，同时保持与人类的可比性可比性。定性编码和评估面临诸如人类评估者之间的资源限制和偏见，准确性和一致性之类的挑战。在这里，我们报告了LLM在演绎10个案例文档中的应用，该案例文档有17个用于城市系统管理的数字双胞胎特征。我们利用两种提示方法将LLMS的语义处理与人体编码工作进行比较：使用OpenAI的GPT-4O，GPT-4O-MINI和O1-MINI模型，整个文本分析和文本块分析。我们发现方法和结果之间的内部变异性的类似趋势表明，在使用特定的演绎编码环境初始化时，LLM可能会与人类编码者相同。使用分块方法，GPT-4O，O1-MINI和GPT-4O-MINI与人类评估者表现出显着的一致性。 GPT-4O和GPT-4O-MINI的应用与三个手动评估者的额外评估者都显示出所有评估者的统计学意义，这表明对文本文档的分析受LLMS受益。我们的发现揭示了LLM应用的细微依据，表明LLMS遵循人类记忆编码过程，其中全文本分析可能会引入多种含义。本文的新贡献在于评估OpenAI GPT模型的性能，并介绍了基于块的提示方法，该方法通过保留本地化环境来解决上下文聚合的偏见。

Title: Do Large Language Models Exhibit Spontaneous Rational Deception?

Authors: Samuel M. Taylor, Benjamin K. Bergen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00285
Pdf URL: https://arxiv.org/pdf/2504.00285
Copy Paste: [[2504.00285]] Do Large Language Models Exhibit Spontaneous Rational Deception?(https://arxiv.org/abs/2504.00285)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) are effective at deceiving, when prompted to do so. But under what conditions do they deceive spontaneously? Models that demonstrate better performance on reasoning tasks are also better at prompted deception. Do they also increasingly deceive spontaneously in situations where it could be considered rational to do so? This study evaluates spontaneous deception produced by LLMs in a preregistered experimental protocol using tools from signaling theory. A range of proprietary closed-source and open-source LLMs are evaluated using modified 2x2 games (in the style of Prisoner's Dilemma) augmented with a phase in which they can freely communicate to the other agent using unconstrained language. This setup creates an opportunity to deceive, in conditions that vary in how useful deception might be to an agent's rational self-interest. The results indicate that 1) all tested LLMs spontaneously misrepresent their actions in at least some conditions, 2) they are generally more likely to do so in situations in which deception would benefit them, and 3) models exhibiting better reasoning capacity overall tend to deceive at higher rates. Taken together, these results suggest a tradeoff between LLM reasoning capability and honesty. They also provide evidence of reasoning-like behavior in LLMs from a novel experimental configuration. Finally, they reveal certain contextual factors that affect whether LLMs will deceive or not. We discuss consequences for autonomous, human-facing systems driven by LLMs both now and as their reasoning capabilities continue to improve.
摘要：大型语言模型（LLMS）在提示这样做时可以有效地欺骗。但是在什么条件下，它们会自发欺骗？在推理欺骗中表现出更好的推理任务表现的模型也更好。在可以认为这样做的情况下，他们是否也越来越自发地自发地欺骗？这项研究评估了LLM在使用信号理论中的工具的预核实验方案中产生的自发欺骗。使用修改后的2x2游戏（以囚犯的困境的方式）评估了一系列专有的封闭源和开源LLM，并以一个阶段增强了他们，可以使用无约束力的语言自由地与其他代理商进行交流。在欺骗对代理人的理性自身利益的有用程度不同的情况下，这种设置为欺骗而创造了欺骗的机会。结果表明，1）所有经过测试的LLMS自发地歪曲了他们在至少某些条件下的行为，2）在欺骗会使他们受益的情况下，通常更有可能这样做，而3）表现出更好的推理能力的模型总体上倾向于以更高的速度欺骗。综上所述，这些结果表明LLM推理能力和诚实之间的权衡。他们还提供了新型实验配置中LLM中推理样行为的证据。最后，他们揭示了某些影响LLM是否会欺骗的上下文因素。我们讨论了现在和随着其推理能力不断提高的LLM驱动的自主，面向人为系统的后果。

Title: Do Chinese models speak Chinese languages?

Authors: Andrea W Wen-Yi, Unso Eun Seo Jo, David Mimno
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2504.00289
Pdf URL: https://arxiv.org/pdf/2504.00289
Copy Paste: [[2504.00289]] Do Chinese models speak Chinese languages?(https://arxiv.org/abs/2504.00289)
Keywords: llm
Abstract: The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they speak the same languages as Western models? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy. To test whether Chinese LLMs today reflect an agenda about China's languages, we test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with Western models', with the sole exception being better Mandarin. Sometimes, Chinese models cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur, even though they are good at French and German. These results provide a window into current development priorities, suggest options for future development, and indicate guidance for end users.
摘要：发行最佳的开放体重LLM巩固了中国作为AI发展领域的领导力量的作用。这些模型是否支持中国使用的语言？还是他们会说与西方模型相同的语言？比较多语言功能很重要，原因有两个。首先，语言能力为培训数据策划提供了见解，从而提供了资源分配和开发优先级。其次，中国有悠久的明确语言政策历史，在少数族裔语言的包容性和普通话优先政策之间有所不同。为了测试当今中国LLM是否反映了有关中国语言的议程，我们在亚洲地区和中国少数族裔语言上测试了中国和西方开源LLM的表现。我们关于信息奇偶校验和阅读理解的实验表明，中国模型在这些语言上的表现与西方模型密切相关（r = 0.93），唯一的例外是更好的普通话。有时，中国模特无法识别中国少数民族（例如哈萨克和乌耶尔）所说的语言，即使他们擅长法语和德语。这些结果为当前开发优先级提供了一个窗口，为将来开发的选择建议，并为最终用户指示指导。

Title: Detecting and Mitigating Bias in LLMs through Knowledge Graph-Augmented Training

Authors: Rajeev Kumar, Harishankar Kumar, Kumari Shalini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00310
Pdf URL: https://arxiv.org/pdf/2504.00310
Copy Paste: [[2504.00310]] Detecting and Mitigating Bias in LLMs through Knowledge Graph-Augmented Training(https://arxiv.org/abs/2504.00310)
Keywords: language model, llm
Abstract: Large language models have revolutionized natural language processing with their surprising capability to understand and generate human-like text. However, many of these models inherit and further amplify the biases present in their training data, raising ethical and fairness concerns. The detection and mitigation of such biases are vital to ensuring that LLMs act responsibly and equitably across diverse domains. This work investigates Knowledge Graph-Augmented Training (KGAT) as a novel method to mitigate bias in LLM. Using structured domain-specific knowledge from real-world knowledge graphs, we improve the understanding of the model and reduce biased output. Public datasets for bias assessment include Gender Shades, Bias in Bios, and FairFace, while metrics such as demographic parity and equal opportunity facilitate rigorous detection. We also performed targeted mitigation strategies to correct biased associations, leading to a significant drop in biased output and improved bias metrics. Equipped with real-world datasets and knowledge graphs, our framework is both scalable and effective, paving the way toward responsible deployment in sensitive and high-stakes applications.
摘要：大型语言模型已经彻底改变了自然语言处理，其能力令人惊讶地理解和产生类似人类的文本。但是，其中许多模型继承并进一步扩大了培训数据中存在的偏见，从而引发了道德和公平的关注。这种偏见的检测和缓解对于确保LLM在各种领域的负责任和公平行动至关重要。这项工作研究了知识图形培训（KGAT），作为减轻LLM偏见的一种新方法。使用来自现实世界知识图的结构化域特异性知识，我们可以提高对模型的理解并减少偏见的输出。用于偏见评估的公共数据集包括性别阴影，BIOS的偏见和Fairface，而人口统计学和平等机会等指标则有助于严格检测。我们还执行了有针对性的缓解策略以纠正偏见的关联，从而导致偏见的产出和改善的偏见指标大大下降。我们的框架配备了现实世界中的数据集和知识图，既可扩展又有效，为在敏感和高风险应用程序中负责部署铺平了道路。

Title: VNJPTranslate: A comprehensive pipeline for Vietnamese-Japanese translation

Authors: Hoang Hai Phan, Nguyen Duc Minh Vu, Nam Dang Phuong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00339
Pdf URL: https://arxiv.org/pdf/2504.00339
Copy Paste: [[2504.00339]] VNJPTranslate: A comprehensive pipeline for Vietnamese-Japanese translation(https://arxiv.org/abs/2504.00339)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-quality synthetic data generation. We introduce VNJPTranslate, a pipeline designed to systematically address the Vi-Ja translation task. It features a targeted data augmentation strategy using advanced LLMs with Chain-of-Thought prompting for challenging segments identified via corpus analysis. Subsequently, we employ efficient fine-tuning techniques (Unsloth with QLoRA) on a capable, low-parameter autoregressive model (specifically, a fine-tuned version of the 1.8B parameter Sailor model, which is based on the Qwen architecture) to create a practical and high-performing translation system. This integrated approach aims to improve Vi-Ja translation quality significantly over existing baselines.
摘要：由变压器体系结构驱动的神经机器翻译（NMT）已大大提高，但面临低资源语言对等越南 - 日本（VI-JA）的挑战。问题包括稀疏的并行数据和处理语言/文化细微差别。在大型语言模型（LLMS）中的最新进展（通常是通过强化学习（RL）精炼）可以实现高质量的合成数据生成。我们介绍VNJPTranslate，这是一种旨在系统地解决VI-JA翻译任务的管道。它具有针对性的数据增强策略，该策略使用高级LLM和经过思考链的促使通过语料库分析确定的具有挑战性的细分。随后，我们在有效的低参数自回旋模型（特别是基于QWEN体系结构的1.8B参数水手模型的微调版本）上采用有效的微调技术（与Qlora的不舒服），以创建实用性和高完美的转换系统。这种综合方法旨在高于现有基线的VI-JA翻译质量。

Title: Leveraging Large Language Models for Automated Definition Extraction with TaxoMatic A Case Study on Media Bias

Authors: Timo Spinde, Luyang Lin, Smi Hinterreiter, Isao Echizen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00343
Pdf URL: https://arxiv.org/pdf/2504.00343
Copy Paste: [[2504.00343]] Leveraging Large Language Models for Automated Definition Extraction with TaxoMatic A Case Study on Media Bias(https://arxiv.org/abs/2504.00343)
Keywords: language model, llm
Abstract: This paper introduces TaxoMatic, a framework that leverages large language models to automate definition extraction from academic literature. Focusing on the media bias domain, the framework encompasses data collection, LLM-based relevance classification, and extraction of conceptual definitions. Evaluated on a dataset of 2,398 manually rated articles, the study demonstrates the frameworks effectiveness, with Claude-3-sonnet achieving the best results in both relevance classification and definition extraction. Future directions include expanding datasets and applying TaxoMatic to additional domains.
摘要：本文介绍了一种义务，该框架利用大型语言模型自动从学术文献中提取定义。该框架专注于媒体偏见域，包括数据收集，基于LLM的相关性分类以及提取概念定义。该研究在2398个手动评级文章的数据集上进行了评估，证明了框架有效性，Claude-3-Sonnet在相关性分类和定义提取方面都取得了最佳结果。未来的方向包括扩展数据集并将征用量应用于其他域。

Title: When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)

Authors: Mahak Agarwal, Divyam Khanna
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00374
Pdf URL: https://arxiv.org/pdf/2504.00374
Copy Paste: [[2504.00374]] When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)(https://arxiv.org/abs/2504.00374)
Keywords: language model, llm, agent
Abstract: In many real-world scenarios, a single Large Language Model (LLM) may encounter contradictory claims-some accurate, others forcefully incorrect-and must judge which is true. We investigate this risk in a single-turn, multi-agent debate framework: one LLM-based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same LLM architecture serves as judge. We introduce the Confidence-Weighted Persuasion Override Rate (CW-POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice. Our experiments on five open-source LLMs (3B-14B parameters), where we systematically vary agent verbosity (30-300 words), reveal that even smaller models can craft persuasive arguments that override truthful answers-often with high confidence. These findings underscore the importance of robust calibration and adversarial testing to prevent LLMs from confidently endorsing misinformation.
摘要：在许多实际情况下，单个大语言模型（LLM）可能会遇到矛盾的主张，而其他人则有力地不正确，并且必须判断这是正确的。我们在一个单转的多方辩论框架中调查了这种风险：一位位于LLM的代理商提供了真实性的事实答案，另一个代理人有力地捍卫了一个虚假，同一LLM架构也是法官。我们介绍了信任加权的说服效率（CW-POR），该率不仅捕捉了法官被欺骗的频率，而且还捕捉了它对不正确的选择的强烈坚强。我们在五个开源LLM（3B-14B参数）上进行的实验，我们系统地改变了代理的冗长（30-300个单词），表明即使较小的模型也可以制定有说服力的论据，这些论点越来越高，这些论点通常充满信心地覆盖了真实的答案。这些发现强调了强大的校准和对抗测试的重要性，以防止LLM自信地认可错误信息。

Title: VerifiAgent: a Unified Verification Agent in Language Model Reasoning

Authors: Jiuzhou Han, Wray Buntine, Ehsan Shareghi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00406
Pdf URL: https://arxiv.org/pdf/2504.00406
Copy Paste: [[2504.00406]] VerifiAgent: a Unified Verification Agent in Language Model Reasoning(https://arxiv.org/abs/2504.00406)
Keywords: language model, agent
Abstract: Large language models demonstrate remarkable reasoning capabilities but often produce unreliable or incorrect responses. Existing verification methods are typically model-specific or domain-restricted, requiring significant computational resources and lacking scalability across diverse reasoning tasks. To address these limitations, we propose VerifiAgent, a unified verification agent that integrates two levels of verification: meta-verification, which assesses completeness and consistency in model responses, and tool-based adaptive verification, where VerifiAgent autonomously selects appropriate verification tools based on the reasoning type, including mathematical, logical, or commonsense reasoning. This adaptive approach ensures both efficiency and robustness across different verification scenarios. Experimental results show that VerifiAgent outperforms baseline verification methods (e.g., deductive verifier, backward verifier) among all reasoning tasks. Additionally, it can further enhance reasoning accuracy by leveraging feedback from verification results. VerifiAgent can also be effectively applied to inference scaling, achieving better results with fewer generated samples and costs compared to existing process reward models in the mathematical reasoning domain. Code is available at this https URL
摘要：大型语言模型表现出显着的推理能力，但通常会产生不可靠或不正确的响应。现有的验证方法通常是特定于模型的或域限制的，需要大量的计算资源，并且在各种推理任务中缺乏可扩展性。为了解决这些限制，我们提出了验证验证剂，这是一种整合了两个验证级别的统一验证剂：元验证：评估模型响应中的完整性和一致性，以及基于工具的自适应验证，其中验证自主自主选择基于数学，逻辑，逻辑，逻辑，逻辑，或固定的推理的适当验证工具。这种自适应方法可确保在不同验证方案中既有效率又鲁棒性。实验结果表明，在所有推理任务中，Verifiagent优于基线验证方法（例如，演绎验证者，后退验证者）。此外，它可以通过利用验证结果的反馈来进一步提高推理精度。与数学推理域中的现有过程奖励模型相比，验证者也可以有效地应用于推理缩放，以更少的生成样本和成本获得更好的结果。代码可在此HTTPS URL上找到

Title: Semantic Mastery: Enhancing LLMs with Advanced Natural Language Understanding

Authors: Mohanakrishnan Hariharan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00409
Pdf URL: https://arxiv.org/pdf/2504.00409
Copy Paste: [[2504.00409]] Semantic Mastery: Enhancing LLMs with Advanced Natural Language Understanding(https://arxiv.org/abs/2504.00409)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) have greatly improved their capability in performing NLP tasks. However, deeper semantic understanding, contextual coherence, and more subtle reasoning are still difficult to obtain. The paper discusses state-of-the-art methodologies that advance LLMs with more advanced NLU techniques, such as semantic parsing, knowledge integration, and contextual reinforcement learning. We analyze the use of structured knowledge graphs, retrieval-augmented generation (RAG), and fine-tuning strategies that match models with human-level understanding. Furthermore, we address the incorporation of transformer-based architectures, contrastive learning, and hybrid symbolic-neural methods that address problems like hallucinations, ambiguity, and inconsistency in the factual perspectives involved in performing complex NLP tasks, such as question-answering text summarization and dialogue generation. Our findings show the importance of semantic precision for enhancing AI-driven language systems and suggest future research directions to bridge the gap between statistical language models and true natural language understanding.
摘要：大型语言模型（LLM）大大提高了其执行NLP任务的能力。但是，仍然很难获得更深入的语义理解，上下文连贯性和更微妙的推理。本文讨论了最先进的方法论，这些方法可以通过更高级的NLU技术来推进LLM，例如语义解析，知识整合和上下文强化学习。我们分析了结构化知识图，检索增强的生成（RAG）和微调策略的使用，这些策略将模型与人级的理解相匹配。此外，我们解决了基于变压器的架构，对比度学习以及混合符号神经方法的结合，这些方法在执行复杂的NLP任务的态势观点中解决了幻觉，模棱两可和不一致的问题，例如提问的文本摘要和对话。我们的发现表明，语义精确度在增强AI驱动语言系统方面的重要性，并建议未来的研究方向弥合统计语言模型和真正的自然语言理解之间的差距。

Title: Multimodal LLMs for OCR, OCR Post-Correction, and Named Entity Recognition in Historical Documents

Authors: Gavin Greif, Niclas Griesshaber, Robin Greif
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2504.00414
Pdf URL: https://arxiv.org/pdf/2504.00414
Copy Paste: [[2504.00414]] Multimodal LLMs for OCR, OCR Post-Correction, and Named Entity Recognition in Historical Documents(https://arxiv.org/abs/2504.00414)
Keywords: language model, llm
Abstract: We explore how multimodal Large Language Models (mLLMs) can help researchers transcribe historical documents, extract relevant historical information, and construct datasets from historical sources. Specifically, we investigate the capabilities of mLLMs in performing (1) Optical Character Recognition (OCR), (2) OCR Post-Correction, and (3) Named Entity Recognition (NER) tasks on a set of city directories published in German between 1754 and 1870. First, we benchmark the off-the-shelf transcription accuracy of both mLLMs and conventional OCR models. We find that the best-performing mLLM model significantly outperforms conventional state-of-the-art OCR models and other frontier mLLMs. Second, we are the first to introduce multimodal post-correction of OCR output using mLLMs. We find that this novel approach leads to a drastic improvement in transcription accuracy and consistently produces highly accurate transcriptions (<1% CER), without any image pre-processing or model fine-tuning. Third, we demonstrate that mLLMs can efficiently recognize entities in transcriptions of historical documents and parse them into structured dataset formats. Our findings provide early evidence for the long-term potential of mLLMs to introduce a paradigm shift in the approaches to historical data collection and document transcription.
摘要：我们探讨了多模式大语言模型（MLLM）如何帮助研究人员转录历史文档，提取相关的历史信息并从历史来源构建数据集。具体来说，我们研究了MLLM在执行（1）光学特征识别（OCR），（2）OCR后校正的功能，以及（3）在1754年至1870年之间以德语发行的一组城市目录上指定的实体识别（NER）任务。我们发现，表现最佳的MLLM模型显着优于传统的最先进的OCR模型和其他边界MLLM。其次，我们是第一个使用MLLM引入多模式后OCR输出后校正的人。我们发现，这种新颖的方法会导致转录精度的急剧提高，并始终产生高度准确的转录（<1％CER），而没有任何图像预处理或模型进行微调。第三，我们证明MLLM可以有效地识别历史文档转录中的实体，并将其解析为结构化的数据集格式。我们的发现为MLLM的长期潜力引入了历史数据收集和文档转录的方法的长期潜力提供了早期证据。

Title: Memorizing is Not Enough: Deep Knowledge Injection Through Reasoning

Authors: Ruoxi Xu, Yunjie Ji, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Ben He, Yingfei Sun, Xiangang Li, Le Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00472
Pdf URL: https://arxiv.org/pdf/2504.00472
Copy Paste: [[2504.00472]] Memorizing is Not Enough: Deep Knowledge Injection Through Reasoning(https://arxiv.org/abs/2504.00472)
Keywords: language model, llm
Abstract: Although large language models (LLMs) excel in knowledge recall and reasoning, their static nature leads to outdated information as the real world evolves or when adapting to domain-specific knowledge, highlighting the need for effective knowledge injection. However, current research on knowledge injection remains superficial, mainly focusing on knowledge memorization and retrieval. This paper proposes a four-tier knowledge injection framework that systematically defines the levels of knowledge injection: memorization, retrieval, reasoning, and association. Based on this framework, we introduce DeepKnowledge, a synthetic experimental testbed designed for fine-grained evaluation of the depth of knowledge injection across three knowledge types (novel, incremental, and updated). We then explore various knowledge injection scenarios and evaluate the depth of knowledge injection for each scenario on the benchmark. Experimental results reveal key factors to reach each level of knowledge injection for LLMs and establish a mapping between the levels of knowledge injection and the corresponding suitable injection methods, aiming to provide a comprehensive approach for efficient knowledge injection across various levels.
摘要：尽管大型语言模型（LLM）在知识回忆和推理方面表现出色，但随着现实世界的发展或适应特定领域的知识，它们的静态性质导致了过时的信息，突出了有效的知识注入的需求。但是，当前对知识注射的研究仍然是肤浅的，主要集中在知识记忆和检索上。本文提出了一个四层知识注入框架，该框架系统地定义了知识注入的水平：记忆，检索，推理和关联。基于此框架，我们介绍了Deep Kindowledge，这是一种合成的实验测试床，旨在对三种知识类型（新颖，增量和更新）的知识注入深度进行精细评估。然后，我们探索各种知识注入方案，并评估基准测试中每种方案的知识注入深度。实验结果表明，关键因素是达到LLM的每个知识注入水平，并在知识注入水平和相应的合适注入方法之间建立映射，旨在为各种水平的有效知识注入提供全面的方法。

Title: Making Large Language Models Better Reasoners with Orchestrated Streaming Experiences

Authors: Xiangyang Liu, Junliang He, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00473
Pdf URL: https://arxiv.org/pdf/2504.00473
Copy Paste: [[2504.00473]] Making Large Language Models Better Reasoners with Orchestrated Streaming Experiences(https://arxiv.org/abs/2504.00473)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) can perform complex reasoning by generating intermediate thoughts under zero-shot or few-shot settings. However, zero-shot prompting always encounters low performance, and the superior performance of few-shot prompting hinges on the manual-crafted demonstrations. In this paper, we present RoSE (Reasoning with Orchestrated Streaming Experiences), a general framework for solving reasoning tasks that can self-improve without complex external efforts. To enable RoSE, we describe an architecture that extends an LLM to store all answered questions and their thoughts in a streaming experience pool then orchestrates helpful questions from the pool to assist in answering new questions. To set up a question-aware orchestration mechanism, RoSE first calculates the similarity of each question in the pool with a new test question. Since the solution to each answered question is not always correct, RoSE will sort the questions according to their similarity with the new question, and then uniformly divide them into multiple buckets. It finally extracts one question from each bucket to make these extracted questions more diverse. To make these extracted questions help RoSE answer new questions as much as possible, we introduce two other attributes of uncertainty and complexity for each question. RoSE will preferentially select the questions with low uncertainty and high complexity from each bucket. We evaluate the versatility of RoSE in various reasoning tasks, LLMs, and CoT methods.
摘要：大型语言模型（LLMS）可以通过在零击或几次设置下产生中间思想来执行复杂的推理。但是，零射击的提示总是会遇到低性能，并且在手动制作的演示中的较少效果提示了铰链。在本文中，我们介绍了玫瑰（通过精心策划的流媒体经验推理），这是一个解决推理任务的一般框架，无需复杂的外部努力就可以自我爆发。为了启用Rose，我们描述了一种扩展LLM的架构，以存储所有回答的问题及其思想在流媒体体验池中，然后在池中策划有用的问题，以协助回答新问题。为了建立一个问题意识到的编排机制，Rose首先通过一个新的测试问题来计算池中每个问题的相似性。由于对每个回答问题的解决方案并非总是正确的，因此Rose将根据问题与新问题的相似性进行分类，然后将其统一分为多个存储桶。最终，它从每个存储桶中提取一个问题，以使这些提取的问题更加多样化。为了使这些提取的问题有助于提出尽可能多的回答新问题，我们为每个问题介绍了其他两个不确定性和复杂性的属性。 Rose将优先选择每个水桶的不确定性低和高复杂性的问题。我们在各种推理任务，LLM和COT方法中评估Rose的多功能性。

Title: Training a Utility-based Retriever Through Shared Context Attribution for Retrieval-Augmented Language Models

Authors: Yilong Xu, Jinhua Gao, Xiaoming Yu, Yuanhai Xue, Baolong Bi, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00573
Pdf URL: https://arxiv.org/pdf/2504.00573
Copy Paste: [[2504.00573]] Training a Utility-based Retriever Through Shared Context Attribution for Retrieval-Augmented Language Models(https://arxiv.org/abs/2504.00573)
Keywords: language model
Abstract: Retrieval-Augmented Language Models boost task performance, owing to the retriever that provides external knowledge. Although crucial, the retriever primarily focuses on semantics relevance, which may not always be effective for generation. Thus, utility-based retrieval has emerged as a promising topic, prioritizing passages that provides valid benefits for downstream tasks. However, due to insufficient understanding, capturing passage utility accurately remains unexplored. This work proposes SCARLet, a framework for training utility-based retrievers in RALMs, which incorporates two key factors, multi-task generalization and inter-passage interaction. First, SCARLet constructs shared context on which training data for various tasks is synthesized. This mitigates semantic bias from context differences, allowing retrievers to focus on learning task-specific utility for better task generalization. Next, SCARLet uses a perturbation-based attribution method to estimate passage-level utility for shared context, which reflects interactions between passages and provides more accurate feedback. We evaluate our approach on ten datasets across various tasks, both in-domain and out-of-domain, showing that retrievers trained by SCARLet consistently improve the overall performance of RALMs.
摘要：由于提供外部知识的猎犬，检索声明的语言模型提高了任务性能。尽管至关重要，但猎犬主要关注语义相关性，这可能并不总是有效的。因此，基于公用事业的检索已成为一个有希望的主题，优先考虑为下游任务提供有效好处的段落。但是，由于理解不足，准确捕获通道实用程序仍未得到探索。这项工作提出了Scarlet，这是RALMS中基于公用事业的检索员的框架，其中包含了两个关键因素，即多任务概括和封闭式相互作用。首先，涉及猩红色构造共享的上下文，以合成各种任务的培训数据。这减轻了从上下文差异中的语义偏见，从而使得猎犬可以专注于学习特定任务的实用程序，以更好地进行任务概括。接下来，Scarlet使用基于扰动的归因方法来估算共享上下文的段落级别的实用程序，这反映了段落之间的交互并提供了更准确的反馈。我们在各种任务中评估了十个数据集的方法，包括域内和域外，这表明，经过猩红色训练的猎犬始终提高拉尔姆斯的整体绩效。

Title: Enhancing Negation Awareness in Universal Text Embeddings: A Data-efficient and Computational-efficient Approach

Authors: Hongliu Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00584
Pdf URL: https://arxiv.org/pdf/2504.00584
Copy Paste: [[2504.00584]] Enhancing Negation Awareness in Universal Text Embeddings: A Data-efficient and Computational-efficient Approach(https://arxiv.org/abs/2504.00584)
Keywords: language model
Abstract: Negation plays an important role in various natural language processing tasks such as Natural Language Inference and Sentiment Analysis tasks. Numerous prior studies have found that contextual text embedding models such as BERT, ELMO, RoBERTa or XLNet face challenges in accurately understanding negation. Recent advancements in universal text embeddings have demonstrated superior performance over contextual text embeddings in various tasks. However, due to the bias in popular evaluation benchmarks, the negation awareness capacity of these models remains unclear. To bridge the gap in existing literature, an in-depth analysis is initiated in this work to study the negation awareness of cutting-edge universal text embedding models. Our findings reveal a significant lack of negation awareness in these models, often interpreting negated text pairs as semantically similar. To efficiently deal with the conflict that different tasks need different trade-offs between topic and negation information among other semantic information, a data-efficient and computational-efficient embedding re-weighting method is proposed without modifying the parameters of text embedding models. The proposed solution is able to improve text embedding models' negation awareness significantly on both simple negation understanding task and complex negation understanding task. Furthermore, the proposed solution can also significantly improve the negation awareness of Large Language Model based task-specific high dimensional universal text embeddings.
摘要：否定在各种自然语言处理任务中起着重要作用，例如自然语言推论和情感分析任务。许多先前的研究发现，诸如Bert，Elmo，Roberta或XLNet之类的上下文嵌入模型在准确理解否定方面面临挑战。通用文本嵌入的最新进展表明，在各种任务中的上下文文本嵌入量优于上下文。但是，由于流行评估基准的偏见，这些模型的否定意识能力尚不清楚。为了弥合现有文献的差距，在这项工作中启动了深入的分析，以研究尖端通用文本嵌入模型的否定意识。我们的发现表明，在这些模型中严重缺乏否定意识，通常将被否定的文本对解释为语义上的相似之处。为了有效地处理不同任务在主题和否定信息之间需要不同的权衡的冲突，在没有修改文本嵌入模型的参数的情况下，提出了提出数据效率和计算效率的嵌入重新加权方法。提出的解决方案能够在简单的否定理解任务和复杂的否定理解任务上显着提高文本嵌入模型的否定意识。此外，所提出的解决方案还可以显着提高基于大型语言模型的大型任务高维通用文本嵌入的否定意识。

Title: Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

Authors: Weizhi Wang, Yu Tian, Linjie Yang, Heng Wang, Xifeng Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00595
Pdf URL: https://arxiv.org/pdf/2504.00595
Copy Paste: [[2504.00595]] Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources(https://arxiv.org/abs/2504.00595)
Keywords: language model, llm
Abstract: The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 442 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36\% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.
摘要：在管道的每个阶段，最先进的多模式LLM训练前训练的范围的再现，包括高质量的数据过滤，多模式数据混合物策略，序列包装技术和训练框架。我们介绍了Open-QWEN2VL，这是一种完全开源的2B参数多模式大型语言模型，可在29m图像文本对上有效地进行预训练，仅使用442 A100-40G GPU小时。我们的方法采用低到高的动态图像分辨率和多模式序列包装来显着提高训练效率。使用基于MLLM的过滤技术（例如MLM滤波器）和常规的夹子过滤方法仔细策划了训练数据集，从而实质上提高了数据质量和训练效率。开放式QWEN2VL预训练是在5B包装的多模态代币的UCSB的学术水平8xA100-40G GPU上进行的，这是1.4T 1.4T多模态训练QWEN2-VL的0.36 \％。最终的指令调整的开放式QWEN2VL胜过部分开放的最先进的MLLM QWEN2-VL-2B在MMBench，Seedbench，MMSTAR和MATHVISTA的各种多模式基准上，这表明了Open-QWEN2VL的出色训练效率。我们开放工作的各个方面，包括计算效率和数据有效的培训详细信息，数据过滤方法，序列包装脚本，WebDataSet格式中的预训练数据，基于FSDP的培训代码库以及基本和说明指导型模型检查点。我们重新定义了多模式LLM的“完全打开”为：1）训练代码库，2）详细的数据过滤技术，以及3）所有用于开发模型的预训练和监督的微调数据。

Title: On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation

Authors: Jirui Qi, Raquel Fernández, Arianna Bisazza
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00597
Pdf URL: https://arxiv.org/pdf/2504.00597
Copy Paste: [[2504.00597]] On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation(https://arxiv.org/abs/2504.00597)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) with large language models (LLMs) has demonstrated strong performance in multilingual question-answering (QA) tasks by leveraging relevant passages retrieved from corpora. In multilingual RAG (mRAG), the retrieved passages can be written in languages other than that of the query entered by the user, making it challenging for LLMs to effectively utilize the provided information. Recent research suggests that retrieving passages from multilingual corpora can improve RAG performance, particularly for low-resource languages. However, the extent to which LLMs can leverage different kinds of multilingual contexts to generate accurate answers, *independently from retrieval quality*, remains understudied. In this paper, we conduct an extensive assessment of LLMs' ability to (i) make consistent use of a relevant passage regardless of its language, (ii) respond in the expected language, and (iii) focus on the relevant passage even when multiple `distracting' passages in different languages are provided in the context. Our experiments with four LLMs across three QA datasets covering a total of 48 languages reveal a surprising ability of LLMs to extract the relevant information from out-language passages, but a much weaker ability to formulate a full answer in the correct language. Our analysis, based on both accuracy and feature attribution techniques, further shows that distracting passages negatively impact answer quality regardless of their language. However, distractors in the query language exert a slightly stronger influence. Taken together, our findings deepen the understanding of how LLMs utilize context in mRAG systems, providing directions for future improvements.
摘要：具有大语言模型（LLMS）的检索授课生成（RAG）通过利用从Corpora检索的相关段落来证明了多语言提问（QA）任务的出色表现。在多语言抹布（MRAG）中，可以用用户输入的查询以外的语言编写检索段落，这使得LLMS具有有效利用所提供的信息的挑战。最近的研究表明，从多语言语料库中检索段落可以提高抹布性能，尤其是对于低资源语言。但是，LLM可以利用不同种类的多语言上下文来产生准确的答案的程度， *独立于检索质量 *，仍在研究中。在本文中，我们对LLM的能力进行了广泛的评估（i）无论其语言如何，（ii）以预期的语言做出响应，（iii）即使在上下文中提供了不同语言的多种“分散注意力”段落。我们在三个QA数据集中对四个LLM进行的实验，总共涵盖了48种语言，这表明了LLMs从语言段落中提取相关信息的令人惊讶的能力，但是在正确的语言中提出完整答案的能力要弱得多。我们的分析基于精度和特征归因技术，进一步表明，无论其语言如何，分散注意力的段落会对答案质量产生负面影响。但是，查询语言中的干扰因素产生了稍强的影响。综上所述，我们的发现加深了对LLM在MRAG系统中如何利用上下文的理解，为未来改进提供了方向。

Title: Efficient Construction of Model Family through Progressive Training Using Model Expansion

Authors: Kazuki Yano, Sho Takase, Sosuke Kobayashi, Shun Kiyono, Jun Suzuki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00623
Pdf URL: https://arxiv.org/pdf/2504.00623
Copy Paste: [[2504.00623]] Efficient Construction of Model Family through Progressive Training Using Model Expansion(https://arxiv.org/abs/2504.00623)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) gain widespread practical application, providing the model family of different parameter sizes has become standard practice to address diverse computational requirements. Conventionally, each model in a family is trained independently, resulting in computational costs that scale additively with the number of models. We propose an efficient method for constructing the model family through progressive training, where smaller models are incrementally expanded to larger sizes to create a complete model family. Through extensive experiments with a model family ranging from 1B to 8B parameters, we demonstrate that our method reduces computational costs by approximately 25% while maintaining comparable performance to independently trained models. Furthermore, by strategically adjusting maximum learning rates based on model size, our method outperforms the independent training across various metrics. Beyond performance gains, our approach offers an additional advantage: models in our family tend to yield more consistent behavior across different model sizes.
摘要：随着大型语言模型（LLMS）获得广泛的实际应用，提供不同参数量的模型家族已成为满足各种计算要求的标准实践。通常，家庭中的每个模型都经过独立培训，从而导致计算成本随着模型数量的增添规模。我们提出了一种通过渐进式训练来构建模型家族的有效方法，在该培训中，较小的模型会逐步扩展到较大的尺寸以创建一个完整的模型家族。通过从1B到8B参数的模型家族进行的广泛实验，我们证明我们的方法可将计算成本降低约25％，同时保持可比的性能与受独立训练的模型。此外，通过基于模型大小的策略性调整最大学习率，我们的方法的表现优于各种指标的独立培训。除了绩效增长之外，我们的方法还提供了一个附加的优势：我们家庭中的模型倾向于在不同的模型尺寸上产生更一致的行为。

Title: DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism

Authors: Dengchun Li, Naizheng Wang, Zihao Zhang, Haoyang Yin, Lei Duan, Meng Xiao, Mingjie Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00661
Pdf URL: https://arxiv.org/pdf/2504.00661
Copy Paste: [[2504.00661]] DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism(https://arxiv.org/abs/2504.00661)
Keywords: language model, llm
Abstract: Instruction-based fine-tuning of large language models (LLMs) has achieved remarkable success in various natural language processing (NLP) tasks. Parameter-efficient fine-tuning (PEFT) methods, such as Mixture of LoRA Experts (MoLE), combine the efficiency of Low-Rank Adaptation (LoRA) with the versatility of Mixture of Experts (MoE) models, demonstrating significant potential for handling multiple downstream tasks. However, the existing routing mechanisms for MoLE often involve a trade-off between computational efficiency and predictive accuracy, and they fail to fully address the diverse expert selection demands across different transformer layers. In this work, we propose DynMoLE, a hybrid routing strategy that dynamically adjusts expert selection based on the Tsallis entropy of the router's probability distribution. This approach mitigates router uncertainty, enhances stability, and promotes more equitable expert participation, leading to faster convergence and improved model performance. Additionally, we introduce an auxiliary loss based on Tsallis entropy to further guide the model toward convergence with reduced uncertainty, thereby improving training stability and performance. Our extensive experiments on commonsense reasoning benchmarks demonstrate that DynMoLE achieves substantial performance improvements, outperforming LoRA by 9.6% and surpassing the state-of-the-art MoLE method, MoLA, by 2.3%. We also conduct a comprehensive ablation study to evaluate the contributions of DynMoLE's key components.
摘要：基于指导的大型语言模型（LLMS）在各种自然语言处理（NLP）任务中取得了巨大的成功。参数有效的微调方法（PEFT）方法，例如Lora专家（Mole）的混合物，将低级适应性（Lora）的效率与专家（MOE）模型的混合物的多功能性结合在一起，表明了处理多个下游任务的重要潜力。但是，摩尔的现有路由机制通常涉及计算效率和预测精度之间的权衡，并且它们无法完全满足不同变压器层之间多样的专家选择需求。在这项工作中，我们提出了Dynmole，这是一种混合路由策略，该策略根据路由器概率分布的Tsallis熵动态调整专家选择。这种方法可以减轻路由器的不确定性，提高稳定性，并促进更公平的专家参与，从而更快地收敛并改善模型性能。此外，我们引入了基于Tsallis熵的辅助损失，以进一步指导该模型降低不确定性，从而提高训练稳定性和性能。我们对常识性推理基准的广泛实验表明，Dynmole实现了实质性的改进，表现优于洛拉（Lora），并超过了最先进的摩尔方法Mola，却降低了2.3％。我们还进行了全面的消融研究，以评估Dynmole的主要组成部分的贡献。

Title: Do LLMs Surpass Encoders for Biomedical NER?

Authors: Motasem S Obeidat, Md Sultan Al Nahian, Ramakanth Kavuluru
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00664
Pdf URL: https://arxiv.org/pdf/2504.00664
Copy Paste: [[2504.00664]] Do LLMs Surpass Encoders for Biomedical NER?(https://arxiv.org/abs/2504.00664)
Keywords: language model, llm
Abstract: Recognizing spans of biomedical concepts and their types (e.g., drug or gene) in free text, often called biomedical named entity recognition (NER), is a basic component of information extraction (IE) pipelines. Without a strong NER component, other applications, such as knowledge discovery and information retrieval, are not practical. State-of-the-art in NER shifted from traditional ML models to deep neural networks with transformer-based encoder models (e.g., BERT) emerging as the current standard. However, decoder models (also called large language models or LLMs) are gaining traction in IE. But LLM-driven NER often ignores positional information due to the generative nature of decoder models. Furthermore, they are computationally very expensive (both in inference time and hardware needs). Hence, it is worth exploring if they actually excel at biomedical NER and assess any associated trade-offs (performance vs efficiency). This is exactly what we do in this effort employing the same BIO entity tagging scheme (that retains positional information) using five different datasets with varying proportions of longer entities. Our results show that the LLMs chosen (Mistral and Llama: 8B range) often outperform best encoder models (BERT-(un)cased, BiomedBERT, and DeBERTav3: 300M range) by 2-8% in F-scores except for one dataset, where they equal encoder performance. This gain is more prominent among longer entities of length >= 3 tokens. However, LLMs are one to two orders of magnitude more expensive at inference time and may need cost prohibitive hardware. Thus, when performance differences are small or real time user feedback is needed, encoder models might still be more suitable than LLMs.
摘要：识别自由文本中的生物医学概念及其类型（例如，药物或基因）通常称为生物医学命名实体识别（NER），是信息提取（IE）管道的基本组成部分。没有强大的组件，其他应用程序（例如知识发现和信息检索）是不切实际的。 NER中的最新技术从传统的ML模型转变为具有基于变压器的编码器模型（例如BERT）作为当前标准的深层神经网络。但是，解码器模型（也称为大型语言模型或LLM）正在获得IE的吸引力。但是，由于解码器模型的生成性质，LLM驱动的NER通常会忽略位置信息。此外，它们在计算上非常昂贵（无论是推理时间还是硬件需求）。因此，值得探索他们是否真的在生物医学上表现出色并评估任何相关的权衡（绩效与效率）。这正是我们在这项工作中采用相同的生物实体标记方案（保留位置信息）的五个不同数据集的努力所做的。我们的结果表明，LLMS（Mistral和Llama：8b范围）通常优于最佳编码器模型（Bert-（Un）Cased，Biomedbert和Debertav3：300m范围），在F分数中，除了一个数据集以外，在F-SCORES中，它们在一个数据集中，在一个数据集中，在它们相等的编码器表现。在长度> = 3代币的较长实体中，这种增益更为突出。但是，LLM在推理时的昂贵一到两个数量级，并且可能需要成本高的硬件。因此，当性能差异很小或需要实时用户反馈时，编码器模型可能仍然比LLMS更合适。

Title: GLiNER-biomed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition

Authors: Anthony Yazdani, Ihor Stepanov, Douglas Teodoro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00676
Pdf URL: https://arxiv.org/pdf/2504.00676
Copy Paste: [[2504.00676]] GLiNER-biomed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition(https://arxiv.org/abs/2504.00676)
Keywords: language model, llm
Abstract: Biomedical named entity recognition (NER) presents unique challenges due to specialized vocabularies, the sheer volume of entities, and the continuous emergence of novel entities. Traditional NER models, constrained by fixed taxonomies and human annotations, struggle to generalize beyond predefined entity types or efficiently adapt to emerging concepts. To address these issues, we introduce GLiNER-biomed, a domain-adapted suite of Generalist and Lightweight Model for NER (GLiNER) models specifically tailored for biomedical NER. In contrast to conventional approaches, GLiNER uses natural language descriptions to infer arbitrary entity types, enabling zero-shot recognition. Our approach first distills the annotation capabilities of large language models (LLMs) into a smaller, more efficient model, enabling the generation of high-coverage synthetic biomedical NER data. We subsequently train two GLiNER architectures, uni- and bi-encoder, at multiple scales to balance computational efficiency and recognition performance. Evaluations on several biomedical datasets demonstrate that GLiNER-biomed outperforms state-of-the-art GLiNER models in both zero- and few-shot scenarios, achieving 5.96% improvement in F1-score over the strongest baseline. Ablation studies highlight the effectiveness of our synthetic data generation strategy and emphasize the complementary benefits of synthetic biomedical pre-training combined with fine-tuning on high-quality general-domain annotations. All datasets, models, and training pipelines are publicly available at this https URL.
摘要：生物医学命名实体识别（NER）提出了由于专门的词汇，庞大的实体以及新实体的持续出现而引起的独特挑战。传统的NER模型受固定分类法和人类注释的约束，努力概括超出预定义的实体类型或有效地适应新兴概念。为了解决这些问题，我们介绍了Gliner-biomed，这是针对NER（Gliner）模型（专门针对生物医学NER）的NER（Gliner）模型的域适应性套件。与常规方法相反，Gliner使用自然语言描述来推断任意实体类型，从而可以零识别。我们的方法首先将大语言模型（LLM）的注释能力提炼成一个较小，更有效的模型，从而能够生成高覆盖的合成生物医学数据。随后，我们在多个尺度上训练两个Gliner体系结构，即单架和双重编码器，以平衡计算效率和识别性能。对几个生物医学数据集的评估表明，Gliner-biomed在零和少数场景中的最先进的Gliner模型优于最先进的Gliner模型，比最强基线的F1得分提高了5.96％。消融研究强调了我们的合成数据生成策略的有效性，并强调了合成生物医学预训练的互补益处，并在高质量的通用域注释上进行了微调。所有数据集，模型和培训管道都可以在此HTTPS URL上公开使用。

Title: ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection

Authors: Xiaoxuan Zhu, Zhouhong Gu, Suhang Zheng, Tao Wang, Tianyu Li, Hongwei Feng, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00695
Pdf URL: https://arxiv.org/pdf/2504.00695
Copy Paste: [[2504.00695]] ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection(https://arxiv.org/abs/2504.00695)
Keywords: language model, llm
Abstract: Pre-training large language models (LLMs) necessitates enormous diverse textual corpora, making effective data selection a key challenge for balancing computational resources and model performance. Current methodologies primarily emphasize data quality metrics and mixing proportions, yet they fail to adequately capture the underlying semantic connections between training samples and quality disparities within individual domains. We introduce ToReMi (Topic-based Reweighting for Model improvement), a novel two-stage framework that dynamically adjusts training sample weights according to their topical associations and observed learning patterns. Our comprehensive experiments reveal that ToReMi variants consistently achieve superior performance over conventional pre-training approaches, demonstrating accelerated perplexity reduction across multiple domains and enhanced capabilities on downstream evaluation tasks. Code is available at this https URL.
摘要：预培训大语模型（LLMS）需要大量多样化的文本语料库，从而使有效的数据选择成为平衡计算资源和模型性能的关键挑战。当前的方法论主要强调数据质量指标和混合比例，但它们无法充分捕获训练样本与各个领域内质量差异之间的基本语义联系。我们介绍了Toremi（基于主题的改进重新加权），这是一个新型的两阶段框架，根据其局部关联和观察到的学习模式，动态调整训练样本权重。我们的全面实验表明，Toremi变体始终达到超过常规训练方法的卓越性能，表明跨多个领域的混淆降低加速了，并且在下游评估任务上的功能增强。代码可在此HTTPS URL上找到。

Title: Command A: An Enterprise-Ready Large Language Model

Authors: Team Cohere, Aakanksha, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Yazeed Alnumay, Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, Raphaël Avalos, Zahara Aviv, Sammie Bae, Saurabh Baji, Alexandre Barbet, Max Bartolo, Björn Bebensee, Neeral Beladia, Walter Beller-Morales, Alexandre Bérard, Andrew Berneshawi, Anna Bialas, Phil Blunsom, Matt Bobkin, Adi Bongale, Sam Braun, Maxime Brunet, Samuel Cahyawijaya, David Cairuz, Jon Ander Campos, Cassie Cao, Kris Cao, Roman Castagné, Julián Cendrero, Leila Chan Currie, Yash Chandak, Diane Chang, Giannis Chatziveroglou, Hongyu Chen, Claire Cheng, Alexis Chevalier, Justin T. Chiu, Eugene Cho, Eugene Choi, Eujeong Choi, Tim Chung, Volkan Cirik, Ana Cismaru, Pierre Clavier, Henry Conklin, Lucas Crawhall-Stein, Devon Crouse, Andres Felipe Cruz-Salinas, Ben Cyrus, Daniel D'souza, Hugo Dalla-Torre, John Dang, William Darling, Omar Darwiche Domingues, Saurabh Dash, Antoine Debugne, Théo Dehaze, Shaan Desai, Joan Devassy, Rishit Dholakia, Kyle Duffy, Ali Edalati, Ace Eldeib, Abdullah Elkady, Sarah Elsharkawy, Irem Ergün, Beyza Ermis, Marzieh Fadaee, Boyu Fan, Lucas Fayoux, Yannis Flet-Berliac, Nick Frosst, Matthias Gallé, Wojciech Galuba, Utsav Garg, Matthieu Geist, Mohammad Gheshlaghi Azar, Seraphina Goldfarb-Tarrant, Tomas Goldsack, Aidan Gomez, Victor Machado Gonzaga, Nithya Govindarajan, Manoj Govindassamy, Nathan Grinsztajn, Nikolas Gritsch, Patrick Gu, Shangmin Guo, Kilian Haefeli, Rod Hajjar, Tim Hawes, Jingyi He, Sebastian Hofstätter, Sungjin Hong, Sara Hooker, Tom Hosking
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.00698
Pdf URL: https://arxiv.org/pdf/2504.00698
Copy Paste: [[2504.00698]] Command A: An Enterprise-Ready Large Language Model(https://arxiv.org/abs/2504.00698)
Keywords: language model, retrieval augmented generation, agent
Abstract: In this report we describe the development of Command A, a powerful large language model purpose-built to excel at real-world enterprise use cases. Command A is an agent-optimised and multilingual-capable model, with support for 23 languages of global business, and a novel hybrid architecture balancing efficiency with top of the range performance. It offers best-in-class Retrieval Augmented Generation (RAG) capabilities with grounding and tool use to automate sophisticated business processes. These abilities are achieved through a decentralised training approach, including self-refinement algorithms and model merging techniques. We also include results for Command R7B which shares capability and architectural similarities to Command A. Weights for both models have been released for research purposes. This technical report details our original training pipeline and presents an extensive evaluation of our models across a suite of enterprise-relevant tasks and public benchmarks, demonstrating excellent performance and efficiency.
摘要：在本报告中，我们描述了Command A的开发，这是一种强大的大型语言模型，目的是在现实世界中的用例中出色。 Command A是一种具有代理优化和多语言能力的模型，它支持23种全球业务的语言，并且具有新颖的混合体系结构平衡效率，并具有顶部性能。它提供一流的检索增强发电（RAG）功能，具有接地和工具用途，以使复杂的业务流程自动化。这些能力是通过分散的训练方法来实现的，包括自我限制算法和模型合并技术。我们还提供了命令R7B的结果，该结果与Command A具有共享能力和架构相似性。这两种模型的权重已出于研究目的。该技术报告详细介绍了我们的原始培训管道，并在一套与企业相关的任务和公共基准测试中对我们的模型进行了广泛的评估，这表明了出色的性能和效率。

Title: Aplicação de Large Language Models na Análise e Síntese de Documentos Jurídicos: Uma Revisão de Literatura

Authors: Matheus Belarmino, Rackel Coelho, Roberto Lotudo, Jayr Pereira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00725
Pdf URL: https://arxiv.org/pdf/2504.00725
Copy Paste: [[2504.00725]] Aplicação de Large Language Models na Análise e Síntese de Documentos Jurídicos: Uma Revisão de Literatura(https://arxiv.org/abs/2504.00725)
Keywords: language model, gpt, llm, hallucination, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) have been increasingly used to optimize the analysis and synthesis of legal documents, enabling the automation of tasks such as summarization, classification, and retrieval of legal information. This study aims to conduct a systematic literature review to identify the state of the art in prompt engineering applied to LLMs in the legal context. The results indicate that models such as GPT-4, BERT, Llama 2, and Legal-Pegasus are widely employed in the legal field, and techniques such as Few-shot Learning, Zero-shot Learning, and Chain-of-Thought prompting have proven effective in improving the interpretation of legal texts. However, challenges such as biases in models and hallucinations still hinder their large-scale implementation. It is concluded that, despite the great potential of LLMs for the legal field, there is a need to improve prompt engineering strategies to ensure greater accuracy and reliability in the generated results.
摘要：大型语言模型（LLM）越来越多地用于优化法律文件的分析和综合，从而实现了诸如摘要，分类和检索法律信息之类的任务的自动化。这项研究旨在进行系统的文献综述，以在法律背景下适用于LLM的及时工程中的最新技术状况。结果表明，诸如GPT-4，BERT，LLAMA 2和Legal-Pegasus之类的模型已被广泛用于法律领域，并且诸如少数学习，零射门学习和经过思考的提示等技术已被证明有效地有效地改善了法律文本的解释。但是，诸如模型和幻觉中的偏见之类的挑战仍然阻碍其大规模实施。可以得出结论，尽管LLM对法律领域具有巨大的潜力，但仍有必要改善及时的工程策略，以确保生成的结果中的更高准确性和可靠性。

Title: IHC-LLMiner: Automated extraction of tumour immunohistochemical profiles from PubMed abstracts using large language models

Authors: Yunsoo Kim, Michal W. S. Ong, Daniel W. Rogalsky, Manuel Rodriguez-Justo, Honghan Wu, Adam P. Levine
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00748
Pdf URL: https://arxiv.org/pdf/2504.00748
Copy Paste: [[2504.00748]] IHC-LLMiner: Automated extraction of tumour immunohistochemical profiles from PubMed abstracts using large language models(https://arxiv.org/abs/2504.00748)
Keywords: language model, gpt, llm
Abstract: Immunohistochemistry (IHC) is essential in diagnostic pathology and biomedical research, offering critical insights into protein expression and tumour biology. This study presents an automated pipeline, IHC-LLMiner, for extracting IHC-tumour profiles from PubMed abstracts, leveraging advanced biomedical text mining. There are two subtasks: abstract classification (include/exclude as relevant) and IHC-tumour profile extraction on relevant included abstracts. The best-performing model, "Gemma-2 finetuned", achieved 91.5% accuracy and an F1 score of 91.4, outperforming GPT4-O by 9.5% accuracy with 5.9 times faster inference time. From an initial dataset of 107,759 abstracts identified for 50 immunohistochemical markers, the classification task identified 30,481 relevant abstracts (Include) using the Gemma-2 finetuned model. For IHC-tumour profile extraction, the Gemma-2 finetuned model achieved the best performance with 63.3% Correct outputs. Extracted IHC-tumour profiles (tumour types and markers) were normalised to Unified Medical Language System (UMLS) concepts to ensure consistency and facilitate IHC-tumour profile landscape analysis. The extracted IHC-tumour profiles demonstrated excellent concordance with available online summary data and provided considerable added value in terms of both missing IHC-tumour profiles and quantitative assessments. Our proposed LLM based pipeline provides a practical solution for large-scale IHC-tumour profile data mining, enhancing the accessibility and utility of such data for research and clinical applications as well as enabling the generation of quantitative and structured data to support cancer-specific knowledge base development. Models and training datasets are available at this https URL.
摘要：免疫组织化学（IHC）对于诊断病理学和生物医学研究至关重要，为蛋白质表达和肿瘤生物学提供了关键的见解。这项研究提出了一条自动管道IHC-llminer，用于从PubMed摘要中提取IHC-Tumour概况，利用先进的生物医学文本挖掘。有两个子任务：抽象分类（包括/排除为相关），而IHC-Tumour概况提取了相关的摘要。表现最佳的模型“ Gemma-2 Fineted”获得了91.5％的精度，F1得分为91.4，其表现优于GPT4-O的精度为9.5％，推断时间更快。从针对50个免疫组织化学标记的107,759个摘要的初始数据集中，分类任务使用Gemma-2 fineTuned模型确定了30,481个相关摘要（Include）。对于IHC-Tumour配置文件提取，Gemma-2列出的模型以63.3％的正确输出实现了最佳性能。将提取的IHC-tumour概况（肿瘤类型和标记）标准化为统一的医学语言系统（UMLS）概念，以确保一致性并促进IHC-Tumour概况景观分析。提取的IHC-Tumour概况与可用的在线摘要数据表现出了极好的一致性，并且在缺少IHC-Tumour概况和定量评估方面提供了相当大的附加值。我们提出的基于LLM的管道为大规模IHC-Tumour概况数据挖掘提供了一种实用解决方案，从而增强了此类数据用于研究和临床应用的可访问性和实用性，并能够生成定量和结构化数据，以支持癌症特定知识基础的发展。此HTTPS URL可用模型和培训数据集。

Title: LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models

Authors: Sameer Sadruddin, Jennifer D'Souza, Eleni Poupaki, Alex Watkins, Hamed Babaei Giglou, Anisa Rula, Bora Karasulu, Sören Auer, Adrie Mackus, Erwin Kessels
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2504.00752
Pdf URL: https://arxiv.org/pdf/2504.00752
Copy Paste: [[2504.00752]] LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models(https://arxiv.org/abs/2504.00752)
Keywords: language model, llm
Abstract: Extracting structured information from unstructured text is crucial for modeling real-world processes, but traditional schema mining relies on semi-structured data, limiting scalability. This paper introduces schema-miner, a novel tool that combines large language models with human feedback to automate and refine schema extraction. Through an iterative workflow, it organizes properties from text, incorporates expert input, and integrates domain-specific ontologies for semantic depth. Applied to materials science--specifically atomic layer deposition--schema-miner demonstrates that expert-guided LLMs generate semantically rich schemas suitable for diverse real-world applications.
摘要：从非结构化文本中提取结构化信息对于建模现实世界过程至关重要，但是传统的架构挖掘依赖于半结构化数据，从而限制了可扩展性。本文介绍了架构计算机，这是一种新颖的工具，将大型语言模型与人类反馈结合起来，以自动化和完善模式提取。通过迭代工作流程，它从文本中组织属性，结合了专家的输入，并集成了特定于域的本体，以进行语义深度。应用于材料科学 - 特异性原子层沉积 - Schema-Miner表明，专家指导的LLMS生成适合各种现实世界应用的语义丰富的模式。

Title: RECKON: Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model

Authors: Lin Zhang, Zhouhong Gu, Xiaoran Shi, Hongwei Feng, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00756
Pdf URL: https://arxiv.org/pdf/2504.00756
Copy Paste: [[2504.00756]] RECKON: Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model(https://arxiv.org/abs/2504.00756)
Keywords: language model, llm
Abstract: As large language models (LLMs) advance, efficient knowledge evaluation becomes crucial to verifying their capabilities. Traditional methods, relying on benchmarks, face limitations such as high resource costs and information loss. We propose the Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model (RECKON), which directly uses reference data to evaluate models. RECKON organizes unstructured data into manageable units and generates targeted questions for each cluster, improving evaluation accuracy and efficiency. Experimental results show that RECKON reduces resource consumption by 56.5% compared to traditional methods while achieving over 97% accuracy across various domains, including world knowledge, code, legal, and biomedical datasets. Code is available at this https URL
摘要：随着大型语言模型（LLM）的发展，有效的知识评估对于验证其能力至关重要。传统方法，依靠基准，面临限制，例如高资源成本和信息损失。我们建议对大语言模型（RECKON）进行大规模参考的有效知识评估，该模型直接使用参考数据来评估模型。认为将非结构化的数据组织到可管理的单元中，并为每个集群生成目标问题，从而提高评估准确性和效率。实验结果表明，与传统方法相比，这估计将资源消耗降低了56.5％，同时在各个领域（包括世界知识，代码，法律和生物医学数据集）中实现了超过97％的精度。代码可在此HTTPS URL上找到

Title: Digitally Supported Analysis of Spontaneous Speech (DigiSpon): Benchmarking NLP-Supported Language Sample Analysis of Swiss Children's Speech

Authors: Anja Ryser, Yingqiang Gao, Sarah Ebling
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00780
Pdf URL: https://arxiv.org/pdf/2504.00780
Copy Paste: [[2504.00780]] Digitally Supported Analysis of Spontaneous Speech (DigiSpon): Benchmarking NLP-Supported Language Sample Analysis of Swiss Children's Speech(https://arxiv.org/abs/2504.00780)
Keywords: language model, llm
Abstract: Language sample analysis (LSA) is a process that complements standardized psychometric tests for diagnosing, for example, developmental language disorder (DLD) in children. However, its labor-intensive nature has limited its use in speech-language pathology practice. We introduce an approach that leverages natural language processing (NLP) methods not based on commercial large language models (LLMs) applied to transcribed speech data from 119 children in the German speaking part of Switzerland with typical and atypical language development. The study aims to identify optimal practices that support speech-language pathologists in diagnosing DLD more efficiently within a human-in-the-loop framework, without relying on potentially unethical implementations that leverage commercial LLMs. Preliminary findings underscore the potential of integrating locally deployed NLP methods into the process of semi-automatic LSA.
摘要：语言样本分析（LSA）是一个补充标准化心理测验测试以进行诊断，例如儿童的发育语言障碍（DLD）。但是，其劳动密集型的性质限制了其在语言病理学实践中的使用。我们介绍了一种方法，该方法利用自然语言处理（NLP）方法不基于商业大型语言模型（LLM）（LLMS），该模型（LLMS）应用于瑞士的119名儿童的转录语音数据，并具有典型的非典型语言发展。该研究旨在确定支持语音语言病理学家在人类界框架内更有效地诊断DLD的最佳实践，而无需依赖于利用商业LLM的潜在不道德实施。初步发现强调了将本地部署的NLP方法集成到半自动LSA过程中的潜力。

Title: Z1: Efficient Test-time Scaling with Code

Authors: Zhaojian Yu, Yinghao Wu, Yilun Zhao, Arman Cohan, Xiao-Ping Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00810
Pdf URL: https://arxiv.org/pdf/2504.00810
Copy Paste: [[2504.00810]] Z1: Efficient Test-time Scaling with Code(https://arxiv.org/abs/2504.00810)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) can achieve enhanced complex problem-solving through test-time computing scaling, yet this often entails longer contexts and numerous reasoning token costs. In this paper, we propose an efficient test-time scaling method that trains LLMs on code-related reasoning trajectories, facilitating their reduction of excess thinking tokens while maintaining performance. First, we create Z1-Code-Reasoning-107K, a curated dataset of simple and complex coding problems paired with their short and long solution trajectories. Second, we present a novel Shifted Thinking Window to mitigate overthinking overhead by removing context-delimiting tags (e.g., . . . ) and capping reasoning tokens. Trained with long and short trajectory data and equipped with Shifted Thinking Window, our model, Z1-7B, demonstrates the ability to adjust its reasoning level as the complexity of problems and exhibits efficient test-time scaling across different reasoning tasks that matches R1-Distill-Qwen-7B performance with about 30% of its average thinking tokens. Notably, fine-tuned with only code trajectories, Z1-7B demonstrates generalization to broader reasoning tasks (47.5% on GPQA Diamond). Our analysis of efficient reasoning elicitation also provides valuable insights for future research.
摘要：大型语言模型（LLM）可以通过测试时间计算缩放实现增强的复杂问题解决，但这通常需要更长的上下文和众多的推理代币成本。在本文中，我们提出了一种有效的测试时间缩放方法，该方法在与代码相关的推理轨迹上训练LLMS，从而促进他们在保持性能的同时减少过多的思维令牌。首先，我们创建了Z1-Code-Reoning-107K，这是一个策划的数据集，其中包括简单而复杂的编码问题与它们的短而长的解决方案轨迹。其次，我们提出了一个新颖的思维窗口，通过删除上下文限制标签（例如，。。）和封盖推理令牌，以减轻过度思考开销。我们的模型Z1-7B接受了长长和短轨迹数据的培训，并配备了转移的思维窗口，可以将其推理水平调整为问题的复杂性，并在不同的推理任务上表现出与R1-Distill-Qwen-7b绩效相匹配的不同推理任务的有效测试时间缩放的能力，其平均思维量的约为30％。值得注意的是，Z1-7B仅通过代码轨迹进行微调，证明了对更广泛的推理任务的概括（在GPQA钻石上为47.5％）。我们对有效推理启发的分析也为未来的研究提供了宝贵的见解。

Title: ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

Authors: Yubo Wang, Xueguang Ma, Ping Nie, Huaye Zeng, Zhiheng Lyu, Yuxuan Zhang, Benjamin Schneider, Yi Lu, Xiang Yue, Wenhu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00824
Pdf URL: https://arxiv.org/pdf/2504.00824
Copy Paste: [[2504.00824]] ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations(https://arxiv.org/abs/2504.00824)
Keywords: language model, retrieval-augmented generation
Abstract: Academic writing requires both coherent text generation and precise citation of relevant literature. Although recent Retrieval-Augmented Generation (RAG) systems have significantly improved factual accuracy in general-purpose text generation, their capacity to adequately support professional academic writing remains limited. In this work, we introduce ScholarCopilot, a unified framework designed to enhance existing large language models for generating professional academic articles with accurate and contextually relevant citations. ScholarCopilot dynamically determines when to retrieve scholarly references by generating a retrieval token [RET], and then utilizes its representation to look up relevant citations from a database. The retrieved references are fed into the model to augment the generation process. We jointly optimize both the generation and citation tasks within a single framework to increase efficiency. Trained on 500K papers from arXiv, our model achieves a top-1 retrieval accuracy of 40.1% on our evaluation dataset, outperforming baselines such as E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000 academic writing samples, ScholarCopilot scores 16.2/25 in generation quality (measured across relevance, coherence, academic rigor, completeness, and innovation), surpassing models with 10x more parameters such as Qwen-2.5-72B-Instruct (15.8/25). Human studies also confirm ScholarCopilot's superior performance in citation recall, writing efficiency, and overall user experience, confirming the effectiveness of our approach.
摘要：学术写作需要连贯的文本生成和相关文献的精确引用。尽管最近的检索型发电（RAG）系统在通用文本生成中已经显着提高了事实准确性，但它们充分支持专业学术写作的能力仍然有限。在这项工作中，我们介绍了ScholarCopilot，这是一个统一的框架，旨在增强现有的大型语言模型，以生成具有准确且上下文相关的引用的专业学术文章。 ScholarCopilot通过生成检索令牌[RET]动态确定何时检索学术参考，然后利用其表示形式从数据库中查找相关的引用。检索到的参考文献被馈入模型以增强生成过程。我们共同优化单个框架内的发电和引文任务，以提高效率。我们的模型在ARXIV的500K论文中接受了培训，在我们的评估数据集中获得了40.1％的前1个检索精度，表现优于E5-Mistral-7b-7b-Instruct（15.0％）（15.0％）和BM25（9.8％）。在1,000个学术写作样本的数据集中，ScholarCopilot的发电质量得分为16.2/25（跨相关性，连贯性，学术严谨，完整性和创新衡量），超过了10倍参数的模型，例如QWEN-2.5-72B构造（15.8/25）。人类研究还证实了ScholarCopilot在引文回忆，写作效率和整体用户体验方面的出色表现，从而确认了我们方法的有效性。

Title: How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study

Authors: Yunjie Ji, Sitong Zhao, Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Yiping Peng, Han Zhao, Xiangang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00829
Pdf URL: https://arxiv.org/pdf/2504.00829
Copy Paste: [[2504.00829]] How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study(https://arxiv.org/abs/2504.00829)
Keywords: language model, llm
Abstract: Enhancing the reasoning capabilities of Large Language Models (LLMs) with efficiency and scalability remains a fundamental challenge in artificial intelligence research. This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning (RL) strategies can substantially improve LLM reasoning performance. Through systematic analysis, we demonstrate that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization. Moreover, we introduce a staged training methodology, progressively exposing models to increasingly challenging tasks, further amplifying reasoning capabilities. Our findings reveal significant cross-domain benefits when simultaneously training models on mathematical reasoning and code generation tasks. Notably, our proposed approach enables a 1.5B parameter model to achieve an accuracy of 42.3\% on the AIME-2024 benchmark, 89.5\% on the MATH-500 benchmark. These results underscore the efficacy of our method in advancing the reasoning proficiency of LLMs. We will open-source our datasets on GitHub and Hugging Face.
摘要：增强具有效率和可扩展性的大语言模型（LLM）的推理能力仍然是人工智能研究中的基本挑战。本文介绍了一项严格的实验研究，以了解令人意识的上演加强学习（RL）策略可以大大改善LLM推理性能。通过系统的分析，我们证明，根据定义明确的难度水平策略性地选择培训数据可以显着增强RL优化。此外，我们介绍了一种分阶段的培训方法，将模型逐渐暴露于越来越具有挑战性的任务，进一步扩大了推理能力。当同时培训有关数学推理和代码生成任务的模型时，我们的发现显示了显着的跨域益处。值得注意的是，我们提出的方法使1.5B参数模型在AIME-2024基准上获得42.3 \％的精度，在Math-500基准上获得89.5 \％。这些结果强调了我们方法在提高LLM的推理水平方面的功效。我们将在github和拥抱脸上开放数据集。

Title: m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models

Authors: Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00869
Pdf URL: https://arxiv.org/pdf/2504.00869
Copy Paste: [[2504.00869]] m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models(https://arxiv.org/abs/2504.00869)
Keywords: language model, llm, prompt
Abstract: Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model's medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.
摘要：测试时间缩放已成为一种强大的技术，用于增强大型语言模型的推理能力。但是，它在医学推理中的有效性仍然不确定，因为医疗领域在知识表示和决策过程方面与数学任务的根本不同。在本文中，我们对医学推理的测试时间缩放和介绍M1进行了首次全面研究，这是一种简单而有效的方法，可提高模型在推断时的医学推理能力。我们跨各种医疗任务的评估表明，测试时间缩放始终增强医疗推理，从而在10b参数下实现了轻巧的微调模型，以建立新的最先进的性能，而我们的32B模型竞争率竞争先前的70B级医疗LLM。但是，我们确定了大约4K的最佳推理代币预算，除此之外，由于过度思考，性能可能会降低。通过迭代提示扩展测试时间计算的预算强迫有助于建模双检查答案，但不一定会改善整体医疗质量检查的性能，在某些情况下，甚至会将错误引入以前正确的答案中。我们的逐案分析将不足的医学知识确定为一种关键的瓶颈，从而防止通过测试时间扩展获得进一步的性能。我们发现，提高数据量表，提高数据质量并扩大模型能力始终增强医学知识的基础，从而持续改进绩效，尤其是在较小的模型达到饱和的挑战性医疗基准时，可以提高医学知识的基础。这些发现强调了LLM中医学和数学推理之间的基本差异，强调了，仅仅提高了推理深度，而是实现测试时间缩放的好处至关重要。

Title: GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

Authors: Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, Bowen Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00891
Pdf URL: https://arxiv.org/pdf/2504.00891
Copy Paste: [[2504.00891]] GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning(https://arxiv.org/abs/2504.00891)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in this https URL.
摘要：大型语言模型（LLMS）的最新进展表明，利用过程奖励模型（PRM）作为验证者来增强LLM的性能是很有希望的。但是，当前的PRM面临三个关键挑战：（1）有限的过程监督和泛化功能，（2）依赖标量值预测而不利用LLM的生成能力，以及（3）无法扩展PRMS的测试时间计算。在这项工作中，我们介绍了GenPrm，这是一种生成过程奖励模型，在为每个推理步骤提供判断之前，在代码验证之前执行明确的经过三链链（COT）推理。为了获得高质量的过程监督标签和基本原理数据，我们提出了相对进度估计（RPE）和结合代码验证的基本原理综合框架。在过程基础和几个数学推理任务上的实验结果表明，GenPRM在数学数据集中只有23K培训数据明显优于先前的PRMS。通过测试时间缩放，1.5B GenPRM的表现优于GPT-4O，而7B GenPRM在ProcessBench上超过了QWEN2.5-MATH-PRM-72B。此外，GenPRM表现出强大的能力，可以作为政策模型改进的批评家模型。这项工作为过程监督建立了一个新的范式，弥合了LLMS中PRMS和评论家模型之间的差距。我们的代码，模型和数据将在此HTTPS URL中可用。

Title: On the Robustness of Agentic Function Calling

Authors: Ella Rabinovich, Ateret Anaby-Tavor
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00914
Pdf URL: https://arxiv.org/pdf/2504.00914
Copy Paste: [[2504.00914]] On the Robustness of Agentic Function Calling(https://arxiv.org/abs/2504.00914)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) are increasingly acting as autonomous agents, with function calling (FC) capabilities enabling them to invoke specific tools for tasks. While prior research has primarily focused on improving FC accuracy, little attention has been given to the robustness of these agents to perturbations in their input. We introduce a benchmark assessing FC robustness in two key areas: resilience to naturalistic query variations, and stability in function calling when the toolkit expands with semantically related tools. Evaluating best-performing FC models on a carefully expanded subset of the Berkeley function calling leaderboard (BFCL), we identify critical weaknesses in existing evaluation methodologies, and highlight areas for improvement in real-world agentic deployments.
摘要：大型语言模型（LLMS）越来越多地充当自主代理，功能调用（FC）功能使他们能够为任务调用特定的工具。虽然先前的研究主要集中在提高FC的准确性上，但很少关注这些药物对输入的扰动的鲁棒性。我们引入了一个基准测试，以评估两个关键领域的FC鲁棒性：自然主义查询变化的弹性，以及工具包随着语义相关的工具扩展时功能调用的稳定性。在伯克利函数呼叫排行榜（BFCL）的精心扩展子集上评估表现最佳的FC模型，我们确定了现有评估方法中的关键弱点，并突出了改善现实世界代理部署的领域。

Title: Multi-Token Attention

Authors: Olga Golovneva, Tianlu Wang, Jason Weston, Sainbayar Sukhbaatar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00927
Pdf URL: https://arxiv.org/pdf/2504.00927
Copy Paste: [[2504.00927]] Multi-Token Attention(https://arxiv.org/abs/2504.00927)
Keywords: language model, llm, long context
Abstract: Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and key token vector. This "single token attention" bottlenecks the amount of information used in distinguishing a relevant part from the rest of the context. To address this issue, we propose a new attention method, Multi-Token Attention (MTA), which allows LLMs to condition their attention weights on multiple query and key vectors simultaneously. This is achieved by applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention. As a result, our method can locate relevant context using richer, more nuanced information that can exceed a single vector's capacity. Through extensive evaluations, we demonstrate that MTA achieves enhanced performance on a range of popular benchmarks. Notably, it outperforms Transformer baseline models on standard language modeling tasks, and on tasks that require searching for information within long contexts, where our method's ability to leverage richer information proves particularly beneficial.
摘要：软关注是为LLM提供动力以在给定情况下定位相关部分的关键机制。但是，单个注意力的权重由仅单个查询和关键令牌向量的相似性确定。这个“单一的令牌注意力”瓶颈将相关部分与其他上下文区分开来使用的信息量。为了解决这个问题，我们提出了一种新的注意方法，即多token注意（MTA），该方法使LLM可以同时在多个查询和关键矢量上调节其注意力权重。这是通过对查询，钥匙和头部应用卷积操作来实现的，使附近的查询和钥匙能够影响彼此的注意力重量，以更加精确地注意。结果，我们的方法可以使用更丰富，更细微的信息来定位相关上下文，这些信息可以超过一个向量的能力。通过广泛的评估，我们证明MTA在一系列流行的基准测试中实现了增强的性能。值得注意的是，它在标准语言建模任务以及需要在长上下文中搜索信息的任务上优于变形金刚基线模型，在这些任务中，我们的方法利用更丰富的信息的能力证明了特别有益。

Title: InformGen: An AI Copilot for Accurate and Compliant Clinical Research Consent Document Generation

Authors: Zifeng Wang, Junyi Gao, Benjamin Danek, Brandon Theodorou, Ruba Shaik, Shivashankar Thati, Seunghyun Won, Jimeng Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00934
Pdf URL: https://arxiv.org/pdf/2504.00934
Copy Paste: [[2504.00934]] InformGen: An AI Copilot for Accurate and Compliant Clinical Research Consent Document Generation(https://arxiv.org/abs/2504.00934)
Keywords: language model, gpt, llm
Abstract: Leveraging large language models (LLMs) to generate high-stakes documents, such as informed consent forms (ICFs), remains a significant challenge due to the extreme need for regulatory compliance and factual accuracy. Here, we present InformGen, an LLM-driven copilot for accurate and compliant ICF drafting by optimized knowledge document parsing and content generation, with humans in the loop. We further construct a benchmark dataset comprising protocols and ICFs from 900 clinical trials. Experimental results demonstrate that InformGen achieves near 100% compliance with 18 core regulatory rules derived from FDA guidelines, outperforming a vanilla GPT-4o model by up to 30%. Additionally, a user study with five annotators shows that InformGen, when integrated with manual intervention, attains over 90% factual accuracy, significantly surpassing the vanilla GPT-4o model's 57%-82%. Crucially, InformGen ensures traceability by providing inline citations to source protocols, enabling easy verification and maintaining the highest standards of factual integrity.
摘要：利用大型语言模型（LLM）生成高风险文件，例如知情同意书（ICF），由于对监管合规性的极端需求和事实准确性的极为需要，这仍然是一个重大挑战。在这里，我们提出了Informgen，这是一个由LLM驱动的副驾驶，可通过优化的知识文档解析和内容产生和人类在循环中进行优化的知识文档解析和内容产生，以精确而兼容的ICF起草。我们进一步构建了来自900次临床试验的协议和ICF的基准数据集。实验结果表明，Innelgen符合从FDA指南中得出的18个核心监管规则的100％遵守，表现优于香草GPT-4O模型高达30％。此外，一项具有五个注释者的用户研究表明，在与手动干预集成时，通知基因的事实准确性超过90％，大大超过了香草GPT-4O模型的57％-82％。至关重要的是，信息通过向源协议提供内联引用，实现易于验证并维持最高的事实完整性标准来确保可追溯性。

Title: Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models?

Authors: Anna Bavaresco, Raquel Fernández
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00942
Pdf URL: https://arxiv.org/pdf/2504.00942
Copy Paste: [[2504.00942]] Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models?(https://arxiv.org/abs/2504.00942)
Keywords: language model
Abstract: A common assumption in Computational Linguistics is that text representations learnt by multimodal models are richer and more human-like than those by language-only models, as they are grounded in images or audio -- similar to how human language is grounded in real-world experiences. However, empirical studies checking whether this is true are largely lacking. We address this gap by comparing word representations from contrastive multimodal models vs. language-only ones in the extent to which they capture experiential information -- as defined by an existing norm-based 'experiential model' -- and align with human fMRI responses. Our results indicate that, surprisingly, language-only models are superior to multimodal ones in both respects. Additionally, they learn more unique brain-relevant semantic information beyond that shared with the experiential model. Overall, our study highlights the need to develop computational models that better integrate the complementary semantic information provided by multimodal data sources.
摘要：计算语言学中的一个普遍假设是，多模型模型学到的文本表示比仅使用语言模型的模型更丰富，更男性，因为它们以图像或音频为基础，类似于人类语言在现实世界中的基础。然而，经验研究检查这是否真的缺乏。我们通过比较从对比度多模型模型与仅语言模型的单词表示形式来解决这一差距，以捕获经验信息的程度（如现有基于规范的“体验模型”所定义），并与人类fMRI响应保持一致。我们的结果表明，令人惊讶的是，仅语言模型在这两个方面都优于多模式。此外，他们学习了与经验模型共享的更独特的与大脑相关的语义信息。总体而言，我们的研究强调了开发计算模型的需求，以更好地整合多模式数据源提供的互补语义信息。

Title: SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching

Authors: Yuxuan Zhu, Ali Falahati, David H. Yang, Mohammad Mohammadi Amiri
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.00970
Pdf URL: https://arxiv.org/pdf/2504.00970
Copy Paste: [[2504.00970]] SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching(https://arxiv.org/abs/2504.00970)
Keywords: language model, llm, long context
Abstract: Large language models face significant computational and memory challenges when processing long contexts. During inference, efficient management of the key-value (KV) cache, which stores intermediate activations for autoregressive generation, is critical to reducing memory overhead and improving computational efficiency. Traditional token-level efficient KV caching methods overlook semantic information, treating tokens independently without considering their semantic relationships. Meanwhile, existing semantic-preserving KV cache management approaches often suffer from substantial memory usage and high time-to-first-token. To address these limitations, we propose SentenceKV, a novel sentence-level semantic KV caching approach designed to enhance inference efficiency while preserving semantic coherence. During prefilling, SentenceKV groups tokens based on sentence-level semantic similarity, compressing sentence representations into concise semantic vectors stored directly on the GPU, while individual KV pairs are offloaded to CPU. During decoding, SentenceKV generates tokens by selectively retrieving semantically relevant sentence-level KV entries, leveraging the semantic similarity between the prefilling-stage semantic vectors and decoding-stage queries. This ensures efficient and contextually accurate predictions, minimizing the loading of redundant or irrelevant data into GPU memory and significantly reducing memory overhead while maintaining stable inference latency, even for extremely long contexts. Extensive evaluations on benchmarks including PG-19, LongBench, and Needle-In-A-Haystack demonstrate that SentenceKV significantly outperforms state-of-the-art methods in both efficiency and memory usage, without compromising model accuracy.
摘要：在处理较长的上下文时，大型语言模型面临着重大的计算和内存挑战。在推断过程中，对钥匙值（KV）缓存的有效管理（存储自回归产生的中间激活）对于减少内存开销和提高计算效率至关重要。传统的令牌级有效的KV缓存方法忽略了语义信息，独立对待令牌而不考虑其语义关系。同时，现有的具有语义的KV缓存管理方法通常会遭受大量的记忆使用和高度的时间。为了解决这些局限性，我们提出了一种新颖的句子级语义kV缓存方法句子，旨在提高推理效率，同时保持语义连贯性。在预填充过程中，句子基于句子级的语义相似性将令牌分组，将句子表示形式压缩为直接存储在GPU上的简洁语义向量中，而单个KV对将单个KV对卸载到CPU。在解码过程中，句子KV通过选择性检索语义相关的句子级kV条目来生成令牌，从而利用了预填充阶段阶段的语义向量和解码阶段查询之间的语义相似性。这样可以确保有效且上下文准确的预测，从而将冗余或无关数据的加载降至GPU内存中，并显着减少内存开销，同时保持稳定的推理潜伏期，即使在非常长的上下文中也是如此。对包括PG-19，Longbench和Neader-A-Haystack在内的基准进行的广泛评估表明，句子在效率和内存使用方面都显着超过了最先进的方法，而不会损害模型的准确性。

Title: Chinese Grammatical Error Correction: A Survey

Authors: Mengyang Qiu, Qingyu Gao, Linxuan Yang, Yang Gu, Tran Minh Nguyen, Zihao Huang, Jungyeul Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.00977
Pdf URL: https://arxiv.org/pdf/2504.00977
Copy Paste: [[2504.00977]] Chinese Grammatical Error Correction: A Survey(https://arxiv.org/abs/2504.00977)
Keywords: language model
Abstract: Chinese Grammatical Error Correction (CGEC) is a critical task in Natural Language Processing, addressing the growing demand for automated writing assistance in both second-language (L2) and native (L1) Chinese writing. While L2 learners struggle with mastering complex grammatical structures, L1 users also benefit from CGEC in academic, professional, and formal contexts where writing precision is essential. This survey provides a comprehensive review of CGEC research, covering datasets, annotation schemes, evaluation methodologies, and system advancements. We examine widely used CGEC datasets, highlighting their characteristics, limitations, and the need for improved standardization. We also analyze error annotation frameworks, discussing challenges such as word segmentation ambiguity and the classification of Chinese-specific error types. Furthermore, we review evaluation metrics, focusing on their adaptation from English GEC to Chinese, including character-level scoring and the use of multiple references. In terms of system development, we trace the evolution from rule-based and statistical approaches to neural architectures, including Transformer-based models and the integration of large pre-trained language models. By consolidating existing research and identifying key challenges, this survey provides insights into the current state of CGEC and outlines future directions, including refining annotation standards to address segmentation challenges, and leveraging multilingual approaches to enhance CGEC.
摘要：中国语法误差校正（CGEC）是自然语言处理中的一项关键任务，可以解决对第二语言（L2）和本地（L1）中文写作中对自动写作援助的不断增长的需求。尽管L2学习者努力掌握复杂的语法结构，但L1用户在学术，专业和正式环境中也从CGEC中受益，在学术，专业和正式环境中，写作精度至关重要。这项调查提供了对CGEC研究的全面综述，涵盖了数据集，注释方案，评估方法和系统进步。我们检查了广泛使用的CGEC数据集，突出了其特征，局限性以及改进标准化的需求。我们还分析了错误注释框架，讨论了诸如单词分割歧义和中文特异性错误类型的分类等挑战。此外，我们回顾了评估指标，重点是它们从英语GEC到中文的适应，包括角色级评分和使用多个参考。在系统开发方面，我们追溯了从基于规则的统计方法和神经体系结构的演变，包括基于变压器的模型以及大型预训练语言模型的集成。通过巩固现有研究并确定关键挑战，该调查提供了有关CGEC当前状态的见解，并概述了未来的方向，包括精炼注释标准以应对细分挑战，并利用多语言方法来增强CGEC。

Title: MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs

Authors: Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, Yihan Cao, Hui Ren, Xiang Li, Xiaoxiao Li, Yuyin Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.00993
Pdf URL: https://arxiv.org/pdf/2504.00993
Copy Paste: [[2504.00993]] MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs(https://arxiv.org/abs/2504.00993)
Keywords: language model, llm
Abstract: Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical reasoning ability of AI models. To bridge this gap, we introduce MedReason, a large-scale high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in large language models (LLMs). We utilize a structured medical knowledge graph (KG) to convert clinical QA pairs into logical chains of reasoning, or ``thinking paths'', which trace connections from question elements to answers via relevant KG entities. Each path is validated for consistency with clinical logic and evidence-based medicine. Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets, resulting in a dataset of 32,682 question-answer pairs, each with detailed, step-by-step explanations. Experiments demonstrate that fine-tuning with our dataset consistently boosts medical problem-solving capabilities, achieving significant gains of up to 7.7% for DeepSeek-Ditill-8B. Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets. We also engage medical professionals from diverse specialties to assess our dataset's quality, ensuring MedReason offers accurate and coherent medical reasoning. Our data, models, and code will be publicly available.
摘要：诸如诊断和治疗计划之类的医疗任务需要精确而复杂的推理，尤其是在至关重要的领域中。与数学推理不同，医学推理需要细致，可验证的思维过程，以确保可靠性和准确性。但是，缺乏明显的数据集，这些数据集提供了透明的，分步的推理来验证和增强AI模型的医学推理能力。为了弥合这一差距，我们介绍了Medreason，这是一个大规模的高质量医学推理数据集，旨在在大语言模型（LLMS）中实现忠实和可解释的医学问题解决。我们利用结构化的医学知识图（KG）将临床质量质量检查转换为推理的逻辑链或``思维路径''，这些链条通过问题元素从问题元素到通过相关的KG实体的答案进行追踪。每条路径都经过验证，以与临床逻辑和循证医学保持一致。我们的管道从7个医疗数据集中为各种医疗问题生成了详细的推理，从而导致数据集为32,682个问题 - 答案对，每对都有详细的，逐步的解释。实验表明，通过数据集进行微调始终增强医疗问题解决能力，对于DeepSeek-Ditill-8B，可实现高达7.7％的增长。我们表现最佳的模型Medreason-8B优于最先进的医学推理模型Huatuo-O1-8B，在临床基准MEDBULLET上的表现高达4.2％。我们还与来自不同专业的医疗专业人员一起评估数据集的质量，以确保Medreason提供准确，连贯的医学推理。我们的数据，模型和代码将公开可用。

Title: Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models

Authors: José Pombal, Nuno M. Guerreiro, Ricardo Rei, André F. T. Martins
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01001
Pdf URL: https://arxiv.org/pdf/2504.01001
Copy Paste: [[2504.01001]] Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models(https://arxiv.org/abs/2504.01001)
Keywords: language model, prompt
Abstract: As language models improve and become capable of performing more complex tasks across modalities, evaluating them automatically becomes increasingly challenging. Developing strong and robust task-specific automatic metrics gets harder, and human-annotated test sets -- which are expensive to create -- saturate more quickly. A compelling alternative is to design reliable strategies to automate the creation of test data and evaluation, but previous attempts either rely on pre-existing data, or focus solely on individual tasks. We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task by leveraging language models for both synthetic test data creation and evaluation. ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation; it is scalable to tasks and languages where collecting real-world data is costly or impractical; it is model-agnostic, allowing the creation of increasingly challenging benchmarks as models improve. To assess the effectiveness of our framework, we create benchmarks for five text-only tasks and a multi-modal one: general capabilities in four languages (English, Chinese, French, and Korean), translation, and general vision-language capabilities in English. We then rank a broad range of open and closed systems on our benchmarks. ZSB rankings consistently correlate strongly with human rankings, outperforming widely-adopted standard benchmarks. Through ablations, we find that strong benchmarks can be created with open models, and that judge model size and dataset variety are crucial drivers of performance. We release all our benchmarks, and code to reproduce our experiments and to produce new benchmarks.
摘要：随着语言模型的改善并能够跨模式执行更复杂的任务，对它们进行评估会变得越来越具有挑战性。开发强大而健壮的特定任务自动指标会变得更加困难，并且被人类通知的测试集（创建价格昂贵）更快地饱和。一个引人注目的替代方法是设计可靠的策略来自动创建测试数据和评估，但是以前的尝试要么依赖于预先存在的数据，要么仅专注于单个任务。我们提出零击基准测试（ZSB），这是为任何任务创建高质量基准测试的框架，该框架通过利用语言模型来创建综合测试数据创建和评估。 ZSB是简单而灵活的：它仅需要创建提示数据生成的提示，一个进行评估。它可扩展到收集现实数据的昂贵或不切实际的任务和语言；它是模型不合时宜的，可以随着模型的改善而创建越来越具有挑战性的基准。为了评估框架的有效性，我们为五个仅文本任务和一个多模式的任务创建基准：四种语言（英语，中文，法语和韩文），翻译和英语的一般视觉语言功能。然后，我们在基准上对广泛的开放和封闭系统进行排名。 ZSB的排名始终与人类排名密切相关，表现优于广泛预料的标准基准。通过消融，我们发现可以通过开放的模型创建强大的基准，并且判断模型大小和数据集品种是性能的关键驱动力。我们释放所有基准和代码以复制我们的实验并生产新的基准测试。

Title: Token embeddings violate the manifold hypothesis

Authors: Michael Robinson, Sourya Dey, Tony Chiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.01002
Pdf URL: https://arxiv.org/pdf/2504.01002
Copy Paste: [[2504.01002]] Token embeddings violate the manifold hypothesis(https://arxiv.org/abs/2504.01002)
Keywords: language model, llm, prompt
Abstract: To fully understand the behavior of a large language model (LLM) requires our understanding of its input space. If this input space differs from our assumption, our understanding of and conclusions about the LLM is likely flawed, regardless of its architecture. Here, we elucidate the structure of the token embeddings, the input domain for LLMs, both empirically and theoretically. We present a generalized and statistically testable model where the neighborhood of each token splits into well-defined signal and noise dimensions. This model is based on a generalization of a manifold called a fiber bundle, so we denote our hypothesis test as the ``fiber bundle null.'' Failing to reject the null is uninformative, but rejecting it at a specific token indicates that token has a statistically significant local structure, and so is of interest to us. By running our test over several open-source LLMs, each with unique token embeddings, we find that the null is frequently rejected, and so the token subspace is provably not a fiber bundle and hence also not a manifold. As a consequence of our findings, when an LLM is presented with two semantically equivalent prompts, and if one prompt contains a token implicated by our test, that prompt will likely exhibit more output variability proportional to the local signal dimension of the token.
摘要：要充分了解大语模型（LLM）的行为，就需要我们对其输入空间的理解。如果此输入空间与我们的假设不同，那么无论其架构如何，我们对LLM的理解和结论都可能存在缺陷。在这里，我们阐明了令牌嵌入的结构，即凭经验和理论上的LLMS的输入域。我们提出了一个广义和统计测试的模型，其中每个令牌的邻域分为明确的信号和噪声尺寸。该模型基于称为纤维束的歧管的概括，因此我们将假设检验表示为``纤维束零''。未能拒绝null的纤维束零。通过在几个开放源LLM上进行测试，每个LLM都有独特的令牌嵌入，我们发现零是经常被拒绝的，因此令牌子空间被证明不是光纤束，因此也不是歧管。由于我们的发现，当LLM带有两个语义上等效的提示时，如果一个提示包含我们测试牵涉的令牌，则该提示可能会表现出与代币的局部信号维度成正比的更多输出可变性。

Title: When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

Authors: Nishad Singhi, Hritik Bansal, Arian Hosseini, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, Anna Rohrbach
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.01005
Pdf URL: https://arxiv.org/pdf/2504.01005
Copy Paste: [[2504.01005]] When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning(https://arxiv.org/abs/2504.01005)
Keywords: language model, llm
Abstract: Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification. The code is available at this https URL.
摘要：扩展测试时间计算已成为增强大语模型（LLM）推理能力（尤其是在数学问题解决的任务中）的关键策略。一种传统的方法，即自一致性（SC），为问题生成了多种解决方案，并通过多数投票选择了最常见的答案。另一种常见方法涉及使用奖励模型（验证者）对每个解决方案进行评分，并选择最佳方法。生成奖励模型（GENRM）的最新进步将验证作为下一步的预测任务，从而沿新轴进行推理时间缩放。具体而言，GENRM生成了多次验证链来评分每个解决方案。在有限的推理预算下，这引入了基本的权衡：您是否应该将预算用于扩展解决方案或通过SC生成更少的解决方案并通过GenRM分配计算？为了解决这个问题，我们在固定推理预算下对GENRM进行了评估。有趣的是，我们发现SC在不同模型和数据集的大多数实际推论预算中比GenRM更有效率。例如，GENRM首先匹配SC在消耗8倍之后的推理计算，并且需要更大的计算才能胜过它。此外，我们得出了GENRM范式的推理缩放定律，表明计算最佳推理比缩放验证的数量更为积极地缩放求解解决方案的生成。我们的工作提供了通过平衡解决方案的生成和验证来优化测试时间扩展的实用指导。该代码可在此HTTPS URL上找到。

Title: Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization

Authors: Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.01018
Pdf URL: https://arxiv.org/pdf/2504.01018
Copy Paste: [[2504.01018]] Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization(https://arxiv.org/abs/2504.01018)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Selective retrieval improves retrieval-augmented generation (RAG) by reducing distractions from low-quality retrievals and improving efficiency. However, existing approaches under-utilize the inherent knowledge of large language models (LLMs), leading to suboptimal retrieval decisions and degraded generation performance. To bridge this gap, we propose Self-Routing RAG (SR-RAG), a novel framework that binds selective retrieval with knowledge verbalization. SR-RAG enables an LLM to dynamically decide between external retrieval and verbalizing its own parametric knowledge. To this end, we design a multi-task objective that jointly optimizes an LLM on knowledge source selection, knowledge verbalization, and response generation. We further introduce dynamic knowledge source inference via nearest neighbor search to improve the accuracy of knowledge source decision under domain shifts. Fine-tuning three LLMs with SR-RAG significantly improves both their response accuracy and inference latency. Compared to the strongest selective retrieval baseline, SR-RAG reduces retrievals by 29% while improving the performance by 5.1%.
摘要：选择性检索通过减少低质量检索和提高效率的干扰来改善检索增强的生成（RAG）。但是，现有的方法低估了大语言模型（LLMS）的固有知识，从而导致了次优的检索决策和降解的发电绩效。为了弥合这一差距，我们提出了自言自语的抹布（SR-rag），这是一个新颖的框架，将选择性检索与知识言语结合。 SR-rag使LLM能够在外部检索和口头表达自己的参数知识之间动态决定。为此，我们设计了一个多任务目标，该目标共同优化了知识源选择，知识口头化和响应生成的LLM。我们进一步通过最近的邻居搜索引入动态知识来源推断，以提高域移动下知识源决策的准确性。用SR-rag进行微调的三个LLM可显着提高其响应准确性和推理潜伏期。与最强的选择性检索基线相比，SR-rag将检索降低了29％，同时将绩效提高了5.1％。