2025-05-08

Title: Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding

Authors: Trilok Padhi, Ramneet Kaur, Adam D. Cobb, Manoj Acharya, Anirban Roy, Colin Samplawski, Brian Matejek, Alexander M. Berenbeim, Nathaniel D. Bastian, Susmit Jha
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.03788
Pdf URL: https://arxiv.org/pdf/2505.03788
Copy Paste: [[2505.03788]] Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding(https://arxiv.org/abs/2505.03788)
Keywords: language model, llm
Abstract: We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs). Existing state-of-the-art UQ methods rely on consistency among multiple responses generated by the LLM on an input query under diverse settings. However, these approaches often report higher confidence in scenarios where the LLM is consistently incorrect. This leads to a poorly calibrated confidence with respect to accuracy. To address this, we leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models. Specifically, we ground the textual responses to the visual inputs. The confidence from the grounding model is used to calibrate the overall confidence. Given that using a grounding model adds its own uncertainty in the pipeline, we apply temperature scaling - a widely accepted parametric calibration technique - to calibrate the grounding model's confidence in the accuracy of generated responses. We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA. The experiments demonstrate that the proposed framework achieves significantly improved calibration on both tasks.
摘要：我们介绍了一种针对多模式大语言模型（LLMS）量身定制的不确定性定量（UQ）的新型方法。现有的最新UQ方法依赖于LLM在不同设置下在输入查询上产生的多个响应之间的一致性。但是，这些方法通常会报告对LLM始终不正确的情况的更高信心。这导致对准确性的信心较差。为了解决这个问题，我们除了自我一致性外，我们还利用了跨模式的一致性来改善多模式模型的校准。具体而言，我们将文本响应基于视觉输入。接地模型的置信度用于校准整体置信度。鉴于使用接地模型增加了自己在管道中的不确定性，因此我们应用温度缩放（一种广泛接受的参数校准技术）来校准接地模型对生成响应准确性的信心。我们考虑了多个多模式的任务，例如医学问答（slake）和视觉问题答案（VQAV2），评估了所提出的方法，考虑了LLAVA-MED和LLAVA等多模式模型。实验表明，所提出的框架在这两个任务上都显着改善了校准。

Title: A Reasoning-Focused Legal Retrieval Benchmark

Authors: Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christopher D. Manning, Peter Henderson, Daniel E. Ho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.03970
Pdf URL: https://arxiv.org/pdf/2505.03970
Copy Paste: [[2505.03970]] A Reasoning-Focused Legal Retrieval Benchmark(https://arxiv.org/abs/2505.03970)
Keywords: language model, llm
Abstract: As the legal community increasingly examines the use of large language models (LLMs) for various legal applications, legal AI developers have turned to retrieval-augmented LLMs ("RAG" systems) to improve system performance and robustness. An obstacle to the development of specialized RAG systems is the lack of realistic legal RAG benchmarks which capture the complexity of both legal retrieval and downstream legal question-answering. To address this, we introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA. Our tasks correspond to real-world legal research tasks, and were produced through annotation processes which resemble legal research. We describe the construction of these benchmarks and the performance of existing retriever pipelines. Our results suggest that legal RAG remains a challenging application, thus motivating future research.
摘要：随着法律界越来越多地研究了大型语言模型（LLM）在各种法律申请中的使用，法律AI开发人员已转向检索功能的LLMS（“ RAG”系统）来提高系统性能和稳健性。开发专业抹布系统的一个障碍是缺乏现实的法律抹布基准，这些基准捕获了法律检索和下游法律提问的复杂性。为了解决这个问题，我们介绍了两个新颖的法律抹布基准：律师协会和住房法规QA。我们的任务对应于现实世界的法律研究任务，并且是通过类似法律研究的注释过程产生的。我们描述了这些基准的构建以及现有猎犬管道的性能。我们的结果表明，法律抹布仍然是一个具有挑战性的应用，从而激发了未来的研究。

Title: Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at Scale

Authors: Jiale Liu, Yifan Zeng, Shaokun Zhang, Chi Zhang, Malte Højmark-Bertelsen, Marie Normann Gadeberg, Huazheng Wang, Qingyun Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.03973
Pdf URL: https://arxiv.org/pdf/2505.03973
Copy Paste: [[2505.03973]] Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at Scale(https://arxiv.org/abs/2505.03973)
Keywords: llm, prompt, agent
Abstract: LLM-based optimization has shown remarkable potential in enhancing agentic systems. However, the conventional approach of prompting LLM optimizer with the whole training trajectories on training dataset in a single pass becomes untenable as datasets grow, leading to context window overflow and degraded pattern recognition. To address these challenges, we propose Fine-Grained Optimization (FGO), a scalable framework that divides large optimization tasks into manageable subsets, performs targeted optimizations, and systematically combines optimized components through progressive merging. Evaluation across ALFWorld, LogisticsQA, and GAIA benchmarks demonstrate that FGO outperforms existing approaches by 1.6-8.6% while reducing average prompt token consumption by 56.3%. Our framework provides a practical solution for scaling up LLM-based optimization of increasingly sophisticated agent systems. Further analysis demonstrates that FGO achieves the most consistent performance gain in all training dataset sizes, showcasing its scalability and efficiency.
摘要：基于LLM的优化在增强代理系统方面具有巨大的潜力。但是，随着数据集的增长，通过单个传球中的整个训练数据集上的整个训练轨迹提示LLM优化器的常规方法，导致上下文窗口溢出和降级模式识别。为了应对这些挑战，我们提出了细粒度优化（FGO），该框架是一个可扩展的框架，将大型优化任务划分为可管理的子集，执行目标优化，并通过渐进合并系统地结合优化的组件。 Alfworld，LogissicsQA和Gaia基准的评估表明，FGO的表现优于现有方法，同时将平均及时及时的立即消耗量减少56.3％。我们的框架提供了一种实用解决方案，用于扩展日益复杂的代理系统的基于LLM的优化。进一步的分析表明，FGO在所有训练数据集大小中都可以达到最一致的性能增益，从而展示其可扩展性和效率。

Title: X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

Authors: Qianchu Liu, Sheng Zhang, Guanghui Qin, Timothy Ossowski, Yu Gu, Ying Jin, Sid Kiblawi, Sam Preston, Mu Wei, Paul Vozila, Tristan Naumann, Hoifung Poon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.03981
Pdf URL: https://arxiv.org/pdf/2505.03981
Copy Paste: [[2505.03981]] X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains(https://arxiv.org/abs/2505.03981)
Keywords: language model, chain-of-thought
Abstract: Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper explores a fundamental research question: Is reasoning generalizable across modalities and domains? Our findings support an affirmative answer: General-domain text-based post-training can enable such strong generalizable reasoning. Leveraging this finding, we introduce X-Reasoner, a vision-language model post-trained solely on general-domain text for generalizable reasoning, using a two-stage approach: an initial supervised fine-tuning phase with distilled long chain-of-thoughts, followed by reinforcement learning with verifiable rewards. Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks (Figure 1). Additionally, we find that X-Reasoner's performance in specialized domains can be further enhanced through continued training on domain-specific text-only data. Building upon this, we introduce X-Reasoner-Med, a medical-specialized variant that achieves new state of the art on numerous text-only and multimodal medical benchmarks.
摘要：最近的专有模型（例如O3）已开始证明强大的多模式推理能力。然而，大多数现有的开源研究都集中在培训纯文本推理模型上，评估主要是数学和通用域任务。因此，尚不清楚如何有效地将推理能力扩展到文本输入和一般域之外。本文探讨了一个基本的研究问题：推理是否可以跨模态和领域进行推广？我们的发现支持一个肯定的答案：一般域基于文本的后培训可以实现如此强大的普遍推理。利用这一发现，我们介绍了X-Reasoner，这是一种视觉模型，仅在通用域文本上进行培训，以用于通用推理，使用两阶段的方法：初始监督的微调阶段，具有蒸馏的长期思维链，然后是通过可验证的奖励进行加强学习。实验表明，X-Reasoner成功将推理能力转移到多模式和室外设置，超过了在各种一般和医疗基准上接受域内和多模式数据训练的现有最新模型（图1）。此外，我们发现，通过对针对特定领域的纯文本数据进行继续培训，可以进一步增强X-Reasoner在专用域中的性能。在此基础上，我们介绍了X-Reasoner-Med，这是一种医学专业的变体，可在众多仅文本和多模式的医疗基准上实现新的最新技术。

Title: SLOT: Structuring the Output of Large Language Models

Authors: Darren Yow-Bang Wang, Zhengyuan Shen, Soumya Smruti Mishra, Zhichao Xu, Yifei Teng, Haibo Ding
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04016
Pdf URL: https://arxiv.org/pdf/2505.04016
Copy Paste: [[2505.04016]] SLOT: Structuring the Output of Large Language Models(https://arxiv.org/abs/2505.04016)
Keywords: language model, llm, agent
Abstract: Structured outputs are essential for large language models (LLMs) in critical applications like agents and information extraction. Despite their capabilities, LLMs often generate outputs that deviate from predefined schemas, significantly hampering reliable application development. We present SLOT (Structured LLM Output Transformer), a model-agnostic approach that transforms unstructured LLM outputs into precise structured formats. While existing solutions predominantly rely on constrained decoding techniques or are tightly coupled with specific models, SLOT employs a fine-tuned lightweight language model as a post-processing layer, achieving flexibility across various LLMs and schema specifications. We introduce a systematic pipeline for data curation and synthesis alongside a formal evaluation methodology that quantifies both schema accuracy and content fidelity. Our results demonstrate that fine-tuned Mistral-7B model with constrained decoding achieves near perfect schema accuracy (99.5%) and content similarity (94.0%), outperforming Claude-3.5-Sonnet by substantial margins (+25 and +20 percentage points, respectively). Notably, even compact models like Llama-3.2-1B can match or exceed the structured output capabilities of much larger proprietary models when equipped with SLOT, enabling reliable structured generation in resource-constrained environments.
摘要：结构化输出对于代理和信息提取等关键应用中的大型语言模型（LLM）至关重要。尽管具有功能，但LLMS通常会产生偏离预定义模式的输出，从而极大地阻碍了可靠的应用程序开发。我们提出了插槽（结构化LLM输出变压器），这是一种模型 - 不合Snostic方法，将非结构化的LLM输出转换为精确的结构化格式。尽管现有的解决方案主要依赖于受约束的解码技术或与特定模型紧密相结合，但SLOT采用微调轻量级语言模型作为后处理层，在各种LLMS和模式规格上实现了灵活性。我们引入了系统的数据策划和合成的系统管道，并使用正式的评估方法来量化模式的准确性和内容保真度。我们的结果表明，具有约束解码的微调Mistral-7b模型达到了几乎完美的模式准确性（99.5％）和内容相似性（94.0％），表现优于Claude-3.5-sonnet，以实质性的边距（分别为+25和+20个百分点）。值得注意的是，即使像Llama-3.2-1b这样的紧凑模型也可以匹配或超过配备插槽时更大的专有模型的结构化输出功能，从而在资源受限的环境中启用可靠的结构化生成。

Title: Advancing and Benchmarking Personalized Tool Invocation for LLMs

Authors: Xu Huang, Yuefeng Huang, Weiwen Liu, Xingshan Zeng, Yasheng Wang, Ruiming Tang, Hong Xie, Defu Lian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04072
Pdf URL: https://arxiv.org/pdf/2505.04072
Copy Paste: [[2505.04072]] Advancing and Benchmarking Personalized Tool Invocation for LLMs(https://arxiv.org/abs/2505.04072)
Keywords: language model, llm
Abstract: Tool invocation is a crucial mechanism for extending the capabilities of Large Language Models (LLMs) and has recently garnered significant attention. It enables LLMs to solve complex problems through tool calls while accessing up-to-date world knowledge. However, existing work primarily focuses on the fundamental ability of LLMs to invoke tools for problem-solving, without considering personalized constraints in tool invocation. In this work, we introduce the concept of Personalized Tool Invocation and define two key tasks: Tool Preference and Profile-dependent Query. Tool Preference addresses user preferences when selecting among functionally similar tools, while Profile-dependent Query considers cases where a user query lacks certain tool parameters, requiring the model to infer them from the user profile. To tackle these challenges, we propose PTool, a data synthesis framework designed for personalized tool invocation. Additionally, we construct \textbf{PTBench}, the first benchmark for evaluating personalized tool invocation. We then fine-tune various open-source models, demonstrating the effectiveness of our framework and providing valuable insights. Our benchmark is public at this https URL.
摘要：工具调用是扩展大语言模型（LLM）功能的关键机制，最近引起了极大的关注。它使LLM可以通过访问最新世界知识的工具调用来解决复杂的问题。但是，现有工作主要集中在LLMS援引解决问题工具的基本能力上，而无需考虑工具调用中的个性化约束。在这项工作中，我们介绍了个性化工具调用的概念，并定义了两个关键任务：工具偏好和配置文件依赖性查询。工具偏好在功能相似的工具之间选择用户偏好，而与配置文件相关的查询考虑了用户查询缺乏某些工具参数的情况，要求该模型从用户配置文件中推断出来。为了应对这些挑战，我们提出了PTool，这是一个用于个性化工具调用的数据综合框架。此外，我们构建\ textbf {ptbench}，这是评估个性化工具调用的第一个基准。然后，我们微调了各种开源模型，证明了我们的框架的有效性并提供了宝贵的见解。我们的基准在此HTTPS URL上是公开的。

Title: Natural Language Generation in Healthcare: A Review of Methods and Applications

Authors: Mengxian Lyu, Xiaohan Li, Ziyi Chen, Jinqian Pan, Cheng Peng, Sankalp Talankar, Yonghui Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04073
Pdf URL: https://arxiv.org/pdf/2505.04073
Copy Paste: [[2505.04073]] Natural Language Generation in Healthcare: A Review of Methods and Applications(https://arxiv.org/abs/2505.04073)
Keywords: language model, llm
Abstract: Natural language generation (NLG) is the key technology to achieve generative artificial intelligence (AI). With the breakthroughs in large language models (LLMs), NLG has been widely used in various medical applications, demonstrating the potential to enhance clinical workflows, support clinical decision-making, and improve clinical documentation. Heterogeneous and diverse medical data modalities, such as medical text, images, and knowledge bases, are utilized in NLG. Researchers have proposed many generative models and applied them in a number of healthcare applications. There is a need for a comprehensive review of NLG methods and applications in the medical domain. In this study, we systematically reviewed 113 scientific publications from a total of 3,988 NLG-related articles identified using a literature search, focusing on data modality, model architecture, clinical applications, and evaluation methods. Following PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines, we categorize key methods, identify clinical applications, and assess their capabilities, limitations, and emerging challenges. This timely review covers the key NLG technologies and medical applications and provides valuable insights for future studies to leverage NLG to transform medical discovery and healthcare.
摘要：自然语言产生（NLG）是实现生成人工智能（AI）的关键技术。随着大语言模型（LLM）的突破，NLG已广泛用于各种医疗应用中，证明了增强临床工作流程，支持临床决策并改善临床文档的潜力。 NLG使用了异构和多样化的医学数据模式，例如医学文本，图像和知识库。研究人员提出了许多生成模型，并将其应用于许多医疗保健应用中。需要对医疗领域中的NLG方法和应用进行全面审查。在这项研究中，我们系统地审查了113个科学出版物，总共使用文献搜索确定的总共3,988个与NLG相关的文章，重点关注数据模式，模型架构，临床应用和评估方法。根据Prisma（用于系统评价和荟萃分析的首选报告项目）指南，我们将关键方法分类，识别临床应用并评估其能力，局限性和新兴挑战。该及时的审查涵盖了关键的NLG技术和医疗应用，并为未来的研究提供了宝贵的见解，以利用NLG来改变医疗发现和医疗保健。

Title: Bringing legal knowledge to the public by constructing a legal question bank using large-scale pre-trained language model

Authors: Mingruo Yuan, Ben Kao, Tien-Hsuan Wu, Michael M. K. Cheung, Henry W. H. Chan, Anne S. Y. Cheung, Felix W. H. Chan, Yongxi Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04132
Pdf URL: https://arxiv.org/pdf/2505.04132
Copy Paste: [[2505.04132]] Bringing legal knowledge to the public by constructing a legal question bank using large-scale pre-trained language model(https://arxiv.org/abs/2505.04132)
Keywords: language model, gpt
Abstract: Access to legal information is fundamental to access to justice. Yet accessibility refers not only to making legal documents available to the public, but also rendering legal information comprehensible to them. A vexing problem in bringing legal information to the public is how to turn formal legal documents such as legislation and judgments, which are often highly technical, to easily navigable and comprehensible knowledge to those without legal education. In this study, we formulate a three-step approach for bringing legal knowledge to laypersons, tackling the issues of navigability and comprehensibility. First, we translate selected sections of the law into snippets (called CLIC-pages), each being a small piece of article that focuses on explaining certain technical legal concept in layperson's terms. Second, we construct a Legal Question Bank (LQB), which is a collection of legal questions whose answers can be found in the CLIC-pages. Third, we design an interactive CLIC Recommender (CRec). Given a user's verbal description of a legal situation that requires a legal solution, CRec interprets the user's input and shortlists questions from the question bank that are most likely relevant to the given legal situation and recommends their corresponding CLIC pages where relevant legal knowledge can be found. In this paper we focus on the technical aspects of creating an LQB. We show how large-scale pre-trained language models, such as GPT-3, can be used to generate legal questions. We compare machine-generated questions (MGQs) against human-composed questions (HCQs) and find that MGQs are more scalable, cost-effective, and more diversified, while HCQs are more precise. We also show a prototype of CRec and illustrate through an example how our 3-step approach effectively brings relevant legal knowledge to the public.
摘要：访问法律信息对于获得正义是至关重要的。然而，可访问性不仅是指向公众提供法律文件，而且还赋予他们可以理解的法律信息。在向公众带来法律信息的一个烦恼的问题是，如何将通常是高度技术性的法律和判决等正式的法律文件转换为易于导航且可理解的知识，向没有法律教育的人提供知识。在这项研究中，我们制定了一种三步的方法，将法律知识带到外行人，解决可通道性和可理解性问题。首先，我们将某些法律部分的部分转化为摘要（称为Clic-pages），每个部分都是一小部分文章，重点是用外行者的术语解释某些技术法律概念。其次，我们构建了一个法律问题银行（LQB），该问题是法律问题的集合，可以在Clic页面中找到答案。第三，我们设计了一个交互式CLIC推荐人（CREC）。鉴于用户对需要法律解决方案的法律情况的口头描述，CREC解释了问题库中用户的输入和候选名单问题，这些问题很可能与给定的法律情况有关，并建议在可以找到相关法律知识的情况下相应的CLIC页面。在本文中，我们关注创建LQB的技术方面。我们展示了大规模的预训练的语言模型，例如GPT-3，用于产生法律问题。我们将机器生成的问题（MGQ）与人类组成的问题（HCQ）进行了比较，并发现MGQ更可扩展，更具成本效益和更多样化，而HCQ则更加精确。我们还展示了CREC的原型，并通过一个示例说明了我们的三步方法如何有效地将相关的法律知识带给公众。

Title: Enhancing Granular Sentiment Classification with Chain-of-Thought Prompting in Large Language Models

Authors: Vihaan Miriyala, Smrithi Bukkapatnam, Lavanya Prahallad
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04135
Pdf URL: https://arxiv.org/pdf/2505.04135
Copy Paste: [[2505.04135]] Enhancing Granular Sentiment Classification with Chain-of-Thought Prompting in Large Language Models(https://arxiv.org/abs/2505.04135)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: We explore the use of Chain-of-Thought (CoT) prompting with large language models (LLMs) to improve the accuracy of granular sentiment categorization in app store reviews. Traditional numeric and polarity-based ratings often fail to capture the nuanced sentiment embedded in user feedback. We evaluated the effectiveness of CoT prompting versus simple prompting on 2000 Amazon app reviews by comparing each method's predictions to human judgements. CoT prompting improved classification accuracy from 84% to 93% highlighting the benefit of explicit reasoning in enhancing sentiment analysis performance.
摘要：我们探讨了使用大语言模型（LLM）提示（COT）提示的使用，以提高App Store评论中颗粒状情感分类的准确性。传统的基于数字和极性的评分通常无法捕获用户反馈中嵌入的细微情感。我们通过将每种方法的预测与人类判断进行比较，评估了COT提示与简单提示的有效性。 COT促使分类精度从84％提高到93％，突出了明确推理在增强情感分析绩效方面的好处。

Title: Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety

Authors: Variath Madhupal Gautham Nair, Vishal Varma Dantuluri
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.04146
Pdf URL: https://arxiv.org/pdf/2505.04146
Copy Paste: [[2505.04146]] Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety(https://arxiv.org/abs/2505.04146)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Existing large language models (LLMs) are advancing rapidly and produce outstanding results in image generation tasks, yet their content safety checks remain vulnerable to prompt-based jailbreaks. Through preliminary testing on platforms such as ChatGPT, MetaAI, and Grok, we observed that even short, natural prompts could lead to the generation of compromising images ranging from realistic depictions of forged documents to manipulated images of public figures. We introduce Unmasking the Canvas (UTC Benchmark; UTCB), a dynamic and scalable benchmark dataset to evaluate LLM vulnerability in image generation. Our methodology combines structured prompt engineering, multilingual obfuscation (e.g., Zulu, Gaelic, Base64), and evaluation using Groq-hosted LLaMA-3. The pipeline supports both zero-shot and fallback prompting strategies, risk scoring, and automated tagging. All generations are stored with rich metadata and curated into Bronze (non-verified), Silver (LLM-aided verification), and Gold (manually verified) tiers. UTCB is designed to evolve over time with new data sources, prompt templates, and model behaviors. Warning: This paper includes visual examples of adversarial inputs designed to test model safety. All outputs have been redacted to ensure responsible disclosure.
摘要：现有的大型语言模型（LLM）正在迅速发展，并在图像生成任务中取得了出色的成果，但其内容安全检查仍然容易受到迅速的越狱。通过对Chatgpt，Metaai和Grok等平台的初步测试，我们观察到，即使是简短的自然提示，也可能导致产生损害图像的产生，从伪造文档的现实描述到操纵公众人物的图像。我们介绍了一个动态且可扩展的基准数据集，以评估图像生成中的LLM漏洞。我们的方法结合了结构化的及时工程，多语言混淆（例如Zulu，Gaelic，base64），并使用GROQ托管的Llama-3进行了评估。该管道支持零射门和后备促进策略，风险评分和自动标签。所有世代都用丰富的元数据存储，并策划为青铜（未验证），银（LLM辅助验证）和黄金（手动验证）层。 UTCB旨在随着新的数据源，提示模板和模型行为而演变。警告：本文包括旨在测试模型安全的对抗输入的视觉示例。所有输出均已删除以确保负责任的披露。

Title: Can Language Models Understand Social Behavior in Clinical Conversations?

Authors: Manas Satish Bedmutha, Feng Chen, Andrea Hartzler, Trevor Cohen, Nadir Weibel
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2505.04152
Pdf URL: https://arxiv.org/pdf/2505.04152
Copy Paste: [[2505.04152]] Can Language Models Understand Social Behavior in Clinical Conversations?(https://arxiv.org/abs/2505.04152)
Keywords: language model, llm, prompt
Abstract: Effective communication between providers and their patients influences health and care outcomes. The effectiveness of such conversations has been linked not only to the exchange of clinical information, but also to a range of interpersonal behaviors; commonly referred to as social signals, which are often conveyed through non-verbal cues and shape the quality of the patient-provider relationship. Recent advances in large language models (LLMs) have demonstrated an increasing ability to infer emotional and social behaviors even when analyzing only textual information. As automation increases also in clinical settings, such as for transcription of patient-provider conversations, there is growing potential for LLMs to automatically analyze and extract social behaviors from these interactions. To explore the foundational capabilities of LLMs in tracking social signals in clinical dialogue, we designed task-specific prompts and evaluated model performance across multiple architectures and prompting styles using a highly imbalanced, annotated dataset spanning 20 distinct social signals such as provider dominance, patient warmth, etc. We present the first system capable of tracking all these 20 coded signals, and uncover patterns in LLM behavior. Further analysis of model configurations and clinical context provides insights for enhancing LLM performance on social signal processing tasks in healthcare settings.
摘要：提供者及其患者之间的有效沟通会影响健康和护理结果。此类对话的有效性不仅与临床信息的交换有关，还与一系列人际行为联系在一起。通常称为社会信号，通常通过非语言提示传达并塑造患者提供者关系的质量。大型语言模型（LLMS）的最新进展表明，即使仅分析文本信息，即使在分析文本信息时，推断情绪和社会行为的能力也不断提高。随着在临床环境中的自动化也会增加，例如患者提供对话的转录，LLMS自动从这些相互作用中自动分析和提取社交行为的潜力越来越大。为了探索LLM在临床对话中跟踪社交信号方面的基本能力，我们设计了特定于任务的提示并评估了跨多个体系结构的模型性能，并使用高度不平衡的注释数据集跨越了20个不同的社交信号，例如提供商的统治者，患者的温暖，呈现了所有这些20个编码模式，并构成了所有编码模式。对模型配置和临床环境的进一步分析为增强医疗保健设置中社交信号处理任务的LLM表现提供了见解。

Title: LLM-Independent Adaptive RAG: Let the Question Speak for Itself

Authors: Maria Marina, Nikolay Ivanov, Sergey Pletenev, Mikhail Salnikov, Daria Galimzianova, Nikita Krayko, Vasily Konovalov, Alexander Panchenko, Viktor Moskvoretskii
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04253
Pdf URL: https://arxiv.org/pdf/2505.04253
Copy Paste: [[2505.04253]] LLM-Independent Adaptive RAG: Let the Question Speak for Itself(https://arxiv.org/abs/2505.04253)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models~(LLMs) are prone to hallucinations, and Retrieval-Augmented Generation (RAG) helps mitigate this, but at a high computational cost while risking misinformation. Adaptive retrieval aims to retrieve only when necessary, but existing approaches rely on LLM-based uncertainty estimation, which remain inefficient and impractical. In this study, we introduce lightweight LLM-independent adaptive retrieval methods based on external information. We investigated 27 features, organized into 7 groups, and their hybrid combinations. We evaluated these methods on 6 QA datasets, assessing the QA performance and efficiency. The results show that our approach matches the performance of complex LLM-based methods while achieving significant efficiency gains, demonstrating the potential of external information for adaptive retrieval.
摘要：大型语言模型〜（llms）容易幻觉，而检索发达的一代（RAG）有助于减轻这种情况，但在冒着错误信息的同时，计算成本高。自适应检索旨在仅在必要时检索，但是现有的方法取决于基于LLM的不确定性估计，这些不确定性估计效率低下且不切实际。在这项研究中，我们介绍了基于外部信息的轻量级LLM独立的自适应检索方法。我们研究了27个特征，分为7组及其混合组合。我们在6个质量检查数据集上评估了这些方法，评估了质量检查的性能和效率。结果表明，我们的方法与复杂的基于LLM的方法的性能相匹配，同时实现了显着的效率提高，这表明了外部信息的自适应检索潜力。

Title: GASCADE: Grouped Summarization of Adverse Drug Event for Enhanced Cancer Pharmacovigilance

Authors: Sofia Jamil, Aryan Dabad, Bollampalli Areen Reddy, Sriparna Saha, Rajiv Misra, Adil A. Shakur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04284
Pdf URL: https://arxiv.org/pdf/2505.04284
Copy Paste: [[2505.04284]] GASCADE: Grouped Summarization of Adverse Drug Event for Enhanced Cancer Pharmacovigilance(https://arxiv.org/abs/2505.04284)
Keywords: language model, llm
Abstract: In the realm of cancer treatment, summarizing adverse drug events (ADEs) reported by patients using prescribed drugs is crucial for enhancing pharmacovigilance practices and improving drug-related decision-making. While the volume and complexity of pharmacovigilance data have increased, existing research in this field has predominantly focused on general diseases rather than specifically addressing cancer. This work introduces the task of grouped summarization of adverse drug events reported by multiple patients using the same drug for cancer treatment. To address the challenge of limited resources in cancer pharmacovigilance, we present the MultiLabeled Cancer Adverse Drug Reaction and Summarization (MCADRS) dataset. This dataset includes pharmacovigilance posts detailing patient concerns regarding drug efficacy and adverse effects, along with extracted labels for drug names, adverse drug events, severity, and adversity of reactions, as well as summaries of ADEs for each drug. Additionally, we propose the Grouping and Abstractive Summarization of Cancer Adverse Drug events (GASCADE) framework, a novel pipeline that combines the information extraction capabilities of Large Language Models (LLMs) with the summarization power of the encoder-decoder T5 model. Our work is the first to apply alignment techniques, including advanced algorithms like Direct Preference Optimization, to encoder-decoder models using synthetic datasets for summarization tasks. Through extensive experiments, we demonstrate the superior performance of GASCADE across various metrics, validated through both automated assessments and human evaluations. This multitasking approach enhances drug-related decision-making and fosters a deeper understanding of patient concerns, paving the way for advancements in personalized and responsive cancer care. The code and dataset used in this work are publicly available.
摘要：在癌症治疗领域中，总结使用规定药物的患者报告的不良药物事件（ADE）对于增强药物保护习惯和改善与药物相关的决策至关重要。尽管药物宣传数据的数量和复杂性有所增加，但该领域的现有研究主要集中在一般疾病上，而不是专门针对癌症。这项工作介绍了多名患者使用相同药物进行癌症治疗的多个患者报告的不良药物事件的汇总的任务。为了应对癌症药物守护液中资源有限的挑战，我们提出了多标签的癌症不良药物反应和摘要（MCADRS）数据集。该数据集包括详细介绍药物功效和不良反应的患者关注的药物守护帖，以及用于药物名称的提取标签，不良药物事件，严重性和反应的逆境以及每种药物的ADES摘要。此外，我们提出了癌症不良药物事件（GASCADE）框架的分组和抽象性汇总，这是一种新型管道，将大语模型（LLMS）的信息提取能力与编码器 - 模型T5模型的汇总功率结合在一起。我们的工作是使用合成数据集用于摘要任务的第一个应用对齐技术，包括直接偏好优化等高级算法，例如直接偏好优化的算法。通过广泛的实验，我们证明了通过自动评估和人类评估验证的各种指标的加油性能出色。这种多任务方法增强了与药物有关的决策，并提高了对患者关注的更深入的了解，为个性化和反应迅速的癌症护理铺平了道路。这项工作中使用的代码和数据集公开可用。

Title: The Aloe Family Recipe for Open and Specialized Healthcare LLMs

Authors: Dario Garcia-Gasulla, Jordi Bayarri-Planas, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Adrian Tormos, Daniel Hinjos, Pablo Bernabeu-Perez, Anna Arias-Duart, Pablo Agustin Martin-Torres, Marta Gonzalez-Mallo, Sergio Alvarez-Napagao, Eduard Ayguadé-Parra, Ulises Cortés
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04388
Pdf URL: https://arxiv.org/pdf/2505.04388
Copy Paste: [[2505.04388]] The Aloe Family Recipe for Open and Specialized Healthcare LLMs(https://arxiv.org/abs/2505.04388)
Keywords: language model, llm
Abstract: Purpose: With advancements in Large Language Models (LLMs) for healthcare, the need arises for competitive open-source models to protect the public interest. This work contributes to the field of open medical LLMs by optimizing key stages of data preprocessing and training, while showing how to improve model safety (through DPO) and efficacy (through RAG). The evaluation methodology used, which includes four different types of tests, defines a new standard for the field. The resultant models, shown to be competitive with the best private alternatives, are released with a permisive license. Methods: Building on top of strong base models like Llama 3.1 and Qwen 2.5, Aloe Beta uses a custom dataset to enhance public data with synthetic Chain of Thought examples. The models undergo alignment with Direct Preference Optimization, emphasizing ethical and policy-aligned performance in the presence of jailbreaking attacks. Evaluation includes close-ended, open-ended, safety and human assessments, to maximize the reliability of results. Results: Recommendations are made across the entire pipeline, backed by the solid performance of the Aloe Family. These models deliver competitive performance across healthcare benchmarks and medical fields, and are often preferred by healthcare professionals. On bias and toxicity, the Aloe Beta models significantly improve safety, showing resilience to unseen jailbreaking attacks. For a responsible release, a detailed risk assessment specific to healthcare is attached to the Aloe Family models. Conclusion: The Aloe Beta models, and the recipe that leads to them, are a significant contribution to the open-source medical LLM field, offering top-of-the-line performance while maintaining high ethical requirements. This work sets a new standard for developing and reporting aligned LLMs in healthcare.
摘要：目的：借助大型语言模型（LLMS）的医疗保健，需要竞争性开源模型来保护公共利益。这项工作通过优化数据预处理和培训的关键阶段，同时展示了如何提高模型安全性（通过DPO）和功效（通过RAG）来促进开放医学LLM的领域。所使用的评估方法包括四种不同类型的测试，定义了该领域的新标准。结果模型与最佳的私人替代方案具有竞争力，并具有允许的许可。方法：在强大的基本模型（例如Llama 3.1和Qwen 2.5）的基础上，芦荟Beta使用自定义数据集用合成的思想示例来增强公共数据。这些模型与直接偏好优化进行了一致性，在越狱攻击的情况下强调了道德和政策与一致的绩效。评估包括封闭式，开放式，安全和人类评估，以最大程度地提高结果的可靠性。结果：在整个管道中提出的建议，并得到芦荟家族的稳定表现的支持。这些模型在医疗保健基准和医疗领域中提供竞争性能，并且通常受到医疗保健专业人员的青睐。关于偏见和毒性，芦荟β模型可显着提高安全性，表明对看不见的越狱攻击的弹性。对于负责任的发布，芦荟家庭模型附有特定医疗保健的详细风险评估。结论：芦荟β模型以及导致它们的食谱，是对开源医学LLM领域的重要贡献，提供了顶级的性能，同时保持高道德要求。这项工作为开发和报告医疗保健中的LLMS制定了新标准。

Title: Large Means Left: Political Bias in Large Language Models Increases with Their Number of Parameters

Authors: David Exler, Mark Schutera, Markus Reischl, Luca Rettenberger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04393
Pdf URL: https://arxiv.org/pdf/2505.04393
Copy Paste: [[2505.04393]] Large Means Left: Political Bias in Large Language Models Increases with Their Number of Parameters(https://arxiv.org/abs/2505.04393)
Keywords: language model, llm, hallucination
Abstract: With the increasing prevalence of artificial intelligence, careful evaluation of inherent biases needs to be conducted to form the basis for alleviating the effects these predispositions can have on users. Large language models (LLMs) are predominantly used by many as a primary source of information for various topics. LLMs frequently make factual errors, fabricate data (hallucinations), or present biases, exposing users to misinformation and influencing opinions. Educating users on their risks is key to responsible use, as bias, unlike hallucinations, cannot be caught through data verification. We quantify the political bias of popular LLMs in the context of the recent vote of the German Bundestag using the score produced by the Wahl-O-Mat. This metric measures the alignment between an individual's political views and the positions of German political parties. We compare the models' alignment scores to identify factors influencing their political preferences. Doing so, we discover a bias toward left-leaning parties, most dominant in larger LLMs. Also, we find that the language we use to communicate with the models affects their political views. Additionally, we analyze the influence of a model's origin and release date and compare the results to the outcome of the recent vote of the Bundestag. Our results imply that LLMs are prone to exhibiting political bias. Large corporations with the necessary means to develop LLMs, thus, knowingly or unknowingly, have a responsibility to contain these biases, as they can influence each voter's decision-making process and inform public opinion in general and at scale.
摘要：随着人工智能的越来越流行，需要对固有的偏见进行仔细的评估，以构成减轻这些倾向可能对用户产生影响的影响的基础。大型语言模型（LLM）主要被许多主题的主要信息来源使用。 LLMS经常遇到事实错误，捏造数据（幻觉）或现在的偏见，使用户面临错误信息并影响意见。教育用户的风险是负责任使用的关键，因为与幻觉不同，偏见不能通过数据验证来捕获。我们使用wahl-o-mat产生的分数在最近的德国政府投票中量化了受欢迎的LLM的政治偏见。该指标衡量了个人的政治观点与德国政党的立场之间的一致性。我们比较模型的一致性得分，以确定影响其政治偏好的因素。这样做，我们发现对左倾政党有偏见，在较大的LLM中最主要的是。另外，我们发现我们用来与模型交流的语言会影响他们的政治观点。此外，我们分析了模型的起源和发布日期的影响，并将结果与联邦政府最近投票的结果进行了比较。我们的结果表明，LLM容易表现出政治偏见。大型公司具有开发LLMS的必要方式，因此，有意或在不知不觉中有责任遏制这些偏见，因为它们可以影响每个选民的决策过程，并一般和大规模地为公众舆论提供信息。

Title: YABLoCo: Yet Another Benchmark for Long Context Code Generation

Authors: Aidar Valeev (1), Roman Garaev (1), Vadim Lomshakov (2), Irina Piontkovskaya (3), Vladimir Ivanov (1), Israel Adewuyi (1) ((1) Research Center of the Artificial Intelligence Institute, Innopolis University, Russia, (2) St. Petersburg Department of the Steklov Institute of Mathematics, Russia, (3) Huawei Noah's Ark Lab)
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2505.04406
Pdf URL: https://arxiv.org/pdf/2505.04406
Copy Paste: [[2505.04406]] YABLoCo: Yet Another Benchmark for Long Context Code Generation(https://arxiv.org/abs/2505.04406)
Keywords: language model, llm, long context
Abstract: Large Language Models demonstrate the ability to solve various programming tasks, including code generation. Typically, the performance of LLMs is measured on benchmarks with small or medium-sized context windows of thousands of lines of code. At the same time, in real-world software projects, repositories can span up to millions of LoC. This paper closes this gap by contributing to the long context code generation benchmark (YABLoCo). The benchmark featured a test set of 215 functions selected from four large repositories with thousands of functions. The dataset contained metadata of functions, contexts of the functions with different levels of dependencies, docstrings, functions bodies, and call graphs for each repository. This paper presents three key aspects of the contribution. First, the benchmark aims at function body generation in large repositories in C and C++, two languages not covered by previous benchmarks. Second, the benchmark contains large repositories from 200K to 2,000K LoC. Third, we contribute a scalable evaluation pipeline for efficient computing of the target metrics and a tool for visual analysis of generated code. Overall, these three aspects allow for evaluating code generation in large repositories in C and C++.
摘要：大型语言模型展示了解决各种编程任务（包括代码生成）的能力。通常，LLMS的性能是在具有数千行代码线的小或中尺寸上下文窗口的基准上测量的。同时，在实际软件项目中，存储库可以涵盖数百万个LOC。本文通过为长上下文代码生成基准（Yabloco）做出贡献来缩小这一差距。该基准的特征是从具有数千个功能的四个大存储库中选择的215个功能的测试集。该数据集包含函数的元数据，具有不同级别依赖关系的功能上下文，DocStrings，功能体和每个存储库的调用图。本文介绍了贡献的三个关键方面。首先，基准测试旨在在C和C ++的大型存储库中发挥作用，这两种语言未涵盖以前的基准测试。其次，基准为200k至2,000k Loc的大型存储库。第三，我们为目标指标的有效计算和可视化分析生成代码的工具提供了可扩展的评估管道。总体而言，这三个方面允许评估C和C ++大型存储库中的代码生成。

Title: OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models

Authors: Xiaoyu Xu, Minxin Du, Qingqing Ye, Haibo Hu
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04416
Pdf URL: https://arxiv.org/pdf/2505.04416
Copy Paste: [[2505.04416]] OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models(https://arxiv.org/abs/2505.04416)
Keywords: language model, llm
Abstract: Large language models (LLMs) trained over extensive corpora risk memorizing sensitive, copyrighted, or toxic content. To address this, we propose OBLIVIATE, a robust unlearning framework that removes targeted data while preserving model utility. The framework follows a structured process: extracting target tokens, building retain sets, and fine-tuning with a tailored loss function comprising three components -- masking, distillation, and world fact. Using low-rank adapters (LoRA), it ensures efficiency without compromising unlearning quality. We conduct experiments on multiple datasets, including the Harry Potter series, WMDP, and TOFU, using a comprehensive suite of metrics: forget quality (new document-level memorization score), model utility, and fluency. Results demonstrate its effectiveness in resisting membership inference attacks, minimizing the impact on retained data, and maintaining robustness across diverse scenarios.
摘要：大型语言模型（LLMS）接受了广泛的语料库培训的风险，将敏感，版权或有毒内容记住。为了解决这个问题，我们提出了遗忘，这是一个坚固的学习框架，可以在保留模型实用程序的同时删除目标数据。该框架遵循一个结构化的过程：提取目标令牌，建筑物保留套件以及通过量身定制的损失功能进行微调，其中包括三个组件 - 掩盖，蒸馏和世界事实。使用低级适配器（LORA），它可以确保效率，而不会损害强烈的学习质量。我们使用全面的指标套件在多个数据集上进行了多个数据集的实验：忘记质量（新的文档级记忆得分），模型实用程序和流利度。结果证明了其在抵抗会员推理攻击，最大程度地降低对保留数据的影响以及在各种情况下保持鲁棒性的有效性。

Title: Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

Authors: Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, Binghan Li, Yonghan Dong, Xiaojun Meng, Yasheng Wang, Dong Li, Yin Li, Dandan Tu, Can Chen, Youliang Yan, Fisher Yu, Ruiming Tang, Yunhe Wang, Botian Huang, Bo Wang, Boxiao Liu, Changzheng Zhang, Da Kuang, Fei Liu, Gang Huang, Jiansheng Wei, Jiarui Qin, Jie Ran, Jinpeng Li, Jun Zhao, Liang Dai, Lin Li, Liqun Deng, Peifeng Qin, Pengyuan Zeng, Qiang Gu, Shaohua Tang, Shengjun Cheng, Tao Gao, Tao Yu, Tianshu Li, Tianyu Bi, Wei He, Weikai Mao, Wenyong Huang, Wulong Liu, Xiabing Li, Xianzhi Yu, Xueyu Wu, Xu He, Yangkai Du, Yan Xu, Ye Tian, Yimeng Wu, Yongbing Huang, Yong Tian, Yong Zhu, Yue Li, Yufei Wang, Yuhang Gai, Yujun Li, Yu Luo, Yunsheng Ni, Yusen Sun, Zelin Chen, Zhe Liu, Zhicheng Liu, Zhipeng Tu, Zilin Ding, Zongyuan Zhan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04519
Pdf URL: https://arxiv.org/pdf/2505.04519
Copy Paste: [[2505.04519]] Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs(https://arxiv.org/abs/2505.04519)
Keywords: language model, llm
Abstract: Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference.
摘要：稀疏的大语言模型（LLM）与专家（MOE）混合在一起，接近万亿个参数正在主导着最有能力的语言模型的领域。但是，大型模型量表对基础软件和硬件系统构成了重大挑战。在本文中，我们旨在发现将这种规模限制在上升NPU上的食谱。关键目标是在动态稀疏模型结构下更好地使用计算资源，并在实际硬件上实现预期的性能增长。为了选择适用于Ascend NPU的模型配置，而无需重复运行昂贵的实验，我们利用模拟比较各种模型超参数的权衡。这项研究导致了Pangu Ultra Moe，这是一个具有7180亿参数的稀疏LLM，我们在模型上进行了实验以验证模拟结果。在系统方面，我们深入研究专家并行性，以优化NPU设备之间的通信以减少开销的同步。我们还优化了设备内的内存效率，以进一步降低参数和激活管理开销。最后，当训练Pangu Ultra MoE（在6K上升NPU上的DeepSeek R1）训练时，我们的MFU为30.0％，并证明Ascend System能够利用最先进的语言模型的所有培训阶段。广泛的实验表明，我们的配方可以通过MOE进行大规模稀疏语言模型的有效培训。我们还研究了此类模型的行为，以供将来参考。

Title: Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review

Authors: Josh McGiff, Nikola S. Nikolov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04531
Pdf URL: https://arxiv.org/pdf/2505.04531
Copy Paste: [[2505.04531]] Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review(https://arxiv.org/abs/2505.04531)
Keywords: language model, gpt, prompt, chat
Abstract: Generative language modelling has surged in popularity with the emergence of services such as ChatGPT and Google Gemini. While these models have demonstrated transformative potential in productivity and communication, they overwhelmingly cater to high-resource languages like English. This has amplified concerns over linguistic inequality in natural language processing (NLP). This paper presents the first systematic review focused specifically on strategies to address data scarcity in generative language modelling for low-resource languages (LRL). Drawing from 54 studies, we identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering, across generative tasks. We also analyse trends in architecture choices, language family representation, and evaluation methods. Our findings highlight a strong reliance on transformer-based models, a concentration on a small subset of LRLs, and a lack of consistent evaluation across studies. We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems. Ultimately, this review aims to support researchers and developers in building inclusive AI tools for underrepresented languages, a necessary step toward empowering LRL speakers and the preservation of linguistic diversity in a world increasingly shaped by large-scale language technologies.
摘要：随着Chatgpt和Google Gemini等服务的出现，生成语言建模迅速流行。尽管这些模型在生产力和沟通中表现出了变革性的潜力，但它们绝大多数迎合了像英语这样的高农产品语言。这扩大了对自然语言处理（NLP）语言不平等的关注。本文介绍了第一个专门针对低资源语言（LRL）生成语言建模数据稀缺的策略的系统综述。从54项研究中得出，我们确定，分类和评估技术方法，包括跨生成任务的单语数据增强，反向翻译，多语言培训和及时的工程。我们还分析了体系结构选择，语言家庭代表和评估方法的趋势。我们的发现强烈依赖基于变压器的模型，集中于一小部分LRL，并且在整个研究中缺乏一致的评估。最后，我们提出了将这些方法扩展到更广泛的LRL的建议，并在构建公平的生成语言系统方面概述了开放挑战。最终，这篇综述旨在支持研究人员和开发人员为代表性不足的语言构建包容性的AI工具，这是赋予LRL扬声器能力的必要步骤，并在一个越来越多地受到大规模语言技术影响的世界中，在这个世界中维护语言多样性。

Title: ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Authors: Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Fei Huang, Yan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04588
Pdf URL: https://arxiv.org/pdf/2505.04588
Copy Paste: [[2505.04588]] ZeroSearch: Incentivize the Search Capability of LLMs without Searching(https://arxiv.org/abs/2505.04588)
Keywords: language model, llm
Abstract: Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.
摘要：有效的信息搜索对于增强大语言模型（LLMS）的推理和发电能力至关重要。最近的研究探索了使用强化学习（RL）通过与现实世界环境中的现场搜索引擎进行交互，从而提高了LLM的搜索功能。尽管这些方法显示出令人鼓舞的结果，但它们面临两个主要挑战：（1）不受控制的文档质量：搜索引擎返回的文档质量通常是无法预测的，因此将噪音和不稳定引入培训过程中。（2）高度高的API成本：RL培训需要频繁推出，可能涉及成千上万的搜索请求，这会导致大量API支出并严重限制可扩展性。为了应对这些挑战，我们介绍了ZeroSearch，这是一个强化学习框架，该框架激励LLM的搜索功能而无需与真实的搜索引擎进行交互。我们的方法始于轻量监督的微调，以将LLM转换为一个检索模块，该模块能够生成相关和嘈杂的文档以响应查询。在RL培训期间，我们采用了一种基于课程的推出策略，从而逐渐降低了生成的文档的质量，从而逐渐通过将其暴露于越来越具有挑战性的检索方案来逐渐引起模型的推理能力。广泛的实验表明，ZeroSearch使用3B LLM作为检索模块有效地激励LLM的搜索功能。值得注意的是，一个7B检索模块与真实搜索引擎的性能相当，而14B检索模块甚至超过了它。此外，它在各种参数尺寸的基础和指令调整模型中都很好地概括了，并且与广泛的RL算法兼容。