2025-02-10

Title: JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment

Authors: Yehan Yan, Tianhao Ma, Ruotai Li, Xinhan Zheng, Guodong Shan, Chisheng Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04345
Pdf URL: https://arxiv.org/pdf/2502.04345
Copy Paste: [[2502.04345]] JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment(https://arxiv.org/abs/2502.04345)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Traditional Chinese medicine (TCM) plays a vital role in health protection and disease treatment, but its practical application requires extensive medical knowledge and clinical experience. Existing TCM Large Language Models (LLMs) exhibit critical limitations of uncomprehensive medical consultation and diagnoses, and inaccurate syndrome differentiation-based treatment. To address these issues, this study establishes JingFang (JF): a novel TCM Large Language Model that demonstrates the expert-level capability of medical diagnosis and syndrome differentiation-based treatment. We innovate a Multi-agent Dynamic Collaborative Chain-of-Thought Mechanism (MDCCTM) for medical consultation, enabling JF with effective and accurate diagnostic ability. In addition, a Syndrome Agent and a Dual-Stage Retrieval Scheme (DSRS) are developed to significantly enhance the capacity of JF for disease treatment based on syndrome differentiation. JingFang not only facilitates the application of LLMs but also promotes the effective practice of TCM in human health protection and disease treatment.
摘要：中医（TCM）在健康保护和疾病治疗中起着至关重要的作用，但其实际应用需要广泛的医学知识和临床经验。现有的TCM大语言模型（LLMS）表现出对医学咨询和诊断和基于不准确的综合症治疗的严重局限性。为了解决这些问题，这项研究建立了Jingfang（JF）：一种新型的TCM大语言模型，该模型展示了医学诊断和基于综合征分化的治疗的专家水平能力。我们为医疗咨询而创新了一个多代理动态协作链机制（MDCCTM），以有效且准确的诊断能力使JF能够实现JF。此外，开发了综合征剂和双阶段检索方案（DSR），以显着增强基于综合征分化的JF治疗的JF的能力。 Jingfang不仅促进了LLM的应用，而且还促进了TCM在人类健康保护和疾病治疗中的有效实践。

Title: Multi-Lingual Cyber Threat Detection in Tweets/X Using ML, DL, and LLM: A Comparative Analysis

Authors: Saydul Akbar Murad, Ashim Dahal, Nick Rahimi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04346
Pdf URL: https://arxiv.org/pdf/2502.04346
Copy Paste: [[2502.04346]] Multi-Lingual Cyber Threat Detection in Tweets/X Using ML, DL, and LLM: A Comparative Analysis(https://arxiv.org/abs/2502.04346)
Keywords: language model, llm
Abstract: Cyber threat detection has become an important area of focus in today's digital age due to the growing spread of fake information and harmful content on social media platforms such as Twitter (now 'X'). These cyber threats, often disguised within tweets, pose significant risks to individuals, communities, and even nations, emphasizing the need for effective detection systems. While previous research has explored tweet-based threats, much of the work is limited to specific languages, domains, or locations, or relies on single-model approaches, reducing their applicability to diverse real-world scenarios. To address these gaps, our study focuses on multi-lingual tweet cyber threat detection using a variety of advanced models. The research was conducted in three stages: (1) We collected and labeled tweet datasets in four languages English, Chinese, Russian, and Arabic employing both manual and polarity-based labeling methods to ensure high-quality annotations. (2) Each dataset was analyzed individually using machine learning (ML) and deep learning (DL) models to assess their performance on distinct languages. (3) Finally, we combined all four datasets into a single multi-lingual dataset and applied DL and large language model (LLM) architectures to evaluate their efficacy in identifying cyber threats across various languages. Our results show that among machine learning models, Random Forest (RF) attained the highest performance; however, the Bi-LSTM architecture consistently surpassed other DL and LLM architectures across all datasets. These findings underline the effectiveness of Bi-LSTM in multilingual cyber threat detection. The code for this paper can be found at this link: this https URL.
摘要：由于虚假信息和在Twitter等社交媒体平台上的差异不断增长，网络威胁检测已成为当今数字时代的重点领域。这些网络威胁经常在推文中伪装，对个人，社区甚至国家构成了重大风险，强调需要有效检测系统。尽管以前的研究探讨了基于推文的威胁，但大部分工作仅限于特定的语言，域或位置，或依赖于单模方法，从而降低了它们对各种现实世界情景的适用性。为了解决这些差距，我们的研究重点是使用各种高级模型的多语言推文网络威胁检测。这项研究分为三个阶段：（1）我们采用了四种语言，中文，俄语和阿拉伯语收集并标记了Tweet数据集，采用基于手动和极性的标签方法来确保高质量的注释。（2）使用机器学习（ML）和深度学习（DL）模型分别分析每个数据集，以评估其在不同语言上的性能。（3）最后，我们将所有四个数据集组合到单个多语言数据集中，并应用的DL和大型语言模型（LLM）体系结构，以评估其在识别各种语言识别网络威胁方面的功效。我们的结果表明，在机器学习模型中，随机森林（RF）的性能最高。但是，BI-LSTM体系结构始终超过所有数据集的其他DL和LLM架构。这些发现强调了BI-LSTM在多语言网络威胁检测中的有效性。可以在此链接上找到本文的代码：此HTTPS URL。

Title: SCALM: Detecting Bad Practices in Smart Contracts Through LLMs

Authors: Zongwei Li, Xiaoqi Li, Wenkai Li, Xin Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04347
Pdf URL: https://arxiv.org/pdf/2502.04347
Copy Paste: [[2502.04347]] SCALM: Detecting Bad Practices in Smart Contracts Through LLMs(https://arxiv.org/abs/2502.04347)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: As the Ethereum platform continues to mature and gain widespread usage, it is crucial to maintain high standards of smart contract writing practices. While bad practices in smart contracts may not directly lead to security issues, they do elevate the risk of encountering problems. Therefore, to understand and avoid these bad practices, this paper introduces the first systematic study of bad practices in smart contracts, delving into over 35 specific issues. Specifically, we propose a large language models (LLMs)-based framework, SCALM. It combines Step-Back Prompting and Retrieval-Augmented Generation (RAG) to identify and address various bad practices effectively. Our extensive experiments using multiple LLMs and datasets have shown that SCALM outperforms existing tools in detecting bad practices in smart contracts.
摘要：随着以太坊平台继续成熟并获得广泛的用法，维持高标准的智能合同写作实践至关重要。尽管智能合约中的不良做法可能不会直接导致安全问题，但它们确实提高了遇到问题的风险。因此，为了理解并避免这些不良实践，本文介绍了对智能合约中不良实践的首次系统研究，并研究了35多个特定问题。具体来说，我们建议使用基于尺度的大型语言模型（LLMS）框架。它结合了逐步提示和检索成绩的一代（RAG），以有效地识别和解决各种不良实践。我们使用多个LLM和数据集进行的广泛实验表明，在检测智能合约中的不良实践方面，Scalm优于现有工具。

Title: Prompt-based Depth Pruning of Large Language Models

Authors: Juyun Wee, Minjae Park, Jaeho Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04348
Pdf URL: https://arxiv.org/pdf/2502.04348
Copy Paste: [[2502.04348]] Prompt-based Depth Pruning of Large Language Models(https://arxiv.org/abs/2502.04348)
Keywords: language model, prompt
Abstract: Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent -- a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.
摘要：深度修剪旨在通过简单地删除几个不太重要的变压器块来降低大语言模型的推理成本，而无需任何硬件特定的并发症。但是，我们的经验发现表明，变压器块的重要性可能是高度依赖于任务的 - 对于任务至关重要的块可以在不降低另一个任务的准确性的情况下删除任务。基于此观察结果，我们开发了一种动态的深度修剪算法，创造的布丁（及时路由的动态深度修剪），该算法决定了哪些基于输入提示符中要省略模型中的封锁。布丁通过训练轻型路由器来预测一组选项中的最佳遗漏，在此选项集中还以数据驱动的方式构建了此选项集。常识性推理基准的经验结果表明，布丁有效地加速了推理语言模型，并且比静态深度固化基线的静态深度更好。

Title: Dynamic benchmarking framework for LLM-based conversational data capture

Authors: Pietro Alessandro Aluffi, Patrick Zietkiewicz, Marya Bazzi, Matt Arderne, Vladimirs Murevics
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04349
Pdf URL: https://arxiv.org/pdf/2502.04349
Copy Paste: [[2502.04349]] Dynamic benchmarking framework for LLM-based conversational data capture(https://arxiv.org/abs/2502.04349)
Keywords: language model, llm, agent
Abstract: The rapid evolution of large language models (LLMs) has transformed conversational agents, enabling complex human-machine interactions. However, evaluation frameworks often focus on single tasks, failing to capture the dynamic nature of multi-turn dialogues. This paper introduces a dynamic benchmarking framework to assess LLM-based conversational agents through interactions with synthetic users. The framework integrates generative agent simulation to evaluate performance on key dimensions: information extraction, context awareness, and adaptive engagement. By simulating various aspects of user behavior, our work provides a scalable, automated, and flexible benchmarking approach. Experimental evaluation - within a loan application use case - demonstrates the framework's effectiveness under one-shot and few-shot extraction conditions. Results show that adaptive strategies improve data extraction accuracy, especially when handling ambiguous responses. Future work will extend its applicability to broader domains and incorporate additional metrics (e.g., conversational coherence, user engagement). This study contributes a structured, scalable approach to evaluating LLM-based conversational agents, facilitating real-world deployment.
摘要：大语言模型（LLM）的快速演变已经改变了对话剂，从而实现了复杂的人机相互作用。但是，评估框架通常集中在单个任务上，无法捕获多转对话的动态性质。本文介绍了一个动态的基准测试框架，以通过与合成用户的互动来评估基于LLM的对话代理。该框架集成了生成代理模拟以评估关键维度的性能：信息提取，上下文意识和自适应参与。通过模拟用户行为的各个方面，我们的工作提供了可扩展，自动化和灵活的基准测试方法。实验评估 - 在贷款应用程序用例中 - 在一次性和很少的提取条件下证明了该框架的有效性。结果表明，自适应策略提高了数据提取准确性，尤其是在处理模棱两可的响应时。未来的工作将将其适用性扩展到更广泛的领域，并包含其他指标（例如，对话连贯性，用户参与度）。这项研究为评估基于LLM的对话剂的结构化，可扩展的方法提供了促进现实世界的部署。

Title: CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

Authors: Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, Chuchu Fan
Subjects: cs.CL, cs.AI, cs.LG, cs.SC, cs.SE
Abstract URL: https://arxiv.org/abs/2502.04350
Pdf URL: https://arxiv.org/pdf/2502.04350
Copy Paste: [[2502.04350]] CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance(https://arxiv.org/abs/2502.04350)
Keywords: language model, gpt, llm
Abstract: Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-round guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-round supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at this https URL.
摘要：现有方法无法有效地引导大型语言模型（LLM）在文本推理和代码生成之间，而符号计算功能不足。我们介绍了CodeSteer，这是指导LLM代码/文本生成的有效方法。我们构建一个综合基准标准架，包括37个具有可调复杂性的符号任务，还合成了12K多轮指导/发电轨迹的数据集和5.5k指导比较对。我们使用新设计的多轮监督微调（SFT）和直接偏好优化（DPO）微调Llama-3-8B模型。所得的模型CodeSteerllm与所提出的符号和自我答案的检查器增强，有效地指导了较大模型的代码/文本生成。用CodeSteer增强GPT-4O的平均性能得分从53.3提高到86.4，甚至优于现有最佳LLM OpenAI O1（82.7），O1-Preview（74.8）和DeepSeek R1（76.8）（76.8）（76.8）看不见）。经过GPT-4O的培训，CodeSteer表现出卓越的概括性，为Claude，Mismtral和GPT-3.5提供了平均41.8的性能提高。 CodeSteer引导的LLM完全利用符号计算，以在高度复杂的任务上保持强劲的性能。此HTTPS URL可用模型，数据集和代码。

Title: NER4all or Context is All You Need: Using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach

Authors: Torsten Hiltmann, Martin Dröge, Nicole Dresselhaus, Till Grallert, Melanie Althage, Paul Bayer, Sophie Eckenstaler, Koray Mendi, Jascha Marijn Schmitz, Philipp Schneider, Wiebke Sczeponik, Anica Skibba
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04351
Pdf URL: https://arxiv.org/pdf/2502.04351
Copy Paste: [[2502.04351]] NER4all or Context is All You Need: Using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach(https://arxiv.org/abs/2502.04351)
Keywords: llm, prompt
Abstract: Named entity recognition (NER) is a core task for historical research in automatically establishing all references to people, places, events and the like. Yet, do to the high linguistic and genre diversity of sources, only limited canonisation of spellings, the level of required historical domain knowledge, and the scarcity of annotated training data, established approaches to natural language processing (NLP) have been both extremely expensive and yielded only unsatisfactory results in terms of recall and precision. Our paper introduces a new approach. We demonstrate how readily-available, state-of-the-art LLMs significantly outperform two leading NLP frameworks, spaCy and flair, for NER in historical documents by seven to twentytwo percent higher F1-Scores. Our ablation study shows how providing historical context to the task and a bit of persona modelling that turns focus away from a purely linguistic approach are core to a successful prompting strategy. We also demonstrate that, contrary to our expectations, providing increasing numbers of examples in few-shot approaches does not improve recall or precision below a threshold of 16-shot. In consequence, our approach democratises access to NER for all historians by removing the barrier of scripting languages and computational skills required for established NLP tools and instead leveraging natural language prompts and consumer-grade tools and frontends.
摘要：命名实体识别（NER）是自动建立对人，地点，事件等所有参考的历史研究的核心任务。然而，要对来源的高语言和流派多样性，只有有限的拼写义务，所需的历史领域知识的水平以及带注释的培训数据的稀缺，已建立的自然语言处理方法（NLP）既昂贵又非常昂贵，而且都非常昂贵且既昂贵又非常昂贵在召回和精度方面，结果仅产生不令人满意的结果。我们的论文介绍了一种新方法。我们证明，在历史文档中，NER的F1分数较高7％至百分之二，而F1分数较高，则证明了易于获得的，最先进的LLM的表现如何显着优于两个领先的NLP框架，即Spacy和Flair。我们的消融研究表明，如何为任务提供历史背景以及一定的角色建模，这些角色建模将纯粹的语言方法转移到了成功的促进策略中的核心。我们还证明，与我们的期望相反，提供越来越多的示例的几个示例并不能提高回忆或精确度低于16杆的阈值。结果，我们的方法通过消除已建立的NLP工具所需的脚本语言和计算技能的障碍，而不是利用自然语言提示，消费者级工具和前端来使所有历史学家对NER的访问权限。

Title: Investigating the Robustness of Deductive Reasoning with Large Language Models

Authors: Fabian Hoppe, Filip Ilievski, Jan-Christoph Kalo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04352
Pdf URL: https://arxiv.org/pdf/2502.04352
Copy Paste: [[2502.04352]] Investigating the Robustness of Deductive Reasoning with Large Language Models(https://arxiv.org/abs/2502.04352)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have been shown to achieve impressive results for many reasoning-based Natural Language Processing (NLP) tasks, suggesting a degree of deductive reasoning capability. However, it remains unclear to which extent LLMs, in both informal and autoformalisation methods, are robust on logical deduction tasks. Moreover, while many LLM-based deduction methods have been proposed, there is a lack of a systematic study that analyses the impact of their design components. Addressing these two challenges, we propose the first study of the robustness of LLM-based deductive reasoning methods. We devise a framework with two families of perturbations: adversarial noise and counterfactual statements, which jointly generate seven perturbed datasets. We organize the landscape of LLM reasoners according to their reasoning format, formalisation syntax, and feedback for error recovery. The results show that adversarial noise affects autoformalisation, while counterfactual statements influence all approaches. Detailed feedback does not improve overall accuracy despite reducing syntax errors, pointing to the challenge of LLM-based methods to self-correct effectively.
摘要：大型语言模型（LLM）已被证明可以为许多基于推理的自然语言处理（NLP）任务取得令人印象深刻的结果，这表明了一定程度的演绎推理能力。但是，尚不清楚在非正式和自动化方法中，LLM在逻辑扣除任务上均具有强大的范围。此外，尽管已经提出了许多基于LLM的推论方法，但缺乏系统的研究来分析其设计组件的影响。在解决这两个挑战时，我们提出了第一个研究基于LLM的演绎推理方法的鲁棒性。我们设计了一个有两个扰动家族的框架：对抗性噪声和反事实陈述，共同生成七个扰动数据集。我们根据其推理格式，形式化语法和错误恢复的反馈来组织LLM推理者的景观。结果表明，对抗噪声会影响自动化，而反事实陈述影响所有方法。尽管降低了语法错误，但详细的反馈并不能提高整体准确性，这表明基于LLM的方法对自我校正有效挑战。

Title: CognArtive: Large Language Models for Automating Art Analysis and Decoding Aesthetic Elements

Authors: Afshin Khadangi, Amir Sartipi, Igor Tchappi, Gilbert Fridgen
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.04353
Pdf URL: https://arxiv.org/pdf/2502.04353
Copy Paste: [[2502.04353]] CognArtive: Large Language Models for Automating Art Analysis and Decoding Aesthetic Elements(https://arxiv.org/abs/2502.04353)
Keywords: language model, llm
Abstract: Art, as a universal language, can be interpreted in diverse ways, with artworks embodying profound meanings and nuances. The advent of Large Language Models (LLMs) and the availability of Multimodal Large Language Models (MLLMs) raise the question of how these transformative models can be used to assess and interpret the artistic elements of artworks. While research has been conducted in this domain, to the best of our knowledge, a deep and detailed understanding of the technical and expressive features of artworks using LLMs has not been explored. In this study, we investigate the automation of a formal art analysis framework to analyze a high-throughput number of artworks rapidly and examine how their patterns evolve over time. We explore how LLMs can decode artistic expressions, visual elements, composition, and techniques, revealing emerging patterns that develop across periods. Finally, we discuss the strengths and limitations of LLMs in this context, emphasizing their ability to process vast quantities of art-related data and generate insightful interpretations. Due to the exhaustive and granular nature of the results, we have developed interactive data visualizations, available online this https URL, to enhance understanding and accessibility.
摘要：艺术作为一种通用语言，可以通过各种方式来解释，艺术品体现了深刻的含义和细微差别。大型语言模型（LLM）的出现以及多模式大语言模型（MLLM）的可用性提出了一个问题，即如何使用这些变革模型来评估和解释艺术品的艺术要素。据我们所知，尽管在该领域进行了研究，但尚未探讨对使用LLMS的艺术品的技术和表达特征深入详细的理解。在这项研究中，我们研究了正式的艺术分析框架的自动化，以迅速分析大量的艺术品，并检查其模式如何随着时间的流逝而发展。我们探讨了LLM如何解码艺术表达，视觉元素，组成和技术，从而揭示跨时期发展的新兴模式。最后，我们在这种情况下讨论了LLM的优势和局限性，强调了它们处理大量与艺术相关数据并产生有见地的解释的能力。由于结果的详尽和颗粒状性质，我们开发了在线HTTPS URL在线提供的交互式数据可视化，以增强理解和可访问性。

Title: Reviving The Classics: Active Reward Modeling in Large Language Model Alignment

Authors: Yunyi Shen, Hao Sun, Jean-François Ton
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04354
Pdf URL: https://arxiv.org/pdf/2502.04354
Copy Paste: [[2502.04354]] Reviving The Classics: Active Reward Modeling in Large Language Model Alignment(https://arxiv.org/abs/2502.04354)
Keywords: language model, llm, prompt
Abstract: Building neural reward models from human preferences is a pivotal component in reinforcement learning from human feedback (RLHF) and large language model alignment research. Given the scarcity and high cost of human annotation, how to select the most informative pairs to annotate is an essential yet challenging open problem. In this work, we highlight the insight that an ideal comparison dataset for reward modeling should balance exploration of the representation space and make informative comparisons between pairs with moderate reward differences. Technically, challenges arise in quantifying the two objectives and efficiently prioritizing the comparisons to be annotated. To address this, we propose the Fisher information-based selection strategies, adapt theories from the classical experimental design literature, and apply them to the final linear layer of the deep neural network-based reward modeling tasks. Empirically, our method demonstrates remarkable performance, high computational efficiency, and stability compared to other selection methods from deep learning and classical statistical literature across multiple open-source LLMs and datasets. Further ablation studies reveal that incorporating cross-prompt comparisons in active reward modeling significantly enhances labeling efficiency, shedding light on the potential for improved annotation strategies in RLHF.
摘要：从人类偏好中构建神经奖励模型是从人类反馈（RLHF）和大语言模型对准研究中学习的关键组成部分。鉴于人类注释的稀缺性和高成本，如何选择最有用的对注释是一个必不可少但具有挑战性的开放问题。在这项工作中，我们强调了一个见解，即用于奖励建模的理想比较数据集应平衡对表示空间的探索，并在对之间进行中等奖励差异之间的内容比较。从技术上讲，挑战在量化这两个目标并有效优先考虑要注释的比较时会出现挑战。为了解决这个问题，我们提出了基于Fisher信息的选择策略，调整经典实验设计文献的理论，并将其应用于基于深度神经网络的奖励建模任务的最终线性层。从经验上讲，与来自多个开源LLM和数据集中的深度学习和经典统计文献中的其他选择方法相比，我们的方法表现出显着的性能，高计算效率和稳定性。进一步的消融研究表明，在主动奖励建模中纳入交叉预测比较会显着提高标签效率，从而阐明了改善RLHF注释策略的潜力。

Title: LLM-ProS: Analyzing Large Language Models' Performance in Competitive Problem Solving

Authors: Md Sifat Hossain, Anika Tabassum, Md. Fahim Arefin, Tarannum Shaila Zaman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04355
Pdf URL: https://arxiv.org/pdf/2502.04355
Copy Paste: [[2502.04355]] LLM-ProS: Analyzing Large Language Models' Performance in Competitive Problem Solving(https://arxiv.org/abs/2502.04355)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: The rapid advancement of large language models has opened new avenues for automating complex problem-solving tasks such as algorithmic coding and competitive programming. This paper introduces a novel evaluation technique, LLM-ProS, to assess the performance of state-of-the-art LLMs on International Collegiate Programming Contest (ICPC) problems. Using a curated dataset of 166 World Finals problems from 2011 to 2024, we benchmark the models' reasoning, accuracy, and efficiency. We evaluate the five models-GPT-4o, Mistral Large, Llama-3.1-405B, and the o1 family, consisting of o1-mini and o1-preview, across critical metrics like correctness, resource utilization, and response calibration. Our results reveal significant differences in the models' abilities to generalize, adapt, and solve novel problems. We also investigated the impact of training methodologies, dataset contamination, and chain-of-thought reasoning on model performance. The findings provide new insights into optimizing LLMs for algorithmic tasks, highlighting both strengths and limitations of current models.
摘要：大型语言模型的快速发展为自动化复杂的解决问题的任务（例如算法编码和竞争性编程）开辟了新的途径。本文介绍了一种新颖的评估技术LLM-PRO，以评估国际大学编程竞赛（ICPC）问题最先进的LLMS的表现。从2011年到2024年，使用166个世界决赛问题的策划数据集，我们将模型的推理，准确性和效率进行基准测试。我们评估了五种型号GPT-4O，Mistral大，Llama-3.1-405b和O1家族，由O1-Mini和O1-preiview组成，涉及正确性，资源利用和响应校准等关键指标。我们的结果揭示了模型能力概括，适应和解决新问题的能力的显着差异。我们还研究了培训方法，数据集污染和经过思考推理对模型性能的影响。这些发现为优化算法任务的LLM提供了新的见解，突出了当前模型的优势和局限性。

Title: Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription

Authors: Mahdi Alkaeed, Sofiat Abioye, Adnan Qayyum, Yosra Magdi Mekki, Ilhem Berrou, Mohamad Abdallah, Ala Al-Fuqaha, Muhammad Bilal, Junaid Qadir
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04356
Pdf URL: https://arxiv.org/pdf/2502.04356
Copy Paste: [[2502.04356]] Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription(https://arxiv.org/abs/2502.04356)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: In response to the success of proprietary Large Language Models (LLMs) such as OpenAI's GPT-4, there is a growing interest in developing open, non-proprietary LLMs and AI foundation models (AIFMs) for transparent use in academic, scientific, and non-commercial applications. Despite their inability to match the refined functionalities of their proprietary counterparts, open models hold immense potential to revolutionize healthcare applications. In this paper, we examine the prospects of open-source LLMs and AIFMs for developing healthcare applications and make two key contributions. Firstly, we present a comprehensive survey of the current state-of-the-art open-source healthcare LLMs and AIFMs and introduce a taxonomy of these open AIFMs, categorizing their utility across various healthcare tasks. Secondly, to evaluate the general-purpose applications of open LLMs in healthcare, we present a case study on personalized prescriptions. This task is particularly significant due to its critical role in delivering tailored, patient-specific medications that can greatly improve treatment outcomes. In addition, we compare the performance of open-source models with proprietary models in settings with and without Retrieval-Augmented Generation (RAG). Our findings suggest that, although less refined, open LLMs can achieve performance comparable to proprietary models when paired with grounding techniques such as RAG. Furthermore, to highlight the clinical significance of LLMs-empowered personalized prescriptions, we perform subjective assessment through an expert clinician. We also elaborate on ethical considerations and potential risks associated with the misuse of powerful LLMs and AIFMs, highlighting the need for a cautious and responsible implementation in healthcare.
摘要：为了回应专有大语模型（LLM）的成功，例如OpenAI的GPT-4，人们对开发开放的非专有LLMS和AI基础模型（AIFMS（AIFM）（用于学术，科学和非非科学和非科学和非 - 商业应用程序。尽管他们无法与专有同行的精致功能相匹配，但开放模型仍具有彻底改变医疗保健应用的巨大潜力。在本文中，我们研究了开源LLM和AIFM的前景，用于开发医疗保健应用并做出两个关键贡献。首先，我们对当前最新的开源医疗保健LLM和AIFM进行了全面调查，并介绍了这些开放AIFM的分类法，从而在各种医疗保健任务中对其实用性进行了分类。其次，为了评估开放LLM在医疗保健中的通用应用，我们提出了一项有关个性化处方的案例研究。该任务尤其重要，因为它在提供量身定制的特定于患者药物方面的关键作用可以极大地改善治疗结果。此外，我们将开源模型的性能与具有和不检索发电机（RAG）的设置中的专有模型进行了比较。我们的发现表明，尽管不太精致，但开放式LLM与与抹布等接地技术配对时，可以实现与专有模型相当的性能。此外，为了强调LLMS授权个性化处方的临床意义，我们通过专业的临床医生进行主观评估。我们还详细阐述了与滥用强大的LLM和AIFM相关的道德考虑和潜在风险，强调了对医疗保健中谨慎和负责任的实施的需求。

Title: Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs

Authors: Hao Sun, Yunyi Shen, Jean-Francois Ton, Mihaela van der Schaar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04357
Pdf URL: https://arxiv.org/pdf/2502.04357
Copy Paste: [[2502.04357]] Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs(https://arxiv.org/abs/2502.04357)
Keywords: language model, llm, chat
Abstract: Large Language Models (LLMs) have made substantial strides in structured tasks through Reinforcement Learning (RL), demonstrating proficiency in mathematical reasoning and code generation. However, applying RL in broader domains like chatbots and content generation -- through the process known as Reinforcement Learning from Human Feedback (RLHF) -- presents unique challenges. Reward models in RLHF are critical, acting as proxies that evaluate the alignment of LLM outputs with human intent. Despite advancements, the development of reward models is hindered by challenges such as computational heavy training, costly evaluation, and therefore poor reproducibility. We advocate for using embedding-based input in reward model research as an accelerated solution to those challenges. By leveraging embeddings for reward modeling, we can enhance reproducibility, reduce computational demands on hardware, improve training stability, and significantly reduce training and evaluation costs, hence facilitating fair and efficient comparisons in this active research area. We then show a case study of reproducing existing reward model ensemble research using embedding-based reward models. We discussed future avenues for research, aiming to contribute to safer and more effective LLM deployments.
摘要：大型语言模型（LLM）通过加强学习（RL）在结构化任务中取得了长足的进步，证明了数学推理和代码生成的能力。但是，将RL应用于聊天机器人和内容产生等较广泛的领域（通过称为增强人类反馈（RLHF）的加强学习过程）提出了独特的挑战。 RLHF中的奖励模型至关重要，充当代理，评估LLM输出与人类意图的一致性。尽管取得了进步，但奖励模型的发展仍受到诸如计算重型培训，昂贵的评估以及因此可重复性差的挑战所阻碍。我们主张在奖励模型研究中使用基于嵌入的输入作为对这些挑战的加速解决方案。通过利用嵌入式进行奖励建模，我们可以提高可重复性，减少对硬件的计算需求，提高训练稳定性，并大大降低培训和评估成本，从而促进该活跃研究领域的公平有效比较。然后，我们展示了使用基于嵌入的奖励模型重现现有奖励模型集成研究的案例研究。我们讨论了未来研究的途径，旨在为更安全，更有效的LLM部署做出贡献。

Title: Position: Scaling LLM Agents Requires Asymptotic Analysis with LLM Primitives

Authors: Elliot Meyerson, Xin Qiu
Subjects: cs.CL, cs.AI, cs.CC, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2502.04358
Pdf URL: https://arxiv.org/pdf/2502.04358
Copy Paste: [[2502.04358]] Position: Scaling LLM Agents Requires Asymptotic Analysis with LLM Primitives(https://arxiv.org/abs/2502.04358)
Keywords: language model, llm, agent
Abstract: Decomposing hard problems into subproblems often makes them easier and more efficient to solve. With large language models (LLMs) crossing critical reliability thresholds for a growing slate of capabilities, there is an increasing effort to decompose systems into sets of LLM-based agents, each of whom can be delegated sub-tasks. However, this decomposition (even when automated) is often intuitive, e.g., based on how a human might assign roles to members of a human team. How close are these role decompositions to optimal? This position paper argues that asymptotic analysis with LLM primitives is needed to reason about the efficiency of such decomposed systems, and that insights from such analysis will unlock opportunities for scaling them. By treating the LLM forward pass as the atomic unit of computational cost, one can separate out the (often opaque) inner workings of a particular LLM from the inherent efficiency of how a set of LLMs are orchestrated to solve hard problems. In other words, if we want to scale the deployment of LLMs to the limit, instead of anthropomorphizing LLMs, asymptotic analysis with LLM primitives should be used to reason about and develop more powerful decompositions of large problems into LLM agents.
摘要：将严重问题分解为子问题通常会使它们更容易，更有效地解决。借助大型语言模型（LLMS）越过关键的可靠性阈值，可以使越来越多的能力范围越来越多地将系统分解为基于LLM的代理人，每个人都可以将系统分解为基于LLM的代理。但是，这种分解（即使自动化）通常是直观的，例如，基于人类如何将角色分配给人类团队的成员。这些角色分解与最佳分解有多近？该立场论文认为，需要使用LLM原语的渐近分析来推理这种分解系统的效率，并且此类分析的见解将解锁扩展它们的机会。通过将LLM正向通行视为计算成本的原子单位，可以将特定LLM的（通常是不透明的）内部运作（通常是不透明的）内部工作与固有效率的固有效率分开，即如何策划一组LLM来解决硬性问题。换句话说，如果我们想将LLM的部署扩展到极限，而不是拟人化LLM，则应使用LLM原始素的渐近分析来推理和发展更强大的大问题分解为LLM Adents。

Title: Exploring Spatial Language Grounding Through Referring Expressions

Authors: Akshar Tumu, Parisa Kordjamshidi
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.04359
Pdf URL: https://arxiv.org/pdf/2502.04359
Copy Paste: [[2502.04359]] Exploring Spatial Language Grounding Through Referring Expressions(https://arxiv.org/abs/2502.04359)
Keywords: language model
Abstract: Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.
摘要：空间推理是人类认知的重要组成部分，是最新的视觉模型（VLMS）表现出难度迹象的领域。当前的分析工作使用图像字幕任务和视觉问题回答。在这项工作中，我们建议使用参考表达理解任务，而不是VLM评估空间推理的平台。该平台为对物体检测有歧义时的空间理解和接地能力的更深入分析提供了机会，2）具有较长句子结构和多个空间关系的复杂空间表达式，以及3）否定的表达式（不是' ）。在我们的分析中，我们使用特定于任务的体系结构以及大型VLM，并突出了它们在处理这些特定情况时的优势和缺点。尽管所有这些模型都面临着手头的任务面临挑战，但相对行为取决于基本模型和空间语义的特定类别（拓扑，方向，近端等）。我们的结果突出了这些挑战和行为，并提供了研究差距和未来方向的洞察力。

Title: MARAGE: Transferable Multi-Model Adversarial Attack for Retrieval-Augmented Generation Data Extraction

Authors: Xiao Hu, Eric Liu, Weizhou Wang, Xiangyu Guo, David Lie
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04360
Pdf URL: https://arxiv.org/pdf/2502.04360
Copy Paste: [[2502.04360]] MARAGE: Transferable Multi-Model Adversarial Attack for Retrieval-Augmented Generation Data Extraction(https://arxiv.org/abs/2502.04360)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) offers a solution to mitigate hallucinations in Large Language Models (LLMs) by grounding their outputs to knowledge retrieved from external sources. The use of private resources and data in constructing these external data stores can expose them to risks of extraction attacks, in which attackers attempt to steal data from these private databases. Existing RAG extraction attacks often rely on manually crafted prompts, which limit their effectiveness. In this paper, we introduce a framework called MARAGE for optimizing an adversarial string that, when appended to user queries submitted to a target RAG system, causes outputs containing the retrieved RAG data verbatim. MARAGE leverages a continuous optimization scheme that integrates gradients from multiple models with different architectures simultaneously to enhance the transferability of the optimized string to unseen models. Additionally, we propose a strategy that emphasizes the initial tokens in the target RAG data, further improving the attack's generalizability. Evaluations show that MARAGE consistently outperforms both manual and optimization-based baselines across multiple LLMs and RAG datasets, while maintaining robust transferability to previously unseen models. Moreover, we conduct probing tasks to shed light on the reasons why MARAGE is more effective compared to the baselines and to analyze the impact of our approach on the model's internal state.
摘要：检索增强的生成（RAG）通过将其输出接地到从外部来源检索到的知识来减轻大语模型（LLMS）的幻觉提供了一种解决方案。私人资源和数据在构建这些外部数据存储中的使用可以使它们面临提取攻击的风险，在这种情况下，攻击者试图从这些私人数据库中窃取数据。现有的破布提取攻击通常依赖于手动制作的提示，从而限制了它们的有效性。在本文中，我们介绍了一个名为MARAGE的框架，用于优化一个对抗字符串，该字符串附加到提交到目标抹布系统的用户查询时，会导致包含逐字检索的抹布数据的输出。 MARAGE利用连续优化方案，该方案将来自多个模型的梯度同时集成了不同的体系结构，以增强优化字符串向看不见的模型的可传递性。此外，我们提出了一种策略，该策略强调目标抹布数据中的初始令牌，从而进一步提高了攻击的普遍性。评估表明，Marage在多个LLM和RAG数据集中始终胜过基于手动和优化的基线，同时保持对以前看不见的模型的可靠传递性。此外，我们进行探测任务以阐明与基线相比Marage更有效的原因，并分析我们方法对模型内部状态的影响。

Title: LLMs can be easily Confused by Instructional Distractions

Authors: Yerin Hwang, Yongil Kim, Jahyun Koo, Taegwan Kang, Hyunkyung Bae, Kyomin Jung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04362
Pdf URL: https://arxiv.org/pdf/2502.04362
Copy Paste: [[2502.04362]] LLMs can be easily Confused by Instructional Distractions(https://arxiv.org/abs/2502.04362)
Keywords: language model, llm, prompt
Abstract: Despite the fact that large language models (LLMs) show exceptional skill in instruction following tasks, this strength can turn into a vulnerability when the models are required to disregard certain instructions. Instruction-following tasks typically involve a clear task description and input text containing the target data to be processed. However, when the input itself resembles an instruction, confusion may arise, even if there is explicit prompting to distinguish between the task instruction and the input. We refer to this phenomenon as instructional distraction. In this paper, we introduce a novel benchmark, named DIM-Bench, specifically designed to assess LLMs' performance under instructional distraction. The benchmark categorizes real-world instances of instructional distraction and evaluates LLMs across four instruction tasks: rewriting, proofreading, translation, and style transfer -- alongside five input tasks: reasoning, code generation, mathematical reasoning, bias detection, and question answering. Our experimental results reveal that even the most advanced LLMs are susceptible to instructional distraction, often failing to accurately follow user intent in such cases.
摘要：尽管大型语言模型（LLMS）在以下任务中显示出非凡的教学技能，但当需要模型忽略某些说明时，这种强度可能会变成脆弱性。指导跟随任务通常涉及清晰的任务描述和输入文本，其中包含要处理的目标数据。但是，当输入本身类似于指令时，即使有明确提示以区分任务指令和输入，也可能会出现混乱。我们将这种现象称为教学分散注意力。在本文中，我们介绍了一个名为Dim Bench的新颖基准，该基准专为评估教学干扰下LLM的表现而设计。基准将教学分心的现实世界实例分类，并在四个指令任务中评估LLM：重写，校对，翻译和样式转移 - 以及五个输入任务：推理，代码生成，数学推理，偏见，偏见检测和问题答案。我们的实验结果表明，即使是最先进的LLM都容易受到教学分散注意力的影响，在这种情况下通常无法准确遵循用户意图。

Title: An Analysis for Reasoning Bias of Language Models with Small Initialization

Authors: Junjie Yao, Zhongwang Zhang, Zhi-Qin John Xu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04375
Pdf URL: https://arxiv.org/pdf/2502.04375
Copy Paste: [[2502.04375]] An Analysis for Reasoning Bias of Language Models with Small Initialization(https://arxiv.org/abs/2502.04375)
Keywords: language model, llm
Abstract: Transformer-based Large Language Models (LLMs) have revolutionized Natural Language Processing by demonstrating exceptional performance across diverse tasks. This study investigates the impact of the parameter initialization scale on the training behavior and task preferences of LLMs. We discover that smaller initialization scales encourage models to favor reasoning tasks, whereas larger initialization scales lead to a preference for memorization tasks. We validate this reasoning bias via real datasets and meticulously designed anchor functions. Further analysis of initial training dynamics suggests that specific model components, particularly the embedding space and self-attention mechanisms, play pivotal roles in shaping these learning biases. We provide a theoretical framework from the perspective of model training dynamics to explain these phenomena. Additionally, experiments on real-world language tasks corroborate our theoretical insights. This work enhances our understanding of how initialization strategies influence LLM performance on reasoning tasks and offers valuable guidelines for training models.
摘要：基于变形金刚的大型语言模型（LLM）通过在各种任务中展示出色的表现，彻底改变了自然语言处理。这项研究研究了参数初始化量表对LLMS训练行为和任务偏好的影响。我们发现，较小的初始化量表鼓励模型偏爱推理任务，而较大的初始化量表则导致对记忆任务的偏爱。我们通过实际数据集和精心设计的锚点函数来验证这种推理偏差。对初始训练动力学的进一步分析表明，特定的模型组件，尤其是嵌入空间和自我注意力的机制，在塑造这些学习偏见方面起着关键作用。我们从模型训练动力学的角度提供了一个理论框架，以解释这些现象。此外，对现实世界语言任务的实验证实了我们的理论见解。这项工作增强了我们对初始化策略如何影响LLM在推理任务上的理解，并为培训模型提供了宝贵的指南。

Title: MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf

Authors: Lingxiang Hu, Shurun Yuan, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04376
Pdf URL: https://arxiv.org/pdf/2502.04376
Copy Paste: [[2502.04376]] MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf(https://arxiv.org/abs/2502.04376)
Keywords: language model, gpt, llm, prompt
Abstract: In contemporary workplaces, meetings are essential for exchanging ideas and ensuring team alignment but often face challenges such as time consumption, scheduling conflicts, and inefficient participation. Recent advancements in Large Language Models (LLMs) have demonstrated their strong capabilities in natural language generation and reasoning, prompting the question: can LLMs effectively delegate participants in meetings? To explore this, we develop a prototype LLM-powered meeting delegate system and create a comprehensive benchmark using real meeting transcripts. Our evaluation reveals that GPT-4/4o maintain balanced performance between active and cautious engagement strategies. In contrast, Gemini 1.5 Pro tends to be more cautious, while Gemini 1.5 Flash and Llama3-8B/70B display more active tendencies. Overall, about 60\% of responses address at least one key point from the ground-truth. However, improvements are needed to reduce irrelevant or repetitive content and enhance tolerance for transcription errors commonly found in real-world settings. Additionally, we implement the system in practical settings and collect real-world feedback from demos. Our findings underscore the potential and challenges of utilizing LLMs as meeting delegates, offering valuable insights into their practical application for alleviating the burden of meetings.
摘要：在当代工作场所中，会议对于交换想法和确保团队的一致性至关重要，但经常面临诸如时间消耗，调度冲突和效率低下的挑战。大型语言模型（LLMS）的最新进步已经证明了它们在自然语言产生和推理方面的强大能力，促使问题：LLM可以有效地将参与者委托参加会议吗？为了探讨这一点，我们开发了一个原型LLM驱动的会议代表系统，并使用真实的会议成绩单创建全面的基准。我们的评估表明，GPT-4/4O保持活跃和谨慎的参与策略之间的平衡性能。相比之下，Gemini 1.5 Pro往往更加谨慎，而Gemini 1.5 Flash和Llama3-8b/70b显示出更多的活跃趋势。总体而言，大约60 \％的响应是从地面真实的至少一个关键点。但是，需要改进以减少无关或重复的内容，并增强对现实世界中常见的转录误差的公差。此外，我们在实际设置中实现了系统，并从演示中收集了现实世界中的反馈。我们的发现强调了利用LLM作为会议代表的潜力和挑战，从而为减轻会议负担的实际应用提供了宝贵的见解。

Title: Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data

Authors: Zhenqing Ling, Daoyuan Chen, Liuyi Yao, Yaliang Li, Ying Shen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04380
Pdf URL: https://arxiv.org/pdf/2502.04380
Copy Paste: [[2502.04380]] Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data(https://arxiv.org/abs/2502.04380)
Keywords: language model, llm
Abstract: Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains. In practical scenarios, existing methods based on modeling the mixture proportions of data composition often struggle with data whose domain labels are missing, imprecise or non-normalized, while methods based on data selection usually encounter difficulties in balancing multi-domain performance. To address these challenges, in this paper, we study the role of data diversity in enhancing the overall abilities of LLMs by empirically constructing contrastive data pools and theoretically deriving explanations for both inter- and intra-diversity. Building upon the insights gained, we propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data. Extensive experiments show that the proposed method notably boosts performance across domain-undetermined data and a series of foundational downstream tasks when applied to various advanced LLMs. We release our code and hope this study can shed light on the understanding of data diversity and advance feedback-driven data-model co-development for LLMs.
摘要：使用不同数据集的微调大语模型（LLM）对于增强其在各个领域的整体性能至关重要。在实际情况下，基于建模数据组成的混合物比例的现有方法通常与域标签缺失，不精确或不符合的数据困难，而基于数据选择的方法通常在平衡多域性能时会遇到困难。为了应对这些挑战，在本文中，我们通过经验构建对比度数据库并理论上得出多样性和多样性内的解释来研究数据多样性在增强LLM的整体能力方面的作用。在获得的见解的基础上，我们提出了一种新方法，该方法使LLM具有双重身份：一种基于多样性奖励的认知探测和选择数据的输出模型，以及要使用所选数据调整的输入模型。广泛的实验表明，当应用于各种高级LLM时，该提出的方法显然可以提高域内未定的数据和一系列基础下游任务的性能。我们发布了我们的代码，希望这项研究能够阐明对数据多样性的理解，并提高反馈驱动的数据模型共同开发LLMS。

Title: Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning

Authors: Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04381
Pdf URL: https://arxiv.org/pdf/2502.04381
Copy Paste: [[2502.04381]] Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning(https://arxiv.org/abs/2502.04381)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have attained human-level accuracy on medical question-answer (QA) benchmarks. However, their limitations in navigating open-ended clinical scenarios have recently been shown, raising concerns about the robustness and generalizability of LLM reasoning across diverse, real-world medical tasks. To probe potential LLM failure modes in clinical problem-solving, we present the medical abstraction and reasoning corpus (M-ARC). M-ARC assesses clinical reasoning through scenarios designed to exploit the Einstellung effect -- the fixation of thought arising from prior experience, targeting LLM inductive biases toward inflexible pattern matching from their training data rather than engaging in flexible reasoning. We find that LLMs, including current state-of-the-art o1 and Gemini models, perform poorly compared to physicians on M-ARC, often demonstrating lack of commonsense medical reasoning and a propensity to hallucinate. In addition, uncertainty estimation analyses indicate that LLMs exhibit overconfidence in their answers, despite their limited accuracy. The failure modes revealed by M-ARC in LLM medical reasoning underscore the need to exercise caution when deploying these models in clinical settings.
摘要：大型语言模型（LLMS）已在医疗问题（QA）基准方面达到了人类水平的准确性。但是，最近显示了它们在导航开放式临床方案方面的局限性，这引起了人们对LLM推理在各种现实世界中的稳健性和普遍性的担忧。为了探测临床解决问题的潜在LLM失败模式，我们提出了医学抽象和推理语料库（M-ARC）。 M-ARC通过旨在利用Einstellung效应的场景评估临床推理 - 先前经验引起的思想的固定，将LLM归纳性偏见定位于训练数据中匹配的僵化模式，而不是进行灵活的推理。我们发现，与M-ARC医生相比，LLMS（包括当前的最新O1和Gemini模型）的表现差，通常表明缺乏常识性医学推理和幻觉的倾向。此外，不确定性估计分析表明，尽管精度有限，但LLM的答案表现出过度自信。 M-ARC在LLM医学推理中揭示的故障模式强调了在临床环境中部署这些模型时要谨慎行事的必要性。

Title: Sparse Autoencoders for Hypothesis Generation

Authors: Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2502.04382
Pdf URL: https://arxiv.org/pdf/2502.04382
Copy Paste: [[2502.04382]] Sparse Autoencoders for Hypothesis Generation(https://arxiv.org/abs/2502.04382)
Keywords: llm
Abstract: We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.
摘要：我们描述了假设，这是一种假设文本数据（例如标题）和目标变量（例如，点击）之间可解释关系的一般方法。假设有三个步骤：（1）在文本嵌入式上训练稀疏的自动编码器，以产生描述数据分布的可解释功能，（2）选择预测目标变量的功能，（3）对每个功能产生自然语言解释（例如，例如，使用LLM，“提到感到惊讶或震惊”。每个解释都是关于预测目标变量的假设。与基线相比，我们的方法更好地识别了合成数据集（F1中的至少+0.06）上的参考假设，并在实际数据集上产生更多的预测假设（尽管需要1-2个范围的计算量少于实际数据集（约为两倍）基于LLM的方法。假设还对两项经过精心研究的任务产生了新颖的发现：解释国会演讲中的党派差异，并确定与在线头条新闻相处的驱动力。

Title: Enhancing Reasoning to Adapt Large Language Models for Domain-Specific Applications

Authors: Bo Wen, Xin Zhang
Subjects: cs.CL, cs.AI, cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2502.04384
Pdf URL: https://arxiv.org/pdf/2502.04384
Copy Paste: [[2502.04384]] Enhancing Reasoning to Adapt Large Language Models for Domain-Specific Applications(https://arxiv.org/abs/2502.04384)
Keywords: language model, llm, prompt
Abstract: This paper presents SOLOMON, a novel Neuro-inspired Large Language Model (LLM) Reasoning Network architecture that enhances the adaptability of foundation models for domain-specific applications. Through a case study in semiconductor layout design, we demonstrate how SOLOMON enables swift adaptation of general-purpose LLMs to specialized tasks by leveraging Prompt Engineering and In-Context Learning techniques. Our experiments reveal the challenges LLMs face in spatial reasoning and applying domain knowledge to practical problems. Results show that SOLOMON instances significantly outperform their baseline LLM counterparts and achieve performance comparable to state-of-the-art reasoning model, o1-preview. We discuss future research directions for developing more adaptive AI systems that can continually learn, adapt, and evolve in response to new information and changing requirements.
摘要：本文介绍了所罗门，这是一种新型的神经启发的大语言模型（LLM）推理网络体系结构，可增强针对特定领域应用的基础模型的适应性。通过半导体布局设计中的案例研究，我们演示了所罗门如何通过利用及时的工程和内在的学习技术来迅速适应通用LLMS的专业任务。我们的实验揭示了LLM在空间推理中面临的挑战，并将领域知识应用于实际问题。结果表明，所罗门实例大大优于其基线LLM对应物，并实现与最先进的推理模型O1-preiview相当的性能。我们讨论了未来的研究方向，以开发更多的自适应AI系统，这些系统可以不断学习，适应和发展，以响应新信息和不断变化的要求。

Title: FedP$^2$EFT: Federated Learning to Personalize Parameter Efficient Fine-Tuning for Multilingual LLMs

Authors: Royson Lee, Minyoung Kim, Fady Rezk, Rui Li, Stylianos I. Venieris, Timothy Hospedales
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04387
Pdf URL: https://arxiv.org/pdf/2502.04387
Copy Paste: [[2502.04387]] FedP$^2$EFT: Federated Learning to Personalize Parameter Efficient Fine-Tuning for Multilingual LLMs(https://arxiv.org/abs/2502.04387)
Keywords: language model, llm
Abstract: Federated learning (FL) has enabled the training of multilingual large language models (LLMs) on diverse and decentralized multilingual data, especially on low-resource languages. To improve client-specific performance, personalization via the use of parameter-efficient fine-tuning (PEFT) modules such as LoRA is common. This involves a personalization strategy (PS), such as the design of the PEFT adapter structures (e.g., in which layers to add LoRAs and what ranks) and choice of hyperparameters (e.g., learning rates) for fine-tuning. Instead of manual PS configuration, we propose FedP$^2$EFT, a federated learning-to-personalize method for multilingual LLMs in cross-device FL settings. Unlike most existing PEFT structure selection methods, which are prone to overfitting low-data regimes, FedP$^2$EFT collaboratively learns the optimal personalized PEFT structure for each client via Bayesian sparse rank selection. Evaluations on both simulated and real-world multilingual FL benchmarks demonstrate that FedP$^2$EFT largely outperforms existing personalized fine-tuning methods, while complementing a range of existing FL methods.
摘要：联合学习（FL）使多语言大语言模型（LLM）能够对多种多样的多语言数据进行培训，尤其是在低资源语言上。为了提高特定于客户端的性能，通过使用参数有效的微调（PEFT）模块（例如Lora）来个性化是常见的。这涉及一个个性化策略（PS），例如PEFT适配器结构的设计（例如，在其中增加劳拉（Loras）的层和哪些等级）和选择超参数（例如学习率）进行微调。我们建议使用手动PS配置，而是提议FEDP $^2 $ eft，这是一种在跨设备FL设置中用于多语言LLM的联合学习方法。与大多数现有的PEFT结构选择方法不同，容易拟合低DATA机制，FEDP $^2 $ eft通过贝叶斯稀疏等级选择，协作地学习了每个客户的最佳个性化PEFT结构。对模拟和现实世界的多语言FL基准的评估表明，FEDP $^2 $ eft在很大程度上优于现有的个性化微调方法，同时补充了一系列现有的FL方法。

Title: In Praise of Stubbornness: The Case for Cognitive-Dissonance-Aware Knowledge Updates in LLMs

Authors: Simone Clemente, Zied Ben Houidi, Alexis Huet, Dario Rossi, Giulio Franzese, Pietro Michiardi
Subjects: cs.CL, cs.AI, cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2502.04390
Pdf URL: https://arxiv.org/pdf/2502.04390
Copy Paste: [[2502.04390]] In Praise of Stubbornness: The Case for Cognitive-Dissonance-Aware Knowledge Updates in LLMs(https://arxiv.org/abs/2502.04390)
Keywords: language model, llm
Abstract: Despite remarkable capabilities, large language models (LLMs) struggle to continually update their knowledge without catastrophic forgetting. In contrast, humans effortlessly integrate new information, detect conflicts with existing beliefs, and selectively update their mental models. This paper introduces a cognitive-inspired investigation paradigm to study continual knowledge updating in LLMs. We implement two key components inspired by human cognition: (1) Dissonance and Familiarity Awareness, analyzing model behavior to classify information as novel, familiar, or dissonant; and (2) Targeted Network Updates, which track neural activity to identify frequently used (stubborn) and rarely used (plastic) neurons. Through carefully designed experiments in controlled settings, we uncover a number of empirical findings demonstrating the potential of this approach. First, dissonance detection is feasible using simple activation and gradient features, suggesting potential for cognitive-inspired training. Second, we find that non-dissonant updates largely preserve prior knowledge regardless of targeting strategy, revealing inherent robustness in LLM knowledge integration. Most critically, we discover that dissonant updates prove catastrophically destructive to the model's knowledge base, indiscriminately affecting even information unrelated to the current updates. This suggests fundamental limitations in how neural networks handle contradictions and motivates the need for new approaches to knowledge updating that better mirror human cognitive mechanisms.
摘要：尽管功能显着，但大型语言模型（LLMS）仍在努力不断地更新其知识而不会造成灾难性遗忘。相比之下，人类毫不费力地整合了新信息，发现冲突与现有信念并有选择地更新其心理模型。本文引入了认知启发的调查范式，以研究LLMS中持续的知识更新。我们实施了受人类认知启发的两个关键组成部分：（1）不和谐和熟悉的意识，分析模型行为以将信息归类为新颖，熟悉或不和谐；（2）有针对性的网络更新，该更新跟踪神经活动以识别经常使用的（固执）且很少使用（塑料）神经元。通过在受控设置中精心设计的实验，我们发现了许多经验发现，证明了这种方法的潜力。首先，使用简单的激活和梯度特征是可行的，这表明了认知启发的训练的潜力。其次，我们发现，无论针对策略如何，非散文的更新在很大程度上保留了先验知识，从而揭示了LLM知识整合中固有的鲁棒性。最关键的是，我们发现不和谐的更新证明对模型的知识库具有灾难性的破坏性，无与伦比地影响与当前更新无关的信息。这表明神经网络如何处理矛盾的基本局限性，并激发了对知识更新的新方法的需求，从而更好地反映了人类的认知机制。

Title: Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents

Authors: Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04392
Pdf URL: https://arxiv.org/pdf/2502.04392
Copy Paste: [[2502.04392]] Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents(https://arxiv.org/abs/2502.04392)
Keywords: language model, llm, agent
Abstract: The rapid expansion of web content has made on-device AI assistants indispensable for helping users manage the increasing complexity of online tasks. The emergent reasoning ability in large language models offer a promising path for next-generation on-device AI agents. However, deploying full-scale Large Language Models (LLMs) on resource-limited local devices is challenging. In this paper, we propose Division-of-Thoughts (DoT), a collaborative reasoning framework leveraging the synergy between locally deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT leverages a Task Decomposer to elicit the inherent planning abilities in language models to decompose user queries into smaller sub-tasks, which allows hybrid language models to fully exploit their respective strengths. Besides, DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks and create a dependency graph, facilitating parallel reasoning of sub-tasks and the identification of key steps. To allocate the appropriate model based on the difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an additional task head attached to the SLM that does not alter the SLM's parameters. To boost adapter's task allocation capability, we propose a self-reinforced training method that relies solely on task execution feedback. Extensive experiments on various benchmarks demonstrate that our DoT significantly reduces LLM costs while maintaining competitive reasoning accuracy. Specifically, DoT reduces the average reasoning time and API costs by 66.12% and 83.57%, while achieving comparable reasoning accuracy with the best baseline methods.
摘要：Web内容的迅速扩展使设备的AI助手助手必不可少，以帮助用户管理在线任务的日益复杂性。大语言模型中的紧急推理能力为下一代内设备AI代理提供了有希望的途径。但是，在资源有限的本地设备上部署全尺度大语言模型（LLM）是具有挑战性的。在本文中，我们提出了思想划分（DOT），这是一个协作推理框架，利用本地部署的较小语言模型（SLM）和基于云的LLM之间的协同作用。 DOT利用任务分解器来引发语言模型中的固有计划能力，以将用户查询分解为较小的子任务，从而使混合语言模型能够完全利用其各自的优势。此外，DOT还采用任务调度程序来分析子任务的配对依赖关系，并创建依赖关系图，促进子任务的并行推理以及关键步骤的识别。为了根据子任务的难度分配适当的模型，DOT利用了插件适配器，这是附加到SLM上的附加任务头，不会改变SLM的参数。为了提高适配器的任务分配功能，我们提出了一种自我增强的培训方法，该方法仅依赖于任务执行反馈。对各种基准测试的广泛实验表明，我们的点可以显着降低LLM成本，同时保持竞争性推理的准确性。具体而言，DOT将平均推理时间和API的成本降低了66.12％和83.57％，同时使用最佳的基线方法实现了可比的推理精度。

Title: DECT: Harnessing LLM-assisted Fine-Grained Linguistic Knowledge and Label-Switched and Label-Preserved Data Generation for Diagnosis of Alzheimer's Disease

Authors: Tingyu Mo, Jacqueline C. K. Lam, Victor O.K. Li, Lawrence Y. L. Cheung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04394
Pdf URL: https://arxiv.org/pdf/2502.04394
Copy Paste: [[2502.04394]] DECT: Harnessing LLM-assisted Fine-Grained Linguistic Knowledge and Label-Switched and Label-Preserved Data Generation for Diagnosis of Alzheimer's Disease(https://arxiv.org/abs/2502.04394)
Keywords: language model, llm
Abstract: Alzheimer's Disease (AD) is an irreversible neurodegenerative disease affecting 50 million people worldwide. Low-cost, accurate identification of key markers of AD is crucial for timely diagnosis and intervention. Language impairment is one of the earliest signs of cognitive decline, which can be used to discriminate AD patients from normal control individuals. Patient-interviewer dialogues may be used to detect such impairments, but they are often mixed with ambiguous, noisy, and irrelevant information, making the AD detection task difficult. Moreover, the limited availability of AD speech samples and variability in their speech styles pose significant challenges in developing robust speech-based AD detection models. To address these challenges, we propose DECT, a novel speech-based domain-specific approach leveraging large language models (LLMs) for fine-grained linguistic analysis and label-switched label-preserved data generation. Our study presents four novelties: We harness the summarizing capabilities of LLMs to identify and distill key Cognitive-Linguistic information from noisy speech transcripts, effectively filtering irrelevant information. We leverage the inherent linguistic knowledge of LLMs to extract linguistic markers from unstructured and heterogeneous audio transcripts. We exploit the compositional ability of LLMs to generate AD speech transcripts consisting of diverse linguistic patterns to overcome the speech data scarcity challenge and enhance the robustness of AD detection models. We use the augmented AD textual speech transcript dataset and a more fine-grained representation of AD textual speech transcript data to fine-tune the AD detection model. The results have shown that DECT demonstrates superior model performance with an 11% improvement in AD detection accuracy on the datasets from DementiaBank compared to the baselines.
摘要：阿尔茨海默氏病（AD）是一种不可逆的神经退行性疾病，影响了全球5000万人。低成本，准确鉴定AD的关键标记对于及时的诊断和干预至关重要。语言障碍是认知能力下降的最早迹象之一，可以用来将AD患者与正常对照人区分开。可以使用患者Interterviewer对话来检测此类障碍，但它们通常与模棱两可，嘈杂和无关紧要的信息混合在一起，从而使AD检测任务变得困难。此外，广告语音样本的可用性有限及其语音样式的可变性在开发基于语音的AD检测模型方面构成了重大挑战。为了应对这些挑战，我们提出了DECT，这是一种基于语音的新型域特异性方法，利用大型语言模型（LLMS）进行细粒度的语言分析和标签开关的标签保留数据生成。我们的研究介绍了四个新颖性：我们利用LLM的汇总功能来识别和提炼关键的认知语言信息，从嘈杂的语音转录中，有效地过滤无关的信息。我们利用LLM的固有语言知识来从非结构化和异构音频转录中提取语言标记。我们利用LLM的组成能力生成由不同语言模式组成的AD语音转录本，以克服语音数据稀缺挑战并增强AD检测模型的鲁棒性。我们使用增强的AD文本语音笔录数据集和AD文本语音转录数据的更细粒度表示来微调广告检测模型。结果表明，与基准相比，DECT表现出了卓越的模型性能，而Dementiabank的数据集上的AD检测准确性提高了11％。

Title: Multimodal Medical Code Tokenizer

Authors: Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04397
Pdf URL: https://arxiv.org/pdf/2502.04397
Copy Paste: [[2502.04397]] Multimodal Medical Code Tokenizer(https://arxiv.org/abs/2502.04397)
Keywords: language model
Abstract: Foundation models trained on patient electronic health records (EHRs) require tokenizing medical data into sequences of discrete vocabulary items. Existing tokenizers treat medical codes from EHRs as isolated textual tokens. However, each medical code is defined by its textual description, its position in ontological hierarchies, and its relationships to other codes, such as disease co-occurrences and drug-treatment associations. Medical vocabularies contain more than 600,000 codes with critical information for clinical reasoning. We introduce MedTok, a multimodal medical code tokenizer that uses the text descriptions and relational context of codes. MedTok processes text using a language model encoder and encodes the relational structure with a graph encoder. It then quantizes both modalities into a unified token space, preserving modality-specific and cross-modality information. We integrate MedTok into five EHR models and evaluate it on operational and clinical tasks across in-patient and out-patient datasets, including outcome prediction, diagnosis classification, drug recommendation, and risk stratification. Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate using MedTok tokenizer with medical QA systems. Our results demonstrate the potential of MedTok as a unified tokenizer for medical codes, improving tokenization for medical foundation models.
摘要：对患者电子健康记录（EHR）培训的基金会模型需要将医疗数据化为离散词汇序列的序列。现有的令牌者将来自EHR的医疗法规视为孤立的文本令牌。但是，每个医学法规都由其文本描述，其在本体论等级制度中的地位及其与其他代码的关系（例如疾病共发生和药物治疗协会）所定义。医学词汇包含超过600,000个代码，其中包含用于临床推理的关键信息。我们介绍了Medtok，这是一种使用文本描述和代码的关系上下文的多模式医学代码令牌。 Medtok使用语言模型编码器处理文本，并使用图形编码器编码关系结构。然后，它将两种模态量化为一个统一的令牌空间，从而保留了特定于模态和跨模式信息。我们将MEDTOK集成到五个EHR模型中，并将其评估在患者和门诊数据集的操作和临床任务上，包括结果预测，诊断分类，药物建议和风险分层。将标准的EHR Tokenizers与MEDTOK交换可改善所有EHR模型的AUPRC，MIMIC-III的AUPRC在4.10％中，对MIMIC-IV的4.78％，EHRSHOT的AUPRC提高了4.78％，在药物推荐方面具有最大的收益。除了EHR建模之外，我们还证明了使用Medtok令牌与医疗质量检查系统一起演示。我们的结果表明，Medtok是医疗法规的统一令牌的潜力，从而改善了医疗基础模型的令牌化。

Title: Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models

Authors: Xiao-Wen Yang, Xuan-Yi Zhu, Wen-Da Wei, Ding-Chu Zhang, Jie-Jing Shao, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04404
Pdf URL: https://arxiv.org/pdf/2502.04404
Copy Paste: [[2502.04404]] Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models(https://arxiv.org/abs/2502.04404)
Keywords: language model, llm
Abstract: The integration of slow-thinking mechanisms into large language models (LLMs) offers a promising way toward achieving Level 2 AGI Reasoners, as exemplified by systems like OpenAI's o1. However, several significant challenges remain, including inefficient overthinking and an overreliance on auxiliary reward models. We point out that these limitations stem from LLMs' inability to internalize the search process, a key component of effective reasoning. A critical step toward addressing this issue is enabling LLMs to autonomously determine when and where to backtrack, a fundamental operation in traditional search algorithms. To this end, we propose a self-backtracking mechanism that equips LLMs with the ability to backtrack during both training and inference. This mechanism not only enhances reasoning ability but also efficiency by transforming slow-thinking processes into fast-thinking through self-improvement. Empirical evaluations demonstrate that our proposal significantly enhances the reasoning capabilities of LLMs, achieving a performance gain of over 40 percent compared to the optimal-path supervised fine-tuning method. We believe this study introduces a novel and promising pathway for developing more advanced and robust Reasoners.
摘要：慢速思维机制与大语言模型（LLMS）的集成为实现2级AGI推理器提供了一种有希望的方法，如Openai的O1这样的系统所示例。但是，仍然存在一些重大挑战，包括效率低下的过度思考和对辅助奖励模型的过度依赖。我们指出，这些局限性源于LLMS无法内部化搜索过程，这是有效推理的关键组成部分。解决此问题的关键步骤是使LLMS能够自主确定何时何地回溯，这是传统搜索算法中的基本操作。为此，我们提出了一种自我反复跟踪机制，该机制使LLM具有在训练和推理期间回溯的能力。这种机制不仅可以通过将缓慢思考的过程转变为通过自我改善来快速思考的方法来提高推理能力。经验评估表明，与最佳路径监督的微调方法相比，我们的提案显着提高了LLMS的推理能力，实现了40％以上的性能增长。我们认为，这项研究介绍了一种新颖而有希望的途径，以发展更先进和强大的推理者。

Title: MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot

Authors: Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2502.04413
Pdf URL: https://arxiv.org/pdf/2502.04413
Copy Paste: [[2502.04413]] MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot(https://arxiv.org/abs/2502.04413)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) is a well-suited technique for retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a key module of the healthcare copilot, helping reduce misdiagnosis for healthcare practitioners and patients. However, the diagnostic accuracy and specificity of existing heuristic-based RAG models used in the medical domain are inadequate, particularly for diseases with similar manifestations. This paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited reasoning for the medical domain that retrieves diagnosis and treatment recommendations based on manifestations. MedRAG systematically constructs a comprehensive four-tier hierarchical diagnostic KG encompassing critical diagnostic differences of various diseases. These differences are dynamically integrated with similar EHRs retrieved from an EHR database, and reasoned within a large language model. This process enables more accurate and specific decision support, while also proactively providing follow-up questions to enhance personalized medical decision-making. MedRAG is evaluated on both a public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD) collected from Tan Tock Seng Hospital, and its performance is compared against various existing RAG methods. Experimental results show that, leveraging the information integration and relational abilities of the KG, our MedRAG provides more specific diagnostic insights and outperforms state-of-the-art models in reducing misdiagnosis rates. Our code will be available at this https URL
摘要：检索增强的一代（RAG）是一种适合检索隐私敏感电子健康记录（EHR）的技术。它可以作为医疗保健副驾驶的关键模块，有助于减少医疗保健从业者和患者的误诊。但是，医学领域中使用的基于启发式的抹布模型的诊断准确性和特异性不足，特别是对于具有相似表现的疾病。本文提出了MedRag，这是一种通过知识图（KG）精心促进的医学领域推理增强的抹布模型，以根据表现来检索诊断和治疗建议。 MedRag系统地构建了一个全面的四层层次诊断KG，包括各种疾病的关键诊断差异。这些差异与从EHR数据库中检索的类似EHR动态集成，并在大型语言模型中进行了推理。该过程使得更准确，更具体的决策支持，同时还可以主动提供后续问题以增强个性化的医疗决策。在公共数据集DDXPLU和从Tan Tock Seng医院收集的公共数据集DDXPLU和私人慢性疼痛诊断数据集（CPDD）上进行了评估，并将其性能与各种现有的抹布方法进行了比较。实验结果表明，利用KG的信息整合和关系能力，我们的MedRag提供了更具体的诊断见解，并且在降低误诊率的方面优于最先进的模型。我们的代码将在此HTTPS URL上提供

Title: EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

Authors: He Hu, Yucheng Zhou, Lianzhong You, Hongbo Xu, Qianning Wang, Zheng Lian, Fei Richard Yu, Fei Ma, Laizhong Cui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04424
Pdf URL: https://arxiv.org/pdf/2502.04424
Copy Paste: [[2502.04424]] EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models(https://arxiv.org/abs/2502.04424)
Keywords: language model, llm
Abstract: With the integration of Multimodal large language models (MLLMs) into robotic systems and various AI applications, embedding emotional intelligence (EI) capabilities into these models is essential for enabling robots to effectively address human emotional needs and interact seamlessly in real-world scenarios. Existing static, text-based, or text-image benchmarks overlook the multimodal complexities of real-world interactions and fail to capture the dynamic, multimodal nature of emotional expressions, making them inadequate for evaluating MLLMs' EI. Based on established psychological theories of EI, we build EmoBench-M, a novel benchmark designed to evaluate the EI capability of MLLMs across 13 valuation scenarios from three key dimensions: foundational emotion recognition, conversational emotion understanding, and socially complex emotion analysis. Evaluations of both open-source and closed-source MLLMs on EmoBench-M reveal a significant performance gap between them and humans, highlighting the need to further advance their EI capabilities. All benchmark resources, including code and datasets, are publicly available at this https URL.
摘要：随着将多模式的大语言模型（MLLM）整合到机器人系统和各种AI应用中，将情绪智力（EI）功能嵌入这些模型对于使机器人能够有效地解决人类的情绪需求并在现实世界中无缝交互至关重要。现有的静态，基于文本或文本图像基准测试忽略了现实世界相互作用的多模式复杂性，并且无法捕获情感表达的动态，多模式的性质，从而使它们不足以评估MLLMS的EI。基于EI的既定心理理论，我们建立了Emobench-M，这是一种新颖的基准，旨在评估来自三个关键维度的13个评估场景中MLLM的EI能力：基本的情感识别，对话情感理解和社会复杂的情绪分析。对EMOBENCE-M上开源和封闭源MLLM的评估揭示了它们与人之间的显着性能差距，这强调了进一步提高其EI能力的必要性。所有基准资源（包括代码和数据集）在此HTTPS URL上均可公开使用。

Title: Decoding AI Judgment: How LLMs Assess News Credibility and Bias

Authors: Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Matteo Cinelli, Walter Quattrociocchi
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2502.04426
Pdf URL: https://arxiv.org/pdf/2502.04426
Copy Paste: [[2502.04426]] Decoding AI Judgment: How LLMs Assess News Credibility and Bias(https://arxiv.org/abs/2502.04426)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are increasingly used to assess news credibility, yet little is known about how they make these judgments. While prior research has examined political bias in LLM outputs or their potential for automated fact-checking, their internal evaluation processes remain largely unexamined. Understanding how LLMs assess credibility provides insights into AI behavior and how credibility is structured and applied in large-scale language models. This study benchmarks the reliability and political classifications of state-of-the-art LLMs - Gemini 1.5 Flash (Google), GPT-4o mini (OpenAI), and LLaMA 3.1 (Meta) - against structured, expert-driven rating systems such as NewsGuard and Media Bias Fact Check. Beyond assessing classification performance, we analyze the linguistic markers that shape LLM decisions, identifying which words and concepts drive their evaluations. We uncover patterns in how LLMs associate credibility with specific linguistic features by examining keyword frequency, contextual determinants, and rank distributions. Beyond static classification, we introduce a framework in which LLMs refine their credibility assessments by retrieving external information, querying other models, and adapting their responses. This allows us to investigate whether their assessments reflect structured reasoning or rely primarily on prior learned associations.
摘要：大型语言模型（LLMS）越来越多地用于评估新闻信誉，但对它们如何做出这些判断知之甚少。虽然先前的研究已经检查了LLM产出中的政治偏见或自动事实检查的潜力，但他们的内部评估过程仍然在很大程度上尚未审查。了解LLM评估信誉如何提供对AI行为的见解，以及如何在大规模语言模型中构造和应用信誉。这项研究基于最先进的LLMS的可靠性和政治分类-Gemini 1.5 Flash（Google），GPT-4O MINI（OpenAI）和Llama 3.1（Meta） - 针对结构化的，专家驱动的评级系统，例如Newsguard和媒体偏见事实检查。除了评估分类绩效外，我们还分析了制定LLM决策的语言标记，确定哪些单词和概念可以推动其评估。我们通过检查关键字频率，上下文决定因素和等级分布将LLM与特定语言特征联系起来的模式。除静态分类外，我们还引入了一个框架，在该框架中，LLM通过检索外部信息，查询其他模型并调整其响应来完善其信誉评估。这使我们能够调查他们的评估是反映结构化推理还是主要依靠先前学到的关联。

Title: Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization

Authors: Yu-Neng Chuang, Leisheng Yu, Guanchu Wang, Lizhe Zhang, Zirui Liu, Xuanting Cai, Yang Sui, Vladimir Braverman, Xia Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04428
Pdf URL: https://arxiv.org/pdf/2502.04428
Copy Paste: [[2502.04428]] Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization(https://arxiv.org/abs/2502.04428)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed and democratized on edge devices. To improve the efficiency of on-device deployment, small language models (SLMs) are often adopted due to their efficient decoding latency and reduced energy consumption. However, these SLMs often generate inaccurate responses when handling complex queries. One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM. This follows the principle of "If you lack confidence, seek stronger support" to enhance reliability. Relying on more powerful LLMs is yet effective but increases invocation costs. Therefore, striking a routing balance between efficiency and efficacy remains a critical challenge. Additionally, efficiently generalizing the routing strategy to new datasets remains under-explored. In this paper, we conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings. Our findings highlight: First, uncertainty-correctness alignment in different uncertainty quantification (UQ) methods significantly impacts routing performance. Second, uncertainty distributions depend more on both the specific SLM and the chosen UQ method, rather than downstream data. Building on the insight, we propose a calibration data construction instruction pipeline and open-source a constructed hold-out set to enhance routing generalization on new downstream scenarios. The experimental results indicate calibration data effectively bootstraps routing performance without any new data.
摘要：大型语言模型（LLM）越来越多地部署并在边缘设备上民主化。为了提高设备部署的效率，由于其有效的解码延迟和能源消耗降低，通常采用小语言模型（SLM）。但是，这些SLM在处理复杂查询时通常会产生不准确的响应。一个有希望的解决方案是基于不确定性的SLM路由，在导致SLM上低信心响应时将高风险查询转换为更强的LLM。这遵循“如果您缺乏信心，寻求更强大的支持”以提高可靠性。依靠更强大的LLM尚未有效，但增加了调用成本。因此，达到效率和功效之间的路由平衡仍然是一个关键的挑战。此外，将路由策略有效地概括为新数据集的概括还不足。在本文中，我们对从SLM到1500多个设置的不确定性驱动路由策略进行基准测试和概括进行了全面研究。我们的发现突出显示：首先，不同不确定性定量（UQ）方法中的不确定性纠正对准可以显着影响路由性能。其次，不确定性分布更多地取决于特定的SLM和所选的UQ方法，而不是下游数据。在Insight的基础上，我们提出了校准数据构建指令管道，并开源构造的固定设置，以增强新下游场景的路由概括。实验结果表明校准数据有效地引导程序路由性能，而无需任何新数据。

Title: Active Task Disambiguation with LLMs

Authors: Katarzyna Kobalczyk, Nicolas Astorga, Tennison Liu, Mihaela van der Schaar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04485
Pdf URL: https://arxiv.org/pdf/2502.04485
Copy Paste: [[2502.04485]] Active Task Disambiguation with LLMs(https://arxiv.org/abs/2502.04485)
Keywords: language model, llm, agent
Abstract: Despite the impressive performance of large language models (LLMs) across various benchmarks, their ability to address ambiguously specified problems--frequent in real-world interactions--remains underexplored. To address this gap, we introduce a formal definition of task ambiguity and frame the problem of task disambiguation through the lens of Bayesian Experimental Design. By posing clarifying questions, LLM agents can acquire additional task specifications, progressively narrowing the space of viable solutions and reducing the risk of generating unsatisfactory outputs. Yet, generating effective clarifying questions requires LLM agents to engage in a form of meta-cognitive reasoning, an ability LLMs may presently lack. Our proposed approach of active task disambiguation enables LLM agents to generate targeted questions maximizing the information gain. Effectively, this approach shifts the load from implicit to explicit reasoning about the space of viable solutions. Empirical results demonstrate that this form of question selection leads to more effective task disambiguation in comparison to approaches relying on reasoning solely within the space of questions.
摘要：尽管大语言模型（LLM）在各种基准测试中的表现令人印象深刻，但它们解决了含糊的指定问题的能力 - 在现实世界中的交互中很频繁 - 不被驱逐出境。为了解决这一差距，我们介绍了任务歧义的正式定义，并通过贝叶斯实验设计的角度构成了任务歧义的问题。通过提出澄清问题，LLM代理可以获取其他任务规格，逐步缩小可行解决方案的空间并降低产生不满意的输出的风险。但是，产生有效的澄清问题需要LLM代理以一种元认知推理形式，目前可能缺乏LLM的能力。我们提出的主动任务歧义方法使LLM代理能够产生目标问题最大化信息增益。有效地，此方法将负载从隐式转移到有关可行解决方案空间的明确推理。经验结果表明，与仅在问题的空间内依靠推理的方法相比，这种问题选择形式导致更有效的任务歧义。

Title: Building A Unified AI-centric Language System: analysis, framework and future work

Authors: Edward Hong Wang, Cynthia Xin Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04488
Pdf URL: https://arxiv.org/pdf/2502.04488
Copy Paste: [[2502.04488]] Building A Unified AI-centric Language System: analysis, framework and future work(https://arxiv.org/abs/2502.04488)
Keywords: language model
Abstract: Recent advancements in large language models have demonstrated that extended inference through techniques can markedly improve performance, yet these gains come with increased computational costs and the propagation of inherent biases found in natural languages. This paper explores the design of a unified AI-centric language system that addresses these challenges by offering a more concise, unambiguous, and computationally efficient alternative to traditional human languages. We analyze the limitations of natural language such as gender bias, morphological irregularities, and contextual ambiguities and examine how these issues are exacerbated within current Transformer architectures, where redundant attention heads and token inefficiencies prevail. Drawing on insights from emergent artificial communication systems and constructed languages like Esperanto and Lojban, we propose a framework that translates diverse natural language inputs into a streamlined AI-friendly language, enabling more efficient model training and inference while reducing memory footprints. Finally, we outline a pathway for empirical validation through controlled experiments, paving the way for a universal interchange format that could revolutionize AI-to-AI and human-to-AI interactions by enhancing clarity, fairness, and overall performance.
摘要：大语言模型的最新进展表明，通过技术的扩展推断可以显着提高性能，但是这些收益随着计算成本的增加以及自然语言中固有的偏见的传播。本文探讨了统一以AI为中心的语言系统的设计，该系统通过提供传统人类语言的更简洁，明确和计算上有效的替代方案来解决这些挑战。我们分析了自然语言的局限性，例如性别偏见，形态不规则性和上下文歧义，并研究了这些问题在当前的变压器体系结构中如何加剧这些问题，在这些问题中，多余的注意力头和效率低下的效率低下。利用新兴的人工通信系统和诸如Esperanto和Lojban等语言的见解，我们提出了一个框架，将各种自然语言输入转化为简化的AI友好语言，从而使更有效的模型培训和推理可以减少记忆足迹。最后，我们通过受控实验概述了经验验证的途径，为通用的互换格式铺平了道路，该格式可以通过增强清晰度，公平性和整体性能来彻底改变AI-to-to-ai和人向人类之间的相互作用。

Title: Multi-Agent Reinforcement Learning with Focal Diversity Optimization

Authors: Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Zachary Yahn, Ling Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04492
Pdf URL: https://arxiv.org/pdf/2502.04492
Copy Paste: [[2502.04492]] Multi-Agent Reinforcement Learning with Focal Diversity Optimization(https://arxiv.org/abs/2502.04492)
Keywords: language model, llm, agent
Abstract: The advancement of Large Language Models (LLMs) and their finetuning strategies has triggered the renewed interests in multi-agent reinforcement learning. In this paper, we introduce a focal diversity-optimized multi-agent reinforcement learning approach, coined as MARL-Focal, with three unique characteristics. First, we develop an agent-fusion framework for encouraging multiple LLM based agents to collaborate in producing the final inference output for each LLM query. Second, we develop a focal-diversity optimized agent selection algorithm that can choose a small subset of the available agents based on how well they can complement one another to generate the query output. Finally, we design a conflict-resolution method to detect output inconsistency among multiple agents and produce our MARL-Focal output through reward-aware and policy-adaptive inference fusion. Extensive evaluations on five benchmarks show that MARL-Focal is cost-efficient and adversarial-robust. Our multi-agent fusion model achieves performance improvement of 5.51\% compared to the best individual LLM-agent and offers stronger robustness over the TruthfulQA benchmark. Code is available at this https URL
摘要：大型语言模型（LLM）及其填充策略的发展引发了对多机构增强学习的新兴趣。在本文中，我们介绍了一种焦点多样性优化的多代理增强学习方法，以质量为单位，具有三个独特的特征。首先，我们开发了一个代理融合框架，用于鼓励多个基于LLM的代理，以协作为每个LLM查询生成最终推理输出。其次，我们开发了一种焦点多样性优化的代理选择算法，该算法可以根据它们可以相互补充以生成查询输出的方式来选择可用代理的一小部分。最后，我们设计了一种解决冲突方法，以检测多个代理之间的产出不一致，并通过奖励感知和政策自适应推理融合产生我们的Marl-cocal输出。对五个基准测试的广泛评估表明，Marl-Cocal具有成本效益和对抗性。与最佳的个人LLM代理相比，我们的多代理融合模型可实现5.51 \％的性能提高，并对真实性基准提供了更强的鲁棒性。代码可在此HTTPS URL上找到

Title: Verifiable Format Control for Large Language Model Generations

Authors: Zhaoyang Wang, Jinqi Jiang, Huichi Zhou, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, Huaxiu Yao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04498
Pdf URL: https://arxiv.org/pdf/2502.04498
Copy Paste: [[2502.04498]] Verifiable Format Control for Large Language Model Generations(https://arxiv.org/abs/2502.04498)
Keywords: language model, gpt, llm
Abstract: Recent Large Language Models (LLMs) have demonstrated satisfying general instruction following ability. However, small LLMs with about 7B parameters still struggle fine-grained format following (e.g., JSON format), which seriously hinder the advancements of their applications. Most existing methods focus on benchmarking general instruction following while overlook how to improve the specific format following ability for small LLMs. Besides, these methods often rely on evaluations based on advanced LLMs (e.g., GPT-4), which can introduce the intrinsic bias of LLMs and be costly due to the API calls. In this paper, we first curate a fully verifiable format following dataset VFF. In contrast to existing works often adopting external LLMs for instruction-following validations, every sample of VFF can be easily validated with a Python function. Further, we propose to leverage this verifiable feature to synthesize massive data for progressively training small LLMs, in order to improve their format following abilities. Experimental results highlight the prevalent limitations in the format following capabilities of 7B level open-source LLMs and demonstrate the effectiveness of our method in enhancing this essential ability.
摘要：最近的大型语言模型（LLMS）表现出满足能力以后的一般指导。但是，具有约7B参数的小型LLM仍在努力挣扎的格式（例如JSON格式），这严重阻碍了其应用程序的进步。大多数现有方法都集中在基准一般说明以下，同时忽略了如何提高小型LLM的特定格式。此外，这些方法通常依赖于基于高级LLM（例如GPT-4）的评估，该评估可能引入LLM的内在偏见，并且由于API调用而成本高昂。在本文中，我们首先在数据集VFF之后策划了完全可验证的格式。与通常采用外部LLM进行指导遵循验证的现有作品相反，每个vff的每个样本都可以通过Python函数轻松验证。此外，我们建议利用这种可验证的功能来合成大量数据以逐步训练小型LLM，以提高其遵循能力的格式。实验结果突出了7B级开源LLM的功能之后的格式中普遍的限制，并证明了我们方法在增强这种基本能力方面的有效性。

Title: ULPT: Prompt Tuning with Ultra-Low-Dimensional Optimization

Authors: Zijun Wu, Yongchang Hao, Lili Mou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04501
Pdf URL: https://arxiv.org/pdf/2502.04501
Copy Paste: [[2502.04501]] ULPT: Prompt Tuning with Ultra-Low-Dimensional Optimization(https://arxiv.org/abs/2502.04501)
Keywords: language model, llm, prompt
Abstract: Large language models achieve state-of-the-art performance but are costly to fine-tune due to their size. Parameter-efficient fine-tuning methods, such as prompt tuning, address this by reducing trainable parameters while maintaining strong performance. However, prior methods tie prompt embeddings to the model's dimensionality, which may not scale well with larger LLMs and more customized LLMs. In this paper, we propose Ultra-Low-dimensional Prompt Tuning (ULPT), which optimizes prompts in a low-dimensional space (e.g., 2D) and use a random but frozen matrix for the up-projection. To enhance alignment, we introduce learnable shift and scale embeddings. ULPT drastically reduces the trainable parameters, e.g., 2D only using 2% parameters compared with vanilla prompt tuning while retaining most of the performance across 21 NLP tasks. Our theoretical analysis shows that random projections can capture high-rank structures effectively, and experimental results demonstrate ULPT's competitive performance over existing parameter-efficient methods.
摘要：大型语言模型可实现最先进的性能，但由于其尺寸而定为微调。参数有效的微调方法（例如及时调整）通过在保持强大的性能的同时减少可训练的参数来解决此问题。但是，先前的方法将提示嵌入到模型的维度上，这可能与较大的LLM和更自定义的LLM的扩展不佳。在本文中，我们提出了超高维及时调整（ULPT），该调整优化了在低维空间（例如2D）中的提示，并使用随机但冷冻的矩阵进行向上投影。为了增强对齐方式，我们引入了可学习的变化和规模嵌入。 ULPT大大降低了可训练的参数，例如，与Vanilla及时调谐相比，仅使用2％参数，同时保留了21个NLP任务的大部分性能。我们的理论分析表明，随机预测可以有效地捕获高级结构，实验结果证明了ULPT在现有参数有效方法上的竞争性能。

Title: When One LLM Drools, Multi-LLM Collaboration Rules

Authors: Shangbin Feng, Wenxuan Ding, Alisa Liu, Zifeng Wang, Weijia Shi, Yike Wang, Zejiang Shen, Xiaochuang Han, Hunter Lang, Chen-Yu Lee, Tomas Pfister, Yejin Choi, Yulia Tsvetkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04506
Pdf URL: https://arxiv.org/pdf/2502.04506
Copy Paste: [[2502.04506]] When One LLM Drools, Multi-LLM Collaboration Rules(https://arxiv.org/abs/2502.04506)
Keywords: llm
Abstract: This position paper argues that in many realistic (i.e., complex, contextualized, subjective) scenarios, one LLM is not enough to produce a reliable output. We challenge the status quo of relying solely on a single general-purpose LLM and argue for multi-LLM collaboration to better represent the extensive diversity of data, skills, and people. We first posit that a single LLM underrepresents real-world data distributions, heterogeneous skills, and pluralistic populations, and that such representation gaps cannot be trivially patched by further training a single LLM. We then organize existing multi-LLM collaboration methods into a hierarchy, based on the level of access and information exchange, ranging from API-level, text-level, logit-level, to weight-level collaboration. Based on these methods, we highlight how multi-LLM collaboration addresses challenges that a single LLM struggles with, such as reliability, democratization, and pluralism. Finally, we identify the limitations of existing multi-LLM methods and motivate future work. We envision multi-LLM collaboration as an essential path toward compositional intelligence and collaborative AI development.
摘要：该立场论文认为，在许多现实（即复杂，上下文化，主观）方案中，一个LLM不足以产生可靠的输出。我们挑战仅依靠单个通用LLM的现状，并主张多LLM协作以更好地代表数据，技能和人员的广泛多样性。我们首先认为，单个LLM不足的现实世界数据分布，异质技能和多元化人群，并且不能通过进一步培训单个LLM来对这种表示差距进行琐碎的修补。然后，我们根据访问和信息交换的级别将现有的多LLM协作方法组织为层次结构，从API级，文本级别，Logit级别到权重级协作。基于这些方法，我们强调了多LLM协作如何应对单个LLM斗争的挑战，例如可靠性，民主化和多元化。最后，我们确定了现有多LLM方法的局限性，并激发了未来的工作。我们将多LLM协作视为构成智能和协作AI开发的基本途径。

Title: Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems

Authors: Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04510
Pdf URL: https://arxiv.org/pdf/2502.04510
Copy Paste: [[2502.04510]] Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems(https://arxiv.org/abs/2502.04510)
Keywords: language model, llm
Abstract: We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by jointly optimizing model roles and weights. We represent multi-LLM systems as directed acyclic graphs (DAGs) of LLMs with topological message passing for collaborative generation. Given a pool of LLM experts and a utility function, Heterogeneous Swarms employs two iterative steps: role-step and weight-step. For role-step, we interpret model roles as learning a DAG that specifies the flow of inputs and outputs between LLMs. Starting from a swarm of random continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs in topological order, evaluate on the utility function (e.g. accuracy on a task), and optimize the adjacency matrices with particle swarm optimization based on the utility score. For weight-step, we assess the contribution of individual LLMs in the multi-LLM systems and optimize model weights with swarm intelligence. We propose JFK-score to quantify the individual contribution of each LLM in the best-found DAG of the role-step, then optimize model weights with particle swarm optimization based on the JFK-score. Experiments demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based baselines by 18.5% on average across 12 tasks. Further analysis reveals that Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles and substantial collaborative gains, and benefits from the diversity of language models.
摘要：我们提出了异质群，这是一种通过共同优化模型角色和权重来设计多LLM系统的算法。我们代表LLMS的多LLM系统的多LLM系统，其拓扑消息传递以进行协作生成。鉴于LLM专家和实用程序功能，异质群采用了两个迭代步骤：角色步骤和重量步骤。对于角色步骤，我们将模型角色解释为学习一个DAG，以指定LLMS之间的输入和输出的流量。从一群随机连续的邻接矩阵开始，我们将它们解码为离散的DAG，按拓扑顺序调用LLM，评估效用函数（例如，任务的准确性），并根据效用优化了相邻矩阵分数。对于重量步骤，我们评估单个LLM在多LLM系统中的贡献，并用群体智能优化模型权重。我们提出了肯尼迪（JFK）得分，以量化每个LLM在角色段最佳发现的DAG中的个体贡献，然后根据JFK-Score优化粒子群优化模型权重。实验表明，在12个任务中，异质群的表现平均超过15个基于重量的基线和/或基于体重的基线。进一步的分析表明，异质群体发现具有异质模型角色和实质性协作增长的多LLM系统，并从语言模型的多样性中受益。

Title: Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide Data Synthesis

Authors: Shuhaib Mehri, Xiusi Chen, Heng Ji, Dilek Hakkani-Tür
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04511
Pdf URL: https://arxiv.org/pdf/2502.04511
Copy Paste: [[2502.04511]] Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide Data Synthesis(https://arxiv.org/abs/2502.04511)
Keywords: llm
Abstract: LLMs demonstrate remarkable capabilities in following natural language instructions, largely due to instruction-tuning on high-quality datasets. While synthetic data generation has emerged as a scalable approach for creating such datasets, maintaining consistent quality standards remains challenging. Recent approaches incorporate feedback to improve data quality, but typically operate at the sample level, generating and applying feedback for each response individually. In this work, we propose Reference-Level Feedback, a novel methodology that instead collects feedback based on high-quality reference samples from carefully curated seed data. We use this feedback to capture rich signals of desirable characteristics that can be propagated to newly synthesized data. We present REFED, a dataset of 10K instruction-response pairs synthesized using such feedback. We demonstrate the effectiveness of our approach by showing that Llama-3.1-8B-Instruct finetuned on REFED achieves state-of-the-art performance among similar-sized SFT-based models on AlpacaEval 2.0 and strong results on Arena-Hard. Through extensive experiments, we show that our approach consistently outperforms traditional sample-level feedback methods with significantly fewer feedback collections and improves performance across different model architectures.
摘要：LLMS在遵循自然语言指令方面表现出了非凡的功能，这主要是由于在高质量数据集上进行了指导。尽管合成数据生成已成为创建此类数据集的可扩展方法，但保持一致的质量标准仍然具有挑战性。最近的方法结合了反馈以提高数据质量，但通常在样本级别运行，为每个响应分别生成和应用反馈。在这项工作中，我们提出了参考级别的反馈，这是一种新颖的方法，该方法基于经过精心策划的种子数据的高质量参考样本收集反馈。我们使用此反馈来捕获可观的特征的丰富信号，这些信号可以传播到新合成的数据。我们介绍了Ref，是使用此类反馈合成的10K指令 - 响应对的数据集。我们通过证明了对Alpacaeval 2.0的基于相似大小的SFT模型在Alpacaeval 2.0上实现最先进的表现，并在Arena-Hard上表现出了较大的SFT模型，并且在Arena-Hard上表现出了最先进的表现，从而证明了方法的有效性。通过广泛的实验，我们表明我们的方法始终优于传统的样本级反馈方法，这些反馈方法的反馈收集明显较少，并提高了不同模型体系结构的性能。

Title: Linear Correlation in LM's Compositional Generalization and Hallucination

Authors: Letian Peng, Chenyang An, Shibo Hao, Chengyu Dong, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04520
Pdf URL: https://arxiv.org/pdf/2502.04520
Copy Paste: [[2502.04520]] Linear Correlation in LM's Compositional Generalization and Hallucination(https://arxiv.org/abs/2502.04520)
Keywords: language model, hallucination, prompt
Abstract: The generalization of language models (LMs) is undergoing active debates, contrasting their potential for general intelligence with their struggles with basic knowledge composition (e.g., reverse/transition curse). This paper uncovers the phenomenon of linear correlations in LMs during knowledge composition. For explanation, there exists a linear transformation between certain related knowledge that maps the next token prediction logits from one prompt to another, e.g., "X lives in the city of" $\rightarrow$ "X lives in the country of" for every given X. This mirrors the linearity in human knowledge composition, such as Paris $\rightarrow$ France. Our findings indicate that the linear transformation is resilient to large-scale fine-tuning, generalizing updated knowledge when aligned with real-world relationships, but causing hallucinations when it deviates. Empirical results suggest that linear correlation can serve as a potential identifier of LM's generalization. Finally, we show such linear correlations can be learned with a single feedforward network and pre-trained vocabulary representations, indicating LM generalization heavily relies on the latter.
摘要：语言模型（LMS）的概括正在进行积极的辩论，将其一般智力的潜力与基本知识组成的斗争（例如，反向/过渡诅咒）进行了对比。本文揭示了知识组成期间LMS中线性相关的现象。为了说明，存在某些相关知识之间存在线性转换X.这反映了人类知识组成的线性，例如巴黎$ \ rightarrow $法国。我们的发现表明，线性转换对大规模微调具有弹性，在与现实世界的关系保持一致时概括了更新的知识，但在偏离时会引起幻觉。经验结果表明，线性相关性可以用作LM概括的潜在标识符。最后，我们显示可以通过单个馈电网络和预训练的词汇表示可以学习这种线性相关性，这表明LM概括在很大程度上依赖于后者。

Title: Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection

Authors: Minseok Jung, Cynthia Fuertes Panizo, Liam Dugan, May Fung, Pin-Yu Chen, Paul Pu Liang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04528
Pdf URL: https://arxiv.org/pdf/2502.04528
Copy Paste: [[2502.04528]] Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection(https://arxiv.org/abs/2502.04528)
Keywords: language model, llm
Abstract: The advancement of large language models (LLMs) has made it difficult to differentiate human-written text from AI-generated text. Several AI-text detectors have been developed in response, which typically utilize a fixed global threshold (e.g., {\theta} = 0.5) to classify machine-generated text. However, we find that one universal threshold can fail to account for subgroup-specific distributional variations. For example, when using a fixed threshold, detectors make more false positive errors on shorter human-written text than longer, and more positive classifications on neurotic writing styles than open among long text. These discrepancies can lead to misclassification that disproportionately affects certain groups. We address this critical limitation by introducing FairOPT, an algorithm for group-specific threshold optimization in AI-generated content classifiers. Our approach partitions data into subgroups based on attributes (e.g., text length and writing style) and learns decision thresholds for each group, which enables careful balancing of performance and fairness metrics within each subgroup. In experiments with four AI text classifiers on three datasets, FairOPT enhances overall F1 score and decreases balanced error rate (BER) discrepancy across subgroups. Our framework paves the way for more robust and fair classification criteria in AI-generated output detection.
摘要：大型语言模型（LLM）的进步使得很难区分人类写的文本与AI生成的文本。已经开发了几个AI-TEXT检测器，该检测器通常使用固定的全局阈值（例如{\ theta} = 0.5）来对机器生成的文本进行分类。但是，我们发现一个通用阈值无法解释特定于亚组的分布变化。例如，当使用固定阈值时，检测器对短文本的人体写的文本造成了更多的假阳性错误，而对神经质写作样式的正面分类要比在长文本中打开更多。这些差异可能导致错误分类，从而影响某些群体。我们通过引入Fairopt来解决这一关键限制，Fairopt是一种在AI生成的内容分类器中用于组特定阈值优化的算法。我们的方法根据属性（例如文本长度和写作样式）将数据分配到子组中，并了解每个组的决策阈值，这可以在每个子组中仔细平衡性能和公平指标。在三个数据集上使用四个AI文本分类器的实验中，FaiROPT提高了整体F1分数并降低了子组之间的平衡错误率（BER）差异。我们的框架为AI生成的输出检测中的更健壮和公平的分类标准铺平了道路。

Title: Contextual Gradient Flow Modeling for Large Language Model Generalization in Multi-Scale Feature Spaces

Authors: Daphne Quillington, Kingsley Fairbrother, Xavier Tattershall, Irin Kabakum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04548
Pdf URL: https://arxiv.org/pdf/2502.04548
Copy Paste: [[2502.04548]] Contextual Gradient Flow Modeling for Large Language Model Generalization in Multi-Scale Feature Spaces(https://arxiv.org/abs/2502.04548)
Keywords: language model
Abstract: Optimization methodologies for training large-scale neural architectures often rely on uniform gradient propagation mechanisms that fail to align with hierarchical linguistic structures, limiting their capacity to generalize across diverse language distributions. A structured gradient refinement framework was introduced to incorporate multi-scale contextual adjustments, improving parameter adaptation through dynamic weighting strategies that enhanced representation coherence. Empirical evaluations demonstrated that structured propagation mechanisms contributed to reductions in gradient oscillations, resulting in more stable training dynamics and improved optimization efficiency. The comparative performance assessment indicated that models incorporating hierarchical propagation strategies exhibited greater robustness in long-range dependency retention and cross-domain adaptation. The hierarchical adjustment of weight updates provided an alternative to conventional backpropagation, reducing sensitivity to initialization conditions while improving overall convergence efficiency. The experimental results confirmed that structured gradient propagation influenced representation learning trajectories, aligning parameter updates with broader linguistic dependencies rather than isolated token-level relationships. Statistical evaluations indicated that structured optimization strategies mitigated overfitting while preserving adaptability across heterogeneous text distributions. The findings established that structured gradient propagation provided an empirically validated framework for refining hierarchical representation learning, supporting more effective integration of linguistic dependencies into optimization dynamics.
摘要：用于训练大规模神经体系结构的优化方法通常依赖于统一的梯度传播机制，这些机制无法与层次的语言结构保持一致，从而限制了它们跨越不同语言分布的能力。引入了一个结构化的梯度改进框架，以结合多尺度的上下文调整，从而通过动态加权策略来改善参数适应，从而增强表示形式相干。经验评估表明，结构化的传播机制导致梯度振荡的减少，从而导致更稳定的训练动力学和提高优化效率。比较性能评估表明，结合分层传播策略的模型在远程依赖保留和跨域适应中表现出更大的鲁棒性。重量更新的分层调整为常规反向传播提供了替代方案，从而降低了对初始化条件的敏感性，同时提高了整体收敛效率。实验结果证实，结构化梯度传播影响了表示轨迹的表示，将参数更新与更广泛的语言依赖性对齐，而不是孤立的令牌级别的关系。统计评估表明，结构化的优化策略可缓解过度拟合，同时保持异质文本分布的适应性。研究结果表明，结构化的梯度传播提供了经验验证的框架，用于完善层次表示学习，从而支持将语言依赖性更有效地整合到优化动力学中。

Title: TruthFlow: Truthful LLM Generation via Representation Flow Correction

Authors: Hanyu Wang, Bochuan Cao, Yuanpu Cao, Jinghui Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04556
Pdf URL: https://arxiv.org/pdf/2502.04556
Copy Paste: [[2502.04556]] TruthFlow: Truthful LLM Generation via Representation Flow Correction(https://arxiv.org/abs/2502.04556)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) are known to struggle with consistently generating truthful responses. While various representation intervention techniques have been proposed, these methods typically apply a universal representation correction vector to all input queries, limiting their effectiveness against diverse queries in practice. In this study, we introduce TruthFlow, a novel method that leverages the Flow Matching technique for query-specific truthful representation correction. Specifically, TruthFlow first uses a flow model to learn query-specific correction vectors that transition representations from hallucinated to truthful states. Then, during inference, the trained flow model generates these correction vectors to enhance the truthfulness of LLM outputs. Experimental results demonstrate that TruthFlow significantly improves performance on open-ended generation tasks across various advanced LLMs evaluated on TruthfulQA. Moreover, the trained TruthFlow model exhibits strong transferability, performing effectively on other unseen hallucination benchmarks.
摘要：众所周知，大型语言模型（LLM）在持续产生真实的反应方面挣扎。尽管已经提出了各种表示干预技术，但这些方法通常将通用表示校正向量应用于所有输入查询，从而限制了它们在实践中的各种查询中的有效性。在这项研究中，我们介绍了TruthFlow，这是一种新型方法，它利用流量匹配技术进行特定于特定的真实表示校正。具体而言，TruthFlow首先使用流模型来学习特定于查询的校正向量，这些校正向量将其从幻觉转变为真实的状态。然后，在推断过程中，训练有素的流模型会生成这些校正向量，以增强LLM输出的真实性。实验结果表明，TruthFlow显着改善了对真实性评估的各种高级LLM的开放式生成任务的性能。此外，受过训练的真相流模型具有强大的可传递性，在其他看不见的幻觉基准上有效地执行。

Title: My LLM might Mimic AAE -- But When Should it?

Authors: Sandra C. Sandoval, Christabel Acquaye, Kwesi Cobbina, Mohammad Nayeem Teli, Hal Daumé III
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04564
Pdf URL: https://arxiv.org/pdf/2502.04564
Copy Paste: [[2502.04564]] My LLM might Mimic AAE -- But When Should it?(https://arxiv.org/abs/2502.04564)
Keywords: language model, llm, prompt
Abstract: We examine the representation of African American English (AAE) in large language models (LLMs), exploring (a) the perceptions Black Americans have of how effective these technologies are at producing authentic AAE, and (b) in what contexts Black Americans find this desirable. Through both a survey of Black Americans ($n=$ 104) and annotation of LLM-produced AAE by Black Americans ($n=$ 228), we find that Black Americans favor choice and autonomy in determining when AAE is appropriate in LLM output. They tend to prefer that LLMs default to communicating in Mainstream U.S. English in formal settings, with greater interest in AAE production in less formal settings. When LLMs were appropriately prompted and provided in context examples, our participants found their outputs to have a level of AAE authenticity on par with transcripts of Black American speech. Select code and data for our project can be found here: this https URL
摘要：我们研究了非裔美国人英语（AAE）在大语言模型（LLMS）中的代表，探索（a）黑人美国人对这些技术产生真实AAE的有效性的看法，以及（b）黑人美国人在哪些环境中发现这一点理想。通过对黑人美国人（$ n = 104美元）的调查和LLM美国黑人制作的AAE的注释（$ n = 228美元），我们发现黑人美国人喜欢选择和自主权，以确定何时适用于LLM输出的AAE 。他们倾向于在正式环境中使用LLMS违约而不是在美国英语主流英语中进行交流，并且在不太正式的环境中对AAE产量的兴趣更大。当LLM受到适当提示并在上下文示例中提供时，我们的参与者发现他们的输出具有一定程度的AAE真实性，与美国黑人演讲的成绩单相当。可以在此处找到我们项目的代码和数据：此HTTPS URL

Title: Extracting and Understanding the Superficial Knowledge in Alignment

Authors: Runjin Chen, Gabriel Jacob Perin, Xuxi Chen, Xilun Chen, Yan Han, Nina S. T. Hirata, Junyuan Hong, Bhavya Kailkhura
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04602
Pdf URL: https://arxiv.org/pdf/2502.04602
Copy Paste: [[2502.04602]] Extracting and Understanding the Superficial Knowledge in Alignment(https://arxiv.org/abs/2502.04602)
Keywords: language model, llm
Abstract: Alignment of large language models (LLMs) with human values and preferences, often achieved through fine-tuning based on human feedback, is essential for ensuring safe and responsible AI behaviors. However, the process typically requires substantial data and computation resources. Recent studies have revealed that alignment might be attainable at lower costs through simpler methods, such as in-context learning. This leads to the question: Is alignment predominantly superficial? In this paper, we delve into this question and provide a quantitative analysis. We formalize the concept of superficial knowledge, defining it as knowledge that can be acquired through easily token restyling, without affecting the model's ability to capture underlying causal relationships between tokens. We propose a method to extract and isolate superficial knowledge from aligned models, focusing on the shallow modifications to the final token selection process. By comparing models augmented only with superficial knowledge to fully aligned models, we quantify the superficial portion of alignment. Our findings reveal that while superficial knowledge constitutes a significant portion of alignment, particularly in safety and detoxification tasks, it is not the whole story. Tasks requiring reasoning and contextual understanding still rely on deeper knowledge. Additionally, we demonstrate two practical advantages of isolated superficial knowledge: (1) it can be transferred between models, enabling efficient offsite alignment of larger models using extracted superficial knowledge from smaller models, and (2) it is recoverable, allowing for the restoration of alignment in compromised models without sacrificing performance.
摘要：大型语言模型（LLM）与人类价值观和偏好的一致性通常是通过基于人类反馈来实现的，对于确保安全和负责的AI行为至关重要。但是，该过程通常需要大量的数据和计算资源。最近的研究表明，通过更简单的方法（例如在文化学习）中，可以在较低的成本下实现一致性。这导致了一个问题：一致性主要是表面的吗？在本文中，我们深入研究了这个问题，并提供了定量分析。我们将表面知识的概念形式化，将其定义为可以通过易于代币的重新设计来获取的知识，而不会影响模型捕获代币之间的基本因果关系的能力。我们提出了一种从对齐模型中提取和隔离表面知识的方法，重点是对最终令牌选择过程的浅修改。通过将仅与表面知识与完全排列的模型进行比较，我们量化了对齐的浅表部分。我们的发现表明，尽管肤浅的知识构成了一定部分的一致性，尤其是在安全和排毒任务方面，但这并不是整个故事。需要推理和上下文理解的任务仍然取决于更深的知识。此外，我们证明了孤立的表面知识的两个实用优势：（1）可以在模型之间传递它，从而使用较小模型提取的浅表知识在模型之间进行有效的非现场对齐，并且（2）可以恢复，以恢复。在不牺牲性能的情况下对折衷的模型进行对齐。

Title: M-IFEval: Multilingual Instruction-Following Evaluation

Authors: Antoine Dussolle, Andrea Cardeña Díaz, Shota Sato, Peter Devine
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04688
Pdf URL: https://arxiv.org/pdf/2502.04688
Copy Paste: [[2502.04688]] M-IFEval: Multilingual Instruction-Following Evaluation(https://arxiv.org/abs/2502.04688)
Keywords: language model, llm
Abstract: Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages. We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural context.
摘要：以下指导是现代大型语言模型（LLM）的核心能力，使评估此能力对于理解这些模型至关重要。来自文献的评估（IFEVAL）基准之后的指示使用客观标准来实现此操作，从而提供LLM性能的度量，而无需主观AI或人类判断。但是，它仅包括英语说明，限制其在其他语言中评估LLM的能力。我们提出了评估（M-fifeval）基准后的多语言指令，并通过通用和特定语言的指令将评估扩展到法语，日语和西班牙语。将此基准应用于8个最先进的LLMS，我们发现跨语言和指令类型的基准性能可能会有所不同，从而强调了在多种文化背景下评估LLM的多语言基准的重要性。

Title: ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning

Authors: Yuwei Yin, Giuseppe Carenini
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04689
Pdf URL: https://arxiv.org/pdf/2502.04689
Copy Paste: [[2502.04689]] ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning(https://arxiv.org/abs/2502.04689)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) achieve remarkable performance on challenging benchmarks that are often structured as multiple-choice question-answering (QA) tasks. Zero-shot Chain-of-Thought (CoT) prompting enhances reasoning in LLMs but provides only vague and generic guidance ("think step by step"). This paper introduces ARR, an intuitive and effective zero-shot prompting method that explicitly incorporates three key steps in QA solving: analyzing the intent of the question, retrieving relevant information, and reasoning step by step. Comprehensive experiments across diverse and challenging QA tasks demonstrate that ARR consistently improves the Baseline (without ARR prompting) and outperforms CoT. Ablation and case studies further validate the positive contributions of each component: analyzing, retrieving, and reasoning. Notably, intent analysis plays a vital role in ARR. Additionally, extensive evaluations across various model sizes, LLM series, and generation settings solidify the effectiveness, robustness, and generalizability of ARR.
摘要：大型语言模型（LLMS）在具有挑战性的基准上实现了出色的性能，这些基准通常被构成多项选择提问（QA）任务。零射击链（COT）提示提示LLMS中的推理，但仅提供模糊而通用的指导（“逐步思考”）。本文介绍了ARR，这是一种直观有效的零射击提示方法，该方法明确合并了QA求解中的三个关键步骤：分析问题的意图，检索相关信息和逐步推理。跨不同质量检查任务进行的全面实验表明，ARR始终改善基线（无需ARR提示）和表现优于COT。消融和案例研究进一步验证了每个组成部分的积极贡献：分析，检索和推理。值得注意的是，意图分析在ARR中起着至关重要的作用。此外，各种模型大小，LLM系列和生成设置的广泛评估巩固了ARR的有效性，鲁棒性和概括性。

Title: Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

Authors: Sourabrata Mukherjee, Atul Kr. Ojha, John P. McCrae, Ondrej Dusek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04718
Pdf URL: https://arxiv.org/pdf/2502.04718
Copy Paste: [[2502.04718]] Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?(https://arxiv.org/abs/2502.04718)
Keywords: language model, llm
Abstract: Text Style Transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TST outputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Using human evaluation is ideal but costly, same as in other natural language processing (NLP) tasks, however, automatic metrics for TST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks-sentiment transfer and detoxification-in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with human judgments, we demonstrate the effectiveness of these metrics when used individually and in ensembles. Additionally, we investigate the potential of Large Language Models (LLMs) as tools for TST evaluation. Our findings highlight that certain advanced NLP metrics and experimental-hybrid-techniques, provide better insights than existing TST metrics for delivering more accurate, consistent, and reproducible TST evaluations.
摘要：文本样式传输（TST）是转换文本以反映特定样式的任务，同时保留其原始内容。评估TST输出是一项多维挑战，需要评估样式转移精度，内容保存和自然性。使用人类评估是理想的，但代价高昂，与其他自然语言处理（NLP）任务相同，但是，TST的自动指标并没有像指标那样受到指标，例如机器翻译或摘要。在本文中，我们研究了从更广泛的NLP任务中进行TST评估的现有和新颖的指标集，重点介绍了两个流行的子任务转移和包括英语，印地语和孟加拉语的多种语言上下文中的排毒。通过通过与人类判断的相关性进行元评估，我们证明了这些指标在单独使用和合奏中使用时的有效性。此外，我们研究了大语模型（LLM）作为TST评估的工具的潜力。我们的发现强调，某些先进的NLP指标和实验混合技术，提供了比现有的TST指标更好的见解，以提供更准确，一致和可再现的TST评估。

Title: Concept Navigation and Classification via Open Source Large Language Model Processing

Authors: Maël Kubli
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04756
Pdf URL: https://arxiv.org/pdf/2502.04756
Copy Paste: [[2502.04756]] Concept Navigation and Classification via Open Source Large Language Model Processing(https://arxiv.org/abs/2502.04756)
Keywords: language model, llm
Abstract: This paper presents a novel methodological framework for detecting and classifying latent constructs, including frames, narratives, and topics, from textual data using Open-Source Large Language Models (LLMs). The proposed hybrid approach combines automated summarization with human-in-the-loop validation to enhance the accuracy and interpretability of construct identification. By employing iterative sampling coupled with expert refinement, the framework guarantees methodological robustness and ensures conceptual precision. Applied to diverse data sets, including AI policy debates, newspaper articles on encryption, and the 20 Newsgroups data set, this approach demonstrates its versatility in systematically analyzing complex political discourses, media framing, and topic classification tasks.
摘要：本文提出了一个新颖的方法学框架，用于检测和分类潜在构造，包括使用开源大语模型（LLMS）的文本数据，包括框架，叙述和主题。提出的混合方法将自动摘要与人类在环境验证相结合，以增强构造识别的准确性和解释性。通过采用迭代抽样以及专家的精致，该框架可以保证方法论上的鲁棒性并确保概念上的精度。该方法应用于不同的数据集，包括AI政策辩论，有关加密的报纸文章以及20种新闻组数据集，这种方法证明了其在系统地分析复杂的政治论述，媒体框架和主题分类任务时的多功能性。

Title: SeDi-Instruct: Enhancing Alignment of Language Models through Self-Directed Instruction Generation

Authors: Jungwoo Kim, Minsang Kim, Sungjin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04774
Pdf URL: https://arxiv.org/pdf/2502.04774
Copy Paste: [[2502.04774]] SeDi-Instruct: Enhancing Alignment of Language Models through Self-Directed Instruction Generation(https://arxiv.org/abs/2502.04774)
Keywords: language model, gpt, llm, chat
Abstract: The rapid evolution of Large Language Models (LLMs) has enabled the industry to develop various AI-based services. Instruction tuning is considered essential in adapting foundation models for target domains to provide high-quality services to customers. A key challenge in instruction tuning is obtaining high-quality instruction data. Self-Instruct, which automatically generates instruction data using ChatGPT APIs, alleviates the data scarcity problem. To improve the quality of instruction data, Self-Instruct discards many of the instructions generated from ChatGPT, even though it is inefficient in terms of cost owing to many useless API calls. To generate high-quality instruction data at a low cost, we propose a novel data generation framework, Self-Direct Instruction generation (SeDi-Instruct), which employs diversity-based filtering and iterative feedback task generation. Diversity-based filtering maintains model accuracy without excessively discarding low-quality generated instructions by enhancing the diversity of instructions in a batch. This reduces the cost of synthesizing instruction data. The iterative feedback task generation integrates instruction generation and training tasks and utilizes information obtained during the training to create high-quality instruction sets. Our results show that SeDi-Instruct enhances the accuracy of AI models by 5.2%, compared with traditional methods, while reducing data generation costs by 36%.
摘要：大型语言模型（LLM）的快速发展使该行业能够开发各种基于AI的服务。指导调整在为目标域调整基础模型以为客户提供高质量服务时被认为是必不可少的。指导调整中的一个关键挑战是获得高质量的指令数据。自动使用CHATGPT API自动生成指令数据的自我指导减轻了数据稀缺问题。为了提高指导数据的质量，自我结构会丢弃从chatgpt产生的许多指令，尽管由于许多无用的API调用，因此成本效率低下。为了以低成本生成高质量的指导数据，我们提出了一个新颖的数据生成框架，自动指导生成（SEDI-Instruct），该框架采用了基于多样性的过滤和迭代反馈任务生成。基于多样性的过滤可维持模型的准确性，而不会通过增强批处理中指令的多样性而过度丢弃低质量生成的指令。这降低了综合指令数据的成本。迭代反馈任务生成集成了指导生成和培训任务，并利用培训期间获得的信息来创建高质量的指导集。我们的结果表明，与传统方法相比，Sedi-Instruct可以提高AI模型的准确性5.2％，同时将数据生成成本降低了36％。

Title: Probing Internal Representations of Multi-Word Verbs in Large Language Models

Authors: Hassane Kissane, Achim Schilling, Patrick Krauss
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04789
Pdf URL: https://arxiv.org/pdf/2502.04789
Copy Paste: [[2502.04789]] Probing Internal Representations of Multi-Word Verbs in Large Language Models(https://arxiv.org/abs/2502.04789)
Keywords: language model, llm
Abstract: This study investigates the internal representations of verb-particle combinations, called multi-word verbs, within transformer-based large language models (LLMs), specifically examining how these models capture lexical and syntactic properties at different neural network layers. Using the BERT architecture, we analyze the representations of its layers for two different verb-particle constructions: phrasal verbs like 'give up' and prepositional verbs like 'look at'. Our methodology includes training probing classifiers on the internal representations to classify these categories at both word and sentence levels. The results indicate that the model's middle layers achieve the highest classification accuracies. To further analyze the nature of these distinctions, we conduct a data separability test using the Generalized Discrimination Value (GDV). While GDV results show weak linear separability between the two verb types, probing classifiers still achieve high accuracy, suggesting that representations of these linguistic categories may be non-linearly separable. This aligns with previous research indicating that linguistic distinctions in neural networks are not always encoded in a linearly separable manner. These findings computationally support usage-based claims on the representation of verb-particle constructions and highlight the complex interaction between neural network architectures and linguistic structures.
摘要：这项研究调查了基于变压器的大语言模型（LLMS）内的动词粒子组合的内部表示，称为多字动词，特别研究了这些模型如何在不同的神经网络层处捕获词汇和句法特性。使用BERT架构，我们分析了两个不同动词粒子构造的层的表示：诸如“放弃”和介词动词之类的短语动词和诸如“查看”的介词动词。我们的方法包括对内部表示的培训探测分类器，以在单词和句子级别上对这些类别进行分类。结果表明该模型的中层达到了最高的分类精度。为了进一步分析这些区别的性质，我们使用广义歧视值（GDV）进行了数据可分离性测试。虽然GDV结果显示两种动词类型之间的线性可分离性弱，但探测分类器仍然可以达到高精度，这表明这些语言类别的表示可能是非线性分离的。这与先前的研究一致，表明神经网络中的语言区别并不总是以线性分离的方式编码。这些发现在计算上支持基于用法的动词构造的说法，并突出了神经网络架构与语言结构之间的复杂相互作用。

Title: S$^2$-MAD: Breaking the Token Barrier to Enhance Multi-Agent Debate Efficiency

Authors: Yuting Zeng, Weizhe Huang, Lei Jiang, Tongxuan Liu, Xitai Jin, Chen Tianying Tiana, Jing Li, Xiaohua Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.04790
Pdf URL: https://arxiv.org/pdf/2502.04790
Copy Paste: [[2502.04790]] S$^2$-MAD: Breaking the Token Barrier to Enhance Multi-Agent Debate Efficiency(https://arxiv.org/abs/2502.04790)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various natural language processing (NLP) scenarios, but they still face challenges when handling complex arithmetic and logical reasoning tasks. While Chain-Of-Thought (CoT) reasoning, self-consistency (SC) and self-correction strategies have attempted to guide models in sequential, multi-step reasoning, Multi-agent Debate (MAD) has emerged as a viable approach for enhancing the reasoning capabilities of LLMs. By increasing both the number of agents and the frequency of debates, the performance of LLMs improves significantly. However, this strategy results in a significant increase in token costs, presenting a barrier to scalability. To address this challenge, we introduce a novel sparsification strategy designed to reduce token costs within MAD. This approach minimizes ineffective exchanges of information and unproductive discussions among agents, thereby enhancing the overall efficiency of the debate process. We conduct comparative experiments on multiple datasets across various models, demonstrating that our approach significantly reduces the token costs in MAD to a considerable extent. Specifically, compared to MAD, our approach achieves an impressive reduction of up to 94.5\% in token costs while maintaining performance degradation below 2.0\%.
摘要：大型语言模型（LLM）在各种自然语言处理（NLP）方案中表现出了显着的功能，但是在处理复杂的算术和逻辑推理任务时，它们仍然面临挑战。虽然经过思考链（COT）推理，但自矛盾（SC）和自我纠正策略试图指导模型，以顺序的多步推理，多代理辩论（MAD）已成为一种可行的方法LLMS的推理能力。通过增加代理的数量和辩论频率，LLMS的性能大大提高。但是，该策略导致令牌成本大幅提高，并呈现出可扩展性的障碍。为了应对这一挑战，我们引入了一种新颖的稀疏策略，旨在降低疯狂的代币成本。这种方法最大程度地减少了代理商之间无效的信息交流和非生产性讨论，从而提高了辩论过程的整体效率。我们在各种模型的多个数据集上进行了比较实验，表明我们的方法在相当大的程度上大大降低了令牌成本。具体而言，与MAD相比，我们的方法可在令牌成本下的令人印象深刻的减少94.5％，同时将绩效降解低于2.0 \％。

Title: Developmentally-plausible Working Memory Shapes a Critical Period for Language Acquisition

Authors: Masato Mita, Ryo Yoshida, Yohei Oseki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04795
Pdf URL: https://arxiv.org/pdf/2502.04795
Copy Paste: [[2502.04795]] Developmentally-plausible Working Memory Shapes a Critical Period for Language Acquisition(https://arxiv.org/abs/2502.04795)
Keywords: language model
Abstract: Large language models exhibit general linguistic abilities but significantly differ from humans in their efficiency of language acquisition. This study proposes a method for integrating the developmental characteristics of working memory during the critical period, a stage when human language acquisition is particularly efficient, into language models. The proposed method introduces a mechanism that initially constrains working memory during the early stages of training and gradually relaxes this constraint in an exponential manner as learning progresses. Targeted syntactic evaluation shows that the proposed method outperforms conventional models without memory constraints or with static memory constraints. These findings not only provide new directions for designing data-efficient language models but also offer indirect evidence supporting the underlying mechanisms of the critical period hypothesis in human language acquisition.
摘要：大语言模型具有一般语言能力，但与人类在语言获取效率方面有显着差异。这项研究提出了一种在关键时期整合工作记忆的发展特征的方法，这是人类语言获取特别有效地进入语言模型的阶段。提出的方法引入了一种机制，该机制最初在训练的早期阶段限制了工作记忆，并随着学习的进展，以指数级的方式逐渐放松了这一约束。有针对性的句法评估表明，所提出的方法优于传统模型，而没有内存约束或具有静态内存约束。这些发现不仅为设计数据有效的语言模型提供了新的方向，而且提供了间接证据，以支持人类语言获取中关键时期假设的基本机制。

Title: Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks

Authors: Jing Yang, Max Glockner, Anderson Rocha, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04797
Pdf URL: https://arxiv.org/pdf/2502.04797
Copy Paste: [[2502.04797]] Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks(https://arxiv.org/abs/2502.04797)
Keywords: llm, hallucination
Abstract: Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data, making it challenging to train models for explainable predictions. To address this, we investigate how to use existing explanation datasets for self-rationalization and evaluate models' out-of-distribution (OOD) performance. We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods. The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization. For the generated explanation evaluation, we conduct a human study on 13 selected models and study its correlation with the Acceptability score (T5-11B) and three other LLM-based reference-free metrics. Human evaluation shows that the Acceptability score correlates most strongly with human judgments, demonstrating its effectiveness in evaluating free-text explanations. Our findings reveal: 1) few annotated examples effectively adapt models for OOD explanation generation; 2) compared to sample selection strategies, fine-tuning data source has a larger impact on OOD performance; and 3) models with higher label prediction accuracy tend to produce better explanations, as reflected by higher Acceptability scores.
摘要：自由文本的解释具有表现力且易于理解，但是许多数据集缺乏带注释的解释数据，因此在训练模型以进行可解释的预测方面具有挑战性。为了解决这个问题，我们研究了如何使用现有的解释数据集进行自治化并评估模型的分布（OOD）性能。我们微调T5总模型和OLMO-7B模型，并评估微调数据质量的影响，微调样本的数量以及很少的选择方法。在三个任务中对19个不同的OOD数据集进行了评估：自然语言推断（NLI），事实检查和抽象性摘要中的幻觉检测。对于生成的解释评估，我们对13个选定模型进行了人类研究，并研究了其与可接受性评分（T5-11B）和其他三个基于LLM的无参考指标的相关性。人类评估表明，可接受性评分与人类判断最密切相关，证明了其在评估自由文本解释方面的有效性。我们的发现揭示了：1）几乎没有带注释的示例有效地适应了OOD解释的生成； 2）与样本选择策略相比，微调数据源对OOD的性能有更大的影响； 3）具有较高标签预测准确性的模型倾向于产生更好的解释，如较高的可接受性评分所反映。

Title: Claim Extraction for Fact-Checking: Data, Models, and Automated Metrics

Authors: Herbert Ullrich, Tomáš Mlynář, Jan Drchal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04955
Pdf URL: https://arxiv.org/pdf/2502.04955
Copy Paste: [[2502.04955]] Claim Extraction for Fact-Checking: Data, Models, and Automated Metrics(https://arxiv.org/abs/2502.04955)
Keywords: llm
Abstract: In this paper, we explore the problem of Claim Extraction using one-to-many text generation methods, comparing LLMs, small summarization models finetuned for the task, and a previous NER-centric baseline QACG. As the current publications on Claim Extraction, Fact Extraction, Claim Generation and Check-worthy Claim Detection are quite scattered in their means and terminology, we compile their common objectives, releasing the FEVERFact dataset, with 17K atomic factual claims extracted from 4K contextualised Wikipedia sentences, adapted from the original FEVER. We compile the known objectives into an Evaluation framework of: Atomicity, Fluency, Decontextualization, Faithfulness checked for each generated claim separately, and Focus and Coverage measured against the full set of predicted claims for a single input. For each metric, we implement a scale using a reduction to an already-explored NLP task. We validate our metrics against human grading of generic claims, to see that the model ranking on $F_{fact}$, our hardest metric, did not change and the evaluation framework approximates human grading very closely in terms of $F_1$ and RMSE.
摘要：在本文中，我们使用一对多文本生成方法探讨了主张提取问题的问题，比较了LLMS，对任务进行了固定的小摘要模型以及以前以NER为中心的基线QACG。随着当前有关索赔提取，事实提取，索赔产生和值得支票的索赔检测的出版物在其手段和术语中都非常分散，我们汇编了他们的共同目标，释放了发狂的数据集，从4K上情境化的Wikipedia句子中提取了17K原子事实索赔，改编自原始发烧。我们将已知的目标汇编为一个评估框架：原子能，流利度，脱皮性，对每个生成的索赔的忠诚，以及针对单个输入的完整预测索赔来测量的重点和覆盖范围。对于每个度量标准，我们使用还原为已探索的NLP任务实现了一个秤。我们验证了我们的指标，以防止人类的通用主张分级，以了解我们最坚硬的指标上的$ f_ {fact} $排名的模型没有更改，并且评估框架近似于人类的分级，以$ f_1 $ and RMSE的价格非常接近。

Title: SSMLoRA: Enhancing Low-Rank Adaptation with State Space Model

Authors: Jiayang Yu, Yihang Zhang, Bin Wang, Peiqin Lin, Yongkang Liu, Shi Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04958
Pdf URL: https://arxiv.org/pdf/2502.04958
Copy Paste: [[2502.04958]] SSMLoRA: Enhancing Low-Rank Adaptation with State Space Model(https://arxiv.org/abs/2502.04958)
Keywords: language model
Abstract: Fine-tuning is a key approach for adapting language models to specific downstream tasks, but updating all model parameters becomes impractical as model sizes increase. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address this challenge by introducing additional adaptation parameters into pre-trained weight matrices. However, LoRA's performance varies across different insertion points within the model, highlighting potential parameter inefficiency due to unnecessary insertions. To this end, we propose SSMLoRA (State Space Model Low-Rank Adaptation), an extension of LoRA that incorporates a State Space Model (SSM) to interconnect low-rank matrices. SSMLoRA ensures that performance is maintained even with sparser insertions. SSMLoRA allows the model to not only map inputs to a low-rank space for better feature extraction but also leverage the computations from the previous low-rank space. Our method achieves comparable performance to LoRA on the General Language Understanding Evaluation (GLUE) benchmark while using only half the parameters. Additionally, due to its structure, SSMLoRA shows promise in handling tasks with longer input sequences. .You can find our code here:this https URL.
摘要：微调是将语言模型调整到特定下游任务的关键方法，但是随着模型尺寸的增加，更新所有模型参数变得不切实际。参数有效的微调（PEFT）方法，例如低级适应（LORA），通过将其他适应参数引入预训练的权重矩阵中，以应对这一挑战。但是，洛拉的性能在模型中的不同插入点上有所不同，突出了由于不必要的插入而引起的潜在参数效率低下。为此，我们提出了SSMLORA（状态空间模型低级适应性），LORA的扩展将状态空间模型（SSM）融合到互连低级别矩阵。 SSMLORA确保即使插入更少，也可以保持性能。 SSMLORA允许模型不仅将输入映射到低级空间以进行更好的特征提取，还可以利用以前的低级空间的计算。我们的方法在仅使用一半参数的同时，在一般语言理解评估（胶水）基准的一般语言理解评估（GLUE）基准方面，可以实现与洛拉的可比性能。此外，由于其结构，SSMLORA在处理任务具有更长的输入序列方面显示出希望。您可以在此处找到我们的代码：此HTTPS URL。

Title: CoCoA: A Generalized Approach to Uncertainty Quantification by Integrating Confidence and Consistency of LLM Outputs

Authors: Roman Vashurin (1), Maiya Goloburda (1), Preslav Nakov (1), Artem Shelmanov (1), Maxim Panov (1) ((1) Mohamed bin Zayed University of Artificial Intelligence)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.04964
Pdf URL: https://arxiv.org/pdf/2502.04964
Copy Paste: [[2502.04964]] CoCoA: A Generalized Approach to Uncertainty Quantification by Integrating Confidence and Consistency of LLM Outputs(https://arxiv.org/abs/2502.04964)
Keywords: language model, llm
Abstract: Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompasses a variety of approaches, with two major types being particularly prominent: information-based, which focus on model confidence expressed as token probabilities, and consistency-based, which assess the semantic relationship between multiple outputs generated using repeated sampling. Several recent methods have combined these two approaches and shown impressive performance in various applications. However, they sometimes fail to outperform much simpler baseline methods. Our investigation reveals distinctive characteristics of LLMs as probabilistic models, which help to explain why these UQ methods underperform in certain tasks. Based on these findings, we propose a new way of synthesizing model confidence and output consistency that leads to a family of efficient and robust UQ methods. We evaluate our approach across a variety of tasks such as question answering, abstractive summarization, and machine translation, demonstrating sizable improvements over state-of-the-art UQ approaches.
摘要：大语言模型（LLMS）的不确定性量化（UQ）方法包括各种方法，其中两种主要类型特别突出：基于信息的侧重于表示标记概率和基于一致性的模型置信度，它们评估了语义使用重复采样生成的多个输出之间的关系。最近的几种方法结合了这两种方法，并在各种应用中表现出了令人印象深刻的性能。但是，他们有时无法胜过更简单的基线方法。我们的研究揭示了LLM作为概率模型的独特特征，这有助于解释为什么这些UQ方法在某些任务中表现不佳。基于这些发现，我们提出了一种合成模型置信度和输出一致性的新方法，从而导致一个有效且稳健的UQ方法的家族。我们评估了我们在各种任务中的方法，例如回答，抽象性摘要和机器翻译，证明了对最先进的UQ方法的大量改进。

Title: Aligning Black-box Language Models with Human Judgments

Authors: Gerrit J. J. van den Burg, Gen Suzuki, Wei Liu, Murat Sensoy
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.04997
Pdf URL: https://arxiv.org/pdf/2502.04997
Copy Paste: [[2502.04997]] Aligning Black-box Language Models with Human Judgments(https://arxiv.org/abs/2502.04997)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks, where relying on human evaluators can be costly, time-consuming, and unscalable. LLMs offer an efficient solution for continuous, automated evaluation. However, since the systems that are built and improved with these judgments are ultimately designed for human use, it is crucial that LLM judgments align closely with human evaluators to ensure such systems remain human-centered. On the other hand, aligning LLM judgments with human evaluators is challenging due to individual variability and biases in human judgments. We propose a simple yet effective framework to align LLM judgments with individual human evaluators or their aggregated judgments, without retraining or fine-tuning the LLM. Our approach learns a linear mapping between the LLM's outputs and human judgments, achieving over 142% average improvement in agreement across 29 tasks with only a small number of calibration examples used for training. Notably, our method works in zero-shot and few-shot settings, exceeds inter-human agreement on four out of six tasks, and enables smaller LLMs to achieve performance comparable to that of larger models.
摘要：大型语言模型（LLMS）越来越多地用作自动化法官，以评估建议系统，搜索引擎和其他主观任务，在这些任务中，依赖人类评估者的代价可能是昂贵，耗时且不可计入的。 LLM为连续自动评估提供了有效的解决方案。但是，由于这些判断最终是为人类使用而设计和改进的系统，因此LLM判断与人类评估人员紧密相符，以确保此类系统保持为以人为本，这一点至关重要。另一方面，由于人类判断的个人变异性和偏见，将LLM判断与人类评估者保持一致。我们提出了一个简单而有效的框架，以使LLM判断与单个人类评估者或其汇总判断，而无需重新调整或微调LLM。我们的方法学习了LLM的产出与人类判断之间的线性映射，在29个任务中，一致性达成142％以上的平均提高超过142％，只有少数用于培训的校准示例。值得注意的是，我们的方法在零射击和少量设置方面起作用，超过了六个任务中的四项，超过了人间协议，并使较小的LLMS能够实现与较大模型相当的性能。

Title: nvAgent: Automated Data Visualization from Natural Language via Collaborative Agent Workflow

Authors: Geliang Ouyang, Jingyao Chen, Zhihe Nie, Yi Gui, Yao Wan, Hongyu Zhang, Dongping Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.05036
Pdf URL: https://arxiv.org/pdf/2502.05036
Copy Paste: [[2502.05036]] nvAgent: Automated Data Visualization from Natural Language via Collaborative Agent Workflow(https://arxiv.org/abs/2502.05036)
Keywords: language model, llm, agent
Abstract: Natural Language to Visualization (NL2Vis) seeks to convert natural-language descriptions into visual representations of given tables, empowering users to derive insights from large-scale data. Recent advancements in Large Language Models (LLMs) show promise in automating code generation to transform tabular data into accessible visualizations. However, they often struggle with complex queries that require reasoning across multiple tables. To address this limitation, we propose a collaborative agent workflow, termed nvAgent, for NL2Vis. Specifically, nvAgent comprises three agents: a processor agent for database processing and context filtering, a composer agent for planning visualization generation, and a validator agent for code translation and output verification. Comprehensive evaluations on the new VisEval benchmark demonstrate that nvAgent consistently surpasses state-of-the-art baselines, achieving a 7.88% improvement in single-table and a 9.23% improvement in multi-table scenarios. Qualitative analyses further highlight that nvAgent maintains nearly a 20% performance margin over previous models, underscoring its capacity to produce high-quality visual representations from complex, heterogeneous data sources.
摘要：自然语言对可视化（NL2VIS）试图将自然语言描述转换为给定表的视觉表示，从而使用户能够从大规模数据中获得见解。大型语言模型（LLMS）的最新进展显示出在自动化代码生成中将表格数据转换为可访问可视化的有希望。但是，他们经常在需要跨多个表中推理的复杂查询中挣扎。为了解决这一限制，我们为NL2VI提出了一个称为NVAGENT的协作代理工作流。具体而言，NVAGENT包括三个代理：用于数据库处理和上下文过滤的处理器代理，用于计划可视化生成的作曲家代理以及用于代码翻译和输出验证的验证器代理。对新的Viseval基准测试的全面评估表明，NVAGENT始终超过最先进的基线，单桌子的提高了7.88％，多塔场景提高了9.23％。定性分析进一步强调，NVAGENT在以前的模型中保持了近20％的性能率，强调了其从复杂的异质数据源中产生高质量的视觉表示的能力。

Title: ChallengeMe: An Adversarial Learning-enabled Text Summarization Framework

Authors: Xiaoyu Deng, Ye Zhang, Tianmin Guo, Yongzhe Zhang, Zhengjian Kang, Hang Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.05084
Pdf URL: https://arxiv.org/pdf/2502.05084
Copy Paste: [[2502.05084]] ChallengeMe: An Adversarial Learning-enabled Text Summarization Framework(https://arxiv.org/abs/2502.05084)
Keywords: language model, llm, hallucination, prompt
Abstract: The astonishing performance of large language models (LLMs) and their remarkable achievements in production and daily life have led to their widespread application in collaborative tasks. However, current large models face challenges such as hallucination and lack of specificity in content generation in vertical domain tasks. Inspired by the contrast and classification mechanisms in human cognitive processes, this paper constructs an adversarial learning-based prompt framework named ChallengeMe, which includes three cascaded solutions: generation prompts, evaluation prompts, and feedback optimization. In this process, we designed seven core optimization dimensions and set the threshold for adversarial learning. The results of mixed case studies on the text summarization task show that the proposed framework can generate more accurate and fluent text summaries compared to the current advanced mainstream LLMs.
摘要：大型语言模型（LLM）的惊人表现及其在生产和日常生活中的显着成就，导致了他们在协作任务中的广泛应用。但是，当前的大型模型面临着诸如幻觉和垂直域任务中内容生成缺乏特异性之类的挑战。受到人类认知过程中的对比和分类机制的启发，本文构建了一个基于对抗性学习的及时框架挑战Emplectem，其中包括三个级联解决方案：发电提示，评估提示和反馈优化。在此过程中，我们设计了七个核心优化维度，并为对抗性学习设定了阈值。与当前高级主流LLM相比，有关文本摘要任务的混合案例研究的结果表明，所提出的框架可以生成更准确和流利的文本摘要。

Title: Flexible and Efficient Grammar-Constrained Decoding

Authors: Kanghee Park, Timothy Zhou, Loris D'Antoni
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.05111
Pdf URL: https://arxiv.org/pdf/2502.05111
Copy Paste: [[2502.05111]] Flexible and Efficient Grammar-Constrained Decoding(https://arxiv.org/abs/2502.05111)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can align with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.
摘要：通常要求大型语言模型（LLMS）生成遵守精确句法规则的结构化输出，例如代码片段或格式化数据。语法约束解码（GCD）可以通过掩盖将令牌掩盖来确保LLM输出匹配此类规则，而令牌将证明会导致不属于指定的无上下文语法（CFG）的输出。为了确保合理性，GCD算法必须计算给定的LLM子字代币仪如何与给定无上下文的语法使用的令牌并根据此信息计算令牌面具。有效地做到这一点是具有挑战性的，现有的GCD算法需要数十分钟才能预处理普通语法。我们提出了一种新的GCD算法，并提供了一种实现，该实现比现有方法更快地提供了17.71倍的离线预处理，同时在在线蒙版计算中保留了最先进的效率。

Title: CodeSCM: Causal Analysis for Multi-Modal Code Generation

Authors: Mukur Gupta, Noopur Bhatt, Suman Jana
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.05150
Pdf URL: https://arxiv.org/pdf/2502.05150
Copy Paste: [[2502.05150]] CodeSCM: Causal Analysis for Multi-Modal Code Generation(https://arxiv.org/abs/2502.05150)
Keywords: language model, llm, prompt
Abstract: In this paper, we propose CodeSCM, a Structural Causal Model (SCM) for analyzing multi-modal code generation using large language models (LLMs). By applying interventions to CodeSCM, we measure the causal effects of different prompt modalities, such as natural language, code, and input-output examples, on the model. CodeSCM introduces latent mediator variables to separate the code and natural language semantics of a multi-modal code generation prompt. Using the principles of Causal Mediation Analysis on these mediators we quantify direct effects representing the model's spurious leanings. We find that, in addition to natural language instructions, input-output examples significantly influence code generation.
摘要：在本文中，我们提出了CodeSCM，即一种结构性因果模型（SCM），用于使用大语言模型（LLMS）分析多模式代码生成。通过将干预措施应用于CODESCM，我们衡量了模型上不同及时模态（例如自然语言，代码和输入输出示例）的因果关系。 CODESCM引入了潜在的调解器变量，以将多模式代码生成提示的代码和自然语言语义分开。使用有关这些介体的因果中介分析的原理，我们量化了代表模型虚假倾向的直接效应。我们发现，除了自然语言指示外，投入输出示例还显着影响代码生成。

Title: Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation

Authors: Steffen Eger, Yong Cao, Jennifer D'Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, Chenghua Lin, Nafise Sadat Moosavi, Wei Zhao, Tristan Miller
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.05151
Pdf URL: https://arxiv.org/pdf/2502.05151
Copy Paste: [[2502.05151]] Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation(https://arxiv.org/abs/2502.05151)
Keywords: language model
Abstract: With the advent of large multimodal language models, science is now at a threshold of an AI-based technological transformation. Recently, a plethora of new AI models and tools has been proposed, promising to empower researchers and academics worldwide to conduct their research more effectively and efficiently. This includes all aspects of the research cycle, especially (1) searching for relevant literature; (2) generating research ideas and conducting experimentation; generating (3) text-based and (4) multimodal content (e.g., scientific figures and diagrams); and (5) AI-based automatic peer review. In this survey, we provide an in-depth overview over these exciting recent developments, which promise to fundamentally alter the scientific research process for good. Our survey covers the five aspects outlined above, indicating relevant datasets, methods and results (including evaluation) as well as limitations and scope for future research. Ethical concerns regarding shortcomings of these tools and potential for misuse (fake science, plagiarism, harms to research integrity) take a particularly prominent place in our discussion. We hope that our survey will not only become a reference guide for newcomers to the field but also a catalyst for new AI-based initiatives in the area of "AI4Science".
摘要：随着大型多模式模型的出现，科学现在正处于基于AI的技术转型的一个门槛上。最近，已经提出了众多新的AI模型和工具，有望使全球研究人员和学者更有效，有效地进行研究。这包括研究周期的所有方面，特别是（1）寻找相关文献；（2）产生研究思想和进行实验；生成（3）基于文本的和（4）多模式内容（例如，科学数字和图）；（5）基于AI的自动同行评审。在这项调查中，我们对这些令人兴奋的最新发展提供了深入的概述，这些发展有望从根本上改变科学研究过程。我们的调查涵盖了上面概述的五个方面，表明相关的数据集，方法和结果（包括评估）以及未来研究的限制和范围。关于这些工具缺点和滥用潜力（假科学，窃，对研究完整性的危害）的道德问题在我们的讨论中特别重要。我们希望我们的调查不仅将成为该领域的新移民的参考指南，而且还将成为“ AI4Science”领域的新计划的催化剂。

Title: DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

Authors: Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.05163
Pdf URL: https://arxiv.org/pdf/2502.05163
Copy Paste: [[2502.05163]] DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails(https://arxiv.org/abs/2502.05163)
Keywords: language model, llm
Abstract: The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model \ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at this https URL.
摘要：大型语言模型（LLM）的快速发展增加了对护栏模型的需求，以确保负责任的使用，尤其是在检测不安全和非法内容时。尽管英语中存在实质性的安全数据，但由于其他语言中开源安全数据的稀缺，多语言的护栏建模仍未得到充实。为了解决这一差距，我们提出了一种新颖的两人增强型学习（RL）框架，其中发电机和护栏模型在对抗性上共同发展以产生高质量的合成数据，以用于多语言的护栏训练。从理论上讲，我们将这种互动正式化为两人游戏，证明了融合符合NASH的平衡。经验评估表明，我们的模型\我们的模型优于最先进的模型，在英语基准测试上，llamaguard3（8b）的提高了近10％，而在推断出明显较小的模型（0.5B）时，它的推断速度快4.5倍。我们在多语言安全任务方面取得了重大进步，尤其是在解决收集的实际数据集中低资源语言的不平衡方面。消融研究强调了合成数据生成在弥合英语和其他语言之间开源数据中不平衡的关键作用。这些发现为合成数据生成建立了可扩展有效的方法，为改进的多语言护栏模型铺平了道路，以增强LLM安全性。代码，模型和数据将在此HTTPS URL上开源。

Title: NoLiMa: Long-Context Evaluation Beyond Literal Matching

Authors: Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.05167
Pdf URL: https://arxiv.org/pdf/2502.05167
Copy Paste: [[2502.05167]] NoLiMa: Long-Context Evaluation Beyond Literal Matching(https://arxiv.org/abs/2502.05167)
Keywords: language model, gpt, llm, long context
Abstract: Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.
摘要：最近的大型语言模型（LLMS）支持从128K到1M代币不等的长上下文。评估这些功能的一种流行方法是针中的针刺（NIAH）测试，该测试涉及从“ haystack”（长期无关的环境）中检索“针头”（相关信息）。这种方法的扩展包括增加干扰因素，事实链接和内在的推理。但是，在这些基准测试中，模型可以利用针和草堆之间的现有文字匹配以简化任务。为了解决这个问题，我们介绍了Nolima，这是一种基准测试，该基准延伸了NIAH，并用精心设计的针头套件，问题和针头具有最小的词汇重叠，需要模型来推断潜在的关联以将针头定位在干草堆内。我们评估了12个受欢迎的LLM，声称支持至少128K代币的上下文。尽管它们在短上下文（<1k）中表现良好，但随着上下文长度的增加，性能会大大降低。例如，在32K处，有10个型号降至其强短长基线的50％以下。即使是表现最佳的例外之一GPT-4O，也经历了从几乎完美的基线减少99.3％到69.7％。我们的分析表明，这些下降源于在缺乏字面匹配时，注意力机制在更长的情况下面临的困难增加，因此很难检索相关信息。